<<

The function and evolution of C2H2 and transposons

by

Laura Francesca Campitelli

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Department of Molecular Genetics University of Toronto

© Copyright by Laura Francesca Campitelli 2020

The function and evolution of C2H2 zinc finger proteins and transposons

Laura Francesca Campitelli

Doctor of Philosophy

Department of Molecular Genetics University of Toronto

2020 Abstract

Transcription factors (TFs) confer specificity to transcriptional regulation by binding specific

DNA sequences and ultimately affecting the ability of RNA polymerase to transcribe a .

The C2H2 zinc finger proteins (C2H2 ZFPs) are a TF class with the unique ability to diversify their DNA-binding specificities in a short evolutionary time. C2H2 ZFPs comprise the largest class of TFs in Mammalian genomes, including nearly half of all Human TFs (747/1,639).

Positive selection on the DNA-binding specificities of C2H2 ZFPs is explained by an evolutionary arms race with endogenous retroelements (EREs; copy-and-paste transposable elements), where the C2H2 ZFPs containing a KRAB repressor domain (KZFPs; 344/747

Human C2H2 ZFPs) are thought to diversify to bind new EREs and repress deleterious transposition events. However, evidence of the gain and loss of KZFP binding sites on the ERE sequence is sparse due to poor resolution of ERE sequence evolution, despite the recent publication of binding preferences for 242/344 Human KZFPs. The goal of my doctoral work has been to characterize the Human C2H2 ZFPs, with specific interest in their evolutionary history, functional diversity, and coevolution with LINE EREs. I contributed to the expert curation of the full complement of 1,639 Human TFs and used the results to quantitatively compare their evolutionary history and tissue specificities to all other Human TFs. I analyzed

ii

-protein interaction (PPI) data for 118 DNA-binding C2H2 ZFPs and found extremely diverse interactions with nuclear factors despite paradoxically few dedicated PPI domains, revealing a new and unexplained dimension of functional diversity in addition to DNA-binding specificity diversity. Finally, I pioneered a computational technique to reconstruct extinct LINE

L1 sequences and showed that they can be used to anchor the integration of KZFP genomic binding data and binding specificities for a complete picture of dynamic KZFP-ERE sequence specificity relationships. Together, my results paint a detailed picture of the diverse functionality and rapid evolution of Human C2H2 ZFPs, contribute to ongoing theorization on KZFP-ERE coevolution, and provide parallel datasets to power future investigations.

iii

Acknowledgements

I am thankful to have had the guidance and mentorship of my excellent supervisor, Dr. Tim Hughes. He has taught me to follow the data, work systematically, explain clearly, and question everything. The entirety of this work results from opportunities enabled and supported by Tim. I hope to carry throughout my life the example he has set as a leader and scientific thinker.

I’d like to thank my thesis committee members, Drs. Jack Greenblatt and Michael Wilson. Their informed perspectives have challenged and shaped this work and motivated my growth as a scientist. Several additional professors have contributed directly or indirectly to my doctoral studies, including Drs. Mathieu Blanchette, Anne-Claude Gingras, Mikko Taipale, and Quaid Morris.

I am grateful to the C2H2 ZFP experts who gave me the special opportunity to learn through collaboration with knowledgeable and kind mentors early in my graduate studies: Drs. Frank Schmitges, Ernest Radovani, Hamed Najafabadi, and Marjan Barazandeh.

There are many senior researchers whose patience, kindness, and exceeding scientific talent played critical roles, including Drs. Edyta Marcon, Mandy Lam, and Ally Yang. Jeff Liu and Dr. Mihai Albu deserve special thanks for joining me on winding journeys through poorly documented bioinformatics software. I’m especially appreciative of Drs. Rozita Razavi and Debashish Ray for many years of teaching and encouragement.

I’m grateful for all the students I’ve had the opportunity to work with over the years. Dr. Samuel Lambert, my only senior lab mate and first example of a great grad student. Kaitlin Laverty has taught me a lot about data science, is always willing to thought partner on scientific ideas, and has made an immeasurable impact through her friendship. My experience in graduate school was made complete by personal and professional connections with all of the students in the Hughes lab, as well as classmates including Samantha Ing Esteves, Lauren Tracey, and Nader Alerasool.

I am extremely thankful to the family and friends who have supported me unconditionally throughout my education and expanded my horizons in a stage when the world can sometimes iv

feel very small – my parents Rita and Peter Campitelli, my grandmothers Franca Campitelli and Maria Celenza, my friend Marilena Danelon, and especially my fiancé Nicholas Fleming.

Finally, I’d like to dedicate this thesis to my friend Dr. Benjamin Grys, who has been there from beginning to end – from my first committee meeting practice talk, to beachside cocktails the world over, to my mid-pandemic online defence seminar. Ben has been my most patient mentor and my closest friend in this experience, and I couldn’t have done it without him.

v

Table of Contents

ACKNOWLEDGEMENTS IV

TABLE OF CONTENTS VI

LIST OF TABLES IX

LIST OF FIGURES X

LIST OF ABBREVIATIONS XIII

CHAPTER 1 1

INTRODUCTION 2

1.1 CHAPTER OUTLINE 2 1.2 FACTORS 3 1.2.1 DEFINING A 3 1.2.2 HUMAN DNA-BINDING DOMAINS (DBDS) 3 1.2.3 IDENTIFYING TFS 7 1.2.4 TRANSCRIPTIONAL REGULATION BY TFS 12 1.2.5 SUMMARY 18 1.3 C2H2 ZINC FINGER PROTEINS 19 1.3.1 C2H2 ZFP DNA-BINDING 19 1.3.2 C2H2 ZFP PPIS 25 1.3.3 EVOLUTION OF C2H2 ZFPS 31 1.4 TES AND TRANSCRIPTIONAL REGULATION 35 1.4.1 ANNOTATION AND CLASSIFICATION OF ERES 37 1.4.2 RECONSTRUCTING ANCESTRAL TES 40 1.4.3 SUMMARY 42 1.5 CHAPTER SUMMARY AND THESIS RATIONALE 42

CHAPTER 2 45

THE FUNCTION AND EVOLUTION OF HUMAN TRANSCRIPTION FACTORS 46

2.1 INTRODUCTION 46 2.2 METHODS 48 2.2.1 HUMAN TF LIST CURATION 48 2.2.2 HUMAN TF PARALOG EVOLUTION 50 2.2.3 TF EXPRESSION PROFILES IN HUMAN TISSUES 50 2.2.4 PBM CONSTRUCT DESIGN FOR MONODACTYL ZFPS 51 2.2.5 PBM EXPERIMENTS 52 vi

2.3 RESULTS AND DISCUSSION 52 2.3.1 THE HUMAN TFS 52 2.3.2 PARALOG EVOLUTION OF HUMAN TFS 53 2.3.3 EXPRESSION PROFILES OF THE HUMAN TFS 53 2.3.4 SEQUENCE-SPECIFIC DNA BINDING BY MONODACTYL ZFPS 57 2.4 SUMMARY 61

CHAPTER 3 62

PROTEIN-PROTEIN INTERACTIONS OF HUMAN C2H2 ZINC FINGER PROTEINS 63

3.1 INTRODUCTION 63 3.2 METHODS 65 3.2.1 MOLECULAR COMPARISON BETWEEN INVESTIGATED C2H2 ZFPS AND ALL HUMAN C2H2 ZFPS 65 3.2.2 AP-MS EXPERIMENTAL METHODS 65 3.2.3 STATISTICAL ANALYSIS OF AP-MS DATA 67 3.2.4 FUNCTIONAL ANNOTATION OF AP-MS PREYS 68 3.3 RESULTS 69 3.3.1 MOLECULAR COMPARISON BETWEEN INVESTIGATED C2H2 ZFPS AND ALL HUMAN C2H2 ZFPS 69 3.3.2 C2H2 ZFPS HAVE UNIQUE PPI PROFILES 69 3.3.3 EFFECTOR DOMAIN SUBCLASSES RECRUIT EXPECTED INTERACTION PARTNERS, AND ADDITIONAL AND ALTERNATIVE PPIS ARE PERVASIVE WITHIN EACH GROUP 72 3.3.4 C2H2 ZFPS INTERACT WITH TRANSCRIPTION-RELATED NUCLEAR FACTORS 76 3.4 DISCUSSION 79 3.5 SUMMARY 82

CHAPTER 4 83

RECONSTRUCTING THE EVOLUTIONARY HISTORY OF LINE L1S TO INTERPRET KZFP-ERE COEVOLUTION 84

4.1 INTRODUCTION 84 4.2 METHODS 88 4.2.1 ANCESTRAL RECONSTRUCTED GENOMES 88 4.2.2 TE ANNOTATION 88 4.2.3 FULL-LENGTH PROGENITOR SEQUENCE RECONSTRUCTION 89 4.2.4 ORF REFINEMENT 90 4.2.5 COMPOSITE PROGENITOR SEQUENCE RECONSTRUCTION 92 4.2.6 COMPOSITE RECONSTRUCTED PROGENITOR SEQUENCE VALIDATION 92 4.3 RESULTS AND DISCUSSION 94 4.3.1 REPEATMASKER ANNOTATIONS IN ANCESTRAL GENOMES CORRESPOND TO L1 SUBFAMILY RELATIVE AGES AND SPECIES DISTRIBUTIONS 94

vii

4.3.2 REPEATMASKER HITS IN ANCESTRAL GENOMES ARE SIMILAR IN LENGTH TO THOSE IN HG38, AND LESS DIVERGENT FROM CONSENSUS MODELS 94 4.3.3 FULL-LENGTH ANCESTRAL SEQUENCE RECONSTRUCTION 97 4.3.4 TARGETED ORF RECONSTRUCTION 101 4.3.5 COMPOSITE RECONSTRUCTED PROGENITOR SEQUENCES CAPTURE EXPECTED PHYLOGENETIC RELATIONSHIPS AND SEQUENCE COMPONENTS 104 4.3.6 COMPOSITE RECONSTRUCTED PROGENITOR SEQUENCES ANCHOR INTEGRATION OF IN VIVO KZFP BINDING EVIDENCE AND IN SILICO-PREDICTED KZFP BINDING PREFERENCES 108 4.4 SUMMARY 111

CHAPTER 5 113

DISCUSSION 113

DISCUSSION 114

5.1 CHAPTER OUTLINE 114 5.2 PERSPECTIVES AND FUTURE DIRECTIONS 116 5.2.1 THE DNA-BINDING PREFERENCES OF EVERY HUMAN TF 116 5.2.2 FUNCTIONAL DIVERSIFICATION OF C2H2 ZFPS: DNA-BINDING, PPIS, AND EXPRESSION PATTERNS 116 5.2.3 IDENTIFYING FAST-EVOLVING PPI ELEMENTS OF THE C2H2 ZFPS 118 5.2.4 THE EVOLUTIONARY ARC OF THE KZFPS 119 5.2.5 ROLES OF C2H2 ZFPS OUTSIDE THE NUCLEUS 120 5.2.6 FUNCTIONAL AND EVOLUTIONARY ASSESSMENT OF RECONSTRUCTED L1 PROGENITOR SEQUENCES 121 5.2.7 INTERPRETING PREDICTED KZFP BINDING SITES ON RECONSTRUCTED L1 PROGENITOR SEQUENCES 122 5.2.8 THE FUTURE OF TE RECONSTRUCTION 123 5.3 CLOSING REMARKS 124

REFERENCES 126

COPYRIGHT ACKNOWLEDGEMENTS 147

viii

List of Tables

Table 3.1 Molecular features of the 118 C2H2 ZFPs investigated compared to all Human C2H2 ZFPs……………………………………………………………………………………………66

ix

List of Figures

Figure 1.1 Schematic of a prototypical TF...... 4

Figure 1.2 Examples of DBD structures and counts of Human TFs by their DBD type...... 5

Figure 1.3 C2H2 ZFP binding mode and protein-protein interactions...... 21

Figure 1.4 The arms race and domestication models of KZFP-ERE coevolution...... 28

Figure 1.5 EREs in the ...... 32

Figure 2.1 Overview of the strategy for identifying the Human TF repertoire...... 49

Figure 2.2 Basic residues of Drosophila GAGA and their neutral replacement residues...... 54

Figure 2.3 Number of TFs and motif status for each DBD family...... 55

Figure 2.4 Paralog divergence times of the Human TFs...... 56

Figure 2.5 Tissue expression profiles of the Human TFs...... 58

Figure 2.6 PBM results for single-ZFPs with and without ZF-adjacent basic residues perturbed...... 60

Figure 3.1 Significant differences in molecular features of the 118 C2H2 ZFPs investigated compared to all Human C2H2 ZFPs (related to Table 3.1...... 70

Figure 3.2 Nuclear protein interactions with C2H2 ZFPs ...... 71

Figure 3.3 Reproducibility between AP-MS replicates...... 73

x

Figure 3.4 Shared interactions between C2H2 ZFPs...... 74

Figure 3.5 Overview of expected interactions for C2H2 ZFPs with KRAB, SCAN and BTB domains...... 75

Figure 3.6 Functional overview of the C2H2 ZFP PPI partners...... 77

Figure 3.7. Functional overview of the C2H2 ZFP PPI partners for C2H2 ZFPs with no effector domain, or KRAB, SCAN or BTB domains (related to Figure 3.6...... 78

Figure 3.8. Correlation between TRIM28 association and H3K9me3 signals...... 80

Figure 4.1. Overview of phylogenetic relationships between Human LINE L1 subfamilies and available consensus models and progenitor sequence reconstructions...... 85

Figure 4.2. Overview of reconstructed ancestral Human genomes...... 87

Figure 4.3. Overview of phylogenetic relationships between Human LINE L1 subfamilies and available consensus models and progenitor sequence reconstructions...... 95

Figure 4.4 Lengths and sequence identities to gold standards for RepeatMasker hits and full- length reconstructed sequences...... 96

Figure 4.5 Best full-length reconstructed sequences...... 98

Figure 4.6 ORF0 detected in full-length reconstructed progenitor sequences...... 99

Figure 4.7 Quality comparison for various reconstruction methods...... 100

Figure 4.8 Codon-aware alignment improves ORF reconstruction...... 102

xi

Figure 4.9 Composite reconstructed progenitor sequences...... 105

Figure 4.10 Lengths of composite reconstruction components...... 107

Figure 4.11 Reconstructed progenitor sequence lengths for 67 LINE L1 subfamilies...... 109

Figure 4.12 Integration of reconstructed L1MC1 progenitor sequence with KZFP binding DNA- binding data from ChIP-seq/exo experiments...... 110

xii

List of Abbreviations

AP-MS Affinity purification coupled to mass spectrometry AGR Ancestral genome reconstruction ASR Ancestral sequence reconstruction B1H Bacterial one-hybrid B1H-RC Bacterial one-hybrid-based recognition code (predicts C2H2 ZFP DNA- binding preferences from sequence)

C2H2 ZFP Cis2His2 zinc finger protein ChIP immunoprecipitation ChIP-exo ChIP with exonuclease digestion and sequencing ChIP-seq ChIP with sequencing ERE Endogenous retroelement ERV Endogenous retrovirus KRAB Krüppel-associated box (protein-protein interaction domain associated with transcriptional repression)

KZFP Cis2His2 zinc finger protein with KRAB domain LINE Long interspersed nuclear element Monodactyl ZFP C2H2 ZFPs with a single ZF domain ORF Open reading frame PBM Protein-binding microarray Pol II RNA polymerase II SINE Short interspersed nuclear element TE Transposable element TF Transcription factor

ZF Cis2His2 zinc finger domain

xiii

Chapter 1 Introduction

Part of the work described in this chapter is published in:

Lambert, S. A.*, Jolma, A.*, Campitelli, L. F.*, Das, P. K., Yin, Y., Albu, M., Chen, X., Taipale, J., Hughes, T. R., and Weirauch, M. T. (2018). The Human Transcription Factors. Cell. 172, 650-665. doi:10.1016/j.cell.2018.01.029 which is published under a Creative Commons Attribution License (CC BY 4.0).

*denotes equal contributions

1 2

Introduction 1.1 Chapter outline

Transcription factors (TFs) confer specificity to the process of transcriptional regulation. In the Human genome, half of all TFs are C2H2 zinc finger proteins (C2H2 ZFPs). These TFs have been underrepresented in functional investigations, and recent evidence suggests that they have played a critical role in the evolution of regulatory modules emerging from coevolution with transposable elements (TEs).

In Chapter 1.2 I define a TF and give an overview of the major classes of Human TFs. I further review experimental methods to determine the DNA-binding preferences and protein-protein interactions of TFs, and the mechanisms by which TFs use these functions to affect transcription.

In Chapter 1.3 I review the largest class of Human TFs, the C2H2 ZFPs. I specifically focus on their modular DNA-binding mechanism including the ability to rapidly evolve new DNA- binding preferences, their protein-protein interaction domains, and evidence for their coevolution with endogenous retroelements (EREs; copy-and-paste transposable elements).

In Chapter 1.4 I provide an overview of the Human transposable elements (TEs), emphasizing the annotation and classification of EREs and current hurdles in the interpretation of their evolutionary histories. I further describe how modern bioinformatic techniques can be exploited to overcome these hurdles for the benefit of interpreting the coevolution of C2H2 ZFPs and EREs.

The goal of my doctoral work has been to characterize the Human C2H2 ZFPs, with specific interest in their functional diversity and coevolution with EREs. I have approached this problem through the analysis of experimental datasets capturing the tissue expression profiles, protein- protein interactions, and DNA-binding evidence for the Human C2H2 ZFPs. Furthermore, I have reconstructed the sequence evolutionary history of Human LINE L1 ERE subfamilies and demonstrated the utility of the output dataset for the integration of in vivo and in silico DNA- binding preferences of C2H2 ZFPs to better understand their evolutionary interaction.

3

1.2 Transcription factors

1.2.1 Defining a transcription factor

Transcription factors (TFs) confer specificity to genome regulation by preferentially binding specific DNA sequences, and ultimately affecting the propensity for transcription by RNA Polymerase. The ability to bind to specific sequences is sufficient to indicate the ability to regulate transcription, because TFs can act simply by occluding the DNA-binding sites of other proteins. We may therefore define a TF simply as a protein that is able to bind a specific DNA sequence. The majority of identified TFs mediate this specific binding with conserved DNA- binding domains (DBDs). Many TFs affect transcription via effector domains that mediate protein-protein interactions (PPIs), perform enzymatic activities, and/or regulate TF activity via ligand binding sites (Figure 1.1). Interpretation of the function and evolution of TFs relies on comprehension of both the DNA sequences they bind and their effector functions.

1.2.2 Human DNA-binding domains (DBDs)

TFs are distinguished from other transcriptional regulatory proteins by their ability to bind specific DNA sequences. The ability to bind a specific DNA sequence is typically conferred to TFs by at least one conserved DNA-binding domain (DBD). There are ~100 known eukaryotic DBD types, which are cataloged in Pfam (Finn et al. 2016), SMART (Letunic et al. 2015) or Interpro (Finn et al. 2017). DBDs represent diverse solutions to the problem of DNA-contacting, typically involving association with the major groove via an α-helix, although minor groove and phosphate-sugar backbone interactions also occur. DBD structures in complex with DNA are currently available in the (PDB) (Berman et al., 2002) for most families of Human TFs.

DBDs are assumed to be derived from a small set of common ancestors representing the major DBD folds, giving rise to families of TFs with the same DBD by duplication. TFs are thus classified by the DBDs they contain. DBDs are not uniformly represented among the Human TFs; rather, about 80% of Human TFs contain one of only a few DBDs: Cys-2-His-2 zinc finger (C2H2 ZF), Homeodomain, Helix-Turn-Helix (bHLH), zipper (bZIP), HMG/Sox, Forkhead, or Nuclear (Figure 1.2 A). DBDs have historically been grouped into

4

Protein-protein interactions Ligand binding

Effector Domain(s) Enzymatic activities

TFBS

DNA-binding domain(s) Recognize specific DNA sequences and sites

Figure 1.1. Schematic of a prototypical TF

TFs’ DNA-binding activities and effector functions (e.g. ligand binding, protein-protein interactions, and enzymatic activities) are often carried out by dedicated domains, resulting in functional modularity.

Adapted from Lambert, Jolma, Campitelli et al. (2018) © 2018 by Elsevier Inc.

5

A

B

TF count in human genome Ets bZIP T-box bHLH AT hook AT C2H2 ZF Forkhead Unknown CCCH ZF All Others HMG/Sox Myb/SANT Homeodomain DNA-binding domain

Figure 1.2. Examples of DBD structures and counts of Human TFs by their DBD type Figure 1.2. Examples of DBD structures and counts of human TFs by their DBD type A. Examples of DBDs from each of the four major structural families: Basic, black; Helix-Turn- A. Examples of DBDs from each of the four major structural families: Basic, black; Helix-Turn-Helix, green;Helix, Zinc-Coordinating, green; Zinc-Coordinating, orange; β-Scaffold, orange; purple. β-Scaffold, DBD structures purple. were DBD obtained structures from were Protein obtained Datafrom Bank Protein (Berman Data et al. Bank 2000) (Berman by Weirauch et al. and 2000) Hughes, by Weirauch and labeled and with Hughes, the DBD and name, lab eledand PDB with the accessionDBD name, number. and Structures PDB accession are coloured number. by macromolecule Structures areand coloured secondary by structure: macromolecule DNA, blue; and α- helices,secondary red; β-sheets, structure: yellow; DNA, and blue; loop regions,α-helices, green. red; Adapted β-sheets, from yellow; Weirauch and and loop Hughes regions, (2011) green. © 2011Adapted by Springer from Nature. Weirauch and Hughes (2011) © 2011 by Springer Nature. B.B.Counts Counts of the of numberthe number of human of Human TFs containing TFs containing each DBD. each DBD DBD. annotations Adapted are from referenced Lambert, from Jolma, Pfam.Campitelli (Finn et al.et 2016).al. (2018) © 2018 by Elsevier Inc.

6 superfamilies, which are based on shared structural characteristics rather than evolutionary relationships: basic, helix-turn-helix, β-scaffold, and zinc-coordinating (Luscombe et al., 2000; Weirauch and Hughes, 2011).

1.2.2.1 Basic

The basic superfamily is largely comprised of DBDs that dimerize to form a scissor-like grip on the DNA, mainly represented by bHLH and bZIP domains (Vinson et al. 1989). These domains are primarily comprised of an that contacts the major groove of DNA, and a dimerization interface. bHLH proteins have been demonstrated to leverage diversification of both dimerization partners and DNA-binding preferences (as well as spatiotemporal expression) to achieve functional diversity (Grove et al., 2009). As a result of dimeric DNA binding, basic DBDs commonly recognize palindromic DNA sequences (Weirauch and Hughes, 2011).

1.2.2.2 Helix-turn-helix

Helix-turn-helix DBDs contact DNA via an open tri-helical bundle. The principle DNA contacts are typically attributed to the third (C-terminal) α-helix, thus called the “recognition helix.” Helix-turn-helix DBDs are prevalent throughout as well as archaea and bacteria, represented by numerous DBDs comprised of the core helix-turn-helix structure often adorned with secondary structural features that appear to promote a more closed configuration (Aravind et al., 2005; Weirauch and Hughes, 2011). Forkhead and homeodomain proteins are examples of helix-turn-helix DBDs. Homeodomain proteins are the largest class of eukaryotic helix-turn- helix TFs, including notable members such as the HOX body plan-determining gene cluster in Bilaterian animals, and POU proteins like the Yamanaka factor OCT4 (Phillips and Luisi, 2000).

1.2.2.3 β-scaffold

The core structure of β-scaffold DBDs is comprised of two anti-parallel β-sheets that are packed together to yield a “β-sandwich” arrangement, with DNA-contacting residues typically contained in additional variable structures (Berardi et al., 1999). The famous tumor suppressor leverages β-scaffolds to contact DNA, as do other TFs with DBDs classed as “p53-like,” including T-box TFs, STAT, and the Rel Homology Region (RHR) family, including the immune-responsive NF-κB complex (Gilmore, 2006; Weirauch and Hughes, 2011).

7

1.2.2.4 Zinc-coordinating

Zinc-coordinating DBDs are characterized by the presence of a zinc atom (Zn2+) that stabilizes the DBD by forming noncovalent associations with cysteine and histidine residues. Coordination by the zinc atom tends to stabilize an α-helix such that it can make base-specific contacts with the major groove of DNA (Luisi, 1992).

The most notable zinc-coordinating DBD classes are the nuclear receptors and Cys-2-His-2 zinc finger proteins (C2H2 ZFPs). Nuclear receptors are a Metazoan-specific TF family that function as transcriptional switches by recognizing small molecules like fatty acids, vitamins and steroids via ligand-binding domains. The nuclear receptor DBD contains two C4 zinc finger motifs, where four cysteine residues in each are coordinated by the tetrahedral geometry of a single zinc atom (Pardee et al. 2011). The N-terminal zinc finger’s α-helix contacts the major groove of DNA, while the C-terminal zinc finger is typically involved in dimerization (Zilliacus et al., 1995). C2H2 ZFPs represent by far the largest class of TFs in eukaryotes and are uniquely able to rapidly diversify their DNA-binding preferences, owing to the structure and arrangement of their DBDs. The structure, function and evolution of the C2H2 ZF domain and the C2H2 ZFPs are described in Chapter 1.3.

1.2.3 Identifying TFs

If we define a TF as a protein that binds a specific DNA sequence, then we may identify putative TFs based on homology to known DBDs, but we can only confidently label a protein as a TF once it is experimentally determined to bind DNA in a sequence-specific manner.

1.2.3.1 Predicting TFs

Currently, most known and putative TFs have been identified by to a previously characterized DBD. The databases Pfam (Finn et al. 2016), SMART (Letunic et al. 2015), and Interpro (Finn et al. 2017) catalogue DBDs as profile hidden Markov models (HMMs). HMMs are simple and powerful statistical tools, and profile HMMs are frequently used in bioinformatics to summarize sequence variation in an alignment of ‘true positive’ sequences (e.g. bona fide members of a DBD family) (Eddy, 1998). A profile HMM represents the probability that each sequence position is occupied by a given residue, and the probabilities that that position will be followed by an insertion, deletion, or another residue. Profile HMMs

8 can therefore be used to score query sequences for the likelihood of a homology match to the sequences in the HMM based on the cumulative score of all positions in the sequence aligned to the HMM. HMMs from the databases listed above representing each DBD family can thus be used to identify putative TFs based on homology to known DBDs. To date, all but a handful of well-characterized Mammalian TFs contain a known DBD (Fulton et al., 2009).

Identifying putative TFs by homology inevitably overlooks an unknown proportion of TFs. It is likely that additional DBDs remain to be discovered (especially in poorly-characterized clades), and some bona fide TFs do not contain any known DBDs. Experimental methods to broadly test proteins for DNA-binding activity therefore continue to be relevant for the discovery of putative TFs. Protein microarrays are one such method, where an array of query proteins are immobilized and tested for specific binding to selected DNA sequences in vitro (Hu et al., 2009). One-hybrid assays are based on the expression of a query TF modified to bind a known transcriptional in cells where the DNA sequence of interest is cloned upstream of a reporter (Reece- Hoyes and Walhout 2012). DNA purification can also be coupled to mass spectrometry to identify proteins bound to DNA in a given cell line (Tacheny et al., 2013). Together, homology- based and experimental predictive methods can produce long lists of putative TFs, which must then be accepted or rejected as TFs based on experimental evidence of sequence-specific binding.

1.2.3.2 Representing DNA-binding specificities with motifs

TF DNA-binding specificities are frequently summarized as motifs — models representing the set of related, short sequences preferred by a given TF — which can be used to scan longer sequences (e.g., promoters) to identify potential binding sites. Determining a DNA-binding motif is often the first step toward detailed examination of the function of a TF, because identification of potential binding sites provides a gateway to further analyses.

Motifs are typically displayed as a sequence logo (Schneider and Stephens, 1990), which in turn represents an underlying table or position weight matrix (PWM) of relative preference of the TF for each base in the binding site (Stormo and Zhao, 2010). At each base position, each of the four bases has a score, and multiplying these scores for each base of a sequence yields a predicted relative affinity of the TF to that sequence. Key biological distinctions between PWMs and profile HMMs are that (1) profile HMMs represent dependencies between the states of adjacent

9 positions and (2) profile HMMs permit and account for indels. Beyond PWMs, complicated models have been developed to account for additional complexities, including cooperative interactions between TFs (Jolma et al., 2015) or DNA methylation (Yin et al., 2017), with improvement in accuracy depending on the TF and its family. In many cases, however, the improvement is minor or even undetectable, especially when comparing across different datasets (Weirauch et al., 2013), and the PWM remains the most commonly used model for analysis of TF binding. Hereafter, the term motif signifies PWM.

There is typically only a partial overlap between experimentally determined binding sites in the genome and sequences matching the motif; moreover, even experimentally determined binding sites are relatively poor predictors of that the TFs actually regulate (Cusanovich et al., 2014). At the same time, motif matches are often among the most enriched sequences in a ChIP- seq (chromatin immunoprecipitation and sequencing; described below) dataset, indicating that intrinsic DNA-binding specificity is important for TF binding in vivo. In retrospect, this outcome should have been expected: most TF-binding sites are small (usually 6–12 bases) and flexible, so a typical Human gene (>20 kb) will contain multiple potential binding sites for most TFs (Wunderlich and Mirny, 2009).

1.2.3.3 Determining TF motifs experimentally

Methods to discover DNA-binding preferences typically involve exposing putative TFs to a complex library of artificial sequences or genomic DNA. The terms in vitro and in vivo are somewhat subjective; in the field of motif discovery they generally distinguish between methods where a protein is purified and exposed to purified DNA (in vitro), or exposed to chromatin in the cell context (in vivo).

Strengths of in vitro methods include higher throughput and ability to determine TF sequence specificity independent from the influence of other proteins. Prevalent in vitro methods today include protein-binding microarray (PBM) (Bulyk et al. 1999; 2001), bacterial one-hybrid (B1H), systematic evolution of ligands by exponential enrichment (SELEX)-based methods, and mechanically induced trapping of molecular interactions (MITOMI).

Conversely, in vivo methods may yield motifs that represent cooperative or indirect DNA- contacting, but also return actual genomic localization of the putative TF. Genomic binding sites

10 can be compared with other datasets (e.g. generated in parallel or those available from ENCODE (The ENCODE Project Consortium, 2012)) to substantiate hypotheses about the protein’s function, for example by the identification promoters that may be regulated by the protein, binding loci correlated with the loci bound by other TFs, changes to bound loci in different environmental conditions or developmental stages, or chromatin marks that may be associated with the protein. Prevalent in vivo methods today include chromatin immunoprecipitation with sequencing (ChIP-seq), and DNA adenine methyltransferase identification with sequencing (DamID-seq). While TF binding loci in vivo are influenced by the expression of cofactors and accessibility of chromatin, motifs obtained from in vivo data are often consistent with those derived from in vitro approaches, indicating that a TF’s intrinsic sequence specificity is a major factor controlling in vivo binding (Weirauch et al., 2013). Below I describe the two most pertinent methods to interpreting the contents of this thesis.

1.2.3.3.1 Protein-binding microarray (PBM)

PBMs are microarrays of spots containing double-stranded DNA probes. The basic concept underlying a PBM experiment is to incubate a purified protein on the DNA probe array to test whether the protein preferentially binds probes with similar sequences (Bulyk et al. 1999; 2001). The array contains ~41,000 defined 35-mers. The probes are designed such that all possible unique 10-mer sequences occur once, and all non-palindromic 8-mers are present 32 times in difference sequence contexts (palindromic 8-mers occur 16 times). The query protein can be either a full-length protein sequence, or simply the DBD(s) plus the 50 amino acids flanking the domain. The query construct also contains an epitope tag (e.g. a GST tag), such that a fluorescently labelled antibody can illuminate the DNA probe spots to which the protein has bound, and fluorescence intensity scales with the degree of binding preference. The protein’s DNA-binding specificity can then be inferred (Weirauch et al., 2013).

The Hughes research group has tested nearly 7000 constructs in PBM assays since 2007, representing 121 DBD classes and 399 species. In these assays, each putative TF’s DBD construct is assayed on two microarrays with distinct solutions to the problem of achieving the sequence complexity described above (i.e. different probes in different spots), in order to minimize array-specific biases. These two arrays, “ME” and “HK”, are named for their designers, Michael Eisen and Hilal Kazan (Mintseris and Eisen, 2006; Badis et al., 2009). The

11 sequence specificity of a putative TF for a given 8-mer can then be represented as an E-score and a Z-score. The E-score represents the relative rank of the microarray spot intensity, and can range from -0.5 to +0.5 (Berger et al., 2006). The Z-score represents a given 8-mer’s deviation from the mean intensity across all 8-mers. Z-scores approximately scale with binding affinity (Badis et al., 2009). In the Hughes research group, a putative TF is said to exhibit highly significant sequence-specific binding to an 8-mer if it achieves an E-score > 0.45 and Z-score > 6 in both arrays (Berger et al., 2008; Weirauch et al., 2014). 8-mer E- and Z-scores are both reproducible between arrays (only 5% of constructs tested in the Hughes research group since 2007 exhibit significant binding preference in only one of the two arrays), and comparable between studies (Weirauch et al., 2014).

PBMs are relatively efficient and economical, and offer the advantages of in vitro methods described above. A limitation of PBMs is that the number of possible DNA sequences to include in the array increases exponentially with the length of the k-mer one wishes to query, and therefore PBMs are generally limited to the analysis of 14bp binding preferences (Berger et al., 2006; Mintseris and Eisen, 2006; Badis et al., 2009; Weirauch et al., 2013). This length is sufficient to query TFs from most DBD classes, with the notable exception of C2H2 ZFPs.

1.2.3.3.2 Chromatin immunoprecipitation (ChIP)-based methods

The basis of a ChIP experiment is the immunoprecipitation of a protein bound to chromatin, followed by the identification of the DNA sequences bound. In an early manifestation of this approach, ChIP-chip, bound sequences were identified by microarray hybridization; today they are typically identified by next generation sequencing (ChIP-seq and ChIP-exo) (Barski et al., 2007; Johnson et al., 2007; Park, 2009).

In a ChIP-seq experiment to identify the binding preferences of a putative TF, a query cell line must be selected to express the protein of interest. Typically, a construct is cloned to express the protein at a higher level, possibly also with a linked epitope tag. Cells are then treated with formaldehyde to cross-link proteins to their bound DNA sequence. This step converts hydrogen bonds between proteins and DNA to covalent bonds, preserving their association (Hoffman et al., 2015). Chromatin is sheared into small fragments (200-600bp) by sonication, and an antibody is used to purify the protein of interest along with its bound fragments. Crosslinks are then reversed, and the DNA fragments that were bound by the query protein are identified by

12 high-throughput sequencing (Park, 2009). These sequences can then be (1) used to derive a motif for the protein based on identifying similarities between bound sequences, and (2) mapped back to a reference genome to identify bound loci and compare those to the loci of other genomic features. Visualizations quantifying the number of reads mapped to each position in the genome result in ‘peaks’ corresponding to the query protein’s binding sites.

ChIP-seq fragments are 1-2 orders of magnitude longer than TF binding sites. ChIP-exo achieves higher resolution by modifying the ChIP-seq protocol to include a lambda exonuclease digestion step after sonication and immunoprecipitation (Mahony and Pugh, 2015). This step digests DNA fragments that are not cross-linked to a protein, such that the fragments eluted later are smaller – typically within 5bp of a motif instance (Rhee and Pugh, 2011). Another recent innovation to elute smaller fragments corresponding to protein-bound loci in situ is cleavage under targets and release using nuclease (CUT&RUN), where antibodies are used to recruit Micrococcal Nuclease directly to the query protein (Skene and Henikoff, 2017).

ChIP-based methods are powerful tools for studying transcriptional regulation. In addition to yielding TF binding sites and loci in vivo, they can also be used to infer the localization of basal transcriptional machinery and with specific modifications that can indicate the state of chromatin (discussed further in the next section). Integrating these datasets with annotations of gene and transposable element (TE) loci, and across diverse chromatin proteins, cell lines, and expression profiles can reveal a complex and dynamic snapshot of transcriptional regulation. The ENCODE project has released several experimental datasets to this end (The ENCODE Project Consortium, 2012), and the UCSC Genome Browser (Lee et al., 2020) is a valuable database of both experimental and in silico annotations for a large and growing number of genomes.

1.2.4 Transcriptional regulation by TFs

Transcription is the process by which DNA is copied into RNA by RNA polymerase. Transcription of eukaryotic protein-coding genes is carried out by RNA polymerase II (Pol II), which transcribes DNA into mRNA in phases: initiation, the placement of Pol II at the gene promoter; elongation, the synthesis of mRNA beginning at the transcription start site (TSS); and finally cleavage and polyadenylation followed by termination, releasing the mRNA transcript. The rate at which this process is repeated cyclically, together with the half-life of the resulting transcript, determines the mRNA expression level of the gene (Venkatesh and Workman, 2015).

13

Transcription initiation requires the assembly of the general transcription factors (GTFs) TFIIB, TFIID, TFIIE, TFIIF and TFIIH to form the pre-initiation complex. This complex opens promoter DNA, initiates RNA synthesis, and enables the escape of Pol II from the promoter (Sainsbury et al. 2015).

Pol II’s access to promoters is affected by the packaging of DNA in chromatin by octamers. Canonical histone octamers consist of two of each of the canonical histones: H3, H4, H2A and H2B. An additional H1 linker histone locks DNA around the histone core (Venkatesh and Workman, 2015). Each octamer is wrapped 1.65 times by DNA, and the DNA-histone complex comprises a nucleosome (Luger, 1997). Regions of open chromatin, corresponding to nucleosome depletion, can be detected using FAIRE-seq and DNAse hypersensitivity assays and typically indicate accessible cis-regulatory elements like promoters and enhancers (Song et al., 2011). Open chromatin is accessible to the transcriptional machinery, and therefore the displacement of nucleosomes is associated with transcriptional activation (Henikoff, 2008; Boyle et al., 2011). Chromatin modifications, histone variants, and chromatin remodelers regulate nucleosomal dynamics to affect the propensity for transcription.

Chromatin modifications are post-translational modifications to histones maintained by called writers, erasers, and readers, which add, remove, and interpret chromatin modifications, respectively. These modifications can be combined and culminate in a histone code based on the combination of histone subunit, residue, and modification types (Jenuwein and Allis, 2001). The terms heterochromatin and euchromatin describe transcriptionally repressed and active regions, respectively. These terms were first used by Emil Heitz in 1928, who distinguished the two based on staining experiments showing that heterochromatin remained stained throughout the cell cycle, while euchromatin staining was diminished after telophase (Allshire and Madhani, 2018). Heitz also recognized that some chromosomal regions appeared to be associated with heterochromatin in all cells, termed constitutive heterochromatin, while other heterochromatin regions varied between cell types, called facultative heterochromatin (Allshire and Madhani, 2018). Today, heterochromatin and euchromatin are typically defined based on histone modifications. The best-studied types of heterochromatin are distinguished by histone hypoacetylation and H3K9 (e.g. H3K9me1, H3K9me2, H3K9me3) or H3K27 methylation (Allshire and Madhani, 2018). For example, the silencing mark H3K9me3 is written by the methyltransferase SETDB1 and read by HP1s, which induce chromatin condensation

14

(Karimi et al., 2011). Euchromatin is associated with histone H4 acetylation and methylation of histone H3 at lysine 4 (e.g. monomethylation – H3K4me) (Grewal and Jia, 2007). Histone acetyltransferases, like CPB/p300, function as transcriptional co-activators (Giordano and Avantaggiati, 1999). Canonical histone subunits can also be swapped by histone exchange for variant subunits (Venkatesh and Workman, 2015).

Chromatin remodelers hydrolyze ATP to slide or evict nucleosomes or mediate histone variant exchange. They can be grouped into four types: INO80, chromodomain, ISWI, and SWI/SNF (Tyagi et al., 2016). INO80 participates in the regulation of transcription, replication, and repair (reviewed by Conaway and Conaway, 2009). The related SWR chromatin remodeling complex mediates the exchange of histone subunit H2A for the variant H2A.Z, which impedes the spread of heterochromatin (Kobor et al., 2004). Chromodomain proteins recognize (read) specific histone marks – for example, the HP1s that recognize H3K9me3 to induce heterochromatinization. ISWI remodelers impede transcription by positioning nucleosomes and facilitating de novo nucleosome assembly (Tyagi et al., 2016). The SWI/SNF chromatin remodeling complexes alter histone-DNA interactions to enable the addition or removal of nucleosomes (Roberts and Orkin, 2004).

TFs add sequence specificity to transcriptional regulation – they interpret the genetic regulatory information of cis-regulatory elements (promoters and enhancers) and convert it to action (or inaction) by Pol II. TF binding alone is sufficient to affect transcription because it can block polymerase. A classic example is regulation of the bacteriophage λ repressor locus. During the lysogenic growth phase, λ Repressor blocks the expression of the λ Cro-encoding gene by occupying an operator site to block DNA-binding by RNA polymerase. Conversely during the lytic growth phase, λ Cro blocks λ Repressor expression by the same mechanism (Ptashne, 2011).

Beyond binding site occlusion, TFs often rely on effector domains to affect transcription (Figure 1.1) (Frietze and Farnham, 2011). Some TFs’ activities are influenced by ligand binding domains, as exemplified by nuclear receptors Chapter 1.2.2.4. In other cases, TFs can exhibit enzymatic activities in addition to DNA binding – for example, a subset of Human C2H2 ZFPs contain SET methyltransferase domains. Many TFs appear to affect transcription via protein-

15 protein interactions with the general transcriptional machinery, other TFs, or histone and chromatin modifying enzymes (Frietze and Farnham, 2011).

1.2.4.1 Protein-protein interactions (PPIs)

Protein-protein interactions (PPIs) are non-covalent associations between proteins, often facilitated by dedicated PPI elements. PPI elements can have intrinsic structure (PPI domains): for example, bZIP TFs dimerize using dedicated domains (Weirauch and Hughes, 2011). PPI domains are typically conserved and larger than 30 amino acids, and they typically arise by duplication, and therefore exhibit homology to one another and can be identified by the methods described in Chapter 1.2.3.1 used to identify DBDs by homology to a consensus HMM.

TF PPI elements can also occur in disordered regions (e.g. short linear motifs (SLiMs), as is typical for transactivation domains (TADs). Intrinsic disordered regions are common in eukaryotic TFs – up to 94% of eukaryotic TFs have been predicted to contain some extended region of intrinsic disorder (Liu et al., 2006). SLiMs can be as short as 2-4 amino acids; therefore, they are likely to arise randomly by point mutations rather than duplication and cannot easily be detected by homology (Neduva and Russell, 2005). TADs are common among TFs; it is thought that their intrinsic disorder enables structural and evolutionary flexibility (Frietze and Farnham, 2011). A recent investigation into the recruitment of the Mediator complex by OCT4 and the in Human embryonic stem cells also suggests that these intrinsically disordered activation domains of TFs establish phase-separated droplets that permit the aggregation of transcription related-cofactors (Boija et al., 2018). This theory may explain the challenging repeated observation that the amino acid sequences of TADs show little correspondence to the cofactors they recruit (Hope and Struhl, 1986; Godowski, Picard and Yamamoto, 1988; Ransone et al., 1990; Jin et al., 2016).

TFs use PPIs to recruit transcription-related cofactors – histone modifiers, chromatin remodelers, general transcription factors, and/or other transcription factors. For example, the tumor suppressor p53 is a TF that recruits the histone acetyltransferases CBP/p300 via a TAD (Dyson and Wright, 2005). YY1 is a C2H2 ZFP that recruits the chromatin remodeling INO80 complex (Cai et al., 2007). The TF has been shown to interact with GTFs and TATA-binding protein (TBP; required for GTF recruitment) (Frietze and Farnham, 2011). TF-TF PPIs can enable long-range interaction between distant genomic loci through DNA looping (Barna et al.,

16

2002; Weintraub et al., 2017). TF-TF interactions can also be facilitated by DNA, and can result in TFs binding sequences distinct from their canonical preferred monomeric binding sites (Jolma et al., 2015).

1.2.4.1.1 Identifying TF PPIs

Because proteins interact with other proteins with related functions, exploring the PPIs of an uncharacterized protein is a valuable first step to functional characterization. Current methods for interrogating PPIs vary between in vitro and in vivo analogously to DNA-binding assays. In vivo assays offer the benefit of detecting interactions that occur in cells, although the interactions may be indirect (mediated by another protein). The selection of a PPI assay is non-trivial; one must further consider the biophysical properties of the anticipated interactions, as well as the throughput of the method. For example, a method involving the activation of a nuclear reporter (e.g. yeast two hybrid (Y2H)) would be nonideal to query the PPIs of a transmembrane protein in its membrane context (Snider et al., 2015). Prominent databases for PPI experimental results include BioGrid (Oughtred et al., 2019) and STRING (Szklarczyk et al., 2019); both are explicit about the types of experiments that yielded evidence for associations between proteins, which is critical to biological interpretation.

There exist several PPI interrogation methods demonstrated or likely to be appropriate for TFs. Protein-protein microarrays (Yazaki et al., 2016) are based on the incubation a purified query protein on an array of potential interactors. Proximity dependent biotin identification coupled to mass spectrometry (BioID-MS) is a relatively new technique, where query proteins (baits) are linked to a biotin ligase molecule and proteins that they are close to within the cell context (preys) are tagged with streptavidin, enabling the purification of preys and their identification by mass spectrometry (Roux et al., 2018). Affinity purification coupled to mass spectrometry (AP- MS) to query TF PPIs has been integral to the findings reported in this thesis.

1.2.4.1.2 AP-MS experiments

Affinity purification coupled to mass spectrometry (AP-MS) is a well-established tool for the discovery of PPI binding partners (preys) for a query protein (bait). The basic principle underlying an AP-MS experiment is the purification of the bait (e.g. by immunoprecipitation), followed by the identification of preys using protein sequencing by mass spectrometry (Dunham et al. 2012). Typically, the bait protein is expressed with an epitope tag in the query cell line.

17

Cells are then lysed, and bait proteins are immunoprecipitated using an antibody fixed to a solid support such as a magnetic bead. Captured preys are then eluted and digested with a protease (e.g. trypsin) to cleave proteins at positions predictable based on their amino acid sequences. The prey peptides are then identified by tandem mass spectrometry, and prey proteins are inferred based on the mapping of peptides to a database of full-length protein sequences (Snider et al., 2015).

When using AP-MS to query the PPIs of TFs, an endonuclease digestion step should be incorporated after cell lysis and before bait purification in order to exclude DNA- or RNA- mediated indirect interactions (Marcon et al., 2014). Nuclear fractionation can also be employed to ensure that interactions discovered occur in the nucleus.

The general design of AP-MS experiments leads to benefits as well as potential limitations. The method is ‘unbiased’ in the sense that prey proteins need not be explicitly queried – however, AP-MS is limited to the discovery of preys that are sufficiently highly expressed at the protein level in the experimental cell line (Dunham et al. 2012). Depending on the experimental question, either standardized epitope tagging systems or antibodies against the native bait can be employed. With the use of standardized epitope tags and appropriate cell lines, the throughput of AP-MS is also amenable to the investigation of hundreds of baits in parallel for the production of interaction networks. Notably, lysis and affinity purification are less likely to permit detection of weak/transient interactions (Snider et al., 2015). AP-MS is also restricted from the detection of homodimerization interactions, because the preys are indistinguishable from baits when sequences are identical.

The interpretation of AP-MS results involves several data analysis steps. The first problem is the identification of preys based on digested prey peptides. This computationally expensive task involves the inference of the likeliest proteins (and their quantities) that would have resulted in the observed peptides after digestion by the known peptidase. Fortunately, software solutions exist, including X! Tandem (Craig and Beavis, 2003). This analysis produces semi-quantitative spectral counts, which represent the quantities of each of the mapped prey proteins in each bait’s purification.

Next, statistically significant bait-prey associations must be distinguished from background. Contaminant proteins (proteins with unusually high background quantities in bait purifications)

18 must be identified so that their spectral counts in bait purifications can be penalized. Contaminants can include proteins that are highly expressed in the experimental cell line, and/or proteins that have affinity for the epitope tag employed. Negative control experimental data is critical to the identification and quantification of contaminant preys. The Contaminant Repository for Affinity Purification Mass Spectrometry Data (CRAPome) is a database that additionally offers negative control spectral count data that can be matched to a user’s AP-MS spectral count data based on cell line, epitope tag, mass spectrometer model, and other specific aspects of experimental design (Mellacheruvu et al., 2013). Finally, with the incorporation of negative control data, statistically significant bait-prey interactions can be identified based on the distribution of the bait’s spectral counts across all preys, and the distribution of the prey’s spectral counts across all baits including negative control experiments. This problem lends itself to Bayesian inference, where the prior probability of non-interaction between a given bait-prey pair can be modelled based on these parameters. The AP-MS analysis tools SAINT and SAINTexpress employ this method to assign confidence scores to each bait-prey interaction, where the confidence score (AvgP) is the posterior probability of a true interaction between the bait and prey (Choi et al., 2011; Teo et al., 2014).

1.2.5 Summary

TFs are proteins that bind specific DNA sequences. This function is normally enabled by dedicated DNA-binding domains. The DNA-binding preferences of a TF can be summarized as a motif, and a myriad of assays to establish DNA-binding preferences exist, including in vitro methods like PBMs and in vivo methods like ChIP-seq. While in vitro methods offer confidence that the derived motif results from direct DNA-contacting, in vivo methods reveal the genomic loci that a TF binds, and these loci can be compared to other annotations to reveal a more integrated picture of the TF’s place in a dynamic regulatory landscape. TFs can influence transcription simply by occupying a binding site, but most eukaryotic TFs are thought to influence Pol II’s access to promoters, predominantly via PPIs to recruit chromatin proteins or GTFs. TF PPIs can be queried in vitro, for example by protein microarrays, or in vivo by methods like BioID-MS or AP-MS. A complete picture of the Human TFs must start with an agreed upon list of Human TFs and a consideration of their motifs, effector functions (including PPIs), and evolution.

19

1.3 C2H2 zinc finger proteins

C2H2 ZFPs are the largest class of Human TFs and are relatively poorly studied despite several unique and intriguing properties. Despite comprising about half of all Human TFs (747) and having hundreds of members in other Vertebrate genomes, they are often overlooked in functional analyses and are vastly undercharacterized to this day. A subset of C2H2 ZFPs have been well-studied, including YY1, the insulator protein CTCF, specificity proteins SP1-3, the KLFs, and early growth response proteins EGR1-4. Yet, C2H2 ZFPs have gradually gained interest over the last decade, as experimental evidence of a long-theorized evolutionary relationship with TEs has begun to accumulate and inspire further imagination about how this interaction influences the evolution of complex genome regulation. As TFs, we can consider the problem of C2H2 ZFPs’ functional characterization in terms of their DNA-binding specificities and PPIs. Their unique DNA-contacting mechanism sets C2H2 ZFPs apart from all other TFs in their ability to rapidly evolve new DNA-binding preferences, and their PPI domains segregate with important functional and evolutionary distinctions within the C2H2 ZFP family.

1.3.1 C2H2 ZFP DNA-binding

The C2H2 ZFPs contact DNA using C2H2 ZF domains. These domains are unique from other DBDs in that they can evolve diverse binding preferences and their structure permits tandem linking of several C2H2 ZF domains. This modularity permits rapid evolution of extremely diverse DNA-binding preferences and lends itself to the problem of in silico motif prediction based on C2H2 ZF domain amino acid sequences.

1.3.1.1 ZF domain structure and function

C2H2 ZF domains (hereafter ZFs) are zinc-coordinating DBDs comprised of two short antiparallel β strands that form a β hairpin, followed by an α helix (Figure 1.2 A, Figure 1.3 A). A zinc ion contacts two cysteine residues in the β strands and two histidine residues in the α helix (hence Cys-2-His-2), folding the β strands away from the DNA-contacting face of the α helix. ZFs are 23-30aa in length and follow the pattern X2-Cys-X2,4-Cys-X12-His-X3,4,5-His (Pabo et al. 2001). The α helix associates with the major groove of the DNA helix, and the amino acids at α helix positions -1, 2, 3, and 6 are called the ‘specificity residues’ because they have the greatest effect on the DNA-binding preferences of the ZF. Each ZF domain confers binding

20 specificity to 3-4 nucleotide residues (Figure 1.3 A) (Wolfe et al. 2000; Klug 2010; Najafabadi et al. 2015).

While the ZF-DNA interaction is well understood, some evidence also suggests that ZFs can associate with RNA and that this association is functional. GTF3A has been shown to bind both DNA and ribosomal dsRNA in separate crystal structures, using the same ZFs to contact each nucleic acid (Brown, 2005). YY1 is another DNA-binding C2H2 ZFP that appears to function as an RNA-binding protein, including interactions with various mRNAs (Belak and Ovsenek, 2007; Belak, Ficzycz and Ovsenek, 2008). In Mouse embryonic fibroblasts, YY1’s interaction with Xist, the long non-coding RNA encoded by and localized to female X for inactivation, supports a model where YY1 tethers Xist to the X by interacting with both RNA and DNA simultaneously (Jeon and Lee, 2011). YY1 has been reported to bind ssRNA with low specificity, and unlike GTF3A, YY1 appears to utilize distinct ZFs for each of its DNA-binding and RNA-binding functions (Wai et al. 2016). CTCF (Kung et al., 2015) and ZNF74 (Grondin, Bazinet, and Aubry 1996) have been shown to bind RNA. Further evidence is required to establish whether RNA-contacting by ZFs is sequence-specific, and to demonstrate the extent to which this function pervades the DBD class. Likewise, ZF domains have been demonstrated to facilitate PPIs, and this phenomenon remains to be characterized at a global scale (discussed further in Chapter 1.3.2.4).

1.3.1.2 Arrangement of ZF domains

Vertebrate C2H2 ZFPs are sometimes called tandem ZFPs or polydactyl ZFPs because they are almost always comprised of arrays of several ZF domains linked head-to-tail toward the C- terminus of the protein. These proteins typically contain at least three ZF domains and an average of 8.5, with the longest C2H2 ZFPs containing more than 30 ZF domains (e.g. ZNF91) (Weirauch et al. 2014; Letunic et al. 2015; Stubbs et al. 2011). The ZF domains are connected in tandem by a 7aa flexible linker, resulting in approximately regular spacing of the DNA- contacting residues of each ZF domain. This establishes a standardized modular template where the binding specificity of the array can be modified by the addition, removal, or replacement of ZFs, or the replacement of specificity residues (Schmidt and Durrett 2004; Emerson and Thomas 2009) (Figure 1.3 A). While the members of most TF DBD classes bind similar sequences to other members of the same DBD class, the C2H2 ZFPs are characterized by the rapid evolution

21

78 ZF DBD L. Stubbs et al. β hairpin A

α helix

Fig. 4.3 KZNF motif DNA-binding interactions. The alpha helices of KZNF motifs contain amino acid residues that bind to DNA nucleotides (at the –1, 2, 3, and 6 sites as shown at top). The relationship between fingers and nucleotides is not one-to-one, as the amino acid at the +2 position Bwill interact with the nucleotide complementary to the neighboring finger’s +6 binding site. In this fashion, fingers wind around the major groove of the DNA molecule (illustrated in the lower panel of the figure)

the four DNA contacting amino-acid residues in each finger and nucleotides at the DNA binding site is not a simple 1:1 relationship, as there is some overlap between nucleotides bound by adjacent fingers [6, 17–19]. However, the arrangement is such that each finger defines binding specificity at a net of 3 adjacent nucleotides, while exerting some influence over the binding specificity of neighboring KZNF motifs (Fig. 4.3).

4.2.1 Predicting a Zinc-Finger Code FigureFigure 1.3 C2H2 1.3 ZFP C2H2 binding ZFP modebinding and mode protein-protein and protein interactions-protein interactions This precisely structured relationship between nucleotides in a binding site and spe- A. BindingA.cific interactions Binding amino acidsbetweeninteractions in the ZFs DNA-contactingbetween of C2H2 th eZFPs ZFs portion and of C2H2DNA. of eachAdaptedZFPs C2H2 and from DNA. finger Stubbs, Adapted implies Sun, and the from Stubbs, Caetano-AnollesexistenceSun, and(2011) of a Caetano zinc-finger © 2011- Anollesby DNA Springer binding(2011) Nature. “code”,© 2011 andby Springer the possibility Nature. that a KZNF pro- B. SchematicB.tein’s representingSchematic binding preferences representingthe expected might modular the beexpected predicted functions modular deof novoC2H2 functions from ZFPs. its Zinc amino of C2H2finger acid arrays ZFPs. sequence. contact Zinc finger arrays DNA at a specific binding site, and most C2H2 ZFPs contain one of three N-terminal effector domains In fact,contact several DNA different at a specific groups binding have designed site, and mathematical most C2H2 ZFPs formulas contain and one infor- of three N- (KRAB, SCAN, or BTB). The counts and percentages of C2H2 ZFPs containing each effector domain maticsterminal tools effector that predict domains KZNF (KRAB, binding SCAN, codes [ 19or– BTB).21]. These The programscounts and are percentages built of and the canonicalupon knowledge function derivedof the effector from domain in vitro is DNA shown. binding Domain experiments annotations andof human structural proteins are sourced fromC2H2 SMART ZFPs (Letunic containing et al. 2015).each effector domain and the canonical function of the effector data,domain together is shown. with calculations Domain annotations of predicted of energies Human of proteins interaction are sourced between from spe- SMART cific amino acids and nucleotides. Although these methods have proved successful (Letunic et al. 2015).

Licensed to Sanie Mnaimneh 22 of extremely diverse binding preferences enabled by this modular architecture (discussed in Chapter 1.3.3).

1.3.1.2.1 Why the long array?

Given that each ZF confers specificity to about 3bp, and each Human ZFP contains an average of 8.5 ZFs, it would appear that the average C2H2 ZFP can preferentially bind sequences about 26nt in length, with the longest C2H2 ZFPs (~34 ZFs) potentially recognizing sequences over 100nt. These are far longer than the average TF’s binding preference, and far longer than the theoretical maximum motif length required confer such a high degree of specificity that it only recognizes a single locus in the Human genome – even ignoring sequence degeneracy, a random 16nt sequence only occurs once in a random 3Gb genome. It is hard to imagine an evolutionary scenario that would select for the conservation of such long ZF arrays for DNA sequence recognition (Emerson and Thomas 2009).

Instead, there exist several alternative explanations for these long ZF arrays, which are not mutually exclusive: (1) a single ZFP may use different subsets of its ZFs to bind different DNA sequences, thereby binding multiple target sites; (2) some of the ZFs may function in protein binding or RNA binding; (3) subsets of the long arrays are not under selection, but rather evolutionary relics – leftover by-products of selection to rapidly diversify binding specificity (Iuchi, 2001).

1.3.1.2.2 Monodactyl ZFPs

13 Human C2H2 ZFPs contain only one ZF and no other DBDs (Letunic et al. 2015) – these can be called ‘Monodactyl ZFPs.’ To my knowledge, ZNF750 is the only known Human Monodactyl ZFP with a DNA-binding motif (Sen et al. 2012; Boxer et al. 2014). However, ZNF750 has been identified in ChIP-seq experiments to interact with the bona fide TF , which also binds a similar GC-rich sequence (Sen et al. 2012; Boxer et al. 2014). Given the paucity of evidence that Monodactyl ZFPs can function as TFs, DNA-binding experiments in the absence of other TFs (in vitro) are required to establish whether ZNF750 and other Human Monodactyl ZFPs tend to bind DNA specifically, and whether a single ZF domain alone is sufficient to confer this binding, or other sequence features are required.

23

The best-studied DNA-binding Monodactyl ZFP is the Drosophila melanogaster TF GAGA, named for its “GAGAGAG” binding preference. Gel-shift experiments (Pedone et al., 1996) and a crystal structure (Omichinski et al., 1997) have suggested that Drosophila GAGA’s ability to bind DNA specifically is dependent on basic amino acids in a disordered region upstream of the ZF domain, and therefore the ZF domain alone is insufficient to confer DNA binding specificity. It remains an open question whether a single ZF is sufficient to confer specific DNA-binding, and therefore in vitro experimental evidence is required to determine whether Monodactyl ZFPs should be labelled as TFs.

1.3.1.3 DNA-binding specificities of C2H2 ZFPs

Prior to work described in this thesis (Chapters 2 and 3) and other work published contemporaneously (Imbeault, Helleboid and Trono, 2017; Barazandeh et al., 2018), binding specificities were only known for about 150/747 Human C2H2 ZFPs. Nonetheless, it was already clear that (1) C2H2 ZFPs are distinct from other TFs because their modular DNA- recognition mechanism enables them to rapidly evolve new, diverse binding specificities, and (2) C2H2 ZFPs tend to bind TEs (discussed further in Chapter 1.3.3).

A global picture of C2H2 ZFP DNA binding specificities was lacking for multiple reasons. First, the determination of motifs for C2H2 ZFPs is challenging in both in vitro and in vivo methods. For example, typical PBMs are suited to discover motifs up to 14bp, and the number of probes required for a PBM increases exponentially with the maximum length of the motif one can discover. Meanwhile, the median C2H2 ZFP contains 9 ZFs, so an appropriate PBM to query motifs for even the half of C2H2 ZFPs with the shortest arrays would require a probe library large enough to discover 36bp motifs. Furthermore, the in vitro expression of C2H2 ZFPs is hampered by their low solubility due to high cysteine content (Stubbs, Sun and Caetano-Anolles, 2011). At the same time, in vivo motif discovery using ChIP-seq has also been limited by a lack of available antibodies against C2H2 ZFPs, and challenges in mapping sequencing reads to TEs (which can exceed 10kb in length) because sequence similarity in ChIP-seq peaks may be confounded by common ancestry rather than binding preference (Najafabadi, Albu, and Hughes 2015). The prediction of C2H2 ZFP binding sites from amino acid sequence can be improved with recognition codes – methods to predict binding preference of a C2H2 ZFP based on its amino acid sequence.

24

1.3.1.3.1 Recognition codes

The endeavour to predict DNA binding specificities of TFs from amino acid sequence alone has been of long-standing interest (Seeman, Rosenberg and Rich, 1976; Mandel-Gutfreund and Margalit, 1998, p. ; Benos, Lapedes and Stormo, 2002). Early frameworks for decoding the specificity between C2H2 ZFP amino acid residues and the nucleotides they preferentially bind were based on careful analysis of individual C2H2 ZFPs with relatively short ZF arrays – for example one approach leveraged a crystal structure for the three-fingered Murine EGR1 protein (Pavletich and Pabo 1991), and another compared mutation analyses between the three-fingered proteins EGR2 and SP1 (Nardelli et al., 1991). Today, an understanding of the general rules for the contacts between nucleotide sequences and the residues of an individual ZF (discussed in Chapter 1.3.1.1) motivates the development of recognition codes: statistical methods to predict the binding preference (a motif) for a C2H2 ZFP based on the ZF array’s amino acid sequence in silico (Wolfe, Nekludova and Pabo, 2000; Klug, 2010). The problem is conceptually simple – each ZF’s four specificity residues should predict the 3-4nt that each ZF will recognize, and a tandem array of ZFs should recognize a sequence represented by the concatenation of its constituent ZFs’ preferences. The problem is complicated by several factors: (1) it is unlikely that all ZF domains participate in DNA contacting; (2) ZF domains impose a variable degree of influence on the binding of adjacent domains (Pavletich and Pabo, 1991; Kaplan, Friedman and Margalit, 2005); (3) binding is not always limited to the ZF’s four canonical specificity residues (Wolfe et al. 2000).

A historical focus on the EGR family as a model for C2H2 ZFP DNA-binding specificity (Nardelli et al., 1991; Pavletich and Pabo, 1991; Benos, Lapedes and Stormo, 2002) has permitted the use of EGR proteins in in vitro assays where the individual EGR ZFs are replaced with query ZFs and the ability of the modified EGR protein to bind a library of promoters can be tested using bacterial one-hybrid (B1H) (Christensen et al., 2011; Gupta et al., 2014; Najafabadi et al., 2015). One such recent study developed a random forest model to predict the preferences of two-finger ZF models based on a dataset of 95 B1H motifs called ZFModels (Gupta et al., 2014). Shortly thereafter, another B1H-based recognition code (B1H-RC) was developed based on the binding preferences for more than 8,000 naturally-occurring individual ZFs in parallel in an EGR1 context, again using a random forest-based recognition code for individual ZFs (Najafabadi et al., 2015). The B1H-RC slightly outperforms ZFModels on the same gold

25 standard dataset, but also failed on many of the same cases, apparently because some ZFs’ natural binding preferences are more strongly influenced by neighbouring ZFs in the ZF arrays in which they naturally occur (Gupta et al., 2014; Najafabadi et al., 2015). While inaccuracies indicate a need for further investigation to produce more nuanced predictive models for the binding specificities of natural C2H2 ZFPs, these recognition codes are useful for the interpretation of ChIP-seq data – they can be used to identify motifs that reflect direct binding and the residues of the C2H2 ZFP that bind DNA (Najafabadi, Albu and Hughes, 2015; Najafabadi et al., 2015). For example, the method RCADE uses the B1H-RC to optimize motif discovery and identify the ZFs involved in DNA contacting (Najafabadi, Albu, and Hughes 2015).

Since the release of RCADE, it has been used to establish virtually all new C2H2 ZFP ChIP-seq motifs –an increase from ~150 to 480 motifs for the 747 members of this DBD class (work described in Chapter 3; Najafabadi et al. 2015; Barazandeh et al. 2018).

1.3.2 C2H2 ZFP PPIs

Based on small-scale investigations, some expectations about functions of the most common PPI domains of the C2H2 ZFPs have been generally established. Human C2H2 ZFPs typically contain one of three PPI domains toward the N-terminus: KRAB, SCAN, or BTB (Figure 1.3 B). The extent to which ZF domains and unstructured regions of C2H2 ZFPs contribute to PPIs remains unknown.

1.3.2.1 KRAB domain

The Krüppel-associated box (KRAB) domain occurs in 47% of Human C2H2 ZFPs (348/747) (Letunic, Doerks and Bork, 2015). KRAB-containing C2H2 ZFPs (hereafter KZFPs) were first discovered in 1991 (Bellefroid et al., 1991). The KRAB domain is about 75 amino acids in length and is comprised of a KRAB-A and KRAB-B box, where the KRAB-B box is thought to support the KRAB-A box’s PPIs (Bellefroid et al., 1991; Mannini et al., 2006). The KRAB domain is largely unstructured and well-characterized to recruit the transcriptional co-repressor TRIM28 (also called KAP1) (Stoll et al., 2019).

KZFPs originated in the last common ancestor of Coelacanth, Lungfishes, and Tetrapods ~413M years ago, and Tetrapod genomes typically contain at least 200 KZFPs; notable and unexplained

26 exceptions are Avian genomes, which generally encode fewer than 10 (Imbeault, Helleboid and Trono, 2017). The KRAB domain of modern KZFPs is thought to have originated from PRDM9 (Birtle and Ponting, 2006). Many KZFPs appear to be lineage-specific (i.e. not shared between closely related species), suggesting that they arise rapidly and are often not conserved over long evolutionary time (Nowick et al., 2011). Segmental duplications have been an important source of KZFP duplication events, leading to the occurrence of clusters of paralogous KZFPs, predominantly on (Hamilton et al., 2006; Huntley et al., 2006; Nowick et al., 2010, 2011). At least 136/348 Human KZFPs are Primate specific, 70 of which arose in the Hominoid lineage (Nowick et al., 2010). The observation that paralogous loci tend to diverge in their ZF arrays led to the hypothesis that DNA-binding specificity is an important contributor to KZFP functional diversification (Hamilton et al., 2006; Nowick et al., 2010). One comparison of the tissue expression profiles of KZFP paralogs within the same cluster also found that paralog pairs tend to exhibit divergent spatial expression (Nowick et al., 2010). In general KZFPs have been characterized to function primarily in the germline and early development, and more recent evidence suggests they can also function as transcriptional repressors in adult tissues (Wolf and Goff, 2009; Quenneville et al., 2012; Corsinotti et al., 2013; Turelli et al., 2014; Wolf et al., 2015).

The discovery that TRIM28 was essential for early embryonic repression of transposable elements in both Mouse and Human led to a shift in focus to the roles of KZFPs as repressors of deleterious TE insertions during developmental reprogramming (discussed further in Chapter 1.3.3.2) (Wolf and Goff, 2009; Quenneville et al., 2012; Corsinotti et al., 2013; Turelli et al., 2014). TRIM28 acts as a scaffold for a number of chromatin proteins that contribute to transcriptional repression: the histone methyltransferase SETDB1 (Schultz et al., 2002), the nucleosome remodelling and deacetylation complex NuRD (Schultz, Friedman and Rauscher, 2001), HP1 (Nielsen et al., 1999; Sripathy, Stevens and Schultz, 2006) and, in embryonic stem cells, DNA methyltransferases (Quenneville et al., 2012) (Figure 1.4). TRIM28 recruitment by KZFPs has been shown to induce heterochromatinization by H3K9me3, and this effect is capable of spreading to silence promoters tens of kilobases from the KZFP’s DNA binding site (Groner et al., 2010). ZFP809 knockout results in reactivation of the transposable elements it binds, associated primarily with a decrease in TRIM28 and SETDB1 localization, decrease in heterochromatin-associated H3K9me3 marks, and a modest reduction in DNA methylation in

27

Mouse embryonic stem cells (Wolf et al., 2015). In Mouse embryonic fibroblasts (which are more differentiated than Mouse embryonic stem cells), H3K9me3 appears to preserve TE silencing in the absence of ZFP809 (Wolf et al., 2015); however, other studies have suggested that KZFPs and TRIM28 play a role in TE silencing in adult cells like neural progenitors (Fasching et al., 2015) or Murine hematopoietic stem cells (Haas et al., 2003). At present, the paucity of KZFP knockout models limits a broader physiological understanding of this gene family (Yang, Wang and Macfarlan, 2017, p. ).

The function of TRIM28 as a transcriptional repressor corresponds well with the interpretation that KZFPs recruit TRIM28 to silence TEs (Rowe et al., 2010). Interestingly, TRIM28 has also been documented to carry out roles beyond transcriptional repression, including the regulation of elongation (Bunch et al., 2014) and DNA damage repair (Ziv et al., 2006; Goodarzi, Jeggo and Lobrich, 2010); overall, it appears that potential impacts of TRIM28 localization are multifold (Iyengar and Farnham, 2011) and at least partially dependent on contexts such as position relative to the transcription start site (Iyengar et al., 2011) and post-translational modifications (Ziv et al., 2006; Goodarzi, Jeggo and Lobrich, 2010).

The KRAB domain is essential to the theory that the diversification of C2H2 ZFPs are driven to diversify by the diversification of transposable elements. Further description for and evidence of this theory are provided in Chapter 1.3.3.2.

1.3.2.2 SCAN domain

The SCAN domain is present in about 7% (52/747) of Human C2H2 ZFPs and functions as a selective dimerization domain (Williams, Blacklow and Collins, 1999; Sander et al., 2000; Stone et al., 2002; Letunic, Doerks and Bork, 2015). The SCAN domain was first identified by its sequence identity in the N-termini of SCAN-containing C2H2 ZFPs (Williams et al., 1995). SCAN-containing C2H2 ZFPs typically occur in the genome in clusters of two to seven, and those in the same cluster tend to have higher sequence identity to one another, suggesting that SCAN C2H2 ZFPs may arise by segmental duplications like KZFPs (Sander et al., 2000).

28

A

B

FigureFigure 1.4 1.4 The The arms arms race andrace domestication and domestication models of KZFP-EREmodels of coevolutionKZFP-ERE coevolution

A.A.The The arms arms race race model model posits positsthat KZFPs that are KZFPs driven areto expand driven and to diversify expand by and the diversifydiversification by andthe expansiondiversification of EREs, and and EREsexpansion are driven of EREs,to diversify and to EREs escape are transcriptional driven to silencingdiversify by to KZFPs. escape B. Thetranscriptional domestication modelsilencing proposes by KZFPs.that a subset of the observed KZFP-ERE binding interactions are conserved beyond the replicative lifespan of the ERE (i.e. after it has accumulated so many mutations B.that The it isdomestication no longer capable model of transposition) proposes becausethat a subsetthe ERE of has the been observed “domesticated” KZFP by-ERE the host binding genomeinteractions as a cis -regulatoryare conserved element. beyond Examples the of replicative three distinct lifespan hypothetical of cellularthe ERE conditions (i.e. after are it has provided,accumulated where theso KZFP’smany mutationsexpression or that post-translational it is no longer modification capable couldof transposition) be used to further because the ERE modulatehas been the “domesticated” regulatory effects ofby the the ERE. host genome as a cis-regulatory element. Examples of three Adapteddistinct from hypotheticalEcco et al. (2017) cellular © 2017 conditions by The Company are provided, of Biologists where Ltd. the KZFP’s expression or post- translational modification could be used to further modulate the regulatory effects of the ERE.

Adapted from Ecco et al. (2017) © 2017 by The Company of Biologists Ltd.

29

Unlike the KRAB domain, the SCAN domain is highly structured. The minimal functional unit for dimerization is 58 amino acids in length and consists of a bundle of five α helices, where the N-terminal helix is thought to form the binding interface between SCAN domains and to have the greatest effect on dimerization preference (Nam, Honer and Schumacher, 2004). It appears that all SCAN domains can homodimerize, while heterodimerization is a restricted feature (Liang, Choo, et al. 2012; Liang, Huimei Hong, et al. 2012; Porsch-Özcürümez et al. 2001). However, the dimerization specificity of a given SCAN domain cannot be predicted from its amino acid sequence based on presently available data.

The SCAN domain is thought to descend from the capsid protein of Gmr1-like Gypsy/Ty3-like retroelements, which itself is an oligomerization domain (Emerson and Thomas 2011). Gmr1- like elements originate before the common ancestor of Deuterostomes (Goodwin and Poulter, 2002), and the SCAN domain is found throughout Vertebrates. While SCAN-containing C2H2 ZFPs are far less numerous than KZFPs and not typically considered to be key players in the coevolution between C2H2 ZFPs and TEs, it has been theorized that the SCAN domain’s original function was to target host C2H2 ZFPs to retroelements (Emerson and Thomas 2011). About half of the Human SCAN-containing C2H2 ZFPs also contain a KRAB domain, and the means by which these domains may function together is unknown (Letunic, Doerks and Bork, 2015). Although the SCAN domain is found throughout Vertebrates, and KZFPs originated in Sarcopterygii (Imbeault, Helleboid and Trono, 2017), the co-occurrence of KRAB and SCAN domains on the same C2H2 ZFP is observed in Mammals and Lizards, but not Chicken or Frog (Fedotova et al., 2017).

1.3.2.3 BTB domain

The BTB domain (sometimes called the POZ domain) occurs at a similar frequency to SCAN domains in Human C2H2 ZFPs (7%; 50/747) (Letunic, Doerks and Bork, 2015). Unlike the KRAB and SCAN domains, BTB domains are not confined to occur among C2H2 ZFPs, and they have been identified in diverse functional roles beyond transcriptional regulation, including cytoskeletal regulation (Ziegelbauer et al., 2001) and ion channel gating (Minor et al. 2000). BTB domains are broadly distributed, with representatives in Arabidopsis thaliana and Caenorhabditis elegans (Stogios et al., 2005).

30

In C2H2 ZFPs, BTB domains are generally regarded as selective dimerization domains that additionally bind non-BTB proteins with transcription-related functions. For example, the BTB domain of BCL6 is a homodimerization domain that also directly binds SMRT and N-CoR transcriptional co-repressors involved in histone deacetylation (Ahmad et al. 2003). Like SCAN domains, BTB domains are highly structured. The 132 amino acid homodimerizing BTB domain of PLZF consists of a cluster of six α helices flanked by five β strands (three of which form an antiparallel sheet). About a quarter of the monomer surface is involved in dimerization via a hydrophobic interface, and the residues determining binding partner specificity prediction are unknown (Ahmad, Engel, and Privé 1998).

1.3.2.4 The ZF as a PPI domain

C2H2 ZF domains are primarily considered DBDs, although as delineated in Chapter 1.3.1.2.1, it is unlikely that all ZFs on any given C2H2 ZFPs actually contribute to DNA-binding specificity. The simple structure of the ZF domain also lends itself to PPIs. Some ZFs appear to function as PPI domains, and this function appears to be independent of whether the same ZF also functions in DNA-binding. A number of examples of ZF domains mediating PPIs by their α helices or β strands are reviewed by Brayer and Segal (2008).

In some cases, PPIs are mediated by non-canonical ZF domains which are likely defunct as DBDs. For example, Drosophila melanogaster FOG1 interacts with GATA1 using four ZFs, which contain modified α helices that may preclude them from DNA-binding (Liew et al., 2005). In other cases, DNA-binding ZFs function in PPIs that have been demonstrated to affect transcriptional regulation. For example, SP1 contains only three ZFs, but they have been implicated in PPIs with the chromatin-remodelling proteins p300, SWI/SNF, and TAF1. The interaction between TAF1 and DNA-binding ZFs was shown to inhibit DNA binding, preventing SP1-mediated transcriptional activation, and thus revealing that PPIs with DNA-binding ZFs can affect transcriptional regulation by blocking the C2H2 ZFP from promoter localization (Suzuki et al., 2003).

A case study based on the 30 ZFs of the Human C2H2 ZFP OAZ found that ZFs involved in DNA-binding could also mediate PPIs, but ZFs that mediated PPIs did not always contribute to DNA-binding. The authors interpreted that DNA-binding may be a restricted function for ZFs, while their potential for PPI mediation is actually less confined and therefore possibly more

31 extensive (Brayer, Kulshreshtha and Segal, 2008). Ultimately, a global perspective on the extent to which ZFs function as DNA-binding domains vs. PPI domains is needed.

1.3.2.5 Disordered regions

Nearly one third of Human C2H2 ZFPs contain no identifiable effector domain (30%, 224/747), and instead typically contain long unstructured N-termini (Letunic, Doerks and Bork, 2015). This is not necessarily surprising – the vast majority of eukaryotic TFs contain disordered regions (Liu et al., 2006; Vuzman and Levy, 2012). As described in Chapter 1.2.4.1, disordered regions of TFs often contribute to PPIs as TADs, SLiMS, or scaffolds (Staby et al., 2017). Disordered regions can also support DNA-binding by tuning DNA-binding specificity or affinity, or by supporting the mechanism by which DNA-binding proteins scan DNA to dock their preferred sequences (Vuzman and Levy, 2012) . In the case of Drosophila melanogaster GAGA, a Monodactyl ZFP described in Chapter 1.3.1.2.2, basic residues in a disordered region near the ZF have been reported to be required for sequence-specific DNA-binding. It seems likely that the extensive and prevalent disordered regions of Human C2H2 ZFPs contribute to the functions of these proteins – possibly as a dock for PPIs and/or tuning DNA-binding affinity or specificity.

1.3.3 Evolution of C2H2 ZFPs

C2H2 ZFPs are found across eukaryotes. Only 1-2 copies are typically observed in plants and fungi, but dozens are present in basal Metazoans and hundreds in Tetrapods (Emerson and Thomas 2009). Gene family expansions rely on duplication followed by functional divergence. Functional divergence for TF families can affect DNA-binding preference, effector functions (for C2H2 ZFPs, mainly PPIs), or expression conditions (Figure 1.5). The diversification of Metazoan C2H2 ZFPs has mainly been explained by positive selection on DNA-binding specificities, especially for those with KRAB domains to acquire binding specificities for new and changing TEs. With respect to expression profile diversification, one previous investigation addressing the expression of 19 KZFPs across 13 Human tissues concluded that tissue expression profile appears to be an important contributor to functional diversification between KZFPs (Nowick et al., 2010).

32

A

B Canonical full-length form Most commonly observed form Pol II PAS Pol II LTRs and ERVs LTR gag pol env LTR LTR

Pol II PAS 5ʹ truncated PAS LINEs 5ʹ UTR ORF1 ORF2 3ʹ UTR ORF1 ORF2 3ʹ UTR

Pol III Pol III SINEs

Coding Regulatory

Figure 1.5 EREs in the Human genome

A. Hierarchical classification of EREs found in the Human genome, based on the scheme Figure 1.5 EREs in the human genome presented by Dfam (Hubley et al. 2016). Superfamilies containing active transposons in the A.modernHierarchical Human classification genome are of EREsidentified found byin theMills human et al. genome, (2007). based Percentages on the scheme of the presented Human by genomeDfam (Hubleycomprised et al. of 2016). EREs, Superfamilies ERVs, LINEs, containing and SINEsactive transposons are reported in the by modern Mandal human and Kazaziangenome (2008).are identified by Mills et al. (2007). Percentages of the human genome comprised of EREs, ERVs, B. StructuralLINEs, and representation SINEs are reported of the by canonical Mandal and full Kazazian-length (2008). forms and forms most commonly B.obserStructuralved in representationthe Human genome of the canonical for the full-length three superfamilies forms and forms of EREs. most commonly Pol II, RNA observed polymerase in the II; humanPol III, genome RNA forpolymerase the three superfamilies III; PAS, polyadenylation of EREs. Pol II, RNA signals. polymerase Adapted II; Polfrom III, Chuong, RNA Elde, polymerase III; PAS, polyadenylation signals. Adapted from Chuong, Elde, and Feschotte (2017) © andMacmillan Feschotte Publishers (2017) ©Ltd, Macmillan part of Springer Publishers Nature. Ltd, part of Springer Nature.

33

1.3.3.1 Mechanisms of DNA sequence specificity diversification

The expansion of C2H2 ZFPs in Metazoans is characterized by lineage-specific expansions – relatively few C2H2 ZFP orthologs are shared between relatively closely related species, implying that C2H2 ZFPs arise rapidly and are not typically conserved over long evolutionary time (Nowick et al., 2011). In the Primate lineage, recent (35-40MYA) segmental duplications have been an important mechanism for the expansion of KZFPs (discussed in Chapter 1.3.2.1), and the resultant paralogs have divergent ZF domain sequences (suggesting divergent binding preferences) and expression profiles (Nowick et al., 2010). Many Metazoan C2H2 ZFP paralogs are under positive selection to acquire new binding specificities via nucleotide substitutions at their specificity residues (Emerson and Thomas 2009), as well as the gain and loss of ZF domains (Nowick et al., 2011). Furthermore, the non-base-contacting residues of ZFs in Metazoan C2H2 ZFPs are uniquely able to contribute to DNA binding, such that the specificity residues are freer to diversify without losing their ability to bind DNA altogether (Najafabadi et al., 2017). In sum, Metazoan C2H2 ZFPs are under diversifying selection to acquire new binding specificities, demonstrated by amino acid sequence analysis and the few DNA-binding motifs that were available at the time. Rapid duplication is facilitated by segmental duplications, and rapid diversification of binding specificities involves the substitution of ZF specificity residues, as well as the gain, loss, and/or exchange of ZFs.

1.3.3.2 Evidence and theories for coevolution with TEs

The overwhelming evidence of diversifying selection on Mammalian C2H2 ZFP binding specificities, especially in the case of KZFPs, supports the theory that C2H2 ZFPs coevolve with endogenous retroelements (EREs). Human TEs are described in the next section – for this Chapter it is sufficient to understand that EREs are diversifying copy-and-paste TEs that duplicate via transcription, and that their duplication can be deleterious. TE activation can be deleterious not only because their insertions can disrupt functional genomic elements (Bourque et al., 2018), but also because (1) de-repression may interfere with the transcription of other loci or the processing of other mRNA (Feschotte, 2008; Daniel, Behm and Öhman, 2015; Elbarbary, Lucas and Maquat, 2016) – for example, TRIM28 represses TE-derived cis-regulatory elements that can cause the aberrant activation of nearby genes in the absence of TRIM28 (Rowe et al., 2013); (2) TE-encoded proteins (like the endonuclease encoded by LINE L1s) can induce

34 genomic instability (Hedges and Deininger, 2007); and (3) the accumulation of TE RNA transcripts can trigger innate immune responses (Kassiotis and Stoye, 2016).

Coevolution between C2H2 ZFPs and TEs was first suggested by a correlation between the number of endogenous retroviruses (ERVs; a type of ERE) and the number of individual ZFs in Vertebrate genomes (Thomas and Schneider 2011). At the same time, the KZFP ZFP809 was observed to silence retroviruses in embryonic stem cells (Wolf and Goff 2009). Shortly thereafter, the it was found that deletion of TRIM28 (the canonical cofactor of KZFPs) or its effector SETDB1 results in massive derepression of ERVs in Mouse embryonic stem cells (Matsui et al., 2010; Rowe et al., 2010), and that TRIM28 is also an important regulator of another type of ERE, LINE L1s, which comprise a much larger proportion of the Human genome than ERVs (Mandal and Kazazian, 2008; Castro-Diaz et al., 2014). These findings led to the interpretation that KZFP-mediated TRIM28-dependent transcriptional repression via SETDB1 recruitment is an important mechanism for ERE restriction. The recruitment of TRIM28 by ERE- binding KZFPs is one of several ERE restriction mechanisms observed in developmental reprogramming – others include DNA methylation by DNMTs (Fadloun et al., 2013; Molaro et al., 2014; Turelli et al., 2014), cytidine deamination by APOBECs (Cullen, 2006; Richardson et al., 2014), and RNA interference (Juliano, Wang and Lin, 2011; Ku and Lin, 2014).

The observation that KZFPs appear to diversify their binding specificities (detailed in the previous section) and coevolve with EREs led to the arms race theory: KZFPs specifically bind and transcriptionally silence EREs via TRIM28 recruitment, blocking ERE transposition and imposing diversifying selection on EREs to acquire new sequences that are not recognized by any active KZFPs, and in turn ERE diversification imposes diversifying selection on KZFPs to acquire DNA-binding specificity for new ERE sequences (Figure 1.4 A) (Ecco, Imbeault, and Trono 2017; Emerson and Thomas 2009; Thomas and Schneider 2011; G. Wolf, Greenberg, and Macfarlan 2015). Some of the best direct evidence for this arms race is based on an interaction between young groups of LINEs and ZNF93, where the older groups of LINEs were found to contain a ZNF93 binding site, but all younger LINEs had lost the binding site by deletion, and ZNF93 binding was linked to transposition repression by transcriptional silencing (Jacobs et al., 2014).

35

During the course of the work described in this thesis, a complementary theory matured: the domestication model, in response to ChIP-exo experimental results for 222 of the 348 Human KZFPs revealing that most of the TEs bound by KZFPs have accumulated enough mutations that they are no longer able to transpose (Ecco, Imbeault and Trono, 2017; Imbeault, Helleboid and Trono, 2017). The domestication model proposes that once an ERE is no longer active, the KZFP and its binding specificity corresponding to the ERE may be conserved if the KZFP has acquired some new function beyond transcriptional silencing of the ERE (Figure 1.4 B).

Together, the arms race and domestication models suggest a self-perpetuating mechanism for complex genomes to rapidly acquire new TF-binding site pairs to act as blank slates for the evolution of novel lineage-specific regulatory modules (Ecco, Imbeault and Trono, 2017; Imbeault, Helleboid and Trono, 2017).

1.4 TEs and transcriptional regulation

TEs are genetic elements that are able to move and replicate within the genome. TEs are considered a type of repetitive DNA, which also includes simple tandem, centromeric, and telomeric repeats, and are grouped together because their repetition in the genome complicates assembly and investigation (Biscotti, Olmo and Heslop-Harrison, 2015). Identifiable TEs comprise ~45% of the current Human genome sequence, but this figure is likely an underestimate because very old insertions are more difficult to identify – up to two thirds of the genome have been estimated to be TE-derived (de Koning et al., 2011). Human TEs can be divided into endogenous retroelements (EREs) and DNA transposons. EREs mobilize via a copy-and-paste mechanism, while DNA transposons operate via cut-and-paste. EREs therefore comprise a far larger proportion of the Human genome than DNA transposons – ~43% vs. ~3%, respectively (Mandal and Kazazian, 2008). Both these quantities are striking, given that less than 2% of the Human genome encodes Human proteins (The ENCODE Project Consortium, 2012).

TEs exhibit “selfish” behaviour in that they are selected to efficiently replicate themselves, and their replicative activity can impose deleterious effects on the genomes in which they reside (Chuong, Elde and Feschotte, 2017) (described in Chapter 1.3.3.2). As a result, TEs are often regarded as genomic parasites, and genomes have evolved regulatory mechanisms to deactivate them, including transcriptional repression by KZFPs. Despite their deleterious effects, Barbara McClintock was the first to regard TEs as “normal components of the chromosome responsible

36 for controlling, differentially, the time and type of activity of individual genes” (McClintock, 1956; Chuong, Elde and Feschotte, 2017).

TEs are now recognized as an important source of cis-regulatory elements (Chuong, Elde and Feschotte, 2017; Sundaram and Wysocka, 2020). TEs are predisposed for co-option as cis- regulatory elements because they rely on transcription in order to transpose, and therefore are under selection to acquire strong promoters that are compatible with the eukaryotic transcriptional machinery (Chuong, Elde and Feschotte, 2017). Among EREs, ERVs contain Pol II promoters in each of their two LTRs, and LINEs contain an internal Pol II promoter in their 5ʹ UTRs (Figure 1.5 B). TE-derived cis-regulatory elements can drive lineage-specific innovation – in the Human genome, Primate-specific cis-regulatory elements are enriched for TE-derived sequences compared to cis-regulatory elements of all ages (Jacques, Jeyakani and Bourque, 2013; Trizzino et al., 2017). In Rodents, a comparative analysis between Mouse and Rat trophoblasts found that species-specific enhancers are enriched for ERV-derived sequences, implicating TEs as contributors to the rapid placental divergence observed between many Mammalian species (Chuong et al., 2013). Addressing a longer Mammalian timescale, ChIP-seq binding profiles show that the key embryonic stem cell regulator TFs, OCT4 and NANOG, share only about 5% of their binding sites between Human and Mouse embryonic stem cells (Kunarso et al., 2010). Major drivers in this evolutionary divergence appear to have been differential ERV- derived OCT4 and NANOG binding sites, which contributed up to one quarter of the bound loci in both Human and Mouse. Importantly, these divergent TF occupancy profiles were linked to divergent expression profiles of orthologous genes near binding sites, demonstrating that the repeat-derived binding sites contributed to lineage-specific divergence in the regulation of Human and Mouse embryonic stem cells (Kunarso et al., 2010). As detailed in Chapter 1.3.3.2, coevolution between C2H2 ZFPs and EREs is also thought to enable the co-option of EREs as cis-regulatory elements (Ecco, Imbeault and Trono, 2017; Imbeault, Helleboid and Trono, 2017).

Proteins encoded by TEs also contain sequence specific DBDs to recognize their own genomic loci after translation. These DBDs have been co-opted by eukaryotic genomes to give rise to such TF DBD classes as PAX (e.g. PAX6 developmental regulator), CENP-B (centromere function), and BED-ZF (e.g. DREF, differentiation and proliferation regulator) (reviewed in Feschotte, 2008).

37

1.4.1 Annotation and classification of EREs

EREs are sometimes called RNA transposons because they are unified by transcription-based copy-and-paste transposition mechanisms where an mRNA intermediate is reverse transcribed and inserted at new locus. In the Human genome, the three major classes of EREs are endogenous retroviruses (ERVs), short interspersed nuclear elements (SINEs), and long interspersed nuclear elements (LINEs) (Figure 1.5). Within each of these classes exist a number of subfamilies. A subfamily of EREs is a group of clonal descendants of a parent progenitor sequence.

The most prevalent annotation tool for TEs is RepeatMasker (Tempel, 2012; Smit, Hubley and Green, 2013), which underlies the UCSC Genome Browser’s repeat annotations and is the predominant repository-based annotation tool for eukaryotic genomes (Goerner-Potvin and Bourque, 2018). RepeatMasker annotates EREs based on either a library of consensus sequences (Repbase, which is proprietary; Jurka 2000), or consensus HMMs (Dfam; Hubley et al. 2016). RepeatMasker annotates LINEs and ERVs based on matches to sub-sequence models contained in Repbase and Dfam, rather than full-length models, which are appropriate because LINEs and ERVs typically occur in the genome as defunct sub-sequences rather than full-length functional copies of the active progenitor from which they descended (described further below).

Other annotation tools identify repetitive elements de novo with partial or complete independence from consensus model repositories based on similarity between observed sequences (Goerner-Potvin and Bourque, 2018). One of the most widely used de novo TE annotation tools is RepeatModeler (Flynn et al., 2020), which integrates outputs from the TE discovery algorithms RepeatScout (Price, Jones and Pevzner, 2005) and RECON (Bao and Eddy, 2002), and has recently been expanded to incorporate further specialized tools for ERV discovery (Flynn et al., 2020). Methods independent from a standardized repository are beneficial for the annotation of genomes with no prior annotation, the discovery of new TE families, and the identification of very old/diverged TE insertions that bear little sequence similarity to their consensus models (Goerner-Potvin and Bourque, 2018). However, the substantial benefit in referencing previously published classification schema and annotations (where available) is that the outputs of new investigations will be directly compatible with prior work and between closely related genomes, and therefore more broadly interpretable. A current

38 problem across TE investigations is incompatibility between numerous classification schema which are not universally consistent or interpretable to non-experts (Wicker et al., 2007; Seberg and Petersen, 2009; Vargiu et al., 2016).

1.4.1.1 ERVs

ERVs comprise about 8% of the Human genome (Figure 1.5 A; Mandal and Kazazian 2008). There are no active (replicating) copies of ERVs in the modern Human genome. ERVs are the vestiges of exogenous RNA viruses that have reverse-transcribed and integrated their genomes into a host’s germline cells and thus achieved vertical transmission (Gifford et al., 2018). ERVs are typically 6-10kb in length and, like other viral genomes, typically consist of an internal region that encodes viral proteins gag, pol and env, plus 0.5-1kb long terminal repeats (LTRs) at the 3ʹ and 5ʹ ends (Figure 1.5 B). LTRs of the same ERV locus very often recombine with one another, excising the internal region (Sverdlov, 1998) – as a result, 85% of ERV-descended loci exist solely as left-behind “solo” LTRs (Belshaw et al., 2007). While they have lost their protein- coding regions, solo LTRs retain cis-regulatory functionality and are thought to be an important source of TE-derived regulatory elements (Thompson, Macfarlan and Lorincz, 2016). Because many ERV loci exist as solo LTRs, RepeatMasker uses separate consensus models for the internal sequences and LTRs.

Recombination between ERV internal protein-coding regions is also common – more than 60% of ERV loci in the Human genome are thought to represent recombinant forms between multiple subfamilies (Vargiu et al., 2016). Furthermore, several classification and annotation schemes beyond RepeatMasker exist for ERVs, and a lack of convergence on an agreed-upon scheme has been called “bewildering” even by ERV experts (Vargiu et al., 2016). An annotation and classification scheme for ERVs that addresses recombination called RetroTector (Vargiu et al., 2016) has been proposed, but its classification scheme is not transparently mappable to the more common RepeatMasker scheme, and its associated annotation software is not amenable to widespread adoption. Before ERV progenitor sequences can be reconstructed systematically across all subfamilies, a universally agreed-upon classification scheme with an associated reproducible annotation method that addresses recombinant forms of ERVs is needed (Gifford et al., 2018). Nonetheless, in the case of a single subfamily (HERV-K/HML-2), a provirus capable

39 of reverse transcription has already been reconstructed, demonstrating that functional progenitor sequence reconstructions are possible for ERVs (Lee and Bieniasz 2007).

1.4.1.2 LINEs

LINEs comprise about 20% of the Human genome (Figure 1.5 A; Mandal and Kazazian 2008). LINEs can be further subdivided into three groups, L1s, L2s, and L3s, where L1s are the youngest and best-documented and represent the largest proportion of the genome (about 17%). Unlike ERVs, a transparent peer-reviewed universal classification scheme for L1s exists (Smit et al. 1995), and is already linked to the most widely used annotation tools (RepeatMasker, Repbase and Dfam). An additional aid in the reconstruction of LINE L1s is that the youngest L1 subfamily, L1HS, is still active in the modern Human genome, providing an example of a functional L1. Furthermore, progenitor sequences have been reconstructed for Primate- distributed (youngest and least degenerate) L1s (Khan, Smit and Boissinot, 2005), providing a clear template for the problem of reconstructing older and more degenerate L1s. A functional LINE L1 is 6-8kb in length and includes two open reading frames encoding proteins called ORF1 protein (ORF1p; 338aa) and ORF2 protein (ORF2p; 1275aa), as well as a 5ʹ UTR (Figure 1.5 B).

LINE L1 subfamilies are continuously related to one another – the progenitor of each new subfamily is a derived member of the previous subfamily (Smit et al. 1995; Khan, Smit, and Boissinot 2005). In recent evolutionary history (~40M years, Simiiformes), only one L1 subfamily has existed at a time in a continuous lineage of subfamilies, suggesting competition between subfamilies, possibly to do with a limiting host resource (Khan, Smit and Boissinot, 2005), and intriguingly, possibly related to coevolution with C2H2 ZFPs. This approximately linear evolution has lent itself to straightforward nomenclature (Smit et al. 1995) – subfamily names typically start with “L1P” or “L1M” indicating approximately Primate or Mammalian distribution, followed by a letter (L1PA-B, L1MA-E) indicating relative age, where “A” is youngest, followed by a number (L1PA1-L1PB4, L1MA1-L1ME5) where “1” is youngest, and sometimes followed by another letter indicating further subdivisions (e.g. L1ME3A-C). The youngest family, L1PA1, is often called L1HS, because it is only found in the Human genome and is still active.

40

Similar to ERVs, RepeatMasker’s method for annotating L1s is based on a combination of sub- sequence models tailored to the biology of L1s. The 3ʹ-to-5ʹ reverse transcription and integration of an L1 transcript into a new genomic locus typically fails before completion, which is (in addition to purifying selection) another reason that the overwhelming majority of L1 loci in the Human genome are 5ʹ truncated (Cost et al., 2002). The 67 Human L1 subfamilies are defined by ~1kb 3ʹ end consensus HMMs. Because L1 5ʹ ends are far rarer in the genome, they are more difficult to reconstruct for each of the 67 subfamilies and instead are represented by 10 generalized consensus models that map one-to-many to the 67 subfamilies (Smit et al. 1995).

1.4.2 Reconstructing ancestral TEs

The overwhelming majority of TE loci in the Human genome are not capable of transposition, with the exception of some LINE L1, Alu, and SVA subfamilies representing less than 0.05% of the total Human genome sequence (Mills et al., 2007). Functional L1 loci are removed by purifying selection (Boissinot, Entezam and Furano, 2001), and defunct insertions accumulate mutations at the neutral rate. Computational estimation of the functional progenitor can be used to investigate how the sequence functioned, whether any host TFs may have been able to bind it, and ultimately what competitive advantage(s) it may have had over other ERE subfamilies that were active around the same time (Khan, Smit and Boissinot, 2005; Yang et al., 2014).

As described above, the popular TE annotation tool RepeatMasker identifies members of a given subfamily in a genome based on homology to a library of sub-sequence consensus models corresponding (often one-to-many) to TE subfamilies. These can be thought of as distinct from reconstructions of the progenitor sequence, i.e. the replicative parent to a subfamily of EREs. Consensus sequences are sometimes sufficient to approximate progenitor sequences – for example, quantifying the sequence divergence of observed ERE subfamily insertions to their respective consensus models can give a relative estimation of subfamily age, where insertions of older subfamilies should exhibit higher sequence divergence from the corresponding subfamily consensus model assuming equal substitution rates (Blanchette et al., 2004). Khan, Smit, and Boissinot (2005) demonstrated that frequency-based consensus reconstruction methods can be sufficient to approximate progenitor sequences for the youngest TE subfamilies with large numbers of detectable insertions and have accumulated relatively fewer mutations, as exemplified in the case of 22 LINE L1s subfamilies that are Primate-distributed, and reported

41 that the method was limited from reconstructing older Mammalian-distributed L1s because of the accumulation of mutations in older subfamilies. Similarly, a frequency-based consensus was employed to reconstruct a transposition-competent megabat LINE progenitor sequence, but was only sufficient to reconstruct functional sub-sequences (Yang et al., 2014). In both studies, the reconstructed progenitor L1 sequences were useful to for the identification of functional sequence features; however, consensus-based methods were limited in their ability to reconstruct very old LINE L1 subfamilies.

Ancestral sequence reconstruction (ASR) is the extrapolation back in time to the common ancestor of a set of related sequences (Joy et al., 2016). Ancestral reconstruction is based on a phylogeny of the related sequences, which is essentially a hypothesis about the relationships between sequences in a multiple sequence alignment (MSA). Methods typically involve estimation of every internal node in the phylogeny, beginning at the terminal nodes (extant sequences) and working back in time to the root of the tree, which represents the common ancestor to all the sequences (i.e. terminal and internal nodes) represented in the phylogeny. ASR methods are based on statistical approaches applied to phylogeny construction, including parsimony, maximum likelihood (ML), and Bayesian inference (Joy et al., 2016). Software tools available for ASR include FastML (Pupko et al., 2000), PAML (Yang, 2007), MEGA (Kumar et al., 2008), and Mr. Bayes (Ronquist et al., 2012).

The principles of ASR can be extended to whole genomes as ancestral genome reconstruction (AGR). Here, whole genomes are aligned, and the genomes of internal nodes are reconstructed. Ancestors 1.1 is a whole genome reconstruction tool that relies on ML-based inference of ancestral sequence states to reconstruct ancestral species’ genomes based on whole genome alignments between the reference genomes of extant Vertebrates (Diallo, Makarenkov and Blanchette, 2010). Reconstructed ancestral genomes can then be annotated by the same methods as newly assembled genomes, including homology methods to annotate TEs. AGR may be especially useful for interpreting the evolutionary history of TEs, because the tendency to accumulate mutations is thought to preclude many loci corresponding to the oldest ERE subfamilies from being detected by homology to consensus models. A pilot study using a 1.1Mb genomic segment of the Boreoeutherian ancestor based on 20 extant Mammalian genomes reconstructed with Ancestors suggests that the method is highly accurate, even reporting 96% accuracy (based on simulations) for the challenging case of reconstruction in repetitive regions

42

(Blanchette et al., 2004). The authors also ran RepeatMasker on the reconstructed ancestral genome and reported improved detection of very old TE families, presumably because their insertions more closely resembled the functional progenitors, which are approximated by the consensus models used by RepeatMasker (Blanchette et al., 2004).

1.4.3 Summary

TE evolution is poorly resolved, primarily hallmarked by a paucity of full-length progenitor sequences to represent the clonal parent of TE subfamilies. Efforts to reconstruct progenitor sequences have been limited by a lack of unified and broadly agreed-upon annotation and classification schema, as well as the accumulation of mutations at old ERE loci that impede the inference of functional progenitor sequences as straightforward frequency-based consensus models. However, classification and annotation of LINEs is sufficiently unified, and ancestral sequence reconstruction (ASR) and ancestral genome reconstruction (AGR) could support the reconstruction of older LINE subfamily progenitor sequences. LINEs are a worthwhile candidate lineage for improved reconstruction because they are known to participate in an evolutionary arms race with Mammalian C2H2 ZFPs, and may have been co-opted as cis-regulatory elements.

1.5 Chapter summary and thesis rationale

TFs are proteins that bind specific DNA sequences and affect transcription, often through PPIs with transcription-related chromatin proteins. C2H2 ZFPs are unique from other Human TF families because their DNA-binding mechanism permits the rapid diversification of binding specificities, and this unique ability is thought to lend C2H2 ZFPs to the task of rapidly coevolving with EREs in an arms race. The C2H2 ZFPs have been largely uncharacterized at the global level, but ChIP-seq with affinity-tagged C2H2 ZFP constructs can be used to uncover the genomic binding profiles of C2H2 ZFPs in parallel. Additionally, with the support of RCADE, DNA-binding motifs can be established from C2H2 ZFP ChIP-seq datasets.

Given the ChIP-seq DNA-binding profiles of C2H2 ZFPs, I have sought to address further aspects of their functional characterization in this thesis. If most C2H2 ZFPs contact DNA with an array of ZFs, what is the capacity of those with only a single ZF to bind specific DNA sequences? If TF paralogs arise via duplication events followed by functional divergence in terms of expression, PPIs, or DNA-binding preference (Grove et al., 2009), and the DNA-

43 binding preferences of the C2H2 ZFPs are diverse, should we expect C2H2 ZFPs to have low diversity in their tissue expression profiles and PPIs? Is the presence of a KRAB domain sufficient to predict that a C2H2 ZFP will recruit its silencing co-factor, TRIM28? Can reconstruction of ERE subfamily progenitor sequences improve resolution of the sequence evolution, and support the interpretation of ChIP-seq/exo data to investigate the evolutionary interaction between KZFPs and EREs?

I can identify several methods to improve the characterization of C2H2 ZFPs and their coevolution with EREs. First, an updated catalogue of Human TFs is needed. This new list of Human TFs can be used to conduct meta-analyses characterizing the C2H2 ZFPs alongside all Human TFs to illustrate their evolutionary histories and investigate the extent to which their expression profiles have diversified beyond embryonic tissues. Furthermore, in vitro binding experiments can be used to identify which Monodactyl ZFPs, if any, are capable of DNA binding and whether that binding is dependent on basic flanking residues (as observed in the Monodactyl ZFP TF Drosophila GAGA). Second, the PPIs of a representative complement of Human DNA- binding C2H2 ZFPs should be queried in parallel in order to establish general functional trends among the C2H2 ZFPs’ cofactors, and to establish the extent to which PPI domain content is predictive of PPI binding partners. Finally, I think that ASR and AGR can be used to improve ERE progenitor sequence reconstruction, and the resultant progenitor sequences can be used to identify KZFP binding sites on ancestral EREs.

In Chapter 2, I describe the curation of the 1,639 Human TFs and downstream analyses to investigate the evolutionary history and adult tissue expression profiles of all TFs. I further explore the DNA-binding capacity of the Human Monodactyl ZFPs, and query whether basic residues flanking the ZF are required for specific DNA-binding in PBM experiments.

In Chapter 3, I analyze AP-MS experimental for 118 DNA-binding Human C2H2 ZFPs. I specifically investigate the extent to which KRAB-, SCAN- and BTB-containing C2H2 ZFPs participate in the PPIs predicted based on their PPI domains alone, the diversity of PPIs, and the functions of the nuclear proteins they bind.

In Chapter 4, I combine ASR and AGR to reconstruct putative progenitor sequences for all 67 Human LINE L1 subfamilies. I demonstrate that these sequences can be used to support the

44 integration of KZFP ChIP-seq/exo bound genomic loci and motifs, as well as B1H-RC predicted motifs, and the ages of L1 subfamilies and KZFPs to interpret KZFP-ERE coevolution.

Together, my results paint a dynamic picture of C2H2 ZFP function and evolution. Among Human TFs, they are virtually the only DBD class to have expanded within the last 100M years, and they are markedly depleted for tissue-specific expression. C2H2 ZFPs’ functional diversity extends beyond DNA-binding preferences to their PPIs with myriad nuclear factors, establishing their functions as transcriptional regulators and raising new questions as to the functional elements mediating such diverse PPIs with paradoxically few PPI domains. Finally, improved resolution of the Human LINE L1 evolutionary lineage empowers sequence-level interpretation of KZFP-ERE coevolution.

45

Chapter 2 The function and evolution of Human transcription factors

Part of the work described in this chapter is published in:

Lambert, S. A.*, Jolma, A.*, Campitelli, L. F.*, Das, P. K., Yin, Y., Albu, M., Chen, X., Taipale, J., Hughes, T. R., and Weirauch, M. T. (2018). The Human Transcription Factors. Cell. 172, 650-665. doi:10.1016/j.cell.2018.01.029 which is published under a Creative Commons Attribution License (CC BY 4.0).

*denotes equal contributions

Author Contributions

SAL, AJ, LFC, PKD, YY, XC, JT, TRH, and MTW assessed Human proteins to establish a list of known and likely Human TFs. SAL conducted motif similarity comparisons and ortholog comparisons across the Human TFs. AJ conducted overview analyses of the Human TFs. LFC compared paralogy relationships between Human TFs and analyzed their tissue-specific expression profiles. MTW analyzed relationships between Human TFs and clinically relevant gene variants. MA produced and maintains the online database. SAL, AJ, and LFC wrote the manuscript with significant contribution from TRH and MTW.

______

A manuscript describing part of the work in this chapter is in preparation for publication:

Campitelli, L. F., Pour, S., Yang, A. W. H., Hughes, T. R. H. Monodactyl C2H2 zinc finger proteins. (In prep.)

Author Contributions

LFC and TRH conceived the experimental plan, designed PBM constructs, and interpreted PBM results. SP conducted supporting data analyses integrating ChIP-seq data and PBM data. AWHY ran PBM experiments.

46

The function and evolution of Human transcription factors 2.1 Introduction

TFs are proteins capable of independently binding a specific DNA sequence. In order to globally analyze the Human TFs, a list of high-confidence Human TFs is needed, along with a catalogue of functional data that has already been established.

The Human TFs have been catalogued previously, but an expert-curated update is warranted. The task of establishing a list of Human TFs is not suited to automation because domain structures do not perfectly predict TFs, literature evidence is variable, and electronic annotations are non-uniform. Prior to this work, the latest Human TF catalogue was published in 2009 (Fulton 2009, Vaq. 2009). Fulton et al. identified putative Human and Mouse TFs based on evidence of DNA activity, which included both DNA-binding activity and the regulation of transcription – 535 Human TFs were identified. Vaquerizas et al. used automated annotation to identify proteins that contain putative DBDs and appended it with (Ashburner et al., 2000) and TRANSFAC (Matys 2006) to produce a final list of 1,391 Human TFs. Recent substantial advancements in data collection, including hundreds of motifs generated in vitro (Badis et al., 2008; Wei et al., 2010; Jolma et al., 2013, 2015; Weirauch et al., 2013, 2014; Yin et al., 2017), as well as updates to gene annotations, render previous Human TF lists incomplete and create a need for a new Human TF database and metanalysis.

A new list of known and likely Human TFs can be compiled based on the most recent (1) experimental data and (2) DBD annotations. However, 13 Human C2H2 ZFPs contain only a single ZF domain and no other DBDs (Monodactyl ZFPs), in contrast with virtually all characterized C2H2 ZFPs, which leverage tandem arrays of ZF domains for DNA binding. This raises a question as to whether Monodactyl ZFPs without DNA-binding evidence be considered likely TFs. The Drosophila Monodactyl ZFP GAGA is well-known to bind DNA specifically, and this ability is attributed to basic residues in an unstructured region near its ZF that have been observed to contact DNA in a crystal structure (Pedone et al., 1996; Omichinski et al., 1997). To my knowledge, the only Human Monodactyl ZFP TF is ZNF750, although its only DNA-binding motif comes from ChIP-seq, and it is known to interact with the bona fide TF KLF4 (Sen et al.

47

2012; Boxer et al. 2014), so in vitro DNA-binding evidence is needed to confirm that ZNF750 can bind DNA independently.

In this Chapter, I aimed to characterize the evolution and function of Human TFs. First, with co- authors, I define a TF as a protein that binds a specific DNA sequence and systematically assess nearly 3,000 proteins to produce a list of 1,639 known and likely TFs. Given this list, I explored the Human TFs’ evolutionary history of duplication events and expression profiles across Human tissues. I demonstrated that the KZFPs have a distinct evolutionary history from all other TFs, as they are virtually the only Human TFs to have arisen by duplication since the origin of amniotes. Furthermore, C2H2 ZFPs, especially KZFPs, are also distinct from all other Human TFs in that they are less likely to display tissue-specific expression in adult tissues. To address the question of whether Monodactyl ZFPs function as TFs, I designed PBM experiments to test the ZF domains of Monodactyl ZFPs for DNA-binding specificity with and without mutation of basic residues flanking the ZF domain. I found that (1) 8/11 Human Monodactyl ZFPs tested show no DNA-binding specificity in PBM experiments, (2) Drosophila GAGA may not be dependent on basic flanks to bind its preferred sequence, (3) ZNF750’s in vitro-derived motif is distinct from its ChIP-seq motif, suggesting that it in vivo binding preferences may not be primarily determined by its DNA-binding specificity, and (4) the paralogs ZNF608 and ZNF609 appear to be Human Monodactyl ZFP TFs that are not dependent on basic flanks for DNA-binding.

48

2.2 Methods

2.2.1 Human TF list curation

My co-authors and I examined 2,765 proteins compiled by combining putative TF lists from several sources: the most recent previous publications attempting to compile such a list (Fulton et al., 2009; Vaquerizas et al., 2009), domain searches (using HMMs and parameters from CisBP (Weirauch et al., 2014) and Interpro, as well as the TRANSFAC-related database TFClass (Wingender et al., 2015)), Gene Ontology, and crystal and NMR structures of proteins in complex with DNA taken from the PDB (Berman et al., 2002). Where available, exiting motifs were matched to the putative TFs from motif collections including TRANSFAC (Matys et al., 2006), JASPAR (Mathelier et al., 2016), HT-SELEX (Jolma et al. 2013; Jolma et al. 2015; Yin et al. 2017), UniPROBE (Hume et al., 2015), and CisBP (Weirauch et al., 2014). A web page for each protein containing all relevant information and links to external databases was created by M. Albu. The approach used to establish the curated list of known and likely Human TFs is depicted in Figure 2.1.

Each of the 2,765 putative TFs was then randomly assigned to two of the curators (among myself, S. A. Lambert, A. Jolma, P. K. Das, Y. Yin, and X. Chen, J. Taipale, T. R. Hughes, and M. T. Weirauch) to classify the protein’s status as a TF (“TF with a known motif,” “TF with a motif inferred from a close homolog,” “likely TF” [due to presence of a DBD or literature information], “ssDNA/RNA binding protein,” or “unlikely TF”), and its DNA binding mode (binds as a monomer or homomultimer, binds as an obligate heteromer, binds with low specificity, or does not bind DNA). Curators could also submit notes and citations supporting their assessments. Using data from CisBP and other sources, we recorded whether motifs are known for each TF (or a close homolog) along with the availability of a protein/DNA structure. Global sequence alignments and known DNA-binding residues were also considered to make decisions for poorly characterized proteins within families where only a subset binds DNA (e.g., ARID, HMG, and Myb/SANT). To make the task feasible, we did not explore or record complexities such as protein modifications or binding partners. I conducted such assessments for 669/2,765 putative TFs.

Three senior co-authors (T. R. Hughes, M. T. Weirauch, and J. Taipale) resolved cases of disagreement between reviewers and manually reviewed all cases where both curators agreed

49

FigureFigure 2.1 Overview 2.1 Overview of the strategy of the strategyfor identifying for identifying the human theTF repertoireHuman TF repertoire

PotentialPotential TF lists TF were lists identified were identified using TFCat, using TF TFCat, Census, TF TFClass, Census, CisBP, TFClass, Gene OntologyCisBP, Gene and Protein Ontology and Data BankProtein (see Data figure Bank for citations). (see figure A totalfor citations).of 2,765 potential A total TFs of 2,765were assessed potential for TFs likelihood were assessed of being afor TF andlikelihood final annotations of being as a‘TF,’ TF ‘likelyand final TF,’ annotations or ‘unlikely TF’as ‘TF,’ were ‘likelyassigned TF,’ by expert or ‘unlikely judges. TF’1,639 were knownassigned and likely by human expert TFs judges. were identified.1,639 known and likely Human TFs were identified.

From Lambert, Jolma, Campitelli et al. (2018) © 2018 by Elsevier Inc.

50 that a protein without a canonical DBD is a likely TF. The “HumanTFs” website (http://Humantfs.ccbr.utoronto.ca/) displays the results, with a separate page for each TF, along with all known motifs and information and sequence alignments for each DBD type. The site also has an option for users to submit additional information.

2.2.2 Human TF paralog evolution

I extracted a list of all paralog relationships between Human genes from EnsemblCompara (Herrero et al., 2016). Using the list of Human TFs determined in 2.2.1, I labelled TFs in the paralogs list, along with their DBDs. TF-TF paralog pairs typically had the same DBD class annotation in our dataset. The only exceptions were paralog pairs between KRAB and non- KRAB C2H2 ZFPs, which were rare and were randomly assigned to one label or the other. The Ensembl data also reported the oldest clade in which both paralogs occur, which I converted to a time in millions of years ago (MYA) and interpreted as a time since the divergence of any given pair of paralogs (analogous to time since duplication event). I quantified the number of TF-TF paralog pairs that diverged in each of Ensembl’s clades, and reported both the number of paralog pairs for each DBD class that diverged about the common ancestor of each clade, and the fraction of Human paralog pairs that were TF-TF paralog pairs that diverged about the common ancestor of each clade.

2.2.3 TF expression profiles in Human tissues

I queried the list of Human TFs against the Human Tissue Atlas database (Uhlen et al., 2015). The database reported expression profiles for 37 Human tissues as RNA-seq transcripts per million (TPM). Following Uhlen et al.’s definition that a protein is ‘expressed’ in a given tissue if it achieves a TPM value of at least 1, I removed all TFs that did not achieve a TPM of at least 1 in at least 1 of the 37 available tissues. This resulted in a table of 37 tissues by 1554 Human TFs. I then normalized the TPM data by row sum and column sum. Rows (tissues) were reordered based on hierarchical agglomerative clustering by Pearson Correlation on the normalized RNA-seq data. Columns (TFs) were clustered similarly, and clusters were then reordered manually for ease of interpretation. Some TFs had generally higher expression levels than others in the raw TPM data, which is not observable after normalization. For each TF, I therefore also reported the mean raw TPM quantity across the 37 tissues.

51

I investigated whether certain DBD classes tend to be more broadly expressed than other TFs (depleted for tissue-specific expression) or restricted to a narrower complement of tissues (enriched for tissue-specific expression). Uhlen et al. provide a framework for labelling a degree of tissue specificity for a given protein based the distribution of its TPM quantities across the 37 tissues:

- tissue enriched: the protein has at least fivefold higher expression (TPM) in one tissue than all other tissues

- tissue enhanced: the protein has fivefold higher average TPM in one or more tissues compared to the mean TPM of all other tissues

- group enriched: the protein has a fivefold higher average TPM in a group of two to seven tissues compared to the mean TPM of all other tissues

To enable straightforward comparison across all 1,554 TFs and their DBD classes, I consolidated the above definitions and defined any TF fitting any of the above descriptions as tissue specific. I then created a 2-by-12 contingency table for the counts of TFs falling into each of the 24 possible categories: tissue specific (True or False), and DBD class (defined by the 11 largest DBD classes, plus all others in a single additional category). I used Fisher’s Exact Test to test each of the DBD classes for significant enrichment or depletion in the tissue specific category and corrected all p-values for multiple testing using Bonferroni correction.

2.2.4 PBM construct design for Monodactyl ZFPs

Monodactyl ZFPs were identified using the SMART database (Letunic et al. 2015). SMART contained 742 Human C2H2 ZFPs, and 51/742 contained only 1 ZF domain. 40/51 proteins contained other DBDs that would have typically rendered them “likely TFs” by our criteria (described in 2.2.1), resulting in 11/51 Monodactyl ZFPs that contained no other DBDs. These remaining 11 proteins became candidates to explore the DNA-binding potential of Monodactyl ZFPs, and their DBDs with flanking regions were tested in PBM experiments. For 8/11 of the Monodactyl ZFPs tested, I designed constructs with basic residues flanking the ZF domain perturbed. Rather than deleting these sequences, my goal was to replace each of the basic amino acids in the ZFP sequence regions flanking the DBD in the PBM constructs with amino acids similar in size but with uncharged side chains. I replaced with Glutamine, Lysine with

52

Asparginine, and Histidine with Phenylalanine (Figure 2.2 B). I included Drosophila GAGA in this experiment, and replaced its BR1 and BR2 basic residues by the same method (Figure 2.2 A). Constructs encoding the DBDs and flanking residues of the Monodactyl ZFPs were created using gene synthesis and inserted into a modified T7-driven expression vector (pTH6838) that expresses N-terminal GST fusion proteins (BioBasic).

2.2.5 PBM experiments

PBM laboratory methods were performed by A. Yang as described previously (Lam et al., 2011; Weirauch et al., 2013). Each DBD-encoding plasmid was analyzed in duplicate on two different arrays with differing probe sequences. 8-mer Z- and E-scores were calculated as previously described (Berger et al., 2006). Experiments are deemed successful if at least one 8-mer had an E-score > 0.45 on both arrays, and the complementary arrays produced highly correlated E- and Z-scores.

2.3 Results and Discussion

2.3.1 The Human TFs

My collaborators and I identified 1,639 Human TFs in total. Roughly three-quarters (1,211) of the Human TFs currently have a binding motif (1,107 “known,” i.e., measured experimentally, and a further 104 inferred from a closely-related homolog) (Weirauch et al., 2014). 913 of the known motifs were obtained from high-throughput in vitro assays such as HT-SELEX or PBM and hence provide a profile of their intrinsic relative preferences to many DNA sequences. Figure 2.3 illustrates that most DBD classes of TFs have high or complete motif coverage, while a handful have major gaps. Almost all Homeodomains (188/196), for example, have a known or inferred motif, likely due to their relative ease of study in vitro and their deep conservation, enabling inference by homology. The C2H2 ZFP class, in contrast, currently lacks hundreds of motifs (267/747) (Figure 2.3, inset), possibly because they are difficult to study in vitro (many are large proteins) and relatively few are well conserved (Stubbs et al. 2011). By proportion, the AT-hook proteins, THAP finger, BED-ZF, and those with no known DBD are also poorly characterized.

Of the 1,639 identified TFs, 69 had no canonical DBD. 19/69 had an experimentally determined motif. 6/19 of those motifs were determined by an in vitro method, while the remaining 13 were

53 determined by a method such as ChIP-seq (and therefore should be interpreted with caution as they may represent indirect DNA binding). The remaining 50/69 proteins lacking both a canonical DBD and a motif were labelled as TFs due to some evidence for direct sequence- specific DNA binding, warranting further investigation. An example of a protein lacking both a canonical DBD and a motif that I identified is PREB, which was found to bind specific promoter sequences in gel-shift experiments (Fliss et al. 1999) and induced expression in a reporter assay where the reporter locus was regulated by the promoter that PREB controls (Murao et al., 2009). More thorough investigation of PREB and other TF “edge cases” will improve our definition of the minimally sufficient structural features required to preferentially bind specific DNA sequences, and may point to other proteins in Humans and other species that could function as TFs without canonical DBDs.

2.3.2 Paralog evolution of Human TFs

The divergence times between Human TF-TF paralogs display a bimodal division: a first wave of duplications across diverse TF families occurred at the base of Bilateria, and a second wave of duplications, dominated by KRAB C2H2-ZFs, began in Amniota (Figure 2.4, left). The earlier wave, with duplications across diverse TF families, is consistent with the postulation that two rounds of whole-genome duplication occurred at or near the base of Vertebrates (Dehal and Boore, 2005). This event is roughly coincident with the expansion of cell-type diversity, possibly facilitated by duplicated TFs available to regulate novel cell types (Nitta et al., 2015; Arendt et al., 2016). The expansive KRAB radiation may be partly explained by the increased opportunity for retroviral transmission facilitated by the placenta (Hayward et al. 2015). Remarkably, TF-TF duplications during the KRAB radiation era dominate the distribution of all Human paralog pairs arising over the last 300 million years (Figure 2.4 right).

2.3.3 Expression profiles of the Human TFs

I examined expression patterns for 1,554 TFs detected in 37 adult tissues using RNA-seq data from the Human Tissue Atlas (Figure 2.5A) (Uhlen et al., 2015). This global view of patterns captures known roles for many well-characterized TFs. For example, , OLIG1, and POU3F2 (OCT7) are expressed almost exclusively in the cerebral cortex, and GATA4and TBX20 are highly expressed only in cardiac muscle. Roughly one-third (543) of the

54

Arginine (R) -> Glutamine (Q) ALysine (K) -> Asparagine (N) Histidine (H) -> Phenylalanine (F)

NANQAN NPQSQ B

Figure 2.2 Basic residues of Drosophila GAGA and their neutral replacement residues Figure 2.2 Basic residues of Drosophila GAGA and their neutral replacement residues

A.A Left:Left: The The crystal crystal structure structure for for Drosophila Drosophila GAGA’s GAGA’s single single ZF binding ZF binding DNA includingDNA including its basic its flanks basic labelledflanks labelled in blue and in blue yellow and (Omichinski yellow (Omichinski et al. 1997). et Right: al. 1997). cartoon Right: showing cartoon GAGA’s showing sequence GAGA’s and thesequence residues and of thethe basicresidues flanks of32 the. Blue basic and flanks. yellow Blueindicate and basic yellow flanks indicate and correspond basic flanks to the and figure at left.correspond Purple text to theshows figure the residuesat left. Purple replacing text the shows basic theresidues residues in the replacing GAGA-MUT the basic construct. residues in B. Eachthe GAGA of the basic-MUT amino construct. acids and its non-basic replacement residue in putative basic flanks identified B inEach each of of the the basic MUT amino constructs. acids and its non-basic replacement residue in putative basic flanks identified in each of the MUT constructs.

55

Figure 2.3 Number of TFs and motif status for each DBD family Figure 2.3 Number of TFs and motif status for each DBD family Inset displays the distribution of the number of C2H2-ZF domains for classes of effector Insetdomains displays (KRAB, the distribution SCAN, of or the BTB number domains); of C2H2-ZF “Classic” domains indicates for theclasses related of effector and highly domains conserved (KRAB,SP, KLF, SCAN, EGR, or BTB GLI domains);GLIS, ZIC, “Classic” and WT indicates proteins. the related and highly conserved SP, KLF, EGR, GLI GLIS, ZIC, and WT proteins. From Lambert, Jolma, Campitelli et al. (2018) © 2018 by Elsevier Inc.

56

Figure 2.4 Paralog divergence times of the Human TFs

Figure 2.4Left: Paralog Number divergence of TF-TF paralog times of pairs the thathuman diverged TFs in each Human ancestral clade based on EnsemblCompara data (Herrero et al. 2016). Right: the proportion of all Human paralogs that Left: Numberdiverged of inTF-TF each paralogclade that pairs are thatTF- TFdiverged paralogs. in each human ancestral clade based on EnsemblCompara data (Herrero et al. 2016). Right: the proportion of all human paralogs that diverged in each clade that are TF-TF paralogs. From Lambert, Jolma, Campitelli et al. (2018) © 2018 by Elsevier Inc.

57

Human TFs in this dataset displayed tissue-specific expression, including many with poorly characterized physiological roles.

Comparing between DBD classes, a striking trend emerges, mimicking the evolutionary analysis above. C2H2 ZFPs are markedly depleted for tissue specificity—only 19%versus 49% for other types of TFs (p < 10-13, Bonferroni-corrected Fisher’s Exact Test) (Figure 2.5A right, Figure 2.5B). Only 12% (41/339) of KRAB-containing C2H2 ZFPs are tissue specific, possibly due to their role in the repression of transposable elements, which may be beneficial broadly across cell types. The majority are testes-specific (26/41), consistent with a role for KRAB C2H2 ZFPs in ERE silencing during gametogenesis (Ecco et al. 2017) (Figure 2.5B). Homeodomain TFs, in contrast, are highly enriched for tissue-specific expression (133/162, 82%, p < 10-13) and are also the only group overrepresented in the list of TFs that is not detected in the Human Tissue Atlas dataset (34/84; p < 10-7), presumably reflecting well-established roles in early embryonic cell fate specification and/or roles in the maintenance and differentiation of specialized cell types (Bürglin, 2011; Dunwell and Holland, 2016). Excluding C2H2 ZFPs, half of the remaining TFs (49%) are tissue specific, providing a clue as to their specific physiological functions. Higher- resolution data—e.g., from single-cell RNA-seq, which can resolve the different cell types that comprise tissues— will almost certainly lead to a more refined view of the associations between TFs, cell identity, and the genes regulated by the TFs.

2.3.4 Sequence-specific DNA binding by Monodactyl ZFPs

C2H2 ZFPs with only a single ZF domain presented another important edge case prediction of likely TFs. Monodactyl ZFPs were classed as ‘likely TFs’ in the publication described above due to the presence of the ZF DBD. However, none of the Human Monodactyl ZFPs have a motif, with the exception of ZNF750 which has a ChIP-seq motif. I designed PBM experiments to test whether the ZF domains of 11/13 Monodactyl ZFPs’ ZF domains were capable of sequence- specific DNA binding, and to explore whether basic amino acid residues near the ZF domains were required for this DNA binding. Drosophila GAGA was also included in the experiment because it is a well-known Monodactyl ZFP TF, and has been reported to rely on basic residues near the ZF domain to bind DNA.

Interestingly, perturbation of Drosophila GAGA’s BR1 and BR2 regions by replacement with neutral amino acids was insufficient to eliminate binding to its canonical preferred sequence,

58

source: CM3_appendix

A

B DBD class Tissue specific

Figure 2.5 Tissue expression profiles of the Human TFs Figure 2.5 Tissue expression profiles of the human TFs

Expression profiles profiles for for all humanall Human TFs detected TFs detected in Human in Human Tissue Atlas’s Tissue RNA-seq Atlas’s dataRNA for-seq 37 adultdata humanfor 37 tissuesadult Human (Uhlen ettissues al. 2015). (Uhlen et al. 2015).

From Lambert, Jolma, Campitelli et al. (2018) © 2018 by Elsevier Inc.

59

GAGAGA (Pedone et al., 1996; Nitta et al., 2015), although it did yield lower PBM scores (Figure 2.6 A). The studies that originally asserted that the basic residues in the BR1 and BR2 regions were required for DNA binding were based on (1) gel-shift assays of the GAGA ZF domain and flanking regions with deletion of the BR1 and BR2 regions (Pedone et al., 1996), and (2) crystal structures indicating contacts between BR1 and BR2 and the minor and major grooves, respectively (Omichinski et al., 1997). It may be interpreted that, while basic residues in the BR1 and BR2 regions have been observed to make contact with DNA, and that the presence of flanking sequences may support DNA binding in gel shift assays, basic amino acids in these positions are not required for recognition of GAGA’s preferred sequence in vitro.

Of the Monodactyl ZFPs’ ZF domains tested in PBM experiments, 3/11 displayed significant sequence-specific DNA binding: ZNF608, ZNF609, and ZNF750. The paralogs ZNF608 and ZNF609 (Herrero et al., 2016) displayed similar results to GAGA – PBM scores fell when basic residues flanking the ZF domain were perturbed, but significant PBM results were independent from the presence of these residues (Figure 2.6 A). Similar A-rich motifs were discovered for both ZNF608 and ZNF609, with and without basic residues perturbed (Figure 2.6 B).

The only ZF domain that lost DNA-binding specificity upon basic flank perturbation originated from ZNF750. This result should be interpreted with caution because the absolute E and Z-scores in the PBM experiment with and without basic flank perturbation only varied marginally below and above significance thresholds respectively (Figure 2.6 A). ZNF750 bound an A-rich motif, similar to those of ZNF608 and ZNF609 and in stark contrast with its previously published GC- rich ChIP-seq motif (Boxer et al., 2014) (Figure 2.6 B). A previous study also found that ZNF750 binding sites are highly correlated to those of KLF4, which also binds a GC-rich motif (Sen et al., 2012). Taken together, it may be interpreted that the Monodactyl ZFP ZNF750 has some DNA-binding capacity, but recruitment to the loci reported by Boxer et al. may result from interaction with KLF4.

Of the 11 Human Monodactyl ZFPs tested, only three appeared to bind specific DNA sequences. In all three cases, perturbation of basic residues near the ZF domain lowered PBM scores, but only resulted in the loss of significant binding in the case of ZNF750, which achieved marginal PBM scores. Intriguingly, the ZF domains of Monodactyl ZFP paralogs ZNF608 and ZNF609 appear sufficient to confer sequence-specific binding, and this function is not dependent on basic

60

Figure 2.6 PBM results for single-ZFPs with and without ZF-adjacent basic residues perturbed

A PBM results for each of the WT and MUT constructs. Asterisks indicate constructs that showed significant DNA-binding activity (Z-score > 6 and E-Score > 0.45 on both ME and HK arrays). B Motifs obtained from PBMs for each construct with significant DNA-binding activity.

61 residues flanking the ZF domain. The sequence features underlying the ability of the lone ZF domains of ZNF608 and ZNF609 to confer sequence-specific binding remain to be uncovered.

2.4 Summary

Determination of a TF’s binding preferences is the first step in understanding its function. Toward the goal of identifying and establishing a motif for every Human TF, I contributed to the manual curation of a list of 1,639 known and likely Human TFs. I then explored the paralog relationships between Human TFs and tissue expression profiles to compare their evolution and functions. By investigating the time since divergence of paralog pairs, I captured striking distinction between KZFPs and all other TFs: KZFPs (and some non-KRAB C2H2 ZFPs) are virtually the only Human TFs to have arisen by duplication since the time of the presumed whole genome duplications in the Bilaterian ancestor.

Furthermore, I found that C2H2 ZFPs are distinct from all other TFs in that they are depleted for tissue specific expression in adult tissues, and this observation is especially marked for KZFPs. I designed PBM experiments to test the Human Monodactyl ZFPs and Drosophila GAGA for sequence-specific DNA binding, and to test whether basic residues near the ZF domains are required for this function. 8/11 of the Human Monodactyl ZFPs displayed no sequence-specific binding. Surprisingly, Drosophila GAGA may not depend on basic residues flanking the ZF domain for sequence-specific binding, as previously thought. Furthermore, the Monodactyl ZF paralogs ZNF608 and ZNF609 demonstrated sequence-specific binding that was not dependent on basic residues flanking the ZF domain. ZNF750 demonstrated marginal sequence-specific DNA-binding, and the resultant motif called into question whether its prior ChIP-seq motif might be the result of indirect DNA-binding.

62

Chapter 3 Protein-protein interactions of Human C2H2 zinc finger proteins

The work described in this chapter is published in:

Schmitges, F. W.*, Radovani, E.*, Najafabadi, H. S.*, Barazandeh, M.*, Campitelli, L. F.*, Yin, Y., Jolma, A., Zhong, G., Guo, H., Kangalingam, T., Dai, W. F., Taipale, J., Emili, A., Greenblatt, J. F., and Hughes, T.R. (2016). Multiparameter functional diversity of Human C2H2 zinc finger proteins. Genome Research. Dec; 26(12):1742-1752. doi: 10.1101/gr.209643.116 which is published under a Creative Commons Attribution License (CC BY 4.0).

*denotes equal contributions

Author Contributions

FWS, ER, HSN, MB, and LFC wrote the manuscript, with significant contribution from TRH. FWS conducted ChIP-seq experiments. ER conducted AP-MS experiments and a literature review to functionally annotate AP-MS preys. JFG and AE supervised the AP-MS experiments. HSN analyzed ChIP-seq data and compared measures of diversity between ChIP-seq and AP-MS data. MB investigated functional divergence between paralogs. LFC conducted statistical analyses of AP-MS data and downstream functional characterization of C2H2 ZFPs based on these data. YY and AJ provided data for external validation. GZ, HG, TK, and WFD supported experiments, including experimental method optimization and cloning C2H2 ZFP constructs for ChIP-seq and AP-MS experiments.

63

Protein-protein interactions of Human C2H2 zinc finger proteins 3.1 Introduction

Diversification of TF functions can be driven by alteration of DNA sequence specificity, PPIs, and the expression pattern of the TF-encoding gene. All three parameters contribute to divergence within the Caenorhabditis elegans bHLH TF family (Grove et al., 2009), but it is largely unknown whether this is the case in other TF families, and to what extent. With ~750 members, the C2H2 ZFPs are the largest class of Human TFs, yet these proteins have been overlooked in functional investigations (Letunic et al. 2015; Lambert et al. 2018). The extremely diverse DNA binding preferences of C2H2 ZFPs, explained by diversifying selection on the specificity residues and domain content of their ZF DBD arrays, set them apart from virtually all other TF classes.

Far less is known about the PPIs of the C2H2 ZFPs. Some well-studied C2H2 ZFPs have been found to interact with transcription-related complexes. For example, YY1 interacts with the INO80 chromatin remodeling complex (Cai et al., 2007), FOG1 interacts with GATA1 and TACC3 (Fox et al., 1999), and SP1 interacts with the histone chaperone TAF-I (Suzuki et al., 2003). However, these findings come from low throughput investigations, limiting our ability to make direct comparisons between the PPI profiles of C2H2 ZFPs to establish global proteomic trends.

About two thirds of C2H2 ZFPs additionally contain one of three PPI domains at their N-termini: 52 (7%) contain a SCAN domain, 50 (7%) contain a BTB domain, and 348 (48%) contain a KRAB domain (Letunic et al. 2015). The SCAN (Nam, Honer and Schumacher, 2004) and BTB (Ahmad, Engel and Privé, 1998; Chauhan et al., 2013) domains are rigidly structured dimerization domains. Both SCAN and BTB domains are known to be selective in their dimerization preferences, but the residues predicting dimerization specificity at the interaction interface are unknown (Liang, Huimei Hong, et al., 2012; Piepoli et al., 2020).

The KRAB domain is largely unstructured and associated with transcriptional silencing by virtue of its recruitment of TRIM28/KAP1, which assembles a repressor complex leading to the formation of heterochromatin via H3K9me3 (Ecco et al. 2017). The rapid lineage-specific

64 diversification of C2H2 ZFPs in Metazoans is linked to the silencing activity of the KRAB domain: KZFPs are thought to coevolve with TEs in an arms race, imposing diversifying selection on their DNA-binding preferences to be able to recognize new TE sequences and prevent their replication via transcriptional silencing (Ecco et al. 2017). Several KZFPs have been recognized to both bind TE loci and function as transcriptional silencers, for example ZFP809 (Wolf and Goff, 2009), ZNF91 and ZNF93 (Jacobs et al., 2014).

Prior to the work in this chapter, no large-scale investigations had attempted to characterize the PPIs of C2H2 ZFPs. Here, I aimed to characterize the PPIs of a representative subset of 118 Human DNA-binding C2H2 ZFPs, with specific interest in the global trends that define this TF class. The AP-MS experiments in this chapter were conducted by expression of each bait C2H2 ZFP fused to an affinity tag, immunoprecipitation of the tagged bait, and identification of the bound prey proteins by tandem mass spectrometry. I acquired the outputs of this method – raw spectral counts – and scored them using the statistical analysis method SAINTexpress (Teo et al., 2014). I further quantified trends in the function of C2H2 ZFP binding partners, with special consideration for the PPIs expected based on effector domains.

In this Chapter I demonstrate that (1) C2H2 ZFPs interact with extremely diverse complements of binding partners, (2) C2H2 ZFPs tend to interact with expected binding partners based on their effector domains, but additional diverse interactions are at least as prevalent, and (3) C2H2 ZFPs tend to interact with transcription-related nuclear factors. These findings bolster the claim that most C2H2 ZFPs function as transcriptional regulatory proteins, and raise questions as to the mechanisms that enable such diverse PPIs with paradoxically few effector domains.

65

3.2 Methods

3.2.1 Molecular comparison between investigated C2H2 ZFPs and all Human C2H2 ZFPs

The C2H2 ZFPs analyzed in this Chapter were all confirmed TFs based on the results of ChIP- seq experiments conducted by F. W. Schmitges, supported by T. Kangalingam, and analyzed by H. S. Najafabadi. I define them as TFs because their ChIP-seq results demonstrate that they preferentially bind specific DNA sequences, i.e. a DNA-binding motif can be derived for each of them, satisfying the definition of a TF established in Chapter 2 (Lambert et al., 2018).

I conducted further analyses comparing the molecular features of the 118 C2H2 ZFPs to the full complement of 755 Human C2H2 ZFPs from SMART in order to identify biases in my dataset that could impede generalization of my findings to all Human C2H2 ZFPs. For each metric listed in Table 3.1, I used a Student’s T-test to test for significant differences between the 118 C2H2 ZFPs and all 755 Human C2H2 ZFPs.

3.2.2 AP-MS experimental methods

Cloning of C2H2 ZFPs was performed by G. Zhong, P. Young, W. F. Dai and E. Radovani. HEK293 cells expressing GFP tagged C2H2-ZF proteins were generated as previously described (Najafabadi et al., 2015). Sequence-verified clones from the ORFeome, Harvard Plasmid collection and synthesized constructs were used to make the expression vectors.

AP-MS experiments were optimized and executed by E. Radovani, G. Zhong and H. Tang. ~20 million cells were grown in two batches representing two biological replicates and harvested 24 hours following induction of protein expression with doxycycline. WCE (Whole Cell Extract) was prepared as previously described (Marcon et al., 2014). GFP-tagged C2H2-ZF proteins were immunoprecipitated with anti-GFP antibody (G10362, Life Technologies) overnight followed by a 2 hour incubation with Protein G Dynabeads (Invitrogen). Following three washes with buffer (10mM TRIS-HCl, pH7.9, 420mM NaCl, 0.1% NP-40) and two washes with no detergent buffer (10mM TRIS-HCl, pH7.9, 420mM NaCl), immunoprecipitated proteins were eluted with ammonium hydroxide and then lyophilized. Proteins were prepared for MS by in-solution trypsin digestion. Briefly, the protein pellet was resuspended in 44ul of 50mM ammonium bicarbonate, reduced with 100mM TCEP-HCL, alkylated with 500mM iodoacetamide, and then digested with

66

Table 3.1 Molecular features of the 118 C2H2 ZFPs investigated compared to all Human C2H2 ZFPs

Comparison of molecular features of the 118 C2H2 ZFPs investigated in this Chapter to the full complement of 755 Human C2H2 ZFPs listed in SMART (Letunic et al. 2015).

Table 3.1 Molecular features of the 118 C2H2 ZFPs investigated compared to all human C2H2 ZFPs

Comparison of molecular features of the 118 C2H2 ZFPs investigated in this Chapter to the full complement of 755 human C2H2 ZFPs listed in SMART (Letunic et al. 2015).

67

1ug of trypsin overnight at 37°C. Samples were desalted using ZipTip Pipette tips (EMD Millipore) through standard procedures, then analyzed with an LTQ-Orbitrap Velos mass spectrometer (ThermoFisher Scientific). Raw MS data was submitted to X! Tandem by H. Guo as previously described (Marcon et al., 2014), obtaining spectral counts for individual Human proteins for each experiment.

3.2.3 Statistical analysis of AP-MS data

I obtained confidence scores for each putative PPI using SAINTexpress (Teo et al., 2014) with our two biological replicates. As negative samples for SAINT analysis, I included both internal controls (GFP-only) and equivalent CRAPome (version 1.1) negative controls (Mellacheruvu et al., 2013). I chose a SAINT confidence score (AvgP) cutoff of 1 as our operating threshold for inclusion in data analyses, for the following reasons: (1) this cutoff resulted in a false positive rate of 0 (assessed using 33 CRAPome cytosolic promiscuous proteins) and a true positive rate of 0.6 (assessed using 18 literature-curated positive controls); (2) lower thresholds introduced false positives; (3) sparse data (such as AP-MS) is sensitive to false positives, such that a modest false-positive rate can result in a very high false discovery rate. Figures also display PPIs with SAINT AvgP 0.9-1.0, to illustrate that our overall conclusions are not greatly impacted by thresholding effects.

I rearranged the SAINTexpress output table into a matrix of sum spectral counts (baits x preys). I retained sum spectral counts across two replicates in the matrix for all interactions with a prey that had at least one interaction with a confidence score equal to 1. I removed promiscuous preys that had no interactions with a z-score higher than 2.2 in the log distribution of their sum spectral counts across all baits; the value 2.2 was chosen to allow retention of TRIM28, a protein which is expected to be present in many samples. This resulted in a set of 344 prey proteins interacting with the 118 C2H2 ZFP baits.

The SAINTexpress confidence scores that I calculated were the only metric used to define interaction significance. However, a normalized version of the raw spectral counts is also presented in figures in the interests of data transparency. To normalize them, the raw spectral counts were converted to an odds ratio for each bait-prey interaction by H. S. Najafabadi and me. For each bait-prey interaction (i,j), the expected number of peptide counts under the null assumption that the bait and prey would not interact with each other was calculated. This was

68 done by estimating the background probability of observing a peptide from each prey j in the AP-MS profile of non-interacting baits (which I defined as baits with SAINT score < 0.5). Then, for each bait i, the expected number of peptides from prey j would be ne(i,j) = N(i) * p(j), where ne(i,j) is the expected number of peptides from prey j in the AP-MS profile of bait i, N(i) is the total number of peptides in the AP-MS profile of bait i, and p(j) is the background probability of observing a peptide from prey j. The odds ratio is then defined as OR(i,j) = [ no(i,j) + 1 ] / [ ne(i,j) + 1 ], where no(i,j) is the observed peptide count for prey j in the AP-MS profile of bait i (+1 is added as pseudo-count).

3.2.4 Functional annotation of AP-MS preys

I used PANTHER to conduct GO enrichment analysis (Thomas et al. 2003) on the 344 statistically significant preys determined above, with Bonferroni correction for multiple testing. PANTHER findings are discussed further in the Results section, but one important finding was that the AP-MS preys were significantly enriched for proteins that localize to the nucleus.

To focus on the role of C2H2 ZFPs in the nucleus, I removed all preys from the data set that did not localize to the nucleus based on GeneCards nuclear localization score (defined as having a GeneCards nuclear localization score of 3 or greater). If such an annotation was not available for any of the 344 preys, I applied manual assignment as ‘nuclear’ or ‘non-nuclear’ based on literature review. I then clustered the matrix by seriation and rearranged clusters manually to form the readable diagonal clustering. This resulted in a set of 227 nuclear preys interacting with the 118 C2H2 ZFPs.

The 227 nuclear preys were manually functionally categorized by E. Radovani using a combination of primary literature, Uniprot and GeneCards: Adaptor/Scaffold, Chromatin Remodeler, General Transcription Factor (GTF), RNA Related, Signaling, Transcription Factor, DNA Replication/Repair, Protein Modifier, , Histone, DNA Methylation, Other/Unknown. More than one tag was assigned to proteins that fulfilled more than one definition. The definitions of the tags are as follows: Adaptor/Scaffold - proteins that mediate protein interactions between two other proteins; Chromatin Remodeler - proteins that either chaperone histones or remodel nucleosomes; GTFs - proteins that associate with RNA Polymerases; RNA Related - proteins that contain an RNA binding domain or proteins that are parts of complexes associated with RNA modification; Signaling - proteins that are parts of

69 signaling pathways; Transcription Factor - proteins that contain a DNA binding domain; DNA Replication/Repair - proteins that are involved in DNA replication and/or DNA repair; Protein Modifier - proteins that contain a domain with catalytic activity towards other proteins; Helicase - proteins that can remodel nucleic acids or nucleic acid-protein complexes; Histone - core histones H3, H4, H2A, H2B, linker histones and histone variants; DNA Methylation - proteins that are involved in reading or regulating DNA methylation; Other/Unknown - proteins that are uncharacterized or did not fulfill the definition of any other category.

3.3 Results

3.3.1 Molecular comparison between investigated C2H2 ZFPs and all Human C2H2 ZFPs

The C2H2 ZFPs included in this functional study are generally representative of the full complement of Human C2H2 ZFPs in terms of representation of the most common PPI domains, amino acid content and base content (Figure 3.1). The only significant bias is that the most extreme outliers in terms of length (number of amino acid residues) and molecular weight (kDa) have been excluded from this investigation (Table 3.1, Figure 3.1). However, there is not a significant bias in the number of ZF domains per protein. This outcome likely results from impediments to the molecular cloning of proteins exceeding 1000 amino acids.

3.3.2 C2H2 ZFPs have unique PPI profiles

Figure 3.2 gives an overview of the nuclear PPIs of the 118 C2H2 ZFPs. The first and most striking feature is the sparseness of the figure – C2H2 ZFPs interact with nuclear proteins that are not shared with many other C2H2 ZFPs in my dataset. Many previously established interactions are captured and follow this trend. For example, the well-studied interaction of YY1 with the INO80 complex (Cai et al., 2007) is exclusive among the 118 proteins examined, while components of the repressive DIF-1 complex (Yeung et al., 2011) interact specifically with KLF10, and Groucho-related proteins TLE1, TLE3, and AES all interact only with OSR2. One of a few notable exceptions to the trend that few preys are shared across C2H2 ZFPs is the E3 ubiquitin ligase HUWE1, which interacts with C2H2 ZFPs from different subclasses, including KRAB, SCAN, BTB, and C2H2 ZFPs lacking any effector domain.

70 400 300 200 100

Figure 3.1 Significant differences in molecular features of the 118 C2H2 ZFPs investigated compared to all Human C2H2 ZFPs (related to Table 3.1) Figure 3.1 Significant differences in molecular features f the 118 C2H2 ZFPs investigated compared to all human C2H2 ZFPs (related to Table 3.1)

The Theonly only significant significant difference difference detected between between the the subset subset of 118 of C2H2118 C2H2 proteins proteins and the totaland populationthe total of populationhuman C2H2 of Human proteins C2H2 is the proteinsunderrepresentation is the underrepresentation of very large outlier of proteins. very large Despite outlier this difference, proteins. the Despitenumber this of difference, C2H2 domains the pernumber protein of in C2H2 the subset domains of 118 peris not protein different in fromthe subset that of allof C2H2118 is proteins not (t- differentTest, pfrom = 0.74) that of all C2H2 proteins (t-Test, p = 0.74)

71

118 C2H2 Baits ZNF692 ZNF175 ZNF33A ZNF354A ZNF350 ZNF8 ZNF454 ZNF669 ZNF121 ZNF146 ZNF224 ZNF331 ZNF273 ZBTB12 ZNF22 ZSCAN22 ZNF213 ZNF449 ZSCAN30 MZF1 ZNF189 MYNN ZNF76 ZFP82 KLF7 ZNF524 GTF3A ZNF136 ZNF574 ZNF708 ZSCAN31 ZNF280A ZNF784 KLF12 ZBTB48 PRDM1 KLF1 EGR2 MAZ ZNF594 ZNF490 OSR2 ZNF770 ZFP28 ZNF778 ZNF549 SCRT1 ZNF595 ZFP3 ZNF71 ZNF250 KLF10 ZNF329 ZNF140 ZNF34 ZNF596 ZNF134 ZNF513 ZNF341 ZNF264 ZNF502 ZNF324 ZNF582 ZNF554 ZNF382 ZNF45 ZNF30 ZNF37A ZIM3 ZNF563 ZNF419 ZNF677 ZNF528 ZNF16 ZNF684 SNAI1 ZNF675 ZSCAN16 ZNF18 ZNF263 ZSCAN5C SP2 ZNF768 ZBTB26 ZNF436 PATZ1 ZBTB6 ZNF317 ZNF547 ZFP64 FEZF1 KLF15 ZBTB42 ZBTB18 ZFP42 GLI4 EGR3 ZNF467 ZNF85 ZNF322 GLIS1 ZNF610 ZNF667 ZNF816 ZNF418 ZNF214 KLF14 ZNF320 IKZF3 ZBTB14 YY1 ZNF281 ZNF184 ZNF257 ZNF394 ZNF200 ZNF98 ZNF35

TRIM28 DIMT1 CSNK1E RAD18 ZNF467 ZNF669 LANCL1 Q96T88 ZNF460 RIPK4 ZNF546 LENG8 ZNF768 ZNF724P SQSTM1 DDX20 TEX10 Score < 1 COIL ZNF225 T CCDC6 GRB2 PRKCA CAMK2G CAMK2D CAMK2B Score = 1 SCML1 HUWE1 T CENPB TP53BP1 MYH6 MIB1 PCM1 PSMD3 PSMC3 SAIN 0.9 ≤ SAIN PSMC5 GLYR1 PPP2CA NDUFS3 NUP160 NUP205 IQGAP2 MRPS31 MRPS9 DAP3 TRIM27 MRPS23 KRI1

≥20 KRR1 PAK1IP1 DDX49 MRPS33 RBM42 H1FX PPAN DDX52 DDX18 RRP1B BAZ1A RFC1

10 XRCC1 ZNF274 ZNF574 CMAS CCDC86 AHCTF1 HMGN2 LIG3 TOP2B Fold change SPC HMGA1 ZKSCAN8 ZSCAN18 SCAND1 0 ZNF24 ZKSCAN1 ZNF446 NPM2 ZKSCAN3

ZMYM2 MS. COPS8 ZNF217 - ZSCAN5B DYNLL1 KDM1A HDAC1 RCOR3 RCOR1 HDAC2 BTB TCOF1

CD2AP SCAN KRAB SH3KBP1 KPNA4 KPNA3 RBM45 FHL3 CMTR1 Q9H6W3 CTSB DNAJA3 CUL1 ZBTB16 ZBTB12 ZBTB6 HCFC1 (2016) USP11 NUP210 DDX50 C1QBP ADNP CHD4 GTF3C5 GTF3C3 GTF3C4

GTF3C2 et al.

PPP4C NAP1L5 ANKS3 CARM1 TRIM39 PNMA2 CCAR2 PPP6R3 CUL3 ANKRD28 SART3 A6NFI3 TXK IWS1 LEO1 ZRANB2 CDC73 CTR9 MCM6 WDR61 PAF1 UBTF NOL9 LAS1L PELP1 MAPK1 HSPH1 TRIP6 XRCC3

227 Nuclear Preys UPF2 ZNF280B EP300 CREBBP MAP7D1 ZBTB18 NACC1 CTBP1 CTBP2 ZNF516 BACH1 ZNF131 UBXN4 FTO POGZ PRC1 ZNF146 PML KPNA1 DCAF7 Q9Y2S7 RPRD1B MIDN GYS1 SEH1L ZFPs nuclear and preys using 227 detected AP

USP7 - GNB2L1 CTNNB1 TRIOBP PPP3CA ERBB2IP SUFU PAWR TPR LMNA NUCKS1 ANXA11 HDGF PABPN1 SKI KLF7 TLE3 TLE1 AES TP73 AKAP8L DNAJB6 YTHDC2 ARFGAP1 Q8IYB9 ZNF45 SAV1 HECTD1 ZBTB40 Q9NYL2 TES SRP19 ZFP30 CASK NUP155 NFRKB INO80 ACTR8 UCHL5 IRF2BPL IRF2BP2 IRF2BP1 WDR26 DVL2 SCRIB GRWD1 KIF23 GAR1 TPX2 H1F0 KNOP1 TSPYL2 C19orf47 GTF3C6 ZNF805 CHAMP1 PGM3 ZNF136 Figure S2 ZNF227 Figure 3.2 Nuclear protein interactions with C2H2 ZFPs interactions protein 3.2 Nuclear Figure AP-MS. using preys detected C2H2-ZFPs 227 nuclear and 118 between Interactions ZNF234 Figure ZFPs with C2H2 3.2 Nuclear interactions protein Interactions 118 between C2H2 © Radovani, Campitelli Schmitges, Najafabadi, Barazandeh,

72

To elucidate the overall diversity in C2H2 ZFP PPIs, I quantified the overlap in PPI profiles between distinct C2H2 ZFPs in the dataset. I leveraged the Pearson Correlation between biological replicates of C2H2 ZFPs (Figure 3.3). I reasoned that if a C2H2 ZFPs has a distinct interaction profile from other C2H2 ZFPs in the dataset, then the Pearson Correlation between the two biological replicates for the same C2H2 ZFP should be high, and their correlations to any other C2H2 ZFP’s replicate should be low. Indeed, AP-MS spectral count profiles from biological replicates for the same C2H2 ZFP were typically more similar to each other than to any of the other 117 proteins (100/118 cases) (Figure 3.3A) and the correlation between replicates exceeds 0.9 for 93/118 C2H2 ZFPs (Figure 3.3B).

To confirm that the observed uniqueness of C2H2 ZFP interaction profiles was not a result of thresholding effects, I queried the number of C2H2 ZFP purifications that contained each prey. I repeated this analysis for SAINT AvgP confidence score thresholds of 0.8, 0.9, and 1, and with and without applying filtering for prey proteins that localize to the nucleus (Figure 3.4). Regardless of the combination of thresholds applied, the majority of AP-MS preys interact with only 1-2 of the 118 C2H2 ZFPs tested. Taken together, these findings capture striking and unexpected diversity in the interaction profiles of C2H2 ZFPs.

3.3.3 Effector domain subclasses recruit expected interaction partners, and additional and alternative PPIs are pervasive within each group

I next investigated whether C2H2 ZFPs with KRAB, SCAN, and BTB auxiliary domains tended to interact with their expected binding partners (Figure 3.5). Surprisingly, only 38/55 KRAB- containing C2H2 ZFPs (KZFPs) recruited the expected cofactor, TRIM28. This result was robustly supported by the raw spectral count data and high degree of reproducibility between replicates (Figure 3.5 A). KZFPs that bound TRIM28 also interacted with other nuclear factors not shared with other KZFPs (Figure 3.2).

As expected, 9 of the 11 SCAN-containing C2H2-ZF proteins we examined interact with other SCAN-containing proteins and 3 of the 9 BTB-containing C2H2-ZF proteins interact with other BTB-containing proteins (Figure 3.5 B, top boxes). It is possible that the lack of heterotypic interactions for some SCAN and BTB C2H2 ZFPs is explained by their binding partners’ absences in HEK293 cells. In addition, in both cases, many additional interactions specific to one or a few baits are observed (Figure 3.5 B, top boxes). Some interacting proteins are common to

73

A 100

80

60

40 Frequency

20

0

0 20 40 60 80 100 120 140

Rank order of replicate pearson correlation B 25

20

15

Frequency 10

5

0

0.6 0.7 0.8 0.9 1.0 Pearson correlation between replicates Figure 3.3 Reproducibility between AP-MS replicates FigureSupplemental 3.3 Reproducibility Figure between S7 (related AP-MS to Figure replicates 5): Reproducibility between AP-MS replicates. (A) Distribution of rank orders of AP-MS replicate Pearson correlation compared to all other replicates based on the raw spectral counts from AP-MS results, using a matrix of 236 baits (each of the 118 in AA. DistributionDistribution ofof rank rank orders orders of AP-MSof AP- MSreplicate replicate Pearson Pearson correlation correlation compared compared to all other toreplicates all other based replicatesduplicate) based x 344 on prey the proteins raw spectral (the same counts set described from AP in-MS the textresults, for PANTHER using a matrixanalysis). of For 236 exam baits- on ple,the rawa rank spectral order countsof 1 for from EGR2 AP-MS indicates results, that usingEGR2 a replicatematrix of 1 236was baits more (each similar of tothe EGR2 118 in replicate duplicate) 2 x (each344than prey of allthe proteins other 118 235 in(the duplicate)individual same set AP-MS described x 344 experiments. prey in the proteins text (B)for DistributionPANTHER(the same ofanalysis).set Pearson described For correlations example, in the textfora rank each for order pair PANTHERof 1of for AP-MS EGR2 analysis). replicates indicates for Forthat the EGR2example, same replicate C2H2-ZFP, a rank 1 was orderas morein A. of The similar 1 forcorrelation EGR2to EGR2 betweenindicates replicate replicates 2that than EGR2 all is other above replicate 235 90% 1 individualwasfor more93/118 AP-MS similar C2H2-ZFs. experiments. to EGR2 replicate 2 than all other 235 individual AP-MS experiments. B Distribution of Pearson correlations for each pair of AP-MS replicates for the same C2H2- B. Distribution of Pearson correlations for each pair of AP-MS replicates for the same C2H2-ZFP, as in A. ZFP,The correlationas in A. The between correlation replicates between is above replicates90% for 93/118 is above C2H2-ZFs. 90% for 93/118 C2H2-ZFs. © Schmitges, Radovani, Najafabadi, Barazandeh, Campitelli et al. (2016)

74

All Preys (343)(344) Nuclear Preys (227) 200 264 (77%) 172 (76%)

150 AvgP = 1

100

50

0 150 213 (62%) 140 (62%) AvgP ≥ 0.9 100

50 Frequency 0 150 202 (59%) 133 (59%) AvgP ≥ 0.8 100

50

0 0 10 20 30 40 50 0 10 20 30 40 50

Number of interacting C2H2-ZFPs per prey

Figure 3.4 Shared interactions between C2H2 ZFPs Figure 3.4 Shared interactions between C2H2-ZF proteins The histograms display the distribution of number of C2H2 ZFPs that purified with each prey at The histograms display the distribution of number of C2H2-ZF proteins that purified with each prey at a givena given confidence confidence score score cutoff cutoff to define to define the interaction the interaction (including/excluding (including/excluding cytoplasmic cytoplasmic preys). The preys). results presentedThe results in Figurepresented 5 correspond in Figure to 5 the correspond plot with nuclearto the plotpreys with and nuclearAvgP ≥ 0.9preys (middle and AvgPright). ≥Blue 0.9 columns indicate(middle the right). total Bluenumber columns of preys indi thatcate interact the totalwith numberonly one ofor preystwo of thatthe 118interact C2H2-ZFPs with only examined one or bytwo AP-MS.of the 118 C2H2 ZFPs examined by AP-MS. © Schmitges, Radovani, Najafabadi, Barazandeh, Campitelli et al. (2016)

75

Figure 3.5 Overview of expected interactions for C2H2 ZFPs with KRAB, SCAN and BTB domains

A Spectral counts and SAINT Scores for TRIM28 from AP-MS experiments for KRAB and non-KRAB-C2H2s for two biological replicates. B SCAN-SCAN are common and BTB-BTB interactions are also observed, but each of these PPI domain subclasses also interact with diverse sets of other nuclear proteins that are not shared with other C2H2-ZFP baits with the same effector domain.

© Schmitges, Radovani, Najafabadi, Barazandeh, Campitelli et al. (2016)

76 multiple C2H2-ZF proteins. For example, SCAND1, ZKSCAN1, and ZSCAN18 interact specifically with most of the SCAN-domain containing C2H2-ZF proteins. Interestingly, both SCAN and BTB-C2H2s also recruited a diverse range of other nuclear factors, including some other C2H2 ZFPs (Figure 3.5 B, bottom boxes).

3.3.4 C2H2 ZFPs interact with transcription-related nuclear factors

GO overrepresentation analysis with PANTHER on the 344 significant binding partners of the C2H2 ZFPs (without filtering for preys that localize to the nucleus) revealed twofold enrichment of the GO-Slim terms “nucleus” (P < 1.69 × 10−5), “RNA metabolic process” (P < 3.41 × 10−6), “DNA binding” (P < 0.0287), and “transcription, DNA-dependent” (P < 0.000883), and greater than fivefold enrichment of “helicase activity” (P < 0.0069). For the 227 nuclear preys, more than half (124) are associated with “regulation of gene expression,” 40 are associated with “chromosome organization,” and 20 with “histone modification.” Following the result that all of the 118 C2H2 ZFP baits bound specific DNA sequences in prior ChIP-seq experiments, these findings provide a second dimension of support that Human C2H2 ZFPs function as transcriptional regulators.

In order to capture two relevant dimensions of the PPI network, for each of the functional categories described in 3.2.4, I quantified both the number of the 227 nuclear preys and the number of bait-prey interactions (i.e. network edges) that were represented by each category. Figure 3.6 provides a summary of this analysis and specific examples of C2H2 ZFP interactions with preys in each of the functional categories, and Figure 3.7 demonstrates that the molecular functions of preys are approximately consistent across all categories of effector domain.

A wide variety of intriguing molecular functions is represented among the interacting proteins. The largest functional categories are DNA-binding transcription factors (primarily other C2H2- ZF proteins), post-translational modifiers, and adaptor/scaffold proteins. Protein modifiers that interact with C2H2-ZF proteins also exhibit diverse activities such as histone acetylation (CREBBP/EP300) (Kalkhoven, 2004), methylation (CARM1) (Chen et al., 1999), and demethylation (KDM1A, LSD1, NO66) (Shi et al., 2004; Sinha et al., 2010). The most common scaffolding protein in the interaction network is TRIM28 (Fig. 5A), followed by TRIM27 (8 interactions), CTBP1 (four interactions), and CTBP2 (three interactions), all of which have been

77

FigureFigure 3.6 Functional 3.6 Functional overview overview of the of C2H2 the C2H2 ZFP PPI ZFP partners PPI partners

All 227All 227nuclear nuclear prey preyproteins proteins were wereassigned assigned functional functional categories categories based onbased a literature on a literature search (see search Methods).(see Methods). The bar graph The barshows graph the showsnumber the of numberindividual of preyindividual proteins prey in eachproteins category in each while category the colour of thewhile bars thereflects colour the of number the bars of reflectstotal interactions the number between of total bait interactions proteins and between prey proteins bait proteins from each and category.prey proteinsExample from interactions each category. for each Examplecategory areinteractions shown as forheat each maps. category are shown as heat maps.

© Schmitges, Radovani, Najafabadi, Barazandeh, Campitelli et al. (2016)

78

BTB All C2H2s No effector domain KRAB SCAN baits = 9 baits = 118 baits = 44 baits = 55 baits = 7 preys = 36 preys = 227 preys = 134 preys = 96 preys = 43

50 Total Interactions

0 123 0 103 0 89 0 74 0 39 40

30

20 Total Preys 10

0 r r r r r o o o o o ler ler ler ler ler ted ted ted ted ted e e e e e pair pair pair pair pair a a a a a e e e e e odifier odifier odifier odifier odifier ubunit ubunit ubunit ubunit ubunit caffold caffold caffold caffold caffold S S S S S hylation hylation hylation hylation hylation M M M M M S S S S S t t t t t Helicase Helicase Helicase Helicase Helicase on Factor on Factor on Factor on Factor on Factor Signalling Signalling Signalling Signalling Signalling i i i i i tion Fact tion Fact tion Fact tion Fact tion Fact p p p p p RNA- RNA-rel RNA-rel RNA-rel RNA-rel n Remod n Remod n Remod n Remod n Remod i i i i i lication/R lication/R lication/R lication/R lication/R p p p p p Histone Histone Histone Histone Histone Adaptor/ Adaptor/ Adaptor/ Adaptor/ Adaptor/ DNA Me DNA Me DNA Me DNA Me DNA Me slational slational slational slational slational n n n n n ranscript ranscript ranscript ranscript ranscript Transcri Transcri Transcri Transcri Transcri T T T T T Chromat Chromat Chromat Chromat Chromat DNA Re DNA Re DNA Re DNA Re DNA Re Post-tra Post-tra Post-tra Post-tra Post-tra General General General General General

Functional Category

Figure 3.7. Functional overview of the C2H2 ZFP PPI partners for C2H2 ZFPs with no effector Figuredomain, 3.7 or FunctionalKRAB, SCAN overview or BTB ofdomains the C2H2 (related ZFP to PPIFigure partners 3.6) for C2H2 ZFPs with no effector domain, or KRAB, SCAN or BTB domains (related to Figure 3.6) All 227 nuclear prey proteins were assigned functional categories based on a literature search (see AllMethods). 227 nuclear The bar prey graphs proteins shows were the number assigned of individual functional prey categories proteins inbased each on category a literature while searchthe colour of the bars reflects the number of total interactions between bait proteins and prey proteins from (seeeach Methods). category. The bar graphs shows the number of individual prey proteins in each category while the colour of the bars reflects the number of total interactions between bait proteins and prey proteins from each category.

79 implicated in recruitment of histone modification complexes (Bloor et al., 2005; Stankiewicz et al., 2014). These findings support a widespread role of C2H2-ZF proteins in chromatin structure and organization.

3.4 Discussion

In Metazoans, C2H2 ZFPs are well known for their diverse DNA binding preferences and frequent association with a small number of effector domains (Stubbs, Sun and Caetano-Anolles, 2011). Surprisingly, these findings demonstrate that Human C2H2 ZFPs have PPI profiles that also exhibit extremely high diversity, which can vary dramatically even among proteins that share the same type of PPI domain. PPIs also strongly indicate that C2H2 ZFPs do function as transcriptional regulators – most interact with at least one other protein that has an established role in regulation of chromatin or gene expression. These findings lead to the conclusion that multi-parameter evolution, previously described for bHLH proteins in C. elegans (Grove et al., 2009), is widespread among the largest class of Human TFs.

The surprising finding that about one third of the tested KZFPs do not recruit TRIM28 (which is highly expressed in HEK293 cells), suggests that these proteins may have other roles in the nucleus, or at least that their silencing ability is not always conserved. Other investigations by my co-authors uncovered that 31 of the 59 KZFPs interacted with at least one transcriptional activator protein in this dataset, and that there is a quantitative relationship between TRIM28 spectral counts in KZFPs’ AP-MS purifications and genomic binding site overlap with H3K9me3 loci in ChIP-seq experiments (r = 0.4, P < 0.0016; Figure 3.8). Taken together, these findings suggest that a significant fraction of KZFPs do not function as transcriptional silencers by the canonical SETDB1 pathway, challenging prior generalizations about the function of this domain, and raising questions as to which sequence components of KRAB domains are essential for (or predictive of) TRIM28 recruitment. Furthermore, these findings fit a picture of KZFP evolution proposed by the arms race and domestication models (Ecco, Imbeault and Trono, 2017; Imbeault, Helleboid and Trono, 2017): ERE diversification imposes diversifying selection on silencing KZFPs to acquire binding specificity for new EREs, but after the bound EREs have decayed, the binding KZFP(s) are no longer constrained to retain silencing ability and can possibly acquire new functions. These ideas are explored further in Chapter 5.

80

FigureFigure 3.8. CorrelationCorrelation between between TRIM28 TRIM28 association association and H3K9me3 and H3K9me3 signals signals

ScatterplotScatterplot ofof TRIM28 TRIM28 signal signal vs. vs.H3K9me3 H3K9me3 signal signal for the for 50 theKRAB 50 KRABproteins proteinsthat were thatstudied were by studiedAP-MS and ChIP-seq. The Pearson correlation is calculated using the log-transformed TRIM28 signal. Figure by AP-MS and ChIP-seq. The Pearson correlation is calculated using the log-transformed created by H. S. Najafabadi, incorporated the AP-MS data described in this chapter and ChIP-seq data TRIM28generated signal. by F. Schmitges Figure created and analyzed by H. by S. H. Najafabadi, S. Najafabadi. incorporated the AP-MS data described in this chapter and ChIP-seq data generated by F. Schmitges and analyzed by H. S. Najafabadi.

© Schmitges, Radovani, Najafabadi, Barazandeh, Campitelli et al. (2016)

81

The observation that virtually all C2H2 ZFPs with PPI domains have non-overlapping interactions beyond those associated with their PPI domains points to a paradox: C2H2 ZFPs are able to bind extremely diverse complements of nuclear proteins, despite being comprised of very few known PPI domains (KRAB, SCAN, BTB). I propose three possible explanations, which are not mutually exclusive:

1. Some of the PPI domains may recruit unexpected binding partners. SCAN and BTB domains may mediate interactions beyond dimerization, or KRAB domains may recruit other binding partners than TRIM28.

2. ZFs may contribute extensively to PPI diversity. I suspect that a significant fraction of C2H2 ZFP PPIs are mediated by ZF domains. Several examples of C2H2 ZFPs using ZFs to mediatePPIs have already been described, including YY1, SP1, OAZ, and PLZF (reviewed by Brayer and Segal 2008). Given that C2H2 ZFPs often contain more ZF domains than they appear to employ in DNA binding, and that the PPI functions of ZFs can occur regardless of whether the ZF is involved in DNA-binding (discussed in Chapter 1), one possible function of some of these apparently superfluous DBDs could be the mediation of PPIs. The key to the C2H2 ZFPs’ ability to rapidly evolve new DNA-binding specificities lies in the established capacity to swap and modify ZF domains. The extension of this modular evolutionary mechanism to the facilitation of PPIs would provide a fitting explanation for the striking PPI diversity observed here.

3. The C2H2 ZFPs may contain undiscovered PPI-mediating elements. The C2H2 ZFPs that lack auxiliary domains are predicted to be largely unstructured outside the ZFs (16% alpha helix and 6% beta sheet, overall, using HHpred (Söding, Biegert and Lupas, 2005). These regions primarily include long unstructured N-termini, where PPI domains normally occur when present, but can also include the linker regions between ZFs. The contribution of intrinsically disordered regions to PPIs is often overlooked (Oldfield and Dunker, 2014). It is conceivable that the apparent excess of un-structured and poorly conserved polypeptide sequence in these proteins may serve as a template for rapid evolution of short linear motifs to mediate PPIs.

Elucidation of the molecular elements that enable C2H2 ZFPs to mediate such diverse PPIs will uncover how C2H2 ZFPs are able to evolve distinct PPI profiles on a relatively short timescale. To approach this problem, new experimental techniques must be applied, discussed further in Chapter 5.

82

3.5 Summary

Despite being the largest class of Human TFs, the C2H2 ZFPs have been underrepresented in DNA-binding experiments, and their PPIs have never been investigated at scale. I characterized the PPI preferences of 118 DNA-binding C2H2 ZFPs using AP-MS data. The highly parallel nature of this investigation uncovered global trends in the PPI profiles of Human C2H2 ZFPs. I found that C2H2 ZFPs tend to interact with transcription-related nuclear factors, but that the specific complements of interaction partners tend to show little overlap between C2H2 ZFPs – 76% of interaction partners bound only 1-2 of the 118 ZFPs. Although they exhibit such diverse PPIs, C2H2 ZFPs tend to contain one of very few PPI domains: the KRAB silencing domain, and SCAN and BTB dimerization domains. C2H2 ZFPs with PPI domains tended to recruit their expected binding partners, but additional interactions were pervasive. Interestingly, one third of KZFPs failed to recruit their canonical silencing cofactor, TRIM28, consistent with recent speculation that conservation of silencing activity is predominantly restricted to those KZFPs that bind recently inserted TEs. In conjunction with the highly diverse DNA-binding preferences reported previously and found by co-authors in this study, it is clear that the evolutionary trajectory of the C2H2 ZFPs demands multiparameter diversification. While the mechanisms by which C2H2 ZFPs diversify their DNA binding preferences are well understood, further investigation is required to illuminate the structural features that enable the C2H2 ZFPs to bind such diverse sets of cofactors and to acquire new PPIs during evolutionary diversificat

Chapter 4 Reconstructing the evolutionary history of LINE L1s to interpret KZFP-ERE coevolution

A manuscript describing the work in this Chapter is in preparation for publication:

Campitelli, L.F., Albu, M., Blanchette, M., and Hughes, T.R. (2020). The evolution of Human LINE L1 TEs. (In prep.)

Author Contributions

LFC produced a method leveraging ancestral Mammalian genomes for TE reconstruction, reconstructed LINE L1 progenitor sequences, refined their protein-coding regions, and conducted downstream analyses to characterize their evolution. MB provided ancestral Mammalian genomes. MA provided supporting data curation. LFC wrote the manuscript with significant contribution from TRH.

83 84

Reconstructing the evolutionary history of LINE L1s to interpret KZFP-ERE coevolution 4.1 Introduction

My goal in this Chapter was to reconstruct a full-length progenitor sequence for each of the 67 Human LINE L1 subfamilies. LINE L1s are a class of EREs, copy-and-paste TEs that replicate via transcription by the general transcriptional machinery. L1s have expanded to comprise more than 16% of the Human genome (Mandal and Kazazian, 2008). Based on the observation that only one LINE L1 subfamily appears to be active at a time, it appears that subfamilies compete with one another for limited transcriptional resources, and the specific underpinnings of this competition are unresolved (Khan, Smit and Boissinot, 2005).

A reconstructed L1 progenitor sequence should encode two protein-coding ORFs called ORF1 and ORF2. The ORF1 protein (ORF1p) encodes a dsRBD and ssRBD, and the ORF2 protein (ORF2p) encodes an endonuclease and reverse transcriptase (Burns and Boeke, 2012) (Figure 4.1 A). Full-length progenitor sequences have been reconstructed for the youngest L1 subfamilies, but older subfamilies are only represented by 3ʹ end consensus sequences (Figure 4.1 C). L1 subfamilies L1PA8 and younger also encode an antisense ORF called ORF0 in their 5ʹ UTRs, which improves transposition efficiency (Denli et al., 2015). The expected phylogenetic relationships among the 67 L1 subfamilies have been previously described and can be recapitulated using established consensus sequences corresponding to their 3ʹ ends (Smit et al., 1995) (Figure 4.1 B).

KZFPs appear to coevolve with EREs in an evolutionary arms race, where KZFPs are under positive selection to acquire new binding specificities for new ERE subfamilies, in turn imposing diversifying selection on EREs to escape KZFP recognition. I reasoned that this phenomenon is worth investigating not only because it appears to have driven the expansion of the largest and most unusual family of Human TFs, but also because its byproduct KZFP-ERE interactions can be co-opted for the establishment of new gene regulatory modules (Ecco, Imbeault and Trono, 2017; Imbeault, Helleboid and Trono, 2017).

The coevolution of KZFPs and EREs is broadly supported, but direct evidence connecting the sequences of potentially active EREs to the binding specificities of KZFPs is restricted to a

85

Figure 4.1. Overview of phylogenetic relationships between Human LINE L1 subfamilies and available consensus models and progenitor sequence reconstructions

A Example of an annotated gold standard L1 sequence for the subfamily L1HS (the most recent subfamily, currently active in the modern Human genome), as published by Khan et al (2005). Annotations of ORF start and end loci are determined by NCBI ORFfinder, and domain start and end sites are determined by NCBI Conserved Domain Search (CDS). Conversely, the older L1MC1 subfamily lacks a full-length reconstruction; instead L1MC1 is represented by a consensus model for its 3ʹ ~1kb. This is representative for virtually all subfamilies that do not have a reconstruction from Khan et al. (as demonstrated in C). B Phylogeny of based on Dfam 3ʹ end model. C Lengths of consensus models from Dfam and Repbase, and progenitor reconstructions from Khan et al. (2005)

86 handful of cases. These include four of the 222 KZFPs investigated by Imbeault et al. (2017), one of which was the previously described ZNF93 (Jacobs et al., 2014). All of those cases involved KZFP interactions with very young LINE L1 subfamilies (L1PA6 and younger, restricted to the last 30M years – Old World Monkeys). However, the recent increase in ChIP- seq/exo data available for Human KZFPs confirms that specific relationships between KZFPs and ERE subfamilies spans much farther back in time; for example, associations with L1MEs and L2s (which are older than 100M years – Eutherians) are prevalent (Schmitges et al., 2016; Imbeault, Helleboid and Trono, 2017; Barazandeh et al., 2018).

The identification of such KZFP-ERE binding relationships has been limited by both the availability of sequence representations of active forms of EREs and the availability of DNA- binding specificities of KZFPs. The recent increase in ChIP-seq/exo data for 314/747 unique Human C2H2 ZFPs, including 242 KZFPs (Schmitges et al., 2016; Imbeault, Helleboid and Trono, 2017; Barazandeh et al., 2018), motivates the improved reconstruction of ERE subfamily active progenitor sequences to identify sequence-level relationships between KZFPs and EREs older than 30M years.

Human LINE L1 subfamilies present an ideal starting point to develop a robust ERE progenitor reconstruction approach because (1) Human L1 subfamilies have a universal classification and annotation scheme (Smit et al., 1995), (2) Khan et al. (2005) have already established a method for the reconstruction of L1 progenitor sequences that can be improved with ancestral sequence reconstruction (ASR) and ancestral genome reconstruction (AGR) techniques, (3) evidence from several studies has already identified distinct KZFP-L1 subfamily associations (Jacobs et al., 2014; Schmitges et al., 2016; Imbeault, Helleboid and Trono, 2017), and (4) KZFP binding sites at the level of the L1 progenitor sequence have already been established in the simplest cases (Jacobs et al., 2014; Imbeault, Helleboid and Trono, 2017).

To support the reconstruction of very old subfamily progenitor sequences, I leveraged AGR using 11 reconstructed ancestral Human genomes spanning 105M years from the base of Eutheria (i.e. Human-Armadillo common ancestor) to hg38 (Figure 4.2 A). I annotated all repeat elements in each genome using RepeatMasker (Figure 4.2 B) and used ASR to reconstruct a full- length progenitor sequence for each L1 subfamily. I further refined the protein-coding ORF reconstructions with codon-aware alignment. Altogether, I produced full-length or near full-

87

Figure 4.2. Overview of reconstructedFigure 4.2. Overview ancestral of A reconstructed ancestral Human genomes human genomes.

A PhylogenyA. Phylogeny of extant of extant genomesgenomes used usedto to produce produceancestral ancestral reconstruction reconstructionin A. Coloured in A. internal Colourednodes internal represent the nodesancestral represent human the ancestralgenomes Human in the dataset. genomesColoured in the text dataset. Colouredcorresponding text to node correspondingcolours indicates to node the coloursextant indicates species the with which the ancestral extant species with genome is a common whichancestor the ancestral to human, and genomdivergencee is a common time from ancestorhuman to Human, in millions of and divergenceyears ago (MYA). time The from orangeHuman line in indicates the millionshuman of years lineage ago that is (MYA).reconstructed. The orange lineB. indicatesWorkflow the used to Humanproduce lineage ancestral that is reconstructed.genomes and to annotate B WorkflowL1 loci. used Each to internal producenode ancestral represents a whole genomesgenome and thatto is reconstructed by annotate L1 loci. Each Ancestors 1.1. Orange internalline node represents the human B representslineage. a whole Coloured genomeinternal that isnodes represent reconstructedthe ancestral by genomes on Ancestorsthe human 1.1. Orange lineage, which line representsare the ancestral the Humanreconstructed lineage. human Colouredgenomes internal used in this nodesinvestigation. represent the ancestral genomes on the Human lineage, which are the ancestral reconstructed Human genomes used in this investigation.

88 length progenitor reconstructions for all 67 Human LINE L1 subfamilies, including 55 subfamily progenitors with refined ORFs that reflect established phylogenetic relationships. In a case study based on the L1MC1 subfamily, I demonstrate the utility of the reconstructed full-length progenitor sequences for integrating KZFPs’ DNA-binding motifs and binding sites from in vivo and in silico predicted binding preferences of KZFPs. These promising results suggest that ASR and AGR can be used to rebuild the sequence evolutionary history of any ERE subfamily for which consistent annotation and classification schema are available, and that these reconstructed progenitor sequences can be used to investigate the evolution of Human genome regulation.

4.2 Methods

4.2.1 Ancestral reconstructed genomes

Reconstructed ancestral genomes were acquired from Mathieu Blanchette. They were produced using Ancestors 1.1 (Diallo, Makarenkov and Blanchette, 2010), using the Mammalian clade of UCSC’s 100-way whole genome alignment and the corresponding UCSC phylogeny. Ancestors 1.1 implements a set of maximum likelihood approaches to infer the most likely ancestral state for each ancestral node in the phylogeny at each position in the input multiple sequence alignment. The result is a whole reconstructed ancestral genome for every internal node in the input phylogeny. The most complete reconstructed ancestral genomes are those on the Human lineage, because the input alignment was referenced to hg38. Given my interest in Human genome evolution, I extracted the 11 ancestral genomes on the Human lineage, ranging from the Human-armadillo common ancestor (105 MYA) to the Human-Chimpanzee common ancestor (7 MYA). I added hg38 to this dataset for a total of 12 genomes.

4.2.2 TE annotation

I annotated TEs and all other repetitive genomic elements in each of the 12 genomes using RepeatMasker (Tempel, 2012), using Dfam HMMs (Hubley et al., 2016) as the reference library for repeat element classification. RepeatMasker sometimes annotates L1 loci as being members of general 5ʹ end subfamilies, presumably because the 3ʹ end of the locus cannot be confidently mapped to a specific 3ʹ end subfamily. Such loci were excluded from analyses.

For each subfamily, in each genome, the average sequence divergence between the subfamily’s 3ʹ end HMM hit loci and the 3ʹ end HMM consensus model was calculated using

89

RepeatMasker’s calcDivergenceFromAlign.pl. This method calculates the sequence divergence based on the Kimura 2 Parameter substitution model, and applies CpG correction to account for hypermutability of CpG sites (i.e. 5ʹ-CG-3ʹ sites, where a C-G base pairs are frequently converted to A-T due to cytosine methylation and spontaneous deamination).

For each of the 67 biological subfamilies defined by Dfam 3ʹ end HMM models, the most probable linear sequences were also extracted from the HMMs using HMMER hmmemit -c (Finn, Clements and Eddy, 2011).

4.2.3 Full-length progenitor sequence reconstruction

For each of the 67 biological subfamilies defined by 3ʹ ends HMM models, I considered the sequences of the 100 longest hits from each of the 12 genomes (hg38 and the 11 ancestors), excluding any hits less than 3kb in length. For each genome and subfamily combination, I produced a multiple sequence alignment of these ≤ 100 longest hits using Muscle (Edgar, 2004). Each alignment was then supplied to FastML (Ashkenazy et al., 2012) to reconstruct the ancestral sequence to all sequences in the alignment. FastML thus returns two possible solutions for the ancestral sequence to the sequences in the input multiple sequence alignment. The result of this process was a maximum of 24 candidate progenitor sequences (a maximum of two for each genome) for each of the 67 subfamilies. Given that high-quality hits of all subfamilies are not present in all genomes, the total number of candidate progenitor sequences reconstructed was 1134.

I also considered a simpler frequency-based consensus approach for progenitor reconstruction. In this method, I started with the same ≤ 100 longest hits aligned using Muscle (Edgar, 2004). I produced a consensus profile HMM of the multiple sequence alignment using HMMER (Finn, Clements, and Eddy 2011) (hmmbuild), and then extracted the most likely linear sequence from the profile HMM (hmmemit -c).

4.2.3.1 Automated selection of the best candidate progenitor sequence

I next systematically selected the best ancestral reconstructed sequence for each subfamily. It stands to reason that the best progenitor reconstruction for a given subfamily should come from the oldest ancestral genome in which that subfamily occurred. I chose to select the best progenitor sequence without consideration of the source genome, however, in order to test the

90 validity of the approach. My selection criteria for the best progenitor sequence for each subfamily were based on (1) achieving a length of 6-8kb, and (2) having high percent sequence identity to gold standard sequences.

I curated a list of gold standard sequences using truncated L1 subfamily 3ʹ end models from the databases Dfam (Hubley et al., 2016) and Repbase (Jurka, 2000), and full length reconstructions of progenitor sequences for the most recent LINE L1s from a previous study (Khan, Smit and Boissinot, 2005). Dfam HMM models were converted to their most likely linear sequences using hmmer (hmmemit -c) (Finn, Clements and Eddy, 2011). I assessed percent identity to each of these gold standards in alignment using in-house python scripts. Because my selection pipeline includes a separate step to assess the length of the reconstructed sequences, and because most gold standard sequences were also truncated, I did not penalize indels in my percent identity assessment. Instead, I calculated percent identity between the two sequences as (a – b) / a, where a is the length of the shorter sequence (either the candidate progenitor or the gold standard) and b is the total number of mismatches in the pairwise local alignment.

Having percent identities for gold standards for each candidate progenitor sequence, I selected the ‘best’ full-length reconstruction. First, I considered only candidate progenitor sequences with a percent identity to a gold standard within 1.5% of the maximum percent identity achieved by any candidate progenitor in that subfamily. Next, I discarded reconstructions with lengths >8kb (which were rare) and overwrote any lengths between 6-8kb as 6kb in order to consider them equivalently preferable. Finally, for each subfamily I rank ordered the candidate progenitors in descending order of best percent identity to a gold standard, followed by normalized length, and then selected the top candidate reconstruction as the best full-length progenitor sequence.

4.2.4 ORF refinement

A functional progenitor sequence should have intact ORFs that encode the expected protein products. I noted that in many cases, especially for the more challenging reconstructions of older subfamilies’ progenitor sequences, small inaccuracies like frame shift mutations, which also introduce premature stop codons, disrupted the ORF-homologous sequences of my best full- length progenitors. To improve the reconstruction of ORF-encoding sequences systematically, I introduced a dedicated ORF reconstruction step.

91

For each subfamily, I searched each of the ≤ 100 longest hits >3kb from each genome for their sub-sequences homologous to ORF1 and ORF2 using BLAST (Madden, 2013). ORF1 and ORF2-homologous loci were identified using blastx to search each of the query L1 nucleotide sequences against a custom database of L1HS’s ORF1 and ORF2 protein products: L1RE1 (UniProt: Q9UN81) and LORF2 (UniProt: O00370).

For each subfamily, for each genome, for each ORF, a new multiple sequence alignment was created using the codon-aware aligner MACSE (Ranwez et al., 2018). Briefly, MACSE assumes that all nucleotide sequences in an alignment originated from a common protein-coding ancestor, and extends the classical Needleman-Wunsch alignment generation and scoring method to penalize more severely the introduction of mismatches and indels that introduce amino acid changes or frame shifts. I then submitted each of these codon-aware ORF1 or ORF2 alignments to FastML for ancestral sequence reconstruction as described above, producing a total of 865 ORF1 reconstructions and 905 ORF2 reconstructions.

4.2.4.1 Automated selection of the best ORF1 and ORF2 sequences

For each subfamily, I selected the best ORF1 and ORF2 reconstruction based on (1) homology to the expected protein domains and (2) whether those domains occurred in the same reading frame. I assessed homology to protein domains by translating my reconstructed ORFs in all 3 reading frames and searching the amino acid sequences against NCBI’s Conserved Domain Database (Lu et al., 2020). Homology hits for domains that were truncated or split over multiple reading frames were considered absent. For ORF1 reconstructions, homology to the Transposase 22 trimerization, Transposase22 dsRBD and Transposase22 ssRBD domains were considered as selection criteria. For ORF2, homology to the Endonuclease, Reverse transcriptase, and DUF1725 were considered as selection criteria. For each subfamily and ORF combination, the ORF reconstruction that achieved the nearest length to the L1HS-encoded protein (338aa for ORF1 or 1275aa for ORF2), that encoded homology to the greatest number of selection criteria domains listed above, and that encoded those domains in the fewest number of different reading frames (i.e. included the fewest observable frame shifts) was considered the best ORF reconstruction.

92

4.2.5 Composite progenitor sequence reconstruction

Given a single best reconstruction for each subfamily based on near full-length input sequences (described in 4.2.3) and a best reconstruction for each of the ORF1- and ORF2-coding sequences refined using the codon-aware aligner MACSE (described in 4.2.4), these sequences can now be combined to create a ‘composite’ progenitor sequence. For each subfamily, I used the full-length progenitor sequence as a starting template. For each of ORF1 and ORF2, if a successful refined reconstruction was available, I replaced the ORF1- and ORF2-like sub-sections of the full-length progenitor sequence with the ORF1 and ORF2 reconstructions, respectively. I identified the ORF1- and ORF2-like sub-sequences using the blastx from BLAST (Madden, 2013) to search each best full-length progenitor nucleotide sequence against a custom database of L1HS’s ORF1 and ORF2 protein products: ORF1p (UniProt: Q9UN81) and ORF2p (UniProt: O00370).

4.2.6 Composite reconstructed progenitor sequence validation

4.2.6.1 Phylogenetic and age relationships between composite sequences

The composite sequences were aligned using Muscle (Edgar, 2004) and a maximum likelihood phylogeny was constructed using FastTree (Price, Dehal and Arkin, 2009), assuming a general time reversible substitution model. The topology of the tree was compared to the age relationships interpreted from the distribution of RepeatMasker hits in ancestral reconstructed genomes, and to the relationships represented by the topology of the phylogeny I constructed by the same methods using the most likely linear sequences derived Dfam 3ʹ end HMM models. Composite sequences that varied dramatically from the observed age ordering and phylogenetic relationships were removed, resulting in a total of 55 plausible composite sequences.

4.2.6.2 ORF0 homology

ORF0 homology was queried using tblastn from BLAST (Madden, 2013) to search the 5ʹ ends of the composite sequences against a custom database containing the ORF0 amino acid sequence reported by (Denli et al., 2015).

4.2.6.3 Integration with KZFP binding data: L1MC1 case study

The composite reconstructed progenitor sequence for L1MC1 was compared with in vivo binding sites for KZFPs identified in ChIP-seq and ChIP-exo data. Next, the ages of the KZFPs that bind

93

L1MC1 were compared to the observed age of L1MC1 based on RepeatMasker hits in ancestral reconstructed genomes. Finally, in vivo and in silico motifs for KZFPs that bound L1MC1 in vivo were identified on the reconstructed L1MC1 progenitor sequence.

4.2.6.3.1 ChIP-seq and ChIP-exo in vivo KZFP binding sites

Barazandeh et al. (2018) previously integrated data from 132 C2H2 ZFPs’ ChIP-seq experiments (including 60 KZFPs) from Schmitges et al. (2016) and 222 KZFPs’ ChIP-exo experiments from Imbeault et al. (2017). This analysis reported a motif for each of 242 unique Human KZFPs, as well as enrichment data for ChIP-seq/exo peaks overlapping TE loci. Here, Marjan Barazandeh defined C2H2 ZFP-L1 pairs as having significant overlap if they (1) achieved enrichment in genomic loci defined as P < 1e-10 (Fisher’s Exact Test, one-tailed), and (2) at least 10 of the top 500 ChIP-seq/exo peaks for the KZFP overlapped loci corresponding to the L1 based on the RepeatMasker annotation of hg38 available from UCSC Genome Browser (Lee et al., 2020). L1MC1 was selected as a case study example for interaction with KZFPs, because it was found to overlap significantly with a relatively large number of KZFPs (five), and its advanced age exploits the extended scope of analysis empowered by the novel reconstructed progenitor sequences described here.

4.2.6.3.2 Age relationships between L1MC1-binding KZFPs and L1MC1

Ages of KZFPs based on their consensus species distributions across a number of gene databases were sourced from Litman and Stein (2019). The authors numbered internal nodes on the tree of life and identified the most commonly reported monophyletic clade for Human protein-coding genes, reporting a ‘modal value’ representing the ancestral node in which each gene must have arisen. I created the following correspondence between their modal values and my ancestral genomes, which have ages sourced from Time Tree: <17, Armadillohg; 17, SN Molehg; 18,

Pikahg; 19.1, S Monkeyhg; 19.2, Chimpanzeehg.

4.2.6.3.3 Identification KZFP binding sites on the L1MC1 reconstructed progenitor sequence

Putative KZFP binding sites on the L1MC1 composite progenitor sequence were identified by Smith-Waterman alignment of each KZFP’s position frequency matrix (PFM) representing its binding preferences derived from ChIP-seq/exo experiments. PFMs were sourced from

94

Barazandeh et al. (2018) and alignments were conducted by Abhimanyu Banerjee (Anshul Kundaje lab, Stanford University Departments of Genetics and Computer Science). Predicted binding preferences for each of the five KZFPs were generated based on amino acid sequences using the B1H-RC (Najafabadi et al., 2015). Smith-Waterman alignments were then conducted between the experimental motifs and the B1H-RC predicted binding preferences for each KZFP.

4.3 Results and Discussion

4.3.1 RepeatMasker annotations in ancestral genomes correspond to L1 subfamily relative ages and species distributions

RepeatMasker hits in ancestral genomes and hg38 are presented in Figure 4.3. For each subfamily, the average sequence divergence between subfamily member loci in hg38 (at the 3ʹ end) and the Dfam 3ʹ end consensus model, which should be proportional to the age of the subfamily, is plotted in Figure 4.3 A. Figure 4.3 B shows the species distribution of each subfamily as reported by Dfam. Both lines of evidence show good correspondence between Dfam and my RepeatMasker results – for example Primate-distributed subfamilies are exclusively found in genomes younger than Bushbabyhg (74 MYA), and these subfamilies typically show lower divergence from their 3ʹ end consensus model than Euarchontoglire or Eutherian-distributed subfamilies (Figure 4.3 C). Overall, the distribution of RepeatMasker hits is consistent with prior knowledge.

4.3.2 RepeatMasker hits in ancestral genomes are similar in length to those in hg38, and less divergent from consensus models

RepeatMasker annotations in ancestral genomes exhibit similar maximum lengths to those annotated in hg38 (Figure 4.4 A), but a striking reduction in average divergence from the 3ʹ end consensus HMM (Figure 4.4 B). This convergence toward the 3ʹ end HMM consensus model suggests that the sequences in older genomes more closely resemble a common progenitor sequence, validating the utility of ancestral reconstructed genomes for progenitor sequence reconstruction. It is noteworthy that this reduction in sequence divergence is apparently small for the oldest L1 subfamilies (e.g. L1ME and older, right side of Figure 4.4 B). This may be because the oldest genome in my dataset (the Eutherian ancestor, Armadillohg – 105MYA), is much older than those subfamilies, or because the 3ʹ end HMM models for those subfamilies are a relatively poor approximation of the progenitor sequence.

95

40 Species Dist. (Dfam) Eutheria Euarchontoglires A Kimura Divergence Primates (Dfam) Simiiformes (or more recent)

0 B Species Dist. (Dfam) C hg38 0 Chimpanzeehg 6.65

Gorillahg 9.06

Orangutanhg 15.2

Gibbonhg 20.19

G Monkeyhg 29.44

S Monkeyhg 43.2

Bushbabyhg 74.0 CT Shrew

Age (MYA) hg 82.0 Pika hg 90.0 1 Normalized SN Molehg 96.0 RepeatMasker 0 Hit Count Armadillohg 105.0 Ancestral Human Genome, L1HS L1PA2 L1PA3 L1PA4 L1PA5 L1PA6 L1PA7 L1PA8 L1PB1 L1PB2 L1PB3 L1PB4 L1ME1 L1ME2 L1ME3 L1ME5 L1MA1 L1MA3 L1MA2 L1MA4 L1MA5 L1MA6 L1MA7 L1MA8 L1MA9 L1MB1 L1MB2 L1MB3 L1MB4 L1MB5 L1MB7 L1MB8 L1MD1 L1MD2 L1MD3 L1MC1 L1MC2 L1MC3 L1MC4 L1MC5 L1PA10 L1PA11 L1PA14 L1PA13 L1PA15 L1PA12 L1PA16 L1PA17 L1PA8A L1ME3F L1MA10 L1ME2Z L1ME3E L1ME3A L1ME4A L1MA4A L1MA5A L1ME3B L1ME4B L1ME3D L1ME3C L1ME4C L1MC4A L1MC5A L1ME3G L1PREC2 67 L1 Subfamilies L1ME3CZ

FigureFigure 4.3.4.3. Overview Overview of ofphylogenetic phylogenetic relationships relationships between between human LINEHuman L1 subfamiliesLINE L1 subfamilies and andavailable available consensus consensus models modelsand progenitor and progenitor sequence reconstructions. sequence reconstructions

AA. PhylogeneticPhylogenetic distance distance fromfrom Dfam Dfam consensus consensus model model for subfamily for subfamily hits in hg38,hits in as hg38, reported as byreported Dfam by Dfam(Hubley (Hu etbley al. 2016).et al. 2016). B. Counts of L1 loci in each of the genomes, for each subfamily. Annotated using RepeatMasker and B CountsDfam ofconsensus L1 loci models in each for of each the subfamily. genomes, for each subfamily. Annotated using RepeatMasker C.andSpecies Dfam distribution consensus of models each subfamily, for each as subfamily. reported by Dfam (Hubley et al. 2016).

C Species distribution of each subfamily, as reported by Dfam (Hubley et al. 2016).

Figure X - Hits Overview 191104_summary_figs_figure1_v2.ipynb 200515_summary_figs_figure1_v3.ipynb

96

10

7.5 7kb A RepeatMasker Hit 5 Length (kb)

2.5

Ancestral Human Genome 40 hg38 S Monkeyhg Chimpanzeehg Bushbaby Gorilla hg 30 hg CT Shrewhg Orangutan Pika B hg hg Average Divergence Gibbonhg SN Mole 20 G Monkey hg from Dfam Model hg Armadillohg (Kimura 2 Parameter) 10

0 10 8 7kb 6 C Reconstructed 4 Sequence Length (kb) 2

0 100 95% 80 D 60 Identity to Gold Standard Sequence 40 (%) 20 0 L1HS L1PA2 L1PA3 L1PA4 L1PA5 L1PA6 L1PA7 L1PA8 L1PB1 L1PB2 L1PB3 L1PB4 L1ME1 L1ME2 L1ME3 L1ME5 L1MA1 L1MA3 L1MA2 L1MA4 L1MA5 L1MA6 L1MA7 L1MA8 L1MA9 L1MB1 L1MB2 L1MB3 L1MB4 L1MB5 L1MB7 L1MB8 L1MD1 L1MD2 L1MD3 L1MC1 L1MC2 L1MC3 L1MC4 L1MC5 L1PA10 L1PA11 L1PA14 L1PA13 L1PA15 L1PA12 L1PA16 L1PA17 L1PA8A L1ME3F L1MA10 L1ME2Z L1ME3E L1ME3A L1ME4A L1MA4A L1MA5A L1ME3B L1ME4B L1ME3D L1ME3C L1ME4C L1MC4A L1MC5A L1ME3G L1PREC2 L1ME3CZ Subfamily

Figure 2 200107_ fg2_reconsts_overview.ipynb Figure 4.4 4.4 Lengths Lengths and and sequence sequence identities identities to gold to standardsgold standards for RepeatMasker for RepeatMasker hits and full-length hits and fullreconstructed-length reconstructed sequences sequences

A. Lengths of L1 hits for each subfamily in each ancestral genome (from Figure 4.3 B). A Lengths of L1 hits for each subfamily in each ancestral genome (from Figure 4.3 B). B. Average phylogenetic distance from consensus models for L1 hits for each subfamily in each B Averageancestral phylogenetic genome (from distance Figure 4.3 from B). Phylogeneticconsensus models distance for is asL1 described hits for eachin the subfamily caption for inFigure each ancestral4.3 A. genome (from Figure 4.3 B). Phylogenetic distance is as described in the caption for C. FigureLengths 4.3 of A. full-length ancestral sequence reconstructions for every subfamily, in every ancestral C Lengthsgenome. of full-length ancestral sequence reconstructions for every subfamily, in every D. ancestralPercent identitiesgenome. to gold standard sequences for full-length ancestral sequence reconstructions for D Percentevery subfamily, identities into every gold genome.standard sequences for full-length ancestral sequence reconstructions for every subfamily, in every genome.

97

4.3.3 Full-length ancestral sequence reconstruction

I used FastML with each of two indel reconstruction models to conduct ancestral sequence reconstructions for each genome and subfamily combination, for a total of 1134 ‘full-length’ progenitor reconstructions – up to 12 for each subfamily. (These are called full-length because for each input sequence, the entire RepeatMasker hit was used, rather than reconstructing sub- sequences as described later). I scored the full-length reconstructions for sequence identity to gold standards: prior full-length consensus models for L1HS-L1MA4 from Khan et al. 2006, and consensus sequences from Dfam and Repbase (which are similar but not always identical). The lengths and identities to gold standards for all L1 reconstructions (Figure 4.4 C-D) show agreement across source genomes for younger subfamilies (e.g. L1MA10 and younger), but variability based on source genome for older subfamilies. In most of the older subfamilies, an improvement in at least one of length or percent identity to gold standard sequences is readily observable (e.g. L1MB4, L1MB5, L1ME2Z). Based on length and percent identity to gold standards, a single best full-length reconstruction was selected for each of the 67 L1 subfamilies (Figure 4.5 A). In 61/67 cases, the best full-length progenitor reconstruction came from an ancestral genome rather than hg38, further validating the utility of the ancestral genomes for reconstruction of ancient TEs. Additionally, the source ancestral genome approximately corresponds to the relative divergence (a proxy for age) (Figure 4.3 A) – older L1 subfamilies’ progenitor sequences were reconstructed based on subfamily homology hits from older genomes (Figure 4.5 B). Homology to the recently discovered antisense ORF at the 5ʹ UTR of L1PA8- L1HS, ORF0 (Denli et al., 2015), was also detectable in the 5ʹ ends of my reconstructions for L1PA8-L1HS, validating the reconstructions of the 5ʹ ends (Figure 4.6).

Frequency-based consensus reconstruction was also considered as an alternative to ancestral sequence reconstruction. This method generally produced more truncated reconstructions with poorer capture of homology to expected protein domains. For example, in the case of L1MC1 (Figure 4.7 A-B), the frequency-based consensus is severely truncated at the 5ʹ end, while the ancestral reconstructed sequence achieves normal length for an active L1. Additionally, the ancestral reconstructed sequence contained fewer frame shift mutations in ORF2 protein domains. However, even the ancestral reconstructed L1MC1 progenitor sequence lacked homology to ORF1 protein domains. This observation was common and motivated targeted reconstruction of the protein-coding regions, ORF1 and ORF2.

98

Best RS Length (kb) AA BB 0 2 4 6 8 L1HSL1HS CreateCreate 11341134 ancestralancestral reconstructedreconstructed L1PA2L1PA2 A B L1PA3L1PA3 sequencessequences (RSs)(RSs) L1PA4L1PA4 (12(12 genomes genomes x x ≤ ≤ 67 67 subfamilies subfamilies xx 22 FastMLFastML methods)methods) L1PA5L1PA5 Best RS Length (kb) A B L1PA6L1PA60 2 4 6 8 L1PA7L1PA7 L1HS L1PA8L1PA8 Create 1134 ancestral reconstructed L1PA2 L1PA8A L1PA3 L1PA8A sequences (RSs) L1PB1L1PB1 Get percent identity between each RSs L1PA4 L1PA10 (12Get genomes percent x ≤ 67 identitysubfamilies between x 2 FastML methods)each RSs L1PA5 L1PA10 L1PA11 andand subfamilysubfamily goldgold standardsstandards (GSs)(GSs) L1PA6 L1PA11 L1PA7 L1PB2L1PB2 MUSCLEMUSCLE align align RSs RSs toto GSGS consensusconsensus sequencessequences L1MA1 from Dfam, Repbase, and Khan et al 2006 L1PA8 L1MA1 from Dfam, Repbase, and Khan et al 2006 L1PA8A L1PA14L1PA14 Get percent identity between each RSs L1PB1 L1MA3L1MA3 L1PA10 L1MA2L1MA2 and subfamily gold standards (GSs) L1PA11 L1PB3L1PB3 L1PB2 MUSCLE align RSs to GS consensus sequences L1PA13L1PA13 L1MA1 L1PA15 IDID LengthLength L1PA15 from Dfam, Repbase, and Khan et al 2006 L1PA14L1PREC2 RS1 0.988 6242 L1PREC2 RS1 0.988 6242 ConsiderConsider onlyonly RSsRSs withwith highhigh L1MA3 L1PA12 RS2 0.995 6299 L1MA2 L1PA12 RS2 0.995 6299 identity to GSs L1PA16L1PA16 identity to GSs L1PB3 L1PA17 RS3 0.962 6725 For each subfamily, discard RS-GS L1PA13 L1PA17 RS3 0.962 6725 For each subfamily, discard RS-GS L1PB4 RS4 ID0.922Length6101 L1PA15 L1PB4 RS4 0.922 6101 pairs where identity is not within 1.5% L1MA4A pairs where identity is not within 1.5% L1PREC2L1MA4A RS1RS50.9880.99162428225Considerof the highest only RS-GSRSs with identity high in the L1MA4 RS5 0.991 8225 of the highest RS-GS identity in the L1PA12 L1MA4 RS2RS60.9950.99662995143 L1MA5 RS6 0.996 5143 identitysamesame subfamilysubfamily to GSs L1PA16 L1MA5 RS3 0.962 6725 For each subfamily, discard RS-GS L1PA17 L1MA5A L1MA5AL1MA6 RS4 0.922 6101 L1PB4 L1MA6 pairs where identity is not within 1.5% L1MA4A L1MA7 RS5 0.991 8225 of the highest RS-GS identity in the L1MA4 L1MA7L1MA8 RS6 0.996 5143 L1MA5 L1MA8 ID Lengthsame subfamily L1MA9 ID Length L1MA5A L1MA9L1MB1 Normalize lengths L1MB1 RS1 0.988 6000 Normalize lengths L1MA6Subfamily L1MB2 RS1 0.988 6000 Subfamily L1MB2 RS2 0.995 6000 ConsiderConsider equivalentequivalent equalequal allall lengthslengths L1MA7 L1MC1 RS2 0.995 6000 L1MA8 L1MC1L1MB3 RS5 0.991 8225 6-8kb and discard RSs with lengths L1MA9 RS5 0.991ID Length8225 6-8kb and discard RSs with lengths L1MB3L1MB4 RS6 0.996 5143Normalize>8kb>8kb lengths L1MB1 L1MB4

Subfamily L1MC2 RS1RS60.9880.99660005143 L1MB2 L1MC2 RS2 0.995 6000 Consider equivalent equal all lengths L1MC1 L1MB5 L1MA10L1MB5 RS5 0.991 8225 6-8kb and discard RSs with lengths L1MB3

67 Subfamilies L1MA10 >8kb L1MB4 L1MD1 RS6 0.996 5143 L1MC2 L1MD1L1MB7 L1MB7 ID Length Select longest sequence L1MB5 L1MC3 ID Length Select longest sequence L1MA10 L1MC3L1MC4 RS2 0.995 6000 with highest identity to a GS L1MD1 L1MC4L1MB8 RS2 0.995 6000 withRank orderhighest RSs byidentity normalized to alength, GS L1MB7 L1MB8 RS1 0.988 6000 Rank order RSs by normalized length, L1MD2 RS1 0.988ID Length6000 Selectthen bylongest GS identity, sequence and select the top L1MC3L1MC4AL1MD2 RS6 0.996 5143 then by GS identity, and select the top L1MC4L1MC4A RS2RS60.9950.99660005143 withsequence highest identity to a GS L1ME1 sequence L1MB8 L1ME1L1ME2 RS1 0.988 6000 Rank order RSs by normalized length, L1MD2 L1ME2 L1MC4A L1ME2Z RS6 0.996 5143 then by GS identity, and select the top L1ME2ZL1ME3 sequence L1ME1 L1ME3 L1ME2 L1ME3A Identity to L1ME2ZL1ME3AL1MC5 Ancestral Human Genome Identity to L1MC5 Ancestral Human Genome Gold Standard (%)L1ME3L1MC5A Gold Standard (%)L1ME3AL1MC5AL1ME3B hg38 S MonkeyIdentity50 to 100L1MC5L1ME3BL1MD3 Ancestralhg38 Human GenomeS Monkey hgGold Standard (%) L1MC5A L1ME3DL1MD3 Chimpanzee Bushbabyhg 50 100 Chimpanzee hg hg L1ME3BL1ME3DL1ME3C Gorilla hg Bushbaby hg38 hg S MonkeyCT Shrewhg50 100 L1MD3L1ME3CL1ME3E Gorilla hg hg L1ME3D ChimpanzeeOrangutanhg CT Shrewhg L1ME3EL1ME3F hghg BushbabyPikahg hg Has published L1ME3C GorillaOrangutanhg Pika Has published L1ME3CZL1ME3F Gibbonhg hg CT Shrewhg full-length modelL1ME3E Gibbon SN Molehg hg L1ME3CZL1ME4B OrangutanG Monkeyhg PikaSN Mole Hasfull-length published model L1ME3F hg hg Armadillohg hg Available L1ME4BL1ME3G GibbonG Monkey hgfull-length model L1ME3CZL1ME3G hg hg SNArmadillo Mole Available L1ME4B L1ME5 G Monkey hg hg L1ME5 hg Armadillo Available L1ME3G L1ME4A hg L1ME5L1ME4AL1ME4C L1ME4AL1ME4C L1ME4C Figure 4.5 Best full-length reconstructed sequences

FigureA Full 4.5-length Best full-length progenitor reconstructed sequence reconstruction sequences. workflow. B Length, identity to gold standard, and source genome for the best reconstructed progenitor A. Full-lengthsequence from progenitor each L1sequence subfamily. reconstruction Height of workflow. bar indicates length of reconstructed sequence, B. Length,and fill identityintensity to indicates gold standard, the best and percentsource genome identity for that the that best reconstruction reconstructed progenitorhad to any sequence of the fromgold eachstandard L1 subfamily. sequences Height for that of barsubfamily. indicates Colour length ofbar reconstructed at bottom indicates sequence, the and genome fill intensity from indicateswhichFigure the the best best reconstructed percent 3 identity sequence that that was reconstruction sourced. had to any of the gold standard sequences for thatFigure Figuresubfamily. Colour 3 3 bar at bottom indicates the genome from which the best reconstructed sequence 200107_fg2_reconsts_overview.ipynb was200107_200107_ sourced.fg2_reconsts_overview.ipynbfg2_reconsts_overview.ipynb Selection pipeline: frst pass -190725_analyze_L1_reconstructions.ipynb, SelectionSelection pipeline: pipeline: frst frst pass pass -190725_analyze_L1_reconstructions.ipynb, -190725_analyze_L1_reconstructions.ipynb, but later fxed in: 190926_bestreconst_ORFdata_PIDdata.ipynb butbut later later f xedfxed in: in: 190926_bestreconst_ORFdata_PIDdata.ipynb 190926_bestreconst_ORFdata_PIDdata.ipynb

TheThe number number 1134: 1134: The number 1134: allfastmlallfastml = pd.read_csv(files_repository= pd.read_csv(files_repository + '190930_allfastml_paths_database.csv', + '190930_allfastml_paths_database.csv', index_col=0)allfastmlindex_col=0) = pd.read_csv(files_repository + '190930_allfastml_paths_database.csv', index_col=0) allfastml[allfastml.subfam.isin(fig1order)]allfastml[allfastml.subfam.isin(fig1order)] allfastml[allfastml.subfam.isin(fig1order)] 99 Subfamily reconstructed sequence Subfamily reconstructed ORF0 homology (BLAST -log(E-value))

Figure 4.6 ORF0 homology detected in full-length reconstructed progenitor sequences Figure 4.6 ORF0 homology detected in full-length reconstructed progenitor sequences

BLASTBLAST scores scores for thefor theORF0 ORF0 amino amino acid acid sequence from from Denli Denli et al. et (2015) al. (2015) searched searched against againsta database a databaseof the of full-lengththe full-length progenitor progenitor nucleotide nucleotide sequences. sequences. Denli et al. reportedDenli et ORF0 al. reported detected inORF0 L1PA8- detected in L1PA8L1HS.-L1HS. Subfamily Subfamily progenitor progenitor sequences sequenceswith BLAST with Bit ScoresBLAST < 25 Bit for Scores ORF0 are < 25excluded. for ORF0 are excluded.

100

Figure 4.7 Quality comparison for various reconstruction methods

A L1MC1 reconstruction based on Muscle alignment followed by frequency-based consensus construction using HMMER. The sequence is 5ʹ truncated, missing ORF1. Domain annotations are based on NCBI CDS, where each row of the image represents domains found in one of three reading frames (RF). Annotations of ORF1 homologous regions are approximate. B L1MC1 full-length ancestral sequence reconstruction based on Muscle alignment followed by FastML, as detailed in Figure 5 and represented in Figure 6. C L1MC1 composite ancestral sequence reconstruction based on (1) Muscle alignment followed by FastML, (2) ORF reconstruction with MACSE and FastML, and (3) replacing the ORFs in the full-length reconstruction with the best reconstructed ORFs.

101

4.3.4 Targeted ORF reconstruction

MACSE (codon-aware aligner) and Muscle (standard multiple sequence aligner) were compared for their efficacy in reconstructing ORF sub-sequences. For both ORF1 and ORF2, reconstructions downstream of MACSE alignments exhibited a marked reduction in stop codons per amino acid, without a reduction in overall length of the reconstructed sequence (Figure 4.8 A). Therefore, MACSE was used upstream of ancestral sequence reconstruction to reconstruct ORF1 and ORF2 for each genome and subfamily combination, and the success of each reconstruction was scored based on (1) the length of the reconstructed sequence, and (2) detectable homology to expected conserved domains (Figure 4.8 B).

In ORF1 reconstructions (Figure 4.8 C, top), the best reconstructed sequence typically included both the Transposase 22 dsRBD and Transposase 22 domains, with the exception of some older subfamilies, L1MB8, L1ME3B, L1ME3D, and L1ME3F, each of which were missing homology to one domain. In most cases, both domains were in the same reading frame, with the exceptions of L1MB4, L1MC4A, and L1ME4B. 41/67 ORF1 reconstructions achieved at least 90% of the 338 amino acid expected length for the ORF1p protein. ORF1 truncations are readily explained by the fact that ORF1 occurs at the 5ʹ end of the L1 sequence, which is truncated in the majority of insertions.

In ORF2 reconstructions (Figure 4.8 C, bottom), the best reconstructed sequence typically included homology to the Reverse Transcriptase and Endonuclease domains, again excepting a few of the oldest subfamilies – L1MC4A, L1ME2Z, L1ME3, L1ME3B, L1MD3, and L1ME5. Like ORF1, frame shifts occurred between protein coding domains in a minority of cases – L1PB1, L1MA5, L1MB1, L1MA10, and L1ME2. 60/67 ORF2 reconstructions achieved at least 90% of the 1275 amino acid expected length for the ORF2p protein.

Promisingly, some of the oldest subfamilies – L1MC5, L1MC5A, L1ME3CZ, L1ME3G, L1ME4A, and L1ME4C – achieved near full-length ORF1 and ORF2 reconstructions exhibiting homology to all conserved domains, despite truncated reconstructions with poor homology to gold standards in the prior full-length reconstructions (Figure 4.5 B).

To summarize, the codon-aware aligner MACSE was used to support ancestral sequence reconstruction of defunct ORF1 and ORF2 sequences for each L1 subfamily. MACSE

102

A ORF1 ORF2 Muscle MACSE Muscle MACSE Stop codons (count)

Length (amino acids) Length (amino acids)

Get longest hits for each subfamily, in each genome B RepeatMasker

Excise ORF1 and ORF2-like sequence regions from each RepeatMasker hit BLAST to identify ORF-like regions

Codon-aware alignment of ORF1-like Codon-aware alignment of ORF2-like sequence regions sequence regions MACSE codon-aware alignment MACSE codon-aware alignment

Ancestral sequence reconstruction Ancestral sequence reconstruction of ORF1 of ORF2 FastML FastML

Select ‘best’ ORF1 reconstruction Select ‘best’ ORF2 reconstruction based on conserved domains and based on conserved domains and 12 Genomes

12 Genomes frame shifts frame shifts NCBI Conserved Domain Search Tool NCBI Conserved Domain Search Tool Ancestral Human Genome hg38 Transposase 22 Orangutanhg S Monkey Pika Endonuclease Chimpanzee hg hg Transposase 22 hg Gibbonhg Bushbabyhg SN Mole Reverse Transcriptase dsRBD Gorilla G Monkey CT Shrew hg hg hg hg Armadillohg

400 CA 300 Length (aa) 200 100 0 Transposase 22 dsRBD Conserved Domains Transposase 22

DB 1500 Length (aa) ORF Reconstruction Pipeline 1000 500

0 Reverse Conserved Transcriptase Domains Endonuclease L1HS L1PA2 L1PA3 L1PA4 L1PA5 L1PA6 L1PA7 L1PA8 L1PB1 L1PB2 L1PB3 L1PB4 L1ME1 L1ME2 L1ME3 L1ME5 L1MA1 L1MA3 L1MA2 L1MA4 L1MA5 L1MA6 L1MA7 L1MA8 L1MA9 L1MB1 L1MB2 L1MB3 L1MB4 L1MB5 L1MB7 L1MB8 L1MD1 L1MD2 L1MD3 L1MC1 L1MC2 L1MC3 L1MC4 L1MC5 L1PA10 L1PA11 L1PA14 L1PA13 L1PA15 L1PA12 L1PA16 L1PA17 L1PA8A L1ME3F L1MA10 L1ME2Z L1ME3E L1ME3A L1ME4A L1MA4A L1MA5A L1ME3B L1ME4B L1ME3D L1ME3C L1ME4C L1MC4A L1MC5A L1ME3G L1PREC2 L1ME3CZ Subfamily

Heatmaps, barplots: Figure 4 191111_orf_reconst_fgure-200120_revisit.ipynb - related csvs are in: 200120_ORF_reconst_fgures/

Genome age colour bar: 200115_composite_reconstructions.ipynb

red = ORF1 , yellow = ORF2 103

Figure 4.8 Codon-aware alignment improves ORF reconstruction

A Comparison of ORF quality in Muscle and MACSE full-length L1 subfamily reconstructions. Number of in-frame stop codons observed in ORF1 (left) and ORF2 (right) in full-length FastML reconstructions based on alignments conducted with either a standard aligner, Muscle, or the codon-aware aligner, MACSE. In Muscle figures, each point represents the ORF-homologous subsequence of best full-length reconstruction for one subfamily (as plotted in Figure 4.5B). In MACSE figures, each point represents a MACSE alignment of the same input sequences as the corresponding subfamily’s Muscle reconstruction. ORF- homologous subsequences were identified using BLAST. B Workflow applied for targeted ORF reconstruction using MACSE. C Best ORF1 reconstruction for each subfamily. Bar plot represents the length of the reconstructed sequence, compared to the expected length (which is the length of ORF1p, UniProt Q9UN81). In the heat map, non-truncated homology to expected conserved domains is represented as red fill. Conserved domains were identified using NCBI’s Conserved Domain Search Tool. A colour change in the upper (more C-terminal) domain indicates that a frame shift occurs between the domains. Truncated or absent domains appear as white. The colour bar at bottom indicates the source genome for the sequences used to reconstruct the best ORF1 for each subfamily. D Best ORF2 reconstruction for each subfamily. Interpret as (C). (The reference ORF2p is UniProt O003790.)

104 reconstruction for ORF1 yielded 41/67 reconstructions within 90% of the expected length, as well as 60/67 ORF2 reconstructions within 90% of the expected length. ORF1 (1013nt) and ORF2 (3824nt) comprise the majority of the 6-8kb sequence of a full length L1, suggesting that targeted sub-sequence reconstruction for defunct protein-coding genes represents a powerful approach for faithful reconstruction of large regions of full-length progenitors in the most challenging cases.

4.3.5 Composite reconstructed progenitor sequences capture expected phylogenetic relationships and sequence components

To finally produce full-length progenitor sequences, the previously described full-length reconstructions and targeted ORF reconstructions were combined. The composite reconstructed progenitor sequence was produced by identifying the ORF1 and ORF2 homologous regions in the best full-length sequence and replacing each with the best targeted ORF1 and ORF2 reconstructions (Figure 4.9 A). 55 of the 67 subfamilies’ composite reconstructions were considered plausible because their phylogenetic relationships to one another demonstrated acceptable agreement with the phylogenetic relationships of the 3ʹ end consensus sequences from Dfam (Figure 4.9 B), as well as with the relative ages of the subfamilies based on sequence divergence between RepeatMasker hits in hg38 and the Dfam consensus sequence (Figure 4.3 A). All of the 55 composite sequences contained targeted reconstructions for ORF1 and ORF2, except the two oldest: L1ME3E (no successful targeted ORF reconstructions) and L1ME3D (missing ORF2 targeted reconstruction) (Figure 4.10). 44/55 subfamilies’ composite progenitor reconstructions contained a 5ʹ UTR of at least 400nt, and 47/55 subfamilies’ progenitor reconstructions achieved a total length of at least 6kb. Altogether, the composite sequences generally capture the expected lengths of L1 progenitor sequences, and the majority contain 5ʹ ends and encode the expected protein products with homology to their substituent conserved domains (Figure 4.7 C).

In sum, the results presented here represent full-length or near full-length reconstructions for all 67 L1 subfamilies’ progenitor sequences, where 55/67 are further reconstructed as composites of the full-length reconstructions and targeted ORF reconstructions with good phylogenetic agreement to prior knowledge, and 45/67 have not been previously described to my knowledge.

105

A

55 composite reconstructions 67 consensus models (Dfam) B full-length + ORF reconstructions 3’ ends only, median length = 925bp

106

Figure 4.9 Composite reconstructed progenitor sequences

A Workflow used to produce composite reconstructed progenitor sequences by combining the best full-length reconstructed sequences (illustrated in Figure 4.5) and the best ORF1 and ORF2 targeted reconstructions (illustrated in Figure 4.8). B Phylogenies of the L1 subfamilies created using composite sequences (which are generally 6- 8kb in length; left) and linear consensus models for L1 3ʹ ends extracted from Dfam HMMs (right). 55 of the 67 L1 subfamilies produced successful composite sequences and are presented here. Phylogenies were built using FastTree based on Muscle alignments of input sequences.

107

Figure 4.10 Lengths of composite reconstruction components

The length of each composite reconstruction’s 5ʹ end, ORF1, inter-ORF distance, ORF2, and 3ʹ end. ORFs were reconstructed with FastML following MACSE codon-aware alignment, and all other components are sourced from the best reconstruction where Muscle was used to align full -length sequences. 55 of the 67 L1 subfamilies produced successful composite sequences and are presented here.

108

Both the composite reconstructed progenitor sequences and the best full-length reconstructed progenitor sequences (before targeted ORF reconstruction) represent a significant improvement compared to the lengths of pre-existing consensus sequences from Dfam and Repbase, and the lengths of composite L1PB-L1PA sequences show near perfect agreement to the 22 previously reconstructed by Khan et al (2005) (Figure 4.11).

4.3.6 Composite reconstructed progenitor sequences anchor integration of in vivo KZFP binding evidence and in silico-predicted KZFP binding preferences

The full-length L1 progenitor sequences described here provide an opportunity to identify direct specificity relationships between the putative active parent sequence to a given subfamily and the KZFPs that are observed to bind that subfamily’s loci in ChIP-seq/exo experiments. These relationships have been previously identified for L1PA sequences (Jacobs et al., 2014; Imbeault, Helleboid and Trono, 2017), which are replicated in the results presented here, since the sequences match those previously described. Such trends have not been explored for older subfamilies, however. Here, I discuss investigation of relationships between the L1MC1 composite reconstructed progenitor sequence and the KZFPs that preferentially bind it in vivo, as part of an ongoing collaboration described below. The L1MC1 subfamily progenitor sequence was selected because previous investigations leveraging full-length copies of L1 subfamilies have been limited to Primate-specific L1PA subfamilies (Khan, Smit and Boissinot, 2005; Jacobs et al., 2014; Imbeault, Helleboid and Trono, 2017), but the high-quality reconstruction of a full length L1MC1 progenitor sequence reflects an opportunity to investigate KZFP-ERE coevolution in a subfamily ~53M years older than previously addressed. Of specific interest is the relationship between L1MC1 and the KZFPs ZNF248 and ZNF382. Curiously, these KZFPs appear to bind many or all L1PA subfamilies (Imbeault, Helleboid and Trono, 2017) and arose about the same time as L1MC1 (Litman and Stein, 2019).

In collaboration with Marjan Barazandeh, 5/314 unique Human C2H2 ZFPs were identified with significantly enriched binding at L1MC1 loci in ChIP-seq/exo experiments: ZIK1, ZNF248, ZNF382, ZFP69, and ZNF454 (P < 1e-10, Fisher’s Exact Test, and > 10 of the top 500 ChIP peaks overlapping) (Schmitges et al., 2016; Imbeault, Helleboid and Trono, 2017; Barazandeh et al., 2018) (Figure 4.12 A). All of these contain KRAB domains (Lambert et al., 2018), and their ages closely correspond to the ~96M year age of L1MC1 according to species distribution of the

109

Figure 4.11 Reconstructed progenitor sequence lengths for 67 LINE L1 subfamilies

Lengths of reconstructed progenitors and composite reconstructed progenitors compared to the currently available consensus sequences from Dfam (Smit et al. 1995; Hubley et al. 2016) and Repbase (Jurka 2000), and previously described full-length reconstructions for the youngest 22 subfamilies from Khan et al. (2006).

110

Figure 4.12 Integration of reconstructed L1MC1 progenitor sequence with KZFP binding DNA-binding Figuredata from 4.12 ChIP-seq/exo Integration experiments of reconstructed L1MC1 progenitor sequence with KZFP binding DNA-binding data from ChIP-seq/exo experiments A. Overlap between 314 unique C2H2 ZFPs’ binding sites in hg38 and L1MC1 loci based on ChIP-seq/exo A Overlapexperiments. between ChIP-seq 314 data unique is from C2H2 Schmitges ZFPs’ et bindingal. (2016) sites and ChIP-exoin hg38 dataand isL1MC1 from Imbeault loci based et al. on ChIP(2017),-seq/exo reanalyzed experiments. and sourced ChIP from -Barazandehseq data is et from al. (2018). Schmitges Dotted etlines al. represent (2016) significanceand ChIP- exo data thresholds for binding site overlap with L1MC1 loci: p < 1e-10 enrichment based on Fisher’s Exact Test, isand from ≥10 Imbeault of the top et500 al. ChIP (2017), peaks. re Theanalyzed bottom leftand quadrant sourced contains from Barazandeh data for 130 C2H2 et al. ZFPs (2018). from Dotted ChIP- linesseq experimentsrepresent significance(Schmitges et al.thresholds 2016) and for 217 binding KZFPs from site ChIP-exooverlap withexperiments L1MC1 (Imbeault loci: p et < al. 1e -10 enrichment2017). based on Fisher’s Exact Test, and ≥10 of the top 500 ChIP peaks. The bottom left B.quadrantNormalized cont countains of data L1MC1 for 130hits detected C2H2 ZFPsin each from ancestral ChIP genome-seq experiments using RepeatMasker, (Schmitges and ancestral et al. 2016)genome and of 217origin KZFPs for KZFPs from with ChIP significant-exo experiments binding site overlap (Imbeault with L1MC1et al. 2017). loc (from A). Ancestral B Normalizedgenome of origin count for of KZFPs L1MC1 was hitsadapted detected from Litman in each & Steinancestral (2019). genome using RepeatMasker, and C. Putative binding sites on the L1MC1 composite reconstructed progenitor sequence for the KZFPs ancestralassociated genome with L1MC1 of orig in ChIP-seq/exoin for KZFPs experiments with significant (from A). binding Correspondence site overlap is indicated with betweenL1MC1 the loc (fromChIP-seq/exo A). Ancestral in vivo motif genome (“ChIP”) of origin from Barazandeh for KZFPs et was al. (2018), adapted the fromreconstructed Litman L1MC1 and Stein sequence (2019). C Putativefrom this binding investigation sites (“L1MC1”), on the L1MC1 and the composite in silico predicted reconstructed motif based progenitor on the B1H-RC sequence from for the KZFPsNajafabadi associa et al.ted (2015) with (“B1H-RC”). L1MC1 in ChIP-seq/exo experiments (from A). Correspondence is indicated between the ChIP-seq/exo in vivo motif (“ChIP”) from Barazandeh et al. (2018), the reconstructed L1MC1 sequence from this investigation (“L1MC1”), and the in silico predicted motif based on the B1H-RC from Najafabadi et al. (2015) (“B1H-RC”).

111

KZFP-encoding genes (Litman and Stein, 2019) and the oldest reconstructed ancestral Human genome in which L1MC1 loci were annotated by RepeatMasker (Figure 4.12 B). Finally, binding sites for all five ChIP-seq/exo enriched KZFPs were identified on the L1MC1 progenitor sequence by Abhimanyu Banerjee, and these L1MC1 loci and in vivo motifs also show good correspondence to in silico predicted motifs based on the B1H-RC (Najafabadi et al., 2015) (Figure 4.12 C).

Altogether, it appears that the expansion of the L1MC1 subfamily distributed binding sites about the genome for five KZFPs that arose about the same time and shortly after. These findings represent in vivo and in silico support for KZFP-ERE coevolution, where the L1MC1 expansion may have been sufficiently deleterious to impose diversifying selection on the Human genome to acquire new KZFPs that could bind and silence the L1MC1 progenitor and its functional replicates. Interestingly, ZNF248 and ZNF382 were also identified to bind the ORF2 in all of the L1 subfamilies for which the binding position on the L1 sequence was reported (L1PA16-L1HS) by Imbeault et al. (2017), indicating that not all binding sites can necessarily be interpreted as arms race evidence, and suggesting that extension of this approach across all reconstructed L1s will support the identification of clear arms race dynamics for a subset of KZFPs bound. Given that coevolution evidence at the level of the L1 sequence position was previously limited to L1PAs, which are restricted to the last ~43M years of Human evolution (Figure 4.3 C), and that L1MC1 appears to have arisen ~96MYA (Figure 4.12 B), these findings extend sequence-level evidence of KZFP-L1 coevolution an additional ~53M years back in time.

4.4 Summary

LINE L1 retroelements comprise ~17% of the Human genome. These copy-and-paste TEs can be split into 67 subfamilies, each presumably defined by a single clonal parent, or progenitor sequence. For the youngest 22 subfamilies (Primate distributed, ~43M years old), a full-length progenitor sequence has been previously reconstructed. However, reconstruction of older subfamilies’ progenitor sequences is more difficult because they have accumulated more mutations over longer evolutionary time. I combined ancestral genome reconstruction and ancestral sequence reconstruction to “turn back the ” on sequence decay and reconstructed full-length or near full-length progenitor sequences for all 67 L1 subfamilies, where the oldest subfamilies exceed 105M years in age (older than Eutherians). 55/67 subfamily progenitors were

112 further reconstructed as composites of the full-length reconstructions and targeted ORF reconstructions with good phylogenetic agreement to prior knowledge, and 45/67 had been previously described. The utility of these progenitor reconstructions is demonstrated in the case of L1MC1, where in vivo binding enrichment for five KZFPs at L1MC1 loci can be explained by positions on the reconstructed L1MC1 progenitor sequence that match the KZFPs’ experimental and predicted binding preferences. In conjunction with the age correspondence between L1MC1 and the KZFPs that bind it, these findings represent new sequence-level support for the KZFP-L1 coevolution at a timescale ~53M years older than previously described. Altogether, these findings validate the L1 progenitor sequences reconstructed, and demonstrate their utility in extending farther back in time our interpretation of their contribution to the evolution of transcriptional regulation in the Human genome.

113

Chapter 5 Discussion

114

Discussion 5.1 Chapter outline

DNA-binding specificity, PPIs, and spatiotemporal expression patterns are thought to be the most important aspects of TF functional diversification (Grove et al., 2009). Prior to the work described in this thesis, the largest and most rapidly expanding family of Human TFs – the C2H2 ZFPs – lacked these functional characterizations at a global scale. Furthermore, elucidation of the coevolution dynamics of KZFPs and EREs was hindered by both a paucity of DNA-binding data for KZFPs as well as sufficient resolution of ERE sequence evolution. As a result of the work described in this thesis, as well as contemporary publications, sufficient evidence now exists to compare the roles of DNA-binding specificity, PPIs, and expression patterns in the diversification of the C2H2 ZFPs, in order to support further investigations into the molecular elements of C2H2 ZFPs related to these functions and to empower further functional prediction from sequence. Together, these insights enable interpretation of the contributions to Mammalian genome regulation of the most numerous, dynamic, and diverse TFs and the mobile elements they bind.

In Chapter 2, I contributed to the curation of a database for all Human TFs, motivated in part by the creation of a list of the remaining putative Human TFs lacking an experimentally determined DNA-binding motif. I further investigated the tissue expression profiles and paralog relationships of Human TFs. I found that C2H2 ZFPs are virtually the only Human TFs to have arisen by duplication since the time of the presumed whole genome duplications in the Bilaterian ancestor and are distinct from all other TFs in that they are depleted for tissue specific expression in adult tissues. I further explored whether Human Monodactyl ZFPs can function as sequence-specific DNA binding proteins,and found that the Human paralogs ZNF608 and ZNF609 can do so independently of flanking basic residues. The results from Chapter 2 provide a definitive list of 428 putative Human TFs requiring motifs, raise questions as to the sequence distinctions that enable some Monodactyl ZFPs to bind DNA specifically, and enable direct comparisons between the spatial expression diversification and expansion of C2H2 ZFPs to all other TFs.

In Chapter 3, I conducted the first global analysis of the PPIs of C2H2 ZFPs using AP-MS data. I uncovered striking diversity in nuclear PPI profiles for 118 DNA-binding C2H2 ZFPs,and proposed that these diverse PPIs are probably explained by some combination of undiscovered

115

PPI-mediating functions of the known PPI domains, ZF domains, or N-terminal unstructured regions. The results from Chapter 3 suggest that C2H2 ZFPs carry out extremely diverse functions related to transcription and chromatin architecture. These functions are even represented by the preys of KZFPs, which had been previously thought to function primarily as transcriptional silencers but have more recently been proposed to functionally diversify in some cases once the EREs they bind lose the ability to transpose. In this Chapter, I contextualize these diverse PPIs in terms of the global characterization of the C2H2 ZFPs, including reconciliation with the findings of more recent evidence about the function and evolution of KZFP PPIs and DNA-binding activity. I further describe specific experimental approaches that can be used to uncover the PPI-mediating sequence features and the implications of those results with respect to the long-standing paradox that C2H2 ZFPs encode more ZFs than they appear to use for DNA- binding. Finally, I address neglected evidence for roles outside the nucleus for a minority of C2H2 ZFPs.

In Chapter 4, I combined ASR, AGR, and codon-aware alignment to reconstruct full-length progenitor sequences for 67 Human LINE L1 subfamilies and demonstrated the utility of these sequences for integrating KZFP DNA-binding evidence with the progenitors of the L1 subfamilies they appear to target. The results from Chapter 4 show that ASR and AGR can be used to estimate the active progenitor sequences for ERE families in cases where a linked annotation and classification scheme are present, and this method could be extended to other ERE subfamilies beyond L1s. The integration with KZFP ChIP-seq/exo data and reconstructed L1 progenitors I described can also be extended to search for evidence of arms race/domestication models, or possibly reveal new patterns of coevolution between L1s and KZFPs. In this Chapter, I recommend a transposition assay that can be used to test the functionality of these reconstructed progenitors, bioinformatic and experimental methods to test the functions of enigmatic functional sequence elements of the L1 progenitors, and analytical and experimental methods to extend and validate the integration of KZFP DNA-binding evidence with the reconstructed progenitor sequences.

116

5.2 Perspectives and Future Directions

5.2.1 The DNA-binding preferences of every Human TF

A major motivation for the work described in Chapter 2 was the production of a bona fide list of all known and putative Human TFs. That list now provides a definitive set of 428 putative Human TFs that have no known DNA-binding motif. Despite substantial recent contributions to motif characterization for Human C2H2 ZFPs (Schmitges et al., 2016; Imbeault, Helleboid and Trono, 2017; Barazandeh et al., 2018), more than 300 of the Human TFs lacking a motif are C2H2 ZFPs. Given that C2H2 ZFPs exhibit by far the most diverse DNA-binding preferences of all Human TFs (Najafabadi et al., 2015; Lambert et al., 2018), the underrepresentation of their motifs indicates that a disproportionately large fraction of the Human regulatory lexicon remains undiscovered. Ongoing and future investigations leveraging newer DNA-binding assays like SMiLE-seq, and developing new assays, including modified versions of SELEX, will reveal the DNA-binding preferences of the most experimentally challenging TFs.

5.2.2 Functional diversification of C2H2 ZFPs: DNA-binding, PPIs, and expression patterns

Grove et al. (2009) propose DNA-binding preferences, PPIs, and spatiotemporal expression patterns as the predominant parameters for functional diversification of TF paralogs. Given the rapid lineage-specific proliferation of the C2H2 ZFPs, they must rapidly acquire new functions to escape redundancy with their paralogs (Nowick et al., 2011; Stubbs, Sun and Caetano- Anolles, 2011).

C2H2 ZFPs have long been thought to bind highly diverse DNA sequences, and work by my co- authors related to Chapter 2 and Chapter 3 now permits all-by-all comparison of motif similarities to demonstrate unequivocally that C2H2 ZFPs exhibit the most diverse DNA-binding preferences of any Human TF (Schmitges et al., 2016; Barazandeh et al., 2018; Lambert et al., 2018).

In Chapter 3, I reported the surprising result that C2H2 ZFPs, including those with known PPI domains, exhibit unexpectedly high diversity in their PPIs as well. This finding established a second parameter of rapid C2H2 ZFP functional diversification, although the mechanistic underpinnings remain to be explored (described further below). It is noteworthy that many of

117 these interactions were with other TFs, especially other C2H2 ZFPs, and that those interactions were not limited to the subset of C2H2 ZFP baits with the known oligomerization domains, SCAN and BTB. Since the publication of the work described in Chapter 3, another study has interrogated the PPIs of 101 Human KZFPs using AP-MS (Helleboid et al., 2019). Similar to our findings, the authors reported a range of spectral count yields for TRIM28 (suggesting variable propensity for binding), and many interactions with chromatin proteins beyond TRIM28 (discussed further in Chapter 5.2.4).

However, in Chapter 2, I found that, compared to other TF families, C2H2 ZFPs are depleted for tissue-specific expression in adult tissues. My quantitative definition of tissue specificity follows definitions established by Uhlen et al. (2015) and permits the interpretation that C2H2 ZFPs exhibit relatively low diversity in tissue expression profiles compared to other Human TFs. These findings should be interpreted cautiously given that a survey of 37 adult tissues is not exhaustive, that temporal fluctuations in expression were not considered, and that the nuclear abundance of these proteins could alternatively be modulated post-translationally and thus would not be detected in the RNA profiles described in Chapter 2. Furthermore, any picture of C2H2 ZFP expression requires consideration of their role in embryonic stem cells, where KZFPs have been demonstrated to recruit TRIM28/KAP1 to ERE loci for apparently permanent transcriptional silencing (Rowe et al., 2010; Quenneville et al., 2012; Castro-Diaz et al., 2014). However, findings from previous investigations into the expression of 19 KZFPs across 13 Human tissues (Nowick et al., 2010) and more recently 222 KZFPs across 1800 Human tissues, cell types, and cell states from FANTOM5 (Imbeault, Helleboid and Trono, 2017) suggest that KZFPs do exhibit expression profile diversity when compared directly to one another rather than to other TFs. It is possible that KZFPs exhibit subtler expression profile divergence than other TFs, but that this divergence is still accounts for important functional diversification. Altogether, it is not an overstatement to conclude that widespread evidence of functional diversification to the degree observed in other TFs in terms of spatiotemporal expression pattern is currently lacking among the C2H2 ZFPs, and warrants further exploration.

In sum, the functional diversification of C2H2 ZFPs has resulted in higher diversity in DNA- binding preference than other Human TFs, lower diversity in spatial expression profiles in adult tissues than other Human TFs, and surprisingly high diversity in PPI binding partners. Direct comparison of PPI diversity for C2H2 ZFPs compared to other Human TFs remains to be

118 explored but could be conducted using the Human TF list and DBD annotations described in Chapter 2 along with a highly parallel PPI network dataset, for example the BioPlex network (Huttlin et al., 2015; Luck et al., 2017).

The result that C2H2 ZFPs exhibit low diversification in spatial expression patterns relative to other TFs permits speculation that they may play more general roles in chromatin regulation than other TFs with more cell type specificity. This interpretation fits with the established functions of some well-studied C2H2 ZFPs like CTCF, and a main outcome of the diversification model is expected to be the co-option of EREs as novel cis-regulatory elements by the KZFPs that originally arose to silence them (Imbeault, Helleboid and Trono, 2017).

5.2.3 Identifying fast-evolving PPI elements of the C2H2 ZFPs

In Chapter 3, AP-MS experiments proved a powerful technique for the unbiased discovery of new prey proteins for C2H2 ZFPs, but the sequence element(s) mediating such diverse PPIs with so few dedicated PPI domains is not explained. I listed three possible explanations, which I suspect are all contributors: (1) the PPI domains mediate more interactions that previously expected, (2) the ZF domains significantly contribute to total PPIs, and (3) long N-terminal unstructured regions play a significant role in PPIs, e.g. by functioning as scaffolds and providing unconstrained sequence for the rapid gain/loss of SLiMS/TADs.

The relative contribution of each of these sequence components could be determined by the sub- cloning of sequence elements and testing for the loss of PPIs observed for the full-length construct. In the simplest conception of this experiment, the sub-sequence clones would separate C-terminal ZF arrays from the N-terminal effector domains and/or long unstructured regions. These results alone would permit the quantification of the number of interactions dependent on the N-terminal unstructured domain, ZF domains, or both. Further sub-cloning of each effector domain or individual ZF would enable the attribution of each PPI to a specific effector domain or ZF. These experiments should be approached with caution, however, especially because the KRAB domain and large regions of the C2H2 ZFP N-termini are largely unstructured, suggesting that they may acquire new conformations upon binding and could be dependent on one direct interaction to facilitate another (Dyson and Wright, 2005; Mannini et al., 2006), and they may form “fuzzy complexes,” where the interaction is supported by weak contacts between the disordered region of the C2H2 ZFP and multiple subunits rather than an obvious binary

119 interaction with an individual subunit (Tompa and Fuxreiter, 2008; Tuttle et al., 2018). These results could then be integrated with annotations of the ZFs predicted to be involved in DNA- binding based on the B1H-RC and established motifs. These findings would provide a global and quantitative perspective on the extent to which ZFs not apparently involved in DNA-binding function as PPI domains, contributing to the long-standing question as to why most C2H2 ZFPs contain more ZFs than required to bind highly specific sites in the Human genome (described in Chapter 1.3.1.2.1).

5.2.4 The evolutionary arc of the KZFPs

The results described in this thesis can be reconciled with prior and contemporary studies to paint a picture of the evolutionary ‘life cycle’ of an average Human KZFP.

The expansion of KZFPs in Mammals is contemporaneous with an expansion of EREs. In Chapter 2 I illustrated the paralog relationships of Human KZFPs to show that they have predominantly expanded in the Human lineage since the base of Amniotes, further increasing in Mammalia. While KZFPs are also found in non-Mammalian Vertebrates, the number of both ZFs and ERVs encoded by Mammalian genomes is higher than in non-Mammalian Vertebrates (Thomas and Schneider 2011). Mammalian genomes tolerate larger complements of EREs, and it is thought that this tolerance for ERE expansion permits the competitive dynamics observed between L1 subfamilies in Mammals and not Fish (Furano, Duvernell and Boissinot, 2004). At the same time, antiviral APOBEC3 family proteins, which mutate ERV loci and appear to diversify in correlation with the extent of ERV germline colonization (Ito, Gifford and Sato, 2020), represent another anti-ERE molecular mechanism that also arose at the base of Mammals. Interestingly, the co-option of TEs in turn played a major role in the evolution of pregnancy, both as cis-regulatory elements (Lynch et al., 2015) and as protein products (Cornelis et al., 2015). What biological changes explain the increased ERE load in Mammalian genomes and sudden diversification in adaptive mechanisms like APOBEC3s and KZFPs to combat ERE expansion? A definitive answer remains elusive, although explanations may stem in part from lower effective population sizes in Mammals, reducing the power of purifying selection to remove ERE insertions (Furano, Duvernell and Boissinot, 2004), or that the placenta could play a role in boosting maternal-fetal retroviral transmission via the bloodstream and increase chances for retrovirus endogenization in the germline (Haig, 2012).

120

In Chapter 3, I reported the surprising finding that only 38/55 KZFPs tested in AP-MS recruited TRIM28, and that TRIM28 recruitment was correlated with H3K9me3 at bound loci in ChIP-seq experiments for the same KZFPs (Schmitges et al., 2016). These observations suggested that a considerable fraction of KZFPs may not function as transcriptional repressors of EREs. Since then, further investigation into the DNA-binding activity of 222 KZFPs led to the domestication model, a theory that the binding of KZFPs to very old (defunct) EREs may be conserved because the EREs have been co-opted as cis-regulatory elements (Ecco, Imbeault and Trono, 2017; Imbeault, Helleboid and Trono, 2017). In this model, KZFPs binding older EREs are more likely free to lose TRIM28 recruitment and acquire new PPIs. Further evidence in support of this theory was published last year, where an AP-MS investigation on 101 KZFPs showed that a subset of the most ancient KZFPs that were tested lacked the ability to recruit TRIM28 and exhibited KRAB domains with higher sequence variability (Helleboid et al., 2019). Interestingly, these old KZFPs instead interacted with proteins related to chromatin organization and RNA processing, consistent with the annotations ascribed to the preys of KZFPs in Chapter 3. Altogether, since the publication of the work described in Chapter 3, new evidence has suggested that the observation that many KZFPs do not bind TRIM28 may be explained by a normal part of the life cycle of KZFPs: KZFPs binding EREs initially arise during the KZFP- ERE arms race, then KZFP-ERE interactions can be conserved long after the ERE becomes defunct if the KZFP acquires new functions. The domestication model raises an obvious question – what features of the KRAB sequence are predictive of silencing activity? Ongoing and future endeavours involving high-throughput interrogation of KRAB domains with diverse sequences may reveal the sequence-to-function rules that can be used to predict which KZFPs function as repressors in modern genomes.

5.2.5 Roles of C2H2 ZFPs outside the nucleus

The AP-MS data analysis in Chapter 3 focussed on the roles of C2H2 ZFPs inside the nucleus, because the C2H2 ZFPs included in the experiments had been demonstrated to bind specific DNA sequences based on ChIP-seq data. This much-needed analysis yielded the result that C2H2 ZFPs are generally likely to function in transcriptional regulation based on observed interactions with diverse transcription-related cofactors. However, roles of C2H2 ZFPs outside the nucleus remain unexplored.

121

Three of the 118 C2H2 ZFP baits in Chapter 3 retrieved relatively high spectral counts for a number of mitochondrial ribosomal protein subunits (MRPSs; e.g. MRPS6): ZNF684 interacted with 18 MRPSs, ZNF16 with 8 MRPSs, and ZNF22 with four MRPSs. ZNF684 contains a KRAB domain, while ZNF16 and ZNF22 have no annotated effector domains. All of these C2H2 ZFPs also interacted with nuclear proteins. These nuclear and mitochondrial PPIs, as well as the aforementioned ChIP-seq results, suggest that ZNF684, ZNF16, and ZNF22 may function in both the nucleus and mitochondrion. Additionally, while interactions with mitochondrial proteins were restricted to these three C2H2 ZFPs, PPIs with non-nuclear preys were common: 117/344 of the AP-MS preys that achieved at least one SAINT AvgP score equal to 1 with any of the C2H2 ZFPs were non-nuclear proteins. A more recent investigation into the PPIs of 101 KZFPs reported consistent results: in immunofluorescence imaging, 7/101 KZFPs localized to both the nucleus and cytoplasm, and an additional 4/101 demonstrated predominantly non- nuclear localization (Helleboid et al., 2019). While these fractions may appear small, this under- representative count of at least 14 unique C2H2 ZFPs that appear to function outside the nucleus is not insignificant.

Altogether, it appears that in some cell types a subset of C2H2 ZFPs may have predominantly non-nuclear roles, and many C2H2 ZFPs may function both inside and outside the nucleus. These results warrant further exploration into the localization of endogenously expressed C2H2 ZFPs in cell lines beyond HEK293Ts, starting with immunofluorescence imaging. Plausible biological explanations for roles outside the nucleus could be interactions with RNA or other proteins, and potentially disappointing explanations like the erroneous localization of affinity tagged constructs should not be overlooked, although it is tempting to imagine that some C2H2 ZFPs could function as TFs regulating the mitochondrial genome.

5.2.6 Functional and evolutionary assessment of reconstructed L1 progenitor sequences

In Chapter 4, I described the reconstruction of LINE L1 subfamily progenitor sequences. However, these reconstructions also present new opportunities to investigate other functional and evolutionary aspects of L1 sequences. Ideally, the composite reconstructed L1 progenitor sequences should be tested in some version of a previously described transposition assay, where the L1 progenitor locus is tagged with an antisense reporter interrupted by an intron such that successfully integrated reverse transcribed transcripts from the progenitor locus express the

122 intact reporter (Moran et al., 1996; Kopera et al., 2016). Quantification of reporter expression could yield relative transposition efficiency scores for each subfamily progenitor, potentially revealing the sequence changes contributing to the competition between L1 subfamilies that has shaped the lineage.

Functional elements of the L1 progenitor sequences can also be compared to assess their evolution and possibly clarify their roles in L1 transposition. For example, the N-terminal coiled- coil Transposase 22 trimerization domain of ORF1 has been of long-standing interest. The coiled-coil domain’s nucleotide sequence exhibits signatures of positive selection in L1PA8-3, but not in L1PA17-8 (Boissinot and Furano 2001; Khan, Smit, and Boissinot 2005). Recently, the elucidation of a crystal structure for the coiled-coil domain has led to the suggestion that positive selection on the coiled-coil domain could simply have resulted from compensatory mutations following an initial destabilizing deleterious mutation (Khazina and Weichenrieder, 2018). I think it is also worth exploring whether the rapid diversification of the coiled-coil domain has been driven by selection for (or resulted in) increased self-specificity, which could be tested in a PPI assay testing all-by-all ORF1ps from L1PA17-1 and measuring whether those the ORF1ps from L1PA8-3 exhibit higher oligomerization specificity for self than non-self ORF1ps. Leveraging my reconstructed L1 progenitor sequences, it is additionally possible to extend this analysis farther back in time to the L1ME-A subfamilies, to test whether the coiled- coil domain or other domains of L1 ORF proteins have undergone phases of diversifying selection in the past. Additionally, it is worth noting that the diversification of the coiled-coil domain begins in L1PA8, suspiciously coincident with the origin of the 5ʹ antisense ORF0, which encodes a transcript of unknown function that has been demonstrated to improve transposition efficiency (Denli et al. 2015). A relationship between these coincident evolutionary events has not been explored to my knowledge.

5.2.7 Interpreting predicted KZFP binding sites on reconstructed L1 progenitor sequences

In Chapter 4, I demonstrated the utility of reconstructed L1 subfamily progenitor sequences for integrating KZFP DNA-binding evidence to illustrate sequence-level relationships, using L1MC1 as an example. This analysis should be extended across all the L1 subfamily progenitor sequences to identify the specific sequence changes that led to the loss of KZFP binding sites,

123 which are expected to be primarily subtler than the 129bp deletion eliminating the ZNF93 binding site in L1PA3-L1HS (Jacobs et al., 2014; Imbeault, Helleboid and Trono, 2017).

Using the cell lines generated for the transposition assay described in Chapter 5.2.6, the expression of KZFPs expected to bind and silence the L1 progenitors based on existing ChIP- seq/exo data could be tested to assess the relative silencing effect of each KZFP on the transposition efficiency of the reconstructed L1 subfamily progenitor sequence. This could help elucidate the combinatorial effects of KZFP binding, which is of particular interest because it is now clear that many EREs appear to contain functional binding sites for multiple KZFPs, and not all KZFP binding sites are subjected to the same degree of selection (some KZFP binding sites persist across many subfamilies while others are lost rapidly) (Imbeault, Helleboid and Trono, 2017). The interpretation of these results could be further integrated with the results of the aforementioned silencing assay (collaboration with M. Taipale lab, Chapter 5.2.4) to confirm whether the silencing effect of each KZFP is proportional to its relative ability to suppress L1 transposition.

5.2.8 The future of TE reconstruction

In Chapter 4, I described the combination of several bioinformatic methods to reconstruct plausible representations of active progenitor sequences representing Human LINE L1 subfamilies. The endeavour was successful in that full-length and near full-length reconstructions were achieved for much older subfamilies than previously described, and I demonstrated that these reconstructions can be used to relate DNA-binding evidence of TFs to estimated active TE sequences. However, many of the subfamilies, especially those older than L1MA8, still suffered from truncations of the 5ʹ end including the 5ʹ region of ORF1, presumably resulting from a paucity of full-length insertions detectable in the Human genome or the reconstructed ancestral genomes.

The ancestral genomes employed in Chapter 4 are produced from an alignment “referenced” to hg38, meaning that large regions of extant Mammalian genomes that are not homologous to any region of hg38 are excluded from the multiple whole genome alignment, and thus also excluded from the ancestral genome reconstructions. Given the relatively high quality of the hg38 assembly compared other Mammalian genomes, referenced alignments boost the credibility of the reconstructed genomes by excluding regions of non-Human genomes that do not align to the

124

Human genome, which could represent assembly errors, but could also descend from regions that have been lost in the modern Human genome. Referenced alignments also save on computing resources by limiting the size of the alignment and reconstructed genomes. As computing resources continue to become more accessible and genome assembly methods improve, unreferenced whole genome alignments can be used to produce more complete ancestral reconstructed genomes, which will likely uncover more L1 insertions, including more full-length insertions. In the future, I hope that the progenitor reconstruction method described in Chapter 4 will be repeated using an unreferenced alignment. I predict that an improved starting dataset will yield more complete L1 reconstructions, and possibly enable the extension of my method to older LINE L2s and L3s.

Similarly, I propose that the method described in Chapter 4 can be adapted to any transposon group for which an annotation method linked to a classification scheme is available, or such a classification scheme is inferred after de novo TE annotation.

5.3 Closing Remarks

The defining initiative of functional genomics is to understand and predict how the sequences encoded by the genome can be related to observed phenotypes. Over the course of my graduate studies, I have had the opportunity to approach this problem in the context of TFs, the proteins they bind and the sequences they recognize. I anticipate that my contributions will support important landmarks for the field of transcriptional regulation, including elucidation of a motif for every Human TF, the attribution of diverse PPIs to specific sequence elements of C2H2 ZFPs, and improved resolution of the nuanced evolutionary relationship between KZFPs and TEs. Ultimately, these results will resolve how the largest class of TFs and the TEs that comprise about half of our genomes have evolved together and independently to establish the dynamic regulation of the Human genome.

125

References

Ahmad, K. F. et al. (2003) ‘Mechanism of SMRT corepressor recruitment by the BCL6 BTB domain’, Molecular Cell, 12(6), pp. 1551–1564. doi: 10.1016/s1097-2765(03)00454-4.

Ahmad, K. F., Engel, C. K. and Privé, G. G. (1998) ‘Crystal structure of the BTB domain from PLZF’, Proceedings of the National Academy of Sciences of the United States of America, 95(21), pp. 12123–12128. doi: 10.1073/pnas.95.21.12123.

Allshire, R. C. and Madhani, H. D. (2018) ‘Ten principles of heterochromatin formation and function’, Nature Reviews Molecular Cell Biology. Nature Publishing Group, 19(4), pp. 229– 244. doi: 10.1038/nrm.2017.119.

Aravind, L. et al. (2005) ‘The many faces of the helix-turn-helix domain: Transcription regulation and beyond’, FEMS Microbiology Reviews, 29(2), pp. 231–262. doi: 10.1016/j.femsre.2004.12.008.

Arendt, D. et al. (2016) ‘The origin and evolution of cell types’, Nature Reviews Genetics, 17(12), pp. 744–757. doi: 10.1038/nrg.2016.127.

Ashburner, M. et al. (2000) ‘Gene Ontology: tool for the unification of biology’, Nature genetics, 25(1), pp. 25–29. doi: 10.1038/75556.

Ashkenazy, H. et al. (2012) ‘FastML: a web server for probabilistic reconstruction of ancestral sequences’, Nucleic Acids Research, 40(W1), pp. W580–W584. doi: 10.1093/nar/gks498.

Badis, G. et al. (2008) ‘A Library of Yeast Transcription Factor Motifs Reveals a Widespread Function for Rsc3 in Targeting Nucleosome Exclusion at Promoters’, Molecular Cell, 32(6), pp. 878–887. doi: 10.1016/j.molcel.2008.11.020.

Badis, G. et al. (2009) ‘Diversity and complexity in DNA recognition by transcription factors’, Science (New York, N.Y.), 324(5935), pp. 1720–1723. doi: 10.1126/science.1162327.

Bao, Z. and Eddy, S. R. (2002) ‘Automated De Novo Identification of Repeat Sequence Families in Sequenced Genomes’, Genome Research, 12(8), pp. 1269–1276. doi: 10.1101/gr.88502.

Barazandeh, M. et al. (2018) ‘Comparison of ChIP-Seq Data and a Reference Motif Set for Human KRAB C2H2 Zinc Finger Proteins’, G3&#58; Genes|Genomes|Genetics, 8(1), pp. 219–229. doi: 10.1534/g3.117.300296.

Barna, M. et al. (2002) ‘Plzf mediates transcriptional repression of HoxD gene expression through chromatin remodeling’, Developmental Cell, 3(4), pp. 499–510. doi: 10.1016/s1534- 5807(02)00289-7.

Barski, A. et al. (2007) ‘High-resolution profiling of histone methylations in the Human genome’, Cell, 129(4), pp. 823–837. doi: 10.1016/j.cell.2007.05.009.

126 127

Belak, Z. R., Ficzycz, A. and Ovsenek, N. (2008) ‘Biochemical characterization of Yin Yang 1- RNA complexes’, Biochemistry and Cell Biology = Biochimie Et Biologie Cellulaire, 86(1), pp. 31–36. doi: 10.1139/o07-155.

Belak, Z. R. and Ovsenek, N. (2007) ‘Assembly of the Yin Yang 1 transcription factor into messenger ribonucleoprotein particles requires direct RNA binding activity’, The Journal of Biological Chemistry, 282(52), pp. 37913–37920. doi: 10.1074/jbc.M708057200.

Bellefroid, E. J. et al. (1991) ‘The evolutionarily conserved Krüppel-associated box domain defines a subfamily of eukaryotic multifingered proteins’, Proceedings of the National Academy of Sciences of the United States of America, 88(9), pp. 3608–3612. doi: 10.1073/pnas.88.9.3608.

Belshaw, R. et al. (2007) ‘Rate of Recombinational Deletion among Human Endogenous Retroviruses’, Journal of Virology. American Society for Microbiology Journals, 81(17), pp. 9437–9442. doi: 10.1128/JVI.02216-06.

Benos, P. V., Lapedes, A. S. and Stormo, G. D. (2002) ‘Probabilistic Code for DNA Recognition by Proteins of the EGR Family’, Journal of Molecular Biology, 323(4), pp. 701–727. doi: 10.1016/S0022-2836(02)00917-8.

Berardi, M. J. et al. (1999) ‘The Ig fold of the alpha Runt domain is a member of a family of structurally and functionally related Ig-fold DNA-binding domains’, Structure (London, England: 1993), 7(10), pp. 1247–1256. doi: 10.1016/s0969-2126(00)80058- 1.

Berger, M. F. et al. (2006) ‘Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities’, Nature Biotechnology, 24(11), pp. 1429–1435. doi: 10.1038/nbt1246.

Berger, M. F. et al. (2008) ‘Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences’, Cell, 133(7), pp. 1266–1276. doi: 10.1016/j.cell.2008.05.024.

Berman, H. M. et al. (2002) ‘The Protein Data Bank’, Acta Crystallographica Section D, 58(6– 1), pp. 899–907. doi: 10.1107/S0907444902003451.

Birtle, Z. and Ponting, C. P. (2006) ‘Meisetz and the birth of the KRAB motif’, Bioinformatics. Oxford Academic, 22(23), pp. 2841–2845. doi: 10.1093/bioinformatics/btl498.

Biscotti, M. A., Olmo, E. and Heslop-Harrison, J. S. P. (2015) ‘Repetitive DNA in eukaryotic genomes’, Chromosome Research: An International Journal on the Molecular, Supramolecular and Evolutionary Aspects of Chromosome Biology, 23(3), pp. 415–420. doi: 10.1007/s10577- 015-9499-z.

Blanchette, M. et al. (2004) ‘Reconstructing large regions of an ancestral mammalian genome in silico’, Genome Research, 14(12), pp. 2412–2423. doi: 10.1101/gr.2800104.

Bloor, A. J. C. et al. (2005) ‘RFP represses transcriptional activation by bHLH transcription factors’, Oncogene, 24(45), pp. 6729–6736. doi: 10.1038/sj.onc.1208828.

128

Boija, A. et al. (2018) ‘Transcription Factors Activate Genes through the Phase-Separation Capacity of Their Activation Domains’, Cell, 175(7), pp. 1842-1855.e16. doi: 10.1016/j.cell.2018.10.042.

Boissinot, S., Entezam, A. and Furano, A. V. (2001) ‘Selection against deleterious LINE-1- containing loci in the Human lineage’, Molecular Biology and Evolution, 18(6), pp. 926–935. doi: 10.1093/oxfordjournals.molbev.a003893.

Boissinot, S. and Furano, A. V. (2001) ‘Adaptive Evolution in LINE-1 Retrotransposons’, Molecular Biology and Evolution, 18(12), pp. 2186–2194. doi: 10.1093/oxfordjournals.molbev.a003765.

Bourque, G. et al. (2018) ‘Ten things you should know about transposable elements’, Genome Biology, 19(1), p. 199. doi: 10.1186/s13059-018-1577-z.

Boxer, L. D. et al. (2014) ‘ZNF750 interacts with KLF4 and RCOR1, KDM1A, and CTBP1/2 chromatin regulators to repress epidermal progenitor genes and induce differentiation genes’, Genes & Development, 28(18), pp. 2013–2026. doi: 10.1101/gad.246579.114.

Boyle, A. P. et al. (2011) ‘High-resolution genome-wide in vivo footprinting of diverse transcription factors in Human cells’, Genome Research, 21(3), pp. 456–464. doi: 10.1101/gr.112656.110.

Brayer, K. J., Kulshreshtha, S. and Segal, D. J. (2008) ‘The protein-binding potential of C2H2 zinc finger domains’, Cell Biochemistry and Biophysics, 51(1), pp. 9–19. doi: 10.1007/s12013- 008-9007-6.

Brayer, K. J. and Segal, D. J. (2008) ‘Keep Your Fingers Off My DNA: Protein–Protein Interactions Mediated by C2H2 Zinc Finger Domains’, Cell Biochemistry and Biophysics, 50(3), pp. 111–131. doi: 10.1007/s12013-008-9008-5.

Brown, R. S. (2005) ‘Zinc finger proteins: getting a grip on RNA’, Current Opinion in Structural Biology, 15(1), pp. 94–98. doi: 10.1016/j.sbi.2005.01.006.

Bulyk, M. L. et al. (1999) ‘Quantifying DNA–protein interactions by double-stranded DNA arrays’, Nature Biotechnology, 17(6), pp. 573–577. doi: 10.1038/9878.

Bulyk, M. L. et al. (2001) ‘Exploring the DNA-binding specificities of zinc fingers with DNA microarrays’, Proceedings of the National Academy of Sciences, 98(13), pp. 7158–7163. doi: 10.1073/pnas.111163698.

Bunch, H. et al. (2014) ‘TRIM28 regulates RNA polymerase II promoter-proximal pausing and pause release’, Nature Structural & Molecular Biology, 21(10), pp. 876–883. doi: 10.1038/nsmb.2878.

Bürglin, T. R. (2011) ‘Homeodomain Subtypes and Functional Diversity’, in Hughes, T. R. (ed.) A Handbook of Transcription Factors. Dordrecht: Springer Netherlands (Subcellular Biochemistry), pp. 95–122. doi: 10.1007/978-90-481-9069-0_5.

129

Burns, K. H. and Boeke, J. D. (2012) ‘Human Transposon Tectonics’, Cell, 149(4), pp. 740–752. doi: 10.1016/j.cell.2012.04.019.

Cai, Y. et al. (2007) ‘YY1 functions with INO80 to activate transcription’, Nature Structural & Molecular Biology, 14(9), pp. 872–874. doi: 10.1038/nsmb1276.

Castro-Diaz, N. et al. (2014) ‘Evolutionally dynamic L1 regulation in embryonic stem cells’, Genes & Development, 28(13), pp. 1397–1409. doi: 10.1101/gad.241661.114.

Chauhan, N. et al. (2013) ‘Analysis of dimerization of BTB-IVR domains of Keap1 and its interaction with Cul3, by molecular modeling’, Bioinformation, 9(9), pp. 450–455. doi: 10.6026/97320630009450.

Chen, D. et al. (1999) ‘Regulation of Transcription by a Protein Methyltransferase’, Science, 284(5423), pp. 2174–2177. doi: 10.1126/science.284.5423.2174.

Choi, H. et al. (2011) ‘SAINT: probabilistic scoring of affinity purification–mass spectrometry data’, Nature Methods, 8(1), pp. 70–73. doi: 10.1038/nmeth.1541.

Christensen, R. G. et al. (2011) ‘A modified bacterial one-hybrid system yields improved quantitative models of transcription factor specificity’, Nucleic Acids Research. Oxford Academic, 39(12), pp. e83–e83. doi: 10.1093/nar/gkr239.

Chuong, E. B. et al. (2013) ‘Endogenous retroviruses function as species-specific enhancer elements in the placenta’, Nature Genetics, 45(3), pp. 325–329. doi: 10.1038/ng.2553.

Chuong, E. B., Elde, N. C. and Feschotte, C. (2017) ‘Regulatory activities of transposable elements: from conflicts to benefits’, Nature Reviews Genetics, 18(2), pp. 71–86. doi: 10.1038/nrg.2016.139.

Conaway, R. C. and Conaway, J. W. (2009) ‘The INO80 chromatin remodeling complex in transcription, replication and repair’, Trends in Biochemical Sciences, 34(2), pp. 71–77. doi: 10.1016/j.tibs.2008.10.010.

Cornelis, G. et al. (2015) ‘Retroviral envelope gene captures and syncytin exaptation for placentation in marsupials’, Proceedings of the National Academy of Sciences, 112(5), pp. E487– E496. doi: 10.1073/pnas.1417000112.

Corsinotti, A. et al. (2013) ‘Global and stage specific patterns of Krüppel-associated-box zinc finger protein gene expression in Murine early embryonic cells’, PloS One, 8(2), p. e56721. doi: 10.1371/journal.pone.0056721.

Cost, G. J. et al. (2002) ‘Human L1 element target-primed reverse transcription in vitro’, The EMBO journal, 21(21), pp. 5899–5910. doi: 10.1093/emboj/cdf592.

Craig, R. and Beavis, R. C. (2003) ‘A method for reducing the time required to match protein sequences with tandem mass spectra’, Rapid communications in mass spectrometry: RCM, 17(20), pp. 2310–2316. doi: 10.1002/rcm.1198.

130

Cullen, B. R. (2006) ‘Role and Mechanism of Action of the APOBEC3 Family of Antiretroviral Resistance Factors’, Journal of Virology. American Society for Microbiology Journals, 80(3), pp. 1067–1076. doi: 10.1128/JVI.80.3.1067-1076.2006.

Cusanovich, D. A. et al. (2014) ‘The Functional Consequences of Variation in Transcription Factor Binding’, PLoS Genetics, 10(3). doi: 10.1371/journal.pgen.1004226.

Daniel, C., Behm, M. and Öhman, M. (2015) ‘The role of Alu elements in the cis-regulation of RNA processing’, Cellular and molecular life sciences: CMLS, 72(21), pp. 4063–4076. doi: 10.1007/s00018-015-1990-3.

Dehal, P. and Boore, J. L. (2005) ‘Two Rounds of Whole Genome Duplication in the Ancestral Vertebrate’, PLoS Biology. Edited by P. Holland, 3(10), p. e314. doi: 10.1371/journal.pbio.0030314.

Denli, A. M. et al. (2015) ‘Primate-Specific ORF0 Contributes to Retrotransposon-Mediated Diversity’, Cell, 163(3), pp. 583–593. doi: 10.1016/j.cell.2015.09.025.

Diallo, A. B., Makarenkov, V. and Blanchette, M. (2010) ‘Ancestors 1.0: a web server for ancestral sequence reconstruction’, Bioinformatics, 26(1), pp. 130–131. doi: 10.1093/bioinformatics/btp600.

Dunham, W. H., Mullin, M. and Gingras, A.-C. (2012) ‘Affinity-purification coupled to mass spectrometry: Basic principles and strategies’, Proteomics, 12(10), pp. 1576–1590. doi: 10.1002/pmic.201100523.

Dunwell, T. L. and Holland, P. W. H. (2016) ‘Diversity of Human and Mouse gene expression in development and adult tissues’, BMC Developmental Biology, 16(1), p. 40. doi: 10.1186/s12861-016-0140-y.

Dyson, H. J. and Wright, P. E. (2005) ‘Intrinsically unstructured proteins and their functions’, Nature Reviews Molecular Cell Biology, 6(3), pp. 197–208. doi: 10.1038/nrm1589.

Ecco, G., Imbeault, M. and Trono, D. (2017) ‘KRAB zinc finger proteins’, Development, 144(15), pp. 2719–2729. doi: 10.1242/dev.132605.

Eddy, S. R. (1998) ‘Profile hidden Markov models’, Bioinformatics, 14(9), pp. 755–763. doi: 10.1093/bioinformatics/14.9.755.

Edgar, R. C. (2004) ‘MUSCLE: multiple sequence alignment with high accuracy and high throughput’, Nucleic Acids Research, 32(5), pp. 1792–1797. doi: 10.1093/nar/gkh340.

Elbarbary, R. A., Lucas, B. A. and Maquat, L. E. (2016) ‘Retrotransposons as regulators of gene expression’, Science. American Association for the Advancement of Science, 351(6274). doi: 10.1126/science.aac7247.

Emerson, R. O. and Thomas, J. H. (2009) ‘Adaptive evolution in zinc finger transcription factors’, PLoS genetics, 5(1), p. e1000325. doi: 10.1371/journal.pgen.1000325.

131

Emerson, R. O. and Thomas, J. H. (2011) ‘Gypsy and the Birth of the SCAN Domain’, Journal of Virology, 85(22), pp. 12043–12052. doi: 10.1128/JVI.00867-11.

Fadloun, A. et al. (2013) ‘Chromatin signatures and retrotransposon profiling in Mouse embryos reveal regulation of LINE-1 by RNA’, Nature Structural & Molecular Biology, 20(3), pp. 332– 338. doi: 10.1038/nsmb.2495.

Fasching, L. et al. (2015) ‘TRIM28 Represses Transcription of Endogenous Retroviruses in Neural Progenitor Cells’, Cell Reports, 10(1), pp. 20–28. doi: 10.1016/j.celrep.2014.12.004.

Fedotova, A. A. et al. (2017) ‘C2H2 Zinc Finger Proteins: The Largest but Poorly Explored Family of Higher Eukaryotic Transcription Factors’, Acta Naturae, 9(2), pp. 47–58. doi: 10.32607/20758251-2017-9-2-47-58.

Feschotte, C. (2008) ‘Transposable elements and the evolution of regulatory networks’, Nature Reviews Genetics, 9(5), pp. 397–405. doi: 10.1038/nrg2337.

Finn, R. D. et al. (2016) ‘The Pfam protein families database: towards a more sustainable future’, Nucleic Acids Research, 44(D1), pp. D279–D285. doi: 10.1093/nar/gkv1344.

Finn, R. D. et al. (2017) ‘InterPro in 2017—beyond protein family and domain annotations’, Nucleic Acids Research, 45(D1), pp. D190–D199. doi: 10.1093/nar/gkw1107.

Finn, R. D., Clements, J. and Eddy, S. R. (2011) ‘HMMER web server: interactive sequence similarity searching’, Nucleic Acids Research, 39(suppl), pp. W29–W37. doi: 10.1093/nar/gkr367.

Fliss, M. S., Hinkle, P. M. and Bancroft, C. (1999) ‘Expression cloning and characterization of PREB (Prolactin Regulatory Element Binding), a novel WD motif DNA-binding protein with a capacity to regulate prolactin promoter activity’, 13(4), p. 14.

Flynn, J. M. et al. (2020) ‘RepeatModeler2 for automated genomic discovery of transposable element families’, Proceedings of the National Academy of Sciences. National Academy of Sciences, 117(17), pp. 9451–9457. doi: 10.1073/pnas.1921046117.

Fox, A. H. et al. (1999) ‘Transcriptional cofactors of the FOG family interact with GATA proteins by means of multiple zinc fingers’, The EMBO Journal, 18(10), pp. 2812–2822. doi: 10.1093/emboj/18.10.2812.

Frietze, S. and Farnham, P. J. (2011) ‘Transcription Factor Effector Domains’, in Hughes, T. R. (ed.) A Handbook of Transcription Factors. Dordrecht: Springer Netherlands (Subcellular Biochemistry), pp. 261–277. doi: 10.1007/978-90-481-9069-0_12.

Fulton, D. L. et al. (2009) ‘TFCat: the curated catalog of Mouse and Human transcription factors’, Genome Biology, 10(3), p. R29. doi: 10.1186/gb-2009-10-3-r29.

Furano, A. V., Duvernell, D. D. and Boissinot, S. (2004) ‘L1 (LINE-1) retrotransposon diversity differs dramatically between mammals and fish’, Trends in genetics: TIG, 20(1), pp. 9–14. doi: 10.1016/j.tig.2003.11.006.

132

Gifford, R. J. et al. (2018) ‘Nomenclature for endogenous retrovirus (ERV) loci’, Retrovirology, 15(1), p. 59. doi: 10.1186/s12977-018-0442-1.

Gilmore, T. D. (2006) ‘Introduction to NF-kappaB: players, pathways, perspectives’, Oncogene, 25(51), pp. 6680–6684. doi: 10.1038/sj.onc.1209954.

Giordano, A. and Avantaggiati, M. L. (1999) ‘p300 and CBP: Partners for life and death’, Journal of Cellular Physiology, 181(2), pp. 218–230. doi: 10.1002/(SICI)1097- 4652(199911)181:2<218::AID-JCP4>3.0.CO;2-5.

Godowski, P. J., Picard, D. and Yamamoto, K. R. (1988) ‘Signal transduction and transcriptional regulation by -LexA fusion proteins’, Science, 241(4867), pp. 812–816. doi: 10.1126/science.3043662.

Goerner-Potvin, P. and Bourque, G. (2018) ‘Computational tools to unmask transposable elements’, Nature Reviews Genetics, 19(11), pp. 688–704. doi: 10.1038/s41576-018-0050-x.

Goodarzi, A. A., Jeggo, P. and Lobrich, M. (2010) ‘The influence of heterochromatin on DNA double strand break repair: Getting the strong, silent type to relax’, DNA Repair. (A break is not the End; insight into the damage response to DNA double strand A break is not the End; insight into the damage response to DNA double strand breaks), 9(12), pp. 1273–1282. doi: 10.1016/j.dnarep.2010.09.013.

Goodwin, T. and Poulter, R. (2002) ‘A group of deuterostome Ty3/gypsy-like retrotransposons with Ty1/copia-like pol-domain orders’, Molecular Genetics and Genomics, 267(4), pp. 481– 491. doi: 10.1007/s00438-002-0679-0.

Grewal, S. I. S. and Jia, S. (2007) ‘Heterochromatin revisited’, Nature Reviews Genetics, 8(1), pp. 35–46. doi: 10.1038/nrg2008.

Grondin, B., Bazinet, M. and Aubry, M. (1996) ‘The KRAB zinc finger gene ZNF74 encodes an RNA-binding protein tightly associated with the nuclear matrix’, The Journal of Biological Chemistry, 271(26), pp. 15458–15467. doi: 10.1074/jbc.271.26.15458.

Groner, A. C. et al. (2010) ‘KRAB–Zinc Finger Proteins and KAP1 Can Mediate Long-Range Transcriptional Repression through Heterochromatin Spreading’, PLoS Genetics. Edited by H. D. Madhani, 6(3), p. e1000869. doi: 10.1371/journal.pgen.1000869.

Grove, C. A. et al. (2009) ‘A Multiparameter Network Reveals Extensive Divergence between C. elegans bHLH Transcription Factors’, Cell, 138(2), pp. 314–327. doi: 10.1016/j.cell.2009.04.058.

Gupta, A. et al. (2014) ‘An improved predictive recognition model for Cys(2)-His(2) zinc finger proteins’, Nucleic Acids Research, 42(8), pp. 4800–4812. doi: 10.1093/nar/gku132.

Haas, D. L. et al. (2003) ‘The Moloney Murine Leukemia Virus Repressor Binding Site Represses Expression in Murine and Human Hematopoietic Stem Cells’, Journal of Virology, 77(17), pp. 9439–9450. doi: 10.1128/JVI.77.17.9439-9450.2003.

133

Haig, D. (2012) ‘Retroviruses and the Placenta’, Current Biology, 22(15), pp. R609–R613. doi: 10.1016/j.cub.2012.06.002.

Hamilton, A. T. et al. (2006) ‘Evolutionary expansion and divergence in the ZNF91 subfamily of Primate-specific zinc finger genes’, Genome Research, 16(5), pp. 584–594. doi: 10.1101/gr.4843906.

Hayward, A., Cornwallis, C. K. and Jern, P. (2015) ‘Pan-vertebrate comparative genomics unmasks retrovirus macroevolution’, Proceedings of the National Academy of Sciences, 112(2), pp. 464–469. doi: 10.1073/pnas.1414980112.

Hedges, D. J. and Deininger, P. L. (2007) ‘Inviting instability: Transposable elements, double- strand breaks, and the maintenance of genome integrity’, Mutation Research, 616(1–2), pp. 46– 59. doi: 10.1016/j.mrfmmm.2006.11.021.

Helleboid, P. et al. (2019) ‘The interactome of KRAB zinc finger proteins reveals the evolutionary history of their functional diversification’, The EMBO Journal, 38(18). doi: 10.15252/embj.2018101220.

Henikoff, S. (2008) ‘Nucleosome destabilization in the epigenetic regulation of gene expression’, Nature Reviews Genetics, 9(1), pp. 15–26. doi: 10.1038/nrg2206.

Herrero, J. et al. (2016) ‘Ensembl comparative genomics resources’, Database, 2016, p. bav096. doi: 10.1093/database/bav096.

Hoffman, E. A. et al. (2015) ‘Formaldehyde Crosslinking: A Tool for the Study of Chromatin Complexes’, Journal of Biological Chemistry. American Society for Biochemistry and Molecular Biology, 290(44), pp. 26404–26411. doi: 10.1074/jbc.R115.651679.

Hope, I. A. and Struhl, K. (1986) ‘Functional dissection of a eukaryotic transcriptional activator protein, of Yeast’, Cell, 46(6), pp. 885–894. doi: 10.1016/0092-8674(86)90070-X.

Hu, S. et al. (2009) ‘Profiling the Human protein-DNA interactome reveals ERK2 as a transcriptional repressor of interferon signaling’, Cell, 139(3), pp. 610–622. doi: 10.1016/j.cell.2009.08.037.

Hubley, R. et al. (2016) ‘The Dfam database of repetitive DNA families’, Nucleic Acids Research, 44(D1), pp. D81–D89. doi: 10.1093/nar/gkv1272.

Hume, M. A. et al. (2015) ‘UniPROBE, update 2015: new tools and content for the online database of protein-binding microarray data on protein–DNA interactions’, Nucleic Acids Research, 43(D1), pp. D117–D122. doi: 10.1093/nar/gku1045.

Huntley, S. et al. (2006) ‘A comprehensive catalog of Human KRAB-associated zinc finger genes: Insights into the evolutionary history of a large family of transcriptional repressors’, Genome Research, 16(5), pp. 669–677. doi: 10.1101/gr.4842106.

Huttlin, E. L. et al. (2015) ‘The BioPlex Network: A Systematic Exploration of the Human Interactome’, Cell, 162(2), pp. 425–440. doi: 10.1016/j.cell.2015.06.043.

134

Imbeault, M., Helleboid, P.-Y. and Trono, D. (2017) ‘KRAB zinc-finger proteins contribute to the evolution of gene regulatory networks’, Nature, 543(7646), pp. 550–554. doi: 10.1038/nature21683.

Ito, J., Gifford, R. J. and Sato, K. (2020) ‘Retroviruses drive the rapid evolution of mammalian APOBEC3 genes’, Proceedings of the National Academy of Sciences. National Academy of Sciences, 117(1), pp. 610–618. doi: 10.1073/pnas.1914183116.

Iuchi, S. (2001) ‘Three classes of C2H2 zinc finger proteins’:, Cellular and Molecular Life Sciences, 58(4), pp. 625–635. doi: 10.1007/PL00000885.

Iyengar, S. et al. (2011) ‘Functional Analysis of KAP1 Genomic Recruitment’, Molecular and Cellular Biology, 31(9), pp. 1833–1847. doi: 10.1128/MCB.01331-10.

Iyengar, S. and Farnham, P. J. (2011) ‘KAP1 Protein: An Enigmatic Master Regulator of the Genome’, Journal of Biological Chemistry, 286(30), pp. 26267–26276. doi: 10.1074/jbc.R111.252569.

Jacobs, F. M. J. et al. (2014) ‘An evolutionary arms race between KRAB zinc-finger genes ZNF91/93 and SVA/L1 retrotransposons’, Nature, 516(7530), pp. 242–245. doi: 10.1038/nature13760.

Jacques, P.-É., Jeyakani, J. and Bourque, G. (2013) ‘The majority of Primate-specific regulatory sequences are derived from transposable elements’, PLoS genetics, 9(5), p. e1003504. doi: 10.1371/journal.pgen.1003504.

Jenuwein, T. and Allis, C. D. (2001) ‘Translating the histone code’, Science (New York, N.Y.), 293(5532), pp. 1074–1080. doi: 10.1126/science.1063127.

Jeon, Y. and Lee, J. T. (2011) ‘YY1 tethers Xist RNA to the inactive X nucleation center’, Cell, 146(1), pp. 119–133. doi: 10.1016/j.cell.2011.06.026.

Jin, W. et al. (2016) ‘Critical POU domain residues confer Oct4 uniqueness in somatic cell reprogramming’, Scientific Reports. Nature Publishing Group, 6(1), p. 20818. doi: 10.1038/srep20818.

Johnson, D. S. et al. (2007) ‘Genome-wide mapping of in vivo protein-DNA interactions’, Science (New York, N.Y.), 316(5830), pp. 1497–1502. doi: 10.1126/science.1141319.

Jolma, A. et al. (2013) ‘DNA-Binding Specificities of Human Transcription Factors’, Cell, 152(1–2), pp. 327–339. doi: 10.1016/j.cell.2012.12.009.

Jolma, A. et al. (2015) ‘DNA-dependent formation of transcription factor pairs alters their binding specificity’, Nature, 527(7578), pp. 384–388. doi: 10.1038/nature15518.

Joy, J. B. et al. (2016) ‘Ancestral Reconstruction’, PLOS Computational Biology, 12(7), p. e1004763. doi: 10.1371/journal.pcbi.1004763.

135

Juliano, C., Wang, J. and Lin, H. (2011) ‘Uniting Germline and Stem Cells: the Function of Piwi Proteins and the piRNA Pathway in Diverse Organisms’, Annual review of genetics, 45. doi: 10.1146/annurev-genet-110410-132541.

Jurka, J. (2000) ‘Repbase Update: a database and an electronic journal of repetitive elements’, Trends in Genetics, 16(9), pp. 418–420. doi: 10.1016/S0168-9525(00)02093-X.

Kalkhoven, E. (2004) ‘CBP and p300: HATs for different occasions’, Biochemical Pharmacology, 68(6), pp. 1145–1155. doi: 10.1016/j.bcp.2004.03.045.

Kaplan, T., Friedman, N. and Margalit, H. (2005) ‘Ab initio prediction of transcription factor targets using structural knowledge’, PLoS computational biology, 1(1), p. e1. doi: 10.1371/journal.pcbi.0010001.

Karimi, M. M. et al. (2011) ‘DNA methylation and SETDB1/H3K9me3 regulate predominantly distinct sets of genes, retroelements, and chimeric transcripts in mESCs’, Cell Stem Cell, 8(6), pp. 676–687. doi: 10.1016/j.stem.2011.04.004.

Kassiotis, G. and Stoye, J. P. (2016) ‘Immune responses to endogenous retroelements: taking the bad with the good’, Nature Reviews Immunology. Nature Publishing Group, 16(4), pp. 207–219. doi: 10.1038/nri.2016.27.

Khan, H., Smit, A. F. A. and Boissinot, S. (2005) ‘Molecular evolution and tempo of amplification of Human LINE-1 retrotransposons since the origin of Primates’, Genome Research, 16(1), pp. 78–87. doi: 10.1101/gr.4001406.

Khazina, E. and Weichenrieder, O. (2018) ‘Human LINE-1 retrotransposition requires a metastable and a positively charged N-terminus in L1ORF1p’, eLife. Edited by S. P. Goff. eLife Sciences Publications, Ltd, 7, p. e34960. doi: 10.7554/eLife.34960.

Klug, A. (2010) ‘The Discovery of Zinc Fingers and Their Applications in Gene Regulation and Genome Manipulation’, Annual Review of Biochemistry, 79(1), pp. 213–231. doi: 10.1146/annurev-biochem-010909-095056.

Kobor, M. S. et al. (2004) ‘A protein complex containing the conserved Swi2/Snf2-related ATPase Swr1p deposits histone variant H2A.Z into euchromatin’, PLoS biology, 2(5), p. E131. doi: 10.1371/journal.pbio.0020131. de Koning, A. P. J. et al. (2011) ‘Repetitive Elements May Comprise Over Two-Thirds of the Human Genome’, PLoS Genetics. Edited by G. P. Copenhaver, 7(12), p. e1002384. doi: 10.1371/journal.pgen.1002384.

Kopera, H. C. et al. (2016) ‘LINE-1 Cultured Cell Retrotransposition Assay’, Methods in molecular biology (Clifton, N.J.), 1400, pp. 139–156. doi: 10.1007/978-1-4939-3372-3_10.

Ku, H.-Y. and Lin, H. (2014) ‘PIWI proteins and their interactors in piRNA biogenesis, germline development and gene expression’, National Science Review, 1(2), pp. 205–218. doi: 10.1093/nsr/nwu014.

136

Kumar, S. et al. (2008) ‘MEGA: A biologist-centric software for evolutionary analysis of DNA and protein sequences’, Briefings in bioinformatics, 9(4), pp. 299–306. doi: 10.1093/bib/bbn017.

Kunarso, G. et al. (2010) ‘Transposable elements have rewired the core regulatory network of Human embryonic stem cells’, Nature Genetics, 42(7), pp. 631–634. doi: 10.1038/ng.600.

Kung, J. T. et al. (2015) ‘Locus-specific targeting to the revealed by the RNA interactome of CTCF’, Molecular Cell, 57(2), pp. 361–375. doi: 10.1016/j.molcel.2014.12.006.

Lam, K. N. et al. (2011) ‘Sequence specificity is obtained from the majority of modular C2H2 zinc-finger arrays’, Nucleic Acids Research, 39(11), pp. 4680–4690. doi: 10.1093/nar/gkq1303.

Lambert, S. A. et al. (2018) ‘The Human Transcription Factors’, Cell, 172(4), pp. 650–665. doi: 10.1016/j.cell.2018.01.029.

Lee, C. M. et al. (2020) ‘UCSC Genome Browser enters 20th year’, Nucleic Acids Research, 48(D1), pp. D756–D761. doi: 10.1093/nar/gkz1012.

Lee, Y. N. and Bieniasz, P. D. (2007) ‘Reconstitution of an Infectious Human Endogenous Retrovirus’, PLoS Pathogens, 3(1), p. e10. doi: 10.1371/journal.ppat.0030010.

Letunic, I., Doerks, T. and Bork, P. (2015) ‘SMART: recent updates, new developments and status in 2015ʹ, Nucleic Acids Research, 43(D1), pp. D257–D260. doi: 10.1093/nar/gku949.

Liang, Y., Choo, S. H., et al. (2012) ‘Crystal optimization and preliminary diffraction data analysis of the SCAN domain of Zfp206’, Acta Crystallographica Section F Structural Biology and Crystallization Communications, 68(4), pp. 443–447. doi: 10.1107/S1744309112006070.

Liang, Y., Huimei Hong, F., et al. (2012) ‘Structural analysis and dimerization profile of the SCAN domain of the pluripotency factor Zfp206’, Nucleic Acids Research, 40(17), pp. 8721– 8732. doi: 10.1093/nar/gks611.

Liew, C. K. et al. (2005) ‘Zinc fingers as protein recognition motifs: structural basis for the GATA-1/friend of GATA interaction’, Proceedings of the National Academy of Sciences of the United States of America, 102(3), pp. 583–588. doi: 10.1073/pnas.0407511102.

Litman, T. and Stein, W. D. (2019) ‘Obtaining estimates for the ages of all the protein-coding genes and most of the ontology-identified noncoding genes of the Human genome, assigned to 19 phylostrata’, Seminars in Oncology, 46(1), pp. 3–9. doi: 10.1053/j.seminoncol.2018.11.002.

Liu, J. et al. (2006) ‘Intrinsic disorder in transcription factors’, Biochemistry, 45(22), pp. 6873– 6888. doi: 10.1021/bi0602718.

Lu, S. et al. (2020) ‘CDD/SPARCLE: the conserved domain database in 2020’, Nucleic Acids Research, 48(D1), pp. D265–D268. doi: 10.1093/nar/gkz991.

Luck, K. et al. (2017) ‘Proteome-scale Human interactomics’, Trends in biochemical sciences, 42(5), pp. 342–354. doi: 10.1016/j.tibs.2017.02.006.

137

Luger, K. (1997) ‘Crystal structure of the nucleosome core particle at 2.8 A resolution’, 389, p. 10.

Luisi, B. (1992) ‘Zinc standard for economy’, Nature, 356(6368), pp. 379–380. doi: 10.1038/356379a0.

Luscombe, N. M. et al. (2000) ‘An overview of the structures of protein-DNA complexes’, Genome Biology, 1(reviews001.1), p. 37. doi: https://doi.org/10.1186/gb-2000-1-1-reviews001.

Lynch, V. J. et al. (2015) ‘Ancient Transposable Elements Transformed the Uterine Regulatory Landscape and Transcriptome during the Evolution of Mammalian Pregnancy’, Cell reports, 10(4), pp. 551–561. doi: 10.1016/j.celrep.2014.12.052.

Madden, T. (2013) The BLAST Sequence Analysis Tool, The NCBI Handbook [Internet]. 2nd edition. National Center for Biotechnology Information (US). Available at: https://www.ncbi.nlm.nih.gov/books/NBK153387/ (Accessed: 19 May 2020).

Mahony, S. and Pugh, B. F. (2015) ‘Protein–DNA binding in high-resolution’, Critical Reviews in Biochemistry and Molecular Biology. Taylor & Francis, 50(4), pp. 269–283. doi: 10.3109/10409238.2015.1051505.

Mandal, P. K. and Kazazian, H. H. (2008) ‘SnapShot: Vertebrate Transposons’, Cell, 135(1), pp. 192-192.e1. doi: 10.1016/j.cell.2008.09.028.

Mandel-Gutfreund, Y. and Margalit, H. (1998) ‘Quantitative parameters for amino acid-base interaction: Implications for prediction of protein-DNA binding sites’, Nucleic Acids Research. Oxford Academic, 26(10), pp. 2306–2312. doi: 10.1093/nar/26.10.2306.

Mannini, R. et al. (2006) ‘Structure/function of KRAB repression domains: Structural properties of KRAB modules inferred from hydrodynamic, circular dichroism, and FTIR spectroscopic analyses’, Proteins: Structure, Function, and Bioinformatics, 62(3), pp. 604–616. doi: 10.1002/prot.20792.

Marcon, E. et al. (2014) ‘Human-Chromatin-Related Protein Interactions Identify a Demethylase Complex Required for Chromosome Segregation’, Cell Reports, 8(1), pp. 297–310. doi: 10.1016/j.celrep.2014.05.050.

Mathelier, A. et al. (2016) ‘JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles’, Nucleic Acids Research, 44(D1), pp. D110– D115. doi: 10.1093/nar/gkv1176.

Matsui, T. et al. (2010) ‘Proviral silencing in embryonic stem cells requires the histone methyltransferase ESET’, Nature. Nature Publishing Group, 464(7290), pp. 927–931. doi: 10.1038/nature08858.

Matys, V. et al. (2006) ‘TRANSFAC(R) and its module TRANSCompel(R): transcriptional gene regulation in eukaryotes’, Nucleic Acids Research, 34(90001), pp. D108–D110. doi: 10.1093/nar/gkj143.

138

McClintock, B. (1956) ‘Intranuclear systems controlling gene action and mutation’, Brookhaven Symposia in Biology, (8), pp. 58–74.

Mellacheruvu, D. et al. (2013) ‘The CRAPome: a contaminant repository for affinity purification–mass spectrometry data’, Nature Methods, 10(8), pp. 730–736. doi: 10.1038/nmeth.2557.

Mills, R. E. et al. (2007) ‘Which transposable elements are active in the Human genome?’, Trends in Genetics, 23(4), pp. 183–191. doi: 10.1016/j.tig.2007.02.006.

Minor, D. L. et al. (2000) ‘The polar T1 interface is linked to conformational changes that open the voltage-gated potassium channel’, Cell, 102(5), pp. 657–670. doi: 10.1016/s0092- 8674(00)00088-x.

Mintseris, J. and Eisen, M. B. (2006) ‘Design of a combinatorial DNA microarray for protein- DNA interaction studies’, BMC bioinformatics, 7, p. 429. doi: 10.1186/1471-2105-7-429.

Molaro, A. et al. (2014) ‘Two waves of de novo methylation during Mouse germ cell development’, Genes & Development, 28(14), pp. 1544–1549. doi: 10.1101/gad.244350.114.

Moran, J. V. et al. (1996) ‘High frequency retrotransposition in cultured mammalian cells’, Cell, 87(5), pp. 917–927. doi: 10.1016/s0092-8674(00)81998-4.

Murao, K. et al. (2009) ‘The transcriptional factor PREB mediates MCP-1 transcription induced by cytokines in Human vascular endothelial cells’, Atherosclerosis, 207(1), pp. 45–50. doi: 10.1016/j.atherosclerosis.2009.03.051.

Najafabadi, H. S. et al. (2015) ‘C2H2 zinc finger proteins greatly expand the Human regulatory lexicon’, Nature Biotechnology, 33(5), pp. 555–562. doi: 10.1038/nbt.3128.

Najafabadi, H. S. et al. (2017) ‘Non-base-contacting residues enable kaleidoscopic evolution of metazoan C2H2 zinc finger DNA binding’, Genome Biology, 18(1), p. 167. doi: 10.1186/s13059- 017-1287-y.

Najafabadi, H. S., Albu, M. and Hughes, T. R. (2015) ‘Identification of C2H2-ZF binding preferences from ChIP-seq data using RCADE: Fig. 1.’, Bioinformatics, 31(17), pp. 2879–2881. doi: 10.1093/bioinformatics/btv284.

Nam, K., Honer, C. and Schumacher, C. (2004) ‘Structural components of SCAN-domain dimerizations’, Proteins: Structure, Function, and Bioinformatics, 56(4), pp. 685–692. doi: 10.1002/prot.20170.

Nardelli, J. et al. (1991) ‘Base sequence discrimination by zinc-finger DNA-binding domains’, Nature. Nature Publishing Group, 349(6305), pp. 175–178. doi: 10.1038/349175a0.

Neduva, V. and Russell, R. B. (2005) ‘Linear motifs: Evolutionary interaction switches’, FEBS Letters, 579(15), pp. 3342–3345. doi: 10.1016/j.febslet.2005.04.005.

Nielsen, A. L. et al. (1999) ‘Interaction with members of the heterochromatin protein 1 (HP1) family and histone deacetylation are differentially involved in transcriptional silencing by

139 members of the TIF1 family’, The EMBO Journal, 18(22), pp. 6385–6395. doi: 10.1093/emboj/18.22.6385.

Nitta, K. R. et al. (2015) ‘Conservation of transcription factor binding specificities across 600 million years of Bilateria evolution’, eLife, 4. doi: 10.7554/eLife.04837.

Nowick, K. et al. (2010) ‘Rapid sequence and expression divergence suggest selection for novel function in Primate-specific KRAB-ZNF genes’, Molecular Biology and Evolution, 27(11), pp. 2606–2617. doi: 10.1093/molbev/msq157.

Nowick, K. et al. (2011) ‘Gain, loss and divergence in Primate zinc-finger genes: a rich resource for evolution of gene regulatory differences between species’, PloS One, 6(6), p. e21553. doi: 10.1371/journal.pone.0021553.

Oldfield, C. J. and Dunker, A. K. (2014) ‘Intrinsically disordered proteins and intrinsically disordered protein regions’, Annual Review of Biochemistry, 83, pp. 553–584. doi: 10.1146/annurev-biochem-072711-164947.

Omichinski, J. G. et al. (1997) ‘The solution structure of a specific GAGA factor–DNA complex reveals a modular binding mode’, Nature Structural Biology, 4(2), pp. 122–132. doi: 10.1038/nsb0297-122.

Oughtred, R. et al. (2019) ‘The BioGRID interaction database: 2019 update’, Nucleic Acids Research, 47(D1), pp. D529–D541. doi: 10.1093/nar/gky1079.

Pabo, C. O., Peisach, E. and Grant, R. A. (2001) ‘Design and selection of novel Cys2His2 zinc finger proteins’, Annual Review of Biochemistry, 70, pp. 313–340. doi: 10.1146/annurev.biochem.70.1.313.

Pardee, K., Necakov, A. S. and Krause, H. (2011) ‘Nuclear Receptors: Small Molecule Sensors that Coordinate Growth, Metabolism and Reproduction’, in Hughes, T. R. (ed.) A Handbook of Transcription Factors. Dordrecht: Springer Netherlands (Subcellular Biochemistry), pp. 123– 153. doi: 10.1007/978-90-481-9069-0_6.

Park, P. J. (2009) ‘ChIP-seq: advantages and challenges of a maturing technology’, Nature Reviews. Genetics, 10(10), pp. 669–680. doi: 10.1038/nrg2641.

Pavletich, N. P. and Pabo, C. O. (1991) ‘Zinc finger-DNA recognition: crystal structure of a Zif268-DNA complex at 2.1 A’, Science (New York, N.Y.), 252(5007), pp. 809–817. doi: 10.1126/science.2028256.

Pedone, P. V. et al. (1996) ‘The single Cys2-His2 zinc finger domain of the GAGA protein flanked by basic residues is sufficient for high-affinity specific DNA binding.’, Proceedings of the National Academy of Sciences of the United States of America, 93(7), pp. 2822–2826.

Phillips, K. and Luisi, B. (2000) ‘The virtuoso of versatility: POU proteins that flex to fit 1 1Edited by P. Wright’, Journal of Molecular Biology, 302(5), pp. 1023–1039. doi: 10.1006/jmbi.2000.4107.

140

Piepoli, S. et al. (2020) ‘Structural analysis of the PATZ1 BTB domain homodimer’, Acta Crystallographica Section D: Structural Biology. International Union of Crystallography, 76(6), pp. 581–593. doi: 10.1107/S2059798320005355.

Porsch-Özcürümez, M. et al. (2001) ‘The Zinc Finger Protein 202 (ZNF202) Is a Transcriptional Repressor of ATP Binding Cassette Transporter A1 (ABCA1) and ABCG1 Gene Expression and a Modulator of Cellular Lipid Efflux’, Journal of Biological Chemistry. American Society for Biochemistry and Molecular Biology, 276(15), pp. 12427–12433. doi: 10.1074/jbc.M100218200.

Price, A. L., Jones, N. C. and Pevzner, P. A. (2005) ‘De novo identification of repeat families in large genomes’, Bioinformatics. Oxford Academic, 21(suppl_1), pp. i351–i358. doi: 10.1093/bioinformatics/bti1018.

Price, M. N., Dehal, P. S. and Arkin, A. P. (2009) ‘FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix’, Molecular Biology and Evolution, 26(7), pp. 1641–1650. doi: 10.1093/molbev/msp077.

Ptashne, M. (2011) ‘Principles of a switch’, Nature Chemical Biology, 7(8), pp. 484–487. doi: 10.1038/nchembio.611.

Pupko, T. et al. (2000) ‘A fast algorithm for joint reconstruction of ancestral amino acid sequences’, Molecular Biology and Evolution, 17(6), pp. 890–896. doi: 10.1093/oxfordjournals.molbev.a026369.

Quenneville, S. et al. (2012) ‘The KRAB-ZFP/KAP1 system contributes to the early embryonic establishment of site-specific DNA methylation patterns maintained during development’, Cell Reports, 2(4), pp. 766–773. doi: 10.1016/j.celrep.2012.08.043.

Ransone, L. J. et al. (1990) ‘Domain swapping reveals the modular nature of Fos, Jun, and CREB proteins.’, Molecular and Cellular Biology. American Society for Microbiology Journals, 10(9), pp. 4565–4573. doi: 10.1128/MCB.10.9.4565.

Ranwez, V. et al. (2018) ‘MACSE v2: Toolkit for the Alignment of Coding Sequences Accounting for Frameshifts and Stop Codons’, Molecular Biology and Evolution. Edited by C. Wilke, 35(10), pp. 2582–2584. doi: 10.1093/molbev/msy159.

Reece-Hoyes, J. S. and Marian Walhout, A. J. (2012) ‘Yeast one-hybrid assays: A historical and technical perspective’, Methods, 57(4), pp. 441–447. doi: 10.1016/j.ymeth.2012.07.027.

Rhee, H. S. and Pugh, B. F. (2011) ‘Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution’, Cell, 147(6), pp. 1408–1419. doi: 10.1016/j.cell.2011.11.013.

Richardson, S. R. et al. (2014) ‘APOBEC3A deaminates transiently exposed single-strand DNA during LINE-1 retrotransposition’, eLife. Edited by A. Ferguson-Smith. eLife Sciences Publications, Ltd, 3, p. e02008. doi: 10.7554/eLife.02008.

141

Roberts, C. W. M. and Orkin, S. H. (2004) ‘The SWI/SNF complex — chromatin and ’, Nature Reviews Cancer. Nature Publishing Group, 4(2), pp. 133–142. doi: 10.1038/nrc1273.

Ronquist, F. et al. (2012) ‘MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice Across a Large Model Space’, Systematic Biology, 61(3), pp. 539–542. doi: 10.1093/sysbio/sys029.

Roux, K. J. et al. (2018) ‘BioID: A Screen for Protein-Protein Interactions’, Current protocols in protein science, 91, p. 19.23.1-19.23.15. doi: 10.1002/cpps.51.

Rowe, H. M. et al. (2010) ‘KAP1 controls endogenous retroviruses in embryonic stem cells’, Nature, 463(7278), pp. 237–240. doi: 10.1038/nature08674.

Rowe, H. M. et al. (2013) ‘TRIM28 repression of retrotransposon-based enhancers is necessary to preserve transcriptional dynamics in embryonic stem cells’, Genome Research, 23(3), pp. 452–461. doi: 10.1101/gr.147678.112.

Sainsbury, S., Bernecky, C. and Cramer, P. (2015) ‘Structural basis of transcription initiation by RNA polymerase II’, Nature Reviews Molecular Cell Biology, 16(3), pp. 129–143. doi: 10.1038/nrm3952.

Sander, T. L. et al. (2000) ‘Identification of a novel SCAN box-related protein that interacts with MZF1B. The leucine-rich SCAN box mediates hetero- and homoprotein associations’, The Journal of Biological Chemistry, 275(17), pp. 12857–12867. doi: 10.1074/jbc.275.17.12857.

Schmidt, D. and Durrett, R. (2004) ‘Adaptive Evolution Drives the Diversification of Zinc- Finger Binding Domains’, Molecular Biology and Evolution, 21(12), pp. 2326–2339. doi: 10.1093/molbev/msh246.

Schmitges, F. W. et al. (2016) ‘Multiparameter functional diversity of Human C2H2 zinc finger proteins’, Genome Research, 26(12), pp. 1742–1752. doi: 10.1101/gr.209643.116.

Schneider, T. D. and Stephens, R. M. (1990) ‘Sequence logos: a new way to display consensus sequences’, Nucleic Acids Research, 18(20), pp. 6097–6100. doi: 10.1093/nar/18.20.6097.

Schultz, D. C. et al. (2002) ‘SETDB1: a novel KAP-1-associated histone H3, lysine 9-specific methyltransferase that contributes to HP1-mediated silencing of euchromatic genes by KRAB zinc-finger proteins’, Genes & Development, 16(8), pp. 919–932. doi: 10.1101/gad.973302.

Schultz, D. C., Friedman, J. R. and Rauscher, F. J. (2001) ‘Targeting histone deacetylase complexes via KRAB-zinc finger proteins: the PHD and of KAP-1 form a cooperative unit that recruits a novel isoform of the Mi-2α subunit of NuRD’, Genes & Development, 15(4), pp. 428–443. doi: 10.1101/gad.869501.

Seberg, O. and Petersen, G. (2009) ‘A unified classification system for eukaryotic transposable elements should reflect their phylogeny’, Nature Reviews Genetics. Nature Publishing Group, 10(4), pp. 276–276. doi: 10.1038/nrg2165-c3.

142

Seeman, N. C., Rosenberg, J. M. and Rich, A. (1976) ‘Sequence-specific recognition of double helical nucleic acids by proteins’, Proceedings of the National Academy of Sciences. National Academy of Sciences, 73(3), pp. 804–808. doi: 10.1073/pnas.73.3.804.

Sen, G. L. et al. (2012) ‘ZNF750 Is a p63 Target Gene that Induces KLF4 to Drive Terminal Epidermal Differentiation’, Developmental Cell, 22(3), pp. 669–677. doi: 10.1016/j.devcel.2011.12.001.

Shi, Yujiang et al. (2004) ‘Histone demethylation mediated by the nuclear amine oxidase homolog LSD1’, Cell, 119(7), pp. 941–953. doi: 10.1016/j.cell.2004.12.012.

Sinha, K. M. et al. (2010) ‘Regulation of the osteoblast-specific transcription factor Osterix by NO66, a Jumonji family histone demethylase’, The EMBO journal, 29(1), pp. 68–79. doi: 10.1038/emboj.2009.332.

Skene, P. J. and Henikoff, S. (2017) ‘An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites’, eLife, 6. doi: 10.7554/eLife.21856.

Smit, A. F. A. et al. (1995) ‘Ancestral, Mammalian-wide Subfamilies of LINE-1 Repetitive Sequences’, Journal of Molecular Biology, 246(3), pp. 401–417. doi: 10.1006/jmbi.1994.0095.

Smit, A. F. A., Hubley, R. and Green, P. (2013) ‘RepeatMasker Open-4.0’. Available at: .

Snider, J. et al. (2015) ‘Fundamentals of protein interaction network mapping’, Molecular Systems Biology, 11(12), p. 848. doi: 10.15252/msb.20156351.

Söding, J., Biegert, A. and Lupas, A. N. (2005) ‘The HHpred interactive server for protein homology detection and structure prediction’, Nucleic Acids Research, 33(Web Server issue), pp. W244-248. doi: 10.1093/nar/gki408.

Song, L. et al. (2011) ‘Open chromatin defined by DNaseI and FAIRE identifies regulatory elements that shape cell-type identity’, Genome Research, 21(10), pp. 1757–1767. doi: 10.1101/gr.121541.111.

Sripathy, S. P., Stevens, J. and Schultz, D. C. (2006) ‘The KAP1 Corepressor Functions To Coordinate the Assembly of De Novo HP1-Demarcated Microenvironments of Heterochromatin Required for KRAB Zinc Finger Protein-Mediated Transcriptional Repression’, Molecular and Cellular Biology. American Society for Microbiology Journals, 26(22), pp. 8623–8638. doi: 10.1128/MCB.00487-06.

Staby, L. et al. (2017) ‘Eukaryotic transcription factors: paradigms of protein intrinsic disorder’, The Biochemical Journal, 474(15), pp. 2509–2532. doi: 10.1042/BCJ20160631.

Stankiewicz, T. R. et al. (2014) ‘C-terminal binding proteins: central players in development and disease’, Biomolecular Concepts, 5(6), pp. 489–511. doi: 10.1515/bmc-2014-0027.

Stogios, P. J. et al. (2005) ‘Sequence and structural analysis of BTB domain proteins’, Genome Biology, 6(10), p. R82. doi: 10.1186/gb-2005-6-10-r82.

143

Stoll, G. A. et al. (2019) ‘Structure of KAP1 tripartite motif identifies molecular interfaces required for retroelement silencing’, Proceedings of the National Academy of Sciences. National Academy of Sciences, 116(30), pp. 15042–15051. doi: 10.1073/pnas.1901318116.

Stone, J. R. et al. (2002) ‘The SCAN domain of ZNF174 is a dimer’, The Journal of Biological Chemistry, 277(7), pp. 5448–5452. doi: 10.1074/jbc.M109815200.

Stormo, G. D. and Zhao, Y. (2010) ‘Determining the specificity of protein–DNA interactions’, Nature Reviews Genetics. Nature Publishing Group, 11(11), pp. 751–760. doi: 10.1038/nrg2845.

Stubbs, L., Sun, Y. and Caetano-Anolles, D. (2011) ‘Function and Evolution of C2H2 Zinc Finger Arrays’, in Hughes, T. R. (ed.) A Handbook of Transcription Factors. Dordrecht: Springer Netherlands (Subcellular Biochemistry), pp. 75–94. doi: 10.1007/978-90-481-9069- 0_4.

Sundaram, V. and Wysocka, J. (2020) ‘Transposable elements as a potent source of diverse cis- regulatory sequences in mammalian genomes’, Philosophical Transactions of the Royal Society B: Biological Sciences. Royal Society, 375(1795), p. 20190347. doi: 10.1098/rstb.2019.0347.

Suzuki, T. et al. (2003) ‘Functional Interaction of the DNA-binding Transcription Factor Sp1 through Its DNA-binding Domain with the Histone Chaperone TAF-I’, Journal of Biological Chemistry, 278(31), pp. 28758–28764. doi: 10.1074/jbc.M302228200.

Sverdlov, E. D. (1998) ‘Perpetually mobile footprints of ancient infections in Human genome’, FEBS Letters, 428(1–2), pp. 1–6. doi: 10.1016/S0014-5793(98)00478-5.

Szklarczyk, D. et al. (2019) ‘STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets’, Nucleic Acids Research, 47(Database issue), pp. D607–D613. doi: 10.1093/nar/gky1131.

Tacheny, A. et al. (2013) ‘Mass spectrometry-based identification of proteins interacting with nucleic acids’, Journal of Proteomics, 94, pp. 89–109. doi: 10.1016/j.jprot.2013.09.011.

Tempel, S. (2012) ‘Using and Understanding RepeatMasker’, in Bigot, Y. (ed.) Mobile Genetic Elements. Totowa, NJ: Humana Press (Methods in Molecular Biology), pp. 29–51. doi: 10.1007/978-1-61779-603-6_2.

Teo, G. et al. (2014) ‘SAINTexpress: Improvements and additional features in Significance Analysis of INTeractome software’, Journal of Proteomics, 100, pp. 37–43. doi: 10.1016/j.jprot.2013.10.023.

The ENCODE Project Consortium (2012) ‘An integrated encyclopedia of DNA elements in the Human genome’, Nature, 489(7414), pp. 57–74. doi: 10.1038/nature11247.

Thomas, J. H. and Schneider, S. (2011) ‘Coevolution of retroelements and tandem zinc finger genes’, Genome Research, 21(11), pp. 1800–1812. doi: 10.1101/gr.121749.111.

Thomas, P. D. et al. (2003) ‘PANTHER: A Library of Protein Families and Subfamilies Indexed by Function’, Genome Research, 13(9), pp. 2129–2141. doi: 10.1101/gr.772403.

144

Thompson, P. J., Macfarlan, T. S. and Lorincz, M. C. (2016) ‘Long Terminal Repeats: From Parasitic Elements to Building Blocks of the Transcriptional Regulatory Repertoire’, Molecular Cell, 62(5), pp. 766–776. doi: 10.1016/j.molcel.2016.03.029.

Tompa, P. and Fuxreiter, M. (2008) ‘Fuzzy complexes: polymorphism and structural disorder in protein–protein interactions’, Trends in Biochemical Sciences, 33(1), pp. 2–8. doi: 10.1016/j.tibs.2007.10.003.

Trizzino, M. et al. (2017) ‘Transposable elements are the primary source of novelty in Primate gene regulation’, Genome Research, 27(10), pp. 1623–1633. doi: 10.1101/gr.218149.116.

Turelli, P. et al. (2014) ‘Interplay of TRIM28 and DNA methylation in controlling Human endogenous retroelements’, Genome Research, 24(8), pp. 1260–1270. doi: 10.1101/gr.172833.114.

Tuttle, L. M. et al. (2018) ‘Gcn4-Mediator Specificity Is Mediated by a Large and Dynamic Fuzzy Protein-Protein Complex’, Cell Reports, 22(12), pp. 3251–3264. doi: 10.1016/j.celrep.2018.02.097.

Tyagi, M. et al. (2016) ‘Chromatin remodelers: We are the drivers!!’, Nucleus, 7(4), pp. 388– 404. doi: 10.1080/19491034.2016.1211217.

Uhlen, M. et al. (2015) ‘Tissue-based map of the Human proteome’, Science, 347(6220), pp. 1260419–1260419. doi: 10.1126/science.1260419.

Vaquerizas, J. M. et al. (2009) ‘A census of Human transcription factors: function, expression and evolution’, Nature Reviews Genetics, 10(4), pp. 252–263. doi: 10.1038/nrg2538.

Vargiu, L. et al. (2016) ‘Classification and characterization of Human endogenous retroviruses; mosaic forms are common’, Retrovirology, 13(1), p. 7. doi: 10.1186/s12977-015-0232-y.

Venkatesh, S. and Workman, J. L. (2015) ‘Histone exchange, chromatin structure and the regulation of transcription’, Nature Reviews Molecular Cell Biology, 16(3), pp. 178–189. doi: 10.1038/nrm3941.

Vinson, C. R., Sigler, P. B. and McKnight, S. L. (1989) ‘Scissors-Grip Model for DNA Recognition by a Family of Leucine Zipper Proteins’, Science. American Association for the Advancement of Science, 246(4932), pp. 911–916.

Vuzman, D. and Levy, Y. (2012) ‘Intrinsically disordered regions as affinity tuners in protein– DNA interactions’, Mol. BioSyst., 8(1), pp. 47–57. doi: 10.1039/C1MB05273J.

Wei, G.-H. et al. (2010) ‘Genome-wide analysis of ETS-family DNA-binding in vitro and in vivo’, The EMBO Journal, 29(13), pp. 2147–2160. doi: 10.1038/emboj.2010.106.

Weintraub, A. S. et al. (2017) ‘YY1 Is a Structural Regulator of Enhancer-Promoter Loops’, Cell, 171(7), pp. 1573-1588.e28. doi: 10.1016/j.cell.2017.11.008.

Weirauch, M. T. et al. (2013) ‘Evaluation of methods for modeling transcription factor sequence specificity’, Nature Biotechnology, 31(2), pp. 126–134. doi: 10.1038/nbt.2486.

145

Weirauch, M. T. et al. (2014) ‘Determination and Inference of Eukaryotic Transcription Factor Sequence Specificity’, Cell, 158(6), pp. 1431–1443. doi: 10.1016/j.cell.2014.08.009.

Weirauch, M. T. and Hughes, T.R. (2011) ‘A Catalogue of Eukaryotic Transcription Factor Types, Their Evolutionary Origin, and Species Distribution’, in Hughes, Timothy R. (ed.) A Handbook of Transcription Factors. Dordrecht: Springer Netherlands (Subcellular Biochemistry), pp. 25–73. doi: 10.1007/978-90-481-9069-0_3.

Wicker, T. et al. (2007) ‘A unified classification system for eukaryotic transposable elements’, Nature Reviews Genetics. Nature Publishing Group, 8(12), pp. 973–982. doi: 10.1038/nrg2165.

Williams, A. J. et al. (1995) ‘Isolation and Characterization of a Novel Zinc-finger Protein with Transcriptional Repressor Activity’, Journal of Biological Chemistry. American Society for Biochemistry and Molecular Biology, 270(38), pp. 22143–22152. doi: 10.1074/jbc.270.38.22143.

Williams, A. J., Blacklow, S. C. and Collins, T. (1999) ‘The Zinc Finger-Associated SCAN Box Is a Conserved Oligomerization Domain’, Molecular and Cellular Biology, 19(12), pp. 8526– 8535. doi: 10.1128/MCB.19.12.8526.

Wingender, E. et al. (2015) ‘TFClass: a classification of Human transcription factors and their rodent orthologs’, Nucleic Acids Research, 43(D1), pp. D97–D102. doi: 10.1093/nar/gku1064.

Wolf, D. and Goff, S. P. (2009) ‘Embryonic stem cells use ZFP809 to silence retroviral ’, Nature, 458(7242), pp. 1201–1204. doi: 10.1038/nature07844.

Wolf, G. et al. (2015) ‘The KRAB zinc finger protein ZFP809 is required to initiate epigenetic silencing of endogenous retroviruses’, Genes & Development, 29(5), pp. 538–554. doi: 10.1101/gad.252767.114.

Wolf, G., Greenberg, D. and Macfarlan, T. S. (2015) ‘Spotting the enemy within: Targeted silencing of foreign DNA in mammalian genomes by the Krüppel-associated box zinc finger protein family’, Mobile DNA, 6(1), p. 17. doi: 10.1186/s13100-015-0050-8.

Wolfe, S. A., Nekludova, L. and Pabo, C. O. (2000) ‘DNA Recognition by Cys2His2 Zinc Finger Proteins’, Annual Review of Biophysics and , 29(1), pp. 183–212. doi: 10.1146/annurev.biophys.29.1.183.

Wunderlich, Z. and Mirny, L. A. (2009) ‘Different gene regulation strategies revealed by analysis of binding motifs’, Trends in Genetics, 25(10), pp. 434–440. doi: 10.1016/j.tig.2009.08.003.

Yang, L. et al. (2014) ‘Reviving the Dead: History and Reactivation of an Extinct L1’, PLoS Genetics. Edited by C. Feschotte, 10(6), p. e1004395. doi: 10.1371/journal.pgen.1004395.

Yang, P., Wang, Y. and Macfarlan, T. S. (2017) ‘The Role of KRAB-ZFPs in Transposable Element Repression and Mammalian Evolution’, Trends in genetics: TIG, 33(11), pp. 871–881. doi: 10.1016/j.tig.2017.08.006.

146

Yang, Z. (2007) ‘PAML 4: phylogenetic analysis by maximum likelihood’, Molecular Biology and Evolution, 24(8), pp. 1586–1591. doi: 10.1093/molbev/msm088.

Yazaki, J. et al. (2016) ‘Mapping transcription factor interactome networks using HaloTag protein arrays’, Proceedings of the National Academy of Sciences, 113(29), pp. E4238–E4247. doi: 10.1073/pnas.1603229113.

Yeung, K. T. et al. (2011) ‘A novel transcription complex that selectively modulates of breast cancer cells through regulation of FASTKD2’, Molecular and Cellular Biology, 31(11), pp. 2287–2298. doi: 10.1128/MCB.01381-10.

Yin, Y. et al. (2017) ‘Impact of cytosine methylation on DNA binding specificities of Human transcription factors’, Science, 356(6337), p. eaaj2239. doi: 10.1126/science.aaj2239.

Ziegelbauer, J. et al. (2001) ‘Transcription factor MIZ-1 is regulated via microtubule association’, Molecular Cell, 8(2), pp. 339–349. doi: 10.1016/s1097-2765(01)00313-6.

Zilliacus, J. et al. (1995) ‘Structural Determinants of DNA-binding Specificity by Steroid Receptors.pdf’.

Ziv, Y. et al. (2006) ‘Chromatin relaxation in response to DNA double-strand breaks is modulated by a novel ATM- and KAP-1 dependent pathway’, Nature Cell Biology. Nature Publishing Group, 8(8), pp. 870–876. doi: 10.1038/ncb1446.

147

Copyright Acknowledgements

Figure 1.2A. RightsLink license number: 4867820618985 © 2011 by Springer Nature

Figure 1.3A. RightsLink license number: 4867820455783 © 2011 by Springer Nature