Evolution, structure and assembly of basal Factor II D

Simona V. Antonova

The work described in this thesis was partially funded by the Bijvoet NWO Graduate Programme (grant 022.004.019).

Paranimfen Elisa Bloemberg Ragna Senf

Printed by Print: Ridderprint | www.ridderprint.nl

ISBN: 978-94-6416-229-5

Cover Adaptation of ―Transcription factors‖ by Kelvinsong, distributed under a CC-BY 3.0 license

Evolution, structure and assembly of basal Transcription Factor II D

Evolutie, structuur en assemblage van basale transcriptiefactor II D (met een samenvatting in het Nederlands)

Proefschrift

ter verkrijging van de graad van doctor aan de Universiteit Utrecht op gezag van de rector magnificus, prof.dr. H.R.B.M. Kummeling, ingevolge het besluit van het college voor promoties in het openbaar te verdedigen op

donderdag 22 oktober 2020 des ochtends te 11.00 uur

door

Simona Vladimirova Antonova geboren op 5 augustus 1989 te Sofia

Promotoren:

Prof. dr. H.T.M. Timmers

Prof. dr. A.J.R. Heck

Content

CHAPTER 1 11 General introduction 11 1.1. Chromatin 13 1.2. The code 15 1.3. Chromatin dynamics 16 1.4. architecture 20 1.5. Transcription initiation and transcription cycle 23 1.6. TFIID 30 1.7. Sharing with SAGA 45 1.8. Concluding remarks and outline 48

CHAPTER 2 49 Epigenetics and transcription regulation during eukaryotic diversification: the saga of TFIID 49

CHAPTER 3 87 Chaperonin CCT checkpoint function in basal transcription factor TFIID assembly 87

CHAPTER 4 121 Proteomic analysis of human TFIID in different cellular compartments 121

CHAPTER 5 153 TAF1L integrates into a holo-TFIID complex with canonical composition 153

CHAPTER 6 171 General discussion 171 6.1. Cellular distribution and assembly of TFIID 173 6.2. Sharing the TAFs 180 6.3. Final remarks and conclusions 181

APPENDICES 183 Bibliography 184 Summary 199 Nederlandse sammenvatting 200 Dankwoord 200 List of publications 204 Curriculum vitae 205

Voorzitter van de promotieplechtigheid / Chairperson of the defense ceremony: Prof. dr. H.M. Verkooijen

Promotiecommissie / Doctoral committee members: Prof. dr. P. J. Coffer (beoordelingscommissie / assesment committee member) Prof. dr. F.C.P. Holstege (beoordelingscommissie / assesment committee member) Prof. dr. W.L. de Laat (beoordelingscommissie / assesment committee member) Dr. S.M. Lemeer Prof. dr. T.K. Sixma Prof. dr. B. Snel (beoordelingscommissie / assesment committee member) Prof. dr. L. Tora Prof. dr. C.P. Verrijzer (beoordelingscommissie / assesment committee member)

List of abbreviations Ac acetylation Acetyl Co-A acetyl coenzyme-A AML acute myeloid leukemia ARP actin-related ATAC ada2a-containing B2M β2 microglobulin BDP1 B double prime 1 BP(S) (s) BrD bromodomain BRE B recognition element BRED B recognition element (downstream) BREU B recognition element (upstream) BRF1 B-related factor 1 BTAF1 B-TFIID TATA-box binding protein associated factor 1 BTF basal transcription factor CBP CREB-binding protein CCD conserved domain database CCT chaperonin containing TCP1 CDK cyclin-dependent kinase CHD chromodomain ChEC-Seq chromatin endogenous cleavage and sequencing CHX cycloheximide Co-IP co-immunoprecipitation COMPASS complex associated with Set1 Co-REST co- of RE1 silencing transcription factor Cryo-EM cryogenic electron microscopy CTD C-terminal domain CUT cryptic unstable transcripts DBD DNA-binding domain DCE downstream core element DNA deoxyribonucleic acid Dox doxycycline DPE downstream promoter element DSIF DRB sensitivity inducing factor DTT dithiothreitol DUB deubiquitinating (domain/protein) DUF3591 domain of unknown function 3591 EER4 enhanced ethylene response 4 ERAP-1 endoplasmic reticulum aminopeptidase 1 ESC embryonic stem cell FACT facilitates chromatin transcription FAD flavin dinucleotide

FDR false discovery rate FECA first eukaryotic common ancestor FRT Flp recombinase target (site) GFP green fluorescent protein GNAT GCN5-related N-acetyltransferases GSTF specific transcription factor GTF general transcription factor H1 histone H2A histone 2A H2B histone 2B H3 histone 3 H4 histone 4 HAT histone acetyl transferase HDAC HDMT histone demethyltransferase HEAT huntingtin, elongation factor 3, protein phosphatase 2A, TOR1 HF histone fold HFD histone fold domain HGT horizontal gene transfer HMT histone methyltransferase HRP horseradish peroxidase iBAQ intenstity-based absolute quantification Inr initiator IP immunoprecipitation iPSC induced pluripotent stem cells iTOL interactive tree of life kDA kilo Dalton KDM lysine demethylase KMT lysine methyltransferase KO knock-out LECA last eukaryotic common ancestor LFQ label-free quantification MDA mega Dalton MBD methyl-CpG-binding domain Me methylation MLL mixed lineage leukemia MOZ monocytic leukemia zinc-finger mRNA messenger RNA MBT malignant brain tumour (domain) MTE motif ten element MYST MOZ-Ybf2/Sas3-Sas2/Tip60 (family) NAD nicotinamide adenine dinucleotide NC2 negative cofactor 2

NELF negative elongation factor NLS nuclear localization signal NTD N-terminal domain NuRD remodeling and deacetylase PCAF p300/CREB-binding protein-associated factor PCR polymerase chain reaction PDB PHD plant homeodomain PIC pre-initiation complex Pol I RNA polymerase I Pol II RNA polymerase II Pol III RNA polymerase III P-TEFB positive transcription elongation factor b PTM post-translational modification qMS quantitative mass spectrometry RNA ribonucleic acid RNAPI RNA polymerase I RNAPII RNA polymerase II RNAPIII RNA polymerase III rRNA ribosomal RNA SAGA Spt-Ada-Gcn5 acetyltransferase SCP super core promoter SEC super elongation complex SET su(var)3-9, of zeste, trithorax SF splicing factor SL1 selective factor 1 SUMO small ubiquitin-like modifier SUT stable unannotated transcripts SWI/SNF SWItch/sucrose non-fermentable) TAF TBP-association factor TAND TAF1 N-terminal domain TBP TATA-binding protein TBPL TBP-like protein TF transcription factor TFB transcription factor B TFIIA transcription factor II A TFIIB transcription factor II B TFIID transcription factor II D TFIIE transcription factor II E TFIIF transcription factor II F TFIIH transcription factor II H TLF TBP-like factor TRF TBP-related factor

TRIC tailless complex polypeptide 1 ring complex tRNA transfer RNA TSS transcription start site UAS upstream activation sequence Ub ubiquitin WCE whole cell extract WGD whole genome duplication WH winged helix WT wild type ZnK zinc knuckle

General introduction

1

Chapter 1

General introduction

11 Chapter 1

1

Chromatin ~ Epigenetics ~ Transcription

“…the transfer of information from nucleic acid to nucleic acid, or from nucleic acid to protein may be possible, but transfer from protein to protein, or from protein to nucleic acid is impossible. Information means here the precise determination of sequence, either of bases in the nucleic acid or of amino acid residues in the protein.” (The Central Dogma, Symp. Soc. Exp. Biol. 1958, vol 12, pp 138-163)

Over half a century ago and shortly after the discovery of the deoxyribonucleic acid (DNA) double- helical structure, the process of transcribing DNA-encoded information to RNA messenger molecules and subsequently to a synthesized functional protein was described as the central dogma of molecular biology. This, together with the early 1960‘s genetic work of Jacob and Monod as well as the discovery of messenger RNA (mRNA) molecule and the subsequent RNA polymerase II purification were among the main fundamental research achievements of the last century, which marked the beginning of central fields of molecular biology, such as transcriptional regulation and epigenetics. Since then our understanding of the mechanisms and structures behind the immense complexity of has expanded exponentially, revealing a chaotic world of tightly regulated processes. This thesis aims to further elucidate our scientific perception of eukaryotic transcription on the evolutionary and structural example of a protein complex central to this process, transcription factor II D (TFIID). For this purpose, the general introduction of the manuscript will first illustrate the nuclear protein network tasked with the execution and maintenance of transcriptional activity in the human cell. Next, I will delve into the process of transcriptional initiation with emphasis on the structural and functional role of TFIID in it.

12 General introduction

1.1. Chromatin 1 The DNA is a -based biopolymer made of sugar-phosphate backbone and four nitrogen-containing nucleobases – (G), (C), thymidine (T) and adenine (A). Its well-characterized double helical shape is attained through complementary A-T and C-G base pairing of the two strands in regular right-handed twists (1-5). Such coiling causing slight distortion in the direct opposition of the two strands and, thus, forms unequally sized opposing distances between each ~10 bps DNA turn, termed minor and major grooves (Fig. 1A) (6). These structures provide the first layer of regulation in the decoding of the genetic information through differential protein-binding properties (7). In humans, the entirety of this information is contained within approximately 3.2x109 base pairs (bps) of DNA, which makes up to a meter in stretched length (8). As most cells in the human body are somatic and, therefore, diploid, up to 2 meters of genetic material has to be packed within the nuclear dimensions of about 0,5x10-9 m3 (9). An analogy can be made with fitting 200 km of thin thread into a basketball (10). In addition, besides being packed, the DNA must also be stored in an accessible manner for the cell both to execute housekeeping programs and to respond adequately to occurring environmental changes in order to maintain cellular homeostasis. To achieve this, the double helix is wrapped around a group of small nuclear proteins, , forming the basic unit of chromatin (Greek chroma, ―colour‖, due to its staining properties) – the nucleosome (11). Indeed, the first layer of chromatin organization consists of long nucleosomal arrays, which wrap around 80% of the total amount of DNA in structures closely resembling ―beads-on-string‖ (Fig. 1B) (12-14). The arrays are further packed into higher-order chromatin, which, impressively, is continuously re-organized and re-assembled during and within cell cycle stages (15). The structure of the canonical nucleosome includes two copies of four core histones: histone 2A (H2A), histone 2B (H2B), histone 3 (H3) and histone 4 (H4), assembled into an octamer and wrapped 1.7 times with 147 bps of DNA, akin to thread on spool (16, 17). The octamer formation is driven by the strong affinity of histone proteins for one another due to the presence of a histone fold domain (HFD) within their structure. The HFDs consist of a long central α-helix (α2) connected N- and C-terminally to two shorter α-helices (α1 and α3, respectively) via unstructured loops, L1 and L2 (Fig. 1C). The α2 helix is essential for the histone- histone interaction between non-identical HFDs (H2A with H2B and H3 with H4) in an antiparallel fashion (Fig. 1D). The interaction is known as a ―molecular handshake‖, where one of the HFDs dimerizes with its distinct HFD partner in a head-to-tail fashion, resembling a handshake. The tertiary structure of the core histones is relatively similar, consisting of a globular α- helical mid-region and unstructured N- and C-terminal ―tails‖ and, due to this structural overlap, the nucleosome assumes a pseudo-dyad symmetry (Fig. 1E). The oppositional arrangement of the histone pairs within the nucleosome also leads to opposing loops brought in proximity and the formation of a basic L1-L2 surface essential for the interaction with the negatively charged phosphodiester backbone of the DNA. The remaining two HFD helices participate in additional nucleosome stabilization.

13 Chapter 1

1

Figure 1. Chromatin and nucleosome assembly. A) DNA double helix, grooves, right-turn twist, base pairs; created using SMART servier medical art platform. B) Electron microscopy image of nucleosomal arrays (13). C) Schematic overview of histone domains. D) Core histones, HFD dimers, octamer assembly. E) Pseudo-dyad of assembled nucleosome. F) Linker H1 forming higher order structures. D-F, adapted from: (16, 18, 19).

Notably, histones contact the DNA primarily at its backbone and, therefore, can theoretically assemble in a sequence-indiscriminate manner. Nevertheless, specific nucleotide combinations affect the physical properties of the DNA (e.g. its bendability) by allowing more energetically favourable bend during the octamer wrapping. As such dinucleotides of AA, TA

14 General introduction or TT with 10 bps periodicity and/or similar repeats of GCs drive nucleosome-favourable rotational positioning of the DNA, while stretches of A or T, i.e. poly(dA:dT), disfavours DNA 1 bend and consequently the nucleosome formation (20, 21). The process of nucleosome formation consists of initial chaperone-assisted H3/H4 tetramer assembly and association with the DNA, followed by recruitment of two pairs of H2A/H2B heterodimers and completion of the double helix wrapping (Fig. 1D) (22, 23). Variants of the core histones, such as H2A.Z, H2A.B, H2A.X, macroH2A, H3.3 and cenH3, also support the structure of the nucleosomes (24), though variations in the stability are observed (e.g. H2A.Z, macroH2A). Such differences have functional output by either allowing for the establishment of a stably-packed, inaccessible to other proteins chromatin, or loosely-bound and open DNA regions (25). Finally, in addition to the core histones and their variants, histone 1 (H1) stabilizes the nucleosome-wound DNA in place by capping together linker regions flanking the nucleosomal 147 bps (Fig. 1F) (26, 27). It is also involved in higher-order chromatin packing by establishing links between the inter-nucleosomal DNA regions (28).

1.2. The histone code

For several decades after their discovery, histones were regarded primarily as structural and supportive units of the eukaryotic nuclear DNA. Work in the early 1960s, however, began to unveil their central role in transcriptional regulation by describing alleviation of transcriptional suppression from isolated chromatin upon removal of histones (29). Following was the discovery of post-translational modification of histones by methylation and acetylation (30, 31), and with the beginning of the new millennium histones were already widely accepted in the field as key epigenetic players in transcriptional regulation (32). Indeed, besides DNA interaction, the histone structure also includes disordered and highly conserved N- and C-terminal ―tails‖, protruding from the nucleosome, which can be extensively decorated with post-translational modifications (PTMs). Such modifications serve as binding platforms for a variety of proteins with PTM-recognition domains, including bromodomains (BrDs), chromodomains, Tudor domains, PWWP, MBT and plant homeodomain (PHD) fingers (33-37). The PTM-recruited proteins, collectively referred to as ―readers‖, exert a wide range of functions onto the chromatin, including essential processes such as DNA transcription, replication and repair. The most studied histone PTMs are acetylation (ac) and methylation (me) and they can be differentially added to histone lysine (K) or arginine (R) residues by specialized enzymes termed ―writers‖, such as histone acetyltransferases (HATs) and histone methyltransferases (HMTs or KMTs). Histone modifications can also be removed by dedicated enzymes, known as ―erasers‖, with examples including histone demethylases (HDMTs or KDMs) and histone deacetylases (HDACs). Collectively histone PTMs and their downstream effects on are referred to as the ―histone code‖ and, through providing the second level of genome decoding regulation, are central in the field of epigenetics (Fig. 2) (32, 38). In wider terms, the histone code allows for the demarcation of the chromatin into transcriptionally active or repressive regions. A classic example of a transcriptionally active histone mark is the trimethylation of H3 lysine 4 (H3K4me3), which is preferentially enriched at the beginning of transcribed (39-41). On the other hand, trimethylation at the lysine 27 residue of H3 (H3K27me3) marks the start of transcriptionally repressed genes (42), demonstrating the strict

15 Chapter 1

precision in residue-selectivity upon carrying out the functional output of histone modifications. 1 Such functional diversity of histone modifications is dependent both on the proteins recruited by a given mark and on the structural effects it has on the chromatin packing and, therefore, its dynamics.

Figure 2. Schematic overview of the commonly described histone post-translational modifications. Possible modification groups are depicted above targeted residues in blue for phosphorylation, purple for acetylation, red for methylation and light blue for ubiquitination. Sequences of globular domains are omitted and PTMs within those regions are not included (e.g. H3K56ac and H3K79me). Sequences correspond to the human canonical histones. Adapted from (43).

1.3. Chromatin dynamics

Chromatin undergoes dynamic structural changes in response to the protein-protein events occurring in its proximity. Consequently, the functional state of the chromatin also shifts, leading to its categorization into two transcriptionally-definitive states: euchromatin or heterochromatin, where the former represents a more relaxed, accessible and, therefore, mostly transcriptionally active form of the DNA and the latter refers to a tightly packed, inaccessible and, thus, mostly inactive one (debatable, see below) (44, 45). Of note, neither of the two chromatins is strictly repressive or activating, as demonstrated by examples of transcriptional events within defined heterochromatin regions as well as gene inactivation within open regions. Among the main protein classes that regulate chromatin dynamics are the histone modifying enzymes, which recruit chromatin-affecting molecular complexes, and the chromatin remodelers, which allow or restrict the access of the effector proteins to specific genomic regions.

1.3.1. Histone modifying enzymes Acetylation Histone acetylation occurs upon transfer of an acetyl moiety from the acetyl-coenzyme A (acetyl-CoA) to the ε-amino group of histone lysine residues by HAT enzymes. As a result, the acetylated lysine loses its positive charge, contributing to loosened bond with the negatively charged phosphate backbone of the DNA within the nucleosome (46-49). For this reason, histone acetylation is generally considered a histone mark for active transcription through aiding the exposure of the double helix for interaction with transcription-related players. In addition and maybe more important, acetylated histones also serve as a binding platform for proteins endowed with acetyl-recognition domains, such as BrDs (50, 51), which are often part of larger molecular complexes with chromatin-oriented functions. As such acetylation of histones not only increases

16 General introduction the accessibility of the local chromatin, but also actively participates in the accommodation of effector proteins. 1 Through acetyl deposition, HATs play a central role in the epigenetic regulation of transcription and this is evident by their numerous presence as modular components of large multimeric nuclear transcription complexes (52). The first such complex with identified HAT functionality was Spt-Ada-Gcn5-Acetyltransferase (SAGA), isolated and extensively characterized over the past two decades (53-56). Subsequently, the number of identified HATs expanded and they have since been classified into three main families: the Gcn5-like acetyltransferases (GNATs), the monocytic leukaemia zinc finger (MOZ)-Ybf2/Sas3-Sas2 and Tip60 family (MYST) and the metazoan-specific p300/CBP family (57). The functionality and specificity of the HATs depends also on their surrounding molecular context. As such yGcn5 has been shown to acetylate H3K14 alone on free histones, but in the context of the SAGA complex its functionality extends to H3K9, H3K18 and H3K23 as well (58). A related metazoan-specific Ada2a-containing complex (ATAC), on the other hand, utilized the yGcn5 homologues KAT2A (GCN5) and KAT2B (PCAF) in combination with an additional HAT module, ATAC2 (KAT14), to additionally modify H4K12 and H3K16 (59, 60). The molecular environment of the HATs also drives their DNA-specificity by providing interaction with proteins endowed with chromatin-binding domains. As such SAGA is recruited to sites with acetylated histone lysines through its Gcn5-contained BrD and to H3K4me2/3 regions through its Tudor domain-containing subunit ySgf29 (hCCDC101) (59, 61, 62).

Deacetylation The deposition of the acetyl group is a reversible process carried out by the HDAC enzymes. Based on phylogenetics and , HDACs are divided into four evolutionally conserved classes (I-IV), with class III, or the sirtuin family, being the only one utilizing nicotinamide adenine dinucleotide (NAD+) to catalyse the deacetylation reaction (57). The remaining three classes are zinc (Zn2+)-dependent and are referred to as the ―classical family‖ due to the chronological order of their discovery. Notably, a major difference between the two types of HDACs is that sirtuins are intrinsically active and able to deacetylate histone proteins in vitro (57), whereas class I, II and IV HDACs require additional proteins for their enzymatic activity. As such, classic HDACs are often part of larger molecular complexes, such as HDAC1 and HDAC2 in the Nucleosome Remodeling Deacetylase (NuRD) and Co-repressor of REST (Co-REST) complexes (63, 64). In addition, they do not exhibit high substrate specificity (65), in contrast to the sirtuin HDACs, which have been shown to deacetylate histones at selected positions, such as H4K16ac, H3K9ac and H3K18ac (66, 67). Methylation Another central PTM with a key role in epigenetic regulation and chromatin state is methylation, present across all core histones on evolutionarily conserved K or R residues (31, 68, 69). Unlike acetylation, the methyl moiety does not affect the positive charge of the histones and primarily exerts its effects through direct regulation of protein-protein interactions, instead of destabilization of the chromatin packing (70). By providing a binding platform for effector proteins endowed with methyl-recognition domains, such as PHD fingers or chromodomains, serves for the recruitment of a wide range of functionally diverse molecules, including

17 Chapter 1

transcriptional activators or , DNA repair or replication molecules and the 1 segregation machinery (33, 68, 71, 72). In addition, arginines and lysines can carry up to two or three methyl groups, respectively, broadening the information potential of the mark (31, 69, 73, 74). An example of functional diversity based on extent of methylation is the methylation of H3K4, where monomethylation occurs at distal transcription regulatory DNA elements, such as enhancers, while trimethylation on the same residue is enriched at the transcription start site of active genes (39, 41, 75). Similarly, mono- and di-methylation of H3K27 occurs at many chromatin loci, while H3K27me3 is selective for repressed genes (76). Histone methylation is deposited by HMT enzymes through the utilization of S-adenosyl- L-methionine (SAM) molecule as a methyl group donor, which is subsequently transferred to the nitrogen side chain of histone Ks or Rs, resulting in methylated histone residue and S-adenosyl-L- homocysteine (SAH) as by-product (68, 77, 78). In most of the cases this reaction is carried out by a conserved catalytic SET (Su(var)3-9, Enhancer of zeste, Trithorax) domain, which bridges together the targeted histone residue and the SAM group, mediating the deposition (79). An F/Y switch region within the SET regulates the extent of methylation as shown by mutation analyses, where substitution of the tyrosine (T) with phenylalanine (Y) led to switch of HMTs from mono- methyltransferases to di- and tri-methyltransferases (80). For a long time the only non-SET histone methyltransferases were considered to be DOT1L (yDot1), responsible for the methylation of H3K79 located within the globular domain of the H3 rather than the unstructured tail (81). DOT1L belongs to the 7-β-strand family of methyltransferases and until recently was considered the only member of this family known to methylate histones (68). Depending on its molecular context, it can establish all three methyl states, marking either sites of transcriptional activation or elongation (82). Recently, a novel KMT9 described to target H4K12 for monomethylation was also shown to belong to the 7-β-strand family of methyltransferases (83).

Demethylation Due to its slow turnover, methylation was initially considered a true mark of epigenetic inheritance, which is maintained throughout cell cycles and preserved across generations (68, 84). However, with the discovery of lysine-specific-demethylase-1A (LSD1), it became clear that the mark is as dynamic as acetylation and its removal is as actively regulated as its deposition (85). To this end, two families of KDMs have been identified – the amine oxidases LSD1/2 and the Jumonji C (JmjC) domain-containing hydroxylases. The LSD demethylases utilize flavin-adenine dinucleotide (FAD) as a cofactor for the demethylation reaction. Notably this requires the presence of protonated lysine ε-nitrogen atom as + + a substrate and for that reasons LSD HDMs can only demethylate mono- (NH2 ) or di- (NH ) methylated, but not tri- (N+) methylated lysines. Being historically the first discovered HMDs, the mono- and di-methylation specificity of the LSD family was suggestive of the likely existence of additional HMDs, capable of demethylating tri-methylated histones as well (85). Indeed, the JmjC family of KDMs was shortly after identified and shown to successfully remove histone trimethyl marks, confirming, therefore, the reversibility of all histone methylation states (86). The JmjC HDMs utilize Fe2+ and α-ketoglutarate (α-KG) as cofactors and, unlike the LSDs, rely on hydroxylation for the demethylation reaction, which allows for removal of the methyl group without the need for protonated substrate.

18 General introduction

Histone modifications crosstalk Besides acetylation and methylation, there is a number of additional histone PTMs that 1 are deposited or removed by dedicated enzymes and have functional impact on the chromatin state. These include ubiquitin, small ubiquitin-like modifier (SUMO), phosphate, serotoninyl, dopaminyl, crotonyl, ADP-ribose and citrulline modifications among others, which altogether decorate extensively the histone tails in a variety of combinations (87-89). The functional variability of these combinations is further complicated by the fact that the presence or absence of modifications on specific residues has a direct effect on the modification of different histone regions (90, 91). Such crosstalk between the PTMs adds further regulatory nuance to the histone code and diversifies its functional readout through various mechanisms. These include: 1) mutual exclusivity of distinct PTMs with antagonistic effects, such as the methylation and acetylation of H3K9 or H3K27 (with repressive and activating roles, respectively) (42, 92-94); and 2) inhibition or enhancement of reader-, writer- or eraser- proteins recruitment by neighbouring (cis-) or on different tails (trans-) modifications (Fig. 3). An example of a histone PTM cis-crosstalk has been observed in the case of H3K9 and H3K14 acetylations by Gnc5, where the presence of a phosphate group at the adjacent H3S10 can inhibit the acetylation of K9 while enhancing it at K14 (95-97). The most notorious and well-described case of a trans-regulation, on the other hand, is the dependence of H3K4me3 initial deposition by the ySet1 complex COMPASS (human mixed- lineage-leukaemia (MLL) complex MLL3/MLL4) on ubiquitination of H2BK123 (or H2BK120ub in human), establishing, therefore, an essential connection between histone ubiquitination and methylation in the light of transcriptional activation (98). Interestingly, in Drosophila SET16 (MLL3/4) also associates with the H3K27-specific demethylase UTX, which in turn results in removal of the repressive H3K27me3 at the sites of active transcription (99). Adding more to the complexity, the DNA itself can also carry functional modifications, i.e. repressive methyl groups on cytosine neighboured by guanine, which in turn can affect the transcriptional state of the chromatin as well (100). A classic example of DNA and histone PTM crosstalk is provided by the major chromatin remodelling complex NuRD, recruited through its methyl-CpG-binding domains (MBDs) at sites of methylated DNA and inducing histone remodelling through HDAC modules (101, 102).

Figure 3. Examples of cis- and trans-crosstalk between histone modifications. Cis-crosstalk is depicted in blue dotted lines, while trans-crosstalk is depicted in green. Inhibitory and stimulating effects are indicated. Preferential effect on only one modifying group is included when required above the connecting line. Adapted from (103).

19 Chapter 1

1.3.2. Chromatin remodelers 1 The second major class of proteins involved in chromatin dynamics are the ATP- dependent chromatin remodelers. These enzymes utilize their ATPase nature to mobilize the nucleosomes along the DNA through hydrolysis-induced break of DNA-histone octamer contact points (104, 105). This allows them to carry out one of three types of function: 1) chromatin assembly by ensuring location-specific deposition and controlled spacing of nucleosomes; 2) chromatin accessibility by nucleosome sliding and/or ejection in order to expose or obstruct DNA regions; 3) editing of existing nucleosomes by incorporation of histone variants with functional specificity. The structure of the chromatin remodelers consists of an ATPase catalytic domain with tandem RecA-like folds and varying flanking domains (106). Based on their catalytic domain the remodelers are classified into four families: switch/sucrose nonfermenting (SWI/SNF) involved in chromatin accessibility, imitation switch (ISWI) and chromo and DNA-binding domain (CHD) with roles in chromatin assembly, and inositol requiring 80 / SWI2/SNF2-Related 1 (INO80/SWR1) essential for nucleosomal editing (105). Most chromatin remodelers include a chromatin-binding domain within their structure, such as the BrDs in SWI/SNF family, chromodomains and DNA-binding domains (DBDs) in CHDs and HAND-SINT-SLIDE (HSS) domains in ISWI, which guide the enzymes to their target regions (105). Additional regulation of the remodelers‘ activity is provided through their typical cooperation with other chromatin-modifying complexes, such as SAGA-assisted recruitment of SWI/SNF at active genes (107, 108). Interestingly, the catalytic domains of SWI/SNF and INO80 also contain a helicase/SANT-associated (HSA) domain, through which nuclear actin monomers as well as actin-related proteins (Aprs) have been shown to associate with the chromatin remodelers and partake in the complex formation, stability and functionality (105). Altogether, both histone modifying enzymes and chromatin remodelers have an extensive repertoire of associated molecular modules with a variety of functional roles, allowing for an efficient chromatin modification upon recruitment to a target site. In the context of transcriptional regulation, such sites most often include upstream gene regions such as promoters and enhancers.

1.4. Promoter architecture

Promoters constitute cis-regulatory DNA elements essential for the transcription of downstream regions (109). In eukaryotes they are typically enriched with different binding motifs and, together with the histone code, serve as selective recruitment platforms for transcription- associated factors and the transcriptional machinery. Based on their relative distance from the transcription start site (TSS) there are two types of promoters – promoter proximal in the range of 40–250 bps upstream from TSS and core promoters, encompassing the TSS and flanking it with additional ~30 bps of up- and downstream DNA motifs (111-114). In addition to promoter regions, distal regulatory elements including enhancers, silencers and insulators are scattered in a distance of over 100 kbps upstream from the TSS. Communication between these regions through the coordinated action of transcription- assisting factors, such as GSTFs, chromatin remodelers (e.g. SWI/SNF) and histone modifiers (e.g. SAGA) provides crucial layers of transcriptional regulation and fine-tuning by the establishment of

20 General introduction transcriptionally accommodating or repressive environment (Fig. 4). In addition, multimeric protein complexes such as the Mediator coordinate such distal binding events with the core 1 promoter through the formation of contact points between both sides of the otherwise spatially separated genomic regions (115).

Figure 4. Schematic overview of promoter architecture and transcriptional initiation. Core promoter and enhancer regions are brought together and bound by the transcriptional machinery (PIC and Pol II) and distal regulatory complexes, such as SWI/SNF chromatin remodelers, HAT chromatin modifiers and GSTFs. Being the first PIC member to bind DNA, TFIID is recruited by promoter- specific activating histone PTMs and establishes crucial contact points with indicated core promoter motifs. TBP association with TATA leads to DNA distortion and formation of PIC-favourable promoter conformation. Bridging large molecular assemblies (e.g. Mediator) drive the proximity- formation between promoters and enhancers. Accurate nucleosome positioning flanks the core promoter at -1 and +1 sites and different combinations of histone PTMs demarcate different chromatin regions. BTAF1 and NC2 serve as regulators of TBP promoter-association. Adopted from (110). Created using SMART servier medical art platform.

1.4.1. Core promoter The core promoter is the key site for the assembly of the basal pol II transcriptional machinery and the actual initiation of transcription (116). Core promoters have been shown to be minimally essential for basal (as opposed to activated) transcription in vitro. Indeed RNA synthesis from core promoter templates required only the presence of the RNA polymerase enzyme together with six basal transcriptions factors (BTFs), needed for the recruitment and stabilization of the polymerase onto the DNA (see section 1.5.1. PIC assembly). Several sequence motifs are typically found within core promoters and their combinatorial presence (or absence) is defining for the nature of the transcriptional event (i.e. strength, speed,

21 Chapter 1

activated vs constitutive expression). Prominent examples include the TATA-box, 1 (Inr), TFIIB recognition element (BRE), downstream promoter element (DPE), motif ten element (MTE) and downstream core element (DCE) (111-114). Historically, the TATA-box is considered a defining feature of core promoters, leading to their most common classification into TATA- containing and TATA-less promoters, with the former including a consensus TATA sequence (TATAWAW, W = A or T) and the latter lacking one or bearing a degenerate TATA-like version primarily reported in yeast (117). The remaining core elements are less consistently present. Nevertheless, when present, the commonality in their pattern of distribution is indicative of dedicated functional roles in nucleation of the transcriptional machinery. As such on RNAPII- transcribed genes BRE is flanking the TATA-box with relatively conserved BRE upstream (BREu) and/or downstream (BREd) sequences rich in C and G nucleotides, Inr is partially overlapping with the TSS and contains an essential A marking the +1 site, and DCE, MTE and DPE are positioned progressively further downstream from the TSS (111-114).

1.4.2. TATA and TATA-less promoters The discovery of the TATA-box was the result of comparative analysis of sequences upstream to the TSS of H2A genes in several species (118, 119). This revealed a conserved enrichment of A-Ts approximately 25-30 bps 5‘ to the start site, which at the time was named after its discoverers – the ―Goldberg-Hogness‖ box (116, 118, 120). Subsequently more genes were shown to contain an upstream TATA-box, and its crucial presence for transcriptional initiation was highlighted through mutational analyses (121, 122). In parallel, with the discovery of the RNA polymerases I, II and III, an effort was made in reconstituting the transcriptional machinery in vitro using TATA-containing DNA templates and purified polymerases. Nevertheless, it was not until the addition of a whole cell or nuclear extracts that in vitro transcription was achieved, indicating that additional nuclear proteins are required for the transcriptional activity (123, 124). Thorough biochemical experiments have since revealed a number of different molecular structures, essential for the execution of gene expression by each of the three eukaryotic polymerases (125-127). A shared molecule between all three systems is the TATA-binding protein (TBP) (128), which binds TATA sequences with a high affinity via the minor groove, induces a DNA distortion at an ~ 90o angle and leads to subsequent molecular events necessary for the loading of the RNA polymerase onto the promoter region of genes (see sections 1.5.1. PIC assembly and 1.6. TFIID) (129-133). Notably, although initially assumed as universal, TATA-containing core promoters are in fact underrepresented in the eukaryotic genomes with ~ 20% of yeast genes dependent on TATA, and only ~ 5% in humans (113, 134-136). Transcription initiation from such promoters is often characterized as ―sharp‖ or ―focused‖ due to the precise location of the TSS, dependent on the upstream TATA-box location (111, 136). Interestingly, these promoters mostly belong to tissue- specific, inducible and/or highly regulated genes, indicative of a rapid ON-and-OFF transcription switch at these sites. The remaining majority of core promoters, on the other hand, lacks a precise consensus TATA and are associated with much less defined TSSs. Due to the unspecific nature of their TSS, there are often several possible transcription start sites within TATA-less promoters, leading to their annotation as ―broad‖ or ―dispersed‖ promoters. In metazoa they are mostly enriched with upstream C-G dinucleotide repeats stretching from 0.5 to 2 kbps and termed CpG islands (111, 136). In yeast TATA-like promoters are primarily associated with housekeeping genes and genes

22 General introduction with constitutive expression (137) (Fig. 7). Despite lacking a consensus TATA-box, these promoters are still bound by TBP in the event of transcriptional initiation, highlighting the central 1 role of this protein in transcription (138).

1.5. Transcription initiation and transcription cycle

Transcription of DNA into RNA is an essential biological mechanism, with roles in fundamental processes, such as differentiation and control of cell growth and homeostasis. Therefore, intricate regulatory mechanisms are needed for its correct execution in time and space. Transcriptional initiation is arguably the most central step in this process. Not only does it provide a binding and stabilizing platform for the RNA polymerase enzymes at the core promoters, but it also dictates the directionality of the process, promoter escape and elongation and is responsible for the global maintenance of steady-state mRNA levels (139-143). The importance of transcriptional initiation is further highlighted by the conceptual conservation of its mechanism throughout the tree of life. In simplified terms, such mechanism includes the utilization of a promoter-recognizing molecule, capable of associating with the DNA, recruiting the RNA polymerase and, alone or aided by additional proteins, stabilize the structure in transcription-favourable conformation. In prokaryotes this process includes a four-subunit RNA polymerase guided by an additional σ factor to transcription initiation DNA sequences, which bear similarities with the eukaryotic TATA-box (144). In this network has expanded to involve a TATA-box and the associated TBP together with transcription factor B (TFB) (distantly related to the bacterial σ factor (145)), which stabilizes the TBP-DNA complex and participates in the recruitment of the RNA polymerase (145-147). Finally, to meet the higher demand for transcriptional diversity of eukaryotes, their transcription initiation machineries have acquired an extensive and mostly conserved from yeast to human molecular repertoire (termed the pre- initiation complex (PIC)), which consists of over 50 polypeptides as opposed to the <10 polypeptides found in bacteria and archaea, and which coordinates the molecular events leading to the RNA synthesis with an impressive accuracy and efficacy.

1.5.1. RNAPII PIC assembly In animals RNAPII-dependent transcription is responsible for expression of protein- coding genes (among others) and, therefore, it is the most extensively studied of the nuclear eukaryotic RNA polymerases. Furthermore, likely due to the larger amount of target genes and requirements for differential expression, transcription by RNAPII is also more extensively regulated than the transcription by RNAPI or RNAPIII, which are instead involved in housekeeping expression of a limited set of genes, such as ribosomal RNA (rRNA) and transfer RNA (tRNA) as well as some small RNAs, respectively. As such, the overall composition of an RNAPII-bound PIC involves the cooperation of over fifty polypeptides – double to the amount of proteins involved in RNAPI- and RNAPIII-specific PIC assemblies (148-150). Human RNAPII PIC is composed of six basal (or general) transcription factors (BTFs or GTFs) – TFIIA, -B, -D, -E, -F and -H, and RNAPII, which are altogether stabilized at gene promoters and poised for transcriptional engagement (141, 151). With the exception of TFIIB, all these factors are composed of several subunits, which in the case of the largest one (TFIID) reach up to 14 subunits (152, 153). The total size of RNAPII PIC is about 2.7 megaDalton (MDa) and

23 Chapter 1

with the addition of other molecular complexes involved in the bridging with distal regulatory 1 chromatin regions and initiation of transcriptional elongation (e.g. Mediator) – it reaches the impressive structure of about 4.5 MDa, similar in dimensions to the human ribosome complex (154-156). How does such an complicated assembly coordinate its formation and functional output? And, considering the correlation between increased demand for transcriptional regulation and the expansion of the PIC in eukaryotes – how and which components of the PIC are structurally and functionally responsible for the extended regulatory diversification? To shed light on these questions, collective scientific efforts from the fields of transcriptional regulation, genetics, biophysics, biochemistry and structural biology have gathered a growing amount of data characterizing PIC function, composition and assembly mechanisms. A summary of over 40 years of research describes a stepwise process, where transcriptional initiation begins with promoter recognition by the TBP-bound TFIID complex, stabilized by TFIIA and followed first by TFIIB and next RNAPII/TFIIF recruitment and completed through the integration of TFIIE and TFIIH (Fig. 5). Numerous conformational rearrangements occur throughout the sequence of the assembly, allowing for the accommodation of the various structures and the correct placement RNAPII at the TSS (157-159). The following section summarizes key structural and functional data of the PIC assembly members and follows to highlight the diverse roles of the TFIID complex during transcriptional initiation.

Figure 5. Schematic overview of PIC assembly. TFIID is the first PIC member to bind core promoter motifs. TBP association with TATA sequence leads to DNA distortion and the interaction is stabilized by TFIIA and later TFIIB. RNAPII is subsequently recruited to engage the TSS through its active site and is stabilized on to the forming PIC together with TFIIF. In the final step of PIC formation the translocase TFIIH and its partner TFIIE are recruited to open the double DNA strands and establish the transcription bubble. Created using SMART servier medical art platform.

TFIID As the first member of PIC to bind the core promoter, TFIID is regarded as a central player in transcriptional initiation, with a crucial role in guiding the RNAPII towards the TSS, as well as nucleating the remaining PIC members needed for productive promoter engagement (129, 135, 141, 160, 161). Human TFIID is composed of TBP and 13 TBP-associated factors (TAFs),

24 General introduction associated in specific stoichiometry in order to construct the largest PIC polypeptide complex with a total size of ~1.3 MDa (162-164). The dimensions of the complex are notably conserved 1 throughout the various branches of the eukaryotic domain, indicating a selective pressure on maintaining the extensive compositional repertoire (135, 165). Why is this elaborate molecular architecture conserved and necessary for core promoter recognition? Early studies aiming at in vitro transcription reconstitution by using TATA-containing templates and purified TFIID and recombinantly expressed TBP provide the first indications that besides the TATA-recognition role of TBP, the remaining TAF subunits have a crucial role to play in the orchestration of activated transcription (166). Indeed, it was observed that while TBP alone is sufficient to support basal transcriptional activity, activated transcription required the presence of holo-TFIID, where the TAFs can mediate transcriptional communication between activators, such as SP1, and the (167). In addition to protein-interaction, several TAFs provide further layers of transcriptional regulation by directly associating with core promoter elements. Such are TAF1, shown to bind Inr, MTE, DPE and DCE regions, TAF2 similarly recognizing Inr and TAF4 and TAF12 shown to bind DNA non-selectively (168-173). Furthermore, human TAF1 and TAF3 both contain chromatin reader domains – double BrDs and PHD, respectively, with recognition specificity for the active promoter histone marks H3K9ac/H4K5/K12ac and H3K4me3 (174-177). To add to the complexity, several TAFs have functional paralogues and splice variants, which can substitute or combine with them in the structure of TFIID (see section 1.6. TFIID) and, due to their mostly tissue-specific expression – likely drive cell type-specific transcriptional programs (178). As such it becomes clear that in a cellular context, holo-TFIID constitutes a large protein machinery capable of differentially targeting gene promoters through varying combination of transcription factors (TFs) interaction, DNA motifs recognition and epigenetic-reader domains. The role of TBP as a part of such a complex has been extensively characterized, with initial views of TBP serving as the first anchor point of TFIID in promoter recognition, with the remaining TAFs stabilizing for this interaction (152, 179). However, more recent work has challenged this decades-old model by demonstrating that promoter-recognition by TAFs precedes TATA-recognition by TBP and as such TFIID structure acts as guide for the accurate deposition of TBP at fixed upstream position in relation to the downstream core promoter elements (162). This model can easier explain the recruitment of TBP at precise distances from the TSS even in promoters, which lack consensus TATA sequences but contain other promoter motifs and/or chromatin modifications (see section 1.6.3. TFIID stoichiometry and structure).

TFIIA The stabilization of the TBP-DNA contact is enabled by the sequentially recruited TFIIA dimer, composed of GTF2A1 and GTF2A2 in humans (180-182). TFIIA is shown to bind a protein-interaction surface on TBP via the β-barrel structure of GTF2A2 and shield it from promoter eviction mediated otherwise by the TBP-dedicated negative regulators BTAF1/Mot1p and NC2 (110, 183-190). The GTF2A2-TBP-DNA interaction is further stabilized by TFIIA-DNA contacts formed at its backbone upstream the TBP-bound TATA sequence (133, 181, 191). A four α-helical bundle of GTF2A1, on the other hand, seems to dock onto additional TFIID subunits, establishing, therefore, a bridge connecting the overall TATA-neighbouring TFIID structure with the DNA (158, 159, 162, 192). Interestingly, TFIIA was also considered a non-GTF, since its

25 Chapter 1

presence was not essential for in vitro reconstitution of basal transcription in the presence of TBP 1 and the rest of the PIC members. However, activated transcription both in vitro and in vivo required the presence of TFIIA and the remaining of the TFIID complex, suggesting crucial interplay between those factors in the cell (193). Indeed, besides TBP/DNA stabilization, a central role of TFIIA in the PIC assembly is its competitive (in relation to TFIID) affinity for TBP, which allows for the release of TBP from its TFIID-inhibited state and association with the promoter (see section 1.6. TFIID) (158, 159, 162).

TFIIB While TFIIA stabilizes TBP-DNA complex by bridging interactions with TFIID subunits and TATA-upstream DNA, the following PIC member to join, TFIIB, stabilizes this structure from the opposite site of TBP and provides a platform for the subsequently recruited RNAPII (132, 194, 195). TFIIB is a single-protein BTF with the ability to bind the TATA-flanking sequences BREu and BREd, which due to the bend induced by TBP-binding to TATA (195). This interaction is established via two asymmetrical α-helical cyclin folds within the C-terminal B-core domain of TFIIB, which contact both the DNA backbone and TBP itself (132). The N-terminal part of TFIIB, on the other hand, participates in transcription-related functions by extending to contact the subsequently bound RNAPII (196). This interaction includes docking of the polymerase to the PIC, guidance of TSS identification and blocking the RNA exit channel of RNAPII (173). Following the complete assembly of PIC and the trigger of transcriptional initiation – global rearrangements of PIC take place leading to the release of the RNAPII exit channel by TFIIB and allowing for newly synthesized RNA to emerge as the polymerase proceeds into transcription elongation step (173). Interestingly, next to TBP and the polymerase itself, TFIIB is the third most conserved PIC member, being homologous to archaeal TFB and, albeit distantly, related to the bacterial σ factor (145-147, 197).

RNAPII and TFIIF The RNAPII complex is the second largest member of the PIC, with total number of 12 subunits and a mass of ~550 kiloDalton (kDa) – a structural complexity that reflects the numerous functions it performs (152, 198, 199). These are the recognition of the template DNA strand and synthesis of a complementary RNA chain, its proof-reading, successful separation of the DNA- RNA hybrid and recruitment of RNA processing factors, such as 5‘ capping enzymes, elongation, splicing and termination factors (200, 201). An essential component of the polymerase for discriminating between its different functional states is provided by the extended C-terminal tail domain (CTD) of its biggest subunit, RBP1, which can be extensively modified. Indeed, the RBP1 CTD consists of a heptad stretch of YSPTSPS repeats (extending up to 52 times in humans), which are differentially phosphorylated throughout the different stages of the transcriptional cycle (i.e. initiation, elongation, termination). RNAPII can enter PIC only in an un-phosphorylated state, while to proceed to promoter escape and elongation – CTD phosphorylation at precise locations is needed (e.g. serine-5 phosphorylation, Ser5p, peaks at early transcription and is exchanged by Ser2p as the polymerase progresses into the elongation phase and towards termination) (202). In addition to the docking platform provided by TFIIB, the recruitment of the RNAPII into the pre-initiation complex is further aided by another PIC member, the dimer TFIIF (173, 203-208). Due to its close association with the polymerase and the presence of TFIIF-like proteins

26 General introduction as integral subunits of other RNA polymerases, this BTF was initially considered a part of the enzyme structure. Subsequently it was revealed that TFIIF is a separate member of PIC with 1 predominant function in securing correct TSS identification by the active site of the RNAPII (159, 205, 209). TFIIF was also shown to associate with the CTD-targeting phosphatase FCP1, suggesting a RNAPII-preparatory role prior assembly into PIC (210, 211). Finally, by contacting the upstream-located TFIIA and subsequently added TFIIE, TFIIF bridges over the cleft site of RNAPII and traps the DNA close to the active site of the polymerase, preparing the stage for the promoter melting and transcriptional bubble formation in the following step of PIC assembly (141, 159).

TFIIE and TFIIH The final members to assemble the pre-initiation complex are the dimer TFIIE and the decamer TFIIH. Notably, these two BTFs exert their effect within PIC cooperatively, where close interaction between the two drives a TFIIE-dependent activation of the translocase and kinase functions of TFIIH (141, 212, 213). Besides TFIIH activation, TFIIE also participates in the formation of the structural bridge over the RNAPII cleft, which retains the promoter DNA in fixed proximity with the polymerase (141). Promoter melting is mediated through the ATP- dependent translocase subunit of TFIIH, XPB, which travels 12-15 bps downstream of its initial binding site and, due to anchor points with PIC, induces DNA backtracking (157, 214). As the upstream promoter sequence is fixed at the TBP/TFIIA/TFIIB site, such backtracking leads to the folding of the TSS sequence towards the active site of the polymerase and torsional strain on the two strands drives the melting of the promoter (transcription bubble). The template strand is then engaged by the polymerase in a 5‘ to 3‘ direction and the first few RNA nucleotides are synthesized. Major structural rearrangements are required for the exposure and activation of XPB, including the repositioning of the RNAPII-specific kinase module of TFIIH, CDK7, in proximity with the CTD of RNAPII (141, 159). This allows coordination between the transcription bubble formation and the phosphorylation of the CTD at Ser5, driving the release of the polymerase from the promoter PIC structure (promoter clearance or escape) into the transcription elongation stage.

Promoter escape and pausing, elongation, termination The PIC assembly-driven initiation is the first step of the transcription cycle, followed by elongation during which the productive transcription takes place and termination, at which point RNAPII disengages from the DNA (215). Throughout the transcription cycle the CTD tail of RNAPII serves as a binding platform for numerous molecules and through its differentially modified residues it coordinates the recruitment of transcriptional players in a well-timed manner (216, 217). Indeed, the TFIIH-dependent CTD Ser5 phosphorylation is among the activators of the promoter escape, which mediates transition from initiation to elongation (218). Prior to engagement of productive transcription, the RNAPII is stalled about 20 nucleotides downstream the promoter region by the negative elongation factors NELF and DSIF – a phenomenon known as promoter-proximal pausing (219, 220). Capping enzymes are recruited by Ser5p to provide the cap structure typical of pol II transcription (221). The association of the positive transcription elongation factor (P-TEFb) is essential to drive the machinery into the next stage of productive transcription, elongation, through DSIF and NELF phosphorylation and alteration of their stalling effect (222). Notably, P-TEFb further phosphorylates CTD-Ser2, allowing for the

27 Chapter 1

recruitment of elongation factors, as well as chromatin remodelers, histone modifiers and 1 chaperones(215). As such as the polymerase progresses throughout the body of the coding region, CTD-Ser5 phosphorylation convers to Ser2p at a fixed distance from the promoter and plateaus until the end of the transcribed sequence (216, 217). This phosphorylation state is associated with dynamic reorganization of the chromatin, which includes the efficient disassembly of the nucleosomes ahead of RNAPII in order to allow the passage of the complex, and the swift reassembly behind the enzyme moves (215). The Facilitates Chromatin Transcription (FACT) complex is a histone chaperone complex, central to these dynamics (223). As the RNAPII reaches the gene end, the synthesized mRNA incorporates a terminal polyadenylation tail (polyA) sequence (5‘- AAUAAA -3‘), followed by a G/U-rich stretch (215). With this signal the RNAPII peters out, reducing its processivity, and reaches terminal pausing and termination of transcription (224). At this point termination factors are recruited and drive the cleavage of the synthesized RNA at the polyA tail region. The remaining downstream RNA is degraded and the polymerase disassociates from the template DNA, completing therefore the transcription cycle.

1.5.2. Additional roles of TFIID in the transcription cycle

Transcriptional re-initiation For the release of the polymerase upon transcriptional termination, restoration of its hypo-phosphorylated state is required. Dedicated phosphatases (e.g. FCP1 and SCP1) carry out this process during the termination pausing and the detached enzyme is ready for re-use in a new cycle of transcription, or re-initiation (215, 225). Interestingly, early in vitro studies have shown that some PIC members may remain associated at promoter regions even after promoter escape of the polymerase (225). Such structures have been proposed to serve as ―bookmarks‖ of recent transcription initiation events and to provide a scaffold for the efficient re-initiation of a subsequent round. TFIID complex is a central candidate of re-initiation scaffolds formation, as suggested by the in vitro observation of the complex association with promoter template even after transcriptional initiation (226). Stringent rounds of template-bound PICs washing revealed that several TFIID TAF subunits remaining stably associated with the template (227). Based on this, it was proposed that promoter-associated TFIID allows for efficient re-initiation events through circumvention of the activator dependence (225). Indeed, while nuclear extract of TAF mutants supported initial rounds of activated-transcription to a similar extend as wild type counterparts, the multi-round transcription potential was largely reduced (225, 226). Models of TFIID involvement in re-initiation propose a role for TAFs in preventing nucleosomal occlusion of the promoter by maintaining the distance of the +1 (first downstream of TSS) nucleosome and/or preservation of the TBP-bound DNA complex, which favours RNAPII recruitment. The exact mechanisms of transcriptional re-initiation as still debated and additional BTFs (e.g. TFIIH and TFIIB) have been proposed to play a role in events such as looping between promoter and termination region and reloading of the polymerase upon completion of the transcriptional cycle (215).

Cryptic transcription Correct positioning of the initiation event is essential for the synthesis of stable and functional mRNA. Nevertheless, transcription events have been shown to be largely pervasive,

28 General introduction sprouting bidirectionally from enhancer, termination or intergenic regions and driving the expression of non-coding RNAs, broadly classified into cryptic unstable transcripts (CUTs) and 1 stable uncharacterized transcripts (SUTs) (228). Chromatin accessibility is a major determinant for the occurrence of cryptic transcription events as demonstrated by mutations of the Set2 and Rpd3S pathway of histone re-deposition during the RNAPII elongation step (229). Indeed, defects in this process lead to an intragenic transcription initiation due to the more accessible state of the chromatin as a consequence of aberrant histone deposition (228). Notably, pervasive transcription is controlled at least in part through the cooperation of TBP and its negative regulators BTAF1 (yMot1p) and NC2 (see section 1.6. TFIID). As PIC assembly is required at cryptic promoters, TBP removal is, intuitively, a counteraction for pervasive transcription. Indeed, NC2 and Mot1p co-operate to suppress cryptic transcription from 3‘ or intragenic (in case of chromatin organization defects) regions of genes by displacement of pervasively associated TBP from those sites (187, 230). Considering the competitive binding nature between TFIID and B-TAF1/NC2 for TBP, the contribution of the rest of the TFIID subunits in this process remains to be characterized.

Promoter-proximal pausing In addition to re-initiation and cryptic expression regulation, few studies have also functionally linked TFIID to promoter pausing of the RNAPII (231-234). This association was proposed due to the spatial overlap of TFIID-bound core promoter elements such as MTE and DPE and the promoter-proximal region of the polymerase pre-elongation pausing (235, 236). As such the model describes that in the presence of strongly TFIID-bound downstream core promoter elements – the complex drives a more regulated transcription by enforcing a stalling of the polymerase after the initiation step (233). Indeed, TFIID has been shown to facilitate recruitment of elongation factors such as P-TEFb and members of the super elongation complex (SEC), as well as proposed to act as negative regulator of the TFIIH-CDK7 through its TAF7 subunit (234, 237). A recent study suggested that in vitro reconstituted PIC-only promoter templates require the presence of TFIID to induce RNAPII stalling without the pausing factors NELF and DSIF (231). Altogether, it becomes clear that numerous molecular mechanisms are coordinated simultaneously in order to execute transcriptional events with spatial and temporal precision. The genomic sequence provides an important basis for preferential chromatin formation and, therefore, dictates rules for region occlusion or accessibility. The subsequent layers of transcriptional regulation are dynamic, mostly reversible and essential for the establishment of the functional readout of the encoded information. The histone code lays out the ―grammatical rules‖ for decoding of the genetic material, while effector proteins with histone PTM reader domains execute this epigenetic code in order to provide or restrict the access of the RNA polymerase to the underlying genetic code. Furthermore, it is evident that formation of the pre-initiation complex is a necessary step for the loading, stabilization and priming for transcription of the RNAPII, highlighting its essential role in gene expression. As summarized above, the formation of the PIC involves numerous structural rearrangements and protein modifications, necessary for the correct execution of the downstream RNA synthesis. Such interlaced relationships between the PIC topology and the nature of its functional output is the reason structural characterization of transcription initiation laid the foundation of fields like transcription regulation and associated

29 Chapter 1

mechanisms of disease in cancer biology (among others). As the first member to associate with the 1 core promoter and to nucleate the remaining of the PIC and RNAPII, the TFIID complex is vital to this process. Its multimeric structure and flexible nature are indicative of a major role in the PIC conformational adaptations. The potential contribution of TFIID in processes outside of PIC nucleation additionally brings to focus the complexity behind its organization and functionality.

1.6. TFIID

The identification of basal transcription factors was initially conducted through the utilization of cation-exchange phosphocellulose chromatography, which allowed for the separation of proteins with DNA-binding properties and their step-wise elution with increasingly high salt concentrations, resulting in the attainment of four fractions, labelled A-D, with likely transcription- associated activities (238). Indeed, characterization of each of the fractions revealed that most of the PIC members were scattered between fractions A-C, while fraction D contained TFIID. The complex took name after its fraction (transcription factor of RNAPII, fraction D) and being among the last to elute was expected to possess strong DNA-binding properties. Footprinting experiments confirmed a role of TFIID in binding core promoters and highlighted its role as the pioneer PIC member needed for the nucleation of the rest of the complex and the RNA polymerase II (238). Interestingly, parallel experiments in yeast led to the purification of a much smaller TFIID molecule, with a size of 27 kDa as opposed to the observed human TFIID with size of 250-300 kDa, which had similar TATA-binding properties and ability to drive transcription in reconstituted systems (238-240). Based on this, the existence of a similar ~40 kDa TATA-binding protein (TBP) from fraction D was subsequently confirmed in humans as well (241), revealing that the previously observed higher molecular weight species was likely a complex consisting of TBP and additional TBP-associated factors (TAFs) (242, 243). Notably, TBP alone could support only basal, but not activated, transcription, suggesting that TAFs must participate in communication propagation of upstream gene activation signals. Subsequent extensive biochemical analyses have revealed the rest of the human TFIID complex to include 13 TAFs (TAF1-TAF13), with several paralogues, splice variants and shared subunits, impressively conserved impressively conserved from yeast to human. The following section summarizes the structural and functional information for each of the TFIID subunits gathered over the past 30 years.

1.6.1. TFIID composition TBP Function: promoter binding Partners: RNAPI, -II, -III PIC members and BTAF1, NC2 TBP is the first TFIID subunit to be identified and has been characterized extensively since (244-246). As described, it is central to transcriptional processes due to its ability to associate upstream to the TSS and bend the DNA in a PIC-favourable conformation (132, 133, 159). TBP consists of a vertebrate-specific flexible N-terminal domain and a conserved C-terminal core domain (247). While the N terminus is relatively undercharacterized, it has been proposed to modulate the DNA-binding properties of the protein and is suggested to have a role in prevention of placental/fetal rejection by regulation of the expression levels of the immunity-related molecule,

30 General introduction

β2m (248-250). Interestingly, the region contains a poly-glutamine (polyQ) stretch in human TBP with varying length of 25-42 repeats and mutations resulting in an increase of the repeats up to 55 1 have been associated with the neurological disorder SCA17, which includes ataxia, dystonia, parkinsonism and dementia (251-255). The core domain of TBP (Fig. 6), on the other hand, is highly conserved among eukaryotes and archaea, suggesting a central role in the main – transcription-nucleating – function of TBP (110, 256).

Figure 6. Schematic overview of DNA-bound core TBP. Structural characterization of the interaction reveals saddle-like core TBP, intercalating DNA through its concave surface. The convex surface and the two ―stirrup‖ regions are exposed for protein-protein interactions (Freeman et al., 2012, Biochemistry, e7).

Atomic analysis of TBP has revealed that core-TBP adopts a bipartite tertiary structure composed of α-helices and β-sheets arranged symmetrically into a shape mostly described as crescent or saddle (131-133). The concave (bottom) surface of the protein consists of ten mirroring antiparallel β-sheets, implicated in DNA binding through non-polar and hydrophobic interactions. Unlike most transcription factors, TBP associates with the minor groove of the DNA, which leads to partial unwinding of the double helix. Two pairs of highly-conserved phenylalanines (F) within this region of TBP are crucial to its DNA-bending ability by inserting into the first T-A of the TATA sequence and the last 2 bps, both of which are well-positioned within the minor groove of the DNA. Such intercalation causes sharp distortion at the borders of the TATA-box and bends the double helix towards the major groove with an approximate angle of 90o. It is not established whether TBP association at TATA-less promoters drives similar degree of distortion (257). Mutational analyses have shown that single nucleotide alterations within the canonical TATA sequence can severely decrease the TBP-DNA binding affinity, suggesting that degenerate TATA-like sequences and TATA-less promoters may exhibit lower TBP affinity and, perhaps, lower degree of TBP intercalation as well (257, 258). And while TAFs and TFIIA may compensate for this by stabilizing TBP onto the DNA, the possibility of differential DNA distortion based on the underlying TBP-bound sequence may be reflective of the dynamics of TBP turnover in post-initiation stage. Indeed, TATA-bound TBP has been shown to have higher rate of turnover as opposed to the non-TATA-associated TBP (257). A proposed model for these observations ascribes the likely differential bending of the DNA as a catalyser for differential eviction capacity of TBP by its negative regulators, NC2 and BTAF1. The convex (top) side of TBP is made of four amphipathic symmetrical α-helices, which participate in protein-protein interactions with factors, such as the RNAPII-related TAF1, TFIIA and TFIIB, the RNAPIII-related BRF1, the RNAPI-related SL1 and the TBP regulators BTAF1 and NC2 (188-190, 259-261). Interestingly, evolutionary analysis of TBP and its negative regulators

31 Chapter 1

has reported a preferential co-conservation dependent on the presence of the two pairs of Fs 1 within the TBP DNA-binding region (110). When these residues were poorly conserved – NC2 and BTAF1 were similarly absent, indicating a crucial role of these proteins in eviction of strongly- intercalated TBP. Such observation leaves open the possibility of preferential need of TBP negative regulation at sites with conserved TATA sequences, should those indeed exhibit higher TBP affinity and degree of intercalation as opposed to TATA-like and/or TATA-less sequences. Finally, three metazoan functional paralogues of TBP have been identified, TRF1, TBPL1 (TRF2) and TBPL2 (TRF3) (262, 263). Among these, TRF1 is only found in insects and is associated with RNAPIII transcription, while TBPL1 and TBPL2 are metazoan- and vertebrate- specific paralogues, respectively, with role in RNAPII transcription. The latter two contain a core structure similar to TBP, with TBPL2 sharing over 90% core similarity with TBP, and TBPL1, which lacks the ability of associating with TATA box – only ~ 40%. In addition the proteins differ by the presence and/or extent of their disordered N-terminal regions, with TBPL1 completely deprived of it. The functional roles of these paralogues are still investigated; however accumulating data is suggestive of mediation of differential transcriptional programs and central roles in early embryogenesis and gametogenesis (262, 264, 265). Indeed, TBPL1 deletion is associated with male sterility in mice and early embryonic development arrest in non-mammalian model organisms. Furthermore, TBPL1 was shown to interact with TFIIA and TFIIB in similar manner as TBP and to act as negative regulator of TBP-targeted gene expression through sequestering of these two PIC members (266, 267). TBPL2, on the other hand, is primarily oocyte-enriched where it seems to substitute the transcriptionally repressed TBP in the driving of oocyte-specific expression programs (268). Notably upon early embryogenesis a rapid switch from TBPL2- to TBP-dominated transcription is observed correlated with an increase of the TBP expression and decrease of TBPL2. In addition, TBPL2 deletion has also been associated with defects in folliculogenesis and female sterility in mice.

TAF1, TAF1L and TAF1 neuronal splice variants Function: promoter recognition Partners: TBP and TAF7 TAF1 (hTAFII250) is the largest subunit of TFIID with an approximate size of 250 kDa and DNA core promoter binding properties. It is one of the earliest identified TFIID members, likely due to its direct interaction with TBP, which was initially used as affinity bait during co- purification compositional characterization of the complex. Its initial discovery revealed TAF1 was identical with CCG1, a nuclear protein essential for the progression of the G1 phase in the cell cycle (269). Subsequent protein characterization in the context of TFIID indicated a central role for the complex integrity and functionality. Several domain regions within the protein indicate its multileveled role. Such are the TAF1 N-terminal domain (TAND), subdivided into two subdomains – TAND1 and TAND2, the highly disorganized domain of unknown function (DUF3591) and the tandem BrDs located C-terminally (175, 260). More recent characterizations of the protein have shed some light on the DUF3591 region and identified an additional C-terminal Zn2+ knuckle (ZnK) domain (270, 271). TAF1 has also been proposed to contain domains with HAT and kinase functions; however direct in vivo characterization and confirmation of these domains in missing (272, 273).

32 General introduction

The TAND domain is the contact point between TAF1 and TBP and plays a crucial role in locking the latter in an inhibited state while the complex is not promoter-bound (158, 159, 162, 1 260). The inhibition is achieved by: 1) the TAND1 region associating closely with the concave surface of TBP establishing hydrophobic bonds through a structural mimicry of TATA-box, and 2) TAND2, which contains several negatively charged residues that recognize the positive convex site of TBP and, thus, shield away the protein side responsible for TFIIA recognition and PIC assembly (141, 158, 159, 162, 173). Characterization of the DUF3591 has been challenging due to its highly-flexible and insoluble nature. A recent study, however, successfully obtained a co-crystalized structure of TAF1DUF3591 together with TAF7ΔCTD, which revealed winged helix (WH) region in the middle of DUF3591 with two TAF7-interaction domains flanking the WH, forming thereby a pyramid-like organization of the TAF1 upon association with TAF7 (Fig. 7)(260). The N-terminal TAF7- interaction site of TAF1 consists of a triple β-barrel, which heterodimerizes with a similar triple barrel structure within N-terminal TAF7. The C-terminal TAF7-interaction site of TAF1 is an extended α-helical organization closely associating with a coiled C-terminal region within TAF7, completing the triangular shape of the heterodimer. Notably, the WH domain located on top of the triangle was shown to exhibit DNA binding capacity with preferential association to Inr and downstream core promoter sequences, confirming therefore the previously reported function of TAF1 in anchoring TFIID to promoters through the recognition of these elements (173).

Figure 7. Schematic overview of TAF1 and TAF7 interaction. DUF3591 central region of TAF1 stably co-purifies in a complex with TAF7. Structural characterization of the interaction reveals complementary dimerization of triple barrel regions between the two proteins. TAF1 barrel is followed by a DNA-binding winged helix (WH), which forms the top of the triangular shape of DUF3591, and a α-helical domain, which further associated with coiled region within TAF7, downstream of its triple barrel (271).

33 Chapter 1

The remaining to TAF1 domains, the tandem BrDs and the ZnK are both located 1 downstream from the TAF7/DNA interaction region, but are similarly implemented in the TAF1 promoter-recognition function (175, 270). Indeed, the small ZnK region was shown to bind preferentially promoter sequences over random sequences and the tandem BrDs have long been described to recognize acetylated H4 enriched at active genes promoters. Due to its large size, the transcription of TAF1 involves the processing of a complex 38 exons system and additional 5 more downstream exons, all of which have been implemented in the generation of numerous TAF1 splice variations. Interestingly, many of these variants have been detected in brain tissues and a hereditary X-linked neurological disease with dystonia and Parkinsonism phenotypes have been associated with the misregulation of the TAF1 neuronal splicing (274-276). Finally, TAF1 is not expressed during male meiosis as the gene locus is within the X- chromosome, which is transcriptionally repressed at that stage of spermatogenesis. Interestingly, during primate evolution a retroposition event of processed TAF1 mRNA has led to the generation of its 94% identical autosomal homolog TAF1L, shown to be expressed specifically in testis tissue and to recruit TBP and function interchangeably with TAF1 in ex vivo settings (277).

TAF2 Function: promoter recognition, complex assembly Partners: TAF8 The second largest member of TFIID is TAF2 (hTAFII150, CIF150) with a size of ~ 150 kDa and proposed core promoter binding properties. TAF2 consists of an extensive N-terminal domain with high sequence homology to the M1 family of metallopeptidases and a short, positively charged, disordered C-terminal region (278, 279). Despite the M1-similarity, TAF2 does not possess enzymatic activity due to a missing exopeptidase motif required for the functionality (278). Notably, close structural similarities of TAF2 with the enzymatically active endoplasmic reticulum aminopeptidase-1 (ERAP-1) have enabled interchangeable use of the proteins in order to model the existing crystal structure of ERAP-1 in place of TAF2 during structural characterization of TFIID (173). It is unclear whether TAF2 can undergo similar conformational rearrangements as the substrate-bound ERAP-1; however, its DNA-recognition region reportedly coincides with the highly flexible domain III of ERAP-1 (173, 280). Altogether, the structural and/or functional significance of the aminopeptidase-like domain of TAF2 has not been clarified. Nevertheless, as M1 family members often utilize regions within this domain for protein-protein interaction, it may be indicative of role in the complex integrity maintenance (278). Indeed, recent breakthrough in cryo-electron microscopy (cryo-EM) characterization of the complex revealed a likely interface between TAF2 aminopeptidase domain and TAF8 C-terminal region (173). Interaction between TAF2 and TAF8 has previously been established through co- purification and mutational analysis (281). Specifically, presence of a trimeric TFIID module consisting of TAF2/TAF8/TAF10 was reported in cytoplasmic extracts. The only nuclear localization signal (NLS)-containing member, TAF8, was shown to be essential for the nuclear import of the trimer via cooperation with Importin-α (Fig. 8). Τhe C-terminus of TAF8 was further identified as necessary for TAF2 incorporation into the recombinantly assembled partial TFIID, consisting of TAF5/TAF6/TAF8/TAF9/TAF10/TAF12, indicative of an essential interface for TAF2 interaction within this TAF8 region (282).

34 General introduction

Figure 8. Nuclear import model of the cytoplasmic TAF2-TAF8- 1 TAF10 TFIID module. TAF2/TAF8/TAF10 associate in the cytoplasm and via the NLS within TAF8 are imported in the nucleus as a trimer. The import is mediated through importin-1α (IMP). TAF8/TAF10 interaction is formed by a dimerization of their HFDs and TAF2 binds the following C-terminal region of TAF8. Upon nuclear localization the trimer is expected to integrate into core TFIID and trigger structural rearrangements, which lead to the subsequent formation of nuclear holo-TFIID (281).

The incorporation of TAF2 within TFIID has been proposed as essential for the assembly of the complex (see section 1.6.2. TFIID assembly), although complexes lacking TAF2 have been purified (281). This could due to substiochiometric presence of the subunit within TFIID or due to redundancy within the complex post-assembly. The DNA-binding properties of TAF2 support function in promoter-recognition and, therefore, role as part of holo-TFIID post-assembly as well.

TAF3 Function: H3K4me3 recognition Partners: TAF10 (HFD) Human TAF3 (hTAFII140) was identified based on sequence similarity with the already characterized yTAF3, shown to dimerize specifically with TAF10. Since human TAF10 has already been identified at the time, its yeast interaction partner was expected to be conserved in humans as well (176). Notably, yTAF10 was shown to have two interaction partners – yTAF3 and yTAF8, both of which share conserved N-terminal H2A-like HFD, mediating dimerization with a complementary HFD present within yTAF10. Utilizing the conserved HFD sequence of their yeast counterparts, both hTAF3 and hTAF8 were subsequently identified in human and confirmed to associate specifically with hTAF10. Interestingly, while the N-terminus of hTAF3 contained the expected and conserved HFD, a large C-terminal portion missing in yeast was observed in humans, including at its most distal region a discrete chromatin reader PHD finger domain. The sequence preceding the PHD is highly charged, unstructured and of yet unidentified function. Subsequent characterization of the TAF3-PHD in a study probing its affinity to differentially modified histone tail peptides revealed a highly specific preference for H3K4me3 modification, revealing, therefore, the crucial role of metazoan TAF3 in targeting TFIID to gene epigenetically marked promoters (177). Structural interrogation of the interaction demonstrated a crucial role of aromatic residues within the PHD pocket in creating negatively charged surface that

35 Chapter 1

interacts through exceptionally high chromatin-binder affinity (0.16-0.31 μM) with the positively 1 charged H3K4me3 modification (283) (Fig. 9). Further investigation noted a synergistic binding between H3K4me3 and H3K14ac in the recruitment of TFIID, while neighbouring modifications at arginine 2 and tyrosine 3 of H3 (i.e. asymmetric H3R2me2a and H3T3-phosphorylation) led to inhibition of the TAF3 binding, suggestive of regulatory epigenetically-mediated switch of transcriptional initiation by targeting TFIID promoter-recognition mechanisms (283, 284). Interestingly, besides its role in TFIID promoter-targeting, TAF3 has also been implemented in developmental processes through its association with the TBP paralog TBPL2, outside of the TFIID complex (285-287). Indeed, several studies have investigated TAF3 role during myogenic process of myoblast to myotubule differentiation. A considerable decrease of TFIID members, including TBP, has been reported upon the differentiation-favouring transcriptional switch. Notable exception was TAF3, which was shown to associate with the upregulated TBPL2 and drive the expression of a myogenic transcriptional program. More specifically, TAF3 seemed to directly mediate the communication between the MyoD transcription factor and TRF3 at promoter regions of genes essential for myogenesis (e.g. Myogenin).

Figure 9. Interface of the TAF3- PHD and H3K4me3 interaction. Surface plot highlights the PHD finger region and residues forming the binding pocket are indicated. Blue, orange and light brown show the different sites of H3K4me3 insertion. H3K4me3 peptide is represented in sticks and intermolecular hydrogen bonds – in yellow dashed lines (283).

Notably, lack of structural characterization and atomic resolution of this interaction obscures clarity on how these molecules associate. However, domain exclusion analyses indicate an important role on TAF3-N for MyoD interaction and TAF3-C for the TBPL2 contact. While both regions include known domains, HFD and PHD, respectively, it remains to be elucidated how the HFD-lacking MyoD associates with the HFD-containing TAF3 region as well as whether TAF3-PHD still serves as anchoring point to H3K4me3 marked promoters and how the protein recognizes TBPL2, while it does not associate with its structurally similar paralog TBP. Interestingly, similar reduction of TFIID expression and coinciding upregulation of TBPL2 and cooperation with TAF3 have been reported during early embryonic development and haematopoiesis, indicative of possible general role of TBPL2/TAF3 in the driving of differentiation-stimulating transcriptional programs (288). Nonetheless, when challenged in an in vivo system, TBPL2 knock-out (KO) experiments failed to produce muscle defects in mice, but were rather associated with sterility (268).

36 General introduction

TAF4 and TAF4B Function: transcription factors interaction 1 Partners: TAF12 (HFD) TAF4 (hTAFII130/135) was first characterized as a TFIID subunit that heterodimerized with TAF12 specifically (289, 290). Similarly to TAF3/TAF10 and TAF8/TAF10, TAF4 and TAF12 associate via compatible HFDs, which resemble structurally the folds of H2A and H2B, respectively (290). Atomic resolution of this heterodimer reveals however distinguishing differences from the histone pair, such as the atypically disordered and elongated shape of TAF4 helix α3. Notably, TAF4 and TAF12 have been proposed as core members of TFIID, essential for the assembly and integrity of the complex (see section 1.6.2. TFIID assembly) (282). The composition of TAF4 includes highly disordered N-terminal and mid-regions, followed by the C-terminal HFD (278, 282, 290). Role of TAF4 N-terminal region in association and stabilization of the protein onto the DNA, have been suggested (291). The mid-TAF4 includes a domain termed TAFH, which consists of a helical hydrophobic groove and is shown to be involved in the binding of transcriptional activators such as SP1, NFAT, HP1 and NF-E2, and, thus, to mediate initiation events through guided-TFIID promoter association (292-297). Notably, due to numerous TF-binding sites within TAF4, this TFIID subunit has been implemented in wide range of transcription-related functions. Indeed, TAF4 isoforms with varying TAFH structures were shown to be essential for the establishment of differentiation programs for various cell lineages stemming from osteoblasts, neuron progenitors, haematopoietic cells, hepatocytes, cardiomyocytes and keratinocytes (298-302). Such function diversity on transcriptional regulation is suggestive of a central role for TAF4 as a mediator between GSTFs and transcriptional output. A functional homolog of TAF4 is the TAF4B (hTAFII105) protein, shown to assemble into TFIID similarly to its related counterpart (172). Notably, both proteins have been shown to simultaneously incorporate into TFIID, indicating that some TAFs may be present in double copies within the complex (see sections 1.6.2. TFIID assembly and 1.6.3. TFIID stoichiometry and structure) (303). Although TAF4B was initially identified in B cells, it was subsequently shown to be redundant for B cell development and functionality and mainly to accumulate in the cytoplasm (304, 305). The observation was further complemented by the discovery of a nuclear export signal within the protein sequence, which indicates the likely existence of an export pathway, modulating the transcription regulatory activity of the protein. Subsequent cell-type expression experiments revealed preferential accumulation of TAF4B in testis and ovarian tissues, indicating a likely role in those cell lineage determinations (306-308). Indeed, a number of studies characterized TAF4B in the context of gametogenesis (309, 310). Protein deletion in mouse models was reportedly associated with infertility in both males and females, but the mechanism behind this phenotype remains unclear. Several studies have proposed a TAF4B-dependent estrogen-responsive transcriptional upregulation, required for ovarian development and early structural studies suggest a differential TFIID conformation in TAF4B- bound state with a more open complex arrangement and increased accessibility to TF binding (303, 309, 311). Interestingly, TAF4 and TAF4B seem to have different roles in embryonic stem cells (ESCs) transcriptional regulation, with TAF4B being highly expressed at that stage and downregulated upon differentiation, at which point TAF4 peaks and seemingly takes control over the transcriptional programs (312). In mouse embryonic fibroblasts (MEFs) deletion of TAF4 was

37 Chapter 1

partially rescued by TAF4B, although morphological alterations as well as increased proliferation 1 were described (312). Intriguingly, while TAF4B expression seems to be mostly gamete- and stem cell-dedicated, overexpression of TAF4 has been reported as crucial for increased efficiency of induced pluripotent stem cells (iPSCs) reprogramming, highlighting the need for further elucidation of the TAF4 and TAF4B functional dichotomy within TFIID (313).

TAF5 Function: structural role Partners: TAF6 and TAF9 TAF5 (hTAFII100) is another member of the proposed core-TFIID structure, shown to be essential for the complex integrity (see section 1.6.2. TFIID assembly) (282). The C-terminal region of the protein consists of a WD40 β-barrel with a key role in TFIID integrity (314). TAF5 has been shown to strongly associate with TAF6 and TAF9, which like TAF4/TAF12, TAF3/TAF10 and TAF8/TAF10, are dedicated HF-interaction partners (315). Interestingly, TAF5 could bind the TAF6/TAF9 heterodimer, as well as each of the proteins separately, indicating the presence of contact surface for both of these TAFs (316). Furthermore, early co-IP work has indicated weaker association with TAF12 and TAF11 proteins, both of which are HFD-containing proteins. As WD40-containing proteins have been described to bind histones (e.g. WDR5), a scaffold-like role within TFIID was proposed for TAF5, which serves as a stabilizing platform for the various HFD TAF pairs (317). The N-terminus of TAF5 consists of two α-helical regions, termed NTD1 and NTD2 (or LisH), proposed to mediate homodimerization, but this has been debated (282, 318). Notably, structural analysis of in vitro reconstituted TFIID through EM and immunolabeling revealed presence of two TAF5 copies within the complex (314). In agreement, in vitro assembly of core- TFIID core also included a pair of TAF5, suggesting therefore a likely double presence of TAF5 within TFIID in vivo and leaving the possibility of the protein homodimerization open (282).

TAF6 Function: structural role Partners: TAF9 (HFD) and TAF5 The third core-TFIID member is yet another HF-containing protein, TAF6 (hTAFII70/80). Unlike the afore-described, this N-terminally located HFD is structurally related to the H4 fold and, accordingly, its H3-related HFD partner is located within TAF9 (278, 319). Interestingly, early studies have also proposed TAF12 as an additional TAF6 HFD partner due to the partial affinity between the two. As the three proteins contain HFs of H3, H4 and H2B – an octamer histone-like structure has been proposed to exist within the TFIID and provide diverse scaffold and interaction interface (320). The C-terminal region of TAF6 consists of several HEAT repeats important for interactions with TAF5 and TAF9 and Drosophila work has demonstrated TAF5-WD40-dependence of TAF6 interaction (316, 321). Both TAF9 and TAF6 have been suggested to exhibit DNA binding specificity towards the DPE core promoter element and the interaction is proposed to be mediated via histone-like DNA recognition through the HFD interface of the two proteins (168, 322). Notably, this is in contrast with a recent cryo-EM analysis of promoter-bound TFIID, which reports TAF1-DPE association rather than with TAF4 (173).

38 General introduction

Interestingly, several TAF6 isoforms have been described. Among is the TAF6δ variant, missing the HFD due to a frame shift caused by 10nt indel (323, 324). The expression of this 1 variant has been associated with pro-apoptotic upstream stimuli, mediated via activating cis-acting RNA elements and caspase-dependent mRNA cleavage of the isoform (Wilhelm et al. 2008). Due to crucial residues deletion within the HFD region, the interaction of this variant with TAF9 is disrupted. Notably, significantly reduced but detectable interaction with other TFIID subunits was observed, leading to the suggestion that TAF6δ is forming a TAF9-less TFIID complex. Interestingly, apoptotic phenotype was also observed in cells with overexpressed wild-type (wt) TAF6 protein, suggesting that the sequestering of TAF9 from the TFIID is the driver of the cell death transcriptional program activation.

TAF7 and TAF7L Function: transcription factors interaction Partners: TAF1 TAF7 (hTAFII55) is a member of TFIID, shown to associate with TAF1 via its N- terminal triple barrel and mid-coiled regions, as described above (see TAF1) (271). The role of this subunit is not fully established, however numerous TFs have been proposed to drive transcriptional activation through interaction with the central region of TAF7 (e.g. YY1, SP1, c-Jun, USF) (325, 326). TAF7 is also described to disassociate upon formation of the PIC and participate in regulation of promoter escape, pausing and elongation steps (237, 327), but the experimental evidence for this is not strong. Just as with other TAFs, a functional paralog of TAF7, TAF7L, has been identified with role in gametogenesis (328-330). Indeed, TAF7L expression is enriched during spermatogenesis. Interestingly, its localization is primarily cytoplasmic during early steps of the process. With progression of the differentiation, TAF7L is imported into the nucleus – an event which reportedly coincides with TAF7 downregulation and TBP overexpression. The role of TAF7L in spermatogenesis is further supported by association of gene polymorphisms with male infertility and a mouse work reporting similar infertility phenotype in TAF7L or TBPL1 depleted mice. In addition to germ cell specific role, TAF7L has also been implicated in adipogenesis (331). Whether TAF7L assembles a similar TFIID composition as its counterpart remains unaddressed. Notably, despite the lack of genetic conservation, the two proteins share significant amino acid similarity, indicating likelihood for substitution of TAF7 with TAF7L within TFIID structure.

TAF8 Function: structural and developmental roles Partners: TAF10 (HFD) As mentioned, TAF8 (hTAFII43) is a HF-interaction partner of TAF10. The N-terminus of the protein includes a HFD similar to the HFD of TAF3, while the C-terminus is structurally less defined with the only prominent feature being a proline-rich region (278). Notably, this region has been shown as essential for interaction with TAF2, allowing the formation of the TAF2/TAF8/TAF10 heterotrimer (281). The NLS of TAF8 is essential for the import of the other two TAFs, which lack such signal motifs. Interestingly, a recent study investigating possible co- translational mechanisms of TFIID assembly noted association of TAF8 mRNA with TAF10

39 Chapter 1

protein indicating early regulation of designated TFIID partners binding (332). TAF8 is an integral 1 component of TFIID and its heterotimeric assembly into the complex is proposed to drive essential structural rearrangements needed for the association of the remaining TAFs and TBP (282). Interestingly, while TAF8 depletion is lethal in murine embryonic stem cells, homozygous TAF8 mutations leading to reduced and partial TFIID formation were reported in human patients with intellectual disability phenotypes (333).

TAF9 and TAF9B Function: promoter recognition, structural and transcription regulatory role Partners: TAF6, TAF6L (in SAGA) HFD and TAF5, TAF5L (in SAGA) In humans, TAF9 (hTAFII31/32) is the first of three TAFs shared between TFIID and an additional complex, SAGA (see section 1.7. Sharing with SAGA) (59). Through an N- terminally located H3-like HFD it interacts with TAF6 H4-like HFD to form stable heterodimers (334). As described, TAF6/TAF9 also forms a heterotrimer with TAF5 and all three proteins have been shown to be integral for the TFIID (282). The C-terminus of TAF9 is relatively under- characterized. It has been shown as non-essential for the complex formation and TAF6 interaction. However, a high sequence conservation in this region is suggestive of an important role, likely, in transcription regulation. Indeed, C-terminal deletions of TAF9 have been reported to the alter promoter occupancy of TFIID (335). A functional paralogue of TAF9 is TAF9B, which like TAF9 was shown to assemble into holo-TFIID structures via its TAF6 HFD partner (336). While the two proteins perform partially redundant functions, TAF9B was shown as essential for growth and transcriptional regulation in HeLa cells (337). Furthermore, it is specifically upregulated alongside TAF7L and TAF4B around day 11 of embryonic development indicating a role in embryogenesis (306). Both TAFs are essential for cell viability and differentially expressed during apoptosis. Notably, such apoptotic- dependency is established through the functional regulation of p53 activity by TAF9 (336).

TAF10 Function: developmental transcription role Partners: TAF3 (HFD), TAF8 (HFD) and SUPT7L (HFD; in SAGA) Another SAGA-shared subunit of human TFIID is TAF10 (hTAFII30)(59). Following a disordered N-terminus is a C-terminal H4-like HFD, which associates with TAF3 or TAF8 in TFIID or with SUPT7L in SAGA (278, 338). TAF10 has also been shown to co-translate with TAF8 though association with their mRNAs (332). The interaction between TAF10 and TAF8 or SUPT7L is essential for the protein‘s nuclear import (281, 339). In the cytoplasm the protein is proposed to associate in a TAF2/TAF8/TAF10 heterotrimer. Functionally TAF10 is implicated in numerous transcriptional programs, such as embryogenesis, keratinogenesis, hepatogenesis and erythropoiesis (340-343). Notably, these studies suggest that depletion of TAF10 in adult, terminally differentiated cells was much less detrimental to the cell viability, indicating a role in developmental and differentiation transcriptional reprograming rather than TFIID integrity.

40 General introduction

TAF11 Function: structural role 1 Partners: TAF13 (HFD) TAF11 is among the smallest TFIID subunits and contains a single essential domain, a C- terminal HFD, which associates in specific pair with the HFD of TAF13 (344). TAF11 was initially proposed to interact with TBP, and later studies revealed a role for the TAF11/TAF13 heterodimer in providing crucial contact points for promoter-free TFIID-bound TBP (162, 345, 346). In such interaction, the heterodimer competes with TAF1-TAND and the TATA-box for TBP association via an interface provided by TAF13 (158, 162, 173). Interestingly, TAF11 has been implemented in bridging TFIID complex with promoter-bound TFIIA (347).

TAF12 Function: structural and transcription factor interaction role Partners: TAF4 (HFD), TAF4B (HFD) and TADA1 (HFD; in SAGA) The third TFIID subunit shared with SAGA complex is TAF12 (hTAFII20/15) (59). The protein contains a H2B-like HFD and is paired with the H2A-like HFD of TAF4 (289, 290). Recent work has proposed role for TAF12 in the progression of AML, based on the observation that TAF4/TAF12 heterodimer is targeted by the oncogenic transcription factor MYB (348, 349). Interestingly, expression of TAF4-HFD induced AML regression in mouse model through sequestering of TAF12 and suppressing MYB activity.

TAF13 Function: structural role Partners: TAF11 (HFD) TAF13 (hTAFII18) is the smallest TFIID subunit with a size of 18 kDa. It is an N- terminal HFD protein with TAF11 as interaction partner (344). TAF13 also binds TBP via its C- terminus and regulates its availability to promoter elements, likely in the context of TFIID positioning and PIC formation (158, 159, 162, 173, 345).

1.6.2. TFIID assembly The existence of multiple HFD-TAF heterodimers (TAF4/TAF4B-TAF12, TAF6- TAF9/9B, TAF8-TAF10, TAF10-TAF3 and TAF11-TAF13) as well as specific heterotrimeric partners (TAF5-TAF6-TAF9, TAF2-TAF8-TAF10 and TAF11-TAF13-TBP) led to the proposition of a modular mechanism of TFIID assembly (281, 282, 332). Such mechanism entails the pre-formation of separate TAF subcomplexes, likely in the cytoplasmic compartment upon protein biosynthesis, which sequentially associate to form the final holo-TFIID complex. Notably, such model is also suggestive of the existence of checkpoints in TFIID formation, as well as limiting TAFs crucial for the propagation of a complete complex assembly as opposed to partial. In the light of such a subcomplex organization, it would be interesting to correlate functional output with TFIID modules, such as the association of both TAF1L and TAF7L paralogues with gametogenesis, considering their probable (due to described TAF1/TAF7 interactions) mutual co- assembly into TFIID.

41 Chapter 1

While in vivo characterization of TFIID assembly has not been systematically performed 1 yet, the structural mechanism behind it was addressed through biochemical approaches in in vitro settings. Insect-based simultaneous expression of different TAF interaction partners has proven to be a successful method for acquiring soluble and stable TFIID modules (Fig. 10) (282).

Figure 10. Model of TFIID assembly. Core TFIID exists in the nucleus and consists of two copies of TAF4/TAF5/TAF6/TAF9/TAF12, assembled symmetrically. Nuclear import of the cytoplasmic TAF2/TAF8/TAF10 module leads to its incorporation into the core TFIID complex and distortion of the symmetry. Consequently, newly exposed surfaces become available for further integration of the remaining TFIID subunits. Adopted from (281, 282).

Structural cryo-EM analysis of such assemblies revealed at 11.6 Å resolution a possibility for core-TFIID formation through the symmetrical pairing of TAF4/TAF12 and TAF6/TAF9 HFD partners onto the scaffolding surface provided by two TAF5 homodimers. The structure suggested close contact between TAF5-WD40 β-barrel and TAF6-HEAT repeats at the upper part of the crystal, with TAF5-NTDs extending downwards and away from this surface. Of note, the proposed TAF5-NTD dimerization could not be observed in this crystal conformation. The N- terminal HFD of TAF6 is shown to pair with TAF9-HFD in the upper middle region of the core architecture, while the second HFD-pair TAF12/TAF4 is localized closer to TAF5 NTDs, creating sufficient distance from the two HFD pairs to exclude octamer-like formation. Finally, mirroring TAF5-NTDs in an antiparallel fashion, TAF4-NTDs extend from the bottom towards the upper side of the structure, protruding symmetrically as each side above the TAF5-WD40 domains. Interestingly, such symmetrical arrangement was only possible in the absence of additional TFIID members. Indeed, upon the incorporation of the TAF8/TAF10 HFD pair the symmetry was broken, exposing new interfaces, suggested to drive the subsequent recruitment of the remaining TAFs and TBP (282). The molecular mechanisms leading to the holo-TFIID formation

42 General introduction have not been elucidated yet and in vivo data supporting the symmetrical double TAF core model is essential for its validation. Furthermore, as TAF8/TAF10 was shown to form a cytoplasmic 1 TAF2/TAF8/TAF10 module in vivo, it remains to be seen how the core-TFIID is affected by the incorporation of this heterotrimer as opposed to the TAF8/TAF10 dimer (281, 282).

1.6.3. TFIID stoichiometry and structure Considering the multileveled transcriptional regulation provided by the assembly and structural rearrangements of the promoter-bound PIC, mutual efforts in fields of structural biology, biochemistry and molecular biology have been aimed at gaining better mechanistic understanding of this process. Notably, structural analysis of TFIID has proven exceptionally challenging due to its scarce availability upon purification as well as numerous unstable, disordered and/or flexible regions driving major complex rearrangements. Early structural descriptions of human TFIID employed negative-stain EM to report a horseshoe-like shape composed of three defined lobes (A-C) with a central cavity and large degree of plasticity (279, 350-352). In agreement, similar geometry was subsequently observed for yeast TFIID (279). Antibody staining of the obtained structure further indicated the likely existence of several TAFs in double copies, notably, localized within distinct lobes (314). Subsequent EM work has indicated possible structural rearrangements in the presence of bound activators as well as upon promoter association of the complex (352). Indeed, TFIID is classically divided into two main structural organizations: the canonical, DNA-free form and the rearranged promoter-bound TFIID. A major difference between the two states stems from the differential location of lobe A, shown to travel a distance of over 100 Å and to shift from one side of the lobes B and C to the other (279, 350-352). Further comparison of the two forms indicated association between lobes A and C in canonical TFIID, whereas DNA and TFIIA-bound TFIID contained lobe A re-localized in close proximity with lobe B. Notably TFIID structures lacking lobe A were not observed indicating that while highly flexible upon promoter binding, it remains stably tethered to the complex. Recent advances in TFIID structural characterization came from work employing mild- crosslinking for the stabilization of TFIID during the harsh sample preparation conditions of cryo- EM combined with technologically improved equipment (173). The study utilized purified TFIID and TFIIA, bound to a super core promoter (SCP) sequence, containing TATA box, Inr, MTE and DPE elements, which allowed for strong and motif-specific association of the proteins onto the DNA template. The work revealed similarly rearranged TFIID organization, but importantly allowed better resolution of the otherwise highly mobile lobe A, revealing two distinct and differently-dynamic substructures, termed lobes A1 and A2. More specifically, the smaller lobe, A1, was promoter-bound and stable, which allowed dissection of its composition and revealed TATA- bound and TFIIA-stabilized TBP. Lobe A2, on the other hand, was considerably larger and exhibited high degree of flexibility, preventing clear definition of its consistency. Notably, re- examination of TFIID in the absence of DNA indicated that this lobe likely contains TAF1/TAF7 module and a clustering of specific HFD TAF partners (TAF4/TAF12, TAF6/TAF9, TAF10/TAF3 and TAF11/TAF13) bound to a single TAF5-WD40 domain (158, 159, 162). The high resolution of A1 allowed the identification of the TATA box within the obtained SCP-bound TFIID structure and, therefore, the annotation of the downstream promoter elements (173). Based on their location, the available partial structures of TAF1/TAF7 and the

43 Chapter 1

homology-based aminopeptidase region of TAF2 were modelled into the cryo-EM image, revealing 1 major contribution of TAF1 WH domain in promoter recognition. TAF2 was assigned to the TSS-downstream localized lobe C. The lobe additionally included a homodimeric asymmetrical association of two TAF6-HEATs confirming the two copy presence of this TAF within holo-TFIID. Notably, a density connecting TAF6-HEAT with TAF2 was proposed to correspond to an extension of the lobe B-localized TAF8, in agreement with the suggested role of TAF8 in driving the incorporation of TAF2 into core-TFIID (281, 282). Finally, lobe B reportedly contained TAF5-WD40-bound HFD TAF pairs, namely TAF4/TAF12, TAF6/TAF9 and TAF8/TAF10, and with some exceptions resembled the TAF5- bound HFD pairs of lobe A (158, 159, 162). Such observation confirmed the previously proposed stoichiometry of TFIID, which includes double copies of its core subunits, TAF4, TAF5, TAF6, TAF9 and TAF12 as well as TAF10 (281, 282). Close examination of different TFIID cryo-EM structures led to a mechanistic model for delivery of TBP at an accurate distance upstream of the TSS (Fig. 11) (158, 159, 162). In this model promoter-associating TAFs serve as a molecular ruler for TBP deposition, a function with a likely relevance for TATA-less transcriptional initiation. In the model, TBP exists in an inhibited state within free-TFIID through its interactions with TAF1-TAND as well as TAF11/TAF13 heterodimer – all localized within the A lobe of the complex. Upon downstream promoter recognition mediated via the lobe A-specific TAF1/TAF7 and lobe C-specific TAF2, conformational changes within lobe A lead to shift in the position of TBP, which allows for the protein to scan the upstream DNA for a binding region. The scanning is aided through interactions with TFIIA, which is recruited through the upstream-located lobe B of TFIID, and results in out- competition of TBP from its TAF1-TAND-inhibited state and TATA-box engagement. The stabilization of TBP onto the DNA through the TFIIA interactions induces steric clashes with TAF11/TAF13-TBP association, which leads to the release of TBP from lobe A and opens a site for interaction with TFIIB, allowing, therefore, progression of the PIC assembly. Interestingly, parallel characterization of yeast TFIID, while partially in agreement, nevertheless proposes several major differences in the promoter-bound complex organization (291). More specifically, the study suggests a connected, ring-like structure of the three lobes, mediated through TAF6-HEATs (in accordance with the reported human TFIID architecture) and TAF5-NTD dimerization (unlike in human TFIID). Furthermore, while similarities between the HF-containing lobes A and B are striking, crucial discrepancy stems from the proposed TAF11/TAF13 localization with lobe B as opposed to TAF8/TAF10. Notably, the study could not model TAF8/TAF10 within their structure, as well as TBP, indicating that the observed complex may represent a conformational state of TFIID during one of its steps of DNA-binding and TBP deposition at TSS-upstream sequences. Nevertheless, due to several structural differences between human and yeast TFIID, as well as the differences in shared subunits between yeast TFIID and SAGA (TAF5/TAF5L and TAF6/TAF6L) the possibility of different complex organization and promoter-binding mechanism remains plausible (59, 291).

44 General introduction

1

Figure 11. Model of TFIID promoter-association. DNA non-bound TFIID (here Apo TFIID) exists in a canonical or extended form, where DNA-binding surface of TBP is covered by TAF1-TAND. In these forms of TFIID, TBP is still associated with TAF1-TAND and possesses the flexibility to relocate alongside lobe A, while maintaining contact with TAND. Upon promoter engagement of TFIID through TAF1 and TAF2 DNA-binding regions, this flexibility allows for TBP to scan for compatible DNA sequence (TATA box) and bind to it through the stabilizing assistance of TFIIA. The association of TFIIA displaces TBP from its inhibitory TAF1-TAND and TAF11/TAF13 interactions and allows for the intercalation of TBP into the TATA sequence. Such state results in the engaged conformation of TFIID, where lobe A co-localized with lobe B, instead of lobe C. The model proposes subsequent recycling of holo-TFIID by release from the promoter region (162).

1.7. Sharing with SAGA

As described above, human TFIID and SAGA share three subunits – TAF9, TAF10 and TAF12. Interestingly, all three proteins utilize their HFDs for assembling into both complexes, forming TFIID-specific pairs with TAF6, TAF3 or TAF8 and TAF4, respectively, and similar SAGA-specific pairs, namely TAF9/TAF6L, TAF10/SPT7L and TAF12/TADA1 (yADA1) interactions (59, 353-355). These SAGA-specific subunits also contain HFDs, which closely resembles their TFIID counterparts, suggesting likely evolutionary links between the two complexes (344, 356).

45 Chapter 1

Further structural and functional commonalities between TFIID and SAGA are in 1 agreement with possible common origins. Indeed, in yeast TAF5 and TAF6 also shared, indicating that the metazoan-specific TAF5L and TAF6L result in complexes divergence (59, 353, 355). Notably, SAGA is described as a modular assembly, which include four functionally diverse structures: a core essential for the integrity of the complex, a HAT and a DUB (de-ubiquitination) modules participating in chromatin remodelling and an activator binding module, mediating the communication between upstream activation signals and effector functions of the complex (59, 353, 355). Strikingly, the yeast core of SAGA is proposed to consist of TAF5-TAF6-TAF9-TAF12- ADA1, which closely resembles the in vitro reconstituted core-TFIID and suggests common assembly pathways between the two complexes. Molecular mechanisms behind TAF discrimination between the two complexes are still uncharacterized. Further commonalities between TFIID and SAGA are found in the SAGA component, SPT3, which is composed of two N- and C-terminally localized HFDs that are virtually identical to the HFs of TAF13 and TAF11, respectively. Finally, yeast TFIID and SAGA also share TBP, bound in SAGA by Spt3 and Spt8. Notably, metazoan SAGA loss of Spt8 is accompanied by loss of TBP subunit, suggesting a likely role of SAGA in yeast in transcriptional initiation. Indeed, early depletion analyses of the SAGA- specific Spt3 indicated that about 10% of yeast genes are dependent on SAGA, while TAF1 depletion affected over 90% of the transcriptional activity (137). Such observations led to the classification of yeast genome into TFIID- or SAGA-dominated, where TFIID-dominated genes are generally characterized as constitutively-expressed and housekeeping, lacking a consensus TATA sequence at their promoter, and SAGA-dominated genes are stress-inducible and highly regulated, with a well-defined TATA-box (143). Interestingly, while only a small subset of genes appeared to be SAGA-regulated, chromatin endogenous cleavage coupled with high-throughput sequencing (ChEC-seq) analysis has revealed a global genomic association of the complex at upstream activating sequences (UAS), which was correlated with its HAT and DUB histone modifying functions (140). The discrepancy between the global occupancy of SAGA and the small number of genes it was proposed to regulate indicated the likely existence of uncharacterized molecular mechanisms guiding transcription initiation events in yeast. Indeed, recent studies have revised the TFIID- and SAGA- gene classification in yeast by highlighting a dual and non-redundant dependency on both factors for transcription activation of all RNAPII-targeted genes. Such observation was made through the evaluation of only nascent RNA levels upon depletion of TFIID or SAGA subunits, an approach that allowed bypassing the inaccuracy of steady state mRNA measurement, which was shown to be affected by cellular buffering mechanisms (i.e. increase of RNA half-life) upon decrease of transcriptional activation (142). Notably, while reporting that all gene expression was dependent on TFIID and SAGA, the study also highlighted that likely existence of different rate-limiting steps, which could explain higher or lower occupancy of different promoter regions by either of the factors. Such rate-limiting steps could be provided by: 1) gene-specific features such as TATA vs TATA-less promoters, the number of TF binding sites; 2) chromatin architecture based on present histone PTMs and level of nucleosome occupancy; 3) negative transcription regulators targeting the two complexes, such as the TBP-specific NC2 and Mot1 (142, 143). As such, the previously proposed yeast housekeeping/constitutive dependence on TFIID could be explained by lower association of the SAGA complex and the inducible-transcription dependence reported for SAGA-

46 General introduction dominated gene could be due to higher occupancy of SAGA at the UAS of those genes (140, 142, 143, 357). 1 Importantly, in either of the cases both TFIID and SAGA seem to be required for transcriptional activity. This functional co-dependency is intuitive in metazoan transcription, with SAGA acting as a co-activator essential for chromatin remodelling and TFIID – central for nucleation of PIC and the marking of the TSS through TBP deposition. In yeast, however, the role of TBP within SAGA remains largely elusive and further structural and functional characterization of both TFIID and SAGA is needed to shed light on the molecular similarities and discrepancies between these two complexes.

47 Chapter 1

1.8. Concluding remarks and outline 1 Altogether, nearly 50 years of research have unveiled immensely complex and interconnected cellular mechanisms tightly regulating the accuracy, timing and potency of transcription by RNA polymerase II. TFIID is integral for both the initiation and modulation of this process and characterization of its assembly pathways as well as structural composition is key for the identification of checkpoints, components and interactors involved in transcription regulation. Not surprisingly, alterations of the TFIID subunits on transcriptional or structural level have been associated with a number of disorders (333, 358-363). As such a systemic experimental interrogation of each of the TAFs and TBP in various contexts will aid our understanding of their role in mechanisms of disease. The aims of my thesis are to: 1) investigate the evolutionary links between TFIID and SAGA in order to gain understanding in the mechanisms that drive their progressive sub- functionalization; 2) interrogate the core-TFIID structure and assembly on the example of the TAF5/TAF6/TAF6 heterotrimer; and 3) characterize promoter-bound and free TFIID as well as its nuclear vs cytoplasmic form in order to elucidate the molecular pathways involved in the complex assembly as well as shed light on its in vivo compositions. Chapter 1 discussed the nuclear transcription regulatory network with emphasis on TFIID. Specifically, important aspects of chromatin dynamics in the context of transcriptional activity were detailed and the process of transcriptional initiation was examined. The structural and functional role of TFIID was presented in depth and links and dichotomy with the SAGA complex were highlighted. Chapter 2 provides a comprehensive evolutionary analysis of each of the TAF subunits performed through computational biology methods. The conservation of structural elements as well as protein divergence is discussed in detail. Links and evolutionary history between TFIID and SAGA are presented in the context of a mutual ancestral origin and gradual sub- functionalization, accommodating the increasingly complex demand for transcriptional regulation among eukaryotes. Chapter 3 presents the atomic resolution of an in vivo detected cytoplasmic TFIID module, TAF5/6/9, supported through mutational validation analysis combined with proteomic methods. The chaperonin CCT was shown to be a central checkpoint-like player in the assembly of the TAF heterotrimer by ensuring correct folding of TAF5-WD40 and providing a platform for the TAF6/TAF9 association with the N-terminus of TAF5, leading to the release of the protein and the module formation. Chapter 4 details the investigation of TFIID composition in different cellular compartments attained through a systematic interrogation of all TAFs and TBP via proteomic methods. Cytoplasmic pre-assembly modules, nuclear composition and relative stoichiometry as well as chromatin-bound structures are discussed. A pulse-chase analysis of newly- synthesized TAFs highlights the timing and order of TFIID assembly as well as likely additional players involved in it. Chapter 5 investigates the TFIID-formation capacity of the functional TAF1L paralogue in comparison to TAF1 and describes the competitive nature of the the protein for complex assembly. Chapter 6 discusses the results presented in chapters 2-5 in the context of current literature, highlighting the answers and remaining questions and providing an outlook for further experimental investigations of the studied molecular mechanisms.

48 Evolution of TFIID

2

Chapter 2

Epigenetics and transcription regulation during eukaryotic diversification: the saga of TFIID

Simona V. Antonova, Jeffrey Boeren, H.T. Marc Timmers, Berend Snel.

Genes & Development, published online May 23, 2019

49 Chapter 2

Epigenetics and transcription regulation during eukaryotic diversification: the saga of TFIID Perspective Simona V. Antonova,1 Jeffrey Boeren,2 H.T. Marc Timmers,1,3,4,6 and Berend Snel5,6 1Molecular Cancer Research and Regenerative Medicine, University Medical Centre Utrecht, 3584 CT Utrecht, The 2 Netherlands; 2Department of Developmental Biology, Erasmus MC, 3015 CN Rotterdam, The Netherlands; 3Department of Urology, Medical Centre-University of Freiburg, 79106 Freiburg, Germany; 4Deutsches Konsortium für Translationale Krebsforschung (DKTK) Standort Freiburg, Deutsches Krebsforschungszentrum (DKFZ), 69120 Heidelberg, Germany; 5Theoretical Biology and Bioinformatics, Department of Biology, Utrecht University, 3584 CH Utrecht, The Netherlands; 6 Utrecht University, 3584 CH Utrecht, The Netherlands;

The basal transcription factor TFIID is central for RNA polymerase II-dependent transcription. Human TFIID is endowed with chromatin reader and DNA-binding do- mains and protein interaction surfaces. Fourteen TFIID TATA-binding protein (TBP)- associated factor (TAF) sub- units assemble into the holocomplex, which shares sub- units with the Spt–Ada–Gcn5–acetyltransferase (SAGA) . Here, we discuss the structural and function- al evolution of TFIID and its divergence from SAGA. Our orthologous tree and domain analyses reveal dynamic gains and losses of epigenetic readers, plant-specific functions of TAF1 and TAF4, the HEAT2-like repeat in TAF2, and, importantly, the pre-LECA origin of TFIID and SAGA. TFIID evolution exemplifies the dynamic plasticity in transcription complexes in the eukaryotic lineage.

The complexity of eukaryotic organisms requires tightly regulated and fine-tuned gene expression programs for the adaptation to intracellular and extracellular challenges (López-Maury et al. 2008; Rosanova et al. 2017). The basal transcription factor TFIID is critical for gene transcription by RNA polymerase II (Pol II), as it is the first protein complex to recognize core promoters and nucleate preinitiation complex assembly (Gupta et al. 2016). Comprised of TATA- binding protein (TBP) and 13–14 TBP- associated factors (TAFs), the TFIID complex includes a number of domains essential for its core promoter recognition function (Fig. 1; Chalkley and Verrijzer 1999; Vermeulen et al. 2007; Gupta et al. 2016). Several TFIID subunits are shared with the Spt–Ada–Gcn5–acetyltransferase (SAGA) coactivator complex (Fig. 1; Spedale et al. 2012). SAGA is a multimeric complex consisting of several functional modules carrying histone acetyltransferase (HAT) or deubiquitination (DUB) functions (Helmlinger and Tora 2017). The evolutionary link between SAGA and TFIID is evident by shared and paralogous subunits, which resulted from gene duplication and subfunctionalization events (Spedale et al. 2012). However, it is unclear when the ancestral subunits of TFIID and SAGA emerged and how they should be placed on the evolutionary tree of eukaryotes (Fig. 1). Insights into the timing of these duplications helps to understand the subfunctionalization and redundancy of TAFs and TFIID and might also provide a better understanding of the idiosyncrasies of transcription regulation across the whole domain of eukarya.

50 Evolution of TFIID

2

Figure 1. Structural variation between human (h) and yeast (y) TFIID and SAGA complexes. Shared TAFs between TFIID and SAGA may reflect a common ancestral origin for the two complexes (here ―ancestor?‖). Reduction of shared TAFs between TFIID and SAGA in human versus yeast Saccharomyces cerevisiae as well as loss of epigenetic domains in S. cerevisiae (e.g., TAF1 BrDs [bromodomains] and TAF3 PHD) indicate divergence in TFIID and SAGA adaptation to transcriptional requirements across different eukaryotic branches (Matangkasombut et al. 2000; Gangloff et al. 2001a; Spedale et al. 2012). Unique and shared subunits as well as epigenetic reader domains are color-coded as indicated.

Here, we determine the evolutionary history of all TFIID subunits by examining the occurrence and structure of their genes over a time span of almost 2 billion years. TFIID and SAGA subunits are placed in a functional context to understand their diversification. We address the following questions: What is the origin of TFIID? Are functional domains conserved throughout gene duplication events in TAFs? Which functional domains of TAFs are highly dynamic across eukaryotic evolution, and which ones are relatively stable? When did SAGA and TFIID duplicate and diverge? How did TFIID diversify in structure and function to meet the growing morphological complexity across evolving species? These questions are examined by phylogenetic comparisons and by profile searches across a set of well-annotated genomes representative of the eukaryotic kingdom (Supplemental Fig. S1). The results are organized per sets of functionally similar TAFs. First, we start by examining the

51 Chapter 2

three TAFs implicated in chromatin binding (TAF1, TAF2, and TAF3). Second, we determine the relationships between TAF8, TAF3, and the SAGA subunit SPT7. Third, we describe the evolutionary rather invariant sub- units TAF5, TAF6, TAF7, TAF9, and TAF10, including their paralogs. Fourth, the relationships of the TAF4 and TAF12 pair are analyzed with respect to the ADA1 sub- unit of SAGA. Finally, we propose models for the origin 2 of TAF11, TAF13, and their SAGA paralog, SPT3. Our combined results strongly support a pre- LECA (last eukaryotic common ancestor) origin for the complete TFIID complex, comprising the full ensemble of TAFs. Later lineage-specific duplications resulted in TAF1L, TAF3, TAF4B, TAF4x, TAF7L, TAF9B, and the TAF12 paralog, EER4, which allowed subfunctionalizations to support increasingly complex multicellularity. Highly sensitive pro- file searches in a representative set of eukaryotic proteomes indicate a dynamic distribution of the bromo- domain (BrD) and plant homeodomain (PHD) epigenetic domains within TFIID evolution. These dynamic do- mains are in sharp contrast to the invariable histone fold (HF), WD40, and HEAT domains, whose conservation reflects their central role in the complex integrity (Kolesnikova et al. 2018; Patel et al. 2018). Additionally, besides TAF paralogous subfunctionalizations, we characterize a stable ancestral repertoire of TFIID subunits combined with a persistent and invariable structure across the entire eukaryotic lineage.

Evolutionary dynamics of the basal transcription machinery TFIID uses TBP to recognize the TATA element of core Pol II promoters, and this has been well studied (Tora and Timmers 2010). Stable binding of TBP to TATA-boxes involves insertion of two highly conserved phenylalanine pairs of TBP into the DNA, which results in an ∼80° angle. In vitro binding of TATA by TBP displays a long half-life, which is countered by the NC2 and BTAF1/MOT1 regulators of TBP activity. These proteins are required for the dynamic behavior of TBP in vivo (Tora and Timmers 2010). Phylogenetic comparisons revealed that these two phenylalanine pairs in TBP coevolved with genes encoding NC2 and BTAF1/MOT1 (Koster et al. 2015). TBP variants binding less stably to TATA elements do not seem to require NC2 and BTAF1/MOT1, indicating that the Pol II basal transcription machinery can adapt to evolutionary pressures. All eukarya contain at least a single gene for TBP, but TBP homologs can also be found in certain archaeal lineages. However, the genes encoding NC2, BTAF1, or the TFIID TAFs are unique to eukarya (Koster et al. 2015; our unpublished results) and absent from currently avail- able archaeal genomes. It has been shown that several TAFs are duplicated in eukaryotic evolution, which suggests that functional and structural divergence correlates with increased transcriptional complexity. In metazoa, TBP and TAF paralogs are actively involved in promoting development and differentiation as well as maintaining cell and tissue identity (Frontini et al. 2005; D‘Alessio et al. 2009; Pijnappel et al. 2013; Zhou et al. 2013, 2014).

Domain analysis of TAF1 and TAF2 reveals BrD dynamics in fungi The two largest TFIID subunits are represented by the TAF1 and TAF2 proteins. TAF1 is characterized by a variety of domains, which together classify it as the most structurally diverse subunit of TFIID. This complexity is reflected by the large size of the protein, its neuronal-specific alternative splicing, and its functions in both chromatin binding and complex stabilization (Chalkley and Verrijzer 1999; Gupta et al. 2016). The tandem BrDs of metazoan TAF1 are central to TFIID function as a transcription regulatory complex. BrDs mediate binding to acetylated lysines on

52 Evolution of TFIID histones H3 and H4, which are hallmarks of active promoters and transcription (Jacobson et al. 2000). Interestingly, human TAF1 is localized on the X chromosome, and missense mutations in the BrD region have been identified in male patients with intellectual disability phenotypes (O‘Rawe et al. 2015). In the plant Arabidopsis thaliana (see the Glossary for species, clades, etc.), TAF1 only has a single BrD, and Saccharomyces cerevisiae TAF1 lacks both BrDs (Matangkasombut et al. 2000; Bertrand et al. 2005). These observations prompted us to investigate in more detail at 2 which point during evolution (see Supplemental Fig. S1 for an evolutionary time tree) the BrDs were lost or acquired. Analysis of TAF1 domain organization among orthologs revealed a secondary loss of the BrD in the ancestor of dikarya, a subkingdom of fungi that contains ascomycota and basidiomycota (Fig. 2A). Notably, ascomycota include model organisms such as S. cerevisiae, Kluyveromyces lactis, Neurospora crassa, and Schizosaccharomyces pombe. Furthermore, the BrD has an overall patchy (or irregular) occurrence in fungal species and points toward in- dependent loss in at least five lineages, including mucoromycota, chytridiomycota, and the early fungal relative fonticula (Supplemental Figs. S1, S2). All metazoans, on the other hand, possess a second BrD acquired in their common ancestor. The presence of a BrD is predicted to have direct implications for TAF1 binding to acetylated nucleosomes in the vicinity of transcription regulatory elements (Jacobson et al. 2000). The absence of BrDs in fungal lineages suggests that acetylated nucleosomes do not serve as anchoring points for fungal TFIID. In contrast to TAF1, TAF2 domain analysis revealed highly dynamic BrD distribution across fungi (Fig. 2B). While TAF2 from ascomycota does not possess any BrD (similarly to TAF1), differential loss and gain of a variable number of BrDs (ranging from one to six) was observed in TAF2 from members of the basidiomycota, mucoromycota, and blastocladiomycota. The dynamic nature of the TAF2 BrDs is emphasized by a differential occurrence in related fungi. For example, the two closely related species of Mucor circinelloides and Phycomyces blakesleeanus (members of mucoromycotina) (Supplemental Fig. S1) contains six BrDs or no BrD, respectively (Supplemental Fig. S3). Mortierellomycetes (closely related to mucoromycotina) contain four BrDs. As indicated, TAF2 from ascomycota consistently lacks a BrD, but TAF2 from their closest basidiomycota contains BrDs (Supplemental Figs. S1, S3). These BrDs were likely acquired in a fungal ancestor, since their emergence is only specific for the fungal branch of the eukaryotic tree. Altogether, the analysis of BrD occurrence across TAFs shows that this domain is present only in TAF1 and TAF2. Due to its direct involvement in epigenetic regulation via acetylated histone recognition, the overall BrD dynamics observed in TAF1 and TAF2 emphasize differential transcription regulation in diverse eukaryotic lineages. Notably, in fungal TFIID, BrDs were differentially acquired and have been independently lost multiple times during evolution. Several fungal species seem to compensate for the absence of a TAF1 BrD by the presence of multiple BrDs in TAF2. Transfer of the BrD from TAF1 to TAF2 could influence TFIID conformation on the core promoter but could also reflect different histone acetylation patterns between species. Beside BrDs, TAF1 structure is characterized by an N-terminal TBP-binding domain (TAND) (Burley and Roeder 1998), a central domain for dimerization with TAF7 (Bhattacharya et al. 2014), and a zinc finger region (zf-CCHC_6) or Zn knuckle only recently described as involved in core promoter DNA binding (Curran et al. 2018). Previous work indicated that TAF1 can bind to the INR element of core promoters (Chalkley and Verrijzer 1999), but this function has not been mapped to a TAF1 domain yet. Examination of domain dynamics of the TAND and Zn

53 Chapter 2

knuckle regions across species is limited by the low sequence conservation of these regions, resulting in their patchy (or irregular) distributions across the phylogenetic tree (Fig. 2A). However, the observation that both domains occur in numerous common ancestors indicates that they have been present in an ancient eukaryotic progenitor.

2

Figure 2. Inferred evolutionary history of TAF1 and TAF2. (A) TAF1 is duplicated in the Old World monkeys. BrD is gained in the ancestor of metazoa and lost in dikarya. Streptophyta acquired a ubiquitin-like domain. (B) TAF2 contains previously unrecognized HEAT2-like repeats. Various BrDs were acquired early in fungal evolution and subsequently lost late in fungi. Duplications are represented as red arrows; gradient domains are not predicted in all species of that respective (super)group.

54 Evolution of TFIID

In A. thaliana, a TAF1 ubiquitin-like module has been proposed (Bertrand et al. 2005). Our analysis indicates that this module was acquired as early as the ancestor of the streptophyta, which contains land plants and related eukaryotic algae (Fig. 2A). Klebsormidium flaccidum is the earliest branching species that contains a ubiquitin- like domain, which is retained up to A. thaliana (Supplemental Figs. S1, S2). The domain is inserted into the TAF7 interaction domain (Bertrand et al. 2005; Wang et al. 2014). The existence of a conserved ubiquitin-like domain within TAF1 in 2 streptophyta is suggestive of regulatory processes involving ubiquitin-binding modules, but the exact link with transcription remains to be discovered. Finally, our analysis did not reveal any HAT or kinase domain within TAF1, which has been suggested previously (Matangkasombut et al. 2000). Besides domain rearrangements, TAF1 evolution is further marked by a duplicative retrotransposition event in the hominoid lineage. This TAF1L gene was first identified in Old World monkeys (cercopithecidae) using intron-spanning primers, and protein expression is only observed in the testis (Wang and Page 2002). Adding Old and New World monkeys to the data set and analyzing TAF1 paralogs confirmed this duplication timing, as there is no paralogous TAF1 gene in Callithrix jacchus, a New World monkey, while the Old World monkey genomes of Hylobates leucogenys and Papio anubus contain TAF1L (Fig. 2A; Supplemental Figs. S1, S2). Furthermore, TAF1L of these species clusters with human TAF1L in the gene tree (Supplemental Fig. S2), which reflects the close relationship of Homo sapiens with Old World monkeys.

Domain interrogation highlights a HEAT2-like repeat region in TAF2 TAF2 is characterized by the N-terminal aminopeptidase M1-like domain, homologous to the catalytic domain of leukotriene A4 hydrolase (LTA4H), a member of the structurally conserved M1 aminopeptidase family of proteins (Papai et al. 2009; Drinkwater et al. 2017). Despite this homology, the signature exopeptidase motif of M1 amino- peptidase, GxMxN, is not present in any of the TAF2 orthologs (data not shown), indicating that the protein lacks peptidase activity. However, the zinc-binding motif of the catalytic site, HExxHx18E, is present in a number of TAF2 orthologs, including human (Hosein et al. 2010). Domain investigation of TAF2 confirmed the consistent presence of peptidase M1-like across all eukaryotes (Fig. 2B; Supplemental Fig. S3), indicating a pre-LECA origin. Interestingly, analysis using the Conserved Domain Database (CDD; NCBI) in the TAF2 orthologous group revealed a HEAT2-like repeat region (data not shown). The presence of a HEAT structure in TAF2 is consistent with recent cryo-EM results of human TFIID, which revealed a density in TAF2 with architecture resembling an armadillo fold (Louder et al. 2016). The study used hu- man endoplasmic reticulum aminopeptidase1 (ERAP1), also a member of the M1 aminopeptidase family, for homology-based modeling of TAF2 into the structure of TFIID. Notably, atomic structure analysis of ERAP1 revealed eight atypical HEAT repeats at its C terminus (Nguyen et al. 2011). To enhance the sensitivity for detection of HEATs in our TAF2 orthologous group, the identified HEAT region was included in the tree analysis, since the best predictors for a HEAT repeat identity are protein internal repeats (Yoshimura and Hirano 2016). Indeed, two pairs of HEAT2-like repeats were identified and are ubiquitously present in TAF2 orthologs across all eukaryotes (Fig. 2B). The first pair of HEAT repeat resides in all TAF2 orthologs, while the second repeat is present mainly in metazoa and has a patchy distribution across the rest of the eukaryotic tree. However, even with the enhanced detection sensitivity, the sequence divergence in HEAT repeat sequences was quite large, and, outside the regions of HEAT2

55 Chapter 2

homology, the exact architecture and number of the expected repeats could not be determined. Based on our analyses and the ERAP1 structure, we propose that human TAF2 contains a HEAT2-like region spanning amino acids 646–976 and likely consisting of eight repeats (Fig. 2B). In conclusion, our analyses showed that, unlike their epigenetic domains, the remaining structures of TAF1 and TAF2 are invariable across eukaryotes and most likely have a pre-LECA 2 origin. A notable exception is the ubiquitin-like domain insertion in TAF1, originating in the plant branches of eukaryotic evolution. TAF2 in LECA contained the N-terminal aminopeptidase M1- like domain followed by a region of HEAT2-like repeats.

The TAF3 PHD finger is dynamic in the eukaryotic lineage Generation of the orthologous tree for TAF3 was complicated by consistent cross- identification of TAF8 and the SPT7 (SUPT7L in metazoa) subunit of SAGA in the profile search. Notably, all three proteins are shown to form a HF pair with TAF10 (Gangloff et al. 2001b). Therefore, the proteins were combined in a single group. Investigation of their evolutionary relationship indicated that they share a highly conserved HF domain (HFD), which is the reason for their cross-identification. In addition to the HFD, TAF8 is characterized by a proline-rich region, which is in- variable across species (Fig. 3). This allowed separation of TAF3 and SPT7 to examine their domain evolution. In contrast to TAF8, both TAF3 and SPT7 have undergone substantial reorganization across species, which is dis- cussed separately. Given the highly similar domain architectures of the TAF3 N-terminal HFD and the C-terminal PHD finger within early branches of fungal mucoromycotina (e.g., in M. circinelloides) and metazoans, it seems that TAF3 emerged in the opisthokonta (animal and fungal lineages and their unicellular relatives but not plants) ancestor through a duplication of TAF8 followed by acquisition of a PHD finger (Fig. 3A; Supplemental Figs. S1, S4). Within opisthokonta, TAF3 is highly variable. As such, human TAF3 includes the C-terminally acquired PHD finger central to H3K4me3 recognition and TFIID association with active promoters (Vermeulen et al. 2007). S. cerevisiae TAF3 lacks this chromatin reader domain (Gangloff et al. 2001a), which is consistent with observations that H3K4me3 modifications are less important for gene transcription in this yeast (Howe et al. 2017). Interrogation of the timing of PHD loss indicates a secondary loss early in fungal evolution based on the presence of a C-terminal PHD finger in mucoromycotina (e.g., M. circinelloides) and metazoa (Fig. 3A; Supplemental Figs. S1, S4). In contrast, fungi branching after the split of mucoromycotina (mainly dikarya) lack a PHD finger and are characterized solely by the HF (Fig. 3A; Supplemental Figs. S1, S4). Interestingly, there is significant overlap between fungal members that lack both the TAF1 BrD and the TAF3 PHD finger, which strengthens the hypothesis of a smaller contribution of chromatin modifications to transcription regulation in these organisms. On the other hand, the gain of TAF2 BrDs in some species may reflect an alternative mechanism for chromatin binding by TFIID or an inter- mediate stage of adaptation. Outside of the opisthokonta, most proteins in the gene tree contain a canonical TAF8 or SPT7 but not TAF3, indicating that TAF3 is not present in nonopisthokonta eukaryotic branches such as plants. Earlier work using a yeast two-hybrid approach in A. thaliana could not identify TAF3, supporting its absence in archaeplastida (including also land plants) (Lawit et al. 2007). The absence of TAF3 in archaeplastida suggests that TAF8 may be present in two copies in the TFIID complex in this supergroup in order to match two proposed two-copy stoichiometry of TAF10 within the complex (Bieniossek et al. 2013).

56 Evolution of TFIID

2

Figure 3. Inferred evolutionary history of TAF3, TAF8, and SPT7. (A) TAF3 arises from a duplication of a shared ancestor of TAF8 in opisthokonta. TAF3 acquired a PHD, which is secondarily lost in late fungi. (B, C) SPT7 duplicated either in the ancestor of the amoebozoa (B) or pre-LECA (C), implying differential loss. Metazoan SPT7 lost its BrD. Duplications are represented as red arrows.

The hypothesis of TAF8 duplication at opisthokonta and the resultant TAF3 is evident only from the domain analysis and not from ortholog clustering (Fig. 3A; Supplemental Fig. S4). Indeed, several TAF3 proteins in ascomycota (fungi) partly cluster close to their TAF8 counterpart, which could be interpreted as independent duplication and convergent evolution of TAF3 in fungi. Furthermore, TAF3 paralogs in mucoromycotina (fungi) cluster together with TAF8 in the supergroup of SAR (stramenopiles, alveolates, and rhizaria), which suggests the presence of one

57 Chapter 2

common protein (TAF8) rather than a separate TAF3 protein (Supplemental Fig. S4). In addition, no holozoa (single-celled organisms closely resembling animals) TAF3s are present in the gene tree (Fig. 3A). A possible explanation for such clustering inconsistencies comes from the relatively short sequences of TAF3, TAF8, and SPT7 HFDs used as a baseline for our tree, which likely lacks sufficient evolutionary information. Since all three proteins interact differently with their 2 common interaction partner TAF10 (Gangloff et al. 2001b), significant sequence divergence is also likely to play a role in the observed clustering inconsistencies. Consequently, our TAF3 origin and evolution hypothesis is based mostly on domain organization analysis and not clustering in the gene tree. In conclusion, TAF8 is present invariantly across the entire eukaryotic lineage and has a pre-LECA origin. Subsequent duplication in opisthokonta most likely gave rise to TAF3. Subsequently, TAF3 acquired a PHD finger, which is retained in metazoa and early fungi (mucoromycotina) and is subsequently lost later in fungal evolution. Similar to TAF1 and TAF2, TAF3 PHD evolution demonstrates the dynamic nature of epigenetic readers within TFIID across the eukaryotic lineage.

Comparative evolutionary analysis of TAF8 and SPT7 reveals BrD gains in SAGA SPT7 is a SAGA-specific subunit that interacts via its HFD with TAF10, which is shared between SAGA and TFIID across species (Spedale et al. 2012). Domain characterization revealed that in amoebozoa and opisthokonta, ancestral SPT7 includes an N-terminal BrD followed by an HFD (Fig. 3B). Interestingly, in metazoa, SPT7 (hSUPT7L) lacks a BrD, which results from secondary loss in the animal ancestor, as is evidenced by the presence of a BrD in unicellular holozoan SPT7 (Fig. 3B; Supplemental Fig. S4). None of the early animals, such as Nematostella vectensis, retained this BrD, which suggests a functional reduction of metazoan SAGA in binding acetylated lysines (Fig. 3B; Supplemental Figs. S1, S4). This may be compensated for in metazoan SAGA through a BrD in the GCN5 subunit of the HAT module (Hassan et al. 2002). The GCN5 BrD is essential for SAGA chromatin recognition and transcriptional activation (Syntichaki et al. 2000). The SPT7 BrD has been shown to anchor SAGA to acetylated chromatin but was dispensable for the function of the complex in S. cerevisiae (Hassan et al. 2002). Structural interrogation of BrDs suggests conserved core residues involved in the recognition of acetylated lysines surrounded by a target-specific cavity, which differs between individual domains (Josling et al. 2012). As such, while dispensable for SAGA function, the BrD of fungal SPT7 seems to add versatility to the molecular mechanisms of chromatin recognition by SAGA, and this function has been lost in animals. This highlights the diverging functions of SAGA between fungi and metazoa and mimics in reverse TFIID evolution, in which animals, but not fungi, increase their epigenetic dependency through acquisition of relevant domains within the complex. Similarly to TAF3, timing the origin of SPT7 is challenging, but domain analysis indicates that SPT7 resulted from a duplication event of TAF8. This event occurred either in the ancestor of amoebozoa and opisthokonta (Fig. 3B) or pre-LECA (Fig. 3C). Orthologs of SPT7 are present across all amoebozoa and opisthokonta, which suggests an origin in their ancestor. Nevertheless, there are two proteins in the plant, including archaeplastida that cluster together with SPT7 – one from Cyanophora paradoxa (a glaucophyte), which only has the HFD, and another from K. flaccidum, which contains a BrD followed by an HFD (Supplemental Fig. S4). The presence of an SPT7-like sequence outside of amoebozoa and opisthokonta could be due to (1) SPT7 originating pre-LECA

58 Evolution of TFIID and differential loss in the respective supergroups (Fig. 3C); (2) horizontal gene transfer (HGT) to these species, which is a rare eukaryotic event (Leger et al. 2018); or (3) technical difficulties in domain sequence alignment and low conservation. Based on this, the precise timing and origins of SPT7 remains unclear. In conclusion, we propose that SPT7 originates from either amoebozoa or a pre-LECA TAF8-like ancestor. In the latter case, the SPT7/TAF8 ancestor was probably a sub- unit of both 2 SAGA and TFIID. A duplication event result- ed in ancestral TAF8 and SPT7 proteins, both of which subfunctionalized to TFIID and SAGA, respectively. After this duplication, SPT7 acquired a BrD in the amoebozoa–opisthokonta ancestor and some archaeplastida, which has been lost subsequently in animals. This dynamic gain (in fungi) and loss (in metazoa) of BrDs is reminiscent of their variable occurrence in TAF1 and TAF2 between these two kingdoms, which supports plasticity in binding acetylated lysines. This appears to be a com- mon theme in the dynamic evolution of TFIID and SAGA complexes.

The invariable ancestral repertoire of TFIID The dynamic domain variations in TAF1, TAF2, and TAF3 are contrasted by relatively stable domain organization of the other TAF subunits. These include the core TFIID subunits TAF5, TAF6, and TAF9 (Bieniossek et al. 2013) as well as TAF10 and TAF7. The TAF4 and TAF12 core subunits appear to have followed a distinct evolutionary pathway in plants and therefore are discussed separately. In addition, while TAF11 and TAF13 maintain their simple HFD-only structure, these proteins are discussed separately in light of their evolutionary link with the SAGA subunit SPT3 (hSUPT3H). The evolution of TAF5 and TAF6 shares a common theme of duplication into paralogs that subfunctionalized toward TFIID or SAGA. Previous studies speculated on animal-specific timing of duplication, as both paralogs are present in Drosophila melanogaster (Spedale et al. 2012). The earliest detection of TAF5 and TAF6 paralogs in our orthologous gene tree is in N. vectensis, which confirms duplication of TAF5 and TAF6 in an ancestor of metazoa (Fig. 4A, B; Supplemental Figs. S1, S5, S6). This analogous evolution fits well with the close interaction of the proteins in core TFIID and their central role in overall complex integrity (Bieniossek et al. 2013). In addition, a simultaneous occurrence of the SAGA-specific TAF5L and TAF6L paralogs is indicative of the combinatorial structural basis for discrimination between TFIID and SAGA in terms of architecture, assembly, and function. Domain analysis of TAF5 and TAF6 revealed an overall stable organization. TAF5 has been characterized by the presence of a Lis homology domain (LisH) followed by the N-terminal domain 2 (NTD2) and WD40 repeats (Bieniossek et al. 2013; Malkowska et al. 2013). Assessing the domain organization of TAF5 indicated seven WD40 repeats, which are widely spread across all eukaryotes (Fig. 4A; Supplemental Fig. S5). The first and last repeat diverged more compared with the other repeats, which complicated identification using the canonical WD40 model and required optimization of repeat detection. In SAR and excavata (supergroup containing unicellular organisms), the prediction accuracy was insufficient, resulting in a variable number of WD40 repeats ranging from one in Blastocystis hominis to seven in Aplanochytrium kerguelense (Supplemental Fig. S5). Moreover, the sequence length separating the repeats increases, indicating possible structural divergence, which likely contributed to the occasional patchiness of the repeats in our

59 Chapter 2

tree (Fig. 4A; Supplemental Fig. S5). The LisH domain also exhibited patchy distribution in the alignments due to low conservation of these sequences.

2

Figure 4. Evolutionary history of the relative invariable TFIID subunits. (A) TAF5 duplicated in the ancestor of animals and contains seven WD40 repeats. (B) TAF6 duplicated in the ancestor of animals. TAF5 and TAF6 paralogs subfunctionalized to either SAGA or TFIID. (C) TAF9 duplicated in placentalia but did not subfunctionalize to SAGA. Duplications are represented as red arrows; gradient domains are not predicted in all species of that respective (super)group.

60 Evolution of TFIID

TAF6 has been characterized previously by the presence of a N-terminal HFD-mediating TAF9 interaction, which is followed by a region of HEAT repeats (Bieniossek et al. 2013). Domain analysis revealed no innovations for TAF6 among eukaryotes (Fig. 4B; Supplemental Fig. S6). Notably, the publicly available TAF6_C_HEAT (Pfam: PF07571) model does not cover the entire HEAT repeat and starts only from helix 2 in repeat 3 to helix 1 in repeat 5 (Scheer et al. 2012). The entire HEAT region in human TAF6 spans from 218 to 477 (Scheer et al. 2012). The 2 individual HEAT repeats have diverged significantly in sequence, which prevents the determination of possible gains or losses in HEAT repeats. Similar to its TAF6 interaction partner, TAF9 is present across the entire eukaryotic lineage (Fig. 4C; Supplemental Fig. S7). As TAF9 has been duplicated into TAF9b in mammals (Frontini et al. 2005), additional mammals were included in the representative eukaryotic tree to determine the timing of this duplication event. This revealed that gene duplication occurred in the ancestor of placental mammals, as a single TAF9 protein was detected within Ornithorhynchus anatinus (platypuses). Meanwhile, two TAF9 proteins are present within Loxodonta africana (African elephants), each of which clusters with TAF9 and TAF9b of H. sapiens and Mus musculus, respectively (Fig. 4C; Supplemental Figs. S1, S7). Furthermore, Macropus eugenii (wallabies) or Monodelphis domestica (opossums) do not have a TAF9 duplication. These two organisms belong to the marsupialia and are the closest relatives of the placentalia (Deakin 2012). Hence, TAF9 was duplicated later than TAF5 and TAF6. The timing difference suggests distinct functional outcomes for the three duplication events. Indeed, neither TAF9 nor TAF9b sub- functionalized toward SAGA. The structural invariability and functional conservation of TAF9 possibly reflects on its role in complex integrity of both TFIID and SAGA, while the TAF9 interaction partner TAF6 and its associated partner, TAF5, provide context-dependent variability in animals. The pattern of invariability in one interaction partner while the other is continuously evolving is also observed for other TAFs. A highly conserved structural organization of TAF10 and TAF7 is observed across species, but their interaction partners (TAF3 and TAF8 or TAF1, respectively) are characterized by a dynamic domain organization. In short, TAF10 contains only an HFD and maintains this basic fold across the entire tree of eukaryotes (Supplemental Figs. S8A, S9). TAF7 is characterized by the presence of an NTD (Pfam; TAFII55_N), essential for the interaction with TAF1 (Bhattacharya et al. 2014). TAF7 structure is conserved across all eukaryotes (Supplemental Figs. S8A, S10). The TAF7 paralog TAF7L is essential for spermatogenesis in mice (Cheng et al. 2007), which is striking in light of testis-specific expression of the TAF1 paralog TAF1L (Wang and Page 2002). Timing the duplication for this paralog was challenging due to a low sequence conservation. In the vertebrate ohnolog database, TAF7L has an intermediate confidence for being duplicated in the vertebrate whole-genome duplication (WGD) event (Singh et al. 2015). Addition of vertebrate species to the database does not support this prediction and reveals a likely origin of TAF7L in mammals, as two copies of TAF7 were detected in O. anatinus (Supplemental Figs. S1, S8B, S10) – a part of the monotremes (egg-laying mammals) that branched early in mammalian evolution and has a striking combination of mammalian and reptilian features (Luo et al. 2011). In summary, TAF5 and TAF6 duplicated in the ancestor to metazoa and are otherwise present across the eukaryotic lineage as a single-copy gene, which stresses the pre- LECA origin of both TAFs. Metazoan paralogs subfunctionalized to localize to either TFIID or SAGA. No domain innovations have been found for either protein. Duplications of TAF7 and TAF9 were

61 Chapter 2

specific for later animal branching events (TAF7 in mammalia and TAF9 in placentailia), but none of them is linked to SAGA-specific subfunctionalization. Together with the Old World mon- key appearance of TAF1L, these duplications are indicative of animal-specific TFIID subfunctionalization events, which may be linked in part to mammalian reproduction. The presence of TAF5, TAF6, TAF7, TAF9, and TAF10 across the different eukaryotic supergroups 2 implies that these subunits have a pre-LECA origin, since no specific eukaryotic origin could be identified within the early branching events.

Duplications and plant-specific variations for TAF4, Ada1, and TAF12 interaction partners TAF4 is widespread across eukaryotes and mostly lacking in SAR (with the exception of Oxytricha trifallax) (Supplemental Fig. S11). Still, the TAF12 HFD partner of TAF4 is widespread in SAR (Supplemental Fig. S12), indicating that TAF4‘s absence is due to poor genome quality or gene prediction. TAF4 is characterized by the presence of a highly disordered N-terminal region, which is followed by an NHR1-binding (or TAFH-binding) motif in animals or an RST-binding motif in plants and an HFD (Gangloff et al. 2001b). The RST motif is found within RCD1, SRO, and TAF4 proteins and is a binding interface for multiple transcription factors (TFs) (Jaspers et al. 2010). RST is proposed to be a streptophytan invention and identified in TAF4 by sequence similarity (Jaspers et al. 2010). Indeed, the motif first appeared in the streptophyta and is not present in green algae or other archaeplastida (Fig. 5A; Supplemental Figs. S1, S11). In animals, NHR1 has a position similar to plant RST, but they do not share any homology (by HHSearch; data not shown). The NHR1 domain was acquired in the ancestor of animals, which is indicated by its presence in N. vectensis and absence in holozoa or amoebozoa (Fig. 5A; Supplemental Figs. S1, S11). Functional analysis of the NHR1 motif showed that it interacts with TFs and is associated with ETO (eight twenty-one) oligomerization (Wei et al. 2007). The functional similarity of RST and NHR1 motifs points toward convergent evolution of TF-binding interfaces within TAF4 and that this TFIID subunit acts as a target for tissue- and lineage-specific regulation of transcription. TAF12 is conserved across species and consists of a single HFD (Fig. 5B). It is present as a single copy across eukaryotes, except for a gene duplication event yielding EER4 in plants. TAF12 duplication is proposed to have occurred with angiosperm (flowering plant) WGD (Jiao et al. 2011). Streptophyta such as Physcomitrella patens, which branches earlier than angiosperm, do not contain EER4, which confirms the time of duplication (Fig. 5B; Supplemental Figs. S1, S12). This plant-specific innovation in TAF12 mirrors the RST motif variation observed in plant TAF4. Besides TAF4, TAF12 also interacts via its HFD with the SAGA subunit ADA1 (hTADA1 in humans) (Spedale et al. 2012), which suggests a possible evolutionary link between TAF4 and ADA1. The SAGA-Tad1 domain in ADA1 (Pfam: 12767), which includes the HFD, was used to generate a TAF4/ADA1 tree (Supplemental Fig. S13). This showed that ADA1 duplicated several times within streptophyta, which is consistent with previously recognized duplications in archaeplastida (Srivastava et al. 2015). This plant-specific event again points toward lineage-specific variations within TAF4/TAF12/ADA1 inter- actions and suggests that the proteins are intimately linked in structure and function. Analysis of the timing of TAF4/ADA1 subfunctionalization showed that both proteins form monophyletic groups in the gene tree, which points toward a duplication and complete subfunctionalization before the emergence of eukaryotes (Supplemental Fig. S13). It

62 Evolution of TFIID seems that TAF4 and ADA1 likely share a pre-LECA ancestor, which resided in both the TFIID and SAGA complexes (Fig. 5C). The duplication event freed this ancestor for specialization toward a single complex.

2

Figure 5. Inferred evolutionary history ofTAF4/Ada1 and the TAF12 HF partner. (A) TAF4 duplicated in the ancestor of vertebrates through a WGD. Afterward, an additional small-scale duplication took place, named TAF4x, which is lost in tetrapoda. The RST domain is acquired in the ancestor of streptophyta, while the NHR1 domain is acquired in animals specifically. (B) TAF12 duplicated in the angiosperm through a WGD. (C) TAF4 and Ada1 emerged through a pre-LECA duplication and subfunctionalized to either SAGA or TFIID. WGD events are represented as blue arrows.

63 Chapter 2

TAF4 underwent additional duplications as a TFIID subunit in vertebrates. The best known is TAF4B, which emerged after the vertebrate WGD based on the vertebrate ohnolog database (Singh et al. 2015) and additional vertebrate species in our eukaryotic tree (Fig. 5A; Supplemental Figs. S1, S11). Strikingly, we found an additional TAF4 duplication within Latimeria chalumnae (coelacanths), a sarcopterygii (lobe-finned fish) closely related to the tetrapoda (four- 2 limbed vertebrates) (Supplemental Fig. S1; Amemiya et al. 2013). The clustering with other vertebrates confirmed the existence of a paralog next to TAF4 and TAF4B, which we named TAF4x (Fig. 5A). It seems that TAF4x has been lost in tetrapoda. These results are in line with the 2R hypothesis of the vertebrate WGD, pointing toward two back-to-back WGDs followed by differential loss of this TAF4 paralog (Kasahara 2007). Notably, the model organism Danio rerio (a ray-finned fish that belongs to actinopterygii) also still contains TAF4x (Fig. 5A; Supplemental Figs. S1, S11). Altogether, our data indicated that an ancestral TAF4/ ADA1 protein existed pre-LECA, which had under- gone duplication and subfunctionalization, resulting in TFIID-specific TAF4 and SAGA-specific ADA1. TAF4 had undergone additional duplications after WGD events in vertebrates, leading to the TAF4B and the fish-specific TAF4x paralogs. TAF4, ADA1, and TAF12 have all undergone plant-specific innovations, indicating differential evolution of these interaction partners within specific eukaryotic plant branches.

Common evolution for TAF11, TAF13, and SPT3 proteins TAF11 and TAF13 are TFIID-specific HFD interaction partners with a simple organization of a single HFD (Gupta et al. 2017). Notably, the SPT3 subunit of SAGA contains two HFDs in tandem. The HFD of SPT3 at the N- terminal half resembles TAF13, while the one in the C- terminal half is homologous to TAF11 (Gangloff et al. 2001b). TBP has been shown to interact with both the TAF11/TAF13 dimer and SPT3 (Eisenmann et al. 1992; Mengus et al. 1995). This raises questions about the evolutionary relationship between the three proteins. To ex- amine this, the HFDs of SPT3 were separated in order to infer a phylogenetic tree of SPT3-N, SPT3-C, TAF11, and TAF13, which suggest a pre-LECA origin of these proteins (Supplemental Fig. S14). The tree showed two clear clusters – one containing mainly the TAF13 and SPT3-N HFDs, while the other contained the TAF11 and SPT3-C HFDs. Within these clusters, additional separation is also observed between the TAFs and SPT3 sequences, which stresses their subfunctionalization in TFIID and SAGA. TAF11 and TAF13 are widespread across the entire eukaryotic tree with a few exceptions in SAR species (namely, Albugo laibachii and Bigelowiella natans), which contain a single HFD cluster with the SPT3-N HFD (Supplemental Fig. S14). Due to the HFD sequence similarity, it remains possible that these are TAF13 proteins in reality. Essentially, all opisthokonta contain SPT3, with the most notable exception of Thecamonas trahens (part of apusozoa), which is an early branching sister group of amoebozoa (Supplemental Fig. S1; Paps et al. 2013). In other supergroups, SPT3 is seemingly lost, with the exception of Naegleria gruberi (excavates), Acanthamoeba castellanii (amoebozoa), Galdieria sulphuraria, and Cyanidioschyzon merolae (red algae) (Supplemental Fig. S14). The differential loss of SPT3 outside of opisthokonta suggests the existence of SAGA lacking SPT3 in those organisms or sharing TAF11 and TAF13 with TFIID. This could be re- solved by biochemical analysis of SAGA complexes from organisms lacking SPT3.

64 Evolution of TFIID

The TAF11/TAF13/SPT3 gene tree points toward two hypotheses for the origin of these proteins. (1) SPT3 is the ancestral protein (Fig. 6A). This pre-LECA ancestor (aSPT3) would have duplicated, and TAF11 and TAF13 then arose as the result of a gene split. (2) TAF11 and TAF13 were the ancestral proteins (Fig. 6B), both of which were duplicated before fusing into SPT3 in a pre-LECA genome. Irrespective of the exact scenario, the duplication allowed subfunctionalization toward either SAGA (SPT3) or TFIID (TAF11 and TAF13), while the ancestor was likely 2 functional in both complexes. The aSPT3 hypothesis describes the more evolutionarily simple process, which requires only two events (duplication followed by fission). In contrast, the TAF11/TAF13 hypothesis requires two independent duplications followed by a specific fusion between SPT3-N and SPT3-C. In summary, the analysis of the TAF11, TAF13, and SPT3 orthologous groups revealed their common ancestry and pre-LECA roots. Our results reveal that duplication and subfunctionalization differentiated the proteins in TFIID- and SAGA-specific subunits.

Figure 6. Inferred evolutionary history of TAF11/TAF13/SPT3. (A) SPT3 is the ancestral protein that gave rise to TAF11 and TAF13 through a duplication followed by a gene fission. (B) TAF11 and TAF13 are ancestral and gave rise to SPT3 through independent duplications followed by gene fusion.

Discussion This work reconstructs the evolutionary history of TAF subunits forming the basal transcription complex TFIID, which is central to all Pol II transcription. A common theme emerging is a pre-LECA origin for all TFIID subunits, with the later duplications resulting in TAF3, TAF4B, TAF4x, TAF7L, TAF9B, and TAF1L. Most likely, an almost complete – as compared with human TFIID— complex existed in pre-LECA ancestors (Fig. 7). Our analysis of the eukaryotic lineage revealed that most of the TAF duplication events occurred predominantly in opisthokonta branches. Large expansions of TF and cofactor families in metazoan evolution have been suggested to support increased morphological and genome complexity (Cheatle Jarvela and

65 Chapter 2

Hinman 2015). The observations with TFIID match well with a versatile transcriptional regulation in opisthokonta. The only other clade in which TFIID is duplicating, albeit it to a lesser extent, is plants. Other examples of evolutionary expansions in major cellular complexes are observed in ribosomes, spliceosomes, and proteasomes (Vosseberg and Snel 2017). A salient feature in TFIID evolution is the extensive dynamics of chromatin reader and 2 TF-binding domains between the TAFs in opisthokonta and streptophyta. Notably, highly dynamic chromatin reader domains occur only in the TAF1, TAF2, and TAF3 subunits (Fig. 7). TAF3 was duplicated from TAF8 early in opisthokonta evolution and acquired a PHD finger, which was lost subsequently in later branching fungi (such as dikarya). In striking similarity, metazoan TAF1 acquired a second BrD, while TAF1 in dikarya (branching in fungi) has lost its BrD. In early fungi, highly dynamic BrDs are present in TAF2, which could compensate for the loss in TAF1 BrDs in some fungal species. Ascomycota (part of dikarya) subsequently lost BrDs from TAF2. Interestingly, all common yeast models are included in ascomycota, which suggests that research in S. cerevisiae and S. pombe focuses on an intriguing exception of the TFIID complex. In these model systems, TFIID is entirely deprived of chromatin reader domains as compared with TFIID complexes across the rest of the eukaryotic lineage. This is consistent with previous work in S. cerevisiae, which shows a reduced association of TAFs with chromatin regulators (Huisinga and Pugh 2004). Notably, S. cerevisiae has been characterized by loss of components of other cellular machineries, including the spliceosome and RNA-modifying and protein-folding complexes (Aravind et al. 2000; Vosseberg and Snel 2017). Complexity reduction in evolutionary terms often indicates alternative (and beneficial) functional adaptations of the living organism. Such benefits are exemplified by the lack of an RNAi pathway in S. cerevisiae, which allows for its symbiotic coexistence with the dsRNA killer virus, which is highly toxic for other fungal species (Drinnenberg et al. 2011). With respect to transcription, the loss of epigenetic domains indicates that TFIID becomes less dependent on chromatin marks for targeting to promoter regions during the course of fungal evolution. How this correlates with SAGA fungal evolution, where SPT7 has gained a BrD, remains to be tested. During plant evolution, TAF1 acquired a ubiquitin-like domain in streptophyta, and TAF4 has gained nonhomologous TF-binding interface RST (as opposed to the metazoan NHR1 domain). This indicates that TFIID is a direct TF target in archaeplastida. Furthermore, TAF4 and TAF12 duplications in the plant kingdom indicate possible roles in driving specific lineage programs. The domain analysis of TAF2 revealed the presence of a highly conserved HEAT2-like repeat region. HEAT repeats are commonly present in a wide range of eukaryotic proteins. TAF6 also has a HEAT repeat region, which has been pro- posed as highly flexible (Yoshimura and Hirano 2016). In TAF1, we also confirmed the presence of a Zn knuckle structure (Curran et al. 2018), which represents a highly conserved Zn finger involved in directing TFIID promoter binding (Curran et al. 2018). The phylogenetic analysis of TAFs stresses the evolutionary linkage of TFIID with SAGA (Fig. 7). We propose that at least eight invariable subunits (ancestral TAF4, TAF5, TAF6, TAF8, TAF9, TAF10, TAF11/13, and TAF12) were shared between the two complexes and that their divergence already started at a pre-LECA stage (Fig. 7). Probably TAF4/ADA1 and TAF11/TAF13/SPT3 (and possibly TAF8/SPT7) were the first shared members to duplicate and subfunctionalize toward each of the complexes, indicating their core role in TFIID and SAGA structural discrimination. This facilitated functional separation of the TFIID and SAGA complexes.

66 Evolution of TFIID

2

Figure 7. Model of TFIID and SAGA evolutionary divergence from pre-LECA until fungal and metazoan ancestors. In a pre- LECA, the ancestral repertoire (green) of TFIID and SAGA was completely shared. Through duplication and subfunctionalization of the resulting paralogs, the complexes diverged to share fewer subunits throughout eukaryotic evolution (pink and grey). Metazoan TFIID acquired several lineage-specific paralogs (e.g., TAF1L, TAF4B, TAF4x, TAF7L, and TAF9B). Epigenetic domains are differentially gained and lost in metazoan and fungal TFIID and SAGA: Metazoan TFIID acquired epigenetic domains (double BrDs in TAF1 and a PHD in TAF3), while metazoan SAGA lost BrD in SUPT7L (retained in fungal SAGA); in contrast, fungal TFIID gradually lost the TAF3 PHD and carries only one BrD in TAF1 (in some late fungi, the BrDs are completely lost). Additionally, fungal TAF2 displays dynamic gains and losses of numerous BrDs, in contrast to metazoan TAF2. Unique and shared subunits as well as dynamics in epigenetic reader domains are color- coded as indicated.

67 Chapter 2

In contrast, TAF5L and TAF6L are more recent SAGA-specific subfunctionalizations. In animals, TFIID shares only three subunits (TAF9, TAF10, and TAF12) with SAGA (Fig. 7). Interestingly, TFIID-specific subfunctionalizations are also evident among metazoa, including TAF4B in vertebrates and TAF4x in fish, mammalian TAF7L, placental- specific TAF9B, and the Old World monkey-specific TAF1L (Fig. 7). The high rate of TAF subfunctionalization coinciding 2 with increased morphological complexity implies a selection for functional divergence of TFIID and SAGA, which started in the pre-LECA era. Our orthologous trees provide a framework for evolutionary reconstruction of the structural changes underlying TAF subfunctionalization through paleostructural biology. From a broader perspective, it is clear that the analysis of TFIID evolution exemplifies how phylogenetic protein interrogation aids in uncovering existing structures, drawing parallels between related complexes, and challenges offered by genome expansions can be countered by exploiting chromatin modifications.

Materials and methods Phylogenetic analysis of the TFIID complex members Species and genome selection. To reconstruct the evolution of the TFIID subunits across the eukaryotic tree of life, a selected reference set of species was chosen such that it was large enough to reliably reconstruct TFIID subunit dynamics across the eukaryotic tree of life but small enough for manual curation and inspection of protein phylogenies (Supplemental Table S1). Predicted proteomes for these species were downloaded from diverse sources (Supplemental Table S1), and protein identifiers were changed to allow manual annotation of duplications and losses in the protein trees. For a subset of TFIID subunits, the addition of specific proteins from phylogenetically informative species was essential to accurately time the duplications and losses. These protein-specific additions included primates and placental mammals for TAF1, nontetrapod vertebrates for TAF4, streptophytes for TAF12, and early branching mammals for TAF7 as well as TAF9.

Sequence analysis and alignment. Protein domains were identified using Pfam version 29.0 (Finn et al. 2016) or CDD (Marchler-Bauer et al. 2015) or were based on literature-proposed domains (Supplemental Fig. S15). Orthologous groups for each TAF were acquired using Pfam‘s gathering cutoffs or manual curation when new HMMER models were made. Sequences were aligned using MAFFT version 7.294 einsi or linsi based on the domain organization of the proteins (Katoh and Standley 2013). linsi was used mostly for orthologous groups where a single domain or excised domains were aligned, while einsi was used for groups with complex domain organizations. Alignments were visualized using Jalview (Waterhouse et al. 2009). After manual inspection, alignments were curated with the trimal option automated if the alignment contained few gaps or gappyout if the alignment was patchy (Capella-Gutierrez et al. 2009). Curated alignments of selected species were visualized using ESPript 3.0, and con- served residues at >70% threshold were marked.

Phylogenetic reconstruction and annotation. Phylogenetic trees were reconstructed with default Phyml version 3.0 settings (LG model of evolution) (Lefort et al. 2017) using the curated alignments (Supplemental Fig. S15). Visualization was done in interactive Tree Of Life (iTOL) (Letunic and Bork 2007). A custom Python script was developed to provide a file for iTOL to color the

68 Evolution of TFIID sequences according to which eukaryotic supergroup the species belong and where the proteins came from (Burki 2014). A second custom python script was developed to provide a file for iTOL to delineate and color domain organization of each protein, as inferred from Pfam searches as described above. The resulting phylogenetic trees were reconciled with the species tree using phylogenetic as well as domain considerations to infer timing of gene duplications and losses. The results of these reconciliations are shown in Figures 2–5 and Supplemental Figures S2–S7 and 2 S9–S14.

Data availability The results from all intermediate steps as well as all final trees are available at https://bioinformatics.bio.uu.nl/snel/TFIID These results include custom HMMER models to search for domains, FASTA files of orthologs, selected protein domain alignments (both the FASTA files and the imagery representation), and annotated protein trees. Graphical representations of the domain and protein alignments for selected species are in Supplemental Figures S16–S32.

Acknowledgments We thank Tanja Bhuiyan, Laszlo Tora, and Imre Berger for discussions and critical reading of the manuscript. This research was financially supported by the SFB850 and SFB992 networks of the Deutsche Forschungsgemeinschaft (to H.T.M.T.) and Nether- lands Organization for Scientific Research (NWO) grants 022.004.019 (to S.V.A.) and 016.160.638 Vici (to B.S.) as well as ALW820.02.013 (to H.T.M.T.). We apologize to our colleagues whose primary findings could not be cited due to space constraints.

Author contributions H.T.M.T. and B.S. conceived the study with input from S.V.A. and J.B. J.B. carried out the bioinformatic analysis, assisted by B.S. Data interpretation and presentation were carried out by S.V.A., J.B., H.T.M.T., and B.S. S.V.A. and H.T.M.T. wrote the manuscript together with input from J.B. and B.S.

69 Chapter 2

Glossary

Note: With recent advances in phylogenetics, the classical taxonomy of the eukaryotic tree of life has undergone extensive revisions. As a result, there is a current lack of uniform taxonomic nomenclature for eukaryotes. This glossary aims to familiarize the readers in general terms with the species and names used throughout the study. For 2 further reading on the different classifications, we suggest several reviews (Burki 2014; Brown et al. 2018).

Acanthamoeba castellanii: genus in amoebozoa. Actinopterygii: ray-finned fish, in which skin webs of the fins are connected by bony spines; kingdom of metazoa. Albugo laibachii: species belonging to the supergroup of SAR (stramenopiles, alveolates, and rhizaria); pathogens of A. thaliana. Alveolates: a taxonomic group of primarily single-celled eukaryotes, characterized by the presence of sacs underneath their cell membranes; forms the “A” in the eukaryotic supergroup SAR. Amoebozoa: a taxonomic group of primarily single-celled eukaryotes, characterized by the presence of pseudopodia and movement through internal cytoplasmic flow. Angiosperm: a large group in the kingdom of plantae, which includes flowering land plants. Aplanochytrium kerguelense: a genus included in the eukaryotic supergroup of SAR; a common marine microorganism. Apusozoa: or obazoa, is an early branching group in eukarya, which includes opisthokonta (also known as fungi and animals but not plants) but excludes amoebozoa. Arabidopsis thaliana: flowering plant (plantae kingdom); a model organism commonly used in laboratory settings. Archaeplastida: a taxonomic classification that includes viridiplantae (e.g., land plants and green algae) as well as rhodophytae (e.g., red algae). Ascomycota: phylum in the fungal subkingdom of dikarya, which includes the commonly used yeast model organisms (e.g., S. cerevisiae, K. lactis, N. crassa, and S. pombe). Basidiomycota: phylum in the fungal subkingdom of dikarya, which includes mushrooms. Bigelowiella natans: flagellated species in SAR with a marine lifestyle; model organism in laboratory settings. Blastocystis hominis: a genus belonging to the eukaryotic supergroup of SAR; contains unicellular parasites capable of infecting humans. Blastocladiomycota: phylum in the kingdom of fungi; parasitic lifestyle; includes model organisms Allomyces macrogynus and Blastocladiella emersonii. Callithrix jacchus: common marmoset, a New World monkey; a model organism used in laboratory settings. Chytridiomycota: division in the kingdom of fungi, characterized by the unique (for fungi) ability to lead a motile lifestyle due to presence of posterior flagellum; a parasite among plants and amphibians. Cyanidioschyzon merolae: unicellular extremophile adapted to sulphur-rich hot spring environments; red algae; a model organism with minimalist cell structure, used for studying organelle and cellular organization. Danio rerio: or zebrafish, is a ray-finned fish (skin webs of the fins are connected by bony spines) in the kingdom of metazoa; commonly used model organism in research and popular in aquarium trade. Dikarya: subkingdom of fungi, also known as “higher fungi.” Excavata: eukaryotic supergroup, including flagellated unicellular organisms. Fonticula: a genus with lifestyle similar to clime mold; includes unicellular organisms capable of assembling into multicellular structures; relative of fungi.

70 Evolution of TFIID

Galdieria sulphuraria: species of red algae; a thermoacidophile, suggested to have acquired its extremophilic adaptations through rare horizontal gene transfer events from archaea and bacteria. Hylobates leucogenys: or Nomascus leucogenys, white- cheeked gibbon; species of Old World monkey. Holozoa: taxonomic group within opisthokonta that includes animals and closely related unicellular organisms but excludes fungal branches. Klebsormidium flaccidum: a species of fresh-water filamentous green algae; kingdom of plantae. 2 Kluyveromyces lactis: a species of Saccharomycetes class (ascomycota division); part of fungi kingdom; commonly used model organism in yeast studies. Loxodonta africana: or African savanna elephant; mammal; kingdom of metazoa. Latimeriachalumnae: species of coelacanth (living fossil), lobe- finned fish; fins are supported on a fleshy lobe- like structure connected to the body in a way similar to tetrapod limbs; more closely related to tetrapods than to ray- finned fish, kingdom of metazoa. LECA: last eukaryotic common ancestor; proposed and reconstructed unicellular organism with nucleus. Mucor circinelloides: species of mucormycota division; fungi kingdom; frequently infecting farm animals. Monodelphis domestica: (laboratory) opossum, mammal in the marsupial cohort; metazoa kingdom; model organism. Macropus eugenii: wallaby, mammal in the marsupial cohort; metazoa kingdom; model organism. Mammalia: all animals nursing their young with milk; metazoa kingdom. Marsupialia: cohort of mammals, carrying their young in pouch; metazoa kingdom. Metazoa: kingdom of animals. Mortierellomycetes: fungal order, belongs to mucoromycota phylum; fungi kingdom. Mucoromycota: a lineage in the fungal kingdom, separate from dikarya; includes common bread mold. Mus musculus: house mouse, mammal in the order rodentia; metazoa kingdom; commonly used model organism. Naegleria gruberi: species belonging to excavata, capable of changing from amoeba to flagellated unicellular organism with cytoskeletal structure. Nematostella vectensis: or starlet sea anemone, a species of sea anemone; metazoa kingdom; model organism, holding position at the base of the animal tree; predatory lifestyle. Neurospora crassa: species of ascomycota (dikarya lineage); fungal kingdom; model organism. New World monkeys: includes families of primates, distinguished from Old World monkeys and apes in the nasal structure, among others; metazoa kingdom. Ornithorhynchus anatinus: or platypus, is an egg-laying mammal; metazoa kingdom. Oxytricha trifallax: species in SAR; ciliated model organism. Old World monkey: family of primates, more closely related to hominoid lineages than New World monkeys; metazoa kingdom. Opisthokonta: group of eukarya, which includes animal, fungal lineages, and their unicellular relatives but not plants. Papio anubis: or olive baboon, member of Old World Monkeys; metazoa kingdom. Phycomyces blakesleeanus: filamentous fungal species, belongs to mucoromycota phylum; fungi kingdom. Physcomitrella patens: earth moss, species in the kingdom of plantae; model organism. Placentalia: cohort of mammals, carrying their young in womb; metazoa kingdom. Protozoa: unicellular heterotrophic eukaryotes. Rhizaria: taxonomic group of mostly unicellular organisms, which forms the “R” in the eukaryotic supergroup of SAR.

71 Chapter 2

Saccharomyces cerevisiae: species of ascomycota (dikarya lineage); fungal kingdom; common model organism. SAR: taxonomic supergroup of primarily single-celled eukaryotes (includes stramenopiles, alveolates, and rhizaria groups). Sarcopterygii: a class of lobe-finned fish, including coelacanths and closely related to tetrapoda; kingdom of metazoa. 2 Schizosaccharomyces pombe: species of ascomycota (dikarya lineage); fungal kingdom; common model organism. Stramenopiles: diverse group of eukaryotes, including plant pathogenic oomycetes, photosynthetic diatoms, and brown algae such as kelp; forms the S in eukaryotic supergroup SAR. Streptophyta: a branching in the kingdom of plantae that includes land plants and green algae and excludes red algae. Thecamonas trahens: genus of apusozoa. Tetrapoda: includes four-limbed vertebrates; kingdom of metazoa.

72 Evolution of TFIID

References  Amemiya CT, Alföldi J, Lee AP, Fan S, Philippe H, Maccallum I, Braasch I, Manousaki T, Schneider I, Rohner N, et al. 2013. The African coelacanth genome provides insights into tetrapod evolution. Nature 496: 311–316. doi:10.1038/nature12027  Aravind L, Watanabe H, Lipman DJ, Koonin EV. 2000. Lineage-specific loss and divergence of functionally linked genes in eukaryotes. Proc Natl Acad Sci 97: 11319–11324. 2 doi:10.1073/pnas.200346997  Bertrand C, Benhamed M, Li YF, Ayadi M, Lemonnier G, Renou JP, Delarue M, Zhou DX. 2005. Arabidopsis HAF2 gene encoding TATA-binding protein (TBP)-associated factor TAF1, is required to integrate light signals to regulate gene expression and growth. J Biol Chem 280: 1465– 1473. doi:10.1074/ jbc.M409000200  Bhattacharya S, Lou X, Hwang P, Rajashankar KR, Wang X, Gustafsson JA, Fletterick RJ, Jacobson RH, Webb P. 2014. Structural and functional insight into TAF1–TAF7, a subcomplex of transcription factor II D. Proc Natl Acad Sci 111: 9103– 9108. doi:10.1073/pnas.1408293111  Bieniossek C, Papai G, Schaffitzel C, Garzoni F, Chaillet M, Scheer E, Papadopoulos P, Tora L, Schultz P, Berger I. 2013. The architecture of human general transcription factor TFIID core complex. Nature 493: 699–702. doi:10.1038/nature11791  Brown MW, Heiss AA, Kamikawa R, Inagaki Y, Yabuki A, Tice AK, Shiratori T, Ishida KI, Hashimoto T, Simpson AGB, et al. 2018. Phylogenomics places orphan protistan lineages in a novel eukaryotic super-group. Genome Biol Evol 10: 427–433. doi:10.1093/gbe/evy014  Burki F. 2014. The eukaryotic tree of life from a global phylogenomic perspective. Cold Spring Harb Perspect Biol 6: a016147. doi:10.1101/cshperspect.a016147  Burley SK, Roeder RG. 1998. TATA box mimicry by TFIID: auto-inhibition of pol II transcription. Cell 94: 551–553. doi:10.1016/S0092-8674(00)81596-2  Capella-Gutierrez S, Silla-Martinez JM, Gabaldon T. 2009. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25: 1972–1973. doi:10.1093/bioinformatics/btp348  Chalkley GE, Verrijzer CP. 1999. DNA binding site selection by RNA polymerase II TAFs: a TAF(II)250-TAF(II)150 complex recognizes the initiator. EMBO J 18: 4835–4845. doi:10.1093/emboj/18.17.4835  Cheatle Jarvela AM, Hinman VF. 2015. Evolution of transcription factor function as a mechanism for changing metazoan developmental gene regulatory networks. Evodevo 6: 3. doi:10.1186/2041- 9139-6-3  Cheng Y, Buffone MG, Kouadio M, Goodheart M, Page DC, Ger- ton GL, Davidson I, Wang PJ. 2007. Abnormal sperm in mice lacking the Taf7l gene. Mol Cell Biol 27: 2582–2589. doi:10.1128/MCB.01722-06  Curran EC, Wang H, Hinds TR, Zheng N, Wang EH. 2018. Zinc knuckle of TAF1 is a DNA binding module critical for TFIID promoter occupancy. Sci Rep 8: 4630. doi:10.1038/s41598- 018- 22879-5  D‘Alessio JA, Wright KJ, Tjian R. 2009. Shifting players and paradigms in cell-specific transcription. Mol Cell 36: 924–931. doi:10.1016/j.molcel.2009.12.011  Deakin JE. 2012. Marsupial genome sequences: providing insight into evolution and disease. Scientifica 2012: 543176. doi:10.6064/2012/543176

73 Chapter 2

 Drinkwater N, Lee J, Yang W, Malcolm TR, McGowan S. 2017. M1 aminopeptidases as drug targets: broad applications or therapeutic niche? FEBS J 284: 1473–1488. doi:10.1111/febs.14009  Drinnenberg IA, Fink GR, Bartel DP. 2011. Compatibility with killer explains the rise of RNAi- deficient fungi. Science 333: 1592. doi:10.1126/science.1209575  Eisenmann DM, Arndt KM, Ricupero SL, Rooney JW, Winston F. 1992. SPT3 interacts with 2 TFIID to allow normal transcription in Saccharomyces cerevisiae. Genes Dev 6: 1319–1331. doi:10.1101/gad.6.7.1319  Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, et al. 2016. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res 44: D279–D285. doi:10.1093/nar/gkv1344  Frontini M, Soutoglou E, Argentini M, Bole-Feysot C, Jost B, Scheer E, Tora L. 2005. TAF9b (formerly TAF9L) is a bona fide TAF that has unique and overlapping roles with TAF9. Mol Cell Biol 25: 4638–4649. doi:10.1128/MCB.25.11.4638-4649.2005.  Gangloff YG, Pointud JC, Thuault S, Carre L, Romier C, Murato- glu S, Brand M, Tora L, Couderc JL, Davidson I. 2001a. The TFIID components human TAFII140 and Drosophila BIP2 (TAFII155) are novel metazoan homologues of yeast TAFII47 containing a histone fold and a PHD finger. Mol Cell Biol 21: 5109–5121. doi:10.1128/MCB.21.15.5109-5121.2001  Gangloff YG, Romier C, Thuault S, Werten S, Davidson I. 2001b.The histone fold is a key structural motif of transcription fac- tor TFIID. Trends Biochem Sci 26: 250–257. doi:10.1016/ S0968-0004(00)01741-2  Gupta K, Sari-Ak D, Haffke M, Trowitzsch S, Berger I. 2016. Zooming in on transcription preinitiation. J Mol Biol 428: 2581–2591. doi:10.1016/j.jmb.2016.04.003  Gupta K, Watson AA, Baptista T, Scheer E, Chambers AL, Koeh- ler C, Zou J, Obong-Ebong I, Kandiah E, Temblador A, et al. 2017. Architecture of TAF11/TAF13/TBP complex suggests novel regulation properties of general transcription factor TFIID. Elife 6: e30395. doi:10.7554/eLife.30395  Hassan AH, Prochasson P, Neely KE, Galasinski SC, Chandy M, Carrozza MJ, Workman JL. 2002. Function and selectivity of bromodomains in anchoring chromatin-modifying complexes to promoter nucleosomes. Cell 111: 369–379. doi:10.1016/ S0092-8674(02)01005-X  Helmlinger D, Tora L. 2017. Sharing the SAGA. Trends Biochem Sci 42: 850–861. doi:10.1016/j.tibs.2017.09.001  Hosein FN, Bandyopadhyay A, Peer WA, Murphy AS. 2010. The catalytic and protein–protein interaction domains are re- quired for APM1 function. Plant Physiol 152: 2158–2172. doi:10.1104/pp.109.148742  Howe FS, Fischl H, Murray SC, Mellor J. 2017. Is H3K4me3 instructive for transcription activation? Bioessays 39: 1–12. doi:10.1002/bies.201670013  Huisinga KL, Pugh BF. 2004. A genome-wide housekeeping role for TFIID and a highly regulated stress-related role for SAGA in Saccharomyces cerevisiae. Mol Cell 13: 573–585. doi:10.1016/S1097-2765(04)00087-5  Jacobson RH, Ladurner AG, King DS, Tjian R. 2000. Structure and function of a human TAFII250 double bromodomain module. Science 288: 1422–1425. doi:10.1126/science.288.5470.1422  Jaspers P, Overmyer K, Wrzaczek M, Vainonen JP, Blomster T, Salojärvi J, Reddy RA, Kangasjärvi J. 2010. The RST and PARP-like domain containing SRO protein family: analysis of protein

74 Evolution of TFIID

structure, function and conservation in land plants. BMC Genomics 11: 170. doi:10.1186/1471- 2164-11-170  Jiao Y, Wickett NJ, Ayyampalayam S, Chanderbali AS, Landherr L, Ralph PE, Tomsho LP, Hu Y, Liang H, Soltis PS, et al. 2011. Ancestral polyploidy in seed plants and angiosperms. Nature 473: 97–100. doi:10.1038/nature09916  Josling GA, Selvarajah SA, Petter M, Duffy MF. 2012. The role of bromodomain proteins in 2 regulating gene expression. Genes 3: 320–343. doi:10.3390/genes3020320  Kasahara M. 2007. The 2R hypothesis: an update. Curr Opin Immunol 19: 547–552. doi:10.1016/j.coi.2007.07.009  Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30: 772–780. doi:10.1093/molbev/ mst010  Kolesnikova O, Ben-Shem A, Luo J, Ranish J, Schultz P, Papai G. 2018. Molecular structure of promoter-bound yeast TFIID. Nat Commun 9: 4666. doi:10.1038/s41467-018-07096-y  Koster MJ, Snel B, Timmers HT. 2015. Genesis of chromatin and transcription dynamics in the origin of species. Cell 161: 724– 736. doi:10.1016/j.cell.2015.04.033  Lawit SJ, O‘Grady K, Gurley WB, Czarnecka-Verner E. 2007. Yeast two-hybrid map of Arabidopsis TFIID. Plant Mol Biol 64: 73–87. doi:10.1007/s11103-007-9135-1  Lefort V, Longueville JE, Gascuel O. 2017. SMS: smart model se- lection in PhyML. Mol Biol Evol 34: 2422–2424. doi:10.1093/ molbev/msx149  Leger MM, Eme L, Stairs CW, Roger AJ. 2018. Demystifying eukaryote lateral gene transfer. Bioessays 40: e1700242. doi:10  .1002/bies.201700242  Letunic I, Bork P. 2007. Interactive tree of life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics 23: 127–128. doi:10.1093/bioinformatics/btl529  López-Maury L, Marguerat S, Bähler J. 2008. Tuning gene expression to changing environments: from rapid responses to evolutionary adaptation. Nat Rev Genet 9: 583–593. doi:10.1038/ nrg2398  Louder RK, He Y, López-Blanco JR, Fang J, Chacón P, Nogales E. 2016. Structure of promoter- bound TFIID and model of human pre-initiation complex assembly. Nature 531: 604–609. doi:10.1038/nature17394  Luo ZX, Yuan CX, Meng QJ, Ji Q. 2011. A Jurassic eutherian mammal and divergence of marsupials and placentals. Nature 476: 442–445. doi:10.1038/nature10291  Malkowska M, Kokoszynska K, Rychlewski L, Wyrwicz L. 2013. Structural bioinformatics of the general transcription factor TFIID. Biochimie 95: 680–691. doi:10.1016/j.biochi.2012.10.024  Marchler-Bauer A, Derbyshire MK, Gonzales NR, Lu S, Chitsaz F, Geer LY, Geer RC, He J, Gwadz M, Hurwitz DI, et al. 2015. CDD: NCBI‘s conserved domain database. Nucleic Acids Res 43: D222–D226. doi:10.1093/nar/gku1221  Matangkasombut O, Buratowski RM, Swilling NW, Buratowski S. 2000. Bromodomain factor 1 corresponds to a missing piece of yeast TFIID. Genes Dev 14: 951–962.  Mengus G, May M, Jacq X, Staub A, Tora L, Chambon P, David- son I. 1995. Cloning and characterization of hTAFII18, hTAFII20 and hTAFII28: three subunits of the human transcription factor TFIID. EMBO J 14: 1520–1531. doi:10.1002/j.1460-2075.1995.tb07138.x

75 Chapter 2

 Nguyen TT, Chang SC, Evnouchidou I, York IA, Zikos C, Rock KL, Goldberg AL, Stratikos E, Stern LJ. 2011. Structural basis for antigenic peptide precursor processing by the endoplasmic reticulum aminopeptidase ERAP1. Nat Struct Mol Biol 18: 604–613. doi:10.1038/nsmb.2021  O‘Rawe JA, Wu Y, Dörfel MJ, Rope AF, Au PY, Parboosingh JS, Moon S, Kousi M, Kosma K, Smith CS, et al. 2015. TAF1 variants are associated with dysmorphic features, intellectual disability, 2 and neurological manifestations. Am J Hum Genet 97: 922–932. doi:10.1016/j.ajhg.2015.11.005  Papai G, Tripathi MK, Ruhlmann C, Werten S, Crucifix C, Weil PA, Schultz P. 2009. Mapping the initiator binding Taf2 sub- unit in the structure of hydrated yeast TFIID. Structure 17: 363–373. doi:10.1016/j.str.2009.01.006  Paps J, Medina-Chacón LA, Marshall W, Suga H, Ruiz-Trillo I. 2013. Molecular phylogeny of unikonts: new insights into the position of apusomonads and ancyromonads and the internal relationships of opisthokonts. Protist 164: 2–12. doi:10.1016/j.protis.2012.09.002  Patel AB, Louder RK, Greber BJ, Grünberg S, Luo J, Fang J, Liu Y, Ranish J, Hahn S, Nogales E. 2018. Structure of human TFIID and mechanism of TBP loading onto promoter DNA. Science 362: eaau8872. doi:10.1126/science.aau8872  Pijnappel WW, Esch D, Baltissen MP, Wu G, Mischerikow N, Bergsma AJ, van der Wal E, Han DW, Bruch H, Moritz S, et al. 2013. A central role for TFIID in the pluripotent transcription circuitry. Nature 495: 516–519. doi:10.1038/nature11970  Rosanova A, Colliva A, Osella M, Caselle M. 2017. Modelling the evolution of transcription factor binding preferences in com- plex eukaryotes. Sci Rep 7: 7596. doi:10.1038/s41598-017-07761-0  Scheer E, Delbac F, Tora L, Moras D, Romier C. 2012. TFIID TAF6–TAF9 complex formation involves the HEAT repeat- containing C-terminal domain of TAF6 and is modulated by TAF5 protein. J Biol Chem 287: 27580–27592. doi:10.1074/ jbc.M112.379206  Singh PP, Arora J, Isambert H. 2015. Identification of ohnolog genes originating from whole genome duplication in early vertebrates, based on synteny comparison across multiple genomes. PLoS Comput Biol 11: e1004394. doi:10.1371/ journal.pcbi.1004394  Spedale G, Timmers HT, Pijnappel WW. 2012. ATAC-king the complexity of SAGA during evolution. Genes Dev 26: 527– 541. doi:10.1101/gad.184705.111  Srivastava R, Rai KM, Pandey B, Singh SP, Sawant SV. 2015. Spt–Ada–Gcn5–acetyltransferase (SAGA) complex in plants: genome wide identification, evolutionary conservation and functional determination. PLoS One 10: e0134709. doi:10.1371/journal.pone.0134709  Syntichaki P, Topalidou I, Thireos G. 2000. The Gcn5 bromodomain co-ordinates nucleosome remodelling. Nature 404: 414– 417. doi:10.1038/35006136  Tora L, Timmers HT. 2010. The TATA box regulates TATA-binding protein (TBP) dynamics in vivo. Trends Biochem Sci 35: 309–314. doi:10.1016/j.tibs.2010.01.007  Vermeulen M, Mulder KW, Denissov S, Pijnappel WW, van Schaik FM, Varier RA, Baltissen MP, Stunnenberg HG, Mann M, Timmers HT. 2007. Selective anchoring of TFIID to nucleosomes by trimethylation of histone H3 lysine 4. Cell 131: 58–69. doi:10.1016/j.cell.2007.08.016  Vosseberg J, Snel B. 2017. Domestication of self-splicing introns during eukaryogenesis: the rise of the complex spliceosomal machinery. Biol Direct 12: 30. doi:10.1186/s13062-017-0201-6 Wang PJ, Page DC. 2002. Functional substitution for TAF(II)250 by a retroposed homolog that is expressed in human spermatogenesis. Hum Mol Genet 11: 2341–2346. doi:10.1093/hmg/11.19.2341

76 Evolution of TFIID

 Wang H, Curran EC, Hinds TR, Wang EH, Zheng N. 2014. Crystal structure of a TAF1-TAF7 complex in human transcription factor IID reveals a promoter binding module. Cell Res 24: 1433– 1444. doi:10.1038/cr.2014.148  Waterhouse AM, Procter JB, Martin DM, Clamp M, Barton GJ. 2009. Jalview version 2—a multiple sequence alignment editor and analysis workbench. Bioinformatics 25: 1189–1191. doi:10.1093/bioinformatics/btp033 2  Wei Y, Liu S, Lausen J, Woodrell C, Cho S, Biris N, Kobayashi N, Wei Y, Yokoyama S, Werner MH. 2007. A TAF4-homology domain from the ETO is a docking platform for positive and negative regulators of transcription. Nat Struct Mol Biol 14: 653–661. doi:10.1038/nsmb1258  Yoshimura SH, Hirano T. 2016. HEAT repeats – versatile arrays of amphiphilic helices working in crowded environments? J Cell Sci 129: 3963–3970. doi:10.1242/jcs.185710  Zhou H, Grubisic I, Zheng K, He Y, Wang PJ, Kaplan T, Tjian R. 2013. Taf7l cooperates with Trf2 to regulate spermiogenesis. Proc Natl Acad Sci 110: 16886–16891. doi:10.1073/pnas.1317034110  Zhou H, Wan B, Grubisic I, Kaplan T, Tjian R. 2014. TAF7L modulates brown adipose tissue formation. Elife 3: e02811. doi:10.7554/eLife.02

77 Chapter 2

Supplemental material

Supplemental Table S1. The selected reference set of species used in this study to reconstruct eukaryotic tree of life in relation to TFIID evolution. The sources of the species‘ predicted proteomes are indicated. The additional species section was added to more accurately determine 2 duplication timing of specific TAFs. Grey colour indicates selected species for protein and domain alignment visualization (due to large size Fig. S1-S7; S9-S14 and S16-S32 are available only in the online version).

Species Generic Genome Species name Published Group code name version Acastellanii.str Acanthamoebe NEFF v1 PMID: Amoeboz ACAS Amoeba castellanii str. Neff (January 9, 23375108 oa 2013) Entamoeba histolytica PMID: Amoeboz EHIS Amoeba 11/27/2009 HM-1:IMSS 20559563 oa Dictyostelium PMID: Amoeboz DDIS Slime mold Unknown discoideum AX4 15875012 oa PolPal_Dec200 Polysphondylium Cellular slime PMID: Amoeboz PPAL 9, (January 29, pallidum PN500 mold 21757610 oa 2010) Capsaspora owczarzaki Amoeboid PMID: COWC V2 Holozoa ATCC 30864 symbiont 23942320 Monosiga brevicollis V1.0, Choanoflagell PMID: MBRE MX1 / ATCC (December 20, Holozoa ate 18273011 50154 2007) Choanoflagell PMID: SROS Salpingoeca rosetta V1 Holozoa ate 23419129 Amphimedon V1.0, (May 28, PMID: AQUE Sponge Metazoa queenslandica 2010) 20686567 V1.0, (June 17, PMID: TADH Trichoplax adhaerens Placozoan Metazoa 2008) 18719581 Starlet sea PMID: NVEC Nematostella vectensis 1 Metazoa anemone 17615350 Sea PMID: MLEI Mnemiopsis leidyi walnut/Warty Initial release Metazoa 24337300 comb Jelly PMID: SMAN Schistosoma mansoni Flatworm 5 Metazoa 19606141 CELE Caenorhabditis elegans Roundworm WBcel235 PMID: 9851916. Metazoa Filarial PMID: BMAL Brugia malayi nematode WS238 Metazoa 17885136 worm Drosophila 5.51, May 7, PMID: DMEL Fruitfly Metazoa melanogaster 2013 10731132

78 Evolution of TFIID

Anopheles gambiae str. PMID: AGAM Mosquito AgamP3.7 Metazoa PEST 12364791 Saccoglossus SKOW Acorn worm Unknown Metazoa kowalevskii Spur_3.1 (RefSeq Strongylocentrotus Purple sea PMID: SPUR Assembly ID: Metazoa 2 purpuratus urchin 17095691 GCF_0000022 35.3 (latest)) Transparent KH (April 29, PMID: CINT Ciona intestinalis Metazoa sea squirt 2011) 12481130 Florida PMID: BFLO Branchiostoma floridae Unknown Metazoa lancelet 18563158 PMID: DRER Danio rerio Zebrafish Zv9 Metazoa 23594743 FUGU5 Japanese PMID: TRUB Takifugu rubripes (October 13, Metazoa pufferfish 12142439 2011) Xtropicalis_v7 Western PMID: XTRO Xenopus tropicalis (December 19, Metazoa clawed frog 20431018 2012) PMID: MMUS Mus musculus Mouse GRCm38.p1 Metazoa 12466850 PMID: 11237011, HSAP Homo sapiens Human GRCh37.p10 Metazoa PMID:11181995 (?) FALB Fonticula alba V2 Fonticula

GCA_0004420 PMID: RALL Rozella allomycis Fungi 15.1 23932404 Vavraia culicis Microsporidia VCUL 4/30/2013 Fungi floridensis n parasite Encephalitozoon Microsporidia PMID: EINT intestinalis ATCC 10/18/2010 Fungi n parasite 20865802 50506 v1 (March PSPE Piromyces sp. Fungi 2011) Batrachochytrium Chytrid BDEN V1 Fungi dendrobatidis JAM81 fungus Spizellomyces punctatus Chytrid SPUN 3/1/2012 Fungi DAOM BR117 fungus Allomyces macrogynus Chytrid V1, March 1, AMAC Fungi ATCC 38327 fungus 2012 Catenaria anguillulae Chytrid CANG V1 Fungi PL171 fungus Mucor circinelloides Zygomycete MCIR V1 Fungi 1006PhL fungus

79 Chapter 2

Phycomyces Zygomycete PBLA blakesleeanus V2 Fungi fungus NRRL1555 Mortierella verticillata Zygomycete MVER 2/17/2011 Fungi NRRL 6337 fungus Zygomycete MELO Mortierella elongata V1 Fungi 2 fungus Arbuscular Rhizophagus irregularis v1 (March PMID: RIRR mycorrhizal Fungi DAOM 197198 2012) 24277808 fungus Conidiobolus coronatus Zygomycete CCOR V1 Fungi NRRL28638 fungus Coemansia reversa Zygomycete CREV V1 Fungi NRRL 1564 fungus ASM32847v1 PMID: UMAY Ustilago maydis Fungus Fungi (March 5, 2007) 17080091 CNEO Cryptococcus neoformans Fungus V4 PMID:15653466 Fungi ASM294v1 Schizosaccharomyces PMID: SPOM Fission yeast (2012-03- Fungi pombe (strain 972) 11859360 PomBase) Neurospora crassa Filamentous PMID: NCRA 3/11/2013 Fungi OR74A fungus 12712197 Kluyveromyces lactis ASM251v1, PMID: KLAC Yeast Fungi NRRL Y-1140 (July 2, 2004) 15229592 Release R64-1- 1 (see Saccharomyces cerevisiae http://www.ye PMID: 8849441 SCER Baker's yeast Fungi S288C astgenome.org/ (?) about/how-to- cite-sgd) Bacteriovorou Thecamonas trahens TTRA s flagellated 10/28/2011 ATCC 50062 protist Single-celled PMID: BHOM Blastocystis hominis protozoan V1 Sar 21439036 parasite Aplanochytrium AKER Marine protist V1 Sar kerguelense PBS07 Aurantiochytrium ALIM limacinum ATCC Marine protist V1 Sar

MYA-1381 Phytophthora infestans Potato late ASM14294v1, PMID: PINF Sar T30-4 blight fungus (June 17, 2009) 19741609 Arabidopsis ENA 1 (2011- PMID: ALAI Albugo laibachii Nc14 Sar pathogen 08-ENA) 21750662 Hyaloperonospora Hyaloperonos PMID: HPAR Unknown Sar parasitica pora 21148394

80 Evolution of TFIID

arabidopsidis Nannochloropsis Stramenopile PMID: NGAD 1.1 Sar gaditana CCMP526 alga 22353717 Aureococcus v.1.0 Harmful PMID: AANO anophagefferens (September 27, Sar bloom alga 21368207 CCMP1984 2007) PMID: 2 ESIL Ectocarpus siliculosus Brown alga Unknown Sar 20520714 Phaeodactylum ASM15095v2 PMID: PTRI tricornutum Diatome (December 12, Sar 18923393 CCAP1055/1 2008) ASM14940v1 Thalassiosira PMID: TPSE Marine diatom (January 16, Sar pseudonana 15459382 2009) No. JCV centre Perkinsus marinus is likely to not PMAR Unknown Sar ATCC 50983 finish this genome PMID: SMIN Symbiodinium minutum v1.0 Sar 23850284 Plasmodium falciparum PMID: PFAL Protozoan 3 Sar 3D7 12368864 Cryptosporidium PMID: CPAI Protozoan 2/23/2007 Sar parvum Iowa II 15044751 Toxoplasma gondii TGON Protozoan 7/23/2012 No Sar ME49 PMID: PTET Paramecium tetraurelia Paramecium v1.78 Sar 17086204 Tetrahymena PMID: TTHE Tetrahymena oct2008 Sar thermophila 16933976 PMID: OTRI Oxytricha trifallax 2/12/2012 Sar 23382650 Bigelowiella natans Chlorarachnio PMID: BNAT V1 Sar CCMP2755 phyte alga 23201678 V1, November PMID: Archaeplas CPAR Cyanophora paradoxa Glaucophyte 2010 22344442 tida ASM34128v1 PMID: Archaeplas GSUL Galdieria sulphuraria Red alga (February 25, 23471408 tida 2013) Cyanidioschyzon PMID: Archaeplas CMER Red alga V1 merolae 10D 17623057 tida Porphyridium ? (updated PMID: Archaeplas PPUR Red alga purpureum 02/2012) 23770768 tida Chlorella variabilis PMID: Archaeplas CVAR Green alga V1 NC64A 20852019 tida Coccomyxa V2.0 (April 13, PMID: Archaeplas CSUB Green alga subellipsoidea C-169 2012) 22630137 tida

81 Chapter 2

Chlamydomonas PMID: Archaeplas CREI Green alga Unknown reinhardtii 17932292 tida PMID: Archaeplas VCAR Volvox carteri Green alga Unknown 20616280 tida Micromonas species Picoplanktoni ASM9098v1 PMID: Archaeplas MSPE RCC299 c green alga (April 10, 2009) 19359590 tida 2 Ostreococcus PMID: Archaeplas OLUC Green alga Unknown lucimarinus CCE9901 17460045 tida Bathycoccus prasinos PMID: Archaeplas BPRA Green alga V1 RCC1005 22925495 tida Klebsormidium Filamentous PMID: Archaeplas KFLA V1.0 flaccidum green alga 24865297 tida Physcomitrella patens PMID: Archaeplas PPAT Moss V1.6 subsp. patens 18079367 tida Selaginella v1.0 (2011-05- PMID: Archaeplas SMOE Spikemoss moellendorffii ENA) 21551031 tida Archaeplas ATRI Amborella trichopoda Unknown tida MSU6 (2009- PMID: Archaeplas OSAT Oryza sativa japonica Rice 01-MSU6) 16100779 (?) tida TAIR10, PMID: Archaeplas ATHA Arabidopsis thaliana Thale cress November 17, 11130711 (?) tida 2010 This is a release of the initial 8X Rocky unmapped Aquilegia coerulea Archaeplas ACOE mountain Aquilegia Goldsmith tida columbine coerulea Goldsmith genome Trichomonas vaginalis PMID: TVAG Protozoan 1/11/2007 Excavata G3 17218520 Giardia intestinalis PMID: GINT Protozoan 2/8/2013 Excavata assemblage A 17901334 Leishmania major PMID: LMAJ Protozoan 10/20/2010 Excavata strain Friedlin 16020728 Trypanosoma brucei PMID: TBRU Protozoan 2013-01-16 Excavata TREU 927 16020726 Naegleria gruberi strain PMID: NGRU Protozoan 1 Excavata NEG-M 20211133 Guith1 (2012- Guillardia theta Cryptomonad PMID: Incerta GTHE 12- CCMP2712 alga 23201678 sedis EnsemblProtist Emiliania huxleyi PMID: EHUX Haptophyte V1 CCMP1516 17931970 Additional species GGAL Gallus gallus red junglefowl GCA_000002 PMID:15592404 Metazoa

82 Evolution of TFIID

315.3 Chinese GCA_000230 PSIN Pelodiscus sinensis PMID:17381049 Metazoa softshell turtle 535.1 African Bush GCA_000001 LAFR Loxodonta africana Metazoa Elephant 905.1 West Indian LCHA Latimeria chalumnae Ocean LoxAfr 3.0 Metazoa 2

coelacanth The olive GCA_000264 PANU Papio Anubus Metazoa baboon 685.1 Northern GCA_000146 HLEU Hylobates leucogenys white-cheeked PMID:20138221 Metazoa 795.3 gibbon Common GCA_000004 CJAC Callithrix jacchus Metazoa marmoset 665.1 Ornithorhynchus GCA_000002 OANA Platypus PMID:18464734 Metazoa anatinus 275.2 GCA_000485 PFOR Poecilia formosa Amazon molly Metazoa 575.1 GCA_000242 LOCU Lepisosteus oculatus Spotted gar Metazoa 695.1

83 Chapter 2

2

Supplemental Figure S8. Inferred evolutionary history of TAF10 and TAF7. A) TAF10 remains invariable across the eukaryotic tree. B) TAF7 duplicated in Mammalia. Duplications are represented as red arrows.

84 Evolution of TFIID

2

Supplemental Figure S15. Schematic representation of the manual and computational steps taken during the phylogenetic analysis. Importantly, species are added and domains or homology alignment refined based on the outcome. E.g. to improve timing of duplication of TAF4 homologs in multiple Tetrapoda such as the living fossil Coelacanth were added from public genome repositories. Different steps in the workflow are coloured as denoted in the legend. Public resources are evolutionary bioinformatics databases or tools available through the world wide web. Genome repositories are further detailed in Table S1. Existing tools are command line tools for evolutionary sequence analysis

85 Chapter 2

.

2

86 CCT as checkpoint for TFIID assembly

3

Chapter 3

Chaperonin CCT checkpoint function in basal transcription factor TFIID assembly

Simona V. Antonova, Matthias Haffke, Eleonora Corradini, Mykolas Mikuciunas, Teck Y. Low, Luca Signor, Robert M. van Es, Kapil Gupta, Elisabeth Scheer, Harmjan R. Vos, László Tora, Albert J. R. Heck, Imre Berger and H. T. Marc Timmers

NATURE STRUCTURAL & MOLECULAR BIOLOGY | VOL 25 | DECEMBER 2018 | 1119–1127

87 Chapter 3

Chaperonin CCT checkpoint function in basal transcription factor TFIID assembly

Simona V. Antonova1,15, Matthias Haffke2,13,15, Eleonora Corradini3, Mykolas Mikuciunas1, Teck Y. Low3,14, Luca Signor4, Robert M. van Es5, Kapil Gupta6, Elisabeth Scheer7,8,9,10, Harmjan R. Vos5, László Tora7,8,9,10, Albert J. R. Heck3, Imre Berger6,16 and H. T. Marc Timmers1,11,12,16 1Molecular Cancer Research and Regenerative Medicine, University Medical Centre Utrecht, Utrecht, The Netherlands. 2European Molecular Biology Laboratory (EMBL), Grenoble, France. 3Biomolecular Mass Spectrometry and Proteomics, Bijvoet Center for Biomolecular Research and Utrecht Institute for Pharmaceutical Sciences and Netherlands Proteomics 3 Centre, Utrecht University, Utrecht, The Netherlands. 4Université Grenoble Alpes, CEA, CNRS, IBS (Institut de Biologie Structurale), Grenoble, France. 5Molecular Cancer Research, Center for Molecular Medicine, University Medical Centre Utrecht, Utrecht, The Netherlands. 6Bristol Synthetic Biology Centre BrisSynBio, Biomedical Sciences, School of Biochemistry, University of Bristol, Bristol, UK. 7Institut de Génétique et de Biologie Moléculaire et Cellulaire, Illkirch, France. 8Centre National de la Recherche Scientifique, UMR 7104, Illkirch, France. 9Institut National de la Santé et de la Recherche Médicale, U964, Illkirch, France. 10Université de Strasbourg, Illkirch, France. 11Department of Urology, Medical Center–University of Freiburg, Freiburg, Germany. 12Deutsches Konsortium für Translationale Krebsforschung (DKTK), Standort Freiburg, Deutsches Krebsforschungszentrum (DKFZ), Heidelberg, Germany. 13Present address: Chemical Biology and Therapeutics, Novartis Institutes for BioMedical Research, Basel, Switzerland. 14Present address: UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, Kuala Lumpur, Malaysia. 15These authors contributed equally: Simona V. Antonova, Matthias Haffke. 16 These authors share last authorship. *e-mail: [email protected]; [email protected]

TFIID is a cornerstone of eukaryotic gene regulation. Distinct TFIID complexes with unique subunit compositions exist and several TFIID subunits are shared with other complexes, thereby conveying precise cellular control of subunit allocation and functional assembly of this essential transcription factor. However, the molecular mechanisms that underlie the regulation of TFIID remain poorly understood. Here we use quantitative proteomics to examine TFIID submodules and assembly mechanisms in human cells. Structural and mutational analysis of the cytoplasmic TAF5–TAF6–TAF9 submodule identified novel interactions that are crucial for TFIID integrity and for allocation of TAF9 to TFIID or the Spt-Ada-Gcn5 acetyltransferase (SAGA) co-activator complex. We discover a key checkpoint function for the chaperonin CCT, which specifically associates with nascent TAF5 for subsequent handover to TAF6–TAF9 and ultimate holo-TFIID formation. Our findings illustrate at the molecular level how multisubunit complexes are generated within the cell via mechanisms that involve checkpoint decisions facilitated by a chaperone.

Eukaryotic gene transcription by RNA polymerase II (pol II) is controlled by a plethora of proteins that are preassembled into complexes, which include basal transcription factors1,2. TFIID is the first transcription factor to bind the core promoter, nucleating the pre-initiation complex. In humans, TFIID comprises the TATA- binding protein (TBP) and 13 TBP-associated factors (TAFs). Cryoelectron microscopy (cryo-EM) analyses have provided important insights into TFIID architecture and promoter binding3,4. Distinct TFIID complexes have been identified, which contain paralogues of TAFs and TBP, that have key roles in cellular differentiation1,2,5. The human TFIID subunits TAF9, TAF10 and TAF12 are also present in SAGA6, a co-activator complex that, similar to TFIID, is globally required for efficient pol II transcription7,8. In yeast,

88 CCT as checkpoint for TFIID assembly

Taf5 and Taf6 are also present in the SAGA complex, whereas the human SAGA complex contains the paralogues TAF5L and TAF6L9. Sharing of subunits is not unique to TFIID and SAGA complexes but occurs frequently in complexes that control transcription and chromatin conformation1,2,6–11. Faithful subunit allocation and regulated complex assembly therefore must be essential for proper cell development and function; however, the underlying molecular mechanisms and factors that are involved remain unknown. Chaperones are known to facilitate the folding of key proteins in essential processes, such as kinase-mediated cellular signalling, proteasome assembly or nucleosome turnover11–15. The CCT complex has been shown to mediate ATP-dependent folding of many substrates, including actin, tubulin and the recently described CRLCSA DNA- repair factor16–18. CCT adopts a barrel-like shape, with two hetero-octameric rings stacked on top 3 of each other creating two folding chambers18. Here we use targeted quantitative mass spectrometry (qMS)19,20 to systematically analyze TAF interactions in the nucleus and cytoplasm of human cells in order to investigate the molecular mechanisms underlying TFIID formation. We identified a cytoplasmic TFIID submodule that comprises TAF5, TAF6 and TAF9. The crystal structure of this complex revealed complex TAF– TAF interactions, providing a basis for mutational studies. Notably, our analyses reveal a key checkpoint function for CCT, providing unprecedented insights into the early steps of holo-TFIID assembly, which is regulated by this chaperonin.

Results TFIID and SAGA submodules in the cytoplasm The presence of stable, partial TFIID assemblies in cells suggests that the holo-complex occurs is formed by discrete, functional submodules 3,21–23. TAF5 is the presumed central scaffold within TFIID and has been shown to interact with TAF6 and TAF93,24. TAF6 and TAF9 form a heterodimer that is stabilized by the pairing of their histone-fold domains (HFDs)25. We created stable doxycycline (Dox)-inducible HeLa cell lines26 that express TAF5, TAF6 or TAF9 proteins that contained an N-terminal green-fluorescent protein (GFP) tag. As expected, all GFP-TAF proteins predominantly localized to the nucleus (Fig. 1a). GFP-TAF5 expression levels were close to endogenous levels, whereas GFP-TAF6 expression was increased. Anti-GFP immunoblotting and quantitative real time PCR (qPCR) showed that GFP-TAF9 levels were between GFP-TAF5 and GFP-TAF6 expression levels (Supplementary Fig. 1). We examined integration of GFP–TAF proteins into TFIID by GFP co-immunoprecipitation (co-IP) from nuclear extracts of Dox-treated cells followed by intensity-based absolute quantification (iBAQ) qMS 27. All GFP–TAF proteins were enriched in the complete holo-TFIID complex (Fig. 1b, c). TAF9 is shared between TFIID and SAGA6, and all SAGA subunits were similarly enriched in nuclear GFP–TAF9 co- IPs. We determined the relative abundance of TFIID subunits and found similar stoichiometries in all experiments. Moreover, comparable levels of SAGA and TFIID were present in the GFP–TAF9 co- IPs (Fig. 1c). Our results demonstrate that GFP-tagged TAF5 and TAF6 are efficiently integrated into TFIID, whereas GFP–TAF9 is found in both TFIID and SAGA complexes. Next, we analyzed GFP–TAF proteins that were purified from the corresponding cytoplasmic extracts (Fig. 1d, e). We found that TAF6 and TAF9 co-purified efficiently with GFP–TAF5, revealing the presence of a stable heterotrimeric complex in the cytoplasm. We also found significant levels of TAF4, TAF8, TAF10 and TAF12, similar to stable TFIID subassemblies in recombinant reconstitution experiments3.

89 Chapter 3

3

Figure 1. TFIID and SAGA submodules in the cytoplasm. a, Cellular localization of GFP-TAF5, GFP-TAF9 and GFP-TAF6 by confocal microscopy; scale bar = 10 μm. b, GFP-TAF5 and GFP-TAF6

90 CCT as checkpoint for TFIID assembly enrich all TFIID subunits in co-IPs from nuclear extracts; GFP-TAF9 enriches TFIID and SAGA subunits. Subunits unique to TFIID are colored in green, subunits unique to SAGA in purple, shared subunits in blue. Dashed red lines denote threshold between background and significant enrichment (two-tailed t-test; FDR = 1%; S0 =1). Each data point is plotted as average of technical triplicates. c, Relative abundance of TFIID and SAGA subunits from nuclear extracts are shown, normalized to TAF4/4B. Each bar represents an average of technical triplicates. Error bars = s.d. of mean. d, GFP- TAF5 and GFP-TAF6 enrich a subset of TFIID subunits in co-IPs from the cytoplasm; GFP-TAF9 enriches both TFIID and SAGA subunits. Each data point is plotted as average of technical triplicates. e, Relative abundance of TFIID and SAGA subunits in the cytoplasmic co-IPs. Each bar represents an average of technical triplicates. Error bars = s.d. of mean. Source data for c and e are available online. 3

Cytoplasmic GFP–TAF6 predominantly co-purified TAF9. Cytoplasmic GFP–TAF9 co- IPs showed enrichment of TAF6 as well as the SAGA subunits TAF5L, TAF6L, TADA1, SUPT3H and SUPT7L. TAF10 and TAF12, like TAF9, are shared between TFIID and SAGA, and were also present. In SAGA, TAF10 and TAF12 are known to form specific heterodimers with SUPT7L and TADA1, respectively. Therefore, the SAGA module that we observed may represent a putative core-SAGA complex that is similar to the core-TFIID complex that was originally identified in Drosophila melanogaster 3,21. Taken together, our data provide evidence for the presence of discrete TFIID and SAGA submodules in the cytoplasm, which likely represent building blocks of the respective holo-complexes.

Crystal structure of TAF5–TAF6–TAF9 Our qMS experiments revealed TAF5–TAF6–TAF9 as a prevalent TFIID submodule in the cytoplasm. Limited proteolysis of full-length TAF5–TAF6– TAF9 identified a sample that was suitable for crystallization (Supplementary Figs. 2, 3 and Supplementary Table 1). Crystals comprising a TAF5 construct, which spanned the N-terminal domain (NTD) and WD40-repeat domain (TAF5(194-800)), bound to an extended TAF6-TAF9 heterodimer (TAF6(1-92); TAF9(1- 120)) diffracted to high resolution (Supplementary Fig. 4). Phases were obtained with tantalum bromide clusters and the structure was refined to 2.5 Å (Table 1), resulting in excellent electron densities (Supplementary Fig. 4e-h). The final model includes residues 207-800 of TAF5 (residues 194-206, 381-416 and 748–753 were not modelled owing to poor definition), residues 6- 92 of TAF6 and residues 5-120 of TAF9. Our human TAF9 construct comprises an extended C-terminal region, which is completely resolved in the electron density (Fig. 2c). This TAF9 C-terminal domain adopts a distinct conformation that is crucial for TAF5–TAF6–TAF9 complex integrity, wrapping around the TAF5 WD40-repeat domain like a clamp with three major anchor points (Fig. 2d). The TAF9 C-terminal loop region attaches tightly to the surface of the TAF5 WD40 domain, involving residues L104, I107 and L115 (Fig. 2e). The second anchor is formed by the TAF9 αC helix, which packs laterally against the TAF5 WD40-repeat domain, engaging in multiple hydrogen bonds that are mediated by R99, N100 and T102 (Fig. 2f).

91 Chapter 3

Moreover, a triple Table 1 | Data collection, phasing and refinement statistics a a Ta6Br proline turn formed by TAF9 Native 12 residues P86–P88 interacts with Data collection both the TAF5 NTD and Space group I23 I23 WD40-repeat domain (Fig. 2g). Cell dimensions This remark- able multitude of a, b, c (Å) 339.05, 339.05, 338.16, 338.16, 338.16 protein–protein interactions 339.05 α, β, γ (°) 90, 90, 90 90, 90, 90 combine to a total buried surface

3 area of 2,713 Å2 in the TAF5– Peak Wavelength 1.0044 1.25439 TAF6–TAF9 complex. b Resolution (Å) 79.92–2.50 48.81–3.80 The crystal structure b b (2.60–2.50) (3.90–3.80) displays a compact, triangular Rmerge 0.1358 (3.048) 0.173 (1.17) shape (Fig. 2a), with the TAF6– I/σ(I) 8.67 (0.64) 9.48 (2.08) TAF9 HFD heterodimer CC1/2 0.995 (0.188) 0.996 (0.676) sandwiched between the NTD Completeness (%) 99.03 (91.70) 100.00 (100.00) and WD40-repeat domain of Redundancy 4.69 (4.15) 5.78 (5.70) TAF5, giving rise to an Refinement intertwined architecture that has Resolution (Å) 79.92–2.50 TAF5 as the central scaffold. Number of reflections 219,090 The TAF5 NTD was crystallized Rwork/Rfree 19.73/22.00 28 previously in isolation . Number of atoms Substantial differences are Protein 24,768 apparent in the TAF5–TAF6– Ion (Cl−) 7 TAF9 complex. Notably, the N- Water 391 terminal α-helix 7 of the NTD is B factors rearranged owing to intimate Protein 88.6 interactions with the TAF5 Ligand/ion 79.6 WD40-repeat domain Water 68.8 (Supplementary Fig. 5). The Root-mean-square deviations WD40 repeat adopts a seven- Bond lengths (Å) 0.004 bladed β-propeller with Bond angles (°) 1.06

pronounced α-helical insertions aA single crystal was used to collect the native (PDB 6F3T) and the Ta6Br12 dataset. in blades 1 and 7. TAF5 engages bValues in parentheses are for the highest-resolution shell. extensively with the central α- helix of the TAF6 HFD through its NTD and the bottom cavity of the WD40 domain (Fig. 2b). The HFDs of TAF6 and TAF9 adopt a structure similar to the D. melanogaster TAF6–TAF9 HFD pair29.

92 CCT as checkpoint for TFIID assembly

3

Figure 2. Crystal structure of TAF5-TAF6-TAF9 complex. a, The TAF5-TAF6-TAF9 complex is shown in a cartoon representation in two views. The TAF6-TAF9 HFD heterodimer is intimately wedged in between the TAF5 NTD and WD40 repeat domain. TAF5 NTD and WD40 repeat domain are colored in cyan and blue, respectively. TAF6 is colored in red, TAF9 is colored in orange. Disordered loops are drawn as grey dashed lines. b, The TAF5 WD40 repeat domain is viewed from its bottom cavity, adopting a 7-bladed β-propeller. c, Human TAF6-TAF9 superimposed on the TAF6-

93 Chapter 3

TAF9 HFD dimer from D. melanogaster (PDBID 1TAF 29). D. melanogaster proteins are colored in grey, human TAF in red and human TAF9 in orange. Note the extended C-terminal domain in human TAF9. d, Overview of the structure with the three major interactions anchoring TAF9 to TAF5.e-g, Zoom-in showing TAF9 C-terminal loop (e), hydrogen-bond network at C-terminus of helix αC (f), and triple proline-turn (g). h, Mutations to probe TAF5-TAF9 interfaces. Colors as in (d); green indicates mutated region in TAF5 NTD targeting TAF5-TAF6-TAF9 interface.

TAF5–TAF9 interactions dictate TFIID assembly Our crystal structure revealed unexpected, complex interactions with TAF9, which we 3 next analyzed in cells by qMS, using a series of mutants (Fig. 2h). Four TAF9 mutants (TAF9m1– m4) were designed with the aim to disrupt the TAF5–TAF9 interface, while maintaining correct folding of the proteins. TAF9m1 targeted the hydrophobic side chains of the α-helix that are inserted into the hydrophobic pockets on the TAF5 WD40 surface as well as prominent salt bridges in between the TAFs. In TAF9m2, we abolished the network of hydrogen bonds between TAF5 and TAF9. TAF9m3 was designed to destabilize the triple proline turn in the C-terminal extension and abolish the TAF5S303–TAF9R89 interaction. Finally, TAF9m4 combines all of these mutations. The GFP–TAF9 mutants were expressed in HeLa cells and co-IPs of nuclear extracts were subjected to qMS (Fig. 3a and Supplementary Fig. 6a–d). TAF9m1 did not incorporate into TFIID, although it retained a low, albeit significant, interaction with TAF6. Whereas the incorporation was strongly reduced compared to wild- type TAF9, TAF9m1 incorporation into a (partial) SAGA complex was indicated by the significantly enriched SAGA subunits TADA3, TADA2B, SUPT7L, SUPT20H, KAT2A and TRRAP (Supplementary Fig. 6a). TAF9m2 disrupted TFIID formation to a similar extent as TAF9m1, while maintaining pairing with TAF6. This TAF9 mutant had no noticeable effect on SAGA. Similarly, TAF9m3 compromised TFIID assembly. Notably, TAF9m3 efficiently enriched most subunits of SAGA, with the exception of the subunits forming the deubiquitination module9, suggesting a role for the region that is mutated in TAF9m3 in mediating deubiquitination integration. Finally, TAF9m4 targeting all TAF9–TAF5 interfaces abolished both TFIID and SAGA formation completely and this mutant associated only with TAF6. In human SAGA, TAF9 pairs with the TAF6 paralogue TAF6L. Of note, TAF6L was not observed in TAF9m1 (and TAF9m4) co-IPs. Our results demonstrate that mutation of TAF9 residues critical for the TAF5–TAF9 interaction completely disrupt TFIID formation. By contrast, the TAF9 mutations affect SAGA formation to a much lower extent, indicating that the conformation adopted by TAF9 and its interactions may be substantially different within TFIID compared to the SAGA complex. We introduced mutations in TAF5 reciprocal to those in the TAF9 mutants for validation (Figs. 2h, 3b–e and Supplementary Figs. 6e–g, 7). First, we included mutations in TAF5, targeting each of the three distinct interface regions individually. Notably, co-IPs and qMS of GFP-tagged versions showed that the three individual TAF5 mutants did not completely disrupt TFIID but that the relative abundance of TFIID subunits was reduced compared to wild-type TFIID, indicating that the formation of TFIID was less efficient (Supplementary Fig. 6f). This is not because of reduced protein levels, because all TAF5 mutants were expressed at levels higher than those of wild-type TAF5 (Supplementary Fig. 6g). We next prepared TAF5 mutants, each combining two sets of mutations. The combinations of TAF5m1 and TAFm2 (P428G, P430G, Q431A, Y557A, Y598A, Y617A and R621V) and TAF5m2 and

94 CCT as checkpoint for TFIID assembly

TAF5m3 (S303A, Y586A, H605A, R607A, andR621V) completely prevented TFIID formation (Fig. 3c–e).

3

Figure 3. TAF5-TAF9 interfaces probed by mutagenesis and quantitative proteomics. a, Stoichiometry plots of TFIID and SAGA subunits from GFP-TAF9 co-IPs are shown, normalized to bait protein. Each bar represents an average of technical triplicates. Error bars= s.d. of mean. b, Confocal fluorescence microscopy reveals nuclear localization of all GFP- TAF5 proteins studies except for GFP-TAF5m1+2 which is also present in the cytoplasm; scale bar = 10 μm. c-e, Nuclear Co-IPs of GFP-TAF5, GFP-TAF5m1+2 and GFP- TAF5m2+3 evidence compromised TFIID assembly by the mutants. Each data point in volcano plots is plotted as average of technical triplicates. Dashed red lines denote threshold between background and significant enrichment (two-tailed t-test; FDR = 1%; S0 =1). Each bar in stoichiometry plot represents an average of technical triplicates. Error bars = s.d. of mean. n = 2 independent experimental replicates for the representative GFP-TAF5m1+2 sample. Source data for a and e are available online.

95 Chapter 3

Our mutational analyses validated the atomic interactions that were observed in our crystal structure, confirming their central importance for TFIID assembly and integrity in vivo. Moreover, our results with mutants of TAF9 suggest that there are distinct interactions of this shared subunit with TFIID and SAGA, implicating TAF9 structural dynamics in the formation pathways of the holo- complexes.

Chaperonin CCT interacts with the WD40 domain of TAF5 Cells that expressed GFP–TAF5m1 and GFP–TAF5m2 exhibited a marked GFP signal in the cytoplasm, in contrast to the predominantly nuclear localization of fluorescence of the other 3 mutant and wild- type GFP–TAF5 cell lines (Fig. 3b). Notably, TAF5m1 and TAF5m2 co-IPs revealed eight highly enriched proteins in close to equimolar amounts, which we identified as the subunits of the CCT complex (Fig. 4a,b). CCT assists in the folding of about 10% of the proteome, including many WD40-repeat proteins16,17,30. In addition, strong enrichment of the HSPA8 protein was observed (Fig. 4a,b), suggesting that HSPA8 is involved in TAF5 delivery to the CCT chaperonin31. We reasoned that CCT may facilitate folding of the TAF5 C-terminal WD40-repeat domain to allow handover to the TAF6– TAF9 heterodimer, as an essential step in holo-TFIID formation. Our qMS experiments show that the TAF5m1 and TAF5m2 double mutant is deficient in accreting TFIID subunits including TAF6– TAF9 (Fig. 4a–c). We hypothesized that the TAF5m1 and TAF5m2 protein may be retained in the CCT folding chamber, possibly owing to impaired interactions with TAF6– TAF9. We investigated whether our observation only related to this particular mutant. Careful inspection confirmed the presence of CCT, albeit in lower amounts, in all GFP–TAF5 mutant co- IPs (Supplementary Fig. 8a,b). Notably, CCT was similarly detected in co-IPs of wild-type GFP– TAF5 from the cytoplasm, but it was absent in the nuclear extract, suggesting that CCT interactions occurred transiently with newly synthesized TAF5 (Supplementary Fig. 8b). Similar co-IPs using GFP-tagged TBP, TAF1, TAF6, TAF7 and TAF9 proteins did not contain any of the CCT subunits (data not shown). We conclude that the CCT interaction is highly specific for the TAF5 subunit of TFIID. We used transiently transfected HEK293T cells to substantiate our findings. Overexpression of GFP–TAF5, the GFP–TAF5m1 and GFP–TAF5m2 double mutant or SAGA- specific GFP–TAF5L strongly enriched CCT in co-IPs, unlike GFP-tagged TAF6 and TAF9 or WDR5, a transcriptional regulator with a comparable WD40-repeat domain (Fig. 4d), confirming that TAF5 and its close paralogue TAF5L are bona fide CCT substrates. Note that WDR5 can be produced in Escherichia coli, indicating that WDR5 folding does not depend on a eukaryotic chaperone32. Transient overexpression of TAF5 fragments confirmed that the WD40 domain is mediating the CCT interaction and that the NTD did not interact with CCT subunits (Fig. 4e). Recombinant reconstitution also confirmed that the TAF5–TAF6–TAF9 complex is dependent on the TAF5 WD40 domain (Fig. 4f).

96 CCT as checkpoint for TFIID assembly

3

Figure 4. Chaperonin CCT engages TAF5. a-c GFP-TAF5m1+2 enriches all subunits of CCT in virtually stoichiometric ratios. Chaperonin subunits are colored in magenta, TFIID subunits in green, HSPA8 in purple. Relative stoichiometry is normalized to bait. Each data point in volcano plots is plotted as average of technical triplicates. Dashed red lines denote threshold between background and significant enrichment (two-tailed t-test; FDR = 1%; S0 =1). Each bar in stoichiometry plot represents an average of technical triplicates. Error bars = s.d. of mean. n = 2 independent experimental replicates for the representative GFP-TAF5m1+2 nuclear sample. d, GFP-TAF interactions with CCT in whole cell extract from transient transfection experiments by immunoblot. Input represents 1% of the protein sample used in the co-IPs. Hallmark CCT subunits TCP1 and CCT2 were probed. Technical replicates n = 3. e, Transient transfection experiments with whole cell extract GFP-fusions of WT or m1+2 (m) TAF52-456 (ΔWD40), TAF5345-800 (ΔNTD) and TAF5456-800 (WD40) probed by co-IPs and immunoblot. Technical replicates n = 2. f, Talon pull-down of co-expressed TAF6, TAF9 and truncated TAF5 analyzed by SDS-PAGE (left) and immunoblot (right). WCE stands for whole cell extract, SUP for cleared lysate, FT for flowthrough, E for eluted fraction His for oligohistidine tag. Source data for c are available online.

CCT hands over nascent TAF5 to TAF6–TAF9 for holo-TFIID assembly Formation of the CCT–TAF5 complex would presumably occur following translation in the cytoplasm. We tested this in pulse–chase experiments in our Dox-inducible cell lines (Fig. 5a–c and Supplementary Fig. 9a,b). Before adding Dox, cycloheximide (CHX) was applied to block translation, resulting in mRNA accumulation. In the chase phase, CHX blockade and Dox treatment were removed and nascent GFP–TAF5 was sampled over time, revealing progressive CCT binding to newly synthesized GFP–TAF5 in co-IP experiments (Fig. 5a). Experiments using the GFP–TAF5m1 and GFP–TAF5m2 double mutant yielded comparable results; GFP– TAF7 was included as a negative control (Supplementary Fig. 9b).

97 Chapter 3

3

Figure 5. CCT is required for holo-TFIID assembly. a, CCT interaction with newly-synthesized GFP-TAF5 was probed by pulse-chase experiments. Input represents 5% of the protein sample used in co-IPs. b, c, Pulse-chase experiment sampled at different time-points. CCT binding to TAF5 peaks at 75 min after inducing GFP-TAF5 expression. Relative abundance was normalized to bait. Baseline was taken at t=0. CCT subunits are colored in magenta and TFIID subunits in green. Each data point in volcano plots is plotted as average of technical triplicates. Dashed red lines denote threshold between background and significant enrichment (two-tailed t-test; FDR = 1%; S0=1). Each bar in stoichiometry plot represents an average of technical triplicates. Error bars = s.d. of mean. d, Transient co-transfection experiments with GFP-TAF5, GFP-TAF6 and GFP-TAF9 probed by co-IPs and immunoblot. TAF6- TAF9 is required for releasing TAF5 from CCT. Technical replicates n = 2. e, f, qMS analysis of GFP- TAF5m4 co-IPs from cytoplasmic fraction. Stoichiometry plot is normalized to bait. Each data point in volcano plots is plotted as average of technical triplicates. Each bar in stoichiometry plot represents an average of technical triplicates. Error bars = s.d. of mean. Source data for b and f are available online.

98 CCT as checkpoint for TFIID assembly

In reverse pulse–chase experiments, prolonged Dox induction enabled GFP–TAF5 protein accumulation, followed by CHX addition to inhibit new protein synthesis. We observed time-dependent disassociation of GFP–TAF5 from CCT. By contrast, TAF5m1and TAFm2 remained bound to the chaperonin. Mass spectrometry analysis of GFP–TAF5-induced cells showed that newly synthesized TAF5 enriched all CCT subunits (Fig. 5b), with CCT binding peaking at 75 min before declining, corroborating the transient nature of the interaction. Notably, co-IPs at later time points progressively enriched TAFs and TBP, whereas CCT disappeared into back- ground, consistent with TAF5 release from CCT and incorporation into the holo-TFIID complex (Fig. 5c). Our model indicates that the handover of TAF5 from CCT to TAF6–TAF9 would be 3 controlled by the levels of the TAF6–TAF9 heterodimer and be dependent on the initial encounter between TAF6–TAF9 and the NTD of TAF5. Indeed, transient co-expression experiments of GFP-tagged TAF5, TAF6 and TAF9 in various combinations underscored that TAF6–TAF9 is required for TAF5 release from CCT (Fig. 5d). We designed a TAF5 mutant that had mutations in the NTD (TAF5m4) that were predicted to disrupt the interaction with TAF6–TAF9 (Fig. 2h and Supplementary Fig. 7). Subsequent co-IP of this mutant followed by qMS showed that TAF5m4 did not form TFIID submodules and that it was retained in the CCT complex (Fig. 5e,f). These data indicate that the CCT complex acts as an essential checkpoint in TFIID assembly. Next, we investigated the effect of short interfering RNA (siRNA)-mediated knockdown of TCP1, which affects CCT function16. TCP1 knockdown in GFP–TAF5-induced cells markedly reduced co-purification of TAF6 as well as TBP, indicating that the interaction with CCT is a prerequisite to prime TAF5 for TAF6–TAF9 binding and incorporation into the TFIID complex (Fig. 6a). Consistent with this, TCP1 knockdown affected cellular localization and a considerable proportion of GFP–TAF5 was retained in the cytoplasm (Fig. 6b).

Discussion Taken together, we provide unique insights at the molecular level into early events in the regulated assembly of the basal transcription factor TFIID. Our results provide compelling evidence that the chaperonin CCT associates with newly synthesized TAF5 in the cytoplasm of cells, acting as an essential checkpoint for the formation of functional TFIID holo-complexes (Fig. 6c and Supplementary Video 1, online version). Our data demonstrate that the interaction with CCT is a prerequisite for the formation of a TFIID submodule that consists of TAF5, TAF6 and TAF9. Presumably, CCT facilitates the folding of the WD40-repeat domain and stabilizes TAF5 in a conformation that is compatible with the formation of the multitude of interactions between the TAF6–TAF9 and the TAF5 NTD that were observed in the crystal structure. Our results show that TAF6–TAF9 binding is required to release TAF5 from CCT and indicate that CCT rebinding may be prevented by docking of α-helix 4 of TAF9 on the lateral surface of the TAF5 WD40 domain (Fig. 2a and Supplementary Video 1). This handover is critical for the formation of the holo-TFIID complex that contains a full complement of TAFs and TBP. Additionally, we identify TAF5–TAF6–TAF9 as a discrete cytoplasmic TFIID subcomplex that is poised to engage other TAFs including the TAF4–TAF12 heterodimer, thus substantiating the concept of holo-TFIID assembly from preformed submodules and pinpointing a chaperonin as a key factor in this process.

99 Chapter 3

3

Figure 6. CCT chaperonin-assisted early events in TFIID assembly. a, siRNA-mediated knockdown of CCT subunit TCP1 impedes GFP-TAF5 integration into TFIID. Control cells were treated with a non-targeting siRNA (siNT). Input represent 2% of the protein sample used in each co- IP. Technical replicates n = 2. b, TCP1 knockdown affects nuclear localization of GFP-TAF5; scale bar = 10 μm. Technical replicates n = 2. c, Newly- synthesized TAF5 is captured by the chaperonin (left). The TAF5 N-terminal domain folds autonomously, Folding of the WD40 repeat domain depends on CCT. Release of readily folded TAF5 from the chaperonin requires binding of TAF6-TAF9, resulting in a discrete TAF5-TAF6-TAF9 complex in the cytoplasm. Binding of further TAFs and TBP completes functional holo- TFIID assembly.

Our finding that not only TAF5, but also its paralogue TAF5L interact with CCT, suggests that the handover mechanism described here is not confined to TFIID and similar mechanisms may also govern early stages of SAGA assembly. We anticipate that our observations will have general implications for the molecular mechanisms that regulate the assembly of the many multiprotein machines that are at work in the cells.

100 CCT as checkpoint for TFIID assembly

Methods TAF5–TAF6–TAF9 complex production. The TAF5–TAF6–TAF9 full-length complex was cloned, expressed and purified using a polyprotein strategy as described previously3. For all other TAF5–TAF6–TAF9 complexes, coding sequences of individual subunits and truncation mutants were cloned into acceptor (pFL) and donor (pIDC and pIDK) vectors of the MultiBac system by sequence- and ligation-independent cloning and fused by in vitro Cre–loxP recombination to yield a single plasmid with multiple expression cassettes33. Constructs used are listed in Supplementary Table 1. A decahistidine tag followed by a tobacco etch virus (TEV) NIa protease cleavage site was placed at the N terminus of the TAF5 subunit. Recombinant baculovirus was produced as previously described34 and used to infect SF21 insect cells (Invitrogen) at a cell density of 1.0 × 106 3 per ml in SF4 medium. Cells were collected 72–96 h after proliferation arrest by centrifugation at 4,000g for 15 min. Cell pellets were resuspended in lysis buffer (25 mM Tris-HCl pH 7.5, 300 mM NaCl, 5 mM imidazole, 1 mM β-mercaptoethanol and 0.1 % NP-40), incubated on ice for 30 min and the lysate was cleared by centrifugation at 25,000 r.p.m. in a JA-25.50 rotor (Beckman) for 90 min. The supernatant was mixed with Talon resin (Clontech), equilibrated in Talon A buffer (25 mM Tris-HCl pH 7.5, 300 mM NaCl and 5 mM imidazole) and incubated for 1 h at 4 °C. The resin was washed with 15 column volumes of Talon A buffer, 10 column volumes of Talon HS buffer (25 mM Tris-HCl pH 7.5, 1 M NaCl and 7.5 mM imidazole), and 15 column volumes of Talon A buffer before eluting the bound protein complex with 5 column volumes of Talon B buffer (25 mM Tris- HCl pH 7.5, 200 mM NaCl and 300 mM imidazole). Then, 6×His-TEV protease (prepared in-house) was added at a 1:100 w/w ratio and the protein was dialyzed against a 100-fold excess of dialysis buffer (25 mM Tris-HCl pH 7.5, 100 mM NaCl and 5 mM β-mercaptoethanol). The dialyzed protein was passed over a Ni-NTA resin (QIAGEN) equilibrated in dialysis buffer and the flow-through was collected. The flow-through was then passed through a MonoQ 5/50GL column (GE Healthcare). The protein complex was found in the flow-through and separated on a Superdex S200 16/60 column (GE Healthcare) equilibrated in SEC buffer (10 mM HEPES-NaOH pH 7.5, 150 mM NaCl and 1 mM dithiothreitol (DTT)). Peak fractions were pooled and concentrated to 10 mg ml−1. Small aliquots were frozen in liquid nitrogen and stored at −80 °C until further use. In TAF5ΔWD40– TAF6–TAF9 pull-down experiments with full-length TAF6 and TAF9 co-expressed with a TAF5 truncation (TAF5(1–343)) mutant that lacked the WD40 domain (Fig. 4e), samples were separated by SDS–PAGE, blotted and probed using specific antibodies (L.T. laboratory), according to standard immunoblot procedures.

Crystallization, data collection and structure determination. The TAF5(194–800)–TAF6(1– 92RLRRRAH)–TAF9(1–120) complex was crystallized in 0.1 M Tris-HCl pH 6.8, 0.8 M sodium citrate and 0–0.15 M NaCl using the hanging drop vapor diffusion technique by mixing 2 µl of protein solution at 5.4–7.5 mg ml−1 with 1 µl of reservoir solution and equilibrated against 500 µl of reservoir solution at 4 °C. Optionally, crystallization drops were seeded with 0.5 µl of a 1:10,000 to 1:1,000,000-diluted micro-seed stock solution, which was prepared by crushing a single crystal in 10 µl reservoir solution. Crystals of pyramidal shape first appeared after four days and reached their maximal size of 300 × 300 × 300 µm after ten days. For data collection at cryogenic temperatures, crystals were transferred sequentially into 2-µl drops containing 0.1 M Tris-HCl pH 6.8, 0.15 M NaCl and increasing concentrations of sodium citrate from 1.0 M to 1.6 M before flash- freezing in liquid nitrogen. For experimental phasing, crystals of

101 Chapter 3

the complex were transferred to a 2-µl drop containing a stabilizing solution of 0.1 M Tris-HCl pH

6.8, 1.0 M sodium citrate and 0.15 M NaCl. Solid Ta6Br12 clusters were directly added to the crystallization drop containing the pre-equilibrated crystals. After 16 h, the Ta6Br12-soaked crystals were sequentially transferred into 2-µl drops containing 0.1 M Tris-HCl pH 6.8 with 0.15 M NaCl and increasing concentrations of sodium citrate from 1.2 M to 1.6 M before flash-freezing in liquid nitrogen. Native datasets were collected at beamline ID14-4 (ESRF) at 100 K (1.0044 Å

wavelength). Datasets of Ta6Br12-soaked TAF5–TAF6–TAF9-complex crystals were collected at beamline ProximaI (SOLEIL) following an optimized inverse- beam single-wavelength anomalous

diffraction (SAD) data-collection strategy at the peak wavelength for Ta (1.25439 Å). The Ta6Br12- 3 soaked crystals were non- isomorphous to native TAF5–TAF6–TAF9-complex crystals. All diffraction data were processed with the XDS software package35,36. The structure of the

TAF5–TAF6–TAF9 complex was solved using the Ta6Br12 SAD dataset. Three Ta6Br12 cluster sites were identified by HySS as implemented in PHENIX software suite37. The heavy-atom substructure was further refined and phases were calculated with PHASER38. The initial electron density map improved substantially by density modification in PARROT39, allowing placement of four truncated molecules of the human TAF5-NTD domain (Protein Data Bank (PDB) ID 2NXP) and four molecules of the D. melanogaster TAF6–TAF9 HFD pair (PDB ID 1TAF) manually into the electron density map. Subsequently, the experimental electron density map was further improved by non-crystallographic symmetry averaging in RESOLVE40,41. The improved map allowed placement of four molecules of a seven-bladed WD40-repeat model obtained from the PHYRE server42 (based on the TAF5 protein sequence) by molecular replacement in MOLREP43. This initial model was used to phase the native dataset by molecular replacement in MOLREP. After automated model building in ARP/wARP44 and BUCCANEER39, the model was manually adjusted using repetitive rounds of refinement in PHENIX and model building in COOT. Translation/Libration/ Screw refinement was used in the final rounds of refinement with 61 individual Translation/Libration/Screw groups as determined by PHENIX.

Pull-down assay of TAF5(NTD)–TAF6–TAF9. Cell pellets were resuspended in buffer (25 mM Tris-HCl pH 8.0, 150 mM NaCl, 5 mM imidazole and EDTA-free complete protease inhibitor (Roche)), lysed by two cycles of freeze–thawing in liquid nitrogen and the lysate was cleared by centrifugation at 19,000g in a Thermo Scientific Fiberlite F21-8 x 50y fixed-angle rotor for 90 min. The supernatant was mixed with Talon resin (Clontech), equilibrated in buffer A (25 mM Tris-HCl pH 8.0, 150 mM NaCl and 5 mM imidazole) and incubated for 1 h at 4°C. The resin was washed with 15 column volumes of buffer A, followed by 10 column volumes of buffer HS (25 mM Tris- HCl pH 8.0, 1,000 mM NaCl and 5 mM imidazole) and 15 column volumes of buffer A before eluting the bound protein complex with 5 column volumes of buffer B (25 mM Tris-HCl pH 8.0, 150 mM NaCl and 200 mM imidazole). Different samples were analyzed by SDS–PAGE analysis and western blot analysis using an HRP-conjugated anti-His antibody (Sigma-Aldrich) for TAF5(NTD) and an anti-TAF6 antibody (primary, from L.T.) followed by HRP- conjugated anti- mouse antibody (secondary, Sigma-Aldrich) for TAF6.

GFP–TAF cell line generation. cDNA of human proteins was amplified by PCR with gene- specific primers fused to attB recombination sequences for GATEWAY cloning. The amplified sequence was recombined into the pDON201 donor vector according to the manufacturer‘s

102 CCT as checkpoint for TFIID assembly protocol (ThermoFisher). The cloned sequence was verified with Sanger sequencing and was transferred to the pcDNA5-FRT-TO-N-GFP26 Gateway destination vector by LR recombination according to the manufacturer‘s protocol (ThermoFisher). TAF9- and TAF5- mutant cDNAs were purchased from GenScript. Both TAF5m1 and TAF5m2 and TAF5m2 and TAF5m3 were obtained in a single round of mutagenesis PCR. All destination vectors were co-transfected with a pOG44 plasmid that encodes the Flp recombinase into HeLa Flp-In/T-REx cells26 using polyethyleneimine transfection to generate stable Dox-inducible expression cell lines.

Cell culture methods. HeLa Flp-In/T-REx cells, which contained the Flp recombination target site and expressed the Tet repressor, were grown in Dulbecco‘s modified Eagl e‘s medium 3 (DMEM), 4.5 g l−1 glucose, supplemented with 10% v/v fetal bovine serum, 10 mM l-glutamine and 100 U ml−1 penicillin–streptomycin (all purchased from Lonza), together with 5 μg ml−1 blasticidin S (InvivoGen) and 200 μg ml−1 zeocin (Invitrogen), which were used to select for the FRT and Tet repressor, respectively. All cell lines were mycoplasma negative. Recombined cells were selected by replacing zeocin with 250 μg ml−1 hygromycin B (Roche Diagnostics) 48 h after polyethyleneimine transfection. Expression of GFP-tagged protein was induced by addition of 1 μg ml−1 Dox, for 16–18 h. HEK293T cells used for transient-expression experiments were grown in DMEM, 4.5 g l−1 glucose, supplemented with 10% v/v fetal bovine serum, 10 mM l-glutamine and 100 U ml−1 penicillin–streptomycin (all purchased from Lonza).

Immunoblotting procedures. Cell were seeded in six-well dishes at 30,000 cells per well and induced with Dox for 16 h to 18 h before collection. Cell lysates were prepared in 1× sample buffer (160 mM Tris-HCl pH 6.8, 4% SDS, 20% glycerol and 0.05% bromophenol blue). Equal amounts of cell lysates were separated by a 10% SDS–PAGE and transferred onto a PVDF membrane. GFP purification samples for immunoblotting were collected by elution with sample buffer. The membrane was developed with the appropriate antibodies and ECL. Immunoblots were analyzed using a ChemiDoc imaging system (BioRad). The images were subjected to linear contrast/brightness enhancement in Photoshop (CS6, 13.0.6 ×64, extended) when needed for data presentation purposes. Antibodies used in the study include: GFP (JL-8, Clontech), TCP1 (91A, ThermoFisher), CCT2 (kind gift from W. Vonk/J. Frydman), GAPDH (clone 6C5, mAb374, Millipore), Penta His (QIAGEN), TAF5 (in-house, 25TA 2G7) and TAF6 (in-house, 1TA-1C2).

Extract preparation and GFP co-IP. Cells were seeded in 15-cm dishes (Greiner Cellstar, Sigma- Aldrich) and grown to 70–80% confluence prior to Dox induction. GFP–protein expression was verified using EVOS fluorescence microscopy (Thermo Fischer). Next, induced cells were collected and nuclear and cytoplasmic extracts were obtained using a modified version of the Dignam and Roeder procedure45. Protein concentrations were determined by Bradford assay (BioRad). Then, 1 mg of nuclear or 3 mg of cytoplasmic extract was used for GFP-affinity purification as previously described46. In brief, protein lysates were incubated in binding buffer (20 mM HEPES-KOH pH 7.9, 300 mM NaCl, 20% glycerol, 2 mM MgCl2, 0.2 mM EDTA, 0.1% NP-40, 0.5 mM DTT and 1× Roche protease inhibitor cocktail) on a rotating wheel for 1 h at 4 °C in triplicates with GBP- coated agarose beads (Chromotek) or control agarose beads (Chromotek). The beads were washed twice with binding buffer containing 0.5% NP-40, twice with PBS containing 0.5% NP-40 and twice with PBS. On-bead digestion of bound proteins was performed overnight in elution buffer

103 Chapter 3

(100 mM Tris-HCl pH 7.5, 2 M urea and 10 mM DTT) with 0.1 µg ml−1 of trypsin at room temperature and eluted tryptic peptides were bound to C18 stagetips prior to mass spectrometry analysis. Whole-cell extracts were collected in lysis buffer (50 mM Tris-HCl pH 7.9, 10% glycerol, 100 mM NaCl, 10 mM MgCl2, 0.5 mM DTT, 0.1% NP-40 and 1× complete protease inhibitor cocktail (Roche)), by incubating the cells with the lysis buffer on ice for 10 min. Next, the protein samples were collected by manual harvesting and centrifuged at 14,000 r.p.m. for 20 min at 4 °C. Supernatants were snap-frozen and stored at −80 °C.

3 Mass spectrometry and data analysis. Tryptic peptides were eluted from the C18 stagetips in H2O:acetonitril (50:50) with 0.1% formic acid and dried prior to resuspension in 10% formic acid. A third of this elution was injected into a Q-Exactive (Thermo Fischer) in tandem mass spectrometry mode with 90 min total analysis time. Blank samples consisting of 10% formic acid were run for 45 min between GFP and non-GFP samples, to avoid carry-over between runs. The raw data files were analyzed with MaxQuant software (version 1.5.3.30) using the Uniprot human FASTA database46,47. Label-free quantification values and match between run options were selected. The iBAQ algorithm was also activated for subsequent relative protein abundance estimations27. The obtained protein files were analyzed by Perseus software (MQ package, version 1.5.4.0), in which contaminants and reverse hits were filtered out. Protein identification based on non-unique peptides as well as proteins identified by only one peptide in the different triplicates were excluded to increase protein prediction accuracy. For identification of the bait interactors, label-free quantification intensity-based values were transformed on the logarithmic scale (log2) to generate Gaussian distribution of the data. This enables the imputation of missing values based on the normal distribution of the overall data (in Perseus, width = 0.3; shift = 1.8). The normalized label-free quantification intensities were compared between grouped GFP triplicates and non-GFP triplicates, using a 1% permutation- based false discovery rate (FDR) in a two-tailed Student‘s t-test. The threshold for significance (S0), based on the FDR and the ratio between GFP and non-GFP samples, was kept at the constant value of 1 for comparison purposes. Relative abundance plots were obtained by comparison of the iBAQ values of GFP interactors. The values of the non-GFP iBAQ values were subtracted from the corresponding proteins in the GFP pull-down and were next normalized to a chosen co- purifying protein for scaling and data-presentation purposes27.

siRNA-mediated knockdown experiments. Cells were transfected using HiPerfect and a reverse-transfection protocol that was based on the manufacturer‘s instructions (QIAGEN). After 72 h, cells were collected for immunoblotting, mRNA analysis or immunofluorescence confocal microscopy. Protein lysates were collected in sample buffer. RNA was extracted using the RNeasy kit following the manufacturer‘s protocol including a DNase treatment step (QIAGEN). For immunofluorescence analysis, cells grown on coverslips were fixed in 4% paraformaldehyde in PBS for 20 min at room temperature and stored in 0.4% paraformaldehyde in PBS at 4 °C before analysis.

RT–qPCR analysis. cDNA for RT–qPCR analysis was synthesized from 500 ng of the total RNA using Superscript II (Invitrogen) with the supplied random primers mix. RT–qPCR was performed

104 CCT as checkpoint for TFIID assembly in a 25 µl final volume using iQ SYBR Green Supermix (Bio- Rad). qPCR analysis was performed in CFX Connect Real-Time PCR Detection System (Bio-Rad) and extracted data were analyzed in Excel following the ΔΔCq method.

Confocal microscopy. Paraformaldehyde-fixed cells were subjected to a short cell membrane permeabilization step with 0.5% Triton X-100 in PBS for 5 min. After quenching the cross-linking with 50 mM glycine, the samples were blocked for 30 min at room temperature with 5% natural goat serum and subsequently incubated with primary antibodies for 2 h at room temperature. After three sequential washing steps with PBS, secondary antibodies conjugated to a fluorophore were added, together phalloidin (Abcam) for staining of actin for 45 min at room temperature. For 3 nuclear staining, the samples were incubated with DAPI (2 mg l−1, Sigma-Aldrich) for 5 min before mounting the coverslips on microscopy slides (ThermoFischer Scientific) with Immu-Mount (Invitrogen). Images were obtained using a SP8-X confocal microscope (Leica Microsystems) using a HC PL APO 63×/1.40 NA oil CS2 objective. Gain and offset settings were adjusted according to the fluorescence signal, but they were kept constant in comparative experimental designs such as Dox- induction tests. Exported files were next subjected to linear contrast and brightness processing in Photoshop (CS6, 13.0.6 ×64, extended) for image presentation purposes.

Pulse–chase experiments. Forward setup (tracing newly synthesized proteins). Cells were seeded in 10 cm culture dishes and grown to approximately 90% confluence. Next, protein translation was blocked by addition of 50 µg ml−1 CHX (Sigma-Aldrich). Then, 30 min after CHX block, cells were induced with 1 µg ml−1 Dox for 6 h to accumulate GFP–protein mRNA. Prior to sample collection, cells were washed with PBS twice for complete removal of CHX and medium without Dox was refreshed for the duration of the chosen time points. Whole-cell extracts were collected for each of the time points and used for GFP affinity purification. Enriched proteins were separated by SDS–PAGE and detected by immunoblotting.

Reverse setup (blocking new protein synthesis). Cells were grown in 10 cm culture dishes to 70% confluence before Dox induction. Then, 50 µg ml−1 of CHX was added after 16 h of induction to block the synthesis of new proteins and to allow gradual disassociation of the newly synthesized proteins from their respective folding partners. Cells were collected at different time points after translational block induction and used for GFP affinity purification in order to analyze associated proteins. Samples were analyzed by SDS–PAGE and immunoblotting.

Data availability Atomic coordinates and crystallographic structure factors have been deposited in the PDB under accession code 6F3T. Proteomics data have been deposited in the PRIDE database under accession code PXD011293. Source data for Figs. 1, 3–5 and Supplementary Figs. 6, 8 and 9 are available in the online version of the paper. All other data supporting findings in this study are available from the corresponding authors upon reasonable request.

105 Chapter 3

Acknowledgements We thank S. Trowitzsch (Goethe University Frankfurt), P. Legrand and A. Thompson (Synchrotron SOLEIL), W. Vonk (Princess Maxima Centre Utrecht), R. Baas (Netherlands Cancer Institute Amsterdam) and M. Vermeulen (Radboud University Nijmegen) for assistance and reagents. We greatly appreciate discussion with R. Sawakar (MPI for Immunobiology and Epigenetics). This research was supported by the Netherlands Organization for Scientific Research (NWO) grants 022.004.019 (S.V.A.), ALW820.02.013 (H.T.M.T.) and 184.032.201 Proteins@Work (E.C., T.Y.L., H.R.V. and A.J.R.H.), a Kékulé fellowship from the Fonds der Chemischen Industrie (M.H.), a European Research Council Advanced grant ERC-2013-340551 (L.T.), Agence Nationale 3 de Recherche research grants ANR-10-IDEX-0002-02 and ANR-10-LABX- 0030-INRT (L.T.) and a Wellcome Trust Senior Investigator Award 106115/Z/14/Z (I.B.). This work used the platforms of the Grenoble Instruct-ERIC Center (ISBG UMS 3518 CNRS-CEA-UGA-EMBL) with support from FRISBI (ANR-10-INBS-05-02) and GRAL (ANR-10-LABX-49-01) within the Grenoble Partnership for Structural Biology (PSB). This research received support from BrisSynBio, a BBSRC/EPSRC Research Centre for synthetic biology at the University of Bristol (BB/L01386X/1).

Author contributions H.T.M.T. and I.B. conceived the study with input from S.V.A., M.H., L.T. and A.J.R.H. S.V.A. carried out all cell biology experiments, assisted by M.M. and E.S. The majority of the quantitative proteomics analyses were carried out by E.C. and T.Y.L., assisted by S.V.A. R.M.v.E. and H.R.V. carried out proteomics analyses of the pulse–chase experiment. M.H. carried out all recombinant protein work, crystallization and structural analysis, assisted by L.S., K.G. and with input from I.B. S.V.A., M.H., L.T., A.J.R.H., H.T.M.T. and I.B. designed experiments, interpreted data and wrote the manuscript together with input from all authors.

References 1. Roeder, R. G. Transcriptional regulation and the role of diverse coactivators in animal cells. FEBS Lett. 579, 909–915 (2005). 2. Levine, M., Cattoglio, C. & Tjian, R. Looping back to leap forward: transcription enters a new era. Cell 157, 13–25 (2014). 3. Bieniossek, C. et al. The architecture of human general transcription factor TFIID core complex. Nature 493, 699–702 (2013). 4. Louder, R. K. et al. Structure of promoter-bound TFIID and model of human pre-initiation complex assembly. Nature 531, 604–609 (2016). 5. Muller, F., Zaucker, A. & Tora, L. Developmental regulation of transcription initiation: more than just changing the actors. Curr. Opin. Genet. Dev. 20, 533–540 (2010). 6. Helmlinger, D. & Tora, L. Sharing the SAGA. Trends Biochem. Sci. 42, 850–861 (2017). 7. Baptista, T. et al. SAGA is a general cofactor for RNA polymerase II transcription. Mol. Cell 68, 130– 143 (2017). 8. Warfield, L. et al. Transcription of nearly all yeast RNA polymerase II-transcribed genes is dependent on transcription factor TFIID. Mol. Cell 68, 118–129 (2017).

106 CCT as checkpoint for TFIID assembly

9. Spedale, G., Timmers, H. T. & Pijnappel, W. W. ATAC-king the complexity of SAGA during evolution. Genes Dev. 26, 527–541 (2012). 10. Hernandez, N. TBP, a universal eukaryotic transcription factor? Genes Dev. 7, 1291–1308 (1993). 11. Clapier, C. R. & Cairns, B. R. The biology of complexes. Annu. Rev. Biochem. 78, 273–304 (2009). 12. Ellis, R. J. Assembly chaperones: a perspective. Phil. Trans. R. Soc. B 368, 20110398 (2013). 13. Schopf, F. H., Biebl, M. M. & Buchner, J. The HSP90 chaperone machinery. Nat. Rev. Mol. Cell Biol. 18, 345–360 (2017). 14. Ramos, P. C. & Dohmen, R. J. PACemakers of proteasome core particle assembly. Structure 16, 1296–1304 (2008). 3 15. Venkatesh, S. & Workman, J. L. Histone exchange, chromatin structure and the regulation of transcription. Nat. Rev. Mol. Cell Biol. 16, 178–189 (2015). 16. Lopez, T., Dalton, K. & Frydman, J. The mechanism and function of group II chaperonins. J. Mol. Biol. 427, 2919–2930 (2015). 17. Pines, A. et al. TRiC controls transcription resumption after UV damage by regulating Cockayne syndrome protein A. Nat. Commun. 9, 1040 (2018). 18. Muñoz, I. G. et al. Crystal structure of the open conformation of the mammalian chaperonin CCT in complex with tubulin. Nat. Struct. Mol. Biol. 18, 14–19 (2011). 19. Altelaar, A. F., Munoz, J. & Heck, A. J. Next-generation proteomics: towards an integrative view of proteome dynamics. Nat. Rev. Genet. 14, 35–48 (2013). 20. Ahrens, C. H., Brunner, E., Qeli, E., Basler, K. & Aebersold, R. Generating and navigating proteome maps using mass spectrometry. Nat. Rev. Mol. Cell Biol. 11, 789–801 (2010). 21. Wright, K. J., Marr, M. T. II & Tjian, R. TAF4 nucleates a core subcomplex of TFIID and mediates activated transcription from a TATA-less promoter. Proc. Natl Acad. Sci. USA 103, 12347–12352 (2006). 22. Trowitzsch, S. et al. Cytoplasmic TAF2-TAF8-TAF10 complex provides evidence for nuclear holo- TFIID assembly from preformed submodules. Nat. Commun. 6, 6011 (2015). 23. Gupta, K. et al. Architecture of TAF11/TAF13/TBP complex suggests novel regulation properties of general transcription factor TFIID. eLife 6, e30395 (2017). 24. Scheer, E., Delbac, F., Tora, L., Moras, D. & Romier, C. TFIID TAF6–TAF9 complex formation involves the HEAT repeat-containing C-terminal domain of TAF6 and is modulated by TAF5 protein. J. Biol. Chem. 287, 27580–27592 (2012). 25. Gangloff, Y. G., Romier, C., Thuault, S., Werten, S. & Davidson, I. The histone fold is a key structural motif of transcription factor TFIID. Trends Biochem. Sci. 26, 250–257 (2001). 26. van Nuland, R. et al. Quantitative dissection and stoichiometry determination of the human SET1/MLL histone methyltransferase complexes. Mol. Cell. Biol. 33, 2067–2077 (2013). 27. Schwanhausser, B. et al. Global quantification of mammalian gene expression control. Nature 473, 337–342 (2011). 28. Bhattacharya, S., Takada, S. & Jacobson, R. H. Structural analysis and dimerization potential of the human TAF5 subunit of TFIID. Proc. Natl Acad. Sci. USA 104, 1189–1194 (2007). 29. Xie, X. et al. Structural similarity between TAFs and the heterotetrameric core of the histone octamer. Nature 380, 316–322 (1996). 30. Miyata, Y., Shibata, T., Aoshima, M., Tsubata, T. & Nishida, E. The molecular chaperone TRiC/CCT binds to the Trp-Asp 40 (WD40) repeat protein WDR68 and promotes its folding, protein kinase DYRK1A binding, and nuclear accumulation. J. Biol. Chem. 289, 33320–33332 (2014).

107 Chapter 3

31. Cuéllar, J. et al. The structure of CCT–Hsc70NBD suggests a mechanism for Hsp70 delivery of substrates to the chaperonin. Nat. Struct. Mol. Biol. 15, 858–864 (2008). 32. Han, Z. et al. Structural basis for the specific recognition of methylated histone H3 lysine 4 by the WD-40 protein WDR5. Mol. Cell 22, 137–144 (2006). 33. Haffke, M., Viola, C., Nie, Y. & Berger, I. Tandem recombineering by SLIC cloning and Cre–LoxP fusion to generate multigene expression constructs for protein complex research. Methods Mol. Biol. 1073, 131–140 (2013). 34. Fitzgerald, D. J. et al. Protein complex expression by using multigene baculoviral vectors. Nat. Methods 3, 1021–1032 (2006). 3 35. Kabsch, W. XDS. Acta Crystallogr. D 66, 125–132 (2010). 36. Kabsch, W. Integration, scaling, space-group assignment and post-refinement. Acta Crystallogr. D 66, 133–144 (2010). 37. Adams, P. D. et al. PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Crystallogr. D 66, 213–221 (2010). 38. McCoy, A. J. et al. Phaser crystallographic software. J. Appl. Crystallogr. 40, 658–674 (2007). 39. Cowtan, K. The Buccaneer software for automated model building. 1. Tracing protein chains. Acta Crystallogr. D 62, 1002–1011 (2006). 40. Terwilliger, T. SOLVE and RESOLVE: automated structure solution, density modification and model building. J. Synchrotron Radiat. 11, 49–52 (2004). 41. Terwilliger, T. C. Automated structure solution, density modification and model building. Acta Crystallogr. D 58, 1937–1940 (2002). 42. Kelley, L. A. & Sternberg, M. J. Protein structure prediction on the Web: a case study using the Phyre server. Nat Protoc. 4, 363–371 (2009). 43. Vagin, A. & Teplyakov, A. Molecular replacement with MOLREP. Acta Crystallogr. D 66, 22–25 (2010). 44. Langer, G., Cohen, S. X., Lamzin, V. S. & Perrakis, A. Automated macromolecular model building for X-ray crystallography using ARP/wARP version 7. Nat Protoc. 3, 1171–1179 (2008). 45. Carey, M. F., Peterson, C. L. & Smale, S. T. Dignam and Roeder nuclear extract preparation. Cold Spring Harbor Protoc. 2009, pdb.prot5330 (2009). 46. Spruijt, C. G., Baymaz, H. I. & Vermeulen, M. Identifying specific protein– DNA interactions using SILAC-based quantitative proteomics. Methods Mol. Biol. 977, 137–157 (2013). 47. Low, T. Y. et al. Quantitative and qualitative proteome characteristics extracted from in-depth integrated genomics and proteomics analysis. Cell Rep. 5, 1469–1478 (2013).

108 CCT as checkpoint for TFIID assembly

Supplemental material

3

Supplementary Figure 1. Expression analysis of WT GFP-TAF5, GFP-TAF6 and GFP-TAF9 inducible cell lines. a, Expression of GFP-TAF5 and GFP-TAF6 proteins relative to their endogenous counterparts. Technical replicates n = 2. b, Relative expression levels of GFP-TAF5, GFP-TAF6 and GFP-TAF9 by immunoblotting. Technical replicates n = 2. c, GFP-mRNA analysis of GFP-TAF5, GFP-TAF6 and GFP-TAF9. Technical replicates n = 3.

109 Chapter 3

3

Supplementary Figure 2. Limited proteolysis of the full-length TAF5/TAF6/TAF9 complex. a, Proteolysis time course of full-length TAF5/TAF6/TAF9 at a 1:250 [w/w] ratio with α-chymotrypsin on ice. b, Size exclusion chromatography profile on a Superdex S200 10/300GL column of preparative proteolysis with α-chymotrypsin at a 1:200 [w/w] ratio on ice for 120 min and SDS-PAGE analysis of peak fractions. Fragments highlighted in SDS-PAGE were analyzed by N-terminal protein sequencing with Edman degradation and LC-ESI/MS. c, The identified N- and C-terminal domain boundaries obtained by limited proteolysis.

110 CCT as checkpoint for TFIID assembly

3

Supplementary Figure 3. Purified TAF5/TAF6/TAF9 complexes for crystallization trials. a, Constructs in which TAF5 is lacking the N-terminal LisH domain. b, Constructs in which TAF5 retains the N-terminal LisH domain. SDS-PAGE analysis of purified complexes is shown on the left. TAF5-LisH corresponds to TAF589-800, TAF5-NTD corresponds to TAF5194-800, TAF6-HEAT to TAF61-480, TAF6-HFDL to TAF61-92RLRRRAH, TAF6-HFDS to TAF61-92, TAF9-HFDL to TAF91- 139, TAF9-HFDS to TAF91-120. Size exclusion chromatography profiles of a Superdex S200 PC3.2/300 chromatography are shown on the right. Elution volumes of molecular weight standards are indicated by triangles.

111 Chapter 3

3

Supplementary Figure 4. Crystal structure determination and electron density map quality of the TAF5/TAF6/TAF9 complex. a, Optimized native crystals of TAF5/TAF6/TAF9 complex grew to

300 μm in size. b, Crystals soaked with Ta6Br12 cluster for SAD phasing experiments. Note the green

color due to incorporation of the Ta6Br12 cluster. c, Diffraction pattern of native TAF5-TAF6-TAF9

complex crystals extending beyond 2.7 Å resolution. d, Ta6Br12 soaked crystals diffracted beyond 3.8 Å resolution. e-g, Quality of the initial experimental electron density map (e) improved significantly by density modification without (f) and including NCS averaging (g). The top panel shows the central α- helix of the TAF6 HFD, the bottom panel shows a β-sheet of the TAF5 WD40 repeat. h, i, Quality of the final 2Fo-Fc electron density map, contoured at 1.5σ. A region of the TAF6-TAF9 HFD heterodimer (h) and a β-strand of the TAF5-WD40 repeat are shown (i).

112 CCT as checkpoint for TFIID assembly

3

Supplementary Figure 5. TAF5 NTD structures. a, Superposition of the TAF5-NTD in the TAF5/TAF6/TAF9 crystal structure and of TAF5-NTD crystallized in isolation (PDB-ID 2NXP, coloured in grey) b, Detailed view of the side chain orientations of L298, R301 and K304. Helix α1, which is present in the isolated TAF5-NTD structure but could not be traced in the TAF5/TAF6/TAF9 complex structure is shown as a ribbon (coloured in red). Note the clash between the reoriented helix α7 (residues L298 to R301) and the location of helix α1 in the isolated TAF5-NTD structure. Helices α3 and α4 are not shown for clarity.

113 Chapter 3

3

Supplementary Figure 6. Probing the TAF5/TAF9 interface. a, Enrichment represented in volcano plots reveal loss of TFIID and SAGA subunits with both GFP-TAF9m1 (left) and GFP-TAF9m3 (right). GFP-TAF9m2 (middle) is defective in TFIID formation only. Subunits are colored as in Fig. 1f. Each data point is plotted as average of technical triplicates; n = 2 independent experimental replicates for the representative GFP-TAF9m1 sample. Dashed red lines denote threshold between background and

114 CCT as checkpoint for TFIID assembly significant enrichment (two-tailed t-test; FDR = 1%; S0 =1). b, Mutation of all three regions involved in forming the TAF5 interface disrupts all interactions of GFP-TAF9 except with TAF6. c, Expression of GFP-TAF9 mutant proteins compared to GFP-TAF9 WT by immunoblot. d, Confocal fluorescence microscopy shows nuclear localization of all GFP-TAF9 proteins; scale bar = 10 μm. e, Volcano plot analysis indicates that GFP-TAF5m1 (left), GFP-TAF5m2 (middle) and GFP-TAF5m3 (right) co-purify all TFIID subunits. Each data point is plotted as average of technical triplicates. f, Relative enrichment of the co-purified TFIID subunits compared to GFP-TAF5 wild-type indicating reduced assembly of the mutants into TFIID. Each bar represents an average of technical triplicates. Error bars = s.d. of mean. g, Expression of GFP-TAF5 mutant proteins compared to GFP-TAF5 WT by immunoblot. Source data for f is available online. 3

115 Chapter 3

3

Supplementary Figure 7. Localization and local residue environment of TAF5 mutations. a, Overview of TA5 in cartoon representation with the WD40 repeat domain colored in blue and the NTD in cyan. Mutated residues in TAF5m1-m4 are shown as sticks in the same color code as in Figure 2. b-f, Local residue environment around the mutated residues. Neighboring residues in the WD40 repeat

116 CCT as checkpoint for TFIID assembly domain and the NTD of TAF5 are shown within a 4 Å radius of the mutations in TAF5m1 (b), TAF5m2 (c), TAF5m3 (d) and TAF5m4 (e,f). Polar contacts are shown as black dashed lines and only residues involved in polar contacts with the mutated residues are labeled with residue numbers for clarity. The majority of residues mutated in TAF5 are located in loop regions and have polar contacts via or to backbone atoms. Side chain mutations in TAF5m1-4, thus, most likely do not affect protein folding of TAF5.

3

Supplementary Figure 8. CCT subunits are co-enriched in GFP-TAF5 purifications. a, GFP- TAF5 WT and mutants co-enrich CCT subunits to different extent. Volcano plot analysis indicates that TAF5 mutants display increased enrichment of all CCT subunits compared to WT TAF5, supporting the transient association of the chaperonin with the WT TAF5. TFIID Subunits are colored in green, CCT subunits – in magenta. Each data point is plotted as average of technical triplicates. Dashed red lines denote threshold between background and significant enrichment (two-tailed t-test; FDR = 1%; S0 =1).

117 Chapter 3

n = 2 independent experimental replicates for the representative GFP-TAF5m1+2 sample. b, Relative protein abundance plot of CCT subunits indicates highest enrichment in mutants 1 and 2 of GFP-TAF5 as well as substantial enrichment in purifications on the cytoplasmic GFP-TAF5 wild-type. GFP- TAF5m1+2 has been excluded from the graph for representation purpose, as CCT enrichment is in 1:2 ratio with the bait. Each bar represents an average of technical triplicates. Error bars = s.d. of mean. Source data for b is available online.

3

Supplementary Figure 9. Pulse-chase experiments reveal transient CCT/TAF5 association. a, Schematic representation of the forward (left panel) and reverse (right panel) pulse-chase setups. Forward: cells are treated with CHX for 30 min prior Dox-induction. Upon mRNA accumulation, the translation block is released and the newly-synthesized GFP-tagged protein is followed in time. Expectations include GFP-TAF5 association with CCT upon protein synthesis stimulation. Reverse: GFP-tagged

118 CCT as checkpoint for TFIID assembly protein expression is induced for 24 h with Dox, allowing accumulation of GFP-TAF5 in two separate complexes – transiently with CCT for the newly-synthesized GFP-TAF5 and stably with TFIID complex. CHX addition to the induced cells blocks synthesis of GFP-TAF5 and, therefore, its transient association with CCT. b, Analysis by immunoblot of CCT interaction with newly-synthesized GFP- TAF5m1+2 and GFP-TAF7 followed in pulse-chase forward and reverse setups. Input represents 5% of the protein sample used in each IP. c, qMS analysis of pulse-chase forward setup reveals gradual enrichment of TFIID subunits during GFP-TAF5 synthesis. Data is normalized to bait. Each bar represents an average of technical triplicates. Error bars = s.d. of mean. d, siRNA-mediated TCP1 knockdown affects cellular localization of TAF5m1+2, which shifts to cytoplasmic. GFP-WDR5 localization is unaffected. GADPH siRNA (top) and non-targeting siRNAs (middle) were applied as negative controls. Scale bar = 10 μm. Technical replicates n = 2. Source data for c is available online. 3

119 Chapter 3

3

120 Proteomic analysis of TFIID

4

Chapter 4

Proteomic analysis of human TFIID in different cellular compartments

(Manuscript in preparation)

121 Chapter 4

Proteomic analysis of human TFIID in different cellular compartments Manuscript in preparation

S.V. Antonova1, M. Biniossek2, S. Capponi1,3, T. Bhuiyan3, L. van Wijk1,4, S. G. Alcantara1,5, J.J.J. Boeren1,6, H. T. M. Timmers1,3

1 Molecular Cancer Research and Regenerative Medicine, University Medical Centre Utrecht, 3584 CT Utrecht, The Netherlands. 2 Institute of Molecular Medicine and Cell Research, Faculty of Medicine, Albert Ludwigs University Freiburg, Stefan Meier Strasse 17, 79104, Freiburg, Germany. 3 German Cancer Research Center (DKFZ), Im Neuenheimer Feld 280, 69120, Heidelberg, Germany; German Cancer Consortium (DKTK) partner site Freiburg, 79106, Freiburg, Germany; Department of Urology, Medical Center-University of Freiburg, Medical Faculty, Breisacher Straße 66, 79106, Freiburg, Germany. 4 4 Department of Human Genetics, Leiden University Medical Center, Leiden, the Netherlands. 5 Stowers Institute for Medical Research, Kansas City, United States. 6 Department of Developmental Biology, Erasmus MC, 3015 CN Rotterdam, The Netherlands.

Abstract

Recent structural work has shed light on the mechanistic aspects the human pre-initiation complex formation. The accurate delivery of TATA-binding protein (TBP) at a fixed distance from the transcription start sites (TSSs) was shown to be dependent on the basal transcription factor IID (TFIID). Major structural rearrangements are required for the release of TBP from TFIID, suggesting likely compositional alterations in chromatin-bound TFIID. Interrogation of promoter- bound TFIID in vivo is needed to validate these in vitro observations. Several reports of partial cytoplasmic TFIID structures have suggested a modular TFIID assembly. Previously, we have characterized the cytoplasmic assembly of the TFIID module TAF5/TAF6/TAF9 and highlighted the crucial role of accurate spatio-temporal presence of the interaction partners in this process. Here, we expand our cell-based system to all TFIID subunits and interrogate their molecular surroundings in three cellular fractions through a quantitative proteomics approach. We show the existence of a number of discrete cytoplasmic TFIID modules, present also in the nucleus, indicating their involvement in nuclear holo-TFIID assembly. Notably, we observe that chromatin- bound TFIID composition is significantly destabilized upon release of TBP, with primarily core TFIID subunits as well as assembly partners preserved, suggesting an on-chromatin disassembly process mimicking the assembly in reverse.

Key words: TFIID, TBP, modules, quantitative proteomics, iBAQ

122 Proteomic analysis of TFIID

Introduction Transcriptional regulation is essential nuclear process required for accurate gene expression in spatio-temporal manner throughout eukaryotic life. Several chromatin-associated events are coordinated to provide a multi-layered and fine-tuned control of transcriptional programs 1. Among these events, transcription initiation is a central process ensuring timely recruitment and stabilization of RNA polymerase II (pol II) at core promoters of genes 2. Six basal (or general) transcription factors (BTFs) carry out this step through the assembly of the pre- initiation complex (PIC), which acts as an anchor and guide for the correct deposition of the pol II enzyme at the transcription start site (TSS) 3,4. TFIID is the first member of the PIC to locate and associate with activated promoters – an event that nucleates in a sequential manner the remaining PIC members and the pol II 5. Recent advances in the structural characterization of TFIID report major conformational rearrangements within its chromatin-bound form, driving the release of the TATA-binding protein 4 (TBP) from the complex and its measured deposition at an accurate distance from the TSS 2-4. TBP is a saddle-shaped protein with a hydrophobic concave surface endowed with DNA-binding properties, and a convex surface involved in protein-protein interactions 6-8. Upon association with the minor groove, TBP intercalates into the double helix through two pairs of highly conserved phenylalanines and enforces the formation of an approximately 90o angle in the DNA helix. Such distortion is favourable for the propagation of PIC assembly and is sequentially stabilized through the following association of TFIIA and TFIIB with the TBP:DNA complex 9,10. This process is regulated by the TBP-binding negative cofactor 2 (NC2) complex and the SWI-SNF-related ATPase BTAF1, which are involved respectively in the competitive shielding of TBP from BTFs and displacement of TBP in the context of the PIC 11-15. While structural characterizations of TBP interaction with its numerous partners have shed light on the mechanistic aspects of the binding discrimination (see Table 1) 9,10,12,14,16-20, understanding of TBP in vivo distribution between these complexes is still lacking. In addition to TBP, human TFIID is composed of 13 TBP- associated factors (TAFs) and four interchangeable TAF homologues. Several TAFs are reported to incorporate as double copies (TAF4, TAF5, TAF6, TAF9, TAF10 and TAF12) and, apart from TAF10, to form a core TFIID structure, integral for the complex assembly 21. Cryogenic electron microscopy (cryo-EM) analysis of recombinantly reconstituted core TFIID revealed a two-fold symmetrical organization of a scaffolding TAF5 and the histone fold (HF)-partners TAF6/TAF9 and TAF4/TAF12 22. Notably, addition of TAF8/TAF10 HF-pair led to a unilateral incorporation of the heterodimer into the core, resulting in symmetry breakage at the site of insertion and exposure of novel interaction surfaces. The step was proposed as central for the progression of holo-TFIID assembly. Subsequent work reported the in vivo existence of a cytoplasmic

123 Chapter 4

TAF2/TAF8/TAF10 module, dependent on TAF8 for nuclear import, leading to the proposition of a nuclear holo-TFIID formation upon incorporation of pre-assembled cytoplasmic modules 23. Insights into the sequence of these events, as well as in vivo interrogation for other cytoplasmic TFIID modules could greatly enhance the mechanistic understanding of this process and potentially reveal key spatial and/or temporal components of TFIID assembly. Furthermore, atomic resolution of separate TFIID subunits has revealed a number of domain-specific binding partners (see Table 2) 17,18,23-29, leaving open the possibility of translation of such interactions into modular cytoplasmic or nuclear pre-assemblies. Human TFIID shares three TAFs (TAF9, TAF10 and TAF12) with the multi-protein transcription coactivator, the 4 Spt-Ada-Gcn5 acetyltransferase (SAGA) chromatin modifying complex. The structure of SAGA is organized in a functionally and structurally modular manner, with its deubiquitinase (DUB), acetyltrasnferase (HAT) and transcription activators interaction (TRRAP) modules individually attached to a central (core) SAGA module 30. Akin to core TFIID, the shared TAFs dimerizes with HF-specific partners (TAF6L, SUPT7L and TADA1, respectively) onto a scaffolding interface provided by TAF5L. Additional elements, like the presence of the SUPT3H and SUPT20H subunits, as well as the single copies of all core SAGA members, highlight the structural and assembly deviation between the two complexes 30,31. Recent cryo-EM work has provided an insight into the architecture of yeast and human TFIID with the impressive resolutions of <5 Å and <10 Å, respectively 3,15,32. The two structures are in general agreement with each other and with previous studies of the complexes at lower resolutions 33, enlisting two lobes containing a single copy of the core TAFs, surrounded by different subunit repertoire and separated by a third lobe containing exclusively non-HF TAFs. Dynamic rearrangements of the complex are proposed to take place upon promoter recognition, leading to the release of TBP from its in-complex DNA-binding inhibited state 15. To date, this process has only been characterized through biochemical in vitro approaches, which limit the observed chromatin-bound TFIID structures to several stably captured states, and therefore, urge for an in vivo TFIID compositional interrogation. Here, we set out to analyse the proteomic surrounding of each TFIID subunit in three functionally relevant cellular compartments – the cytoplasm, and the soluble as well as the chromatin-bound nuclear fractions. Through a comprehensive quantitative mass spectrometry approach, we have defined cytoplasmic TFIID modules, nuclear holo-TFIID composition and relative stoichiometry, as well as chromatin-bound complex structure. Our data further indicate dynamic transition of TBP between its different interaction partners in a compartment-specific manner and preferential localization of shared TAFs to core SAGA rather than core TFIID in the

124 Proteomic analysis of TFIID cytoplasm. Together, this work illustrates the structural aspects of TFIID assembly and disassembly in the cell, sheds light onto the differential pathways of TFIID and SAGA formation and highlights the dynamics of TBP redistribution.

Results

Differential compartmentalization of TFIID-, SAGA- and TBP-specific proteins

To assess the molecular compositions of cellular TFIID, doxycycline (dox)-inducible HeLa-derived cells lines were generated to express TFIID subunits with an N-terminally fused green fluorescent protein (GFP) 24,34. Immunofluorescence analysis of the cells revealed differential 4 TAF and TBP distribution across the cytoplasmic and nuclear cellular compartments upon dox induction of the expression (Fig. 1A, S1A). To investigate the complex formation potential of each subunit in a compartment-dependent manner, cytoplasmic, nuclear and chromatin-bound cellular fractions from induced cells were obtained and analysed through GFP co-immunoprecipitation (co-IP) and quantitative mass spectrometry (qMS) 24,34,35 (Fig. 1B). Comparative analysis of identified GFP-baits within each extract indicated differential accumulation within the cytoplasm and the chromatin, and more uniform presence in the nucleus (Fig. S1B). (GO)- term enrichment analysis of the proteins identified across different fractions confirmed compartment-specific representation of protein molecular functions (Fig. 1C). Selective examination of the compartment-specific protein composition revealed differential distribution of several TFIID-, SAGA- and TBP-associated proteins (Fig. 1D). As such, TFIID is most consistently found throughout all fractions, with only the TAF7 homologue, TAF7L, absent from the cytoplasmic fraction. In contract, SAGA is more differentially distributed, with: 1) its spliceosome-related partners, SF3B3 and SF3B5 36,37, found in the cytoplasm and the soluble nuclear fraction; 2) most of the DUB module (except for ENY2) exclusively found in the soluble nuclear fraction; 3) and the enzymatic HAT subunits KAT2A (GCN5) and KAT2B (PCAF) specifically gained in the nucleus. Interestingly, upon data filtration and statistical intensity based absolute quantification (iBAQ) analysis, an exclusive cytoplasmic enrichment for SF3B3 and SF3B5 was observed among all SAGA-shared GFP-TAF baits (Fig. S1C). TBP-specific interactors (outside of TFIID) are similarly compartmentalized, with the SL1 complex members TAF1A and TAF1D and the pol II PIC member TFIIA (both GTF2A1 and GTF2A2 subunits) gained in the nucleus, and the TFIIIB subunit BDP1 as well as the NC2 regulator complex (DR1/DRAP1) exclusively present at the chromatin (Fig. 1D). Notably, the peptide coverage achieved with our qMS analysis was sufficient for confident protein-of-interest identification and was further evident of the fraction-dependent manner of complexes distribution (Fig. S2). Together, the data suggested dynamic gain of TFIID-, SAGA- and TBP-specific interaction partners in different cellular compartments.

125 Chapter 4

4

Figure 1. Intracellular distribution of exogenously expressed TFIID subunits. A. Immuno- fluorescent analysis of dox-induced cell lines expressing GFP-fused TAF(1-13) or TBP proteins. Scale bars, 5 μm. B. Schematic overview of the quantitative proteomics approach used to analyse the affinity-

126 Proteomic analysis of TFIID purified samples from each cell fraction. C. GO enrichment analysis of the fraction-specific proteins. Data are analysed through the online Panther GO overrepresentation test tool (test type FISHER; FDR- corrected P-values) and presented as log of fold enrichment. D. Venn diagram representation of overlapping and fraction-specific proteins identified through quantitative proteomics analysis in each cellular compartment. Unique localization of TAF- or TBP-specific complex members is indicated.

Cytoplasmic TFIID pre-assembly modules The qMS experiments revealed presence of TFIID in all three cellular compartments. To investigate the possible existence of pre-assembly modules, we systematically probed the cytoplasmic fraction for TFIID components co-purified with different TFIID subunits. Partial TFIID enrichments were observed for the majority of the baits, with core members primarily residing within core TFIID, SAGA-shared TAFs divided between core TFIID and core SAGA and non-core members found alone or in specific hetero-dimers or -trimers (Fig. 2). As such GFP- 4 TAF1 primarily enriched for TAF7 and TBP, GFP-TAF2 for TAF8, GFP-TAF4 for TAF12, GFP- TA5 for TAF6 and TAF9, GFP-TAF8 for TAF10, and GFP-TAF11 and GFP-TAF13 for each other. In addition GFP-TAF3 and GFP-TAF7 were found alone and GFP-TAF4B, -TAF6, - TAF9(9B), -TAF10 and -TAF12 co-enriched for core TFIID (and SAGA in the case of shared TAFs) members. Notably, among all expressed TFIID subunits, GFP-TBP exclusively co-enriched for the BRF1 protein, which is part of the Pol III TFIIIB PIC, while endogenous TBP was found in the GFP-TAF1 co-IP. To assess the relative abundance of the co-purified partners in relation to their baits, we next performed iBAQ-based analysis, allowing for correction of the registered peptide intensity over the size of the identified protein and the amount of its theoretical tryptic peptides. This revealed with confidence the formation of discrete bait-specific cytoplasmic TFIID modules (Fig. 3), including GFP-TAF1/TAF7/TBP, GFP-TAF11/TAF13, GFP-TAF5/TAF6/TAF9, GFP- TAF2/TAF8 and GFP-TAF4/TAF12 (Fig. 3, left). Interestingly, the co-purification of these modules was not consistently observed when probing with different members of the assemblies (Fig. 3, right), suggesting the possible existence of rate-limiting factors regulating modules formations.

TAF1/TAF7/TBP The observed module was only present in GFP-TAF1 co-IPs with an approximate ratio of 1:1 for TAF1:TAF7 and 1:2 for TAF1:TBP (Fig. 3A, left panel). Notably, neither GFP-TBP nor GFP-TAF7 enriched for the trimer (Fig. 3A, right and S3A), which may indicate that endogenous amounts of TAF1 are limiting for TAF1/TAF7/TBP formation.

TAF11/TAF13 This module consisted of two HF interaction partners and has been described in vitro previously 25. The relative abundance of GFP-TAF11 to TAF13 is approximately 2:1, indicating that over half of the bait resides alone, while the other half is bound by its HF partner TAF13 (Fig. 3B, left). Similarly to TAF1/TAF7/TBP, interrogation of this module through TAF13 did not enrich for the dimer. Indeed GFP-TAF13 was found to reside in the cytoplasm primarily alone, with less than 10% if it bound by endogenous TAF11 (Fig. 3B, right), which may indicate that endogenous TAF11 is limiting.

127 Chapter 4

4

Figure 2. Partial TFIID and SAGA assemblies in the cytoplasm. Cytoplasmic co-IPs indicate partial enrichment of TFIID-, SAGA or TBP-specific subunits compared to GFP-only control. Each data point in the volcano plots is plotted as the mean of technical triplicates. Dashed red lines denote the threshold between background and significant enrichment (two-tailed Student‘s t-test; FDR ≤1%; S0= 1). TFIID-unique subunits are coloured in green, SAGA-unique in purple, shared TFIID/SAGA subunits are shown in blue and TBP-specific interactors in red. Baits are highlighted in bold.

TAF5/TAF6/TAF9 We have previously characterized the in vivo existence of this module in the context of cytoplasmic GFP-TAF5 co-IPs (Fig. 3C, left) 24. Similar to our prior observations, GFP-TAF6 was inefficient in forming the module and instead associated only with its HF interaction partner, TAF9(B) (Fig. 3C, right). TAF9 and TAF9B are shared between TFIID and SAGA and are, therefore, separately discussed in the following section.

128 Proteomic analysis of TFIID

4

Figure 3. TAFs in cytoplasmic TFIID modules. (A-E). Relative abundance of identified TFIID subunits, normalized to bait. Modules are highlighted in dark grey (bait) and grey (identified partners). Data are derived from iBAQ values mean ± s.d. of technical triplicates.

129 Chapter 4

Notably, however, neither of the two enriched specifically for the TAF5/TAF6/TAF9 trimer (Fig. 4A, S3C), in agreement with the previously reported observation of rate-limiting amounts of endogenous TAF5 for formation of this trimeric complex 24.

TAF2/TAF8/TAF10 (TAF3) The interaction between TAF2 and TAF8 has been characterized previously in the context of a TAF2/TAF8/TAF10 cytoplasmic module, with TAF8 bridging the interactions 23. In our hands, cytoplasmic GFP-TAF2 enriched primarily for TAF8 (Fig. 3D, left), while GFP-TAF8 enriched exclusively for its HF partner TAF10 in an approximate ratio of 1:1 (Fig. 3D, right). Interestingly, GFP-TAF10, which due to its nature as a shared subunit with SAGA will be discussed in greater detail in the following section, also partially enriched for TAF8 (Fig. 4B). Notably, the other HF partner of TAF10, TAF3, resided alone in the cytoplasm, suggesting that 4 these proteins may be interacting in different cellular compartments (Fig. S3B).

TAF4(B)/TAF12 TAF4(B) and TAF12 are well-defined HF partners 38. In accordance, we observed preferential cytoplasmic enrichment for TAF12 in both GFP-TAF4 and GFP-TAF4B co-IPs (Fig. 3E). Notably, unique peptides of each homologue were detected with both baits (Fig. S4), indicating a heterodimeric co-existence of the two and suggesting a possible hetero-tetramer containing two copies of TAF4(B)/TAF12 dimer, which would be reminiscent of the double copy core TFIID structure described by Bieniossek et al. 22. In further agreement with this structure, GFP-TAF4B also co-enriched for the rest of core TFIID members, namely TAF5/TAF6/TAF9, displaying surprisingly a higher efficiency of core TFIID assembly as compared to its homolog – and canonical TFIID subunit – TAF4 (Fig. 3E, right). Curiously, TAF8 presence in the absence of its HF partner TAF10 was also evident and partially in agreement with the reported TAF8/TAF10 symmetry-breaking incorporation into core TFIID 22.

Shared TAFs preferentially assemble cytoplasmic core SAGA Metazoan TFIID shares several TAFs with SAGA complex. Interestingly, examining of cytoplasmic co-enrichments from these TAFs did not reveal enrichment of the above-described assemblies, but rather a preferential incorporation into partial SAGA formations (Fig. 4). Indeed, while GFP-TAF9 co-purified to a significant extent with its TFIID HF partner TAF6, the presence of TAF6L (similarly a HF partner), TAF5L, TADA1 and SUPT7L, alongside the other shared TAFs, TAF10 and TAF12, was indicative of assembly of the central module of SAGA (Fig. 4A). Notably, albeit relatively low, the existence of a core TFIID, incorporating TAF8/TAF10, was hinted by low amounts of TAF5/TAF6/TAF9 and TAF8. Interestingly, similarly to TAF4 and TAF4B, the homologue of TAF9, GFP-TAF9B, was more efficient in forming TFIID assemblies, as evident from its enrichment of members of both the core of TFIID and central module of SAGA, as well as TAF8/TAF10 and SUPT7L/TAF10, respectively (Fig. S3C).

130 Proteomic analysis of TFIID

Notably, MS detection of unique tryptic TAF9 and TAF9B peptides in GFP-TAF9B cytoplasmic co- IPs further indicated that the two proteins co-exist within the detected core TFIID, and based on Bieniossek et al. 2013 likely represent an overall two-copy stoichiometry for all its subunits (Fig. S4). Similarly to TAF9, 4 GFP-TAF10 enriched the central module of SAGA and only its TFIID HF partner TAF8 (Fig. 4B). Notably, the second TFIID HF partner of TAF10, TAF3, was not identified, confirming the earlier GFP-TAF3 observation that these two proteins are likely to interact in other cellular compartments (Fig. S3B). Finally, similar to Figure 4. Shared TAFs preferentially assemble SAGA core in TAF9B, GFP-TAF12 the cytoplasm. (A-C). Relative abundance of identified TFIID and enriched to an extent for both SAGA subunits, normalized to bait. Data are derived from iBAQ core TFIID and core SAGA, values mean ± s.d. of technical triplicates. with SAGA relatively more abundant than TFIID (Fig. 4C). TAF8/TAF10 and SUPT7L/TAF10 modules were similarly present, and minor levels of SUPT3H indicated an efficient formation of the central module of SAGA.

Nuclear holo-TFIID formation Probing TFIID complex through its different subunits from cytoplasmic extracts of our cell lines revealed the presence of discrete submodules. To investigate whether the GFP-tagged baits can assemble a holo-TFIID, we next interrogated the nuclear compartment of the cells 24. Indeed, in contrast to the cytoplasmic compartment, we found the entire TFIID complex enriched in the nucleus, confirming that our GFP-fused baits are capable of forming holo-TFIID in the appropriate cellular environment (Fig. 5). In addition to TFIID, the complete SAGA complex as well as non-TFIID TBP interactors were similarly found, indicating that upon nuclear import the shared TFIID subunits continue to follow discriminatory pathways for differential complexes assemblies. While shared TAFs become complex-dedicated already in the cytoplasm, TBP re-

131 Chapter 4

distribution is likely mechanistically linked to its negative regulator BTAF1, which was identified alongside the rest of the nuclear interaction partners of TBP.

4

Figure 5. Nuclear TFIID and SAGA complexes. Nuclear co-IPs enrich for complete TFIID and SAGA, as well as TBP-specific interactors compared to GFP-only control. Each data point in the volcano plots is plotted as the mean of technical triplicates. Dashed red lines denote the threshold between background and significant enrichment (two-tailed Student‘s t-test; FDR ≤1%; S0= 1). TFIID- unique subunits are coloured in green, SAGA-unique in purple, shared TFIID/SAGA subunits are shown in blue and TBP-specific interactors in red. Baits are highlighted in bold.

Relative TFIID stoichiometry Next we compared the relative stoichiometries of the co-purified nuclear TFIID complexes using the iBAQ values obtained in our qMS analysis and normalizing all data to a selected subunit (TAF1), known the be present in a single copy within the complex. Remarkably,

132 Proteomic analysis of TFIID despite observing a number of bait-specific variations, a common relative (to TAF1) abundance pattern of the different TFIID subunits was detected irrespectively to whether the bait was a core, non-core or shared subunit (Fig. 6). As such the single copy subunits TAF2, TAF3, TAF7, TAF11, TAF13 and TBP were generally present in levels similar or below the single protein reference point set by TAF1, confirming their singular stoichiometry. Notably, among these TAF2, TAF11 and TAF13 had significantly lower presence – an observation that could potentially be attributed either to dynamic/unstable association with the rest of the complex or inefficient peptide detection. While TAF11 and TAF13 are among the smallest TFIID subunits and, therefore, provide only a limited number of tryptic peptides, TAF2 is approximately 150 kDa in size and the extensive amount of detected for this protein nuclear peptides is suggestive of a substoichiometric presence rather than underdetection (Fig. S2).

TFIID subunits with higher relative abundance than TAF1 included all previously 4 described double copy TAFs (TAF4, TAF5, TAF6, TAF9, TAF10 and TAF12) as well as, the otherwise expected in a singular presence, TAF8. We attribute this discrepancy to the significantly elevated relative levels of TAF10, observed consistently throughout all nuclear co-IPs and also seen previously in similar experimental setups 24, which likely drive the formation of TAF10/TAF8 modules formed separately of TFIID. In addition to TAF10, an increased abundance of TAF4 was similarly detected across our nuclear data as well as in our previous work 24. Curiously, unlike with TAF10/TAF8, the increase of TAF4 was not matched by its HF cytoplasmic partner TAF12.

Presence of assembly modules Similar module-specific co-enrichment was observed for a number of baits. Indeed a number of partnered TAFs displayed a bait-specific increase in relative abundance. As such, GFP- TAF1 co-enriched for TAF7 and TBP to a higher extent than the average enrichment observed among all nuclear co-IPs (Fig. 6, pink bars), suggesting partial residence of the bait within a nuclear TAF1/TAF7/TBP module. GFP-TAF2 co-enriched for TAF8 and TAF10, indicating a nuclear presence of the TAF2/TAF8/TAF10 module described by Trowitzsch et al. 23. In agreement with this observation, GFP-TAF8 co-IPs also displayed elevated relative abundance of these proteins, while GFP-TAF10 contrasted by enriching only for TAF8. Interestingly, TAF3 levels were not elevated in GFP-TAF10 co-IPs, suggesting a potential binding preference of TAF10 for TAF8 over TAF3. Notably, however, while unclear whether TAF10 is co-enriched by GFP-TAF3 nuclear baits, a striking and unique for GFP-TAF3 co-IPs decrease of TAF8 was observed, indicating competitive relation between these proteins for their mutual partner TAF10. In addition to TAF1/TAF7/TBP and TAF2/TAF8/TAF10 modules, preferential increase of HF interaction partners was detected in GFP-TAF4, GFP-TAF6 and GFP-TAF9 co- IPs. In contrast, GFP-TAF12, GFP-TAF4B and GFP-TAF9B did not enrich for their corresponding partner, once again highlighting a higher efficiency of these proteins in TFIID incorporation. An exception was observed for GFP-TAF9B, which displayed elevated relative levels of TAF12 – an observation present also in the cytoplasmic extract of the cells (Fig. S3C).

133 Chapter 4

4

Figure 6. Consistency in the relative abundance of nuclear DNA-free TFIID subunits. Relative abundance of identified TFIID subunits, normalized to TAF1. Modules are highlighted in dark grey (bait) and grey (partners); pink box-frames indicate average enrichment of each subunit. Data are derived from iBAQ values mean ± s.d. of technical triplicates.

134 Proteomic analysis of TFIID

Besides TAF4B, TAF9B and TAF12, similar stoichiometric compliance (disregarding the modular assemblies) was present among the GFP-TAF1, -TAF4, -TAF5, -TAF7, -TAF8, -TAF10 and -TBP co-IPs. In contrast, comparative deviations in the relative abundances of TFIID subunits were seen in the GFP-TAF2, -TAF3, -TAF6, -TAF9, -TAF11 and -TAF13 co-IPs, where a consistent loss of TAF2, TAF7 (as opposed to TAF7L) and/or TAF13 was observed.

Non-TFIID complexes Examining the relative abundance of additional proteins co-enriched by our shared baits revealed separate residence of GFP-TAF9(B), -TAF10 and -TAF12 within a now fully assembled nuclear SAGA complex and the enrichment of RNAPI, RNAPII and RNAPIII PIC members, as well as BTAF1 for GFP-TBP (Fig. S5). Notably, while SAGA was enriched to a similar extent as TFIID (Fig. S5A-D), TBP-specific partners were present is substoichiometric levels suggesting either dynamic disengagement or preferential association of nuclear DNA-free GFP-TBP with 4 TFIID complex (Fig. S5E). Similarly, while a full SAGA complex was identified, the core and HAT modules were significantly elevated as compared to DUB and TRRAP, suggesting either dynamic or unstable association of the latter two with the core (Fig. 5A-D).

Loss of TBP destabilizes chromatin-bound TFIID Our cytoplasmic and nuclear extracts revealed the presence of discrete pre-assembly modules as well as stable holo-TFIID formation, respectively. To investigate the complex composition upon chromatin association, chromatin extracts obtained from the insoluble nuclear fraction were generated and analysed. Surprisingly, an overall loss of holo-TFIID formation was evident by the reduced identification of significantly enriched TFIID subunits (Fig. 7). Similarly, only partial SAGA assemblies were found with some, but not all, shared TAFs baits. Interestingly, while TAFs lose their interaction partners, GFP-TBP co-IPs revealed a gain in interactions, such as the NC2 complex (DR1/DRAP1) and the second TFIIIB PIC member, BDP1. Assessment of the relative abundance of the co-enriched (per bait) proteins revealed primarily core TAFs and TAF8/TAF10 preservation in chromatin-bound TFIID, indicating a gradual disassembly of the complex once associated with the DNA (Fig. 8, pink boxes). The presence of unique TAF4/TAF4B peptides as well as TAF9/TAF9B unique peptides in GFP- TAF4B and GFP-TAF9B co-IPs, respectively, was indicative that at least in some cases core TFIID maintains its double-copy composition (Fig. S6). In addition, co-enrichment of modules, such as GFP-TAF1/TBP, GFP-TAF2/TAF8/TAF10, GFP-TAF8/TAF10 and, curiously, GFP- TAF3/TAF10 was detected. Several of the baits co-enriched to an extent for holo-TFIID (e.g. GFP-TAF4B, -TAF5, - TAF9B, -TAF10 and -TAF12), however, core TFIID still remained most abundantly present (Fig. 8). Other baits like GFP-TAF11 and GFP-TBP also displayed enrichment for additional to the core structure TFIID subunits, albeit to a lower extent. Notably, the high relative abundance of TAF4 and TAF10, observed in the DNA-free nuclear fraction was absent in the chromatin-bound TFIID. Interestingly, while GFP-TBP co-enriches partly for holo-TFIID, it primarily resides within other interactors – namely the Pol II PIC member TFIIA (GTF2A1/2), the NC2 complex (DR1/DRAP1) and to an extent BTAF1, indicating that TBP leaves TFIID once chromatin-bound and is dynamically engaged by other functional complexes. Indeed, most of the chromatin GFP-

135 Chapter 4

TAF co-IPs confirmed loss of TBP, while the exceptions consistently included TAF1 presence as well. Notably, loss of Pol I and Pol III PIC members as interactors of GFP-TBP may be indicative of a transient residence of the protein with these complexes once chromatin-bound.

4

Figure 7. Partial chromatin-bound TFIID-, SAGA- and TBP-specific assemblies. Chromatin co- IPs reveal partial presence of TFIID-, SAGA or TBP-specific subunits compared to GFP-only control. Each data point in the volcano plots is plotted as the mean of technical triplicates. Dashed red lines denote the threshold between background and significant enrichment (two-tailed Student‘s t-test; FDR ≤1%; S0= 1). TFIID-unique subunits are coloured in green, SAGA-unique in purple, shared TFIID/SAGA subunits are shown in blue and TBP-specific interactors in red. Baits are highlighted in bold.

136 Proteomic analysis of TFIID

Finally, the SAGA complex is primarily lost from the chromatin-bound shared GFP- TAFs. An exception was observed with GFP-TAF9B, which enriched for partial SAGA assemblies, including low levels of core SAGA and several members of the HAT and DUB modules, as well as relatively high levels of CCDC101 (also SGF29). Interestingly, minor presence of CCDC101/SGF29 was also detected in GFP-TAF10 and GFP-TAF12 co-IPs.

4

Figure 8. Disassociation of TBP from chromatin-bound TFIID and complex disassembly. Relative abundance of identified TFIID-, SAGA- and TBP-specific subunits, normalized to bait. Modules are highlighted in dark grey (bait) and grey (partners); pink box-frames indicate core TFIID subunits. Data are derived from iBAQ values mean ± s.d. of technical triplicates.

137 Chapter 4

Discussion Recent breakthroughs in the structural biology of TFIID have revealed major complex rearrangements upon association with DNA 15. Additional work has provided insight into the likely holo-TFIID formation from cytoplasmic pre-assemblies 18,23,24. This, together with the presence of a number of shared subunits within TFIID, is suggestive of a dynamic assembly process and compartmental organization of the complex in cells. We have previously utilized an inducible cellular system for the qMS interrogation of TFIID subunits to reveal an intertwined relation between TAF5 biosynthesis and its cytoplasmic interactions with the TAF6/TAF9 HF pair 24. Here we extended this approach to all TFIID subunits to investigate their direct molecular interactions in three functionally- and structurally-relevant for the complex cellular compartments – the cytoplasm, the nucleus and on chromatin. Our work confirmed the existence of cytoplasmic pre- assembly modules and their subsequent translocation to the nucleus, whereupon a holo-TFIID 4 formation takes place. Notably, we observed that, once chromatin-bound, holo-TFIID loses TBP and is destabilized back into modular assemblies, which is consistent with proposed structural rearrangements of TFIID upon core promoter binding

TAF1/TAF7/TBP This module was stably enriched by cytoplasmic GFP-TAF1 and its cellular existence is in line with the available atomic structures of purified TAF1:TBP and TAF1:TAF7 subcomplexes 17,28. Notably we find the heterotrimer also in the nuclear fraction, indicating ability of nuclear translocation and providing mechanistic insights into the delivery of non-core TFIID structures during holo-TFIID formation. Interestingly, neither GFP-TAF7 nor GFP-TBP co-enriched for the module or for TAF1 alone, suggesting that the available TAF1 levels are rate-limiting for the assembly of this structure. Notably, all three proteins were shown to form a stable holo-TFIID in the nucleus, suggesting possibility for nuclear import and integration into this subcomplex. It remains unclear whether GFP-TBP and GFP-TAF7 can be recruited into forming TFIID as single molecules or whether they only transiently reside in TAF1/TAF7/TBP modules prior rapid nuclear import and incorporation into holo-TFIID. Notably, previous work has shown that depletion of TAF1 led to TBP-less nuclear TFIID assemblies 39, indicating TAF1-dependency of TBP for integration into the complex. Interestingly, at chromatin GFP-TAF1 does not co-enrich for the TAF1/TAF7/TBP module but rather for TAF1/TBP, indicating that the trimer is likely an assembly step rather than a functional unit. Notably, interrogation of chromatin-bound GFP-TBP revealed preferential existence outside of the TFIID complex, which is in line with the proposed role of the complex in accurate deposition and release of TBP at promoter sites 15. The abundant presence of NC2 and TFIIA in chromatin GFP-TBP co-IPs illustrates a dynamic competition between these complexes for TBP interaction, allowing therefore for a stochastic and concentration-dependent regulation of transcription initiation by active NC2-driven removal of the protein upon lack of molecular shielding provided otherwise by TFIIA.

TAF2/TAF8/TAF10 (TAF3) The existence of cytoplasmic TAF2/TAF8/TAF10 module has been described previously 23. Interestingly, we detected this module in the nucleus, while cytoplasmic GFP-TAF2 and GFP- TAF8, enriched for TAF2/TAFA8 and TAF8/TAF10, respectively. The lack of these heterodimers in the nucleus indicates that only the trimeric form is imported, which is in agreement

138 Proteomic analysis of TFIID with the reported TAF8-dependency of TAF10 and TAF2 nuclear import 23. Similarly to the TAF1/TAF7/TBP module, the inability to detect of a cytoplasmic TAF2/TAF8/TAF10 assembly could be related to the low (endogenous) levels of the module formation and its dynamic transportation to the nucleus, which are masked by the elevated expression of the baits and the partial module assemblies trapped in the cytoplasm until completed trimeric formation. Cytoplasmic GFP-TAF10 also co-enriched for its SAGA HF partner, SUPT7L, and core SAGA, indicating a preferential SAGA incorporation of the TAF10 protein in the cytoplasm. In contrast, nuclear GFP-TAF10 formed holo-TFIID and SAGA in comparable relative levels, suggesting that the cytoplasmic discrimination may be a regulatory step dictating the difference in complex-commitment of the protein. Such eventuality would suggest a competitive mechanism between TAF10 nuclear import (and TFIID-commitment) and TAF10 cytoplasmic retention in core SAGA structures (and SAGA-commitment). In line with this, co-expression of TAF8 and TAF3 was reported to drive nuclear import of exogenous TAF10, while overexpression of the 4 SAGA-specific SUPT7L maintained its cytoplasmic retention 40. Interestingly, TAF3 was not found in cytoplasmic GFP-TAF10 co-IPs and vice versa. Indeed, the exogenous protein resided either alone in the cytoplasm and the nucleus, or incorporated into nuclear holo-TFIID. The lack of TAF3/TAF10 dimers in our system may be suggestive of either singular incorporation into TFIID or a transient existence, which is swiftly shuttled into the nucleus and incorporated into the complex. Curiously, TAF3/TAF10 module was present in our chromatin extract, suggesting that it may be involved in complex disassembly rather than assembly. Existence of TAF2/TAF8/TAF10 and TAF8/TAF10 chromatin modules further indicates a modular TFIID disassembly.

TAF4/TAF12 (core-TFIID) The atomic structure of TAF4/TAF12 HF partners has been described before 38. Here, we observed enrichment of TAF12 by GFP-TAF4 in the cytoplasm and nucleus, indicative of a cytoplasmic association and transportation into the nucleus. Interestingly, cytoplasmic GFP- TAF4B enriched additionally for the complete core TFIID, described previously as nuclear 22,39. Unlike TAF4, TAF4B does not include an NLS, leaving the possibility open for inefficient nuclear import of the exogenously expressed GFP-TAF4B due to a rate-limiting endogenous step, allowing therefore the cytoplasmic progression to core TFIID assembly. Strikingly, cytoplasmic GFP- TAF12 also co-enriched for core TFIID as well as TAF8/TAF10, placing the proposed symmetry breakage of core TFIID in the cytoplasm rather than the nucleus 22, and suggesting a possible rate- limiting for TAF12 in core TFIID formation. Indeed accumulation of cytoplasmic ―assymetrical‖ core TFIID was only observed with GFP-TAF12 and, curiously, with GFP-TAF9B, in which significantly elevated TAF12 levels were detected. Besides core TFIID, cytoplasmic GFP-TAF12 also assembled core SAGA with a slight preference over TFIID, as seen with GFP-TAF10, again indicating a possible role of compartmentalization in the discriminatory assembly pathways of the two complexes.

TAF5/TAF6/TAF9 The presence of TAF5/TAF6/TAF9 module in cytoplasmic GFP-TAF5 co-IPs, but not GFP-TAF6 and GFP-TAF9, is in line with our previous work, describing a rate-limiting role of TAF5 for the module formation 24. Notably, nuclear GFP-TAF5 is fully incorporated in holo-

139 Chapter 4

TFIID, as evident by the lack of free bait TAF5 in our analysis, indicating that the nuclear import of TAF5 is likely regulated by a rate-limiting mechanism. At chromatin, the protein enriched primarily for core TFIID as well as to an extent for holo-TFIID, indicating that its presence is associated with complex stability. Nuclear GFP-TAF6 and GFP-TAF9, on the other hand, assembled less stable holo-TFIID and at chromatin resided exclusively in core TFIID structures. Notably, cytoplasmic core TFIID was not enriched by any of these proteins indicating that TAF4/TAF12 heterodimers are rate-limiting for its formation.

TAF11/TAF13 Cytoplasmic TAF11/TAF13 module has been previously described 18. Here, we show preferential enrichment of cytoplasmic TAF13 by GFP-TAF11, while GFP-TAF13 remained primarily alone, suggesting a rate-limiting capacity of TAF11 for the module formation. Notably, 4 nuclear interrogation of GFP-TAF13 revealed presence of TAF11/TAF13 modules in addition to holo-TFIID, indicating that nuclear import of the protein occurs in the context of a dimer. At chromatin, the proteins are found primarily in holo-TFIID with dynamically changing presence of TBP (missing in GFP-TAF13, but not in GFP-TAF11) and TAF13 (missing in GFP-TAF11). As we observed that TBP is only transiently present in chromatin-bound TFIID, such dynamics could be reflective of structural reorganizations between these proteins. Indeed, direct interaction between TAF11/TAF13/TBP has been characterized 18 and recent structural work described co- localization of all three proteins in the dynamic lobe A of chromatin-bound TFIID 15. Conclusions Altogether, our data suggests modular TFIID assembly from pre-existing cytoplasm sub- complexes. We propose cytoplasmic TAF4(B)/TAF12 dimerization as the first step of TFIID assembly, followed by incorporation of a second TAF4(B)/TAF12 copy and subsequent recruitment of a pre-formed TAF5/TAF6/TAF9 module (Fig. 9A). In the likely following step, TAF8/TAF10 dimer is incorporated to form ―asymmetrical‖ core TFIID 22. Whether this is a cytoplasmic or nuclear event, or if it involved TAF2/TAF8/TAF10 trimer remains to be elucidated. The preferential incorporation of shared TAFs into cytoplasmic core SAGA is indicative of a likely compartment-specific discrimination between TFIID and SAGA assemblies. As such it is plausible that TFIID modules are actively re-directed into the nucleus for holo-TFIID formation, while core SAGA is held back for further assembly progression (and structural diversion) prior nuclear import. Indeed, we detected a number of cytoplasmic TFIID modules present in the nucleus, likely representing nuclear pre-assembly steps prior holo-TFIID formation. While their accumulation could be a result of our expression system, we expect at least a transient presence of those assemblies in the cytoplasm prior to nuclear import. Characterization of holo- TFIID revealed stable and strikingly consistent in its relative abundance composition. Once chromatin-bound, however, the complex loses its TBP subunit to additional interactors (i.e. NC2, BTAF1 and TFIIA) and begins to disintegrate, likely due to structural rearrangements involved in the TBP release 15 (Fig. 9B). The dynamic loss of single-copy subunits as well as the general preservation of core TFIID likely indicates a disassembly pathway following the assembly steps in reverse. Together, our work has shed needed light on the cellular dynamics of TFIID subunits and their cell compartment-specific complex-discriminatory pathways, and lays out the groundwork for future research aiming to expand the molecular mechanisms behind these events.

140 Proteomic analysis of TFIID

4

Figure 9. Model of cellular TFIID dynamics. A. Different cytoplasmic modules assemble prior nuclear holo-TFIID formation. Color-coded boxes indicate different building blocks of TFIID –

141 Chapter 4

TAF2/TAF8/TAF10 in yellow, TAF3 in blue, TAF11/TAF13 in green, TAF1/TAF7/TBP in grey and core TFIID in pink. Preferential integration of SAGA-shared subunits (TAF9(B), TAF10, TAF12) into core SAGA is indicated by bold arrows. Less-efficient core TFIID formation is illustrated by thin arrows. Curved red arrows indicate limiting TAFs for the formation of the modules (arrow base points the limiting TAF and arrow head illustrates resulting module). Dashed arrows indicate the speculated steps of assembly for the identified modules and question mark indicates potential stage of a double- copy core TFIID formation. B. Stably formed holo-TFIID is detected in the DNA-free nuclear fraction. Upon chromatin association, aided by chromatin modifications and promoter sequence (based on literature), TFIID loses TBP to the transcription-promoting TFIIA and to its negative regulators BTAF1 and NC2 (DR1/DRAP1) and is destabilized to primarily core TFIID composition. Assembly modules are present on both DNA-free and DNA-bound fractions, suggesting that prior complex formation and upon its disassociation TAFs remain preferentially paired. 4 Methods GFP–TAF cell line generation. cDNA of human proteins was amplified by PCR with gene-specific primers fused to attB recombination sequences for GATEWAY cloning. The amplified sequence was recombined into the pDON201 donor vector according to the manufacturer‘s protocol (ThermoFisher). The cloned sequence was verified with Sanger sequencing and was transferred to the pcDNA5-FRT-TO-N-GFP Gateway destination vector by LR recombination according to the manufacturer‘s protocol (ThermoFisher). All destination vectors were co-transfected with a pOG44 plasmid that encodes the Flp recombinase into HeLa Flp-In/T-REx cells using polyethyleneimine transfection to generate stable Dox-inducible expression cell lines.

Cell culture methods. HeLa Flp-In/T-REx cells, which contained the Flp recombination target site and expressed the Tet repressor, were grown in Dulbecco‘s modified Eagl e‘s medium (DMEM), 4.5 g l−1 glucose, supplemented with 10% v/v fetal bovine serum, 10 mM l-glutamine and 100 U ml−1 penicillin–streptomycin (all purchased from Lonza), together with 5 μg ml−1 blasticidin S (InvivoGen) and 200 μg ml−1 zeocin (Invitrogen), which were used to select for the FRT and Tet repressor, respectively. All cell lines were mycoplasma negative. Recombined cells were selected by replacing zeocin with 250 μg ml−1 hygromycin B (Roche Diagnostics) 48 h after polyethyleneimine transfection. Expression of GFP-tagged protein was induced by addition of 1 μg ml−1 Dox, for 16– 18 h.

Confocal microscopy. Paraformaldehyde-fixed cells were subjected to short cell membrane permeabilization with 0.5% Triton X-100 in PBS for 5 min. After quenching the cross-linking with 50 mM glycine, the samples were blocked for 30 min at RT with 5% natural goat serum (NGS) and subsequently incubated with primary antibody (GFP, JL-8, Clontech), for 2 hours at RT. After three sequential washing steps with PBS, secondary antibodies conjugated to a fluorophore were added for 45 min at RT. For nuclear staining, the samples were incubated with DAPI dye (2 mg/L, Sigma-Aldrich, St. Louis, MO) for 5 min prior to mounting the coverslips on microscopy slides (ThermoFischer Scientific, Waltham, MA) with Immu-Mount (Invitrogen, Carlsbad, CA). Images were obtained using SP8-X confocal microscope (Leica microsystems, Germany) using a HC PL APO 63x/1.40 Oil CS2 objective. Gain and offset settings were adjusted according to the fluorescence signal, but they were kept constant in comparative experimental designs such as

142 Proteomic analysis of TFIID doxycycline-induction tests. Exported files were next subjected to linear contrast and brightness processing in Photoshop (CS6, 13.0.6 x64, extended) for image representation purposes.

Extract preparation and GFP co-IP. Cells were seeded in 15-cm dishes (Greiner Cellstar, Sigma- Aldrich) and grown to 70–80% confluence prior to Dox induction. GFP–protein expression was verified using EVOS fluorescence microscopy (Thermo Fischer). Next, induced cells were collected and nuclear and cytoplasmic extracts were obtained as described before 24. Chromatin extracts were obtained by resuspension through an IKA Ultra-Turrax Homogenizer of the insoluble nuclear pellets in a NIB buffer (250 mM Sucrose, 10 mM MgCl2, 20 mM Hepes-KOH pH 7.8, 0.1% Triton X-100, 5 mM βMe) in 1:10 weight-to-volume ratio, in the presence of 3 mM CaCl2. The pellets were further digested at RT for 10 min with 4 units of MNase (MO247S, NEB) per 100 mg of pellet. The reaction was terminated with 10 mM EGTA and 150 mM NaCl. Protein concentrations were determined by Bradford assay (BioRad). Then, 1 mg of nuclear, 3 mg of cytoplasmic or 0.5 4 mg of chromatin extract was used for GFP-affinity purification as previously described 24. In brief, protein lysates were incubated in binding buffer (20 mM HEPES-KOH pH 7.9, 300 mM NaCl, 20% glycerol, 2 mM MgCl2, 0.2 mM EDTA, 0.1% NP-40, 0.5 mM DTT and 1× Roche protease inhibitor cocktail) on a rotating wheel for 1 h at 4 °C in triplicates with GBP-coated agarose beads (Chromotek). The beads were washed twice with binding buffer containing 0.5% NP-40, twice with PBS containing 0.5% NP-40 and twice with PBS. On-bead digestion of bound proteins was performed overnight in elution buffer (100 mM Tris-HCl pH 7.5, 2 M urea and 10 mM DTT) with 0.1 µg ml−1 of trypsin at room temperature and eluted tryptic peptides were bound to C18 stagetips prior to mass spectrometry analysis.

Mass spectrometry and data analysis. Tryptic peptides were eluted from the C18 stagetips in H2O:acetonitril (50:50) with 0.1% formic acid and dried prior to resuspension in 10% formic acid. A third of this elution was injected into a Q-Exactive (Thermo Fischer) in tandem mass spectrometry mode with 90 min total analysis time. Blank samples consisting of 10% formic acid were run for 45 min between all different triplicates samples, to avoid carry-over between runs. The raw data files were analyzed with MaxQuant software (version 1.5.3.30) using the Uniprot human FASTA database 24. Label-free quantification values and match between run options were selected. The iBAQ algorithm was also activated for subsequent relative protein abundance estimations. The obtained protein files were analyzed by Perseus software (MQ package, version 1.5.4.0), in which contaminants and reverse hits were filtered out. Protein identification based on non-unique peptides as well as proteins identified by only one peptide in the different triplicates were excluded to increase protein prediction accuracy. For identification of the bait interactors, label-free quantification intensity-based values were transformed on the logarithmic scale (log2) to generate Gaussian distribution of the data. This enables the imputation of missing values based on the normal distribution of the overall data (in Perseus, width = 0.3; shift = 1.8). The normalized label-free quantification intensities were compared between grouped GFP-bait triplicates and GFP-only triplicates, using a 1% permutation- based false discovery rate (FDR) in a two-tailed Student‘s t-test. The threshold for significance (S0), based on the FDR and the ratio between GFP-bait and GFP-only samples, was kept at the constant value of 1 for comparison purposes. Relative abundance plots were obtained by comparison of the iBAQ values of GFP-bait interactors. The values of the GFP-only iBAQ values

143 Chapter 4

were subtracted from the corresponding proteins in the GFP pull-down and were next normalized to a chosen co-purifying protein for scaling and data-presentation purposes 24.

Literature 1 Shandilya, J. & Roberts, S. G. The transcription cycle in eukaryotes: from productive initiation to RNA polymerase II recycling. Biochim Biophys Acta 1819, 391-400, doi:10.1016/j.bbagrm.2012.01.010 (2012). 2 Schier, A. C. & Taatjes, D. J. Structure and mechanism of the RNA polymerase II transcription machinery. Genes Dev 34, 465-488, doi:10.1101/gad.335679.119 (2020). 3 Louder, R. K. et al. Structure of promoter-bound TFIID and model of human pre-initiation complex assembly. Nature 531, 604-609, doi:10.1038/nature17394 (2016). 4 Nogales, E., Louder, R. K. & He, Y. Structural Insights into the Eukaryotic Transcription 4 Initiation Machinery. Annu Rev Biophys 46, 59-83, doi:10.1146/annurev-biophys-070816- 033751 (2017). 5 Bhuiyan, T. & Timmers, H. T. M. Promoter Recognition: Putting TFIID on the Spot. Trends Cell Biol 29, 752-763, doi:10.1016/j.tcb.2019.06.004 (2019). 6 Kim, J. L. & Burley, S. K. 1.9 A resolution refined structure of TBP recognizing the minor groove of TATAAAAG. Nat Struct Biol 1, 638-653, doi:10.1038/nsb0994-638 (1994). 7 Kim, J. L., Nikolov, D. B. & Burley, S. K. Co-crystal structure of TBP recognizing the minor groove of a TATA element. Nature 365, 520-527, doi:10.1038/365520a0 (1993). 8 Tora, L. & Timmers, H. T. The TATA box regulates TATA-binding protein (TBP) dynamics in vivo. Trends Biochem Sci 35, 309-314, doi:10.1016/j.tibs.2010.01.007 (2010). 9 Tan, S., Hunziker, Y., Sargent, D. F. & Richmond, T. J. Crystal structure of a yeast TFIIA/TBP/DNA complex. Nature 381, 127-151, doi:10.1038/381127a0 (1996). 10 Nikolov, D. B. et al. Crystal structure of a TFIIB-TBP-TATA-element ternary complex. Nature 377, 119-128, doi:10.1038/377119a0 (1995). 11 Auble, D. T. et al. Mot1, a global repressor of RNA polymerase II transcription, inhibits TBP binding to DNA by an ATP-dependent mechanism. Genes Dev 8, 1920-1934, doi:10.1101/gad.8.16.1920 (1994). 12 Butryn, A. et al. Structural basis for recognition and remodeling of the TBP:DNA:NC2 complex by Mot1. Elife 4, doi:10.7554/eLife.07432 (2015). 13 Chitikila, C., Huisinga, K. L., Irvin, J. D., Basehoar, A. D. & Pugh, B. F. Interplay of TBP inhibitors in global transcriptional control. Mol Cell 10, 871-882, doi:10.1016/s1097- 2765(02)00683-4 (2002). 14 Wollmann, P. et al. Structure and mechanism of the Swi2/Snf2 remodeller Mot1 in complex with its substrate TBP. Nature 475, 403-407, doi:10.1038/nature10215 (2011). 15 Patel, A. B. et al. Structure of human TFIID and mechanism of TBP loading onto promoter DNA. Science 362, doi:10.1126/science.aau8872 (2018). 16 Hernandez, N. TBP, a universal eukaryotic transcription factor? Genes Dev 7, 1291-1308, doi:10.1101/gad.7.7b.1291 (1993). 17 Anandapadamanaban, M. et al. High-resolution structure of TBP with TAF1 reveals anchoring patterns in transcriptional regulation. Nat Struct Mol Biol 20, 1008-1014, doi:10.1038/nsmb.2611 (2013). 18 Gupta, K. et al. Architecture of TAF11/TAF13/TBP complex suggests novel regulation properties of general transcription factor TFIID. Elife 6, doi:10.7554/eLife.30395 (2017).

144 Proteomic analysis of TFIID

19 Juo, Z. S., Kassavetis, G. A., Wang, J., Geiduschek, E. P. & Sigler, P. B. Crystal structure of a transcription factor IIIB core interface ternary complex. Nature 422, 534-539, doi:10.1038/nature01534 (2003). 20 Knutson, B. A., Luo, J., Ranish, J. & Hahn, S. Architecture of the Saccharomyces cerevisiae RNA polymerase I Core Factor complex. Nat Struct Mol Biol 21, 810-816, doi:10.1038/nsmb.2873 (2014). 21 Sanders, S. L., Garbett, K. A. & Weil, P. A. Molecular characterization of Saccharomyces cerevisiae TFIID. Mol Cell Biol 22, 6000-6013, doi:10.1128/mcb.22.16.6000-6013.2002 (2002). 22 Bieniossek, C. et al. The architecture of human general transcription factor TFIID core complex. Nature 493, 699-702, doi:10.1038/nature11791 (2013). 23 Trowitzsch, S. et al. Cytoplasmic TAF2-TAF8-TAF10 complex provides evidence for nuclear holo-TFIID assembly from preformed submodules. Nat Commun 6, 6011, doi:10.1038/ncomms7011 (2015). 4 24 Antonova, S. V. et al. Chaperonin CCT checkpoint function in basal transcription factor TFIID assembly. Nat Struct Mol Biol 25, 1119-1127, doi:10.1038/s41594-018-0156-z (2018). 25 Birck, C. et al. Human TAF(II)28 and TAF(II)18 interact through a histone fold encoded by atypical evolutionary conserved motifs also found in the SPT3 family. Cell 94, 239-249, doi:10.1016/s0092-8674(00)81423-3 (1998). 26 Gangloff, Y. G. et al. Histone folds mediate selective heterodimerization of yeast TAF(II)25 with TFIID components yTAF(II)47 and yTAF(II)65 and with SAGA component ySPT7. Mol Cell Biol 21, 1841-1853, doi:10.1128/MCB.21.5.1841-1853.2001 (2001). 27 Gangloff, Y. G. et al. The human TFIID components TAF(II)135 and TAF(II)20 and the yeast SAGA components ADA1 and TAF(II)68 heterodimerize to form histone-like pairs. Mol Cell Biol 20, 340-351, doi:10.1128/mcb.20.1.340-351.2000 (2000). 28 Wang, H., Curran, E. C., Hinds, T. R., Wang, E. H. & Zheng, N. Crystal structure of a TAF1- TAF7 complex in human transcription factor IID reveals a promoter binding module. Cell Res 24, 1433-1444, doi:10.1038/cr.2014.148 (2014). 29 Xie, X. et al. Structural similarity between TAFs and the heterotetrameric core of the histone octamer. Nature 380, 316-322, doi:10.1038/380316a0 (1996). 30 Wang, H. et al. Structure of the transcription coactivator SAGA. Nature 577, 717-720, doi:10.1038/s41586-020-1933-5 (2020). 31 Antonova, S. V., Boeren, J., Timmers, H. T. M. & Snel, B. Epigenetics and transcription regulation during eukaryotic diversification: the saga of TFIID. Genes Dev 33, 888-902, doi:10.1101/gad.300475.117 (2019). 32 Kolesnikova, O. et al. Molecular structure of promoter-bound yeast TFIID. Nat Commun 9, 4666, doi:10.1038/s41467-018-07096-y (2018). 33 Matangkasombut, O., Auty, R. & Buratowski, S. Structure and function of the TFIID complex. Adv Protein Chem 67, 67-92, doi:10.1016/S0065-3233(04)67003-3 (2004). 34 van Nuland, R. et al. Quantitative dissection and stoichiometry determination of the human SET1/MLL histone methyltransferase complexes. Mol Cell Biol 33, 2067-2077, doi:10.1128/MCB.01742-12 (2013). 35 Schwanhausser, B. et al. Global quantification of mammalian gene expression control. Nature 473, 337-342, doi:10.1038/nature10098 (2011). 36 Brand, M. et al. UV-damaged DNA-binding protein in the TFTC complex links DNA damage recognition to nucleosome acetylation. EMBO J 20, 3187-3196, doi:10.1093/emboj/20.12.3187 (2001).

145 Chapter 4

37 Martinez, E. et al. Human STAGA complex is a chromatin-acetylating transcription coactivator that interacts with pre-mRNA splicing and DNA damage-binding factors in vivo. Mol Cell Biol 21, 6782-6795, doi:10.1128/MCB.21.20.6782-6795.2001 (2001). 38 Werten, S. et al. Crystal structure of a subcomplex of human transcription factor TFIID formed by TATA binding protein-associated factors hTAF4 (hTAF(II)135) and hTAF12 (hTAF(II)20). J Biol Chem 277, 45502-45509, doi:10.1074/jbc.M206587200 (2002). 39 Wright, K. J., Marr, M. T., 2nd & Tjian, R. TAF4 nucleates a core subcomplex of TFIID and mediates activated transcription from a TATA-less promoter. Proc Natl Acad Sci U S A 103, 12347-12352, doi:10.1073/pnas.0605499103 (2006). 40 Soutoglou, E. et al. The nuclear import of TAF10 is regulated by one of its three histone fold domain-containing interaction partners. Mol Cell Biol 25, 4092-4104, doi:10.1128/MCB.25.10.4092-4104.2005 (2005). 4

Acknowledgements We thank Roy Baas for bioinformatics assistance, as well as Laszlo Tora (Strasbourg, France) and Imre Berger (Bristol, the UK) for provided assistance and reagents. This research was supported by Netherlands Organization for Scientific Research (NWO) grants 022.004.019 (S.V.A) and ALW820.02.013 (H.T.M.T).

Author contribiutions H.T.M.T. conceived the study with input from S.V.A. S.V.A. carried out all cell biology experiments, assisted by L.vW., S.G.A., J.J.J.B., S.C. and T.B. The quantitative proteomics analyses were carried out by M.B., assisted by S.V.A. S.V.A. interpreted the data and wrote the manuscript together with input from H.T.M.T.

146 Proteomic analysis of TFIID

Supplemental Figures

4

Supplemental Figure 1. A. Immunofluorescence analysis of dox-untreated cell lines integrated with exogenous inducible GFP-fused TFIID subunits. Scale bars, 5 μm. B. Black-to-yellow colour scale represents low-to-high enrichment of identified bait proteins per fraction. Data are derived from iBAQ values mean ± s.d. of technical triplicates. C. Relative abundance of identified cytoplasmic and nuclear SAGA-specific SF3B3/SF3B5 subunits, normalized to bait. Data are derived from iBAQ values mean ± s.d. of technical triplicates.

147 Chapter 4

4

Supplemental Figure 2. Complex-specific overview of the proteins peptide coverage identified during the proteomic analysis. Theoretical tryptic peptides were obtained through the online

148 Proteomic analysis of TFIID

PeptideMass ExPASy tool; mass limit is set from 0.5 kDa to 5 kDa to include all identified peptides. In- scale plotting of theoretical and identified peptides was performed with the online Proteator tool.

4

Supplemental Figure 3. (A-C). Relative abundance of identified cytoplasm TFIID and SAGA (C) subunits, normalized to bait. Modules are highlighted in dark grey (bait) and grey (identified partners). Data are derived from iBAQ values mean ± s.d. of technical triplicates.

149 Chapter 4

4

Supplemental Figure 4. Unique TAF4/TAF4B/TAF9/TAF9B peptides are identified across cytoplasmic GFP co-IPs. Baits are depicted at the top and include the technical triplicates for each co- IP. Black-to-yellow colour scale represents lack-to-abundance of identified unique peptides. Green dotted frames highlight peptide data for GFP-TAF4/GFP-TAF4B and GFP-TAF9/GFP-TAF9B homologues.

150 Proteomic analysis of TFIID

4

Supplemental Figure 5. (A-E). Relative abundance of identified nuclear SAGA and TBP-specific interactors, normalized as indicated. SAGA modules are color coded as follows: core SAGA in red, DUB in blue, HAT in green and TF-associating module in black. Data are derived from iBAQ values mean ± s.d. of technical triplicates.

151 Chapter 4

4

Supplemental Figure 6. Unique TAF4/TAF4B/TAF9/TAF9B peptides are identified across chromatin GFP co-IPs. Baits are depicted at the top and include the technical triplicates for each co-IP. Black-to-yellow colour scale represents lack-to-abundance of identified unique peptides. Green dotted frames highlight peptide data for GFP-TAF4/GFP-TAF4B and GFP-TAF9/GFP-TAF9B homologues.

152 TAF1L forms canonical TFIID

Chapter 5 5

TAF1L integrates into a holo-TFIID complex with canonical composition

(Manuscript in preparation)

153 Chapter 5

TAF1L integrates into a holo-TFIID complex with canonical composition Manuscript in preparation

Antonova S.V.1, Boeren J.J.J.2, Vos H.R.3, Capponi S.4, Timmers H.T.M.1,4

1 Molecular Cancer Research and Regenerative Medicine, University Medical Centre Utrecht, 3584 CT Utrecht, The Netherlands. 2 Department of Developmental Biology, Erasmus MC, 3015 CN Rotterdam, The Netherlands. 3 Molecular Cancer Research, Center for Molecular Medicine, University Medical Centre Utrecht, Universiteitsweg 100, 3584 CG Utrecht, The Netherlands. 4 German Cancer Research Center (DKFZ), Im Neuenheimer Feld 280, 69120, Heidelberg, Germany; German Cancer Consortium (DKTK) partner site Freiburg, 79106, Freiburg, Germany; Department of Urology, Medical Center-University of Freiburg, Breisacher Straße 66, 79106, Freiburg, Germany.

Abstract

5 TFIID is a basal transcription complex required for the transcriptional initiation by RNA polymerase II (pol II) of all eukaryotic protein-coding genes. It is the first pre-initiation complex member to bind the core promoter of genes and to recruit the pol II transcriptional machinery. The complex consists of the TATA-binding protein (TBP) and 13-14 TBP-associated factors (TAFs). Together the TAFs provide a multi-layered control of transcription through numerous chromatin-binding sites and transcription factor-associating regions. Further context-dependent complexity to TFIID regulation is provided by several subunit paralogues, which have mostly been implicated in embryogenesis, gametogenesis and stem cell maintenance. TAF1L is a TAF1-specific paralogue, which has risen from a retrotransposition event in the split of the Old World Monkeys evolutionary branch and the two proteins share high sequence similarity. Expression analyses have placed TAF1L in the testis tissue with proposed role in spermatogenesis. Reported in vitro interaction between TAF1L and TBP suggests it may function in the context of the TFIID complex, but data supporting this are missing. Here, we implement a quantitative mass spectrometry approach to identify the cellular interactors of exogenously expressed TAF1L in the cytoplasmic and nuclear extracts of cells. We observe similar TFIID assembly steps between TAF1 and TAF1L, which include the cytoplasmic formation of TAF1(1L)/TAF7(7L)/TBP trimer and the nuclear incorporation of this trimer into holo-TFIID. Notably, in the presence of endogenous TAF1, GFP-TAF1L integrates less efficiently into TFIID, indicating possible competition between the two proteins for TFIID assembly. CRISPR/Cas9-mediated inactivation of endogenous TAF1 alleviated the phenotype and allowed for complete integration of GFP-TAF1L into a nuclear holo- TFIID, confirming the competition model between the two proteins for incorporation into the full TFIID complex.

Key words: TAF1L, TFIID, quantitative proteomics, CRISPR/Cas9

154 TAF1L forms canonical TFIID

Introduction The eukaryotic basal transcription factor TFIID is central to the expression of protein- coding genes, as it is the first complex to recognize core promoter regions and to, thus, allow subsequent assembly of the pre-initiation complex (PIC) and stabilization of the RNA polymerase II (pol II) at the transcription start site (TSS) 1. Metazoan TFIID consists of TATA-Binding Protein (TBP) in complex with 13-14 TBP-Associated Factors (TAFs) – abundantly present and mostly conserved across the entire eukaryotic tree-of-life, indicating strong correlation between complex structure and function among all eukaryotes 2. Historically, as a protein with specificity for canonical TATA box sequences, TBP was seen as the central promoter-anchoring subunit of TFIID 3. Nevertheless, subsequent work has highlighted the essential role of TAFs in promoter selectivity and TFIID recruitment during activated transcription and at TATA-less promoters 1,4,5. Additionally, recent breakthrough in structural characterization of TFIID and PIC assembly has revealed a general TAF-dependence in the initial promoter-binding step, followed by TBP-DNA association at a measured distance from the TSS, driven by conformational rearrangements within TFIID 1,4. A central member of the TFIID complex is TAF1, as it contributes to promoter recognition through numerous domains. 5 Being the largest TFIID subunit, human TFIID consists of an N-terminal TAND domain involved in inhibitory TBP binding to prevent premature loading of TBP onto the DNA; a central DUF3591 region, which includes TAF7-interaction site as well as binding regions for promoter motifs, such as Inr, DPE and MTE; and a C-terminal Zinc2+ knuckle (ZnK) and tandem bromodomains (BrDs) involved in DNA- and acetylated histone tails-binding, respectively 2,6-9. Besides TAF1, additional promoter recognition within TFIID structure is provided by the H3K4me3-specific TAF3-PHD domain and the DNA-binding specificity of TAF2 and TAF4 1,4,10. Finally, differential promoter targeting has been ascribed to a direct interaction of several TAFs with various transcription factors (TFs) 11. Several TAF paralogues have emerged at different points of eukaryotic evolution, indicating lineage-specific diversification of the complex 2. Considering the functional role of TFIID, this repertoire expansion is most likely reflective of the differential need for fine-tuning and re-targeting of transcription initiation events among different eukaryotic branches. In line with such lineage-specific role, all TFIID paralogues have been associated with developmental and differentiation transcription programs. As such, the metazoan branch has obtained three additional TBP-related factors (TRFs), which include the ubiquitously present among all animals TRF2 (or TBPL1) and the additionally diversified TRF1 (or TLF) and TRF3 (or TBPL2), specific for insects and vertebrates, respectively 12. Notably, while TBP has been associated with somatic transcription, TRFs mainly act during gametogenesis and embryogenesis. In mammals, TRF2 and TRF3 have undergone further sub-functionalization and have each been implemented in spermatogenesis and oogenesis, respectively 12. The TAF4 subunit of TFIID has also duplicated to give rise to the vertebrate-specific TAF4B, as well as fish-specific TAF4x 2. While the former has been ascribed a role in proliferation and maintenance of germ and embryonic stem cells, the latter is yet to be characterized in a functional context 13. TAF9B is placental animal invention, with suggested role in neurogenesis, and the mammalian TAF7L has been implemented in spermatogenesis as well as adipogenesis 14-16. The most recent diversification of TFIID is the Old World Monkeys-specific TAF1L, which arose from a retrotransposition event and shares 94% DNA and 84% amino acid

155 Chapter 5

sequence similarity with TAF1 2. Similarly to other TFIID paralogues, the protein has been suggested to play a role in spermatogenesis, however functional data is still missing 17. We have previously described a cytoplasmic presence of a TAF1/TAF7/TBP heterotrimer as a likely assembly step during TFIID formation (Chapter 4). As the paralogues of each of these proteins have been associated with spermatogenesis, it raises the possibility of additional functionality behind this trimeric structure. Indeed, TAF7L has been shown to associate not only with TBP, but also with TRF2 with a significantly higher affinity 18. This interaction is unlikely to be direct, suggesting that an additional protein serves as mediator. Consistently, TBP, TAF7 and TAF7L have all been described to directly bind TAF1 6,9,15, and TAF1L has been shown to bind TBP 17. This suggests that the interaction between these proteins is likely conserved among their paralogues as well. Furthermore, a well-described temperature sensitive (ts) point mutation within TAF1 associated with cell cycle arrest and apoptosis evokes similar phenotype in a corresponding TAF1L ts mutant 17. Notably, the mutation is located at the border of TAF7- interaction region and a DNA-binding domain suggesting conservation of these interactions within TAF1L as well 9. Despite the indicative data, no molecular characterization of TAF1L cellular 5 interactions, specifically in the context of TFIID assembly has been described yet. Here, we have implemented a cell-based system for expression and analysis of TAF1L in the context of its molecular interactions. A quantitative mass spectrometry (qMS) approach was utilized for the characterization of TAF1L potential for TFIID assembly in comparison to TAF1, revealing a competitive relation between the two for complex integration. Inactivation of endogenous TAF1 eliminated this competition and allowed for efficient TAF1L-mediated TFIID assembly.

Results TAF1L integrates into TFIID with lower efficiently than TAF1 To assess the primary structure similarity between TAF1 and TAF1L, a Clustal Omega (CLUSTAL-O)-based sequence alignment was performed between the two proteins 19. A significant conservation among all functional domains was observed, which indicates preservation of interaction partners as well as protein conformation (Fig. 1S). Indeed, in accordance with reported TAF1L-TBP interaction 17, TAF1-TAND is largely conserved in TAF1L. The presence of several substitutions within this region may be indicative of a variable affinity for TBP between the two proteins. The DUF3591 domain is closely preserved between TAF1L and TAF1, indicating that the former is likely to bind TAF7 (and possibly TAF7L). Indeed, the glycine (Gly) rich motif within the TAF1 triple barrel described as essential for TAF7 recognition, as well as the basic lysine (Lys) and arginine (Arg) residues within its DNA-binding winged helix (WH) region 9 are both fully conserved in TAF1L (Fig. S1). In addition to the WH, the other promoter recognition sites of TAF1 (e.g. targeting DPE, MTE, Inr 20), the ZnK 7 and the tandem BrDs 21 are also mostly conserved. Similar to the TAND, a few residue substitutions are present at the Inr and DPE binding sites and the BrDs (albeit outside the acetyl-lysine binding pockets) of TAF1L may indicate possible variations in promoter recruitment between the two proteins. Altogether, despite the local variations, the overall sequence similarity between interaction domains of TAF1 and TAF1L, as well as the reported in vitro association of TAF1L with TBP 17 are both suggestive of a TFIID assembly potential for TAF1L.

156 TAF1L forms canonical TFIID

To address this possibility in a cellular context, we set out to determine whether exogenously expressed TAF1L could integrate into a holo-TFIID. For that purpose, we utilized a previously described HeLa_FRT cell system 22 for the generation of a Doxycycline (dox)-inducible cell line expressing TAF1L, which is N-terminally fused to a green fluorescent protein (GFP) (Fig. 1A, B). In order to interrogate the molecular interactions of GFP-TAF1L in the context of TFIID formation, cytoplasmic and nuclear extracts of the induced cells were prepared alongside extracts from previously generated cells expressing GFP-TAF1, GFP-TAF7 or GFP-TBP (Chapter 4). qMS analysis of affinity-based enrichment revealed a cytoplasmic heterotrimeric interaction between GFP-TAF1L, TAF7 (or TAF7L) and TBP, which is identical observations with the GFP- TAF1 (Fig. 1C, chapter 4). In addition, GFP-TAF1 also co-enriched for TAF4, TAF5 and TAF6, suggesting a a higher affinity of TAF1 for TFIID subunits than TAF1L (Fig. 1C). To compare the amount of TFIID complex members co-enriched by each bait, relative abundance was calculated through the intensity-based absolute quantification (iBAQ) values for each identified TFIID subunit (Fig. 1D). As seen before, such analysis revealed specific co-enrichment in approximately 1:1:1 ratio of TAF7 and TBP by GFP-TAF1, representing a discrete TFIID assembly module formed in the cytoplasm (Fig. 1D, left, chapter 4). Interestingly, while GFP-TAF1L similarly co- 5 enriched for TAF7 and TBP, it also largely remained unbound, suggesting that it may be inefficiently integrating into the heterotrimer (Fig. 1D, right, 1E). Neither GFP-TAF7 nor GFP- TBP co-precipitated with the heterotrimer, which is in agreement with our previous observations that TAF1 (and, therefore, possibly TAF1L) is limiting for the formation of this cytoplasmic TFIID assembly module (Fig. S2A,B, chapter 4). To investigate the possibility of an overall reduction in TFIID formation, correlated with the cytoplasmic pre-assembly stagnation of TAF1L, we applied qMS analysis to the nuclear co- enrichments of the baits. Bioinformatics analysis revealed overall significant co-enrichment of the complete TFIID complex with each bait (Fig. 1F, S2C). Notably, only partial enrichment for TAF10 was seen in GFP-TAF1L, indicating that the protein may be comparatively less efficient in forming a stable holo-TFIID (Fig. 1F, right). In contrast, GFP-TAF1, -TAF7 and -TBP successfully enriched for the entire complex (Fig. 1F, S2C). The exception was TAF7L, missing from the GFP-TAF7 co-precipitates due to their mutually exclusive presence within TFIID (Fig. S2C). While TAF7 and TAF7L do not share any tryptic peptides, providing us with confidence in their identification, only shared peptides were identified for TAF1L in our entire proteomic analysis (except for the GFP-TAF1L enrichments), limiting the confidence of its identification. As such co- enriched endogenous TAF1L likely represents a miss-assigned splice version of TAF1, which contains shared tryptic peptides with TAF1L. Finally, besides TFIID, GFP-TBP also co-enriched for PIC members of the pol I and pol III transcription machineries, and the pol II regulators BTAF1 and TFIIA, which is in line with our previous observations of the nuclear distribution of TBP between different transcription-regulating complexes (Fig. 2C, chapter 4). To assess the ability of the nuclear baits to enrich holo-TFIID we utilized iBAQ values for the estimation of the relative abundance of each of the identified TFIID subunits. These relative stoichiometries revealed that most of GFP-TAF1 was integrated into holo-TFIID, which expectedly comprised of elevated levels of the two-copy core TFIID members (TAF4/TAF5/TAF6/TAF9/TAF10/TAF12) and lower levels of the single-copy TAF2, TAF3, TAF11 and TAF13 (Fig. 1G, left) 4,20,23.

157 Chapter 5

5

Figure 1. Inefficient TFIID assembly by TAF1L compared to its canonical paralogue TAF1. A: Immunofluorescence analysis indicates primarily nuclear localization of GFP-TAF1L upon induction. B: Protein analysis of GFP-TAF1L HeLa whole cell lysate confirms GFP-TAF1L expression upon induction. C: Enrichment analysis of bait and co-precipitates from the cytoplasmic and (F) nuclear extracts of induced cell lines compared to a control. Each data point in the volcano plots is plotted as the mean of technical triplicates. Black line denotes the threshold between background and significant enrichment (two-tailed Student‘s t-test; FDR ≤1%; S0= 1). TFIID subunits are coloured in green and

158 TAF1L forms canonical TFIID baits in magenta. D: Relative abundance calculation of co-enriched cytoplasmic and (G) nuclear interactors normalized to selected single copy TFIID subunit. The cytoplasmic heterotrimer is highlighted in dark grey (bait) and grey (interaction partners). Data are derived from iBAQ values mean ± s.d. of technical triplicates. E: Estimation of heterotrimeric and (H) total TFIID co-enrichment from the cytoplasmic and nuclear extracts of induced GFP-TAF1 and GFP-TAF1L HeLa cells. Data are derived from iBAQ values mean ± s.d. of technical triplicates. Baits as well as the heterotrimer members (in the nuclear samples) are excluded from the calculation.

Notably, increased levels of the otherwise single-copy subunits TAF7 and TBP is in agreement with our previous observations of a nuclear presence of the TAF1/TAF7/TBP assembly module (Chapter 4). Such distribution of TAF1 among its heterotrimeric and holo-TFIID state may be representative of the assembly process of TFIID, where TAF1/TAF7/TBP is imported from cytoplasm to the nucleus and integrated into the complex. Interestingly, the relative abundance of the GFP-TAF1L-enriched TFIID subunits revealed excess of TAF1L/TAF7/TBP trimer over holo-TFIID, supporting our initial observation of an inefficient (compared to TAF1) TAF1L 5 incorporation into the complex (Fig. 1G, right). Indeed, analysis of the total amount of co- enriched TFIID, confirmed a significantly lower presence of the complex in GFP-TAF1L precipitates as compared to the one in GFP-TAF1 enrichments (Fig. 1H). In contrast, the relative abundance of the co-enriched TFIID subunits by nuclear GFP-TAF7 and GFP-TBP was similar to the one observed in GFP-TAF1 indicating that the low efficiency of complex integration was specific for GFP-TAF1L (Fig. S2D). Altogether, our data showed that, like TAF1, cytoplasmic TAF1L assembles into a TAF1L/TAF7(or TAF7L)/TBP heterotrimeric subcomplex, albeit to a lower extent. Accordingly, this TAF1L-containing trimer is imported into the nucleus, where it proceeds to assemble into a holo-TFIID. Akin to the cytoplasmic TAF1L, nuclear integration of the protein into holo-TFIID exhibited lower efficiency as compared to TAF1 protein.

Deletion of TAF1 allows complete incorporation of TAF1L into holo-TFIID Despite the lower efficiency TAF1L can form a complete TFIID, which indicates that the protein likely functions as a part of the TFIID complex. We asked whether the presence of endogenous TAF1 in the induced GFP-TAF1L cells would hinder the integration of TAF1L in TFIID due to their mutually exclusive nature in the context of the complex. Such competition model would indicate the possibility of regulated TAF1L functionality in the presence of TAF1 and its expansion in the absence of TAF1. To investigate the hypothesized model, we utilized the clustered regularly interspaced short palindromic repeats (CRISPR) / CRISPR-associated protein 9 (Cas9) gene editing technology to target the endogenous TAF1 gene for deletion. Cas9 enzyme can be targeted to desired genomic regions through single-guide RNA (sgRNA) sequences followed by a protospacer adjacent motif (PAM), where it introduces a single stranded DNA cut 4 base pairs (bps) upstream of the PAM sequence 24. As TAF1 is a large, multi-exon protein with a number of splice variants occurring in the C-terminal portion of the protein 25,26, we chose to target the splice donor (SD) site immediately after the end of exon 1, such that the Cas9-induced cut occurs at the consensus SD motif ―gt‖. Induction of DNA damage at this site is expected to disrupt the protein synthesis by affecting mRNA splicing and premature translation termination through a stop codon downstream of the exon 1 (Fig. 2A).

159 Chapter 5

5

Figure 2. Deletion of endogenous TAF1 enhances the incorporation of exogenous TAF1L into TFIID. A: Schematic overview of CRISPR/Cas9 gene editing strategy for targeted deletion of TAF1 and aligned sequence modifications of the two selected clones. Pink box depicts exon 1, red letters indicate sgRNA target sequence, PAM is indicated and highlighted in blue. B,C: Enrichment analysis of bait and co-precipitates from the nuclear extract of induced cell lines compared to a control. Each data point in the volcano plots is plotted as the mean of technical triplicates. Black line denotes the threshold between background and significant enrichment (two-tailed Student‘s t-test; FDR ≤1%; S0= 1). TFIID subunits are coloured in green and baits in magenta. D,E: Relative abundance calculation of co-enriched nuclear interactors normalized to TAF7. The cytoplasmic heterotrimer is highlighted in dark grey (bait) and grey (interaction partners). Data are derived from iBAQ values mean ± s.d. of technical triplicates.

160 TAF1L forms canonical TFIID

As TAF1 is an essential gene 27, it was targeted for deletion in the background of an induced GFP-TAF1L expression, such that it would rescue the lethal phenotype only in the case of partial and sufficient functional overlap between the two proteins. Through genomic characterization of the resulting cells, two clones were identified to carry TAF1 SD deletions at the first exon/intron boundary (Fig 2A). One of the clones (TAF1L clone #1, TAF1L-1) contained an heterozygous 20-nucleotide deletion encompassing (and disrupting) the entire SD site and introducing a frame-shift leading to a premature stop-codon. The other clone (TAF1L clone#2, TAF1L-2) had a homozygous 5-nucleotide deletion with the same effect (Fig. 2A). To investigate whether the deletion of endogenous TAF1 affected the integration of GFP-TAF1L into TFIID, we subjected the nuclear extracts of the two CRISPR clones to affinity- purification and qMS analysis. Significant enrichment of all TFIID subunits was consistently observed in both cell lines, in line with previously observed holo-TFIID co-precipitation by GFP- TAF1L (Fig. 1F, 2B,C). Assessment of the relative abundance of the complex subunits revealed remarkable similarity to the complex enrichment by GFP-TAF1 (Fig. 1G, 2D,E). Interestingly, the heterozygous clone showed a clear abundance of the TAF1L/TAF7/TBP subcomplex, albeit to lower extent than the one found in the nuclear fraction of GFP-TAF1L (TAF1+/+) cells, and much 5 more reminiscent of the GFP-TAF1 subcomplex formation (Fig. 1G, 2D). Strikingly, the subcomplex was primarily absent in the homozygous clone, suggesting that in the case of complete absence of endogenous TAF1, TAF1L efficiently integrates into the cellular pool of TFIID (Fig. 2E). Together, these data indicate that TAF1L functionally overlaps with TAF1 to an extent sufficient for the maintenance of cellular integrity and that endogenous TAF1 outcompetes exogenously-expressed TAF1L for TFIID formation, forcing the latter to remain in the assembly module of TAF1L/TAF7/TBP even upon nuclear import.

Discussion The ability of TBP and TAF paralogues to assemble TFIID with canonical composition is indicative of a central role of the complex structure for its functionality (Chapter 4). Here we report the ability of exogenously-expressed GFP-TAF1L to engage the assembly pathway of TFIID complex by forming a cytoplasmic TAF1L/TAF7(7L)/TBP module, which upon nuclear import integrates into a holo-TFIID. This is in accordance with our previous observations on the TFIID incorporation mechanism of its canonical subunit, TAF1 (Chapter 4). Notably, despite the high primary sequence similarity and the conservation of crucial interaction sites, as well as the parallels in the TFIID formation steps, GFP-TAF1L integration into the complex was inefficient as compared to GFP-TAF1. Indeed, cytoplasmic GFP-TAF1L enriched to a lower extent the TAF1/TAF7/TBP hetero-trimer and nuclear import of this trimer was accompanied with accumulation outside of TFIID due to inefficient integration of the module into the complex. Intriguingly, deletion of endogenous TAF1 alleviated the complex formation stagnation, suggesting that in the presence of both proteins, TFIID preferentially incorporates TAF1 over its paralogue (Fig. 3). Such preference is likely due to minor but crucial topological differences between the two proteins and probing the structures through a domain-swapping approach could aid in identifying sites within TAF1 or TAF1L involved in the enhanced or reduced complex formation, respectively.

161 Chapter 5

Figure 3. Competition between TAF1 and TAF1L for the assembly of TFIID. Left panel, complete GFP-TAF1L integration into the TFIID complex is hindered by mutually exclusive integration of the endogenously expressed TAF1, which exhibits higher affinity for the complex. Right panel, upon deletion of endogenous TAF1, GFP-TAF1 is fully incorporated in TFIID and does not remain in the pre-assembly TAF1L/TAF7(L)/TBP trimer once in the nucleus.

TAF1L has been implemented in spermatogenesis due to its testis-specific tissue expression 17. Indeed, testis biopsies from Sertoli cell-only syndrome patients, which lack male 5 germ cells and are, therefore sterile, did not reveal TAF1L expression, in line with its proposed role in spermatogenesis 17. Interestingly, TAF1 expression was detected in both in healthy and germ- defective testis tissues, which supports its somatic function 17. As such TAF1L would be transcriptional driver of spermatogenesis and, therefore, according to our data, it would be expressed in the absence of TAF1 (which otherwise would outcompete the transcriptional function). Such mechanism could include the temporal upregulation of TAF1L, which would co- occur with a corresponding downregulation of TAF1 expression. Indeed, similar mechanism has been reported for TBP and TRF2 as well as TAF7 and TAF7L during pachytene stage of spermatogenic meiosis 15,18,28,29. TAF7L exhibits stronger binding preference for TRF2 than TBP in testis extract, implying a functional link between the two during spermatogenesis. The role TAF7L for spermatogenesis has been characterized with a cytoplasmic localization prior to pachytene stage, and rapid nuclear import at its onset 15. The reported reduction in TAF7 and TBP levels during that stage, therefore, seems to be compensated by the functional increase of their paralogues. Importantly, the interaction between TAF7L and TRF2 (and TBP, albeit to a lower extent) is not directly mediated 15,30, which indicates that an additional protein like TAF1L would participate in the formation of this functionally relevant module. Based on our previous work (Chapter 4) along with the present study and the reported TAF1L tissue specificity 17, we speculate that TAF1L is the likely mediator of this interaction. Unlike TAF7, TAF7L does not include a nuclear localization signal, indicating that potential interaction with the NLS-containing TAF1L during spermatogenesis is of a crucial role for the functionality of these paralogues. Further research in this direction would benefit from compositional characterization of TAF1L-formed TFIID in the cellular context of spermatogenesis. Altogether, our data shed light on the molecular interactions of TAF1L in vivo and demonstrates the ability of this paralogue to assemble holo-TFIID. Additionally, we report competitive integration of TAF1, but not TAF1L, in TFIID, which may be indicative of a mechanistic interplay between the two and underscores possible functional implementations during differentiation processes such as spermatogenesis.

162 TAF1L forms canonical TFIID

Methods GFP-TAF cell line generation Human cDNA was amplified by PCR with gene-specific primers fused to attB recombination sequences for GATEWAY cloning. The amplified sequence was recombined into the pDONR201 donor vector according to manufacturer‘s protocol (ThermoFisher). The cloned sequence was verified with Sanger sequencing and was transferred to the pcDNA5_FRT_TO_N-GFP 22 Gateway destination vector by LR recombination according to manufacturer‘s protocol (ThermoFisher). All destination vectors were co-transfected with pOG44 plasmid encoding for the Flp recombinase into HeLa Flp-In/T-REx cells 22 using polyethyleneimine (PEI) transfection to generate stable Dox-inducible expression cell lines.

Cell culture methods HeLa Flp-In/T-REx cells, containing the Flp Recombination Target site and expressing the Tet Repressor, were grown in Dulbecco's Modified Eagle's Medium (DMEM), 4.5 g/L glucose, supplemented with 10% v/v fetal bovine serum, 10 mM L-glutamine and 100 U/mL penicillin/streptomycin (all purchased from Lonza), together with 5 μg/ml blasticidin S 5 (InvivoGen, San Diego, CA) and 200 μg/ml zeocin (Invitrogen, Carlsbad, CA) as selection drugs for FRT and Tet repressor, respectively. All cell lines used were mycoplasma-negative. Recombined cells were selected by replacing zeocin by 250 μg/mL hygromycin B (Roche Diagnostics, Mannheim, Germany) 48 h after PEI transfection. Expression of GFP-tagged protein was induced by addition of 1 μg/mL doxycycline, for 16 to 18 h.

Immunoblotting procedures Cell were seeded in 6-well dishes at 30,000 cell/well and induced with doxycycline for to 18 h prior to harvesting. Cell lysates were prepared in 1x sample buffer (160 mM Tris-HCl pH 6.8, 4% SDS, 20% glycerol, 0.05% bromophenol blue). Equal amounts of cell lysates were separated by a 10% SDS-PAGE and transferred onto PVDF membrane. GFP purification samples for immunoblotting were collected by elution with sample buffer. The membrane was developed with the appropriate antibodies and ECL. Immunoblots were analyzed using ChemiDoc imaging system (BioRad). The images were subjected to linear contrast/brightness enhancement in Photoshop (CS6, 13.0.6 x64, extended) when needed for data representation purposes. Antibodies used in the study include: GFP (JL-8, Clontech) and tubulin (clone 6C5, mAb374, Millipore).

Confocal microscopy Paraformaldehyde-fixed cells were subjected to short cell membrane permeabilization with 0.5% Triton X-100 in PBS for 5 min. After quenching the cross-linking with 50 mM glycine, the samples were blocked for 30 min at RT with 5% natural goat serum (NGS) and subsequently incubated with primary antibodies for 2 hours at RT. After three sequential washing steps with PBS, secondary antibodies conjugated to a fluorophore were added, together Phalloidin (Abcam) for staining of actin for 45 min at RT. For nuclear staining, the samples were incubated with DAPI dye (2 mg/L, Sigma-Aldrich, St. Louis, MO) for 5 min prior to mounting the coverslips on microscopy slides (ThermoFischer Scientific, Waltham, MA) with Immu-Mount (Invitrogen, Carlsbad, CA). Images were obtained using SP8-X confocal microscope (Leica microsystems, Germany) using a HC PL APO 63x/1.40 Oil CS2 objective. Gain and offset settings were adjusted according to the

163 Chapter 5

fluorescence signal, but they were kept constant in comparative experimental designs such as doxycycline-induction tests. Exported files were next subjected to linear contrast and brightness processing in Photoshop (CS6, 13.0.6 x64, extended) for image representation purposes.

Cell extracts and GFP co-immunoprecipitation (co-IP) Cells were seeded in 15-cm dishes (Greiner Cellstar, Sigma-Aldrich) and grown to 70-80% confluence prior to doxycycline induction. GFP-protein expression was verified using EVOS fluorescence microscopy (Thermo Fischer). Next, induced cells were harvested and nuclear and cytoplasmic extracts were obtained using a modified version of the Dignam and Roeder procedure 31. Protein concentrations were determined by Bradford assay (BioRad). 1 mg of nuclear or 3 mg of cytoplasmic extract was used for GFP-affinity purification as described46. In short, protein lysates were incubated in binding buffer (20 mM Hepes-KOH pH 7.9, 300 mM NaCl, 20% glycerol, 2 mM MgCl2, 0.2 mM EDTA, 0.1% NP-40, 0.5 mM DTT and 1x Roche protease inhibitor cocktail) on a rotating wheel for 1 h at 4° C in triplicates with GBP-coated agarose beads (Chromotek) or control agarose beads (Chromotek). The beads were washed two times with binding buffer containing 5 0.5% NP-40, two times with PBS containing 0.5% NP-40, and two times with PBS. On-bead digestion of bound proteins was performed overnight in elution buffer (100 mM Tris-HCl pH 7.5, 2 M urea, 10 mM DTT) with 0.1 μg/ml of trypsin at RT and eluted tryptic peptides were bound to C18 stagetips prior to mass spectrometry analysis.

Mass spectrometry and data analysis Tryptic peptides were eluted from the C18 stage-tips in H2O:acetonitril (50:50) with 0.1% formic acid and dried prior to resuspension in 10% formic acid. A third of this elution was injected into Q-Exactive (Thermo Fischer) in the MS/MS mode with 90 min total analysis time. Blank samples consisting of 10% formic acid were run for 45 min between GFP and non-GFP samples, to avoid carry-over between runs. The raw data files were analyzed with MaxQuant software (version 1.5.3.30) using Uniprot human FASTA database 22,32. Label-free quantification values (LFQ) and match between run options were selected. Intensity based absolute quantification (iBAQ) algorithm was also activated for subsequent relative protein abundance estimation. The obtained protein files were analyzed by Perseus software (MQ package, version 1.5.4.0), in which contaminants and reverse hits were filtered out. Protein identification based on non-unique peptides as well as proteins identified by only one peptide in the different triplicates were excluded to increase protein prediction accuracy. For identification of the bait interactors LFQ intensity-based values were transformed on the logarithmic scale (log2) to generate Gaussian distribution of the data. This allows for imputation of missing values based on the normal distribution of the overall data (in Perseus, width = 0.3; shift = 1.8). The normalized LFQ intensities were compared between grouped GFP triplicates and non-GFP triplicates, using 1% permutation-based false discovery rate (FDR) in a two-tailed t-test. The threshold for significance (S0), based on the FDR and the ratio between GFP and non-GFP samples, was kept at the constant value of 1 for comparison purposes. Relative abundance plots were obtained by comparison of the iBAQ values of GFP interactors. The values of the non-GFP iBAQ values were subtracted from the corresponding proteins in the GFP pull-down and were next normalized on a chosen co-purifying protein for scaling and data representation purposes 22.

164 TAF1L forms canonical TFIID

CRISPR/Cas9 TAF1 knockout GFP-TAF1L-HeLa_FRT cells were supplemented with 1 μg/ml doxycycline to maintain expression of the GFP fusion protein under the targeted deletion of endogenous TAF1. Cells were transfected with CRISPR/Cas9 and sgRNA-TAF1 DNA, targeting the first exon/intron boundary of TAF1 using Fugene (Promega) according to the manufacturer's instructions. 24 h post transfection the medium was refreshed and replaced by puromycin-supplemented (1 µg/ml) growth medium that contained antibiotic selection (Blasticidin S, hygromycin B). 48 h post transfection and 24 post puromycin selection, cells were trypsinized and 3-fold logarithmically diluted to acquire individual colonies. Colonies were picked by hand and grown in a 12-well dish and maintained. To confirm knock-outs, genomic DNA (gDNA) was isolated using the DNeasy blood and tissue kit (QIAgen). The targeted areas were amplified by PCR, cloned and sent for sequencing. Out of the >30 CRISPR clones, 2 were found positive for heterozygous and homozygous knock-out of TAF1.

References 1 Patel, A. B., Greber, B. J. & Nogales, E. Recent insights into the structure of TFIID, its 5 assembly, and its binding to core promoter. Curr Opin Struct Biol 61, 17-24, doi:10.1016/j.sbi.2019.10.001 (2020). 2 Antonova, S. V., Boeren, J., Timmers, H. T. M. & Snel, B. Epigenetics and transcription regulation during eukaryotic diversification: the saga of TFIID. Genes Dev 33, 888-902, doi:10.1101/gad.300475.117 (2019). 3 Thomas, M. C. & Chiang, C. M. The general transcription machinery and general cofactors. Crit Rev Biochem Mol Biol 41, 105-178, doi:10.1080/10409230600648736 (2006). 4 Patel, A. B. et al. Structure of human TFIID and mechanism of TBP loading onto promoter DNA. Science 362, doi:10.1126/science.aau8872 (2018). 5 Zhou, Q., Boyer, T. G. & Berk, A. J. Factors (TAFs) required for activated transcription interact with TATA box-binding protein conserved core domain. Genes Dev 7, 180-187, doi:10.1101/gad.7.2.180 (1993). 6 Anandapadamanaban, M. et al. High-resolution structure of TBP with TAF1 reveals anchoring patterns in transcriptional regulation. Nat Struct Mol Biol 20, 1008-1014, doi:10.1038/nsmb.2611 (2013). 7 Curran, E. C., Wang, H., Hinds, T. R., Zheng, N. & Wang, E. H. Zinc knuckle of TAF1 is a DNA binding module critical for TFIID promoter occupancy. Sci Rep 8, 4630, doi:10.1038/s41598-018-22879-5 (2018). 8 Matangkasombut, O., Buratowski, R. M., Swilling, N. W. & Buratowski, S. Bromodomain factor 1 corresponds to a missing piece of yeast TFIID. Genes Dev 14, 951-962 (2000). 9 Wang, H., Curran, E. C., Hinds, T. R., Wang, E. H. & Zheng, N. Crystal structure of a TAF1- TAF7 complex in human transcription factor IID reveals a promoter binding module. Cell Res 24, 1433-1444, doi:10.1038/cr.2014.148 (2014). 10 Vermeulen, M. et al. Selective anchoring of TFIID to nucleosomes by trimethylation of histone H3 lysine 4. Cell 131, 58-69, doi:10.1016/j.cell.2007.08.016 (2007). 11 Liu, W. L. et al. Structures of three distinct activator-TFIID complexes. Genes Dev 23, 1510- 1521, doi:10.1101/gad.1790709 (2009). 12 Akhtar, W. & Veenstra, G. J. TBP-related factors: a paradigm of diversity in transcription initiation. Cell Biosci 1, 23, doi:10.1186/2045-3701-1-23 (2011).

165 Chapter 5

13 Langer, D. et al. Essential role of the TFIID subunit TAF4 in murine embryogenesis and embryonic stem cell differentiation. Nat Commun 7, 11063, doi:10.1038/ncomms11063 (2016). 14 Herrera, F. J., Yamaguchi, T., Roelink, H. & Tjian, R. Core promoter factor TAF9B regulates neuronal gene expression. Elife 3, e02559, doi:10.7554/eLife.02559 (2014). 15 Pointud, J. C. et al. The intracellular localisation of TAF7L, a paralogue of transcription factor TFIID subunit TAF7, is developmentally regulated during male germ-cell differentiation. J Cell Sci 116, 1847-1858, doi:10.1242/jcs.00391 (2003). 16 Zhou, H. et al. Dual functions of TAF7L in adipocyte differentiation. Elife 2, e00170, doi:10.7554/eLife.00170 (2013). 17 Wang, P. J. & Page, D. C. Functional substitution for TAF(II)250 by a retroposed homolog that is expressed in human spermatogenesis. Hum Mol Genet 11, 2341-2346, doi:10.1093/hmg/11.19.2341 (2002). 18 Zhou, H. et al. Taf7l cooperates with Trf2 to regulate spermiogenesis. Proc Natl Acad Sci U S A 110, 16886-16891, doi:10.1073/pnas.1317034110 (2013). 19 Madeira, F. et al. The EMBL-EBI search and sequence analysis tools APIs in 2019. Nucleic Acids Res 47, W636-W641, doi:10.1093/nar/gkz268 (2019). 5 20 Louder, R. K. et al. Structure of promoter-bound TFIID and model of human pre-initiation complex assembly. Nature 531, 604-609, doi:10.1038/nature17394 (2016). 21 Jacobson, R. H., Ladurner, A. G., King, D. S. & Tjian, R. Structure and function of a human TAFII250 double bromodomain module. Science 288, 1422-1425, doi:10.1126/science.288.5470.1422 (2000). 22 Antonova, S. V. et al. Chaperonin CCT checkpoint function in basal transcription factor TFIID assembly. Nat Struct Mol Biol 25, 1119-1127, doi:10.1038/s41594-018-0156-z (2018). 23 Bieniossek, C. et al. The architecture of human general transcription factor TFIID core complex. Nature 493, 699-702, doi:10.1038/nature11791 (2013). 24 Ran, F. A. et al. Genome engineering using the CRISPR-Cas9 system. Nat Protoc 8, 2281-2308, doi:10.1038/nprot.2013.143 (2013). 25 Capponi, S. et al. Neuronal-specific microexon splicing of TAF1 mRNA is directly regulated by SRRM4/nSR100. RNA Biol 17, 62-74, doi:10.1080/15476286.2019.1667214 (2020). 26 Herzfeld, T., Nolte, D. & Muller, U. Structural and functional analysis of the human TAF1/DYT3 multiple transcript system. Mamm Genome 18, 787-795, doi:10.1007/s00335-007- 9063-z (2007). 27 Bertomeu, T. et al. A High-Resolution Genome-Wide CRISPR/Cas9 Viability Screen Reveals Structural Features and Contextual Diversity of the Human Cell-Essential Proteome. Mol Cell Biol 38, doi:10.1128/MCB.00302-17 (2018). 28 Martianov, I. et al. Distinct functions of TBP and TLF/TRF2 during spermatogenesis: requirement of TLF for heterochromatic chromocenter formation in haploid round spermatids. Development 129, 945-955 (2002). 29 Zhang, D., Penttila, T. L., Morris, P. L. & Roeder, R. G. Cell- and stage-specific high-level expression of TBP-related factor 2 (TRF2) during mouse spermatogenesis. Mech Dev 106, 203- 205, doi:10.1016/s0925-4773(01)00439-7 (2001). 30 Lavigne, A. C. et al. Multiple interactions between hTAFII55 and other TFIID subunits. Requirements for the formation of stable ternary complexes between hTAFII55 and the TATA-binding protein. J Biol Chem 271, 19774-19780, doi:10.1074/jbc.271.33.19774 (1996). 31 Carey, M. F., Peterson, C. L. & Smale, S. T. Dignam and Roeder nuclear extract preparation. Cold Spring Harb Protoc 2009, pdb prot5330, doi:10.1101/pdb.prot5330 (2009).

166 TAF1L forms canonical TFIID

32 Spruijt, C. G., Baymaz, H. I. & Vermeulen, M. Identifying specific protein-DNA interactions using SILAC-based quantitative proteomics. Methods Mol Biol 977, 137-157, doi:10.1007/978-1- 62703-284-1_11 (2013).

Acknowledgements We thank Roy Baas for assistance with bioinformatics and Imre Berger (Bristol, the UK) for the cDNAs of TAF1L and TAF7, as well as M.M. for critical reading of the manuscript. This research was supported by Netherlands Organization for Scientific Research (NWO) grants 022.004.019 (S.V.A), ALW820.02.013 (H.T.M.T), and 184.032.201 Proteins@Work (H.R.V.).

Author contributions H.T.M.T. conceived the study with input from S.V.A. and S.C. S.V.A. and J.J.J.B. carried out all cell biology experiments, assisted by S.C. The quantitative proteomics analyses were carried out by H.R.V., assisted by S.V.A. H.T.M.T., S.V.A. and J.J.J.B interpreted the data and S.V.A. wrote the manuscript together with input from J.J.J.B. and H.T.M.T. 5

167 Chapter 5

Supplemental Figures

5

(continues on next page)

168 TAF1L forms canonical TFIID

5

Supplemental Figure 1. Primary sequence alignment of human TAF1 and TAF1L using the CLUSTAL-O platform. Minor variations are observed in the TAF1L regions corresponding to the TAND TAF1 domain (blue), the TAF7 and DNA interaction region DUF3591 (green) and the ZnK (purple) and tandem BrD sites (yellow). Important residues for chromatin recognition are highlighted with a box frame and for TAF7 interaction with red letters.

169 Chapter 5

5

Supplemental Figure 2. Nuclear TFIID enrichment by GFP-TAF7 and GFP-TBP. A: Enrichment analysis of bait and co-precipitates from cytoplasmic and (C) nuclear and extract of induced cell lines compared to control. Each data point in the volcano plots is plotted as the mean of technical triplicates. Black line denotes the threshold between background and significant enrichment (two-tailed Student‘s t-test; FDR ≤1%; S0= 1). TFIID subunits are coloured in green, TBP-specific interactors in purple and baits in magenta. B: Relative abundance calculation of co-enriched cytoplasmic and (D) nuclear interactors normalized to selected single copy TFIID subunit. Data are derived from iBAQ values mean ± s.d. of technical triplicates.

170 General discussion

Chapter 6

General discussion 6

171 Chapter 6

The tree of life has classically been characterized through three main domains of life – Bacteria, Archaea and Eukarya (364, 365). While timing of the origin and relation between these domains remains disputed, advances in the fields of phylogenetics and paleogeology collectively reveal a complex process behind eukaryogenesis, stemming from an ancestral branching of Archaea and involving numerous horizontal gene transfer events from Bacteria, possibly dating prior the emblematic mitochondrial endosymbiosis between the two (366-368). Concomitant emergence of a nuclear envelope, whole genome duplication events and viral and/or transposon integrations have gradually increased the chromatin complexity during the first-to-last eukaryotic common ancestor (FECA-to-LECA) transition (368). By the time of LECA a burst of innovation in the transcription- associated proteins is described leading to expansion, diversification and subfunctionalization of the overall molecular repertoire of eukaryotes (369). As such, shaped under the selective pressure of survival, sophisticated eukaryotic intracellular networks have emerged to coordinate the cytoplasmic and nuclear events leading to transcriptional output. In a combinatorial fashion, numerous layers of gene regulation are provided by the nuclear genomic segregation, chromatin organization, epigenetic modifications and RNA processing. As at the centre of this vast protein network lays the ability for interaction, structural characterization is essential for disentangling the molecular rules and mechanisms of eukaryotic transcription. Among the different layers of eukaryotic gene regulation, transcription initiation 6 encompasses a number of upstream and downstream events to coordinate activator signalling pathways together with promoter/enhancer communication and precise positioning and assembly of the transcriptional machinery (141, 152, 158). As discussed over the course of this scientific work, TFIID is a central player in this process, with an extensive and highly conserved compositional organization throughout the eukaryotic branches (153) (see chapter 2). The diversity of its functional implementations is highlighted by an ubiquitous involvement in pathologies, such as tumorigenesis, neurodegeneration and infertility (360, 362, 370-377). Indeed, the lineage- and tissue-specific expression and functions of TFIID subunits indicate a role in differential transcription programs and developmental processes (274, 328, 378, 379). The functionality of TFIID has mostly been based on phenotypic observations (with a number of prominent exceptions, nonetheless (285, 331, 352)), while structural characterization has primarily been performed in in vitro setups (173, 282), leaving room for functional and context-dependent interrogation of the available structures. The diversity of TFIID composition is amongst the prominent contributors to the differential functional outputs of the complex. Indeed, at different branching points of eukaryotic evolution several functional paralogues and a number of splice variants have emerged (153, 264, 274). In addition, intricate evolutionary links between TFIID and a second transcription-related complex, SAGA, indicate a progressively untwining structural and, therefore functional, diversification (59, 153). As such, the structural and biochemical interrogation of TFIID sheds light onto the mechanistic aspects of its functionality by revealing important interactions (e.g. assembly checkpoints, network components, target sites) and highlighting functional domains. Accordingly, the collective research presented in this thesis aimed at investigating the molecular interactions of the TFIID subunits in a cellular context and characterizing their conservation across eukaryotic evolution. Through an extensive combination of quantitative proteomics, mutational analyses, biochemistry and structural and cellular biology, I identified: 1) cytoplasmic pre-assembly modules of TFIID, 2) sequential pattern of the complex formation and

172 General discussion associated rate-limiting subunits and modules, 3) crucial for the complex integrity interfaces, 4) involvement of additional external complexes in this process (see chapters 3 and 4). Furthermore, based on the reported tissue expression patterns and the observed competitive nature of TAF1- and TAF1L- TFIID formation, my work also suggested additional functionality of the pre- assembly modules beyond the complex formation (see chapter 5). In addition, the compositional organization of chromatin-bound and free TFIID was also addressed, revealing dynamic alterations between the two states of the complex (see chapter 4). Finally, through a comprehensive computational biology approach I characterized the eukaryotic evolution of TFIID, the timing and origin of its many subunits and its gradual separation from the SAGA complex (see chapter 2). Altogether, this work highlights the crucial roles of different TFIID structural elements and indicates their likely implementations in functional outcomes.

6.1. Cellular distribution and assembly of TFIID

6.1.1. Cytoplasmic TFIID Based on the cytoplasmic accumulation of TAF2/TAF8/TAF10 and the successful reconstitution of a symmetrical core TFIID structure, the modular assembly of the complex has been proposed before (281, 282). To address the possibility of discrete module formations between 6 each of the TFIID subunits, in chapter 4 I have investigated their cytoplasmic molecular interactions. As a result, the novel TAF1/TAF7/TBP and TAF5/TAF6/TAF9 pre-assembly modules of TFIID, as well as the co-localization of the TAF4/TAF12 and TAF11/TAF13 HFD partners were identified. The following sections examine each of these modules in the light of current scientific literature and detail possible structural (and functional) implementations of my observations.

TAF1(1L)/TAF7(7L)/TBP Previously described structures of TAF1/TAF7 and TAF1/TBP were indicative of a possible TAF1-mediated heterotrimerization between these proteins in cells (260, 271). The specificity of these interactions is further suggested by its preservation among the functional paralogues of the proteins (277, 330). Here, we reported the existence of a discrete cytoplasmic TAF1/TAF7/TBP heterotrimer, which can also be formed by TAF1L and is limited by the levels of available TAF1(1L). Interestingly, among the members of this module only TAF1 and TAF1L contain a strong NLS, indicating that the protein is not only limiting for the formation of the trimer but also for its subsequent nuclear import. A weaker NLS region is also present within the TAF7 sequence, but it is unlikely to contribute to the nuclear import of the module due to overlap with TAF7/TAF1 interface residues (271). The precise mechanisms of nuclear import of the heterotrimer can be tested by mutation of the TAF1(1L) NLS, which is expected to affect the cytoplasmic-to-nuclear distribution of the three proteins. An NLS-deletion in TAF1(1L) is expected to increase the cytoplasmic accumulation of TAF7 and TBP. Notably, overexpression- based cellular systems are beneficial due to the offered possibility of sequestering interaction partners from the endogenous TAF1 copy and the possibility to express mutant proteins. In chapter 4 we observed that cytoplasmic GFP-TAF1 can exist also outside of its heterotrimer, but in the nucleus the protein was exclusively distributed between TAF1/TAF7/TBP

173 Chapter 6

and holo-TFIID, indicating that, TAF1 may require the heterotrimer formation for nuclear import and for protein stability, which could relate to a crucial stabilizing role of TAF1 module partners for its conformational integrity. In line with this, the DUF3591 region of TAF1 was co-purified and successfully crystalized only in the presence of its binding partner TAF7, suggesting a possible structural support role for TAF7 in TAF1 biosynthesis (271). In light of this, it is striking that our cytoplasmic co-precipitation experiments with GFP-TAF7 did not reveal the TAF1/TAF7/TBP heterotrimer (chapter 4). We speculate that due to a rate-limiting role of TAF1, the heterotrimer is formed by a small fraction of GFP-TAF7, which is then swiftly shuttled into the nucleus. Co- overexpression of GFP-TAF7 with TAF1 (with and without NLS) and TBP could provide further insight into the preferential nuclear import pathways of these proteins. Due to the lack of a NLS in TBP along with the observation that nuclear TBP exists only in complex-bound form (chapter 4), we conclude that, similarly to TAF1, this protein depends on cytoplasmic partners for nuclear import. Akin to GFP-TAF7, GFP-TBP co-precipitated primarily alone from cytoplasmic extracts, likely due to the limiting role of TAF1 for the heterotrimer formation and its efficient shuttling to the nucleus upon assembly. Accordingly, affinity co-purification of endogenous TBP is expected to reveal the presence of a cytoplasmic TAF1/TAF7/TBP. Among the other two TBP-containing nuclear complexes, only the SL1 includes NLS- containing subunits (i.e. TAF1A, TAF1C and TAF1D), indicating that partial nuclear import of 6 TBP is provided by this complex. As such, the TFIIIB complex is the sole TBP partner, which is not expected to associate with the endogenous protein prior nuclear context (in contrast to the exogenously expressed GFP-TBP, which due to its partial cytoplasmic retention also co-enriched for BRF1, chapter 4), posing the question of how nuclear TBP is redistributed among different RNA polymerase machineries. The presence of BTAF1 in our nuclear GFP-TBP co-enrichments indicates active interplay between these proteins outside of the chromatin context, suggesting possible role of BTAF1 in redistribution of TBP. Recent work has reported the in vitro formation of a TBP/TAF11/TAF13 sub-complex (345). We did not observe exclusive co-enrichment between these proteins in neither of our cellular extracts, suggesting it may exists in the context of a holo-TFIID. Indeed, structural work has revealed key TBP:TAF11/TAF13 interactions during deposition of TBP onto promoter regions by chromatin-bound TFIID (162). Finally, we are the first to characterize the functional paralogue of TAF1, TAF1L, in the context of a cytoplasmic TAF1L/TAF7/TBP assembly module and a nuclear holo-TFIID formation (see chapter 5). Comparative relative abundance analysis of co-purified interactors from GFP-TAF1L and GFP-TAF1 cellular extracts revealed less efficient integration of the paralogue into the complex and suggested possible functional implementations of this competition. We speculate that TAF1 expression levels will be reduced during spermatocyte development, which is characterized by upregulation of TAF1L. During the transient phase of TAF1/TAF1L co- expression, TAF1L incorporation into TFIID will be hindered by the presence of TAF1. Notably, during spermatogenesis the X chromosome (and therefore the X-linked TAF1) is temporarily inactivated, which potentially substantiates our hypothesis of mutually exclusive expression patterns between the two proteins (277). Interestingly, the two module partners of TAF1L also exist as the spermatogenesis-associated paralogues TAF7L and TRF2, with an established functional (but indirect) link (330) and a potential for TAF1 (TAF1L) interaction. Further work in a context-

174 General discussion specific cellular system is required to elucidate the affinity of these proteins for formation of the heterotrimer and of holo-TFIID and the subsequent functional implementations.

TAF2/TAF8/TAF10 Cytoplasmic existence of the TAF2/TAF8/TAF10 complex has been described (281). While our work revealed separate cytoplasmic dimers of TAF2/TAF8 and of TAF8/TAF10 (see chapter 4), a nuclear enrichment of TAF2/TAF8/TAF10 indicated existence of this heterotrimer in our system as well. We speculate that its lack in the cytoplasmic extract of the cells is due to an efficient nuclear import upon assembly, which leads to the accumulation and preferential co- enrichment by our bait proteins of incomplete module species. As such, deletion of TAF8-NLS, shown as central for the trimer nuclear import (281), is expected to retain the TAF2/TAF8/TAF10 trimer in the cytoplasm. Notably, nuclear GFP-TAF2 co-precipitated only as a part of TAF2/TAF8/TAF10 trimer and holo-TFIID, indicating that it is only imported into the nucleus in the context of the trimer. In contrast, nuclear GFP-TAF8 co-enriched for the trimer and TFIID, as well as for TAF8/TAF10, while GFP-TAF10 – only for TAF10/TAF8 and TFIID (see chapter 4). This indicates that while the trimer is prerequisite for the nuclear import of TAF2, the import of TAF8 and TAF10 does not require TAF2 presence. Considering that TAF8 is the only NLS-containing member of the trimer, it remains to be seen how TAF10 (as a part of the trimer) contributes to the import of TAF2. 6 Interestingly, addition of excess TAF2 to a recombinant 7TAF complex (asymmetrical core TFIID, including TAF8/TAF10) led to 8TAF formation, indicating that TAF2 is likely capable of integrating within TFIID on its own (282). As such the strict trimer-dependency for its nuclear import may serve a regulatory, rate-limiting role for holo-TFIID assembly.

TAF3/TAF10 Computational evolutionary analysis of TAF3 presence across the eukaryotic tree places its origins as a functional duplication from TAF8 during the Opisthokonta branching, which has acquired a PHD domain with epigenetic properties (lost in late Fungi) (see chapter 2). It shares HFD with TAF8, which provides specificity for TAF10-HFD (176, 339). Interestingly, interrogation of cytoplasmic extracts for molecular interactors of GFP-TAF3 did not reveal TAF3/TAF10 dimers (and neither did cytoplasmic GFP-TAF10 co-IP), indicating that: 1) either the two proteins interact elsewhere, and/or that 2) once bound they are rapidly imported into the nucleus (see chapter 4). To exclude one of the possibilities, deletion of TAF3-NLS could potentially allow for cytoplasmic dimer accumulation. Notably, nuclear co-IP of GFP-TAF3 revealed relative instability in the formed holo-TFIID and a surprising reduction of TAF8 levels, as compared to its abundance in the rest of our nuclear co-IPs (see chapter 4). Such competitive integration of the two proteins is not in line with their proposed co-existence within holo-TFIID, as well as with the central role of TAF3 in TFIID for promoter recognition, and can likely be attributed to sequestering effects of GFP-TAF3 overexpression. Endogenous affinity co- precipitation or decreased expression levels of exogenous GFP-TAF3 are expected to improve the observed integration of the protein within holo-TFIID. Interestingly, we observed TAF3/TAF10 dimers in the chromatin extracts of our cells, indicating that the interaction is enriched outside the assembly context of TFIID (see chromatin-bound TFIID discussion).

175 Chapter 6

TAF4(4B)/TAF12 The in vitro formation of this heterodimer was described before (290). Here we demonstrated its preferential cytoplasmic formation and potential to further grow into a core TFIID (see chapter 4). As cytoplasmic core TFIID was only enriched by GFP-TAF4B and GFP- TAF12, but not GFP-TAF4, we speculate that TAF4-formed modules are rapidly imported into the nucleus upon formation. Indeed, among the three proteins TAF4 contains the strongest NLS and TAF4B lacks a NLS. We attribute the ability of GFP-TAF4B to assemble cytoplasmic core TFIID to its inefficient (as compared to TAF4) nuclear import. Addition of an NLS to GFP- TAF4B is expected to reduce the time of its cytoplasmic residence and the observed cytoplasmic core TFIID. Interestingly, both GFP-TAF4 and GFP-TAF4B co-enriched for each other alongside TAF12, indicating that upon TAF12 binding, the heterodimers further associate. As such mechanism would provide the basis for the two-copy core TFIID formation, further research is required to elucidate the interface for the dimers‘ interaction and the subsequent recruitment mechanism of TAF5/TAF6/TAF9 trimer. Among all TFIID subunits, GFP-TAF12 was the only one to form the 7TAF asymmetric core complex in the cytoplasm (see chapter 4). As TAF12 contains an NLS and is a direct (and early) partner of TAF4, similarly endowed with an NLS sequence, we could not attribute this formation to a cytoplasmically-retained GFP-TAF12. Therefore, the 7TAF presence in cytoplasmic 6 GFP-TAF12 co-IPs may be indicative of a central, rate-limiting role of TAF12 in core TFIID assembly. Indeed, as we speculate that the complex formation stems from the two-copy TAF4/TAF12 structures, it is conceivable that the available cellular TAF12 levels act as rate- limiting for the subsequent TFIID assembly. As such depletion of TAF12 as well as prevention of (TAF4/TAF12)x2 dimerization are both expected to halt TFIID assembly. Interestingly, we did not observe 8TAF formation (which includes TAF2 in addition to the asymmetrical core TFIID), in line with our speculation of a rapid nuclear TAF2 import.

TAF5/TAF6/TAF9(B) We were the first to identify the cytoplasmic assembly of a TAF5/TAF6/TAF9 TFIID module and to extensively characterize the interactions between these proteins (see chapter 3). Our structural work revealed a triangular organization, with TAF5-NTD and -WD40 domains sandwiching the TAF6/TAF9 HFD dimer and providing scaffolding interface. Three important contacts points between TAF9 C-terminal region and TAF5 WD40 were identified, allowing TAF9 to partially wrap around the surface of the barrel-like WD40 domain. This was in contrast with previously reported interactions between TAF5, TAF6 and TAF9 in the context of a two-copy core TFIID (282), indicating possible internal rearrangements of the TFIID modules upon incorporation into larger structures. Importantly, we validated the TAF5/TAF9 interface seen in our structure in a cellular context through mutational analyses and unequivocally demonstrated a crucial role of this region for TFIID integrity. Notably, as TAF9 is shared between TFIID and SAGA, our mutational analysis also allowed a comparative investigation of TAF9 integration into SAGA complex. Curiously and in contrast to TFIID, only one out of our three TAF9 mutants prevented SAGA formation, indicating significant protein plasticity in accommodating the two complexes and highlighting differential interfaces between TAF5-TAF6-TAF9 and TAF5L- TAF6L-TAF9 structures. Interestingly, a recent structural characterization of yeast SAGA complex revealed that the relative orientations of TAF5-NTD and -WD40 differ from the one of TAF5 in

176 General discussion

TFIID, highlighting a differential role for TAF5 interface in discrimination between the two complexes (380, 381). Further probing of the region may reveal crucial discriminatory points for preferential TFIID or SAGA formation. Additionally, cytoplasmic assembly characterization of the SAGA complex could further indicate the molecular mechanisms dictating the complex-dedication of shared TAFs. Our TAF5 mutants, targeting the TAF5/TAF9 interface and mirroring the TAF9 mutants, assisted in the identification of an otherwise transient interaction between newly- synthesized TAF5 and the folding chaperonin complex CCT, which is known to engage a number of seven-blade WD40-containing cargoes (382). Indeed, we confirmed that the TAF5/CCT interaction is mediated via TAF5-WD40 and demonstrated that the mutant accumulation within CCT is linked to the aberrant TAF5/TAF9 interaction rather than misfolding of the WD40 domain (circular dichroism spectroscopy, data not shown). We also identified a crucial role for TAF5-NTD in the release of nascent TAF5 from CCT and speculate that the TAF5-NTD serves as a docking platform for initial TAF6/TAF9 association. Upon TAF5 release from the CCT folding chamber a subsequent stabilization of the interaction through the TAF9 C-terminal interface, which wraps the folded TAF5 barrel, prevents its re-insertion into the CCT chamber. Interrogation of cytoplasmic TAF5/TAF6/TAF9 formation indicated, that TAF5 is rate- limiting for the assembly of this module, since neither GFP-TAF6 nor GFP-TAF9 could co-enrich for this heterotrimer (see chapters 3 and 4). In addition, lack of the trimer in these cytoplasmic 6 co-IPs further suggests a rapid nuclear import upon formation, which is in line with the NLS sequence within TAF5. Notably, TAF5L does not include an NLS. In contrast, TAF6L but not TAF6 contains a predicted NLS. Such opposite NLS correlation between these partners and the dependence of biosynthesis TAF5 on TAF6/TAF9 interaction indicates mechanistic differences between protein folding, cytoplasmic assembly, composition of sub-modules and nuclear import, which likely participates in the discriminatory pathways of TFIID and SAGA assemblies. Indeed, yTAF5 which is also shared between the two complexes lacks an NLS, indicating that the metazoan TAF5 may have evolved to guide swift nuclear import upon synthesis. The dependence on TAF6/TAF9 may therefore represent a mechanism to ensure TAF5/TAF6/TAF9 association prior to TFIID formation. TAF5-NLS deletion experiments could address the significance of this import in the context of TFIID and SAGA formation.

TAF11/TAF13 These two HFD partners of TFIID share evolutionary origin with the SAGA SUPT3H subunit (see chapter 2). Indeed, SUPT3H contains the HFDs of both TAF11 and TAF13, indicating a likely gene duplication and subsequent split, which eventually subfunctionalized towards TFIID. All three proteins are implemented in TBP interactions, suggesting a conserved functionality (345, 383). We did not observe the TAF11/TAF13/TBP trimer in vivo and based on recent structural characterization of promoter-bound TFIID, I suggest that the reported interactions occur in the context of holo-TFIID and are essential for DNA-bound TFIID rearrangement events during TBP release from complex (162). GFP-TAF11 and GFP-TAF13 co- enrichments from cytoplasmic extracts revealed isolated dimerization between the proteins and indicated simultaneous integration of this module into TFIID at later (nuclear) stages of the complex assembly (see chapter 4).

177 Chapter 6

Altogether, molecular interactions of TFIID subunits in the cytoplasmic compartment revealed the existence of discrete pre-assembly modules and participation of at least two major cytoplasmic networks (protein folding and nuclear import) that serve as regulatory checkpoints in the complex formation (chapters 3 and 4). Furthermore, our work exemplifies the benefits of mutational and overexpression molecular approaches together with quantitative proteomics, which allow for the identification of otherwise transient and/or intermediate complex-formation interactions and stages. These observations, together with the reported importin-α1 pathway of TAF2/TAF8/TAF10 nuclear import (281), collectively draw a picture of several regulatory steps involved in TFIID complex formation. Most likely, this exemplifies the mechanistic aspects of large nuclear protein assemblies in general.

6.1.2. Nuclear TFIID

Compositional characterization of DNA-free holo-TFIID complex The investigation of TFIID from the nuclear compartment revealed an overall canonical composition of the complex with an elevated presence of the two-copy core members TAF4, TAF5, TAF6, TAF9 and TAF12 and lower levels for the single-copy TAF1, TAF2, TAF3, TAF7, TAF11 and TAF13 subunits (see chapter 4). Furthermore, most of the baits exhibited elevated co- 6 enrichment of cytoplasmic module partners indicating their nuclear presence as a likely step in the holo-TFIID assembly process. Indeed, TAF1/TAF7/TBP, TAF2/TAF8/TAF10, TAF4/TAF12 and TAF11/TAF13 were all detected in the nucleus. Additionally, several bait proteins resided exclusively in module and holo-TFIID structure (TAF1, TAF2, TAF5, TAF12 and TBP), suggesting that each of these proteins requires module partners in order to enter the nucleus. Such observations are in line with the biology of these proteins, as neither TAF2 nor TBP include an NLS sequence, TAF5 biosynthesis depends on TAF6/TAF9 association (see chapter 3) and TAF1 may require TAF7 for conformational stabilization (271). An elevated co-enrichment of TAF4, TAF8 and TAF10 subunits consistently reappeared among all our nuclear co-IPs (see chapter 4). We attribute the observed TAF8 increase as a complementary to the increase of its HFD partner TAF10. The mechanistic background of TAF4 and TAF10 elevation, however, remains largely elusive. A technical explanation could include a bias stemming from the overexpression nature of our system, or overrepresentation of protein identification due to preferential peptide detection during our qMS analysis. An alternative explanation could involve certain functionality behind increases in specific subunits, in which TAF4 and/or TAF10 participate in the response-regulation of global TFIID levels. As such, the sensing of TFIID subunits overexpression could be triggering a pathway for a corresponding boost or reduction in endogenous TAFs/TBP production in order to match the elevated levels of non- assembled subunits. Such mechanism would indicate the existence of a sense-responding regulation of TFIID formation either on transcriptional or translational level (or both). Previous work has indicated that TAF4 overexpression alone results in the enhancement of induced pluripotent stem cells (iPSCs) generation (313). The study also characterized an overall elevation of TFIID subunit expression in mouse embryonic stem cells and highlighted the specific targeting of the promoter- upstream region of TAF4 by the Nanog and Oct4 reprogramming factors reported elsewhere (384). Further work is required to address the possible existence of a gene expression regulatory loop between the different TAFs. An alternative mechanism behind the observed accumulation of

178 General discussion

TAF10 and TAF4 could stem from the formation of ―dead-end‖ complexes, which fail to proceed to full TFIID assembly. Finally, not all nuclear baits co-enriched for a stable holo-TFIID, indicated by a consistent loss of TAF2 and/or TAF13 in several samples (i.e. GFP-TAF2, GFP-TAF3, GFP-TAF6, GFP- TAF9, GFP-TAF11 and GFP-TAF13). We attribute these variations to technical aspects, since our previous work did not reveal similar subunits loss in nuclear co-IPs from GFP-TAF6 and GFP- TAF9 cell lines (see chapter 3). However, it is noteworthy that the loss of TAF2 and TAF13 consistently coincided with a preferential enrichment of TAF7L over TAF7 in the identified holo- TFIID complexes (see chapter 4), which leaves room for a possible functionality behind our observations.

Compositional characterization of DNA-bound holo-TFIID complex Analysis of TFIID subunits from the chromatin indicated destabilization of the TFIID complex, when it is chromatin-bound. Indeed, preferential co-enrichment of core-TFIID members as opposed to holo-TFIID indicated that the complex composition becomes rearranged upon chromatin association (see chapter 4). Interestingly, as core TFIID includes a number of HFD- containing proteins, its stable preservation onto the DNA is partially reminiscent of a histone octamer-like structure. In contrast, a recent structural characterization of reconstituted promoter- bound TFIID does not reveal HFD-mediated DNA contacts (162), indicating that further work is 6 required to elucidate the mechanisms behind the reported here stable core TFIID association with chromatin. The most striking and consistent chromatin subunit loss was for TBP, which was only co- enriched by GFP-TAF1 bait or in the presence of co-enriched endogenous TAF1 representing a minor fraction of preserved chromatin-bound holo-TFIID (see chapter 4). Such TAF1- dependency is indicative of a central stabilizing role of TAF1, but not TAF11/TAF13, for TBP in the context of TFIID. Additionally, chromatin GFP-TBP co-IPs revealed preferential localization of the protein within non-TFIID complexes, such as the regulatory NC2 and TFIIA, further highlighting the low stability of TBP when TFIID is chromatin-bound. This was in contrast with nuclear (chromatin-free) GFP-TBP, which primarily resided within TFIID and was to a lower extent divided between the TFIIIB and SL1 complexes, as well as in complex with BTAF1. The lack of TFIIIB and SL1 complex members among the co-enriched interactors of chromatin-bound GFP-TBP was not expected. We speculate TBP may only be transiently associating with RNAPI/III genes, which could explain the need for NC2/BTAF1-regulated TBP recycling from RNAPII, but not RNAPI/III, promoters. Finally, TAF10-containing assembly modules were detected in our chromatin extract, suggesting a possible modular TFIID disassembly. Surprisingly, beside TAF2/TAF8/TAF10 we also observed co-enrichment between TAF3 and TAF10, indicating that the interaction between these two HFD partners may include chromatin-related functionality. Akin to the nuclear extract analysis, only GFP-TAF3, but not GFP-TAF10, co-enriched for this interaction and, together with the associated loss of TAF8 in GFP-TAF3 co-IPs, these data indicate competition between TAF3 and TAF8 for TAF10. Altogether, we show that holo-TFIID forms in the nucleus from cytoplasmically preassembled and imported modules and that the composition of the complex is relatively preserved among the different subunits (see chapters 4 and 5). Notably, chromatin-bound TFIID

179 Chapter 6

is largely destabilized, resulting primarily in preservation of the core of TFIID. This may be the result of conformational rearrangements required for TBP release (135, 158) and the seeming accompanying loss of TAF1, TAF2, TAF3, TAF7, TAF11 and TAF13.

6.2. Sharing the TAFs

Four TAFs are shared between TFIID and SAGA in humans – TAF9 (9B), 10 and 12 (see chapter 2). Curiously, cytoplasmic assessment of co-enriched TFIID and SAGA by the shared TAFs as bait proteins revealed a preference for SAGA subunits over TFIID (see chapter 4). This pattern was lower in GFP-TAF12 co-precipitates, which was in line with the enhanced capacity of the protein to form cytoplasmic 7TAF complexes. All cytoplasmic co-enrichments of SAGA included the central SAGA subunits TAF5L, TAF6L, TADA1, SUPT7L, TAF9(B), TAF10 and TAF12, which together correspond to core TFIID and indicate similarities in the assembly pathways of the two complex. The observed specificity of CCT for TAF5L demonstrated further parallels (see chapter 3). However, the differential effect of our TAF9 mutants on the complex integrity of SAGA and TFIID highlighted the existence of crucial complex-assembly interface differences within the shared TAF for 6 interaction with TAF5/TAF6 or TAF5L/TAF6L. Based on the observed preferential enrichment of SAGA subunits from cytoplasm extracts, we speculate that cytoplasmic vs nuclear localization may contribute to the segregation of the two complex assemblies. In line with this, crucial NLS differences are observed in TAF4 and TADA1 – with the former including the strongest core TFIID NLS and the latter fully lacking one, and in TAF5/TAF5L and TAF6/TAF6L, both of which contain NLS in an opposing manner. TAF5 includes NLS, which likely actively participates in TAF5/TAF6/TAF9 nuclear import (see chapter 4), but TAF5L lacks an NLS. In a similar fashion, TAF6 is deprived of a NLS, while TAF6L contains one. Considering the reported here cytoplasmic SAGA enrichment, it is conceivable that the differential NLS distribution between TAF5/TAF6 and TAF5L/TAF6L represent a mechanistic step for discrimination between the fate of TAF5 and TAF5L upon release from the CCT complex. Nascent TAF5 would be retained by the CCT complex until it associates with the TAF6/TAF9 pair. This heterotrimer would be at imported rapidly into the nucleus to find the TAF4/TAF12 dimer. In contrast, due to lack of an NLS, cytoplasmic TAF5L requires the assembly of other central subunits for efficient import into the nucleus. NLS-swap experiments will test the differential distribution and assembly rates between the TFIID and SAGA. Additionally, yTAF5 is shared between both TFIID and SAGA and does not include an NLS, underscoring a likely evolutionary pressure on the metazoan TAF5 acquisition and further highlighting the fascinating structural and functional links between the two complex. To gain further insight into the differences and similarities between TFIID and SAGA, we investigated the evolutionary history of TFIID and revealed that it is intimately intertwined with the history of SAGA (see chapter 2). Indeed, both complexes take roots in a common ancestral origin in pre-LECA dating back ~2 billion years. It is likely that at the time of eukaryogenesis a rudimentary pre-initiation complex consisted primarily of an archaeal-like TBP and TFIIB, which gradually expanded their molecular repertoire in the time between FECA and LECA to match the increasingly higher demand for transcriptional complexity. We suggest that in LECA TBP was integrated into an ancestral TFIID/SAGA complex with a minimal composition of core

180 General discussion

TAF4/TADA1, TAF5, TAF6, TAF9 and TAF12, surrounded by TAF8/SUPT7L, TAF10, SUPT3H (TAF11/TAF13), TAF1, TAF2 and TAF7. Upon the expansion of the eukaryotic tree, the complex gradually began to separate into two structurally and functionally different complexes – TFIID and SAGA, which provided further possibilities for transcriptional regulation and diversification. A hallmark of eukaryotic transcription is the epigenetic layer of control provided by histone tail modifications (32, 385). Intriguingly, we observed significant dynamics in epigenetic domains of TFIID and SAGA, suggesting continuous adaptation of the complexes to the transcriptional landscape of different eukaryotic branches. As such TAF1, TAF2 and SUPT7L gained and lost bromodomains with particularly prominent and contrasting differences between the metazoan and fungal kingdoms (see chapter 2). TAF3 resulted from a TAF8 duplication in opisthokonta (aka Metazoa and Fungi). After this TAF3 gained a PHD finger domain, which was subsequently lost in late fungi. Together with the lineage-specific changes within the TF binding region of TAF4, this collectively point towards a selective and dynamic modification within the functional domains of TFIID. The function of TFIID has been re-adjusting and expanding throughout evolution, but the structure and composition of TFIID is remarkably conserved. Indeed, key domains such as HFs, WD40, HEATs and TAND, involved in TAF/TBP interactions, remain a structural constant. Such conservation is indicative of a crucial role of the complex structure for function and this fits 6 well with the proposed model of TFIID as a molecular ruler for TBP accurate delivery at a measured distance from TSS (162). The key role of TFIID geometry is further highlighted by the fact that its metazoan diversification includes functional paralogues, of which most have been shown to assemble TFIID with canonical composition (see chapters 4 and 5). We observe that these paralogues are a result of subunit duplications at different branching points of metazoan evolution and contribute either to a functional diversification of TFIID or further separation from SAGA (i.e. TAF5 and TAF6). Altogether, our evolutionary analysis reveals extensive diversification over time between TFIID and SAGA, which revolved around the gradual separation of their core composition. This diversification is supplemented with dynamic functional adaptations of the complexes with direct implementations in epigenetic regulation of transcription. This work exemplifies how computational and evolutionary biology methods can be utilized to highlight crucial structural and functional elements within large molecular assemblies and provide directions for further research.

6.3. Final remarks and conclusions

Together, the scientific work discussed in this thesis illustrates how the structure of TFIID revolves and evolves around its functionality. Diversifications in the chromatin-engaging and TF-interacting subunits as well as in their direct TFIID interaction partners were discussed thoroughly in the light of evolution and in the context of complex assembly. The differential pathways of preferential TFIID vs SAGA assemblies have also been addressed and questions for further research were highlighted. The role of protein complexes (e.g. CCT) in TFIID assembly may be a major contributor to this. Additional mechanisms such as nuclear import and co-translational assembly point further towards a complex network of regulation involved in the function of TFIID (281, 332). Interestingly, a number of TFIID subunits possess intrinsically disordered regions

181 Chapter 6

(IDRs), which in the recent years have drawn much attention due to their role in the spatial organization and co-occurrence of large molecular architectures with co-dependent functionality (386, 387). IDRs have been proposed to contribute to the formation of transcriptionally active hubs through the process of phase separation and the presence of TFIID within such aggregates is expected (388). Strikingly, in this work none of the previously described TFIID interactions (e.g. with Mediator and additional PIC members) was detected, which stresses their transient nature. In addition, significant destabilization of chromatin-bound TFIID was observed. Recent reports suggest a role for TFIID in promoter pausing through TAF1/TAF2-mediated RNAPII stalling (389). In our work, we observed a small portion of chromatin-bound holo-TFIID, leaving possibility for a holo-TFIID role at promoter regions outside of TBP delivery. Interestingly, Shao et al. reports a role of TATA box in decreasing of Pol II promoter pausing, indicating an a possible interplay between TBP association with specific DNA sequences and the downstream preservation of TAF1/TAF2 promoter-binding or even holo-TFIID stability.

* * *

6 Altogether, the work of my thesis exemplifies how integration of structural, proteomic, cellular and computational biology methods can be utilized to comprehensively investigate the molecular rules and mechanisms driving large protein assemblies and their functionalities. The work on the TFIID complex actively contributes to the on-going research in the field and helps to rationally place the complex functionality in the light of its surround molecular environment and the transcriptional network of eukaryotes.

182

Appendices

A

183

Bibliography 1. R. E. Franklin, R. G. Gosling, Molecular configuration in sodium thymonucleate. Nature 171, 740- 741 (1953). 2. R. E. Franklin, R. G. Gosling, Evidence for 2-chain helix in crystalline structure of sodium deoxyribonucleate. Nature 172, 156-157 (1953). 3. B. Maddox, The double helix and the 'wronged heroine'. Nature 421, 407-408 (2003). 4. J. D. Watson, F. H. Crick, The structure of DNA. Cold Spring Harb Symp Quant Biol 18, 123-131 (1953). 5. M. H. Wilkins, A. R. Stokes, H. R. Wilson, Molecular structure of deoxypentose nucleic acids. Nature 171, 738-740 (1953). 6. J. C. Wang, Helical repeat of DNA in solution. Proc Natl Acad Sci U S A 76, 200-203 (1979). 7. P. L. Privalov et al., What drives proteins into the major or minor grooves of DNA? J Mol Biol 365, 1-9 (2007). 8. J. C. Venter et al., The sequence of the . Science 291, 1304-1351 (2001). 9. R. Phillips, R. Milo, A feeling for the numbers in biology. Proc Natl Acad Sci U S A 106, 21465- 21471 (2009). 10. D. W. Sumners, DNA, Knots and Tangles. Banagl M., Vogel D. (eds) The Mathematics of Knots. Contributions in Mathematical and Computational Sciences 1, (2011). 11. P. Oudet, M. Gross-Bellard, P. Chambon, Electron microscopic and biochemical evidence that chromatin structure is a repeating unit. Cell 4, 281-300 (1975). 12. R. D. Kornberg, Chromatin structure: a repeating unit of histones and DNA. Science 184, 868-871 (1974). 13. A. L. Olins, D. E. Olins, Spheroid chromatin units (v bodies). Science 183, 330-332 (1974). 14. E. Segal, J. Widom, What controls nucleosome positions? Trends Genet 25, 335-343 (2009). 15. O. Deniz, O. Flores, M. Aldea, M. Soler-Lopez, M. Orozco, Nucleosome architecture throughout the cell cycle. Sci Rep 6, 19729 (2016). 16. K. Luger, A. W. Mader, R. K. Richmond, D. F. Sargent, T. J. Richmond, Crystal structure of the A nucleosome core particle at 2.8 A resolution. Nature 389, 251-260 (1997). 17. R. K. McGinty, S. Tan, Nucleosome structure and function. Chem Rev 115, 2255-2273 (2015). 18. A. K. Shaytan, D. Landsman, A. R. Panchenko, Nucleosome adaptability conferred by sequence and structural variations in histone H2A-H2B dimers. Curr Opin Struct Biol 32, 48-57 (2015). 19. A. K. Shaytan et al., Coupling between Histone Conformations and DNA Geometry in Nucleosomes on a Microsecond Timescale: Atomistic Insights into Nucleosome Functions. J Mol Biol 428, 221-237 (2016). 20. C. Anselmi, G. Bocchinfuso, P. De Santis, M. Savino, A. Scipioni, Dual role of DNA intrinsic curvature and flexibility in determining nucleosome stability. J Mol Biol 286, 1293-1301 (1999). 21. A. Thastrom et al., Sequence motifs and free energies of selected natural and non-natural nucleosome positioning DNA sequences. J Mol Biol 288, 213-229 (1999). 22. F. X. Wilhelm, M. L. Wilhelm, M. Erard, M. P. Duane, Reconstitution of chromatin: assembly of the nucleosome. Nucleic Acids Res 5, 505-521 (1978). 23. R. J. Burgess, Z. Zhang, Histone chaperones in nucleosome assembly and human disease. Nat Struct Mol Biol 20, 14-22 (2013). 24. S. Henikoff, M. M. Smith, Histone variants and epigenetics. Cold Spring Harb Perspect Biol 7, a019364 (2015). 25. P. B. Talbert, S. Henikoff, Histone variants on the move: substrates for chromatin dynamics. Nat Rev Mol Cell Biol 18, 115-126 (2017). 26. S. P. Hergeth, R. Schneider, The H1 linker histones: multifunctional proteins beyond the nucleosomal core particle. EMBO Rep 16, 1439-1453 (2015). 27. F. Thoma, T. Koller, A. Klug, Involvement of histone H1 in the organization of the nucleosome and of the salt-dependent superstructures of chromatin. J Cell Biol 83, 403-427 (1979). 28. P. J. Robinson, D. Rhodes, Structure of the '30 nm' chromatin fibre: a key role for the linker histone. Curr Opin Struct Biol 16, 336-343 (2006). 29. R. C. Huang, J. Bonner, Histone, a suppressor of chromosomal RNA synthesis. Proc Natl Acad Sci U S A 48, 1216-1222 (1962). 30. V. G. Allfrey, R. Faulkner, A. E. Mirsky, Acetylation and Methylation of Histones and Their Possible Role in the Regulation of Rna Synthesis. Proc Natl Acad Sci U S A 51, 786-794 (1964).

184

31. K. Murray, The Occurrence of Epsilon-N-Methyl Lysine in Histones. Biochemistry 3, 10-15 (1964). 32. T. Jenuwein, C. D. Allis, Translating the histone code. Science 293, 1074-1080 (2001). 33. M. Yun, J. Wu, J. L. Workman, B. Li, Readers of histone modifications. Cell Res 21, 564-578 (2011). 34. K. S. Champagne, T. G. Kutateladze, Structural insight into histone recognition by the ING PHD fingers. Curr Drug Targets 10, 432-441 (2009). 35. J. Kim et al., Tudor, MBT and chromo domains gauge the degree of lysine methylation. EMBO Rep 7, 397-403 (2006). 36. R. Marmorstein, S. L. Berger, Structure and function of bromodomains in chromatin-regulating complexes. Gene 272, 1-9 (2001). 37. S. Maurer-Stroh et al., The Tudor domain 'Royal Family': Tudor, plant Agenet, Chromo, PWWP and MBT domains. Trends Biochem Sci 28, 69-74 (2003). 38. B. D. Strahl, C. D. Allis, The language of covalent histone modifications. Nature 403, 41-45 (2000). 39. B. E. Bernstein et al., Genomic maps and comparative analysis of histone modifications in human and mouse. Cell 120, 169-181 (2005). 40. B. D. Strahl, R. Ohba, R. G. Cook, C. D. Allis, Methylation of histone H3 at lysine 4 is highly conserved and correlates with transcriptionally active nuclei in Tetrahymena. Proc Natl Acad Sci U S A 96, 14967-14972 (1999). 41. R. Schneider et al., Histone H3 lysine 4 methylation patterns in higher eukaryotic genes. Nat Cell Biol 6, 73-77 (2004). 42. B. Schuettengruber, D. Chourrout, M. Vervoort, B. Leblanc, G. Cavalli, Genome regulation by polycomb and trithorax proteins. Cell 128, 735-745 (2007). 43. M. Rodriguez-Paredes, M. Esteller, Cancer epigenetics reaches mainstream oncology. Nat Med 17, 330-339 (2011). 44. A. Babu, R. S. Verma, Chromosome structure: euchromatin and heterochromatin. Int Rev Cytol 108, 1-60 (1987). 45. I. Solovei, K. Thanisch, Y. Feodorova, How to rule the nucleus: divide et impera. Curr Opin Cell Biol 40, 47-59 (2016). 46. W. R. Bauer, J. J. Hayes, J. H. White, A. P. Wolffe, Nucleosome structural changes due to A acetylation. J Mol Biol 236, 685-690 (1994). 47. M. Garcia-Ramirez, C. Rocchini, J. Ausio, Modulation of chromatin folding by histone acetylation. J Biol Chem 270, 17923-17928 (1995). 48. L. Hong, G. P. Schroth, H. R. Matthews, P. Yau, E. M. Bradbury, Studies of the DNA binding properties of histone H4 amino terminus. Thermal denaturation studies reveal that acetylation markedly reduces the binding constant of the H4 "tail" to DNA. J Biol Chem 268, 305-314 (1993). 49. V. Mutskov et al., Persistent interactions of core histone tails with nucleosomal DNA following acetylation and transcription factor binding. Mol Cell Biol 18, 6293-6304 (1998). 50. P. Filippakopoulos, S. Knapp, Targeting bromodomains: epigenetic readers of lysine acetylation. Nat Rev Drug Discov 13, 337-356 (2014). 51. J. W. Tamkun et al., brahma: a regulator of Drosophila homeotic genes structurally related to the yeast transcriptional activator SNF2/SWI2. Cell 68, 561-572 (1992). 52. K. K. Lee, J. L. Workman, Histone acetyltransferase complexes: one size doesn't fit all. Nat Rev Mol Cell Biol 8, 284-295 (2007). 53. J. E. Brownell, C. D. Allis, An activity gel assay detects a single, catalytically active histone acetyltransferase subunit in Tetrahymena macronuclei. Proc Natl Acad Sci U S A 92, 6364-6368 (1995). 54. J. E. Brownell et al., Tetrahymena histone acetyltransferase A: a homolog to yeast Gcn5p linking histone acetylation to gene activation. Cell 84, 843-851 (1996). 55. S. Y. Roth, J. M. Denu, C. D. Allis, Histone acetyltransferases. Annu Rev Biochem 70, 81-120 (2001). 56. P. Y. Wu, C. Ruhlmann, F. Winston, P. Schultz, Molecular architecture of the S. cerevisiae SAGA complex. Mol Cell 15, 199-208 (2004). 57. R. D. Steunou AL., Côté J. , Regulating Chromatin by Histone Acetylation. In: Workman J., Abmayr S. (eds) Fundamentals of Chromatin. Springer, New York, NY, 147-212 (2014). 58. P. A. Grant et al., Expanded lysine acetylation specificity of Gcn5 in native complexes. J Biol Chem 274, 5895-5900 (1999). 59. G. Spedale, H. T. Timmers, W. W. Pijnappel, ATAC-king the complexity of SAGA during evolution. Genes Dev 26, 527-541 (2012).

185

60. T. Suganuma et al., ATAC is a double histone acetyltransferase complex that stimulates nucleosome sliding. Nat Struct Mol Biol 15, 364-372 (2008). 61. C. Bian et al., Sgf29 binds histone H3K4me2/3 and is required for SAGA complex recruitment and histone H3 acetylation. EMBO J 30, 2829-2842 (2011). 62. D. E. Sterner et al., Functional organization of the yeast SAGA complex: distinct components involved in structural integrity, nucleosome acetylation, and TATA-binding protein interaction. Mol Cell Biol 19, 86-98 (1999). 63. J. K. Tong, C. A. Hassig, G. R. Schnitzler, R. E. Kingston, S. L. Schreiber, Chromatin deacetylation by an ATP-dependent nucleosome remodelling complex. Nature 395, 917-921 (1998). 64. A. You, J. K. Tong, C. M. Grozinger, S. L. Schreiber, CoREST is an integral component of the CoREST- human histone deacetylase complex. Proc Natl Acad Sci U S A 98, 1454-1458 (2001). 65. E. Seto, M. Yoshida, Erasers of histone acetylation: the histone deacetylase enzymes. Cold Spring Harb Perspect Biol 6, a018713 (2014). 66. H. A. Eskandarian et al., A role for SIRT2-dependent histone H3K18 deacetylation in bacterial infection. Science 341, 1238858 (2013). 67. A. Vaquero et al., Human SirT1 interacts with histone H1 and promotes formation of facultative heterochromatin. Mol Cell 16, 93-105 (2004). 68. S. Y. Gozani O., Histone Methylation in Chromatin Signaling. In: Workman J., Abmayr S. (eds) Fundamentals of Chromatin. Springer, New York, NY, 213-256 (2014). 69. P. Byvoet, G. R. Shepherd, J. M. Hardin, B. J. Noland, The distribution and turnover of labeled methyl groups in histone fractions of cultured mammalian cells. Arch Biochem Biophys 148, 558-567 (1972). 70. S. D. Taverna, H. Li, A. J. Ruthenburg, C. D. Allis, D. J. Patel, How chromatin-binding modules interpret histone modifications: lessons from professional pocket pickers. Nat Struct Mol Biol 14, 1025-1040 (2007). 71. D. B. Beck, H. Oda, S. S. Shen, D. Reinberg, PR-Set7 and H4K20me1: at the crossroads of genome integrity, cell cycle, chromosome condensation, and transcription. Genes Dev 26, 325-337 (2012). A 72. Y. Huyen et al., Methylated lysine 79 of histone H3 targets 53BP1 to DNA double-strand breaks. Nature 432, 406-411 (2004). 73. K. Hempel, H. W. Lange, L. Birkofer, [Epsilon-N-trimethyllysine, a new amino acid in histones]. Naturwissenschaften 55, 37 (1968). 74. W. K. Paik, D. C. Paik, S. Kim, Historical review: the field of protein methylation. Trends Biochem Sci 32, 146-152 (2007). 75. N. D. Heintzman et al., Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet 39, 311-318 (2007). 76. E. Lavarone, C. M. Barbieri, D. Pasini, Dissecting the role of H3K27 acetylation and methylation in PRC2 mediated control of cellular identity. Nat Commun 10, 1679 (2019). 77. S. Kim, W. K. Paik, Studies on the origin of epsilon-N-methyl-L-lysine in protein. J Biol Chem 240, 4629-4634 (1965). 78. W. K. Paik, S. Kim, Protein methylation. Science 174, 114-119 (1971). 79. R. C. Trievel, B. M. Beach, L. M. Dirk, R. L. Houtz, J. H. Hurley, Structure and catalytic mechanism of a SET domain protein methyltransferase. Cell 111, 91-103 (2002). 80. R. E. Collins et al., In vitro and in vivo analyses of a Phe/Tyr switch controlling product specificity of histone lysine methyltransferases. J Biol Chem 280, 5563-5570 (2005). 81. Q. Feng et al., Methylation of H3-lysine 79 is mediated by a new family of HMTases without a SET domain. Curr Biol 12, 1052-1058 (2002). 82. M. Wong, P. Polly, T. Liu, The histone methyltransferase DOT1L: regulatory functions and a cancer therapy target. Am J Cancer Res 5, 2823-2837 (2015). 83. E. Metzger et al., KMT9 monomethylates histone H4 lysine 12 and controls proliferation of prostate cancer cells. Nat Struct Mol Biol 26, 361-371 (2019). 84. T. Jenuwein, Re-SET-ting heterochromatin by histone methyltransferases. Trends Cell Biol 11, 266- 273 (2001). 85. Y. Shi et al., Histone demethylation mediated by the nuclear amine oxidase homolog LSD1. Cell 119, 941-953 (2004). 86. Y. Tsukada et al., Histone demethylation by a family of JmjC domain-containing proteins. Nature 439, 811-816 (2006).

186

87. A. M. Arnaudo, B. A. Garcia, Proteomic characterization of novel histone post-translational modifications. Epigenetics Chromatin 6, 24 (2013). 88. A. J. Bannister, T. Kouzarides, Regulation of chromatin by histone modifications. Cell Res 21, 381- 395 (2011). 89. Y. Zhao, B. A. Garcia, Comprehensive Catalog of Currently Documented Histone Modifications. Cold Spring Harb Perspect Biol 7, a025064 (2015). 90. F. W. Schmitges et al., Histone methylation by PRC2 is inhibited by active chromatin marks. Mol Cell 42, 330-341 (2011). 91. T. Zhang, S. Cooper, N. Brockdorff, The interplay of histone modifications - writers that read. EMBO Rep 16, 1467-1481 (2015). 92. M. Lachner, D. O'Carroll, S. Rea, K. Mechtler, T. Jenuwein, Methylation of histone H3 lysine 9 creates a binding site for HP1 proteins. Nature 410, 116-120 (2001). 93. W. Zhang, N. Garcia, Y. Feng, H. Zhao, J. Messing, Genome-wide histone acetylation correlates with active transcription in maize. Genomics 106, 214-220 (2015). 94. J. Zhou et al., Genome-wide profiling of histone H3 lysine 9 acetylation and dimethylation in Arabidopsis reveals correlation between multiple histone marks and gene expression. Plant Mol Biol 72, 585-595 (2010). 95. D. G. Edmondson et al., Site-specific loss of acetylation upon phosphorylation of histone H3. J Biol Chem 277, 29496-29502 (2002). 96. W. S. Lo et al., Phosphorylation of serine 10 in histone H3 is functionally linked in vitro and in vivo to Gcn5-mediated acetylation at lysine 14. Mol Cell 5, 917-926 (2000). 97. P. Cheung et al., Synergistic coupling of histone H3 phosphorylation and acetylation in response to epidermal growth factor stimulation. Mol Cell 5, 905-915 (2000). 98. J. Dover et al., Methylation of histone H3 by COMPASS requires ubiquitination of histone H2B by Rad6. J Biol Chem 277, 28368-28371 (2002). 99. J. Vandamme et al., The C. elegans H3K27 demethylase UTX-1 is essential for normal development, independent of its enzymatic activity. PLoS Genet 8, e1002647 (2012). 100. M. M. Suzuki, A. Bird, DNA methylation landscapes: provocative insights from epigenomics. Nat A Rev Genet 9, 465-476 (2008). 101. P. A. Wade et al., Mi-2 complex couples DNA methylation to chromatin remodelling and histone deacetylation. Nat Genet 23, 62-66 (1999). 102. Y. Zhang et al., Analysis of the NuRD subunits reveals a histone deacetylase core complex and a connection with DNA methylation. Genes Dev 13, 1924-1935 (1999). 103. J. A. Latham, S. Y. Dent, Cross-regulation of histone modifications. Nat Struct Mol Biol 14, 1017- 1024 (2007). 104. G. Langst, L. Manelyte, Chromatin Remodelers: From Function to Dysfunction. Genes (Basel) 6, 299-324 (2015). 105. C. R. Clapier, J. Iwasa, B. R. Cairns, C. L. Peterson, Mechanisms of action and regulation of ATP- dependent chromatin-remodelling complexes. Nat Rev Mol Cell Biol 18, 407-422 (2017). 106. A. Flaus, D. M. Martin, G. J. Barton, T. Owen-Hughes, Identification of multiple distinct Snf2 subfamilies with conserved structural motifs. Nucleic Acids Res 34, 2887-2905 (2006). 107. D. Mitra, E. J. Parnell, J. W. Landon, Y. Yu, D. J. Stillman, SWI/SNF binding to the HO promoter requires histone acetylation and stimulates TATA-binding protein recruitment. Mol Cell Biol 26, 4095-4110 (2006). 108. K. J. Pollard, C. L. Peterson, Chromatin remodeling: a marriage between two families? Bioessays 20, 771-780 (1998). 109. J. Scaife, J. R. Beckwith, Mutational alteration of the maximal level of Lac expression. Cold Spring Harb Symp Quant Biol 31, 403-408 (1966). 110. M. J. Koster, B. Snel, H. T. Timmers, Genesis of chromatin and transcription dynamics in the origin of species. Cell 161, 724-736 (2015). 111. V. Haberle, A. Stark, Eukaryotic core promoters and the functional basis of transcription initiation. Nat Rev Mol Cell Biol 19, 621-637 (2018). 112. J. T. Kadonaga, Perspectives on the RNA polymerase II core promoter. Wiley Interdiscip Rev Dev Biol 1, 40-51 (2012). 113. B. Lenhard, A. Sandelin, P. Carninci, Metazoan promoters: emerging characteristics and insights into transcriptional regulation. Nat Rev Genet 13, 233-245 (2012).

187

114. A. L. Roy, D. S. Singer, Core promoters in transcription: old problem, new insights. Trends Biochem Sci 40, 165-171 (2015). 115. E. V. Putlyaev, A. N. Ibragimov, L. A. Lebedeva, P. G. Georgiev, Y. V. Shidlovskii, Structure and Functions of the Mediator Complex. Biochemistry (Mosc) 83, 423-436 (2018). 116. D. J. Mathis, P. Chambon, The SV40 early region TATA box is required for accurate in vitro initiation of transcription. Nature 290, 310-315 (1981). 117. H. S. Rhee, B. F. Pugh, Genome-wide structure and organization of eukaryotic pre-initiation complexes. Nature 483, 295-301 (2012). 118. R. P. Lifton, M. L. Goldberg, R. W. Karp, D. S. Hogness, The organization of the histone genes in Drosophila melanogaster: functional and evolutionary implications. Cold Spring Harb Symp Quant Biol 42 Pt 2, 1047-1051 (1978). 119. R. Breathnach, P. Chambon, Organization and expression of eucaryotic split genes coding for proteins. Annu Rev Biochem 50, 349-383 (1981). 120. J. Ponjavic et al., Transcriptional and structural impact of TATA-initiation site spacing in mammalian core promoters. Genome Biol 7, R78 (2006). 121. A. Giangrande, C. Mettling, M. Martin, C. Ruiz, G. Richards, Drosophila Sgs3 TATA: effects of point mutations on expression in vivo and protein binding in vitro with staged nuclear extracts. EMBO J 8, 3459-3466 (1989). 122. R. Grosschedl, B. Wasylyk, P. Chambon, M. L. Birnstiel, Point mutation in the TATA box curtails expression of sea urchin H2A histone gene in vivo. Nature 294, 178-180 (1981). 123. J. L. Manley, A. Fire, A. Cano, P. A. Sharp, M. L. Gefter, DNA-dependent transcription of adenovirus genes in a soluble whole-cell extract. Proc Natl Acad Sci U S A 77, 3855-3859 (1980). 124. P. A. Weil, D. S. Luse, J. Segall, R. G. Roeder, Selective and accurate initiation of transcription at the Ad2 major late promotor in a soluble system dependent on purified RNA polymerase II and DNA. Cell 18, 469-484 (1979). 125. L. Comai, N. Tanese, R. Tjian, The TATA-binding protein and associated factors are integral components of the RNA polymerase I transcription factor, SL1. Cell 68, 965-976 (1992). A 126. D. Eberhard, L. Tora, J. M. Egly, I. Grummt, A TBP-containing multiprotein complex (TIF-IB) mediates transcription specificity of murine RNA polymerase I. Nucleic Acids Res 21, 4180-4186 (1993). 127. B. F. Pugh, R. Tjian, Diverse transcriptional functions of the multisubunit eukaryotic TFIID complex. J Biol Chem 267, 679-682 (1992). 128. B. P. Cormack, K. Struhl, The TATA-binding protein is required for transcription by all three nuclear RNA polymerases in yeast cells. Cell 69, 685-696 (1992). 129. K. Gupta, D. Sari-Ak, M. Haffke, S. Trowitzsch, I. Berger, Zooming in on Transcription Preinitiation. J Mol Biol 428, 2581-2591 (2016). 130. D. I. Chasman, K. M. Flaherty, P. A. Sharp, R. D. Kornberg, Crystal structure of yeast TATA- binding protein and model for interaction with DNA. Proc Natl Acad Sci U S A 90, 8174-8178 (1993). 131. Y. Kim, J. H. Geiger, S. Hahn, P. B. Sigler, Crystal structure of a yeast TBP/TATA-box complex. Nature 365, 512-520 (1993). 132. D. B. Nikolov et al., Crystal structure of a TFIIB-TBP-TATA-element ternary complex. Nature 377, 119-128 (1995). 133. S. Tan, Y. Hunziker, D. F. Sargent, T. J. Richmond, Crystal structure of a yeast TFIIA/TBP/DNA complex. Nature 381, 127-151 (1996). 134. A. D. Basehoar, S. J. Zanton, B. F. Pugh, Identification and distinct regulation of yeast TATA box- containing genes. Cell 116, 699-709 (2004). 135. T. Bhuiyan, H. T. M. Timmers, Promoter Recognition: Putting TFIID on the Spot. Trends Cell Biol 29, 752-763 (2019). 136. P. Carninci et al., Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet 38, 626-635 (2006). 137. K. L. Huisinga, B. F. Pugh, A genome-wide housekeeping role for TFIID and a highly regulated stress-related role for SAGA in Saccharomyces cerevisiae. Mol Cell 13, 573-585 (2004). 138. R. Donczew, S. Hahn, Mechanistic Differences in Transcription Initiation at TATA-Less and TATA-Containing Promoters. Mol Cell Biol 38, (2018).

188

139. S. Sainsbury, C. Bernecky, P. Cramer, Structural basis of transcription initiation by RNA polymerase II. Nat Rev Mol Cell Biol 16, 129-143 (2015). 140. T. Baptista et al., SAGA Is a General Cofactor for RNA Polymerase II Transcription. Mol Cell 68, 130-143 e135 (2017). 141. A. C. Schier, D. J. Taatjes, Structure and mechanism of the RNA polymerase II transcription machinery. Genes Dev 34, 465-488 (2020). 142. L. Warfield et al., Transcription of Nearly All Yeast RNA Polymerase II-Transcribed Genes Is Dependent on Transcription Factor TFIID. Mol Cell 68, 118-129 e115 (2017). 143. V. Fischer, K. Schumacher, L. Tora, D. Devys, Global role for coactivator complexes in RNA polymerase II transcription. Transcription 10, 29-36 (2019). 144. D. F. Browning, S. J. Busby, The regulation of initiation. Nat Rev Microbiol 2, 57-65 (2004). 145. S. P. Burton, Z. F. Burton, The sigma enigma: bacterial sigma factors, archaeal TFB and eukaryotic TFIIB are homologs. Transcription 5, e967599 (2014). 146. K. B. Decker, D. M. Hinton, Transcription regulation at the core: similarities among bacterial, archaeal, and eukaryotic RNA polymerases. Annu Rev Microbiol 67, 113-139 (2013). 147. S. D. Bell, S. P. Jackson, Transcription in Archaea. Cold Spring Harb Symp Quant Biol 63, 41-51 (1998). 148. I. Grummt, Life on a planet of its own: regulation of RNA polymerase I transcription in the nucleolus. Genes Dev 17, 1691-1702 (2003). 149. T. I. Lee, R. A. Young, Transcription of eukaryotic protein-coding genes. Annu Rev Genet 34, 77-137 (2000). 150. L. Schramm, N. Hernandez, Recruitment of RNA polymerase III to its target promoters. Genes Dev 16, 2593-2620 (2002). 151. D. S. Luse, The RNA polymerase II preinitiation complex. Through what pathway is the complex assembled? Transcription 5, e27050 (2014). 152. S. Hahn, Structure and mechanism of the RNA polymerase II transcription machinery. Nat Struct Mol Biol 11, 394-403 (2004). A 153. S. V. Antonova, J. Boeren, H. T. M. Timmers, B. Snel, Epigenetics and transcription regulation during eukaryotic diversification: the saga of TFIID. Genes Dev 33, 888-902 (2019). 154. T. M. Harper, D. J. Taatjes, The complex structure and function of Mediator. J Biol Chem 293, 13778-13785 (2018). 155. K. Murakami et al., Structure of an RNA polymerase II preinitiation complex. Proc Natl Acad Sci U S A 112, 13543-13548 (2015). 156. G. Yusupova, M. Yusupov, Crystal structure of eukaryotic ribosome and its complexes with inhibitors. Philos Trans R Soc Lond B Biol Sci 372, (2017). 157. Y. He, J. Fang, D. J. Taatjes, E. Nogales, Structural visualization of key steps in human transcription initiation. Nature 495, 481-486 (2013). 158. A. B. Patel, B. J. Greber, E. Nogales, Recent insights into the structure of TFIID, its assembly, and its binding to core promoter. Curr Opin Struct Biol 61, 17-24 (2020). 159. B. J. Greber, E. Nogales, The Structures of Eukaryotic Transcription Pre-initiation Complexes and Their Functional Implications. Subcell Biochem 93, 143-192 (2019). 160. S. Buratowski, S. Hahn, L. Guarente, P. A. Sharp, Five intermediate complexes in transcription initiation by RNA polymerase II. Cell 56, 549-561 (1989). 161. B. L. Davison, J. M. Egly, E. R. Mulvihill, P. Chambon, Formation of stable preinitiation complexes between eukaryotic class B transcription factors and promoter sequences. Nature 301, 680-686 (1983). 162. A. B. Patel et al., Structure of human TFIID and mechanism of TBP loading onto promoter DNA. Science 362, (2018). 163. B. D. Dynlacht, T. Hoey, R. Tjian, Isolation of coactivators associated with the TATA-binding protein that mediate transcriptional activation. Cell 66, 563-576 (1991). 164. L. Tora, A unified nomenclature for TATA box binding protein (TBP)-associated factors (TAFs) involved in RNA polymerase II transcription. Genes Dev 16, 673-675 (2002). 165. M. Hantsche, P. Cramer, Conserved RNA polymerase II initiation complex structure. Curr Opin Struct Biol 47, 17-22 (2017).

189

166. S. K. Burley, R. G. Roeder, Biochemistry and structural biology of transcription factor IID (TFIID). Annu Rev Biochem 65, 769-799 (1996). 167. J. L. Chen, L. D. Attardi, C. P. Verrijzer, K. Yokomori, R. Tjian, Assembly of recombinant TFIID reveals differential coactivator requirements for distinct transcriptional activators. Cell 79, 93-105 (1994). 168. T. W. Burke, J. T. Kadonaga, The downstream core promoter element, DPE, is conserved from Drosophila to humans and is recognized by TAFII60 of Drosophila. Genes Dev 11, 3020-3031 (1997). 169. G. E. Chalkley, C. P. Verrijzer, DNA binding site selection by RNA polymerase II TAFs: a TAF(II)250-TAF(II)150 complex recognizes the initiator. EMBO J 18, 4835-4845 (1999). 170. Z. Chen, J. L. Manley, Core promoter elements and TAFs contribute to the diversity of transcriptional activation in vertebrates. Mol Cell Biol 23, 7350-7362 (2003). 171. D. H. Lee et al., Functional characterization of core promoter elements: the downstream core element is recognized by TAF1. Mol Cell Biol 25, 9674-9686 (2005). 172. K. Gazit et al., TAF4/4b x TAF12 displays a unique mode of DNA binding and is required for core promoter function of a subset of genes. J Biol Chem 284, 26286-26296 (2009). 173. R. K. Louder et al., Structure of promoter-bound TFIID and model of human pre-initiation complex assembly. Nature 531, 604-609 (2016). 174. J. V. Geisberg, J. L. Chen, R. P. Ricciardi, Subregions of the adenovirus E1A transactivation domain target multiple components of the TFIID complex. Mol Cell Biol 15, 6283-6290 (1995). 175. R. H. Jacobson, A. G. Ladurner, D. S. King, R. Tjian, Structure and function of a human TAFII250 double bromodomain module. Science 288, 1422-1425 (2000). 176. Y. G. Gangloff et al., The TFIID components human TAF(II)140 and Drosophila BIP2 (TAF(II)155) are novel metazoan homologues of yeast TAF(II)47 containing a histone fold and a PHD finger. Mol Cell Biol 21, 5109-5121 (2001). 177. M. Vermeulen et al., Selective anchoring of TFIID to nucleosomes by trimethylation of histone H3 lysine 4. Cell 131, 58-69 (2007). A 178. J. A. D'Alessio, K. J. Wright, R. Tjian, Shifting players and paradigms in cell-specific transcription. Mol Cell 36, 924-931 (2009). 179. R. G. Roeder, The role of general initiation factors in transcription by RNA polymerase II. Trends Biochem Sci 21, 327-335 (1996). 180. A. N. Imbalzano, K. S. Zaret, R. E. Kingston, Transcription factor (TF) IIB and TFIIA can independently increase the affinity of the TATA-binding protein for DNA. J Biol Chem 269, 8280- 8286 (1994). 181. T. Lagrange et al., High-resolution mapping of nucleoprotein complexes by site-specific protein- DNA photocrosslinking: organization of the human TBP-TFIIA-TFIIB-DNA quaternary complex. Proc Natl Acad Sci U S A 93, 10620-10625 (1996). 182. J. Ozer, K. Mitsouras, D. Zerby, M. Carey, P. M. Lieberman, Transcription factor IIA derepresses TATA-binding protein (TBP)-associated factor inhibition of TBP-DNA binding. J Biol Chem 273, 14293-14300 (1998). 183. M. P. Klejman et al., NC2alpha interacts with BTAF1 and stimulates its ATP-dependent association with TATA-binding protein. Mol Cell Biol 24, 10072-10082 (2004). 184. F. J. van Werven et al., Cooperative action of NC2 and Mot1p to regulate TATA-binding protein function across the genome. Genes Dev 22, 2359-2369 (2008). 185. P. de Graaf et al., Chromatin interaction of TATA-binding protein is dynamically regulated in human cells. J Cell Sci 123, 2663-2671 (2010). 186. G. Spedale et al., Tight cooperation between Mot1p and NC2beta in regulating genome-wide transcription, repression of transcription following heat shock induction and genetic interaction with SAGA. Nucleic Acids Res 40, 996-1008 (2012). 187. M. J. Koster, H. T. Timmers, Regulation of anti-sense transcription by Mot1p and NC2 via removal of TATA-binding protein (TBP) from the 3'-end of genes. Nucleic Acids Res 43, 143-152 (2015). 188. M. P. Klejman, X. Zhao, F. M. van Schaik, W. Herr, H. T. Timmers, Mutational analysis of BTAF1- TBP interaction: BTAF1 can rescue DNA-binding defective TBP mutants. Nucleic Acids Res 33, 5426-5436 (2005). 189. K. Kamada et al., Crystal structure of negative cofactor 2 recognizing the TBP-DNA transcription complex. Cell 106, 71-81 (2001).

190

190. P. Wollmann et al., Structure and mechanism of the Swi2/Snf2 remodeller Mot1 in complex with its substrate TBP. Nature 475, 403-407 (2011). 191. M. Bleichenbacher, S. Tan, T. J. Richmond, Novel interactions between the components of human and yeast TFIIA/TBP/DNA complexes. J Mol Biol 332, 783-793 (2003). 192. J. H. Layer, P. A. Weil, Direct TFIIA-TFIID protein contacts drive budding yeast ribosomal protein gene transcription. J Biol Chem 288, 23273-23294 (2013). 193. X. Sun, D. Ma, M. Sheldon, K. Yeung, D. Reinberg, Reconstitution of human TFIIA activity from recombinant polypeptides: a role in TFIID-mediated transcription. Genes Dev 8, 2336-2348 (1994). 194. O. Flores et al., The small subunit of transcription factor IIF recruits RNA polymerase II into the preinitiation complex. Proc Natl Acad Sci U S A 88, 9999-10003 (1991). 195. S. Lee, S. Hahn, Model for binding of transcription factor TFIIB to the TBP-DNA complex. Nature 376, 609-612 (1995). 196. D. Y. Zhang, D. J. Carson, J. Ma, The role of TFIIB-RNA polymerase II interaction in start site selection in yeast cells. Nucleic Acids Res 30, 3078-3085 (2002). 197. J. Soppa, Transcription initiation in Archaea: facts, factors and future aspects. Mol Microbiol 31, 1295- 1305 (1999). 198. P. Cramer, D. A. Bushnell, R. D. Kornberg, Structural basis of transcription: RNA polymerase II at 2.8 angstrom resolution. Science 292, 1863-1876 (2001). 199. A. Klug, Structural biology. A marvellous machine for making messages. Science 292, 1844-1846 (2001). 200. F. X. Chen, E. R. Smith, A. Shilatifard, Born to run: control of transcription elongation by RNA polymerase II. Nat Rev Mol Cell Biol 19, 464-478 (2018). 201. M. J. Thomas, A. A. Platas, D. K. Hawley, Transcriptional fidelity and proofreading by RNA polymerase II. Cell 93, 627-637 (1998). 202. J. P. Hsin, J. L. Manley, The RNA polymerase II CTD coordinates transcription and RNA processing. Genes Dev 26, 2119-2137 (2012). 203. C. Chang, C. F. Kostrub, Z. F. Burton, RAP30/74 (transcription factor IIF) is required for promoter escape by RNA polymerase II. J Biol Chem 268, 20482-20489 (1993). A 204. D. A. Khaperskyy, M. L. Ammerman, R. C. Majovski, A. S. Ponticelli, Functions of Saccharomyces cerevisiae TFIIF during transcription start site utilization. Mol Cell Biol 28, 3757-3766 (2008). 205. Q. Yan, R. J. Moreland, J. W. Conaway, R. C. Conaway, Dual roles for transcription factor IIF in promoter escape by RNA polymerase II. J Biol Chem 274, 35668-35675 (1999). 206. W. H. Chung et al., RNA polymerase II/TFIIF structure and conserved organization of the initiation complex. Mol Cell 12, 1003-1013 (2003). 207. Z. A. Chen et al., Architecture of the RNA polymerase II-TFIIF complex revealed by cross-linking and mass spectrometry. EMBO J 29, 717-726 (2010). 208. M. Sopta, R. W. Carthew, J. Greenblatt, Isolation of three proteins that bind to mammalian RNA polymerase II. J Biol Chem 260, 10353-10360 (1985). 209. M. A. Ghazy, S. A. Brodie, M. L. Ammerman, L. M. Ziegler, A. S. Ponticelli, Amino acid substitutions in yeast TFIIF confer upstream shifts in transcription initiation and altered interaction with RNA polymerase II. Mol Cell Biol 24, 10975-10985 (2004). 210. R. S. Chambers, B. Q. Wang, Z. F. Burton, M. E. Dahmus, The activity of COOH-terminal domain phosphatase is regulated by a docking site on RNA polymerase II and by the general transcription factors IIF and IIB. J Biol Chem 270, 14962-14969 (1995). 211. T. Kamenski, S. Heilmeier, A. Meinhart, P. Cramer, Structure and mechanism of RNA polymerase II CTD phosphatases. Mol Cell 15, 399-407 (2004). 212. J. A. Goodrich, R. Tjian, Transcription factors IIE and IIH and ATP hydrolysis direct promoter clearance by RNA polymerase II. Cell 77, 145-156 (1994). 213. F. C. Holstege, P. C. van der Vliet, H. T. Timmers, Opening of an RNA polymerase II promoter occurs in two distinct steps and requires the basal transcription factors IIE and IIH. EMBO J 15, 1666-1677 (1996). 214. Y. He et al., Near-atomic resolution visualization of human transcription promoter opening. Nature 533, 359-365 (2016). 215. J. Shandilya, S. G. Roberts, The transcription cycle in eukaryotes: from productive initiation to RNA polymerase II recycling. Biochim Biophys Acta 1819, 391-400 (2012). 216. S. Egloff, S. Murphy, Cracking the RNA polymerase II CTD code. Trends Genet 24, 280-288 (2008).

191

217. B. Schwer, A. M. Sanchez, S. Shuman, Punctuation and syntax of the RNA polymerase II CTD code in fission yeast. Proc Natl Acad Sci U S A 109, 18024-18029 (2012). 218. Y. Ohkuma, R. G. Roeder, Regulation of TFIIH ATPase and kinase activities by TFIIE during active initiation complex formation. Nature 368, 160-163 (1994). 219. J. A. Coppola, A. S. Field, D. S. Luse, Promoter-proximal pausing by RNA polymerase II in vitro: transcripts shorter than 20 nucleotides are not capped. Proc Natl Acad Sci U S A 80, 1251-1255 (1983). 220. Y. Yamaguchi et al., NELF, a multisubunit complex containing RD, cooperates with DSIF to repress RNA polymerase II elongation. Cell 97, 41-51 (1999). 221. C. K. Ho, S. Shuman, Distinct roles for CTD Ser-2 and Ser-5 phosphorylation in the recruitment and allosteric activation of mammalian mRNA capping enzyme. Mol Cell 3, 405-411 (1999). 222. J. Peng, M. Liu, J. Marion, Y. Zhu, D. H. Price, RNA polymerase II elongation control. Cold Spring Harb Symp Quant Biol 63, 365-370 (1998). 223. F. K. Hsieh et al., Histone chaperone FACT action during transcription through chromatin by RNA polymerase II. Proc Natl Acad Sci U S A 110, 7654-7659 (2013). 224. N. J. Proudfoot, Transcriptional termination in mammals: Stopping the RNA polymerase II juggernaut. Science 352, aad9926 (2016). 225. Y. J. Joo et al., Downstream promoter interactions of TFIID TAFs facilitate transcription reinitiation. Genes Dev 31, 2162-2174 (2017). 226. T. Oelgeschlager, Y. Tao, Y. K. Kang, R. G. Roeder, Transcription activation via enhanced preinitiation complex assembly in a human cell-free system lacking TAFIIs. Mol Cell 1, 925-931 (1998). 227. N. Yudkovsky, J. A. Ranish, S. Hahn, A transcription reinitiation intermediate that is stabilized by activator. Nature 408, 225-229 (2000). 228. J. Berretta, A. Morillon, Pervasive transcription constitutes a new level of eukaryotic genome regulation. EMBO Rep 10, 973-982 (2009). 229. M. J. Carrozza et al., Histone H3 methylation by Set2 directs deacetylation of coding regions by A Rpd3S to suppress spurious intragenic transcription. Cell 123, 581-592 (2005). 230. M. J. Koster, A. D. Yildirim, P. A. Weil, F. C. Holstege, H. T. Timmers, Suppression of intragenic transcription requires the MOT1 and NC2 regulators of TATA-binding protein. Nucleic Acids Res 42, 4220-4229 (2014). 231. C. B. Fant et al., TFIID Enables RNA Polymerase II Promoter-Proximal Pausing. Mol Cell, (2020). 232. B. Li, J. A. Weber, Y. Chen, A. L. Greenleaf, D. S. Gilmour, Analyses of promoter-proximal pausing by RNA polymerase II on the hsp70 heat shock gene promoter in a Drosophila nuclear extract. Mol Cell Biol 16, 5433-5443 (1996). 233. W. Shao, S. G. Alcantara, J. Zeitlinger, Reporter-ChIP-nexus reveals strong contribution of the Drosophila initiator sequence to RNA polymerase pausing. Elife 8, (2019). 234. D. Yadav, K. Ghosh, S. Basu, R. G. Roeder, D. Biswas, Multivalent Role of Human TFIID in Recruiting Elongation Components at the Promoter-Proximal Region for Transcriptional Control. Cell Rep 26, 1303-1317 e1307 (2019). 235. H. Skolnik-David, Y. Aloni, Pausing of RNA polymerase molecules during in vivo transcription of the SV40 leader region. EMBO J 2, 179-184 (1983). 236. B. A. Purnell, P. A. Emanuel, D. S. Gilmour, TFIID sequence recognition of the initiator and sequences farther downstream in Drosophila class II genes. Genes Dev 8, 830-842 (1994). 237. A. Gegonne et al., TFIID component TAF7 functionally interacts with both TFIIH and P-TEFb. Proc Natl Acad Sci U S A 105, 5367-5372 (2008). 238. R. Weinmann, The basic RNA polymerase II transcriptional machinery. Gene Expr 2, 81-91 (1992). 239. M. Horikoshi et al., Purification of a yeast TATA box-binding protein that exhibits human transcription factor IID activity. Proc Natl Acad Sci U S A 86, 4843-4847 (1989). 240. M. W. Van Dyke, M. Sawadogo, DNA-binding and transcriptional properties of human transcription factor TFIID after mild proteolysis. Mol Cell Biol 10, 3415-3420 (1990). 241. H. T. Timmers, P. A. Sharp, The mammalian TFIID protein is present in two functionally distinct complexes. Genes Dev 5, 1946-1956 (1991). 242. M. Killeen, B. Coulombe, J. Greenblatt, Recombinant TBP, transcription factor IIB, and RAP30 are sufficient for promoter recognition by mammalian RNA polymerase II. J Biol Chem 267, 9463-9466 (1992).

192

243. Q. Zhou, P. M. Lieberman, T. G. Boyer, A. J. Berk, Holo-TFIID supports transcriptional stimulation by diverse activators and from a TATA-less promoter. Genes Dev 6, 1964-1974 (1992). 244. S. Hahn, S. Buratowski, P. A. Sharp, L. Guarente, Isolation of the gene encoding the yeast TATA binding protein TFIID: a gene identical to the SPT15 suppressor of Ty element insertions. Cell 58, 1173-1181 (1989). 245. M. Horikoshi et al., Cloning and structure of a yeast gene encoding a general transcription initiation factor TFIID that binds to the TATA box. Nature 341, 299-303 (1989). 246. C. C. Kao et al., Cloning of a transcriptionally active human TATA binding factor. Science 248, 1646- 1650 (1990). 247. A. A. Bondareva, E. E. Schmidt, Early vertebrate evolution of the TATA-binding protein, TBP. Mol Biol Evol 20, 1932-1939 (2003). 248. R. Kuddus, M. C. Schmidt, Effect of the non-conserved N-terminus on the DNA binding activity of the yeast TATA binding protein. Nucleic Acids Res 21, 1789-1796 (1993). 249. S. Gupta et al., DNA and protein footprinting analysis of the modulation of DNA binding by the N- terminal domain of the Saccharomyces cerevisiae TATA binding protein. Biochemistry 46, 9886-9898 (2007). 250. N. K. Hobbs, A. A. Bondareva, S. Barnett, M. R. Capecchi, E. E. Schmidt, Removing the vertebrate-specific TBP N terminus disrupts placental beta2m-dependent interactions with the maternal immune system. Cell 110, 43-54 (2002). 251. S. J. Reid et al., TBP, a polyglutamine tract containing protein, accumulates in Alzheimer's disease. Brain Res Mol Brain Res 125, 120-128 (2004). 252. G. Schaffar et al., Cellular toxicity of polyglutamine expansion proteins: mechanism of transcription factor deactivation. Mol Cell 15, 95-105 (2004). 253. M. J. Friedman et al., Polyglutamine domain modulates the TBP-TFIIB interaction: implications for its normal function and neurodegeneration. Nat Neurosci 10, 1519-1528 (2007). 254. S. Huang et al., Large Polyglutamine Repeats Cause Muscle Degeneration in SCA17 Mice. Cell Rep 13, 196-208 (2015). 255. S. Yang, X. J. Li, S. Li, Molecular mechanisms underlying Spinocerebellar Ataxia 17 (SCA17) A pathogenesis. Rare Dis 4, e1223580 (2016). 256. F. Blombach, D. Grohmann, Same same but different: The evolution of TBP in archaea and their eukaryotic offspring. Transcription 8, 162-168 (2017). 257. L. Tora, H. T. Timmers, The TATA box regulates TATA-binding protein (TBP) dynamics in vivo. Trends Biochem Sci 35, 309-314 (2010). 258. C. Millan-Pacheco, V. M. Capistran, N. Pastor, On the consequences of placing amino groups at the TBP-DNA interface. Does TATA really matter? J Mol Recognit 22, 453-464 (2009). 259. Z. S. Juo, G. A. Kassavetis, J. Wang, E. P. Geiduschek, P. B. Sigler, Crystal structure of a transcription factor IIIB core interface ternary complex. Nature 422, 534-539 (2003). 260. M. Anandapadamanaban et al., High-resolution structure of TBP with TAF1 reveals anchoring patterns in transcriptional regulation. Nat Struct Mol Biol 20, 1008-1014 (2013). 261. B. A. Knutson, J. Luo, J. Ranish, S. Hahn, Architecture of the Saccharomyces cerevisiae RNA polymerase I Core Factor complex. Nat Struct Mol Biol 21, 810-816 (2014). 262. U. G. Jacobi et al., TBP paralogs accommodate metazoan- and vertebrate-specific developmental gene regulation. EMBO J 26, 3900-3909 (2007). 263. J. H. Reina, N. Hernandez, On a roll for new TRF targets. Genes Dev 21, 2855-2860 (2007). 264. W. Akhtar, G. J. Veenstra, TBP-related factors: a paradigm of diversity in transcription initiation. Cell Biosci 1, 23 (2011). 265. M. E. Torres-Padilla, L. Tora, TBP homologues in embryo transcription: who does what? EMBO Rep 8, 1016-1018 (2007). 266. J. A. Chong et al., TATA-binding protein (TBP)-like factor (TLF) is a functional regulator of transcription: reciprocal regulation of the neurofibromatosis type 1 and c-fos genes by TLF/TRF2 and TBP. Mol Cell Biol 25, 2632-2643 (2005). 267. I. Martianov et al., Late arrest of spermiogenesis and germ cell apoptosis in mice lacking the TBP- like TLF/TRF2 gene. Mol Cell 7, 509-515 (2001). 268. E. Gazdag et al., TBP2 is essential for germ cell development by regulating transcription and chromatin condensation in the oocyte. Genes Dev 23, 2210-2223 (2009).

193

269. S. Ruppert, E. H. Wang, R. Tjian, Cloning and expression of human TAFII250: a TBP-associated factor implicated in cell-cycle regulation. Nature 362, 175-179 (1993). 270. E. C. Curran, H. Wang, T. R. Hinds, N. Zheng, E. H. Wang, Zinc knuckle of TAF1 is a DNA binding module critical for TFIID promoter occupancy. Sci Rep 8, 4630 (2018). 271. H. Wang, E. C. Curran, T. R. Hinds, E. H. Wang, N. Zheng, Crystal structure of a TAF1-TAF7 complex in human transcription factor IID reveals a promoter binding module. Cell Res 24, 1433- 1444 (2014). 272. R. Dikstein, S. Ruppert, R. Tjian, TAFII250 is a bipartite protein kinase that phosphorylates the base transcription factor RAP74. Cell 84, 781-790 (1996). 273. C. A. Mizzen et al., The TAF(II)250 subunit of TFIID has histone acetyltransferase activity. Cell 87, 1261-1270 (1996). 274. S. Capponi et al., Neuronal-specific microexon splicing of TAF1 mRNA is directly regulated by SRRM4/nSR100. RNA Biol 17, 62-74 (2020). 275. T. Herzfeld, D. Nolte, U. Muller, Structural and functional analysis of the human TAF1/DYT3 multiple transcript system. Mamm Genome 18, 787-795 (2007). 276. D. Nolte, S. Niemann, U. Muller, Specific sequence changes in multiple transcript system DYT3 are associated with X-linked dystonia parkinsonism. Proc Natl Acad Sci U S A 100, 10347-10352 (2003). 277. P. J. Wang, D. C. Page, Functional substitution for TAF(II)250 by a retroposed homolog that is expressed in human spermatogenesis. Hum Mol Genet 11, 2341-2346 (2002). 278. M. Malkowska, K. Kokoszynska, L. Rychlewski, L. Wyrwicz, Structural bioinformatics of the general transcription factor TFIID. Biochimie 95, 680-691 (2013). 279. G. Papai et al., Mapping the initiator binding Taf2 subunit in the structure of hydrated yeast TFIID. Structure 17, 363-373 (2009). 280. G. Kochan et al., Crystal structures of the endoplasmic reticulum aminopeptidase-1 (ERAP1) reveal the molecular basis for N-terminal peptide trimming. Proc Natl Acad Sci U S A 108, 7745-7750 (2011). 281. S. Trowitzsch et al., Cytoplasmic TAF2-TAF8-TAF10 complex provides evidence for nuclear holo- A TFIID assembly from preformed submodules. Nat Commun 6, 6011 (2015). 282. C. Bieniossek et al., The architecture of human general transcription factor TFIID core complex. Nature 493, 699-702 (2013). 283. H. van Ingen et al., Structural insight into the recognition of the H3K4me3 mark by the TFIID subunit TAF3. Structure 16, 1245-1256 (2008). 284. R. van Nuland et al., Multivalent engagement of TFIID to nucleosomes. PLoS One 8, e73495 (2013). 285. M. D. Deato et al., MyoD targets TAF3/TRF3 to activate myogenin transcription. Mol Cell 32, 96- 105 (2008). 286. M. D. Deato, R. Tjian, Switching of the core transcription machinery during myogenesis. Genes Dev 21, 2137-2149 (2007). 287. Y. Stijf-Bultsma et al., The basal transcription complex component TAF3 transduces changes in nuclear phosphoinositides into transcriptional output. Mol Cell 58, 453-467 (2015). 288. D. O. Hart, M. K. Santra, T. Raha, M. R. Green, Selective interaction between Trf3 and Taf3 required for early development and hematopoiesis. Dev Dyn 238, 2540-2549 (2009). 289. Z. Moqtaderi, J. D. Yale, K. Struhl, S. Buratowski, Yeast homologues of higher eukaryotic TFIID subunits. Proc Natl Acad Sci U S A 93, 14654-14658 (1996). 290. S. Werten et al., Crystal structure of a subcomplex of human transcription factor TFIID formed by TATA binding protein-associated factors hTAF4 (hTAF(II)135) and hTAF12 (hTAF(II)20). J Biol Chem 277, 45502-45509 (2002). 291. O. Kolesnikova et al., Molecular structure of promoter-bound yeast TFIID. Nat Commun 9, 4666 (2018). 292. P. J. Amrolia et al., The activation domain of the enhancer binding protein p45NF-E2 interacts with TAFII130 and mediates long-range activation of the alpha- and beta-globin gene loci in an erythroid cell line. Proc Natl Acad Sci U S A 94, 10051-10056 (1997). 293. E. Hibino et al., Interaction between intrinsically disordered regions in transcription factors Sp1 and TAF4. Protein Sci 25, 2006-2017 (2016). 294. E. Hibino et al., Identification of heteromolecular binding sites in transcription factors Sp1 and TAF4 using high-resolution nuclear magnetic resonance spectroscopy. Protein Sci 26, 2280-2290 (2017).

194

295. L. J. Kim, A. G. Seto, T. N. Nguyen, J. A. Goodrich, Human Taf(II)130 is a coactivator for NFATp. Mol Cell Biol 21, 3503-3513 (2001). 296. M. F. Vassallo, N. Tanese, Isoform-specific interaction of HP1 with human TAFII130. Proc Natl Acad Sci U S A 99, 5919-5924 (2002). 297. W. Y. Chen et al., A TAF4 coactivator function for E proteins that involves enhanced TFIID binding. Genes Dev 27, 1596-1609 (2013). 298. J. Kazantseva et al., Alternative splicing targeting the hTAF4-TAFH domain of TAF4 represses proliferation and accelerates chondrogenic differentiation of human mesenchymal stem cells. PLoS One 8, e74799 (2013). 299. J. Kazantseva, H. Sadam, T. Neuman, K. Palm, Targeted alternative splicing of TAF4: a new strategy for cell reprogramming. Sci Rep 6, 30852 (2016). 300. J. Kazantseva, K. Tints, T. Neuman, K. Palm, TAF4 controls differentiation of human neural progenitor cells through hTAF4-TAFH activity. J Mol Neurosci 55, 160-166 (2015). 301. D. Alpern et al., TAF4, a subunit of transcription factor II D, directs promoter occupancy of nuclear receptor HNF4A during post-natal hepatocyte differentiation. Elife 3, e03613 (2014). 302. D. Langer et al., Essential role of the TFIID subunit TAF4 in murine embryogenesis and embryonic stem cell differentiation. Nat Commun 7, 11063 (2016). 303. W. L. Liu et al., Structural changes in TAF4b-TFIID correlate with promoter selectivity. Mol Cell 29, 81-91 (2008). 304. R. Dikstein, S. Zhou, R. Tjian, Human TAFII 105 is a cell type-specific TFIID subunit related to hTAFII130. Cell 87, 137-146 (1996). 305. A. Rashevsky-Finkel, A. Silkov, R. Dikstein, A composite nuclear export signal in the TBP- associated factor TAFII105. J Biol Chem 276, 44963-44969 (2001). 306. M. A. Gura et al., Dynamic and regulated TAF gene expression during mouse embryonic germ cell development. PLoS Genet 16, e1008515 (2020). 307. R. N. Freiman et al., Requirement of tissue-selective TBP-associated factor TAFII105 in ovarian development. Science 293, 2084-2087 (2001). 308. K. J. Grive, K. A. Seymour, R. Mehta, R. N. Freiman, TAF4b promotes mouse primordial follicle A assembly and oocyte survival. Dev Biol 392, 42-51 (2014). 309. A. E. Falender, M. Shimada, Y. K. Lo, J. S. Richards, TAF4b, a TBP associated factor, is required for oocyte development and function. Dev Biol 288, 405-419 (2005). 310. K. G. Geles et al., Cell-type-selective induction of c-jun by TAF4b directs ovarian-specific transcription networks. Proc Natl Acad Sci U S A 103, 2594-2599 (2006). 311. J. R. Wardell et al., Estrogen responsiveness of the TFIID subunit TAF4B in the normal mouse ovary and in ovarian tumors. Biol Reprod 89, 116 (2013). 312. A. Bahat et al., TAF4b and TAF4 differentially regulate mouse embryonic stem cells maintenance and proliferation. Genes Cells 18, 225-237 (2013). 313. W. W. Pijnappel et al., A central role for TFIID in the pluripotent transcription circuitry. Nature 495, 516-519 (2013). 314. C. Leurent et al., Mapping key functional sites within yeast TFIID. EMBO J 23, 719-727 (2004). 315. W. Selleck et al., A histone fold TAF octamer within the yeast TFIID transcriptional coactivator. Nat Struct Biol 8, 695-700 (2001). 316. E. Scheer, F. Delbac, L. Tora, D. Moras, C. Romier, TFIID TAF6-TAF9 complex formation involves the HEAT repeat-containing C-terminal domain of TAF6 and is modulated by TAF5 protein. J Biol Chem 287, 27580-27592 (2012). 317. Y. Tao et al., Specific interactions and potential functions of human TAFII100. J Biol Chem 272, 6714-6721 (1997). 318. S. Bhattacharya, S. Takada, R. H. Jacobson, Structural analysis and dimerization potential of the human TAF5 subunit of TFIID. Proc Natl Acad Sci U S A 104, 1189-1194 (2007). 319. K. Hisatake et al., Evolutionary conservation of human TATA-binding-polypeptide-associated factors TAFII31 and TAFII80 and interactions of TAFII80 with other TAFs and with general transcription factors. Proc Natl Acad Sci U S A 92, 8195-8199 (1995). 320. M. Bando, S. Ijuin, S. Hasegawa, M. Horikoshi, The involvement of the histone fold motifs in the mutual interaction between human TAF(II)80 and TAF(II)22. J Biochem 121, 591-597 (1997). 321. H. Mitsuzawa, A. Ishihama, Identification of histone H4-like TAF in Schizosaccharomyces pombe as a protein that interacts with WD repeat-containing TAF. Nucleic Acids Res 30, 1952-1958 (2002).

195

322. H. Shao et al., Core promoter binding by histone-like TAF complexes. Mol Cell Biol 25, 206-219 (2005). 323. C. Kamtchueng et al., Alternative splicing of TAF6: downstream transcriptome impacts and upstream RNA splice control elements. PLoS One 9, e102399 (2014). 324. E. Wilhelm, F. X. Pellay, A. Benecke, B. Bell, TAF6delta controls apoptosis and gene expression in the absence of p53. PLoS One 3, e2721 (2008). 325. C. M. Chiang, R. G. Roeder, Cloning of an intrinsic human TFIID subunit that interacts with multiple transcriptional activators. Science 267, 531-536 (1995). 326. C. Munz et al., TAF7 (TAFII55) plays a role in the transcription activation by c-Jun. J Biol Chem 278, 21510-21516 (2003). 327. A. Gegonne, B. N. Devaiah, D. S. Singer, TAF7: traffic controller in transcription initiation. Transcription 4, 29-33 (2013). 328. Y. Cheng et al., Abnormal sperm in mice lacking the Taf7l gene. Mol Cell Biol 27, 2582-2589 (2007). 329. J. C. Pointud et al., The intracellular localisation of TAF7L, a paralogue of transcription factor TFIID subunit TAF7, is developmentally regulated during male germ-cell differentiation. J Cell Sci 116, 1847-1858 (2003). 330. H. Zhou et al., Taf7l cooperates with Trf2 to regulate spermiogenesis. Proc Natl Acad Sci U S A 110, 16886-16891 (2013). 331. H. Zhou et al., Dual functions of TAF7L in adipocyte differentiation. Elife 2, e00170 (2013). 332. I. Kamenova et al., Co-translational assembly of mammalian nuclear multisubunit complexes. Nat Commun 10, 1740 (2019). 333. F. El-Saafin et al., Homozygous TAF8 mutation in a patient with intellectual disability results in undetectable TAF8 protein, but preserved RNA polymerase II transcription. Hum Mol Genet 27, 2171-2186 (2018). 334. X. Xie et al., Structural similarity between TAFs and the heterotetrameric core of the histone octamer. Nature 380, 316-322 (1996). 335. M. Saint et al., The TAF9 C-terminal conserved region domain is required for SAGA and TFIID A promoter occupancy to promote transcriptional activation. Mol Cell Biol 34, 1547-1563 (2014). 336. M. Frontini et al., TAF9b (formerly TAF9L) is a bona fide TAF that has unique and overlapping roles with TAF9. Mol Cell Biol 25, 4638-4649 (2005). 337. Z. Chen, J. L. Manley, In vivo functional analysis of the histone 3-like TAF9 and a TAF9-related factor, TAF9L. J Biol Chem 278, 35172-35183 (2003). 338. Y. G. Gangloff et al., Histone folds mediate selective heterodimerization of yeast TAF(II)25 with TFIID components yTAF(II)47 and yTAF(II)65 and with SAGA component ySPT7. Mol Cell Biol 21, 1841-1853 (2001). 339. E. Soutoglou et al., The nuclear import of TAF10 is regulated by one of its three histone fold domain-containing interaction partners. Mol Cell Biol 25, 4092-4104 (2005). 340. A. K. Indra et al., TAF10 is required for the establishment of skin barrier function in foetal, but not in adult mouse epidermis. Dev Biol 285, 28-37 (2005). 341. W. S. Mohan, Jr., E. Scheer, O. Wendling, D. Metzger, L. Tora, TAF10 (TAF(II)30) is necessary for TFIID stability and early embryogenesis in mice. Mol Cell Biol 23, 4307-4318 (2003). 342. P. Papadopoulos et al., TAF10 Interacts with the GATA1 Transcription Factor and Controls Mouse Erythropoiesis. Mol Cell Biol 35, 2103-2118 (2015). 343. A. Tatarakis et al., Dominant and redundant functions of TFIID involved in the regulation of hepatic genes. Mol Cell 31, 531-543 (2008). 344. C. Birck et al., Human TAF(II)28 and TAF(II)18 interact through a histone fold encoded by atypical evolutionary conserved motifs also found in the SPT3 family. Cell 94, 239-249 (1998). 345. K. Gupta et al., Architecture of TAF11/TAF13/TBP complex suggests novel regulation properties of general transcription factor TFIID. Elife 6, (2017). 346. A. C. Lavigne et al., Synergistic transcriptional activation by TATA-binding protein and hTAFII28 requires specific amino acids of the hTAFII28 histone fold. Mol Cell Biol 19, 5050-5060 (1999). 347. M. M. Robinson et al., Mapping and functional characterization of the TAF11 interaction with TFIIA. Mol Cell Biol 25, 945-957 (2005). 348. C. C. Bell, M. A. Dawson, TFIID and MYB Share a Therapeutic Handshake in AML. Cancer Cell 33, 1-3 (2018).

196

349. Y. Xu et al., A TFIID-SAGA Perturbation that Targets MYB and Suppresses Acute Myeloid Leukemia. Cancer Cell 33, 13-28 e18 (2018). 350. F. Andel, 3rd, A. G. Ladurner, C. Inouye, R. Tjian, E. Nogales, Three-dimensional structure of the human TFIID-IIA-IIB complex. Science 286, 2153-2156 (1999). 351. M. A. Cianfrocco et al., Human TFIID binds to core promoter DNA in a reorganized structural state. Cell 152, 120-131 (2013). 352. W. L. Liu et al., Structures of three distinct activator-TFIID complexes. Genes Dev 23, 1510-1521 (2009). 353. D. Helmlinger, L. Tora, Sharing the SAGA. Trends Biochem Sci 42, 850-861 (2017). 354. E. Koutelou, C. L. Hirsch, S. Y. Dent, Multiple faces of the SAGA complex. Curr Opin Cell Biol 22, 374-382 (2010). 355. G. Liu et al., Architecture of Saccharomyces cerevisiae SAGA complex. Cell Discov 5, 25 (2019). 356. Y. Han, J. Luo, J. Ranish, S. Hahn, Architecture of the Saccharomyces cerevisiae SAGA transcription coactivator complex. EMBO J 33, 2534-2546 (2014). 357. W. J. de Jonge et al., Molecular mechanisms that distinguish TFIID housekeeping from regulatable SAGA promoters. EMBO J 36, 274-290 (2017). 358. N. Fieremans et al., Identification of Intellectual Disability Genes in Female Patients with a Skewed X-Inactivation Pattern. Hum Mutat 37, 804-811 (2016). 359. S. Gudmundsson et al., TAF1, associated with intellectual disability in humans, is essential for embryogenesis and regulates neurodevelopmental processes in zebrafish. Sci Rep 9, 10730 (2019). 360. S. Hellman-Aharony et al., Microcephaly thin corpus callosum intellectual disability syndrome caused by mutated TAF2. Pediatr Neurol 49, 411-416 e411 (2013). 361. H. R. Oh, C. H. An, N. J. Yoo, S. H. Lee, Frameshift mutations of TAF7L gene, a core component for transcription by RNA polymerase II, in colorectal cancers. Pathol Oncol Res 21, 849-850 (2015). 362. J. R. Ribeiro, L. A. Lovasco, B. C. Vanderhyden, R. N. Freiman, Targeting TBP-Associated Factors in Ovarian Cancer. Front Oncol 4, 45 (2014). 363. G. Stevanin, A. Brice, Spinocerebellar ataxia 17 (SCA17) and Huntington's disease-like 4 (HDL4). Cerebellum 7, 170-178 (2008). A 364. C. R. Woese, O. Kandler, M. L. Wheelis, Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci U S A 87, 4576-4579 (1990). 365. C. R. Woese, G. E. Fox, Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc Natl Acad Sci U S A 74, 5088-5090 (1977). 366. A. Spang et al., Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature 521, 173-179 (2015). 367. T. M. Embley, W. Martin, Eukaryotic evolution, changes and challenges. Nature 440, 623-630 (2006). 368. L. Eme, A. Spang, J. Lombard, C. W. Stairs, T. J. G. Ettema, Archaea and the origin of eukaryotes. Nat Rev Microbiol 15, 711-723 (2017). 369. A. de Mendoza et al., Transcription factor evolution in eukaryotes and the assembly of the regulatory toolkit in multicellular lineages. Proc Natl Acad Sci U S A 110, E4858-4866 (2013). 370. H. Cheng et al., Missense variants in TAF1 and developmental phenotypes: challenges of determining pathogenicity. Hum Mutat, (2019). 371. A. Domingo et al., Evidence of TAF1 dysfunction in peripheral models of X-linked dystonia- parkinsonism. Cell Mol Life Sci 73, 3205-3215 (2016). 372. M. B. Mobasheri, R. Shirkoohi, M. H. Modarressi, Cancer/Testis OIP5 and TAF7L Genes are Up- Regulated in Breast Cancer. Asian Pac J Cancer Prev 16, 4623-4628 (2015). 373. J. A. O'Rawe et al., TAF1 Variants Are Associated with Dysmorphic Features, Intellectual Disability, and Neurological Manifestations. Am J Hum Genet 97, 922-932 (2015). 374. H. R. Oh, C. H. An, N. J. Yoo, S. H. Lee, Frameshift Mutations in the Mononucleotide Repeats of TAF1 and TAF1L Genes in Gastric and Colorectal Cancers with Regional Heterogeneity. Pathol Oncol Res 23, 125-130 (2017). 375. K. Stouffs et al., The role of the testis-specific gene hTAF7L in the aetiology of male infertility. Mol Hum Reprod 12, 263-267 (2006). 376. H. Tawamie et al., Hypomorphic Pathogenic Variants in TAF13 Are Associated with Autosomal- Recessive Intellectual Disability and Microcephaly. Am J Hum Genet 100, 555-561 (2017).

197

377. Y. Xu et al., TAF1 plays a critical role in AML1-ETO driven leukemogenesis. Nat Commun 10, 4925 (2019). 378. K. A. Jones, Changing the core of transcription. Elife 3, e03575 (2014). 379. L. A. Lovasco et al., Accelerated ovarian aging in the absence of the transcription regulator TAF4B in mice. Biol Reprod 82, 23-34 (2010). 380. G. Papai et al., Structure of SAGA and mechanism of TBP deposition on gene promoters. Nature 577, 711-716 (2020). 381. H. Wang et al., Structure of the transcription coactivator SAGA. Nature 577, 717-720 (2020). 382. Y. Miyata, T. Shibata, M. Aoshima, T. Tsubata, E. Nishida, The molecular chaperone TRiC/CCT binds to the Trp-Asp 40 (WD40) repeat protein WDR68 and promotes its folding, protein kinase DYRK1A binding, and nuclear accumulation. J Biol Chem 289, 33320-33332 (2014). 383. L. Laprade, D. Rose, F. Winston, Characterization of new Spt3 and TATA-binding protein mutants of Saccharomyces cerevisiae: Spt3 TBP allele-specific interactions and bypass of Spt8. Genetics 177, 2007-2017 (2007). 384. X. Chen et al., Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133, 1106-1117 (2008). 385. A. Willbanks et al., The Evolution of Epigenetics: From Prokaryotes to Humans and Its Biological Consequences. Genet Epigenet 8, 25-36 (2016). 386. D. Hnisz, K. Shrinivas, R. A. Young, A. K. Chakraborty, P. A. Sharp, A Phase Separation Model for Transcriptional Control. Cell 169, 13-23 (2017). 387. P. E. Wright, H. J. Dyson, Intrinsically disordered proteins in cellular signalling and regulation. Nat Rev Mol Cell Biol 16, 18-29 (2015). 388. A. Boija et al., Transcription Factors Activate Genes through the Phase-Separation Capacity of Their Activation Domains. Cell 175, 1842-1855 e1816 (2018). 389. C. B. Fant et al., TFIID Enables RNA Polymerase II Promoter-Proximal Pausing. Mol Cell 78, 785- 793 e788 (2020). A

198

Summary Almost every cell of the human body carries the same genetic information, encoded in a unique for each human DNA sequence. Yet, our bodies contain numerous different cells, which form a number of different organs with specialized functions. How do we develop from a single cell to multicellular organisms when all of our cells contain exactly the same genetic sequence? A main contributor to the cellular diversity found in each multicellular organism is the mechanism by which the information in the DNA is decoded and translated into function. Such mechanism includes an extensive binding of the genetic information by dedicated molecules, which hide away all the regions that should not be accessible for reading and translation in a given cell. These molecules can wind and condense the DNA around them, add temporary chemical modifications to it or even shield it away like an – all with the goal of establishing a second layer of information, which is provided not by the DNA itself, but by the molecules attached to it. Such above-the-genetic layer is called epi-genetic (επι – ―above‖, Greek) and it has unique patterns in different cells of the same organism, ensuring differential read of the genetic code and rise of cellular diversity. In order to find, read and decode the genetic information hidden within the epigenetic code, our cells have evolved a sophisticated network of molecules, which coordinate their functions. DNA is read (or transcribed) by a specialized molecular transcriptional machinery into an intermediate information-carrying molecule (mRNA), which becomes exported from the nucleus to the cytoplasm of the cell. There, mRNA is further decoded (or translated), which entails the synthesis of a protein based on the sequence provided by the mRNA molecule. Each protein synthesised from a unique mRNA sequence (and therefore unique DNA sequence) A performs a unique function within the cell. For instance, some proteins oversee cellular survival, while others ensure correct cell division. As such, it becomes clear that while the genetic code of our cells is identical, the epigenetic control over it drives the differential production of proteins and allows for the cellular diversity needed to maintain the various functions of our bodies. Importantly, in the centre of this process is the ability to read the genetic and epigenetic codes – a process overseen by the transcriptional machinery. In order to decode the DNA to mRNA, this machinery must first identify the ―beginning‖ of the information it aims to read with an utmost precision. Indeed, small differences in the information start site can result in a false read of the genetic code and failed RNA synthesis. A central player in the identification of such start sites is the protein complex investigated in this thesis – TFIID. TFIID possesses the ability to both scan through and bind to the DNA sequence, as well as to read the epigenetic layer above it, which together allow it to accurately find the correct place for transcription initiation within the DNA. In the work presented here, I have investigated what the mechanisms behind TFIID formation are and whether they are present among all eukaryotes. I reported the existence of an extensive molecular interplay during the formation of TFIID and observed impressive conservation of this interplay throughout all eukaryotes over ~2 billion years of evolution. A notable exception is the ability of TFIID to read the epigenetic code, which seems to have expanded mostly in animals (as compared to plants, fungi and unicellular eukaryotes). Together, the results from my doctoral research aid in the better understanding of the transcriptional initiation mechanisms and highlight the importance of detailed analyses when investigating the molecular rules of transcriptional events.

199

Nederlandse sammenvatting Bijna elke cel in het menselijk lichaam bevat dezelfde genetische informatie, gecodeerd in een voor elk persoon unieke DNA-sequentie. Onze lichamen bevatten echter verschillende typen cellen, die gezamenlijk verschillende organen vormen met gespecialiseerde functies. Hoe ontwikkelen we ons als mens van een enkele cel naar een multicellulair organisme als al onze cellen precies dezelfde genetische sequentie bevatten? Een belangrijke oorzaak van de cellulaire diversiteit in elk multicellulair organisme is het mechanisme waarmee de informatie in het DNA wordt gedecodeerd en vertaald naar een functie. Dit mechanisme omvat de binding van de genetische informatie door speciale moleculen die in een bepaalde cel alle regio‘s verbergen die niet toegankelijk moeten zijn om te worden afgelezen en vertaald. Deze moleculen kunnen het DNA om zichzelf heen winden en condenseren, tijdelijke chemische veranderingen toevoegen aan het DNA of het zelfs beschermen als een soort insulator – dit alles heeft als doel om een tweede laag aan informatie tot stand te brengen die niet door het DNA zelf wordt aangeleverd, maar door de moleculen die daaraan hechten. Zo‘n laag boven-de-genetica wordt epi-genetisch (επι – ―boven‖, Grieks) genoemd en bestaat uit unieke patronen die anders zijn in de verschillende cellen van hetzelfde organisme, waardoor verschillende leeswijzen van de genetische code ontstaan die cellulaire diversiteit bewerkstelligen. Om de genetische code, die verstopt zit in de epigenetische code, te vinden, af te lezen en te decoderen, hebben onze cellen een geraffineerd netwerk van moleculen ontwikkeld dat hun functies coördineert. DNA wordt afgelezen (oftewel getranscribeerd) door een gespecialiseerde moleculaire transcriptiemachine, die het omzet in een informatie-bevattend tussenproduct (mRNA) dat getransporteerd wordt van de celkern naar het cytoplasma van de cel. Aldaar wordt het mRNA verder gedecodeerd (oftewel vertaald) waardoor een eiwit wordt gesynthetiseerd dat gebaseerd is op de sequentie aangeleverd door het mRNA molecuul. Elk eiwit dat gesynthetiseerd A wordt via een unieke mRNA-sequentie (en dus door een unieke DNA-sequentie) voert een unieke functie in de cel uit. Sommige eiwitten houden bijvoorbeeld toezicht op het overleven van de cel, terwijl andere eiwitten ervoor zorgen dat de celdeling correct verloopt. Zodoende wordt duidelijk dat ondanks dat de genetische code in onze cellen identiek is, de epigenetische controle ervoor zorgt dat gedifferentieerde productie van eiwitten mogelijk is waardoor de cellulaire diversiteit ontstaat die nodig is om de verschillende functies van onze lichamen in stand te houden. Centraal in dit proces is het vermogen om de genetische en epigenetische code te lezen – een proces dat wordt overzien door de transcriptiemachine. Om het DNA naar mRNA te decoderen, moet deze machine eerst het ‗begin‘ van alle informatie die het zo precies mogelijk wil aflezen identificeren. Kleine verschillen in de informatie startplaats kunnen ervoor zorgen dat de genetische code verkeerd wordt afgelezen waardoor de RNA-synthese mislukt. Een belangrijke speler in de identificatie van zulke startplaatsen is het eiwitcomplex dat in het kader van deze thesis werd onderzocht – TFIID. TFIID bezit niet alleen het vermogen om de DNA-sequentie te scannen en eraan te binden, maar kan daarnaast ook de epigenetische laag erboven aflezen, wat er gezamenlijk voor zorgt dat het precies de juiste plek kan vinden voor de start van de transcriptie van het DNA. In het werk dat ik in dit proefschrift presenteer, heb ik onderzocht wat de mechanismes achter de vorming van TFIID zijn en of deze aanwezig zijn bij alle eukaryoten. Ik rapporteerde de aanwezigheid van een uitgebreid moleculair samenspel tijdens de vorming van TFIID en observeerde indrukwekkende conservatie van dit samenspel bij alle eukaryoten gedurende ~2 miljard jaar evolutie. Een opvallende uitzondering is het vermogen van TFIID om de epigenetische code te lezen, welke vooral is doorontwikkeld in dieren (in vergelijking met planten, schimmels en eencellige eukaryoten). Gezamenlijk helpen de resultaten van mijn promotieonderzoek bij het ontwikkelen van een beter begrip van de transcriptie startmechanismes en benadrukken ze het belang van een gedetailleerde analyse tijdens het onderzoeken van de moleculaire regels van transcriptie.

200

Dankwoord Throughout the years I have received professional and personal support, input and understanding from many people, for which I will always be grateful and hopeful to return the favour one day. Above all I would first like to thank my supervisor, Marc Timmers, who has been an example to me since the beginning. Marc, through your love for science, work ethics and optimism, you have been a great mentor to me. The example you set, together with the space you gave me to develop my own style, have always been acknowledged and deeply appreciated from my side. In moments of mistakes, doubt or low spirits, you have always been a source of support and encouragement and for that I wholeheartedly thank you. You have told me on numerous occasions that the PhD path should be a growing curve (and kindly reassured me each time that I am going upwards), and that has stuck with me throughout the years, giving me the means to battle away insecurities and find motivation in seeking improvement. I will try to apply the ―growing curve‖ logic at any step in my life and hope that one day I will be able to infuse reassurance and confidence to younger generations the same way you did with me. My second supervisor, Albert Heck, has given me the opportunity to work and learn from one of the best mass spectrometry labs in the Netherlands. Albert, I will always be grateful to you for providing me with such an excellent support from specialists such as Fan Liu and dedicated scientists like Eleonora Corradini and Low Teck. I have received plenty of enthusiasm, assistance and productive collaboration through you, which has ultimately been invaluable for my progress as a scientist and for the obtaining of my doctoral degree. In addition to my supervisors, I would also like to thank Paul Coffer for the kind acceptance into his lab of an ―orphaned‖ PhD student, with headachingly large amount of A structural and biochemical line of work. I truly appreciate the sincere welcome you and your whole lab has showed me and have grown attached to you all as much as I was to my now Germany- departed lab-mates. It was amazing opportunity to be able to gain such a thorough social and scientific experience from two labs in one PhD, and I am so glad that this was possible within your lab. For all the structural TFIID details I threw at everyone‘s politely smiling faces – I have actually grown very fond of molecular immunology and for that I can only thank you! To my PhD progress committee members, Madelon Maurice and Frank Holstege, a wholehearted ―thank you‖ for the support you have shown me. My PhD had a number of divergent points, where it was completely unclear to me whether things are still within the frames of ―doable‖. The yearly evaluations are naturally the culmination points of such insecurities and I couldn‘t have asked for more supportive and reassuring committee than you. And while our interactions were sporadic – they meant to me a lot more than it probably seemed. I would also like to express my appreciation to all my PhD examination committee members, for the time they have dedicated in evaluating this thesis and their enthusiasm and interest. It is an honour for me to be able to share and discuss my work with such an audience. Finally, I would like to thank the groups of Imre Berger and Laszlo Tora for the support, open scientific communication and thorough assistance they have so enthusiastically provided me with. The brainstorming sessions, experience and opinion exchange and the collaborations between us were not only fruitful but also a blast. It is always heart-warming to be among fellow TFIID enthusiasts! In the following section I would like to thank the various people who have helped me out, showed me support and befriended me throughout the years, even without needing to :-). It is only

201

right to start with three people: Nelleke, Maria and Roy, who were the most amazingly welcoming senior-―ish‖ PhD students I could ask for. You have all gone through your own PhD challenges (long ago now), but still at the time readily offered me friendship, advices and help anytime I needed it. I will always cherish that. Nelleke, you were the best thing that could have happened for my productivity and organizational skills. The working example you set was actually inspirational and it really stuck with me for a while, even after you went to Nijmegen. I cannot thank you enough for that! In a dream-lab I will most definitely want to be next to you. Maria, I still laugh thinking of my first days in the lab and your ―welcome‖! It‘s refreshing to remember how sometimes few sentences and jokes down the road can turn a ―no‖ into ―yes‖, and eventually me into being your paranimph! We had quite some times having fun and I‘m glad we stay in touch! Roy, you must know how helpful you were all these years! Always friendly and always ready to give me the details! I still feel bad for telling you off for switching the heater off… and I hope it‘s only me who remembers that! Wish all of you guys the best and hope to keep crossing paths with you! Hetty and Richard, during our time together you were the housekeeping genes of the lab! So I want to express my gratitude not only for being so kind and assisting (and funny and friendly!), but also for helping us all not to crumble into a mess of confusion and disorder! Students: Franka, Lise, Sergio, Jeffrey and Mykolas. You were all my babies that taught me all kinds of unexpected lessons! I hope I didn‘t damage you in any way :-))! Timmers lab: Simona and Laura, the Utrecht survivors, you get to live and tell the tale! You were great colleagues and are sweet friends and I‘m always looking forward to seeing you or hearing from you! Tanja, Steffi, Omid, Timothy, Saiful, Nizam – it was great visiting you in A Freiburg and even more fun skiing (or attempting to) with you! Koen and Elfi – you were the always friendly and humorous doctors; I wish you a lot of success! Betty, Mirjam and Bärbel, without your help we would all be lost the in world outside of academia… thank you so much for the kindness, readily given help throughout the years and for generally putting up with academics! Coffer lab: I don‘t even know what to say guys! It‘s been great many jokes per hour with you all! Enric, I still have an unfinished list of things to watch thanks to you and a suppressed desire to try out the autophagy diet. Guy, you‘ve been an exemplary-awesome frat-boy and as a fellow shorty – it‘s up to you now to sneak around and ruin everyone‘s experiments! Ed, Josse, Anna – so much fun to find similar ―blood-types‖ in the lab and have fun outside of it ;-)! Cindy, thank you for the physical inspiration! I wish we can be doing that in the future too! Magdalena, Cornelieke, Koen, like true Coffer group members – it‘s always easy and uplifting to talk to you! Suzi, Nader, Alessandro and Sonia – have fun! To all of you and previous (associated) members (Jorg, Lucas, Steffan, Janneke, Luca, Ana, Gianmarco) – a big thank you for being so welcoming! Harmjan, Holger, Thanasis, thank you all for the readily assistance, advice and friendly ear! I‘m always happy to chat and/or work with you! Irem, Vittoria, José, Tao, Silke, Bas, Wim, Corina, Samir, Kim, Ajit, Susan, Cristina, Rik and Lisa – ever the drinking, pub-quizzing, chatting and random conferences/meetings buddies of the past and present! Looking forward to more opportunities like those in the future! Last but not least, I would like to thank Petra, who was in a way my introduction point to everything and everyone mentioned above. I wish you all the best, Petra, and thank you for being the perfect combination of supervisor and a friendly ear at the times needed!

202

Finally, I would like to dedicate a few words to my dearest friends Ragna, Elisa, Anna and Jorge. We‘ve met at random points and are all doing our things, but whenever there is a moment I want to share – I‘m always looking forward to seeing you there! Annushka, Jorge, my Belgium buddies, there is no reason for you not to become my Dutch buddies! Ragna, woman, what a great ―LSD‖ buddy you are – looking forward to being old grannies with you! Elisa, puella, you‘ve been the best of UCU! We‘ve been doing some parallel growing up together and I‘m looking forward to adding you to the grandma parties with Ragna!

The four people who have always been (and wanted to be) there for me at any point, showing understanding, strength, humour and love, are of course my mom and dad, Filip and Mike. Mom and dad, you deserve all the credits for whatever I do (this is very true). You‘ve given me pretty much everything I could ask for and the more I grow up the more clearly I see how much that was and is! Thank you for that, forever and beyond. Fif, … винаги си бил най- прекрасният брат – от както бяхме малки и едва се карахме, и когато се изправяше срещу всеки, който ми се намръщи, докато се справяше с трудностите в училище (през които се надявам да съм ти помогнала поне малко повече отколкото ми се струва), и когато ...порасна! Беше и продължаваш да си най-готиното ХУ на мам & дад! … Finally, my Michali, you know the best what it meant to be next me, while I‘m obtaining a doctoral degree. Thank you for being there, even before I realize I need it! I love you all so much and this scientific TFIID fruit is dedicated to you.

A

203

List of publications

Antonova S.V., Haffke M., Corradini E., Mikuciunas M., Low T.Y., Signor L., van Es R.M., Gupta K., Scheer E., Vos H.R., Tora L., Heck A.J.R., Timmers H.T.M. and Berger I. ―Chaperonin CCT Checkpoint Function in Basal Transcription Factor TFIID Assembly.‖ Nat Struct Mol Biol. (2018), 25(12): 1119–1127.

Antonova S.V., Boeren J.J.J., Timmers H.T.M. and Snel B. ―Epigenetics and transcription regulation during eukaryotic diversification: the saga of TFIID.‖ Genes & Dev. (2019), 33: 888- 902.

A

204

Curriculum vitae

Simona V. Antonova was born on August 5th 1989 in Sofia, Bulgaria. After completing her secondary education with a silver award at the Russian Embassy School in Tripoli, Libya, she moved to the Netherlands in 2008 and followed a science major at the Liberal Arts and Sciences University College of Utrecht, where she obtained an honorous bachelor‘s degree in 2010. In 2013 she obtained a master‘s degree in Molecular and Cellular Life Sciences at the University of Utrecht, during which she explored the potential of BET inhibitors as epi-drugs. In the same year, after a successful application, she was granted the Netherlands Organisation for Scientific Research (NWO) grant for funding of her Ph.D work at the group of H.T.M. Timmers in the department of Molecular Cancer Research at the University Medical Center of Utrecht. The results of her doctoral research are presented in this thesis.

A

205

206