Biochemical studies of human TBP-associated factor TAF2 and a TAF2 and a TAF2-8-10 subcomplex of human general factor TFIID Cristina Viola

To cite this version:

Cristina Viola. Biochemical studies of human TBP-associated factor TAF2 and a TAF2 and a TAF2- 8-10 subcomplex of human general TFIID. Biochemistry, Molecular Biology. Uni- versité de Grenoble, 2014. English. ￿NNT : 2014GRENV072￿. ￿tel-01561609￿

HAL Id: tel-01561609 https://tel.archives-ouvertes.fr/tel-01561609 Submitted on 13 Jul 2017

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés.

THÈSE To obtain the title of / Pour obtenir le grade de DOCTEUR DE L’UNIVERSITÉ DE GRENOBLE Discipline / Spécialité : Doctorat CSV/Chimie Physique Moléculaire et Structurale Arrêté ministériel :

Presented by / Présentée par : Cristina VIOLA

Thesis supervisor / Thèse dirigée par : Prof. Imre BERGER

Thesis prepared at / Thèse préparée au sein du European Molecular BIology Laboratory (EMBL) Grenoble Outstation In / dans : l'École Doctorale Chimie et Sciences du Vivantes

Biochemical studies of human TBP-associated factor TAF2 and a TAF2-8-10 subcomplex of human general transcription factor TFIID.

Thèse soutenue publiquement le 24 Octobre 2014 devant le jury composé de : Dr. Carlo PETOSA President/ Président Chef d'équipe, IBS Grenoble, France Prof. Andreas LADURNER Reviewer/ Rapporteur Professor, LMU Munchen, Allemagne Dr. Christophe ROMIER Reviewer/ Rapporteur Chef d'équipe, IGBMC Strasbourg, France Prof. Imre BERGER Directeur de Thèse Chef d'équipe, EMBL Grenoble, France

Université Joseph Fourier / Université Pierre Mendès France / 1 Université Stendhal / Université de Savoie / Grenoble INP Table of Contents

1. Introduction ...... 6

1.1. Eukaryotic Transcription, TFIID and the pre-initiation complex ...... 10

1.2. Basal transcription, the core promoter and TFIID ...... 11

1.2.1. Initiator element ...... 12

1.3. TSM-1 becomes TAF2(TFII150) a bona fide TAF conserved from Yeast to Humans .. 14

1.3.1. TAF2 core promoter INR recognition ...... 16

1.3.2. TAF2 in TFIID ...... 16

1.3.3. Domain Structure of TAF2 ...... 17

1.4. Structural studies of TFIID ...... 18

1.4.1. Structural analysis of TAFs ...... 18

1.5. Production of TAF samples for functional and structural studies ...... 18

1.5.1. Purification of native TAFs ...... 18

1.5.2. Production of recombinant TAFs for structural and functional studies ...... 19

1.6. Publications ...... 21

1.6.1. Publication 1 ...... 21

1.6.2. Publication 2 ...... 37

1.6.3. Publication 3 ...... 49

2. Materials and methods ...... 65

2.1. Materials ...... 65

2.1.1. Chemicals ...... 65

2.1.2. Enzymes and Antibodies ...... 66

2.1.3. Plasmids ...... 66

2.1.4. Oligonucleotides ...... 67

2.1.5. Cell Lines ...... 67

2.1.6. Bacterial strains ...... 68

2 2.1.7. Crystallization Screens ...... 68

2.1.8. Software and programs ...... 68

2.2. Methods ...... 69

2.3. Nucleic acid biochemistry ...... 69

2.3.1. Concentration determination of nucleic acids ...... 69

2.3.2. Preparation of plasmid DNA ...... 69

2.3.3. Agarose gel electrophoresis ...... 69

2.3.4. Restriction digestion of DNA ...... 70

2.3.5. Polymerase chain reaction (PCR) ...... 70

2.3.6. DNA ligation ...... 70

2.3.7. DNA extraction from agarose gels ...... 71

2.3.8. DNA Sequencing ...... 71

2.4. General cloning strategies ...... 71

2.4.1. Cloning TAF2 constructs ...... 71

2.4.2. Cloning TAF8 constructs ...... 74

2.4.3. Cloning TAF2-8-10 ...... 74

2.4.4. Overlap extension PCR ...... 76

2.4.5. Sequence and ligation independent cloning ...... 77

2.5. production techniques ...... 78

2.5.1. Baculovirus production and protein expression in insect cells ...... 78

2.5.2. Protein expression in E. coli ...... 80

2.5.3. Purification of His-tagged protein from E. coli cytosolic fraction ...... 80

2.5.4. Purification of TAF2 constructs and TAF2-8-10 complex from Sf21 lysate ...... 81

2.5.5. Expression of [ 13 C, 15 N]-labeled MBP-TAF8 C-terminal constructs ...... 82

2.6. Protein Biochemistry and biochemical methods ...... 82

2.6.1. Denaturing polyacrylamide gel electrophoresis (SDS-PAGE) ...... 82

2.6.2. Limited proteolysis ...... 83

2.6.3. Electrospray ionization mass spectrometry ...... 83

3 2.6.4. Edman sequencing ...... 84

2.6.5. Binding experiments using a peptide array ...... 84

2.7. Cells and cell culture ...... 85

2.7.1. Cultivation of E. coli cells ...... 85

2.7.2. Transformation of chemically competent bacteria ...... 85

2.7.3. Transformation of electrocompetent E. coli ...... 86

2.7.4. Cultivation of Sf21 cells ...... 86

2.7.5. Transfection of Sf21 cells ...... 86

2.7.6. Cultivation of HeLa cells ...... 86

2.7.7. Transfection of HeLa cells ...... 87

2.8. Bioinformatic methods ...... 88

2.8.1. Multiple sequence alignments ...... 88

2.8.2. Secondary structure prediction ...... 88

2.8.3. Structural modeling ...... 88

2.9. Biophysical methods ...... 88

2.9.1. Surface plasmon resonance ...... 88

2.9.2. NMR ...... 89

2.9.3. Native Mass Spectrometry ...... 89

2.10. Protein crystallization ...... 89

3. Results ...... 90

3.1. Domain architecture of TAF2 ...... 90

3.1.1. Introduction ...... 90

3.1.2. Expression and Purification of Full Length Human, Drosophila and Yeast TAF2 ...... 90

3.1.3. Limited proteolysis ...... 97

3.1.4. Human TAF2 ‘core’ and ‘Two-Domain’ constructs ...... 102

3.1.5. Crystallization trials ...... 105

3.1.6. Resume ...... 106

4 3.2. Functional annotation of TAF2 ...... 107

3.2.1. Introduction ...... 107

3.2.2. TAF2 secondary structure and modeling ...... 108

3.2.3. TAF2 Histone binding ...... 113

3.2.4. Resume ...... 115

3.3. Biochemical and biophysical characterization of TAF2-8-10 ...... 117

3.3.1. Introduction ...... 117

3.3.2. Sf21 co-expression and purification of TAF2-8-10 complex ...... 117

3.3.3. TAF2-8-10 AUC and Native Mass Spectrometry ...... 120

3.3.4. TAF2-8-10 negative stain EM ...... 125

3.3.5. TAF2 interaction with TAF8 C-term region: SEC analysis ...... 126

3.3.6. TAF2 interaction with TAF8 C-terminal region: A thermodynamic analysis ...... 130

3.3.7. NMR analysis of TAF8 interaction region ...... 133

3.3.8. Identification of the TAF8 C-terminal interaction region by a peptide array assay ..... 135

3.3.9. Resume ...... 141

3.4. Functional characterization of TAF2-8-10 ...... 142

3.4.1. Introduction ...... 142

3.4.2. TAF8/10 lacking putative TAF2 interaction region can be mixed with 5TAF to form

7TAF-∆TAF8[C-term] ...... 143

3.4.3. TAF2 incorporation into 7TAF-∆TAF8[C-term] ...... 144

3.4.4. Resume ...... 146

3.5. Is TAF2 co-imported into the nucleus with TAF8/10? ...... 148

3.5.1. Introduction ...... 148

3.5.2. TAF2-mCherry expression in HeLa ...... 148

3.5.3. TAF2-8-10 Importin α complex ...... 149

3.5.4. Resume ...... 151

4. Discussion ...... 153

5 4.1. Publication 4 (under review at Nature Communications; NCOMMS-14-12636) ...... 160

5. Appendix ...... 219

5.1. Tables ...... 219

6. References ...... 224

List of figures

Figure 1-1: Human TFIID subunits ...... 11

Figure 1-2: TAF2 Conserved Domains and propensity to fold ...... 17

Figure 2-1: Human TAF2 N-terminus ...... 72

Figure 2-2: MultiBac vectors for TAF2-8-10 complex expression in insect cells ...... 75

Figure 3-1: YFP monitor of hsTAF2FL expression in Sf21 cells...... 92

Figure 3-2: Human TAF2 purification ...... 94

Figure 3-3: S200 SEC profiles of three TAF2 homologs ...... 95

Figure 3-4: Full-length TAF2 limited proteolysis time course ...... 98

Figure 3-5: Summary of mass spectrometry and Edman sequencing analysis of fragments

derived from limited proteolysis of three TAF2 homologs ...... 100

Figure 3-6:. Anti-His 5 antibody stained Western blot of Sf21 extract enriched with His 10 -

tagged full-length TAF2 and His 10 -tagged TAF2 ‘core’ ...... 103

Figure 3-7: Size exclusion chromatography of the Human TAF2 'core' domain ...... 104

Figure 3-8: TAF2 Proteolysis and the ‘Two-Domain’ Model ...... 105

Figure 3-9: PRALINE alignment of TAF2 and aminopeptidase N ...... 110

Figure 3-10: TAF2 model domain alignment with ERAP1 crystal structure and PSIPred

protein sequence analysis ...... 112

6 Figure 3-11: SEC analysis of mixtures of TAF2 and HFD-containing domains in TFIID ... 114

Figure 3-12: YFP and CFP monitor of TAF2-8-10 expresiion in Sf21 ...... 118

Figure 3-13: Purification of TAF2-8-10 complex ...... 119

Figure 3-14: Sedimentation velocity analytical ultracentrifugation ...... 121

Figure 3-15: Native MS/MS dissociation of purified hs TAF2-8-10 FL complex ...... 123

Figure 3-16: Collision induced dissociation experiments of hsTAF2-8-10 FL ...... 124

Figure 3-17: Glycerol gradient and negative stain EM ...... 126

Figure 3-18: TAF2 and TAF8 construct mixing experiments ...... 128

Figure 3-19: Predicted disorder based on the Human TAF8 protein sequence ...... 130

Figure 3-20: Surface plasmon resonance experiments using immobilized TAF2 ligand...... 132

Figure 3-21. 2D ( 1H, 15 N) TROSY NMR analysis of TAF8...... 134

Figure 3-22: Peptide scan analysis of the putative TAF2-TAF8 C-terminal interaction region

...... 137

Figure 3-23: Peptide scan analysis of the TAF2 interaction regions III and IV from TAF8 . 139

Figure 3-24: Region IV alanine substitution peptide scan ...... 140

Figure 3-25: TAF8/10 lacking TAF2 interaction region incorporation into 5TAF ...... 144

Figure 3-26: TAF2 incorporation into 7TAF ∆TAF8[C-term] complex ...... 146

Figure 3-27: Exogenously expressed mCherry-hs TAF2 FL in HEK293 cells ...... 149

Figure 3-28: TAF2 and Importin α do not interact ...... 150

Figure 3-29: SEC analysis of TAF2/8/10 and Importin α mixtures in high and low salt

concentrations ...... 151

Figure 5-1. ESPRIPT representation of a CLUSTAL alignment of Human TAF8 homologs

...... 222

7

Abstract

Initiation of transcription in eukaryotes requires the co-ordinated activity of myriad different and protein complexes. Integral to this process are the general transcription factors

(GTFs), of which TFIID is a principle player, interacting with both the promoter region of the target and RNA polymerase II. That TFIID is critical in transcription initiation has long been appreciated, however the precise molecular details of its function are still being determined. A large step forward in our understanding will come with high-resolution knowledge of the structure-function relationship of TFIID and, in particular, how the different TBP-associated factors that comprise TFIID assemble to form a functional complex.

This thesis focuses on TAF2, the second largest TFIID subunit. The data presented here address structural, biophysical and biochemical properties of TAF2 and how interactions with other TAFs may influence its assembly into TFIID.

Full-length Human, Drosophila and Saccharomyces TAF2 homologs were expressed in insect cells. The full-length proteins were purified to homogeneity yet size exclusion chromatography revealed that the three TAF2 homologs all eluted at larger than expected volumes, indicating the presence of flexible regions or a non-globular shape. Extensive crystallization trials were performed using the full-length protein without success. Proteolytic analysis revealed a ‘TAF2-core’ (residues 1-1027), which lacked a C-terminal putatively unstructured region. This construct was also expressed in insect cells yet could not be crystallised. Additional limited proteolysis analysis suggested the presence of two ‘domains’ of TAF2, a feature conserved across the three homologues tested. One of these ‘domains’ had to the Aminopeptidase N proteins (M1 family of metalloproteins).

Defunct aminopeptidase domains have previously been shown to interact with histone-fold

8 domains (HFDs), for example in the protein ‘Facilitates chromatin transcription’ or FACT.

Two independent histone binding assays were performed but not interaction was detected.

HFDs are also found in certain TAFs. Of the 5 pairs of HFDs found in TFIID found, SEC co- migration assays revealed that TAF2 interacts specifically with a HFD-mediated TAF8-10 dimer to form a 1:1:1 complex. Further biochemical and biophysical analyses mapped the

TAF2 interaction region to the unstructured C-terminal tail of TAF8, rather than to the HFDs present in TAF8 or TAF10. SEC, AUC, NMR and peptide array experiments showed distinct binding regions in the C-terminus of TAF8. Our collaborators (The group of Laszlo Tora,

IGBMC, Strasbourg) have identified a heterotrimeric TFIID subcomplex consisting of TAF2,

TAF8 and TAF10 in the cytoplasm of human cells. Binding of TAF2 to TAF8/10 promoted the incorporation of TAF2 into a nuclear core-TFIID complex. These results provide evidence for a step-wise assembly pathway of TFIID in the nucleus, regulated by nuclear import of preformed cytoplasmic submodules.

9 1. Introduction

1.1. Eukaryotic Transcription, TFIID and the pre-initiation complex

In order for genetic information encoded in the DNA of living organisms to be transformed into functional molecules, must be transcribed into RNA by the host’s transcription machinery. Regulation of the transcription of genes ensures that functional molecules are present at the right time and place allowing different cells to perform their unique functions.

Gene transcription in Humans is governed by three different RNA polymerases (RNApolI, II and III); RNApol I transcribes mainly ribosomal RNA, RNApol III small nuclear, transfer and ribosomal RNA, while RNApol II governs the transcription of messenger RNA(Thomas and Chiang, 2006). Class II transcription concerns all protein coding genes transcribed by

RNApol II, which is recruited to the start site of genes by the General Transcription Factors

(GTFs), TFIIA, TFIIB, TFIID, TFIIE, TFIIF, and TFIIH (Thomas and Chiang, 2006). DNA elements found at the transcription start sites of genes interact with the GTFs to form what is known as the Pre-Initiation Complex (PIC). The PIC is responsible for recruiting RNA pol II resulting ultimately in gene transcription (Kornberg, 2007). At any given time during the cell cycle, a highly complex regulatory system is in place to coordinate the recruitment of RNA polymerase allowing the correct set of genes to become transcriptionally active (Carey and

Smale, 2001). TFIID, the largest of the GTFs, is a mega-Dalton, multi-protein complex composed of the TATA-Box Binding Protein (TBP) and 13 TBP-associated factors (TAFs)

(Figure 1-1). TFIID is the first of the GTFs to contact the transcription start site nucleating the PIC (Tanese et al., 1991). The complexity of the role of the 13 TAFs in initiating transcription remains largely unknown, in part due to the difficulty in producing TAFs in the quantity and quality necessary for functional and structural studies. TFIID has proven

10 particularly challenging to produce recombinantly using traditional hosts for recombinant protein production, such as E. coli (Berger et al., 2011).

Figure 1-1: Human TFIID subunits. Subunits of the Human TFIID complex. TFIID is composed of TBP and 13 TBP-associated factors (TAFs).

1.2. Basal transcription, the core promoter and TFIID

Basal transcription is the low level of transcription of class II genes by RNA pol II observed in vitro in the presence of the six highly purified GTFs(Roeder, 1991; Sawadogo and

Sentenac, 1990). This low level transcription by RNApol II in vitro without the stimulating or repressive effects of activators, chromatin states and/or co-factors resulted in the dissection of the minimum promoter elements required for transcription, called the ‘core promoter’(Butler and Kadonaga, 2002). The core promoter generally encompasses the transcription start site

(TSS) of genes, usually within ~40 nt of the TSS and comprises several sequence motifs, called core promoter elements, that have specific functions in relation to transcription. In

Eukaryotes, seven core promoter elements have been identified, namely the TATA-box, the

11 Motif-ten element (MTE), the Downstream promoter element (DPE), the Initiator (INR), the

Downstream core element (DCE) and the upstream and downstream TFIIB-recognition elements (BRE u and BRE d) (Tjian and Maniatis, 1994). While the TATA-box element and its role in the recruitment of TFIID to the TSS of Class II genes is the most widely studied core promoter element (CPE), no single CPE is primarily required for transcription. Furthermore, much less well-characterized promoter elements appear to influence the recruitment of TFIID and ultimately the regulation of gene transcription (Juven-Gershon et al., 2008). The well- characterized TBP interaction with the TATA-box core promoter element provides much insight into the mechanism that drives the recognition of Class II genes by TFIID, but only partially explains why TFIID should occupy all Class II promoters (Smale, 1997). Genome scale computational analysis of the indicate that the canonical TATA-box is found at 8.4% of class II promoters while promoters containing the INR only or uncharacterized core promoter elements make up 78.4% of class II promoters(Yang et al.,

2007). Therefore, investigation into the role of gene-selective transcription by TFIID via less well-characterized core promoter elements, such as the INR, is required.

1.2.1. Initiator element

The activity of purified TBP is limited to the recognition of promoters containing a TATA-

Box element and not of other regulatory sequences that seem to be able to direct transcription in the presence of TFIID partially purified from Human cells (Smale et al., 1990). The

Initiator (INR), a core promoter element, was first identified as a sequence that can determine the transcription start site from a promoter lacking a TATA-box element (Smale and

Baltimore, 1989). It was subsequently shown that an INR element located within 25 bp of a

TATA-box element could enhance the strength of the promoter indicating that additional elements on the core promoter could be interacting with the basal transcription machinery

12 (Purnell and Gilmour, 1993). Endogenously purified TFIID complex from HeLa cells was shown to be sufficient to support transcription from a promoter lacking a TATA-Box element

(Zhou et al., 1992) and furthermore TBP was not required for directing basal transcription from a TATA-less promoter (Martinez et al., 1994). Random mutations in the INR element revealed a loose but consistent consensus sequence 5’-pY-pY-A-N-(T/A)-pY-pY-3’

(pY=Pyrimidine A=Adenine N=any nucleotide) suggesting INR elements are recognized by a universal protein (Javahery et al., 1994; Kaufmann et al., 1996). Until recently, TFIID interacting with core promoter DNA at transcription start sites was considered to be principally the role of TBP whose interaction with the TATA-box element would set off the chain of events leading towards PIC assembly and finally recruitment of

RNApolII(Sawadogo and Roeder, 1985). However, there is a significant body of evidence to suggest that distinct TFIID complexes that lack TBP are in fact more often found at the start site of transcription than those containing TBP (Oelgeschläger et al., 1998). Genome wide chromatin immunoprecipitation analyses of human promoters occupied by TFIID suggests that roughly 90% of transcription start sites analyzed are TATA-independent and that INR and DPE core promoter elements are instead enriched. (Carninci et al., 2006; Kim et al.,

2005).

13 1.3. TSM-1 becomes TAF2(TFII150) a bona fide TAF conserved from

Yeast to Humans

The temperature sensitive Yeast mutant TSM-1 was targeted in a genetic screen designed to isolate mating-type specific gene sequences flanking the MAT Locus on

Chromosome III of Saccharomyces cerevisiae . The lethal phenotype of the temperature sensitive mutant could be rescued by complementation with a vector encoding the wild-type region flanking the MAT locus between MATalpha and

THR4. This region was found to contain an open reading frame encoding a 1470 aa protein whose deletion of the last 919 bp, recapitulated the lethal phenotype(Ray et al., 1991). Cloning of a 150 kDa Drosophila protein, which was originally identified by immunoblotting a fraction of immunopurified TFIID, revealed that the Drosophila

TBP-associated-factor (previously named dTF II 150) was in fact a homolog of the essential Yeast protein Temperature Sensitive Mutant-1 (TSM-1)(Verrijzer et al.,

1994). Immunoprecipitation assays using recombinantly produced Yeast and

Drosophila HA-tagged TBP and Human HA-tagged TAF1 with either radiolabelled

Drosophila TAF2 or the COOH-terminal 75% of Yeast TAF2 revealed that both

TAF2 constructs co-immunoprecipitated with TAF1 and TBP, regardless of which homolog was used(Verrijzer et al., 1994). These experiments suggested that an interaction between TAF2, TAF1 and TBP is conserved from Yeast to Humans and provided evidence for a potential role for TAF2 in the recognition of promoter elements in DNA. The Human homolog of Drosophila TAF2 was first identified in a complex deemed to be separate from TFIID called Co-activator of Initiator Function

(CIF). The authors of the study concluded that this Human homolog of Drosophila

TAF2 was not stably associated with TFIID and named it CIF150 instead(Kaufmann

14 et al., 1998). Supporting the claim that CIF150 was not associated with TFIID were gel filtration studies with HeLa nuclear extracts and subsequent immunoblot assays, which showed that a polyclonal antibody raised against recombinantly purified

CIF150 reacted with a species that did not co-migrate with TFIID(Martin et al.,

1999). A conflicting study, showed that a Human homolog of Drosophila TAF2 was stably associated with highly pure immunopurified TFIID and finally resulted in the

Human homolog being recognized as a bona fide TAF(Martinez et al., 1998). Papai and colleagues (Papai et al., 2009) were able to partially deplete TAF2 from their preparation of TFIID containing extract purified from S. cerevisiae by Mono-S FPLC chromatography using a 1 M NaCl gradient. The majority of TAF2 eluted below 0.25

M NaCl, which is consistent with the method that produced the INR activity depleted

(IA- ) fraction of TFIID used in the CIF150 complementation assay by Kaufmann et al(Kaufmann et al., 1998). The conflicting results may be due to a combination of the use of different preparations of HeLa extracts and/or different antibodies, however, regardless of whether TAF2 is stably associated with TFIID, these studies point to an intriguing yet elusive role of TAF2 in core promoter recognition and a potential role for TAF2 separate from TFIID. The second largest TBP associated factor TAF2

(previously called TF II 150) has been shown to interact directly with the Initiator sequence in the core promoter, suggesting an alternative method by which TFIID communicates with gene transcription start sites other than TBP (Kaufmann et al.,

1996; Verrijzer et al., 1995).

15 1.3.1. TAF2 core promoter INR recognition

Before TAF2 was accepted as a bona fide TAF, a partially purified 150 kDa TBP- associated-factor, was first identified in UV cross-linking studies of fractionated nuclear extracts from Drosophila embryos containing TFIID and Heat shock protein

70 (Hsp70) promoter DNA(Gilmour et al., 1990). These studies were the first to implicate TAF2 in the role of core promoter binding. Subsequently, Drosophila TAF2

(dm TAF2) was found to possess the ability to discriminate between proximal and distal INR elements of a promoter upstream of the gene encoding Alcohol dehydrogenase (Hansen and Tjian, 1995). In a separate study, complementation assays showed that Ni 2+ -purified Human homolog of Drosophila TFII150, then named

Co-factor for TFIID-dependent initiator function (CIF150), could restore transcriptional activity from HeLa cell extracts depleted in initiator activity

(Kaufmann et al., 1998). Further evidence that TAF2 is potentially involved in promoter DNA recognition is that purified TAF2 (then called CIF-150) preferentially bound DNA fragments containing the consensus sequence 5'-Py-X-GAG-A/C-A/Py-

3', based on the promoter sequence of Cyclin B1 over DNA fragments with a randomized sequence(Martin et al., 1999). While this consensus sequence is not found in the INR specifically, the authors postulate that TAF2 might stabilize TFIID on TATA-less promoters for the eventual recognition of the INR by a different TAF.

1.3.2. TAF2 in TFIID

In the context of TFIID, TAF2 is thought to play a major role in transcription. It has been implicated in the recognition of DNA elements exclusive of the TATA-Box found in the ‘Core Promoter’ of gene transcription start sites (See section 1.3.1 and references therein). Immunopurified Drosophila TFIID incubated with Drosophila

16 Hsp70 promoter DNA substituted with Bromodeoxyuridine followed by crosslinking with UV light and separation by SDS-PAGE chromatography revealed two radiolabelled species with approximate molecular weights matching both TAF1 (250 kDa) and TAF2 (150 kDa)(Verrijzer et al., 1995). Replacing immunopurified TFIID by a complex made of recombinant TAF1-TAF2-TBP resulted in the same outcome

(Verrijzer et al., 1995). Given that TAF2 is sometimes considered not to be a stable member of TFIID (Kaufmann et al., 1998; Papai et al., 2009) and that the Drosophila homolog has been produced recombinantly using a baculovirus system(Martin et al.,

1999), TAF2 could be considered a suitable target for structural studies.

1.3.3. Domain Structure of TAF2

Secondary structure predictions of TAF2 reveal a conserved unfolded region in the C- terminal half of the protein (Figure 1-2). A sequence similarity search of Human

TAF2 against a non-redundant database of all proteins using the BLAST search algorithm reveals a similarity to the Aminopeptidase N of the M1 family of metalloproteases localized to the N-terminal region of TAF2.

(http://blast.ncbi.nlm.nih.gov/Blast.cgi).

Figure 1-2: TAF2 Conserved Domains and propensity to fold. The Domain topology of three homologs of TAF2 showing N-terminal conserved

17 aminopeptidase homology in green. The numbers delineate the boundaries of the conserved aminopeptidase domain. Conserved domains were found using the web tool SMART (a Simple Modular Architecture Research Tool) (http://smart.embl-heidelberg.de/). The areas in red delineate the amino acids that are predicted to be unfolded using the web tool FoldIndex© (http://bip.weizmann.ac.il/fldbin/findex).

1.4. Structural studies of TFIID

1.4.1. Structural analysis of TAFs

The first subunit of TFIID to be structurally characterized was one of the two isoforms of TBP from Arabidopsis thaliana (at). The sequence encoding at TBP-2 was overexpressed in E. coli and tested for activity in a reconstituted Human in vitro transcription system (Gasch et al., 1990). Active protein was isolated and crystals diffracting to 2.6 Å were obtained (Nikolov et al., 1992). Higher resolution crystal and NMR structures of TBP from other organisms including Homo sapiens have since been described. The first histone fold containing pairs of subunits from TFIID were cloned from Saccharomyces cerevisiae.

1.5. Production of TAF samples for functional and structural studies

1.5.1. Purification of native TAFs

Historically, TAFs and TFIID have been purified form native sources using immunoaffinity purification techniques. Protein purified this way may have the advantage of possessing ‘native’ post-translational modifications and co-purifying native co-factors often necessary to ensure correct function with, however, unfeasibly large volumes of cultures and large amounts of antibodies are required. Due to

18 restricted flexibility when purifying from native sources, it becomes difficult to introduce mutations, make alternative constructs, truncations, etc that are often necessary to probe the function of a protein. Recombinant production of protein can be used to overcome low yields and flexibility of construct design, however, traditional methods using E. coli for recombinant protein production have proved not to be suitable for many of the proteins found in TFIID.

1.5.2. Production of recombinant TAFs for structural and functional studies

Historically, full-length Human TAF subunits have proven difficult to produce individually using traditional recombinant protein production methods that employ E. coli as an expression host. Since many of the TAF subunits have rather been expressed in eukaryotic hosts, a baculovirus expression system is a feasible option for screening TAF construct solubility in order to produce samples in the quality and quantity necessary for structural studies. In this thesis, the MultiBac system, was used to produce single proteins as well as a protein complex. The MultiBac system contains a suite of vectors that allow the heterologous expression of multiple proteins simultaneously from a single baculovirus encoded on separate protein expression cassettes or as polycistrons (Berger et al., 2004; Bieniossek and Berger, 2008;

Bieniossek et al., 2008; Fitzgerald et al., 2006; 2007; Nie et al., 2011; Trowitzsch et al., 2010; Vijayachandran et al., 2011). Multiple MultiBac 'Donor' vectors with varying antibiotic resistance markers encoding the conditional origin of replication,

R6K gamma, can be fused to an 'Acceptor' vector encoding a standard origin of replication via Cre-lox recombination. For replication of a 'Donor' vector encoding the

R6K gamma origin of replication, it is essential that the vector is propagated in a strain of E. coli expressing the pir-encoded pi protein. Once vectors are fused by Cre-

19 lox recombination, they can be separated into individual 'Acceptor' and 'Donor' vectors as a result of the reversibility of the reaction catalyzed by Cre-recombinase enzyme. Sequence and ligation independent cloning allows flexibility in inserting or deleting of any DNA sequence necessary to improve the solubility and/or yield of a resulting protein of interest, regardless of the availability of 'convenient' restriction sites present, traditionally used in cloning procedures. The SLIC method utilizes the complementarity of short regions of DNA, typically 12-18 bp long and the E. coli replication machinery to repair nicked DNA to create the desired product (Li and

Elledge, 2007). One of the major challenges in structural biology remains the production of soluble eukaryotic protein complexes for structural biology. A combination of the SLIC method to produce a variety of gene constructs to screen for solubility with the potential of the MultiBac system to produce eukaryotic expression targets in a combinatorial fashion the previously, the challenges preventing the study of difficult eukaryotic protein complex targets becomes attainable. A more comprehensive description of the methodology of the MultiBac protein expression system, the techniques used to employ it and how it can be applied to the production of multi-subunit protein complexes is provided in three articles that I co-authored.

These articles are presented in section 1.6.

20 1.6. Publications

In this section, three publications are presented which describe the technologies and procedures that were used to obtain the results described in this thesis, including technologies that were developed or optimized during the experimental part of this thesis.

1.6.1. Publication 1

Title: Tandem recombineering by SLIC cloning and Cre-LoxP fusion to generate multigene expression constructs for protein complex research.

Authors: Haffke, M., Viola, C., Nie, Y. & Berger, I.

Journal: Methods in Molecular Biology 2013;1073:131-40.

In this manuscript, the tandem recombineering (TR) method developed in the Berger laboratory is described in detail. This method combines sequence and ligation independent cloning (SLIC) with DNA fusion by the Cre recombinase enzyme via

LoxP sequences present on the DNAs. TR can be used to rapidly generate expression constructs, gene truncations and mutations, optionally in high-throughput.

The manuscript is published in Methods in Molecular Biology.

21 Running Head:

Tandem recombineering by SLIC cloning and Cre-LoxP fusion to generate multigene expression constructs for protein complex research

Author affiliations:

Matthias Haffke 1, Cristina Viola 1, Yan Nie 1 and Imre Berger 1,2

1 European Molecular Biology Laboratory (EMBL), BP 181, and Unit of Virus Host Cell Interactions

(UVHCI), Polygone Scientifique, 6 Rue Jules Horowitz, 38042 Grenoble Cedex 9, France

2Corresponding author: [email protected]

Keywords

Sequence and ligation independent cloning, Cre recombinase, Cre-LoxP fusion, multigene delivery, multiprotein complexes, MultiBac, ACEMBL automation, robotics

22 Summary

A robust protocol to generate recombinant DNA containing multigene expression cassettes by using sequence and ligation independent cloning (SLIC) followed by multiplasmid Cre-LoxP recombination in tandem for multiprotein complex research is described. The protocol includes polymerase chain reaction (PCR) amplification of the desired genes, seamless insertion into the target vector via SLIC and Cre-LoxP recombination of specific donor and acceptor plasmid molecules, optionally in a robotic setup. This procedure, called tandem recombineering, has been implemented for multiprotein expression in E.coli and mammalian cells, and also for insect cells using a recombinant baculovirus.

1. Introduction

High flexibility and diversity in cloning techniques are essential aspects for the creation of multigene constructs and multiprotein assemblies in synthetic biology (1). Common techniques used to insert

PCR products into vectors for are restriction enzyme dependent cloning (2), blunt end cloning (3) and Gateway cloning (4). However, such cloning techniques possess limitations due to the requirements for specific DNA sequences and/or restriction enzyme sites and are therefore not feasible for high-throughput applications or automation. Since SLIC removes the requirement for specific

DNA sequences and furthermore does not require restriction enzyme sites, it is more suitable for integration in an automated setup (5, 6).

In a typical SLIC reaction, the gene of interest (GOI) is amplified using primers which provide a homology sequence to the vector on their 5’ sites, followed by a GOI specific sequence (Fig. 1).

Primers for the creation of multigene constructs are designed in a similar way, providing a complementary sequence to the 5’ adjacent GOI or to the homology sequence of the vector (Fig. 2).

Primers for linearization of the vector are complementary to the homology sequences chosen in the

GOI primers. The PCR products and the linearized vector are treated with T4 DNA polymerase, which exhibits 3’ exonuclease activity in the absence of dNTPs to generate 5’ overhangs. In vitro recombination is achieved by annealing of the T4 DNA polymerase treated fragments and transformation of competent E.coli cells with the reaction mix.

23 The combination of SLIC with Cre-LoxP recombination of specific acceptor and donor plasmids in vitro , called tandem recombineering, further increases versatility and flexibility of the generation of multigene constructs for multiprotein expression. The ACEMBL technology is available for E.coli

(MultiColi ) (6, 8), mammalian cells (MultiMam ) (9) and insect cells via a recombinant baculovirus

(MultiBac ) (7, 10) (Tab. 1). Both acceptor and donor plasmids contain LoxP sites for recombination via Cre recombinase in vitro . Acceptor plasmids can be recombined with multiple donors to create fused plasmids for multiprotein expression ( Fig. 3). Since donor plasmids carry a conditional origin of replication (R6K γ), they are only propagated in pir positive E.coli strains or after fusion with one/multiple acceptor plasmids in conventional cloning ( pir negative) strains (6, 8). This, in combination with different antibiotic resistances, (Tab. 1) allows for specific selection of the desired

Cre-LoxP recombined multiplasmid constructs. The methods described here were optimized to be integrated in an automated robotic setup with a liquid handling system (6).

2. Materials

All solutions should be prepared using ultrapure water (Millipore Milli-Q system or equivalent; conductivity of 18.2 MΩ·cm at 25°C) and analytical grade reagents. Store all buffers, antibiotics and enzymes at -20°C.

2.1 Preparation of vector

1. LB Broth (Miller, cat. no. 0103)

2. Purified Agar Agar (Euromedex, ref. 1329-D)

3. Sterile polystyrene Falcon tube (15 ml)

4. QIAprep Spin Miniprep Kit (Qiagen, cat. no. 27104)

5. Antibiotics: Ampicillin, Chloramphenicol, Spectinomycin, Tetracycline, Gentamycin,

Kanamicin ( see Note 1)

2.2 PCR and linearization of vector

1. Phusion High-Fidelity DNA Polymerase (Thermo Scientific, kit cat. no. F-530S)

2. 5xPhusion HF Buffer (included in kit)

3. 10 mM dNTP mix (New England Biolabs Inc., cat. no. N0447S)

24 4. Thermocycler (e.g. Biometra GmbH, Thermocycler T3000)

2.3 Dpn1 digest

1. Dpn1 (New England Biolabs Inc., cat. no. R0176S)

2. 10x NEBuffer 4 (included in kit)

3. 37°C water bath

4. Qiagen spin column (QIAquick Gel Extraction Kit, cat. no. 28704)

5. Qiagen buffers (included in kit)

6. Agarose gel electrophoresis system (e.g. BioRad, Mini-Sub Cell GT System)

7. 5x TBE Buffer: 0.89 M Tris base, 0.89 M boric acid, 20 mM EDTA (pH 8.0) (see Note 2)

8. Agarose Type D-5 DNA-grade (Euromedex, ref. D5-D)

9. 6x DNA Loading Dye: 30% (v/v) glycerol, 0.125% (w/v) bromophenol blue, 0.125% (w/v)

Xylene cyanol FF (see Note 3)

10. 1 kb DNA Ladder (New England Biolabs Inc., cat. no. N3232S) (see Note 4)

2.4 T4 DNA Polymerase treatment

1. T4 DNA Polymerase (New England Biolabs Inc., cat. no. M0203S)

2. NEBuffer 2 (included in kit)

3. 2M Urea

4. 500 mM EDTA (see Note 5)

5. 75°C Heat Block

2.5 SLIC annealing

1. 65°C heat block

2.6 Transformation of chemical competent cells

1. BW23474 chemical competent cells or equivalent

2. 42°C waterbath

3. LB Broth (Miller, cat. no. 0103)

4. 37°C shaking incubator

2.7 Cre-LoxP recombination

1. Cre Recombinase (New England Biolabs Inc., cat. no. M0298S)

25 2. 10x Cre Recombinase Reaction Buffer (New England Biolabs Inc., cat. no. B0298S)

3. 37°C water bath

3. Methods

3.1 Preparation of vector

1. Inoculate 5 ml of LB broth containing appropriate antibiotics in a 15 ml Falcon tube from a

glycerol stock of E.coli cells containing the desired vector. Incubate at 37°C, agitating at

150 rpm for 12 h. Concentrations for antibiotics: Ampicillin 50 µg/ml, Chloramphenicol

34 µg/ml, Spectinomycin 100 µg/ml Tetracycline 12.5 µg/ml, Gentamycin 10 µg/ml,

Kanamicin 30 µg/ml.

2. Centrifuge the Falcon tubes for 10 min at 5,000 x g at 4°C. Take off the supernatant and invert

the Falcon tubes to drain.

3. Perform a plasmid prep using the QIAprep Spin Miniprep Kit and follow the instructions in

the product’s manual.

4. Determine the concentration of the extracted DNA spectrophotometrically (e.g. Thermo

Scientific NanoDrop 2000).

3.2 PCR and linearization of vector

1. Identical PCR reactions are set up for amplification of the desired insert and linearization of

the vector.

2. Set up a 100 µl PCR reaction in a 0.5 ml PCR tube: Mix 1 µl template DNA (approximately

10 ng) with 20 µl 5x Phusion HF Buffer ( see Note 6), 2 µl 10 mM dNTP mix, 1 µl of forward

primer (concentration 100 µM), 1 µl of reverse primer (concentration 100 µM) and 74.5 µl

water.

3. Add 0.5 µl Phusion High-Fidelity DNA Polymerase and mix (see Note 7).

4. Choose appropriate annealing temperatures for the specific primers chosen to perform the

PCR (see Note 8). Typically, templates are initially denatured at 98°C for 60 s; followed by 30

26 cycles at 98°C for 20 s, the specific annealing temperature for 30 s, 72°C for 30 s (for 1 kb

product size); and a single final step at 72°C for 10 min.

3.3 Dpn1 digest and purification of PCR product and linearized vector

1. Add 20 U of Dpn1 directly to the 100 µl PCR product and incubate at 37°C for 2 h ( see

Note 9). This step is not required for insert PCR reactions if the resistance marker of the

template plasmid differs from the destination vector.

2. Mix with 20 µl 6x DNA loading dye, load on 1% TBE agarose gel and run the gel at 100 V

(see Note 10 ) for around 1.5 h until the 1 kb DNA ladder is well resolved.

3. Excise the band corresponding to the PCR product using a UV light box and transfer to a 2 ml

sterile Eppendorf tube (see Note 11).

4. Extract the DNA from the gel slices using the QIAquick Gel Extraction Kit following the

instructions in the product’s manual.

5. Determine the concentration of the extracted DNA spectrophotometrically (e.g. Thermo

Scientific NanoDrop 2000).

3.4 T4 DNA Polymerase treatment of PCR product and linearized vector

1. Set up the reaction in a 0.5 ml PCR tube: Mix 2 µl 10x NEBuffer 2, 1 µl 100mM DTT, 2 µL

2M Urea, 0.5 U T4 DNA Polymerase and 1 µg of the purified PCR product (see Note 12 ) in a

total volume of 20 µl. For a 20 bp overhang between PCR product and vector, incubate for

30 min at room temperature (see Note 13 ).

2. Stop the reaction by adding 1 µL of 500 mM EDTA.

3. Inactivate T4 DNA Polymerase by heating to 75°C for 20 min.

3.5 SLIC annealing

1. Set up the reaction in a 0.5 ml PCR tube: Mix 10 µL of the T4 DNA polymerase treated vector

with 10 µL of T4 DNA polymerase treated insert.

2. Incubate at 65°C for 10 min and let cool down slowly in the heat block at RT.

3.6 Transformation of chemical competent cells

27 1. Mix 5 µl of the annealing reaction with 50 µl of BW23474 chemical competent cells on ice

and incubate for 30 min, heat shock at 42°C for 60 s, incubate on ice for 2 min, add 400 µl of

LB Broth and incubate in a 37°C shaker for 1 h.

2. Plate 100 µl of the cells on a selective LB agar plate with appropriate antibiotic(s), pellet the

remaining cells by centrifugation at 4,000 x g for 1 min, take off 250 µl supernatant and

resuspend the pellet in the remaining 100 µl. Plate this concentrated cell suspension cells on a

second LB agar plate.

3.7 Cre-LoxP recombination of Acceptor and Donor vectors

1. Set up a 10 µl Cre-LoxP recombination reaction in a 0.5 ml PCR tube: Mix 1 µg of Donor

vector with a 1:1 molar ratio of Acceptor, 1 µl 10x Cre Recombinase Reaction Buffer and

0.5 µl Cre Recombinase in a 10 µl total reaction volume.

2. Incubate the reaction at 37°C for 1 h (see Note 14).

3.8 Transformation of chemical competent cells

1. Mix 5 µl of the Cre-LoxP recombination reaction with 50 µl of BW23474 chemical competent

cells on ice and incubate for 30 min, heat shock at 42°C for 60 s, incubate on ice for 2 min,

add 400 µl of LB Broth and incubate at 37°C for overnight (see Note 15).

2. Plate 100 µl of the cells on a selective LB agar plate with appropriate antibiotic(s), pellet the

remaining cells by centrifugation at 4,000 x g for 1 min, take off 250 µl supernatant and

resuspend the pellet in the remaining 100 µl. Plate the remaining cells on a second LB agar

plate.

4. Notes

1. Carbenicillin can be used as a substitute for Ampicillin (at the same concentration) to reduce

presence of satellite colonies. Concentration of stock solutions (1000x): Ampicillin 50 mg/ml

in water, Carbenicillin 50 mg/ml in 50% ethanol, Chloramphenicol 34 mg/ml in ethanol,

Spectinomycin 10 mg/ml in water, Tetracycline 12.5 mg/ml in 70% ethanol, Gentamycin

10 mg/ml in water, Kanamicin 30 mg/ml in water.

28 2. Weigh 108 g Tris base (MW: 121.10 g/mol) and 55 g boric acid (MW: 61.83 g/mol) and add

40 ml of 0.5 M EDTA, pH 8.0 in a 2-L graduated cylinder. Having water on the bottom of the

cylinder (~400 ml) and stirring while adding Tris base and boric acid helps to dissolve these

components. Fill up to a total volume of 2 L with water. Filter through 0.22 µm filter and

autoclave to prevent precipitation during long-term storage. Store at room temperature.

3. Add 0.125% Orange G to the 6x DNA Loading Dye if working with small PCR products.

Orange G migrates at about 50 bp in 1% TBE agarose gels and helps to determine the time

needed for electrophoresis.

4. For smaller PCR products use a 100 bp DNA ladder (New England Biolabs Inc., cat. no.

N3231S) to identify fragments in the range of 100 bp to 1 kbp more easily.

5. Weigh 73.06 g EDTA (MW: 292.24 g/mol), add 400 mL of water and adjust pH to 8.0. EDTA

will not dissolve until the pH is adjusted to 8.0. Top up to a total volume of 500 mL. Filter

through a 0.22 µm filter and store at room temperature.

6. When using the Phusion High-Fidelity DNA Polymerase kit, the 5x GC buffer can help to

increase the performance of Phusion High-Fidelity DNA Polymerase on long or GC rich

templates. When working with GC rich templates, add 3%DMSO as a PCR additive to aid

denaturing of templates with high GC content. It is practical to run two PCR reactions with

HF and GC buffer in parallel and compare yield and PCR product specificity for both

reactions.

7. Mix by pipetting or flipping the tube. Centrifuge for 10 s at 4,000 x g to collect the mix on the

bottom of the PCR tube. No bubbles should remain in the tube.

8. When using the Phusion High-Fidelity DNA Polymerase kit, calculate the annealing

temperature with the manufacturer’s Tm calculator tool on the website:

http://www.finnzymes.fi/tm_determination.html

9. This is a critical step to reduce background colonies after transformation. The Dpn1 digest can

be incubated longer than 2 h (e.g. overnight) to reduce background colonies.

10. Depending on the gel system used the voltage might be increased up to 120 V to reduce

separation time. Increasing the voltage can result in heating up and melting the agarose gel.

29 11. 1.5 ml Eppendorf tubes can be used in this step as well, depending on size of the gel slice.

When excising the desired band from the agarose gel, use longer wavelength (e.g. 365 nm or

equivalent) and reduced intensity on the UV lightbox to avoid any modifications to your PCR

product.

12. It is important to purify the desired PCR products as described before the T4 DNA

Polymerase treatment as residual dNTPs from the PCR reaction inhibit the 3’ exonuclease

activity of the T4 DNA Polymerase.

13. The incubation time is a critical step for T4 DNA Polymerase treatment. A too short

incubation time will result in non-overlapping overhangs between PCR product and vector and

impede correct annealing.

14. Longer incubation times will likely lead to undesired higher molecular weight recombination

products.

15. Long recovering times are essential to obtain positive transformants, especially when creating

multiple acceptor-donor fusions due to the high selective pressure from the combination of

antibiotics used.

References

1. Ellis, T., Adie, T., and Baldwin, G.S. (2011) DNA assembly for synthetic biology: from parts

to pathways and beyond. Integr. Biol. 3, 109-118.

2. Scharf, S. J., Horn, G. T., and Erlich, H. A. (1986) Direct cloning and sequencing analysis of

enzymatically amplified genomic sequences. Science 233 , 1076–1078.

3. Costa, G. L., Grafsky, A., and Weiner, M. P. (1994) Cloning and analysis of PCR-generated

DNA fragments . PCR Meth. Appl. 3, 338–345.

4. Esposito, D., Garvey, L.A., and Chakiath C.S. (2009) Gateway cloning for protein expression.

Methods Mol. Biol. 498 , 31-54.

5. Li, M.Z., and Elledge, S.J. (2007) Harnessing homologous recombination in vitro to generate

recombinant DNA via SLIC. Nat. Meth. 4, 251-256.

30 6. Bieniossek, C., Nie, Y., Frey, D., Olieric, N., Schaffitzel, C., Collinson, I., Romier, C., Berger,

P., Richmond, T.J., Steinmetz, M.O., and Berger, I. (2009) Automated unrestricted multigene

recombineering for multiprotein complex production. Nat. Meth. 6, 447-450.

7. Fitzgerald, D.J., Berger, P., Schaffitzel, C., Yamada, K., Richmond, T.J., and Berger, I. (2006)

Protein complex expression by using multigene baculoviral vectors. Nat. Meth. 3, 1021-1032.

8. Nie, Y., Bieniossek C., Frey, D., Olieric, N., Schaffitzel, C., Steinmetz, M.O., and Berger, I.

(2009) ACEMBLing multigene expression vectors by recombineering. Nat. Protoc. 4,

doi:10.1038/nprot.2009.104.

9. Kriz, A., Schmid, K., Baumgartner, N., Ziegler, U., Berger, I., Ballmer-Hofer, K., and Berger,

P. (2010) A plasmid-based multigene expression system for mammalian cells. Nat. Commun.

1, doi:10.1038/ncomms1120.

10. Berger, I., Fitzgerald, D.J., and Richmond, T.J. (2004) Baculovirus expression system for

heterologous multiprotein complexes. Nat. Biotechnol. 22 , 1583-1587.

11. Vijayachandran, L.S., Viola, C., Garzoni, F., Trowitzsch, S., Bieniossek, C., Chaillet, M.,

Schaffitzel, C., Busso, D., Romier, C., Poterszman, A., Richmond, T.J. and Berger, I. (2011)

Robots, pipelines, polyproteins: enabling multiprotein expression in prokaryotic and

eukaryotic cells. J. Struct. Biol . 175 , 198-208.

Acknowledgements

We thank all members of the Berger laboratory for helpful discussions. MH is recipient of a Kekulé fellowship of the Fonds der Chemischen Industrie (FCI, Germany). YN is a fellow of the Marie-Curie training network Chromatin Plasticity and the Boehringer Ingelheim Foundation (BIF, Germany). IB acknowledges support from the Swiss National Science Foundation (SNSF), the Agence Nationale de la Recherche (ANR), the Centre National de la Recherche Scientifique (CNRS), the EMBL and the

European Commission (EC) through the joint EIPOD program, and the European Commission (EC) projects SPINE2-Complexes and 3D-Repertoire (Framework Program 6 (FP6)), as well as

INSTRUCT, PCUBE, BioSTRUCT-X and ComplexINC (EC FP7).

31 Competing financial interest statement

The authors declare competing financial interests. IB is author on patents and patent applications related to the methods here described.

32 Figures:

Fig. 1 . Schematic overview of single- and multigene SLIC reactions. Genes of interest (A, B, C) are shown as colored boxes. 5’ sites in primers and T4 DNA polymerase treated PCR products are indicated. Regions of homology are indicated by different grayscales. Inset: schematic representation of the primer design for SLIC reactions. The homology sequence should be 20 bp long, a similar length should be chosen for the GOI specific sequence.

33

Fig. 2. Examples for primer design for single- and multigene SLIC reactions. Complementary sequences to GOIs and vectors are indicated by lines, as well as 5’ and 3’ sites. Homology regions for multi SLIC reactions are shown in different grayscales. The sequences shown do not refer to a specific vectors or GOIs and need to be changed accordingly.

34

Fig. 3. Schematic representation of the Cre-LoxP recombination process. The Cre recombination process is an equilibrium reaction and gives rise to all combinations of the acceptor (A) and donor (D) fusions. One acceptor can be fused with multiple donors. Desired acceptor-donor fusions (A-D1 / A-

D2 / A-D1-D2) are selected via specific antibiotics (colored boxes). The process of Cre-LoxP recombination is reversible (De-Cre reaction). LoxP sites are shown as red balls. Adapted from (11).

35

Tab 1. Overview of available ACEMBL systems showing all acceptor and donor plasmids for prokaryotic ( MultiColi ), mammalian ( MultiMam ) and baculovirus expression ( MultiBac ). 1 For reagents contact: [email protected]

1 Reprinted from: Robots, pipelines, polyproteins: enabling multiprotein expression in prokaryotic and eukaryotic cells, 175(2), Vijayachandran, L.S., Viola, C., Garzoni, F., Trowitzsch, S., Bieniossek, C., Chaillet, M., Schaffitzel, C., Busso, D., Romier, C., Poterszman, A., Richmond, T.J. and Berger, I., pages 198-208, Copyright 2011, with permission from Elsevier.

36 1.6.2. Publication 2

Title: Robots. pipelines, polyproteins: Enabling multiprotein expression in prokaryotic and eukaryotic cells

Authors: Vijayachandran, L.S., Viola, C., Garzoni, F., Trowitzsch, S., Bienossek, C.,

Chaillet, M., Schaffitzel, C., Busso, D., Romier, C., Poterszman, A., Richmond, T. J.,

Berger, I.

Journal: J Struct Biol. 2011 Aug;175(2):198-208

The following manuscript outlines the goals of the SPINE2 project and summarizes the results from this structural proteomics initiative while addressing the challenges of the overproduction of recombinant eukaryotic protein complexes.

The manuscript is published in the Journal of Structural Biology .

37 Journal of Structural Biology xxx (2011) xxx–xxx

Contents lists available at ScienceDirect

Journal of Structural Biology

journal homepage: www.elsevier.com/locate/yjsbi

Robots, pipelines, polyproteins: Enabling multiprotein expression in prokaryotic and eukaryotic cells

Lakshmi Sumitra Vijayachandran a, Cristina Viola a, Frederic Garzoni a, Simon Trowitzsch a, Christoph Bieniossek a, Maxime Chaillet a, Christiane Schaffitzel a, Didier Busso b,c,d,e, ⇑ Christophe Romier b,c,d,e, Arnaud Poterszman b,c,d,e, Timothy J. Richmond f, Imre Berger a, a European Molecular Biology Laboratory (EMBL), Grenoble Outstation, and Unit of Virus Host Cell Interactions UVHCI, UJF-CNRS-EMBL Unite Mixte International UMI 3265, rue Jules Horowitz, 38042 Grenoble Cedex 9, France b IGBMC (Institut de Génétique et de Biologie Moléculaire et Cellulaire), 1 rue Laurent Fries, BP10142, Illkirch F-67400, France c Inserm, U964, Illkirch F-67400, France d CNRS, UMR7104, Illkirch 67400, France e Université de Strasbourg, Strasbourg F-67000, France f Institute for Molecular Biology and Biophysics IMB, Swiss Federal Institute of Technology ETH Hönggerberg, Schafmattstrasse 20, CH 8093 Zürich, Switzerland article info abstract

Article history: Multiprotein complexes catalyze vital biological functions in the cell. A paramount objective of the Received 15 December 2010 SPINE2 project was to address the structural molecular biology of these multiprotein complexes, by Received in revised form 11 March 2011 enlisting and developing enabling technologies for their study. An emerging key prerequisite for studying Accepted 11 March 2011 complex biological specimens is their recombinant overproduction. Novel reagents and streamlined pro- Available online xxxx tocols for rapidly assembling co-expression constructs for this purpose have been designed and validated. The high-throughput pipeline implemented at IGBMC Strasbourg and the ACEMBL platform at the EMBL Keywords: Grenoble utilize recombinant overexpression systems for heterologous expression of proteins and their ACEMBL complexes. Extension of the ACEMBL platform technology to include eukaryotic hosts such as insect Automation Baculovirus and mammalian cells has been achieved. Efficient production of large multicomponent protein com- BEVS plexes for structural studies using the baculovirus/insect cell system can be hampered by a stoichiome- High-throughput tric imbalance of the subunits produced. A polyprotein strategy has been developed to overcome this Insect cells bottleneck and has been successfully implemented in our MultiBac baculovirus expression system for Mammalian host producing multiprotein complexes. MultiBac Ó 2011 Elsevier Inc. All rights reserved. pET-MCN Polyprotein Recombineering Robotics Structural biology

Abbreviations: BEVS, baculovirus expression vector system; CFP, cyan florescent 1. Introduction protein; CMV, cytomegalovirus; dpa, day of proliferation arrest; dsRed, red fluorescent protein; E. coli , Escherichia coli; EGFP, enhanced green fluorescent protein; EM, electron microscopy; FRET, fluorescence resonance energy transfer; The structural proteomics initiative SPINE (Structural Proteo- HT, high throughput; kb, kilo base; kDa, kilo dalton; MOI, multiplicity of infection; mics IN Europe) was a highly successful European project aimed NMR, nuclear magnetic resonance (spectroscopy); ORF, open reading frame; Ori, at production and structural characterization of mostly single pro- origin of replication; p10, p10 baculoviral late promoter; polh, polyhedrin bacul- teins with important roles in human health. By building on these oviral very late promoter; R6Kc, bacteriophage R6K c; SDS–PAGE, sodium dodecyl- sulfate–polyacryamide gel electrophoresis; Sf9, Sf21, Spodoptera frugiperda cell lines successes, SPINE2 (full name SPINE2-COMPLEXES) addressed more 9 or 21, respectively; SLIC, sequence and ligation independent cloning; tcs, TEV challenging biological systems which consists of many interlocking protease cleavage site; TEV, tobacco etch virus, resp. a protease (N1A) from this protein subunits assembled in complexes to exert their biological virus; YFP, yellow fluorescent protein. function. Multiprotein complexes are emerging as cornerstones ⇑ Corresponding author. Address: European Molecular Biology Laboratory (EMBL Grenoble), BP 181, 6 rue Jules Horowitz, 38042 Grenoble Cedex 9, France. Fax: +33 of biological activity in the cell, and deciphering their structure (0)476207199. and function is imperative for advancing research in health and E-mail address: [email protected] (I. Berger). disease (e.g. Alberts, 1998; Nie et al., 2009a ). Many essential

1047-8477/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.jsb.2011.03.007

Please cite this article in press as: Vijayachandran, L.S., et al. Robots, pipelines, polyproteins: Enabling multiprotein expression in prokaryotic and eukary- otic cells. J. Struct. Biol. (2011), doi: 10.1016/j.jsb.2011.03.007

38 2 L.S. Vijayachandran et al. / Journal of Structural Biology xxx (2011) xxx–xxx protein complexes, particularly in human cells are comparatively Owing to a large part to structural genomics efforts such as the hard to come by, thus complicating their study. Low endogenous SPINE project, affordable methods and equipment have been levels and sample heterogeneity often impede purification from developed to automate molecular cloning of expression constructs, native sources in the quality and quantity required for structural recombinant expression screening and purification. There are obvi- studies, especially by X-ray diffraction methods requiring highly ous advantages of automation in the molecular biology laboratory. purified material that can crystallize. Recombinant techniques Procedures can be carried out in parallel and scaled-up accordingly can overcome several of the bottlenecks encountered, and can fur- to process large amounts of samples at reasonable cost. Automa- ther provide the means to modify complexes at the gene level tion puts constraints on the robustness of protocols to be imple- resulting in alteration of the protein complex subunits to meet mented which are certainly more stringent than regular manual the high quality requirements for structural analysis. Conse- laboratory intervention requires. In our experience, a good protocol quently, SPINE2-COMPLEXES placed considerable emphasis on yielding reproducible results, be it for cloning, transformation or developing and implementing technologies, protocols and reagents expression screening, by no means guarantees that it can be to facilitate protein complex production for structural studies, in scripted without further ado into a robotics routine. Rather, seem- prokaryotic and eukaryotic host organisms. In this contribution, ingly robust protocols for manual experimentation often have to be we will review the high throughput platforms at Strasbourg and optimized further to work robustly in a parallelized manner on a Grenoble as well as the practical considerations concerning setting robot. This is beneficial to the laboratory implementing automa- up and running a eukaryotic expression facility. New approaches tion in several ways. Not all experiments will be carried out on a for eukaryotic protein production, including the benefits of poly- robot even in the most well-equipped laboratory, and the general protein design, will also be discussed. success rate of manual experimentation, at least in our laborato-

Fig.1. SPINE2: automation for HT protein complex production. (A) The high throughput pipeline for screening co-expression constructs at IGBMC Strasbourg is shown in a schematic representation. Relying on semi-automated procedures, co-expression constructs are prepared in a parallel fashion by the cloning method of choice. This can be carried out in small volumes for example on a 96 well micro-titer plate. Constructs are transformed and plated on agar provided on 24 well tissue culture (TC) plates (top). Parallel expression of single colonies is performed in 2–4 ml miniculture volumes in deep well TC plates screening a variety of parameters (media, temperature, induction, etc.). Cells are pelleted and lysed by a multitip sonicator. Overproduced soluble protein complexes are purified from the cleared lysate by metal affinity purification on deep well plates using automated procedures (middle). The complexes retained on the affinity resin are analyzed by SDS–PAGE (below). Promising complex candidates are prioritized for scale-up, biochemical and biophysical characterization and structure determination (bottom). (B) The multigene cloning and expression pipeline ACEMBL at the EMBL Grenoble relies on fully automated protocols scripted into robotic routines that are implemented on a liquid handling workstation (a Tecan Evo II LHW is shown). One or several genes are inserted in specifically designed synthetic plasmids called Donors and Acceptors ( Table 1) by sequence and ligation independent cloning (SLIC). These plasmids are next concatamerized by Cre–LoxP fusion into multigene constructs for protein complex production. The combination of SLIC and Cre–LoxP reactions for multigene vector construction is termed tandem recombineering (TR). The Cre-reaction is an equilibrium reaction and can also be exploited for disassembling constructs. (C) The Cre–LoxP fusion reaction is detailed in a schematic representation (left). In a single Eppendorf tube, Donors and one Acceptor ( Table 1), each containing a LoxP sequence, are combined with Cre recombinase. Cre–LoxP fusion generates in this tube all possible combinations in an equilibrium reaction (arrows marked by Cre and De-Cre). After transformation and plating, desired Donor–Acceptor fusions are selected by the specific antibiotic combination conferred by the resistance markers on the plasmids. Identification of the desired recombinants can be carried out on a 96 well microtiter plate (right). D is Donor, A is Acceptor. A–D and A–D–D denote Donor–Acceptor fusions. LoxP sites are shown as colored balls. Ap, Cm, Kn, Sp stand for antibiotics ampicillin, chloramphenicol, kanamycin and spectinomycin. Antibiotics combinations are boxed. Red dye is added in wells on the microtiter plate for orientation. ACEMBL system Donors and Acceptors are listed in Table 1. C is adapted from Bieniossek et al. (2009) with permission of the publisher.

Please cite this article in press as: Vijayachandran, L.S., et al. Robots, pipelines, polyproteins: Enabling multiprotein expression in prokaryotic and eukary- otic cells. J. Struct. Biol. (2011), doi: 10.1016/j.jsb.2011.03.007

39 L.S. Vijayachandran et al. / Journal of Structural Biology xxx (2011) xxx–xxx 3 ries, definitely increased when protocols working also robustly on method are described in detail in an accompanying paper in this robots become available to the researchers. In turn, automated SPINE2 special issue by Diebold et al. processes on a robot can run in parallel for large numbers of reac- tions with unsurpassed precision, ensuring error-free operations 2.2. The Grenoble experience once robust routines are implemented. Automation in molecular biology laboratories can come in High-resolution structure elucidation of single proteins often many forms ranging from simple modules for individual tasks that requires mutations, truncations or the elimination of low complex- are integrated into a pipeline requiring frequent manual handling, ity regions that introduce heterogeneity in the sample and may to large systems which perform many tasks on an integrated ro- preclude crystallization. For single proteins or small binary or tri- botic platform, with a minimum of manual intervention by the re- meric complexes, this can be achieved still in a reasonable time- searcher. The high-throughput (HT) expression screening platform frame, for example by applying parallelized cloning, expression for co-expression in Escherichia coli at IGBMC Strasbourg and the and purification approaches. As the size and number of protein ACEMBL platform at EMBL Grenoble represent two complementary subunits in a complex of interest increases, the work-load aug- solutions to some of the challenges of automated protein complex ments excessively and can become inhibitory, in particular if the production (Fig. 1). recombinant multigene expression vectors required to produce this complex have to be reconstructed every time from scratch. This challenge was addressed at EMBL Grenoble by creating the ACEMBL platform for multigene expression vector construction 2. The SPINE2 high throughput platforms for expression of and complex production. protein complexes in E. coli

2.1. The Strasbourg experience 3. The ACEMBL system facilitates high-throughput recombineering for protein complex production The platform in Strasbourg is composed of individual modules, each for a specialized function in the recombinant protein expres- ACEMBL is the result of deconstructing multigene assembly into sion process that integrates into the pipeline shown (Fig. 1A). two simple steps of gene integration and plasmid concatameriza- Co-expression vectors are put together in a parallel fashion, for tion, which can both be automated and performed by a liquid han- example in 48 or 96 well microtiter or PCR plates. Procedures dling workstation (Fig. 1B). Sequence and ligation independent involving restriction enzymes and ligation, or ligation independent cloning (SLIC) was chosen for gene insertion (Li and Elledge, cloning methods such as LIC (Novagen, Merck Biosciences), Gate- 2007). SLIC relies on the exonuclease activity of T4 DNA polymer- way (Invitrogen) (Walhout et al., 2000 ) or In-Fusion (Clontech ase to produce sticky ends on DNA fragments that are annealed Takahara) (Berrow et al., 2009 ) can be applied, depending on the and joined together covalently upon transformation by the DNA re- project and according to the preference of the user. Transformation pair machinery of the E. coli cell used. Gene insertion occurs in is followed by plating on agar supplemented with antibiotic(s), small (2 kb) synthetic circular DNA modules called Donor and optionally on six well or 24 well tissue culture (TC) plates at appro- Acceptor plasmids (Table 1). Donors contain a conditional origin priate dilutions. Single colonies are used to inoculate liquid broth of replication which makes their propagation dependent on (LB) or terrific broth (TB) minicultures (2–4 ml volume) in 24 deep expressing the pir gene in the host used for cloning. Acceptors con- well TC plates. Autoinduction or addition of IPTG for induction can tain a ColE1 derived, regular origin of replication. Donors are sui- be chosen if expression plasmids relying on a lac operator are used cide vectors in all common cloning strains that are pir negative, (Studier, 2005). Lysis is carried out in the deep well plates by a and are only propagated in those strains if fused with an Acceptor multitip sonicator. The lysates are cleared by plate centrifugation providing a regular replicon ( Bieniossek et al., 2009; Nie et al., and applied to affinity resin presented in a 96 deep well plate. At 2009b). least one of the components of the complex studied is designed Fusions of Donors and an Acceptor, each containing one or sev- to incorporate a purification tag. SDS–PAGE analysis of the resin- eral recombinant genes of interest, are catalyzed by Cre recombi- bound material then reveals the outcome of the expression exper- nase, which fuses DNA molecules containing a LoxP sequence. In iment. Promising complexes are selected and production is scaled an Eppendorf tube containing several Donors and one Acceptor, up off-line for characterization by biochemical and biophysical each with a different resistance marker, all possible multigene fu- methods and functional assays, followed by structure determina- sions are made if Cre enzyme is added, in an equilibrium reaction tion involving NMR, EM and X-ray crystallography. favoring disassembly (Fig. 1C) providing a convenient combinato- This co-expression pipeline is at the core of the Strasbourg sys- rial option. Fusions containing the desired gene combinations are tem and performs successfully with a number of co-expression identified by their unique resistance marker combination via anti- systems developed by the Strasbourg laboratory during the SPINE biotic challenge on a microtiter plate ( Fig. 1C). and SPINE2 projects, as well as with co-expression systems from SLIC relies on one enzyme (T4 polymerase) and one protocol other laboratories (Busso et al., this issue ). Specifically, this pipe- (annealing). The Cre–LoxP fusion relies also on one enzyme (Cre line fully supports the pET-MCN/pET-MCP multi-expression sys- recombinase) and one protocol (incubation). This simplicity con- tem developed at IGBMC for protein complex expression in fers utter robustness and therefore lends itself to automation E. coli . This system relies on a cyclic multi-expression strategy (Nie et al., 2009b ). A welcome added value is the very low invest- based on several vectors that can be conjoined by restriction/liga- ment necessary in terms of material when using these enzymes, tion to yield multigene expression constructs ( Romier et al., 2006; which can either be purchased at low cost (T4 DNA polymerase) Perrakis and Romier, 2008; Diebold et al., this issue ). This co- or produced and purified efficiently in large amounts on site (Cre expression system has been instrumental for producing numerous recombinase). Numerous multiprotein complexes were produced complexes for high-resolution structure determination. Applica- in E. coli using the ACEMBL system from plasmids constructed by tion of the pET-MCN/pET-MCP system has further revealed the par- SLIC/Cre–LoxP tandem recombineering (TR), notably including, amount importance of tag placement on complex production among many others, important protein–RNA complexes and also efficiency, and the requirement to address this issue in a combina- large transmembrane assemblies such as the bacterial holotranslo- torial manner. These results and the use of the pET-MCN/pET-MCP con complex (Fig. 2). These examples show that the ACEMBL

Please cite this article in press as: Vijayachandran, L.S., et al. Robots, pipelines, polyproteins: Enabling multiprotein expression in prokaryotic and eukary- otic cells. J. Struct. Biol. (2011), doi: 10.1016/j.jsb.2011.03.007

40 4 L.S. Vijayachandran et al. / Journal of Structural Biology xxx (2011) xxx–xxx

Table 1 ACEMBL systems for multiprotein complex expression.

system greatly facilitates cloning and expression of protein com- requiring specialist knowledge and highly trained personnel. Exist- plexes in E. coli . ing approaches were hampered by many impediments, rendering it virtually impossible to generate mammalian cell populations in 3.1. Extending ACEMBL to eukaryotic expression systems which every cell simultaneously expresses all desired genes, for labeling specific cell compartments, or for assembling into multi- E. coli has been and remains the major host for heterologous protein complexes. production of proteins in many laboratories, notably when large To tackle this bottleneck, the ACEMBL technology concept was quantities are required for biochemical and biophysical studies extended to introducing multiple foreign genes simultaneously and structure analysis. Eukaryotic proteins and protein complexes, into mammalian cells by using a single multigene plasmid which however, may impose requirements on the expression host which is rapidly built from Donors and an Acceptor by tandem recombi- E. coli cannot support properly. These can include a specialized neering as outlined above ( Kriz et al., 2010 ). This work was carried eukaryotic folding machinery, or the necessity for authentic pro- out in a collaboration between EMBL Grenoble and the Paul Scher- cessing and post-translational targeting and modifications to gain rer Institute in Villigen, Switzerland. Highly efficient co-expression activity. Eukaryotic expression systems therefore have attracted of currently five genes in a cell transfected with a Donor–Acceptor considerable interest since some time as an alternative to expres- multigene construction was demonstrated ( Fig. 2C). The genera- sion in E. coli . Currently, two eukaryotic expression systems are tion of stable multigene expresser cell lines by using specific inte- increasingly being used for eukaryotic protein expression: the bac- gration elements that can be readily introduced by the TR method ulovirus expression vector system (BEVS) for protein production in was also achieved. The viability of the transfected cells was not ad- insect cells, and mammalian expression systems for protein pro- versely affected, as they show the expected behavior for example duction in mammalian cells ( Nettleship et al., 2010 ). when stimulated with growth factors (ibid). Provision of mamma- Introducing foreign genes into mammalian cells remains a chal- lian cell compatible promoter and terminator DNA sequences on lenge to date, especially if many genes should be introduced simul- the ACEMBL plasmid modules originally conceived for E. coli taneously. Genetic engineering of mammalian cells with expression, lead to the MultiMam system ( Table 1). MultiMam transgenes, however, is essential in contemporary biology. Not consists of small synthetic Donors and Acceptors that are fully only structural biology applications benefit from introducing for- compatible with the robotic routines of the ACEMBL platform for eign genes efficiently into mammalian cells. Reprogramming of so- multigene construct assembly by TR, thus setting the stage for effi- matic cells into stem cells by co-expressing an array of specific cient multigene expression, optionally transient or stable, in mam- transcription factors is a prominent example. Efficient simulta- malian cells for structural molecular biology and numerous other neous monitoring of many parameters in living cells with fluores- applications. cent-protein sensors is an essential prerequisite for cell biology Finally, the ACEMBL approach was implemented for the baculo- experiments, for example if multi-component pathways are to be virus/insect cell system by adapting our time-tested MultiBac tech- followed precisely. Achieving robust results in all these experi- nology (Berger et al., 2004; Fitzgerald et al., 2006; Bieniossek et al., ments using mammalian cells has been a formidable task, 2008) for the requirements of the automated TR method ( Table 1).

Please cite this article in press as: Vijayachandran, L.S., et al. Robots, pipelines, polyproteins: Enabling multiprotein expression in prokaryotic and eukary- otic cells. J. Struct. Biol. (2011), doi: 10.1016/j.jsb.2011.03.007

41 L.S. Vijayachandran et al. / Journal of Structural Biology xxx (2011) xxx–xxx 5

Fig.2. Donor–Acceptor recombineering exploits. (A) Expression of the signal receptor particle SRP in E. coli was achieved by ACEMBL. A dual expression plasmid encoding for the protein subunit Ffh and the RNA subunit Ffs was created by fusing Donor and Acceptor plasmids containing the encoding genes. Co-expression resulted in SRP which eluted in a symmetric peak from a size-exclusion column. SDS–PAGE (Ffh) and agarose gel (Ffs) analysis of SEC fractions revealed the complex. SRP is depicted as a model (based on PDB entry 2IY3) in the inset. (B) E. coli was used as expression host to produce the SecYEG-DF-YidC holotranslocon. A schematic drawing of this transmembrane multiprotein complex bound to the ribosome is shown (top). A multigene ACEMBL construct was used to express and purify the holotranslocon (below). A gel section from SDS–PAGE of the complex purified from detergent solubilized membrane fractions shows all expected components (right). (C) Efficient multigene expression in mammalian cells. Five fluorescent marker proteins were inserted into Donors and an Acceptor containing mammalian cell active promoter elements. All cells produced all proteins at identical levels in all cells of the mammalian culture. EBFP2-Nuc stains the nuclei, mTFP-FYVE binds a phosphoinositol-phosphate (PI-3-P), tubulin is labeled with EYFP, Mito- dsRed localizes to mitochondria, and Plum-PLC d-PH binds a phospho-inositol-biphosphate (PI-4,5-P2) The porcine aortic endothelial cells analyzed are shown in light field (bottom right). A and B are adapted from Bieniossek et al. (2009) , C is adapted from a time-lapse movie in Kriz et al., 2010 (image credit to P. Berger, PSI Villigen, Switzerland).

The MultiBac methodology, the protocols and reagents, and the at and common table-top shakers in a room constantly kept at times spectacular exploits, have been reviewed in some detail re- 27 °C, instead of the significantly more expensive equipment com- cently (Trowitzsch et al., 2010 ). We strove to maximally streamline monly encountered in baculovirus expression facilities (spinner procedures for reliable and robust use of the baculovirus expres- flasks, magnetic stirrers, incubators, fermentors, wave bioreactors, sion vector system (BEVS) as a standard technique for protein etc.). The last major cost factor when operating the facility is the expression in laboratories, not significantly more complicated than media used for cell culture. We undertook to test a number of com- recombinant expression in E. coli , which has become routine prac- mercially supplied media (liquid and dry powder), as well as own tice virtually everywhere. Following our strategy of deconstructing formulations based on literature from the early days of baculovirus multigene vector generation, we reengineered and streamlined our expression and insect cell culture. We systematically compared MultiBac plasmids by synthesizing small Donors and Acceptors cell viability and growth curves, variation from batch-to-batch, containing baculovirus specific promoters and terminators. In and protein production efficacy of the media tested, in relation to addition to other characteristic features of the ACEMBL system the cost and effort incurred in their purchase and/or preparation. such as homing endonuclease sites for inserting multiple expres- To our surprise, when informed about our ongoing study, commer- sion cassettes, the new plasmids also contain DNA sequences re- cial suppliers commenced offering significant discount rates on quired for integrating the expression cassettes of interest into the their products, effectively disqualifying our own media formula- genome of our engineered MultiBac baculovirus by Tn7 transposi- tions, which tended to suffer from variations between batches pre- tion (Table 1). The classical head-to-head dual-expression-cassette pared. The outcome of our study is compiled in Fig. 3. Formerly, we design of the original MultiBac plasmids, or comparable plasmids used SF900 II SFM serum free media (Gibco Life Technologies, such as pFastBacDual (Bac-to-Bac system, Invitrogen) can be easily Invitrogen) with excellent results for all operations (transfection, recapitulated from the new Donors and Acceptors if desired, by virus preparation, amplification and storage, protein expression) iteratively inserting a second (or, by the same token, third, fourth, in the EEF. Among those received, the most competitive offer etc.) expression cassette utilizing homing endonuclease sites pro- was for Hyclone (Thermo Fisher). The Gibco media proved to be vided on all ACEMBL plasmids ( Bieniossek et al., 2009; Nie et al., clearly superior to Hyclone in supporting cell growth to higher 2009b). Thus, a theoretically unlimited number of foreign genes densities, however, in the region relevant to our protocols (0.5– can be incorporated into the recombinant baculovirus for heterol- 2.5 million cells per ml), both media performed equally well ogous multiprotein expression in infected insect cell cultures. (Fig. 3A). We assayed protein expression using a MultiBac virus The ACEMBL systems shown in Table 1 are available from ATG- expressing YFP and a test protein to infect Sf21 cells adapted to biosynthetics ( www.atg-biosynthetics.com), and are called the media analyzed. Both YFP expression levels (Fig. 3B) and MultiColi (E. coli system), MultiMam (mammalian system) and expression and solubility of the test protein ( Fig. 3C) were equiva- MultiBacTurbo (BEVS). lent. We conclude that Hyclone media supports baculovirus expression in the EEF to our full satisfaction, at a fraction of the costs hitherto incurred. 4. Tips and tricks for eukaryotic protein production by BEVS A recurring subject of discussion when considering co-expres- sion of proteins by BEVS is the utility of the co-infection approach, When setting up the eukaryotic expression facility (EEF) at where recombinant baculoviruses, each containing one foreign EMBL Grenoble for baculovirus expression using our MultiBac gene of interest, are used to co-infect an insect cell culture for pro- BEVS, we strove to control the expenditures of equipment and con- ducing the ensemble of proteins encoded by the baculoviruses. As sumables while maintaining the superior performance of the sys- widely documented, co-expression of subunits of a multiprotein tem (Bieniossek et al., 2008; Trowitzsch et al., 2010 ). We use assembly is often mandatory for production of high-quality com- regular screw-capped Erlenmeyer flasks (250 ml to 2.5 L volume) plexes. Proper folding and/or solubility of the subunits may

Please cite this article in press as: Vijayachandran, L.S., et al. Robots, pipelines, polyproteins: Enabling multiprotein expression in prokaryotic and eukary- otic cells. J. Struct. Biol. (2011), doi: 10.1016/j.jsb.2011.03.007

42 6 L.S. Vijayachandran et al. / Journal of Structural Biology xxx (2011) xxx–xxx

Fig.3. Sf21 growth and protein production: controlling facility costs. Media performance of several suppliers was assessed in SPINE2. (A) Sf21 growth curves for SF900 II SFM from Gibco (Life Technologies, Invitrogen) and Hyclone (Thermo Fischer Inc.) are shown. Both are serum free media. The Gibco media supports growth to high cell densities. Both media perform equally in the region (shaded in gray) relevant for protein complex expression in the laboratory (0.5–2.5 Mio cells/ml). Hyclone per liter cost is at a third of the Gibco list price and 40% below the best offer to the SPINE2 consortium and the eukaryotic expression facility (EEF) at EMBL Grenoble. (B) A recombinant baculovirus (EMBacY-p75) simultaneously expressing YFP and a test protein (p75) was used to infect Sf21 cells adapted to the Gibco and Hyclone media, respectively. YFP expression per 1 Mio cells in both media was identical. St stands for a defined fluorescent protein standard (used to calibrate for 100,000 arbitratry units), dpa stands for day of proliferation arrest in the infected culture. (C) Coomassie blue stained SDS–PAGE of the test expression shows comparable levels of recombinant protein for both media. CC stands for uninfected cell control, WCE for whole cell extract, SN for cleared lysate after sonication and centrifugation. Bands corresponding to YFP and p75 are indicated.

Fig.4. Limits of multigene expression by baculovirus co-infection. Infection of Sf9 cells with one or two baculoviruses (expressing dsRed protein or EGFP, respectively) 48 h post-infection are shown. (A) Micrographs of the same field of Sf9 cells showing specific fluorescence signals for dsRed (left), and for EGFP (middle) and an overlay of the two (right). Multiplicity of infection (MOI) was one for both viruses. Cells that express dsRed are colored red, cells expressing EGFP are colored green. Cells that effectively co- express both dsRed and EGFP appear yellow in the overlay. (B) The influence of virus titers on co-expression efficacy was analyzed, and the results are shown in a graphical representation. The green, red and yellow bars represent the proportion of cells that expressed dsRed, EGFP or both dsRed and EGFP, respectively. At least 1500 cells in each condition were analyzed. Saturating co-infection with both viruses was not achieved in this experiment.

Please cite this article in press as: Vijayachandran, L.S., et al. Robots, pipelines, polyproteins: Enabling multiprotein expression in prokaryotic and eukary- otic cells. J. Struct. Biol. (2011), doi: 10.1016/j.jsb.2011.03.007

43 L.S. Vijayachandran et al. / Journal of Structural Biology xxx (2011) xxx–xxx 7 critically depend on each other, thus ruling out reconstitution of (although it appeared to affect mostly proteins larger than individually purified components to achieve the complex. The real- 100 kDa). Secondly, we observed that we could express very large ization of this necessity was originally the impetus for creating single proteins (400–500 kDa and larger) efficiently by using MultiBac, where all DNAs encoding a protein complex are assem- MultiBac. bled into and expressed from a single composite multigene baculo- We then considered a viral strategy. For example, SARS coro- virus (Berger et al., 2004 ). navirus, which is the virus that causes severe acute respiratory Nonetheless, co-expression of two or more proteins in the bac- syndrome (SARS), is a positive and single stranded RNA virus ulovirus system can also be achieved by co-infection of the insect belonging to a family of enveloped coronaviruses ( Peiris et al., cells with several baculoviral stocks encoding the individual sub- 2003). Genome expression is realized by translation of two large units. This approach can offer advantages, in particular for explor- open reading frames (ORFs), which are two polyproteins. The atory screening of putative interaction partners prior to large scale polyproteins are then processed into 16 smaller subunits expression. Co-infection also offers a possibility to adjust individ- by proteases contained within the polyproteins. One polyprotein ual virus titers by simply using more or less volume of a particular of SARS coronavirus is 700 kDa in size ( Gorbalenya et al., virus for infection, thus choosing the ratio between viruses and 2006). thereby possibly also the amounts of proteins produced in the cul- We asked whether a similar approach could be exploited to pro- ture. Simply speaking, if single-gene baculoviruses are at hand, duce protein complexes from large polyproteins by the baculovirus why not use them for testing in a straight-forward experiment system, thus potentially overcoming the limitations imposed by whether or not the encoded proteins interact, at least analytically, stoichiometric imbalance of the subunits produced ( Fig. 5). In a prior to up-scaling? Undoubtedly, even such seemingly straight- first approach, we expressed by MultiBac a fusion protein forward experiments need to be carried out with professional care (CFPtcsYFP) of cyan fluorescent protein (CFP) and yellow fluores- to avoid false interpretations for example due to a virus titer ‘‘going cent protein (YFP), connected by a short linker incorporating the off’’, i.e. absence of protein caused by a weak virus must not be amino acids that are recognized and cleaved by protease N1A from confused with absence of interaction or incorporation into a tobacco etch virus (TEV). CFP and YFP represent a Förster pair complex. which can be used for fluorescent resonance energy transfer (FRET) In a pilot experiment, the Strasbourg group addressed the measurements. If CFP and YFP are closely spaced as in our fusion requirements for co-infection experiments using baculoviruses. protein, excitation of CFP at a wavelength of 400 nm will result Results of this experiment are shown in Fig. 4, illustrating the lim- in excitation of YFP by Förster transfer, leading to an emission its and the difficulty to optimize co-infection experiments for spectrum characteristic of YFP. Cleared lysates of insect cells in- achieving tangible results. Insect cells were co-infected with two fected by the virus encoding for CFPtcsYFP were analyzed by viruses expressing fluorescent proteins dsRed and EGFP, respec- SDS–PAGE showing an over-expressed band at 50 kDa. Co-expres- tively. Cells infected either with dsRed or EGFP or both were sion of TEV protease resulted in complete cleavage of the fusion counted using fluorescence microscopy (Fig. 4A). These experi- protein as evidenced by SDS–PAGE. The expression of TEV protease ments already demonstrate that, in order to maximize co-expres- interfered neither with cell proliferation nor with the baculoviral sion, optimization is absolutely required. The best ratio between life cycle upon infection. In a separate experiment, the fusion pro- viruses needs to be determined experimentally, by varying the tein could also be quantitatively cleaved by adding purified TEV multiplicity of infection (MOI) i.e. the number of infectious virus protease to the lysate and incubation overnight, resulting in particles of each kind counted against the insect cells present in appearance of the same bands as in the TEV/CFPtcsYFP co-expres- the infected culture. The percentage of cells that are productively sion experiment ( Fig. 5A). Analysis of the lysates revealed efficient co-infected varies significantly with the amount and ratio of indi- FRET for the fusion protein. Interestingly, exclusively the spectrum vidual viruses used for co-infection. Importantly, in these experi- of CFP was observed in the TEV/CFPtcsYFP co-expression experi- ments, even after optimization, only a fraction (50%) of all insect ment upon excitation at 400 nm. FRET is an extremely sensitive cells effectively expressed both proteins simultaneously ( Fig. 4B). tool for measuring close proximity of suitable fluorophores. The Taken together, these results certainly suggest that co-infection absence of any residual YFP signal in the lysate of cells co-express- experiments, seemingly easy, may need significant investment of ing TEV and fusion protein confirm complete cleavage of the fusion time and effort and should be approached with caution to arrive at the TEV protease site. We can confidently rule out the possibility at interpretable and reliable outcomes. of cleavage occurring during or immediately after cell lysis, as addition of TEV protease to a lysate containing fusion protein only gradually reverted the YFP emission spectrum to the CFP spectrum 5. Polyproteins: a new concept for producing protein complexes overnight (Fig. 5B). Excitation at 488 nm revealed the presence of comparable amounts of YFP protein in all samples analyzed. Since its introduction in 2004, the MultiBac system has been We have reported a 400 kDa transcription factor subcomplex of prolifically used by us and many other laboratories, including users human general transcription factor TFIID composed of two copies and visitors in the EEF at EMBL Grenoble to generate what in the each of the TBP associated factors (TAFs) 5, 6 and 9 ( Fitzgerald past would have been considered difficult targets. Many proteins et al., 2007 ). This complex was expressed by MultiBac from a single and their complexes were produced successfully, often for the first virus containing three separate expression cassettes (Fig. 5C). The time (Trowitzsch et al., 2010 ). MultiBac relies on infecting insect smallest subunit, TAF9, contained a 6 histidine tag. Purification of cells with a single virus containing all foreign genes of choice. the complex by IMAC revealed substoichiometric production of When logging and interpreting the outcomes of many complex the largest subunit, TAF5 (100 kDa). The eluate contained an excess expressions, we arrived at two important conclusions. Firstly, even of TAF6–TAF9 dimer. Purification of the intact complex was none- though all genes were present on one MultiBac baculovirus used theless possible by a further affinity step taking advantage of a cal- for complex expression, it still occurred occasionally that one or modulin binding peptide present on TAF5 ( Fitzgerald et al., 2007 ). the other subunit was expressed at much lower levels than the However, a large portion of the protein produced was lost due to remaining subunits, thus reducing the yield of intact complex puri- the observed stoichiometric imbalance. We now created a single fied. This feature of inhibitory imbalance in individual expression ORF encoding for TEV protease, TAF6, TAF5 and TAF9. Again, levels appeared to be exacerbated in the case of large complexes TAF9 contained a 6 histidine tag in this polyprotein. Surprisingly with many subunits, and proved to be difficult to rationalize and to our delight, expression of the polyprotein by MultiBac not

Please cite this article in press as: Vijayachandran, L.S., et al. Robots, pipelines, polyproteins: Enabling multiprotein expression in prokaryotic and eukary- otic cells. J. Struct. Biol. (2011), doi: 10.1016/j.jsb.2011.03.007

44 8 L.S. Vijayachandran et al. / Journal of Structural Biology xxx (2011) xxx–xxx

Fig.5. Protein complex expression from a polyprotein. (A) A fusion protein of CFP and YFP, linked by amino acid residues corresponding to a cleavage site for TEV protease, was constructed and expressed using the MultiBac system, alternatively with and without TEV protease co-expressed. A gel section from SDS–PAGE is shown, with the band corresponding to the fusion protein indicated. Co-expression of TEV protease results in loss of this band and the appearance of bands at half of its molecular weight (left). Expression of the baculoviral protein p10 is shown as an internal control (below). Presence of TEV protease in the co-expression was confirmed by Western blot using antibody specific for the 6 histidine N-terminal tag of the protease, revealing a doublet (middle). Addition of TEV protease to the lysate of the fusion protein resulted likewise in cleavage at the specific proteolytic site (right). FL denotes uncleaved, full-length fusion protein, Cld denotes protein cleaved at the TEV protease site, tcs stands for TEV protease cleavage site. (B) Excitation of CFP in the CFPtcsYFP pair results in FRET and an emission maximum characteristic of YFP (curve marked CY FL ). Addition of TEV protease to the lysate containing the fusion gradually converts the curves (red, purple, blue) to a spectrum resembling CFP emission by eliminating FRET. An emission spectrum of CFP expressed alone is inset for comparison. The spectrum of lysate from the co-expression of TEV and the fusion (TCY) is identical to CFP only (left). Spectra of all lysates excited at the YFP excitation wavelength (488 nm) confirm that all samples tested contain comparable amounts of flourescent protein (line color as before). For comparison, the spectrum of YFP expressed alone is inserted (right). (C) A 400 kDa human transcription factor complex composed of proteins TAF5, TAF6 and TAF9 (with 6 histidine tag) was expressed by using the MultiBac system from single expression cassettes (SEc) or, alternatively, from a polyprotein (PP) joining TEV protease and the three subunits by TEV protease cleavage sites followed by purification by IMAC. Sections from SDS–PAGE are shown on the right. Bands corresponding to TAFs are indicated. A band corresponding to a degradation product of TAF6 is marked with an asterisk. Expression from single cassettes results in complex significantly depleted of TAF5. Polyprotein expression, in contrast, results in stoichiometrically balanced sample and reduction of degradation. Protein was purified from equivalent amounts of cells infected by SEc or PP expressing virus, respectively. only revealed stoichiometrically balanced sample production, but 5.1. The pPBac vector for polyprotein expression also indicated that TAF5 production from the polyprotein had in fact ‘‘caught up’’ to the levels previously obtained for TAF6 and Encouraged by our success with the TFIID subcomplex, we went TAF9 (Fig. 4C). on to generate a plasmid, pPBac, specifically designed for

Please cite this article in press as: Vijayachandran, L.S., et al. Robots, pipelines, polyproteins: Enabling multiprotein expression in prokaryotic and eukary- otic cells. J. Struct. Biol. (2011), doi: 10.1016/j.jsb.2011.03.007

45 L.S. Vijayachandran et al. / Journal of Structural Biology xxx (2011) xxx–xxx 9

Fig.6. BEVS pPBac vector for polyprotein expression of complexes. A new plasmid, pPBac, for expressing polyproteins with the baculovirus system was constructed. The plasmid map is shown (A). pPBac contains a gene for TEV protease linked to a CFP gene, spaced apart by a BstEII recognition site, followed by a RsrII site and a TEV protease cleavage site. pPBac contains all elements required for Tn7 transposition into a recipient baculovirus. (B) Polyprotein constructs for producing TAF8 and TAF10 (1), TAF10 and SPT7L (2) , or TAF8, SPT7L and two copies of TAF10 (3) were generated as shown. (C) Constructs were inserted via Tn7 transposition into EMBacY baculovirus expressing YFP from the backbone (left). Tracking YFP and CFP expression simultaneously evidenced production of YFP and of the polyprotein in the expressions (right). St stands for a standard sample (100,000 counts) used to calibrate the spectrophotometer. (D) SDS–PAGE sections of IMAC batch purifications are shown. All polyprotein specimens are processed completely resulting in stoichiometrically balanced product. Bands corresponding to SPT7L, TAF8, TAF10 and TEV protease are observed. Samples are numbered 1– 3 as in B and C.

producing polyproteins with the Multibac system ( Fig. 6). In addi- Recently, a small TAF containing complex (SMAT) was discov- tion to elements required for Tn7 transposition into the MultiBac ered in nuclear extracts from HeLa cells, containing TAF8, TAF10 baculoviral genome, pPBac contains an expression cassette flanked and SPT7L, a subunit specific for the SAGA complex by a polyhedrin promoter and a terminator derived from the SV40 (Demény et al., 2007 ). We used pPBac to produce the subunits polyA signal sequence. The cassette contains an ORF encoding for encoding for SMAT from polyproteins. Two copies of TAF10 were TEV protease, followed by a short linker DNA sequence with a proposed to exist in SMAT. We therefore created a corresponding BstEII restriction site and a RsrII restriction site. The ORF is com- polyprotein that contained two copies of the TAF10 gene in the pleted by the gene encoding for CFP, cloned in frame with TEV, ORF (Fig. 6B). The polyproteins were inserted into EMBacY, a bac- and containing a TEV protease cleavage site at its N-terminus. ORFs ulovirus where we had inserted YFP as a single expression cassette encoding for polyproteins are inserted into pPBac by using the into the backbone (Bieniossek et al., 2008; Trowitzsch et al., 2010 ). BstEII and RsrII cleavage sites. Both enzymes are asymmetric cut- YFP fluorescence can be used to monitor virus performance in ters with recognition sequences encompassing seven nucleotides infection experiments using EMBacY. We monitored in parallel each. LIC or SLIC procedures can likewise be used to insert ORFs the emission of YFP (from the virus backbone) at 488 nm excita- into pPBac linearized at the BstEII and/or RsrII sites, maintaining tion, and of CFP (from the polyprotein) at 400 nm excitation the reading frame. Polyproteins produced from pPBac will start (Fig. 6C). YFP and CFP intensities were similar for the three poly- with TEV protease and end with CFP. Monitoring CFP emission proteins produced, indicating comparable virus performance and by using a fluorescence spectrophotometer will report in ‘‘real- polyprotein expression levels. CFP intensities were consistently time’’ on the expression level of the polyprotein in probes taken lower as compared to the YFP levels owing to the lower (10%) from the expressing cell culture. TEV protease expressed from quantum yield of the former fluorophore. The polyproteins were pPBac contains a 6 histidine tag to allow for detection of TEV with again completely processed, resulting in stoichiometrically virtu- an antibody specific for the polyhistidine tag. By analyzing the po- ally balanced expression of the subunits ( Fig. 6D), thus compel- sition of TEV in a Western blot, expression of the polyprotein as lingly validating our approach and the utility of the pPBac vector well as (in)complete cleavage can be monitored. Completely pro- for polyprotein expression by BEVS. cessed polyprotein will only produce a TEV specific signal in the blot at the molecular weight corresponding to the protease. Incom- plete processing would result in a staining pattern resembling a 6. Conclusions ladder. For more than 40 expressions from pPBac, we thus con- firmed by Western blotting quantitative cleavage of all polypro- Technology drives discovery. In this contribution, we have de- teins produced to date (IB, unpublished). scribed the complementary HT pipelines for protein complex

Please cite this article in press as: Vijayachandran, L.S., et al. Robots, pipelines, polyproteins: Enabling multiprotein expression in prokaryotic and eukary- otic cells. J. Struct. Biol. (2011), doi: 10.1016/j.jsb.2011.03.007

46 10 L.S. Vijayachandran et al. / Journal of Structural Biology xxx (2011) xxx–xxx production implemented during SPINE2-COMPLEXES at IGBMC in et al., 2010 ). It will be interesting to see whether or not E. coli Strasbourg and at the EMBL in Grenoble. The SPINE2C consortium and mammalian overproduction systems can likewise benefit from provided a rich environment for developing expression technolo- the polyprotein approach for structural biology applications. We gies, and several co-expression systems have been designed, vali- anticipate strategies relying on tandem recombineering, with Do- dated and successfully implemented during the SPINE2 project, nor and Acceptor plasmids each modified similar to pPBac, for pro- each with its own merit (Busso et al., this issue; Diebold et al., this ducing recombinant protein complexes with a very large number issue). The ACEMBL system at EMBL Grenoble was originally de- of subunits efficiently by co-expressing polyproteins. Each of the signed for rapidly generating multigene vectors for protein com- polyproteins can contain a different fluorescent marker at the C- plex expression in E. coli . The ACEMBL tandem recombineering terminal end, which would allow for directly following expression procedures, scripted in robotics routines and implemented on a ro- levels during complex production by multichannel fluorescence bot, are equally applicable to generating multigene expression con- emission measurements. structs for producing heterologous protein complexes in During the SPINE2 project, we have also optimized the perfor- mammalian cells, and the original suite of prokaryotic Donor and mance of the eukaryotic expression facility (EEF) which uses our Acceptor plasmids has now been complemented by equivalents Multibac system for protein complex production. The controlled for mammalian expression. A broad range of applications will ben- reduction of expenditures by purchasing appropriate equipment efit from this methodology besides structural biology. For instance, and, importantly, identifying media that performs well but is avail- fluorescent markers can be used to genetically tag specific proteins able at a fraction of the previous costs, has resulted in a prototype to study cellular processes and entire pathways, with interesting facility which can be duplicated in many laboratories in Europe, possibilities to enhance screening in mammalian cells for pharma- including labs with comparatively small operating budgets. The ro- cological applications. Further, we have redesigned and stream- bust protocols and procedures we developed for baculovirus lined our MultiBac system for baculovirus/insect cell expression expression have, at least in our view, reduced the previously per- of complexes such that it can be accommodated in the ACEMBL ceived ‘‘complexity’’ of this eukaryotic expression technology to pipeline. In summary, we now have ACEMBL compatible multi- the level of E. coli expression. These developments have set the expression systems for protein production in E. coli , mammalian stage to welcome many scientists, from academia and industry, and baculovirus/insect cells, which represent the three major host for intensive training visits at the EEF in Grenoble, for example systems used in contemporary structural molecular biology for re- through the EC funded MultiBac trans-national access program in combinant protein production. the FP7 I3 project P-CUBE ( www.p-cube.eu). Notably, we have strikingly extended the BEVS complex expres- sion tool-box by implementing a novel logic for multisubunit com- plex expression in insect cells by using polyproteins ( Fig. 7). The Conflict of interest potential of this technology, also for commercial applications has been recognized (Berger and Richmond, 2006). More recently, a The authors declare competing financial interest. I.B. and T.J.R. comparable polyprotein approach was also implemented for deliv- are authors on patents and patent applications describing parts ering foreign genes at defined ratios into mammalian cells ( Chen of the technologies discussed in this contribution.

Fig.7. Multiprotein complex expression strategies by baculovirus. (A) Co-infection of baculoviruses each encoding for a different protein subunit of a complex results in heterogenous infection patterns in the ensemble of cells in the expression culture, ranging from cells infected with only one virus (left) to cells infected with all viruses (right). Balanced expression is difficult to achieve. (B) Expressing all subunits from a single baculovirus (for example by the MultiBac system) containing separate expression cassettes reliably achieves expression of all components in each individual cell in the culture. Inherent differences in individual protein expression levels can impede complex purification and drastically reduce yields if one subunit is in severely short supply. (C) Expression of all components of a multiprotein complex from a single polyprotein. Transcription of a long ORF containing all genes results in a single polypeptide encoding for all subunits and TEV protease. A fluorescent protein at the C-terminal end of the polyprotein allows real-time evaluation of the expression level. TEV protease releases itself from the polyprotein and proceeds to tailor the remaining cleavage sites in the polyprotein. This can lead to stoichiometrically balanced production of the complex components and superior complex assembly. Letters A–D denote protein subunits encoded by genes labeled a to d. TEV protease, CFP and YFP are indicated. Sites of cleavage activity by TEV protease are marked (scissors).

Please cite this article in press as: Vijayachandran, L.S., et al. Robots, pipelines, polyproteins: Enabling multiprotein expression in prokaryotic and eukary- otic cells. J. Struct. Biol. (2011), doi: 10.1016/j.jsb.2011.03.007

47 L.S. Vijayachandran et al. / Journal of Structural Biology xxx (2011) xxx–xxx 11

Acknowledgments Demény, M.A., Soutoglou, E., Nagy, Z., Scheer, E., Jànoshàzi, A., Richardot, M., Argentini, M., Kessler, P., Tora, L., 2007. Identification of a small TAF complex and its role in the assembly of TAF-containing complexes. PLoS One 21, e316. We thank Yuichiro Takagi, Wassim Abdul Rahman, Philipp Diebold, M.L., Fribourg, S., Koch, M. Metzger, T., Romier, C. Deciphering correct Berger, Daniel Fitzgerald, Michel O. Steinmetz, Daniel Frey, Darren strategies for multiprotein complex assembly by co-expression: application to Hart, Yvonne Hunziker and all members of the Berger and complexes as large as the histone octamer. J. Struct. Biol., this issue (SPINE2 Special). Schaffitzel laboratories for helpful discussions, and Hubert Berna- Fitzgerald, D.J., Berger, P., Schaffitzel, C., Yamada, K., Richmond, T.J., Berger, I., 2006. uer and Alexander Craig (ATG-biosynthetics), Marcel Boeglin Protein complex expression by using multigene baculoviral vectors. Nat. (IGBMC Imaging Center) and Isabelle Kolb-Cheynel (IGBMC Baculo- Methods 3, 1021–1032. Fitzgerald, D.J., Schaffitzel, C., Berger, P., Wellinger, R., Bieniossek, C., Richmond, T.J., virus Expression Facility) for advice and expert support. S.T. is a Berger, I., 2007. Multiprotein expression strategy for structural biology of European Commission (EC) Marie Curie post-doctoral fellow. C.B. eukaryotic complexes. Structure 15, 275–279. is supported by a Swiss National Science Foundation Advanced Re- Gorbalenya, A.E., Enjuanes, L., Ziebuhr, J., Snijder, E.J., 2006. Nidovirales: evolving the largest RNA virus genome. Virus Res. 117, 17–37. searcher fellowship (SNSF, Switzerland). I.B. acknowledges support Kriz, A., Schmid, K., Baumgartner, N., Ziegler, U., Berger, I., Ballmer-Hofer, K., Berger, from the Agence Nationale de la Recherche (ANR, France), the Cen- P., 2010. A plasmid-based multigene expression system for mammalian cells. tre National de la Recherche Scientifique (CNRS), the SNSF, the Nat. Commun. 1. doi: 10.1038/ncomms1120. Li, M.Z., Elledge, S.J., 2007. Harnessing homologous recombination in vitro to EMBL through the EIPOD program, and the EC through Framework generate recombinant DNA via SLIC. Nat. Methods 4, 251–256. Program 7 (FP7) projects INSTRUCT and P-CUBE. The present work Nettleship, J.E., Assenberg, R., Diprose, J.M., Rahman-Huq, N., Owens, R.J., 2010. was supported by EC projects SPINE2-COMPLEXES and 3D Reper- Recent advances in the production of proteins in insect and mammalian cells for structural biology. J. Struct. Biol. 172, 55–65. toire (FP6). Nie, Y., Viola, C., Bieniossek, C., Trowitzsch, S., Vijayachandran, L.S., Chaillet, M., Garzoni, F., Berger, I., 2009a. Getting a grip on complexes. Curr. Genomics 10, References 558–572. Nie, Y., Bieniossek, C., Frey, D., Olieric, N., Schaffitzel, C., Steinmetz, M.O., Berger, I., 2009b. ACEMBLing multigene expression vectors by recombineering. Nat. Alberts, B., 1998. The cell as a collection of protein machines: preparing the next Protoc. 4. doi: 10.1038/nprot.2009.104. generation of molecular biologists. Cell 92, 291–294. Peiris, J.S. et al., 2003. Coronavirus as a possible cause of severe acute respiratory Berger, I., Fitzgerald, D.J., Richmond, T.J., 2004. Baculovirus expression system for syndrome. Lancet 361, 1319–1325. heterologous multiprotein complexes. Nat. Biotechnol. 22, 1583–1587. Perrakis, A., Romier, C., 2008. Assembly of protein complexes by co-expression in Berger, I., Richmond, T.J., 2006. New logic for recombinant expression of prokaryotic and eukaryotic hosts: an overview. Methods Mol. Biol. 426, 247– multiprotein complexes using polygenes. PCT/EP2006/010608, WO2007/ 256. 054250 (granted 2009). Romier, C., Ben Jelloul, M., Albeck, S., Buchwald, G., Busso, D., Celie, P.H., Berrow, N.S., Alderton, D., Owens, R.J., 2009. The precise engineering of expression Christodoulou, E., De Marco, V., van Gerwen, S., Knipscheer, P., Lebbink, J.H., vectors using high-throughput In-Fusion PCR cloning. Methods Mol. Biol. 498, Notenboom, V., Poterszman, A., Rochel, N., Cohen, S.X., Unger, T., Sussman, J.L., 75–90. Moras, D., Sixma, T.K., Perrakis, A., 2006. Co-expression of protein complexes in Bieniossek, C., Richmond, T.J., Berger, I., 2008. MultiBac: multigene baculovirus- prokaryotic and eukaryotic hosts: experimental procedures, database tracking based eukaryotic protein complex production. Curr. Protoc. Protein Sci. Unit 5, and case studies. Acta Crystallogr. D Biol. Crystallogr. 62, 1232–1242. 20 (Chapter 5). Studier, F.W., 2005. Protein production by auto-induction in high density shaking Bieniossek, C., Nie, Y., Frey, D., Olieric, N., Schaffitzel, C., Collinson, I., Romier, C., cultures. Protein Expr. Purif. 41, 207–234. Berger, P., Richmond, T.J., Steinmetz, M.O., Berger, I., 2009. Automated Trowitzsch, S., Bieniossek, C., Nie, Y., Garzoni, F., Berger, I., 2010. New baculovirus unrestricted multigene recombineering for multiprotein complex production. expression tools for recombinant protein complex production. J. Struct. Biol. Nat. Methods 6, 447–450. 172, 45–54. Busso, D., Peleg, Y., Romier, C., Salim, L., Troesch, E., Perrakis, A., Celie, P.H.N. Walhout, A.J., Temple, G.F., Brasch, M.A., Hartley, J.L., Lorson, M.A., van den Heuvel, Expression of protein complexes using multiple E. coli protein co-expression S., Vidal, M., 2000. GATEWAY recombinational cloning: application to the Systems: A benchmarking study. J. Struct. Biol., this issue (SPINE2 Special). cloning of large numbers of open reading frames or ORFeomes. Methods Chen, X., Pham, E., Truong, K., 2010. TEV protease-facilitated stoichiometric delivery Enzymol. 328, 575–592. of multiple genes using a single expression vector. Protein Sci. 19, 2379–2388.

Please cite this article in press as: Vijayachandran, L.S., et al. Robots, pipelines, polyproteins: Enabling multiprotein expression in prokaryotic and eukary- otic cells. J. Struct. Biol. (2011), doi: 10.1016/j.jsb.2011.03.007

48 1.6.3. Publication 3

Title: Getting a grip on complexes

Authors: Nie, Y.*, Viola, C.*, Bienossek, C., Trowitzsch, S., Vijayachandran L. S.,

Chaillet, M., Garzoni, F., Berger, I.

Journal: Curr. Genomics. 2009 Dec;10(8):558-72.

The following manuscript reviews the effects of recent structural genomics efforts necessitating advancements in technology in the field of structural biology such as the

MultiBac and ACEMBL systems developed by the Berger Group.

The manuscript is published in Current Genomics.

Asterisk denotes that both authors contributed equally to this manuscript.

49 558 Current Genomics, 2009 , 10, 558-572 Getting a Grip on Complexes

Yan Nie #, Cristina Viola #, Christoph Bieniossek, Simon Trowitzsch, Lakshmi Sumitra Vijay- achandran, Maxime Chaillet, Frederic Garzoni and Imre Berger*

European Molecular Biology Laboratory (EMBL), Grenoble Outstation and Unit of Virus Host-Cell Interactions (UVHCI), UJF-EMBL-CNRS, UMR 5233, 6 rue Jules Horowitz, 38042 Grenoble CEDEX 9, France

Abstract: We are witnessing tremendous advances in our understanding of the organization of life. Complete genomes are being deciphered with ever increasing speed and accuracy, thereby setting the stage for addressing the entire gene product repertoire of cells, towards understanding whole biological systems. Advances in bioinformatics and mass spec- trometric techniques have revealed the multitude of interactions present in the proteome. Multiprotein complexes are emerging as a paramount cornerstone of biological activity, as many proteins appear to participate, stably or transiently, in large multisubunit assemblies. Analysis of the architecture of these assemblies and their manifold interactions is impera- tive for understanding their function at the molecular level. Structural genomics efforts have fostered the development of many technologies towards achieving the throughput required for studying system-wide single proteins and small interac- tion motifs at high resolution. The present shift in focus towards large multiprotein complexes, in particular in eukaryotes, now calls for a likewise concerted effort to develop and provide new technologies that are urgently required to produce in quality and quantity the plethora of multiprotein assemblies that form the complexome, and to routinely study their struc- ture and function at the molecular level. Current efforts towards this objective are summarized and reviewed in this con- tribution.

Received on: June 21, 2009 - Revised on: July 15, 2009 - Accepted on: July 24, 2009 Key Words: Proteome, interactome, multiprotein assemblies, structural genomics, robotics, multigene expression, multi- Bac, BEVS, ACEMBL, complexomics.

INTRODUCTION tural biology is to understand the mechanisms governing biological interactions in living systems for designing strate- Protein-protein interactions (PPIs) are intrinsic to virtu- gies to modulate, and interfere with these interactions. How- ally every essential process in the cell. Deciphering PPIs is ever, the large and increasing body of data describing PPIs imperative for understanding the underlying biological on a genome-wide scale, and the pace at which it is amassed, mechanisms of living systems. Cellular activities that govern is currently at a pronounced disparity with the rate at which health and disease, such as DNA replication, transcription, the structure and function of representative protein com- splicing, translation, secretion, cell cycle control, signal plexes that comprise the identified interactions, are described transduction and intermediary metabolism are controlled by at the molecular level. Despite considerable advances in con- PPIs [1-5]. New developments in sequencing technology in temporary structure determination techniques and significant combination with advances in affinity purification tech- efforts by structural genomics consortia to streamline the niques and automation are presenting researchers with the process leading to high-resolution structures, many bottle- opportunity to study the proteome of various organisms at an necks in the structure determination pipeline remain. ever increasing pace. Genome-wide protein-protein interac- tion studies involving affinity chromatography and mass Protein complexes are often found in scarce amounts in spectrometry (MS) analyses of systematically tagged open their endogenous host and remain difficult to isolate in the reading frames (ORFs) have been developed and imple- quantity and quality required for detailed functional and mented, aided by powerful bioinformatics approaches, to structural analysis. This is often the case already for electron address the entirety of PPIs in cells. microscopy experiments, although the requirements of this technique in terms of sample quantity are typically less im- To date, many thousands of PPIs are known, however, posing as compared to studies for example by X-ray crystal- the precise molecular details are available for only a small lography or by nuclear magnetic resonance (NMR) spectros- fraction of these interactions. Structure elucidation can ulti- copy. The latter two are the currently most powerful and mately turn abstract system representations into models that widely used techniques for providing high-resolution struc- more accurately reflect biological reality. The utility of struc- tural information. Multiplexed overexpression experiments

by using advanced recombinant production technologies *Address correspondence to this author at the European Molecular Biology could be instrumental not only for overcoming the sample Laboratory (EMBL), Grenoble Outstation and Unit of Virus Host-Cell In- production bottleneck, but also for compellingly validating teractions (UVHCI), UJF-EMBL-CNRS, UMR 5233, 6 rue Jules Horowitz, proposed interactions in a heterologous setup. Streamlined 38042 Grenoble CEDEX 9, France; E-mail: [email protected] high-throughput technologies for production of multisubunit #These authors contributed equally.

1389-2029/09 $55.00+.00 ©2009 Bentham Science Publishers Ltd.

50 Getting a Grip on Complexes Current Genomics, 2009 , Vol. 10, No. 8 559 protein complexes, however, have been utterly lacking to DECIPHERING THE INTERACTOME date. New developments are required to rapidly and repro- In recent years, new and powerful methods have been ducibly construct large protein complexes and variations developed which allow complex cellular protein-protein in- thereof at the rate that they are conceptualized from genome- teraction networks to be mapped (Fig. ( 1)). Such techniques wide studies. have produced a wealth of data and have given rise to a new

Fig. (1). Interactomics. Recent technological advances in genome-wide methods enable researchers to address protein-protein interactions present in the proteome of organisms in a comprehensive fashion, thus giving rise to the interactome. Native purification of proteins present in organelles and entire cells by using tandem affinity purification (TAP) methods, Strep-protein interaction experiment (SPINE) and trans- genomics involving bacterial artificial for generating stable mammalian cell lines, as well as protein-protein screens by yeast two-hybrid (Y2H) methods are supported by bioinformatics analyses, and together provide a (growing) picture of the interactome as a com- plex mixture of multiprotein assemblies. Mass spectrometry (MS) based proteomic methods including matrix-assisted laser desorption ioni- zation (MALDI) and electro-spray ionization (ESI) techniques coupled to liquid chromatography (LC-MS) and tandem MS-MS measure- ments add to the catalogue of tools employed to tackle the complexome. The link between ineractome research and structural biology is made by native mass spectrometry. Native MS can provide vital information about the structure, topology and architecture of protein complexes preserved in the gaseous phase. Ion mobility separation coupled to mass spectrometry (IM-MS) and collision induced dissociation (CID) are new approaches holding particular promise for characterizing the properties and composition of even very large protein complexes. Recombinant overproduction, functional characterization and eventually 3-D structure determination can help to validate the vast amounts of interactome data from recent systems biology efforts. Multiplexed and quantitative MS methods in conjunction with limited proteolysis may become critically important to elucidate variants of recombinantly overproduced multiprotein complexes amenable to high-resolution struc- tural and functional analysis. Combinatorial multigene generation, parallel small-scale expression and biochemical and biophysical analysis of multiprotein complexes derived from interactome data constitute likely modules of a conceptual “complexomics“ pipeline in analogy to current structural genomics approaches, leading to routine and rapid elucidation of the molecular architecture of many complexes and their subunit components by X-ray diffraction analysis, electron microscopy and NMR spectroscopy.

51 560 Current Genomics, 2009 , Vol. 10, No. 8 Nie et al. sphere of research designated “interactomics”. The term “in- modifying the TAP tag to include green fluorescent protein teractome” is used to describe all known interactions present (GFP) followed by the Z-domain of protein G instead of pro- in the cellular gene product repertoire [6]. tein A, and by replacing the CBP-tag with streptavidin pep- tide, this study revealed intact complexes involved in C. ele- Purification from Native Source gans kinetochore formation. A celebrated development in high-throughput identifica- Furthermore, Burckstummer et al . overcame the problem tion of protein complexes is the tandem affinity purification of low protein yields in TAP tagging experiments in mam- (TAP) method [7]. In this approach endogenously tagged malian systems by likewise altering the composition of the proteins of interest are produced which are used as bait to TAP tag [12]. They also replaced the IgG peptides of Protein fish out interacting partners. The original TAP tag comprises A with those of Protein G and the CBP peptide with strep- two affinity tags: the Z-domain of protein A, which binds to tavidin peptide. Using IKK ! with this modified TAP tag as immunoglobulin G (IgG), and calmodulin-binding peptide bait resulted in a ten-fold increase not only in the amount of (CBP), which binds to calmodulin. These two tags are sepa- bait but also of its interacting partner, IKK ". These ad- rated by the highly specific tobacco etch virus (TEV) prote- vancements in affinity purification techniques promise to ase site. TAP tagging involves a relatively mild extraction allow future interactome maps of cultivated human cell lines procedure in which protein complexes are purified via a two- to be determined, as well as maps of other cell types that are step process that yields intact protein complexes composed inherently more difficult to cultivate in large quantities, such of the tagged bait and any associated partners. This method as neuronal cells and immune cells. By tweaking certain as- is particularly useful for detecting stable complexes; more pects of existing purification strategies, such as modifying transient complexes are not observed, as they tend to disso- the original TAP tag itself, high-throughput interactome ciate during purification. Two major proteome-wide studies maps are moving into the realm of mammalian systems. in S. cerevisiae using the TAP method have revealed many An interesting approach called BAC TransgeneOmics previously unknown protein interactions and pathway asso- was recently described as a tool for studying protein-protein ciations [8, 9]. In one study, Gavin et al . TAP-tagged 6406 and protein-DNA interactions in addition to protein localiza- ORFs from the S. cerevisiae genome which enabled the puri- tion [13]. BAC TansgeneOmics describes a method by fication of 1993 tagged proteins and the identification of 491 which all known proteins within a proteome of a given orga- protein complexes [8]. In an independent study, Krogan et nism are tagged on a genome-wide scale. Using this recom- al . TAP-tagged 4562 ORFs from the yeast proteome. 2357 binantly tagged genome to create a bacterial artificial chro- of these TAP-tagged proteins were purified revealing 547 mosome (BAC) library ensures the presence of native regu- complexes as well as 429 interactions between complexes latory regions around the target gene. BACs containing the [9]. In both of these extensive studies affinity tags were in- recombinantly tagged genes of interest are then sequentially troduced into the 3’ ends of target ORFs in the yeast chro- transfected and expressed in mammalian cells. The tags con- mosome by homologous recombination. Data generated sist of a combination of fluorescent proteins and peptides for from these surveys correlated well with known protein com- affinity purification and reporting on factors such as in vivo plexes formerly discovered and studied by conventional protein localization and endogenous protein interactions. means. More notably, new interaction partners of well- known complexes were identified, as well as entirely novel Interaction Analysis by Yeast Two-Hybrid Screens complexes and associations. Another powerful method for generating interactome Methods to optimize the TAP tagging strategy are under maps in a high-throughput manner is the yeast two-hybrid way in an effort to obtain larger quantities of tagged protein (Y2H) approach [14]. Interactome-wide binary interaction assemblies. One of the challenges of the TAP method is to maps resulting from Y2H screens are generally regarded as gain insight into the more fleeting interactions present in a low-coverage studies, noisy and containing a high likelihood protein complex. Herzberg et al. have developed a Strep- of false positives. In an attempt to systematically map inte- protein interaction experiment (SPINE) that deals with the ractome networks from Y2H screens, Venkatesan et al. esti- inherent false positives otherwise found in TAP tagging ex- mate that only 8% of the full human interactome has been periments [10]. By replacing the TAP tag with a strongly covered by Y2H screens [15]. However, these surveys conti- interacting variant of Streptavidin called Strep-tactin and nue to provide a useful concomitant view of the whole inte- employing a reversible cross-linking reagent, Herzberg et al . ractome when considered alongside other affinity purificati- were able to get an in vivo snap-shot of bait interactors in B. on/MS-based techniques [5]. Y2H screens report on whether subtilis in a single affinity purification step . or not two proteins interact by fusing to a target protein the In the years since the pioneering initial glimpses into the DNA binding domain (DBD) of a transcription factor while yeast interactome, subsequent affinity purification studies potential binding partners are fused to an activation domain. have sought to shed light on the interactomes of multicellular Any interaction between the two target proteins leads to the organism. Multicellular organisms are generally less amena- expression of a reporter gene [16]. There are three common- ble to TAP-tagging approaches due to the challenge of using ly used high-throughput Y2H screening approaches: (1) the homologous recombination to insert affinity tags and the yeast mating approach in which haploid DBD strains and difficulties in retrieving sufficient quantities of purified ma- strains with the activation domains undergo mating and se- terial. Nevertheless, Cheeseman et al . described a procedure lection for reporter expression; (2) the matrix approach, whe- using the TAP tagging principle to purify protein complexes re DBD strains can be mated with an array of strains contai- from C. elegans strains and cultivated HeLa cells [11]. By ning activation domains; and (3) the library approach, which

52 Getting a Grip on Complexes Current Genomics, 2009 , Vol. 10, No. 8 561 involves the mating of individual DBD strains with a library ineractions and will likely continue to make major contribu- of activation domain strains that represents a cDNA library tions to the field of interactomics. of a given target organism [5]. The latter method is the most efficient for high-throughput studies, however, the sampling Computational Approaches efficiency of individual DBD strains with entire cDNA libra- Results from high-throughput interactome studies are ries is greatly reduced. being tabulated with increasing clarity. These efforts are While the Y2H strategy has the capacity to meet the de- resulting in unprecedented amounts of potentially useful data mands of high-throughput interactome mapping, this appro- for molecular and structural biologists. On the bioinformatics ach cannot currently compete with affinity based methods in side, the major hurdles in analyzing high-throughput interac- terms of genome coverage. Nonetheless, Y2H surveys have tome data sets include managing databases, creating useful realized a rich source of high-quality binary interaction maps clustering algorithms to glean valuable information about from a wide range of organisms, including viruses, bacteria protein interactions, and using the resulting clustering to [17], S. cerevisiae [14, 18, 19], D. melanogaster [2], C. ele- make predictions about biological systems. Results from gans [20-22] and humans [4, 23, 24]. It is also important to combined genome-wide interaction studies may contain only note that while Y2H screens are critisized for inherent prob- partially overlapping datasets, false positives (interactions lems concerning the overexpression of homologous genes, that should not normally occur in a cell) and false negatives the post-translational modification machinery and a bias to- (limited or biased coverage that excludes a true interaction). wards interactions that occur in the nucleus, this approach Such issues hamper a comprehensive portrayal of protein can examine a different subspace of the protein interaction networking [26]. Today’s bioinformatician faces many chal- world to that sampled by affinity/MS methods. Together, lenges in the emerging field of interactomics. What follows both sources of interactome mapping provide a more is an overview of what challenges are being faced currently comprehensive outlook of the whole interactome. and those that are on the horizon that will undoubtedly con- tinue to be a boon for structural biologists in search of com- Two valuable high-throughput Y2H human PPI maps plex three dimensional (3-D) structures. were generated by Stelzl et al . [24] and Rual et al . [4]. These independent studies both utlized the matrix approach to a- Considering that each genome-wide interactome study chieve greatest possible coverage of the human genome and generates characteristic data and that each existing repository between them identified approximately 6000 binary protein uses characteristic file formats for storing data, the challenge interactions. In the Stelzl study, where 4456 baits and 5632 of creating a consolidated resource for a transparent flow of preys were screened, 195 disease related genes were found to data between datasets is startling. The Molecular Interactions interact with previously unidentified partners. Furthermore, (MI) group of the Proteomics Standards Initiative (PSI) has 342 uncharacterized proteins were assigned new putative created an international standard for representing protein roles after being found to interact with a protein of known interaction data by consolidating existing interactome data function. In total, new functions were assigned to hundreds sets from individually curated databases to create the Inter- of different proteins. In a comparable effort, Rual and col- national Molecular Interaction Exchange consortium (IMEx) leagues looked for binary interactions between approximate- [27]. The consortium, to date, includes the following data- ly 8100 ORF’s and detected approximately 2800 protein bases: DIP (http://dip.doe-mbi.ucla.edu), IntAct (http://www. associations. These interactions were then correlated with ebi.ac.uk/intact), MINT (http://mint.bio.uniroma2.it/mint), independant co-affinity purifications which revealed an over- MPact (http://mips.gsf.de/genre/proj/mpact), MatrixDB lap of approximately 78%. Despite the impact these Y2H (http://www.matrixdb.ibcp.fr), BioGRID (http://www.the- screens have made in the field of interactomics, further deve- biogrid.org), MPIDB (http://www.jcvi.org/mpidb) and BIND lopments are still required before they reach the coverage (http://www.blueprint.org). Alongside IMEx is MIMIx, the achieved by affinity methods. The impact of these studies minimum information required for reporting a molecular will surely propel the current technology in Y2H to new interaction experiment. MIMIx tackles the lack of commu- heights. nity consensus on what information is required to report mo- In a recent high-quality yeast binary protein interaction lecular interaction by setting up an international standard to facilitate the extraction of useable data from PPI experiments study, Yu et al. have attempted to deal with a long standing by users. Currently, data is exchanged in XML format. criticism that Y2H screens are biased towards interactions that occur within the nucleus [25]. To counter this concern, A major challenge concerning interactome datasets is Yu et al. performed a Y2H screen in parallel with a yellow how to cluster the resulting interactions to accurately report flourescent protein complementation assay (PCA) in which on real protein complexes rather than spurious, or false posi- the traditional bait and prey peptides are replaced with non- tive interactions while including more transient members of flourescing halves of yellow fluorescent protein (YFP). Once protein complexes rather than only architectural ones. Based the interacting partners are in close proximity, the fluores- on the Gavin, Krogan and Ho studies, Hart et al . used an cent properties of YFP are reconstituted and thereby create a unsupervised probabilistic scoring scheme and assigned con- useful marker that is not limited to reporting on interactions fidence scores to each interaction. This approach generated a that occur within the nucleus. Using their dual method, Yu matrix-model interpretation of the yeast interactome datasets et al. were able to validate their own results, which showed a [28]. Unsatisfied with the existing spoke model as a way of greater degree of correlation than that shown between the representing interactome data which only considers bait and Gavin and Krogan TAP studies. Y2H screens are certainly prey interactions, Hart and colleagues devised a scoring becoming a valuable tool for studying genome-wide protein method to hone the matrix model which additionally also

53 562 Current Genomics, 2009 , Vol. 10, No. 8 Nie et al. takes prey/prey interactions into account, thereby including MS experiments can likewise be used to obtain invento- the elusive transient members of complexes without decreas- ries of biochemically isolated organelles allowing for the ing the overall accuracy of reported complexes. In doing so, characterization of sub-interactomes contained within sub- it was shown that the degree of overlap between the reported cellular compartments. High-resolution methods were ap- datasets was considerably higher than previously thought, plied for accurate protein identification and novel algorithms and that one of the major problems in previous comparisons were developed to assign genuine components from co- was the inclusion of ribosomal protein interactions. Based on purifying proteins in these experiments [38]. This holds par- assessments of similarity between the above mentioned ticular promise for accessing the protein repertoire and datasets and with a third yeast interactome dataset [9], Hart complexome of such cellular subcompartments by high- et al. suggested that these studies are approaching saturation resolution structural and functional studies. of what can be known about the subset of the complexome MS based interactome wide studies are often met with of yeast grown in rich media. Recently, Krogan indicated skepticism concerning the reproducibility of results [39]. The that a rough calculation based on the overlap of the two stud- Test Sample working group of the Human Proteome Organi- ies suggests that approximately 80% of the interactions ca- zation (HUPO), who have an interest in establishing interna- pable of detection in yeast by the TAP method have been tional standards for proteomics studies, attempted to address detected [29]. the question of irreproducibility in MS experiments. The Another consequence of the upsurge in PPI maps and working group provided a defined test sample containing an genome-wide sequencing efforts is the new wealth of data equimolar mixture of highly purified recombinant proteins to that can be used by the community of scientists who model 27 different laboratories using high-throughput MS methods protein interactions and predict protein function from the to test their ability to correctly identify the mixture [40]. The gene sequence. With the ever increasing amounts of data results were that, initially, only a quarter of the laboratories about PPIs, it is possible to identify recurring ‘domain signa- correctly identified the protein mixture. However, upon clo- tures’ and to correlate frequent interactions between them, ser inspection of each laboratory‘s raw data, it became appa- the idea being that the interaction may be mediated by the rent that the peptides had in fact been identified in every case signature sequence [30]. Knowledge about where an interac- and that the problem arose in environmental contamination tion might occur can also narrow down which portions of a of the sample, incorrect database matching and poor curation protein sequence should be included in designing protein of proteins identified. In summary, this study exemplified complex constructs [31]. that reproducibility in MS experiments can be achieved by carrying out the MS experiments with care and by upgrading Mass Spectrometry existing databases for their curation [39, 40]. Mass spectrometry (MS) has emerged as an indispen- The link between interactome research and structural sable tool for studying the interactome [32, 33]. MS is now biology is made by native mass sepctrometry of large protein firmly established as one of the main driving forces of prote- assemblies, an emerging, very promising technology. Native ome studies, and is increasingly the method of choice for mass spectrometry techniques allow sensitive analyses of analyzing complex protein mixtures derived from entire endogenously expressed protein complexes with high speed cells. Besides protein identification, quantification and profi- and selectivity [41, 42]. Importantly, native MS can provide ling, MS has had a significant impact on the analysis of pro- vital information about the structure, topology and architec- tein interactions and protein complexes [32]. Combining ture of protein complexes. Protein complexes in native MS affinity purification with MS allowed a de novo characteriza- experiments are prevented from disassociating in the gaseous tion of the composition and organization of the cellular ma- phase during electro-spray ionization (ESI). Additionally, chinery. Data derived from these methods indicated that nanoflow ES (nano-ES) is employed for improved resolution complexes can combine transiently and differentially in a of the sample being studied thereby improving the sensitivity modular fashion thus enabling a diversification of the poten- of native MS [40]. High-perfomance mass analyzers, such as tial function of individual protein complexes [8]. MS-based orthogonal ESI-time of flight (TOF) instruments, can be u- interactome analysis approaches, using a variety of techni- sed to accurately identify ions with a high mass-to-charge ques including matrix-assisted laser desoprtion/ionization ratio, a prerequisite for analyzing large protein complexes (MALDI) and liquid-chromatography coupled electro-spray with many subunits by native MS [42]. Tandem MS-MS ionization (LC-MS), offer several important advantages for methods, usually used in proteomics experiments to deduce studying protein complexes as compared to other techniques. the amino acid sequences of small peptides, can be applied A protein complex can be isolated directly from its cellular to native MS to gather information about the subunits pre- environment, fully processed with its full complement of sent in a protein complex [40]. Apparently, peripheral subu- modifications and directly studied by MS without the need nits are preferentially eliminated in this setup, thus allowing for further manipulations [34]. MS based methods can readi- interpretation of the topology of the complexes analyzed. ly detect stable interactions which constitute core architectu- A recent technological advance is ion mobility seperation res of protein complexes. Implementation of chemical cross- coupled to mass spectrometry (IM-MS), which has been par- linking strategies in MS experiments further offers possibili- ticulary useful to establish mass spectrometry as a powerful ties to detect and analyze important transient interactions tool for structural biology applications [41, 43]. In IM-MS, [35]. A key issue is the analysis of the vast amount of data ions are separated on the basis of their mass-to-charge ratio gathered in MS-based proteome and interactome analysis. and as well on their drift time in a gas-filled ion mobility Progress is being made in developing tools for analyzing chamber. The drift time depends on the cross-section of the MS-data based on statistical principles [36, 37].

54 Getting a Grip on Complexes Current Genomics, 2009 , Vol. 10, No. 8 563 molecule, with larger molecules exhibiting longer drift- typically requires iterative optimization of protein encoding times, thus allowing determination of the average projection sequences for expression and purification of the specimens. area of a specimen studied. It is conceivable that this techni- Several to many expression vectors, host organisms and host que will mature into a tool that will be routinely used to strains need to be integrated into the experimental work- measure the cross-section of large protein complexes, which flow, in addition to covering a large space of conditions suit- could be rather useful for providing volume constraints that able for crystallization. All steps involved require consider- can be utilized in molecular modelling of these assemblies able investment in labor and materials and a very significant [43]. through-put of experiments. Entire proteomes are addressed most often at the single protein or protein domain level. Requiring relatively small amounts of protein sample Consequently, structural genomics intensively stimulated compared to other MS techniques, nanoelectro-spray ioniza- tion can achieve the maintenance of a solution structure in and fostered the implementation of automation and high- throughput approaches, which now result also in consider- the gas phase. Using collision-induced dissociation (CID), able benefit for classical, hypothesis driven structural mo- even very large protein complexes can be selectively disso- lecular biology. Many laboratories are now in the process of ciated by collision with neutral gas atoms. Each collision integrating high-throughput approaches at varying levels in event results in the accumulation of internal energy by the their research [49]. ion in question. Upon accumulation of sufficient internal energy, this ion may undergo dissociation. This approach can Structural genomics projects generally start from target be used to dissociate protein complexes into subcomplexes selection, which is based on evaluation of a large amount of and subunits which are then analyzed with TOF instruments. candidate genes via bioinformatics methods. This is followed CID has been used to analyze virus capsids and entire ribo- by cloning, insertion in one or several expression vectors, somes with a molecular mass of 2.5 MDa [44]. The complete expression and purification, and finally structure determina- subunit architecture of the yeast exosome, the protein ma- tion. Researchers at centers engaged in structural genomics chine which degrades RNA in yeast, could be correctly as- integrated automated cloning strategies based on restric- signed using CID [45]. Furthermore, subcomplexes and pe- tion/ligation [50, 51], ligation-independent cloning [52, 53], ripheral subunits of human elongation factor elF3 could be or recombination [54, 55]. Among them, recombination identified by using this method [46, 47]. based cloning systems are most widely utilized in high- throughput experiments. Although the systems used cur- IMPACT OF STRUCTURAL GENOMICS rently are robust and can be automated, they are often not The description of the 3-D structure of biological mac- sufficiently flexible when variations of expression elements romolecules, at near-atomic resolution, is imperative for un- such as purification tags, promoter/terminator combinations, derstanding their function at the molecular level. The eluci- protease cleavage sites and others need to be introduced or dation of the DNA sequence of the entire genome of many modified [49]. organisms, including humans, revealed the gene repertoire Autoinduction procedures were found to be particularly present in cells. This set the stage to address the proteome, useful for automated high-throughput approaches for expres- which is the comprehensive assemblage of all known gene sion of the target specimens in E. coli as expression host. products in an organism. The elucidation of the 3-D structure Autoinduction is based on a defined medium containing of all encoded proteins, at high resolution, is the goal of glycerol, glucose and lactose as inducer, which makes use of structural genomics efforts. Structural genomics aims at promoters containing lac operators. Glucose prevents induc- building up a high-resolution library dedicated to catalogu- tion by lactose until it is consumed. Upon glucose depletion ing the protein complement of different organisms via high- in the culture, lactose is metabolized and heterologous induc- throughput and automated approaches starting from molecu- tion occurs by means of the lac operator. Autoinduction thus lar cloning of the genes to structure elucidation of the en- simplifies the expression procedure: it alleviates the re- coded proteins. Based on structures deposited in the Protein quirement for monitoring the density of cell cultures, as glu- Data Bank (PDB), structure determination by single crystal cose depletion auto-regulates the time of induction. Further, X-ray diffraction analysis is currently the predominantly auto-induction does not require the addition of inducer used technique, in addition to structure determination in so- chemicals facilitating means for automation [56]. lution by NMR. By means of comparison with structures of well-characterized proteins and domains, the biological func- Increasingly, cell-free (CF) protein synthesis methods tion of uncharacterized proteins can often be discovered or emerge as a viable alternative to in vivo expression in struc- proposed. Until the beginning of 2008, the combined effort tural genomics pipelines due to several advantages [57]. Pro- from structural genomics consortia worldwide contributed teins that are toxic to host cells can be expressed by CF ex- about 50% of the newly-deposited structures in the PDB. pression, and CF expression, in principle, can be better con- One of the largest structural genomic projects is the Project trolled by using highly purified components [58]. CF expres- Structure Initiative (PSI) in the United States, which is spon- sion is especially useful for structure determination by NMR sored by the National Institute of Health (NIH). Several spectroscopy, since it is performed in small volumes and other large consortia exist in Japan, Canada, and Europe therefore requires less isotope label than cellular protein la- [48]. beling [48, 57]. CF methods may be particularly useful for efficient screening of detergents required for successful pro- In addition to the very large number of structures to be duction for membrane proteins [59], and may also allow elucidated for describing a proteome, structural genomics rapid, small volume parallel screening of many variants of a approaches were confronted with a multitude of challenges. target protein [60]. Successful structural determination by X-ray crystallography

55 564 Current Genomics, 2009 , Vol. 10, No. 8 Nie et al.

Many particularly exciting targets in the proteome will coli still remains to date the expression system of choice in require expression in eukaryotic systems. Baculovirus ex- most laboratories. Nonetheless, eukaryotic expression is also pression vector systems (BEVS) increasingly become the being implemented for production of samples that can not be method of choice for many of these targets. While consider- produced in E. coli . In particular the baculovirus/insect cell able effort is being invested into automation and high- system has been streamlined significantly, and detailed pro- throughput protein expression by using BEVS [61-63], con- tocols have become available that considerably simplify trolled virus generation in sufficient quantity and quality handling, thus alleviating some of the uncertainties regarding remains a challenge with currently available BEVS tech- this system that impeded its routine application by non- nologies [61]. Transient transfection of plasmid DNA into specialist users [70, 73, 74]. the nucleus of insect cells was suggested as a possible, eco- Our laboratory has contributed to some of these devel- nomic alternative for analytical screening prior to larger opments, with particular focus on the production of mul- scale virus generation [61]. ticomponent protein complexes for structural biology appli- Hierarchal multiplex expression and purification strate- cations. We are interested in the structural molecular biology gies utilized by the core Protein Production Platform of the of eukaryotic complexes. For recombinant overproduction of Northeast Structural Genomics Consortium (NESG), foster these complexes, a system for multiprotein expression in an increase in the production of protein samples and also the insect cells, called MultiBac, was introduced [70, 73] (Fig. solution of many 3-D protein structures [55]. Initiatives are (2)). MultiBac uses an engineered deletion baculovirus with ongoing to set up productive modules for target sampling, improved protein production properties including reduced cloning, sample characterization and crystallization, arranged proteolysis and a delayed onset of cell fragmentation in the into fully integrated pipelines [64]. Since compact globular late phase of viral infection [73]. This MultiBac baculovirus domains defined by limited proteolysis are good candidates is accessed by two plasmids called transfer vectors at two for production of diffraction quality crystals, high- recombination sites present on the virus: a LoxP imperfect throughput limited proteolysis/mass spectrometry ap- inverted repeat for site-specific recombination, and a Tn7 proaches for protein domain elucidation are being included attachment site. The Tn7 attachment site is embedded in a into such pipelines, providing precise definition of domain LacZ ! gene for blue-white selection of recombinant bacu- boundaries, with significant impact for success prospects loviruses. These transfer vectors harbour the heterologous [65]. genes of interest. The MultiBac baculovirus exists as a BAC E. coli Structural genomics has decisively accelerated automa- in cells containing also a small plasmid with four genes encoding for the Tn7 transposon, similar to the widely tion and the development of robust high-throughput meth- utilized Bac-to-Bac system from Invitrogen, and essentially ods. Nonetheless, critics claim that structural genomics con- all other baculovirus systems that rely on Tn7 transposition sortia have gone after the “low-hanging fruit”, such as solu- of a transfer vector in vivo in an E. coli host strain. ble single proteins of prokaryotic origin which are compara- tively easy to express and purify [66]. Actually, structural The transfer vectors that we developed for MultiBac con- genomics efforts now are gradually moving to address more tained elements that made it particularly straight forward to challenging target proteins of eukaryotic origin. The objec- arrange into multigene expression cassettes several to many tive is to facilitate the structural determination of human expression units containing ORFs encoding for example for proteins, integral membrane proteins, and eventually multi- members of a protein complex of choice. One transfer vector protein complexes [48]. However, the currently implemented was designed to provide these multigene cassettes between approaches for automation and high-throughput methods Tn7L and Tn7R DNA sequences for integration into the Tn7 cannot easily accommodate the upgrade required to address, site of the MultiBac baculovirus. A second transfer vector in particular, large and complex multicomponent systems. contained a LoxP sequence thus enabling integration of The automation currently implemented in cloning routines multigene cassettes into the LoxP site of the MultiBac virus and expression systems are mainly designed for addressing in the presence of Cre recombinase, the enzyme responsible single ORFs or small, mostly binary systems [67]. for fusing DNA pieces that contain the imperfect inverted repeat. Integration into the LoxP and Tn7 site could be car- EUKARYOTIC MULTIPROTEIN EXPRESSION: ried out simultaneously by co-transfecting the two transfer MULTIBAC vectors into E. coli cells harboring the MultiBac virus, and expressing Tn7 transposon and Cre recombinase, respec- The interactome can not be rationalized on the basis of tively, from helper plasmids [73]. Selection for recombinant elucidating single protein structures. It is now increasingly MultiBac viruses harboring the multigene cargo occurred via clear that the proteins in the cell function as interlocking machines containing ten or more interaction partners, that blue/white selection and antibiotic challenge for the resis- tance marker contained in the plasmid incorporated into the associate stably or transiently to realize cellular activities [1]. virus by Cre -LoxP fusion (Fig. ( 2)). Structural genomics efforts have provided a wealth of detail on the level of individual proteins and domains. To address The MultiBac system as conceived in 2004 was surpris- the more complex challenge of multicomponent assemblies, ingly well received in the community, probably indicating a number of expression systems have been introduced, that the present and growing interest in researching eukaryotic are suitable for simultaneous expression of several genes in interactomes and multiprotein complexes. Many laboratories prokaryotic and eukaryotic hosts [68-72]. In spite of consid- requested the MultiBac reagents, many proteins were ex- erable improvements of eukaryotic expression systems, E. pressed, and X-ray crystal structures based on specimens

56 Getting a Grip on Complexes Current Genomics, 2009 , Vol. 10, No. 8 565

Fig. (2). MultiBac BEVS: Eukaryotic multiprotein expression. ORFs (a-e) encoding for subunits of a protein complex and auxiliary pro- tein such as modifiers or chaperones, are inserted into a plasmid containing the sequences required for Tn7 transposition (Tn7L, Tn7R), or a plasmid containing a LoxP imperfect inverted repeat, respectively. Gene insertion occurs via a multiplication module (small rectangles) de- signed for facilitating multigene cassette generation. A baculovirus genome containing the Tn7 attachment site (attn7) and a LoxP sequence, in addition to deletions beneficial for protein production, is present in bacterial cells in form of a bacterial artificial (BAC). Inte- gration of multigene expression cassettes is mediated by the Tn7 transposon and Cre recombinase, respectively, which are expressed from helper vectors in the bacteria [73]. Transfection of insect cells with the resulting composite baculovirus results in high-level expression of the proteins in cultured insect cells. Adapted from [95]. produced by MultiBac are now being reported [75, 76]. In- to establish a robust protocol for virus generation, amplifica- terestingly, our baculovirus expression technologies were not tion and protein production which then could be applied rou- only used successfully for protein complex production for tinely and successfully in our laboratory and many others structural biology as they were designed for, but also for including non-specialist users [74]. We feel that BEVS ex- rather diverse other applications ranging from production of pression, by using these protocols, can now be performed possible vaccine candidates based on papilloma virus like with almost the same ease and effort, as heterologous ex- particles [77] to preparing recombinant adenoviruses for pression is commonly carried out in E. coli. gene therapy treatment of obesity in laboratory rodents [78]. ACEMBLING MULTIPROTEIN COMPLEXES In our view, the genuinely useful contribution in conjunc- tion with MultiBac, was not only the creation of yet another The combination of many genes encoding for subunits of baculovirus and a few transfer vectors. We had realized in a protein complex into vectors used for expression will re- the process of our experimental work that the parameters of main a rather laborious task, in particular if it relies on re- virus generation are not really compatible with routine appli- striction digestion and pasting together of DNA fragments by cation of an expression method in laboratories focusing on ligase in a serial, one-gene-at-a-time mode. This approach is structural analysis. Baculovirus expression is constrained by essentially refractory to automation. Structural genomics certain requirements that need to be met to assure that the consortia have strived to address the problem by implement- recombinant DNA cargo is properly maintained in the bacu- ing recombination methods for gene insertion. These meth- loviral genome during virus amplification and eventually ods have the advantage that they always use the very same protein production [79-81]. We found that introducing a reagents and reaction conditions, and therefore can be fluorescent marker gene into the virus backbone, and pre- scripted into a robotics routine. The emphasis of most sys- cisely monitoring fluorescence intensity as well as the cell tems currently was mainly placed on offering a multitude of growth development in a culture, provided a very useful and expression options for the one ORF of choice. For instance, simple regimen to largely alleviate the detrimental loss of the Gateway system from Invitrogen, defines an Entry vector titer or loss of protein production which are the major im- for the gene of interest, which is inserted by any suitable pediments encountered when using BEVS. This allowed us means. This Entry vector is then used to introduce this gene

57 566 Current Genomics, 2009 , Vol. 10, No. 8 Nie et al. into a wide range of Destination vectors providing a large available. Nicks are closed and gaps are filled by the E. coli assortment of purification or solubility tags for expression in machinery upon transformation with the annealed DNAs. a variety of hosts. The situation presents itself in reverse for We found that efficient procedures could be established for multiprotein complex expression: here, the challenge is to integrating single genes or polycistrons into the ACEMBL introduce an assortment of genes into probably one expres- Donors and Acceptors by SLIC, and scripted into robust rou- sion system of choice to start with. This needs to be achieved tines, which could be readily carried out by a robot [82]. in a way that ideally, the genes encoding for the multiprotein Gene integration into the ACEMBL vectors occurs at inte- complex to be studied can not only be assembled fairly eas- gration sites that make up a so-called multiple integration ily, but also options need to be provided to modify the indi- element (MIE), which contains also restriction sites for con- vidual subunit components rapidly and in a flexible way by ventional gene integration as well as homing endonuclease mutation, truncation or replacing of affinity tags. Already for sites for facile gene multiplication into multi-expression cas- single proteins, altering the wild-type sequence for example settes [82]. by removing low complexity regions is often a prerequisite Donors thus charged with recombinant DNA cargo, each for successful high-resolution structural analysis, and intro- containing single genes, polycistrons or multiple expression ducing mutations is commonplace for elucidating the func- cassettes, are then fused with one Acceptor by using Cre tion and activity. This is equally valid for multiprotein com- recombinase and the LoxP site present on each vector. Ac- plexes, however, the tasks at hand are considerably more ceptors like Donors can contain one or several genes, poly- complicated to achieve as the number of interacting subunits cistrons or a combination thereof. Several Donors can be increases. fused with each Acceptor. Selection for multiple resistance, These deliberations and underlying experimental necessi- each of these characteristic for one Donor or one Acceptor, ties prompted us recently to introduce ACEMBL, an auto- then identifies the Donor-Acceptor fusions in a combinato- matable system for multiprotein expression making use of rial fashion. By using this approach, we could easily gener- multigene recombineering by using a robot [82, 83] (Fig. ate in a single reaction a series of multigene expression vec- (3)). For matters of simplicity, we first created ACEMBL in tors expressing protein complexes as well as all possible a version suitable for multiprotein complex production in E. combinations of genes contained on the individual vectors, coli as an expression host, although, the same robotic scripts revealing subcomplexes [82]. Interestingly, our experiments can likewise be applied for generating multigene constructs showed that multigene expression vectors could not only be for protein complex expression in eukaryotic hosts. We de- assembled in this way, but likewise also selectively decon- cided to consequently adapt recombination methods at every structed by using the reverse approach. This is achieved by step of the process of gene insertion and gene combination applying Cre recombinase to previously generated Donor- into multigene expression cassettes, and to implement al- Acceptor fusions. This is possible due to the equilibrium ready existing, robust robotics protocols for small scale ex- reached between the fusion and excision activities of the Cre pression and protein extraction by using affinity purification enzyme. Thus, defined parts of a multigene construct, encod- [82]. ing for subunits of a protein complex, can be excised by our procedure, altered for example by truncation, mutation, or Building on our positive experiences using Cre -LoxP replacement of the encoding genes, and then reintegrated fusion in MultiBac, we synthesized two families of small into the multigene expression construct of choice by apply- plasmids with the minimum DNA sequences required. These ing Cre fusion. This provides useful combinatorial options, plasmids are called Acceptors and Donors. They are small also for robotics applications [82]. By employing the (2-2.5 kb) and each plasmid contains the LoxP inverted im- perfect repeat. Donors contain a conditional origin of replica- ACEMBL method, we were able to express and purify all members of the holotranslocon from E. coli , a large prokary- tion which makes their existence and propagation in regular otic translocation complex consisting of six transmembrane cloning and expression strains dependent on Cre -LoxP me- proteins, from a 16 kb multigene plasmid [82]. diated fusion with Acceptors, which in turn have a regular origin of replication derived from the classical ColE1 origin. STRUCTURAL COMPLEXOMICS? We settled on sequence and ligation independent cloning Genome and proteome-wide studies have clearly re- (SLIC) as the method of choice for inserting genes into Do- vealed the key role of macromolecular complexes in most, if nors and Acceptors, as detailed protocols for this methods not all vital cellular processes. Protein complexes display became available recently [84]. Nonetheless, we needed to activities that are entirely different from the activities of each modify and improve these protocols to achieve robust inte- subunit studied independently, as interaction partners often gration, in particular when the process was carried out on in a robotic setup using a liquid handling workstation [82, 83]. dramatically influence recognition propensities and likewise biological activities. In addition, protein complex composi- This SLIC method, and likewise the BD-InFusion (Clontech tion in particular in higher eukaryotes can depend on tissue Takara) or standardized ligation independent cloning (LIC) type and cell state. Importantly, covalent posttranslational methods (Novagen), are commonly referred to as recombina- modifications such as phosphorylation, acetylation, methyla- tion methods, although this denotion is slightly misleading tion and many others can have a critical impact on the forma- for these approaches. Rather, these methods have in common that they make use of the 3’ exonuclease activity of DNA tion of protein complexes and their activity. Due to all of the variables that need to be controlled when attempting to as- polymerases in the absence of nucleotide triphosphates. semble protein complexes recombinantly, it is important to Thus, long single stranded overhangs are created which can have a robust system that allows rapid testing of many dif- serve as sticky ends if complementary single strands become ferent constructs.

58 Getting a Grip on Complexes Current Genomics, 2009 , Vol. 10, No. 8 567

Fig. (3). ACEMBL System. ACEMBL consists of newly designed, small vectors (A) and automated procedures and routines relying on re- combineering for gene insertion and vector fusion (B). Multigene expression constructs are generated by insertion of genes into multiple inte- gration elements (MIE) by recombination, followed by Cre -LoxP fusion of Donors with an Acceptor. Incubation of educt constructs (here pDK, pDS, pACE) containing genes of interest (white arrows) results in all possible combinations in a single reaction including Acceptor- Donor (AD) and Acceptor-Donor-Donor (ADD) fusions as shown here schematically. Creation of even four-plasmid ADDD constructs has also been completed successfully in our laboratory [82]. All co-existing constructs have characteristic antibiotic marker combinations and resistance levels (right). Donor vectors contain a conditional origin of replication derived from R6K !, and thus act as suicide vectors in clon- ing strains devoid of the pir gene unless fused to an Acceptor with a regular replicon. A second Acceptor, pACE2, is identical to pACE ex- cept for the encoded marker which confers resistance to tetracycline rather than ampicillin (not shown). Plasmid pACE2 can be used in con- junction with pACE derivatives for example to co-express auxiliary proteins such as chaperones or modifiers [82]. ( C) Recombineering workflow by using the ACEMBL system is shown. Genes are integrated in Donors or Acceptors by ligation independent methods such as SLIC followed by combinatorial multigene vector generation using Cre -LoxP fusion. Expression and purification provide protein complex for analysis. Multigene vectors are deconstructed by using Cre excision activity (De- Cre ). Encoded genes are modified by PCR and reintegrated into the workflow by recombination in an iterative cycle. The entire process is compatible with automation, and was successfully scripted into a robotic routine. Adapted in part from [82, 83].

In the current environment, in which valuable informa- technological state-of-the-art is simply incompatible with the tion about interactomes, complexomes and other genome- speed at which new data is accumulated through high- wide studies is pouring in at an ever increasing pace, struc- throughput research addressing the interactome, and a major tural biology as it is performed to date simply cannot keep effort towards the development of new technologies is ur- up with the increasing demand for the validation that only 3- gently required to close this gap. D structures can provide. Protein structures can offer insights 3-D structural information can be gained from purified into the details of a protein interaction at the molecular or material extracted in small amounts from native source by near-atomic level, and it is imperative for structural biolo- electron-microscopic techniques which have significantly gists to move into the arena of protein complex interactions. matured in recent years [85-87]. In particular, cryo-electron Despite recent colossal efforts in obtaining 3-D structures at microscopy in conjunction with single-particle analysis can near atomic resolution by X-ray crystallography, greatly fos- be used to gain information about the quaternary architecture tered by structural genomics consortia, obtaining diffraction of multiprotein assemblies. Although 3-D protein structures quality crystals of protein complexes remains a significant obtained from cryo-electron microscopy are reaching higher challenge and often takes on the order of years achieve. This

59 568 Current Genomics, 2009 , Vol. 10, No. 8 Nie et al. resolutions than ever before, 3-D structures obtained by this proper complex assembly and folding. Protein-RNA com- method provide still limited information when compared to plexes such as telomerase, snRNPs or RNAi containing com- the atomic details obtained by X-ray crystallography or plexes are a focus of contemporary research efforts aimed at NMR spectroscopy. elucidating mechanisms of health and disease. The recent 3- Undoubtedly, great benefit could be derived from the D structure of a human spliceosomal U1 snRNP compellingly demonstrates the power of recombinant recon- development of advanced techniques and reproducible pro- stitution of such a complex for structure elucidation [92]. tocols for micropurification of endogenous complexes. Puri- Technologies allowing routine multigene expression in pro- fication of protein from biological material present in limited karyotic and eukaryotic hosts will certainly need to incorpo- amounts will certainly be necessary in particular for the iden- rate the means for producing heterologous complexes con- tification of complexes, or variations of complexes, that are present in specialized cells or specific tissues, and for a thor- taining non-protein components such as RNA and other bio- molecules. ough validation of interactome data. This requires highly efficient methods to recover the quantities of protein re- Automation is essential for accelerating contemporary quired for biophysical methods. Due to the considerable in- protein science. Automation depends on standardization and crease in sensitivity of mass spectrometers achieved in recent simplification of protocols that are robust and reproducible. years, it is now possible to routinely identify subunits of pro- These requirements must be addressed by the development tein complexes from pico- to femto-mole quantities of mate- of easy-to-use, affordable reagents that are ideally compati- rial. It is critically important now to develop new strategies ble with robotic procedures. Automation has already had a for the micropurification of protein complexes that will al- considerable impact on cloning, DNA preparation, protein low the simultaneous processing of several samples from purification by affinity tags and assaying protein activities. limited amounts of source material. Such micropurification Protocols optimized for automation have at times superseded techniques, in conjunction with process automation for en- earlier, more laborious procedures even in laboratories not dogenous sample preparation will decisively improve current applying robots routinely, as manual procedures generally research approaches both in terms of throughput and also also benefit considerably from the standardization and ro- quality of analysis. Size-exclusion chromatography (SEC) is bustness inherently required for methods that can be used by often a rate limiting step in the preparation of protein com- robots. Automation will be particularly important for recon- plexes. New purification strategies involving native gels, stitution of macromolecular complexes by heterologous capillary electrophoresis or absorption onto membranes multigene expression as probably a large number of con- could possibly mature into genuine alternatives to SEC, thus structs will need to be tested for many cases until a satisfac- allowing parallel processing of many samples and increasing tory reconstitution is achieved, yielding specimens suitable sample homogeneity. for detailed studies. The number of possible combinations increases dramatically with the number of subunits. This is Recombinant expression most certainly had a decisive particularly true if the pipeline is geared towards X-ray crys- impact on life science research, and is to date the major technique for successful production of well-defined macro- tallography. molecular specimens in the quality and quantity required for In single crystal structure determination by X-ray diffrac- many applications. Apart from notable examples such as tion, a vital prerequisite is the ability of a specimen to ar- ribosomes or RNA polymerase [88-91], near-atomic struc- range into a highly ordered crystal lattice that diffracts the ture determination of complex multicomponent systems will incident X-ray radiation to near-atomic resolution. Often, in all likelihood in most cases depend on recombinant over- this challenge can only be met by introducing variation into production. More recently, several multi-expression systems the wild-type sequence until a crystallizable specimen is ob- have been introduced for expression of protein complexes in tained. Limited proteolysis, in conjunction with mass spec- a variety of different expression hosts, two of these were trometry, has been particularly useful for defining regions of described in some detail in this contribution. However, most low-complexity that can often interfere with crystallization. systems currently available still require dedicated expertise Such regions are then typically removed by introducing trun- and considerable technical specialisation of the user, which cations or deletions in encoding DNA sequences, and re- is refractory to routine research, in particular for high- combinant overexpression of the resulting variant can then throughput applications. Biological and also pharmaceutical result in sample more amenable to crystallization. Corre- research often depend on introducing variations (mutation, sponding procedures are now being introduced in more truncations, fusions with markers, etc) into the specimen elaborate structural genomics pipelines. Nonetheless, it is studied. Multi-expression systems therefore must provide the clear that implementing such limited proteolysis procedures, flexibility required for rapid revision of experiments, where often already laborious for single proteins, will be vastly such alterations can be introduced with ease. The ACEMBL more complicated when several to many ORFs need to be system we developed could represent a first step in this di- diversified concomitantly in a multiprotein complex. Recent rection. Nonetheless, production of many vital protein com- advances in mass spectrometry, including quantitative, mul- plexes, especially those requiring a eukaryotic host machin- tiplexed techniques [93, 94] may prove to be invaluable for ery for sample production, remains a challenge and a major designing tools to analyze limited proteolysis experiments of bottleneck in the pipeline to high-resolution 3-D structures. complex multiprotein assemblies in high-throughput for structure elucidation. A further consideration in protein complex biology are those complexes that contain protein subunits as well as High-resolution structure determination, in particular by RNA components which may need to be co-expressed for X-ray crystallography, has developed into an indispensable

60 568 Current Genomics, 2009 , Vol. 10, No. 8 Nie et al. resolutions than ever before, 3-D structures obtained by this proper complex assembly and folding. Protein-RNA com- method provide still limited information when compared to plexes such as telomerase, snRNPs or RNAi containing com- the atomic details obtained by X-ray crystallography or plexes are a focus of contemporary research efforts aimed at NMR spectroscopy. elucidating mechanisms of health and disease. The recent 3- Undoubtedly, great benefit could be derived from the D structure of a human spliceosomal U1 snRNP compellingly demonstrates the power of recombinant recon- development of advanced techniques and reproducible pro- stitution of such a complex for structure elucidation [92]. tocols for micropurification of endogenous complexes. Puri- Technologies allowing routine multigene expression in pro- fication of protein from biological material present in limited karyotic and eukaryotic hosts will certainly need to incorpo- amounts will certainly be necessary in particular for the iden- rate the means for producing heterologous complexes con- tification of complexes, or variations of complexes, that are present in specialized cells or specific tissues, and for a thor- taining non-protein components such as RNA and other bio- molecules. ough validation of interactome data. This requires highly efficient methods to recover the quantities of protein re- Automation is essential for accelerating contemporary quired for biophysical methods. Due to the considerable in- protein science. Automation depends on standardization and crease in sensitivity of mass spectrometers achieved in recent simplification of protocols that are robust and reproducible. years, it is now possible to routinely identify subunits of pro- These requirements must be addressed by the development tein complexes from pico- to femto-mole quantities of mate- of easy-to-use, affordable reagents that are ideally compati- rial. It is critically important now to develop new strategies ble with robotic procedures. Automation has already had a for the micropurification of protein complexes that will al- considerable impact on cloning, DNA preparation, protein low the simultaneous processing of several samples from purification by affinity tags and assaying protein activities. limited amounts of source material. Such micropurification Protocols optimized for automation have at times superseded techniques, in conjunction with process automation for en- earlier, more laborious procedures even in laboratories not dogenous sample preparation will decisively improve current applying robots routinely, as manual procedures generally research approaches both in terms of throughput and also also benefit considerably from the standardization and ro- quality of analysis. Size-exclusion chromatography (SEC) is bustness inherently required for methods that can be used by often a rate limiting step in the preparation of protein com- robots. Automation will be particularly important for recon- plexes. New purification strategies involving native gels, stitution of macromolecular complexes by heterologous capillary electrophoresis or absorption onto membranes multigene expression as probably a large number of con- could possibly mature into genuine alternatives to SEC, thus structs will need to be tested for many cases until a satisfac- allowing parallel processing of many samples and increasing tory reconstitution is achieved, yielding specimens suitable sample homogeneity. for detailed studies. The number of possible combinations increases dramatically with the number of subunits. This is Recombinant expression most certainly had a decisive particularly true if the pipeline is geared towards X-ray crys- impact on life science research, and is to date the major technique for successful production of well-defined macro- tallography. molecular specimens in the quality and quantity required for In single crystal structure determination by X-ray diffrac- many applications. Apart from notable examples such as tion, a vital prerequisite is the ability of a specimen to ar- ribosomes or RNA polymerase [88-91], near-atomic struc- range into a highly ordered crystal lattice that diffracts the ture determination of complex multicomponent systems will incident X-ray radiation to near-atomic resolution. Often, in all likelihood in most cases depend on recombinant over- this challenge can only be met by introducing variation into production. More recently, several multi-expression systems the wild-type sequence until a crystallizable specimen is ob- have been introduced for expression of protein complexes in tained. Limited proteolysis, in conjunction with mass spec- a variety of different expression hosts, two of these were trometry, has been particularly useful for defining regions of described in some detail in this contribution. However, most low-complexity that can often interfere with crystallization. systems currently available still require dedicated expertise Such regions are then typically removed by introducing trun- and considerable technical specialisation of the user, which cations or deletions in encoding DNA sequences, and re- is refractory to routine research, in particular for high- combinant overexpression of the resulting variant can then throughput applications. Biological and also pharmaceutical result in sample more amenable to crystallization. Corre- research often depend on introducing variations (mutation, sponding procedures are now being introduced in more truncations, fusions with markers, etc) into the specimen elaborate structural genomics pipelines. Nonetheless, it is studied. Multi-expression systems therefore must provide the clear that implementing such limited proteolysis procedures, flexibility required for rapid revision of experiments, where often already laborious for single proteins, will be vastly such alterations can be introduced with ease. The ACEMBL more complicated when several to many ORFs need to be system we developed could represent a first step in this di- diversified concomitantly in a multiprotein complex. Recent rection. Nonetheless, production of many vital protein com- advances in mass spectrometry, including quantitative, mul- plexes, especially those requiring a eukaryotic host machin- tiplexed techniques [93, 94] may prove to be invaluable for ery for sample production, remains a challenge and a major designing tools to analyze limited proteolysis experiments of bottleneck in the pipeline to high-resolution 3-D structures. complex multiprotein assemblies in high-throughput for structure elucidation. A further consideration in protein complex biology are those complexes that contain protein subunits as well as High-resolution structure determination, in particular by RNA components which may need to be co-expressed for X-ray crystallography, has developed into an indispensable

61 570 Current Genomics, 2009 , Vol. 10, No. 8 Nie et al.

[9] Krogan, N.J.; Cagney, G.; Yu, H.; Zhong, G.; Guo, X.; Ig- Lee, H.; Doucette-Stamm, L.; Hill, D.E.; Vidal, M. C. elegans OR- natchenko, A.; Li, J.; Pu, S.; Datta, N.; Tikuisis, A.P.; Punna, T.; Feome version 1.1: experimental verification of the genome anno- Peregrín-Alvarez, J.M.; Shales, M.; Zhang, X.; Davey, M.; Robin- tation and resource for proteome-scale protein expression. Nat. son, M.D.; Paccanaro, A.; Bray, J.E.; Sheung, A.; Beattie, B.; Genet. , 2003 , 34 , 35-41. Richards, D.P.; Canadien, V.; Lalev, A.; Mena, F.; Wong, P.; Sta- [22] Walhout, A.J.; Boulton, S.J.; Vidal, M. Yeast two-hybrid systems rostine, A.; Canete, M.M.; Vlasblom, J.; Wu, S.; Orsi, C.; Collins, and protein interaction mapping projects for yeast and worm. Yeast S.R.; Chandran, S.; Haw, R.; Rilstone, J.J.; Gandi, K.; Thompson, 2000 , 17 , 88-94. N.J.; Musso, G.; St Onge, P.; Ghanny, S.; Lam, M.H.; Butland, G.; [23] Colland, F.; Jacq, X.; Trouplin, V.; Mougin, C.; Groizeleau, C.; Altaf-Ul, A.M.; Kanaya, S.; Shilatifard, A.; O'Shea, E.; Weissman, Hamburger, A.; Meil, A.; Wojcik, J.; Legrain, P.; Gauthier, J.M. J.S.; Ingles, C.J.; Hughes, T.R.; Parkinson, J.; Gerstein, M.; Wo- Functional proteomics mapping of a human signaling pathway. dak, S.J.; Emili, A.; Greenblatt, J.F. Global landscape of protein Genome Res. , 2004 , 14 , 1324-1332. complexes in the yeast Saccharomyces cerevisiae. Nature , 2006 , [24] Stelzl, U.; Worm, U.; Lalowski, M.; Haenig, C.; Brembeck, F.H.; 440 , 637-643. Goehler, H.; Stroedicke, M.; Zenkner, M.; Schoenherr, A.; Koep- [10] Herzberg, C.; Weidinger, L.A.; Dörrbecker, B.; Hübner, S.; Stülke, pen, S.; Timm, J.; Mintzlaff, S.; Abraham, C.; Bock, N.; Kietz- J.; Commichau, F.M. SPINE: a method for the rapid detection and mann, S.; Goedde, A.; Toksöz, E.; Droege, A.; Krobitsch, S.; Korn, analysis of protein-protein interactions in vivo . Proteomics , 2007 , B.; Birchmeier, W.; Lehrach, H.; Wanker, E.E. A human protein- 7, 4032-4035. protein interaction network: a resource for annotating the proteome. [11] Cheeseman, I.M.; Desai, A. A combined approach for the localiza- Cell , 2005 , 122 , 957-968. tion and tandem affinity purification of protein complexes from [25] Yu, H.; Braun, P.; Yildirim, M.A.; Lemmens, I.; Venkatesan, K.; metazoans. Sci. STKE. , 2005 , 266 , pl1. Sahalie, J.; Hirozane-Kishikawa, T.; Gebreab, F.; Li, N.; Simonis, [12] Bürckstümmer, T.; Bennett, K.L.; Preradovic, A.; Schütze, G.; N.; Hao, T.; Rual, J.F.; Dricot, A.; Vazquez, A.; Murray, R.R.; Hantschel, O.; Superti-Furga, G.; Bauch, A. An efficient tandem Simon, C.; Tardivo, L.; Tam, S.; Svrzikapa, N.; Fan, C.; de Smet, affinity purification procedure for interaction proteomics in mam- A.S.; Motyl, A.; Hudson, M.E.; Park, J.; Xin, X.; Cusick, M.E.; malian cells. Nat. Methods , 2006 , 12 , 1013-1019. Moore, T.; Boone, C.; Snyder, M.; Roth, F.P.; Barabási, A.L.; Tav- [13] Poser, I.; Sarov, M.; Hutchins, J.R.; Hériché, J.K.; Toyoda, Y.; ernier, J.; Hill, D.E.; Vidal, M. High-quality binary protein interac- Pozniakovsky, A.; Weigl, D.; Nitzsche, A.; Hegemann, B.; Bird, tion map of the yeast interactome network. Science 2008 , 322 , 104- A.W.; Pelletier, L.; Kittler, R.; Hua, S.; Naumann, R.; Augsburg, 110. M.; Sykora, M.M.; Hofemeister, H.; Zhang, Y.; Nasmyth, K.; [26] Devos, D.; Russel, R.B. A more complete, complexed and struc- White, K.P.; Dietzel, S.; Mechtler, K.; Durbin, R.; Stewart, A.F.; tured interactome. Curr. Opin. Struct. Biol ., 2007 , 17 , 370-377. Peters, J.M.; Buchholz, F.; Hyman, A.A. BAC TransgeneOmics: a [27] Orchard, S.; Salwinski, L.; Kerrien, S.; Montecchi-Palazzi, L.; high-throughput method for exploration of protein function in Oesterheld, M.; Stümpflen, V.; Ceol, A.; Chatr-aryamontri, A.; mammals. Nat. Methods , 2008 , 5, 409-415. Armstrong, J.; Woollard, P.; Salama, J.J.; Moore, S.; Wojcik, J.; [14] Fromont-Racine, M.; Rain, J.C.; Legrain, P. Towards a functional Bader, G.D.; Vidal, M.; Cusick, M.E.; Gerstein, M.; Gavin, A.C.; analysis of the yeast genome through exhaustive two-hybrid Superti-Furga, G.; Greenblatt, J.; Bader, J.; Uetz, P.; Tyers, M.; screens. Nat. Genet. , 1997 , 16 , 277-282. Legrain, P.; Fields, S.; Mulder, N.; Gilson, M.; Niepmann, M.; [15] Venkatesan, K.; Rual, J.F.; Vazquez, A.; Stelzl, U.; Lemmens, I.; Burgoon, L.; De Las Rivas, J.; Prieto, C.; Perreau, V.M.; Hogue, Hirozane-Kishikawa, T.; Hao, T.; Zenkner, M.; Xin, X.; Goh, K.I.; C.; Mewes, H.W.; Apweiler, R.; Xenarios, I.; Eisenberg, D.; Ce- Yildirim, M.A.; Simonis, N.; Heinzmann, K.; Gebreab, F.; Sahalie, sareni, G.; Hermjakob, H. The minimum information required for J.M.; Cevik, S.; Simon, C.; de Smet, A.S.; Dann, E.; Smolyar, A.; reporting a molecular interaction experiment (MIMIx). Nat. Bio- Vinayagam, A.; Yu, H.; Szeto, D.; Borick, H.; Dricot, A.; Klitgord, technol. , 2007 , 25 , 894-898. N.; Murray, R.R.; Lin, C.; Lalowski, M.; Timm, J.; Rau, K.; Boone, [28] Hart, G.T.; Lee, I.; Marcotte, E.R. A high-accuracy consensus map C.; Braun, P.; Cusick, M.E.; Roth, F.P.; Hill, D.E.; Tavernier, J.; of yeast protein complexes reveals modular nature of gene essen- Wanker, E.E.; Barabási, A.L.; Vidal, M. An empirical framework tiality. BMC Bioinformatics , 2007 , 8, 236. for binary interactome mapping. Nat. Methods , 2009 , 6, 83-90. [29] Collins, S.R.; Miller, K.M.; Maas, N.L.; Roguev, A.; Fillingham, [16] Fields, S.; Song, O. A novel system to detect protein-protein inter- J.; Chu, C.S.; Schuldiner, M.; Gebbia, M.; Recht, J.; Shales, M.; actions. Nature , 1989 , 340 , 245-246. Ding, H.; Xu, H.; Han, J.; Ingvarsdottir, K.; Cheng, B.; Andrews, [17] Rain, J.C.; Selig, L.; De Reuse, H.; Battaglia, V.; Reverdy, C.; B.; Boone, C.; Berger, S.L.; Hieter, P.; Zhang, Z.; Brown, G.W.; Simon, S.; Lenzen, G.; Petel, F.; Wojcik, J.; Schächter, V.; Che- Ingles, C.J.; Emili, A.; Allis, C.D.; Toczyski, D.P.; Weissman, J.S.; mama, Y.; Labigne, A.; Legrain, P. The protein-protein interaction Greenblatt, J.F.; Krogan, N.J. Functional dissection of protein map of Helicobacter pylori. Nature , 2001 , 409 , 211-215. complexes involved in yeast chromosome biology using a genetic [18] Ito, T.; Chiba, T.; Ozawa, R.; Yoshida, M.; Hattori, M.; Sakaki, Y. interaction map . Nature , 2007 , 446 , 806-810. A comprehensive two-hybrid analysis to explore the yeast protein [30] Aloy, P.; Russell, R.B. Structural systems biology: modelling pro- interactome. Proc. Natl. Acad. Sci. U.S.A. , 2001 , 98 , 4569-4574. tein interactions. Nat. Rev. Mol. Cell Biol. , 2006 , 7, 188-197. [19] Uetz, P.; Giot, L.; Cagney, G.; Mansfield, T.A.; Judson, R.S.; [31] Sprinzak, E.; Altuvia, Y.; Margalit, H. Characterization and predic- Knight, J.R.; Lockshon, D.; Narayan, V.; Srinivasan, M.; Pochart, tion of protein-protein interactions within and between complexes . P.; Qureshi-Emili, A.; Li, Y.; Godwin, B.; Conover, D.; Proc. Natl. Acad. Sci. U. S. A. , 2006 , 103 , 14718-14723. Kalbfleisch, T.; Vijayadamodar, G.; Yang, M.; Johnston, M.; [32] Aebersold, R.; Mann, M. Mass spectrometry-based proteomics. Fields, S.; Rothberg, J.M. A comprehensive analysis of protein- Nature , 2003 , 422 , 198-207. protein interactions in Saccharomyces cerevisiae. Nature , 2000 , [33] Han, X.; Aslanian, A.; Yates, J.R. III. Mass spectrometry for pro- 403 , 623-627. teomics. Curr. Opinion Chem. Biol. , 2008 , 12 , 483-490. [20] Li, S.; Armstrong, C.M.; Bertin, N.; Ge, H.; Milstein, S.; Boxem, [34] Domon, B. Aebersold, R. Mass Spectrometry and Protein Analysis. M.; Vidalain, P.O.; Han, J.D.; Chesneau, A.; Hao, T.; Goldberg, Science , 2006 , 312 , 212-217. D.S.; Li, N.; Martinez, M.; Rual, J.F.; Lamesch, P.; Xu, L.; Tewari, [35] Ashman, K.; Moran, M.F.; Sicheri, F.; Pawson, T.; Tyers, M. Cell M.; Wong, S.L.; Zhang, L.V.; Berriz, G.F.; Jacotot, L.; Vaglio, P.; signaling - the proteomics of it all. Sci. STKE , 2001 , 103 , pe33. Reboul, J.; Hirozane-Kishikawa, T.; Li, Q.; Gabel, H.W.; Elewa, [36] Rappsilber, J.; Siniossoglou, S.; Hurt, E.C.; Mann, M. A generic A.; Baumgartner, B.; Rose, D.J.; Yu, H.; Bosak, S.; Sequerra, R.; strategy to analyze the spatial organization of multi-protein com- Fraser, A.; Mango, S.E.; Saxton, W.M.; Strome, S.; Van Den Heu- plexes by cross-linking and mass-spectrometry. Anal. Chem. , 2000 , vel, S.; Piano, F.; Vandenhaute, J.; Sardet, C.; Gerstein, M.; 72 , 267-275. Doucette-Stamm, L.; Gunsalus, K.C.; Harper, J.W.; Cusick, M.E.; [37] Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical Roth, F.P.; Hill, D.E.; Vidal, M. A map of the interactome network statistical model to estimate the accuracy of peptide identifications of the metazoan C. elegans. Science , 2004 , 303 , 540-543. made by MS/MS and database search. Anal. Chem. , 2000 , 74 , [21] Reboul, J.; Vaglio, P.; Rual, J.F.; Lamesch, P.; Martinez, M.; Arm- 5383-5392. strong, C.M.; Li, S.; Jacotot, L.; Bertin, N.; Janky, R.; Moore, T.; [38] Andersen, J.S.; Mann, M. Organellar proteomics: turning invento- Hudson, J.R. Jr.; Hartley, J.L.; Brasch, M.A.; Vandenhaute, J.; ries into insights. EMBO Rep. , 2006 , 7, 874-879. Boulton, S.; Endress, G.A.; Jenna, S.; Chevet, E.; Papasotiropoulos, [39] Aebersold, R. A stress test for mass spectrometry-based pro- V.; Tolias, P.P.; Ptacek, J.; Snyder, M.; Huang, R.; Chance, M.R.; teomics. Nat. Methods , 2009 , 6, 411-412.

62 Getting a Grip on Complexes Current Genomics, 2009 , Vol. 10, No. 8 571

[40] Bell, A.W.; Deutsch, E.W.; Au, C.E.; Kearney, R.E.; Beavis, R.; A.; Hayashizaki, Y.; Kigawa, T.; Shirouzu, M.; Yokoyama, S. Sechi, S.; Nilsson, T.; Bergeron, J.J. HUPO Test Sample Working Crystal structure of the RUN domain of the RAP2-interacting pro- Group. Nat. Methods , 2009 , 6, 423-430. tein x. J. Biol. Chem. , 2006 , 281 , 31843-31853. [41] Benesch, J.L.; Robinson, C.V. Mass spectrometry of macromolecu- [61] Buchs, M.; Kim, E.; Pouliquen, Y.; Sachs, M.; Geisse, S.; Mahnke, lar assemblies: preservation and dissociation. Curr. Opin. Struct. M.; Hunt, I. High-throughput insect cell protein expression applica- Biol. , 2006 , 16 , 245-251. tions. Methods Mol. Biol. , 2009 , 498 , 199-227. [42] Heck, A.J. Native mass spectrometry: a bridge between interactom- [62] Schlaeppi, J.M.; Henke, M.; Mahnke, M.; Hartmann, S.; Schmitz, ics and structural biology. Nat. Methods , 2008 , 5, 927-933. R.; Pouliquen, Y.; Kerins, B.; Weber, E.; Kolbinger, F.; Kocher, [43] Ruotolo, B.T.; Giles, K.; Campuzano, I.; Sandercock, A.M.; Bate- H.P. A semi-automated large-scale process for the production of man, R.H.; Robinson, C.V. Evidence for macromolecular protein recombinant tagged proteins in the Baculovirus expression system. rings in the absence of bulk water. Science , 2005 , 310 , 1658-1661. Protein Expr. Purif. , 2006 , 50 , 185-195. [44] Benesch, J.L.; Ruotolo, B.T.; Simmons, D.A.; Robinson, C.V. [63] Kärkkäinen, H.R.; Lesch, H.P.; Määttä, A.I.; Toivanen, P.I.; Protein complexes in the gas phase: technology for structural ge- Mähönen, A.J.; Roschier, M.M.; Airenne, K.J.; Laitinen, O.H.; nomics and proteomics. Chem. Rev. , 2007 , 107 , 3544-3567. Ylä-Herttuala, S. A 96-well format for a high-throughput baculovi- [45] Hernandez, H.; Dziembowski, A.; Traverner, T.; Seraphin, B.; rus generation, fast titering and recombinant protein production in Robinson, C.V. Subunit architecture of multimeric complexes iso- insect and mammalian cells. BMC Res. Notes , 2009 , 2, 63. lated directly from cells. EMBO Rep. , 2006 , 7, 605-610. [64] Bonanno, J.B.; Almo, S.C.; Bresnick, A.; Chance, M.R.; Fiser, A.; [46] Damoc, E.; Fraser, C.S.; Zhou, M.; Videler, H.; Mayeur, G.L.; Swaminathan, S.; Jiang, J.; Studier, F.W.; Shapiro, L.; Lima, C.D.; Hershey, J.W.; Doudna, J.A.; Robinson, C.V.; Leary, J.A. Struc- Gaasterland, T.M.; Sali, A.; Bain, K.; Feil, I.; Gao, X.; Lorimer, D.; tural characterization of the human eukaryotic initiation factor 3 Ramos, A.; Sauder, J.M.; Wasserman, S.R.; Emtage, S.; D'Amico, protein complex by mass spectrometry. Mol. Cell. Proteomics , K.L.; Burley, S.K. New York-Structural GenomiX Research Con- 2007 , 6, 1135-1146. sortium (NYSGXRC): a large scale center for the protein structure [47] Zhou, M.; Sandercock, A.M.; Fraser, C.S.; Ridlova, G.; Stephens, initiative. J. Struct. Funct. Genomics , 2005 , 6, 225-232. E.; Schenauer, M.R.; Yokoi-Fong, T.; Barsky, D.; Leary, J.A.; Her- [65] Gao, X.; Bain, K.; Bonanno, J.B.; Buchanan, M.; Henderson, D.; shey, J.W.; Doudna, J.A.; Robinson, C.V. Mass spectrometry re- Lorimer, D.; Marsh, C.; Reynes, J.A.; Sauder, J.M.; Schwinn, K.; veals modularity and a complete subunit interaction map of the eu- Thai, C.; Burley, S.K. High-throughput limited proteolysis/mass karyotic translation factor eIF3. Proc. Natl. Acad. Sci. U.S.A. , spectrometry for protein domain elucidation. J. Struct. Funct. 2008 , 105 , 18139-18144. Genomics , 2005 , 6, 129-134. [48] Fox, B.G.; Goulding, C.; Malkowski, M.G.; Stewart, L.; Deacon, [66] Editorial. Structural genomics in the spotlight. Nat. Methods , 2008 , A. Structural genomics: from genes to structures with valuable ma- 5, 115. terials and many questions in between. Nat. Methods , 2008 , 5, 129- [67] Romier, C.; Ben Jelloul, M.; Albeck, S.; Buchwald, G.; Busso, D.; 132. Celie, P.H.; Christodoulou, E.; De Marco, V.; van Gerwen, S.; [49] Kambach, C. Pipelines, robots, crystals and biology: what use high Knipscheer, P.; Lebbink, J.H.; Notenboom, V.; Poterszman, A.; throughput solving structures of challenging targets? Curr. Protein Rochel, N.; Cohen, S.X.; Unger, T.; Sussman, J.L.; Moras, D.; Pept. Sci. , 2007 , 8, 205-217. Sixma, T.K.; Perrakis, A. Co-expression of protein complexes in [50] Klock, H.E.; White, A.; Koesema, E.; Lesley, S.A. Methods and prokaryotic and eukaryotic hosts: experimental procedures, data- results for semi-automated cloning using integrated robotics. J. base tracking and case studies. Acta Crystallogr. D Biol. Crystal- Struct. Funct. Genomics , 2005 , 6, 89-94. logr. , 2006 , 62 , 1232-1242. [51] Blommel, P.G.; Martin, P.A.; Wrobel, R.L.; Steffen, E.; Fox, B.G. [68] Tan, S.; Kern, R.C.; Selleck, W. The pST44 polycistronic expres- High efficiency single step production of expression plasmids from sion system for producing protein complexes in Escherichia coli. cDNA clones using the Flexi Vector cloning system. Protein Expr. Protein Expr. Purif. , 2005 , 40 , 385-395. Purif. , 2006 , 47 , 562-570. [69] Tolia, N.H.; Joshua-Tor, L. Strategies for protein coexpression in [52] Stols, L.; Gu, M.; Dieckman, L.; Raffen, R.; Collart, F.R.; Don- Escherichia coli. Nat. Methods, 2006 , 3 , 55-64. nelly, M.I. A new vector for high-throughput, ligation-independent [70] Fitzgerald, D.J.; Berger, P.; Schaffitzel, C.; Yamada, K.; Rich- cloning encoding a tobacco etch virus protease cleavage site. Pro- mond, T.J.; Berger, I. Protein complex expression by using tein Expr. Purif. 2002 , 25 , 8-15. multigene baculoviral vectors. Nat. Methods , 2006 , 3 , 1021-1032. [53] Klock, H.E.; Koesema, E.J.; Knuth, M.W.; Lesley, S.A. Combining [71] Chanda, P.K.; Edris, W.A.; Kennedy, J.D. A set of ligation- the polymerase incomplete primer extension method for cloning independent expression vectors for co-expression of proteins in Es- and mutagenesis with microscreening to accelerate structural ge- cherichia coli. Protein Expr. Purif. , 2006 , 47 , 217-224. nomics efforts. Proteins , 2008 , 71 , 982-994. [72] Scheich, C.; Kümmel, D.; Soumailakakis, D.; Heinemann, U.; [54] Thao, S.; Zhao, Q.; Kimball, T.; Steffen, E.; Blommel, P.G.; Riters, Büssow, K. Vectors for co-expression of an unrestricted number of M.; Newman, C.S.; Fox, B.G.; Wrobel, R.L. Results from high- proteins. Nucleic Acids Res. , 2007 , 35 , e43. throughput DNA cloning of Arabidopsis thaliana target genes using [73] Berger, I.; Fitzgerald, D.J.; Richmond, T.J. Baculovirus expression site-specific recombination. J. Struct. Funct. Genomics , 2004 , 5, system for heterologous multiprotein complexes. Nat. Biotechnol ., 267-276. 2004 , 22 , 1583-1587. [55] Acton, T.B.; Gunsalus, K.C.; Xiao, R.; Ma, L.C.; Aramini, J.; Ba- [74] Bieniossek, C.; Richmond, T.J.; Berger, I. MultiBac: multigene ran, M.C.; Chiang, Y.W.; Climent, T.; Cooper, B.; Denissova, baculovirus-based eukaryotic protein complex production. Curr. N.G.; Douglas, S.M.; Everett, J.K.; Ho, C.K.; Macapagal, D.; Ra- Prot. Protein Sci. , 2008 , ch. 5, Unit 5.20. pp. 2001-2025, Wiley, jan, P.K.; Shastry, R.; Shih, L.Y.; Swapna, G.V.; Wilson, M.; Wu, New York. M.; Gerstein, M.; Inouye, M.; Hunt, J.F.; Montelione, G.T. Robotic [75] Cui, S.; Eisenächer, K.; Kirchhofer, A.; Brzózka, K.; Lammens, A.; cloning and Protein Production Platform of the Northeast Structural Lammens, K.; Fujita, T.; Conzelmann, K.K.; Krug, A.; Hopfner, Genomics Consortium. Methods Enzymol. , 2005 , 394 , 210-243. K.P. The C-terminal regulatory domain is the RNA 5'-triphosphate [56] Studier, F.W. Protein production by auto-induction in high density sensor of RIG-I. Mol. Cell , 2008 , 29 , 169-179. shaking cultures. Protein Expr. Purif ., 2005 , 41 , 207-234. [76] Murzina, N.V.; Pei, X.Y.; Zhang, W.; Sparkes, M.; Vicente-Garcia, [57] Manjasetty, B.A.; Turnbull, A.P.; Panjikar, S.; Büssow, K.; J.; Pratap, J.V.; McLaughlin, S.H.; Ben-Shahar, T.R.; Verreault, A.; Chance, M.R. Automated technologies and novel techniques to ac- Luisi, B.F. and Laue, E.D. Structural basis for the recognition of celerate protein crystallography for structural genomics. Pro- histone H4 by the histone-chaperone RbAp46. Structure , 2008 , 16 , teomics , 2008 , 8, 612-625. 1077-1085. [58] Shimizu, Y.; Inoue, A.; Tomari, Y.; Suzuki, T.; Yokogawa, T.; [77] Senger, T.; Schädlich, L.; Gissmann, L.; Müller, M. Enhanced Nishikawa, K.; Ueda, T. Cell-free translation reconstituted with pu- papillomavirus-like particle production in insect cells. Virology , rified components. Nat. Biotechnol. , 2001 , 19 , 751-755. 2009 , 388 , 344-353. [59] Liguori, L.; Marques, B.; Villegas-Méndez, A.; Rothe, R.; Lenor- [78] Shapiro, A.; Matheny, M.; Zhang, Y.; Tümer, N.; Cheng, K.Y.; mand, J.L. Production of membrane proteins using cell-free expres- Rogrigues, E.; Zolotukhin, S.; Scarpace, P.J. Synergy between sion systems. Expert Rev. Proteomics , 2007 , 4, 79-90. leptin therapy and a seemingly negligible amount of voluntary [60] Kukimoto-Niino, M.; Takagi, T.; Akasaka, R.; Murayama, K.; wheel running prevents progression of dietary obesity in leptin- Uchikubo-Kamo, T.; Terada, T.; Inoue, M.; Watanabe, S.; Tanaka, resistant rats. Diabetes , 2008 , 57 , 614-622.

63 572 Current Genomics, 2009 , Vol. 10, No. 8 Nie et al.

[79] Kool, M.; Voncken, J.W.; van Lier, F.L.; Tramper, J.; Vlak, J.M. [87] Cheng, Y.; Walz, T. The Advent of Near-Atomic Resolution in Detection and analysis of Autographa californica nuclear polyhe- Single-Particle Electron Microscopy. Annu. Rev. Biochem. , 2009 , drosis virus mutants with defective interfering properties. Virology , 78 , 723-742. 1991 , 183 , 739-746. [88] Korostelev, A.; Noller, H.F. The ribosome in focus: new structures [80] De Gooijer, C.D.; Koken, R.H.; Van Lier, F.L.; Kool, M.; Vlak, bring new insights. Trends Biochem. Sci. , 2007 , 32 , 434-441. J.M.; Tramper, J. A structured dynamic model for the baculovirus [89] Kornberg, R.D. The molecular basis of eukaryotic transcription. infection process in insect-cell reactor configurations. Biotechnol. Proc. Natl. Acad. Sci. U.S.A. , 2007 , 104 , 12955-12961. Bioeng. , 1992 , 40 , 537-548. [90] Ramakrishnan, V. What we have learned from ribosome structures. [81] Simón, O.; Williams, T.; Caballero, P.; López-Ferber, M. Dynam- Biochem. Soc. Trans. , 2008 , 36 , 567-574. ics of deletion genotypes in an experimental insect virus popula- [91] Cramer, P.; Armache, K.J.; Baumli, S.; Benkert, S.; Brueckner, F.; tion. Proc. Biol. Sci. , 2006 , 273 , 783-790. Buchen, C.; Damsma, G.E.; Dengl, S.; Geiger, S.R.; Jasiak, A.J.; [82] Bieniossek, C.; Nie, Y.; Frey, D.; Olieric, N.; Schaffitzel, C.; Jawhari, A.; Jennebach, S.; Kamenski, T.; Kettenberger, H.; Kuhn, Collinson, I.; Romier, C.; Berger, P.; Richmond, T.J.; Steinmetz, C.D.; Lehmann, E.; Leike, K.; Sydow, J.F.; Vannini, A. Structure M.O.; Berger, I. Automated unrestricted multigene recombineering of eukaryotic RNA polymerases. Annu. Rev. Biophys. , 2008 , 37 , for multiprotein complex production. Nat. Methods , 2009 , 6, 447- 337-352. 450. [92] Pomeranz Krummel, D.A.; Oubridge, C.; Leung, A.K.; Li, J.; Na- [83] Nie, Y.; Bieniossek, C.; Frey, D.; Olieric, N.; Schaffitzel, C.; Ste- gai, K. Crystal structure of human spliceosomal U1 snRNP at 5.5 A inmetz, M.O.; Berger, I. ACEMBLing multigene expression vec- resolution. Nature , 2009 , 458 , 475-480. tors by recombineering. Nat. Protocols , 2009 , DOI: 10.1038/ [93] Yan, W.; Hwang, D.; Aebersold, R. Quantitative proteomic analy- nprot.2009.104. sis to profile dynamic changes in the spatial distribution of cellular [84] Li, M.Z.; Elledge, S.J. Harnessing homologous recombination in proteins. Methods Mol. Biol. , 2008 , 432 , 389-401. vitro to generate recombinant DNA via SLIC. Nat. Methods , 2007 , [94] Pflieger, D.; Jünger, M.A.; Müller, M.; Rinner, O.; Lee, H.; Gehrig, 4, 251-256. P.M.; Gstaiger, M.; Aebersold, R. Quantitative proteomic analysis [85] Chiu, W.; Baker, M.L.; Almo, S.C. Structural biology of cellular of protein complexes: concurrent identification of interactors and machines. Trends Cell Biol. , 2006 , 16 , 144-150. their state of phosphorylation. Mol. Cell. Proteomics , 2008 , 7, 326- [86] Zhou, Z.H. Towards atomic resolution structural determination by 346. single-particle cryo-electron microscopy. Curr. Opin. Struct. Biol. [95] Roy, P. Baculovirus solves a complex problem. Nat. Biotechnol. , 2008 , 18 , 218-228. 2004 , 22 , 1527-1528.

64 2. Materials and methods

2.1. Materials

2.1.1. Chemicals

Agarose (electrophoresis grade) Sigma-Aldrich Ammonium peroxodisulfate (APS) Sigma-Aldrich Ampicillin, Sodium salt Sigma-Aldrich Bromophenol Blue, sodium salt Sigma-Aldrich Chloramphenicol Sigma-Aldrich Complete EDTA-free Protease inhibitor Roche Coomassie Brilliant Blue R250 Sigma-Aldrich Deoxynucleoside-5’-triphosphates (dNTPs) New England Biolabs Dimethylsulfoxide (DMSO) Sigma-Aldrich Dithiothreitol (DTT) Sigma-Aldrich DNA molecular weight marker New England Biolabs Ethanol VWR International Ethylenediaminetetracetate, disodium salt Sigma-Aldrich (EDTA) Glycerol VWR International 2-[4-(2-Hydroxyethyl)-1-piperazino]- VWR International ethansulfonic acid (HEPES) Isopropyl-β-D-thiogalactopyranoside Sigma-Aldrich (IPTG) Kanamycin sulfate Sigma-Aldrich 2-Propanol VWR International Precision Plus Prestained protein molecular Biorad weight marker Sodium chloride Sigma-Aldrich Sodium dodecyl sulfate (SDS) Sigma-Aldrich Sodium hydroxide Sigma-Aldrich Tris-(hydroxymethyl)aminomethane (Tris) Sigma-Aldrich Triton-X-100 VWR International Xylene cyanolFF VWR International Phenylmethylsulfonyl fluoride (PMSF) VWR International NP-40 (IGEPAL® CA-630) Sigma-Aldrich Methanol VWR International 2-Mercaptoethanol VWR International LB medium powder Sigma-Aldrich Imidazole VWR International N.N.N.’,N’-Tetreamethylethylenediamide Sigma-Aldrich (TEMED)

65 Ethidium bromide solution (10mg/mL) Sigma-Aldrich Bluogal™ Invitrogen Gentamicin Sigma Aldrich Tetracycline Sigma Aldrich

2.1.2. Enzymes and Antibodies

Phusion Hot Start II High Fidelity DNA Thermo Scientific Polymerase T4 DNA Ligase (400 U/µL) New England Biolabs Monoclonal Anti-polyHistidine Sigma-Aldrich Peroxidase Conjugate Restriction Enzymes New England Biolabs Chymotrypsin Sigma Aldrich Trypsin Sigma Aldrich Endoproteinase Glu-C Sigma Aldrich Subtilisin Sigma Aldrich TEV Protease (1 mg/mL) EMBL Core Facility TAF2 Polyclonal Antibody Laszlo Tora (IGBMC) Cre-recombinase EMBL Core Facility T4 DNA polymerase New England Biolabs

2.1.3. Plasmids

Name Description/Resistance Origin

Insect cell expression vectors: pDIFB_150x Polh/p10 dual promoter baculotransfer Martina vector encoding Human FLAG-tagged TAF2 16-1201 for Tn7 insertion into AcMNPV derived bacmid, Gent R, Amp R pFL Polh/p10 dual promoter baculotransfer MultiBac(Berger et ‘acceptor’ vector with LoxP site and al., 2004) Tn7L and R attachment sites for bacmid insertion, Gent R,Amp R pFBDM Polh/p10 dual promoter baculotransfer MultiBac(Berger et ‘acceptor’ vector with Tn7L and R al., 2004) attachment sites for bacmid insertion, Gent R,Amp R pIDC_TAF8_TAF10 Polh/p10 dual promoter baculotransfer Simon Trowiztsch ‘donor’ vector with LoxP and R6K γ origin encoding Human full-length TAF8 and TAF10 as polyprotein 1_1201 pFL_HT_TAF2 His 10, TEV cleavage site, OE PCR This study repaired TAF2 N-terminus. Inserted via SLIC pFL_HT_scTAF2 FL Inserted via RsrII/HindIII This study pFBDM_HT_dmTAF2 FL This study 66 pFL_HT_T2 FL _pIDC_TAF8_TAF10 This study

E. coli expression vectors: pMAL-C Ptac promoter for overexpression of NEB chiermic genes coding for an N- terminal Maltose binding protein upstream of protein of interest, Amp R 105_310 pMAL_TAF8 Inserted via SLIC, C-term His 6 This study 141_310 pMAL_TAF8 Inserted via SLIC, C-term His 6 This study 141_199 pMAL_TAF8 Inserted via SLIC, C-term His 6 This study 105_199 pMAL_TAF8 Inserted via SLIC, C-term His 6 This study 200_310 pMAL_TAF8 Inserted via SLIC, C-term His 6 This study 164_252 pMAL_TAF8 Inserted via SLIC, C-term His 6 This study 164_310 pMAL_TAF2 Inserted via SLIC, C-term His 6 This study

Mammalian expression vectors pMDS CMV promoter based mammalian MultiMam (Kriz et transient expression vector, Spec R al., 2010) pMDS_TAF2 FL _mCherry Inserted via SLIC, N-terminal His 10 tag This study and TEV cleavage site

2.1.4. Oligonucleotides

Name Sequence Description TAF2_OE1 5'GTGTGTTCGGTCCGATGCATCACCATCACCATCACCATCACCATCACG TAF2 repair TAF2_OE2 5'CCCTGAAAATACAGGTTTTCGGTCGTTGGGATATCGTAATCGTGATGGTGATGGTGATGG TAF2 repair TAF2_OE3 5'ACCGAAAACCTGTATTTTCAGGGCGTCGACCCGCTGACTGGTGTAGAGCCCGCCAGAA TAF2 repair TAF2_OE4 5'TTCAAAGCCCTTGTCTCCTTTCCTGTTCATTCTGGCGGGCTCTACAC TAF2 repair TAF2_OE5 5'GTGTGTTCGGTCCGATGC TAF2 repair TAF2_OE6 5'TTCAAAGCCCTTGTCTCCTTTC TAF2 repair MBP-T8 1 5’CATCACCATCACCATCACCATTGAGTACCCGGCCGGGGATCC pMAL Forward MBP-T8 2 5’GAAAATACAGGTTTTCGGTCGTGATATCCGAGCTCGAATTAGTCTGCGCGTC pMAL RevCom MBP-T8 3 5’CACGACCGAAAACCTGTATTTTCAGGGCACTCAGAGTCTTTTCAAAGATGACGTCAGCAC TAF8 200 For MBP-T8 4 5’CTCAATGGTGATGGTGATGGTGATGCTCGCCTGTCTTGGCCATGAAAC TAF8 199 rc MBP-T8 5 5’CACGACCGAAAACCTGTATTTTCAGGGCGACACTCTCCCTGCTTATGCAAAACGG TAF8 105 For MBP-T8 6 5’CACGACCGAAAACCTGTATTTTCAGGGCCCCCACCCGCCGCACATC TAF8 141 For MBP-T8 7 5’CTCAATGGTGATGGTGATGGTGATGGGAGAGGGACTTCTTCCTGCGG TAF8 310 rc

Table 1: Table of primers used in this study

2.1.5. Cell Lines

Sf 21 Invitrogen

(Spodoptera frugiperda cells)

HeLa Kindly donated by the Schaffitzel group(EMBL)

(Human cervical cancer cells)

HEK-293 Kindly donated by the Schaffitzel group

(Human embryonic kidney cells)

67 2.1.6. Bacterial strains

The E. coli strains DH5 α, DH10EMBacY and PirHC were used for propagation of plasmid DNA and BL21(DE3)-RIL strains were used for the heterologous production of labeled protein for NMR studies.

DH10EMBacY Geneva Biotech (Berger et al., 2004)

PirHC Geneva Biotech (Berger et al., 2004)

BL21-CodonPlus®(DE3)-RIL (Stratagene)

DH5 α [F- endA1 hsdR17 (rk- mk-) supE44 thi-1 recA1 gyrA (NalR) (Hanahan, 1983) relA (lacZYA-argF)U169 lac

2.1.7. Crystallization Screens

Crystallization kits:

The Classics I and II Qiagen Wizard I and II Rigaku Crystal Screen Lite Hampton Research PEG/Ion Hampton Research The PEGs I Qiagen Index Screen Hampton Research

2.1.8. Software and programs

ApE Plasmid editing software http://biologylabs.utah.edu/jorgensen/wayned/ape/

Pymol 3D visualization software http://pymol.sourceforge.net/

Adobe Illustrator Graphics software http://www.adobe.com/products/illustrator.html

CCPNMR NMR data analysis (Vranken et al., 2005)

XmGrace Graphing tool http://plasma-gate.weizmann.ac.il/Grace/

68 2.2. Methods

2.3. Nucleic acid biochemistry

2.3.1. Concentration determination of nucleic acids

The concentration of nucleic acids was determined by measuring the extinction in aqueous solution at a wavelength of 260 nm in comparison to a reference. The following equations were used to determine DNA concentrations (Sambrook and

Green, 2001):

1 OD260 = 50 µg/ml double stranded DNA (in nucleotides)

2.3.2. Preparation of plasmid DNA

Plasmid DNA was purified from bacterial cells using the QIAprep spin miniprep kit

(Qiagen), according to the maufacturers’ instructions. Plasmid DNA was typically eluted with 50 µL ddH2O and stored at -20 ˚C.

2.3.3. Agarose gel electrophoresis

Analysis and purification of DNA molecules after restriction digestions, PCR reactions and DNA ligation reactions was done using agarose gel electrophoresis.

Gels were prepared containing 0.8% - 2% agarose (depending of the size of the product) 1 x TBE buffer and 0.4 µg/mL ethidium bromide, then solidified for visualization of bands under UV illumination. Samples were mixed with DNA loading dye (6x DNA loading dye: 30 % [v/v] glycerol, 0.125% [w/v] bromopheol blue, 0.125% [w/v] xylene cyanol FF or 30 % [v/v] glycerol and 2 mg/mL Orange G) and separated using horizontal electrophoresis.

69 2.3.4. Restriction digestion of DNA

Restriction digestion was used to cleave DNA in order to produce the desired ends in

PCR products and vectors for use in downstream ligation reactions or analytical cleavages of plasmids. Restriction digestion reactions were performed according to the manufacturer’s directions. In general, 1 U of restriction enzyme was used to cleave 1 µg of DNA in 1 h at the recommended temperature.

2.3.5. Polymerase chain reaction (PCR)

PCR was used for gene and gene fragment amplification. Phusion Hot-Start High-

Fidelity DNA Polymerase (Thermo Scientific) was used according to the manufacturers instruction. Briefly, a standard PCR reaction contained 5 µL 10 x HF

Buffer, 2 µL dNTP mix (10 mM of each nucleotide-5’-triphosphate), 5-50 ng template DNA, 100 pmol sense and antisense primer and 1-2 µL Phusion Hot-Start

High-Fidelity DNA Polymerase. The temperature protocol was as follows: an initial denaturation step (Step1: 98 °C, 2 min), a second denaturation step (Step 2: 98 °C, 30 sec), an annealing step (Step 3: 2-5°C below the melting temperature of the complementary primer sequences, 30 sec), an elongation step (Step 4: 72 °C, 1-2 min per kilobase product) and a final elongation step (Step 5: 72 °C, 5 min). Cycling between steps 2-4 were repeated 25 times.

2.3.6. DNA ligation

DNA fragments with 5’-phosphates and 3’-hydroxyl groups were ligated using T4

DNA ligase. Cloning procedures consisted of cleaving a vector and PCR product

(insert) with the same pair of enzymes. 100 ng of cleaved vector was mixed with a 3-

70 fold molar excess of insert in a 20 µL reaction volume containing 2 µL of 10x T4

DNA ligase buffer and 1 µL T4 DNA ligase. The reaction was incubated for 4 hours at 16 °C before transformation of chemically competent cells.

2.3.7. DNA extraction from agarose gels

DNA bands were excised from ethidium bromide stained gels illuminated with 365 nm UV light and purified using the Qiagen Gel Extraction kit according to manufacturer’s instructions.

2.3.8. DNA Sequencing

DNA sequences of plasmids generated for this study were analyzed for correct sequences by DNA sequencing. For a sequencing reaction, 2 µg of DNA, purified as described in Section 2.3.8 Preparation of Plasmid DNA, was sent to Macrogen,

Europe (Amsterdam, Netherlands) with the appropriate sequencing primer.

Sequencing reaction data were analyzed using the plasmid editing software Ape

(http://biologylabs.utah.edu/jorgensen/wayned/ape/).

2.4. General cloning strategies

2.4.1. Cloning TAF2 constructs

For the cloning of plasmids expressing full-length protein and truncated constructs used in this work, the full-length Human TAF2 gene sequence was obtained from Dr.

I. Berger in the baculovirus expression vector pDIFB150x, described in the thesis of

Martina Mijuskovic (Mijuskovic, 2007). Considering the Uniprot full-length TAF2 sequence Q6P1X5 as the reference for residue numbering, the construction of the

TAF2 gene cloned into pDIFB150x starts at residue Asp12 directly downstream of an

71 N-terminal FLAG tag. Using the ‘overlap extension PCR’ method (see section 2.4.4) wild type residues upstream of Asp12 derived from the human TAF2 sequence

Q6P1X5 were introduced to the truncated TAF2 construct and the FLAG tag swapped for a (His 10 ) tag followed by a TEV cleavage site. Finally, the (His 10 -TEV cleavage site-TAF2[2-1199]) construct was cloned into the pFL baculovirus expression vector for downstream bacmid preparation.

A ______FLAG______pDIFB TAF2 [12-1199]___ M D Y K D D D D K N R K K G D K G F E S P R

...... 20 . . . .

B ______TAF2 Uniprot QXP1Z5 [1-1199]______M P L T G V E P A R M N R K K G D K G F E S P R

...... 10 ...... 20 . . . .

C M-10xHis-TEV ______pFL_HT_TAF2 [2-1199]______P L T G V E P A R M N R K K G D K G F E S P R

...... 10 ...... 20 . . . .

Figure 2-1: Human TAF2 N-terminus. The residues belonging to TAF2 are colored green. Residues belonging to tags and/or cleavage sites are colored blue and the Methionine at the start position in black. TAF2 sequence is shown only up to residue Arg24. (A) Original TAF2 construct from pDIFB vector with N-terminal FLAG tag (B) Uniprot Full Length TAF2 sequence (C) Full length TAF2 sequence with N-terminal TEV cleavable His-tag used in this study.

72 The gene encoding S. cerevisiae TAF2 full-length ( sc TAF2FL) ( Saccharomyces

Genome Database name YCR042C, Uniprot P23255) was purchased from Open

Biosystems and cloned into the pFBDM baculovirus expression vector using the sequence and ligation independent cloning (SLIC) method(Li and Elledge, 2007). The pFBDM vector used in this case was modified using the same OE PCR method as for the pFL vector in order to include an N-terminal TEV cleavable His 10 tag as for the hs TAF2FL construct. D. melanogaster TAF2 ( dm TAF2FL) was kindly donated by

Dr. Christophe Romier and cloned into the pFBDM vector in the same manner as sc TAF2FL (Uniprot Q24325). For the production of the TAF2 core, and the TAF2

‘Two-Domain’ pseudo-complex, the TAF2 core gene fragment, TAF2[2-1027], and the gene fragment encoding the Aminopeptidase Containing Domain (ACD),

TAF2[2-600], were cloned into the pFL-TAF2 vector via RsrII/HindIII restriction sites for the production of a TEV-cleavable, N-terminal His 10 tag. The 20kD domain,

TAF2[820-984] was cloned into the pUCDM vector using the SLIC method.

For the localization of TAF2 expression in HeLa cells by fluorescent microscopy, the

TAF2 gene was synthesized by GenScript USA Inc. and the mCherry gene containing an N-terminal TEV protease cleavage site was amplified from pIDS-

T[1113310]mCherry donated by Yan Nie (Berger Group member). The TAF2 gene was fused to the mCherry gene via an RsrII restriction site on the 3’ end of TAF2 and on the 5’ end of the TEV protease cleavage site-mCherry gene. The entire TAF2- mCherry gene fusion product was cloned into pMDS via the restriction sites

PmeI/XhoI allowing the production of a TAF2-mCherry fusion protein.

73 2.4.2. Cloning TAF8 constructs

For binding experiments using TAF2 mixed with truncation versions of TAF8, the following maltose binding protein (MBP)-TAF8 fusions were cloned in the pMAL vector using the SLIC method described by Li et al (Li and Elledge, 2007) and using the primers: MBP-TAF8 105-310 , MBP-TAF8 141-310 , MBP-TAF8 105-199 , MBP-TAF8 141-

199 , MBP-TAF8 200-310 . All MBP-TAF8 constructs included a TEV cleavage site in between MBP and TAF8 and a C-terminal His 6x -tag for production of TEV cleavable

MBP fusion proteins. The gene fragment TAF8 164-310 was ordered from GenScript

USA Inc. and cloned into the pMAL-TAF8 105-310 vector via SalI/HindIII restriction sites.

2.4.3. Cloning TAF2-8-10

For the purification of the TAF2/8/10 complex, the pFL-hs TAF2 vector was fused by

Cre-lox recombination to a pIDC vector encoding the polyprotein [TEV protease-

TAF8-TAF10-CFP] kindly donated by Dr. Simon Trowitzsch (Figure 2-2).

74

Figure 2-2: MultiBac vectors for TAF2-8-10 complex expression in insect cells. Acceptor and donor vectors are depicted encoding TAF2 and a polyprotein made up of TEV protease, TAF8, TAF10 and CFP, respectively. LoxP sites used to fuse Acceptor and Donor vectors are indicated by red circles. Blue boxes indicate promoter (M) and terminator sequences. Genes mediating resistances against Ampicillin (Amp R) Kanamycin (Kan R) and Chloramphenicol (Chl R) as well as the origins of replication (ori ColE1 and ori R6Kgamma ) are shown. TEV cleavage sites (tcs) allow cleavage by the TEV protease. Tn7L and Tn7R sites are depicted on the Acceptor vector and are used to integrate expression cassettes into the MultiBac baculoviral genome via a mini Tn7 attachment site embedded in the lacZ gene.

To produce the TAF2-8-10 complex, an Acceptor vector encoding TAF2 and a Donor vector encoding the TAF8-10 polyprotein were fused by Cre-Lox recombination.

(Fitzgerald et al., 2006). (1 µg) Acceptor plasmid (1 µg) DNA and Donor plasmid

(1.2 µg) DNA was mixed with 10 U of Cre recombinase enzyme and incubated for 30

75 minutes at 37 °C. The reaction was stopped by heat denaturation of Cre recombinase by incubating at 65 °C for 10 minutes.

2.4.4. Overlap extension PCR

In order to repair the 5’ region of the Human TAF2 gene and create a TEV protease cleavable N-terminal His 10 -tag, the Overlap-Extension PCR (OE-PCR) technique was used. A first round of PCR was performed to assemble overlapping oligos (2.1.4:

Oligonucleotides TAF2_OE 1-4) creating a 152 bp DNA fragment containing a 5’

RsrII restriction site, the newly created sequence, and 20 bp at the 3’ end complementary to the 5’ end of the N-terminally truncated TAF2 construct. 4 µL of each oligo (2.1.4: Oligonucleotides TAF2_OE 1-4) stock concentration of 0.125

µg/ µL), 2 µL of 10 mM dNTPs, 20 µL of 5x HF Buffer, 3000U Pfu DNA polymerase and 60.5 µL ddH 2O were mixed to produce a final reaction volume of 100 µL. The temperature cycling was performed as follows: An initial denaturation step (step 1:

94 °C, 7 min), second denaturation step (step 2: 94 °C, 1.5 min), an annealing step

(Step 3: 54 °C, 2 min) and an elongation step (step 4: 72 °C, 3 min). steps 2-4 were cycled 8 times. The assembled DNA fragment was amplified using sense and antisense primers (2.1.4: TAF2_OE 5-6) flanking 5’ and 3’ ends. 4µL of each sense and antisense primer (2.1.4: TAF2_OE 5-6) stock concentration of 0.25 µg/ µL), 2 µL of 10 µM dNTPs, 20 µL of 5 x HF Buffer, 3000U of recombinantly prepared Pfu

DNA polymerase and 60. µL of ddH2O were mixed to a final reaction volume of 100

µL. The temperature cycling was performed as follows: An initial denaturation step

(step 1: 98 °C, 5 min), a second denaturation step (step 2: 94 °C, 30 sec), an annealing step (step 3: 54 °C, 2 min) and an elongation step (step 4: 72 °C, 1.5 min). Steps 2-4 were cycled 25 times (Protocol adapted from:

76 http://www.yorku.ca/pjohnson/AssemblyPCRProtocol.html). The amplified 152 bp fragment was gel purified using the Qiagen gel purification kit. The TAF2 gene lacking the wild-type N-terminus was amplified using the same protocol described in section 2.3.10. Melting temperature of the overlapping region shared between the assembled DNA molecule and the amplified non wild-type TAF2 PCR product was calculated using the following PROMEGA webtool

(https://www.promega.com/techserv/tools/biomath/calc11.htm). Another PCR reaction was set up using 1 µL of the crude mixture from the first PCR reaction mixed with 4 µL of each primer (2.1.4: TAF2_OE 5-6), 4 µL of 5 mM dNTPs, 10 µL of 10x

HF buffer (ThermoScientific) 1.5 µL of Phusion Hot Start II High Fidelity DNA

Polymerase, and 75.5 µL of double distilled water. This mixture was then subjected to

25 cycles of amplification. Each cycle consisted of a 30 second 98 °C step, a 2 min 54

°C step, and a 1.5 min 72 °C step. Prior to the first cycle, a 5 min 98 °C step was used. A 5 min 72 °C elongation step was included following the final cycle. The final reamplified product was cloned into the MultiBac pFL plasmid, and verified by DNA sequencing.

2.4.5. Sequence and ligation independent cloning

For the insertion of TAF8 fragments into the pMAL vector and production of TEV cleavable MBP-TAF8 fusion proteins, the Sequence and Ligation Independent

Cloning (SLIC) method was used (Li and Elledge, 2007). 2 µg of vector was digested with SacI/EcoRI . The digested vector was gel purified and isolated with Qiagen’s Gel

Extraction Kit according to the manufacturer’s instruction. The inserts were amplified with Phusion Hot-Start High-Fidelity DNA Polymerase using the full-length human

TAF8 gene as a template. A 100 µL PCR reaction was set up with 250 µM of each

77 dNTP, 0.5 µM of each primer and 2.5 units of Phusion Hot-Start High-Fidelity DNA

Polymerase. Temperature cycling was performed as follows: 98 °C for 45 seconds: 30 cycles of 98 °C for 45 seconds, 54 °C for 45 seconds and 72 °C for 1 minute; and a final step of 72 °C 45 seconds. 1 µg of each digested vector and insert were treated separately with 0.5 units of T4 DNA polymerase in NEB buffer 2 in a 20 µL reaction for 30 minutes. The reaction was stopped by adding 1/10 th of the volume of 10 mM dCTP and incubated on ice for 10 minutes. A 10 µL annealing reaction in a 1:1 insert to vector ratio, using 0.074 pmoles of treated vector DNA, 1 x NEB T4 ligation buffer, appropriate amount of insert and water was incubated at 37 °C for 30 minutes.

The annealing reaction was stopped by incubation on ice. 5 µL of the annealing reaction mixture was used to transform 150 µL of chemically competent OmniMax cells by incubation on ice for 10 minutes, heat shock incubation at 42 °C for 45 seconds, further incubation on ice for 2 minutes, addition of 500 µL of Luria Broth

(LB) and recovery at 37 °C shaking at 180 rpm on a orbital shaking platform for 1 hour. Cultures were spun down and pellets resuspended in 100 µL of LB and the entire mixture was plated onto LB-Agar plates supplemented with appropriate antibiotics.

2.5. Protein production techniques

2.5.1. Baculovirus production and protein expression in insect cells

For the production of recombinant proteins in insect cells used in this study, bacmid

DNA molecules encoding genes for study were produced using the MultiBac system as described previously (Alexander and Berger, 2011; Bieniossek et al., 2008;

Fitzgerald et al., 2006; Trowitzsch et al., 2010). Briefly, the E. coli strain EMBacY

78 contains modified Autographa californica multinucleopolyhedrosis virus (AcMNPV) as a bacterial artificial chromosome, so-called bacmid DNA. Modified bacmid DNA encodes a mini-F replicon and a Kanamycin resistance marker for propagation and selection. The EMBacY bacmid also contains a mini Tn7 attachment site, which is used to integrate transfer vectors via Tn7 transposition. Since the mini Tn7 attachment site on the bacmid is embedded in the LacZ gene, successful integration of DNA fragments can be monitored by blue white screening on LB agar plates containing kanamycin, gentamycin, tetracyclin, IPTG and BluoGal™. Bacmid DNA is isolated from white colonies using the Qiagen Miniprep ™ kit. However, after the first centrifugation step, the supernatant is transferred to a new tube and the centrifuged again. 0.7 volumes [v/v] of isopropanol are added to the cleared cell lysate and the bacmid DNA is pelleted for 10 min in table top centrifuge at full speed. The DNA pellet is washed twice with 70 % [v/v] ethanol and bacmid DNA is finally resuspended in 20 µL sterilized ddH 20. Typically, the purified bacmid was used to transfect 1 x 10 6 adherent Sf21 cells in a 6-well plate with transfection reagent

(according to the manufacturer's instruction) and the first generation of virus (V0) harvested after incubation at 28 °C for 60 hours. V0 virus was typically amplified in a

250 mL capped flask by adding 3 mL of budded virus harvested from the 6-well plate used to carry out the transfection procedure to 25 mL of Sf21 culture at a density of around 1 x 10 6 cells/mL. 48 hours post cellular proliferation arrest, amplified second generation virus (V1) supernatant was harvested by centrifugation (10 minutes, 1000 rpm, Eppendorf 58 R centrifuge, 4 °C) after incubation at 28 °C, shaking on an orbital shaker at 80 rpm. V1 virus was tested for optimal expression of the reporter YFP expression cassette under control of the polyhedrin promoter by infection of Sf21 cultures with a three different concentration of the V1 virus. V1 virus tests were

79 performed using 50 mL of Sf21 culture infected at a culture density of 50 mL at 1 x

10 6 cells/mL and V1 virus concentrations of 0.1-2% (v/v). YFP fluorescence

(Emission wavelength=530 nm) was measured every twenty four hours post infection and compared with a standard of known fluorescence (YFP Standard) as well as an uninfected negative control (uninfected Sf21). When maximum YFP fluorescence was reached, indicated by a plateau in fluorescence counts for 72 hours, cells would be harvested from the infected cultures by centrifugation (20 minutes, 1000 rpm,

Beckman rotor JLA 8.1 at 4 °C) and the pelleted cell mass stored at -20 °C.

2.5.2. Protein expression in E. coli

E. coli BL21(DE3)RIL cells were transformed with plasmid DNA carrying expression cassettes of interest and positive clones were selected on LB-agar plated supplemented with the appropriate antibiotics. The same antibiotics were added in all subsequent cultivation steps. As precultures, 50 mL of LB medium were inoculated with a single colony from a plated transformation and incubated overnight at 37 °C, shaking at 180 rpm. 1L of LB-medium supplemented with the appropriate antibiotics was inoculated with 10-20 mL of preculture. Gene expression was induced by the addition of 1 mM isopropyl beta-D-1-thiogalactopyranosid (IPTG) when cultures reached an optical density (OD 600 )of 0.6-1.2. Protein production was carried out for 4-

24 hours at 30 °C, 25 °C or 18 °C. Cells were harvested by centrifugation in a JLA

8.1 rotor for 30 minutes at 4000 rpm.

2.5.3. Purification of His-tagged protein from E. coli cytosolic fraction

All TAF8 constructs in this study were extracted from E. coli lysate using immobilized metal affinity chromatography (IMAC) via a C-terminal His 6 tag. The

80 target protein was captured from soluble extract using Ni 2+ -NTA resin and eluted by the addition of imidazole. IMAC eluted TAF8 protein identified by polyacrylamide gel electrophoresis was buffer exchanged into 50 mM Tris pH 7.5, 150 mM NaCl using a 10/300 S200 Superdex column (GE Healthcare). Pooled fractions were concentrated using Amicon Ultra-15 concentrators (Millipore) to 10 mg/mL and the protein was judged >95% pure by SDS-PAGE analysis.

2.5.4. Purification of TAF2 constructs and TAF2-8-10 complex from Sf21

lysate

TAF2 constructs and TAF2-8-10 complex encoded on a single baculotransfer vector as described in section 2.4.3 were Co 2+ purified via an N-terminal TEV cleavable

His 10 tag on TAF2 as well as a C-terminal His 6 tag on TAF8 using IMAC. Sf 21 lysates were prepared by diluting the pellet volume 10-fold with buffer composed of

500 mM NaCl, 50 mM HEPES pH 7.5 and 10 mM MgCl 2 to which 5 mM imidazole was added. Typically the resin was washed with 20 column volumes of buffer supplemented with 10 mM imidazole and finally eluted with buffer containing 250 mM imidazole. Eluted TAF2 constructs and the TAF2-8-10 complex were dialyzed against buffer containing 150 mM NaCl instead of 500 mM NaCl and further purified using an ion exchanger and a gradient of 100 mM to 1 M NaCl. HiTrap SP HP (GE

Healthcare) was used in the case of dm TAF2 and HiTrap Q HP (GE Healthcare) for the rest. The proteins and protein complexes were further purified by gel filtration using a Superdex 200 10/300 column. Pooled fractions were concentrated using

Amicon Ultra-15 concentrators (Millipore) to 1-5 mg/mL and the protein was judged

>95% pure by SDS-PAGE analysis.

81 2.5.5. Expression of [ 13 C, 15 N]-labeled MBP-TAF8 C-terminal constructs pMAL-TAF8-164-310 and pMAL-TAF8-200-310 encoding a TEV-cleavable N- terminally MBP-tagged TAF8 construct with a Gly-Gly-Gly-Ser linker fused to the terminal Gly of the TEV cleavage site and a C-terminal His 6 tag were transformed into BL21-CodonPlus®(DE3)-RIL E. coli cells (Stratagene). Transformed E. coli cells were plated onto LB-agar supplemented with ampicillin (100 µg/mL). Colonies were used to inoculate 50 mL of LB media and pre-cultures were incubated at 37ºC shaking at 180 rpm for 6 hours. Pre-cultures were used to inoculate 200 mL of [ 13 C,

15 15 N]- or [ N]-labeled M9 minimal media to an optical density at 600 nm (OD 600 ) of

0.1. This second preculture was incubated overnight at 37ºC shaking at 180 rpm.

13 15 15 C/ N-M9 media was prepared with 1 g/L of NH 4Cl and 3 g/L of 13C-labelled glucose per liter. 2 L of labeled M9 media was inoculated with sufficient pre-culture to give a starting OD 600 = 0.1. The cultures were incubated at 37ºC shaking at 180 rpm until the OD 600 reached 0.7 and then stored at 25ºC for 45 minutes. Protein expression was induced by adding IPTG to a final concentration of 1 mM. The induced cultures were incubated at 15 ºC for 18 hours. Finally, the cultures were harvested by centrifugation and the cell pellet stored at -80ºC until required.

2.6. Protein Biochemistry and biochemical methods

2.6.1. Denaturing polyacrylamide gel electrophoresis (SDS-PAGE)

SDS-polyacrylamide gels were used to analyze protein samples. Depending on the size of the protein being analyzed, 6-18 % of polyacrylamide was used. SDS-PAGE was performed as described elsewhere (Laemmli, 1970; Weber and Osborn, 1969).

Gels were poured and run vertically in SDS-PAGE running buffer (25 mM Tris-HCl

82 (pH 8.1-8.3), 192 mM glycine, 0.1% [w/v] SDS. Gels were poured with a stacking layer (125 mM Tris HCl (pH 6.8), 0.1% [w/v] SDS, 6% acrylamide, and polymerized with 0.3% [v/v] APS and 0.03% [v/v] TEMED) and a separating layer (375 mM Tris-

HCl (pH 8.8), 0.1% SDS, 6-12% acrylamide polymerized with 0.3% [v/v] APS and

0.03% [v/v] TEMED). Protein samples were denatured in protein gel loading buffer

(60 mM Tris-HCl (pH 6.8), 50 mM DTT, 1 mM EDTA, 15% glycerol, 2% [w/v] SDS and 0.1% [w/v] bromophenol blue). Protein samples were heated to 95 ºC for 5 minutes except in the case of TAF2 containing samples where samples were not heated at all. Gels were stained with Coomassie brilliant blue stain (Thermo

Scientific) and subsequently destained with 10 % acetic acid and 40 % ethanol (v/v) to visualize proteins.

2.6.2. Limited proteolysis

Limited proteolysis experiments were typically performed in 20 µL using 5-10 µg of target protein. Typically, protease concentrations were first screened is a series of 10- fold dilutions at a fixed time period of 45-60 minutes. Once the proper protease concentration was determined, time course experiments were conducted. Reactions were stopped by the addition of Complete Protease Inhibitor Cocktail Tablets (Roche) and protein gel loading buffer. Samples were heated to 95 ºC for 3 minutes prior to analysis via SDS-PAGE.

2.6.3. Electrospray ionization mass spectrometry

Electrospray ionization mass spectrometry of proteolyzed mixtures of TAF2 were performed by Luca Signor (PSB, Grenoble) using an LC ESI-TOF MS 6210 Analyzer

(Agilent Technologies)

83 2.6.4. Edman sequencing

Edman sequencing was performed by Jean-Pierre Andrieu (PSB, Grenoble) using

Procise® Sequencing System Model 492 (Applied Biosystems).

2.6.5. Binding experiments using a peptide array

Peptide array analysis was used to identify the peptide corresponding to the region of

TAF8 that contributes to the binding of TAF2. Overlapping 25-mers covering the C- terminal region of TAF8 were immobilized on cellulose membranes via double β- alanine anchors and assembles using the SPOT technology (AG Molekulare

Bibliotheken, Institut für Medizinische Immunology, Charité - Universitätsmedizin

Berlin, Germany). The library of 20-mer peptides spanning the C-terminal TAF8 residues [105-310] were synthesized by Fmoc (9-fluorenylmethoxycarbonyl) chemistry with an offset of 3 amino acids between spots. Low density hexa-Histidine peptides were used as controls at the beginning and the end of each array. Pepscan membranes were blocked in blocking buffer (50 mM Tris-HCl (pH 7.6), 500 mM

NaCl, 20 % [w/v] sucrose, 3 % [w/v] bovine serum albumin) for 1h at 4 ºC, washed with TBS (50 mM Tris-HCl (pH 7.6), 500 mM NaCl) and incubated for 1.5 h with

His-tagged TAF2 at a concentration of 10 µg/ml in blocking buffer or with blocking buffer alone. Membranes were subsequently incubated with mouse anti-His monoclonal primary antibody (SIGMA H1029, dilution 1:3000) and peroxidase- conjugated anti-mouse secondary antibody (SIGMA A5906, dilution 1:10000) in blocking buffer. Membranes were washed 3 times with TBS between each incubation step. After addition of luminol solution (Pierce) luminescence was detected on a

KODAK 4000MM photoimager. Images were analyzed using the Dot Blot Analyzer tool in ImageJ.

84 2.7. Cells and cell culture

2.7.1. Cultivation of E. coli cells

E. coli was cultivated in LB medium, M9 minimal medium or on LB agar plates.

Adequate aeration was maintained by growing cells in baffled Erlenmeyer flasks, shaking on a rotating platform at 180 rpm at 37 °C unless otherwise stated. Optical densities were measured in plastic cuvettes with a 1 cm path-length in a spectral photometer at a wavelength measuring 600 nm and using LB media alone as a reference.

2.7.2. Transformation of chemically competent bacteria

Chemically competent E. coli were prepared using the CaCl 2 method (Chung and

Miller, 1993). Competent cells were stored at -80°C in 50 µL aliquots. A typical transformation reaction was prepared by mixing 5 µL of a ligation reaction with 50

µL of competent E. coli in a 1.5 mL Eppendorf tube and incubated on ice for 20 minutes. The mixture would be heat-shocked for 45 seconds at 42°C and then cooled on ice for 2 minutes. 400 µL of LB medium were added to the mixture and incubated at 37°C and shaking for 30 minutes. Transformed cells were then selected on LB-agar plates supplemented with appropriate antibiotics. Antibiotic concentrations were 50

µg/mL (Streptomycin), 34 µg/mL (Chloramphenicol), 30 µg/mL (Kanamycin), 100

µg/mL (Ampicillin). For transformation of E. coli strains with MultiBac baculotransfer vectors; antibiotics, Bluo-Gal and IPTG concentrations were prepare as described in the MultiBac method (Fitzgerald et al., 2006).

85 2.7.3. Transformation of electrocompetent E. coli

Electroporation of electrocompetent DH10EMBacY was essentially done as described elsewhere (Alexander and Berger, 2011) . Briefly, 5 µL of a mixture containing DNA for transformation was added to 100 µL of electrocompetent cells and added to an electrocuvette. A Biorad electroporator set to 1.8 kV was used to pulse the E. coli mixture. Electroporated cells were recovered in 400 µL of autoclaved LB, shaking

180 rpm at 37ºC for 1 hour and plated onto LB-agar supplemented with appropriate antibiotics.

2.7.4. Cultivation of Sf21 cells

Sf 21 cells (Gibco) were cultivated in serum free media Sf-900™II (Gibco®) medium.

Cells were grown in suspension with cell densities between 0.5 and 2 x 106 cells/mL in glass capped Erlenmeyer flasks. Harvesting of Sf 21 from cultures was performed as described elsewhere (Fitzgerald et al., 2006).

2.7.5. Transfection of Sf21 cells

Adherent Sf 21 cells were transfected in a 6-well plates using the transfection reagent

FuGene® (Promega) according to methods described in the MultiBac Manual

(Alexander and Berger, 2011).

2.7.6. Cultivation of HeLa cells

For the cultivation of HeLa cells (kindly donated by the Schaffitzel group, EMBL

Grenoble), Dulbecco's Modified Eagle Medium (DMEM) supplemented with 10% fetal calf serum and 2 mM L-glutamine was used. Adherent cultures were grown at 37

86 ºC and cultivated and harvested according to methods described elsewhere. (Freshney,

2006; SCHERER and HOOGASIAN, 1954)

2.7.7. Transfection of HeLa cells

HeLa cells were transiently transfected with Lipofectamine® according to the manufacturers instructions. (Hawley-Nelson and Ciccarone, 2003)

87 2.8. Bioinformatic methods

2.8.1. Multiple sequence alignments

Multiple sequence alignments were performed using the webtool ClustalW

(http://www.ebi.ac.uk/Tools/msa/clustalw2/ ) or the webtool PRALINE

(http://www.ibi.vu.nl/programs/pralinewww/).

2.8.2. Secondary structure prediction

Secondary structure prediction was performed using the webtool PSI-PRED

(http://bioinf.cs.ucl.ac.uk/psipred/) .

2.8.3. Structural modeling

Structural modeling of TAF2 was performed using the webtool MODELLER

(http://salilab.org/modeller/)

2.9. Biophysical methods

2.9.1. Surface plasmon resonance

Surface plasmon resonance experiments were done using a BIACORE 3000 (Biacore

AB, Uppsala). TAF2 ligand was immobilized onto CM5 sensor chips (GE Healthcare) to a level of 2000 response units (RU) using amine-coupling chemistry. A control cell was prepared without ligand. Truncation mutants of TAF8 fused to N-terminal maltose binding protein (MBP) were serially diluted into running buffer (25 mM

HEPES-NaOH (pH 7.0), 300 mM NaCl, 0.01 % [v/v] NP-40). For association phase,

150 µl of analyte were injected over both flow cells at a flow rate of 25 µl/min and dissociation phases were monitored for 200 seconds by injecting running buffer only.

88 Binding responses were recorded and responses from referencing sensorgrams were subtracted using BIAevaluation software (GE Healthcare). Data were globally analyzed with the analysis software and fitted to a simple 1:1 binding model.

Triplicates of nine different concentrations of analyte were tested for three different

TAF8 constructs and steady-state kinetics were evaluated using the software

SigmaPlot.

2.9.2. NMR

All NMR spectra were recorded on Varian DirectDrive 600 MHz spectrometer equipped with a cryogenic triple-resonance probe. 2D ( 1H, 15N) HSQC or TROSY

HSQC spectra were recorded of different [U-15 N]-labeled constructs of TAF8. Spectra

1 were recorded with 80 (t 1) and 2560 (t 2) complex points, spectral widths of 16.7 ( H)

15 and 30 ( N) ppm and 8 scans per t 1 increment. Sample concentrations varied between

100 and 250 µM.

2.9.3. Native Mass Spectrometry

Native mass spectrometry experiments were carried out in the lab of Carol V.

Robinson (Oxford University).

2.10. Protein crystallization

All crystallization screens were carried out using the HTX robotic platform at the

EMBL in Grenoble (https://embl.fr/htxlab/).

89 3. Results

3.1. Domain architecture of TAF2

3.1.1. Introduction

In order to probe the domain architecture of TAF2, homologs from Human, Yeast and

Drosophila were cloned and expressed using a baculovirus expression system, and soluble protein subsequently purified from insect cells. A baculovirus expression system was used since full-length Human TAF2 (hsTAF2 FL ) is known to be insoluble when produced using an E. coli expression system. In this section, I will describe the cloning, expression and purification of three TAF2 homologs and C-terminal constructs of Human TAF2. In this section I will also describe the production and analysis of stable fragments of three TAF2 homologs derived from limited proteolysis that led to the cloning of a ‘Two-Domain’ construct. The 'Two-Domain' construct was engineered with a view to engineering a globular version of TAF2 lacking protease sensitive regions notorious for preventing the crystallization of macromolecules.

3.1.2. Expression and Purification of Full Length Human, Drosophila and

Yeast TAF2

Full-length TAF2 was cloned using the MultiBac system, expressed in Sf 21 cells and purified as described in sections 2.4, 2.5.1, 2.5.3 and 2.5.4. Briefly, the MultiBac baculotransfer vectors pFL or pFBDM encoding full-length TAF2 homologs from H. sapiens (hs), S. cerevisiae (sc) or D. melanogaster (dm) were incorporated into a bacmid housed in the MultiBac E. coli strain DH10EMBacY via Tn7 transposition elements and a transposase encoded on a helper plasmid (Berger et al., 2004).

90 Baculovirus was produced from transfected Sf 21 cells and amplified to infect large- scale Sf 21 cultures. Since the MultiBac system includes a YFP reporter gene encoded in the bacmid backbone, TAF2 expression could be monitored by measuring YFP fluorescence on a spectrophotometer in comparison with a standard fluorescence marker (Emission wavelength=530 nm). For example, measuring YFP fluorescence from a baculovirus infected culture expressing hs TAF2 FL every 24 hours, indicated the optimal time for harvesting cells for maximum protein expression (Figure 3-1).

Cells were typically harvested 72 hours after cell proliferation arrest by which time

YFP fluorescence reached a maximum. A similar measurement was carried out for all proteins produced in insect cells for this thesis.

91

Figure 3-1: YFP monitor of hsTAF2FL expression in Sf21 cells . YFP fluorescence of the soluble fraction of Sf21 cells infected with recombinant baculovirus expressing a YFP marker and full-length His 10 -tagged Human TAF2 was monitored every 24 hours post infection (YFP Excitation wavelength=524 nm, Emission wavelength=544 nm) The fluorescence of infected cells were compared with a standard of known fluorescence (YFP Standard) along side an uninfected negative control (Sf21 Control). Time intervals are labeled on the x-axis as days post infection (DPI) and days post cell proliferation arrest (DPA). Maximum YFP fluorescence was typically reached by three days post cellular proliferation arrest (DPA+72).

Immobilized metal affinity chromatography (IMAC) with Co 2+ as the immobilized metal was used to purify all TAF2 constructs via an N-terminal His 10 tag. Impurities from Co 2+ affinity purification were for the most part removed by washing the metal affinity resin with 10 mM imidazole, however, a small amount of large molecular weight contaminant around 250 kDa remained after elution as determined by analysis

92 of affinity purified material by SDS-PAGE (Figure 3-2). Migration of affinity purified hs TAF2 FL by SDS-PAGE showed that the protein migrated to a position consistent with its calculated molecular weight of 137 kDa. Metal affinity chromatography steps were carried out with buffer conditions of 25 mM HEPES pH

7.5, 500 mM NaCl, 10 mM MgCl2 and 5 mM Beta-mercaptoethanol. A high concentration of NaCl was used since purified hs TAF2 FL precipitated at [NaCl] below

200 mM (data not shown). hs TAF2 FL (theoretical pI=8.45) could be further purified using a cationic exchanger, taking care not to dialyze the protein into a buffer containing less than 200 mM NaCl. The absorption profile of hs TAF2 FL passed over a

5 mL SP HP column (GE Healthcare) using a NaCl gradient from 200 mM to 1 M shows hs TAF2 FL eluted as a single peak at approximately 350 mM NaCl. Size exclusion chromatography (SEC) using a Superdex 200 10/300 column (GE

Healthcare) of affinity and cation exchange purified hs TAF2 FL using buffer composed of 25 mM HEPES pH 7.5, 500 mM NaCl and 2 mM beta-mercaptoethanol. SEC as a final purification step of hs TAF2 FL resulted in protein that was at least 90-95% pure according to SDS-PAGE ( Figure 3-2). An essentially similar purification approach was used for all TAF2 homolog C-terminal constructs.

93

Figure 3-2: Human TAF2 purification . On the left hand side are absorption profiles (280 nm) in black, conductivity profiles in red and [Imidazole] profiles in green. On the right hand are corresponding SDS-PAGE gels. The numbers marking the wells of the SDS-PAGE gel correspond to the fractions that can be found along the x-axis of the corresponding absorption profiles. (A) IMAC purification (B) Ion exchange purification (C) SEC profile.

The retention volume of hs TAF2 FL measured using an S200 10/300 GL column (GE

Healthcare) was larger than expected for a globular monomer with the calculated mass of hs TAF2 FL . The larger than expected retention volume is consistent with a

94 globular particle with an apparent molecular mass of 327 kDa, as determined by column calibration with known protein standards (not shown). Considering the calculated molecular weight of hs TAF2 FL is 137 kDa, the retention volume observed by gel filtration (Blue peak in Figure 3-3) is consistent with either a dimeric stoichiometry or an elongated configuration. A comparison of hs TAF2 FL , dm TAF2 FL and sc TAF2 FL SEC profiles showed larger than expected retention volumes for all three homologs ( Figure 3-3).

Figure 3-3: S200 SEC profiles of three TAF2 homologs. The left panel shows an overlay of three chromatograms (absorption at 280 nm) resulting from the application of affinity and ion exchange purified TAF2 from three different homolgs on a Superdex 200 10/300 column (GE Healthcare). Human

95 (hsTAF2 FL ) in blue, S. cerevisiae TAF2 (scTAF2 FL ) in green and D. melanogaster (dmTAF2 FL ) in red. The right panel shows SEC fractions separated by SDS-PAGE.

96

3.1.3. Limited proteolysis

Larger than expected retention volumes of purified TAF2 homologs, as determined by

SEC studies in 3.1.2, are consistent with flexible regions notorious for the potential disruption of crystal contacts and may explain the difficulty of past efforts in crystallizing TAF2. Limited biochemical and structural information and lack of clear homology to TAF2 protein in the BLAST database repository leaves little in the way of clues regarding how best to engineer protein constructs that are amenable to crystallization. Analysis of stable protein fragments resulting from treatment with low concentration of proteases can reveal a protein’s 'structural' boundaries that may lead to a more 'crystallizable' construct. Purified sc TAF2 FL and dm TAF2 FL homologs were subjected to limited proteolysis using trypsin, chymotrypsin and/or endogenous proteolysis, described in section 2.6.2. hs TAF2 FL proteolysis studies were performed previously under the supervision of Dr. Imre Berger and Dr. Timothy Richmond and the results can be found in the Ph.D. thesis of Dr. Martina Mijuskovic (Mijuskovic,

2007). SDS-PAGE separation of proteolyzed mixtures showed protease resistant fragments of sc TAF2 FL and dm TAF2 FL appearing after treatment with chymotrypsin and trypsin (Figure 3-4).

97

Figure 3-4: Full-length TAF2 limited proteolysis time course. SDS-PAGE analysis of peptide fragments of trypsin and chymotrypisin-treated sc TAF2 FL (Left panel) or dm TAF2 FL (Right panel) appearing over time. Labeled fragments were analyzed by mass spectrometry and N-terminal sequencing (TR# for Trypsin derived fragments and CH# for Chymotrypsin derived fragments)

Stable peptide fragments that resulted from limited proteolysis of all three insect cell purified TAF2 homologs were N-terminally sequenced by Edman degradation

(performed by Dr. Jean Pierre Andrieu, IBS Grenoble). Each fragment mixture resulting from exposure to the respective protease for at least 40 minutes was separated by SEC and peak fractions were analyzed by electrospray ionization mass

(ESI) spectrometry (ESI performed by Luca Signor, IBS Grenoble). All N-terminally sequenced peptide fragments were manually matched with corresponding ESI masses detected from the proteolyzed mixtures derived are listed in two tables found in the

Appendix of this thesis ( Table 4 and Table 5). The fragments outlined in these two tables are represented in Figure 3-5 as black, dark grey and light grey bars depending on the proteolytic degradation process used. The bars representing each proteolytic fragment analyzed were drawn to scale and positioned based on N-terminal

98 sequencing results under a domain topology figure representing each TAF2 homolog.

Stable fragments summarized in the panel labeled ‘Human’ of Figure 3-5 were analyzed as described in the Ph.D. thesis of Dr. Martina Musjkovic by: a) trypsinization of hs TAF2 FL ; b) trypsinization of a construct of Human TAF2 ‘core’ residues 1-1000; and c) a natural degradation process of purified full-length TAF2.

The stable fragments resulting from limited proteolysis of dm TAF2 and sc TAF2 were derived from the mixture analyzed by SDS-PAGE in (Figure 3-4). Analysis of dm TAF2 FL treated with chymotrypsin and sc TAF2 FL treated with trypsin and chymotrypsin revealed a putative conserved pattern of stable fragments that map onto the same two distinct regions of each TAF2 homolog. A summary of ESI and Edman sequencing analysis of stable fragments derived from limited proteolysis of the three

TAF2 homologs are depicted in Figure 3-5. In each panel, stable fragments analyzed were mapped onto a domain topology diagram of TAF2 showing AminopeptidaseN homology as detected by a BLAST search of the TAF2 sequence as well as regions predicted to be unstructured as determined using the web service FoldIndex©

(http://bip.weizmann.ac.il/fldbin/findex). One group of fragments maps onto a large region in the N-terminal half of each TAF2 homolog, loosely corresponding to the region containing sequence homology to the AminopeptidaseN family of metallopeptidases, while the other group maps onto a smaller region spanning approximately 20 kDa in the C-terminal region. This conserved pattern of stable regions suggests from the three TAF2 homologs is consistent with a conserved ‘Two-

Domain’ architecture of TAF2. In the case of Human TAF2, the trypsin derived stable fragment TR4 can be mapped to a region spanning the first 600 residues of hs TAF2, forming the N-terminal domain of the ‘Two-Domain’ architecture and TR6 forming the C-terminal domain, which maps onto the residues 820-984.

99

Figure 3-5: Summary of mass spectrometry and Edman sequencing analysis of fragments derived from limited proteolysis of three TAF2 homologs . In each panel, a domain topology diagram representing TAF2 is

100 drawn to scale with Aminopeptidase homology region coloured in green, unfolded regions in red (predicted using the online tool FoldIndex©: http://bip.weizmann.ac.il/fldbin/findex ) and six consecutive Histidines coloured in blue. Fragments resistant to proteolysis are shown in black, grey and light grey bars and marked CH for Chymotrypsin derived fragments or TR for Tryspin derived fragments. N-termini of fragments were determined by Edman sequencing (Jean Pierre Andrieu, IBS). Fragment sizes were determined by matching protein mass derived from ESI analysis of proteolyzed mixtures (Luca Signor, IBS). A table summarizing masses detected by ESI can be found in the Appendix of this thesis.

101 3.1.4. Human TAF2 ‘core’ and ‘Two-Domain’ constructs

Human TAF2 constructs were designed based on peptide sequences of stable fragments analyzed from limited proteolysis experiments (N.B. Stable fragments of

Human TAF2 analyzed in this study were derived from limited proteolysis experiments that are described in the Ph.D. thesis of Dr. Martina Mijuskovic). The constructs listed in Table 2 were cloned and expressed as described in section 2.4 and the soluble ‘core’ 1-1027aa purified as described in section 2.5.4.

Human TAF2 Construct Residues Fragment

‘Core’ 1-1027aa Full-Length TR2

‘Two-Domain’ N-terminal domain 1-600aa Full-Length TR4

‘Two-Domain’ C-terminal domain [1-600aa] X [820-984aa] Full-Length TRa6

Table 2: Human TAF2 constructs designed based on limited proteolysis experiments.

The TAF2 ‘core’ construct is based on the stable fragment labeled TR2 derived from the Trypsin limited proteolysis fragment shown in Figure 3-5. A monoclonal anti-

His 5x, HRP-conjugated antibody probe was used against blotted SDS-PAGE separated full-length and TAF2 ‘core’ expressed in Sf21. Antibody staining of insoluble and soluble fractions of both hsTAF2 FL and TAF2 ‘core’ showed that TAF2 ‘core’ has a higher ratio of soluble protein when compared to its full-length counterpart. (Figure

3-6).

102

Figure 3-6:. Anti-His 5 antibody stained Western blot of Sf21 extract enriched with His 10 -tagged full-length TAF2 and His 10 -tagged TAF2 ‘core’. Enriched Sf21 whole cell extract was first separated by centrifugation into an insoluble fraction (see lanes labeled 'P') and a soluble fraction (see lanes labeled 'S') and then separated by SDS-PAGE. HRP-conjugated anti-

His 5 antibody staining of blotted protein by conversion of DAB substrate to a chromogenic product showed that the ratio of soluble His 10 -tagged TAF2

‘core’ is greater than that of His 10 -tagged Full-length TAF2. Molecular weight standards are in the lane labeled 'mw'.

2+ SDS-PAGE of peak fractions from SEC of Co affinity purified His 10 -tagged Human

TAF2 ‘core’ showed a single band migrating to its expected molecular weight of 118 kDa. The observed retention volume of the TAF2 ‘core’ domain according to SEC was 12.91 mL (Figure 3-7). The observed retention volume of TAF2 ‘core’ corresponds to a globular protein that is approximately 165 kDa, when compared to the migration of molecular weight standards. Since the expected retention volume of a globular 118 kDa protein analyzed by SEC using S200 resin is 13.13 mL, it appears that the ‘core’ is more globular in comparison its full-length protein counterpart, which eluted as a globular particle with a molecular mass greater than 300 kDa.

103

Figure 3-7: Size exclusion chromatography of the Human TAF2 'core' domain. Affinity and ion exchange purified Human TAF2 'core' was applied to a Superdex 200 10/300 column (GE Healthcare). A chromatogram showing the absorption profile of soluble TAF2 ‘core’ domain indicated the retention volume was as expected for a globular protein with the same molecular mass. Coomassie stained SDS-PAGE of SEC peak fractions are inset.

The TAF2 ‘core’ fragment derived from limited proteolysis contains the conserved N- and C- terminal domains that make up the proposed ‘two-domain’ construct. The two domains in each case are composed of a larger molecular weight fragment at the N- terminal part of the protein which includes the predicted aminopeptidase domain

(denoted Aminopeptidase Containing Domain, ACD) and a smaller fragment, which is roughly 20-30 kD in size, located in the C-terminal half of the protein (denoted

20kD) (Figure 3-8). It is possible that TAF2 is organized into two stable, tightly associated domains that are resistant to proteolysis and separated by protease sensitive linker regions. To examine this hypothesis, individual domains of the TAF2 ‘Two-

104 Domain’ construct were cloned into two separate Multi-Bac vectors. The N-terminal

‘ACD’ domain was expressed either alone or together with the C-terminal ‘20 kD’ domain by performing a Cre-lox fusion of the individual vectors.

Figure 3-8: TAF2 Proteolysis and the ‘Two-Domain’ Model. TOP: A schematic of the ‘Two-Domain’ model based on limited proteolysis experiments performed using Human, Drosophila and Yeast TAF2. Bottom: A schematic representation of the MultiBac vectors designed to express two domains of TAF2 from a single baculovirus.

Both the ACD and ‘Two-Domain’ constructs were expressed in insect cells as described in section 2.5.1, however, these constructs proved to be insoluble when expressed in insect cells (Data not shown).

3.1.5. Crystallization trials

Over 600 different crystallization conditions were tested for each purified Human,

Drosophila and Yeast full-length TAF2 as well as Human TAF2 ‘core’ proteins.

Purified protein concentrations ranged 1-10 mg/mL. Crystallization conditions were also tested in the presence of trace amounts of Chymotrypsin. No crystals were

105 obtained under any of the conditions tested. All crystallization conditions were tested using the HTX robotic platform, Grenoble.

3.1.6. Resume

Three different homologs of TAF2 produced recombinantly in insect cells can be purified to near homogeneity as was expected given the evidence that native Yeast and Human TAF2 may dissociate from TFIID under certain conditions (Kaufmann et al., 1998; Papai et al., 2009). SEC analysis of metal affinity and ion exchange purified

TAF2 of all three homologs is consistent with a conserved non-globular architecture.

The elution profiles of insect cell expressed, purified hs TAF2 FL , dm TAF2 FL and sc TAF2 FL homologs are consistent large unfolded regions, or a possible extended conformation, which may explain the difficulty in obtaining a crystal of this protein.

In contrast, a core fragment of Human TAF2, hsTAF2 1-1027 , that was found to be resistant to proteolysis in a previous study (Mijuskovic, 2007), migrated in SEC experiments nearly as expected for a globular monomer. Fragments of hs TAF2 FL , dm TAF2 FL and sc TAF2 FL resistant to proteolysis could be mapped to two distinct regions on the protein sequence of each respective TAF2 sequence consistent with an architecture composed of two globular domains separated by a flexible linker. Co- expression of these two putatively conserved domains of TAF2 in insect cells did not result yield any soluble material. Potential experiments to test the ‘Two-Domain’ theory could include expression of the two domains on a single polypeptide chain joined by a protease cleavage site or expression of constructs that include extended domain boundaries on each of the respective domains.

106 3.2. Functional annotation of TAF2

3.2.1. Introduction

Other than being associated with TFIID as a bona fide TBP associated factor and a putative co-activator of initiator function, there is little information regarding the role of TAF2 within the TFIID complex. Often, conserved domain architecture, predicted on the level of amino acid sequence, can give insight into the function of a protein whose role is otherwise unknown (Chothia et al., 2003). TAF2 sequence homology could potentially shed light on its role in TFIID. A search against the non-redundant

BLASTp database using the online search tool PSI-BLAST

(http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE=Proteins&PROGR

AM=blastp&RUN_PSIBLAST=on) showed that residues in the N-terminal half of

TAF2 possess homology to Aminopeptidase N with low sequence identity.

Aminopeptidase N is a member of zinc-containing secreted enzymes that play a role in the digestion of peptides generated by the hydrolysis of proteins, typically found in the gut. Although homology to the Aminopeptidase N in the N-terminal region of

TAF2 is conserved in many different TAF2 homologs, TAF2 has not been shown to possess any peptidase activity. The M1 family metalloproteases, of which

Aminopeptidase N is a member, contain the signature exopeptidase motif (GXMXN)

(not present in Human TAF2) as well as the Zinc-binding motif (HEXXH[18]E)

(present in the Human TAF2). Sequence alignment of TAF2 homologs and

Aminopeptidase N proteins whose structures are known show that conserved residues are not shared with those that confer catalytic activity to the aminopeptidases.

Catalytically inactive peptidases have been shown to participate in protein-protein interactions as in the case of the chromatin remodeler complex FACT (FAcilitates

107 Chromatin Transcription) (Stuwe et al., 2008). The SpT16 subunit of FACT possesses in its N-terminal region homology to Aminopeptidase P (from the M24 family of metalloproteases) but with the catalytically active site residues mutated in SpT16. The

SpT16 peptidase-like domain, expressed on its own, was shown to bind both the globular domains and tails of H3-H4 histones. Furthermore, a crystal structure of the peptidase-like domain of SpT16 revealed a shared ‘pita-bread’-fold with known

Aminopeptidase P 3-D structures (Stuwe et al., 2008). In light of the SpT16 interaction with H3-H4 histones, it seems reasonable to ask whether the defunct aminopeptidase TAF2 may also mediate protein-protein interactions with histones or histone fold domains within TFIID via such a defunct aminopeptidase domain.

3.2.2. TAF2 secondary structure and modeling

It is well documented that protein structure is more conserved than protein sequence

(Chothia et al., 2003). Sequence identity between TAF2 and Aminopeptidase N members falls below the 30% boundary set to allow the determination of evolutionary relatedness between proteins in a sequence alignment (Todd et al., 2001). An alignment between the predicted secondary structure of TAF2 homologs and the known secondary structures of Aminopeptidase N homologs (calculated from crystal structures) was performed using the web-based tool PRALINE

(http://www.ibi.vu.nl/programs/pralinewww/ ) (Simossis and Heringa, 2005). Figure

3-9 shows the result of the PRALINE sequence analysis; an alignment of secondary structure elements predicted for two TAF2 homologs with those from three known

Aminopeptidase N proteins whose crystal structures have been solved. The sequences aligned are hs TAF2, dm TAF2, Human aminopeptidase N ( hs APN), Endoplasmic reticulum aminopeptidase1 (ERAP1), Colwellia psychrerythraea cold-aminopeptidase

108 N ( cp APN) and Thermoplasma F3 factor ( ta PIF3). The sequences highlighted in blue in Figure 3-9 are predicted to be beta-sheets and those in red are predicted to be alpha helical. A green circle marks residues that confer activity to ERAP1, which are mutated in TAF2. The residues known to confer activity to Aminopeptidase N enzymes are not conserved TAF2. The conserved residues that are mutated in TAF2 are noted with a green circle in Figure 3-9. For example, the residues found in the

Zinc-coordinating HEXXHX 18 E motif of ERAP1, H353, H357 and E376 and those found to coordinate the amino group of the peptide mimic, bestatin, E183 and E320, are not conserved in TAF2.

109

Figure 3-9: PRALINE alignment of TAF2 and aminopeptidase N. Sequences highlighted in blue are predicted to be beta-sheets and those in red are predicted to be alpha helical. A green circle marks residues that confer activity to ERAP1.

110 In order to further explore the possibility that TAF2 is a structural homolog of an

Aminopeptidase N, the Human TAF2 sequence was modeled using an aminopeptidase as a structural template. Two aminopeptidase template candidates for a structural alignment of Human TAF2 were detected using the web based modeling software ‘Modeller’ ( http://salilab.org/modeller/ ) see Section 5.1 Appendix Table 6 for a summary of ‘Modeller’ hits. The TAF2 model calculated with the highest confidence was that using ERAP1 as a template, scoring an E-value of 0. The second highest E-value of 4 x 10 -6 calculated for a structural alignment resulted when cp APN, another member of the Aminopeptidase N sub- family of M1 metallopeptidases, was used as a template for modelling. The sequence identity across the residues 93-912 of

Human TAF2, used for modelling Human TAF2 residues with ERAP1, is only 14%, which is lower than the accepted 20% cut-off for protein sequence homology modeling. However, considering a PSI-BLAST search of the amino acid sequence of

Human TAF2 revealed a 24% sequence identity with ERAP1 in the N-terminal residues 187-377 and up to 30% sequence identity with putative Aminopeptidase N homologs, it seems reasonable to assume these calculated models reflect a distant homology between TAF2 and the Aminopeptidase N subfamily. Secondary structure prediction of both Human TAF2 and ERAP1 using the online software PSIPred

(http://bioinf.cs.ucl.ac.uk/psipred/ ) are presented next to the TAF2 model predicted by

Modeller and the crystal structure of ERAP1in Figure 3-10 . Colouring of the TAF2 model based on the four domains described for the ERAP1 crystal structure (Kochan et al., 2011) show where predicted structural domains would reside in TAF2.

111

Figure 3-10: TAF2 model domain alignment with ERAP1 crystal structure and PSIPred protein sequence analysis. Structural domains as described for ERAP1 were numbered I-IV and coloured consecutively yellow, red, blue and green.

112 3.2.3. TAF2 Histone binding

The residues that would normally confer Aminopeptidase P activity were shown to be mutated in the homologous region found in SpT16. These mutated residues were instead shown by X-ray crystallography to be involved in binding of the enzymatically defunct SpT16 to the histone pair H3-H4 (Stuwe et al., 2008). In order to explore the possibility that the putative Aminopeptidase N homology in TAF2 mediates an interaction with histones in a similar fashion to SpT16 two experiments were conducted. For the first experiment conducted in the laboratory of Prof. Andreas

Ladurner (EMBL, Heidelberg), purified hs TAF2 FL samples used as part of this thesis were mixed with purified H3-H4 and H2A-H2B dimers and the mixture separated by

SEC. For the second experiment purified hs TAF2 FL was shipped to the lab of Dr. Or

Gozani (Stanford University, California) and assayed for interaction using a peptide array of modified histone tails (Bua et al., 2009). No interaction between TAF2 and the histone pairs or the modified histone tails was observed (data not shown).

Considering TFIID contains five known histone fold domain (HFD) pairs of TAFs

(TAF 8-10, TAF 3-10, TAF 4-12, TAF 6-9 and TAF 11-13), purified hs TAF2 FL was mixed with each purified HFD containing TAF pair and the mixtures analyzed by

SEC (Figure 3-11). A chromatogram of hs TAF2 FL alone superimposed with chromatograms of mixtures of hs TAF2 FL and purified HFD-containing TAF pairs showed shift in the elution volume of TAF2 could be detected only when hs TAF2 FL was mixed with full-length TAF8-10 (Figure 3-11).

113 Figure 3-11: SEC analysis of mixtures of TAF2 and HFD-containing domains in TFIID . hs TAF2 FL was mixed with each HFD-containing TAF pair, TAF8-10, TAF3-10, TAF11-13, TAF6-9, TAF4-12 and analyzed by SEC. The corresponding SDS-PAGE gel is shown on the right with peaks indicated by their elution volumes shown in the chromatogram overlay on the left. Grey curves represent the chromatogram of hs TAF2 FL and black curves the chromatogram of the mixture. A shift in the elution volume of hs TAF2 FL indicates an interaction with the HFD-containing TAF pair. A shift in the elution volume of hs TAF2 FL was only detected when mixed with TAF8-10 (top-panel).

114 Mixing each full-length HFD containing TAF pair respectively with TAF2 and analysis by size exclusion chromatography revealed that only after mixing with

TAF8-10 was there a shift in the elution volume of TAF2. Analysis of the peak fractions revealed the presence of TAF2, TAF8 and TAF10. Of all the known HFD containing TAF pairs in TFIID, only TAF8-10 interacted with TAF2 under the conditions tested. The HFD containing TAF pair TAF3-10 did not result in a shift the elution volume of hs TAF2 FL to that consistent with a larger molecular weight species, indicting that the TAF2/TAF8-10 interaction is unlikely to be mediated by TAF10.

3.2.4. Resume

In light of the shared structural “pita-bread” fold between SpT16 and the

Aminopeptidase P sub-family (despite having only ~20% sequence identity), it is possible that TAF2 and the Aminopeptidase N sub-family may also share a distant homology and possibly a structural fold. Using the BLAST homology search algorithm, it was shown that the N-terminal region of TAF2 shares up to 30% sequence identity with members of the Aminopeptidase N subfamily of M1

Metalloproteases. The homology modeling software ‘Modeller’ detected two template candidates for modeling TAF2, ERAP1 (PDB: 2YD0) residues 94-935 and Colwellia psychrerythraea cold-aminopeptidase N (PBD: 3C1A) residues 51-365. Residues that confer aminopeptidase activity to ERAP1 are not conserved in TAF2. Since the

SpT16 subunit of FACT interacts with the Histone pairs H2A-H2B and H3-H4 via its defunct aminopeptidase fold, we asked whether TAF2 might also interact with histones via a defunct aminopeptidase fold. Data collected in collaboration with the

Prof. Andreas Ladurner and Dr. Or Gozani showed no evidence that hs TAF2 interacted with recombinant H3-H4 or H2A-H2B histones pairs nor with modified

115 histone tails. The ability for TAF2 to interact with the five purified HFD-containing

TAF pairs that are found within TFIID itself was tested. SEC analysis of mixtures containing hs TAF2 FL and each of the HFD-containing pairs of TAFs in TFIID revealed that an interaction could only be detected with purified full-length TAF8-10.

Since an interaction was not detected between TAF2 and TAF3-10, it is unlikely that the interaction between TAF8-10 is mediated by TAF10. Despite aminopeptidase homology and a mutated active site, TAF2 does not appear to mediate an interaction with purified histone pairs as in the case of SpT16. Neither does TAF2 mediate an interaction with modified histone tails. Instead, an observed shift in elution volume when mixed with TAF8-10, indicated a more specific interaction with an HFD- containing subunit within TFIID itself. It remains unknown whether this interaction resides in the aminopeptidase domain of TAF2 as in the case of SpT16. The following experiments described in this thesis concentrate on isolating the region of interaction on TAF8-10.

116 3.3. Biochemical and biophysical characterization of TAF2-8-10

3.3.1. Introduction

In Section 3.2, the question was posed whether TAF2 may be interacting with a HFD- containing protein via a defunct aminopeptidase region, as in the case of the chromatin remodeler FACT (Stuwe et al., 2008). According to analytical SEC of mixtures of HFD containing TAFs found in TFIID and TAF2, an interaction was detected between TAF2 and TAF8-10 and unlikely to be mediated by TAF10. Next, the questions were asked whether the TAF8-10 histone fold pair could be interacting with TAF2 in the same manner as FACT, via HFDs present in TAF8-10 and what could be the nature of such an interaction. Co-expression of the trimeric TAF2-8-10 in insect cells yielded a purifiable complex that co-migrates in SEC. The complex could be analyzed by native mass spectrometry and analytical ultracentrifugation. The interaction region could be narrowed down to the C-terminal region of TAF8 and steady state thermodynamic analyses using surface plasmon resonance SPR could be performed to measure affinities. Results from this section point to an interaction with

TAF2 involving a region in the C-terminal half of TAF8 that secondary structure prediction software predicted to be mainly unfolded. A BLAST homology search showed a conserved TAF8 C-terminal domain that is rich is proline in the region.

3.3.2. Sf21 co-expression and purification of TAF2-8-10 complex

In order to simplify the production of theTAF2-8-10 complex, a baculovirus transfer vector was prepared encoding Human full-length TEV cleavable His 10 -tagged TAF2 and a polyprotein cassette encoding full-length Human TAF8-10 polyprotein fused to

TEV protease with TEV cleavage sites separating all three encoded proteins as

117 described in section 2.4.3. Using the MultiBac technology, individual MultiBac baculotransfer vectors encoding TAF2 and the TAF8-10 cassette could be fused together via Cre-lox recombination to allow all the subunits of the trimeric complex to be expressed from a single baculovirus (Bieniossek et al., 2008; 2009).

TAF2-8-10 was expressed in insect cells in large scale as described in section 2.5.4. A gene encoding CFP was included in the polyprotein expression cassette for the purpose of monitoring fluorescence during the course of infection. The fluorescence of both YFP (encoded in the bacmid backbone) and CFP (encoded in the TAF8-10-

TEVprotease-CFP polyprotein cassette) reached a maximum by 72 hours after cell proliferation arrest, by which time the cells were harvested (Figure 3-12).

Figure 3-12: YFP and CFP monitor of TAF2-8-10 expresiion in Sf21. Since YFP and CFP were co-expressed with the TAF2-8-10 complex, monitoring fluorescent counts emitted by YFP (in yellow) and CFP (in blue) was used to estimate complex expression levels. Fluorescence reached a maximum by 3 days post cell-proliferation arrest (DPA). DPI stands for days post infection.

118 Briefly, the TAF2-8-10 complex could be purified by IMAC via an N-terminal His 10-

FL Tag located on hs TAF2 and a C-terminal His 6-tag on TAF8. The contaminating

TEV protease could be removed by ion exchange chromatography and the complex further purified to homogeneity by gel filtration, as analyzed by SDS-PAGE (Figure

3-13). The complex eluted as a single peak with an elution volume of 10.7 mL indicating a molecular mass of a globular protein that is larger than expected for this complex in a 1:1:1 stoichiometric ratio (Figure 3-13).

Figure 3-13: Purification of TAF2-8-10 complex. Chromatograms which monitor absorbance at 280 nm are in black, conductivity profiles in red and Imidazole concentration in green. The top panel shows an IMAC profile on the left and the corresponding SDS-PAGE gel on the right. An imidazole gradient

119 of 10-400 mM yields a crude purification of TAF2-8-10. A double band corresponding to TEV protease is indicated on the SDS-PAGE gel. In the middle panel, Ion exchange chromatography(IEX) was used to separate TAF8-10 and other contaminants from TAF2-8-10 as shown by the SDS- PAGE gel on the right. Contaminating TEV protease and excess TAF8-10 are found in the flowthrough fraction of the SP chromatography. A size exclusion profile is shown in the bottom panel. The complex eluted in a single peak, which eluted at 10.7 mL.

The TAF2-8-10 complex could be purified to homogeneity (>90% as determined by

SDS-PAGE analysis), by a three-step purification (IMAC, IEX and SEC). The complex eluted from an S200 10/300 column as a single peak consistent with a globular a species with a larger molecular mass than expected for a 1:1:1 TAF2-8-10 complex .

3.3.3. TAF2-8-10 AUC and Native Mass Spectrometry

The sedimentation coefficient of purified hsTAF2-8-10 complex was measured and compared to that of hs TAF2 FL alone. Sedimentation velocity AUC of hs TAF2 FL resulted in two main peaks: one centered at a sedimentation coefficient of ~4.3 S, accounting for ~60% of the total protein and a second peak centered at ~6.2 S accounting for ~30% of the total protein. A sedimentation coefficient of ~4.3 S is too low for a globular monomer with the same molecular mass as hs TAF2 FL ; although by applying a frictional constant, the sedimentation coefficient observed is consistent with an extended conformation of a species with the same molecular mass as hs TAF2 FL . TAF2-8-10 complex analyzed by sedimentation velocity AUC resulted in a main peak accounting for ~60% of the total protein was shifted to a sedimentation coefficient of ~4.9 S when compared to the peak corresponding to TAF2 alone. A

120 sedimentation coefficient of ~4.9 S is consistent with a 200 kDa species made up of a

1:1:1 stoichiometric ratio of each TAF2, TAF8 and TAF10 when applying a frictional coefficient consistent with an extended conformation of the complex.

Figure 3-14: Sedimentation velocity analytical ultracentrifugation. In this figure the full distribution of sedimentation coefficients (Svedbergs) S is shown on the x-axis. The distribution of sedimentation coefficients of TAF2 is represented by a dashed line and that of the TAF2-8-10 complex as a solid black line.

Sedimentation velocity analytical ultracentrifugation experiments confirm the presence of a TAF2-8-10 complex, and give results that are consistent with a species containing a 1:1:1 stoichiometric ratio when applying a frictional coefficient consistent with an extended conformation..

The purified TAF2-8-10 complex was sent to the laboratory of Carol Robinson

(Oxford University, UK) for analysis by native mass spectrometry. MS/MS spectra

(Figure 3-15) and Collision Induced Dissociation experiments (Figure 3-16) resulted in a series of peaks consistent with predicted masses corresponding to the combination of subunits found in Table 3-3. Every combination of subunits, except

121 TAF2-10, could be assigned to a mass. The main peak corresponded to a mass consistent with the presence of the TAF2-8-10 complex in a 1:1:1 stoichiometric ratio.

122

Figure 3-15: Native MS/MS dissociation of purified hs TAF2-8-10 FL complex. Figure kindly provided by the lab of Carol Robinson (Oxford University).

123

Figure 3-16: Collision induced dissociation experiments of hsTAF2-8- 10 FL

124 Name Expected mass Calculated mass Mass Difference (Daltons) (Daltons) (Daltons) TAF2-8-10 195712.3 196372.2 +659.9 1:1:1 195222.5 -489.8 TAF2 137115.0 137526.7 +411.7 136364.8 -750.2 TAF2-8 172099.2 171548.1 +461.1 1:1 172753.8 +744.6 TAF8-10 58597.3 58975.8 +378.5 1:1 TAF8 34984.2 35028.2 +44.0

TAF10 23613.1 23751.6 +138.5 23484.3

Table 3: Masses calculated from peaks derived from Native Mass spectrometry . Data communicated the lab of Carol Robinson (Oxford University). Data from native MS/MS and collision induced dissociation experiments resulted in peaks from which masses were calculated. In this table the calculated masses are assigned to expected masses of TAF2-8-10 and resulting subcomplexes.

Data from both TAF2-8-10 native MS/MS experiments and HFD-containing-TAF- pair mixing experiments are consistent with TAF2 interacting with TAF10. Results from sedimentation velocity AUC experiments and native mass spectrometry experiments described in this section are consistent with a 1:1:1 stoichiometry of the

TAF2-8-10 complex.

3.3.4. TAF2-8-10 negative stain EM

The purified TAF2-8-10 complex was negative stained with uranyl acetate and visualized by EM as a qualitative measure. TAF2-8-10 was passed over a continuous

10-30% glycerol gradient and then fractionated. Carbon grids of negative stained

TAF2-8-10 were prepared of the purest fraction, as determined by SDS-PAGE. The micrograph in the right panel of Figure 3-17 revealed a relatively homogenous

125 distribution of elongated particles with approximate diameters ranging from 15-25 nm.

Figure 3-17: Glycerol gradient and negative stain EM. The left panel shows an SDS-PAGE gel of the TAF2-8-10 glycerol gradient fractions from 10-30% glycerol. The right panel shows a micrograph of a uranyl acetate negative stained carbon grid of the fraction loaded in the sixth lane of the SDS-PAGE gel magnified 50,000X.

3.3.5. TAF2 interaction with TAF8 C-term region: SEC analysis

The TAF8-10 heterodimer is made of TAF10, which contains a HFD and TAF8, which also contains a HFD as well as a proline rich domain and a Nuclear

Localization Sequence (NLS) at its C-terminus (Figure 3-18). In order to further explore the possibility that TAF2-8-10 is not mediated by TAF10 (considering SEC analysis of mixtures of hs TAF2 and HFD-containing TAF mixtures that suggested hs TAF2 is interacting with hs TAF8 and not hs TAF10 (Figure 3-11)), different constructs of TAF8-10 and of Maltose Binding Protein (MBP)-hs TAF8 fusion proteins were mixed with hs TAF2 FL and analyzed by SEC to identify critical

126 interaction motifs. MBP-hs TAF8 fusion proteins were cloned as described in 2.4.2.

Briefly, MBP-TAF8 fusion constructs were cloned into the pMAL vector (NEB) and contain a TEV cleavage site separating MBP and TAF8. All constructs included a C- terminal His 6x -tag for purification from E. coli lysate as described in 2.5.4. Mixing the histone fold of TAF8-10 (TAF8 1-134 TAF10 98-218 ) (Purified protein kindly donated by

Dr. Simon Trowitszch) did not cause a shift in the elution volume of TAF2, but a

TAF8 fusion protein of MBP fused to the C-terminal residues of hs TAF8 105-310 did cause a shift in the elution volume of TAF2. This result is consistent with an interaction between TAF2 and TAF8 that is mediated by resides in the C-terminal two-thirds of the TAF8. A TAF8 construct encoding the histone folds of TAF8 and

TAF10 fused directly to the native TAF8 NLS (TAF8 [1-134][297-310] TAF10 98-218 ) was mixed with TAF2. This mixture did not result in a shift in the elution volume of

TAF2, consistent with the interaction between TAF2 and TAF8-10 that is not mediated by the NLS of TAF8. Constructs in Figure 3-18, panels A-E, when mixed with hs TAF2 FL did not cause a change in the elution volume of TAF2. These data suggest that TAF2 interacting region of TAF8 resides in the residues 164-296.

Interestingly, MBP-TAF8 200-310 did not cause a shift in the elution volume of TAF2, however, SDS-PAGE analysis of the TAF2 peak fractions contained a band corresponding to the size of MBP-TAF8 200-310 (Figure 3-18, Panel D). A possible explanation for the presence of residual MBP-TAF8 200-310 in the peak corresponding to the elution volume of TAF2 could be a change in either the affinity (ie. a weaker interaction between the two proteins) or a change in the kinetics of complex formation or dissociation, which would result in a visualization of the complex at an intermediate elution volume. SEC analysis and SDS-PAGE results of all TAF8 fragment mixing experiments with TAF2 are summarized in Figure 3-18.

127 Figure 3-18: TAF2 and TAF8 construct mixing experiments. Top right: Domain topology of the hs TAF8-10 FL heterocomplex with TAF10 in green, TAF8 in blue, HFD domains in purple and TAF8 C-term domain called the proline rich domain (PRD) for the purpose of this thesis in grey (Human TAF8 Sequence I.D.: NP_612639.2, TAF8 C-term domain, NCBI conserved domain database I.D.CDD:176263). Panels (A-H): SEC and SDS-PAGE analysis of eight different MBP-TAF8 fusion proteins mixed with hs TAF2 FL .

128 Chormatograms on the left of each panel are coloured as follows: Grey chromatograms represent mixtures, dotted line chromatograms represent MBP-TAF8 fusions and black chromatograms represent hs TAF2 FL alone. Panels F, G and H show that a shift in the elution volume of TAF2 was observed when mixed with the MBP-TAF8 fusions indicated.

Three MBP-TAF8 C-term fusion proteins resulted in a shift of the elution volume of hs TAF2 FL in SEC mixing experiments described in Figure 3-18, Panel F, G and H: 1)

MBP-TAF8 105-310 2) MBP-TAF8 141-310 and 3) MBP-TAF8 164-310 (Figure 3-18).

Results obtained from SEC analysis of TAF8-TAF2 mixtures are consistent with complex formation being mediated principally by TAF8 residues 164-297. A

CLUSTAL alignment of homologous TAF8 amino acid sequences revealed that residues 164-297 contain a part of the conserved proline rich domain (See appendix:

Figure 5-1). The region of TAF8 containing residues 164-297 was predicted to be largely unfolded by the online software FoldIndex© except for the stretch of residues

188-219 (Figure 3-19) (Prilusky et al., 2005).

129 Figure 3-19: Predicted disorder based on the Human TAF8 protein sequence (UNIPROT: Q7Z7C8). Regions predicted to be disordered are coloured red and regions predicted to be ordered in green. The disorder plot was generated using the webservice FoldIndex© (Prilusky et al., 2005).

3.3.6. TAF2 interaction with TAF8 C-terminal region: A thermodynamic

analysis

Two of the MBP-TAF8 fusion constructs that were found to interact with hs TAF2 FL as well as the hs TAF8-10 FL heterodimer were further characterized by surface plasmon resonance experiments. The kinetics of the TAF2-TAF8 interaction meant that it was only possible to perform an equilibrium analysis using this technique. hs TAF2 FL ‘ligand’ was immobilized to a standard CM5 sensor chip (GE Healthcare) and the two MBP-TAF8 fusion and hs TAF8-10 FL heterodimer ‘analytes’ were flowed over the immobilized ligand until equilibrium was reached. The SPR signal

130 ‘Response Units’ were measured at equilibrium for nine different concentrations of each MBP-TAF8 fusion construct ranging 0-800 µM and binding constants were calculated using the Langmuir binding model assuming a 1:1 interaction. The experimental details can be found in the section 2.9.1. To investigate the contribution of two different C-terminal regions of TAF8 to binding hs TAF2 FL , equilibrium constants of two different MBP-TAF8 fusion constructs were compared to that of hs TAF2 FL binding the hs TAF8-10 FL heterodimer. The equilibrium-binding constant of hs TAF2 FL and hs TAF8-10 FL heterodimer was calculated to be 43 +/- 8 nM. When the

C-terminal regions of TAF8 105-310 and then TAF8 141-310 (both of which were fused to an N-terminal MBP tag) were flowed over hs TAF2 FL the equilibrium binding constants were measured to be 126 +/- 9 nM and 177 +/- 36 nM, respectively (Figure

FL FL 3-20). The calculated K D the hs TAF8-10 heterodimer: hs TAF2 interaction is about 3 times lower than the TAF8 truncation constructs as measured by SPR.

Possible explanations for this result could be that the presence of MBP disrupts the interaction or that the presence of TAF10 is contributing to a conformation favorable for binding to TAF2.

131

Figure 3-20: Surface plasmon resonance experiments using immobilized TAF2 ligand . Sensorgrams on the left obtained by applying analytes that were applied at nine different concentrations ranging 0-800 μM. Binding curves on the right were obtained by non-linear curve fitting of the Langmuir binding equation (assuming a 1:1 interaction) using the graphing software Grace (http://plasma-gate.weizmann.ac.il/Grace/). Standard deviation was calculated between three replicates. Panel A, hs TAF8-10 FL heterodimer analyte. Panel B, MBP-TAF8 105-310 analyte and Panel C, MBP-TAF8 105-310 analyte.

132 3.3.7. NMR analysis of TAF8 interaction region

In order to further probe the interaction region between hs TAF2 FL and the C-terminal region of TAF8, isotopically labeled samples of MBP-TAF8 164-310 were produced for

NMR interaction studies. 15 N labeled MBP-TAF8 164-310 purified from E. coli containing the proposed interaction region, deduced from SEC binding experiments

(Figure 3-18), was mixed with unlabeled hs TAF2 FL purified from insect cells where hs TAF2 FL was added in a 10-fold molar excess. An obvious feature of the 2D ( 1H,

15 N) TROSY spectra of 15 N labeled MBP-TAF8 164-310 alone (Figure 3-21) is that the dispersion of the crosspeaks is consistent with the protein being unfolded. The

TROSY spectrum of the mixture showed considerable broadening of crosspeaks compared to the spectra of MBP-TAF8 164-310 alone, which is consistent with the formation of a higher molecular weight complex tumbling more slowly in solution.

The overlaid spectra from the two TROSY experiments showed peaks corresponding to TAF8 disappeared upon the addition of TAF2. The disappearance of cross peaks is consistent with the hypothesis that a few residues in a predicted unfolded stretch of

TAF8 residues are involved in the interaction with TAF2. Observable crosspeaks with unperturbed chemical shifts following the formation of a complex in excess of 150 kDa strongly suggests that a portion of TAF8 does not interact with TAF2 and that these residues remain unfolded and flexible.

133 110

120 N (ppm) 15

130 9.0 8.0 7.0 8.0 7.0 8.0 7.0 1H (ppm)

Figure 3-21. 2D ( 1H, 15 N) TROSY NMR analysis of TAF8. Black spectrum (left-hand panel), 15 N- labeled TAF8 164-310 , red spectrum (middle panel) 15 N- labeled TAF8 164-310 mixed with unlabeled. The two spectra are superposed on the right-hand panel.

Only 115 peaks were visible in a TROSY spectrum of 15 N-labelled TAF8 164-310 , which accounts for 77% of all-non proline residues. The abundance of unfolded regions in the C-terminal region of TAF8 lead to a significant overlap of 1H -15 N, 13C -alpha and

13C-beta correlations in TROSY spectra impeding unambiguous assignment of connectivity between spin-systems. Assignment of TAF8 164-310 was attempted using a combination of HADAMAC, TROSY, HNCA, and HNCACB cross-correlation experiments as well as comparison with a TROSY spectrum derived from a shorter

TAF8 construct containing residues 200-310. However, poor quality data due to low signal-to-noise and a high proportion of missing peaks in 3D data sets prevented

134 assignment of TAF8 164-310 . Despite this, the NMR spectra presented here show strong evidence that a substantial portion of TAF8 164-310 is unfolded. The observation that only selected crosspeaks disappear upon the addition of TAF2 supports the hypothesis that TAF2 binds to discrete regions of TAF8, which are likely composed on unfolded interaction motifs. These data, whilst not yielding complete assignments, support the application of this technique in future high resolution mapping interaction sites in the unfolded C-terminal tail of TAF8. Further studies would require additional optimization of the sample conditions and significant improvements of the signal-to- noise ratio of the data collected.

3.3.8. Identification of the TAF8 C-terminal interaction region by a peptide

array assay

Peptide array experiments were employed to further map the region of TAF8 involved in binding TAF2. Peptide array assays were performed as described in 2.6.5. Briefly, a peptide array made of 20-mer peptides spanning residues 105-310 of hs TAF8 was chemically synthesized with an offset of three residues and immobilized onto cellulose membranes. hs TAF2 FL was incubated with identical arrays at two different

NaCl concentrations, 150 mM and 500 mM NaCl. Control peptide scans were also prepared in parallel, which were not incubated with TAF2. Peptides scans incubated at both concentrations of NaCl revealed a pattern of spots that corresponded to five different interactions regions on the C-terminal third of TAF8 Figure 3-22(Panel A).

The five regions detected map to the same area on TAF8, regardless of NaCl concentration, however, the scans incubated in low [NaCl] showed a different pattern of spot intensities than when incubated at a high [NaCl]. Spot intensities corresponding to region III approximately doubled in the presence of low salt while

135 spot intensities corresponding to regions II and IV remained the same regardless of salt concentration. Each 20-mer peptide identified in Panel A in Figure 3-22 was assigned a number from 1-4 depending on spot intensity. Then every residue in the peptide was assigned a number according to its position in the hs TAF8 sequence, such that numbers from overlapping residues in spot forming peptides could be summed. The resulting sum of intensities assigned to each position of the TAF8 C- terminal residues 105-310 region represented in overlapping peptides could be plotted

(Panel B Figure 3-22). Plotting summed spot intensities this way for all residues of the C-terminal region of TAF8 revealed five interaction regions (I-V) that could be roughly assigned to the residues outlined in Figure 3-22 (Panel C).

136 Figure 3-22: Peptide scan analysis of the putative TAF2-TAF8 C-terminal interaction region. Panel A) Peptide arrays composed of 20-mer epitopes derived from residues 105-310 of TAF8 incubated with and without His 10x - hs TAF2 FL in two different [NaCl] (500 mM NaCl on the left and 150 mM NaCl on the right). Spots were visualized by incubating arrays with monoclonal mouse anti-His 5x antibodies and horseradish peroxidase-coupled anti-mouse IgG antibodies and chemiluminescent substrate Luminol. The chemilumiescent signal was visualized using a Kodak Imager. Positions a1, g5 and g6 contain positive control peptides made of 6x Histidines. B) Spot

137 intensities plotted as a function of residue number from two different scans, one at 500 mM NaCl (yellow) and another at 150 mM NaCl (purple). C) Top, TAF8 domain topology figure depicting residues 105-310 with the PRD coloured grey and the NLS coloured black. Green arrows represent helical regions predicted by PSI-PRED. Bottom, regions I to V delineated from the five peaks seen in panel B shown using the same colour scheme as in B.

A finer mapping of the TAF2 interaction regions III and IV on TAF8 was performed with two peptide arrays composed of 20-mer peptides spanning the TAF8 residues

170-205 and 202-240 with an offset of one residue. Peptide arrays were visualized as described for peptide arrays shown in Figure 3-22. An alignment of the 20-mer peptides corresponding to spots with intensities above a certain threshold revealed that the minimum sequence for binding hs TAF2 FL in region III is TAF8 181 -

SQRRDVERALTR-193 and in region IV is TAF8 219 -FTIPYLTAL-227 (Figure 3-23).

138

Figure 3-23: Peptide scan analysis of the TAF2 interaction regions III and IV from TAF8. Panel A Top: A cellulose bound peptide array of 20-mer epitopes from the TAF2 interaction region, TAF8 region III (TAF8 residues

170-205) offset by a single amino acid incubated with and without His 10X - hs TAF2 FL . Control peptides made of six Histidines are located positions a1 and b9. Panel A Bottom: An alignment of the peptides sequences that yielded FL a chemiluminescent signal after incubation with His 10X -hs TAF2 . A consensus binding sequence is outlined in red. Panel B Top: A peptide array composed of the TAF2 interaction region IV (TAF8 residues 202-240), prepared in the same manner as the peptide array shown in the top of panel A. Control peptides made of 6x Histidines are located positions a1, c2 and c3. Panel B Bottom: An alignment of the peptide sequences that yielded a chemiluminescent signal after incubation with TAF2.

139 To obtain higher resolution information regarding which residues from regions III and

IV on TAF8 participate in the interaction with TAF2, an alanine substitution scan was performed using a peptide containing TAF8 Region IV, residues 213-232. Each residue in the peptide was substituted for an alanine except in the case where an alanine was already present, in which case, it was substituted for a serine. The residues from Region IV, that were shown to abolish binding of the TAF8 213-232 are shown in red in Figure 3-24.

Figure 3-24: Region IV alanine substitution peptide scan. An alanine substitution scan of the peptide TAF8 213 -LIAARPFTIPYLTALLPSEL-232 derived from the sequence found in Region IV of TAF8. Each residue in the peptide was substituted for an alanine except in the case where an alanine was already present, in which case it was substituted for a serine. The top panel shows an alanine substitution scan incubated with TAF2 and the bottom panel shows control blot incubated without TAF2.

The residues in TAF8 Region IV that abolished binding when mutated includes the sequence LXXLL, a motif known to be involved in protein-protein interactions.

(Heery et al., 1997; Plevin et al., 2005).

140 3.3.9. Resume

Results from peptide array experiments probing the TAF2-TAF8 interaction (as seen in Section 3.3.7) showed that the minimum interaction region on TAF8 does not contain a discrete peptide sequence as initially proposed but instead includes the C- terminal regions III, IV and V spanning the residues 165-310.

141 3.4. Functional characterization of TAF2-8-10

3.4.1. Introduction

The discovery of a stable recombinant human TAF subcomplex composed of TAF2-

8-10 raises the question of whether such a subcomplex is relevant to the assembly and activity of TFIID. in vitro TFIID complex assembly experiments have shown that particular subcomplexes can support activated transcription in the presence of certain activators. Immunoaffinity purified Drosophila TAF1, TAF2 and TBP when activated by the transcription factor NTF-1 (Neurogenic element-binding transcription factor 1) has been shown to undergo activated transcription in vitro (Chen et al., 2014). Recent results communicated form the lab of Dr. Laszlo Tora showed that a complex containing TAF2-8-10 (as identified by mass spectrometry using the multidimensional protein identification technology (MudPIT)) could be co- immunoprecipitated using a specific TAF2 antibody from HeLa cytoplasmic extracts

(Data in Publication 4, Figure 1 (d) in section 4 of this thesis). Experiments in the

Berger group have shown that a stable homogeneous subcomplex of TFIID can be isolated that is composed of five recombinant human TAFs, and so called

5TAF(Bieniossek et al., 2013). 5TAF was shown to be made up of two copies of each

TAF5, TAF6, TAF9, TAF4 and TAF12. Purified recombinant TAF8-10 could be combined with 5TAF to subsequently form 7TAF. TAF2 could also be incorporated into 7TAF to form the so-called 8TAF complex (Data in Publication 4,

Supplementary methods, Figure 6 in section 4 of this thesis). The quantification of a stable interaction between TAF2 and TAF8/10, likely to reside in the C-terminal region of TAF8 (described in this thesis), together with the 7TAF complex provided, an assay for studying the incorporation of TAF2 into a mutant 7TAF lacking the

142 putative TAF2 interaction domain. This section will describe the incorporation of a construct of TAF8-10 lacking the putative TAF2 interaction domain found in the C- terminal third of TAF8 into the 5TAF complex forming a 7TAF complex.

3.4.2. TAF8/10 lacking putative TAF2 interaction region can be mixed with

5TAF to form 7TAF-∆TAF8[C-term]

The purified TAF8 25-120 /MBP-TAF10 98-218 heterocomplex (Purified protein kindly donated by Dr. Simon Trowistzch) lacking the putative TAF2 interaction domain

(proposed in this study reside in the TAF8 residues numbered 164-297) was mixed with purified 5TAF complex composed of (TAF5, TAF6, TAF9, TAF4 and TAF12)

(Components of the purified 5TAF complex kindly donated Dr. Christoph

Bieniossek). The mixture was analyzed by size exclusion chromatography using S6 resin. The SEC profile showed a peak at 9.9 mL corresponding to a complex with a molecular weight between 1.3 and 2.0 MDa when compared to known molecular weight standards. SDS-PAGE analysis of the peak fractions showed the expected bands corresponding to each protein expected in the mixture. Excess TAF8 25-

120 /MBP-TAF10 98-218 heterocomplex could be found in a second peak, which eluted from the S6 column at 16.3 mL.

143

Figure 3-25: TAF8/10 lacking TAF2 interaction region could be incorporation into 5TAF. Left panel SDS-PAGE analysis of peak fraction from S6 chromatogram found in the right panel. All five subunits of 5TAF (TAF5, 6, 9, 4 and 12) as well as TAF8/10 lacking the putative TAF2 interaction region are identified in the 9.9 mL peak. The band corresponding to TAF12 is very faint but was observed when loading more protein on the gel (as seen in Figure 3-26). The peak in the chromatogram labeled 16.3 mL contains the excess TAF8/10 construct lacking the putative TAF2 interaction region.

3.4.3. TAF2 incorporation into 7TAF-∆TAF8[C-term]

The Berger group has shown that TAF2 can be mixed with 7TAF to form 8TAF (Data in Supplementary methods, Figure 6 in section 4 of this thesis). In order to test whether mixing TAF2 with 7TAF-∆TAF8[C-term] could result in the incorporation of TAF2, the individual components were and the mixed and analyzed by SEC. Two peaks were eluted from an S6 column and analyzed by SDS-PAGE Figure 3-26. SDS-

PAGE analysis of peak fractions eluted at 10.0 mL revealed bands corresponding to

144 5TAF and the TAF8 25-120 /MBP-TAF10 98-218 heterocomplex (Figure 3-26). SDS-

PAGE analysis of the peak fractions eluted at 14.7 mL revealed a band corresponding to the expected molecular weight of TAF2 as well as a band migrating as expected for a 50 kDa protein, which corresponds to the molecular weight of similar in size to

MBP-TAF10 98-218 (Figure 3-26, Panel A). Since both TAF4 and TAF2 migrate to very similar positions on a 6-12% Acrylamide SDS-PAGE gel, anti-TAF2 antibody was incubated with 5TAF, the S6 separated 7TAF ∆TAF8[C-term]/TAF2 mixture and

TAF2 alone to determine if TAF2 was incorporated into 7TAF ∆TAF8[C-term]. Anti-

TAF2 antibody staining of 5TAF resulted in the unexpected staining of a species that corresponds to TAF4, however, close inspection of the migration positions of each species revealed that TAF2 migrates as a slightly smaller species than TAF4 (Figure

3-26, Panel D). Furthermore, there was no shift detected in the elution volume of the peak corresponding to 7TAF ∆TAF8[C-term] when TAF2 was added to the mixture, as would be expected from a resulting interaction. These results are consistent with

TAF2 not being capable of binding the 7TAF ∆TAF8[C-term] complex.

145

Figure 3-26: TAF2 incorporation into 7TAF ∆TAF8[C-term] complex. Panel A: 15 % Acrylamide SDS-PAGE and coomassie staining of the fractions eluted from an S6 10/300 column (GE Healthcare) in Panel B. Panel B: Chromatogram of mixture eluted from an S6 10/300 column. Panel C: Anti- TAF2 antibody staining of an identical SDS-PAGE as in Panel A blotted onto PVDF membrane. Panel D: Anti-TAF2 antibody staining of the 5TAF complex next to insect cell expressed and affinity purified hs TAF2 FL produced for this thesis showing cross reactivity of the enzyme with TAF4.

3.4.4. Resume

The recombinant human TFIID subcomplexes 3, 5, 7 and 8TAF have been reported by the Berger group by Bieniossek et al as well as in Publication 4 in section 4 in this thesis (Bieniossek et al., 2013). The recombinant 5TAF complex made up TAFs 5, 6,

146 9, 4 and 12 have been shown to incorporate TAF8/10 to produce the 7TAF and subsequently 8TAF when mixed with TAF2. This 7TAF complex provided an assay to test whether TAF2 could be recruited to such a complex lacking the TAF2 interaction region on the C-terminal region of TAF8. A TAF8/10 construct lacking the proposed TAF2 interaction region could form 7TAF ∆TAF8[C-term] when mixed with 5TAF complex. Mixing TAF2 with the modified 7TAF complex did not result in the 8TAF. The similar migration behavior of TAF2 and TAF4 prevented unambiguous identification of the absence of TAF2 and anti-TAF2 antibody showed cross reactivity with a species that corresponds to TAF2. A better separation between

TAF2 and TAF2 was subsequently achieved and can be found in the supplementary methods of Publication 4 of this thesis, Figure 6.

147 3.5. Is TAF2 co-imported into the nucleus with TAF8/10?

3.5.1. Introduction

TAF10 does not possess an NLS and is entirely dependent on an interaction with

TAF8 (or TAF3, or SPT7L) for entry into the nucleus (Soutoglou et al., 2005). TAF8 possesses a C-terminal nuclear localization sequence that when removed and expressed simultaneously with TAF10 in HeLa cells, caused a block in the nuclear import of both TAF8 and TAF10 (Soutoglou et al., 2005). An NLS could not be detected in the sequence of TAF2 using the signal sequence prediction software

Signal P 4.1 (http://www.cbs.dtu.dk/services/SignalP/). The Importin α/β heterodimer is known to target hundreds of proteins to the Nuclear Pore Complex (NPC) for translocation across the nuclear envelope. Importin α (IMPα) is known to bind the classical nuclear localization signal. In this section, I report an interaction between the

TAF2-8-10 complex and Importin α, and furthermore, that TAF2 does not interact with Importin α unless TAF8/10 is also present.

3.5.2. TAF2-mCherry expression in HeLa

To determine the localization of recombinant TAF2 in HeLa cells, an expression construct encoding full-length TAF2 was fused N-terminally to the fluorescent protein mCherry and subsequently cloned into in the mammalian expression vector pMam, as described in 2.4.1. Signal corresponding to the emission wavelength of mCherry protein was observed during fluorescent imaging of transfected HeLa cells.

148

Figure 3-27: Exogenously expressed TAF2 is localized to the cytoplasm and the nucleus. Human TAF2-mCherry exogenously expressed in HeLa cells and visualized directly by fluorescence microscopy. mCherry fluorescence is shown in red and DAPI staining of DNA is shown in blue.

3.5.3. TAF2-8-10 Importin α complex

Importin α is the subunit of the NPC complex responsible for interacting with the nuclear localization sequence (NLS) of proteins that are destined for translocation across the nuclear membrane. It is known that TAF8 possesses a C-terminal NLS and can dimerize with Importin α. It has been shown that TAF10 does not itself interact with Importin α but rather requires the dimerization with either TAF3 or TAF8 to bind Importin α. Soutoglou et al, have shown that in order for exogenously expressed

TAF10 to be translocated across the nuclear membrane TAF8 or TAF3 must be exogenously expressed at the same time. Since TAF2 does not itself possess a classical NLS, translocation across the nuclear membrane may be effected by its interaction with TAF8 as in the case for TAF10. In order to test whether TAF2 can interact with Importin α, recombinantly purified Human TAF2 was mixed with

149 Importin α and analyzed by SEC. SEC analysis of the TAF2 Importin α mixture showed that the elution volume of TAF2 did not shift in the presence of Importin α indicating that TAF2 does not interact with Importin α. SDS-PAGE analysis of the peak fractions confirms that Importin α does not form complex with TAF2 under the conditions tested (Figure 3-28).

Figure 3-28: TAF2 and Importin α do not interact. SEC analysis of a 1:1.1 mixture of TAF2 and Importin α is shown on the left. The black curve represents a chromatogram at 280 nm. Two species eluted at 11.5 mL and 14.6 mL respectively. SDS-PAGE analysis of the peaks, shown on the right, shows independent migration of both TAF2 and Importin α.

To further explore the similarity between TAF2 and TAF10 regarding the required interaction with TAF8 for interaction with Importin α and translocation across the nuclear membrane, the purified Human TAF2/8/10 complex was mixed with Importin

α and analyzed by SEC. SEC analysis of the mixture using S200 resin resulted in a shift in the elution volume of TAF2/8/10, consistent with the conclusion that an interaction was detected between the TAF2/8/10 complex and Importin α. The same mixing experiment was carried out in 500 mM NaCl, however, a shift in the elution volume of TAF2/8/10 was not detected. This is consistent with the conclusion that

Importin α does not interact with TAF2/8/10 in high salt conditions.

150

Figure 3-29: SEC analysis of TAF2/8/10 and Importin α mixtures in high and low salt concentrations. Top horizontal panel (150 mM NaCl conditions): S200 chromatogram of TAF2/8/10/IMP α mixture overlaid with the chromatogram, dashed, of the TAF2/8/10 complex for reference. SDS-PAGE analysis of the fractions from the single peak is shown on the right. Bottom horizontal panel (500 mM NaCl): S200 chromatogram of TAF2/8/10/IMP α mixture under high salt conditions resulted in the separate elution of TAF2/8/10 and IMP α. The elution profile of TAF2/8/10/IMP α is overlaid for reference (dashed).

TAF2, TAF2/8/10 and Importin α mixing experiments revealed that TAF2 also requires the presence of TAF8 to bind Importin α, as has been previously shown for

TAF10.

3.5.4. Resume

In a previous study, TAF10 fused to the fluorescent protein CFP, showed that concomitantly expressed TAF8 was required for both TAF8 and TAF10 to be

151 imported into the nucleus of HeLa cells (Soutoglou et al., 2005). Results from this study showed that exogenously expressed TAF2 tagged with mCherry at its N- terminus could not be localized to any compartment of HeLa cells. The fluorescence pattern of the TAF2-mCherry fusion expressed in HeLa cells was determined to be an artifact of over-produced protein as seen previously in such systems (personal communication by Laszlo Tora). Subsequently, mass spectrometry analysis of co- precipitated proteins resulting from an immunoprecipitation, using Anti-TAF2 from

HeLa cell cytoplasmic fractions, identified only TAF2 and TAFs 8 and 10 only while none of the other TAFs could be detected.

Additionally, results in this study showed that purified TAF2-8-10 complex can form an intact complex when mixed with Importin α. The Importin α/TAF2-8-10 complex could be dissociated in high salt conditions(500 mM NaCl) which consistent with an interaction that is mediated by ionic bonds.

152 4. Discussion

Results from this thesis show that soluble protein could be obtained from three different homologs of TAF2, Human, D. melanogaster and S. cerevisiae. TAF2 could be purified recombinantly from insect cells independently from any other subunits of the TFIID complex, corroborating a previous hypothesis that TAF2 is the most dissociable subunit of TFIID (Martin et al., 1999; Papai et al., 2009) The three homologues of TAF2 tested here could all be purified to near homogeneity from insect cells, as determined by SDS-PAGE and SEC. Despite considerable effort, attempts to crystallize any of the three homologues was unsuccessful. Considering

SEC of all three homologs of TAF2 resulted in the elution of species that were larger than expected for globular molecular weights, it is possible that crystallization efforts were impeded by flexible regions. Flexible unstructured regions of proteins can cause an increase in the hydrodynamic radius compared to that expected for a globular species with the same molecular mass. Attempts to obtain a fragment of TAF2 more amenable to crystallization using limited proteolysis and recloning of a two domain construct best representing the stable fragments, did not yield soluble protein.

However, hs TAF2 1-1027 lacking the C-terminal region, which is predicted to be unstructured by the software PSIPRED (Buchan et al., 2003), did produce soluble material that behaved as a more ‘globular’ species by SEC. However this construct did not yield crystals either. It is noteworthy that the stable fragments of TAF2 from all three homologs tested in proteolysis assays consistently mapped to two regions on the amino acid sequence of TAF2. These data are consistent with the conservation of two folded regions separated by a proteolytically sensitive linker region. TAF2 and peptides derived from TAF2 originating from various Human cell lines have been

153 shown to be phosphorylated in positions found across the length of the protein

(Danielsen et al., 2011; Olsen et al., 2010; Shiromizu et al., 2013) Considering that the efforts of crystallizing TAF2 did not yield any crystals, dephosphorylation of the purified material may be considered as an attempt to obtain a more homogeneous sample for future crystallization trials. It might also be worth considering a different host organism for expression of TAF2 with a potentially different set of post- translational modifications.

Using purified hs TAF2 FL produced in this study, results from experiments including

SEC co-migration assays, sedimentation velocity AUC, co-expression in insect cells,

SPR, NMR and peptide array analysis are all consistent with a specific interaction between TAF2 and the HFD-containing pair hs TAF8-10. These studies further showed that this interaction is localized to the C-terminal region of TAF8 164-297. The group of Laszlo Tora (IGBMC, Strasbourg) detected a TAF2-8-10 complex in the cytoplasmic fraction of HeLa cells. The identity of the three components was confirmed by immunopurification using anti-hs TAF2 antibody and a mass spectrometric analysis of the immunopurified sample using MudPIT

(Multidimensional Protein Identification Technology). Importantly, other subunits of

TFIID could not be detected in the immunopurified sample. This observation is consistent with the presence of a TFIID subcomplex comprised of the subunits of

TAF-2-8-10 that can transiently reside in the cytoplasm (see Trowitszch et al ., publication in review, section 4.1 supplementary Figure 6). Other data from the Tora group have shown that fluorescently-tagged TAF10 ectopically-expressed in HeLa cells is cytoplasmic and could not enter the nucleus unless co-expressed with TAF8.

Furthermore, co-expression with a construct of TAF8 lacking its C-terminal NLS

154 prevented TAF10 entry into the nucleus, which is consistent with nuclear import of

TAF10 being dependent on the C-terminal NLS of TAF8. Since a nuclear localization sequence has not been identified in TAF2, this raises the question of whether TAF2 also utilizes the NLS of TAF8 for import into the nucleus in the same manner as

TAF10. TFIID assembly on the core promoter remains an unknown aspect of eukaryotic transcription regulation and if it is indeed the case that TAF2 can only enter the nucleus via the NLS of TAF8 as is the case for TAF10, the discovery of a

TAF2-8-10 subcomplex in the cytoplasm, combined with an absence of an obvious

TAF2 NLS points to the possibility that TAF2 uses TAF8 via a TAF2-8-10 complex to enter the nucleus. If TAF2-8-10 is proven to be a physiologically relevant module involved in TFIID and PIC assembly at the TSS of class II genes, its entry into the nucleus may describe a mechanism by which gene transcription is regulated by the cell. Results from this thesis show that overexpression of mCherry-tagged TAF2 resulted in the detection of the characteristic signal emitted by mCherry protein in

HEK293 cells, consistent with ectopic expression of TAF2 in the cytoplasm. In order to validate these findings, it would be necessary to determine if mCherry-hs TAF2 FL is functional and able to interact with TAF8-10 in a similar manner to the wildtype protein. The fact that some of the same subunits present in TFIID are also present in at least two other transcription regulation complexes, STAGA (Spt3-TAF9-Gcn5- acetyltransferase), and TFTC (TBP-free TAF containing complex), is consistent with

TFIID regulating transcription in a modular fashion. Further evidence pointing to a modular view of TFIID include the isolation of a TFIID ‘core’ composed of two copies of five subunits of TFIID (TAF5, 6, 9, 4, 12). This complex was initially observed by RNA knockdown studies in Drosophila S2 cells (Wright et al., 2006). A later cryo-EM study using a recombinant TFIID ‘core’, produced using the MultiBac

155 system, revealed a symmetrical dimer (Bieniossek et al., 2013; Wright et al., 2006).

When purified recombinant 5TAF dimer was mixed with the HFD-containing heterodimer TAF8-10, TAF8-10 was shown to be incorporated in a single copy, breaking the original symmetry of the 5TAF dimer to form a 7TAF complex

(Bieniossek et al., 2009). TAF2 could be incorporated into 7TAF to form 8TAF (see

Trowitszch et al ., publication in review, section 4, supplementary Figure 6). The

TAF2-8-10 complex provides an intriguing glimpse into a potential modularized mechanism for TFIID assembly.

Biochemical evidence of TAF2-8-10 in this thesis combined with in vivo detection of the complex in the cytoplasm of HeLa cells provides an intriguing insight into the possible role of this subcomplex and the assembly of TFIID. Although TAF2 is a bona fide member of the TFIID complex, there are many instances in the literature that are consistent with TAF2 existing outside of TFIID. An EM study of purified yeast TFIID by Papai et al showed that endogenously immunopurified TFIID complex could be separated into two architecturally distinct forms depending on whether or not TAF2 was present in the complex. In this case, TAF2 could be depleted form the complex using TAP-tagged TAF1 purified TFIID from S. cerevisiae and applying an NaCl gradient to the ion exchange column-applied sample and monitoring the disappearance of a band by SDS-PAGE with the same molecular weight as sc TAF2 (Papai et al., 2009) . Closer inspection of the SDS-PAGE-separated, ion exchanged fraction, that showed TAF2 being depleted from TFIID, reveals also the presence of TAF8 and TAF10, as well as all the subunits that make up the 8TAF complex described. The data presented in this thesis suggest that it may be possible to deplete TAF2 using peptides derived from the C-terminal region of TAF8. If possible,

156 this approach could be used to determine if the interaction with TAF8 is the sole anchor point for TAF2 in TFIID, or whether the association with TFIID is dependent on other interactions.

A three-dimensional model of Human TAF2 (residues 93-984) could be obtained using Human Endoplasmic Reticulum Aminopeptidase, ERAP1, (PDB: 2YD0) as a template. TAF2 has known sequence homology with the Aminopeptidase N family of enzymes, specifically in the N-terminal region of TAF2, which as yet has no known peptidase activity. Analysis of the homologous active site region in ERAP1 shows that none of the residues that confer activity to ERAP1 are conserved in TAF2. The chromatin specific-transcription elongation factor, ‘Facilitates Chromatin

Transcription’ (FACT), has been shown to interact with histones via a defunct

Aminopeptidase P domain (Stuwe et al., 2008). The obvious question was asked whether the putatively defunct aminopeptidase domain of TAF2 could be analogous to that found in FACT? That is could TAF2 interact with one of the many histone pairs found in TFIID? The results from this thesis are consistent with TAF2 interacting with the histone fold containing hetermodimer TAF8-10. However, rather than interacting with the histone fold elements of TAF8-10, results in this thesis are consistent with TAF2 instead interacting with a largely unstructured region of the C- terminal tail of TAF8. Evidence of an interaction between highly purified hs TAF2 FL and hs TAF8-10 FL heterodimer could be confirmed by native mass spectrometry,

NMR, sedimentation velocity AUC, and SPR which yielded results that are consistent with the TAF2-8-10 complex being heterotrimeric with a 1:1:1 stoichiometric ratio with a K D measuring approximately 50 nM. SPR experiments showed that the C- terminal residues 141-310 of TAF8 could mostly recapitulate the K D (within 100 nM)

157 of the full-length heterodimeric complex. The difference in K D could be attributed to the either the presence of MBP moiety fused to N-terminus of the TAF8 truncation constructs, or the absence of an indirect effect of TAF10. Peptide array experiments in this thesis are consistent with five distinct regions (regions I-V) located in the TAF8 residues 164-297 being involved in an interaction with TAF2. NMR experiments using labeled MBP-tagged TA8 are consistent with a C-terminal region of TAF8 that is unfolded, supporting the theory that the TAF8 interaction with TAF2 is mediated by an unfolded stretch of residues. A recent high resolution NMR structure showed that yTBP interacts with an unfolded negatively charged region of TAF1 (TAND1).

This interaction is mediated by a highly positively charged cleft on yTAF1 and anchored by a phenylalanine. The authors suggest that the yeast TAF1-TBP complex

NMR structure suggests that yTAF1 may obscure yTBP from interacting with the

TATA-box element in the core promoter of transcription start sites

(Anandapadamanaban et al., 2013). An intriguing hypothesis is that TAF2 interacts with an unfolded region of TAF8 in the context of TFIID, potentially obscuring binding to the core promoter via a similar mode of binding as that found in yeast

TAF1 and TBP.

By devising a robust protocol that allows the production of large quantities of the subunits of TFIID using the MultiBac baculovirus expression system, it has been possible to perform experiments that show that TAF2 interacts specifically with the histone fold containing TFIID subunit, TAF8. TAF8 has been implicated in the regulation of cell differentiation. Specifically, TAF8 expression has been shown to be significantly increased during the differentiation of mouse 3T3-L1 preadipocytes to adipocytes. Interestingly, TAF depleted 3T3 nuclear extract using anti-TAF5 and anti-

158 TBP could not be depleted of TAF2. The authors attribute an inability to deplete

TAF2 from 3T3-L1 nuclear extract as reflecting the lability of TAF2 within TFIID. It is intriguing to speculate that TAF2 incorporation into TFIID is regulated by the presence of TAF8.

159 4.1. Publication 4 (under review at Nature Communications;

NCOMMS-14-12636)

Title: The histone fold-containing TAF8-10 complex directs incorporation of TAF2 into human TFIID

Authors: Simon Trowitzsch 1,2, Cristina Viola 1,2,8 , Gabor Papai 3, Marjorie Fournier 4,

Elisabeth Scheer 4, Ima-Obong Ebong 5,7 , Juan Zou 6, Juri Rappsilber 6, Carol

Robinson 5, Patrick Schulz 3, Laszlo Tora 4 & Imre Berger 1,2

Summary: This publication describes the major findings of this thesis presenting the dissection of the TAF2 interaction with TAF8-TAF10 in the context of TFIID. The publication also presents the discovery of the TAF2-8-10 complex as a functional

TFIID submodule in the cytosol of Human cells. The following manuscript in preparation and describes a biochemical and biophysical characterization of the

TFIID subcomplex TAF2-8-10. Immunopurification studies revealed a stable subcomplex present in the cytosol of HeLa cells that can be produced recombinantly in Insect cells. Biophysical methods were used to detect a binding interface between

TAF8 and TAF2. Nuclear import of the complex is discussed and a crystal structure of the Histone-fold domain of TAF8-10 as well as the nuclear import protein Importin alpha in complex with the NLS of TAF8 in presented.

The manuscript has been submitted for publication.

160 Cytoplasmic complex comprising TAF2, TAF8 and TAF10 provides evidence for nuclear holo- TFIID assembly from preformed submodules.

Simon Trowitzsch 1,2, Cristina Viola 1,2, Elisabeth Scheer 3, Virginie Chavant 4, Marjorie Fournier 3, Gabor Papai 5, Ima-Obong Ebong 6, Christiane Schaffitzel 1,2 , Juan Zou 7, Juri Rappsilber 7,8 , Carol Robinson 6, Patrick Schultz 5, Laszlo Tora 3 & Imre Berger 1,2

1 European Molecular Biology Laboratory, Grenoble Outstation, 6 rue Jules Horowitz, 38042 Grenoble, France 2 Unit for Virus Host-Cell Interactions, Univ. Grenoble Alpes-EMBL-CNRS, 6 rue Jules Horowitz, 38042 Grenoble, France. 3 Cellular Signaling and Nuclear Dynamics Program, Institut de Génétique et de Biologie Moléculaire et Cellulaire, UMR 7104, INSERM U964, 1 rue Laurent Fries, 67404 Illkirch, France. 4 Proteomics Platform, Institut de Génétique et de Biologie Moléculaire et Cellulaire, UMR 7104, INSERM U964, 1 rue Laurent Fries, 67404 Illkirch, France. 5 Integrated Structural Biology, Institut de Génétique et de Biologie Moléculaire et Cellulaire, UMR 7104, INSERM U964, 1 rue Laurent Fries, 67404 Illkirch, France. 6 Chemistry Research Laboratory, University of Oxford, South Parks Road, Oxford, Oxfordshire, OX1 3TA, United Kingdom. 7 Wellcome Trust Centre for Cell Biology, University of Edinburgh, Mayfield Road, Edinburgh EH9 3JR, United Kingdom. 8 Institute of Bioanalytics, Department of Biotechnology, Technische Unversität Berlin, 13353 Berlin, Germany

Corresponding authors: Laszlo Tora ([email protected]) & Imre Berger ([email protected])

161 ABSTRACT General transcription factor TFIID is a cornerstone of RNA polymerase II transcription initiation in eukaryotic nuclei. Human TFIID is composed of the TATA- binding protein (TBP) and 13 TBP-associated factors (TAFs). The cellular mechanism of TFIID assembly is poorly understood. In the cytoplasm of human cells we discovered a heterotrimeric TFIID subcomplex consisting of TAF2, TAF8 and TAF10. X-ray crystallography revealed a non-canonical arrangement of the TAF8- TAF10 histone fold domains (HFDs). TAF2 binds to multiple motifs within the TAF8 C-terminal region, and this interaction promotes TAF2 incorporation into a nuclear core-TFIID complex. Our results provide evidence for a step-wise assembly pathway of TFIID in the nucleus, regulated by nuclear import of preformed cytoplasmic submodules. .

162 INTRODUCTION

Eukaryotic class II gene transcription is controlled by a plethora of proteins, which are preassembled in large multiprotein complexes, including RNA polymerase II, Mediator and the general transcription factors (GTFs) 1. The sequential nucleation of GTFs and Mediator on core promoter DNA initiates regulated class II gene transcription 2. The GTF TFIID plays a central role in this process by linking cellular signaling events with regulatory DNA elements and the transcription machinery 3. Although a basal transcription system supporting initiation of transcription from TATA-box-containing promoters can be reconstituted with TBP, TFIIA, TFIIB, TFIIE, TFIIF and TFIIH in vitro , TFIID is additionally required to respond to activators and for efficient transcription from TATA-less promoters 4,5 . In mammalian cells, about 90% of protein-coding gene promoters are occupied by TFIID and loss of TFIID components leads to embryonic lethality 6-9. TFIID subunits are implicated in crosstalk with epigenetic modifications on nucleosomes and regulatory DNA elements in promoter regions 10,11 . Structural analysis of TFIID by cryo-electron microscopy (cryo-EM) revealed the overall architecture of TFIID and provided important insights into subunit assembly and promoter recognition at low to medium resolution 12-16 . Canonical human TFIID consists of TATA-binding protein (TBP) and 13 TBP- associated factors (TAFs) 17 . Other non-canonical TFIID and TAF-containing complexes have been identified recently with key roles during spermatogenesis and stem cell development 18-20 . A central scaffold of canonical TFIID comprises two copies each of TAF4, 5, 6, 9, and 12, which were shown to form a symmetric core 12,21 . This core-TFIID complex was first identified in Drosophila melanogaster nuclei 23 . TAF3, 4, 6, 8, 9, 10, 11 and 12 contain histone fold domains (HFDs), which stabilize discrete heterodimers (TAF3-10, TAF4-12, TAF6-9, TAF8-10 and TAF11- 13) 22-25 . Among these HFD pairs, the TAF8-10 heterodimer plays a key role in the TFIID assembly pathway, is critical for the integrity of holo-TFIID and also fulfills essential functions in early embryonic development 6,8,26,27 . Binding of TAF8-10 to core-TFIID triggers a transition from symmetry to asymmetry, which was proposed to prime the recruitment of TAF1, 2, 3, 7, 11, 13 and TBP to complete holo-TFIID 12 . Evidence from genetic and biochemical studies showed that knockout of the TAF10 gene leads to impairment of mature TFIID assembly in F9 EC cells and to

163 dissociation of TFIID in hepatocytes 6,26,27 . Biochemical data suggest that TAF8 and TAF10 interact strongly and specifically with each other via their HFDs 28 . Identification of human TAF8 uncovered high sequence similarities with the Drosophila protein PRODOS and the mouse TBN protein 8,28,29 . Mouse e mbryos carrying a mutation in TBN develop normally to the blastocyst stage but fail to develop further due to the lack of inner cell mass cells 8. Interestingly, the same phenotype was also found in TAF10 knockout mice strongly suggesting that TAF8 and TAF10 are both involved in controlling embryonic development at similar stages 6. The importance of this cooperative activity of TAF8 and TAF10 is supported by nuclear import assays, which showed that the transport of TAF10 from the cytoplasm to the nucleus depends on the nuclear localization signal found at the C-terminal end of TAF8 30 . Photocrosslinking experiments, DNA binding-site selections and reconstituted transcription assays identified TAF2 as mediator of Initiator (Inr) function in the context of TFIID 11,31-33 . Recently, mutations in the TAF2-coding gene were shown to be associated with various neurological disorders 34,35 . Human TAF2 is predicted to adopt an aminopeptidase-like fold with an additional C-terminal unstructured region. Localization studies using immunopurified TFIID showed that TAF2 is an integral part of the central lobe of the holo-complex 13 . While general functions of individual TFIID subunits and the holo-complex are increasingly better understood, very little is known to date about how the cell assembles this essential multiprotein complex. The existence of physiological core- TFIID in the nucleus, containing a subset of TAFs, provides evidence that the holo- complex may be assembled in a regulated manner from stable, preformed partial TFIID subassemblies. The dependence of some of the TAFs on each other for nuclear import and the critical role of the TAF8-10 pair in functional remodeling of core- TFIID imply that discrete submodules preassemble also in the cytoplasm of cells. However, direct evidence for the presence of subassemblies in the cytoplasm is lacking to date. By immunoprecipitating TAF-containing complexes from different cellular compartments, we identified a novel endogenous TFIID subcomplex formed by TAF2, 8 and 10 in the cytoplasm of human cells. We dissected cytoplasmic TAF2-8- 10 biochemically and structurally. We elucidated the interactions that stabilize the

164 complex and revealed a central role of TAF8 in its nucleation. By X-ray crystallography, we demonstrate a non-canonical histone-fold domain pair arrangement between TAF8 and TAF10. We report a novel interaction between TAF8 and TAF2, mediated by multiple peptide motifs in the TAF8 C-terminal region. Moreover, we describe the formation of a putative nuclear import particle comprising the TAF2-8-10 complex and Importin α1. Further, we demonstrate that the TAF8- TAF2 interaction is not only crucial for formation of the cytoplasmic TAF2-8-10 complex, but also dictates incorporation of TAF2 into a physiological core-TFIID complex that exists in the nucleus.

165 RESULTS Identification of an endogenous cytoplasmic TAF2-8-10 complex With the objective to better understand TFIID assembly and in particular the incorporation of TAFs into holo-TFIID, we carried out immunoprecipitations from HeLa cell cytoplasmic and nuclear extracts. To test the role of TAF2 in the assembly process, we raised a polyclonal antibody using highly purified recombinant human TAF2 protein for the immunization procedure. We ascertained specificity of the purified antibody against recombinant TAF2 and endogenous TFIID by Western blotting (Fig. 1a and Supplementary Fig. 1). Using this antibody we carried out co- immunoprecipitation experiments of endogenous TAF2 from the cytoplasm, where TAF2 is synthesized de novo , and from nuclear extracts, where TAF2 likely functions in the context of TFIID. To identify proteins that co-precipitated with TAF2 we subjected the immunoprecipitated proteins to proteomics analysis by using the multidimensional protein identification technology (MudPIT). MudPIT analysis of proteins co-precipitated with TAF2 from the nuclear fraction revealed the full set of TFIID components (Fig. 1b and Supplementary Table I). We observed differences in abundance of the individual TAFs, which may argue for the presence of distinct TAF2-associated TAF or TFIID-like complexes in the nucleus. Strikingly, MudPIT analysis of TAF2-associated proteins from the cytoplasmic fractions identified only TAF8 and TAF10, whereas none of the other TAFs could be detected (Fig. 1b, c).

TAF8 nucleates the TAF2-8-10 complex We next used highly purified recombinant human TAF2, 8 and 10 to reconstitute the cytoplasmic TAF2-8-10 complex in vitro . We produced recombinant TAF2 and the TAF8-10 pair separately in insect cells and tested complex formation by size exclusion chromatography (SEC) experiments. SEC of a stoichiometric mixture of TAF2 and TAF8-10 showed a clear peak shift in retention volume towards earlier fractions as compared to the individual components (Fig. 2a). Analysis of the chromatographic fractions by SDS-PAGE shows that all three polypeptides co-elute in the same fractions (Fig. 2a). We observed unusually high molecular weight estimates for the components TAF2 and TAF8-10, and also for the complete TAF2-8- 10 complex, which exceed the calculated molecular weights of the proteins. These high estimates can be due to either oligomerization or elongated shapes of the specimens analyzed. We therefore determined the oligomeric states of purified TAF2,

166 TAF8-10 and the TAF2-8-10 complex by analytical ultracentrifugation sedimentation velocity and native mass spectrometry (MS) experiments. Sedimentation coefficients of 4.3 S, 2.3 S and 4.9 S were obtained for TAF2, TAF8-10 and TAF2-8-10, respectively (Fig. 2b). Continuous size-distribution analyses returned best-fit molecular weights of 140 kDa, 52 kDa and 200 kDa. These values are in good agreement with monomeric TAF2, heterodimeric TAF8-10 and heterotrimeric TAF2- 8-10 complexes, with subunit stoichiometries of 1:1 and 1:1:1 in case of the complexes. Analysis of TAF2-8-10 by native mass spectrometry revealed a predominant complex with an average molecular mass of 195797 Da corresponding to a TAF2-8- 10 heterotrimer containing one copy of each protein (Supplementary Fig. 2 and Supplementary Table 2). We subjected the TAF2-8-10 complex to collision-induced dissociation (CID) experiments in the mass spectrometer to probe for subunit interactions 36 . The resulting spectra reveals dissociation of the trimeric complex into TAF2-8 and TAF10 submodules (Fig. 2c and Supplementary Table 2). Notably, TAF2 and TAF10 do not interact under the conditions studied, since we did not observe a TAF2-10 species (Fig. 2c). We conclude from these data that TAF2, 8 and 10 assemble as a heterotrimeric complex with a 1:1:1 stoichiometry and that the complex is nucleated by TAF8 and stabilized by distinct TAF2-8 and TAF8-10 interactions.

TAF8 and TAF10 adopt a non-canonical histone fold dimer We next dissected the interactions identified by CID. First, we determined the X-ray crystal structure of the TAF8-10 complex. Previous GST pull-down experiments suggested that the interaction between TAF8 and TAF10 is mediated by their histone fold domains (HFDs), which are present in the N-terminal half of TAF8 and the C- terminal half of TAF10 28,30 . We co-expressed and purified full-length TAF8-10 complex in insect cells from a polyprotein construct 37 , subjected the complex to limited proteolysis and defined the core complex to TAF8 residues 1-134 and TAF10 residues 98-218 (hereafter referred to as TAF8 ∆C and TAF10 ∆N, respectively) (Supplementary Fig. 3a). We prepared this TAF8 ∆C-TAF10 ∆N core complex, but only obtained crystals diffracting X-rays to 5-6 Å resolution. We therefore tested various N- and C-terminal deletion constructs of the two proteins in crystallization experiments. A complex of

167 TAF8-10 comprising TAF8 residues 25-120 and TAF10 residues 112-212 yielded crystals which diffracted incident X-rays to 1.9 Å resolution (Supplementary Fig. 3b). We determined the structure of this complex by the Sulfur-SAD method and refined the model to a crystallographic R value of 20.5% and a free R factor of 23.7% with excellent stereochemistry (Table I). The final model includes TAF8 residues 28-120 and TAF10 residues 113-212 with the exception of a flexible loop in TAF10 comprising residues 178-191. The crystal structure of the TAF8-10 complex reveals that the two proteins adopt atypical HFDs with 3 central α helices flanked by additional N- and C-terminal α helices (Fig. 3a). In our structure, TAF8 wraps entirely around the α2 helix of TAF10 markedly enveloping its interaction partner (Fig. 3a). Complex formation buries 2212.3 Å 2 with predominantly hydrophobic intermolecular contacts. As observed in other HFD interactions, the two opposing aromatic residues Y68 of TAF8 and F162 of TAF10 at the crossover of the α2 helices contact each other via hydrophobic stacking interactions and categorize the complex to the H3/H4 family of HFD-containing proteins (Fig. 3b and Supplementary Fig. 3c) 38 . The additional N- terminal α-helix of TAF10, αN, and the C-terminal α-helix of TAF8, αC, contact each other on one side of the HFD in a head to tail fashion and significantly stabilize the complex by hydrophobic interactions centered on F119 of TAF10 (Fig. 3c). Interestingly, the proteins TAF8 and TAF10 have similar L1 loop geometries, which are not found in other structures of related HFD-containing TAFs (Fig. 3d,e) 23,25 . In both proteins a phenylalanine of loop L1 (F50 in TAF8 and F144 in TAF10) is embedded in a composite, hydrophobic cavity mainly formed by residues from helices α1/ α2 of one protomer and helices α2/ α3 of the other (Fig. 3d,e). The amino acids forming this hydrophobic cavity are remarkably similar in TAF8 and TAF10, suggesting an evolutionary interrelation between the two proteins (Supplementary Fig. 3d). Primary sequence comparison with two other TAF10-interacting proteins, TAF3 and human SPT7L, shows that similar residues can be also found in their HFDs, arguing for a conserved binding mode of these proteins known to interact with TAF10 (Supplementary Fig. 3d). Our structure underscores that HFDs in TAFs can adopt a variety of conformations, which may differ significantly from the canonical histone pairs found in the nucleosome.

HFDs of TAF8 and TAF10 are dispensable for TAF2 binding

168 We next analyzed the physical interactions between TAF2 and the TAF8-10 heterodimer. We first tested the effects of deleting the intrinsically unstructured regions of TAF8 and TAF10 on TAF2 binding in pull-down assays with purified proteins. As a control, full-length TAF8-10 was co-precipitated with TAF2 tagged with maltose-binding protein (MBP) (Fig. 3f). Truncation of the N-terminal region of TAF10 (TAF8-TAF10 ∆N, TAF10 residues 98-218) did not change the binding properties and still co-precipitated with MBP-TAF2. In contrast, a truncated complex of TAF8-10, in which the flexible C-terminal region of TAF8 was deleted (TAF8 ∆C- TAF10, TAF8 residues 1-134), did not co-precipitate with MBP-TAF2 suggesting that the region that mediates binding to TAF2 resides in the C-terminal, low- complexity tail of TAF8 (Fig. 3f). We confirmed the interaction between TAF2 and the C-terminal tail of TAF8 by SEC. We utilized full-length TAF2 and a fusion protein of MBP with residues 105-310 of TAF8 and evidenced complex formation (Supplementary Fig. 4a,b). These results are consistent with the CID data in native MS, which showed that only TAF8, and not TAF10, is directly interacting with TAF2.

TAF2 recognizes multiple short motives in the C-terminal region of TAF8 We characterized the TAF2-TAF8 interaction further by means of a peptide array. We monitored binding of His-tagged TAF2 to peptide arrays covering residues 105-310 of TAF8 (Fig. 4a). Densitometric analysis of the arrays indicated that TAF2 binding clusters around four distinct regions; a short N-terminal region I covering TAF8 residues 105-125, a less well-defined region II including residues 147-202 and regions III and IV spanning residues 207-238 and 282-310, respectively (Fig. 4a). We next analyzed the individual contributions of these four TAF8 regions to TAF2 binding by surface plasmon resonance (SPR) experiments. We generated N- and C-terminal deletion constructs of TAF8 and fused them to MBP (Fig. 4b). We monitored association and dissociation phases of the MBP-TAF8 truncations on TAF2-charged sensor chips and compared binding kinetics at identical analyte concentrations. An MBP-TAF8 fusion construct spanning the entire C-terminal region (TAF8 residues 105-310) showed a maximal association level of about 85 response units (RU) with fast on and off rates (Fig. 4b). A shorter MBP-fusion protein lacking region I (TAF8 residues 141-310) showed similar kinetics but a reduced maximal association level of approximately 40 RU (Fig. 4b). MBP-fusion constructs with

169 deleted regions II or IV (TAF8 residues 200-310 or 105-260, respectively) hardly interacted with immobilized TAF2 showing maximal association levels of less than 10 RUs (Fig. 4b). These data indicate that all four TAF2-interacting regions of TAF8 contribute cooperatively to the binding to TAF2. Based on our peptide array and SPR results, we introduced TAF8 point mutants into the TAF8-10 polyprotein expression construct by substituting three triple amino acid clusters spanning residues 185-187 (DVE), 222-224 (PYL) and 293-295 (PYL) with alanines. We produced and purified wild type TAF8-10 and the mutated TAF8- 10 complex and analyzed TAF2 binding via SEC (Fig. 4c). In contrast to wild-type TAF8-10, formation of a trimeric TAF2-8-10 complex was not observed with the three triple amino acid cluster TAF8-10 mutant, corroborating the results that we obtained with our peptide array and SPR experiments (Fig. 4c).

The C-terminal region of TAF8 promotes TAF2 incorporation in TFIID We showed recently that a TFIID subcomplex comprising TAF4, 5, 6, 8, 9, 10 and 12 (hereafter referred as 7TAF) can be formed in vitro by binding TAF8-10 to a physiological nuclear core-TFIID complex, which constitutes an important intermediate in holo-TFIID assembly 12,21 . We next asked whether the association of TAF2 to this 7TAF complex depends on the C-terminal region of TAF8, which we identified as responsible for TAF2 binding in the TAF2-8-10 complex. We produced and purified recombinant 7TAF and a 7TAF ∆ complex, in which TAF8 is substituted by TAF8 ∆C (Fig. 5a). We monitored binding of a mCherry-TAF2 fusion protein to 7TAF and 7TAF ∆ complexes using SEC. We introduced the mCherry tag on TAF2 in order to unambiguously separate the protein from TAF4 in SDS-PAGE. In all SEC experiments, we used stoichiometric amounts of TAF2 in relation to TAF8-10 or the truncated TAF8 ∆C-TAF10 complex. Interestingly, TAF2 could be fully incorporated into the 7TAF complex, whereas TAF2 did not interact noticeably with the 7TAF ∆ complex, in which the C-terminal TAF2-interaction region of TAF8 had been deleted (Fig. 5b). Next, we mapped the position of TAF2 on 7TAF. To this end, we determined a 3D model of negatively stained 7TAF complexes bound to TAF2 (hereafter referred as 8TAF complex) by single particle electron microscopy and compared the resulting structure to the reconstruction of the 7TAF complex we had determined previously 12 .

170 We observed major density differences clearly positioned on only one side of the particle, indicating TAF2 location (Fig. 5c and Supplementary Fig. 5). Interestingly, our 8TAF complex reconstruction resembles a precursor to the characteristic clamp shape of holo-TFIID, in contrast to the less elongated shape of 7TAF and core-TFIID 10 . Next we sought to characterize possible alterations in the protein-protein interaction networks along the assembly pathway to holo-TFIID. In particular we looked at the transition from 7TAF to 8TAF complexes upon TAF2 binding by cross- linking and mass spectrometry (CLMS) experiments. We cross-linked 7TAF and 8TAF complexes with the bifunctional reagent bis(sulfosuccinimidyl)suberate, BS 3 that targets mostly lysines 39 (Supplementary Fig. 6). Cross-linked complexes were separated from non-cross-linked species by SDS-PAGE, in-gel digested and cross- linked peptides were assigned to ion masses observed by mass spectrometry. We identified 43 protein-protein cross-links for the 7TAF complex and 35 protein-protein cross-links for the 8TAF complex with an overlap of 23 cross-linked peptides between the two complexes (Fig. 5d and Supplementary Table 3). Our data suggest that TAF9 plays a central role in 7TAF complex architecture by interconnecting TAF4, 5, 6, 8 and 12 (Fig. 5d). In our CLMS data, prominent cross-links between TAF8 and TAF10 were not present, consistent with the paucity of lysines within cross-linking distance, and the partly buried location of the TAF8-10 dimer within the 7TAF complex 12 . In the 8TAF complex, we observed cross-links of the C-terminal region of TAF8 with residues on TAF2, which are predicted to map to the surface (Fig. 5d). In addition to its proximity to TAF8, TAF2 is also positioned closely to TAF5, 6, and 9 and promotes cross-linking between TAF9 and 10, and TAF4 and 5 (Fig. 5d). Our data indicate that TAF2 is indeed anchored to the 7TAF complex via the TAF2-interacting region on TAF8 and that binding of TAF2 induces significant conformational changes that result in novel TAF-TAF interactions not present in the 7TAF complex.

A putative nuclear import particle comprising TAF2-8-10 and Importin α1. Biochemical and cell biology experiments demonstrated that the C-terminal nuclear localization signal (NLS) within TAF8 is necessary for shuttling TAF8 and TAF10 from the cytoplasm to the nucleus in an Importin α/β dependent fashion 30 . We asked whether the identified endogenous TAF2-8-10 complex would be capable of

171 recruiting Importin α1 in vitro to form a nuclear import complex. To this end we mixed highly purified TAF2-8-10 with a two-fold molar excess of an Importin α1 IBB variant lacking the Importin β-binding domain (Importin α1∆ ). We observed IBB efficient complex formation in SEC indicating that Importin α1∆ was stoichiometrically incorporated into the TAF2-8-10 complex (Fig. 6a and Supplementary Fig. 7a). We also observed by SEC that TAF2 alone is not bound by Importin α1 (data not shown). To better understand the function of Importin α1 in the TAF2-8-10 complex, we determined the X-ray crystal structure of the C-terminal NLS of TAF8 in complex IBB with Importin α1∆ at 1.75 Å resolution. The refined model has a crystallographic R value of 15.3% and a free R factor of 18.0% with good stereochemistry (Table I). Importin α1 residues 72-497 and residues 297-305 of the TAF8 peptide could be unambiguously traced in the electron density map. The TAF8 peptide binds as a monopartite NLS via residues 297-302 (Fig. 7b). In previous Importin/NLS structures, asparagines N146, N188 and N235 of Importin α1 hydrogen bond to NLS main chain amide and carbonyl groups at positions P1, P3 and P5 40 (Fig. 6b and Supplementary Fig. 6b). Importin α1 tryptophanes W142, W184, and W231 form apolar pockets, which accommodate the aliphatic moieties of lysine residues K300 and K302, and position the TAF8 NLS backbone via residues P297, L300 and L302 (Fig. 6b). The side chain of L299 is coordinated by the main chain carbonyl group of G150, the hydroxyl group of T155 and the carboxylate of D192, whereas side chains of L300 and L302 are contacted by side chain carbonyl groups of N228 and Q181, respectively. We could also model a less well-defined short amino acid segment at the minor binding site of Importin α1 (Supplementary Fig. 7c). In order to assess, whether the minor binding site of Importin α1 plays a role in binding the NLS of TAF8, we determined the kinetic parameters for the Importin α1/TAF8-NLS complex IBB formation by isothermal titration calorimetry. Using Importin α1∆ as analyte and an NLS peptide comprising TAF8 residues 288 to 310 as titrant, we obtained a 1:1 binding stoichiometry with a dissociation constant in the low micromolar range ( Kd = 10.4 +/- 0.8 µM; Supplementary Fig. 7d). In accordance with our crystal structure, the binding of the NLS of TAF8 to Importin α1 is driven by enthalpy involving mainly

172 hydrogen bonds and van der Waals interactions (enthalpy change of ∆H = -18.5 +/- 1.3 kcal mol -1 and entropy change of -T∆S = 11.5 kcal mol -1). Taken together, our data suggest the presence of a nuclear import particle in which the TAF2-8-10 complex is bound by the major binding site of Importin α1 via the NLS of TAF8 (Fig. 6).

173 DISCUSSION The structure and function of multiprotein complexes in gene regulation, including TFIID, are an intense focus of current research efforts 41,42 ..The dynamic assembly of gene regulatory complexes in the cell and the functional implications, however, remain poorly understood. Evidence for a modular concept of TFIID assembly derives from earlier studies. In vitro experiments suggested that the Initiator-binding activity of TAF2 markedly increases when TAF1 and TBP were added 11,33. A stable TFIID core complex comprising two copies each of TAF4, 5, 6, 9, 4, 12 was identified in the Drosophila nuclei 21 . We show here that endogenous TAF2 associates with TAF8-10 readily in the cytoplasm to form a stable complex, which is competent to interact with components of the import machinery to shuttle into the nucleus. Importantly, we did not detect a stable endogenous interaction between TAF2 and TAF1, or TAF2 and TBP in our co-IP coupled MS analyses. Association of TAF2, 8 and 10 with the nuclear core-TFIID complex involves an intricate network of interactions between the TAF8 C-terminal tail and TAF2, and the globular HFD of TAF8-10 and core-TFIID. Core-TFIID, thought to represent a central scaffold of canonical TFIID, comprises two copies each of TAF4, 5, 6, 9, and 12, in a symmetric structure 12,21 . Binding of a single copy of TAF8-10 to core-TFIID was shown to cause profound rearrangements within the particle and a transition from symmetry to asymmetry, which may direct the recruitment of the remaining TAFs and TBP in single copy 12 . As if to illustrate this at the atomic level, the crystal structure of the TAF8-10 complex combines symmetric and asymmetric elements. Both TAFs share characteristic folds of their L1 loops, which give rise to pseudo-symmetric structures at the extremities of their HFDs. These pseudo-symmetric L1 loops are characteristic for the TAF8-10 complex since similar arrangements cannot be found in the crystal structures of Drosophila TAF6-9 and human TAF4-12 complexes 23,25 . On the other hand, the presence of the additional helices, αN and αC, render the TAF8-10 complex asymmetric in our crystal structure. Human TAF2 has a high sequence similarity to the protein ERAP-1, which belongs to the family of aminopeptidases 43 . Although catalytically inactive, TAF2 was suggested to share a similar fold 44 . Interestingly, other proteins involved in transcription regulation and chromatin functions, like the FACT subunit Spt16 and the

174 protein Ebp1, also comprise inactive aminopeptidase-like folds 45,46 . The aminopeptidase-like domain of Spt16 was shown to specifically interact with the globular domains and the unstructured tails of histone H3 and H4 46 . We could not detect a complex formation between TAF2 and the histone fold domains of TAF8-10 in analogy to Spt16 (Supplementary Fig. 4). However, we identify here a stable complex between TAF2 and a non-canonical histone fold pair TAF8-TAF10, only involving multiple short domains within the low-complexity TAF8 C-terminal tail. It is interesting to speculate that TAF2 may have evolved its aminopeptidase fold as a specific protein-protein interaction domain for specifically binding peptide segments from its binding partner, TAF8. Cytoplasmic TAF2 may also use the NLS of TAF8 to be indirectly imported into the nucleus via the Importin α/β pathway, similar to TAF10 30 . The peptide arrays revealed a strong TAF2-binding region in the vicinity of the NLS of TAF8 (TAF2-binding region IV). Phosphorylation of residues of the NLS or neighboring residues may have a strong effect on protein import 47 . Phosphorylation of serines close to the NLS of TAF8 may therefore influence the binding kinetics to TAF2 and/or Importin α1 and could provide an additional level of regulation of TFIID assembly in vivo . We identified a novel TFIID building block comprising TAF2-8-10 in the cytoplasm of cells by immunoprecipitation of endogenous TAF2 with a specific antibody followed by proteomics analysis. We characterized the interactions stabilizing this complex in an integrated approach combining native mass- spectroscopy, X-ray crystallography, peptide arrays and biochemical and biophysical means. We demonstrate that Importin α1 can stably associate with a NLS present in this complex, thus giving rise to a putative nuclear import particle. Using highly purified sample, we show that the TAF2-TAF8 interaction we found stabilizes the cytoplasmic complex, and also dictates the incorporation of TAF2 when TFIID is assembled in the nucleus. Our results compellingly support the view that stable partial TFIID complexes, potentially with important functions on their own, exist in the cell. These complexes may represent functional cytoplasmic or nuclear modules (Fig. 6c), which assemble into holo-TFIID in a modular and step-wise fashion. Our results also point to the role of processes such as cytoplasmic-nuclear transport in the regulation of holo-TFIID

175 assembly. It will be exciting to uncover in molecular detail the intricate mechanisms at work in the cell to build the complicated multiprotein machines at the core of vital biological functions.

176 METHODS Methods and any associated references are available in the online version of the paper.

Accession codes. Atomic coordinates and structure factors will be deposited in the .

Note: Any Supplementary Information and Source Data files are available in the online version of the paper.

ACKNOWLEDGEMENTS We thank the ESRF-EMBL Joint Structural Biology Group for access to ESRF beamlines and the EMBL-ESRF-ILL-INS Partnership for Structural Biology for access to structural biology instrumentation and protein analysis and characterization facilities. We are grateful to Andrew Thompson (SOLEIL) for support during data collection and structure solving. We also thank Aurelian Deniaud (EMBL) and Andrés Palencia (EMBL) for technical help with AUC, SPR and ITC experiments. S.T. was an European Commisssion (EC) Framework Programme (FP) 7 Marie Curie post-doctoral fellow. IB is supported by the EC FP7 ComplexINC project (contract nr. 279039). CS is supported by a European Research Council (ERC) Starting Grant and LT by an ERC Advanced grant. IB, LT and PS acknowledge support from the Agence National de la Recherche ANR Projet Blanc DiscoverIID and of the French Infrastructure for Integrated Structural Biology (FRISBI; ANR-10-INSB-05-01) and INSTRUCT as part of the European Strategy Forum on Research Infrastructures (ESFRI). JR is a Wellcome Trust Senior Research Fellow (grant nr. 084229). The Wellcome Trust Centre for Cell Biology is supported by core grant numbers 077707 and 092076, and the work was also supported by Wellcome Trust instrument grant nr. 091020.

AUTHOR CONTRIBUTIONS ST, CV, LT, IB designed and interpreted experiments. ST and CV produced and purified all proteins and performed characterizations. ST determined crystal structures. ES prepared antibodies. VS, MF performed MudPIT experiments and MF, LT analyzed them. IE and CR performed and analyzed native mass spectrometry experiments. GP, CS, PS prepared EM grids, collected and analyzed data and

177 calculated reconstructions. JZ and JR performed and analyzed CLMS experiments. ST, LT, IB wrote the manuscript with input from all authors.

COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests. MAIN TEXT RERFERENCES 1. Thomas, M.C. & Chiang, C.-M. The general transcription machinery and general cofactors. Crit. Rev. Biochem. Mol. Biol. 41, 105-78 (2006). 2. Rhee, H.S. & Pugh, B.F. Genome-wide structure and organization of eukaryotic pre-initiation complexes. Nature 483, 295-301 (2012). 3. Albright, S.R. & Tjian, R. TAFs revisited: more data reveal new twists and confirm old ideas. Gene 242, 1-13 (2000). 4. Hampsey, M. & Reinberg, D. RNA polymerase II as a control panel for multiple coactivator complexes. Curr. Opin. Genet. Dev. 9, 132-9 (1999). 5. Juven-Gershon, T., Hsu, J.Y., Theisen, J.W. & Kadonaga, J.T. The RNA polymerase II core promoter - the gateway to transcription. Curr. Opin. Cell. Biol. 20, 253-9 (2008). 6. Mohan, W.S., Scheer, E., Wendling, O., Metzger, D. & Tora, L. TAF10 (TAF(II)30) is necessary for TFIID stability and early embryogenesis in mice. Mol. Cell. Biol. 23 4307-18 (2003). 7. Kim, T.H. et al. A high-resolution map of active promoters in the human genome. Nature 436, 876-80 (2005). 8. Voss, A.K. et al. Taube nuss is a novel gene essential for the survival of pluripotent cells of early mouse embryos. Development 127, 5449-61 (2000). 9. Gegonne, A. et al. The general transcription factor TAF7 is essential for embryonic development but not essential for the survival or differentiation of mature T cells. Mol. Cell. Biol. 32, 1984-97 (2012). 10. Vermeulen, M. et al. Selective anchoring of TFIID to nucleosomes by trimethylation of histone H3 lysine 4. Cell 131, 58-69 (2007). 11. Verrijzer, C.P., Chen, J.L., Yokomori, K. & Tjian, R. Binding of TAFs to core elements directs promoter selectivity by RNA polymerase II. Cell 81, 1115-25 (1995). 12. Bieniossek, C. et al. The architecture of human general transcription factor TFIID core complex. Nature 493, 699-702 (2013). 13. Papai, G. et al. Mapping the initiator binding Taf2 subunit in the structure of hydrated yeast TFIID. Structure 17, 363-73 (2009). 14. Grob, P. et al. Cryo-electron microscopy studies of human TFIID: conformational breathing in the integration of gene regulatory cues. Structure 14, 511-20 (2006). 15. Cianfrocco, M.A. et al. Human TFIID binds to core promoter DNA in a reorganized structural state. Cell 152, 120-31 (2013). 16. Brand, M., Leurent, C., Mallouh, V., Tora, L. & Schultz, P. Three- dimensional structures of the TAFII-containing complexes TFIID and TFTC. Science 286, 2151-3 (1999). 17. Matangkasombut, O., Auty, R. & Buratowski, S. Structure and function of the TFIID complex. Adv. Protein Chem. 67, 67-92 (2004). 18. Maston, G.A. et al. Non-canonical TAF complexes regulate active promoters in human embryonic stem cells. Elife 1, e00068 (2012).

178 19. Goodrich, J.A. & Tjian, R. Unexpected roles for core promoter recognition factors in cell-type-specific transcription and gene regulation. Nat. Rev. Genet. 11, 549-58 (2010). 20. Müller, F., Zaucker, A. & Tora, L. Developmental regulation of transcription initiation: more than just changing the actors. Curr. Opin. Genet. Dev. 20, 533-40 (2010). 21. Wright, K.J., Marr, M.T. & Tjian, R. TAF4 nucleates a core subcomplex of TFIID and mediates activated transcription from a TATA-less promoter. Proc. Natl. Acad. Sci. USA 103, 12347-52 (2006). 22. Xie, X. et al. Structural similarity between TAFs and the heterotetrameric core of the histone octamer. Nature 380, 316-22 (1996). 23. Werten, S. et al. Crystal structure of a subcomplex of human transcription factor TFIID formed by TATA binding protein-associated factors hTAF4 (hTAF(II)135) and hTAF12 (hTAF(II)20). J. Biol. Chem. 277, 45502-9 (2002). 24. Gangloff, Y.G. et al. Histone folds mediate selective heterodimerization of yeast TAF(II)25 with TFIID components yTAF(II)47 and yTAF(II)65 and with SAGA component ySPT7. Mol. Cell. Biol. 21, 1841-53 (2001). 25. Birck, C. et al. Human TAF(II)28 and TAF(II)18 interact through a histone fold encoded by atypical evolutionary conserved motifs also found in the SPT3 family. Cell 94, 239-49 (1998). 26. Metzger, D., Scheer, E., Soldatov, A. & Tora, L. Mammalian TAF(II)30 is required for cell cycle progression and specific cellular differentiation programmes. EMBO J. 18, 4823-34 (1999). 27. Tatarakis, A. et al. Dominant and redundant functions of TFIID involved in the regulation of hepatic genes. Mol. Cell 31, 531-43 (2008). 28. Guermah, M., Ge, K., Chiang, C.M. & Roeder, R.G. The TBN protein, which is essential for early embryonic mouse development, is an inducible TAFII implicated in adipogenesis. Mol. Cell 12, 991-1001 (2003). 29. Hernández-Hernández, A. & Ferrús, A. Prodos is a conserved transcriptional regulator that interacts with dTAF(II)16 in Drosophila melanogaster. Mol. Cell. Biol. 21, 614-23 (2001). 30. Soutoglou, E. et al. The nuclear import of TAF10 is regulated by one of its three histone fold domain-containing interaction partners. Mol. Cell. Biol. 25, 4092-104 (2005). 31. Chalkley, G.E. & Verrijzer, C.P. DNA binding site selection by RNA polymerase II TAFs: a TAF(II)250-TAF(II)150 complex recognizes the initiator. EMBO J. 18, 4835-45 (1999). 32. Sypes, M.A. & Gilmour, D.S. Protein/DNA crosslinking of a TFIID complex reveals novel interactions downstream of the transcription start. Nucleic Acids Res. 22, 807-14 (1994). 33. Verrijzer, C.P., Yokomori, K., Chen, J.L. & Tjian, R. Drosophila TAFII150: similarity to yeast gene TSM-1 and specific binding to core promoter DNA. Science 264, 933-41 (1994). 34. Najmabadi, H. et al. Deep sequencing reveals 50 novel genes for recessive cognitive disorders. Nature 478, 57-63 (2011). 35. Halevy, A. et al. Microcephaly-thin corpus callosum syndrome maps to 8q23.2-q24.12. Pediatr. Neurol. 46, 363-8 (2012).

179 36. Benesch, J.L. & Robinson, C.V. Mass spectrometry of macromolecular assemblies: preservation and dissociation. Curr. Opin. Struct. Biol. 16, 245-51 (2006). 37. Nie, Y., Bellon-Echeverria, I., Trowitzsch, S., Bieniossek, C. & Berger, I. Multiprotein complex production in insect cells by using polyproteins. Methods Mol. Biol. 1091, 131-41 (2014). 38. Davey, C.A., Sargent, D.F., Luger, K., Maeder, A.W. & Richmond, T.J. Solvent mediated interactions in the structure of the nucleosome core particle at 1.9 Å resolution. J. Mol. Biol. 319, 1097-113 (2002). 39. Chen, Z.A. et al. Architecture of the RNA polymerase II-TFIIF complex revealed by cross-linking and mass spectrometry. EMBO J. 29, 717-26 (2010). 40. Conti, E., Uy, M., Leighton, L., Blobel, G. & Kuriyan, J. Crystallographic analysis of the recognition of a nuclear localization signal by the nuclear import factor karyopherin alpha. Cell 94, 193-204 (1998). 41. Hahn, S. Structure and mechanism of the RNA polymerase II transcription machinery. Nat. Struct. Mol. Biol. 11, 394-403 (2004). 42. Kandiah, E., Trowitzsch, S., Gupta, K., Haffke, M. & Berger, I. More pieces to the puzzle: recent structural insights into class II transcription initiation. Curr. Opin. Struct. Biol. 24, 91-7 (2014). 43. Kochan, G. et al. Crystal structures of the endoplasmic reticulum aminopeptidase-1 (ERAP1) reveal the molecular basis for N-terminal peptide trimming. Proc. Natl. Acad. Sci. USA 108, 7745-50 (2011). 44. Aravind, L. & Koonin, E.V. Eukaryotic transcription regulators derive from ancient enzymatic domains. Curr. Biol. 8, R111-3 (1998). 45. Kowalinski, E. et al. The crystal structure of Ebp1 reveals a methionine aminopeptidase fold as binding platform for multiple interactions. FEBS Lett. 581, 4450-4 (2007). 46. Stuwe, T. et al. The FACT Spt16 "peptidase" domain is a histone H3-H4 binding module. in Proc. Natl. Acad. Sci. USA Vol. 105 8884-9 (2008). 47. Nardozzi, J.D., Lott, K. & Cingolani, G. Phosphorylation meets nuclear import: a review. Cell Commun. Signal 8, 32 (2010).

180 FIGURES

Figure 1

Figure 2

181 Figure 3

182 Figure 4

183 Figure 5

184 Figure 6

185 Table 1 Data collection and refinement statistics

TAF8-10 TAF8-10 Importin α1 / TAF8- Native S-SAD NLS Native Data collection Space group P3 121 P3 121 P2 12121 Cell dimensions a, b, c (Å) 51.32, 51.32, 144.40 51.30, 51.30, 144.70 54.27, 77.72, 128.57 α, β, γ (°) 90, 90, 120 90, 90, 120 90, 90, 90 Wavelength 0.98011 1.90745 0.93340 Resolution (Å) 44.44-1.91 (1.98-1.91)* 48.23-2.61 49.54-1.75 (1.81-1.75) Rmerge 2.92 (87.98) 2.00 5.00 (77.45) I / σI 23.4 (1.68) 48.43 19.18 (2.03) Completeness (%) 93.91 (58.53) 100.00 99.60 (99.14) Redundancy 4.80 (4.30) 20.91 4.60 (4.60)

Refinement Resolution (Å) 44.44-1.91 49.54-1.75 No. reflections 16793 (1630) 12943 55423 (5436) Rwork / Rfree 20.5 (33.1) / 23.7 (35.5) 15.3 (23.9) / 18.0 (27.2) No. atoms 1474 3877 Protein 1404 3366 Ligand/ion 7 48 Water 63 463 B-factors Protein 58.2 30.9 Ligand/ion 59.9 59.6 Water 51.9 44.8 R.m.s deviations Bond lengths (Å) 0.003 0.008 Bond angles ( °) 0.681 1.158 *Values in parentheses are for highest-resolution shell.

186 FIGURE LEGENDS

Figure 1 A TAF2-8-10 complex in the cytoplasm. (a) Purified, polyclonal anti-TAF2 antibodies specifically recognize recombinant and endogenous TAF2. Recombinant (rec.) purified TAF2 (10 ng, lane 1) and immunopurified TFIID (300 and 150 ng; lanes 2, 3) were loaded on an 8% SDS-PAGE, blotted and analyzed by western blot assay. Protein size markers are indicated. ( b) Abundances of individual proteins co- immunoprecipitated from nuclear or cytoplasmic HeLa cell extracts (grey or black bars, respectively) using purified polyclonal anti-TAF2 antibodies were compared in units of normalized spectral abundance factors (NSAFs). Each column is the average of two independent experiments and error bars represent standard deviations. (c) Domain organization of TAF2, TAF8 and TAF10 in a schematic view. Grey rectangles indicate predicted, unstructured regions. The nuclear localization sequence (NLS) of TAF8 is shown as a black bar. Numbers indicate first and last amino acids in each protein. HFD, histone fold domain.

Figure 2. Recombinant TAF2-8-10 complex. (a) TAF2, the TAF8-10 pair and a mixture of TAF2-8-10 were analyzed by SEC. Elution profiles of TAF2 (green), TAF8-10 (blue) and TAF2-8-10 (purple) are plotted in relative absorption units at 280 nm versus elution volume (top). Fractions are numbered (top of graph). SDS-PAGE analyses of the eluted samples are shown (below). Molecular masses of protein standards are indicated on the left of gel sections. Protein denominations are shown on the right. First lane shows the SEC input (IN). ( b) Absorbance c(s) profiles from sedimentation velocity analytical ultracentrifugation experiments are plotted for TAF2 (green), TAF8-10 (blue) and TAF2-8-10 (purple). ( c) Mass spectrum of TAF2- 8-10 complex electrosprayed from an aqueous ammonium acetate solution under high collision energy for subunit dissociation. The MS spectrum reveals peaks with corresponding masses for a TAF8-10 dimer (blue dots), TAF2 subunit (green dots) and a predominant TAF2-8-10 complex (purple dots) centered at 4000 m/z, 6000 m/z and 7500 m/z respectively. At 12000 m/z is a TAF2-8 dimer (yellow squares) resulting from the dissociation of TAF10 subunit (light blue dots) from the intact TAF2-8-10 complex. Proteins and protein complexes are shown schematically as circles.

187 Figure 3. Analysis of TAF8-TAF10 interactions. ( a) Crystal structure of human TAF8-10 complex is shown in a cartoon representation. Two orientations related by a vertical rotation of 90° are shown. TAF8 is colored in blue and TAF10 in green. The disordered L2 loop of TAF10 is represented by a dotted line. Secondary structure elements and loops are labeled. The TAF8-TAF10 complex adopts a non-canonical HFD pair. (b-e) Close-up views of the interactions between TAF8 and TAF10. Key interacting residues are highlighted. All structure drawings were generated with PyMOL (http://www.pymol.org/). ( f) Pull-down experiments of TAF2 fused to maltose-binding protein (MBP) analyzing interactions with TAF8-10, TAF8- TAF10 ∆N and TAF8 ∆C-TAF10 HFD pairs (see main text for details). Unfused MBP is included as a control. Input samples (top) and samples precipitated on amylose resin (bottom) were resolved on 4-12% gradient gels. Protein identities are shown on the right.

Figure 4. Analysis of TAF8-TAF2 interactions. ( a) Binding of His-tagged TAF2 is monitored to overlapping peptides of the TAF8 C-terminal region (residues 105-310) spotted onto nitrocellulose membranes (spots A2-G4, left) utilizing a peptide array. Bound TAF2 was visualized by luminol reaction and signal intensities were plotted for each spot after background subtraction (right). Spots A1, G5 and G6 served as positive controls. TAF2 protein was omitted for the control membrane. The four major binding regions are indicated above the histogram. ( b) Surface plasmon resonance (SPR) experiments with immobilized full-length TAF2 as ligand and MBP- fusions of TAF8 fragments 105-310, 141-310, 200-310 and 105-260 as analytes. TAF8 deletion constructs are schematically shown as bar diagrams (left). TAF2- interacting regions on TAF8 as identified in panel a are highlighted. SPR sensorgrams at identical analyte concentrations of 500 nM are plotted as response units (RU) versus time (right). ( c) SEC analyses assessing the influence of TAF8 point mutations on TAF2 binding. Elution profiles for the indicated proteins and protein complexes are plotted on the left and SDS-PAGE analyses of each run are shown on the right. Molecular masses of protein standards are denoted on the left of the gels and protein names on the right. MBP, maltose-binding protein; HFD, histone fold domain.

188 Figure 5. TAF8 promotes TAF2 incorporation in TFIID assembly. (a) TFIID components studied are shown as bar diagrams. Predicted low-complexity regions of the proteins are colored in grey. A black bar denotes the TAF8 NLS. C-terminally truncated TAF8 (TAF8 ∆C), which was used to reconstitute the 7TAF ∆ complex, is depicted on the right. Numbers denote first and last amino acids for each protein. HFD, histone fold domain; NTD, N-terminal domain. (b) Impact of the TAF8 truncation on TAF2 binding to 7TAF complexes. SEC elution profiles for indicated proteins and protein complexes are shown (top). Corresponding SDS-PAGE gel sections of peak fractions of each run are shown (bottom). Protein size markers are shown on the left; protein identities on the right. ( c). 3D single particle EM reconstruction of negatively-stained 8TAF complex (grey mesh) overlayed on 7TAF complex (yellow, from Ref. 12) is shown in two views related by a 90° rotation as indicated (arrow). Difference density attributed to bound TAF2 is highlighted in blue. (e) Protein-protein cross-link maps for the 7TAF complex (left) and the 8TAF complex (right) are shown. Circle sizes represent relative molecular weights of each protein. Black lines connect cross-linked proteins. Grey bars superimposed on black lines indicate cross-link frequencies (www.crosslinkviewer.org).

Figure 6. Putative nuclear import particle comprising TAF2-8-10 complex and IBB Importin α1. ( a) A complex consisting of TAF2, 8, 10 and Importin α1∆ was IBB formed from highly purified components. Importin α1∆ was mixed in a 2-fold molar excess with purified TAF2-8-10 and the mixture purified by SEC. SDS-PAGE analysis of the peak fraction is shown. ( b) Importin α1-TAF8 complex crystal structure. Magnified view of interacting residues of the major binding site of Importin α1 (grey) with residues of the NLS of TAF8 (blue). Importin α1 is shown in ribbon representation and the TAF8-NLS as a stick model. TAF8 residues R303 and R304 are not involved in contacting Importin α1 but stabilized by crystal contacts, and are omitted for clarity. ( c) Cartoon model of cytoplasmic TAF2-8-10 complex and nuclear holo-TFIID assembly. The NLS of TAF8 is filled in black. The TAF2- interaction domain within TAF8 is highlighted by shading. TAF2, blue; TAF8, green; TAF10, orange. HFD, histone fold domain. The TAF2-8-10 complex resides in the cytoplasm, whereas the physiological symmetric core-TFIID complex is found in the nucleus 10,23 . The cryo-EM density envelope of core-TFIID complex is shown

189 (adapted from Ref. 12). Upon binding of Importin α1 (grey) to the TAF8 NLS, TAF2-8-10 translocates into the nucleus through a nuclear pore (arrow). In the nucleus, Importin α1 is released and TAF2, 8 and 10 associate with core-TFIID, to form intermediates including the pronouncedly asymmetric 8TAF complex along the holo-TFIID assembly pathway.

ONLINE METHODS DNA constructs. Coding sequences of full-length TAF8 (Uniprot accession number Q7Z7C8) and TAF10 (Uniprot accession code Q12962) were synthesized at GenScript (New Jersey, USA) as a polyprotein construct and cloned into pPBac vector 37 via restriction sites Bst EII and Rsr II. The triple alanine mutant of the TAF8- 10 polyprotein construct was generated by substituting the Bst EII-Apa I fragment with a synthetic DNA fragment (GenScript) carrying mutated codons for TAF8 residues 185-187 (DVE to AAA), 222-224 (PYL to AAA) and 293-295 (PYL to AAA). ORFs coding for deletion constructs of TAF10 (with engineered N-terminal, Tobacco Etch virus (TEV)-cleavable deca-His-tag) and TAF8 were subcloned into transfer vectors pFL and pIDC via sequence and ligation independent cloning (SLIC), respectively 48,49 . The TAF2 coding sequence (UniProt accession number Q6P1X5 VAR_027855) was cloned into a modified pFL vector coding for an engineered N-terminal TEV- cleavable deca-His tag via restriction sites SalI and Hind III. The mCherry-TAF2 construct was cloned by inserting the mCherry-coding sequence via the SalI cleavage site into the pFL-HisTEVTAF2 vector. The MBP-TAF2 construct was generated in analogy to the mCherry-TAF2 construct. Transfer vectors were either first fused in vitro by Cre-LoxP recombination or directly integrated into the EmBacY baculovirus genome by in vivo transposition 50,51 . Transfer vectors coding for subcomplexes TAF5-6-9 (pPBac-3TAF) and TAF4-12 (pDiFB-412) are described elsewhere 12 .

190 For MBP-fusion constructs, coding sequences of truncation versions of the TAF8 protein were amplified via PCR from the synthetic polyprotein construct and cloned into pMAL-c vector (Novagen) with engineered C-terminal hexa-His tags via IBB SLIC. The vector coding for human Importin α1∆ (residues 60-529, UniProt accession number P52292) was described elsewhere 52 . A shortened version of Importin α1 (residues 71-497) was generated by amplifying the coding region of Importin α1 by PCR and insertion of the fragment into the same vector backbone via NdeI and SacI cleavage sites. All constructs were verified by DNA sequencing.

Sequence alignments. Primary protein sequences were aligned using the ClustalW2 server 53 and alignments were plotted using ESPript (http://espript.ibcp.fr) 54 . Protein sequences were retrieved from the UniProt server (www..org) for human TAF8 (UniProt accession number Q7Z7C8), human TAF10 (UniProt accession number Q12962), human TAF3 (UniProt accession number Q5VWG9) and human SPT7L (UniProt accession number O94864). Protein production and purification. Maltose-binding protein (MBP)-TAF8 fusion proteins were produced in E. coli Rosetta TM (DE3) cells (Novagen) transformed with plasmids pMAL-c_TAF8_105-310, pMAL-c_TAF8_105-260, pMAL-c_TAF8_141- 310, and pMAL-c_TAF8_200-310. Cells were grown in LB broth (Miller’s) medium supplemented with 34 µg/ml Chloramphenicol and 100 µg/ml Ampicillin at 37 °C.

Temperature was decreased to 20 °C at an optical density (OD 600 ) of 0.4. Protein production was induced at an OD 600 of 0.8 by addition of isopropyl β-D-1- thiogalactopyranoside (IPTG) to a final concentration of 1 µM. Cells were harvested 18 h post induction by centrifugation. Cell pellets were flash frozen in liquid nitrogen and resuspended in lysis buffer (50 mM HEPES-NaOH, pH 7.6, 300 mM NaCl, 5 mM Imidazole, 1 µM Leupepsin, 1 µM Pepstatin, 50 µg/ml lysozyme) supplemented with 1 tablet of cOmplete EDTA-free protease inhibitor cocktail (Roche). Crude extracts were prepared by sonication and cleared by centrifugation (Beckman JA-20 rotor, 45 min, 20000 rpm, 4 °C). All purification steps were performed at 4 °C on ÄKTA prime and purifier systems (GE Healthcare). Soluble extracts were passed over a 5 ml column of TALON ® metal affinity resin (Clontech) equilibrated in lysis buffer. The resin was washed with 10 cv of washing buffer (50 mM HEPES-NaOH, pH 7.6, 300 mM NaCl, 5 mM Imidazole, 0.01% [v/v] NP-40, 1 µM Leupepsin, 1 µM

191 Pepstatin) and 10 cv of washing buffer without NP-40. MBP-fusions were eluted by a linear imidazole gradient of 16 cv into elution buffer (50 mM HEPES-NaOH, pH7.6, 300 mM NaCl, 300 mM Imidazole, 1 µM Leupepsin, 1 µM Pepstatin). Peak fractions were pooled and concentrated to 1 ml via ultrafiltration in 15 ml, 10 MWCO spin concentrators (Millipore). Fusion proteins were further purified to homogeneity via size exclusion chromatography on a Superdex 200 16/60 column (GE Healthcare) in a buffer comprising 25 mM HEPES-NaOH, pH 7.6, 300 mM NaCl and supplemented with cOmplete EDTA-free protease inhibitor cocktail (Roche). Proteins were concentrated to ~ 10 mg/ml, flash-frozen in liquid nitrogen and stored at – 80 °C in aliquots. Importin α1 constructs (residues 60-529 or residues 71-497) were produced and purified essentially as described elsewhere 52 . However, E. coli Rosetta TM (DE3) cells (Novagen) were used and protein production was induced by the addition of 1 mM IPTG. Production and purification of core-TFIID and 7TAF complex was performed as described 12 . Proteins TAF2, MBP-TAF2, mCherry-TAF2 and TAF8-10 constructs were produced using the MultiBac system 50,51 . Pellets of baculovirus-infected Sf21 cells expressing TAF2 constructs were resuspended in lysis buffer (50 mM HEPES-NaOH, pH 7.5, 500 mM NaCl, 10 mM Imidazole, 1 µM Leupepsin, 1 µM Pepstatin, supplemented with cOmplete EDTA-free protease inhibitor cocktail) and crude extracts were prepared by sonication and subsequently cleared by centrifugation (Beckman JA-25.50 rotor, 1 h, 25000 rpm, 4 °C). TAF2 constructs were captured from soluble extract via TALON ® metal affinity resin (Clontech) in batch. The resin was extensively washed with lysis buffer and TAF2 constructs were eluted in lysis buffer supplemented with 250 mM Imidazole. Proteins TAF2 and mCherry-TAF2 were furthermore purified by ion exchange chromatography using a 5 ml SP- Sepharose HiTrap column (GE Healthcare) and eluted from the column by a linear salt gradient in a buffer comprising 25 mM HEPES-NaOH, pH 7.5, 150 to 1000 mM NaCl, 1 mM β-mercaptoethanol, 1 µM Leupepsin and 1 µM Pepstatin. Prior to ion exchange chromatography the His-tag was removed by incubating the proteins with Tobacco Etch Virus (TEV) protease (produced in house). TAF2 constructs were finally polished by gel filtration using Superdex200 10/300 or Superose6 10/300 columns (GE Healthcare) in a buffer comprising 25 mM HEPES-NaOH, pH 7.5, 500

192 mM NaCl, 1 mM β-mercaptoethanol. Proteins were concentrated to ~ 5-20 mg/ml, flash-frozen in liquid nitrogen and stored at – 80 °C in aliquots. TAF8-10 constructs were purified essentially as described for the TAF2 constructs. However, the lysis and gel filtration buffer contained 150 mM NaCl instead of 500 mM and protein complexes were gel filtrated right after elution from the TALON resin. For crystallization purposes, the complexes were gel filtrated in a buffer comprising 10 mM Tris-HCl, pH 7.5, 150 mM NaCl, 1 mM dithiothreitol (DTT).

Binding experiments. Size exclusion chromatography (SEC) experiments were done on AKTA purifier or AKTA Micro systems (GE Healthcare) using Superdex200 10/300, Superdex200 PC3.2 or Superose6 PC3.2 columns. Binding experiments shown in Fig. 1a were performed in a running buffer comprising 25 mM HEPES- NaOH, pH 7.5, 150 mM NaCl, 1 mM DTT, whereas SEC runs shown in Fig. 4 and Fig. 5 were performed in a running buffer comprising 25 mM HEPES-NaOH, pH 7.5, 500 mM NaCl, 1 mM DTT.

Crystallization and structure determination. Screening for crystallization conditions were performed at the High Throughput Crystallization Laboratory (EMBL, Grenoble, France). Despite intensive screening, no crystals were obtained from full-length TAF8-10 complex. A truncated complex of TAF8-10 was obtained from limited proteolysis using chymotrypsin. Stable fragments of the proteins TAF8 and TAF10 were identified by N-terminal sequencing in conjunction with mass spectrometry at facilities of the Partnership for Structural Biology (PSB, Grenoble, France). Crystals of a truncated TAF8-10 complex (TAF8 residues 25-120 and TAF10 residues 112-212) were refined manually by sitting-drop vapor diffusion by mixing equal volumes of protein solution containing 15-25 mg/ml TAF8-10 in 25 mM Tris-HCl, 150 mM NaCl at pH 7.5 and crystallization solution containing 1.4 M

Na/K PO 4 at pH 7.6. Crystals grew in space group P3 121 with cell dimensions of a=b=51.3 Å and c=144.8 Å within 1 - 2 weeks. Crystals were cryoprotected by addition of 20% [v/v] Glycerol and flash frozen in liquid nitrogen. Diffraction data were collected at 100 K on beamline PROXIMA 1 using a Pilatus 6M detector (SOLEIL synchrotron, Gyf-sur-Yvette, France) and were integrated and scaled using

193 XDS 55,56 . The structure of TAF8-10 was solved by the Sulfur-SAD method. A partial model could be built into the experimental electron density map by iterative rounds of density modification and automated structure building using the programs Pirate and Buccaneer from the CCP4i suite 57 . The model was used to phase a high-resolution dataset, which was collected on beamline PROXIMA 1 (SOLEIL, Paris, France) by molecular replacement using Phaser 57,58 . Diffraction data from the high-resolution dataset were analyzed and corrected for anisotropy using the Diffraction Anisotropy Server (services.mbi.ucla.edu/anisoscale) and an isotropic B of -11.99 A2 was applied to the data set 59 . The TAF8-10 structure was iteratively built and refined using the programs Coot and Phenix, respectively 60,61 . Final refinements included TLS parameters and individual B-factor refinement. Crystals of Importin α1 (residues 60-529) with a synthetic TAF8 NLS peptide (amino acids 297-PVKKPKIRRKKSLS-310 purchased from Peptide Specialty Laboratory GmbH, Germany) were grown by mixing 2 µl of protein solution containing 8 mg/ml Importin α1/TAF8-NLS in 10 mM Tris-HCl, pH 8.0, 150 mM NaCl and 1 mM DTT with 1 µl reservoir solution containing 100 mM HEPES-NaOH, pH 7.1, 12% [w/v] polyethylene glycole 3350 and 200 mM L-proline in sitting drop vapor diffusion plates. Crystals grew in space group P2 12121 with cell dimensions of a=54.3 Å, b=77.7 Å and c=128.6 Å within ~ 2 months. Crystals were cryoprotected by addition of 30% [v/v] ethylene glycole and flash frozen in liquid nitrogen. Datasets were collected at 100 K on beamline ID14-1 using an ADSC Quantum Q210 detector (European Synchrotron Radiation Facility, Grenoble, France). Diffraction data were integrated and scaled using XDS 55,56 . The structure of Importin α1/TAF8-NLS was solved by molecular replacement using the atomic coordinates of Importin α1 (PDB ID 3RZ9) 62 as a search model. The Importin α1/TAF8-NLS structure was refined in Phenix 60 using TLS parameters as well as occupancy and individual B-factor refinements and manually built using Coot 61 .

Surface plasmon resonance. Biosensor experiments were performed at 25 °C on a BIACORE 3000 (Biacore AB, Uppsala). TAF2 ligand was immobilized onto CM5 sensor chips (GE Healthcare) to a level of 2500 response units (RU) using amine- coupling chemistry. A second flow cell was treated equally but without the ligand and was used as control. Truncation mutants of TAF8 fused C-terminally to maltose

194 binding protein (MBP) were serially diluted into running buffer (25 mM HEPES- NaOH, pH 7.0, 300 mM NaCl, 0.01% [v/v] NP-40). For association phase, 150 µl of analyte at a concentration of 500 nM were injected over both flow cells at a flow rate of 25 µl/min and dissociation phases were monitored for 200 seconds by injecting running buffer only. Binding responses were recorded and responses from referencing sensorgrams were subtracted using BIAevaluation software (GE Healthcare). Data were globally analyzed with the analysis software.

Isothermal titration calorimetry. Calorimetric experiments were conducted in duplicates with a MicroCal iTC 200 instrument (GE Healthcare) at 25 °C. Importin α1 (residues 71-479) and the TAF8 NLS peptide (residues 288- NPYLRPVKKPKIRRKKSLS-310 purchased from Peptide Specialty Laboratories GmbH, Germany) were extensively dialyzed against ITC buffer (25 mM Tris-HCl (pH 8.0), 150 mM NaCl, 1 mM b-mercaptoethanol) and used at concentrations of 39 µM and 1.5 mM, respectively. Protein concentrations were determined by absorbance spectroscopy at 280 nm with calculated extinction coefficients of 48930 M-1 cm -1 for Importin α1 and 1490 M-1 cm -1 for the TAF8 peptide. 1.5 µl TAF8 peptide were injected for 3 sec with a spacing of 180 sec between injections into 200 µl of Importin α1. Heat changes were recorded over a total of 26 injections. Acquired calorimetric titration data were integrated, corrected for heat of dilution of the TAF8 peptide alone and analyzed using Origin software version 7.0 according to a one-site binding model.

Binding stoichiometry (n), association constant ( Ka), binding enthalpy ( ∆H) and entropy change ( ∆S) were deduced from fitted isotherms by nonlinear regression. Gibbs free energy difference was calculated using the equation ∆G = ∆H - T∆S.

Pull-down assays. MBP pull-down assays were performed by mixing 10 µg bait (MBP or MBP-TAF2) with 10 µg prey (TAF8-10, TAF8 ∆C-TAF10, TAF8- TAF10 ∆N) for 1 h at 4 °C in binding buffer (25 mM HEPES-NaOH, pH 7.5, 500 mM NaCl, 5% [v/v] glycerol, 2 mM β-mercaptoethanol). Protein mixtures were incubated with 20 µl Amylose resin (New England Biolabs) for 1.5 h on a head-over-tail rotator at 4 °C. The resin was washed three times with binding buffer, once with washing buffer (25 mM HEPES-NaOH, pH 7.5, 500 mM NaCl, 5% [v/v] glycerol, 2 mM β- mercaptoethanol, 0.05% NP-40) and again three times with binding buffer. Proteins

195 were eluted in 15 µl binding buffer containing 30 mM D-maltose. Input samples and precipitates were mixed with SDS loading dye and resolved on 4-12% Bis-Tris NuPAGE gels (Invitrogen) with a MES running buffer. Proteins were visualized by Coomassie-blue staining.

Limited proteolysis experiments. 1 mg/ml of the TAF8-10 complex was treated with Chymotrypsin at an enzyme to protein ratio of 1:10 [w/w] and samples were taken after 2, 5, 10, 20, 40 and 60 min and analyzed by SDS-PAGE. In order to identify the TAF8-10 core complex, the proteolyzed sample was loaded on a Superdex75 10/300 size exclusion chromatography column (GE Healthcare) prior to N-terminal sequencing and mass spectrometry analysis of the comigrating polypeptides.

Peptide arrays. Pepscan libraries of the C-terminal region of TAF8 (residues 105- 310) were immobilized on cellulose membranes via double β-alanine anchors and assembled using the SPOT technology (AG Molekulare Bibliotheken, Institut für Medizinische Immunology, Charité - Universitätsmedizin Berlin, Germany). Overlapping 20-mer peptides of TAF8 were synthesized by Fmoc (9- fluorenylmethoxycarbonyl) chemistry with an offset of 3 amino acids between neighboring spots. Low density hexa-Histidine peptides were used as controls at the beginning and the end of each array. Pepscan membranes were blocked in blocking buffer (50 mM Tris-HCl, pH 7.6, 500 mM NaCl, 20% [w/v] sucrose, 3% [w/v] bovine serum albumin) for 1h at 4 ˚C, washed with TBS (50 mM Tris-HCl, pH 7.6, 500 mM NaCl) and incubated for 1.5 h with His-tagged TAF2 at a concentration of 10 µg/ml in blocking buffer or with blocking buffer alone. Membranes were subsequently incubated with mouse anti-His monoclonal primary antibody (SIGMA H1029, dilution 1:3000) and peroxidase-conjugated anti-mouse secondary antibody (SIGMA A5906, dilution 1:10000) in blocking buffer. Membranes were washed 3 times with TBS between each incubation step. After addition of luminol solution (Pierce) luminescence was detected on a KODAK 4000MM photoimager. Images were analyzed using the Dot Blot Analyzer tool in ImageJ.

196 Analytical Ultracentrifugation. Analytical ultracentrifugation sedimentation velocity experiments were performed in a Beckman XL-I analytical ultracentrifuge (Beckman Coulter). The purified proteins and protein complexes TAF2, TAF8-10 and TAF2-8-10 were loaded into sapphire-windowed cells with 12 mm optical path length and spun in an An-60Ti rotor (Beckman Coulter). Absorbance at 280 nm was measured for 16 hours at 42,000 rpm and 10 °C. The data were analyzed in terms of continuous size-distribution (c(s)) with the Sedfit program 63 , considering 200 particles with sedimentation coefficients, s, between 0.1 and 20 S. TAF2 and the TAF2-8-10 complex are rather compact based on SEC, while TAF8-10 is elongated. Therefore, we used a frictional ratio of 1.4 for TAF2 and TAF2-8-10 and of 1.6 for TAF8-10. A partial specific volume of 0.73 was used for all samples. A regularization procedure was also applied with a confidence level of 0.68. Sample densities and viscosities were 1.023 g/mL and 1.40 mPa.s respectively for TAF2 and TAF2-8-10 and were 1.021 g/mL and 1.31 mPa.s for TAF8-10, as determined with Sednterp 64 .

Native Mass Spectrometry. Prior to native mass spectrometric analyses, 30 µl of purified TAF2-8-10 complex were buffer exchanged by ultrafiltration using Amicon spin concentrators (Millipore, 10 kDa MWCO) into 500 mM ammonium acetate buffer (pH 7.5). All mass spectrometry experiments were performed on a Quadrupole Time-of-flight (Q-ToF) II mass spectrometer (Waters, Manchester, UK) in the positive ion mode 65,66 . For data acquisition, 2 µl of the sample were injected into the mass spectrometer with gold-coated capillary needles made in-house using a needle puller (Harvard Apparatus, Holliston, MA, USA). MS spectra were acquired using a capillary voltage of 1.7 kV and cone and collision voltages of 100 V. Time-of-flight and analyzer pressures were at 5.6 x 10 -6 and 4.2 x 10 -4 mbar, respectively. Datasets were acquired and processed with MassLynx V4.1 software (Waters, UK) with minimal smoothing and no background subtraction. The recorded mass spectra were calibrated externally using 100 mg/ml caesium iodide in water.

Cross-linking and mass spectrometry (CLMS) analyses. 7TAF complexes were reconstituted by mixing the individual modules TAF5-6-9, TAF4-12, and TAF8-10 in a 1:2:2 molar ratio and purified using SEC in a buffer comprising 25 mM HEPES- NaOH, pH 7.6, 150 mM NaCl and 5 mM β-mercaptoethanol. 8TAF complexes were

197 reconstituted from 7TAF complexes by adding 2-fold molar excess of TAF2 and removal of unbound TAF2 from 8TAF complexes by SEC in the same buffer. 7TAF (200 µg) and 8TAF complexes (200 µg) were cross-linked by BS 3 (Bis- sulfosuccinimidyl suberate, Thermo Scientific) at complex/BS3 ratio of 1:5 (w/w) in cross-linking buffer (25 mM HEPES-NaOH, pH 7.6, 150 mM NaCl, 1 mM DTT) for 2 h on ice. The reaction was quenched by addition of saturated ammonium bicarbonate solution and further incubation on ice for 45 min. The cross-linked samples were concentrated by ultrafiltration using spin concentrators (Millipore) and separated by SDS-PAGE on NuPAGE 3-8% Tris-Acetate gels run in Tris-Acetate SDS running buffer (Invitrogen). Gels were stained with Coomassie brilliant blue, destained and bands corresponding to the cross-linked complexes were excised from the gels. Cross-linked complex proteins were reduced, alkylated and trypsin digested following standard procedures 67 . Cross-linked peptides were fractionated using SCX- StageTips following the published protocol for linear peptides and desalted using C18 StageTips 68-70 . Mass spectrometry – Peptides were analyzed on an LTQ Orbitrap Velos mass spectrometer coupled with an UltiMate 3000 Rapid Seperation LC system (Thermo Fisher Scientific). The column was packed into a spray emitter (75-µm inner diameter, 8-µm opening, 250-mm length; New Objectives) with C18 material (ReproSil-Pur C18-AQ 3 µm; Dr Maisch GmbH, Ammerbuch-Entringen, Germany) using an air pressure pump (Proxeon Biosystems) 71 . Mobile phase A consisted of water and 0.1% formic acid. Mobile phase B consisted of 80% acetonitrile and 0.1% formic acid. Peptides were loaded onto the column with 2% B at 500 nl/min flow rate and eluted at 300 nl/min flow rate in two steps: linear increase from 2% B to 40% B in 139 minutes; then increase from 40% to 95% B in 11 minutes. The eluted peptides were directly sprayed into the mass spectrometer. Peptides were analyzed using a high/high strategy: both MS spectra and MS2 spectra were acquired in the Orbitrap. MS spectra were recorded at 100,000 resolution. The eight highest intensity peaks with a charge state of three or higher were selected in each cycle for iontrap fragmentation. The fragments were produced using CID with 35% normalized collision energy and detected by the Orbitrap at 7500 resolution. Dynamic exclusion was set to 90s and repeat count was 1. Data processing – The mass spectrometric raw files were processed into peak lists using MaxQuant (version 1.3.0.5) 72 at default parameters except for “top MS/MS 198 peaks per 100 Da” being set to 100. Search was conducted against TAF complex sequences using Xi software (version 1.3.355). Search parameters were MS accuracy, 6ppm; MS/MS accuracy, 20ppm; enzyme, trypsin; cross-linker, BS3 (including BS3 modification); . missed cleavages, 4; fixed modification, carbamidomethylation on cysteine; variable modifications, oxidation on methionine; cross-linkable amino acids, N-terminus, lysine, serine, tyrosine and threonine; fragments, b and y ions with loss of H 2O, NH 3 and CH 3SOH.

Antibody production and purification. TAF2 antibodies were generated by immunizing rabbits with purified baculovirus overexpressed full-length human TAF2. Immunization and antibody purification was done as described 73 with the following modifications: 2 mg of recombinant full-length human TAF2 were fixed on 400 µl Affi-Gel 10/15 beads (Bio-Rad) for 2 hours at 4°C with gentle agitation in phosphate- buffered saline (PBS). Free active esters were blocked with 1 M ethanolamine HCl (pH 8) solution for 1 hour at 4 °C under gentle agitation. The TAF2-bound gel was transferred to a column and washed 4 times with 10 volumes of PBS. 10 ml of rabbit polyclonal antibody sera raised against human TAF2 were applied twice and the column was washed with 10 ml of PBS prior to elution. Bound antibodies were eluted with 0.1 M glycine (pH 2.5) buffer. 500 µl fractions of purified antibody were collected and quickly neutralized by adding 50 µl of 2 M Tris-HCl (pH 8.8) buffer.

Protein extract preparations, immunoprecipitation and MudPIT analyses. Hela cell nuclear extract (NE) preparations and immunoprecipitations were essentially done as described elsewhere 74 . HeLa cell cytoplasmic extract was prepared following the NE preparation protocol until opening the cells by the first dounce step. The resulting supernatant containing the cytoplasm was precipitated by adding stepwise 0.3 g/ml of ammonium sulfate under agitation at 4 °C for 30 minutes. Precipitated proteins were collected by centrifugation with 30 000 g at 4°C for 20 minutes. The protein pellet was resuspended in 1/5 volume of the starting volume and dialyzed over night as described in the NE preparation protocol. Immunoprecipitated proteins were eluted from the protein G columns with a 0.1 M glycine (pH 2.5) solution and quickly neutralized with a 2 M Tris-HCl (pH 8.8) buffer. MudPIT analyses of immunoprecipitated proteins were performed as previously described 75,76 . In brief, protein mixtures were TCA precipitated, urea- 199 denaturated, reduced, alkylated and digested with endoproteinase Lys-C followed by modified trypsin digestion. Peptide mixtures were loaded onto a triphasic 100 mm diameter fused silica microcapillary column as described 77 . Loaded microcapillary columns were placed in-line with a Quaternary Dionex Ultimate 3000 HPLC pump and a LTQ Velos linear ion trap mass spectrometer equipped with a nano-LC electrospray ionization source (Thermo Fischer Scientific). A fully automated 12- steps MudPIT run was performed as previously described during which each full MS scan (from 300 to 1700 m/z range) was followed by 20 MS/MS events using data- dependent acquisition 75 . Proteins were identified by database searching using SEQUEST with ThermoProteome Discoverer 1.4 (Thermo Fischer Scientific) 78 . Tandem mass spectra were searched against a human protein sequence database (from the Homo sapiens 2013-04-03 Swissprot release). In all searches, cysteine residues were considered to be fully carboxyamidomethylated (+57 Da statically added) and methionine considered to be oxidized (+16 Da dynamically added). Relative protein abundance for each protein in a given sample was estimated by Normalized Spectral Abundance Factor (NSAF) 79 . The NSAF values were calculated from the spectral counts of each identified protein. To account for the fact that larger proteins tend to contribute more peptide/spectra, spectral counts were divided by protein length to provide a spectral abundance factor (SAF). SAF values were then normalized against the sum of all SAF values in the corresponding run allowing the comparison of protein levels across different runs.

Electron microscopy. Specimen preparation. The 8TAF sample was stabilized by mild glutaraldehyde cross-linking by using the GraFix method 80 as described previously 12 . Briefly, 200 µl of purified 8TAF complexes were loaded on a 4 mL centrifugation tube containing a 10 to 30% glycerol and a 0 to 0.15% glutaraldehyde gradient. The gradient was centrifuged for 18 h at 34,000 rpm and 4°C (SW60 rotor, Beckman Coulter) and the fractions containing the stabilized sample were deposited onto a buffer exchange column (Zeba spin desalting columns, Pierce) to remove the excess of glycerol. Specimen was adsorbed onto a thin layer of carbon deposited on an electron microscopy grid and negatively stained for 45 s with a drop of 2% of uranyl acetate. The particles were imaged using a transmission electron microscope (Tecnai F20 G2, FEI) equipped with a field emission gun operating at 200 kV. Images were recorded under low-dose condition (total dose of 40–50 e/Å 2) on a 2048 x 2048 200 CCD camera (Ultrascan 1000, Gatan Inc., Pleasanton) at a magnification of 50,000 resulting in a pixel spacing on the specimen of 0.21 nm. Random Conical Tilt Reconstructions. The initial reference volumes were obtained by random conical tilt using XMIPP 81 and IMAGIC 82 software packages. Two consecutive images of the same area were taken at 45° and 0° tilt angles under low- dose conditions. A total of 1546 tilt pairs were selected manually using XMIPP. Untilted images were aligned using iteratively refined 2D class averages as references and multivariance statistical analysis (MSA) and Hierarchical Ascendant Classification (HAC) for clustering into 50 class averages with IMAGIC. The 50 volumes calculated from the 2D classes were aligned, clustered and averaged using XMIPP MLtomo to compensate for the missing cone, resulting in 5 RCT reconstructions. Structure refinement. The best volume was used as reference for refinement cycles using a dataset of 35.145 untilted molecular images windowed with the Boxer application of the EMAN2 software package 83 and coarsened by two resulting in a pixel spacing of 4.2 Å. Image sorting was found necessary to select the most homogeneous particles since part of the structure was flexible and prevented convergence. Sorting was performed by using first XMIPP then subsequently the RELION software package 84 . The final 3-D reconstruction was carried out in RELION with 2361 sorted particles and resulted in a structure with 37 Å resolution as estimated by the 0.5 Fourier Shell Correlation criteria. Images were prepared using Chimera visualization software 85 .

201 ONLINE METHODS REFERENCES

48. Li, M.Z. & Elledge, S.J. Harnessing homologous recombination in vitro to generate recombinant DNA via SLIC. Nat. Methods 4, 251-6 (2007). 49. Fitzgerald, D.J. et al. Protein complex expression by using multigene baculoviral vectors. Nat. Methods 3, 1021-32 (2006). 50. Trowitzsch, S., Bieniossek, C., Nie, Y., Garzoni, F. & Berger, I. New baculovirus expression tools for recombinant protein complex production. J. Struct. Biol. 172, 45-54 (2010). 51. Bieniossek, C. et al. Automated unrestricted multigene recombineering for multiprotein complex production. Nat. Methods 6, 447-50 (2009). 52. Boivin, S. & Hart, D.J. Interaction of the influenza A virus polymerase PB2 C-terminal region with importin alpha isoforms provides insights into host adaptation and polymerase assembly. J. Biol. Chem. 286, 10439-48 (2011). 53. Larkin, M.A. et al. Clustal W and Clustal X version 2.0. Bioinformatics 23, 2947-8 (2007). 54. Gouet, P., Robert, X. & Courcelle, E. ESPript/ENDscript: Extracting and rendering sequence and 3D information from atomic structures of proteins. Nucleic Acids Res. 31, 3320-3 (2003). 55. Kabsch, W. Integration, scaling, space-group assignment and post-refinement. Acta Crystallogr. D Biol. Crystallogr. 66, 133-44 (2010). 56. Kabsch, W. Xds. Acta Crystallogr. D Biol. Crystallogr. 66, 125-32 (2010). 57. Winn, M.D. An overview of the CCP4 project in protein crystallography: an example of a collaborative project. J. Synchrotron Radiat. 10, 23-5 (2003). 58. McCoy, A.J. Solving structures of protein complexes by molecular replacement with Phaser. Acta Crystallogr. D Biol. Crystallogr. 63, 32-41 (2007). 59. Strong, M. et al. Toward the structural genomics of complexes: crystal structure of a PE/PPE protein complex from Mycobacterium tuberculosis . Proc. Natl. Acad. Sci. USA 103, 8060-5 (2006). 60. Adams, P.D. et al. PHENIX: building new software for automated crystallographic structure determination. Acta Crystallogr. D Biol. Crystallogr. 58, 1948-54 (2002). 61. Emsley, P. & Cowtan, K. Coot: model-building tools for molecular graphics. Acta Crystallogr. D Biol. Crystallogr. 60, 2126-32 (2004). 62. Takeda, A.A., de Barros, A.C., Chang, C.W., Kobe, B. & Fontes, M.R. Structural basis of importin-alpha-mediated nuclear transport for Ku70 and Ku80. J. Mol. Biol. 412, 226-34 (2011). 63. Schuck, P. Size-distribution analysis of macromolecules by sedimentation velocity ultracentrifugation and lamm equation modeling. Biophys. J. 78, 1606-19 (2000). 64. Hayes, D., Laue, T.M. & Philo, J. Program Sednterp: Sedimentation Interpretation Program, Alliance Protein Laboratories, Thousand Oaks, CA. (1995). 65. Hernandez, H. & Robinson, C.V. Determining the stoichiometry and interactions of macromolecular assemblies from mass spectrometry. Nat. Protoc. 2, 715-26 (2007).

202 66. Sobott, F., Hernandez, H., McCammon, M.G., Tito, M.A. & Robinson, C.V. A tandem mass spectrometer for improved transmission and analysis of large macromolecular assemblies. Anal. Chem. 74, 1402-7 (2002). 67. Maiolica, A. et al. Structural analysis of multiprotein complexes by cross- linking, mass spectrometry, and database searching. Mol. Cell. Proteomics 6, 2200-11 (2007). 68. Ishihama, Y., Rappsilber, J. & Mann, M. Modular stop and go extraction tips with stacked disks for parallel and multidimensional Peptide fractionation in proteomics. J. Proteome Res. 5, 988-94 (2006). 69. Rappsilber, J., Ishihama, Y. & Mann, M. Stop and go extraction tips for matrix-assisted laser desorption/ionization, nanoelectrospray, and LC/MS sample pretreatment in proteomics. Anal. Chem 75, 663-70 (2003). 70. Rappsilber, J., Mann, M. & Ishihama, Y. Protocol for micro-purification, enrichment, pre-fractionation and storage of peptides for proteomics using StageTips. Nat. Protoc. 2, 1896-906 (2007). 71. Ishihama, Y., Rappsilber, J., Andersen, J.S. & Mann, M. Microcolumns with self-assembled particle frits for proteomics. J. Chromatogr. A 979, 233-9 (2002). 72. Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 26, 1367-72 (2008). 73. Muratoglu, S. et al. Two different Drosophila ADA2 homologues are present in distinct GCN5 histone acetyltransferase-containing complexes. Mol. Cell. Biol. 23, 306-21 (2003). 74. Demény, M.A. et al. Identification of a small TAF complex and its role in the assembly of TAF-containing complexes. PLoS ONE 2, e316 (2007). 75. Florens, L. & Washburn, M.P. Proteomic analysis by multidimensional protein identification technology. Methods Mol. Biol. 328, 159-75 (2006). 76. Washburn, M.P., Wolters, D. & Yates, J.R., 3rd. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. 19, 242-7 (2001). 77. McDonald, W.H., Ohi, R., Miyamoto, D.T., Mitchison, T.J. & Yates, J.R. Comparison of three directly coupled HPLC MS/MS strategies for identification of proteins from complex mixtures: single-dimension LC- MS/MS, 2-phase MudPIT, and 3-phase MudPIT. Intern. J. Mass Spectrom. 219, 245-251 (2002). 78. Eng, J.K., McCormack, A.L. & Yates, J.R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976-89 (1994). 79. Zybailov, B. et al. Statistical analysis of membrane proteome expression changes in Saccharomyces cerevisiae . J. Proteome Res. 5, 2339-47 (2006). 80. Stark, H. GraFix: stabilization of fragile macromolecular complexes for single particle cryo-EM. Methods Enzymol. 481, 109-26 (2010). 81. Sorzano, C.O. et al. XMIPP: a new generation of an open-source image processing package for electron microscopy. J. Struct. Biol. 148, 194-204 (2004). 82. van Heel, M., Harauz, G., Orlova, E.V., Schmidt, R. & Schatz, M. A new generation of the IMAGIC image processing system. J. Struct. Biol. 116, 17- 24 (1996).

203 83. Tang, G. et al. EMAN2: an extensible image processing suite for electron microscopy. J. Struct. Biol. 157, 38-46 (2007) 84. Scheres, S.H. RELION: implementation of a Bayesian approach to cryo-EM structure determination. J. Struct. Biol. 180, 519-30 (2012). 85. Pettersen, E.F. et al. UCSF Chimera--a visualization system for exploratory research and analysis. J. Comput. Chem. 25, 1605-12 (2004).

204 SUPPLEMENTARY INFORMATION

Cytoplasmic complex comprising TAF2, TAF8 and TAF10 provides evidence for nuclear holo-TFIID assembly from preformed submodules.

Simon Trowitzsch 1,2 , Cristina Viola 1,2 , Elisabeth Scheer 3, Virginie Chavant 4, Marjorie Fournier 3, Gabor Papai 5, Ima-Obong Ebong 6, Christiane Schaffitzel 1,2 , Juan Zou 7, Juri Rappsilber 7,8 , Carol Robinson 6, Patrick Schultz 5, Laszlo Tora 3 & Imre Berger 1,2

1 European Molecular Biology Laboratory, Grenoble Outstation, 6 rue Jules Horowitz, 38042 Grenoble, France 2 Unit for Virus Host-Cell Interactions, Univ. Grenoble Alpes-EMBL-CNRS, 6 rue Jules Horowitz, 38042 Grenoble, France. 3 Cellular signaling and nuclear dynamics program, Institut de Génétique et de Biologie Moléculaire et Cellulaire, UMR 7104, INSERM U964, 1 rue Laurent Fries, 67404 Illkirch, France. 4 Proteomics platform, Institut de Génétique et de Biologie Moléculaire et Cellulaire, UMR 7104, INSERM U964, 1 rue Laurent Fries, 67404 Illkirch, France. 5 Integrated Structural Biology, Institut de Génétique et de Biologie Moléculaire et Cellulaire, UMR 7104, INSERM U964, 1 rue Laurent Fries, 67404 Illkirch, France. 6 Chemistry Research Laboratory, University of Oxford, South Parks Road, Oxford, Oxfordshire, OX1 3TA, United Kingdom. 7 Wellcome Trust Centre for Cell Biology, University of Edinburgh, Mayfield Road, Edinburgh EH9 3JR, United Kingdom. 8 Institute of Bioanalytics, Department of Biotechnology, Technische Unversität Berlin, 13353 Berlin, Germany

205

Corresponding authors: Laszlo Tora ([email protected]) & Imre Berger ([email protected])

Supplementary Figure 1

Analysis of polyclonal anti-TAF2 antibodies.

Crude extracts of E. coli (left) or baculovirus-infected insect cells (right) expressing 6His-tagged human TAF2 were resolved on SDS-PAGE. Proteins

206 were blotted onto nitrocellulose membranes and incubated with pre-immun serum or with the anti-TAF2 serum 3038 taken from rabbits, which were immunized with recombinant human TAF2 protein. Protein size markers are indicated on the left of each blot. The polyclonal antibody recognizes recombinant TAF2 from E. coli and baculovirus-infected insect cells.

207 Supplementary Figure 2

Native mass spectrometry of recombinant TAF2-8-10 module.

Recombinant TAF2-8-10 complexes were electrosprayed from an aqueous ammonium acetate solution. The TAF2-8-10 module (purple dots) centers on a charge state at 7000 m/z. Charge states at around 2000 m/z (light blue dots) and 12000 m/z (yellow dots) correspond to minor amounts of TAF10 and a TAF2-8 complex, respectively. Proteins and protein complexes are schematically shown as circles.

208 Supplementary Figure 3

e

Structural analysis of TAF8-10 complex.

(a) Time course of a limited proteolysis experiment with TAF8-10 using

Chymotrypsin (left). Time points, protein size markers and protein identities are indicated. ***, TAF8 fragment spanning residues 1-159; **, TAF8 fragment spanning residues 1-134; *, TAF10 fragment spanning residues 98-218; IN,

Input sample. HFD, histone fold domain. Bar diagrams of the proteins TAF8

209 and TAF10 are indicated as shown in Fig. 1c. Domain boundaries of the core

TAF8-10 complex (TAF8 ∆C and TAF10 ∆N) are highlighted. ( b) Image of crystals grown from a refined TAF8-10 construct (TAF8 residues 25-125 and

TAF10 residues 112-212) with bar diagrams of the protein constructs. ( c)

Comparison of the central α helices of other histone fold-containing structures

(PDB IDs 1KX5, 1BH8, 1TAF) showing an array of residues at the crossing of the helices. ( d) Sequence alignment of the L1 loop regions of TAF8 and

TAF10 (top). Putative L1 regions of TAF3 and human SPT7L are aligned to

TAF8 (bottom). Start and end residues of the aligned sequences are indicated. Residues highlighted in Fig. 3d,e are marked by asterisks.

Secondary structure elements are shown for TAF10 at the top of the alignment. Note that the L2 loop of TAF10 was removed for clarity ( L2 arrow).

(e) Representative section of the 2F o-Fc electron density map (mesh) of the

TAF8-10 crystal structure is shown in a stereo view, contoured at 1.5 σ around the central helices of TAF8 (in blue) and TAF10 (in green).

210 Supplementary Figure 4

TAF2 interacts with the C-terminal region of TAF8 but not with the core complex of TAF8-10.

(a) Binding analysis of TAF2 with the core construct TAF8 ∆C-TAF10 ∆N using gel filtration. The elution profile monitored at an absorption wavelength of 280 nm versus elution volume is shown on the left and the SDS-PAGE analysis of peak fractions is shown on the right. ( b) Similar binding experiment as in ( a) but with an MBP-fusion construct of the unstructured C-terminal region of

TAF8 (TAF8 residues 105-310). IN, input sample. Protein size markers and protein identities are indicated.

211 Supplementary Figure 5

Electron micrographs and 2D class averages of 7TAF and 8TAF complexes

A section of electron micrographs from 7TAF complex consisting of TAF4, 5,

6, 8, 9, 10 and 12 is shown on the left, with representative 2D class averages are shown below. A similar section from 8TAF complex comprising TAF2, 4,

5, 6, 8, 9, 10 and 12 is shown on the right, with representative 2D class averages below. Scale bars are indicated. 8TAF complex has an elongated shape as compared to more compact 7TAF complex. Additional density corresponding to TAF2 is located at one side of the 8TAF complex, adopting flexible conformations.

212 Supplementary Figure 6

Cross-linking of 7TAF and 8TAF complexes using bifunctional crosslinker BS 3.

Cross-linking efficiency of 7TAF and 8TAF complexes was assessed on

NuPAGE Novex 3-8 % Tris-Acetate gels (Invitrogen). Identical amounts of

7TAF and 8TAF samples before cross-linking (-) and after cross-linking (+) were loaded on each lane. Protein size markers are shown on the left and protein identities on the right. Cross-linked complexes are indicated.

213 Supplementary Figure 7

Structural and biochemical characterisation of the putative nuclear import particle comprising TAF2-8-10-Importin α1.

(a) Binding experiment as in Supplementary Fig. 4a, but with the TAF2-8-10

IBB complex mixed with a two-fold molar excess of Importin a1∆ . Elution profile of the mixture is shown as a black line. The dotted line shows the elution profile of the rechromatographed material pooled from the first peak (at around 10 ml). SDS-PAGE analysis of peak fractions is shown on the right.

(b) Schematic representation of the interactions between Importin α1 and the

NLS of TAF8. Residues engaged in salt bridges, van der Waals contacts or hydrogen bondings are indicated by dashed lines. Backbone amino and carbonyl groups of the NLS peptide are schematically drawn. Residue

214 positions are indicated. (c) Structure of Importin α1 with an NLS peptide of

TAF8. Importin α1 molecule (grey) is shown in cartoon representation and the

TAF8 peptide as sticks in blue. The 2F o-Fc density map contoured at 1 σ around the NLS peptide fragments is shown as a grey mesh. TAF8 residues

R303 and R304, which are stabilized by crystal contacts but are not engaged in Importin α1 binding, are indicated. Major and minor NLS-binding sites on the Importin α1 molecule are denoted. ( d) TAF8-NLS peptide binding to

Importin α1 assayed by isothermal titration calorimetry. The upper panel shows the added heat to the cell over time with successive additions. The excess heat added per addition was integrated from the upper panel and plotted in the lower panel as a function of the ratio of the concentration of the

NLS and Importin α1 in the cell. The right panel shows a control run without

Importin α1 in the cell to assess heat of dilution of the peptide.

215 Supplementary Table 1 MutPIT analysis of TAF2 co-immunoprecipitated proteins from nuclear and cytoplasmic HeLa cell extracts. TFIID subunits specifically enriched in TAF2 immunoprecipitations (IPs) as compared to control IP samples are indicated by protein sequence coverage (%), unique peptides and spectral counts.

TAF2 IP (nuclear) TAF2 IP (cytoplasmic)

TFIID Sequence Unique Spectral Sequence Unique Spectral

subunit coverage (%) peptides counts coverage (%) peptides counts

TAF1 13.19 14 56

TAF2 33.19 32 573 30.94 25 424

TAF3 4.09 3 6

TAF4 19.63 16 169

TAF4B 9.98 6 19

TAF5 38.63 21 131

TAF6 36.78 20 207

TAF7 26.65 7 44

TAF8 48.71 10 100 36.77 6 55

TAF9 40.53 8 45

TAF9B 23.90 5 15

TAF10 27.06 3 34 20.64 2 15

TAF11 30.33 5 21

TAF12 11.08 2 2

TAF13 16.94 2 11

TBP 7.37 2 12

216 Supplementary Table 2 Native Mass Spectrometry Data of TAF2-8-10 complexes.

Protein / protein complex Measured mass [Da] Calculated mass [Da]

TAF8 35028 34984

TAF10 23484 / 23751 23613

(23618 ± 134)

TAF8-10 57975 / 58976 58579

(58475 ± 501)

TAF2 136364 / 137526 137030

(136945 ± 581)

TAF2-8 171548 / 172754 172014

(172151 ± 603)

TAF2-8-10 195222 / 196372 195609

(195797 ± 575)

Two series of peaks are observed in the spectra for TAF10 and

TAF2 (and therefore also complexes containing these TAFs),

likely due to post-translational modification. Mass averages are

provided in brackets.

217 Supplementary Table 3

BS 3 protein-protein cross-links of 7TAF and 8TAF complexes identified by mass spectrometry.

Cross-linked proteins 7TAF 8TAF Common to 7TAF and 8TAF

(prot1-prot2) unique (matches) unique (matches) [residues prot1-prot2]

TAF2-TAF5 - 1 (2) -

TAF2-TAF6 - 1 (2) -

TAF2-TAF8 - 4 (4) -

TAF2-TAF9 - 1 (1) -

TAF4-TAF5 4 (6) 5 (5) 3 [974-564/984-564/987-486]

TAF4-TAF9 3 (3) 2 (2) 2 [974-108/984-108]

TAF4-TAF12 1 (2) 1 (2) 1 [897-62]

TAF5-TAF6 5 (10) 3 (3) 2 [364-1/364-10]

TAF5-TAF8 3 (5) - (-) -

TAF5-TAF9 2 (12) 2 (12) 2 [577-134/577-135]

TAF6-TAF8 15 (19) 6 (9) 5 [1-178/67-1/67-20/67-78/214-178]

TAF6-TAF9 4 (7) 2 (2) 2 [67-10/67-135]

TAF8-TAF9 3 (3) 3 (6) 3 [20-1/20-10/135-178]

TAF9-TAF10 - (-) 1 (1) -

TAF9-TAF12 2 (5) 2 (3) 2 [62-107/62-114]

TAF10-TAF12 1 (1) 1 (1) -

Unique cross-links 43 35 23

218 5. Appendix

5.1. Tables

Human

Full-length Trypsin Limited Proteolysis

Fragment Obs.Mass kDa N-terminal Obs Mass Exp.Mass Sequence (SDSPAGE) Seq. Result (MALDI) TR1 120 Start - 121584.9 1-1060 TR2 110 (R)KKGD - 113707.5 12-1000 TR3 90 Start - 93739.0 1-816 TR4 60 (R)KKGD - 67001.0 12-597 TR5 40 (R)NKKKK - 46724.4 597-1000 TR5a 35 (R)TLDNL - 33098.8 816-1108 TR5b 30 (R)TLDNL - 29275.5 816-1073 TR5c 25 (K)TNNFM - 30658.5 736-1000 TR6 20 (R)TLDNL - 21382.9 816-1000 ‘Core’ Trypsin Limited Proteolysis

Fragment Obs.Mass kDa N-terminal Obs. Mass Exp. Mass Sequence (SDSPAGE) Seq. Result (MALDI) TRa1 90 (R)KKGD 92200 92342.5 12-815 TRa1a 80 (R)KKGD 83000 83066.9 12-735 TRa2 66 (R)KKGD 66900 66844.9 12-595 TRa3 55 (R)NLNYF 56700 56624.0 100-595 TRa3a 50 (R)KKGD 46600 46825.9 12-420 TRa4 25 (R)NKKKK 25300 25359.5 597-815 TRa5 22 (R)TLDNL 22000 21382.9 816-999 TRa6 21 (R)TLDNL 20900 19309.3 816-981 TRa7 16 (R)NKKKK 16000 16083.8 597-735 TRa8 15 (R)NKKKK 14100 14124.4 597-718 TRa9 7 (K)TNNFM 6100 6065.1 736-786 Full-length Endogenous Limited Proteolysis

Fragment Obs.Mass kDa N-terminal Obs.Mass Exp.Mass Sequence (SDSPAGE) Seq. Result (MALDI) FL 116 DDD 113400 113499.9 4-991 DEG1 80 DDD 84100 84140.9 4-737 DEG2 30 NNFMS 29300 29535.1 737-990 DEG3 20 DNLNP 20200 20102.3 818-991

Table 4: Human TAF2 Limited proteolysis . Data found in this table was obtained from the thesis of Dr. Martina Mijuskovic (Mijuskovic, 2007)

219

Drosophila

Full-length Chymotrypsin Limited Proteolysis

Fragment Obs.Mass kDa N-terminal Obs Mass Exp.Mass Sequence (SDSPAGE) Seq. Result (MALDI) CH1a 80 Start 83843.8 83843.1 1-733 CH1b 84702.5 84703.2 1-741 CH1c 84236.7 84235.6 1-737 CH2a 30 (L)NNFSN 29777.4 29777.2 742-1000 CH2b (F)SNFQL 29401.8 29401.9 745-1000 CH3a 20 (A)IHGTQ 21262.1 21262.5 817-1000 CH3b (H)GTQIT 21011.7 21012.2 819-1000

Yeast

Full-length Trypsin Limited Proteolysis

Fragment Obs.Mass kDa N-terminal Obs Mass Exp.Mass Sequence (SDSPAGE) Seq. Result (MALDI) TR1 50 Start - - TR2 37 (R)ENVVD 34205.0 34204.2 412-710 TR3 25 (R)GGGAN 26976.5 26973.4 782-1014 TR4a 10 (R)NIEES 9222.9 9222.4 330-411 TR4b (R)AIVSE 12278.8 12279.0 11-117 TR4c (K)TTPGF 13259.7 13259.7 165-280 Full-length Chymotrypsin Limited Proteolysis

Fragment Obs.Mass kDa N-terminal Obs. Mass Exp. Mass Sequence (SDSPAGE) Seq. Result (MALDI) CH1a 75 (L)SNYTP 64131.5 64121.0 154-716 CH1b (Y)TPITP 64573.5 64565.5 157-723 CH2a 25 (Y)RRLRK 25821.1 25822.3 774-996 CH2b 27535.5 27535.0 774-1011 CH2c (L)RKXGG 27109.6 27109.5 777-1011 CH2d 27611.5 27615.1 777-1015 CH3a 15 (F)KSALD 16925.8 16926.7 718-861 CH3b (F)ASLNE 16923.9 122-271

Table 5: Stable peptide fragments derived from proteolysed mixtures of Drosophila and Yeast TAF2. A summary of the analysis of stable fragments obtained from limited proteolysis time course experiments shown in Figure 3-4. Proteolyzed mixtures were subjected to ESI to determine peptide masses present in each mixture. Peptide masses detected by ESI were matched to fragments based on their expected molecular weight and N-terminal sequence 220 (determined by Edman sequencing). Labeling of fragments in this table corresponds with the labels assigned to fragments detected by SDS-PAGE in Figure 3-4.

TAF2 Template Template e-value Sequence Protein residues Modeled Identity % modeled 93-912 2yd0A 94-935 0 14% Human Endoplasmic Reticulum Aminopeptidase-1 27-389 3ciaA 51-365 4e-06 22% Colwellia psychrerythraea cold- aminopeptidase

Table 6: TAF2 homology modeling hits calculated using the online modeling software Modeller. The top two hits for a template to model TAF2 detected by Modeller.

221

Figure 5-1. ESPRIPT representation of a CLUSTAL alignment of Human TAF8 homologs. Homologs are labeled as follows: Sacccaromyces cerevisiae (sc), Homo sapiens (hs), Arabidopsis thaliana (at), Drosophila melanogaster (dm). ESPRIPT software location: http://espript.ibcp.fr , (Robert and Gouet, 2014). CLUSTAL software location: http://www.ebi.ac.uk/Tools/msa/clustalo/ (McWilliam et al., 2013)

222 Region TAF8 Residues 500 mM NaCl 150 mM NaCl I 108-127 105-130 II 147-175 147-175 III 165-202 165-208 IV 207-238 192-241 V 273-310 261-310

Table 7: TAF8 C-terminal-TAF2 peptide scan interaction regions .

223 6. References

Alexander, C., and Berger, I. (2011). ACEMBL Expression System Series. (www.embl.fr/multibac/multiexpression_technologies )

Anandapadamanaban, M., Andresen, C., Helander, S., Ohyama, Y., Siponen, M.I., Lundström, P., Kokubo, T., Ikura, M., Moche, M., and Sunnerhagen, M. (2013). High-resolution structure of TBP with TAF1 reveals anchoring patterns in transcriptional regulation. Nature Publishing Group 20 , 1008–1014.

Berger, I., Blanco, A.G., Boelens, R., Cavarelli, J., Coll, M., Folkers, G.E., Nie, Y., Pogenberg, V., Schultz, P., Wilmanns, M., et al. (2011). Structural insights into transcription complexes. J. Struct. Biol. 175, 135–146.

Berger, I., Fitzgerald, D.J., and Richmond, T.J. (2004). Baculovirus expression system for heterologous multiprotein complexes. Nat. Biotechnol. 22 , 1583–1587.

Bieniossek, C., and Berger, I. (2008). Towards eukaryotic structural complexomics. J Struct Funct Genomics 10 , 37–46.

Bieniossek, C., Nie, Y., Frey, D., Olieric, N., Schaffitzel, C., Collinson, I., Romier, C., Berger, P., Richmond, T.J., Steinmetz, M.O., et al. (2009). Automated unrestricted multigene recombineering for multiprotein complex production. Nat. Methods 6, 447–450.

Bieniossek, C., Papai, G., Schaffitzel, C., Garzoni, F., Chaillet, M., Scheer, E., Papadopoulos, P., Tora, L., Schultz, P., and Berger, I. (2013). The architecture of human general transcription factor TFIID core complex. Nature 493, 699–702.

Bieniossek, C., Richmond, T.J., and Berger, I. (2008). MultiBac: multigene baculovirus-based eukaryotic protein complex production. Curr Protoc Protein Sci Chapter 5 , Unit5.20.

Bua, D.J., Kuo, A.J., Cheung, P., Liu, C.L., Migliori, V., Espejo, A., Casadio, F., Bassi, C., Amati, B., Bedford, M.T., et al. (2009). Epigenome microarray platform for proteome-wide dissection of chromatin-signaling networks. PLoS ONE 4, e6789.

Buchan, D.W.A., Rison, S.C.G., Bray, J.E., Lee, D., Pearl, F., Thornton, J.M., and Orengo, C.A. (2003). Gene3D: structural assignments for the biologist and bioinformaticist alike. Nucleic Acids Research 31 , 469–473.

Butler, J.E.F., and Kadonaga, J.T. (2002). The RNA polymerase II core promoter: a key component in the regulation of gene expression. Genes & Development 16 , 2583–2592.

Carey, M., and Smale, S.T. (2001). Transcriptional Regulation in Eukaryotes (CSHL Press).

Carninci, P., Sandelin, A., Lenhard, B., Katayama, S., Shimokawa, K., Ponjavic, J.,

224 Semple, C.A.M., Taylor, M.S., Engström, P.G., Frith, M.C., et al. (2006). Genome- wide analysis of mammalian promoter architecture and evolution. Nat. Genet. 38 , 626–635.

Chen, J.L., Attardi, L.D., Verrijzer, C.P., Yokomori, K., and Tjian, R. (2014). Assembly of recombinant TFIID reveals differential coactivator requirements for distinct transcriptional activators. Cell 79 , 93–105.

Chothia, C., Gough, J., Vogel, C., and Teichmann, S.A. (2003). Evolution of the protein repertoire. Science 300, 1701–1703.

Chung, C.T., and Miller, R.H. (1993). Preparation and storage of competent Escherichia coli cells. Meth. Enzymol. 218, 621–627.

Danielsen, J.M.R., Sylvestersen, K.B., Bekker-Jensen, S., Szklarczyk, D., Poulsen, J.W., Horn, H., Jensen, L.J., Mailand, N., and Nielsen, M.L. (2011). Mass spectrometric analysis of lysine ubiquitylation reveals promiscuity at site level. Mol. Cell Proteomics 10 , M110.003590.

Fitzgerald, D.J., Berger, P., Schaffitzel, C., Yamada, K., Richmond, T.J., and Berger, I. (2006). Protein complex expression by using multigene baculoviral vectors. Nat. Methods 3, 1021–1032.

Fitzgerald, D.J., Schaffitzel, C., Berger, P., Wellinger, R., Bieniossek, C., Richmond, T.J., and Berger, I. (2007). Multiprotein Expression Strategy for Structural Biology of Eukaryotic Complexes. Structure 15 , 275–279.

Freshney, R.I. (2006). Basic Principles of Cell Culture. In Culture of Cell for Tissue Engineering, V.-N. Gordana, and R.I. Freshney, eds. (Basic Principles of Cell Culture), pp. 3–22.

Gasch, A., Hoffmann, A., Horikoshi, M., Roeder, R.G., and Chua, N.H. (1990). Arabidopsis thaliana contains two genes for TFIID. Nature 346, 390–394.

Gilmour, D.S., Dietz, T.J., and Elgin, S.C. (1990). UV cross-linking identifies four polypeptides that require the TATA box to bind to the Drosophila hsp70 promoter. Molecular and Cellular Biology 10 , 4233–4238.

Hanahan, D. (1983). Studies on transformation of Escherichia coli with plasmids. Journal of Molecular Biology 166, 557–580.

Hansen, S.K., and Tjian, R. (1995). TAFs and TFIIA mediate differential utilization of the tandem Adh promoters. Cell 82 , 565–575.

Hawley-Nelson, P., and Ciccarone, V. (2003). Transfection of cultured eukaryotic cells using cationic lipid reagents. Curr Protoc Cell Biol Chapter 20 , Unit20.6.

Heery, D.M., Kalkhoven, E., Hoare, S., and Parker, M.G. (1997). A signature motif in transcriptional co-activators mediates binding to nuclear receptors. Nature 387, 733– 736.

Javahery, R., Khachi, A., Lo, K., Zenzie-Gregory, B., and Smale, S.T. (1994). DNA

225 sequence requirements for transcriptional initiator activity in mammalian cells. Molecular and Cellular Biology 14 , 116–127.

Juven-Gershon, T., Hsu, J.-Y., Theisen, J.W., and Kadonaga, J.T. (2008). The RNA polymerase II core promoter — the gateway to transcription. Current Opinion in Cell Biology 20 , 253–259.

Kaufmann, J., Ahrens, K., Koop, R., Smale, S.T., and Müller, R. (1998). CIF150, a human cofactor for transcription factor IID-dependent initiator function. Molecular and Cellular Biology 18 , 233–239.

Kaufmann, J., Verrijzer, C.P., Shao, J., and Smale, S.T. (1996). CIF, an essential cofactor for TFIID-dependent initiator function. Genes & Development 10 , 873–886.

Kim, T.H., Barrera, L.O., Zheng, M., Qu, C., Singer, M.A., Richmond, T.A., Wu, Y., Green, R.D., and Ren, B. (2005). A high-resolution map of active promoters in the human genome. Nature 436, 876–880.

Kochan, G., Krojer, T., Harvey, D., Fischer, R., Chen, L., Vollmar, M., Delft, von, F., Kavanagh, K.L., Brown, M.A., Bowness, P., et al. (2011). Crystal structures of the endoplasmic reticulum aminopeptidase-1 (ERAP1) reveal the molecular basis for N- terminal peptide trimming. Proceedings of the National Academy of Sciences 108, 7745–7750.

Kornberg, R.D. (2007). The molecular basis of eucaryotic transcription. Cell Death Differ. 14 , 1989–1997.

Kriz, A., Schmid, K., Baumgartner, N., Ziegler, U., Berger, I., Ballmer-Hofer, K., and Berger, P. (2010). A plasmid-based multigene expression system for mammalian cells. Nat Commun 1, 120.

Laemmli, U.K. (1970). Cleavage of structural proteins during the assembly of the head of bacteriophage T4. Nature 227, 680–685.

Li, M.Z., and Elledge, S.J. (2007). Harnessing homologous recombination in vitro to generate recombinant DNA via SLIC. Nat. Methods 4, 251–256.

Martin, J., Halenbeck, R., and Kaufmann, J. (1999). Human transcription factor hTAF(II)150 (CIF150) is involved in transcriptional regulation of cell cycle progression. Molecular and Cellular Biology 19 , 5548–5556.

Martinez, E., Chiang, C.M., Ge, H., and Roeder, R.G. (1994). TATA-binding protein- associated factor(s) in TFIID function through the initiator to direct basal transcription from a TATA-less class II promoter. Embo J. 13 , 3115–3126.

Martinez, E., Ge, H., Tao, Y., Yuan, C.X., Palhan, V., and Roeder, R.G. (1998). Novel cofactors and TFIIA mediate functional core promoter selectivity by the human TAFII150-containing TFIID complex. Molecular and Cellular Biology 18 , 6571– 6583.

McWilliam, H., Li, W., Uludag, M., Squizzato, S., Park, Y.M., Buso, N., Cowley, A.P., and Lopez, R. (2013). Analysis Tool Web Services from the EMBL-EBI.

226 Nucleic Acids Research 41 , W597–W600.

Mijuskovic, M. (2007). STRUCTURAL AND BIOCHEMICAL STUDIES ON HUMAN TAF2. ETH Zurich.

Nie, Y., Viola, C.M., Bieniossek, C., Trowitzsch, S., Vijayachandran, L.S., Chaillet, M., Garzoni, F., and Berger, I. (2011). Getting a grip on complexes. Curr. Genomics 10 , 558–572.

Nikolov, D.B., Hu, S.H., Lin, J., Gasch, A., Hoffmann, A., Horikoshi, M., Chua, N.H., Roeder, R.G., and Burley, S.K. (1992). Crystal structure of TFIID TATA-box binding protein. Nature 360, 40–46.

Oelgeschläger, T., Tao, Y., Kang, Y.K., and Roeder, R.G. (1998). Transcription activation via enhanced preinitiation complex assembly in a human cell-free system lacking TAFIIs. Molecular Cell 1, 925–931.

Olsen, J.V., Vermeulen, M., Santamaria, A., Kumar, C., Miller, M.L., Jensen, L.J., Gnad, F., Cox, J., Jensen, T.S., Nigg, E.A., et al. (2010). Quantitative phosphoproteomics reveals widespread full phosphorylation site occupancy during mitosis. Sci Signal 3, ra3.

Papai, G., Tripathi, M.K., Ruhlmann, C., Werten, S., Crucifix, C., Weil, P.A., and Schultz, P. (2009). Mapping the Initiator Binding Taf2 Subunit in the Structure of Hydrated Yeast TFIID. Structure/Folding and Design 17 , 363–373.

Plevin, M.J., Mills, M.M., and Ikura, M. (2005). The LxxLL motif: a multifunctional binding sequence in transcriptional regulation. Trends Biochem Sci 30 , 66–69.

Prilusky, J., Felder, C.E., Zeev-Ben-Mordehai, T., Rydberg, E.H., Man, O., Beckmann, J.S., Silman, I., and Sussman, J.L. (2005). FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics 21 , 3435–3438.

Purnell, B.A., and Gilmour, D.S. (1993). Contribution of sequences downstream of the TATA element to a protein-DNA complex containing the TATA-binding protein. Molecular and Cellular Biology 13 , 2593–2603.

Ray, B.L., White, C.I., and Haber, J.E. (1991). The TSM1 gene of Saccharomyces cerevisiae overlaps the MAT locus. Curr. Genet. 20 , 25–31.

Robert, X., and Gouet, P. (2014). Deciphering key features in protein structures with the new ENDscript server. Nucleic Acids Research 42 , W320–W324.

Roeder, R.G.R. (1991). The complexities of eukaryotic transcription initiation: regulation of preinitiation complex assembly. Trends Biochem Sci 16 , 402–408.

Sambrook, J., and Green, M.R. (2001). Molecular cloning (New York: Cold Spring Harbour Lboratory Press).

Sawadogo, M., and Roeder, R.G. (1985). Interaction of a gene-specific transcription factor with the adenovirus major late promoter upstream of the TATA box region.

227 Cell 43 , 165–175.

Sawadogo, M., and Sentenac, A. (1990). RNA polymerase B (II) and general transcription factors. Annu. Rev. Biochem. 59 , 711–754.

SCHERER, W.F., and HOOGASIAN, A.F. (1954). Preservation at subzero temperatures of mouse fibroblasts (strain L) and human epithelial cells (strain HeLa). Proc Soc Exp Biol Med 87 , 480–487.

Shiromizu, T., Adachi, J., Watanabe, S., Murakami, T., Kuga, T., Muraoka, S., and Tomonaga, T. (2013). Identification of missing proteins in the neXtProt database and unregistered phosphopeptides in the PhosphoSitePlus database as part of the Chromosome-centric Human Proteome Project. J. Proteome Res. 12 , 2414–2421.

Simossis, V.A., and Heringa, J. (2005). PRALINE: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information. Nucleic Acids Research 33 , W289–W294.

Smale, S.T. (1997). Transcription initiation from TATA-less promoters within eukaryotic protein-coding genes. Biochim. Biophys. Acta 1351, 73–88.

Smale, S.T., and Baltimore, D. (1989). The “initiator” as a transcription control element. Cell 57 , 103–113.

Smale, S.T., Schmidt, M.C., Berk, A.J., and Baltimore, D. (1990). Transcriptional activation by Sp1 as directed through TATA or initiator: specific requirement for mammalian transcription factor IID. Proc. Natl. Acad. Sci. U.S.a. 87 , 4509–4513.

Soutoglou, E., Demeny, M.A., Scheer, E., Fienga, G., Sassone-Corsi, P., and Tora, L. (2005). The Nuclear Import of TAF10 Is Regulated by One of Its Three Histone Fold Domain-Containing Interaction Partners. Molecular and Cellular Biology 25 , 4092– 4104.

Stuwe, T., Hothorn, M., Lejeune, E., Rybin, V., Bortfeld, M., Scheffzek, K., and Ladurner, A.G. (2008). The FACT Spt16 “peptidase” domain is a histone H3-H4 binding module. Proceedings of the National Academy of Sciences 105, 8884–8889.

Tanese, N., Pugh, B.F., and Tjian, R. (1991). Coactivators for a proline-rich activator purified from the multisubunit human TFIID complex. Genes & Development 5, 2212–2224.

Thomas, M.C., and Chiang, C.-M. (2006). The General Transcription Machinery and General Cofactors. Critical Reviews in Biochemistry and Molecular Biology 41 , 105– 178.

Tjian, R., and Maniatis, T. (1994). Transcriptional activation: a complex puzzle with few easy pieces. Cell 77 , 5–8.

Todd, A.E., Orengo, C.A., and Thornton, J.M. (2001). Evolution of function in protein superfamilies, from a structural perspective. Journal of Molecular Biology 307, 1113–1143.

228 Trowitzsch, S., Bieniossek, C., Nie, Y., Garzoni, F., and Berger, I. (2010). New baculovirus expression tools for recombinant protein complex production. J. Struct. Biol. 172, 45–54.

Verrijzer, C.P., Chen, J.L., Yokomori, K., and Tjian, R. (1995). Binding of TAFs to core elements directs promoter selectivity by RNA polymerase II. Cell 81 , 1115– 1125.

Verrijzer, C.P., Yokomori, K., Chen, J.L., and Tjian, R. (1994). Drosophila TAFII150: similarity to yeast gene TSM-1 and specific binding to core promoter DNA. Science 264, 933–941.

Vijayachandran, L.S., Viola, C., Garzoni, F., Trowitzsch, S., Bieniossek, C., Chaillet, M., Schaffitzel, C., Busso, D., Romier, C., Poterszman, A., et al. (2011). Robots, pipelines, polyproteins: enabling multiprotein expression in prokaryotic and eukaryotic cells. J. Struct. Biol. 175, 198–208.

Vranken, W.F., Boucher, W., Stevens, T.J., Fogh, R.H., Pajon, A., Llinas, M., Ulrich, E.L., Markley, J.L., Ionides, J., and Laue, E.D. (2005). The CCPN data model for NMR spectroscopy: development of a software pipeline. Proteins 59 , 687–696.

Weber, K., and Osborn, M. (1969). The reliability of molecular weight determinations by dodecyl sulfate-polyacrylamide gel electrophoresis. J. Biol. Chem. 244, 4406– 4412.

Wright, K.J., Marr, M.T., and Tjian, R. (2006). TAF4 nucleates a core subcomplex of TFIID and mediates activated transcription from a TATA-less promoter. Proc. Natl. Acad. Sci. U.S.a. 103, 12347–12352.

Yang, C., Bolotin, E., Jiang, T., Sladek, F.M., and Martinez, E. (2007). Prevalence of the initiator over the TATA box in human and yeast genes and identification of DNA motifs enriched in human TATA-less core promoters. Gene 389, 52–65.

Zhou, Q., Lieberman, P.M., Boyer, T.G., and Berk, A.J. (1992). Holo-TFIID supports transcriptional stimulation by diverse activators and from a TATA-less promoter. Genes & Development 6, 1964–1974.

229