Investigating the intrinsically disordered regions of USF1 and other bHLHZ transcription factors.

by Serban C. Popa A thesis submitted in conformity with the requirements for the degree of Master of Science

Department of Cell and Systems Biology University of Toronto

© Copyright by Serban C. Popa (2020) Investigating the intrinsically disordered regions of USF1 and other bHLHZ transcription factors. Serban C. Popa Master of Science Department of Cell and Systems Biology University of Toronto

2020

Abstract Part I of this thesis describes structural studies of the disordered loop of the basic helix- loop-helix/zipper (bHLHZ) transcription factor USF1. The objective was to understand how

USF1 recognizes a single nucleotide polymorphism in a 4G tract found in the regulatory region of the plasminogen activator inhibitor 1 (PAI-1) gene, by mutating selected residues in its disordered loop. T234 is the key residue implicated in differentiating the PAI-1 polymorphism, and the USF1 bHLHZ preferentially binds the 5G variant of the polymorphism and not the 4G variant.

Part II explores the design of minimalist transcription factors where domains from various transcription factors with desired properties were combined to generate novel .

By using crystal structures of the parent proteins, we appended domains of interest onto the MEF bHLHZ to create MEF variants with altered DNA binding properties. Additionally, phage- assisted continuous evolution (PACE), is being utilized to further improve these MEF variants.

ii

Acknowledgements

To start off, I would like to thank my supervisor, Prof. Jumi Shin for giving me this research opportunity and for her guidance throughout my studies. I especially want to thank two former

Shin lab research associates, Dr. Inder Sheoran and Dr. Ichiro Inamoto for their support, guidance and mentorship throughout my degree. Without their invaluable aid and support, completing this thesis would not have been possible. I am also very grateful to Prof. Ho-Sung

Rhee and Prof. Voula Kanelis for agreeing to be on my committee, for offering guidance when needed, providing valuable feedback that opened new avenues to explore at each of our committee meetings and for their feedback on this thesis. I would also like to thank Prof. Tim

Westwood for agreeing to be my external examiner and for his critical reading of this thesis.

It has been a pleasure working with an amazing group of scientists in the Shin lab:

Montdher Hussain, Duan Fan Tang, Kevin Do and all former graduate and undergraduate students. I thank my coworkers for their enthusiasm, willingness to lend a hand and bright outlook, which made the days in the lab much more bearable than they would have otherwise been.

I am extremely thankful to my friends and family who supported me throughout my studies and helped keep my spirits high even when things were at their worst. Without them, I would not be where I am today.

iii

Table of contents

Acknowledgements ...... iii Table of Contents ...... iv List of Tables ...... vi List of Figures ...... vii List of Abbreviations ...... viii List of Appendices ...... ix 1.Part I: Investigating how the intrinsically disordered loop of USF1 alters DNA binding specificity in hereditary asthma ...... 1 1.1 Preface to Part I ...... 1 1.2 Introduction ...... 2 1.2.1 The bHLHZ family of transcription factors and their disordered domains ...... 2 1.2.2 Intrinsically disordered regions and their use in vivo ...... 4 1.2.3 USF1 and hereditary asthma ...... 5 1.2.4 The unique loop of USF1 ...... 6 1.2.5 Characterization of bHLHZ proteins ...... 9 1.2.6 Research objectives ...... 11 1.3 Materials and Methods ...... 12 1.3.1 Handling of E. coli strains ...... 12 1.3.2 Manipulation of DNA ...... 13 1.3.3 Bacterial one-hybrid experiments...... 15 1.3.4 expression ...... 17 1.3.5 Electrophoretic mobility shift assay (EMSA) ...... 18 1.3.6 Circular dichroism ...... 19 1.4 Results and Discussion ...... 20 1.4.1 Assessing secondary structure of USF1 mutants for proper folding ...... 20 1.4.2 Determining DNA binding affinities via EMSA ...... 23 1.4.3 Examining bHLHZ activity in vivo via the B1H assay ...... 27 1.4.4 Discrepancies from previously published material ...... 32 1.4.5 Relevance ...... 34 1.4.6 Future directions ...... 35

iv

1.5 References ...... 37 1.6 Appendix A ...... 40 1.6.1 Appendix A1 – Composition of media and buffers ...... 40 1.6.2 Appendix A2 – DNA sequences used and representative figures ...... 41 2. Part II: The rational design of novel bHLHZ transcription factors using a mixture of rational design and non-rational, directed evolution systems ...... 45 2.1 Preface to Part II ...... 45 2.2 Introduction ...... 46 2.2.1 Strategies for protein engineering: rational and non-rational ...... 46 2.2.2 Rational design of proteins ...... 48 2.2.3 Non-rational design of proteins ...... 51 2.2.4 PACE ...... 53 2.2.5 The importance of DNA topology on protein-DNA interactions ...... 55 2.3 Materials and Methods ...... 58 2.3.1 Amplification and purification of phage particles ...... 58 2.3.2 Construction of vectors ...... 59 2.3.3 Plaque assay ...... 60 2.3.4 PANCE setup ...... 60 2.3.5 Miscellaneous ...... 61 2.4 Results and Discussion ...... 61 2.4.1 Rational redesign of USF1 to create UFW ...... 61 2.4.2 Assessing properties of UFW ...... 62 2.4.3 Non-rationally modifying USF1 and UFW in PACE ...... 66 2.4.4 MEFU and MEFH ...... 69 2.4.5 Relevance ...... 79 2.4.6 Future directions ...... 79 2.5 References ...... 81 2.6 Appendix B - DNA sequences used and additional figures ...... 88

v

List of Tables

Part I

Table 1. Summary of CD data for the various USF1 and Max constructs ...... 22

Table 2. Summary of Kd values for the various USF1 and Max constructs ...... 24 Appendix Table 1. Sequences of DNA oligos used for SDM/cloning/B1H ...... 42 Appendix Table 2. Sequences of DNA used for EMSA/CD/B1H reporters ...... 43

Part II

Table 1. Comparison of UFW in vivo and in vitro properties relative to USF1 ...... 64 Table 2. Comparison of in vitro properties for MEFU and related proteins ...... 74 Appendix Table 1. DNA sequences used for cloning and B1H, EMSA experiments ...... 92 Appendix Table 2. Sequences of reporters used for PANCE ...... 93

vi

List of Figures

Part I

Fig. 1. Representative bHLHZ heterodimer ...... 2 Fig. 2. Overview of PAI-1 expression and subsequent effects ...... 6 Fig. 3. Comparison of the USF1 loop to a prototypical bHLHZ loop ...... 7 Fig. 4. Three residues of the USF1 loop make contacts with nucleotides flanking the E-box ...... 8 Fig. 5. Schematic representation of the bacterial one-hybrid assay ...... 11 Fig. 6. Representative CD spectra for USF1 T234A ...... 21 Fig. 7. In vitro characterization of USF1 T234A ...... 23 Fig. 8. Optimization of B1H assay for use with USF1 ...... 28 Fig. 9. Identifying optimal spacer length for 4G/5G constructs for B1H with USF1 ...... 29 Fig. 10. B1H results for USF1/Max derived TFs ...... 31 Fig. 11. B1H assay to assess the impact of K235 acetylation on USF1 ...... 34 Appendix Fig. 1. Representative CD experiments for wildtype USF1 ...... 43 Appendix Fig. 2. Protein sequences of studied transcription factors ...... 44 Appendix Fig. 3. Representative EMSA for USF1 T234A binding to nonspecific DNA ...... 44 Appendix Fig. 4. Non-specific activity for -11 4G and 5G E-box reporters ...... 44

Part II

Fig. 1. Alignment of the USF1 and Max bHLHZs for the MaxULoop loop swap ...... 47 Fig. 2. A representation of the MEF structure ...... 50 Fig. 3. Simplified schematic of the M13 phage replication cycle ...... 53 Fig. 4. Schematic diagram illustrating the PACE-B1H system ...... 55 Fig. 5. Alignment of FosW and USF1 LZs for designing UFW ...... 62 Fig. 6. Comparison of USF1 activity to UFW activity in the B1H assay ...... 63 Fig. 7. In vitro characterization of UFW binding affinity ...... 65 Fig. 8 Overview of the B1H-PACE selection circuit ...... 68 Fig. 9. B1H plate of MEF, MEFU, MaxULoop binding to the 4G or 5G E-box ...... 72

vii

Fig. 10. Comparison of the E47 HLH and Max HLH ...... 73 Fig. 11. MEFH interacting with B1H reporter system ...... 77 Fig. 12. Differing DNA binding properties of MEFH and MEFGH ...... 78 Appendix Fig. 1. Representative UFW EMSAs for binding to nonspecific DNA and 5G E-box...... 87 Appendix Fig. 2. B1H testing how a truncated LZ impacts UFW activity...... 87 Appendix Fig. 3. Attempts to find a suitable spacer length to evolve USF and UFW in PACE. 89 Appendix Fig. 4. Modifications made to original MEF protein sequence to generate MEFH, MEFGH and MEFU...... 89 Appendix Fig. 5. B1H for MEFU nonspecific activity...... 89 Appendix Fig. 6. B1H testing TF binding to symmetric G5G5 E-boxes...... 90 Appendix Fig. 7. Initial B1H for MEFH...... 90 Appendix Fig. 8. Addition of the Hin arm alters TF binding preferences...... 91

viii

List of Abbreviations

3-AT 3-Amino-1,2,4-triazole 4G 4G E-box variant of PAI-1 5G 5G E-box variant of PAI-1 AD Activator domain Amp Ampicillin B1H assay Bacterial one-hybrid assay bHLHZ basic helix-loop-helix/zipper bHLH basic helix-loop-helix bZIP basic-leucine zipper CD Circular dichroism DBD DNA-binding domain DNA Deoxyribonucleic acid E. coli Escherichia coli E-box Enhancer box EMSA Electrophoretic mobility shift assay HLH helix-loop-helix IDP Intrinsically disordered protein IDR Intrinsically disordered region IPTG Isopropyl beta-D-1-thiogalactopyranoside LB Lysogeny broth

Kd Dissociation constant Mad Max dimerization protein 1 Max Myc associated factor X MaxULoop Max bHH with USF1 loop ME47 Max basic region-E47 HLH or Max-E47 MEF Max-E47-FosW LZ MEFH MEF with Hin arm MEFU MEF with USF1 loop Myc Myelocytomatosis Kan Kanamycin LZ Leucine zipper NMR Nuclear magnetic resonance NS Nonspecific OD Optical density PAI-1 Plasminogen activator inhibitor 1 PACE Phage-assisted continuous evolution PANCE Phage-assisted non-continuous evolution PCR Polymerase chain reaction

ix

PEG 8000 Polyethylene glycol 8000 PDB Protein Data Bank RNA Ribonucleic acid RPM Revolutions per minute SDM Site directed mutagenesis SOB Super optimal broth SOC Super optimal broth with catabolite repression TAE Tris-Acetate-EDTA TF Transcription factor UFW USF1-FosW USF1 Upstream Stimulatory Factor 1 Zif268 Zinc finger-containing transcription factor 268

x

List of Appendices

Appendix A1 – Composition of media and buffers ...... 41 Appendix A2 – DNA sequences used and representative figures ...... 42 Appendix B – DNA sequences used and additional figures ...... 88

xi

Part I: Investigating how the disordered loop of USF1 alters DNA binding specificity in hereditary asthma.

1.1 Preface to Part I

Part I of this thesis consists primarily of a manuscript that has been published:

“Popa, S. C., & Shin, J. A. (2019). The Intrinsically Disordered Loop in the USF1 bHLHZ Domain Modulates Its DNA-Binding Sequence Specificity in Hereditary Asthma. The Journal of Physical Chemistry B, 123(46), 9862-9871.”

The format of the published manuscript has been heavily modified throughout to rearrange the order in which certain topics are discussed and to more fully discuss the presented material.

Author list:

Serban C. Popa, Jumi A. Shin

Author contributions:

S.C.P. designed and performed experiments, analyzed data, and wrote the paper. J.A.S. conceived the project idea, designed experiments and wrote the paper.

1

1.2 Introduction

1.2.1 The bHLHZ family of transcription factors and their disordered domains

The basic/helix-loop-helix/zipper (bHLHZ) group of transcription factors (TFs) are members of the larger basic/helix-loop-helix (bHLH) superfamily and are involved in the transcriptional regulation of a whole host of human genes.1 Eukaryotic TFs contain more disordered regions than average for eukaryotic proteins that contribute to flexibility, conformational adaptability and the ability to modulate on/off rates of DNA-binding necessary for fine-tuning gene regulation; this is also true of bHLHZ TFs.1-3 The bHLHZ TFs function as dimeric species, forming either homo- or heterodimers in order to bind their cognate DNA targets. Of particular interest are

DNA targets bound by the Myc:Max heterodimer known as the Enhancer box (E-box, Fig. 1).1,4

Fig. 1. Representative bHLHZ heterodimer. The bHLHZ heterodimer (PDB: 1NKP 5) formed by c- Myc (blue) and Max (red) monomers is shown as a ribbon representation bound within the major groove of the E-box element (consensus sequence: 5’-CACGTG-3’). Structures visualized with Chimera v.1.13.1.56

2

The E-box consensus sequence is defined as 5’-CAN•NTG-3’, where the core nucleotides in the middle of the motif (N) determine the binding preference of a given bHLHZ to a given E-box motif.4,6 Protein contacts with the underlined nucleotide positions determine whether the bHLHZ TF interacts with the canonical Class A E-box (CACGTG) or non-canonical

Class B E-box (CAGCTG).1 The basic regions of bHLHZ proteins are intrinsically disordered, and scan along DNA in the nucleus until they come across their cognate E-box, which causes the basic regions to undergo a disorder-to-order transition, in which the basic regions adopt a helical structure that makes contacts with nucleotides in the E-box motif.7,8 Additionally, the disordered loops of bHLHZ proteins are known to make phosphodiester contacts with nucleotides flanking the E-box that modulate the binding preferences of the TF, illustrating how vital these intrinsically disordered regions (IDRs) are for target recognition.8

Dimerization of two bHLHZ monomers occurs using the helix-loop-helix domain as the primary dimerization interface, and is aided by additional dimerization motifs like leucine zippers (LZ) or Per-Arnt-SIM (PAS) domains.9,10 It has been suggested that bHLHZ dimers may preform in the cytosol after translation, and after transport to the nucleus, these preformed dimers can then search for cognate E-boxes using their disordered basic regions; however, this may not be thermodynamically favorable.10 A more likely model would be a bHLHZ monomer identifying a target E-box followed by rapid binding of a second monomer in a process resembling a two-state system.11 LZs are coiled-coil structures that consists of a heptad repeat of leucines (abdefghn), with hydrophobic residues at the first and fourth positions (a/d) of the heptad play an important role in dimerization specificity.1,10 Positions a/d lie at the dimerization interface and form “a hydrophobic stripe which associates with respective partners on the other helix”, such that altering these residues alters dimerization preferences.10 These secondary

3 dimerization motifs can play a significant role in determining partner preference which significantly impacts how the TFs will behave.12,13 For example, in the case of the

Myc/Max/Mxd1 regulatory network, the Myc LZ possesses residues at positions a/d that preclude homodimerization but facilitate the formation of heterodimers with the Max LZ.12

1.2.2. Intrinsically disordered regions and their use in vivo

Intrinsically disordered proteins (IDPs) and proteins containing IDRs are ubiquitously found in the eukaryotic proteome, and have greatly changed our understanding of how proteins interact with their intended targets.14-16 IDRs are characterized by low sequence complexity, low proportions of bulky, hydrophobic amino acids and high proportions of charged, hydrophilic amino acids that allow IDRs to sample multiple conformations.16 These IDRs are involved in diverse processes within the cell, functioning as hubs in molecular interaction networks, promoting phase separation within the cytoplasm and promoting the formation of multi- molecular complexes.14 All of this has spurred significant interest in studying and understanding how the structure of IDRs permit them to carry out these various functions.

TFs also benefit from using IDRs to augment their DNA-binding function.16-18 IDRs are particularly important for bHLHZ TFs as they regulate how these TFs identify and bind their cognate DNA. These bHLHZ proteins also contain significant disorder at their N and C termini, which are thought to be crucial to TF activity by “contributing to the ability of TFs to 1) recognize target sequences in the DNA appropriately, 2) bind to a wider diversity of DNA target sequences, 3) be anchored with higher affinity to the DNA after recognizing target sequences, 4) bind to other factors and complexes positioned on the DNA or involved in transcriptional regulation, or 5) present activation domains to downstream transcriptional regulatory machinery”.18 By better understanding how IDRs like the disordered basic region and disordered

4 loops are used by bHLHZ proteins to recognize and interact with their cognate DNA, we can gain insights on TF function and possibly extend that knowledge to other types of IDPs as well.

1.2.3 USF1 and hereditary asthma

The most prominent of the bHLHZ TFs are the Myc family of bHLHZ proteins, in particular c-

Myc, and one of its binding partners Max. This heterodimer plays a central role in the cell cycle by upregulating expression of critical positive cell cycle regulators like Cdks, cyclins and E2F transcription factors, cell differentiation and other important cellular processes; however, when aberrantly expressed, the Myc:Max heterodimer can lead to the development of various cancers, and as such has been studied extensively.12,13 With the focus given to the c-Myc:Max heterodimer, less attention has been paid to other members of the bHLHZ family, despite their roles in regulating other important processes within the cell. One such example is upstream stimulatory factor 1 (USF1), a bHLHZ TF that has been implicated in diseases, such as hereditary asthma and atherosclerosis, but the mechanism(s) through which USF1 contributes to these diseases is not fully understood.19,20

Genetic studies of individuals afflicted with hereditary asthma have shown that these individuals have elevated levels of plasminogen activator inhibitor-1 (PAI-1), with the highest expression levels of PAI-1 occurring in the lungs from stimulated human mast cells.21

Transcription of PAI-1 is elevated in the airways of those who suffer from heritable asthma, and it has been suggested that PAI-1 induces allergic inflammation and tissue remodeling in the airways (Fig. 2A).22,23 Interestingly, these individuals have a polymorphism associated with the regulatory region of the PAI-1 gene that occurs in the context of the E-box through which USF1 regulates expression of PAI-1.23,24 The PAI-1 gene and the associated 4G/5G polymorphism play

5 a role in hereditary asthma, as the 4G allele is more frequently observed in asthmatic children, and those with the 4G/4G genotype have the highest plasma levels of PAI-1, followed by 4G/5G then 5G/5G.23 Given that the 4G/5G polymorphism occurs in nucleotides flanking the E-box, it is possible that the intrinsically disordered loop of USF1 might play a role in the preferential binding of the 4G variant of PAI-1, as it is the loop of bHLHZ TFs that contact flanking nucleotides and may be ultimately responsible for this type of hereditary asthma (Fig. 2B).

Fig. 2. Overview of PAI-1 expression and subsequent effects. (A) Schematic representation of downstream PAI-1 targets. PAI-1 acts as an inhibitor of tissue type and urokinase plasminogen activators (tPA/uPA) that then prevent the formation of plasmin that activate matrix metalloproteinases.24,25 This prevents the degradation of extracellular matrix (ECM) that will then accumulate in the airways of asthmatics leading to the asthmatic phenotype.24 (B) USF1 interactions with the 4G and 5G variants of PAI-1. USF1 is known to mediate expression of PAI-1 by interacting with an E-box motif in the regulatory region of the gene. Individuals with a 4G polymorphism in the nucleotides flanking that E-box have elevated levels of PAI-1 expression and present an asthmatic phenotype. Figure adapted from ref. 20.

1.2.4. The unique loop of USF1 and relevance to asthma

The loop of USF1 is unique when compared to the loop of a prototypical bHLHZ protein like

Max (Fig. 3). The increased length of the USF1 loop confers greater flexibility that allows the loop to make contacts with nucleotides further away from the E-box motif relative to a typical bHLHZ protein like Max.6,26

6

Fig. 3. Comparison of the USF1 loop to a prototypical bHLHZ loop. USF1 (pink, PDB: 1AN4 26) has a loop that is significantly longer (12 residues) than that of MAX (blue, PDB: 1HLO 6) whose loop is only 8 residues. The USF1 loop can make three contacts with nucleotides flanking the E-box, whereas the Max loop makes only one such contact.6,26 Structures visualized with Chimera v.1.13.1.56

The three residues in the USF1 loop that are responsible for making contacts with nucleotides flanking the E-box are S233, T234 and Q238 (Fig 4A): Q238 lies in the middle of the loop and contacts the C nucleotide of the G/C immediately flanking the 3'-end of the E-box and was not of interest, as it cannot contact nucleotides at the site of the PAI-1 polymorphism.26 Q238 may be important in orienting the loop properly such that S233 and/or

T234 can contact the phosphodiester backbone. S233 and T234 traverse the minor groove to make phosphodiester contacts and are at the farthest point in the loop, which is interesting because T234 contacts the thymine highlighted in Fig. 4B. This was of interest, as that thymine is found at the same position where the 4G/5G polymorphism of PAI-1 is found, and its positioning could allow T234 to distinguish between the 4G/5G variants of PAI-1. However, since USF1 was crystallized in complex with the adenovirus major late promoter, it is possible that the PAI-1 sequence has a different DNA topology, and that S233 that is responsible for distinguishing between the two E-boxes.

7

Fig 4. Three residues of the USF1 loop make contacts with nucleotides flanking the E-box. (A) USF1 interactions with the adenovirus major late promoter. USF1 binds to the E-box using its basic region, while S233 and T234 in the loop make contacts with flanking nucleotides where the 4G/5G polymorphism is found (PDB: 1AN4 26). The USF1 monomers are light and dark blue; the E-box is highlighted green. The following contacts are shown: S233 (black) and T234 (red) are at the far end of the USF1 loop. Structures visualized with Chimera v.1.13.1.56 (B). DNA sequences. Burley crystallized USF1 with the adenovirus major late promoter, the thymine contacted by T234 is bolded.26 This same nucleotide for all DNA sequences is bolded, highlighted yellow, and corresponds to the 4G/5G polymorphism that illustrates the sequence diversity flanking the E-box bound by USF1. Sequences for 4G and 5G and nonspecific (NS) DNA were used for in vitro and in vivo studies of the various USF1 and Max constructs. NS DNA sequence has been used previously and shown to be valid in studying bHLHZ binding, whereas the 4G and 5G E-boxes are flanked by the same sequences that are found in the regulatory region of PAI-1.24 (C). Loop Sequences. Sequences of the USF1 bHLHZ, Max bHLHZ loops. S233 and T234 were mutated to alanine to remove their ability to contact the phosphodiester backbone of nucleotides flanking the E-box. USF1 AA is the double mutant with both S233 and T234 mutated to Ala. Positions corresponding to S233 and T234 are underlined; residues mutated to Ala are bolded. Adapted from ref. 27.

Another unusual feature of the USF1 loop is that the loops in the dimer make asymmetric contacts, where one loop interacts with the DNA while the other loop does not, which is highly uncharacteristic for bHLHZ TFs and is even more interesting, since the 4G/5G polymorphism is found only on one side of the E-box motif.1,26 Taken together, these unique and unusual properties of the USF1 loop led to our initial hypothesis that it is the USF1 loop that allows for the preferential recognition of the 4G variant of the PAI-1 gene, and that either S233 or T234 mediate this interaction. To test the hypothesis, USF1 loop mutants were created, specifically

S233A, T234A and the double mutant where both residues would be mutated to alanine.27 By mutating these residues to alanine, the resulting mutants should lose the ability of the side chains

8 to make phosphodiester contacts with the nucleotides flanking the E-box, and one would expect to see a change in the way the mutants interact with the E-box constructs. The double mutant,

USF1 AA was also made to investigate the effect of two Ala residues in the USF1 loop and was found to possess significantly altered properties relative to the wildtype USF1. The USF1 loop was also appended onto the Max protein scaffold in lieu of its native loop to elucidate the function of the USF1 loop in a non-native context.27 The resulting protein, MaxULoop, was found to differentiate the 4G/5G polymorphism of PAI-1 unlike the parent Max bHLHZ from which it was derived.

1.2.5 Characterization of bHLHZ proteins

The Shin group has developed a two-pronged strategy to study bHLHZ proteins and changes made to them, wherein their properties are assessed both in vivo and in vitro.

Circular dichroism (CD) has been used to give insights into the secondary structure of bHLHZ proteins that is key to their proper function.28 Far UV CD is predicated upon the differential absorption of circularly polarized light by biological molecules that are optically active, including amino acids.29 When the chromophores of the amides of the polypeptide backbone of proteins are aligned in arrays, their optical transitions are shifted or split into multiple transitions.29 The result is that different structural elements have characteristic CD spectra and thus can be utilized to determine the ratios of secondary structure elements present in a given protein (i.e., how much of the protein is α-helical, random coil etc.), but does not provide any high-resolution information about protein structure. Given that bHLHZ proteins are primarily composed of disordered regions and α-helices, CD experiments on these proteins can provide significant insights into the structural properties of these proteins. The helicity of

9 bHLHZ and similarly related proteins has been shown to be linked to their DNA binding properties, which makes assessing the α-helical character of these proteins important.29,30

We use electrophoretic mobility shift assays (EMSAs) to detect and study protein:DNA complexes. A titration of increasing protein concentrations were incubated alongside labelled

DNA oligonucleotides and subjected to electrophoresis, where the free DNA oligonucleotides will migrate farther down the gel relative to the protein: DNA complex.31 By examining the intensities of the shifted bands across the titration range (which corresponds to the protein’s affinity for the DNA), we can use this data to determine the protein’s dissociation constant for the target DNA (Kd), or in other words, the binding affinity of the protein for the DNA duplex.31,32 Once the protein:DNA complex is loaded onto a gel for electrophoresis, although the system is no longer in true equilibrium, (in situ fluorescence anisotropy does maintain true equilibrium throughout the measurement), but EMSAs have been used successfully to measure

32 Kd values for various bHLHZ proteins.

The bacterial one-hybrid (B1H) assay designed by Wolfe et al. was adapted to study how bHLHZ TFs interact with the E-box motif in vivo (Fig. 5). The TFs used consist of just the bHLHZ domain and lack the ability to mediate expression of downstream genes by virtue of lacking their native activator domain (AD). The TF DNA binding domains were instead fused to the omega subunit of RNA polymerase. This fusion can then recruit the various components of the transcription complex needed for expression of the reporter gene downstream of the E-box.33

By altering key residues within the bHLHZ scaffold, we can change how the protein recognizes with the E-box, resulting in changes to His3 reporter expression.

10

Fig. 5. Schematic representation of the bacterial one-hybrid assay. The transcription factor of interest (in this instance, USF1) is fused to the omega subunit of RNA polymerase (which acts as an AD) via a 21-residue linker and is cloned into the pB1H2w2 vector (refer to Methods for construct details).This fusion protein binds the E-box motif that is present on the pH3U3 vector to direct the omega subunit to the weak lac promoter. The omega subunit allows for the assembly of the RNA polymerase complex that initiates the transcription of the His3 reporter gene that allows the US0 cells (ΔhisBΔpyrF) to grow in His- deficient media.33 We can supplement the media with 3-Amino-1,2,4-triazole (3-AT), an inhibitor of His synthesis, which then requires increased transcription factor binding in order to allow the US0 cells to survive.33 Adapted from ref. 27.

1.2.6 Research objectives

Given what is known about the 4G/5G polymorphism in the regulatory region of the PAI-1 gene and how unique the USF1 bHLHZ is, we investigated the following questions: 1) Is the USF1 loop responsible for differentiating between the two variants of the PAI-1 polymorphism and 2) if the USF1 loop mediates this phenomenon, which residue(s) is/are responsible for this?

Studying how USF1 interacts with the regulatory region of PAI-1 with its unique loop might give insight into how intrinsically disordered loops affect protein-DNA interactions.

Additionally, studying the USF1 loop might give insights into the development of potential therapeutics for those that have the hereditary asthma associated with the PAI-1 gene.

1.3 Materials and Methods

All reagents purchased from BioShop, except where noted. The compositions of media and buffers are given in Appendix A1.

11

1.3.1 Handling of E. coli strains

1.3.1.1 Culturing and storage of E. coli strains

E. coli strains DH5α (Invitrogen) and USO (Addgene bacterial strain # 18049), were grown in lysogeny broth (LB) or on LB agar plates, except where noted. Overnight cultures were started from a single colony or by inoculating from a glycerol stock into 5 mL of LB and grown at 37 ºC with shaking at 200 rpm in a MaxQ 400 Shaker (Thermo Scientific) for approximately 16 hours.

Antibiotics were added to the growth medium when required for plasmid selection. The final antibiotic concentrations were as follows: 50 mg/L ampicillin (Amp), 30 mg/L kanamycin (Kan).

Glycerol stocks for long-term storage were prepared by combining equal volumes of 50% v/v glycerol and cell culture and storing at -80 ºC.

1.3.1.2 Preparation and transformation of chemically competent cells

Chemically competent E. coli stains were prepared by the transformation and storage solution method as previously described.34

1.3.1.3 Transformation of competent E. coli

1–3 µL of plasmid DNA (intact plasmid or ligation product) was added to 100 µL of competent cells and incubated on ice for 30 minutes. Cells were heat shocked at 42 ºC for 45 seconds and 900 µL of Super Optimal broth with Catabolite repression (SOC; Super Optimal broth, SOB, with glucose to a final concentration of 20 mM) was added to each cell suspension.

Cells were recovered for 1–3 hours at 37 ºC with shaking. Longer recoveries were required after transformation with some ligation products. Cells were plated on LB agar containing the

12 appropriate antibiotic(s) to select for the plasmid(s).

1.3.2 Manipulation of DNA

1.3.2.1 Preparation of plasmid DNA

Plasmids were isolated from cells and purified using a QIAprep Spin Miniprep Kit

(QIAGEN). Plasmids were extracted from 5 mL of overnight culture following the manufacturer’s instructions with the following modifications: (1) cleared cell lysate and water were allowed to incubate in the column for 5 minutes before spinning to improve efficiency of DNA binding and elution, respectively; (2) an additional two-minute spin was performed after removal of wash buffer to ensure removal of residual ethanol; (3) DNA was eluted in nuclease free water.

1.3.2.2 Agarose gel electrophoresis

DNA was separated on a 1–2% agarose (Invitrogen) gel in 1x TAE with SYBR® Safe DNA Gel

Stain (Invitrogen). DNA samples (7 – 60 µL of DNA with 1x DNA sample buffer) were loaded into the wells of an agarose gel with a DNA ladder (1 kb Plus DNA Ladder, Invitrogen) and electrophoresed at 100 V for 1 hour using an EC 105 Electrophoresis Power Supply (E-C

Apparatus Corporation). Gels were visualized on a VWR LM-20E transilluminator (VWR

Scientific) at 302 nm.

1.3.2.3 Site directed mutagenesis

PCR primers were designed manually, and oligonucleotides were synthesized and lyophilized by

Eurofins MWG Operon (Appendix A2 – Table 1). Primers were resuspended and diluted to a

13 final concentration of 10 µM. All reactions had a total volume of 50 µL and were performed with either Taq DNA polymerase or Phusion® High-Fidelity DNA polymerase (New England

Biolabs) according to the manufacturer’s instructions. Reactions were performed in an Applied

Biosystems Veriti® 96-Well Thermal Cycler (Life Technologies).

1.3.2.4 Restriction digestion of DNA

Restriction endonucleases and buffers were purchased from New England Biolabs. Volumes of

10 – 40 µL of DNA per reaction were digested following the manufacturer’s instructions. When possible, restriction endonucleases were inactivated by incubating at 65 ºC for 20 minutes.

1.3.2.5 Alkaline phosphatase treatment

Digested plasmid DNA intended as a cloning vector was treated with alkaline phosphatase prior to purification and ligation to prevent vector re-circularization. Alkaline phosphatase and buffer were purchased from New England Biolabs. Enzyme and buffer were added directly to digestion reaction mixtures and the plasmids were treated following the manufacturer’s instructions.

1.3.2.6 Ligation

T4 DNA ligase and buffer were purchased from New England Biolabs. Quantities of 20–50 ng of vector DNA and a 3-fold or 10-fold molar excess of insert DNA were used in each ligation reaction. Ligation reactions were performed following the manufacturer’s instructions with the following exception: reactions were incubated at 14 ºC overnight, then 16 ºC for 2 hours.

1.3.2.7 Gel extraction of DNA fragments

Following electrophoresis, bands corresponding to restriction endonuclease digested DNA or

PCR products were excised from agarose gels with a scalpel. DNA was extracted and purified

14 from gel slices using a QIAquick Gel Extraction Kit (QIAGEN) following the manufacturer’s instructions with the following modifications: (1) the dissolved gel slice/ DNA and water were allowed to incubate in the column for 5–10 minutes before spinning to improve efficiency of

DNA binding and elution, respectively; (2) an additional two-minute spin was performed after removal of wash buffer to ensure removal of residual ethanol; (3) DNA was eluted in 20 µL of nuclease-free water.

1.3.2.8 Quantification of DNA

DNA concentrations were measured using a NanoDrop 2000 spectrophotometer (Thermo

Scientific) following the manufacturer’s instructions.

1.3.2.9 DNA sequencing

All Sanger sequencing of plasmid DNA and PCR products was performed by The Centre for

Applied Genomics (TCAG) at The Hospital for Sick Children.

1.3.3 Bacterial one-hybrid experiments

1.3.3.1 Construction of DNA-binder and reporter vectors

DNA coding for the USF1 bHLHZ flanked by restriction sites compatible with the multiple cloning site of pB1H2w2 was ordered from IDT. pB1H2w2 (Addgene plasmid # 18038) was a gift from Scot Wolfe.33 KpnI was present at the 5’ end USF1 while XbaI, a stop codon and XhoI were present in that order at the 3’ end of USF1; this was done to facilitate subcloning of our TFs from pB1H2w2 to pET28a. Along with pB1H2w2, the USF1 oligo was double digested with

KpnI and XbaI. Linearized pB1H2w2 was treated with alkaline phosphatase. Both DNA

15 fragments were purified by gel extraction, quantified, and ligated. Successfully constructed

DNA-binder vectors were confirmed by restriction analysis and Sanger sequencing.

E-box and NS DNA fragments containing the various spacers and flankers were designed manually and ordered from Eurofins MWG Operon. The forward oligonucleotide was designed to contain the desired binding sequence and incorporate the last six nucleotides of the NotI restriction site at the 5’ end and the first nucleotide of the EcoRI restriction site at the 3’ end, while the reverse oligonucleotide was complementary to the forward oligonucleotide and incorporated the last two nucleotides of the NotI site at the 3’ end and the first five nucleotides of the EcoRI site at the 5’ end. These partial restriction sites were designed to flank the E-box or

NS DNA and spacer such that annealing the two oligonucleotides would produce a fragment with the desired binding sequence flanked by 5’ overhangs identical to those that would be created by a double digestion with NotI and EcoRI. Oligonucleotides were combined in equimolar amounts, treated with T4 polynucleotide kinase (New England Biolabs) to provide 5’ phosphate groups for downstream ligation, heated at 80 ºC for 20 minutes, and allowed to cool slowly to room temperature to anneal the DNA. pH3U3 (Addgene plasmid # 12609) was a gift from Scot Wolfe. pH3U3 was double digested with NotI and EcoRI.33 The resulting linearized plasmid was treated with alkaline phosphatase, purified by gel extraction, quantified, and ligated with the binding sequence fragments. Successfully constructed reporter vectors were confirmed by restriction analysis and/or Sanger sequencing.

1.3.3.4 Bacterial one-hybrid assays

Selective NM His – plates and NM +/-His liquid minimal growth medium were prepared in advance as outlined previously.33 E. coli US0 cells (ΔrpoZ,ΔhisB,ΔpyrF) were transformed with either: 1) doubly with pB1H2w2 containing a DNA-binding protein and pH3U3 containing one

16 of the various E-box or NS binding sequences; or 2) singly with one of the pH3U3/ binding sequence constructs alone. Overnight cultures were started from well-isolated, single colonies of transformed US0 cells. 100 µL aliquots of overnight culture were used to inoculate 4 mL of NM

+His and cultures were grown at 37 ºC with shaking for 1.5-3 hours. Cultures were then centrifuged at 1380g for 5 minutes to pellet the cells in a Centrific Model 228 centrifuge (Fisher

Scientific), the supernatants were decanted, and the pellets were resuspended in 5 mL of NM –

His. This wash was to remove histidine was performed a total of four times per culture. The washed cell pellets were resuspended in 1 mL of NM –His and diluted ten-fold into NM –His.

Cells were then diluted in NM –His to produce 1 mL samples with an OD600 of 0.1, from which ten-fold serial dilutions were made from 10-1 to 10-6. 5 µL of each dilution was plated on NM –

His agar with 0-30 mM 3-AT as well as LB agar. Positive and negative controls were performed with every B1H assay. The positive control consisted of US0 harbouring pB1H2w2 containing

Zif268, a zinc finger transcription factor, and pH3U3 containing the binding sequence of Zif268

(GCGTGGGGCG 35), while the negative control consisted of US0 harbouring pB1H2w2 containing the DNA-binding protein under investigation and an empty pH3U3 vector.

1.3.4 Protein expression

The USF1 bHLHZ, Max bHLHZ, and mutant proteins were bacterially expressed and purified following previously described methods with modifications described below.36 DNA fragments encoding the proteins in E. coli optimized codons were assembled and cloned into the pET28A(+) expression vector (Novagen; details of gene construction are given in the Supporting Information).

These vectors were transformed into E. coli BL21(DE3) pLysS for the protein expression.

Typically, cells were grown in a 1 L culture where protein production was induced by adding 1 mM IPTG during the mid-log phase of growth (OD600 ~0.6). After induction, the cells were

17 harvested, sonicated, and purified using Co2+ metal affinity chromatography (TALON, Clontech) following the manufacturer’s protocol. The recommended buffers for TALON typically contain up to 5 mM β-mercaptoethanol. However, 20 µL 1M DTT was used in lieu of BME due to problems previously encountered, where β-mercaptoethanol was found to be covalently linked to cysteine residues in the TFs.32 After TALON, proteins in the elution fraction were reduced by exposure to 10 mM DTT for 1 h, 37 °C and further purified by reversed-phase HPLC (Beckman

System Gold) run on a semi-preparative reversed-phase C18 column (Vydac) with a gradient of acetonitrile and water plus 0.06% trifluoroacetic acid (v/v); the flow rate was 3 mL/min, and the gradient started at 0–20% acetonitrile over 20 min, followed by 20–60% acetonitrile over 45 min.

Protein identities were confirmed by ESI-MS (Waters Micromass ZQ, Model MM1), and their concentrations were measured by UV/Vis spectrometry (Nanodrop 2000 Spectrophotometer,

Thermo Scientific). Finally, the proteins were lyophilized in aliquots and stored at -80 °C.

Immediately before EMSA, proteins were reconstituted to the desired concentration (typically 20

µM) in buffer and incubated for 1 h, 37 °C to maximize solubility.

1.3.5 Electrophoretic mobility shift assay (EMSA)

Single-stranded oligonucleotides containing the desired protein-recognition site—NS DNA, 4G and 5G—were synthesized with 6-carboxyfluorescein (6-FAM) incorporated at their 5’ ends

(Eurofins Genomics, Fig. 1B). The 6-FAM-labelled oligonucleotides were annealed to their corresponding unlabeled complementary oligonucleotides by mixing the two in 10 mM Tris–HCl, pH 7, with the unlabeled oligonucleotide in 1.5x molar excess, heating at 95 °C, 10 min, and slow- cooling to room temperature. These annealed DNA targets were used in the following binding reactions with appropriate amounts of protein solution added to each reaction to cover the full

18 titration range of the binding reaction. Protein-DNA binding reactions were performed in EMSA buffer (20 mM Tris–HCl, pH 8, 0.15 mM EDTA, 0.5 M NaCl, 2 mM DTT, 100 µg/mL BSA, 2

µg/mL poly dI–dC) and 2 nM 6-FAM-labeled duplex DNA in 30 µL total volume with 4 µL 30% ficoll added to the samples just before loading onto the gel. Prior to electrophoresis, samples were treated with the temperature-leap tactic (T-leap) to minimize protein misfolding and aggregation: protein solutions were incubated at 4 °C overnight followed by 30 min at 37 °C, and 1 h at room temperature.37 The samples were loaded onto a pre-equilibrated native PAGE gel (10% poly- acrylamide, 0.5 % TBE), and run at 200 V for 5 min followed by 100 V for 25 min. The gels were visualized using the BioRad ChemiDoc MP Imaging System. Imagelab software and the resulting

38 Kd values were obtained as described previously.

1.3.6 Circular dichroism (CD)

The protein was resuspended in CD buffer (15 mM Na2HPO4, 5 mM KH2PO4, 50 mM NaCl, pH

7.4) to a final concentration of 20 µM and incubated for 1 h at 37 °C. The protein was then diluted to 2 µM final concentration in 2 mL final volume with CD buffer (and DNA duplex if applicable, which was diluted to 2 µM final concentration). The protein-DNA mixture was treated with the T- leap, as described above.37 CD was performed on an Aviv 215 spectrometer with a suprasil, 10 mm path-length cell (Hellma) at 22 °C. Spectra were acquired between 180 and 300 nm at 0.2 nm increments and a sampling time of 0.2 s. Each spectrum was the average of two scans with the buffer control spectrum subtracted. Data obtained were not smoothed. Protein -helix content was calculated by the method of Chau and coworkers.39 Briefly, percent -helix content was determined by assuming only -helical content (i.e., assuming no -sheet structure) and using the equation:

19

∞ H = Θ 222 / Θ H222 (1 − k222/n)

∞ where H is the percent helicity, Θ 222 is the mean residue ellipticity at 222 nm, Θ H222 is the reference value for a helix of infinite length, k222 is a wavelength-dependant constant, and n is the number of amino acids in the protein.

1.4 Results and Discussion

1.4.1 Assessing secondary structure of USF1 mutants for proper folding

After purifying the various proteins using metal-ion affinity chromatography and then reverse phase HPLC (see Methods), we assessed the secondary structure of the proteins. A generalization for bHLHZ TFs is that the more α-helical the TF is, the better it is at dimerizing and binding DNA, so it was important to confirm that our TFs are assuming the proper secondary structure.27 The lyophilized proteins were resuspended in phosphate buffer, treated using the temperature leap tactic (where the protein is allowed to fold slowly and properly at 4

°C overnight then warmed to 37 °C for 1h. before) as this was found to improve the α-helicities of bZIP and other similar TFs, and the α-helicities of the various constructs were determined

(Fig. 6, Table 1).39,40

20

Fig. 6. Representative CD spectra for USF1 T234A. The above spectra were generated for USF1 T234A in the absence of DNA (black), and with the addition of NS DNA (blue), 4G E-box (green) and 5G E-box (red). There are two local minima at 222 nm and 208 nm that are characteristic of α-helices indicating that the protein has properly folded.39 Upon introduction of DNA, a characteristic disorder-to-order transition is observed where the disordered basic region becomes α-helical in order to interact with the DNA. CD spectra for the wildtype USF1 can be found in Appendix A2 (Appendix Fig. 1). Adapted from ref. 27.

As observed in Fig. 6 and Table 1, the various constructs behave as expected of a typical bHLHZ TF: the α-helicity of these proteins increases as NS DNA is added and increases further in the presence of the cognate E-box. For these constructs, the α-helicity measured in the presence of the 4G and 5G E-box is very similar to one another, suggesting that differences in α- helicity does not drive differentiation of the PAI-1 polymorphism. The α-helicity of MaxULoop is significantly lower than that of Max bHLHZ, which can be attributed to the domain swap that it underwent. The USF1 loop is slightly longer than that of Max (12 residues vs 8 residues), but that does not explain the significant decrease in α-helicity that is observed. This would suggest then that the loop swap had a destabilizing impact on the MaxULoop structure (discussed below) which is observed in part by the notable decrease in the α-helical character of the protein.

21

Table 1. Summary of CD data for the various USF1 and Max constructs.

α-helicity, α - h elicit y, α-helicity, α-helicity, Protein no DNA NS DNA 4G 5G

USF1 bHLHZ 50% 62% 67% 63% USF1 S233A 47% 52% 58% 55% USF1 T234A 47% 53% 56% 56% USF1 AA 58% 40% 40% 38% Max bHLHZ 50% 55% 55% 55%

MaxULoop 16% 21% 20% 19%

The above summarizes the α-helicity values obtained for USF1 and Max constructs from the CD experiments. Each value is the average of two CD scans of the same sample that were then averaged. The helicity of each transcription factor was determined as described previously.28,39

MaxULoop was made by aligning the protein sequences of the Max bHLHZ and USF1 bHLHZ to determine regions where the sequences differed significantly (corresponding to the loops of the two proteins). The 8 aa loop of Max was swapped for the 12 aa USF1 loop, keeping the rest of the Max protein scaffold, in order to investigate the effects of the USF1 loop in a non- native context. USF1 AA was made to investigate the effect that the double mutant (S233A and

T234A) would have on the USF1 loop’s ability to differentiate between the 4G and 5G E-boxes.

We hypothesized that the protein would have the same affinity to the E-box as the wildtype protein, but lack the ability to differentiate between the 4G and 5G E-boxes (refer to Appendix

A2 Fig. 1 for the protein sequences of all constructs that were investigated).

The most interesting construct is USF1 AA, whose behaviour deviates from what is expected of a typical bHLHZ protein. USF1 AA has a notably higher α-helical character than the other USF1 derivatives in the absence of DNA, it is believed that this is due to having two consecutive alanine residues in the USF1 loop. Alanine is known to promote the formation of α- helices in proteins, and by virtue of having two alanines in a disordered region of the protein, it

22

may have imposed additional structure onto the protein.41,42 By altering the secondary structure

of USF1 AA, it may have weakened its ability to interact with DNA which may contribute to the

strange α-helicity recorded in the presence of DNA. Inspection of USF1 AA using X-ray

crystallography or NMR could confirm if any changes were made to the secondary structure of

the protein that would explain its unexpected behavior.

1.4.2 Determining DNA binding affinities via EMSA

Having established that the USF1 and Max derived constructs were properly folded and able to

bind DNA as previously described in the literature, the DNA binding affinities of the various

constructs were determined via EMSA for binding to NS DNA and 4G and 5G E-box containing

DNA (Fig. 7, Table 2).

A B

Fig 7. In vitro characterization of USF1 T234A. (A) Representative EMSA for USF1 T234A binding to the 4G E-box. USF1 T234A shows a high affinity for the 4G E-box (Kd = 4.7 nM) like the wildtype protein. The additional bands that appear at the higher concentrations of the protein corresponds to the formation of tetramers as was previously reported.26 Numbers above each lane correspond to amounts of USF1 T234A monomer added to target DNA. (B) Representative curve fit for USF1 T234A binding to the 4G E-box. Representative curve fit done in Kaleidagraph v. 3.6.2, (R2=0.992) for EMSA result 43 shown in Fig. 7A. From the curve fit, it was possible to determine the Kd values reported in Table 2. Representative EMSA data for USF1 T234A binding to the other DNA targets can be found in Appendix A2 (Appendix Fig. 3).

23

Table 2. Summary of Kd values for the various USF1 and Max constructs.

4G 5G NS DNA Protein Kd (nM) Kd (nM) Kd (nM)

USF1 bHLHZ 7.0 ± 0.4 4.1 ± 0.3 102 ± 3.1 USF1 S233A 5.7 ± 0.7 4.4 ± 0.3 153 ± 17.0 USF1 T234A 4.9 ± 0.3 3.9 ± 0.1 120 ± 9.8 USF1 AA 23.8 ± 2.1 24.4 ± 3.1 611 ± 94 a a Max bHLHZ 5.1 ± 0.1 5.0 ± 0.1 306.3 ± 28.3 MaxULoop 490.6 ± 24.6 426.3 ± 29.6 > 2000 b Each value is the average of two independent EMSA experiments. Numbers represent the total monomeric protein concentration of each sample. a Previous measurements of the Max bHLHZ binding to an E-box containing oligonucleotide (AT E-box) gave Kd values of 14 ± 8 nM via fluorescence anisotropy with slightly altered buffer 38 b conditions. The observed Kd value for MaxULoop was greater than 2000 nM, at this concentration range there were issues with protein aggregation such that obtaining a good curve fit was not feasible.

The various USF1 constructs had low nanomolar affinities for the 4G and 5G E-boxes, which complicated our initial analysis of the data, as the Kd values were very similar (and highly reproducible) for the various loop mutants. Additionally, the USF1 constructs all bind the E-box with a high specificity, as found when comparing binding to NS DNA vs 4G or 5G E-boxes. The most surprising EMSA result is that the USF1 bHLHZ appears to preferentially bind the 5G E- box and not the 4G E-box as was previously reported in the literature,23,24 a finding which was contrary to our initial hypothesis (discussed below).

Studying the various USF1 loop mutants did not clarify what the key residue was in determining DNA binding preference. USF1 S233A behaved like the wildtype USF1, suggesting that S233 is not the residue that drives differentiation of the PAI-1 polymorphism. On the other hand, USF1 T234A binds the 4G E-box with almost the same affinity as it does the 5G E-box, suggesting that T234 might be a residue of interest DNA recognition; however, it was difficult to draw any definitive conclusions. The impact of changing the binding affinity by a few nM in an

24 in vitro assay may not be significant, but in vivo, it is possible that a slight change in binding affinity could have significant implications for the expression of genes that are regulated by

USF1.

USF1 AA behaved in a similar fashion to USF1 T234A in that it seemingly does not differentiate between the 4G and 5G E-boxes; however, the binding affinities are weaker than for the other constructs. This agrees with the CD data for USF1 AA in that it has a lower α-helicity in the presence of DNA, which correlates to a weaker DNA binding affinity as determined by

EMSA (58% α-helicity in the absence of DNA vs 40% α-helicity in the presence of cognate

DNA). This further reinforced the suspicion that having two consecutive alanines in the loop of

USF1 altered the secondary structure of the loop and other regions of the protein. This could remove the ability of the loop to differentiate between the two E-boxes and could also weaken the binding affinity of the TF, by possibly altering the dimerization interface of the protein. If this is the case, then that might suggest that the USF1 loop not only contributes to DNA binding specificity but might also contribute to binding affinity as well.

EMSA studies with the Max constructs also provided significant information on the nature of the USF1 loop. Max-derived proteins can be difficult to work with as has been noted in the literature, and this may contribute to discrepancies observed between our results and previously published results.38,40 The Max HLH has positively charged residues throughout helix

1 and helix 2 that hinder Max dimerization, but the full-length Max protein has a disordered, negatively charged domain that precedes the basic region that folds back on itself to mask these charges that the constructs used here lack.40 By systematically altering various components of the buffer, we were able to identify a buffer that improved the stability of the Max constructs to enable assessment of their DNA-binding function in EMSA assays.

25

The Max bHLHZ is incapable of differentiating between 4G and 5G E-boxes, which agreed with our hypothesis that the Max loop is too short to allow it to do so. Interestingly,

MaxULoop was able to still bind the E-box, despite the disruptive loop domain swap (as evidenced by the CD data and loss of DNA binding affinity), but it also gained the ability to differentiate between the 4G and 5G E-boxes, in agreement with the hypothesis that it is the

USF1 loop that mediates this process. Because the domain swap greatly impacted the secondary structure of MaxULoop, its binding affinity is much weaker than what was observed for the other bHLHZ proteins, and it is easier to see that the USF1 loop drives the differentiation of the E- boxes, as the difference in Kd values for the 4G and 5G E-boxes is significantly larger than for the USF1 constructs.

The fact that MaxULoop was still able to interact with DNA despite how disruptive the domain swap appeared to be was surprising, as domain swap experiments often abrogate the original function of the protein. More time could have been spent to optimize the length and location of the loop swap for MaxULoop to generate a better DNA binder; however, the initial construct was sufficient to validate our hypothesis. By creating MaxULoop, it was possible to gain more insights into the USF1 loop’s contributions to binding and specificity in contrast to just investigating the USF1 loop mutants, as it was possible to see what the USF1 loop does in a non-native context.

From the EMSA studies, we concluded that the USF1 bHLHZ appears to preferentially bind the 5G E-box and not the 4G E-box, that T234 might be responsible for differentiating between the two E-box variants, and that S233 contributes little if anything to E-box differentiation. Another key finding was that is possible to alter the DNA-binding preferences of

Max by grafting the USF1 loop onto the Max protein scaffold. To further clarify the EMSA

26 results, in vivo assays were carried out to gain insight as to how the constructs behave in a biological setting.

1.4.3 Examining bHLHZ activity in vivo via the B1H assay

Before being able to test the USF1 loop mutants in the B1H assay, we had to optimize the reporter system. By altering the number of nucleotides in the flanking sequence between the E- box motif and the weak lac promoter, we can alter how the TF-omega subunit fusion interacts with the reporter system (Fig. 5). The goal was to identify a spacer length that would allow for specific activity (i.e. the TF-omega subunit fusion interacts solely with the E-box motif and not

NS DNA present at the same position) and that has a high signal-to-noise ratio (in other words, the reporter system is not auto-activating). A spacer library that had been previously generated by Dr. Ichiro Inamoto and Dr. Sarmitha Sathiamoorthy for studying their own bHLHZ TFs was screened to find a suitable spacer. It was found that the -9 and -11 AT spacers provided a strong

B1H signal even at high 3-AT concentrations (Fig. 8B) and that these spacers were not auto- activating (Fig. 8A), which made these two spacers promising candidates for use with USF1.

27

A B

-9 AT E-box Positive control

-11 AT E-box Negative control

-13 AT E-box -9 AT E-box + USF1

-11 GC E-box -11 AT E-box + USF1

Positive control -13 AT E-box + USF1

Negative control -11 GC E-box + USF1

Fig 8. Optimization of B1H assay for use with USF1. (A) Auto-activating B1H to test signal-to-noise ratios. Shown is the 2.5 mM 3-AT auto-activating plate. US0 cells were transformed solely with the reporter construct in the pH3U3 vector for this experiment. Reporters with a spacer length greater than 11 bp were found to be auto-activating and were not pursued. The -9 and -11 AT spacers showed minimal auto-activating behavior but were deemed fit for use. (B) True B1H to test signal from different spacers. Shown is the 20 mM 3-AT plate. -9/-11 refers to the number of nucleotides from the beginning of the weak lac promoter to the designated nucleotides that flank the E-box (AT or GC). The positive control (+) comprises the transcription factor Zif268 with its cognate DNA in lieu of the E-box, and the negative control (-) comprises Zif268 with an unchanged pH3U3 vector that lacks the Zif268 cognate DNA site upstream of the His3 reporter.33,35 5 µL of serially diluted cells (10-1 to 10-6) were spotted from left to right.

Once these promising spacers were found, the goal was to clone the two variants of PAI-

1 into the pH3U3 reporter system (keeping the optimal spacer lengths) to assess how USF1 interacts with the 4G and 5G E-boxes in a context that is as similar to the PAI-1 gene as possible.

To do so, -9 and -11 bp versions of the 4G and 5G E-boxes were made, cloned into pH3U3 and used to repeat the B1H experiments done previously. The -9 4G and 5G reporters were found to be auto-activating (Fig. 9A) and as such were not suitable to study the USF1 loop mutants. The -

11 4G and 5G reporters were not auto-activating, so these were the constructs that were used for all subsequent B1H experiments (Fig. 9B). Lastly, the -11 4G and 5G reporter constructs caused minimal non-specific activity when the E-box was replaced with NS DNA, showing that these constructs are indeed valid for use in characterizing the USF1 loop mutants in vivo (Appendix

A2 Fig. 4).

28

We found that USF1 bHLHZ produces a notably stronger signal from the -11 5G E-box relative to the -11 4G E-box, which agrees with the earlier EMSA studies, but the magnitude of the difference is more notable in the B1H assay than in the EMSA. However, these reporter assays rely on indirect means for detection of protein:DNA interactions, and their output is not necessarily linear, meaning that it is difficult to say how meaningful these differences in reporter expression are.44

A B

-9 4G E-box -9 4G E-box + USF1

-9 5G E-box -9 5G E-box + USF1

-11 4G E-box -11 4G E-box + USF1

-11 5G E-box -11 5G E-box + USF1

Positive control Positive control

Negative control Negative control

Fig. 9. Identifying optimal spacer length for 4G/5G constructs for B1H with USF1. (A) Auto- activating plates for -9/-11 4G and 5G reporter constructs. Shown is the 5 mM 3-AT plate. US0 cells were transformed solely with the reporter construct in the pH3U3 vector for this experiment. The -9 4G and 5G E-box reporters are auto-activating in that they produce the His3 protein even in the absence of the USF1 fusion protein, whereas the -11 reporters did not cause this. Controls are as previously described (B) Plates testing signal of -9/-11 4G and 5G E-box reporter constructs. Shown is the 10 mM 3-AT plate. Surprisingly, the signal produced from the -11 5G E-box is significantly stronger than the signal from -11 4G E-box, the wildtype USF1 was used for these experiments. Controls are as previously described.

Using the now validated B1H assay, we tested the various USF1 mutants to determine whether the loop mutations had any impact on USF1 behavior in vivo. The B1H results for the

USF1 loop mutants agree with what we observed in vitro, with the results being easier to interpret than the results for the EMSAs. USF1 bHLHZ and USF1 S233A both preferentially bind the 5G E-box over the 4G E-box as evidenced by the higher cell densities when plated on

29

His deficient media (Fig. 10A); however, the difference is less than an order of magnitude. Using a different reporter gene like LuxAB could make detecting and quantifying the differences in reporter expression easier, but the B1H assay is still useful in providing insight as to how the

USF1 loop mutations alter . USF1 T234A, on the other hand, does not differentiate between the 4G and 5G E-boxes, which agrees with the EMSA data and reinforces the idea that T234 is an important residue when it comes to differentiating the PAI-1 polymorphism. USF1 was shown to have some nonspecific activity with the -11 4G NS construct

(Appendix Fig. 1), which makes the fact that USF1 gives a better signal for the 5G E-box, and not 4G E-box, even more striking.

Surprisingly, USF1 AA seemingly differentiates between the two E-boxes in vivo (Fig.

10C) despite having very similar Kd values in vitro for the two E-boxes. Additionally, the amount of growth in the B1H with USF1 AA is less relative to the other USF1 constructs which agrees with previous results. From previous data, we had hypothesized that USF1 AA has a different secondary structure relative to the other USF1 variants, which may explain how it can still differentiate between the E-boxes despite having both S233 and T234 mutated to alanine.

Our working theory is that if the loop structure is significantly different for USF1 AA, it may be possible for Q238 to make the contacts needed to differentiate between the 4G and 5G E-boxes; however, it is difficult to synthesize all of the experimental data into one cohesive explanation as to how the double loop mutant changes the DNA binding preferences of USF1. Structural studies via NMR might provide insight into how the AA loop mutant interacts differently with the 4G and 5G E-boxes relative to the wildtype protein or even the single mutants.

Like the EMSA studies, interesting results were found for the Max constructs (Fig. 10B).

Max bHLHZ and USF1 T234A behave in a similar manner and do not differentiate between the

30

E-boxes; however, MaxULoop does distinguish between the two, producing a signal that is roughly an order of magnitude stronger for the 5G E-box than for the 4G E-box.

Fig 10. B1H results for USF1/Max derived TFs. (A) Effects of S233A and T234A mutations. E-box responsive transcriptional activation of USF1 loop mutants plated as a 10-fold serial dilution (10-1 -10-6 from left to right for all plates) on a 5 mM 3-AT plate in the B1H assay. USF1 bHLHZ and USF1 S233A preferentially bind the 5G E-box resulting in improved growth relative to the 4G E-box. USF1 T234A shows no ability to differentiate between the 4G/5G sequences, resulting in identical growth for the 4G and 5G. Red boxes used to highlight the 10-4 dilutions on plate. All reporter constructs had the -11 spacer. Controls are as previously described. (B) USF1 loop swap alters binding specificity of Max. The Max bHLHZ produces a weaker signal than USF1 bHLHZ in the B1H assay (1 mM 3-AT plate shown); the Max bHLHZ is unable to distinguish between the 4G/5G polymorphism. MaxULoop, on the other hand, behaves like USF1 bHLHZ, in that it preferentially binds the 5G sequence, producing a signal that is ~10 times stronger than that produced with the 4G. Controls are as previously described. (C) USF1 AA differentiates between 4G and 5G E-boxes. Surprisingly, the double mutant can differentiate between the 4G E-box (top lane for all proteins) and the 5G E-box (bottom lane for all proteins), whereas USF1 T234A does not. Shown here is the 5 mM 3-AT plate, controls are as previously described.

The Max and MaxULoop results from the EMSA experiments match what is observed in vivo, with the in vivo results being more straightforward to interpret. These results reinforce the

31 idea that it is indeed the USF1 loop that allows for differentiating between the 4G and 5G E-box, as the wildtype Max bHLHZ is unable to differentiate between the two E-boxes, while

MaxULoop does differentiate between the two. There is more evidence that the USF1 loop is contributing to both DNA-binding preference and affinity given the ~10-fold discrepancy produced in the MaxULoop signal for the 5G E-box vs 4G E-box which difficult to discern in the

USF1 B1H assays given how strongly all of the constructs interact with the E-box.

1.4.4 Discrepancies from previously published material

While the results suggest that the USF1 bHLHZ does indeed differentiate between the two variants of the PAI-1 gene, we found that the USF1 bHLHZ preferentially binds the 5G E-box and not the 4G E-box as was previously reported, which was perplexing to us.24 Previous work done in characterizing USF1 via EMSA was done using the full length USF1 (comprising a bHLHZ DNA binding domain and an activator domain) grown in stimulated human mast cells, whereas just the USF1 bHLHZ, grown in E. coli cells was used for our EMSA experiments, which is significantly different from the literature conditions.24 These differences led to several theories that might explain the discrepancy: 1) in mammalian cells, there is a post-translational modification that occurs in the USF1 bHLHZ (most likely in the loop) that alters the DNA binding preferences of USF1 from the 5G E-box towards the 4G E-box, 2) post-translational modifications in IDRs preceding/following the USF1 bHLHZ somehow impact DNA-binding preferences or 3) that the full-length USF1 with the transcriptional machinery needed for expression of PAI-1 uses regions outside of the DBD to preferentially bind the 4G E-box.

Post-translational modifications are a common way of finetuning the behavior of TFs once expressed and can finetune the behavior of the TF. For example, phosphorylation of Max

32 modulates the structure/disorder balance of the basic region, greatly impacting how it interacts with the E-box, so it was thought that something similar might happen with USF1.45 Uniprot and other protein databases were searched for sites where USF1 undergoes post-translational modifications. We found that there are four modifications that occur in the USF1 bHLHZ (there are other modifications that occur in other parts of the full length USF1 protein that were not explored as our constructs lack these regions). Of particular interest to us was the acetylation of

K235, as it is found beside T234 that mediates 4G and 5G differentiation. K235 of USF1 is known to be acetylated in the context of fat metabolism and insulin production, but may be modified in other contexts as well.46,47 By acetylating this residue and changing the chemical environment of the loop, it is possible that the DNA binding preferences of USF1 could be altered. Acetylation of lysine is a common post-translational modification that contributes to a

TF’s ability to regulate nuclear functions.48

To test this hypothesis, site-directed mutagenesis was carried out to create USF1 K235M

(Fig. 11), as Met somewhat resembles an acetylated Lys residue. USF1 K235M had a slightly weaker signal in the B1H assay relative to the wild type, but still maintained the same DNA binding preference as the wildtype protein. This suggests that the chemical environment of the loop impacts binding affinity but has little to no impact on specificity, which was an interesting result but did not answer the starting question. There are other post-translational modifications that occur in the USF1 bHLHZ, all of which involve acetylating various lysine residues in the other domains of the USF1 bHLHZ; however, the rationale was that these residues were unlikely to impact DNA binding preferences, so they were not explored. In light of the B1H data, we did not express USF1 K235A for any in vitro experiments. Acetylating the K235 site specifically in vitro and doing EMSAs with the acetylated USF1 might have provided more insight into how

33 modifications of this residue affects DNA binding. It is possible that modifications in IDRs not part of the USF1 bHLHZ alter its DNA binding properties, as this is observed with other eukaryotic transcriptional machinery, which could explain why our results differ from the literature.49

Fig. 11. B1H assay to assess the impact of K235 acetylation on USF1. Relative to USF1 bHLHZ (first two lanes), USF1 K235M is a slightly weaker binder of the E-box, however it still preferentially interacts with the 5G E-box rather than with the 4G E-box. Samples plated on a 2.5 mM 3-AT plate. Controls as described before.

As such, our current working hypotheses to explain the discrepancies between these results and previously published results are that either having the full length USF1 (AD + BHLHZ) that recruits the eukaryotic transcriptional machinery may alter the DNA binding preferences of

USF1, or that modifications of other IDRs in the full length USF1 may impact DNA binding preferences.

1.4.5 Relevance

USF1's use of its unusually long 12 residue loop to recognize and distinguish closely related

DNA sequences could be an excellent tactic that has finetuned its utility during evolution. Loop

34 structures are useful in molecular recognition: catalytic antibodies use protein loops, and aptamers use oligonucleotide loops to effect function.50-52 The humoral immune response uses antibody hypervariable loops to recognize foreign antigens.51 Surprisingly, ~40% of human proteins are believed to be significantly disordered yet may adopt (partly) folded structures to achieve function.14 By expanding upon the work shown here, it may be possible to create designer peptides that bind desired DNA targets by swapping loops and other IDRs with known properties onto our protein scaffolds. The results of the USF1 loop suggest that not only can this disordered loop region drive DNA-binding specificity, but it can also behave as an independent unit that can be pasted into another protein framework. Gaining insights into how IDRs are used by proteins for DNA recognition has tremendous potential applications. By understanding how these IDRs function, it could be possible to design small proteins that target specific DNA motifs associated with diseases as a precursor to personalized medicine or to develop new proteins useful in synthetic biology toward design and control of biological systems.

1.4.6 Future directions

Currently, the only structural information that exists for USF1 is Burley’s crystal structure that was done with the truncated USF1 bHLH in complex with the adenovirus major late promoter.26

It was never explicitly stated why the full length USF1 bHLHZ was not used; it could be that the full-length protein was not amenable for crystallization.26 It would be interesting to obtain structures of the USF1 loop mutants (via NMR, possibly in collaboration with Prof. Kanelis) in complex with the 4G and 5G E-boxes associated with PAI-1 to better understand how changing these residues alters DNA binding. The structural information gained from these experiments could potentially be used to inform the rational design of IDPs that is an important research focus for the Shin group as well as guide future work on the USF1 bHLHZ and other IDPs.

35

As was previously discussed, discrepancies between the results generated here and previously published work could stem in part due to the significantly different systems that were used, introducing these loops into a more relevant setting such asthma tissue cells like (HBEC tissue culture)53 or mice asthma models (chronic allergen exposure models)54 might answer some unresolved questions, namely, why does the USF1 bHLHZ behave so differently from the full length protein.

For example, we could repeat our loop mutations with the full length USF1 protein (both the AD and DBD) in these model systems that would also possess the relevant post-translational modifications, which would explore these modifications as a potential explanation for any deviations. Additionally, if we could generate a mammalian cell line that has the USF1 gene deleted, we could observe the phenotype of these cells and then possibly rescue the wild type phenotype by transforming the mutant cell line with a construct containing just the USF1 bHLHZ. From here, we could investigate how the USF1 bHLHZ interacts with the PAI-1 promoter in a more relevant system. Lastly, it would be interesting to work with a murine/tissue culture model for this hereditary asthma to see if providing the USF1 bHLHZ protein used in these studies would reverse the asthmatic phenotype that is observed. USF1 bHLHZ without the

AD would act as a competitive inhibitor that would compete with the native USF1 for binding to the problematic E-box, and ideally, there would see a decrease in the asthmatic phenotype. Given the Shin group’s lack of experience with these systems, relying on collaborators such as Prof.

Edmond Young and his airway on a chip technology 55 will be needed to be able to help design and carry out these experiments.

36

1.5 References

1. Jones, S. (2004). An overview of the basic helix-loop-helix proteins. Genome biology, 5(6), 226. 2. Liu, J., Perumal, N. B., Oldfield, C. J., Su, E. W., Uversky, V. N., & Dunker, A. K. (2006). Intrinsic disorder in transcription factors. Biochemistry, 45(22), 6873-6888. 3. Tompa, P., & Fuxreiter, M. (2008). Fuzzy complexes: polymorphism and structural disorder in protein–protein interactions. Trends in biochemical sciences, 33(1), 2-8. 4. Nair, S. K., & Burley, S. K. (2006). Structural aspects of interactions within the Myc/Max/Mad network. In The Myc/Max/Mad Transcription Factor Network (pp. 123-143). Springer, Berlin, Heidelberg. 5. Nair, S. K., & Burley, S. K. (2003). X-ray structures of Myc-Max and Mad-Max recognizing DNA: molecular bases of regulation by proto-oncogenic transcription factors. Cell, 112(2), 193-205. 6. Ferré-D'Amaré, A. R., Prendergast, G. C., Ziff, E. B., & Burley, S. K. (1993). Recognition by Max of its cognate DNA through a dimeric b/HLH/Z domain. Nature, 363(6424), 38. 7. Brownlie, P., Ceska, T. A., Lamers, M., Romier, C., Stier, G., Teo, H., & Suck, D. (1997). The crystal structure of an intact human Max–DNA complex: new insights into mechanisms of transcriptional control. Structure, 5(4), 509-520. 8. Gordân, R., Shen, N., Dror, I., Zhou, T., Horton, J., Rohs, R., & Bulyk, M. L. (2013). Genomic regions flanking E-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape. Cell reports, 3(4), 1093-1104. 9. Amoutzias, G. D., Robertson, D. L., Van de Peer, Y., & Oliver, S. G. (2008). Choose your partners: dimerization in eukaryotic transcription factors. Trends in biochemical sciences, 33(5), 220-229. 10. Mason, J. M., & Arndt, K. M. (2004). Coiled coil domains: stability, specificity, and biological implications. Chembiochem, 5(2), 170-176. 11. Vancraenenbroeck, R., & Hofmann, H. (2018). Occupancies in the DNA-Binding Pathways of Intrinsically Disordered Helix-Loop-Helix Leucine-Zipper Proteins. The Journal of Physical Chemistry B, 122(49), 11460-11467. 12. Nair, S. K., & Burley, S. K. (2006). Structural aspects of interactions within the Myc/Max/Mad network. In The Myc/Max/Mad Transcription Factor Network (pp. 123-143). Springer, Berlin, Heidelberg. 13. Fernandez, P. C., Frank, S. R., Wang, L., Schroeder, M., Liu, S., Greene, J., ... & Amati, B. (2003). Genomic targets of the human c-Myc protein. Genes & development, 17(9), 1115- 1129.

37

14. Fuxreiter, M., Simon, I., & Bondos, S. (2011). Dynamic protein–DNA recognition: beyond what can be seen. Trends in biochemical sciences, 36(8), 415-423. 15. Wright, P. E., & Dyson, H. J. (2015). Intrinsically disordered proteins in cellular signalling and regulation. Nature reviews Molecular cell biology, 16(1), 18-29. 16. Minezaki, Y., Homma, K., Kinjo, A. R., & Nishikawa, K. (2006). Human transcription factors contain a high fraction of intrinsically disordered regions essential for transcriptional regulation. Journal of molecular biology, 359(4), 1137-1149. 17. Tsafou, K., Tiwari, P. B., Forman-Kay, J. D., Metallo, S. J., & Toretsky, J. A. (2018). Targeting intrinsically disordered transcription factors: changing the paradigm. Journal of molecular biology, 430(16), 2321-2341. 18. Guo, X., Bulyk, M. L., & Hartemink, A. J. (2012). Intrinsic disorder within and flanking the DNA-binding domains of human transcription factors. In Biocomputing 2012 (pp. 104-115). 19. Putt, W., Palmen, J., Nicaud, V., Tregouet, D. A., Tahri-Daizadeh, N., Flavell, D. M., ... & Talmud, P. J. (2004). Variation in USF1 shows haplotype effects, gene: gene and gene: environment associations with glucose and lipid parameters in the European Atherosclerosis Research Study II. Human molecular genetics, 13(15), 1587-1597. 20. Ma, Z., Paek, D., & Oh, C. K. (2009). Plasminogen activator inhibitor‐1 and asthma: role in the pathogenesis and molecular regulation. Clinical & Experimental Allergy, 39(8), 1136- 1144. 21. Cho, S. H., Tam, S. W., Demissie-Sanders, S., Filler, S. A., & Oh, C. K. (2000). Production of plasminogen activator inhibitor-1 by human mast cells and its possible role in asthma. The Journal of Immunology, 165(6), 3154-3161 22. Sherenian, M. G., Cho, S. H., Levin, A. M., Min, J. Y., Sen, S., Oh, S., ... & Rodriguez- Santana, J. R. (2017). PAI-1 Gain of Function Genotype and Airway Obstruction in Asthma. Journal of Allergy and Clinical Immunology, 139(2), AB171. 23. Buč, D., Izakovičová Hollá, L., & Vacha, J. (2002). Polymorphism 4G/5G in the plasminogen activator inhibitor‐1 (PAI‐1) gene is associated with IgE‐mediated allergic diseases and asthma in the Czech population. Allergy, 57(5), 446-448. 24. Ma, Z., Jhun, B., Jung, S. Y., & Oh, C. K. (2008). Binding of upstream stimulatory factor 1 to the E-Box regulates the 4G/5G polymorphism–dependent plasminogen activator inhibitor 1 expression in mast cells. Journal of allergy and clinical immunology, 121(4), 1006-1012. 25. Lijnen, H. R. (2002). Matrix metalloproteinases and cellular fibrinolytic activity. Biochemistry (Moscow), 67(1), 92-98. 26. Ferre‐D'Amare, A. R., Pognonec, P., Roeder, R. G., & Burley, S. K. (1994). Structure and function of the b/HLH/Z domain of USF. The EMBO journal, 13(1), 180-189. 27. Popa, S. C., & Shin, J. A. (2019). The intrinsically disordered loop in the USF1 bHLHZ domain modulates its DNA-binding sequence specificity in hereditary asthma. The Journal of Physical Chemistry B.

38

28. Johnson, W. C. (1999). Analyzing protein circular dichroism spectra for accurate secondary structures. Proteins: Structure, Function, and Bioinformatics, 35(3), 307-312. 29. Lajmi, A. R., Lovrencic, M. E., Wallace, T. R., Thomlinson, R. R., & Shin, J. A. (2000). Minimalist, alanine-based, helical protein dimers bind to specific DNA sites. Journal of the American Chemical Society, 122(23), 5638-5639. 30. Bird, G. H., Lajmi, A. R., & Shin, J. A. (2002). Sequence‐specific recognition of DNA by hydrophobic, alanine‐rich mutants of the basic region/leucine zipper motif investigated by fluorescence anisotropy. Biopolymers: Original Research on Biomolecules, 65(1), 10-20. 31. Hellman, L. M., & Fried, M. G. (2007). Electrophoretic mobility shift assay (EMSA) for detecting protein–nucleic acid interactions. Nature protocols, 2(8), 1849. 32. Inamoto, I.; Chen, G.; Shin, J. A. The DNA target determines the dimerization partner selected by bHLH/Z-like hybrid proteins AhRJun and ArntFos. Mol. BioSyst. 2017, 13, 476−488. 33. Meng, X., Brodsky, M. H., & Wolfe, S. A. (2005). A bacterial one-hybrid system for determining the DNA-binding specificity of transcription factors. Nature biotechnology, 23(8), 988. 34. Chung, C. T.; Niemela, S. L.; Miller, R. H. One-step preparation of competent Escherichia coli transformation and storage of bacterial cells in the same solution. Proc. Natl. Acad. Sci. U. S. A. 1989, 86, 2172−2175. 35. Christy, B., & Nathans, D. (1989). DNA binding site of the growth factor-inducible protein Zif268. Proceedings of the National Academy of Sciences, 86(22), 8737-8741. 36. Xu, J.; Chen, G.; De Jong, A. T.; Shahravan, S. H.; Shin, J. A. Max-E47, a designed minimalist protein that targets the E-box DNA site in vivo and in vitro. J. Am. Chem. Soc. 2009, 131, 7839−7848. 37. Xie, Y.; Wetlaufer, D. B. Control of aggregation in protein refolding: the temperature-leap tactic. Protein Sci. 1996, 5, 517−523. 38. Chen, G., De Jong, A. T., & Shin, J. A. (2012). Forced homodimerization of the c-Fos leucine zipper in designed bHLH/Z-like hybrid proteins MaxbHLH-Fos and ArntbHLH- Fos. Molecular BioSystems, 8(4), 1286-1296. 39. Chen, Y.-H.; Yang, J. T.; Chau, K. H. Determination of the helix and β form of proteins in aqueous solution by circular dichroism. Biochemistry 1974, 13, 3350−3359. 40. Naud, J. F., McDuff, F. O., Sauvé, S., Montagne, M., Webb, B. A., Smith, S. P., ... & Lavigne, P. (2005). Structural and thermodynamical characterization of the complete p21 gene product of Max. Biochemistry, 44(38), 12746-12758. 41. O'Neil, K. T.; DeGrado, W. F. A thermodynamic scale for the helix-forming tendencies of the commonly occurring amino acids. Science 1990, 250, 646-651. 42. Luque, I.; Mayorga, O. L.; Freire, E. Structure-based thermodynamic scale of a-helix propensities in amino acids. Biochemistry 1996, 35, 13681-13688

39

43. Tellinghuisen, J. (2000). Nonlinear least-squares using microcomputer data analysis programs: KaleidaGraph™ in the physical chemistry teaching laboratory. Journal of Chemical Education, 77(9), 1233. 44. Estojak, J., Brent, R., & Golemis, E. A. (1995). Correlation of two-hybrid affinity data with in vitro measurements. Molecular and cellular biology, 15(10), 5820-5829. 45. Pursglove, S. E., Fladvad, M., Bellanda, M., Moshref, A., Henriksson, M., Carey, J., & Sunnerhagen, M. (2004). Biophysical properties of regions flanking the bHLH-Zip motif in the p22 Max protein. Biochemical and biophysical research communications, 323(3), 750- 759. 46. Wang, Y.; Wong, R. H.; Tang, T.; Hudak, C. S.; Yang, D.; Duncan, R. E.; Sul, H. S. Phosphorylation and recruitment of BAF60c in chromatin remodeling for lipogenesis in response to insulin. Mol. Cell 2013, 49, 283−297 47. Wong, R. H.; Chang, I.; Hudak, C. S.; Hyun, S.; Kwan, H. Y.;Sul, H. S. A role of DNA-PK for the metabolic gene regulation in response to insulin. Cell 2009, 136, 1056−1072. 48. Xing, S.; Poirier, Y. The protein acetylome and the regulation of metabolism. Trends Plant Sci. 2012, 17, 423−430. 49. Gibbs, E. B., Lu, F., Portz, B., Fisher, M. J., Medellin, B. P., Laremore, T. N., ... & Showalter, S. A. (2017). Phosphorylation induces sequence-specific conformational switches in the RNA polymerase II C-terminal domain. Nature communications, 8(1), 1-11. 50. Wentworth, P.; Janda, K. D. Catalytic antibodies: structure and function. Cell Biochem. Biophys. 2001, 35, 63−87. 51. Dunn, M. R.; Jimenez, R. M.; Chaput, J. C. Analysis of aptamer discovery and technology. Nature Rev. Chem. 2017, 1,1−16. 52. Sela-Culang, I.; Kunik, V.; Ofran, Y. The structural basis of antibody-antigen recognition. Front. Immunol. 2013, 4,1−13. 53. Sachs, L. A., Finkbeiner, W. E., & Widdicombe, J. H. (2003). Effects of media on differentiation of cultured human tracheal epithelium. In Vitro Cellular & Developmental Biology-Animal, 39(1-2), 56-62. 54. Nials, A. T., & Uddin, S. (2008). Mouse models of allergic asthma: acute and chronic allergen challenge. Disease models & mechanisms, 1(4-5), 213-220. 55. Humayun, M.; Chow, C. W.; Young, E. W. K., Microfluidic lung airway-on-a-chip with arrayable suspended gels for studying epithelial and smooth muscle cell interactions. Lab Chip 2018, 18, 1298-1309. 56. Pettersen, E. F., Goddard, T. D., Huang, C. C., Couch, G. S., Greenblatt, D. M., Meng, E. C., & Ferrin, T. E. (2004). UCSF Chimera—a visualization system for exploratory research and analysis. Journal of computational chemistry, 25(13), 1605-1612.

40

1.6 Appendix A

1.6.1 Appendix A1 - Compositions of media and buffers

LB/ LB agar 50x TAE 10 g tryptone 2 M Tris 5 g yeast extract 1 M acetic acid 10 g NaCl 50 mM EDTA add ddH2O to 1 L add 15 g agar for plates 10x TBE 890 mM Tris SOB media 890 mM boric acid 20 g tryptone 20 mM EDTA 5 g yeast extract 0.5 g NaCl 10x DNA Sample buffer 5 g MgSO4 50% v/v glycerol add ddH2O to 1 L 0.25% w/v bromophenol blue 0.25% w/v xylene cyanol 2xYT broth/ 2YT top agar 16 g tryptone 10x M9 salts 10 g yeast extract 67.8 g Na2HPO4 5 g NaCl 30 g KH2PO4 add ddH2O to 1 L 5 g NaCl adjust pH to 7.0 10 g NH4Cl add 7.5 g agar for top agar add ddH2O to 1 L

33.3x solution Prepare the following six solutions:

Solution I Solution IV 0.99 g Phe 1.04 g Asp 1.1 g Lys 18.7 g Glu 2.5 g Arg add ddH2O to 100 mL add ddH2O to 100 mL Solution V Solution II 14.6 g Gln 0.2 g Gly 0.36 g Tyr 0.7 g Val add ddH2O to ~90 mL 0.84 g Ala 0.41 g Trp Solution VI Solution III 0.79 g Ile 0.71 g Thr 0.77 g Leu 8.4 g Ser add ddH2O to 100 mL 4.6 g Pro 0.96 g Asn add ddH2O to 100 mL

Add Solution V to Solution IV and add NaOH pellets until all amino acid pellets dissolve. Combine Solutions I – IV, filter sterilize, and stored at 4 ºC.

41

1.6.2 Appendix A2 – DNA sequences used and representative figures

Appendix Table 1. Sequences of DNA oligonucleotides used for SDM/cloning. Name of oligo Sequence of oligo USF1 T234A fwd primer 5’ CAGTATGGAATCAGCCAAAAGTGGACAGAGCAAA USF1 T234A rev primer 5’ TTTGCTCTGTCCACTTTTGGCTGATTCCATACTG USF1 S233A fwd primer 5’ CTGCAGTATGGAATCAGCCAAAAGTGGACAGAGC USF1 S233A rev primer 5’ GCTTTTGCTCTGTCCACTTTTGGCTGATTCCATA USF1 AA fwd primer 5’ GCAGTATGGAAGCTGCCAAAAGTGGACA USF1 AA rev primer 5’ TGTCCACTTTTGGCAGCTTCCATACTGC USF1 K235M fwd primer 5’ TATGGAATCAACGATGAGTGGACAGAGCAA USF1 K235M rev primer 5’ TTGCTCTGTCCACTCATCGTTGATTCCATA pET28a KpnI site fwd primer 5’ TTAAGAAGGAGAGGTACCATGGGCGCGGATA pET28a KpnI site rev primer 5’ TATCCGCGCCCATGGTATATCTCCTTCTTAA -9 spacer with 4G E-box for cloning into 5’ [Phos}GGCCGCCTCAGGGGCACAGAGAGAGTCTGGAC pH3U3 fwd ACGTGGGGAGG -9 spacer with 4G E-box for cloning into 5’ AATTCCCTCCCCACGTGTCCAGACTCTCTCTGTGCCC pH3U3 rev CTGAGGC -9 spacer with 5G E-box for cloning into 5’ [Phos]GGCCGCCTCAGGGGGACAGAGAGAGTCTGGAC pH3U3 fwd ACGTGGGGGAG -9 spacer with 5G E-box for cloning into 5’ AATTCCCTCCCCCACGTGTCCAGACTCTCTCTGTGCCC pH3U3 rev CTGAGGC -11 spacer with 4G E-box for cloning 5’ [Phos] GGCCGCCTCAGGGGCACAGAGAGAGTC into pH3U3 fwd TGGACACGTGGGGAGTCG -11 spacer with 4G E-box for cloning 5’ [Phos] AATTCGACTCCCCACGTGTCC into pH3U3 rev AGACTCTCTCTGTGCCCCTGAGGC -11 spacer with 5G E-box for cloning 5’ [Phos] GGCCGCCTCAGGGGCACAGAGAGAGTCTGG into pH3U3 fwd ACACGTGGGGGATCG -11 spacer with 5G E-box for cloning 5’ [Phos] AATTCGATCCCCCACGTGTCCAGACTC into pH3U3 rev TCTCTGTGCCCCTGAGGC -11 spacer with 4G NS for cloning into 5’ [Phos] GGCCGCCTCAGGGGCACAGAGAGAGTCTGG pH3U3 fwd TTCCAAGGGGAGTCG -11 spacer with 4G NS for cloning into 5’ [Phos] AATTCGACTCCCCTTGGAACCAGACTCTCT pH3U3 rev CTGTGCCCCTGAGGC -11 spacer with 5G NS for cloning into 5’ [Phos] GGCCGCCTCAGGGGCACAGAGAGAGTCTGG pH3U3 fwd TTCCAAGGGGGATCG -11 spacer with 5G NS for cloning into 5’ [Phos] AATTCGATCCCCCTTGGAACCAGACTCTCT pH3U3 rev CTGTGCCCCTGAGGC Sequences of DNA oligos ordered from Eurofins Genomics to create the USF1 bHLHZ mutants and to clone in the cognate DNA into the pH3U3 reporter vector for the B1H assays. [Phos] refers to oligos that were modified to have a 5’ phosphate group to allow for ligation into the pH3U3 vector. In bold are shown the sequences corresponding to the digested Eco/NotI sites such that when the oligos are annealed, they can be directly ligated into pH3U3 that was digested at the same sites.

42

Appendix Table 2. Sequences of DNA used for CD/EMSA/B1H reporters Name of construct DNA sequence -9 AT E-box, B1H reporter ACCACGTGGTatcgaattcTTTACA -11 AT E-box, B1H reporter ACCACGTGGTccatcgaattcTTTACA -13 AT E-box, B1H reporter ACCACGTGGTggccatcgaattcTTTACA -13 GC E-box, B1H reporter GCCACGTGCGggccatcgaattcTTTACA -9 4G E-box, B1H reporter GACACGTGGGgaggaattcTTTACA -9 5G E-box, B1H reporter GACACGTGGGggagaattcTTTACA -11 4G E-box, B1H reporter GACACGTGGGgagtcgaattcTTTACA -11 5G E-box, B1H reporter GACACGTGGGggatcgaattcTTTACA -11 4G NS, B1H reporter GATTCCAAGGGGatcgaattcTTTACA -11 5G NS, B1H reporter GATTCCAAGGGGgatgaattcTTTACA Positive control, B1H reporter GCGTGGGGCGatcgaattcTTTACA 4G E-box fwd, CD AGTCTGGACACGTGGGGAGTCAGC 4G E-box rev, CD and EMSA GCTGACTCCCCACGTGTCCAGACT 5G E-box fwd, CD AGTCTGGACACGTGGGGGAGTCAGC 5G E-box rev, CD and EMSA GCTGACTCCCCCACGTGTCCAGACT 4G E-box fwd, EMSA [Fluor] AGTCTGGACACGTGGGGAGTCAGC 5G E-box fwd, EMSA [Fluor] AGTCTGGACACGTGGGGGAGTCAGC Sequences of DNA used as targets for bHLHZ TFs in CD/EMSA and B1H experiments. Cognate DNA shown in bold. For B1H reporters: underlined sequences denote nucleotides used in naming reporter, sequence in red is part of spacer, sequence in blue is the EcoRI that is also part of spacer, sequence in yellow corresponds to -35 of RNA polymerase binding site. [Fluor] refers to 6-Carboxyfluorescein that was covalently added to the 5’ end of EMSA

oligos.

) 1

- 0 * * dmol

2 -5000

-10000

-15000

-20000

-25000

Mean residue elipticity (mdeg * cm * (mdeg elipticity residue Mean -30000 200 220 240 260 280 300 Wavelength (nm)

Appendix Fig. 1. Representative CD experiments for wildtype USF1. 2 µM USF1 monomer (thus, 1 µM USF1 homodimer) plus no DNA, blue line; 2 µM 4G duplex, orange line; 2 µM 5G duplex; silver line, 2 µM NS DNA, yellow line. The helicities of the transcription factors were determined as previously described.2 Protein was resuspended in CD buffer to 20 µM final concentration, incubated for 1 hr at 37 °C, then diluted to 2 µM final concentration in 2 mL final volume with CD buffer and oligonucleotide duplex if applicable. Each scan was carried out twice at 22 °C, scanning from 200 to 300 nm; scans were averaged and not subjected to smoothing. The buffer control was subtracted from each protein spectrum. Mean residue ellipticities are presented on y-axis.

43

Appendix Fig. 2. Protein sequences of studied transcription factors. Residues in bold, red font correspond to residues that were mutated relative to the wildtype USF1 or Max proteins (i.e. USF1 S233A, USF1 T234A, USF1 AA and MaxULoop). Residues in bold blue font correspond residues found in the wildtype protein

Appendix Fig 3. Left. Representative EMSA for USF1 T234A binding to nonspecific DNA. USF1 T234A behaves like the wildtype when presented non-cognate DNA (Kd =95 nM). Protein concentrations are as follows from left to right: 0, 25, 50, 100, 150, 200, 350, 500 nM. Right. Representative EMSA for USF1 T234A binding to 5G E-box. USF1 T234A shows a high affinity for the 5G E-box (Kd =3.8 nM) like the wildtype protein, however unlike the wildtype protein, it seemingly does not differentiate between the two E-boxes. Protein concentrations are as follows from left to right: 0, 2, 4, 8, 10, 15, 20, 40 nM.

-11 4G NS -11 5G NS -11 AT, no USF1 Positive control Negative control

Appendix Fig. 4 Non-specific activity for -11 4G and 5G E-box reporters. Shown is the 2.5 mM 3-AT plate. The E-box motif in the reporter construct was replaced with NS DNA to assess if the USF-1 fusion protein interacts non-specifically with the reporter constructs. There is some non-specific activity for the - 11 4G NS construct, however it was low enough to still use the -11 4G reporter for subsequent analyses. At the next highest concentration of 3-AT (5mM), all NS activity disappears. Despite there being some nonspecific activity towards the -11 4G construct, USF1 still gives a better signal when binding to the 5G E-box which is very surprising.

44

Part II: The rational design of novel bHLHZ transcription factors using a mixture of rational design and non-rational, directed evolution systems

2.1 Preface to Part II

Part II of this thesis contains parts of two manuscripts that are in review/in preparation for publication. The format of the submitted manuscripts have been modified in that only sections relevant to this thesis were included.

Manuscript 1: Combining rational design and continuous evolution on minimalist proteins that target DNA

Author list: Ichiro Inamoto, Inder Sheoran, Serban C. Popa, Montdher Hussain, and Jumi A. Shin

Author contributions: I.I and I.S designed and performed experiments and analyzed data. S.C.P, M.H., analyzed data and wrote the paper, J.A.S. conceived the project idea, designed experiments and wrote the paper.

Manuscript 2: Phage Assisted Continuous Evolution (PACE): a How-to Guide for Directed Evolution

Author list: Serban C. Popa, Ichiro Inamoto, Benjamin W. Thuronyi, and Jumi A. Shin

Author contributions: I.I. designed and performed experiments and analyzed data. S.C.P, B.W.T analyzed data and wrote the paper, J.A.S. conceived the project idea, designed experiments and wrote the paper.

45

2.2 Introduction

2.2.1 Strategies for protein engineering: rational and non-rational

Ever since it has been understood how the “central dogma” of biology links the genetic information stored in DNA to a resulting protein, attempts at engineering proteins to alter or enhance their function have been ongoing.1,2 Historically, protein engineering has been guided by the structure-function paradigm, where in order for a protein to have function, it must adopt a certain structure, and that by altering the protein sequence, we can alter how the protein behaves.3 This method of approaching protein engineering has had tremendous success in the past, as demonstrated by modifying the original green fluorescent protein to generate variants that fluoresce at different wavelengths.4 However, as more insight was gained on proteins, we understood that the functional domains of proteins are not entirely comprised of rigid structures, but in fact can have extremely dynamic domains, which pose significant challenges to protein engineering.5

Methods for protein engineering can be classified into two broad categories: rational design and non-rational design.6,7 Rational design refers to the use of literature, modeling, and knowledge of the protein scaffold to generate novel proteins with desired traits, whereas non- rational design refers to subjecting proteins to multiple, alternating rounds of mutagenesis and selection with the end goal of improving protein function by acquiring some beneficial mutation.7

It is possible to alter the specificity of an enzyme/protein via rational design by mutating amino acids at key positions throughout the protein. Although the rational approach is powerful and rapid, a limitation is the availability of useful information: e.g., we must know a protein’s structure and function in order to mutate the right amino acids to obtain the desired changes. An

46 example of rational design was provided in the previous chapter in the form of the MaxULoop construct. By comparing the protein sequences of the Max and USF1 bHLHZs (Fig. 1), we were able to rationally design MaxUloop to alter how the Max protein scaffold interacts with different

E-boxes.

Fig. 1. Alignment of the USF1 and Max bHLHZs for the MaxULoop loop swap. By aligning the two protein sequences, we can identify where H1 and H2 are (regions with higher homology) whereas the loops have little to no homology whatsoever. Using this information, we could figure out what residues to remove from the Max bHLHZ in order to introduce the USF1 loop.8-10 * denotes same residues, : denotes very similar residues, . denotes somewhat similar residues, -are stop gaps added to improve the alignment.

On the other hand, non-rational design has the advantage that it can be performed with less knowledge of the protein. A solid understanding of the protein's activity and mode of action are required, as these are vital for building an appropriate selective assay/screen in order to identify improved variants of the protein.11 However, structural information is not strictly necessary, as non-rational design allows us to work with large libraries that can be mutated randomly and can provide unbiased coverage of all protein variants. This means that a non- rational approach might identify beneficial mutations in unexpected regions of the protein that would have been impossible to foresee via high-resolution structural information.11

One of the research goals of the Shin group is to design novel TFs using a mixture of rational and non-rational design. These novel TFs have many potential applications, such as uses in synthetic biology and as potential therapeutics for diseases. Historically, targeting bHLHZ proteins using small molecule drugs has proven difficult, given the large amount of disorder and lack of obvious drug binding pockets, but certain groups have had some success with it as of late.12,13 That being said, more progress has been made pursuing peptide therapeutics like Mad,

47

Omomyc and ME47, that either sequester binding partners of c-Myc to avoid the formation of key regulatory heterodimers or actively compete for binding to the E-box of interest.14-16

2.2.2 Rational design of proteins

The increase in the number of structures available for a wide range of proteins, generated via X- ray crystallography, NMR, and cryo-EM, has significantly improved our ability to engineer proteins via rational design. X-ray crystallography and cryo-EM are particularly useful in providing snapshots of structured proteins, that can provide the necessary structural information needed for rational design, for example, providing the structure of active sites within an enzyme.17,18 NMR however, is better suited for providing information on proteins that contain significant disorder, as these disordered regions can sample multiple conformations, such that a single snapshot of the protein would not be an adequate description of these disordered proteins.19,20 Additionally, computational techniques have increased in predictive power when it comes to protein design, allowing for the creation of enzymes that catalyze reactions not observed in nature, but these enzymes still fall short of natural enzymes when it comes to efficiency.21,22

Examples of rationally designed proteins made by the Shin group include the bHLH protein ME47 and bHLHZ protein MEF.15,22 ME47 was rationally designed to be a hybrid protein consisting of the Max basic region and the E47 HLH domain, with the goal of having this protein act as a competitive inhibitor of the c-Myc:Max heterodimer.15 ME47 was meant to compete with the c-Myc:Max heterodimer for binding to the same cognate E-boxes, but lacks an

AD so that expression of downstream genes does not occur. The E47 TF binds the noncanonical,

48 asymmetric E-box motif (5’-CACCTG) and homo-/heterodimerzes with different TFs than Max does.23

We showed that ME47 does not dimerize with c-Myc and outcompetes the Max homodimer and Myc:Max heterodimer for binding to the E-box in yeast two-hybrid assays.15 ME47 was also effective in decreasing tumor growth in a mouse model, making it a promising candidate to be used as a peptide therapeutic.24 While ME47 is a strong binder of the E-box, it also had some undesirable features, such as problems with folding and stability that made reproducing results in vitro problematic, suggesting that further refinements of the protein scaffold were needed to optimize the protein for use as a potential therapeutic.22 Other examples of similar, small peptide therapeutics include Omomyc, a variant of c-Myc whose leucine zipper was modified to allow for homodimerization with c-Myc (c-Myc does not homodimerize) and Mad, a variant of the

Mxd1 protein that acts a c-Myc antagonist (by competing with c-Myc for dimerization with

Max).14,16 Omomyc has had promising results in treating Myc-dependent tumors and is currently in preparation for Phase I clinical trials, while Mad has recently been shown to be potentially more potent than Omomyc at impeding c-Myc function,14,16 which would suggest that peptide- based therapeutics are a valid strategy for targeting E-box mediated diseases moving forward.14,24

The lack of ME47 tractability prompted the creation of our next generation of E-box binding proteins known as MEF. MEF is a “franken-protein” that consists of the ME47 bHLH that has had a LZ appended to its C-terminus, making it into a bHLHZ protein.22 We hypothesized that by providing a secondary dimerization motif, the MEF homodimer would become more stable and would therefore be more tractable both in vitro and in vivo. LZs are characterized with a heptad repeat (abcdefg)n where positions a/d are occupied by hydrophobic residues that engage

49 in hydrophobic interactions at the dimer interface, and positions e/g are occupied by charged residues that engage in electrostatic interactions to stabilize the dimer.25,26 LZs are secondary dimerization motifs in bHLHZ proteins that provide partner specificity with its larger dimerization interface and additional protein stability.25

Fig. 2. A representation of the MEF structure. The ME47 (PDB: 3U5V 29) and FosW (PDB: 5VF8 28) crystal structures were taken and merged to illustrate what MEF would look like. Modelling was done using the I-TASSER software.27 MEF is a rationally designed fusion peptide containing the Max basic region, E47 HLH and FosW LZ.28 Adapted from ref. 22. Structures visualized with Chimera v.1.13.1.75

The foreign LZ that was added to ME47 is FosW, a rationally designed LZ engineered by

Mason and coworkers, based on the bZIP protein c-Fos that heterodimerizes with c-Jun to form

AP-1, that like the c-Myc:Max heterodimer is involved in key cellular processes such as cellular differentiation and apoptosis.28 The native c-Fos LZ is unable to homodimerize, but the FosW zipper was modified such that it can homodimerize, by modifying two Lys residues in the native c-Fos LZ coiled coil that prevent homodimerization to Asn and Ile.28 MEF was shown to behave significantly better than ME47, both in vivo and in vitro, validating this design choice.22

Additionally, by combining the domains of three different classes of TFs (bZIP, bHLH, bHLHZ)

50 to create MEF, we would suspect that MEF should not interact with proteins natively present in the cell, making it more appealing for use as a potential therapeutic or synthetic biology tool.28,30

Current work on MEF is attempting to alter MEF’s DNA binding preferences by appending domains from helix-turn-helix (HTH) TFs as well as domains from other bHLHZ TFs.

Additionally, there are ongoing attempts to rationally modify the USF1 protein scaffold to create a protein that only homodimerizes in vivo. While rational design can be very powerful, eventually, all “obvious” modifications that can be made will be exhausted, at which point taking advantage of non-rational design can provide further breakthroughs in protein design.

2.2.3 Non-rational design of proteins

Historically, non-rational protein design strategies have entailed performing random mutagenesis on a protein of interest via chemicals/error-prone PCR/recombination, and then screening/selecting for the desired change, depending on the activity of the protein.31,32 Screens and selections exist on a continuum, where the trait that is meant to be evolved determines the appropriate strategy to be adopted. Proteins with beneficial changes in activity can then be inserted into the next cycle of mutagenesis and selection, and this process can in theory be repeated indefinitely. This approach has yielded significant changes in protein function, such as developing P450 to hydroxylate propane, evolving new variants of BT toxins to target resistant insects, and evolving GPCR proteins that respond to novel stimuli.32-34 A popular example of non-rational design is phage display, where the protein of interest is fused to a phage coat protein

(typically the pIII gene of M13 bacteriophage).35 Libraries of these fusions can then be generated with variations in the protein of interest, and then screening of the libraries would in theory identify a variant with the desired trait; however, this is a workflow that is quite laborious.36

51

Improvements to non-rational strategies have come in the form of automating the maintenance of the mutagenesis and selection cycles, by carrying them out in vivo and using continuous evolution to guide changes that arise in the protein of interest.

Benefits of these evolution-based approaches to protein engineering include increasing the portion of the evolutionary landscape that can be sampled, and the ability to parallelize conditions to have multiple evolutionary trajectories to increase the number of potential mutants.37,38 Additionally, one can obtain epistatic effects: in a study on cephalosporin antibiotic resistance mutations in β-lactamase, several mutations were tolerated only in the presence of a pre-existing stabilizing mutation.39 Conversely, challenges arising from this strategy include designing a selective assay that appropriately guides the protein's evolution in the desired direction, generating large unbiased libraries, avoiding false positive and false negative signals, and manually performing multiple generations of mutagenesis and selection.40,41

Systems like phage-assisted continuous evolution (PACE, described below) and viral evolution of genetically actuating sequences (VEGAS) have found ways to tie the rapid phage reproductive cycle to the function of a protein of interest in bacterial and mammalian systems respectively, creating a selection circuit where improvements in the protein of interest are translated to a selective advantage for the phage.35,43 PACE was adopted by the Shin group to non-rationally modify the protein sequences of bHLHZ TFs to make improve their binding to the

E-box.

52

2.2.4 PACE

PACE is a true evolution system, in which evolving genes are subjected to continuous cycles of mutagenesis and selection.42 Liu and coworkers developed PACE to provide a system similar to natural evolution, where random mutations in the phage genome are produced at a rate that is much higher than what occurs naturally, and proteins of interest are selected for their fitness in situ. To accomplish directed evolution without constant human intervention, PACE utilizes the continuous infection of E. coli host cells by a modified version of M13 bacteriophage (Fig. 3).

Work headed by Dr. Ichiro Inamoto in collaboration with former graduate and undergraduate students in the Shin group has adapted PACE for use with our bHLHZ transcription factors. The

B1H assay that had been used to study the behavior of bHLHZ transcription factors in vivo was adapted for the purposes of PACE and formed the basis of the PACE selection circuit (Fig. 4).

Fig. 3. Simplified schematic of the M13 phage replication cycle. (A) M13 bacteriophage, and (B) selection phage (SP). For both diagrams, gIII and its product pIII are shown as brown triangles; other phage proteins are represented in green and blue. The protein of interest, expressed from the SP that mediates expression of gIII, is represented as orange circles. M13 infection of host E.coli is mediated by interaction of the gIII coat protein with the TolQRA complex (yellow).43,44 Upon entry into the host, the single-stranded viral genome is converted to a double-stranded genome by host DNA polymerases and expression of the M13 genome can begin by utilizing host machinery.45 Adapted from ref. 71.

53

PACE utilizes a mutant M13 bacteriophage whose gIII gene is replaced by that for the protein of interest (the mutant phage is called Selection Phage, SP).42 Thus, the SP expresses the protein of interest instead of pIII in host E. coli; the SP cannot produce mature phage particles by itself. To complement the SP, the gIII gene is supplied on a separate plasmid in the host E. coli

(Accessory Plasmid, AP) as part of a selection system that activates pIII production in response to the activity of the protein of interest. SP can only propagate by expressing the protein of interest from the phage genome, followed by expression of gIII that is mediated by the protein’s activity

(Fig. 4). Thus, successful SP propagation is linked to the activity of the protein of interest. SP carrying a mutant protein with enhanced activity will confer a fitness advantage over other SP particles, because the enhanced protein activity allows for increased pIII production, thereby increasing phage production. Phage assisted non-continuous evolution (PANCE) is a simpler, smaller-scale version of PACE that is less stringent, which should in theory allow for the accumulation of a wider range of mutations that might not arise in a PACE experiment, which is why our TFs were initially evolved using a PANCE evolution scheme.47

The rationale for using PACE and PANCE to evolve USF1 was that given the dynamic nature of IDPs and the difficulty that there would be in trying to rationally design changes in these regions, PACE and PANCE could be used to evolve these disordered regions for us. We also hoped to use PANCE to evolve our ME derived TFs to improve their stability, as well as to alter their

DNA-binding preferences by providing the appropriate selection circuits encoded in the AP construct.

54

Fig. 4. Schematic diagram illustrating the PACE-B1H system. The omega subunit of RNAP fused to ME47 (green and orange, respectively) acts as an activator domain (AD). ME47 binds to the E-box (blue) present on the pH3U3/AP vector and positions the omega subunit of RNAP so that it can recognize the - 10/-35 elements of the weak lac promoter (purple). This leads to expression of the downstream gIII gene (brown) that is needed for phage propagation. 48,49 Adapted from ref. 71.

2.2.5 The importance of DNA sequence and topology on protein-DNA interactions

We can expect to find the E-box motif fairly frequently in a given genome, once in every 48 bp or ~65K bp if we include flanking nucleotides (i.e. NCACGTGN), meaning that in a human genome of ~3 billion bp, one could expect to have ~46,000 E-boxes present (this assumes no sequence bias and that all DNA would be available for protein interaction, which is not necessarily the case).50 DNA in eukaryotes is compacted into dense nucleosomes by interacting with histone octamers in order to fit within the nucleus, and modifications to these histone proteins can determine whether the genes encoded in the DNA are expressed (this DNA is

“euchromatin”) or whether the genes are repressed (this DNA is “heterochromatin”).51 A fraction of all possibly occurring E-boxes in a genome would be found as heterochromatin and these would not be of concern to us; of greater import are E-boxes that are available to interact with regulatory machinery. This is not to mention the existence of other similar sequences like

CAGGTG, etc., with which any E-box binding protein could interact nonspecifically, which is highly undesirable.

Additionally, there are other bHLH/bHLHZ TFs that interact with the E-box like MyoD

(involved in muscle tissue genesis), CLOCK (regulates processes tied to circadian rhythm) and

55

E47 (involved in tissue differentiation). All are members of the bHLH superfamily, and are expected to have similar properties and mechanisms for targeting E-boxes as the engineered TFs described thus far.52-54 Ideally, our engineered bHLHZ TFs would interact solely with targeted

E-boxes, while not interacting with E-boxes regulated by these other TFs or other E-box like sequences. The question then becomes, how to alter and refine specificity of our engineered bHLHZ TFs, so that they interact with desired E-boxes and not off-target sequences that would hamper their use as therapeutics and their applications in synthetic biology? It is known that bHLHZ proteins target E-box sequences in major grooves using their disordered basic regions, which leaves adjacent nucleotides in the neighboring minor grooves as a means to refine specificity.

TFs that use DNA minor grooves to refine specificity are already well documented. A classic example of this is the Hox protein family, that is part of the larger homeodomain family of proteins.54 Homeodomain TFs were initially discovered in Drosophila melanogaster and found to regulate numerous aspects of development and cellular differentiation. These TFs comprise a

HTH motif that inserts into the major grooves of their cognate DNA and are used for providing specificity to the protein:DNA interaction.54 Homeodomain TFs also use intrinsically disordered

N-terminal arms to insert into AT-rich minor grooves that flank the major grooves of their cognate sites, and use the RNR motif (N=Pro, Gly, Asn etc.) in their arms to recognize these minor grooves.54,55 AT-rich sequences are known to form very narrow minor grooves, that induce curvature of DNA and focuses the electrostatic potential in the minor groove; the N- terminal arms recognize the narrow groove’s shape and electrostatic potential.56 Studies of how

Hox proteins recognize these minor grooves have shown that the N-terminal arms do not make

56 any specific contacts with nucleotides in the minor groove, but rather they recognize the electrostatic potential based on the topology of the groove.56

Mutational analysis of Hox proteins where the RNR of the N-terminal arms of these proteins was mutated to RNA had a notably detrimental impact on DNA binding, demonstrating the importance of these minor groove contacts in TF activity.55 These N-terminal arms containing the RNR motif are often short (<10 residues) and computational studies have shown that the unique shape and electrostatic potential of the AT-rich minor groove creates an anionic environment ideal for localizing Arg's diffuse cationic guanidinium group, which is easier to desolvate than Lys, allowing these arms to contribute to DNA-binding specificity.55-57

Additionally, these TFs all bind very similar sequences, and it has recently been demonstrated that DNA shape, which is impacted by the DNA sequence in these minor grooves, is essential for how these TFs recognize their specific cognate sites.58

The DNA major groove dimensions remain fairly constant whether free or bound by a protein, whereas the minor groove undergoes significant distortions and possesses more protein-

DNA contacts in the minor groove with these proteins than one would expect, which speaks to the importance of flanking DNA in protein:DNA interactions.58 Attempts to study how the sequence of DNA impacts the dimensions/flexibility of DNA grooves have been driven by P31

NMR based techniques which have found that AT-rich sequences are quite rigid whereas GC rich sequences are very flexible.59,60 This also impacts the ratio of BI-BII DNA (refers to phosphate backbone orientations of B-DNA when crystallized; BII DNA positions the phosphate group towards the minor groove, while BI DNA positions it symmetrically between the major and minor grooves).60 BII DNA is preferentially bound by proteins and corresponds to deep, narrow minor grooves, while BI corresponds to shallow, wide major grooves.60,61 Physically,

57

DNA tracts with higher flexibility (typically GC-rich) explore a larger conformational space than those with lower flexibility (typically AT-rich) which can have a significant impact on protein:DNA interactions.56

Given the diversity of DNA-binding proteins that exist, one can envisage appending different

IDRs onto the bHLHZ scaffold to allow it to recognize different E-boxes based on the DNA topology and sequence of nucleotides flanking that E-box. This would allow for increasing the 48 nucleotides involved in E-box recognition to 4n (where n denotes the number of nucleotides in the sequence flanking the E-box that the appended IDR recognizes), which would allow for greater target specificity and open the door for using these modified bHLHZ TFs for more practical applications.

2.3 Methods and materials 2.3.1 Amplification and purification of phage particles

A 200 µL aliquot of overnight, infected E. coli host culture started from a single plaque was used to inoculate 20 mL of LB and grown at 37 ºC with shaking for 4 – 5 hours. Cells were then collected by centrifugation at 4500 x g for 10 minutes (Eppendorf Centrifuge 5804 R). The supernatant was removed and the pellet was centrifuged a second time. The top 16 mL of supernatant was retained, combined with 4 mL NaCl/ PEG 8000, and incubated at 4 ºC overnight. Precipitated phage particles were collected by centrifugation at 12000g for 15 minutes

(Eppendorf Centrifuge 5804 R). Supernatant was discarded and the pellet was resuspended in 1 mL of TBS, combined with 200 µL of NaCl/ PEG 8000, incubated on ice for 1-2 hours, and centrifuged at 16100g for 10 minutes. Supernatant was discarded and the pellet containing phage particles was resuspended in 200 µL TBS and stored at -20 ºC.

58

2.3.2 Construction of vectors

2.3.2.1 Construction of SP vector

The selection phage (SP) was a gift from David R. Liu.43 Subcloning of the TF of interest was carried out using by digesting the SP and the pB1H2w2 containing the TF of interest with SacII

(NEB R0157S) and XbaI (NEB R0145S) in a sequential digest. The digested SP was then treated with Antarctic Phosphatase (NEB M0289L) while the digested pB1H2w2 was run on a

1% agarose gel to gel extract the band corresponding to the TF of interest using a kit (Qiagen ID:

28704). The TF insert and the digested SP backbone were then ligated together using T4 ligase

(NEB M0202S) and the ligation product was transformed into chemically competent DH5α cells.

The resulting transformants were plasmid prepped and sent for sequencing (TCAG-Sick Kids) to confirm successful cloning.

2.3.2.1 Construction of AP vector

The accessory plasmid was a gift from David R. Liu.43 Cloning of various reporter constructs into the AP was done by digesting the AP with NotI (NEB R3189S) and EcoRI (NEB R0101S).

The reporter constructs were ordered as ssDNA oligos (Eurofins Genomics) and were designed to have NotI/EcoRI overhangs when annealed to allow for cloning into the AP in the proper orientation. After annealing, the oligos were treated with T4 polynucleotide kinase (NEB

M0201S) then ligated into the AP using T4 ligase (NEB M0202S) and the ligation product was transformed into chemically competent DH5α cells. The resulting transformants were plasmid prepped and sent for sequencing (TCAG-Sick Kids) to confirm successful cloning.

59

2.3.3 Plaque assay

The desired host cells (permissive 1059 cells and 1030/2060 cells with various AP constructs) were used to start overnight cultures the day before the assay in 5 mL of LB supplemented with

Amp. The following day, the overnight cultures were subcultured 1:50 in 5 mL of fresh LB +

Amp media and grown to an OD600 of 0.4-0.5 as measured by a Beckman DU640 spectophotometer. During the outgrowth step, phage particles containing the desired SP were serially diluted from 10-1 to 10-6 and tubes containing 4 mL molten 2x YT top agar were prepared and kept in a water bath at 50 °C. Once the desired OD600 was reached, 270 µL of the desired host cells were added to 30 µL of the diluted phage particles and were incubated for 5 minutes at room temperature. The now infected cell culture was added to a tube of molten agar, mixed quickly, poured onto an LB plate and evenly spread on the plate before the agar could solidify. These plates were then incubated overnight at 37 degrees Celsius. If 2060 host cells were used, then 80 µL 20 mg/mL X-gal were added directly to the molten agar to improve the visualization of plaques as this host strain contains a β-gal gene under the control of a phage shock promoter, producing blue plaques that are easier to visualize.

2.3.4. PANCE setup

2 mL of the desired host phage (1030 cells transformed with the AP reporter of interest) were added to fresh 2 mL 2xYT media + Amp/Chl as well as 500 µL 1M arabinose. To this resulting culture, 500 µL of the most recent phage stock were added, allowed to infect the host for several minutes and outgrown for 8-12 h. at 37°C at 200 rpm. After 12h. the culture was spun down at

3000 rpm for 5 minutes to pellet the host cells, the supernatant containing phage particles was filter sterilized and used to set up the next passage of the PANCE experiment as outlined above.

After 2 passages, a recovery step was done using the permissive 1059 host in order to replenish

60 the phage titer that drops after successive PANCE subculturing. The phage stock to be recovered was added to 100 µL 1059 cells in 25 mL of 2xYT media supplemented with Amp/Kan and incubated overnight at 37°C at 200 rpm. The resulting culture was spun down the following day at 6000 RPM for 30 minutes, the resulting supernatant was filter sterilized and gave phage stocks with titers up to 1012 plaque forming units.

2.3.5 Miscellaneous

Cloning of the various constructs introduced in Chapter 2 into pB1H2w2 and pET28a was carried out exactly as outlined in Chapter 1. Manipulations of DNA/bacterial cultures were performed exactly as outlined in Chapter 1. The composition of any media used is the same as outlined in Appendix A1 of Chapter 1 unless otherwise stated. The same in vitro/in vivo assays described in Chapter 1 were used for characterizing the proteins described in Chapter 2, with the sole exception that EMSAs carried out using MEF and MEF derived proteins used a phosphate buffer in the reaction mixture (4.3 Na2HPO4, 1.4 mM KH2PO4, 150 mM NaCl, 2.7 mM KCl, 1.5 mM EDTA, pH 7.4) in lieu of the Tris buffer used for USF1 and USF1 derived proteins.

2.4 Results and discussion

2.4.1 Rational re-design of USF1 to create UFW

USF1 is known to interact with many proteins in the cell including PDX-1 (regulates insulin production in pancreas), hTERT, TGFβ2 and IGF2R, (cell cycle regulating proteins and growth suppressor genes), in part through its disordered regions and in part through its leucine zipper.62-

67 These protein-protein interactions are regulated by post-translational modifications such as phosphorylation of T100 by CK2.63 We wanted to replace the native LZ of USF1 with the engineered FosW LZ to reduce the likelihood of unwanted protein-protein interactions with the

61 aforementioned proteins, which is very desirable in an engineered TF. The LZs of FosW and

USF1 were aligned (Fig. 5) in order to determine where the optimal site for the domain swap would be. The alignment showed that the heptad register aligned from L7 of FosW and L271 of the USF1 protein (referring to the positions of the residues on the corresponding proteins); this is what was then used as the basis for the domain swap to create UFW.

Fig. 5. Alignment of FosW and USF1 LZs for designing UFW. Alignment of the FosW and USF1 LZ used to design UFW. The FosW LZ was directly swapped for the USF1 LZ once aligned in order to generate UFW. Leucines that constitute the beginning of a heptad repeat in the LZ are highlighted in yellow. In bold are the beginning/ends of the LZ domains for each protein. The resulting UFW construct has a LZ that is longer by one turn relative to that of USF1.

If the UFW construct works as envisaged and retains the same DNA-binding properties of the USF1 bHLHZ while eliminating unwanted protein-protein interactions, then it might be possible to use UFW as a peptide therapeutic for individuals that are afflicted with the hereditary asthma associated with overexpression of USF1, where UFW would act as a competitive inhibitor and compete with USF1 for binding to the 4G E-box of PAI-1 in a similar manner as

ME47 does with the c-Myc:Max heterodimer.

2.4.2 Assessing properties of UFW

Having designed UFW, the protein was cloned into the pB1H2 and pET28a vectors, and the protein was characterized in vivo and in vitro as described in Ch. 1. From the B1H assay (Fig. 6), we found that UFW behaves like USF1, in that it preferentially binds the 5G E-box, and not the

4G E-box. Given that there is some slight nonspecific binding of the 4G reporter and none for the

62

5G reporter, the fact that both USF1 and UFW have a higher affinity for the 5G reporter is interesting.

Fig. 6. Comparison of USF1 activity to UFW activity in the B1H assay. Left. Samples were spotted as a serial dilution from 10-1-10-5 from left to right on a 5 mM 3-AT plate. The spacer length for the 4G and 5G E-boxes was based on the 11 AT E-box as that spacer was previously found to give good activity and was not auto-activating for USF1. The signal from UFW is slightly weaker than that of USF1 for both the 4G and 5G E-box (roughly a 10-fold decrease in activity) and UFW preferentially binds the 5G E-box just like USF1 does. Reporter sequences used: 4G E-box: GGCACGTGGGGAGTCG, 5G E-box: GGCACGTGGGGGAGTCG. Right. NS assay for USF1 and UFW. Cells were plated on a 1 mM 3- AT plate. UFW shows some nonspecific activity for the -11 4G NS like the wildtype USF1, however at 5 mM 3-AT, this signal disappears (data not shown) suggesting that the nonspecific activity should not hamper interpretation of the B1H results. Reporter sequences used: 4G NS: GGTTCCAAGGGGAGTCG, 5G NS sequence: GGTTCCAAGGGGGAGTCG

The signal produced in the B1H assay with UFW is slightly weaker than what is observed with USF1, which is surprising since UFW has an extra repeat of the LZ heptad, and we expected that UFW would form a more stable dimer. Differences in the expression levels of

USF1 and UFW might explain this phenomenon and could be investigated via RT-PCR to assess the levels of mRNA transcript for the respective proteins.68 Additionally, it is possible that that the bHLH-LZ interface is not optimized despite having done a sequence alignment, so that the resulting UFW dimer is less stable than the USF1 dimer, resulting in UFW being targeted for degradation in the cell. This could be investigated via Western blots with antibodies targeting the

USF1 bHLH, as it would be a shared domain for both proteins. Despite the slight decrease in

63 binding affinity observed, UFW still produced a good signal in the B1H assay and seemed to be a promising candidate for use as a peptide therapeutic.

In vitro, we found that the UFW construct is slightly less helical than the wildtype USF1 bHLHZ (Table 1), which is surprising given that UFW has a longer LZ than USF1.

Table 1. Comparison of UFW in vivo and in vitro properties relative to USF1

Protein 4G E-box 5G E-box NS DNA Helicity, Helicity, Helicity, Helicity,

Kd (nM) Kd (nM) Kd (nM) no DNA NS DNA 4G E-box 5G E-box USF1 7.0 ± 0.4 4.1 ± 0.3 102 ± 3.1 50 62 67 63 UFW 9.1 ± 0.5 7.1 ± 0.3 29.1 ± 3.0 47 49 57 55 Each value is the average of two independent EMSA experiments. Numbers represent the total monomeric protein concentration of each sample. Each value is the average of two CD scans of the same sample that were then averaged. The helicity of each transcription factor was determined as described previously. Level of His3 activity leading to growth was reported relative to the growth observed from US0 cells transformed with the USF1 bHLHZ and 4G E-box reporter system.

To further investigate what was observed with CD, we did a series of EMSAs using the same fluorescinated oligos that were used in Ch. 1 to compare UFW to USF1 (Fig. 7). UFW has a similar low nanomolar affinity for E-box containing DNA like USF1 and maintains a slight preference for the 5G E-box over the 4G E-box. There are, however, two significant differences between UFW and USF1 that might preclude its use in practical applications: UFW interacts strongly with nonspecific DNA, and it is prone to tetramerizing at low protein concentrations

(tetramerization starts around 100 nM of UFW against NS DNA). Comparing the ratio of the Kd values of 4G:NS for both USF1 and UFW, we see that for USF1 the ratio is 1:25, whereas for

UFW the ratio is 1:4, a 6-fold increase in promiscuity for the UFW construct. The increase in promiscuity runs counter to our design principles and represents a significant step backwards in our protein design.

64

Additionally, we observed that UFW is prone to tetramerizing at lower protein concentrations relative to USF1 (100 nM vs. 250 nM, respectively for NS DNA), which has significant implications for drug delivery if the protein were to be used as a peptide therapeutic.

If the UFW protein needs to be supplied at high concentrations in order to alter transcription in human cells, it is possible that it would begin tetramerizing before it even reaches the nucleus, which could impede its mechanism of action. USF1 and other bHLHZ proteins are known to form bivalent homotetramers in vitro and can still interact with E-box containing DNA when in this state, but it is difficult to assess if these tetramers form in vivo and if they would be a problem moving forward.67 It is interesting to note however, that both MEF and MFW (Max bHLH and FosW LZ) have been engineered to contain the same FosW LZ, but do not have the same propensity for tetramerization that UFW does (data not shown).

Fig. 7. In vitro characterization of UFW binding affinity. Representative EMSA for UFW done for the 4G E-box with representative curve fit (R2 of 0.991, done in Kaleidagraph v. 3.6.2,). From left to right, samples were loaded as follows: 0, 2, 4, 8, 10, 15, 20, 25, 40 nM, Kd = 8.5 nM. UFW has a similar affinity for E-box containing DNA as USF1 does but is more prone to tetramerizing as evidenced by the additional bands that appear in the gel. USF1 is known to form bivalent homotetramers but these were only observed at higher protein concentrations of USF1 (>250 nM).67

65

To further assess what the FosW LZ is doing for UFW, a variant of UFW was designed with a premature stop codon situated after the first heptad of the LZ to generate a truncated version of UFW. This construct was tested in the B1H assay and there was no signal obtained for the truncated UFW (Appendix Fig. 2, rows 5 and 7). Previous in vitro studies with USF1 showed that removing the LZ of the protein resulted in a 1000 fold weaker binding to the E-box as determined by EMSA, which is essentially what the truncated UFW is.67 That being said, the complete lack of a signal was somewhat surprising given that the experiment was done in vivo and not in vitro. A next step could be to mutate the Leu residues in the FosW LZ to Ala so that dimerization is inhibited but the LZ is still otherwise intact and see what that does to UFW behavior, although we anticipate results similar to what was seen with the truncated UFW.

Taken together, these unwanted properties of UFW have led us to reconsider what UFW can and should be used for. However, given that UFW has these problematic features, UFW is a prime candidate to be further developed using non-rational approaches to redesign the protein scaffold to remove or alter these unwanted features.

2.4.3 Non-rationally modifying USF1 and UFW in PACE

PACE has been previously shown to be able to modify proteins to have altered behavior or even completely new functions altogether.69,70 In light of how potentially powerful this technique can be, our initial goal was to try and adapt PACE to modify the IDRs found in bHLHZ proteins, with a particular interest in the peculiar loop of USF1 and UFW. As was noted in the previous chapter, the loop of the USF1 bHLHZ preferentially recognizes the 5G E-box of PAI-1 over the

4G E-box associated with asthma. If PACE could be used to evolve the disordered USF1 loop to alter its DNA binding preferences away from the 5G E-box towards the 4G E-box, the new loop

66 could be used towards generating a peptide therapeutic for hereditary asthma. We could also imagine using PACE to evolve the USF1 loop to differentiate other flanking nucleotides as well, by creating different variants of the selection circuit and evolving USF1 and UFW on the new selection circuit to generate new loop variants.

Both the USF1 and UFW bHLHZ constructs were subcloned from the pB1H2w2 vectors into the SP vector as described in the methods. The SP vectors containing the proteins of interest, which can be manipulated like a plasmid, were then subsequently transformed into the permissive 1059 host cells in order to generate mature phage particles, as the 1059 cells constitutively express the gIII protein that is necessary for phage assembly that has been removed from the SP vector. The now infected cells were outgrown, spun down to remove phage particles in the supernatant from the host cells, and the phage supernatant was the filter-sterilized to generate a stock of mature phage particles.

Using what was thought to be mature phage containing the proteins of interest, plaque assays were done using 1030 cells containing AP-RBSA -7GC E-box (the lowest stringency selection circuit available for use, refer to Appendix B for sequences) and AP-RBSA -11 4G E-box to assess if these phage were viable for use in PACE as well, as using 1059 cells that served as the positive control. No plaques were produced from any of the host strains, suggesting that there was a problem with the phage particles. After some troubleshooting, we determined that the phage titer was too low after the initial transformation, and a modified recovery using PEG 8000 was employed to increase the phage titer and allow for successful plaque assays. The plaque assay was repeated with the improved phage stock and plaques were produced from 1059 cells

(the phage titers were in the 109 range, indicative of a successful plaque assay), but not with either of the 1030 host cell strains, where TF activity is needed for phage propagation.

67

The fact that AP-RBSA -7 GC E-box did not give plaques was not surprising as it was previously optimized for use in PACE with ME47 and MEF. AP-RBSA -7 GC E-box was shown to be a valid selection circuit partner for MEF, which is a smaller protein than USF1 (127 residues vs. 151 residues), so this difference in protein size may explain why USF1 did not produce plaques from it. By virtue of USF1 being so different from MEF, it is possible that the

USF1-omega subunit does not align properly with the promoter region upstream of the gIII gene

(Fig. 8), resulting in no expression of gIII. The more shocking result was that AP RBS -11 4G E- box did not produce plaques, as this was the same sequence that gave a strong signal in the B1H assay with USF1, and the B1H assay forms the basis of our PACE selection circuits. This is a recurring theme with PACE or PANCE when adapting an external reporter circuit for use in

PACE, which can only be solved by trial and error.71 Given the additional layers of complexity that are introduced by the phage-host interaction, as well as all of the different parameters that can be altered when setting up a selection circuit (Fig. 8), establishing a successful PACE or

PANCE set up can take up to a year to accomplish.71

Fig. 8. Overview of the B1H-PACE selection circuit. Schematic depiction of how the accessory plasmid (AP) is organized, and how the AP and protein of interest-omega subunit fusion from the SP interact with one another to affect gIII expression in the B1H-PACE with ME47. The numbers correspond to the list outlining parameters that can be altered to fine-tune gIII expression. In brief, the E-box motif (yellow) is flanked by sequences (1, green trapezoid) that can be modified to alter the distance between cognate DNA and promoter. The length of the linker (2, blue rectangle) between ME47 and the omega subunit of RNA polymerase can be altered to optimally position the omega subunit on the promoter. The promoter sequence can also be altered to give differing amounts of transcription from the RNA polymerase (4).48,49 The RBS sequence (5, red semi-circle) can be altered to change how easily the ribosome binds to it to affect translation.72 Altering the first codon of gIII (6, gray arrow) can also decrease the amount of translation that occurs.72 A LuxAB CDS (purple arrow) can be placed downstream of the gIII CDS to allow for an indirect means to measure gIII expression. Adapted from ref 71.

68

Given the large number of components that can be optimized to obtain a functional selection circuit in the context of PACE, Dr. Inamoto was consulted on where to begin troubleshooting. As per his advice, efforts were made to optimize the length of the spacer to accommodate the larger

USF1-omega subunit fusion. To this effect, reporter constructs with varying lengths on the pH3U3 vector were tested in the B1H assay (as per Dr. Inamoto’s advice) to identify spacers that gave strong signals and were not auto-activating. Eventually, we found that the -23 AT spacer was a suitable candidate (Appendix Fig. 3). AP RBSA-23 4G E-box was then designed and tested in the context of the plaque assay for its viability in PACE and no plaques were produced.

The difficulty in transferring successful B1H results over to PACE was also observed when developing a PACE selection circuit for ME47 and MEF, and this is the point where this project remains.

Currently, evolving the USF1 loop in the context of the MEF protein is being explored, as

MEF has been previously shown to be amenable for use in PACE with the AP-RBSA -7 GC E- box. The goal is to evolve the USF1 loop in the context of the MEF protein scaffold and then introduce any loop mutations that may arise into the USF1 or UFW scaffolds to examine changes in the resulting proteins.

2.4.4 MEFU and MEFH

Given the success with the domain swaps to create the MaxULoop construct in Ch. 1, our next step was to explore how far these domain swaps could be taken. To this effect, two new protein designs were envisaged: 1) MEFU, which is a further refinement of the initial MaxULoop construct, and 2) MEFH, a variant of MEF that has an appended N-terminal arm from a HTH transcription factor (Appendix Fig. 4). MEFU is based on MaxULoop, with the goal of making

69 the new protein differentiate between the 4G and 5G E-boxes without sacrificing the low nM

DNA binding affinity that is characteristic of bHLHZ TFs. MEFU was also designed with the goal of evolving the USF1 loop in PACE due to issues discussed in section 2.4.3. With MEFH, the goal was to append the N-terminal arm of a HTH TF that recognizes AT-rich minor grooves

(typically contain 3-5 A/T nucleotides) based on the electrostatic potential of the minor groove

DNA, to the N-terminus of MEF, so that the resulting protein could preferentially recognize E- boxes flanked by these AT-rich sequences,56 (see Appendix Table 1 for sequences of AT-rich reporters). After this round of rational design on the MEF protein scaffold, the MEF-derived proteins can then be subjected to PACE to further refine our design as the wildtype MEF has been shown to be viable in PACE (data not shown), which will hopefully open the door in terms of potential applications of these proteins.

The rationale for creating these constructs is simple: by appending these various domains onto MEF, it should be possible to alter the DNA-binding preferences of the protein so that it becomes possible to target specific E-box and flanking sequences found throughout the human genome. A generic bHLHZ protein recognizes the core E-box motif (6 nucleotides) and may make contacts with nucleotides that directly flank the E-box (1-2 nucleotides away on either side). This hypothetical E-box has up to 8 nucleotides that will be recognized by a generic bHLHZ, meaning that the given E-box should be found every 65 K bp. By adding these additional domains onto these proteins, the goal was to increase the length of the targeted E-box and flankers in order to improve the specificity of the protein:DNA interaction. Additionally, if the MEFH project works as intended, this would demonstrate that DNA binding domains of other TFs can be used to alter bHLHZ binding properties, which would then greatly increase the

70 number of bHLHZ variants that can be generated and increase the diversity of DNA sequences that can be targeted using this strategy.

2.4.4.1 MEFU

As previously discussed, MEF represents a significant improvement in our protein design strategy and we thought that swapping the native MEF loop for the USF1 loop should result in a protein (MEFU) that can differentiate between 4G or 5G tracts flanking the E-box and is more stable. A second objective for this protein would be to subject it to PACE in order to evolve the disordered loop of USF1 in a protein scaffold that works in PACE. This project was a collaboration between 3rd year ROP student Kevin Do and myself. The design and cloning of

MEFU into vectors was done by myself, protein expression/purification, B1H, EMSA and CD experiments were a joint effort.

This new protein, dubbed MEFU was tested in the B1H assay and as expected, was found to differentiate between the 4G and 5G E-boxes like MaxULoop does; however, the preferential binding of the 5G E-box over the 4G E-box was less evident than it was for MaxULoop. A reason for this could be that MEFU generated a signal that is comparable to that of MEF, while

MaxULoop produced a signal that was much weaker (several orders of magnitude), which makes the differentiation between the 5G and 4G E-boxes much more apparent (Fig. 9). This would seemingly satisfy the first objective for making MEFU.

71

Fig. 9. B1H plate of MEF, MEFU, MaxULoop binding to the 4G or 5G E-box. B1H comparing MEF to bHLHZ proteins with foreign USF1 loop appended to them. The addition of the USF1 loop allows for novel DNA recognition at the cost of DNA binding affinity. MEFU appears to strike a balance between MEF and MaxULoop where some DNA binding affinity is sacrificed to gain the ability to differentiate the 4G/5G polymorphism. Controls are as previously described. Shown is a 2.5 mM 3-AT plate, the assay was done by Kevin Do.

Comparing the E47 protein scaffold to that of Max (Fig. 10) shows that the E47 HLH has a longer H1, a shorter loop, and the H2 helix leaving at different angles relative to the Max HLH, which could have explained why MEFU would behave differently in the B1H assay relative to

MaxULoop.73,74 MEFU produces a similar signal to MEF in the B1H while having the desired differentiation of the 4G/5G polymorphism, validating the rationale for making the protein in the first place. Additionally, MEFU behaves like MEF in that it has little to no nonspecific activity in the context of the B1H (Appendix Fig. 5).

72

Fig. 10. Comparison of the E47 HLH and Max HLH H1 of E47 (PDB: 2QL2,73 blue) is longer than that of Max (PDB: 1HLO,74 beige) by one turn and loop of E47 is shorter by one residue relative to that of Max. The protein scaffold of E47 is notably different than that of Max, which might explain why MEFU does not behave the same way that MaxULoop does. Structures visualized with Chimera v.1.13.1.75

MEFU was also used to investigate the effect of having the E-box motif flanked by symmetric G5G5 tracts (i.e. G5-E-box-G5). As described in Ch. 1, USF1 makes asymmetric contacts using its loops with nucleotides flanking the E-box, and the polymorphism is located downstream of the E-box regulating PAI-1 expression.67 It was found that E-boxes flanked by symmetric G5G5 tracts produced a noticeably weaker signal for both MEF and MEFU relative to the asymmetric E-boxes (Appendix Fig 6). This could be due to the inherent flexibility associated with G5G5 tracts,61 which might allow the major groove in which the E-box is found to adopt conformations that are not conducive to bHLHZ binding, again highlighting how important these flanking nucleotides are in regulating protein-DNA interactions.

From CD, we see that MEFU has a similar helicity to the parent MEF, suggesting that the domain swap was not as disruptive to MEFU as it was for MaxULoop, in agreement with our observations in vivo (Table 2). Issues with the CD instrument prevented us from characterizing how MEFU behaves in the presence of 4G and 5G E-box containing DNA. However, in terms of in vitro binding of DNA, MEFU behaves similarly to MaxULoop, with Kd values in the hundreds

73

of nanomolar, which was surprising since MEFU had a B1H signal that was significantly

stronger than that observed for MaxULoop (Fig. 9). This would suggest that differences between

MEFU and MaxULoop in vivo might stem from improvements in MEFU stability or an increase

in protein expression levels but not from significant differences in DNA binding affinity between

the two proteins. While MEFU presents a small improvement in the rational design strategy first

employed in making MaxULoop, it requires further refining as the goal is to recapitulate the

DNA-binding affinity of the original MEF protein. Altering the length of the appended USF1

loop or the loop interface with H1 or H2 might stabilize the dimerization interface of MEFU and

recover some of the DNA-binding affinity but may require a brute force approach using rational

design in order to do so.

Table 2. Comparison of in vitro properties for MEFU and related proteins c a CD, 222 nm Kd (nM) Protein no DNA E-box NS DNA 4G E-box 5G E-box NS DNA

MEF 17 25 27 — — 96 ± 22 MaxULoop 16 21 20 490.6 ± 24.6 426.3 ± 29.6 > 2000b MEFU 16 — — 426.3 ± 44.7 379 ± 26 > 2000b a Each value is the average of two independent EMSA experiments. Numbers represent the total monomeric protein concentration of each sample. b Protein binding of NS DNA was below 50% at 2000 nM, the titration range was not expanded due to issues with protein aggregation. c Each α-helicity value is the average of two CD scans of the same sample that were then averaged. The helicity of each transcription factor was determined as described previously. EMSAs for MEF binding to the 4G E-box and 5G E-box were not fully carried out; however, the Kd for the GC E-box (GCCACGTGCG) was 7.8 ± 0.6 nM and preliminary EMSAs with the 4G E-box and 5G E-box gave similar results.

Given that MEFU seems to be well behaved in vivo and in vitro, it was cloned into the SP

for use in PACE using the same strategy employed for USF1. Initial plaque assays using the

permissive 1059 host strain showed that SP-MEFU was cloned properly and capable of

producing mature phage particles. Additional plaque assays using various AP constructs (AP

RBSA-7GC, AP RBSA -11 4G, AP RBSA -11 5G, AP RBSA -13 4G, AP RBSA -23 4G) with

SP-MEFU produced plaques for all of those constructs (data not shown), which was surprising

given the challenges of using SP-USF1 or SP-UFW to produce plaques when selective pressure

74 was present. PANCE with MEFU is currently ongoing in the absence of selective pressure to allow for genetic drift (the goal is to build a library of MEFU mutants), before introducing selective pressure to select for MEFU variants that have a higher DNA binding affinity/increased stability.

2.4.4.2 MEFH

In order to create MEFH, the N-terminal arm of Hin recombinase appended to the N-terminus of

MEF (Sequence of Hin arm: RPRAITKHKL) . Hin recombinase, a protein of 198 residues, is a

Ser recombinase identified in Salmonella that cleaves DNA and alters its orientation.76 The DBD of Hin recombinase comprises a HTH motif that inserts into the major groove of its cognate

DNA (HixC consensus half site is 5’ TGTTTTTGATAAGA), and it possesses an N-terminal arm that inserts into the neighbouring AT-rich minor groove.76 Interestingly, it has been found that the N-terminal arm of Hin recombinase does not recognize specific nucleotides within the minor groove, but rather it recognizes the electrostatic potential within the minor groove to help direct the rest of the protein towards its cognate DNA.56 The N-terminal arm of Hin possesses an RPR motif that is highly conserved in this class of proteins that when mutated abolishes sequence- specific binding of the TF.77 Other similar HTH proteins use a similar RNR motifs for contacting the minor groove, like the Engrailed homeodomain transcription factor, which provides some variety in terms of what N-terminal arms could have been used. We used the N-terminal arm of

Hin as other ongoing projects in the group are also investigating this domain.

The goal was to append the N-terminal arm of Hin recombinase onto the MEF protein scaffold to alter its E-box flanking sequence DNA-binding preferences in a similar way that the

USF1 loop did for MaxULoop. This project was conceived as a proof of concept that domains of

75 other TFs and DNA-binding proteins can be appended to the bHLHZ protein scaffold and that we are not merely limited to domain swaps between closely related TFs. If this project works the way it is intended, then it becomes feasible to consider attaching DNA recognizing domains from other protein families to the bHLHZ scaffold, in order to begin targeting specific E-boxes, which would open the door to creating custom TFs that can be used for a whole host of applications. This project was a collaboration between 4th year thesis student Montdher Hussain and myself. The design and cloning of MEFH into vectors were done by myself, B1H assays were a joint effort while the design and creation of the various reporter constructs was done by

Montdher.

As with MEFU, MEFH was initially assessed using the B1H assay. The initial A4T4 E- box reporter consists of the core E-box motif flanked on either side by a 4 nucleotide polyA/polyT flanker, designed to mimic the AT rich flanking region that the Hin arm natively inserts into (Appendix Table 1).76 The initial design of the A4T4 E-box was suboptimal as

MEFH gave approximately the same signal as the A4T4 E-box as MEF did (Appendix Fig. 7), suggesting that the N-terminal arm is not aiding in altering DNA binding preferences as was intended. The addition of the N-terminal arm does not seem to have had a negative impact on

DNA binding affinity, which was seen with appending the USF1 loop onto MEF.

There are two potential explanations for why MEFH did not differentiate the initial AT-rich spacer flanking the E-box in our reporter system, both of which might have to do with the nature of the protein construct used in the B1H (Fig. 11). The omega subunit of the RNAP is fused to the N-terminus of the bHLHZ protein via a 21-residue linker, which coincidently is where the N- terminal Hin arm is also found. If the two proteins (MEFH and RNAP) act as a sort of “anchor” and impede the Hin arm from inserting properly into the minor groove, then one would not

76 expect to see any differentiation of DNA on this basis. If the linker between the bHLHZ and omega subunit were lengthened to provide additional flexibility/slack, it may be possible to recover the Hin arm activity to recognize AT-rich sequences.

Fig. 11. MEFH interacting with B1H reporter system. A schematic representation of how our TF fusions might interact with the reporter system in the B1H assay. The N-terminal Hin arm could be anchored by the bHLHZ and RNAP subunit, which may prevent it from inserting properly into the AT rich minor groove, thus preventing it from recognizing the electrostatic potential motif in the minor groove. Figure created by Montdher Hussain. Structures visualized with Chimera v.1.13.1.75

The second theory is that the AT-rich tract is too close to the core E-box, so that the Hin arm cannot contort itself to insert into the groove. A potential solution to this problem would be to either lengthen the AT-rich tract by adding additional A/T nucleotides where appropriate or to move the AT-rich tract away from the E-box, so that it is not directly abutting the E-box motif.

To this end, MEFGH, a variant of MEFH with increased linkers between the Max basic region and Hin arm were made (Appendix Fig. 4) with others being designed as well and additional

B1H reporters were designed to test how MEFH or MEFGH interacts with other similar AT-rich

E-boxes (Fig. 12, Appendix Fig. 7).

77

Fig. 12. Differing DNA binding properties of MEFH and MEFGH. Lengthening of the AT tracts significantly impacted DNA binding by MEFH but has no impact for MEFGH. Both sets of samples plated on 20 mM 3-AT, controls as described before. B1H assays were performed by Montdher Hussain.

MEFH and MEFGH behave significantly different from one another when tested in vivo with the same reporter constructs despite differing by a single Gly residue, although admittedly, the work shown here is still preliminary. Lengthening the AT-rich tracts significantly impacted

MEFH binding but seems to have had little to no impact on MEFGH, suggesting that we have developed two MEF variants that can preferentially bind slightly different AT tracts (A4T4 E- box for MEFH and A5T5 E-box for MEFGH). We can imagine further lengthening the linker to see how drastically we can change the DNA binding properties of MEFH/MEFGH. Additionally, we found that the addition of the elongated Hin arm to MEFGH shifted the DNA binding preferences away from the A5T5 E-box that the original MEF construct showed towards A5A5

E-box (Appendix Fig. 8). Further characterization of these MEFH derived proteins needs to be carried out in vivo and in vitro, but currently, it seems that by altering the length of the linker between the Hin arm and Max basic region, we can generate proteins that recognize slightly different E-boxes. This might allow for the design of tunable TFs, where we can append different

IDRs to MEF and other related TFs and target very specific E-boxes.

78

2.4.5 Relevance

Ever since Jacob and Monod elucidated the workings of the lac operon, attempts have been ongoing to take advantage of circuits that regulate cellular behavior.78 Since then, with breakthroughs made in genomic research and the development of high throughput biology, synthetic biology has become feasible, where we can construct gene circuits within model organisms that are orthogonal to the native organism’s machinery. The goal of creating these orthogonal circuits would be to create a desirable product, whether it be an industrially relevant product or a drug. Examples of this include the anti-malarial artemisnin whose precursor is being made at industrial levels in yeast, or the production of biofuels like isobutanol in E.coli.79,80 We can imagine placing the expression of genes needed for these products under the control of a specific E-box that a designed bHLHZ TF would recognize in a specific manner to create these products. Additionally, if we expand our repertoire of DNA recognizing IDRs that can be appended onto engineered bHLHZ TFs, then that may open the door for using these TFs as peptide therapeutics for diseases like asthma or cancer. Other groups have shown that small peptides can be successfully used to inhibit c-Myc dependent tumors. The hope is that by refining the tunability of the TFs, we can limit the off-target effects that these small peptides might have.

2.4.6 Future directions

Finishing the PANCE experiment on MEFU to see if it can evolve to preferentially recognize the

4G E-box and not the 5G would be of significant value as any mutations that arise would be expected to be in the loop or H1-loop or H2-loop interfaces, and these mutations could be reintroduced into USF1 or UFW to attempt to reengineer their binding preferences. We can

79 extend this to have MEFU evolve to recognize different flanking sequences other than an asymmetrical polyG tract, and would be excellent as a proof of concept to validate using PANCE for these applications, as well as to build a library of loop variants that recognize diverse sequences. If the PANCE experiment with MEFU yields interesting loop mutants, then we can imagine using PANCE to evolve other IDRs towards a certain DNA target, potentially even evolving IDRs that are not necessarily involved in protein:DNA interactions to make novel DNA contacts.

The MEFH and MEFGH behave significantly different from one another despite differing by just a Gly residue. Exploring longer linkers and fully characterizing them will hopefully provide a “repertoire” of disordered N-terminal arms that can be used to design TFs that target specific AT rich sequences. Testing out different AT rich flanking sequences with

MEFH or MEFGH like ATATA or AATAA etc. may lead to interesting results as the width of the narrow groove changes depending on the sequence, which will alter how these arms interact with the groove. Given that the addition of the N-terminal Hin arm altered the DNA binding properties of MEF, exploring IDRs from other DNA binding proteins like histones and other classes of TFs could be of significant interest and might be feasible using PACE to optimize the fusions. Upon the full characterization of several disordered DNA binding domains, we could then consider appending two or more such domains to a bHLHZ to explore how far this concept can be taken.

Ideally, we would test the designed bHLHZ in the model systems described in Ch. 1 to assess their use as potential therapeutics, which would be of significant interest and could inform future modifications that need to be made to these protein scaffolds.

80

2.5 References

1. Ulmer, K. M. (1983). Protein engineering. Science, 219(4585), 666-671.

2. Carter, P. J. (2011). Introduction to current and future protein therapeutics: a protein engineering perspective. Experimental cell research, 317(9), 1261-1269. 3. Wright, P. E., & Dyson, H. J. (1999). Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. Journal of molecular biology, 293(2), 321-331.

4. Remington, S. J. Green fluorescent protein: A perspective. Prot. Sci. 20, 1509-1519 (2011).

5. Marshall, S. A., Lazar, G. A., Chirino, A. J. & Desjarlais, J. R. Rational design and engineering of therapeutic proteins. Drug Discov. Today 8, 212-221 (2003). 6. Packer, M. S. & Liu, D. R. Methods for the directed evolution of proteins. Nat. Rev. Genetics 16, 379-394 (2015). 7. Gregor, P. D., Sawadogo, M., & Roeder, R. G. (1990). The adenovirus major late transcription factor USF is a member of the helix-loop-helix group of regulatory proteins and binds to DNA as a dimer. Genes & Development, 4(10), 1730-1740. 8. Blackwood, E. M., & Eisenman, R. N. (1991). Max: a helix-loop-helix zipper protein that forms a sequence-specific DNA-binding complex with Myc. Science, 251(4998), 1211-1217. 9. Popa, S. C., & Shin, J. A. (2019). The intrinsically disordered loop in the USF1 bHLHZ domain modulates its DNA-binding sequence specificity in hereditary asthma. The Journal of Physical Chemistry B. 10. Popa, S. C., Inamoto, I., Thuronyi, B. W., & Shin, J. A. (2019). Phage Assisted Continuous Evolution (PACE): A How-to Guide for Directed Evolution. In preparation. 11. Bayliss, R., Burgess, S. G., Leen, E., & Richards, M. W. (2017). A moving target: structure and disorder in pursuit of Myc inhibitors. Biochemical Society Transactions, 45(3), 709-717 12. Struntz, N. B., Chen, A., Deutzmann, A., Wilson, R. M., Stefan, E., Evans, H. L., ... & Neel, D. V. (2019). Stabilization of the Max homodimer with a small molecule attenuates Myc- driven transcription. Cell chemical biology, 26(5), 711-723. 13. Demma, M. J., Mapelli, C., Sun, A., Bodea, S., Ruprecht, B., Javaid, S., ... & Orvieto, F. (2019). Omomyc reveals new mechanisms to inhibit the MYC oncogene. Molecular and Cellular Biology, 39(22), e00248-19.

81

14. Xu, J., Chen, G., De Jong, A. T., Shahravan, S. H., & Shin, J. A. (2009). Max-E47, a designed minimalist protein that targets the E-box DNA site in vivo and in vitro. Journal of the American Chemical Society, 131(22), 7839-7848. 15. Fernandez-Leiro, R., & Scheres, S. H. (2016). Unravelling biological macromolecules with cryo-electron microscopy. Nature, 537(7620), 339-346. 16. Demma, M. J., Hohn, M. J., Sun, A., Mapelli, C., Hall, B., Walji, A., & O’Neil, J. (2020). Inhibition of Myc Transcriptional Activity by a Mini Protein Based Upon Mxd1. FEBS letters. 17. Josephson, K., Jones, B. C., Walter, L. J., DiGiacomo, R., Indelicato, S. R., & Walter, M. R. (2002). Noncompetitive antibody neutralization of IL-10 revealed by protein engineering and x-ray crystallography. Structure, 10(7), 981-987. 18. Kosol, S., Contreras-Martos, S., Cedeño, C., & Tompa, P. (2013). Structural characterization of intrinsically disordered proteins by NMR spectroscopy. Molecules, 18(9), 10802-10828. 19. Jensen, M. R., Ruigrok, R. W., & Blackledge, M. (2013). Describing intrinsically disordered proteins at atomic resolution by NMR. Current opinion in structural biology, 23(3), 426-435. 20. Hellinga, H.W. and Richards, F.M. (1994) Optimal sequence selection in proteins of known structure by simulated evolution. Proc. Natl. Acad. Sci. U. S. A. 91, 5803–5807 21. Chevalier, A.; Silva, D.-A.; Rocklin, G. J.; al., e., Massively parallel de novo protein design for targeted therapeutics. Nature 2017, 550, 74-82. 22. Inamoto, I., Sheoran, I., Popa, S. C., Hussain, M., & Shin, J. A. (2020). Combining rational design and continuous evolution on minimalist proteins that target DNA. In review. 23. Ellenberger, T., Fass, D., Arnaud, M., & Harrison, S. C. (1994). Crystal structure of transcription factor E47: E-box recognition by a basic region helix-loop-helix dimer. Genes & development, 8(8), 970-980. 24. Lustig, L. C., Dingar, D., Tu, W. B., Lourenco, C., Kalkat, M., Inamoto, I., ... & Penn, L. Z. (2017). Inhibiting MYC binding to the E-box DNA motif by ME47 decreases tumour xenograft growth. Oncogene, 36(49), 6830. 25. Landschulz, W. H., Johnson, P. F., & McKnight, S. L. (1988). The leucine zipper: a hypothetical structure common to a new class of DNA binding proteins. Science, 240(4860), 1759-1764. 26. Mason, J. M., & Arndt, K. M. (2004). Coiled coil domains: stability, specificity, and biological implications. Chembiochem, 5(2), 170-176.

82

27. Yang, J., Yan, R., Roy, A., Xu, D., Poisson, J., & Zhang, Y. (2015). The I-TASSER Suite: protein structure and function prediction. Nature methods, 12(1), 7. 28. Worrall, J. A., & Mason, J. M. (2011). Thermodynamic analysis of Jun–Fos coiled coil peptide antagonists. The FEBS journal, 278(4), 663-672. 29. Ahmadpour, F., Ghirlando, R., De Jong, A. T., Gloyd, M., Shin, J. A., & Guarné, A. (2012). Crystal structure of the minimalist Max-E47 protein chimera. PLoS One, 7(2). 30. Tam, J. P., Yu, Q., & Miao, Z. (1999). Orthogonal ligation strategies for peptide and protein. Peptide Science, 51(5), 311-332.

31. Arnold, F. H., & Georgiou, G. Directed Enzyme Evolution. Vol. 230 (Humana Press, 2003).

32. Fasan, R., Chen, M. M., Crook, N. C. & Arnold, F. H. Engineered alkane-hydroxylating cytochrome P450(BM3) exhibiting native-like catalytic properties. Angew. Chem. Int. Ed. Engl. 46, 8414–8418 (2007). 33. Badran, A. H., Guzov, V. M., Huai, Q., Kemp, M. M., Vishwanath, P., Kain, W., ... & Wang, P. (2016). Continuous evolution of Bacillus thuringiensis toxins overcomes insect resistance. Nature, 533(7601), 58. 34. English, J. G., Olsen, R. H., Lansu, K., Patel, M., White, K., Cockrell, A. S., ... & Roth, B. L. (2019). VEGAS as a Platform for Facile Directed Evolution in Mammalian Cells. Cell, 178(3), 748-761. 35. Sidhu, S. S. (2001). Engineering M13 for phage display. Biomolecular engineering, 18(2), 57-63. 36. Wittrup, K. D. (2001). Protein engineering by cell-surface display. Current opinion in biotechnology, 12(4), 395-399.

37. Leconte, A. M. et al. A Population-Based Experimental Model for Protein Evolution: Effects of Mutation Rate and Selection Stringency on Evolutionary Outcomes. Biochemistry 52, 1490-1499 (2013).

38. Dickinson, B. C., Leconte, A. M., Allen, B., Esvelt, K. M. & Liu, D. R. Experimental interrogation of the path dependence and stochasticity of protein evolution using phage- assisted continuous evolution. Proc. Natl. Acad. Sci USA 110, 9007-9012 (2013).

39. Wang, X., Minasov, G. & Shoichet, B. K. Evolution of an antibiotic resistance enzyme constrained by stability and activity trade-offs. J. Mol. Biol. 320, 85-95 (2002).

83

40. Serebriiskii, I. G. & Golemis, E. A. Two-hybrid system and false positives. Approaches to detection and elimination. Meth. Mol. Biol. 177, 123-134 (2001). 41. Vidalain, P.-O., Boxem, M., Ge, H., Li, M. & Vidal, M. Increasing specificity in high- throughput yeast two-hybrid experiments. Methods 32, 363-370 (2004). 42. Esvelt, K. M., Carlson, J. C., & Liu, D. R. (2011). A system for the continuous directed evolution of biomolecules. Nature, 472(7344), 499. 43. Rakonjac, J., Bennett, N. J., Spagnuolo, J., Gagic, D. & Russel, M. Filamentous bacteriophage: biology, phage display and nanotechnology applications. Curr. Iss. Mol. Biol. 13, 51-76 (2011). 44. Bennett, N. J. & Rakonjac, J. Unlocking of the filamentous bacteriophage virion during infection is mediated by the C domain of pIII. J. Mol. Biol. 356, 266-273 (2006). 45. Smeal, S. W., Schmitt, M. A., Rodrigues Pereira, R., Prasa, A. & Fisk, J. D. Simulation of the M13 life cycle I: Assembly of a genetically-structured deterministic chemical kinetic simulation. Virology 500, 259-274 (2017). 46. Dickinson, B. C., Leconte, A. M., Allen, B., Esvelt, K. M. & Liu, D. R. Experimental interrogation of the path dependence and stochasticity of protein evolution using phage- assisted continuous evolution. Proc. Natl. Acad. Sci USA 110, 9007-9012 (2013). 47. Roth, T. B., Woolston, B. M., Stephanopoulos, G., & Liu, D. R. (2019). Phage-assisted evolution of Bacillus methanolicus methanol dehydrogenase 2. ACS synthetic biology, 8(4), 796-806. 48. Meng, X., Brodsky, M. H. & Wolfe, S. A. A bacterial one-hybrid system for determining the DNA-binding specificity of transcription factors. Nat. Biotech. 23, 988-994 (2005). 49. Meng, X. & Wolfe, S. A. Identifying DNA sequences recognized by a transcription factor using a bacterial one-hybrid system. Nat. Protocol. 1, 30-45 (2006). 50. Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., ... & Gocayne, J. D. (2001). The sequence of the human genome. science, 291(5507), 1304-1351. 51. Mariño-Ramírez, L., Kann, M. G., Shoemaker, B. A., & Landsman, D. (2005). Histone structure and nucleosome stability. Expert review of proteomics, 2(5), 719-729. 52. Nakahata, Y., Sahar, S., Astarita, G., Kaluzova, M., & Sassone-Corsi, P. (2009). Circadian control of the NAD+ salvage pathway by CLOCK-SIRT1. Science, 324(5927), 654-657. 53. Megeney, L. A., Kablar, B., Garrett, K., Anderson, J. E., & Rudnicki, M. A. (1996). MyoD is required for myogenic stem cell function in adult skeletal muscle. Genes & development, 10(10),

84

54. Pérez-Moreno, M. A., Locascio, A., Rodrigo, I., Dhondt, G., Portillo, F., Nieto, M. A., & Cano, A. (2001). A new role for E12/E47 in the repression ofe-cadherin expression and epithelial-mesenchymal transitions. Journal of Biological Chemistry, 276(29), 27424-27431. 55. Bürglin, T. R., & Affolter, M. (2016). Homeodomain proteins: an update. Chromosoma, 125(3), 497-521.173-1183. 56. Joshi, R., Passner, J. M., Rohs, R., Jain, R., Sosinsky, A., Crickmore, M. A., ... & Mann, R. S. (2007). Functional specificity of a Hox protein mediated by the recognition of minor groove structure. Cell, 131(3), 530-543. 57. Bondos, S. E.; Swint-Kruse, L.; Matthews, K. S., Flexibility and Disorder in Gene Regulation: LacI/GalR and Hox Proteins. J. Biol. Chem. 2015, 290, 24669-77. 58. Rohs, R., West, S. M., Sosinsky, A., Liu, P., Mann, R. S., & Honig, B. (2009). The role of DNA shape in protein–DNA recognition. Nature, 461(7268), 1248-1253. 59. Zeiske, T., Baburajendran, N., Kaczynska, A., Brasch, J., Palmer III, A. G., Shapiro, L., ... & Mann, R. S. (2018). Intrinsic DNA shape accounts for affinity differences between Hox- cofactor binding sites. Cell reports, 24(9), 2221-2230. 60. Oguey, C., Foloppe, N., & Hartmann, B. (2010). Understanding the sequence-dependence of DNA groove dimensions: implications for DNA interactions. PloS one, 5(12). 61. Heddi, B., Oguey, C., Lavelle, C., Foloppe, N., & Hartmann, B. (2010). Intrinsic flexibility of B-DNA: the experimental TRX scale. Nucleic acids research, 38(3), 1034-1047. 62. Stefan, M. I., & Le Novère, N. (2013). Cooperative binding. PLoS computational biology, 9(6). 63. Lupp, S., Götz, C., Khadouma, S., Horbach, T., Dimova, E. Y., Bohrer, A. M., ... & Montenarh, M. (2014). The upstream stimulatory factor USF1 is regulated by protein kinase CK2 phosphorylation. Cellular signalling, 26(12), 2809-2817. 64. QIAN, J., KAYTOR, E. N., TOWLE, H. C., & Karl OLSON, L. (1999). Upstream stimulatory factor regulates Pdx-1 gene expression in differentiated pancreatic β-cells. Biochemical Journal, 341(2), 315-322. 65. Chang, J. T. C., Yang, H. T., Wang, T. C. V., & Cheng, A. J. (2005). Upstream stimulatory factor (USF) as a transcriptional suppressor of human telomerase reverse transcriptase (hTERT) in oral cancer cells. Molecular Carcinogenesis: Published in cooperation with the University of Texas MD Anderson Cancer Center, 44(3), 183-192.

85

66. Hering, S., Isken, F., Knabbe, C., Janott, J., Jost, C., Pommer, A., ... & Pfeiffer, A. F. H. (2001). TGFβ1 and TGFβ2 mRNA and protein expression in human bone samples. Experimental and clinical endocrinology & diabetes, 109(04), 217-226. 67. Ferre‐D'Amare, A. R., Pognonec, P., Roeder, R. G., & Burley, S. K. (1994). Structure and function of the b/HLH/Z domain of USF. The EMBO journal, 13(1), 180-189. 68. Erickson, H. S., Albert, P. S., Gillespie, J. W., Rodriguez-Canales, J., Linehan, W. M., Pinto, P. A., ... & Emmert-Buck, M. R. (2009). Quantitative RT-PCR gene expression analysis of laser microdissected tissue samples. Nature protocols, 4(6), 902. 69. Badran, A. H., Guzov, V. M., Huai, Q., Kemp, M. M., Vishwanath, P., Kain, W., ... & Wang, P. (2016). Continuous evolution of Bacillus thuringiensis toxins overcomes insect resistance. Nature, 533(7601), 58-63. 70. Pu, J., Disare, M., & Dickinson, B. C. (2019). Evolution of C‐terminal modification tolerance in full‐length and split T7 RNA Polymerase biosensors. ChemBioChem, 20(12), 1547-1553. 71. Popa, S. C., Inamoto, I., Thuronyi, B. W., & Shin, J. A. (2019). Phage Assisted Continuous Evolution (PACE): A How-to Guide for Directed Evolution. In preparation. 72. Chen, H., Bjerknes, M., Kumar, R. & Jay, E. Determination of the optimal aligned spacing between the Shine–Dalgarno sequence and the translation initiation codon of Escherichia coli mRNAs. Nucl. Acid. Res. 22, 4953-4957 (1994). 73. Longo, A., Guanga, G. P., & Rose, R. B. (2008). Crystal Structure of E47− NeuroD1/Beta2 bHLH Domain− DNA Complex: Heterodimer Selectivity and DNA Recognition. Biochemistry, 47(1), 218-229. 74. Brownlie, P., Ceska, T. A., Lamers, M., Romier, C., Stier, G., Teo, H., & Suck, D. (1997). The crystal structure of an intact human Max–DNA complex: new insights into mechanisms of transcriptional control. Structure, 5(4), 509-520. 75. Pettersen, E. F., Goddard, T. D., Huang, C. C., Couch, G. S., Greenblatt, D. M., Meng, E. C., & Ferrin, T. E. (2004). UCSF Chimera—a visualization system for exploratory research and analysis. Journal of computational chemistry, 25(13), 1605-1612. 76. Feng, J. A., Johnson, R. C., & Dickerson, R. E. (1994). Hin recombinase bound to DNA: the origin of specificity in major and minor groove interactions. Science, 263(5145), 348-355. 77. DiCara, D., Rapisarda, C., Sutcliffe, J. L., Violette, S. M., Weinreb, P. H., Hart, I. R., ... & Marshall, J. F. (2007). Structure-function analysis of Arg-Gly-Asp helix motifs in αvβ6 integrin ligands. Journal of Biological Chemistry, 282(13), 9657-9665.

86

78. Monod, J. & Jacob, F. Teleonomic mechanisms in cellular metabolism, growth, and differentiation. Cold Spring Harb. Symp. Quant. Biol. 26, 389–401 (1961) 79. Westfall, P. J.; Pitera, D. J.; Lenihan, J. R.; Eng, D.; Woolard, F. X.; Regentin, R.; Horning, T.; Tsuruta, H.; Melis, D. J.; Owens, A.; Fickes, S.; Diola, D.; Benjamin, K. R.; Keasling, J. D.; Leavell, M. D.; McPhee, D. J.; Renninger, N. S.; Newman, J. D.; Paddon, C. J., Production of amorphadiene in yeast, and its conversion to dihydroartemisinic acid, precursor to the antimalarial agent artemisinin. Proc. Natl. Acad. Sci USA 2012, 109, E111-18. 80. Huo, Y. X. et al. Conversion of proteins into biofuels by engineering nitrogen flux. Nature Biotech. 29, 346–351 (2011).

87

2.6 Appendix B – DNA sequences used and additional figures

Appendix Fig. 1. Representative UFW EMSAs for binding to nonspecific DNA and 5G E-box. NS (Left) protein concentrations from left to right: 0, 25, 50, 100, 125, 175, 250, 500 nM. Kd: 32.9. 5G E-box concentrations from left to right (Right) 0, 2, 4, 8, 10, 15, 20, 40. Kd of 6.8 nM. UFW binds to NS DNA with a higher affinity than USF1, which is a problematic feature to have. Additionally, we see that it is prone to tetramerizing at low concentrations in the presence of E-box containing DNA, which may pose challenges to further development of UFW as a therapeutic.

Appendix Fig. 2. B1H testing how a truncated LZ impacts UFW activity. A stop codon was introduced into the UFW ORF such that the LZ was truncated after the first heptad and this construct was used in the B1H assay (lanes labelled stop). On the left is a 0 mM 3-AT plate and on the right is a 1 mM 3-AT plate. Upon addition of low amounts of 3-AT, the truncated UFW gave no signal suggesting that truncating the LZ of UFW had a significantly detrimental on its DNA binding ability. Controls as described before. Truncated UFW sequence: GTDEKRRAQHNEVERRRRDKINNWIVQLSKIIPDCSMESTKSGQSKGGILSKACDYIQELRQSNH RLSEELQGLDQL*AEIEQLEERNYALRKEIEDLQKQLEKLGAPLE – where * denotes the stop codon.

88

Appendix Fig. 3. Attempts to find a suitable spacer length to evolve USF1 and UFW in PACE. The entirety of a previously generated spacer library (ranging from -11 to -25 spacers) was investigated in the B1H to find a suitable spacer length to evolve USF1 in PACE. Eventually, the -23 GC E-box was used as it gave a strong signal and minimal NS activity (relative to the other constructs that is), but ultimately this spacer was also not suited for use in PACE. Other plates not shown, samples were plated on a 5 mM 3- AT plate. Controls as described before.

Appendix Fig. 4. Amino acid sequences of MEF derived proteins with different IDRs attached. Shown are the protein sequences of MEFH, MEFGH and MEFU, residues in red are domains derived from different TFs (Hin recombinase and USF1 respectively) while dashes were inserted in ordered to maintain the alignment of the protein sequences.

Appendix Fig. 5. B1H for MEFU nonspecific activity. MEFU behaves like MEF in that it does not interact with nonspecific DNA, which would suggest that any signal produced in B1Hs containing E- boxes are a result of specific TF-E-box interactions. Samples were plated on a 2.5 mM plate. Controls as described before.

89

Appendix Fig. 6. B1H testing TF binding to symmetric G5G5 E-boxes. Having the E-box flanked by symmetric G5G5 tracts significantly decreased the amount of binding for both MEF/MEFU relative to DNA binding with E-boxes flanked by asymmetric G tracts. Plated on 20 mM 3-AT. Negative control 10- 5 positive control plated on it due to a mix-up while plating. Controls as described before; assay performed by Kevin Do.

Appendix Fig. 7. Initial B1H for MEFH. Initial B1H testing MEFH binding to E-boxes flanked by an A4T4 tract. MEF and MEFH produced roughly the same signal, suggesting that the Hin arm is not recognizing the AT rich flanking sites. Samples plated on a 20 mM 3-AT plate; controls as described before. B1H experiment performed by Montdher Hussain.

90

` Appendix Fig. 8. Addition of the Hin arm alters TF binding preferences. Signal seen from A5A5 and G5G5 E-boxes were the same for both MEF and MEFGH, however, addition of the Hin arm seems to have altered binding preferences away from A5T5 towards A5A5.

91

Appendix Table 1. DNA sequences used for cloning and B1H, EMSA experiments. Name of construct DNA sequence SP-UFW frame shift fwd TGAAGATCTGCAGAAACAGCTGGAAAAACTGGGCGCGCCG SP-UFW frame shift rev CGGCGCGCCCAGTTTTTCCAGCTGTTTCTGCAGATCTTCA -23 GG 4G E-box fwd, B1H, PANCE [Phos] GGCCGCAGAAAGTCTGGACACGTGGGGAGTCAG CCGTGTATCAG 23 GG 4G E-box rev, B1H, PANCE [Phos] AATTCTGATACACGGCTGACTCCCCACGTGTCC AGACTTTCTGC A4T4 E-box fwd, B1H GCGGCCGCTGCAGAAAAACACGTGTTTTTGAATTC A4T4 E-box rev, B1H GAATTCAAAAACACGTGTTTTTCTGCAGCGGCCGC G5G5 E-box fwd, B1H GGCCGCCTCGGGGGCACGTGGGGGGAGG G5G5 E-box rev, B1H AATTCCTCCCCCCACGTGCCCCCGAGGC A5T5 E-box fwd, B1H GGCCGCCTCAAAAACACGTGTTTTTG A5T5 E-box rev, B1H GGCCGCCTCAAAAACACGTGAAAAAG A5A5 E-box fwd, B1H GGCCGCCTCAAAAACACGTGAAAAAG A5A5 E-box rev, B1H AATTCCTTTTTCACGTGTTTTTGAGG -11 spacer with 4G NS for cloning into 5’ [Phos] GGCCGCCTCAGGGGCACAGAGAGAGTCTGG pH3U3 fwd TTCCAAGGGGAGTCG -11 spacer with 4G NS for cloning into 5’ [Phos] AATTCGACTCCCCTTGGAACCAGACTCTCT pH3U3 rev CTGTGCCCCTGAGGC -11 spacer with 5G NS for cloning into 5’ [Phos] GGCCGCCTCAGGGGCACAGAGAGAGTCTGG pH3U3 fwd TTCCAAGGGGGATCG -11 spacer with 5G NS for cloning into 5’ [Phos] AATTCGATCCCCCTTGGAACCAGACTCTCT pH3U3 rev CTGTGCCCCTGAGGC UFW stop codon fwd, SDM GGGTTGGATCAGCTGTAGGCGGAAATTGAAC UFW stop codon rev, SDM GTTCAATTTCCGCCTACAGCTGATCCAACCC MEFGH fwd, SDM CATTACTAAGCATGGCAAGCTTATGGGCGCG MEFGHrev, SDM CGCGCCCATAAGCTTGCCATGCTTAGTAATG Sequences of DNA oligos ordered from Eurofins Genomics to create the various reporters used in Ch.2 and to carry out the various SDM reactions to alter the TFs. [Phos] refers to oligos that were modified to have a 5’ phosphate group to allow for ligation into the pH3U3 vector, B1H reporters ordered without the 5’ phosphate were phosphorylated in house to save on costs, but the cloning strategy remained the same as outlined in Ch.1.

92

Appendix Table 2. Reporter sequences used for PANCE. Name of construct DNA sequence AP RBS A -11 5G E-box ccgcctcaggggcacagagagagtctggacacgtgggggatcgaattcTTTACA AP RBS A -11 4G E-box ccgcctcaggggcacagagagagtctggacacgtggggaatcgaattcTTTACA AP RBS A -23 4G E-box GCCGCAGAAAGTCTGGACACGTGGGGAGTCAGCCGTGTATCAGaattcTTTACA AP RBS A -7 GC E-boc GCCACGTGCGggccatcgaattcTTTACA AP RBS A -13 4G E-box GCCGCAGAAAGTCTGGACACGTGGGGAGTCAGGaattcTTTACA Sequence in yellow corresponds to -35 region that the omega subunit of RNA polymerase interacts with to mediate transcription of downstream gIII gene, sequence in green is the EcoRI restriction site, sequence in red corresponds to nucleotides 2 bp from core E-box motif used to name the reporter, sequence in bold is the core E-box motif. The same cloning strategies used for cloning into the pH3U3 reporter were used for cloning into the AP reporter (i.e. digesting with EcoRI and NotI). RBS A refers to the fact that the Shine-Dalgarno sequence on the resulting mRNA transcript is the wildtype sequence, variants of this sequence have been made to decrease the affinity of the ribosome to this sequence to increase the stringency of our selection circuit (corresponds to parameter 5 of Fig. 8).

93