Directing an artificial to new targets by fusion to a non-DNA binding domain

Wooi Fang (Catheryn) Lim

A thesis in fulfilment of the requirements for the degree of

Doctor of Philosophy

School of Biotechnology and Biomolecular Sciences

Faculty of Science

March 2016

Page | 0

THESIS/ DISSERTATION SHEET

Page | i ORIGINALITY STATEMENT

‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.’

WOOI FANG LIM Signed ……………………………………………......

31-03-2016 Date ……………………………………………......

Page | i

COPYRIGHT STATEMENT

‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.'

Signed ……………………………………………......

Date ……………………………………………......

AUTHENTICITY STATEMENT

‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.’

Signed ……………………………………………......

Date ……………………………………………...... Table of Contents

ORIGINALITY STATEMENT ...... i Table of Contents ...... ii Acknowledgement...... v Publications arising from this thesis ...... vi Abstract ...... vii Abbreviations ...... viii Chapter 1 General Introduction ...... 1 1.1 Transcriptional regulation ...... 1 1.1.1 DNA binding ...... 3 1.1.2 Modularity of sequence-specific DNA binding proteins ...... 4

1.1.3 C2H2 zinc finger containing DNA binding proteins ...... 5 1.1.4 Target recognition by sequence-specific transcription factors ...... 9 1.1.5 KLF family of zinc finger transcription factors ...... 12 1.2 Artificial Zinc Finger DNA binding proteins ...... 17 1.2.1 Application of artificial zinc finger DNA binding proteins ...... 18 1.2.2 Construction of artificial zinc finger DNA binding proteins ...... 21 1.2.3 Artificial zinc finger proteins targeting VEGF-A promoter ...... 23 1.2.4 Specificity of artificial zinc finger DNA binding proteins ...... 25 1.3 Aims ...... 28 Chapter 2 Materials and Methods ...... 29 2.1 Materials ...... 29 2.1.1 Reagents and kits ...... 29 2.1.2 Cell lines ...... 31 2.1.3 Oligonucleotides ...... 31 2.1.4 Vectors ...... 31 2.2 Laboratory methods ...... 32 2.2.1 General methods ...... 32

Page | ii

2.2.2 Cell culture ...... 32 2.2.3 Generation of retroviral and expression vectors ...... 32 2.2.4 Transient transfection for protein production ...... 33 2.2.5 Retroviral transduction to generate stable cell lines ...... 34 2.2.6 Nuclear extracts ...... 34 2.2.7 SDS-PAGE and Western blot ...... 34 2.2.8 Electrophoretic mobility shift assay (EMSA) ...... 35 2.2.9 RNA extraction and cDNA synthesis ...... 35 2.2.10 Real time PCR ...... 36 2.2.11 immunoprecipitation (ChIP)...... 36 2.2.12 DNA library preparation and next generation sequencing ...... 36 2.3 Bioinformatics methods ...... 37 2.3.1 Quality trimming ...... 37 2.3.2 Alignment ...... 37 2.3.3 Peak calling and IDR analysis ...... 37 2.3.4 Quantification of ChIP tags ...... 38 2.3.5 Differential binding analysis ...... 38 2.3.6 Genomic annotation and visualization ...... 39 2.3.7 De novo motif analysis ...... 39 2.3.8 Motif scanning ...... 40 2.3.9 ENCODE data set and data accession ...... 40 2.3.10 Common promoter binding events ...... 40 2.3.11 Statistical test...... 41 Chapter 3 in vivo DNA binding specificity of a three zinc finger artificial DNA binding protein ...... 42 3.1 Introduction ...... 42 3.2 Experimental design and construct validation ...... 44 3.3 ChIP-Seq and Bioinformatics workflow ...... 47 3.4 AZF genomic occupancy ...... 51 3.4.1 AZF binds to a large number of sites within the genome ...... 51 3.4.2 AZF occupancy is enriched at DNase hypersensitive sites ...... 55 3.4.3 AZF binds predominantly to sites containing the target sequence ...... 57

Page | iii

3.5 Discussion ...... 61 Chapter 4 Regions outside of DNA binding domain of the zinc finger KLF3 are involved in in vivo DNA binding specificity...... 65 4.1 Introduction ...... 65 4.2 Experimental design and construct validation ...... 67 4.3 ChIP-Seq and bioinformatics workflow ...... 71 4.4 KLF3FD-AZF shows increased DNA occupancy across the genome .... 74 4.5 Both AZF and KLF3FD-AZF bind the VEGF-A recognition site in vivo ...... 77 4.6 Differential binding analysis ...... 79 4.6.1 KLF3FD-AZF peaks are abundant in promoter regions ...... 83 4.6.2 Peaks preferentially bound by KLF3FD-AZF contain an imperfect AZF DNA binding site ...... 86 4.6.3 KLF3 FD, in the absence of a DBD, is capable of chromatin binding in vivo ...... 91 4.6.4 KLF3FD-AZF generates peaks at known KLF3 target sites ...... 94 4.7 Discussion ...... 101 Chapter 5 General Discussion and Conclusions ...... 107 5.1 General Discussion and Conclusions ...... 107 5.2 Future directions ...... 111 References ...... 114 Appendix ...... 129

Page | iv

Acknowledgement

I would like to thank everyone below for their unconditional support throughout the three and a half year of my Ph.D. candidature. Without them, I wouldn’t have come this far and achieved this much.

All past and current members of Crossley Lab and Dawes Lab, especially my supervisors, Prof. Merlin Crossley, Dr. Kate Quinlan, Dr. Richard Pearson, Emeritus

Prof. Ian Dawes and Dr. Joyce Chiu.

Ph.D. review committee: Prof. Andrew Brown and Prof. Marc Wilkins.

Friends that are always there for me - Michelle, Samantha, Clair, ShinDee, Juri, Juli,

Jiawei, Dennis, Wen Jun, Xiu Yi, Ivy …

Students that I have mentored during my Ph.D. candidature – Thanks for being so considerate and tolerant to my busy schedule.

My precious family members - Mom and dad, my brothers Chin Onn and Chin Way, aunts and uncles and grandparents. I am very sorry for my absence to many of our family events.

I am very fortunate to have all of you with me throughout this very difficult yet exciting journey. I truly appreciate everything you all have done for me to make me a better scientist and a better person.

Catheryn Lim

Page | v

Publications arising from this thesis

Journal articles Burdach, J., Funnell, A.P., Mak, K.S., Artuz, C.M., Wienert, B., Lim, W.F., Tan, L.Y., Pearson, R.C. and Crossley, M. (2014) Regions outside the DNA-binding domain are critical for proper in vivo specificity of an archetypal zinc finger transcription factor. Nucleic Acids Res, 42, 276-289.

Lim, W.F., Burdach, J., Funnell, A.P., Pearson, R.C., Quinlan, K.G. and Crossley, M. (2016) Directing an artificial zinc finger protein to new targets by fusion to a non- DNA-binding domain. Nucleic Acids Res, 44, 3118-3130. (This is the original place of publication for a majority of the data presented in this thesis.)

Selected conference abstracts Selected oral presentations:

Lim, W. F., Burdach, J., Funnell, A. P., Pearson, R. C., Quinlan, K. & Crossley, M. (2015). How do transcription factors find their target ? Lorne Genome Conference. Australia.

Selected poster presentations:

Lim, W. F., Funnell, A. P., Pearson, R. C., & Crossley, M. (2013). Engineering next generation artificial transcription factors for improved chromatin access. Combio Perth. Australia.

Lim, W. F., Burdach, J., Funnell, A. P., Pearson, R. C., Quinlan, K. & Crossley, M. (2014). Regions outside of the DNA binding domain are important for in vivo DNA binding specificity of a zinc finger transcription factor. Combio Canberra. Australia.

Lim, W. F., Burdach, J., Funnell, A. P., Pearson, R. C., Quinlan, K. & Crossley, M. (2015). How do transcription factors find their target genes in vivo? EMBO conference nuclear structure and dynamics, L’Isle-sur-la-Sorgue, France.

Page | vi

Abstract

Transcription factors are often regarded as having two separable components: a

DNA binding domain (DBD) and a functional domain (FD), with the DBD thought to determine target recognition. While this holds true for DNA binding in vitro, it appears that in vivo FDs can also influence genomic targeting. In the current study, we fused FD from the well-characterised transcription factor Krüppel-like Factor 3 (KLF3) to an artificial zinc finger (AZF) protein originally designed to target the Vascular

Endothelial Growth Factor-A (VEGF-A) gene promoter. ChIP-Seq (chromatin immunoprecipitation followed by high-throughput DNA) was then performed to identify DNA binding sites across the genome of the AZF alone, previously reported to bind robustly in an in vitro setting, and of the fusion protein KLF3FD-AZF. As predicted, AZF binds to the VEGF-A promoter in vivo, but we also found AZF binding to approximately 25,000 other sites, a large number of which contained the expected

AZF recognition sequence, GCTGGGGGC. We then compared genome-wide occupancy of the KLF3FD-AZF fusion to that observed with AZF. Interestingly, addition of the KLF3 FD re-distributes the fusion protein to new sites, with DNA occupancy detected at around 50,000 sites. A portion of these sites correspond to known

KLF3 endogenous targets, whilst others contained sequences similar but not identical to the expected AZF recognition sequence. These results show that FDs can influence, and may be useful in directing, artificial zinc finger DNA binding proteins to specific targets and provide insights into how natural transcription factors operate.

Reviewer links to deposited data (not yet public) ChIP-Seq data, NCBI GEO# GSE69739 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?token=cfmhiceclfwddip&acc=GSE69739 Page | vii

Abbreviations

18S rRNA 18S ribosomal RNA ADR1 Alcohol dehydrogenase II synthesis regulator ARID3B AT Rich Interactive Domain 3B AZF Artificial zinc finger protein used in the current thesis Bp

C2H2 Cysteine-2/Histidine-2 CAP Catabolite Activator Protein cDNA Complementary DNA (cDNA) CDS Coding DNA sequence CENTRIMO Central motif enrichment analysis ChIP Chromatin immunoprecipitation ChIP-PCR Chromatin immunoprecipitation followed by polymerase chain reaction ChIP-Seq Chromatin immunoprecipitation followed by high throughput sequencing COS African green monkey kidney fibroblast-like cell line CPP Cell-penetrating peptide CRISPR/Cas9 Clustered regularly interspaced short palindromic repeats / CRISPR associated protein 9 CRISPR/dCas9 Clustered regularly interspaced short palindromic repeats / dead CRISPR associated protein 9 (catalytically inactive) CtBP C-terminal binding protein DBD DNA binding domain DMEM Dulbecco‟s modified eagle medium DNA Deoxyribonucleic acid DNase-seq DNase I hypersensitive site sequencing DREME Discriminative Regular Expression Motif Elicitation ECL Electrochemiluminescence

Page | viii

EDTA Ethylenediaminetetraacetic acid EF-1 α Elongation factor 1 alpha EMSA Electrophoretic mobility shift assay ENCODE Encyclopedia of DNA elements ETS E-twenty six FBS Fetal bovine serum FC Fold change FD Functional domain FDR False-discovery rate FIMO Find Individual Motif Occurrences FOG-1 Friend of GATA-1 GAL4 Galactose-responsive transcription factor GATA-1 GATA binding factor 1 GC-rich Guanine-cytosine rich HCOP HGNC Comparison of Orthology Predictions HEK293 Human Embryonic Kidney 293 cells HepG2 liver hepatocellular cells HIV-1 Human immunodeficiency virus 1 HOMERHOMER Hypergeometric Optimization of Motif EnRichment HOMERHOMER-IDR Combined Hypergeometric Optimization of Motif EnRichment and Irreproducible Discovery Rate HRP Horseradish peroxidase HS Hypersensitive IDR Irreproducible Discovery Rate IgG Immunglobulin type G IGV Integrative Genomics Viewer IN Input IP Immunoprecipitant IRES Internal ribosomal entry site KAP1 KRAB-associated protein-1 KLF Krüppel-like factor

Page | ix

KLF3 FD Functional domain of Krüppel-like factor 3 (aligned to amino acid 1-262 of the full length protein) KLF3 Krüppel-like factor 3 KLF3FD-AZF Fusion protein of functional domain of Krüppel-like factor 3 and an artificial zinc finger DNA binding domain targeting VEGF-A promoter KRAB Krüppel-associated box MEFs Murine embryonic fibroblast cells MEME Multiple EM for motif elicitation MSCV Murine stem cell virus NFY Nuclear transcription factor Y NGS Next generation sequencing NLS Nuclear localisation signal NRF1 Nuclear respiratory factor-1 PAM Protospacer adjacent motif PAX5 Paired Box 5 protein PBS Phosphate buffered saline PCA Principal component analysis PCR Polymerase chain reaction Phoenix A Phoenix amphotropic packaging cells PSG Penicillin Streptomycin Glutamate PWM Position weight matrix RNA Ribonucleic acid RNAP II RNA-polymerase II ix RSAT Regulatory Sequence Analysis Tools RT Room temperature RT-PCR Real time polymerase chain reaction RUNX1 Runt-related transcription factor 1 TAL1 T-cell acute lymphocytic leukemia protein 1 SDS Sodium dodecyl sulfate SDS-PAGE Sodium dodecyl sulfate polyacrylamide gel electrophoresis SETDB1 SET Domain, Bifurcated 1

Page | x

SH3GL1 SH3-domain GRB2-like 1 Sin3A SIN3 Homolog A SOX Sry-related HMG box SP Specificity Protein SRF SV40 Simian virus 40 SV5 Simian virus 5 TALE Transcription activator-like effector TAT peptide (GRKKRRQRRRPQ) derived from the transactivator of transcription (TAT) of human immunodeficiency virus TBST Tris-Buffered Saline, tween-20 TF Transcription factor TGEKP Thr-Gly-Glu-Lys-Pro TSS Transcription start site TTS Transcription termination site UTR Untranslated region VEGF-A Vascular endothelial growth factor A VP16 Virus encoded protein 16 VP64 Tetrameric VP16 WT Wild-type ZF Zinc finger ZIF268 Zinc finger 268

Page | xi

Chapter 1 General Introduction

1.1 Transcriptional regulation

The key concepts of transcriptional control were first established in bacterial systems half a century ago by Jacob and Monod (1). Since then, extensive studies have been carried out to understand transcription regulatory systems in eukaryotes. These systems are more complex and involve diverse arrays of proteins such as general transcription factors, sequence-specific DNA binding transcription factors, co-factors, and chromatin remodelling proteins (2-7). An effective transcriptional regulatory system is crucial to ensure proper and precise control of in different cell states or cellular systems. Various biological processes including cell cycle progression, cellular differentiation and development and maintenance of intracellular metabolic and physiological balance involve transcriptional regulation. Aberrant regulation has been associated with numerous diseases and syndromes including cancer, autoimmunity, neurological disorders, diabetes, cardiovascular disease, and obesity, which can be caused by mutations in regulatory DNA sequences and in the transcription factors, cofactors, chromatin regulators, and noncoding RNAs that interact with these regions (8). In addition, a third of human development disorders have been attributed to dysfunctional transcription factors (9).

Essentially, transcriptional regulation involves sequence specific DNA binding proteins or transcription factors recognising and interacting with the promoter or enhancer elements. Typically multiple transcription factors cooperatively bind the regulatory elements of a gene leading to DNA looping between the enhancers and the

Page | 1

core promoters and recruitment of co-factors and general transcription factors and ultimately influencing the expression of that gene. Transcription factors also recruit an array of -modifying enzymes that acetylate, methylate or ubiqutinylate nucleosome remodelling complexes that alters nucleosome architecture (8,10,11).

By extracting and integrating information from InterPro databases, International

Protein Index (IPI) database, Ensembl Genome Browser database and functional studies, Vaquerizas and colleagues identified 1,391 genomic loci, which is approximately 6% of the total number of protein coding genes in that encode sequence specific transcription factors. Further classification of these proteins based on the structure of their DNA binding domains (DBDs) revealed three types of transcription factors dominating the human genome and accounting for over 80% of the repertoire. These are the C2H2 zinc finger-, homeodomain- and helix-loop-helix transcription factors, shown in Figure 1.1 (9). In the next section, we will review DNA binding proteins focussing on the sequence-specific DNA binding proteins largely comprised of transcription factors.

Page | 2

Figure 1.1 Estimates of the abundance of transcription factors in human based on the structure of their DBDs. Transcription factors were classified into families according to their DBD composition. Taken from Vaquerizas et al. 2009 (9).

1.1.1 DNA binding proteins

DNA binding proteins contain DBDs that enable interaction with DNA either with general or specific affinity. DNA binding proteins play crucial roles in all aspects of genetic control within an organism, including participating in the regulation of transcription, replication, packaging, rearrangement and repair. Among the DNA binding proteins first identified were the bacterial regulatory proteins, e.g. lac , lambda repressor and CAP (12). Since then, many more have been identified and it was reported that 2-3% and 6-7%, respectively, of prokaryotic and eukaryotic coding genes encode DNA binding proteins (13).

Page | 3

There are DNA binding proteins that recognise specific DNA sequence, such as transcription factors, nucleases and restriction enzymes. There are also DNA binding proteins that interact with DNA with minimal sequence specificity, including histone octamers and non-histone chromosomal proteins that organise DNA into higher order structures, and enzymes such as polymerases and DNase I. In addition, there are proteins that bind single stranded DNA, termed single stranded DNA binding proteins

(SSBs), which are vital in genome maintenance during biological processes such as

DNA replication, recombination and repair where double stranded DNA is transiently unwound (14). Next, we will review sequence-specific DNA binding proteins that are important in transcriptional regulation and discuss how these proteins recognise and interact with DNA and thus participate in gene regulation.

1.1.2 Modularity of sequence-specific DNA binding proteins

Advances in techniques such as X-ray crystallography and in silico approaches for structural prediction over the past decades have led to elucidation of various protein and protein-nucleic acid interaction structures. These findings have provided valuable insights into the stereochemical principles of protein-DNA interactions, including how particular base sequences are recognised by DNA binding proteins. Sequence-specific

DNA binding proteins or transcription factors have been regarded as having two separable domains; a DNA binding domain (DBD) that directs the transcription factor to its target gene in a sequence specific manner and a functional domain (FD) that influences gene expression via recruitment of other accessory factors and RNA polymerase. Under this model, the DNA binding domain is the sole determinant of target specificity. The modular nature of transcription factors was first described by

Brent and Ptashne in 1985 via domain swapping experiments fusing the DNA binding Page | 4

domain of an Escherichia coli repressor protein LexA to the activation domain of a

Saccharomyces cerevisiae transcriptional activator Gal4, resulting in transcriptional activation dependant on the presence of LexA binding site near the transcription start site (15). This was later confirmed in other domain switching experiments with the flexibility to the positioning of these domains in the fusion proteins suggesting that these domains are indeed functionally independent and separable (16). The modularity and independence of these domains is exemplified by the development of the yeast two- hybrid system (17), a useful technique that has been widely used to study protein- protein interactions. More recently this information has been use for the generation of sequence-specific nucleases and transcription factors that are created by fusing a functional domain to a designer DNA binding domain (18,19). In the next section, we will review one of the most abundant types of DNA binding proteins - C2H2 type zinc finger DNA binding proteins - and will discuss how these DNA binding proteins interact with DNA.

1.1.3 C2H2 zinc finger containing DNA binding proteins

Luscome et al. classified DNA binding proteins based on the structure of DBD in the proteins into eight distinct groups, notably, helix-turn-helix, zinc-coordinating and zipper-type (13). Many of these proteins function as dimers and they mostly bind in the DNA major groove. Of the eight groups of DNA binding proteins, zinc-coordinating and more specifically, one of the subgroups, C2H2 type zinc finger containing DNA binding proteins are among the most abundant. C2H2 zinc finger DNA binding proteins make up 3% of human coding genes and are relatively well studied. C2H2 zinc finger proteins were the founding members of the zinc-coordinating family of DNA binding proteins (20,21). They contain one or more C2H2 zinc finger motifs in their DBD, a Page | 5

DNA binding motif that is folded in the presence of zinc ion to form a compact ββα domain. Each finger is a self-contained domain stabilised by a zinc ion tetrahedrally held by two cysteines on the beta sheet and two histidines on the recognition alpha helix and by hydrophobic core consisting of several aromatic amino acids (Figure 1.2A).

There are approximately 30 amino acids in each finger and the fingers are linked in tandem to recognise nucleic acid sequence of different lengths with each finger typically recognising 3-4 bases (22-24).

The C2H2 type zinc finger motif was first discovered in the Xenopus laevis protein TFIIIA and has since been found to be present in many transcription regulatory proteins and in other DNA binding proteins (25). These zinc finger proteins are found widespread in nature and throughout many different types of organisms and they contain from one to more than 30 fingers. There are triple-C2H2- (containing three C2H2 zinc fingers), multiple-adjacent-C2H2- and separated-paired-C2H2 zinc finger proteins.

In contrast to triple-C2H2 proteins, the latter two types may not utilise all the fingers for interaction with DNA (26). Table 1.1 shows abundance of C2H2 zinc finger proteins in eight different organisms and shows that these proteins constitute 0.9-3.0% of the genes across different genomes. They have been implicated in myriad cellular processes including replication and repair, transcription and translation, metabolism and signalling, cell proliferation and (27). In addition to interacting with DNA, zinc finger proteins also act as protein recognition or RNA-binding modules or are capable of dual DNA and RNA binding (28-32).

The mode of precise interaction between zinc fingers and DNA was uncovered when Palvetich and Pabo solved the crystal structure of Zif268-DNA complex

Page | 6

(Figure 1.2B) (33). It was revealed that the primary contact was made via the alpha helix that binds DNA at the major groove though specific one to one interactions between amino acid at helical position -1, 3, 6 and three successive DNA bases on one strand of the DNA. Later, a secondary interaction between amino acid 2 and DNA base on opposite strand was reported by another group (34). Other residues, including those responsible for phosphate backbone contacts, TGEKP linkers and other amino acid residues not directly involved in DNA recognition, while important for precise docking arrangement and stabilising the zinc finger-DNA interaction (23,35,36), are not thought to directly contact DNA.

C2H2 proteins have been shown to bind their DNA targets with high affinity and specificity. Zinc finger proteins such as Zif268 and SP1 bind preferred sequences with dissociation constants between 10-8 M and 10-11 M (depending on buffer condition and assay methods) (36-39). These proteins also show good specificity for their target sites as demonstrated by competition with non-specific DNA sequences and transient transfection assay. Compilation of data from structural studies, mutagenesis, statistical analysis of sequences and design studies from the past decades have led to a general recognition code, a chart showing one-to-one interactions between specific positions on the helix and specific base pairs in the finger recognition site (33,34,40-42). This recognition code has facilitated the design of artificial zinc finger proteins with tailor-made DNA binding specificities (refer section 1.2).

Page | 7

Figure 1.2 C2H2 zinc finger motif. (A) Structure of a zinc finger from a two- dimensional NMR study of a two-finger peptide in solution, taken from Klug 2010 (18). The ribbon represents the carbon-nitrogen backbone of the amino acid chain, showing an anti-parallel β sheet containing two cysteines and an α helix containing two histidines that are chelated by a zinc ion. Also shown are amino acids that form the hydrophobic core. Together, these hold and stabilise the finger-like structure (B) Zif268-DNA complex. A three zinc finger regulatory protein Zif268 is shown binding to major groove of DNA. On the right are the one-to-one interactions between recognition amino acids and the individual DNA bases. Amino acid numbering is relative to their position in the α helix. Adapted and modified from Rhodes & Klug 1993 (21) and Wolfe et al. 1999 (23).

Table 1.1 Number of C2H2 genes in the genomes of various organisms (adapted from Klug 2010 (18).

Organism Total number of genes C2H2 genes Human 23,299 709 (3.0%) Mouse 24,948 573 (2.3%) Rat 21,276 466 (2.2%) Zebrafish 20,062 344 (1.7%) Drosophila 13,525 298 (2.2%) Anopheles 14,653 296 (2.0%) Caenorhabditis elegans 19,564 173 (0.88%) Caenorhabditis briggsae 11,884 115 (0.9%)

Page | 8

1.1.4 Target recognition by sequence-specific transcription factors

It is widely accepted that the target specificity of sequence-specific transcription factors lies within the DBD with specific amino acids in the DBD making direct interactions with bases in the target DNA sequences. Nevertheless, the fact that two domains can work autonomously, does not imply that in vivo the domains of most or all transcription factors have independent and distinct functions. There is increasing evidence that natural transcription factors localise to their target genes via the combined functions of both their DBDs and FDs (43-46). GATA1, for example, has been shown to bind different sets of target genes depending on the presence of its co-factor, FOG1, that was previously thought to merely involved in the recruitment of co-activators or co- (44,45). Furthermore, there is also evidence that DBDs of transcription factors are dispensable for their DNA binding activity to a subset of their target genes

(47-49). One of the studies, involving SP family of zinc finger transcription factors, proposed a mechanism involving interaction between regions outside of the DBD and a

DNA-binding co-factor, NF-γ to explain recruitment of SP2 DBD-deficient mutant to the chromatin (48). Given the size of the human genome (3 billion base pairs) and the fact that most transcription factors recognise relatively short consensus sequences that are present many times across the genome (50,51), it is not surprising that factors other than DNA sequence preference specified by the DBD may be involved in target specificity of transcription factors.

Interestingly, there are families of transcription factors where members of the family have DBDs with very similar biochemical properties and such as the ETS, SOX and SP/KLF family of transcription factors. Despite highly homologous DNA binding domains, each member performs different functions in vivo Page | 9

(51-55). A study conducted by Wei and colleagues assessed DNA binding specificities of human and mouse ETS family of transcription factors consisting of 27 and 26 members each (56). ETS factors, despite sharing a highly conserved winged helix-turn- helix DBD, recognise a consensus DNA sequence, 5’-GGA(A/T)-3’, are known to have diverse functions and activities in physiology and oncogenesis. In this study, they analysed the DNA sequence preference of these factors in vitro via high-throughput microwell-based transcription factor DNA binding specificity assays and protein- binding microarrays and in vivo via ChIP-Seq (chromatin immunoprecipitation followed by high throughput sequencing). They revealed that while all the family members bind similar DNA-sequences, attributable to the near identical DBDs, some differences were observed between the DNA motifs bound by the different factors, both in vitro and in vivo, and resulting in four distinct classes of the ETS factors. The main differences between the DNA recognition motifs within the family are at the core +4 position and 5’ flanking base pairs resulting from amino-acid divergences at specific DNA-interacting amino acid residues within the DBDs of the proteins. Interestingly, when the in vitro and in vivo binding motifs were compared, in vivo motifs were in general more degenerate and this was speculated by the authors to be partly due to the presence of high number of GGAA repeats at ETS binding sites in human genome. It is also possible that the complex chromosomal landscape existing only in the in vivo setting, or the presence of other DNA binding proteins or chromatin modifying enzymes in vivo may have contributed to the variation. The in vivo studies also found that while there are overlaps in the genomic occupancy of different ETS factors, even between the different classes, most of the binding sites were specific for a given factor, underscoring the differential specificity of the ETS family members. It thus seems that DNA sequence

Page | 10

preference alone is insufficient to explain the site specificity of the ETS proteins, and that other factor such as co-operative binding of ETS factors with different transcription factors via other, more divergent, non-DNA binding portions of the proteins may play a role. This is plausible because ETS factors are known to have diverse range of binding partners, for example, ELK1 and ELK4 binds DNA in cooperation with SRF (57,58) and ETS1 binds composite sites with PAX5 and RUNX1 (59-61).

This is further illustrated in the SOX proteins, a family of sequence-specific transcription factors with characteristic DNA binding high-mobility-group (HMG) domains targeting a common consensus motif in DNA. It was proposed that target specificity by different SOX proteins could be further achieved through differential affinity or preference for particular flanking and/ or central heptameric sequences, homo- or heterodimerisation among SOX proteins (62-64), post translational modifications of SOX factors (65), or interaction with other cofactors or transcription factors to bind composite elements (66-68) based on evidence from studies on the individual proteins.

An interesting study on a zinc finger family of transcription factors, SP family, demonstrated differential in vivo binding site selection among the three members of this family. While the two members, SP1 and SP3 essentially occupy the same promoters containing a GC-rich consensus, SP2 primarily localises to CCAAT motifs via interaction with a binding partner, CCAAT box binding transcription factor, NF-γ, in a zinc finger DBD independent manner (48).

Taken together, while it is certain that DBDs are crucial in DNA sequence recognition by sequence specific transcription factors as demonstrated by mutagenesis

Page | 11

studies (69,70), the short degenerate motifs recognised by most DBDs of transcription factors alone may be insufficient to specify in vivo target recognition and recent evidence implies that additional part of the protein may play a role. In the case of families of transcription factors sharing a consensus binding motif, one plausible explanation for divergent target gene regulation is binding to composite elements on the

DNA via interaction with different protein partners possibly through regions outside of the DBDs. In the next section, we will discuss one of the well-studied families of zinc finger transcription factors, Kruppel-like factor (KLF) family.

1.1.5 KLF family of zinc finger transcription factors

Transcription factors are classified into super-families based on their DNA binding domains (DBDs) and sequence homology. Members within the super-family exhibit high sequence homology especially at the DBD and often recognise or target similar DNA sequences. One well-studied family of zinc finger transcription factors is the Kruppel-like-factor (KLF) family (71,72). The founding member KLF1 was cloned in the early 1990s (73) and since then a total of 17 members have been discovered and, according to current nomenclature, are referred to as KLF1-KLF17 (52). The key feature of this family is the presence of three highly conserved classical C2H2 zinc fingers, with each zinc finger recognising three base pairs of the DNA sequence and thus interacting with nine base pairs in total. These fingers are located at the carboxyl terminus of the proteins (Figure 1.3A). Sequence alignment of the zinc finger domains shows high sequence homology among the family members, with close to 100% identity at the critical residues that make direct interaction with DNA (Figure 1.3B).

KLF family members recognise a consensus GC-rich or CACCC box in DNA (74).

Page | 12

Figure 1.3 KLF family of transcription factors. (A) Schematic representation of KLF proteins showing the conserved C-terminal three C2H2 zinc finger DBD and variable N- terminal functional domain (Taken from Pearson et al. 2008 (72)). (B) Amino acid sequence alignment of the carboxyl terminal zinc finger domains of the 17 members of the KLF family. The consensus zinc fingers (ZF1, ZF2 and ZF3) are indicated below the sequences and the amino acid residues that make specific interaction with DNA are highlighted in brown. On the right shows the percentage similarity between the zinc finger domains of KLF1 and the other KLFs.

Recently, the advent of next generation sequencing technology has allowed genome-wide analysis of genomic occupancy of transcription factors via ChIP-Seq experiments. Comparisons of individual ChIP-Seq data published on the members of

KLF family, KLF1 (46), KLF3 (75), (76), KLF10 and KLF14 (77) confirmed in vivo binding of these family members to highly similar DNA consensus motifs (Figure

1.4). These structural and sequence similarities inevitably create instances of overlap in

Page | 13

their transcriptional targets, for example, in embryonic stem cells, KLF2, KLF4 and

KLF5 bind and activate Esrrb, Fbxo15, Nanog and Tcl1 (78).

Figure 1.4 KLFs DNA binding in vivo motifs from the published ChIP-Seq experiments. Members of the KLF zinc finger family - KLF1, KLF3, KLF4, KLF10 and KLF14 - bind to highly similar GC-rich consensus sequences in vivo.

It is, however, difficult to envisage that all 17 KLF proteins compete with each other for the same binding sites within the genome. In addition, different KLFs have been shown to regulate different biological processes ranging from proliferation and cell growth, differentiation, development, survival and responses to external stress (71,79-

87). The short and degenerate motifs recognised by the DBDs of the KLFs are found many times throughout the genome and thus are insufficient to explain differences in target genes regulated by these family members or to discriminate between different

KLF proteins. While some proteins show cell type specific expression, for example, Page | 14

KLF1 expression is predominantly restricted to cells of the erythroid lineage, the majority of these proteins are expressed widely (72,83). Thus, cell-type specific expression of particular KLFs cannot account for the diverse biological processes that individual members of this family participate in. It is therefore of interest to understand how these proteins with closely related DBDs regulate the expression of specific subsets of genes in vivo.

While all KLF proteins share highly similar zinc finger DBDs, there is little homology among the family members at regions outside of the DBD. These regions, referred to as the functional domains (FDs), have been shown to serve as protein interacting surface to allow recruitment of co-factors such as co-activators and co- repressors to exert their regulatory effects. Phylogenetic analysis and functional characterisation of the KLFs (Figure 1.5) revealed three distinct groups within the family: Group 1 consisting of transcriptional repressors interacting with the carboxyl- terminal binding protein (CtBP), Group 2 acts predominantly as transcriptional activators and Group 3 as transcriptional repressors via interaction with a common transcriptional co-repressor Sin3A (83).

Page | 15

Figure 1.5 Phylogenetic tree and protein structure of human Krüppel-like factors (KLFs). The 17 members of KLF are divided into distinct groups based on multiple sequence alignment and phylogenetic analysis (left) and according to common structural and functional domains (right). This figure is adapted and modified from McConnell and Yang 2010 (83).

Our recent study comparing genome-wide occupancy of a member of KLF family, Kruppel-like factor 3, KLF3 (88) to that of its mutant variant lacking the entire amino terminus functional domain, identified a role for the N-terminal domain in in vivo binding site selection as the deletion mutant was unable to localise to a large proportion of the binding sites. Comparison of the genomic DNA binding profiles to another mutant with a point mutation in the contact motif with a known KLF3 protein partner,

CtBP, showed an intermediate pattern such that the observed total number of bound sites was less than wild type KLF3 but greater than that observed for the mutant lacking the entire functional domain (75). This suggests that regions outside of the zinc finger

DBD are important for in vivo DNA targeting and that the contact with CtBP may partly explain the loss of binding in the absence of the N-terminal domain.

Page | 16

These observations motivated this current project to further explore the role of functional domain of the zinc finger transcription factor in the in vivo target specificity in a gain-of-function approach (refer to section 1.3 Aim).

1.2 Artificial Zinc Finger DNA binding proteins

The increasing understanding of gene regulation and the role of sequence- specific transcription factors as the primary regulators, coupled with knowledge of precisely how these DNA binding proteins bind and recognise their target sequences, has allowed development of genetic engineering tools to artificially manipulate gene expression. One promising tool involves construction of synthetic zinc finger proteins that are capable of targeting any DNA sequence to specifically regulate, knock out or replace any gene with the ultimate aim of correcting genetic diseases resulting from aberrant gene expression. These artificial zinc finger proteins typically consist of a classical C2H2 zinc finger domain as their DNA-binding moiety, an effector domain such as activation domain or repressor domain, a nuclear localisation signal, an epitope tag to monitor expression of the proteins and in some cases, a cell-penetrating peptide or a protein transduction domain is also conjugated to the zinc finger protein to facilitate delivery into the cells. The simple one-to-one mode of DNA recognition between specific individual amino acids in the zinc finger and individual DNA bases by the C2H2 zinc finger DBD provides an ideal scaffold for designing proteins with novel sequence specificities (Figure 1.6). In addition, their modular nature allows the use of these zinc fingers as building blocks for de novo design of zinc finger proteins with greater numbers of tandem fingers and thus recognising extended DNA sequences. The ability of these zinc fingers to function as monomers provides an additional advantage of these

Page | 17

C2H2-based artificial DNA binding proteins over other types DBDs that require palindromic targets that are not frequently found in the genome. This allows these zinc finger proteins to essentially recognise any given DNA sequence (18,89).

Figure 1.6 Artificial zinc finger protein. Schematic diagram of DNA recognition by a three-zinc finger protein showing specific DNA-amino acid contacts. Also shown is the cross-strand interactions emanating from position 2 on the recognition helix of the zinc fingers (taken from Klug 2010 (18)).

1.2.1 Application of artificial zinc finger DNA binding proteins

The first study of gene regulation by modified transcription factor was performed in 1992 by Young’s group where they showed mutation in zinc finger DBD of a yeast transcription factor ADR1 alters DNA binding specificities of the protein leading to activation of a reporter gene containing the target DNA sequence in yeast

(90). The early iterations of artificial zinc finger proteins lacked an effector domain and they were primarily used to influence gene expression by physically blocking the movement of RNA polymerase (91) or by competing with natural transcription factors for the binding sites (92-94). Subsequently, these minimal designer zinc fingers have

Page | 18

been fused to a range of different effector domains allowing the proteins to carry out diverse roles including gene activation or repression, epigenetic modification and genome editing. Artificial zinc finger proteins have been used for a myriad of applications, predominantly for their biotechnology and therapeutic potentials (reviewed in (89,95,96)). Robustness of this artificial DNA-binding platform has been validated in vitro in a reporter system and in vivo in various model organisms such as Arabidopsis thaliana (97), Caenorhabditis elegans (98), and mouse (99).

One example demonstrating the therapeutic potential of this artificial zinc finger technology is the re-activation of γ-globin gene as a therapeutic strategy to treat sickle cell disease and β-thalassemia. Barbas’s group designed zinc finger fusion proteins targeting γ-globin promoter region proximal to binding sites of natural transcription factors. Fusion of these zinc finger proteins to a potent activation domain, VP64 successfully up-regulated γ-globin expression in human cell lines (100) and activated the silence γ-globin gene in primary human hematopoietic stem cells (101) and in vivo in a transgenic mouse model (102). This presents an attractive and promising therapeutic approach in which targeted gene activation in a patient’s own cells could compensate for the genetic disorder.

Recently, transcription activation-like effectors (TALEs) and clustered regularly interspaced short palindromic repeats (CRISPR)/Cas9 system have emerged as alternative to the zinc finger based DNA-binding platform (Figure 1.7). Numerous studies have reported the effectiveness of these systems in genome editing and gene regulation related applications (19,103-105). However, these technologies are still in

Page | 19

their infancy and much work still needed to be done to understand the robustness and specificity of these systems.

Figure 1.7 Schematic representations of three artificial DNA-binding platforms.

Construction and application of artificial C2H2 zinc fingers are discussed in detail in the main text. Transcription activator-like effectors (TALE) are natural bacterial proteins from a plant pathogen Xanthomonas sp. that are involved in activation of genes for bacterial colonisation. The central region of the protein contains tandem repeats of 34 amino acid modules that can be engineered to recognise any DNA sequence. DNA specificity is conferred by highly variable amino acids at positions 13 and 14, termed the repeat variable diresidues (RVDs), within each module. Each module or monomer targets one nucleotide and thus the linear sequence of monomers in a TALE specifies the target DNA sequence in 5’ to 3’ orientation. CRISPR/Cas9 system, on the other hand, is developed from our understanding of the microbial adaptive immune system CRISPR (clustered regularly interspaced short palindromic repeats). For the purpose of gene regulation, two point mutations are introduced to Cas9 nuclease to render the protein catalytically inactive, creating dCas9 protein. dCas9 protein is directed to the target site adjacent to a protospacer adjacent motif (PAM) 5’NGG sequence by a guide RNA complementary to the target DNA sequence.

Page | 20

1.2.2 Construction of artificial zinc finger DNA binding proteins

To date, a few groups have developed different methods to design and construct synthetic zinc finger proteins. One simple and widely-used method is modular assembly of the zinc finger domains (89). In this method, the first step is to determine DNA sequences on the promoter regions of endogenous genes to be targeted for modulation.

This usually involves assessment on the chromatin structures of the promoter regions and identification of DNase I hypersensitive sites. This is important because, as demonstrated by a study performed by Wolffe’s group, not all artificial zinc finger proteins are capable of regulating endogenous gene expression despite showing superb performance in activating transient reporter genes (106,107). The next step involves determining the amino acid sequences making up the zinc finger domains. This can done by either, selecting finger domain from available building modules (108-112), or modifying specific amino acids (amino acids at position -1, 2, 3 and 6 of the alpha helix region) of the zinc finger that make DNA contact on mouse Zif268 or human SP1 zinc finger proteins based on recognition code tables generated from several phage display experiments (42,113) (Figure 1.8). DNA fragments encoding the selected zinc finger domains are then assembled by PCR and a canonical linker TGEKP is included to link the finger domains. Other accessory domains such as effector domains, nuclear localisation signals and epitope tags are fused to the zinc finger DNA binding domains.

Several type of effector domains have been fused to designer zinc fingers to carry out various molecular roles and to enable diverse applications including nuclease (114), recombinase (115), activation or repression (91,99,102), DNA methyltransferase (116-

118) and histone methyltransferase (119) domains (Figure 1.9). Finally, the resulting

Page | 21

genes encoding the synthetic zinc finger proteins are cloned into mammalian expression plasmids to evaluate the efficiency of these designer proteins.

Figure 1.8 DNA recognition of artificial zinc finger protein. Zinc finger-DNA recognition table elucidated by Choo and Klug 1994 showing amino acid-nucleotide base contacts frequently observed in interactions of selected zinc finger with DNA. Amino acids and their positions in the α-helix are entered in the matrix relating each base to each position of a DNA triplet recognised by one zinc finger. Amino acids from position +2 can enhance or modulate specificity of amino acids at position -1 and are listed as pairs (42).

Page | 22

Figure 1.9 Application of artificial DNA binding proteins. Artificial zinc finger DBD can be fused to different effector domains including nuclease, recombinase, activator VP64, methyltransferase, repressor KRAB and histone methyltransferase to carry out a wide range of functions (taken from Gersbach et al. 2014 (112)).

1.2.3 Artificial zinc finger proteins targeting VEGF-A promoter

One of the first artificial zinc finger proteins constructed to modulate endogenous gene expression are those targeting VEGF-A promoter. VEGF-A is a key regulator for physiological angiogenesis important both during embryogenesis and later in development for wound healing. VEGF-A dysregulation has thus been implicated in a number of medical conditions, including tumour growth, diabetic retinopathy, ischemic heart and limb diseases (120). Many studies have suggested that expression of appropriate relative levels of major spliced variants produced by VEGF-A is important for proper functioning of the gene (121,122), which may explain the unsuccessful Page | 23

attempts to promote neovasculature by delivery of a single spliced isoforms to the target cells or tissues (123,124). VEGF-A thus presents an attractive therapeutic target for both pro- and anti- angiogenic gene therapies using artificial zinc finger proteins. In 2001 Liu and colleagues designed synthetic zinc fingers targeting DNA sequences contained within the DNase I-hypersensitive regions on VEGF-A promoter. When fused with activation domain, VP16 or p65, these zinc fingers were able to up-regulate VEGF-A transcript and protein expression in human embryonic kidney (HEK293) cells to a level exceeding induction by hypoxic stress, a condition known to induce VEGF-A expression (107). Later in 2002, Rebar reported the first in vivo application of these zinc finger proteins in a whole-organism model. Intramuscular injection of adenovirus encoding the zinc fingers resulted in induction of VEGF-A expression in the quadriceps muscle of mice and visible neovasculature was observed in mouse ear upon subcutaneous injection of adenovirus encoding the fingers. In addition, they also showed zinc finger protein-mediated acceleration of wound healing in mice. Notably, these synthetic zinc finger protein-mediated neovasculatures were not hyperpermeable as was seen for those produced by expression of a single VEGF-A spliced isoform,

VEGF-A164 (99). Snowden, on the other hand, demonstrated repression of VEGF-A in a highly tumourigenic glioblastoma cell line U87MG that decreased VEGF-A expression to a level comparable that in non-angiogenic cancer line (125). Similar work was done by Kang and colleagues, where they generated and assessed anti-tumour efficacy of a replication-incompetent adenovirus expressing VEGF-A promoter-targeted synthetic zinc fingers fused to KOX1 repressor domain. Treatment of human glioma xenograft mouse model with this protein not only resulted in inhibition of VEGF-A mediated angiogenesis and reduction in tumour growth, but also conferred increased survival in

Page | 24

the treatment mice compared to the controls (126). There are also other in vivo studies where VEGF-A expression was regulated in model organisms, including rat model of diabetic neuropathy (127,128) and mouse (129) and rabbit hindlimb ischemic models

(130). One group has also created artificial zinc finger activators and repressors fused to cell-penetrating peptides (CPPs) of HIV-1 TAT, PTD4 or a 9-mer of arginine. They reported that these proteins, when added exogenously to the medium, entered the nucleus in HEK293 cells and modulated the endogenous VEGF-A expression (131).

This study serves as a proof of concept for the use of zinc finger proteins as novel protein drugs potentially applicable to various therapeutic targets. Taken together, this evidence shows that VEGF-A targeting artificial zinc finger proteins are useful tools that can be used to improve our understanding on gene regulation and biological processes.

In addition, they are also potent regulators of gene expression with therapeutic promise in the treatment of disease.

1.2.4 Specificity of artificial zinc finger DNA binding proteins

The ultimate utility and applicability of this technology relies on the capability to generate ATFs that are highly specific, as off-target DNA binding may result in aberrant transcriptional regulation leading to unwanted secondary effects. Tan et al

2003 and Zhang et al 2012 each reported genome wide single-gene specificity by six- finger artificial transcription factors using microarray analysis to study gene expression changes (132,133). Small-scale targeted DNA-binding assessments by chromatin immunoprecipitation combined with quantitative real time PCR suggested limited off- target DNA binding (94,119,134). Recently, genome-wide mapping of DNA binding by chromatin immunoprecipitation followed by high throughput DNA sequencing

(ChIP-Seq) revealed substantial off-target localisation of artificial zinc finger proteins Page | 25

(135). ChIP-Seq is a powerful technology to assess genomic occupancy of a DNA binding protein on a genome-wide scale (Figure 1.10). In this study, they assessed in vivo DNA binding specificity of six-finger DNA binding proteins targeting two different 18 nt regions of the human promoter, with or without a repressor domain, super KRAB domain (SKD). These 18 nt sequences were expected to occur once throughout the whole genome and thus a single binding specificity was expected.

Surprisingly, they identified three to five thousand genomic regions bound by the two effector-free zinc finger proteins, and addition of an effector domain increased the binding sites by five times to approximately 20,000 sites. They proposed two possible explanations for the observed promiscuous binding by these proteins, 1) the SKD effector domain derived from KOX1 zinc finger protein may interact with other DNA binding proteins and thus recruit the SKD containing artificial zinc finger proteins to the additional sites in the genome, and 2) instead of using all the six zinc fingers for DNA binding, these artificial proteins may use only a subset of fingers to contact the genomic

DNA. The second explanation was supported by de novo motif analysis that revealed binding of these proteins to shorter DNA consensus sequences, consisting of partial sequences from the 18 nt the proteins were originally designed to bind. They further showed that while thousands of promoters were bound by these proteins, only 10% of the bound promoters showed changes in gene expression.

Several strategies have been proposed to improve specificity of artificial zinc finger DNA binding proteins, including the use of a split system or dimerisation domains that bring together two separate sets of the zinc finger proteins (114,117).

However, it remains to be definitively determined how specific such systems are.

Page | 26

Research is still required to further characterise and to improve in vivo specificity of artificial zinc finger DNA binding proteins.

Figure 1.10 Outline of ChIP-Seq procedure. The first step of ChIP involves crosslinking of protein-DNA interactions using crosslinking reagents such as formaldehyde. After crosslinking, the tissues are homogenised and the cells are lysed with standard lysis buffers. The chromatin is then sheared by sonication. An aliquot of the sheared DNA is kept as the input control and the remaining is incubated with magnetic beads coupled to an antibody specific for the target protein. This is followed by washes, elution and reversal of crosslinking at 65°C. The proteins and RNA in the samples are enzymatically digested and the immunoprecipitated DNA is purified by phenol-chloroform extraction and ethanol precipitation. The final step prior to high- throughput DNA sequence is the DNA library preparation, which can be done with commercial Illumina Genomic Sample Preparation Kit. This step involves end-repair of the DNA, ligation of sequencing adapters to the DNA fragments and amplification of the adapter-modified DNA by polymerase chain reaction (PCR). The amplified library is purified on an agarose gel to select DNA fragments of specific size-range and the library is run on Bioanalyzer for quality control and to estimate the concentration of the DNA samples. The sample is then ready for sequencing on a sequencer. Figure is taken from Schmidt et al. 2009 (136). Page | 27

1.3 Aims

The recent advent of next generation sequence technology has allowed study of transcription factor occupancy on a genome scale. Despite convincing evidence confirming the therapeutic potential of artificial zinc finger DNA-binding platform technology, in vivo target specificity of these proteins remains poorly understood. The first aim of this study to assess the in vivo genome-wide DNA binding specificity of one potentially therapeutically relevant artificial zinc finger protein, the VEGF-A promoter- targeting three zinc finger protein.

Secondly, we aimed to answer an important question regarding in vivo target specificity of natural transcription factors – how do proteins with closely related DBDs regulate different sets of target genes and thus carry out unique functions in vivo? We hypothesised that in addition to DBDs, regions outside of these domains also play a role in in vivo DNA binding. Previously, using a loss-of-function approach, we showed that, at least in the case of Krüppel-like-factor 3 (KLF3), an archetypal zinc finger transcription factor from KLF family, the DBD was not the sole determinant of in vivo

DNA binding specificity. Deletion of the entire functional domain of KLF3 reduced

DNA occupancy across the genome (75). Thus, in the current study, we extended this investigation by performing gain-of-function experiments by fusing KLF3 functional domain onto an artificial zinc finger DBD targeting VEGF-A promoter.

Page | 28

Chapter 2 Materials and Methods

2.1 Materials

2.1.1 Reagents and kits

Table 2.1 shows a list of reagents and kits used, with details of the suppliers or manufacturers, categorized by the experiment types. Common laboratory chemicals and reagents that can be easily obtained from numerous suppliers in any region are excluded from the list.

Table 2.1: List of reagents and kits

Reagents/ kits Manufacturers

Mammalian cell culture  Dulbecco’s modified Eagle’s medium DMEM (Cat. No.: 11995-073) Thermo Fisher Scientific  Fetal bovine serum FBS (Cat. No.: 16000044)  Thermo Fisher Scientific  Penicillin-Streptomycin-Glutamine (100x) PSG (Cat. No.: 10378016) Thermo Fisher Scientific  Puromycin dihydrochloride (Cat. No.: P8833-100MG)  Sigma-Aldrich

DNA cloning and plasmid preparation  Q5 High-Fidelity DNA polymerase (Cat. No.: M0491S)  New England Biolabs  Restriction enzymes and reaction buffers  New England Biolabs  T4 DNA ligase (Cat. No.: M0202S)  New England Biolabs  Alpha-select silver competent cells (Cat. No.: BIO-85026)  Bioline  Wizard SV gel and PCR clean-up system (Cat. No.: A9282)  Promega  Purelink HiPure plasmid filter maxiprep kit (Cat. No.: K210017) Thermo Fisher Scientific

Transient transfection and retroviral transduction  FuGENE6 transfection reagent (Cat. No.: E2692)  Promega  Polybrene (Cat. No.:107689) Sigma-Aldrich

Electrophoretic mobility shift assay (EMSA) 32  Adenosine 5’-[γ-32P] triphosphate P(Cat. No.: BLU502A250UC)  Perkin Elmer  T4 polynucleotide kinase and reaction buffer T4 PNK (M0201S)  New England Biolabs  Quick spin columns for radiolabeled DNA purification (Cat. No.:  Sigma-Aldrich

Page | 29

11273949001)  V5 mouse monoclonal antibody (Cat. No.: R960CUS)  Thermo Fisher Scientific

Protein gel electrophoresis and Western blot  NuPAGE Novex 10% Bis-Tris protein gels, 1.0 mm, 10-well (Cat. Thermo Fisher Scientific No.: NP0301BOX)  NuPAGE MOPS SDS running buffer (20X) (Cat. No.: NP0001) Thermo Fisher Scientific  Amersham ECL full-range rainbow molecular weight markers (Cat. GE Healthcare Life No.: RPN800E) Sciences  BioTrace NT Nitrocellulose Transfer Membrane (Cat. No.: 66485) PALL Corporation  V5 mouse monoclonal antibody (Cat. No.: R960CUS)  Thermo Fisher Scientific  KLF3 goat polyclonal antibody (Cat. No.: PA5-18030)  Thermo Fisher Scientic  β-actin mouse monoclonal antibody (Cat. No.: A1978-200UL) Sigma-Aldrich  Amersham ECL Mouse IgG-HRP conjugated antibodies (Cat. No.: GE Healthcare Life GEHENA931-1ML) Sciences  Goat IgG-HRP conjugated antibodies (Cat. No.: sc-2020)  Santa Cruz Biotech  Immobilon Western chemiluminescent HRP substrate (Cat. No.:  Merck Millipore WBKLS0500)

 RNA extraction and Real Time PCR  Tri-reagent (Cat. No.: T9424-200ml)  Sigma-Aldrich  DNA-free DNA removal kit (Cat. No.: AM1906)  Thermo Fisher Scientific  RNeasy Mini kit (Cat. No.: 74106)  Qiagen  SuperScript VILO cDNA synthesis kit (Cat. No.: 11754-250) Thermo Fisher Scientific  Power SYBR Green PCR master mix (Cat. No.: 4368702) Thermo Fisher Scientific

 Chromatin immunoprecipitation (ChIP)  Dynabeads Protein G for immunoprecipitation (Cat. No.: 100 -04D) Thermo Fisher Scientific  cOmplete EDTA free protease inhibitor cocktail tablets (Cat. No.:  Roche 11836170001)  Proteinase K (Cat. No.: P8107S)  New England Biolabs  RNase A (Cat. No.: 19101)  Qiagen  GlycoBlue coprecipitant (Cat. No.: AM9515)  Thermo Fisher Scientific  MinElute PCR purification kit (Cat. No.: 28004)  Qiagen  V5 mouse monoclonal antibody (Cat. No.: R960CUS)  Thermo Fisher Scientific

 DNA library preparation for Next Generation Sequencing  TruSeq ChIP library prep kit (Cat. No.: IP-202-1012)  Illumina

Page | 30

2.1.2 Cell lines

 Human embryonic kidney cells (HEK293) – gift from Dr. Richard Pearson

 Phoenix Ampho cells (retrovirus producer cell line, also known as Phoenix A) -

gift from Dr. Laura Norton

2.1.3 Oligonucleotides

All oligonucleotides were synthesised by Sigma-Aldrich, Australia. A list of oligonucleotides used in this study is available in the Appendix as Supplementary

Table 2.1.

2.1.4 Vectors

Table 2.2: List of vectors used in this thesis.

pMT3.KLF3 Gift from Crossley Lab Mammalian expression vector with mouse full length KLF3 coding sequence insert.

pMA-RQ: GeneArt product from Life pMA-RQ vector with NLS:AZFVEGFA:V5 insert NLS:AZFVEGFA:V5 Technologies

pEF.IRES.puro Gift from Crisbel Artuz Mammalian expression vector. Transcript expression is driven by human elongation factor-1 α (EF-1 α) promoter. Contains puromycin resistance gene as selection marker.

pMSCV.puro Clontech Laboratories, CA, MSCV (Murine Stem Cell Virus) retroviral USA vector for stable introduction of gene of interest into the genome of mammalian cell lines via retroviral expression system involving the use of a packaging cell line. Contains puromycin resistance gene as selection marker.

Page | 31

2.2 Laboratory methods

2.2.1 General methods

Standard molecular biology techniques were carried out as described in

Molecular Cloning, A Laboratory Manual Book 1-3 by Sambrook et al. 1989 (137).

2.2.2 Cell culture

HEK293 and Phoenix A cells were maintained at 37°C with 5% CO2 and were cultured in DMEM supplemented with 10% FBS and 1% PSG. HEK293 cells were maintained under selection in 2.5 ug/mL puromycin antibiotic where appropriate.

2.2.3 Generation of retroviral and expression vectors

The AZF used in this study was designed and functionally validated in HEK293 cells by Liu et al, 2001 (107), previously referred to as construct VZ+42/+530. This

AZF was designed to target two GCTGGGGGC sites within the DNase I hypersensitive regions on the human VEGF-A locus, 42 bases and 530 bases, respectively, downstream of the human VEGF-A transcriptional start site (+1 TSS). KLF3FD-AZF was made by fusing KLF3 FD amino acid 1-262 to the N-terminus of the AZF. The third construct, termed KLF3 FD, lacking a DNA binding domain, consisted of KLF3 FD amino acids

1-262 alone. All the constructs contained a nuclear localization signal from SV40 large

T antigen, Pro-Lys-Lys-Lys-Arg-Lys-Val, N-terminal to the AZF or C-terminal to the

KLF3 FD, and a C-terminal glycine-serine linker followed by a V5 tag for immunoprecipitation with an anti-V5 antibody. DNA sequences encoding AZF,

KLF3FD-AZF or KLF3 FD were cloned into a mammalian expression vector with an

EF1-α promoter (pEF.IRES.puro) for transient studies and into a retroviral expression

Page | 32

system pMSCVpuro (Clontech Laboratories, Mountain View, CA, USA) for stable expression. DNA cloning and subcloning were performed as described in (137). Table

2.3 shows the list of vectors generated in this study.

Table 2.3: List of vectors generated in this study.

Vectors generated in this study Description pEF.IRES.puro:NLS:AZFVEGFA:V5 Mammalian expression vector used for transient expression of AZF protein in HEK293 cells for validation experiments.

pEF.IRES.puro:KLF3FD:NLS:AZFVEGFA:V5 Mammalian expression vector used for transient expression of KLF3FD.AZF protein in HEK293 cells for validation experiments.

pEF.IRES.puro:KLF3FD:NLS:V5 Mammalian expression vector used for transient expression of KLF3FD protein in HEK293 cells for validation experiments.

pMSCV.puro:NLS:AZFVEGFA:V5 Retroviral vector used to introduce and facilitate stable integration of AZF coding sequence to the genome to generate stable HEK293 cell lines expressing the AZF protein.

pMSCV.puro:KLF3FD:NLS:AZFVEGFA:V5 Retroviral vector used to introduce and facilitate stable integration of KLF3FD-AZF coding sequence to the genome to generate stable HEK293 cell lines expressing the KLF3FD.AZF protein.

pMSCV.puro:KLF3FD:NLS:V5 Retroviral vector used to introduce and facilitate stable integration of KLF3FD coding sequence to the genome to generate stable HEK293 cell lines expressing the KLF3FD protein.

2.2.4 Transient transfection for protein production

Lipid-based transfection using Fugene 6 transfection reagent was performed on

HEK293 cells according to manufacturer’s instructions to yield AZF, KLF3FD-AZF and KLF3FD protein overexpression. Page | 33

2.2.5 Retroviral transduction to generate stable cell lines

DNA sequences encoding AZF, KLF3FD-AZF or KLF3 FD were cloned into a retroviral expression system pMSCV-puro. A packaging cell line Phoenix Ampho

(Phoenix A) was transfected with 5 ug of retroviral vectors using FuGene 6 transfection reagent, according to the manufacturer’s instructions. 48-hours post transfection, viral- containing supernatants were used to infect target cells, HEK293. Puromycin antibiotic selection (2.5 µg/mL) was initiated 48 hours post transduction and was maintained for at least two weeks to ensure stable expression of the transgene. Single stable clones expressing each transgene were isolated and transcript and protein expression levels were assessed via quantitative Real Time PCR (RT-PCR) and Western Blot using an anti-V5 antibody, respectively, as described in the next sections.

2.2.6 Nuclear extracts

Nuclear extracts were prepared as described in Andrew et al. 1991 (138).

2.2.7 SDS-PAGE and Western blot

Protein concentration was determined by UV-light absorbance at 280nm using a

Nanodrop (Thermo Fischer Scientific, MA, USA) such that equal loading could be achieved in each lane. Nuclear extracts were run at 200 V for 45 minutes on NuPAGE

Novex 10% Bis-Tris gels (Life Technologies, CA, USA) using X-Cell modules (Life

Technologies, CA, USA) as per the manufacturer’s instructions. Proteins were transferred to a nitrocellulose membrane using the X-Cell blot module (Life

Technologies, CA, USA). Membranes were then blocked in TBST (50 mM Tris pH7.4,

150 mM NaCl, 0.05% Tween-20) with 4% (w/v) skim milk powder for 30 minutes.

Membranes were then probed using the antibodies and conditions described in Table 2.4.

Page | 34

Table 2.4: Primary and secondary antibodies used in this thesis.

Primary antibody/ condition Secondary antibody/ condition

V5 mouse monoclonal antibody (Cat. No.: Amersham ECL Mouse IgG-HRP conjugated R960CUS)/ 1:15000 diluted in TBST (Tris- antibodies (Cat. No.: GEHENA931-1ML)/ 1:20000 buffered saline with Tween 20), incubated for 1 dilution in TBST, incubated for 1 hour at room hour at room temperature. temperature.

β-actin mouse monoclonal antibody (Cat. No.: Amersham ECL Mouse IgG-HRP conjugated A1978-200UL)/ 1:30000 diluted in TBST, antibodies (Cat. No.: GEHENA931-1ML)/ 1:20000 incubated for 1 hour at room temperature dilution in TBST, incubated for 1 hour at room temperature.

KLF3 goat polyclonal antibody (Cat. No.: PA5- Goat IgG-HRP conjugated antibodies (Cat. No.: 18030)/ 1:250 diluted in TBST, incubated for 1 sc-2020)/ 1:60000 dilution in TBST, incubated for hour at room temperature 1 hour at room temperature

HRP labelled antibodies were detected using Immobilon Western chemiluminescent HRP substrate according the manufacturer’s instructions (Merck

Millipore) and chemiluminescent bands were detected wither using ImageQuant LAS

500 (GE Healthcare, UK). Rainbow molecular weight markers (GE Healthcare Life

Sciences) were included for size estimation.

2.2.8 Electrophoretic mobility shift assay (EMSA)

The in vitro DNA binding properties of the proteins were assessed via

Electrophoretic Mobility Shift Assay (EMSA) using 32P radiolabelled probes, as previously described (139). Oligonucleotides used for EMSA are available in the

Appendix as Supplementary Table 2.1.

2.2.9 RNA extraction and cDNA synthesis

Total RNA was extracted with Tri-reagent according to the manufacturer’s protocol, followed by purification with an RNeasy mini kit and DNA-free DNA

Page | 35

removal kit to ensure purity of the RNA samples. RNA concentration was determined by UV-light absorbance at 260 nm using a Nanodrop (Thermo Fischer Scientific, MA,

USA) and 1 ug of RNA was used as a template for cDNA synthesis using the

SuperScript VILO cDNA synthesis kit as per manufacturer’s instructions.

2.2.10 Real time PCR

Reactions (SYBR Green PCR master mix, 0.5ug each for forward and reverse primers targeting locus of interest, and template cDNA or ChIP DNA) were set up in triplicates and run on the Applied Biosystems 7500 Real-Time PCR system. For transcript level analysis, the results were normalized against 18s rRNA levels of the respective samples, and for ChIP experiments, the normalization was done against the respectively input samples containing total sonicated DNA. Oligonucleotides used for

RT PCR are available in Appendix as Supplementary Table 2.1.

2.2.11 Chromatin immunoprecipitation (ChIP)

ChIP was conducted in six stable HEK293 clones, two each expressing equivalent levels of AZF, KLF3FD-AZF or KLF3 FD, representing two biological replicates. Approximately 7 x 107 cells were used for each ChIP and the experiments were conducted as described (136) using 14 µg of V5 mouse monoclonal antibody.

DNA samples obtained were used for RT PCR and high throughput DNA sequencing.

Oligonucleotides used for RT PCR are available in Appendix as Supplementary Table

2.1.

2.2.12 DNA library preparation and next generation sequencing

Libraries were prepared using the TruSeq ChIP Sample Prep Kit (Illumina, San

Diego, CA) according to the manufacturer’s instructions. The libraries were multiplexed Page | 36

into 2 lanes using sample specific adapters such that there were 4 samples per lane. 50 bp single reads or 100 bp paired end reads were sequenced on the HiSeq 2500 or HiSeq

2000 (Illumina, San Diego, CA). For KLF3 FD samples, 75 bp single reads were sequenced on the NextSeq 500. Library preparation and sequencing were performed by the Ramaciotti Centre for Genomics, University of New South Wales, NSW, Australia.

2.3 Bioinformatics methods

2.3.1 Quality trimming

Quality control was performed using FastQC v0.10.1 available from http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Reads were quality filtered, trimmed and adapter sequences were removed using Trimmomatic v0.3.2 (140).

2.3.2 Alignment

Reads were aligned to the hg19/GRCh37 Homo sapiens genome using Bowtie v2.2.1 (141) set to --very-sensitive. Resulting alignments were sorted and indexed using

Samtools v0.1.18 (142).

2.3.3 Peak calling and IDR analysis

Pseudoreplicates were created using homer-idr v0.1 (available from https://zenodo.org/record/11619#) for individual and combined IP samples. Peaks were called using HOMER v4.7.2 (143) using the permissive settings (-P .1 -LP .1 -poisson

.1) on individual replicates, combined replicates, individual pseudoreplicates and combined pseudoreplicates against the combined input control. Peaks lists were then supplied to homer-idr to determine the IDR statistic for each peak generating a final peak list satisfying the thresholds set by homer-idr. Peaks were merged using

Page | 37

mergePeaks using the switch -d meaning that peaks had to literally overlap in genomic space to be considered overlapping. Two venn diagrams, summarizing 1) the total number of the AZF only peaks, KLF3FD-AZF only peaks and AZF and KLF3FD-AZF common peaks , 2) the total number of KLF3FD-AZF only peaks, KLF3 FD only peaks and KLF3FD-AZF and KLF3 FD common peaks, were generated for this study.

2.3.4 Quantification of ChIP tags

HOMER was used to quantify ChIP tag density at peak locations across the genome. Unless otherwise noted, tags were counted within 162 bp (for KLF3 FD) and

214 bp (for AZF and KLF3FD-AZF) around the peak centre (as peak widths could vary across the different samples). All tag counts were normalized to 100 M reads, and were thus expressed as reads/100 M reads to allow comparison across samples.

2.3.5 Differential binding analysis

Peaks were called on each replicate against its corresponding input control using

HOMER v4.7.2 (143) with the default settings except with (-style factor -F 10). Peaks from each replicate along with alignments were piped to DiffBind v1.10.2 (144) and the contrast was set based on the factor that was immunoprecipitated (AZF/KLF3FD-AZF).

DiffBind was used to calculate differential binding statistics for the KLF3FD-AZF and

AZF groups using edgeR-based analysis. Principal component analysis (PCA) and MA plots were also generated using DiffBind on the factor contrast. Peaks were considered differentially bound if they showed FDR < 0.1, P < 0.05 and a log2 fold-change of either > 2 or < -2 between the groups.

Page | 38

2.3.6 Genomic annotation and visualization

Peak lists from IDR and differential binding analysis were annotated using annotatePeaks.pl using the HOMER annotation set for hg19/GRCh37. HOMER was used to create bedgraph files using the makeUCSCfile program. These were viewed using Integrative Genomics Viewer IGV v2.2 (145).

2.3.7 De novo motif analysis

De novo motif discovery was performed on the top 600 peaks, ranked by normalized tag counts, from the AZF total peaks, KLF3FD-AZF total peaks and AZF and KLF3FD-AZF common peaks obtained from the HOMER-IDR analysis, and on the top 600 peaks, ranked by log2 fold-change contrasting AZF/KLF3FD-AZF normalized tag counts, from the KLF3FD-AZF differential bound peaks obtained from the differential binding analysis. Sequence databases in fasta format consisting of the 100bp surrounding peak centres were created using an open source, web-based platform,

Galaxy (available from https://usegalaxy.org/) and piped to MEME-ChIP (146) using the default settings to identify the most significantly bound motif from each sample group.

For motif analysis using the complete ChIP-Seq datasets, alternative web-based motif finding tools, RSAT peak motifs (available from http://floresta.eead.csic.es/rsat/peak-motifs_form.cgi) (147) and DREME (available from http://meme-suite.org/tools/dreme) (148), were used.

Page | 39

2.3.8 Motif scanning

FIMO, a motif scanning tool available from http://meme-suite.org/tools/fimo

(149), was used to identify all GCTGGGGGC sites in the human genome version hg19 with default settings.

2.3.9 ENCODE data set and data accession

HEK293T DNase-Seq data set (GEO Accession #GSM1008573) was downloaded from the ENCODE Consortium, performed by the Crawford’s laboratory at the Duke University (150,151). The raw sequencing reads from these data sets were processed using the ChIP-Seq pipeline described earlier in text to make bedgraph files for visualization in IGV and to quantify sequencing tags at genomic locations of interest.

2.3.10 Common promoter binding events

A list was generated consisting of all promoter peaks (2,025 peaks) bound by

KLF3FD-AZF but not by the variant lacking the KLF3 functional domain (from the differential binding analysis). This list (referred to as KLF3FD-AZF HEK293 list) was compared to a list consisting of all the promoter peaks (4,212 peaks) bound by unmodified KLF3 protein but not by the variant lacking the KLF3 functional domain, referred to as KLF3 MEFs list. These two lists represent promoters that are differentially bound by KLF3FD-AZF and KLF3, respectively, that require the presence of the KLF3 functional domain for DNA occupancy. The latter was obtained from published ChIP-Seq experiments performed in mouse embryonic fibroblasts (MEFs)

(75). Promoter binding was defined as DNA occupancy observed between -1000 bp to

+100 bp relative to the +1 TSS of a coding gene. The comparison was made based on

Page | 40

common downstream gene names. To address the issue of interspecies difference, mouse gene names on the KLF3 MEFs list were converted to the corresponding human orthologs using HCOP: Orthology Prediction Search

(available from http://www.genenames.org/cgi-bin/hcop). The KLF3FD-AZF HEK293 list was refined to contain only the human/mouse orthologous genes. These two lists

(containing 3,525 and 1,720 genes, respectively) were overlapped based on common downstream gene names to generate a list of common promoters bound by KLF3FD-

AZF and KLF3. The expected number of common promoters found by chance was calculated based on the total number of promoter peaks from each dataset, that were used for comparison, as percentages of the total number of protein coding genes (152)

(as an estimation to the total number of promoters). The product of these percentages gives an estimation of the expected common promoter binding that would occur by chance. Similar analysis was performed to investigate common promoter binding event between KLF3 (75) and KLF3 FD from the current study.

2.3.11 Statistical test

Chi-squared (Χ2) test was performed to determine whether the difference between the observed and the expected number of common promoter binding was significant. A chi-squared value was computed yielding a p-value using GraphPad software available from http://graphpad.com/quickcalcs/.

Page | 41

Chapter 3 in vivo DNA binding specificity of a three zinc finger artificial DNA binding protein

3.1 Introduction

Decades of selection and design studies to understand modes of DNA recognition by zinc fingers have led to simple yet comprehensive rules to facilitate the design of an artificial zinc finger protein (AZF) that can bind virtually any DNA sequence (101,108,111). This designer zinc finger DNA binding platform has since been used to generate transcription factors to regulate endogenous gene expression for biotechnology and therapeutic applications (18,89,96,153).

One of the first synthetic zinc finger proteins developed was designed to target a model gene, VEGF-A (107). The VEGF-A gene encodes an angiogenic and neuroprotective factor that may be effective in the treatment of heart ischemia, diabetic neuropathy and cancer (120,154). Given that three alternatively spliced isoforms of

VEGF-A are required for its maximal biological activity (155), VEGF-A targeting AZFs that modulate the expression of all three isoforms of VEGF-A represent a promising therapeutic strategy. Several different VEGF-A-targeting AZFs have now been reported and shown to bind DNA robustly in vitro, and when fused to an activation or a repressor domain, to be effective in modulating expression from the VEGF-A promoter both in vitro and in reporter systems (107). These proteins have also been used to modulate angiogenesis in vivo in normal and diseased mouse models and to induce neuroprotection in a diabetic mouse (99,127,128).

Page | 42

Perhaps surprisingly, to our knowledge, no in vivo DNA binding specificity studies have been reported with these VEGF-A targeting artificial zinc finger proteins.

Therefore, the first aim of the current study was to assess the in vivo genome-wide DNA binding specificity of one potentially therapeutically relevant AZF using chromatin immunoprecipitation followed by high throughput DNA sequence, an established approach to determine where a protein binds in the genome. This AZF, depicted in

Figure 3.1A (referred to as VZ+42/+530 in Liu et al. (107)), consists of three classical

C2H2 zinc fingers and was originally designed to target two GCTGGGGGC sites within

DNase I hypersensitive regions in the human VEGF-A locus. One of these sites lies within the VEGF-A promoter, 42 bases downstream of the transcription start site

(Figure 3.1B).

This AZF, when fused to a transcription activation domain, was reported to be effective in activating transcription of both the endogenous human and mouse VEGF-A gene (the 9 bp target sequence is conserved in human and mouse) and a transiently transfected native reporter construct containing the VEGF-A promoter (99,107). When tested in a whole-organism model, expression of this AZF induced angiogenesis in mice

(99).

In the current study, we investigated the genome-wide occupancy of this AZF stably expressed in human embryonic kidney cells (HEK293) and identified the consensus motif from the sequences bound by this protein using several motif discovery algorithms. We also identified all the GCTGGGGGC sites across human genome, representing all the candidate-AZF bound sites and combined with DNase I

Page | 43

hypersensitivity information of these sites and the actual AZF bound sites, to allow us to uncover the relationship between AZF DNA occupancy and chromatin accessibility.

Figure 3.1 Artificial zinc finger protein (AZF) design. (A) AZF consists of three classical C2H2 zinc fingers that recognise a GC-rich 9 nt DNA sequence. Three specific amino acid-DNA base interactions from each finger are indicated by arrows, with finger 3 (F3) making the most 5’ contact with the DNA binding site. (B) Human VEGF-A locus showing the location of the 9 nt target sequence GCTGGGGGC 42 bases downstream of the +1 TSS that the AZF was designed to recognise and bind. (C) Schematic representation of the AZF protein with a N-terminal nuclear localisation signal (NLS) and C-terminal V5 epitope tag. (D) Amino acid sequence of the AZF protein expressed from the construct used in this study. The NLS and V5 tag are highlighted in blue and green, respectively. Amino acids at position -1 to +6 relative to the recognition helix for each finger are highlighted in red while amino acids that make specific contact with DNA bases are further underlined in red. Sequences with black underline are the GS linker.

3.2 Experimental design and construct validation

In order to investigate genome-wide occupancy of the AZF via chromatin immunoprecipitation (ChIP) approach, an epitope tag, V5 epitope, consisting of 14

Page | 44

amino acids derived from the P and V proteins of the paramyxovirus simian virus 5

(SV5) (156), was fused to the C-terminal of the AZF, connected via a glycine-serine linker (with 3 repeats of a GS pair of amino acids) (Figure 3.1C and D). Among important considerations for an efficient and reliable genomic ChIP experiment are availability of a high quality antibody that is specific to the target protein which has minimal cross-reactivity with other cellular targets and accessibility of epitopes upon chromatin crosslinking. Thus, in the current study, we chose to use an epitope tag approach to ensure efficient and reliable pull down of our target protein AZF that was designed based on a natural transcription factor, ZIF268, varied by only a few amino acids. Kolodziej reported that V5-ChIP is at least equivalent, if not better, in terms of performance and efficiency compared to the highly specific and efficient streptavidin

ChIP and is less affected by formaldehyde crosslinking (157).

In addition to the C-terminus V5 tag, a Nuclear Localisation Signal (NLS) derived from SV40 large T antigen (158) was linked to the N-terminus of the AZF via the same glycine-serine linker to facilitate transportation of the protein to the nucleus

(Figure 3.1C and D).

Human embryonic kidney cell lines (HEK293) stably expressing this AZF protein were generated using the MSCV retroviral transduction system. The HEK293 cell line was used for this work because the original study on this AZF was performed in HEK293 where the robust performance of this protein in terms of in vitro DNA binding and functionality were first reported (107). Two independent clonal cell lines stably expressing the AZF were selected for further analysis.

Page | 45

Expression and DNA binding of the AZF proteins expressed in these two clonal lines were characterised using a series of molecular biology techniques. A Western blot, using an antibody to the V5 epitope tag, was performed on nuclear proteins extracted from the two AZF stable HEK293 cell lines and showed robust and comparable AZF protein expression in the two lines (Figure 3.2A). A β-actin immunoblot was included as loading control. To assess in vitro DNA binding of the AZF protein, we carried out

Electrophoretic Mobility Shift Assays (EMSAs) using a previously validated target site containing the canonical motif GCTGGGGGC (107). EMSA is an affinity electrophoresis technique based on the differential migration of the free or unbound probes, protein bound probes and antibody-protein-probe complexes on a non- denaturing polyacrylamide gel, with the smaller products, i.e. the free linear DNA probes migrate faster and vice versa. Probes, 22 bp DNA fragments containing the

GCTGGGGGC target sequence, were radiolabelled with P32 and added to total nuclear proteins extracted from the HEK293 cell lines stably expressing the AZF proteins in a suitable binding condition and subsequent detection via autoradiography. As shown in

Figure 3.2B, strong shifted bands were observed in lanes 3 and 5, indicating protein- probe complexes. Migration of the complex was further retarded in lane 4 and 6 (blue asterisks) upon addition of anti-V5 antibody, specific to the V5 tagged AZF proteins.

These supershift experiments confirmed the identity of retarded species, and thus substantiated in vitro DNA binding of the AZF proteins stably expressed in the two

HEK293 clonal lines.

Page | 46

Figure 3.2 Validation of AZF protein expression and DNA binding. (A) Western blot with anti-V5 antibody shows comparable expression of AZF proteins in the two HEK293 clones stably expressing the AZF protein. Expression of β-actin protein was included as loading control. (B) In vitro DNA binding study using electrophoretic mobility shift assay (EMSA) shows robust and equivalent AZF protein binding to P32 radiolabelled probes containing the GCTGGGGGC 9 nt target sequence for the two stable clonal lines (lane 3 and 5). Blue asterisks indicate the supershift of the protein- DNA probe complex upon addition of anti-V5 antibody confirming the identity of the V5-tagged AZF proteins (lane 4 and 6). EV denotes empty vector control and was included as negative control.

3.3 ChIP-Seq and Bioinformatics workflow

We performed ChIP-Seq using the anti-V5 antibody on the two AZF expressing

HEK293 lines as biological replicates to interrogate in vivo DNA binding of these proteins. Approximately 1 x 108 cells were used per replicate. Briefly, DNA and the associated proteins in the cells were cross-linked with a reversible cross-linking agent, formaldehyde, followed by sonication step to shear DNA to fragments of size 100-

300 bp. Cross-linked protein-DNA complexes were selectively immunoprecipitated

Page | 47

with an anti-V5 antibody specific to the AZF proteins and DNA fragments associated with AZF proteins were purified and subjected to library preparation that involved linking adapters to the DNA fragments to allow multiplexing followed by several rounds of PCR amplification. In addition to DNA fragments obtained from immunoprecipitated samples (IP samples), total DNA fragments from the cells (INPUT samples) for each replicate were also included as a control for downstream normalisation purposes. The ends of the DNA fragments were sequenced using Illumina next-generation sequencing (NGS) platform.

A concise workflow for the analysis of our ChIP-Seq data is presented in

Figure 3.3. Raw sequences obtained from the NGS platform for each sample were filtered by applying size and quality cutoffs and low-quality ends of the reads were trimmed off using an freely-available tool called Trimmomatic. The remaining reads were then mapped to human genome reference version hg19 using an efficient short- read aligner, Bowtie 2. Table 3.1 shows the total number of reads obtained for each sample and the percentage of total reads (98.7 to 99.3%) successfully mapped to hg19.

The high percentage of uniquely mapped reads (64-72%), as expected for a human genome library, indicated overall good immunoprecipitation and DNA library preparation and robust sequencing performance (159).

The next step, which is the pivotal computational analysis for ChIP-Seq experiment, is to find the genomic regions bound by the AZF protein. A peak caller,

HOMER, was used to identify regions with significant number of mapped reads

(peaks). Mapped reads were linearly normalised according to sequencing depth where the reads were multiplied by a scale factor to obtain the same total number of reads

Page | 48

(expressed as reads/100 million reads) across all the different samples (INPUT and IP samples). Mapped reads from INPUT samples were then used as controls for further normalisation and noise removal. To ensure reproducibility of the results, we performed irreproducible discovery rate analysis (IDR) using the HOMER-IDR package across two sets of peaks identified for the biological replicates. This analysis produced a list of significant and reproducible ChIP-Seq peaks or genomic regions occupied by the AZF protein that passed a reproducibility threshold based on the consistency of identified peaks between the replicates.

A series of downstream analyses were then performed using available bioinformatics tools to answer several biological queries. These include peak annotation to associate the ChIP-Seq peaks with functionally relevant genomic features including genomic localisation (promoter, transcription start sites, intergenic regions), information on nearest coding genes and motif analysis using two different motif discovery algorithms, MEME-ChIP and RSAT: Peak Motifs to identify centrally located motifs bound by the AZF protein and graphical ChIP-Seq track visualisation on Integrative

Genomic Viewer (IGV).

Page | 49

Figure 3.3 Workflow for the computational analysis of AZF V5 ChIP-Seq. Software tools used for each step are included in blue writing.

Table 3.1 Read mapping summary.

Sample Reads Mapped to hg19 % mapped % uniquely mapped

AZF Input 1 14,770,820 14,602,331 98.9 63.9

AZF Input 2 15,225,256 15,034,280 98.7 63.9

AZF IP 1 11,304,986 11,203,387 99.1 72.0

AZF IP 2 6,737,341 6,688,293 99.3 70.0

Page | 50

3.4 AZF genomic occupancy

The ChIP-Seq analysis returned a list of peaks that were bound by AZF and upon inputting the sorted binary tiled data (.tdf) file to IGV, we were able to visualise the data where the AZF occupied regions were displayed as a peak or histogram, with the height larger than the background or negative regions.

3.4.1 AZF binds to a large number of sites within the genome

Global analysis of the ChIP-Seq dataset following the bioinformatics workflow discussed in section 3.3, identified 25,322 peaks as significant and consistent across the two independent biological replicate sets of AZF samples that passed the reproducibility threshold (irreproducible discovery rate, IDR ≤ 0.05) (Figure 3.4A). Annotated tables containing AZF ChIP-Seq peaks can be found in Table 3.2 (showing the top 20 peaks) and Supplementary Table 3.1 (total peaks). Interrogation of the AZF bound regions revealed AZF binding at the expected and previously validated AZF binding site in the human VEGF-A locus (Figure 3.4B). Histograms or peaks in blue indicate where the

AZF protein binds. The 9 nt target sequence GCTGGGGGC for the AZF was found at the centre of the peak in the VEGF-A promoter. The VEGF-A promoter peak was among the top 3% (peak rank 727 out of the total 25,322 peaks) of peaks bound by the AZF as ranked by normalised tag count (Figure 3.4C). Enrichment profiles of AZF at two selected loci, representing the top AZF bound sites, are illustrated in Figure 3.4D. This indicates that the AZF protein, originally designed to target VEGF-A promoter by binding to a specific 9 nt DNA sequence, in addition to binding at the predicted target site, also occupies a large number of other sites across human genome.

Page | 51

Page | 52

Figure 3.4 HOMER-IDR peak calling and reproducibility analysis. (A) Upon implementation of reproducibility threshold of IDR ≤ 0.05, 25,322 peaks were identified as significant and reproducible AZF bound regions in the two biological replicates. (B) Presence of blue histogram on ChIP-Seq track at VEGF-A locus +42 regions confirms AZF occupancy at the predicted target site in vivo. Partial sequence around the peak summit is shown with a black line underlining the 9 nt target sequence GCTGGGGGC. No or negligible binding by AZF was seen in the input control sample AZF (ctrl). (C) Graphical representation of all AZF ChIP-Seq peaks ranked in descending order by normalised tag counts. Peak height/ normalised tag counts is shown on the Y-axis and peak rank on the X-axis. The approximate position of the peak at the target site in VEGF-A promoter is indicated by a black arrow. (D) ChIP-Seq tracks showing the top two regions bound by AZF, ranked by normalised tag counts.

Page | 53

Table 3.2 Top 20 annotated peaks occupied by AZF ranked by normalised tag counts.

Normalised Chr Start End Annotation Gene Name Gene Description Tag Count chr19 36438332 36438546 243.5 Intergenic LRFN3 leucine rich repeat and fibronectin type III domain containing 3 chr6 170209114 170209328 242.8 Intergenic LINC00242 long intergenic non-protein coding RNA 242

chr15 66648997 66649211 240 promoter-TSS TIPIN TIMELESS interacting protein chr14 64854890 64855104 231.1 5' UTR MTHFD1 methylenetetrahydrofolate dehydrogenase (NADP+ dependent) 1 chr17 38268028 38268242 227.7 Intergenic MSL1 male-specific lethal 1 homolog (Drosophila)

chr5 133706960 133707174 225.6 promoter-TSS UBE2B ubiquitin-conjugating enzyme E2B chr5 179238273 179238487 225.6 intron SQSTM1 sequestosome 1 chr14 76127115 76127329 222.8 promoter-TSS C14orf1 14 open reading frame 1 chr20 3691562 3691776 220.1 Intergenic SIGLEC1 sialic acid binding Ig-like lectin 1, sialoadhesin chr1 225965869 225966083 218.7 intron SRP9 signal recognition particle 9kDa chr1 36839508 36839722 216 intron STK40 serine/threonine kinase 40 chr19 41869657 41869871 213.9 promoter-TSS TMEM91 transmembrane protein 91 nuclear factor of kappa light polypeptide gene enhancer in B-cells chr19 39390547 39390761 213.9 promoter-TSS NFKBIB inhibitor, beta chr1 228297288 228297502 211.2 promoter-TSS MRPL55 mitochondrial ribosomal protein L55 chr19 38146609 38146823 211.2 promoter-TSS ZFP30 ZFP30 zinc finger protein chr10 73610634 73610848 203.6 intron PSAP prosaposin chr1 41846762 41846976 202.9 intron FOXO6 forkhead box O6 chr19 42783947 42784161 202.9 Intergenic CIC capicua transcriptional repressor chr14 105293945 105294159 200.8 Intergenic LINC00638 long intergenic non-protein coding RNA 638 chr6 149067875 149068089 200.8 promoter-TSS UST uronyl-2-sulfotransferase

Page | 54

3.4.2 AZF occupancy is enriched at DNase hypersensitive sites

The AZF protein was designed to bind a 9 nt GCTGGGGGC sequence and we have confirmed that it binds to this 9 nt sequence in vitro via EMSA and in vivo it also binds to the VEGF-A promoter containing this 9 nt sequence. So, perhaps, there are many more of this relatively-short GCTGGGGGC consensus sites in the human genome available for AZF binding that may explain the widespread genomic occupancy by

AZF. This would not be surprising as the human genome is 3 billion base pairs, and statistically a 9 nt sequence is predicted to occur approximately 23,000 times across human genome, assuming equal distribution of the four DNA bases. Thus, the first step was to identify all GCTGGGGGC sites in human genome. Using FIMO, a motif scanning tool, we found 37,000 GCTGGGGGC sites across the whole human genome.

We then overlapped these GCTGGGGGC sites and the AZF bound sites to determine whether AZF binds to these GCTGGGGGC sites. Surprisingly, only 18% (6510 out of

37,000 sites) of the total GCTGGGGGC sites were bound by the AZF protein (Figure

3.5A).

A large fraction of human genome is rendered inaccessible for protein binding as a result of the complex chromatin organisation and nucleosome packing, essential for regulation of biological processes. In fact, a collaborative study by two ENCODE production centres (University of Washington and Duke University) revealed that, overall, 95% of all ChIP-Seq peaks from all ENCODE studied transcription factors fall within accessible chromatin (150,151). Thus, we next sought to investigate whether the observed underrepresented AZF occupancy at the perfect GCTGGGGGGC sites across human genome is linked to chromatin accessibility of these regions. A DNase I dataset on HEK293T produced by Crawford lab from Duke University (150,151), GEO Page | 55

accession #GSM1008573, was analysed using our in-house ChIP-Seq pipeline as described in the method section. By incorporating genomic DNase I sensitivity information on HEK293T into our analysis, we found 11% of the GCTGGGGGC sites lay within the open chromatin, identified as DNase I hypersensitive site (DHS) and of these, 63% were bound by AZF. Surprisingly, of 89% of the GCTGGGGGC sites that fell within non DHS category, regions that are inaccessible to protein, 14% were bound by the AZF protein (Figure 3.5B). This indicates that AZF is capable of binding to both nucleosome free and nucleosome containing regions, but with higher preference for open chromatin regions.

Page | 56

Figure 3.5 AZF preferentially binds perfect recognition sites in open chromatin regions. (A) Co-occurrence of 37,000 of total GCTGGGGGC sites across human genome and 25,322 total AZF ChIP-Seq peaks were analysed using HOMER and approximately a fifth of the total target sequence GCTGGGGGC containing regions was bound by AZF. (B) Total GCTGGGGGC sites in human genome were divided into two catagories based on their DNase I sensitivity (DHS: open chromatin and non-DHS: closed chromatin). AZF occupancy at these two genomic regions is presented in the pie charts (left and right).

3.4.3 AZF binds predominantly to sites containing the target sequence

Despite having 37,000 perfect GCTGGGGGC sites in the genome, only a small fraction (6510 out of the total 37,000) of these sites were bound by the AZF, which, as discussed above, could be partly explained by low chromatin inaccessibility across human genome. To further investigate the widespread binding of AZF, we interrogated the AZF ChIP-Seq peaks to identify a centrally located consensus motif bound by AZF in vivo. There are currently a range of de novo motif discovery algorithms available.

Page | 57

However, different algorithms have complementary strengths and weaknesses. While a few studies have been done to assess and compare performance of a wide range of available motif finding tools, it has been proved difficult to identify one best algorithm that is adequate for a reliable analysis (160,161). Thus, we resolved to perform the motif analysis using three alternative algorithms, MEME, DREME and Peak Motifs to ensure the reliability and robustness of the outcomes. While these three tools have been widely used for motif analysis and are supported by user-friendly web interface, MEME, an expectation-maximisation (EM) based algorithm (162), has much higher time cost during the motif discovery procedure, compared to the word-based motif discovery program DREME (148) and the combination of word and pattern-based motif finding tool Peak Motifs (163). Thus, in the current study, motif analysis with MEME was restricted to the top 600 peaks ranked by normalised tag counts, while total 25,322 peaks were inputted to DREME and Peak Motifs tools. We identified a common and significant consensus motif from the three motif analysis that is present in a large fraction of the ChIP-Seq peak sequences, 95% and 88%, respectively, from MEME and

DREME analysis. A central enrichment analysis CENTRIMO was further carried out to confirm the central localisation of the motif in the peak sequences (Figure 3.6).

Interestingly, the top motif conformed to the GCTGGGGGC 9 nt target sequence bound by the AZF in vitro.

A closer examination of the motif obtained from MEME revealed varied levels of tolerance to mismatches across the 9 nt consensus sequence. Percentage frequency of occurrence for each of the four DNA bases at position 1 to 9, with G as nucleotide no. 1 and C as nucleotide no. 9 from the GCTGGGGGC sequence, is presented in a table under the position weight matrix (PWM) of the top consensus motif (Figure 3.6B). For Page | 58

example, at position 4, a G is present in all the 571 sequences containing this consensus motif and thus a 100% occurrence; therefore it has the maximum height and information content of 2 bits. This suggests that essentially a G is required at position 4. On the other hand, any of the 4 possible nucleotides could be present at position 9, although with differing relative frequencies. This represents an increase in uncertainty, thus lowering the information content available at this position and suggests that the identity of the base at this position is not an important determinant of binding specificity. While most of the bound sequences show high conservation to the 9 nt target sequence the

AZF was designed to recognise, nucleotide identity at position 3 and 9 of the

GCTGGGGGC target sequence, the T and C, respectively, seems to be more forgiving to mismatches. Surprisingly, we also observed AZF binding to DNA in the absence of the consensus motif in 5% of the top 600 AZF peaks investigated. Two examples are shown in Figure 3.7 with DNA sequences around peak submit displayed in white boxes indicating the absence of a recognisable GCTGGGGGC-related motif identified from the motif analysis. The tolerance to mismatches in the binding consensus and the instances of peaks with no related motif responsible are like to both contribute to the widespread binding exhibited by the AZF protein.

Page | 59

Figure 3.6 Motif analysis of AZF bound regions using MEME, DREME and Peak Motifs. (A) Three de novo motif discovery tools were used for motif analysis; top 600 AZF ChIP-Seq peaks were inputted to MEME, while total 25,322 peaks were used for DREME and Peak Motifs analysis. All three motif discovery algorithms returned a common consensus motif conforms to the predicted 9 nt DNA sequence the AZF was designed to recognise and bind. CENTRIMO was included to determine localisation of the motifs in the peaks. (B) Position weight matrix (PWM) of the consensus motif identified by MEME. The height of the letters represents information content (in bits) at each position that is related to the degree of certainty of the particular nucleotide at a given position and the table below shows the relative frequency in percentage of each A, C, G, T nucleotide observed at a given position.

Page | 60

Figure 3.7 AZF binds in the absence of the AZF 9 nt target sequence GCTGGGGGC. Two examples of ChIP-Seq tracks illustrating AZF binding to regions lacking the AZF target sequence. Partial sequence around the peak summit is shown in white boxes. Blue histogram on AZF track indicates AZF occupancy and no or negligible AZF binding is observed for the input control sample, shown on the AZF (ctrl) track.

3.5 Discussion

Artificial zinc finger based DNA binding platforms are currently being developed for therapeutic purposes. Since the development of this technology, the field has moved from the early use of three zinc finger proteins based on the natural zinc finger protein ZIF268 (107,125) to adopting six zinc finger proteins (134,164) that are expected to be more specific in targeting sites within the whole human genome. One recent study by Grimmer and colleagues assessed genome-wide DNA binding specificity of two six-zinc finger proteins targeting two different 18 nt sequences in the

Page | 61

human SOX2 promoter. They revealed that while statistically one would expect an 18 nt sequence to appear once or at most a few times in the whole human genome, unexpectedly, these six finger proteins occupy thousands of sites across the genome

(135). One explanation for this is the idea that instead of requiring a perfect 18 nucleotide site, DNA binding proteins can use subsets of their available zinc finger domains for genomic interactions. In short, they possibly have more available sites than the corresponding three zinc finger proteins targeting similar sites. This finding may encourage reversion to the use of three zinc finger designer proteins. Interestingly, while a large number of three zinc finger artificial DNA binding proteins have been designed and used for various biotechnology and therapeutic applications (99,107,125), to date, little is known about in vivo DNA binding of these proteins, let alone a comprehensive genomic occupancy study.

In the current study we have carried out conventional ChIP-Seq experiments to identify genomic sites bound by a three zinc finger artificial DNA binding protein targeting VEGF-A gene promoter. This was one of the first synthetic DNA binding proteins generated (107). This artificial zinc finger protein (AZF) is a ‘first generation’ synthetic DNA binding protein that contains three classical C2H2 zinc fingers designed to recognise the 9 nt sequence GCTGGGGGC derived from DNase hypersensitive site on human VEGF-A promoter. It was previously reported to bind robustly in vitro and to be functionally effective in activating endogenous VEGF-A expression when fused to an activation domain (107).

In this study, we found that this AZF did bind the VEGF-A locus as expected. In addition, we also observed binding at around 25,000 additional genomic locations.

Interestingly, in our efforts to understand the widespread binding of AZF, we found

Page | 62

that, of the 37,000 GCTGGGGGC sites identified in the 3 billion base pair human genome, AZF only binds to 18% or 6510 of these sites. We partly attribute this to the complex chromatin organisation which may restrict AZF accessibility to a substantial portion of these target sites, a finding not uncommon in natural transcription factors or

DNA binding proteins although here the AZF also binds, to a lesser extent, to nucleosome occupied regions. While most natural transcription factors, such as c-Jun,

GATA1, NRF1 are known to occupy mostly within accessible chromatin, there are cases where transcription factors also bind compacted chromatin, including

KRAB-associated factors KAP1 and SETDB1 (151).

De novo motif analysis revealed AZF binding to related motifs, where 7 or 8 of the 9 nts were conserved, as well as some sites where no clear motif could be identified.

The degeneracies of the AZF DNA binding preference observed in this study mean that the 3-zinc finger protein effectively recognises a 7 or 8 nt sequence, rather than the extended 9 nt sequence. This may partly explain the off-target binding observed for this

AZF. In the current study, we also found AZF binding to sites lacking a clear motif or anything recognisably like the 9 nt target sequence. It is not clear whether these additional sites represent cases where the AZF is binding to highly divergent motifs, or is localising to particular genes via protein-protein interactions, or whether these non- canonical sites are the consequences of secondary long or short range interactions possibly resulting from enhancer looping to the promoter, thus, representing binding sites that were not directly bound by the AZF. In the case of natural transcription factors, binding to non-canonical sites is often explained by the possibility that the protein is docking to a biologically relevant partner protein and thereby localising to its target gene indirectly via protein-protein interactions (45,47,48). However, in the case

Page | 63

of AZF, a minimal zinc finger protein, it is uncertain whether AZF has many high affinity protein partners.

Overall we favour the hypothesis that in vivo synthetic AZF proteins could be more promiscuous in their binding than has previously been suspected. These results are not dissimilar to those obtained with natural transcription factors. ChIP-Seq experiments on natural transcription factors revealed binding to multiple genes, including genes that do not contain recognisable consensus binding sites, as well as genes that are not functionally regulated by the transcription factor (75,165).

These findings also advance our understanding of engineered zinc finger proteins. In contrast to the six zinc finger proteins that could use subsets of three to four fingers for target recognition leading to promiscuous binding across the genome (135), the three zinc finger protein used in the current study binds to the designated 9 nt sequence despite a small degree of tolerance to mismatches. Although the two studies were based on zinc finger proteins targeting two different sites and are thus not directly comparable, these findings do imply that adding more zinc fingers may not necessarily improve DNA binding specificity of the artificial zinc finger proteins, as one would have expected.

In addition, developments in this area have also led to the use of multiple artificial DNA binding proteins to either work synergistically (166,167), or, in the case of genomic nucleases and DNA methyltransferases, dimerise to a heterologous partner so that full binding is specified by more ZFs (114,117). However, it remains to be definitively determined how specific such systems are. In the next chapter, we will discuss the effect of fusing a non-DNA binding domain to this AZF on in vivo target specificity.

Page | 64

Chapter 4 Regions outside of DNA binding domain of the zinc finger transcription factor KLF3 are involved in in vivo DNA binding specificity

4.1 Introduction

Transcription factors are sequence-specific DNA binding proteins that play an important role in the regulation of gene expression. They are typically thought of as being composed of independent and separable DNA binding domains (DBDs) and functional domains (FDs). This idea has led to development of important methodologies such as the yeast two-hybrid system (17) and has also facilitated the generation of sequence-specific artificial DNA binding proteins such as zinc finger based- and

Transcription Activator Like Effector (TALE) based proteins, have been widely used in genome editing for various therapeutic and biotechnology applications (168).

Nonetheless, recently, there has been increasing evidence that natural transcription factors localise to their many target genes via the combined functions of both their DBDs and FDs, leading to an idea that DBD alone may not be sufficient to explain in vivo target specificity observed for many transcription factors (45,48,169).

While much studies have been done trying to understand in vivo, how transcription factors recognise their target genes, one perplexing observation is that most transcription factors are part of a large families sharing highly similar DBDs and yet each of these proteins carry out unique functions in vivo (52,55,66). There has thus been an on-going quest to understand how these transcription factors achieve target

Page | 65

specificity in vivo. The answer is likely to lie within regions outside of the DBD. To assess this, we investigated one of the well-studied families of zinc finger transcription factors, Kruppel-like factor (KLF) family, consisting of 17 members carrying highly conserved DBDs but variable FDs. It is known that these 17 members of KLF regulate different subsets of genes and diverse biological processes (83,85). We have previously examined an archetypal zinc finger transcription factor, Krüppel-like factor 3 (KLF3), a member of the SP/KLF family of transcription factors (75). KLF3 has a typical KLF C- terminal zinc finger DBD that recognises CACCC boxes and GC-rich sequences in

DNA, and an N-terminal FD known to recruit co-repressors, such as C-terminal Binding

Protein (CtBP) to silence gene expression (88,139,170). Using a loss-of-function approach, we recently showed that, unexpectedly, the DBD was not the sole determinant of DNA-binding specificity (75). We found that deletion of the entire FD of

KLF3 reduced DNA occupancy across the genome, as assessed by ChIP-Seq. This result highlighted the importance of the FD for proper in vivo DNA-binding specificity.

In the current study, we have extended this investigation by performing gain-of- function experiments. We fused the KLF3 FD onto the unrelated, but well characterised artificial zinc finger (AZF) protein originally designed to target a model target gene,

VEGF-A (107). Genome-wide DNA occupancy study on the minimal AZF protein

(Chapter 3) revealed widespread binding of protein which could be attributable to the presence of large number of the target sequence across the genome and degeneracies of the AZF DNA binding preference. Here we went on to test whether the binding pattern of the AZF protein was affected by the addition of a heterologous FD, the KLF3 FD.

We performed differential binding analysis by comparing the sites bound by the AZF alone with the set of peaks generated by the KLF3FD-AZF fusion protein to identify a Page | 66

list of peaks significantly and differentially bound only in the presence of the KLF3 FD.

We also carried out ChIP-Seq experiments on a protein lacking a DBD (KLF3 FD only) to test the whether the FD alone is sufficient to confer in vivo DNA binding.

4.2 Experimental design and construct validation

Having previously established a system to study genome-wide DNA occupancy of the AZF protein (Chapter 3), we set out to expand this system to investigate the role of a non-DNA binding domain in in vivo target recognition by fusing FD from an archetypal zinc finger transcription factor, KLF3 to the minimal AZF protein and compare the genomic DNA profile of these two proteins. This fusion protein

(designated KLF3FD-AZF) contained the same three zinc finger DBD targeting the 9 nt sequence, GCTGGGGGC, nuclear localisation signal and C-terminal V5 epitope tag, as the minimal AZF protein, to enable consistent immunoprecipitation and comparison between the two proteins. In addition, it also contained the N-terminal region of KLF3 protein (amino acid 1 to 262 of a full length KLF3 protein) that possesses potent repressor activity via interaction with a transcription corepressor CtBP (170). Figure 4.1 shows schematic representations of the two constructs, with or without the KLF3 FD, and the amino acid sequence of the new fusion protein generated.

Page | 67

Figure 4.1 KLF3FD-AZF fusion protein. (A) Schematic representation of AZF (Blue) and KLF3FD-AZF (Red) with an extra KLF3 functional domain (KLF3 FD) fused to the N-terminus of the artificial zinc fingers. Also shown is the 9 nt DNA sequence the shared zinc finger DBD was designed to target. (B) Amino acid sequence of the KLF3FD-AZF protein. KLF3 amino acid 1 to 262 (KLF3 FD) is shown in highlighted in grey. NLS and V5 tag are shown in blue and green font, respectively. Amino acids at position -1 to +6 relative to the recognition helix for each finger are shown in red while amino acids that make specific contact with DNA bases are further underlined in red. Sequences with black underline are the GS linker.

As in Chapter 3, we generated human embryonic kidney cell lines (HEK293) stably expressing this KLF3FD-AZF protein using the MSCV retroviral transduction system. Two independent clonal cell lines stably expressing the KLF3FD-AZF were selected for comparison to the AZF expressing HEK293 lines described in Chapter 3.

Western blot (Figure 4.2A) and quantitative real time PCR (Figure 4.2B) experiments

Page | 68

showed robust and comparable protein and mRNA expression in all selected stable lines. mRNA expression levels were normalised against the levels of 18S rRNA of the respective samples. Two clones designated AZF 1 and 2 expressed AZF, while clones labelled KLF3FD-AZF 1 and 2 expressed KLF3FD-AZF. While there was some variation in mRNA levels, the protein levels were comparable in Western blotting using an antibody to the shared V5 epitope tag. A β-actin immune blot was included as loading control for the protein expression analysis.

We then investigated in vitro binding with Electrophoretic Mobility Shift Assays

(EMSAs) using the previously validated target site containing the canonical motif

GCTGGGGGC. We observed strong and comparable binding of both AZF and

KLF3FD-AZF, respectively, in all four lines (Figure 4.2C). Supershift experiments with an anti-V5 antibody confirmed the identity of retarded species (marked with asterisks). These validation experiments suggested that these cell lines were appropriately expressing the proteins of interest at equivalent levels which was crucial to ensure comparability of the genomic DNA binding profiles of the two proteins. Thus any differences observed in genome wide DNA binding should not due to differential expression levels of the respective proteins.

Page | 69

Figure 4.2 Validation of AZF and KLF3FD-AZF protein and transcript expression and in vivo DNA binding. Western blot (A) and quantitative real time PCR (B) showing equivalent protein and mRNA transcript expression, respectively, for the four selected HEK293 clones stably expressing AZF or KLF3FD-AZF. For Western blot, β-actin was included as the loading control and for real time PCR, transcript expression was normalised to 18S rRNA level and is shown relative to the expression for AZF clone 1 (first bar), which was set to an arbitrary value of 1. (C) Electrophoretic mobility shift assay showing equivalent in vitro AZF and KLF3FD-AZF binding to a 32P radiolabelled EMSA probe containing the 9 nt target sequence GCTGGGGGC. Asterisks indicate the supershift of the protein-DNA probe complex by an anti-V5 antibody confirming the identity of the V5-tagged AZF proteins. EV denotes empty vector and was included as negative control.

Page | 70

4.3 ChIP-Seq and bioinformatics workflow

ChIP-Seq was performed on the two KLF3FD-AZF expressing HEK293 lines as biological replicates using approximately 1 x 108 cells each. These ChIP-Seq experiments were performed in parallel with those on the AZF samples described in

Chapter 3. Across the four samples, a total of more than 150M raw sequencing reads were obtained and subsequently analysed following the same bioinformatics pipeline discussed previously (Chapter 3.3). Briefly, raw sequences were quality filtered with

Trimmomatic, followed by genomic mapping to human genome reference version hg19 using Bowtie 2. Table 4.1 shows the total numbers of raw reads, mapped reads and percentage of mapped reads per sample. The high percentage of unique and total mapped reads (in the range of 61-77% and 98-99%, respectively) are indications of robust and reliable ChIP-Seq experiments. The subsequent steps involved peak calling using HOMER and irreproducible discovery rate analysis (IDR) using the

HOMER-IDR package to identify consensus peaks across the biological replicates

(Figure 4.3). Annotated tables containing KLF3FD-AZF ChIP-Seq peaks, regions that are bound by KLF3FD-AZF, can be found in Table 4.2 (Top 20 peaks) and

Supplementary Table 4.1 (Total peaks).

To identify regions that are differentially bound by AZF and KLF3FD-AZF, we performed quantitative differential binding analysis using an R Bioconductor package,

DiffBind (Figure 4.3). AZF and KLF3FD-AZF peaks obtained from ChIP-Seq peak caller, HOMER, were inputted into the DiffBind package. The first step of DiffBind involved reading and merging the overlapping peaks between the two datasets to give a single set of unique genomic intervals covering all the supplied peaks. This was Page | 71

followed by counting reads at each genomic interval for the two samples, AZF and

KLF3FD-AZF, leading to generation of a binding affinity matrix containing normalised read count for each sample at every potential binding site. Based on p-value and false discovery rate (FDR) assigned by edgeR and the normalised read count contrasts

(KLF3FD-AZF/AZF or AZF/KLF3FD-AZF), peaks that passed the desired cut-offs were identified as significantly differentially bound. The results are presented as a MA plot showing an overview of the analysis outcome, a principal component analysis

(PCA) plot that reports how the different samples including the replicates cluster based on the differentially bound sites, boxplots that shows distribution of the reads comparing the two samples AZF and KLF3FD-AZF, and lists of differentially bound regions with genomic coordinates (Chapter 4.6). The differentially bound regions were further annotated with information including genomic localisation and nearest gene names and also subjected to a comprehensive motif analysis to answer several biological queries towards achieving the aim of the current study.

Table 4.1 Read mapping summary.

Mapped to % uniquely Sample Reads % mapped hg19 mapped AZF Input 1 14,770,820 14,602,331 98.9 63.9 AZF Input 2 15,225,256 15,034,280 98.7 63.9 KLF3FD-AZF Input 1 10,269,497 10,163,649 99.0 61.1 KLF3FD-AZF Input 2 11,546,992 11,431,010 99.0 61.1

AZF IP 1 11,304,986 11,203,387 99. 1 72 AZF IP 2 6,737,341 6,688,293 99.3 70 KLF3FD-AZF IP 1 40,810,770 40,015,258 98.1 75.8 KLF3FD-AZF IP 2 43,254,066 42,505,354 98.3 76.4 Total 153,919,728 151,643,562

Page | 72

Figure 4.3 Workflow for the computational analysis of the V5 ChIP-Seq data. Software tools used for each step are included in blue writing. Page | 73

4.4 KLF3FD-AZF shows increased DNA occupancy across the genome

To investigate if the fusion of an additional KLF3 FD from a natural zinc finger transcription factor to an artificial zinc finger protein has an effect on the in vivo DNA binding specificity, we compared the genomic DNA binding profile of KLF3FD-AZF to that of the AZF. HOMER-IDR analysis (with the reproducibility threshold of IDR ≤

0.05) on ChIP-Seq data obtained from KLF3FD-AZF samples revealed 48,003 peaks across the human genome, consistent between the two biological replicates (Figure

4.4A). Table 4.2 shows the top 20 peaks bound by KLF3FD-AZF, ranked by normalised tag count or peak height. A complete list of KLF3FD-AZF ChIP-Seq peaks is available as Supplementary Table 4.1. Compared with the 25,322 peaks observed with AZF,

KLF3FD-AZF thus generates approximately twice as many peaks. We further examined and compared the regions where peaks were observed with AZF and/or KLF3FD-AZF by literal overlapping the two sets of peaks based on position of the peaks to get an overview of differential binding. As expected, with both AZF and KLF3FD-AZF sharing the same zinc finger DBD, most of the peaks (more than 80%) observed with

AZF were also observed with KLF3FD-AZF. But strikingly KLF3FD-AZF also generates 27,610 additional peaks (Figure 4.4B). ChIP-Seq tracks in Figure 4.4C illustrate two examples of the AZF and KLF3FD-AZF commonly bound regions. Blue and red histograms on the tracks indicate AZF and KLF3FD-AZF binding, respectively, at the pictured genomic loci. ChIP-Seq tracks for the input samples were included as negative controls, showing no enrichment at the two loci inspected.

Page | 74

Figure 4.4 HOMER-IDR peak calling and reproducibility analysis. (A) Upon implementation of reproducibility threshold of IDR ≤ 0.05, 48,003 peaks were identified as significant and reproducible AZF bound regions between the two biological replicates. (B) A proportional Venn diagram showing the regions bound by AZF (blue) and/or KLF3FD-AZF (red). (C) ChIP-Seq tracks showing the top two regions bound by KLF3FD-AZF, ranked by normalised tag counts.

Page | 75

Table 4.2 KLF3FD-AZF binding sites across human genome. Annotated peak list showing top 20 regions bound by KLF3FD-AZF ranked by normalised tag counts using HOMER-IDR for peak calling.

Normalized Chr Start End Annotation Gene Name Gene Description Tag Count chr1 85666683 85666879 153.7 promoter-TSS SYDE2 synapse defective 1, Rho GTPase, homolog 2 (C. elegans) chr15 66648988 66649184 147.5 promoter-TSS TIPIN TIMELESS interacting protein chr1 877227 877423 145.9 intron SAMD11 sterile alpha motif domain containing 11 chr15 78423803 78423999 144.7 promoter-TSS CIB2 calcium and integrin binding family member 2 chr14 54863375 54863571 141.4 promoter-TSS CDKN3 cyclin-dependent kinase inhibitor 3 chr6 170209179 170209375 140.6 Intergenic LINC00242 long intergenic non-protein coding RNA 242 chr19 38146627 38146823 138.9 promoter-TSS ZFP30 ZFP30 zinc finger protein chr1 224033487 224033683 138.5 promoter-TSS TP53BP2 tumor protein binding protein 2 chr17 5185298 5185494 137.5 promoter-TSS RABEP1 rabaptin, RAB GTPase binding effector protein 1 nuclear factor of kappa light polypeptide gene enhancer in B-cells inhibitor, chr19 39390578 39390774 137.5 promoter-TSS NFKBIB beta chr7 156931243 156931439 135.9 promoter-TSS UBE3C ubiquitin protein ligase E3C chr17 44270746 44270942 135.3 promoter-TSS KANSL1-AS1 KANSL1 antisense RNA 1 chr1 244998511 244998707 134.4 promoter-TSS COX20 COX20 cytochrome C oxidase assembly factor chr1 153756049 153756245 133.6 Intergenic SLC27A3 solute carrier family 27 (fatty acid transporter), member 3 pleckstrin homology domain containing, family G (with RhoGef domain) chr19 39903120 39903316 132 promoter-TSS PLEKHG2 member 2 chr5 179238272 179238468 131 intron SQSTM1 sequestosome 1 chr5 133706986 133707182 131 promoter-TSS UBE2B ubiquitin-conjugating enzyme E2B chr6 31588129 31588325 131 promoter-TSS PRRC2A proline-rich coiled-coil 2A chr20 306392 306588 130.8 promoter-TSS SOX12 SRY (sex determining region Y)-box 12 chr10 104403081 104403277 129.1 Intergenic TRIM8 tripartite motif containing 8 Page | 76

4.5 Both AZF and KLF3FD-AZF bind the VEGF-A recognition site in vivo

We next interrogated the ChIP-Seq data to assess binding at the VEGF-A locus.

Satisfyingly, peaks were observed at the expected and previously validated artificial zinc finger domain binding site in the human VEGF-A locus in both the AZF and

KLF3FD-AZF expressing cell lines (Figure 4.5A). Partial DNA sequence around the

AZF and KLF3FD-AZF peak summit is displayed in a white box and the 9 nt AZF target sequence is underlined, confirming AZF and KLF3FD-AZF binding to the predicted target site in vivo. Sorting the peaks based on peak height revealed the peaks at VEGF-A promoter (indicated with arrows in Figure 4.5B) were among the top of all the ranked peaks, having a rank of 727 out of the 25322 AZF peaks and 299 out of the

48003 KL3FD-AZF peaks. This result was also confirmed by ChIP PCR analysis. ChIP assays using anti-V5 antibody specific to the AZF and KLF3FD-AZF proteins were performed on each of the four stable clonal cell lines and the recovered DNA was subjected to amplification by quantitative real-time PCR using primers specific for the human VEGF-A promoter region near the GCTGGGGGC target sequence in comparison to a negative control region that is 3.5 kbps upstream of the VEGF-A locus relative to the VEGF-A transcriptional start site lacking the target sequence (Primer sequences are available in Supplementary Table 2.1). We observed strong enrichment at

VEGF-A promoter, at least 7 fold greater than the negative locus (Figure 4.5C), indicating AZF and KLF3FD-AZF occupancy at the predicted target site. Enrichment of the AZF and KLF3FD-AZF protein at the genomic loci was presented as percentage of the DNA fragments from the immunoprecipitated samples over total DNA fragments from the cells (% input).

Page | 77

Figure 4.5 Validation of AZF and KLF3FD-AZF binding to the predicted target site GCTGGGGGC in vivo via ChIP-Seq and ChIP-PCR assay. ChIP-Seq (A) and ChIP-real time PCR (C) confirm that AZF and KLF3FD-AZF bind to the predicted target site in the VEGF-A promoter in vivo. ChIP (Chromatin immunoprecipitation) was performed on the four HEK293 clones, with two clones each stably expressing AZF or KLF3FD- AZF using anti-V5 antibody against the V5-tagged AZF proteins. Purified DNA was subjected to (A) high throughput next generation DNA sequencing and the partial sequence around the peak summit is shown with a red line underlining the 9 nt target sequence and no or negligible binding detected for the input control samples around the same region; and (C) real time PCR using primers designed to target human VEGF-A promoter near the target sequence and a negative locus 3.5kb upstream to the VEGF-A +1 TSS. As expected, higher enrichment of AZF or KLF3FD-AZF at VEGF-A promoter compared to the negative locus was observed. Enrichment is presented as percentage of the respective input sample. (B) Graphical representation of all AZF and KLF3FD-AZF ChIP-Seq peaks ranked in descending order by normalised tag counts. Peak height/ normalised tag counts is shown on the Y-axis and peak rank on the X-axis. The approximate positions of the peaks at the target site in VEGF-A promoter are indicated by a black arrow.

Page | 78

4.6 Differential binding analysis

We have shown that adding KLF3 FD to the AZF increased total genomic DNA occupancy of the zinc finger protein. Next, we sought to identify regions that were differentially bound by KLF3FD-AZF compared to the AZF protein lacking the

KLF3 FD. Differential binding analysis was performed using DiffBind package to define peaks that were more than two fold different in normalised read counts in the

KLF3FD-AZF and the AZF samples as differentially bound peaks, a qualitative approach based on evidence of binding affinity. MA plot in Figure 4.6A shows log2 fold difference contrasting the normalised read counts of AZF and KLF3FD-AZF sample

(AZF/KLF3FD-AZF) for each binding site identified from AZF and KLF3FD-AZF

ChIP-Seq data plotted against average log2 normalised read counts at the respective binding site. The differentially bound sites were those that showed at least two-fold different in the normalised read counts and were statistically significant with FDR < 0.1 and p-value < 0.05. This analysis resulted in two peak lists consisting of 4,620 and

4,357 peaks each that were preferentially or uniquely bound by AZF (log2 FC > 2) and

KLF3FD-AZF (log2 FC < -2), respectively. The top 20 KLF3FD-AZF differentially bound peaks are shown in Table 4.3 and complete annotated peak lists for AZF and

KLF3FD-AZF differentially bound peaks can be found in Supplementary Table 4.2 and

Supplementary Table 4.3, respectively. Principal component analysis (PCA)

(Figure 4.6B) using the differentially bound sites demonstrated that KLF3FD-AZF samples behaved differently from the AZF samples at those binding sites. Replicates of the respective samples were seen clustered together indicating small variance between the replicates. To visualise how reads at differentially bound sites were distributed between the two samples, AZF and KLF3FD-AZF, two boxplots were generated for Page | 79

AZF and KLF3FD-AZF differentially bound sites, respectively (Figure 4.6C and D).

KLF3FD-AZF showed increased binding affinity at the KLF3FD-AZF differentially bound regions, and similarly, we observed increased binding affinity by AZF at the

AZF differentially bound sites. ChIP-Seq traces in Figures 4.6E and F show examples of KLF3FD-AZF and AZF unique peaks and preferentially bound peaks, respectively.

These differentially bound peaks were further studied to elucidate the role of KLF3 FD in in vivo DNA binding.

Page | 80

Figure 4.6 Differential binding analysis. (A) MA plot shows binding affinity of AZF and KLF3FD-AZF sample at each binding sites identified. Differential binding study DiffBind identified 4,357 and 4,610 sites that are differentially bound by KLF3FD-AZF and AZF, respectively. (B) PCA using the differentially bound sites shows that KLF3FD-AZF and AZF sample exhibited distinct binding pattern at these differentially bound regions (C+D) Boxplots showing log2 normalised read counts at KLF3FD-AZF differentially bound regions (C) and AZF differentially bound regions (D) in AZF and KLF3FD-AZF samples. (E+F) An illustrative range of examples showing regions that are differentially bound by both KLF3FD-AZF (E) and AZF (F). Binding profiles of the respective input samples are included as control and shown in the third and fourth track.

Page | 81

Table 4.3 Top 20 KLF3FD-AZF differentially bound regions determined using a differential binding analysis tool, DiffBind and ranked by log2 fold-change contrasting AZF/KLF3FD-AZF normalised tag counts.

log AZF normalised log _KLF3FD-AZF log fold Chr Start End Annotation Gene Name log mean 2 2 2 2 read counts normalised read counts change chr16 30406601 30406809 promoter-TSS ZNF48 6.55 1.61 7.53 -5.92 chr10 104180825 104181027 exon PSD 6.48 1.61 7.46 -5.85 chr5 178157642 178157844 promoter-TSS ZNF354A 6.34 1.61 7.32 -5.71 chr12 6961498 6961736 promoter-TSS USP5 6.33 1.61 7.30 -5.70 chr1 875693 875895 intron SAMD11 8.19 3.51 9.16 -5.65 chr19 13162473 13162682 intron NFIX 6.24 1.61 7.21 -5.61 chr1 12538323 12538515 intron SNORA59A 6.24 1.61 7.21 -5.60 chr1 205744479 205744681 promoter-TSS RAB29 6.20 1.61 7.17 -5.56 chr16 30538111 30538336 promoter-TSS ZNF768 6.19 1.61 7.16 -5.55 chr7 99679433 99679625 promoter-TSS ZNF3 6.19 1.61 7.15 -5.55 chr12 117175923 117176115 promoter-TSS RNFT2 6.15 1.61 7.12 -5.51 chr4 140005444 140005646 promoter-TSS ELF2 6.11 1.61 7.08 -5.47 chr2 230786750 230787039 promoter-TSS TRIP12 6.10 1.61 7.07 -5.46 chr1 150551701 150551908 exon MCL1 7.09 2.63 8.06 -5.43 chr5 112539079 112539281 intron MCC 6.04 1.61 7.01 -5.40 chr5 79783734 79783926 promoter-TSS FAM151B 5.97 1.61 6.94 -5.33 chr7 148936604 148936796 promoter-TSS ZNF212 5.97 1.61 6.93 -5.32 chr7 150945709 150945911 promoter-TSS SMARCD3 5.96 1.61 6.92 -5.32 chr19 30302331 30302533 promoter-TSS CCNE1 5.95 1.61 6.91 -5.31 chr8 141474447 141474649 Intergenic TRAPPC9 6.97 2.63 7.93 -5.30 Page | 82

4.6.1 KLF3FD-AZF peaks are abundant in promoter regions

Previously, genome-wide DNA occupancy studies on unmodified full-length

KLF3 protein revealed 30% of the KLF3 bound regions laid in the promoter regions

(Figure 4.7A) (75). As promoters constitute less than one per cent of the total genome

(estimation based on 27,000 total coding genes and an average of 1100 bp of promoter size according to RefSeq), this result represents a strong enrichment of KLF3 peaks in promoters, which is consistent with its role as a transcription regulator. Interestingly, a deletion mutant lacking the KLF3 FD was not similarly enriched at promoters but instead generated peaks more broadly across the genome, with only 1% of promoter peaks (Figure 4.7B). Thus, this loss-of-function experiment suggested that the KLF3 FD may play a role in localising KLF3 to target gene promoters.

We next investigated if the KLF3 FD exhibits similar behaviour in the current gain-of-function experiments. To achieve this, we compared the peaks generated by

AZF to KLF3FD-AZF and determined which were in promoters, introns, intergenic and intragenic regions. The peaks were analysed based on RefSeq annotations, and were categorised according to their genomic localisation across the Homo sapiens

(hg19/GRCh37) genome into the four main categories. In this analysis, promoters were defined as the regions -1 to +0.1 kb from the RefSeq transcription start site (TSS) and a fifth category ‘other’ was used to encompass peaks that fell into category including coding exons, 5’UTR and 3’UTR exons or peaks that were close to transcriptional termination sites (-100 bp to +1 kb).

We found that 18% and 20% of the AZF and KLF3FD-AZF peaks, respectively, lie within promoters (Figure 4.7C and D). This is not surprising given the large number

Page | 83

of AZF and KLF3FD-AZF commonly bound sites. Intriguingly, we observed 46% of the peaks that were differentially bound by KLF3FD-AZF resided in promoter regions

(Figure 4.7E)). These are the peaks with peak height two fold or more higher in

KLF3FD-AZF compared to AZF. This observation suggests that the addition of the

KLF3 FD does indeed help to target the fusion protein to promoter regions that are not significantly occupied by AZF alone.

Page | 84

Figure 4.7 Genomic localisation of regions bound by AZF or KLF3FD-AZF showing enrichment of KLF3FD-AZF differential binding at the promoter regions. Genomic localisation of (A) total regions bound by unmodified full length KLF3 protein, (B) total regions bound by FD-deficient KLF3 DNA binding domain only (DBD) protein. (A) and (B) were analysed from peaks obtained from our previously published ChIP-Seq experiments (75). (C) total regions bound by AZF, (D) total regions bound by KLF3FD- AZF and (E) regions differentially bound by KLF3FD-AZF as identified by DiffBind. Promoters are defined as the region -1000 bp to +100bp around the +1 TSS of Refseq genes. Peaks that fell into CDS exons, non-coding, 5’ and 3’ UTR exons and transcription termination sites are all labelled as ‘other’. Percentages of binding peaks lying in each genomic DNA class are given and the total number of peaks sampled are shown in parenthesis.

Page | 85

4.6.2 Peaks preferentially bound by KLF3FD-AZF contain an imperfect AZF DNA binding site

To further characterise the new binding sites acquired in the presence of the

KLF3 FD, we generated and compared consensus motifs from peaks bound by AZF and/or KLF3FD-AZF using the MEME, DREME and Peak Motifs algorithm (Figure

4.8). The combined use of three different motif discovery algorithms, as mentioned in

Chapter 3.4.3, was to ensure reliability and robustness of the results since different algorithms have their own strengths and weaknesses. As expected, a large fraction of

KLF3FD-AZF peaks and AZF and KLF3FD-AZF shared peaks contained the 9 nt AZF target sequence, GCTGGGGGC (Figure 4.8 first and second column). These likely represent regions that were bound utilising the shared three zinc finger DBD.

Interestingly, regions differentially bound by AZF also contained the same 9 nt sequence albeit existence in smaller fraction of the peaks (Figure 4.8 third column). In contrast, examination of the peaks generated either uniquely or preferentially by

KLF3FD-AZF revealed a related motif, consisting of a string of G and C bases (Figure

4.8 forth column). A closer examination of the motif obtained from MEME revealed the resemblance to an imperfect version of the AZF target sequence that is more forgiving to mismatches at nucleotide position 3, 5, 8 and 9 (Figure 4.9A). The percentage frequency of occurrence for each of the four DNA bases at position 1 to 9, with G as nucleotide no. 1 and C as nucleotide no. 9 according to the GCTGGGGGC target sequence, is presented in a table under position weight matrix (PWM) of the consensus motif. Also shown are the frequencies for the top motif obtained from MEME for

KLF3FD-AZF total peaks (Figure 4.9B) and AZF and KLF3FD-AZF shared peaks

(Figure 4.9C) for comparison. As shown in Figure 4.9A, particularly at position 3, a G

Page | 86

base has become more prominent than the original T base. Additionally, tolerance of a

C base and an A or C base at position 5 and 8, respectively, at frequencies of 8 to 28%, is acquired. The G base at position 8 occurs at a frequency of 52% compared to 99% in peaks shared by both AZF and KLF3FD-AZF.

We further evaluated this in an in vitro setting using naked DNA. We performed

EMSA experiments to assess AZF and KLF3FD-AZF binding affinity to probes containing a G to C point mutation at nucleotide position 5 (labelled as 5G>C probe).

As a positive control we used a previously validated probe containing a perfect 9 nt

GCTGGGGGC AZF target sequence, referred to as the ‘wild type’ AZF probe. As expected, the G to C point mutation compromises AZF binding. It similarly reduced the binding of KLF3FD-AZF in vitro (Figure 4.10A). In vivo, KLF3FD-AZF but not AZF occupied an imperfect AZF target site (Figure 4.10B). Partial sequence around the peak summit was shown in a white box and the imperfect DNA sequence bound by

KLF3FD-AZF was underlined.

This result highlights the differences between the in vitro and in vivo

DNA-binding behaviours. While in vitro both AZF and KLF3FD-AZF show weak binding to the 5G>C sequence, in vivo KLF3FD-AZF, but not AZF, generates a peak at many such sequences. This suggests that in the in vivo setting where the complexity of the chromatin comes into play, the additional KLF3 FD may be facilitating binding to more degenerate sites than the AZF alone can bind.

Page | 87

Figure 4.8 De novo motif analysis using MEME, DREME and RSAT. Top consensus motifs bound by (A) KLF3FD-AZF, (B) by both AZF and KLF3FD- AZF and differentially bound by (C) AZF and (D) KLF3FD-AZF. DREME and RSAT peak motifs tool were used to perform motif finding analysis on full size ChIP-Seq peaks from each dataset, while MEME was used to analyse the top 600 ChIP-Seq peaks from each dataset.

Page | 88

Figure 4.9 Frequencies of nucleotides in percentage at each position for top motifs obtained from MEME. Top consensus motifs obtained from MEME for regions that were (A) differentially bound by KLF3FD-AZF, (B) total regions bound by KLF3FD- AZF and (C) total regions equally bound by AZF and KLF3FD-AZF. Table below each consensus motif contains the relative frequency in percentage of each A, C, G, T nucleotide observed at a given position. Page | 89

Figure 4.10 AZF and KLF3FD-AZF display distinct binding patterns in vivo but not in vitro. (A) EMSA experiments showing AZF (left) and KLF3FD-AZF (right) binding to the AZF target sequence, labelled as ‘wild type AZF’, (lane 3) and upon addition of anti-V5 antibody, supershift was observed identifying the V5-tagged AZF proteins, marked with asterisk. Radiolabelled probes containing a mutation at position 5 from a G base to a C base (labelled as ‘5G>C’) was included to investigate AZF and KLF3FD- AZF binding to a degenerate AZF target sequence that was able to be bound in vivo by KLF3FD-AZF (lane 5 and lane 6 with addition of anti-V5 antibody). EV (empty vector) serves as a negative control. (B) A peak image from the ChIP-Seq experiments shows differential binding by AZF and by KLF3FD-AZF in vivo at region containing the corresponding degenerate AZF target site (black underline).

Page | 90

4.6.3 KLF3 FD, in the absence of a DBD, is capable of chromatin binding in vivo

We next assessed whether KLF3 FD alone, in the absence of a DBD, is recruited to genomic sites. To achieve this, we performed ChIP-Seq on two HEK293 lines stably expressing KLF3 FD only protein tagged with V5 epitope for immunoprecipitation to investigate genome-wide DNA occupancy of this protein (Figure 4.11A). Following the same bioinformatics pipeline described in Chapter 3.3, we identified 1439 sites bound by KLF3 FD across the genome that were consistent across the two biological replicates

(Supplementary Table 4.4). In general, we observed lower normalised tag counts for these binding sites compared to those that are bound by AZF or KLF3FD-AZF, an indication of weaker binding by the DBD-deficient protein. This is not surprising based on the previous evidence implying that KLF3 FD is involved in facilitating recruitment of a protein to DNA but with the DBD playing the primary role in DNA binding and recognition.

When we compared the sites bound by KLF3 FD to those that are occupied by

KLF3FD-AZF, we found 264 sites commonly bound by both KLF3 FD and

KLF3FD-AZF. This represents 23% of the genomic sites occupied by KLF3 FD (Figure

4.11B and Supplementary Table 4.5). The correlation of normalised tag counts at individual KLF3 FD and KLF3FD-AZF commonly bound peaks demonstrates different chromatin binding efficiencies between the two proteins, with the DBD-deficient protein, KLF3 FD, binding DNA with much lower strength (Figure 4.11C).

Representative peak images of KLF3 FD and KLF3FD-AZF binding to one of the common target sites, ARID3A, are shown in Figure 4.11D, and binding was validated with an independent ChIP-PCR experiment (Figure 4.11E).

Page | 91

This evidence suggests that KLF3 FD, despite lacking a DBD, is still capable of in vivo chromatin binding, albeit with a lower efficiency. We next sought to investigate whether this protein binds to naked DNA in vitro via EMSA. We assessed KLF3 FD binding to three target DNA probes containing AZF target site, natural KLF3 target site or a 50 nt DNA sequence derived from ARID3B intron region. The latter has been shown to be bound by KLF3 FD in vivo. As shown in Figure 4.12, there is no detectable

KLF3 FD binding to any these naked DNA probes. KLF3FD-AZF was also included as a positive control where we observed KLF3FD-AZF binding to AZF target site but to neither KLF3 target nor ARID3B probe, as expected. Consistent with our results for

AZF and KLF3FD-AZF chromatin binding, this indicates that the FD plays a role in the in vivo DNA binding but not in vitro and that this could be attributable to chromatin organisation in vivo.

Page | 92

Figure 4.11 ChIP-Seq and ChIP-PCR demonstrate KLF3 FD in vivo chromatin binding. (A) Schematic representation of KLF3 FD protein with V5-epitope tag at C-terminus for immunoprecipitation. (B) A proportional Venn diagram showing 264 genomic regions bound by both KLF3FD-AZF and KLF3 FD. (C) Chromatin binding efficiencies of KLF3 FD is lower than KLF3FD-AZF as illustrated by normalised tag counts of the two proteins at the commonly bound regions. A representative ChIP-Seq track (D) and ChIP-PCR (E) showing KLF3 FD genomic occupancy. ChIP-Seq tracks for KLF3FD-AZF and KLF3 FD are on scale [0-500] and [0-100], respectively.

Page | 93

Figure 4.12 KLF3 FD, in the absence of a DBD, is incapable of in vitro DNA binding. EMSA was performed to assess DNA binding activity of KLF3 FD to AZF target sequence (Wild type AZF probe), full length KLF3 protein target sequence (KLF3 caccc box probe) and a 50 nt DNA sequence derived from ARID3B intron region, shown to be bound by KLF3 FD in the KLF3 FD ChIP-Seq experiment. While KLF3FD-AZF binds to the AZF target sequence (lane 3) as expected, with lane 4 showing the supershift of the protein-DNA complex by an anti-V5 antibody, there is no detectable KLF3 FD binding to any of the probes (lane 5, 6, 11, 12, 17 and 18).

4.6.4 KLF3FD-AZF generates peaks at known KLF3 target sites

To further analyse the nature of peaks that were dependent on the KLF3 FD, we compared the genomic sites bound by KLF3FD-AZF to that of full length unmodified

KLF3 protein containing the same FD but in its native configuration upstream of the

KLF3 zinc finger DBD rather than the AZF domain. KLF3 is known to bind CACCC boxes and GC-rich elements via its C2H2 zinc finger DBD in the regulatory regions of its target genes (139). We examined our previously published KLF3 ChIP-Seq dataset which was produced by experiments in immortalised murine embryonic fibroblasts Page | 94

(MEFs) (75). As shown in Figure 4.13, the loss of function experiments demonstrated decreased genomic occupancy in the absence of KLF3 FD correlating peaks bound by the full length KLF3 protein and those bound by the KLF3 FD deletion mutant (DBD).

We refined our search to focus on peaks that were generated by full length KLF3 but not by a KLF3 deletion mutant, DBD only protein missing the FD. This generates a list representing regions that require the presence of the KLF3 FD for DNA binding, as determined by loss-of-function experiments.

Figure 4.13 A proportional Venn diagram showing regions bound by unmodified full length KLF3 protein and FD-deficient KLF3 DNA binding domain only protein (DBD), analysed from the previous loss-of-function experiments (75).

We then compared this dataset to the list of peaks differentially bound by

KLF3FD-AZF but not by AZF obtained from DiffBind analysis. This creates a list of

KLF3 FD-dependent peaks generated in the gain-of-function experiments. The

ChIP-Seq experiments used for comparison were performed in two different cell lines from two different species and thus cannot be directly overlaid. We thus compared only

Page | 95

the promoter occupancies (by overlapping downstream gene names from the two lists), which are more highly conserved between mouse and human rather than focussing on less well conserved non-coding regions in the genome. The issue of interspecies gene nomenclature differences was addressed by converting all the downstream gene names to the corresponding human orthologs using HCOP: Orthology Prediction Search. Two refined lists containing KLF3 FD dependent promoter occupancies, 3,525 and 1,720, respectively, were created from the two ChIP-Seq datasets (Figure 4.14A). Common promoter binding events were calculated based on the occurrence of common downstream gene names in the two lists. As shown in Figure 4.14A, we expect to find approximately 0.8% or approximately 223 common promoter peaks from the two datasets, taking into the consideration the number of promoter peaks found in each of the datasets as a percentage of the total number of protein coding genes (as an estimation to the total number of promoters) (152). Interestingly, we identified at least

578 peaks that were generated by KLF3FD-AZF and not by AZF alone, are similarly bound by full length KLF3 but not by its counterpart lacking the KLF3 FD, with a p-value of <0.0001 from a Chi-squared test assessing the significance of the difference between the observed and the expected occasions of the common promoter binding.

This represents more than twice the expected occurrence of common promoter binding obtained based on probabilistic calculation. ChIP-Seq tracks in Figure 4.14B illustrate an example of KLF3 FD dependent binding near SH3GL1 promoter, a KLF3 endogenous target, where we observed binding only by KLF3 FD-containing constructs,

KLF3 full length and KLF3FD-AZF fusion protein, but not the KLF3 FD-deficient constructs, DBD and AZF. Lists of the KLF3 FD dependent binding sites are included as Table 4.4 (top 100) and Supplementary Table 4.6 (complete list). This result is

Page | 96

surprising because the AZF is hypothesised to bind GCTGGGGGC sequences in the genome and thus is not expected to be enriched in this proportion of KLF3 target genes.

Using similar analysis procedure, we compared KLF3 FD dependent promoter peaks from the previous KLF3 loss-of-function experiment (75) to KLF3 FD promoter peaks obtained from the current study, consisting of 107 promoter regions bound by the

DBD-deficient KLF3 FD protein. We found 45 common promoter binding events overlapping the two lists based on common gene name. This represents 42% of the total promoter peaks bound by the KLF3 FD protein lacking a DBD (Figure 4.14C and

Supplementary Table 4.7). Despite the inevitable caveats associated with this comparative analysis method as the result of human and mouse genome differences, these observations remain striking.

Page | 97

Page | 98

Figure 4.14 Common promoter binding analysis. (A) 3525 KLF3 FD dependent promoter binding events were obtained from our published KLF3 ChIP-Seq study (75) and were overlapped with 1720 promoter regions differentially bound by KLF3FD- AZF. 578 common promoter occupancies were observed, compared with the expected count of 223. (B) Example showing KLF3FD-AZF binding to KLF3 endogenous target sites that the KLF3 DBD only and AZF are unable to bind. Unmodified KLF3 protein shows a stronger enrichment at a region on Sh3gl1 promoter compared to the counterpart lacking the KLF3 FD, DBD. While AZF shows weak or negligible enrichment at the same region, in the presence of the KLF3 FD, KLF3FD-AZF shows a stronger enrichment. (C) 3525 KLF3 FD dependent promoter binding events were obtained from a published KLF3 ChIP-Seq study and were overlapped with 107 promoter regions bound by the DBD-deficient KLF3 FD. 264 common promoter occupancies were observed.

One possible explanation is that KLF3FD-AZF fusion protein may be dimerising with endogenous KLF3 in HEK293 cells, and thus is directed to KLF3’s normal target genes. To explore this possibility we tested whether endogenous KLF3 protein was expressed in HEK293 cells. No signal corresponding to full length KLF3 protein could be detected by Western blot (Figure 4.15). COS (Lane 2) and HepG2

(Lane 4) protein extracts were included as negative and positive controls, respectively.

KLF3 protein was undetectable in the former and a distinct band was seen in the latter corresponding to the expected size of KLF3 protein. We therefore conclude that the

KLF3 FD is not likely to be dimerising with endogenous KLF3 to facilitate genomic localisation but, more interestingly, may be contacting another molecule, perhaps another transcription factor or histone mark that identifies KLF3 target promoters.

Taken together, our results suggest that the KLF3 FD is both required to localise

KLF3 to certain target genes in loss-of-function experiments, and is also capable of redirecting an artificial AZF to known target genes in gain-of-function experiments.

Page | 99

Table 4.4 Common promoter binding analysis. 3525 and 1720 KLF3 FD dependent promoter binding events (differentially bound by KLF3FD-AZF, and not by AZF) obtained from Burdach et al. 2013 (75) and current study, respectively, were overlapped based on the common downstream gene names. There were 578 common promoters in both datasets, two times more than the expected common promoters by chance.

Common promoter binding Downstream gene name (human) Showing 100 out of the total 578

ZNF768 SLBP GAS1 DNAJC17 YWHAZ ZNF3 PGAP1 MEF2D DDX55 MSH6 RNFT2 YAP1 CDKN2AIP ATR LAMP1 ZNF212 OSBPL2 FAM8A1 ZNF451 SLC22A5 SMARCD3 DPYSL2 NACC1 SAP130 GMFB RBPJ TBC1D4 ATP6AP1 TWSG1 SERBP1 DNLZ OGDH NDUFAF4 SURF6 SPAG9 SLC25A3 WIZ TSPAN14 KCTD5 NFYB DGCR8 PSMB6 TNFRSF10B FAM72B FLYWCH1 TRPM7 SH3BGRL2 CDV3 CYHR1 RANBP2 TAF5L SLC45A1 DIDO1 STX16 HMGB2 MDM2 NFE2L1 PDP2 RAPGEF6 TMEM194A AMN1 GFER MTF2 ASF1A WNT9A PAPD7 ATL2 SLC35F5 ZDHHC2 CTNNB1 TPD52 MCM6 HMG20B BMF KPNB1 POMP FNIP2 PIP5K1C PPP2CA ALG14 C3orf58 RFX3 LCLAT1 PFKP POLR1D LENG8 TPBG PRICKLE1 FOSL2 ZBTB11 ABHD14B TMED2 POGZ AMMECR1L USP25 RNF139 SLC3A2 PACSIN2 RNF34 PLEKHF1

Page | 100

Figure 4.15 Undetectable protein expression of endogenous KLF3 in HEK293 cells. Western blot was performed using an anti-KLF3 antibody. Lane 1 contains nuclear extract from COS cells transiently transfected with pMT3-KLF3 expression plasmid. Lane 2 is the negative control containing nuclear extract from a KLF3-null cell line, untransfected COS cells. Lane 3 and lane 4 were loaded with nuclear extract from HEK293 and HepG2 cells, respectively. HepG2 was included as a positive control.

4.7 Discussion

In Chapter 3, we reported promiscuous binding across human genome of a three zinc finger artificial DNA binding protein (AZF) designed to target a 9 nt DNA sequence on VEGF-A promoter (107). In this chapter, we compared genomic occupancy of this AZF protein to that of a fusion protein containing the N-terminal FD from the well-studied transcription factor KLF3 fused to the AZF protein. KLF3 is member of

KLF family of transcription factors consisting of 17 members in total including KLF3.

Despite sharing highly conserved DBDs, different KLF proteins regulate different sets of target genes (71,72,85). We previously observed in a loss-of-function study that

KLF3 FD may have a role in in vivo target specificity (75). The current study thus Page | 101

aimed to further investigate the role of KLF3 FD in in vivo target recognition using gain-of-function approach via fusion to an unrelated zinc finger DBD and to also assess the effect of adding a FD, one that is implicated in in vivo target recognition, on target specificity of the AZF. KLF3 itself has 3 classical C2H2 zinc fingers, so by fusing the

FD upstream of the 3 zinc finger artificial DBD, we are essentially producing a protein that is related to KLF3, but has different zinc fingers.

From our ChIP-Seq experiments, we detected around 50,000 peaks from the

KLF3FD-AZF sample, that is, about twice as many peaks as observed with the AZF alone. In many cases the KLF3FD-AZF generated peaks reside at the same location as the AZF alone. We examined the elements where we observed strong peaks with the

KLF3FD-AZF fusion protein and weaker or no peaks with the AZF alone using a cut- off where the KLF3FD-AZF peaks were at least twice as high. These peaks had a number of interesting features. Firstly they were enriched in promoters (46% were in promoter regions). Secondly, they contain a centrally enriched consensus sequence that resembled but was not identical to the 9 nt recognition element of the AZF. Thirdly, approximately 34% (578 out of 1720) of promoter peaks are mapped to known KLF3 bound gene promoters previously been shown to depend upon the presence of the KLF3

FD in loss-of-function experiments. That is, both loss-of-function and gain-of-function experiments implicated the role of KLF3 FD in targeting its cargo to a shared set of gene promoters.

These results provide evidence that the KLF3 FD, the N-terminal, non-DNA- binding domain, that is thought to function primarily in turning genes off by recruiting the co-repressor CtBP, has a second function in facilitating localisation of the protein to

Page | 102

its target genes. This is not unprecedented and aligns with recent work on other SP/KLF family transcription factors, assessing and comparing genomic occupancy of three SP proteins (48). While SP1 and SP3 were found to function as a conventional DNA binding proteins with the 3 zinc finger domains being necessary and sufficient to localise the protein to CACCC and GC rich elements in the genome, the zinc finger

DBD of SP2 was entirely dispensable for such activity and the N-terminal FD of SP2 was necessary and sufficient for promoter targeting.

These findings may help explain why different family members, for example, different members of the KLF family, bind and regulate different sets of target genes in vivo – despite having near identical DBDs (72,74,83). One might expect all KLFs bind to all CACCC or GC-containing regulatory elements as specified by the DBD, but it now seems more likely that gene targeting is orchestrated by the combined action of the shared zinc finger DBD and the divergent N-terminal FDs. Thus the variable FDs between the KLF members may explain the different sets of genes regulated by different KLF proteins. This is also supported by our results on the genome localisation of the KLF3 FD only protein, which is capable of in vivo chromatin binding, in the absence of a DBD. The reduced chromatin binding affinity observed compared to the

KLF3FD-AZF is likely to be associated with its role in target recognition that is secondary to that of a DBD, and in this case the artificial zinc finger domain.

Several possibilities could be envisaged explain how the FDs influence gene targeting. They could bind directly to DNA or be recruited to DNA indirectly via a gene-localised non-coding RNA, or another DNA-bound transcription factor or cofactor, or a specific histone or transcription factor mark. We are currently

Page | 103

investigating these possibilities. To date we have found no good evidence that the FD of

KLF3 binds to DNA or RNA, nor does it contain any known histone or histone mark binding domain (such as a bromodomain or chromodomain). KLF3 is known to bind the co-repressor CtBP (170). CtBP can dimerise (171,172) and is resident at many different promoters (173) so this is an attractive possibility. Previous experiments using a KLF3 mutant that is unable to bind CtBP showed that CtBP may be important for proper

KLF3 localisation to the promoters (75). However, this mutant showed largely similar genomic binding profiles to that of the wildtype KLF3 protein at the common KLF3 FD dependent promoter binding sites (the 578 KLF3 FD dependent promoter bindings obtained from overlapping the two related KLF3 and KLF3 FD-AZF ChIP-Seq data).

Thus it seems unlikely that recruitment by CtBP completely accounts for the ability of

KLF3 FD to participate in genome targeting in vivo. In the case of Sp2, a protein from

SP family that is closely related to KLF family, the FD was found to localise to a duplicated CCAAT element that is known to be bound by NF-γ. Nevertheless, no direct binding to NFY was detected (48), so in both cases the precise mechanism behind the gene targeting remains unknown.

A model explaining our hypothesis is illustrated in Figure 4.16. AZF binds to genes containing its canonical binding site or a near fit to this site. It is unable to bind to sites that diverge too far from the ideal consensus element. When the KLF3 FD is fused to the AZF, the fusion protein can also recognise optimal binding sites. In addition to that, some property of the KLF3 FD also allows localisation to weak, divergent consensus elements. This property may be related to interactions between a genomic element and the KLF3 FD as illustrated.

Page | 104

Figure 4.16 Models illustrating possible mechanisms of AZF and KLF3FD-AZF binding to the 9 nt target sequence and KLF3FD-AZF but not AZF to a more degenerate sequence in vivo. This mechanism outlines the involvement of DBD of a transcription factor, in this case, zinc fingers of KLF3 or artificial zinc fingers of KLF3FD-AZF, to establish a protein-DNA interaction, and also of the FD of the transcription factor, in this case, KLF3 FD to interact with a component possibly via protein-protein interaction to facilitate and specify transcription factor binding to DNA. (A) Both AZF and KLF3FD-AZF bind to the 9 nt target sequence GCTGGGGGC in vivo, which mainly rely on zinc fingers-DNA interactions. This represents regions that are bound equivalently by both AZF and KLF3FD-AZF, as depicted by the ChIP-Seq track. (B) KLF3FD-AZF but not AZF binds to a degenerate site via weak zinc fingers-DNA interaction and the binding is facilitated further by specific interaction between the KLF3 FD and a component in grey, which could be another protein, histone or histone mark, or RNA.

Given the size of the human genome (3 billion bps), a motif would need to be greater than 16 bp in length to be unique if a random nucleotide distribution is assumed.

Despite this, most eukaryotic TF motifs are rather short and only some positions carry

Page | 105

strong sequence preference (77). The zinc finger transcription factors of the KLF family, for instance, recognise a 9 bp sequence with only 4 of these positions being restricted to a single, specific nucleotide (46,75,76). Thus, it is possible that many natural transcription factors have evolved to use both protein-DNA contacts made through their DBDs, as well as protein-protein interactions via their FDs in order to localise specifically to only the repertoire of their target genes. Recognising this possibility and analysing more FDs for DNA-binding activity or specificity functions, may reveal insights that will be useful in designing AZF-FD fusion proteins with increased or even more restricted AZF specificity in the future.

Page | 106

Chapter 5 General Discussion and Conclusions

5.1 General Discussion and Conclusions

In the current study, we assessed the in vivo specificity of a three zinc-finger artificial DNA binding protein and also investigated the effect of adding a non-DNA-binding domain on target specificity of this artificial DNA binding domain

(DBD). Genome-wide specificity was assessed by protein binding studies using

ChIP-Seq, chromatin immunoprecipitation followed by high throughput DNA sequencing. We reported widespread binding of this three zinc finger DNA binding protein that may be partly explained by the degeneracies of the binding motif. Upon fusion to a non-DNA-binding-domain, in this study, we chose the functional domain

(FD) of KLF3 protein that was previously reported to be implicated in in vivo target specificity (75), the artificial zinc finger protein is directed to new sites with a large proportion of these new sites corresponding to endogenous target sites of the KLF3 protein. In addition, ChIP-Seq experiments with KLF3 FD alone showed that this truncated protein lacking a DBD has in vivo DNA-binding capability. We thus hypothesised that the artificial zinc finger protein is recruited to these new sites via additional interaction between the KLF FD and a component, which could be a KLF3

DNA- or histone- binding protein partner, or a RNA molecule, remained to be elucidated. This gain-of-function study therefore confirmed our published loss-of- function experiments (75) that KLF3 FD is involved in in vivo DNA binding specificity.

Since KLF3 is one of the members of KLF zinc finger of transcription factors, this hypothesis may be applied to other members of KLF family such that different FDs of the KLFs may be interacting with different protein partners resulting in recruitment of

Page | 107

these proteins to different sets of target genes. This may provide an explanation for how specificity is achieved among members of this family of proteins despite the DBDs showing highly similar DNA binding specificity in vitro.

Since the elucidation of the structure of DNA by Watson and Crick more than

50 years ago, research has been focused on understanding how proteins interact with

DNA. In the past, target recognition of DNA binding proteins, especially transcription factors, had been thought to solely rely on the DBDs. However, with the advent of next generation technologies that allow studies such as genome-wide transcription factor occupancy, there is increasing evidence implying the requirement and the role of factors other than the DBD to specify transcription factor in vivo target recognition (45,47,48).

In the current study, we reported that adding a non-DNA-binding-domain to a zinc finger protein directs the protein to other genomic sites, not solely specified by the DBD and that presence of a secondary interaction between the non-DNA-binding domain directly or indirectly with those genomic sites may facilitate binding at those regions.

This finding thus adds to our understanding of in vivo target specificity of zinc finger transcription factors, which may be applied to other closely related family of transcription factors such as the SP family (52).

The current study on target specificity of a three zinc finger artificial DNA binding protein also improved our understanding of the in vivo DNA binding specificity of this artificial zinc finger protein This novel and powerful artificial DNA-binding technology has been proposed for use in therapeutic settings due to the versatility and effectiveness of this system in genome editing and gene regulation (18,89,96,153), however, prior to this study, in vivo DNA-binding specificities of these proteins was

Page | 108

still poorly understood. While researchers have generated six zinc finger artificial DNA binding proteins hoping to achieve single-gene specificity suitable for therapeutic applications (134,164,167), a recent study by Grimmer and colleagues reported promiscuous binding of two six-zinc-finger proteins genome-wide as a consequence of the use of subsets of the six zinc fingers (135). In fact, Najafabadi et al. pointed out in their study that, on average, only 45% of the C2H2 zinc finger domains in each natural zinc finger protein assessed are utilised for in vivo direct DNA binding (77). Thus, adding more zinc fingers may not be sufficient to improve specificity of artificial zinc finger proteins to a therapeutically relevant degree. If not carefully designed, off-target effects arising from non-specific targeting of these proteins may lead to reduced efficiency of on-target gene regulation or, worse, an adverse event in a therapeutic setting. The design of artificial zinc finger proteins has been heavily reliant on recognition codes based on mutation of specificity residues within well-defined structural templates. Studies have shown that these recognition codes may be incomplete because they do not fully account for factors such as DNA-protein side- chain side-chain interactions and three-dimensional structures of the interacting proteins and DNA (77,174). These may contribute to the degeneracies observed in the DNA binding motif of the artificial zinc-finger proteins studied in the current work. Recently, a group attempted to derive a new version of the recognition codes by combining data from a modified bacterial one-hybrid system with protein-binding microarray and chromatin immunoprecipitation analyses. They, however, showed that contributions of non-specificity residues to DNA-binding are complex and difficult to model (77). In this current work, we demonstrated that non-DNA-binding domain can be used to recruit the artificial zinc finger protein to new sets of genes. Thus, further understanding how non-

Page | 109

DNA-binding domains are involved in in vivo target specificity and by incorporating this information to the design of artificial DNA binding proteins may help creating novel artificial DNA binding proteins with improved specificity in vivo via fusion to FD from natural transcription factors.

Another interesting observation from the current study was that, in general, at regions commonly bound by AZF and KLF3FD-AZF, we observed greater peak heights from KLF3FD-AZF sample compared to that from AZF sample, which may imply an additional role of KLF3 FD in enhancing DNA binding affinity of a protein.

Interestingly, similar phenomenon was observed in our previous loss-of-function study where full length KLF3 sample shows greater peak heights than the DBD alone sample

(75). However, it remained to be determined whether this observation is the direct result of improved binding affinity by KLF3 FD. Our current study thus showed that FD may be important for both DNA binding specificity and DNA binding efficiency and that fusing one such domain to an artificial DNA binding protein may improve not only specificity but also efficiency of these proteins in regulating endogenous target gene expression. This is also relevant for applications involving another artificial DNA binding technology, transcription-activator like effector (TALE) (19). While a few studies reported impressive in vivo target specificity of these proteins (175,176), this system has mainly been used for genome editing. There has been inconsistency in the effectiveness of gene regulation activity by these TALEs (176-178). In one study, synergistic activity of a number of TALE proteins each fused to a strong activation domain was required to sufficiently activate an endogenous target gene (176). With a large protein such as TALE that is more than three times larger than a typical zinc finger protein (112), this is therefore therapeutically impractical. By understanding how Page | 110

natural transcription factors access the large genome and specifically and efficiently regulate the expression their target genes, we may be able to engineer next generation artificial DNA binding proteins that are both more specific and more efficient.

5.2 Future directions

There has been exciting progress in understanding specificity of both natural and artificial zinc finger DNA binding proteins. However, there are still significant challenges and work to be done to completely understand natural zinc finger proteins to enable development of artificial zinc finger proteins that are versatile and reliable for widespread application in biochemical research and gene therapy. One possible extension study following the current study is to elucidate the molecular mechanisms of how non-DNA-binding domains interact with DNA and in the case of indirect interactions, identify the genomic elements that allow the interactions to occur. To date, there is no evidence of direct interaction between regions outside of a DBD and DNA.

We thus favour the hypothesis that non-DNA-binding domains, or KLF3 FD in the current study, are recruited to DNA via interaction with another genomic elements, which could be DNA binding- or histone binding- or even a non-DNA-binding proteins or RNA. This is not unprecedented. GATA1 for example is known to be recruited to different sets of target genes and thus carry out different biological functions depending on the availability of FOG1, a non-DNA-binding protein interacting with GATA1 via its non-DNA-binding N-terminal zinc fingers (44,45). A mechanism was proposed that

FOG1 acts to stabilise GATA1 binding at a fraction of its target sites via interaction with other chromatin factors such as ETS family proteins (45).

Page | 111

Therefore, the first future studies will involve identification of KLF3 protein partners. One well studied protein partner of KLF3 is CtBP (170), a non-DNA-binding co-repressor that is known to interact with a large complex consisting of co-repressors and histone modifying enzymes (172). Previously, we assessed genomic occupancy of a

KLF3 mutant that is unable to interact with CtBP (75). While it was revealed that CtBP may be important for KLF3 localisation to promoters in the previous study, it does not explain the increased genomic occupancy observed in the current study comparing

KLF3FD-AZF and AZF. A handful of other KLF3 protein partners were recently identified in a high throughput yeast two-hybrid study (179). Much work is still required to further characterise the interactions between KLF3 and these proteins.

Assessing genomic occupancy of KLF3 mutants unable to interact with these partner proteins and comparing these to that of the wild type KLF3 may reveal the role of these partners in KLF3 genomic occupancy.

While currently there is no evidence that KLF3 possesses RNA binding activity, the genome encodes large numbers of different types of RNAs and there has recently been an explosion of evidence on their roles in gene regulation (180). It is thus not impossible that KLF3 is recruited to the FD-specific targets by an RNA molecule. One of the best examples of RNA-guided recruitment of protein to target DNA is the clustered regularly interspaced short palindromic repeats (CRISPR)/Cas9 system via formation of complementary RNA-DNA duplex (104). Experiments such as RNA immunoprecipitation (RIP) (181,182) or crosslinking immunoprecipitation (CLIP) (183) followed by high throughput DNA sequencing will be carried out to identify

RNA-KLF3 or RNA-KLF3 FD interactions.

Page | 112

In addition, the current study will also be expanded to include the whole family of KLFs, consisting of 17 members, in domain-swapping experiments followed by genome-wide DNA-binding studies to further interrogate the important biological question of how transcription factors from a family of proteins sharing near identical

DBDs achieve specificity.

Lastly, it would be also of interest to study the role of non-DNA-binding domains in in vivo target specificity in other artificial DNA binding systems including

TALE and CRISPR/dCas9 based DNA-targeting strategies. Previously, it was reported that addition of KRAB domain (a repressor domain from KOX1 protein) to artificial zinc fingers increases genome-wide occupancy by five times compared to the counterpart lacking a KRAB domain (135). More recently, a similar study showed that fusion of the same effector domain to dCas9 protein does not increase recruitment to

DNA (184). It is thus of interest to investigate whether the KLF3 FD from the current study, or other FDs from natural transcription factors, are able to improve target specificity of these proteins.

Page | 113

References

1. Jacob, F. and Monod, J. (1961) Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol, 3, 318-356.

2. Bannister, A.J. and Kouzarides, T. (2011) Regulation of chromatin by histone modifications. Cell Res, 21, 381-395.

3. Conaway, R.C. and Conaway, J.W. (2011) Function and regulation of the Mediator complex. Curr Opin Genet Dev, 21, 225-230.

4. Fuda, N.J., Ardehali, M.B. and Lis, J.T. (2009) Defining mechanisms that regulate RNA polymerase II transcription in vivo. Nature, 461, 186-192.

5. Ho, L. and Crabtree, G.R. (2010) Chromatin remodelling during development. Nature, 463, 474-484.

6. Roeder, R.G. (2005) Transcriptional regulation and the role of diverse coactivators in animal cells. FEBS Lett, 579, 909-915.

7. Spitz, F. and Furlong, E.E. (2012) Transcription factors: from enhancer binding to developmental control. Nat Rev Genet, 13, 613-626.

8. Lee, T.I. and Young, R.A. (2013) Transcriptional regulation and its misregulation in disease. Cell, 152, 1237-1251.

9. Vaquerizas, J.M., Kummerfeld, S.K., Teichmann, S.A. and Luscombe, N.M. (2009) A census of human transcription factors: function, expression and evolution. Nat Rev Genet, 10, 252-263.

10. Mitchell, P.J. and Tjian, R. (1989) Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. Science, 245, 371-378.

11. Levine, M. and Tjian, R. (2003) Transcription regulation and animal diversity. Nature, 424, 147-151.

12. Takeda, Y., Ohlendorf, D.H., Anderson, W.F. and Matthews, B.W. (1983) DNA-binding proteins. Science, 221, 1020-1026.

13. Luscombe, N.M., Austin, S.E., Berman, H.M. and Thornton, J.M. (2000) An overview of the structures of protein-DNA complexes. Genome Biol, 1, REVIEWS001.

14. Ashton, N.W., Bolderson, E., Cubeddu, L., O'Byrne, K.J. and Richard, D.J. (2013) Human single-stranded DNA binding proteins are essential for maintaining genomic stability. BMC Mol Biol, 14, 9.

15. Brent, R. and Ptashne, M. (1985) A eukaryotic transcriptional activator bearing the DNA specificity of a prokaryotic repressor. Cell, 43, 729-736. Page | 114

16. Frankel, A.D. and Kim, P.S. (1991) Modular structure of transcription factors: implications for gene regulation. Cell, 65, 717-719.

17. Fields, S. and Song, O. (1989) A novel genetic system to detect protein-protein interactions. Nature, 340, 245-246.

18. Klug, A. (2010) The discovery of zinc fingers and their applications in gene regulation and genome manipulation. Annu Rev Biochem, 79, 213-231.

19. Bogdanove, A.J. and Voytas, D.F. (2011) TAL effectors: customizable proteins for DNA targeting. Science, 333, 1843-1846.

20. Klug, A. and Rhodes, D. (1987) Zinc fingers: a novel protein fold for nucleic acid recognition. Cold Spring Harb Symp Quant Biol, 52, 473-482.

21. Rhodes, D. and Klug, A. (1993) Zinc fingers. Sci Am, 268, 56-59, 62-55.

22. Klug, A. and Schwabe, J.W. (1995) Protein motifs 5. Zinc fingers. FASEB J, 9, 597-604.

23. Wolfe, S.A., Nekludova, L. and Pabo, C.O. (2000) DNA recognition by Cys2His2 zinc finger proteins. Annu Rev Biophys Biomol Struct, 29, 183-212.

24. Laity, J.H., Lee, B.M. and Wright, P.E. (2001) Zinc finger proteins: new insights into structural and functional diversity. Curr Opin Struct Biol, 11, 39-46.

25. Brown, D.D. (1984) The role of stable complexes that repress and activate eucaryotic genes. Cell, 37, 359-365.

26. Iuchi, S. (2001) Three classes of C2H2 zinc finger proteins. Cell Mol Life Sci, 58, 625-635.

27. Krishna, S.S., Majumdar, I. and Grishin, N.V. (2003) Structural classification of zinc fingers: survey and summary. Nucleic Acids Res, 31, 532-550.

28. Hudson, W.H. and Ortlund, E.A. (2014) The structure, function and evolution of proteins that bind DNA and RNA. Nat Rev Mol Cell Biol, 15, 749-760.

29. Font, J. and Mackay, J.P. (2010) Beyond DNA: zinc finger domains as RNA- binding modules. Methods Mol Biol, 649, 479-491.

30. Brown, R.S. (2005) Zinc finger proteins: getting a grip on RNA. Curr Opin Struct Biol, 15, 94-98.

31. Lu, D., Searles, M.A. and Klug, A. (2003) Crystal structure of a zinc-finger- RNA complex reveals two modes of molecular recognition. Nature, 426, 96-100.

32. Gamsjaeger, R., Liew, C.K., Loughlin, F.E., Crossley, M. and Mackay, J.P. (2007) Sticky fingers: zinc-fingers as protein-recognition motifs. Trends Biochem Sci, 32, 63-70. Page | 115

33. Pavletich, N.P. and Pabo, C.O. (1991) Zinc finger-DNA recognition: crystal structure of a Zif268-DNA complex at 2.1 A. Science, 252, 809-817.

34. Fairall, L., Schwabe, J.W., Chapman, L., Finch, J.T. and Rhodes, D. (1993) The crystal structure of a two zinc-finger peptide reveals an extension to the rules for zinc-finger/DNA recognition. Nature, 366, 483-487.

35. Berg, J.M. (1990) Zinc finger domains: hypotheses and current knowledge. Annu Rev Biophys Biophys Chem, 19, 405-421.

36. Shi, Y. and Berg, J.M. (1995) A direct comparison of the properties of natural and designed zinc-finger proteins. Chem Biol, 2, 83-89.

37. Rebar, E.J. and Pabo, C.O. (1994) Zinc finger phage: affinity selection of fingers with new DNA-binding specificities. Science, 263, 671-673.

38. Greisman, H.A. and Pabo, C.O. (1997) A general strategy for selecting high- affinity zinc finger proteins for diverse DNA target sites. Science, 275, 657-661.

39. Elrod-Erickson, M. and Pabo, C.O. (1999) Binding studies with mutants of Zif268. Contribution of individual side chains to binding affinity and specificity in the Zif268 zinc finger-DNA complex. J Biol Chem, 274, 19281-19285.

40. Berg, J.M. (1988) Proposed structure for the zinc-binding domains from transcription factor IIIA and related proteins. Proc Natl Acad Sci U S A, 85, 99- 102.

41. Choo, Y. and Klug, A. (1994) Toward a code for the interactions of zinc fingers with DNA: selection of randomized fingers displayed on phage. Proc Natl Acad Sci U S A, 91, 11163-11167.

42. Choo, Y. and Klug, A. (1994) Selection of DNA binding sites for zinc fingers using rationally randomized DNA reveals coded interactions. Proc Natl Acad Sci U S A, 91, 11168-11172.

43. Merika, M. and Orkin, S.H. (1995) Functional synergy and physical interactions of the erythroid transcription factor GATA-1 with the Kruppel family proteins Sp1 and EKLF. Mol Cell Biol, 15, 2437-2447.

44. Tsang, A.P., Visvader, J.E., Turner, C.A., Fujiwara, Y., Yu, C., Weiss, M.J., Crossley, M. and Orkin, S.H. (1997) FOG, a multitype zinc finger protein, acts as a cofactor for transcription factor GATA-1 in erythroid and megakaryocytic differentiation. Cell, 90, 109-119.

45. Chlon, T.M., Dore, L.C. and Crispino, J.D. (2012) Cofactor-mediated restriction of GATA-1 chromatin occupancy coordinates lineage-specific gene expression. Mol Cell, 47, 608-621.

46. Tallack, M.R., Whitington, T., Yuen, W.S., Wainwright, E.N., Keys, J.R., Gardiner, B.B., Nourbakhsh, E., Cloonan, N., Grimmond, S.M., Bailey, T.L. et Page | 116

al. (2010) A global role for KLF1 in erythropoiesis revealed by ChIP-seq in primary erythroid cells. Genome Res, 20, 1052-1063.

47. Kassouf, M.T., Hughes, J.R., Taylor, S., McGowan, S.J., Soneji, S., Green, A.L., Vyas, P. and Porcher, C. (2010) Genome-wide identification of TAL1's functional targets: insights into its mechanisms of action in primary erythroid cells. Genome Res, 20, 1064-1083.

48. Volkel, S., Stielow, B., Finkernagel, F., Stiewe, T., Nist, A. and Suske, G. (2015) Zinc finger independent genome-wide binding of Sp2 potentiates recruitment of histone-fold protein Nf-y distinguishing it from Sp1 and Sp3. PLoS Genet, 11, e1005102.

49. Kassouf, M.T., Chagraoui, H., Vyas, P. and Porcher, C. (2008) Differential use of SCL/TAL-1 DNA-binding domain in developmental hematopoiesis. Blood, 112, 1056-1067.

50. Jolma, A., Yan, J., Whitington, T., Toivonen, J., Nitta, K.R., Rastas, P., Morgunova, E., Enge, M., Taipale, M., Wei, G. et al. (2013) DNA-binding specificities of human transcription factors. Cell, 152, 327-339.

51. Badis, G., Berger, M.F., Philippakis, A.A., Talukder, S., Gehrke, A.R., Jaeger, S.A., Chan, E.T., Metzler, G., Vedenko, A., Chen, X. et al. (2009) Diversity and complexity in DNA recognition by transcription factors. Science, 324, 1720- 1723.

52. Suske, G., Bruford, E. and Philipsen, S. (2005) Mammalian SP/KLF transcription factors: bring in the family. Genomics, 85, 551-556.

53. Sarkar, A. and Hochedlinger, K. (2013) The sox family of transcription factors: versatile regulators of stem and progenitor cell fate. Cell Stem Cell, 12, 15-30.

54. Kamachi, Y. and Kondoh, H. (2013) Sox proteins: regulators of cell fate specification and differentiation. Development, 140, 4129-4144.

55. Sharrocks, A.D. (2001) The ETS-domain transcription factor family. Nat Rev Mol Cell Biol, 2, 827-837.

56. Wei, G.H., Badis, G., Berger, M.F., Kivioja, T., Palin, K., Enge, M., Bonke, M., Jolma, A., Varjosalo, M., Gehrke, A.R. et al. (2010) Genome-wide analysis of ETS-family DNA-binding in vitro and in vivo. EMBO J, 29, 2147-2160.

57. Dalton, S. and Treisman, R. (1992) Characterization of SAP-1, a protein recruited by serum response factor to the c-fos serum response element. Cell, 68, 597-612.

58. Boros, J., Donaldson, I.J., O'Donnell, A., Odrowaz, Z.A., Zeef, L., Lupien, M., Meyer, C.A., Liu, X.S., Brown, M. and Sharrocks, A.D. (2009) Elucidation of

Page | 117

the ELK1 target gene network reveals a role in the coordinate regulation of core components of the gene regulation machinery. Genome Res, 19, 1963-1973.

59. Garvie, C.W., Hagman, J. and Wolberger, C. (2001) Structural studies of Ets- 1/Pax5 complex formation on DNA. Mol Cell, 8, 1267-1276.

60. Fitzsimmons, D., Lutz, R., Wheat, W., Chamberlin, H.M. and Hagman, J. (2001) Highly conserved amino acids in Pax and Ets proteins are required for DNA binding and ternary complex assembly. Nucleic Acids Res, 29, 4154-4165.

61. Hollenhorst, P.C., Shah, A.A., Hopkins, C. and Graves, B.J. (2007) Genome- wide analyses reveal properties of redundant and specific promoter occupancy within the ETS gene family. Genes Dev, 21, 1882-1894.

62. Han, Y. and Lefebvre, V. (2008) L-Sox5 and Sox6 drive expression of the aggrecan gene in cartilage by securing binding of Sox9 to a far-upstream enhancer. Mol Cell Biol, 28, 4999-5013.

63. Peirano, R.I. and Wegner, M. (2000) The glial transcription factor Sox10 binds to DNA both as monomer and dimer with different functional consequences. Nucleic Acids Res, 28, 3047-3055.

64. Bridgewater, L.C., Walker, M.D., Miller, G.C., Ellison, T.A., Holsinger, L.D., Potter, J.L., Jackson, T.L., Chen, R.K., Winkel, V.L., Zhang, Z. et al. (2003) Adjacent DNA sequences modulate Sox9 transcriptional activation at paired Sox sites in three chondrocyte-specific enhancer elements. Nucleic Acids Res, 31, 1541-1553.

65. Tsuruzoe, S., Ishihara, K., Uchimura, Y., Watanabe, S., Sekita, Y., Aoto, T., Saitoh, H., Yuasa, Y., Niwa, H., Kawasuji, M. et al. (2006) Inhibition of DNA binding of Sox2 by the SUMO conjugation. Biochem Biophys Res Commun, 351, 920-926.

66. Wegner, M. (2010) All purpose Sox: The many roles of Sox proteins in gene expression. Int J Biochem Cell Biol, 42, 381-390.

67. Kamachi, Y., Uchikawa, M., Tanouchi, A., Sekido, R. and Kondoh, H. (2001) Pax6 and SOX2 form a co-DNA-binding partner complex that regulates initiation of lens development. Genes Dev, 15, 1272-1286.

68. Yuan, H., Corbi, N., Basilico, C. and Dailey, L. (1995) Developmental-specific activity of the FGF-4 enhancer requires the synergistic action of Sox2 and Oct-3. Genes Dev, 9, 2635-2645.

69. Zang, W.Q., Veldhoen, N. and Romaniuk, P.J. (1995) Effects of zinc finger mutations on the nucleic acid binding activities of Xenopus transcription factor IIIA. Biochemistry, 34, 15545-15552.

Page | 118

70. Hoffman, R.C., Horvath, S.J. and Klevit, R.E. (1993) Structures of DNA- binding mutant zinc finger domains: implications for DNA binding. Protein Sci, 2, 951-965.

71. Turner, J. and Crossley, M. (1999) Mammalian Kruppel-like transcription factors: more than just a pretty finger. Trends Biochem Sci, 24, 236-240.

72. Pearson, R., Fleetwood, J., Eaton, S., Crossley, M. and Bao, S. (2008) Kruppel- like transcription factors: a functional family. Int J Biochem Cell Biol, 40, 1996- 2001.

73. Miller, I.J. and Bieker, J.J. (1993) A novel, erythroid cell-specific murine transcription factor that binds to the CACCC element and is related to the Kruppel family of nuclear proteins. Mol Cell Biol, 13, 2776-2786.

74. Kaczynski, J., Cook, T. and Urrutia, R. (2003) Sp1- and Kruppel-like transcription factors. Genome Biol, 4, 206.

75. Burdach, J., Funnell, A.P., Mak, K.S., Artuz, C.M., Wienert, B., Lim, W.F., Tan, L.Y., Pearson, R.C. and Crossley, M. (2014) Regions outside the DNA-binding domain are critical for proper in vivo specificity of an archetypal zinc finger transcription factor. Nucleic Acids Res, 42, 276-289.

76. Chen, X., Xu, H., Yuan, P., Fang, F., Huss, M., Vega, V.B., Wong, E., Orlov, Y.L., Zhang, W., Jiang, J. et al. (2008) Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell, 133, 1106-1117.

77. Najafabadi, H.S., Mnaimneh, S., Schmitges, F.W., Garton, M., Lam, K.N., Yang, A., Albu, M., Weirauch, M.T., Radovani, E., Kim, P.M. et al. (2015) C2H2 zinc finger proteins greatly expand the human regulatory lexicon. Nat Biotechnol, 33, 555-562.

78. Jiang, J., Chan, Y.S., Loh, Y.H., Cai, J., Tong, G.Q., Lim, C.A., Robson, P., Zhong, S. and Ng, H.H. (2008) A core Klf circuitry regulates self-renewal of embryonic stem cells. Nat Cell Biol, 10, 353-360.

79. Tetreault, M.P., Yang, Y. and Katz, J.P. (2013) Kruppel-like factors in cancer. Nat Rev Cancer, 13, 701-713.

80. Suzuki, T., Aizawa, K., Matsumura, T. and Nagai, R. (2005) Vascular implications of the Kruppel-like family of transcription factors. Arterioscler Thromb Vasc Biol, 25, 1135-1141.

81. Wu, Z. and Wang, S. (2013) Role of kruppel-like transcription factors in adipogenesis. Dev Biol, 373, 235-243.

Page | 119

82. Moore, D.L., Apara, A. and Goldberg, J.L. (2011) Kruppel-like transcription factors in the nervous system: novel players in neurite outgrowth and axon regeneration. Mol Cell Neurosci, 47, 233-243.

83. McConnell, B.B. and Yang, V.W. (2010) Mammalian Kruppel-like factors in health and diseases. Physiol Rev, 90, 1337-1381.

84. Ghaleb, A.M., Nandan, M.O., Chanchevalap, S., Dalton, W.B., Hisamuddin, I.M. and Yang, V.W. (2005) Kruppel-like factors 4 and 5: the yin and yang regulators of cellular proliferation. Cell Res, 15, 92-96.

85. Dang, D.T., Pevsner, J. and Yang, V.W. (2000) The biology of the mammalian Kruppel-like family of transcription factors. Int J Biochem Cell Biol, 32, 1103- 1121.

86. Bieker, J.J. (2001) Kruppel-like factors: three fingers in many pies. J Biol Chem, 276, 34355-34358.

87. Atkins, G.B. and Jain, M.K. (2007) Role of Kruppel-like transcription factors in endothelial biology. Circ Res, 100, 1686-1695.

88. Pearson, R.C., Funnell, A.P. and Crossley, M. (2011) The mammalian zinc finger transcription factor Kruppel-like factor 3 (KLF3/BKLF). IUBMB Life, 63, 86-93.

89. Sera, T. (2009) Zinc-finger-based artificial transcription factors and their applications. Adv Drug Deliv Rev, 61, 513-526.

90. Thukral, S.K., Morrison, M.L. and Young, E.T. (1992) Mutations in the zinc fingers of ADR1 that change the specificity of DNA binding and transactivation. Mol Cell Biol, 12, 2784-2792.

91. Choo, Y., Sanchez-Garcia, I. and Klug, A. (1994) In vivo repression by a site- specific DNA-binding protein designed against an oncogenic sequence. Nature, 372, 642-645.

92. Papworth, M., Moore, M., Isalan, M., Minczuk, M., Choo, Y. and Klug, A. (2003) Inhibition of herpes simplex virus 1 gene expression by designer zinc- finger transcription factors. Proc Natl Acad Sci U S A, 100, 1621-1626.

93. Bartsevich, V.V. and Juliano, R.L. (2000) Regulation of the MDR1 gene by transcriptional repressors selected using peptide combinatorial libraries. Mol Pharmacol, 58, 1-10.

94. Barrow, J.J., Masannat, J. and Bungert, J. (2012) Neutralizing the function of a beta-globin-associated cis-regulatory DNA element using an artificial zinc finger DNA-binding domain. Proc Natl Acad Sci U S A, 109, 17948-17953.

95. Liu, W., Yuan, J.S. and Stewart, C.N., Jr. (2013) Advanced genetic tools for plant biotechnology. Nat Rev Genet. Page | 120

96. Jamieson, A.C., Miller, J.C. and Pabo, C.O. (2003) Drug discovery with engineered zinc-finger proteins. Nat Rev Drug Discov, 2, 361-368.

97. Holmes-Davis, R., Li, G., Jamieson, A.C., Rebar, E.J., Liu, Q., Kong, Y., Case, C.C. and Gregory, P.D. (2005) Gene regulation in planta by plant-derived engineered zinc finger protein transcription factors. Plant Mol Biol, 57, 411-423.

98. Morton, J., Davis, M.W., Jorgensen, E.M. and Carroll, D. (2006) Induction and repair of zinc-finger nuclease-targeted double-strand breaks in Caenorhabditis elegans somatic cells. Proc Natl Acad Sci U S A, 103, 16370-16375.

99. Rebar, E.J., Huang, Y., Hickey, R., Nath, A.K., Meoli, D., Nath, S., Chen, B., Xu, L., Liang, Y., Jamieson, A.C. et al. (2002) Induction of angiogenesis in a mouse model using engineered transcription factors. Nat Med, 8, 1427-1432.

100. Wilber, A., Tschulena, U., Hargrove, P.W., Kim, Y.S., Persons, D.A., Barbas, C.F., 3rd and Nienhuis, A.W. (2010) A zinc-finger transcriptional activator designed to interact with the gamma-globin gene promoters enhances fetal hemoglobin production in primary human adult erythroblasts. Blood, 115, 3033- 3041.

101. Graslund, T., Li, X., Magnenat, L., Popkov, M. and Barbas, C.F., 3rd. (2005) Exploring strategies for the design of artificial transcription factors: targeting sites proximal to known regulatory regions for the induction of gamma-globin expression and the treatment of sickle cell disease. J Biol Chem, 280, 3707-3714.

102. Costa, F.C., Fedosyuk, H., Neades, R., de Los Rios, J.B., Barbas, C.F., 3rd and Peterson, K.R. (2012) Induction of Fetal Hemoglobin In Vivo Mediated by a Synthetic gamma-Globin Zinc Finger Activator. Anemia, 2012, 507894.

103. Sanjana, N.E., Cong, L., Zhou, Y., Cunniff, M.M., Feng, G. and Zhang, F. (2012) A transcription activator-like effector toolbox for genome engineering. Nat Protoc, 7, 171-192.

104. Sander, J.D. and Joung, J.K. (2014) CRISPR-Cas systems for editing, regulating and targeting genomes. Nat Biotechnol, 32, 347-355.

105. Hsu, P.D., Lander, E.S. and Zhang, F. (2014) Development and applications of CRISPR-Cas9 for genome engineering. Cell, 157, 1262-1278.

106. Zhang, L., Spratt, S.K., Liu, Q., Johnstone, B., Qi, H., Raschke, E.E., Jamieson, A.C., Rebar, E.J., Wolffe, A.P. and Case, C.C. (2000) Synthetic zinc finger transcription factor action at an endogenous chromosomal site. Activation of the human erythropoietin gene. J Biol Chem, 275, 33850-33860.

107. Liu, P.Q., Rebar, E.J., Zhang, L., Liu, Q., Jamieson, A.C., Liang, Y., Qi, H., Li, P.X., Chen, B., Mendel, M.C. et al. (2001) Regulation of an endogenous locus using a panel of designed zinc finger proteins targeted to accessible chromatin

Page | 121

regions. Activation of vascular endothelial growth factor A. J Biol Chem, 276, 11323-11334.

108. Segal, D.J., Dreier, B., Beerli, R.R. and Barbas, C.F., 3rd. (1999) Toward controlling gene expression at will: selection and design of zinc finger domains recognizing each of the 5'-GNN-3' DNA target sequences. Proc Natl Acad Sci U S A, 96, 2758-2763.

109. Dreier, B., Fuller, R.P., Segal, D.J., Lund, C.V., Blancafort, P., Huber, A., Koksch, B. and Barbas, C.F., 3rd. (2005) Development of zinc finger domains for recognition of the 5'-CNN-3' family DNA sequences and their use in the construction of artificial transcription factors. J Biol Chem, 280, 35588-35597.

110. Dreier, B., Beerli, R.R., Segal, D.J., Flippin, J.D. and Barbas, C.F., 3rd. (2001) Development of zinc finger domains for recognition of the 5'-ANN-3' family of DNA sequences and their use in the construction of artificial transcription factors. J Biol Chem, 276, 29466-29478.

111. Bae, K.H., Kwon, Y.D., Shin, H.C., Hwang, M.S., Ryu, E.H., Park, K.S., Yang, H.Y., Lee, D.K., Lee, Y., Park, J. et al. (2003) Human zinc fingers as building blocks in the construction of artificial transcription factors. Nat Biotechnol, 21, 275-280.

112. Gersbach, C.A., Gaj, T. and Barbas, C.F., 3rd. (2014) Synthetic zinc finger proteins: the advent of targeted gene regulation and genome modification technologies. Acc Chem Res, 47, 2309-2318.

113. Wolfe, S.A., Greisman, H.A., Ramm, E.I. and Pabo, C.O. (1999) Analysis of zinc fingers optimized via phage display: evaluating the utility of a recognition code. J Mol Biol, 285, 1917-1934.

114. Urnov, F.D., Miller, J.C., Lee, Y.L., Beausejour, C.M., Rock, J.M., Augustus, S., Jamieson, A.C., Porteus, M.H., Gregory, P.D. and Holmes, M.C. (2005) Highly efficient endogenous human gene correction using designed zinc-finger nucleases. Nature, 435, 646-651.

115. Gaj, T., Mercer, A.C., Sirk, S.J., Smith, H.L. and Barbas, C.F., 3rd. (2013) A comprehensive approach to zinc-finger recombinase customization enables genomic targeting in human cells. Nucleic Acids Res, 41, 3937-3946.

116. Meister, G.E., Chandrasegaran, S. and Ostermeier, M. (2010) Heterodimeric DNA methyltransferases as a platform for creating designer zinc finger methyltransferases for targeted DNA methylation in cells. Nucleic Acids Res, 38, 1749-1759.

117. Nomura, W. and Barbas, C.F., 3rd. (2007) In vivo site-specific DNA methylation with a designed sequence-enabled DNA methylase. J Am Chem Soc, 129, 8676-8677.

Page | 122

118. Chaikind, B., Kilambi, K.P., Gray, J.J. and Ostermeier, M. (2012) Targeted DNA methylation using an artificially bisected M.HhaI fused to zinc fingers. PLoS One, 7, e44852.

119. Snowden, A.W., Gregory, P.D., Case, C.C. and Pabo, C.O. (2002) Gene-specific targeting of H3K9 methylation is sufficient for initiating repression in vivo. Curr Biol, 12, 2159-2166.

120. Nagy, J.A., Dvorak, A.M. and Dvorak, H.F. (2007) VEGF-A and the induction of pathological angiogenesis. Annu Rev Pathol, 2, 251-275.

121. Carmeliet, P., Ng, Y.S., Nuyens, D., Theilmeier, G., Brusselmans, K., Cornelissen, I., Ehler, E., Kakkar, V.V., Stalmans, I., Mattot, V. et al. (1999) Impaired myocardial angiogenesis and ischemic cardiomyopathy in mice lacking the vascular endothelial growth factor isoforms VEGF164 and VEGF188. Nat Med, 5, 495-502.

122. Grunstein, J., Masbad, J.J., Hickey, R., Giordano, F. and Johnson, R.S. (2000) Isoforms of vascular endothelial growth factor act in a coordinate fashion To recruit and expand tumor vasculature. Mol Cell Biol, 20, 7282-7291.

123. Baumgartner, I., Rauh, G., Pieczek, A., Wuensch, D., Magner, M., Kearney, M., Schainfeld, R. and Isner, J.M. (2000) Lower-extremity edema associated with gene transfer of naked DNA encoding vascular endothelial growth factor. Ann Intern Med, 132, 880-884.

124. Magovern, C.J., Mack, C.A., Zhang, J., Rosengart, T.K., Isom, O.W. and Crystal, R.G. (1997) Regional angiogenesis induced in nonischemic tissue by an adenoviral vector expressing vascular endothelial growth factor. Hum Gene Ther, 8, 215-227.

125. Snowden, A.W., Zhang, L., Urnov, F., Dent, C., Jouvenot, Y., Zhong, X., Rebar, E.J., Jamieson, A.C., Zhang, H.S., Tan, S. et al. (2003) Repression of vascular endothelial growth factor A in glioblastoma cells using engineered zinc finger transcription factors. Cancer Res, 63, 8968-8976.

126. Kang, Y.A., Shin, H.C., Yoo, J.Y., Kim, J.H., Kim, J.S. and Yun, C.O. (2008) Novel cancer antiangiotherapy using the VEGF promoter-targeted artificial zinc- finger protein and oncolytic adenovirus. Mol Ther, 16, 1033-1040.

127. Pawson, E.J., Duran-Jimenez, B., Surosky, R., Brooke, H.E., Spratt, S.K., Tomlinson, D.R. and Gardiner, N.J. (2010) Engineered zinc finger protein- mediated VEGF-a activation restores deficient VEGF-a in sensory neurons in experimental diabetes. Diabetes, 59, 509-518.

128. Price, S.A., Dent, C., Duran-Jimenez, B., Liang, Y., Zhang, L., Rebar, E.J., Case, C.C., Gregory, P.D., Martin, T.J., Spratt, S.K. et al. (2006) Gene transfer of an engineered transcription factor promoting expression of VEGF-A protects against experimental diabetic neuropathy. Diabetes, 55, 1847-1854.

Page | 123

129. Yu, J., Lei, L., Liang, Y., Hinh, L., Hickey, R.P., Huang, Y., Liu, D., Yeh, J.L., Rebar, E., Case, C. et al. (2006) An engineered VEGF-activating zinc finger protein transcription factor improves blood flow and limb salvage in advanced- age mice. FASEB J, 20, 479-481.

130. Dai, Q., Huang, J., Klitzman, B., Dong, C., Goldschmidt-Clermont, P.J., March, K.L., Rokovich, J., Johnstone, B., Rebar, E.J., Spratt, S.K. et al. (2004) Engineered zinc finger-activating vascular endothelial growth factor transcription factor plasmid DNA induces therapeutic angiogenesis in rabbits with hindlimb ischemia. Circulation, 110, 2467-2475.

131. Tachikawa, K., Schroder, O., Frey, G., Briggs, S.P. and Sera, T. (2004) Regulation of the endogenous VEGF-A gene by exogenous designed regulatory proteins. Proc Natl Acad Sci U S A, 101, 15225-15230.

132. Zhang, H.S., Liu, D., Huang, Y., Schmidt, S., Hickey, R., Guschin, D., Su, H., Jovin, I.S., Kunis, M., Hinkley, S. et al. (2012) A designed zinc-finger transcriptional repressor of phospholamban improves function of the failing heart. Mol Ther, 20, 1508-1515.

133. Tan, S., Guschin, D., Davalos, A., Lee, Y.L., Snowden, A.W., Jouvenot, Y., Zhang, H.S., Howes, K., McNamara, A.R., Lai, A. et al. (2003) Zinc-finger protein-targeted gene regulation: genomewide single-gene specificity. Proc Natl Acad Sci U S A, 100, 11997-12002.

134. Onori, A., Pisani, C., Strimpakos, G., Monaco, L., Mattei, E., Passananti, C. and Corbi, N. (2013) UtroUp is a novel six zinc finger artificial transcription factor that recognises 18 base pairs of the utrophin promoter and efficiently drives utrophin upregulation. BMC Mol Biol, 14, 3.

135. Grimmer, M.R., Stolzenburg, S., Ford, E., Lister, R., Blancafort, P. and Farnham, P.J. (2014) Analysis of an artificial zinc finger epigenetic modulator: widespread binding but limited regulation. Nucleic Acids Res, 42, 10856-10868.

136. Schmidt, D., Wilson, M.D., Spyrou, C., Brown, G.D., Hadfield, J. and Odom, D.T. (2009) ChIP-seq: using high-throughput sequencing to discover protein- DNA interactions. Methods, 48, 240-248.

137. Sambrook, J., Fritsch, E.F. and Maniatis, T. (1989) Molecular cloning: A laboratory manual. Cold Spring Harbour Laboratory Press, United States of America.

138. Andrews, N.C. and Faller, D.V. (1991) A rapid micropreparation technique for extraction of DNA-binding proteins from limiting numbers of mammalian cells. Nucleic Acids Res, 19, 2499.

139. Crossley, M., Whitelaw, E., Perkins, A., Williams, G., Fujiwara, Y. and Orkin, S.H. (1996) Isolation and characterization of the cDNA encoding BKLF/TEF-2,

Page | 124

a major CACCC-box-binding protein in erythroid cells and selected other cells. Mol Cell Biol, 16, 1695-1705.

140. Bolger, A.M., Lohse, M. and Usadel, B. (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30, 2114-2120.

141. Langmead, B. and Salzberg, S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods, 9, 357-359.

142. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R. and Genome Project Data Processing, S. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078-2079.

143. Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y.C., Laslo, P., Cheng, J.X., Murre, C., Singh, H. and Glass, C.K. (2010) Simple combinations of lineage- determining transcription factors prime cis-regulatory elements required for macrophage and identities. Mol Cell, 38, 576-589.

144. Ross-Innes, C.S., Stark, R., Teschendorff, A.E., Holmes, K.A., Ali, H.R., Dunning, M.J., Brown, G.D., Gojis, O., Ellis, I.O., Green, A.R. et al. (2012) Differential oestrogen binding is associated with clinical outcome in breast cancer. Nature, 481, 389-393.

145. Robinson, J.T., Thorvaldsdottir, H., Winckler, W., Guttman, M., Lander, E.S., Getz, G. and Mesirov, J.P. (2011) Integrative genomics viewer. Nat Biotechnol, 29, 24-26.

146. Machanick, P. and Bailey, T.L. (2011) MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics, 27, 1696-1697.

147. Thomas-Chollier, M., Darbo, E., Herrmann, C., Defrance, M., Thieffry, D. and van Helden, J. (2012) A complete workflow for the analysis of full-size ChIP- seq (and similar) data sets using peak-motifs. Nat Protoc, 7, 1551-1568.

148. Bailey, T.L. (2011) DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics, 27, 1653-1659.

149. Grant, C.E., Bailey, T.L. and Noble, W.S. (2011) FIMO: scanning for occurrences of a given motif. Bioinformatics, 27, 1017-1018.

150. Consortium, E.P. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57-74.

151. Thurman, R.E., Rynes, E., Humbert, R., Vierstra, J., Maurano, M.T., Haugen, E., Sheffield, N.C., Stergachis, A.B., Wang, H., Vernot, B. et al. (2012) The accessible chromatin landscape of the human genome. Nature, 489, 75-82.

152. Xuan, Z., Zhao, F., Wang, J., Chen, G. and Zhang, M.Q. (2005) Genome-wide promoter extraction and analysis in human, mouse, and rat. Genome Biol, 6, R72.

Page | 125

153. Papworth, M., Kolasinska, P. and Minczuk, M. (2006) Designer zinc-finger proteins and their applications. Gene, 366, 27-38.

154. Benjamin, L.E. (2001) Glucose, VEGF-A, and diabetic complications. Am J Pathol, 158, 1181-1184.

155. Whitlock, P.R., Hackett, N.R., Leopold, P.L., Rosengart, T.K. and Crystal, R.G. (2004) Adenovirus-mediated transfer of a minigene expressing multiple isoforms of VEGF is more effective at inducing angiogenesis than comparable vectors expressing individual VEGF cDNAs. Mol Ther, 9, 67-75.

156. Dunn, C., O'Dowd, A. and Randall, R.E. (1999) Fine mapping of the binding sites of monoclonal antibodies raised against the Pk tag. J Immunol Methods, 224, 141-150.

157. Kolodziej, K.E., Pourfarzad, F., de Boer, E., Krpic, S., Grosveld, F. and Strouboulis, J. (2009) Optimal use of tandem biotin and V5 tags in ChIP assays. BMC Mol Biol, 10, 6.

158. Kalderon, D., Roberts, B.L., Richardson, W.D. and Smith, A.E. (1984) A short amino acid sequence able to specify nuclear location. Cell, 39, 499-509.

159. Bailey, T., Krajewski, P., Ladunga, I., Lefebvre, C., Li, Q., Liu, T., Madrigal, P., Taslim, C. and Zhang, J. (2013) Practical guidelines for the comprehensive analysis of ChIP-seq data. PLoS Comput Biol, 9, e1003326.

160. Tran, N.T. and Huang, C.H. (2014) A survey of motif finding Web tools for detecting binding site motifs in ChIP-Seq data. Biol Direct, 9, 4.

161. Das, M.K. and Dai, H.K. (2007) A survey of DNA motif finding algorithms. BMC Bioinformatics, 8 Suppl 7, S21.

162. Bailey, T.L., Williams, N., Misleh, C. and Li, W.W. (2006) MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res, 34, W369- 373.

163. Thomas-Chollier, M., Herrmann, C., Defrance, M., Sand, O., Thieffry, D. and van Helden, J. (2012) RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets. Nucleic Acids Res, 40, e31.

164. Stolzenburg, S., Rots, M.G., Beltran, A.S., Rivenbark, A.G., Yuan, X., Qian, H., Strahl, B.D. and Blancafort, P. (2012) Targeted silencing of the oncogenic transcription factor SOX2 in breast cancer. Nucleic Acids Res, 40, 6725-6740.

165. Yu, M., Riva, L., Xie, H., Schindler, Y., Moran, T.B., Cheng, Y., Yu, D., Hardison, R., Weiss, M.J., Orkin, S.H. et al. (2009) Insights into GATA-1- mediated gene activation versus repression via genome-wide chromatin occupancy analysis. Mol Cell, 36, 682-695.

Page | 126

166. Perez-Pinera, P., Ousterout, D.G., Brunger, J.M., Farin, A.M., Glass, K.A., Guilak, F., Crawford, G.E., Hartemink, A.J. and Gersbach, C.A. (2013) Synergistic and tunable human gene activation by combinations of synthetic transcription factors. Nat Methods, 10, 239-242.

167. Garriga-Canut, M., Agustin-Pavon, C., Herrmann, F., Sanchez, A., Dierssen, M., Fillat, C. and Isalan, M. (2012) Synthetic zinc finger repressors reduce mutant huntingtin expression in the brain of R6/2 mice. Proc Natl Acad Sci U S A, 109, E3136-3145.

168. Gaj, T., Gersbach, C.A. and Barbas, C.F., 3rd. (2013) ZFN, TALEN, and CRISPR/Cas-based methods for genome engineering. Trends Biotechnol, 31, 397-405.

169. Wu, W., Morrissey, C.S., Keller, C.A., Mishra, T., Pimkin, M., Blobel, G.A., Weiss, M.J. and Hardison, R.C. (2014) Dynamic shifts in occupancy by TAL1 are guided by GATA factors and drive large-scale reprogramming of gene expression during hematopoiesis. Genome Res, 24, 1945-1962.

170. Turner, J. and Crossley, M. (1998) Cloning and characterization of mCtBP2, a co-repressor that associates with basic Kruppel-like factor and other mammalian transcriptional regulators. EMBO J, 17, 5129-5140.

171. Kumar, V., Carlson, J.E., Ohgi, K.A., Edwards, T.A., Rose, D.W., Escalante, C.R., Rosenfeld, M.G. and Aggarwal, A.K. (2002) Transcription corepressor CtBP is an NAD(+)-regulated dehydrogenase. Mol Cell, 10, 857-869.

172. Shi, Y., Sawada, J., Sui, G., Affar el, B., Whetstine, J.R., Lan, F., Ogawa, H., Luke, M.P., Nakatani, Y. and Shi, Y. (2003) Coordinated histone modifications mediated by a CtBP co-repressor complex. Nature, 422, 735-738.

173. Chinnadurai, G. (2002) CtBP, an unconventional transcriptional corepressor in development and oncogenesis. Mol Cell, 9, 213-224.

174. Rohs, R., Jin, X., West, S.M., Joshi, R., Honig, B. and Mann, R.S. (2010) Origins of specificity in protein-DNA recognition. Annu Rev Biochem, 79, 233- 269.

175. Mendenhall, E.M., Williamson, K.E., Reyon, D., Zou, J.Y., Ram, O., Joung, J.K. and Bernstein, B.E. (2013) Locus-specific editing of histone modifications at endogenous enhancers. Nat Biotechnol, 31, 1133-1136.

176. Polstein, L.R., Perez-Pinera, P., Kocak, D.D., Vockley, C.M., Bledsoe, P., Song, L., Safi, A., Crawford, G.E., Reddy, T.E. and Gersbach, C.A. (2015) Genome- wide specificity of DNA binding, gene regulation, and chromatin remodeling by TALE- and CRISPR/Cas9-based transcriptional activators. Genome Res, 25, 1158-1169.

Page | 127

177. Gao, X., Tsang, J.C., Gaba, F., Wu, D., Lu, L. and Liu, P. (2014) Comparison of TALE designer transcription factors and the CRISPR/dCas9 in regulation of gene expression by targeting enhancers. Nucleic Acids Res, 42, e155.

178. Maeder, M.L., Linder, S.J., Reyon, D., Angstman, J.F., Fu, Y., Sander, J.D. and Joung, J.K. (2013) Robust, synergistic regulation of human gene expression using TALE activators. Nat Methods, 10, 243-245.

179. Rolland, T., Tasan, M., Charloteaux, B., Pevzner, S.J., Zhong, Q., Sahni, N., Yi, S., Lemmens, I., Fontanillo, C., Mosca, R. et al. (2014) A proteome-scale map of the human interactome network. Cell, 159, 1212-1226.

180. Morris, K.V. and Mattick, J.S. (2014) The rise of regulatory RNA. Nat Rev Genet, 15, 423-437.

181. Gilbert, C. and Svejstrup, J.Q. (2006) RNA immunoprecipitation for determining RNA-protein associations in vivo. Curr Protoc Mol Biol, Chapter 27, Unit 27 24.

182. Zhao, J., Ohsumi, T.K., Kung, J.T., Ogawa, Y., Grau, D.J., Sarma, K., Song, J.J., Kingston, R.E., Borowsky, M. and Lee, J.T. (2010) Genome-wide identification of polycomb-associated RNAs by RIP-seq. Mol Cell, 40, 939-953.

183. Licatalosi, D.D., Mele, A., Fak, J.J., Ule, J., Kayikci, M., Chi, S.W., Clark, T.A., Schweitzer, A.C., Blume, J.E., Wang, X. et al. (2008) HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature, 456, 464- 469.

184. O'Geen, H., Henry, I.M., Bhakta, M.S., Meckler, J.F. and Segal, D.J. (2015) A genome-wide analysis of Cas9 binding specificity using ChIP-seq and targeted sequence capture. Nucleic Acids Res, 43, 3389-3404.

185. Lim, W.F., Burdach, J., Funnell, A.P., Pearson, R.C., Quinlan, K.G. and Crossley, M. (2015) Directing an artificial zinc finger protein to new targets by fusion to a non-DNA-binding domain. Nucleic Acids Res.

Page | 128

Appendix

Supplementary Tables

Supplementary Table 3.1, 4.1, 4.2, 4.3, 4.4, 4.5, 4.6 and 4.7 are available as excel spreadsheets on the CD attached.

Supplementary Table 2.1: Oligonucleotide sequences for (A) cloning the AZF, KLF3FD-AZF and KLF3 FD coding sequences into pEF.IRES.puro and pMSCV.puro vectors, (B) semi quantitative real time PCR to assess mRNA transcript level, (C) EMSA probes, and (D) ChIP PCR assay. All primer sequences are in 5’ to 3’.

A. DNA cloning

To generate pEF.IRES.puro:KLF3FD:NLS:AZFVEGFA:V5

ATTATCTAGATGCGGGCCC A4197 AZF_F_XbaI To amplify GAAGAAAAAGCG NLS:AZFVEGFA:V5 from GCGGCCGCTTAGGTGGAAT pMA-RQ: A4166 AZF_R_NotI C NLS:AZFVEGFA:V5 ATTACTCGAGAACATGCTC A4163 KLF3FD_F_XhoI ATGTTTGATCCAGT To amplify KLF3FD aa 1-262 from ATTATCTAGACCGTCGCAT A4164 KLF3FD_R_XbaI pMT3.KLF3 CTGTGTATCCTGC

To generate pMSCV.puro:NLS:AZFVEGFA:V5

ATTATCTAGATGCGGGCCC A4197 AZF_F_XbaI To amplify GAAGAAAAAGCG NLS:AZFVEGFA:V5 from TAATGAATTCTTAGGTGGA pEF.IRES.puro:NLS:AZFVE A4201 AZF_R_EcoRI ATCCAGGCCCAGTAA GFA:V5

To generate pMSCV.puro:KLF3FD:NLS:V5

KLF3FD_F_XhoI_ ATTACTCGAGACCATGCTC A5790 To amplify 2 ATGTTTGATCCAGT KLF3FD:NLS:V5 from KLF3FD_R_EcorI ATTAGAATTCTTAGGTGGA pTALE_TF_v2:KLF3FD:N A5466 _2 ATCCAGGCCCA LS:V5

Page | 129

Sequencing primers

CTCTTAAGGCTAGAGTACT A2160 pEF_IRES_F TAATACG Sequencing primers for CCAAGCGGCTTCGGCCAGT pEF.IRES vectors A2161 pEF_IRES_R AACGTTA CCCTTGAACCTCCTCGTTC A3727 pMSCV_F GACC Sequencing primers for GAGACGTGCTACTTCCATT pMSCV vectors A3728 pMSCV_F TGTC Internal forward KLF3FD_internal_ ATGTATACCAGCCACCTGC A4387 sequencing primer F AGCA annealed to KLF3FD

B. semi quantitative real time PCR

GCTTCTCCGACAGGTCTCA A4503 AZF_mRNA_F Figure 4.2B, transcript C levels of AZF and CTGGGCAAACTTTCTTCCA A4505 AZF_mRNA_R KLF3FD-AZF C CACGGCCGGTACAGTGAAA A1560 18s_mRNA_F C Figure 4.2B, for normalization A1561 18s_mRNA_R AGAGGAGCGAGCGACCAA

C. EMSA

CGTGGCGCTGGGGGCTAGC Figure 3.2B A4157 Wild type AZF_F ACC Figure 4.2C GGTGCTAGCCCCCAGCGCC Figure 4.10 (lane 1 to 4) A4158 Wild type AZF_R ACG Figure 4.12 (lane 1 to 6) CGTGGCGCTGCGGGCTAGC A6317 AZF_5G>C_F ACC Figure 4.10 (lane 5 and GGTGCTAGCCCGCAGCGCC 6) A6318 AZF_5G>C_R ACG TAGAGCCACACCCTGGTAA A6615 KLF3 caccc box_F G Figure 4.12 (lane 7 to CTTACCAGGGTGTGGCTCT 12) A6616 KLF3 caccc box_R A TGGCAGAGTTCCATACCGG A6721 ARID3B_F CCGCTGAGTCAGCTCATAG CAGGATCCTGCC Figure 4.12 (lane 13 to GGCAGGATCCTGCTATGAG 18) A6722 ARID3B_R CTGACTCAGCGGCCGGTAT GGAACTCTGCCA Page | 130

D. ChIP-PCR

TTTTTAAAAGTCGGCTGGT Figure 3.2C and 4.2B, A4453 hVegfA_-30bp_F AGC AZF predicted target site on VEGF-A A4454 hVegfA_-30bp_R CTGACCGGTCCACCTAACC promoter A4449 hVegfA_-3.5kb_F GGTTTGTATCCTGCCCTTCC Figure 3.2C and 4.2B, negative control A4450 hVegfA_-3.5kb_R ACTGGGTCTTGCTGTTTTCC Figure 4.11E, negative control A6680 hARID3B_intron_F CCCACTGATTGCTTTGGTTT

CGGTATGGAACTCTGCCAA Figure 4.11E, ARID3B A6681 hARID3B_intron_F C

Page | 131

Table 3.1: AZF binding sites across human genome. ChIP-Seq peak locations and normalized tag counts showing regions bound by AZF (25322 sites) using homer-IDR for peak calling.

Table 4.1: KLF3FD-AZF binding sites across human genome. ChIP-Seq peak locations and normalized tag counts showing regions bound by KLF3FD-AZF (48003 sites) using homer-IDR for peak calling.

Table 4.2: Regions differentially bound by AZF (4620 sites) determined using a differential binding analysis tool, DiffBind.

Table 4.3: Regions differentially bound by KLF3FD-AZF (4537 sites) determined using a differential binding analysis tool, DiffBind.

Table 4.4: KLF3 FD binding sites across human genome. ChIP-Seq peak locations and normalized tag counts showing regions bound by KLF3 FD (1439 sites) using homer- IDR for peak calling.

Table 4.5: Regions bound by both KLF3FD-AZF and KLF3 FD, lacking a DNA binding domain (264 sites).

Table 4.6: Common promoter binding analysis. 3525 and 1720 KLF3 FD dependent promoter binding events (differentially bound by KLF3FD-AZF, and not by AZF) obtained from Burdach et al. 2013 (75) and current study (185), respectively, were overlapped based on the common downstream gene names. There were 578 common promoters in both datasets, two times more than the expected common promoters by chance.

Table 4.7: Common promoter binding analysis. 3525 and 107 KLF3 FD (lacking DNA binding domain) promoter binding events obtained from Burdach et al. 2013 (75) and current study (185), respectively, were overlapped based on the common downstream gene names. There were 45 common promoters in both datasets.

Page | 132