CALIFORNIA STATE UNIVERSITY, NORTHRIDGE

Bioinformatic Comparison of the EVI2A Promoter and Coding Regions

A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Biology By Max Weinstein

May 2020 The thesis of Max Weinstein is approved:

Professor Rheem D. Medh Date

Professor Virginia Oberholzer Vandergon Date

Professor Cindy Malone, Chair Date

California State University Northridge

ii

Table of Contents Signature Page ii List of Figures v Abstract vi

Introduction 1 Ecotropic Viral Integration Site 2A, a within a Gene 7 Materials and Methods 11 PCR and Cloning of Recombinant Plasmid 11 Transformation and Cell Culture 13 Generation of Deletion Constructs 15 Transfection and Luciferase Assay 16 Identification of Transcription Factor Binding Sites 17 Determination of Region for Analysis 17 Multiple Sequence Alignment (MSA) 17 Model Testing 18 Tree Construction 18 Promoter and CDS Conserved Motif Search 19

Results and Discussion 20 Choice of Species for Analysis 20 Mapping of Potential Transcription Factor Binding Sites 20 Confirmation of Plasmid Generation through Gel Electrophoresis 21 Analysis of Deletion Constructs by Transient Transfection 22 EVI2A Coding DNA Sequence Phylogenetics 23 EVI2A Promoter Phylogenetics 23 EVI2A Conserved Leucine Zipper 25

iii

EVI2A Conserved Casein kinase II phosphorylation site 26 EVI2A Conserved Sox-5 Binding Site 26 EVI2A Conserved HLF Binding Site 27 EVI2A Conserved cREL Binding Site 28 EVI2A Conserved CREB Binding Site 28 Summary 29 Appendix: Figures 30 Literature Cited 44

iv

List of Figures Figure 1. EVI2A is nested within the gene NF1 30 Figure 2 Putative Transcriptions Start Site 31 Figure 3. Gel Electrophoresis of pGC Blue cloned, Restriction Digested Plasmid 32

Figure 4. Gel Electrophoresis of pGL3 Basic cloned, Restriction Digested Plasmids 33

Figure 5. Non-Significant differences in activity between EVI2A 34 promoter deletion constructs

Figure 6. Phylogenetic Tree of EVI2A CDS Generated Through Bayesian Inference 35

Figure 7. Phylogenetic Tree of EVI2A CDS Generated Through Maximum Likelihood 36 Figure 8. Phylogenetic Tree of EVI2A Putative Promoter 37 Generated Through Bayesian Inference

Figure 9. Phylogenetic Tree of EVI2A Putative Promoter 38 Generated Through Maximum Likelihood

Figure 10. MSA of the amino acids in the leucine zipper motif of the EVI2A 39 Figure 11. MSA of the amino acids of the casein kinase II phosphorylation site 40 Figure 12. MSA of nucleotides composing the Sox-5 Transcription Factor Binding Site 41 Figure 13. MSA of nucleotides composing the HLF Transcription Factor Binding Site 42 Figure 14. MSA of nucleotides composing the cREL Transcription Factor Binding Site 42 Figure 15. MSA of nucleotides composing the CREB Transcription Factor Binding Site 43

v

Abstract

Bioinformatic Comparison of the EVI2A Promoter and Coding Regions

By Max Weinstein Master of Science in Biology

Evolution is driven by natural selection operating on the present in a given species.

However, it is unclear if the different parts of a gene (coding and non-coding) are under the same types of selection as each other. Determining the similarity of selection between coding and non- coding regions of a gene can be determined using bioinformatics; by analyzing the differences in conserved sites in the coding and non-coding (promoter) regions of the gene. Ecotropic Viral

Integration Site 2A was analyzed at both its putative promoter region and its coding DNA sequence to determine its similarity of selection. After identifying the putative promoter region of the EVI2A gene through tracking of conserved regions, the genetic sequences for the EVI2A putative promoter and coding sequences were compared across several different vertebrate species to generate genetic phylogenies, which were then compared against each other and known evolutionary phylogenetic trees. Conserved sequences were identified in both the putative promoter regions and the coding sequences. The motifs conserved in the coding sequence were much more stringently conserved than were the conserved sequences in the putative promoter region, suggesting that the two regions of the EVI2A gene are under different types of selective pressure.

vi

Introduction For transcription of a gene to even begin, RNA polymerase must combine with auxiliary transcription factors to form a transcription preinitiation complex, which binds to the gene locus upstream of the transcription start site, in an approximately 1000 long area called the promoter region (Vo et al. 2017). This process of RNA polymerase binding is tightly regulated to control when a given gene is expressed in a cell. A failure of this regulation can have far-ranging consequences on the human body. Diseases that can result from genetic mis-regulation include autoimmune inflammation, neurodegenerative disorders, and cancer (Lee and Young, 2013).

RNA polymerase II binds to the DNA strand upstream of the DNA sequence to be transcribed. This transcription start site marks the location where RNA polymerase II begins transcribing the DNA blueprint into mRNA. This is separate from where translation of mRNA into protein begins which is always a methionine amino acid. In order to properly bind, the polymerase requires a number of other to facilitate proper regulation of binding. These proteins, known as transcription factors, must also bind to the DNA upstream of the transcription start site, in an area known as the promoter region.

Within the promoter region is the core promoter, an area approximately 40 base pairs long that is the minimum continuous stretch of DNA necessary for the proper binding of RNA polymerase II to initiate transcription (Butler et al. 2002). In addition to the promoter itself, assorted transcription factors exist that contribute to the process of controlling when a gene is transcribed, and the protein produced. “Enhancers” are sequences of DNA that recruit proteins known as “activators” to the promoter region. These activators, in turn, act to recruit RNA polymerase to the promoter (Maston et al. 2006). Conversely, “silencer” sequences of DNA recruit

“repressor” transcription factors, which act to block transcription of the gene. Factors that include

1 environmental stimuli, response to infection, and intercellular signaling act to control when these transcription factors bind to their transcription factor binding sites (TFBS).

The core promoter structure varies between different genes, and it is this variation that helps to allow for genes to be differentially regulated. Though the core promoter as a whole differs between genes, they are usually made up of a number of similar sequence motifs. It is the combination of these motifs that differentiates core promoters from each other. In general, there are two distinct types of core promoters; focused and dispersed. Focused promoters have a single, strong, transcription start site, or a group of tightly clustered start sites, where transcription factors bind to in order to guide in RNA polymerase II (Juven-Gershon et al. 2008). In these focused promoters, the single binding site is usually a TATA Box. Despite being the oldest known form of core promoter, focused promoters are in the minority in vertebrates. Only about 35 percent of vertebrate promoters are focused promoters (Dikstein 2011). The other 65 percent of promoters are dispersed promoters. Dispersed promoters, unlike focused promoters, make use of a number of weaker transcription start sites spread out over an area of 50-100 nucleotides (Juven-Gershon et al. 2008). The binding sites for dispersed promoters are usually (approximately 50% of the time) associated with CpG islands; areas of high G/C concentration (Juven-Gershon et al. 2008,

Mahpour et al. 2018).

The most ancestral motifs found in core promoters, but not dispersed promoters, is the

“TATA Box”. The TATA Box is identified by the sequence “TATAWAAR” with R being the downstream element of the gene (Juven-Gershon et al. 2008). It is also possible for up to two mismatches to be present in the TATA box while still maintaining its function (Dikstein 2011).

The TATA box acts as a binding site for Transcription Factor IID, a functional unit made up of a number of proteins. The protein “TATA Binding Protein” (TBF) binds to the TATA box, and the

2

TBF-Associated Factors bind onto TBF, building TFIID. TFIID, once assembled, aids in the binding of RNA polymerase II to the DNA at the proper location to begin transcription. Another protein, known as TRF3 or TBF2 functions as an analog for TBF. TRF3, like TBF, is able to bind to the TATA box. TRF3 also interacts with some of the same transcription factors as TBF; such as TFIIA and TFIIB, and is capable of performing basal transcription, making it a replacement for

TBF (Goodrich and Tjian 2010). The TATA box if often accompanied by a TFIIB Recognition

Element, or “BRE”. The BRE can be located either upstream or downstream of the TATA box

(Juven-Gershon et al. 2008).

Another motif seen in focused promoters is the initiator. The initiator motif spans the transcription start site with a consensus sequence of YYANWYY, though the sequence has such a high level of variability that it has been suggested that the consensus site should only be YR, with

R corresponding to the transcription start site and replacing “ANWYY” (Dikstein 2011). By itself, the initiator can be weakly bound to by RNA polymerase II. A much stronger bond is formed when

RNA polymerase II forms a complex with TFIIB (binding to BRE), TFIID (binding to the TATA box), and TFIIF.

A third motif important for basal transcription activity seen in focused promoters is the

Downstream Promoter Element. It is a highly conserved motif across animal species (Juven-

Gershon et al. 2008). As the name suggests, this element is found downstream of the transcription start site; approximately +33 bases from the +1 site of transcriptional initiation. The DPE acts as a binding site for Transcription Factor IID.

A transcription factor unique to dispersed core promoters is the “CGCG Element”, a motif with the consensus sequence of TCTCGCGAGA (Mahpour et al. 2018). The CGCG element is capable of allowing for transcriptional activity independent of other transcription factors (Mahpour

3 et al. 2018). The CGCG element is a bi-directional promoter; it is capable of acting as a promoter for sequences both downstream (toward the telomere) and upstream (toward the centromere). The

CGCG element is highly conserved, as the palindromic nature of the motif is important to its ability to act bidirectionally as a promoter element; swapping the first and last three nucleotides of the sequence resulted in a loss of function for the GCE. Unsurprisingly, CGCG elements are most often found within the same CpG Islands associated with dispersed core promoters.

Similar to focused core promoters, dispersed promoters also have a version of the initiator element. Its consensus is much more strongly conserved, giving it the name “Strict Initiator”

(Dikstein 2011), with a consensus sequence of GSCGCCATYTTG (Yarden et al. 2009). In addition to its own strict sequence, the sINR is flanked by other strongly conserved sequences

(Dikstein 2011). Strict Initiators have been shown to be enriched within TATA-less promoters.

Interestingly, a sINR can take the place of an INR, but the inverse results in a non-functioning promoter; implying a strong significance to the sequence variations between the two types of initiator.

Other types of TFBSs include proximal promoter elements (PPE) and distal regulatory elements. PPEs are located directly upstream of the core promoter and serve to increase RNA polymerase binding affinity (Maston et al. 2006). These proximal promoters are normally (~60%) located near a CpG island. Distal regulatory elements are located over 1000 bp upstream of the core promoter or downstream of the transcription termination site and, in addition to enhancers and silencers, include sequences known as “insulators”. Insulators are not TFBSs like proximal promoters and silencers are; transcription factors do not bind to insulators. Instead, they act to protect the gene they are associated with from being affected by transcriptional activity of

4 neighboring genes. They do so by blocking enhancer-promoter communication and preventing the spread of repressive chromatin.

Downstream of the promoter region lies the gene itself. The DNA that makes up the gene can be described in several ways; exon and intron, coding and non-coding. It is the coding DNA sequence (CDS) that is ultimately read during the process translation from mRNA to protein. From this translation is generated the primary structure of the protein; the linear sequence of amino acids.

This primary structure then begins to fold, sometimes with the help of other “chaperone” proteins into the secondary structures; alpha helices and beta sheets, which then in turn continue to fold into the tertiary structure of the protein (Whitford 2005). The folding of the protein into its tertiary structure is initially driven by nonspecific hydrophobic reactions, but the final conformation needs to be “locked in” to place through specific chemical interactions, which include salt bridges, disulfide bonds, and hydrogen bonds.

Determining the tertiary structure of a protein is usually done through direct analysis of a sample of the protein. Methods include x-ray crystallography (Larsen et al. 1994), nuclear magnetic resonance spectroscopy (Poulsen 1994), and cryogenic electron microscopy (Hoffman et al. 2020). Inference of tertiary structure of an uncharacterized protein through analysis of the primary structure is a difficult task. Without knowledge of what environmental factors, including chaperone proteins, interact with the folding protein and the order in which the proteins interact, any inference of the tertiary structure from the primary structure is a “best guess”. Instead, it is more convenient to compare the sequence (DNA or amino acid residue) of the unknown protein to that of a known protein through multiple sequence alignments (Subramaniam 1994). In this way it is possible to recognize certain common patterns, or “motifs” in the primary amino acid sequence. These motifs are indicative of certain structural patterns present in the tertiary structure,

5 known as “supersecondary structures” (Efimov 1994). One example of such a motif is the leucine zipper; a motif comprised of a leucine amino acid being placed every seven positions in the sequence of an alpha helix (Whitford 2005). This placement causes the leucine molecules to sit atop one another once the primary structure curls into a helix. A leucine zipper is a point of interface for the protein, allowing it to bind to either DNA, or another protein’s leucine zipper in order to form a complex of multiple proteins, or quaternary structure. Other types of common motifs include helix-loop-helix, and zinc fingers. A helix-loop-helix motif is often found in transcription factors. It is made of two sequential alpha-helices connected by a short loop of amino acids. In transcription factors, these two helices contain basic amino acid residues to facilitate binding to DNA (Zipursky et al. 2003). Similar in function to helix-loop-helix structures, the term

“zinc finger” describes a number of similarly constructed motifs consisting of two beta-sheets and one alpha-helix folded over the sheets. The structure is stabilized through the bonding of the secondary structure components to a zinc ion. As with the helix-loop-helix supersecondary structure, zinc fingers bind to nucleic acids at the basic amino acid residues present in the alpha helix (Latchman 2004). Finally, while not a structural motif, the hydrophobicity of the protein can be used to help determine the function of a protein. While hydrophobic regions are used to force the protein to fold into its tertiary structure, an exposed hydrophobic region still present in the tertiary structure indicates that that region will sit inside a plasma membrane; the protein is either embedded or crosses a plasma membrane within the cell (Buchberg et al. 1990).

Transcription factor binding sites and protein structural motifs are both necessary to the proper expression and function of a gene. As such, they are under selective evolutionary pressure.

For the structural motifs; any deviation from the existing amino acid sequence threatens to destabilize the final protein, altering protein efficiency, usually to the detriment of the cell

6 expressing the mutant protein. As such, it can be said that these motifs are under selective pressure to not change and, as such, protein motifs should be expected to be conserved across analogous genes between species. Compared to the CDS of a gene, the promoter region is much more flexible in terms of acceptance of genetic variation (Arnosti and Kulkarni, 2005). Distal regulatory elements can typically be found up to 1000 base pairs upstream of the transcription start site, and the exact position of the regulatory elements is not set in stone. The regulatory elements themselves also have flexibility in their sequence, with the initiator promoter element being an extreme example of sequence flexibility. Yet there is still a level of selective pressure on these promoter elements, the transcription factors do require a certain amount of consistence in the sequences they bind to (Dikstein 2011, Mahpour et al. 2018).

Ecotropic Viral Integration Site 2A, a Gene within a Gene

The gene EVI2A is a gene found in jawed vertebrates with many interesting characteristics.

Among those is the fact that it is expressed at higher levels in Small Lymphocytic Lymphoma (a lymphoma that affects B-cells) over Mantel Cell Lymphoma when compared using suppression subtractive hybridization (Henson et al. 2011). The EVI2A gene itself is also of interest for many reasons. To begin with, the entire EVI2A locus is a “nested” gene, it is located within the intron of another gene; NF1, on 17 (Fig. 1). The direction of transcription for EVI2A is also opposite that of NF1; from telomere to centromere (Largaespada et al. 1995). Therefore, transcription factors for EVI2A should not affect genomic regulation of the surrounding NF1 gene that EVI2A is nested within. This makes it typical in terms of nested gene traits (Kumar 2009).

While the exact function of EVI2A has not been characterized, it is unlikely to be directly connected to the function of NF1, as nested genes rarely share similarities in terms of function or expression patterns. Nested genes are usually the result of either gene duplication or

7 retrotransposition (Kumar 2009). In the case of EVI2A, the former is more likely, as the gene still contains intronic regions that would not be present if the insertion was due to retrotransposition of a processed mRNA. Interestingly, EVI2A shares no similarity with the neighboring gene EVI2B, both genes appear to be the results of separate gene duplication events. This, along with the presence of a third gene in the region, OMG, suggests that the NF1 EVI2 locus has always had a high affinity for integration of new DNA, and it is not a function of the genes present in the locus.

As a nested gene, there is the question of what sort of selective pressure the promoter region of EVI2A is subject to. It is possible EVI2A is under normal selective pressure. The sequence that has optimal binding affinity with the proper transcription factors is selected for. This kind of selection would occur independent of species; a given binding sequence is ideal across species spanning multiple classes. A second option also exists; that EVI2A transcription factor binding sites are conserved as the result of a selective sweep. The surrounding NF1 gene is under selective pressure to remain the same, as mutant genes have a high propensity of leading to the development of neurofibromatosis. With EVI2A nested entirely within NF1, it is possible that its sequence, including the promoter region, is “protected” from mutations, regardless of the effect any mutation may have on the viability of the promoter region.

EVI2A is expressed primarily in the nervous system -predominantly in the spinal cord - and in lymphocytes (GeneCards, GTEx). The exact purpose of the EVI2A protein is currently unknown, but the predicted structure of the protein indicates the presence of multiple leucine zippers, a structural trait that can facilitate binding to both other proteins and DNA (Buchberg et al. 1990). In EVI2A, these zippers are found in a transmembrane domain of the protein – as identified by graphing hydrophobicity of the protein primary structure, and the positioning of two

8 zippers in the alpha-helix suggests that EVI2A interacts with at least two other proteins in order to form a functional unit.

Finally, the EVI2 locus (which contains EVI2A, EVI2B, and a third gene; OMG) is a common integration site for various proviruses (Buchberg et al. 1990). Integration of proviruses at EVI2A can disrupt the function of NF1, which can lead to the disease Neurofibromatosis Type

1, which is characterized by the growth of nerve sheath tumors, or neurofibromas (Largaespada et al. 1995). These neurofibromas can develop into malignant peripheral nerve sheath tumors. In these MSPNSTs, EVI2A is shown to be significantly upregulated (Pasmant et al. 2011). In addition to neurofibromatosis, when a virus inserts proviral DNA into EVI2A, it tends to insert into the promoter region of the gene. This leads to an alteration of the expression of EVI2A, but not the coding region (Buchberg et al. 1990). The result is a case of neurofibromatosis caused by truncated

NF1 proteins, with a correlative increase in EVI2A expression (Largaespada et Al. 1995).

In this study, putative promoter regions and coding DNA sequences of fifty-five species found in the National Center for Biotechnology Information database were aligned for basic phylogenetic analyses. For the coding DNA sequences, the DNA sequence was converted to an amino acid sequence before alignment. The two groups of sequences were used to generate a pair of multiple sequence alignments which, in turn, were subjected to statistical analyses including

Bayesian inference and phylogenetic tree construction. The genetic phylogenies developed were compared against known species phylogenies in order to look for evidence that gene evolution, in both the promoter region and the CDS, does not necessarily follow overall species evolution. At the same time, the promoter and CDS MSAs were examined for conserved transcription factor binding sites and structural motifs, respectively. These analyses suggest that conservation of transcription factor binding sites is less stringent than conservation of structural motifs. Taken

9 together, the MSAs and phylogenies generated through this study suggest that different parts of a gene are under different levels of negative selective pressure.

10

Materials and Methods

PCR and Cloning of Recombinant Plasmid

In order to begin, primers were designed using the Primer3Plus software that amplified a

1577 base pair (bp) region of DNA that contains the transcription start site of EVI2A as well as an approximately 1500 bp region upstream of it. Primers were purchased through IDT DNA. The primers were then phosphorylated using the T4 Polynucleotide Kinase (PNK) and corresponding buffer from Lucigen™’s pGC Blue Cloning Kit. Primers were phosphorylated as per protocol described in Lucigen™ pGC Blue Cloning Kit.

The phosphorylated primers were then used to perform a polymerase chain reaction

(PCR), which amplifies the region of interest from a genomic DNA template. The PCR reaction was performed as per the protocol provide by Promega™ for use with their 2x GoTaq Master

Mix, at a final volume of 50uL. The PCR thermocycling program was designed as a touchdown

PCR program in to maximize the amplification across a wide range of primer melting temperatures.

11

Cycles Temperature Duration 1 95˚C 5 min 95˚C 30 sec 2 60˚C 40 sec 72˚C 2 min 95˚C 30 sec 2 57˚C 40 sec 72˚C 2 min 95˚C 30 sec 2 54˚C 40 sec 72˚C 2 min 95˚C 30 sec 2 52˚C 40 sec 72˚C 2 min 95˚C 30 sec 27 50˚C 40 sec 72˚C 2 min 1 4˚C Hold Table 1. Touchdown PCR Cycling Parameters

The PCR product was purified via gel electrophoresis. The bands of amplified DNA were viewed by staining the gel with GelRed dye (Biotium ™) and imaging the gel under long-wave

UV light. Gel electrophoreses was performed on a 1% weight-by-volume agarose gel with a change strength of 70 volts for 85 minutes. Also included in the gel was either GeneRuler 1kb or

GeneRuler 1kb+ ladder (Thermo Fisher ™). The band of amplicon was cut out and purified using the GeneJET Gel Extraction Kit (ThermoFisher ™) and followed the protocol provided with the kit. The concentration of the amplicon product was measured at this point via NanoDrop spectrophotometer (ThermoFisher ™). The amplicon was cloned into pGC Blue cloning vector

(Lucigen ™) using the protocol provided with the pGC Blue Cloning and Amplification Kit

(Lucigen ™).

12

Transformation and Cell Culture

Plasmid was transformed into GC5 chemically competent E. coli cells as per protocol

(Sigma-Aldrich ™). Plasmids without an inserted amplicon contained the gene for a functional

α-fragment of β-galactosidase, which combined with the Ω-fragment in the GC5 cell to produce a functioning protein that would break down X-gal into a pair of subunits, one of which is blue in color. Plasmids containing the amplicon would have it inserted in the multiple cloning site

(MCS), which sits in the middle of the α-fragment, thereby disrupting the α-fragment and preventing it from being transcribed. These GC5 cells were unable to break down X-Gal and so appeared clear.

Five colonies that showed successful transformation (clear colonies) were removed from the plate and suspended in 4 mL of a mixture of Lysogeny broth (LB) and kanamycin. These samples were incubated on a shaker for 16 hours at 37˚ C. These outgrowths were used to obtain purified plasmid DNA via a boil preparation.

To confirm the presence and orientation of the promoter region insert within the pGC

Blue plasmid, a sample of the plasmid was digested using restriction enzymes. Samples were digested using restriction enzymes and protocol provided by ThermoFisher™ for use with their

Fast Digest buffer and enzymes. The restriction enzyme EcoRI (restriction site 5’-G^AATTC-3’) was used to confirm the presence of the insert, while NcoI (5’-C^CATGG-3’) was used to determine the orientation of the insert. Following digestion, the reaction was run on a gel, stained with ethidium bromide (EtBr), and imaged under short-wave UV light to view the bands.

Once the presence and orientation of the putative promoter amplicon had been confirmed in the pGC Blue plasmid, the amplicon was excised from pGC Blue and ligated into the pGL3

13

Basic plasmid. The pGL3 Basic plasmid contains an ampicillin resistance gene for screening, as well as a firefly luciferase gene lacking a promoter. The putative promoter is inserted upstream of the luc. Gene. To do this, the pGC Blue plasmid containing the amplicon was digested with

BcuI (restriction site 5’-A^CTAGT-3’) and XhoI (5’-C^TCGAG-3’) restriction enzymes

(Promega™), and the “empty” pGL3 Basic plasmid was digested with NheI (5’-G^CTAGC-3’) and XhoI in a buffer solution provided by Promega. The putative promoter region and pGL3 plasmid were ligated together using T4 DNA ligase, and the plasmid was inserted into GC5 chemically competent cells according to protocol (Millipore-Sigma™) for a second round of selection and outgrowth. GC5 cells were grown overnight on an LB plate containing ampicillin to allow for screening based of the resistance gene included in pGL3. Colonies that successfully grew were grown overnight and analyzed via boil preparation for the presence of the plasmid and insert. From the aliquot of cells not analyzed by boil preparation, the plasmid DNA was purified out via GeneJET Plasmid Miniprep Kit (ThermoFisher ™) following the included protocol.

The purified plasmid was prepared for analysis by adding an aliquot of plasmid to one of two primers designed to amplify the putative promoter insert region of the plasmid from both the

5’ end (RVp3) and the 3’ end (GLp2). The product was sent to Laragen Sequencing and

Genotyping for Sanger sequencing. Upon receiving the sequence of the putative promoter region, it was compared against the sequence provided by NCBI to identify any know single nucleotide polymorphisms, or mutations accrued due to exposure to short-wave UV radiation during imaging. Two previously uncharacterized SNPs were found and noted, but neither altered the sequence of any of the potential TFBSs identified.

14

Generation of Deletion Constructs

To create the deletion constructs, primers were designed for a process termed

“amplification with exclusion (AWE)”. Each primer pair consisted of approximately 30 base pair primers with GC clamps (areas of 5 nucleotides with at least 2 G/C nucleotides) on both the 5’ and 3’ end. The primers were designed to as to bind flanking an approximately 200bp fragment of the putative promoter region in the plasmid, but with the 3’ ends of the primers facing away from the fragment. In this way, everything on the circular plasmid except the 200 bp fragment would be amplified by PCR. Each primer set used the same 5’ primer, so that each consecutive deletion construct would remove a larger segment of the putative promoter region. A total of six primer sets were designed, and amplified via PCR in a touchdown procedure to maximize product produced.

Cycles Temperature Duration 1 95˚C 5 min 95˚C 30 sec 2 57.5˚C 40 sec 72˚C 06 min 95˚C 30 sec 2 54.5˚C 40 sec 72˚C 06 min 95˚C 30 sec 2 51.5˚C 40 sec 72˚C 06 min 95˚C 30 sec 2 48.5˚C 40 sec 72˚C 06 min 95˚C 30 sec 27 47.5˚C 40 sec 72˚C 06 min 72˚C 05 min 1 4˚C Hold Table 2. Amplification With Exclusion PCR Parameters

15

The amplified AWE product was transformed into GC5 chemically competent cells. The

GC5 cells perform the ligation reaction necessary to re-circularize the linear AWE product. GC5 cells were again grown on LB-AMP plates to screen for plasmid insertion, with the successfully grown colonies grown in LB-AMP broth. Cells were lysed via boil preparation, and the resulting purified plasmid was digested with the appropriate restriction enzyme and analyzed via gel electrophoresis to test that the AWE process successfully excised the 200 bp fragment. The digested plasmid that was not run on a gel was ligated back into a circular plasmid for transfection.

Transfection and Luciferase Assay

Following confirmation of a successful amplification of the deletion construct, the plasmid was transfected into HEK293 cells. HEK293 cells were grown on a 12-well plate according to protocol (Qiagen ™). At this point, the HEK cells contained the pGL3 plasmid containing the putative promoter region, as well as the pRL SV40 normalization construct, which contains the gene for Renilla luciferase. This construct acted as a positive control. HEK cells were imaged via fluorescence microscopy to determine effect of excising subsequent 200 bp areas of the promoter. A baseline fluorescence level was determined by transfecting HEK cells with the full promoter, and each deletion construct’s fluorescence was measured in relationship to the baseline as “Relative Fluorescence Units”. Should the RFU value of a given deletion construct decrease after a segment was excised, that was taken as evidence that an activating

TFBS was present in that sequence. Conversely, should the RFU value increase following the removal of a section, it was because that segment contained a repressing TFBS.

16

Identification of Transcription Factor Binding Sites

Potential Transcription Factor Binding Sites for the EVI2A promoter were identified using three programs; AliBaba2.1, Match, and Consite. Potential transcription factors were defined as any transcription factor that shared a consensus between at least two programs. In total, 19 potential TFBSs were identified in this manner. Additionally, a phylogenetic tree was developed showing the homologous location of the EVI2A gene in species closely-related to humans. Species were chosen to encompass a wide range of vertebrate classes in order to track potential points of change between classes, and were obtained from the National Center for

Biotechnology Information database.

Determination of Region for Analysis

An initial multiple sequence alignment was run using a smaller sample of mammals

(primates, rodents, and artiodactyls) in order to determine where conserved sequences were clustered. The area that contained transcription factor binding sites was chosen to be further examined across a wider variety of animal species to determine if the conservation of TFBSs still held.

Multiple Sequence Alignment (MSA)

Separate MSAs were performed for the putative promoter region and the CDS of EVI2A.

The putative promoter region was identified by the initial MSA to be a region approximately six hundred bases upstream of the assumed transcription start site. That sequence was used in conjunction with a Basic Local Alignment Search Protocol (BLAST) to obtain orthologous promoter regions of EVI2A in other species. A total of fifty-five different species were obtained, in classes Mammalia, Aves, Reptilia, Actinopterygii, and Sarcopterygii. For each species, the

17

CDS was also obtained for MSA. All sequences (promoter and CDS) were obtained through the

National Center for Biotechnology Information (NCBI) gene database. Sequences were aligned using MEGA version 7.0.26 software. In the case of the promoter region the sequences were aligned using ClustalW parameters, while the CDS was aligned using Muscle (Codon) parameters.

Model Testing

To determine which evolutionary models fit each alignment, model testing was performed using the model test function provided in MEGA7. Model testing for generation of a

Maximum Likelihood (ML) tree included the Akaike information criterion (AIC) and the

Bayesian information criterion (BIC) to assess a substitution model’s goodness of fit to the dataset. For both the promoter and CDS alignments, both tests returned a Hasegawa-Kishino-

Yano model (HKY) as the best supported base evolution model. Testing of the CDS alignment also recommended inclusion of a gamma distribution parameter (+G), as well as invariable sites

(+I). Meanwhile, testing of the promoter sequence alignment did not recommend inclusion of gamma distribution or invariable sites.

Tree Construction

Maximum likelihood along with parametric bootstrapping (1,000 replicates) was performed based on the appropriate model for both promoter and CDS alignments (HKY and

HKY+G+I, respectively). In addition to ML, neighbor joining (NJ) and maximum parsimony

(MP) were performed using the MEGA 7 software to further evaluate some of the statistically supported groupings of ML. Bayesian inference was also performed, using the Geneious Prime

2020.1.1 software platform (https://www.geneious.com/) with the MrBayes 2.2.4 plugin

18 component. A total of 1.1 million generations were performed, generating a tree every 200 generations. The first 100,000 generations were discarded as burn-in.

Promoter and CDS Conserved Motif Search

The EVI2A aligned putative promoter sequences from the fifty-five species were analyzed by visual inspection to identify conserved sequences that matched the putative TFBSs identified using Match, Consite, and AliBaba. Conservation in this case was not restricted by alignment, only the presence or absence of the TFBS in the putative promoter region was considered. Species were grouped by shared TFBSs. The putative promoter region was also analyzed using the “CpG Island Finder” software (dbcat.cgm.ntu.edu.tw) in order to check for the presence of CpG islands. Similarly, the CDSs of fifty-four species were analyzed to identify conserved protein motifs. Protein motif databases InterPro and MyHits were used to identify putative structural motifs in the translated CDS of the human EVI2A gene. Visual inspection was carried out in order to determine which identified motifs were conserved between human and non-human CDSs. Unlike with the promoter region, location of the motif was a factor; if the motif did not align with the human EVI2A motif, it was not considered to be conserved between the two sequences.

19

Results and Discussion

Choice of Species for Analysis

The National Center for Biotechnology Information database entries on EVI2A indicated that the gene was only present in phylum chordates. It is not present in chordates closest living relative, echinoderms. This suggests that the insertion of the EVI2 region into the NF1 intronic region occurred sometime after the divergence of chordates from echinoderms. Upon further examination, it was also shown that, while all chordates have an EVI2, not all EVI2 regions are the same. Coelacanths, and cartilaginous fishes (such as sharks), have orthologs of EVI2A and

OMG in their EVI2 region, and lack EVI2B. Bony fishes, meanwhile, have EVI2B orthologs but lack EVI2A and OMG orthologs. All other jawed vertebrates have orthologs of all three genes located in their EVI2 region. This suggests an interesting evolutionary chain, whereby the ancestral EVI2 region contained all three genes and was introduced into the genome after the development of jaws on the evolutionary timeline. Cartilaginous fish experienced a loss of function in EVI2B after the development of jaws, but before the development of bony skeletons.

From there, bony fish experienced a loss of EVI2A and OMG after the split from lunged animals, while lobed fish (such as the coelacanth) experienced a loss of the EVI2B ortholog after the development of lungs but before the development of limbs in tetrapods. Tetrapods, meanwhile, retain orthologs of all three ancestral genes. For this reason, bony fish were excluded from analysis in this study.

Mapping of Potential Transcription Factor Binding Sites

A 1577 base pair-long region upstream of the believed EVI2A transcription start site was chosen to be the initial area for examination. Since promoter regions typically extend to only

20 approximately 1000 bases upstream of the transcription start site, a larger area was chosen to ensure that no transcriptional elements were overlooked. Potential binding sites that were identified by at least two of the programs was highlighted and color-coded according to how it was matched (Fig. 2). Transcription factor binding sites identified by all three programs were bracketed in green. Sites identified by AliBaba and Consite were bracketed in blue. Sites identified by Consite and Match 2.0 bracketed in yellow. Sites identified by AliBaba and Match

2.0 bracketed in purple. The sequence was also used to map out where deletion constructs would be generated through amplification with exclusion (Fig. 2). Constructs were generated that covered the areas defined by -1248/+128, -1048/+128, -848/+128, -648/+128, -448/+128, and -

58/+128 base pairs. Construct locations were chosen to remove approximately 200 bases of putative promoter at a time, with care not to interrupt any putative TFBSs. Results from CpG

Island Finder indicated that no CpG islands were present in the putative promoter region of the human EVI2A gene.

The initial MSA of mammals identified a region approximately 800 base pairs long upstream of the assumed transcription start site. In this region were located the transcription factor binding sites identified as above that were also conserved across multiple species. While upstream of this region did contain TFBSs, none of these were conserved across more than four different species.

For this reason, the region on the genome from (-1577) to (-775) bp was not examined across the fifty-five species in the complete MSAs.

Confirmation of Plasmid Generation through Gel Electrophoresis

After the putative EVI2A promoter region was amplified and ligated into a plasmid (either pGC

Blue or pGL3 Basic), a sample of the plasmid was digested with restriction enzymes in order to

21 confirm the amplicon had been successfully ligated into the plasmid. When the pGC Blue plasmid construct was digested with EcoRI and NcoI restriction enzymes, it generated bands of size indicating that the putative promoter region had been successfully ligated into the plasmid in all tested samples (Fig. 3). Similarly, after the putative promoter region had been excised from the pGC Blue plasmid and ligated into the pGL3 Basic plasmid, the construct was digested with

EcoRI or NcoI and run on a gel again (Fig. 4). Of the six samples taken for testing, only one sample showed bands of the appropriate size to denote successful ligation of amplicon into plasmid and transformation of plasmid into GC5 E. coli cells.

Analysis of Deletion Constructs by Transient Transfection

Deletion constructs generated through amplification with exclusion were transfected into human embryonic kidney 297T cells via transient transfection. The transfected cells were harvested for luciferase 24 hours post transfection. Levels of luciferase were assayed using a MonoLight 3010 luminometer system (BD Biosystems™). The luciferase activity was pRL SV40-luciferase promoter and enhancer vector normalized, then divided by the results of a control plasmid containing the luc. gene with no promoter (Fig. 5). Statistical significance was determined by the student’s two-sided T-test. The activity of the EVI2A promoter deletion constructs did not show statistically significant variation compared to either each other or to the controls. For this reason, the results were not used to determine the putative promoter region of EVI2A. Instead, a bioinformatics approach was used; looking for transcription factor binding sites conserved across different animal species.

22

EVI2A Coding DNA Sequence Phylogenetics

Phylogeny was estimated with both Bayesian inference (Fig. 6) and Maximum Likelihood (Fig.

7) in order to compare results. The topology between ML and Bayesian trees is identical. Both trees match the expected topology given the variety of different classes and species involved, with one significant divergence. Homo sapiens is clustered with its closest relatives; the great apes which, in turn, bifurcate most recently from monkeys. The primates, in turn, split from other placental mammal groups, then from marsupials and monotremes, then finally splitting to a second tree consisting of reptiles and birds. Artiodactyls are located as expected, with cetaceans placed adjacent to the terrestrial artiodactyls, which split from the carnivora family. The disparity with current knowledge exists in the relationship between the primate, rodent, artiodactyl, and carnivora groups. According to current understanding of the tree of life, rodents and primates share a more recent last common ancestor than artiodactyls/carnivora and primates do. Primates and rodents should be grouped in to the Euarchontoglires clade, while artiodactyls and carnivora are grouped in to the Laurasiatheria clade. High bootstrap values on both the Bayesian and ML trees support the validity of the deviation from expected results. This disparity between established species phylogenies and the phylogenies generated around the EVI2A gene seem to support the possibility that a mutation in the rodent EVI2A CDS occurred after primates and rodents diverged from a last common ancestor, while primates retained the ancestral genotype shared by artiodactyls and carnivora.

EVI2A Promoter Phylogenetics

As with the coding DNA sequence, the putative promoter region was analyzed through both

Bayesian inference and Maximum Likelihood, with the two trees sharing identical topographies

(Fig. 8 and 9). Unlike the CDS tree, however, the tree deviates from currently established species

23 phylogenetic trees in multiple locations. While great apes (including humans) are still properly grouped together, a polyphyletic group made up of marsupials, monotremes, and reptiles bifurcates off from the primates in a manner inconsistent with known evolutionary biology.

Following this, a second misplaced bifurcation would suggest that artiodactyls and carnivora share a more recent common ancestor with most primates than those primates do with macaques and olive baboons. Rodents are no longer clustered together, instead spread across multiple branches. One example is that the naked mole rat and domestic goat clustered as most closely related, with high bootstrap values from both Bayesian inference and maximum likelihood

(0.999 and 92, respectively). The only group relatively unaffected is aves. Despite the outwardly messy-appearing deviations in the phylogeny, bootstrap values remain high at most bifurcations, even taking into account the more conservative bootstrap values Bayesian inference generates compared to posterior probabilities (Yang and Rannala, 1997). It is not surprising to see these kinds of deviations from accepted norm when looking at the promoter regions of the EVI2A genotypes. As discussed previously, the promoter region of a gene is not as stringent in the placement of transcription factor binding sites as the CDS is in the placement of protein structural motifs. Instead, it is more valuable to look for conserved binding sites. If the site exists within the approximately eight hundred base pair region that makes up the putative promoter, then it should perform equally well as a part of the promoter, regardless of precise positioning.

Furthermore, MSA alignments and phylogenies are complicated by the presence of “structural”

DNA, a type of non-coding DNA. These nucleotides are neither coding nor binding sites.

Instead, they act as spacers; placeholders between the biochemically active parts of the gene.

Once transcription factors bind to their binding sites, the structural DNA allows the chromosome to contort in a way to bring the transcription factors in to proper position relative to each other

24 and the transcription start site. In this role, the placeholder function is conserved regardless of what nucleotide is present. This makes it equivalent to a four-fold degenerate site in coding DNA terms.

EVI2A Conserved Leucine Zipper

The leucine zipper is an amino acid sequence comprised of leucine or isoleucine amino acid residues placed every seven residues in the motif. In an alpha-helix conformation, the helix takes approximately seven residues to make a single twist. This places the leucine residues adjacent to each other on the helix. These hydrophobic amino acids will interface with other leucine zippers, usually found on other monomers, with the result being a polymeric functional unit. In the case of EVI2A, two leucine zippers exist in the same alpha helix; one comprised of three amino acid residues, and one comprised of four residues (Fig. 10). This indicates that EVI2A interacts with at least two other peptides in order to form a functional polymer. As it is necessary for forming a complete functional unit, it is therefore unsurprising to see that the leucine zippers are conserved across all species looked at in this study. Of the 54 species analyzed, 51 of them shared amino acid residue sequences all located within the area of amino acid positions 226-252, with a conserved sequence of (IIIAVLFLICTFLFLSTVVLANKVSSL), with a leucine, or isoleucine placed every seven residues, starting at position 226 for the first leucine zipper, and position 231 for the second zipper. In the case of the remaining three species, one or more leucine/isoleucine residues were substituted with either a valine or methionine. Both valine and methionine are hydrophobic residues like leucine/isoleucine, with Sneath’s indices of 9 and 20, respectively, when compared to leucine, and 7 and 22, respectively, when compared with isoleucine. This high similarity between the residues means that the substitutions should not interfere with the action of the zipper.

25

EVI2A Conserved Casein kinase II phosphorylation site

Phosphorylation of a protein alters the conformation and, by extension, the activity of a protein.

It can often control whether a protein is in an active or inactive conformation, allowing for post- transcriptional control over gene expression. This is likely the function Casein kinase II has on

EVI2A when it phosphorylates at the site described by the sequence (SNGD). Like the leucine zipper, this phosphorylation site is conserved across 51 of the 54 amino acid sequences analyzed in this paper. In the remaining three sequences, a single point mutation has resulted in a change in amino acid coded for by the codon. In the American crow and bearded dragon, a serine has been replaced with a threonine (Sneath’s index of 12). In coelacanths, a glycine has been replaced with an alanine (or, given the coelacanth’s status as the oldest living lineage of the

Sarcopterygii clade, it is equally likely the SNAD sequence is the original sequence and the

SNGD sequence seen elsewhere is a mutation). The Sneath’s index for a G to A replacement is

9. In the case of both substitutions, the low substitution index values indicate a high similarity between the two residues. This in turn would suggest that these substitutions do not interfere with the ability for Casein II kinase to bind to and phosphorylate the site.

EVI2A Conserved Sox-5 Binding Site

The SRY-related HMG-box (“Sox-5”) is a transcription factor involved most commonly in embryonic development, particularly the determination of cell fate (GeneCards). SOX genes are known to be conserved across eukaryotic species, but that is not strictly consistent with what was found by looking at the MSA of the EVI2A putative promoter regions (Fig. 12). A conserved sequence of (CCACAMY) is found represented across all classes of mammals (primates, rodents, artiodactyls, and carnivora), aves, and reptiles. However, the sequence was not

26 identified in the putative promoter region of marsupials, cartilaginous fishes, or coelacanths. In addition, the location of the binding site showed variability when compared between the different species. For example, in the sequences of most species, Sox-5 was found to be located approximately 400 base pairs upstream of the presumed transcription start site, but in the helmeted guineafowl, the sequence is a full 840 base pairs upstream. As stated previously, the exact location of a TFBS is not as critical to its proper function as it is when considering protein motifs, and these placements may serve equally well for the function of allowing Sox-5 to bind upstream of EVI2A and regulate expression.

EVI2A Conserved HLF Binding Site

Hepatic leukemia factor (HLF) is a protein that forms dimers with other proline and acidic-rich proteins to act as a transcriptional activator (GeneCards). It is the most conserved potential TFBS in mammals of those examined here, showing up across fully half of the species looked at (Fig.

13). In addition, the position of HLF in the putative promoter is much more consistent than was the position of Sox-5. HLF is consistently found 400-600 bases upstream of the transcription start site. This is a likely spot for a proximal promoter element. The conservation and consistent placement of the HLF binding site, combined with its known function as a transcriptional activator, make HLF a strong candidate for further examination as a potential transcription factor that activates transcription of EVI2A.

27

EVI2A Conserved cREL Binding Site

The protein cREL, coded for by the gene REL, is a proto-oncogene belonging to the Rel

Homology Domain family (GeneCards). It is involved in the survival and proliferation of B- cells, but mutation of the gene may lead to overexpression of cREL, leading to cancers such as

Hodgkin’s lymphoma. It is found conserved across all mammal classes, being fairly ubiquitous

(Fig. 14). One point of note is the evidence of an insertion event in the genome of flying lemurs which is not found in Coquerel’s lemurs. This would suggest a relatively recent insertion event that, given the size of the insertion and its position in the middle of the TFBS, likely resulted in a loss of function for the binding site. The existence of flying lemurs, then, would suggest that the cREL binding site is not absolutely necessary for the proper transcription of EVI2A.

EVI2A Conserved CREB Binding Site

The cAMP-Responsive Element Binding (CREB) protein is a transcription factor that acts upon genes involved in neural functions (GeneCards). Like the EVI2A protein, CREB contains a leucine zipper, which interacts with the zipper on another CREB monomer to form a homodimer as its functional unit. The CREB binding site is restricted to only primates and, furthermore, differs in humans compared to other primates (Fig. 15). In humans, there exists a single adenine insertion into the middle of the binding site. Unlike the insertion in the flying lemur cREL TFBS, this does not appear to have resulted in a loss of function, further showcasing the plasticity of transcription factor binding sites.

28

Summary

This study looked to determine if there was differing selective pressures on the coding DNA sequence and putative promoter region of the Ecotropic Viral Integration Site 2A gene. Isolating the putative promoter region was done both bioinformatically and experimentally through integration of the promoter region into plasmids for dual luciferase assays. Once the putative promoter region in humans was confirmed through assays, it was used as the basis for determining the putative promoter regions of other species, chosen to represent a wide variety of animal classes. Multiple sequence alignments of both the promoter and CDS were generated and used to generate phylogenetic trees. The MSAs were also used to determine conserved sequences in the promoter and CDS that could represent transcription factor binding sites and functional motifs, respectively. The differences between the generated phylogenetic trees suggest that the promoter region and CDS are under differing selective pressures, and these differences lead to differences in what parts are conserved or not. Additionally, a potential point of divergence between primates and rodents was identified. Two highly conserved motifs were revealed in the amino acid sequence of EVI2A. The fact that the leucine zipper and casein kinase II binding site were conserved across all examined species is strong evidence to the central role these motifs play in ensuring proper structure and function of the protein, even if that function is currently unknown. Conversely, no potential TFBSs showed that same level of conservation, even considering the increased flexibility seen of TFBSs in both placement and sequence. While some

TFBSs, such as Sox-5, could be found spanning multiple classes of animals, the most highly conserved binding sites, such as HLF, were restricted only to mammals. Further laboratory analysis would narrow the search range for potential TFBSs as well as identify a set transcription start sites, allowing for more focused analysis of conserved regions in the promoter.

29

Appendix: Figures

Figure 1. EVI2A is nested within the gene NF1. Along with EVI2B and OMG, EVI2A is located on and runs antiparallel to NF1. The three genes sit within an intronic region of the NF1 gene. Large arrows denote exonic regions of the genes and direction of transcription.

30

Figure 2. An area comprised of 1,449 bases upstream and 128 bases downstream of the assumed transcription start site (labeled with a green arrow and “+1”) was analyzed suing the software AliBaba, Consite, and Match 2.0. The potential transcription factor binding sites that were identified by more than one program were highlighted by color. TFBSs identified by all three programs were bracketed in green. Sites identified by AliBaba and Consite were bracketed in blue. Sites identified by Consite and Match 2.0 bracketed in yellow. Sites identified by AliBaba and Match 2.0 bracketed in purple. The Cutoff sites for the generated deletion constructs are marked by a black line.

31

Figure 3. Gel Electrophoresis of pGC Blue cloned, Restriction Digested Plasmid Colonies of GC5 E. coli cells were grown in medium, lysed, and digested with either the EcoRI (lanes 2-7), NcoI (lanes 9-13) or no restriction enzyme (lane 14). When digested with EcoRI, all samples generated bands at 2233 bp and 1595 bp. Similarly, samples digested with NcoI generated bands 2207 and 1340 bp in length. Results confirmed that all bacterial colonies sampled contained the pGC Blue plasmid with the EVI2A putative promoter region inserted in the forward orientation.

32

Figure 4. Gel Electrophoresis of pGL3 Basic cloned, Restriction Digested Plasmids Colonies of GC5 E. coli cells transformed with pGL3 Basic plasmid containing the EVI2A putative promoter region were grown in medium, lysed, and digested with either the EcoRI (lanes 2-4, 6-8), NcoI (lanes 9-11, 13-15) or no restriction enzyme (lane 1). Six different colonies were analyzed (colony 1: lanes 2, 9; colony 2: lanes 3, 10; colony 3: lanes 4, 11; colony 4: lanes 6, 13; colony 5: lanes 7, 14; colony 6: lanes 8, 15). When digested with EcoRI, all samples generated bands at 5057 bp and 1595 bp. When digested with NcoI, only colony sample 5 generated expected bands of 6010 and 452bp in length. Sample 5 plasmid was used to generate further deletion constructs and knockouts.

33

Figure 5. Non-Significant differences in activity between EVI2A promoter deletion constructs. Transient transfection of the sequentially deleted EVI2A promoter region full (−1449/+128), −1248/+128, −1048/+128, −848/+128, −648/+128, −448/+128, −58/+128 and pGL3 basic promoterless control (“empty”) were performed in the HEK293T cell line. EVI2A promoter sequences for each construct are shown in Fig. 1. The activity of each construct is expressed as the fold activation over promoterless pGL3 basic vector. Luciferase activity is pRL SV40-luciferase promoter and enhancer vector normalized. The activity of constructs was not significantly higher than pGL3 basic by the Student two-sided t-test (p ≤ 0.05, n = 5).

34

Figure 6. Phylogenetic Tree of EVI2A CDS Generated Through Bayesian Inference. Fifty-four species were analyzed by Bayesian inference and a phylogenetic tree was produced that demonstrates posterior probabilities shown between 0 and 1, with equaling 100% agreement. The analysis was performed using Geneious Prime version 2020.0.5 and the MrBayes version 2.2.4 plugin. A total of 1.1 million generations were performed, generating a tree every 200 generations. The first 100,000 generations were discarded as burn-in.

35

Figure 7. Phylogenetic Tree of EVI2A CDS Generated Through Maximum Likelihood. The evolutionary history was inferred by using the Maximum Likelihood method based on the Hasegawa-Kishino-Yano model [Hasegawa et al. 1985]. The tree with the highest log likelihood (-13428.22) is shown. The percentage of trees in which the associated taxa clustered together is shown next to the branches. Initial tree(s) for the heuristic search were obtained automatically by applying the Maximum Parsimony method. A discrete Gamma distribution was used to model evolutionary rate differences among sites (5 categories (+G, parameter = 1.6207)). The rate variation model allowed for some sites to be evolutionarily invariable ([+I], 14.37% sites). The analysis involved 54 nucleotide sequences. There were a total of 1176 positions in the final dataset. Evolutionary analyses were conducted in MEGA7 [Kumar et al. 2016].

36

Figure 8. Phylogenetic Tree of EVI2A Putative Promoter Generated Through Bayesian Inference. Fifty-five species were analyzed by Bayesian inference and a phylogenetic tree was produced that demonstrates posterior probabilities shown between 0 and 1, with equaling 100% agreement. The analysis was performed using Geneious Prime version 2020.0.5 and the MrBayes version 2.2.4 plugin. A total of 1.1 million generations were performed, generating a tree every 200 generations. The first 100,000 generations were discarded as burn-in.

37

Figure 9. Phylogenetic Tree of EVI2A Putative Promoter Generated Through Maximum Likelihood. The evolutionary history was inferred by using the Maximum Likelihood method based on the Hasegawa-Kishino-Yano model [Hasegawa et al. 1985]. The tree with the highest log likelihood (-40677.85) is shown. The percentage of trees in which the associated taxa clustered together is shown next to the branches. Initial tree(s) for the heuristic search were obtained automatically by applying Neighbor-Join and BioNJ algorithms to a matrix of pairwise distances estimated using the Maximum Composite Likelihood (MCL) approach, and then selecting the topology with superior log likelihood value. The analysis involved 55 nucleotide sequences. There were a total of 2588 positions in the final dataset. Evolutionary analyses were conducted in MEGA7 [Kumar et al. 2016]

38

Figure 10. MSA of the amino acids in the leucine zipper motifs of the EVI2A protein. Two leucine zippers are present in all species examined. The leucine zippers span residues 226 through 252 and are bracketed by the yellow box. Leucine, isoleucine, or compatible hydrophobic residues are noted by colored arrows. Leucine zipper one is denoted by orange arrows at positions 226, 233, and 240. Leucine zipper two is denoted by blue arrows at positions 231, 238, 245, and 252. EVI2A peptide sequences were obtained through the National Center for Biology Information (https://www.ncbi.nlm.nih.gov/) and aligned with MEGA version 7.0.26

39

Figure 11. MSA of the amino acids of the casein kinase II phosphorylation site. The casein kinase II phosphorylation site is present in all species examined. The casein kinase II phosphorylation site spans residues 282 through 285 and is bracketed by the yellow box. The asparagine residue at position 283, and the aspartic acid residue at position 285 are conserved across all species. EVI2A peptide sequences were obtained through the National Center for Biology Information (https://www.ncbi.nlm.nih.gov/) and aligned with MEGA version 7.0.26

40

Figure 12. MSA of nucleotides composing the Sox-5 Transcription Factor Binding Site. A conserved sequence of (CCACAMY) is found conserved across all classes of mammals (primates, rodents, artiodactyls, and carnivora), aves, and reptile. The Sox-5 transcription factor binding site spans nucleotides from 388 to 381 base pairs upstream of the presumed transcription start site in humans and is bracketed by the yellow box. EVI2A nucleic acid sequences were obtained through the National Center for Biology Information (https://www.ncbi.nlm.nih.gov/) and aligned with MEGA version 7.0.26

41

Figure 13. MSA of nucleotides composing the HLF Transcription Factor Binding Site. A conserved sequence of (AGTYYYRCAMY) is found represented across all classes of mammals (primates, rodents, artiodactyls, and carnivora). The HLF transcription factor binding site spans nucleotides 392 through 381 bases upstream of the presumed transcription start site in humans and is bracketed by the yellow box. EVI2A nucleic acid sequences were obtained through the National Center for Biology Information (https://www.ncbi.nlm.nih.gov/) and aligned with MEGA version 7.0.26

Figure 14. MSA of nucleotides composing the cREL Transcription Factor Binding Site. A conserved sequence of (RRARMMCCCT) is found represented across all classes of mammals (primates, rodents, artiodactyls, and carnivora). The cREL transcription factor binding site spans nucleotides 197 through 187 bases upstream of the presumed transcription start site in humans and is bracketed by the yellow box. The cREL binding site in flying lemurs has been subject to an insertion mutation event in the middle of the TFBS, likely abrogating the effectiveness of the binding site. EVI2A nucleic acid sequences were obtained through the National Center for Biology Information (https://www.ncbi.nlm.nih.gov/) and aligned with MEGA version 7.0.26

42

Figure 15. MSA of nucleotides composing the CREB Transcription Factor Binding Site. The sequence (TGCGTCAACCCT) is found only in humans as a result of an insertion event. The CREB transcription factor binding site spans nucleotides 630 through 619 bases upstream of the presumed transcription start site in humans and is bracketed by the yellow box. EVI2A nucleic acid sequences were obtained through the National Center for Biology Information (https://www.ncbi.nlm.nih.gov/) and aligned with MEGA version 7.0.26

43

Literature Cited Arnosti DN, Kulkarni MM. 2005. Transcriptional Enhancers : Intelligent Enhanceosomes or

Flexible Billboards ? 898:890–898. doi:10.1002/jcb.20352.

Buchberg AM, Bedigian HG, Jenkins NA, Copeland ’ NG. 1990. Evi-2, a Common Integration

Site Involved in Murine Myeloid Leukemogenesis. Mol. Cell. Biol. 10:4658–4666.

Butler JEF, Kadonaga JT. 2002. The RNA polymerase II core promoter: a key component in the regulation of gene expression. Genes Dev. 16:2583–2592. doi:10.1101/gad.1026202.The.

CREB Gene – GeneCards. : https://www.genecards.org/cgi- bin/carddisp.pl?gene=CREB1&keywords=CREB

Dikstein R. 2011. The unexpected traits associated with core promoter elements. Transcription

2:5:201–206. doi:10.4161/trns.2.5.17271.

Efimov A V. 1994. Super-secondary Structures in Proteins. In: Bohr H, Brunak S, editors.

Protein Structure by Distance Analysis. IOS Press. p. 187–200.

EVI2A Gene - GeneCards. : https://www.genecards.org/cgi- bin/carddisp.pl?gene=EVI2A&keywords=evi2a.

EVI2A Gene Expression - GTEx. :https://gtexportal.org/home/gene/EVI2A.

Goodrich JA, Tjian R. 2010. Unexpected Roles for Core Promoter Recognition Factors in

Celltype Specific Transcription and Gene Regulation. Nat Rev Genet. 11:549–558. doi:10.1038/nrg2847.

Hasegawa M., Kishino H., and Yano T. (1985). Dating the human-ape split by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution 22:160-174.

44

Henson SE, Morford T, Stein MP, Wall R, Malone CS. 2011. Candidate genes contributing to the aggressive phenotype of mantle cell lymphoma. Acta Histochem. 113:729–742. doi:10.1016/j.acthis.2010.11.001.

HLF Gene – GeneCards, :https://www.genecards.org/cgi- bin/carddisp.pl?gene=HLF&keywords=HLF

Hoffman DP, Shtengel G, Xu CS, Campbell KR, Freeman M, Wang L, Milkie DE, Pasolli HA,

Iyer N, Bogovic JA, et al. 2020. Correlative three-dimensional super-resolution and block-face electron microscopy of whole vitreously frozen cells. 5357. doi:10.1126/science.aaz5357.

Juven-Gershon, Tamar; Hsu, Jer-Yuan; Theisen, Joshua W.M.; Kadonaga JT. 2008. The RNA

Polymerase II Core Promoter – the Gateway to Transcription. Curr Opin Cell Biol 20:253–259. doi:10.1016/j.ceb.2008.03.003.

Kumar S., Stecher G., and Tamura K. (2016). MEGA7: Molecular Evolutionary Genetics

Analysis version 7.0 for bigger datasets.Molecular Biology and Evolution 33:1870-1874.

Largaespada DA, Shaughnessy JD, Jenkins NA, Copeland NG. 1995. Retroviral Integration at the Evi-2 Locus in BXH-2 Myeloid Leukemia Cell Lines Disrupts Nf1 Expression without

Changes in Steady-State Ras-GTP Levels. J. Virol. 69:5095–5102.

Larsen S, Kadziola A, Petersen JFW. 1994. Structure from X-Ray Crystallography illustrated by

Proteins with prosthetic Groups. In: Bohr H, Brunak S, editors. Protein Structure by Distance

Analysis. IOS Press. p. 15–23.

Latchman DS. 2004. Eukaryotic Transcription Factors. Fourth. Elsevier Academic Press.

Lee TI, Young RA. 2013. Transcriptional Regulation and its Misregulation in Disease. Cell

45

152:1237–1251.

Leenen FAD, Vernocchi S, Hunewald OE, Schmitz S, Molitor M, Muller CP, Turner JD. 2016.

Where does transcription start ? 5 -RACE adapted to next-generation sequencing. 44:2628–2645. doi:10.1093/nar/gkv1328.

Mahpour A, Scruggs BS, Smiraglia D, Ouchi T, Gelman IH. 2018. A methyl-sensitive element induces bidirectional transcription in TATA-less CpG island-associated promoters. PLoS One

13:1–25. doi:10.1371/journal.pone.0205608.

Maston GA, Evans SK, Green MR. 2006. Transcriptional Regulatory Elements in the Human

Genome. Annu. Rev. Genomics Hum. Genet. 7:29–59. doi:10.1146/annurev.genom.7.080505.115623.

Pasmant E, Masliah-Planchon J. 2011. Identification of Genes Potentially Involved in the

Increased Risk of Malignancy in NF1-Microdeleted Patients. Mol. Med. 17:1. doi:10.2119/molmed.2010.00079.

Poulsen FM. 1994. Function and Three-Dimensional Structure of Proteins using Nuclear

Magnetic Resonance Spectroscopy. In: Bohr H, Brunak S, editors. Protein Structure by Distance

Analysis. IOS Press. p. 24–35.

REL Gene – GeneCards. : https://www.genecards.org/cgi- bin/carddisp.pl?gene=REL&keywords=cREL

SOX-5 Gene - GeneCards. :https://www.genecards.org/cgi- bin/carddisp.pl?gene=SOX5&keywords=Sox-5

Subramaniam S. 1994. Protein Structure Prediction - Past and Present. In: Bohr H, Brunak S,

46 editors. Protein Structure by Distance Analysis. IOS Press. p. 3–14.

Vo L, Wang Y, Kassavetis GA, Kadonaga JT. 2017. The punctilious RNA polymerase II core promoter. Genes Dev. 31:1289–1301. doi:10.1101/gad.303149.117.GENES.

Whitford D. 2005. Proteins Structure and Function. John Wiley & Sons, Ltd.

Yarden G, Elfakess R, Gazit K, Dikstein R. 2009. Characterization of sINR, a strict version of the initiator core promoter element. Nucleic Acids Res. 37:4234–4246. doi:10.1093/nar/gkp315.

Zipursky L, Berk A, Krieger M, Darnell JE, Lodish HF, Kaiser C, Scott MP, Matsudaira PT.

2003. McGill Lodish 5E Package - Molecular Cell Biology & McGill Activation Code. San

Francisco: W.H. Freeman.

47