NOTE TO USERS

This reproduction is the best copy available.

® UMI

Identifying Mouse Putatively Transcriptionally Regulated by the Glucocorticoid Receptor

By Zuojian Tang

School of Computer Science McGiII University, Montreal January 2005

A thesis submitted to McGiII University in partial fulfillment of the requirements of the degree of Master of Science

©Zuojian Tang 2005 Library and Bibliothèque et 1+1 Archives Canada Archives Canada Published Heritage Direction du Branch Patrimoine de l'édition

395 Wellington Street 395, rue Wellington Ottawa ON K1A ON4 Ottawa ON K1A ON4 Canada Canada

Your file Votre référence ISBN: 0-494-12552-7 Our file Notre référence ISBN: 0-494-12552-7

NOTICE: AVIS: The author has granted a non­ L'auteur a accordé une licence non exclusive exclusive license allowing Library permettant à la Bibliothèque et Archives and Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par télécommunication ou par l'Internet, prêter, telecommunication or on the Internet, distribuer et vendre des thèses partout dans loan, distribute and sell th es es le monde, à des fins commerciales ou autres, worldwide, for commercial or non­ sur support microforme, papier, électronique commercial purposes, in microform, et/ou autres formats. paper, electronic and/or any other formats.

The author retains copyright L'auteur conserve la propriété du droit d'auteur ownership and moral rights in et des droits moraux qui protège cette thèse. this thesis. Neither the thesis Ni la thèse ni des extraits substantiels de nor substantial extracts from it celle-ci ne doivent être imprimés ou autrement may be printed or otherwise reproduits sans son autorisation. reproduced without the author's permission.

ln compliance with the Canadian Conformément à la loi canadienne Privacy Act some supporting sur la protection de la vie privée, forms may have been removed quelques formulaires secondaires from this thesis. ont été enlevés de cette thèse.

While these forms may be included Bien que ces formulaires in the document page count, aient inclus dans la pagination, their removal does not represent il n'y aura aucun contenu manquant. any loss of content from the thesis. ••• Canada Abstract

The Glucocorticoid receptor (GR) is one of many steroid hormone receptors. It controls broad physiological networks, confers pathological effects in a range of disease states, and offers an excellent target for therapeutic intervention. Therefore, it is necessary to betler understand the mechanisms of GR regulation. In particular, we are interested in betler understanding the protein­ nucleotide interactions (transcription factors interacting with transcription factor binding sites). Upon glucocorticoids-hormone binding, the GR forms a protein­ nucleotide interaction with a specific transcription factor binding site known as a glucocorticoid response element (GRE). This research has employed three different but complementary bioinformatics approaches to identify Mouse genes putatively transcriptionally regulated by GR. Firstly, we focus on the problem of searching for putative GREs in the complete Mouse genome using a position weight matrix. This produced a large number of putative GREs. Most of these are likely false positive predictions. Secondly, two different strategies are used to improve the accuracy of our framework: combinatorial analysis of multiple TFs/modules of TFBSs and phylogenetic footprinting (PF). The number of putative GREs can be reduced by 97.9% using the module of TFBSs analysis, 97.7% using the PF analysis, and 99.9% using both module and PF analyses. In each step, a statistical test has been used to measure the significance of the results. Résumé

Le récepteur aux glucocorticoïdes (GR) fait partie de la grande famille des récepteurs aux stéroïdes. "est impliqué dans le contrôle de l'expression d'un bon nombre de gènes formant un réseau de régulation physiologique large et étendu. "a un rôle important dans plusieurs pathologies et offre donc une excellente cible thérapeutique. "est donc essentiel de mieux comprendre les mécanismes d'action du GR. Nous nous sommes intéressé particulièrement à la régulation génique par le GR selon le mode d'interaction protéine-ADN tel un facteur de transcription liant spécifiquement un site de liaison à l'ADN. En effet, suite à la liaison de son ligand glucocorticoïde, le GR lie sous forme d'homodimère un site de liaison à l'ADN qui lui est spécifique : l'élément de réponse aux glucocorticoïdes (GRE). Dans la présente recherche, nous avons employé trios différentes approches bio-informatiques complémentaires afin d'identifier des gènes dont la transcription est potentiellement régulée directement par le GR chez la souris. Premièrement, nous nous sommes penché sur le problème de la recherche de GRE potentiels sur le génome complet de la souris en utilisant une matrice de poids des positions. Cette méthode nous a donné un très grand nombre de GRE potentiels parmi lesquels on retrouve essentiellement de fausses prédictions. Deuxièmement, deux différentes stratégies ont été employées pour augmenter la précision de notre outil de prédiction. D'une part, nous avons utilisé l'analyse combinatoire de modules de régulation de transcription formés d'au moins un GRE et de sites de liaisons à l'ADN pour d'autres facteurs de transcription. D'autre part, nous avons procédé à l'analyse d'empreintes phylogénétiques (PF) des GRE potentiels. Le nombre de GRE potentiels peut être réduit de 97.9% en utilisant l'analyse de modules et de 97.7% avec l'analyse d'empreintes phylogénétiques. La combinaison des deux stratégies a permis quant à lui de réduire ce nombre de 99.9%. Pour chaque étape, un test statistique a été utilisé pour évaluer le degré de signification des résultats. Acknowledgements

First of ail, 1would like to express my deep gratitude to my supervisor, Professor Michael Hallett. His profound insights have enlightened and guided me immensely.

1would like to thank Dr. Sebastien Provencher for his advice and guidance in my entire research work, especially, on aspects of biological knowledge.

1 would like to thank Mr. Alexandre Marcil for his cooperation, information and experimental data providing in this research.

1owe special thanks for Mr. François Pepin for helping me with the BIAS system.

Finally, 1 deeply appreciate my husband for his support, understanding, and patience. Table of contents:

1. Introduction ...... 1 2. Biological Background ...... 5 2.1. Promoter ...... 5 2.1.1. The Basic Structure of a Promoter and The Initiation of Transcription ...... 5 2.1.2. Modules ...... 7 2.2. Transcriptional Regulation by Glucocorticoid Receptor (GR) ...... 9 3. Background for Computer Science and Bioinformatics Concepts ...... 12 3.1. BIAS: Bioinformatics Integrated Application Software ...... 12 3.1.1 Modules in BIAS ...... 14 3.1.2. InternaI Data Sources - Object-Relational Model ...... 15 3.1.3. External Data Sources ...... 15 3.1.3.1 Java API of Ensembl (Ensj) ...... 16 3.1.3.2 Java API of BIAS ...... 16 3.2. Background of Bioinformatics Approaches ...... 17 3.2.1. Module for TFBS ...... 17 3.2.1.1. A Markov chain (MC) [118] ...... 18 3.2.1.2. Position Weight Matrix (PWM) ...... 20 3.2.1.3. Regions in Mouse Genome ...... 22 3.2.1.4. Search for putative TFBSs using PWM ...... 22 3.2.1.5. Statistical Significant Test ofputative TFBSs ...... 23 3.2.1.6. TRANSFAC ...... 28 3.2.2. Background for the Genome Wide Search ofTFBSs ...... 29 3.2.3. Modules ...... 31 3.2.4. Phylogenetic Footprinting (PF) ...... 32 4. Methods And Implementation ...... 37 4.1. External Data Sources Preparation in BIAS ...... 37 4.2. Genome Wide Search For Transcription Factor Binding Site ...... 39 4.3. Module ...... 42 4.3.1. Aigorithm ...... 42 4.3.2. Implementation ...... 43 4.3.2.1. Statistical Significance Test using MBMC ...... 44 4.4. Phylogenetic Footprinting ...... 48 4.4.1. Aigorithm ...... 48 4.4.1.1. Orthologous Genes Selection ...... 48 4.4.1.2. Alignment of Promoter Sequences of Orthologous Genes ...... 49 4.4.1.2.1. Masking Sequences ...... 50 4.4.1.2.2. Principle of A VID Alignment ...... 51 4.4.2. Implementation ...... 53 5. Results and Discussion ...... 57 5.1. Genome Wide Search for TFBSs ...... 57 5.2. Modules ...... 66 5.3. Phylogenetic Footprinting ...... 71 5.3.1. Parameters Prediction Using Supervised Data Mining Techniques ...... 71

1 5.3.1.1. Parsing A VID Alignments ...... 71 5.3.1.2. Parameter Prediction of AVID Alignment ...... 74 5.3.1.2.1. Effect of Different Length of Promoter Sequences ...... 74 5.3.1.2.2. Effect of Maximum Number of Gaps ...... 77 5.3 .1.2.3. Effect of RepeatMasker ...... 77 5.3.2. One Case Study For Gene Expression Data ...... 81 5.3 .2.1 Experimental Background For Gene Expression Data ...... 81 5.3.2.2. Results and Discussion ...... 81 5.3.3. Study For Whole Mouse Genome ...... 86 6. Conclusion and Recommendations ...... 88 6.1. Conclusion ...... 88 6.2. Recommendations ...... 88 References: ...... 90 Appendix A: The BIAS Database ...... 103 Appendix B: ...... 111

ii List of tables:

Table 3-1: List of resources for obtaining and analyzing genomic sequences [84] ...... 36 Table 5-1: Number of putative TFBSs and p-value of frequency of TFBSs for four transcription factors (GR, NF-1, C/EBPdelta, C/EBPbeta) in either whole Mouse genome or different regions ...... 57 Table 5-2: Percentage of nucleotides A, C, G, and T in consensus string of four TFs ...... 60 Table 5-3: Statistical Results for pGREs and binding sites of NF-1 in different regions of Mouse genome ...... 62 Table 5-4: Statistical Results for binding sites of C/EBPbeta and C/EBPdelta in different regions of Mouse genome ...... 62 Table 5-5: Statistic test for combinatorial probability distribution in MBMC ...... 69 Table 5-6: results for ail co-occurrence genes ...... 70 Table 5-7: Information of 6 conserved GREs ...... 71 Table 5-8: Part of AVID alignment result...... 72 Table 5-9: AVID alignment results. This table shows the number of mismatches in each parsed alignment if it is available ...... 75 Table 5-10: AVID alignment results. This table shows the number of mismatches in each parsed alignment if it is available ...... 75 Table 5-11: AVID alignment results. This table shows the number of mismatches in each parsed alignment if it is available ...... 76 Table 5-12: AVID alignment results. This table shows the number of mismatches in each parsed alignment if it is available ...... 77 Table 5-13: AVID alignment results. This table shows the number of mismatches in each parsed alignment if it is available ...... 79 Table 5-14: AVID alignment results. This table shows the number of mismatches in each parsed alignment if it is available ...... 79 Table 5-15: AVID alignment results. This table shows the number of mismatches in each parsed alignment if it is available ...... 79 Table 5-16: Number of genes in each group from gene expression data and the number of genes found by both module and PF analyses ...... 82 Table 5-17: Final results for gene expression data ...... 85 Table 5-18: Interesting points for five good candidate genes ...... 86 Table 5-19: Phylogenetic footprinting results for Mouse 2 as compared with Human orthologous genes ...... 87

111 List of Figures:

Figure 1-1: The major actions of glucocorticoids in mammals [3].------1 Figure 2-1: Schematic structure of promoter [12]. ------7 Figure 2-2: Schematic structure of promoter modules [12]. ------8 Figure 2-3: The Type-1 mechanism of action of the glucocorticoid receptor [3]. ------10 Figure 2-4: The Type-2 mechanism of action of the glucocorticoid receptor [3]. ------11 Figure 3-1: Basic overview of BIAS [16] ------14 Figure 3-2: An example on how to derive a count matrix from a set of known TFBSs. ------20 Figure 3-3: A comparison of the Number of hits between Mouse chromosome 2 and different background sequences. Window size changes from 2 up to whole chromosome long. Overlap is O. ------25 Figure 3-4: Number of hits comparison among Mouse chromosome 2 and different random sequences. Window size is 200, 500, 1000, and whole chromosome long with different overla ppi ng 0, 10, and 100. ------25 Figure 3-5: Evolutionary relationship between metazoans [86]. ------34 Figure 4-1: Overview of module "datasources/ensembl" ------39 Figure 4-2: Overview of sub-module "greITFBS" ______------41

Figure 4-3: Overview of sub-module "greITFBS" for co-occurrence ------46 Figure 4-4: Overview of sub-module "gre/secondOrderMarkovChain" ------47 Figure 4-5: Global view of AVI D alignment. ------50 Figure 4-6: Selecting anchors from the set of matches. Every maximal match is shown in blue. A set of good anchors is shown in red [1 00]. ------52 Figure 4-7: Overview of sub-module "gre/footPrinting" ------56 Figure 5-1: Frequency distribution of pGREs in different regions of Mouse genome and frequency distribution of pGREs in random sequence ------58 Figure 5-2: Frequency distribution of putative TFBSs of NF-1 in different region of Mouse genome and frequency distribution of putative TFBSs of NF-1 in random sequence------58 Figure 5-3: Frequency distribution of putative TFBSs of C/EBPbeta in different region of Mouse genome and frequency distribution of putative TFBSs of C/EBPbeta in random sequence -59 Figure 5-4: Frequency distribution of putative TFBSs of C/EBPdelta in different region of Mouse genome and frequency distribution of putative TFBSs of C/EBPdelta in random sequence 59 Figure 5-5: Frequency distribution of nucleotides in each chromosome and each region of the Mouse genome.------60 Figure 5-6: Frequency distribution of pGREs in promoter(left), gene(middle), and junk(right) regions of the Mouse genome ------63 Figure 5-7: Frequency distribution of putative TFBSs for NF-1 in promoter(left), gene(middle), and junk(right) regions of Mouse genome ------63 Figure 5-8: Frequency distribution of putative TFBSs for C/EBPbeta in promoter(left), gene(middle), and junk(right) regions of Mouse genome------63

IV Figure 5-9: Frequency distribution of putative TFBSs for C/EBPdelta in promoter(left), gene(middle), and junk(right) regions of Mouse genome------63 Figure 5-10: Mouse and wide hits rate distribution at different threshold for total number of putative binding sites of four different transcription factors (GR, NF-1, CIE BPd elta, CIE B P beta) ------65 Figure 5-11: Order distribution of module for four different TFs (GR, NF-1, C/EBPdelta, C/EBPbeta). Different colour stands for different in Mouse genome------66 Figure 5-12: Counts distribution of combinatorial probability of TFBSs for four different TFs (GR, NF -1, CIE B Pdelta, C/EB Pbeta) of M B MC. ------67 Figure 5-13: Frequency distribution of combinatorial probability of TFBSs for four different TFs in promoter region of Mouse genome (GR, NF-1, C/EBPdelta, C/EBPbeta) ------68 Figure 5-14: Percentage of qualified genes ------83 Figure 5-15: Average number of mismatches in conserved alignments ------83 Figure 5-16: Average distance to TSS of gene ------84 Figure 5-17: Average score of pGREs ------84

v List of abbreviations: activating transcription factor (ATF) ...... 10 Bioinformatics Integrated Application Software (BIAS) ...... 4 CCAA T/enhancer binding protein (C/EBPbeta and C/EBPdelta) ...... 57 Genetics Computing Group (GCG) ...... 30 glucocorticoid receptor (GR) ...... 1 glucocorticoid response element (GRE) ...... 2 Glucocorticoids (GCs) ...... 1 initiator region (INR) ...... 6 Markov chain (MC) ...... 18 Mouse Background MC (MBMC) ...... 19 nuclear factor 1 (N F-1) ...... 57 object-relational (OR) ...... 15 Phylogenetic Footprinting (PF) ...... 32 position frequency matrix (PFM) ...... 20 position weight matrix (PWM) ...... 3 putative GREs (pGREs) ...... 18 transcription factor binding site (TFBS) ...... 2 transcription factors (TFs) ...... 1 transcription start site (TSS) ...... 5 transcriptional accessory factors (T AFs) ...... 6

VI 1. Introduction

Glucocorlicoids (GCs) are steroid hormones that are produced in the adrenal cortex. They play an important role in a variety of organ systems during development and in many physiological and pathological processes [1]. A very important function of GCs is the regulation of carbohydrate and lipid metabolism. Clinically, GCs are also important anti-inflammatory and immunosuppressive agents. They are used in the treatment of a wide variety of diseases, including allergie and autoimmune diseases like asthma and rheumatoid arthritis [2 (p.197- p.210)]. Figure 1-1 shows the major glucocorticoid actions in mammals. Since GCs are essential for the survival of vertebrates, it is necessary to understand the molecular mechanism of their actions [3]. GCs function by regulating the expression of target genes through an intracellular receptor, the glucocorlicoid receptor (GR).

Lipid Metabolism and Protein Metabolism Carbohydrate Metabolism Electrolyte Homeostasis

General Adaptation Immunosuppression

Anti-inflammatory Action Oxidative Metabolism

Reproduction and Growth Apoptosis Anti-tumor Promoting Activity

Figure 1-1: The major actions of glucocorticoids in mammals [3].

GR is one of many steroid hormone receptors. It controls broad physiological gene networks, confers pathological effects in a range of disease states, and offers an excellent target for therapeutic intervention [4]. GR belongs to the nuclear receptor family of ligand-dependent transcription factors (TFs). Upon GC-

1 hormone binding, the GR forms a protein-nucleotide interaction with a specific transcription factor binding site (TFBS) known as a g/ucocorlicoid response e/ement (GRE). The GRE TFBS is a specific short, near-palindromic DNA sequence located in the promoter region of GR controlled genes. It is hypothesized that the binding of GR to a GRE alone is insufficient to promote transcription of target genes. Rather it is believe that GR "operates" in collaboration with a set of additional proteins and TFs. These accessories aid GR regulation by either binding GR via protein-protein interactions or binding TFBSs in genomic DNA in close proximity to the GRE. The set of accessory proteins and TFs is largely unknown.

This research employs several bioinformatics approaches in order to better understand the mechanisms of GR regulation. In particular, we are interested in better understanding the protein-nucleotide interactions (TFs interacting with TFBSs). The broad goal of this study is to develop a software platform to predict GREs with high accuracy. This will allow us to identify GR-regulated genes throughout the whole Mouse genome. It is believe that GR regulates no more than approximately 100 of the approximation 25K genes in Mouse. Identifying the putative GR-regulated Mouse genes could help: (i) to improve the design of so­ called ChlP-chip experiments for GR in Mouse, (ii) to better make sense of microarray expression studies, (iii) to identify targets for intervention (towards drug design for the various GR related diseases), (iv) to improve our understanding of protein-nucleotide interactions in general, and (v) to improve our understanding of other steroid-related transcription factors including the progesterone receptor , estrogen Receptor, androgen receptor.

Most TFBSs are highly degenerate sequence motifs ranging from 5 to 20 bp in length. By degenerate we mean that a TFBS for a specific TF allows sorne sequence variation. We can think of these as stochastic variants of sorne "consensus" binding site pattern. The identification of true binding sites for most

2 transcription factors is very difficult since they are very short (5-20bp) relative to the size of the Mouse genome (2.6Gb).

Our framework for identifying GREs proceeds as follows. Firstly, we focus on the problem of searching for GREs in the complete Mouse genome using a position weight matrix (PWM), a common approach in the literature for modeling TFBSs. This search produces a large number of candidate GREs (1.8M). Clearly, such an approach is extremely non-specifie (it produces a prohibitively large number of false positive predictions). However, by using a "Iiberal" parameter for the PWM, we hope that this initial method is sensitive (there are not too many false negatives). Further analysis is then required to reduce the number of false positives. Two different strategies acknowledged in the literature for reducing the rate of false positives [13, 66, 85] are used to improve the accuracy of our framework: combinatorial analysis of multiple TFs/modules and phylogenetic footprinting. ln eukaryotes, especially in higher-order organisms, mechanisms for controlling gene expression are highly complex. For example, transcriptional activation of a gene in many cases is determined by the combinatorial absence/presence of multiple, interacting TFs [5-6]. In other words, a specifie set of TFs operates together to promote the transcription of a gene. Each TF has a unique TFBS and the collection of ail binding sites are usually relatively close together along the genomic DNA. GR is known to "collaborate" with transcription factor accessories, such as NF-1, Oct-1, HNF-1, HNF-3, AP-1, C/EBP~, and C/EBPo [7]. Therefore, looking for combinatorial analysis of multiple TFs of GR-regulated genes may allow us to further filter false positive predictions obtained via the PWM.

Ali eukaryotes are believed to share a highly conserved mechanism of transcriptional regulation [8-9]. The underlying assumption is that functional sequences including both coding regions and regulatory modules are more conserved evolutionarily than non-functional regions through the whole genome.

3 Comparative genomic analyses have been extensively used to identify any functional regions providing significant amount of genomic sequence segments [10, 70-74, 79-83, 97]. It appears that phylogenetic footprinting is an effective method available for the identification of TFBSs [10, 97, 85].

The structure of thesis is as follows. Chapter 2 introduces the necessary biological background for this study. This includes the basic structure of promoters, modules, transcription initiation, and transcriptional regulation by GR. Chapter 3 focuses on the necessary computer science background and bioinformatics concepts. In particular, we demonstrate the architecture of Bioinformatics Integrated Application Software (BIAS) [16], a development framework used to build our software. Chapter 4 is concerned with the description of ail necessary bioinformatics approaches and how they are implemented in BIAS. Chapter 5 provides the results of our experiments and a discussion of their significance. Chapter 6 lists a number of conclusions, recommendations and open problems for further research.

4 2. Biological Background

2.1. Promoter

2.1.1. The Basic Structure of a Promoter and The Initiation of Transcription

The promoter is an integral part of a gene. There exist a few different detinitions for a promoter motivated by different research purposes. Fickett and Wasserrnan [10] explain that a promoter is a region of DNA surrounding the transcription start site (TSS) (represented in Figure 2-1 within the INR region) that is able to direct transcription from the TSS. The TSS is the base pair where transcription is initiated. By convention, the TSS in the DNA sequence of a transcription unit is usually numbered +1 [126 p.346]. Werner [12] describes a promoter as a region that is necessary to achieve transcriptional initiation. It marks the beginning of the tirst exon of a gene. The function of a promoter is to mediate and control initiation of transcription of that part of a gene that is located immediately downstream of the promoter in the 3' direction. By 3' direction, we mean the 3' end orientation of a nucleic acid chain formed through nucleotides polymerization. A nucleic acid chain has an end-to-end chemical orientation: the 5' end has a free phosphate group on the 5' carbon of its terminal sugar; the 3' end has a free hydroxyl group on the 3' carbon of its terminal sugar. Synthesis of a nucleic acid chain always proceeds from 5' to 3'. Therefore, the sequence of a nucleic acid chain is written from 5' to 3' [126 p.1 00-1 03]. In many cases, understanding the structure of promoter is the first logical step towards understanding the function of the associated protein. ln general, there are three different promoters: the core promoter, the proximal promoter and the distal promoter depicted in Figure 2-1. The core promoter consists of the binding sites of the complex composed of the RNA polymerase Il

5 and six general transcription factors, designated as TF liA, TFIIB, TFIID, TFIIE, TFIIF and TFIIH, where "TF" stands for "transcription factor" and "II" for the RNA Polymerse Il [127]. The TFIIA can stabilize TFIID and promoter binding. The TFIIB recruits the RNA polymerse Il and TFIIF. The TFIID includes transcriptional accessory factors (TAFs) and binds to the TATA box via TBP. The TFIIE recruits TFIIH. The TFIIF is mainly composed of RAP74 and is in charge of conformational changes in the RNA polymerase Il promoter complex [12]. The TFIIH can unwind the DNA duplex. The core promoter is composed of several regions with different functions: a initiator region (INR), one or more binding sites for general transcription factors which are mostly located in the 5' end of the TSS, and a TATA box. TATA boxes are the most well-known binding sites of general transcription factors and are recognized by TATA box binding proteins. The core promoter is also the most important factor to determine the precise TSS and is the minimum region that is capable of basal transcription.

The complex containing RNA polymerase Il and the general transcription factors is referred to as the transcription initiation complex. Transcription can begin when the transcription initiation complex is recruited to the promoter [11].

There is no clear-cut distinction from where the core promoter ends to where the proximal promoter begins. The region next to the 5' end of the core promoter up to 200-300 bp upstream of the TSS is called the proximal promoter. It contains several TFBSs and is responsible for the modulation of transcription. A set of TFs can function cooperatively to control the rate of transcription.

So-called distal promoters are located possibly thousands of base pairs upstream of proximal promoters. This type of region varies much more than the other two promoter regions with respect to composition and length. It also contains TFBSs for specifie TFs that are known to activate or repress transcriptional activity.

6 The set of TFs (and the interactions amongst the members of this set of TFs) that bind to the genomic DNA sequence upstream of the promoter can activate or repress the RNA polymerase Il machinery. In sorne cases, the set of TFs is responsible for recruiting this machinery to the core promoter [12] and the transcription start site located within the core promoter. It can initiate mRNA synthesis. Figure 2-2 depicts the components of transcription regulation.

Figure 2-1: Schematic structure of promoter [12].

2.1.2. Modules

It is common to find a set of TFBSs where the binding sites are located in close proximity to one another. In many cases, the set of TFBSs together perform a specifie regulatory function. That is, the combinatorial presence or absence of the corresponding TFs is capable of regulating the downstream gene. We cali such a set of TFBSs a module. The module is defined by a specifie set of TFBSs located in close proximity to one another; furthermore, this set of TFBSs are typically

7 flanked by long stretch es of nucleotides that do not contain TFBSs. Modules control the expression of the gene. Such a module is depicted in Figure 2-2.

Distal modules Proximal modules

TF binding sites TF bindîng sites TATA box INA

<---____J <------~I I~ ___~

Distal promoter Proximal promoter Core promoter (200 • 300 bp)

Figure 2-2: Schematic structure of promoter modules [12].

Werner [122] describes modules as the next level of functional organization after individual TFBSs. The authors define module as two or more individual TFBSs that act in a coordinated way (either synergistically or antagonistically) with the contributing elements arranged within a defined distance and sequential order. Kel et al. [50] and GuhaThakurta et al. [49] described that in synergistic binding, simultaneous interactions of two factors with closely situated binding sites can result in a non-additively high level of a transcriptional activation. A non-additively high level of activation means the activation provided by cooperative TFs is much higher than the sum of the levels of activation provided by either TFBS alone. In antagonistic binding two factors interference with each other so that competition for overlapping sites leads to a mutually exclusive binding of TFs.

Wagner [46] mentioned there are three types of cooperative actions: (i) Homotypic, which involves interactions among multiple bound factors of the same kind; (ii) Heterotypic, which involves interactions among TFs of different kinds; (iii) Mixture of Homotypic and Heterotypic. Any of them can form tightly linked cluster of TFBSs, a module.

8 Typically the TFs form a three-dimensional complex that interacts with the nucleotides in a geometric manner. The shape and size of the complex imply that binding sites must be in specific positions relative to one another. Therefore, within a module, the order, the distance, and strand orientation of TFBSs ail play an important role in the functionality of the module. A promoter can contain one or more modules. We revisit the discussion of modules in Section 3.2.3.

2.2. Transcriptional Regulation by Glucocorticoid Receptor (GR)

GR is one of the best-characterized steroid receptors and it is essential for the regulation of multiple physiological processes [2 (p.279-p.295), 7]. When the ligand GCs are absent from the cell, GR is localized to the cytoplasm by association with a multiprotein complex consisting of heat-shock proteins (hsp90), and immunophilins such as the FK506-binding family.

GCs bind GR when present in the cell. This releases GR from the multiprotein complex and allows the translocation of GR from the cytoplasm to the nucleus of the cell. In the nucleus, the GC-bound GR can either activate or repress the transcription of target genes by interacting with short GREs. This short GRE sequence is composed of two defined stretch es of nucleotides separated by undefined nucleotide creating two half-sites [3]. Usually, GREs are not perfectly palindromic and sorne degeneracy in the half-site can be tolerated [7]. The consensus GRE is GGTACAnnnTGTTCT, where n can be any nucleotide [124]. GR can modulate gene transcription via two types of mechanism entitled Type-1 and Type-2.

The Type-1 mechanism functions as follows: the GC-bound GR undergoes homo-dimerization or hetero-dimerization in the nucleus. The homo-dimerization produces a homo-dimer. This homo-dimer binds to a GRE and activates transcription of the gene downstream (Le. in the 3' direction) of the GRE. A dimer

9 is a protein complex made up of two subunits. In a homo-dimer the two subunits are identical, and in a hetero-dimer they differ (though they are often still very similar in structure). Thus hetero-dimerization produces a hetero-dimer. The binding of a hetero-dimer to a GRE leads to suppression of transcription. Occasionally, GC-bound GR can also bind to GRE without undergoing dimerization. When this kind of binding occurs, the GRE is of a special form (not a near palindromic sequence) and we refer to it as a negative GRE (nGRE). nGREs can be bound by either GR monomers or a combination of monomers and dimers. It can also inhibit the transcription of a gene (transrepression). Figure 2-3 depicts the Type-I mechanism.

The Type-2 mechanism functions as follows: the hormone-bound GR binds to other TFs belonging to AP-1 family (e.g. c-Jun, Jun-B, Jun-D, c-Fos, Fos-B, Fos­ B2) or activating transcription factor (ATF) family proteins (e.g. Fra-1, Fra-2). The protein-protein interactions between GR and these proteins result in the inhibition of the transcription of target genes regulated by these additional TFs [3]. Figure 2-4 depicts the type Il mechanism.

T..... C.. -

Figure 2-3: The Type-1 mechanism of action of the glucocorticoid receptor [3].

10 -Œl ŒlŒl Œl

Figure 2-4: The Type-2 mechanism of action of the glucocorticoid receptor [3].

11 3. Background for Computer Science and Bioinformatics Concepts

This section describes the necessary computer science background. In particular, we demonstrate the architecture of BIAS [16], the development framework used to build our software. Furthermore, we also introduce sorne bioinformatics concepts used in this research.

The framework for identifying genes putatively regulated by GR has been implemented in BIAS. BIAS was created in the Hallett lab in order to carry out integrative bioinformatics research. We have contributed to the development of BIAS in three ways:

(i) The construction of a so-called module entitled gre. This module contains ail of the routines/code for performing the experiments described in this thesis. Furthermore, the tools will be made available to the community to be used for GR and other transcriptional studies. (ii) The development of the underlying relational database schema for representing biological objects. In particular, we have extended the schema to allow the representation of GRE and binding sites of GR's cooperative TFs, orthologous gene pairs between Mouse and Human, and conserved information between Mouse and Human in promoter. (iii) The development of the so-called external data source capabilities of BIAS. The main focus here is on the importation of data from Ensembl.

3.1. BIAS: Bioinformatics Integrated Application Software

Our software is developed in BIAS (Bioinformatics Integrated Application Software) [16]. BIAS is a development platform especially tailored to bioinformatics research and software development. The mandate of BIAS is to provide infrastructure to integrate many different bioinformatics objects together

12 into one easy to use, easy to expand, and computationally powerful system. The goals of BIAS are: (i) Formalize the representation of biological data to allow for ease of application communication, data exchange, and comparison through standards and languages. (ii) Create an object relational data base capable of storing, accessing, and guaranteeing the consistency of data in the relational world as weil as the object world. (iii) Develop a repository of computer science and bioinformatics related algorithms. (iv) Be a framework that is flexible enough to allow for third-party software to be integrated within it whenever the appropriate machinery such as Software Development Kits (SDKs) or Application Programmer Interfaces (APis) are made available from the distributor. (v) Be a system that integrates the above into one easy to use framework.

BIAS is used by ail members of the Hallett lab. BIAS currently contains infrastructure for (i) Gene expression studies (cDNA, Agilent, and Affymetrix) compliant with the MIAME standard, (ii) Handling complete genomes and annotations, (iii) Transcription factor binding site analyses, (iv) Genetic network inference, and (v) Protein-protein interaction experiments. The structure of BIAS is depicted in Figure 3-1.

Currently BIAS has had a significant impact towards both reducing the development time of new applications and improving existing applications in our lab. It is an Open Source project and is freely available from the website http://www.mcb.mcgill.cal-bias/.

13 Bias '---'Î

,~.----.- Client ----"-----', 1 '1 ( Oient Main (PSI 1 ;' fvÎ~d~~'~tj:iI"": 1

:i": 1 --f--~ / :, -,l' :i S-e-r,-'er-' -.--.... -----.-- -___-l'---_-_-_-.~- ..- .. -.----.------...-. -----·:·------c-,on.-tP-u-t.-do-~-·--i-':Î

'j (§ 3.61 '1 : ' 1 I(Sener Main , -Trul1ed ' 1

:1'1 l' '--1-',' ' -llioSenu l' :1; \ ,,' l'- TheNonnalizer :1 .-- .... ·:[Proc·ess~upport(~~3I': 'llioOperaclients :' ' , : 1 BioOpem sen'er : __ :1_

,1\.

Figure 3-1: Basic overview of BIAS [16]

3.1.1 Modules in BIAS

A module is an application written by a developer in BIAS. It can make use of the BIAS internai and external libraries, data source access routines, and data model. Essentially, modules are the basic means by which developers add bioinformatics functionality to the framework. Modules can be used as a stand­ alone package where the end-user of the system (Le., the biologist) need not necessarily know that the application is written in BIAS. When an end-user uses BIAS, they execute the client process on their machine of choice. The client is responsible for ail communication to and from the server. The server component

14 acts as a middleware layer between the modules/clients and the data sources/libraries components of the system and receives requests from clients. It is primarily responsible for accessing the BIAS libraries and data resources of a module.

We have implemented our framework in BIAS within a module called gre. It can be obtained at htlp://www.mcb.mcgill.ca/.... bias/.

3.1.2. Internai Data Sources - Object-Relational Model

BIAS offers a pragmatic object-relational (OR) model that maps objects in the object-oriented world to relations in the database and vice versa. The notion of an OR model is essentially an amalgamation of relational and object-oriented models [17-19] that tries to provide the advantages of both systems. In particular, it allows programmers to work with a modern object-oriented programming environ ment while preserving the rich toolkit of reliable functionality offered for persistent storage by relational data base management systems. When a programmer defines a new class in Java, the class is reflected as a table in the database. When an instance of the class is created, the instance is reflected as a tuple (row) in the corresponding table. The system guarantees that modifications to classes or instances of classes are automatically reflected in the database (and vice versa). The strengths of our system are its ease of use in comparison with other OR systems [17-18], and the ease with which many important Bioinformatics objects can be integrated into the model.

3.1.3. External Data Sources

Any Bioinformatics software must consider questions regarding the importation and exportation of data from foreign sources. Of course, there are a multitude of open problems and thorny issues involved in discussions of distribution and inter-

15 operability of data. Our main concern within this research is to build a middleware layer between BIAS and to external data sources, such as the Ensembl database.

3.1.3.1 Java API of Ensembl (Ensj)

The mandate of Ensembl is to develop a software system to produce and maintain automatic annotations of eukaryotic genomes [20]. Ensembl provides a Java interface, called Ensj, to make it easy for Java programmers to access the Ensembl databases. Ensembl has many different data types, such as gene, transcripts, translations. Each data type has a corresponding Java interface. And each data type has an own adaptor that can be used to retrieve data from the database and to write to the database. Adaptors are grouped together into a driver. The driver represents a single logical data source such as the 'current' database on ensembldb.ensembLorg. The Ensj allows the programmer to manipulate Java objects and does not require the programmer to know about the particular Ensembl schema. This is simpler than accessing the database directly via JOBe and allows data base changes to be hidden from the application programmer.

1 have developed a module called ensembl in BIAS to import both Mouse and Human genome information through Ensj.

3.1.3.2 Java API of BIAS

This subsection describes our pragmatic approach for the importation of data from EnsembL Ideally BIAS requires a method for (i) Accessing this information, (ii) Oetermining what information has changed or been added to Ensembl, (iii) Updating any local information dependent on changes made to EnsembL

16 With respect to (i), we have constructed an interface to connect to Ensembl directly through a Java API (via Ensj) that in turns calls the Ensembl MySQL server. With respect to (ii) and (iii) in the future, BIAS will contain a module that acts as a middleware layer between BIAS and Ensembl. Essentially, whenever BIAS determines that a new version of Ensembl is available, it will (semi-) automatically remove ail of the necessary objects (tables in the data base ) and download the new version of Ensembl. Via a set of Java scripts, ail annotations that can be created automatically are re-run with the new data and stored with the BIAS system.

Although we have not yet developed the functionality with respect to (ii) and (iii) in BIAS, we have carefully annotated our TFBSs with additional information that allows us to uniquely identify each TFBS in ail future versions of the genome with high probability. The additional information includes the flanking 10 bp around the site and the distance and label of the nearest gene. If the assembly changes, BIAS will be capable of automatically searching for their occurrence in the new data set and thus will be capable of transferring any annotations to the new version of the data base.

3.2. Background of Bioinformatics Approaches

3.2.1. Module for TFBS

The module for TFBS mentioned in Section 2.1.2 is different from the module for BIAS mentioned in Section 3.1.1. In this section, we will discuss the module for TFBS. The identification of TFBSs and the corresponding TFs is an important problem for several reasons: (i) These protein-nucleotide interactions are a fundamental mechanism for controlling gene expression in ail forms of life.

17 (ii) It allows us to identify genes regulated by a specific TF. This may reveal unappreciated and unexpected functions of these genes. (iii) It allows us to better understand the function of the TF itself. (iii) It is an important initial step in determining the DNA signais that regulate transcription of the genome. (iv) It a"ows us to understand how the genome encodes the information that specifies when and where a gene will be expressed.

ln this research, a module called gre in BIAS is developed to find putative TFBSs and their target genes transcriptiona"y regulated by the GR through three different bioinformatics approaches: (i) Searching for putative GREs (pGREs) in the complete Mouse genome using PWM, (ii) Combinatorial analysis of multiple TFs/modules, and (iii) Phylogenetic footprinting.

3.2.1.1. A Markov chain (MC) [118]

Markov processes can be in either discrete or continuous time, and in either discrete or continuous space. However, in this study where we consider DNA sequences, we only deal with discrete time finite MCs. The finite MC means there are sorne finite discrete set S of possible states, labeled {S1, S2, S3, ... ,S,J. At each of the unit time points t = 1, 2, 3, ... a MC process occupies one of these states. In each time step t to t+1, the process either stays in the same state or moves to sorne other state in S. Further, it does this in a probabilistic, or stochastic, way rather than in a deterministic way. Suppose the MC is in state Si at time t. Pij is used to denote the probability that at time t+ 1 it is in state Sj. 80 Pij is ca"ed the transition probability from Si to Sj. It is convenient to group the transition probabilities into the so-ca"ed transition probability matrix of the MC. Any row in the matrix corresponds to the state trom which the transition is made,

18 and any column in the matrix corresponds to the state to which the transition is made.

The MC model has following two distinguishing Markov characteristics: (i) The memoryless property. If at sorne time t the process is in state Si, the probability that one time unit later it is in state Sj depends only on Si, and not on the past history of the states it was in before time t. (ii) The time homogeneity property. Given that at time t the process is in state Si, the probability that one time unit later it is in state Sj is independent of t. The MC described above is the tirst order MC, which means it possess the first characteristics above.

More general Markov processes relax one or both properties and allows for so­ called th order Markov processes, for i >= O. For example, a model where each position of a nucleotide sequence evolves independently of ail other positions is called a dh order MC. In general, a model where each state depends on at most K >= 1 other states is called ~h order Markov model. In this study, we have modeled PWM using a dh order MC. Furthermore, we have used a simple ~d order MC as a background model of the Mouse genome. In this MC, each position depends on exactly the two previous positions. We refer to this as the Mouse Background MC (MBMC).

The transition probability matrix is normally presented in the form of a 4 x 4 matrix in DNA sequence analysis.

PAA PAC PAG PAr PCA Pcc PCG Pcr P= PGA PGC PGG PGr PrA Prc PrG PTT

19 3.2.1.2. Position Weight Matrix (PWM)

A count matrix is commonly used to reflect characteristics at each position of the binding site of a TF. It is represented by a 4 x n matrix. Each row of the matrix corresponds to one nucleotide (A, C, G, T). Each column of the matrix corresponds to one position of a TFBS. It assigns a count to each possible nucleotide at each position by taking the number of observed nucleotides for that position in the alignment given an aligned set of known TFBSs. The Figure 3-2 gives an example on how to derive a count matrix from a set of known TFBSs.

MGCATT Posi. 1 2 3 4 5 6 7 A 4 4 1 CAGCMT 3 0 0 0 C 1 0 0 3 0 0 1 MGCATC G 0 0 4 0 0 0 0 MGTATT T 0 0 0 1 0 3 3

Figure 3-2: An example on how to derive a count matrix fram a set of known TFBSs.

A position frequency mafrix (PFM), such as matrices of TFBSs from TRANSFAC, is another form of a count matrix. It calculates the frequency of every nucleotide at each position given a set of known TFBSs. Normally this PFM is normalized to form a probability matrix. This means each column must add up to a total of one. We cali this transformed matrix a PWM.

For a PWM matrix P, let pei, j) be the probability that nucleotide i occurs at position j. Then pei, j) can be estimated from either cou nt matrix or PFM using following equation:

P(i, j) = ---=n=(,-i,;:;...;j)_ = --=f=(.:....-i,.;;..;j)_ (3-2) Ln(b,j) Lf(b,j) beA,C,G,T beA,C,G,T where n(i, j) is the number of nucleotide i occurring at position j. f(i, j) is the frequency of nucleotide i occurring at position j.

20 A PWM is a oth order Markov process since it models each position as evolving independently of ail other positions. A PWM reflects the nucleotide preferences for the individual positions of the aligned known TFBSs and is ideally derived from a set of functionally characterized binding sites for the given TF [10, 30, 31, 40, 122].

It is common to use a PWM for a specific TF to locate putative TFBSs in the complete genome of an organism. This proceeds by "sliding" the PWM along the genomic DNA sequence and scoring each "window" of length n according to the probabilities in the PWM.

Counts in the count matrix can also be converted into odds ratios of expected to observed prababilities [33] using the following equation:

P(i,j) _ n(i,j)/ N j (3-2) pei) J>(i) where P(i, j) is the estimate of the probability from the PWM and P(i) is the probability of nucleotide i in a random sequence. This latter quantity can be estimated fram either sim ply examining the entire Mouse genome. Sometimes, as is the case with our MBMC, it is important to use background models that are nd not oth order MCs. For the MBMC we use a 2 order MC and we need to calculate P(i 1 i',i') where i' is the nucleotide at the previous position and i" is the nucleotide at the 2nd previous position. Here we are assuming that the MC is homogenous - the probability of nucleotide i at position j given nucleotides i', i" at positions j-1, j-2 respectively are the same as the probability of nucleotide i at positionj'given nucleotides t, t'at positionsj'-1,j'-2 respectively.

It is common to use the so-called log-odds ratio when performing calculation in practice. For example, we would take the logarithm of Equation 3-2. Adding log-

21 odds values is equivalent to multiplying the corresponding probability ratios. The odds ratio can maximize selectivity for observed nucleotides [33]. However, log­ odds approaches can have serious drawbacks that must be handled carefully. When an entry in the count matrix is zero, the corresponding probability in the PWM will be zero. In this case, the logarithm of zero is negative infinity. ln this study, we did not use the log-odd method. Instead we use PWMs in genome-wide searches and compare our results against randomly generated sequences from a MBMe.

3.2.1.3. Regions in Mouse Genome

Three different regions are defined in whole Mouse genome: (i) Gene region: the whole gene area including introns and exons, (ii) Promoter region: 1000 bp in the 5' direction (upstream) starting from TSS. However, we extend to 15000 bp upstream and 1000 bp downstream in phylogenetic footprinting analysis, and (iii) Junk region: ail other regions excluding the gene and promoter regions. This region can also be called "intergenic region".

3.2.1.4. Search for putative TFBSs using PWM ln this research, we used a method based on [31] to search binding sites of a single known TF. The original PFM is extracted from TRANSFAC [45]. We begin by transforming it into a PWM using the method described above. Now using this PWM, we calculate the matrix similarity as follows:

mat _ sim(i) = ft. c, (j)x p(i,j)l/t. Ci (j)x max](j)l (3-1 )

22 where n is the length of the matrix, pei, j) is the estimate of the probability for base i at position j and max_Plj) is the maximum value of pei, j) at position j with b E {A, C, G, T}. Ci(j) is the value at position j of vector Ci, which is an array of values termed the consensus index vector. Equation (3-2) gives the calculation for each position i of the Ci vector.

C;Cj) = (1 OOjln4)x[ LP(b,j)xlnP(b,j)+ ln 4] (3-2) bEA,C,G,T

The Ci vector represents the conservation of the individual nucleotide positions in the matrix. The maximum of 100 is reached by a position when total conservation of one nucleotide is observed. The minimum value of 0 only occurs at a position with equal distribution of ail four nucleotides [109].

The matrix similarity reaches 1 only if the candidate sequence corresponds to the most conserved nucleotide at each position of the matrix. Multiplying each score by the Ci(j) emphasizes the fact that mismatches at less conserved positions are more easily tolerated than mismatches at highly conserved positions [31].

After we calculate the score for the particular sequence, we compare it against a threshold. If it is higher than the threshold, it is qualified as a putative TFBS. Otherwise this sequence is ignored.

3.2.1.5. Statistical Significant Test of putative TFBSs

When a genome-wide search for TFBSs is performed for a given TF, we require a method for assessing the statistical significance for the number of putative TFBSs found. In other words, we need to estimate a p-value expressing our belief that the number of TFBSs is different than what we would expect from randomly generated sequences. To do this, we modeled Mouse sequences using different kinds of random sequences [129]. We found that 2nd order MC (MBMC)

23 modeled the Mouse genome weil (depicted in Figure 3-2 and Figure 3-3). Figure 3-3 shows the number of pGREs comparison using the algorithm from Section 3.2.1.4 between Mouse chromosome 2 and three different kinds of background sequences with the different window size and 0 overlapping between windows. The three different background sequences are:

(i) oth order MC (described in Section 3.2.1.1) (ii) 2nd order MC, also called MBMC (described in Section 3.2.1.1) (iii)Sequence obtained by shuffling the nucleotides in the chromosome within a certain window. By window, we mean the certain length of sequences. By shuffling, it means the window is permuted randomly. For example, if the window has "AATTCCGG", a random permutation of these nucleotides, such as "ATCGATCG" can be one valid result. Different window size and different overlapping between two windows are used. By overlapping, we mean there is certain length of base pairs overlapped between two continuous windows.

Figure 3-4 shows the results with the different overlapping for each fixed window size.

Based on Figure 3-3, it appears that a window size of 2 and an overlap of 0 betler approximate the true number of GRE hits in Chromosome 2. Based on Figure 3- 4, it appears when the number of overlap is increased; the number of hits of the random sequence betler approximates the true number of GRE hits in Chromosome 2. However, the MBMC best matches the true number of hits.

24 Number of Hits Comparasion Among Mouse Chromosome 2 and Different Background Sequences

140000 -+-Chromo...... 2 ofMouoe Genome ____ MBMC

___ ShuIIIod s.q...... BoMd On Mouse Chromoso.... 2 (Wlndow Sa - 2 and Over1op - 0) 120000 ____ ShuIIIod s.q...... BaMd On Mouoe Chromoso.... 2 (WIndow Sa - 3 and Over1ap - 0)

- ~- ShuIIIod s.quonce BaMd On _ Chromosome 2 (WIndow SI.. - 4 and Over1op - 0) __ ShuIIIod s.quonce BoMd On Mouoe Chromoso .... 2 (Wlndow SI.. - 5 and Over1op - 0) 100000 ShuIIIod s.q...... BoMd On Mouoe Chromoso.... 2 (Wlndow Sa - 10 and Over1ap - 0) ShuIIIod s.q...... BaMd On Mouoe Chromosome 2 (WIndow Sa - 200 and o..rtap - 0) ____ ShuIIIod s.quonce BaMd On Mouoe Chromoso.... 2 (WIndow Sa - 500 and o..r1ap - 0)

_0_ ShuIIIod s.q_ BoMd On Mouoe Chromosome 2 (WIndow Sa - 1000 and o..r1ap - 0) 1! 80000 :I: ____ ShuIIIod s.q...... BoMd On MOUlle Chromosome 2 (WIndow Sa -10'7 and o..r1ap - 0) '0 ShuIIIod s.q_ (0.., WhoIe Chromosome 2) :s - Shuffled Sequ.nce (Owr WhoIe Chromosome 2 and N°. Remalned ln Thelr Poaltlon) E 80000 Z'"

40000

20000

0.81 0.82 0.83 0.84 0.85 0.88 0.87 U8 0.119 0.9 0.91 0.92 0.93 0.94 0.95 0.98 0.97 0.98 Threshold Figure 3-3: A comparison of the Number of hits between Mouse chromosome 2 and different background sequences. Window size changes from 2 up to whole chromosome long. Overlap is O.

Number of Hits Comparasion Among Mouse Chromosome 2 and Different Background Sequences

140000 -+-Chromosome 2 of Mo.... Genome ____ 2nd Ordo' Markov Chain ___ ShuIIIed s.q_ BaMeI On Mo.... Chromosome 2 (WI-Sa - 200 and Over1op - 0) 120000 ___ Shullied Sequonce BaMeI On Mo.... Chromosome 2 (Wlndow Size - 200 and Over1op - 10) :;. Shullied Sequonce BaMeI On Mo.... Chromosome 2 (Wlndow Size - 200 and Over1ap -100) -+-Shullied Sequonce aaMel On Mo.... Chromosome 2 (WIndow Size - 500 and Over1op - 0) •. Shullied Sequonce aaMd On Mouse Chromosome 2 (WIndow Size - 500 and Over1op - 100) 100000 - Shullied Sequonce aaMd On Mouse Chromosome 2 (WI- Size - 1000 and o..~ap - 0) ShuIIIod Sequonce aaMd On Mo.... Chromosome 2 (WI- Size -1000 and o..rtap - 100) ShuIIIod Sequonce (0-Who. Chromosome 2 and Wa Remalnod ln Thelr PosHIon) 1! Shullied Sequonce (0-Who. Chromosome 2) :I: 80000 '0 ..~ Il E 80000 Z'"

40000

20000

0.81 0.82 0.83 0.84 0.85 0.88 0.87 0.88 0.119 0.9 0.91 0.92 0.93 0.94 0.95 0.98 0.97 0.98 Threshold Figure 3-4: Number of hits comparison among Mouse chromosome 2 and different random sequences. Window size is 200, 500, 1000, and whole chromosome long with different overlapping 0,10, and 100.

25 A p-value is also called an achieved significance level. It is the probability of observing a test statistic as extreme as or more extreme than the observed value, assuming that the null hypothesis is true. In other word, the p-value gauges the strength of evidence against the null hypothesis on a numerical scale. A small p­ value indicates a strong support of alternative hypothesis or a reject of null hypothesis. Let Ho be the null hypothesis. In our case, the null hypothesis states there is no difference in the number of pGREs found between the Mouse genome and, for example, MBMC (mentioned in Section 3.2.1.1). Let H1 be the alternative hypothesis that the number of pGREs found in Mouse genome is different from that in MBMC. The p-value is compared to a selected a (significance) level to determine whether result is statistically significant. If the p-value is smaller than a for one-side test or smaller than 1/2a for two-side test, then Ho will be rejected and H1 will be accepted.

The following algorithm is used for generating p-values: (i) Derive the frequency and combinatorial probability (see Section 4.3.1) distribution from MBMC. (ii) Decide if both distributions are normally distributed.

(iii) Formulate Ho and H1 (iv) Set up a threshold a (such as 0.05) to determine the rejection region. (v) Calculate the p-value for the frequency and combinatorial probability derived from Mouse genome. (vi) Draw the conclusion: State whether or not Ho is rejected.

When searching for TFBSs for a specifie TF in Section 4.2, we use the significance of the frequency of TFBS in different region for Mouse genome. A right-sided test is used to calculate p-value. In the module study of a set of TFBSs in Section 4.3, we test the significance of the combinatorial probability of the position order for the module in promoter region of Mouse genome. A two­ sided test is used to calculate p-value.

26 We estimated the parameters for MBMC. The transition probability matrices are calculated based on the different regions of Mouse genome. The following algorithm describes how a MBMC is created: (i) Retrieve different regions from Mouse genome. (ii) Concatenate ail retrieved sequences into one sequence. (iii) Create MBMC transition probability matrices based on the concatenated sequence. (iv)Generate MBMC sequence multiple times (such as 1000) using the transition probability matrices. (v) Search for binding sites using a PWM for a specifie TF in each new created sequence.

Skewness and Kurtosis

The Skewness and the Kurtosis are two parameters used to characterize the similarity of each data set to the normal distribution. The 2*Ses, and 2*Sek parameters are the thresholds for Skewness and Kurtosis. The following gives more detailed expia nation on these two concepts.

Skewness is a measure of symmetry or lack of symmetry. A data set is symmetric if it looks the same to the left of the right of the center point. A distribution is more symmetric if the value of skewness tends to o. Distributions with positive skewness have long tails to the right, and distributions with negative skewness have long tails to the left. Normal distributions produce a skewness of approximately zero. Values of two standard errors of skewness (ses) or more (regardless of sign) are probably skewed to a significant degree. The ses can be estimated roughly using the following formula [120]: # .

Kurlosis is a measure to show whether the data are peaked or fiat relative to a normal distribution. The distribution is closer to being normal if the value of

27 kurtosis tends to O. Values of two standard errors of kurtosis (sek) or more (regardless of sign) probably differ from mesokurlic to a significant degree. A distribution with the same kurtosis as the normal distribution is called mesokurlic.

The sek can be estimated roughly using the following formula [120]: ~~

3.2.1.6. TRANSFAC

TRANSFAC [36-43] is a data base of eukaryotic TFBSs and transcription factors. The database was founded in 1988 [44] and covers a range of diverse organisms including Yeast, Human, and Mouse. The TRANSFAC database is available from [45]. TFs, TFBSs, and TFBS sequence profiles (consensus/lUPAC string or count matrix/PFM) are generated from the published literature. If we consider the PWM described in Section 3.2.1.2, the corresponding nucleotide in {A, C, G, T} with the maximum P(i, j) at position j is the consensus symbol at this position. However, if the probabilities in one position are almost same, say approximately 0.25, then N symbol is used. IUPAC is the abbreviation of International Union of Pure and Applied Chemistry [27]. IUPAC string is a type of consensus string that uses so­ called ambiguous symbols (e.g., B=C, G or T; R=A or G) ta describe the variability of nucleotide usage.

Release 6.0 of TRANSFAC is made available as ASCII fiat files. It has six tables [45]: SITE, GENE, FACTOR, CELL, CLASS, and MATRIX. SITE gives information on (regulatory) TFBSs within eukaryotic genes. GENE gives a short expia nation of the gene where a site (or group of sites) is found. FACTOR describes the proteins binding to these sites. CELL gives a brief overview of the cellular source of proteins that have been shown to interact with the sites. CLASS briefly explains sorne of the main features of the DNA- binding domains of 44 transcription factor classes. MA TRIX gives the nucleotide frequencies observed in aligned binding sites of the corresponding TF. An additional column depicts the IUPAC string derived from the matrix according to the following rules: a single

28 nucleotide is shown if its frequency is greater than 50% and at least twice as high as the second most frequent nucleotide. A double-degenerate code represents that the corresponding two nucleotides occur in more than 75% of the underlying sequences but each of them is present in less than 50%. Usage of triple­ degenerate codes is restricted to those positions where one of the nucleotides did not show up at ail in the sequences and none of the afore-mentioned rules applies. Ali other frequency distributions are represented by the letter N [110].

Currently, the MATRIX table contains 336 nucleotide frequency matrices of aligned binding sites. Appendix B gives PWMs, which are derived from PFW of TRANSFAC, of TFBSs for GR, NF-1, C/EBPbeta, and C/EBPdelta.

3.2.2. Background for the Genome Wide Search of TFBSs

The alignment of molecular sequences is a common approach used in many of the techniques for locating TFBSs. Traditionally sequence alignment research has focused on tools (e.g. FASTP, FASTA, LFASTA [21-24], GappedBLAST, PSI-BLAST [25]) for aligning coding regions of genes whether at the DNA or amine acid level. A coding region of a gene normally is longer than TFBSs and, generally speaking, the degree of similarity is higher. However, for regulatory regions containing TFBSs, these standard tools are inappropriate.

There are three main reasons why traditional sequence alignment tools are not appropriate for the alignment of nucleotide sequences from regulatory regions. Firstly, the scoring matrices used in traditional BLAST are appropriate for only coding regions. This is because they are in fact derived from coding regions. However, there is no codon structure in a regulatory region. Secondly, the statistics used to measure the significance of an alignment are not sufficient. For example, the expect value (E-value) is a parameter used in BLAST alignments for measuring the significance of a hit. The E-value is defined as the number of hits one can "expect" to see just by chance when searching a data base of a

29 particular size. The closer it is to 0 the more "significant" the match is. However, the E-value does not estimate the expected number of hits very weil for short sequences [125]. Since TFBSs are only 6-20bp in length, the E-value is not the correct choice for measuring the significance of a hit. A third reason is that full dynamic programming solutions are computationally infeasible for sequence lengths witnessed in Eukaryotic organisms. It is known for example that sorne TFBSs in regulatory regions for Mouse genes are sometimes 25Kb in length.

There has been much research recently towards the identification of TFBSs in promoters. One common approach used is to form a consensus string or IUPAC string for the TFBS. Sorne software has been developed for performing TFBS search based on consensus string or IUPAC string. This includes FindPatterns in the Genetics Computing Group (GCG) package [26, 27] and SIGNAL SCAN [28- 29]. These consensus string or IUPAC string based search programs are very fast. They are widely used. However, the definition of the IUPAC code is arbitrary to sorne extent. An IUPAC string is generated to represent the nucleotide composition in each position based on the alignment of sets of known binding sites of a TF. It does not reflect the quantitative characteristics of TFBS. This potentially removes much information.

A number of software packages now exist for identifying putative TFBSs using PWMs. The list includes MATRIX SEARCH (SIGNAL SCAN version 4.0) [30], Matlnspector [31], FastM [32], TransFind [34], TESS [35]. One of the benefits a PWM-based search has over a consensus or IUPAC-based search is that it provides a quantitative evaluation of each potential binding site. Of course, the accuracy of PWM-based methods is limited by the quality of the PWM itself. If the PWM is not created with a sufficiently large library of know TFBSs, it may not perform weil. While a PWM-based search is considered to be more sensitive, they produce a large number of false positive TFBS predictions. However, generally speaking, PWM-based searches are a common starting point in many analyses. Combinatorial analysis of multiple TFs (modules) and phylogenetic

30 footprinting are both effective and efficient methods for reducing this large number of false positives.

3.2.3. Modules

Section 2.1.2 provided a biological definition of a module. This section describes our approach to reducing the number of false positive TFBS predictions obtained via PWM searches using these modules.

Methods for detecting modules fa Il into two classes: supervised and unsupervised methods. A supervised machine learning approach uses a priori knowledge of the structure of a particular module. This priori knowledge is summarized as a model. Different approaches use different models. For example, MSCAN [59] is an algorithm that identifies DNA segments that contain clusters of putative TF binding sites. It takes a set of one or more TF binding site models and outputs a set of putative modules. The model is then aligned against the entire genome in a manner similar to the alignment of a PWM-based method.

Typically, an unsupervised machine learning approach is used if there is no a priori information regarding the structure of a module. It focuses on the identification of significance of a predicted module. The statistical significance of a putative module is a function of the statistical significance of each individual predicted TFBS.

If the promoter of a gene contains this module, it will be the best candidate gene. Therefore, module can be used to detect the best candidate genes effectively [13, 57]. Wasserman et al. [56] have reported that compared with the rate of predictions of individual TFBS, the focus on modules eliminated -99% of false TFBS predictions while retaining 60% of functional regions. Furthermore, if the module analysis is combined with PF, this can eliminate 99.9% of predictions while retaining -50% of known modules [13].

31 Several studies [5, 47-55] have identified cooperative binding of multiple transcription factors in different genomes to DNA and formation of a complex of protein-DNA. Several good software packages [32, 49, 58-65] have been developed for identifying modules. Several databases [36-43, 50, 67] have been constructed for module studies. See [10, 13, 66] for good reviews of this topic.

3.2.4. Phylogenetic Footprinting (PF)

Recently, comparative genomics approaches have been developed to reduce large number of false positive binding site predictions. PF was originally used by Tagle et al. [67]. Their analysis focused on non-coding regions of primate epsilon and gamma globin genes. They demonstrated that TFBSs with important functions are evolutionarily conserved. Even when such TFBSs are quite short, they are distinguishable from the more rapidly evolving non-coding DNA that they are embedded in, provided the number of aligned homologous sequences represents enough evolutionary time for the accumulation of mutations at the less constrained (presumably selectively neutral) base positions. They define PF as the phylogenetic comparisons that reveal conserved TFBSs in the non-coding regions of homologous genes [67]. More recently, PF is also defined as a method for the discovery of TFBSs in a set of orthologous regulatory regions from multiple species by identifying the best-conserved binding sites in those orthologous regions [69].

Frazer et al. [84] mentioned that genes derived from a common ancestral gene are homo/ogs, and the level of similarity in their sequences often reflects the time since they diverged. Homologous genes can be generated by speciation, which produces pairs of ortho/ogs. By orthologs, we mean genes in different species that are derived from the same gene in the last common ancestral species. Therefore, orthologous genes usually have similar functions. Homologous genes can also result from the duplication of a chromosomal segment, which produces

32 paralogs. Two genes are paralogs if they diverged from a gene duplication event. ln many cases, paralogs do not have the same function.

Sorne useful PF algorithm software has been developed and is available on the web [69, 73, 75-78]. For example, Wasserman et al. [68] found that 98% (74/75) of experimentally defined sequence-specifie binding sites of skeletal-muscle­ specifie transcription factors are confined to the 19% of Human sequences that are most conserved in the orthologous rodent sequences. It is believed that most Human regulatory regions can probably be discovered by comparisons between Human and Rodent sequences. Through the comparative studies, it has also been demonstrated that conserved 5' flanking sequences of gene often constitute TFBSs that are involved in the regulation of gene expression [87]. PF can identify TFBSs specifie even for a single gene, as long as they are sufficiently conserved across species instead of requiring a set of reliable co-regulated genes over a single genome search. There are many good reviews of PF including [10, 13,84- 86].

Genome projects that are producing sequences from a wide variety of metazoan organisms, such as Caenorhabditis elegans [88], Drosophila melanogaster [89], Homo sapiens [90-91], Fugu rubripes [92], Anopheles gambiae [93], Mus musculus [94], Caenorhabditis briggsae, Rattus norvegicus, and Ciona intestinalis [95] are developing very fast. Figure 3-5 shows evolutionary relationship between metazoans. This provides a significant amount of material available for the data necessary of PF. The comparative genomic approach considerably increases the amount of information that can be extracted from these genomic sequences.

33 '/,,-'----1

"

1:XlO Sf,c 4C() 300 2C() '00

Figure 3-5: Evolutionary relationship between metazoans [86]. Evolutionary distances are in million years. ln comparison to the length of promoter regions TFBSs are quite short (normally 5 to 20 bp long). Aiso due to the noise of the diverged nonfunctional background overcoming the short conserved signal, the short TFBSs may not be detectable by the regular alignment algorithm. Therefore, PF requires specialized alignment tools. Gumucio et al. and Duret et al. [96-97] have reported that there are two variants of the PF methods. The first one, called "differential phylogenetic footprinting", relies on a search for sequence differences and is typically used to identify regulatory elements responsible for the establishment of novel expression patterns in specifie lineages. The analysis is carried out by aligning orthologous sequences and searching for sequence differences. The second variant, called "motif-based phylogenetic footprinting", has been developed to detect conserved binding sites that show sequence variation. Rather than focusing solely on primary sequence conservation, this method searches for putative TFBSs that occur at orthologous positions, allowing the detection of functionally conserved binding sites despite sequence differences. COGs/KOGs [104], HOPS [105], and

34 HomoloGene [106] are bioinformatics resources that catalogue orthologues genes between species. HOVERGEN [107] is also a large database for homologous sets of vertebrate genes.

There are two broadly used algorithms for PF alignment of non-coding genome sequences: local alignment and global alignment. Commonly used programs based on these alignments are: BLASTZ [98] for local alignment, LAGAN [99] and AVID [100, 108] for global alignment. BLASTZ uses an empirically determined scoring matrix for matches and mismatches plus an affine gap penalty. The principal difference between AVID and LAGAN algorithm is that AVID looks for exact matches to nucleate the alignment, whereas LAGAN uses short inexact words for this purpose. Therefore, LAGAN is more suitable for alignment of distant species [78]. Two different web servers, PipMaker (http://bio.cse.psu.edu) [101-102] and VISTA (http://www-gsd.lbl.gov/vista) [75, 103] have been developed for producing and visualizing alignments of genomic DNA. PipMaker's underlying alignment software is BLASTZ and VISTA currently uses the AVID global alignment program and LAGAN algorithm. Table 3-1 gives sorne resources for obtaining and analyzing genome sequences. ln this PF study, we used the AVID alignment algorithm. A detailed description of this algorithm is given in Section 4.4.1.2.

35 Databases of Genomic Sequences: NCBI http://www.ncbLnlm.nih.gov/ TI GR http://www.tigr.org/ Sanger http://www.sanger.ac.ukl Ensembl http://www.ensembl.org/ TAI R http://www.arabidopsis.org/home.html SGD http://genome-www.stanford.edu/Saccharomyces/ MGD http://www.informatics.jax.org/ Human Genome Browser http://www.genome.ucsc.edu/ NISC http://www.nisc.nih.gov/ Rat Genome Database http://www.rgd.mcw.edu/ FlyBase http://flybase.bio.indiana.edu/ Wormbase http://brie2.cshl.org:8081/ ExoFish http://www.genoscope.cns.fr/exteme/tetraodon/ Gene Annotation/Prediction Programs: GENSCAN http://genes.mit.edulGENSCAN.html GenomeScan http://genes.mit.edu/genomescanl Sim4 http://pbil.univ-lyon1.fr/sim4.html EST Genome http://www.sanger.ac.uklSoftware/Alfresco/download.shtml FGENESH http://genomic.sanger.ac.uklgf.html. GrailEXP http://compbio.ornl.gov/grailexp/ TwinScan http://genes.cs.wustl.edu/query.html Genie http://www.fruitfly.orglseQ....tools/genie.html SGP http://kiwi.ice.mpg.delsgp-1/ SLAM http://baboon.math.ber1

Table 3-1: List of resources for obtaining and analyzing genomic sequences [84]

36 4. Methods And Implementation

4.1. External Data Sources Preparation in BIAS ln BIAS, a Java API is developed to import data from the Ensembl data base through Ensj. The module is called "ensembl" in BIAS and is located in the following path:

-/ca/mcgill/mcb/bias/datasources/ensembl Figure 4-1 shows a detailed overview of this module.

The function of this module is to enable BIAS to connect to Ensembl database. This middleware can either download eukaryotic genomic information fram

Ensembl. Three classes, AddChromosomeFromENSJ, AddGenesFromENSJ, and

GeneHomologyFromEnsembl, constitute the middleware between BIAS database and Ensembl database. BIAS can fetch eukaryotic genome sequence, gene, and orthologous information fram Ensembl thraugh this layer. In this research ail Human and Mouse genomic information, such as genome sequence, genes, and ail orthologous gene pairs between two species have been stored in BIAS database.

Class AddChromosomeFromENSJ enables species genome information to be retrieved fram Ensembl and saves this information automatically in BIAS. It is capable of creating a pathway fram the Ensembl server to the BIAS server by praviding host name, user name, password, and database name parameters. Therefore, BIAS can download chramosomal information directly. Due to the limitations of our database system (MySQL), the genome of a species is divided into smaller fragments. Each fragment is typically 107 bp in length. These divisions also make retrieval easier. Two tables in BIAS database exist to accomplish this: CHROMOSOME and CHROMOSOME_FRAGMENT. Appendix A.2 shows detailed information regarding these tables.

37 Class AddGenesFromENSJ fetches the gene information for a specific species and includes ail available pseudo-genes from Ensembl. This information is saved automatically within the BIAS database. This includes information regarding the accession number of a gene, display name, startlend position, and orientation.

Three tables are used to store this information: GENE, GENE_ANNOTATION, and

LOCUS. Table GENE stores basic information for each gene. Table GENE_ANNOTATION recordings the functions of each gene (if available) along with the Ensembl accession number and name. The sequence for each gene is stored in Table

LOCUS. Appendix A.2 gives detailed information regarding these tables.

Class GeneHomologyFromEnsembl fetches orthologous gene pairs between Mouse and Human. This information is used in the PF analysis of Section 4.4. The information is saved in table GENE_HOMOLOGY. Detailed information regarding this table is given in Appendix A.2.

Our approach to finding GRE TFBSs follows three steps. Firstly, we perform a genome-wide search using the PWM. In this step, a PWM is retrieved from table

ZERO_PSSM in BIAS database. Detailed information regarding this table is also shown in Appendix A.2. Secondly, we try to identify putative modules. This requires that we consider TFBSs from multiple TFs simultaneously. In particular, we search for regions within the Mouse genome that contain an unusually high number of TFBSs. Thirdly, we apply phylogenetic footprinting using orthology between Human and Mouse.

38 Ensembl Database

ENSJ 1 1

" Bias datasources: ensembl

1 AddChromosomeFromENSJ 1 AddGeneFromENSJ 1 l GeneHomologyFromEnsembl 1 \ 1

\ ... p/ 1 CHROMOSOME GENE 1 HOMOLOGY_GENE 1 CHROMOSOME_FRAGMENT GENE_ANNOTATION LOCUS

Bias Database

Figure 4-1: Overview of module "datasources/ensembl"

4.2. Genome Wide Search For Transcription Factor Binding Site

A detailed description of how to find genome wide TFBSs using PWM of known

TF is given in this section. The sub-module of module gre in BIAS is located in the following path: -fcafrncgillfrncbfbiasfrnodulesfgrefTFBS Figure 4-2 gives an overview of this sub-module. This sub-module is suitable for any kind of binding site search based on PWM of a specifie TF given either species whole genome sequence or a specifie sequence.

The complete sub-module is explained as follows:

There are two major classes: TranscriptionalFactorBindingSitesLocating and TranscriptionalFactorBindingSitesReturnType.

39 The function of class TranscriptionalFactorBindingSitesLocating is to search for ail putative TFBSs on both orientations of an input sequence (either part of a genome or the whole genome) using a given PWM. If the input is the whole genome sequence of specifie species from BIAS database, then the user can use only the species' unique identifier and chromosome unique identifier from the BIAS database. Otherwise, the whole sequence is needed to be input into program. A threshold is also needed in order to serve as cutoff point. The criteria are described in Section 3.2.1.4. This class can search for TFBSs of a single TF using one PWM. It can also accept multiple PWMs of multiple TFs and search for TFBSs simultaneously.

Class TranscriptionalFactorBindingSitesReturnType is instantiated as an object to store the information corresponding to each qualified putative TFBS found by class TranscriptionalFactorBindingSitesLocating. The calculated information includes chromosome unique identifier, position, similarity score, TFBS sequence, and PWM identifier.

Class GRE_ReturnType acts as an object to keep the information corresponding to table GRE_HIT. Similarly, class GREAccessory_ReturnType also act as an object to

keep the information corresponding to tables GRE_ACCESSORY_NF_l,

GRE_ACCESSORY_C_EBPbeta, or GRE_ACCESSORY_C_EBPdelta. The function of these two classes is to be instantiated as objects to pack the information together and then save into tables in BIAS database.

Class LocatingPgres is used to activate search for pGREs in either Mouse or

Human genome and save the results into table GRE_HIT. It can also search the nearest gene corresponding to each pGREs. The nearest gene means the

closest gene downstream of the pGRE. Other three classes LocatingNfl,

LocatingPcebpbeta, and LocatingPcebpdelta have the similar function as class

LocatingPgres except they work for GR's cooperative TFs NF-1, C/EBPbeta, and

40 C/EBPdelta. The reason that we used separated classes for four TFs is we only considered each is one application of the sub-module.

Bias Database

CHROMOSOME CHROMOSOME_FRAGMENT PSSM GENE LOCUS

Bias Module gre: TFBS

TranscriptionFactorBindingSitesLocating LocatingPgres Locati ng Pnf1 LocatingPcebpbeta LocatingPcebpdelta TranscriptionFactorBindigSitesReturnType

GRE_ReturnType G REAccessory_ Return Type

GRE_HIT GRE_ACCESSORY_NF _1 GRE_ACCESSORY_C_EBPbeta GRE_ACCESSORY_C_EBPdelta

Bias Database

Figure 4-2: Overview of sub-module "greITFBS"

41 4.3. Module

4.3.1. Aigorithm

The module analysis strategy may be particularly weil suited for GRE discovery since many cooperative TFs (see Section 2.1) of GR are known. From TRANSCompel [111], the cooperative TFs of GR are annotated as NF1, Oct-1, Ets-1, HNF-1, HNF-3, AP-1, and C/EBPj3. Therefore, looking for co-occurrence of both GRE and TFBSs of its cooperative TFs might improve the identification of functional sites [7]. This will help to understand the interactions between these factors, and hopefully the range of this effect. A better understanding of the interplay between GR and the other transcription factors could help to predict regions inside which GR effects can occur.

Three cooperative TFs (NF-1, C/EBPbeta, and C/EBPdelta), which are suggested by cooperative biologists, are used for the module analysis in this research. The co-occurrence of ail TFBSs in the same promoter region (definition given in Section 3.2.1.3) is studied. We cali this a GRE module. The window size is 201 bp with 100 bp flanking on each side based on the tirst position of 5' end of the GRE. The search allows flexible spacing between TFBSs, which means overlapping is permitted, but does not allow for multiple binding sites for the same TF to contribute to the predictions. The position order of binding sites is considered using the start position of the 5' end without a so-called strand orientation limit. Here strand orientation limit means that the order of co­ occurrence is only considered by the start position of each binding site no matter whether it is on positive strand or on negative strand. The combinatorial probability is calculated using multinomial distribution.

p. ( ) - n! TI Y, (4-1) Y Y - TI ( ') i Pi i Yi·

42 where n is the number of independent trials, Pi is the probability of outcome i on any trial, i = 1, 2, ... , k, where k is the number of possible outcomes on each trial, and Yi is the number of times that outcome i occurs in the n trials. In this study, n is the total number of modules found and k is the total number of possible position orders for four TFs, which is 4!.

The significance of this value is tested using MBMC as a background. Section 3.2.1.5 gives the detailed information of how to estimate parameters for MBMC and how to do the statistical test for module analysis.

4.3.2. Implementation

This section describes the implementation details of the toolkit in BIAS related to searching for modules. The sub-module of module gre in BIAS is located in the following path: -/ca/mcgill/mcb/bias/modules/gre/TFBS Figure 4-3 gives an overview of this code.

Three classes, LocatingPnfl, LocatingPcebpdelta, and LocatingPcebpbeta, are responsible for calculating co-occurrence between pGREs and one of putative TFBSs of GR's cooperative TFs. The programs retrieve putative TFBSs separately from tables GRE_HIT, GRE_ACCESSORY_NF_l, GRE_ACCESSORY_C_EBPbeta, and GRE ACCESSORY C EBPdel ta. The method then uses a window of size 201 to find the qualified results. The qualified results means within 201 bp window size based on the first position of 5' end of the GRE, there exists at least one TFBS of NF-1, C/EBPbeta, or C/EBPdelta. Only the nearest putative TFBS to the selected pGRE is counted at this calculation. Results are saved into table GRE_HIT in BIAS database. These results are then used in module analysis. To retrieve ail putative modules for each chromosome, we can use following query:

43 select * from GRE HIT where COOCCUR DISTANCE TO NFl != null and COOCCUR DISTANCE TO CEBPDELTA != null and COOCCUR DISTANCE TO CEBPBETAl17! = Null and CHROMOSOME ID = user selected id

This will retrieve list of putative modules from GRE_HIT table including pGRE unique identification number, chromosome unique identification number, corresponding gene identification number, co-occurrence TFBS information.

Appendix A.2 gives detailed information on each column of table GRE_HIT.

4.3.2.1. Statistical Significance Test using MBMe

Section 3.2.1.5 has described steps on how to generate a MBMC. The sub­ module of module gre in BIAS is located in a path: -jcajmcgilljmcbjbiasjmodulesjgrejsecondOrderMarkovChain Figure 4-4 demonstrates an overview of this sub-module.

Class ConcatenateRegionSequence begins by retrieving ail genes from the BIAS database. It then fetches the promoter region of each gene (described in Section 3.2.1.3) and concatenates ail promoter sequences together to build into one sequence. Class CalculateDistributionBasedOnBioj ava is used to estimate the parameters of a MBMC transition probability using Biojava [112]. Class

CreateSecondOrderMarkovChain is responsible for generating MBMC sequence providing a sequence length and transition probability matrix using Biojava [112].

Class LocateTranscriptionFactorBindingSitesOnMarkovChain is used to create an object of class TranscriptionalFactorBindingSitesLocating to locating putative TFBSs for ail TFs. ln order to test the statistical significance of the position order of module in promoter region of Mouse genome, class OrderOfCoocurrence is applied to find

44 position order of ail putative modules in the direction from smaller start position to higher start position for each binding site within the module.

Since there are four different TFs: GR, NF-1, C/EBPbeta, and C/EBPdelta, therefore, there are 4!=24 possible position orders of modules. We used number "0" to represent "GRE", number "1" to represent the binding site of NF-1, number "2" to represent the binding site of C/EBPdelta, and number "3" to represent the binding site of C/EBPbeta. For example, the sequence "2013" induces the following ordering (from 5' to 3'):

C/EBPdelta GRE NF-1 CIE BPbeta _ q Q A Q ... ----y __ ~~______-...-- _------A~---- y--- ...

Junk Promoter Gene

After we calculate ail the position orders of putative modules in promoter region of Mouse genome, class CalculateCombinatorialProbability is used to get the combinatorial probability of position order of modules in Mouse genome using equation 4-1.

45 GRE HIT GRE_ACCESSORY_NF_1 GRE_ACCESSORY_C_EBPbeta GRE_ACCESSORY_ C_EBPdelta

Bias Database

~

Bias Module GRE: TFBS

LocatingPnf1 LocatingPcebpbeta LocatingPcebpdelta

~r

GRE HIT 1 1

Bias Database

Figure 4-3: Overview of sub-module "greITFBS" for co-occurrence

46 CHROMOSOME CHROMOSOME_FRAGMENT GENE GRE_HIT GRE_ACCESSORY_NF _1 GRE_ACCESSORY _ C_EBPbeta GRE_ACCESSORY_C_EBPdelta

Bias Database

Bias Module gre: secondOrderMarkovChain

ConcatenateRegionSequence

CalculateDistributionBasedOnBiojava

CreateSecondOrderMarkovChain

LocateTranscriptionFactorBindingSitesOnMarkovChain

OrderOfCoocurrence

CalculateCombinatorialProbability

GRE_MARKOVCHAIN GRE_MARKOVCHAIN_TFBS

Bias Database

Figure 4-4: Overview of sub-module "gre/secondOrderMarkovChain"

47 4.4. Phylogenetic Footprinting

4.4.1. Aigorithm

PF proceeds via four steps: (i) Select orthologous gene pairs between Mouse and Human and retrieve their corresponding promoter regions (described in Section 3.2.1.3). (ii) Align the promoter sequences. (iii) Identify conserved segments of the alignment. The conserved alignment means there is at least one pGRE inside each qualified aligned sequences derived from AVID alignment provided certain parameters (described in Section 5.3.1). (iv) Search for TFBSs only in conserved alignments and identify the module.

4.4.1.1. Orthologous Genes Selection

The choice of which organisms to use in orthologous analysis and the choice of a correct evolutionary distance for performing this analysis are hard problems without clear solutions. This is partially due to the duplication and/or deletion of genes during evolution. Orthologous genes selection is very important for the success of PF methods. The alignment of promoter sequences becomes progressively more difficult as the evolutionary distance between the sequences increase.

Cliften et al. [113] and Lenhard et al. [76] have discovered that if the species are too closely related, for instance Human and chimpanzee, the sequence alignment is obvious but unable to confer any meaningful results since the functional elements are not sufficiently better conserved than the surrounding nonfunctional sequence. Conversely, if the species are too distantly related, su ch as between Human and fish, it is difficult or impossible to find an accurate alignment because

48 they will have diverged too much to preserve any significant similarity. The comparison of DNA sequences between pairs of species that diverged -40-80 million years ago from a common ancestor, such as Humans with Mice, reveals conservation in both coding sequences and a significant number of noncoding sequences [84]. Wasserman et al. [13] reported the combination of PF and PWM searches applied to orthologous Human and Mouse gene sequences reduces the rate of false predictions by an order of magnitude with modest reduction in sensitivity . ln this analysis, we compare our Mouse sequence against Human. We use ail of orthologous gene pairs between Mouse and Human from table GENE_HOMOLOGY of BIAS database (described in Section 4.1). Appendix A.2 described information about table GENE HOMOLOGY.

4.4.1.2. Alignment of Promoter Sequences of Orthologous Genes

We use AVID to align orthologous gene pairs. AVID is a very fast pairwise global alignment program capable of aligning megabases of sequence in a reasonable amount of time. It can also provide details about ordered and oriented contigs, and accurate placement in the finished sequence. It is fully integrated with repeat masking (see Section 4.4.1.2.1). AVID is sensitive in finding orthologous regions. It addresses existing shortcomings of global alignment programs, but is also specifie and avoids the fa Ise-positive problem of local alignment programs. The AVID alignment system takes advantage of the fact that in related genomic sequences such as the Human and the Mouse genome, there are parts that are largely conserved between the two species. Hence these conserved parts are aligned very weil. However, they are separated by longer sequences that are not weil conserved and subsequently much harder to align. Instead of attempting to do a global alignment of the two sequences, these systems try to find the "islands" which align very weil using a local alignment routine. This routine runs in

49 linear time in the length of the sequence. The set of "islands" is then "chained" together to form an overall global alignment as shown below in Figure 4-5. - - -. Figure 4-5: Global view of AVID alignment.

4.4.1.2.1. Masking Sequences ln most mammalian genomes, a large fraction of the genome consists of sequences that are repeated a large number of times. These are referred to as repetitive sequence elements. Lander et al. [90] has defined several classes of repetitive elements at the nucleotide levaI. If these repetitive elements are not "masked out" (the nucleotides of the sequences are replaced by the symbol N) in a DNA sequence prior to a comparative analysis, they will generate a very large numbers of alignments that do not reflect biologically significant similarities [84].

Before we attempt an AVID alignment between orthologous sequences, the RepeatMasker [115-116] program is used to mask the repetitive sequence elements.

RepeatMasker is the most commonly used program for screening DNA sequences in interspersed repeats and low complexity DNA sequences. The software generates a detailed annotation of ail repeats present in the input sequences as weil as the "masked out" version of the input sequence. Sequence comparisons in RepeatMasker are performed by the program cross_match, an efficient implementation of the Smith-Waterman-Gotoh algorithm developed by P. Green [115-116] or, in conjunction with the script MaskerAid [117], by WU-Blast developed by W. Gish. MaskerAid is can increases the speed of RepeatMasker about 30-fold while maintaining sensitivity.

50 4.4.1.2.2. Principle of AVID Alignment

This expia nation follows [100, 108].

The AVID program takes two genomic sequences as input. The output is a global alignment with additional information (e.g., an overall score). The AVID method has four components: (i) Repeat Masking (Optional) This means that the input sequences can be processed with the RepeatMasker program [115], but when AVID does masked alignment, both the masked and unmasked sequences are used in the alignment process. (ii) Finding Matches Using Suffix Trees Matches are divided into two groups: overlapping repeats called repeat matches and not overlapping repeats called c/ean matches. Here the term "match" refers to a maximal exact match that is not necessarily unique. It is different from "alignment" mentioned before. Alignment is not necessary exact match each other and allows mismatches as long as the score is higher than selected threshold. Selected matches (the red matches in Figure 4-6) will be the part of final global alignment. Suffix trees are used to find ail maximal repeated substrings of a single string. Maximal matches between two sequences are a pair of matching subsequences (one from each sequence) whose f1anking bases are mismatches. AVID first concatenates the two sequences and places the character N between them. A maximal repeat in this string that crosses the boundary between the two sequences represents a maximal match between the two sequences. (iii) Anchor Selection

51 This step is a recursive process of anchoring and aligning the sequences. An anchor set is a collection of non-overlapping, non-crossing matches (the red matches in Figure 4-6).

Figure 4-6: Selecting anchors from the set of matches. Every maximal match is shown in blue. A set of good anchors is shown in red [100).

~ Sorting of the matches Throughout this step, the entire match set can be reduced by eliminating "noisy" matches from those being considered for anchors, such as the matches that are less than half the length of the longest match from initial consideration. Aiso we can order the matches with clean matches appearing first (sorted by length), followed by repeat matches. Repeat matches will not be considered for anchoring until there are no more clean matches.

~ Selection of the anchors A variant of the Smith-Waterman algorithm is used to select anchors. The gap score used is zero, and the mismatch score is negative infinity. The score assigned to a match is based on its length and the alignment score of the regions flanking the match (10 bp on each side). Anchors are also required to be non-overlapping. There is no guarantee that the matches in the anchor set produced by this procedure are biologically significant. For regions that are too long to align by the Needleman-Wunsch algorithm, there is no choice but to use the anchors. Therefore, the anchors should only be used if we are confident that they are correct. When AVID aligns regions short enough to perform an optimal alignment, it uses anchors only if the total length of the anchor

52 set is >50% of the length of the sequence. Otherwise the regions are aligned using the Needleman-Wunsch algorithm using standard parameters. (iv) Recursion The next step searches the final global alignment recursively. If n anchors have been set by previous method, there will be n + 1 regions between these anchors that remain to be aligned. By filtering the current list of matches, we produce n + 1 lists of matches corresponding to n + 1 regions. The Ih list is the list of ail maximal matches between the Ih inter­ anchor pair. This is done by checking for each match wh ether that match (or any sufficiently long part of that match) lies entirely between two sets of anchors. Once the maximal matches have been obtained, the smaller inter-anchor regions are realigned using the anchor selection step described earlier. The recursion will terminate at following two conditions: • There are no remaining bases to be aligned • There are no significant matches in the remaining sequences. If the sequences are short «=4 kb each), they are aligned using the Needleman­ Wunsch algorithm. For long sequences, we conclude that the lack of anchors indicates no significant alignment between them, and because a Needleman­ Wunsch type algorithm is meaningless, we retum a trivial alignment, where both sequences are completely gapped.

AVID works very weil in practice since it fills in the Needleman-Wunsch matrix to find the alignments in the area between the local alignments.

4.4.2. Implementation ln BIAS, a sub-module called "footprinting" of module gre is developed to realize the purpose of this section and it is located in the following path: -/ca/mcgill/mcb/bias/modules/gre/footPrinting

53 Figure 4-7 provides an overview of this sub-module.

The following steps described the implementation of PF: • Fetch orthologous gene pairs between Mouse and Human fram table

GENE HOMOLOGY in BIAS database. • Retrieve orthologous sequences including both upstream and downstream based on gene's TSS provided length of each direction. • Apply the RepeatMasker program to mask different repeats by choosing different options (see Section 5.3.1.2.3) and then merge ail results into one repeat masked sequence. • Run the AVID alignment program to align each orthologous sequence pair • Parse alignment results using parameters: minimum length of each alignment and maximum gaps allowed for each alignment in the conserved sequence. • Check ail pGREs if available within the conserved alignments by retrieving

pGREs from table GRE_HIT in BIAS database. • Update co-occurrence information for each conserved pGRE pair between Mouse and Human.

Class AvidAlignmentApplication is used to fulfill the purpose. The main function

is to obtain ail orthologous genes from table GENE _ HOMOLOGY given a chromosome unique identification number or each gene unique identification number. It then fetches orthologous sequences to do the AVID alignment after masking the sequences by RepeatMasker. The qualified results are saved into the table

HOMOLOGY_GENE_UPSTREAM_DOWNSTREAM_ALIGNMENT in the BIAS database. Appendix A.2 gives detailed information regarding this table.

Class AvidAlignmentColumnParser is instantiated by Class

AvidAlignmentApplication. Its function is to parse each aligned file into five different vectors corresponding to each column in original aligned results. It includes index of first sequence, corresponding nucleotide of first sequence (''1"

54 represents insertion/gap), alignment result ("_" represents match), corresponding nucleotide of second sequence, index of second sequence.

Class AvidAlignmentParser is also instantiated by class

AvidAlignmentApplication. It is employed to parse AVID alignment file and obtains conserved alignments providing the two parameters: minimum length required and maximum gaps allowed (see Section 5.3.1). Information of the conserved alignments is kept as an object by instantiating class

AvidAlignmentParserReturnType.

Class AvidAlignmentAndGreMap checks whether there exists pGREs within each conserved alignment based on the start position of pGRE. If both sequences in a conserved alignment contain one or more pGREs, the results will be qualified and saved into table GRE_AVIDALIGNMENT_MAP. Appendix A.2 has detailed information about this table.

Class CooccurrenceUpdate is developed to update co-occurrence information among GRE and binding sites of other three TFs, NF-1, C/EBPdelta, and C/EBPbeta. The algorithm is similar to the module section. However, in this section, we do not limit the window size to 200 bp with 100 bp flanking on each side. We search for the nearest binding sites of each cooperative TF within 15000 bp upstream and 1000 bp downstream of orthologous genes based on start position of each pGRE inside of conserved alignment.

55 CHROMOSOME CHROMOSOME_FRAGMENT GENE HOMOLOGY GENE GRE_HIT GRE_ACCESSORY_NF _1 GRE_ACCESSORY_C_EBPbeta GRE_ACCESSORY_C_EBPdelta

Bias Database

Bias Module gre: footPrinting

AvidAlignmentColumnParser

AvidAlignmentParser AvidAlignmentApplication

AvidAlignmentParserReturnType

AvidAlignmentl nformationRetrieve

AvidAlignmentAndGreMap

CooccurrenceUpdate

HOMOLOGY_GENE_UPSTREAM_DOWNSTREAM_ALlGNMENT GRE_AVIDALlGNMENT_MAP

Bias Database

Figure 4-7: Overview of sub-module "gre/footPrinting"

56 5. Results and Discussion

5.1. Genome Wide Search for TFBSs

Four different TFs were used in a genome wide search for TFBSs in Mouse genome: Glucocorticoid receptor (GR), nuclear factor 1 (NF-1), and CCAA T/enhancer binding protein (C/EBPbeta and C/EBPdelta). The algorithm is described in Section 3.2.1.

Table 5-1 shows the total number of putative TFBSs for each TF in either the complete Mouse genome or in the three regions (gene, promoter, junk as defined in Section 3.2.1) and the p-value of the frequency of the TFBSs in three different regions. At the end of this section, we have described how to calculate p-values. The number of pGREs found in the complete Mouse genome is 1830228 when a threshold of 0.81 is used. This is the suggested threshold for the GRE PWM from [128]. From this total, 15575 pGREs are found in promoter regions, 502465 pGREs are located within gene regions and the remaining 1312188 are located in junk regions. The threshold for TFBSs of NF-1, C\EBPdelta, and C\EBPbeta is 0.90. This number is selected conservatively, which means very few hits will be found with such a rigid threshold.

: GR (Threshold: 0.81) ~ NF-1 (Threshold: 0.9) ~ C/EBPdelta (Threshold: 0.9): C/EBPbeta (Threshold: 0.9) Number of putative TFBSs ~ in whole Mouse genome i 1830228 5618177 3829152 2544364 Number of putative TFBSs : : : : *e~~~.~~e1r:m~~ëY.ër ...... t...... 1.~.~.?~ ...... !...... ~.~.?~~ ...... !...... }Q.1.~.1 ...... ! ...... ~~~~...... TFBS in promoter region ~ 0.0004 ~ 0.492 ~ 0.0351 ~ 1 Number of putative TFBSs ~ ~ ~ ~ *y~B:·~~q~~üëiiëYëj,...····t············§Q:?~·~·~············l·············1.~·~·~??~············l················· .~~§.~Q.L ...... + ...... ~~!.~.1§...... TFBS in gene region i 0.0008 i 0.9875 i 0.3121 i 1 Number of putative TFBSs : : : : *~~ït:~.}~~üëiiëY.ëf ...... +...... 1~.1.?1.~~ ...... !...... ~.~.~9.~~~ ...... !...... :?~:?1.~.? ...... +...... ~~~~§.1.L ...... TFBS in junk region i 0 i 1 i 0 i 1

Table 5-1: Number of putative TFBSs and p-value of frequency of TFBSs for four transcription factors (GR, NF-1, C/EBPdelta, C/EBPbeta) in either whole Mouse genome or different regions

57 Figure 5-1 to Figure 5-4 depict the distribution of frequencies of pGRES, putative TFBSs of NF-1, putative TFBSs of C/EBPbeta, and putative TFBSs of C/EBPdelta, in each of the Mouse chromosomes for three different regions. Figure 5-1 to Figure 5-4 also show the frequencies of putative TFBSs for random sequence as a comparison. The random sequence is created based on oth order Markov Chain and the value is the average value based on the results of 100, 1MB bp sequences.

Frequency Distribution of pGREs ln Different Regions of Mouse Genome (Threshold: 0.81)

0.0008 r------.,

r; ~ 0.0006 t---~--"...... ,,"------'---=="'------"_-----...1--I l IL

0.00051------i

-.-pGR& ln Prom oter Rliion ____ pGR& ln Gene Roglon 0.0004 '--______----.- pGR& ln Junk AIglon ...... -4-pGRE':l ln Random Sequence___ ...J

~~p~~~~~~~~~~~p~~~~~ if' if' if' if' if' if' if' if' if' if' if' if' if' if' if' if' if' if' if' if' Chromosome Name

Figure 5-1: Frequency distribution of pGREs in different regions of Mouse genome and frequency distribution of pGREs in random sequence

Frequency Distribution of Putative Blndlng Sites of NF-1 in Different Reglons of Mouse Genome (Threshold: 0.90)

0.0035 r------., --+- Putative Blndlng SIt •• of PF-1 ln PromGter Region ____ PutMtv. IIndlnlI SIt •• of PF-11n Gene Rallon ---.- P\rtM1v. Blndlng SIt •• of PF .. 1 ln Junk Aiglon ~ PutMlYe IIncIlng SItes of PF-1 ln RMdom Sequence 0.003 t----.------,I'.------I

0.0015 '--_____...... ____ ...J

~~p~~~~~~~~~~~p~~~~~ if' if' if' if' if' if' if' if' if' if' if' if' if' if' if' if' if' if' if' if' Chromosome Name

Figure 5-2: Frequency distribution of putative TFBSs of NF-1 in different region of Mouse genome and frequency distribution of putative TFBSs of NF-1 in random sequence

58 Frequency Distribution of Putative Blndlng Sites of C/EBPbeta ln Different Reglons of Mouse Genome (Threshold: 0.90)

0.00121===::::::::::::::::::::::::::=:::==::::::::::::::::::::::::::=~ --+-- Pu1aIIve IIndlng SIl.. 01 CSFbeto ln Promot., Roglon ___ Pu1aIIve IIndlng SIl •• 01 CSFbeto ln Gone Roglon 0.0011 --+--- Pu1aIIve IIndlng SIl •• 01 CSFbeto ln Junk Roglon __ Putlllive IIndlng SIl •• 01 CSFbeto ln _dom Ssquenœ u>- c •~ 0.001 e

0.0008 L...... _ ...... _ ....._ ...... -....I f~?~~?#?~//////////~ Chromosome Name

Figure 5-3: Frequency distribution of putative TFBSs of C/EBPbeta in different region of Mouse genome and frequency distribution of putative TFBSs of C/EBPbeta in random sequence

Frequency Distribution of Putative Binding Sites of C/EBPdelta in Different Reglons of Mouse Genome (Threshold: 0.90) 0.0018 r------,

0.0016 t---T''--'.;,------A-__"..------HI

0.001 1------~ Putattve Elndlng SIte. of CEllPdeIta ln Prom oter Region _____ PutMtv. Bndlng SIt •• of CEBPdeIta ln Gene AIglon --+--- Putotlv. IIndlng SIlo. of C_1ta ln Junk Roglon 0.0008 L...... _ ...... ____ Pu1aIIv...... IIndlng...;; ___ SIlo. of C_1ta...... ln _dom Ssquen...;......

~~~~~~~~~~~~p~~~~~~~ ~~~~~~~~~~~~~~~~~~~ ~ Chromosome Name

Figure 5-4: Frequency distribution of putative TFBSs of C/EBPdelta in different region of Mouse genome and frequency distribution of putative TFBSs of C/EBPdelta in random sequence

We need to explain the difference in observed nurnber of pGREs between the different regions. Firstly, we begin by exploring the Ge bias in each region. Figure 5-5 shows the frequency distribution of nucleotides in the different regions of non-rnasked Mouse genorne. We read gene region in the direction of transcription and read prornoter region in the sarne direction as gene. The junk region is read for both strands. Nucleotide T has the highest frequency in the

59 gene region. Nucleotides A and T in junk region have second highest frequency. Nucleotide A in gene region has third highest frequency. In promoter region, the frequency of nucleotides A and T is higher than the frequency of nucleotides G and C. The frequency of nucleotides Gand C in the junk region is the lowest.

Frequency Distribution of A, C. G. T Content in Different Regions of Mouse Genome 0.33_------... _G·Promotar C·Promoter _A-Promoter -T-Promoter • :. G-Gene -» C-Gene -::. A-Gene -» T-Gene -G..Junk .. C..Junk _A..Junk -----j,-T..Junk 0.311>------,..

, ... , ...

0.17 .....______..... ______......

chr1 chr2 chr3 ctw4 chrS chr6 chr7 chr8 cIv1I chr10 chr11 chr12 chr13 ehr14 chr15 chri6 drU chr18 chr1S1 chrX Chromosome Name

Figure 5-5: Frequency distribution of nucleotides in each chromosome and each region of the Mouse genome.

Secondly, we need to consider the nucleotide frequency within the consensus string according to PWM of each TF . Table 5-2 gives the percentage of nucleotides for each consensus string of the four TFs.

gre nf1 cebpbeta cebpdelta A 0.214285714 0 0.222222222 0.25 T 0.357142857 0.285714286 0.333333333 0.333333333 G 0.285714286 0.285714286 0.111111111 0.166666667 C 0.142857143 0.428571429 0.333333333 0.25

Table 5-2: Percentage of nucleotides A. C, G, and T in consensus string of four TFs

Figure 5-1 indicates that the frequency of pGREs in gene regions is higher than that in junk regions followed by promoter regions. This can be explained in an ad hoc manner using Figure 5-5 and Table 5-2. The GRE consensus string

60 GGTACAANNTGTTCTG contains a large number of nucleotides T. From Figure 5-5, we can find there is the highest frequency of nucleotide T in gene region. Similar analysis is used for putative binding sites of the other three TFs.

From Figure 5-2, we see that the promoter region has the highest frequency of NF-1 TFBSs with the junk region being the lowest. Table 5-2 shows the consensus sequence for NF-1 NNTTGGCNNNNNNCCNNN has much higher number of G, C nucleotides than A, T nucleotides. Figure 5-5 shows that the Mouse genome also has higher frequency of nucleotides Gand C in promoter region than in the gene and junk regions.

Frequency of C/EBPbeta TFBSs and C/EBPdelta TFBSs display the same trend. That is, the junk region has the highest frequency of both C/EBPbeta TFBSs and C/EBPdelta TFBSs with the promoter region being the lowest. Table 5-2 indicates both C/EBPbeta consensus sequence NNNTTGCNCAACTN and C/EBPdelta consensus sequence AATTGCGTCACT have higher number of nucleotides A and T than nucleotides Gand C. However, comparing to GRE in Table 5-2, the number of nucleotide T is not high enough to overcome the influence of number of nucleotide A. Therefore, there is higher frequency of both C/EBPbeta TFBSs and C/EBPdelta TFBSs in junk region than that in gene region comparing to higher frequency of pGREs in gene region than that in junk region.

The statistical signiticance of the number of putative TFBSs results for each TF is addressed next. We estimate a p-value for each region of the Mouse genome. Tables 5-3 and 5-4 show these results. In both tables, the tirst row gives the frequency of TFBSs in different regions of Mouse genome. We need to calculate the p-value for each of these regions. A MBMC (see Section3.2.1.1) is used here as a background to calculate the p-value. We estimated the parameters of a single MBMC sequence (depicted in Section 3.2.1.5) and then used this MBMC to generate 1000, 1MB bp sequences for each region of each TF. After this, we searched for putative TFBSs for each sequence. The searching algorithm is the

61 same as the algorithm used for searching TFBSs in Mouse genome. Table 5-3 and 5-4 also give the mean and the standard deviation for each group of MBMC sequences. Here "group" means 1000 MBMC sequences in one region of one TF. Therefore, in total, we have 12 different groups for 3 different regions and 4 different TFs.

pGREs Binding Sites of NF-1 Promoter Gene Junk Prometer Gene Junk Frequency of TFBSs in Different Regions of Meuse Genome 0.000624299 0.000733739 0.00070264 0.002736291 0.002481415 0.002061907 Mean 0.000566432 0.000650833 0.000517302 0.002735612 0.002596752 0.002502082 Standard Deviation 1.73444E-

Table 5-3: Statistical Results for pGREs and binding sites of NF-1 in ditferent regions of Mouse genome

0.3121 0

Table 5-4: Statistical Results for binding sites of C/EBPbeta and C/EBPdelta in ditferent regions of Mouse genome ln Tables 5-3 and 5-4, the four rows labeled Skewness, Kurtosis, 2*Ses, and 2*Sek are used to compare the frequency distributions with the normal distribution (see Section 3.2.1.5).

Figures 5-6 to 5-9 depict the results as histograms. We split the range of the data - the frequency of TFBSs in each region of each TF in this case - into equal-sized bins. Then, for each bin, the number of points from the data set that fall into each bin is counted. Therefore, counts or frequency for each bin can be represented as vertical axis and each bin can be represented as horizontal axis. The red vertical line in each histogram indicates the p-value obtained for the Mouse

62 genome. The amplitude of the line does not indicate the number of sequences found. A right-sided test is used to get p-value (depicted in Section 3.2.1.5).

F...-cJ'a...... ofTRI.... aMI:IrWIIMtcO'iC.... F...-,~oITF8 •• tDr2MIOnter~C .... f~D"""'" ofTFBh,.2nclOrder-..ovCM6na rr...... tOOO) (T"'~tOOClt (TobII ...... :1000) ~~~-~ I! .. llllllllllh .. ////////////////../~1 F ...... ~ftaIrI

Figure 5-6: Frequency distribution of pGREs in promoter(left), gene(middle), and junk(right) regions of the Mouse genome

f...... -cY ~ ofTFBs.1Dr 2nd an.... lIMIri c..... F...-ylHlMtullonofTRIS.fDr2ndOlder ...... C..... F...-cJD...... ofTRIS.flw2nd Order ...... C"'-"- rr.. ~tooo) (Total A.p.t 1010) rroWR.,..ts:tOOl) j 1 r----l 1!1~:~=~ Il ~I I! ~I jd 1111 ~ IIII;:~ ~ .. illllllllllllll" .. 1 ////////.~%////.q»~:;/// //////////////////.///.// /... Z~"'/////:;//////////// F___ ...... ,~...., .'l'IIIIIIIII~~~~ F...... , ...... fIl~hlI1 ,...... , ...... III fIIn-...... ,

Figure 5-7: Frequency distribution of putative TFBSs for NF-1 in promoter(left), gene(middle), and junk(right) regions of Mouse genome

F,...... cy DtIIrIIutioR or Tfas. tDr 2nd 0rHr lIIIrIIow c ..... F....-cJ0...... ,.. ofTFB.. .,,2IMI 0rRr ..... c .... fNqlMnCy DIItrtbution orTF.,.. fDr 2nd c:Jnhr lInow c..... (TcMlA.,..u: 1000) (ToIIIIIRI!pIIIIts: 1001) "...... ,,000)

1 il :~~-=-I 1 1 II~r-=~ i -r--~ III Il .-- .,lIllr:llbh 111111:11111111. .1111111111111,,, 1 ~ Iq· .. /////.."'//../.."'''~:/:lZ/.JI.%/../..'''/ / //../.././ .... Z,.. / / / / / ./.i'/ /////////// ../..,.. .. ~z,.. F...... , ...... ~...." F...... ,~"·1rI ,...... , ...... "...... ,

Figure 5-8: Frequency distribution of putative TFBSs for C/EBPbeta in promoter(left), gene(middle), and junk(right) regions of Mouse genome

F~DIIIIIbuaonolTFBS."'2ndOrderIllltnwC"'" F~D6MrtbuIIonofTFB •• for2nd OnNr--.C...... FNqUMCy DiIIIrtbutIDrI ofTfBS... 2rId Order .....,. c...... (TD4IIIR..,...n:1000) t._ (ToIIiIAl!pIIIIts:tOOO) (T.. ~tooot

11:1 III'--l 1.- .,IIII~~III;~ ~ Il ~I UI IUUUH!!!! §!!! H! //////////////.'l'lllllllli;::-:~~ ' ...... ,...... , ' ...... , ...... TF-....1'aIr)

Figure 5-9: Frequency distribution of putative TFBSs for C/EBPdelta in promoter(left), gene(middle), and junk(right) regions of Mouse genome

63 From Tables 5-3 and 5-4, we found the Skewness value in each column is always smaller than the corresponding 2*Ses value in the same column. Aiso the Kurtosis value in each column is always smaller than the corresponding 2*Sek value in the same column. This means ail the frequency distribution are near to the normal distributions. Therefore, a normal distribution is used to derive the p­ value for each region of each TF. The p-value results are also included in Tables 5-3 and 5-4. Based on Section 3.2.1.5, from the p-value results, we can see the frequency of pGREs in ail regions of Mouse genome are significant. This implies that we reject Ho and accept H1 for a. = 0.05. Furthermore the frequency of putative TFBSs for C/EBPdelta in junk region of Mouse genome is also significant for a. = 0.05.

The relationship between Tables 5-3 and 5-4 and Figures 5-1 to 5-4 can be also observed. For example, in Figure 5-1, the frequency distribution of pGREs in random sequence is much lower than that in the three different regions of Mouse chromosomes. In fact, ail frequencies of pGREs in the three different regions of Mouse chromosomes are significant compared to the random sequence (Table 5- 3).

We also performed a genome-wide search for these TFBSs in the Human genome. The results of this section are used in the PF studies of Section 4.4. Figure 5-10 depicts the frequency distribution of TFBSs with different thresholds in both Mouse and Human genomes. It shows that the frequency of TFBSs decreases as the threshold increases for ail TFs in both Mouse and Human genome. However, with respect to the frequencies of TFBSs for NF-1 and C/EBPbeta, there is a slightly stable area with threshold between 0.92-0.94 in both Mouse and Human genome. This trend has been reconfirmed with both MBMC sequences and random sequences using multiple simulations. This behavior may be related to the structure of PWMs for these two TFs.

64 Putative Transcription Factor Binding Site Hlts Rate On Mouse and Human Genome

0.0025 r------, ~pGREs_Mouse ------pGREIU-l.man -1-pNFCMoule pNFU.... man __ CEBPbeta117_Mouee -+-CEBPbeta117_Ht.men --+-CEBPdaIta_Mouae -CEBPdeIta_Human 0.0021------',.------...... ___J

"G... g -g O.OO151------..---\---=:'~~~~.----____l z .z~ ~ ~ ;..!! 0.001 I------x--'r-~,..__------\----...... ___J !I :ë

0.0005 I--~~------=~o;;;:::-""'""=::-___+_------t

U~~~~~~~~~"~~~~~~~~~1 Threshold

Figure 5-10: Mouse and Human genome wide hits rate distribution at different threshold for total number of putative binding sites of four different transcription factors (GR, NF-1, C/EBPdelta, C/EBPbeta)

Figure 5-10 also shows that frequencies of putative TFBSs for ail GR's three cooperative TFs (NF-1, C/EBPbeta, and C/EBPdelta) are always higher in the Human genome than in the Mouse genome. However, GR itself is more frequent in Mouse than in Human. Both Mouse and Human genome are not masked. Aiso the frequency of TFBSs for cooperative TFs is averagely higher than the frequency of TFBSs for GR. The reason is that GR has longer binding sites than other two factors, C/EPBdelta and C/EBPbeta. Although GRE has shorter binding sites than that of NF-1, GRE has more specifie DNA binding site profile. This can be seen from the consensus index vector (Appendix B). As described in Section 4.2.1, a value of 100 is reached by a position with total conservation of one nucleotide, whereas the value of 0 only occurs at a position with equal distribution of ail four nucleotides and gaps. GRE has the highest average value of consensus index than other three TFs. The results from masked Mouse and Human genome should be tried later.

65 5.2. Modules

From previous section, in total 15575 pGREs are found in the promoter region (1000 bp upstream of the TSS) of the Mouse genome among 11151 distinct genes. In this section we search for the co-occurrence of pGREs and putative binding sites for NF-1, C/EBPbeta, and C/EBPdelta. If we remove from consideration ail pGREs that are not part of a module, the number greatly reduces from 15575 to 327. These means that only 327 pGREs contain a module within 201 bp window size based on first position of 5' end of each pGRE. The 327 pGREs are located directly upstream of 306 distinct genes in the Mouse genome.

Figure 5-11 shows ail the position orders of 327 putative modules. Different color stands for different Mouse chromosomes as indicated in graph. As described in Section 4.3.2.1, totally we can have 24 different types of position orders. Figure 5-11 gives each position order of module and the counts for each type of position order. It also shows the combinatorial probability of position orders calculated using equation 4-1 in Section 4.3.1.

Type.f.rder. ._._-.U3 ··..... u 'U, ml 1123 ..n- ..- 0123 0.32 •._~~- ~.tn tu. 0113 - .... •ml 32.1 320. 3201 i.... 320 • .... 32.1 2301 , •3211

Il • 5 • :5 11 30 16 II Il .1 15 13 15 •• :5 11 12 9 9 1. t )4 .9

Figure 5-11: Order distribution of module for four different TFs (GR, NF-1, C/EBPdelta, C/EBPbeta). Different colour stands for different chromosomes in Mouse genome

66 The combinatorial probability of ail putative modules for TFBSs of GR, NF-1, C/EBPbeta, and C/EBPdelta in promoter region of Mouse genome is calculated using equation 4-1 mentioned in Section 4.3.1. The value of combinatorial probability is 8.20E-49. The statistical test has been done for this value to check its significance (depicted in Section 3.2.1.5). Its p-value is calculated using MBMC as a background. Figure 5-12 shows a histogram of combinatorial probabilities of 1000 MBMC sequences. We split the range of the combinatorial probability into equal-sized bins. Then for each bin, the number of sequences found that its combinatorial probability falls into this bin is counted. Therefore, counts for each bin are represented as vertical axis and each bin are represented as horizontal axis. Similar to the previous section, the red vertical line is the p­ value calculating point for the combinatorial probability of promoter region in Mouse genome.

Combinatorial Probability Distribtlon of Module Analysis for pGREs with 2nd Order Markov Chalns (Total Repeats: 1000)

80 Cl2nd 0nIer lhrtIoV ChIiIns ._- 70 ... :.~ 60 i u. 50

t 40 1 1 30 o .Il ~ 20 z

10

n B ~ ~Dill .1. Il Il ~,~~~~~#:~~~~~~~~~~~~~~~~~~~~~%:~~I.~~~~~%~~~~%('~~ Combinatorial Probablltly Perlod

Figure 5-12: Counts distribution of combinatorial probability of TFBSs for four different TFs (GR, NF-1, C/EBPdelta, C/EBPbeta) of MBMC.

Since the combinatorial probability is very small, it is more convenient to use the log value to calculate the p-value. On the vertical axis, we use frequency instead

67 of counts. Figure 5-13 shows a transformed histogram of this combinatorial probability distribution. This graph also shows you the normal distribution li ne using the same mean and standard deviation.

Combinatorial Probabillty Distribution of Module Analysis for pGREs with 2nd Order Markov Chains (Total Repeats:1000)

0.05

= .... 0nIw ..... ~ ___ a- i 0.045 r--- u. -.-_- ~- c i 0.04 1 1 0.035 f- ~ I :a: 0.03 1--

io. ... i t 0.025 lL ~A. :a \ ~ 0.02 1 u '; 1 ~ 0.015 E z. '; 0.01 ~ L ! ~)I ...1 0.005 o .fl. ~ Irr TI tfu ~ , , , , , , , , , , , ~ ~ , , , , , , , , ~ ~ , , , , # , ; ~ CombinMorial Pro.,.bility P.riod (logarithm Trarmormed)

Figure 5-13: Frequency distribution of combinatorial probability of TFBSs for four different TFs in promoter region of Mouse genome (GR, NF-1, C/EBPdelta, C/EBPbeta)

Table 5-5 depicts the results of several statistical tests for this distribution. We find that the Kurtosis is smaller than 2 standard errors of Kurtosis; this implies that it is similar to the normal distribution. However, the absolute value of Skewness, which is 1-0.1720965741. is slightly higher than the 2 standard error of Skewness, which is 0.154919334. This means the distribution has skewed to the significant degree and it has a tail to the left. Although it is a little bit skewed to the left, we can still consider it is similar to the normal distribution since the difference is marginal. Therefore, the p-value for the combinatorial probability in promoter region of Mouse genome can be calculated based on normal distribution using two-sided test (depicted in Section 3.2.1.5). The value is 0.017. If we used 0.025 as a cut point, this value is significant using the MBMe. It means this module is unlikely to occur by chance alone.

68 Statistic Definition Value Mean -198.11 SD 18.23 Skewness -0.172096574 Kurtosis -0.186655208 2*SES 0.154919334 2*SEK 0.309838668 p-value 0.017

Table 5-5: Statistic test for combinatorial probability distribution in MBMC

We decided to use a Gene Ontology (GO) [131] search by GOstat [119] to investigate the genes that are downstream of the 327 modules. GO project is a collaborative effort to address the need for consistent descriptions of gene products in different data bases. The GO collaborators are developing three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner [131].

The GOstat software can find statistically overrepresented biological processes within a group of genes. Mouse is selected as the used GO gene-associations database [119]. Table 5-6 shows the results with p-value.

GO Genes Pvalue GO as name GO:0042391 KCNK2; PRES 0.000295 regulation of membrane potential GO:0006334 D4ERTD89E; HIST1H1B; 0.00261 nucleosome assembly DXBWG1396E GO:0005615 SYT13; TEX261; CRP; LCN2; 0.00358 extracellular space LYNX1; WNT1; CLGN; SEMA3A; SARS2; CHRNE; GPR63; CCL 12; PRLPC2; CLDN16; CLDN4; D10UCLA2; TGFBR3; CEACAM12; BC026374; LEP; MGI:102514; ORM2; BC018222; SEMA4D; NKG7; DEFCR-RS1; DDX28

GO:0016564 CLP1; BCLAF1; HES7 0.00591 transcriptional repressor activity GO:0042640 DSG4 0.00707 anagen GO:0019627 SEMA4D 0.00707 urea metabolism GO:0030322 KCNK2 0.00707 stabilization of membrane potential

69 GO:0004310 FDFT1 0.00707 farnesyl-diphosphate farnesyltransferase activity

GO:0016455 3110004H13RIK 0.00707 RNA polymerase Il transcription mediator activity GO:0008119 TPMT 0.00707 thiopurine S-methyltransferase activity GO:0008051 FDFT1 0.00707 farnesyl-diphosphate farnesyl transferase complex GO:0006163 ATP6V1F; AMPD2; ATP5G2 0.00839 purine nucleotide metabolism

GO:0006333 D4ERTD89E; HIST1H1B; 0.0092 chromatin assembly or disassembly DXBWG1396E GO:0046933 ATP6V1F;ATP5G2 0.00954 hydrogen-transporting ATP synthase activity, rotational mechanism GO:0046961 ATP6V1F; ATP5G2 0.00954 hydrogen-transporting ATPase activity, rotational mechanism GO:0006953 CRP; ORM2 0.0124 acute-phase response GO:0015985 ATP6V1F;ATP5G2 0.0124 energy cou pied proton transport, down electrochemical gradient GO:0015986 ATP6V1F; ATP5G2 0.0124 ATP synthesis coupled proton transport GO:0000726 G22P1 0.0141 non-recombinational repair GO:0030327 ZMPSTE24 0.0141 prenylated protein catabolism GO:0004311 FDFT1 0.0141 farnesyltranstransferase activity GO:0006303 G22P1 0.0141 double-strand break repair via nonhomologous end-joining GO:0004828 SARS2 0.0141 serine-tRNA ligase activity GO:0006434 SARS2 0.0141 seryl-tRNA aminoacylation GO:0004756 SEPHS1 0.0141 selenide, water dikinase activity GO:0016781 SEPHS1 0.0141 phosphotransferase activity, paired acceptors GO:0016469 ATP6V1F;ATP5G2 0.0155 proton-transporting two-sector ATPase complex

GO:0008080 GCN5L2; NAT5 0.0155 N-acetyltransferase activity GO:0006754 ATP6V1F; ATP5G2 0.0166 A TP biosynthesis GO:0006753 ATP6V1F;ATP5G2 0.0166 nucleoside phosphate metabolism

Table 5-6: Gene ontology results for ail co-occurrence genes

70 5.3. Phylogenetic Footprinting

5.3.1. Parameters Prediction Using Supervised Data Mining Techniques

Using six known conserved GREs between Mouse and Human as positive controls, we investigate optimal settings for several parameters related to the PF study including the length of upstream and downstream required, maximum number of gaps allowed within each conserved alignment, minimum length required for each conserved alignment. These six conserved GREs belong to four known genes, see Table 5-7. For gene FKBP51, there are three conserved GREs. Two of these three are downstream of the gene and far from the gene start position. They are in fact located within the first intron (the start of this gene is badly mapped: exon 1 is missing). The one upstream GRE should be the real promoter if exon 1 is valid.

laenename MTZ ctaf CYII561 FKBP51 chromosome in mouse 8 10 11 17 17 17 strand for aene + + . . . . GRE sequence aggac:agcctgtcct agaac:attga.taac agucc:ttttgttcc agaactgactgttcc agaacagcttgttct ggtatauctgtgct GRE score 0.913383645 0.433318478 0.74351128 0.7331936 0.909893622 0.928179907 strand for GRE + + . . . . Upstream or downatrum for GRE upstrum upstrum upstrum upstream downatream downstream GRE distance to gene ·1244 ·1837 748 2507 -43810 -43669

Table 5-7: Information of 6 conserved GREs

5.3.1.1. Parsing AVID Alignments

ln this section, an AVID alignment is parsed based on two parameters: maximum number of gaps allowed within conserved alignment and minimum length of each alignment. The parsing criterion is very important for any further analysis. Here, we give sorne examples to explain how we parse the AVID alignment results. Table 5-8 shows part of the AVID alignment results. There are five columns. The first column is the index of Mouse sequence. The second column shows the aligned Mouse sequence. Character "1" in the column means there is an insertion at this position. The third column shows the mapping information between Mouse

71 sequence and Human sequence. The fourth column gives aligned Human sequence. The fifth column is the index of Human sequence.

90 a 91 9 92 t 93 c 94 t 95 c 96 c 97 a 98 a 99 a 100 c - c 11 101 a t 12 102 9 - 9 13 103 t - t 14 104 a c 15 105 a - a 16 t 17 9 18 106 a - a 19 107 t - t 20 108 9 - 9 21 109 c a 22 110 t - t 23 111 9 - 9 24 112 t - t 25 113 t - t 26 114 c 1 115 a 1 116 t 1 117 a 1 118 9 1 119 9 1 120 c 1

Note: First column: the index of the Mouse sequence Second column:the corresponding Mouse sequence base pair Third column: match if there is "-" or mismatch if nothing Fourth column: the corresponding Human sequence base pair Fifth column: the index of the Human sequence ''1'' means gap or insertion in that sequence.

Table 5-8: Part of AVID alignment result

72 If different parameters are used, different results will be parsed. For example, if we use maximum number of gaps allowed within alignment with zero and minimum length of alignment with five, then we will have following parsed alignments:

100 c - c 11 106 a - a 19 101 a t 12 107 t - t 20 102 9 - 9 13 108 9 - 9 21 103 t - t 14 and 109 c a 22 104 a c 15 110 t - t 23 105 a - a 16 111 9 - 9 24 112 t - t 25 113 t - t 26

However, if we use maximum number of gaps allowed within alignment with two and minimum length of alignment with five, we will have following parsed alignments:

100 c - c 11 101 a t 12 102 9 - 9 13 103 t - t 14 104 a c 15 105 a - a 16 1 t 17 1 9 18 106 a - a 19 107 t - t 20 108 9 - 9 21 109 c a 22 110 t - t 23 111 9 - 9 24 112 t - t 25 113 t - t 26

If we use maximum number of gaps allowed within alignment with zero and minimum length of alignment with seven, we will have following parsed alignments:

106 a - a 19 107 t - t 20 108 9 - 9 21 109 c a 22 110 t - t 23 111 9 - 9 24 112 t - t 25 113 t - t 26

73 5.3.1.2. Parameter Prediction of AVID Alignment

We return to our four known GR-regulated genes mentioned above. By taking the promoter sequences between each pair of orthologous genes, AVID is used here to confirm if we could find conserved GREs in Mouse genome using the proper parameters. Here we used word promoter sequences to be distinct from promoter region (Section 3.2.1.3). By the proper parameters, we mean the proper promoter sequences including proper upstream and downstream regions of gene based on TSS, proper maximum number of gaps allowed within each alignment, and proper minimum length of alignment. By using six known GREs information, we could obtain these proper parameters for the whole Mouse genome in the PF analysis.

This process proceeds as follows: • Fetch promoter sequences for each orthologous gene pair • Do phylogenetic footprinting to each orthologous gene pair • Get the conserved alignment after fixing two key parameters: maximum number of gaps allowed within each alignment and minimum length of alignment • Check if known conserved GREs is included in the alignment results

5.3.1.2.1. Effect of Different Length of Promoter Sequences

Note that for gene MT2 in Mouse, there are 7 different homologous genes in Human. Therefore, we find several alignments for this gene. However, only one of the alignments matches very weil (Le. it shows sufficient conservation). The rest of them have many mismatches. This can be verified by examining Tables 5- 9t05-15).

Four case studies have been performed. Table 5-9 shows the results using 50k bp upstream and 1k bp downstream. We can see four of the conserved GREs for

74 ail genes are found. Ali the alignments have a low number of mismatches. However, two of the conserved GREs for gene FKBP51 are not found since our choice of promoter sequence excluded these sites. Table 5-9 also shows the results using different number of gaps allowed within alignment. We can find the parameter of number of gaps allowed within alignment does not affect results. Although we find more alignments when we use high number of gaps allowed within alignment, they are not good alignments since they have lots of mismatches in each alignment.

avid align 50k up_m and 1k downstraam: gero.name MT2 ~ Çyb561 FKBP51 chromosome ln mou .. 8 10 11 17 17 17 Up_m or downstraam for GRE up_m up_m up_m up_m downstraam down_m GRE distance to~ene ·1244 ·1837 748 2507 -43610 -43889 Humber of mlsmatch wIth 0 gaps allowed 0 2 2 2 not avallable not avallable Humber of mlsmatch wlth 2 gaps allowed 0 2 2 2 not avallable not avallabla 0 Humber of mismatch wIth 4 gaps aU_ed 2 2 2 not avaliable not avaUabia 7 0 Humber of mismatch wIth 8 gaps aU_ed 7 2 2 2 not avallable not avaliabla 8 0 Humber of mismatch wIth 8 gaps aU_ed 7 2 2 2 not avallable nol avallabla 8 0 Humber of mlsmalch wlth 10 gaps all_ed 7 2 2 2 nol avaliable not avaliabla 8 0 Humber of mlsmatch with 20 gaps an_ed 7 2 2 2 not avallable nol avallabla 8

Table 5-9: AVID alignment results. This table shows the number of mismatches in each parsed alignment if it is available

Table 5-10 shows the results using 10k bp upstream and 1k bp downstream. We can see the conserved GRE for gene FKBP51 is not found. This means that we should examine longer upstream regions.

avid align 10k up_m and 1k downstream: genenama MT2 ctgf CyII561 FKBP51 chromosome 1" mou.. 8 10 11 17 17 17 Up_m or downstream for GRE up_m up_m up_m up_m d_nstream down_m GRE distance to~ ·1244 -1837 748 2507 -43810 -43889 Humber of mismatch wIth 0 gaps aUowed 0 2 2 notfound not avaliable not avaliable 0 5 Number of mlamatch with 4 gap•• nowed 2 2 notfound not avall.bI. not evaU.bI. 8 9 0 5 Humber of mlsmatch wlth 8 gaps allowed 2 2 notfound not avallabla not avaliabla 8 10

Table 5-10: AVID alignment results. This table shows the number of mismatches in each parsed alignment if it is available

75 Table 5-11 shows the results using 50k bp upstream and 50k bp downstream. We can see one of the conserved GREs downstream of gene FKBP51 is found if we extend the downstream up to 50k bp. The other conserved GRE downstream is also found if given high value of minimum number of gaps allowed within alignment. However, one of the conserved GREs upstream for gene FKBP51 is missed. As we mentioned before, this conserved GRE is in the real promoter sequence and we do not want to lose this information (see expia nation in Section 5.3.1). Aiso the alignment of the conserved GRE of gene MT2 has more mismatches than the results of Table 5-9 and Table 5-10. This may be affected by longer downstream sequences. Therefore, we should not select long sequences for both upstream and downstream regions since they interfere with each other. In this case, we may want keep long upstream sequences instead of long downstream sequences since most of TFBSs are located upstream of gene. However, we may have a risk of losing sorne information downstream.

avid allgn 50k upstraam and 50k downstraam: g_name MT2 ctgf ~561 FKBP51 chromosome ln mou .. 8 10 11 17 17 17 Upstraam or downstream for GRE upstraam upstraam upstraam upstraam downstream downstraam GRE distance to~ -1244 -1837 748 2507 -43610 -43669 Humber of mlsmatch wIth 0 aaps aU_ad 5 2 2 notfound notfound 3 Humber of mlsmatch wIth 4 gaps aUowad 5 2 2 notfound 3 3 Humber of mlsmatch wIth 8 aaDS aU_ad 5 2 2 notfound 3 3 Humber of mlamatch wlth 12 gaPS allowad 5 2 2 notfound 3 3 Humber of mlsmatch wIth 20 gaps aU_ad 5 2 2 notfound 3 3

Table 5-11: AVID alignment results. This table shows the number of mismatches in each parsed alignment if it is available

Table 5-12 shows the results using 3k bp upstream and 50k bp downstream. We can see the conserved GREs upstream of gene FKBP51 is not found. It has the similar results to Table 5-11.

76 avlel allgn 3k u.,-m and 50k downstrum: a_narne MT2 etat ::vt>561 FKBP51 chromosome in mou.. 8 10 11 17 17 17 Uastrum Dr down..... m for GRE uastrum uastrum UPstrum uastrum down..... m downstream GRE distance ID g_ ·1244 ·1837 748 2507 -43810 -43889 0 Number of mlsmatch wIth 0 gap. aliDwed 8 2 2 notfDund 2 notfDund 8 0 8 Number Df mlsmatch wlth 4 gap. allowed 8 2 2 nDtfDund 2 notfound 8 8 0 7 Number Df mismatch wIth 8 gap. allowed 2 2 notfDund 2 notfound 8 8 0 7 8 Number of mismatch wlth 12 gaps allowed 2 2 nDtfDund 2 nDtfDund 8 8 8 0 7 Number Df mlsmatch wlth 20 gap. allDwed 8 2 2 notfound 2 notfound 8 8

Table 5-12: AVID alignment results. This table shows the number of mismatches in each parsed alignment if it is available

5.3.1.2.2. Effect of Maximum Number of Gaps

From these case studies, we find that if we increase the value of the parameter for the maximum number of gaps allowed within alignment, we can have more parsed alignment results for gene MT2 since this gene has multiple homologous gene pairs in Human genome. However, most of them are not reliable due to the large number of mismatches in the alignments. One exception is gene FKBP51 in table 5-11. Here one more conserved GRE downstream region is found when increasing the value of this parameter.

5.3.1.2.3. Effect of RepeatMasker

Section 4.4.1.2.1 has described why a RepeatMasker is needed to mask the sequence. Here we only explain how to choose the parameters for RepeatMasker.

77 There are several options available in RepeatMasker to control the kind of repeats that are masked [115-116]. For example, by choosing "noinf option, the RepeatMasker will only mask low complexity/simple repeats but not mask interspersed repeats. There are also different parameters to let you determine the program running speed or sensitivity. This means that "slow" setting will mask more DNA repetitive sequences but take long time and "quicK' setting will mask less DNA repetitive sequences but take short time. RepeatMasker also allows you to choose if Wu-blast is used or not (see Section 4.4.1.2.1). The default setting of the web server is: • Default speed, and • Mask interspersed and simple repeats, and • Without Wu-blast By default speed, it means the speed is between slow speed and fast speed.

After we tried different combination of options, we select RepeatMasker to mask Mouse genome sequence with following options: • Select slow setting (-s) • Select noint option (-noint) • Select Wu-blast option (-w)

Moreover, the Oust [121] program is used to mask simple repeats in nucleotide sequences. Oust is a program that is used in different BLAST alignments to tilter low-complexity and short-periodicity internai repeat. We merge ail results from both RepeatMasker and Oust into combined masked sequences for both Mouse and Human orthologous pair. This is used as input to AVID.

Table 5-13 shows the results using 10k bp upstream and 1k bp downstream with masked sequences. RepeatMasker setting is the default setting from the web server. Ali of the conserved GREs are found. Comparing with Table 5-10, one more conserved GRE of gene FKBP51 is found due to the masking. However,

78 this conserved GRE of gene FKBP51 has lots of mismatches within the alignment.

ln 10k upatream .nd 1k downatnulm wHh default maskad: g_n.me MT2 etat Cvb581 FKBP51 chromosome in mou.. 8 10 11 17 17 17 UDatnulm or downstraam for GRE uDatnulm uDatnulm upatnulm upatnulm downstraam downatnulm GRE distance to a_ -1244 -1837 748 2507 -43610 -43669 Number of miamatch wHh 0 aaDS .llowad 0 2 2 6 not .vali.ble not .v.liable

Table 5-13: AVID alignment results. This table shows the number of mismatches in each parsed alignment if it is available

Table 5-14 shows the results using 50k bp upstream and 1k bp downstream with masked sequences. RepeatMasker setting is same as we mentioned above. Ali of the conserved GREs are found. Comparing with Table 5-9, it shows that one more conserved GREs for gene MT2 is found with 4 mismatches in the alignment. This means repeat masking does improve the results.

1 50k upatnulm and 1k down_m with combined m.... ad: uenename MT2 etai 1."YII561 FKBP51 chromosome in mou.. 8 10 11 17 17 17 Upatnulm or downstraam for GRE upatnulm uD_m UD_m uDstream downstraam downatnulm GRE distance to uene -1244 -1837 748 2507 -43610 -43669 0 Number of miamalch wHh 0 gaps .lIowad 2 2 2 not avaliable not avali.ble 4

Table 5-14: AVID alignment results. This table shows the number of mismatches in each parsed alignment if it is available

Table 5-15 shows the results using 50k bp upstream and 50k bp downstream with masked sequences. RepeatMasker setting is the default setting from the web server. Comparing to Table 5-11, we cannot tell which one is obviously better. Without RepeatMasker, we can find one more conserved GRE of gene FKBP51. However, the conserved GRE of gene MT2 has lots of mismatches within the alignment. With RepeatMasker, we found less conserved GREs but ail are good results .

• iilln 50k upstraam and 50k downatnulm wHh m ....ad: uenename MT2 etai CYb561 FKBP51 chromosome in mou.. 8 10 11 17 17 17 Up_m or downstraam for GRE up_m UD_m uDatnulm UD_m downstraam downatnulm GRE distance to g_ -1244 -1837 748 2507 -43610 -43669 Number of miamatch wHh 0 g.ps allowad 0 2 2 notfound notfound nolfound

Table 5-15: AVID alignment results. This table shows the number of mismatches in each parsed alignment if it is available

79 Considering ail of above case studies, we conclude that the tirst sensitive factor is the length of downstream region chosen. The longer the downstream sequence is, the worse the alignment results are. The second factor is the length of the upstream region. The longer the upstream sequence is, the belter the alignment results are. However if the total length of both sequences is very long, then AVID does not pertorm weil. There should be a balance point. However, we still need to further explore why this situation happens. Sequence repeat masking can improvement the results slightly but not so much. The number of gaps does not affect results a lot. Although it may find more conserved GRE for one gene in sorne cases, it has lots of mismatches within the alignment. This means they are not reliable results.

Finally, we summarizes the conditions for phylogenetic footprinting: • Use RepeatMasker program with the selected options (mentioned above) • Combine with Oust mask tiltering program with RepeatMasker • Use 15k bp upstream and 1k bp downstream based on TSS of each gene as promoter sequence • Use the value of zero as the maximum number of gaps allowed within each alignment • Use the value of 16 as the minimum length of alignment required • Use AVID program to do the alignment for orthologous sequence pairs between Mouse and Human. Mouse sequence as a query sequence and Human sequence as a hit sequence

The conditions we summarized here is only based on the six conserved GREs belong to four known genes. Therefore, it is sufficient for this research at this moment, but may not suitable for other study. Generally speaking, it is a small number for machine learning techniques to be safely applied.

80 5.3.2. One Case Study For Gene Expression Data

5.3.2.1 Experimental Background For Gene Expression Data

The experiments used Affymetrix GeneChips to study expression profiles in 3T3- L 1 cells following treatment with DEX. Expression profiles of DEX-treated 3T3-L 1 cells in the presence or absence of cycloheximide was obtained. Cycloheximide is a drug that blocks protein synthesis thus allowing us in principle to identify the direct and indirect GR targets. Ce Ils pretreated (30 min) with or without cycloheximide (CHX) were stimulated with DEX or EtOH (as control) for 1h and 3h. The mRNA from these cells was harvested with trizol, labeled and hybridized to GeneChips. This experiment was replicated 3 times.

Gene expression was analyzed with dChip. After normalization, gene expression was modeled using the pm-only modal. Comparisons between DEX-treated and control samples (in the presence and absence of CHX) enabled us to find the significantly differentially expressed genes. Genes induced by DEX in the presence or absence of CHX were classified as direct targets, whereas genes induced by DEX only, but not in the presence of CHX were classified as indirect targets. The same logic was applied for downregulated genes.

The experimental results were performed by Mareil [7]. We are responsible for the analysis.

5.3.2.2. Results and Discussion

Using the original gene expression data, we define five different groups of genes: direct-up, direct-down, indirect-up, indirect-down, and control (ctrl). Both module analysis with GR's three cooperative TFs (NF-1, C/EBPbeta, and C/EBPdelta) and PF analysis have been applied to the data. Consider a gene in one of the five groups. We say that the gene is qualified if it is satisfied with following conditions:

81 • There exists at least one conserved alignment (see Section4.4.1) in the promoter sequence (see Section 5.3.1.2) of the gene. • At least one pGREs in the conserved alignment contains module.

Table 5-16 shows the number of genes in each group from original gene expression data and the number of genes found by both module and PF analyses.

Total number of Number of genes found by genes both module and PF analyses Direct-up 45 12 Direct-down 12 1 Indirect-up 25 6 Indirect-down 53 9 Control 79 15

Table 5-16: Number of genes in each group from gene expression data and the number of genes found by both module and PF analyses

Figure 5-14 shows the percentage of qualified genes in each group. Figure 5-15 gives the average number of mismatches in the conserved alignments for each group. Figure 5-16 demonstrates the average distance to gene based on TSS for qualified pGREs in each group. Figure 5-17 reports the average score of qualified pGREs in each group.

82 percentage of quallfled g ..... In each group 0.3.------,

i;. 0.251------1 Ji 10.21------­ '; J f 0.15 J '; 1 0.1 S foœ

_of ....

Figure 5-14: Percentage of qualified genes

Average number of mlsmatch.. In conserved allgnment pairs ln each group Mr------,

70

'" 1)50 ..E J '0 ~ & j"

20

10

type ofgene

Figure 5-15: Average number of mismatches in conserved alignments

83 Average distance 10 gene correspondlng 10TSS ln .ch group

oooo~------, amou.. ."uman 7000 1------..

6000

5000 i s 3 '000 1 3000

2000

1000

......

Figure 5-16: Average distance to TSS of gene

Average acore of putative GRes ln .ch group

0.951------

0.91------i l 085

0.8

0.75 ...... -

Figure 5-17: Average score of pGREs

From the above results, we could find that genes in the direct-up group have higher hit percentage, lower number of average mismatches, nearer average

84 distance to TSS of gene for conserved pGREs, and higher average score for conserved pGREs comparing to control genes. However, genes in the direct­ down group have lower hit percentage and higher number of average mismatches compared to control genes. This is reasonable because the binding of homo-dimer GR to GRE activates the transcription of a gene (described in Section 2.2). The pGREs found in this research likely activate the transcription of a gene since the consensus of GRE used is almost palindrome. Figures 5-14 to 5-17 also shows that genes in indirect-up have lower hit percentage, higher average number of mismatches, farther average distance to gene for conserved pGREs, and lower average score for conserved pGREs comparing to genes in direct-up. ln the direct upregulated gene group, we found 12 genes among 45 genes with both module and PF analyses given parameters described in Section 5.3.2.1. Among these 12 genes, one gene Sgk is Serum- and Glucocorticoid-regulated Kinase [130], two genes, Mt2 and Ctgf, are among 6 well-known conserved GREs described in Section 5.3.2.1, and another 4 genes, which are Nfkbia, Mt1, Rgs2, and Cdkn1a may be good candidates for further wet-Iab experiments. In indirect upregulated gene group, we found 9 genes among 58 genes with both module and PF analyses. Only one gene Thbs1 may be interesting. The following tables shows you the results. Ali the results are shown in Figure 5-17. Figure 5- 18 shows the interesting points for five good candidate genes mentioned above.

Total number of Number of genes Interested genes Number of GRE- genes found by PF regulated genes Direct-up 45 12 4 3 Direct-down 12 1 0 0 Indirect-up 25 6 0 0 Indirect-down 53 9 1 0

Table 5-17: Final results for gene expression data

85 Gene catagory Gene name Interesting points fram http://www.ncbi.nlm.nih.gov/Entrezlindex.html Direct-up Mt1 Reduces inflammatory response, oxidative stress and apoptosis and increases angiogenesis following focal brain cryolesion. Direct-up Nfkbia 1. data suggest that RelA is liberated during LPS-induced pulmonary inflammation by the regulated degradation of both IkappaB-alpha and IkappaB-beta. 2. IkappaBalpha and IkappaBbeta play unique in jury context- specifie roles in activating NF-kappaB-mediated prainflammatory responses. Direct-up Rgs2 Parathyroid hormone induces RGS-2 expression by a cyclic adenosine 3',5'-monophosphate-mediated pathway in primary neonatal murine osteoblasts. Cdkn1a 1. Ablation of p21 in lupus-prone mice allows these cells to reenter the cell cycle and undergo apoptosis, leading to autoimmune disease reduction. 2. Cdkn1 a is induced in the Mouse central nervous system as part of the acute response to inflammation. Indirect-down Thbs1 Thrambospondin-1 may have a proinflammatory raie in anti- glomerular basement membrane glomerulonephritis.

Table 5-18: Interesting points for five good candidate genes

5.3.3. Study For Whole Mouse Genome

The comparative analysis for ail genes of Mouse species is studied in this section. We select Human as the orthologous species for Mouse. We separate our analysis based on different chromosomes. Table 5-19 shows the results for Mouse chromosome 2. Total number of 425 orthologous gene pairs is found and they ail contain conserved alignment in the regulatory region between Mouse and Human. Each conserved alignment includes one or more pGREs. Among 425 gene pairs, 337 Mouse genes contain pGREs upstream and 88 Mouse genes contain pGREs downstream. Among 337 Mouse genes, 326 ortho logo us Human genes also contain pGREs upstream and 11 orthologous Human genes contain

86 pGREs downstream. Among 88 Mouse genes, 86 orthologous Human genes also contain pGREs downstream and two orthologous Human genes contain pGREs uptream.

number 01 average score 0 average score 01 number of genes 1. runber of genes 1. mismalches in the average distance le putative GREs Ir average distance 10 putative GREs 1 mouse huma. allgnmenl mouse gene! mouse genome human gene! human genomE uDstream 337 328 30.52225519 7055.338279 0.846150686 7136.740854 0.848495891 downstream 88 97 110.2613636 415.4772727 0.844944686 453.7628866 0.847566551

Table 5-19: Phylogenetic footprinting results for Mouse chromosome 2 as compared with Human orthologous genes

If we combine this result with the module analysis results shown in Section 5.2, only one gene Fsip2 is found containing module within conserved orthologous sequences in Mouse chromosome 2. This gene is fibrous sheath-interacting protein 2 and has following gene ontology terms: • GO:0005515: protein binding • GO:0009434: flagellum (sensu Eukarya)

Considering ail of the above results, we derive the following conclusion: (i) The module analysis reduced the number of putative GREs by 97.9%. (ii) The PF analysis reduced the number of putative GREs by 97.7%. (iii) The module and PF analysis reduced the number of putative GREs by 99.9%.

87 6. Conclusion and Recommendations

6.1. Conclusion

Three different but complementary bioinformatics approaches have been applied to identify Mouse genes putatively transcriptionally regulated by GR. Firstly, we searched for putative TFBSs using PWM for four TFs (GR, NF-1, C/EBPbeta, and C/EBPdelta) in the complete Mouse genome. This produced a large number of pGREs and putative TFBSs for other three TFs. Most of these are likely false positive predictions. Secondly, both module of TFBSs and PF analyses were used to reduce this large number of false positives. In each step, a statistical test has been used to measure the significance of the results. Finally, we have following conclusions: (i) The module analysis reduced the number of putative GREs by 97.9%. (ii) The PF analysis reduced the number of putative GREs by 97.7%. (iii) The module and PF analysis reduced the number of putative GREs by 99.9%. Aiso in one of the case study for gene expression data, we found four genes, Nfkbia, Mt1, Rgs2, and Cdkn1a may be good candidates for further wet-Iab experiments.

6.2. Recommendations

Although it has been demonstrated that our module can find a set of GR­ regulated genes as a good guideline for biologists, it would be betler if we could increase our confidence in these predictions. There remain a number of additional properties of TFBSs that we could consider. Firstly, within a module, DNA-binding sites order, orientation, and distance are ail important factors for the function. In this study, we only considered distance among TFBSs and allowed more flexible order and did not place any restrictions upon their orientation.

88 Secondly, there are many co-operative TFs for GR. In our module, we only examine three TFs and we did not compare the effects among different TFs. This means we should use different set of combination of GR's co-operative TFs to do the TFBSs search and see which set of TFs has good results. Right now, we still do not know which GR's co-operative TF is more important to GR. Thirdly, in PF analysis, we predict AVID alignment parameters based on 6 well-known conserved GREs. However, this is a small number for machine learning techniques to be safely applied.

89 References:

1. W. L. Miller, and J. Blake Tyrrel, The adrenal cortex. In: Felig P, Baxter JO, Frohman lA (eds) Endocrinology and metabolism. McGraw-HiII, New York, (1995) p.555-711 2. A. C. B. Cato, H. Schacke, and K. Asadullah, Recent advances in glucocorticoid receptor action, Springer, (2002) 3. B. B. P. Gupta, and K. Lalchhandama, Molecular mechanisms of glucocorticoid action, Current Science, 83(9) (2002) 1103-1111 4. 1. Rogatsky, J. Wang, M. K. Oerynck, O. F. Nonaka, O. B. Khodabakhsh, C. M. Haqq, B. O. Oarimont, M. J. Garabedian, and K. R. Yamamoto, Target-specific utilization of transcriptional regulatory surfaces by the glucocorticoid receptor, Proceedings of the National Academy of Sciences of the United States of America (PNAS), 100(24) (2003) 13845-13850 5. Y. Pilpel, P. Sudarsanam, and G. M. Church, Identifying regulatory networks by combinatorial analysis of promoter elements, Nature Genetics, 29 (2001) 153- 159 6. P. Sudarsanam, Y. Pilpel, and G. M. Church, Genome-wide co-occurrence of promoter elements reveals a cis-regulatory cassette of rRNA transcription motifs in Saccharomyces cerevisiae, Genome Research, 12 (2002) 1723-1731 7. A. Marcil, Project Prosposal for Ph.O., Elucidation of the glucocorticoid receptor transcriptional regulatory network, (2003) 8. M. Ptashne, How eukaryotic transcription factors work, Nature, 335 (1988) 683-689 9. M. Ptashne, and A. Gann, Transcriptional activation by recruitment, Nature, 386 (1997) 569-577 10. J. W. Fickett, and W. W. Wasserman, Oiscovery and modeling of transcriptional regulatory regions, Current Opinion in Biotechnology, 11 (2000) 19-24

90 11. F. Sauer, and R. Tjian, Mechanisms of transcriptional activation: differences and similarities between yeast, Drosophila, and man, Current Opinion in Genetics and Development, 7(2)(1997) 176-181 12. T. Werner, Models for prediction and recognition of eukaryotic promoters, Mammalian Genome, 10 (1999) 168-175 13. W. W. Wasserman, and A. Sandelin, Applied bioinformatics for the identification of regulatory elements, Nature Reviews Genetics, 5(4) (2004) 276- 287 14. L Zawel, and 0 Reinberg, Common Themes in Assembly and Function of Eukaryotic Transcription Complexes, Annual Reviews Biochemistry, 64 (1995) 533-561 15. K. Rhee, and E. A. Thompson, Glucocorticoid regulation of a transcription factor that binds an initiator-like element in the murine thymidine kinase (Tk-1) promoter, Molecular Endocrinology, 10(12) (1996) 1536-1548 16. G. Finak, N. Godin, M. Hallett, F. Pepin, Z. Rajabi, V. Srivastava, and Z. Tang, BIAS: bioinformatics integrated application software, McGiII Center for Bioinformatics, (2004) to be submitted 17. The apache DB project, http://db.apache.org, (2004) 18. Java 2 platform, enterprise edition, http://java.sun.com/j2ee, (2003) 19. M. Stonebraker, D. Moore, and P. Brown, Object-Relational DBMSs: tracking the next great wave, Morgan Kaufmann, second edition (1999) 20. T. Hubbard, and et al. The Ensembl genome database project, Nucleic Acids Research, 30(1) (2002) 38-41 21. D. J. Lipman, and W. R. Pearson, Rapid and Sensitive Protein Similarity Searches, Science, 227(4693) (1985) 1435-1441 22. W. R. Pearson, and D. J. Lipman, Improved tools for biological sequence comparison, Proceedings of the National Academy of Sciences of the United States of America, 85 (1988) 2444-2448 23. W. R. Pearson, Rapid and sensitive sequence comparison with FASTP and FASTA, Methods Enzymol, 183 (1990) 63-98

91 24. W. R. Pearson, Comparison of methods for searching protein sequence databases, Protein Science, 4(6) (1995) 1145-1160 25. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Research, 25(17) (1997) 3389-3402 26. http://www.accelrys.com/products/gcg_wisconsin_package/index.html 27. A. Cornish-Bowden, Nomenclature for imcompletely specified bases in nucleic acid sequences: recommendations 1984, Nucleic Acids Research, 13 (1985) 3021-3030 28. D. S. Prestridge, SIGNAL SCAN: a computer program that scans DNA sequences for eukaryotic transcriptional elements, Comput. Appl. BioscL, 7(2) (1991) 203-206 29. D. S. Prestridge, G. Storrno, SIGNAL SCAN 3.0: new database and program features, Computer Applications in the Biosciences, 9(1) (1993) 113-115 30. Q. K. Chen, G. Z. Hertz, and G. D. Stormo, MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices, Computer Applications in the Biosciences, 11 (5) (1995) 563-566 31. K. Quandt, K. French, H. Karas, E. Wingender, and T. Werner, Matlnd and Matlnspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data, Nucleic Acids Research, 23(23) (1995) 4878-4884 32. A. Klingenhoff, K. Frech, K. Quandt, and T. Werner, Functional promoter modules can be detected by formai models independent of overall nucleotide sequence similarity, Bioinformatics, 15(3) (1999) 180-186 33. J. G. Henikoff, and S. Henikoff, Using substitution probabilities to improve position-specific scoring matrices, Computer Applications in the Biosciences, 12(2) (1996) 135-143 34. http://www.genomicdiscoverytools.com/ 35. J. Schug, and G. C. Overton, TESS: Transcription Element Search Software on the WWW, Technical Report CBIL-TR-1997-1001-vO.0, Computational Biology

92 and Informatics laboratory, School of Medicine, University of Pennsylvania, http://www.cbil.upenn.edu/tess (1997) 36. E. Wingender, R. Knuppel, P. Dietze, H. Karas, K. Frech, K. Quandt, and T. Werner, The TRANSFAC data base and Conslnd program as tools for describing and understanding regulatory functions of the genome, SAMS, 18-19 (1995) 823- 826 37. E. Wingender, P. Dietze, H. Karas, and R. Knuppel, TRANSFAC: a database on transcription factors and their DNA binding sites, Nucleic Acids Research, 24(1) (1996) 238-241 38. E. Wingender, A. E. Kel, O. V. Kel, H. Karas, T. Heinemeyer, P. Dietze, R. Knuppel, A. G. Romaschenko, and N. A. Kolchanov, TRANSFAC, TRRD and COMPEl: towards a federated data base system on transcriptional regulation, Nucleic Acids Research, 25(1) (1997) 265-268 39. T. Heinemeyer, E. Wingender, 1. Reuter, H. Hermjakob, A. E. Kel, O. V. Kel, E. V. Ignatieva, E. A. Ananko, O. A. Podkolodnaya, F. A. Kolpakov, N. L. POdkolodny, and N. A. Kolchanov, Databases on transcriptional regulation: TRANSFAC, TRRD and COMPEl, Nucleic Acids Research, 26(1) (1998) 362- 367 40. T. Heinemeyer, X. Chen, H. Karas, A. E. Kel, O. V. Kel, 1. Liebich, T. Meinhardt, 1. Reuter, F. Schacherer, and E. Wingender, Expanding the TRANSFAC data base towards an expert system of regulatory molecular mechanisms, Nucleic Acids Research, 27(1) (1999) 318-322 41. E. Wingender, X. Chen, R. Hehl, H. Karas, 1. Liebich, V. Matys, T. Meinhardt, M. PruB, 1. Reuter, and F. Schacherer, TRANSFAC: an integrated system for gene expression regulation, Nucleic Acids Research, 28(1) (2000) 316-319 42. E. Wingender, X. Chen, E. Fricke, R. Geffers, R. Hehl, 1. Liebich, M. Krull, V. Matys, H. Michael, R. Ohnhauser, M. pruB, F. Schacherer, S. Thiele, and S. Urbach, The TRANSFAC system on gene expression regulation, Nucleic Acids Research, 29(1) (2001) 281-283 43. V. Matys, E. Fricke, R. Geffers, E. GoBling, M. Haubrock, R. Hehl, K. Hornischer, D. Karas, A. E. Kel, O. V. Kel-Margoulis, D. U. Kloos, S. land, B.

93 Lewicki-Potapov, H. Michael, R. Munch, 1. Reuter, S. Rotert, H. Saxel, M. Scheer, S. Thiele, and E. Wingender, TRANSFAC: transcriptional regulation, from patterns to profiles, Nucleic Acids Research, 31 (1) (2003) 374-378 44. E. Wingender, Compilation of transcription regulating proteins, Nucleic Acids Research, 16(5) (1988) 1879-1902 45. http://www.gene-regulation.com/ 46. A. Wagner, Genes regulated cooperatively by one or more transcription factors and their identification in whole eukaryotic genomes, Bioinformatics, 15(10) (1999) 776-784 47. M. 1. Arnone, and E. H. Davidson, The hardwiring of development: organization and function of genomic regulatory systems, Development, 124 (1997) 1851-1864 48. C. H. Yuh, H. Bolouri, and E. H. Davidson, Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene, Science, 279 (1998) 1896-1902 49. D. GuhaThakurta, and G. D. Stormo, Identifying target sites for cooperatively binding factors, Bioinformatics, 17(7) (2001) 608-621 50. O. V. Kel, A. G. Romaschenko, A. E. Kel, E. Wingender, and N. A. Kolchanov, A compilation of composite regulatory elements affecting gene transcription in vertebrates, Nucleic Acids Research, 23 (1995) 4097-4103 51. M. S. Halfon, A. Carmena, S. Gisselbrecht, C. M. Sackerson, F. Jimenez, M. K. Baylies, and A. M. Michelson, Ras pathway specificity is determined by the integration of multiple signal-activated and tissue-restricted transcription factors, Cell 103( 1 )(2000) 63-74 52. H. Weintraub, R. Davis, D. Lockshon, and A. Lassar, MyoD binds cooperatively to two sites in a target enhancer sequence: Occupancy of two sites is required for activation, Proceedings of the National Academy of Sciences of the United States of America (PNAS), 87 (1990) 5623-5627 53. C. S. Moreno, P. Emery, J. E. West, B. Durand, W. Reith, B. Mach, and J. M. Boss, Purified X2 binding protein (X2BP) cooperatively binds the class Il MHC X

94 box region in the presence of purified RFX, the X box factor deficient in the bare lymphocyte syndrome, Journal of Immunology, 155(9) (1995) 4313-4321 54. J. W. Fickett, Coordinate positioning of MEF2 and myogenin binding sites, Gene, 172(1) (1996) GC19-GC32 55. A. Muhlethaler-Mottet, W. Di Berardino, L. A. Otten, and B. Mach, Activation of the MHC class Il transactivator CIITA by interferon-gamma requires cooperative interaction between Stat1 and USF-1, Immunity, 8(2) (1998) 157-166 56. W. W. Wasserman, and J. W. Fickett, Identification of regulatory regions which confer muscle-specifie gene expression, Journal of Molecular Biology, 278 (1998) 167-181 57. E. H. Davidson, Genomic Regulatory Systems: Development and Evolution, Academie Press, San Diego, (2001) 58. K. Quandt, K. Grote, and T. Werner, Genomelnspector: basic software tools for analysis of spatial correlations between genomic structures within megabase sequences, Genomics, 33 (1996) 301-304 59. W. B. L. Alkema, O. Johansson, J. Lagergren, and W. W. Wasserman, MSCAN: identification of functional clusters of transcription factor binding sites, Nucleic Acids Research, 32 (2004) W195-W198 60. M. Rebeiz, N. L. Reeves, and J. W. Posakony, SCORE: a computational approach to the identification of cis-regulatory modules and target genes in whole-genome sequence data-Site clustering over random expectation, Proceedings of the National Academy of Sciences of the United States of America (PNAS), 99 (2002) 9888-9893 61. T. L. Bailey, and W. S. Noble, Searching for statistically significant regulatory modules, Bioinformatics, 19 (2003) ii16-ii25 62. M. C. Frith, M.C. Li, and Z. Weng, Cluster-Buster: finding dense clusters of motifs in DNA sequences, Nucleic Acids Research, 31 (13) (2003) 3666-3668 63. R. Sharan, 1. Ovcharenko, A. Ben-Hur, and R. M. Karp, CREME: a framework for identifying cis-regulatory modules in Human-Mouse conserved segments, Bioinformatics, 19 (2003) i283-i291

95 64. S. Aerts, P. Van Loo, G. Thijs, Y. Moreau, and B. De Moor, Computational detection of cis-regulatory modules, Bioinformatics, 19(2) (2003) ii5-ii14 65. A. Sosinsky, C. P. Bonin, R. S. Mann, and B. Honig, Target Explorer: an automated tool for the identification of new target genes for a specified set of transcription factors, Nucleic Acids Research, 31 (2003) 3589-3592 66. T. Werner, S. Fessele, H. Maier, and P. J. Nelson, Computer modeling of promoter organization as a tool to study transcriptional coregulation, The Federation of American Societies for Experimental Biology (FASEB) Journal, 17(10) (2003) 1228-1237 67. D. A. Tagle, B. F. Koop, M. Goodman, J. L. Slightom, D. L. Hess, and R. T. Jones, Embryonic epsilon and gamma globin genes of a prosimian primate (Galago crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints, Journal of Molecular Biology, 203(2) (1988) 439-455 68. W. W. Wasserrnan, M. Palumbo, W. Thompson, J. W. Fickett, and C. E. Lawrence, Human-Mouse genome comparisons to locate regulatory sites, Nature Genetics, 26 (2000) 225-228 69. M. Blanchette, and M. Tompa, Discovery of regulatory elements by a computational method for phylogenetic footprinting, Genome Research, 12 (2002) 739-748 70. R. C. Hardison, J. Oeltjen, and W. Miller, Long Human-Mouse sequence alignments reveal novel regulatory elements: A reason to sequence the Mouse genome, Genome Research, 8 (1997) 959-966 71. M. A. Ansari-Lari, J. C. Oeltjen, S. Schwartz, Z. Zhang, D. M. Muzny, J. Lu, J. H. Gorrell, A. C. Chinault, J. W. Belmont, W. Miller, and R. A. Gibbs, Comparative sequence analysis of a gene-rich cluster at Human chromosome 12p13 and its syntenic region in Mouse chromosome 6, Genome Research, 8 (1998) 29-40 72. N. Jareborg, E. Birney, and R. Durbin, Comparative analysis of noncoding regions of 77 orthologous Mouse and Human gene pairs, Genome Research, 9 (1999) 815-824

96 73. 1. Dubchak, M. Brudno, G. G. Loots, L. Pachter, C. Mayor, E. M. Rubin, and K. A. Frazer, Active conservation of noncoding sequences revealed by three-way species comparisons, Genome Research, 10 (2000) 1304-1306 74. Q. Wu, T. Zhang, J. Cheng, Y. Kim, J. Grimwood, J. Schmutz, M. Dickson, J. P. Noonan, M. Q. Zhang, R. M. Myers, and T. Maniatis, Comparative DNA sequence analysis of Mouse and Human protocadherin gene clusters, Genome Research, 11 (2001) 389-404 75. G. G. Loots, 1. Ovcharenko, L. Pachter, 1. Dubchak, and E. M. Rubin, rVista for comparative sequence-based discovery of functional transcription factor binding sites, Genome Research, 12 (2002) 832-839 76. B. Lenhard, A. Sandelin, L. Mendoza, P. Engstrom, N. Jareborg, and W. W. Wasserman, Identification of conserved regulatory elements by comparative genome analysis, Journal of Biology, 2(2) (2003) article 13 77. A. Sandelin, W. W. Wasserman, and B. Lenhard, ConSite: web-based prediction of regulatory elements using cross-species comparison, Nucleic Acids Research, 32 (2004) W249-W252 78. E. Berezikov, V. Guryev, R. H.A. Plasterk, and E. Cuppen, CONREAL: Conserved Regulatory Elements Anchored Alignment Aigorithm for Identification of Transcription Factor Binding Sites by Phylogenetic Footprinting, Genome Research, 14 (2004) 170-178 79. W. Krivan, and W. W. Wasserman, A predictive model for regulatory sequences directing liver-specific transcription, Genome Research, 11 (2001) 1559-1566 80. J. C. Oeltjen, T. M. Malley, D. M. Muzny, W. Miller, R. A. Gibbs, and J. W. Belmont, Large-scale comparative sequence analysis of the Human and murine bruton's tyrosine kinase loci reveals conserved regulatory domains, Genome Research, 7 (1997) 315-329 81. R. C. Hardison, Conserved noncoding sequences are reliable guides to regulatory elements, Trends in Genetics, 16(9) (2000) 369-372

97 82. G. G. Loots, R. M. Locksley, C. M. Blankespoor, Z. E. Wang, W. Miller, E. M. Rubin, and K. A. Frazer, Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons, Science, 288 (2000) 136-140 83. P. Qiu, L. Qin, R. P. Sorrentino, J. R. Greene, and L. Wang, Comparative promoter analysis and its application in analysis of PTH-regulated gene expression, Journal of Molecular Biology, 326 (2003) 1327-1336 84. K. A. Frazer, L. Elnitski, D. M. Church, 1. Dubchak, and R. C. Hardison, Cross-Species Sequence Comparisons: A Review of Methods and Available Resources, Genome Research, 13 (2003) 1-12 85. P. Qiu, Recent advances in computational promoter analysis in understanding the transcriptional regulatory network, Biochemical and Biophysical Research Communications, 309 (2003) 495-501 86. A. Ureta-Vidal, L. Ettwiller, and E. Birney, Comparative Genomics: Genome­ wide analysis in metazoan eukaryotes, Genetics, 4 (2003) 251-262 87. C. Thacker, M. A. Marra, A. Jones, and D. L. Baillie, Functional Genomics in Caenorhabditis elegans: An approach involving comparisons of sequences from related nematodes, Genome Research, 9 (1999) 348-359 88. The C. elegans Sequencing Consortium, Genome Sequence of the Nematode C. elegans: A Platform for Investigating Biology, Science, 282(5396) (1998) 2012-2018 89. M. D. Adams et aL, The Genome Sequence of Drosophila melanogaster, Science, 287(5461) (2000) 2185-2195 90. International Human Genome Sequencing Consortium, Initial sequencing and analysis of the Human genome, Nature, 409 (2001) 860-921 91. J. C. Venter et aL, The sequence of the Human genome, Science, 291 (2001) 1304-1351 92. S. Aparicio et aL, Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes, Science, 297(5585) (2002) 1301-1310 93. R. A. Holt et aL, The genome sequence of the malaria mosquito Anopheles gambiae, Science, 298(5591) (2002) 129-149

98 94. Waterston et aL, and Mouse Genome Sequencing Consortium, Initial sequencing and comparative analysis of the Mouse genome, Nature, 420 (2002) 520-562 95. P. Oehal et aL, The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins, Science, 298(5601) 2002 2157-2167 96. O. L. Gumucio, O. A. Shelton, W. Zhu, O. Millinoff, T. Gray, J. H. Bock, J. L. Slightom, and M. Goodman, Evolutionary strategies for the elucidation of cis and trans factors that regulate the developmental switching programs of the beta-like globin genes, Molecular Phylogenetics and Evolution, 5(1) (1996) 18-32 97. L. Ouret, and P. Bucher, Searching for regulatory elements in Human noncoding sequences, Current Opinion in Structural Biology, 7 (1997) 399-406 98. S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. hardison, O. Haussier, and W. Miller, Human-Mouse alignments with BLASTZ, Genome Research, 13 (2003) 103-107 99. M. Brudno, O. B. Chuong, G. M. Cooper, M. F. Kim, E. Oavydov, NISC Comparative Sequencing Program, E. O. Green, A. Sidow, and S. Batzoglou, LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic ONA, Genome Research. 13(4) (2003) 721-731 100. N. Bray, 1. Oubchak, and L. Pachter, AVIO: A global alignment program, Genome Research, 13(1) (2003) 97-102 101. S. Schwartz, Z. Zhang, K. A. Frazer, A. Smit, C. Riemer, J. Bouck, R. Gibbs, R. Hardison, and W. Miller, PipMaker-A web server for aligning two genomic ONA sequences, Genome Research, 10 (2000) 577-586 102. L. Elnitski, C. Riemer, H. Petrykowska, L. Florea, S. Schwartz, W. Miller, and R. Hardison, PipTools: a computational toolkit to annotate and analyze pairwise comparisons of genomic sequences, Genomics, 80(6) (2002) 681-690 103. C. Mayor, M. Brudno, J. R. Schwartz, A. Poliakov, E. M. Rubin, K. A. Frazer, L. S. Pachter, and 1. Oubchak, VISTA: Visualizing global ONA sequence alignments of arbitrary length, Bioinformatics, 16 (2000) 1046-1047 104. R. L. Tatusov, N. D. Fedorova1, J. O. Jackson1, A. R. Jacobs1, B. Kiryutin, E. V. Koonin, O. M. Krylov, R. Mazumder, S. L. Mekhedov, A. N. Nikolskaya, B.

99 S. Rao, S. Smirnov, A. V. Sverdlov, S. Vasudevan, Y. 1. Wolf, J. J. Yin, and D. A. Natale, The COG database: an updated version includes eukaryotes, BMC Bioinformatics, 4(1) (2003) 41 105. C. E.V. Storm, and E. L.L. Sonnhammer, Comprehensive analysis of orthologous protein domains using the HOPS data base, Genome Research, 13 (2003) 2353-2362 106. D. L. Wheeler, C. Chappey, A. E. Lash, D. D. Leipe, T. L. Madden, G. D. Schuler, T. A. Tatusova, and B. A. Rapp, Database resources of the national center for biotechnology information, Nucleic Acids Research, 28(1) (2000) 10-14 107. L. Duret, D. Mouchiroud, and M. Gouy, HOVERGEN: a data base of homologous vertebrate genes. Nucleic Acids Research, 22(12) (1994) 2360-2365 108. N. Bray and L. Pachter. MAVID multiple alignment server, Nucleic Acids Research, 31 (13) (2003) 3525-3526 109. K. Frech, G. Hermann, and T. Werner, Computer-assisted prediction, classification, and delimitation of protein binding sites in nucleic acids, Nucleic Acids Research, 21(7) (1993) 1655-1664 110. D. R. Cavener, Comparison of the consensus sequence flanking translational start sites in Drosophila and vertebrates, Nucleic Acids Research, 15(4) (1987) 1353-136 111. O. V. Kel-Margoulis, A. E. Kel, 1. Reuter, 1. V. Deineko, and E. Wingender, TRANSCompel: a data base on composite regulatory elements in eukaryotic genes, Nucleic Acids Research, 30(1) (2002) 332-334 112. BioJava, http://www.biojava.org/ 113. P. F. Cliften, L. W. Hillier, L. Fulton, T. Graves, T. Miner, W. R. Gish, R. H. Waterston, M. Johnston, Surveying Saccharomyces genomes to identify functional elements by comparative DNA sequence analysis, Genome Research, 11(7) (2001) 1175-1186 114. W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler, and D. Haussier, The Human genome browser at UCSC, Genome Research, 12(6) (2002) 996-1006 115. A. F. A. Smit, and P. Green, RepeatMasker at http://repeatmasker.org, 2004

100 116. A. F. A. Smit, and P. Green, P RepeatMasker, http://ftp.genome.washington.edu/RM/RepeatMasker.html, 2003 117. J. A. Bedell, L. Korf, and W. Gish, MaskerAid: a performance enhancement to RepeatMasker, Bioinformatics, 16(11) (2000) 1040-1041 118. W. J. Ewens and G. R. Grant, Statistical Methods in Bioinformatics: An Introduction, Springer, (2001) 119. T. Beissbarth, and T. P. Speed, GOstat: Find statistically overrepresented Gene Ontologies within a group of genes, Bioinformatics, 20(9) (2004) 1464-1465 120. B. G. Tabachnick, and L. S. Fidell' Using multivariate statistics (3rd ed), New York, (1996) 121. R. L. Tatusov, and D. J. Lipman, DUST, http://www.ncbi.nlm.nih.gov/ search for DUST 122. T. Werner, S. Fessele, H. Maier, and P. J. Nelson, Computer modeling of promoter organization as a tool to study transcriptional coregulation, The Federation of American Societies for Experimental Biology (FASEB), 17 (2003) 1228-1237 123. S. Khorasanizadeh, and F. Rastinejad, Nuclear-receptor interactions on DNA-response elements, Trends in Biochemical Sciences, 26(6) (2001) 384-390 124. M. Beato, Gene regulation by steroid hormones, Cell' 56 (1989) 335-344 125. http://www.ncbi.nlm.nih .gov/BLASTIproducttable.shtml#degenerate 126. H. Lodish, A. Berk, S. L. Zipursky, P. Matsudaira, D. Baltimore, and J. Darnell, Molecular Cell Biology, 4th edition, W. H. Freeman and Company, (2001) 127. Gene transcription, http://www.web-books.com/MoBio/Free/Contents.htm 128. http://www.genomatix.de/products/Matinspector/optthresh_example.html 129. M. Noellenburg, Locating Glucocorticoid Receptor Elements in the Mouse Genome, Course Project Report, McGili Center for Bioinformatics, McGiII University, (2003) 130. H. Sakoda, Y. Gotoh, H. Katagiri, M. Kurokawa, H. Ono, Y. Onishi, M. Anai, T. Ogihara, M. Fujishiro, Y. Fukushima, M. Abe, N. Shojima, M. Kikuchi, Y. Oka, H. Hirai, and T. Asano, Differing Roles of Akt and Serum- and Glucocorticoid-

101 regulated Kinase in Glucose Metabolism, DNA Synthesis, and Oncogenic Activity, Journal of Biological Chemistry, 278(28) (2003) 25802-25807 131. http://www.geneontology.org/

102 Appendix A: The BIAS Database

A.1: ER Diagram of GRE Module

GENEflN DT ATION CHROMOSOME CH-II OMOSOME_FRAGhENT ,ENEj\NNOTATIONjO inlogo. <1*> HR OMOSOIoE_1I> inlog .. HROMOSOIoE FRAG II> inloge. <1*> 'ENE_D int... e, <11<> PECIES_IO iltlGe, <1k> 1'" !cHROMOSOIoE:1I> - Int.g .. <1<> PE .!ring lAME string TAATIN G_POSITION Inlogo. UReE string OURCE sl.lng 0 .. G _POSITDN Intl .. er ESC RIPTON strin EaUENC E long s~ing aUENCE longsl.lng

GR E_ACCESSOR V_NF _1 E ACCESSORY NF 1_ii - in.Qlf GENE_HOMOLOGY HROMOSOME_II> inte"er <1k> NE_HOMOLOGY_IO inl.g .. <1*> IATRIJ<,..REFjO Int.gef HR OMOSOME_II> Inlog.. 'TAAT]OS II\logo. ENE_ID intlger ND POS Inlogo. IT _C HR OMOSOMEjO inlOQ" ECHNIQUE_II> 1ft", II' IT _GENE_ID inllger <1k> :CORE double ESCRIPTION slrin :EQUENCE slring ESCRIPTION string EAAEST_OISTANCE TO GENE integer GRE_GENE NË_II> Intl"er E_GENEjO inloge. HROMOSOIoE_1I> Int.g.. <1<> E_HIT_II> i'ltlger NE_I> int,..... HROMOSOME II> int.". NE_STAR T inIOQ" IATRIX REF 10- Inte.... NE END integer TAAT3'oS - Inlogo. E..HIT_II> InlOQIf Inlogo. E_START inlogo. i " .. 1T~gH~~UE_1I>...... E..ENO inlog•• :CORE doublo ISTANCE_TO :EQUENCE .Iring ,GENE_UPSTREAM inlog .. ESCRIPTION sbin" ZERO]SSM 'ISTANCE_TO_GENE EAAEST_OISTANCE MATRIJ<,..REFJO InlOQIf Int",er ENE_II> ln_QI' EGION Inlogo. doubte intllQlf OOCCURj)ISTANCE double intlOlf ,TO_Nf1 InlOQIf doublo InlOQIf HOMOLOGV _GEN E_U PSTREAM :OOC CU R_DISTANCE double double _0 ONN srR EAM..,ALIGNMENT ,TO_C.EBPDaTA integer string OIolJLOOY_GEN E :OOCCUR_DISTANCE string U PSTREAM_OOWNSTREAM ,TO_C EBPBETA117 InlOQIf O\IJGNMENTJO Inttg,r - IEAA EST_DISTANC E_TO In.ger HROMOSOIoE_1I> intlger cftc> GENE_IN_UPSTREAM Inlogo. Integlf <6:> NE II> Int'Qlf <.> NE_II> _IN UPSTREAM Inlog•• NÏAENT_STAAT Intlo·r NMENT_EN0 in·Glf CUENCE .~lng IT _C HR OMOSOloEjO integer In.ger <.> IT .J GR E..,AVlOALIGNhENT _MAP UMBER_OF _MISIoIATCH in.Qlf HROMOSOIolIUO i ....lIger E_AIIIDIILIGNIoENTjAAI' _II> inlog UMBER_OF _MA)(MJ M .. IATRIJ<,..REFJO 1.... IIg.r GN'S_ALLOWEO integer OMOLOGY_GENE_UPSTREAM :TART_POS Integlr i ....eger UMBER_OF _MIN IMU lA ,0 ON NSTREAIot..,AUGN MENT jO O_POS intlglf string UGNMENT_LENGTH_ALLOWED Inlogo. 1 INCLUOEO PUTAT"'E GRE ID SET CHNIOUE_IO in_oer LU OEO:HIT_PU TATPJEjiRËjO_SET string NGTH_OF _UPSTRE...... REQUIREO InIOQ" 'IIIIINC ORE doublo ESCRIPTION string NGTH_OF _OOWNSTR EAIot QUENCE .Iring REOUIREO Inlogo. OOCCUR_TO_NF1_SET string ESCRIPTION .!rlng ESCRIPTION string OOC CU R_ TO_C.EBPdoll._SET sbing EAA EST_DISTJlljCE_TO_GEN E integer E suin, OOC CU R_ TO_C.EBPbot._SET stlng NE inteal!' <'le>

103 A.2: Related Tables in BIAS Oatabase

Table: CHROMOSOME

Attributes Attribute Primary key Foreign Foreign table Description type reference CHROMOSOME ID Integer Ves No 1 Unique identifier SPECIES ID Integer No Ves SPECIES Primary key of table SPECIES NAME String No No 1 Chromosome name SOURCE String No No 1 SEQUENCE Long string No No 1

Table: CHROMOSOME FRAGMENT

Attributes Attribute Primary key Foreign Foreign table Description type reference CHROMOSOME FRAG ID Integer Ves No 1 Unique identifier CHROMSOME ID Integer No Ves CHROMOSOME Primary key of table CHROMOSOME STARTING POSITION Integer No No 1 Star! position of the 5' end of fragment on the full chromosome EN DING POSITION Integer No No 1 End position of the 3' end of fraQment on the full chromosome SEQUENCE Long string No No 1 Nucleotide of thispart

Table: GENE

Attributes Attribute Primary key Foreign Foreign table Description type reference GENE ID Integer Ves No 1 Uniaue identifier NAME String No No 1 Gene's systematic name SPECIES ID Integer No Ves SPECIES Primary key of table SPECIES

104 Table: LOCUS

Attributes Attribute Primary key Foreign Foreign table Description type reference LOCUS ID Integer Yes No / Unique identifier GENE ID Integer No Yes GENE Primary key of table GENE CHROMOSOME ID Integer No Yes CHROMOSOME Primary key of table CHROMOSOME SEQUENCE START Inteaer No No / Start position of the 5' end of the gene SEQUENCE STOP Inteaer No No / Position of the 3' end of the Qene DNA_SEQUENCE long string No No / Ç()rnplete DNA sequence ------

Table: GENE ANNOTATION

Attributes Attribute Primarykey Foreign Foreign table Description ce GENE ANNOTATION ID Integer Yes No / Unique identifier GENE ID Integer No Yes GENE Primary key of table GENE TYPE String No No / Reference and/or additional name SOURCE String No No / Source of annotation DESCRIPTION String No No / Description of annotation

Table: ZERO PSSM

Attributes Attribute Primarykey Foreign Foreign table Description type reference MATRIX REF ID Integer Yes Yes MATRIX INFO Primary key of table MATRIX INFO TECHNIQUE ID Integer Yes Yes TECHNIQUE ID Primary key of table TECHNIQUE POSITION Integer Yes No / Position index in the motif A Double No No / Weight value for nucleotide A C Double No No / Weight value for nucleotide C G Double No No / WeiQht value for nucleotide G T Double No No / Weight value for nucleotide T

105 Table: GENE HOMOLOGY

Attributes Attribute Primary key Foreign Foreign table Description type reference GENE HOMOLOGY ID IntElQer Yes No 1 Unique identifier CHROMOSOME ID Integer Yes Yes CHROMOSOME Primary key of table CHROMOSOME GENE ID Integer Yes Yes GENE Primary key ottable GENE HIT CHROMOSOME ID Integer No Yes CHROMOSOME Primary key of table CHROMOSOME HIT GENE ID Integer No Yes GENE Primary key of table GENE DESCRIPTION String No No 1

Table: GRE_HIT

Attributes Attribute Primarykey Foreign Foreign table Description type reference GRE HIT ID Integer Ves No 1 Unique identifier CHROMOSOME ID Integer No Ves CHROMOSOME Primary key of table CHROMOSOME MATRIX REF ID Integer No Ves MATRIX INFO Primary key of table MATRIX INFO (205 for GRE) START_POS Integer No No 1 Star! position of the 5' end of the putative binding site (inciuded in sequence on W strand and not inciuded on C strand) END_POS Integer No No 1 End of the 3' end of the putative binding site (included in sequence on C strand and not included on W strand) 1

TECHNIQUE ID Integer No No 1 Primary key of table TECHNIQUE (4 for PSSM) 1 SCORE Double No No 1 Similarity score based on PSSM SEQUENCE String No No 1 Putative DNA binding elements DESCRIPTION String No No 1 NEAREST_DISTANCE_ TO _GENE Integer No No 1 The nearest distance to gene no matter it is on upstream or downstream, or on W strand or on C strand GENE ID Integer No Yes GENE Corresponding gene identifier for the nearest distance to gene REGION Integer No No 1 Region that putative TFBS belongs to COOCCUR_DISTANCE_TO_NF1 Integer No No 1 The nearest distance to binding site of NF-1 within 100 bp flanking on each side no matter it is on W strand or C strand COOCCUR_DISTANCE_TO_C/EB Integer No No 1 The nearest distance to binding site of C/EBPdelta within 100 bp PDELTA flanking on each side no matter it is on W strand or C strand COOCCUR_DISTANCE_TO_CEB Integer No No 1 The nearest distance to binding site of C/EBPbeta within 100 bp PBETA117 flanking on each side no matter it is on W strand or C strand NEAREST_DISTANCE_ TO _GENE Integer No No 1 The nearest distance to gene on upstream only no matter it is on W IN UPSTREAM strand or on C strand

106 GENE_IDJN_ UPSTREAM GENE Corresponding gene identifier for the nearest distance to gene uDstream

Table: GRE - ACCESSORY- NF- 1

Attributes Attribute Primarykey Foreign Foreign table Description 1 type reference GRE ACCESSORY NF 1 ID Integer Yes No 1 Unique identifier CHROMOSOME ID Integer No Yes CHROMOSOME Primary key of table CHROMOSOME MATRIX REF ID Integer No Yes MATRIX INFO Primary key of table MATRIX INFO (193 for NF-1) START_POS Integer No No 1 Start position of the 5' end of the putative binding site (included in sequence on W strand and not included on C strandl END_POS Integer No No 1 End of the 3' end of the putative binding site (induded in sequence on C strand and not included on W strand) TECHNIQUE ID IntElfler No No 1 Primary key of table TECHNIQUE (4 for PSSM) SCORE Double No No 1 Similaritv score based on PSSM SEQUENCE String No No 1 Putative DNA binding elements DESCRIPTION String No No 1 NEAREST_DISTANCE_ TO _GENE Integer No No 1 The nearest distance to gene no matter it is on upstream or downstream, or on W strand or on C strand Integer Yes _ Corr~ndin~ne idenHfier for the nearest distanceto gene GENE_ID No GENE .. _ _-- _J

Table: GRE_ ACCESSORY_C_EBPbeta

Attributes Attribute Primarykey Foreign Foreign table Description ty,,-e reference GRE_ACCESSORY_C_EBPbeta_1 Integer Yes No 1 Unique identifier 1 0 CHROMOSOME ID Integer No Yes CHROMOSOME Primary key of table CHROMOSOME

MATRIX REF ID Integer No Yes MATRIX INFO Primary key of table MATRIX INFO (117 for C/EBPbeta) 1 START_POS Integer No No 1 Start position of the 5' end of the putative binding site (included in sequence on W strand and not induded on C strand) END_POS Integer No No 1 End of the 3' end of the putative binding site (included in sequence on C strand and not included on W strand) 1 TECHNIQUE ID Integer No No 1 Primary key of table TECHNIQUE (4 for PSSM)

107 SCORE Double No No 1 Similarity score based on PSSM SEQUENCE String No No 1 Putative DNA binding elements DESCRIPTION String No No 1 NEAREST_DISTANCE_TO_GENE Integer No No 1 The nearest distance to gene no matter it is on upstream or downstream, or on W strand or on C strand GENE ID Integer No Yes GENE Corresponding gene identifier for the nearest distance to gene

Table: GRE- ACCESSORY-- C EBPdelta

Attributes Attribute Primary key Foreign Foreign table Description type reference GRE_ACCESSORY_C_EBPdelta_ Integer Yes No 1 Unique identifier ID

CHROMOSOME ID Inte~er No Yes CHROMOSOME Primary key of table CHROMOSOME MATRIX REF ID Integer No Yes MATRIX INFO Primary key of table MATRIX INFO (621 for C/EBPdelta) START_POS Integer No No 1 Star! position of the 5' end of the putative binding site (included in sequence on W strand and not included on C strand) END_POS Integer No No 1 End of the 3' end of the putative binding site (included in sequence on , C strand and not included on W strand) TECHNIQUE ID Integer No No 1 Primarv key of table TECHNIQUE (4 for PSSM) SCORE Double No No 1 Similarity score based on PSSM SEQUENCE String No No 1 Putative DNA binding elements

DESCRIPTION Strin~ No No 1 NEAREST_DISTANCE_ TO _GENE Integer No No 1 The nearest distance to gene no matter it is on upstream or downstream, or on W strand or on C strand

GENE ID Integer No Yes GENE Corre~ndingg~ne identifier for the nearest distance to gene.. _

Table: HOMOLOGY_ GENE_ UPSTREAM_DOWNSTREAM_ALlGNMENT

Attributes Attribute Primarykey Foreign Foreign table Description type reference 1 HOMOLOGY_GENE_UPSTREAM Integer Yes No 1 Unique identifier DOWNSTREAM_ALlGNMENT_1 D CHROMOSQMEJD Inte~ No Yes CHROMOSOME Primary key of table CHROMOSOME __ ---- -

108 GENE ID Integer No Ves GENE Primary key of table GENE ALlGNMENT START Integer No No 1 Starl position of 5' end of Qualified aligned sequence ALlGNMENT END InteQer No No 1 Position of 3' end of Qualified aligned sequence

SEQUENCE Stri~ No No 1 Qualified aliQned sequence HIT CHROMOSOME ID InteQer No Ves CHROMOSOME Primary key of table CHROMOSOME

HIT GENE ID Int~er No Ves GENE Primary key of table GENE HIT ALlGNMENT START Integer No No 1 Starl position of 5' end of qualified hit aligned sequence HIT ALlGNMENT END Integer No No 1 End position of 3' end of qualified hit aligned sequence HIT SEQUENCE String No No 1 Hit Qualified aligned sequence NUMBER OF GAPS Integer No No 1 Total number of gaps on bath aligned sequences NUMBER OF MISMATCH Integer No No 1 Total number of mismatch on Qualified alignment NUMBER OF MAXIMUM GAPS Integer No No 1 One of the parameters for parsing AVID alignment ALLOWED - -- NUMBER_OF_MINIMUM_ALlGNM Integer No No 1 One of the parameters for parsing AVID alignment ENT LENGTH ALLOWED LENGTH_OF_UPSTREAM_REQU Integer No No 1 One of the parameters for AVI 0 alignment IRED LENGTH_OF_DOWNSTREAM_R Integer No No 1 One of the parameters for AVID alignment EQUIRED DESCRIPTION String No No 1 TYPE String No No 1 Additional information

Table: GRE GENE

Attributes Attribute Primarykey Foreign Foreign table Description type reference GRE GENE ID Integer Ves No 1 Unique identifier CHROMOSOME ID Integer No Ves CHROMOSOME Primary key of table CHROMOSOME GENE ID Integer No Ves GENE Primary key of table GENE GENE START Int6!ier No No 1 Starl position of 5' end of Qene GENE END Integer No No 1 Position of 3' end ofgene GRE HIT ID Int6!ier No Ves GRE HIT The nearest GRE either uostream or downstream of the gene

GRE START Integer No No 1 Starl ~sitionof 5' end of oGRE GRE END Integer No No 1 Position of 3' end of pGRE DISTANCE_TO_GENE_UPSTREA Integer No No 1 Distance ta its gene if it is upstream M DISTANCE_TO _ GENE_DOWNST Integer No No 1 Distance ta its gene if it is downstream _~EAM

109 Table: GRE_AVIDALlGNMENT_MAP

Attributes Attribute Primary key Foreign Foreign table Description type reference GRE AVIDALlGNMENT MAP ID Integer Yes No 1 UniQue identifier HOMOLOGY GENE UPSTREAM DOWN Integer No Yes HOMOLOGY_GENE_U Primary key of table STREAM_ALïGNMENT_ID - PSTREAM DOWNSTR HOMOLOGY_GENE_UPSTREAM_DOWNSTREAM_ALlGN EAM ALlGNMENT MENT INCLUDED PUTATIVE GRE ID SET String No No 1 Set of pGREs included in conserved alignment INCLUDED HIT PUTATIVE GRE ID SET String No No 1 Set of pGREs included in conserved hit alignment DESCRIPTION String No No 1 COOCCUR_TO_NF1_SET String No No 1 Set of the nearest binding sites of NF-1 for each pGREs within conserved aliQnment either upstream or downstream COOCCUR_TO _ C/EBPdelta_SET String No No 1 Set of the nearest binding sites of C/EBPdelta for each pGREs within conserved alignment either upstream or downstream COOCCUR_TO _C/EBPbeta_SET String No No 1 Set of the nearest binding sites of C/EBPbeta for each pGREs within conserved alignment either upstream or ------downstream

110 Appendix B:

PWM Information in BIAS Database

TFBS of GR (GRE)

POSITION 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 A 0 0 0 0.8868421 0 0.70863308 0.58443114 0.30179641 0.31257484 0.19304556 0 0 0 0 0 0 C 0.10299401 0 0.22134387 0 0.79281437 0.29136692 0.11616767 0.09820359 0.37005987 0 0 0 0.32215569 1 0 0.16581197 G 0.61077843 0.71223021 0.09881423 0 0 0 0.19640718 0.2011976 0.31736528 0 1 0 0.09580838 0 0 0.44273504 T 0.28622756 0.28776979 0.6798419 0.1131579 0.20718563 0 0.10299401 0.39880239 0 0.80695444 0 1 0.58203593 0 1 0.39145299 CI 35.5618642 56.7088557 40.4996922 74.5315867 63.1965686 56.4759778 19.3703967 7.76246014 20.9684049 64.6100377 100 100 34.7445185 100 100 26.0026715 ------.. ------

TFBS of NF-1

POSITION 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 A 0.226667 0.226667 0.066667 0 0 0.013333 0.066667 0.428571 0.306667 0.213333 0.266667 0.453333 0.133333 0.133333 0.146667 0.4 0.36 0.293333 C 0.24 0.32 0.146667 0 0 0 0.893333 0.185714 0.266667 0.213333 0.253333 0.093333 0.226667 0.546667 0.533333 0.24 0.16 0.186667 G 0.186667 0.186667 0.013333 0.013333 1 0.986667 0.026667 0.114286 0.266667 0.426667 0.24 0.253333 0.36 0.12 0.16 0.12 0.28 0.266667 T 0.346667 0.266667 0.773333 0.986667 0 0 0.013333 0.271429 0.16 0.146667 0.24 0.2 0.28 0.2 0.16 0.24 0.2 0.253333 Ci 1.932345 1.404065 48.17659 94.8921 100 94.8921 68.58414 7.838033 1.851465 5.928231 0.070055 9.851867 4.110251 15.23346 13.20571 5.794627 3.38809 0.932165

111 TFBS of C/EBPbeta

POSITION 1 2 3 4 5 6 7 8 9 10 11 12 13 14 A 0.352941176 0.117647059 0.352941176 0 0 0.176470588 0.058823529 0.352941176 0.235294118 0.705882353 0.882352941 0.117647059 0.176470588 0.235294118 C 0.11764 7059 0.117647059 0.235294118 0 0 0 0.823529412 0.117647059 0.470588235 0.235294118 0.058823529 0.411764706 0.235294118 0.294117647 G 0.235294118 0.352941176 0.294117647 0.058823529 0 0.764705882 0.117647059 0.352941176 0 0.058823529 0 0.117647059 0.117647059 0.294117647 T 0.294117647 0.411764706 0.117647059 0.941176471 1 0.058823529 0 0.176470588 0.294117647 0 0.058823529 0.352941176 0.470588235 0.176470588 Ci 4.801609346 10.80703282 4.801609346 83.86215206 100 51.09922217 58.28264885 6.728144734 23.89050639 45.68435796 67.98967833 10.80703282 9.611821179 1.433235735

TFBS of C/EBPdelta

POSITION 1 2 3 4 5 6 7 8 9 10 11 12 A 0.4375 0.875 0 0 0.1875 0 0.25 0.0625 0.375 0.75 0 0.125 C 0.375 0 0 0 0 1 0.125 0.25 0.5 0.1875 0.5 0.375 G 0.1875 0 0 0.0625 0.5 0 0.4375 0.125 0 0 0.0625 0 T 0 0.125 1 0.9375 0.3125 0 0.1875 0.5625 0.125 0.0625 0.4375 0.5 Ci 24.73795925 72.82177784 100 83.13549667 26.13914993 100 7.519912364 20.40414067 29.71804689 49.29511722 36.41088892 29.71804689

112