<<

Developing the Cis-Regulatory Association Model (CRAM) to Identify Combinations of

Transcription Factors in ChIP-Seq Data

Thesis

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University

By

Brian Kennedy, B.S.

Department of Computer Science and Engineering

The Ohio State University

2010

Thesis Committee:

Victor X. Jin, Advisor

Raghu Machiraju Copyright by

Brian Kennedy

2010 Abstract

There are approximately 2,600 human factors which may cis-regulate the expression of proximal . These TFs may further interact with one another and exhibit different behavior in combination than individually, cis-regulatory modules

(CRM). Even simple 2 and 3 TF combinations could form over 2.9 billion different cis-reg- ulatory modules. Testing the functionality of these modules experimentally will be a massive undertaking. CRAM, the Cis-Regulatory Association Modeler predicts functional regulatory modules in silico using TFs found in sequences searched for TF motifs defined by Position Weight Matrices. This technique targets ChIP-seq data and finds CRMs which are over-represented in the target sequences compared to a random background, or anoth- er contrasting sample of sequences, by using contrast frequent item-set mining in the experimental ChIP-seq peaks and the control sample. The error with which these CRMs may be separated from the random background by a variety of features is used to deter- mine which CRMs are truly specific to the experimental ChIP-seq sample under degree of motif matching, relative position, and genetic conservation. Feed-forward neural net- works are used to learn the function which specifies the classifiability of each CRM and calculate the error with which they are compared. Several other programs use a compara- ble approach; however, the application of neural networks specifically and contrast item- set mining is novel.

ii Dedicated to my mother, father, and brother,

for all of their love and support.

iii Acknowledgments

I have many people to thank for my making it this far: my advisor, Dr. Victor Jin, for everything he's done; Dr. Raghu Machiraju, for his counsel and support; all of my lab mates, for their knowledge, assistance, and encouragement; and the incredible Biomedical

Informatics Department staff for everything they do.

iv Vita

2003 Memphis Central High School

2008 B.S. Computer Science, University of Memphis

2009 Transferred from M.S. , University of Memphis

2009-Present M.S. Computer Science & Engineering, The Ohio State University

Publications

Kennedy BA, Gao W, Huang TH, Jin VX (2009) HRTBLDb: an informative data resource for hormone receptors target binding loci. Nucleic Acids Res. 38:D676-681

Bapat SA, Jin V, Berry N, Balch C, Sharma N, Kurrey N, Zhang S, Fang F, Lan X, Li M, Kennedy BA, Bigsby RM, Huang TH, Nephew KP (2010) Multivalent epigenetic marks confer microenvironment-responsive epigenetic plasticity to ovarian cancer cells. Epige- netics 5(8):716-729

Fields of Study

Major: Computer Science & Engineering

Machine applied in Bioinformatics

v Table of Contents

Abstract...... ii Acknowledgments...... iv Vita...... v Table of Contents...... vi List of Illustrations ...... ix List of Tables...... xii Lists of Symbols & Abbreviations...... xiii Chapter 1: Introduction...... 1 1.1 Biological Background of the Question...... 1

1.2 Research Question & Goals...... 4

1.3 Prior Art in CRM Prediction...... 5

1.4 Overview of the Solution...... 5

1.5 Overview of the Thesis...... 7

Chapter 2: Algorithms and Design...... 9 2.1 Input Data...... 10

2.2 Representation...... 12

2.3 Transcription Factor Search...... 14

2.4 Genetic Conservation Measurement...... 16

2.5 Transcription Factor Cartization...... 20

2.6 Contrast Frequent Item-set Mining...... 24

2.7 Artificial Neural Networks...... 26

vi 2.8 Output Data...... 30

Chapter 3: Comparing Transcription Factors in K562 cells...... 31 3.1 Data Source...... 31

3.2 Data summary...... 32

3.3 Biological background...... 35

3.4 Method...... 35

3.5 Analysis...... 35

3.6 Discussion...... 40

Chapter 4: Contrasting TGF-β Treatment in A2780...... 42 4.1 Data source...... 42

4.2 Data summary...... 42

4.3 Biological background...... 45

4.4 Method...... 45

4.5 Analysis...... 47

4.6 Discussion...... 50

Chapter 5: Conclusions...... 52 5.1 Address of the Research Question...... 52

5.2 Discussion of CRAM...... 53

5.3 Future Work...... 54

Bibliography...... 57 Appendix...... 60 A. FASTA format...... 60

B. BED format...... 60

E. Output format...... 63

vii F. JSON format...... 64

G. Financial support...... 64

H. License Citations...... 64

viii List of Illustrations

Illustration 1: A TATA-binding in blue ligating to a DNA strand in red...... 1

Illustration 2: Cis-regulatory modules consist of several proximal transcription factors ligating near a and acting in concert to regulate the gene's expression...... 2

Illustration 3: A Broad overview of the CRAM work-flow, highlighting the most significant inputs, transformations, and outputs at each step from start to end...... 9

Illustration 4: The ChIP-seq protocol in its essential steps (graphic used with permission, see Appendix section H)...... 10

Illustration 5: A graphical representation of the motif for the protein Ar...... 12

Illustration 6: Position Weight Matrix Score S(p,n) is the score at position p for nucleotide n in a PWM where PFM(p) is the row of nucleotides at position p in a PFM with size number of rows...... 13

Illustration 7: A) A position frequency matrix for the TF Ar, Androgen . B) The position weight matrix for the position frequency matrix in (A)...... 14

Illustration 8: Precomputed alignments are used to quickly estimate the conservation of a given sequence in multiple target species. These steps follow from a conceptual visualization of the source information in (1) to a representation of the precomputed alignment of such in (2) to the formula for computing the conservation metric for a given query sequence in (3)...... 19

Illustration 9: Neighboring TFs inside the acceptance window and outside the rejection

ix window are formed in to one 'cart' per TF. In practice the rejection window is set to 0 nt from the edges of the current motif and the acceptance window varies based on the user's desire for tightly or loosely clustered CRMs...... 22

Illustration 10: An example of cart expansion. A) a list of TFs predicted by PWM search in a target sequence. B) What the set of TFs in this cart would look like, note that the information on the duplicate GATA2 and USF1 will be lost. C) For any subset of this cart which contained only unique TFs, members of such sets must share an edge in the graph in (C). D) A list of the carts needed to represent the distinct combinations of separate motifs which are of the same TF type...... 23

Illustration 11: Three frequent and one infrequent one-TF sets followed by all possible two-TF and three-TF sets. The property of downward closure illustrated above shows the manner in which Apriori improves frequent set mining. In the One TF sets, GATA2, USF1, and NFATC1 are frequent and GABP does not have the minimum support, thus the two and three sets with GABP need never be calculated...... 25

Illustration 12: A feed-forward, 3 layer perceptron network. This neural network's purpose is to classify an exemplar of two inputs in to a binary categories...... 28

Illustration 13: Each TF comes from a separate ChIP-seq experiment, so some variance is expected. Peak calculation for each sample was conducted using identical parameters, and the larger differences are due to naturally differing preponderance...... 33

Illustration 14: A) The distribution over all the peaks, for each TF, of that peak's midpoint relative to the nearest gene. B) The same distribution with each region, 5' Distal to 3'

Distal normalized by its size, showing density in peaks/1 knt rather than absolute quantity...... 34

x Illustration 15: Overlaps between the 23 TFs measured in the K562 cell line. Each column contains the overlaps for that TF with the other TFs in the rows, meaning that the peaks being counted are from the source TFs sample. The data in orange along the diagonal are not overlaps, but the number of peaks from that TFs data set, provided for reference. The values highlighted in blue are the maximal absolute overlap for a given Column in the table. The columns following MAX are continued in Illustration 15...... 36

Illustration 16: A) The distribution over all SMAD4 peaks of that peak's midpoint relative to the nearest gene. B) The same distribution with each region, 5' Distal to 3' Distal normalized by its size, showing density in peaks/1 knt rather than absolute quantity...... 44

xi List of Tables

Table 1: The number of peaks calculated ...... 32

Table 2: The top 5 2-TF CRMs from each sample, identified in the "Source" column which contains the source TF in the CRM. The other TF in the CRM is listed in Co-factor. The number of Experimental and Control sample instances of each CRM are listed in the Expr and Ctrl CRMs columns respectively. Highlighted in blue are modules which are also the a top ranked overlap between two TFs...... 38

Table 3: TGF-b Treated contrast with TGF-b untreated. Above are the predicted modules with NRMSE < 0.1...... 47

Table 4: TGF-b Treated contrasted with a random background sample...... 49

Table 5: TGF-b Untreated contrasted with a random background sample...... 49

xii Lists of Symbols & Abbreviations

CRAM – Cis-Regulatory Association Model, the name of the project.

ANN – Artificial Neural Network

ChIP – Immunoprecipitation

ChIP-Seq – ChIP-Sequence

CRM – Cis-Regulatory Module

DBD – DNA Binding Domain

DNA – Deoxyribonucleic Acid knt – kilo-nucleotides, 1,000 nucleotides of DNA or RNA

NRMSE – Normalized Root Mean Squared Error

PFM – Position Frequency Matrix

PSSM – Position Specific Scoring Matrix

PWM – Position Weight Matrix

RNA – Ribonucleic Acid

TF – Transcription Factor

xiii Chapter 1: Introduction

Presented here is an investigation in to the use of frequent item-set mining and neural network classification for the prediction of functional cis-regulatory elements in

ChIP-sequence data. Section 1.1 of this chapter contains summary background informa- tion on the biological nature of this question. Section 1.2 itemizes the central question and goals of this research. Section 1.3 summarizes the prior art in this research question. Sec- tion 1.4 presents a summary overview of the solution to the research question, and section

1.5 is an organizational overview of the thesis itself.

1.1 Biological Background of the Question

Transcription Factors are a family of with a DNA-binding domain, a structure with an affinity for ligating to a specific nucleotide sequence on a strand of

DNA, as in Illustration 1, and which may regulate genes proximal to their ligation site [1] [2]. TFs are often present in the cellular cytoplasm and nucleus along with many other significant regulatory pro- teins; however, it is the DBD which distinguishes a

TF from other families of regulators. The regulatory effect of a TF will consist of either enhancing or re- Illustration 1: A TATA-binding protein in blue ligating to a DNA strand in red.

1 pressing the proximal gene's transcription into RNA through interactions with the RNA polymerase. As such, TFs are considered cis-acting proteins, as opposed to trans-acting proteins which would indirectly regulate the expression of a gene, and this regulatory ac- tion may be caused by one TF or a group of several TFs near the same site acting in concert.

TFs are an essential component of gene regulation, being found in all organisms.

Due to the importance of their function, many TFs are evolutionarily conserved between species. It has been shown that the total number of transcription factors and the density of transcription factors on average near a gene increases with the size of a genome [3].

There are approximately 2,600 human transcription factors which may cis-regulate the ex- pression of proximal genes [4]. Given that the is relatively large at 32,000 transcripts compared to other species, such as E. coli, S. cerevisiae, and A. thaliana with

4,300, 6,400 and 27,800 transcripts respectively, and H. sapiens has been observed to have a relatively large number of transcription factors per transcript [4], it can be expected that the difficulty of mapping the relationships be- tween these different transcription factors and the transcripts which they regulate will be above rather than below the average for other species.

A cluster of TFs, ligated with in a - Illustration 2: Cis-regulatory modules consist tively short region which have a different of several proximal transcription factors regulatory effect on a gene when combined ligating near a gene and acting in concert to regulate the gene's expression.

2 than when separate is called a Cis-Regulatory Module. As seen in Illustration 2, CRMs are a specific repeatedly occurring set of TFs clustered within a few hundred nucleotides of one another [5]. CRMs may provide a stronger regulatory effect than any of the con- stituent TFs individually. The constituent TFs of a CRM must be present within the CRM at the same time in order to function, and when this occurs, they generally perform one of three functions: enhancing the expression of the proximal gene, repressing the expression of the proximal gene, or acting as a separator between other regulatory processes on the with out having a direct regulatory effect on the gene.

Presently CRMs which regulate the expression of a gene directly are of the most interest; however, all are significant, and since they share the same basic structure, all of them are amenable to analyses which depend only on such structure. Chromatin struc- ture is also an important factor in CRM function; however, it remains a challenge due to the need for targeted biological experimentation to measure the factors of chromatin structure, which also vary over time. Even counting only simple, unordered two and three-set combinations of these transcription factors results in over 2.9 billion putative

CRMs. Biological experimentation to validate the functionality of each one at the sites at which it occurs would be a massive undertaking; therefore, there have been several at- tempts to computationally predict CRMs using in silico analysis.

Cis-Regulatory modules are a significant regulatory mechanism in the human transcriptome [6]. The presence of a functional CRM rather than the individual motifs that comprise it may have a different effect on the target genes of the CRM than the it's constituent TFs do [5]. For this reason, identifying possible CRMs along the genome is

3 highly important. Putative CRMs can be tested in biological experiments to determine their function; however, there are far too many possible CRMs to make the experimental validation of all of them a feasible near-term task. In parallel with this biological effort, many have attempted to predict CRMs algorithmically on a genome-wide scale.

1.2 Research Question & Goals

Which transcription factors commonly interact to form cis-regulatory modules?

The goals of this thesis are the present the following contributions to the study of this question:

1. A method of applying frequent item set mining in ChIP-sequence data to identify

common putative CRMs.

2. A method of applying neural network classification of transcription factor motif

characteristics and genetic conservation to distinguish CRMs in the area of ChIP-

seq observed protein binding from CRMs which may occur elsewhere.

3. The material contribution of a software implementation of these techniques with

which to test their efficacy and provide such analysis on novel data sets.

4. An investigation into the validity of these techniques by applying them to ChIP-

sequence data in two distinct types of analysis, interactions between 23 transcrip-

tion factors K562 cells and differences between SMAD4 binding in TGF-beta

treated and untreated A2780 cells.

4 1.3 Prior Art in CRM Prediction

There are two significant classes of algorithm to predict CRMs. The first relies on the assumption of prevalence of a functional CRM and uses sequence data to identify sig- nificantly prevalent clusters of TFs. For example, Stubb uses hidden Markov models and expectation maximization to identify statistically significant regulatory modules [7].

Stubb requires that TF associations be know beforehand, and it uses this prior knowledge along with genetic conservation information to build a HMM whose parameters are set by

EM to generate probable CRMs. ChIPModules finds TFs using PWMs in target sequences and uses classification trees to build putative CRMs [8]. The second class of algorithms relies on the assumption of expression regulation by a functional CRM and an assumption that co-expressed genes imply a common co-regulator. This second class uses expression data to infer the existence of CRMs and identifies TF clusters that co-expressed genes have in common such as the ModuleSearcher algorithm [9]. ModuleMaster also uses the as- sumption of co-expression and a database of known TFs to; however, its core algorithm is presently unpublished [10]. CRÉME uses a given list of genes to investigate and calculates a probability that the frequency at which 2-TF clusters occur significantly more frequently in the regions, a few kilo-nucleotides (knt) upstream of the transcription start site, of the given actually or potentially co-expressed genes [11].

1.4 Overview of the Solution

CRAM predicts functional regulatory modules in silico using TFs found in ChIP- seq data. It compares two sets of data and provide an analysis of the CRMs which are dif-

5 ferentiable between the two. The experimental set will be a ChIP-seq experiment, and the control set will either be another ChIP-seq experiment or a representative randomly dis- tributed background sample. These sequences are searched with Position Weight

Matrices to find putative TF ligation sites. The sequences are scored for their evolution- ary conservation between the source species and 7 other species. For the motifs in each input sequence, they are grouped in to “carts” which are amenable to frequent item-set mining. This is done by considering, for each TF, which of its neighbors could possibly in- teract with it at a CRM, and if they are not already in a cart together, they are added to one. CRAM then finds CRMs which are over-represented in the experimental carts, by us- ing contrast frequent item-set mining in the experimental carts and a control carts. The error with which these CRMs may be separated between the experimental set and the control set by classification based on their PWM scores and sequence conservation values is used to determine which CRMs are truly specific to the experimental ChIP-seq sample.

Feed-forward neural networks are used to learn the function which specifies the differen- tiability of each CRM and calculate the error with which they are classified in to the experimental and control sets. Several other programs use a comparable approach; how- ever, the application of neural networks specifically and contrast item-set mining is novel.

Taken as a whole each of the methods in Prior Art has one or more significant lim- itations, and while CRAM does as well: they're not the same limitations. For example,

CRAM does not depend on the either the assumption of co-expression of genes near the same CRMs nor does it depend on the CRMs having a particular distribution around genes, nor does it depend on prior knowledge of TF association. While each of these things have their merits for CRM detection, they are limitations on the applicability of the

6 techniques. CRAM uses unbiased biological data on TF binding activity to direct the search for CRMs which occur in frequency and with differentiable PWM scores and ge- netic conservation when compared to a contrasting sample which can either be a representatively distributed

1.5 Overview of the Thesis

Chapter 2, Algorithms and Design, walks through the details of each basic compo- nent of CRAM in the sequence that they're needed in the CRAM algorithm with one section devoted to each. It gives a brief literature background for each methodology uti- lized by CRAM, the rationale for utilizing that methodology, and a description of its properties and implementation.

Chapter 3, Comparing Transcription Factors in K562 cells, is a comparison of the observed biological interactions, from overlapping TF ligation sites detected in indepen- dent ChIP-seq experiments, and the predicted interactions with each TF from that TF's

ChIP-seq data contrasted with a representatively distributed random background sample.

This will validate the utility of CRAM's core algorithms, frequent item-set mining and neural network classification, in identifying biologically real interactions rather than spu- rious clustering.

Chapter 4, Contrasting TGF-β Treatment in A2780, is an observational comparison of the results of contrasting dependent ChIP-seq samples of the same TF, SMAD4, taken from the same cell culture before and after treatment with an upstream inducer of SMAD4 ligation, TGF-β. This will elucidate the limits of this approach in classifying similar -

7 ples with potential differences.

Chapter 5, Conclusions, starts with a summary of the CRAM algorithm and its utility, and then discusses the results from Chapters 3 & 4 and their implications. A state- ment of the future work that can be done with this project is followed by the final concluding remarks.

8 Chapter 2: Algorithms and Design

This section will follow the basic components outlines in Illustration 3 in order, giving summary background, properties, and implementation details of each. The soft- ware is implemented entirely in Python with the exception of the optional Biopython module, an external dependency which enabled the reading of a diversity of input file for- mats.

Input sequences: Positive & Negative sets Scan for motifs in the input (Optionally score their conservation) Motifs Position & scores Identify frequent CRMs, with the Apriori algorithm, that are over-represented in experimental compared to the control set Over-represented CRMs Rank the CRMs by their error in cross validation in supervised neural network classification Ranked CRMs

Illustration 3: A Broad overview of the CRAM work-flow, highlighting the most significant inputs, transformations, and outputs at each step from start to end.

9 2.1 Input Data

Background

TF ligation is a highly important regulatory mechanism of genetic expression. Un- derstanding the distribution of binding for a given protein is of critical importance in the inference of more complex regulatory networks and in explaining and understanding the regulatory mechanisms behind the differences in patients with differing clinical prog- noses. Many different technologies have been applied to this problem of precisely location TF binding sites; however, recently, the use of high-throughput short read se- quencing has allowed highly precise location of the source of a given sample of DNA.

This combined with a technique for collecting fragments of DNA which have a target protein ligated to them is an ex- tremely adaptable technology for understanding the mechanisms behind crit- ical cellular regulatory processes. The means for collecting such targeted DNA is Illustration 4: The ChIP-seq protocol in its a process called Chromatin Immunoprecipi- essential steps (graphic used with permission, see Appendix section H). tation assay, ChIP [12]. Antibodies specifically designed to attach to a target protein are applied to fragments of DNA and these fragments with attached antibodies are precipitated from the remainder of the solu- tion of DNA fragments, producing a highly specific concentration of DNA fragments with protein ligation. Combined with high-throughput short read , the technology

10 ChIP-sequencing is born [13], seen in Illustration 4.

These short reads, once sequenced, can be used to reconstruct the original protein ligated sequence fragments with careful statistical analysis. The short reads from the se- quencing experiment will tend to accumulate in greater density from regions of the genome that contained an ligated fragment due to the ChIP step filtering out most of the un-ligated DNA fragments. Algorithms which can compute such sites do so by looking for areas of relatively high read density compared to the remainder of the genome, or

'peaks' in the read density. The recovered sequences from this process are often referred to as ChIP-seq peaks, and “peak” and the noun “sequence” can often be used interchangeably in this context.

Rationale

ChIP-seq data provides an unbiased biological context for the input data to the al- gorithm as an alternative to applying assumptions about the behavior of TFs to the biased selection of sequences from the genome or attempting to generalize the behavior of TFs across the whole genome that may not behave uniformly. There is a much higher cause to believe that the biological activity of a given putative TF binding site or CRM of TFs is consistent within the proximity of observed binding of another protein than at a site se- lected by arbitrary criteria. CRAM requires both the genomic coordinates and nucleotide sequence of each input ChIP-seq sequence.

Implementation

The program takes 5 essential sets of input data, a library of position frequency matrices (discussed further in section 2.2), the experimental set's peak file and sequence

11 file, and the control set's peak file and sequence file. The peak files are in BED format, as described in Appendix section B, and the sequence files for those peaks are in FASTA for- mat as described in Appendix section A. The coordinate files are used for the precomputed alignment conservation scoring, and the sequence files are searched for TF binding motifs as described in sections 2.4 and 2.3 respectively.

2.2 Transcription Factor Representation

Background

The DBD of a TF ligates to one or more specific DNA sequences, but because the experiments which determine such DBDs are imperfect and leave some natural uncertain- ty, the binding site of a TF is modeled with a tolerance for some degree of error [14]. This is most commonly done with a position specific scoring matrix. A PSSM assigns a proba- bility or 'score' to each nucleotide at each position of a finite length matrix representing the binding sites of a given ligand. There are two main types of PSSM in use today [15].

The first is the PFM, or Position Frequency Matrix. A PFM is a direct representa- tion of a set of sequences which are being taken as an aggregate representation of the binding activity of a given ligand. The values of each nucleotide at a given position in a

PFM are the sum of the occurrences of that nucleotide at that position in the set of se-

Illustration 5: A graphical representation of the motif for the protein Ar. 12 quences to which the DBD may ligate (i.e. the sum of all nucleotides at each position will equal the total number of DBD target sequences used to compute the PFM). This repre- sents the likelihood of a nucleotide at a given position as a position specific linear combination independent of the other positions.

The more common representation is a Position Weight Matrix for which some ker- nel has been applied to transform the PFM in to a different scoring space. This is most commonly done to accommodate the assumptions that the positions in the input sequence are not uniformly significant, as such non-uniformity is represented in the graphical logo for the Ar TF in Illustration 5.

Rationale

PWM is the most commonly used means of representing the binding preferences of a protein DBD. Additionally, there are large sets of PFMs with represent known liga- tion sites for a large number of TFs. PFMs are easily converted in to PWMs, and this allows us to access a large body of predictive information about the binding of TFs.

Implementation

The JASPAR [16] and TRANSFAC [17] s p,n=  PFM p, n −minimum PFM p −min databases were collected and the mammalian size

min=∑ minimumPFM p TF PFMs therein were selected for use in p=1

size CRAM. The PFMs are stored in a format de- max=∑ maximumPFM p p=1 scribed in Appendix section C. These PFMs Illustration 6: Position Weight Matrix Score S(p,n) is the score at position p for nucleotide are read by CRAM and PWMs are calculated n in a PWM where PFM(p) is the row of from them using the formula in Illustration 6. nucleotides at position p in a PFM with size number of rows. 13 For an example of a PFM and its corresponding PWM, see Illustration 7.

2.3 Transcription Factor Search

Background

This is a straight forward application of PWMs and can easily be done on a large scale to develop a body of putative TF ligation sites to a specified degree of mismatch from the PWM [15]. A library of PFMs is converted into PWMs which are tested against every possible ligation site in the given sequences. An example PFM and its PWM is shown in

Illustration 7. Since PFMs can be assembled from even a small number of example liga-

A – PFMB – Androgen Receptor PWM Pos. A C G T Pos. A C G T 1 9 7 2 6 1 0.024 0.017 0.000 0.014 2 9 2 3 10 2 0.024 0.000 0.003 0.028 3 11 3 3 7 3 0.028 0.000 0.000 0.014 4 16 1 7 0 4 0.055 0.003 0.024 0.000 5 0 0 22 2 5 0.000 0.000 0.076 0.007 6 12 6 1 5 6 0.038 0.017 0.000 0.014 7 21 2 0 1 7 0.073 0.007 0.000 0.003 8 0 24 0 0 8 0.000 0.083 0.000 0.000 9 15 0 8 1 9 0.052 0.000 0.028 0.003 10 4 9 2 9 10 0.007 0.024 0.000 0.024 11 5 11 2 6 11 0.010 0.031 0.000 0.014 12 6 9 9 0 12 0.021 0.031 0.031 0.000 13 3 5 1 15 13 0.007 0.014 0.000 0.048 14 0 0 24 0 14 0.000 0.000 0.083 0.000 15 4 0 0 20 15 0.014 0.000 0.000 0.069 16 11 5 1 7 16 0.035 0.014 0.000 0.021 17 1 22 1 0 17 0.003 0.076 0.003 0.000 18 3 16 1 4 18 0.007 0.052 0.000 0.010 19 6 7 5 6 19 0.003 0.007 0.000 0.003 20 6 5 9 4 20 0.007 0.003 0.017 0.000 21 10 11 0 3 21 0.035 0.038 0.000 0.010 22 5 11 6 2 22 0.010 0.031 0.014 0.000

Illustration 7: A) A position frequency matrix for the TF Ar, Androgen Receptor. B) The position weight matrix for the position frequency matrix in (A).

14 tion sites for a protein, there are a great many such PFMs with which to find putative TFs

[16], [17].

Rationale

There are databases of manually curated, experimentally verified motif binding sites; however, these are sparse compared to the likely number of actual binding sites or even to the number of observed ligation events in a typical ChIP-seq experiment; there- fore, they are inappropriate for large scale data mining in arbitrary ChIP-seq data sets.

ChIP-seq data itself provides large scale empirical evidence of TF binding which could be used for this purpose; however, the precision of ChIP-seq measurements is not specific enough to identify the exact coordinates of the TF which is binding somewhere inside a

ChIP-seq peak. Experimentally verified TF binding sites can, however, be used to con- struct PFMs which describe the observed binding preferences of that TF, and those binding preferences can be used to identify all identical or similar potential binding sites across the genome. This provides less accurate data than verified TF binding, but it is pos- sible to acquire it on a genome wide scale.

Implementation

There are two scores used to judge the fit of a PWM to a nucleotide sequence. Its

length total score which is defined simply as ∑ sp, N where length is the length of the PFM, p p=1 sp,n is as defined in Illustration 6, and Np is the nucleotide at position p in the sequence be- ing scored. The core score is calculated using an identical formula; however, effectively, a second PWM is generated for the contiguous 5 position segment of the PFM which has the

15 highest total frequency of the maximally frequent nucleotide at each of the 5 positions.

This is calculated because the significance of the fit or lack thereof of a PWM is measured more accurately when the most significant, typically near center, 5 nucleotide stretch is bi- ased because lack of a correct nucleotide match in this segment of a binding motif may have a disproportionate effect on TF ligation [8].

Each PWM is applied to each possible position in each input sequence. For a se- quence of n nucleotides in length and a motif of m nucleotides in length n-m+1 PWM calculations will be made. A PWM with a total score and core score above the corre- sponding user provided thresholds will be recorded as an instance of the TF which the

PWM represents and those two scores will be part of the neural network input for CRMs containing that motif. The time complexity of motif scanning a single sequence would be m*(n-m+1) which is undesirable; however, the distribution of nucleotide binding prefer- ences is far from uniform and that irregularity can be used to shorted the number of calculations it takes to reject a given site as a potential ligation site for a PWM. In calcu- lating the PWM score, the positions with the highest single nucleotide scores are tested first, and the loss from a non-maximal match is counted, if this loss exceeds the 1-t, where t is the total score threshold, then this site can be rejected immediately and the next checked.

2.4 Genetic Conservation Measurement

Background

From morphological, geographical and paleontological observations, Darwin first

16 inferred the interconnectedness of life and the idea of common descent. From this, sprang the Theory of Evolution [18]. Much of the work on this theory in the last century has been in the fields of ecology and taxonomy, but with the advent of modern genetics, the

Theory of Evolution can be used predictively to infer that organisms posited to be related by other sciences should have similar genetic makeups and likewise that organisms with similar genes should be related [19].

Rationale

This predictive capacity can be applied on the individual gene and protein level to create an evolutionary analogy between things long studied, morphology, and these newer discoveries. Much like morphologically conserved features are so conserved because of their utility, so too are genetic features conserved because of their functionality [20] there- fore, we can use genetic conservation as an discriminant of genetic feature functionality.

Implementation

I use pairwise alignment between the source genome and 7 other genomes to com- pute a vector of 7 conservation scores. The species being aligned are Human Homo sapiens (hg19), Chimpanzee Pan troglodytes (panTro2), Mus musculus

(mm9), Brown rat Rattus norvegicus (rn4), Dog Canis lupus familiaris (canFam2), Chicken

Gallus gallus (galGal3), Zebrafish Danio rerio (danRer6), and last but not least, Fugu Tak- ifugu rubripes (fr2). Precomputed alignments between the 8 genomes are measured against each query sequence, for which we want to measure the conservation, and the proportion of aligning nucleotides within the precomputed alignment to total gapped alignment is the conservation score, as seen in Illustration 8. Conservation, c, for a given

17 query sequence, q, is the following, where i is an aligning species, and Gi,x is a boolean state aligned or non-aligned for the gapped alignment to species i at nucleotide x.

q end 1 G is aligned ∑ i , x 0 G is non−aligned x=q start { i , x } score i,q= qend−qstart

This score is calculated for each sequence in the experimental and control sets for each other species. This produces a vector of 7 conservation scores for a given sequence which are used as part of the input to the neural network classification step for each CRM in that sequence.

The precomputed alignments used by CRAM by default are pairwise alignments between human and the other 7 target species or mouse and the other 7 target species.

These are Chain format files which contain several blocks which map areas of relatively high alignment form the query genome to the target genome by storing segments of align- ing sequence and gaps for misaligning sequence in either or both genomes (see Appendix section D for the Chain file format).

18 Illustration 8: Precomputed alignments are used to quickly estimate the conservation of a given sequence in multiple target species. These steps follow from a conceptual visualization of the source information in (1) to a representation of the precomputed alignment of such in (2) to the formula for computing the conservation metric for a given query sequence in (3). 19 2.5 Transcription Factor Cartization

Background

While TFs occur all across the genome and have no apparent grouping, each chro- mosome is a two dimensional surface and on this surface the TFs have an almost euclidean spatial relationship. One of the two dimensions is the DNA strand, and in the sense that this is a dimension at all, it has only a binary value. A TF or other genetic fea- ture is either on the positive strand or the negative strand of a piece of DNA. The other is the discrete length of nucleotides which make up the chromosome itself. TFs can interact with one another freely between strands as they are so close together compared to the size of the ligating proteins as to be immaterial. Ligating TFs will wrap around both strands and may freely interact in the three dimensions outside the one dimensional world of a nucleotide sequence. For this reason, strand is not considered as a factor in the formation of CRMs, and thus a CRM is functionally, a group of TFs defined by a one dimensional co- ordinate system.

Within the biological constraints of CRM function, there are very real upper and lower bounds on the positions of TFs which may compose a CRM. The constituent TFs of a CRM may not be more than a few hundred nucleotides of distance from one another and the TF proteins themselves may not occupy the same physical space. The TFs can be put in to logical groups set by biological realities of these limitations.

Rationale

Bart Goethals is a pioneer in the field of applying association rule and item-set mining techniques to the simplification of complex multi-dimensional data sets [21].

20 Those same techniques can be applied to the one dimensional structure of TF ligation in

CRMs. The quintessence of his technique is to render continuous data into a discrete set of transactions or “carts” which are amenable to analysis by item-set mining techniques.

In the case of CRMs, nature has provided such a rendering for us, encoded in the biologi- cal constraints on CRM function.

Each TF may be conceived as being members of one or more sets of TFs from which a subset is a putative CRMs. The set of all such sets contains all possible CRMs. By applying frequent item-set mining to this set of sets, the CRMs which are biologically pos- sible with the highest frequency may be determined. This of course assumes a uniform probability that a given set of possible TF ligation sites is a CRM which is not the case.

There are many potential ligation sites on the chromosome too far from transcribed ele- ments for them to have any effect on such transcription by traditional means; therefore, we can observe the probability of possible TF ligation sites forming a TF to be non uni- form in the population of all such sites, but there is not a great deal of additional knowledge which can be brought to bear on the problem, and the true distribution is wholly unknown. For this reason, additional analysis of putative CRMs using known bio- logical indicators is required.

Implementation

The motifs in each sequence are converted in to one or more 'carts' which are transactions amenable to the following frequent item-set mining. The heuristic for con- structing carts, Illustration 9, from a sequence of motifs is simple. For each sequence, starting with the TF, current-TF, which has the end-point nearest the start of the sequence,

21 all TFs which start at least r nucleotides ahead of the current-TF and which end no more than a nucleotides away from the end of the current-TF are added to a cart with curren- t-TF. Then the TF next nearest the start is selected and this process repeats. In practice r, the rejection window, is set to 0 nt, and a, the acceptance window, is set to around 300 nt.

Illustration 9: Neighboring TFs inside the acceptance window and outside the rejection window are formed in to one 'cart' per TF. In practice the rejection window is set to 0 nt from the edges of the current motif and the acceptance window varies based on the user's desire for tightly or loosely clustered CRMs.

After these carts are constructed there is an further step in cart generation, motif expansion. The 'expansion parameter' is the number of times different instances of the same motif in the same cart may replace the current cart with the cartesian product of those instances of that motif and the carts resulting from recursive expansion of the rest of the cart, i.e. if expansion is set to 1, this is no different than the basal case of all TFs being added to a set which defines what TFs are members of the cart. Illustration 10 shows an example of a cart which is fully expanded by motif expansion of 2. This is done to address the concern that a single cart may encompass multiple CRMs composing different physi- cal instances of the same TFs. 22 Illustration 10: An example of cart expansion. A) a list of TFs predicted by PWM search in a target sequence. B) What the set of TFs in this cart would look like, note that the information on the duplicate GATA2 and USF1 will be lost. C) For any subset of this cart which contained only unique TFs, members of such sets must share an edge in the graph in (C). D) A list of the carts needed to represent the distinct combinations of separate motifs which are of the same TF type.

23 2.6 Contrast Frequent Item-set Mining

Background

In 1994, a Rakesh Agrawal and Ramakrishnan Srikant proposed a fundamentally new algorithm for computing relationships between items for sale in a store from a large quantity of sales transactions [22]. In this problem, there are a large number of different items for sale and there is a large database of sales transactions. Each transaction is a set of items which were purchased together in a single sale. Their goal is to predict associa- tions between products in the form of rules, such as 'If Bread and Milk are purchased, then

Eggs are likely to be purchased as well.' This is accomplished by counting which subsets of items in the sales transactions occur with some minimum level of frequency. Once those frequent sets of items are identified, they are further processed to deduce the associ- ations. The first step is accomplished with the Apriori algorithm, which identifies frequent item sets 3 to 10 times faster than the algorithms that precede it, AIS and SETM.

Rationale

Downward closure, which is seen in Illustration 11, asserts that no larger TF set with minimum support can contain a subset that does not have minimum support. This allows us to assert a minimum number of exemplars needed for effective neural network classification and simultaneously reduce the computational complexity of finding differ- ential CRMs. There are faster algorithms which follow it such as Eclat [23] and FP- growth [24]; however, Apriori is elegant, simple to implement, easily verifiable, and ulti- mately produces the same solution as any other frequent item-set mining algorithm with the same measure of interestingness or frequency.

24 Illustration 11: Three frequent and one infrequent one-TF sets followed by all possible two- TF and three-TF sets. The property of downward closure illustrated above shows the manner in which Apriori improves frequent set mining. In the One TF sets, GATA2, USF1, and NFATC1 are frequent and GABP does not have the minimum support, thus the two and three sets with GABP need never be calculated.

Several other means of evaluating TF clustering in ChIP-seq data have been ap- plied, but frequent item-set mining using background confidence as a measure of interest has not previously been applied to ChIP-seq data. The analogy between the biological constraints of CRM function and the nature of finite discrete transactions used in tradi- tional frequent item-set mining is compelling. With the addition of background confidence as a measure of interestingness, the biologically grounded, mostly unbiased ap- plication of contrasting frequent item-set mining becomes a clear candidate for the 25 measurement of TF clustering in putative CRMs.

Implementation

The canonical Apriori algorithm is implemented using two measures of interest- ingness. Additionally, each iteration of the Apriori algorithm run on the experimental set and then the control set in order to determine the levels of support in both sets for item- sets of a given size. Minimum Absolute Support, i.e. the number of instances of a CRM, by default this is 10, and minimum Background Confidence are used to measure whether or not a CRM is 'frequent'. Background confidence is defined as the ratio of support, i.e. the probability of a CRM in its data set, in the experimental set to the support in the con- trol set. By keeping the minimum absolute support low, the background confidence measure serves as a biologically significant, unbiased method of selecting the parameters for item-set frequency determination. The default confidence ratio is 1.0, which can be in- terpreted as equal frequency, support, in the experimental and control sets; however, in a typical case this would be adjusted by the experimenter to the level of difference that they're interested in studying. The neural network classification step follows calculation of the frequent CRMs and uses separability between the experimental and control set to determine whether there are meaningful differences provided the confidence ratio is close to 1.

2.7 Artificial Neural Networks

Background

Artificial neural networks are a connectionist model of computation which ab-

26 stractly model the communication patterns of biological . This abstraction com- poses a group of connected artificial neurons which emit values to others in the group to which they are connected and aggregate the emitted values of those connected to them, forming a weighted directed graph of such connections [25]. Neural networks can be used to solve a variety of complex data modeling and predictive challenges. They do this by processing sample data and converging on a state which provides a desired set of output values when given the corresponding input data, or approximates the appropriate output values to an acceptable degree. This is done by providing an example input, computing the output of the network, and modifying the network based on its output to reduce future error, iteratively until some halting condition is reached such as acceptable low error in the output responses [26].

The state of a network defines a function mapping a set of input values to a mem- ber of a set of output classifications. The network taken as a whole, in the context of a function f : X → Y, defines a class of functions which approximate f members of X given some known mappings f(x) = y as training data. Single layer neural networks which have a number of input nodes which have only connections to a set of output nodes are compu- tationally equivalent to a linear classifier, being that they're some function of a linear combination of their input values. Multiple layer neural networks which have one or more layers of hidden nodes, i.e. nodes which are between the input and output nodes, are more computationally powerful and can compute non-linear functions on input data.

Rationale

As seen in Section 1.3 of Chapter 1, a diversity of machine learning and classifica-

27 tion methods have been implemented for discriminating between similar and divergent properties of a CRM. Here we are attempting to validate or invalidate the utility of neural networks and develop a platform within which future classifier experimentation may be performed.

For both of the previously mentioned models of neural network, it is assumed that the graph of connections is directed and acyclic. These are known as feed-forward net- works because of the acyclic forward motion of information through the network. If the graph of connections has one or more cycles, the network is referred to as a recurrent neu- ral network [25]. Recurrent neural networks are proven to be equivalent to a Universal

Turing Machine; however, there is no prescriptive means of utilizing such a network [27].

For other architectures of recurrent neural networks, there may be advantages in compu- tational power to be gained over feed-forward networks; however, the Cybenko Theorem has proven that a multi-layer feed-forward network with a single output is a uni- versal function approximator [28]. The advan- tages in simplicity, speed, and predictability of feed-forward networks combined with their ability to compute any function, in the opinion of this author is persuasive enough to justify their use to the exclusion of recurrent models.

Implementation

Illustration 12: A feed-forward, 3 layer As shown in Illustration 12, feed for- perceptron network. This neural network's ward neural networks with a single hidden purpose is to classify an exemplar of two inputs in to a binary categories. 28 layer with 3 neurons and a single neuron output layer were used for classification. The activation function used was hyperbolic tangent producing output on the range of (-1, 1).

Backpropagation is used to propagate error through the network for training, and hyper- bolic tangent has a nice derivative for backpropagaion, f(x) = 1 – x2. Backpropagation is implemented using the canonical algorithm, as described in Russell, 2002. For classifica- tion, output < 0 was considered -1 and >=0 was considered 1, corresponding to the control classification and the experimental classification respectively, for the calculation of speci- ficity and sensitivity.

If the confidence ratio in the contrasting frequent item-set mining is large then, the neural networks will tend toward classifying all CRMs as experimental given that this will produce low error even in the absence of strong differences between the input. This may seem problematic, but in the case that there is a significantly greater frequency in the ex- perimental set than the control set, this is the expected behavior and will yield CRMs representing a signficant difference between the experimental and control sets by virtue of their relative frequencies if not the discriminatory power of their conservation and PWM scores.

The arithmetic mean of the NRMSE values for each of 5 cross validations using

80% of the data, sampled randomly in proportion from experimental and control, i.e. a random 80% of each set, while the remaining data is used for testing, is the final output score for a CRM.

29 2.8 Output Data

Each CRM is output as a set of TFs along with its mean NRMSE from the cross val- idation, and this is used to rank the CRMs in an ordinal scale of measure, i.e. there is no meaning to addition or ratios of the scores but they provide a total preorder on the CRMs.

Each CRM is also output with the mean sensitivity, and specificity of its classification along with the number of instances of that CRM in the experimental set and the number of instances of that CRM in the control set.

30 Chapter 3: Comparing Transcription Factors in K562 cells

This is a collection of ChIP-seq data from the K562 cell line for 23 different tran- scription factors. This large sampling of different factors helps us test the hypothesis that the CRM differences identified in my algorithm can reflect actual biological relationships between the given TFs in CRMs. This is accomplished by observationally comparing the overlap of biologically observed TF binding in the ChIP-seq data to the predicted CRMs within each of the datasets.

3.1 Data Source

The K562 cell line is the first culture of immortalized human myelogenous leukemia cells. They are erythroleukamia type cells from a 53 year old female with chron- ic myelogenous leukemia in crisis. Data from this cell line for 23 ChIP-seq experiments, performed by the ENCODE consortium [29] with different TFs are compared to provide in-vivo validation of the CRMs predicted by CRAM. The TFs in Table 2 have

PFMs published and in the TF library curated for this experiment.

31 3.2 Data summary

These data sets were all run using the un- Transcription Peak Factor Quantity modified ChIP-seq peaks found in the source Pu1 24774 sequence tags by BELT [30]. Every TF's sequence Srf 22572 Usf1 21714 tags were processed using the same BELT parame- Egr1 20130 ters, a bin size of 50 nt, and a output threshold of NFATC1 18310 Tr4 17538 0.9995 resulting in the numbers of peaks seen in Ta- Nrsf 16679 ble 1 and charted in Illustration 13. Cjun 16095 Gabp 15536 The experimental data for each TF sample Ctcf 15258 were the top 2,000 sequences from that TF's ChIP- Cfos 15113 Max 13939 seq sample which were within 100 knt of a RefSeq Nfyb 13118 annotated gene, ranked by highest significance as Gata2 13019 Cmyc 12020 measured by BELT's p-value for peak significance. Yy1 11801 The control data for each set was a set of 2,000 ran- Nfya 11185 domly placed peaks designed to have the same E2f6 10323 Atf3 10194 distribution as the experimental data. This was Nfe2 9687 done by generating 2,000 random peaks which were 9494 Znf263 9044 within 100 knt of a RefSeq gene and had an identi- Gata1 8313 cal distribution of peak lengths as the experimental Table 1: The number of peaks calculated set, i.e. for every peak in the experimental set, there by BELT for each ChIP-seq sample. was one peak in the control set with the same length. This ensures that the control set is representative of the experimental set's overall distribution while being randomly posi-

32 tioned. Illustration 14 shows the distributions of the TFs across the genome relative to their nearest, e.g. most likely target, gene.

K562 Sample Sizes

Pu1 Srf Usf1 Egr1 NFATC1 Tr4 Nrsf Cjun Gabp

s

r Ctcf

o

t c Cfos

a

F

n Max

o

i

t

p Nfyb

i

r

c

s Gata2

n

a r Cmyc

T Yy1 Nfya E2f6 Atf3 Nfe2 E2f4 Znf263 Gata1

0 5000 10000 15000 20000 25000 30000

Quantity of Peaks

Illustration 13: Each TF comes from a separate ChIP-seq experiment, so some variance is expected. Peak calculation for each sample was conducted using identical parameters, and the larger differences are due to naturally differing preponderance.

33 Illustration 14: A) The distribution over all the peaks, for each TF, of that peak's midpoint relative to the nearest gene. B) The same distribution with each region, 5' Distal to 3' Distal normalized by its size, showing density in peaks/1 knt rather than absolute quantity.

34 3.3 Biological background

Each of these TFs is known to bind frequently in the promoter region of a gene, which is the area approximately 2,000 nucleotides upstream of the transcription start site.

Given that they are often found in the same area, one would expect that there may be po- tential interactions and that there would certainly be a high degree of co-occurrence.

3.4 Method

CRAM Parameters:

• Cart size 300 nt

• PWM total score threshold 0.90

• PWM core score threshold 0.95

• Minimum support 10

• Background confidence 1.0

• Motif expansion 2

The TF library used composed one PFM for each of the 23 TFs sampled in K562 taken from the JASPAR and TRANSFAC databases.

3.5 Analysis

The overlaps between each of these transcription factors are the count of the num- ber of peaks in one set which have a peak within 500 nt in the overlapping data set. This

35 number, 500 nt, is typically used to allow for the natural error in ChIP-seq peak finding given that the overall density of peaks within a single data set are sparse, the probability of peaks overlapping by chance is expected to be relatively low. Thus when two peaks are overlapping between two different ChIP-seq data sets, there is also proximal binding be- tween the two TFs targeted by the two sequencing experiments. Illustration 15 shows the number of overlapping peaks between every pair of data sets from the K562 cell line. To validate the predictions of CRAM we will show that there is meaningful correspondence between the biologically observed co-occurrence of these TFs and their predicted CRMs.

ATF3 CFOS CJUN CMYC CTCF E2F4 E2F6 EGR1 GABP GATA1 GATA2 MAX ATF3 10,194 1,198 850 1,328 341 1,272 1,126 1,576 1,312 247 279 1,531 CFOS 1,206 15,113 7,752 3,603 980 3,122 2,748 3,992 3,901 1,263 2,030 3,050 CJUN 828 7,771 16,095 3,851 1,740 2,416 2,097 2,903 2,792 1,643 2,737 2,396 CMYC 1,120 3,125 3,466 12,020 2,051 4,961 5,520 6,117 5,703 1,616 2,212 5,907 CTCF 340 982 1,788 2,246 15,258 1,331 1,221 1,971 1,557 332 578 1,428 E2F4 1,086 2,609 2,066 5,132 1,189 9,494 5,631 7,406 6,033 561 700 3,588 E2F6 904 2,260 1,784 5,506 1,043 5,443 10,323 7,328 5,694 593 798 4,922 EGR1 1,138 2,820 2,283 5,275 1,582 6,061 6,242 20,130 9,490 691 804 4,001 GABP 1,044 3,018 2,408 5,502 1,313 5,545 5,405 10,802 15,536 1,172 1,436 3,708 GATA1 254 1,295 1,671 1,739 336 634 670 871 1,300 8,313 5,205 1,240 GATA2 265 1,991 2,697 2,272 557 743 847 942 1,534 5,021 13,019 1,584 MAX 1,442 2,875 2,312 6,413 1,367 3,918 5,439 5,162 4,218 1,169 1,588 13,939 NFATC1 479 1,534 1,909 1,699 1,777 1,473 1,295 2,045 1,931 373 605 1,186 NFE2 780 2,226 2,407 1,475 391 898 1,037 1,236 1,049 593 1,034 1,584 NFYA 797 5,627 1,199 2,324 427 2,661 2,313 3,330 3,265 568 762 2,175 NFYB 891 5,687 1,180 2,486 501 3,004 2,659 3,941 3,668 482 688 2,433 NRSF 735 1,919 1,730 3,799 1,348 3,859 3,855 8,789 6,714 662 829 2,679 PU1 725 2,774 2,823 3,525 1,519 2,777 2,740 5,336 5,865 1,366 2,269 2,416 SRF 1,162 2,551 2,294 4,891 2,226 5,240 5,343 12,288 8,906 742 853 3,794 TR4 261 260 270 459 240 380 321 587 608 214 210 441 USF1 1,746 3,823 2,608 5,476 1,484 4,022 5,198 8,074 6,214 908 1,304 6,543 YY1 733 1,655 1,229 3,258 713 3,297 2,699 4,468 3,757 458 604 2,244 ZNF263 270 563 628 938 631 1,004 1,059 1,679 1,242 292 373 652 Continued

Illustration 15: Overlaps between the 23 TFs measured in the K562 cell line. Each column contains the overlaps for that TF with the other TFs in the rows, meaning that the peaks being counted are from the source TFs sample. The data in orange along the diagonal are not overlaps, but the number of peaks from that TFs data set, provided for reference. The values highlighted in blue are the maximal absolute overlap for a given Column in the table. The columns following MAX are continued in Illustration 15. 36 Illustration 15: Continued NFATC1NFE2 NFYA NFYB NRSF PU1 SRF TR4 USF1 YY1 ZNF263 ATF3 500 755 824 986 946 738 1,643 258 1,915 784 297 CFOS 1,616 2,164 6,163 6,545 2,434 2,905 3,521 261 4,143 1,783 650 CJUN 2,007 2,339 1,246 1,267 2,115 2,912 2,932 265 2,747 1,308 695 CMYC 1,522 1,302 1,999 2,218 4,161 3,261 5,620 369 5,194 2,882 900 CTCF 1,844 391 440 517 1,609 1,548 2,774 239 1,544 759 698 E2F4 1,307 746 2,311 2,711 4,324 2,538 6,291 301 3,878 3,051 950 E2F6 1,138 851 1,972 2,358 4,157 2,503 6,182 263 4,825 2,389 974 EGR1 1,708 888 2,398 2,931 7,695 4,174 12,197 417 6,215 3,414 1,465 GABP 1,714 835 2,610 3,049 6,781 5,129 10,070 470 5,414 3,211 1,149 GATA1 406 576 582 519 806 1,457 929 216 924 484 323 GATA2 633 974 765 708 969 2,309 987 206 1,291 591 401 MAX 1,172 1,481 2,124 2,478 3,226 2,412 4,734 389 6,727 2,177 719 NFATC1 18,310 487 710 777 1,298 1,602 1,942 319 1,425 888 421 NFE2 545 9,687 796 862 895 1,209 1,189 146 1,807 649 287 NFYA 719 743 11,185 9,225 1,887 1,793 2,829 231 3,321 1,546 411 NFYB 764 782 8,986 13,118 2,248 1,922 3,369 259 4,096 1,751 504 NRSF 1,159 682 1,504 1,840 16,679 3,242 8,822 334 4,314 2,266 978 PU1 1,654 1,130 1,762 1,943 3,773 24,774 5,048 415 3,675 1,841 1,016 SRF 1,638 852 2,109 2,585 7,700 4,004 22,572 546 5,852 2,950 1,319 TR4 370 144 243 280 425 434 709 17,538 391 454 211 USF1 1,429 1,598 3,219 4,117 5,084 3,633 7,498 353 21,714 2,477 1,035 YY1 907 596 1,493 1,738 2,797 1,809 3,823 425 2,551 11,801 663 ZNF263 398 248 373 459 1,092 936 1,466 187 955 617 9,044

To put this data in context, we examine the top scoring 2-TF CRMs from each TF sample which contain the sample TF itself and one other. These CRMs, highlighted in Table 2, have meaningful correspondence with the overlaps, thus showing biologically the validity of this technique as a means to predict CRMs which have experimentally determined cause to exist.

37 Source NRMSE Sensitivity Specificity Expr CRMs Ctrl CRMs Co-factor CJUN .0069 1.0000 .0000 406 0 AFT3 CJUN .0094 1.0000 .0000 42 0 GABP CJUN .0097 1.0000 .0000 44 0 NFYB CJUN .0104 1.0000 .0000 34 0 TR4 CJUN .2847 .9154 .0000 596 56 GATA2 CMYC .0068 1.0000 .0000 414 0 USF1 CMYC .0078 1.0000 .0000 414 0 MAX CMYC .0090 1.0000 .0000 28 0 ZNF263 CMYC .0092 1.0000 .0000 24 0 E2F6 CMYC .0096 1.0000 .0000 32 0 EGR1 CTCF .0079 1.0000 .0000 163 0 USF1 CTCF .0089 1.0000 .0000 63 0 YY1 CTCF .0093 1.0000 .0000 106 0 CFOS E2F4 .0061 1.0000 .0000 561 0 USF1 E2F4 .0079 1.0000 .0000 215 0 NFYB E2F4 .0084 1.0000 .0000 175 0 NFYA E2F4 .0089 1.0000 .0000 94 0 MAX E2F4 .0092 1.0000 .0000 78 0 CTCF EGR1 .0088 1.0000 .0000 76 0 ZNF263 EGR1 .0093 1.0000 .0000 58 0 MAX EGR1 .0095 1.0000 .0000 37 0 E2F4 EGR1 .0100 1.0000 .0000 12 0 CJUN EGR1 .1355 .9814 .0000 793 16 USF1 GABP .0028 1.0000 .0000 2648 0 E2F6 GABP .0042 1.0000 .0000 1188 0 NFYB GABP .0043 1.0000 .0000 1040 0 NFYA GABP .0046 1.0000 .0000 1016 0 ZNF263 GABP .0050 1.0000 .0000 674 0 CJUN GATA1 .3776 .7111 .0222 44 8 NFE2 GATA1 .4338 .7931 .0000 118 30 ZNF263 GATA1 .5178 .6500 .0375 130 32 MAX GATA1 .5667 .5308 .0673 4192 2127 USF1 GATA1 .5703 .5579 .0605 1628 706 YY1 MAX .0070 1.0000 .0000 274 0 EGR1 MAX .0074 1.0000 .0000 246 0 ZNF263 MAX .0087 1.0000 .0000 188 0 NFYB MAX .0089 1.0000 .0000 84 0 NFE2 MAX .0090 1.0000 .0000 128 0 GABP

Continued Table 2: The top 5 2-TF CRMs from each sample, identified in the "Source" column which contains the source TF in the CRM. The other TF in the CRM is listed in Co- factor. The number of Experimental and Control sample instances of each CRM are listed in the Expr and Ctrl CRMs columns respectively. Highlighted in blue are modules which are also the a top ranked overlap between two TFs.

38 Table 2: Continued Source NRMSE Sensitivity Specificity Expr CRMs Ctrl CRMs Co-factor NFATC1 .0059 1.0000 .0000 53 1 NFYB NFATC1 .0070 1.0000 .0000 49 1 NFYA NFATC1 .0078 1.0000 .0000 320 0 CMYC NFATC1 .0093 1.0000 .0000 112 0 AFT3 NFATC1 .0189 1.0000 .0000 128 4 NFE2 NFE2 .0090 1.0000 .0000 36 0 MAX NFE2 .0096 1.0000 .0000 14 0 NFYA NFE2 .0097 1.0000 .0000 14 0 NFYB NFE2 .0098 1.0000 .0000 14 0 GABP NFE2 .0099 1.0000 .0000 20 0 CJUN NFYA .0074 1.0000 .0000 312 0 GABP NFYA .0089 1.0000 .0000 112 0 E2F4 NFYA .0093 1.0000 .0000 56 0 CMYC NFYA .0098 1.0000 .0000 16 0 SRF NFYA .0100 1.0000 .0000 32 0 PU1 NFYB .0067 1.0000 .0000 462 0 GABP NFYB .0075 1.0000 .0000 312 0 MAX NFYB .0084 1.0000 .0000 140 0 CJUN NFYB .0086 1.0000 .0000 140 0 E2F4 NFYB .0096 1.0000 .0000 40 0 PU1 NRSF .0023 1.0000 .0000 2948 0 USF1 NRSF .0028 1.0000 .0000 2568 0 CFOS NRSF .0040 1.0000 .0000 1182 0 YY1 NRSF .0075 1.0000 .0000 208 0 ZNF263 NRSF .0083 1.0000 .0000 192 0 CJUN PU1 .0089 1.0000 .0000 68 0 E2F6 PU1 .0091 1.0000 .0000 26 0 GABP PU1 .0100 1.0000 .0000 12 0 MAX PU1 .0102 1.0000 .0000 38 0 ZNF263 PU1 .0103 1.0000 .0000 20 0 TR4 SRF .0331 1.0000 .0000 23 1 USF1 TR4 .0095 1.0000 .0000 24 0 YY1 TR4 .0096 1.0000 .0000 15 0 CFOS TR4 .2677 .8000 .0000 32 4 USF1 TR4 .3954 .6857 .0286 34 6 GATA2

Continued

39 Table 2: Continued Source NRMSE Sensitivity Specificity Expr CRMs Ctrl CRMs Co-factor USF1 .0028 1.0000 .0000 298 2 EGR1 USF1 .0056 1.0000 .0000 626 0 CMYC USF1 .0088 1.0000 .0000 84 0 AFT3 USF1 .0098 1.0000 .0000 18 0 E2F4 USF1 .0514 1.0000 .0000 81 2 CTCF YY1 .0043 1.0000 .0000 248 1 EGR1 YY1 .0064 1.0000 .0000 448 0 AFT3 YY1 .0090 1.0000 .0000 84 0 NFE2 YY1 .0096 1.0000 .0000 76 0 E2F4 YY1 .0101 1.0000 .0000 32 0 PU1 ZNF263 .0042 1.0000 .0000 1174 0 MAX ZNF263 .0049 1.0000 .0000 1001 0 E2F6 ZNF263 .0057 1.0000 .0000 628 0 NFYB ZNF263 .0058 1.0000 .0000 596 0 NFYA ZNF263 .0081 1.0000 .0000 318 0 GABP

3.6 Discussion

YY1+EGR1, USF1+AFT3, PU1+GABP, NFE2+CJUN, EGR1+E2F4, EGR1+ZNF263, and CMYC+MAX, for each of these CRMs, the 2 TF-set is the most frequent overlap from one to the other and were predicted as the top 5 or higher modules from containing the TF from whose sample the module was predicted. YY1+EGR1 interaction has been previous- ly observed [31] USF1+AFT3 interactions have been shown in [32]. PU1+GABP interaction has been observed as a functioning regulatory module of CD18, for which the

PU1+GABP CRM can begin transcription in normally CD18 non-permissive cells [33].

NFE2+CJUN are also a documented TF interaction [34]. EGR1+E2F4 interaction has also been detected [35]. Further, the CMYC + MAX interaction is very well documented [36].

This biologically validates CRAMs capability of producing some biologically significant, genuine cis-regulatory modules.

40 The converse of this success would be CRAMs failure to assign low scores to poor- ly differential CRMs; however, this can be very clearly seen not to be the case from this data. Out of all 88 top scoring 2-TF CRMs which contain their experimental samples TF, only 13 occur at all in the control sample despite the two samples being the same over all size and being selected under the same constraints, within 100 knt of a gene. While the reason that these highest scoring CRMs are scored highly is that there are no control set

TFs to distinguish them from; however, this does not invalidate the utility of the results.

They are significant because they are only found in the experimental set to such a large degree, and over half number over 100 instances in the experimental set. Two noteworthy examples are the USF1+NRSF CRM and the E2F6+GABP CRM which each have over 2,000 instances in the experimental set yet 0 in the control set. This is because they are clus- tered densely in the experimental data and are not similarly distributed through out the genome, and CRAM is able to extract this fact from the sequence data. The minimum support parameter could be adjusted higher if an experimenter were not interested in pu- tative CRMs with only moderate frequency. This clearly indicates the ability of CRAM to extract useful information from ChIP-sequence data.

41 Chapter 4: Contrasting TGF-β Treatment in A2780

These are a pair of ChIP-seq data sets from the A2780 cell line for the same TF,

SMAD4. One sample is treated with TGF-β, an upstream of SMAD4, and the other is not treated. These comparative samples of a single TF help us validate the hy- pothesis that the contrast item set mining used by my algorithm in conjunction with the classification metrics will highlight the differences between dependent samples which are expected to deviate in some fashion.

4.1 Data source

A2780 cells were divided in to two samples, those treated with TGF-β for 3 hours and an untreated control sample. TGF-β is a known inducer of the SMAD4 transcription factor. TGF-β and SMAD4 bind in the cytoplasm and transduce in to the cell nucleus where SMAD4 may ligate DNA. A2780 is a Human ovarian epithelial carcinoma cell line.

4.2 Data summary

These data sets were all run using the ChIP-seq peaks found in the source se- quence tags by BELT [30]. These were drawn from the data set being used in an ongoing analysis project to bring some direct biological meaning to the results. This data has been notoriously noisy, and consequently a conservative sample was taken from it of the top

42 500 sequences, ranked by highest peak significance score, which were within 100 knt of a

RefSeq annotated gene, in the treated and the untreated data sets.

There are three different comparisons being made on this data. The control data for each set was a set of 500 randomly placed peaks designed to have the same distribution as the experimental data. This was done by generating 500 random peaks which were within 100 knt of a RefSeq gene and had an identical distribution of peak lengths as the experimental set, i.e. for every peak in the experimental set, there was one peak in the control set with the same length.

The distribution of SMAD4 binding, as seen in Illustration 16, is wide and not con- centrated near the transcription start site as in the TFs studied in the K562 cell line.

Although a large quantity of the peaks are intragenic, gene length varies greatly and the average gene length for a gene with an intragenic peak from these data is long enough that the density of intragenic genes is actually quite low.

43 Illustration 16: A) The distribution over all SMAD4 peaks of that peak's midpoint relative to the nearest gene. B) The same distribution with each region, 5' Distal to 3' Distal normalized by its size, showing density in peaks/1 knt rather than absolute quantity.

44 4.3 Biological background

SMAD4 is the only mammalian co-SMAD: SMADs 1, 2, 3, 5, and 8 transduce TGF-

β in the cell cytoplasm in to the cell nucleus where they bind with SMAD4 with in turn ligates to DNA to regulate genes. Involved in cell signaling pathways, SMAD4 binds to a diversity of genomic locations and is implicated in a variety of pathologies.

The TGF-β signaling pathway is responsible for repressing ovarian epithelial cell growth following rupture during menstruation. It is often dysregulated in many cancers

[37]. As this dysregulation of the key nuclear transducer of TGF-β removes a tightly regu- lated growth inhibition signal, it can lead to over-proliferation of the epithelial cells.

Additionally because the SMAD transcription factors are known to interact with a wide variety of cofactors, they are able to influence many cellular processes.

A2780 cells are not an aggressive cancer line, and they are still sensitive to a key chemotherapeutic drug cisplatin, cis-diamminedichloroplatinum(II). The A2780 cells have only an intermediate level of TGF-β dysregulation: they are still able to induce SMAD4 ex- pression and transduce existing SMAD4 from the cytoplasm to the nucleus following TGF-

β stimulation.

4.4 Method

CRAM Parameters:

• Cart size 300 nt

• PWM total score threshold 1.0

45 • PWM core score threshold 1.0

• Minimum support 10

• Background confidence 1.0

• Motif expansion 1

These data sets were all run using modified ChIP-seq peaks. Due to the large aver- age size of ChIP-seq peaks in these data sets, ~600 nucleotides, peaks larger than 300 nucleotides were redefined to be 150 nucleotides upstream and 150 nucleotides down- stream of the weighted peak center calculated by BELT.

In contrast with the K562 experiment in which the goal was to find relationships from among the set of TFs for which we had biological data, in this analysis the goal is merely to identify whether or not there are significant differential CRMs between the liga- tion sites of SMAD4 before TGF-β treatment and after treating with TGF-β for 3 hours; therefore, the TF library for this data set is the combined mammalian TF library that was curated from the JASPAR and TRANSFAC databases for this purpose. Whereas in the pre- vious analysis the library contained 23 TF PFMs, the full combined TF library is 1,061 different PFMs.

For reference, both the treated sample and untreated sample were independently compared with a random background sample representative of their distribution.

46 4.5 Analysis

NRMSE Sens. Spec. Expr. Ctrl. CRM Motifs 0.0125 1.0000 0.0000 12 3 AREB6_4(T) ARID3A(J) 0.0148 1.0000 0.0000 12 3 AREB6_4(T) ARID3A(J) Prrx2(J) 0.0178 1.0000 0.0000 11 2 HMG(T) NFAT2(T) Prrx2(J) 0.0189 1.0000 0.0000 11 2 HMG(T) NFAT2(T) 0.0283 1.0000 0.0000 12 4 AREB6_4(T) GATA2(J) 0.0428 1.0000 0.0000 11 3 Arnt::Ahr(J) DBP(T) NKX31(J) 0.0435 1.0000 0.0000 13 3 Arnt::Ahr(J) DBP(T) 0.0527 1.0000 0.0000 12 4 NFE2L1::MafG(J) RORBETA(T) ZNF354C(J) 0.0626 1.0000 0.0000 13 3 Arnt::Ahr(J) DBP(T) GATA2(J) 0.0747 1.0000 0.0000 12 3 CBF1(T) OG2_1(T) SOX10(J) 0.0778 1.0000 0.0000 12 4 FOXO3(J) NFATC2(J) NFE2L1::MafG(J) 0.0882 1.0000 0.0000 13 3 AP1(J) ARID3A(J) ZNF354C(J) Table 3: TGF-b Treated contrast with TGF-b untreated. Above are the predicted modules with NRMSE < 0.1.

Table 3 shows all of the predicted modules from the contrast between the treated and untreated samples with an NRMSE < 0.10. Originally, these data sets were run with the same PWM thresholds as the K562 experiment; however, this produced a startling number of redundant motifs, and the stringency of the PWM threshold requirement was increased. This did not negatively impact the usability of the data set as the TGF-β treated set contained 25,151 motifs and the untreated set contained 25,264 motifs under these pa- rameters.

The levels of CRM support are lower in the set than the previous, to a greater mag- nitude than the difference in input sample size between this and the K562 dataset; therefore, comparisons between each of these samples and a random sample were per- formed as well. While the contrast between treated and untreated has only 10 CRMs with

NMRSE < 0.10, the treated vs random has 76 and the control vs random has 38. These

47 comparisons' CRMs have similarly low levels of support; therefore we can support the conclusion that there are a smaller number of clear cut differential CRMs between these two samples, rather than concluding that they do not have enough data to be differentiat- ed. The treated and control vs random background CRMs can be seen in Table 4 and Table

5 respectively.

48 NRMSE Sens. Spec. Expr. Ctrl. CRM Motifs 0.0021 1.0000 0.0000 11 3 NFATC2(J) Nobox(J) USF1(J) 0.0032 1.0000 0.0000 13 1 DBP(T) ETS1(J) GATA3(J) 0.0035 1.0000 0.0000 13 4 ETS1(J) NFATC2(J) Nobox(J) 0.0042 1.0000 0.0000 11 1 NFAT1_3(T) NFAT2(T) ZNF354C(J) 0.0043 1.0000 0.0000 12 3 AREB6_4(T) GATA2(J) 0.0053 1.0000 0.0000 12 1 CRX(T) GATA3(J) MZF1_14(J) 0.0058 1.0000 0.0000 12 2 ARID3A(J) NFAT2(T) ZNF354C(J) 0.0062 1.0000 0.0000 14 2 NFAT2(T) Prrx2(J) ZNF354C(J) 0.0075 1.0000 0.0000 11 1 IRF1_2(T) LEF1_1(T) NFATC2(J) 0.0079 1.0000 0.0000 12 1 MZF1_14(J) NFAT2(T) USF1(J) 0.0084 1.0000 0.0000 11 1 Arnt(J) CREB_5(T) NKX31(J) 0.0097 1.0000 0.0000 12 0 ETS2(T) GATA3(J) USF1(J) 0.0100 1.0000 0.0000 12 1 ELF1(T) NFE2L1::MafG(J) NURR1(T) 0.0102 1.0000 0.0000 11 0 ELF1(T) SPI1(J) ZNF354C(J) 0.0124 1.0000 0.0000 14 4 ELF1(T) NFAT1_3(T) ZNF354C(J) Table 4: TGF-b Treated contrasted with a random background sample.

NRMSE Sens. Spec. Expr. Ctrl. CRM Motifs 0.0036 1.0000 0.0000 12 1 CRX(T) HNF4_5(T) Nr2e3(J) 0.0051 1.0000 0.0000 11 3 GATA2(J) HMG(T) HNF4_4(T) 0.0057 1.0000 0.0000 13 1 CRX(T) Nr2e3(J) YY1(J) 0.0097 1.0000 0.0000 16 0 ETS2(T) GATA3(J) USF1(J) 0.0103 1.0000 0.0000 11 0 Arnt(J) CRX(T) HNF4_5(T) 0.0107 1.0000 0.0000 14 4 ETS1(J) HNF4_4(T) ZNF354C(J) 0.0160 1.0000 0.0000 12 2 ARID3A(J) CRX(T) NFAT2(T) 0.0167 1.0000 0.0000 11 3 CRX(T) MAFB(T) NFATC2(J) 0.0293 1.0000 0.0000 16 4 ELF1(T) Pdx1(J) ZNF354C(J) 0.0324 1.0000 0.0000 12 2 CRX(T) ETS2(T) GATA3(J) 0.0327 1.0000 0.0000 13 3 ETS1(J) Pdx1(J) REX1_2(T) 0.0393 1.0000 0.0000 11 2 GATA2(J) MZF1_14(J) NFAT2(T) 0.0396 1.0000 0.0000 11 3 ETS2(T) NFATC2(J) TEF1(T) 0.0418 1.0000 0.0000 11 3 ARID3A(J) ELF1(T) GATA(T) 0.0423 1.0000 0.0000 11 3 ETS1(J) HOXA7_1(T) USF1(J) Table 5: TGF-b Untreated contrasted with a random background sample.

49 4.6 Discussion

Usage of the full TF library was expected to increase the number of frequent CRMs and perhaps their average frequency. Due to the increase in possible combinations, it was thought that it was more likely that there would be some random large differences by chance; however, the neural network classification step did not differentiate the two popu- lations to an acceptable error for such CRMs. For example Arnt::Ahr(J) + GATA2(J) +

NFE2L1::MafG(J) has the highest background confidence of all of the results with over 100 instances in the input data at 104 treated instances and 83 untreated instances, and it is instructive that its NRMSE is 0.5196, with sensitivity of 0.22, and specificity of 0.267.

This is so because with the PWM thresholds set to 1.0, the only scores which would vary between samples were the conservation scores, and that the classification error is about the same as random guessing, it indicates that there were not meaningful differences in conservation for these CRMs despite the high background confidence, and it is the pur- pose of the neural network classification to rank poorly those CRMs which have no defining characteristics to separate them in the experimental set from the control set.

With out such differences or a larger background confidence, this and the similar CRMs are effectively the same between the two samples.

The low level of differentiation seen here stands in contrast to the higher level of

CRM differentiation seen in the K562 samples, and there may be several reasons for this.

The first of which is the samples are significantly wider, with an average size of 250 to

300 nt compared to the K562 peaks which were much smaller on average being 50 to 300 nt in length. This decreases the specificity of the ChIP-seq binding and raises the likeli-

50 hood of TFs binding in the region which are not closely related to the ligating activity be- ing targeted by the ChIP-seq experiment.

51 Chapter 5: Conclusions

5.1 Address of the Research Question

From the ChIP-seq samples involved in these CRMs, CRAM was able to predict the following documented TF interactions as among the most significant for the samples in which they were predicted:

CMYC + MAX [36]

CMYC+EGR1 [38]

PU1+GABP [33]

NFE2+CJUN [34]

EGR1+E2F4 [35]

YY1 + EG1 [31]

The utility of these predictions along with the fact that in the K562 cell results there were very strong differences between the CRMs in the experimental samples and the control samples answers in the affirmative that the research questions on the utility of contrasting frequent item set mining and neural networks as a classifier have utility in predicting CRMs in ChIP-seq data.

CRAM also indicated that there are only subtle differences between the TGF-b treated and untreated samples, and this can be said to be a support of the null-hypothesis

52 that there are no major differences between the treated and untreated SMAD4 ChIP-seq samples which can be well discerned either by CRM frequency or conservation measure- ments because, although there were very frequent CRMs with sufficient background confidence, they were not differentiable between the treated and control sets by the neural networks on the basis of their conservation data nor was the frequency difference large enough for the classification to produce a low error despite their similarity in genetic con- servation.

5.2 Discussion of CRAM

CRAM is a useful tool in the analysis of potential CRMs specific to a ChIP-seq data set and separate from the genome wide distribution of such CRMs. By predicting func- tional regulatory modules using TFs scanned from ChIP-seq data using a library of

PWMs, CRAM is not that different from the plurality of CRM prediction algorithms. It compares two sets of sequence data and provide an analysis of the CRMs which are differ- entiable between the two; however, CRAM differentiates itself by being independent of some of the major assumptions behind the existing algorithms, depending on the either the assumption of co-expression of genes near the same CRMs or depending on the CRMs having a particular distribution around genes, or depending on prior knowledge of TF as- sociation. While each of these things may contribute something to CRM detection, they also take something away, they are limitations on the applicability of the techniques.

CRAM uses unbiased biological data on TF binding activity to direct the search for CRMs which occur in frequency and with differentiable PWM scores and genetic conservation when compared to a control sample. CRAM of course in this way introduces its own as-

53 sumptions, which are that one's control sample represents a meaningful difference in terms of either CRM frequency, PWM scores, or genetic conservation from one's experi- mental set; however, these assumptions are not as thoroughly tested as those dominant in the field, and thus are fertile ground for experimentation and an adaptation of existing techniques to the norms of the increasingly common field of ChIP-sequencing.

The K562 results biologically validate CRAM's capability of producing some bio- logically significant, genuine cis-regulatory modules. The TGF-b results suggest that it would be useful to ensure a larger number of smaller sequences for ideal predictive ability of CRMs which are likely to be related to the experimental TF.

The performance of Python is less than desirable on large data sets; however, the use of the presently experimental PyPy Just-In-Time compiling Python interpreter speeds up CRAM's execution time by over an order of magnitude in many cases, and several of the TFs in the K562 data set completed their analysis in under 5 minutes; however, the mo- tif expansion algorithm provides for too many carts for efficient neural network classification of the resulting frequent item sets when there are a large number of dense repeating motifs. In cases in which there are a great many motifs, raising the motif score thresholds and reducing the motif expansion parameter will provide a useful data set while coping with the preponderance of transcription factors.

5.3 Future Work

CRAM is developed for two primary purposes, the first is complete which is to ap- ply the techniques of contrast frequent item set mining to 'cartized' TF data from ChIP-seq experiments and of neural network classification based on genetic conservation and PWM

54 binding specificity. The second is that CRAM was made to be fairly modular and serve as a test bed for other techniques which follow the same basic formula, generate classifica- tion metrics, use one or more interestingness measures to find frequent sets of CRMs, and use a classifier to rank the CRMs by their differentiability back into the data sets when the came based on the previously generated classification metrics.

While the primary goals of CRAM are to leverage current understanding of the bi- ological constraints of CRM function and investigate the usage of contrast frequent item- set mining and neural networks in an unbiased manner using ChIP-seq data, the body of programs which accomplish the same goal should in many cases produce similar results.

The differences between these programs should only appear for CRMs which are less cer- tain to be authentic. It would be useful to produce a qualitative comparison between this and other CRM prediction models in order to guide future algorithm development.

CRAM currently requires both sequence input and sequence coordinate input; however, the input is taken from ChIP-seq peak finding results which do not necessarily by their nature include the nucleotide sequence of their output. This requires a data bank of the source genome in order to look up the nucleotide sequences for the input data, for every genome to be used. Additionally, the conservation data requires pre-computed pair- wise alignment data between every source species and each of the remaining target species. The sum effect of this is that there is a large amount of data required to run the program which is only utilized briefly or as a means to another end by CRAM. Given its large amounts of necessary secondary data files, CRAM is a natural fit for an on-line pro- cessing model. It would be very helpful to develop a web based front-end to CRAM

55 which would facilitate its use on a properly configured machine with the relevant data over the Internet.

56 Bibliography

[1] L. DS and Latchman DS, “Transcription factors: an overview,” Int. J. Biochem. Cell Biol., vol. 29, no. 12, pp. 1305–12, 1997.

[2] K. M and Karin M, “Too many transcription factors: positive and negative interactions,” New Biol., vol. 2, no. 2, pp. 126–31, 1990.

[3] E. van Nimwegen, “Scaling laws in the functional content of genomes,” Trends Genet., vol. 19, no. 9, pp. 479–84, 2003.

[4] L. N. Babu MM and Babu MM, Luscombe NM, Aravind L, Gerstein M, Teichmann SA, “Structure and evolution of transcriptional regulatory networks,” Curr. Opin. Struct. Biol., vol. 14, no. 3, pp. 283–91, 2004.

[5] E. Davidson, ''The Regulatory Genome: Gene Regulatory Networks in Development and Evolution''. Elsevier, 2006.

[6] D. E. Ben-Tabou de-Leon S and Ben-Tabou de-Leon S, Davidson EH, “Gene regulation: gene control network in development,” Annu Rev Biophys Biomol Struct, vol. 36, p. 191, 2007.

[7] S. Sinha, E. van Nimwegen, and E. D. Siggia, “A probabilistic method to detect regulatory modules,” Bioinformatics (Oxford, England), vol. 19, pp. i292-301, 2003.

[8] V. X. Jin, H. O'Geen, S. Iyengar, R. Green, and P. J. Farnham, “Identification of an OCT4 and SRY regulatory module using integrated computational and experimental genomics approaches,” Genome Research, vol. 17, no. 6, pp. 807-817, Jun. 2007.

[9] S. Aerts, P. Van Loo, G. Thijs, Y. Moreau, and B. De Moor, “Computational detection of cis -regulatory modules,” Bioinformatics (Oxford, England), vol. 19, pp. ii5-14, Oct. 2003.

[10]&. Wrzodek et al., “ModuleMaster: A new tool to decipher transcriptional regulatory networks,” Biosystems, vol. 99, no. 1, pp. 79–81, Oct. 2009.

[11]B. A. Sharan R and Sharan R, Ben-Hur A, Loots GG, Ovcharenko I., “CRÈME: Cis regulatory module explorer for the human genome,” ''Nucleic Acids Res.'', vol. 32, pp. W253–W256, 2004.

[12]G. Robertson et al., “Genome-wide profiles of STAT1 DNA association using

57 chromatin immunoprecipitation and massively parallel sequencing,” Nat Meth, vol. 4, no. 8, pp. 651-657, 2007.

[13]P. Collas and Collas, Philippe., “The Current State of Chromatin Immunoprecipitation,” Molecular Biotechnology, Jan. 2010.

[14]N. M. Levine HA and Levine HA, Nilsen-Hamilton M, “A mathematical analysis of SELEX,” and chemistry, vol. 31, no. 1, pp. 11–35, 2007.

[15]S. GD and Stormo GD, “DNA binding sites: representation and discovery,” Bioinformatics, vol. 16, no. 1, pp. 16–23, 2000.

[16]E. Portales-Casamar et al., “JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles,” Nucleic Acids Research, vol. 38, pp. D105-110, Jan. 2010.

[17]V. Matys et al., “TRANSFAC®: transcriptional regulation, from patterns to profiles,” Nucleic Acids Research, vol. 31, no. 1, pp. 374 -378, Jan. 2003.

[18]C. Darwin, “On the origin of species by means of natural selection, or the preservation of favoured races in the struggle for life,” New York: D. Appleton, 1859.

[19]R. K. Wayne and P. A. Morin, “Conservation genetics in the new molecular age,” Frontiers in Ecology and the Environment, vol. 2, no. 2, pp. 89-97, 2004.

[20]R. S. Singh and C. B. Krimbas, Evolutionary genetics: from molecules to morphology. Cambridge University Press, 2000.

[21]T. Calders and B. Goethals, “Mining All Non-derivable Frequent Itemsets,” in Principles of Data Mining and Knowledge Discovery, vol. 2431, T. Elomaa, H. Mannila, and H. Toivonen, Eds. Springer Berlin / Heidelberg, 2002, pp. 1-42.

[22]R. Agrawal, T. Imieliński, and A. Swami, “Mining association rules between sets of items in large databases,” in Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD '93, pp. 207-216, 1993.

[23]M. Zaki, “Scalable algorithms for association mining,” Knowledge and Data Engineering, IEEE Transactions on, vol. 12, no. 3, pp. 372-390, 2000.

[24]J. Han, J. Pei, Y. Yin, and R. Mao, “Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach,” Data Mining and Knowledge Discovery, vol. 8, no. 1, pp. 53-87, 2004.

[25]S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 2nd ed. Prentice Hall, 2002.

[26]R. Hecht-Nielsen, “Theory of the backpropagation neural network,” in Neural Networks, vol. 1, pp. 593-605, 1989. 58 [27]H. T. Siegelmann and E. D. Sontag, “Turing computability with neural nets,” Applied Mathematics Letters, vol. 4, no. 6, pp. 77-80, 1991.

[28]G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals, and Systems, vol. 2, no. 4, pp. 303-314, 1989.

[29]E. Birney et al., “Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project,” Nature, vol. 447, no. 7146, pp. 799-816, Jun. 2007.

[30]S. Frietze, X. Lan, V. X. Jin, and P. J. Farnham, “Genomic Targets of the KRAB and SCAN Domain-containing Protein 263,” Journal of Biological Chemistry, vol. 285, no. 2, pp. 1393 -1403, Jan. 2010.

[31]T. López-Rovira, E. Chalaux, J. Massagué, J. L. Rosa, and F. Ventura, “Direct Binding of Smad1 and Smad4 to Two Distinct Motifs Mediates Bone Morphogenetic Protein- specific Transcriptional Activation ofId1 Gene,” Journal of Biological Chemistry, vol. 277, no. 5, pp. 3176 -3185, Feb. 2002.

[32]L. Pan et al., “Critical Roles of a Cyclic AMP Responsive Element and an E-box in Regulation of Mouse Renin ,” Journal of Biological Chemistry, vol. 276, no. 49, pp. 45530 -45538, Dec. 2001.

[33]A. G. Rosmarin, D. G. Caprio, D. G. Kirsch, H. Handa, and C. P. Simkevich, “GABP and PU.1 Compete for Binding, yet Cooperate to Increase CD18 ( Leukocyte Integrin) Transcription,” Journal of Biological Chemistry, vol. 270, no. 40, pp. 23627 -23633, Oct. 1995.

[34]C. Francastel, Y. Augery-Bourget, M. Prenant, M. Walters, D. I. Martin, and J. Robert- Lézénès, “c-Jun inhibits NF-E2 transcriptional activity in association with p18/ in Friend erythroleukemia cells,” Oncogene, vol. 14, no. 7, pp. 873-877, Feb. 1997.

[35]Z. Chen et al., “Egr-1 Regulates Autophagy in Cigarette Smoke-Induced Chronic Obstructive Pulmonary Disease,” PLoS ONE, vol. 3, no. 10, p. e3316, Oct. 2008.

[36]B. Amati, S. Dalton, M. W. Brooks, T. D. Littlewood, G. I. Evan, and H. Land, “Transcriptional activation by the human c- oncoprotein in yeast requires interaction with Max,” Nature, vol. 359, no. 6394, pp. 423-426, Oct. 1992.

[37]B. Y. Kalo E and Kalo E, Buganim Y, Shapira KE, ''et al.'', “Mutant attenuates the SMAD-dependent transforming growth factor beta1 (TGF-beta1) signaling pathway by repressing the expression of TGF-beta receptor type II.,” Mol. Cell. Biol., vol. 27, no. 23, pp. 8228–42, 2007.

[38]M. Shafarenko, D. A. Liebermann, and B. Hoffman, “Egr-1 abrogates the block imparted by c-Myc on terminal M1 myeloid differentiation,” Blood, vol. 106, no. 3, pp. 871-878, Aug. 2005.

59 Appendix

A. FASTA format

A header line which begins with the “>” character followed by the sequences ID. This must correspond to the ID in the Name/ID column in the corresponding BED file for this sequence. Followed by a mandatory new line, followed by any number of white space in- sensitive nucleotide letters. Admissible letters are [AGTCagtcNn]

>Seq_216

ACTGATCTGATATCGATCTATCG

ACTTGTgcacgactagcNNNNGATAG

B. BED format

BED format is a text file of tab separated columns chrom chromStart chromEnd name score strand

The six required BED fields are:

1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold

(e.g. scaffold10671).

2. chromStart - The starting position of the feature in the chromosome or scaffold.

The first base in a chromosome is numbered 0.

3. chromEnd - The ending position of the feature in the chromosome or scaffold. The

chromEnd base is not included in the display of the feature. For example, the first

100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span

the bases numbered 0-99.

60 4. name - Defines the name of the BED line. This label is displayed to the left of the

BED line in the Genome Browser window when the track is open to full display

mode or directly to the left of the item in pack mode.

5. score – This field is not directly used by CRAM.

6. strand - Defines the strand - either '+' or '-'.

The BED specification contains further columns after 6, but they will be ignored by

CRAM.

Example:

Here's an example of an annotation track that uses a working BED definition: chr22 1000 5000 cloneA 960 + chr22 2000 6000 cloneB 900 -

C. Chain alignment format

The chain format describes a pairwise alignment that allow gaps in both sequences simultaneously. Each set of chain alignments starts with a header line, contains one or more alignment data lines, and terminates with a blank line. The format is deliberately quite dense.

Example:

chain 4900 chrY 58368225 + 25985403 25985638 chr5 151006098 - 43257292 43257528 1 9 1 0 10 0 5 61 4 0 16 0 4 42 3 0 16 0 8 61 14 1 0 3 7 0 48

chain 4900 chrY 58368225 + 25985406 25985566 chr5 151006098 - 43549808 43549970 2 16 0 2 60 4 0 10 0 4 70

Header Lines chain score tName tSize tStrand tStart tEnd qName qSize qStrand qStart qEnd id

The initial header line starts with the keyword chain, followed by 11 required attribute values, and ending with a blank line. The attributes include:

• score -- chain score • tName -- chromosome (reference sequence) • tSize -- chromosome size (reference sequence) • tStrand -- strand (reference sequence) • tStart -- alignment start position (reference sequence) • tEnd -- alignment end position (reference sequence) • qName -- chromosome (query sequence) • qSize -- chromosome size (query sequence) • qStrand -- strand (query sequence) • qStart -- alignment start position (query sequence) • qEnd -- alignment end position (query sequence) • id -- chain ID The alignment start and end positions are represented as zero-based half-open in- tervals. For example, the first 100 bases of a sequence would be represented with start position = 0 and end position = 100, and the next 100 bases would be represented as start

62 position = 100 and end position = 200. When the strand value is "-", position coordinates are listed in terms of the reverse-complemented sequence.

Alignment Data Lines

Alignment data lines contain three required attribute values: size dt dq

• size -- the size of the ungapped alignment • dt -- the difference between the end of this block and the beginning of the next block (reference sequence) • dq -- the difference between the end of this block and the beginning of the next block (query sequence) NOTE: The last line of the alignment section contains only one number: the ungapped alignment size of the last block.

E. Output format

The output of the program is placed in a file called results.json in the directory specified by the -s/--save parameter. It is in the JSON format and takes the following shape:

[ record1, record2, … ] where a record is [scores, crm] where scores is [NRMSE, Sensitivity, Specificity, Number of Experimental instances, Num- ber of Control instances] where crm is [motif_name, motif_name, ...]

63 F. JSON format

This is Java Script Object Notation. It is a simple data representation format defining strings, numbers, lists, and maps, like so {“key”:[“value”, 1, 2, 3]}

For details please see:

Crockford, D. (2006). JSON: The Fat-Free Alternative to XML. Presented at the XML 2006,

Boston. Retrieved from http://www.json.org/fatfree.html

G. Financial support

My Research Assistantship under which this thesis research was conducted was funded by an NIH/NCI grant, U54CA113001, PI: Dr. Tim H.-M. Huang, Co-I: Dr Victor X Jin.

H. License Citations

Chris Taplin created the ChIP-seq work flow illustration and released it for reuse under a

Creative Commons 3.0 share alike license http://creativecommons.org/licenses/by-sa/3.0/

I did not alter, transform, or build upon Chris Taplin's work, and therefore, no liability or obligations beyond the inclusion of this attribution and mention of the license are required of me, any other person, The Ohio State University, any department thereof, or this inno- cent document itself.

BioPython may in the future be distributed in part or in whole with CRAM which entails no liability or further obligations beyond propagation of this license.

64 Biopython License Agreement

Permission to use, copy, modify, and distribute this software and its documentation with or without modifications and for any purpose and without fee is hereby granted, provided that any copyright notices appear in all copies and that both those copyright notices and this permission notice appear in supporting documentation, and that the names of the contributors or copyright holders not be used in advertising or publicity pertaining to dis- tribution of the software without specific prior permission. the contributors and copyright holders of this software disclaim all warranties with regard to this software, including all implied warranties of merchantability and fitness, in no event shall thecontributors or copyright holders be liable for any special, indirect or consequential damages or any damages whatsoever resulting from loss of use, data or profits, whether in an action of contract, negligence or other tortious action, arising out of or in connection with the use or performance of this software.

65