A HMM Approach to Identifying Distinct DNA Methylation Patterns

for Subtypes of Breast Cancers

Thesis

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science

in the Graduate School of The Ohio State University

By

Maoxiong Xu, B.S.

Graduate Program in Computer Science and Engineering

The Ohio State University

2011

Thesis Committee:

Victor X. Jin, Advisor

Raghu Machiraju

Copyright by

Maoxiong Xu

2011

Abstract

The United States has the highest annual incidence rates of breast cancer in the world; 128.6 per 100,000 in whites and 112.6 per 100,000 among African Americans.[1,2]

It is the second-most common cancer (after skin cancer) and the second-most common cause of cancer death (after lung cancer).[1] Recent studies have demonstrated that hyper- methylation of CpG islands may be implicated in tumor genesis, acting as a mechanism to inactivate specific gene expression of a diverse array of genes (Baylin et al., 2001).

Genes have been reported to be regulated by CpG hyper-methylation, include tumor suppressor genes, cell cycle related genes, DNA mismatch repair genes, hormone receptors and tissue or cell adhesion molecules (Yan et al., 2001). Usually, breast cancer cells may or may not have three important receptors: estrogen receptor (ER), progesterone receptor (PR), and HER2. So we will consider the ER, PR and HER2 while dealing with the data. In this thesis, we first use (HMM) to train the methylation data from both breast cancer cells and other cancer cells. Also we did hierarchy clustering to the gene expression data for the breast cancer cells and based on the clustering results, we get the methylation distribution in each cluster. Finally, we correlate the HMM training results with the methylation distribution and get the biology meanings for the states in the HMM results.

ii

Dedicated to my father, mother, and wife,

for all of their love and support.

iii

Acknowledgments

I have many people to thank for my making it this far: my advisor, Dr. Victor Jin, for everything he's done; Dr. Raghu Machiraju, for his help and support; all of my lab mates, for their knowledge, assistance, and encouragement; and the incredible

Biomedical Informatics Department staff for everything they do.

iv

Vita

2005……………………………...Mudu Central High School

2009……………………………...B.S. Computer Science, Southeast University

2009 to present……….……..……M.S. Computer Science & Engineering, The

Ohio State University

Sep. 2010 to present……………...Graduate Teaching Associate, Department

of Bioinformatics, The Ohio State University

Publications

Cao AR, Rabinovich R, Xu M, Xu X, Jin VX, Farnham PJ: Genome-wide analysis of transcription factor E2F1 mutant proteins reveals that N- and C-terminal protein interaction domains do not participate in targeting E2F1 to the human genome.

J Biol Chem. 2011 Apr 8; 286(14):11985-96. Epub 2011 Feb 10.

Fields of Study

Major: Computer Science & Engineering

Machine Learning applied in Bioinformatics

v

Table of Contents

Abstract……...... ii

Dedication………………………………………………………………………..……….iii

Acknowledgments…...... iv

Vita...... v

Table of Contents ...... vi

List of Tables ...... ix

List of Figures...... xi

Chapter 1: Introduction...... 1

1.1 Methylation……………………………………………………………………1

1.1.1 What Is Methylation? ...... 1

1.1.2 DNA Methylation…………………………………………………2

1.1.3 DNA Methylation Mechanism………………………………...... 3

1.1.4 DNA Methylation in Cancer...………………………………...... 5

1.2 Gene Expression………………………………………………………………6

1.2.1 Gene Expression Measurement……………………………….…….7

1.2.2 mRNA Quantification……………………………………………8

1.2.3 Regulation of Gene Expression……………………….….……...10

1.3 Hidden Markov Model………………………………………………...…….11

1.3.1 Introduction to Hidden Markov Model…………………….……12

vi

1.3.2 Hidden Markov Model……………………………………..…….13

1.3.3 Model Architecture...…………………………………………….13

1.3.4 HMM Training and Decoding……………………………..…….14

1.3.5 HMMs in Computational Biology………………………..……...15

1.3.6 Application of HMMs to Specific Problems……………..……...16

Chapter 2: Methods and Algorithms……………………………....….………………….18

2.1 The Probabilistic Model…………………….………….…………….………18

2.2 Baum-Welch Algorithm…………………………….……………….….……19

2.3 Work Flow…………………………………………….…………….….……23

Chapter 3: Data Process…..…………………………………………………………..….26

3.1 Data Sets……………………………………………………………..………26

3.2 MBD-seq Protocol…………………………………………………..……….27

3.3 Data Preprocess…………………………………………………….….……..27

3.4 Input for HMM……………………….………………………………..…….30

3.5 Methylation Distribution Overview………………….……………….….…..33

3.6 Gene Expression Data………………………………………………….……34

Chapter 4: Results and Discussion………………………………………………….…...35

4.1 Results from HMM………………………………………………….………35

vii

4.2 Biology Meanings………………………………………………………..…..41

4.2.1 Gene Expression Results for 33 Breast Cancer Cell Lines...... …..41

4.2.2 Results Based on Different Clusters…………………………..…...42

4.2.3 States Meanings and Group Patterns……………………….....…...50

Chapter 5: Data Visualization……………………………………………………..……..56

Chapter 6: Conclusions and Suggestions for Further Work………………………..…....59

6.1 Conclusion……………………………………………………………..….…59

6.2 Future Work…………………………………….……………….…..…….…60

References………………………………………………………………………...….…..61

Appendix_Formats………………………………………………………….…..…….…66

A. BAM format………………………………………………………..…..…….66

B. SAM format………………………………………………………..….….….66

C. Export format………………………………………………………..…...…..67

D. BED format………………………………………………………..…………68

E. Fastq format………………………………………………………..………...70

F. Bowtie output format………………………………………………..……….71

viii

List of Tables

Table 3.1 Data summary for 36 cell lines……………………...………..……………….29

Table 3.2 12 Groups for 36 cell lines……………..………………………..…….….…...31

Table 4.1 BIC results for HMM results…………………………………..………….…..35

Table 4.2 Transition Matrix…………………………………………………..…….……36

Table 4.3 Emission probabilities for each mark in each state……………………..….…38

Table 4.4 Ordered emission probabilities for each mark in each state-mark………...…..39

Table 4.5 Ordered emission probabilities for each mark in each state- probabilities.…...39

Table 4.6 Filtered ordered emission probabilities for each mark in each state- marks….40

Table 4.7 Number of genes in each cluster……………………………………..………..43

Table 4.8 First 3 marks for each state…………………………………………………..50

ix

Table 4.9 States and interval correlation results………………………………………..51

Table 4.10 States meanings…………………………………………………………….52

Table 4.11 Patterns for subtypes of Breast cancers…………………………………….52

x

List of Figures

Fig 1.1 Methylation…………………………………………………………………….…1 Fig 1.2 DNA methylation……………………………………………………………..…..2 Fig 1.3 DNA methylation mechanism……………………………………….……….…...4 Fig 1.4 DNA methylation in cancer…………………………………………….…….…...6 Fig 1.5 Gene Expression………………………………………………………….….……6

Figure 1.6: A simple HMM λ= (A,B, π),where N = 3, M = 3, a12,a23,a32 are non-zero, b1(a), b2(t),b3(g) = 1 and π = 1, 0, 0. ……………………………………………..……...13

Fig2.1 A Broad overview of the HMM work-flow, highlighting the most significant inputs, transformations, and outputs at each step from start to end. ……………..…...…23

Fig 3.1 Bar figure for 36 cell lines……...……………………………………………..…30

Fig 3.2 Methylation distribution for 33 breast cancer cell lines……...…………..……...34

Fig 4.1 Heatmap for transition matrix…………………………………………….……..37

Fig 4.2 33 Breast Cancer Cell Gene Expression One-Way Hierarchy Clustering……....41

Fig 4.3 Grouped 33 Breast Cancer Cell Gene Expression One-Way Hierarchy

Clustering …………………………………………………………………….….……...42

Fig 4.4 Methylation distribution based on cluster 1 genes……………………...……….44

Fig 4.5 Methylation distribution based on cluster 2 genes…………………...…..……...45

xi

Fig 4.6 Methylation distribution based on cluster 3 genes………………………...…….45

Fig 4.7 Methylation distribution based on cluster 4 genes…………………...……….....46

Fig 4.8 Methylation distribution based on cluster 5 genes…………………….………...46

Fig 4.9 Methylation distribution based on cluster 6 genes…………………...……...... 47

Fig 4.10 Methylation distribution based on cluster 7 genes………………..….………...48

Fig 4.11 Methylation distribution based on cluster 8 genes………………….……...... 48

Fig 4.12 Methylation distribution based on cluster 9 genes………………….……...... 49

Fig 5.1 Database Web Tool……………………………………………………………..56

xii

Chapter 1: Introduction

1.1 Methylation

1.1.1 What Is Methylation?

In the view of chemical sciences, methylation means the addition of a methyl group to a substrate or the substitution of an atom or group by a methyl group.

Methylation is a form of alkylation with, to be specific, a methyl group, rather than a larger carbon chain, replacing a hydrogen atom.

In the view of biological systems, methylation is catalyzed by enzymes; such methylation can be involved in modification of heavy metals, regulation of gene expression, regulation of protein function, and RNA metabolism. Methylation of heavy metals can also occur outside of biological systems. Chemical methylation of tissue samples is also one method for reducing certain histological staining artifacts. Fig 1.1 Methylation

1 The term methylation in organic chemistry refers to the alkylation process used to describe the delivery of a CH3 group [3].This is commonly performed using nucleophilic methyl sources - iodomethane, dimethyl sulfate, dimethyl carbonate, or less commonly with the more powerful (and more dangerous) methylating reagents of methyl triflate or methyl fluorosulfonate (magic methyl), which all react via SN2 nucleophilic substitution.

For example a carboxylate may be methylated on oxygen to give a methyl ester, an alkoxide salt RO− may be likewise methylated to give an ether, ROCH3, or a ketone enolate may be methylated on carbon to produce a new ketone.

1.1.2 DNA Methylation

After every cycle of DNA replication, several modifications occur in the DNA.

DNA methylation is one such post-synthesis modification. It is an epigenetic modification involved in both normal developmental processes and disease states through the modulation of gene expression and the maintenance of genomic organization[4]. DNA methylation has been proven by research to Fig 1.2 DNA methylation be manifested in a

2 number of biological processes such as regulation of imprinted genes, X chromosome inactivation, and tumor suppressor gene silencing in cancerous cells. It also acts as a protection mechanism adopted by the pathogen DNA (mainly bacterial against the end nuclease activity that destroys any foreign DNA [5, 6].

DNA cytosine methylation is the covalent addition of a methyl group to the 5 position of cytosine. In humans, DNA methylation occurs predominantly in a CpG dinucleotide context and is catalyzed by DNA methyltransferases [7, 8, 9]. Dense clusters of

CpG dinucleotides, termed CpG islands, are present in roughly 40% of gene promoters, and methylation of these regions is associated with transcriptional silencing [10, 11]. DNA methylation is essential for normal developmental processes, such as imprinting [12] and

X chromosome inactivation [13]. Dysregulation of DNA methylation occurs in disease states such as cancer, where promoter CpG island hyper-methylation leads to inactivation of tumor suppressor genes [14, 15]. Thus, many tumor suppressors classically identified through mutation analyses, such as APC [16, 17], BRCA1 [18, 19], and CDKN2A [20, 21], have also been found to be transcriptionally silenced by promoter hyper-methylation.

1.1.3 DNA Methylation Mechanism

In DNA, methylation usually occurs in the CpG islands, a CG rich region, upstream of the promoter region. The letter “p” here signifies that the C and G are connected by a phosphodiester bond. In humans, DNA methylation is carried out by a group of enzymes called DNA methyltransferases. These enzymes not only determine the

3 DNA methylation patterns during the early development, but are also responsible for copying these patterns to the strands generated from DNA replication [6].

DNA methylation involves the addition of a methyl group to the 5 position of the cytosine pyrimidine ring or the number 6 nitrogen of the adenine purine ring (cytosine and adenine are two of the four bases of DNA). This modification can be inherited through cell division. DNA methylation is typically removed during zygote although the latest research shows that hydroxylation of methyl group occurs rather than complete removal of methyl groups in zygotermation and re-established through successive cell divisions during development [22]. DNA methylation is a crucial part of normal organismal development and cellular differentiation in higher organisms. DNA methylation stably alters the gene expression pattern in cells such that cells can

"remember where they have been" or decrease gene expression; for example, cells programmed to be pancreatic islets during embryonic development remain pancreatic islets throughout the life of the organism without continuing signals telling them that they need to remain islets. In addition,

DNA methylation suppresses the expression of viral genes and Fig 1.3 DNA methylation mechanism other deleterious elements that have been incorporated into the genome of the host over time. DNA methylation also 4 forms the basis of chromatin structure, which enables cells to form the myriad characteristics necessary for multicellular life from a single immutable sequence of DNA.

DNA methylation also plays a crucial role in the development of nearly all types of cancer [23].

1.1.4 DNA Methylation in Cancer

DNA methylation is an important regulator of gene transcription and a large body of evidence has demonstrated that aberrant DNA methylation is associated with unscheduled gene silencing, and the genes with high levels of 5-methylcytosine in their promoter region are transcriptional silent. DNA methylation is essential during embryonic development, and in somatic cells, patterns of DNA methylation are generally transmitted to daughter cells with a high fidelity. Aberrant DNA methylation patterns have been associated with a large number of human malignancies and found in two distinct forms: hyper-methylation and hypo-methylation compared to normal tissue.

Hyper-methylation is one of the major epigenetic modifications that repress transcription via promoter region of tumor suppressor genes. Hyper-methylation typically occurs at

CpG islands in the promoter region and is associated with gene inactivation. Global hypo-methylation has also been implicated in the development and progression of cancer through different mechanisms [24].

5

Fig 1.4 DNA methylation in cancer

1.2 Gene Expression

Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins, but in non-protein coding genes such as rRNA genes or tRNA genes, the product is a functional RNA. The process of gene expression is used by all known life - eukaryotes (including Fig 1.5 Gene Expression

6 multicellular organisms), prokaryotes (bacteria and archaea) and viruses - to generate the

macromolecular machinery for life. Several steps in the gene expression process may be

modulated, including the transcription, RNA splicing, translation, and post-translational

modification of a protein. Gene regulation gives the cell control over structure and

function, and is the basis for cellular differentiation, morphogenesis and the versatility

and adaptability of any organism. Gene regulation may also serve as a substrate for

evolutionary change, since control of the timing, location, and amount of gene expression

can have a profound effect on the functions (actions) of the gene in a cell or in a

multicellular organism.

In genetics gene expression is the most fundamental level at which genotype gives

rise to the phenotype. The genetic code is "interpreted" by gene expression, and the

properties of the expression products give rise to the organism's phenotype.

1.2.1 Gene Expression Measurement

Measuring gene expression is an important part of many life sciences - the ability

to quantify the level at which a particular gene is expressed within a cell, tissue or

organism can give a huge amount of information. For example measuring gene

expression can:

 Identify viral infection of a cell (viral protein expression)

 Determine an individual's susceptibility to cancer (oncogene expression)

 Find if a bacterium is resistant to penicillin (beta-lactamase expression)

7 Similarly the analysis of the location of expression protein is a powerful tool and this can be done on an organism or cellular scale. Investigation of localization is particularly important for study of development in multicellular organisms and as an indicator of protein function in single cells. Ideally measurement of expression is done by detecting the final gene product (for many genes this is the protein) however it is often easier to detect one of the precursors, typically mRNA, and infer gene expression level.

1.2.2 mRNA Quantification

Levels of mRNA can be quantitatively measured by Northern blotting which gives size and sequence information about the mRNA molecules. A sample of RNA is separated on an agarose gel and hybridized to a radio-labeled RNA probe that is complementary to the target sequence. The radio-labeled RNA is then detected by an autoradiograph. The main problems with Northern blotting stem from the use of radioactive reagents (which make the procedure time consuming and potentially dangerous) and lower quality quantification than more modern methods (due to the fact that quantification is done by measuring band strength in an image of a gel). Northern blotting is, however, still widely used as the additional mRNA size information allows the discrimination of alternately spliced transcripts.

A more modern low-throughput approach for measuring mRNA abundance is reverse transcription quantitative polymerase chain reaction (RT-PCR followed with

8 qPCR). RT-PCR first generates a DNA template from the mRNA by reverse transcription.

The DNA template is then used for qPCR where the change in fluorescence of a probe changes as the DNA amplification process progresses. With a carefully constructed standard curve qPCR can produce an absolute measurement such as number of copies of mRNA, typically in units of copies per nanolitre of homogenized tissue or copies per cell. qPCR is very sensitive (detection of a single mRNA molecule is possible), but can be expensive due to the fluorescent probes required.

Northern blots and RT-qPCR are good for detecting whether a single gene is being expressed, but it quickly becomes impractical if many genes within the sample are being studied. Using DNA microarrays transcript levels for many genes at once

(expression profiling) can be measured. Recent advances in microarray technology allow for the quantification, on a single array, of transcript levels for every known gene in several organisms‟ genomes, including humans.

Alternatively "tag based" technologies like Serial analysis of gene expression

(SAGE), which can provide a relative measure of the cellular concentration of different messenger RNAs, can be used. The great advantage of tag-based methods is the "open architecture", allowing for the exact measurement of any transcript, with a known or unknown sequence.

9 1.2.3 Regulation of Gene Expression

Regulation of gene expression refers to the control of the amount and timing of

appearance of the functional product of a gene. Control of expression is vital to allow a

cell to produce the gene products it needs when it needs them; in turn this gives cells the

flexibility to adapt to a variable environment, external signals, damage to the cell, etc.

Some simple examples of where gene expression is important are:

 Control of Insulin expression so it gives a signal for blood glucose regulation

 X chromosome inactivation in female mammals to prevent an "overdose" of the genes it

contains.

 Cycling expression levels control progression through the eukaryotic cell cycle

More generally gene regulation gives the cell control over all structure and

function, and is the basis for cellular differentiation, morphogenesis and the versatility

and adaptability of any organism.

Any step of gene expression may be modulated, from the DNA-RNA

transcription step to post-translational modification of a protein. The stability of the final

gene product, whether it is RNA or protein, also contributes to the expression level of the

gene - an unstable product results in a low expression level. In general gene expression is

regulated through changes in the number and type of interactions between molecules that

collectively influence transcription of DNA and translation of RNA.

10 DNA methylation is a widespread mechanism for epigenetic influence on gene expression and is seen in bacteria and eukaryotes and has roles in heritable transcription silencing and transcription regulation. In eukaryotes the structure of chromatin, controlled by the histone code, regulates access to DNA with significant impacts on the expression of genes in euchromatin and heterochromatin areas.

1.3 Hidden Markov Model (HMM)

1.3.1 Introduction to HMM

A Hidden Markov Model (HMM) is a stochastic model that captures the statistical properties of observed real world data. A good HMM accurately models the real world source of the observed data and has the ability to simulate the source. Machine Learning techniques based on HMMs have been successfully applied to problems including , optical character recognition, and as we will examine problems in computational biology.

Methylation finding or prediction has become one of the foremost computational biology problems for two reasons. Firstly, completely sequenced genomes have become readily available. And most important, because of the need to extract actual biological knowledge from this data to explain the molecular interactions that occur in cells and to define important cellular pathways. Discovering the location of hyper-methylation on the genome is a very important step towards building such a body of knowledge. This thesis

11 will introduce several different statistical and algorithmic methods for hyper-methylation finding, with a focus on the statistical model-based approach using HMMs.

1.3.2 Hidden Markov Models

A basic Markov model of a process is a model where each state corresponds to an observable event and the state transition probabilities depend only on the current and predecessor state. This model is extended to a Hidden Markov model for application to more complex processes, including speech recognition and computational gene finding.

A generalized Hidden Markov Model (HMM) consists of a finite set of states, an alphabet of output symbols, a set of state transition probabilities and a set of emission probabilities. The emission probabilities specify the distribution of output symbols that may be emitted from each state. Therefore in a hidden model, there are two stochastic processes; the process of moving between states and the process of emitting an output sequence. The sequence of state transitions is a hidden process and is observed through the sequence of emitted symbols.

Let us formalize the definition of an HMM in the following way, taken from an

HMM tutorial by Lawrence Rabiner [25]. An HMM is defined by the following elements:

1. Set S of N states, S = S1S2…SN

2. Set O of M observation symbols, the output alphabet. O = o1o2…oM

3. Set A of state transition probabilities, A = aij where aij is the probability of moving from state i to state j.

12 E1.1

4. Set B of observation symbol probabilities at state j, B = bj(k), where bj(k) is the probability of emitting symbol k at state j.

E1.2

5. Set π, the initial state distribution π = πi where πi is the probability that the start state is state i.

E1.3

Given the definitions above, the notation of a model is λ= (A, B, π).

Figure 1.6: A simple HMM λ= (A,B, π),where N = 3, M = 3, a12,a23,a32 are non-zero, b1(a), b2(t),b3(g) = 1 and π = 1, 0, 0. Note that states can be 'null' states that do not emit any symbol.

1.3.3 Model Architecture

The set of states S, the output symbol alphabet X and the connections between the states constitute the architecture of a model. The architecture of a HMM is problem dependent. The model is constructed to correspond to the properties and constraints of the

13 observed sequences and of the process itself. HMM architecture can also be learned from the data, but in most computational biology problems, it is advantageous to use known constraints that characterize the processes.

1.3.4 HMM Training and Decoding

Once the architecture of an HMM has been decided, an HMM must be trained to closely fit the process it models. Training involves adjusting the transition and output probabilities until the model sufficiently fits the process. These adjustments are performed using standard machine learning techniques to optimize P(O|λ), the probability of observed sequence O = O1O2…OT, (here T is the number of observation length, i.e. the number of 1000bp intervals) given model over a set of training sequences. The most common and straightforward algorithm for HMM training is expectation maximization

(EM) which adapts the transition and output parameters by continually re-estimating these parameters until P(O|λ) has been locally maximized.

HMM decoding involves the prediction of hidden states given an observed sequence. The problem is to discover the best sequence of states Q = Q1Q2…QT visited that accounts for an emitted sequence O = O1O2…OT and a model λ. There may be several different ways to define a best sequence of states. A common decoding algorithm is the . The Viterbi algorithm uses a dynamic programming approach to find the most likely sequence of states Q given an observed sequence O and model λ.

14 1.3.5 HMMs in Computational Biology

The field of computational biology involves the application of computer science theories and approaches to biological and medical problems. Computational biology is motivated by newly available and abundant raw molecular datasets gathered from a variety of organisms. Though the availability of this data marks a new era in biological research, it alone does not provide any biologically significant knowledge. The goal of computational biology is then to elucidate additional information regarding protein coding, protein function and many other cellular mechanisms from the raw datasets. This new information is required for drug design, medical diagnosis, medical treatment and countless fields of research.

The majority of raw molecular data used in computational biology corresponds to sequences of nucleotides corresponding to the primary structure of DNA and RNA or sequences of amino acids corresponding to the primary structure of proteins. Therefore the problem of inferring knowledge from this data belongs to the broader class of sequence analysis problems.

Two of the most studied sequence analysis problems are speech recognition and language processing. Biological sequences have the same left-to-right linear aspect as sequences of sounds corresponding to speech and sequences of words representing language. Consequently, the major computational biology sequence analysis problems can be mapped to linguistic problems. A common linguistic metaphor in computational biology is that of protein family classification as speech recognition. The metaphor suggests interpreting different proteins belonging to the same family as different

15 vocalizations of the same word. Another metaphor is gene finding in DNA sequences as the parsing of language into words and semantically meaningful sentences. It follows that biological sequences can be treated linguistically with the same techniques used for speech recognition and language processing.

Since the theory of HMMs was formalized in the late 1960s, several scientists have applied the theory to speech recognition and language processing. Just as HMMs were first introduced as mathematical models of language, HMMs can be used as mathematical models of molecular processes and biological sequences. In addition,

HMMs have been applied to linguistics because they are suited for problems where the exact theory may be unknown but where there exists large amounts of data and knowledge derived from observation. As this is also the situation in biology, HMM-based approaches have been successfully applied to problems in computational biology. The main benefit is that an HMM provides a good method for learning the theory from the data and can provide a structured model of sequence data and molecular processes.

1.3.6 Application of HMMs to Specific Problems

It is clear that an HMM-based approach is a logical idea for tackling problems in computational biology. Much work has been performed and applications have been built using such an approach.

Baldi and Brunak [26] define three main groups of problems in computational biology for which HMMs have been proven especially useful.

16 First, HMMs can be used for multiple alignments of DNA sequences, which is a difficult task to perform using a naive dynamic programming approach. Second, the structure of trained HMMs can uncover patterns in biological data. Such patterns have been used to discover periodicities within specific regions of the data and to help predict regions in sequences prone to forming specific structures. Third is the large set of classification problems. HMM based approaches have been applied to structure prediction - the problem of classifying each nucleotide according to which structure it belongs. HMMs have also been used in protein profiling to discriminate between different protein families and predict a new protein's family or subfamily. HMM-based approaches have also been successful when applied to the problem of methylation function finding. This is the problem of predicting methylation function according to which type of job they perform. The remainder of this thesis is concerned with this last problem of computational methylation function prediction.

17

Chapter 2: Methods and Algorithms

2.1 The Probabilistic Model

The probabilistic model is based on a multivariate instance of a Hidden Markov

Model (HMM). The model assumes a fixed number of hidden states N. In each hidden state, the emission distribution, that is the probability distribution over each combination of marks, is modeled with a product of independent Bernoulli random variables. Formally, for each of the N states, and M=4096 (i.e. 212) input marks, there is an emission parameter bk,m denoting the probability in state k (k=1,…,N) that input mark m

(m=1,…,M) has a present call. Let denote a chromosome where C is the set of all

chromosomes. Let denote an interval on the genome where t =1,…, T corresponding to the 1000bp intervals on genome and T is the number of intervals that

th the genome was divided into. So is assigned „1‟ if a mark is detected in the t

1000bp interval on chromosome c. Let aij denote the probability of transitioning from state i to state j, where

i=1,…,N and j=1,…,N.

18 We also have parameters πi (i=1, …, N), which denote the probability that the state of the first interval on the chromosome is i. Let be an unobserved state

sequence through chromosome c and be the set of all possible state sequences. Let denote the unobserved state on chromosome c at location t for state sequence . The full likelihood of all of the observed data D for the parameters a, b, and π can then be expressed as:

……………………………………………………………………………………..E2.1

2.2 Baum-Welch Algorithm

In electrical engineering, computer science, statistical computing and bioinformatics, the Baum–Welch algorithm is used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the forward-backward algorithm and is named for Leonard E. Baum and Lloyd R. Welch.

A Hidden Markov Model is a probabilistic model of the joint probability of a collection of random variables {O1… Ot, O1… Ot}. The Ot variables are discrete observations and the variables are “hidden” and discrete. Under an HMM, there are two conditional independence assumptions made about these random variables that make associated algorithms tractable. These independence assumptions are: 19 1. The tth hidden variable, given the (t-1)st hidden variable, is independent of

previous variables, or:

E2.2

2. The tth observation depends only on the tth state.

E2.3

In the following, we present the EM algorithm for finding the maximum-likelihood estimate of the parameters of a hidden Markov model given a set of observed feature vectors. This algorithm is also known as the Baum-Welch algorithm.

Qt is a discrete with N possible values {S1….SN}. We further assume that the underlying “hidden” defined by P(Qt | Qt-1 } is time-homogeneous

(i.e., is independent of the time t). Therefore, we can represent P (Qt | Qt-1} as a time- independent stochastic transition matrix

E2.4

The special case of time t=1 is described by the initial state distribution

E2.5

We say that we are in state j at time t if Qt = Sj. A particular sequence of states is described by Q = (Q1. . . QT )

where Qt ϵ{S1…SN}is the state at time t.

20 The observation is one of M possible observation symbols, Ot ϵ {o1…oM}. The probability of a particular observation vector at a particular time t for state j is described by:

E2.6 so B={bij} is an N by M matrix.

A particular observation sequence O is described as O = (O1 = o1, …, OT = oT ).

Therefore, we can describe a HMM by: λ= (A, B, ). Given an observation O, the Baum-

Welch algorithm finds:

E2.7 that is, the HMM λ, that maximizes the probability of the observation O.

Initialization: set λ= (A, B, ) with random initial conditions. The algorithm updates the parameters of λ iteratively until convergence, following the procedure below:

The forward procedure:

We define:

E2.8 which is the probability of seeing the partial sequence o1, , , ot and ending up in state Si at time t.

We can efficiently calculate αi(t) recursively as:

1. E2.9

2. E2.10

21 The backward procedure: This is the probability of the ending partial sequence ot+1,…, oT given that we started at state i, at time t. We can efficiently calculate βi(t) as:

1. E2.11

2. E2.12

Using α and β, we can calculate the following variables:

E2.13

E2.14

Having and ξ, we can define update rules as follows:

E2.15

E2.16

E2.17

(note that the summation in the nominator of b‟i(k) is only over observed symbols equal to ok).

Using the updated values of A, B and π, a new iteration performed until convergence.

22 2.3 Work Flow

Fig2.1 A Broad overview of the HMM work-flow, highlighting the most significant inputs, transformations, and outputs at each step from start to end.

Detailed description of the work flow:

Data Sources and Data Processing (Will be discussed in the next chapter)

23 HMM training: For the HMM training, we first randomly initialized the parameters, the probability matrix for the first state ( ), the initial transition matrix (A), the initial emitting matrix (B).

Then we started to try to train the model with different numbers of states, from 12-26, using the Baum-Welch algorithm. Also, during the training, we did some control to make the training process more designable. For example, for each iteration, we will calculate the value of A and B with

E2.17 and

E2.18

By doing this kind of thing, we can smooth the transition and emission matrixes. Here in my program, based on the number of states and possible observations in each state, we set α as 0.000001.

For the terminate conditions, we have two thresholds:

The first one is delta, which is the difference between the previous LLR (Log

Likelihood Ratio) and the current LLR. For every iteration, we will calculate the LLR for the current iteration, then compare it to the previous LLR and get the value of delta. If delta is less than or equal to some threshold (here we set 0.001), then we think that the training is enough and so we stop training.

The other is the number of iteration, we count the number of iterations all the time during the training process, if the number of iteration is less than the threshold (we set

300 here), we continue the training, otherwise, we abort.

24 At the end of every iteration, we will check if any of the two thresholds is achieved and if either the delta or the iteration number achieves, we just stop the training process and output the training results.

We trained our HMM using different number of states from 11 to 26, and then we can get the Bayesian information criterion (BIC) values. The BIC is an asymptotic result derived under the assumptions that the data distribution is in the exponential family.

The formula for the BIC is:

E2.19

Where x is the observed data; n is the number of data points in x, the number of observations, or equivalently, the sample size; k is the number of free parameters to be estimated. If the estimated model is a linear regression, k is the number of regresses, including the intercept; then p(x|k) is the probability of the observed data given the number of parameters; or, the likelihood of the parameters given the dataset; and L is the maximized value of the likelihood function for the estimated model.

25

Chapter 3: Date Process

3.1 Data Sets:

We used 36 cell lines as our training datasets, including 33 Breast Cancer Cell Lines, H1,

HCT116 and CD4+T cell lines.

a) 33 Breast Cancer Cell Lines:

Breast cell lines were procured through the Integrative Cancer Biology Program

(ICBP) of the National Cancer Institute (Neve et al., 2006) [27].

b) H1 cell line:

The H1 MBD-seq data used in this thesis is from the paper of R Alan Harris. Et.al

(nature biotechnology, 2010) [28]

c) HCT116:

The HCT116 MBD-seq data used in this thesis is from the paper of David Serre

et.al. (Nucleic Acids Research, 2010) [4]

d) d) CD4+Tcell:

The CD4+Tcell MBD-seq data used in this thesis is from the paper of Jung K

Choi et.al. (Genome Biology, 2009)[29]

26 3.2 MBD-seq Protocol

Genomic DNA was isolated by the QIAamp DNA Mini Kit (Qiagen) following the manufacture‟s protocol. Genomic DNA of breast cell lines was procured through the

Integrative Cancer Biology Program (ICBP) of the National Cancer Institute.

MBDCap-seq, mapping and normalization. Methylated DNA was eluted by the

MethylMiner Methylated DNA Enrichment Kit (Invitrogen) according to the manufacturer‟s instructions. Briefly, one microgram of genomic DNA was sonicated and captured by MBD proteins. The methylated DNA was eluted in 1 M salt buffer. DNA in each eluted fraction was precipitated by glycogen, sodium acetate and ethanol, and was resuspended in TE buffer. Eluted DNA was used to generate libraries for sequencing following the standard protocols from Illumina. MBDCap-seq libraries were sequenced using the Illumina Genome Analyzer II as per manufacturer's instructions. Image analysis and base calling were performed with the standard Illumina pipeline. Sequencing reads were mapped by ELAND algorithm. Unique reads were up to 36 base pair reads mapped to the human reference genome (hg18), with up to two mismatches. Reads in satellite regions were excluded due to the large number of amplifications.

3.3 Data Preprocess

The original data for the 33 DNA methylation data are in the export format which is described in (Appendix.C), the H1 is in the form of bam which is described in

(Appendix.A) and the HCT116 and CD4+T cells are in the form of fastq which is 27 described in (Appendix.E). While dealing with these data, for export files (33 Breast

Cancer cell lines), we have to first divide the reads in the files into three groups:

Group 1: the unique matched reads (which means there is only one sequence on the genome that matches the read)

Group 2: the multiple-/non- matched reads (which means there is multiply/no sequences match the read)

Group 3: QC (Quality Control, which means the read itself can‟t meet some quality requirements)

Then for the reads in group2, we use a tool (Lonut) to get the most possible matched reads and then add them to the reads from group1 to get the reads that we are going to deal with (Total Reads After Processing).

For the bam files(H1 cell lines), we first use the popular tool “samtool” to transform it from bam format to sam format, since bam file is a binary file. Also since there is no QC reads in sam file, so this time we just divide the reads in to two groups, the group1 and group2 which are the same as for the export files.

Then for the fastq files(HCT116 and CD4+Tcell), we first use the software bowtie to map the sequences on to the genome and then followed by the same steps as those for

33 breast cancer cell lines.

Besides, the outputs for all the process above are in bed format (Appendix.D)

28 Table 3.1 Data summary for 36 cell lines

Unique Matched Not Unique Total Reads After Cell line ID Raw Reads Reads Matched Reads Process BrCa-02(AU565) 38,389,113 21,757,417 11,268,129 33,025,546 BrCa-03(BT549) 38,607,423 24,343,702 9,151,298 33,495,000 BrCa-06(HCC1569) 33,243,637 17,790,745 11,032,912 28,823,657 BrCa-07(HCC1937) 32,664,695 17,761,936 10,746,815 28,508,751 BrCa-08(HCC2185) 40,922,132 22,424,765 11,505,148 33,929,913 BrCa-09(HCC70) 42,112,586 24,613,958 11,832,051 36,446,009 BrCa-10(LY2) 38,858,773 23,020,926 11,294,571 34,315,497 BrCa-11(MCF-7) 43,128,546 24,876,183 12,608,935 37,485,118 BrCa-12(MDAMB-231) 36,495,183 22,767,185 8,963,014 31,730,199 BrCa-14(MDAMB-468D) 46,932,495 25,467,786 14,656,101 40,123,887 BrCa-15(SUM149PT) 36,129,334 21,592,142 10,491,546 32,083,688 BrCa-16(SUM225CWN) 27,600,744 17,390,015 7,502,375 24,892,390 BrCa-20(BT20) 38,329,851 21,775,872 11,679,619 33,455,491 BrCa-25(HCC1954) 38,223,154 21,961,680 10,936,018 32,897,698 BrCa-28(MCF10A) 47,587,907 27,946,727 12,391,220 40,337,947 BrCa-32(SKBR3) 41,094,509 24,365,279 10,698,249 35,063,528 BrCa-33(SUM159PT) 43,752,391 25,433,158 11,726,940 37,160,098 BrCa-38(BT474) 46,247,881 27,613,327 11,417,755 39,031,082 BrCa-40(HCC1143) 34,178,168 20,315,316 9,549,005 29,864,321 BrCa-41(HCC1428) 33,877,849 21,196,882 8,922,003 30,118,885 BrCa-43(HCC202) 34,308,392 20,183,614 9,788,109 29,971,723 BrCa-44(HCC3153) 30,572,196 16,910,888 9,429,028 26,339,916 BrCa-49(MDAMB436) 37,982,516 22,680,715 10,033,312 32,714,027 BrCa-51(SUM185PE) 28,071,302 16,021,760 7,372,317 23,394,077 BrCa-55(600MPE) 29,989,492 15,772,575 8,997,509 24,770,084 BrCa-59(HCC1500) 33,512,194 21,689,325 7,297,771 28,987,096 BrCa-63(HS578T) 29,774,913 19,954,550 6,690,775 26,645,325 BrCa-64(MCF12A) 39,816,671 19,667,828 14,407,417 34,075,245 BrCa-65(MDAMB175VII) 36,527,805 19,117,034 11,991,384 31,108,418 BrCa-67(MDAMB453) 34,634,234 18,920,039 11,290,635 30,210,674 BrCa-68(SUM1315MO2) 34,112,917 20,891,597 9,163,182 30,054,779 BrCa-70(SUM52PE) 32,326,562 19,320,660 9,731,807 29,052,467 T47D 43,275,135 26,119,748 10,784,904 36,904,652 HCT116 19,041,613 4,906,885 3,615,483 8,522,368 Tcell 7,172,5143 21,645,848 18,120,978 39,766,826 H1 5,9618,003 30,139,685 174,794 30,314,479 29

Fig 3.1 Bar figure for 36 cell lines

3.4 Input for HMM

First we divide the whole genome into 1000-base-pair non-overlapping intervals within which we independently made a call as to whether each of the 36 marks was detected as being present or not based on the count of tags mapping to the interval. Each tag was uniquely assigned to one interval based on the location of the 5‟ end of the tag after applying a shift of 500 bases in the 5‟ to 3‟ direction of the tag (mid-point). The threshold, t, for each mark was based on the total number of mapped reads for the mark, and was set to the smallest integer t such that P(X>t)<10-4 where X is a random variable with a Poisson distribution with mean parameter set to the empirical mean of the number of tags per interval. In each cell line, if a mark is detected in an interval, then we assign „1‟ to this interval; otherwise, we assign „0‟ to it.

30 Also we group the 36 cell lines into 12 groups based on some mechanism/factors

(e.g. Gene Cluster, ER+/-, PR+/- and HER2 expression) as follows:

Table 3.2 12 groups for 36 cell lines

Cell lines Gene Cluster ER PR HER2 group(1for+,2for-) Groups BT474 Lu + + 11.9994 111 1 MCF10A BaB - - 6.837 222 2 MCF12A BaB - - 7.226 222 HCC1428 Lu + + 7.6065 112 MCF7 Lu + + 8.4522 112 3 T47D Lu + + 8.2666 112 600MPE Lu + - 9.2756 122 LY2 Lu + - 6.9903 122 4 MDAMB175 Lu + - 9.2384 122 SUM52PE Lu + - 7.6287 122 AU565 Lu - - 12.1189 221 HCC202 Lu - - 12.1056 221 5 SKBR3 Lu - - 11.5751 221 HCC1569 BaA - - 11.7554 221 HCC1954 BaA - - 11.5082 221 6 SUM225CWN BaA - - 12.9908 221 HCC2185 Lu - - 9.3429 222 MDAMB453 Lu - - 10.172 222 7 SUM185PE Lu - - 8.3417 222 BT20 BaA - - 7.7677 222 HCC1143 BaA - - 8.7032 222 HCC1937 BaA - - 7.7687 222 8 HCC3153 BaA - - 8.9164 222 HCC70 BaA - - 7.9334 222 MDAMB468 BaA - - 7.0596 222 BT549 BaB - - 7.0328 222 HCC1500 BaB - - 7.2479 222 HS578T BaB - - 6.6301 222 MDAMB231 BaB - - 6.5633 222 9 MDAMB436 BaB - - 6.9034 222 SUM1315 BaB - - 6.9018 222 SUM149PT BaB - - 6.5676 222 SUM159PT BaB - - 7.3181 222 H1 10 HCT116 11 Tcell 12

31 Then we need to decide whether an interval is hyper-methylated in this group.

What we did here is count the number of marks in each interval for every group. If there is at least one mark presented in some interval for some group, then we say this group is hyper-methylated in this interval. Again, for each group, if some interval is hyper- methylated, then we assign this interval with „1‟, otherwise „0‟.

Then for the input of the HMM, we combine the 12 groups together in the way as follows:

We denote the observation value for interval t as vt (t=1, …, T), then we calculate the value of vt in the following steps:

1. Initial vt, set vt = 0;

2. Put the 12 groups in the increasing order, from group1 to group12.

3. While going through the 12 groups, for each group, if the corresponding interval

is 1, then vt =2* vt +1; otherwise, vt =2* vt

4. After going through the 12 groups, we have add 1 to vt, that is vt = vt +1, so we

can get the value of vt from 1 to 4096.

So based on the encode mechanism, we can decode the value of an interval.

What we need to do is simply write the number in the format of binary number which is in the form of

E3.1

where ai ϵ {0,1} and i= ϵ {1,…,12} 32 Then if ai =1, we know the group i in this interval is 1(hyper-methylated).

Thus, for some vt =1, we know, none of the 12 groups is 1 since 1-1=0;

And for some vt =169, we know this interval is a combination of 3 groups (9, 7, 5) with 1

(hyper-methylated), since

169-1=128+32+8=212-5+212-7+212-9

3.5 Methylation Distribution Overview

Also, we need to get an overview of the DNA methylation distribution, so we modified a web tool developed by Brian to visualize the data.

To find the meaning of the states (training result), we also have to deal with the source data to find the correlation between our result and the source data. What we did here is first remap our intervals to the genome (hg18) and get the positions that they are on the genome. Then we correlate these remapped reads with genes (hg18). By using this, we can divide the intervals into different regions based on their distances to the gene 5‟ end. Also it is not necessary to include the gene desert regions, so we also filter out the regions that are farther than 100k from the gene body. Besides, since different genes have different lengths of gene body, so we artificially choose a certain distance (2kbp) away from both 5‟ and 3‟ ends in the gene body and all the intervals are with a length of 2k base pair.

33 2.5 1 2 2

1.5 3 4 1 5 0.5 6

0 7

6 2 2

- -

- 8

50 46 42 38 34 30 26 22 18 14 10

------

5TSS 3TSS

5TSS 9

3TSS+3 3TSS+7

5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS

3TSS+11 3TSS+15 3TSS+19 3TSS+23 3TSS+27 3TSS+31 3TSS+35 3TSS+39 3TSS+43 3TSS+47

Fig 3.2 Methylation distribution for 33 breast cancer cell lines

3.6 Gene Expression Data

Increasing evidence is revealing a role of methylation in the interaction of environmental factors with genetic expression. Differences in maternal care during the first 6 days of life in the rat induce differential methylation patterns in some promoter regions and, thus, influencing gene expression [30].

Also, we correlated the results with the gene expression data (Richard M.Neve et al, 2006) [27]. The gene expression data files are in the format of *.cel, so we used a R package to transform them into readable files. Then we paste them into the same file and did the one way hierarchy clustering (cluster the genes) for the 33 breast cancer cell lines and in order to find out the meanings of the states more easily, we ordered the cell lines in the order of subgroups. So in this way, we can have a straight sight on which cluster of genes is correlated with a particular subgroup.

34

Chapter 4: Results and Discussion

4.1 Results from HMM

We trained our HMM using different number of states from 11 to 26, and then we get the

BIC values as follows:

Table 4.1 BIC results for HMM results

# states L(log ratio) k n BIC 11 -4,257,409 45,165 3,070,531 9,189,464 12 -4,256,362 49,283 3,070,531 9,248,882 13 -4,254,518 53,403 3,070,531 9,306,736 14 -4,265,865 57,525 3,070,531 9,391,002 15 -4,212,430 61,649 3,070,531 9,345,733 16 -4,195,762 65,775 3,070,531 9,374,029 17 -4,223,164 69,903 3,070,531 9,490,494 18 -4,247,955 74,033 3,070,531 9,601,768 19 -4,162,556 78,165 3,070,531 9,492,691 20 -4,188,262 82,299 3,070,531 9,605,854 21 -4,189,774 86,435 3,070,531 9,670,659 22 -4,158,646 90,573 3,070,531 9,670,214 23 -4,161,580 94,713 3,070,531 9,737,922 24 -4,167,498 98,855 3,070,531 9,811,629 25 -4,151,067 102,999 3,070,531 9,840,667 26 -4,145,348 107,145 3,070,531 9,891,160

35 So based on the BIC we decide to choose the model with 11 states.

Following are the training results for the model with 11 states (transition and emission matrixes):

The transition matrix is 11x11 since there are 11 states in total and the transition matrix is about the probabilities from each state to all the possible states.

The result is as follows:

Table 4.2 Transition Matrix

Here the rows are the states a transition starts and the columns are the states it transits to, then each cell in the main table is the probability that a transition from at the from-state (the column number) to the to-state (the row number).

36 The corresponding heatmap:

Fig 4.1 Heatmap for transition matrix

From the transition matrix, we can see that most of the states are with very high probabilities to transit to themselves expect states 1, 4 and 10. In the view of the biology side, it is very reasonable, since for methylation intervals in the whole genome, if current region is methylated or not methylated, then it is very possible that the next interval is also methylated or not methylated. Also some of the states are more likely to transit to other states; it is possible that they are mostly in the intervals whose next interval is not the same as it (from methylated interval to non-methylated interval or from non- methylated interval to methylated interval). We can see that states 1, 4 and 10 do not have

37 very high probabilities to transit to themselves separately, but when treated as a group, we can see, it still has a very high probability to transit to itself. So I think maybe we can treat them as a group in the future analysis.

Then for the emission matrix, we have a matrix of 11 x 4096, there is not enough space to present the whole here, so we can‟t draw the matrix here. Also we don‟t need to all the detailed observation probabilities of every combination of marks, actually we have to get the probability of each group occurs in each state, so we have to add up all the probabilities of observations which contain a particular group.

The results are as follows:

Table 4.3 Emission probabilities for each mark in each state

Here the rows are the states and the columns are the marks, then each cell in the main table is the probability that one mark (the column number) can present in the corresponding state (the row number).

38 In order to have a more clearly view of the emission probabilities, we ordered the probabilities in each states and follows are the marks in the decreasing order and the corresponding probabilities:

Table 4.4 Ordered emission probabilities for each mark in each state-mark

Table 4.5 Ordered emission probabilities for each mark in each state- probabilities

39 Furthermore, we consider that for all the intervals on the whole genome, there are a lot of non-methylated intervals which can even be a large part of the whole genome, and now we don‟t want to consider these kinds of things since we want to focus on the methylated intervals. So we apply the threshold 0.08 to the observation probability of each state and get 8 states above the threshold and 3 states under.

Table 4.6 Filtered ordered emission probabilities for each mark in each state- marks

From the table above, we can see that the 3 states under the threshold are exactly the ones that don‟t have very high probabilities to transit to themselves. Also these 3 states are quite similar to each other. They are all dominated by 3 marks (7, 9, 10) and followed by the other marks which are also in very similar orders. Also some states are dominated by non-DNA methylation related cell lines (e.g. state 7 which is dominated by group 10 and 12 which are H1-stem cell and CD4+T cell).

40 4.2 Biology Meanings

After we get the states and their features, we have to assign them with biology meanings by correlate them with other data.

What we did is correlating the results above with the gene expression data for the

33 breast cancer cell lines since increasing evidence is revealing a role of methylation in the interaction of environmental factors with genetic expression.

We did the one-way hierarchy clustering to 12113 genes we get for the 33 breast cancer cell lines (Richard M.Neve et al, 2006) [27] as described in the section of data process.

4.2.1 Gene Expression results for 33 breast cancer cell lines

Fig 4.2 33 Breast Cancer Cell Gene Expression One-Way Hierarchy Clustering

41 We future applied a threshold to the clustering results and get the results as follows

Fig 4.3 Grouped 33 Breast Cancer Cell Gene Expression One-Way Hierarchy Clustering

4.2.2 Results Based on Different Clusters

From the results above, we can see that the whole genes were divided into 9 clusters which are annotated in 9 different colors. From up to bottom, we denote them from cluster1 to cluster9. We can see some very interesting features from the figure above, but we would like to discuss them together with the methylation distribution in these clusters. Based on the clustering results, we get the genes in each cluster.

42 Table 4.7 Number of genes in each cluster

clusters # genes

1 1,362

2 860

3 1,098

4 1,274

5 800

6 4,010

7 1,211

8 866

9 632

Then for each group, we correlated the methylated intervals to the genome based on their distances to the nearest gene in each cluster. What we do is as follows:

We first remap the methylated intervals onto the genome as the distance between its midpoint to the 5‟TSS or 3‟TSS of its nearest gene. We do this for all the intervals and we call this process find-region. Then we count the number of reads located in each 2kb intervals from -100kb to 4kb based on 5‟TSS and -4kb to 100kb based on 3‟TSS. After that, we draw the distribution image to have a look at distribution of methylated interval in each group in each cluster and try to find the relationship between gene expression and methylation.

43 2.5 cluster1 1 2 2 1.5 3 4 1 5 6 0.5 7

0 8

6 2 2

- - -

50 46 42 38 34 30 26 22 18 14 10

------

- 9

5TSS 5TSS 3TSS

3TSS+3 3TSS+7

5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS

3TSS+11 3TSS+15 3TSS+19 3TSS+23 3TSS+27 3TSS+31 3TSS+35 3TSS+39 3TSS+43 3TSS+47

Fig 4.4 Methylation distribution based on cluster 1 genes

Here we can see that group3 and group6 are high methylated in gene body region, also from the clustering results, we can see group 3 and group6 are with high gene expression which makes sense since methylation in the gene body can up-regulate gene expression. Also group8 and group9 are low methylated in the 5‟TSS promoter region

(5TSS-2) which also makes sense since methylation in the promoter region can repress gene expression.

44 3 1 2.5 cluster2 2 2 3 1.5 4 1 5 0.5 6

0 7

6 2 2

- - -

50 46 42 38 34 30 26 22 18 14 10

------

- 8

5TSS 5TSS 3TSS

3TSS+3 3TSS+7

5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS

5TSS 9

3TSS+11 3TSS+15 3TSS+19 3TSS+23 3TSS+27 3TSS+31 3TSS+35 3TSS+39 3TSS+43 3TSS+47

Fig 4.5 Methylation distribution based on cluster 2 genes Here group 1 is high methylated in gene body region (5TSS+2) and is with high gene expression.

3 cluster3 1 2.5 2 2 3 1.5 4 1 5

0.5 6 7

0

2 2

6 8

- - -

50 46 42 38 34 30 26 22 18 14 10

------

9

5TSS 5TSS 3TSS

3TSS+3 3TSS+7

5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS

3TSS+11 3TSS+15 3TSS+19 3TSS+23 3TSS+27 3TSS+31 3TSS+35 3TSS+39 3TSS+43 3TSS+47

Fig 4.6 Methylation distribution based on cluster 3 genes

Here group 2 is high methylated in gene body region (3TSS+1) and is with high gene expression. Group 8 is low methylated in 5TSS promoter region (5TSS-4) and with high gene expression. 45 3 1 2.5 cluster4 2 2 3 1.5 4 1 5

0.5 6 7

0

2 2

6 8

- - -

50 46 42 38 34 30 26 22 18 14 10

------

9

5TSS 5TSS 3TSS

3TSS+3 3TSS+7

5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS

3TSS+11 3TSS+15 3TSS+19 3TSS+23 3TSS+27 3TSS+31 3TSS+35 3TSS+39 3TSS+43 3TSS+47

Fig 4.7 Methylation distribution based on cluster 4 genes

Here group 2 is high methylated in 5TSS promoter region (5TSS-1) and is with low gene expression. Group 7 is low methylated in 5TSS promoter region (5TSS-2) and is with high gene expression.

3 cluster5 2.5 1 2 2 3 1.5 4 1 5 6 0.5 7 0

8

6 2 2

- - -

50 46 42 38 34 30 26 22 18 14 10

------

- 9

5TSS 5TSS 3TSS

3TSS+3 3TSS+7

5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS

3TSS+11 3TSS+15 3TSS+19 3TSS+23 3TSS+27 3TSS+31 3TSS+35 3TSS+39 3TSS+43 3TSS+47

Fig 4.8 Methylation distribution based on cluster 5 genes

46 Here group 4 is high methylated in gene body region (5TSS+1) and is with high gene expression. Group 8 is low methylated in 5TSS promoter region (5TSS-4) and with high gene expression.

3 cluster6 2.5 1 2 2 3 1.5 4 1 5 6 0.5 7 0

8

6 2 2

- - -

50 46 42 38 34 30 26 22 18 14 10

------

- 9

5TSS 5TSS 3TSS

3TSS+3 3TSS+7

5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS

3TSS+31 3TSS+11 3TSS+15 3TSS+19 3TSS+23 3TSS+27 3TSS+35 3TSS+39 3TSS+43 3TSS+47

Fig 4.9 Methylation distribution based on cluster 6 genes

Here group2 is high methylated in the 5TSS promoter region (5TSS-1) and is with low gene expression.

47 3 cluster7 1 2.5 2 2 3 1.5 4 1 5 6 0.5 7 0

8

6 2 2

- - -

50 46 42 38 34 30 26 22 18 14 10

------

- 9

5TSS 5TSS 3TSS

3TSS+3 3TSS+7

5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS

3TSS+31 3TSS+11 3TSS+15 3TSS+19 3TSS+23 3TSS+27 3TSS+35 3TSS+39 3TSS+43 3TSS+47

Fig 4.10 Methylation distribution based on cluster 7 genes

Here group 3 and group 4 ars high methylated in gene body region (5TSS+1 and

5TSS+2) and are with high gene expression.

3 1 2.5 cluster8 2 2 3 1.5 4 1 5 0.5 6

0 7

6 2 2

- - -

46 42 38 34 30 26 22 18 14 10

50 8

------

5TSS 5TSS 3TSS 3TSS+7

3TSS+3 9

5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS

3TSS+31 3TSS+11 3TSS+15 3TSS+19 3TSS+23 3TSS+27 3TSS+35 3TSS+39 3TSS+43 3TSS+47

Fig 4.11 Methylation distribution based on cluster 8 genes

Here group 5 is high methylated in the 5TSS promoter region (5TSS-2) and is with low gene expression.

48 3.5 cluster9 1 3 2 2.5 3 2 1.5 4 1 5 0.5 6

0 7

6 2 2

- - -

50 42 38 34 30 26 22 18 14 10

46 8

------

5TSS 5TSS 3TSS 3TSS+7

3TSS+3 9

5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS 5TSS

3TSS+11 3TSS+15 3TSS+19 3TSS+23 3TSS+27 3TSS+31 3TSS+35 3TSS+39 3TSS+43 3TSS+47

Fig 4.12 Methylation distribution based on cluster 9 genes

Here group 2 and group 6 are high methylated in gene body region (3TSS-1 and

5TSS+1) and are with high gene expression.

49 4.2.3 States Meanings and Group Patterns

From the figures above, we can see that in each cluster, the distributions of methylated intervals are quite different for different groups but still it is not very easy to find the biology meanings for the states. Furthermore, we ordered all the groups in each

2kbp interval in a decreasing order as well as what we did to the emission matrix.

Then we correlate the emission matrix with ordered groups in each cluster. We assume a state is coherent with an interval if the first 3 groups are the same.

Table 4.8 First 3 marks for each state

For example, we have state 3 which ordered some features excluding group 10,

11 and 12 (5, 8, 6, 3, 9, 4…), then an interval is said to be coherent with state 3 if it‟s ordered features start with 5, 8, 6 and followed by other groups.

Based on the assumption above, we correlate our states with the clustering results and we get the results as follows:

50 Table 4.9 States and interval correlation results clusters states intervals features cluster1 State_1 5TSS-36 9 7 8 4 1 6 2 5 3 cluster2 State_3 5TSS-26 5 8 6 4 1 9 2 3 7 cluster2 State_3 3TSS+22 5 8 6 7 9 1 4 2 3 cluster2 State_5 5TSS-12 8 9 5 4 2 7 3 1 6 cluster3 State_2 3TSS+26 4 1 9 6 2 3 8 7 5 cluster3 State_11 3TSS+28 6 8 9 4 7 5 2 3 1 cluster3 State_11 3TSS+30 6 8 9 3 4 5 7 1 2 cluster4 State_1 5TSS-23 9 7 8 5 2 4 1 6 3 cluster4 State_1 3TSS+3 9 7 8 6 2 5 1 4 3 cluster5 State_1 5TSS-6 9 7 8 6 3 4 1 2 5 cluster5 State_1 3TSS+17 9 7 8 5 3 1 4 2 6 cluster5 State_2 3TSS+25 4 1 9 5 2 3 8 7 6 cluster5 State_5 3TSS+46 8 9 5 4 7 6 3 2 1 cluster5 State_8 5TSS-28 9 8 7 2 5 6 1 3 4 cluster6 State_6 5TSS-34 8 7 6 4 9 5 3 1 2 cluster8 State_11 3TSS+48 6 8 9 7 5 2 4 1 3 cluster9 State_5 3TSS+21 8 9 5 3 4 6 1 7 2 cluster9 State_8 5TSS-25 9 8 7 6 4 5 1 3 2 cluster9 State_8 3TSS+20 9 8 7 4 6 3 1 5 2 cluster9 State_9 3TSS+25 9 8 2 7 3 6 1 4 5

51 In cluster 1, we get:

State_1 5TSS-36 9 7 8 4 1 6 2 5 3 which means the 36th 2kbp interval in the upstream of 5TSS is coherent with state 1.

From the results above, we can see that there are no states 4, 7 and 10 which is reasonable since states 4 and 10 are in the 3 states that we filtered and also they are dominated by group 10 which is not breast cancer cell line, they all can be classified as the . Also state 7 is mostly dominated by non-methylation related groups (group 10 and group 12). Then for state 1, we can see it appears in cluster 1, 4 and 5. Besides, in cluster

1 and 4, there is only state 1 .While for the other states, we can continue to find some other deep biology meanings.

Table 4.10 States meanings states meanings

regions that are unlikely to be methylated, if methylated, it is high probability 1 in breast cancer cell lines

regions that methylated in breast cancer cell lines, at far 3 distal (50k

2 downstream of 3'core with high gene expression and 52k downstream of

3'core) with low gene expressions

regions that methylated in breast cancer cell lines, at far distal (52k upstream 3 of 5'TSS and 44k downstream of 3'core) regions with high gene expressions

Continued 52 Table 4.10: Continued

4 and regions that are unlikely to be methylated, if methylated, it is high probability

10 in non-breast cancer cell lines

regions that methylated in breast cancer cell lines, at near 5 distal and far 3

5 distal (24k upstream of 5'TSS and 92k downstream of 3'core) regions with

high gene expressions

regions that methylated in breast cancer cell lines, at far 5 distal (68k 6 downstream of 5'TSS) regions with low gene expressions

regions that mainly methylated in non-breast cancer cell lines (H1 stem cell 7 and CD4+ Tcell)

regions that methylated in breast cancer cell lines, at far 5 distal and 3 near

8 distal (56k and 50k upstream of 5'TSS and 40k downstream of 3'core) regions

with high gene expressions

regions that methylated in breast cancer cell lines, at far 3 distal (50k 9 downstream of 3'core) regions with low gene expressions

regions that methylated in breast cancer cell lines, at far 3 distal (56k, 60k 11 and 96k downstream of 3‟TSS) regions with high gene expressions

Near distal regions are the regions in 10k-40k up/down stream of 5‟TSS/3‟TSS

Far distal regions are the regions in 41k-100k up/down stream of 5‟TSS/3‟TSS

Proximal regions are the regions in 4k-10k up/down stream of 5‟TSS/3‟TSS

53 Then based on the results above and correlated to Table 4.6 and Figure 4.2, we can figure out the DNA methylation patterns for subtypes of Breast cancers as follows:

Table 4.11 Patterns for subtypes of Breast cancers subtypes patterns

Low methylated at far 3 distal (56k and 92k downstream of 3‟TSS) regions Group 1 with high gene expressions

Low methylated at far 3 distal (40k, 42k, 60k and 96k downstream of

Group 2 3‟TSS) regions with low gene expressions and at far 5 distal regions (68k

upstream of 5‟TSS) with high or low gene expressions

Low methylated at far 5 distal (46k and 72k upstream of 5‟TSS), far 3 distal

Group 3 (44k, 96k downstream of 3‟TSS) and 3 proximal (6k downstream of 3‟TSS)

regions with high gene expressions

High methylated at far 3 distal (50k, 52k downstream of 3‟TSS) regions Group 4 with low gene expressions

High methylated at far distal (52k upstream of 5'TSS and 44k downstream Group 5 of 3'core) regions with high gene expressions

High methylated at far 3 distal (56k, 60k, 96k downstream of 3'TSS) regions Group 6 with high gene expressions

Continued

54 Table 4.11: Continued

Low methylated at far 5 distal (52k upstream of 5‟TSS) regions with high Group 7 gene expressions

High methylated at far 3 distal (42k and 96k downstream of 3‟TSS) and near

5 distal (24k upstream of 5‟TSS) regions with high gene expressions, but at Group 8 far 5 distal (68k upstream of 5‟TSS) regions (68k upstream of 5‟TSS) with

low gene expressions

High methylated at 5 distal (46k, 50k, 56k and 72k upstream of 5‟TSS), 3

Group 9 distal (34k, 40k, and 50k downstream of 3‟TSS) and 3 proximal (6k

downstream of 3‟TSS) regions with high gene expressions

55

Chapter 5: Data Visualization

We also modified a developed database web tool to visualize the methylation data in our 36 cell lines to give an intuitional view of the data. Also in the web tool, we grouped the 36 cell lines into 9 groups which can be quite convenient for us to compare the data in different groups. For details,

Step1: We divide each cell line into 100bp intervals and then count the number of methylation reads that fall into to each interval.

Step2: We normalize each of the 36 cell lines to the same level, say each of the cell lines has 10,000,000 methylation reads, since according to the statistical summary, most of the cell lines have this level of methylation read number.

th Thus, for an original interval value Dij which is the number of methylation reads in the j interval on the ith cell line, we can calculate the normalized value

E5.1

th where Si is the total number of methylation reads in the i cell line.

Step3: In order to express the methylation level of an interval, we need to use red color from light to dark to present the methylation levels from low to high. Here our level boundaries are as follows: 56

E5.2

So, we will use 7 red colors with different brightness to represent the different levels of methylation.

Also Methylation data is stored as a start coordinate and up to 4 consecutive following methylation levels for fixed step 100nt spans. This was done to reduce the number of records in the database and improve performance because while methylation data is globally dense, it is many disjoint segments usually only a few hundred consecutive nucleotides long with a non-zero methylation value.

Besides the methylation data, we also correlate the genome regions with genes, so then we can see the correlation between gene and methylation regions. For example, in the fig datatool.fig, we can see, in region chr1:27060485-27070485, there is one gene called SFN (NM_006142) and in the region of this gene, cell lines in Group2 and Group4 are not methylated but some cell lines in Group6 and Group7 are hyper-methylated.

57

Fig 5.1 Database Web Tool

Corresponding Link: http://motif.bmi.ohio-state.edu/hmm

58

Chapter 6: Conclusions and Suggestions for Future Work

6.1 Conclusions

Many researchers are doing research in Breast Cancer cell lines and methylation data, also some people are trying to solve biology problems using Hidden Markov Model.

However, few people had used HMM to deal with DNA methylation data in Breast

Cancer lines. Besides, for those trying to use HMM to solve biology problems, they usually only set 2 states for training, and the meaning for the 2 states are even known which makes the training not so meaningful. In this thesis, we used much more states for the training and also the meanings for the states are not known before the training, also after training, by correlating with other biology data, we can figure out the meanings of the states, which is advanced and novel. Also, for the program itself, we modified the standard version of HMM and make it work better for our data. The time and space complexities for standard version of HMM are O(#iteration *n2T) and O(n2T), where n is the number of states and T is the length of the input sequence. Ours are O(#iteration *

(m+n)nT) and O(nT), where m is the number of possible observations. We did this modification because our server has 30GB memory limitation while deal with the whole genome with the standard HMM it will take much more memory than the limitation.

59 Besides the HMM part, we also used programs published or developed by our lab to processing and analyzing the data which largely help us to find the biology meanings.

For example, for the correlation of DNA methylation data and gene expression data, we find the relationships between them which are coherent with some published results.

6.2 Future Work

The HMM program we designed here is not parallelized so it takes quite long time to train the whole genome as the input. So we could try to parallelize the program use OpenMP, MPI or other methods and make it more efficient. Also, our prediction for biology meanings is restricted to the dataset we have. So if we further correlated our results with other data, it is quite possible that we can predict more and deeper biology meanings. What‟s more, now we can only verify our results with some published paper but not very systematically, so if possible it is better to do some biology experiments to further verify the predictions we made.

60

References:

[1] American Cancer Society (September 13, 2007). "What Are the Key Statistics for

Breast Cancer?". Archived from the original on January 5, 2008.

http://web.archive.org/web/20080105001124/http://www.cancer.org/docroot/CRI/

content/CRI_2_4_1X_What_are_the_key_statistics_for_breast_cancer_5.asp.

Retrieved 2008-02-03

[2] Browse the SEER Cancer Statistics Review 1975–2006".

http://seer.cancer.gov/csr/1975_2006/browse_csr.php?section=4&page=sect_04_t

able.07.html.

[3] March, Jerry; Smith, Michael W. (2001). March's advanced organic chemistry:

reactions, mechanisms, and structure. New York: Wiley. ISBN 0-471-58589-0

[4] Serre D, Lee BH, Ting AH. MBD-isolated Genome Sequencing provides a high-

throughput and comprehensive survey of DNA methylation in the human genome.

Nucleic Acids Res. 2010; 38:391–9

[5] Campbell, M.K., (1995) Biochemistry. Saunders College: Philadelphia, pgs. 615-

16, 181

[6] Maclean, N., S.P. Gregory, and R.A. Flavell (1993) Eukaryotic Genes.

Butterworth and Co., London, pgs. 53-67 61 [7] Yen RW, Vertino PM, Nelkin BD, Yu JJ, el-Deiry W, Cumaraswamy A, Lennon

GG, Trask BJ, Celano P, Baylin SB. Isolation and characterization of the cDNA

encoding human DNA methyltransferase. Nucleic Acids Res. 1992; 20:2287–2291.

[8] Gruenbaum Y, Cedar H, Razin A. Substrate and sequence specificity of a

eukaryotic DNA methylase. Nature. 1982; 295:620–622.

[9] Bird A. Perceptions of epigenetics. Nature. 2007; 447:396–398.

[10] Holliday R, Pugh JE. DNA modification mechanisms and gene activity during

development. Science. 1975; 187:226–232.

[11] Cedar H, Stein R, Gruenbaum Y, Naveh-Many T, Sciaky-Gallili N, Razin A.

Effect of DNA methylation on gene expression. Cold Spring Harb. Symp. Quant.

Biol. 1983; 47 (Pt 2):605–609.

[12] Reik W, Collick A, Norris ML, Barton SC, Surani MA. Genomic imprinting

determines methylation of parental alleles in transgenic mice. Nature. 1987;

328:248–251.

[13] Riggs AD. X inactivation, differentiation, and DNA methylation. Cytogenet. Cell

Genet. 1975; 14:9–25.

[14] Feinberg AP, Vogelstein B. Alterations in DNA methylation in human colon

neoplasia. Semin. Surg. Oncol. 1987; 3:149–151.

[13] Spruck CH, III, Rideout WM, III, Jones PA. DNA methylation and cancer.

[Review]. EXS. 1993; 64:487–509. 62 [16] Virmani AK, Rathi A, Sathyanarayana UG, Padar A, Huang CX, Cunnigham HT,

Farinas AJ, Milchgrub S, Euhus DM, Gilcrease M, et al. Aberrant methylation of

the adenomatous polyposis coli (APC) gene promoter 1A in breast and lung

carcinomas. Clin. Cancer Res. 2001; 7:1998–2004.

[17] Tsuchiya T, Tamura G, Sato K, Endoh Y, Sakata K, Jin Z, Motoyama T, Usuba O,

Kimura W, Nishizuka S, et al. Distinct methylation patterns of two APC gene

promoters in normal and cancerous gastric epithelia. Oncogene. 2000; 19:3642–

3646.

[18] Ibanez de Caceres I, Battagli C, Esteller M, Herman JG, Dulaimi E, Edelson MI,

Bergman C, Ehya H, Eisenberg BL, Cairns P. Tumor cell-specific BRCA1 and

RASSF1A hypermethylation in serum, plasma, and peritoneal fluid from ovarian

cancer patients. Cancer Res. 2004; 64:6476–6481.

[19] Rice JC, Massey-Brown KS, Futscher BW. Aberrant methylation of the BRCA1

CpG island promoter is associated with decreased BRCA1 mRNA in sporadic

breast cancer cells. Oncogene. 1998; 17:1807–1812.

[20] Bian YS, Osterheld MC, Fontolliet C, Bosman FT, Benhattar J. p16 inactivation

by methylation of the CDKN2A promoter occurs early during neoplastic

progression in Barrett's; esophagus. Gastroenterology. 2002; 122:1113–1121.

[21] Holst CR, Nuovo GJ, Esteller M, Chew K, Baylin SB, Herman JG, Tlsty TD.

Methylation of p16(INK4a) promoters occurs in vivo in histologically normal

human mammary epithelia. Cancer Res. 2003; 63:1596–1601. 63 [22] Iqbal, K.; Jin, S.-G.; Pfeifer, G. P.; Szabo, P. E. (2011). "Reprogramming of the

paternal genome upon fertilization involves genome-wide oxidation of 5-

methylcytosine". Proceedings of the National Academy of Sciences 108 (9):

3642–3647. doi:10.1073/pnas.1014033108. PMC 3048122.

PMID 21321204.http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmce

ntrez&artid=3048122

[23] Jaenisch, R.; Bird, A. (2003). "Epigenetic regulation of gene expression: how the

genome integrates intrinsic and environmental signals". Nature genetics 33 Suppl:

245–254. doi:10.1038/ng1089. PMID 12610534

[24] Craig, JM; Wong, NC (editor) (2011). Epigenetics: A Reference Manual. Caister

Academic Press. ISBN 978-1-904455-88-2

[25] Rabiner, Lawrence R. "A Tutorial on Hidden Markov Models and Selected

Applications in Speech Recognition". Proceedings of the IEEE , Vol. 77, No. 2,

February 1989, pp. 257-286.

[26] Baldi, P. & Brunak S. "Bioinformatics - The Machine Learning Approach".

Massachusetts Institute of Technology, 1998.

[27] Richard M.Neve et al A colloetion of breast cancer cell lines for the study of

functionally distinct cancer subtypes, Cancer Cell 10,515-527, December, 2006

64 [28] R Alan Harris et.al. "Comparison of sequencing-based methods to profile DNA

methylation and identification of monoallelic epigenetic modifications". Nature

Biotechnology, 2010

[29] Jung K Choi1, Jae-Bum Bae1, Jaemyun Lyu1, Tae-Yoon Kim2 and Young-Joon

Kim1*Nucleosome deposition and DNA methylation at coding region boundaries,

Genome Biology,2009,10:R89

[30] Weaver IC (2007). "Epigenetic programming by maternal behavior and

pharmacological intervention. Nature versus nurture: let's call the whole thing off".

Epigenetics 2 (1): 22–8. doi:10.4161/epi.2.1.3881. PMID 17965624.

http://www.landesbioscience.com/journals/epi/abstract.php?id=3881.

[31] Cock et. al. The Sanger FASTQ file format for sequences with quality scores, and

the Solexa/Illumina FASTQ variants. Nucleic Acids Research, 2009

65

Appendix_Formats

A. BAM format

BAM format is the compressed binary version of the Sequence Alignment/Map

(SAM) format, a compact and index-able representation of nucleotide sequence alignments. Many next-generation sequencing and analysis tools work with SAM/BAM.

For custom track display, the main advantage of indexed BAM over PSL and other human-readable alignment formats is that only the portions of the files needed to display a particular region are transferred to UCSC. This makes it possible to display alignments from files that are so large that the connection to UCSC would time out when attempting to upload the whole file to UCSC. Both the BAM file and its associated index file remain on your web-accessible server (http or ftp), not on the UCSC server. UCSC temporarily caches the accessed portions of the files to speed up interactive display.

B. SAM (Sequence Alignment/Map) format

SAM format is a generic format for storing large nucleotide sequence alignments.

SAM aims to be a format that:

66  Is flexible enough to store all the alignment information generated by various

alignment programs;

 Is simple enough to be easily generated by alignment programs or converted from

existing alignment formats;

 Is compact in file size;

 Allows most of operations on the alignment to work on a stream without loading

the whole alignment into memory;

 Allows the file to be indexed by genomic position to efficiently retrieve all reads

aligning to a locus.

C. EXPORT format

HWUSI-EAS68R 0012 3 1 1173 16855 0 1

AGATCGAGCTGGAGAAATTCCATGAATATACCACAC cddddcLcc^dd\d^a`ccYcca^aM_]_b`b\TYc chr10.fa 12854193 R36 118 Y

HWUSI-EAS68R 0012 3 1 1174 2493 0 1

AAGACGGGAAAGGACTCACTCAAAGTCACACAGCTG cTccc_M_^_L_UUL[MM]XGXZFZXSQV\aaYYaV chr3.fa 195586021 F 17C18 97 N

HWUSI-EAS68R 0012 3 1 1174 7057 0 1

CAACTTGGAGAATCACATTTGAAGTGCAAAGAACAC

BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB NM N

67 D. BED format

Bed format provides a flexible way to define the data lines that are displayed in an annotation track. BED lines have three required fields and nine additional optional fields.

The number of fields per line must be consistent throughout any single set of data in an annotation track. The order of the optional fields is binding: lower-numbered fields must always be populated if higher-numbered fields are used.

The first three required BED fields are:

1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold

(e.g. scaffold10671).

2. chromStart - The starting position of the feature in the chromosome or scaffold.

The first base in a chromosome is numbered 0.

3. chromEnd - The ending position of the feature in the chromosome or scaffold.

The chromEnd base is not included in the display of the feature. For example, the

first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100,

and span the bases numbered 0-99.

The 9 additional optional BED fields are:

4. name - Defines the name of the BED line. This label is displayed to the left of the

BED line in the Genome Browser window when the track is open to full display

mode or directly to the left of the item in pack mode.

68 5. score - A score between 0 and 1000. If the track line useScore attribute is set to 1

for this annotation data set, the score value will determine the level of gray in

which this feature is displayed (higher numbers = darker gray). This table shows

the Genome Browser's translation of BED score values into shades of gray:

shade

score in ≤ 167- 278- 389- 500- 612- 723- 834- ≥

range 166 277 388 499 611 722 833 944 945

6. strand - Defines the strand - either '+' or '-'.

7. thickStart - The starting position at which the feature is drawn thickly (for

example, the start codon in gene displays).

8. thickEnd - The ending position at which the feature is drawn thickly (for

example, the stop codon in gene displays).

9. itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line

itemRgb attribute is set to "On", this RBG value will determine the display color

of the data contained in this BED line. NOTE: It is recommended that a simple

color scheme (eight colors or less) be used with this attribute to avoid

overwhelming the color resources of the Genome Browser and your Internet

browser.

10. blockCount - The number of blocks (exons) in the BED line.

11. blockSizes - A comma-separated list of the block sizes. The number of items in

this list should correspond to blockCount. 69 12. blockStarts - A comma-separated list of block starts. All of the blockStart

positions should be calculated relative to chromStart. The number of items in this

list should correspond to blockCount.

Example

Here's an example of an annotation track that uses a complete BED definition: track name= pairedReads description="Clone Paired Reads" useScore=1 chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0, 3512 chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0, 3601

E. Fastq format

Fastq format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are encoded with a single ASCII character for brevity. It was originally developed at the Wellcome Trust Sanger Institute to bundle a FASTA sequence and its quality data, but has recently become the de facto standard for storing the output of high throughput sequencing instruments such as the Illumina Genome Analyzer [31].

Example

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC

+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

70 IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

F. Bowtie output format (generated from software bowtie, and input is the fastq files)

19 + chr9 8003

AGGCTATATGCGCGGCCAGCAGACCTGCAGGGCCCGCTCGTCCAGGGGGCGG

TGCTTGCTCTGGATCGTGTGCGG

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

167:C>G,72:G>C,73:C>G

28 + chr19 28101

AAATAAATAAATAAAAACAACTTGTCCAAGGTCAGACAGGCCGCCTCTTAGT

AAGCACACCTATCCTCTATAGTA

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

141:A>C,60:A>C,72:T>G

28 + chr1 25355

AAATAAATAAATAAAAACAACTTGTCCAAGGTCAGACAGGCCGCCTCTTAGT

AAGCACACCTATCCTCTATAGTA

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

141:A>C,60:A>C,72:T>G

35 + chr1 41809

CAAATACGGTGACTGTTTCTTACGTGGACGACGTTGTGTTGAACATGGGTGA

71 GTAAGACTGAAGCAGCCGTAATT

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0

72