Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1167

Rule-based Models of Transcriptional Regulation and Complex Diseases

Applications and Development

SUSANNE BORNELÖV

ACTA UNIVERSITATIS UPSALIENSIS ISSN 1651-6214 ISBN 978-91-554-9005-8 UPPSALA urn:nbn:se:uu:diva-230159 2014 Dissertation presented at Uppsala University to be publicly examined in BMC C8:301, Husargatan 3, Uppsala, Friday, 3 October 2014 at 13:15 for the degree of Doctor of Philosophy. The examination will be conducted in English. Faculty examiner: Prof Joanna Polanska (Silesian University of Technology).

Abstract Bornelöv, S. 2014. Rule-based Models of Transcriptional Regulation and Complex Diseases. Applications and Development. Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1167. 69 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-554-9005-8.

As we gain increased understanding of genetic disorders and gene regulation more focus has turned towards complex interactions. Combinations of genes or gene and environmental factors have been suggested to explain the missing heritability behind complex diseases. Furthermore, gene activation and splicing seem to be governed by a complex machinery of modification (HM), factor (TF), and DNA sequence signals. This thesis aimed to apply and develop multivariate machine learning methods for use on such biological problems. Monte Carlo feature selection was combined with rule-based classification to identify interactions between HMs and to study the interplay of factors with importance for asthma and allergy. Firstly, publicly available ChIP-seq data (Paper I) for 38 HMs was studied. We trained a classifier for predicting exon inclusion levels based on the HMs signals. We identified HMs important for splicing and illustrated that splicing could be predicted from the HM patterns. Next, we applied a similar methodology on data from two large birth cohorts describing asthma and allergy in children (Paper II). We identified genetic and environmental factors with importance for allergic diseases which confirmed earlier results and found candidate gene-gene and gene-environment interactions. In order to interpret and present the classifiers we developed Ciruvis, a web-based tool for network visualization of classification rules (Paper III). We applied Ciruvis on classifiers trained on both simulated and real data and compared our tool to another methodology for interaction detection using classification. Finally, we continued the earlier study on by analyzing HM and TF signals in genes with or without evidence of bidirectional transcription (Paper IV). We identified several HMs and TFs with different signals between unidirectional and bidirectional genes. Among these, the CTCF TF was shown to have a well-positioned peak 60-80 bp upstream of the transcription start site in unidirectional genes.

Keywords: Histone modification, Transcription factor, Transcriptional regulation, Next- generation sequencing, Feature selection, Machine learning, Rule-based classification, Asthma, Allergy

Susanne Bornelöv, Department of Cell and Molecular Biology, Computational and Systems Biology, Box 596, Uppsala University, SE-75124 Uppsala, Sweden.

© Susanne Bornelöv 2014

ISSN 1651-6214 ISBN 978-91-554-9005-8 urn:nbn:se:uu:diva-230159 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-230159) List of papers

This thesis is based on the following papers, which are referred to in the text by their Roman numerals.

I Enroth, S.†, Bornelöv, S.†, Wadelius, C. and Komorowski, J. (2012) Combinations of histone modifications mark exon inclu- sion levels. PLOS ONE, 7(1):e29911 II Bornelöv, S.†, Sääf, A.†, Melén, E., Bergström, A., Moghadam, B.T., Pulkkinen, V., Acevedo, N., Pietras, C.O., Ege, M., Braun-Fahrländer, C., Riedler, J., Doekes, G., Kabesch, M., van Hage, M., Kere, J., Scheynius, A., Söderhäll, C., Pershagen, G. and Komorowski, J. (2013) Rule-Based Models of the Interplay between Genetic and Environmental Factors in Childhood Al- lergy. PLOS ONE, 8(11):e80080 III Bornelöv, S., Marillet, S. and Komorowski, J. (2014) Ciruvis: a web-based tool for rule networks and interaction detection using rule-based classifiers. BMC Bioinformatics, 15:139 IV Bornelöv, S., Komorowski, J. and Wadelius, C. Different dis- tribution of histone modifications in genes with unidirectional and bidirectional transcription and a role of CTCF and cohesin in directing transcription. Manuscript

Reprints were made with permission from the respective publishers.

†These authors contributed equally.

List of additional publications

1. Bornelöv, S., Enroth, S. and Komorowski, J. (2012) Visualization of Rules in Rule-Based Classifiers. Intelligent Decision Technolo- gies, 329-338 2. Bysani, M., Wallerman, O., Bornelöv, S., Zatloukal, K., Komor- owski, J. and Wadelius, C. (2013) ChIP-seq in steatohepatitis and normal liver tissue identifies candidate disease mechanisms relat- ed to progression to cancer. BMC Med Genomics, 6(50)

Contents

Introduction ...... 9 1 Biological background ...... 10 1.1 Genome organization ...... 10 1.1.1 DNA ...... 10 1.1.2 The human genome ...... 11 1.1.3 ...... 11 1.1.4 Protein-coding genes ...... 12 1.2 Transcriptional regulation ...... 13 1.2.1 Transcription factors ...... 14 1.2.2 Histone modifications ...... 14 1.2.3 positioning ...... 16 1.2.4 Alternative splicing ...... 16 1.2.5 Divergent transcription ...... 17 1.3 High-throughput technologies ...... 18 1.3.1 Next-generation sequencing ...... 18 1.3.2 RNA-seq ...... 18 1.3.3 ChIP-seq ...... 19 1.3.4 DNA microarrays ...... 19 1.3.5 Cap analysis of gene expression ...... 20 1.4 Databases and international projects ...... 20 1.4.1 Ensembl ...... 20 1.4.2 ENCODE ...... 21 1.4.3 FANTOM ...... 21 2 Computational background ...... 22 2.1 Feature selection ...... 22 2.1.1 Monte Carlo feature selection ...... 22 2.2 Machine learning ...... 25 2.2.1 Clustering ...... 25 2.2.2 Principal component analysis ...... 26 2.2.3 Classification ...... 27 2.3 Model validation and performance ...... 34 2.3.1 Training and test sets ...... 34 2.3.2 Cross validation ...... 34 2.3.3 Classifier accuracy ...... 35

2.3.4 Confusion matrix ...... 35 2.3.5 Receiver-operator characteristic curve ...... 36 2.4 Statistics ...... 37 2.4.1 Rule significance ...... 37 2.4.2 Undersampling ...... 38 2.4.3 Multiple testing problem ...... 38 2.4.4 Bootstrapping ...... 39 3 Aims ...... 40 4 Methods and results ...... 41 4.1 Paper I ...... 41 4.1.1 Background ...... 41 4.1.2 Data ...... 41 4.1.3 Methods ...... 42 4.1.4 Results ...... 43 4.2 Paper II ...... 43 4.2.1 Background ...... 43 4.2.2 Data ...... 44 4.2.3 Method ...... 44 4.2.4 Results ...... 46 4.3 Paper III ...... 46 4.3.1 Background ...... 46 4.3.2 Data ...... 47 4.3.3 Method ...... 47 4.3.4 Results ...... 48 4.4 Paper IV ...... 50 4.4.1 Background ...... 50 4.4.2 Data ...... 50 4.4.3 Method ...... 51 4.4.4 Results ...... 51 5 Discussion and conclusions ...... 53 6 Sammanfattning på svenska ...... 55 7 Acknowledgements...... 58 8 References ...... 60

Abbreviations

ac AUC area under the ROC curve bp base pair CAGE cap analysis of gene expression cDNA complementary DNA ChIP chromatin DNA deoxyribonucleic acid DNase I deoxyribonucelase I FN false negative FP false positive FPR false positive rate GWAS genome-wide association study H histone HM histone modification K kb 1000 bp MCFS Monte Carlo feature selection me1/2/3 mono-/di-/trimethylation mRNA messenger RNA PCA principal component analysis RI relative importance RNA ribonucleic acid RNA Pol II RNA polymerase II ROC receiver-operator characteristic rRNA ribosomal RNA seq sequencing SNP single nucleotide polymorphism TF transcription factor TN true negative TP true positive TPR true positive rate tRNA transporter RNA TSS transcription start site

Introduction

Molecular biology has undergone several revolutions in recent years. The sequencing of the human genome was the largest project in the history of biological sciences. When the human genome was published in 2001 [1, 2] and 2004 [3] we realized that there were only approximately 20,000-25,000 genes [3], a number significantly lower than many earlier estimates. Only a small fraction of the genome was coding for proteins and it contained a lot of repetitive regions. These discoveries paved the way for a new view of the genome. We now know that there are a lot more than protein-coding genes that determine the behavior of a cell or the development of an organism. Furthermore, genes are expressed differently in different cell types. Such differences are mainly regulated by proteins binding to the deoxyribonucleic acid (DNA) and there are numerous different mechanisms and interactions determining this regula- tion. The introduction of new technologies such as next-generation sequencing allowed genomic studies at a genome-wide level [4-6]. New collaborative projects such as ENCODE and FANTOM [7, 8] have emerged relying on these technologies. With all this genomic data being produced, it could only be analyzed by assistance from computer science and mathematics, thus transforming molecular biology it into an interdisciplinary field. It is in this intersection of biology, computer science, and mathematics that the new and important discipline of bioinformatics appeared. This thesis covers four bioinformatics studies aimed at the application and development of multivariate machine learning methods to solve complex biological problems. Chapter 1 gives an introduction to the molecular biolo- gy and the experimental data underlying the thesis. Chapter 2 presents the computational and statistical methods that were employed. In Chapter 3 the specific aims of the four studies are presented, followed by a brief summary of each paper in Chapter 4. Finally the contributions are put into perspective and are discussed in Chapter 5.

9 1 Biological background

1.1 Genome organization This section provides a basic overview and description of the genomic fea- tures that are needed to understand the results presented in this thesis.

1.1.1 DNA The DNA is the fundamental unit of inheritance and encodes the information needed for the development of a new organism. The DNA molecule forms a double-helix structure with two strands composed of polynucleotide chains. The nucleotides consist of a nucleotide base, either adenine (A), cytosine (C), guanine (G), or thymine (T), a deoxyribose sugar, and a phosphate group. Covalent bonds between the sugar of one nucleotide and the phos- phate group of another nucleotide form the DNA backbone (Figure 1).

Figure 1. Chemical structure of the DNA. Figure created by Madeleine Price Ball (Wikimedia Commons).

10 The two ends of a strand are referred to as the 5’ and the 3’ ends with a ter- minal phosphate or hydroxyl group, respectively. The two strands are bound to each other by hydrogen bonds between the bases A and T, or C and G. It is the sequence of the nucleotide bases that encodes the genetic information. Since each base can only be paired with exactly one other base the strands are complementary to each other and contain the same genetic information.

1.1.2 The human genome The human genome consists of about 3 billion base pairs (bp) divided into 22 chromosomes called autosomes plus the sex chromosomes X and Y. Each chromosome is present in two homological copies in the cell nucleus. Only about 1.5% of the human genome encodes for proteins and this part is called the coding DNA. The remaining part is called the non-coding DNA and may have regulatory functions or be traces of retroviruses, transposon activity, genome rearrangements, or gene duplications. The non-coding DNA also includes introns and genes that non-coding ribonucleic acids (RNA) such as ribosomal RNAs (rRNA) and transporter RNAs (tRNA). Recently a large number of long non-coding RNAs have been detected [9-11], defined as polyadenylated RNAs longer than 200 nucleotides. Such RNAs may be regulators of transcription [12] and also show signs of being regulated them- selves [11]. It was long believed that the non-coding DNA is vastly filled with “junk” DNA [13] without any known function. The function of the non-coding DNA is still to a high degree unclear, but the term “junk” have turned obso- lete. The ENCODE project has estimated the usage of the human genome [7]. For instance, different RNAs map to 62% of the genome and histone modifications (HM) are enriched at 56% of the genome in at least one cell type. Note that these numbers are expected to rise when more cell types are studied. In total at least 80% of the human genome is either transcribed at some point or contains a regulatory signal in at least one cell type, e.g. pro- tein binding or a specific chromatin signature [7].

1.1.3 Chromatin If unwrapped and lined up, the 3 billion bps in one human cell would meas- ure about 2 meters. Clearly, the DNA molecules could not have become this large unless the DNA could be efficiently packed. The DNA in the cell nu- cleus is packed into a structure called chromatin and once in this structure it measures roughly 10 µm in diameter. The chromatin structure is necessary for the condensation during mitosis, prevents DNA damage, and allows for gene regulation by influencing the accessibility of the DNA. The key part of the chromatin are the . The nucleosomes consist of approximately 147 bp of DNA wrapped 1.7 times around a com-

11 plex of eight histone proteins; two copies of each of the H2A, H2B, H3, and H4 [14]. The nucleosomes are connected by short stretches of linker DNA bound by the H1 or H5 histones [15]. The compaction of DNA into nucleosomes, the “beads on a string” structure, is called euchromatin. Multi- ple such nucleosomes are wrapped by the addition of histone H1 into a 30 nm fibre which is the common compact form of DNA called heterochroma- tin. Active genes are associated with the euchromatin structure whereas inac- tive genes are associated with the heterochromatin structure. During the mi- tosis and meiosis the heterochromatin is packed by additional scaffold pro- teins into an even higher level of compaction (Figure 2).

Figure 2. The major chromatin structures. Figure adapted from Richard Wheeler (Wikimedia Commons).

1.1.4 Protein-coding genes There are about 20,000 protein-coding genes in the human genome [16]. Protein-coding genes in eukaryotes are organized into coding and non- coding regions called exons and introns. The median human gene, based on

12 the Ensembl database [16], is 24,000 bp (24 kb) long and has 19 exons each of length 137 bp. During the transcription of the DNA into the messenger RNA (mRNA), the intronic regions are removed by the spliceosome, a large complex that recognizes specific start and end sequences at the intron boundaries [17]. In this process certain exons may also be removed, and the exons of a single gene may therefore be assembled into different variants, called transcripts. The median gene has four such transcripts. The mRNAs are transported out of the nucleus, then bound by the ribosomes and subse- quently the bases are read in triplets to produce an amino acid chain (a poly- peptide). This phase is called the translation. Finally, the polypeptide is folded into a protein. Thus, a single gene may give rise to several different proteins of similar type. The whole process of DNA being transcribed into mRNA which is translated to a protein is part of the central dogma of mo- lecular biology [18] which states the common flow of genetic information. There are several genomic regulatory elements associated to the expres- sion of protein-coding genes. The promoter is a region of approximately 100-2,500 bp located upstream of the transcription start site (TSS). To initi- ate gene transcription the RNA polymerase II (RNA Pol II) has to bind to the promoter region. In eukaryotes, there are several different events that need to occur to facilitate and enhance this crucial initiation. There are also enhancers and silencers. Enhancers and silencers are ge- nomic regions cooperating with or binding to promoters to regulate gene transcription. They are operated in a similar way. Transcription factors bind to the enhancers and silencers to activate or repress the target promoter, re- spectively. There are hundreds of thousands of enhancers in the human ge- nome and they are more difficult to identify than promoters since they may be located 1Mb or more from the gene and do not necessarily regulate the closest gene [19, 20]. Enhancers are typically 50-1500 bp and are normally identified by chromatin features or elements binding to them, rather than by sequence. Insulators are gene distant genomic elements that may block the interaction between enhancers and promoters when placed between them [21]. Insulators may thus prevent enhancers from affecting multiple genes, which allows for different regulatory patterns of neighboring genes [21].

1.2 Transcriptional regulation Since all cells have the same DNA there has to be mechanisms to define what genes are active in each cell type. This regulation may be achieved by transcription factors which are differentially expressed across cell types or by differences in HM marks and is discussed in this section.

13 1.2.1 Transcription factors Transcription factors (TF) are the largest protein family and are defined as proteins that bind to DNA in order to regulate transcription of genes [22]. TFs may act as both activators and repressors of gene expression. TFs may work alone, but more commonly they form large complexes with other TFs. Approximately 2,600 proteins may function as TFs in human [23] but about 1,000 of them are yet to be verified. In a recent study 1,762 TFs identified from the literature were studied; a median of 430 TFs were actively tran- scribed in each cell type with 24 of them being highly active [8]. TFs function by binding to the regulatory regions of a gene. The binding may help or block the recruitment of RNA polymerase, recruit other coacti- vator or corepressior proteins, or influence the acetylation or deacetylation of histone proteins thus making the DNA more or less accessible to RNA pol- ymerase. Most TFs have specific DNA-binding motifs and such motifs are catalogued e.g. in the JASPAR database [24]. Recently, motifs were identi- fied for 830 TFs in vitro using the SELEX technology [25]. Some TFs are general and necessary for transcription to occur. TFs in this group are not cell specific, but make a part of the so-called RNA preinitia- tion complex and bind to the promoter of a gene. Other TFs bind to the en- hancer regions of the genes and participate in the more specific gene regula- tion that allows different genes to be expressed in different cell types. Genes coding for TFs may be regulated by TFs in the same way as any other gene, which enables a complex network of regulatory pathways. The importance of TFs has furthermore been illustrated through cellular reprogramming. Generally, only a small number of TFs are needed to repro- gram a cell from one cell type to another [26]. The first reprogramming of a cell into a pluripotent stem cell was performed by only the four factors Oct3/4, Sox2, Klf4, and c-Myc in mouse fibroblasts [27]. Later, the same four factors were used to reprogram human fibroblasts into pluripotent stem cells [28] whereas another research group simultaneously published similar results using the OCT4, SOX2, NANOG, and LIN28 factors [29].

1.2.2 Histone modifications The core histones H2A, H2B, H3, and H4 like other proteins commonly undergo post-translational modifications. Over 100 different histone modifi- cations (HM) have been reported including acetylation (ac), (me), and ubiquitination of lysine (K), methylation of arginine (R), and of serine (S) and threonine (T) [30]. Although mass spec- troscopy studies have identified modifications on the lateral surface of the globular domain [31], and on the N-terminal or C- terminal tail domains are the most well-studied. An overview of such modi- fications is provided in Figure 3. The study of histone tail modifications has

14 been greatly facilitated by the introduction of the chromatin immunoprecipi- tation followed by DNA sequencing (ChIP-seq) technology that allows a snapshot of the current HM state in a biological sample.

Figure 3. Overview of modifications occurring on the N- and C-terminal the his- tones [32]. Residue number and modification type are shown in the figure. Reprinted with permission from Nature Publishing Group.

Acetylations are mainly found on histones in active chromatin. Acetylation of the histones alters their charge and thus weakens their association to DNA, making the DNA more accessible to transcription. Methylations on the other hand do not alter the charge but are dependent on other proteins recognizing the modifications. Methylations may both facilitate activation or repression of transcription and methylated may be either mono-, di-, or tri-methylated (me1/2/3). Examples of histone methylations are the tri- methylation of lysine 27 on histone H3 (H3K27me3) associated with tran- scriptional silencing [33] and H3K4me3 associated to active promoters [34]. Most histone methyltransferases are rather specific and modify only one or a few positions [35]. The HMs are not independent of each other. Several of them are correlat- ed, some may not co-exist and there may be certain hierarchies among them [36]. An overview of such effects with experimental evidence is discussed in

15 [35] and shown in Figure 4. It has been suggested that a exists [37], which would be a cellular machinery allowing specific combinations of HMs to be “written” onto the nucleosomes in order to regulate the genes. Indeed, there are able to both add and remove histone acetylations (histone acetyltransferases and histone deacetylases, respectively). However, only a handful of proteins able to “read” HMs have been found and they are mainly related to active transcription. Alternative models, e.g. with positive feedback loops [38, 39], have been suggested in which the HMs may work as a transcriptional memory, but are not the main regulators themselves. Furthermore, since the histone-modifying enzymes lack sequence specificity, there must be other cellular mechanisms to determine the gene locations.

Figure 4. Histone modification cross-talk. Positive effects shown as arrowheads and negative effects are shown as flat heads [35]. Reprinted with permission from Nature Publishing Group.

1.2.3 Nucleosome positioning The nucleosome positioning adds another layer of transcriptional regulation. Nucleosomes have been shown to occupy specific locations relative to other functional elements. The nucleosome occupancy governs e.g. the TF acces- sibility of the DNA and nucleosome-free regions are enriched for regulatory elements [40]. Nucleosome positioning may be studied by digesting chroma- tin using micrococcal nuclease, leaving the DNA wrapped around the mono- nucleosomes intact for sequencing or microarray profiling. Nucleosomes do not have specific DNA motifs, but there are several associations between sequence and nucleosome positioning [41] related to how easily different DNA sequences are bent around the histone octamers.

1.2.4 Alternative splicing A single gene may give rise to several different proteins by a process called alternative splicing. This process includes either the inclusion or exclusion of particular exons in the processed mRNA. Nearly all genes (95%) in the hu-

16 man genome with multiple exons may undergo alternative splicing [42]. An overview of different alternative splicing types is given in Figure 5. The most common form of alternative splicing is the exon skipping (cassette exon) [43]. In this alternative splicing type a specific exon may be spliced out from the transcript while its neighbors are always retained. Alternative splicing has a key role in gene regulation and allows many protein isoforms to be produced from a single gene. Therefore, the number of proteins tran- scribed from the human genome by far exceeds the number of genes.

Figure 5. Schematic view of different alternative splicing events. Figure adapted from Agathman (Wikimedia Commons).

The traditional view on splicing was that it occurred post-transcriptionally. In that model, the DNA was first transcribed into mRNA, and the splicing events such as removal of the introns and alternative splicing of exons hap- pened during the translation of the mRNA into a protein. More recent studies have suggested that splicing may happen both during the transcription and after it [44-46] and genome-wide studies have shown that most splicing oc- curs co-transcriptionally [47]. Furthermore, HMs have been suggested to play a role in the regulation of alternative splicing [48, 49].

1.2.5 Divergent transcription Most promoters initiate transcription in both directions of the TSS [50-52]. However, the RNA polymerase normally does not elongate efficiently in the upstream antisense direction and the RNAs produced are usually short and quickly degraded [53, 54]. It has been suggested that transcription is initiated from the nucleosome-depleted region upstream of the TSS in both directions in almost equal proportions, but that checkpoints during elongation terminate the antisense transcription so that most of the observed transcripts are from the sense direction [54]. Divergent transcription has been speculated to have a function since it is common for most genes [39], but no clear function has been demonstrated yet.

17 Moreover, about 10% of all protein coding-genes have a bidirectional ori- entation with two genes on opposite strands separated by less than 1000 bp and with a common promoter region in between [55, 56]. Such gene pairs do usually not participate in a common pathway, but show similarities in their expression profiles [55]. This type of gene orientation may have been evolu- tionary advantageous since it may have allowed more gene translocations [53]. Several of the HM and TF signals that are used to identify gene regulatory elements may be present upstream of the transcribed genes due to divergent transcription or bidirectionally oriented genes. With few exceptions [51] the influence of antisense transcription has not been taken into account when studying these regulatory marks in the past.

1.3 High-throughput technologies In this section some of the technologies used to produce data analyzed in this thesis are presented.

1.3.1 Next-generation sequencing The DNA sequencing technologies have been improved rapidly over the last years. Current methods include the so-called next-generation sequencing which allows for parallel sequencing of millions of DNA fragments. There are several sequencing methods. Two of them that are commonly used in high-throughput applications are Illumina and SOLiD sequencing. Both pro- duce rather short reads, but have very high throughputs. After sequencing, the short reads are aligned to the reference genome and the position of their alignment is assumed to represent their genomic origin. Depending on the read lengths, about 80-90% of the genome is mappable [57] using either of these two methods. The remaining part of the genome contains highly repeti- tive regions that require other sequencing methods that produce longer reads to generate mappable sequences.

1.3.2 RNA-seq RNA sequencing (RNA-seq) is a method for genome-wide profiling of the mRNA transcriptome. It uses next-generation sequencing in order to se- quence a snapshot of the mRNA in a biological sample. RNA-seq may be used to profile e.g. gene expression and co-expression, gene splicing, and single nucleotide polymorphisms (SNP) in coding regions and it has com- monly replaced expression microarrays in transcriptome profiling. There are several steps, both mandatory and optional, in performing RNA-seq as described by Wang et al. [5]. First, either the total RNA may be

18 used, or a specific fraction may be selected. For instance, mRNAs may be enriched by targeting the polyadenylated tail of the RNA. Next, the RNA is reverse-transcribed into complementary DNA (cDNA). The RNA may be fragmented prior to the reverse transcription to avoid biases related to e.g. the formation of secondary RNA structures [58]. However, since the 3’ and 5’ end of the RNA are less efficiently converted into DNA, the fragmenta- tion may alternatively be performed on the cDNA depending on the aim of the study. The resulting cDNA fragments may be amplified using PCR. Fi- nally, the cDNA fragments are sequenced and aligned to the reference ge- nome, reference transcripts, or de-novo assembled.

1.3.3 ChIP-seq ChIP-seq is the most common method to analyze DNA-protein interactions. It is mainly applied to produce genome-wide snapshots of TFs or chromatin features [4]. Since TFs interact with DNA to control gene expression their genome-wide distribution is highly relevant in the study of certain diseases. ChIP is performed in several steps. First the DNA-binding proteins are crosslinked to the DNA by the addition of formaldehyde. Next the chromatin is fragmented, either by sonication or by the addition of selected nucleases. DNA fragments of about 150-300 bp are selected, and the size-selected fragments are immune precipitated with an antibody. The antibody is chosen to bind to the protein or domain of interest and may be targeted e.g. against a specific TF, a protein domain, or a histone with a particular post- translational modification. The DNA-protein complexes that are bound by the antibody are then selected, the complexes are washed to remove unspe- cific binding, and the DNA is finally purified. The DNA segments are then usually amplified by PCR and sequenced using next-generation sequencing. The genomic position to which the DNA fragments map will correspond to the position of the protein bound by the antibody. Before next-generation sequencing became the method of choice the ChIP fragments were hybridized to microarrays (ChIP-chip) instead. This also gave a similar genome-wide view but was limited to the probed DNA re- gions and prone to more biases [4].

1.3.4 DNA microarrays A microarray contains thousands to millions of small DNA or RNA se- quences (probes) attached to a solid surface. Each probe matches a specific target sequence and the probe-target binding is quantified using e.g. fluores- cent dyes that have been attached to the complimentary DNA or RNA. Mi- croarrays may be used to determine the expression of multiple genes or ex- ons, or to determine the SNPs in a biological sample. A common application has been to identify the differences in gene expression between a patient and

19 a control sample. The resolution of the microarray is determined by the number of probes and only known genes and transcripts may be studied. Commonly 500,000 to 2,000,000 SNPs are profiled at once, which covers most of the genetically interesting parts of the genome. However, this num- ber should be compared to the total of 38 million SNPs identified by the 1000 Genomes Project [59]. Thus, microarrays are limited by the quality of the genome assembly and the SNP databases and may not be used for de-novo identification of SNPs or transcripts. As any other technique, there are technical limitations related e.g. to low-abundance transcripts which may not be reliably assessed and to differences in probe-target binding properties between different probes.

1.3.5 Cap analysis of gene expression Cap analysis of gene expression (CAGE) [60] is a next-generation sequenc- ing technique able to identify the 5’ end of mRNA. Firstly, the mRNA is selected based on the presence of a cap site in the 5’ end. Then the mRNA is reverse-transcribed into DNA, amplified and the first 20-27 nucleotides from the 5’ end are sequenced. The sequence reads (called CAGE tags) are mapped to the reference genome and correspond to mRNA originating from active TSSs and promoters in the studied sample. The number of reads mapped to a particular TSS may be used to quantify its activity. Further- more, several advances have been made to the original CAGE protocol and it may now be performed using single molecule-sequencing without the use of PCR amplification [61].

1.4 Databases and international projects This section presents some of the data sources that have been used through- out the thesis.

1.4.1 Ensembl Ensembl [16] is a joint project between the European Bioinformatics Insti- tute and the Wellcome Trust Sanger Institute. The goal of the project is “to automatically annotate the genome, integrate this annotation with other available biological data and make all this publicly available via the web” as stated by the project website [62]. The project now includes several species in addition to the human genome. In this thesis we have used the Ensembl database to determine the genomic coordinates of human genes and tran- scripts (H. sapiens 54_36p, hg18).

20 1.4.2 ENCODE The Encyclopedia of DNA elements (ENCODE) [7] is a research project aimed at finding all functional elements in the human genome for a number of different cell types. The data is produced in several labs following stand- ardized protocols and is made publicly available. A pilot phase of the project was released in 2007 and focused on the analysis of 1% of the genome [63, 64]. In 2012 a series of papers were published describing the analysis of the complete genome [7]. ENCODE continues to be an important resource of data on transcriptional regulation and much of the next-generation sequenc- ing data used in this thesis is from the ENCODE project.

1.4.3 FANTOM The Functional annotation of the mammalian genome (FANTOM) is a col- laborative project launched in 2000 aimed at identifying all functional ele- ments in the mammalian genome. The results have been released in several phases, including the latest FANTOM5 that was published in May 2014. In these publications they used the CAGE technology to identify a large set of promoters across many cell types and to construct a human gene expression atlas [8]. Additionally, they constructed an atlas of active enhancers across the human cell types [65].

21 2 Computational background

2.1 Feature selection The task of a feature selection method is to select a subset of the features available in the data. Feature selection is often applied when dealing with many features measured for a small number of objects. In contrast to other techniques, feature selection does not change or transform the original fea- tures. Since thousands of genes or even millions of SNPs are commonly analyzed today, performing a feature selection step has become an important bioinformatics task [66]. Feature selection methods are motivated both by limitations of the com- putational resources and by the assumption that the data often contains re- dundant or non-informative features. Computational time may be saved by removing redundant features and the model quality may be improved if fea- tures without any predictive value are removed prior to the modeling. An- other benefit is that the interpretation of the model is simplified when it is based on fewer features. Different approaches for feature selection are available. From univariate methods that evaluate each feature independently (e.g. by a t-test) to multi- variate methods that include feature dependencies to various degrees [66]. Feature selection may be applied either once as a filter prior to the model construction or wrapped around the model construction. In the latter case a new model may be constructed for each selection of features, and thus the set of features may be evaluated in the context of a specific classifier. Each approach has different advantages and disadvantages (see overview in [66]). For instance, filtering methods are faster than wrapper methods, but cannot incorporate the interaction with the classifier algorithms. Similarly, univari- ate methods are faster than multivariate methods, but ignore dependencies between the features.

2.1.1 Monte Carlo feature selection Monte Carlo feature selection (MCFS) is a method aimed at identifying features with importance for classification [67]. The algorithm is implement- ed in the publicly available dmLab software [68]. MCFS is a multivariate method based on the Monte Carlo technique; a group of algorithms based on random sampling of the data. Typically, a large number of random samples

22 are constructed and the property of interest is estimated from these. Monte Carlo simulations are generally employed when the exact solution cannot be computed or when there is a lot of uncertainty in the input data. Since the biological data used in classification is usually quite noisy, Monte Carlo- based methods may be well suited for this type of applications. A schematic view of the MCFS procedure is shown in Figure 6. From the original set of d features, s number of subsets each containing m features are sampled at random. Each subset is divided into a training and a test set t number of times; creating t × s pairs of training and test sets. For each of them, a classifier is trained on the training set and evaluated on the test set. Typically, the classification algorithm used in the evaluation step of MCFS has been a decision tree classifier.

Figure 6. Overview of the MCFS procedure [67]. Reprinted with permission from Oxford University Press.

Every iteration of the MCFS procedure ends with the construction and eval- uation of a decision tree. Based on this tree, a relative importance (RI) score is computed for all of the participating features. Finally, this score is summa- rized per-feature over all the iterations following

no. in () =() IG( ())  no. in   ()

The s and t parameters have been described above and their product is the total number of trees. For each such tree (τ) the weighted accuracy (wAcc) is computed according to the following equation

23 1 = + +⋯+ where c is the number of decision classes and nij is the number of objects from class i that was classified as class j, thus creating a mean accuracy with all the decision classes weighted equally. The wAcc of a tree is multiplied by the sum (over all nodes, n, within the tree, τ, that contain the attribute gk) of the information gain (IG) of the node and the ratio of objects in (no.in) the node compared to in the whole tree, τ. The parameters u and v are two op- tional weighting parameters, which were set to either u = v = 1, or u = 0 and v = 1 in the experiments performed throughout this thesis.

2.1.1.1 Determining which features are significant While performing MCFS, a total RI score is reported for each feature. To determine which features that significantly contributed to the classification, we applied a permutation test as implemented in dmLab. The permutation test is performed by randomly reassigning the class labels of the original data N number of times. Thus, in each iteration the original distribution of labels is preserved. For each such relabeled dataset, the whole MCFS proce- dure is repeated; from the construction of s subsets to the computation of the RI for each feature. The RIs are stored, producing a total of N permutation test RIs for each feature in addition to the RI based on the original data. In a typical permutation test, a p-value may be estimated for each attribute score according to the proportion of times the RI of the relabeled data is higher than or equal to the RI of the original data. In order to be certain of the validity of the results many permutations (e.g., >10,000) have to be per- formed. This is highly impractical in the MCFS setting since the algorithm is very computer intensive. Under the assumption that the RI scores are nor- mally distributed, the mean μgk and standard deviation σgk of the permutation test RI scores may be estimated for each attribute. The estimations are calcu- lated according to the following formulas

1  = and

24 1  = −  −1 where N is the number of permutations. Next, the probability that the RI of the original data came from a normal distribution of this type may be com- puted using these estimates. This has been shown to work well in practice using 100 permutations. However, the p-value will be more accurate the more permutations that are performed. Since a high number of attributes are tested for significance simultaneous- ly, Bonferroni correction has been applied to the p-values calculated using this strategy. If the number of significant features is too high to include all features in the model, another option is to include the top-ranked features instead. This strategy was applied by Enroth et al. [69].

2.2 Machine learning In machine learning the aim is to automatically organize the data and learn from it. Typically a set of training examples is given, and the task is to find patterns underlying these observations. Machine learning methods may be divided into supervised learning in which the training examples are labeled with a known class or outcome and unsupervised learning that deals with unlabeled examples [70]. For instance, in a setup where the outcome is whether a child has asthma or not, an unsupervised method may seek to identify groups of children with similar genotypes and environmental exposures whereas a supervised meth- od may seek to predict if a child will develop asthma or not based on its genotype and environmental exposures. The following section provides an overview of some of the most common machine learning techniques, as well as a more in-depth description of the rule-based method employed in this thesis.

2.2.1 Clustering Clustering is a type of unsupervised learning with the aim of automatically discovering subgroups or clusters in the data where the objects are more similar to each other than to the objects in any other cluster. Clustering is used when no class information is given, or when it is desirable to study the subgroup structure of the data. There are several different algorithms for clustering and the preferred algorithm depends on how the data is distribut- ed.

25 2.2.1.1 K-means clustering In k-means clustering, a set of k initial objects are selected as cluster cen- troids at random. Next, each object is assigned to the closest centroid, so that k clusters are formed. The centroid of each such cluster is then recalculated and these assignment-recalculation steps are repeated until convergence is reached. This algorithm is affected by the choice of starting centroids and the results may therefore differ from run to run.

2.2.1.2 Hierarchical clustering Hierarchical clustering is performed either “bottom-up” (agglomerative) or “top-down” (divisive). In the agglomerative setting each object starts as one cluster, and in each iteration the two most similar clusters are merged into a new cluster until the desired number of clusters has been reached. In the divisive setting, all objects start in the same cluster. In each step one cluster is split into two, until the desired number of clusters has been reached. The optimal split or merge is commonly determined by a greedy algorithm.

2.2.2 Principal component analysis Principal component analysis (PCA) is an unsupervised method that is used for dimension reduction. It was first introduced in 1901 by Karl Pearson [71]. Given a dataset with possibly correlated features describing a set of objects, PCA may be used to replace the original features by a new set of linearly uncorrelated features called principal components. These principal components are normally ordered by the amount of the total variance in the data that is explained by each of them. Thus, using the two first principal components, high-dimensional data may be visualized in a simple two- dimensional plot (Figure 7).

Figure 7. Example of PCA usage adapted from Bornelöv et al. [72], in part. Here the expression of 4026 genes was used to describe patients with three different lym- phoma diagnoses (encoded as C, D, and F). The PCA was used to reduce (A) all genes, or (B) a selection of 30 genes into two dimensions. The PCA made it possible to visually inspect the distribution of patients in the gene expression space before and after the selection of genes predictive for the diagnosis.

26 The PCA loadings are defined as the contribution of each original feature to each principal component, and are usually represented as vectors. Features with high loadings in the first principal components may be interpreted to have high importance for the variation of the data. PCA may thus be used to identify important features.

2.2.3 Classification Classification is a type of supervised learning. Each object is described by a feature vector which contains the value of each feature for that object. The aim of the classification is, based on a set of such objects with known class labels, to build a model able to predict the decision class for previously un- seen objects, given that they are described by the same type of feature vec- tors. There are many classification methods available to date and they differ in their assumptions, limitations, and representation of the data. Generally, simpler methods are easier to interpret, whereas more advanced methods are more accurate in their predictions [70]. The preferred classification method may therefore differ depending on the purpose of the study. In this section we will give an in-depth description of the rule-based clas- sification method used in this thesis as well as a short description of logistic regression and decision trees.

2.2.3.1 Logistic regression Logistic regression is similar to linear regression, but instead of predicting a continuous response variable, a binary variable (represented as 0 or 1) is predicted. The method is commonly used in biology, and is one of the sim- plest classification methods. In order to ensure that the predicted value will fall into the [0,1] interval the traditional regression equation (were Y is the response, βi are the coeffi- cients, and Xi are the explanatory variables)

= + +⋯+ is replaced by the logistic function

⋯ (=1) = 1+⋯

This function is guaranteed to fall between 0 and 1 and the resulting value is interpreted as the probability of the 1 outcome [70].

27 2.2.3.2 Decision trees Decision trees are conceptually simple but nevertheless able to capture non- linear relations. They are represented as directed, acyclic, and connected graphs with a number of nodes connected by edges. The uppermost node is called the root node. The bottommost nodes are called leaf nodes, thus pro- ducing an analogy with an upside-down tree. Every path in the tree starts at the root, goes via a number of internal nodes, and ends up in a leaf. Figure 8 shows a decision tree trained on the HM data from Enroth et al. [69] using the J48 algorithm in WEKA [73]. The classification task was to predict whether a cassette exon would be included in the transcript or spliced out based on HMs present as preceding, on, and succeeding the exon. In order to get a small tree suitable for this example, the WEKA parameters were set to require at least 3000 training set objects in each leaf node, which produced this minimal tree with only the root, two internal nodes, and four leaf nodes. It is very easy to interpret and use a decision tree. Consider again the clas- sifier in Figure 8 and assume that we have one exon which does not have the H3K9me2 modification on the exon (follow the edge going down-right from the root), but that has the H4K91ac modification present succeeding the ex- on (follow the edge going down-left from the H4K91ac.succ node). We now end up in the “Spliced out” leaf node, which corresponds to a prediction of the unknown exon to be spliced out.

Figure 8. Decision tree for prediction of cassette exon inclusion based on histone modification data. The accuracies of the prediction in each leaf node based on the training data are shown under the nodes.

2.2.3.3 Rule-based classification using rough sets The classification performed in this thesis is based on the ROSETTA framework for data mining and knowledge discovery [74]. ROSETTA uses the rough set theory [75, 76] introduced in 1982 by Zdzislaw Pawlak to con- struct rule-based classifiers. This section provides a detailed description of rough set theory and the use of rule-based classifiers.

28 2.2.3.3.1 Rough set theory Rough set theory is a mathematical framework for analyzing tabular data. It was specifically built to deal with imprecise data by constructing set approx- imations. The first step in dealing with rough sets involves arranging the data with the objects as rows and the attributes as columns; together they form an information system. The objects are grouped into equivalence classes and each equivalence class is defined as all objects that are indiscernible (identi- cal) with respect to the set of all attributes. The decision attribute is a unique attribute usually positioned as the last data column containing the decision class of each object. An information system with a decision attribute is called a decision system. For instance, in Enroth et al. [69], we had a deci- sion system in which we treated the exons as objects, different HMs as at- tributes, and the exon inclusion level with the two possible values “included” or “spliced out” as the decision attribute. If the set of all objects is crisp, then each equivalence class will contain objects with only one decision class. If there is an equivalence class with objects of different decision classes, then the data set is defined to be rough. In other words, in a rough set there are at least two objects that are indiscern- ible with respect to the attribute values, but have different decision values. Most biological datasets are rough, thus preventing an errorless classifica- tion. A reduct [77] is defined as a minimal set of attributes needed to discern all objects of different decision classes equally well as by using all attributes. An extension of the reduct notion is approximate reduct which is a set of attributes that can discern a sufficiently large fraction of all objects, as de- termined by a threshold parameter. In order to compute the reducts of a deci- sion system a discernibility function is first created. The discernibility func- tion is a Boolean function that will return either true or false. Each input variable is Boolean and its value represents the inclusion or exclusion of an attribute. The discernibility function is simplified using Boolean algebra and each minimal solution to it corresponds to one reduct (see description of the process in [78]).

2.2.3.3.2 An example of reduct computation In this section we will show step by step how to compute the reducts for a decision system. The decision system used (Table 1) is based on data from Enroth et al. [69] using the five highest ranked attributes and a random se- lection of 7 objects (slightly modified). The classification task is to predict the exon inclusion level (“Included” or “Spliced out”) based on HMs resid- ing on the preceding intron, on the exon, or on the succeeding intron. Since only a small fraction of the data is used, we do not expect the results to re- flect the published results.

29 Table 1. A decision system with 7 objects, 5 attributes and a decision attribute. Inclusion H2BK5me1 H2BK5me1 H4K91ac H4K91ac ID H3K9me2 level Preceding Succeeding Preceding Succeeding (IL) A Yes Yes No No No Included B Yes Yes No No No Spliced out C No No No No Yes Included D No Yes No No Yes Included E No No No Yes No Spliced out F Yes Yes No Yes Yes Spliced out G No No No Yes No Spliced out

We start by identifying some of the parameters. A decision system is defined to consist of the set of all objects, U, the set of all attributes, A, and the deci- sion attribute, d. In the current example (Table 1) we may identify U = {A, B, C, D, E, F, G}, A = {H2BK5me1.prec, H2BK5me1.succ, H3K9me2, H4K91ac.prec, H4K91ac.succ}, and d=Inclusion level. The next step is to compute the equivalence classes. The equivalence class of an attribute x∈U is denoted [x] and is defined as [x] = {x’∈U | ∀a∈A a(x) = a(x’)}. This definition identifies groups of objects that share the same attribute values. Here, the following equivalence classes were identified

[A] = {A, B} [B] = {A, B} [C] = {C} [D] = {D} [E] = {E, G} [F] = {F} [G] = {E, G}

For each equivalence class we may compute its generalized decision defined as ∂([x]) = {i | ∃x’∈[x] d(x’) = i}, or in other words the set of all decision val- ues for the objects in the given equivalence class. Applying this definition, the generalized decisions were identified as

∂([A]) = ∂([B]) = {“Included”, “Spliced out”} ∂([C]) = {“Included”} ∂([D]) = {“Included”} ∂([E]) = ∂([G]) = {“Spliced out”} ∂([F]) = {“Spliced out”}

We can now construct the discernibility matrix which is an n × n matrix, where n is the number of equivalence classes. Each cell cij in the matrix is

30 defined as cij = {a∈A | a(xi) ≠ a(xj)}. In other words, each cell is the set of all attributes that may be used to discern between two equivalence classes.

Table 2. Discernibility matrix. [A] [C] [D] [E] [F] [A] ∅ H2BK5me1.prec [C] H2BK5me1.succ ∅ H4K91ac.succ H2BK5me1.prec [D] H2BK5me1.succ ∅ H4K91ac.succ H2BK5me1.prec H2BK5me1.succ H4K91ac.prec [E] H2BK5me1.succ H4K91ac.prec ∅ H4K91ac.succ H4K91ac.prec H4K91ac.succ H2BK5me1.prec H2BK5me1.prec H4K91ac.prec H2BK5me1.prec [F] H2BK5me1.succ H2BK5me1.succ ∅ H4K91ac.succ H4K91ac.prec H4K91ac.prec H4K91ac.succ

The discernibility matrix (Table 2) describes how each object may be dis- cerned from all other objects by any of the attributes within a cell. It may be noted that in order to classify objects, it is only necessary to discern between objects from different decision classes, and therefore the decision-relative discernibility matrix is usually computed instead. The decision-relative dis- cernibility matrix (Table 3) is defined in the same way as the full discernibil- ity matrix except that now cij = ∅ if ∂([xi]) = ∂([xj]).

Table 3. Decision-relative discernibility matrix. [A] [C] [D] [E] [F]

[A] ∅ H2BK5me1.prec [C] H2BK5me1.succ ∅ H4K91ac.succ H2BK5me1.prec [D] ∅ ∅ H4K91ac.succ H2BK5me1.prec H2BK5me1.succ H4K91ac.prec [E] H2BK5me1.succ H4K91ac.prec ∅ H4K91ac.succ H4K91ac.prec H4K91ac.succ H2BK5me1.prec H4K91ac.prec H2BK5me1.prec [F] H2BK5me1.succ ∅ ∅ H4K91ac.succ H4K91ac.prec H4K91ac.prec

Now the discernibility function, f(A), can be defined. It is a Boolean function that describes which attributes a∈A that are needed to discern between equivalence classes with different decisions [78]. It is defined as

31 ∗ ∗ ∗ ( ) =|1≤≤≤, ≠  where A* is a set of Boolean variables corresponding to the variables in A * * and cij = {a | a∈cij}. Using this formula, the discernibility function is f(A) = (H2BK5me1.prec* ˅ H2BK5me1.succ* ˅ H4K91ac.succ*) ˄ (H2BK5me1.prec* ˅ H4K91ac.succ*) ˄ (H2BK5me1.prec* ˅ H2BK5me1.succ* ˅ H4K91ac.prec*) ˄ (H4K91ac.prec* ˅ H4K91ac.succ*) ˄ (H4K91ac.prec* ˅ H4K91ac.succ*) ˄ (H2BK5me1.prec* ˅ H2BK5me1.succ* ˅ H4K91ac.prec*) ˄ (H2BK5me1.succ* ˅ H4K91ac.prec* ˅ H4K91ac.succ*) ˄ (H2BK5me1.prec* ˅ H4K91ac.prec*)

By applying the Boolean redundancy law, A˄(A˅B)=A, this function may be simplified to f(A) = (H2BK5me1.prec* ˅ H4K91ac.succ*) ˄ (H4K91ac.prec* ˅ H4K91ac.succ*) ˄ (H2BK5me1.prec* ˅ H4K91ac.prec*)

Transforming the expression from the conjunctive normal form to the dis- junctive normal form results in f(A) = (H2BK5me1.prec* ˄ H4K91ac.prec*) ˅ (H2BK5me1.prec* ˄ H4K91ac.succ*) ˅ (H4K91ac.prec* ˄ H4K91ac.succ*) corresponding to the three reducts {H2BK5me1.prec, H4K91ac.prec}, {H2BK5me1.prec, H4K91ac.succ}, and {H4K91ac.prec, H4K91ac.succ}.

2.2.3.3.3 Rule-based classification in ROSETTA The ROSETTA software incorporates reduct calculation algorithms similar to the one described in the previous two sections. However, since simplify- ing the discernibility function is an NP-hard problem [77] approximation algorithms are used. Furthermore, computing approximate reducts instead of exact reducts will often reduce overfitting and simplify the problem. When the reducts have been computed, they are translated into IF-THEN rules by reading the values of the reduct attributes in the original dataset. Consider for instance the dataset presented in Table 4. Again, the classifica- tion task is to predict if the exon is included in the transcript or spliced out based on which HMs are significantly enriched in the region.

32 Table 4. A decision system with four exons.

ID H3K4me1 H3K9me2 H3K9me3 ... Inclusion level (IL) A No Yes Yes ... Yes Included B No No No ... Yes Included C Yes No No ... No Spliced out D Yes Yes Yes ... Yes Included

Assume now that the discernibility function has been constructed, that a set of approximate reducts has been computed and, furthermore, that one of the reducts was the set {H3K9me2, H3K9me3}. To construct the classification rules from this reduct we consider each unique combination of attribute val- ues in the training examples (Table 4) and read the decision values of these. The corresponding rules are:

IF H3K9me2=Yes AND H3K9me3=Yes THEN IL=”Included” IF H3K9me2=No AND H3K9me3=No THEN IL=”Included” OR IL=”Spliced out”

The first rule is based on exon A and D and shows that exons are predicted to be included when both the H3K9me2 and the H3K9me3 modifications are present. The second rule, which is based on exon B and C, has two different decision classes. This shows that the decision system is rough and that the absence of the two modifications is not a predictor for the exon inclusion level. In general, the syntax of a classification rule is:

IF Condition1=Value1 AND Condition2=Value2 AND … AND ConditionN=ValueN THEN Decision=Prediction

Rules may have one or multiple conditions in the IF-part of the rule, but only one decision attribute in the THEN-part. There are several properties which can be defined for the rules in order to measure their quality. The support is defined as the number of objects in the training set that match both the IF- part and the THEN-part of the rule. The rule accuracy is defined as the num- ber of objects that match both the IF-part and the THEN-part of the rule, divided by the number of objects that match the IF-part. Thus, support is a measure of the generality of a rule, whereas accuracy measures its specifici- ty. Both support and accuracy are reported by ROSETTA. The rules may be visualized [72, 79] to ease the interpretation in order to gain biological knowledge. This methodology has been applied for instance to understand how HMs are involved in alternative splicing [69] and to study risk and protective factors for asthma and allergy in children [80]. In order to interpret the rules it is important to know how informative they are. Aside from computing their support and accuracy as described above, the rule

33 quality may be assessed by constructing a classifier from the rules and measuring the classifier performance. The set of all rules constructed from all reducts constitute a rule-based classifier and may be used to predict the outcome of previously unseen ob- jects. The classification of a new object is done in several steps. Firstly, the object is compared to each rule to determine which rules will fire. A rule is said to fire if each condition in the IF-part of the rule matches the attribute values of the object. Next, a voting scheme is created from the firing rules. In its simplest form the number of votes per rule equals the rule support and the votes are cast on the decision class that is predicted by the rule. Rules with multiple decision values split their votes according to the decision class distribution in the training data. Thus, general rules (with high support) would give more votes, but if the rules are not specific (high accuracy), then the votes are split on multiple decision classes according to the observed uncertainty in the training data. If no weighting is applied, then the decision class with the highest total number of votes will be reported as the classifier prediction.

2.3 Model validation and performance 2.3.1 Training and test sets When a classifier is trained on a particular dataset, the classifier performance on that data is usually higher than on independent data. Furthermore, tuning the classifier to optimize the performance for the training set will usually overfit the model on the training data making it perform worse on independ- ent data. In order to avoid the problem of overfitting and get reliable esti- mates of the classifier performance, the data is commonly divided into a training set and a test set. A simple schema for this division is to randomly sample 80% of the objects into the training set and use the remaining 20% as the test set. Both training and test sets should follow a value distribution similar to the original data. When the model is generated, it is trained using the training set. The model is applied to the test set after the training step, and the performance on the test set is used as an estimate of the performance on previously unseen, independent, data.

2.3.2 Cross validation Cross validation is another technique used to estimate how a classifier will perform on previously unseen data. Alternatively, cross validation may be used in the model training phase in order to avoid overfitting. In k-fold cross validation (k is usually set to 10) the original set of objects is divided into k equally sized disjoint subsets. One of the subsets is selected

34 to be used as the test set and the remaining k-1 subsets are used as the train- ing set. A classifier is trained on the training set and evaluated on the test set. This sequence of selection-training-evaluation is repeated k times so that each subset is used as the test set exactly once. The average performance on the test set is used as an estimate of the performance on unseen data. Several other versions of cross validation are available. Leave-one-out cross validation is a variant of k-fold cross validation with k set to the total number of objects. Thus, each object is excluded from the training set and used as test set exactly once. Alternatively, a second cross validation may be performed inside the first one [81]. In 5x2-fold cross validation the data is randomly divided into two subsets. The first subset is used as a training set and the second as a test set. This is then reversed using the second as the training set and the first as the test set. This process is repeated five times.

2.3.3 Classifier accuracy Classifier accuracy is a measure of how likely the classifier is to return a correct answer on previously unseen data. It is estimated using either an external test set or cross validation. A high accuracy is preferred, and a 100% accuracy signifies that no misclassifications occurred. However, the number of decision classes and their relative distribution has to be consid- ered when interpreting the classifier accuracy. For instance, considering a classification problem with objects equally distributed over two decision classes, random guessing would be expected to yield a classifier with 50% accuracy and as a result an observed accuracy below or close to 50% would indicate poor performance.

2.3.4 Confusion matrix The classifier performance on a test set may also be summarized into a confusion matrix. A confusion matrix shows the number of correctly and incorrectly classified objects (Table 5) from the positive and nega- tive class respectively (as explained in Table 6). Each row in the ma- trix shows the actual number of objects, here 46+4=50 objects from class 1 (Table 5). The number of correct predictions is shown in the diagonal which is marked in gray. The classifier accuracy is shown in the lower right corner and is defined as the number of objects falling into the diagonal divided by the total number of objects, here (46+48)/(46+4+2+48)=0.94.

35 Table 5. A confusion matrix. Predicted 1 0 1 46 (TP) 4 (FN) 92% Actual 0 2 (FP) 48 (TN) 96% 96% 92% 94%

Table 6. Description of classification terminology. Term Description True postive (TP) Positive correctly classified as positive False negative (FN) Positive incorrectly classified as negative False postive (FP) Negative incorrectly classified as positive True negative (TN) Negative correctly classified as negative

2.3.5 Receiver-operator characteristic curve A classifier typically provides a score related to how likely an object is to belong to a certain class. In order for the classifier to decide which class should be reported for a certain object, the classifier compares the score (cer- tainty) of each class to a pre-defined threshold. For a problem with two deci- sion classes, the threshold is typically set so that the class with the highest score is reported. However, in many applications different misclassifications may be asso- ciated with different costs. For instance, it may be undesirable to classify a possibly sick patient as healthy during a disease screening. Therefore classi- fiers need to adopt a threshold suitable for the purpose they are designed for. A receiver-operator characteristic (ROC) curve illustrates the perfor- mance of a classifier as this threshold is varied and may therefore be used to evaluate classifiers without limiting the evaluation to a fixed threshold (Figure 9). The y-axis of the ROC curve shows the true positive rate (TPR) and the x-axis the false positive rate (FPR).

= /( + )

= /( + )

Generally, there is a trade-off between having a high TPR and a low FPR and this trade-off is illustrated by the ROC curve. The dashed diagonal line from the lower left to the upper right corner in Figure 9 corresponds to the expected performance using a random guessing strategy. Points above this line correspond to classifier performing better than random guessing, and points below to classifiers performing worse. The area under the ROC curve (AUC) is used as an alternative measurement for classifier performance and should be significantly higher than 0.5, corresponding to assigning class

36 labels by random guessing. Reporting the AUC may be preferred over the classifier accuracy when the latter is difficult to interpret, such as when the class distribution is skewed.

Figure 9. A hypothetical ROC curve is shown as the solid line. The points A and B represent two classifiers built using different thresholds. The dashed line and the point C show the theoretical performance using a random guessing strategy. The point D would correspond to a perfect classifier.

2.4 Statistics This section covers statistical concepts used in the thesis.

2.4.1 Rule significance The significance of a classification rule may be computed using the hyper- geometric distribution following Hvidsten et al. [82]:

(,) − (;, , ) = −

Here N is the total number of objects in the dataset, n is the number of ob- jects that match the IF-part of the rule, k is the total number of objects with the decision class predicted by the rule, and x is the number of objects that

37 both match the IF-part of the rule and are from the predicted decision class. Typically, we have kept rules with p < 0.05. Bonferroni correction (see sec- tion 2.4.3) has been applied with the number of tests set to the number of rules.

2.4.2 Undersampling Several classification techniques are sensitive to a skewed distribution of the decision class. For instance, rule-based classifiers are more likely to predict the class that was most frequent in the training set. Undersampling is a strat- egy that is performed to handle unbalanced data. The aim is to select a bal- anced subset of the original data and train the classifier on this set instead. The following undersampling strategy has previously been applied in the context of rule-based classifiers [80]:

Let ni be the number of objects from decision class i. Let nmin be min(n1, n2, ..., nk), where k is the number of decision classes. Each ob- ject in the original data then has probability nmin/ni to be included in the subset, where i is the decision class of the object.

Since new biases may be introduced if only a small subset of the original data is used for classifier training, the undersampling may be repeated to construct several balanced subsets. A classifier is then trained on each subset and eventually, all these classifiers are combined.

2.4.3 Multiple testing problem When many hypotheses are being tested simultaneously, the problem of multiple testing arises leading to the incorrect rejection of the null hypothe- sis. For instance, if every protein-coding gene in the human genome is tested for a significant effect on a disease trait (~20,000 tests) and the null hypothe- sis (no effect) is rejected if p<0.05, then 20,000·0.05=1,000 tests would be expected to reject the null hypothesis even if there were no real effects. If there are genes with a real effect among the ones tested, they may disappear in all the FPs. If the implications of multiple testing are ignored, the conclu- sions of a study may be greatly flawed. There are several approaches to correct for multiple testing. The Bonfer- roni correction is the simplest and also the most stringent method in which the p-values are multiplied by the number of tests. In the previous example the null hypothesis rejection test is replaced by p·20,000<0.05 or equally, p<2.5·10-6. The number of expected FPs after correction would be 20,000·2.5·10-6 = 0.05 tests. The reduction of the FPs has a cost, however, which is an increase in the number of FNs (in this case true hits with a p- value between 2.5·10-6 and 0.05). There are several other methods to correct

38 for multiple testing which have different approaches to balance between the FPs and the FNs.

2.4.4 Bootstrapping Bootstrapping was first introduced in 1979 by Bradley Efron [83] and is a method commonly used in statistics to estimate e.g. the variance or a confi- dence interval of an estimator. The technique uses an approximating distri- bution to measure a property of interest. If the original samples may be con- sidered to be drawn for an independent and identically distributed population then the original data may be resampled (random sampling with replace- ment) to form T number bootstrap samples of the same size as the original sample. Then the property of interest may be measured on these bootstrap samples. Bootstrapping is a powerful and simple technique for properties that cannot be computed directly.

39 3 Aims

The aim of this thesis was to apply and develop multivariate machine learn- ing methods on complex biological problems. Specifically, we were interest- ed in how different properties are combined with each other and how they interact. The following specific aims where set:

i. To investigate the combinatorial role of HMs. ii. To identify novel gene-gene and gene-environment interactions with respect to asthma and allergy. iii. To develop a tool for model visualization and to use it for inter- action detection. iv. To analyze the regulation of transcription with respect to HMs and TFs.

Each one of them corresponding to one of the papers included in this thesis.

40 4 Methods and results

This chapter provides an overview of the four papers presented in this thesis. In Paper I and Paper II a pipeline combining feature selection and classifi- cation was applied on different biological questions with the specific aim to find combinations and interactions. Paper III introduced a web-based tool for visualization of rule-based classifiers and Paper IV continued the study of epigenetics initiated in Paper I. The overview presented here covers a short background specific for each paper, as well as the methods used and the main results. The implications of the results are discussed in the next chapter.

4.1 Paper I Combinations of histone modifications mark exon inclusion levels

4.1.1 Background The role of HMs in gene regulation has been extensively studied [30, 33, 84, 85] and machine learning techniques and statistics have been applied to pre- dict gene expression based on HMs [7, 86-89]. However, the relation be- tween HMs and alternative splicing has been studied less. Such connections have been illustrated [49], in one report Luco et al. showed that HMs affect- ed splice site selection in a set of genes [48], and in another study H3K36me3 was shown to be depleted in spliced out cassette exons [30]. In this paper we provided the first classification of exon inclusion levels based on HMs [69]. Afterwards, other studies have confirmed that splicing may be predicted from HMs [90, 91].

4.1.2 Data Publicly available ChIP-seq data for 38 HMs (21 methylations and 17 acety- lations) [33, 85] in human CD4+ T cells were downloaded together with exon microarray data [92] for the same cell line. This cell line was chosen since it had the highest number of HMs characterized at that time. The En- sembl [16] database (H. sapiens, 54_36p) was used to identify internal exons that were included in some annotated transcripts, but excluded in others.

41 Exons shorter than 50 bp were omitted from the study, since they do not contain any nucleosome. Moreover, an exon was omitted if any of the flank- ing intronic regions were shorter than 360 bp, to ensure that the studied in- tronic regions were sufficiently far away from other exons not to be influ- enced by them. For each such exon we quantified the inclusion level as a ratio between the expression of this particular exon and other exons annotated to the same gene. In total, 11,587 exons were assigned to the ‘spliced out’ class and 12,692 were assigned to the ‘included’ class. The HMs signals preceding, on, or succeeding the exons were discretized into a binary (present/absent) representation (Figure 10). Using this representation, we constructed a deci- sion system with 24,279 exons as the objects, 114 attributes describing the HM status of the exon and its flanking introns (38 HMs × 3 positions), and with the exon inclusion level as the decision class.

Figure 10. Schematic view of HM signal preceding, on, and succeeding an exon. The signal for each HM was discretized into present or absent in each region using a Poisson model. Reprinted from Enroth et al. [69].

4.1.3 Methods Firstly, the MCFS [67] algorithm was applied to rank the HMs by their clas- sification ability. Using the built-in randomization test in dmLab [68] 94 out of the 114 attributes were found significant (p<0.05). In order to avoid over- fitting of the model and to reduce the computational cost, we selected and used the 20 highest ranked attributes. Next we used ROSETTA [74] to con- struct a rule-based classifier based on these attributes. However, to avoid any biases related to having different numbers of training examples for each decision class, firstly, we randomly subsampled the larger ‘included’ class

42 down to the size of the ‘spliced out’ class. This subsampling strategy was repeated ten times to define ten different balanced datasets. A classifier was trained on each dataset and a ten-fold cross validation was used to estimate the classification accuracy of the classifier. To further avoid overfitting, we filtered the rules based on rule support and accuracy to drop any low-quality rules.

4.1.4 Results Notably, the features that were highest ranked by MCFS were not the HMs on the exons, but those residing on the flanking introns. Among the 20 at- tributes used in the classification model, only two were centered on the exon. Several HMs previously related to exon expression were among the highest scoring attributes, confirming that important HMs had been selected. The final rule-based model consisted of 165 rules. Together the support sets of these rules covered 27% of the exons included in the study and their classification accuracy, as estimated by the cross validation, was 72%. This is considerably higher than what would be expected by chance, and our re- sults therefore suggested that HMs and combinations thereof are good pre- dictors of certain alternative splicing events. Furthermore, we illustrated that the HMs located in flanking introns instead of over the exons were more informative with regard to alternative splicing.

4.2 Paper II Rule-Based Models of the Interplay between Genetic and Environmental Factors in Childhood Allergy

4.2.1 Background The development of allergic diseases is dependent on both genetic and envi- ronmental factors and much effort has been put into identifying such factors and mapping them to allergic phenotypes [93]. Genome-wide association studies (GWAS) have been used to identify genetic risk variants [94], but the SNPs identified so far only explain a fraction of the disease risk. This so- called missing heritability [95] may be due to rare variants that are not de- tected in GWAS, interaction effects between different SNPs that are not accounted for in the present models [96], or incorrect estimates of the herita- bility and is currently of major interest to solve. Two large projects that have collected data on asthma and allergy are the European PARSIFAL [97] cross-sectional study and the Swedish BAMSE [98, 99] birth cohort, each covering hundreds of genetic and environmental factors for thousands of children. In this paper we aimed to use machine

43 learning methods on data from these projects in order to study the interplay of genetic and environmental factors and to identify interactions. The meth- odology applied here was similar to Paper I. Machine learning methods have previously been successfully used to model asthma and allergy [100-102] although with other specific aims.

4.2.2 Data All data originated from the PARSIFAL and BAMSE studies and was orga- nized into decision systems in order to apply our methods. Separate decision systems were constructed for 11 different decision attributes (asthma, cur- rent asthma, allergic asthma, non-allergic asthma, wheeze, eczema, allergic eczema, non-allergic eczema, rhinoconjunctivitis, and atopic sensitization >0.35 kU/L or >3.5 kU/L, respectively). Each decision class was represented as a binary (yes/no) variable. The set of attributes included SNPs associated to 29 genes which had been sequenced previously and environmental expo- sures obtained from previously performed questionnaires. The questionnaire data was already discrete, but the SNPs were transformed into a pseudo- binary representation [103] counting the number of copies of the major allele (0/1/2).

4.2.3 Method Having access to two independent datasets, a classifier could be trained on one dataset and evaluated on the other. To use this strategy, we considered 110 SNPs that were genotyped in both datasets and then used MCFS to se- lect significant features among them for each dataset separately. Then a rule- based model was constructed for each dataset. The rule-based classifiers were evaluated using 10-fold cross validation and, additionally, the most significant rules were applied to the other dataset (Figure 11A). In order to study the environmental factors that were not present in both datasets, we also trained a classifier based on the PARSIFAL data which was the most comprehensive dataset (Figure 11B). A partial validation could be per- formed on some of these rules, if they contained only features that were measured in BAMSE as well.

44

Figure 11. Methodology illustrating (A) the training of classifiers describing the gene-gene interplay and the cross-material validation applied of these rules and (B) the classifier with both genetic and environmental features based solely on the PARSIFAL data. Reprinted from Bornelöv et al. [80].

The data was highly skewed towards children who were unaffected by the allergy phenotypes. Therefore, we randomly undersampled the larger set to select a group of unaffected children equally large as the affected group. The undersampling was repeated 10 or 100 times and a rule-based classifier was constructed for each such balanced subset. Finally, the rules from all 10 or 100 classifiers were merged into one set of rules. A p-value (hypergeometric distribution) was computed for each rule, and for the most significant rules we computed their odds ratios.

45 The rules in each model were visualized as circular rule networks using a methodology we developed previously [79] (later implemented in the Ciru- vis tool [72], described in Paper III).

4.2.4 Results The quality of the classification was quite low and in particular for the gene- gene models; illustrating that it may be difficult to build highly predictive models solely from SNP data. This difficulty may be due to low effects of the genetic variants or poorly defined phenotypes [100, 104]. However, sev- eral rules in the classifiers very highly significant and predictive combina- tions of attributes could be identified among these and from the rule net- work. A majority of these rules were validated using the independent dataset, illustrating that information may be extracted also from a low-quality classi- fier. For instance, during the validation phase for the gene-gene models 79% of the significant rules had an effect in the same direction on both datasets. A combination of genetic variants for the RORA (rs17270362=Ag; upper/lower case = major/minor allele) and ORMDL3 (rs2305480=GG) genes increased the risk of current asthma in both study populations with an odds ratio at 2.1 (confidence interval 1.2-3.6) for BAMSE and 3.2 (2.0-5.0) for PARSIFAL. Another combination of ORMDL3 (rs7216389=tt), RORA (rs17270362=Ag), and COL29A1 (rs11917356=AA) variants was associated with increased risk of wheeze with an odds ratio at 2.8 (1.7-4.4) in BAMSE and 1.8 (1.0-3.1) in PARSIFAL. Previous interactions between these genes were unknown, but we speculated that there may be a functional cross-talk between ORMDL3 and RORA which both are suggested to participate in cellular stress response and lipid metabolism [105, 106].

4.3 Paper III Ciruvis: a web-based tool for rule networks and interaction detection using rule-based classifiers

4.3.1 Background Machine learning methods have become an integrated part of bioinformatics (see e.g. [107]). Since many biological systems may rely on complex inter- actions, the need to develop machine learning methods to apply on such data has been stressed [108]. Indeed, this paper was motivated by the previous work in Paper I and Paper II in which rule-based classifiers were trained to predict alternative splicing patterns of exons or allergy phenotypes in chil- dren. In addition, rule-based classifiers have previously been applied to a

46 wide range of biological applications [109-112] and will likely continue to be used as the focus shifts towards combinations and interactions. In order to efficiently visualize and interpret the classifiers we developed an algorithm [79] which translates a set of rules into a table of connection scores. This table may then be visualized as a rule network using the Circos software for visualization of data using a circular layout [113]. In this paper we presented Ciruvis, a web-based tool able to produce publication-quality networks from a set of classification rules. Furthermore, using several differ- ent datasets we extensively explored how the rule networks could be used to identify interactions.

4.3.2 Data The study of the Ciruvis tool was divided into three parts. In the first part we applied Ciruvis to simulated data which was defined to contain five pairs of attributes, each pair with an increasing level of interaction effect, and five attributes with different levels of correlation to the decision attribute. The aim was to have a controlled set of experiments in which we could determine how Ciruvis could be used to detect known interactions. The second part used the California Housing [114] data, obtained from Torgo’s Regression DataSets [115]. This dataset described census data from California collected in 1990. The decision attribute was the median value of a block group, and there were 8 attributes describing, among others, the age of the house, the number of rooms, the median income, and the geographic location. Since this data had been carefully analyzed before with respect to interactions [116] we could use these results to compare Ciruvis to another method. In the last part of our study we performed feature selection and classifica- tion on two well-known biological datasets describing different types of leukemia [117] and lymphoma [118]. In these two setups the objects were the patients, the attributes were the expression level of thousands of genes measured by microarrays, and the decision attribute was the disease pheno- type. Since no interactions had previously been detected in these datasets, we could present novel predictions of gene-gene interactions.

4.3.3 Method The aim of the study was to test how the Ciruvis tool could be used to visu- alize co-occurring features in rule-based classifiers and to detect interactions. Consequently, the first step of each part of the study was to train a rule-based classifier. ROSETTA [74] was used for this purpose. The exact procedure varied depending on the size and the type of the data. Generally, we per- formed a 10-fold cross validation to assess the quality of the classifier. The two biological datasets describing leukemia and lymphoma had too many

47 features to apply ROSETTA directly, and feature selection was performed using MCFS [67] to select the 30 most significant features. Any duplicate rules were removed and then the Ciruvis methodology was applied to the rules. A rule condition was defined as an attribute-value pair, e.g. “MIF=high”, and for each combination of two rule conditions, a connection score was computed according to the formula

(, ) = () ∙ () ∈(,)

Where R(x,y) is defined as the set of all rules that contain both condition x and y. The pairwise scores were then translated into a circular network in which the rule conditions were placed on the circle and each pairs of condi- tions with a non-zero score were connected inside the circle. The width and color of the connecting ribbon reflected the size of the score. In the first part of the study we compared the relative score of the known interactions to the score for connected non-interacting features. In the second part we compared the strongest connections in the rule networks with the previously identified interactions, and computed the interaction effect manu- ally for the highest scored connections assuming independent and multiplica- tive effects. In the third part of the study we compared our classification performance to what was previously accomplished using the same data, and verified if the top connections in the rule network pointed to interactions.

4.3.4 Results This paper demonstrated the use of a fast and simple methodology to visual- ize rule-based classifiers and provided a web-based implementation of the tool (Figure 12). Applying Ciruvis to rules trained from simulated data with different trade-offs between interacting attributes and attributes correlated to the decision, we illustrated how these properties affected the visualization of the classifier. When the correlation between an attribute and the decision was larger than the association between the interacting attributes and the decision, the interactions did not show up efficiently in the rule networks. However, when this was not the case, the connections between the attribute pairs with highest interaction were indeed the highest scored connections in the network. When identification of interactions is the main purpose of the visualization, we suggested and illustrated a strategy to first identify attrib- utes strongly correlated to the decision and remove these prior to the con- struction of the classifier.

48

Figure 12. Screenshot of the Ciruvis website. This figure illustrates the co- occurrences of gene expression conditions predictive for the acute lymphoblastic leukemia.

In comparison to another method for interaction detection, using the same data, we identified several significant interactions among rule conditions of the highest scoring connections in the rule network. Additionally, we found a third interaction between high median age of the houses and a high number of rooms which was previously not reported by Sorokina et al. [116]. Our classifier performed very well on the two biological datasets, in agreement to what have been reported previously (e.g. [67, 117]). Since the classification problem was rather easy, the rule networks were based on a large set of rules of mostly equal quality. In order to identify interaction ef- fects, we compared the observed accuracy of a rule to the expected accuracy of the combination of conditions assuming independent multiplicative ef- fects. For instance, the highest scored connection for the lymphoma data and the chronic lymphocytic leukemia (CLL) phenotype was the rule

IF MIF= low AND GPX1=high THEN CLL

This rule identified CLL patients with 73% accuracy, as compared to the expectation (51%) obtained by combining the individual effects of the rule conditions (MIF=low and GPX1=high, separately). This interaction and oth- ers were suggested by our methodology, although further studies are neces- sary to validate the findings.

49 4.4 Paper IV Different distribution of histone modifications in genes with unidirectional and bidirectional transcription and a role of CTCF and cohesin in directing transcription

4.4.1 Background Several HMs are mainly present at promoters, and are associated with the promoter activity [33, 63]. An assumption in Paper I was that HMs are in- volved in the regulation of gene splicing, consistent with the histone code hypothesis [37]. In particular, we expected that combinations of HMs would be able to explain the regulation of alternative splicing. However, such mechanisms have not been reliably discovered [119] and there are now sev- eral indications that TFs binding to DNA are the main regulators and the main cause of HMs. For instance, regions with inter-individual differences in HMs have been enriched of SNPs disrupting TF motifs [120] and changes in TF motifs are associated to HM differences [121], suggesting that the se- quence participates in determining the HM state. Since the histone acetyl- transferases and histone methyltransferases are not sequence specific, there must be other mechanisms connecting them to the sequence. However, it may be argued that if HMs at promoters were a consequence of transcription, then they would not be present where there is no transcrip- tion. Most HMs has a signal peak both upstream and downstream of the TSS. We hypothesized that this pattern is caused by transcription in both directions of the TSS, which is common in most genes [50, 51]. Antisense RNA produced by this phenomenon is usually short and quickly degraded [53], but the occurrence of antisense transcription could explain the observed upstream HMs peaks. In this paper we analyzed HM and TF signals around gene promoters, taking bidirectional transcription into account.

4.4.2 Data In order to divide the human protein-coding genes into bidirectional or unidi- rectional both CAGE data and gene annotations from Ensembl [16] was used. CAGE data for six cell types was downloaded from the ENCODE repository [7] at UCSC. The data was obtained both in the form of aligned reads and clusters (NCBI36, hg18). Cell lines included in the study were GM12878, H1hESC, HepG2, HUVEC, K562, and NHEK. The CAGE RNA had been extracted using several different isolation conditions and we select- ed three of them for our study (non-polyadenylated RNA from the cytosol, non-polyadenylated RNA from whole cells, and total RNA from the nucleo- lus). Since there were millions of clusters in each dataset, we selected the

50 29,857 top-scoring clusters, equal to the number of promoters identified in FANTOM 4 [122]. ChIP-seq data produced by the Broad, HudsonAlpha, UW, and Yale labs for HMs, TFs, and RNA Pol II as well as RNA-seq produced by the Caltech lab were downloaded from the ENCODE repository [7] at UCSC. The data was processed using the SICTIN tool [123] with ChIP-seq reads extended to 147 bp corresponding to the length of a nucleosome.

4.4.3 Method Since we studied several cell lines and CAGE RNA isolation conditions in parallel, the steps described here were performed for each dataset. Firstly, we determined which of the 19,950 protein-coding genes annotat- ed in Ensembl [16] that were actively transcribed in the current cell line by requiring a CAGE-tag cluster on the sense strand and within 10 bp of the TSS for the active genes. Genes that were not active were considered inac- tive, and were excluded from the present cell line. The active genes were then divided into bidirectional or unidirectional genes. Two different ap- proaches were used for this and both of them searched for evidence of bidi- rectional transcription considering a window of 1 kb on the opposite strand of the TSS. If no evidence of bidirectional transcription was found within this region, a gene was considered unidirectional. The first approach consid- ered the Ensembl database, looking for another gene annotated antisense to the first one. The second approach used the CAGE data to look for an anti- sense CAGE cluster. The overlap between the two methods was high and only the genes that were annotated equally using both approaches were con- sidered for the further analysis. The next step was to compare HM and TF signals for these two groups of genes. The results were shown as footprints illustrating the average signal per group in the 2 kb region, with 95% confidence interval of the average computed by bootstrapping. Moreover, in two other analyses we either addi- tionally divided the genes into expression bins to verify that the observed differences were not caused by differences in gene expression, or alternative- ly, divided the genes into bins according to their amount of antisense tran- scription instead of using the bidirectional and unidirectional groups.

4.4.4 Results Between 2,839 and 6,041 active genes were identified per cell line and a high overlap was observed for these. The active genes were annotated using two different approaches showing in average 86.5% agreement. We identified significant differences in the distribution of several HMs and TFs between the bidirectional and unidirectional genes (Figure 13). For instance, the promoter marks H3K4me3, and H3K9ac all had a peak of ap-

51 proximately equal height both upstream and downstream of the TSS in the bidirectional group, whereas the unidirectional genes had a significantly smaller peak upstream of the TSS. This observation was true independently of whether we used the first annotation approach, the second, or the intersec- tion of them. Furthermore, the TFs TAF1 and NELFe, which both are tightly associated to the RNA Pol II preinitiation complex, had significantly higher upstream signal in the bidirectional group.

Figure 13. Average ChIP-seq signal for five HMs and 4 TFs in bidirectional (red) and unidirectional (blue) genes. The confidence interval is shown as a shaded area.

Interestingly, we demonstrated that the ChIP-seq CTCF peak commonly observed just upstream of the TSS was specific for the unidirectional genes. This observation was also supported by enrichment in the CTCF motif in this region seen only in the unidirectional group. CTCF often acts together with cohesin [124, 125]. Since RAD21 is part of the cohesin complex [126] it may be used as a marker for cohesin. In the K562 cell line where we had ChIP-seq data for RAD21 we could observe a pattern similar as to CTCF, but with the peak summit slightly shifted upstream; as expected from previ- ous reports of RAD21 being positioned towards the 5’ end of the CTCF mo- tif [127].

52 5 Discussion and conclusions

During the last years there has been a rapid development of experimental techniques able to generate massive amounts of high-quality biological data. Instead of mapping thousands of genes by microarrays, next-generation se- quencing approaches such as ChIP-seq now allow us to map a HM or a TF in the entire genome in one single experiment. International collaborative ef- forts such as the ENCODE [7], Ensembl [16], and FANTOM [8] projects have stated ambitious goals to produce large amounts of sequencing and annotation data to be publicly available for the scientific community. An increasing amount of papers do not rely on their own experimental data, but instead solely on the analysis of previously published data. Moreover, the development of experimental methods for data generation is not slowing down and bioinformatics methods to analyze such data will likely be of high importance also in the future. One approach used to analyze this multitude of data is to apply machine learning methods. Machine learning may either be applied to test a specific hypothesis, or to do a hypothesis-free exploration of the data. In both Paper I and Paper II we performed classification in an exploratory fashion hoping to identify new mechanisms for gene regulation and the development of asthma and allergy in children. We applied a pipeline combining MCFS with rule- based classification using ROSETTA. We specifically aimed to extend the repertoire of machine learning methods used to discover complex patterns of associations in the data. Rule-based classifiers are promising in this context because they are simple to interpret without being limited to linear relations. With the increased availability on data describing chromatin feature such as HMs, there have been numerous hypotheses on their function. Features as diverse as gene expression [7, 86-89], cancer recurrence [128], enhancers [129] and RNA Pol II states [130] have been modeled using HMs. In Paper I we showed that alternative splicing events for cassette exons could be pre- dicted from HMs residing on the preceding intron, on the exon, or on the succeeding intron. Although single HMs had been associated to alternative splicing before [30, 48], this was the first paper to show the relationship between combinations of HMs and alternative splicing. Notably, HMs resid- ing on the flanking introns were the most predictive features of our models, suggesting that signals outside of the exons may also be of high importance. In Paper IV we aimed at obtaining a deeper understanding of the biological processes behind gene transcription. One of the most difficult tasks working

53 with data analysis is to distinguish the cause from the consequence of the observed phenomenon. In Paper IV we applied a novel approach invoking the direction of transcription into the analysis of HM and TF signals. This work led to new hypotheses on the role of CTCF and cohesin as potentially involved in the direction of transcriptional initiation. Furthermore, it con- firmed the relation between the presence of HMs and active transcription. In Paper II we applied the same methodology combining MCFS and rule- based classification to classify allergy and asthma phenotypes observed in children based on genetic and environmental factors. Since most SNPs gov- erning complex traits have small effects [95] we fine-tuned the methods to work on data of low predictability. We trained a model on one dataset and validated in onto another independent set in order to select rules with con- sistently high quality. With the increased popularity of GWASs machine learning techniques may take a key role in the mapping of SNPs to complex human diseases. The application and development of methods to be applied on such data is therefore of high priority. In this paper we confirmed many previously identified SNPs and exposures of importance for allergy and asthma and were able to suggest specific gene-gene and gene-environment interactions. If further studies can validate these interactions, it may lead to an increased understanding of the allergic diseases. The main focus of both Paper I and Paper II was the interpretation and visualization of classifiers. The classification performance itself may be of high importance for certain applications, for instance in order to prove that there is an association between a set of attributes and a decision, but in bioin- formatics it is commonly more important to understand the process behind the observed associations. We have developed and algorithm [79] which we implemented in the Ciruvis tool [72] for visualization of rule-based classifi- ers. In Paper II we described this tool and illustrated how it may be used to detect interactions in the data. Taken together, in these four papers we have made contributions to how machine learning methods such as MCFS and rule-based classification may be applied to biological data. We have developed new methodologies aimed at solving the current limitations in the interpretability of classification mod- els, and heavily relied on publicly available genomics data in order to do novel discoveries. The rate by which new genomics data is being produced is very encourag- ing and there are likely many questions regarding the genome organization and function whose future solution will involve similar bioinformatics meth- ods.

54 6 Sammanfattning på svenska

Utvecklingen av nya effektivare metoder för sekvensering har gjort det möj- ligt att studera biologiska signaler över hela genomet i ett enda experiment. På detta sätt har man kunnat kartlägga bindningen av olika transkriptionsfak- torer och förekomsten av olika histonmodifikationer. Transkriptionsfaktorer är proteiner som binder till DNA för att antingen aktivera eller tysta gener. Histonmodifikationer påverkar hur åtkomligt DNA:t är i cellen så att gener som ligger i mer åtkomliga delar av DNA har större chans att vara aktiva. Genom att bättre förstå mekanismerna för hur dessa signaler är kopplade till geners aktivitet kan vi öka vår förståelse för hur gener regleras. Internationella projekt (t.ex. ENCODE och FANTOM) har genererat enorma sådana datamängder som är fritt tillgängliga att analysera. Dessutom har utvecklingen av genetiska associationsstudier över hela genomet lett till upptäckt av flera nya genetiska varianter kopplade till en ökad risk eller ett skydd för en viss sjukdom. Trots dessa nya genetiska riskvarianter har man inte lyckats förklara hela den ärftliga faktorn bakom komplexa sjukdomar såsom astma och allergi och en av anledningarna tros vara att man inte har tagit hänsyn till hur riskvarianterna interagerar. En utmaning för bioinforma- tiken är därför att hitta metoder som kan identifiera dessa interaktioner. I denna avhandling har vi använt och utvecklat metoder för multivariat dataanalys i syfte att lösa flera olika biologiska problem relaterade till dessa typer av data. Den första delen av arbetet syftade till att förutsäga ifall kasettexoner var inkluderade eller ej i ett transkript, baserat på förekomsten av histonmodifieringar vid föregående intron, över exonen, samt vid efterföl- jande intron. Vi använde en metodik som kombinerar Monte Carlo-baserat val av attribut (MCFS, ”Monte Carlo feature selection”) med regelbaserad klassifikation. För att träna regelmodellerna användes programmet RO- SETTA. Med hjälp av MCFS valdes de histonmodifikationer ut som var bäst på att förutsäga ifall exonen var inkluderad eller ej. Utifrån dessa utvalda modifikationer skapade vi en regelbaserad klassifikator. Med denna klassifi- kator kunde vi i 72% av fallen korrekt förutsäga exonens status. Även om vissa histonmodifikationer tidigare hade kopplats till alternativ splitsning av gener, så bekräftade vår modell detta samband och utgjorde dessutom den första tillämpningen av klassifikatorer på detta problem. I nästa delstudie så tillämpade vi en liknande metodik med MCFS och regler för att prediktera olika allergirelaterade fenotyper hos barn utifrån genetiska varianter och miljöfaktorer. Fokus lades på att undersöka hur gener

55 samt gener och miljö interagerar. Vi hade tillgång till två stora datamaterial härrörande från den europeiska PARSIFAL-studien samt den svenska stu- dien BAMSE. Varje studie inkluderade tusentals barn för vilka man hade mätt ett hundratal genetiska varianter samt miljöfaktorer. Eftersom vi hade tillgång till två barngrupper kunde vi skapa en modell utifrån ett grupp och utvärdera den med hjälp av den andra. Genom att jämföra resultaten mellan barngrupperna hittades flera potentiella interaktioner, t.ex. mellan generna RORA, ORMDL3 och COL29A1. I den tredje delstudien utvecklade vi en metod för visualisering och tolk- ning av regelbaserade klassifikatorer. Vi hade tidigare konstruerat en algo- ritm för att visualisera regelbaserade klassifikatorer som regelnätverk och i denna studie så utvecklade vi ett webb-baserat verktyg som implementerade denna algoritm. En tidigare version av detta verktyg hade använts i såväl den första som den andra delstudien. Här använde vi dessutom verktyget på ett flertal olika dataset för att undersöka hur väl det fungerade. Vi simulerade data med kända interaktioner mellan attributen och tränade regler för dessa. Reglerna visualiserades och vi kunde bekräfta att de interagerande attributen kunde identifieras med hjälp av regelnätverken. Vi använde även riktiga data för att bekräfta att vår algoritm kunde upptäcka samma interaktioner som tidigare rapporterats av andra. Vidare tittade vi på biologiska data där vi identifierade potentiella geninteraktioner med betydelse för leukemi och lymfom. Den sista delstudien återkopplade till den första. Återigen studerade vi histonmodifikationer och försökte förstå mekanismer bakom genaktivering. Många histonmodifikationer återfinns främst på histoner nära geners tran- skriptionsstartpunkt (TSS) och är associerade till genens aktivitetsnivå. Van- ligen observeras en topp i förekomsten strax innan TSS:en samt en efter. Förekomst av histonmodifikationer såsom H3K4me3, H3K9ac och H3K27ac har tidigare kopplats till aktiva gener. I denna studie ville vi undersöka ifall denna topp innan TSS:en kunde förklaras av att transkriptionen också skedde i motsatt riktning från TSS:en. Detta undersöktes med hjälp av fritt tillgäng- liga data där vi jämförde förekomsten av histonmodifikationer hos gener med transkription endast i ena riktningen (unidirektionella gener) respektive i båda riktningarna (bidirektionella gener). Motsvarande jämförelse gjordes också med avseende på transkriptionsfaktorer. Vi observerade signifikant färre histonmodifieringar innan TSS:en hos de unidirektionella generna jäm- fört med de bidirektionella. Dessutom upptäckte vi att transkriptionsfaktorn CTCF har en topp precis innan TSS:en endast hos unidirektionella gener. Utifrån detta har vi föreslagit att CTCF har en hittills okänd roll i regleringen av geners transkriptionsriktning. Sammanfattningsvis så har vi utvecklat nya metoder för regelbaserad klassifikation. Vi har tillämpat dessa metoder på olika molekylärbiologiska problem. Genom dessa studier har vi bidragit till att öka kunskapen om al- lergirelaterade sjukdomar. Med hjälp av genomövergripande data har vi

56 dessutom bekräftat tidigare observationer och gjort nya upptäckter med po- tentiell betydelse för vår förståelse av genaktivering.

57 7 Acknowledgements

There are many people who have influenced my work or otherwise been important during my PhD.

Many thanks are due to

My main supervisor Prof. Jan Komorowski who always has been very en- couraging. I am glad that you believed in me and let me join your group. It has been a pleasure to work with ROSETTA and I am grateful that I could combine it with work on chromatin and genomics.

Prof. Claes Wadelius my co-supervisor and who always shared his ideas with us, taught us about chromatin, and explained the experimental proce- dures behind the data.

Marcin Kierczak who helped me to get started with MCFS and ROSETTA. Stefan Enroth for first introducing me to chromatin during my master’s project and for our joint work on the first paper. Behrooz Torabi for keep- ing me company during lunchtime at the former LCB. Marcin Kruczyk who joined the group at almost the same time as me. It was very valuable to have another person in the same situation to share the confusion!

The current group of Jan Komorowski with Nicholas Baltzer, Michal Dabrowski, Klev Diamanti, Zeeshan Khaliq, Behrooz Torabi, and Husen M. Umer for all lunchtime discussions as well as scientific support and co- operation. Extra thanks to Klev, Michal, and Nicholas for proof-reading of this thesis!

Husen, Zeeshan och Behrooz, lycka till med svenskan(/zeeshanskan)! :-)

Simon Marillet, my co-author on the Ciruvis project.

Annika Sääf with whom I worked on the asthma and allergy project and who together with Cilla Söderhäll and Göran Pershagen taught me a lot about writing papers and how science is communicated in medicine.

58 I would also like to thank all the former and present members of Claes Wadelius’ group. In particular, it enjoyed working with Madhu Bysani and Ola Wallerman on the steatohepatitis project but I also learnt a lot during the Tuesday meetings with Madhu Bysani, Marco Cavalli, Helena Nord, and Gang Pan.

There are also many friends who have contributed to life aside from work. I have had a lot of fun together with “team Uppsala” both in Sweden and dur- ing our travels abroad. Thank you Nike, Pelle, and Scale!

I do also want to acknowledge Hanna Schierbeck who share my interest for cavies as well as Robin Brostedt and Göran Söderberg for nice dinners and gaming, respectively.

My mother Solveig for always supporting me and to Fredrik for support and sharing everything with me.

Susanne Bornelöv Uppsala, August 2014

59 8 References

1. Lander ES, Consortium IHGS, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, et al: Initial sequencing and analysis of the human genome. Nature 2001, 409:860-921. 2. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, et al: The sequence of the human genome. Science 2001, 291:1304-+. 3. Collins FS, Lander ES, Rogers J, Waterston RH, Conso IHGS: Finishing the euchromatic sequence of the human genome. Nature 2004, 431:931-945. 4. Park PJ: ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet 2009, 10:669-680. 5. Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 2009, 10:57-63. 6. Metzker ML: Applications of Next-Generation Sequencing Sequencing Technologies - the Next Generation. Nature Reviews Genetics 2010, 11:31-46. 7. Encode Project Consortium, Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, Snyder M: An integrated encyclopedia of DNA elements in the human genome. Nature 2012, 489:57-74. 8. Fantom Consortium: A promoter-level mammalian expression atlas. Nature 2014, 507:462-470. 9. Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermuller J, Hofacker IL, et al: RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 2007, 316:1484-1488. 10. Hangauer MJ, Vaughn IW, McManus MT: Pervasive Transcription of the Human Genome Produces Thousands of Previously Unidentified Long Intergenic Noncoding RNAs. PLoS Genet 2013, 9. 11. Guttman M, Amit I, Garber M, French C, Lin MF, Feldser D, Huarte M, Zuk O, Carey BW, Cassady JP, et al: Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 2009, 458:223-227. 12. Goodrich JA, Kugel JF: Non-coding-RNA regulators of RNA polymerase II transcription. Nat Rev Mol Cell Bio 2006, 7:612- 616. 13. Ohno S: So Much Junk DNA in Our Genome. Brookhaven Symp Biol 1972:366-&.

60 14. Luger K, Mader AW, Richmond RK, Sargent DF, Richmond TJ: Crystal structure of the nucleosome core particle at 2.8 angstrom resolution. Nature 1997, 389:251-260. 15. Robinson PJJ, Fairall L, Huynh VAT, Rhodes D: EM measurements define the dimensions of the "30-nm" chromatin fiber: Evidence for a compact, interdigitated structure. Proc Natl Acad Sci U S A 2006, 103:6506-6511. 16. Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L, et al: Ensembl 2009. Nucleic Acids Res 2009, 37:D690-697. 17. Pomeranz Krummel DA, Oubridge C, Leung AK, Li J, Nagai K: Crystal structure of human spliceosomal U1 snRNP at 5.5 A resolution. Nature 2009, 458:475-480. 18. Crick F: Central dogma of molecular biology. Nature 1970, 227:561-563. 19. Pennacchio LA, Bickmore W, Dean A, Nobrega MA, Bejerano G: Enhancers: five essential questions. Nat Rev Genet 2013, 14:288- 295. 20. Krivega I, Dean A: and promoter interactions-long distance calls. Curr Opin Genet Dev 2012, 22:79-85. 21. Burgess-Beusse B, Farrell C, Gaszner M, Litt M, Mutskov V, Recillas-Targa F, Simpson M, West A, Felsenfeld G: The insulation of genes from external enhancers and silencing chromatin. Proc Natl Acad Sci U S A 2002, 99 Suppl 4:16433- 16437. 22. Latchman DS: Transcription factors: An overview. Int J Biochem Cell Biol 1997, 29:1305-1312. 23. Babu MM, Luscombe NM, Aravind L, Gerstein M, Teichmann SA: Structure and evolution of transcriptional regulatory networks. Curr Opin Struct Biol 2004, 14:283-291. 24. Mathelier A, Zhao X, Zhang AW, Parcy F, Worsley-Hunt R, Arenillas DJ, Buchman S, Chen CY, Chou A, Ienasescu H, et al: JASPAR 2014: an extensively expanded and updated open- access database of transcription factor binding profiles. Nucleic Acids Res 2014, 42:D142-147. 25. Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, Morgunova E, Enge M, Taipale M, Wei GH, et al: DNA-Binding Specificities of Human Transcription Factors. Cell 2013, 152:327-339. 26. Lee TI, Young RA: Transcriptional Regulation and Its Misregulation in Disease. Cell 2013, 152:1237-1251. 27. Takahashi K, Yamanaka S: Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell 2006, 126:663-676. 28. Takahashi K, Tanabe K, Ohnuki M, Narita M, Ichisaka T, Tomoda K, Yamanaka S: Induction of pluripotent stem cells from adult human fibroblasts by defined factors. Cell 2007, 131:861-872.

61 29. Yu JY, Vodyanik MA, Smuga-Otto K, Antosiewicz-Bourget J, Frane JL, Tian S, Nie J, Jonsdottir GA, Ruotti V, Stewart R, et al: Induced pluripotent stem cell lines derived from human somatic cells. Science 2007, 318:1917-1920. 30. Hon G, Wang W, Ren B: Discovery and annotation of functional chromatin signatures in the human genome. PLoS Comput Biol 2009, 5:e1000566. 31. Cosgrove MS: Histone proteomics and the epigenetic regulation of nucleosome mobility. Expert Rev Proteomics 2007, 4:465-478. 32. Felsenfeld G, Groudine M: Controlling the double helix. Nature 2003, 421:448-453. 33. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: High-resolution profiling of histone methylations in the human genome. Cell 2007, 129:823-837. 34. Santos-Rosa H, Schneider R, Bannister AJ, Sherriff J, Bernstein BE, Emre NCT, Schreiber SL, Mellor J, Kouzarides T: Active genes are tri-methylated at K4 of histone H3. Nature 2002, 419:407-411. 35. Bannister AJ, Kouzarides T: Regulation of chromatin by histone modifications. Cell Res 2011, 21:381-395. 36. Schwammle V, Aspalter CM, Sidoli S, Jensen ON: Large-scale analysis of co-existing post-translational modifications on histone tails reveals global fine-structure of crosstalk. Mol Cell Proteomics 2014. 37. Jenuwein T, Allis CD: Translating the histone code. Science 2001, 293:1074-1080. 38. Vermeulen M, Mulder KW, Denissov S, Pijnappel WW, van Schaik FM, Varier RA, Baltissen MP, Stunnenberg HG, Mann M, Timmers HT: Selective anchoring of TFIID to nucleosomes by trimethylation of histone H3 lysine 4. Cell 2007, 131:58-69. 39. Seila AC, Core LJ, Lis JT, Sharp PA: Divergent transcription: a new feature of active promoters. Cell Cycle 2009, 8:2557-2564. 40. MacAlpine DM, Almouzni G: Chromatin and DNA replication. Cold Spring Harb Perspect Biol 2013, 5:a010207. 41. Segal E, Fondufe-Mittendorf Y, Chen L, Thastrom A, Field Y, Moore IK, Wang JP, Widom J: A genomic code for nucleosome positioning. Nature 2006, 442:772-778. 42. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ: Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet 2008, 40:1413-1415. 43. Sammeth M, Foissac S, Guigo R: A general definition and nomenclature for alternative splicing events. PLoS Comput Biol 2008, 4:e1000147. 44. Kornblihtt AR, De la Mata M, Fededa JP, Munoz MJ, Nogues G: Multiple links between transcription and splicing. Rna-a Publication of the Rna Society 2004, 10:1489-1498. 45. Moore MJ, Proudfoot NJ: Pre-mRNA Processing Reaches Back to Transcription and Ahead to Translation. Cell 2009, 136:688-700.

62 46. Kornblihtt AR, Schor IE, Allo M, Dujardin G, Petrillo E, Munoz MJ: Alternative splicing: a pivotal step between eukaryotic transcription and translation. Nat Rev Mol Cell Bio 2013, 14:153- 165. 47. Tilgner H, Knowles DG, Johnson R, Davis CA, Chakrabortty S, Djebali S, Curado J, Snyder M, Gingeras TR, Guigo R: Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs. Genome Res 2012, 22:1616-1625. 48. Luco RF, Pan Q, Tominaga K, Blencowe BJ, Pereira-Smith OM, Misteli T: Regulation of alternative splicing by histone modifications. Science 2010, 327:996-1000. 49. Allo M, Schor IE, Munoz MJ, de la Mata M, Agirre E, Valcarcel J, Eyras E, Kornblihtt AR: Chromatin and alternative splicing. Cold Spring Harb Symp Quant Biol 2010, 75:103-111. 50. Seila AC, Calabrese JM, Levine SS, Yeo GW, Rahl PB, Flynn RA, Young RA, Sharp PA: Divergent transcription from active promoters. Science 2008, 322:1849-1851. 51. Core LJ, Waterfall JJ, Lis JT: Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science 2008, 322:1845-1848. 52. Preker P, Nielsen J, Kammler S, Lykke-Andersen S, Christensen MS, Mapendano CK, Schierup MH, Jensen TH: RNA exosome depletion reveals transcription upstream of active human promoters. Science 2008, 322:1851-1854. 53. Wu X, Sharp PA: Divergent transcription: a driving force for new gene origination? Cell 2013, 155:990-996. 54. Wei W, Pelechano V, Jarvelin AI, Steinmetz LM: Functional consequences of bidirectional promoters. Trends Genet 2011, 27:267-276. 55. Trinklein ND, Aldred SF, Hartman SJ, Schroeder DI, Otillar RP, Myers RM: An abundance of bidirectional promoters in the human genome. Genome Res 2004, 14:62-66. 56. Adachi N, Lieber MR: Bidirectional gene organization: a common architectural feature of the human genome. Cell 2002, 109:807-809. 57. Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZDD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB: PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol 2009, 27:66-75. 58. Mortazavi A, Williams BA, Mccue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA- Seq. Nat Methods 2008, 5:621-628. 59. Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, Chakravarti A, Clark AG, Donnelly P, Eichler EE, Flicek P, Gabriel SB, et al: An integrated map of genetic variation from 1,092 human genomes. Nature 2012, 491:56-65.

63 60. Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, Kawaji H, Kodzius R, Watahiki A, Nakamura M, Arakawa T, et al: Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci U S A 2003, 100:15776-15781. 61. Kawaji H, Lizio M, Itoh M, Kanamori-Katayama M, Kaiho A, Nishiyori-Sueki H, Shin JW, Kojima-Ishiyama M, Kawano M, Murata M, et al: Comparison of CAGE and RNA-seq transcriptome profiling using clonally amplified and single- molecule next-generation sequencing. Genome Res 2014, 24:708- 717. 62. About the Ensembl Project [http://www.ensembl.org/info/about/index.html] 63. Koch CM, Andrews RM, Flicek P, Dillon SC, Karaoz U, Clelland GK, Wilcox S, Beare DM, Fowler JC, Couttet P, et al: The landscape of histone modifications across 1% of the human genome in five human cell lines. Genome Res 2007, 17:691-707. 64. Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng ZP, Snyder M, Dermitzakis ET, Stamatoyannopoulos JA, et al: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007, 447:799-816. 65. Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, Chen Y, Zhao X, Schmidl C, Suzuki T, et al: An atlas of active enhancers across human cell types and tissues. Nature 2014, 507:455-+. 66. Saeys Y, Inza I, Larranaga P: A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23:2507-2517. 67. Draminski M, Rada-Iglesias A, Enroth S, Wadelius C, Koronacki J, Komorowski J: Monte Carlo feature selection for supervised classification. Bioinformatics 2008, 24:110-117. 68. The MCFS-ID package [http://www.ipipan.eu/staff/m.draminski/files/dmLab185.zip] 69. Enroth S, Bornelov S, Wadelius C, Komorowski J: Combinations of Histone Modifications Mark Exon Inclusion Levels. PLoS ONE 2012, 7. 70. James G, Witten D, Hastie T, Tibshirani R: An Introduction to Statistical Learning with Applications in R. Springer; 2013. 71. Pearson K: On lines and planes of closest fit to systems of points in space. Philos Mag 1901, 2:559-572. 72. Bornelov S, Marillet S, Komorowski J: Ciruvis: a web-based tool for rule networks and interaction detection using rule-based classifiers. BMC Bioinformatics 2014, 15:139. 73. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA Data Mining Software: An Update. SIGKDD Explorations 2009, 11.

64 74. Komorowski J, Øhrn A, Skowron A: The ROSETTA Rough Set Software System. In Handbook of Data Mining and Knowledge. Edited by Klösgen W ZJ. New York: Oxford University Press; 2002 75. Pawlak Z: Rough sets. International Journal of Computer and Information Sciences 1982, 11:341-356. 76. Pawlak Z: Rough Sets: Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers; 1992. 77. Skowron A, Rauszer C: The Discernibility Matrices and Functions in Information Systems. In Intelligent Decision Support. Volume 11. Edited by Słowiński R: Springer Netherlands; 1992: 331-362: Theory and Decision Library]. 78. Komorowski J, Pawlak Z, Polkowski L, Skowron A: Rough Sets: A Tutorial. In Rough Fuzzy Hybridization – A New Trend in Decision Making. Edited by Pal SK, Skowron A: Springer; 1998: 3-98 79. Bornelöv S, Enroth S, Komorowski J: Visualization of Rules in Rule-Based Classifiers. In Intelligent Decision Technologies. Volume 15. Edited by Watada J, Watanabe T, Phillips-Wren G, Howlett RJ, Jain LC: Springer Berlin Heidelberg; 2012: 329-338: Smart Innovation, Systems and Technologies]. 80. Bornelov S, Saaf A, Melen E, Bergstrom A, Moghadam BT, Pulkkinen V, Acevedo N, Pietras CO, Ege M, Braun-Fahrlander C, et al: Rule-Based Models of the Interplay between Genetic and Environmental Factors in Childhood Allergy. PLoS ONE 2013, 8. 81. Freyhult E, Prusis P, Lapinsh M, Wikberg JE, Moulton V, Gustafsson MG: Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling. BMC Bioinformatics 2005, 6:50. 82. Hvidsten TR, Wilczynski B, Kryshtafovych A, Tiuryn J, Komorowski J, Fidelis K: Discovering regulatory binding-site modules using rule-based learning. Genome Res 2005, 15:856- 866. 83. Efron B: 1977 Rietz Lecture - Bootstrap Methods - Another Look at the Jackknife. Ann Stat 1979, 7:1-26. 84. Andersson R, Enroth S, Rada-Iglesias A, Wadelius C, Komorowski J: Nucleosomes are well positioned in exons and carry characteristic histone modifications. Genome Res 2009, 19:1732- 1741. 85. Wang Z, Zang C, Rosenfeld JA, Schones DE, Barski A, Cuddapah S, Cui K, Roh TY, Peng W, Zhang MQ, Zhao K: Combinatorial patterns of histone acetylations and methylations in the human genome. Nat Genet 2008, 40:897-903. 86. Karlic R, Chung HR, Lasserre J, Vlahovicek K, Vingron M: Histone modification levels are predictive for gene expression. Proc Natl Acad Sci U S A 2010, 107:2926-2931. 87. Cheng C, Yan KK, Yip KY, Rozowsky J, Alexander R, Shou C, Gerstein M: A statistical framework for modeling gene

65 expression using chromatin features and application to modENCODE datasets. Genome Biol 2011, 12. 88. Yu H, Zhu SS, Zhou B, Xue HL, Han JDJ: Inferring causal relationships among different histone modifications and gene expression. Genome Res 2008, 18:1314-1324. 89. Xu X, Hoang S, Mayo M, Bekiranov S: Application of machine learning methods to histone methylation ChIP-Seq data reveals H4R3me2 globally represses gene expression. BMC Bioinformatics 2010, 11:1-20. 90. Shindo Y, Nozaki T, Saito R, Tomita M: Computational analysis of associations between alternative splicing and histone modifications. FEBS Lett 2013, 587:516-521. 91. Zhu SJ, Wang GH, Liu B, Wang YD: Modeling Exon Expression Using Histone Modifications. PLoS ONE 2013, 8. 92. Oberdoerffer S, Moita LF, Neems D, Freitas RP, Hacohen N, Rao A: Regulation of CD45 alternative splicing by heterogeneous ribonucleoprotein, hnRNPLL. Science 2008, 321:686-691. 93. Ober C, Yao TC: The genetics of asthma and allergic disease: a 21st century perspective. Immunol Rev 2011, 242:10-30. 94. Moffatt MF, Kabesch M, Liang LM, Dixon AL, Strachan D, Heath S, Depner M, von Berg A, Bufe A, Rietschel E, et al: Genetic variants regulating ORMDL3 expression contribute to the risk of childhood asthma. Nature 2007, 448:470-U475. 95. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al: Finding the missing heritability of complex diseases. Nature 2009, 461:747-753. 96. Ober C, Vercelli D: Gene-environment interactions in human disease: nuisance or opportunity? Trends Genet 2011, 27:107- 115. 97. Alfven T, Braun-Fahrlander C, Brunekreef B, von Mutius E, Riedler J, Scheynius A, van Hage M, Wickman M, Benz MR, Budde J, et al: Allergic diseases and atopic sensitization in children related to farming and anthroposophic lifestyle--the PARSIFAL study. Allergy 2006, 61:414-421. 98. Kull I, Melen E, Alm J, Hallberg J, Svartengren M, van Hage M, Pershagen G, Wickman M, Bergstrom A: Breast-feeding in relation to asthma, lung function, and sensitization in young schoolchildren. J Allergy Clin Immunol 2010, 125:1013-1019. 99. Wickman M, Kull I, Pershagen G, Nordvall SL: The BAMSE project: presentation of a prospective longitudinal birth cohort study. Pediatr Allergy Immunol 2002, 13 Suppl 15:11-13. 100. Simpson A, Tan VY, Winn J, Svensen M, Bishop CM, Heckerman DE, Buchan I, Custovic A: Beyond atopy: multiple patterns of sensitization in relation to asthma in a birth cohort study. Am J Respir Crit Care Med 2010, 181:1200-1206.

66 101. Soeria-Atmadja D, Onell A, Kober A, Matsson P, Gustafsson MG, Hammerling U: Multivariate statistical analysis of large-scale IgE antibody measurements reveals allergen extract relationships in sensitized individuals. J Allergy Clin Immunol 2007, 120:1433- 1440. 102. Xu M, Tantisira KG, Wu A, Litonjua AA, Chu JH, Himes BE, Damask A, Weiss ST: Genome Wide Association Study to predict severe asthma exacerbations in children using random forests classifiers. BMC medical genetics 2011, 12:90. 103. Kooperberg C, LeBlanc M, Dai JY, Rajapakse I: Structures and Assumptions: Strategies to Harness Gene x Gene and Gene x Environment Interactions in GWAS. Statistical Science 2009, 24:472-488. 104. Lotvall J, Akdis CA, Bacharier LB, Bjermer L, Casale TB, Custovic A, Lemanske RF, Jr., Wardlaw AJ, Wenzel SE, Greenberger PA: Asthma endotypes: a new approach to classification of disease entities within the asthma syndrome. J Allergy Clin Immunol 2011, 127:355-360. 105. Breslow DK, Collins SR, Bodenmiller B, Aebersold R, Simons K, Shevchenko A, Ejsing CS, Weissman JS: Orm family proteins mediate sphingolipid homeostasis. Nature 2010, 463:1048-1053. 106. Jetten AM: Retinoid-related orphan receptors (RORs): critical roles in development, immunity, circadian rhythm, and cellular metabolism. Nuclear receptor signaling 2009, 7:e003. 107. Touw WG, Bayjanov JR, Overmars L, Backus L, Boekhorst J, Wels M, van Hijum SAFT: Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle? Brief Bioinform 2013, 14:315-326. 108. Moore JH, Asselbergs FW, Williams SM: Bioinformatics challenges for genome-wide association studies. Bioinformatics 2010, 26:445-455. 109. Laegreid A, Hvidsten TR, Midelfart H, Komorowski J, Sandvik AK: Predicting gene ontology biological process from temporal gene expression patterns. Genome Res 2003, 13:965-979. 110. Calvo-Dmgz D, Galvez JF, Glez-Pena D, Fdez-Riverola F: Biological Knowledge Integration in DNA Microarray Gene Expression Classification Based on Rough Set Theory. Adv Intel Soft Compu 2012, 154:53-61. 111. Kontijevskis A, Wikberg JE, Komorowski J: Computational proteomics analysis of HIV-1 protease interactome. Proteins 2007, 68:305-312. 112. Kruczyk M, Zetterberg H, Hansson O, Rolstad S, Minthon L, Wallin A, Blennow K, Komorowski J, Andersson MG: Monte Carlo feature selection and rule-based models to predict Alzheimer's disease in mild cognitive impairment. J Neural Transm 2012, 119:821-831.

67 113. Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, Jones SJ, Marra MA: Circos: an information aesthetic for comparative genomics. Genome Res 2009, 19:1639-1645. 114. Kelley Pace R, Barry R: Sparse spatial autoregressions. Statistics & Probability Letters 1997, 33:291-297. 115. Regression DataSets [http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html] 116. Sorokina D, Caruana R, Riedewald M, Fink D: Detecting statistical interactions with additive groves of trees. In Book Detecting statistical interactions with additive groves of trees (Editor ed.^eds.). pp. 1000-1007. City: ACM; 2008:1000-1007. 117. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286:531- 537. 118. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, et al: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000, 403:503-511. 119. Henikoff S, Shilatifard A: Histone modification: cause or cog? Trends Genet 2011, 27:389-396. 120. Kasowski M, Kyriazopoulou-Panagiotopoulou S, Grubert F, Zaugg JB, Kundaje A, Liu Y, Boyle AP, Zhang QC, Zakharia F, Spacek DV, et al: Extensive variation in chromatin states across humans. Science 2013, 342:750-752. 121. McVicker G, van de Geijn B, Degner JF, Cain CE, Banovich NE, Raj A, Lewellen N, Myrthil M, Gilad Y, Pritchard JK: Identification of genetic variants that affect histone modifications in human cells. Science 2013, 342:747-749. 122. Fantom Consortium, Suzuki H, Forrest AR, van Nimwegen E, Daub CO, Balwierz PJ, Irvine KM, Lassmann T, Ravasi T, Hasegawa Y, et al: The transcriptional network that controls growth arrest and differentiation in a human myeloid leukemia cell line. Nat Genet 2009, 41:553-562. 123. Enroth S, Andersson R, Wadelius C, Komorowski J: SICTIN: Rapid footprinting of massively parallel sequencing data. BioData Min 2010, 3:4. 124. Wendt KS, Yoshida K, Itoh T, Bando M, Koch B, Schirghuber E, Tsutsumi S, Nagae G, Ishihara K, Mishiro T, et al: Cohesin mediates transcriptional insulation by CCCTC-binding factor. Nature 2008, 451:796-801. 125. Ong CT, Corces VG: CTCF: an architectural protein bridging genome topology and function. Nat Rev Genet 2014, 15:234-246. 126. Yan J, Enge M, Whitington T, Dave K, Liu J, Sur I, Schmierer B, Jolma A, Kivioja T, Taipale M, Taipale J: Transcription factor

68 binding in human cells occurs in dense clusters formed around cohesin anchor sites. Cell 2013, 154:801-813. 127. Nitzsche A, Paszkowski-Rogacz M, Matarese F, Janssen-Megens EM, Hubner NC, Schulz H, de Vries I, Ding L, Huebner N, Mann M, et al: RAD21 cooperates with pluripotency transcription factors in the maintenance of embryonic stem cell identity. PLoS ONE 2011, 6:e19470. 128. Seligson DB, Horvath S, Shi T, Yu H, Tze S, Grunstein M, Kurdistani SK: Global histone modification patterns predict risk of prostate cancer recurrence. Nature 2005, 435:1262-1266. 129. Rajagopal N, Xie W, Li Y, Wagner U, Wang W, Stamatoyannopoulos J, Ernst J, Kellis M, Ren B: RFECS: A Random-Forest Based Algorithm for Enhancer Identification from Chromatin State. PLoS Comput Biol 2013, 9. 130. Chen Y, Jorgensen M, Kolde R, Zhao X, Parker B, Valen E, Wen J, Sandelin A: Prediction of RNA Polymerase II recruitment, elongation and stalling from histone modification data. BMC Genomics 2011, 12:544.

69 Acta Universitatis Upsaliensis Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1167 Editor: The Dean of the Faculty of Science and Technology

A doctoral dissertation from the Faculty of Science and Technology, Uppsala University, is usually a summary of a number of papers. A few copies of the complete dissertation are kept at major Swedish research libraries, while the summary alone is distributed internationally through the series Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology. (Prior to January, 2005, the series was published under the title “Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology”.)

ACTA UNIVERSITATIS UPSALIENSIS Distribution: publications.uu.se UPPSALA urn:nbn:se:uu:diva-230159 2014