A Methodology for Motif Discovery Employing Iterated Cluster Re-Assignment

May 24, 2006 19:13 WSPC/Trim Size: 11in x 8.5in for Proceedings motif3 2571 A METHODOLOGY FOR MOTIF DISCOVERY EMPLOYING ITERATED CLUSTER RE-ASSIGNMENT Osman Abul¤ y and Finn Drabl¿s Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway ¤Email: [email protected] ¯[email protected] Geir Kjetil Sandve Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway Email: [email protected] Motif discovery is a crucial part of regulatory network identi¯cation, and therefore widely studied in the literature. Motif discovery programs search for statistically signi¯cant, well-conserved and over-represented patterns in given promoter sequences. When gene expression data is available, there are mainly three paradigms for motif discovery; cluster-¯rst, regression, and joint probabilistic. The success of motif discovery depends highly on the homogeneity of input sequences, regardless of paradigm employed. In this work, we propose a methodology for getting homogeneous subsets from input sequences for increased motif discovery performance. It is a unification of cluster-first and regression paradigms based on iterative cluster re-assignment. The experimental results show the e®ectiveness of the methodology. 1. INTRODUCTION from each other. ChIP-chip experiments can analyze the genome-wide binding of a speci¯c TF. For in- Transcription Factors (TF) are proteins that selec- stance, Lee et al. 19 have conducted experiments for tively bind to short pieces (5-25nt long) of DNA, so over 100 TFs for experimental identi¯cation of reg- called Transcription Factor Binding Sites (TFBS). ulatory networks of Saccharomyces cerevisiae. Un- Although TFs bind in a selective way they allow fortunately their resolution (¼ 1K-nt) is not enough some degeneracy in their binding sites, forming Tran- to exactly identify binding locations. Other prob- scription Factor Binding Motifs (TFBMs) or just lems include condition speci¯c binding, measurement motifs. This property creates the TFBS represen- noise, and di±culty in ¯nding an optimal consen- tation problem, i.e. the choice of language in which sus motif. TFBMs are functional elements of genes motifs are expressed. The most common represen- and preserved throughout the evolution. This prop- tations are motif consensus over IUPAC codes, mis- erty, together with available completed genetic maps match strings and position speci¯c weight matrices of many species, has made possible computational (PSWMs), as well as their variants and specializa- identi¯cation based solely on sequence data. That is, tions. since these regions have accumulated very few muta- Finding TFBMs is an important step in elucida- tions compared to non-functional parts, it is possible tion of genetic regulatory networksa. There are ba- to ¯nd them computationally by just exploiting the sically two methods for ¯nding TFBMs, experimen- statistical over-representation. Computational ap- tal and computational, although they usually bene¯t ¤Corresponding author. yThis work was carried out during the tenure of an ERCIM fellowship. aRegulatory network identi¯cation methods are also studied without explicitly focusing on use and discovery of TFBMs33; 27; 36; 28; 24, in this paper we do not cover these approaches. May 24, 2006 19:13 WSPC/Trim Size: 11in x 8.5in for Proceedings motif3 2258 proaches built around this fact include MEME 2; 1, An alternative to the cluster-¯rst approach is to BioProspector 20, AlignACE 12, Consensus 10, and start from a large set of putative motifs and ¯lter MDScan 21, among many others. them by regressing on expression data. The idea TFs bind to respective TFBSs in promoter re- behind this approach is to remove non-relevant mo- gions of their target genes. Each gene can have a tifs and thereby reduce the number of false positives. number of TFBSs for several di®erent TFs in its pro- Examples of this approach include Reduce 7, Motif moter sequence. In Eukaryotes, TFBSs are organized Regressor 8; 21, a boosting approach also employing in modules; sets of TFBSs for a number of TFs. Each ChIP-chip data (Hong et al. 15) and a logic regres- TF can function as inducer or repressor and this pro- sion approach by Kele»s et al. 17. cess is combinatorial, i.e. depends on the qualita- Although a number of algorithms and programs tively and quantitatively binding of other TFs. This have been developed for motif discovery, little has combinatorial behavior can cause non-additive ex- been done on designing a methodology for optimal pression behavior for their common targets. In gen- usage. In particular, little attention is paid to the eral, intra-module couplings are much stronger than s e le c tio n o f ho mo g e ne o us s ubs e ts fr o m he te r o g e ne o us inter-module couplings. Expression behavior also de- gene sets of interest. In practice, what an exper- pends on the genome-wide global conditions. imenter does is 1) cluster the gene sets of interest To understand the governing rules for gene ex- (using a clustering program like k-means, hierarchi- pression, we need to know 1) all TFs, 2) abundance cal clustering, Self-organizing maps, etc), then 2) in- and activity of them under varying conditions, 3) put them to one or a few motif ¯nding programs, and their binding sites, and 4) their combinatorial joint ¯nally 3) decide on the true motifs among all the can- regulation of target expression 35; 9. From this, it didates, either by further analysis (like regression) or is clear that to induce regulatory networks com- manually. Though clustering before motif discovery putationally we need both sequence and functional improves homogeneity compared to random subsets, data. Typically, the sequence data employed is the it might fail in ¯nding true clusters. Motivated by inter-genic promoter regions upstream of transcrip- this, we here study the generation of homogeneous tion start sites while the functional data is obtained clusters using both sequence and expression data, from microarray experiments under various condi- and we address the issue of methodology for motif tions. Other useful sources of data for motif (and discovery. module) discovery include ChIP-chip experiments We de¯ne an iterative procedure (a methodol- (e.g. 3), TFBM databases (e.g. 26), and phyloge- ogy) for the motif discovery process. Briefly, we start netic relations (e.g. 14). with an initial clustering of gene sets from gene ex- The success of motif discovery programs depends pression data and ¯nd motifs in these clusters. We on the quality of input data. That is, they typically then (optionally) re¯ne these motifs by ¯ltering out give high false-positives/negatives if input genes are irrelevant ones. In this step, simple ¯ltering or ¯lter- heterogenous with respect to regulation. To make ing employing regression analysis is applied. After the input genes homogeneous, genes are clustered be- that, we screen all the genes by motif pro¯les of each fore they are presented to motif discovery programs; cluster and re¯ne clusters by re-assignment based on hence this is called the cluster-¯rst approach. This screening score. Following this, we restart motif dis- is because gene expression depends on combinatorial covery on the new gene clusters and iterate this pro- binding of TFs on TFBMs. The co-expressed genes cedure until convergence. Finally, we output the set are assumed to be co-regulated, therefore genes are of motifs found in the last iteration. clustered based on their expression pro¯le similarity over a course of microarray experiments. Each clus- 2. POWERING MOTIF DISCOVERY ter (in which sequences are highly probable to con- USING GENE EXPRESSION DATA tain homogeneous TFBMs) is given as input to motif ¯nding programs (MEME, BioProspector, MDScan The three main paradigms for incorporating etc.). gene expression data into motif discovery are cluster- May 24, 2006 19:13 WSPC/Trim Size: 11in x 8.5in for Proceedings motif3 2593 ¯rst, regression and joint probabilistic. proach of Kele»s et al. 17 uses two-step logistic re- Brazma et al. 6 presented one of the earliest gression on a single gene expression experiment. In methods within the cluster-¯rst paradigm. They the ¯rst step, the set of all over-represented oligos look for over-represented oligos with limited degen- (allowing limited degeneracy) in the input sequences eracy, both genome-wide and for clusters generated are identi¯ed as candidate motifs. In the second step, from gene expression clustering based on the time se- for each sequence a binary score vector (serving as a ries data. The approach taken by Beer et al. 5 also covariate vector) is constructed in which each entry use a cluster-¯rst approach. The genes are clustered corresponds to existence of a motif type (or a logical using expression data with k-means clustering and function of a subset of all motif types, a so called AlignACE 12 is used for motif discovery. A very sim- logic tree) and this vector is regressed on expression ilar approach using a custom clustering algorithm is data. The Rim-Finder system of Zilberstein et al. presented in 23. 38 is another method using the regression approach. A variant of the cluster-¯rst approach is TFCC Identi¯cation of synergistic e®ects of pairs of motifs (Transcription Factor Centric Clustering) of Zhu et using co-expression has also been studied 26. al. 37. The idea is to ¯nd a set of genes showing sim- Methods for binary regression (classi¯cation) ilar expression pro¯les to the expression pro¯le of a have also been developed. A large-margin classi¯- particular TF over a set of expression experiments, cation approach, called Medusa, using boosting to- and then look for motifs in that cluster using Alig- gether with alternating decision trees is given in 22.

A Methodology for Motif Discovery Employing Iterated Cluster Re-Assignment

UNIVERSITY of CALIFORNIA RIVERSIDE Unsupervised And

University of California Santa Cruz Sample

Introduction

UCLA UCLA Electronic Theses and Dissertations

Director's Update

Top 100 AI Leaders in Drug Discovery and Advanced Healthcare Introduction

Probabilistic Models for Species Tree Inference and Orthology Analysis

David A. Knowles

Learning from Graph Neighborhoods Using Lstms

Michael Kearns

Population-Based Meta-Heuristic for Active Modules Identification Leandro Corrêa, Denis Pallez, Laurent Tichit, Olivier Soriani, Claude Pasquier

BE562 Computational Biology: Genomes, Networks, Evolution Fall 2014 Course Information