A Methodology for Motif Discovery Employing Iterated Cluster Re-Assignment
Total Page:16
File Type:pdf, Size:1020Kb
May 24, 2006 19:13 WSPC/Trim Size: 11in x 8.5in for Proceedings motif3 2571 A METHODOLOGY FOR MOTIF DISCOVERY EMPLOYING ITERATED CLUSTER RE-ASSIGNMENT Osman Abul¤ y and Finn Drabl¿s Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway ¤Email: [email protected] ¯[email protected] Geir Kjetil Sandve Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway Email: [email protected] Motif discovery is a crucial part of regulatory network identi¯cation, and therefore widely studied in the literature. Motif discovery programs search for statistically signi¯cant, well-conserved and over-represented patterns in given promoter sequences. When gene expression data is available, there are mainly three paradigms for motif discovery; cluster-¯rst, regression, and joint probabilistic. The success of motif discovery depends highly on the homogeneity of input sequences, regardless of paradigm employed. In this work, we propose a methodology for getting homoge- neous subsets from input sequences for increased motif discovery performance. It is a unification of cluster-first and regression paradigms based on iterative cluster re-assignment. The experimental results show the e®ectiveness of the methodology. 1. INTRODUCTION from each other. ChIP-chip experiments can analyze the genome-wide binding of a speci¯c TF. For in- Transcription Factors (TF) are proteins that selec- stance, Lee et al. 19 have conducted experiments for tively bind to short pieces (5-25nt long) of DNA, so over 100 TFs for experimental identi¯cation of reg- called Transcription Factor Binding Sites (TFBS). ulatory networks of Saccharomyces cerevisiae. Un- Although TFs bind in a selective way they allow fortunately their resolution (¼ 1K-nt) is not enough some degeneracy in their binding sites, forming Tran- to exactly identify binding locations. Other prob- scription Factor Binding Motifs (TFBMs) or just lems include condition speci¯c binding, measurement motifs. This property creates the TFBS represen- noise, and di±culty in ¯nding an optimal consen- tation problem, i.e. the choice of language in which sus motif. TFBMs are functional elements of genes motifs are expressed. The most common represen- and preserved throughout the evolution. This prop- tations are motif consensus over IUPAC codes, mis- erty, together with available completed genetic maps match strings and position speci¯c weight matrices of many species, has made possible computational (PSWMs), as well as their variants and specializa- identi¯cation based solely on sequence data. That is, tions. since these regions have accumulated very few muta- Finding TFBMs is an important step in elucida- tions compared to non-functional parts, it is possible tion of genetic regulatory networksa. There are ba- to ¯nd them computationally by just exploiting the sically two methods for ¯nding TFBMs, experimen- statistical over-representation. Computational ap- tal and computational, although they usually bene¯t ¤Corresponding author. yThis work was carried out during the tenure of an ERCIM fellowship. aRegulatory network identi¯cation methods are also studied without explicitly focusing on use and discovery of TFBMs33; 27; 36; 28; 24, in this paper we do not cover these approaches. May 24, 2006 19:13 WSPC/Trim Size: 11in x 8.5in for Proceedings motif3 2258 proaches built around this fact include MEME 2; 1, An alternative to the cluster-¯rst approach is to BioProspector 20, AlignACE 12, Consensus 10, and start from a large set of putative motifs and ¯lter MDScan 21, among many others. them by regressing on expression data. The idea TFs bind to respective TFBSs in promoter re- behind this approach is to remove non-relevant mo- gions of their target genes. Each gene can have a tifs and thereby reduce the number of false positives. number of TFBSs for several di®erent TFs in its pro- Examples of this approach include Reduce 7, Motif moter sequence. In Eukaryotes, TFBSs are organized Regressor 8; 21, a boosting approach also employing in modules; sets of TFBSs for a number of TFs. Each ChIP-chip data (Hong et al. 15) and a logic regres- TF can function as inducer or repressor and this pro- sion approach by Kele»s et al. 17. cess is combinatorial, i.e. depends on the qualita- Although a number of algorithms and programs tively and quantitatively binding of other TFs. This have been developed for motif discovery, little has combinatorial behavior can cause non-additive ex- been done on designing a methodology for optimal pression behavior for their common targets. In gen- usage. In particular, little attention is paid to the eral, intra-module couplings are much stronger than s e le c tio n o f ho mo g e ne o us s ubs e ts fr o m he te r o g e ne o us inter-module couplings. Expression behavior also de- gene sets of interest. In practice, what an exper- pends on the genome-wide global conditions. imenter does is 1) cluster the gene sets of interest To understand the governing rules for gene ex- (using a clustering program like k-means, hierarchi- pression, we need to know 1) all TFs, 2) abundance cal clustering, Self-organizing maps, etc), then 2) in- and activity of them under varying conditions, 3) put them to one or a few motif ¯nding programs, and their binding sites, and 4) their combinatorial joint ¯nally 3) decide on the true motifs among all the can- regulation of target expression 35; 9. From this, it didates, either by further analysis (like regression) or is clear that to induce regulatory networks com- manually. Though clustering before motif discovery putationally we need both sequence and functional improves homogeneity compared to random subsets, data. Typically, the sequence data employed is the it might fail in ¯nding true clusters. Motivated by inter-genic promoter regions upstream of transcrip- this, we here study the generation of homogeneous tion start sites while the functional data is obtained clusters using both sequence and expression data, from microarray experiments under various condi- and we address the issue of methodology for motif tions. Other useful sources of data for motif (and discovery. module) discovery include ChIP-chip experiments We de¯ne an iterative procedure (a methodol- (e.g. 3), TFBM databases (e.g. 26), and phyloge- ogy) for the motif discovery process. Briefly, we start netic relations (e.g. 14). with an initial clustering of gene sets from gene ex- The success of motif discovery programs depends pression data and ¯nd motifs in these clusters. We on the quality of input data. That is, they typically then (optionally) re¯ne these motifs by ¯ltering out give high false-positives/negatives if input genes are irrelevant ones. In this step, simple ¯ltering or ¯lter- heterogenous with respect to regulation. To make ing employing regression analysis is applied. After the input genes homogeneous, genes are clustered be- that, we screen all the genes by motif pro¯les of each fore they are presented to motif discovery programs; cluster and re¯ne clusters by re-assignment based on hence this is called the cluster-¯rst approach. This screening score. Following this, we restart motif dis- is because gene expression depends on combinatorial covery on the new gene clusters and iterate this pro- binding of TFs on TFBMs. The co-expressed genes cedure until convergence. Finally, we output the set are assumed to be co-regulated, therefore genes are of motifs found in the last iteration. clustered based on their expression pro¯le similarity over a course of microarray experiments. Each clus- 2. POWERING MOTIF DISCOVERY ter (in which sequences are highly probable to con- USING GENE EXPRESSION DATA tain homogeneous TFBMs) is given as input to motif ¯nding programs (MEME, BioProspector, MDScan The three main paradigms for incorporating etc.). gene expression data into motif discovery are cluster- May 24, 2006 19:13 WSPC/Trim Size: 11in x 8.5in for Proceedings motif3 2593 ¯rst, regression and joint probabilistic. proach of Kele»s et al. 17 uses two-step logistic re- Brazma et al. 6 presented one of the earliest gression on a single gene expression experiment. In methods within the cluster-¯rst paradigm. They the ¯rst step, the set of all over-represented oligos look for over-represented oligos with limited degen- (allowing limited degeneracy) in the input sequences eracy, both genome-wide and for clusters generated are identi¯ed as candidate motifs. In the second step, from gene expression clustering based on the time se- for each sequence a binary score vector (serving as a ries data. The approach taken by Beer et al. 5 also covariate vector) is constructed in which each entry use a cluster-¯rst approach. The genes are clustered corresponds to existence of a motif type (or a logical using expression data with k-means clustering and function of a subset of all motif types, a so called AlignACE 12 is used for motif discovery. A very sim- logic tree) and this vector is regressed on expression ilar approach using a custom clustering algorithm is data. The Rim-Finder system of Zilberstein et al. presented in 23. 38 is another method using the regression approach. A variant of the cluster-¯rst approach is TFCC Identi¯cation of synergistic e®ects of pairs of motifs (Transcription Factor Centric Clustering) of Zhu et using co-expression has also been studied 26. al. 37. The idea is to ¯nd a set of genes showing sim- Methods for binary regression (classi¯cation) ilar expression pro¯les to the expression pro¯le of a have also been developed. A large-margin classi¯- particular TF over a set of expression experiments, cation approach, called Medusa, using boosting to- and then look for motifs in that cluster using Alig- gether with alternating decision trees is given in 22.