Efficient Biclustering Algorithms for Time Series Gene Expression Data

Efficient Biclustering Algorithms for Time Series Gene Expression Data Analysis Sara C. Madeira123 and Arlindo L. Oliveira12 1 Instituto Superior Técnico,Technical University of Lisbon, Portugal 2 Knowledge Discovery and Bioinformatics (KDBIO) Group, INESC-ID, Portugal 3 University of Beira Interior, Portugal [email protected], [email protected] Abstract. We present a summary of a PhD thesis proposing efficient biclustering algorithms for time series gene expression data analysis, able to discover important aspects of gene regulation as anticorrelation and time-lagged relationships, and a scoring method based on statistical significance and similarity measures. The ability of the proposed algorithms to efficiently identify sets of genes with statistically significant and bio- logically meaningful expression patterns is shown to be instrumental in the discovery of relevant biological phenomena, leading to more convinc- ing evidence of specific transcriptional regulatory mechanisms. Key words: Biclustering, gene expression time series, temporal expression patterns, anticorrelated time-lagged patterns, regulatory modules. 1 Context and Motivation Biclustering was shown to be remarkably effective in the discovery of local expression patterns, potentially useful to identify regulatory mechanisms [1, 2]. However, most biclustering formulations are NP-hard and thus heuristic approaches are often used, not guaranteeing optimal solutions [1]. In time series expression data analysis, the biclustering problem can be restricted to the identification of biclusters with contiguous columns, leading to a tractable problem and enabling the use of efficient exhaustive search algorithms [3]. In this context, our motivation to analyze gene expression times series through the identification of biclusters with contiguous columns is based on the following key points: { Being able to monitor the change in expression patterns over time, and to observe the emergence of coherent temporal responses of many interacting components, should provide the basis for understanding evolving but complex biological processes, such as disease progression and drug responses [4]. { Some authors have already pointed out the importance of biclusters with contiguous columns in the identification of regulatory processes [3]. The biological support for this reasoning is the key observation that biological processes last for a contiguous period of time, leading to increased/decreased activity of sets of genes whose coherent expression patterns can be identified as biclusters with contiguous columns. Other authors have recently emphasized 2 S.C. Madeira and A.L. Oliveira that biclustering holds a tremendous promise in the analysis of expression time series as more systemic perturbations are becoming available [4]. { Many algorithms have been proposed to address the general problem of biclustering and analyze expression data. In fact, these approaches have been extensively used to analyze expression time series [1], without focusing on biclusters with contiguous columns and thus eventually producing suboptimal results. Despite the known importance of discovering local temporal expression patterns, few recent biclustering algorithms address the specific case of expression time series and look for biclusters with contiguous columns [3]. In this context, there is a need for efficient biclustering algorithms to analyze expression time series, from a computational and a biological point of view. The biclustering algorithms specifically developed for time series gene expression data analysis should be able to take into account the noisy nature of these data, by enabling the discovery of biclusters with approximate expression patterns1, and allow the discovery of important aspects of gene regulation such as anticorrelation and time-lagged relationships [3]. Furthermore, since applying biclustering to real data can produce hundreds or even thousands of biclusters, an objective evaluation of the quality of the biclusters discovered is crucial. 2 Contributions A Survey on Biclustering Algorithms: We presented a systematic classifi- cation of the biclustering algorithms proposed for biological data analysis using four dimensions of analysis: bicluster type, bicluster structure, type of algorithm, and application domain [1]. We studied in detail the state of the art biclustering algorithms specifically proposed for gene expression time series [3]. CCC-Biclustering: We proposed CCC-Biclustering [2], a biclustering algorithm that finds and reports all maximal contiguous columns coherent biclusters with perfect expression patterns in time linear in the size of the time series expression matrix. We consider that a bicluster has perfect expression patterns if all the genes in the bicluster have the same expression pattern in the contiguous time points defining the bicluster. A bicluster is maximal if it cannot be extended with genes with the same expression pattern and its expression pattern cannot be extended with contiguous time points without losing genes. The linear time complexity of CCC-Biclustering is achieved by manipulating a discretized version of the original time series expression matrix using efficient string pro- cessing techniques based on suffix trees. The key idea behind CCC-Biclustering is the relationship between maximal CCC-Biclusters and nodes in a generalized suffix tree built for the set of strings representing the expression pattern of each gene in the matrix, obtained after appending the column number to each symbol 1 Throughout the text we use the terms perfect and approximate expression pattern in the sense of exact and approximate string matching using the Hamming distance. Biclustering Algorithms for Gene Expression Time Series 3 in the discretized matrix (Fig. 1). We proved this relationship in the following theorem: every maximal CCC-Bicluster with at least two rows corresponds to a maxNode in the generalized suffix tree, and each maxNode defines a maximal CCC-Bicluster with at least two rows. An internal node v is called maxNode iff : (a) v does not have incoming suffix links; or (b) v has incoming suffix links only from nodes ui such that, for every node ui, the number of leaves in the subtree rooted at ui is inferior to the number of leaves in the subtree rooted at v. U1 U2 D3 U4 U5 $4 C1 C2 C3 C4 C5 $3 N5 U4 N3 N2 D3 U4 $3 N5 U4 N3 D5 $2 G1 N1 U2 D3 U4 N5 $4 U5 3 U4 N5 G2 D1 U2 D3 U4 D5 D5 $2 B2 U4 D3 U2 N1 G3 N1 N2 N3 U4 N5 3D5 $2 N5 $1 4 B3 2 N5 B1 $4 U5 2 G4 U1 U2 D3 U4 U5 U2 D3 U4 N5 $1 $4 U5 $3 2 B4 $1 $1 N5 $3 B1=({G1,G3}, {C1}) $2 D5 $4 U5 $1 B2=({G1,G2,G4}, {C2,C3,C4}) B3=({G1,G2,G3,G4}, {C4}) $3 N5 U4 N3 N2 B4=({G1,G3}, {C4,C5}) Fig. 1. Maximal CCC-Biclusters and generalized suffix trees. e-CCC-Biclustering: CCC-Biclustering identifies perfect expression patterns and thus cannot deal with measurement errors, inherent to microarray exper- iments, and discretization errors, potentially introduced due to poor choice of discretization thresholds or number of symbols. In this context, we proposed e- CCC-Biclustering [5], a biclustering algorithm that finds and reports all maximal contiguous column coherent biclusters with approximate expression patterns in time polynomial in the size of the time series expression matrix. We consider that a CCC-Bicluster has approximate patterns if a given number of errors e is allowed, per gene, relatively to a pattern profile identifying the expression pattern in the bicluster. The polynomial time complexity of e-CCC-Biclustering is achieved using a discretized matrix and exploring the relation between the problem of finding maximal e-CCC-Biclusters and the Common Motifs Prob- lem. The algorithm has three main steps: 1) identify all right-maximal e-CCC- Biclusters, using an adaptation of SPELLER [6]; 2) discard non left-maximal e-CCC-Biclusters using a trie; and 3) discard repetitions using a hashtable. 4 S.C. Madeira and A.L. Oliveira Extended CCC-Biclustering and e-CCC-Biclustering: Given the importance of discovering more general expression patterns to the study of gene regulation using time series expression data, we proposed extensions to both CCC- Biclustering and e-CCC-Biclustering able to discover biclusters with anticorrelated (opposite patterns), scaled (patterns with different expression rates), and time-lagged (time-shifted patterns) expression patterns. These algorithms are also able to handle missing values, and identify complex expression patterns, such as the combination of anticorrelated and time-lagged patterns. Scoring CCC-Biclusters and e-CCC-Biclusters: The inspection of biclustering results can be prohibitive without an efficient scoring approach enabling both ranking and filtering of results according to quality criteria. We proposed a scoring method for CCC-Biclusters and e-CCC-Biclusters combining the statistical significance of their expression pattern (p-value) and the similarity measure between overlapping biclusters (Jaccard Index). Using this approach, the p-value of each bicluster is computed and those not passing a Bonferroni corrected statistical significance test at a predefined level, usually 1%, are discarded. Biclusters are then sorted by increasing order of their p-value and, when several of them overlap more than a predefined threshold, only the most significant are kept [2]. 3 Results Summary We present a summary of the results obtained when applying CCC-Biclustering and e-CCC-Biclustering to the identification of transcriptional regulatory modules using a dataset concerning the Saccharomyces cerevisiae response to heat shock [2, 3]. The results were post-processed using the proposed scoring method. Table 1 shows a summary of the top 5 CCC-Biclusters analyzed using the Gene Ontology (GO) annotations. To perform the analysis for functional enrich- ment we used the (Bonferroni corrected) p-values obtained using the hyperge- ometric distribution to assess the over-representation of a specific GO term in the \Biological Process" ontology above level 2. The CCC-Biclusters describing transcriptional up/down-regulation patterns were then analyzed in more detail using GO annotations together with information on transcriptional regulations.

Efficient Biclustering Algorithms for Time Series Gene Expression Data

Development of Biclustering Techniques for Gene Expression Data Modeling and Mining Juan Xie South Dakota State University

Package 'S4vd'

Biclustering in Data Mining Stanislav Busygina, Oleg Prokopyevb,∗, Panos M

Arxiv:1811.04661V5 [Cs.CV] 11 May 2021

Arxiv:2003.04726V1 [Cs.LG] 7 Mar 2020 Ces Indicating Some Regularities in a Data Matrix

High Performance Parallel/Distributed Biclustering Using Barycenter Heuristic

Statistical and Computational Tradeoffs in Biclustering

Optimal Estimation and Completion of Matrices with Biclustering Structures

Scalable Graph Algorithms with Applications in Genetics

An Extensive Survey on Biclustering Approaches and Algorithms for Gene Expression Data

Biclustering Usinig Message Passing

Convex Biclustering