Efficient for Gene Expression Data Analysis

Sara C. Madeira123 and Arlindo L. Oliveira12

1 Instituto Superior T´ecnico,Technical University of Lisbon, Portugal 2 Knowledge Discovery and (KDBIO) Group, INESC-ID, Portugal 3 University of Beira Interior, Portugal [email protected], [email protected]

Abstract. We present a summary of a PhD thesis proposing efficient biclustering algorithms for time series gene expression data analysis, able to discover important aspects of gene regulation as anticorrelation and time-lagged relationships, and a scoring method based on statistical sig- nificance and similarity measures. The ability of the proposed algorithms to efficiently identify sets of genes with statistically significant and bio- logically meaningful expression patterns is shown to be instrumental in the discovery of relevant biological phenomena, leading to more convinc- ing evidence of specific transcriptional regulatory mechanisms.

Key words: Biclustering, gene expression time series, temporal expres- sion patterns, anticorrelated time-lagged patterns, regulatory modules.

1 Context and Motivation

Biclustering was shown to be remarkably effective in the discovery of local ex- pression patterns, potentially useful to identify regulatory mechanisms [1, 2]. However, most biclustering formulations are NP-hard and thus heuristic ap- proaches are often used, not guaranteeing optimal solutions [1]. In time series expression data analysis, the biclustering problem can be restricted to the iden- tification of biclusters with contiguous columns, leading to a tractable problem and enabling the use of efficient exhaustive search algorithms [3]. In this context, our motivation to analyze gene expression times series through the identification of biclusters with contiguous columns is based on the following key points: – Being able to monitor the change in expression patterns over time, and to observe the emergence of coherent temporal responses of many interacting components, should provide the basis for understanding evolving but com- plex biological processes, such as disease progression and drug responses [4]. – Some authors have already pointed out the importance of biclusters with con- tiguous columns in the identification of regulatory processes [3]. The biologi- cal support for this reasoning is the key observation that biological processes last for a contiguous period of time, leading to increased/decreased activity of sets of genes whose coherent expression patterns can be identified as bi- clusters with contiguous columns. Other authors have recently emphasized 2 S.C. Madeira and A.L. Oliveira

that biclustering holds a tremendous promise in the analysis of expression time series as more systemic perturbations are becoming available [4]. – Many algorithms have been proposed to address the general problem of bi- clustering and analyze expression data. In fact, these approaches have been extensively used to analyze expression time series [1], without focusing on bi- clusters with contiguous columns and thus eventually producing suboptimal results. Despite the known importance of discovering local temporal expres- sion patterns, few recent biclustering algorithms address the specific case of expression time series and look for biclusters with contiguous columns [3]. In this context, there is a need for efficient biclustering algorithms to analyze expression time series, from a computational and a biological point of view.

The biclustering algorithms specifically developed for time series gene ex- pression data analysis should be able to take into account the noisy nature of these data, by enabling the discovery of biclusters with approximate expression patterns1, and allow the discovery of important aspects of gene regulation such as anticorrelation and time-lagged relationships [3]. Furthermore, since applying biclustering to real data can produce hundreds or even thousands of biclusters, an objective evaluation of the quality of the biclusters discovered is crucial.

2 Contributions

A Survey on Biclustering Algorithms: We presented a systematic classifi- cation of the biclustering algorithms proposed for biological data analysis using four dimensions of analysis: bicluster type, bicluster structure, type of , and application domain [1]. We studied in detail the state of the art biclustering algorithms specifically proposed for gene expression time series [3].

CCC-Biclustering: We proposed CCC-Biclustering [2], a biclustering algo- rithm that finds and reports all maximal contiguous columns coherent biclusters with perfect expression patterns in time linear in the size of the time series ex- pression matrix. We consider that a bicluster has perfect expression patterns if all the genes in the bicluster have the same expression pattern in the contiguous time points defining the bicluster. A bicluster is maximal if it cannot be ex- tended with genes with the same expression pattern and its expression pattern cannot be extended with contiguous time points without losing genes. The linear time complexity of CCC-Biclustering is achieved by manipulating a discretized version of the original time series expression matrix using efficient string pro- cessing techniques based on suffix trees. The key idea behind CCC-Biclustering is the relationship between maximal CCC-Biclusters and nodes in a generalized suffix tree built for the set of strings representing the expression pattern of each gene in the matrix, obtained after appending the column number to each symbol

1 Throughout the text we use the terms perfect and approximate expression pattern in the sense of exact and approximate string matching using the Hamming distance. Biclustering Algorithms for Gene Expression Time Series 3 in the discretized matrix (Fig. 1). We proved this relationship in the following theorem: every maximal CCC-Bicluster with at least two rows corresponds to a maxNode in the generalized suffix tree, and each maxNode defines a maximal CCC-Bicluster with at least two rows. An internal node v is called maxNode iff : (a) v does not have incoming suffix links; or (b) v has incoming suffix links only from nodes ui such that, for every node ui, the number of leaves in the subtree rooted at ui is inferior to the number of leaves in the subtree rooted at v.

U1 U2 D3 U4 U5 $4 C1 C2 C3 C4 C5 $3 N5 U4 N3 N2 D3 U4

$3 N5 U4 N3 D5 $2 G1 N1 U2 D3 U4 N5 $4 U5 3 U4 N5 G2 D1 U2 D3 U4 D5 D5 $2 B2 U4 D3 U2 N1 G3 N1 N2 N3 U4 N5 3D5 $2 N5 $1 4 B3 2 N5 B1 $4 U5 2 G4 U1 U2 D3 U4 U5 U2 D3 U4 N5 $1 $4 U5 $3 2 B4 $1

$1 N5 $3 B1=({G1,G3}, {C1}) $2 D5 $4 U5 $1 B2=({G1,G2,G4}, {C2,C3,C4}) B3=({G1,G2,G3,G4}, {C4})

$3 N5 U4 N3 N2 B4=({G1,G3}, {C4,C5})

Fig. 1. Maximal CCC-Biclusters and generalized suffix trees.

e-CCC-Biclustering: CCC-Biclustering identifies perfect expression patterns and thus cannot deal with measurement errors, inherent to microarray exper- iments, and discretization errors, potentially introduced due to poor choice of discretization thresholds or number of symbols. In this context, we proposed e- CCC-Biclustering [5], a biclustering algorithm that finds and reports all maximal contiguous column coherent biclusters with approximate expression patterns in time polynomial in the size of the time series expression matrix. We consider that a CCC-Bicluster has approximate patterns if a given number of errors e is allowed, per gene, relatively to a pattern profile identifying the expression pattern in the bicluster. The polynomial time complexity of e-CCC-Biclustering is achieved using a discretized matrix and exploring the relation between the problem of finding maximal e-CCC-Biclusters and the Common Motifs Prob- lem. The algorithm has three main steps: 1) identify all right-maximal e-CCC- Biclusters, using an adaptation of SPELLER [6]; 2) discard non left-maximal e-CCC-Biclusters using a trie; and 3) discard repetitions using a hashtable. 4 S.C. Madeira and A.L. Oliveira

Extended CCC-Biclustering and e-CCC-Biclustering: Given the impor- tance of discovering more general expression patterns to the study of gene regu- lation using time series expression data, we proposed extensions to both CCC- Biclustering and e-CCC-Biclustering able to discover biclusters with anticorre- lated (opposite patterns), scaled (patterns with different expression rates), and time-lagged (time-shifted patterns) expression patterns. These algorithms are also able to handle missing values, and identify complex expression patterns, such as the combination of anticorrelated and time-lagged patterns.

Scoring CCC-Biclusters and e-CCC-Biclusters: The inspection of biclus- tering results can be prohibitive without an efficient scoring approach enabling both ranking and filtering of results according to quality criteria. We proposed a scoring method for CCC-Biclusters and e-CCC-Biclusters combining the statis- tical significance of their expression pattern (p-value) and the similarity measure between overlapping biclusters (Jaccard Index). Using this approach, the p-value of each bicluster is computed and those not passing a Bonferroni corrected statis- tical significance test at a predefined level, usually 1%, are discarded. Biclusters are then sorted by increasing order of their p-value and, when several of them overlap more than a predefined threshold, only the most significant are kept [2].

3 Results Summary

We present a summary of the results obtained when applying CCC-Biclustering and e-CCC-Biclustering to the identification of transcriptional regulatory mod- ules using a dataset concerning the Saccharomyces cerevisiae response to heat shock [2, 3]. The results were post-processed using the proposed scoring method. Table 1 shows a summary of the top 5 CCC-Biclusters analyzed using the Gene Ontology (GO) annotations. To perform the analysis for functional enrich- ment we used the (Bonferroni corrected) p-values obtained using the hyperge- ometric distribution to assess the over-representation of a specific GO term in the “Biological Process” ontology above level 2. The CCC-Biclusters describing transcriptional up/down-regulation patterns were then analyzed in more detail using GO annotations together with information on transcriptional regulations.

Table 1. Summary of top 5 CCC-Biclusters after applying the scoring method.

ID Sorting Pattern #Time Points # #(corrected) GO #(corrected) GO p-value (first-last) Genes p-values < 0.01 p-values ∈ [0.01, 0.05[ 124 2.56E-84 DNU 4(2-5) 904 40 8 14 1.64E-58 UND 4(2-5) 1091 62 12 27 3.69E-44 UUND 5(1-5) 290 7 6 39 8.65E-42 UNND 5(1-5) 258 0 0 151 3.99E-31 DNNU 5(1-5) 232 12 2 Biclustering Algorithms for Gene Expression Time Series 5

Table 2 shows the results for the top 2 CCC-Biclusters. The analysis of CCC-Bicluster 124 (delayed down-regulation pattern) shows that it comprises a number of genes involved in RNA and protein synthesis (“RNA processing” or “ribosome biogenesis and assembly” appear among its most significant GO terms). Indeed, the inhibition of ribosome biosynthesis and the repression of rRNA synthesis, associated with the general stress response program, is also a feature of the heat shock response. In agreement with this observation, the tran- scription factors (TFs) Sfp1p and Rap1, associated with ribosome biogenesis and rRNA synthesis, appear as the main regulators of this bicluster, regulating 29.6% and 18.7% of the genes. A similar analysis of CCC-Bicluster 14 (middle up- regulation pattern) reveals the occurrence of highly significant terms, including “carbohydrate metabolic process” or “energy derivation by oxidation of organic compounds”, related to energy generation, and “response to stimulus” or “re- sponse to stress”, related to the cellular response to heat shock. These terms are consistent with the induction of protein folding chaperones aiming at pro- tecting against, and recovering from, protein unfolding with associated energetic expenses. The transcriptional induction of genes involved in alternative carbon source metabolism and respiration, in the presence of glucose, is considered a consequence of a sudden decrease in cellular ATP concentration, caused by ATP- consuming stress defense mechanisms. As expected, the heat-shock factor Hsf1p comes out as one of the major regulators of this bicluster, regulating 17.1% of the genes. Moreover, in agreement with previous knowledge, Msn2p and Msn4p, regulators of the general stress response in yeast, appear as major contributors to the heat-induced transcriptional activation, regulating, respectively, 16.7% and 14.5% of the genes in the bicluster. A TF also presumably implicated in the regulation of the genes in this bicluster is Rpn4p, regulating 15.8% of the genes. This TF stimulates the expression of the proteasome genes, involved in the degradation of denatured or unnecessary proteins in stressed yeast cells [2].

Table 2. Relevant GO terms and transcriptional regulations of top 2 CCC-Biclusters.

ID TFs % Relevant GO Terms Enriched p-value (level) 124 Sfp1p 29.6 ribonucleoprotein complex biogenesis and assembly 5.62E-81 (4) Rap1p 18.7 ribosome biogenesis and assembly 1.76E-76 (5) Rpn4p 16.9 organelle biogenesis and assembly 1.76E-76 (4) Arr1p 14.5 RNA processing 3.75E-37 (6) Fhl1p 11.6 rRNA metabolic process 9.12E-36 (6) RNA metabolic process 1.02E-33 (5) rRNA processing 1.10E-33 (6,7) 14 Sok2p 19.5 carbohydrate metabolic process 2.77E-23 (4) Hsf1p 17.1 generation of precursor metabolites and energy 4.15E-20 (3) Msn2p 16.7 cellular carbohydrate metabolic process 5.08E-19 (5) Rpn4p 15.8 response to stress 1.53E-18 (3) Msn4p 14.5 response to stimulus 6.09E-17 (4) energy derivation by oxidation of organic compounds 7.57E-16 (4) 6 S.C. Madeira and A.L. Oliveira

We have then assessed the impact of discovering 1-CCC-Biclusters in the biological results. The improvement was two-fold: (1) GO annotations (better p- values and higher number of GO terms enriched); (2) transcriptional regulations (higher number of genes regulated by relevant TFs relatively to corresponding CCC-Biclusters). Table 3 shows a summary of the top 5 1-CCC-Biclusters.

Table 3. Summary of top 5 1-CCC-Biclusters after applying the scoring method.

ID Sorting Pattern #Time Points # #(corrected) GO #(corrected) GO p-value (first-last) Genes p-values < 0.01 p-values ∈ [0.01, 0.05[ 10 0.00E-00 DDNU 5 (1-5) 1079 58 16 27 0.00E-00 DNUU 5 (1-5) 597 22 13 79 0.00E-00 NNND 5 (1-5) 849 40 16 132 0.00E-00 UNDD 5 (1-5) 539 10 7 145 2.81E-41 UUDD 5 (1-5) 511 8 5

Note that both 1-CCC-Biclusters 79 and 132 (functionally enriched) are ob- tained by extending CCC-Bicluster 39 (not functionally enriched) with genes with approximate patterns. Moreover, a detailed analysis of 1-CCC-Bicluster 10 (in Table 3) reveals that it corresponds to CCC-Bicluster 124 (in Table 1) ex- tended with genes with approximate patterns and a contiguous time point at left. The recovered genes are relevant: the functional enrichment results improved and the number of genes regulated by the relevant TFs in Table 2 is higher. In CCC- Bicluster 124, Sfp1p, Rap1p, Rpn4p, Arr1p and Fhl1p, regulate, respectively, 268, 169, 153, 131, and 105 of the 904 genes. In 1-CCC-Bicluster 10, these key TFs regulate, respectively, 288, 175, 166, 138, and 108 of the 1079 genes.

4 Conclusions

We proposed efficient biclustering algorithms to analyze expression time series, a scoring method to post-process the results, and algorithmic extensions to deal with missing values and discover scaled, anticorrelated and time-lagged patterns. The results obtained using the transcriptomic expression patterns occurring in Saccharomyces cerevisiae in response to heat stress, showed not only the abil- ity of CCC-Biclustering to extract relevant information compatible with doc- umented biological knowledge, but also the utility of using this algorithm in the study of other environmental stresses and of regulatory modules, in gen- eral [2]. Furthermore, these results demonstrated that e-CCC-Biclustering is not only able to recover genes with approximate patterns, which are potentially lost when only perfect patterns are considered, but also that the recovered genes are, in fact, biologically relevant to the problem under study. Biclustering Algorithms for Gene Expression Time Series 7

Acknowlegments We thank Miguel C. Teixeira and Isabel S´a-Correiathe in- valuable help in the biological analysis of CCC-Biclustering results [2]. This work was partially supported by projects ARN, PTDC/EIA/67722/2006, and Dyablo, PTDC/EIA/71587/2006, funded by FCT, Funda¸c˜aopara a Ciˆenciae Tecnologia.

References

1. Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1(1) (January–March 2004) 24–45 2. Madeira, S.C., Teixeira, M.C., S´a-Correia,I., Oliveira, A.L.: Identification of regu- latory modules in time series gene expression data using a linear time biclustering algorithm. IEEE/ACM Transactions on Computational Biology and Bioinformat- ics, 21 Mar. 2008. IEEE Society Digital Library. IEEE Computer Society, http://doi.ieeecomputersociety.org/10.1109/TCBB.2008.34 3. Madeira, S.C.: Efficient Biclustering Algorithms for Time Series Gene Expression Data Analysis. PhD thesis, Instituto Superior T´ecnico,Technical University of Lisbon (2008) 4. Androulakis, I.P., Yang, E., Almon, R.R.: Analysis of time-series gene expression data: Methods, challenges and opportunities. Annual Review of Biomedical Engi- neering 9 (2007) 205–228 5. Madeira, S.C., Oliveira, A.L.: An efficient biclustering algorithm for finding genes with similar patterns in time-series expression data. In: Proc. of the 5th Asia Pacific Bioinformatics Conference, Imperial College Press (2007) 67–80 6. Sagot, M.F.: Spelling approximate repeated or common motifs using a suffix tree. In: Proc. of Latin’98, Springer Verlag, LNCS 1380 (1998) 111–127