Defining Diversity, Specialization, and Gene Specificity in Transcriptomes Through Information Theory
Total Page:16
File Type:pdf, Size:1020Kb
Defining diversity, specialization, and gene specificity in transcriptomes through information theory Octavio Marti´nez*† and M. Humberto Reyes-Valde´ s‡ *Laboratorio Nacional de Geno´mica para la Biodiversidad (Langebio), Cinvestav, Campus Guanajuato, Apartado Postal 629, C.P. 36500 Irapuato, Guanajuato, Mexico; and ‡Department of Plant Breeding, Universidad Auto´noma Agraria Antonio Narro, Buenavista, C.P. 25315 Saltillo, Coahuila, Mexico Communicated by Luis Herrera-Estrella, Center for Research and Advanced Studies, Guanajuato, Mexico, April 10, 2008 (received for review March 10, 2008) The transcriptome is a set of genes transcribed in a given tissue applied to many scientific fields (8). In particular, it has been under specific conditions and can be characterized by a list of genes repeatedly applied to genetics in distinct contexts (9–12). Our with their corresponding frequencies of transcription. Transcrip- approach consists of considering as symbols, in the sense of tome changes can be measured by counting gene tags from mRNA information theory, the distinct transcripts found in a tissue and libraries or by measuring light signals in DNA microarrays. In any counting their abundance to calculate information parameters. case, it is difficult to completely comprehend the global changes that occur in the transcriptome, given that thousands of gene Results and Discussion expression measurements are involved. We propose an approach Theoretical Framework. Consider the division of an organism in to define and estimate the diversity and specialization of tran- tissues; the transcriptomes of each tissue can then be simply scriptomes and gene specificity. We define transcriptome diversity described as the set of relative frequencies, pij, for the ith gene as the Shannon entropy of its frequency distribution. Gene spec- (i ϭ 1, 2, ...,g)inthejth tissue (j ϭ 1, 2, ...,t). Then the ificity is defined as the mutual information between the tissues and diversity of the transcriptome of each tissue can be quantified by the corresponding transcript, allowing detection of either house- an adaptation of Shannon’s entropy formula, keeping or highly specific genes and clarifying the meaning of these concepts in the literature. Tissue specialization is measured g ϭ Ϫ ͑ ͒ by average gene specificity. We introduce the formulae using a Hj pijlog2 pij . [1] simple example and show their application in two datasets of gene iϭ1 expression in human tissues. Visualization of the positions of transcriptomes in a system of diversity and specialization coordi- Hj will vary from zero when only one gene is transcribed up to nates makes it possible to understand at a glance their interrela- log2(g), where all g genes are transcribed at the same frequency: tions, summarizing in a powerful way which transcriptomes are 1/g. If we consider the average frequency of the ith gene among richer in diversity of expressed genes, or which are relatively more tissues, say, specialized. The framework presented enlightens the relation among transcriptomes, allowing a better understanding of their 1 t changes through the development of the organism or in response p ϭ p , [2] i t ij to environmental stimuli. jϭ1 biological complexity ͉ gene expression ͉ microarrays ͉ serial analysis of and define gene specificity as the information that its expression gene expression (SAGE) ͉ Shannon entropy provides about the identity of the source tissue as t he transcriptome is highly dynamic; the relative transcription 1 pij pij S ϭ ͩ log ͪ . [3] i 2 Tfrequencies of the genes change in response to environmen- t pi pi ϭ tal and internal stimuli redirecting the functional and structural j 1 landscape of living organisms. Currently, we can measure tran- Si will give a value of zero if the gene is transcribed at the same scriptome changes by counting gene tags with technologies as frequency in all tissues and a maximum value of log2(t)ifthe serial analysis of gene expression (SAGE) (1), massively parallel gene is exclusively expressed in a single tissue. To quantify signature sequencing (MPSS) (2), pyrosequencing of cDNA the tissue specialization we can obtain for each jth tissue, the libraries obtained from mRNA (3), or alternatively by measuring average of the gene specificities, say, light signals in DNA microarrays (4). In any case, it is difficult to completely understand the global changes that occur in the g transcriptome, given that thousands of gene frequency measure- ␦ ϭ p S . [4] ments are involved. Here, we present a set of indexes that allow j ij i iϭ1 the calculation of transcriptome diversity and context special- ization and the degree of gene specificity. These indexes are ␦j varies from zero if all genes expressed in the tissue are based on the adaptation of Shannon’s information theory (5) to completely unspecific (Si ϭ 0 for all i) up to a maximum of the transcriptome framework. Our approach is exemplified by log2(t), when all genes expressed in the tissue are not expressed the analysis of a dataset of the transcriptome of 32 human tissues, from which Ͼ32 million gene tags were obtained (6) and a comparable dataset for the expression of human genes in 36 Author contributions: O.M. and M.H.R.-V. designed research, performed research, contrib- human tissues using the Affymetrix GeneChip for the human uted new reagents/analytic tools, analyzed data, and wrote the paper. genome (7). The main conclusion of our study is that this The authors declare no conflict of interest. conceptualization allows elucidation of aspects of the transcrip- Freely available online through the PNAS open access option. tome previously uncharacterized due to the quantity and com- †To whom correspondence should be addressed. E-mail: [email protected]. plexity of the data. This article contains supporting information online at www.pnas.org/cgi/content/full/ GENETICS Information theory was pioneered by Claude E. Shannon in a 0803479105/DCSupplemental. seminal paper in 1948 (5), and it has been generalized and © 2008 by The National Academy of Sciences of the USA www.pnas.org͞cgi͞doi͞10.1073͞pnas.0803479105 PNAS ͉ July 15, 2008 ͉ vol. 105 ͉ no. 28 ͉ 9709–9714 Downloaded by guest on September 26, 2021 1.0 Gene 1; S = 2.0000 a Gene 2; S = 0.7341 2.0 Gene 3; S = 0.4151 Gene 4; S = 1.0501 0.8 1.5 0.6 transcription (pij) ation) 0.4 j (Specializ δ δ H / mean 1.0 0.2 b Relative frequency of d c 0.0 abcd 0.0 0.5 1.0 1.5 2.0 Tissue Hj (Diversity) Fig. 1. Example of values of information parameters as functions of the relative frequencies of transcription of genes in a system with four tissues and four genes. (A) Bar plot of the relative frequencies of transcription of each gene in the four tissues. Values of gene specificities, Si, are presented. (B) Scatter plot of Hj (diversity) vs. ␦j (specialization, given by the average of the gene specificities) for each tissue. The value of H in the whole system and the mean of ␦j is plotted as a diamond. anywhere else. If we substitute the values of pij by pi in Eq. 1 and an approximate intermediate diversity, whereas tissue b, tran- ignore the subindex j, we obtain a measure, say H, of the diversity scribing three genes, one with relatively high specificity (S4 ϭ of the whole system. 1.05), is the second most diverse and specialized. The diversity To define a measure of divergence with respect to the whole of the whole system, with H ϭ 1.9965, almost reaches the ϭ average transcriptome, let us define the average log2 of the global maximum diversity for a system with four genes, log2(4) 2, and transcript frequencies in a given tissue, say the mean average diversity of the tissues, mean (␦j) ϭ 1.0424, is almost in the center of the range of possible diversities, 0 to log2 g (4) ϭ 2 (diamond in Fig. 1B). In this example, the properties of ϭ Ϫ ͑ ͒ HRj pijlog2 pi . [5] the transcriptome can be easily understood by inspection of the iϭ1 transcription frequencies of the four genes, but in any real case, thousands of genes are involved, and the appreciation of the HRj will be equal to or larger than the corresponding Hj, reaching transcriptome properties becomes impossible without the tools equality if and only if pi ϭ pij for all values of i. Now we can define described here. the Kullback–Leibler divergence of the tissue j as Analysis of Human Data: The Tissue Perspective. To exemplify our ϭ Ϫ Dj HRj Hj. [6] approach with real cases, we analyzed two comparable datasets. The first consists of Ͼ31 millions MPSS tags for 22,935 genes Dj measures how much a given tissue j departs from the measured in 32 human tissues (6), and the second is a microarray corresponding transcriptome distribution of the whole system. expression profiling of 36 human tissues (7). These two datasets Notice that Hj, the measure of diversity, depends only on the share 28 human tissues and thus present the possibility of relative transcription frequencies of the tissue j; thus, it is comparing the results of our approach with two highly dissimilar independent of the context. However, the measures of tissue methodologies. specialization and divergence, ␦ and D , respectively, depend not j j Fig. 2 shows a scatter plot of the values of diversity, Hj vs. the only on these frequencies but also on those of the remaining values of specialization given by the average gene specificity of tissues; thus, these parameters are sensitive to the context where the tissues, ␦j. they are measured [see supporting information (SI) Text]. From the results of the MPSS dataset (Fig.