Gene Networks
Hans J. Bohnert
Shisong Ma (present address: Yale University)
Introduction
Highthroughput transcript profiling platforms – microarrays in a variety of forms
make use of the increasing amount of genomic DNA and EST sequences and
generate a flood of data. Tools are urgently needed that allow for analyses that
convert these data into models for the structures of any underlying pathways and
causal networks (Brazhnik et al., 2002). From an organism’s response to various
developmental or externally manipulated conditions at the transcript level it
should be possible to infer functional relations between genes based on co
expression pattern, essentially by assuming “guilt by association”. However, the
use of these data in interpreting how genes respond to external or internal, i.e.,
hormonal or biochemical, stimuli has been fraught with uncertainties. Statistical
models abound, yet there is no consensus about how biological context and
significance are to be merged in these models. Common are notions that
generate “relevance networks” by determining Pearson correlation coefficients
(Markowetz and Spang, 2007), which indeed associate many genes with similar
expression patterns. However, the method is encumbered by several problems
because it cannot separate indirect from direct interactions and tends to recover
far too many interactions, which then frustrate interpretation (Schäfer and
Strimmer, 2005a). Also, results are dominated by a few heavily connected
1 pathways, for example genes associated with functions of the ribosome, while
the effects of environmental factors are less faithfully represented.
We have chosen the graphical Gaussian model, GGM, as a method for representing
correlation. GGM is based on partial correlation (Schäfer and Strimmer, 2005a,
b; OpgenRhein and Strimmer, 2007), placing the relationship between gene
pairs in the context of the entire transcriptome, while the Pearson correlation
approach considers each pair of genes separate. The partial correlation
approach has superior ability in separating direct from indirect interactions.
However, GGM is impeded by what may be called the problem of “large p, small
n”. This describes the fact that the number of microarray slides (n) in a dataset is
invariably much smaller than the number of genes (p) in a genome. For the
model plant Arabidopsis thaliana, whose genome is well known, some 3,000
microarray experiments with the Affymetrix ATH1 platform on which >22,000
gene probes are printed represent ~150 experimental conditions. A “shrinkage”
method (Schäfer and Strimmer, 2005a, b) makes it possible to infer partial
correlations from among 2,000 genes from this dataset. We then proposed and
implemented an iterative sampling routine, coupled with the shrinkage approach,
thus expanding the partial correlation analysis to establish network coverage of
the whole genome (Ma et al., 2007).
Subsequent analyses revealed that GGM recovers interactions between seemingly
unrelated biological pathways (developmental and biochemical) that cover
diverse aspects, which provided particular advantages for the analysis of
responses to external factors. Depending on the stringency of the settings,
which may be adjusted based on biological knowledge, it is possible to recover
2 genomewide networks of moderate sizes that cover many pathways, and which
can be controlled by the degree with which they recount known gene
interactions. Although this represents a heuristic, experimental approach many
interactions with biological significance have been recovered (Ma et al., 2007; Ma
and Bohnert, 2008; Li et al., 2008). Further meaningful enhancement of the
GGM can be obtained using clustering methods (Gasch and Eisen, 2002; Ma et
al., 2006; Ma and Bohnert, 2007).
It seems that modified GGM approaches can provide novel understanding of
transcript profiles, that the process can identify genes in biochemical pathways,
in explaining hormonal, biotic and abiotic stressrelevant responses, and in
placing genes into developmental pathways. The chosen model is particularly
suitable placing various isoforms of genes in families into different context, and
by assigning genes of unknown functions into networks that can then be
experimentally interrogated. GGM would profit from increased experimentation,
especially if the number of experimental conditions and the number of time
course experiments were increased, and if more wild type to (knockout) mutant
comparisons were included. The numbers of stringently controlled and
annotated experiments is not yet sufficient to provide a sufficiently complex and
nearly scalefree model of the Arabidopsis transcriptome. Also, we are still far
from being able to integrate into an improved GGM and to see in context data
established by different microarray platforms, or to profit from diverse datasets as
they are provided by protein interaction or metabolite dynamics studies, via, e.g.,
Bayesian network approaches (Lee et al, 2004b), although robust models about
3 how to reconcile data from different microarray platforms are still to be in
developed.
Another plant database exists that organizes a network based on the Pearson
correlation coefficient (Obayashi et al, 2007). For nonplant organisms, a co
expression network based on 1st order partial correlation exists for yeast
(Magwene and Kim, 2004). Other networks, for yeast, C. elegans, or human
tissues, typically combine coexpression data with other types of data, such as
proteinprotein interactions, CHIPexperimental data or acrossspecies gene
conservation studies (Ramani et al., 2008; Lee et al., 2008; Lee et al., 2004a;
Lee et al., 2007). Owing to the fact that more protein interaction studies have
been conducted for important other models, human cell cultures or yeast in
particular, less emphasis has been placed in these models on gene networks,
while integration of different datasets has been emphasized. A similar trajectory
can be expected for plant datasets in the future to become incorporated into the
TAIR database that strives o incorporate all information on Arabidopsis genetics,
biochemistry and genomics (Rhee et al., 2003).
References
Brazhnik, P., A. de la Fuente, and P. Mendes. (2002) Gene networks: how to put
the function in genomics. Trends Biotechnol 20: 467472.
Gasch, A.P., Eisen, M.B. (2002) Exploring the conditional coregulation of yeast
gene expression through fuzzy kmeans clustering. Genome Biol. 3: R 0059.
4 Lee, I., Date, S.V., Adai, A.T., and Marcotte, E.M. (2004a) A probabilistic functional
network of yeast genes. Science 306:1551558.
Lee, H.K., Hsu, A.K., Sajdak, J., Qin, J., Pavlidis, P. (2004b) Coexpression
analysis of human genes across many microarray data sets. Genome Res. 14:
10851094.
Lee, I., Li, Z, Marcotte, E.M. (2007) An Improved, BiasReduced Probabilistic
Functional Gene Network of Baker's Yeast, Saccharomyces cerevisiae. PLoS
ONE 2: e988
Lee, I., Lehner, B., Crombie, C., Wong, W., Fraser, A.G., Marcotte, E.M. (2008) A
single gene network accurately predicts phenotypic effects of gene perturbation
in Caenorhabditis elegans. Nat Genet. 40:1818
Lee, T.I., Rinaldi, N.J., Robert, Odom, R.D.T., BarJoseph, Z., Gerber, G.K.,
Hannett, N.M., Harbison, C.T., Thompson, C.M., Simon, I., Zeitlinger, J.,
Jennings, E.G., Murray, H.L., Gordon, D.B., Ren, B., Wyrick, J.J., Tagne,
J.B., Volkert, T.L., Fraenkel, E., Gifford, D.K., Young, R.A. (2002)
Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298:
799804.
Li, P., Ma, S., Bohnert, H.J. (2008) Coexpression characteristics of trehalose 6
phosphate phosphatase subfamily genes reveal different functions in a network
context. Physiol. Plant. in press.
Ma, S., Gong, Q., Bohnert, H.J. (2006) Dissecting Salt Stress Pathways. J. Exp.
Bot. 57: 10971107.
5 Ma, S., Bohnert, H.J. (2007) Integration of Arabidopsis thaliana stressrelated
transcript profiles, promoter structures, and cellspecific expression. Genome
Biol. 8: R49.
Ma, S., Gong, Q., Bohnert, H.J. (2007) An Arabidopsis Gene Network based on the
Graphical Gaussian Model. Genome Res. 17: 16141625.
Ma, S., Bohnert, H.J. (2008) Genomics Data, Integration, Networks and Systems.
Molec. BioSystems, epub: January 9, 2008.
Magwene, P.M., Kim, J. (2004) Estimating genomic coexpression networks using
firstorder conditional independence. Genome Biol 5: R100.
Markowetz, F., Spang, R. (2007) Inferring cellular networks a review. BMC
Bioinformatics 8 Suppl6: S5
Obayashi, T., Kinoshita, K., Nakai, K., Shibaoka, M., Hayashi, S., Saeki, M.,
Shibata, D., Saito, K., Ohta, H. (2007) ATTEDII: a database of coexpressed
genes and cis elements for identifying coregulated gene groups in Arabidopsis.
Nucleic Acids Res. 35: D863–D869.
OpgenRhein , R., Strimmer, K. (2007) From correlation to causation networks: a
simple approximate learning algorithm and its application to highdimensional
plant gene expression data. BMC Syst. Biol. 1: 37.
Ramani, A.K., Li, Z., Hart, G.T., Carlson, M.W., Boutz, D.R., Marcotte, E.M.
(2008) A map of human protein interactions derived from coexpression of
human mRNAs and their orthologs. Mol Syst Biol 4:180
Rhee, S.Y., Beavis, W., Berardini, T.Z., Chen, G., Dixon, D., Doyle, A., Garcia
Hernandez, M., Huala, E., Lander, G., Montoya, M. et al. (2003) The
6 Arabidopsis Information Resource (TAIR): a model organism database providing
a centralized, curated gateway to Arabidopsis biology, research materials and
community. Nucleic Acids Res. 31: 224228.
Schäfer , J., Strimmer, K. (2005a) An empirical Bayes approach to inferring large
scale gene association networks. Bioinformatics 21: 754764.
Schäfer, J., Strimmer, K. (2005b) A shrinkage approach to largescale covariance
matrix estimation and implications for functional genomics. Stat. Appl. Genet.
Mol. Biol. 4: Article32.
Selected Links
TAIR: http://www.arabidopsis.org
Genevestigator: http://www.genevestigator.ethz.ch
Bohnert Lab: http://www.life.uiuc.edu/bohnert
ATTEDII: http://www.atted.bio.titech.ac.jp/
GGM (Strimmer lab): http://strimmerlab.org/notes/ggm.html
NASC: http://arabidopsis.info
Society Computational Biology: http://www.iscb.org
Society Molecular Biology: http://www.iscb.org/ismb2008/index.php
Keywords
Gene network, microarray, transcript profiling, gene expression, biochemical pathways,
signaling cascades, environmental control, graphical Gaussian model, scalefree
networks, Pearson correlation
7