A Visual Data Mining Tool That Facilitates Reconstruction of Transcription Regulatory Networks

A Visual Data Mining Tool that Facilitates Reconstruction of Transcription Regulatory Networks Daniel C. Jupiter, Vincent VanBuren* Department of Systems Biology and Translational Medicine, Texas A&M Health Science Center College of Medicine, Temple, Texas, United States of America Abstract Background: Although the use of microarray technology has seen exponential growth, analysis of microarray data remains a challenge to many investigators. One difficulty lies in the interpretation of a list of differentially expressed genes, or in how to plan new experiments given that knowledge. Clustering methods can be used to identify groups of genes with similar expression patterns, and genes with unknown function can be provisionally annotated based on the concept of ‘‘guilt by association’’, where function is tentatively inferred from the known functions of genes with similar expression patterns. These methods frequently suffer from two limitations: (1) visualization usually only gives access to group membership, rather than specific information about nearest neighbors, and (2) the resolution or quality of the relationships are not easily inferred. Methodology/Principal Findings: We have addressed these issues by improving the precision of similarity detection over that of a single experiment and by creating a tool to visualize tractable association networks: we (1) performed meta- analysis computation of correlation coefficients for all gene pairs in a heterogeneous data set collected from 2,145 publicly available micorarray samples in mouse, (2) filtered the resulting distribution of over 130 million correlation coefficients to build new, more tractable distributions from the strongest correlations, and (3) designed and implemented a new Web based tool (StarNet, http://vanburenlab.medicine.tamhsc.edu/starnet.html) for visualization of sub-networks of the correlation coefficients built according to user specified parameters. Conclusions/Significance: Correlations were calculated across a heterogeneous collection of publicly available microarray data. Users can access this analysis using a new freely available Web-based application for visualizing tractable correlation networks that are flexibly specified by the user. This new resource enables rapid hypothesis development for transcription regulatory relationships. Citation: Jupiter DC, VanBuren V (2008) A Visual Data Mining Tool that Facilitates Reconstruction of Transcription Regulatory Networks. PLoS ONE 3(3): e1717. doi:10.1371/journal.pone.0001717 Editor: Raya Khanin, University of Glasgow, United Kingdom Received November 7, 2007; Accepted February 5, 2008; Published March 5, 2008 Copyright: ß 2008 Jupiter, VanBuren. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This work was funded by a Scientist Development Grant from the American Heart Association (Grant ID: 0630263N), and by start-up funds from the Department of Systems Biology and Translational Medicine and the Dean of the College of Medicine, Texas A & M Health Science Center. The project fundershad no role in the design and conduct of the study, no role in the collection, analysis, and interpretation of the data, and no role in the preparation, review, or approval of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] Introduction Dynamic Bayesian networks offer a viable approach for the discovery of gene regulatory network topology [7–12]. However, Several approaches to microarray data analysis make use of these methods are often computationally intensive, heuristic, and clustering techniques [1–4] to suggest functional roles for previously limited to the study of small networks usually derived from time uncharacterized genes. Clustering approaches, however, normally series data. Our approach to addressing these issues focuses on result in a graphical display of groupings that typically lack specific visualizing association networks local to a given gene of interest. information about the correlation of expression patterns between Using the Affymetrix GeneChip Mouse Genome 430 2.0 Array two selected genes. Thus while group membership can be tentatively platform, we (1) selected samples from a wide variety of tissues established, the topology of the group, or the interactions between its and experimental conditions to build a table of correlation members are not necessarily well elucidated. coefficients from all pair-wise comparisons of genes represented Synthesis and visualization of publicly available data remains a on the array, (2) selected a subset of those samples in order to challenge for biologists. Available microarray data is thus typically examine the differences in network topology which arise in a not exploited beyond the scope of the original experiment. smaller set of related regulatory states in cardiac tissues and early Visualization platforms such as Cytoscape [5] or BioTapestry [6] developmental states, relative to the average regulatory state have provided versatile solutions for viewing large networks, represented by the full cohort of arrays, (3) built a Web based including association and interaction networks, but such platforms application for user specified network construction and viewing, expect a network provided by the user, and do not learn or and (4) provide assessment of the resultant networks by drawing reconstruct the networks in and of themselves. networks of known interactions involving the list of genes in the PLoS ONE | www.plosone.org 1 March 2008 | Volume 3 | Issue 3 | e1717 Visual Data Mining correlation network, and by determining Gene Ontology (GO) which we have dubbed the ‘‘full cohort’’, cover a wide range of [13] annotation terms that are enriched in the correlation network tissues and experimental conditions. Of these hybridizations, 239 as compared with the entire array platform. All data used in our were from experiments related to cardiac development, cardiac analyses were retrieved from the Gene Expression Omnibus [14]. tissues in adult mice, or early development (the ‘‘cardiac cohort’’). A Fig. 1 shows an overview of the project. complete list of the experimental datasets used is available at http:// We present a user-directed approach to network elucidation, www.vanburenlab.tamhsc.edu/starnet_doc.html. and provide an intuitive Web-based interface (StarNet, http:// Features on the array were mapped to Entrez Gene [16] IDs vanburenlab.medicine.tamhsc.edu/starnet.html) for visual explo- using Version 9 of the mapping provided by Dai and colleagues ration of correlation networks radiating from a selected gene. In [17]. Their mapping yields 16,297 genes on the array. The arrays short, there are two main parts to the work described here: (1) within the full and cardiac cohorts were normalized separately, construction of a database by combining annotations and known using the justRMALite [18] package within the BioConductor interactions from Entrez Gene with our meta-analysis computa- [19] suite of tools. This procedure performs quantile normaliza- tion of correlation coefficients and data partitioning, and (2) tion, positive match only adjustment, and Tukey median polish. development of a Web-based front end (StarNet) that interrogates Pearson correlation coefficients were calculated for all pairwise the database, constructs networks for visualization, and performs comparisons of genes on the array using Octave. This yielded some analyses on those networks to provide a quick assessment of 132,787,956 coefficients for each cohort. their utility. StarNet results may suggest putative interactions, Several subsets of the collection of correlation coefficients were either in biochemical pathways or transcriptional regulatory built. First, we selected the 20,000 (20K) largest positive networks, thus providing new hypotheses for additional experi- correlation coefficients. This procedure was repeated for 40,000 ments. The results provided by StarNet may also be viewed as the (40K) and 100,000 (100K) coefficients. The 20K, 40K, and 100K first step in a data analysis pipeline, where the putative networks sub-distributions were also formed for the largest negative produced by StarNet, for example, may be studied further using coefficients. We further considered the union of positive and the tools of Bayesian network analysis. negative ‘‘extreme tails’’, for each of the three sizes. This procedure was executed for both the full and cardiac cohorts, Methods yielding a total of 18 different sub-distributions. To guarantee that each gene on the array is represented in our Data Preparation distributions, a ‘‘genecentric’’ distribution was built. The ten We selected 2,145 sample hybridizations performed on the largest positive correlations to each gene were selected, with the Affymetrix GeneChip Mouse Genome 430 2.0 Array which are proviso that the p-value of the correlation was less than .05. This available from the Gene Expression Omnibus (GEO) [14,15] for was repeated for negative correlations, and again the union of which raw data was available from GEO. Data from these samples, positive and negative correlations was considered. This procedure was carried out for both full and cardiac cohorts, thus obtaining an additional six distributions. We built two further classes of ‘‘specialty’’ distributions, each a variant

A Visual Data Mining Tool That Facilitates Reconstruction of Transcription Regulatory Networks

Program Nr: 1 from the 2004 ASHG Annual Meeting Mutations in A

Critical Congenital Heart Disease

CAGE: Combinatorial Analysis of Gene-Cluster Evolution

Personalized Genetic Diagnosis of Congenital Heart Defects in Newborns

HHS Public Access Author Manuscript

B Cells in Rheumatoid Arthritis (RA) B Cells in Autoimmune Arthritis Model

Genome-Wide Gene Expression Profiling of Randall's Plaques In

CFC1 Is a Cancer Stemness-Regulating Factor in Neuroblastoma

Report CFC1 Mutations in Patients with Transposition of the Great

Rare Novel Variants in the ZIC3 Gene Cause X-Linked Heterotaxy

The Evolution and Population Diversity of Human-Specific Segmental Duplications

Generation and Annotation of the DNA Sequences of Human Chromosomes 2And4