A Novel Tool for Detection of Tissue and Cell Specific Marker Genes From
Total Page:16
File Type:pdf, Size:1020Kb
El Amrani et al. BMC Genomics (2015) 16:645 DOI 10.1186/s12864-015-1785-9 METHODOLOGY ARTICLE Open Access MGFM: a novel tool for detection of tissue and cell specific marker genes from microarray gene expression data Khadija El Amrani1, Harald Stachelscheid1,2, Fritz Lekschas1, Andreas Kurtz1,3* and Miguel A. Andrade-Navarro4,5 Abstract Background: Identification of marker genes associated with a specific tissue/cell type is a fundamental challenge in genetic and cell research. Marker genes are of great importance for determining cell identity, and for understanding tissue specific gene function and the molecular mechanisms underlying complex diseases. Results: We have developed a new bioinformatics tool called MGFM (Marker Gene Finder in Microarray data) to predict marker genes from microarray gene expression data. Marker genes are identified through the grouping of samples of the same type with similar marker gene expression levels. We verified our approach using two microarray data sets from the NCBI’s Gene Expression Omnibus public repository encompassing samples for similar sets of five human tissues (brain, heart, kidney, liver, and lung). Comparison with another tool for tissue-specific gene identification and validation with literature-derived established tissue markers established functionality, accuracy and simplicity of our tool. Furthermore, top ranked marker genes were experimentally validated by reverse transcriptase-polymerase chain reaction (RT-PCR). The sets of predicted marker genes associated with the five selected tissues comprised well-known genes of particular importance in these tissues. The tool is freely available from the Bioconductor web site, and it is also provided as an online application integrated into the CellFinder platform (http://cellfinder.org/analysis/marker). Conclusions: MGFM is a useful tool to predict tissue/cell type marker genes using microarray gene expression data. The implementation of the tool as an R-package as well as an application within CellFinder facilitates its use. Keywords: Microarrays, Marker genes, Samples Background genes are used to determine the tissue identity and to Large amounts of microarray experimental data are avail- characterize cells grown in vitro. able in public repositories. Although a variety of pro- Since disease-associated genes are more likely to show grams have been developed to make use of these data, tissue specific expression [2], marker genes of healthy the number of tools that identify marker genes is lim- tissues could also be used to understand the molecular ited. Genes may be split into two categories based on the mechanisms underlying complex diseases. number of tissues in which they are expressed [1]. Genes Microarrays allow the parallel analysis of the expression that are expressed in many tissues are often designated as of several thousands of genes in hundreds of tissues/cell housekeeping while those that are expressed in few tis- types, and have been extensively used by the scientific sues are termed tissue-specific or marker genes. Marker community. Consequently, a large amount of microarray expression data has accumulated in public repositories. *Correspondence: [email protected] The Gene Expression Omnibus (GEO) [3], contains cur- 1Charité - Universitätsmedizin Berlin, Berlin Brandenburg Center for rently expression data on 1,328,979 samples across 3848 Regenerative Therapies (BCRT), 13353 Berlin, Germany 3Seoul National University, College of Veterinary Medicine and Research datasets, and ArrayExpress [4] contains 1,649,790 assays Institute for Veterinary Science, 151-742 Seoul, Republic of Korea across 55,656 experiments. The aim of the current study Full list of author information is available at the end of the article was to develop a tool to detect marker genes associated © 2016 El Amrani et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. El Amrani et al. BMC Genomics (2015) 16:645 Page 2 of 12 with small sets of normal tissue samples obtained from distinct sample groups. For each probe set, the expression microarray experiments. Here we introduce a new com- levels of the two sample groups are summarized as the putational tool to predict marker genes from microarray mean of expression values. The marker genes can then be gene expression data. The tool is available as a stand- ranked according to a score ranging from 0 to 1, which alone version (R-package called MGFM)inBioconductor is the ratio of the second and first value in the vector of [5] and it is also integrated into the CellFinder platform mean expression values of a probe set. Values near 0 would (http://cellfinder.org/analysis/marker) to be used as an indicate high specificity and large values closer to 1 would online tool. CellFinder [6] is a comprehensive one-stop indicate low specificity. resource for diverse data characterizing mammalian cells We applied MGFM to two microarray data sets from in different tissues and in different development stages. It GEO. The first data set (#1) consists of triplicate sam- is built from carefully selected data sets stemming from ples from five human tissues (heart atrium, kidney cortex, other curated databases and the biomedical literature. liver, lung, and midbrain). The microarray data set is pub- We verified MGFM using two microarray data sets from licly available from GEO with the series number GSE3526 the GEO public repository each encompassing samples [7]. The second data set (#2) is derived from five human for similar sets of five human tissues (brain, heart, kidney, tissues (brain, heart, kidney, liver, and lung) from two liver, and lung). The accuracy of MGFM was verified with GEO Series GSE1133 [8] and GSE2361 [9] (see Table 7). known literature-curated marker genes. Using one of the Moreover, we compared the results with another tool data sets MGFM identified 72 % of the known marker genes. for tissue-specific gene identification [10] and validated Moreover, top ranked marker genes were further validated the identified markers using literature-curated markers by RT-PCR. (Additional file 1) and experimentally by RT-PCR. Results Marker selection Marker genes are identified when sample grouping of For data set #1, 12482 probe sets out of 54675 (com- the same type exist with similar expression of the prising about 23 % of all probe sets on the microarray) marker gene (see Fig. 1 for an illustrative example and were identified as potential markers associated with the Methods for details). After sorting the expression values five analyzed tissues. In data set #2 we identified 3836 of probe sets in decreasing order, a probe set is considered probe sets from 22283 as potential markers, or approxi- a potential marker of a sample type if the highest expres- mately 17 % of the probes on the microarray. In order to sion values represent all replicates of this sample type. refine the number of predicted markers, they were ranked We consider the position in the sorted expression vector according to their score (see Methods for details, Identifi- that segregates different sample types a cut-point. Each cation of marker genes). To investigate how the number of cut-point segregates elements of sample types into two selected markers changes depending on the score, we used Sort of expression values in decreasing order Identification of cut-points and calculation of mean expression values Mean value = 10.66 12 11.25 12 11.25 10.57 10.57 10.17 10.17 10 10 Mean value = 6.95 8.02 8.02 8 8 Mean value = 6.36 7.21 6.98 6.96 6.95 7.21 6.98 6.96 6.95 6.77 6.59 6.54 6.53 6.77 6.59 6.54 6.53 6.46 6.38 6.23 6.46 6.38 6.23 6 6 1 t 1 2 2 Expression value Expression value 4 4 tni nio o p p- - t t uc u 2 2 c 0 0 liver liver liver liver liver liver lung lung lung lung lung lung brain brain brain brain brain brain heart heart heart heart heart heart kidney kidney kidney kidney kidney kidney Samples Samples Possible marker of liver, score = 6.95/10.66 = 0.65 Fig. 1 An example showing how marker genes are identified by our method. The expression values correspond to the probe set "202357_s_at", which represents the gene CFB (complement factor B) El Amrani et al. BMC Genomics (2015) 16:645 Page 3 of 12 different cutoffs (Fig. 2). The number of potential markers respectively. The grey curves show the precision/recall for selected increases with less specific (higher) cutoffs. random selection. As illustrated, MGFM performed bet- ter than random selection in both data sets. Using lower Performance analysis score cutoffs results in higher precision and lower recall, To evaluate the precision of the developed tool, we whereas higher score cutoffs results in lower precision searched the literature to collect genes used extensively and higher recall. Tables 1 and 2 show the percentage as markers for cell types within a tissue. A total of 142 of probes on the microarray predicted as marker probe literature-derived genes were found for the five human tis- sets and the percentage of correctly identified marker sues (brain, heart, kidney, liver, and lung) and will here genes using different score cutoffs for data sets #1 and #2, be called real markers (Additional file 1). In addition to respectively (see Methods for details on how probe sets these markers, the cytochrome P450 genes (51 genes) were mapped to genes).