BMC BioMed Central

Research article Open Access GIFtS: annotation landscape analysis with GeneCards Arye Harel*1, Aron Inger, Gil Stelzer1, Liora Strichman-Almashanu1, Irina Dalah1, Marilyn Safran1,2 and Doron Lancet1

Address: 1Department of Molecular , Weizmann Institute of Science Rehovot 76100, Israel and 2Department of Biological Services (Bioinformatics Unit), Weizmann Institute of Science, Rehovot 76100, Israel Email: Arye Harel* - [email protected]; Aron Inger - [email protected]; Gil Stelzer - [email protected]; Liora Strichman-Almashanu - [email protected]; Irina Dalah - [email protected]; Marilyn Safran - [email protected]; Doron Lancet - [email protected] * Corresponding author

Published: 23 October 2009 Received: 22 February 2009 Accepted: 23 October 2009 BMC Bioinformatics 2009, 10:348 doi:10.1186/1471-2105-10-348 This article is available from: http://www.biomedcentral.com/1471-2105/10/348 © 2009 Harel et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract Background: annotation is a pivotal component in computational , encompassing prediction of gene function, expression analysis, and sequence scrutiny. Hence, quantitative measures of the annotation landscape constitute a pertinent bioinformatics tool. GeneCards® is a gene-centric compendium of rich annotative information for over 50,000 human gene entries, building upon 68 data sources, including (GO), pathways, interactions, , publications and many more. Results: We present the GeneCards Inferred Functionality Score (GIFtS) which allows a quantitative assessment of a gene's annotation status, by exploiting the unique wealth and diversity of GeneCards information. The GIFtS tool, linked from the GeneCards home page, facilitates browsing the by searching for the annotation level of a specified gene, retrieving a list of within a specified range of GIFtS value, obtaining random genes with a specific GIFtS value, and experimenting with the GIFtS weighting algorithm for a variety of annotation categories. The bimodal shape of the GIFtS distribution suggests a division of the human gene repertoire into two main groups: the high-GIFtS peak consists almost entirely of -coding genes; the low- GIFtS peak consists of genes from all of the categories. Cluster analysis of GIFtS annotation vectors provides the classification of gene groups by detailed positioning in the annotation arena. GIFtS also provide measures which enable the evaluation of the that serve as GeneCards sources. An inverse correlation is found (for GIFtS>25) between the number of genes annotated by each source, and the average GIFtS value of genes associated with that source. Three typical source prototypes are revealed by their GIFtS distribution: genome-wide sources, sources comprising mainly highly annotated genes, and sources comprising mainly poorly annotated genes. The degree of accumulated knowledge for a given gene measured by GIFtS was correlated (for GIFtS>30) with the number of publications for a gene, and with the seniority of this entry in the HGNC . Conclusion: GIFtS can be a valuable tool for computational procedures which analyze lists of set of genes resulting from wet-lab or computational research. GIFtS may also assist the scientific community with identification of groups of uncharacterized genes for diverse applications, such as delineation of novel functions and charting unexplored areas of the human genome.

Page 1 of 11 (page number not for citation purposes) BMC Bioinformatics 2009, 10:348 http://www.biomedcentral.com/1471-2105/10/348

Background In the quest for revealing the function of DNA sequences, 100000 scientists have used a variety of approaches, from molecu- lar techniques targeting specific genes, to systematic anal- yses of thousands of functional units encompassed by the transcriptome, proteome, and metabolome. This hetero- 10000 geneous mass of knowledge is time-dependent, with new information constantly arising from a variety of sources. Thus, a quantitative tool for assessing annotation depth is important for directing ongoing research and for analyz- 1000 ing the emerging results. Efforts in this field have included the Genome Annotation Scores (GAS) algorithm [1], which demonstrates a quantitative methodology of Source size Source size (number of genes) assigning annotations scores at the whole genome level, 100 the GO Annotation Quality (GAQ) score, which gives a quantitative measure of GO annotations [2], and the Gene Characterization Index (GCI), which scores the extent to which a gene's functionality is described, based 10 largely on the quantification of human perception, and 0204060 applied only to protein-encoding genes [3]. We now Source introduce the GeneCards Inferred Functionality Scores (GIFtS) tool [4], which utilizes the wealth of gene annota- SourceFigure size1 tion within GeneCards [5] to quantify the degree of func- Source size. The number of human gene entries in each tional knowledge about >50,000 GeneCards entries. one of the sources contributing to the GIFtS score. Sources GeneCards is a comprehensive gene-centric compendium are shown by their rank according to their size (see addi- of annotative information about human genes, automati- tional file 1: Table S1). A source's size is defined to be the cally mined from nearly 70 data sources [6-12]. Thus, number of GeneCards entries containing annotations from GIFtS can provide quantitative annotation estimates for a this source. very large number of genes, and at a significant depth, made possible by the exploitation of dozens of annota- tion resources. To study the effect of overlap between sources, we gener- ated a pairwise overlap matrix for all 68 data sources (see Results additional file 2, 3, 4: Tables S2, S3 and Fig. S1-A). For this GIFtS definition and applications we utilized a parameter (see Methods) which assesses the We devised the GeneCards Inferred Functionality Score overlap by gene set sharing. The rational is that if, for (GIFtS) which allows a quantitative assessment of a gene's example, all information about a set of genes, such as annotation status, with potential relevance to the degree , is imported from one source to another, it of relevant functional knowledge. A GIFtS value for a gene will result in an overlap in gene sets. Based on the overlap is defined as the number of GeneCards sources, out of a matrix, we proceeded to select the 20 source pairs with the total of 68 (see additional file 1: Table S1), that include highest overlap (≥0.8), and eliminated (typically) one information about this gene (see Methods). Data sources source of each pair. Fig. 1S-B (see additional file 3: Fig.S1) have heterogeneous sizes, as estimated by the number of shows that the exclusion of overlapping sources has only human gene entries for which the source contains infor- a minor effect on the overall shape of the distribution, mation (Fig. 1), having an average of 11,404 ± 10,970 hence, by inference, on the relative positions of the major- entries per source. One of GeneCards' main aims is to ity of genes on the GIFtS scale. This may suggest that hav- incorporate overlapping sources, and perform integration ing an overlap among sources does not impose an adverse of data for different annotation fields. Considerable atten- effect on proposed functionality scale. While such a find- tion is also directed to conflicts among sources, one clear ing could be interpreted as suggesting over-representation example being the GeneLoc [8] member of the GeneCards of information items, we believe that source redundancy suite, which handles conflicts in genomic coordinates also has its advantages, as pointed out below. from Ensembl [13] and NCBI [14]. The overlap problem is particularly applicable for data extracted from genome- We also asked whether overlap might exert an advanta- wide sources such as Entrez Gene, Ensembl, GO [15], Uni- geous effect. For this, we comparatively analysed the GCI Prot [16] and InterPro [17] which are all closely linked scale with 6 data sources, i.e. having a lower (but not neg- and share some of the information presented, which may ligible) degree of source overlap, and GIFtS, which has 68 introduce a degree of redundancy. sources and a considerably higher extent of source over-

Page 2 of 11 (page number not for citation purposes) BMC Bioinformatics 2009, 10:348 http://www.biomedcentral.com/1471-2105/10/348

lap. Focusing on the 489 genes with highest functionality sequence similarity), but also many genes that are not eas- scores in GCI, all having the same score of 10.0, we note ily defined by their symbols, such as non protein-coding that the same genes span a rather broad range of GIFtS val- genes. ues, from 51 to 84 (see additional file 5: Fig. S2-A). Thus, for the most well-annotated genes, the GIFtS scale dis- We asked what might distinguish genes with extremely plays a fine resolution not seen in GCI. We suspect that high GIFtS score (>75, right tail of the high-GIFtS peak) as this may actually arise from the diversity of overlapping compared to those at the top of the same peak (genes with sources in GIFtS. That this is the case is tentatively sup- GIFtS value of 51); we note that some sources are particu- ported by the fact that when eliminating the 21 smallest- larly enriched (>X8) in the former. Such sources include, size sources (below source size of 2200, see characteriza- for example, databases on Genetic Association (GAD), tion of data sources by GIFtS, and additional file 1: Table Cytogenetics in Oncology and Haematology (ATLAS), S1) the spanned score range for the top 500 GeneCards Human Gene (HGMD), drug response varia- genes diminishes appreciably, from 68-84 (span of 16 tions (PHARMGKB), news about biomedical research units) to 92-100 (span of 8 units) (see Methods for scale (DOCTOR'S GUIDE) genetic testing of inherited disor- shift details). When performing a similar analysis for ders (GENETESTS) and Human Genome Epidemiology 1961 low-annotation genes, all having a GCI score of (HUGE NAVIGATOR), all indicative of advanced stages in exactly 2.2, a wide range of GIFtS scores was seen again, functional gene annotation. between 2 and 34 (see additional file 5: Fig. S2-B), sug- gesting that the enhanced resolution afforded by the We subsequently clustered the GeneCards entries accord- larger number of sources is not unique to top-scoring ing to the source combinations that provide their annota- genes. tion (Fig. 2B). As an example, under the assumption of 8 clusters, the most distinct was cluster 6 (T-test for differ- GIFtS is fully implemented in GeneCards. Every Gene- ence of means, p < 0.001), with contribution from the Cards gene entry is marked with its GIFtS value. Also pro- sources AlmaKnowledge Server (AKS, [18]), Expoldb [19] vided on the GeneCards homepage is a capacity for and GeneWiki [20]. Since these sources integrate data obtaining a random gene with a specified GIFtS score which is usually available for highly annotated genes (e.g. within a given range, constituting a useful tool for brows- AKS uses text mining to find genes' interactions with ing the human gene annotation landscape. Linked from chemical compounds and their relationships to ), the GeneCards homepage is the expanded GIFtS tool, it is conceivable that they appear in this unique cluster of which affords additional functionalities such as: a) retriev- high GIFtS. We see that different parts of the GIFtS distri- ing a list of genes within a specified range of GIFtS value, bution are dominated by different clustered patterns of thereby obtaining random genes with more detailed GIFtS annotation sources. Genes with high GIFtS tend to consist specifications; b) obtaining a list of sources that did not of data from the same source combination, as shown by comprise data for a selected gene, thereby facilitating fur- the high degree of similarity among clusters of the high ther characterization for genes of interest, and potentially GIFtS peak (Fig. 2B, inset). useful for understanding poorly researched genes; c) obtaining the GIFtS value of a selected gene, which results Characterization of data sources by GIFtS in an annotation indicator graphically superimposed on GIFtS is also useful for characterizing and comparing dif- the overall distribution graph; and d) displaying GIFtS ferent database sources. Plotting source size against GIFtS distributions and statistics. values, an inverse correlation was found for GIFtS>25 (Fig. 3). While the top of the graph (labeled G) comprises GIFtS distributions general sources pertinent to a large number of different To explore the annotation landscape of the human genes (e.g. NCBI Entrez Gene [14], Ensembl [13]) the genome, we analyzed the distribution of GIFtS for all lower end of this curve includes sources that specialize in GeneCards entries (Fig. 2). The bimodal shape of the a small number of highly annotated genes (euGenes [21], GIFtS distribution suggests a division of the human gene HGMD [22]). A separate cluster (SP) arises from another repertoire into two main groups, based on the degree of class of specialized sources (e.g. miRBase [23], IMGT [24]) annotation. Dissecting the curve according to GeneCards whose genes tend to be poorly studied. The standard devi- gene categories (Fig. 2A), we find that the high-GIFtS peak ation of GIFtS values tends to be higher for the larger, consists almost entirely of protein-coding genes, while the high-GIFtS sources, reflecting annotation heterogeneity low-GIFtS peak consists of genes from all of the categories (represented by colors in Fig. 3). (see examples in Table 1). It should be pointed out that low GIFtS genes include not only those originally identi- To achieve a higher resolution of sources analysis, we cal- fied as predicted genes, e.g. CiiORFjjj (where ii is a chro- culated GIFtS distributions for all sources. Three typical mosome number) and FAMxxx genes (family with prototypical distributions are shown in Fig. 4. The first

Page 3 of 11 (page number not for citation purposes) BMC Bioinformatics 2009, 10:348 http://www.biomedcentral.com/1471-2105/10/348

Table 1: GIFtS values for selected genes.

Gene_Symbol Category Gene_Description HGNC Approval GIFtS

1 TP53 Protein-coding Tumor protein Yes 84

2 SOD1 Protein-coding Superoxide dismutase 1, soluble Yes 82

3 NFKB1 Protein-coding Nuclear factor of kappa polypeptide gene enhancer in B-cells 1 Yes 78

4 RAC2 Protein-coding Ras-related C3 botulinum toxin substrate 2 Yes 76 (rho family, small GTP binding protein Rac2)

5 IL2 Protein-coding Yes 73

6 BRCA1 Protein-coding Breast 1, early onset Yes 72

7 CD4 Protein-coding CD4 molecule Yes 72

8 DLEU1 RNA-gene Deleted in lymphocytic leukemia 1 (non-protein coding) Yes 48

9 HMGB1L10 high-mobility group box 1-like 10 Yes 31

10 GBAP Pseudogene Glucosidase, beta; acid, pseudogene Yes 28

11 ABCC6P1 Pseudogene ATP-binding cassette, sub-family C, member 6 pseudogene 1 Yes 9

12 MIRN1224 Uncategorized MicroRNA 1224 Yes 7

13 SNORA11D* RNA gene Small nucleolar RNA, H/ACA box 11D Yes 6

14 C14orf7* Protein-coding 14 open reading frame 7 Yes 4

15 AASTH1* Genetic Allergic/atopic asthma related QTL 1 No 1

16 C6orf179* Uncategorized Chromosome 6 open reading frame 179 Yes 1

* Unknown genomic location. Well studied genes are prominent at the top of the table. Lines 8-10 depict genes with unexpectedly high GIFtS values for their category.

(illustrated by Ensembl) is genome-wide sources whose seen, different datasets show somewhat different distribu- GIFtS distribution conforms with the one seen when all tions, suggesting better or worse typical prior knowledge sources are combined together for all GeneCards entries. about the constituent genes. In each of the sets, well stud- The second relates to sources, characterizing already ied genes in the high GIFtS region may suggest a higher known genes, whose genes are mainly found in the high probability for being associated with the relevant biologi- GIFtS peak, illustrated by KEGG [25]. The third includes cal change, while genes in the low GIFtS tail may necessi- sources, characterizing new and unusual genes, whose tate further earmarked functional studies. In parallel, we genes are mainly found in the low GIFtS peak, as exempli- have performed a similar analysis for three sets of genes fied by miRBase. derived from GeneCards searches with keywords with dif- ferent expected level of annotation, and showed that GIFtS facilitates the analysis of experimental genes sets indeed the GIFtS distribution can capture such annotation Inspecting the level of annotation for a list of genes gener- differences that result from different depths of experimen- ated by experiments can supply additional insight about tal evidence (see additional files 6, 8: Fig. S3-B, Table S5). the data, and may facilitate research by suggesting targets for further study. As an example, we have generated the The power of GIFtS in aiding experimental research so as GIFtS distribution for genes whose expression was signifi- to achieve deeper understanding of a gene sets is further cantly altered vs controls in 3 different published experi- highlighted by a source-depletion analysis. For this we uti- ments (see additional files 6, 7: Fig. S3-A and Table S4). As lized, as an example, 8 sets of genes expressed in exactly

Page 4 of 11 (page number not for citation purposes) BMC Bioinformatics 2009, 10:348 http://www.biomedcentral.com/1471-2105/10/348

A 5 25 12 100 90 G G – General 10 4 20 80 SP- Specialized

RNA gene -4 poorly annotated 8 Uncategorized 70 Pseudogene 60 3 15 6 Protein-coding 50 Genetic locus 40 4 2 30 Percent of genes Cumulative 10 Cumulative percent Cumulative

2 20 Standard Deviation 10

Genes per source X10 Genes per 1 0 0 5 B 12 SP Gene’s Clusters: 1 2 3 4 5 6 7 8 0 10 0 20 40 60 80 1 88 1 0.9 77 0.9 GIFtS (average) 8 0.8 66 0.8 0.70.7 55 0.60.6 44 6 0.50.5 CharacterizationFigure 3 of GeneCards sources by GIFtS statistics 33 0.40.4 22 Characterization of GeneCards sources by GIFtS sta- 0.30.3 1 4 1 0.2 1 2 3 4 5 6 7 8

Percent of genes 0.2 tistics. Characterization of GeneCards sources by GIFtS sta- 1462 3 5 7 8 Scalar Vector multiplication tistics. Each point represents all genes within a given 2 annotation source (circles below dashed line were excluded 0 for the production of GIFtS reduced by 21 lowest ranked 1 7 13 19 25 30 36 42 48 54 59 65 71 77 84 sources, see additional file 1: Table S1). Generating the GIFtS GIFtS distribution for each source achieved higher resolution (e.g. Fig. 4). Distribution(GIFtS)Figure 2 of GeneCards Inferred Functionality Scores Distribution of GeneCards Inferred Functionality Scores (GIFtS). (A) Dissection of GIFtS by gene categories. (B) Clustering analysis of sources of the GIFtS tool (with highest significance for cluster 6, T-test for difference of means, p < 0.05). Inner frame presents correlation between 1000 ENSEMBL clusters (highest correlation -1) calculated by scalar-multipli- 5000 900 cation of unit vectors of cluster centers. KEGG 800 4000 MIRBASE 700 two normal human tissues (see Methods) derived from 600 microarray experiments previously performed in our lab- 3000 500 oratory [12]. We asked what might be recommended for further scrutiny based on GIFtS-related evaluation of the 400 2000 information available about each of these gene sets. 300 Within the GIFtS database, each gene in every one of the Gene count (MIRBASE) eight sets is characterized by a different pattern of sources 1000 200 Gene count (KEGG, ENSEMBL) from which information is available. We compute a statis- 100 tic stemming from the set-average of such annotation val- ues, and identify the sources with the least information for 0 0 1 9 33 49 57 65 73 81 41 25 the given gene set (Fig. 5). Such paucity of representation 17 of a given source in a gene set (see additional file 9: Table GIFtS S6) can point to new directions within the researcher's set- specific studies. For example, under-representation of the GIFtSFigure fingerprint 4 source ASD [26] in the genes expressed in pancreas and GIFtS fingerprint. Distribution of GIFtS values for 3 selected sources produced three different archetypes repre- muscle indicates lack of sufficient information about senting sources encompassing: well studied genes (KEGG), all alternative splicing; under-representation of KEGG [25] in genes (Ensembl), genes which are poorly studied (miRBase). the genes expressed in the brain and muscle, similarly

Page 5 of 11 (page number not for citation purposes) BMC Bioinformatics 2009, 10:348 http://www.biomedcentral.com/1471-2105/10/348

indicates a need to seek more information on pathways which show a declining trend to a low of GIFtS of ~15 by related data; under-representation of SwissProt [27] and 2007 (T-test for difference of means, p < 0.001). InterPro [17] in the genes expressed in the brain and thy- mus, indicates lack of sufficient information in the pro- GIFtS subsectioning and weighting tein level, including protein domains. In its simplest embodiment, used throughout this paper, the GIFtS tool regards each source as a unitary entity, irre- Another feature that assists in gauging biological results as spective of content. However, we have also installed contained within GeneCards has been introduced in the improvements within the GIFtS web tool [4], that include new GeneCards version 3 beta [28]. The user can select experimenting with increased resolution by using sub-sec- gene category or type (such as protein-coding genes or tioning of data sources and adjusting scores based on the RNA genes) as well as GIFtS when browsing the genome presence or absence of detailed annotations within a by looking for a random gene or performing an advanced source. In addition we have introduced weights related to search. the quantitative aspects of annotations items, enabling better evaluation of the data relevant to annotation levels. Time-Evolution of human genomic knowledge Sub-sectioning was performed on data from a pivotal bio- We suggest that GIFtS scores reflect the degree of known informatics source for data, SwissProt [27], functionality for a gene, i.e. its annotation depth. One whereby a user may request scores based on the presence plausible parallel measure is the number of publications or absence of detailed annotations within this source (see for a given gene, as found in its GeneCards entry with inte- Methods). Weights were introduced for publications gration of gene-publication links from seven different (based on article counts per gene) and for orthologs sources [14,16,18,29-31]. Indeed, there is ample correla- (based on the number of species for which orthologs are tion (for GIFtS>30) between GIFtS and publications shown). count, which are not part of the standard GIFtS score (Fig. 6), but the lack of perfect correlation, comprising a small With the latter modification, it is possible to perform group of 60 genes (of which only one is protein coding) explorations in the domain of low-GIFtS genes as follows: with GIFtS<13 and publications count>= 10, suggests that the user can request a comprehensive list of genes with GIFtS provides additional information. Another GIFtS GIFtS scores in a pre-selected range, activating the correlation is with HGNC gene symbol approval date orthologs and publications options. The output then indi- [30], with higher GIFtS genes tending to be older (Fig. 7). cates genes for which there are known orthologs and pub- Genes with symbols approved before 1999 show a plateau lished articles, so as to facilitate the initiation of study of GIFtS value of nearly 50, conforming to the approximate a subset of the low GIFtS gene list. median of genes in the higher peak of the distribution of GIFtS for all GeneCards entries. This is statistically signif- We note that annotation about low-GIFtS genes is often icantly different relative to genes approved after 1999, limited to high-throughput generic genome publications, a situation which may reduce the usefulness of the publi- cation-adjusted scores. We have therefore screened our id database for such generic publications by looking for arti-

1 -1.020420783 - 0.8546822851.0402137860.71675569 -0.661023876 -0.998307491 -0.539884064 - 0.3235943221.150742897 -0.545135397 -0.690623103 -0.353553391 1.217638368 1.0184474581.210152750.810632725 1.184789393 0.621649609 1.101104638 -0.04867922 0.06285299 -0.59966402 0.027509769 0.830000469 0.377811281 0.3144830390.829415531 -1.059465963 0.092552826 0.551058704 0.079728797 0.438939218-0.285633565 -1.101756035 1.184789393 1.184789393 1.181626423 - 0.3535533911.027359546 -0.320191938 #DIV/0!0.9361427 1.200062791 -1.636470214 -0.353553391 1.267438548 -0.984840454 -0.216592331 #DIV /0! 1.2650239070.559446001 1.481246237 -0.558205173 1.273645767 0.987245992 #DIV/0! - 0.409542416#DIV/0! 0.91399562 -1.54413822 1.219149782 0.81398571-1.0526764070.746175183 -0.769442133 #DIV/0!#DIV/0! 0.81398571-0.353553391

2 -0.043198176 - 0.048326139-0.8310930870.616909126 0.441726855 -0.639790106 -0.539884064 0.783945768-1.261084815-0.208757276 1.257898291 -0.353553391 -0.892554588 - 0.2240159580.268157517 0.723668295 -0.767063335 -1.194907332 0.007456691 1.3907549320.038759344-0.533435955 0.311610661 -0.199037633 -1.242944361 0.655528253-0.1956276631.221010273 1.052774928 0.451072795 -0.772593214 - 0.618922663-0.9672209640.5717194 -0.767063335 -0.767063335 -1.716948891 -0.3535533911.2277993170.62894845 #DIV/0!-0.091260493 - 1.03025454 - 0.275696942-0.353553391 -0.344100963 1.381720526 -0.802591407 #DIV /0! -0.344092250.132733234 -0.033411569 0.964994368 -0.468907535 -1.034699822 #DIV/0! - 0.001649044#DIV/0! -0.378906684 -0.028176381 -1.463137149 0.7235428540.818471847 -1.312003328 1.004946462 #DIV/0!#DIV/0! 0.723542854-0.353553391 cles correlated with many genes. We found that only 25

3 0.319811708 0.824305854-0.8310930870.017829744 -0.527999749 -0.173513222 -0.539884064 0.1540925341.2026308240.10226571 -1.457008125 -0.353553391 0.178225756 1.611575048 -0.844768805-0.779574 0.341943896 -0.599314892 -0.529425029 0.100315963-0.787757477 0.860464809 -0.958173606 -1.477211276 -0.658888273 - 0.733575139-1.476931656 -0.204423355 0.135650766 -2.312174148 0.39958474 - 0.2148781940.1402986020.43553235 0.341943896 0.341943896 0.23501962 - 0.3535533911.2277993170.342068406 #DIV/0!-0.424918764 - 1.03025454 0.356122475-0.353553391-1.27890858 0.816384489 -0.326888134 #DIV /0! - 1.280827726-0.402132539 -0.389801641 -0.134980051 0.098552186 -0.382725458 #DIV/0! - 0.001649044#DIV/0! 0.857228404 0.141123761 0.350239931 - 0.775224486-0.955635474 -0.135901322 1.004946462 #DIV/0!#DIV/0! - 0.775224486-0.353553391

4 -0.075143046 - 0.366450207-1.072552039 -2.136003275 0.478356687 -0.894122952 -0.539884064 - 1.049755389-0.202186752-0.832493591 -1.046265826 -0.353553391 1.28892867 - 0.090684797-0.844768805-0.69260957 -1.691236028 -1.413291226 - 1.26515183 -1.13129229 -1.956823094 -0.923248126 -1.367652356 -1.401387416 -1.461965394 - 1.364508785-1.403714285 -1.059465963 -0.615977307 0.323818001 -0.385003072 -1.860441119 0.643454958 -1.010404228 -1.691236028 -1.691236028 -1.266494619 - 0.353553391-0.433229 -1.380206242 #DIV/0! -1.792587323 -1.324250915 - 0.045782401-0.353553391 -1.433754014 -0.263587246 -0.248586533 #DIV /0! - 1.430245655-0.372636265-0.287975906 -0.435346618 -1.292477395 -1.274032184 #DIV/0! - 0.676598479#DIV/0! 0.250166748 - 1.20166822 - 1.04127512 - 0.6977020380.081956045 -1.551326411 -0.626091672 #DIV/0!#DIV/0! - 0.697702038-0.353553391 publications are linked to >1000 genes and that 51 publi- 5 -1.132227827 -1.731732668 0.6780253590.217522871 -0.462451629 -0.328768614 -0.539884064 - 1.136432439-0.961928496 0.829112906 0.857174096 -0.353553391 0.250941865 -1.692619479 -0.844768805 0.326116613 0.926021039 1.167609345 -0.350464456 - 0.974623535-0.0607578910.324417513 -0.303274367 0.830000469 1.093279988 - 1.3199105650.8294155310.460972399 -2.128715 0.9055542 -1.624915226 0.189166274-0.967220964 -0.153490651 0.926021039 0.926021039 0.887851898 - 0.353553391 -0.22221751 0.150649312 #DIV/0!0.206058759 0.764137131 - 1.117825784-0.3535533910.65379183 -1.031179474 -1.310288882 #DIV /0! 0.655858505-1.149371486 0.895748261 -0.257427708 1.042644465 1.597956157 #DIV/0! 0.822694903#DIV/0! -1.367698904 0.198202094 0.954698958 0.323010203-1.434619568 1.368415198 -1.557869669 #DIV/0!#DIV/0! 0.323010203-0.353553391 cations are linked to >500 genes. Genes whose only pub- 6 1.509032087 1.4693907711.221308 0.959240202 1.324698595 1.365762276 1.667635229 1.0459030760.449294996 1.774014261 0.932309883 2.474873734 -0.580303063 - 0.624663024 1.583877082 1.046679036 0.852087223 0.313926848 1.101104638 - 0.6619254890.6683367961.416069386 1.319968756 0.830000469 0.523825303 0.7906346260.829415531 -1.059465963 0.852800487 0.323818001 0.854909082 1.658418885 -0.9672209640.373883891 0.852087223 0.852087223 0.287246202 2.474873734 -1.066263468 1.542290981 #DIV/0! 1.60676279 0.632345652 1.1786462 - 0.353553391 1.015097842 1.052986071 0.444340535 #DIV /0! 1.017909641.334214805 0.730281442 1.965120529 0.972339721 0.64062671 #DIV/0! 2.022129608 #DIV/0! -1.367698904 0.03277167 0.872845131 1.0465530560.9416391850.39060946 0.306112964 #DIV/0!#DIV/0! 1.046553056-0.353553391 7 1.267993524 0.203522082-0.943199029 -0.695359998 1.066361885 1.670655266 1.571669154 1.4099466880.64787842 0.305106788 -0.159663546 -0.353553391 0.151135441 - 0.1602915060.316887871 -1.835570653 -0.212559719 -0.063281697 -1.165729289 1.530797615 1.1250685240.846685815 1.436009965 -0.242365553 0.275601466 1.235305117 -0.24138852 1.320003755 0.446817893 -0.330635222 1.637615389 0.0495872760.57342557 1.801325931 -0.212559719 -0.212559719 0.35252943 - 0.353553391-0.9284082660.416647273 #DIV/0!-0.206884646 -0.046887353 1.151911952-0.353553391 0.303955851 0.122934693 0.456127873 #DIV0! / 0.3053010561.226061799 -1.624438677 -1.235365368 -0.564321116 -0.465253858 #DIV/0! - 0.693356336#DIV/0! 1.211733874 1.537124357 -0.229033303 -1.834697951 0.99513611 -0.026496484 1.116299052 #DIV/0!#DIV/0! -1.834697951 -0.353553391 lications belong to such a category, and that are also 8 -0.825847485 0.5039725910.7383900970.30310564 -1.659668769 -0.001915157 -0.539884064 - 0.884105915-1.025347073 -1.424113402 0.30617833 -0.353553391 -1.61401245 0.162252258-0.844768805 0.400657554 -0.633982467 1.167609345 1.101104638 - 0.2053479750.910320808 -1.391289422 -0.465998822 0.830000469 1.093279988 0.4220434530.829415531 0.380834817 0.164095406 0.08748767 -0.189326495 0.358130324 1.830117325 -0.916810658 -0.633982467 -0.633982467 0.039169937 - 0.353553391-0.832839936-1.380206242 #DIV/0!-0.233313024 0.835101773 0.389094715 2.474873734 -0.183520514 -1.094418606 2.004478879 #DIV0! / - 0.188927477 -1.32831555 -0.771648147 -0.30878998 -1.061476093 -0.069117536 #DIV/0! - 1.062029191#DIV/0! -0.118820155 0.864760938 -0.663488229 0.4005326510.605728262 0.520527705 -0.478901466 #DIV/0!#DIV/0! 0.400532651 2.474873734 Sources represented in a large fraction (say 50% or higher) of all such publications, contain GIFtS scores of ~1-2, highlight- SourcesFigure 5enrichment analysis ing them as extremely low GIFtS genes. For genes in the Sources enrichment analysis. The pattern of each of the higher GIFtS realm, the publication adjustment is almost sources enrichment for each of eight gene sets expressed negligible. exactly in two tissues (see additional file 10: Table S7) was calculated based on the significance (p < 0.07) of each of the source enrichments in each set. Blue and red squares indi- Discussion cate over or under representation of source in set, respec- GIFtS provides a quantitative tool for assessing annota- tively (see additional file 9: Table S6). Ids indicate co- tion depth of every gene in the human genome based on expression in: 1. Brain and Muscle; 2. Brain and Pancreases; 3. GeneCards' rich compendium of information sources. Brain and Prostate gland; 4. Brain and Thymus; 5. Lung and Three previously published relevant efforts focus only on Kidney; 6. Lung and Spleen; 7. Muscle and Bone-Marrow; 8. protein coding genes; two of them are mainly aimed at Pancreases and Muscle. gene sets and utilize more limited information sources. The first, the Genome Annotation Scores (GAS) algorithm

Page 6 of 11 (page number not for citation purposes) BMC Bioinformatics 2009, 10:348 http://www.biomedcentral.com/1471-2105/10/348

and annotation depth. In recent years, several studies have highlighted the important role of non protein-coding 4 10 genes (such as RNA genes) in cellular function [32-34]. Therefore, it is important to develop bioinformatics tools such as GIFtS which are applicable to all types of genes. 3 10 However, in a world of fleeting gene annotations, it appears that GCI and GIFtS are complementary to each other: while GCI contains about 33,000 entries, all 2 10 defined as protein coding [3], GIFtS (via GeneCards) addresses only ~22,000 genes defined to be protein cod- ing (based on , refseq, and/or ensembl evidence), 1 along with another ~10,000 non-protein-coding genes 10 (categorized as , RNA genes, or disorder loci),

Publications count and another ~10,000 which are still uncategorized).

0 Future scrutiny could help resolve this dichotomy. 10 The usability of GIFtS benefits in significant ways from its being embedded within GeneCards. As of the introduc- -1 10 tion of GeneCards version 3 beta in May 2009, it allows 0 20 40 60 80 100 advanced searches that combine gene category and data GIFtS score section (e.g. expression, pathways, disorders) with GIFtS range. Furthermore, GIFtS enjoys the capacities of the CorrelationFigure 6 of publication count and GIFtS Correlation of publication count and GIFtS. Correla- GeneCards batch query engine, GeneALaCart, whereby tion between GIFtS scores and the count of publications to the user may receive tables with numerous genes, along each GeneCards gene. with their GIFtS scores and a variety of annotation items, including IDs. Such queries can be keyed by HGNC official gene symbols or by several alias types, [1], is broader in scope, since it addresses all species including Entrez Gene IDs (used in GCI queries), as well present in SwissProt [27]. Using 5 major information as Ensembl and SwissProt IDs. areas, literature, curation, sequence, structure and experi- ment, it provides only an average annotation score for all The GIFtS scoring concepts demonstrate a generic system protein-coding genes in a given species. The second tool, for evaluating genomic annotation that could be applied the GO Annotation Quality (GAQ) score [2] is a quantita- to a variety of species, providing that their annotation data tive annotation measure based on the number, detail and is derived from multiple sources. This becomes more evidence type of GO annotations available for a gene. important for the many species with only minimal anno- GAQ uses only one information source, and in contrast to tation, and in the context of the current development of GIFtS, focuses on assessing the annotation for an entire set high throughput, extremely rapid DNA sequencing. of gene entries. The third tool, the Gene Characterization Index (GCI) [3] assesses the characterization status of Currently GeneCards focuses only on human genes, but it individual human genes using annotations from six infor- is rich in annotation data derived from other species., In mation sources, combined with the count of articles per addition to an integrated ortholog section, it has signifi- each gene based on data from SwissProt [27] and NCBI cant data from certain mammals, especially mouse, where Entrez Gene [14]. It is based on an original concept, we include in the function section mouse phenotypes whereby the perception of human researchers about the from Mouse Genome Informatics (MGI), based on the annotation status of a subset of genes is used as training strong functional overlap between the species [35], as well data for developing a procedure to assign scores to all as mouse-related reagents (such as antibodies in the pro- human genes. GCI allows one to identify sets of low-scor- teins section). This should serve as infrastructure for the ing genes suited for experimental investigation as drug tar- ambitious undertaking of a bona-fide extension of Gene- gets. GIFtS, on the other hand, offers a non-biased Cards to other species, allowing full-fledged application approach which does not involve human perception, of GIFtS to such . This will allow, among others, scores all human genes (whether protein coding or not), comparisons of the landscape of annotations of the and is based on a much larger set of data sources. human genome, having a Genome Annotation Score (GAS) of 13081 [1], to those of genomes from other spe- One of the obvious advantages of the GIFtS scoring cies, such as Mus musculus (GAS = 5644), as well as less method is its applicability to all genes, irrespective to type studied species such as Bos Taurus with GAS<4000.

Page 7 of 11 (page number not for citation purposes) BMC Bioinformatics 2009, 10:348 http://www.biomedcentral.com/1471-2105/10/348

ing trends in human genomic knowledge, performing

80 3500 database maintenance and refinement, and improving computational procedures for analyzing sets of genes

70 3000 resulting from wet-lab or computational research. In addi- tion, GIFtS may also assist the scientific community with 60 identification of groups of uncharacterized genes and 2500 fields of knowledge which are less studied for diverse 50 applications, such as delineation of novel functions and 2000 charting unexplored areas of the human genome. 40 1500

30 Gene Count Methods

GIFtS (average) The GIFtS scores presented here are based on the data in 1000 20 GeneCards version 2.39, comprising 52,524 gene entries, encompassing 22,690 protein coding genes, 7,776 RNA 10 500 genes, and 9,300 pseudogenes. GeneCards has recently been updated, but we expect only minor changes in anal- 0 0 yses described in this manuscript. 2001 2005 2003 2007 1989 1991 1993 1999 1995 1997 HGNC approval date The 68 sources used for the generation of GIFtS are shown in Table 1S (see additional file 1). The information about Time-dependenceFigure 7 of annotation levels the presence or absence of gene annotation in Ns = 68 Time-dependence of annotation levels. Correlation sources was extracted from the GeneCards text files by between Human Gene Nomenclature Committee (HGNC) approval dates [39] and GIFtS average (bars represent stand- GeneQArds, the GeneCards in-house quality assurance ard deviation). tool (available upon request) which is based on a collec- tion of programs used for data mining and statistics.

We define Sij as a binary Ns-dimensional matrix, whose j- Two projects currently running in our laboratory provide th columns depicts the absence or presence of informa- clear examples for the utility of GIFtS. One is GeneDecks tion in data source j. The GIFtS scalar value G for a gene i, [36,37], a relatively new member of the GeneCards suite. i = One of its tools, Set Distiller, aims to detect common j 68 is defined as GWSN= (/)∑ . Where W is a source attributes for a set of genes. The use of GIFtS enables the ijijs j j=1 profiling of gene sets, and helps choose the proper GIFtS- matched control sets for validation. The second project weight. For the standard GIFtS scale, all weights are set to scrutinizes protein interaction networks in the realm of 1; for the experimental web tool, users can change the synthetic lethality [38]. The GIFtS score of each gene was weights for a subset of the annotations. GIFtS vectors, cal- used to measure the confidence in the known protein culated for each of the GeneCards genes, are stored in a interactions for that gene, inferring an enhanced accuracy MySQL 5.0 database (Sun Microsystems, Santa Clara, CA) of known protein interactions for genes with a higher for further analyses and dissections (e.g. extraction of the GIFtS score. This allows one to generate a weighted pro- data for the GIFtS distribution of each gene category). It tein interaction network where high-confidence interac- should be noted that the maximal score depends on the tions receive high weights. maximal fraction of the total number of sources available Further, as exemplified here, we believe that understand- for the highest scoring genes. When analysing source ing the GIFtS profiles for lists of genes that result from elimination for understanding the effect of overlap, we transcription profiling experiments may help experimen- note that after eliminating 21 sources the maximal score tal biologists choose or eliminate gene candidates for rises from 84 to 100. future research. Finally, dissections of GIFtS data could be valuable for database construction, in decisions related to Access to modified GIFtS scores weighted by protein sub- the incorporation of new gene category (e.g. RNA genes) sections, ortholog counts and publication counts is avail- or in the selection of additional or most relevant data able in the GIFtS home page [4]. In order to enrich GIFtS sources. with respect to protein data, we selected the pivotal bioin- formatics source for such data, namely SwissProt [27], Conclusion and dissected it into 6 sub sources: protein subunit, sub GIFtS yields facile tools for navigating the human gene cellular location, post-translational modification, func- annotation landscape, which could be useful in describ- tion, catalytic activity, and other. Each of these subfields

Page 8 of 11 (page number not for citation purposes) BMC Bioinformatics 2009, 10:348 http://www.biomedcentral.com/1471-2105/10/348

received a binary score as described above, thereby data from each one of the GeneCards sources, SS (source increasing the GIFtS vector size by 5. To weight proteins size); the number of genes entries in GeneCards, TG (total effectively in the new vectors, the sum of the binary data genes); for each set, a vector consisting of the number of was still divided by the original number of sources (with genes in each of the GeneCards sources was calculated, SwissProt treated as 1 source for this denominator, in SSSS (set specific source size); the number of genes in the spite of its sub sources contributions to the numerator). set, GCS (genes count in set). These parameters enabled To enrich GIFtS by orthologs or publications data, we calculating a vector of "source enrichment " for each of the define a new differential score for each of those compo- eight gene sets. The "source enrichment" was calculated nents, which is then added to the default GIFtS to generate for each of the GeneCards sources based on the following an adjusted score. Specifically, the orthologs and publica- fraction: (SSSS/SS)X(TG/GCS). The significant of each of tions differential scores for each gene are calculated as the "source enrichments" was evaluated by Z value, based round (logxsum(i)), where x equals 3 for orthologs and 5 on the distance from the average value divided by the for publications, and sum(i) is the count of relevant standard deviation (average and standard deviation were orthologs or publications. Genes with no orthologs or calculated for all eight sets). publications receive a differential score of zero for the - evant component(s); differential scores rounded down to Sources enrichment analysis was also used to compare 0 (for low counts) are set to 1 to distinguish them from a two gene sets (with GIFtS values of 51 or greater than 74) state of true 0 publication or orthologs. It should be noted and is available upon request. that in the adjusted scores, the increment due to the addi- tion of one or more differential scores may be rather Abbreviations small, as the intention is to only weigh the relevant counts GIFtS: GeneCards Inferred Functionality Score; SNP: Sin- in, and not to make them dominate the final outcome. gle Nucleotide Polymorphism; GAQ: GO Annotation Quality; GAS: Genome Annotation Scores; GCI: Gene Clustering of genes according to their GIFtS vector values Characterization Index. was performed by the k-means algorithm in MATLAB 7.5 (R2008a). The statistical significance of each cluster was Authors' contributions calculated by using the T-test function in Microsoft Excel AH developed the GIFtS scores and the GIFtS web-based (Microsoft, Bellevue, WA). The HGNC approval date for tool, performed most of the analyses, and wrote the bulk gene entries was downloaded from the HGNC custom of the paper. AI participated in: cluster analysis, dissection download page [39]. The significance of the difference of of genes by statistical parameters, analyzing overlap average GIFtS between early and late approved genes was between sources, and evaluation of the results. GS devel- calculated by using the T-test function in Excel. oped GeneDecks' Set Distiller, and participated in the analysis of wet-lab transcriptional profiling experiments To study the degree of overlap between a pair of sources, of expression levels from different human tissues. LSA S1 and S2, we computed the fraction of shared genes as developed GeneQArds, the GeneCards in-house Quality |Gs1∩Gs2|/|Gs1∪Gs2|, where Gi is the set of genes appearing Assurance tool used in this work. ID pivotal member in in source i. the development of the GeneCards V3 database, partici- pated in the analysis of GCI scores and the comparison to The five Microarray-based gene lists originated from the GIFtS data. MS is the head of the GeneCards development ArrayExpress database http://www.ebi.ac.uk/microarray- team, contributed to the weighted scores, the evaluation as/ae and appearing in: [40,41]. of the results, and to the writing of the paper. DL is the principal investigator, helped initiate the GIFtS idea, and The three keywords-based gene lists were retrieved by contributed to the evaluation of the results and to the searching for the terms enzyme*, develop*, in the sum- writing of the paper. All authors read and approved the maries section and for cancer* in the disorders section final manuscript. using the advanced search of GeneCards version 3 beta [28]. Additional material

Sources enrichment analysis was studied on sets of genes derived from whole genome expression studies in 12 Additional file 1 Table S1- The size (number of GeneCards entries) of GeneCards sources human tissues, previously performed in our lab [12]. [42] which were used for generating the GIFtS scores. Eight sets of genes expressed exactly in two tissues, (based Click here for file on their binary expression pattern [12]) were selected for [http://www.biomedcentral.com/content/supplementary/1471- further research. The following parameters were extracted 2105-10-348-S1.DOC] using MySQL and PHP scripts: the number of genes with

Page 9 of 11 (page number not for citation purposes) BMC Bioinformatics 2009, 10:348 http://www.biomedcentral.com/1471-2105/10/348

Tsippi Iny Stein (who incorporated GIFtS into GeneALaCart), Justin Alex- Additional file 2 ander (currently working on the assimilation of GIFtS in GeneCards Ver- Table S2 - Pairwise analysis for the degree of overlap between all GIFtS sion 3), Emily Brewster (Quality Assurance), and the reviewers. Funding sources. was provided by Xennex Inc., Boston MA; the Weizmann Institute's Crown Click here for file Human Genome Center, Phyllis and Joseph Gurwin Fund for Scientific [http://www.biomedcentral.com/content/supplementary/1471- Advancement; and the EU Specific Targeted Research Project consortium 2105-10-348-S2.XLS] "Regulatory Control Networks Synthetic Lethality" (SYNLET).

Additional file 3 References Fig. S1 - Effect of reducing overlapping source on distribution of GIFtS 1. European Molecular Laboratory (EMBL)- European Click here for file Bioinformatics Institute (EBI) - Genome Annotation Scores [http://www.biomedcentral.com/content/supplementary/1471- (GAS) [http://www.ebi.ac.uk/integr8/HelpAc tion.do?action=searchById&refId=60] 2105-10-348-S3.PPT] 2. Buza TJ, McCarthy FM, Wang N, Bridges SM, Burgess SC: Gene Ontology annotation quality analysis in model eukaryotes. Additional file 4 Nucleic Acids Res 2008, 36(2):e12. Table S3 - GIFtS sources ids. 3. Kemmer D, Podowski RM, Yusuf D, Brumm J, Cheung W, Wahlest- edt C, Lenhard B, Wasserman WW: Gene characterization Click here for file index: assessing the depth of gene annotation. PLoS ONE 2008, [http://www.biomedcentral.com/content/supplementary/1471- 3(1):ce1440. 2105-10-348-S4.XLS] 4. GeneCards- GIFtS tool [http://www.genecards.org/GIFtS.shtml] 5. GeneCards [http://www.genecards.org] 6. Chalifa-Caspi V, Shmueli O, Benjamin-Rodrig H, Rosen N, Shmoish M, Additional file 5 Yanai I, Ophir , Kats P, Safran M, Lancet D: GeneAnnot: interfac- Fig. S2 - Distribution of GIFtS for genes with specific GCI scores [43] ing GeneCards with high-throughput gene expression com- Click here for file pendia. Brief Bioinform 2003, 4(4):349-360. [http://www.biomedcentral.com/content/supplementary/1471- 7. Rebhan M, Chalifa-Caspi V, Prilusky J, Lancet D: GeneCards: a 2105-10-348-S5.PPT] novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 1998, 14(8):656-664. Additional file 6 8. Rosen N, Chalifa-Caspi V, Shmueli O, Adato A, Lapidot M, Stamp- Fig. S3 - Distribution of GIFtS for gene sets nitzky J, Safran M, Lancet D: GeneLoc: exon-based integration of Click here for file human genome maps. Bioinformatics 2003, 19(Suppl 1):i222-224. 9. Safran M, Chalifa-Caspi V, Shmueli O, Olender T, Lapidot M, Rosen [http://www.biomedcentral.com/content/supplementary/1471- N, Shmoish M, Peter Y, Glusman G, Feldmesser E, et al.: Human 2105-10-348-S6.PPT] Gene-Centric Databases at the Weizmann Institute of Sci- ence: GeneCards, UDB, CroW 21 and HORDE. Nucleic Acids Additional file 7 Res 2003, 31(1):142-146. 10. Safran M, Solomon I, Shmueli O, Lapidot M, Shen-Orr S, Adato A, Table S4- Lists of genes derived from microarray experiments. Ben-Dor U, Esterman N, Rosen N, Peter I, et al.: GeneCards 2002: Click here for file towards a complete, object-oriented, human gene compen- [http://www.biomedcentral.com/content/supplementary/1471- dium. Bioinformatics 2002, 18(11):1542-1543. 2105-10-348-S7.XLS] 11. Shklar M, Strichman-Almashanu L, Shmueli O, Shmoish M, Safran M, Lancet D: GeneTide--Terra Incognita Discovery Endeavor: a new transcriptome focused member of the GeneCards/ Additional file 8 GeneNote suite of databases. Nucleic Acids Res 2005:D556-561. Table S5- Lists of genes derived from keywords search in GeneCards. 12. Shmueli O, Horn-Saban S, Chalifa-Caspi V, Shmoish M, Ophir R, Ben- Click here for file jamin-Rodrig H, Safran M, Domany E, Lancet D: GeneNote: whole [http://www.biomedcentral.com/content/supplementary/1471- genome expression profiles in normal human tissues. C R Biol 2003, 326(10-11):1067-1072. 2105-10-348-S8.XLS] 13. Ensembl [http://www.ensembl.org/index.html] 14. Entrez gene [http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene] Additional file 9 15. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Table S6 - Under and over significantly-represented GIFtS sources in each Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Con- of the eight gene sets (see Methods) sortium. Nat Genet 2000, 25(1):25-29. Click here for file 16. Universal Protein Resource (UniProt) [http://www.uni [http://www.biomedcentral.com/content/supplementary/1471- prot.org/] 2105-10-348-S9.XLS] 17. InterPro [http://www.ebi.ac.uk/interpro/] 18. AlmaKnowledge Server2 (AKS2) [http://www.bioalma.com/ aks2/index.php] Additional file 10 19. EXPOLDB: a database of expression variation in blood leu- Table S7 - List of genes expressed in exactly two tissues. kocytes in monozygotic twins and unrelated individuals Click here for file [http://expoldb.igib.res.in/] [http://www.biomedcentral.com/content/supplementary/1471- 20. Gene Wiki [http://en.wikipedia.org/wiki/Gene_Wiki] 2105-10-348-S10.XLS] 21. euGenes- Genomic Information for Eukaryotic Organisms [http://iubio.bio.indiana.edu:8089/] 22. Human gene mutation database (HGMD) [ http:// www.hgmd.cf.ac.uk/ac/index.php] 23. miRBase Sequence Database [http://microrna.sanger.ac.uk/ sequences/index.shtml] Acknowledgements 24. The international ImMunoGeneTics information system We thank Prof Yoram Groner whose inquiry seeded the idea of GIFtS, (IMGT) [http://imgt.cines.fr/] Naomi Rosen (who incorporated GIFtS into the display of each GeneCard),

Page 10 of 11 (page number not for citation purposes) BMC Bioinformatics 2009, 10:348 http://www.biomedcentral.com/1471-2105/10/348

25. Kyoto Encyclopedia of Genes and Genomes (KEGG) [http:// www.genome.ad.jp/kegg/] 26. Alternative Splicing Database Project [http://www.ebi.ac.uk/ asd/] 27. Swiss-Prot Protein knowledgebase (Swiss-Prot) [http:// www.expasy.org/sprot/] 28. GeneCards version 3 Beta [http://www.genecards.org/v3/] 29. Genetic Association Database (GAD) [http://geneticassocia tiondb.nih.gov/] 30. Eyre TA, Ducluzeau F, Sneddon TP, Povey S, Bruford EA, Lush MJ: The HUGO Gene Nomenclature Database, 2006 updates. Nucleic Acids Res 2006:D319-321. 31. The Pharmacogenetics and Pharmacogenomics Knowledge Base (PharmGKB) [http://www.pharmgkb.org/] 32. Birney Eea: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007, 447(7146):799-816. 33. Gerstein MB, Bruce C, Rozowsky JS, Zheng D, Du J, Korbel JO, Emanuelsson O, Zhang ZD, Weissman S, Snyder M: What is a gene, post-ENCODE? History and updated definition. Genome Res 2007, 17(6):669-681. 34. Washietl S, Hofacker IL, Lukasser M, Huttenhofer A, Stadler PF: Mapping of conserved RNA secondary structures predicts thousands of functional noncoding in the human genome. Nat Biotechnol 2005, 23(11):1383-1390. 35. Turgeon B, Meloche S: Interpreting neonatal lethal phenotypes in mouse mutants: insights into gene function and human diseases. Physiol Rev 2009, 89(1):1-26. 36. GeneDecks [http://www.genecards.org/v3/index.php?path=/Gene Decks] 37. Stelzer G, Inger A, Olender T, Iny-Stein T, Dalah I, Harel A, Safran M, Lancet D: GeneDecks: paralog hunting and gene-set distilla- tion with GeneCards annotation. OMICS 2009, 13(6):. 38. Platzer A, Perco P, Lukas A, Mayer B: Characterization of pro- tein-interaction networks in tumors. BMC Bioinformatics 2007, 8:224. 39. HGNC- download [http://www.genenames.org/cgi-bin/ hgnc_downloads.cgi] 40. Greenawalt DM, Duong C, Smyth GK, Ciavarella ML, Thompson NJ, Tiang T, Murray WK, Thomas RJ, Phillips WA: Gene expression profiling of esophageal cancer: comparative analysis of Bar- rett's esophagus, adenocarcinoma, and squamous carci- noma. Int J Cancer 2007, 120(9):1914-1921. 41. Saetre P, Emilsson L, Axelsson E, Kreuger J, Lindholm E, Jazin E: Inflammation-related genes up-regulated in schizophrenia brains. BMC Psychiatry 2007, 7(1):46. 42. GeneCards- sources [http://www.genecards.org/sources.shtml] 43. Gene Characterization Index Project [http://www.cisreg.ca/ gci/]

Publish with BioMed Central and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime." Sir Paul Nurse, Cancer Research UK Your research papers will be: available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours — you keep the copyright

Submit your manuscript here: BioMedcentral http://www.biomedcentral.com/info/publishing_adv.asp

Page 11 of 11 (page number not for citation purposes)