A Computational Approach to Candidate Gene Prioritization for X
Total Page:16
File Type:pdf, Size:1020Kb
Lombard et al. Biology Direct 2011, 6:30 http://www.biology-direct.com/content/6/1/30 RESEARCH Open Access A computational approach to candidate gene prioritization for X-linked mental retardation using annotation-based binary filtering and motif-based linear discriminatory analysis Zané Lombard1*†, Chungoo Park2,3†, Kateryna D Makova2 and Michèle Ramsay1 Abstract Background: Several computational candidate gene selection and prioritization methods have recently been developed. These in silico selection and prioritization techniques are usually based on two central approaches - the examination of similarities to known disease genes and/or the evaluation of functional annotation of genes. Each of these approaches has its own caveats. Here we employ a previously described method of candidate gene prioritization based mainly on gene annotation, in accompaniment with a technique based on the evaluation of pertinent sequence motifs or signatures, in an attempt to refine the gene prioritization approach. We apply this approach to X-linked mental retardation (XLMR), a group of heterogeneous disorders for which some of the underlying genetics is known. Results: The gene annotation-based binary filtering method yielded a ranked list of putative XLMR candidate genes with good plausibility of being associated with the development of mental retardation. In parallel, a motif finding approach based on linear discriminatory analysis (LDA) was employed to identify short sequence patterns that may discriminate XLMR from non-XLMR genes. High rates (>80%) of correct classification was achieved, suggesting that the identification of these motifs effectively captures genomic signals associated with XLMR vs. non-XLMR genes. The computational tools developed for the motif-based LDA is integrated into the freely available genomic analysis portal Galaxy (http://main.g2.bx.psu.edu/). Nine genes (APLN, ZC4H2, MAGED4, MAGED4B, RAP2C, FAM156A, FAM156B, TBL1X, and UXT) were highlighted as highly-ranked XLMR methods. Conclusions: The combination of gene annotation information and sequence motif-orientated computational candidate gene prediction methods highlight an added benefit in generating a list of plausible candidate genes, as has been demonstrated for XLMR. Reviewers: This article was reviewed by Dr Barbara Bardoni (nominated by Prof Juergen Brosius); Prof Neil Smalheiser and Dr Dustin Holloway (nominated by Prof Charles DeLisi). * Correspondence: [email protected] † Contributed equally 1Division of Human Genetics, School of Pathology, Faculty of Health Sciences, National Health Laboratory Service & University of the Witwatersrand, Johannesburg, 2000, South Africa Full list of author information is available at the end of the article © 2011 Lombard et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Lombard et al. Biology Direct 2011, 6:30 Page 2 of 17 http://www.biology-direct.com/content/6/1/30 Background In an attempt to increase the probability of correctly The identification of genes and genetic variants that identifying disease related genes by producing a short result in disease, or contribute to disease susceptibility, list of highly probable candidate genes, a combined is a critical objective in medical research. Such findings approach could reduce uncertainty. This could be done have contributed to improvements in diagnosis, prog- by combining multiple independent lines of evidence, nosis and therapy [1]. The typical approach taken for each by itself lacking sufficient power. In this paper, a disease gene discovery for monogenic traits involves the previously described method of candidate gene selection identification of affected families, genotyping and linkage based primarily on gene annotation [25] is complemen- analysis. Subsequent fine mapping of the identified ted by the evaluation of pertinent DNA sequence motifs linked region is performed to focus the candidate region or signatures to refine the gene selection approach [24]. and reduce the number of putative candidate genes, and The intercept between these approaches is evaluated, mutation detection is done to uncover the genetic cause and the usefulness of combining the two is discussed. of the disorder [2,3]. This approach has been success- We have applied this approach to X-linked mental retar- fully used to pinpoint the genetic contributors for over a dation (XLMR), a group of related but heterogeneous thousand diseases, including Huntington disease [4], disorders for which some of the underlying genetics is Duchenne muscular dystrophy [5] and cystic fibrosis [6]. known (reviewed in [26]). With the successful identification of disease genes for many single-gene disorders, the focus has shifted to dis- Results eases with a complex, multifactorial etiology [7,8]. The Annotation-based gene prioritization using a binary candidate gene approach has often been used in the filtering method search for complex disease genes, but with the advent of The complete set of X chromosome genes (a total of massively parallel sequencing and genotyping, genome- 814 X-linked genes; Ensembl v49) was subjected to can- wide approaches (such as genome-wide association stu- didate gene selection for XLMR using a previously dies [GWAS]) are starting to take precedence [9]. How- described method [25]. Gene annotation terms found to ever, these approaches can result in large sets of be relevant to XLMR were identified through literature potentially implicated genes, with the challenge then and data-mining (a total of 40 terms; summarized in being to identify the actual genes involved with disease Table 1). Each term was used as a selection criterion to pathogenesis - a potentially laborious and costly populate a gene list (containing all genes within the exercise. Ensembl database annotated with that term). Candidate Recently, many computational candidate gene selec- genes were then prioritized using a binary evaluation tion and prioritization methods have been developed, in grid that assessed the term gene lists against the X- part to provide new avenues to pinpoint disease genes linked gene list, and genes were scored (see Methods for or to prioritize genes from a large list of candidates details) accordingly, and the X-linked gene list was sub- [10-21]. Most often, one of two approaches is taken to sequently prioritized for putative involvement in XLMR. identify and prioritize putative disease genes - either X-linked genes with the most matches to the annota- investigating similarities (such as sequence resemblance) tion-derived gene lists (27/40 being the most matches to known disease genes or evaluating functional annota- for any gene) were ranked as strong candidate XLMR tion of genes. The first is based on the premise that dif- genes. Figure 1 shows the relative enrichment for known ferences exist between disease genes, and other human XLMR genes within the prioritized list, both cumula- genes, including differences in gene sequence and struc- tively and as percentage enrichment. Table S1 (Addi- ture, which can then be utilized to single out candidate tional File 1) shows the prioritization of X-linked genes disease genes [13,14,22,23]. This information can also be as XLMR candidates using this approach, serving as a used for other objectives such as the discovery of test for the sensitivity of this approach. The genes sequences important in X-chromosome inactivation, and ranked highly by this method (i.e. more terms matched, the subsequent prediction of expression status of indivi- therefore stronger support of being a likely candidate) dual genes [24]. The second is an annotation-based have a higher enrichment for known XLMR genes than approach that centers on the hypothesis that similar dis- for non-XLMR genes. Conversely, the enrichment eases may be influenced by genes with comparable fea- becomes moderated as one moves lower down the tures (such as function). This method relies heavily on prioritized list. the terms used to describe genes and their related pro- By considering the higher ranking genes on the priori- ducts, and standardized ontologies utilized across data- tizedlist(signifiedbyanarbitrary cut-off of 10/40 bases (such as Gene Ontology (GO) and eVOC) are matches),itisobservedthat66%(54/82)ofknown imperative to the efficiency of this approach. XLMR genes are present. This signifies a two-fold Lombard et al. Biology Direct 2011, 6:30 Page 3 of 17 http://www.biology-direct.com/content/6/1/30 Table 1 Annotation terms identified to be pertinent to XLMR using a literature- and data-mining approach ANNOTATION TERM CATEGORIES1 Anatomical site Biological Process Phenotype Animal model homology Developmental Development Seizures Phenotype Liver Transcription Epilepsy Behaviour/Neurological Central nervous system Metabolism Acidosis Nervous system related Respiratory Phosphorylation Microcephaly Embryogenesis Cerebellum Brain development Tremor Kidney Timing Hippocampus Pre-Embryonic Spinal cord Embryonic Cerebral cortex Fetal Testis Brain stem Anatomy Peripheral nerve 2TS8-9 Ectoderm Cerebrum TS10-13 Neural Ectoderm Substantia nigra TS14-26 CNS Cardiovascular Adrenal gland Thyroid Ovary Amygdala Musculoskeletal Ganglion Hypothalamus 1These terms were used to extract gene lists that were compared to all X chromosome genes in a