Chapter 15: Disease Gene Prioritization
Total Page:16
File Type:pdf, Size:1020Kb
Education Chapter 15: Disease Gene Prioritization Yana Bromberg* Department of Biochemistry and Microbiology, School of Environmental and Biological Sciences, Rutgers University, New Brunswick, New Jersey, United States of America Abstract: Disease-causing aberra- chromosome 4 called huntigtin (HTT) 2. Background tions in the normal function of a [4]. Huntington’s became the first genetic The Merriam-Webster dictionary de- gene define that gene as a disease disease mapped using polymorphism fines the word ‘‘disease’’ as a ‘‘a condition gene. Proving a causal link be- information (G8 DNA probe/genetic tween a gene and a disease exper- marker), closely followed by the same of the living animal or plant body or of one imentally is expensive and time- year discovery of phenylketonuria associ- of its parts that impairs normal functioning consuming. Comprehensive priori- ation with polymorphisms in a hepatic and is typically manifested by distinguishing tization of candidate genes prior to enzyme phenylalanine hydroxylase [5]. signs and symptoms.’’ Thus, disease is experimental testing drastically re- These advances provided a route for defined with respect to normal function of said duces the associated costs. Com- predicting the likelihood of disease devel- body or body part. Note, that this definition putational gene prioritization is opment and even stirred some worries also describes the malfunction of individual based on various pieces of correl- regarding the possibility of the rise of cells or cell groups. In fact, many diseases ative evidence that associate each ‘‘medical eugenics’’ [6]. Interestingly, it can and should be defined on a cellular gene with the given disease and took another ten years for HTT’s se- level. Understanding a disease, and poten- suggest possible causal links. A fair quence to be identified and for the tially finding curative or preventive mea- amount of this evidence comes precise nature of the Huntigton’s-associ- sures, requires answering three questions: from high-throughput experimen- ated mutation to be determined [7]. (1) What is the affected function? (2) What tation. Thus, well-developed meth- The recent explosion in high-through- functional activity levels are considered ods are necessary to reliably deal put experimental techniques has contrib- normal given the environmental contexts? with the quantity of information at (3) What is the direction and amount of hand. Existing gene prioritization uted significantly to the identification of change in this activity necessary to cause techniques already significantly im- disease-associated genes and mutations. prove the outcomes of targeted For instance, the latest release of SwissVar the observed phenotype? experimental studies. Faster and [8], a variation centered view of the Swiss- Contrary to the view that historically more reliable techniques that ac- Prot database of genes and proteins [9,10], prevailed in classical genetics it is rarely count for novel data types are reports nearly 20 thousand mutations in the case that one gene is responsible for necessary for the development of 35 hundred genes associated with over one function. Rather, an assembly of genes new diagnostics, treatments, and three thousand broad disease classes. constitutes a functional module or a cure for many diseases. Unfortunately, the improved efficiency in molecular pathway. By definition, a mo- production of association data (e.g. ge- lecular pathway leads to some specific end nome-wide association studies, GWAS) point in cellular functionality via a series of has not been matched by its similarly interactions between molecules in the cell. This article is part of the ‘‘Transla- improving accuracy. Thus, the sheer Alterations in any of the normally occur- tional Bioinformatics’’ collection for quantity of existing but yet unvalidated ring processes, molecular interactions, and PLOS Computational Biology. data resulted in information overflow. pathways lead to disease. For example, While association and linkage studies folate metabolism is an important molec- provide a lot of information, incorporation ular pathway, the disruptions in which 1. Introduction of other sources of evidence is necessary to have been associated with many disorders narrow down the candidate search space. including colorectal cancer [12] and In 1904 Dr. James Herrick reported [1] Computational methods - gene prioritiza- coronary heart disease [13]. Because this the findings of ‘‘peculiar elongated and tion techniques, are therefore necessary to pathway involves 19 proteins interacting sickle shaped’’ red blood cells discovered effectively translate the experimental data via numerous cycles and feedback loops by Dr. Ernest Irons in a hospital patient into legible disease-gene associations [11]. [14], it is not surprising that there are a afflicted with shortness of breath, heart palpitations, and various other aches and Citation: Bromberg Y (2013) Chapter 15: Disease Gene Prioritization. PLoS Comput Biol 9(4): e1002902. pains. This was the first documented case doi:10.1371/journal.pcbi.1002902 of sickle cell disease in the United States. Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland, Forty years later, in 1949, sickle cell Baltimore County, United States of America anemia became the first disease to be Published April 25, 2013 characterized on a molecular level [2,3]. Copyright: ß 2013 Yana Bromberg. This is an open-access article distributed under the terms of the Creative Thus, implicitly, the first disease-associat- Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, ed gene, coding for beta-globin chain of provided the original author and source are credited. hemoglobin A, was discovered. Funding: YB is funded by Rutgers, New Brunswick start-up funding, Gordon and Betty Moore Foundation It took another thirty years before in grant, and USDA-NIFA and NJAES grants Project No. 10150-0228906. The funders had no role in the 1983 a study of the DNA of families preparation of the manuscript. afflicted with Huntington’s disease has Competing Interests: The author has declared that no competing interests exist. revealed its association with a gene on * E-mail: [email protected] PLOS Computational Biology | www.ploscompbiol.org 1 April 2013 | Volume 9 | Issue 4 | e1002902 What to Learn in This Chapter change, binding site obstruction, loss of ligand affinity, etc.), (3) introduction of new pathway members (e.g. activation N Identification of specific disease genes is complicated by gene pleiotropy, polygenic nature of many diseases, varied influence of environmental factors, of previously silent genes), and (4) envi- and overlying genome variation. ronmental disruptions (e.g. increased tem- peratures due to inflammation or de- N Gene prioritization is the process of assigning likelihood of gene involvement in creased ligand concentrations due to generating a disease phenotype. This approach narrows down, and arranges in malnutrition). While all members of the the order of likelihood in disease involvement, the set of genes to be tested experimentally. affected pathways can be construed as disease genes, the identification of a subset N The gene ‘‘priority’’ in disease is assigned by considering a set of relevant of the true causative culprits is difficult. features such as gene expression and function, pathway involvement, and Obscuring such identification are individ- mutation effects. ual genome variation (i.e. the reference N In general, disease genes tend to 1) interact with other disease genes, 2) harbor definition of ‘‘normal’’ is person-specific), functionally deleterious mutations, 3) code for proteins localizing to the multigenic nature and complex pheno- affected biological compartment (pathway, cellular space, or tissue), 4) have types of most diseases, varied influence of distinct sequence properties such as longer length and a higher number of environmental factors, as well as experi- exons, 5) have more orthologues and fewer paralogues. mental data heterogeneity and constraints. N Data sources (directly experimental, extracted from knowledge-bases, or text- Disease genes are most often identified mining based) and mathematical/computational models used for gene using: (1) genome wide association or prioritization vary widely. linkage analysis studies, (2) similarity or linkage to and co-regulation/co-expres- sion/co-localization with known disease genes, and (3) participation in known number of different ways in which it can be 3. Interpreting What We Know disease-associated pathways or compart- broken. The changes in concentrations ments. In bioinformatics, these are repre- Identifying the genetic underpinnings of and/or activity levels of any of the pathway sented by multiple sources of evidence, the observed disease is a major challenge members directly affect the pathway end- both direct, i.e. evidence coming from own products (e.g. pyrimidine and/or methylat- in human genetics. Since disease results experimental work and from literature, ed DNA). The specifics of a given change from the alteration of normal function, and indirect, i.e. ‘‘guilt-by-association’’ define the severity and the type of the identifying disease genes requires defining data. The latter means that genes that resulting disease; see Box 1 for discussion molecular pathways whose disrupted func- are in any way related to already estab- on disease types. Moreover, since the view tionality