University of Massachusetts Medical School eScholarship@UMMS

Open Access Articles Open Access Publications by UMMS Authors

2009-10-02

Genetic region characterization ( RECQuest) - software to assist in identification and selection of candidate from genomic regions

Rajani S. Sadasivam University of Massachusetts Medical School

Et al.

Let us know how access to this document benefits ou.y Follow this and additional works at: https://escholarship.umassmed.edu/oapubs

Part of the Bioinformatics Commons, Genetics and Genomics Commons, and the Medicine and Health Sciences Commons

Repository Citation Sadasivam RS, Sundar G, Vaughan LK, Tanik MM, Arnett DK. (2009). Genetic region characterization (Gene RECQuest) - software to assist in identification and selection of candidate genes from genomic regions. Open Access Articles. https://doi.org/10.1186/1756-0500-2-201. Retrieved from https://escholarship.umassmed.edu/oapubs/2100

This material is brought to you by eScholarship@UMMS. It has been accepted for inclusion in Open Access Articles by an authorized administrator of eScholarship@UMMS. For more information, please contact [email protected]. BMC Research Notes BioMed Central

Short Report Open Access Genetic region characterization (Gene RECQuest) - software to assist in identification and selection of candidate genes from genomic regions Rajani S Sadasivam*†1, Gayathri Sundar†2, Laura K Vaughan†3, Murat M Tanik†2 and Donna K Arnett†4

Address: 1Division of Health Informatics and Implementation Science, Quantitative Health Sciences, University of Massachusetts Medical School, USA, 2Department of Electrical and Computer Engineering, School of Engineering, University of Alabama at Birmingham, Birmingham, Alabama, USA, 3Section on Statistical Genetics, Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, Alabama, USA and 4Department of Epidemiology, School of Public Health, University of Alabama at Birmingham, Birmingham, Alabama, USA Email: Rajani S Sadasivam* - [email protected]; Gayathri Sundar - [email protected]; Laura K Vaughan - [email protected]; Murat M Tanik - [email protected]; Donna K Arnett - [email protected] * Corresponding author †Equal contributors

Published: 30 September 2009 Received: 13 July 2009 Accepted: 30 September 2009 BMC Research Notes 2009, 2:201 doi:10.1186/1756-0500-2-201 This article is available from: http://www.biomedcentral.com/1756-0500/2/201 © 2009 Sadasivam et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract Background: The availability of research platforms like the web tools of the National Center for Biotechnology Information (NCBI) has transformed the time-consuming task of identifying candidate genes from genetic studies to an interactive process where data from a variety of sources are obtained to select likely genes for follow-up. This process presents its own set of challenges, as the genetic researcher has to interact with several tools in a time-intensive, manual, and cumbersome manner. We developed a method and implemented an effective software system to address these challenges by multidisciplinary efforts of professional software developers with domain experts. The method presented in this paper, Gene RECQuest, simplifies the interaction with existing research platforms through the use of advanced integration technologies. Findings: Gene RECQuest is a web-based application that assists in the identification of candidate genes from linkage and association studies using information from Online Mendelian Inheritance in Man (OMIM) and PubMed. To illustrate the utility of Gene RECQuest we used it to identify genes physically located within a linkage region as potential candidate genes for a quantitative trait locus (QTL) for very low density lipoprotein (VLDL) response on 18. Conclusion: Gene RECQuest provides a tool which enables researchers to easily identify and organize literature supporting their own expertise and make informed decisions. It is important to note that Gene RECQuest is a data acquisition and organization software, and not a data analysis method.

Page 1 of 8 (page number not for citation purposes) BMC Research Notes 2009, 2:201 http://www.biomedcentral.com/1756-0500/2/201

Background and be 10-30 Mb in length [2]. In order to identify the Complex genetic diseases are due to common variants act- causal variant, these regions must then be subjected to ing alone or in combination with other genes or the envi- fine-mapping, where the area under each region is satu- ronment to cause disease. For the past several years, rated with additional molecular markers, and these mark- genetic linkage has been a mainstay for the analysis of ers are then tested for association with the trait in these complex diseases. Although there has been some question. This technique can reduce the area under each discussion about the future of genetic linkage studies, region and result in dozens of putative candidate genes for there is little debate that genetic linkage studies have had each region. These reduced regions are then subjected to tremendous success in identifying regions of the genome sequence analysis to further refine the search area. Numer- that contribute to a wide variety of complex phenotypes ous sequence variants, both in coding and non-coding [1]. However, identifying the gene (or genes) underlying a regions, can exist within each region. Because a complex linkage peak or in a region of association which drive the trait region can result from several variants within the association remains a significant challenge. same region, each variant must be tested independently for functionality, as well as combinations of all the vari- Linkage studies typically identify regions of association ants. With a linkage study of a complex trait, multiple that cover 10-30 cM, which can contain up to 300 genes regions are expected to be identified, resulting in several

WorkflowFigure 1 of manual linkage studies research activities Workflow of manual linkage studies research activities.

Page 2 of 8 (page number not for citation purposes) BMC Research Notes 2009, 2:201 http://www.biomedcentral.com/1756-0500/2/201

hundred, if not thousands, of putative candidate loci, and Viewer tool of NCBI, selecting the organism, chromo- necessitating fine-mapping and subsequent sequencing some, maps and other options, and specifying the region for numerous genomic regions, which could be costly and of chromosome to download a data file with a list of genes time prohibitive [3-6]. on sequence. For each gene, genetic researchers will then search and retrieve information such as individual A commonly used method in selecting genes from a link- PubMed citations and OMIM summaries, which are age study for follow-up is to identify the genes that lie searched manually for the key words of interest. This proc- within the region of interest and select genes from this list ess is hampered by the large number of genes in a linkage for further analysis based on some criteria such as previ- region, and the difficulty involved in identifying and cat- ously demonstrated association with the phenotype. Typ- aloguing each of those genes. ically, biological investigators will undertake a manual process such as that depicted in Figure 1, in which In an effort to make this gene selection process more effi- researchers browse through multiple pages in the Map cient, streamlined and organized, we have developed a

FigureGene RECQuest 2 Map Viewer Interface Gene RECQuest Map Viewer Interface.

Page 3 of 8 (page number not for citation purposes) BMC Research Notes 2009, 2:201 http://www.biomedcentral.com/1756-0500/2/201

software system called GENEtic REgion Characterization more information than the researcher needs for his or her Quest (Gene RECQuest). For genes located within a purpose. To reduce the number of steps, we have designed region of interest, Gene RECQuest automatically retrieves a single interface in which the user will be able to set the and stores the PubMed and OMIM citations using XML parameters for displaying the maps in a single step. The and Web Services, and allows for a key phrase-based Gene RECQuest system (Figure 2) navigates the user to the search using database technologies. To demonstrate the appropriate Map Viewer from which the user can down- utility, we applied our method to a recently identified load the chromosomal data, which will be uploaded to linkage peak for VLDL response on . the Gene RECQuest MySql database for further analysis. Unfortunately, as the output format of Map Viewer is Implementation inconsistent, we were not yet able to integrate the Map Our overall goal was to design and develop a web-based Viewer download to the file upload system of GeneREC- tool to integrate and customize the existing tools (NCBI Quest. To overcome this inconsistency, we introduced a web tools) to simplify and streamline genetic linkage step where the user has to format the Map Viewer output studies. We approached the problem with a user-centered before uploading to the GeneRECQuest system. design [7] and modified service-oriented architecture approach that is described [8-11]. Full instructions for use Integration with NCBI Web Services are available on the software's website and key imple- NCBI Web Services provide a Web Service Description mented features are as follows: Language (WSDL)-based Application Programmable Interface (API) to a collection of E-utilities. The Integration with Map Viewer Gene RECQuest NCBI tools-as-services wrapper uses the Integration with the Map Viewer tool in Gene RECQuest NCBI WSDL-API to access and retrieve the necessary infor- allows for the reduction of the number of steps required mation (Gene, OMIM, and PubMed information) from to interact with Map Viewer. Map Viewer is a genome the NCBI database. Due to the amount of data and restric- analysis tool provided by the NCBI. Map Viewer can be tions set by NCBI, the initial search must be conducted used to view chromosome maps of many organisms, during "off" hours, thus registration with email address is including humans, and also to identify and localize genes used to notify a user when the search is complete. Integra- and other biological features [12]. Although NCBI allows tion with NCBI (WSDL)-based API in Gene RECQuest the user to view the maps and download the sequence allows integrating and automating the access of data from data within a few clicks (without actually typing informa- the NBCI databases. For this purpose, we have developed tion), the user has to go through several pages that have a Windows service as part of Gene RECQuest that automat-

FigureConcept 3 map of Gene RECQuest MySql Database Model for organizing and searching NCBI data Concept map of Gene RECQuest MySql Database Model for organizing and searching NCBI data.

Page 4 of 8 (page number not for citation purposes) BMC Research Notes 2009, 2:201 http://www.biomedcentral.com/1756-0500/2/201

ically updates the necessary NCBI data to the MySql data- Key word or key phrases based search interface base whenever new genes are uploaded to the system. One of the key features of this software is the ability to eas- Updating the data to the MySql databases allows us to ily and quickly search for key words contained in the liter- implement the user-friendly search application using the ature associated with genes in the chromosomal region of most recent data available. interest. This feature can be found in the "Search Genes" tab on the Gene RECQuest website. The user simply selects User-friendly search through PubMed and OMIM articles the region of interest from their list and uses the search The implementation of the user-friendly search applica- form to find genes of interest. We leverage the Boolean tion has two parts: full text search feature of MySql database to implement the Gene RECQuest full text search on PubMed and OMIM Setting up a database to collect and organize data articles for each gene in a region. Instructions are provided The open-source MySql database is used to collect and to help the researcher construct the search query. Briefly, organize the information accessed from the NCBI web the researcher can either search using key words or phrases services. For this purpose, we designed a relational data of interest, but must specify variants of interest (e.g. model organizing the NCBI information and imple- searching for singular and plural variations). The search mented the model in MySql. Figure 3 shows a concept results first provide a listing of genes that contained both map depicting the relationship where the four main enti- OMIM annotations and PubMed articles matching the ties (Gene, Chromosome region, OMIM, and PubMed) search phrases in the same format as the "Genes in have been highlighted. Regions" list.

FigureGene RECQuest 4 User-Friendly Search Interface Gene RECQuest User-Friendly Search Interface.

Page 5 of 8 (page number not for citation purposes) BMC Research Notes 2009, 2:201 http://www.biomedcentral.com/1756-0500/2/201

The researcher can easily navigate the search results by clicking on each gene to see the OMIM and PubMed arti- cles that contain the search phrases. The PubMed articles information includes the PubMedID, title, authors, date of publication, journal of publication, and the authors' primary affiliation. The researcher can see the abstract by clicking on each article. A link to the PubMed web page for each article is also provided. The OMIM information includes the OMIMID (MIM number) and the OMIM name. Researchers can also see the OMIM summary by clicking on each OMIM and a link to the OMIM page is also provided. Figure 4 shows an example of the search results web page.

Application to a research question We chose to use one of our own research questions to illustrate the utility of Gene RECQuest. As a part of the Genetics of Lipid reducing drugs and Diet Network (GOLDN) project, a QTL analysis was conducted for VLDL response to a high fat meal (see [13] for full details). The study was conducted in 1254 individuals. Briefly, the high fat challenge meal was determined by body surface area, and contained 700 kilocalories per m2 of body sur- face area. The meal composition was 83% of calories from fat, 14% from carbohydrates, and 3% from . VLDL was measured twice (the day before and the day of the meal), and at 3.5 and 6 hours after the meal using proton nuclear magnetic resonance (NMR) spectroscopy. The VLDL response was calculated using a mixed model where ChromosomemilkFigure shake 5 (GOLDN 18 LOD study) plot for VLDL response to high fat a growth curve was created across the four measurement Chromosome 18 LOD plot for VLDL response to points. Variance components linkage analysis was imple- high fat milk shake (GOLDN study). Although this QTL mented using the program SOLAR (Sequential Oligogenic peak would not be considered significant (LOD >3), it is sug- gestive. The region spanning from 60-q20 cM was included in Linkage Analysis Routines) for the VLDL response trait. this analysis. The most promising region identified in this study was located on chromosome 18, with a maximum LOD score of 2.4 (Figure 5). For illustrative purposes, we chose to list of genes was generated (Table 1). Search terms include the entire region under the QTL peak, from 60 to included lipid(s), lipoprotein(s), triglyceride(s), blood 120 cM. This large, 60 cM region contains over 230 genes pressure and postural hypertension. Several of the genes (480 total genes on chromosome 18), many of which may (indicated by bold in Table 1) were identified through be likely candidates. The task of manually identifying each multiple key word searches. Of particular interest are the gene, researching associated publications, and tracking genes that were contained in the annotations for 4 out of known associations with human diseases through OMIM the 5 key words, lipase (LIPG), melanocortin 4 receptor is monumental. Of the 230+ annotations in the region, (MC4R) and SMAD family member 2 (SMAD2), all of there are136 genes, 70 hypothetical, 16 open reading which have strong support for possible involvement in frames (ORFs), 10 pseudo genes, 3 micro RNAs and 2 lipid metabolism. anticodons. The list of genes retained by Gene RECQuest in the "Genes by Region" tab contained 89 genes and 1 Conclusion micro RNA, each with a link on the left to "select" the It is important to note that Gene RECQuest is a data acqui- related PubMed and OMIM citations, and a link on the sition and organization software, and not a data analysis right that opens the Entrez Gene annotation in a new win- method. There is a multitude of data analysis methods dow. designed to aid in the selection of candidate genes from lists generated by linkage or association studies [14-17]. Search for potential candidate genes Many of these methods rely on some sort of annotation Key word searches for terms associated with the pheno- analysis, either comparison to a list of known genes (e.g. type of interest were quickly and easily conducted, and a ENDEAVOUR) [18] or over-representation enrichment

Page 6 of 8 (page number not for citation purposes) BMC Research Notes 2009, 2:201 http://www.biomedcentral.com/1756-0500/2/201

Table 1: Gene RECQuest results for key word search.

Chromosome 18 region 60-120 cM total 237 genes in region & 480 on chromosome 18 Lipoprotein(s) Triglyceride(s) Lipid(s) "Blood Pressure" Hypertension

DCC LIPG ACAA2 SLC14A2 LIPG LIPG LMAN1 ATP8B1 NEDD4L LMAN1 MC4R MC4R BCL2 MC4R MC4R SERPINB2 CD226 SERPINB7 MEX3C SERPINB5 CYB5A MRPS5P4 SERPINB3 DCC NEDD4L SMAD2 ELP2 SERPINB7 SMAD4 FVT1 SMAD2 KRT8P5 SOCS6 LIPG MALT1 MC4R MBP MRPS5P4 NEDD4L PIGN PIK3C3 POL1 SERPINB2 SLC39A6 SMAD2 SMAD4 STARD6 SYT4 VPS4B

Bold indicates genes that are present for more than 4 of the terms.

(e.g. GSEA) [19], to identify genes of interest. Often meth- Abbreviations ods that utilize the same input data and annotation API: Application Programmable Interface; GOLDN: sources produce very different output. As a result, it has Genetics of Lipid lowering drugs and Diet Network; NCBI: been suggested that the best approach is to use multiple National Center for Biotechnology Information; NMR: prioritization tools [20]. Although these methods are Nuclear magnetic resonance; OMIM: Online Mendelian potentially very powerful and have a place in one's tool- Inheritance in Man; QTL: Quantitative trait locus; SOLAR: box, investigators often approach them with hesitation Sequential Oligogenic Linkage Analysis Routines; VLDL: and want to read relevant literature to aid in their deci- Very low density lipoprotein cholesterol; WSDL: Web sions. Gene RECQuest provides a tool which enables Service Description Language; XML: Extensible Markup researchers to identify and organize literature supporting language. their own expertise easily, and make informed decisions. Competing interests Availability and requirements The authors declare that they have no competing interests. • Project name: Gene RECQuest Authors' contributions • Project home page: http://www.ncbi RS and LKV prepared the manuscript. RS designed and search.cme.uab.edu/Default.aspx developed the Gene RECQuest system. GS prepared the throw-away prototypes and assisted in the task analysis of • Anonymous accounts (no e-mail address for registration the genetic linkage studies. MMT oversaw the software is needed): http://www.ncbisearch.cme.uab.edu/Anony engineering of the Gene RECQuest system. LKV and DA mousLogin.aspx provided domain expertise for the development of the system. DA served as the project lead. All authors read and • Operating systems: any OS (that has an internet browser approved the manuscript. application) Acknowledgements • Programming language: .Net, ASP.Net, C#, MySQL This work has been supported by the following NIH grants:

Page 7 of 8 (page number not for citation purposes) BMC Research Notes 2009, 2:201 http://www.biomedcentral.com/1756-0500/2/201

LKV- K01DK080188

DA - 1R01 HL091357-01

References 1. Clerget-Darpoux F, Elston RC: Are linkage analysis and the col- lection of family data dead? Prospects for family studies in the age of genome-wide association. Hum Hered 2007, 64(2):91-96. 2. Perez-Iratxeta C, Bork P, Andrade MA: Association of genes to genetically inherited diseases using data mining. Nature Genet- ics 2002, 31(3):316-319. 3. Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, et al.: Evolving gene/transcript defi- nitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res 2005, 33(20):175. 4. Farrall M, Morris AP: Gearing up for genome-wide gene-associ- ation studies. Hum Mol Genet 2005, 14(Spec No. 2):R157-162. 5. Evans DM, Cardon LR: Genome-wide association: a promising start to a long race. Trends Genet 2006, 22(7):350-354. 6. DiPetrillo K, Wang X, Stylianou IM, Paigen B: Bioinformatics tool- box for narrowing rodent quantitative trait loci. Trends Genet 2005, 21(12):683-692. 7. Vredenburg K, Isensee S, Righi C: User-centered design: an inte- grated approach. Upper Saddle River, NJ: Prentice Hall PTR; 2002. 8. Sadasivam RS: An Architecture Framework for Process-Per- sonalized Composite Services: Service-oriented Architec- ture, Web Services, Business-Process Engineering, and Human Interaction Management. Saarbrücken, Germany: VDM Verlag; 2008. 9. Sadasivam RS, Tanik MM: Composite process-personalization with service-oriented architecture. In Handbook of Research In Mobile Business: Technical, Methodological And Social Perspectives Issue 2 Edited by: Unhelkar B. IGI Global; 2008:470. 10. Sadasivam RS, Sundar G, Tanik MM, Jololian L, Tanju MN: A process personalization model for enabling biological research. In Int Design and Process Technology: June 3-8 2007 Antalya, Turkey: Society for Design and Process Science; 2007:168-174. 11. Sadasivam RS, Tanik MM, Jololian L: WO/2007/008687: Drag and drop communication of data via a computer network. World Intellectual Property Organization. vol. WO/2007/008687 2007. 12. Wheeler DL, Church DM, Edgar R, Federhen S, Helmberg W, Mad- den TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, et al.: Data- base resources of the National Center for Biotechnology Information: update. Nucleic Acids Res 2004:35-40. 13. Corella D, Arnett DK, Tsai MY, Kabagambe EK, Peacock JM, Hixson JE, Straka RJ, Province M, Lai CQ, Parnell LD, et al.: The -256T>C polymorphism in the apolipoprotein A-II gene promoter is associated with body mass index and food intake in the genetics of lipid lowering drugs and diet network study. Clin Chem 2007, 53(6):1144-1152. 14. Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations and open problems. Bioinfor- matics 2005, 21(18):3587-3595. 15. Tiffin N, Adie E, Turner F, Brunner H, van Driel MA, et al.: Compu- tational disease gene identification: A concert of methods prioritizes type 2 diabetes and obesity candidate genes. Nuc Acids Res 2006, 34(10):3067-3081. 16. Nam D, Kim S-Y: Gene-set approach for expression pattern analysis. Briefings in Bioinformatics 2008, 9(3):189-197. Publish with BioMed Central and every 17. Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrich- scientist can read your work free of charge ment tools: paths toward the comprehensive functional analysis of large gene lists. Nuc Acids Res 2009, 37(1):1-13. "BioMed Central will be the most significant development for 18. Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, disseminating the results of biomedical research in our lifetime." Tranchevent L-C, De Moor B, Marynen P, Hassan B, Carmeliet P, Sir Paul Nurse, Cancer Research UK Moreau Y: Gene prioritization through genomic data fusion. Nature Biotechnology 2006, 24(5):537-544. Your research papers will be: 19. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression available free of charge to the entire biomedical community with self-organizing maps: methods and application to peer reviewed and published immediately upon acceptance hematopoietric differentiation. PNAS 1999, 96:2907-2912. 20. Thornblad TA, Elliott KS, Jowett J, Visscher PM: Prioritization of cited in PubMed and archived on PubMed Central positional candidate genes using multiple web-based soft- yours — you keep the copyright ware tools. Twin Research and Human Genetics 2007, 10(6):861-870. Submit your manuscript here: BioMedcentral http://www.biomedcentral.com/info/publishing_adv.asp

Page 8 of 8 (page number not for citation purposes)