Biomedical Informatics

BIOMEDICAL INFORMATICS Abstract GENE LIST AUTOMATICALLY DERIVED FOR YOU (GLAD4U): DERIVING AND PRIORITIZING GENE LISTS FROM PUBMED LITERATURE JEROME JOURQUIN Thesis under the direction of Professor Bing Zhang Answering questions such as ―Which genes are related to breast cancer?‖ usually requires retrieving relevant publications through the PubMed search engine, reading these publications, and manually creating gene lists. This process is both time-consuming and prone to errors. We report GLAD4U (Gene List Automatically Derived For You), a novel, free web-based gene retrieval and prioritization tool. The quality of gene lists created by GLAD4U for three Gene Ontology terms and three disease terms was assessed using ―gold standard‖ lists curated in public databases. We also compared the performance of GLAD4U with that of another gene prioritization software, EBIMed. GLAD4U has a high overall recall. Although precision is generally low, its prioritization methods successfully rank truly relevant genes at the top of generated lists to facilitate efficient browsing. GLAD4U is simple to use, and its interface can be found at: http://bioinfo.vanderbilt.edu/glad4u. Approved ___________________________________________ Date _____________ GENE LIST AUTOMATICALLY DERIVED FOR YOU (GLAD4U): DERIVING AND PRIORITIZING GENE LISTS FROM PUBMED LITERATURE By Jérôme Jourquin Thesis Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in Biomedical Informatics May, 2010 Nashville, Tennessee Approved: Professor Bing Zhang Professor Hua Xu Professor Daniel R. Masys ACKNOWLEDGEMENTS I would like to express profound gratitude to my advisor, Dr. Bing Zhang, for his invaluable support, supervision and suggestions throughout this research work. I would like to deeply thank my committee, Drs. Hua Xu and Daniel R. Masys, for their guidance and support of my master's research. I would like to thank Dr. Vito Quaranta for his constant support. Dr. Lourdes Estrada was instrumental in the success of my part-time study plan, and this work greatly benefited from the assistance of Ms. Toni Shepard, Brandy Weidow, Jillanne Shell and Jing Hao. Overall, I am also deeply in debt with the entire Quaranta laboratory for their understanding of my morphing into a graduate student. Thank you Walter and Shawn for your patience in answering all the math questions I threw your way. Thank you Manisha, Jill and Brandy for listening and enduring my ups and downs for the last two and a half years. To the members of the Zhang laboratory—Jing Li, Naresh Prodduturi, Mingguang Shi and Dexter Duncan—thank you so much for your help and feedbacks on my master‘s research. I would also like to thank the Department of Biomedical Informatics. I appreciate the support of the students, faculty, and staff. In particular, I would like to thank Dr. Cynthia Gadd who welcomed me into the program and kept up with my progress, and Ms. Rischelle Jenkins, Claudia McCarn and Nelda Fowles for their assistance throughout this project. Similarly, I appreciate all the help and training that I received from Drs. ii David Tabb, Dan Liebler and Robbert Slebos as well as Scott Sobecki, Joseph Burden and William Schultz. Finally, I am grateful to my family and friends for all their encouragements and support. You have all helped me succeed. My parents have been faithful supporters and expect to be partly paid back with the Commencement celebrations. I am most grateful to Steve for being my number one fan. His strong support, sound advices and continuous encouragement were critical for my research and the realization of this manuscript. Thank you to all my friends, especially to Bart and Sophie, who took my defense as an excuse for re-visiting the South. I have a special ‗thank you‘ addressed to the Matrisian laboratory for ―adopting‖ me; your support has meant a lot! Thank you to Mr. Steven Goodman and Ms. Brandy Weidow for proofreading the manuscript. This work was supported by the National Institutes of Health (NIH)/ National Institute of General Medical Sciences (NIGMS) through grant R01GM088822, the NIH/National Cancer Institute (NCI) through grant U54CA113007, and the NIH/ National Institute of Mental Health (NIMH) through grant P50MH078028. iii TABLE OF CONTENTS Page ACKNOWLEDGEMENTS ................................................................................................ ii LIST OF TABLES ............................................................................................................. vi LIST OF FIGURES .......................................................................................................... vii LIST OF ABBREVIATIONS .......................................................................................... viii Chapter I. INTRODUCTION ..................................................................................................... 1 II. BACKGROUND ....................................................................................................... 4 Literature review .................................................................................................. 4 Gene prioritization using co-occurrences ...................................................... 4 Terminologies/ontologies .............................................................................. 7 Integration ...................................................................................................... 9 Miscellaneous .............................................................................................. 10 Preliminary results: Targeted users interviews .................................................. 11 III. MATERIALS AND METHODS ............................................................................. 14 Gold standard generation ................................................................................... 14 Gene Ontology terms ................................................................................... 14 Disease terms ............................................................................................... 15 Extracting EBIMed-prioritized gene lists .......................................................... 16 Publication and gene retrieval ............................................................................ 17 Gene prioritization ............................................................................................. 19 GLAD4U Counts ......................................................................................... 19 GLAD4U Hypergeometric........................................................................... 19 Performance evaluation ..................................................................................... 20 GLAD4U web user-interface implementation ................................................... 20 IV. RESULTS ................................................................................................................ 21 Definition and evaluation of GLAD4U publication and gene retrieval algorithm ............................................................................................................ 21 Rationale ...................................................................................................... 21 Results and Discussion ................................................................................ 21 iv Limitations ................................................................................................... 23 Increase of GLAD4U performance by the implementation and evaluation of gene prioritization algorithms ............................................................................ 23 Rationale ...................................................................................................... 23 Results and Discussion ................................................................................ 23 Limitations ................................................................................................... 30 Implementation of GLAD4U methods as a web user-interface ....................... 31 Rationale ...................................................................................................... 31 Results and Discussion ................................................................................ 32 Limitations ................................................................................................... 35 V. DISCUSSION .......................................................................................................... 37 VI. FUTURE WORK ..................................................................................................... 41 GLAD4U ―Similarity‖ ....................................................................................... 41 GLAD4U web implementations ........................................................................ 41 VII. CONCLUSIONS...................................................................................................... 43 Appendix A. APPENDIX 1: SURVEY TEMPLATE ADAPTED FROM THE FREE-STORY INTERVIEWS .................................................................................................................. 44 B. APPENDIX 2: GOLD STANDARDS .................................................................... 47 C. APPENDIX 3: EBIMED-EXTRACTED PRIORITIZED GENE LISTS ............... 58 REFERENCES ................................................................................................................

Biomedical Informatics

Analyses of Allele-Specific Gene Expression in Highly Divergent

Polyclonal Antibody to APC11 / ANAPC11 - Serum

Genetic Mutation Analysis of High and Low Igy Chickens by Capture Sequencing

Loss-Of-Function Mutation in the Dioxygenase-Encoding FTO Gene

Pharmacogenomic Characterization in Bipolar Spectrum Disorders

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

Download Download

Deubiquitinase UCHL1 Maintains Protein Homeostasis Through PSMA7-APEH- Proteasome Axis in High-Grade Serous Ovarian Carcinoma

Genomic Amplification of Chromosome 20Q13.33 Is the Early Biomarker For

WO 2019/079361 Al 25 April 2019 (25.04.2019) W 1P O PCT

Datasheet Blank Template

The Mammalian Adult Neurogenesis Gene Ontology (MANGO) Provides a Structural Framework for Published Information on Genes Regulating Adult Hippocampal Neurogenesis