Biomedical Informatics
Total Page:16
File Type:pdf, Size:1020Kb
BIOMEDICAL INFORMATICS Abstract GENE LIST AUTOMATICALLY DERIVED FOR YOU (GLAD4U): DERIVING AND PRIORITIZING GENE LISTS FROM PUBMED LITERATURE JEROME JOURQUIN Thesis under the direction of Professor Bing Zhang Answering questions such as ―Which genes are related to breast cancer?‖ usually requires retrieving relevant publications through the PubMed search engine, reading these publications, and manually creating gene lists. This process is both time-consuming and prone to errors. We report GLAD4U (Gene List Automatically Derived For You), a novel, free web-based gene retrieval and prioritization tool. The quality of gene lists created by GLAD4U for three Gene Ontology terms and three disease terms was assessed using ―gold standard‖ lists curated in public databases. We also compared the performance of GLAD4U with that of another gene prioritization software, EBIMed. GLAD4U has a high overall recall. Although precision is generally low, its prioritization methods successfully rank truly relevant genes at the top of generated lists to facilitate efficient browsing. GLAD4U is simple to use, and its interface can be found at: http://bioinfo.vanderbilt.edu/glad4u. Approved ___________________________________________ Date _____________ GENE LIST AUTOMATICALLY DERIVED FOR YOU (GLAD4U): DERIVING AND PRIORITIZING GENE LISTS FROM PUBMED LITERATURE By Jérôme Jourquin Thesis Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in Biomedical Informatics May, 2010 Nashville, Tennessee Approved: Professor Bing Zhang Professor Hua Xu Professor Daniel R. Masys ACKNOWLEDGEMENTS I would like to express profound gratitude to my advisor, Dr. Bing Zhang, for his invaluable support, supervision and suggestions throughout this research work. I would like to deeply thank my committee, Drs. Hua Xu and Daniel R. Masys, for their guidance and support of my master's research. I would like to thank Dr. Vito Quaranta for his constant support. Dr. Lourdes Estrada was instrumental in the success of my part-time study plan, and this work greatly benefited from the assistance of Ms. Toni Shepard, Brandy Weidow, Jillanne Shell and Jing Hao. Overall, I am also deeply in debt with the entire Quaranta laboratory for their understanding of my morphing into a graduate student. Thank you Walter and Shawn for your patience in answering all the math questions I threw your way. Thank you Manisha, Jill and Brandy for listening and enduring my ups and downs for the last two and a half years. To the members of the Zhang laboratory—Jing Li, Naresh Prodduturi, Mingguang Shi and Dexter Duncan—thank you so much for your help and feedbacks on my master‘s research. I would also like to thank the Department of Biomedical Informatics. I appreciate the support of the students, faculty, and staff. In particular, I would like to thank Dr. Cynthia Gadd who welcomed me into the program and kept up with my progress, and Ms. Rischelle Jenkins, Claudia McCarn and Nelda Fowles for their assistance throughout this project. Similarly, I appreciate all the help and training that I received from Drs. ii David Tabb, Dan Liebler and Robbert Slebos as well as Scott Sobecki, Joseph Burden and William Schultz. Finally, I am grateful to my family and friends for all their encouragements and support. You have all helped me succeed. My parents have been faithful supporters and expect to be partly paid back with the Commencement celebrations. I am most grateful to Steve for being my number one fan. His strong support, sound advices and continuous encouragement were critical for my research and the realization of this manuscript. Thank you to all my friends, especially to Bart and Sophie, who took my defense as an excuse for re-visiting the South. I have a special ‗thank you‘ addressed to the Matrisian laboratory for ―adopting‖ me; your support has meant a lot! Thank you to Mr. Steven Goodman and Ms. Brandy Weidow for proofreading the manuscript. This work was supported by the National Institutes of Health (NIH)/ National Institute of General Medical Sciences (NIGMS) through grant R01GM088822, the NIH/National Cancer Institute (NCI) through grant U54CA113007, and the NIH/ National Institute of Mental Health (NIMH) through grant P50MH078028. iii TABLE OF CONTENTS Page ACKNOWLEDGEMENTS ................................................................................................ ii LIST OF TABLES ............................................................................................................. vi LIST OF FIGURES .......................................................................................................... vii LIST OF ABBREVIATIONS .......................................................................................... viii Chapter I. INTRODUCTION ..................................................................................................... 1 II. BACKGROUND ....................................................................................................... 4 Literature review .................................................................................................. 4 Gene prioritization using co-occurrences ...................................................... 4 Terminologies/ontologies .............................................................................. 7 Integration ...................................................................................................... 9 Miscellaneous .............................................................................................. 10 Preliminary results: Targeted users interviews .................................................. 11 III. MATERIALS AND METHODS ............................................................................. 14 Gold standard generation ................................................................................... 14 Gene Ontology terms ................................................................................... 14 Disease terms ............................................................................................... 15 Extracting EBIMed-prioritized gene lists .......................................................... 16 Publication and gene retrieval ............................................................................ 17 Gene prioritization ............................................................................................. 19 GLAD4U Counts ......................................................................................... 19 GLAD4U Hypergeometric........................................................................... 19 Performance evaluation ..................................................................................... 20 GLAD4U web user-interface implementation ................................................... 20 IV. RESULTS ................................................................................................................ 21 Definition and evaluation of GLAD4U publication and gene retrieval algorithm ............................................................................................................ 21 Rationale ...................................................................................................... 21 Results and Discussion ................................................................................ 21 iv Limitations ................................................................................................... 23 Increase of GLAD4U performance by the implementation and evaluation of gene prioritization algorithms ............................................................................ 23 Rationale ...................................................................................................... 23 Results and Discussion ................................................................................ 23 Limitations ................................................................................................... 30 Implementation of GLAD4U methods as a web user-interface ....................... 31 Rationale ...................................................................................................... 31 Results and Discussion ................................................................................ 32 Limitations ................................................................................................... 35 V. DISCUSSION .......................................................................................................... 37 VI. FUTURE WORK ..................................................................................................... 41 GLAD4U ―Similarity‖ ....................................................................................... 41 GLAD4U web implementations ........................................................................ 41 VII. CONCLUSIONS...................................................................................................... 43 Appendix A. APPENDIX 1: SURVEY TEMPLATE ADAPTED FROM THE FREE-STORY INTERVIEWS .................................................................................................................. 44 B. APPENDIX 2: GOLD STANDARDS .................................................................... 47 C. APPENDIX 3: EBIMED-EXTRACTED PRIORITIZED GENE LISTS ............... 58 REFERENCES ................................................................................................................