Identifying High Quality Medline Articles and Web Sites Using
Total Page:16
File Type:pdf, Size:1020Kb
IDENTIFYING HIGH QUALITY MEDLINE ARTICLES AND WEB SITES USING MACHINE LEARNING By YINDALON APHINYANAPHONGS Dissertation Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY in Biomedical Informatics December 2007 Nashville, Tennessee Approved: Professor Constantin Aliferis Professor Ioannis Tsamardinos Professor Douglas Hardin Professor Steven Brown Professor Dan Masys Copyright © 2007 by Yin Aphinyanaphongs All Rights Reserved ii DEDICATION My parents. My family. iii ACKNOWLEDGEMENTS You could say that I had the best twenties a person could probably have. I got to wake up every morning excited and motivated to discover new knowledge and solve real problems, and at the same time, develop as a person and a scientist. I’m ready for the world and ready to be an adult. The tops of my list to thank are my parents and family. Without the consistent pressure of my mom asking how the dissertation was going, when I was going to be done, and what I was going to do next, I’m not quite sure I’d be writing this today. My parents are the source of who I am, and I thank them for making all the sacrifices so I could pursue my dream. Next is my advisor Constantin Aliferis. I could not ask for a better mentor and friend, and a clear sign of my advocacy for Dr. Aliferis as an advisor is that I would have him all over again, and I would not hesitate to recommend him to any student. The measure of any mentor is does the mentor prepare you to make contributions to the field and does the mentor teach you how to be a good scientist. My answer is an unequivocal yes to both questions. I feel prepared to make contributions on my own to the field and I have learned how to do “good” science. I’d like to thank all my committee members. Dr. Tsamardinos, Dr. Brown, Dr. Hardin, and Dr. Masys for their feedback and involvement. I’d also like to thank the MSTP department, the NLM training fellowship, and the Department of Biomedical Informatics for supporting me. iv There are a few other people I have to thank. My little brother Joe for being a hard ass and keeping me focused. My buddy Michael and JP for giving me refuge for a couple of days in the final stretch of dissertation writing so I could finish. Sutin for discussing software engineering with me and taking me away from the world of medicine every so often, and keeping me inspired. Mimmie, Micaela, Todd, and Jeff for listening to me rant and rave during my toughest emotional times so I could get it out of my system for the day to focus on research.Geoff for being supportive and a great friend, and keeping me in the loop on the outside world. Finally, all my peeps that left me to pursue their dreams. Jeff, Troy, Darcie, Allison, and Mary Hunt. If it wasn’t for their continued friendship, I would not have been able to make it to where I am today. v TABLE OF CONTENTS DEDICATION............................................................................................................................................. iii ACKNOWLEDGEMENTS.........................................................................................................................iv LIST OF TABLES .......................................................................................................................................ix LIST OF FIGURES .....................................................................................................................................xi I. INTRODUCTION ................................................................................................................................1 OVERVIEW AND DISSERTATION STRUCTURE ...............................................................................................4 Models and Evaluation of Retrieval Performance..................................................................................5 Evaluation of Generalization ..................................................................................................................6 EBMSearch: Proof of Concept Search Engine .......................................................................................7 Extensions to the World Wide Web .......................................................................................................7 Conventions............................................................................................................................................8 SUMMARY ...................................................................................................................................................8 II. BACKGROUND/ PRIOR WORK..................................................................................................9 TEXT CATEGORIZATION MODELS FOR HIGH QUALITY RETRIEVAL IN INTERNAL MEDICINE .......................9 Abstract ..................................................................................................................................................9 Introduction ..........................................................................................................................................11 Background...........................................................................................................................................12 Methods................................................................................................................................................15 Results ..................................................................................................................................................27 Discussion.............................................................................................................................................35 Summary ..............................................................................................................................................42 References ............................................................................................................................................43 LEARNING BOOLEAN QUERIES FOR ARTICLE QUALITY FILTERING............................................................46 Abstract ................................................................................................................................................46 Introduction ..........................................................................................................................................46 Methods................................................................................................................................................49 Results ..................................................................................................................................................55 Discussion.............................................................................................................................................59 Conclusions ..........................................................................................................................................62 References ............................................................................................................................................63 III. MODELS AND EVALUATION OF RETRIEVAL PERFORMANCE....................................65 A COMPARISON OF CITATION METRICS TO MACHINE LEARNING FILTERS FOR THE IDENTIFICATION OF HIGH QUALITY MEDLINE DOCUMENTS...................................................................................................65 vi Abstract ................................................................................................................................................65 Introduction & Background..................................................................................................................67 Hypothesis & Experiments...................................................................................................................69 Methods................................................................................................................................................70 Results ..................................................................................................................................................81 Study Limitations .................................................................................................................................90 Discussion and Conclusions .................................................................................................................95 References ............................................................................................................................................98 A COMPARISON OF WEB HYPERLINKS TO MACHINE LEARNING FILTERS FOR THE IDENTIFICATION OF HIGH QUALITY MEDLINE DOCUMENTS..........................................................................................................105 Abstract ..............................................................................................................................................105 Introduction ........................................................................................................................................106 Background.........................................................................................................................................108 Methods..............................................................................................................................................110