Text Mining Biomedical Literature for Genomic Knowledge Discovery
Total Page:16
File Type:pdf, Size:1020Kb
TEXT MINING BIOMEDICAL LITERATURE FOR GENOMIC KNOWLEDGE DISCOVERY A Thesis Presented to The Academic Faculty by Ying Liu In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in Computer Science Georgia Institute of Technology August 2005 TEXT MINING BIOMEDICAL LITERATURE FOR GENOMIC KNOWLEDGE DISCOVERY Approved by: Shamkant B. Navathe, Chairman Ray Dingledine College of Computing Department of Pharmacology Georgia Institute of Technology Emory University Brian J. Ciliax Edward Omiecinski Department of Neurology College of Computing Emory University Georgia Institute of Technology Venu Dasigi Ashwin Ram Department of Computer Science College of Computing Southern Polytechnic and State University Georgia Institute of Technology Date Approved June 28th, 2005 This dissertation is dedicated to my parents. ACKNOWLEDGEMENTS This thesis would not have been possible without the help and support of many people. Firstly I would like to thank my advisor, Prof. Shamkant B. Navathe, for his excellent supervision, his knowledge, his belief and interest in the work and encouragement and motivation throughout. I would also like to thank Dr. Brian J. Ciliax for being my good friend, my mentor and someone to follow. I am very grateful to Dr. Ray Dingledine for his hours of patient and detailed proofreading, and many conversations. My thanks also go to everyone who has provided support or advice in one way or another, including Dr. Venu Dasigi, Dr. Ashwin Ram, and Dr. Edward Omincinski. The researchers in Center of Disease of Control and Preventation, such as Dr. Muin Korey, Dr. Marta Gwinn, Mr. Bruce Lin, provided us the training sets and testing sets in Chapter 6. I also want to thank other graduate students in the lab, including Nalini Polavarapu, Saurav Sahay, Abhishek Dabral, Ramprasad Ramnarayanan for their support. And finally I’d like to thank my family for their love and support. iii TABLE OF CONTENTS Acknowledgements............................................................................................................ iii List of Tables ..................................................................................................................... ix List of Figures.................................................................................................................... xi Summary.......................................................................................................................... xiii Chapter 1 Motivation and Background...........................................................................1 1.1 Motivation and contribution of this thesis .........................................................1 1.2 An overview of functional genomics.................................................................7 1.3 Introduction to microarrays..............................................................................15 1.4 Two types of microarrays ................................................................................18 1.4.1 Spotted microarray............................................................................18 1.4.2 Synthesised high density oligonucleotide microarrays.....................19 1.5 Roadmap of chapters in the thesis ...................................................................21 1.6 Summary..........................................................................................................23 Chapter 2 Related Previous Work .................................................................................24 2.1 Microarray data clustering analysis approaches ..............................................24 2.1.1 K-means ............................................................................................25 2.1.2 Hierarchical clustering......................................................................26 2.1.3 Other clustering algorithms...............................................................30 2.2 Limitation of microarray data analysis approaches .........................................39 2.3 Text mining biomedical literature....................................................................40 iv 2.3.1 Text mining biomedical literature for discovering gene-to-gene relationships ...............................................................................................43 2.3.2 Classification of biomedical literature..............................................40 2.3.3 Support vector machine ....................................................................46 2.4 Mining text for association rules......................................................................49 2.4.1 How association rules work..............................................................49 2.4.2 Mining text for association ...............................................................50 2.5 Summary..........................................................................................................53 Chapter 3 Issues for Analysis of Biomedical Text ........................................................54 3.1 Creating a relational database of medline abstracts.........................................56 3.2 Keyword extraction from biomedical literature...............................................58 3.2.1 Background sets................................................................................58 3.2.2 Test sets.............................................................................................59 3.2.3 Stemming ..........................................................................................60 3.2.4 Stop-word lists ..................................................................................60 3.3 Keyword assessment........................................................................................61 3.3.1 Z-score method .................................................................................61 3.3.2 TFIDF method ..................................................................................61 3.3.3 Normalized z-score method ..............................................................63 3.4 Precision-recall and error-minimization ..........................................................63 3.5 Keyword lists ...................................................................................................64 3.6 Extraction of new information.........................................................................72 3.7 Optimization of the keyword selection algorithm ...........................................75 3.8 Comparison of TFIDF and normalized z-score method for keyword v extraction..........................................................................................................79 3.9 Future experiments...........................................................................................82 3.10 Summary........................................................................................................83 Chapter 4 Clustering Genes Based on Keyword Feature Vectors ..............................84 4.1 Keyword extraction from biomedical literature...............................................87 4.1.1 Test sets of genes ..............................................................................88 4.1.2 Keyword assessment.........................................................................91 4.1.3 Keyword selection for gene clustering .............................................91 4.2 List of keywords and keyword X gene matrix generation...............................91 4.3 BEA-PARTITION: a new algorithm with application to gene clustering.......94 4.3.1 Constructing the symmetric gene X gene matrix..............................94 4.3.2 Sorting the matrix .............................................................................95 4.3.3 Partitioning the sorted matrix............................................................95 4.4 Other clustering algorithm ...............................................................................98 4.4.1 Determination of number of clusters ................................................98 4.4.2 Determination of z-score threshold...................................................99 4.5 Evaluation the clustering results ....................................................................102 4.5.1 Purity...............................................................................................102 4.5.2 Entropy............................................................................................102 4.5.3 Mutual information .........................................................................103 4.6 A comparative study of BEA-PARTITION with other clustering algorithms: using 26-gene set............................................................................................104 4.7 A comparative study of BEA-PARTITION with other clustering algorithms: using 44-gene set............................................................................................116 vi 4.8 Top-scoring keywords shared among members of a gene cluster .................126 4.9 Discussion of clustering algorithm comparison.............................................130 4.9.1 BEA-PARTITION vs. k-means......................................................130 4.9.2 BEA-PARTITION vs. hierarchical algorithm ................................131 4.9.3 K-means vs. SOM...........................................................................132 4.9.4 Computing time and complexity.....................................................132 4.9.5