Content Based Search in Gene Expression Databases and a Meta-Analysis of Host Responses to Infection
Total Page:16
File Type:pdf, Size:1020Kb
Content Based Search in Gene Expression Databases and a Meta-analysis of Host Responses to Infection A Thesis Submitted to the Faculty of Drexel University by Francis X. Bell in partial fulfillment of the requirements for the degree of Doctor of Philosophy November 2015 c Copyright 2015 Francis X. Bell. All Rights Reserved. ii Acknowledgments I would like to acknowledge and thank my advisor, Dr. Ahmet Sacan. Without his advice, support, and patience I would not have been able to accomplish all that I have. I would also like to thank my committee members and the Biomed Faculty that have guided me. I would like to give a special thanks for the members of the bioinformatics lab, in particular the members of the Sacan lab: Rehman Qureshi, Daisy Heng Yang, April Chunyu Zhao, and Yiqian Zhou. Thank you for creating a pleasant and friendly environment in the lab. I give the members of my family my sincerest gratitude for all that they have done for me. I cannot begin to repay my parents for their sacrifices. I am eternally grateful for everything they have done. The support of my sisters and their encouragement gave me the strength to persevere to the end. iii Table of Contents LIST OF TABLES.......................................................................... vii LIST OF FIGURES ........................................................................ xiv ABSTRACT ................................................................................ xvii 1. A BRIEF INTRODUCTION TO GENE EXPRESSION............................. 1 1.1 Central Dogma of Molecular Biology........................................... 1 1.1.1 Basic Transfers .......................................................... 1 1.1.2 Uncommon Transfers ................................................... 3 1.2 Gene Expression ................................................................. 4 1.2.1 Estimating Gene Expression ............................................ 4 1.2.2 DNA Microarrays ....................................................... 6 1.2.3 Microarray Analysis Methods .......................................... 7 1.3 Gene Expression Databases ..................................................... 9 1.3.1 Small or Specialty Databases ........................................... 10 1.3.2 Gene Expression Omnibus: the Large Database ....................... 10 1.3.3 ArrayExpress: the European Counterpart .............................. 12 1.4 Database Management Systems ................................................. 12 2. BINARY REPRESENTATIONS OF GENE EXPRESSION STUDIES ENABLE EFFICIENT SEARCHES BY CONTENT............................................ 14 2.1 Background....................................................................... 14 2.1.1 Previous Attempts to Establish Content-based Searches .............. 14 2.1.2 Inspiration from Chemoinformatics .................................... 15 2.1.3 Distance Measures....................................................... 16 iv 2.2 Methods........................................................................... 18 2.2.1 Dataset Acquisition...................................................... 18 2.2.2 Expression Profile Creation ............................................. 19 2.2.3 Binary Vector Creation .................................................. 20 2.2.4 Cross Validation Studies ................................................ 20 2.3 Results ............................................................................ 21 2.3.1 Single Platform Validation Study ....................................... 21 2.3.2 Multiple Platform Validation Study .................................... 22 2.4 Discussion ........................................................................ 24 2.5 Conclusion........................................................................ 26 3. IMPLEMENTATION OF A DATABASE OF BINARY REPRESENTATIONS OF GENE EXPRESSION STUDIES ................................................. 27 3.1 Background and Methods........................................................ 27 3.2 Results ............................................................................ 28 3.3 Discussion ........................................................................ 29 3.4 Conclusion........................................................................ 32 3.5 Future Work ...................................................................... 32 4. ENRICHMENT OF GENE EXPRESSION DATA USING BINARY VECTOR REPRESENTATIONS ................................................................. 34 4.1 Gene Set Enrichment............................................................. 34 4.1.1 KEGG Pathways......................................................... 34 4.1.2 Gene Ontology........................................................... 35 4.2 Determining Significance ........................................................ 36 v 4.2.1 DAVID ................................................................... 36 4.2.2 GSEA .................................................................... 37 4.3 Motivation and Reasoning ....................................................... 38 4.4 Methods........................................................................... 38 4.4.1 Distance Measures....................................................... 38 4.4.2 Database Construction .................................................. 39 4.4.3 Exponential Transforms ................................................. 39 4.5 Results ............................................................................ 40 4.6 Discussion ........................................................................ 42 4.7 Conclusion and Future Work .................................................... 43 5. META-ANALYSIS OF GENE EXPRESSION DURING HOST RESPONSES TO INFECTIONS ...................................................................... 44 5.1 Background....................................................................... 44 5.2 Methods........................................................................... 45 5.2.1 Data Acquisition......................................................... 45 5.2.2 Binary Search to Distinguish Taxonomical Groups.................... 46 5.2.3 Selection of a Meta-analysis Method ................................... 47 5.2.4 Statistical Approach ..................................................... 48 5.3 Results ............................................................................ 49 5.4 Discussion ........................................................................ 53 5.5 Conclusion........................................................................ 56 6. META-ANALYSIS OF MIRNA EXPRESSION DURING HOST RESPONSES TO INFECTIONS ...................................................................... 58 vi 6.1 Background....................................................................... 58 6.2 Methods........................................................................... 59 6.3 Results ............................................................................ 60 6.4 Discussion ........................................................................ 63 6.5 Conclusion........................................................................ 64 7. CONCLUSION......................................................................... 65 REFERENCES ............................................................................. 67 APPENDIX A: Data from the Single Platform Study of Binary Representations ...... 79 APPENDIX B: Data from the Multiple Platform Study of Binary Representations .... 99 APPENDIX C: Data from the Initial Search of Binary Representation Database....... 120 APPENDIX D: Predicted Enrichments Using Binary Distances......................... 155 APPENDIX E: Data Used in the Meta-analysis of Gene Expression Studies ........... 164 APPENDIX F: Enriched Data from the Meta-analysis of Gene Expression ............ 398 APPENDIX G: Data from the Meta-analysis of microRNA Expression Studies ....... 407 APPENDIX H: Enriched Data from the Meta-analysis of miRNA Expression ......... 517 VITA ........................................................................................ 525 vii List of Tables 2.1 Definitions of Binary Distance Measures. A and B are both binary vectors of length n. A+ and A− represent the number of positive and negative bits in A, respectively. A±B± denotes the number of bits in the intersection of A and B. A± _ B± denotes the number of bits in the union of A and B. To avoid division by zero, vectors of all positive or all negative bits are removed. ................... 17 2.2 Definitions of Numerical Distance Measures. A and B are vectors of fold changes of length n. Ai represents the ith term of the vector. mA is the mean of vectorA............................................................................... 18 2.3 Results of Single Platform Validation Study. True positive rates at a false positive rate of 0 (TPRjFPR=0) and average search time for distance measures (d) are provided. ..................................................................... 22 2.4 Results of Multiple Platform Validation Study. True positive rates at a false positive rate of 0 (TPRjFPR=0) and average search time for distance measures (d) are provided. The first search of a distance measure required the most time. In cases where the first search created deviations greater than the average search time, the first search was omitted from calculations. ....................... 24 3.1 Results of Initial