Detecting Phenotype-Specific Interactions Between Biological Processes from Microarray Data and Annotations
Total Page:16
File Type:pdf, Size:1020Kb
Wayne State University DigitalCommons@WayneState Wayne State University Dissertations 1-1-2010 Detecting Phenotype-Specific nI teractions Between Biological Processes From Microarray Data And Annotations Nadeem Ahmed Ansari Wayne State University Follow this and additional works at: http://digitalcommons.wayne.edu/oa_dissertations Recommended Citation Ansari, Nadeem Ahmed, "Detecting Phenotype-Specific nI teractions Between Biological Processes From Microarray Data And Annotations" (2010). Wayne State University Dissertations. Paper 76. This Open Access Dissertation is brought to you for free and open access by DigitalCommons@WayneState. It has been accepted for inclusion in Wayne State University Dissertations by an authorized administrator of DigitalCommons@WayneState. DETECTING PHENOTYPE-SPECIFIC INTERACTIONS BETWEEN BIOLOGICAL PROCESSES FROM MICROARRAY DATA AND ANNOTATIONS by NADEEM A. ANSARI DISSERTATION Submitted to the Graduate School of Wayne State University, Detroit, Michigan in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY 2010 MAJOR: COMPUTER SCIENCE Approved By: Advisor Date DEDICATION To my family, for all their loving care since my birth and unfaltering support throughout my academic studies. ii ACKNOWLEDGEMENTS I am very grateful to my teacher, advisor, and mentor, Dr. Sorin Dr˘aghici, who kindly and wisely guided me in the new field of bioinformatics, with all his patience and support. During my doctoral studies, whenever I thought I was making jumps and leaps in my research, he challenged me with critical inquiries, and encouraged me during my slow progress whenever I felt low. I am also very grateful to Dr. Carl D. Freeman for his advise and encouragement, especially during my most intense last months of study. I would like to express my sincere thanks to Dr. Daniel Grosu, Dr. Chandan Reddy, and Dr. Hasan Jamil for their advice throughout my Ph.D. candidacy years. On a personal level, I am grateful to Riyue Bao for helping me understand and in- terpret the results of our work in the context of biological phenomena; to Dr. Purvesh Khatri and Sivakumar Sellamuthu for helping me understand the OntoTools software suite and its back-end database; to the members of Intelligent Systems and Bioin- formatics Laboratory (Dr. Bogdan Done, Calin Voichita, Michele Donato, and Dr. Adi Tarca) for their everyday help and constructive criticism; and finally to Monika Witoslawski for her constant encouragement. iii TABLE OF CONTENTS Dedication....................................... ii Acknowledgements .................................. iii ListofTables..................................... vi ListofFigures .................................... vi Chapter1:Introduction . .. .. .. 1 1.1 Dissertationoutline. .. .. 3 Chapter2: Biologicalfoundations . ..... 5 2.1 DNA,RNA,andproteins.......................... 6 2.2 Geneexpression............................... 9 2.3 Gene expression measurement technologies . ..... 12 2.4 GeneOntology ............................... 18 2.5 Phenotypeandgenotype . .. .. 20 2.6 Identifying the biological phenomena - current approaches ....... 21 Chapter3: Mathematicalfoundations . .... 24 3.1 Vectorspacemodel ............................. 24 3.2 Correlation and it’s geometric interpretation . ........ 33 Chapter 4: Identifying interactions between biological processes . .. 36 4.1 Datarepresentation............................. 37 4.2 Approximating matrix in lower dimensional subspace . ....... 40 4.3 Association between biological processes . ...... 42 4.4 Identifying significant changes . ... 43 Chapter 5: Improvements and enhancements . ..... 48 5.1 Binaryscheme(1-1) ............................ 48 5.2 Weighting by gene expression levels (1-e) . ..... 49 5.3 Information retrieval binary scheme (IR1-1) . ...... 49 5.4 Information retrieval scheme weighted by gene expression levels (IR1-e) 51 iv Chapter 6: Implementation details . .... 52 6.1 Data collection and gene-function matrix formation . ........ 52 6.2 Modelimplementation . .. .. 54 6.3 Improvementsintegration . 57 6.4 Hardwareandsoftwareissues . 57 Chapter7:Literaturereview . .. 59 7.1 Relatedwork ................................ 59 7.2 Discussionofrelatedwork . 66 Chapter 8: Results and discussion . .... 68 8.1 Resultsevaluation.............................. 68 8.2 Discussion and results for data in breast cancer . ...... 70 8.3 Discussion and results for data in lung cancer . ..... 78 Chapter 9: Conclusions and future directions . ....... 88 9.1 Limitations ................................. 89 9.2 Futuredirections .............................. 89 Appendix A: Results for breast cancer data . ..... 91 Appendix B: Results for lung cancer data . 102 References.......................................114 Abstract........................................134 AutobiographicalStatement . 136 v LIST OF TABLES Table. 2.1 Over-representation of the functional categories ........... 22 Table. 2.2 Correct representation of functional categories............ 23 Table. 3.1 Document collection - an example . ..... 27 Table. 3.2 Keywords used in the document collection - an example ...... 27 Table. 7.1 A 2 x 2 contingency table - an example . .... 61 Table. 8.1 Results statistics for breast cancer data set . ........... 69 Table. 8.2 Results statistics for lung cancer data set . .......... 69 Table. 8.3 Selected pairs of biological processes - breast cancer dataset . 72 Table. 8.4 Selected pairs of biological processes - lung cancerdataset . 80 Table. A.1 Breast cancer results for scheme 1-1 . ....... 93 Table. A.2 Breast cancer results for scheme 1-e . ....... 95 Table. A.3 Breast cancer results for scheme IR 1-1 . ....... 97 Table. A.4 Breast cancer results for scheme IR 1-e . ....... 99 Table. B.1 Lung cancer results for scheme 1-1 . .104 Table. B.2 Lung cancer results for scheme 1-e . .106 Table. B.3 Lung cancer results for scheme IR 1-1 . .109 Table. B.4 Lung cancer results for scheme IR 1-e . ......111 vi LIST OF FIGURES Figure. 2.1 DNA - double helical structure . ..... 7 Figure.2.2 DNA-linearstructure . .. 8 Figure.2.3 RNA-linearstructure . .. 8 Figure.2.4 Protein-linearstructure . ..... 8 Figure.2.5 RNAtranscription. .. 9 Figure. 2.6 RNA transcription - base pairing . ...... 9 Figure. 2.7 RNA splicing - pre-mRNA to mRNA . 10 Figure. 2.8 RNA splicing - linear representation . ........ 10 Figure.2.9 Proteintranslation . ... 11 Figure. 2.10 Gene expression - DNA to RNA to proteins . ...... 11 Figure.2.11 Northernblotting . ... 13 Figure.2.12 Polymerasechainreaction . ..... 15 Figure. 2.13 Serial analysis of gene expression . ......... 17 Figure. 2.14 Microarray - DNA hybridization . ...... 18 Figure. 2.15 Gene Ontology graph representation . ........ 19 Figure. 3.1 Singular value decomposition . ...... 31 Figure. 3.2 Matrix approximation in a lower dimensional subspace. 32 Figure.3.3 Correlationcoefficients . .... 34 Figure. 6.1 OntoInteractions gene-function processing interface. 55 Figure. 6.2 OntoInteractions and OntoJoctave interface packages ....... 56 Figure. 8.1 Association between GO:0006508 and GO:0043065 ......... 72 Figure. 8.2 Association between GO:0006270 and GO:0006350,GO:0006355 . 74 vii Figure. 8.3 Association between GO:0006355 and GO:0006281 ......... 75 Figure. 8.4 Association between GO:0016192 and GO:0006366 ......... 76 Figure. 8.5 Association between GO:0006270 and GO:0048015 ......... 77 Figure. 8.6 Association between GO:0006916 and GO:0006954 ......... 79 Figure. 8.7 Association between GO:0044419 and GO:0043066 ......... 82 Figure. 8.8 Association between GO:0007223 and GO:0030154 ......... 83 Figure. 8.9 Association between GO:0007596 and GO:0045786 ......... 84 Figure. 8.10 Association between GO:0045786 and GO:0007155......... 86 viii 1 Chapter 1: Introduction Cells are the fundamental units of life that contain all the working machinery nec- essary for their functioning, survival, and death. This working machinery involves proteins, in general, and RNAs, in some cases. The process through which life gener- ates this working machinery is known as gene expression. Typically, in a eukaryotic cell nucleus, parts of DNA that constitute an expressing gene are transcribed into a (messenger) RNA. The messenger RNA then gets translated into a protein, which in turn usually goes through post-translation modification before becoming a functional part of the cell. These proteins participate in virtually all functions within the cells. In non-protein coding RNA genes, transcribed genes encode for functional RNAs such as ribosomal RNA and transfer RNA, or long non-coding RNAs. These RNAs participate in a protein-assembly process, chemical (e.g. catalytic) activities, and chromosomal inactivation [Lee, 2009] as well as genes regulation [Mercer et al., 2008]. Various genes are expressed or silenced in different cells and tissues, or under different stimuli, internal or external to the organism. A change in the expression patterns of particular genes may give rise to a phenotype [Alberts et al., 2003], cause a disease [Dermitzakis, 2008], or even decide about the fate of the cell [Brand and Perrimon, 1993]. Such changes may be caused by mutagens, hormones, or the ad- dition or removal of epigenetic tags such as ethyl or methyl groups. Understanding the structure and attribute of genes and the proteins they encode holds the promise of enabling us to understand disease states in a fundamental way that was not avail- able before the rapid sequencing of genes and genomes that is now available. High throughput technologies