Analysis of Gene Expression Data for Gene Ontology
Total Page:16
File Type:pdf, Size:1020Kb
ANALYSIS OF GENE EXPRESSION DATA FOR GENE ONTOLOGY BASED PROTEIN FUNCTION PREDICTION A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Robert Daniel Macholan May 2011 ANALYSIS OF GENE EXPRESSION DATA FOR GENE ONTOLOGY BASED PROTEIN FUNCTION PREDICTION Robert Daniel Macholan Thesis Approved: Accepted: _______________________________ _______________________________ Advisor Department Chair Dr. Zhong-Hui Duan Dr. Chien-Chung Chan _______________________________ _______________________________ Committee Member Dean of the College Dr. Chien-Chung Chan Dr. Chand K. Midha _______________________________ _______________________________ Committee Member Dean of the Graduate School Dr. Yingcai Xiao Dr. George R. Newkome _______________________________ Date ii ABSTRACT A tremendous increase in genomic data has encouraged biologists to turn to bioinformatics in order to assist in its interpretation and processing. One of the present challenges that need to be overcome in order to understand this data more completely is the development of a reliable method to accurately predict the function of a protein from its genomic information. This study focuses on developing an effective algorithm for protein function prediction. The algorithm is based on proteins that have similar expression patterns. The similarity of the expression data is determined using a novel measure, the slope matrix. The slope matrix introduces a normalized method for the comparison of expression levels throughout a proteome. The algorithm is tested using real microarray gene expression data. Their functions are characterized using gene ontology annotations. The results of the case study indicate the protein function prediction algorithm developed is comparable to the prediction algorithms that are based on the annotations of homologous proteins. iii ACKNOWLEDGEMENTS I could not have finished this project without the continued support of many who helped me throughout the entire process. First, I would like to thank my advisor, Dr. Duan. The success of this undertaking would not have been possible without your guidance, patience, and encouragement. Thank you to Dr. Chan and Dr. Xiao, for reviewing my work and serving on my committee. Your contributions of time and expertise are sincerely appreciated. Finally, thanks to family and friends for helping through all of the challenges and periods of adversity along the way. iv TABLE OF CONTENTS Page LIST OF TABLES.............................................................................................................vii LIST OF FIGURES..........................................................................................................viii CHAPTER I. BACKGROUND OF THE STUDY................................................................................1 1.1 Modern Genetics……………………………………………………………...1 1.2 Structure of DNA……………………………………………………………..2 1.3 Genes………………………………………………………………………….3 1.4 Protein Synthesis……………………………………………………………...4 1.5 Gene Ontology………………………………………………………………..6 1.6 Gene Expression……………………………………………………………...7 II. PURPOSE OF THE STUDY.......................................................................................11 III. METHODOLOGY.....................................................................................................16 3.1 Source Files.....................................................................................................16 3.2 Data Pre-Processing........................................................................................20 3.3 Algorithm........................................................................................................21 IV. ANALYSIS OF RESULTS........................................................................................27 V. CONCLUSIONS..........................................................................................................33 REFERENCES..................................................................................................................36 v APPENDIX EXPRESSION DATASET AFTER PRE-PROCESSING...........................38 vi LIST OF TABLES Table Page 3.1 Cho – mitotic cell cycle.txt………………………………………………………19 3.2 ‘Cho – mitotic cell cycle’ After Pre-Processing………........................................22 3.3 Expression Levels for Four Genes.........................................................................23 3.4 Slope Matrix...........................................................................................................24 3.5 Calculating Dissimilarity Scores............................................................................25 3.6 Closest Matching Genes........................................................................................25 3.7 Closest Matching Genes Implementing 5 Nearest Neighbors...............................26 4.1 Prediction Accuracy – Level 1 GO Terms using 1 Nearest Neighbor...................28 4.2 Prediction Accuracy – Level 1 GO Terms using 3 Nearest Neighbors.................29 4.3 Prediction Accuracy – Level 1 GO Terms using 5 Nearest Neighbors.................29 4.4 Prediction Accuracy – Level 2 GO Terms using 1 Nearest Neighbor...................30 4.5 Prediction Accuracy – Level 2 GO Terms using 5 Nearest Neighbors.................31 vii LIST OF FIGURES Figure Page 1.1 The Structure of DNA……………………………………………………………..3 1.2 Exons and Introns of a Gene………………………………………………………4 1.3 Exons being Copied into mRNA………………………………………………….5 1.4 Crick’s Central Dogma of Molecular Biology…………………………………….5 1.5 Subset of the Molecular Function Ontology……………………………………....7 1.6 Gene Expression Profiling using a DNA Microarray……………………………..9 1.7 Heat Map Generated from DNA Microarray Data..................................................9 3.1 Entry in the Yeast Proteome..................................................................................17 3.2 Entry of a GO Term in ‘gene_ontology_ext.obo’..................................................20 3.3 Algorithm for Analysis of Expression Data...........................................................21 3.4 Graph of Expression Levels for Four Genes……………………………………..23 4.1 Method for Calculating Results / Prediction Success............................................27 4.2 Prediction Accuracy vs. Nearest Neighbors..........................................................32 viii CHAPTER I BACKGROUND OF THE STUDY 1.1 Modern Genetics The experiments of Gregor Mendel marked the beginning of the modern science of genetics. In the 1860’s, Mendel’s experiments with garden peas ( Pisum sativum) constituted some of the first highly scientific and statistical approaches to understanding the concepts of heredity and inheritance. Mendel chose to use pea pods for his study due to their variety of observable traits, ability to self-fertilize, and because the offspring of self-fertilized plants are fertile. The seven traits that he tested included seed shape, seed color, flower color, pod shape, pod color, flower position, and stem height. As the basic unit of inheritance, genes are responsible for each one of these traits. Consider one of Mendel’s experiments in which he crossed true-breeding tall plants with true-breeding short plants. The results were clear, with the entire first generation of offspring expressing the tall characteristic. Taking it a step further, he crossed the first generation plants and discovered that most (3:1) of the second generation exhibited the tall trait whereas the remainder exhibited the short trait. Because the first generation of plants all came out being tall, tall became known as a dominant allele, or trait. On the other hand, the short trait was found to be the recessive allele. 1 Furthermore, Mendel came to understand the difference between the organism’s genotype, or combination of alleles, versus the organism’s phenotype, or physical expression of the alleles. The first-generation pea pods had inherited a dominant (tall) and recessive (short) allele from their parents; however, their phenotype only expressed the dominant trait. Following the cross of the first generation, the second-generation plants possessed a variety of genotypes. One quarter of the plants had two tall alleles, one half of the plants had one tall and one short allele, and one quarter of the plants had two short alleles. The genotype of two recessive alleles became the only combination capable of having a phenotype of short. In the end, Mendel made the first bold strides toward understanding the basic unit of heredity, the gene [3]. 1.2 Structure of DNA Another significant milestone in genetics came when James Watson and Francis Crick discovered the structure of DNA, or deoxyribonucleic acid. By the early twentieth century, scientists were searching for the source and structure that contained heredity material. A 1952 experiment with bacteriophages by Alfred Hershey and Martha Chase concluded that DNA, rather than proteins, was the physical carrier of heredity. Years of research had exposed much of the composition of DNA, but Watson and Crick were the ones who ultimately constructed an accurate model of the structure. DNA is composed of two complementary strands of genetic material. The bases of each strand pair up to one another like the rungs of a ladder. The purines, adenine and guanine, always pair with their respective pyrimidines, thymine and cytosine. The 2 nucleotide bases are held together with hydrogen bonds. The strands of the