Region Based Gene Expression Via Reanalysis of Publicly Available Microarray Data Sets
Total Page:16
File Type:pdf, Size:1020Kb
University of Louisville ThinkIR: The University of Louisville's Institutional Repository Electronic Theses and Dissertations 5-2018 Region based gene expression via reanalysis of publicly available microarray data sets. Ernur Saka University of Louisville Follow this and additional works at: https://ir.library.louisville.edu/etd Part of the Bioinformatics Commons, Computational Biology Commons, and the Other Computer Sciences Commons Recommended Citation Saka, Ernur, "Region based gene expression via reanalysis of publicly available microarray data sets." (2018). Electronic Theses and Dissertations. Paper 2902. https://doi.org/10.18297/etd/2902 This Doctoral Dissertation is brought to you for free and open access by ThinkIR: The University of Louisville's Institutional Repository. It has been accepted for inclusion in Electronic Theses and Dissertations by an authorized administrator of ThinkIR: The University of Louisville's Institutional Repository. This title appears here courtesy of the author, who has retained all other copyrights. For more information, please contact [email protected]. REGION BASED GENE EXPRESSION VIA REANALYSIS OF PUBLICLY AVAILABLE MICROARRAY DATA SETS By Ernur Saka B.S. (CEng), University of Dokuz Eylul, Turkey, 2008 M.S., University of Louisville, USA, 2011 A Dissertation Submitted To the J. B. Speed School of Engineering in Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Computer Science and Engineering Department of Computer Engineering and Computer Science University of Louisville Louisville, Kentucky May 2018 Copyright 2018 by Ernur Saka All rights reserved REGION BASED GENE EXPRESSION VIA REANALYSIS OF PUBLICLY AVAILABLE MICROARRAY DATA SETS By Ernur Saka B.S. (CEng), University of Dokuz Eylul, Turkey, 2008 M.S., University of Louisville, USA, 2011 A Dissertation Approved On April 20, 2018 by the following Committee __________________________________ Dissertation Director Dr. Eric C. Rouchka __________________________________ Dr. Olfa Nasraoui __________________________________ Dr. Dar-Jen Chang ___________________________________ Dr. Juw Won Park ___________________________________ Dr. Jeffrey C. Petruska ii DEDICATION Dedicated to my parents Nuran Saka and Erol Saka and to my sister Esin Saka iii ACKNOWLEDGEMENTS First, I would like to thank my advisor Dr. Rouchka for his guidance, patience and support throughout my PhD. Without the guidance and the support that he has provided, this experience would not have been the same. I would also like to thank the rest of my dissertation committee members, Dr. Chang, Dr. Park, and Dr. Petruska for giving me the opportunity to learn from them. In particular, I would like to thank Dr. Nasraoui for her constant support and encouragement. I appreciate all my professors who have helped me progress. I also would like to thank Dr. Elmaghraby, Chairman of the Department of Computer Engineering and Computer Science for his support. My experience at the bioinformatics lab would not have been the same without our discussions, collaboration and seminars. Thanks to my former and current lab members Abdallah Eteleeb, Dazhuo Li, Fahim Mohammad, Mohammed Sayed, Mohamed Chaabane, Aanchal Malhotra, and Aryan Neupane. I would love to thank Dr. Harrison for his contributions to biological analysis and interpretation of our research outcomes. I also would like to thank Dr. Chariker for her helpful discussions and pleasant conversations we had. Through the years, several friends have provided support. I would like to thank James Walter Moore and Caroline Fortin for always being on my side, their encouragements and listening ears. iv Most of all, I am so grateful to my family, my mother Nuran, my father Erol, my sister Esin and my brother in love Levent for supporting me in every possible way unconditionally. v ABSTRACT REGION BASED GENE EXPRESSION VIA REANALYSIS OF PUBLICLY AVAILABLE MICROARRAY DATA SETS Ernur Saka April 20, 2018 A DNA microarray is a high-throughput technology used to identify relative gene expression. One of the most widely used platforms is the Affymetrix® GeneChip® technology which detects gene expression levels based on probe sets composed of a set of twenty-five nucleotide probes designed to hybridize with specific gene targets. Given a particular Affymetrix® GeneChip® platform, the design of the probes is fixed. However, the method of analysis is dynamic in nature due to the ability to annotate and group probes into uniquely defined groupings. This is particularly important since publicly available repositories of microarray datasets, such as ArrayExpress and NCBI’s Gene Expression Omnibus (GEO) have made millions of samples readily available to be reanalyzed computationally without the need for new biological experiments. One way in which the analysis can dynamically change is by correcting the mapping between probe sets and targets by creating custom Chip Description Files (CDFs) to arrange which probes belong to which probe set based on the latest genomic information or specific annotations of interest. vi Since default probe sets in Affymetrix® GeneChip® platforms are specific for a gene, transcript or exon, the analyses are then limited to profile differential expression at the gene, transcript or individual exon level. However, it has been revealed that untranslated regions (UTRs) of mRNA have important impacts on the regulation of proteins. We therefore developed a new probe mapping protocol that addresses three issues of Affymetrix® GeneChip® data analyses: removing nonspecific probes, updating probe target mapping based on the latest genome information and grouping the probes into region (UTR, individual exon), gene and transcript level targets of interest to support a better understanding of the effect of UTRs and individual exons on gene expression levels. Furthermore, we developed an R package, affyCustomCdf, for users to dynamically create custom CDFs. The affyCustomCdf tool takes annotations in a General/Gene Transfer Format File (GTF), aligns probes to gene annotations via Nested Containment List (NCList) indexing and generates a custom Chip Description File (CDF) to regroup probes into probe sets based on a region (UTR and individual exon), transcript or gene level. Our results indicate that removing probes that no longer align to the genome without mismatches or align to multiple locations can help to reduce false-positive differential expression, as can removal of probes in regions overlapping multiple genes. Moreover, our method based on regions can detect changes that would have been missed by analysis based on gene and transcript. It also allows for a better understanding of 3’ UTR dynamics through the reanalysis of publicly available data. vii TABLE OF CONTENTS DEDICATION iii ACKNOWLEDGEMENTS iv ABSTRACT vi LIST OF TABLES xiii LIST OF FIGURES xv 1 INTRODUCTION 1 1.1 Motivation ..................................................................................................................2 1.2 Organization of the Document ...................................................................................8 2 OVERVIEW OF MOLECULAR BIOLOGY 10 2.1 DNA .........................................................................................................................11 2.2 RNA .........................................................................................................................15 2.3 mRNA Structure ......................................................................................................19 2.3.1 Untranslated Regions (UTR) .......................................................................19 2.3.2 Poly(A) Tail .................................................................................................20 2.4 Protein ......................................................................................................................20 2.5 Chromosome ...........................................................................................................21 2.6 Gene ........................................................................................................................22 2.7 Gene Regions ...........................................................................................................22 2.3.1 Intron .............................................................................................................22 viii 2.3.2 Exon ..............................................................................................................23 2.3.3 Open Reading Frame (ORF) .........................................................................23 2.3.4 Coding sequence (CDS) ................................................................................23 2.8 Gene Expression .......................................................................................................24 2.9 Central Dogma of Molecular Biology ......................................................................25 2.9.1 General Transfers ..........................................................................................26 2.9.2 Special Transfers ...........................................................................................29 2.10 Genetic Code ..........................................................................................................30 2.11 Alternative Splicing ................................................................................................31 2.12 Transcript ..............................................................................................................33