Biological Computation: the Development of a Genomic Analysis
Total Page:16
File Type:pdf, Size:1020Kb
BIOLOGICAL COMPUTATION: THE DEVELOPMENT OF A GENOMIC ANALYSIS PIPELINE TO IDENTIFY CELLULAR GENES MODULATED BY THE TRANSCRIPTION / SPLICING FACTOR SRSF1 by Evan Clark A Thesis Submitted to the Faculty of The College of Engineering and Computer Science In Partial Fulfillment of the Requirements for the Degree of Master of Science Florida Atlantic University Boca Raton, FL May 2017 Copyright 2017 by Evan Clark ii BIOLOGICAL COMPUTATION: THE DEVELOPMENT OF A GENOMIC ANALYSIS PIPELINE TO IDENTIFY CELLULAR GENES MODULATED BY THE TRANSCRIPTION / SPLICING FACTOR SRSF1 INTRODUCTION .......................................................................................... 1 BACKGROUND ON RNA-SEQ .................................................................. 2 THE SRSF1 PROTEIN ................................................................................. 5 INSTALLATION & DEPLOYMENT OF COMPUTATIONAL CLUSTER RESOURCES FOR RNA-SEQ DATA. .................................. 9 DEVELOPMENT OF ANALYSIS PIPELINE ...................................... 11 RNA-SEQ ANALYSIS OF HEK293 CELLS TRANSFECTED WITH AN SRSF1 AND AN RRM12 OVER EXPRESSION VECTOR. ....................................................................................................... 20 Preprocessing of sequencing data ............................................................................................... 22 Alignment of Sequences to Reference Genome ...................................................................... 23 Differential Gene Expression Analysis ....................................................................................... 23 Histone genes are regulated by SRSF1 expression ............................................................... 32 iv Key cellular pathways are regulated by SRSF1 ..................................................................... 36 APPENDIX A1 ............................................................................................. 44 BIBLIOGRAPHY ......................................................................................... 65 iv ABSTRACT Author: Evan Clark Title: Biological Computation: the development of a genomic analysis pipeline to identify cellular genes modulated by the transcription / splicing factor srsf1 Institution: Florida Atlantic University Thesis Advisor: Dr. Waseem Asghar Degree: Master of Science Year: 2017 SRSF1 is a widely expressed mammalian protein with multiple functions in the regulation of gene expression through processes including transcription, mRNA splicing, and translation. Although much is known of SRSF1 role in alternative splicing of specific genes little is known about its functions as a transcription factor and its global effect on cellular gene expression. We utilized a RNA sequencing (RNA-Seq) approach to determine the impact of SRSF1 in on cellular gene expression and analyzed both the short term (12 hours) and long term (48 hours) effects of SRSF1 expression in a human cell line. Furthermore, we analyzed and compared the effect of the expression of a naturally occurring deletion mutant of SRSF1 (RRM12) to the full-length protein. Our analysis reveals that shortly after v SRSF1 is over-expressed the transcription of several histone coding genes is down- regulated, allowing for a more relaxed chromatin state and efficient transcription by RNA Polymerase II. This effect is reversed at 48 hours. At the same time key genes for the immune pathways are activated, more notably Tumor Necrosis Factor-Alpha (TNF-α), suggesting a role for SRSF1 in T cell functions. vi INTRODUCTION Beginning in the early 2000s, the advent of novel methods for sequencing DNA became in high demand. Several technologies attempted to revolutionize sequencing through methods that would be later described as next generation. The major technique developed during this time was sequencing via synthesis. This method works by taking ssDNA fragments, hybridizing them to a well, and progressively adding nucleotides to each base until a match occurs and a visual reporter is induced through laser based molecular excitement. The development of synthesis sequencing technology has led to several new experimental techniques that allow for the quantification of DNA and RNA. One of these techniques is RNA sequencing (RNA-Seq). This sequencing method utilizes RNA transcripts obtained from samples that are converted to cDNA libraries and then sequenced. The sequencing results contain millions of sequenced transcripts known also as sequence reads. Each read contains the identified nucleotide sequence and a corresponding base quality score assigned by the sequencing device. RNA-seq has begun to replace techniques such as microarray used to identify changes in gene expression. Compared to microarray, RNA-seq provides several advantages, i) genome-wide analysis of transcript abundance that is not 1 limited by probe quantity, ii) identification of novel transcript sequences, iii) quantification of alternative splicing events. BACKGROUND ON RNA-SEQ RNA is first extracted from cells using either an organic phenol extraction or solid-phase extraction (utilized in this project). Solid phase extraction differs from organic extraction in that the RNA molecules bind directly to fibers composed of silica within a membrane, and can be eluted out after all other contaminants are washed away. In order to perform RNA-Seq the sample must be DNAsed and cleared of the ribosomal RNA that could interfere with the sequencing process. The extracted mRNA is then reverse-transcribed into cDNA, which will be amplified and used in sequencing. In order for the cDNA to be sequenced libraries, composed of amplified cDNA sequences of a specific length with proprietary sequences inserted within the cDNA, are generated. Library generation includes two major processes, adapter insertion and sequence amplification. Adapters are unique sequences of DNA 2 Figure 1. Flowchart for the sample preparation protocol utilized in RNA-Seq. used to tether each DNA fragment to the sequencing lane. Depending on the sequencing platform usage of adapter sequences may differ, however, in this case we focus on the illumina sequencing platforms. Additionally, barcode reads are inserted as part of the adapter sequence to easily determine the sample origin of each DNA fragment. Multiple fragments from several experiments can be included in a sequencing run; this is known as multiplexing. The adapters are 3 then used to generate clusters of similar fragments within the sequencing lanes through a process known as bridge amplification. After clusters are generated the fragments are ready to be sequenced. Figure 2. Sequencing Pipeline – Here is described the normal approach in conducting sequencing using the Illumina sequencing platform. Specifically, on an illumina NextSeq 500 cDNA is fragmented into equal length sequence pairs containing proprietary adpaters that attach the read to the sequencing bed. The NextSeq 500 can record up to 800 million paired-end reads per run spread across 8 lanes. Reads are amplified using bridge-amplification which generates reverse reads. The reads are again amplified to form clusters, and bases are called by progressively adding nucleotides to their corresponding base pairs. The reads emit a color when excited by a laser at 500nm, which are then read by a detector that reports a nucleotide. Images obtained from https://www.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf 4 The sequencing process might utilize either single-end or paired-end reads. Single-end reads are only read from a single direction during sequencing, whereas paired-end reads are read from both directions. Using paired-end sequencing improves overall sequencing quality, allows for the identification of sequence rearrangements, allows for better alignment of reads to a reference genome, and allows for the identification of novel isoforms by decreasing the number of sequence fragment gaps that occur during alignment. Using the illumina platform, sequencing is performed through a method known as sequencing via synthesis. During sequencing, special nucleotides containing a unique fluorescent reporter molecule are hybridized to their matching bases on each fragment. During this process, a laser at wavelength of 488nm excites the fluorescent tags and a detector reads the color response. These responses are then recorded as a base call for each fragment. This information is then written to a multiplexed bax file containing all the reads from each sequencing run. This file is then demultiplexed using the barcode sequences inserted earlier to produce FASTQ files for each sample. If a sample is split across multiple sequencing lanes, then a FATQ file for each lane is generated as well. THE SRSF1 PROTEIN The serine/arginine rich splicing factor 1 (SRSF1) is a widely expressed mammalian RNA binding protein that is a member of the serine-arginine rich protein family. SRSF1 contains three major domains within the structure (Fig. 3), RNA recognition Motif 1(RRM1) (16-91 aa), RNA recognition Motif 2 (RRM2) 5 (121-195 aa) and the serine arginine rich (SR) (198-247 aa) domain. RRM1 and RRM2 have a significant role in RNA binding while the SR domain is thought to mediate protein/protein interactions and it is heavily phosphorylated. SRSF1 has multiple functions in regulating gene expression through biological processes including transcription, mRNA splicing, and translation. Specifically, its primary function is to serve as a master regulator for alternative