BIOLOGICAL COMPUTATION: THE DEVELOPMENT OF A GENOMIC ANALYSIS
PIPELINE TO IDENTIFY CELLULAR GENES MODULATED BY THE TRANSCRIPTION /
SPLICING FACTOR SRSF1
by
Evan Clark
A Thesis Submitted to the Faculty of
The College of Engineering and Computer Science
In Partial Fulfillment of the Requirements for the Degree of
Master of Science
Florida Atlantic University
Boca Raton, FL
May 2017
Copyright 2017 by Evan Clark
ii
BIOLOGICAL COMPUTATION: THE DEVELOPMENT OF A GENOMIC ANALYSIS
PIPELINE TO IDENTIFY CELLULAR GENES MODULATED BY THE
TRANSCRIPTION / SPLICING FACTOR SRSF1
INTRODUCTION ...... 1
BACKGROUND ON RNA-SEQ ...... 2
THE SRSF1 PROTEIN ...... 5
INSTALLATION & DEPLOYMENT OF COMPUTATIONAL
CLUSTER RESOURCES FOR RNA-SEQ DATA...... 9
DEVELOPMENT OF ANALYSIS PIPELINE ...... 11
RNA-SEQ ANALYSIS OF HEK293 CELLS TRANSFECTED
WITH AN SRSF1 AND AN RRM12 OVER EXPRESSION
VECTOR...... 20
Preprocessing of sequencing data ...... 22
Alignment of Sequences to Reference Genome ...... 23
Differential Gene Expression Analysis ...... 23
Histone genes are regulated by SRSF1 expression ...... 32
iv
Key cellular pathways are regulated by SRSF1 ...... 36
APPENDIX A1 ...... 44
BIBLIOGRAPHY ...... 65
iv
ABSTRACT
Author: Evan Clark
Title: Biological Computation: the development of a genomic analysis pipeline to identify cellular genes modulated by the transcription / splicing factor srsf1
Institution: Florida Atlantic University
Thesis Advisor: Dr. Waseem Asghar
Degree: Master of Science
Year: 2017
SRSF1 is a widely expressed mammalian protein with multiple functions in the regulation of gene expression through processes including transcription, mRNA splicing, and translation. Although much is known of SRSF1 role in alternative splicing of specific genes little is known about its functions as a transcription factor and its global effect on cellular gene expression. We utilized a RNA sequencing
(RNA-Seq) approach to determine the impact of SRSF1 in on cellular gene expression and analyzed both the short term (12 hours) and long term (48 hours) effects of SRSF1 expression in a human cell line. Furthermore, we analyzed and compared the effect of the expression of a naturally occurring deletion mutant of
SRSF1 (RRM12) to the full-length protein. Our analysis reveals that shortly after v SRSF1 is over-expressed the transcription of several histone coding genes is down- regulated, allowing for a more relaxed chromatin state and efficient transcription by
RNA Polymerase II. This effect is reversed at 48 hours. At the same time key genes for the immune pathways are activated, more notably Tumor Necrosis Factor-Alpha
(TNF-α), suggesting a role for SRSF1 in T cell functions.
vi
INTRODUCTION
Beginning in the early 2000s, the advent of novel methods for sequencing
DNA became in high demand. Several technologies attempted to revolutionize sequencing through methods that would be later described as next generation.
The major technique developed during this time was sequencing via synthesis.
This method works by taking ssDNA fragments, hybridizing them to a well, and progressively adding nucleotides to each base until a match occurs and a visual reporter is induced through laser based molecular excitement.
The development of synthesis sequencing technology has led to several new experimental techniques that allow for the quantification of DNA and RNA.
One of these techniques is RNA sequencing (RNA-Seq). This sequencing method utilizes RNA transcripts obtained from samples that are converted to cDNA libraries and then sequenced. The sequencing results contain millions of sequenced transcripts known also as sequence reads. Each read contains the identified nucleotide sequence and a corresponding base quality score assigned by the sequencing device.
RNA-seq has begun to replace techniques such as microarray used to identify changes in gene expression. Compared to microarray, RNA-seq provides several advantages, i) genome-wide analysis of transcript abundance that is not
1 limited by probe quantity, ii) identification of novel transcript sequences, iii) quantification of alternative splicing events.
BACKGROUND ON RNA-SEQ
RNA is first extracted from cells using either an
organic phenol extraction or solid-phase extraction
(utilized in this project). Solid phase extraction differs
from organic extraction in that the RNA molecules bind
directly to fibers composed of silica within a membrane,
and can be eluted out after all other contaminants are
washed away. In order to perform RNA-Seq the sample
must be DNAsed and cleared of the ribosomal RNA that
could interfere with the sequencing process. The
extracted mRNA is then reverse-transcribed into cDNA,
which will be amplified and used in sequencing.
In order for the cDNA to be sequenced libraries,
composed of amplified cDNA sequences of a specific
length with proprietary sequences inserted within the
cDNA, are generated. Library generation includes two
major processes, adapter insertion and sequence
amplification. Adapters are unique sequences of DNA
2
Figure 1. Flowchart for the sample preparation protocol utilized in RNA-Seq. used to tether each DNA fragment to the sequencing lane. Depending on the sequencing platform usage of adapter sequences may differ, however, in this case we focus on the illumina sequencing platforms. Additionally, barcode reads are inserted as part of the adapter sequence to easily determine the sample origin of each DNA fragment. Multiple fragments from several experiments can be included in a sequencing run; this is known as multiplexing. The adapters are
3 then used to generate clusters of similar fragments within the sequencing lanes
through a process known as bridge amplification. After clusters are generated
the fragments are ready to be sequenced.
Figure 2. Sequencing Pipeline – Here is described the normal approach in conducting sequencing using the Illumina sequencing platform. Specifically, on an illumina NextSeq 500 cDNA is fragmented into equal length sequence pairs containing proprietary adpaters that attach the read to the sequencing bed. The NextSeq 500 can record up to 800 million paired-end reads per run spread across 8 lanes. Reads are amplified using bridge-amplification which generates reverse reads. The reads are again amplified to form clusters, and bases are called by progressively adding nucleotides to their corresponding base pairs. The reads emit a color when excited by a laser at 500nm, which are then read by a detector that reports a nucleotide. Images obtained from https://www.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf
4 The sequencing process might utilize either single-end or paired-end reads. Single-end reads are only read from a single direction during sequencing, whereas paired-end reads are read from both directions. Using paired-end sequencing improves overall sequencing quality, allows for the identification of sequence rearrangements, allows for better alignment of reads to a reference genome, and allows for the identification of novel isoforms by decreasing the number of sequence fragment gaps that occur during alignment. Using the illumina platform, sequencing is performed through a method known as sequencing via synthesis. During sequencing, special nucleotides containing a unique fluorescent reporter molecule are hybridized to their matching bases on each fragment. During this process, a laser at wavelength of 488nm excites the fluorescent tags and a detector reads the color response. These responses are then recorded as a base call for each fragment. This information is then written to a multiplexed bax file containing all the reads from each sequencing run. This file is then demultiplexed using the barcode sequences inserted earlier to produce FASTQ files for each sample. If a sample is split across multiple sequencing lanes, then a FATQ file for each lane is generated as well.
THE SRSF1 PROTEIN
The serine/arginine rich splicing factor 1 (SRSF1) is a widely expressed mammalian RNA binding protein that is a member of the serine-arginine rich protein family. SRSF1 contains three major domains within the structure (Fig. 3),
RNA recognition Motif 1(RRM1) (16-91 aa), RNA recognition Motif 2 (RRM2) 5 (121-195 aa) and the serine arginine rich (SR) (198-247 aa) domain. RRM1 and
RRM2 have a significant role in RNA binding while the SR domain is thought to
mediate protein/protein interactions and it is heavily phosphorylated. SRSF1
has multiple functions in regulating gene expression through biological
processes including transcription, mRNA splicing, and translation. Specifically,
its primary function is to serve as a master regulator for alternative splicing
through its ability to activate splice events. Alternative splicing is a biological
process that allows a single gene to code for multiple proteins by joining
alternative coding regions (exons) in the messenger RNA. During this process,
exons may be skipped or included into the final sequence of the mature mRNA.
Furthermore, the laboratory of Dr. Caputi and others have shown that SRSF1 has
the potential to function as a transcription factor (1, 2).
Figure 3. SRSF1 protein structure. The SRSF1 domains (RRM1. RRM2, RS) are indicated.
6 Dysregulation of SRSF1 has been shown to lead to several negative physiological effects. Overexpression of SRSF1 has been observed to cause the transformation of normal endothelial cells into malignant tumor cells in breast cancer (3) and it has also been shown to be up-regulated in acute lymphoblastic leukemia (4). In addition SRSF1 has also been associated with diseases such as spinocerebellar ataxia and has been shown to be a key modulator of genes within the Human Immunodeficiency Virus (HIV) genome through transcription and splicing regulation (5, 6).
Given the multiple functions of SRSF1 in gene expression we setup di characterize the role of SRSF1 and a deletion mutant (RRM12), which corresponds to a naturally occurring isoforms of SRSF1 lacking the SR domain, in gene expression utilizing a RNA-Seq approach. Our ultimate goal was to characterize the genes and the biological pathways that SRSF1 modulates via both, its activity as a transcription and splicing factor.
7
Figure 4. A) Gene transcripts generated by alternative splicing of the SRSF1 messengers. 3 of the alternatively spliced messengers code for the protein isoforms. ASF-1 (248aa) is the canonical isoform, expressed with high abundance in most tissues. B) Schematic representation of the SRSF1 gene structure with the coding exons and their location relative to the protein structure.
8 INSTALLATION & DEPLOYMENT OF COMPUTATIONAL CLUSTER
RESOURCES FOR RNA-SEQ DATA.
When developing the distributed computing grid for the processing of high-throughput sequencing data, it was necessary to have a customizable GUI for straightforward analysis of data and the creation of reusable analysis pipelines. Galaxy (7), a collaborative platform aimed at simplifying bioinformatics, was chosen as the host for our analysis software packages because of its extensive library and management capabilities for command line applications. The library, known as the galaxy toolshed (8), provides access to the installers and dependencies for commonly used bioinformatics tools including RNA reference aligners and transcript quantification applications.
Galaxy was installed on a dedicated server within the FAU (High Performance
Computing (HPC) cluster and cloned to the secure cluster for use with sensitive information.
The FAU HPC cluster was developed to allow students and faculty to run complex high IOP (Input/Output Processor) compute jobs on shared hardware purchased through academic research grants. The cluster is composed of nearly
100 nodes, each containing 128gb, dual Intel xeon processors, and additional xeon phi co-processors connected through high-capacity 40gbs infiband
(federated data rate) FDR networking. The nodes currently run scientific linux 6, a derivative of Red hat Enterprise Linux 6, but are in the process of being
9 upgraded to enterprise linux 7, the latest version of the enterprise linux software line.
Once installed, the config file was modified for running galaxy in a production environment. [See appendix A1 for modified config file] Administrative access was granted to specified users and galaxy was connected to an external structured query language (SQL) database. PostgreSQL was enabled and initialized. A galaxy user was created within the SQL server and all privileges were granted to a newly generated galaxy database.
1. sudo su – postgres 2. psql 3. CREATE DATABASE galaxy_production; 4. CREATE USER galaxy WITH PASSSWORD ‘******’; 5. GRANT ALL PRIVILEGES ON DATABASE galaxy_production to galaxy;
This database was then connected to galaxy through the SQLalchemy python library by adding the following lines to the to the galaxy config file. The database was then migrated and updated to link to all existing datasets managed by galaxy.
1. cd ~/galaxy/ 2. sh run.sh
Once installment was completed the galaxy toolshed was mounted by uncommenting the lines in the galaxy config file.
313. # Directory where Tool Data Table related files will be placed 314. # when installed from a ToolShed. Defaults to tool_data_path. 315. #shed_tool_data_path = tool-data
10
With the toolshed added, toolsets could then begin to be added to galaxy.
DEVELOPMENT OF ANALYSIS PIPELINE
The design of the computing cluster was developed in order to maximize efficiency when performing RNA-Seq. Our experimental methodology focuses on computing differential expression at gene level and quantifying alternative splicing events at the exon level. Both analysis workflows require generation of a transcript assembly by aligning the raw read data to a reference genome through a process known as mapping. Mapping is completed using a software algorithm that aligns transcripts to their complimentary DNA gene sequences and assess the read quality.
The FASTQC (9) java application was added to galaxy in order to assess raw read quality from sequencing data. Fastq files are sent through FASTQC to generate average quality bar plots for each read. This program provides an accurate visual and numerical representation of the quality of clusters generated and read accuracy. FASTQC also provides contamination information for each sample by indicating over-represented sequences, such as those contained in the adapters used to bind DNA to the sequencing wells. Finally, FASTQC provides a measure of the overall sequence run quality, which is identified on a scale of 0-
11 35 with good quality being above the value 30. FASTQC determines read quality by identifying read quality symbols coupled with each sequence fragment.
Depending on the character value, higher quality is assigned to nucleotides whose color was generally unambiguous during sequencing, and lower quality is assigned to nucleotides whose color was not easily identifiable by the sequencer.
This, in addition, generates a phred score that indicates the probability that an incorrect base call could occur. Figure 2 and 3 show quality per tile of reads within the sequencing well (blue indicates high quality, red indicates poor quality), and the average quality of reads at each nucleotide position in the read.
12 Figure 5. Example read quality output files from FASTQC. These graphs represent quality across sequencing wells and average quality across read positions in each sequencing fragment. Each flow cell contains millions of wells each containing nucleotide clusters of the same type., average quality is reported across each well, blue signifying high quality, red indicating poor quality. Read quality is assessed by phred score on a scale from 0 -36; 28 – 36 being ideal read quality. 13
If the data obtained from the quality metrics indicates that the reads are below an average threshold quality, read trimming will need to be performed.
Read trimming works by identifying sections of sequence pairs whose combined quality is below a certain value (generally 28-30), and removes them from the unaligned sequences. This operation is known as sliding frame trimming, however, depending on the type of sequencing performed different trim operations can be performed.
In order to trim reads accurately, the trimmomatic package was installed in
Galaxy. This package provides multiple options for trimming including read adapter trimming, foreign genomic DNA sequence removal, and sliding frame quality trimming. In our data-processing pipeline we utilize the sliding frame trim operation to cut bases from reads which are not at or above a specified read quality level.
Once completed, the paired reads outputted can be mapped to a reference genome map using a splice aware gapped read mapper. RNA-Seq produces reads from cDNA generated from mRNA, which would not contain any introns. When performing an alignment, a reference genome composed of DNA sequences is used to identify the location of the fragments on the genome. RNA fragments, however, could span across multiple exons and would require gaps to be inserted in order for alignment to be performed correctly. Splice aware gapped read mappers resolve this issue by identifying the parent DNA sequence of the fragments and splitting them at the location of the introns, while still preserving splice events. 14 Using the repository provided by galaxy, we installed the tool The
Hierarchical Indexing for Spliced Alignment of Transcripts 2 (HISAT2) tool to perform the genome guided assembly on our transcripts. HISAT2 is a splice- aware gapped read mapper and is the successor to the common Tophat 2 aligner. HISAT2 provides two key functionalities required when performing differential expression analysis and alternative splicing analysis; it produces transcript assemblies specifically for the cufflinks analysis package, which is used for differential expression (Using the --dta-cufflinks option), and generates alignment files compatible with MISO. Furthermore, HISAT2 offers a balance between low memory footprint and speed, running multi-threaded it can complete alignments 15 times faster than Tophat2. HISAT2 provides a sequence alignment map (SAM) with the location of each transcript within its corresponding gene sequence in the reference genome (we utilized the GRCh38 assembly of the human genome as reference genome). The splice aware functionality allows the application to determine which exon each portion of a sequenced fragment belongs to and maps them to the corresponding genes, which is required for the downstream quantification of splice variants at the exon level. The SAM file is then sorted and converted to a binary alignment map
(BAM) through the samtools sort application and an accompanying index file is generated using the samtools index application (10). The final bam files are then run through genome coverage bedgraph (11) tool to generate a track map of the read counts and alignments to exons.
15 The BAM file can than be utilized to quantifying gene and transcript expression levels. For this purpose we utilized the cufflinks toolset. The cufflinks package (12) was downloaded and compiled from the galaxy toolshed repository. This package installed the latest versions of cufflinks, cuffdiff, and cuffquant. Cuffquant is utilized to quantify transcript abundance level across whole genome assemblies generated by cufflinks. The quantification file is then fed into cuffdiff, which computes the transcript and gene level differential expression between multiple samples. Cuffdiff performs pair-wise comparisons between datasets and generates a TSV file containing log2 fold changes for all genes and q-values if replicates are present. In order to group the aligned sequences that belong to each gene, a gene transfer format (GTF) file is utilized as it provides genome coordinates for each gene in addition to the gene symbol.
The database produced by cuffdiff can then be plotted using the CummeRbund R script, which produces visualizations for differential expression data including heatmaps, boxplots, dendograms, and volcano plots. The differential expression pipeline is outlined below in in Figure 4.
16 File Inputs! (FASTQ)
Condition 1 Condition 1 Condition 2 Condition 2
Concatenate Tail-to-Head Concatenate Tail-to-Head Mate-Pairs Mate-Pairs
Forward Reads Reverse Reads Forward Reads Reverse Reads (R1) (R2) (R1) (R2)
FASTQC FASTQC FASTQC FASTQC
Is average read quality Is average read quality Is average read quality Is average read quality above 28? above 28? above 30? above 28?
Yes No Yes No Yes No Yes No
Hisat 2 FASTQ Quality Trimmer Hisat 2 FASTQ Quality Trimmer
Hisat 2 Hisat 2
Input GFF File
Cuffquant Cuffquant
CuffDiff
Differential Gene Expresion List
Figure 6. The pipeline used in generating differential gene expression information. This pipeline includes merging of read pairs into two final forward and reverse files. QC metrics using FASTQC to ensure read quality is high and if trimming is necessary. If average read quality is below the threshold of 28, reads are trimmed for quality using trimmomatic. The reads are then aligned to a reference genome, in this case GRCH38, using the HISAT2 alignment tool. The final BAM files are sent into cuffdiff with a reference GFF file that contains gene locations for the specific reference used. BAM files are pairwise quantified usinf cuffquant through cuffdiff and gene expression lists are output.
17
Alternative splicing is a biological process that occurs during DNA transcription that alters the expression of genes through the alternative inclusion and exclusion of exonic sequencing, leading to a single gene coding for multiple proteins. There are five primary types of alternative splicing events: skipped exons, where a single or multiple exons are not included in final messenger RNA (mRNA), alternative 5’ and 3’, where alternative competing 5’ donor and 3’ acceptor sites dictate the portion of an exon that is either included or excluded from the mRNA, retained intron, where a coding sequence may be retained, or removed as an intron, and mutually exclusive exons, where if an exon is included if the final messenger another exon is always excluded.
In parallel to the data processing performed by the cufflinks package, we also utilize the Bayesian probability based splicing analysis software MISO (13).
The MISO tool to quantify alternative splicing events we utilized in our pipeline has been developed by the Burge lab at the MIT and it is widely utilized for the quantification of known and novel splicing events. MISO utilizes an exon centric analysis, which focuses at the level of individual splice events, and differs from isoform centric analysis in that it does not quantify expression levels of whole gene isoforms.
More specifically MISO uses known splice junctions categorized by event type (skipped exon (SE), retained intron (RI), alternative 3’ (A3SS), alternative 5’
(A5SS), mutually exclusive exons (MXE) generated from all assemblies of the
18 human genome. The output produced by MISO contains chromosome locations without gene identifications. In order to better quantify overall changes in alternative splicing at the gene level, the miso output is converted to a browser extensible data (bed) file which reports each reported splice as a position on a genome browser track. This file is then annotated using the Homer package from the Salk Institute (14), which inserts a gene name and symbol for each splice event. The original bayes file, which contains expression quantifications in the form of percent spliced in (PSI) values (13) that indicate the amount of mRNA that represents exon inclusion, produced from miso and the output file from
Homer are then merged to produce a tab separated value (TSV) file containing gene identifiers for each splice event.
19
RNA-SEQ ANALYSIS OF HEK293 CELLS TRANSFECTED WITH AN
SRSF1 AND AN RRM12 OVER EXPRESSION VECTOR.
Sequencing of Samples and Initial Output
To study the role of SRSF1 in global gene expression, HEK293 cells were transfected with an SRSF1 over-expression vector (pSRSF1, a truncated SRSF1 variant (pRRM12) and a control expression vector pCMV-SP6. HEK293 cells were used because they are easily cultured and transfected, while the functions of SRSF1 do not change in different cell types. Total RNA was then extracted at
12 and 48 hours following transfection, since at 12 hours the genes directly regulated by SRSF1 will show a change in expression and processing, while at 48 hours both the genes directly regulated and their secondary targets will show changes in expression levels and RNA processing.
20 The total RNA was extracted utilizing the EZ Tissue/Cell Total RNA Mini
Prep Kit from EZBioResearch. It was then quantified and analyzed for RNA integrity utilizing an Agilent 2100 bioanalyzer. The total RNA preparation was then sent to the Scripps Florida genomics facility for library preparation and sequencing. The samples were treated with DNase to remove any possible genomic DNA contamination. Total RNA quality was reassessed utilizing an
Agilent 2100 bioanalyzer and the samples was quantified using a Qubit 2 fluorometer. RNA quality was measured using the RNA integrity number (RIN)
(scale 1 to 10). All samples showed low degradation and a high RIN (> 8).
Figure 7. Quality of the Total mRNA utilized in the RNA-Seq assay.
The DNase treated samples were then depleted of ribosomal RNA using the probes contained in the Illumina TruSeq Total RNA-seq kit, which provides rRNA 21 probes for human, mouse, and rat rRNA. The sample quality was then re- assessed on the bioanalyer to check that 28s and 18s rRNA peaks were minimized.
The cDNA from each transfection assay was then sequenced with an illumina HISeq 2500 sequencer. The demultiplexed data obtained consisted of
40 million pair-end reads [each around 80 nucleotides long] per sample.
Furthermore, an additional set of samples from a second transfection assay was sequenced with an illumina NextSeq 500 sequencer. The demultiplexed data received was composed of 2 technical replicates, each consisting of 25 million paired-end reads.
Preprocessing of sequencing data
The FASTQ files containing the read data for each sequencing run, SRSF1,
SP6, and RRM12 samples, were run through the FASTQC program to determine read quality, phred score, read distribution, and read length. The reads were then merged using the concatenate tail-to-head tool to generate two FASTQ files, one containing forward reads and one containing reverse reads.
The phred score of the merged reads was evaluated and was high enough (>28) to prevent the need for read trimming (define better what is high what is low, what is the scale). This process was repeated for each set of replicates.
22 Alignment of Sequences to Reference Genome
The merged reads were then aligned to the Feb. 2012 Ensemble build of the human reference genome (GRCh38) to generate binary alignment (bam) files. The HISAT2 tool was utilized to align the paired reads to the reference genome.
Differential Gene Expression Analysis
The bam files obtained from HISAT2 were then used in computing differential gene expression through the cuffdiff package. Each control, and corresponding replicate files were added to cuffdiff and used to develop a global model for all samples which generates fold change values across samples. The controls and samples were then compared to generate a TSV containing gene level changes in expression and P-values for SP6 vs. SRSF1 and SP6 vs. RRM12 at
12 and 48 hours. Cuffdiff then outputs a list of genes that were differentially expressed between both samples. The gene lists obtained from cuffdiff were then subjected to value cutoffs. The cuffdiff output was limited to genes which had a fold change of greater than 1.5 or less than -1.5, and a P-value of less than
0.05 at 12 hours, and a fold change of greater than 2 or less than -2, and a P- value of less than 0.05 at 48 hours.
These genes list were then run through g:profiler id converter (15) in order to obtain gene descriptions for each gene. Using these descriptions genes were separated into protein coding, RNA coding, pseudogenes and unknown genes.
23 Any gene that did not include a description rendered from g:profiler were then passed through the genecards (16) database to identify a function for the gene.
In excel, the output files obtained from cuffdiff, the genes for SRSF1 were compared between 12 and 48 hours, and the genes for RRM12 at 12 and 48 were compared. The genes at 12 hours for SRSF1 and RRM12 were compared, and genes at 48 hours for SRSF1 and RRM12 were also compared.
A majority of the genes modulated by SRSF1 at 12 and 58 hours did code for functional protein products; of those, at 12 hours nearly 50% were up-regulated and 50% down-regulated. At 48 hours, however, nearly 90% of the protein coding genes were up-regulated by SRSF1.
Similar to the trend shown in the SRSF1 data, a majority of the genes modulated by RRM12 at 12 and 58 hours did code for functional protein products; of those, at 12 hours nearly 50% were up-regulated and 50% down-regulated. At 48 hours, however, nearly 90% of the protein coding genes were up-regulated by
SRSF1.
Only 50% of the protein coding genes regulated by SRSF1 at 12 hours were found to also be regulated by RRM12, suggesting that the deletion mutant shares similar but not identical functions in gene expression. Furthermore the RRM12 mutant appeared to regulate a larger number (243 vs 139) of genes than SRSF1 suggesting that the SRS1 naturally occurring isoforms, that contain RRM12 but lack the RS domain, are able to modulate the expression of a set of genes differently than the canonical SRSF1 sequence. Only 35% of the genes regulated by SRSF1 at 12 hours were also found to be regulated at 48 hours, this can be
24 due to multiple reasons, such as the increased stringency of our analysis cutoffs at 48 hours and feedback mechanisms that keep the relative abundance of some genes within a given range.
Figure 8. Genes significantly modulated by SRSF1 at 12 and 48 hours using a log2 fold change cutoff of above 1.5 and below -1.5 at 12 hours, a log2 fold change of above 2 and below -2 at 48 hours, and a p-value cutoff of 0.05 at both time points.
25
Figure 9. Genes significantly modulated by RRM12 at 12 and 48 hours using a log2 fold change cutoff of above 1.5 and below -1.5 at 12 hours, a log2 fold change of above 2 and below -2 at 48 hours, and a p-value cutoff of 0.05 at both time points.
26
Figure 10. Venn diagrams of genes found to be in common between SRSF1 and RRM12 differential expression testing at 12 and 48 hours. Using P-value cutoff of 0.05, and log2 cutoff of 1.5, and -1.5 at 12 hours & log2 cutoff of 2 and -2 at 48 hours.
27
Figure 11. Biological pathways significantly regulated by SRSF1 expression after 12 hours. Go Biological Processes (BP), GO Molecular functions (MF), Kegg biological pathways (Keg).
Fig ur e PPP1R1B HIST3H2A HIST2H3D HIST2H2AC HIST2H3C HIST1H4k HIST2H2AA3 HIST2H3A HIST2H4B HIST1H4K
12. chromatin:silencing !1.63071 !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883
Ke nucleosome !1.63071 !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883 y Systemic:lupus:erythematosus !1.63071 !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883
Alcoholism 1.60197 !1.63071 !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883 pat Epigenetic:regulation:of:gene:expression !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883
hw DNA:methylation !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883 ays HDACs:deacetylate:histones !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883 an SenescenceGAssociated:Secretory:Phenotype:(SASP) !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883 d Amyloid:fiber:formation !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883
RNA:Polymerase:I:Transcription !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883 ge RMTs:methylate:histone:arginines !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883
nes HATs:acetylate:histones !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883 sig Transcriptional:regulation:by:small:RNAs !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883 nifi can tly reg ula ted by SRSF1 expression.
28
Fig. 13. Biological pathways significantly regulated by SRSF1 expression after 48 hours. Go Biological Processes (BP), GO Molecular functions (MF), Kegg biological pathways (Keg).
Figure 14. Key pathways and genes significantly regulated by SRSF1 expression after 48 hours.
29
Figure 15. Biological pathways significantly regulated by RRM12 expression after 12 hours. Go Biological Processes (BP), GO Molecular functions (MF), Kegg biological pathways (Keg).
DMBX1 EHMT1 FZD9 H2AFX HES4 HIST1H2AB HIST1H2AH HIST1H2AI HIST1H2AL HIST1H2AM HIST1H3A HIST1H3H HIST1H4A HIST1H4K HIST2H2AA3 HIST2H2AA4 HIST2H2AB HIST2H2AC HIST2H3C HIST2H3D HIST3H2A HIST3H2BB JUND KCNH2 NPAS3 RENBP SLC4A11 SLC6A4 SOX18 SPPL2B ZFP36
ChromatinEsilencing !1.54961 !1.97439 !1.9905 !1.87188 !2.24981 !1.69631 !1.65961 !1.65994 !2.26428 !2.03729 !1.5631 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !1.7199 !2.59459 !3.59225 !3.03392 !2.66858 !1.62825 !1.52474 !1.64705 2.07788 !1.90329 !1.69598 2.97066 !1.94017 !1.69262 !1.69776
Nucleosome !1.87188 !1.69631 !1.65961 !1.65994 !2.26428 !2.03729 !1.5631 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !1.7199 !2.59459 !3.59225 !3.03392 !2.66858 !1.62825
ProteinEheterodimnerizationEactiviity !1.65961 !1.65994 !2.26428 !2.03729 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !1.7199 !3.03392 !2.66858 !1.62825 !1.94017
SystemicElupusEerythematosus !1.87188 !1.69631 !1.65961 !1.65994 !2.26428 !2.03729 !1.5631 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !1.7199 !2.59459 !3.59225 !3.03392 !2.66858 !1.62825
Alcoholism !1.65961 !1.65994 !2.26428 !2.03729 !1.80822 !1.63374 !1.65154 !1.7199 !3.03392 !2.66858 !1.62825
EpigeneticEregulationEofEgeneEexpression !1.87188 !1.69631 !1.5631 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !2.59459 !3.59225 !3.03392 !1.62825
DNAEmethylation !1.65994 !1.5631 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !2.59459 !3.59225 !3.03392 !1.62825
HDACsEdeacetylateEhistones !1.69631 !1.65961 !1.65994 !2.26428 !2.03729 !1.5631 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !1.7199 !2.59459 !3.59225 !3.03392 !2.66858 !1.62825
HATsEacetylateEhistones !1.65961 !1.65994 !2.26428 !2.03729 !1.80822 !1.63374 !2.03028 !1.65154 !1.7199 !2.66858 !1.62825
AmyloidEfiberEformation !1.69631 !1.5631 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !2.59459 !3.59225 !3.03392 !1.62825
RNAEPolymeraseEIETranscription !1.5631 !1.63374 !2.03028 !2.22284 !1.65154 !2.59459 !3.59225 !3.03392 !1.62825
ChromatinEOrganizaiton !1.69631 !1.65961 !1.65994 !2.26428 !2.03729 !1.5631 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !1.7199 !2.59459 !3.59225 !3.03392 !2.66858 !1.62825
ActivatedEPKN1EstimulatesEtranscriptionEofEAR !1.5631 !1.63374 !2.03028 !2.22284 !1.65154 !2.59459 !3.59225 !3.03392 !1.62825
ActivationEofEanetriorEHIXEgenesEinEhindbrainEdev. !1.69631 !1.5631 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !2.59459 !3.59225 !3.03392 !1.62825
TranscriptionalEregulationEbyEsmallERNAs !1.5631 !1.63374 !2.03028 !2.22284 !1.65154 !2.59459 !3.59225 !3.03392 !1.62825
Figure 16. Key pathways and genes significantly regulated by RRM12 expression after 12 hours.
30
Figure 17. Biological pathways significantly regulated by RRM12 expression after 48 hours. Go Biological Processes (BP), GO Molecular functions (MF), Kegg biological pathways (Keg).
Figure 18. Key pathways and genes significantly regulated by RRM12 expression after 48 hours.
31 Histone genes are regulated by SRSF1 expression
Five classes of histone proteins are found in the chromatin of eukaryotic cells. Two molecules of each of the the four core histones, H2a, H2b, H3, and H4, make up the nucleosome while histone H1 is bound to the linker DNA between nucleosomes and is also required to organize the higher-order structure of chromatin. The expression of most Histone genes are dependent on the cell cycle and cell replication, such genes are named replication dependent histone genes, such genes are not polyadenylated. There are also some histone genes that are not cell-cycle regulated and that encode for polyadenylated mRNAs. The five classes of replication-dependent histones are encoded by a multi-gene family organized in 3 gene clusters (HIST1, HIST2, HIST3) and a forth location coding only for a copy of H4. HIST1 contains 51 histone genes, HIST2 6 genes and HIST3 only 3 genes. H4 is encoded by 14 genes that code for the same protein, while there are 3 variants of histone H3 encoded by 12 genes. Histone H2A and H2B consist of at least 10 variants (17).
Among the genes significantly regulated at 12 hours post SRSF1 and RRM1 expression histones genes were the most significantly represented we further analyzed the expression of all the histones genes at both 12 and 48 hours following SRSF1 and RRM12 expression. For this secondary analysis we did not filter for a P value nor for a Log2 fold change.
Overall we observed a marked decrease in the replication competent histone genes transcripts at 12 hours following expression of either SRSF1 or RRM12, while at 48 hours the histone’s transcripts showed an overall increase. At the 32 same time the replication independent histone genes did not appear to significantly decrease at 12 hours but did slightly increase at 48 hours. Histone
H2a appeared to be the one that was down-regulated the most at 12 hours.
33
Figure 19. Global effect on Histone genes expression following SRF1 and RRM1 delivery to the cells. Replication dependent histone genes are shown on the left panel. Replication independent genes are shown on the right panel. The histone genes expression variation relative to the control is shown in this heatmap. Histone clusters are also indicated. No P-value was applied in this analysis. The Log2 of the variation in Histone gene expression is color- coded (green = downregulated, Red = upregulated).
34
Figure 20. Variation in histone gene expression classified by histone type.
35 Key cellular pathways are regulated by SRSF1
Pathway analysis of the genes regulated by SRSF1 48 hours after its overexpression reveals that multiple cellular pathways are specifically regulated.
Among the genes significantly regulated at 48 hours (Log2 =+/-2, P>0.05) following SRSF1 overexpression there are key components of the TNF and NF-
Kappa B signaling pathways (Fig. 21,22). Several cytokines are also regulated affecting immunological processes and the functions of cells of the immune system. SRSF1 also appears to play a role in the immune disease reumathoid arthraitis and in amphetamine addiciton (Fig. 23,24).
String node analysis carried out to characterize the interactions between the proteins regulated by SRSF1 expression revealed that these proteins have significantly more interactions among themselves than what would be expected for a random set of proteins of similar size (the enrichment P-value for this set of genes is 2.22e-15) (Fig. 25). From this analysis it appears theat the cytokine TNF and the transcription factor EGR1 and FOS are the primary nodes for this gene set. Given the key role played by TNF in the immune response, citokine regulatory pathways and the majority of the biological pathway statistically relevant in the analysis of the SRSF1 overexpression at 48 hours we have validated TNF expression at 12 and 48 hours following SRSF1 and RRM12 expression utilizing a quantitative PCR (qPCR) approach (Fig. 27) (experiment carried out by Sean Paz in the laboratory of Dr. Caputi).
36
Figure 21. Kegg pathway for TNF signaling. The genes significantly regulated at 48 hours (Log2 =+/-2, P>0.05) following SRSF1 overexpression are indicated in red.
37
Figure 22. Kegg pathway for NF-Kappa B signaling. The genes significantly regulated at 48 hours (Log2 =+/-2, P>0.05) following SRSF1 overexpression are indicated in red.
38
Figure 23. Kegg pathway for Rheumatoid Arthritis. The genes significantly regulated at 48 hours (Log2 =+/-2, P>0.05) following SRSF1 overexpression are indicated in red.
39
Figure 24. Kegg pathway for Amphetamine Addition. The genes significantly regulated at 48 hours (Log2 =+/-2, P>0.05) following SRSF1 overexpression are indicated in red.
40
Figure 25. String node analysis of the genes significantly regulated at 48 hours (Log2 =+/-2, P>0.05) following SRSF1 overexpression. Interactions among proteins (known or predicted) are indicated by strings connecting the proteins. The enrichment P-value for this set of genes is 2.22e-15. These proteins have significantly more interactions among themselves than what would be expected for a random set of proteins of similar size.
41
Figure 26. Relative expressions in all samples analyzed of the main gene nodes for the dataset obtained picking the genes significantly regulated at 48 hours by SRSF1 overexpression (Log2 =+/-2, P>0.05).
Figure 27. Validation by qPCR of TNF, F2, FES, AOC2, CXCL1, CCL20 expression at 12 and 48 hours following SRSF1 overexpression (Experiment carried out by Sean Paz).
42
DISCUSSION & FUTURE STUDIES
One of the most intriguing discoveries following our analysis is the decrease in Histone genes transcripts following SRSF1 expression. A decrease in Histone expression is usually the cause of an increase in overall gene transcription that correlates well with previous observations from the laboratory of Dr. Caputi, which has shown that SRSF1 is highly up-regulated following CD4+ T cell activation. We have now observed that an increase in SRSF1 expression causes an activation of several pathways related to immunity, T cells functions, and activation at 48 hours.
Taken together, these results indicate a key role for SRSF1 in T cell activation and functions.
In addition to the differential gene expression analysis, we intend to perform exon-centric splicing analysis. Since SRSF1 regulates genes through modulation of alternative splicing events, it is important to understand which events are prevalent with SRSF1 and RRM12 over-expression. In order to perform this analysis we will be utilizing the MISO program provided by the burge lab. The output of MISO provides a different file for each of the respective exon event types. These files contain PSI values that can be used to identify splice events suppressed or activated by SRSF1.
43
APPENDIX A1
1. # 2. # Galaxy is configured by default to be usable in a single-user development 3. # environment. To tune the application for a multi-user production 4. # environment, see the documentation at: 5. # 6. # http://usegalaxy.org/production 7. # 8. 9. # Throughout this sample configuration file, except where stated otherwise, 10. # uncommented values override the default if left unset, whereas commented 11. # values are set to the default value. Relative paths are relative to the root 12. # Galaxy directory. 13. # 14. # Examples of many of these options are explained in more detail in the wiki: 15. # 16. # https://wiki.galaxyproject.org/Admin/Config 17. # 18. # Config hackers are encouraged to check there before asking for help. 19. 20. # ---- HTTP Server ------21. 22. # Configuration of the internal HTTP server. 23. 24. [server:main] 25. 26. # The internal HTTP server to use. Currently only Paste is provided. This 27. # option is required. 28. use = egg:Paste#http 29. 30. # The port on which to listen. 31. port = 8088 32. 33. # The address on which to listen. By default, only listen to localhost (Galaxy 34. # will not be accessible over the network). Use '0.0.0.0' to listen on all 35. # available network interfaces. 36. host = 0.0.0.0 37. 38. # Use a threadpool for the web server instead of creating a thread for each 39. # request. 40. use_threadpool = True 41. 42. # Number of threads in the web server thread pool. 43. #threadpool_workers = 10 44. 45. # Set the number of seconds a thread can work before you should kill it 46. # (assuming it will never finish) to 3 hours. Default is 600 (10 minutes). 47. threadpool_kill_thread_limit = 10800 48.
44 49. # ---- Filters ------50. 51. # Filters sit between Galaxy and the HTTP server. 52. 53. # These filters are disabled by default. They can be enabled with 54. # 'filter-with' in the [app:main] section below. 55. 56. # Define the gzip filter. 57. [filter:gzip] 58. use = egg:Paste#gzip 59. 60. # Define the proxy-prefix filter. 61. [filter:proxy-prefix] 62. use = egg:PasteDeploy#prefix 63. prefix = /koko 64. 65. # ---- Galaxy ------66. 67. # Configuration of the Galaxy application. 68. 69. [app:main] 70. 71. # -- Application and filtering 72. 73. # The factory for the WSGI application. This should not be changed. 74. paste.app_factory = galaxy.web.buildapp:app_factory 75. 76. # If not running behind a proxy server, you may want to enable gzip compression 77. # to decrease the size of data transferred over the network. If using a proxy 78. # server, please enable gzip compression there instead. 79. #filter-with = gzip 80. 81. # If running behind a proxy server and Galaxy is served from a subdirectory, 82. # enable the proxy-prefix filter and set the prefix in the 83. # [filter:proxy-prefix] section above. 84. filter-with = proxy-prefix 85. 86. # If proxy-prefix is enabled and you're running more than one Galaxy instance 87. # behind one hostname, you will want to set this to the same path as the prefix 88. # in the filter above. This value becomes the "path" attribute set in the 89. # cookie so the cookies from each instance will not clobber each other. 90. cookie_path = /koko 91. 92. # -- Database 93. 94. # By default, Galaxy uses a SQLite database at 'database/universe.sqlite'. You 95. # may use a SQLAlchemy connection string to specify an external database 96. # instead. This string takes many options which are explained in detail in the 97. # config file documentation. 98. # This is an example 99. #database_connection = sqlite:///./database/universe.sqlite?isolation_level=IMMEDIATE 100. 101. 102. # This is the connection to the local mysql 103. database_connection = postgres://galaxy:********@127.0.0.1/galaxy_production 104. 105. 106. # This doesnt work 107. 108. # If the server logs errors about not having enough database pool connections, 109. # you will want to increase these values, or consider running more Galaxy 45 110. # processes. 111. #database_engine_option_pool_size = 5 112. #database_engine_option_max_overflow = 10 113. 114. # If using MySQL and the server logs the error "MySQL server has gone away", 115. # you will want to set this to some positive value (7200 should work). 116. #database_engine_option_pool_recycle = -1 117. 118. # If large database query results are causing memory or response time issues in 119. # the Galaxy process, leave the result on the server instead. This option is 120. # only available for PostgreSQL and is highly recommended. 121. #database_engine_option_server_side_cursors = False 122. 123. # Log all database transactions, can be useful for debugging and performance 124. # profiling. Logging is done via Python's 'logging' module under the qualname 125. # 'galaxy.model.orm.logging_connection_proxy' 126. #database_query_profiling_proxy = False 127. 128. # By default, Galaxy will use the same database to track user data and 129. # tool shed install data. There are many situations in which it is 130. # valuable to separate these - for instance bootstrapping fresh Galaxy 131. # instances with pretested installs. The following option can be used to 132. # separate the tool shed install database (all other options listed above 133. # but prefixed with install_ are also available). 134. #install_database_connection = sqlite:///./database/universe.sqlite?isolation_level=IMMEDIATE 135. 136. # -- Files and directories 137. 138. # Dataset files are stored in this directory. 139. #file_path = database/files 140. 141. # Temporary files are stored in this directory. 142. #new_file_path = database/tmp 143. 144. # Tool config files, defines what tools are available in Galaxy. 145. # Tools can be locally developed or installed from Galaxy tool sheds. 146. # (config/tool_conf.xml.sample will be used if left unset and 147. # config/tool_conf.xml does not exist). 148. #tool_config_file = config/tool_conf.xml,config/shed_tool_conf.xml 149. 150. # Enable / disable checking if any tools defined in the above non-shed 151. # tool_config_files (i.e., tool_conf.xml) have been migrated from the Galaxy 152. # code distribution to the Tool Shed. This setting should generally be set to 153. # False only for development Galaxy environments that are often rebuilt from 154. # scratch where migrated tools do not need to be available in the Galaxy tool 155. # panel. If the following setting remains commented, the default setting will 156. # be True. 157. #check_migrate_tools = True 158. 159. # Tool config maintained by tool migration scripts. If you use the migration 160. # scripts to install tools that have been migrated to the tool shed upon a new 161. # release, they will be added to this tool config file. 162. #migrated_tools_config = config/migrated_tools_conf.xml 163. 164. # File that contains the XML section and tool tags from all tool panel config 165. # files integrated into a single file that defines the tool panel layout. This 166. # file can be changed by the Galaxy administrator to alter the layout of the 167. # tool panel. If not present, Galaxy will create it. 168. #integrated_tool_panel_config = integrated_tool_panel.xml 169. 170. # Default path to the directory containing the tools defined in tool_conf.xml. 46 171. # Other tool config files must include the tool_path as an attribute in the 172. #
185. 186. # The dependency resolvers config file specifies an ordering and options for how 187. # Galaxy resolves tool dependencies (requirement tags in Tool XML). The default 188. # ordering is to the use the Tool Shed for tools installed that way, use local 189. # Galaxy packages, and then use Conda if available. 190. # See https://github.com/galaxyproject/galaxy/blob/dev/doc/source/admin/dependency_resolvers.r st 191. # for more information on these options. 192. #dependency_resolvers_config_file = config/dependency_resolvers_conf.xml 193. 194. # The following Conda dependency resolution options will change the defaults for 195. # all Conda resolvers, but multiple resolvers can be configured independently 196. # in dependency_resolvers_config_file and these options overridden. 197. # Location on the filesystem where Conda packages are installed 198. 199. # conda_prefix is the location on the filesystem where Conda packages and environments are installed 200. # IMPORTANT: Due to a current limitation in conda, the total length of the 201. # conda_prefix and the job_working_directory path should be less than 50 characters! 202. #conda_prefix =
64
BIBLIOGRAPHY
1. Paz S, Krainer AR, Caputi M. 2014. HIV-1 transcription is regulated by
splicing factor SRSF1. Nucleic Acids Res 42:13812-13823.
2. Ji X, Zhou Y, Pandit S, Huang J, Li H, Lin CY, Xiao R, Burge CB, Fu XD. 2013.
SR proteins collaborate with 7SK and promoter-associated nascent RNA to
release paused polymerase. Cell 153:855-868.
3. Anczukow O, Rosenberg AZ, Akerman M, Das S, Zhan L, Karni R,
Muthuswamy SK, Krainer AR. 2012. The splicing factor SRSF1 regulates
apoptosis and proliferation to promote mammary epithelial cell
transformation. Nat Struct Mol Biol 19:220-228.
4. Zou L, Zhang H, Du C, Liu X, Zhu S, Zhang W, Li Z, Gao C, Zhao X, Mei M,
Bao S, Zheng H. 2012. Correlation of SRSF1 and PRMT1 expression with
clinical status of pediatric acute lymphoblastic leukemia. J Hematol Oncol
5:42.
5. Jablonski JA, Caputi M. 2009. Role of cellular RNA processing factors in
human immunodeficiency virus type 1 mRNA metabolism, replication, and
infectivity. J Virol 83:981-992.
65 6. Paz S, Lu ML, Takata H, Trautmann L, Caputi M. 2015. SRSF1 RNA
Recognition Motifs Are Strong Inhibitors of HIV-1 Replication. J Virol
89:6275-6286.
7. Afgan E, Baker D, van den Beek M, Blankenberg D, Bouvier D, Čech M,
Chilton J, Clements D, Coraor N, Eberhard C, Grüning B, Guerler A,
Hillman-Jackson J, Von Kuster G, Rasche E, Soranzo N, Turaga N, Taylor
J, Nekrutenko A, Goecks J. 2016. The Galaxy platform for accessible,
reproducible and collaborative biomedical analyses: 2016 update. Nucleic
Acids Research doi:10.1093/nar/gkw343.
8. Blankenberg D, Von Kuster G, Bouvier E, Baker D, Afgan E, Stoler N,
Taylor J, Nekrutenko A. 2014. Dissemination of scientific software with
Galaxy ToolShed. Genome Biology 15:1-3.
9. Andrews S. 2010. FastQC: a quality control tool for high throughput
sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc.
Accessed
10. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G,
Abecasis G, Durbin R, Genome Project Data Processing S. 2009. The
Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078-
2079.
11. Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for
comparing genomic features. Bioinformatics 26:841-842.
12. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ,
Salzberg SL, Wold BJ, Pachter L. 2010. Transcript assembly and
66 quantification by RNA-Seq reveals unannotated transcripts and isoform
switching during cell differentiation. Nat Biotechnol 28:511-515.
13. Katz Y, Wang ET, Airoldi EM, Burge CB. 2010. Analysis and design of RNA
sequencing experiments for identifying isoform regulation. Nat Methods
7:1009-1015.
14. Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre
C, Singh H, Glass CK. 2010. Simple combinations of lineage-determining
transcription factors prime cis-regulatory elements required for macrophage
and B cell identities. Mol Cell 38:576-589.
15. Reimand J, Kull M, Peterson H, Hansen J, Vilo J. 2007. g:Profiler--a web-
based toolset for functional profiling of gene lists from large-scale
experiments. Nucleic Acids Res 35:W193-200.
16. Rebhan M, Chalifa-Caspi V, Prilusky J, Lancet D. 1998. GeneCards: a novel
functional genomics compendium with automated data mining and query
reformulation support. Bioinformatics 14:656-664.
17. Marzluff WF, Gongidi P, Woods KR, Jin J, Maltais LJ. 2002. The human and
mouse replication-dependent histone genes. Genomics 80:487-498.
67
68