BIOLOGICAL COMPUTATION: THE DEVELOPMENT OF A GENOMIC ANALYSIS

PIPELINE TO IDENTIFY CELLULAR MODULATED BY THE /

SPLICING FACTOR SRSF1

by

Evan Clark

A Thesis Submitted to the Faculty of

The College of Engineering and Computer Science

In Partial Fulfillment of the Requirements for the Degree of

Master of Science

Florida Atlantic University

Boca Raton, FL

May 2017

Copyright 2017 by Evan Clark

ii

BIOLOGICAL COMPUTATION: THE DEVELOPMENT OF A GENOMIC ANALYSIS

PIPELINE TO IDENTIFY CELLULAR GENES MODULATED BY THE

TRANSCRIPTION / SPLICING FACTOR SRSF1

INTRODUCTION ...... 1

BACKGROUND ON RNA-SEQ ...... 2

THE SRSF1 ...... 5

INSTALLATION & DEPLOYMENT OF COMPUTATIONAL

CLUSTER RESOURCES FOR RNA-SEQ DATA...... 9

DEVELOPMENT OF ANALYSIS PIPELINE ...... 11

RNA-SEQ ANALYSIS OF HEK293 CELLS TRANSFECTED

WITH AN SRSF1 AND AN RRM12 OVER EXPRESSION

VECTOR...... 20

Preprocessing of sequencing data ...... 22

Alignment of Sequences to Reference Genome ...... 23

Differential Expression Analysis ...... 23

Histone genes are regulated by SRSF1 expression ...... 32

iv

Key cellular pathways are regulated by SRSF1 ...... 36

APPENDIX A1 ...... 44

BIBLIOGRAPHY ...... 65

iv

ABSTRACT

Author: Evan Clark

Title: Biological Computation: the development of a genomic analysis pipeline to identify cellular genes modulated by the transcription / splicing factor srsf1

Institution: Florida Atlantic University

Thesis Advisor: Dr. Waseem Asghar

Degree: Master of Science

Year: 2017

SRSF1 is a widely expressed mammalian protein with multiple functions in the regulation of through processes including transcription, mRNA splicing, and translation. Although much is known of SRSF1 role in alternative splicing of specific genes little is known about its functions as a transcription factor and its global effect on cellular gene expression. We utilized a RNA sequencing

(RNA-Seq) approach to determine the impact of SRSF1 in on cellular gene expression and analyzed both the short term (12 hours) and long term (48 hours) effects of SRSF1 expression in a human cell line. Furthermore, we analyzed and compared the effect of the expression of a naturally occurring deletion mutant of

SRSF1 (RRM12) to the full-length protein. Our analysis reveals that shortly after v SRSF1 is over-expressed the transcription of several coding genes is down- regulated, allowing for a more relaxed state and efficient transcription by

RNA Polymerase II. This effect is reversed at 48 hours. At the same time key genes for the immune pathways are activated, more notably Tumor Necrosis Factor-Alpha

(TNF-α), suggesting a role for SRSF1 in T cell functions.

vi

INTRODUCTION

Beginning in the early 2000s, the advent of novel methods for sequencing

DNA became in high demand. Several technologies attempted to revolutionize sequencing through methods that would be later described as next generation.

The major technique developed during this time was sequencing via synthesis.

This method works by taking ssDNA fragments, hybridizing them to a well, and progressively adding nucleotides to each base until a match occurs and a visual reporter is induced through laser based molecular excitement.

The development of synthesis sequencing technology has led to several new experimental techniques that allow for the quantification of DNA and RNA.

One of these techniques is RNA sequencing (RNA-Seq). This sequencing method utilizes RNA transcripts obtained from samples that are converted to cDNA libraries and then sequenced. The sequencing results contain millions of sequenced transcripts known also as sequence reads. Each read contains the identified nucleotide sequence and a corresponding base quality score assigned by the sequencing device.

RNA-seq has begun to replace techniques such as microarray used to identify changes in gene expression. Compared to microarray, RNA-seq provides several advantages, i) genome-wide analysis of transcript abundance that is not

1 limited by probe quantity, ii) identification of novel transcript sequences, iii) quantification of alternative splicing events.

BACKGROUND ON RNA-SEQ

RNA is first extracted from cells using either an

organic phenol extraction or solid-phase extraction

(utilized in this project). Solid phase extraction differs

from organic extraction in that the RNA molecules bind

directly to fibers composed of silica within a membrane,

and can be eluted out after all other contaminants are

washed away. In order to perform RNA-Seq the sample

must be DNAsed and cleared of the ribosomal RNA that

could interfere with the sequencing process. The

extracted mRNA is then reverse-transcribed into cDNA,

which will be amplified and used in sequencing.

In order for the cDNA to be sequenced libraries,

composed of amplified cDNA sequences of a specific

length with proprietary sequences inserted within the

cDNA, are generated. Library generation includes two

major processes, adapter insertion and sequence

amplification. Adapters are unique sequences of DNA

2

Figure 1. Flowchart for the sample preparation protocol utilized in RNA-Seq. used to tether each DNA fragment to the sequencing lane. Depending on the sequencing platform usage of adapter sequences may differ, however, in this case we focus on the illumina sequencing platforms. Additionally, barcode reads are inserted as part of the adapter sequence to easily determine the sample origin of each DNA fragment. Multiple fragments from several experiments can be included in a sequencing run; this is known as multiplexing. The adapters are

3 then used to generate clusters of similar fragments within the sequencing lanes

through a process known as bridge amplification. After clusters are generated

the fragments are ready to be sequenced.

Figure 2. Sequencing Pipeline – Here is described the normal approach in conducting sequencing using the Illumina sequencing platform. Specifically, on an illumina NextSeq 500 cDNA is fragmented into equal length sequence pairs containing proprietary adpaters that attach the read to the sequencing bed. The NextSeq 500 can record up to 800 million paired-end reads per run spread across 8 lanes. Reads are amplified using bridge-amplification which generates reverse reads. The reads are again amplified to form clusters, and bases are called by progressively adding nucleotides to their corresponding base pairs. The reads emit a color when excited by a laser at 500nm, which are then read by a detector that reports a nucleotide. Images obtained from https://www.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf

4 The sequencing process might utilize either single-end or paired-end reads. Single-end reads are only read from a single direction during sequencing, whereas paired-end reads are read from both directions. Using paired-end sequencing improves overall sequencing quality, allows for the identification of sequence rearrangements, allows for better alignment of reads to a reference genome, and allows for the identification of novel isoforms by decreasing the number of sequence fragment gaps that occur during alignment. Using the illumina platform, sequencing is performed through a method known as sequencing via synthesis. During sequencing, special nucleotides containing a unique fluorescent reporter molecule are hybridized to their matching bases on each fragment. During this process, a laser at wavelength of 488nm excites the fluorescent tags and a detector reads the color response. These responses are then recorded as a base call for each fragment. This information is then written to a multiplexed bax file containing all the reads from each sequencing run. This file is then demultiplexed using the barcode sequences inserted earlier to produce FASTQ files for each sample. If a sample is split across multiple sequencing lanes, then a FATQ file for each lane is generated as well.

THE SRSF1 PROTEIN

The serine/arginine rich splicing factor 1 (SRSF1) is a widely expressed mammalian RNA binding protein that is a member of the serine-arginine rich protein family. SRSF1 contains three major domains within the structure (Fig. 3),

RNA recognition Motif 1(RRM1) (16-91 aa), RNA recognition Motif 2 (RRM2) 5 (121-195 aa) and the serine arginine rich (SR) (198-247 aa) domain. RRM1 and

RRM2 have a significant role in RNA binding while the SR domain is thought to

mediate protein/protein interactions and it is heavily phosphorylated. SRSF1

has multiple functions in regulating gene expression through biological

processes including transcription, mRNA splicing, and translation. Specifically,

its primary function is to serve as a master regulator for alternative splicing

through its ability to activate splice events. Alternative splicing is a biological

process that allows a single gene to code for multiple by joining

alternative coding regions (exons) in the messenger RNA. During this process,

exons may be skipped or included into the final sequence of the mature mRNA.

Furthermore, the laboratory of Dr. Caputi and others have shown that SRSF1 has

the potential to function as a transcription factor (1, 2).

Figure 3. SRSF1 protein structure. The SRSF1 domains (RRM1. RRM2, RS) are indicated.

6 Dysregulation of SRSF1 has been shown to lead to several negative physiological effects. Overexpression of SRSF1 has been observed to cause the transformation of normal endothelial cells into malignant tumor cells in breast cancer (3) and it has also been shown to be up-regulated in acute lymphoblastic leukemia (4). In addition SRSF1 has also been associated with diseases such as spinocerebellar ataxia and has been shown to be a key modulator of genes within the Human Immunodeficiency Virus (HIV) genome through transcription and splicing regulation (5, 6).

Given the multiple functions of SRSF1 in gene expression we setup di characterize the role of SRSF1 and a deletion mutant (RRM12), which corresponds to a naturally occurring isoforms of SRSF1 lacking the SR domain, in gene expression utilizing a RNA-Seq approach. Our ultimate goal was to characterize the genes and the biological pathways that SRSF1 modulates via both, its activity as a transcription and splicing factor.

7

Figure 4. A) Gene transcripts generated by alternative splicing of the SRSF1 messengers. 3 of the alternatively spliced messengers code for the protein isoforms. ASF-1 (248aa) is the canonical isoform, expressed with high abundance in most tissues. B) Schematic representation of the SRSF1 gene structure with the coding exons and their location relative to the protein structure.

8 INSTALLATION & DEPLOYMENT OF COMPUTATIONAL CLUSTER

RESOURCES FOR RNA-SEQ DATA.

When developing the distributed computing grid for the processing of high-throughput sequencing data, it was necessary to have a customizable GUI for straightforward analysis of data and the creation of reusable analysis pipelines. Galaxy (7), a collaborative platform aimed at simplifying bioinformatics, was chosen as the host for our analysis software packages because of its extensive library and management capabilities for command line applications. The library, known as the galaxy toolshed (8), provides access to the installers and dependencies for commonly used bioinformatics tools including RNA reference aligners and transcript quantification applications.

Galaxy was installed on a dedicated server within the FAU (High Performance

Computing (HPC) cluster and cloned to the secure cluster for use with sensitive information.

The FAU HPC cluster was developed to allow students and faculty to run complex high IOP (Input/Output Processor) compute jobs on shared hardware purchased through academic research grants. The cluster is composed of nearly

100 nodes, each containing 128gb, dual Intel xeon processors, and additional xeon phi co-processors connected through high-capacity 40gbs infiband

(federated data rate) FDR networking. The nodes currently run scientific linux 6, a derivative of Red hat Enterprise Linux 6, but are in the process of being

9 upgraded to enterprise linux 7, the latest version of the enterprise linux software line.

Once installed, the config file was modified for running galaxy in a production environment. [See appendix A1 for modified config file] Administrative access was granted to specified users and galaxy was connected to an external structured query language (SQL) database. PostgreSQL was enabled and initialized. A galaxy user was created within the SQL server and all privileges were granted to a newly generated galaxy database.

1. sudo su – postgres 2. psql 3. CREATE DATABASE galaxy_production; 4. CREATE USER galaxy WITH PASSSWORD ‘******’; 5. GRANT ALL PRIVILEGES ON DATABASE galaxy_production to galaxy;

This database was then connected to galaxy through the SQLalchemy python library by adding the following lines to the to the galaxy config file. The database was then migrated and updated to link to all existing datasets managed by galaxy.

1. cd ~/galaxy/ 2. sh run.sh

Once installment was completed the galaxy toolshed was mounted by uncommenting the lines in the galaxy config file.

313. # Directory where Tool Data Table related files will be placed 314. # when installed from a ToolShed. Defaults to tool_data_path. 315. #shed_tool_data_path = tool-data

10

With the toolshed added, toolsets could then begin to be added to galaxy.

DEVELOPMENT OF ANALYSIS PIPELINE

The design of the computing cluster was developed in order to maximize efficiency when performing RNA-Seq. Our experimental methodology focuses on computing differential expression at gene level and quantifying alternative splicing events at the exon level. Both analysis workflows require generation of a transcript assembly by aligning the raw read data to a reference genome through a process known as mapping. Mapping is completed using a software algorithm that aligns transcripts to their complimentary DNA gene sequences and assess the read quality.

The FASTQC (9) java application was added to galaxy in order to assess raw read quality from sequencing data. Fastq files are sent through FASTQC to generate average quality bar plots for each read. This program provides an accurate visual and numerical representation of the quality of clusters generated and read accuracy. FASTQC also provides contamination information for each sample by indicating over-represented sequences, such as those contained in the adapters used to bind DNA to the sequencing wells. Finally, FASTQC provides a measure of the overall sequence run quality, which is identified on a scale of 0-

11 35 with good quality being above the value 30. FASTQC determines read quality by identifying read quality symbols coupled with each sequence fragment.

Depending on the character value, higher quality is assigned to nucleotides whose color was generally unambiguous during sequencing, and lower quality is assigned to nucleotides whose color was not easily identifiable by the sequencer.

This, in addition, generates a phred score that indicates the probability that an incorrect base call could occur. Figure 2 and 3 show quality per tile of reads within the sequencing well (blue indicates high quality, red indicates poor quality), and the average quality of reads at each nucleotide position in the read.

12 Figure 5. Example read quality output files from FASTQC. These graphs represent quality across sequencing wells and average quality across read positions in each sequencing fragment. Each flow cell contains millions of wells each containing nucleotide clusters of the same type., average quality is reported across each well, blue signifying high quality, red indicating poor quality. Read quality is assessed by phred score on a scale from 0 -36; 28 – 36 being ideal read quality. 13

If the data obtained from the quality metrics indicates that the reads are below an average threshold quality, read trimming will need to be performed.

Read trimming works by identifying sections of sequence pairs whose combined quality is below a certain value (generally 28-30), and removes them from the unaligned sequences. This operation is known as sliding frame trimming, however, depending on the type of sequencing performed different trim operations can be performed.

In order to trim reads accurately, the trimmomatic package was installed in

Galaxy. This package provides multiple options for trimming including read adapter trimming, foreign genomic DNA sequence removal, and sliding frame quality trimming. In our data-processing pipeline we utilize the sliding frame trim operation to cut bases from reads which are not at or above a specified read quality level.

Once completed, the paired reads outputted can be mapped to a reference genome map using a splice aware gapped read mapper. RNA-Seq produces reads from cDNA generated from mRNA, which would not contain any . When performing an alignment, a reference genome composed of DNA sequences is used to identify the location of the fragments on the genome. RNA fragments, however, could span across multiple exons and would require gaps to be inserted in order for alignment to be performed correctly. Splice aware gapped read mappers resolve this issue by identifying the parent DNA sequence of the fragments and splitting them at the location of the introns, while still preserving splice events. 14 Using the repository provided by galaxy, we installed the tool The

Hierarchical Indexing for Spliced Alignment of Transcripts 2 (HISAT2) tool to perform the genome guided assembly on our transcripts. HISAT2 is a splice- aware gapped read mapper and is the successor to the common Tophat 2 aligner. HISAT2 provides two key functionalities required when performing differential expression analysis and alternative splicing analysis; it produces transcript assemblies specifically for the cufflinks analysis package, which is used for differential expression (Using the --dta-cufflinks option), and generates alignment files compatible with MISO. Furthermore, HISAT2 offers a balance between low memory footprint and speed, running multi-threaded it can complete alignments 15 times faster than Tophat2. HISAT2 provides a sequence alignment map (SAM) with the location of each transcript within its corresponding gene sequence in the reference genome (we utilized the GRCh38 assembly of the as reference genome). The splice aware functionality allows the application to determine which exon each portion of a sequenced fragment belongs to and maps them to the corresponding genes, which is required for the downstream quantification of splice variants at the exon level. The SAM file is then sorted and converted to a binary alignment map

(BAM) through the samtools sort application and an accompanying index file is generated using the samtools index application (10). The final bam files are then run through genome coverage bedgraph (11) tool to generate a track map of the read counts and alignments to exons.

15 The BAM file can than be utilized to quantifying gene and transcript expression levels. For this purpose we utilized the cufflinks toolset. The cufflinks package (12) was downloaded and compiled from the galaxy toolshed repository. This package installed the latest versions of cufflinks, cuffdiff, and cuffquant. Cuffquant is utilized to quantify transcript abundance level across whole genome assemblies generated by cufflinks. The quantification file is then fed into cuffdiff, which computes the transcript and gene level differential expression between multiple samples. Cuffdiff performs pair-wise comparisons between datasets and generates a TSV file containing log2 fold changes for all genes and q-values if replicates are present. In order to group the aligned sequences that belong to each gene, a gene transfer format (GTF) file is utilized as it provides genome coordinates for each gene in addition to the gene symbol.

The database produced by cuffdiff can then be plotted using the CummeRbund R script, which produces visualizations for differential expression data including heatmaps, boxplots, dendograms, and volcano plots. The differential expression pipeline is outlined below in in Figure 4.

16 File Inputs! (FASTQ)

Condition 1 Condition 1 Condition 2 Condition 2

Concatenate Tail-to-Head Concatenate Tail-to-Head Mate-Pairs Mate-Pairs

Forward Reads Reverse Reads Forward Reads Reverse Reads (R1) (R2) (R1) (R2)

FASTQC FASTQC FASTQC FASTQC

Is average read quality Is average read quality Is average read quality Is average read quality above 28? above 28? above 30? above 28?

Yes No Yes No Yes No Yes No

Hisat 2 FASTQ Quality Trimmer Hisat 2 FASTQ Quality Trimmer

Hisat 2 Hisat 2

Input GFF File

Cuffquant Cuffquant

CuffDiff

Differential Gene Expresion List

Figure 6. The pipeline used in generating differential gene expression information. This pipeline includes merging of read pairs into two final forward and reverse files. QC metrics using FASTQC to ensure read quality is high and if trimming is necessary. If average read quality is below the threshold of 28, reads are trimmed for quality using trimmomatic. The reads are then aligned to a reference genome, in this case GRCH38, using the HISAT2 alignment tool. The final BAM files are sent into cuffdiff with a reference GFF file that contains gene locations for the specific reference used. BAM files are pairwise quantified usinf cuffquant through cuffdiff and gene expression lists are output.

17

Alternative splicing is a biological process that occurs during DNA transcription that alters the expression of genes through the alternative inclusion and exclusion of exonic sequencing, leading to a single gene coding for multiple proteins. There are five primary types of alternative splicing events: skipped exons, where a single or multiple exons are not included in final messenger RNA (mRNA), alternative 5’ and 3’, where alternative competing 5’ donor and 3’ acceptor sites dictate the portion of an exon that is either included or excluded from the mRNA, retained , where a coding sequence may be retained, or removed as an intron, and mutually exclusive exons, where if an exon is included if the final messenger another exon is always excluded.

In parallel to the data processing performed by the cufflinks package, we also utilize the Bayesian probability based splicing analysis software MISO (13).

The MISO tool to quantify alternative splicing events we utilized in our pipeline has been developed by the Burge lab at the MIT and it is widely utilized for the quantification of known and novel splicing events. MISO utilizes an exon centric analysis, which focuses at the level of individual splice events, and differs from isoform centric analysis in that it does not quantify expression levels of whole gene isoforms.

More specifically MISO uses known splice junctions categorized by event type (skipped exon (SE), retained intron (RI), alternative 3’ (A3SS), alternative 5’

(A5SS), mutually exclusive exons (MXE) generated from all assemblies of the

18 human genome. The output produced by MISO contains locations without gene identifications. In order to better quantify overall changes in alternative splicing at the gene level, the miso output is converted to a browser extensible data (bed) file which reports each reported splice as a position on a genome browser track. This file is then annotated using the Homer package from the Salk Institute (14), which inserts a gene name and symbol for each splice event. The original bayes file, which contains expression quantifications in the form of percent spliced in (PSI) values (13) that indicate the amount of mRNA that represents exon inclusion, produced from miso and the output file from

Homer are then merged to produce a tab separated value (TSV) file containing gene identifiers for each splice event.

19

RNA-SEQ ANALYSIS OF HEK293 CELLS TRANSFECTED WITH AN

SRSF1 AND AN RRM12 OVER EXPRESSION VECTOR.

Sequencing of Samples and Initial Output

To study the role of SRSF1 in global gene expression, HEK293 cells were transfected with an SRSF1 over-expression vector (pSRSF1, a truncated SRSF1 variant (pRRM12) and a control expression vector pCMV-SP6. HEK293 cells were used because they are easily cultured and transfected, while the functions of SRSF1 do not change in different cell types. Total RNA was then extracted at

12 and 48 hours following transfection, since at 12 hours the genes directly regulated by SRSF1 will show a change in expression and processing, while at 48 hours both the genes directly regulated and their secondary targets will show changes in expression levels and RNA processing.

20 The total RNA was extracted utilizing the EZ Tissue/Cell Total RNA Mini

Prep Kit from EZBioResearch. It was then quantified and analyzed for RNA integrity utilizing an Agilent 2100 bioanalyzer. The total RNA preparation was then sent to the Scripps Florida genomics facility for library preparation and sequencing. The samples were treated with DNase to remove any possible genomic DNA contamination. Total RNA quality was reassessed utilizing an

Agilent 2100 bioanalyzer and the samples was quantified using a Qubit 2 fluorometer. RNA quality was measured using the RNA integrity number (RIN)

(scale 1 to 10). All samples showed low degradation and a high RIN (> 8).

Figure 7. Quality of the Total mRNA utilized in the RNA-Seq assay.

The DNase treated samples were then depleted of ribosomal RNA using the probes contained in the Illumina TruSeq Total RNA-seq kit, which provides rRNA 21 probes for human, mouse, and rat rRNA. The sample quality was then re- assessed on the bioanalyer to check that 28s and 18s rRNA peaks were minimized.

The cDNA from each transfection assay was then sequenced with an illumina HISeq 2500 sequencer. The demultiplexed data obtained consisted of

40 million pair-end reads [each around 80 nucleotides long] per sample.

Furthermore, an additional set of samples from a second transfection assay was sequenced with an illumina NextSeq 500 sequencer. The demultiplexed data received was composed of 2 technical replicates, each consisting of 25 million paired-end reads.

Preprocessing of sequencing data

The FASTQ files containing the read data for each sequencing run, SRSF1,

SP6, and RRM12 samples, were run through the FASTQC program to determine read quality, phred score, read distribution, and read length. The reads were then merged using the concatenate tail-to-head tool to generate two FASTQ files, one containing forward reads and one containing reverse reads.

The phred score of the merged reads was evaluated and was high enough (>28) to prevent the need for read trimming (define better what is high what is low, what is the scale). This process was repeated for each set of replicates.

22 Alignment of Sequences to Reference Genome

The merged reads were then aligned to the Feb. 2012 Ensemble build of the human reference genome (GRCh38) to generate binary alignment (bam) files. The HISAT2 tool was utilized to align the paired reads to the reference genome.

Differential Gene Expression Analysis

The bam files obtained from HISAT2 were then used in computing differential gene expression through the cuffdiff package. Each control, and corresponding replicate files were added to cuffdiff and used to develop a global model for all samples which generates fold change values across samples. The controls and samples were then compared to generate a TSV containing gene level changes in expression and P-values for SP6 vs. SRSF1 and SP6 vs. RRM12 at

12 and 48 hours. Cuffdiff then outputs a list of genes that were differentially expressed between both samples. The gene lists obtained from cuffdiff were then subjected to value cutoffs. The cuffdiff output was limited to genes which had a fold change of greater than 1.5 or less than -1.5, and a P-value of less than

0.05 at 12 hours, and a fold change of greater than 2 or less than -2, and a P- value of less than 0.05 at 48 hours.

These genes list were then run through g:profiler id converter (15) in order to obtain gene descriptions for each gene. Using these descriptions genes were separated into protein coding, RNA coding, pseudogenes and unknown genes.

23 Any gene that did not include a description rendered from g:profiler were then passed through the (16) database to identify a function for the gene.

In excel, the output files obtained from cuffdiff, the genes for SRSF1 were compared between 12 and 48 hours, and the genes for RRM12 at 12 and 48 were compared. The genes at 12 hours for SRSF1 and RRM12 were compared, and genes at 48 hours for SRSF1 and RRM12 were also compared.

A majority of the genes modulated by SRSF1 at 12 and 58 hours did code for functional protein products; of those, at 12 hours nearly 50% were up-regulated and 50% down-regulated. At 48 hours, however, nearly 90% of the protein coding genes were up-regulated by SRSF1.

Similar to the trend shown in the SRSF1 data, a majority of the genes modulated by RRM12 at 12 and 58 hours did code for functional protein products; of those, at 12 hours nearly 50% were up-regulated and 50% down-regulated. At 48 hours, however, nearly 90% of the protein coding genes were up-regulated by

SRSF1.

Only 50% of the protein coding genes regulated by SRSF1 at 12 hours were found to also be regulated by RRM12, suggesting that the deletion mutant shares similar but not identical functions in gene expression. Furthermore the RRM12 mutant appeared to regulate a larger number (243 vs 139) of genes than SRSF1 suggesting that the SRS1 naturally occurring isoforms, that contain RRM12 but lack the RS domain, are able to modulate the expression of a set of genes differently than the canonical SRSF1 sequence. Only 35% of the genes regulated by SRSF1 at 12 hours were also found to be regulated at 48 hours, this can be

24 due to multiple reasons, such as the increased stringency of our analysis cutoffs at 48 hours and feedback mechanisms that keep the relative abundance of some genes within a given range.

Figure 8. Genes significantly modulated by SRSF1 at 12 and 48 hours using a log2 fold change cutoff of above 1.5 and below -1.5 at 12 hours, a log2 fold change of above 2 and below -2 at 48 hours, and a p-value cutoff of 0.05 at both time points.

25

Figure 9. Genes significantly modulated by RRM12 at 12 and 48 hours using a log2 fold change cutoff of above 1.5 and below -1.5 at 12 hours, a log2 fold change of above 2 and below -2 at 48 hours, and a p-value cutoff of 0.05 at both time points.

26

Figure 10. Venn diagrams of genes found to be in common between SRSF1 and RRM12 differential expression testing at 12 and 48 hours. Using P-value cutoff of 0.05, and log2 cutoff of 1.5, and -1.5 at 12 hours & log2 cutoff of 2 and -2 at 48 hours.

27

Figure 11. Biological pathways significantly regulated by SRSF1 expression after 12 hours. Go Biological Processes (BP), GO Molecular functions (MF), Kegg biological pathways (Keg).

Fig ur e PPP1R1B HIST3H2A HIST2H3D HIST2H2AC HIST2H3C HIST1H4k HIST2H2AA3 HIST2H3A HIST2H4B HIST1H4K

12. chromatin:silencing !1.63071 !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883

Ke !1.63071 !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883 y Systemic:lupus:erythematosus !1.63071 !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883

Alcoholism 1.60197 !1.63071 !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883 pat Epigenetic:regulation:of:gene:expression !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883

hw DNA:methylation !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883 ays HDACs:deacetylate: !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883 an SenescenceGAssociated:Secretory:Phenotype:(SASP) !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883 d Amyloid:fiber:formation !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883

RNA:Polymerase:I:Transcription !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883 ge RMTs:methylate:histone:arginines !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883

nes HATs:acetylate:histones !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883 sig Transcriptional:regulation:by:small:RNAs !1.60964 !1.65792 !1.9085 !1.91883 !1.69801 !6.01476 !8.66543 !1.91883 nifi can tly reg ula ted by SRSF1 expression.

28

Fig. 13. Biological pathways significantly regulated by SRSF1 expression after 48 hours. Go Biological Processes (BP), GO Molecular functions (MF), Kegg biological pathways (Keg).

Figure 14. Key pathways and genes significantly regulated by SRSF1 expression after 48 hours.

29

Figure 15. Biological pathways significantly regulated by RRM12 expression after 12 hours. Go Biological Processes (BP), GO Molecular functions (MF), Kegg biological pathways (Keg).

DMBX1 EHMT1 FZD9 H2AFX HES4 HIST1H2AB HIST1H2AH HIST1H2AI HIST1H2AL HIST1H2AM HIST1H3A HIST1H3H HIST1H4A HIST1H4K HIST2H2AA3 HIST2H2AA4 HIST2H2AB HIST2H2AC HIST2H3C HIST2H3D HIST3H2A HIST3H2BB JUND KCNH2 NPAS3 RENBP SLC4A11 SLC6A4 SOX18 SPPL2B ZFP36

ChromatinEsilencing !1.54961 !1.97439 !1.9905 !1.87188 !2.24981 !1.69631 !1.65961 !1.65994 !2.26428 !2.03729 !1.5631 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !1.7199 !2.59459 !3.59225 !3.03392 !2.66858 !1.62825 !1.52474 !1.64705 2.07788 !1.90329 !1.69598 2.97066 !1.94017 !1.69262 !1.69776

Nucleosome !1.87188 !1.69631 !1.65961 !1.65994 !2.26428 !2.03729 !1.5631 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !1.7199 !2.59459 !3.59225 !3.03392 !2.66858 !1.62825

ProteinEheterodimnerizationEactiviity !1.65961 !1.65994 !2.26428 !2.03729 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !1.7199 !3.03392 !2.66858 !1.62825 !1.94017

SystemicElupusEerythematosus !1.87188 !1.69631 !1.65961 !1.65994 !2.26428 !2.03729 !1.5631 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !1.7199 !2.59459 !3.59225 !3.03392 !2.66858 !1.62825

Alcoholism !1.65961 !1.65994 !2.26428 !2.03729 !1.80822 !1.63374 !1.65154 !1.7199 !3.03392 !2.66858 !1.62825

EpigeneticEregulationEofEgeneEexpression !1.87188 !1.69631 !1.5631 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !2.59459 !3.59225 !3.03392 !1.62825

DNAEmethylation !1.65994 !1.5631 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !2.59459 !3.59225 !3.03392 !1.62825

HDACsEdeacetylateEhistones !1.69631 !1.65961 !1.65994 !2.26428 !2.03729 !1.5631 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !1.7199 !2.59459 !3.59225 !3.03392 !2.66858 !1.62825

HATsEacetylateEhistones !1.65961 !1.65994 !2.26428 !2.03729 !1.80822 !1.63374 !2.03028 !1.65154 !1.7199 !2.66858 !1.62825

AmyloidEfiberEformation !1.69631 !1.5631 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !2.59459 !3.59225 !3.03392 !1.62825

RNAEPolymeraseEIETranscription !1.5631 !1.63374 !2.03028 !2.22284 !1.65154 !2.59459 !3.59225 !3.03392 !1.62825

ChromatinEOrganizaiton !1.69631 !1.65961 !1.65994 !2.26428 !2.03729 !1.5631 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !1.7199 !2.59459 !3.59225 !3.03392 !2.66858 !1.62825

ActivatedEPKN1EstimulatesEtranscriptionEofEAR !1.5631 !1.63374 !2.03028 !2.22284 !1.65154 !2.59459 !3.59225 !3.03392 !1.62825

ActivationEofEanetriorEHIXEgenesEinEhindbrainEdev. !1.69631 !1.5631 !1.80822 !1.63374 !2.03028 !2.22284 !1.65154 !2.59459 !3.59225 !3.03392 !1.62825

TranscriptionalEregulationEbyEsmallERNAs !1.5631 !1.63374 !2.03028 !2.22284 !1.65154 !2.59459 !3.59225 !3.03392 !1.62825

Figure 16. Key pathways and genes significantly regulated by RRM12 expression after 12 hours.

30

Figure 17. Biological pathways significantly regulated by RRM12 expression after 48 hours. Go Biological Processes (BP), GO Molecular functions (MF), Kegg biological pathways (Keg).

Figure 18. Key pathways and genes significantly regulated by RRM12 expression after 48 hours.

31 Histone genes are regulated by SRSF1 expression

Five classes of histone proteins are found in the chromatin of eukaryotic cells. Two molecules of each of the the four core histones, H2a, H2b, H3, and H4, make up the nucleosome while is bound to the linker DNA between and is also required to organize the higher-order structure of chromatin. The expression of most Histone genes are dependent on the cell cycle and cell replication, such genes are named replication dependent histone genes, such genes are not polyadenylated. There are also some histone genes that are not cell-cycle regulated and that encode for polyadenylated mRNAs. The five classes of replication-dependent histones are encoded by a multi-gene family organized in 3 gene clusters (HIST1, HIST2, HIST3) and a forth location coding only for a copy of H4. HIST1 contains 51 histone genes, HIST2 6 genes and HIST3 only 3 genes. H4 is encoded by 14 genes that code for the same protein, while there are 3 variants of encoded by 12 genes. and H2B consist of at least 10 variants (17).

Among the genes significantly regulated at 12 hours post SRSF1 and RRM1 expression histones genes were the most significantly represented we further analyzed the expression of all the histones genes at both 12 and 48 hours following SRSF1 and RRM12 expression. For this secondary analysis we did not filter for a P value nor for a Log2 fold change.

Overall we observed a marked decrease in the replication competent histone genes transcripts at 12 hours following expression of either SRSF1 or RRM12, while at 48 hours the histone’s transcripts showed an overall increase. At the 32 same time the replication independent histone genes did not appear to significantly decrease at 12 hours but did slightly increase at 48 hours. Histone

H2a appeared to be the one that was down-regulated the most at 12 hours.

33

Figure 19. Global effect on Histone genes expression following SRF1 and RRM1 delivery to the cells. Replication dependent histone genes are shown on the left panel. Replication independent genes are shown on the right panel. The histone genes expression variation relative to the control is shown in this heatmap. Histone clusters are also indicated. No P-value was applied in this analysis. The Log2 of the variation in Histone gene expression is color- coded (green = downregulated, Red = upregulated).

34

Figure 20. Variation in histone gene expression classified by histone type.

35 Key cellular pathways are regulated by SRSF1

Pathway analysis of the genes regulated by SRSF1 48 hours after its overexpression reveals that multiple cellular pathways are specifically regulated.

Among the genes significantly regulated at 48 hours (Log2 =+/-2, P>0.05) following SRSF1 overexpression there are key components of the TNF and NF-

Kappa B signaling pathways (Fig. 21,22). Several cytokines are also regulated affecting immunological processes and the functions of cells of the immune system. SRSF1 also appears to play a role in the immune disease reumathoid arthraitis and in amphetamine addiciton (Fig. 23,24).

String node analysis carried out to characterize the interactions between the proteins regulated by SRSF1 expression revealed that these proteins have significantly more interactions among themselves than what would be expected for a random set of proteins of similar size (the enrichment P-value for this set of genes is 2.22e-15) (Fig. 25). From this analysis it appears theat the cytokine TNF and the transcription factor EGR1 and FOS are the primary nodes for this gene set. Given the key role played by TNF in the immune response, citokine regulatory pathways and the majority of the biological pathway statistically relevant in the analysis of the SRSF1 overexpression at 48 hours we have validated TNF expression at 12 and 48 hours following SRSF1 and RRM12 expression utilizing a quantitative PCR (qPCR) approach (Fig. 27) (experiment carried out by Sean Paz in the laboratory of Dr. Caputi).

36

Figure 21. Kegg pathway for TNF signaling. The genes significantly regulated at 48 hours (Log2 =+/-2, P>0.05) following SRSF1 overexpression are indicated in red.

37

Figure 22. Kegg pathway for NF-Kappa B signaling. The genes significantly regulated at 48 hours (Log2 =+/-2, P>0.05) following SRSF1 overexpression are indicated in red.

38

Figure 23. Kegg pathway for Rheumatoid Arthritis. The genes significantly regulated at 48 hours (Log2 =+/-2, P>0.05) following SRSF1 overexpression are indicated in red.

39

Figure 24. Kegg pathway for Amphetamine Addition. The genes significantly regulated at 48 hours (Log2 =+/-2, P>0.05) following SRSF1 overexpression are indicated in red.

40

Figure 25. String node analysis of the genes significantly regulated at 48 hours (Log2 =+/-2, P>0.05) following SRSF1 overexpression. Interactions among proteins (known or predicted) are indicated by strings connecting the proteins. The enrichment P-value for this set of genes is 2.22e-15. These proteins have significantly more interactions among themselves than what would be expected for a random set of proteins of similar size.

41

Figure 26. Relative expressions in all samples analyzed of the main gene nodes for the dataset obtained picking the genes significantly regulated at 48 hours by SRSF1 overexpression (Log2 =+/-2, P>0.05).

Figure 27. Validation by qPCR of TNF, F2, FES, AOC2, CXCL1, CCL20 expression at 12 and 48 hours following SRSF1 overexpression (Experiment carried out by Sean Paz).

42

DISCUSSION & FUTURE STUDIES

One of the most intriguing discoveries following our analysis is the decrease in Histone genes transcripts following SRSF1 expression. A decrease in Histone expression is usually the cause of an increase in overall gene transcription that correlates well with previous observations from the laboratory of Dr. Caputi, which has shown that SRSF1 is highly up-regulated following CD4+ T cell activation. We have now observed that an increase in SRSF1 expression causes an activation of several pathways related to immunity, T cells functions, and activation at 48 hours.

Taken together, these results indicate a key role for SRSF1 in T cell activation and functions.

In addition to the differential gene expression analysis, we intend to perform exon-centric splicing analysis. Since SRSF1 regulates genes through modulation of alternative splicing events, it is important to understand which events are prevalent with SRSF1 and RRM12 over-expression. In order to perform this analysis we will be utilizing the MISO program provided by the burge lab. The output of MISO provides a different file for each of the respective exon event types. These files contain PSI values that can be used to identify splice events suppressed or activated by SRSF1.

43

APPENDIX A1

1. # 2. # Galaxy is configured by default to be usable in a single-user development 3. # environment. To tune the application for a multi-user production 4. # environment, see the documentation at: 5. # 6. # http://usegalaxy.org/production 7. # 8. 9. # Throughout this sample configuration file, except where stated otherwise, 10. # uncommented values override the default if left unset, whereas commented 11. # values are set to the default value. Relative paths are relative to the root 12. # Galaxy directory. 13. # 14. # Examples of many of these options are explained in more detail in the wiki: 15. # 16. # https://wiki.galaxyproject.org/Admin/Config 17. # 18. # Config hackers are encouraged to check there before asking for help. 19. 20. # ---- HTTP Server ------21. 22. # Configuration of the internal HTTP server. 23. 24. [server:main] 25. 26. # The internal HTTP server to use. Currently only Paste is provided. This 27. # option is required. 28. use = egg:Paste#http 29. 30. # The port on which to listen. 31. port = 8088 32. 33. # The address on which to listen. By default, only listen to localhost (Galaxy 34. # will not be accessible over the network). Use '0.0.0.0' to listen on all 35. # available network interfaces. 36. host = 0.0.0.0 37. 38. # Use a threadpool for the web server instead of creating a thread for each 39. # request. 40. use_threadpool = True 41. 42. # Number of threads in the web server thread pool. 43. #threadpool_workers = 10 44. 45. # Set the number of seconds a thread can work before you should kill it 46. # (assuming it will never finish) to 3 hours. Default is 600 (10 minutes). 47. threadpool_kill_thread_limit = 10800 48.

44 49. # ---- Filters ------50. 51. # Filters sit between Galaxy and the HTTP server. 52. 53. # These filters are disabled by default. They can be enabled with 54. # 'filter-with' in the [app:main] section below. 55. 56. # Define the gzip filter. 57. [filter:gzip] 58. use = egg:Paste#gzip 59. 60. # Define the proxy-prefix filter. 61. [filter:proxy-prefix] 62. use = egg:PasteDeploy#prefix 63. prefix = /koko 64. 65. # ---- Galaxy ------66. 67. # Configuration of the Galaxy application. 68. 69. [app:main] 70. 71. # -- Application and filtering 72. 73. # The factory for the WSGI application. This should not be changed. 74. paste.app_factory = galaxy.web.buildapp:app_factory 75. 76. # If not running behind a proxy server, you may want to enable gzip compression 77. # to decrease the size of data transferred over the network. If using a proxy 78. # server, please enable gzip compression there instead. 79. #filter-with = gzip 80. 81. # If running behind a proxy server and Galaxy is served from a subdirectory, 82. # enable the proxy-prefix filter and set the prefix in the 83. # [filter:proxy-prefix] section above. 84. filter-with = proxy-prefix 85. 86. # If proxy-prefix is enabled and you're running more than one Galaxy instance 87. # behind one hostname, you will want to set this to the same path as the prefix 88. # in the filter above. This value becomes the "path" attribute set in the 89. # cookie so the cookies from each instance will not clobber each other. 90. cookie_path = /koko 91. 92. # -- Database 93. 94. # By default, Galaxy uses a SQLite database at 'database/universe.sqlite'. You 95. # may use a SQLAlchemy connection string to specify an external database 96. # instead. This string takes many options which are explained in detail in the 97. # config file documentation. 98. # This is an example 99. #database_connection = sqlite:///./database/universe.sqlite?isolation_level=IMMEDIATE 100. 101. 102. # This is the connection to the local mysql 103. database_connection = postgres://galaxy:********@127.0.0.1/galaxy_production 104. 105. 106. # This doesnt work 107. 108. # If the server logs errors about not having enough database pool connections, 109. # you will want to increase these values, or consider running more Galaxy 45 110. # processes. 111. #database_engine_option_pool_size = 5 112. #database_engine_option_max_overflow = 10 113. 114. # If using MySQL and the server logs the error "MySQL server has gone away", 115. # you will want to set this to some positive value (7200 should work). 116. #database_engine_option_pool_recycle = -1 117. 118. # If large database query results are causing memory or response time issues in 119. # the Galaxy process, leave the result on the server instead. This option is 120. # only available for PostgreSQL and is highly recommended. 121. #database_engine_option_server_side_cursors = False 122. 123. # Log all database transactions, can be useful for debugging and performance 124. # profiling. Logging is done via Python's 'logging' module under the qualname 125. # 'galaxy.model.orm.logging_connection_proxy' 126. #database_query_profiling_proxy = False 127. 128. # By default, Galaxy will use the same database to track user data and 129. # tool shed install data. There are many situations in which it is 130. # valuable to separate these - for instance bootstrapping fresh Galaxy 131. # instances with pretested installs. The following option can be used to 132. # separate the tool shed install database (all other options listed above 133. # but prefixed with install_ are also available). 134. #install_database_connection = sqlite:///./database/universe.sqlite?isolation_level=IMMEDIATE 135. 136. # -- Files and directories 137. 138. # Dataset files are stored in this directory. 139. #file_path = database/files 140. 141. # Temporary files are stored in this directory. 142. #new_file_path = database/tmp 143. 144. # Tool config files, defines what tools are available in Galaxy. 145. # Tools can be locally developed or installed from Galaxy tool sheds. 146. # (config/tool_conf.xml.sample will be used if left unset and 147. # config/tool_conf.xml does not exist). 148. #tool_config_file = config/tool_conf.xml,config/shed_tool_conf.xml 149. 150. # Enable / disable checking if any tools defined in the above non-shed 151. # tool_config_files (i.e., tool_conf.xml) have been migrated from the Galaxy 152. # code distribution to the Tool Shed. This setting should generally be set to 153. # False only for development Galaxy environments that are often rebuilt from 154. # scratch where migrated tools do not need to be available in the Galaxy tool 155. # panel. If the following setting remains commented, the default setting will 156. # be True. 157. #check_migrate_tools = True 158. 159. # Tool config maintained by tool migration scripts. If you use the migration 160. # scripts to install tools that have been migrated to the tool shed upon a new 161. # release, they will be added to this tool config file. 162. #migrated_tools_config = config/migrated_tools_conf.xml 163. 164. # File that contains the XML section and tool tags from all tool panel config 165. # files integrated into a single file that defines the tool panel layout. This 166. # file can be changed by the Galaxy administrator to alter the layout of the 167. # tool panel. If not present, Galaxy will create it. 168. #integrated_tool_panel_config = integrated_tool_panel.xml 169. 170. # Default path to the directory containing the tools defined in tool_conf.xml. 46 171. # Other tool config files must include the tool_path as an attribute in the 172. # tag. 173. #tool_path = tools 174. 175. # -- Tool dependencies 176. 177. # Path to the directory in which tool dependencies are placed. This is used by 178. # the Tool Shed to install dependencies and can also be used by administrators 179. # to manually install or link to dependencies. For details, see: 180. # https://wiki.galaxyproject.org/Admin/Config/ToolDependencies 181. # Set to the string none to explicitly disable tool dependency handling. 182. # If this option is set to none or an invalid path, installing tools with dependencies 183. # from the Tool Shed will fail. 184. tool_dependency_dir = /home/eclark28/galaxy_production/galaxy/tool_dependency/tool_dependency

185. 186. # The dependency resolvers config file specifies an ordering and options for how 187. # Galaxy resolves tool dependencies (requirement tags in Tool XML). The default 188. # ordering is to the use the Tool Shed for tools installed that way, use local 189. # Galaxy packages, and then use Conda if available. 190. # See https://github.com/galaxyproject/galaxy/blob/dev/doc/source/admin/dependency_resolvers.r st 191. # for more information on these options. 192. #dependency_resolvers_config_file = config/dependency_resolvers_conf.xml 193. 194. # The following Conda dependency resolution options will change the defaults for 195. # all Conda resolvers, but multiple resolvers can be configured independently 196. # in dependency_resolvers_config_file and these options overridden. 197. # Location on the filesystem where Conda packages are installed 198. 199. # conda_prefix is the location on the filesystem where Conda packages and environments are installed 200. # IMPORTANT: Due to a current limitation in conda, the total length of the 201. # conda_prefix and the job_working_directory path should be less than 50 characters! 202. #conda_prefix = /_conda 203. # Override the Conda executable to use, it will default to the one on the 204. # PATH (if available) and then to /bin/conda 205. #conda_exec = 206. # Pass debug flag to conda commands. 207. #conda_debug = False 208. # conda channels to enable by default (http://conda.pydata.org/docs/custom-channels.html) 209. #conda_ensure_channels = conda-forge,r,bioconda,iuc 210. # Set to True to instruct Galaxy to look for and install missing tool 211. # dependencies before each job runs. 212. #conda_auto_install = False 213. # Set to True to perform additional checking of installed Conda environment 214. #conda_verbose_install_check=False 215. # Set to True to instruct Galaxy to install Conda from the web automatically 216. # if it cannot find a local copy and conda_exec is not configured. 217. #conda_auto_init = False 218. 219. # File containing the Galaxy Tool Sheds that should be made available to 220. # install from in the admin interface (.sample used if default does not exist). 221. #tool_sheds_config_file = config/tool_sheds_conf.xml 222. 223. # Set to True to enable monitoring of tools and tool directories 224. # listed in any tool config file specified in tool_config_file option. 225. # If changes are found, tools are automatically reloaded. Watchdog ( 226. # https://pypi.python.org/pypi/watchdog ) must be installed and 227. # available to Galaxy to use this option. Other options include 'auto' 228. # which will attempt to watch tools if the watchdog library is available 229. # but won't fail to load Galaxy if it is not and 'polling' which will use 47 230. # a less efficient monitoring scheme that may work in wider range of scenarios 231. # than the watchdog default. 232. #watch_tools = False 233. 234. # Enable automatic polling of relative tool sheds to see if any updates 235. # are available for installed repositories. Ideally only one Galaxy 236. # server process should be able to check for repository updates. The 237. # setting for hours_between_check should be an integer between 1 and 24. 238. #enable_tool_shed_check = False 239. #hours_between_check = 12 240. 241. # Enable use of an in-memory registry with bi-directional relationships between 242. # repositories (i.e., in addition to lists of dependencies for a repository, 243. # keep an in-memory registry of dependent items for each repository. 244. #manage_dependency_relationships = False 245. 246. # XML config file that contains data table entries for the 247. # ToolDataTableManager. This file is manually # maintained by the Galaxy 248. # administrator (.sample used if default does not exist). 249. #tool_data_table_config_path = config/tool_data_table_conf.xml 250. 251. # XML config file that contains additional data table entries for the 252. # ToolDataTableManager. This file is automatically generated based on the 253. # current installed tool shed repositories that contain valid 254. # tool_data_table_conf.xml.sample files. At the time of installation, these 255. # entries are automatically added to the following file, which is parsed and 256. # applied to the ToolDataTableManager at server start up. 257. #shed_tool_data_table_config = config/shed_tool_data_table_conf.xml 258. 259. # Directory where data used by tools is located, see the samples in that 260. # directory and the wiki for help: 261. # https://wiki.galaxyproject.org/Admin/DataIntegration 262. #tool_data_path = tool-data 263. 264. # Directory where Tool Data Table related files will be placed 265. # when installed from a ToolShed. Defaults to tool_data_path. 266. #shed_tool_data_path = tool-data 267. 268. # File containing old-style genome builds 269. #builds_file_path = tool-data/shared/ucsc/builds.txt 270. 271. # Directory where chrom len files are kept, currently mainly used by trackster 272. #len_file_path = tool-data/shared/ucsc/chrom 273. 274. # Datatypes config file(s), defines what data (file) types are available in 275. # Galaxy (.sample is used if default does not exist). If a datatype appears in 276. # multiple files, the last definition is used (though the first sniffer is used 277. # so limit sniffer definitions to one file). 278. #datatypes_config_file = config/datatypes_conf.xml 279. 280. # Disable the 'Auto-detect' option for file uploads 281. #datatypes_disable_auto = False 282. 283. # Visualizations config directory: where to look for individual visualization 284. # plugins. The path is relative to the Galaxy root dir. To use an absolute 285. # path begin the path with '/'. This is a comma separated list. 286. # Defaults to "config/plugins/visualizations". 287. #visualization_plugins_directory = config/plugins/visualizations 288. 289. # Interactive environment plugins root directory: where to look for interactive 290. # environment plugins. By default none will be loaded. Set to 48 291. # config/plugins/interactive_environments to load Galaxy's stock plugins 292. # (currently just IPython). These will require Docker to be configured and 293. # have security considerations, so proceed with caution. The path is relative to the 294. # Galaxy root dir. To use an absolute path begin the path with '/'. This is a comma 295. # separated list. 296. interactive_environment_plugins_directory = config/plugins/interactive_environments/ 297. 298. # Interactive tour directory: where to store interactive tour definition files. 299. # Galaxy ships with several basic interface tours enabled, though a different 300. # directory with custom tours can be specified here. The path is relative to the 301. # Galaxy root dir. To use an absolute path begin the path with '/'. This is a comma 302. # separated list. 303. tour_config_dir = config/plugins/tours 304. 305. # Each job is given a unique empty directory as its current working directory. 306. # This option defines in what parent directory those directories will be 307. # created. 308. #job_working_directory = database/job_working_directory 309. 310. # If using a cluster, Galaxy will write job scripts and stdout/stderr to this 311. # directory. 312. #cluster_files_directory = database/pbs 313. 314. # Mako templates are compiled as needed and cached for reuse, this directory is 315. # used for the cache 316. #template_cache_path = database/compiled_templates 317. 318. # Set to false to disable various checks Galaxy will do to ensure it 319. # can run job scripts before attempting to execute or submit them. 320. #check_job_script_integrity = True 321. # Number of checks to execute if check_job_script_integrity is enabled. 322. #check_job_script_integrity_count = 35 323. # Time to sleep between checks if check_job_script_integrity is enabled (in seconds). 324. #check_job_script_integrity_sleep = .25 325. 326. 327. # Set the default shell used by non-containerized jobs Galaxy-wide. This 328. # defaults to bash for all jobs and can be overidden at the destination 329. # level for heterogenous clusters. conda job resolution requires bash or zsh 330. # so if this is switched to /bin/sh for instance - conda resolution 331. # should be disabled. Containerized jobs always use /bin/sh - so more maximum 332. # portability tool authors should assume generated commands run in sh. 333. #default_job_shell = /bin/bash 334. 335. # Citation related caching. Tool citations information maybe fetched from 336. # external sources such as http://dx.doi.org/ by Galaxy - the following 337. # parameters can be used to control the caching used to store this information. 338. #citation_cache_type = file 339. #citation_cache_data_dir = database/citations/data 340. #citation_cache_lock_dir = database/citations/lock 341. 342. # External service types config file, defining what types of external_services 343. # configurations are available in Galaxy (.sample is used if default does not 344. # exist). 345. #external_service_type_config_file = config/external_service_types_conf.xml 346. 347. # Path to the directory containing the external_service_types defined in the 348. # config. 349. #external_service_type_path = external_service_types 350. 351. # Tools with a number of outputs not known until runtime can write these 49 352. # outputs to a directory for collection by Galaxy when the job is done. 353. # Previously, this directory was new_file_path, but using one global directory 354. # can cause performance problems, so using job_working_directory ('.' or cwd 355. # when a job is run) is encouraged. By default, both are checked to avoid 356. # breaking existing tools. 357. #collect_outputs_from = new_file_path,job_working_directory 358. 359. # -- Data Storage (Object Store) 360. # 361. # Configuration file for the object store 362. # If this is set and exists, it overrides any other objectstore settings. 363. # object_store_config_file = config/object_store_conf.xml 364. 365. 366. # -- Mail and notification 367. 368. # Galaxy sends mail for various things: subscribing users to the mailing list 369. # if they request it, password resets, notifications from the Galaxy Sample 370. # Tracking system, reporting dataset errors, and sending activation emails. 371. # To do this, it needs to send mail through an SMTP server, which you may 372. # define here (host:port). 373. # Galaxy will automatically try STARTTLS but will continue upon failure. 374. #smtp_server = None 375. smtp_server = localhost 376. 377. # If your SMTP server requires a username and password, you can provide them 378. # here (password in cleartext here, but if your server supports STARTTLS it 379. # will be sent over the network encrypted). 380. #smtp_username = None 381. #smtp_password = None 382. 383. # If your SMTP server requires SSL from the beginning of the connection 384. # smtp_ssl = False 385. 386. # On the user registration form, users may choose to join a mailing list. This 387. # is the address used to subscribe to the list. Uncomment and leave empty if you 388. # want to remove this option from the user registration form. 389. #mailing_join_addr = [email protected] 390. 391. # Datasets in an error state include a link to report the error. Those reports 392. # will be sent to this address. Error reports are disabled if no address is 393. # set. Also this email is shown as a contact to user in case of Galaxy 394. # misconfiguration and other events user may encounter. 395. #error_email_to = None 396. 397. # Email address to use in the 'From' field when sending emails for 398. # account activations, workflow step notifications and password resets. 399. # We recommend using string in the following format: 400. # Galaxy Project 401. # If not configured, '' will be used. 402. #email_from = None 403. email_from = [email protected] 404. 405. # URL of the support resource for the galaxy instance. Used in activation 406. # emails. 407. #instance_resource_url = https://wiki.galaxyproject.org/ 408. instance_resource_url = https://koko-galaxy.fau.edu/ 409. 410. # E-mail domains blacklist is used for filtering out users that are using 411. # disposable email address during the registration. If their address domain 412. # matches any domain in the blacklist, they are refused the registration. 50 413. #blacklist_file = config/disposable_email_blacklist.conf 414. 415. # Registration warning message is used to discourage people from registering 416. # multiple accounts. Applies mostly for the main Galaxy instance. 417. # If no message specified the warning box will not be shown. 418. #registration_warning_message = Please register only one account - we provide this service free of charge and have limited computational resources. Multi- accounts are tracked and will be subjected to account termination and data deletion. 419. 420. 421. # -- Account activation 422. 423. # User account activation feature global flag. If set to "False", the rest of 424. # the Account activation configuration is ignored and user activation is 425. # disabled (i.e. accounts are active since registration). 426. # The activation is also not working in case the SMTP server is not defined. 427. #user_activation_on = False 428. 429. # Activation grace period (in hours). Activation is not forced (login is not 430. # disabled) until grace period has passed. Users under grace period can't run 431. # jobs. Enter 0 to disable grace period. 432. # Users with OpenID logins have grace period forever. 433. #activation_grace_period = 3 434. 435. # Shown in warning box to users that were not activated yet. 436. # In use only if activation_grace_period is set. 437. #inactivity_box_content = Your account has not been activated yet. Feel free to browse around and see what's available, but you won't be able to upload data or run jobs until you have verified your email ad dress. 438. 439. # Password expiration period (in days). Users are required to change their 440. # password every x days. Users will be redirected to the change password 441. # screen when they log in after their password expires. Enter 0 to disable 442. # password expiration. 443. #password_expiration_period = 0 444. 445. # Galaxy Session Timeout 446. # This provides a timeout (in minutes) after which a user will have to log back in. 447. # A duration of 0 disables this feature. 448. #session_duration = 0 449. 450. 451. # -- Analytics 452. 453. # You can enter tracking code here to track visitor's behavior 454. # through your Google Analytics account. Example: UA-XXXXXXXX-Y 455. #ga_code = None 456. 457. # -- Display sites 458. 459. # Galaxy can display data at various external browsers. These options specify 460. # which browsers should be available. URLs and builds available at these 461. # browsers are defined in the specifield files. 462. 463. # If use_remote_user = True, display application servers will be denied access 464. # to Galaxy and so displaying datasets in these sites will fail. 465. # display_servers contains a list of hostnames which should be allowed to 466. # bypass security to display datasets. Please be aware that there are security 467. # implications if this is allowed. More details (including required changes to 468. # the proxy server config) are available in the Apache proxy documentation on 469. # the wiki. 51 470. # 471. # The list of servers in this sample config are for the UCSC Main, Test and 472. # Archaea browsers, but the default if left commented is to not allow any 473. # display sites to bypass security (you must uncomment the line below to allow 474. # them). 475. #display_servers = hgw1.cse.ucsc.edu,hgw2.cse.ucsc.edu,hgw3.cse.ucsc.edu,hgw4.cse.ucsc.edu,hgw5.cs e.ucsc.edu,hgw6.cse.ucsc.edu,hgw7.cse.ucsc.edu,hgw8.cse.ucsc.edu,lowepub.cse.ucsc.edu 476. 477. # To disable the old-style display applications that are hardcoded into 478. # datatype classes, set enable_old_display_applications = False. 479. # This may be desirable due to using the new-style, XML-defined, display 480. # applications that have been defined for many of the datatypes that have the 481. # old-style. 482. # There is also a potential security concern with the old-style applications, 483. # where a malicious party could provide a link that appears to reference the 484. # Galaxy server, but contains a redirect to a third-party server, tricking a 485. # Galaxy user to access said site. 486. #enable_old_display_applications = True 487. 488. # -- Next gen LIMS interface on top of existing Galaxy Sample/Request 489. # management code. 490. 491. use_nglims = False 492. nglims_config_file = tool-data/nglims.yaml 493. 494. # -- UI Localization 495. 496. # Show a message box under the masthead. 497. #message_box_visible = False 498. #message_box_content = None 499. #message_box_class = info 500. 501. # Append "/{brand}" to the "Galaxy" text in the masthead. 502. #brand = None 503. 504. # Format string used when showing date and time information. 505. # The string may contain: 506. # - the directives used by Python time.strftime() function (see 507. # https://docs.python.org/2/library/time.html#time.strftime ), 508. # - $locale (complete format string for the server locale), 509. # - $iso8601 (complete format string as specified by ISO 8601 international 510. # standard). 511. # pretty_datetime_format = $locale (UTC) 512. 513. # URL (with schema http/https) of the Galaxy instance as accessible within your 514. # local network - if specified used as a default by pulsar file staging and 515. # IPython Docker container for communicating back with Galaxy via the API. 516. #galaxy_infrastructure_url = http://localhost:8080 517. 518. # If the above URL cannot be determined ahead of time in dynamic environments 519. # but the port which should be used to access Galaxy can be - this should be 520. # set to prevent Galaxy from having to guess. For example if Galaxy is sitting 521. # behind a proxy with REMOTE_USER enabled - infrastructure shouldn't talk to 522. # Python processes directly and this should be set to 80 or 443, etc... If 523. # unset this file will be read for a server block defining a port corresponding 524. # to the webapp. 525. #galaxy_infrastructure_web_port = 8080 526. 527. # The URL of the page to display in Galaxy's middle pane when loaded. This can 528. # be an absolute or relative URL. 529. #welcome_url = /static/welcome.html 52 530. 531. # The URL linked by the "Galaxy/brand" text. 532. #logo_url = / 533. 534. # The URL linked by the "Wiki" link in the "Help" menu. 535. #wiki_url = https://wiki.galaxyproject.org/ 536. 537. # The URL linked by the "Support" link in the "Help" menu. 538. #support_url = https://wiki.galaxyproject.org/Support 539. 540. # The URL linked by the "How to Cite Galaxy" link in the "Help" menu. 541. #citation_url = https://wiki.galaxyproject.org/CitingGalaxy 542. 543. #The URL linked by the "Search" link in the "Help" menu. 544. #search_url = http://galaxyproject.org/search/usegalaxy/ 545. 546. #The URL linked by the "Mailing Lists" link in the "Help" menu. 547. #mailing_lists_url = https://wiki.galaxyproject.org/MailingLists 548. 549. #The URL linked by the "Videos" link in the "Help" menu. 550. #screencasts_url = https://vimeo.com/galaxyproject 551. 552. # The URL linked by the "Terms and Conditions" link in the "Help" menu, as well 553. # as on the user registration and login forms and in the activation emails. 554. #terms_url = None 555. 556. # The URL linked by the "Galaxy Q&A" link in the "Help" menu 557. # The Galaxy Q&A site is under development; when the site is done, this URL 558. # will be set and uncommented. 559. #qa_url = 560. 561. # Serve static content, which must be enabled if you're not serving it via a 562. # proxy server. These options should be self explanatory and so are not 563. # documented individually. You can use these paths (or ones in the proxy 564. # server) to point to your own styles. 565. #static_enabled = True 566. #static_cache_time = 360 567. #static_dir = static/ 568. #static_images_dir = static/images 569. #static_favicon_dir = static/favicon.ico 570. #static_scripts_dir = static/scripts/ 571. #static_style_dir = static/june_2007_style/blue 572. #static_robots_txt = static/robots.txt 573. 574. # Incremental Display Options 575. 576. #display_chunk_size = 65536 577. 578. # -- Advanced proxy features 579. 580. # For help on configuring the Advanced proxy features, see: 581. # http://usegalaxy.org/production 582. 583. # Apache can handle file downloads (Galaxy-to-user) via mod_xsendfile. Set 584. # this to True to inform Galaxy that mod_xsendfile is enabled upstream. 585. #apache_xsendfile = False 586. 587. # The same download handling can be done by nginx using X-Accel-Redirect. This 588. # should be set to the path defined in the nginx config as an internal redirect 589. # with access to Galaxy's data files (see documentation linked above). 590. #nginx_x_accel_redirect_base = False 53 591. 592. # nginx can make use of mod_zip to create zip files containing multiple library 593. # files. If using X-Accel-Redirect, this can be the same value as that option. 594. #nginx_x_archive_files_base = False 595. 596. # If using compression in the upstream proxy server, use this option to disable 597. # gzipping of library .tar.gz and .zip archives, since the proxy server will do 598. # it faster on the fly. 599. #upstream_gzip = False 600. 601. # The following default adds a header to web request responses that 602. # will cause modern web browsers to not allow Galaxy to be embedded in 603. # the frames of web applications hosted at other hosts - this can help 604. # prevent a class of attack called clickjacking 605. # (https://www.owasp.org/index.php/Clickjacking). If you configure a 606. # proxy in front of Galaxy - please ensure this header remains intact 607. # to protect your users. Uncomment and leave empty to not set the 608. # `X-Frame-Options` header. 609. #x_frame_options = SAMEORIGIN 610. 611. # nginx can also handle file uploads (user-to-Galaxy) via nginx_upload_module. 612. # Configuration for this is complex and explained in detail in the 613. # documentation linked above. The upload store is a temporary directory in 614. # which files uploaded by the upload module will be placed. 615. #nginx_upload_store = False 616. 617. # This value overrides the action set on the file upload form, e.g. the web 618. # path where the nginx_upload_module has been configured to intercept upload 619. # requests. 620. #nginx_upload_path = False 621. 622. # Galaxy can also use nginx_upload_module to receive files staged out upon job 623. # completion by remote job runners (i.e. Pulsar) that initiate staging 624. # operations on the remote end. See the Galaxy nginx documentation for the 625. # corresponding nginx configuration. 626. #nginx_upload_job_files_store = False 627. #nginx_upload_job_files_path = False 628. 629. # Have Galaxy manage dynamic proxy component for routing requests to other 630. # services based on Galaxy's session cookie. It will attempt to do this by 631. # default though you do need to install node+npm and do an npm install from 632. # `lib/galaxy/web/proxy/js`. It is generally more robust to configure this 633. # externally managing it however Galaxy is managed. If True Galaxy will only 634. # launch the proxy if it is actually going to be used (e.g. for IPython). 635. #dynamic_proxy_manage=True 636. 637. # As of 16.04 Galaxy supports multiple proxy types. The original NodeJS 638. # implementation, alongside a new Golang single-binary-no-dependencies 639. # version. Valid values are (node, golang) 640. #dynamic_proxy=node 641. # 642. # The NodeJS dynamic proxy can use an SQLite database or a JSON file for IPC, 643. # set that here. 644. #dynamic_proxy_session_map=database/session_map.sqlite 645. 646. # Set the port and IP for the the dynamic proxy to bind to, this must match 647. # the external configuration if dynamic_proxy_manage is False. 648. #dynamic_proxy_bind_port=8800 649. #dynamic_proxy_bind_ip=0.0.0.0 650. 651. # Enable verbose debugging of Galaxy-managed dynamic proxy. 54 652. #dynamic_proxy_debug=False 653. 654. # The dynamic proxy is proxied by an external proxy (e.g. apache frontend to 655. # nodejs to wrap connections in SSL). 656. #dynamic_proxy_external_proxy=False 657. 658. # Additionally, when the dynamic proxy is proxied by an upstream server, you'll 659. # want to specify a prefixed URL so both Galaxy and the proxy reside under the 660. # same path that your cookies are under. This will result in a url like 661. # https://FQDN/galaxy-prefix/gie_proxy for proxying 662. #dynamic_proxy_prefix=gie_proxy 663. 664. # The Golang proxy also manages the docker containers more closely than the 665. # NodeJS proxy, so is able to expose more container management related options 666. 667. # This attribute governs the minimum length of time between consecutive HTTP/WS 668. # requests through the proxy, before the proxy considers a container as being 669. # inactive and kills it. 670. #dynamic_proxy_golang_noaccess = 60 671. 672. # In order to kill containers, the golang proxy has to check at some interval 673. # for possibly dead containers. This is exposed as a configurable parameter, 674. # but the default value is probably fine. 675. #dynamic_proxy_golang_clean_interval = 10 676. 677. # The golang proxy needs to know how to talk to your docker daemon. Currently 678. # TLS is not supported, that will come in an update. 679. #dynamic_proxy_golang_docker_address = unix:///var/run/docker.sock 680. 681. # The golang proxy uses a RESTful HTTP API for communication with Galaxy 682. # instead of a JSON or SQLite file for IPC. If you do not specify this, it will 683. # be set randomly for you. You should set this if you are managing the proxy 684. # manually. 685. #dynamic_proxy_golang_api_key = None 686. 687. # -- Logging and Debugging 688. 689. # If True, Galaxy will attempt to configure a simple root logger if a 690. # "loggers" section does not appear in this configuration file. 691. #auto_configure_logging = True 692. 693. # Verbosity of console log messages. Acceptable values can be found here: 694. # https://docs.python.org/2/library/logging.html#logging-levels 695. #log_level = DEBUG 696. 697. # Print database operations to the server log (warning, quite verbose!). 698. #database_engine_option_echo = False 699. 700. # Print database pool operations to the server log (warning, quite verbose!). 701. #database_engine_option_echo_pool = False 702. 703. # Turn on logging of application events and some user events to the database. 704. #log_events = True 705. 706. # Turn on logging of user actions to the database. Actions currently logged 707. # are grid views, tool searches, and use of "recently" used tools menu. The 708. # log_events and log_actions functionality will eventually be merged. 709. #log_actions = True 710. 711. 712. # Fluentd configuration. Various events can be logged to the fluentd instance 55 713. # configured below by enabling fluent_log. 714. #fluent_log = False 715. #fluent_host = localhost 716. #fluent_port = 24224 717. 718. # Sanitize all HTML tool output. By default, all tool output served as 719. # 'text/html' will be sanitized thoroughly. This can be disabled if you have 720. # special tools that require unaltered output. WARNING: disabling this does 721. # make the Galaxy instance susceptible to XSS attacks initiated by your users. 722. #sanitize_all_html = True 723. 724. # Whitelist sanitization file. 725. # Datasets created by tools listed in this file are trusted and will not have 726. # their HTML sanitized on display. This can be manually edited or manipulated 727. # through the Admin control panel -- see "Manage Display Whitelist" 728. #sanitize_whitelist_file = config/sanitize_whitelist.txt 729. 730. # By default Galaxy will serve non-HTML tool output that may potentially 731. # contain browser executable JavaScript content as plain text. This will for 732. # instance cause SVG datasets to not render properly and so may be disabled 733. # by setting the following option to True. 734. #serve_xss_vulnerable_mimetypes = False 735. 736. # Return a Access-Control-Allow-Origin response header that matches the Origin 737. # header of the request if that Origin hostname matches one of the strings or 738. # regular expressions listed here. This is a comma separated list of hostname 739. # strings or regular expressions beginning and ending with /. 740. # E.g. mysite.com,google.com,usegalaxy.org,/^[\w\.]*example\.com/ 741. # See: https://developer.mozilla.org/en-US/docs/Web/HTTP/Access_control_CORS 742. #allowed_origin_hostnames = None 743. 744. # Set the following to True to use IPython nbconvert to build HTML from IPython 745. # notebooks in Galaxy histories. This process may allow users to execute 746. # arbitrary code or serve arbitrary HTML. If enabled, IPython must be 747. # available and on Galaxy's PATH, to do this run 748. # `pip install jinja2 pygments ipython` in Galaxy's virtualenv. 749. #trust_ipython_notebook_conversion = False 750. 751. # Debug enables access to various config options useful for development and 752. # debugging: use_lint, use_profile, use_printdebug and use_interactive. It 753. # also causes the files used by PBS/SGE (submission script, output, and error) 754. # to remain on disk after the job is complete. 755. debug = False 756. 757. # Check for WSGI compliance. 758. #use_lint = False 759. 760. # Run the Python profiler on each request. 761. #use_profile = False 762. 763. # Intercept print statements and show them on the returned page. 764. #use_printdebug = True 765. 766. # Enable live debugging in your browser. This should NEVER be enabled on a 767. # public site. Enabled in the sample config for development. 768. use_interactive = True 769. 770. # Write thread status periodically to 'heartbeat.log', (careful, uses disk 771. # space rapidly!). Useful to determine why your processes may be consuming a 772. # lot of CPU. 773. #use_heartbeat = False 56 774. 775. # Control the period (in seconds) between dumps. Use -1 to disable. Regardless 776. # of this setting, if use_heartbeat is enabled, you can send a Galaxy process 777. # (unless running with uWSGI) SIGUSR1 (`kill -USR1`) to force a dump. 778. #heartbeat_interval = 20 779. 780. # Heartbeat log filename. Can accept the template variables {server_name} and 781. # {pid} 782. #heartbeat_log = heartbeat_{server_name}.log 783. 784. # Log to Sentry 785. # Sentry is an open source logging and error aggregation platform. Setting 786. # sentry_dsn will enable the Sentry middleware and errors will be sent to the 787. # indicated sentry instance. This connection string is available in your 788. # sentry instance under -> Settings -> API Keys. 789. #sentry_dsn = None 790. 791. # Log to statsd 792. # Statsd is an external statistics aggregator (https://github.com/etsy/statsd) 793. # Enabling the following options will cause galaxy to log request timing and 794. # other statistics to the configured statsd instance. The statsd_prefix is 795. # useful if you are running multiple Galaxy instances and want to segment 796. # statistics between them within the same aggregator. 797. #statsd_host= 798. #statsd_port=8125 799. #statsd_prefix=galaxy 800. 801. # -- Data Libraries 802. 803. # These library upload options are described in much more detail in the wiki: 804. # https://wiki.galaxyproject.org/Admin/DataLibraries/UploadingLibraryFiles 805. 806. # Add an option to the library upload form which allows administrators to 807. # upload a directory of files. 808. #library_import_dir = None 809. 810. # Add an option to the library upload form which allows authorized 811. # non-administrators to upload a directory of files. The configured directory 812. # must contain sub-directories named the same as the non-admin user's Galaxy 813. # login ( email ). The non-admin user is restricted to uploading files or 814. # sub-directories of files contained in their directory. 815. #user_library_import_dir = None 816. 817. # Add an option to the admin library upload tool allowing admins to paste 818. # filesystem paths to files and directories in a box, and these paths will be 819. # added to a library. Set to True to enable. Please note the security 820. # implication that this will give Galaxy Admins access to anything your Galaxy 821. # user has access to. 822. #allow_library_path_paste = False 823. 824. # Users may choose to download multiple files from a library in an archive. By 825. # default, Galaxy allows users to select from a few different archive formats 826. # if testing shows that Galaxy is able to create files using these formats. 827. # Specific formats can be disabled with this option, separate more than one 828. # format with commas. Available formats are currently 'zip', 'gz', and 'bz2'. 829. #disable_library_comptypes = 830. 831. # Some sequencer integration features in beta allow you to automatically 832. # transfer datasets. This is done using a lightweight transfer manager which 833. # runs outside of Galaxy (but is spawned by it automatically). Galaxy will 834. # communicate with this manager over the port specified here. 57 835. #transfer_manager_port = 8163 836. 837. # Search data libraries with whoosh 838. #enable_whoosh_library_search = True 839. # Whoosh indexes are stored in this directory. 840. #whoosh_index_dir = database/whoosh_indexes 841. 842. # Search data libraries with lucene 843. #enable_lucene_library_search = False 844. # maximum file size to index for searching, in MB 845. #fulltext_max_size = 500 846. #fulltext_noindex_filetypes = bam,sam,wig,bigwig,fasta,fastq,fastqsolexa,fastqillumina,fastqsanger 847. # base URL of server providing search functionality using lucene 848. #fulltext_url = http://localhost:8081 849. 850. # -- Toolbox Search 851. 852. # The following boosts are used to customize this instance's toolbox search. 853. # The higher the boost, the more importance the scoring algorithm gives to the 854. # given field. Section refers to the tool group in the tool panel. Rest of 855. # the fields are tool's attributes. 856. # tool_name_boost = 9 857. # tool_section_boost = 3 858. # tool_description_boost = 2 859. # tool_label_boost = 1 860. # tool_stub_boost = 5 861. # tool_help_boost = 0.5 862. 863. # Limits the number of results in toolbox search. Can be used to tweak how many 864. # results will appear. 865. # tool_search_limit = 20 866. 867. # -- Users and Security 868. 869. # Galaxy encodes various internal values when these values will be output in 870. # some format (for example, in a URL or cookie). You should set a key to be 871. # used by the algorithm that encodes and decodes these values. It can be any 872. # string. If left unchanged, anyone could construct a cookie that would grant 873. # them access to others' sessions. 874. # One simple way to generate a value for this is with the shell command: 875. # python -c 'import time; print time.time()' | md5sum | cut -f 1 -d ' ' 876. #id_secret = USING THE DEFAULT IS NOT SECURE! 877. 878. # User authentication can be delegated to an upstream proxy server (usually 879. # Apache). The upstream proxy should set a REMOTE_USER header in the request. 880. # Enabling remote user disables regular logins. For more information, see: 881. # https://wiki.galaxyproject.org/Admin/Config/ApacheProxy 882. #use_remote_user = False 883. use_remote_user = True 884. 885. # If use_remote_user is enabled and your external authentication 886. # method just returns bare usernames, set a default mail domain to be appended 887. # to usernames, to become your Galaxy usernames (email addresses). 888. #remote_user_maildomain = None 889. 890. # If use_remote_user is enabled, the header that the upstream proxy provides 891. # the remote username in defaults to HTTP_REMOTE_USER (the 'HTTP_' is prepended 892. # by WSGI). This option allows you to change the header. Note, you still need 893. # to prepend 'HTTP_' to the header in this option, but your proxy server should 894. # *not* include 'HTTP_' at the beginning of the header name. 895. remote_user_header = HTTP_EPPN 58 896. #remote_user_header = HTTP_REMOTE_USER 897. 898. # If use_remote_user is enabled, anyone who can log in to the Galaxy host may 899. # impersonate any other user by simply sending the appropriate header. Thus a 900. # secret shared between the upstream proxy server, and Galaxy is required. 901. # If anyone other than the Galaxy user is using the server, then apache/nginx 902. # should pass a value in the header 'GX_SECRET' that is identical to the one 903. # below. 904. #remote_user_secret = USING THE DEFAULT IS NOT SECURE! 905. 906. # If use_remote_user is enabled, you can set this to a URL that will log your 907. # users out. 908. #remote_user_logout_href = None 909. remote_user_logout_href = https://koko-galaxy.fau.edu/Shibboleth.sso/Logout 910. 911. # If your proxy and/or authentication source does not normalize e-mail 912. # addresses or user names being passed to Galaxy - set the following option 913. # to True to force these to lower case. 914. #normalize_remote_user_email = False 915. 916. # If an e-mail address is specified here, it will hijack remote user mechanics 917. # (``use_remote_user``) and have the webapp inject a single fixed user. This 918. # has the effect of turning Galaxy into a single user application with no 919. # login or external proxy required. Such applications should not be exposed to 920. # the world. 921. #single_user = 922. 923. # Administrative users - set this to a comma-separated list of valid Galaxy 924. # users (email addresses). These users will have access to the Admin section 925. # of the server, and will have access to create users, groups, roles, 926. # libraries, and more. For more information, see: 927. # https://wiki.galaxyproject.org/Admin/Interface 928. admin_users = [email protected], [email protected] 929. 930. # Force everyone to log in (disable anonymous access). 931. #require_login = False 932. 933. # Show the sites welcome page (see welcome_url) alongside the login page 934. # (even if require_login is True) 935. #show_welcome_with_login = False 936. 937. # Allow unregistered users to create new accounts (otherwise, they will have to 938. # be created by an admin). 939. #allow_user_creation = True 940. 941. # Allow administrators to delete accounts. 942. #allow_user_deletion = False 943. 944. # Allow administrators to log in as other users (useful for debugging) 945. #allow_user_impersonation = False 946. 947. # Allow users to remove their datasets from disk immediately (otherwise, 948. # datasets will be removed after a time period specified by an administrator in 949. # the cleanup scripts run via cron) 950. allow_user_dataset_purge = True 951. 952. # By default, users' data will be public, but setting this to True will cause 953. # it to be private. Does not affect existing users and data, only ones created 954. # after this option is set. Users may still change their default back to 955. # public. 956. #new_user_dataset_access_role_default_private = False 59 957. 958. # Expose user list. Setting this to True will expose the user list to 959. # authenticated users. This makes sharing datasets in smaller galaxy instances 960. # much easier as they can type a name/email and have the correct user show up. 961. # This makes less sense on large public Galaxy instances where that data 962. # shouldn't be exposed. For semi-public Galaxies, it may make sense to expose 963. # just the username and not email, or vice versa. 964. #expose_user_name = False 965. #expose_user_email = False 966. 967. # -- Beta features 968. # Enable new run workflow form 969. #run_workflow_toolform_upgrade = False 970. 971. # Enable Galaxy to communicate directly with a sequencer 972. #enable_sequencer_communication = False 973. 974. # Separate tool command from rest of job script so tool dependencies 975. # don't interfer with metadata generation. 976. # Despite the name, this feature is enabled by default 977. # in 16.01 and the option will go away in the future. 978. #enable_beta_tool_command_isolation = True 979. 980. # Enable the new interface for installing tools from Tool Shed 981. # via the API. Admin menu will list both if enabled. 982. #enable_beta_ts_api_install = False 983. 984. 985. # Set the following to a number of threads greater than 1 to spawn 986. # a Python task queue for dealing with large tool submissions (either 987. # through the tool form or as part of an individual workflow step across 988. # large collection). The size of a "large" tool request is controlled by 989. # the second parameter below and defaults to 10. This affects workflow 990. # scheduling and web processes, not job handlers. 991. #tool_submission_burst_threads = 1 992. #tool_submission_burst_at = 10 993. 994. # Enable beta workflow modules that should not yet be considered part of Galaxy's 995. # stable API. 996. #enable_beta_workflow_modules = False 997. 998. # Force usage of Galaxy's beta workflow scheduler under certain circumstances - 999. # this workflow scheduling forces Galaxy to schedule workflows in the background 1000. # so initial submission of the workflows is signficantly sped up. This does 1001. # however force the user to refresh their history manually to see newly scheduled 1002. # steps (for "normal" workflows - steps are still scheduled far in advance of 1003. # them being queued and scheduling here doesn't refer to actual cluster job 1004. # scheduling). 1005. # Workflows containing more than the specified number of steps will always use 1006. # the Galaxy's beta workflow scheduling. 1007. #force_beta_workflow_scheduled_min_steps=250 1008. # Switch to using Galaxy's beta workflow scheduling for all workflows involving 1009. # ccollections. 1010. #force_beta_workflow_scheduled_for_collections=False 1011. 1012. # Enable authentication via OpenID. Allows users to log in to their Galaxy 1013. # account by authenticating with an OpenID provider. 1014. #enable_openid = False 1015. # .sample used if default does not exist 1016. #openid_config_file = config/openid_conf.xml 1017. #openid_consumer_cache_path = database/openid_consumer_cache 60 1018. 1019. # XML config file that allows the use of different authentication providers 1020. # (e.g. LDAP) instead or in addition to local authentication (.sample is used 1021. # if default does not exist). 1022. #auth_config_file = config/auth_conf.xml 1023. auth_config_file = config/auth_conf.xml 1024. 1025. # Optional list of email addresses of API users who can make calls on behalf of 1026. # other users. 1027. #api_allow_run_as = None 1028. 1029. # Master key that allows many API admin actions to be used without actually 1030. # having a defined admin user in the database/config. Only set this if you 1031. # need to bootstrap Galaxy, you probably do not want to set this on public 1032. # servers. 1033. #master_api_key = changethis 1034. 1035. # Enable tool tags (associating tools with tags). This has its own option 1036. # since its implementation has a few performance implications on startup for 1037. # large servers. 1038. #enable_tool_tags = False 1039. 1040. # Enable a feature when running workflows. When enabled, default datasets 1041. # are selected for "Set at Runtime" inputs from the history such that the 1042. # same input will not be selected twice, unless there are more inputs than 1043. # compatible datasets in the history. 1044. # When False, the most recently added compatible item in the history will 1045. # be used for each "Set at Runtime" input, independent of others in the Workflow 1046. #enable_unique_workflow_defaults = False 1047. 1048. # The URL to the myExperiment instance being used (omit scheme but include port) 1049. #myexperiment_url = www.myexperiment.org:80 1050. 1051. # Enable Galaxy's "Upload via FTP" interface. You'll need to install and 1052. # configure an FTP server (we've used ProFTPd since it can use Galaxy's 1053. # database for authentication) and set the following two options. 1054. 1055. # This should point to a directory containing subdirectories matching users' 1056. # identifier (defaults to e-mail), where Galaxy will look for files. 1057. ftp_upload_dir = /home/eclark28/galaxy_production/galaxy/galaxy_users_upload_directory 1058. 1059. # This should be the hostname of your FTP server, which will be provided to 1060. # users in the help text. 1061. ftp_upload_site = koko.fau.edu 1062. 1063. # User attribute to use as subdirectory in calculating default ftp_upload_dir 1064. # pattern. By default this will be email so a user's FTP upload directory will be 1065. # ${ftp_upload_dir}/${user.email}. Can set this to other attributes such as id or 1066. # username though. 1067. #ftp_upload_dir_identifier = email 1068. 1069. # Python string template used to determine an FTP upload directory for a 1070. # particular user. 1071. #ftp_upload_dir_template = ${ftp_upload_dir}/${ftp_upload_dir_identifier} 1072. 1073. # This should be set to False to prevent Galaxy from deleting uploaded FTP files 1074. # as it imports them. 1075. #ftp_upload_purge = True 1076. 1077. # Enable enforcement of quotas. Quotas can be set from the Admin interface. 1078. #enable_quotas = False 61 1079. 1080. # This option allows users to see the full path of datasets via the "View 1081. # Details" option in the history. Administrators can always see this. 1082. #expose_dataset_path = False 1083. 1084. # Data manager configuration options 1085. # Allow non-admin users to view available Data Manager options. 1086. #enable_data_manager_user_view = False 1087. # File where Data Managers are configured (.sample used if default does not 1088. # exist). 1089. #data_manager_config_file = config/data_manager_conf.xml 1090. # File where Tool Shed based Data Managers are configured. 1091. #shed_data_manager_config_file = config/shed_data_manager_conf.xml 1092. # Directory to store Data Manager based tool-data; defaults to tool_data_path. 1093. #galaxy_data_manager_data_path = tool-data 1094. 1095. # -- Job Execution 1096. 1097. # To increase performance of job execution and the web interface, you can 1098. # separate Galaxy into multiple processes. There are more than one way to do 1099. # this, and they are explained in detail in the documentation: 1100. # 1101. # https://wiki.galaxyproject.org/Admin/Config/Performance/Scaling 1102. 1103. # By default, Galaxy manages and executes jobs from within a single process and 1104. # notifies itself of new jobs via in-memory queues. Jobs are run locally on 1105. # the system on which Galaxy is started. Advanced job running capabilities can 1106. # be configured through the job configuration file. 1107. #job_config_file = config/job_conf.xml 1108. 1109. # In multiprocess configurations, notification between processes about new jobs 1110. # must be done via the database. In single process configurations, this can be 1111. # done in memory, which is a bit quicker. 1112. #track_jobs_in_database = True 1113. 1114. # This enables splitting of jobs into tasks, if specified by the particular tool 1115. # config. 1116. # This is a new feature and not recommended for production servers yet. 1117. #use_tasked_jobs = False 1118. #local_task_queue_workers = 2 1119. 1120. # Enable job recovery (if Galaxy is restarted while cluster jobs are running, 1121. # it can "recover" them when it starts). This is not safe to use if you are 1122. # running more than one Galaxy server using the same database. 1123. #enable_job_recovery = True 1124. 1125. # Although it is fairly reliable, setting metadata can occasionally fail. In 1126. # these instances, you can choose to retry setting it internally or leave it in 1127. # a failed state (since retrying internally may cause the Galaxy process to be 1128. # unresponsive). If this option is set to False, the user will be given the 1129. # option to retry externally, or set metadata manually (when possible). 1130. #retry_metadata_internally = True 1131. 1132. # Very large metadata values can cause Galaxy crashes. This will allow 1133. # limiting the maximum metadata key size (in bytes used in memory, not the end 1134. # result database value size) Galaxy will attempt to save with a dataset. Use 1135. # 0 to disable this feature. The default is 5MB, but as low as 1MB seems to be 1136. # a reasonable size. 1137. #max_metadata_value_size = 5242880 1138. 1139. # If (for example) you run on a cluster and your datasets (by default, 62 1140. # database/files/) are mounted read-only, this option will override tool output 1141. # paths to write outputs to the working directory instead, and the job manager 1142. # will move the outputs to their proper place in the dataset directory on the 1143. # Galaxy server after the job completes. 1144. #outputs_to_working_directory = False 1145. 1146. # If your network filesystem's caching prevents the Galaxy server from seeing 1147. # the job's stdout and stderr files when it completes, you can retry reading 1148. # these files. The job runner will retry the number of times specified below, 1149. # waiting 1 second between tries. For NFS, you may want to try the -noac mount 1150. # option (Linux) or -actimeo=0 (Solaris). 1151. #retry_job_output_collection = 0 1152. 1153. # Clean up various bits of jobs left on the filesystem after completion. These 1154. # bits include the job working directory, external metadata temporary files, 1155. # and DRM stdout and stderr files (if using a DRM). Possible values are: 1156. # always, onsuccess, never 1157. #cleanup_job = always 1158. 1159. # For sites where all users in Galaxy match users on the system on which Galaxy 1160. # runs, the DRMAA job runner can be configured to submit jobs to the DRM as the 1161. # actual user instead of as the user running the Galaxy server process. For 1162. # details on these options, see the documentation at: 1163. # 1164. # https://wiki.galaxyproject.org/Admin/Config/Performance/Cluster 1165. # 1166. #drmaa_external_runjob_script = scripts/drmaa_external_runner.py 1167. #drmaa_external_killjob_script = scripts/drmaa_external_killer.py 1168. #external_chown_script = scripts/external_chown_script.py 1169. 1170. # File to source to set up the environment when running jobs. By default, the 1171. # environment in which the Galaxy server starts is used when running jobs 1172. # locally, and the environment set up per the DRM's submission method and 1173. # policy is used when running jobs on a cluster (try testing with `qsub` on the 1174. # command line). environment_setup_file can be set to the path of a file on 1175. # the cluster that should be sourced by the user to set up the environment 1176. # prior to running tools. This can be especially useful for running jobs as 1177. # the actual user, to remove the need to configure each user's environment 1178. # individually. 1179. #environment_setup_file = None 1180. 1181. # Optional file containing job resource data entry fields definition. 1182. # These fields will be presented to users in the tool forms and allow them to 1183. # overwrite default job resources such as number of processors, memory and 1184. # walltime. 1185. #job_resource_params_file = config/job_resource_params_conf.xml 1186. 1187. # If using job concurrency limits (configured in job_config_file), several 1188. # extra database queries must be performed to determine the number of jobs a 1189. # user has dispatched to a given destination. By default, these queries will 1190. # happen for every job that is waiting to run, but if cache_user_job_count is 1191. # set to True, it will only happen once per iteration of the handler queue. 1192. # Although better for performance due to reduced queries, the tradeoff is a 1193. # greater possibility that jobs will be dispatched past the configured limits 1194. # if running many handlers. 1195. #cache_user_job_count = False 1196. 1197. # ToolBox filtering 1198. 1199. # Modules from lib/galaxy/tools/toolbox/filters/ can be specified in 1200. # the following lines. tool_* filters will be applied for all users 63 1201. # and can not be changed by them. user_tool_* filters will be shown 1202. # under user preferences and can be toogled on and off by 1203. # runtime. Example shown below are not real defaults (no custom 1204. # filters applied by defualt) but can be enabled with by renaming the 1205. # example.py.sample in the filters directory to example.py. 1206. 1207. #tool_filters = 1208. #tool_label_filters = 1209. #tool_section_filters = 1210. #user_tool_filters = examples:restrict_upload_to_admins, examples:restrict_encode 1211. #user_tool_section_filters = examples:restrict_text 1212. #user_tool_label_filters = examples:restrict_upload_to_admins, examples:restrict_encode 1213. 1214. # The base modules that are searched for modules as described above 1215. # can be modified and modules external to Galaxy can be searched by 1216. # modifying the following option. 1217. #toolbox_filter_base_modules = galaxy.tools.toolbox.filters,galaxy.tools.filters 1218. 1219. # Galaxy Application Internal Message Queue 1220. 1221. # Galaxy uses AMQP internally TODO more documentation on what for. 1222. # For examples, see http://ask.github.io/kombu/userguide/connections.html 1223. # 1224. # Without specifying anything here, galaxy will first attempt to use your 1225. # specified database_connection above. If that's not specified either, Galaxy 1226. # will automatically create and use a separate sqlite database located in your 1227. # /database folder (indicated in the commented out line below). 1228. 1229. #amqp_internal_connection = sqlalchemy+sqlite:///./database/control.sqlite?isolation_level =IMMEDIATE 1230. 1231. # Galaxy real time communication server settings 1232. #enable_communication_server = False 1233. #communication_server_host = http://localhost 1234. #communication_server_port = 7070 1235. # persistent_communication_rooms is a comma- separated list of rooms that should be always available. 1236. #persistent_communication_rooms = 1237. 1238. 1239. # ---- Galaxy External Message Queue ------1240. 1241. # Galaxy uses Advanced Message Queuing Protocol (AMQP) to receive messages from 1242. # external sources like barcode scanners. Galaxy has been tested against 1243. # RabbitMQ AMQP implementation. For Galaxy to receive messages from a message 1244. # queue, the RabbitMQ server has to be set up with a user account and other 1245. # parameters listed below. The 'host' and 'port' fields should point to where 1246. # the RabbitMQ server is running. 1247. 1248. [galaxy_amqp] 1249. 1250. #host = 127.0.0.1 1251. #port = 5672 1252. #userid = galaxy 1253. #password = galaxy 1254. #virtual_host = galaxy_messaging_engine 1255. #queue = galaxy_queue 1256. #exchange = galaxy_exchange 1257. #routing_key = bar_code_scanner 1258. #rabbitmqctl_path = /path/to/rabbitmqctl

64

BIBLIOGRAPHY

1. Paz S, Krainer AR, Caputi M. 2014. HIV-1 transcription is regulated by

splicing factor SRSF1. Nucleic Acids Res 42:13812-13823.

2. Ji X, Zhou Y, Pandit S, Huang J, Li H, Lin CY, Xiao R, Burge CB, Fu XD. 2013.

SR proteins collaborate with 7SK and promoter-associated nascent RNA to

release paused polymerase. Cell 153:855-868.

3. Anczukow O, Rosenberg AZ, Akerman M, Das S, Zhan L, Karni R,

Muthuswamy SK, Krainer AR. 2012. The splicing factor SRSF1 regulates

apoptosis and proliferation to promote mammary epithelial cell

transformation. Nat Struct Mol Biol 19:220-228.

4. Zou L, Zhang H, Du C, Liu X, Zhu S, Zhang W, Li Z, Gao C, Zhao X, Mei M,

Bao S, Zheng H. 2012. Correlation of SRSF1 and PRMT1 expression with

clinical status of pediatric acute lymphoblastic leukemia. J Hematol Oncol

5:42.

5. Jablonski JA, Caputi M. 2009. Role of cellular RNA processing factors in

human immunodeficiency virus type 1 mRNA metabolism, replication, and

infectivity. J Virol 83:981-992.

65 6. Paz S, Lu ML, Takata H, Trautmann L, Caputi M. 2015. SRSF1 RNA

Recognition Motifs Are Strong Inhibitors of HIV-1 Replication. J Virol

89:6275-6286.

7. Afgan E, Baker D, van den Beek M, Blankenberg D, Bouvier D, Čech M,

Chilton J, Clements D, Coraor N, Eberhard C, Grüning B, Guerler A,

Hillman-Jackson J, Von Kuster G, Rasche E, Soranzo N, Turaga N, Taylor

J, Nekrutenko A, Goecks J. 2016. The Galaxy platform for accessible,

reproducible and collaborative biomedical analyses: 2016 update. Nucleic

Acids Research doi:10.1093/nar/gkw343.

8. Blankenberg D, Von Kuster G, Bouvier E, Baker D, Afgan E, Stoler N,

Taylor J, Nekrutenko A. 2014. Dissemination of scientific software with

Galaxy ToolShed. Genome Biology 15:1-3.

9. Andrews S. 2010. FastQC: a quality control tool for high throughput

sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc.

Accessed

10. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G,

Abecasis G, Durbin R, Genome Project Data Processing S. 2009. The

Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078-

2079.

11. Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for

comparing genomic features. Bioinformatics 26:841-842.

12. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ,

Salzberg SL, Wold BJ, Pachter L. 2010. Transcript assembly and

66 quantification by RNA-Seq reveals unannotated transcripts and isoform

switching during cell differentiation. Nat Biotechnol 28:511-515.

13. Katz Y, Wang ET, Airoldi EM, Burge CB. 2010. Analysis and design of RNA

sequencing experiments for identifying isoform regulation. Nat Methods

7:1009-1015.

14. Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre

C, Singh H, Glass CK. 2010. Simple combinations of lineage-determining

transcription factors prime cis-regulatory elements required for macrophage

and B cell identities. Mol Cell 38:576-589.

15. Reimand J, Kull M, Peterson H, Hansen J, Vilo J. 2007. g:Profiler--a web-

based toolset for functional profiling of gene lists from large-scale

experiments. Nucleic Acids Res 35:W193-200.

16. Rebhan M, Chalifa-Caspi V, Prilusky J, Lancet D. 1998. GeneCards: a novel

functional genomics compendium with automated data mining and query

reformulation support. Bioinformatics 14:656-664.

17. Marzluff WF, Gongidi P, Woods KR, Jin J, Maltais LJ. 2002. The human and

mouse replication-dependent histone genes. Genomics 80:487-498.

67

68