THE AND MATHEMATICAL BIOSCIENCE LAB (BMBL) BIWEEKLY SCIENCE REPORT

Reporting Time: Due at 5pm on Mar. 25th

Institution: South Dakota State University

Report prepared by: Jinyu yang Advisor: Qin Ma

Project Name: How to do differential expression analysis from Fastq format data on HPC (required packages or software: , ShortRead, BBMap, Bowtie2, TopHat2, Samtools, HTSeq and DESeq)

SIGNIFICANT SCIENCE ACCOMPLISHMENTS: (Examples: major achievement in meeting a milestone, new collaborations, publication in high impact journal.)

1. Summary of Science Activities (<1 page):

raw data(.fq / .fasta file)---[ ShortRead (QA, filtering and trimming) then BBMap(QC, filtering and trimming)]---preprocessed data(.fq / .fasta file)---[ Bowtie+Tophat ]---accepted_hits.bam--- [ samtools ]- --XXX.sam--- [ HTSeq ]---XXX.count---[ DESeq ]---result files(E.g. P-value table)

2. Future Work Plans (Brief summary of the tasks/milestones working on next month):

3. Issues to Resolve (Issues that need input from another partner, FA Lead, Science Coordinator or BESC Director for resolution, etc.):

4. Publications:

5. Presentations:

6. News / Awards:

7. Personnel changes (New, reassigned, or departed):

8. Intellectual Property:

9. Quality Assurance:

1

10. Environment, Safety and Health:

Please complete and return to Qin Ma ([email protected]). If no activity, please indicate “N/A”.

2

Supplementary Tables and Figures (to support above accomplishments)

Now we use several real datasets with .fq format to do differential expression analysis, in this example, the datasets are derived from two different conditions (gu and ye), and each condition has a biological replicate. (please install the required software following the instructions in “How to use DESeq do differential expression analysis” before that).

1. Prepare you datasets

Upload your raw datasets (.fq or .fastq format file) and relative files (e.g., reference and gene model annotations) to HPC or other server, like following:

where the GL.fa is reference , GL.gb.gff is gene model annotation. Then extract compressed files to current directory, command like this:

> gunzip *.fq.gz

Then you need to use ShortRead to do QA(Quality Assessment), filtering and trimming, so make sure that you have installed . If not, don’t warry, you could do as the following steps now. Otherwise, you can skip them.

Firstly, Download the latest R package( https://cran.r-project.org/ ) for , e.g., R-3.2.3.tar.gz, and upload this file to HPC, then execute the following commands:

3

> tar –zxvf R-3.2.3.tar.gz > cd R-3.2.3 > ./configure > make > make install > vim ~/.bash_profile

now you need to edit following two sentences in ~/.bash_profile:

PATH = /home/yangj/R-3.2.3/bin:$PATH Export PATH

then save and quit (/home/yangj/R-3.2.3 is absolute path of R package in my PC, remembering to substitute yours), finally, execute:

> source ~/.bash_profile

2. ShortRead( QA(Quality Assessment), filtering and trimming)

a) QA

Firstly, we need to move into R environment:

> R

then we need to install the ShortRead and DESeq, if you have not installed before.

> source("https://bioconductor.org/biocLite.R")

> biocLite(“ShortRead”, "DESeq")

then use setwd() to set the working directory to where the FQ file are situated

> setwd("/home/yangj/DESeq-SDU/Sample")

4

load ShortRead library:

> library("ShortRead")

Now we can access quality control with ShortRead,

> fls = dir("./", "*fq$", full=TRUE) > qaSummary = qa(fls, type="fastq") > report(qaSummary, type="html", dest="fqQAreport")

However, HPC do not support X11, then you will get the error as following:

So we need to download the qaSummary to our local machine, and manipulate it with RStudio.

save(qaSummary, file="qaSummary")

Download qaSummary to your local machine, then boot up you RStudio.

> load("C:/BaiduYunDownload/qaSummary") > report(qaSummary, type="html", dest="fqQAreport")

You will get a file named “fqQAreport”, which includes QA report.

Ok, back to HPC

> qaSummary[["readCounts"]] read filter aligned gu2_read1.fq 35006259 NA NA gu2_read2.fq 35006259 NA NA gu3_read1.fq 30748467 NA NA gu3_read2.fq 30748467 NA NA ye1_read1.fq 31187333 NA NA ye1_read2.fq 31187333 NA NA ye3_read1.fq 33479655 NA NA ye3_read2.fq 33479655 NA NA

As you can see, the upper command results the number of reads, the number of reads surviving the Solexa filtering criteria, and the number of reads aligned to the reference genome for the lane (Because the filtering

5

and aligning have not be done, so the last two are NA). Meantime, you can inspect the detail information of each FQ file as following:

> qaSummary[["baseCalls"]] A C G T N gu2_read1.fq 21685857 28412307 28130219 21767568 4049 gu2_read2.fq 21729895 28722063 27816884 21730591 567 gu3_read1.fq 21723444 28346939 28174527 21751407 3683 gu3_read2.fq 21734048 28697947 27824570 21742736 699 ye1_read1.fq 21675483 28443112 28095517 21781660 4228 ye1_read2.fq 21702486 28762839 27785294 21748745 636 ye3_read1.fq 21795076 28360347 27968237 21872354 3986 ye3_read2.fq 21807964 28695030 27618484 21877864 658

b) Filtering and trimming

Construct a function for filtering and trimming, here we use nFilter() to guarantee the reads can not contain ‘N’. The sliding window (trimTailw(object, k, a, halfwidth)) starts at the left-most nucleotide, tabulating the number of cycles in a window of 2 * halfwidth + 1 surrounding the current nucleotide with quality scores that fall at or below a. The read is trimmed at the first nucleotide for which this number >= k. Then we drop reads that are less than 35nt.

myFilterAndTrim <- function(fl, destination=sprintf("%s_SR.fq.gz", substr(fl,1,nchar(fl)-6))){ stream <- open(FastqStreamer(fl)) on.exit(close(stream)) repeat { fq <- yield(stream) if (length(fq) == 0) break fq <- fq[nFilter()(fq)] fq <- trimTailw(fq, 2, "4", 2) fq <- fq[width(fq) >= 35] writeFastq(fq, destination, "a") } }

execute the function using a “for loop”:

dataSets = c("gu2_read1.fq.gz","gu2_read2.fq.gz","gu3_read1.fq.gz","gu3_read2.fq.gz","ye1_read1.fq.gz","ye1_read2.fq.gz","ye3 _read1.fq.gz","ye3_read2.fq.gz")

for(i in 1:length(dataSets)){myFilterAndTrim(dataSets[i]);}

3. fastqc (QC(quality control)) 6

Also, we use fastqc to get the quality control report. First, create a new folder “fastqcReport” and fastqc.pbs, then edit the command in fastqc.pbs as following:

#!/bin/bash

# File fastqc.pbs # fastqc script for blackjack # ~April 2015 [email protected]

# Job name #PBS -N fastqc

# To request 10 hours of wall clock time #PBS -l walltime=10:00:00

# To request a single node with 1 core #PBS -l nodes=1:ppn=1

#The environment variable $PBS_O_WORKDIR specify the directory from which you submitted the job

cd $PBS_O_WORKDIR

# Modify input and output files below to match your run!! # You will also need correct ancillary files (parameters, etc) in the # working directory to get your simulation to run.

#load module . /usr/share/modules/init/sh module load bio/FastQC/0.11.3

fastqc -o ./fastqcReport-1 *fq

#fastqc --help

then execute it by the command:

qsub fastqc.pbs

ShortRead drops the reads containing the ‘N’, but it looks like that the low quality bases still exists, so we decide to filtering and trimming the ShortRead result with BBMap.

4. BBMap(Filtering and Trimming)

7

(Note: in the section, all the operations are executed on zcluster, a server at UGA)

First, create the bbmap.sh, and edit the command as following:

#!/bin/bash cd ~/Jinyu/DESeq-SDU/shortReadRes time /usr/local/bbmap/latest/bbmap.sh ref=GL.fa in=gu2_read1_SR.fq out=gu2_read1_BBM.fq qin=64 qout=64 qtrim=r trimq=4 time /usr/local/bbmap/latest/bbmap.sh ref=GL.fa in=gu2_read2_SR.fq out=gu2_read2_BBM.fq qin=64 qout=64 qtrim=r trimq=4

time /usr/local/bbmap/latest/bbmap.sh ref=GL.fa in=gu3_read1_SR.fq out=gu3_read1_BBM.fq qin=64 qout=64 qtrim=r trimq=4 time /usr/local/bbmap/latest/bbmap.sh ref=GL.fa in=gu3_read2_SR.fq out=gu3_read2_BBM.fq qin=64 qout=64 qtrim=r trimq=4

time /usr/local/bbmap/latest/bbmap.sh ref=GL.fa in=ye1_read1_SR.fq out=ye1_read1_BBM.fq qin=64 qout=64 qtrim=r trimq=4 time /usr/local/bbmap/latest/bbmap.sh ref=GL.fa in=ye1_read2_SR.fq out=ye1_read2_BBM.fq qin=64 qout=64 qtrim=r trimq=4

time /usr/local/bbmap/latest/bbmap.sh ref=GL.fa in=ye3_read1_SR.fq out=ye3_read1_BBM.fq qin=64 qout=64 qtrim=r trimq=4 time /usr/local/bbmap/latest/bbmap.sh ref=GL.fa in=ye3_read2_SR.fq out=ye3_read2_BBM.fq qin=64 qout=64 qtrim=r trimq=4

execute it by the command:

qsub -q rcc-30d bbmap.sh

Then use fastqc to get the quality control report again, to check whether the filtered and trimmed reads are reasonable. Creating the fastqc.sh, and editing the command as following:

#!/bin/bash cd ~/Jinyu/DESeq-SDU/shortReadRes export PATH=${PATH}:/usr/local/fastqc/latest/ time /usr/local/fastqc/latest/fastqc -o ./fastqcReport-2 *BBM.fq

Execute it by the command:

qsub -q rcc-30d fastqc.sh

Then you will see the quality control results are pretty well. You can also use ShortRead to get quality assessment again.

5. Align the reads to reference genome with Bowtie2 and Tophat2

First, we need put all the *BBM.fq files and reference genomes GL.fa and gene model annotation GL.gb.gff in a same folder. Second, create folder tophat_out_gu2, tophat_out_gu3, tophat_out_ye1, tophat_out_ye3. Then create tophat.pbs file, and feed the commands as following:

8

#!/bin/bash

# File tophat.pbs # tophat script for bigjack # ~Jan. 2015 [email protected]

# Job name #PBS -N tophat

# To request 20 hours of wall clock time #PBS -l walltime=3:00:00:00

# To request a single node with 12 cores #PBS -l nodes=1:ppn=12

#The environment variable $PBS_O_WORKDIR specify the directory from which you submitted the job

cd $PBS_O_WORKDIR

# Modify input and output files below to match your run!! # You will also need correct ancillary files (parameters, etc) in the # working directory to get your simulation to run.

. /usr/share/modules/init/sh

module load bio/bowtie2/2.2.4 module load bio/tophat/2.0.13

# You can load the Tuxedo module

# module load bio/Tuxedo/2015.0

#By default, Tophat will run a single thread. For efficiency and speed you probably will want to use all the available cores on the node. Thus, you should set the value of the '-p' parameter to be the same (never larger!) as the number of cores on the node.

date

bowtie2-build -f GL.fa GL tophat -G GL.gb.gff -p 12 -o ./tophat_out_gu2 GL gu2_read1_BBM.fq,gu2_read2_BBM.fq tophat -G GL.gb.gff -p 12 -o ./tophat_out_gu3 GL gu3_read1_BBM.fq,gu3_read2_BBM.fq tophat -G GL.gb.gff -p 12 -o ./tophat_out_ye1 GL ye1_read1_BBM.fq,ye1_read2_BBM.fq tophat -G GL.gb.gff -p 12 -o ./tophat_out_ye3 GL ye3_read1_BBM.fq,ye3_read2_BBM.fq

date execute the command:

qsub tophat.pbs

9

6. Organize the BAM files into a single directory, sort and index them and create SAM files

First, create the samtools.pbs, locate it as the picture given by:

Then feed the commands as following:

10

#!/bin/bash

# File samtools.pbs # samtools script for bigjack # ~Jan. 2015 [email protected]

# Job name #PBS -N samtools

# To request 10 hours of wall clock time #PBS -l walltime=20:00:00

# To request a single node with 12 cores #PBS -l nodes=1:ppn=12

#The environment variable $PBS_O_WORKDIR specify the directory from which you submitted the job

cd $PBS_O_WORKDIR

# Modify input and output files below to match your run!! # You will also need correct ancillary files (parameters, etc) in the # working directory to get your simulation to run.

. /usr/share/modules/init/sh

module load bio/tophat/2.0.13 module load bio/cufflinks/2.2.1

date

samtools sort -n tophat_out_gu2/accepted_hits.bam tophat_out_gu2/gu2_sn samtools view -o tophat_out_gu2/gu2_sn.sam tophat_out_gu2/gu2_sn.bam

samtools sort -n tophat_out_gu3/accepted_hits.bam tophat_out_gu3/gu3_sn samtools view -o tophat_out_gu3/gu3_sn.sam tophat_out_gu3/gu3_sn.bam

samtools sort -n tophat_out_ye1/accepted_hits.bam tophat_out_ye1/ye1_sn samtools view -o tophat_out_ye1/ye1_sn.sam tophat_out_ye1/ye1_sn.bam

samtools sort -n tophat_out_ye3/accepted_hits.bam tophat_out_ye3/ye3_sn samtools view -o tophat_out_ye3/ye3_sn.sam tophat_out_ye3/ye3_sn.bam

date

Execute it:

11

qsub samtools.pbs

7. Count reads using htseq-count (Note these operation are executed on zcluster)

Upload the files which with suffix .sam on zcluster as follows:

Then create “htseq.sh” in the same folder, and feed the command like this:

#!/bin/bash

export PYTHONPATH=${PYTHONPATH}:/usr/local/htseq/0.5.3p3/lib/python

time /usr/local/htseq/0.5.3p3/bin/htseq-count -s no -i Parent gu2_sn.sam GL.gb.gff > ./gu2.count time /usr/local/htseq/0.5.3p3/bin/htseq-count -s no -i Parent gu3_sn.sam GL.gb.gff > ./gu3.count time /usr/local/htseq/0.5.3p3/bin/htseq-count -s no -i Parent ye1_sn.sam GL.gb.gff > ./ye1.count time /usr/local/htseq/0.5.3p3/bin/htseq-count -s no -i Parent ye3_sn.sam GL.gb.gff > ./ye3.count

Then execute the following command:

qsub -q rcc-30d htseq.sh

Finally, you will get the results:

8. DESeq

For convenient, I download the gu2.count, gu3.count, ye1.count, and ye3.count on my local computer. You can use HPC if you want.

Startup R or RStudio, then execute the following commands:

12

# import .count format data gu2 <- read.delim("C:/Data/BMBL/SDU/GuYe/HTSeqRes/gu2.count", header = FALSE) gu3 <- read.delim("C:/Data/BMBL/SDU/GuYe/HTSeqRes/gu3.count", header = FALSE) ye1 <- read.delim("C:/Data/BMBL/SDU/GuYe/HTSeqRes/ye1.count", header = FALSE) ye3 <- read.delim("C:/Data/BMBL/SDU/GuYe/HTSeqRes/ye3.count", header = FALSE)

# create counttalbe, the first column is gene name, and the other columns are reads number countTable = data.frame(row.names = gu2$V1, gu2 = gu2$V2, gu3 = gu3$V2, ye1 = ye1$V2, ye3 = ye3$V2) condition = factor(c("gu", "gu", "ye", "ye"))

### install the DESeq (If you haven't installed DESeq, you need to execute these commands) #source("https://bioconductor.org/biocLite.R") #biocLite("DESeq")

# import the DESeq library library("DESeq")

# create the count data set cds = newCountDataSet( countTable, condition )

# estimates the size factors from the count data cdsEstSize = estimateSizeFactors(cds) sizeFactors(cdsEstSize)

# To estimate the dispersions # cdsEstDisp = estimateDispersions(cdsEstSize, method="blind", sharingMode="fit-only") cdsEstDisp = estimateDispersions(cdsEstSize)

# Inspect the estimated dispersions using the plotDispEsts function plotDispEsts(cdsEstDisp)

# Perform the test for differential expression res = nbinomTest(cdsEstDisp,"gu","ye") write.csv(res, file="res_DESeq.csv")

# display differential expression(log-fold-changes) versus expression strength (log-average-read-count) plotMA(res)

# plot the histogram of p-value hist(res$pval, breaks=100)

13

References:

1. http://bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf

2. https://www.bioconductor.org/help/course- materials/2013/CSAMA2013/tuesday/afternoon/DESeq_protocol.pdf

3. http://bioconductor.org/packages/release/bioc/html/DESeq.html

4. file:///C:/Users/jinyu.yang/Documents/ShortRead.pdf

5. https://bioconductor.org/packages/release/bioc/vignettes/ShortRead/inst/doc/Overview.pdf

14