The Bioinformatics and Mathematical Bioscience Lab (Bmbl) Biweekly Science Report
Total Page:16
File Type:pdf, Size:1020Kb
THE BIOINFORMATICS AND MATHEMATICAL BIOSCIENCE LAB (BMBL) BIWEEKLY SCIENCE REPORT Reporting Time: Due at 5pm on Mar. 25th Institution: South Dakota State University Report prepared by: Jinyu yang Advisor: Qin Ma Project Name: How to do differential expression analysis from Fastq format data on HPC (required packages or software: R, ShortRead, BBMap, Bowtie2, TopHat2, Samtools, HTSeq and DESeq) SIGNIFICANT SCIENCE ACCOMPLISHMENTS: (Examples: major achievement in meeting a milestone, new collaborations, publication in high impact journal.) 1. Summary of Science Activities (<1 page): raw data(.fq / .fasta file)---[ ShortRead (QA, filtering and trimming) then BBMap(QC, filtering and trimming)]---preprocessed data(.fq / .fasta file)---[ Bowtie+Tophat ]---accepted_hits.bam--- [ samtools ]- --XXX.sam--- [ HTSeq ]---XXX.count---[ DESeq ]---result files(E.g. P-value table) 2. Future Work Plans (Brief summary of the tasks/milestones working on next month): 3. Issues to Resolve (Issues that need input from another partner, FA Lead, Science Coordinator or BESC Director for resolution, etc.): 4. Publications: 5. Presentations: 6. News / Awards: 7. Personnel changes (New, reassigned, or departed): 8. Intellectual Property: 9. Quality Assurance: 1 10. Environment, Safety and Health: Please complete and return to Qin Ma ([email protected]). If no activity, please indicate “N/A”. 2 Supplementary Tables and Figures (to support above accomplishments) Now we use several real datasets with .fq format to do differential expression analysis, in this example, the datasets are derived from two different conditions (gu and ye), and each condition has a biological replicate. (please install the required software following the instructions in “How to use DESeq do differential expression analysis” before that). 1. Prepare you datasets Upload your raw datasets (.fq or .fastq format file) and relative files (e.g., reference genomes and gene model annotations) to HPC or other server, like following: where the GL.fa is reference genome, GL.gb.gff is gene model annotation. Then extract compressed files to current directory, command like this: > gunzip *.fq.gz Then you need to use ShortRead to do QA(Quality Assessment), filtering and trimming, so make sure that you have installed R package. If not, don’t warry, you could do as the following steps now. Otherwise, you can skip them. Firstly, Download the latest R package( https://cran.r-project.org/ ) for Linux, e.g., R-3.2.3.tar.gz, and upload this file to HPC, then execute the following commands: 3 > tar –zxvf R-3.2.3.tar.gz > cd R-3.2.3 > ./configure > make > make install > vim ~/.bash_profile now you need to edit following two sentences in ~/.bash_profile: PATH = /home/yangj/R-3.2.3/bin:$PATH Export PATH then save and quit (/home/yangj/R-3.2.3 is absolute path of R package in my PC, remembering to substitute yours), finally, execute: > source ~/.bash_profile 2. ShortRead( QA(Quality Assessment), filtering and trimming) a) QA Firstly, we need to move into R environment: > R then we need to install the ShortRead and DESeq, if you have not installed before. > source("https://bioconductor.org/biocLite.R") > biocLite(“ShortRead”, "DESeq") then use setwd() to set the working directory to where the FQ file are situated > setwd("/home/yangj/DESeq-SDU/Sample") 4 load ShortRead library: > library("ShortRead") Now we can access quality control with ShortRead, > fls = dir("./", "*fq$", full=TRUE) > qaSummary = qa(fls, type="fastq") > report(qaSummary, type="html", dest="fqQAreport") However, HPC do not support X11, then you will get the error as following: So we need to download the qaSummary to our local machine, and manipulate it with RStudio. save(qaSummary, file="qaSummary") Download qaSummary to your local machine, then boot up you RStudio. > load("C:/BaiduYunDownload/qaSummary") > report(qaSummary, type="html", dest="fqQAreport") You will get a file named “fqQAreport”, which includes QA report. Ok, back to HPC > qaSummary[["readCounts"]] read filter aligned gu2_read1.fq 35006259 NA NA gu2_read2.fq 35006259 NA NA gu3_read1.fq 30748467 NA NA gu3_read2.fq 30748467 NA NA ye1_read1.fq 31187333 NA NA ye1_read2.fq 31187333 NA NA ye3_read1.fq 33479655 NA NA ye3_read2.fq 33479655 NA NA As you can see, the upper command results the number of reads, the number of reads surviving the Solexa filtering criteria, and the number of reads aligned to the reference genome for the lane (Because the filtering 5 and aligning have not be done, so the last two are NA). Meantime, you can inspect the detail information of each FQ file as following: > qaSummary[["baseCalls"]] A C G T N gu2_read1.fq 21685857 28412307 28130219 21767568 4049 gu2_read2.fq 21729895 28722063 27816884 21730591 567 gu3_read1.fq 21723444 28346939 28174527 21751407 3683 gu3_read2.fq 21734048 28697947 27824570 21742736 699 ye1_read1.fq 21675483 28443112 28095517 21781660 4228 ye1_read2.fq 21702486 28762839 27785294 21748745 636 ye3_read1.fq 21795076 28360347 27968237 21872354 3986 ye3_read2.fq 21807964 28695030 27618484 21877864 658 b) Filtering and trimming Construct a function for filtering and trimming, here we use nFilter() to guarantee the reads can not contain ‘N’. The sliding window (trimTailw(object, k, a, halfwidth)) starts at the left-most nucleotide, tabulating the number of cycles in a window of 2 * halfwidth + 1 surrounding the current nucleotide with quality scores that fall at or below a. The read is trimmed at the first nucleotide for which this number >= k. Then we drop reads that are less than 35nt. myFilterAndTrim <- function(fl, destination=sprintf("%s_SR.fq.gz", substr(fl,1,nchar(fl)-6))){ stream <- open(FastqStreamer(fl)) on.exit(close(stream)) repeat { fq <- yield(stream) if (length(fq) == 0) break fq <- fq[nFilter()(fq)] fq <- trimTailw(fq, 2, "4", 2) fq <- fq[width(fq) >= 35] writeFastq(fq, destination, "a") } } execute the function using a “for loop”: dataSets = c("gu2_read1.fq.gz","gu2_read2.fq.gz","gu3_read1.fq.gz","gu3_read2.fq.gz","ye1_read1.fq.gz","ye1_read2.fq.gz","ye3 _read1.fq.gz","ye3_read2.fq.gz") for(i in 1:length(dataSets)){myFilterAndTrim(dataSets[i]);} 3. fastqc (QC(quality control)) 6 Also, we use fastqc to get the quality control report. First, create a new folder “fastqcReport” and fastqc.pbs, then edit the command in fastqc.pbs as following: #!/bin/bash # File fastqc.pbs # fastqc script for blackjack # ~April 2015 [email protected] # Job name #PBS -N fastqc # To request 10 hours of wall clock time #PBS -l walltime=10:00:00 # To request a single node with 1 core #PBS -l nodes=1:ppn=1 #The environment variable $PBS_O_WORKDIR specify the directory from which you submitted the job cd $PBS_O_WORKDIR # Modify input and output files below to match your run!! # You will also need correct ancillary files (parameters, etc) in the # working directory to get your simulation to run. #load module . /usr/share/modules/init/sh module load bio/FastQC/0.11.3 fastqc -o ./fastqcReport-1 *fq #fastqc --help then execute it by the command: qsub fastqc.pbs ShortRead drops the reads containing the ‘N’, but it looks like that the low quality bases still exists, so we decide to filtering and trimming the ShortRead result with BBMap. 4. BBMap(Filtering and Trimming) 7 (Note: in the section, all the operations are executed on zcluster, a server at UGA) First, create the bbmap.sh, and edit the command as following: #!/bin/bash cd ~/Jinyu/DESeq-SDU/shortReadRes time /usr/local/bbmap/latest/bbmap.sh ref=GL.fa in=gu2_read1_SR.fq out=gu2_read1_BBM.fq qin=64 qout=64 qtrim=r trimq=4 time /usr/local/bbmap/latest/bbmap.sh ref=GL.fa in=gu2_read2_SR.fq out=gu2_read2_BBM.fq qin=64 qout=64 qtrim=r trimq=4 time /usr/local/bbmap/latest/bbmap.sh ref=GL.fa in=gu3_read1_SR.fq out=gu3_read1_BBM.fq qin=64 qout=64 qtrim=r trimq=4 time /usr/local/bbmap/latest/bbmap.sh ref=GL.fa in=gu3_read2_SR.fq out=gu3_read2_BBM.fq qin=64 qout=64 qtrim=r trimq=4 time /usr/local/bbmap/latest/bbmap.sh ref=GL.fa in=ye1_read1_SR.fq out=ye1_read1_BBM.fq qin=64 qout=64 qtrim=r trimq=4 time /usr/local/bbmap/latest/bbmap.sh ref=GL.fa in=ye1_read2_SR.fq out=ye1_read2_BBM.fq qin=64 qout=64 qtrim=r trimq=4 time /usr/local/bbmap/latest/bbmap.sh ref=GL.fa in=ye3_read1_SR.fq out=ye3_read1_BBM.fq qin=64 qout=64 qtrim=r trimq=4 time /usr/local/bbmap/latest/bbmap.sh ref=GL.fa in=ye3_read2_SR.fq out=ye3_read2_BBM.fq qin=64 qout=64 qtrim=r trimq=4 execute it by the command: qsub -q rcc-30d bbmap.sh Then use fastqc to get the quality control report again, to check whether the filtered and trimmed reads are reasonable. Creating the fastqc.sh, and editing the command as following: #!/bin/bash cd ~/Jinyu/DESeq-SDU/shortReadRes export PATH=${PATH}:/usr/local/fastqc/latest/ time /usr/local/fastqc/latest/fastqc -o ./fastqcReport-2 *BBM.fq Execute it by the command: qsub -q rcc-30d fastqc.sh Then you will see the quality control results are pretty well. You can also use ShortRead to get quality assessment again. 5. Align the reads to reference genome with Bowtie2 and Tophat2 First, we need put all the *BBM.fq files and reference genomes GL.fa and gene model annotation GL.gb.gff in a same folder. Second, create folder tophat_out_gu2, tophat_out_gu3, tophat_out_ye1, tophat_out_ye3. Then create tophat.pbs file, and feed the commands as following: 8 #!/bin/bash # File tophat.pbs # tophat script for bigjack # ~Jan. 2015 [email protected] # Job name #PBS -N tophat # To request 20 hours of wall clock time #PBS -l walltime=3:00:00:00 # To request a single node with 12 cores #PBS -l nodes=1:ppn=12 #The environment variable $PBS_O_WORKDIR specify the directory from which you submitted the job cd $PBS_O_WORKDIR # Modify input and output files below to match your run!! # You will also need correct ancillary files (parameters, etc) in the # working directory to get your simulation to run.