RNA-Seq Day Two
Total Page:16
File Type:pdf, Size:1020Kb
Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University [email protected] RNA-Seq Day two Intro to Unix File formats RNA-Seq QC Lecture outline • What is Unix? • Tufts Cluster • Unix Introduction • File formats – FASTA – FASTQ – SAM/BAM – GFF/GTF – BED What is Unix? An operating system Why does bioinformatics use Unix? Open source fits with academic ideals Best software development environment Programming languages already installed and configured - difficult on Windows Why does bioinformatics use Unix? Software is free and easily available Shell tools – bioinformatics is mostly about processing text in some way Mac OS is based on Unix and (for the most part) works the same Unix directory structure Unix components Kernel The operating system. Allocates hardware resources in response to software and user requests. Shell The interface between the user and the kernel. We will use the Bash shell. Files and Processes Everything in Unix is a file or a process A file is a destination for or a source of data (this includes directories, the screen, printers) A process is a program that is running A file stores the instructions for a process, and a process may interact with files Interacting with Unix Text-based command line You type in commands and the OS assigns resources to utilize the appropriate process(es) and file(s) $ command –options targets Tufts High-performance computing research cluster 172 RedHat 6 systems 8-16 cores/node 16-128 GB RAM/node Access using ssh (secure shell) Terminal on Mac/Linux PuTTY on Windows Why use a cluster? NGS data is big and getting bigger Your desktop/laptop aren’t good enough Run many simultaneous programs (jobs) Cloud computing from Amazon, Illumina (computer rental) Lecture outline • What is Unix? • Tufts Cluster • Unix Introduction • File formats – FASTA – FASTQ – SAM/BAM – GFF/GTF – BED Login to Tufts Cluster Mac Terminal cluster.uit.tufts.edu Windows Putty.exe ssh [email protected] Your first Unix commands $ ls lists the contents of a directory $ pwd shows your current directory $ whoami shows your user name $ touch file.txt interacts with a file (requires a target) Command manuals $ man shows the manual for a target command $ man ls navigate with arrow keys, PgUp, PgDn press “q” to exit Some useful options for ls: -l –a –t –S -lat Editing files in Unix Requires a text editor. We’ll use nano. $ nano Unix intro exercises Lecture outline • What is Unix? • Tufts Cluster • Unix Introduction • File formats – FASTA – FASTQ – SAM/BAM – GFF/GTF – BED Important file formats FASTA – nucleotide sequence FASTQ – sequence + quality information SAM/BAM – alignments GFF/GTF – transcript information BED – misc. feature coordinates FASTA >gi|212549564|ref|NM_015981.3| Homo sapiens calcium/calmodulin- dependent protein kinase II alpha (CAMK2A), transcript variant 1, mRNA GGTTGCCATGGGGACCTGGATGCTGACGAAGGCTCGCGAGGCTGTGAGCAGCCACAGTGCCCTGCTCAGA AGCCCCGGGCTCGTCAGTCAAACCGGTTCTCTGTTTGCACTCGGCAGCACGGGCAGGCAAGTGGTCCCTA GGTTCGGGAGCAGAGCAGCAGCGCCTCAGTCCTGGTCCCCCAGTCCCAAGCCTCACCTGCCTGCCCAGCG CCAGGATGGCCACCATCACCTGCACCCGCTTCACGGAAGAGTACCAGCTCTTCGAGGAATTGGGCAAGGG AGCCTTCTCGGTGGTGCGAAGGTGTGTGAAGGTGCTGGCTGGCCAGGAGTATGCTGCCAAGATCATCAAC ACAAAGAAGCTGTCAGCCAGAGACCATCAGAAGCTGGAGCGTGAAGCCCGCATCTGCCGCCTGCTGAAGC ACCCCAACATCGTCCGACTACATGACAGCATCTCAGAGGAGGGACACCACTACCTGATCTTCGACCTGGT CACTGGTGGGGAACTGTTTGAAGATATCGTGGCCCGGGAGTATTACAGTGAGGCGGATGCCAGTCACTGT ATCCAGCAGATCCTGGAGGCTGTGCTGCACTGCCACCAGATGGGGGTGGTGCACCGGGACCTGAAGCCTG Tab completion Save time typing and reduce spelling errors by using tab completion. Type in enough letters to uniquely identify a command/file/path, and press tab. Unix will automatically fill in the rest. If pressing tab does nothing, what you have typed is not enough. Press tab twice to see a list of possible matches. Text manipulation commands less - displays part of a file head - display beginning of file tail - display end of file sort – sorts a file cut - select columns of a file tr - replace or remove characters grep – searches a file sed - stream editor, edits a file line by line awk - programming language, very useful for advanced text manipulation Handling text output > Redirects output to a target (overwrites) >> Appends output to a target | “Pipe” - Sends output to another program. Very useful for multi-step text manipulation. FASTA exercises Minute cards Break FASTQ Stores sequence information and quality scores associated with the sequence Quality represented as a Phred score ଵ where Q = quality and P = error probability Phred Prob. Incorrect Accuracy 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.9% 40 1 in 10000 99.99% 50 1 in 100000 99.999% FASTQ To save space, Phred scores are represented as ASCII characters Phred scales Sanger format Phred+33 from 0-40 Illumina 1.0/Solexa Phred+64 from -5-40 Illumina 1.3+ Phred+64 from 3-40 Illumina 1.8+ Phred+33 from 0-41 Look for “B” tails or “#” tails. FASTQ @42JV5AAXX_HWI-EAS229_1:6:87:886:1289 CTACACCTTGAGCAAGAGGACCCTGCAATGTCCCTAGCTGCCAGCAGGCGGC + B?6<..11@A(=11//664.7.46<6888646.6886668688846588888 @42JV5AAXX_HWI-EAS229_1:6:91:843:1848 CATATTTAGGAGTCTACTGAGACCAAACAGCATATGCTCCGGGTGTTTCCCT + B@B:BBB>AA@C>B@@ABB@A;AB@@>B?@@@@?AA@A@@BBA5C>>?>?7; First line = unique identifier (starts with “@”) Second line = sequence Third line = spacer (may repeat identifier, starts with “+”) Fourth line = quality scores FastQC FastQC is used to generate summary information about FASTQ sequences You will use this every time you receive RNA-Seq data Babraham Bioinfomatics http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ FastQC – Basic Statistics FastQC – Per base sequence quality FastQC – Per base sequence content FastQC – Sequence duplication levels FastQC – Kmer content What’s wrong here? #1 What’s wrong here? #1 Random hexamer bias at start AT rich due to mitochondrial and polyA reads Solution: remove the reads before analysis What’s wrong here? #2 What’s wrong here? #2 Adapter sequences due to short insert Solution: trim the reads before mapping What’s wrong here? #3 What’s wrong here? #3 This sample is almost entirely adapter- adapter ligation products Solution: None. Data unusable. Add the ligase last!! LSF (Load Sharing Facility) Resource allocator - how programs are assigned to nodes in the cluster You interact with the head node, along with everyone else Programs should not be run on the head node Submitting jobs to the cluster bsub Submits a job to the cluster bjobs Displays information about current jobs bkill Stops a job bqueues Displays information about the queues (different nodes that can run programs) Cluster modules Modules exist for specific software packages on the cluster Sets environment parameters to correctly run the software Path: tells the OS where to find programs Using WinSCP WinSCP is an FTP/SFTP program for transferring files Useful for transferring files between the cluster and your computer Login credentials are the same as with PuTTY WinSCP Create stored sessions on your personal computers FastQC exercises SAM (Sequence Alignment/Map) Standard format for alignment data Tab delimited text format Header lines and alignment lines All header lines start with “@” Header contains metadata about alignments One line per alignment BAM is a binary form of SAM http://samtools.sourceforge.net/SAM1.pdf @HD VN:1.0 SO:coordinate @SQ SN:chr1 LN:249250621 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 SAM header @SQ SN:chr12 LN:133851895 @SQ SN:chr13 LN:115169878 @SQ SN:chr14 LN:107349540 @HD is the header @SQ SN:chr15 LN:102531392 @SQ SN:chr16 LN:90354753 line. Shows sort order @SQ SN:chr17 LN:81195210 @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr2 LN:243199373 @SQ are the sequence @SQ SN:chr20 LN:63025520 dictionary lines. Show @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 what sequences the @SQ SN:chr3 LN:198022430 @SQ SN:chr4 LN:191154276 reads were aligned to. @SQ SN:chr5 LN:180915260 @SQ SN:chr6 LN:171115067 @SQ SN:chr7 LN:159138663 @SQ SN:chr8 LN:146364022 Other lines specified in @SQ SN:chr9 LN:141213431 SAM format document @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 SAM alignment section Col Field Type Brief description 1 QNAME String Query template NAME 2FLAGInt bitwise FLAG 3 RNAME String Reference sequence NAME 4POSInt1‐based leftmost mapping POSition 5 MAPQ Int MAPping Quality (sometimes) 6CIGARString CIGAR string 7 RNEXT String Ref. name of the mate/next segment 8 PNEXT Int Position of the mate/next segment 9TLENInt observed Template LENgth 10 SEQ String segment SEQuence 11 QUAL String ASCII of Phred‐scaled base QUALity+33 42JV5AAXX_HWI-EAS229_1:6:87:886:1289 272 chr1 11320 1 76M * 0 0 TTGCTTACTGTATAGTGGTGGCACGCCGCCTGCTGGCAGCTAGGGACATTGCAGGGTCCT… 62664(1666646648848668688888856488868666886.6468886<64.7.466… bitwise FLAG FLAG Description 1Read is paired 2Both paired reads mapped 4Read unmapped 8Mate unmapped 16 Read reverse strand 32 Mate reverse strand 64 First in pair 128 Second in pair 256 Not primary alignment 512 not passing quality controls 1024 PCR or optical duplicate http://picard.sourceforge.net/explain-flags.html CIGAR string Symbol Description M alignment match (can be a sequence match or mismatch) I insertion to the reference D deletion from the reference N skipped region from the reference S soft clipping (clipped sequences present in SEQ) Hhard clipping (clipped sequences NOT present in SEQ) P padding (silent deletion from padded reference) = sequence match X sequence mismatch SAM optional fields Aligner specific information