Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University [email protected]

RNA-Seq Day two

Intro to File formats RNA-Seq QC Lecture outline

• What is Unix? • Tufts Cluster • Unix Introduction • File formats – FASTA – FASTQ – SAM/BAM – GFF/GTF – BED What is Unix? An Why does bioinformatics use Unix?

Open source fits with academic ideals

Best software development environment

Programming languages already installed and configured - difficult on Windows Why does bioinformatics use Unix?

Software is free and easily available

Shell tools – bioinformatics is mostly about processing text in some way

Mac OS is based on Unix and (for the most part) works the same Unix directory structure Unix components

Kernel The operating system. Allocates hardware resources in response to software and user requests. Shell The interface between the user and the kernel. We will use the shell. Files and Processes

Everything in Unix is a file or a process

A file is a destination for or a source of data (this includes directories, the screen, printers)

A process is a program that is running

A file stores the instructions for a process, and a process may interact with files Interacting with Unix

Text-based line

You type in commands and the OS assigns resources to utilize the appropriate process(es) and file(s)

$ command –options targets Tufts High-performance computing research cluster 172 RedHat 6 systems 8-16 cores/node 16-128 GB RAM/node

Access using ssh (secure shell) Terminal on Mac/ PuTTY on Windows Why use a cluster?

NGS data is big and getting bigger

Your desktop/laptop aren’t good enough

Run many simultaneous programs (jobs)

Cloud computing from Amazon, Illumina (computer rental) Lecture outline

• What is Unix? • Tufts Cluster • Unix Introduction • File formats – FASTA – FASTQ – SAM/BAM – GFF/GTF – BED Login to Tufts Cluster

Mac Terminal

cluster.uit.tufts.edu

Windows Putty.exe ssh [email protected] Your first Unix commands

$ lists the contents of a directory $ shows your current directory $ shows your user name $ file.txt interacts with a file (requires a target) Command manuals

$ man shows the manual for a target command $ man ls navigate with arrow keys, PgUp, PgDn press “q” to exit Some useful options for ls: -l –a –t –S -lat Editing files in Unix Requires a text editor. We’ll use nano. $ nano Unix intro exercises Lecture outline

• What is Unix? • Tufts Cluster • Unix Introduction • File formats – FASTA – FASTQ – SAM/BAM – GFF/GTF – BED Important file formats

FASTA – nucleotide sequence FASTQ – sequence + quality information SAM/BAM – alignments GFF/GTF – transcript information BED – misc. feature coordinates FASTA

>gi|212549564|ref|NM_015981.3| Homo sapiens calcium/calmodulin- dependent protein kinase II alpha (CAMK2A), transcript variant 1, mRNA GGTTGCCATGGGGACCTGGATGCTGACGAAGGCTCGCGAGGCTGTGAGCAGCCACAGTGCCCTGCTCAGA AGCCCCGGGCTCGTCAGTCAAACCGGTTCTCTGTTTGCACTCGGCAGCACGGGCAGGCAAGTGGTCCCTA GGTTCGGGAGCAGAGCAGCAGCGCCTCAGTCCTGGTCCCCCAGTCCCAAGCCTCACCTGCCTGCCCAGCG CCAGGATGGCCACCATCACCTGCACCCGCTTCACGGAAGAGTACCAGCTCTTCGAGGAATTGGGCAAGGG AGCCTTCTCGGTGGTGCGAAGGTGTGTGAAGGTGCTGGCTGGCCAGGAGTATGCTGCCAAGATCATCAAC ACAAAGAAGCTGTCAGCCAGAGACCATCAGAAGCTGGAGCGTGAAGCCCGCATCTGCCGCCTGCTGAAGC ACCCCAACATCGTCCGACTACATGACAGCATCTCAGAGGAGGGACACCACTACCTGATCTTCGACCTGGT CACTGGTGGGGAACTGTTTGAAGATATCGTGGCCCGGGAGTATTACAGTGAGGCGGATGCCAGTCACTGT ATCCAGCAGATCCTGGAGGCTGTGCTGCACTGCCACCAGATGGGGGTGGTGCACCGGGACCTGAAGCCTG Tab completion

Save time typing and reduce spelling errors by using tab completion. Type in enough letters to uniquely identify a command/file/path, and press tab. Unix will automatically fill in the rest. If pressing tab does nothing, what you have typed is not enough. Press tab twice to see a list of possible matches. Text manipulation commands less - displays part of a file - display beginning of file - display end of file – sorts a file - select columns of a file - replace or remove characters grep – searches a file sed - stream editor, edits a file line by line awk - programming language, very useful for advanced text manipulation Handling text output

> Redirects output to a target (overwrites) >> Appends output to a target | “Pipe” - Sends output to another program. Very useful for multi-step text manipulation. FASTA exercises Minute cards Break FASTQ Stores sequence information and quality scores associated with the sequence

Quality represented as a Phred score ଵ଴ where Q = quality and P = error probability

Phred Prob. Incorrect Accuracy 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.9% 40 1 in 10000 99.99% 50 1 in 100000 99.999% FASTQ To save space, Phred scores are represented as ASCII characters Phred scales Sanger format Phred+33 from 0-40 Illumina 1.0/Solexa Phred+64 from -5-40 Illumina 1.3+ Phred+64 from 3-40 Illumina 1.8+ Phred+33 from 0-41

Look for “B” tails or “#” tails. FASTQ

@42JV5AAXX_HWI-EAS229_1:6:87:886:1289 CTACACCTTGAGCAAGAGGACCCTGCAATGTCCCTAGCTGCCAGCAGGCGGC + B?6<..11@A(=11//664.7.46<6888646.6886668688846588888 @42JV5AAXX_HWI-EAS229_1:6:91:843:1848 CATATTTAGGAGTCTACTGAGACCAAACAGCATATGCTCCGGGTGTTTCCCT + B@B:BBB>AA@C>B@@ABB@A;AB@@>B?@@@@?AA@A@@BBA5C>>?>?7;

First line = unique identifier (starts with “@”) Second line = sequence Third line = spacer (may repeat identifier, starts with “+”) Fourth line = quality scores FastQC

FastQC is used to generate summary information about FASTQ sequences

You will use this every time you receive RNA-Seq data

Babraham Bioinfomatics http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ FastQC – Basic Statistics FastQC – Per base sequence quality FastQC – Per base sequence content FastQC – Sequence duplication levels FastQC – Kmer content What’s wrong here? #1 What’s wrong here? #1

Random hexamer bias at start

AT rich due to mitochondrial and polyA reads

Solution: remove the reads before analysis What’s wrong here? #2 What’s wrong here? #2

Adapter sequences due to short insert

Solution: trim the reads before mapping What’s wrong here? #3 What’s wrong here? #3

This sample is almost entirely adapter- adapter ligation products Solution: None. Data unusable. Add the ligase last!! LSF (Load Sharing Facility)

Resource allocator - how programs are assigned to nodes in the cluster

You interact with the head node, along with everyone else

Programs should not be run on the head node Submitting jobs to the cluster bsub Submits a job to the cluster bjobs Displays information about current jobs bkill Stops a job bqueues Displays information about the queues (different nodes that can run programs) Cluster modules

Modules exist for specific software packages on the cluster

Sets environment parameters to correctly run the software

Path: tells the OS where to find programs Using WinSCP

WinSCP is an FTP/SFTP program for transferring files

Useful for transferring files between the cluster and your computer

Login credentials are the same as with PuTTY WinSCP

Create stored sessions on your personal computers FastQC exercises SAM (Sequence Alignment/Map)

Standard format for alignment data Tab delimited text format Header lines and alignment lines All header lines start with “@” Header contains metadata about alignments One line per alignment BAM is a binary form of SAM http://samtools.sourceforge.net/SAM1.pdf @HD VN:1.0 SO:coordinate @SQ SN:chr1 :249250621 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 SAM header @SQ SN:chr12 LN:133851895 @SQ SN:chr13 LN:115169878 @SQ SN:chr14 LN:107349540 @HD is the header @SQ SN:chr15 LN:102531392 @SQ SN:chr16 LN:90354753 line. Shows sort order @SQ SN:chr17 LN:81195210 @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr2 LN:243199373 @SQ are the sequence @SQ SN:chr20 LN:63025520 dictionary lines. Show @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 what sequences the @SQ SN:chr3 LN:198022430 @SQ SN:chr4 LN:191154276 reads were aligned to. @SQ SN:chr5 LN:180915260 @SQ SN:chr6 LN:171115067 @SQ SN:chr7 LN:159138663 @SQ SN:chr8 LN:146364022 Other lines specified in @SQ SN:chr9 LN:141213431 SAM format document @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 SAM alignment section

Col Field Type Brief description 1 QNAME String Query template NAME 2FLAGInt bitwise FLAG 3 RNAME String Reference sequence NAME 4POSInt1‐based leftmost mapping POSition 5 MAPQ Int MAPping Quality (sometimes) 6CIGARString CIGAR string 7 RNEXT String Ref. name of the mate/next segment 8 PNEXT Int Position of the mate/next segment 9TLENInt observed Template LENgth 10 SEQ String segment SEQuence 11 QUAL String ASCII of Phred‐scaled base QUALity+33

42JV5AAXX_HWI-EAS229_1:6:87:886:1289 272 chr1 11320 1 76M * 0 0 TTGCTTACTGTATAGTGGTGGCACGCCGCCTGCTGGCAGCTAGGGACATTGCAGGGTCCT… 62664(1666646648848668688888856488868666886.6468886<64.7.466… bitwise FLAG

FLAG Description 1Read is paired 2Both paired reads mapped 4Read unmapped 8Mate unmapped 16 Read reverse strand 32 Mate reverse strand 64 First in pair 128 Second in pair 256 Not primary alignment 512 not passing quality controls 1024 PCR or optical duplicate http://picard.sourceforge.net/explain-flags.html CIGAR string

Symbol Description M alignment match (can be a sequence match or mismatch) I insertion to the reference D deletion from the reference N skipped region from the reference S soft clipping (clipped sequences present in SEQ) Hhard clipping (clipped sequences NOT present in SEQ) P padding (silent deletion from padded reference) = sequence match X sequence mismatch SAM optional fields

Aligner specific information added to reads Some fields are specified in SAM format Good for filtering reads

For Tophat: AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:25A50 YT:Z:UU NH:i:4 CC:Z:chr15 :i:102519634 HI:i:0 SAM/BAM exercises Break GFF (General Feature Format) GTF (Gene Transfer Format)

File formats to store gene structures 1. seqname - chromosome 2. source - The program that generated this feature 3. feature - Feature name. "CDS", "start_codon", "stop_codon", "exon" 4. start - Starting position. (1-based) 5. end - The ending position of the feature (inclusive) 6. score - A score between 0 and 1000, or “.” 7. strand - '+', '-', or '.' 8. frame - For coding exons, number between 0-2 that represents the reading frame of the first base. Otherwise, '.' GFF (General Feature Format)

GFF2 9. group – All lines with the same group are linked together into a single item. Usually a single string.

GFF3 9. attributes – tag=value format ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN ctg123 . TF_binding_site 1000 1012 . + . Parent=gene00001 ctg123 . exon 1300 1500 . + . Parent=gene00001 GTF (Gene Transfer Format)

9. attribute – gene_id “…”; transcript_id “…” gene_id: A globally unique identifier for the genomic source of the sequence. transcript_id: A globally unique identifier for the predicted transcript.

Most commonly used format for recording gene structures for use in NGS applications. GTF (Gene Transfer Format)

Necessary to determine which reads map to which genes/transcripts

Coordinates change - make sure your GTF was created from the same genome build you used for alignment!

UCSC Table Browser is the simplest place to obtain GTF files GTF (Gene Transfer Format) chr1 hg19_knownGene exon 13221 14409 0.000000 + . gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; chr1 hg19_knownGene start_codon 12190 12192 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene CDS 12190 12227 0.000000 + 0 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; How to get a GTF file BED

File format for defining genomic intervals

Required fields: 1. chrom – chromosome 2. chromStart – starting position (0-based) 3. chromEnd – ending position BED

Optional fields:

4. name – Name of the BED line. 5. score - A score between 0 and 1000, or “.” 6. strand - Defines the strand - either '+' or '-'. 7. thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays). 8. thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays). 9. itemRgb - An RGB value for display. 10. blockCount - The number of blocks (exons) in the BED line. 11. blockSizes - A comma-separated list of the block sizes. 12. blockStarts - A comma-separated list of block starts. How to get a BED file from the UCSC Table browser BED exercises Questions? Minute cards!

• What is Unix? • Tufts Cluster • Unix Introduction • File formats – FASTA – FASTQ – SAM/BAM – GFF/GTF – BED