Bioinformatics 101 – Lecture 2 Introduction to line Alberto Riva ([email protected]), Tongjun Gu ([email protected]) ICBR Bioinformatics Core Outline

1. Introduction to HiPerGator

2. Basic commands

3. Fastq files Introduction to HiPerGator Computing environments

§ Standalone application – for local, interactive use;

§ Command-line – local or remote, interactive use;

§ Cluster oriented: remote, not interactive, highly parallelizable. HiPerGator

§ HiPerGator is a high-performance computing cluster managed by UF Research Computing.

§ It currently consists of than 50,000 computing cores, plus a number of specialized nodes (e.g. GPUs). Petabytes of storage available.

§ HiPerGator is investment-based: users pay for access, not for actual usage.

§ More information https://www.rc.ufl.edu/. Working on HiPerGator

§ Accessible only through ssh. Only the four nodes (login1 to login4) can receive ssh connections.

§ Head nodes should not be used for intensive computations.

§ Programs to run are submitted as jobs to an execution queue. The scheduler assigns jobs to available nodes according to multiple criteria: resources required, priority, etc. Working on HiPerGator

1. Users connect to head nodes to develop scripts and manage datasets.

2. Scripts are executed by submitting them as jobs. Jobs run unattended. Scheduler commands allow checking the status of jobs, canceling them, etc.

3. The scheduler (slurm) manages job submission and execution according to priorities, policies, available resources, etc. Working on HiPerGator

4. Different “quality of service” queues are available to balance resources and priority: more resources needed = lower priority.

5. Extension of shell scripting: the same commands are executed in parallel on multiple nodes (possibly with different arguments). Modules

§ Software on HiPerGator is organized into modules. Module commands:

§ module load Load named module § module spider Show matching module(s) § module unload Unload named module § module purge Unload all modules

§ Hundreds of modules are available. Modules may automatically load other modules they depend on. Basic Unix Commands Command-line basics

§ Commands are typed at a prompt. The program that reads your commands and executes them is the shell.

§ Interaction style originated in the 70s, with the first visual terminals (connections were slow…).

§ A command consists of a program name followed by options and/or arguments.

§ Syntax may be obscure and inconsistent (but efficient!). Command-line basics

§ Example: to view the names of files in the current directory, use the “” command (short for “list”)

§ ls plain list § ls –l long format (size, permissions, etc) § ls –l –t newest to oldest § ls –l –t –r reverse sort (oldest to newest) § ls –lrt options can be combined

§ Command names and options are case sensitive! Command-line basics

Main advantage: commands can be scripted.

§ Sequences of commands can be saved to a and repeated easily.

§ Shell syntax supports variables and control structures (conditionals, loops, etc). Simple but powerful.

§ Basic Unix philosophy: simple building blocks can be easily combined to create new complex tools. File System

§ The Unix environment is centered on the file system. Huge tree of directories and subdirectories containing all files.

§ Everything is a file. Unix provides a lot of commands to operate on files.

§ File extensions are not necessary and are not recognized by the system (but may still be useful).

§ Please do not put spaces in filenames! Permissions

§ Different privileges and permissions apply to different areas of the filesystem.

§ Every file has an owner and a group. A user may belong to more than one group.

§ Permissions specify read, , and execute privileges for the owner, the group, everyone else. Example:

§ ls –l genes.csv -rw-rw---- 1 ariva bioinf 2478937 Feb 23 14:50 genes.csv Filesystem commands

§ where are we? § move around the filesystem § ls view contents of directory § create new directory § change file permissions § copy files § move files § delete files (use with caution!) § remove (empty) directory

Every user has a home directory. “cd” with no arguments takes you back there. Working with text files

§ Plain text files are the “lingua franca” of Unix systems.

§ Most files are structured, either in a simple way (rows and columns, comma- or tab-delimited) or complex (e.g. XML, JSON).

§ Text files can be created manually with an editor (e.g. nano, ) or generated by other programs. Working with text files

Viewing files: § file show of file § display contents of a file § more display contents better § less display contents even better § head display top (10) lines § display last (10) lines Working with text files

Finding information on or in files

§ count rows / words / chars in file § print rows containing text / pattern § select columns from delimited file § display differences between two files § search for files matching tests (file name, size, modification date, etc). Working with text files

Other file operations… § sort sort file lines (alphabetically, numerically, based on one or more fields) § detect (and remove / print / count) duplicate lines § split a file into chunks § files side by side § shuf shuffle file lines

More advanced processing can be accomplished with , . Other useful commands

§ display message § screen run multiple terminals over one connection § watch execute a command periodically § man get help on a command § gzip compress a file § gunzip uncompress a file § zcat/zmore/zgrep view / search compressed files § zip create zip archive § measure command execution time § wget download file given URL (http, ftp, etc) Fastq files Short Reads - Structure

Short reads from NGS instruments are saved in fastq files. Each read is represented by four lines of text:

@M01105:32:000000000-BFW77:1:1101:15747:1360 1:N:0:CATGGC CTTACTAATACATCCCCTTCCTTAAACCCACCAAAACTTTTCCAACTTTCCTCTCTT + 6BC9BF<,C,C8,,,;@,,,;;,,0;,,34;=

1. Read header 2. Sequence 3. (empty) 4. Quality scores Short Reads - Structure

§ Read length can range from 35 to 300nt; a typical fastq file may contain tens of millions of reads, several GB in size.

§ Fastq files may contain reads from different samples (multiplexing); index sequences are used to separate them.

§ Paired-end sequencing produces two fastq files, one containing R1 reads, the other containing the corresponding R2 reads. Short Reads - Structure

Paired reads:

@M01105:32:000000000-BFW77:1:1101:15747:1360 1:N:0:CATGGC CTTACTAATACATCCCCTTCCTTAAACCCACCAAAACTTTTCCAACTTTCCTCTCTT + 6BC9BF<,C,C8,,,;@,,,;;,,0;,,34;=

@M01105:32:000000000-BFW77:1:1101:15747:1360 2:N:0:CATGGC TTCATCCTCCTCTATTTCTTACTCTTCCTCTTGCCTTTTCTGCCCCGTCCCCTCGTC + 6,8,,<,,;,,6;,,,,,;,,,,;,4;44=,3,,,,,,,,+,33****)*1)*0)58 Short Reads – Quality Scores

§ Each character represents a quality value. § Sanger encoding is now the most commonly used. Short Reads – Quality Scores Interpretation:

Q = -10 log(p) where p is the probability that the base is incorrect.

§ Highest possible value is 40 (or 41), p = 0.0001. § Values above 30 are acceptable. § Bases with Q < 20 should never be used. § Reads are trimmed to ensure average / minimum quality. Short Reads – Quality Scores

Understanding the error profile of your reads is very important. Different technologies have different error profiles. For example:

§ Illumina – Quality quickly reaches max, then slowly decreases towards 3’ end of read.

§ PacBio – Higher error rate but with uniform distribution across read. Short Reads – Quality Scores

Why are quality scores important?

Certainty about a base call can only be obtained by observing that base multiple times (coverage).

How many times? Depends on the application.

PacBio sequencing has built-in repeated observations through CCS (Circular Consensus Sequencing) – allows trading read length for higher quality. E.g. a 20kb read can become a 5kb read with 4X coverage. Short Reads – Quality Scores

Examples of tools to work with quality scores:

§ trimmomatic § cutadapt § trim_galore § sickle § FastQC

These tools are usually executed automatically at the beginning of each analysis. Pipelines help data analysis automated -> faster, reliable, reproducible.