Catalina Vasilescu Student number: 014438655 Advanced practical course in genome bioinformatics 529028 Report Day1 Background! Next generation sequencing machines produce a huge amount of short sequences called reads that have to be aligned and merged in order to reconstruct the original sequence. This process is done with assembly tools. The first step is the quality control of the data and preprocessing (adapter sequence removal, etc.). Then the alignment of reads is performed and the overlapping reads are assembled into contigs. Scaffolding is then the process of defining the gaps between contigs. Sequencing of the gap regions will provide the full genomic sequence. There are two types of genome assembly tools: OLC (overlap-layout-consensus) based assemblers or graph based assemblers. We are testing three genome assemblers: Minimo, Velvet and SPAdes. Minimo is OLC based, Velvet and SPAdes are graph based. Setting up the system! We use resources and software located in CSC, thus we have to connect remotely to CSC server (taito.csc.fi). For this we make use of PuTTY program. PuTTY is a terminal software for Windows and Unix systems that mediates the remote access of computers over the internet. In taito, first we create the folders that we need and navigate by using the linux commands mkdir and cd. We copy then the materials using rsync command and decompress them with unzip. Each login session we need to load a module from the server using the command load module biokit. In this way we load many programs and biological databases and make them available to use. Read quality checking! One of the first steps in data analysis is checking the quality of the data. We use the Java based program FastQC which performs simple quality control checks to ensure that the raw data is in normal parameters. The program accepts as input fastq, sam and bam files. We use fastq files. FastQ is a text-based format storing both biological (nucleotide) sequences and their corresponding quality score. The R1 and R2 reads are in separate fastq files. The output of the program is a HTML QC report which can detect problems originating either in the sequencer machine or in the preparation of library material. In the left side of the report it is displayed a summary of the analysis modules with color-coded warnings (green-orange-red) that signal possible flaws of the data. The body of the report shows graphical representations of the analysed parameters. Q1:! R1 and R2 are the forwards and reverse read pairs from paired-end sequencing. The fragmented DNA is ligated into different non-complementary adapters at 3’ and 5’ ends, so that the primer for one adapter begins the synthesis on only one strand . First we sequence using the adapter from one end and then using a primer for the second adapter. We obtain few hundreds bp of the 3’ end of the fragment and few hundreds bases from the 5’ end, in opposite orientation as the first one. Using FastQC we have generated two HTML QC reports, one for R1 reads and another for R2 reads. The most important parameters to check are: per base sequence quality, sequence duplication levels, 1 overrepresented sequences and adapter content. The last three show us that the adapter sequences have been removed and also that there aren’t other repetitive sequences. Per base sequence quality illustrates the variation of quality score function of the position of the base in the read. The quality score starts to decrease after position 150 in the reads. We can see a difference in quality between R1 and R2 reads, with R2 sequences presenting more decreased quality. In case of R2 reads, the quality drops under the score of 20 for the bases with position higher than 250 in the reads.

Fig1. Per base sequence quality for R1 reads (upper panel) and R2 reads (lower panel). We included this figure to illustrate the quality differences between R1 and R2 reads in the analysed set of data. 2 De novo assembly tools! Assembly tools based on overlap-layout-consensus (OLC) approach don’t cope with huge number of reads. Most of the available tools that can cope with typical data generated by Illumina use de Bruijn graph based k- mer approach. Minimo is an OLC (overlap-layout-consensus) based assembler. The reads obtained from the sequencing machine are aligned to find all possible pairwise overlaps. We run the assembly without specifying the minimum identity length and the minimum identity percentage. Velvet and SPAdes are de Bruijn graph based k-mer assemblers. We run Velvet (velveth + velvetg) and SPAdes for k-mer size 21, 55, 99. Q2:! I find Minimo the easiest to use assembler. It takes as input the R1 and R2 sequences in fasta files. The minimum that we need to define is the output format and the file name. We can further specify the “minimum overlap length” and “minimum identity percentage”, but is not mandatory. For Velvet and SPAdes is necessary to make multiple runs with different k-mer values to find the best assembly. Q5:! Next-generation sequencing! Next-generation sequencing represents a collection of sequencing methods that extend the sequencing process across millions of reactions in a massively parallel fashion. The methods can be applied to whole- genome sequencing, exome-sequencing, RNA sequencing, targeted-genes sequencing and other types of projects. De novo assembly! represents the process of aligning and merging fragments of DNA sequence in order to reconstruct the original genome of a species. Short fragments called reads are aligned to obtain overlapping common regions. Their sequences are merged into longer sequences called contigs. If there is a reference genomic sequence already published from a specific organism, the sequence reads obtained from the sequencing machine can be assembled and mapped on the reference. When the genome of an organism is sequenced for the first time the reconstruction of the original sequence has to be done from scratch. This process is called de novo assembly. Minimo! Minimo is a de novo assembler that follows the overlap-layout-consensus paradigm. It accepts fasta files as input. The output files are ace or fasta formatted and contain the contigs. Two parameters can be manipulated to control the assembly: minimum overlap length and minimum identity. By decreasing the minimum overlap identity we obtain a less fragmented assembly, but at the same time with more errors. The same is valid for decreasing the the minimum overlapping length. However, it is thought that increasing the minimum overlap length may sometimes produce better assemblies by resolving the assembly of small repeated regions. Minimo can be used with short reads but not if the number of reads is very high. It cannot be used to assemble large genomes, but for small genomes it offers very good quality assemblies. Velveth! Velveth is a program part of Velvet assembler. It takes as input sequence files and gives two files as output (“sequences” and “roadmaps”). These two files are necessary for Velvetg. Velvetg! Velvetg is the core of Velvet, where the de Bruijn graph is built and manipulated. 3 SPAdes (St. Petersburg genome assembler)! SPAdes is an assembler developed for bacterial genomes. Version 3.5.0 supports paired-end reads, mate- pairs and unpaired reads. The output folder contains the contigs and scaffolds in fasta and fastg formats. SPAdes is a de Bruijn graph based assembler and uses k-mers for building up the graph. K-mer! The reads are sequence fragments generated through DNA sequencing. K-mers refer to all possible subsequences of length k from a read. K-mers are used in sequence assembly and in . In sequence assembly, by decreasing the k-mer size, computation becomes faster. Decreasing the size too much will promote ambiguity and will make the reconstruction of genome more difficult. On the other hand, by increasing the k-mer size we can obtain better assembly for repeated regions. Contig (contiguous sequence of overlapping reads)! A contig represents overlapping DNA fragments that have been assembled together and stand for a consensus region of DNA. Scaffold! Scaffolds represent contigs separated by gaps of known lengths. The known lengths between the contigs are a result of paired-end sequencing technology. The gaps between the contigs in the scaffold can be subsequently sequenced either by PCR amplification followed by sequencing in case of small gaps, or BAC cloning followed by sequencing for larger gaps. N50! N50 is a statistic used in genome assembly that indicates the length for which contigs equal or longer than this length produce 50% of the assembly. N50 is somehow similar with mean or median, representing the average contig.

4