Catalina Vasilescu Student number: 014438655 Advanced practical course in genome bioinformatics 529028 Report Day1 Background! Next generation sequencing machines produce a huge amount of short sequences called reads that have to be aligned and merged in order to reconstruct the original sequence. This process is done with assembly tools. The first step is the quality control of the data and preprocessing (adapter sequence removal, etc.). Then the alignment of reads is performed and the overlapping reads are assembled into contigs. Scaffolding is then the process of defining the gaps between contigs. Sequencing of the gap regions will provide the full genomic sequence. There are two types of genome assembly tools: OLC (overlap-layout-consensus) based assemblers or graph based assemblers. We are testing three genome assemblers: Minimo, Velvet and SPAdes. Minimo is OLC based, Velvet and SPAdes are graph based. Setting up the system! We use resources and software located in CSC, thus we have to connect remotely to CSC server (taito.csc.fi). For this we make use of PuTTY program. PuTTY is a terminal software for Windows and Unix systems that mediates the remote access of computers over the internet. In taito, first we create the folders that we need and navigate by using the linux commands mkdir and cd. We copy then the materials using rsync command and decompress them with unzip. Each login session we need to load a module from the server using the command load module biokit. In this way we load many programs and biological databases and make them available to use. Read quality checking! One of the first steps in data analysis is checking the quality of the data. We use the Java based program FastQC which performs simple quality control checks to ensure that the raw data is in normal parameters. The program accepts as input fastq, sam and bam files. We use fastq files. FastQ is a text-based format storing both biological (nucleotide) sequences and their corresponding quality score. The R1 and R2 reads are in separate fastq files. The output of the program is a HTML QC report which can detect problems originating either in the sequencer machine or in the preparation of library material. In the left side of the report it is displayed a summary of the analysis modules with color-coded warnings (green-orange-red) that signal possible flaws of the data. The body of the report shows graphical representations of the analysed parameters. Q1:! R1 and R2 are the forwards and reverse read pairs from paired-end sequencing. The fragmented DNA is ligated into different non-complementary adapters at 3’ and 5’ ends, so that the primer for one adapter begins the synthesis on only one strand . First we sequence using the adapter from one end and then using a primer for the second adapter. We obtain few hundreds bp of the 3’ end of the fragment and few hundreds bases from the 5’ end, in opposite orientation as the first one. Using FastQC we have generated two HTML QC reports, one for R1 reads and another for R2 reads. The most important parameters to check are: per base sequence quality, sequence duplication levels, 1 overrepresented sequences and adapter content. The last three show us that the adapter sequences have been removed and also that there aren’t other repetitive sequences. Per base sequence quality illustrates the variation of quality score function of the position of the base in the read. The quality score starts to decrease after position 150 in the reads. We can see a difference in quality between R1 and R2 reads, with R2 sequences presenting more decreased quality. In case of R2 reads, the quality drops under the score of 20 for the bases with position higher than 250 in the reads.
<<