De Bruijn Graph Based De Novo Genome Assembly
Total Page:16
File Type:pdf, Size:1020Kb
2160 JOURNAL OF SOFTWARE, VOL. 9, NO. 8, AUGUST 2014 De Bruijn Graph based De novo Genome Assembly Mohammad Ibrahim Khan Computer Science and Engineering, Chittagong University of Engineering and Technology Chittagong, Bangladesh [email protected] Md. Sarwar Kamal Computer Science and Engineering, Chittagong University of Engineering and Technology Chittagong, Bangladesh [email protected] Abstract—The Next Generation Sequencing (NGS) is an and microorganism. Now –a-days, numbers of unique important process which assures inexpensive organization idea and vocational attempt have been established on the of vast size of raw sequence data set over any traditional road to the evolution of new techniques to overcome the sequencing systems or methods. Various aspects of NGS like problems of Sanger. The continuing innovation of DNA, template preparation, sequencing imaging and genome RNA and Proteins sequencing techniques have alignment and assembly outlines the genome sequencing and accompanied the development of sequencing system alignment .Consequently, deBruijn Graph (dBG) is an important mathematical tool that graphically analyzes how under the environment of sudden reduced expenditures the orientations are constructed in groups of nucleotides. and better outcomes in performance comparing with the Basically, de Bruijn graph describe the formation of the systems of tow three years back. Various sequencers genome segments in a circular iterative fashions. Some from some established methods like 454 Life Science or pivotal de Bruijn graph based de novo algorithms and Roche or Illumina and Applied Biosystems are in software package like T-IDBA, Oases, IDBA-tran, Euler, developments and Helicos have appeared in application. Velvet, ABySS, AllPaths, SOAPdenovo and SOAPdenovo2 All these sequencers will produce more and more data have illustrated here. and consequently increase the volume of data set. To diminish the data set National Institute of Health (NIH) Index Terms—Next Generation Sequencing (NGS), are going to generate proper sequence within couple of deBruijn Graph (dBG), SOAPdenove2 years. Consequently, the updated systems with higher throughput assure the researcher to perform the sequencing with less time and space. I. INTRODUCTION A. Influence of Next Generation Technique In the age of information superhighway all the invention, discovery, analysis, synthesis, anti-synthesis Sequencing of biological data set has improved swiftly and measurements are managed by various computing due to the appearance of monumental lateral sequencing devices and algorithmic steps. To make proper methodologies, which together called as Next Generation achievement in computing analysis change should be Sequencing (NGS). We know that there is huge made in algorithms not in system. In this regards, the difference between serial and lateral computing systems. researcher and scientist are working all over the world Lateral or parallel system used to solve multiple towards the progress of efficient planning on algorithm instructions with in a second where a single instruction is and automated method. So from the last few years, the solved by serial system during the same time. Since NGS researcher and scientist have made a constitutional methods support parallel computations, it is very easy to deviation from the use of mechanical Sanger DNA get high-throughput sequencing of small length at nucleotide sequencing for genome segments reasoning. It reduced cost [1], [2]. Consequently Next Generation is obviously correct that the mechanical Sanger process Sequencing technologies are accelerating the had been monopolizes from the last twenty five years and Computational Biology (CB) research in different fields drives a number of historic achievements, comprising the like genomics, metagenomics, transcriptomics, outstanding outcome of full sequencing of human proteogenomics, gene expression analysis, DNA genome sequence. In contempt of huge development of sequencing, non coding RNA discovery and Protein- theory and scientific inventions all along this era, the Proteins Interactions alignments [3]. The proper genome constraint of mechanical Sanger DNA nucleotide assembly is very cumbersome as well as costly in time sequencing indicates the necessity of new and updated and space complexity. Beside this is an NP hard problem methods and scholarly mechanisms for analyzing vast which indicates that there are not sufficient solutions for volume of genomes irrespective of human, plants, virus the exact assembly. The assembly problem on sequencing © 2014 ACADEMY PUBLISHER doi:10.4304/jsw.9.8.2160-2168 JOURNAL OF SOFTWARE, VOL. 9, NO. 8, AUGUST 2014 2161 is occurred due to the difficulties to read complete Small read can scale several It perform good when there is no genome sequence in one read under the environment of locations reference data set. Repeated segments are Two important paradigms as ongoing methods for sequencing. problematic. Overlap Graph and K-mer B. Genome Assembler Necessity Graph. Accuracy rate is high Accuracy rate is relatively less Genome Assembly is a process which combines than comparative genome various small segments into complete one after making assembly subdivisions of different segments to perform certain activity. We know that Shortgun sequencing process sub- C. Next-Generation Sequencing Technologies divide the complete genome into arbitrary sections and The latest methods that assure reduced cost, accurate scan each section autonomously. By implementing Next sequencing, alignment and high-throughput computing Generation Sequencing technology, any sequencer can are called Next Generation. These update approaches and design ample amounts of DNA sequencing data set in the concepts arrange some blueprint that depends on the meanwhile of a specific run with very minimum costs. integration of template design, sequencing, pairwise But most of the designed data set becomes exaggeration genome alignment and genome assembly methods. The by the over density of sequencing errors as well as improvement of NGS approaches on Computational genomic repetitions. In this respects, to design a genome Biology, Computational Chemistry, Genetic Engineering assembler for Next Generation Sequencing is one of the and Genetic Birding has switched the policy we realize difficult phenomena due to the shortages of proper exact regarding automated and scientific process in all aspects computational methods as well as systems that of research and analysis. The pivotal development incorporate the computing techniques. To handle the endeavor through Next Generation Sequencing is the difficulties of NGS assembly, there should have a strong possibilities to yield immense size of data reasonably like and fundamental structure that helps to efficient few billions of short reads within per execution rate. This managements. Genome Assembler must have a proper environment permits to make dimension to sequence the input data set which contains two files as sequence orientation of the genome bases. As for example, container and score container. NGS methods have microarrays techniques have been reinstated by sequence efficient and high performance small reads and to based techniques which can determine and verify the measure the files is a space occupied process. Most of the unusual translation without previous idea of any specific assembler use graph oriented data structures to make gene and can present clue in relation to different interlace easier of the assembly process and salvage time and and sequence change in determined genes [7],[8]. The memory cost. But some assembler can vary in respect of strength to sequence complete genome of different graph formation, graph design, graph traversing and similar creature has permitted substantial relative and graph resolution [4]. innovative consideration are being computed which were Scientist and researcher used two distinct types of inconceivable just a two year back. The multi scale genome assembly in DNA data sequencing research. application of NGS makes easy our analysis over These are: different animals towards the exact answer of health and 1) The Comparative Process of Genome Assembly. diseases influence. 2) De novo process of Genome Assembly. The diversity of NGS characteristics makes it vital in Comparative Genome Assembly works based on the new inventions of genome irrespective of any part of the reference genome from the same parts of the organisms research lab. Some of the remarkable outcomes and or similar species is used as a scaling to direct the implementations of various labs are as 454 , GA, MiSe, assembly approach by making pair wise alignment of the HiSeq, SOLiD, Ion Torrent, RS system and Heliscope genome segments for assembled data set. This process have general features to process the data set in parallel mainly used in genome re-sequencing [5]. On the other mode which improve the vast size of data generation in a hand, de novo assembly does not maintain any scaling single execution[9],[10].Some implementation produce process and this is used to reconstruct genomes which are short reads as near about 75 bp by SOLiD [11], 100 to not identical to previously sequenced genomes [6]. 150 bp by Illunina [12], 200 bp by Ion Torrent [12] and Comparisons between comparative and de novo genome 400 to 500 bp by 454 [12] and long read by Pacific assembly (Table 1) is illustrate here. Bioscience but it suffer multiple errors [13]. Besides, each of the implementation has some limitations.