
ISSN: 1748-0345 (Online) www.tagajournal.com Lossless Segment with Lempel-Ziv-Welch Compression Algorithm Based DNA Compression Dr. S. JOTHI Dr. A. CHANDRASEKAR Mr. R. RANJITH Associate Professor-CSE Professor & Head - CSE Assistant Professor - CSE St.Joseph’s College of Engineering St.Joseph’s College of Engineering St.Joseph’s College of Engineering Chennai, India Chennai, India Chennai, India [email protected] [email protected] [email protected] Abstract: The DNA (Deoxyribonucleic Acid) sequences stored in the databases are huge in number, that it has led the researches to propose various algorithms for its compression. The Lossless segment compression which is being employed in LZW (Lempel-Ziv-Welch) method are based on characteristics of DNA bases. There is a dictionary-based compression called LZW is being used. The combination of Lossless segment and LZW makes our algorithm more efficient. In this algorithm the DNA sequence will segment after that it will encode. So the compression ratio will be reduced. The datas have been proceed which compression and decompression to obtain the complete DNA sequence. The compressed and Decompressed DNA base datas should have equal spaced memory which is used for the purpose of string the DNA bases. In this proposed work, the gene locations are identified and the distance calculated between the nucleotide bases. The analysis of genes and drug discovery procedures can reduce the volumes of the memory used to store these proposed datas In this method will be achieve higher compression ratio when compared to the other compression methods. Keywords: — DNA Sequence, LZW, Correlation, compression ratio. I.INTRODUCTION The Data sets in the real world are multiplying into huge volumes, due to continuous accommodation of datas. Due to this the following problems are faced. Firstly large storage is required, Secondly costs for computation and processing has increased, thirdly the time complexity has increased. Bioinformatics is the combination of Biology, Information Technology and Computer Science field form of a single discipline. DNA accommodates genetic information that are used in research in the development of all leaving micro and macro organism. A DNA sequences is consist of double helix structure formed by 2 biopolymer strands. The DNA subunits are joined in a chain by a covalent bond between the 5’PO4 (5 Prime hydroxyl) group of one nucleotide and the 3’OH (3 Prime hydroxyl). The nucleotide bases Adenine (A), Cytosine(C), Guanine (G) and Thymine(T) are found to occur has repeated patterns in a DNA sequences. Which is shown in the Figure1.Many fields in the theoretical foundation that are developing are promoted by DNA sequencing. DNA sequences that encode life is to be nonrandom. The central dogma of life is hidden in the DNA. DNA transcribes mRNA which is translated to proteins. Proteins play a major role in regulating all the biological functions. While 3 billion base pairs are present by the human genome only 3% of proteins are encoded. Due to this large size, they cannot be downloaded or shared over the network by anyone except these with large amount of available resources. The biological fields like Biotechnology, Bioinformatics, and Forensic science are implemented through the knowledge of DNA sequences. In this proposed work the DNA sequence is compressed. Fig. 1.Normal Structure of DNA sequence © 2018 SWANSEA PRINTING TECHNOLOGY LTD 1548 TAGA JOURNAL VOL. 14 ISSN: 1748-0345 (Online) www.tagajournal.com To reduce the size of the bits present in the data is called as compression. There are two elements in the compression. One is Encoding. Encoding is the process of changing in the text into a code, that is called as compressed data. Another one is Decoding. Decoding is the reverse process of the encoding technique that is code is converted back to a text. It is called as decompressed data. There are two types of data compression: lossy and lossless. Lossless compression reduces in the size of the bits, but it can reproduce the original bits (or) datas. There is no data loss in this method. Examples of lossless compression perfectly used for text data are Huffman coding, arithmetic coding, Shannon-Fano, adaptive Huffman coding, Run length Encoding (RLE), Modified Run Length Encoding (MRLE) etc. In RLE is a compression technique where in the nucleotide bases are consolidated and are represented numerically. For example . Input data : AAAATTTTCCCGGGG Output data: 4A4T3C4G. In MRLE is the compression technique are the nucleotide bases converted to the respected binary values according to the arrangement of the basis [25].Extended-ASCII in the enhance of the RLE[4].COMRAD(Compression using redundancy of DNA) is the method to extract and repeat palindromes are identified[6]. Lossy compression reduces bits by removing redundant information from images, video, audio. The most commonly used human genomic data available in public database such as Genbank. Genbank is a huge data and stored DNA sequence data more and more exponentially. The resources for the all DNA sequences compressions are available over the internet from various sources. Generally, the resources are deposited in the following location for domain oriented research. 1. (www.ncbi.nlm.nih.gov/) the NCBI databases, 2. (www.ddbj.nig.ac.jp/) the DNA Database of Japan(DDBJ), 3. (www.ebi.ac.uk/embl/) the European Molecular Biology Laboratory (EMBL). The Data sets for the proposed DNA sequences for the DNA sequence comparison are taken from the sources. II. RELATED WORKS In this section, The DNA is abbreviated as Deoxyribonucleic Acid. The DNA sequences consist of the nucleotides bases like A,G,T,C. the are the basic subunits of the genetics, found in all the living micro and macro organism. These DNA sequence can be used for storing datas. In such storage, the memory space occupied is more. Thus, the datas in the DNA sequences must be compressed. Many algorithms have been developed and implemented. This compression is a challenging work, due to its complexity. But this work, is not like a process of normal file compression method. DNA sequences can be compressed in 2 modes.(i) Horizontal Mode(HM),(ii) Vertical Mode(VM). In HM, the genetic sequence is compressed by using extracts (data) from the DNA structure. This is done by making simple references to the substrings of these sequences. Here compression method evolution are done.[8]. Horizontal mode can be classified into A. Substitution Based Methods B. Statistical based methods C. Substitution and statistical based methods D. Grammar based methods A. Substitution Based methods: Biocompress-1 and Biocompress-2 are compression algorithm technique for DNA and RNA sequences. In this method the DNA sequence bases are converted into binary values. palindromes is the called as the specific redundancy in the DNA bases. It cannot support the amino acids. in this algorithm not suited for all type of DNA sequences.[11]. The cfact uses two pass algorithm to search reverse complement and this process repeats. One is builds the suffix tree of the sequence and another one is does actual encoding using LZ. It also encoded Non-repeat regions. In this algorithm not easy to compress the data because it takes more time.[24] © 2018 SWANSEA PRINTING TECHNOLOGY LTD 1549 TAGA JOURNAL VOL. 14 ISSN: 1748-0345 (Online) www.tagajournal.com Gencompression algorithm was done by the compression using 3 steps. Initially it replaces the character, then identifies the particular character and delete the character. While using in this 3 steps it can be transform one string to another.repeated strings are identify by using hamming code distance method [6].DNA compression produces a slightly better compression ratio with faster compression than other compression algorithms. Its deal large DNA sequences [8].DNABIT compression algorithm is the compression technique. In this technique execute code and source cannot be available, so in this method is the very tough work of the compression process and it also increase the time and space complexity [14]. DNAPack uses the hamming distance for the repeats and inverted repeats. Non-repeats regions are encoded by the most excellent option from an order-2 arithmetic coding. It is expected that the algorithm will be due to the expensive calculation, and will not be suitable for long sequences [23][5]. B. Statistical Based Methods. Finite Context model algorithm is the lossless compression technique In this method 2 finite models are used to complete the encoding process while au used different orders compared. When one act as the low order model that are represents the best. That the best model only takes when compared to the two models, till the models are updated [15]. C. Substitutional and statistical based methods CTW+LZ is the abbreviated into the context tree weighting +Lempel Ziv 77.in this method is used for to detect the repeated bases approximately with CTE. While in this method execution time is high depends on the DNA sequence but it have a good compression ratio [18]. NML (Normalized Maximum Likelihood) is the one type of compression technique the algorithm finds a “regressor”, that has the minimum hamming distance from current block. While initially in this algorithm to divide the DNA bases into particular fixed size of blocks [26]. Burrows-Wheeler Transform (BWT) is the block sorting process. It any sequence is to transformations before the real compression also simple which is unexpectedly good performance. It is used RLE procedure used. The DNA consisting of the various bases in a sequence are given as the data input. The bases are consecutive in nature and are converted into binary format. In the consecutive serious, if the 9th bit is equal, if it is represented as 1, else it is represented as 0[25]. The DNACRAMP tool to develop the DNA sequence compression.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages7 Page
-
File Size-