<<

ISSN: 1748-0345 (Online) www.tagajournal.com

Lossless Segment with Lempel-Ziv-Welch Compression Based DNA Compression

Dr. S. JOTHI Dr. A. CHANDRASEKAR Mr. R. RANJITH Associate Professor-CSE Professor & Head - CSE Assistant Professor - CSE St.Joseph’s College of Engineering St.Joseph’s College of Engineering St.Joseph’s College of Engineering Chennai, India Chennai, India Chennai, India [email protected] [email protected] [email protected]

Abstract: The DNA (Deoxyribonucleic Acid) sequences stored in the databases are huge in number, that it has led the researches to propose various for its compression. The Lossless segment compression which is being employed in LZW (Lempel-Ziv-Welch) method are based on characteristics of DNA bases. There is a dictionary-based compression called LZW is being used. The combination of Lossless segment and LZW makes our algorithm more efficient. In this algorithm the DNA sequence will segment after that it will encode. So the compression ratio will be reduced. The datas have been proceed which compression and decompression to obtain the complete DNA sequence. The compressed and Decompressed DNA base datas should have equal spaced memory which is used for the purpose of string the DNA bases. In this proposed work, the gene locations are identified and the distance calculated between the nucleotide bases. The analysis of genes and drug discovery procedures can reduce the volumes of the memory used to store these proposed datas In this method will be achieve higher compression ratio when compared to the other compression methods.

Keywords: — DNA Sequence, LZW, Correlation, compression ratio.

I.INTRODUCTION

The Data sets in the real world are multiplying into huge volumes, due to continuous accommodation of datas. Due to this the following problems are faced. Firstly large storage is required, Secondly costs for computation and processing has increased, thirdly the time complexity has increased. Bioinformatics is the combination of Biology, Technology and Computer Science field form of a single discipline. DNA accommodates genetic information that are used in research in the development of all leaving micro and macro organism. A DNA sequences is consist of double helix structure formed by 2 biopolymer strands. The DNA subunits are joined in a chain by a covalent bond between the 5’PO4 (5 Prime hydroxyl) group of one nucleotide and the 3’OH (3 Prime hydroxyl). The nucleotide bases Adenine (A), Cytosine(C), Guanine (G) and Thymine(T) are found to occur has repeated patterns in a DNA sequences. Which is shown in the Figure1.Many fields in the theoretical foundation that are developing are promoted by DNA sequencing. DNA sequences that encode life is to be nonrandom. The central dogma of life is hidden in the DNA. DNA transcribes mRNA which is translated to proteins. Proteins play a major role in regulating all the biological functions. While 3 billion base pairs are present by the only 3% of proteins are encoded. Due to this large size, they cannot be downloaded or shared over the network by anyone except these with large amount of available resources. The biological fields like Biotechnology, Bioinformatics, and Forensic science are implemented through the knowledge of DNA sequences. In this proposed work the DNA sequence is compressed.

Fig. 1.Normal Structure of DNA sequence

© 2018 SWANSEA PRINTING TECHNOLOGY LTD 1548 TAGA JOURNAL VOL. 14 ISSN: 1748-0345 (Online) www.tagajournal.com

To reduce the size of the present in the data is called as compression. There are two elements in the compression. One is Encoding. Encoding is the process of changing in the text into a code, that is called as compressed data. Another one is Decoding. Decoding is the reverse process of the encoding technique that is code is converted back to a text. It is called as decompressed data. There are two types of : lossy and lossless. reduces in the size of the bits, but it can reproduce the original bits (or) datas. There is no data loss in this method. Examples of lossless compression perfectly used for text data are , , Shannon-Fano, , Run length Encoding (RLE), Modified Run Length Encoding (MRLE) etc. In RLE is a compression technique where in the nucleotide bases are consolidated and are represented numerically.

For example .

Input data : AAAATTTTCCCGGGG

Output data: 4A4T3C4G.

In MRLE is the compression technique are the nucleotide bases converted to the respected binary values according to the arrangement of the basis [25].Extended-ASCII in the enhance of the RLE[4].COMRAD(Compression using redundancy of DNA) is the method to extract and repeat palindromes are identified[6]. reduces bits by removing redundant information from images, , audio. The most commonly used human genomic data available in public database such as Genbank. Genbank is a huge data and stored DNA sequence data more and more exponentially. The resources for the all DNA sequences compressions are available over the internet from various sources. Generally, the resources are deposited in the following location for domain oriented research.

1. (www.ncbi.nlm.nih.gov/) the NCBI databases,

2. (www.ddbj.nig.ac.jp/) the DNA Database of Japan(DDBJ),

3. (www.ebi.ac.uk/embl/) the European Molecular Biology Laboratory (EMBL).

The Data sets for the proposed DNA sequences for the DNA sequence comparison are taken from the sources.

II. RELATED WORKS

In this section, The DNA is abbreviated as Deoxyribonucleic Acid. The DNA sequences consist of the nucleotides bases like A,G,T,C. the are the basic subunits of the genetics, found in all the living micro and macro organism. These DNA sequence can be used for storing datas. In such storage, the memory space occupied is more. Thus, the datas in the DNA sequences must be compressed. Many algorithms have been developed and implemented. This compression is a challenging work, due to its complexity. But this work, is not like a process of normal file compression method.

DNA sequences can be compressed in 2 modes.(i) Horizontal Mode(HM),(ii) Vertical Mode(VM). In HM, the genetic sequence is compressed by using extracts (data) from the DNA structure. This is done by making simple references to the substrings of these sequences. Here compression method evolution are done.[8]. Horizontal mode can be classified into

A. Substitution Based Methods

B. Statistical based methods

C. Substitution and statistical based methods

D. Grammar based methods

A. Substitution Based methods:

Biocompress-1 and Biocompress-2 are compression algorithm technique for DNA and RNA sequences. In this method the DNA sequence bases are converted into binary values. palindromes is the called as the specific redundancy in the DNA bases. It cannot support the amino acids. in this algorithm not suited for all type of DNA sequences.[11].

The cfact uses two pass algorithm to search reverse complement and this process repeats. One is builds the suffix tree of the sequence and another one is does actual encoding using LZ. It also encoded Non-repeat regions. In this algorithm not easy to the data because it takes more time.[24]

© 2018 SWANSEA PRINTING TECHNOLOGY LTD 1549 TAGA JOURNAL VOL. 14 ISSN: 1748-0345 (Online) www.tagajournal.com

Gencompression algorithm was done by the compression using 3 steps. Initially it replaces the character, then identifies the particular character and delete the character. While using in this 3 steps it can be transform one string to another.repeated strings are identify by using hamming code distance method [6].DNA compression produces a slightly better compression ratio with faster compression than other compression algorithms. Its deal large DNA sequences [8].DNABIT compression algorithm is the compression technique. In this technique execute code and source cannot be available, so in this method is the very tough work of the compression process and it also increase the time and space complexity [14]. DNAPack uses the hamming distance for the repeats and inverted repeats. Non-repeats regions are encoded by the most excellent option from an order-2 arithmetic coding. It is expected that the algorithm will be due to the expensive calculation, and will not be suitable for long sequences [23][5].

B. Statistical Based Methods.

Finite Context model algorithm is the lossless compression technique In this method 2 finite models are used to complete the encoding process while au used different orders compared. When one act as the low order model that are represents the best. That the best model only takes when compared to the two models, till the models are updated [15].

C. Substitutional and statistical based methods

CTW+LZ is the abbreviated into the +Lempel Ziv 77.in this method is used for to detect the repeated bases approximately with CTE. While in this method execution time is high depends on the DNA sequence but it have a good compression ratio [18]. NML (Normalized Maximum Likelihood) is the one type of compression technique the algorithm finds a “regressor”, that has the minimum hamming distance from current block. While initially in this algorithm to divide the DNA bases into particular fixed size of blocks [26].

Burrows-Wheeler Transform (BWT) is the block sorting process. It any sequence is to transformations before the real compression also simple which is unexpectedly good performance. It is used RLE procedure used. The DNA consisting of the various bases in a sequence are given as the data input. The bases are consecutive in nature and are converted into binary format. In the consecutive serious, if the 9th is equal, if it is represented as 1, else it is represented as 0[25]. The DNACRAMP tool to develop the DNA sequence compression. Encoding and decoding will performed with the index bounded array linear data structure of two stages. In this method will be applied for any kind of DNA set.

D. Grammar Based Method

The Grammar-based method is a compression method, where the context free grammar is used. The DNA sequitur method is the otherwise called as on-line linear time a the DNA bases represented by lowercase letters such as a,c,t,g are grouped to pairs converted into uppercase letters, assigned for the base pairs[10]. Huffbit is a compression technique, where Huffman codes are used. Here the DNA bases arranged in a sequence are given as input. The frequency value, which is the amount of occurrence of each bases is calculated. Based on the frequency value calculated, the value are stored in a form of a tree data structure. This is with the Huffman codes (0, 1) and the values of each base is represented being forma[1]. The Hash Based algorithm transforms the complexity scanned DNA bases into factors of length=4. This transformation is hash based. This constructs a hash table and provides a matchless character to each and every factor that as a key, called as hash key [2].LSBD is a compression technique, where part-by-part compression method is adopted. During the compression, initially a lookup table consisting of the index value of each value of the bases, is constructed. The bases arranged in the sequence are used continuously, when there are characters, then the nucleotide bases, they are removed. After removal, the bases are compressed accordingly [9].

Vertical mode makes use of in order transform amidst the 2 DNA sequence bases. Of this sequence one sequence in been considered as the sentence sequence.

III.PROPOSED WORK

DNA sequence analysis is useful in diverse areas such as forensics, Drug discovery procedure, pharmacy, agriculture etc. It is very necessary to address the storage problem of these exponentially rising data. This project proposed Lossless segment with LZW (Lempel-Ziv-Welch) algorithm is the one type of DNA sequence compression technique. The Lempel-Ziv compression algorithm is the very effective compression method while compared to another compression technique. But it will be maintained a individual file can be maintain for datas.

© 2018 SWANSEA PRINTING TECHNOLOGY LTD 1550 TAGA JOURNAL VOL. 14 ISSN: 1748-0345 (Online) www.tagajournal.com

Advantages are

 Compression ratio is less   Time consumption is less   Performance is high   Efficient storage

IV.ARCHITECTURE DIAGRAM

The proposed system architecture given below in the Figure 2 consists of the modules for this proposed work.

Fig. 2.Architecture Diagram

V. IMPLEMENTATION

In this modules are can be divided into 4 parts. That 4 parts can be compressed the DNA sequence which is reduced the memory space and the compression ratio will be reduced.

A. DNA sequence uploaded

B. Sequence alignment

C. Correlation based feature selection

D. LZW compression.

A. DNA sequence uploaded

The DNA sequence is generate and alignment of the sequences based on the genome proteins and correlation based used for feature extraction from the DNA sequences. During compression when Lossless segment based methods are adopted the compression ratio is reduced to half. While using LZW method the compression ratio is further reduced. The software programs is used for compression and Decompression of files that are different formats which are passively used.

B. Sequence Alignment

Genomic alignment tools concentrate on DNA alignments while accounting for characteristics nearby in genomic data. A sequence alignment is a process of grouping the DNA bases to discover regions of comparison and to calculate the distance between the same bases and find out location that can be any relationships between the bases. Sequences are the amino acids for residues 120-180 of the proteins.

As the DNA sequences are proliferating into huge volumes, many algorithms are being proposed and developed for DNA compression making the domain rapidly growing one. In order to identify homologous nucleotide base position among the given data, the initial step in compared to sequence analysis is the sequence alignment.

© 2018 SWANSEA PRINTING TECHNOLOGY LTD 1551 TAGA JOURNAL VOL. 14 ISSN: 1748-0345 (Online) www.tagajournal.com

Many algorithms have been proposed and implemented the DNA sequence compression. But many of them are slow in nature. Even though various forms of dynamic programming are adopted for large datasets, there is no generate of identifying the correct method

C. Correlation Based Feature Selection

Correlation is the process of checking the similarity between the 2 random DNA sequences. if the two random bases are same then the coefficient value is +1,-1. Else the coefficient value is 0. In this proposed method the correlation feature selection, method is to remove the unwanted datas. Similarities calculate between two random variables. It also reduces the size of the DNA sequence.

D. LZW Compression

LZW is the Dictionary-Based method. The input string consists of 4 nucleotide bases of the DNA, which are long in length. The LZW process includes both compression and Decompression process. During the compression, the length of input is reduced. The method is based on index value. During the process, every base is assigned with a code. When the input string is reed, it is replaced with the respective code. When the respective code is not found for a particular input string, a new code is generated at that instance and it is replaced for that input string. The process continues till the following input string is reed and is complexity replaced by its equivalent code values. Thus, the compression is done. During the decompression, the code is calculated back to the input string

VI. EXPERIMENTAL DISCUSSIONS

During the process of Data acquisition the DNA sequences are uploaded as input first time. These base components are in the ATGC order only. This uploading process shown in the Figure 3.

Fig. 3. Upload the DNA sequence.

The distance between each individual bases are calculated. (i.e) The distance between one adenine component and the next adenine component is calculated. The same process is repeated for other component bases. The distance calculation is shown in the figure 4

© 2018 SWANSEA PRINTING TECHNOLOGY LTD 1552 TAGA JOURNAL VOL. 14 ISSN: 1748-0345 (Online) www.tagajournal.com

Fig. 4. Character Distance calculation

The DNA bases like adenine (A), thymine (T), guanine (G), and cytosine (C) are placed in the DNA sequence, using the suffix tree method shown in the Figure 5

Fig. 5.DNA sequence alignment.

The DNA sequence containing the adenine, thymine, guanine, and cytosine base components are shifted in such a way that they can be reduced by the pruning process. And using the correlation feature selection method. The reduced DNA sequence is shown in the Figure 6.

Fig. 6. DNA Sequence correlation feature selection.

© 2018 SWANSEA PRINTING TECHNOLOGY LTD 1553 TAGA JOURNAL VOL. 14 ISSN: 1748-0345 (Online) www.tagajournal.com

VII.CONCLUSION

The Dictionary based compression algorithm LZW is better than the other compression methods. That the compressed data can be stored in minimum memory space, but it takes longer time for sequence arranging process. Even though there are various algorithms proposed for the DNA sequences, they not be suitable for bases having certain arrangements. During the survey it has been found that the compression ratio for various DNA sequence compression methods are greater in value than the proposed method. For instance when considering the compression ratios of various methods like extended ASCII algorithms which has the compression ratio of 1.68bpb.MRLE algorithm having a compression ratio of 1.67bpb and COMRAD is having in a ratio of 1.88bpb, whereas the compression ratio for this proposed algorithm is lesser than the above analyzed values. In application, lesser the value of the compression ratio, greater will be the efficiency of the compression.

REFERENCES [1] Afify, H., Islam, M., Abdel-Wahed, M., et al., 2010, Genomic Sequences Differential Compression Model, Proc., 27th National Radio Science Conf., Egypt.

[2] Ateet Mehta, 2010, et al., “ DNA Compression using Hash Based Data Structure”, IJIT&KM, Vol2 No.2, pp. 383-386. [3] Bao, S., Chen, S., Jing, Z., et al., 2005, A DNA Sequence Compression Algorithm Based on LUT and LZ77, Processing and Information Technology, Proc., 50th IEEE International Symposium, 23–28.

[4] Bacem saada, Jing Zhang, ”DNA sequences compression Algorithm based on Extended-ASCII representation, proceedings of the world congress on Engineering and computer Science 2015 vol II.

[5] Bharti, R. K., and Singh, R. K., 2011, A Biological Sequence Compression based on Look up Table (LUT) using Complementary Palindrome of Fixed Size, International Journal of Computer Applications, 35(11), 55-58. [6] Biji c.l. and Achuthsankar S.nair, “Benchmark dataset for whole Genome sequence compression, IEEE TRANSCATIONS ON COMPUTATIONAL Biology and Bioinformatics, (1545-5963 (c)2016.

[7] Chen.X, Kwong.S and M. Li, "A Compression Algorithm for DNA Sequences and It's Applications in Genome Comparison," The 10th Workshop on Genome and Informatics, (GIW99), 1999, vol. 10, 51-61.

[8] Chen, X., Li, M., Ma, B., et al., 2002, DNACompress: fast and effective DNA sequence Compression, Bioinformatics, 18(12), 1696–1698.

[9] Choi Ping Paula Wu, 2008, et al., “ Cross chromosomal similarity for DNA sequence compression”, Bioinformatics 2(9): 412-416

[10] Cherniavsky.N and Ladner.R, "Grammar-based Compression of DNA Sequences," UW CSE Technical Report (TR2007-05-02), presented at the DIMACS Working Group, 2004.

[11] Grumbach, S., and Tahi, F., 1994, A new challenge for compression algorithms: genetic sequences, Information Processing & Management, 30(6), 875–886.

[12] Grumbach.S and Tahi.F, "Compression of DNA Sequences," in Proc. of the Data Compression Conf., (DCC '93), 1993, 340–350.

[13] Genbank size, (2013),[Online]. Available:http://ftp.ncbi.nih.gov/genbank/gbrel.txt

[14] Lee A. J. T., Chang.C and Chen.C, "DNAC: An Efficient Compression Algorithm for DNA Sequences," National Taiwan University, Taipei, Taiwan 10617, R.O.C., 2004.

[15] Loewenstern.D, and P. N. Yianilos, "Significantly lower entropy estimates for natural DNA sequences," in Proc. of the Data Compression Conf., (DCC '97), 1997, 151–160.

[16] Ma, B., Tromp, J. and Li, M., 2002, Pattern Hunter: faster and more sensitive homology search, Bioinformatics, 18(3), 440–445.

[17] Manzini, G., and Rastero, M., 2004, A Simple and Fast DNA Compressor, Software: Practice and Experience, 34(14), 1397–1411.

[18] Matsumoto, T., Sadakane, K., and Imai, H., 2000, Biological Sequence Compression Algorithms, Genome Informatics, vol. 11, 43–52.

[19] Postolico.A, et al., Eds., DNA Compression Challenge Revisited: A Dynamic Programming Approach, Lecture Notes in Computer Science, Island, Korea: Springer, 2005, vol. 3537, 190–200.

[20] Pinho.A.J, Neves.A.J.R,Martins.D.A, et al., Finite-Context Models for DNA Coding, Signal Processing Lab, DETI/IEETA, S. Miron, Ed., University of Aveiro, Portugal, Chapter 6, 117-130, 2010.

[21] Prasad, V. H., and Kumar, P. V., 2012, A New Revised DNA Cramp Tool Based Approach of Chopping DNA Repetitive and Non- Repetitive Genome Sequences, International Journal of Computer Science Issues (IJCSI), 9(6), 448-454. [22] Priyanka, savitha Goel, “A compression Algorithm for DNA that uses ASCII values “,2014 IEEE International Advance Computing Conference(IACC).

[23] Postolico.A, et al., Eds., DNA Compression Challenge Revisited: A Dynamic Programming Approach, Lecture Notes in Computer Science, Island, Korea: Springer, 2005, vol. 3537, 190–200. [24] Rivals.E, Dauchet.M, Delahaye.J-P., et al., "Fast Discerning Repeats in DNA Sequences with a Compression Algorithm," The 8th Workshop on Genome and Informatics, (GIW97), 1997, vol. 8, 215-26.

[25] Rajeswari, P. R., and Apparao, A., 2011, DNABIT Compress – Genome compression algorithm, Bioinformation, 5(8), 350-360.

[26] Tabus.I., Korodi.G, and Rissanen.J, "DNA sequence compression using the normalized maximum likelihood model for discrete regression," in Proc. of the Data Compression Conf. (DCC2003), 2003, 253–262.

[27] Tembe, W., Lowey, J., and Suh, E., 2010, G-SQZ: compact encoding of genomic sequence and quality data, Bioinformatics, 26(17), 2192- 2194.

[28] Vey, G., 2009, Differential direct coding: a compression algorithm for nucleotide sequence data, Database, Oxford University Press, vol. 2009, ID bap013.

© 2018 SWANSEA PRINTING TECHNOLOGY LTD 1554 TAGA JOURNAL VOL. 14