<<

International Journal of Pure and Applied Mathematics Volume 118 No. 9 2018, 669-675 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu Special Issue ijpam.eu

Lossless Text Compression Technique Based on Static Dictionary for Tamil Document

B.Vijayalakshmi Dr.N.Sasirekha Associate Professor Ph.D. Research Scholar Department of Computer Science Department of Computer Science Vidyasagar College of Arts and Science Vidyasagar College of Arts and Science Udumalpet, Tamilnadu, India Udumalpet, Tamilnadu, India [email protected] [email protected]

There are many compression techniques available, Abstract- Text compression is an effective technique that reduces one of the popular compression technique is dictionary based the storage and also increases the data transfer rate during compression. The dictionary contains a list of strings of communication. This paper explains a new method of lossless text possible symbols stored in a table like structure. It uses the compression technique for Tamil documents made of Unicode index of entries to represent larger and repeated dictionary Tamil characters. The method of compression and decompression process using static dictionary compression word or character by a smaller one [1]. The dictionary scheme is presented. This compression technique reduces the compression can be a static or dynamic scheme type. In this Tamil document an average of 50% of its storage capacity. The paper, the compression technique is based on a static original document is retained in the decompression process. dictionary which is easy and a permanent one. This static dictionary contains the subset of all the common pattern of Keywords-Text compression, decompression, Unicode and Unicode Tamil characters indexed by ASCII characters. The ASCII. size of Unicode character ranges from 1 byte to 4 bytes depending upon the document storage encoding style.

I. INTRODUCTION There is a high need to develop a special compression Data Compression is the process of converting an technique for different world languages and Indian regional input data stream to another data stream that has a smaller languages. Many researches are going on in the development size. Compression is a reduction in the number of bits needed of compression technique for different languages like Chinese, to represent data. Compression data can save storage capacity, Japanese and Arabic etc. In India, Tamil is a popular language increase the speed of transmission and decrease the cost of spoken by the people in the state of Tamilnadu and over 80 storage hardware and network bandwidth [11, 12, 8]. millions of people in India and worldwide. Tamil is one of the Text Compression can be as simple as removing all languages most widely used in the web today. unneeded characters, inserting a single repeated character by a The Tamil language is an abugida language. An string character and substituting a smaller bit string for a abugida is a kind of syllabify in which the vowel is changed frequently occurring bit string. Lossless text compression by modifying the base consonant symbol so that all the forms enables the restoration of a file to its original state without the that represent a given consonant plus each vowel resemble one loss of a single bit of data, when the file is uncompressed [16]. another. Amharic, Hindi and Burmese are also abugida languages. The Tamil script has 12 vowels, 18 consonants and

669 International Journal of Pure and Applied Mathematics Special Issue

1 aytam character (neither vowel nor consonant). Apart from IV. PROPOSED COMPRESSION TECHNIQUE FOR that a set of 216 combining letters formed by adding vowel TAMIL DOCUMENT marker to the consonant. Totally there are 247 characters In computing, a technique is used available in Basic Tamil script [2]. Some vowels require the to represent a collection of characters used both for basic shape of the consonant to be altered in a way that is transmission and storage in memory. Depending on the specific to that vowel. Others are written by adding vowel abstraction level, context, the corresponding code points and specific suffix to consonant, specific prefix to consonant and the resulting code space may be represented as bit patterns, both suffix & prefix to a consonant. In the current scenario, octets, natural numbers and electrical pulses, etc. A character there is a high demand for an effective compression technique encoding is used in computation, , and for Tamil language documents. This paper fills the gap by transmission of textual data. Many encoding file types like presenting an effective lossless compression technique ANSI, UTF-8, UTF-16, UTF-13 etc are available for storage.

A. Tamil Documents In Digital Form II. RELATED WORKS Many compression techniques are available for There are over 65 million Tamils in India and 80 English language and European languages. But for Indian million worldwide [3]. Millions of petitions, commercial languages, a special kind of compression technique should be transaction registrations, birth/death records are generated in designed and developed. the Tamil language every year. The Tamilnadu government The author in paper [11], explains the Malayalam text continuously involved in the process of digitizing its billions compression by variable length encoding. The Unicode of records. The government of Tamilnadu issued an order to character is represented by less number of bits. use Unicode as current standard for Tamil encoding [18]. Author Seethalakshmi in paper [13], presented the The encoding techniques available for Tamil importance of Unicode encoding technique which was characters are ISCII (7 bits), TSCII/TAB (7bits), TAM (8 followed by Tamil documents and software. bits), 7 bit – Unicode and Proprietary encodings (7/8 bits) Ševčík Jiří author of paper [14], represents lossless [19]. The limitations in the above encoding techniques are it is data compression for the Czech language. The grammatical insufficient to represent all Tamil characters and it is rule and properties for natural languages are different from inefficient to store, transmit and retrieve the documents. These English language, so compression should be specially problems can be solved by the Unicode Tamil characters set. designed. In paper [15], the author creates a font by mapping B. Unicode Tamil Characters ASCII character with Unicode characters. For Indian languages, the combination of characters can be replaced by The Unicode is the most acceptable industry ASCII characters. standards for storing, transmitting and documentation [2]. It Paper [17], represents the character set of Tamil was developed in conjunction with the Universal Coded language. The Unicode characters are universally accepted Character Set (UCS) standard and published as the Unicode encoding technique for representing text as well as Standard. The latest version of Unicode contains a repertoire transmission. of than 128,000 characters covering 135 modern and historic scripts, as well as multiple symbol sets. III. EXISTING COMPRESSION TECHNIQUES FOR TAMIL DOCUMENT Unicode is designed to represent almost all characters There was no exclusive compression technique for in every language in the world [15]. All the characters of the Tamil language. The existing compression techniques Tamil language are now encoded as per the Universal were mainly dealt with European languages like English, Principle of Unicode. The Tamil characters are range from French and Germany [4]. Many researchers were going on in U+0B80 to U+0BFF in Unicode character set [17]. other languages, such as Japanese and Chinese [7]. A Tamil Unicode characters need 2 bytes whereas an ASCII character It is large enough to encompass all characters that are occupies one byte for a character [15]. The technique can do likely to be used in general text interchange, including those in compression by replacing a single ASCII character in place of major international, national, and industry character sets. Tamil Unicode characters (16 to 32 bits). Unicode occupies more space in memory during storage [15]. The existing lossless compression techniques like Fig 4.1 shows the recent Unicode version for Tamil Unicode WinZip, a popular Windows program that compresses files characters. then it packages them in an archive. Archive file formats that support compression include ZIP and RAR. The bzip2 and gzip formats are widely used for compressing individual files for English documents. But for natural languages like Tamil, a special compression technique should be needed to design.

670 International Journal of Pure and Applied Mathematics Special Issue

C. Architecture of Compression

To perform lossless data compression for natural languages, a special techniques is needed to design exclusively for particular language (Tamil) is needed which is different from English or other languages [14]. The proposed substitution method performs lossless text compression for Tamil language documents in an effective way. The lossless text decompression will reconstruct the document which was exactly same in the original document [5]. This method reduces almost 50% of storage space so that it is suitable for transformation and also saves the hard disk memory. Substituting a predefined available content to the natural

Fig 4.1: Unicode Version 10.0 for Tamil Characters

The proposed system involves substituting an ASCII character in place of a Unicode Tamil character since the size of an ASCII character is one byte (8 bits) whereas a Unicode character size range between 1 byte (8 bits) to 4 bytes (32 bits) Fig 4.2: Lossless Compression Technique depends on the encoding technique. Architecture

language is getting an increase due to the usage of internet [6]. Table 4.1 : Encoding techniques and its size available to Figure 4.2 and 4.3 shows lossless technique before and after store in .txt file proposed substitution method. The text file which contains the Tamil documents with any one of the encoding technique like Unicode, Unicode big endian or UTF-8 will be given as Encoding Range of Character Type input to the proposed method. The Unicode Tamil characters Type Character Size (16 bits to 32 bits) will be replaced with ASCII characters (8 bits) using proposed substitution method. The compressed file will contain only the ASCII characters. This ensures 50 % of ANSI ASCII character 0 to 7 bits compression. The compressed file will be stored in ANSI encoding encoding type. The compressed file will be reduced to 50% from its UTF-8 ASCII character 8 bits to 32 bits original size. This file can be further compressed by any one of the lossless compression technique like Run-length, Huffman or Lempel-Ziv etc that results in again 20% to 40% Unicode reduction of storage space [9]. Unicode 16 bits character The decompression involves in doing the same method in the reverse process by giving an ANSI file as input [14]. This ANSI file contains compressed data with collection Unicode Unicode 16 bits to 32 bits of Big Endian character

The table 4.1 shows the variation in storage for different encoding techniques available to store a text (.txt) file.

671 International Journal of Pure and Applied Mathematics Special Issue

Table 5.1 : Combination of Unicode characters for a single Tamil character.

Fig 4.3: Architecture of Decompression Technique unreadable ASCII characters. The reverse process of substitution method will be performed (i.e.) replacing Tamil Unicode characters in the place of ASCII characters. The resultant decompressed file may be in any one of the encoding type given above in figure 4.3.

V. RESULTS AND DISCUSSIONS The compression and decompression process was developed as a web application using ASP.NET. In future, it is easy for the users to do compression and decompression in the online itself. ASP.NET is an -source server-side web application framework designed for web development to produce dynamic web pages. It was developed by Microsoft to The table 5.2 shows the replacement of Unicode and ASCII allow programmers to build dynamic websites, web character of the word . applications and web services [20].

The following example shows how the compression process takes place for a Tamil word. i) The word seems to have 6 characters, but actually it is the combination of 11 Unicode characters listed in the table 5.1. ii) The Unicode characters combination of the word [13]

is given below iii) The proposed substitution method will replace the existing Unicode Tamil characters with ASCII characters for the above example. The table 5.1 shows the Unicode for the tamil characters for the word . The replacement of a Unicode character by an ASCII character is shown in the

Table 5.2.

iv) Now the word will be replaced as öÿÜÑ£á after substitution method. The actual size of is 11 bytes in a text file with Unicode encoding type. After compression öÿÜÑ£á is 6 bytes when it is stored as ANSI encoding type text file. The above is the example that shows how the size of a Tamil Unicode character reduced from 11 bytes to 6 bytes after compression. The same process is carried out for a Tamil document given below.

672 International Journal of Pure and Applied Mathematics Special Issue

The percentage of compression is calculated by the following formula [11].

The compression percentage of our proposed method is calculated by substituting the size of bharathiar.txt and bharathiar_cprs.txt is 14762 bytes and 7406 bytes for original and compressed files respectively.

This compression technique almost reduces nearly Figure 5.1: File bharathi.txt before compression 50% from the original files. The compressed file can be retained to original file by decompressing it. The percentage . The actual compression process was carried out to of compression is calculated by comparing the size of file the file named bharathiar.txt is of size 14762 bytes is given as before and after compression [11]. Table 5.3 shows the input to the application shown in figure 5.1. This is a file that percentage of compression of Tamil files with the contains Unicode Tamil characters and stores as a Unicode corresponding size in bytes before and after compression. In encoding file type. all the files, compression percentage is almost 50%. The decompression process is also successful.

Table 5.3: before and after compression with the percentage of compression.

Original Compressed % of File Name File Size File Compres (bytes) Size(bytes) sion Bharathi.txt 14762 7406 49.83% Chennaithagaval.txt 4266 2150 49.6% Ettuthokkai.txt 3082 1540 50.03% Natrrinai.txt 5634 2832 49.73% Pathittrupathu.txt 13074 6536 50.01% Vairamuthuvaralaru.txt 26222 13226 49.56% Average Percentage: 49.76%

This compression technique almost reduces nearly 50% from the original files. The compressed file can be retained to original file by decompressing it. The average

percentage of file compression is 49.76%, this is due to the Figure 5.2: File bharathi_cpr.txt after compression replacement of ASCII in the place of Unicode characters.

After compression, the output is given as V. CONCLUSION bharathiar_cprs.txt file which is of size 7406 bytes, shown in figure 5.2. This compressed file contains ASCII characters and Tamil is a Dravidian language spoken by millions of stored as ANSI encoding file type automatically by the people in India and all over the world. It is the first Indian application. The reverse process will do the decompression state language of Tamilnadu. There is a high need for storing effectively. the Tamil documents in digital form. Many applications are

developed for both computers and mobile phones. New

673 International Journal of Pure and Applied Mathematics Special Issue

technologies are needed to preserve literature, artistic and [11] Sajilal Divakaran, Biji C.L., Anjali. C, Achuthsankar s. scientific work of mankind digitally. This lossless Nair , “Malayalam Text Compression”, International Journal compression technique surely paves a way to store the Tamil of Information Systems and Engineering, Vol 1, No. 1, April documents in minimum storage. Almost the compressed 2013. document will be reduced to 50%. This technique can be [12] Salomon, “Data Compression: The Complete Reference”, applied to other abugida languages too. The compression and Springer, pp. 1-14 (2004). decompression process was successful. Decompression [13] Seethalakshmi.R, Sreeranjani.T.R, Balachandaar.T, restores the compressed file to its original form without any “Optical Character Recognition for Printed Tamil Text using loss of data. Unicode”, Journal of Zhejiang University Science, ISSN 1009-3095, 2005 6A(11):1297-1305 (2005). VI. FUTURE ENHANCEMENT [14] Ševčík, Jiří, and Jiří Dvorský. "Techniques of Czech Language Lossless Text Compression." IFIP International The compression technique works perfectly if the Conference on Computer Information Systems and Industrial original document contains only Tamil characters. This is due Management. Springer International Publishing, 2016. to while performing decompression there may be a chance to [15] Siva Jyothi Chandra, Ashlesha Pandhare, Mamatha Vani, substitute Unicode Tamil character wrongly to an ASCII “Multilingual Font Creation by Mapping Unicode to Ascii”, character. The perfection can be further enhanced by placing a International Journal of Advanced Research in Computer special separator character between the ASCII character in the Science and Software Engineering, Vol 5, Issue 9, Sep 2015, original document before starting the compression. Further, ISSN: 2277 128X (2015). the compression can be enriched by finding the frequency of [16] Storer, Jamws A., ed. Image and text compression. Vol. occurrence of every Tamil character in Tamil documents, so 176. Springer Science & Business Media,2012. that it can be applied effectively in the compression technique. [17] Dr.J.Venkatesh and C.Sureshkumar, “Tamil Handwritten This technique can be easily applied to all the languages that Character Recognition Using Kohonon’s Self Organizing has Unicode character format. Map”, International Journal of Computer Science and Network Security, Vol. 9 No. 12, December 2009. REFERENCES [18] www.tamilvu.org/doc_file/it_e_5_2013.pdf, Accessed on 3 July 2017. [1] A Carus, A Mesut, “Fast Text Compression Using [19] www.unicode.org/L2/L2007/07175-tamil-resentation.pdf, Multiple Static Dictionaries”, Information Technology Accessed on 10 June 2017. Journal 9(5): 1013-1021, 2010, ISSN 1812 (2010). [20] https://en.wikipedia.org/wiki/ASP.NET, Accessed on 10 [2] Ajantha Devi, Dr.S.Santhosh Baboo, “Embedded Optical June 2017. Character Recognition on Tamil Text Image Using Raspberry Pi.” International Journal of Computer Science Trends and Technology, Vol 2, Issue4, Jul-Aug 2014. [3] Apte, Akshay, and Harshad Gado. "Tamil character recognition using structural features." (2010). [4] Arafat Awajan and Enas Abu Jrai, “Hybrid Techniques for Arabic Text Compression”, Global Journal of Computer Science and Technology:C Software and Data Engineering, Vol 15 Issue 1 Version 1.0 2015, Print ISSN: 0975-4350 (2015). [5] Blelloch, Guy E. “Introduction to Data Compression.” Computer Science Department, CarNegie Mellon University (2001). [6] Frank E., Chang Chui, andlan H. Witteh. “Text categorization using Compression Models.” (2000). [7] Hewavitharana, S., and H. C. Fernando. "A two stage classification approach to Tamil handwriting recognition." Proc. TI (2002). [8]Graefe, Goetz, and Leonard D. Shapiro. "Data compression and database performance." Applied Computing, 1991.,[Proceedings of the 1991] Symposium on. IEEE, 1991. [9] Kodituwakku, S. R., and U. S. Amarasinghe. "Comparison of lossless data compression algorithms for text data." Indian journal of computer science and engineering 1.4 (2010): 416- 425. [10] Radescu, Radu. “Transform methods used in lossless compression of text files.” Romanian Journal of Information Science and Technology 12.1 (2009):101-115.

674 675 676