SSE Method for the Text of the Insignificance of the Lines Order

Juncai Xu1, Weidong Zhang1, Qingwen Ren1, Xin Xie2, and Zhengyu Yang2 1 College of Mechanics and Materials, Hohai University, Nanjing, China 2 Department of Electrical and Computer Engineering, Northeastern University, Boston, MA USA

Abstract: There is a special type of text which the order of the rows makes no difference (e.g., a word list). To these special texts, the traditional lossless compression method is not the ideal choice. A new method that can achieve better compression results for this type of texts is proposed. The texts are pre-processed by a method named SSE and are then compressed through the traditional lossless compression method. Comparison shows that an improved compression result is achieved. Keywords: lossless compression, compression method, Sort and Set Empty (SSE)

Corresponding Xu Juncai, PhD, E-mail: [email protected]

SSE Lossless Compression Method for the Text of the Insignificance of the Lines Order 1

1 Introduction can be classified into lossless and lossy. denotes significant data compression with the loss of relatively unimportant details; it can be utilized to compress audio files, images, and . However, with regard to text compression, lossless compression is the only choice when the aim is to retain all details. Lossless compression is based on the removal of redundant data, so it focuses on the existence of redundant data. The simplest lossless compression method is Run-Length Encoding (RLE), which involves encoding continuous repeated data into a specific data amount to achieve compression. It is based on or a longer data unit, with byte being the most common one. Another common lossless compression method is entropy coding, which was introduced by Shannon, the father of informatics 0The most widely utilized lossless compression method is Huffman codingError! Reference source not found.0, an extension of 0Given that Arithmetic Coding is protected by various patents, Range Coding which is similared to Arithmetic Coding is more popular in open source communities0The compression limit of entropy coding is entropy, which is represented by H/8 according to the binary system. H is entropy (H =∑ ∗1/), and p is the probability of various characters to appear in the text. Entropy coding has no relations with context, so it cannot remove the context redundant in the text. Compression methods that can remove redundant in the text are called dictionary encoding. The first dictionary encoding is LZ770 L and Z the initials of the two inventors’ names and 77 is the year of invention. LZ77 adopts a slide window to dynamically generate an implicit dictionary. This can effectively remove redundant information in the context and thus achieves good results in compressing repeated data. Therefore, a series of derivate have been proposed; the well-known ones include /DEFLATE64 and LZMA/ LZMA2. Another version of LZ is called LZ780Unlike LZ77, LZ78 generates an explicit dictionary and then substitutes the dictionary order for the data in the dictionary. Basically, this method is similar to LZ77. The most famous derivate algorithm of LZ78 is LZW0, which yields a better compression ratio after some improvements. RLE is also a simple dictionary algorithm that can remove redundant information in the context. Other compression algorithms of this type include Dynamic Markov Coding (DMC) 0, Context Tree Weight (CTW) 0, Prediction by Partial Matching (PPM) 0, and PAQ0Basically, these algorithms predict through a model or count redundant data in the context and then employ Arithmetic Coding or other means to effectively remove redundant data. Other algorithms can create redundant data through reversible transformation; such algorithms include BWT algorithm0differential encoding0, and MTF algorithm000These reversible transformation algorithms produce good compression results by creating redundant data and then using other compression methods. Extensive effort has been exerted to develop

SSE Lossless Compression Method for the Text of the Insignificance of the Lines Order 2

and update lossless compression methods and obtain a better compression ratio; examples include optimizing existing algorithms on the one hand and finding more effective algorithms (e.g., CSE) on the other hand0 A new method called Sort and Set Empty (SSE) is proposed in this paper. The method can be utilized for the compression of special texts in which the order of the lines makes no difference. The compression result can be optimized by preprocessing the text with SSE and then using lossless compression methods. 2 Compression and Decompression Method

2.1 Terminology 1) Empty Symbol An Empty Symbol is a symbol utilized to occupy a position. The symbol should be different from any symbol in the text. For example, in a word list, a space can be used as an Empty Symbol. Several continuous Empty Symbols can be represented by a Empty Symbol followed by a number indicating the number of Empty Symbol (RLE). Corresponding processing should be adopted in decompression. 2) Set Empty Setting empty involves replacing a character with an Empty Symbol. The procedure can run continuously. Setting several continuous characters Empty can be represented by an Empty Symbol followed by a number (RLE). 3) Sort and Set Empty This process involves Sorting the special text and then Setting Empty the same characters as those in the previous line. The SSE lossless compression method employed in this study is a preprocessing method for the text in which the order of lines makes no difference. Owing to the insignificance of the order of lines, the text is first sorted; case sensitivity in the sorting process depend on the instructions of the application layer. After sorting, the beginning of each line has a high level of duplication. A large number of characters can then be Set Empty, producing a text of tree structure (Trie). Finally, the common lossless compression method is adopted to compress the text. Decompression is the reverse process. First, a common lossless decompression method is adopted. Second, the Empty Symbols are replaced with the characters in the same place in the previous line. The order of the lines in the text is of little importance, so no other processing is required after decompression. The order of the original text is lost after SSE. Hence, this method can only be applied to the special text or sorted general text.

SSE Lossless Compression Method for the Text of the Insignificance of the Lines Order 3

2.2 Pseudo Code

const Empty = ‘ ’; fun Compress(Lines) Sort(Lines); var o = ‘’; for each s in Lines do var i = 0; var l = o.length(); loop exit when i > l or o[i] <> s[i]; i += 1; end loop; o = s; println(Empty.x(i) & s.substr(i)); end do; end fun;

fun Decompress(Lines) var o = ‘’; for each s in Lines do var i = 0; loop exit when s[i] <> Empty; i += 1; end loop; o = o.substr(len: i) & s.substr(i); println(o); end do; end fun;

2.3 Experiment The example was obtained from Free Scrabble Dictionary. The following result was obtained using SSE.

SSE Lossless Compression Method for the Text of the Insignificance of the Lines Order 4

Source text Sort and Set Empty 12 1 aa aa 2 aah ‐‐h 3 aahed ‐‐‐ed 4 aahing ‐‐‐ing 5 aahs ‐‐‐s 6 aal ‐‐l 7 aalii ‐‐‐ii 8 aaliis ‐‐‐‐‐s 9 aals ‐‐‐s 10 aardvark ‐‐rdvark 11 aardvarks ‐‐‐‐‐‐‐‐s 12 aardwolf ‐‐‐‐wolf 13 aardwolves ‐‐‐‐‐‐‐ves 14 aargh ‐‐‐gh 15 aarrgh ‐‐‐rgh 16 aarrghh ‐‐‐‐‐‐h 17 aas ‐‐s 18 aasvogel ‐‐‐vogel 19 aasvogels ‐‐‐‐‐‐‐‐s 20 ab ‐b 21 aba ‐‐a 22 abaca ‐‐‐ca 23 abacas ‐‐‐‐‐s 24 abaci ‐‐‐‐i 25 aback ‐‐‐‐k 26 abacterial ‐‐‐‐terial 27 abacus ‐‐‐‐us 28 abacuses ‐‐‐‐‐‐es 29 abaft ‐‐‐ft 30 abaka ‐‐‐ka 31 abakas ‐‐‐‐‐s 32 abalone ‐‐‐lone 33 abalones ‐‐‐‐‐‐‐s ......

Table 1

1 Using – as Empty Symbols here 2 Not using RLE code for comparison with original text here

SSE Lossless Compression Method for the Text of the Insignificance of the Lines Order 5

3 Compression Ratio Evaluation The compression method was evaluated comparatively based on two aspects (actual compression and theory compression ratios).

3.1 Theory Compression Ratio Suppose that

{x1, …, xn} is an alphabet (text alphabet, including line break in the paper). Then {q1, …, qn} is the times in the text, m = ∑ is the length of the text, and {p1, …, pn} is its probability of appearing in the text; ∑ = 1.

Information entropy is obtained as H=∑ ∗1/, and the compression ratio is H/8. This type of text is a special text with an independent line sequence. Thus, the SSE (Sort and Set Empty) method can be applied. The above model is described as follows:

Suppose that

{x0 , x1 , … , xn} is a new alphabet (add x0 as a Empty Symbol). Then ` ` ` {q0, q 1, …, q n} is the times in the text, m is the length of the text, and

` ` ` {p0, p 1, …, p n} is its probability of appearing in the text; ∑ ` = 1.

` The information entropy of the new text is obtained as H = ∑ ` ∗1/`; the compression ratio is H`/8.

Consequently, the following formulas are obtained.

` ` (1) q 1 ≤ q1, …, q n ≤ qn

` (2) q 0 = ∑ q` ` ` (3) 0 ≤ p 1 ≤ p1 ≤ 1, …, 0 ≤ p n ≤ pn ≤ 1

` (4) 0 ≤ p 0 = ∑ p` ≤ 1

SSE Lossless Compression Method for the Text of the Insignificance of the Lines Order 6

Function y = x * log2(1/x) is shown below:

Figure 1

The function extreme value is 1/e/log(2)  0.530737845423043 at 1/e  0.367879441171442. The function increases steeply at the left branch of 1/e. The function decreases at the right branch of 1/e. However, the degree of steepness at the right branch is less than that at the left branch. ` ` ` Compared with H, H has an additional item, that is, p 0 * log2 (1/p0). The ` maximum of the additional item is 0.53. For every p i ≤ from 1 to n, the formula is as follows:

` ` (5) p i * log2 (1/p i)≤ pi * log2 (1/pi), pi ≤ 1/e.

(6) If pi> 1/e, the formula on the two sides of (5) is uncertain. The relationship between the two sides maybe greater than, equal to, or less than.

According to the two formulas above, we found that SSE methods can reduce entropy when the probability of the character appearing in the text is less than or equal

SSE Lossless Compression Method for the Text of the Insignificance of the Lines Order 7

to 1/e (approximately 0.37). However, when each character probability is larger than 1/e in the text, the SSE methods may increase entropy and result in a large compressed file. A probability value that is greater than 0.37 (for simplicity, the approximate value is used instead of 1/e) should not exceed 2. The Sort and Set Empty method fails when only two characters exist in the alphabet case. The Sort and Set Empty method has an effect on the alphabet with more than two characters. The magnitude of the effect has a positive correlation with the following numerical value. For each alphabet occurrence probability that is less than or equal to 0.37 characters, the SSE is the more times, effect display significant. For a probability value that is less than or equal to 0.37, entropy is reduced. If it is greater than the new Empty Symbol entropy (0.53 maximum), the appearance probability would be greater than 0.37 to increase the total entropy of a few characters. This condition means that the Sort and Set Empty method takes effect and obtains a better compression rate.

Below is an alphabet with a random probability and 2 to 52 characters. In the average entropy statistical graph, “Source” is the average entropy not being processed. “Target” is the average entropy being SSE. The compression effect on the alphabet with number of characters beginning at 5 is significant.

Figure 2

The chart below shows a statistical comparison of average, maximum, and

SSE Lossless Compression Method for the Text of the Insignificance of the Lines Order 8

minimum entropy.

Figure 3

The ratio of average entropy or the average compression ratio (60% to 80%) is shown below.

Figure 4

SSE Lossless Compression Method for the Text of the Insignificance of the Lines Order 9

3.2 Comparison of compression ratios Three types of text are presented in the table below for comparison; these three are word list, URL list, and MD5 list. We tested many files for each type and used the average values. We listed two types of compression ratios: entropy and actual compression ratios. The entropy compression ratio employs the binary entropy value. The actual compression ratio employs the Ultra LZMA compression ratio from 7-.

Comparison of compression ratios: Source and SSE

Source compression SSE compression Ratio of compression ratio ratio ratios1 Entropy2 Actual3 Entropy Actual Entropy Actual Word 53.64% 19.45% 25.92% 11.37% 48.32% 58.46% List URL 57.53% 6.01% 21.27% 4.39% 36.98% 72.96% List MD5 52.12% 43.50% 47.93% 37.89% 91.95% 87.10% List

Table 2

The source entropy compression ratio is greater than 50%. All of the ratios are reduced significantly after SSE. In particular, the entropy compression ratio of the URL list is 21.27%. Many repeated prefixes exist in the URL list, so entropy was reduced considerably with SSE. However, fewer repeated prefixes exist in the MD5 list, which has too much random data; hence, entropy was reduced only slightly. The word list is in between. The actual compression ratios of LZMA are all ideal and are better than the entropy compression ratio (LZMA employs the dictionary compression method, and we used the 7-zip LZMA Ultra option). The best compression ratio of LZMA is for the URL list: the source has 6.01%, and SSE has 4.39%. SSE improvement in the actual compression ratio is weaker than that in the entropy one. For the word list, SSE is ideal; the actual compression ratio is reduced to 11.37% from 19.45%, and the ratio of the ratios is 58.46%. For the MD5 list, the improvement by SSE is the weakest regardless of the actual entropy compression ratio.

1 Ratio of Compress ratio = SSE compress ratio / Source compress ratio 2 Compress ratio = H / 8, H = ∑ ∗1/ 3 Use 7‐zip LZMA Ultra, Compress ratio = Compressed size / source size

SSE Lossless Compression Method for the Text of the Insignificance of the Lines Order 10

4 Conclusion The presented description and actual test showed that the SSE method is useful. For a special text that is row-order independent, SSE can be utilized to preprocess the text and then compress it through other compression methods. The compression ratio would be reduced significantly. SSE is very simple, efficient, and has a good application potential. Our next work would be to establish a proper method of applying SSE to normal text. Prior to implementing SSE, other methods would be used to remember the positions and restore them after decompression. Such work would extend the application scope of SSE and is worth pursuing. Acknowledgement This research was funded by the National Natural Science Foundation of China (Grant No. 11132003) and Jiangsu Province post-doctor Foundation of China (Grant No. 1401124C).

References

Shannon, C.E. (1948), "A Mathematical Theory of Communication", Bell System Technical Journal, 27, pp. 379–423 & 623–656, July & October, 1948. D.A. Huffman, "A Method for the Construction of Minimum-Redundancy Codes", Proceedings of the I.R.E., September 1952, pp 1098–1102. Ken Huffman. Profile: David A. Huffman, Scientific American, September 1991, pp. 54–58. RISSANEN, J., AND LANGDON, G. G. 1979. Arithmetic coding. IBM J. Res. Dev. 23, 2 (Mar.), 149-162. G. Nigel N. Martin, : An algorithm for removing redundancy from a digitized message, & Data Recording Conference, Southampton, UK, July 24–27, 1979. Ziv J., Lempel A., “A Universal Algorithm for Sequential Data Compression”, IEEE Transactions on , Vol. 23, No. 3 (1977), pp. 337-343. Ziv J., Lempel A., “Compression of Individual Sequences via Variable-Rate Coding”, IEEE Transactions on Information Theory, Vol. 24, No. 5, pp. 530-536. Bell, T., Witten, I., Cleary, J., "Modeling for Text Compression", ACM Computing Surveys, Vol. 21, No. 4 (1989). Gordon Cormack and Nigel Horspool, "Data Compression using Dynamic Markov Modelling", Computer Journal 30:6 (December 1987) Willems; Shtarkov; Tjalkens (1995), The Context-Tree Weighting Method: Basic Properties 41, IEEE Transactions on Information Theory Cleary, J.; Witten, I. (April 1984). "Data Compression Using and Partial String Matching". IEEE Trans. Commun. 32 (4): 396–402. David Salomon, Giovanni Motta, (with contributions by David Bryant), Handbook of Data Compression, 5th edition, Springer, 2009, ISBN 1-84882-902-7, 5.15 PAQ, pp. 314–319

SSE Lossless Compression Method for the Text of the Insignificance of the Lines Order 11

Burrows, Michael; Wheeler, David J. (1994), A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation Elias, Peter (March 1975). "Universal codeword sets and representations of the integers". IEEE Transactions on Information Theory 21 (2): 194–203. Ryabko, B. Ya Data compression by means of a "book stack”, Problems of Information Transmission, 1980, v. 16: (4), pp. 265–269 Ryabko, B. Ya.; Horspool, R. Nigel; Cormack, Gordon V. (1987). "Comments to: "A locally adaptive data compression scheme" by J. L. Bentley, D. D. Sleator, R. E. Tarjan and V. K. Wei.". Comm. ACM 30 (9): 792–794. J. L. Bentley, D. D. Sleator, R. E. Tarjan, V. K. Wei (1986). "A Locally Adaptive Data Compression Scheme". Communications of the ACM 29 (4) Iwata, K., Arimura, M., and Shima, Y., "An Improvement in Lossless Data Compression via Substring Enumeration", 2011 IEEE/ACIS 10th International Conference on Computer and Information Science (ICIS). S.Senthil,L.Robert. Lossless Preprocessing Algorithms for Better Compression[A]. Proceedings of 2012 IEEE International Conference on Computer Science and Automation Engineering(CSAE 2012) VOL02[C]. 2012 S Senthil, S J Rexiline and L Robert. Article: RIDBE: A Lossless, Reversible Text Transformation Scheme for Better Compression. International Journal of Computer Applications 51(12):35-40, August 2012. Francisco Claude, Patrick K. Nicholson, Diego Seco. On the compression of search trees[J]. Information Processing and Management, 2014, 50(2). Strahil Ristov, Damir Korenčić. Using static suffix array in dynamic application: Case of text compression by longest first substitution[J]. Information Processing Letters, 2014. Nakamura R, Inenaga S, Bannai H, Funamoto T, Takeda M, Shinohara A. Linear-Time Text Compression by Longest-First Substitution. Algorithms. 2009; 2(4):1429-1448. Antonio Fariñ, a, Gonzalo Navarro, José, R. Paramá. Boosting Text Compression with Word-Based Statistical Encoding. The Computer Journal, 2012, Vol.55 (1), pp.111-131 Paul Gardner-Stephen, Andrew Bettison, Romana Challans, Jennifer Hampton, Jeremy Lakeman, Corey Wallis. Improving Compression of Short Messages. Int'l J. of Communications, Network and System Sciences, 2013, Vol.06 (12), pp.497-504 Mohammada Aprilianto , Maman Abdurohman. Improvement Text Compression Performance Using Combination of Burrows Wheeler Transform, Move to Front, and Methods. Journal of Physics: Conference Series, 2014, Vol.495 (1)IOP Ehud S. Conley , Shmuel T. Klein. Improved Alignment-Based Algorithm for Multilingual Text Compression. Mathematics in Computer Science, 2013, Vol.7 (2), pp.1-153Springer Simon Gog , Enno Ohlebusch. Compressed suffix trees. Journal of Experimental Algorithmics (JEA), 2013, Vol.18, pp.2.1-2.31ACM Sin'ya R. Text compression using abstract numeration system on a regular language. Computer Software, 2013, Vol.30 (3), pp.163-179OCLC Radescu R. , Honciuc A. Advanced lossless text compression algorithm based on splay tree adaptive methods. Revue Roumaine des Sciences Techniques Serie Electrotechnique et Energetique, 2012,

SSE Lossless Compression Method for the Text of the Insignificance of the Lines Order 12

Vol.57 (3), pp.311-320OCLC