Entropy and Compressibility of Symbol Sequences
Total Page:16
File Type:pdf, Size:1020Kb
PhysComp96 Entropy and Compressibility of Symbol Sequences Full paper Draft, February 23, 1997 Werner Ebeling [email protected] Thorsten PÄoschel [email protected] Alexander Neiman [email protected] Institute of Physics, Humboldt University Invalidenstr. 110, D-10115 Berlin, Germany The purpose of this paper is to investigate long-range correlations in symbol sequences using meth- ods of statistical physics and nonlinear dynamics. Beside the principal interest in the analysis of correlations and fluctuations comprising many letters, our main aim is related here to the problem of sequence compression. In spite of the great progress in this ¯eld achieved in the work of Shannon, Fano, Hu®man, Lempel, Ziv and others [1] many questions still remain open. In particular one must note that since the basic work by Lempel and Ziv the improvement of the standard compression algorithms is rather slow not exceeding a few percent per decade. One the other hand several experts expressed the idee that the long range correlations, which clearly exist in texts, computer programs etc. are not su±ciently taken into account by the standard algorithms [1]. Thus, our interest in compressibility is twofold: (i) We would like to explore how far compressibility is able to measure correlations. In particular we apply the standard algorithms to model sequences with known correlations such as, for instance, repeat sequences and symbolic sequences generated by maps. (ii) We aim to detect correlations which are not yet exploited in the standard compression algorithms and belong therefore to potential reservoirs for compression algorithms. First, the higher-order Shannon entropies are calculated. For repeat sequences analytic estimates are derived, which apply in some approximation to DNA sequences too. For symbolic strings obtained from special nonlinear maps and for several long texts a characteristic root law for the entropy scaling is detected. Then the compressibilities are estimated by using grammar representations and several standard computing algorithms. In particular we use the Lempel-Ziv compression algorithm. Further the mean square fluctuations of the composition with respect to the letter content and several characteristic scaling exponents are calculated. We show ¯nally that all these measures are able to detect long-range correlations. However, as demonstrated by shu²ing experiments, di®erent measuring methods operate on di®erent length scales. The algorithms based on entropy or compressibility operate mainly on the word and sentence level. The characteristic scaling exponents reflect mainly the long-wave fluctuations of the composition which may comprise a few hundreds or thousands of letters. 1 Correlation measures and inves- sis. These methods found several applications to DNA tigated sequences sequences [9, 14, 12] and to human writings [11, 15]. At ¯rst we study model sequences with simple construc- tion rules (nonlinear maps as e.g. the logistic map and Characteristic quantities which measure long correlations the circle map, stochastic sequences containing repeating are dynamic entropies [2, 3, 4, 5, 6], correlation functions blocks etc.). A repeat sequence is de¯ned as a random and mean square deviations, 1=f ± noise [7, 8], scaling (Bernoulli-type) sequence on an alphabet A1 : : : A¸where exponents [9, 10], higher order cumulants [11] and, mu- a given word of length s written on the same alphabet tual information [12, 13]. Our working hypothesis which is introduced º times (º being large). This simple type we formulated in earlier papers [3, 8], is that texts and of sequences which admits most easily analytical calcula- DNA show some structural analogies to strings gener- tions was introduced by Herzel et al. [13] for modelling ated by nonlinear processes at bifurcation points. This is the structure of DNA strings. Repeat sequences may be demonstrated here ¯rst by the analysis of the behaviour of generated in the following way. First we de¯ne the repeat the higher order entropies. Further analysis is based on e.g. by a string of length s the mapping of the text to random walk and to “fluc- tuating gas" models as well as on the spectral analy- R = A1 : : : As 1 by a given sequence of length s (i.e., len(R) = s). Then we n: generate a Bernoulli sequence of length L0 with N0 = L0 (n) (n) letters on the alphabet R; A1 : : : A¸. This sequence con- Hn = p (A1 : : : An) log p (A1 : : : An); ¡ X sists of º repeats and N0 º \random" letters. Finally we (1) replace the letters R by ¡the de¯ning string. In this way a string with N letters which is like a sea??? of random hn = Hn Hn (2) letters with interspersed \ordered repeats" is obtained. +1 ¡ This string contains sº repeat letters. Going along the where the summation runs over all possible words of string, each time we meet a repeat, we ¯rst have to iden- length n, i.e. over all words which could be found if the tify it. Let us assume we need the ¯rst sc s letters ¿ text would have in¯nite length. For the case of repeat for the identi¯cation. After any identi¯cation of a re- sequences we can carry out analytical calculations for the peat we know where we are in a string and know how entropies [13]. For the periodic case N = ºs the higher to continue, ie. the uncertainty is decreased. The repeat order entropies Hn are constant and the uncertainties hn sequences de¯ned in the way described above have well are zero de¯ned correlations with a range of s. Beyond the dis- tance s the correlations are destroyed by the stochastic hn = 0; Hn = Hs : : : (3) symbols inter-dispersed between the repeats. This proce- if n s. The lower order entropies for n s depend on dure may be continued in a hierarchical way in order to ¸ · generate longer correlations and hierarchical structures the concrete structure of the individual repeat and can with some similarity to texts. In the special case that the easily be calculated by simple counting For the case of proper repeat sequences N sº approximative formulae sequence consists only of one kind of repeats we get peri- ¸ odic sequences with the period s. In this case the range for the entropy are available [13]. We ¯nd, for example of correlations is in¯nite. Sequences with slowly decay- (in log¸ units), ing long correlations are obtained from the logistic map hn = 1 (s sc)º=N; (4) and the circular map at critical points [3, 8]. Further we ¡ ¡ study long standard information carriers as e.g. books and DNA strings and compare them with the model se- Hn = Hs + (n s)[1 (s sc)º=N]; (5) quences. In particular we studied the book \Moby Dick" ¡ ¡ ¡ by Melville (L 1; 170; 200 letters), the German edi- if s n. Our methods for the analysis of the entropy of ¸ tion of the Bible¼(L 4; 423; 030 letters), Grimm's Tales natural sequences were in detail explained elsewhere [16]. (L 1; 435; 820 letters),¼ the \Brothers Karamazov" by We have shown that at least in a reasonable approxima- Dosto¼ evsky (L 1; 896; 000 letters), the DNA sequences tion the scaling of the entropy against the word length is of the lambda-virus¼ (L 50; 000 letters) and that of yeast given by a root law. Our best ¯t of the data obtained ¼ (L 300; 000 letters). for texts on the 32-alphabet (measured in log32 units) ¼ In order to ¯nd out the origin of the long-range corre- reads [17] lations we studied the e®ects of shu²ing of long texts on Hn 0:5 pn + 0:05 n + 1:7 ; (6) di®erent levels (letters, words, sentences, pages and chap- ¼ ¢ ¢ ters). The shu²ed sequences texts were always compared with the original one (without any shu²ing). Of course hn 0:25 pn + 0:05 : (7) the original and the shu²ed ¯les have the same letter dis- ¼ ¢ tribution. However only the correlations on scales below The dominating term is given by a root law correspond- the shu²ing level are conserved. The correlations (fluctu- ing to a rather long memory tail. We mention that a scal- ations) on higher levels which are based on the large-scale ing law of the root type was ¯rst found by Hilberg who structure of texts as well as on semantic relations are de- made a new ¯t for Shannons original data. We used our stroyed by shu²ing. own data for n = 1; : : : ; 26 but included Shannons result for n = 100 as well. The idea of algorithmic entropy goes back to 2 Entropy and complexity of se- Kolmogorov and Chaitin and is based on the idea that sequences have a \shortest representation". The rela- quences tion between the Shannon entropy and the algorithmic entropy is given by the theorems of Zvonkin and Levin. Let A1A2 : : : An be the letters of a given substring of Several concrete procedures to construct a shorter repre- (n) length n L. Let further p (A : : : An) be the proba- sentations, are used in data compression algorithms [1]. · 1 bility to ¯nd in a string a block with the letters A1 : : : An. One may assume that those algorithms are still fare from Then we may introduce the entropy per block of length being optimal, since as a rule the do not take into account 2 long correlations. A rather simple algorithm for ¯nding compressed algorithms, which was proposed by Thiele and Scheidereiter, is called grammar complexity [6]. Let us consider a word p on certain alphabet and let K(p) be the length of shortest representation of the given string.