<<

PhysComp96 and Compressibility of Symbol Sequences Full paper Draft, February 23, 1997 Werner Ebeling [email protected] Thorsten P¨oschel [email protected] Alexander Neiman [email protected] Institute of Physics, Humboldt University Invalidenstr. 110, D-10115 Berlin, Germany

The purpose of this paper is to investigate long-range correlations in symbol sequences using meth- ods of statistical physics and nonlinear dynamics. Beside the principal interest in the analysis of correlations and fluctuations comprising many letters, our main aim is related here to the problem of sequence compression. In spite of the great progress in this field achieved in the of Shannon, Fano, Huffman, Lempel, Ziv and others [1] many questions still remain open. In particular one must note that since the basic work by Lempel and Ziv the improvement of the standard compression algorithms is rather slow not exceeding a few percent per decade. One the other hand several experts expressed the idee that the long range correlations, which clearly exist in texts, computer programs etc. are not sufficiently taken into account by the standard algorithms [1]. Thus, our interest in compressibility is twofold: (i) We would like to explore how far compressibility is able to measure correlations. In particular we apply the standard algorithms to model sequences with known correlations such as, for instance, repeat sequences and symbolic sequences generated by maps. (ii) We aim to detect correlations which are not yet exploited in the standard compression algorithms and belong therefore to potential reservoirs for compression algorithms.

First, the higher-order Shannon are calculated. For repeat sequences analytic estimates are derived, which apply in some approximation to DNA sequences too. For symbolic strings obtained from special nonlinear maps and for several long texts a characteristic root law for the entropy scaling is detected. Then the compressibilities are estimated by using grammar representations and several standard computing algorithms. In particular we use the Lempel-Ziv compression algorithm. Further the mean square fluctuations of the composition with respect to the letter content and several characteristic scaling exponents are calculated. We show finally that all these measures are able to detect long-range correlations. However, as demonstrated by shuffling experiments, different measuring methods operate on different length scales. The algorithms based on entropy or compressibility operate mainly on the word and sentence level. The characteristic scaling exponents reflect mainly the long-wave fluctuations of the composition which may comprise a few hundreds or thousands of letters.

1 Correlation measures and inves- sis. These methods found several applications to DNA tigated sequences sequences [9, 14, 12] and to human writings [11, 15]. At first we study model sequences with simple construc- tion rules (nonlinear maps as e.g. the logistic map and Characteristic quantities which measure long correlations the circle map, stochastic sequences containing repeating are dynamic entropies [2, 3, 4, 5, 6], correlation functions blocks etc.). A repeat sequence is defined as a random and mean square deviations, 1/f δ noise [7, 8], scaling (Bernoulli-type) sequence on an alphabet A1 . . . Aλwhere exponents [9, 10], higher order cumulants [11] and, mu- a given word of length s written on the same alphabet tual information [12, 13]. Our working hypothesis which is introduced ν times (ν being large). This simple type we formulated in earlier papers [3, 8], is that texts and of sequences which admits most easily analytical calcula- DNA show some structural analogies to strings gener- tions was introduced by Herzel et al. [13] for modelling ated by nonlinear processes at bifurcation points. This is the structure of DNA strings. Repeat sequences may be demonstrated here first by the analysis of the behaviour of generated in the following way. First we define the repeat the higher order entropies. Further analysis is based on e.g. by a string of length s the mapping of the text to random walk and to “fluc- tuating ” models as well as on the spectral analy- R = A1 . . . As

1 by a given sequence of length s (i.e., len(R) = s). Then we n: generate a Bernoulli sequence of length L0 with N0 = L0 (n) (n) letters on the alphabet R, A1 . . . Aλ. This sequence con- Hn = p (A1 . . . An) log p (A1 . . . An), − X sists of ν repeats and N0 ν “random” letters. Finally we (1) replace the letters R by −the defining string. In this way a string with N letters which is like a sea??? of random hn = Hn Hn (2) letters with interspersed “ordered repeats” is obtained. +1 − This string contains sν repeat letters. Going along the where the summation runs over all possible words of string, each time we meet a repeat, we first have to iden- length n, i.e. over all words which could be found if the tify it. Let us assume we need the first sc s letters ¿ text would have infinite length. For the case of repeat for the identification. After any identification of a re- sequences we can carry out analytical calculations for the peat we know where we are in a string and know how entropies [13]. For the periodic case N = νs the higher to continue, ie. the uncertainty is decreased. The repeat order entropies Hn are constant and the uncertainties hn sequences defined in the way described above have well are zero defined correlations with a range of s. Beyond the dis- tance s the correlations are destroyed by the stochastic hn = 0; Hn = Hs . . . (3) symbols inter-dispersed between the repeats. This proce- if n s. The lower order entropies for n s depend on dure may be continued in a hierarchical way in order to ≥ ≤ generate longer correlations and hierarchical structures the concrete structure of the individual repeat and can with some similarity to texts. In the special case that the easily be calculated by simple counting For the case of proper repeat sequences N sν approximative formulae sequence consists only of one kind of repeats we get peri- ≥ odic sequences with the period s. In this case the range for the entropy are available [13]. We find, for example of correlations is infinite. Sequences with slowly decay- (in logλ units), ing long correlations are obtained from the logistic map hn = 1 (s sc)ν/N, (4) and the circular map at critical points [3, 8]. Further we − − study long standard information carriers as e.g. books and DNA strings and compare them with the model se- Hn = Hs + (n s)[1 (s sc)ν/N], (5) quences. In particular we studied the book “Moby Dick” − − − by Melville (L 1, 170, 200 letters), the German edi- if s n. Our methods for the analysis of the entropy of ≥ tion of the Bible≈(L 4, 423, 030 letters), Grimm’s Tales natural sequences were in detail explained elsewhere [16]. (L 1, 435, 820 letters),≈ the “Brothers Karamazov” by We have shown that at least in a reasonable approxima- Dosto≈ evsky (L 1, 896, 000 letters), the DNA sequences tion the scaling of the entropy against the word length is of the lambda-virus≈ (L 50, 000 letters) and that of yeast given by a root law. Our best fit of the data obtained ≈ (L 300, 000 letters). for texts on the 32-alphabet (measured in log32 units) ≈ In order to find out the origin of the long-range corre- reads [17] lations we studied the effects of shuffling of long texts on Hn 0.5 √n + 0.05 n + 1.7 , (6) different levels (letters, words, sentences, pages and chap- ≈ · · ters). The shuffled sequences texts were always compared with the original one (without any shuffling). Of course hn 0.25 √n + 0.05 . (7) the original and the shuffled files have the same letter dis- ≈ · tribution. However only the correlations on scales below The dominating term is given by a root law correspond- the shuffling level are conserved. The correlations (fluctu- ing to a rather long memory tail. We mention that a scal- ations) on higher levels which are based on the large-scale ing law of the root type was first found by Hilberg who structure of texts as well as on semantic relations are de- made a new fit for Shannons original data. We used our stroyed by shuffling. own data for n = 1, . . . , 26 but included Shannons result for n = 100 as well. The idea of algorithmic entropy goes back to 2 Entropy and complexity of se- Kolmogorov and Chaitin and is based on the idea that sequences have a “shortest representation”. The rela- quences tion between the Shannon entropy and the algorithmic entropy is given by the theorems of Zvonkin and Levin. Let A1A2 . . . An be the letters of a given substring of Several concrete procedures to construct a shorter repre- (n) length n L. Let further p (A . . . An) be the proba- sentations, are used in data compression algorithms [1]. ≤ 1 bility to find in a string a block with the letters A1 . . . An. One may assume that those algorithms are still fare from Then we may introduce the entropy per block of length being optimal, since as a rule the do not take into account

2 long correlations. A rather simple algorithm for finding compressed algorithms, which was proposed by Thiele and Scheidereiter, is called grammar complexity [6]. Let us consider a word p on certain alphabet and let K(p) be the length of shortest representation of the given string. Then we define the compressibility with respect to a given algorithm by

c(p) = K(p)/K0(p) (8) where K0(p) is the representation of the corresponding Bernoulli string on the same alphabet. For repeat se- quences of length N we may find c(p) by using grammar representations [6]. We get the following compressed rep- resentation

S A1 . . . AkRA1 . . . AsRA1 . . . AnR . . . , → Figure 1: Lempel–Ziv complexities (dashed line) and scaling R A . . . As. → 1 exponents of diffusion (full line) represented on the level of shuffling. From this representation we calculate the “grammar compressibility”

K(p) (N ν(s 1)) log(λ + 1) + s log λ (9) ≤ − − express it in terms of the fluctuation theory of statistical physics. Let us consider a sequence with the total length s log(λ + 1) L. Then the total number of letters is N = L and the c(p) = 1 + ρ(s 1) (10) N − − log λ is equal to “1”. However the number density of different symbols may fluctuate along the string. In an The algorithmic entropy according to Lempel and Ziv is earlier work [11] we considered for example the fluctuat- introduced as the relation of the length of the compressed ing local density of blanks in “Moby Dick” and pointed sequence (with respect to a Lempel-Ziv compression al- out the existence of rather long wave fluctuations. We gorithm) to the original length. represented there the (averaged over windows of length Explicit results obtained for the Lempel-Ziv complexi- 4,000) local frequency of the blanks (and other letters) ties (entropies) of several sequences were given in earlier in the text “Moby Dick” in dependence on the position work [17] (tab. 1). The complexities in dependence on along the text. The original text shows a large-scale struc- the shuffling level are graphically represented for the text ture extending over many windows. This reflects the fact Moby Dick in fig. 1. that in some part of the texts we have many short words, e.g. in conversations (yielding the peaks of the space fre- xxxxx yyyyy quency), and in others we have more long words, e.g. in xxxxx yyyyy descriptions and in philosophical considerations (yielding xxxxx yyyyy the minima of the space frequency). The shuffled text xxxxx yyyyy shows a much weaker non-uniformity of the text, the lower xxxxx yyyyy the shuffling level, the larger is the uniformity. More uni- formity means less fluctuations and more similarity to a Table 1: The Lempel–Ziv complexities for several original and Bernoulli sequence. For the case of DNA sequences no for the corresponding shuffled sequences. analogies of pages, chapters, etc. are known. Neverthe- less the reaction on shuffling is similar to those of texts. In order to quantify these findings let us define the num- ber of letters of kind k inside a substring of length l by 3 Mean square fluctuations and N(k, l). In the limit l we get the average density correlation functions n(k) = lim N(k, l)/l Since→ w∞e have λ different symbols we get in this way a λ-dimensional composition space. Let In this part we follow the methods proposed by Peng us now consider the fluctuations of N(k, l) as function of et al. [9, 10] and the invariant representation proposed l. We expect that N(k, l) fluctuates around the mean Voss [7]. However we shall use another language: Instead value N(k) = nl(k). Further we assume that the mean of formulating the problem in terms of random walks we squareh fluctuationsi scale with certain power of the mean

3 (particle) numbers We have investigated the characteristic exponents for several typical symbol sequences which were described [N(k, l) N(k) ]2 = const N(k) αk (11) h − h i i h i above [17]. The exponent is called the characteristic mean square The data are summarized in tab 1. fluctuation exponent. In an analogous way we consider In the same way we can obtain other important statis- the sum of the mean square fluctuations defining an ex- tical quantities, such as higher-order moments and cumu- ponent α by lants of y(k, l) (see [11]). By calculations of the H¨older [N(k, l) N(k) ]2 = constN α (12) exponents Dq up to q = 6 we have shown that the higher Xh − h i i order moments have (in the limits of accuracy) the same The case α(k) = 0.5 corresponds to the normal behaviour scaling behaviour as the second moment. We repeated of mean square fluctuations which the procedure described above for the shuffled files. predicts in the absence of long-range correlations. If The results of the calculations in dependence on the α(k) > 0.5 we have an anomalous fluctuation behaviour shuffling level are shown also in tab. 1. A graphical rep- which reflects the existence of long-range correlations. resentation of the results for Moby Dick is given in fig. 1. In this respect we may use the term “coherent fluctua- We see from the numbers in tab. 1 and from fig. 2 tions” [17]. One can easily see that the above definitions give the same numbers for the α-coefficients as the mapping to a random walk in a λ-dimensional space [11]. Let us con- sider in brief this procedure: Instead of the original string consisting of λ different symbols we generate λ strings on the binary alphabet (0,1) (λ = 32 for texts). In the first string we place a “1” on all positions where there is an “a” in the original string and a “0” on all other positions. The same procedure is carried out for the remaining symbols too. Then we generate random processes corresponding to these strings moving one step upwards for any “1” and remaining on the same level for any “0”. The resulting move over a distance l is called y(k, l) where k denotes the symbol. Then by defining a λ-dimensional vector space considering y(k, l) as the component k of the state vector at the (discrete) “time” l we can map the text to a trajec- tory. The corresponding procedure is carried out e.g. for the DNA sequences which are mapped to a random walk Figure 2: Double–logarithmic plot of the power spectrum for on a 4-dimensional discrete space. “Moby Dick”. The full line corresponds to the original text, Let us study now the anomalous diffusion coeffi- the dashed line corresponds to the text shuffled on the chapter cients [10]. The mean-square displacement for symbol k level and the dotted line to the text shuffled on the page level. is determined by Shuffling on a level below pages destroys the low frequency branch. F 2(k, l) = y2(k, l) ( y(k, l) )2 , (13) − h i ­ ® where the brackets mean the averaging over all initial Our results show that the original texts and DNA se- positions. The behah·iviour of F 2(k, l) for l 1 is the focus À quences show strong long-range correlations, i.e. the co- of interest. It is expected that F (k, l) follows a power efficients of anomalous diffusion are clearly different from law [10]. 1/2. After the shuffling below the page level the sequences F (k, l) lα(k), (14) become practically Bernoullian in comparison with the ∝ original ones since the diffusion coefficients decrease to a where α(k) is the diffusion exponent for symbol k. We value of about 1/2. The decrease occurs in the shuffling note that the diffusion exponent is related to the expo- regime between the page level and the chapter level. For nent of the power spectrum [10, 11]. Besides the individ- DNA sequences the characteristic level of shuffling where ual diffusion exponents for the letters we get an averaged the diffusion coefficient goes to 1/2 is about 500–1000. diffusion exponent α for the state space. The comparison Our result demonstrates that shuffling on the level of sym- of the formulae given above shows the complete equiv- bols, words, sentences or pages, or segments of length alence of the fluctuation picture and the random walk 500–1000 in the DNA case, destroys the long range cor- picture. Both are equivalent. relations which are felt by the mean square deviations.

4 4 Conclusions [5] Ebeling, Werner, Thorsten Poschel¨ , and Karl Friedrich Albrecht, “Transinformation and Word Our results show that the dynamic entropies, the com- Distribution of Information-Carrying Sequences”, pressibilities and the scaling of the mean square devia- Int. J. Bifurcation & Chaos, 5 (1995), 51-61. tions, are appropriate measures for the long-range corre- ´ lations in symbolic sequences. However, as demonstrated [6] Ebeling, Werner, and Miguel Angel Jimenez– ˜ by shuffling experiments, different measures operate on Montano, “titel???”,Math. Biosci. 52 (1980), 53- different length scales. The longest correlations found in ???. our analysis comprise a few hundreds or thousands of let- [7] Voss, Richard F., Phys. Rev. Lett., “titel???” 68 ters and may be understood as long-wave fluctuations of (1992), 3805–???; Voss, Richard F., “???” Fractals the composition. These correlations (fluctuations) give 2 (1994), 1–???. rise to the anomalous diffusion and to coherent fluctua- tions. These fluctuations comprise several hundreds or [8] Anishchenko, Vadim S., Werner Ebeling, and thousands of letters. There is some evidence that these Aleksander B. Neiman, “???”, Chaos, Solitons & correlations are based on the hierarchical organization of Fractals 4 (1994), 69–???. the sequences and on the structural relations between the levels. In other words these correlations are connected [9] Peng, C???. K. et al.???, “???”, Nature 356 (1992), with the grouping of the sentences into hierarchical struc- 168–???. tures as the paragraphs, the pages, the chapters etc. Usu- [10] Stanley, H. Eugene et al.???, Physica, “titel???”, ally inside certain substructure the text shows a greater A205 (1994), 214–???. uniformity on the letter level. Possibly a more care- ful comparison of the correlations in texts and in DNA [11] Ebeling, Werner, and Neiman, Aleksander, “ti- sequences may contribute to a better understanding of tel???”, Physica A215 (1995), 233–???. the informational structure of DNA in particular to their modular structure. [12] Li, Wentian, and Kaneko, Europhys. Lett. 17 Our results clearly demonstrate that the longest-range (1992), 655-???. correlations in information carriers are of structural ori- [13] Herzel, Hans-Peter, Werner Ebeling, and Armin gin. The entropy-like measures studied in part two op- O. Schmitt, “Entropies of biosequences: the role of erate on the sentence and the word level. In some sense repeats”, Phys. Rev. E, 50 (1994), 5061-???. entropies are the most complete quantitative measures of correlation relations. This is due to the fact that the [14] Peng, C???.-K. et al.???, “titel???”, Phys. Rev. E entropies include many point-correlations. On the other 49 (1994), 1685–???. hand the calculation of the higher order entropies is ex- tremely difficult and at the present moment there is no [15] Schenkel, A???, J??? Zhang, and Zhang, Y???, hope to extend the entropy analysis to the level of thou- “titel???”, Fractals 1, (1993), 47–???. sands of letters. [16] Poschel¨ , Thorsten, Werner Ebeling, and Helge Hopefully the analysis of entropies, compressibilities Rose´, “Guessing probability distributions from and scaling exponents can be developed to useful in- small samples” J. Stat. Phys. 80 (1995), 1443–1452. struments for studies of the large-scale structure of information-carrying sequences and may contribute fi- [17] Ebeling, Werner, Aleksander Neiman, and nally to the finding of improved compression algorithms. Thorsten Poschel¨ , “Dynamic entropies, long-range correlations and fluctuations in complex linear References structures”. In: Suzuki, M???, and Kawashima, N. (eds.), Coherent Approaches to Fluctuations, World [1] Storer, James A., Data Compression: Methods and Scientific (1995), ???–??? Theory, Computer Science Press (1988). [2] Hilberg, W., “titel???”, Frequenz 44 (1990), 243– ???; 45 (1991), 1–???. [3] Ebeling, Werner, and Gregoire Nicolis, “titel???”, Chaos, Solitons and Fractals 2 (1992), 635–???. [4] Ebeling, Werner, and Thorsten Poschel¨ , “En- tropy and long range correlations in literary En- glish”, Europhys. Lett., 26 (1994), 241-246.

5