INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A COMMUNICATION SYSTEM

Liuling Gong, Nidhal Bouaynaya∗ and Dan Schonfeld

University of Illinois at Chicago, Dept. of Electrical and Computer Engineering,

ABSTRACT can be investigated in the context of engineering communica- In this paper, we investigate the information theoretic bounds tion codes. In particular, it is legitimate to ask at what rate of the channel of evolution introduced in [1]. The channel of can the genomic information be transmitted. And what is the evolution is modeled as the iteration of protein communica- average distortion between the transmitted message and the tion channels over time, where the transmitted messages are received message at this rate? Shannon’s channel capacity protein sequences and the encoded message is the DNA. We theorem states that, by properly encoding the source, a com- compute the capacity and the rate-distortion functions of the munication system can transmit information at a rate that is protein communication system for the three domains of life: as close to the channel capacity as one desires with an arbi- Achaea, Prokaryotes and Eukaryotes. We analyze the trade- trarily small transmission error. Conversely, it is not possi- off between the transmission rate and the distortion in noisy ble to reliably transmit at a rate greater than the channel ca- protein communication channels. As expected, comparison pacity. The theorem, however, is not constructive and does of the optimal transmission rate with the channel capacity in- not provide any help in designing such codes. In the case dicates that the biological fidelity does not reach the Shan- of biological communication systems, however, evolution has non optimal distortion. However, the relationship between already designed the code for us. The encoded message is the channel capacity and rate distortion achieved for differ- the DNA sequence. Comparison of the genomic transmis- ent biological domains provides tremendous insight into the sion rate with the channel capacity will reveal whether the ge- dynamics of the evolutionary processes. We rely on these re- nomic code is efficient from an information theoretic perspec- sults to provide a model of protein sequence evolution based tive. However, even if the channel capacity is not exceeded, on the two major evolutionary processes: and un- we are assured that biological communication systems do not equal crossover. rely on codes that produce negligible errors since the level of distortion presented must account for evolutionary processes. Index Terms— Biological communication system; Chan- It is, therefore, interesting to ask ourselves whether biologi- nel capacity; Rate-distortion theory. cal communication systems maintain an optimal balance be- tween the transmission rate and the desired distortion level 1. INTRODUCTION needed to support adaptive evolution. Rate-distortion theory analyzes the optimal tradeoff between the transmission rate, The genetic information storage and transmission apparatus R(D), and distortion, D, in noisy communication channels. resembles engineering communication systems in many ways: Given the fidelity, D, present in biological communication The genomic information is digitally encoded in the DNA. systems, comparison of the genomic transmission rate with By decoding into , organisms come into be- the optimal rate R(D) can be used to determine whether or ing. The protein communication system, proposed in [1], not the genomic code achieves the optimal rate-distortion cri- [2] and shown in Fig. 1, is a communication model of the teria. Moreover, by equating the optimal rate R(D) with the genetic information storage and transmission apparatus. The channel capacity, C, we can determine whether the biological protein communication system abstracts a as a set of pro- fidelity, D, reaches the Shannon optimum distortion. In this teins and models the process of cell division as an informa- paper, we will only compare the channel capacity and rate tion communication system between protein sets. Using this distortion functions of a single source memoryless protein mathematical model of protein communication, the problem communication system, modelling asexual reproduction. The of a species’ evolution will be represented as the iteration of two-source protein communication system, modelling sexual a communication channel over time. reproduction, is more involved mathematically and will not The genome is viewed as the joint source-channel en- be addressed here. coded message of the protein communication system and hence

∗Nidhal Bouaynaya is currently in the Department of Systems Engineer- ing at the University of Arkansas at Little Rock.

1-4244-1198-X/07/$25.00 ©2007 IEEE 1 SSP 2007 Fig. 1. Protein communication system

2. PROTEIN CHANNEL CAPACITY And no information can be obtained at the output after in- finite transmissions. The reason is that, in communications, Assuming a first-order Markov channel, the protein commu- only the initial message is used to convey information and not nication channel is characterized by the probability transition the channel. In , on the other hand, the output matrix, Q = {qi,j }1≤i,j≤20, of the amino acids. In this pa- message captures the information of the channel (i.e. the mu- per, we use two different probability transition matrices: Day- tations) regardless of the initial message. In particular, a par- hoff’s Point Accepted (PAM) matrices [3], and a ent organism cannot transmit reliably (channel capacity zero) first-order Markov transition probability matrix P [1], [2]. An its genetic information to its offspring of many generations element of a PAM matrix, Mij, gives the probability that the no matter how small the rate is as long as it aminoacidinrowi will be replaced by the in col- is not zero. After a long enough time of evolution, the final umn j after a given evolutionary interval which is interpreted distribution of amino acids in offspring only depends on the as the evolutionary distance of PAM matrices. The first-order channel characteristics regardless of the parent organism. It is Markov transition probability matrix P is constructed from also interesting to observe that organisms with lower mutation the using a point mutation rate α, which rep- rates have higher channel capacity, and therefore their genetic resents the probability of a base interchange of any one nu- information can be reliably transmitted at a higher rate. cleotide and is assumed to be constant over time (see [1] for Having computed the capacity of the protein communica- the computation of P). tion channel, a comparison of the genomic transmission rate The capacity of a channel is the maximum rate at which with the channel capacity will reveal whether the genomic information can be reliably conveyed by the channel. It is code is efficient from an information theoretic perspective. defined as

  Qjk C I p, Q pjQjk  , 3. PROTEIN RATE DISTORTION =maxn ( )=maxn j k log p∈P p∈P k pjQjk  (1) The rate distortion function, R(D), is the effective rate at P n {p ∈ Rn p ≥ ∀j p } where = : j 0 ; j j =1 is the which the source produces information subject to the con- set of all probability distributions on the channel input, Q is straint that the receiver can tolerate an average distortion D. I p, Q the probability transition matrix of the channel, and ( ) A distortion matrix with elements ρi,j specifies the distortion is known as the mutual information between the channel in- associated with reproducing the ith source letter by the jth put and output. Evaluation of the channel capacity involves reproducing letter. The rate-distortion function for discrete solution of a convex programming problem. In most cases, memoryless source is defined as analytic solutions cannot be found. Blahut [4] suggested an   Q iterative algorithm for computing the channel capacity.  jk R(D)= min I(p, Q)= min j k pjQjk log , Figure 2(a) (resp. 2(b)) shows the capacity of the pro- Q∈QD Q∈QD k pjQjk tein communication system as a function of the evolutionary  (2) Q {Q ∈ Rn × Rn Q ,Q ≥ distance of PAM matrices (resp. point mutation rate α). As where D =   : k jk =1 jk expected, the channel capacity decreases to zero as the evolu- 0,d(Q) ≤ D}, d(Q)= j k pjQjkρjk,andp = {pj} tionary distance or the point mutation rate α increases. This is the probability vector of the channel input. result has different ramifications on bioinformatics than on We define the distortion between a pair of amino acids communication engineering: In engineering, it is interpreted as their distance in the Principal Component Analysis (PCA) as a loss of information after a great number of transmissions. plane obtained from 7 physico-chemical properties (volume,

2 4.5 4 4 3.5

3 3.5

2.5 3 2

1.5 2.5 Channel Capacity Channel Capacity

1 2 0.5

0 1.5 0 200 400 600 800 1000 0 0.02 0.04 0.06 0.08 0.1 0.12 PAM ,k the parameter alfa k (a) (b)

Fig. 2. Channel Capacity: (a) Channel capacity v.s. the evolutionary distance of PAM matrices; (b) Channel capacity v.s. the point mutation rate α

3 W R 2 F Y I K 1 L M H V 0 Q E T P −1 C N D Second component −2 A S −3

G −4 −3 −2 −1 0 1 2 3 4 First component

Fig. 3. Plot of the amino acids on the first two components of the PCA analysis. The amino acids are labelled by their one-letter standard abbreviations. bulkiness, polarity, PH index, hydrophobicity and surface area). curves of Archaea, Bacteria and Eukaryotes, respectively. At The amino acid data was obtained from [Chapter 2] [5]. The about D ≈ 1.4, the above order switches to result of PCA analysis is shown in Fig 3. The amino acid probability distributions in Archaea, Bacteria and Eukaryote R(D)Eu

3

Archaea 2 Archaea 0.6 Archaea 4.2 Bacteria Bacteria Bacteria Eukaryote 1.9 Eukaryote Eukaryote 4 0.5 1.8

3.8 1.7 0.4 3.6 1.6

1.5 0.3 rate R rate R 3.4 rate R 1.4 3.2 0.2 1.3 3 1.2 0.1 2.8 1.1

1 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 4 4.5 5 5.5 6 6.5 7 7.5 distorion D distorion D distorion D (a) (b) (c)

Fig. 4. Rate-distortion curves for Archaea, Bacteria and Eukaryotes: (a) low distortion region; (b) intersection region; (c) high distortion region. the three domains. Table 1. Average Rate-Distortion for three domains of life The actual average distortion over the protein communi- Archaea Bacteria Eukaryote cation channel is defined as   Distortion 9.1491 8.9964 8.8979 D = pjqjkρjk, (5) j k where Q = {qi,j } is the probability transition matrix of the Table 2. Scaled norm of odd moments Archaea Bacteria Eukaryote channel, p = {pj} is the distribution of the channel input and scaled norm 5.0653 73.5401 1.0000 ρi,j is the distortion between amino acids i and j. By trial and error Dayhoff et al. [3] found that the 250 PAM matrix works well for scoring of actual protein sequences. At this evolu- tionary distance (250 substitutions per hundred residues) only distortion. Thus, the biological rate-distortion values R∗(D) one amino acid in five remains unchanged. Table 1 displays corresponding to the average distortions given in Table 1 are the actual average distortion for Archaea, Bacteria and Eu- still less than the Shannon channel capacity and the bounds karyotes, where PAM250 was used as the probability transi- on the R-D curves still exhibit the same reversal phenomenon tion matrix of the channel. Observe that the biological rate- depicted in Fig. 4. distortion values R(D), corresponding to the average distor- tions given in Table 1, are less than the Shannon channel ca- 3.1. Evolutionary Model: Amino Acid Distribution pacity (C =0.8197 >R(D)). So, from rate-distortion the- ory, we can ascertain that the genetic information is encoded It is well known from information theory that the Gaussian so that the system reproduces the initial input with fidelity D. input maximizes the mutual information in an additive Gaus- In particular, the biological communication system does not sian noise [8], i.e., rely on codes that produce negligible errors since the level of I X X Z∗ ≤ I X∗ X∗ Z∗ , distortion present must account for evolutionary processes. ( ; + ) ( ; + ) (7) The formula of the rate-distortion function given in Eq. where I(a, b) is the mutual information between input a and (2) is valid only for discrete-time stationary and memoryless ∗ output b, X is the input, Z is the channel noise and denotes sources. For discrete-time stationary sources with memory, Gaussianity. We will show that the amino acid distribution Wyner and Ziv derived bounds for their rate-distortion func- in Eukaryotes is “more Gaussian” than Bacteria and Archaea. tion [7] as follows: Since a distribution is uniquely characterized by the set of its ∗ R(D) − Δ ≤ R (D) ≤ R(D) , (6) moments and given that the odd moments of the Gaussian dis- tribution are identically zero, we compute the odd moments where R(D) is the rate-distortion function of the memoryless of the amino acid distribution for the three branches of life source with the same marginal statistics, Δ is a measure of [6]. Table 2 displays the scaled norm of the first 4 odd mo- the memory of the source and is independent of the distor- ments (3rd, 5th, 7th and 9th) for Archaea, Bacteria and Eukary- tion measure and the distortion value D. So, the R-D curves otes. The odd moments norm of Archaea (resp. Bacteria) for a source with memory are always shifted down compared are 5 (resp. 73) times higher than Eukaryotes, asserting that to the R-D curves of the corresponding memoryless source. the amino acid distribution of Eukaryotes is “more Gaussian” Moreover, the shift is a function of the source and not the than the two other groups of life. To explain the low-distortion

4 region in Fig.4 and the switching-over of the R-D curves, rithmica, Special issue on Algorithmic Methodologies for we have to dig deeper into the evolutionary processes, which Processing Protein Structures, Sequences and Networks, shaped the three groups of life. to appear. [2] N. Bouaynaya and D. Schonfeld, “Biological evolution: 3.2. Evolutionary Process: Mutation and Crossover Distribution and convergence analysis of amino acids,” It is widely accepted today that the main driving forces of evo- in 28th Annual International Conference of the IEEE En- lution are mutations and unequal crossover 1. Furthermore, gineering in Medicine and Biology Society (EMBS’06), Archaea and Bacteria rely mostly on mutations for adaptabil- August 2006, pp. 2045–2048. ity and survival. So, we can fairly postulate that mutations drive the evolution of Archaea and Bacteria whereas unequal [3] M. Dayhoff, R. Schwartz, and B. Orcutt, “A model of crossovers drive the evolution of Eukaryotes. A mutation in- evolutionary change in proteins,” Altas of Protein Se- volves one or a very short sequence of . quence and Structure, vol. 5, no. 3, pp. 345–352, 1978. Therefore, it induces much less modifications to the genome [4] R. Blahut, “Computation of channel capacity and rate- sequence than any unequal crossover. So at the beginning of distortion functions,” IEEE Transactions on Information evolution, the distortion caused by mutations is small com- Theory, vol. 18, pp. 460–472, July 1972. pared to the distortion caused by unequal crossovers. How- ever, with time, mutations accumulate much faster than the [5] P. Higgs and T. Attwood, Bioinformatics and Molecular rare unequal crossovers. So, the distortion caused by mu- Evolution, Blackwell publishing, 2005. tations exceeds, over time, the distortion caused by unequal crossovers. This implies higher fidelity, over time, in Eu- [6] N. Bogatryreva, A. Finkelstein, and O. Galzitskaya, karyotes than Bacteria and Archaea. For example, assume “Trend of amino acid composition of proteins of differ- that mutations and unequal crossovers follow a Non Homo- ent taxa,” Journal of Bioinformatics and Computational geneous Poisson Process (NHPP) within the genome. The Biology, vol. 4, no. 2, pp. 597–608, 2006. NHPP process is a Poisson Process with a time-dependent [7] A. Wyner and J. Ziv, “Bounds on the rate-distortion func- λ t n rate parameter, ( ). The Probability that there are events tion for stationary sources with memory,” IEEE Transac- r, r s in the interval ( + ) is calculated as follows tions on Information Theory, vol. 17, no. 5, pp. 508–513,   − r+s λ(t) dt r+s n September 1971. e r ( λ(t)) P (N(r, r + s) − N(r)=n)= r n! [8] T. M. Cover and J. A. Thomas, Element of Informa- (8) λ t tion Theory, Wiley-Interscience, New York, 2nd edition, It can be shown that there exist parameters ( )mutation 2006. and λ(t)unequal crossover representing different functions of time which induce the trend of the R-D curves observed in Fig. 4.

4. CONCLUSION

By modeling evolution as the iteration of a protein commu- nication system over time, we were able to study it from an information theoretic perspective. Investigation of the biolog- ical communication channel capacity and the rate-distortion curves of the three branches of life, Archaea, Bacteria and Eu- karyotes, reveals that the biological fidelity D does not reach the Shannon optimum distortion. Furthermore, we relied on these results to provide an evolutionary model of the three groups of life based on mutations and unequal crossovers.

5. REFERENCES

[1] N. Bouaynaya and D. Schonfeld, “Protein communica- tion system: Evolution and genomic structure,” Algo-

1Unequal crossover is a crossover between homologous chromosomes that are not perfectly aligned. It results in a duplication of genes on one chromosome and a deletion of these on the other.

5