Information-Theoretic Bounds of Evolutionary Processes Modeled As a Protein Communication System
Total Page:16
File Type:pdf, Size:1020Kb
INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM Liuling Gong, Nidhal Bouaynaya∗ and Dan Schonfeld University of Illinois at Chicago, Dept. of Electrical and Computer Engineering, ABSTRACT can be investigated in the context of engineering communica- In this paper, we investigate the information theoretic bounds tion codes. In particular, it is legitimate to ask at what rate of the channel of evolution introduced in [1]. The channel of can the genomic information be transmitted. And what is the evolution is modeled as the iteration of protein communica- average distortion between the transmitted message and the tion channels over time, where the transmitted messages are received message at this rate? Shannon’s channel capacity protein sequences and the encoded message is the DNA. We theorem states that, by properly encoding the source, a com- compute the capacity and the rate-distortion functions of the munication system can transmit information at a rate that is protein communication system for the three domains of life: as close to the channel capacity as one desires with an arbi- Achaea, Prokaryotes and Eukaryotes. We analyze the trade- trarily small transmission error. Conversely, it is not possi- off between the transmission rate and the distortion in noisy ble to reliably transmit at a rate greater than the channel ca- protein communication channels. As expected, comparison pacity. The theorem, however, is not constructive and does of the optimal transmission rate with the channel capacity in- not provide any help in designing such codes. In the case dicates that the biological fidelity does not reach the Shan- of biological communication systems, however, evolution has non optimal distortion. However, the relationship between already designed the code for us. The encoded message is the channel capacity and rate distortion achieved for differ- the DNA sequence. Comparison of the genomic transmis- ent biological domains provides tremendous insight into the sion rate with the channel capacity will reveal whether the ge- dynamics of the evolutionary processes. We rely on these re- nomic code is efficient from an information theoretic perspec- sults to provide a model of protein sequence evolution based tive. However, even if the channel capacity is not exceeded, on the two major evolutionary processes: mutations and un- we are assured that biological communication systems do not equal crossover. rely on codes that produce negligible errors since the level of distortion presented must account for evolutionary processes. Index Terms— Biological communication system; Chan- It is, therefore, interesting to ask ourselves whether biologi- nel capacity; Rate-distortion theory. cal communication systems maintain an optimal balance be- tween the transmission rate and the desired distortion level 1. INTRODUCTION needed to support adaptive evolution. Rate-distortion theory analyzes the optimal tradeoff between the transmission rate, The genetic information storage and transmission apparatus R(D), and distortion, D, in noisy communication channels. resembles engineering communication systems in many ways: Given the fidelity, D, present in biological communication The genomic information is digitally encoded in the DNA. systems, comparison of the genomic transmission rate with By decoding genes into proteins, organisms come into be- the optimal rate R(D) can be used to determine whether or ing. The protein communication system, proposed in [1], not the genomic code achieves the optimal rate-distortion cri- [2] and shown in Fig. 1, is a communication model of the teria. Moreover, by equating the optimal rate R(D) with the genetic information storage and transmission apparatus. The channel capacity, C, we can determine whether the biological protein communication system abstracts a cell as a set of pro- fidelity, D, reaches the Shannon optimum distortion. In this teins and models the process of cell division as an informa- paper, we will only compare the channel capacity and rate tion communication system between protein sets. Using this distortion functions of a single source memoryless protein mathematical model of protein communication, the problem communication system, modelling asexual reproduction. The of a species’ evolution will be represented as the iteration of two-source protein communication system, modelling sexual a communication channel over time. reproduction, is more involved mathematically and will not The genome is viewed as the joint source-channel en- be addressed here. coded message of the protein communication system and hence ∗Nidhal Bouaynaya is currently in the Department of Systems Engineer- ing at the University of Arkansas at Little Rock. 1-4244-1198-X/07/$25.00 ©2007 IEEE 1 SSP 2007 Fig. 1. Protein communication system 2. PROTEIN CHANNEL CAPACITY And no information can be obtained at the output after in- finite transmissions. The reason is that, in communications, Assuming a first-order Markov channel, the protein commu- only the initial message is used to convey information and not nication channel is characterized by the probability transition the channel. In bioinformatics, on the other hand, the output matrix, Q = {qi,j }1≤i,j≤20, of the amino acids. In this pa- message captures the information of the channel (i.e. the mu- per, we use two different probability transition matrices: Day- tations) regardless of the initial message. In particular, a par- hoff’s Point Accepted Mutation (PAM) matrices [3], and a ent organism cannot transmit reliably (channel capacity zero) first-order Markov transition probability matrix P [1], [2]. An its genetic information to its offspring of many generations element of a PAM matrix, Mij, gives the probability that the no matter how small the point mutation rate is as long as it aminoacidinrowi will be replaced by the amino acid in col- is not zero. After a long enough time of evolution, the final umn j after a given evolutionary interval which is interpreted distribution of amino acids in offspring only depends on the as the evolutionary distance of PAM matrices. The first-order channel characteristics regardless of the parent organism. It is Markov transition probability matrix P is constructed from also interesting to observe that organisms with lower mutation the genetic code using a point mutation rate α, which rep- rates have higher channel capacity, and therefore their genetic resents the probability of a base interchange of any one nu- information can be reliably transmitted at a higher rate. cleotide and is assumed to be constant over time (see [1] for Having computed the capacity of the protein communica- the computation of P). tion channel, a comparison of the genomic transmission rate The capacity of a channel is the maximum rate at which with the channel capacity will reveal whether the genomic information can be reliably conveyed by the channel. It is code is efficient from an information theoretic perspective. defined as Qjk C I p, Q pjQjk , 3. PROTEIN RATE DISTORTION =maxn ( )=maxn j k log p∈P p∈P k pjQjk (1) The rate distortion function, R(D), is the effective rate at P n {p ∈ Rn p ≥ ∀j p } where = : j 0 ; j j =1 is the which the source produces information subject to the con- set of all probability distributions on the channel input, Q is straint that the receiver can tolerate an average distortion D. I p, Q the probability transition matrix of the channel, and ( ) A distortion matrix with elements ρi,j specifies the distortion is known as the mutual information between the channel in- associated with reproducing the ith source letter by the jth put and output. Evaluation of the channel capacity involves reproducing letter. The rate-distortion function for discrete solution of a convex programming problem. In most cases, memoryless source is defined as analytic solutions cannot be found. Blahut [4] suggested an Q iterative algorithm for computing the channel capacity. jk R(D)= min I(p, Q)= min j k pjQjk log , Figure 2(a) (resp. 2(b)) shows the capacity of the pro- Q∈QD Q∈QD k pjQjk tein communication system as a function of the evolutionary (2) Q {Q ∈ Rn × Rn Q ,Q ≥ distance of PAM matrices (resp. point mutation rate α). As where D = : k jk =1 jk expected, the channel capacity decreases to zero as the evolu- 0,d(Q) ≤ D}, d(Q)= j k pjQjkρjk,andp = {pj} tionary distance or the point mutation rate α increases. This is the probability vector of the channel input. result has different ramifications on bioinformatics than on We define the distortion between a pair of amino acids communication engineering: In engineering, it is interpreted as their distance in the Principal Component Analysis (PCA) as a loss of information after a great number of transmissions. plane obtained from 7 physico-chemical properties (volume, 2 4.5 4 4 3.5 3 3.5 2.5 3 2 1.5 2.5 Channel Capacity Channel Capacity 1 2 0.5 0 1.5 0 200 400 600 800 1000 0 0.02 0.04 0.06 0.08 0.1 0.12 PAM ,k the parameter alfa k (a) (b) Fig. 2. Channel Capacity: (a) Channel capacity v.s. the evolutionary distance of PAM matrices; (b) Channel capacity v.s. the point mutation rate α 3 W R 2 F Y I K 1 L M H V 0 Q E T P −1 C N D Second component −2 A S −3 G −4 −3 −2 −1 0 1 2 3 4 First component Fig. 3. Plot of the amino acids on the first two components of the PCA analysis. The amino acids are labelled by their one-letter standard abbreviations. bulkiness, polarity, PH index, hydrophobicity and surface area). curves of Archaea, Bacteria and Eukaryotes, respectively. At The amino acid data was obtained from [Chapter 2] [5]. The about D ≈ 1.4, the above order switches to result of PCA analysis is shown in Fig 3.