<<

BioSystems 159 (2017) 1–11

Contents lists available at ScienceDirect

BioSystems

jo urnal homepage: www.elsevier.com/locate/biosystems

Equalizing the information amounts of and mRNA by

information theory

Y. Adiguzel

Biophysics Department, School of Medicine, Istanbul Kemerburgaz University, Istanbul, Turkey

a r t i c l e i n f o a b s t r a c t

Article history: Based on the Shannon’s information communication theory, information amount of the entire length of a

Received 16 December 2016

polymeric macromolecule can be calculated in bits through adding the entropies of each building block.

Received in revised form 21 April 2017

Proteins, DNA and RNA are such macromolecules. When only the building blocks’ variation is considered

Accepted 19 May 2017

as the source of entropy, there is seemingly lower information in case of the protein if this approach is

Available online 15 June 2017

applied directly on a protein of specific size and the coding sequence size of the mRNA corresponding

to the particular length of the protein. This decrease in the information amount seems contradictory

Keywords:

but this apparent conflict is resolved by considering the conformational variations in as a new

Codon evolution

variable in the calculation and balancing the approximated entropy of the coding part of the mRNA

Protein conformation

mRNA and the protein. Probabilities can change therefore we also assigned hypothetical probabilities to the

Protein conformational states, which represent the uneven distribution as the time spent in one conformation,

Shannon’s communication theory providing the probability of the presence in either or one of the possible conformations. Results that are

Information amount obtained by using hypothetical probabilities are in line with the experimental values of variations in the

conformational-state of protein populations. This equalization approach has further biological relevance

that it compensates for the degeneracy in the codon usage during protein and it leads to the

conclusion that the alphabet size for the protein is rather optimal for the proper protein functioning

within the thermodynamic milieu of the cell. The findings were also discussed in relation to the codon

bias and have implications in relation to the codon evolution concept. Eventually, this work brings the

fields of protein structural studies and molecular protein translation processes together with a novel

approach.

© 2017 Elsevier B.V. All rights reserved.

1. Introduction seem meaningless although when it is actually not the case. Tak-

ing the logarithms can make sense in such situations. A similar

Shannon’s information communication theory (Shannon, 1948; approach is followed in the crude sense, in quantification of the

Apter and Wolpert, 1965; Rontó, 1999) is a fruitful field that is information that is characterized by the number of possible distinct

well-known mostly in bioinformatics (Koslicki, 2011; Orosz and messages that could be processed at a communication event. Obvi-

Ovádi, 2011). In the simplest sense, it is based on taking the loga- ously, this depends on the type of communication and what is being

rithm of variation in a condition that typically bears the information regarded as a message. Whatever they are, the universal notion in

to be communicated. In mathematical terms, the act of taking the information communication is that the simplest condition of such

logarithm is having a similar rationale as representing an exponen- an event involves two (equally) possible distinct messages, which

tially increasing variable with a logarithmic axis when the relation could be for example the on situation or off situation of a device,

of that variable with another entity is sketched as a graph. Varia- or an enzyme. This simplest condition could even be the presence

tions in distinct entities can be immensely diverse in number and or absence of a signal in any sort of signal communication event.

comparing them by using their actual quantities may practically Accordingly, choosing the base of the logarithm as 2 was recog-

nised as the common approach, which resulted in the calculation

of the information content of the ‘simplest’ signal communication

event with two (equally) possible distinct messages as 1 bit. By this

Correspondence to: Present address: Department of Biophysics, School of means, 1 bit is the unit information, which can simply represent the

Medicine, Istanbul Kemerburgaz University, Kartaltepe Mah Incirli Cad. No: 11,

presence or absence of the informing entity. Through calculating

Bakirkoy, Istanbul, Turkey.

the information content of any two diverse signal communication

E-mail address: [email protected]

http://dx.doi.org/10.1016/j.biosystems.2017.05.003

0303-2647/© 2017 Elsevier B.V. All rights reserved.

2 Y. Adiguzel / BioSystems 159 (2017) 1–11

events in bits, one can make a comparison of the two events since ation of the protein conformation is introduced as an additional

this calculation takes also the realisation probabilities of the indi- source of entropy per residue. Basically, statistical variation of the

vidual messages of the respective events as well. Such a comparison conformations in the folded state of the protein is considered to be

would not be immediately possible without such a quantification representing the missing entropy per residue and hence the infor-

(or in other words, standardization). So, this standardization enables mation amounts of the proteins, compared to that of the mRNAs.

quantification in various information communication systems and So, statistical distribution of conformations in the protein popu-

hence comparing and analysing them. Information content of a lation provides additional entropy per amino acid residue of the

message can be calculated through the number of all possible mes- protein. In this sense, the work has inherent relation to the Boltz-

sages, along with their realisation propensities, that are eliminated mann entropy as well.

once a certain message is received. Messages are comprised of Different from the calculations for the information contents of

sequential or a multiple number of symbols, entropy of which the molecular structures (Sarkar et al., 1978; Sullivan et al., 2003;

would be proportional to the variation in them. This explanation Aynechi and Kuntz, 2005a; Aynechi and Kuntz, 2005b), it is pre-

is valid and understandable in terms of biological macromolecules sented here that equalizing the information contents of the protein

such as DNA, RNA or proteins, wherein the nucleotides or amino and the coding part of the mRNA compensates for the degener-

acids as the building blocks of the respective macromolecules are acy in the codon usage during protein translation. Further, the

the symbols and the macromolecules themselves, namely the pro- concept is evaluated together with the concept of the reduced

teins or mRNAs that code for proteins, are the messages. In relation, alphabet (Murphy et al., 2000; Solis and Rackovsky, 2000; Li et al.,

suitability of information theoretic approaches in biosciences was 2003; Bacardit et al., 2009; Hod et al., 2013), which can be used

proved by the presence of redundancy in DNA (Battail, 2014). That for the computational approaches of the protein folding studies,

redundancy serves well for reliable message transmission through or so. (Reduced alphabet refers to the least number of amino acid

unreliable channels of the biological environments. In a review, by types that could be relied on for attaining all the present/available

applying the formula equivalent to information transmission chan- structural variations in the proteins.) It is concluded that the alpha-

nel capacity to molecular systems, the efficiency of protein binding bet size is somewhat optimal for the thermodynamic milieu of the

was suggested to be 70% (Schneider, 2010). cell. The findings will further be discussed in relation to the codon

Proteins are important functional components of the cells and bias and codon evolution concepts (Sorimachi and Okayasu, 2008;

they take active roles in the molecular communication networks, Cannarozzi and Schneider, 2012).

nanonetworks (Akyildiz et al., 2008; Dressler and Akan, 2010;

Adiguzel, 2016a). They are also part of a network that involves

2. Calculation

several other proteins and DNA and RNAs during their synthesis,

namely the protein translation. Two entities of that network, mRNA

Below in Eq. (1), H is the entropy in bits per residue, which would

and proteins, are aimed to be dealt with an information theoretic

also be the information amount in each residue.

approach since information theory can be used for information

20

quantification in distinct type of messages. Information processing =

− H Pilog2Pi (1)

=

capacities of various mechanisms can be related in the informa- i 1

tion contents’ determination of molecular structures (Sarkar et al.,

1978; Sullivanet al., 2003; Aynechi and Kuntz, 2005a; Aynechi and The subscript i in Eq. (1) is the protein residues’ variation as the

Kuntz, 2005b). Accordingly, we aim to implement the information amino acid type. The equation is the same but the upper limit will be

communication theory in the calculation and balancing of the infor- 4 when the mRNA nucleotides are the variables and hence the cal-

mation amounts of the messenger Ribonucleic Acid (mRNA) and culation is performed for the mRNA. Here calculation is performed

protein macromolecules, which are part of the protein translation for the protein and therefore i goes up to 20 since there are 20 differ-

machinery as the inputs (mRNA) and the outputs (protein). ent amino acids, any of which can be present at each residue. The Pi

Information theoretic approaches has been applied to the prob- in Eqn. (1) stands for the probability distribution of the amino acid

lem of protein structure determination (Orosz and Ovádi, 2011; types in the corresponding subsequence (Oliver et al., 1999). So, it

Rocha et al., 2012) and calculating topological entropies (Koslicki, is the probability of the amino acids’ occurrence in the protein or

2011). Common to these approaches, Shannon’s communication peptide and it would have been the bases’ occurrence in the mRNA.

theory (Shannon, 1948; Apter and Wolpert, 1965; Rónto, 1999) is If the probabilities of each variable are equal, their value is 0.05

the basis in calculating information amount of a message that can each, for the amino acids’ probability of occurrence in the protein.

be carried through molecules. Huang and Hwang (2005) previously It would have been 0.25 for the probability of each of the bases’

computed the conformational entropy from protein sequences occurrence in the coding part of the mRNA when the probabilities

using the machine learning approach. They termed it as sequence of each are equal. Making the same calculation in Eq. (1) for every

structural entropy and found a close relation between the low amino acid through the full length (l) of the protein will result in

sequence structural entropy and slow hydrogen exchange regions the information amount of that length of a protein. In other words,

of the proteins. calculation for the information amount will be the sum of entropies

For being linear polymeric chains that are composed mainly of at each residue as a discrete entity. This sum that is shown in eqn.

building blocks, the information amounts of the proteins and the (2) is giving the information amount of a specific length of the poly-

protein-coding parts of the mRNAs can be calculated as the sum of meric macromolecule of interest. Therefore, it would be the same

the entropies of each of their respective building blocks, based on for mRNA except that the upper limit of the inner sum will be 4 and

the inherent variations in these building blocks. Namely, there are the length will be 3 folds more than that of protein to be compared

20 amino acids, each of which are encoded by a nucleotide triplet with since 3 nucleotide groups in mRNA that are termed as codons

(codon), making a total number of 64 different combinations due code for an amino acid in protein.

to the presence of 4 different nucleotides (start and stop codons

l 20

are not eliminated). The immediate outcome of this calculation

= − H Pi,jlog2Pi,j (2)

= =

approach is that it results in a lower information amount of the j 1 i 1

protein as a linear polymeric chain of amino acids, compared to

that of the mRNA, which is a linear polymeric chain of nucleotide H in Eq. (2) is the information amount in bits, for a specific length

bases. Here we show that the discrepancy vanishes when the vari- l of a protein or peptide chain. The subscript i is again the protein

Y. Adiguzel / BioSystems 159 (2017) 1–11 3

residues’ variation as the amino acid type while Pi,j is the Pi amino Hypothetical probabilities are also attributed to the conforma-

acid type-probabilities at every position of a sequence with length tional states in the calculations, in order to represent the uneven

l. As mentioned, the upper limit value of the inner sum need to be distribution of the protein conformations as the time spent in one

4 in case of the protein coding mRNA molecule and the length l conformation. The same fact, namely the probability of the pres-

will be 3 folds more than that of protein to be compared with. A ence in either or one of the possible conformation, will be observed

nucleotide triplet (codon) codes for each amino acid in the protein as separation of the protein population in more than one conforma-

and therefore the variation in those triplet sequences is 64, which tional state that would be prone to changes under the influences of

is the number of all possible combinations of the 4 nucleotides. Pn modulating factors like pH, temperature, presence of ligands, etc. If

is 0.05 in peptides and proteins, representing the equal probabil- Pc is the probability estimated from the conformation variation per

ity of the presence of each 20 different amino acids when there residue. Then, entropy per residue will be the sum of variation due

is no bias for any amino acid type. So, there is lowered variation to this conformation variation and to the amino acid types’ varia-

per residue in proteins compared to that of the mRNA codons. This tion. Pc is 0.5 when there are two conformations and no bias. This

lowered variation results in a decrease of about the same extent is based on the assumption that protein in question and thus all of

in the information amount of a certain length of protein in com- its residues spend an equal amount of time in one conformation,

parison to the mRNA molecule that would be encoding a protein to that in the other conformation. In other words, half of the pro-

of the same length. No conformational variation is considered until teins are populated in one and the rest at the other conformation,

here. However, entropy per residue of the protein can be equal- which could well be the native conformation versus unfolded state,

ized to that of a triplet nucleotide (codon) by accounting for the or else. Similarly, Pc is 0.33 when there are three conformations and

conformational variations in the proteins that would assume 2 or 3 no bias. This time the assumption is that the proteins are equally

alternative states for each residue, as shown in Eq. (3), which shows populated at each of the one unfolded and two folded conforma-

the 3 alternative states case since the upper limit of the inner sum tions. The latter can both be native conformations but one could

at the second term is 3. This is based on the fact that proteins exist be the active and the other the inactive conformation. This is well

at more than one conformational state and this is the case in the cell observed in cases of proteins with enzymatic functions, which are

as well. Conformational variation is not considered for the entropy generally under the regulation of the presence of the substrate or

of the mRNA nucleotides since the information that is transmitted the substrate concentration, in addition to the regulation of other

from mRNA to protein is residing in its sequence. physiological parameters. Such situations of co-existing popula-

  tions of conformational states are present in the literature as well

(Dunker et al., 2002; Gunesakaran et al., 2003). This is not contra-

l 20 l 3

= − + H Pi,jlog2Pi,j Pc,jlog2Pc,j (3) dicting the fact that the proteins’ conformational states primarily

j=1 i=1 j=1 c=1

rely on the primary (amino acid) sequence of the proteins. For the

sake of simplicity and due to the fact that we did not focused on a

certain type of organism, or so, no bias was accounted for any type

H in Eq. (3) is also the information amount in bits, for a specific

of variations that has been considered here but respective prob-

length l of a protein or peptide chain but when the proteins’ confor-

ability values are required to be used when there is bias and this

mational variation is taken into account. The first term inside the

is almost always the case. The information amount of proteins is

parenthesis is the same as that in Eq. (2) while in the second term,

denoted as Hp in Fig. 1 and Fig. 2.

Pc,j is the Pc amino acid conformation type-probabilities at every

By using the approach that is described above, information

position of a sequence with length l. So, the subscript i is the protein

amounts of 124 representative proteins’ “full-length” mRNAs,

residues’ variation and the subscript c is the conformational vari-

namely coding parts and the untranslated regions (UTRs) of the

ation. The c goes up to 3 here, representing the calculations for 3

mRNAs, are also calculated. The data of mRNA lengths was obtained

conformational states’ condition. Therefore, the upper limit of the

from the database of the National Institute of Health. Detailed

inner sum at the second term is 2 when two conformational states

information of the respective gene ID numbers and the mRNA and

are considered, or else, according to the number of conformational

protein lengths are provided in the Supporting file, along with the

states that are being accounted for.

Fig. 1. Information amount changes (in bits) with increasing protein chain length up to 2000 amino acids. The information amounts are calculated for the coding part of the

mRNA, protein, and again for the protein by accounting for 2 and 3 major conformational states. Graphs are drawn by the data obtained through the calculations by using the

formula given in the Methods section, by considering equal probabilities of all variables. The information amount is denoted as Hp for the protein. The graph for the protein

that was recalculated for 3 different conformational states depicts a situation wherein the information amount is quite close to that of the coding part of the mRNA, with the

size that would be coding for the corresponding chain length of protein.

4 Y. Adiguzel / BioSystems 159 (2017) 1–11

Fig. 2. Information amount changes (in bits) with increasing protein chain length up to 15160 amino acids. The information amounts are calculated for the coding part of the

mRNA, protein, and again for the protein by accounting for 3 and 5 major conformational states. Graphs are drawn by the data obtained through the calculations by using the

formula given in the Methods section, by equal probabilities of all variables. The information amount is denoted as Hp for the protein. The graph for the information amounts

of the proteins recalculated for 3 different conformational states with equal probabilities was still found to be the one closest to that of the linear fit graph of the calculated

information amounts of the “full-length” mRNAs. “Full-length” mRNAs is the coding part plus the untranslated regions (UTRs) of the mRNA (Supporting file). Yet, it may

well be observed that the information amounts of the proteins recalculated for 5 different conformational states with equal probabilities would be closer to the information

amounts of the “full-length” mRNAs at low molecular lengths, as can be seen through the grey coloured data points.

very first references that were cited in the web-sites. The graph for protein population in more than one conformational state, which

the entropies of the proteins is recalculated for higher number of could be prone to shifts. This is the situation here. Therefore, the

(5) different conformational states as well. Pc probabilities that need to be attributed to varying conforma-

tional states of each respective protein and all of its amino acids

that comprise that protein are 0.82, 0.10 and 0.08. So, this prob-

2.1. Bias condition

ability distribution is under the control of ligand binding, which

makes the probabilities as 0.50, 0.45 and 0.05 in condition of the

The condition of bias in the conformational states is calculated bound-ligand.

as an example by considering distinct hypothetical bias conditions.

For the 2 conformational states conditions, the bias was present

in the Pc probabilities of being at one or the other conformational 3. Results and discussion

state, which was 0.75–0.25 instead of the equal probabilities case.

The respective H values are represented as H’ in Table 1. In another The protein length and the conformational variation is inspected

one, the Pc values were 0.90–0.10 and the H values are denoted as here to evaluate the information capacity variations’ regula-

H” in Table 1. In the third one, the Pc values were 0.95–0.05, and the tion between the proteins and the coding parts of the mRNA

H values are indicated as H in Table 1. In the third one, the Pc values sequences that are used during protein translation. The pro-

¥

were 0.99–0.01, rand the H values are shown as H at Table 1. teins’ conformational variation is included here in the Shannon’s

For the 3 conformational states conditions, the first bias was or Shannon-Weaver’s information theoretic calculations for the

introduced as the Pc values of 0.50, 0.25, and 0.25. In the sec- proteins. Our assumption was the presence of a 2-state, e.g. active-

ond bias condition, the Pc values were 0.45, 0.45, and 0.10. In inactive state, or a 3-state, e.g. active-inactive state and unfolded

the third bias condition, the Pc values were 0.475, 0.475, and conformation. Information amounts of the coding parts of the

0.05. In the fourth bias condition, the Pc values were 0.495, 0.495, mRNAs and proteins were balanced when the conformational vari-

and 0.01. However, amino acid bias was again not represented. ation is sensibly considered as a 3-state model (Fig. 1). This is

The results are summarized in Table 1 for the following amino actually justified in the literature several times (Dunker et al., 2002;

acid lengths: 10, 50, 100, 250, 500, 505, 750, 1000, 1500, and Gunesakaran et al., 2003). The apparent information content vari-

2000. The amino acid chain length of 505 amino acids is specif- ation between the proteins and the coding parts of the mRNA

ically included because the calculations for the same length are sequences was thus eliminated.

performed by using the experimental conformational distribu- Lowered information amount is normally revealed in the protein

tion values (Bernadó and Blackledge, 2010) of the 505 amino acid is when that is compared for the proteins and the mRNA molecules

long (Dos Santos et al., 2013) enzyme, tyrosine-protein kinase Hck of the appropriate coding length, considering the fact that protein

(Table 1, the last two columns). This protein has 3 primary con- amino acids are encoded by mRNA tripled codons (Fig. 1). This is

formational states denoted as the assembled, partly assembled, expected since there are 20 amino acids that are encoded by a three

and disassembled states (Fig. 3). Accordingly, the protein is in letter code of 4 bases, making up of a total number of 64 triplet

dynamic equilibrium of 3 conformational states in its free form, codons. The excess number of codons lead to multiple number of

with the following percentages: 82% assembled, 10% partly dis- codons for a single amino acid. This excessiveness was commonly

assembled, and 8% disassembled (Bernadó and Blackledge, 2010). evaluated from a mechanism that is reducing the errors and it

When the signalling peptide ligand is bound, the conformations’ is termed as the degeneracy in the codon usage. When the pro-

populations shift. The conformational states’ percentages become teins’ conformational variation is introduced to the calculations,

as follows: 50% partly disassembled, 45% disassembled, 5% several it compensates for the differences in the entropies of the longer

other conformational states (Bernadó and Blackledge, 2010). It was chains of the coding parts of the mRNAs and the shorter chains of

mentioned above that the probability of the presence in either or the translated products, namely the peptides and proteins. This is

one of the possible conformation is observed as separation of the achieved especially when the conformational variation of the full

Y. Adiguzel / BioSystems 159 (2017) 1–11 5

Table 1

Information amounts for different sizes of proteins, with no bias and with hypothetical bias conditions in conformational states, together with one example of conformational

state probabilities that are derived through the experimental values.*.

‡ ‡ ¥ ¥ Hck−free Hck−bd

Hp(2) Hp(3) Hp’(2) Hp’(3) Hp”(2) Hp”(3) Hp (2) Hp (3) Hp (2) Hp (3) H H

Hck−free Hck−bd

Pc(2) Pc(3) Pc’(2) Pc’(3) Pc”(2) Pc”(3) Pc‡(2) Pc‡(3) Pc¥(2) Pc¥(3) Pc Pc

Length 0.500 0.333 0.750 0.500 0.900 0.450 0.950 0.475 0.990 0.495 0.820 0.500

(aa) 0.500 0.333 0.250 0.250 0.100 0.450 0.050 0.475 0.010 0.495 0.100 0.450

0.333 0.250 0.100 0.050 0.010 0.080 0.050

10 53.219 59.069 51.332 58.219 47.909 56.909 46.083 55.583 44.027 53.927

50 266.10 295.34 256.66 291.10 239.55 284.55 230.42 277.92 220.14 269.64

100 532.19 590.69 513.32 582.19 479.09 569.09 460.83 555.83 440.27 539.27

250 1330.5 1476.7 1283.3 1455.5 1197.7 1422.7 1152.1 1389.6 1100.7 1348.2

500 2661.0 2953.4 2566.6 2911.0 2395.5 2845.5 2304.2 2779.2 2201.4 2696.4

505 2687.6 2983.0 2592.3 2940.1 2419.4 2873.9 2327.2 2807.0 2223.4 2723.3 2616.1 2806.0

750 3991.4 4430.2 3849.9 4366.4 3593.2 4268.2 3456.2 4168.7 3302.0 4044.5

1000 5321.9 5906.9 5133.2 5821.9 4790.9 5690.9 4608.3 5558.3 4402.7 5392.7

1500 7982.9 8860.3 7699.8 8732.9 7186.4 8536.4 6912.5 8337.5 6604.1 8089.1

2000 10644 11814 10266 11644 9581.8 11382 9216.7 11117 8805.4 10785

* Results of the hypothetical bias conditions in conformational states that are evaluated for the information amounts (in bits) of the proteins with different amino acid (aa)

chain lengths and 2 or 3 major conformational states that are indicated in parentheses in the first two rows. Except for the first column wherein the amino acid chain length

is given, the second row provides the Pc values of the respective column and the rest is the result of information amounts’ calculation. In the columns from left to right, no

bias condition is followed by the bias condition and gradually increasing biases, for both states in case of the 2 conformational states condition and for only one state in case

of the 3 conformation state condition. The last two columns are calculated with the Pc values of free and bound (bd) forms of Hck protein, respectively, derived from the

literature (Bernadó and Blackledge, 2010). No bias in the amino acid type is considered here.

Fig. 3. Adoption of several conformational states by the multidomain Hck enzyme in solution. Reprinted with permission from Nature Publishing Group. It was indicated by

Bernadó and Blackledge (2010) that the figure was adapted from Yang et al. (2010).

length proteins or peptides is represented by 3 major distinct con- ence in the proteins’ conformational state distributions. In other

formations. Other possible minor conformational variations can be words, when the protein conformation of interest is not e.g. the

omitted or considered all together for their much smaller pres- active state, that conformation could be accounted as the inac-

6 Y. Adiguzel / BioSystems 159 (2017) 1–11

tive conformation, which would be so in the biological realm as 3.1. Biological relevance

well. It is also shown in Table 1. There, information amounts of

the 3 conformational states given in the 11th column in between The approximations that were used here are biologically rele-

the 3rd and 12th rows are very close to those of the 2 conforma- vant or better to say, biologically significant. First of all, proteins’

tional states with no bias (the 2nd column, the same rows). This is conformational changes that are in dynamic equilibrium (Bernadó

observed when the population of the third conformation is actually and Blackledge, 2010) are widely described as two state models

1%. (Lumry et al., 1966; Barrick, 2009). Two state models both involve

The equalization approach above, until here, was based on the the native and denatured states (Gittis et al., 1994) and the allosteric

assumption of the presence of the same conformation variation in regulations of the proteins that are generally represented by the

all residues of a protein but this is unlikely. Yet, similar outcomes proteins’ undergoing conformational transitions by ligand binding

can easily be attained by other means such as the boosted confor- (Goh et al., 2004; Youn et al., 2008). In case of allosteric modulation,

mation variation in a limited number of amino acids, compensating population range that will undergo conformational transition is

for the rest of the amino acids with no variation for instance. In other determined by the effective ligand concentration. However, which

words, two amino acids with two conformation states each would individual protein will be found as bound to the ligand will be

be having the same outcome as those with 1-state to 4-states. As a stochastically determined and can be influenced by many other

crude analogy, it is like the B factor in protein structure modelling factors such as physical conditions, presence of competitive bind-

with bioinformatics. ing partners, concentration gradients of the protein (Kiekebusch

In Table I, calculations with the Pc probabilities that are derived and Thanbichler, 2014) and the ligand, overall molecular crowding

from experimental conformational distribution values (Bernadó (Van den Berg et al., 2000; Van den Berg et al., 1999), and physio-

and Blackledge, 2010) of the 505 amino acid long (Dos Santos et al., chemical parameters like temperature and pH, which can vary at

2013) tyrosine-protein kinase Hck (Fig. 3) is also shown (the last the subcellular level as well (Talley and Alexov, 2010). Proteins can

two columns). This protein is an enzyme. It is in dynamic equilib- be intrinsically disordered but are still stabilized in the cell so that

rium of 3 conformational states in its free (unbound) form. In this their functionalities are maintained (Theillet et al., 2014). The pop-

form, the assembled conformation is dominating (82%) (Bernadó ulation of proteins at transient conformational states (Gray et al.,

and Blackledge, 2010). When it is in bound form, namely when 2012) under certain circumstances is another relevant aspect, if

the signalling peptide is bound, conformation distribution of the the proteins are not mutants that are frozen in a transient state like

protein population shifts and partly disassembled (50%) and dis- that in a constitutively active state (Kwon et al., 2010; Castellano

assembled (45%) conformations dominate, while the remaining 5% and Downward, 2011). Proteins switch between certain conforma-

of the proteins are in other conformational states (Bernadó and tions rather than generating considerably crowded populations of

Blackledge, 2010). Accordingly, the Pc probabilities are 0.82, 0.10 numerous intermediates (Lumry et al., 1966; Barrick, 2009). Fur-

and 0.08 in the former (free, unbound) case and 0.50, 0.45 and 0.05 ther, there are conformational transition pathways that proceed

in the latter bound-state. The information amount in case of a 505 with intermediates and such pathways are modulated in the cellu-

amino acids-long protein with the same conformational state dis- lar milieu but still with protein conformation populations, which

tributions in its amino acids as that of the free Tyrosine-protein are stochastically distributed and the stochastic distributions of

kinase Hck is 2616.1 bits. This is close to that of the information which shift e.g. by molecular binding (Erdmann et al., 2013; Swift

amount of the protein that have 2 distinct, equally populated con- and McCammon, 2009). Also, please see the section “3.5 Relevant

formations that are pertinent to its amino acids when there is no Examples” for further insight of the biological relevance of this work.

bias (2687.6 bits; see row 8, column 2 of Table 1; written in bold)

or relatively little bias (2592.3 bits; see row 8, column 4 of Table 1; 3.2. Attributing protein conformational variation to amino acids

written in bold). On the other hand, the information amount in case

of a 505 amino acids-long protein with the same conformational Conformation, which is sourced principally by the H-bonding

state distributions in its amino acids as that of the bound-form of variation between residues that leads to a structure of the known

Tyrosine-protein kinase Hck is 2806.0 bits. This is about 190 bits type, and amino acid type, which is principally a chemical varia-

more than the information amount in the previous, free, unbound tion, are treated to be independent of each other. This can also be

state. This increase is expected since the extreme bias condition objected by arguing that amino acids are the sole determinants of

of the previous case, the free-state, is diminished. However, this the structure, within or outside the cell. However, environmental

information amount is in between the information amounts of the factors are having a strong influence on protein conformations and

same-length proteins that have 2 and 3 distinct conformations can lead the proteins shift from one conformation to the other. This

when there is no bias (2687.6 bits and 2983.0 bits, respectively). is the basis of the present approach. On the other hand, structural

The case that is rather unexpected is that there are 3 conforma- variation of the whole protein is attributed to its amino acids, as dis-

tional states and the extreme bias that is favouring heavily only one crete entities. Abstracting probabilities for amino acids as discrete

single conformational state ends up in the information amount that entities from the experimental observations of protein populations

is close to the condition with 2 conformational states and no bias. is a rough approximation. On the other hand, transition of a protein

In this condition, one would expect in the first hand that bias that from one structure to another needs to follow an allosteric mecha-

is favouring two states equally and un-favouring one state would nism, or so. It means that the structural transition by itself is prone

rather be ending up in the information amount that is close to the to such probabilistic events of the discrete amino acids, which may

condition with 2 conformational states and no bias. On the other be through quite different mechanisms but they should eventually

hand, information amount calculation results with the probabil- be in line with that of the protein as a whole entity and in a manner

ities that are derived from the literature for the bound-state of that will provide us with the final probabilistic distribution of the

the protein are close to the hypothetical conditions with similar protein structure, together with the other amino acids that are also

probability values, which is normal. Overall, it is observed that the discrete entities by themselves.

presence of bias in the probabilities of conformational states and

the increase in this bias decreases the information amount, wherein 3.3. Considering untranslated regions (UTRs) of the mRNAs

it gets closer to the condition of the presence of less number of

conformational states. For comparison, information amounts of the 124 representa-

tive proteins’ “full-length” mRNAs, namely coding parts and the

Y. Adiguzel / BioSystems 159 (2017) 1–11 7

untranslated regions (UTRs) of the mRNAs, are calculated (Fig. 2). and the related information amounts of mRNA and proteins are

The data of mRNA lengths was obtained from the database of the matched with the present amino acid variation. Conformational

NIH and the decrease in the number of representative proteins’ variations in the proteins may be represented with a reduced amino

“full-length” mRNAs with longer mRNA molecules represents the acid alphabet (Murphy et al., 2000; Solis and Rackovsky, 2000; Li

diminished abundance with increased molecular size. In Fig. 2, the et al., 2003; Bacardit et al., 2009; Hod et al., 2013) but then the unit

graph for the information amounts of the proteins recalculated for entities’ entropies and the related information amounts of mRNA

3 different conformational states with equal probabilities was still and proteins would be equalized under conditions of higher con-

found to be the one closest to that of the linear fit graph of the formational variations in the current thermodynamic milieu of the

calculated information amounts of the “full-length” mRNAs. “Full- cell, which favours the present alphabet size. Yet, possible cell-

length” mRNAs is the coding part plus the untranslated regions specific variations in the thermodynamic states may be observed

(UTRs) of the mRNA (Supporting file). Still, it may well be observed together with reduced alphabet size propensities and related con-

that the information amounts of the proteins recalculated for 5 formational variations. This is further discussed below, in relation

different conformational states with equal probabilities would be to the codon degeneracy and codon bias concepts.

closer to the information amounts of the “full-length” mRNAs at

low molecular lengths, as can be seen in Fig. 2. However, we do 3.4.1. Relation of codon bias

not prefer this approximation also because mRNA still has untrans- Codon evolution deals with the codon usage and “a cellular

lated regions that are functionally important after removal of the amino acid composition based on phenotype (that) eventually

introns (Pesole et al., 2001; Kawaguchi and Bailey-Serres, 2005; resembles the genomic amino acid composition based on geno-

Matoulkova et al., 2012) but our attempt here was to take the type” (Sorimachi and Okayasu, 2008; Cannarozzi and Schneider,

protein related ones into consideration rather than the translation 2012). Each respective genome represents varying codon degener-

process itself. The structure of mRNA has regulatory implications acy since there are certain genes with distinct copy numbers and

in translation (Mauger et al., 2013). Therefore, UTRs are out of the expression patterns, which differs them from the other species.

scope of the current work although there are studies on proteins’ According to this work, varying degeneracies should be expected to

similarity searches under restricted mRNA structures (Backofen end up in different conformational variations and hence stabilities

et al., 2002; Gurski, 2008; Blin et al., 2008), which may be a means to of the same protein. It is known that environmental physiological

improve this approach in the future. Similarly, parameters determine the conformational variation of a protein

is also out of the scope. Varying translated proteins of alterna- with a certain sequence. Therefore, the right causality relation

tive splicing events would be evaluated by comparing the specific that would not conflict with this work and the present state-of-

amino acid sequences that will be coded by the respective alterna- the-art knowledge would be a change in the apparent codon bias

tive splicing products. and hence the codon degeneracy. By this means, possible cell-

specific variations in the thermodynamic states may be observed

3.4. Further inferences and relation of codon bias together with reduced alphabet size propensities and related con-

formational variations. This may explain the protein stability and

It should be mentioned here that the information amounts that the interrelated conformational variation modulation by distinct

were discussed here are quantitative as indicated in the termi- species, within the same environments. As a result, these findings

nology and can supplement a qualitative evaluation rather than can contribute further to the species’ variation concept. Addition-

indicating a qualitative measure such as functionality. Yet, the ally, through these explanations, thermodynamic entropy is treated

approach may evolve in that direction. In addition, conformational as a means of application of Shannon’s information theory, in par-

states with small probabilities that are corresponding to small dis- allel with the view of Jaynes (1957), or in other words, postulations

tributions of a proteins’ certain conformational states can be of high of the information theory is driven by thermodynamics.

importance in the proteins’ reaction mechanisms considering the

transition-state intermediates. This can be understood better by 3.4.2. General perspective in relation to this work and codon bias

reminding the relevant description in the explanation of calcula- Average symbol quantity (information) of each bases in DNA

tions. There, it was written that (hypothetical) probabilities can or amino acids in protein as a discrete source are not memory-

be regarded as assignments to the conformational states through less in practice since evolutionary force on the available sequences

correlations of the uneven conformation distributions as the time resulted in the selection of functional proteins, which provided

spent in each conformation that is considered, as a measure provid- advantage for survival. However, we are not evaluating a conscious

ing the probability of the presence in the respective conformation. being that is calculating the information amount in a given length

Huang and Hwang (2005) previously computed the conforma- of a sequence; it is a cell that is acting with chemical and physical

tional entropy from protein sequences using the machine learning constraints. Therefore, available chemical variation that is posed

approach. They termed it as sequence structural entropy and used by all possible bases’ variation in the nucleotide sequences of DNA

the eight secondary structures which are defined by the DSSP. How- and all possible amino acids’ variation in the peptides or proteins is

ever, one could also account only 4 major secondary structural the source of information. One should make the distinction that we

elements (helices, sheets, coils, and turns) or even only 3 major assumed the cell is a chemical information processor even if it is act-

secondary structural elements, by considering also the turns among ing as part of a semantic information processor (Adiguzel, 2016b).

the coils. This is in line with the presented assumption of this work This is the case for stochastic events-based living systems. The sit-

since changes in the secondary structures result in the conforma- uation would be different in computational approaches, which are

tional variations. Huang and Hwang (2005) further found a close closer to the semantic information processors rather than individ-

relation between the low sequence structural entropy and slow ual cells. In computational approaches, the more representative

hydrogen exchange regions of the proteins. This is in relation to information one provides to such programs, the better is the calcu-

and actually a good example of what was meant above by stat- lation outcome, as long as the processing capacity is high enough to

ing that the information amounts that we discuss are quantitative evaluate the input information and the algorithm is written accord-

but can well supplement a qualitative evaluation and the approach ingly, to be able to process that additional type of information. So,

may evolve in that direction. Such an attempt would be claiming functions of a program, including what type of information to eval-

that the protein conformational variations are determined mainly uate, depend on the algorithm. In cells, functions can be prone to

by the current thermodynamic milieu of the cell and the entropies adaptations to the internal or environmental changes thus vary in

8 Y. Adiguzel / BioSystems 159 (2017) 1–11

time, and those ‘novel’ functions can even be encoded in DNA as codons through their anticodons and carry the amino acids specific for

epigenetic alterations. Consider codon bias as the most relevant the sequences that they are in charge of.) However, commonly, there

example to such a variation, although it is not considered as an are different tRNAs for distinct codons and change in the abun-

epigenetic alteration. However, that information, namely favoured dances of these tRNAs result in a bias at the state of translation

preference of different sets of codons by the organisms (Sharp and (Dale and Park, 2004). Quax and co-workers (Quax et al., 2015)

Li, 1986), is specific to a given type of organism and prone to change highlighted this In their review by stating that 45 tRNAs with dis-

even at the tissue level for higher organisms (Plotkin and Kudla, tinct anticodons are present in the eukaryote H. sapiens, 39 tRNAs

2011). Obviously, we did not work here on a specific type of organ- in the bacterium E. coli, 35 tRNAs in the archeon S. solfataricus, and

ism. Therefore, our current approach is lacking the implementation 28 tRNAs in some mycoplasma species and related species. Consid-

of such rigorous facts as codon bias, which is observed in cells. So, ering 61 different codons, these numbers are also indicative of bias,

there is codon bias in the protein translation machinery but this which is not the same in prokaryotes and eukaryotes. In addition,

is not accounted for here since it is assumed to be valid in spe- the differences in the affinities of codon-anticodon pairs between

cific conditions that are of evolutionary origin (Biro, 2008a), as the eukaryotes and prokaryotes lead to some differences in the co-

a consequence of the driving factors such as faster cell division occurrences, a situation that is also termed as the co-occurrence

(Quax et al., 2015), etc. (Sharp et al., 2010). Yet, in this work, we bias (Quax et al., 2015). In a broad sense, Quax and co-workers

considered the conformational types of the proteins as an addi- defined “codon bias as a means to fine-tune .”

tional source of information besides the residue-based variation. Different, kingdom-specific modification strategies that improve

Accordingly, one can argue against this approach by saying that translation efficiencies was reported to be considered at least par-

the present number of conformations in the proteins are also our tially as the source of diverse codon bias of bacteria and eukarya

own observation as another form of bias in all the possible confor- (Novoa et al., 2012). Earlier, Carbone (Carbone, 2008) stated that

mations that could have been attained by the amino acid chains “in bacteria and archaea, translational bias provides information on

and therefore it should not have been included in the calculations living environment and on genes involved in essential metabolic

that are performed here. This is actually true but different than functions and stress response, which are crucial for the bacteria

the case of bias in the codon usage, that information is not spe- wildlife.” This seems quite plausible considering the immense vari-

cific for the organisms and encoded in their genome but it is rather ation in the environmental conditions that bacteria could adapt to.

universal in the sense that protein populations in certain condi- However, from the perspective of the current study, maybe one

tions would attain similar conformational distributions. Namely, could assume an effect on the protein structure or conformation

codon bias is informative but for different aspects than those con- variations in the prokaryotes compared to that in the eukaryotes,

sidered here and further, as already mentioned, we did not work considering variations in the codon bias. Observation of such vari-

on a specific organism. Therefore, codon bias s assumed not to have ations that can also be explained by other means does not rule out

an effect on the (maximum) entropy per nucleotide in case of the the claim of the current approach since this approach is a means of

DNA sequences. Such information has to be accounted for in the measure, which is the quantification of information.

studies that would involve certain type of organisms, and in con-

ditions such as slow or fast cell division rates. Intriguingly, codon 3.5. Relevant examples

bias was recently reported to regulate co-translational protein fold-

ing (Yu et al., 2015). That makes sense since the degeneracy in the The encoding of the type of amino acids in a protein within the

codon usage is regarded here to be indirectly linked to the protein DNA and the mRNA are rather clear but it may be hard to think

structure and folding, based on the assumption of equal informa- of such a clear distinctive information transfer or encoding in case

tion contents in a certain length of amino acid chain and a piece of of the protein structure since many tend to think that a protein’s

mRNA that would be encoding for the same length of a peptide or structure is primarily dictated by its amino acid sequence and there

protein. should principally not be any reason to think about the involvement

of the so-called redundant information in the mRNA. Why should

3.4.3. Comments on differences of eukaryotes and prokaryotes, one ever think that synonymous codes (the alternative codes for the

and codon bias same amino acid) could make a difference if they are eventually end-

This work suggests a link between the major conformational ing up in the addition of the same amino acid at the same position of

variation of the proteins and the mRNA, in terms of the informa- the protein? However, assuming such a correlation with the mRNA

tion content. The mRNA has been chosen specifically here since and the protein structure or with the protein translation machin-

it lacks the introns, the nucleotide regions that are not translated ery and the protein structure is not a new idea and emanated from

but present in the coding region of the DNA and transcribed into some certain experimental observations. For instance, silent poly-

the pre-mRNA but then removed before the translation. The aim in morphisms or mutations (changes in a nucleotide that turns the

keeping the introns away from the discourse of this work through triplet codon into the one that is still coding for the same amino acid)

this approach is a subject of another study in progress, which is can have biochemical and physiological effects (Biro, 2008b) and

relating the presence of introns in the eukaryotic genome with the there can be problems in achieving human protein expression in

average sizes of the proteins (Adiguzel, 2015). It is of importance the native state within bacteria (Adzhubei et al., 1996). So, there

here to mention the protein translation in bacteria as well since the have been valuable efforts to elucidate the obscured path of this

prokaryotes lack the introns in their genome and would be expect- type of communication and other than investigating by biochem-

edly having limited average sizes of proteins compared to the ical means, many used combinatorics, information theory (Biro,

eukaryotes, with such a perspective. There are protein length dif- 2008b). As example, D’Onofrio and co-workers (D’Onofrio et al.,

ferences between the prokaryotic and eukaryotic genomes, which 2002) detected systematic compositional differences both among

is in accordance (Brocchieri and Karlin, 2005). We will not go fur- and within the secondary structure levels of a non-redundant set

ther into the details here since it is out of the scope of the current of 62 human proteins, in addition to the confirmation of the ear-

work (interested readers could contact the author for further details). lier findings that the third positions of the codons are related

Since prokaryotes are just mentioned, the implications of codon to the protein secondary structures (Adzhubei et al., 1996). As

bias in bacteria could be elaborated here. In both eukaryotes and another example, Oresiˇ cˇ and Shalloway (Oresiˇ cˇ and Shalloway,

prokaryotes, the same tRNA can recognize more than one codon 1998) reported correlations between relative synonymous codon

that are coding for the same amino acid. (tRNAs recognize the mRNA usage and protein secondary structure through comparing the

Y. Adiguzel / BioSystems 159 (2017) 1–11 9

mRNA sequences with the respective three-dimensional structures computational observations can be considered as the other means

of the 35 human and 31 E. coli proteins. Gupta and co-workers to elaborate the current approach that is presented herein. Efforts

(Gupta et al., 2000) similarly looked at the correlation of the in this field need to tackle inconsistencies but the advancements

synonymous codon usage and the alpha helices and beta sheets and gathering of knowledge did and will bring about supportive

secondary structural elements of the proteins, for the organ- findings as well.

isms E. coli (with 28 non-redundant entries, namely proteins), B.

subtilis (with 28 non-redundant entries), S. cerevisiae (with 29 non-

4. Summary and conclusion

redundant entries), and Homo sapiens (with 68 non-redundant

entries). They found correlation irrespective of the species in ques-

This work presents an information theoretic method for equaliz-

tion. Further, the secondary structural units of the proteins could

ing the information amounts of the coding part of the mRNA and the

be recognised through the bases’ occurrences at the second codon

protein of a corresponding length such that it would be the same

position. Mukhopadhyay and co-workers (Mukhopadhyay et al.,

length of the protein that would be translated from that mRNA.

2007) reported the synonymous codon usage in different pro-

It achieves this by systematically exploiting the proteins’ being in

tein secondary structural classes (all-alpha, all-beta, alpha+beta,

dynamic equilibrium of conformational states. While doing this, we

alpha/beta) by investigating through 401 human genes. The study

first assumed that the lower entropy of the protein residues com-

was indicative of a significantly increased number of GC rich genes

3 pared to that of the mRNA codons as nucleotide triplets that code

towards the protein stability. Zhou and co-workers (Zhou et al.,

for the amino acids (protein residues) vanishes when the variation

2015) reported that relatively optimal codons are observed to be

of the protein conformation is introduced as an additional source

coding for the structured protein domains while the non-optimal

of entropy per residue. So, the protein conformation was intro-

codons are −preferentially- used in coding of the intrinsically disor-

duced as an additional variable, which was assumed to be valid at

dered protein regions. Importantly, this correlation was confirmed

residue-level as well. The probabilities to the conformational states

by in vivo tests, through manipulating the codons that are found to

provide us with the probability of the presence in either or one

be of structural importance, in the eukaryote Neurospora circadian

of the possible conformations. These probabilities are represented

clock gene frequency (frq). This codon optimization of those coding

at the protein level by corresponding protein population distribu-

for the “predicted disordered but not well structured” regions of the

tions that spend time in the possible conformations, respectively.

FRQ protein resulted in the changes of the FRQ protein structures

We assigned hypothetical probabilities to the conformational states

and the impairment of the clock functions.

and also used experimentally-derived probabilities, which were in

As early as 1996, Brunak and Engelbrecht (Brunak and

line with the hypothetical ones. This approach implied that the

Engelbrecht, 1996) performed “a direct comparison of experi-

balanced entropies compensates for the degeneracy in the codon

mentally determined protein structures and their corresponding

usage during the translation process. Codon degeneracy is mainly

protein coding mRNA sequences.” They analysed 719 unique pro-

regarded as a source of redundancy that enables reliable trans-

tein chains, 82 of which were from enterobacteria, 181 of which

mission of signals through noisy channels, with the perspective

were from mammals, and the rest of which were non-organism

of information communication theory. If we translate this to the

specific. Clustering of the rare codons with the structural units of

molecular biology terminology, codon degeneracy diminishes the

the encoded-proteins or the effect of codon bias on the transla-

errors. However, it is claimed here that ‘reliable transmission of

tional rate of proteins as a means of protein folding regulation were

signals’, namely decreasing the error rate, may not be the only task

already hypothesized then, which were elaborated later together

of codon degeneracy. Further implications of this study involve the

with the introduction of the concepts like the co-translational

relation of change in the codon degeneracy at the organism- or

folding (Marin, 2008; Deane and Saunders, 2011; Pechmann and

tissue-level, which is known as codon bias, with a possible varia-

Frydman, 2013; Faure et al., 2016). Brunak and Engelbrecht found

tion of the conformations of the related proteins. Since change in

possible relation of the mRNA sequences within the regions that

the conformation variation of a protein can be dictated by envi-

code for the amino acids that surround the helix and sheet type

ronmental variables, it is concluded that thermodynamic milieu of

secondary structures of the proteins. This was suggested by the

the cell, so the intracellular physiology may be a driving force for

similarity of the codons that code for the amino acids that are at

codon bias. Quite a wide range of concepts and biologically relevant

the N- and C-termini (starting and ending regions) of the respective

issues can be of relevance, such as protein folding and misfolding

structures. In the same year, Adzhubei and co-workers (Adzhubei

in neurodegenerative diseases (Ovádi and Orosz, 2009). Such clin-

et al., 1996) reported relatedly that the synonymous codons reveal

ical implications and the additional inferences of this work for the

diverse, even contrary, preferences at the N- or C-termini of the

codon evolution (Sorimachi and Okayasu, 2008; Cannarozzi and

protein structure fragments, compared to their presence rates in

Schneider, 2012), life’s origin (Yockey, 2002) concepts, and the sec-

the coding of the individual amino acids. Yet, what is additionally

ondary information in the (Maraia and Iben, 2014) are

of methodological relevance to the current study is that, Brunak

potentially among our future interests, together with the parallel

and Engelbrecht (Brunak and Engelbrecht, 1996) quantified the

fields including the information theoretic approaches in cellular

sequence information content with the Shannon information mea-

signalling (Waltermann and Klipp, 2011) and the semantic pro-

sure, and the Shannon information representation of their analyses

cessing in cells (Görlich et al., 2011). So, this fruitful field deserves

were visually highly informative. The data was shown for 14 amino

growing attention and dedication of the scholars.

acids-long sequences at the interface regions of the sheet and coil

structures or the helix and coil structures.

Appendix A. Supplementary data

Beyond these examples and discussions, as mentioned, the

mRNA structure was considered irrelevant here in this study due

Supplementary data associated with this article can be found, in

to the stance of this work. However, it is a good candidate to be a

the online version, at 10.1016/j.biosystems.2017.05.003.

means to improve the current approach. Further, Luo and Jia (Luo

and Jia, 2007) studied the codon usage effect on the protein struc-

References

ture and for humans, preferences of codons in 45 (or 79) di-peptides

(peptides that are two amino acids in length) were found to be dis-

Adiguzel, Y., 2015. Information theoretic approach for the DNA and protein length

tinct from that in case of the individual amino acids. This was 36

relation. In: Ecology and Evolutionary Biology Symposium, Ankara, Turkey,

(or 60) in case of E. coli. All such experimental or theoretical and August 2015.

10 Y. Adiguzel / BioSystems 159 (2017) 1–11

Adiguzel, Y., 2016a. Biophysical and Biological Perspective in Biosemiotics. Hod, R., Kohen, R., Mandel-Gutfreund, Y., 2013. Searching for protein signatures

Progress in Biophysics and Molecular Biology, 11., pp. 55–61. using a multilevel alphabet. Proteins 81, 1058–1068.

Adiguzel, Y., 2016b. In formation theoretic approach in molecular interactions and Huang, S.-W., Hwang, J.-K., 2005. Computation of conformational entropy from

implications in molecular evolution. Nano Commun. Networks, http://dx.doi. protein sequences using the machine-learning method—Application to the

org/10.1016/j.nancom.2016.09.002. study of the relationship between structural conservation and local structural

Adzhubei, A.A., Adzhubei, I.A., Krasheninnikovb, I.A., Neidle, S., 1996. Non-random stability. Proteins: Struc., Funct., Bioinf. 59, 802–809.

usage of ‘degenerate’ codons is related to protein three-dimensional structure. Jaynes, E.T., 1957. Information theory and statistical mechanics. Phys. Rev. 106,

FEBS Lett. 399, 78–82. 620–630.

Akyildiz, I.F., Brunetti, F., Blázquez, C., 2008. Nanonetworks: a new communication Kawaguchi, R., Bailey-Serres, J., 2005. mRNA sequence features that contribute to

paradigm. Comput. Networks 52, 2260–2279. in Arabidopsis. Nucleic Acids Res. 33, 955–965.

Apter, M., Wolpert, L., 1965. Cybernetics and development. I. information theory. J. Kiekebusch, D., Thanbichler, M., 2014. Spatiotemporal organization of microbial

Theoret. Biol. 8, 244–257. cells by protein concentration gradients. Trends Microbiol. 22, 65–73.

Aynechi, T., Kuntz, I.D., 2005a. An information theoretic approach to Koslicki, D., 2011. Topological entropy of DNA sequences. Bioinformatics 27,

macromolecular modeling. I. Sequence alignments. Biophys. J. 89, 2998–3007. 1061–1067.

Aynechi, T., Kuntz, I.D., 2005b. An information theoretic approach to Kwon, O., Jeong, S.J., Kim, S.O., He, L., Lee, H.G., Jang, K.L., Osada, H., Jung, M., Kim,

macromolecular modeling II. Force fields. Biophys. J. 89, 3008–3016. B.Y., Ahn, J.S., 2010. Modulation of E-cadherin expression by K-Ras;

Bacardit, J., Stout, M., Hirst, J.D., Valencia, A., Smith, R.E., Krasnogor, N., 2009. involvement of DNA methyltransferase-3b. Carcinogenesis 31, 1194–1201.

Automated alphabet reduction for protein databases. BMC Bioinf. 10, 6. Li, T., Fan, K., Wang, J., Wang, W., 2003. Reduction of protein sequence complexity

Backofen, R., Narayanaswamy, N.S., Swidan, F., 2002. Protein similarity search by residue grouping. Protein Eng. 16, 323–330.

under mRNA structural constraints: application to targeted selenocysteine Lumry, R., Biltonen, R., Brandts, J.F., 1966. Validity of the two-state hypothesis for

insertion. In Silico Biol. 2, 275–290. conformational transitions of proteins. Biopolymers 4, 917–944.

Barrick, D., 2009. What have we learned from the studies of two-state folders, and Luo, L., Jia, M., 2007. Messenger RNA information: its implication in protein

what are the unanswered questions about two-state protein folding? Phys. structure determination and others. In: Feng, J., Jost, J., Qian, M. (Eds.),

Biol. 10, 015001. Networks: From Biology to Theory. Springer-Verlag, London, pp. 291–308.

Battail, G., 2014. Information and Life. Springer, Dordrecht Heidelberg London New Maraia, R.J., Iben, J.R., 2014. Different types of secondary information in the genetic

York. code. RNA 20, 977–984.

Bernadó, P., Blackledge, M., 2010. Proteins in dynamic equilibrium. Nature 468, Marin, M., 2008. Folding at the rhythm of the rare codon beat. Biotechnol. J. 3,

1046–1048. 1047–1057.

Biro, J.C., 2008a. Does codon bias have an evolutionary origin? Theor. Biol. Med. Matoulkova, E., Michalova, E., Vojtesek, B., Hrstka, R., 2012. The role of the 3’

Model. 5, 16. untranslated region in post-transcriptional regulation of protein expression in

Biro, J.C., 2008b. Discovery of Proteomic code with mRNA assisted protein folding. mammalian cells. RNA Biol. 9, 563–576.

Int. J. Mol. Sci. 9, 2424–2446. Mauger, D.M., Siegfried, N.A., Weeks, K.M., 2013. The genetic code as expressed

Blin, G., Fertin, G., Hermelin, D., Vialette, S., 2008. Fixed-parameter algorithms for through relationships between mRNA structure and protein function. FEBS

protein similarity search under mRNA structure constraints. J. Discrete Lett. 587, 1180–1188.

Algoritms 6, 618–626. Mukhopadhyay, P., Basak, S., Ghosh, T.C., 2007. Synonymous codon usage in

Brocchieri, L., Karlin, S., 2005. Protein length in eukaryotic and prokaryotic different protein secondary structural classes of human genes: implication for

proteomes. Nucleic Acids Res. 33 (10), 3390–3400. increased non-randomness of GC3 rich genes towards protein stability. J.

Brunak, S., Engelbrecht, J., 1996. Protein structure and the sequential structure of Biosci. 32 (5), 947–963.

mRNA: ␣-helix and ␤-sheet signals at the nucleotide level. Proteins: Struc., Murphy, L.R., Wallqvist, A., Levy, R.M., 2000. Simplified amino acid alphabets for

Funct., Genet. 25, 237–252. protein fold recognition and implications for folding. Protein Eng. 13, 149–152.

Cannarozzi, G.M., Schneider, A., 2012. Codon Evolution: Mechanisms and Models. Novoa, E.M., Pavon-Eternod, M., Pan, T., de Pouplana, L.R., 2012. A role for tRNA

Oxford University Press, UK. modifications in genome structure and codon usage. Cell 149, 202–213.

Carbone, A., 2008. Codon bias is a major factor explaining phage evolution in Oliver, J.L., Román-Roldán, R., Pérez, J., Bernaola-Galván, P., 1999. SEGMENT:

translationally biased hosts. J. Mol. Evol. 66 (3), 210–223. identifying compositional domains in DNA sequences. Bioinformatics 15,

Castellano, E., Downward, J., 2011. RAS interaction with PI3K. Genes Cancer 2, 974–979.

261–274. Oresiˇ c,ˇ M., Shalloway, D., 1998. Specific correlations between relative synonymous

D’Onofrio, G., Ghosh, T.C., Bernardi, G., 2002. The base composition of the human codon usage and protein secondary structure. J. Mol. Biol. 281, 31–48.

genes is correlated with the secondary structures of the encoded proteins. Orosz, F., Ovádi, J., 2011. Proteins without 3D structure: definition, detection and

Gene 300, 179–187. beyond. Bioinformatics 27, 1444–1454.

Dale, J.W., Park, S.F., 2004. Codon usage. In: Molecular Genetics of Bacteria, 4th ed. Ovádi, J., Orosz, F., 2009. Protein Folding and Misfolding: Neurodegenerative

John Wiley & Sons Inc., pp. 101. Diseases. Springer.

Deane, C.M., Saunders, R., 2011. The imprint of codons on protein structure. Pechmann, S., Frydman, J., 2013. Evolutionary conservation of codon optimality

Biotechnol. J. 6, 641–649. reveals hidden signatures of cotranslational folding. Nat. Struct. Mol. Biol. 20,

Dos Santos, C., McDonald, T., Ho, Y.W., Liu, H., Lin, A., Forman, S.J., Kuo, Y.H., Bhatia, 237–243.

R., 2013. The Src and c-Kit kinase inhibitor dasatinib enhances p53-mediated Pesole, G., Mignone, F., Gissi, C., Grillo, G., Licciulli, F., Liuni, S., 2001. Structural and

targeting of human acute myeloid leukemia stem cells by chemotherapeutic functional features of eukaryotic mRNA untranslated regions. Gene 276, 73–81.

agents. Blood 122, 1900–1913. Plotkin, J.B., Kudla, G., 2011. Synonymous but not the same: the causes and

Dressler, F., Akan, O.B., 2010. A survey on bio-inspired networking. Comput. consequences of codon bias. Nature Rev. Genet. 12, 32–42.

Networks J. 54 (6), 881–900. Quax, T.E.F., Claassens, N.J., Söll, D., van der Oost, J., 2015. Codon bias as a means to

Dunker, A.K., Brown, C.J., Lawson, J.D., Iakoucheva, L.M., Obradovic,´ Z., 2002. fine-Tune gene expression. Mol. Cell 59, 149–161.

Intrinsic disorder and protein function. Biochemistry 41, 6573–6582. Rocha, J.R., van der Linden, M.G., Ferreira, D.C., Azevêdo, P.H., de Araújo, A.F.P.,

Erdmann, T., Albert, P.J., Schwarz, U.S., 2013. Stochastic dynamics of small 2012. Information-theoretic analysis and prediction of protein atomic burials:

ensembles of non-processive molecular motors: the parallel cluster model. J. on the search for an informational intermediate between sequence and

Chem. Phys. 139, 175104. structure. Bioinformatics 28, 2755–2762.

Faure, G., Ogurtsov, A.Y., Shabalina, S.A., Koonin, E.V., 2016. Role of mRNA structure Rontó, G., 1999. The elements of biocybernetics, communication and control. In:

in the control of protein folding. Nucleic Acids Res. 44 (22), 10898–10911. Rontó, G., Tarján, I. (Eds.), An Introduction to Biophysics −with Medical

Görlich, D., Artmann, S., Dittrich, P., 2011. Cells as semantic systems. Biochim. Orientation. Semmelweis Kiadó, Budapest, pp. 369–388.

Biophys. Acta 1810, 914–923. Sarkar, R., Roy, A.B., Sarkar, P.K., 1978. Topological information content of genetic

Gittis, A.G., Stites, W.E., Lattman, E.E., 1994. A first order phase transition between molecules −I. Math. Biosci. 39, 299–312.

a compact denatured state and a random coil state in Staphylococcal nuclease. Schneider, T.D., 2010. A brief review of molecular information theory. Nano

In: Doniach, S. (Ed.), Statistical Mechanics, Protein Structure, and Protein Commun. Networks 1, 173–180.

Substrate Interactions. NATO Science Series B, pp. 39–47. Shannon, C.E., 1948. A mathematical theory of communication. Bell Syst. Tech. J.

Goh, C.-S., Milburn, D., Gerstein, M., 2004. Conformational changes associated with 27, 379–423.

protein–protein interactions. Curr. Opin. Struct. Biol. 14, 104–109. Sharp, P.M., Li, W.H., 1986. An evolutionary perspective on synonymous codon

Gray, R.D., Buscaglia, R., Chaires, J.B., 2012. Populated intermediates in the thermal usage in unicellular organisms. J. Mol. Evol. 24, 28–38.

unfolding of the human telomeric quadruplex. J. Am. Chem. Soc. 134, Sharp, P.M., Emery, L.R., Zeng, K., 2010. Forces that influence the evolution of

16834–16844. codon bias. Phil. Trans. R. Soc. B 365, 1203–1212.

Gunesakaran, K., Tsai, C.-J., Kumar, S., Zanuy, D., Nussinov, R., 2003. Extended Solis, A.D., Rackovsky, S., 2000. Optimized representations and maximal

disordered proteins: targeting function with less scaffold. Trends Biochem. Sci. information in proteins. Proteins: Struct., Funct., Genet. 38, 149–164.

28, 81–85. Sorimachi, K., Okayasu, T., 2008. Codon evolution is governed by linear formulas.

Gupta, S.K., Majumdar, S., Bhattacharya, T.K., Ghosh, T.C., 2000. Studies on the Amino Acids 34, 661–668.

relationships between the synonymous codon usage and protein secondary Sullivan, D.C., Aynechi, T., Voelz, V.A., Kuntz, I.D., 2003. Information content of

structural units. Biochem. Biophys. Res. Commun. 269, 692–696. molecular structures. Biophys. J. 85, 174–190.

Gurski, F., 2008. Polynomial algorithms for protein similarity search for restricted Swift, R.V., McCammon, J.A., 2009. Substrate induced population shifts and

mRNA structures. Inf. Process. Lett. 105, 170–176. stochastic gating in the PBCV-1 mRNA capping enzyme. J. Am. Chem. Soc. 131, 5126–5133.

Y. Adiguzel / BioSystems 159 (2017) 1–11 11

Talley, K., Alexov, E., 2010. On the pH-optimum of activity and stability of proteins. Yu, C.-H., Dang, Y., Zhou, Z., Wu, C., Zhao, F., Sachs, M.S., Liu, Y., 2015. Codon usage

Proteins 78, 2699–2706. influences the local rate of translation elongation to regulate co-translational

Theillet, F.-X., Binolfi, A., Frembgen-Kesner, T., Hingorani, K., Sarkar, M., Kyne, C., Li, protein folding. Mol. Cell 59 (5), 744–754.

C., Crowley, P.B., Gierasch, L., Pielak, G.J., Elcock, A.H., Gershenson, A., Selenko, Zhou, M., Wang, T., Fu, J., Xiao, G., Liu, Y., 2015. Non-optimal codon usage

P., 2014. Physicochemical properties of cells and their effects on intrinsically influences protein structure in intrinsically disordered regions. Mol. Microbiol.

disordered proteins (IDPs). Chem. Rev. 114, 6661–6714. 97 (5), 974–987.

Van den Berg, B., Ellis, R.J., Dobson, C.M., 1999. Effects of macromolecular crowding

Yekbun Adiguzel received Undergraduate degree from Molecular Biology and

on protein folding and aggregation. EMBO J. 18, 6927–6933.

Genetics Department, Middle East Technical University (METU), M.Sc. degree

Van den Berg, B., Wain, R., Dobson, C.M., Ellis, R.J., 2000. Macromolecular crowding

from Pediatric Molecular Genetics Department, Ankara University Medical Faculty,

perturbs protein refolding kinetics: implications for folding inside the cell.

Turkey. She received Ph.D. degree from Biophysics Department, Ruhr-University

EMBO J. 19, 3870–3875.

Bochum (RUB), Germany. Her Ph.D. was funded by International Max Planck

Waltermann, C., Klipp, E., 2011. Information theory based approaches to cellular

Research School in Chemical Biology, Max Planck Institute of Molecular Physiology,

signaling. Biochim. Biophys. Acta 1810, 924–932.

Dortmund, Germany. She involved in Postdoctoral studies in Neurobiochemistry,

Yang, S., Blachowicz, L., Makowski, L., Roux, B., 2010. Multidomain assembled

RUB. Then she became European Union 7th Frame Project Personnel within

states of Hck tyrosine kinase in solution. Proc. Natl. Acad. Sci. USA 107,

15757–15762. BioMEMS Division, METU-MEMS Research and Application Center. She is currently

associate professor of Biophysics Department, School of Medicine, Istanbul Kemer-

Yockey, H.P., 2002. Information theory: evolution and the origin of life. Inf. Sci. 141,

219–225. burgaz University, Turkey.

Youn, H., Koh, J., Roberts, G.P., 2008. Two-state allosteric modeling suggests protein

equilibrium as an integral component for cyclic AMP (cAMP) specificity in the

cAMP receptor protein of Escherichia coli. J. Bacteriol. 190, 4532–4540.