Deviation of amino acid utilization and correlation with G+C composition in bacterial genome Sajia Akhter, Hochul K Lee, Barbara Bailey, Peter Salamon, Robert Edwards Computational Science Research Center, San Diego State University 2008
Kullback-Leibler Divergence on Amino Acid 0.09 The rela onship between %GC and amino acid divergence is given by the 0.08 equa on y = 2(x‐0.5)2, where x is the %GC and y is the divergence of amino acid Composition 0.07 Non‐diverge Genome composi on. 0.06 Most subsystems has similar parabolic equa on with high regression coefficient, The Kullback‐Leibler divergence (KLD) was calculated to compare the distribu on of Divergence of Amino which suggest that the DNA content and amino acid composi on were related. amino acids in different protein coding sequences as a measure of how much those 0.05 Acid u liza on are not Secondary metabolism has poor correla on between GC content and amino acid sequences deviate from the standard. The Kullback‐Leibler divergence (KLD) was 0.04 significantly different composi on – calculated for 372 whole bacterial genomes and for proteins in subsystems by 0.03 from the mean for all High level of horizontal gene transfer 0.02 subsystems. Limited (167) bacteria have this subsystem, and most of those have GC
Kullback-Leibler Divergence 0.01 As used here, Pi is the frequency of the ith amino acid in a given bacterial genome and Q i content between 40% and 60%. 0 is the average frequency of ith amino acid calculated from all complete genomes. Amino Acids and Carbohydrates Cell Division and DNA Metabolism Membrane Nitrogen Phosphorus Protein RNA Metabolism Sulfur Derivatives Cell Cycle Transport Metabolism Metabolism Metabolism Metabolism Different Subsystems Predicting Amino Acid composition based Possible Explanation for Divergence of Bifidobacterium adolescentis Bacillus B-14905 Nostoc sp. PCC 7120 Salmonella bongori 12149 Chlamydophila pneumoniae CWL029 Mean on G+C content Amino Acid Utilization An explicit expression for the informa on content is available once a surprisal/ Frequency of Amino Acid Utilization devia on analysis is carried out (Levine, 1978) Life S le of Organism m ln(Q /P ) A (i)! The organisms which have the most skewed amino acid composi ons, are i i = "0 + #"r r r=1 intracellular pathogens with a very limited ecological niche range and where Ar(i) are a set of m proper es for the state i. For the devia on of amino restricted lifestyle. acid composi on, since the interested property is only GC content, the model Phylogene c Effects will be ! There is a significant difference between amino acid u liza on in different ln(Qi/Pi) = λ0 + λ (GC%) [eqn 1] phylogene c groups of bacteria. 0.16 Amino Acids and their GC Sensi vity 0.14 !"#$(-,$&%'(%)%*+(,$ !"#$%&$&%'(%)%*+(,$ Frequency of Amino acid 0.12 ./0$&1-20$ 3/0$&1-20$ 40,5%-(%(0$647$ =1+(%(0$6=7$ !18,+9%*$+*%?$6G7$ 0.1 !18,+9%(0$6:7$ ">&,0%(0$6"7$ C50(>1+1+(%(0$6H7$ 0.08 ;5<0-(%(0$6;7$ =&2+<,%*$+*%?$6@7$ I&-108*%(0$6I7$ !1>*%(0$6!7$ B>&%(0$6J7$ 0.06 Amino Acid A%&,%?%(0$6A7$ =&2+<+'%(0$6K7$
0.04 B08*%(0$6B7$ L0<%(0$6L7$ The most diverse Genome have low GC content (ranging from 22% to 28%) C<-1%(0$6C7$ ;><-&%(0$6M7$
Mean of KLD with SEM 0.02 GC‐poor bacteria have few codons for alanine, glycine, proline, and arginine =<'%(%(0$6D7$ 0 E+1%(0$6E7$ GC‐rich bacteria have few codons for phenylalanine, isoleucine, lysine, asparagine, and tyrosine ;<>2,-25+($6F7$ Bacilli $ Aquificae Chlorobia Clostridia Chlamydiae Deinococci Mollicutes Fusobacteria SpirochaetesThermotogae Actinobacteria BacteroidetesFlavobacteria Cyanobacteria (λ0 + λ (GC%)) Sphingobacteria From eqn 1, Q = P exp Betaproteobacteria AlphaproteobacteriaDeltaproteobacteria i i EpsilonproteobacteriaGammaproteobacteria Divergence of Amino Acid Utilization and G+C content Class of Bacteria where, λ = fi ng equa on 1 with actual frequency λ0 = weighted average of G+C content λ (GC% ‐ avg(GC%)) Divergence of Amino Acid Utilization in Finally, Qi = Pi exp different Subsystems Previous Work According to Knight’s (2001) correla on between Amino Acid and GC% The most Divergent Genome are significantly different from Qi = λ0 + λ (GC%) 0.3 the mean for all subsystems. The differences are not restricted 0.25 Significance of Exponen al rela onship than Linear rela onship to one or few metabolic process but Exponen al rela onship uses 1 parameter ( λ) instead of 2 (λ and λ0) though 0.2 are across all subsystems. the Regression coefficient (R^2) is almost same for both rela onship.
0.15 References 0.1 Levine (1978), “Informa on Theory Approach to Molecular Reac on Dynamics”
0.05 Annual Review of Physical Chemistry, 29(1):59
Kullback-Leibler Divergence Knight (2001), “A simple model based on muta on and selec on explains trends 0 Amino Acids and Carbohydrates Cell Division and DNA Metabolism Membrane Nitrogen Phosphorus Protein Metabolism RNA Metabolism Sulfur Metabolism in codon and amino‐acid usage and GC composi on within and across genomes” Derivatives Cell Cycle Transport Metabolism Metabolism Different Subsystems Department of Ecology and Evolu onary Biology, Princeton University, Princeton, Wigglesworthia glossinidia Borrelia garinii Mycoplasma mycoides Ureaplasma parvum serovar Buchnera aphidicola Mean NJ 08544, USA