Deviation of amino acid utilization and correlation with G+C composition in bacterial Sajia Akhter, Hochul K Lee, Barbara Bailey, Peter Salamon, Robert Edwards Computational Science Research Center, San Diego State University 2008

Kullback-Leibler Divergence on Amino Acid 0.09  The relaonship between %GC and amino acid divergence is given by the 0.08 equaon y = 2(x‐0.5)2, where x is the %GC and y is the divergence of amino acid Composition 0.07 Non‐diverge Genome composion. 0.06  Most subsystems has similar parabolic equaon with high regression coefficient, The Kullback‐Leibler divergence (KLD) was calculated to compare the distribuon of Divergence of Amino which suggest that the DNA content and amino acid composion were related. amino acids in different protein coding sequences as a measure of how much those 0.05 Acid ulizaon are not  Secondary metabolism has poor correlaon between GC content and amino acid sequences deviate from the standard. The Kullback‐Leibler divergence (KLD) was 0.04 significantly different composion – calculated for 372 whole bacterial and for proteins in subsystems by 0.03 from the mean for all  High level of horizontal transfer 0.02 subsystems.  Limited (167) have this subsystem, and most of those have GC

Kullback-Leibler Divergence 0.01 As used here, Pi is the frequency of the ith amino acid in a given bacterial genome and Q i content between 40% and 60%. 0 is the average frequency of ith amino acid calculated from all complete genomes. Amino Acids and Carbohydrates Cell Division and DNA Metabolism Membrane Nitrogen Phosphorus Protein RNA Metabolism Sulfur Derivatives Cell Cycle Transport Metabolism Metabolism Metabolism Metabolism Different Subsystems Predicting Amino Acid composition based Possible Explanation for Divergence of Bifidobacterium adolescentis B-14905 Nostoc sp. PCC 7120 bongori 12149 Chlamydophila pneumoniae CWL029 Mean on G+C content Amino Acid Utilization An explicit expression for the informaon content is available once a surprisal/ Frequency of Amino Acid Utilization deviaon analysis is carried out (Levine, 1978)  Life Sle of Organism m ln(Q /P ) A (i)! The organisms which have the most skewed amino acid composions, are i i = "0 + #"r r r=1 intracellular with a very limited ecological niche range and where Ar(i) are a set of m properes for the state i. For the deviaon of amino restricted lifestyle. acid composion, since the interested property is only GC content, the model  Phylogenec Effects will be ! There is a significant difference between amino acid ulizaon in different ln(Qi/Pi) = λ0 + λ (GC%) [eqn 1] phylogenec groups of bacteria. 0.16 Amino Acids and their GC Sensivity 0.14 !"#$(-,$&%'(%)%*+(,$ !"#$%&$&%'(%)%*+(,$ Frequency of Amino acid 0.12 ./0$&1-20$ 3/0$&1-20$ 40,5%-(%(0$647$ =1+(%(0$6=7$ !18,+9%*$+*%?$6G7$ 0.1 !18,+9%(0$6:7$ ">&,0%(0$6"7$ C50(>1+1+(%(0$6H7$ 0.08 ;5<0-(%(0$6;7$ =&2+<,%*$+*%?$6@7$ I&-108*%(0$6I7$ !1>*%(0$6!7$ B>&%(0$6J7$ 0.06 Amino Acid A%&,%?%(0$6A7$ =&2+<+'%(0$6K7$

0.04 B08*%(0$6B7$ L0<%(0$6L7$  The most diverse Genome have low GC content (ranging from 22% to 28%) C<-1%(0$6C7$ ;><-&%(0$6M7$

Mean of KLD with SEM 0.02  GC‐poor bacteria have few codons for alanine, glycine, proline, and arginine =<'%(%(0$6D7$ 0 E+1%(0$6E7$  GC‐rich bacteria have few codons for phenylalanine, isoleucine, lysine, asparagine, and tyrosine ;<>2,-25+($6F7$ Bacilli $ Aquificae Chlorobia Clostridia Chlamydiae Deinococci Mollicutes Fusobacteria SpirochaetesThermotogae Actinobacteria BacteroidetesFlavobacteria Cyanobacteria (λ0 + λ (GC%)) Sphingobacteria From eqn 1, Q = P exp AlphaproteobacteriaDeltaproteobacteria i i EpsilonproteobacteriaGammaproteobacteria Divergence of Amino Acid Utilization and G+C content Class of Bacteria where, λ = fing equaon 1 with actual frequency λ0 = weighted average of G+C content λ (GC% ‐ avg(GC%)) Divergence of Amino Acid Utilization in Finally, Qi = Pi exp different Subsystems Previous Work According to Knight’s (2001) correlaon between Amino Acid and GC% The most Divergent Genome  are significantly different from Qi = λ0 + λ (GC%) 0.3 the mean for all subsystems.  The differences are not restricted 0.25 Significance of Exponenal relaonship than Linear relaonship to one or few metabolic process but  Exponenal relaonship uses 1 parameter ( λ) instead of 2 (λ and λ0) though 0.2 are across all subsystems. the Regression coefficient (R^2) is almost same for both relaonship.

0.15 References 0.1 Levine (1978), “Informaon Theory Approach to Molecular Reacon Dynamics”

0.05 Annual Review of Physical Chemistry, 29(1):59

Kullback-Leibler Divergence Knight (2001), “A simple model based on mutaon and selecon explains trends 0 Amino Acids and Carbohydrates Cell Division and DNA Metabolism Membrane Nitrogen Phosphorus Protein Metabolism RNA Metabolism Sulfur Metabolism in codon and amino‐acid usage and GC composion within and across genomes” Derivatives Cell Cycle Transport Metabolism Metabolism Different Subsystems Department of Ecology and Evoluonary Biology, Princeton University, Princeton, Wigglesworthia glossinidia Borrelia garinii Mycoplasma mycoides Ureaplasma parvum serovar Buchnera aphidicola Mean NJ 08544, USA