COMPARISON OF AND MRNA PROFILES OF ESCHERICHIA COLI: DATA VISUALIZATION AND ANALYSIS OF SPECIFIC GENE GROUPS

Oleg Paliy,1,2 Brian Thomas,3 Rebecca Corbin,4 Feng Yang,4 Jeffrey Shabnowitz, 4 Mark Platt, 4 Charles E. Lyons, Jr., 4 Karen Root, 4 Donald Hunt, 4, 5 and Sydney Kustu2

Department of Biochemistry and Molecular Biology, Wright State University, Dayton, Ohio, 45435,1 Department of Plant & Microbial Biology,2 and College of Natural Resources,3 University of California, Berkeley, CA 94720, Department of Chemistry, University of Virginia, Charlottesville, VA 22901,4 and Department of Pathology, University of Virginia, Charlottesville, VA 2290845.

ABSTRACT estimates of mRNA abundance and which genes are Despite recent progress in protein identification transcribed under a single condition [1-4]. Analysis of in whole cell lysates, many laboratories interested in complex mixtures of tryptic peptides by mass spectrometry global gene expression depend on assessment of mRNA provides a powerful method for determining the protein rather than protein. Hence knowledge of the composition of cells [5-7]. Availability of both types of relationship between the two remains important. data has allowed comparisons between them and several Though it has been explored in eukaryotic cells, there such comparisons have been made for the yeast are few studies of this relationship in bacteria. In Saccharomyces cerevisiae [8-10]. addition, previous studies have generally not considered We recently made general comparisons between illustrative examples. and mRNAs detected in E. coli strain MG1655 We previously detected with high reliability (CGSC 6300) grown in minimal medium with glycerol as about one quarter of the proteins of E. coli (1147 carbon source [11]. Globally there appeared to be a positive proteins) in cells grown under a single condition in relationship between , performed under minimal medium and compared these proteins to global conditions that favored proteins of greatest abundance, and mRNA levels. To understand the relationship between mRNA levels. Here we extend our earlier studies by protein detection and mRNA abundance in greater presenting a simple visualization approach to facilitate depth we here consider it for specific gene groups comparisons between protein and mRNA data and consider (translation apparatus, energy metabolism, motility and examples of specific genes and operons for which the chemotaxis, cofactor biosynthesis, transcriptional biological literature allows more meaningful analysis. regulators, and membrane and membrane associated proteins). We also present a data visualization tool that RESULTS facilitates comparison of whole cell mRNA and protein Data visualization profiles. In most instances protein detection was We created a simple visualization tool that allowed associated with a high level of mRNA, as well as with us to display protein and mRNA presence calls in genome greater protein length and solubility. We failed to detect order (Fig. 1, http://coli.berkeley.edu/protein_profile/), as cognate mRNA for only 34 of the proteins we identified. we did previously for DNA microarray data [12-15]. The resulting genome image facilitates analysis of the data, INTRODUCTION particularly in terms of operon organization. This tool Physiological studies of organisms whose genome allows displaying the protein and mRNA detection either as sequences have been determined have been greatly separate squares, each corresponding to a detected protein advanced by development of new techniques to sample cell or mRNA, or as a vertical rectangle for cases where both composition globally at mRNA, protein, and metabolite protein and mRNA were detected for a particular gene. The levels. Availability of global data has led to introduction of image map on the web site allows the gene ID number [16], the term “systems biology”. Affymetrix GeneChip gene name, and gene description to be displayed above the microarrays allow the comparison of mRNA levels under image. Clicking on a spot of interest transfers the user to different growth conditions and also provide statistical the E. coli Entry Point database (http://coli.berkeley.edu/

1 O.Paliy et al.

Figure 1 - Genome image of protein and mRNA presence calls for E. coli MG1655 grown in minimal medium with glycerol as carbon source and NH4Cl as nitrogen source. Genes are arranged in their order on the chromosome of E. coli (according to the original E. coli annotation [16]) beginning with Blattner (b)0001 and progressing from left to right. There are 100 genes / row. To assist in viewing, each 10 genes are marked with a tick and a narrow vertical line and the background for the rows alternates between light and dark gray. Green bars indicate genes for which protein and mRNA were both detected. Yellow squares indicate the 34 genes for which protein but not mRNA was detected (see text), whereas blue squares indicate genes for which RNA but not protein was detected. Boxes correspond to some of the exam- ples given in Table 1 and discussed in the text. Red boxes denote operons or clusters of operons with relatively abundant protein products. They are (in b number order): the trp operon (b1260-01265; b1265 is the trp leader); the his operon (b2018-2026; b2018 is the his leader); the nuo operon (b2276-2288), a cluster of ribosomal protein operons (b3294-3321), and the atp operon (b3731-3739). White boxes denote operons or clusters of operons with less abundant protein products. They are (in b number order): a cluster of murein-fts operons (b0081-0095); the lac operon and lac regulatory gene (b0342-0345; b0345 is lacI); and a cluster of flagellar (fli) operons (b1937-1950). At our web site (Protein data display: http://coli.berkeley.edu/protein_profile/), a cursor can be used to deter- mine the b number, name, and description of each gene in the image. Links to the E. coli Entry Point (http://coli.berkeley.edu/ecoli/) facilitate obtaining additional information.

cgi-bin/ecoli/coli_entry.pl) [14], where useful information [b3462], and alr [b4053]) are known to be required for about the gene can be retrieved easily. In our comparison, murein synthesis and cell division (see below) and one for cognate mRNAs were not detected (called “absent” by the fatty acid biosynthesis (product of fabF [b1095]) [16]. We Affymetrix algorithm) for only 34 proteins out of the total happen to know independently that the GlnG regulatory list of 1147 proteins (yellow boxes, Fig. 1) because none of protein (=NtrC; product of b3868) is also present [12, 17, the genes for these proteins had a high mRNA signal on the 18]. Expression of the genes for these five proteins should array (5-650; average for the transcriptome was 2000). have been detected at the mRNA level. We estimate that the Three of the proteins (products of mcrB [b0149], ftsX limit for protein detection in our experiments was at 50-100

2 Comparison of protein and mRNA profiles of Escherichia coli: data visualization and analysis of specific gene groups protein copies per cell, whereas Affymetrix microarrays can flagellar gene [25], and there is direct evidence that its detect 1 molecule of RNA in a complex mixture of 100,000 product is a periplasmic binding protein for cystine [26, distinct RNA molecules [11, 19]. Another 9 proteins whose 27]. Products of the lactose utilization operon were not cognate mRNAs were “absent” were designated ORFs. detected and mRNA was detected (unreliably) only for lacZ (Table 1 legend). Of 160 known DNA-binding Gene examples transcriptional regulators [28], we detected only 37 (23%), To make our understanding of the protein profile whereas 124 were considered expressed at the mRNA level. concrete, we looked at a number of specific examples The average mRNA signal intensity for genes (Table 1, Fig. 1; see also supplementary material). corresponding to regulators detected at the protein level Abundant proteins was eight-fold higher than that for genes corresponding to undetected regulators. We began with abundant proteins we expected to We considered examples of proteins utilized for find on the protein list: ribosomal proteins, of synthesis of co-factors that are required for growth in glycerol and central carbon metabolism, and amino acid minimal medium because we thought their expression biosynthetic enzymes. Ribosomal proteins, which are levels might not be high. All of the genes whose products among the most abundant in the cell (~15,000 copies under are thought to be required for synthesis of NAD, our growth conditions [20]), were well-represented in the pyridoxine, riboflavin, thiamine, and biotin (35 total) ([29] total list of proteins. Of the 55 ribosomal proteins, 49 were and J. Cronan, personal communication) were expressed at identified and mRNA was detected for all 55 genes. The the mRNA level, and half of their protein products were ribosomal proteins that were missed had very few predicted present in the total list. The average mRNA signal tryptic peptides (1 to 4, whereas the average was 6.0). Nine intensities for these five groups of genes were between of the 12 proteins involved in glycerol catabolism were 1300 and 2900 (Table 1). It is known that some of the detected, including all of those known to be required for enzymes involved in co-factor synthesis have low turnover growth, and expression of all 12 genes was detected at the numbers (J. Cronan, personal communication), and hence mRNA level. The glycerol facilitator (GlpF) and glycerol both transcripts and proteins may be more abundant than phosphate permease (GlpT) were detected only in the anticipated. membrane sample [11]. Most proteins categorized as Finally, we considered a group of proteins whose glycolytic (14 out of 18), gluconeogenic (all 4) or as expression levels were expected to vary widely within the components of the tricarboxylic acid cycle (15 out of 17) group – proteins required for cell division (fts gene [21] were detected and their mean mRNA signal intensities, products). Of the 12 Fts proteins [30, 31], most of which which can be considered approximations of mRNA levels are membrane bound or associated, five (FtsI, N, X, Z, and (Affymetrix Inc., Santa Clara, CA, technical note, 2001), ZipA) were detected and 11 were expressed at the mRNA were high. Those not detected are not required for the level. FtsZ is by far the most abundant of the Fts proteins process involved (products of fruK and fumB), or would not [32] and is soluble. FtsI and ZipA are also relatively have been detected due to low mRNA levels or small abundant [33, 34] and have large soluble domains [30]. numbers of tryptic fragments (gapC_1, gapC_2, fruL and Several Fts proteins that were not detected are thought to be farR). Membrane-bound components of succinate present at much lower levels or are of unknown abundance dehydrogenase were detected only in the membrane ([30] and D. Weiss, personal communication). One of sample. All the enzymes required for synthesis of the amino these, FtsW, is an intrinsic and FtsL and acids histidine and tryptophan were detected, as were the FtsB(=YgbQ) are small transmembrane proteins [30]. corresponding mRNAs. As expected, expression of the regulatory leader regions [22] for the his and trp operons, Membrane and membrane-associated proteins hisL and trpL, was detected at the mRNA but not the We looked at additional membrane proteins or protein level. This was true generally for leader regions proteins associated with membranes: the F1F0 ATPase and (including fruL mentioned above). NADH dehydrogenase I, abundant proteins known to be Proteins of low abundance present in cells grown in minimal medium, and products of the 23 experimentally studied ATP-binding cassette (ABC) We next looked for proteins that were not expected transport operons [35] for which we detected at least one to be abundant: flagellar proteins, proteins of the lactose protein product. Six of the nine gene products that degradative operon, and transcriptional regulators. Flagellar constitute the ATP synthase were detected and expression proteins are poorly expressed in strain MG1655 (CGSC of all nine genes, which constitute a single operon (atp), 6300) [23, 24]. Of 40 proteins classified as flagellar was detected at the mRNA level. Two of the three proteins proteins [21], the only one detected was the product of fliY that were missed had only one or three predicted tryptic (b1920), which lies at the edge of a cluster of flagellar peptides, whereas the average number for proteins of the operons. The fliY gene had an mRNA signal intensity of operon was 10.6. Ten of the 13 gene products that 5500, whereas the mean signal intensity for all other constitute NADH dehydrogenase I were detected, several flagellar genes was <200 (the average for an E. coli protein only in the membrane sample. Expression of all 13 genes, was 2000). It has been reported that fliY may not be a

3 O.Paliy et al.

20% General observations Consideration of the above examples indicated that

15% protein detection showed a positive relationship to mRNA level for the corresponding gene, a relationship that pertains globally [11]. Protein detection also shows a positive 10% relationship to protein hydrophilicity and protein length (expressed as the number of predicted tryptic peptides). Our 5% ability to detect E. coli proteins was progressively better

Proteins in list detected (%) detected list in Proteins with a higher number of predicted tryptic peptides per protein (Fig. 2). The positive relationship was especially 0% 12345678910 strong for short proteins (0-12 tryptic peptides) but became E. coli proteins sorted by the number of predicted tryptic gradually weaker for longer proteins (Fig. 3) because peptide fragments per protein (classes, lowest to highest) detection of only a single tryptic peptide was sufficient to call the protein “present” [11]. We showed previously that Figure 2 - Detection of E. coli proteins as a function there was not a positive linear relationship between the of codon usage bias and the number of tryptic pep- number of tryptic peptides per protein and mRNA signal tide fragments per protein. intensity for the corresponding gene [11]. To get the X axis, all 4290 proteins were first sorted from lowest to highest number of tryptic peptides per protein. They Comparison of the protein list to the lists of were then divided into 10 equal classes (429 each). Class 1 proteins identified on 2D gels contained the 10% of proteins with the smallest number of Neidhardt and colleagues pioneered the use of tryptic peptides and so on. The proportion (percent) of all two-dimensional (2D) gel electrophoresis to determine the detected proteins in each class was plotted on the Y axis. The dotted line represents the hypothetical percent of detected protein composition of E. coli [39], an approach that has proteins in each class (10%) if there is no relationship be- been intensively pursued by others [40-43]. We extracted tween protein detection and the number of tryptic peptides per from the SWISS-2DPAGE protein database protein. The ranges of the number of predicted tryptic pep- (http://us.expasy.org/ch2d/) a list of proteins identified on tides for classes 1-10 were as follows: (1) 0-4, (2) 4-6, (3) 6-7, 2D gels from cells grown under a variety of conditions, not (4) 7-9, (5) 9-11, (6) 11-13, (7) 13-15, (8) 15-19, (9) 19-25, just the single growth condition we used for our (10) 25-80. experiments. A search for E. coli proteins yielded a list of 336 unique protein names (as of August 2004). We detected 86% of these unique proteins and detected the cognate which constitute a single operon (nuo), was detected at the mRNAs for 96% of them (Table 2). As was true for mRNA level. The three proteins that were missed were proteins on our total list [11], cognate mRNAs for proteins transmembrane subunits and were short relative to those identified on 2D gels had a higher average signal intensity detected (average of 4.3 vs. 14.8 predicted tryptic peptides). than those for all E. coli proteins or all expressed proteins ABC transport systems are usually composed of (Tables 1 and 2). Although the average signal intensity of three different types of proteins: soluble periplasmic cognate mRNAs for proteins we detected from the 2D gel binding proteins, trans-membrane transport proteins, and list was higher than that for proteins we failed to detect, the inner-membrane associated ATP-binding proteins. Twenty average signal intensity for proteins we failed to detect was soluble periplasmic binding protein components of ABC nevertheless higher than overall averages. The proteins transporters were detected in the cell extract (the total identified on 2D gels were longer than the average E. coli number of periplasmic components in the 23 operons protein, and those we detected were longer than those we considered was 21). The genes corresponding to them failed to detect. Proteins identified on 2D gels were five to generally had the highest mRNA signal intensities in their six fold low in membrane proteins and more than threefold operons. By contrast, about half of the membrane- low in proteins of unknown function. associated ATP-binding components coded by these Recently, a large-scale analysis of E. coli 2D gel operons and only one quarter of the integral membrane maps was carried out by Lopez-Campistrous and colleagues proteins (much shorter than the others) were detected. With [44], who were able to identify 575 E. coli proteins. Among one or two exceptions, membrane and membrane- these, 450 (78%) were present in our protein list, and associated components were detected exclusively in the mRNAs were detected for 94% of them. Properties of the membrane sample. Thus, preparation of the membrane 575 proteins and their corresponding mRNAs were similar sample was essential for detection of membrane to those from the SWISS-2DPAGE protein list, except that components of ABC transport operons and useful for the number of membrane proteins and proteins of unknown detection of other membrane and membrane-associated function was increased (Table 2). Those proteins among the proteins discussed above. 575 that we did not detect had somewhat low mRNA levels (average signal of 1500 versus 2000 for E. coli protein- coding genes).

4 300 a)

250

200

150 Number of proteins

100

50

Undetected proteins Detected proteins

0 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869 70 71 72 73 74 75 76 77 78 79 80 1.0 b) 0.9

0.8

0.7

0.6

0.5

0.4 Ratio of detected proteins

0.3

0.2

0.1

0.0 0 1020304050607080 Number of predicted tryptic peptide fragments for E. coli protein

Figure 3 - Protein detection as a function of the number of predicted tryptic peptide fragments per E. coli pro- tein. (a) The X axis shows the number of predicted tryptic peptide fragments per protein. The numbers of detected (yellow) and undetected (blue) proteins are plotted on the Y axis as stacked columns. Each vertical combination of yellow and blue bars represents all E. coli proteins with that particular number of predicted tryptic peptides. (b) The X axis is as in (a). The Y axis shows a ratio of detected pro- teins to all proteins for each group of proteins with a particular number of predicted tryptic peptides. The purple line represents a log-fit curve of the data (only values for proteins with 1-60 tryptic peptide fragments were used to produce the curve). The dotted orange line represents the hypothetical case if there is no relationship between protein detection (26.7% overall) and the number of tryptic peptide fragments per protein.

5 Table 1 - Protein examples Average mRNA Proteins mRNAs Functional category and descriptiona No. of proteins signalb detected detectedc

Ribosomal 55 28100 49 55 Glycerol catabolism 12 6700 9 12 Glycolysis and gluconeogenesis 22 6200 18 22 TCA cycle 17 12700 15 16 Histidine biosynthesisd 8 4700 8 8 Tryptophan biosynthesisd 5 3900 5 5 Flagella 40 300 1 12 Lactose catabolism 4 200 0 2e Transcriptional regulatorsf 160 1600 37 124 Co-factor biosynthesisg 35 1900 19 35 Cell division (Fts)h 12 2500 5 11 F1F0 ATPase 9 18000 6 9 NADH dehydrogenase I 13 7600 10 13 ABC transportersi -Periplasmic 21 6600 20 21 -ATP-binding 23 1900 11 23 -Membrane 31 1500 8 28 All E. coli proteins 4291 2000 j 1147 2826 a Unless otherwise stated, according to Riley and Labedan [21]. b For genes corresponding to all proteins in the group, rounded to the nearest hundred. c Affymetrix presence calls (see [11]). d Leader peptides not included. e The mRNA for lacI was called present in all three experiments, whereas lacZ mRNA was present in one, marginal in one, and absent in one [11]. f Known DNA-binding transcriptional regulators, according to Perez-Rueda and Collado-Vides [28]. g NAD, pyridoxine, riboflavin, and thiamine, according to Koonin and Galperin [29], and biotin, according to J. Cronan (personal communication). h [30, 31], D. Weiss, personal communication. i Known ABC transporters according to Paulsen (http://66.93.129.133/transporter/wb/index2.html; [35]). This is not the entire category. Rather, if any protein product of an experimentally studied ABC transport operon was detected, we considered all members of that operon. Proportions of proteins detected were essentially the same for predicted ABC transport operons. j For the 2826 protein coding genes called present at the mRNA level, the average mRNA signal was 2900.

Table 2 - E. coli proteins detected on 2D gels Table 3 - Minimal gene sets Lopez- Swiss- Buchnera Profile element a Campistrou Mushegian- 2DPAGE b Gil et al. aphidicola s et al. Profile element Koonin MGSb BAp MGSa Proteins identified on 2D gels 336 575 genomec Number detected in this workc 291 450 c Number with mRNA detected 322 543 Genes/proteins in the set 255 206 582 d Average mRNA signal 7100 5300 Genes/proteins with E. 243 203 574 Average number of tryptic coli orthologues e 16.3 16.0 peptides predicted Proteins we detectedd 184 160 387 c Membrane proteins 22 76 mRNAs we detectedd 236 200 540 f Proteins of unknown function 29 83 Average mRNA signale 10300 11400 6900 a E. coli proteins in the Swiss-2DPAGE database a A minimal gene set as defined by Mushegian and Koonin [45] by (http://us.expasy.org/ch2d/) as of August 2004 (complete list available on comparing the genomes of Mycoplasma genitalium and Haemophilus web site). influenzae. b List of E. coli proteins identified on 2D gels by Lopez-Campistrous et al b A minimal gene set as defined by Gil et al. [47] by comparing genomes [44]. of five completely sequenced endosymbionts. c See [11]. c Protein-coding genes for Buchnera aphidicola BAp [48, 49]. d Rounded to the nearest hundred. The average signals for proteins d See [11] detected and undetected were 7700 and 3200, respectively, for Swiss- e Rounded to the nearest hundred. The average signals for genes 2DPAGE list; and 6300 and 1500, respectively, for Lopez-Campistrous et corresponding to proteins detected and undetected in the minimal lists al. list. The average signal for an E. coli protein was 2000. were 12200 and 4200, respectively, for Mushegian and Koonin list; and e The average numbers for genes corresponding to proteins detected and 13200 and 4900, respectively, for Gil et al. list; and those for genes undetected were 16.9 and 12.5, respectively, for Swiss-2DPAGE list; and corresponding to proteins detected and undetected in the Buchnera list 16.3 and 14.7, respectively, for Lopez-Campistrous et al. list. The average were 9000 and 2800, respectively. The average signal for an E. coli protein E. coli protein has 13.1 peptides. was 2000. f Open reading frames [21].

6 Comparison of protein and mRNA profiles of Escherichia coli: data visualization and analysis of specific gene groups

Comparison of the protein and mRNA lists to minimal particular protein under any one condition. For example, gene sets and to the genome of Buchnera spp although we failed to detect more than half of the Fts By comparing the genomes of Haemophilus proteins, these are essential for growth under all conditions influenzae (about 1700 genes) and Mycoplasma genitalium ([30, 31] and D. Weiss, personal communication), and (about 470 genes) Mushegian and Koonin [45] identified a hence must be present. By contrast, enzymes of the lac set of 255 orthologous genes [46]. In a more recent study, operon are in some sense absent in cells grown on glycerol. Gil and co-workers [47] determined a common core set of The “genome image” visualization approach 206 genes by comparing five sequenced genomes of shown in Fig. 1 can help biologists interpret mRNA and endosymbionts. We were interested in whether we detected protein profiles of cells more easily. Such data visualization proteins and mRNAs from these sets, which were inferred allows researchers to make quick qualitative assessments of to be essential, and in whether genes from these sets were these profiles, especially for bacteria and archaea. In expressed at a level higher than average for E. coli. organisms belonging to these two kingdoms of life genes of Likewise, we considered a list of E. coli genes orthologous common function often form multi-gene expression units to those of Buchnera aphidicola [48, 49], an obligate (operons) and hence are adjacent on the chromosome. endosymbiont that is closely related to E. coli [48]. The Genes in operons are seen as strings of the same color on a genome of B. aphidicola contains only about 580 protein- genome image (see footnote to Fig. 1) and thus their coding genes and apparently evolved from the genome of qualitative behavior is readily determined. As quantitative the last common ancestor of Buchnera and E. coli through comparisons between protein levels in different samples gene loss [49]. Many Buchnera genes are considered become available the program can be further extended to essential for cellular function [48]. incorporate them together with comparisons of the We detected high percentages of proteins in both corresponding mRNA levels. minimal gene sets and the Buchnera / E. coli orthologous gene list (Table 3). Properties of proteins in the minimal CONCLUSIONS and Buchnera lists and their cognate mRNAs differed from Considering specific examples of genes, operons, those of all E. coli proteins in the ways described above for and groups of genes enriched understanding of protein and proteins in the 2D gel lists. Proteins we detected or failed to mRNA detection in E. coli grown under a single condition detect in each list also differed from each other in the ways and global relationships between them. A simple described above (legend to Table 3). A number of proteins visualization tool facilitated qualitative comparisons we failed to detect in the Buchnera list have already been between protein and mRNA profiles across operons. discussed above (e.g. Fts proteins and members of their operons required for murein synthesis; components of the METHODS F1F0 ATPase and NADH dehydrogenase I). In addition, more than half of the 155 proteins that we failed to detect in Data analysis the Buchnera list were proteins of unknown or putative The data set was as described in [11]. The protein function (44 and 11 proteins, respectively) or were composition of a whole cell lysate of E. coli strain MG1655 "flagellar" proteins (27 proteins). Our failure to detect the (CGSC6300) [24] grown in minimal-glycerol medium was latter is due to a peculiarity of the MG1655 (CGSC 6300) acquired from a trypsin-digested protein extract using strain used in these experiments (see above). Buchnera does HPLC-MS/MS. Affymetrix E. coli Antisense GeneChips not have flagella and lacks, in particular, genes coding for were used to obtain mRNA levels and mRNA presence flagellar filament proteins [48]. It has been speculated that calls under the same conditions. the proteins designated "flagellar" in Buchnera may Gene and protein names, ID numbers, and constitute a type III secretion system [48]. descriptions were taken from the E. coli Entry Point database (http://coli.berkeley.edu/cgi- DISCUSSION bin/ecoli/coli_entry.pl). These were as defined by Blattner It has long been estimated that E. coli expresses at et al. [16]. Gene functional categories were as originally least a quarter of its genome under a single growth defined by Riley and Labedan [21]. The list of E. coli condition [20]. It is now clear that this is a minimum proteins identified by 2D electrophoresis was downloaded estimate. Considering the list of 1147 proteins detected by from the SWISS-2DPAGE database high pressure liquid chromatography-tandem mass (http://us.expasy.org/ch2d/). All data comparisons were spectrometry (HPLC-MS/MS) in the context of the operons performed in EXCEL with the help of VISUAL BASIC scripts. containing their cognate genes indicates that E. coli The complete gene lists used here are available at the web probably expresses at least one-third of its proteins (~1600) site. in minimal glycerol medium [11]. The community is now poised to begin determining global differences in the ACKNOWLEDGEMENTS amounts of E. coli proteins under different growth conditions [44, 50, 51], an undertaking that will help with We thank John Cronan, Kelly Hughes, and David the biological interpretation of the failure to detect a Weiss for information on co-factor synthesis, flagellar

7 O.Paliy et al. function, and cell division, respectively, Julio Collado- on DNA microarrays. In: Anal Biochem. vol. 290; 2001: 205- Vides for providing a list of known transcriptional 213. regulators, and Francisco J. Silva and Andres Moya for 14. Zimmer DP, Paliy O, Thomas B, Gyaneshwar P, Kustu S: providing a list of the genes of Buchnera. Genome image programs: visualization and interpretation of Escherichia coli microarray experiments. In: Genetics. This work was supported by National Institutes of vol. 167; 2004: 2111-2119. Health grants GM37537 to DH and GM38361 to SK and by 15. Gyaneshwar P, Paliy O, McAuliffe J, Popham DL, Jordan Wright Brothers Institute grant WBSC9004A to OP. MI, Kustu S: Sulfur and nitrogen limitation in Escherichia coli K-12: specific homeostatic responses. J Bacteriol 2005, 187(3):1074-1090. REFERENCES 16. Blattner FR, Plunkett G, 3rd, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF et al: The complete genome sequence of Escherichia coli 1. Selinger DW, Cheung KJ, Mei R, Johansson EM, Richmond K-12. In: Science. vol. 277; 1997: 1453-1474. CS, Blattner FR, Lockhart DJ, Church GM: RNA expression 17. Magasanik B: Regulation of transcription of the glnALG analysis using a 30 base pair resolution Escherichia coli operon of Escherichia coli by protein phosphorylation. genome array. Nat Biotechnol 2000, 18(12):1262-1268. Biochimie 1989, 71(9-10):1005-1012. 2. Wassarman KM, Repoila F, Rosenow C, Storz G, Gottesman S: 18. Reitzer LJ, Magasanik B: Transcription of glnA in E. coli Identification of novel small RNAs using comparative is stimulated by activator bound to sites far from the genomics and microarrays. Genes Dev 2001, 15(13):1637- promoter. Cell 1986, 45(6):785-792. 1651. 19. GeneChip arrays provide optimal sensitivity and 3. Hubbell E, Liu WM, Mei R: Robust estimators for expression specificity for microarray expression analysis. In. analysis. In: Bioinformatics. vol. 18; 2002: 1585-1592. Affymetrix, Santa Clara, Ca: Affymetrix Technical Note. 4. Liu WM, Mei R, Di X, Ryder TB, Hubbell E, Dee S, Webster 20. Neidhardt FC, Ingraham JL, Schaechter M: Physiology of TA, Harrington CA, Ho MH, Baid J et al: Analysis of high the bacterial cell: a molecular approach: Sinauer Associates, density expression microarrays with signed-rank call Sunderland, MA; 1990. algorithms. In: Bioinformatics. vol. 18; 2002: 1593-1599. 21. Riley M, Labedan B: E. coli gene products: Physiological 5. Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold functions and common ancestries. In: Escherichia coli and R: Quantitative analysis of complex protein mixtures using Salmonella: Cellular and Molecular Biology. Edited by isotope-coded affinity tags. Nat Biotechnol 1999, 17(10):994- Neidhardt F, Curtiss R, Lin ECC, Ingraham J, Low KB, 999. Magasanik B, Reznikoff W, Riley M, Schaechter M, Umbarger 6. Florens L, Washburn MP, Raine JD, Anthony RM, Grainger M, HE. Washington, D.C.: ASM Press; 1996: 2118-2202. Haynes JD, Moch JK, Muster N, Sacci JB, Tabb DL et al: A 22. Landick R, Turnbough CL, Jr., Yanofsky C: Transcription proteomic view of the Plasmodium falciparum life cycle. In: Attenuation. In: Escherichia coli and Salmonella: Cellular and Nature. vol. 419; 2002: 520-526. Molecular Biology. Edited by Neidhardt F, Curtiss R, Lin ECC, 7. Lipton MS, Pasa-Tolic L, Anderson GA, Anderson DJ, Auberry Ingraham J, Low KB, Magasanik B, Reznikoff W, Riley M, DL, Battista JR, Daly MJ, Fredrickson J, Hixson KK, Schaechter M, Umbarger HE. Washington, D.C.: ASM Press; Kostandarithes H et al: Global analysis of the Deinococcus 1996: 1263-1286. radiodurans by using accurate mass tags. In: Proc 23. Lehnen D, Blumer C, Polen T, Wackwitz B, Wendisch VF, Natl Acad Sci U S A. vol. 99; 2002: 11049-11054. Unden G: LrhA as a new transcriptional key regulator of 8. Futcher B, Latter GI, Monardo P, McLaughlin CS, Garrels JI: flagella, motility and chemotaxis genes in Escherichia coli. A sampling of the yeast proteome. Mol Cell Biol 1999, Mol Microbiol 2002, 45(2):521-532. 19(11):7357-7368. 24. Soupene E, van Heeswijk WC, Plumbridge J, Stewart V, 9. Griffin TJ, Gygi SP, Ideker T, Rist B, Eng J, Hood L, Bertenthal D, Lee H, Prasad G, Paliy O, Charernnoppakul P, Aebersold R: Complementary profiling of gene expression at Kustu S: Physiological studies of Escherichia coli strain the transcriptome and proteome levels in Saccharomyces MG1655: growth defects and apparent cross-regulation of cerevisiae. Mol Cell Proteomics 2002, 1(4):323-333. gene expression. J Bacteriol 2003, 185(18):5611-5626. 10. Ghaemmaghami S, Huh WK, Bower K, Howson RW, Belle 25. Ikebe T, Iyoda S, Kutsukake K: Structure and expression A, Dephoure N, O'Shea EK, Weissman JS: Global analysis of of the fliA operon of Salmonella typhimurium. Microbiology protein expression in yeast. In: Nature. vol. 425; 2003: 737- 1999, 145 ( Pt 6):1389-1396. 741. 26. Butler JD, Levin SW, Facchiano A, Miele L, Mukherjee AB: 11. Corbin RW, Paliy O, Yang F, Shabanowitz J, Platt M, Lyons Amino acid composition and N-terminal sequence of CE, Jr., Root K, McAuliffe J, Jordan MI, Kustu S et al: purified cystine binding protein of Escherichia coli. Life Sci Toward a protein profile of Escherichia coli: comparison to 1993, 52(14):1209-1215. its transcription profile. Proc Natl Acad Sci U S A 2003, 27. Quadroni M, Staudenmann W, Kertesz M, James P: 100(16):9232-9237. Analysis of global responses by protein and peptide 12. Zimmer DP, Soupene E, Lee HL, Wendisch VF, Khodursky fingerprinting of proteins isolated by two-dimensional gel AB, Peter BJ, Bender RA, Kustu S: Nitrogen regulatory electrophoresis. Application to the sulfate-starvation -controlled genes of Escherichia coli: scavenging response of Escherichia coli. Eur J Biochem 1996, as a defense against nitrogen limitation. Proc Natl Acad Sci 239(3):773-781. U S A 2000, 97(26):14674-14679. 28. Perez-Rueda E, Collado-Vides J: The repertoire of DNA- 13. Wendisch VF, Zimmer DP, Khodursky A, Peter B, binding transcriptional regulators in Escherichia coli K-12. Cozzarelli N, Kustu S: Isolation of Escherichia coli mRNA Nucleic Acids Res 2000, 28(8):1838-1847. and comparison of expression using mRNA and total RNA

8 Comparison of protein and mRNA profiles of Escherichia coli: data visualization and analysis of specific gene groups

29. Koonin EV, Galperin MY: Sequence-evolution-function: 41. Link AJ, Robison K, Church GM: Comparing the Computational approaches in compara-tive genomics: predicted and observed properties of proteins encoded in Kluwer Academic Publishers, Boston, USA; 2003. the genome of Escherichia coli K-12. In: Electrophoresis. vol. 30. Errington J, Daniel RA, Scheffers DJ: Cytokinesis in 18; 1997: 1259-1313. bacteria. Microbiol Mol Biol Rev 2003, 67(1):52-65, table of 42. Loo RR, Cavalcoli JD, VanBogelen RA, Mitchell C, Loo JA, contents. Moldover B, Andrews PC: Virtual 2-D gel electrophoresis: 31. Gill DR, Hatfull GF, Salmond GP: A new cell division visualization and analysis of the E. coli proteome by mass operon in Escherichia coli. Mol Gen Genet 1986, 205(1):134- spectrometry. In: Anal Chem. vol. 73; 2001: 4063-4070. 145. 43. Tonella L, Hoogland C, Binz PA, Appel RD, Hochstrasser 32. Lu C, Stricker J, Erickson HP: FtsZ from Escherichia coli, DF, Sanchez JC: New perspectives in the Escherichia coli Azotobacter vinelandii, and Thermotoga maritima-- proteome investigation. In: Proteomics. vol. 1; 2001: 409-423. quantitation, GTP hydrolysis, and assembly. Cell Motil 44. Lopez-Campistrous A, Semchuk P, Burke L, Palmer-Stone Cytoskeleton 1998, 40(1):71-86. T, Brokx SJ, Broderick G, Bottorff D, Bolch S, Weiner JH, 33. Dougherty TJ, Kennedy K, Kessler RE, Pucci MJ: Direct Ellison MJ: Localization, annotation, and comparison of the quantitation of the number of individual penicillin-binding Escherichia coli K-12 proteome under two states of growth. proteins per cell in Escherichia coli. J Bacteriol 1996, Mol Cell Proteomics 2005, 4(8):1205-1209. 178(21):6110-6115. 45. Mushegian AR, Koonin EV: A minimal gene set for 34. Hale CA, de Boer PA: Direct binding of FtsZ to ZipA, an cellular life derived by comparison of complete bacterial essential component of the septal ring structure that genomes. In: Proc Natl Acad Sci U S A. vol. 93; 1996: 10268- mediates cell division in E. coli. Cell 1997, 88(2):175-185. 10273. 35. Dassa E, Hofnung M, Paulsen IT, Saier MH, Jr.: The 46. Koonin EV: How many genes can make a cell: the Escherichia coli ABC transporters: an update. Mol minimal-gene-set concept. In: Annu Rev Genomics Hum Microbiol 1999, 32(4):887-889. Genet. vol. 1; 2000: 99-116. 36. Carbone A, Zinovyev A, Kepes F: Codon adaptation index 47. Gil R, Silva FJ, Pereto J, Moya A: Determination of the as a measure of dominating codon bias. Bioinformatics 2003, core of a minimal bacterial gene set. Microbiol Mol Biol Rev 19(16):2005-2015. 2004, 68(3):518-537. 37. Sharp PM, Li WH: The codon Adaptation Index--a 48. Shigenobu S, Watanabe H, Hattori M, Sakaki Y, Ishikawa measure of directional synonymous codon usage bias, and H: Genome sequence of the endocellular bacterial symbiont its potential applications. Nucleic Acids Res 1987, of aphids Buchnera sp. APS. Nature 2000, 407(6800):81-86. 15(3):1281-1295. 49. Silva FJ, Latorre A, Moya A: Why are the genomes of 38. Ma J, Campbell A, Karlin S: Correlations between Shine- endosymbiotic bacteria so stable? In: Trends Genet. vol. 19; Dalgarno sequences and gene features such as predicted 2003: 176-180. expression levels and operon structures. In: J Bacteriol. vol. 50. Champion MM, Campbell CS, Siegele DA, Russell DH, Hu 184; 2002: 5733-5745. JC: Proteome analysis of Escherichia coli K-12 by two- 39. VanBogelen RA, Abshire KZ, Pertsemlidis A, Clark RL, dimensional native-state chromatography and MALDI-MS. Neidhardt FC: Gene-Protein Database of Escherichia coli K- In: Mol Microbiol. vol. 47; 2003: 383-396. 12, Edition 6. In: Escherichia coli and Salmonella: Cellular 51. Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, and Molecular Biology. Edited by Neidhardt F, Curtiss R, Lin Aebersold R: Quantitative analysis of complex protein ECC, Ingraham J, Low KB, Magasanik B, Reznikoff W, Riley mixtures using isotope-coded affinity tags. In: Nat M, Schaechter M, Umbarger HE. Washington, D.C.: ASM Biotechnol. vol. 17; 1999: 994-999. Press; 1996: 2067-2117. 52. Glasner JD, Liss P, Plunkett G, 3rd, Darling A, Prasad T, 40. Champion KM, Nishihara JC, Joly JC, Arnott D: Similarity Rusch M, Byrnes A, Gilson M, Biehl B, Blattner FR et al: of the Escherichia coli proteome upon completion of ASAP, a systematic annotation package for community different biopharmaceutical fermentation processes. In: analysis of genomes. In: Nucleic Acids Res. vol. 31; 2003: 147- Proteomics. vol. 1; 2001: 1133-1148. 151.

9