Comparison of Proteogenomic Approaches Across Kingdoms: A Joint Effort in Modeling Samuel Purvine1, Kim Hixson1, Carrie Nicora1, Muktak Aklujkar3, Ellen Panisko1, Lee Ann McCue1, Matthew Monroe1, Athanasios Lykidis2, Nikos Kyrpides2, Deanna Auberry1, Derek Lovely3, Igor Grigoriev2, Scott Baker1, Mary Lipton1, Gordon Anderson1, Richard Smith1 1Pacific Northwest National Laboratory, Richland, WA; 2Lawrence Berkeley National Laboratory, Berkeley, CA; 3University of Massachusetts, Boston, MA

Accumulated data for microbial proteogenomics observed total observed % coverage N-terminus Organism ORFs annotated Challenges – Bacterial & Archaeal peptides (≥ 2 peptides) correction Goals (≥ 2 peptides) ORFs Conclusions • Combine direct observation of gene products The Genomic Encyclopedia of and Archaea (GEBA) Escherichia coli K-12 24054 1809 4331 42% 236 • GeneMark gene predictions for bacteria and with sequence information to: project at the Joint Genome Institute (JGI) is systematically Geobacter sulfurreducens PCA 29123 2005 3446 58% 138 archaea show excellent fidelity, with the vast Halorhabdus utahensis AX-2, DSM 12940 12595 1308 3037 43% 222 filling in gaps in the bacterial and archaeal branches of the tree majority of observed gene products falling within – Examine the accuracy of gene models for Brachybacterium faecium DSM 4810 18828 1442 3129 46% 311 of life, which have been largely ignored by genome sequencing predicted diverse bacteria and archaea curtum DSM 15641 11793 900 1368 66% 168 projects. However, gene finding algorithms were developed Sanguibacter keddieii DSM 10542 13086 1384 3746 37% 265 • Proteomics data identify numerous alternate – Provide modifications to gene models, based on observations and knowledge from a narrow range of heliotrinireducens DSM 20476 14048 1080 2819 38% 198 protein start sites, as well as alternate ORFs thereby improving genome annotation well-studied organisms. Their performance on diverse species Extensive representation in genomic databases Saccharomonospora viridis P101, DSM 43017 9125 1063 3959 27% 186 • Signal peptide cleavage site identifications are – Validate exon borders in fungal is uncertain. Some representation in genomic databases providing information to improve SignalP and Proteomics data on a model microbe (E. coli), a distantly related, but experimentally well-studied microbe Poor representation in genomic databases (G. sulfurreducens), and six microbes sequenced by the GEBA project. All analyses performed on six-frame other prediction algorithms – Provide evidence to support selection of (1) Artemis views translations of the respective genomes, and results compared to the annotated open reading frames (ORFs). • Proteomics data on the fungus S. thermophile best gene model from among several = predicted gene possibilities at each loci in fungal genomes = observed peptide have confirmed many exon boundaries and are providing evidence supporting alternate gene • Primary benefits are: Complete coverage of 67 protein from S. keddieii Alternate upstream start site indicated for Sked_08350 Alternate ORFs at the Sked_15870 and Sked_18880 loci models – Improvements to genome annotations • Incorporation of proteomics results will provide feedback into the criteria used for filtering the – Provide evidence to promote hypothetical multiple gene models predicted for eukaryotes and conserved hypothetical gene products to expressed genes of unknown function Acknowledgements – Provide information about post-translational This work was funded by U.S. Department of Energy Biological and events, such as signal peptide cleavage, N- Environmental Research (DOE/BER). Samples were analyzed using capabilities developed under the support of the NIH National Center for terminal amino acid removal, and chemical Research Resources (RR18522) and the U.S. Department of Energy modifications such as sulfur oxidation Biological and Environmental Research (DOE/BER). processed N-terminus Significant portions of the work were performed in the Environmental – Providing feedback that will result in Molecular Science Laboratory, a DOE/BER national scientific user facility at improvements to gene finding algorithms Pacific Northwest National Laboratory (PNNL) in Richland, Washington. PNNL is operated for the DOE by Battelle under contract DE-AC05-76RLO- 1830. Examples showing comparison of the proteomics data to the GeneMark(3) predicted ORFs, showing confirmation or improvement of the gene models. Methods References Filtered peptides across an fgenesh1_pm prediction, specifically gene prediction 99296 1. Artemis: sequence visualization and annotation. Rutherford K, Parkhill J, • Protein extraction, isolation, and enzymatic digestion Crook J, Horsnell T, Rice P, Rajandream MA, Barrell B, Bioinformatics were performed on both soluble (i.e., largely cytosolic Challenges - Fungal S. thermophile 2000,16: 944-945. PMID: 11120685. proteins from the post-centrifugation supernatant) and 2. An approach to correlate tandem mass spectral data of peptides with insoluble (i.e., post-centrifugation pellet) fractions. amino acid sequences in a protein database. Eng JK, McCormack AL, The presence of introns and exons make gene prediction challenging in and Yates JR III. J Am Soc Mass Spectrom 1994, 5: 976-989. • Bacterial genomes were translated in all six reading eukaryotes. To arrive at a single “best” gene model at each loci, JGI frames and polypeptide sequences of at least 16 amino 3. GeneMark.hmm: new solutions for gene finding, Lukashin A and combines the results of several gene prediction algorithms: Borodovsky M. NAR 1998, 26: 1107-1115. acids saved for comparison to LC-MS/MS data using TM (2) 4. GeneWise and Genomewise, Birney E, Clamp M, and Durbin R. SEQUEST . • ab initio predictions of Fgenesh (www.softberry.com) Genome Res 2004, 14: 988-995. • The LC-MS/MS data for the fungus Sporotrichum • homology-based predictions of Fgenesh+ AllModels FilteredModel thermophile were compared to all the gene models 15,036 unique peptides predicted at every loci, as well as the “filtered” gene • hidden Markov model based predictions of GeneWise (4) models consisting of the single “best” gene model at • EST-based predictions of estExt (Grigoriev, unpublished) CONTACT: Samuel Purvine each loci. Instrument Development Laboratory

• Peptide identifications are reported at ~1% false 29,996 unique peptides Environmental & Molecular Sciences Laboratory discovery rate (FDR). Pacific Northwest National Laboratory K8-91 P.O. Box 999, Richland, WA 99352 E-mail: [email protected]