Protein Bioinformatics

To appear in: Medical Applications of Mass Spectrometry Vekey, K., Telekes, A., Vertes, A. (Eds.) Elsevier Science Protein Bioinformatics Peter McGarvey, Hongzhan Huang and Cathy H. Wu Department of Biochemistry and Molecular Biology, Georgetown University Medical Center, Washington, DC 20057, USA 1. Introduction Bioinformatics can be defined as the field of science in which biology, computer science, and information technology merge into a single discipline. Bioinformatics contains a number of important sub-disciplines including: development of new algorithms and statistics; the analysis and interpretation of data; development of tools that mine and manage various types of information; and, database development and data integration. All these sub-disciplines have played a role in developing the mass spectrometry methods reviewed in this book. As described in this volume and elsewhere the high-throughput proteomics methods used in mass spectrometry require accurate databases of both protein sequences and post-translational modifications as well as algorithms and tools to match spectra to peptides and peptides to proteins [MacCoss, 2005; Johnson et al., 2005]. After identification of a protein, further interpretation and knowledge discovery come from the integration of protein sequence data with all forms of additional biomedical data. There are many approaches to data integration and the field is evolving as different approaches and data collections merge. Here we describe our bottom-up approach at data integration starting with protein sequence information and bringing in a wide variety of structural, functional, genetic and disease information related to proteins. We also discuss some future efforts to link this information to other data collections and broader community efforts and approaches to data integration. High-throughput genome and proteome projects have resulted in the rapid accumulation of genome sequences for a large number of organisms. Meanwhile, scientists have begun to systematically tackle other complex regulatory processes by studying organisms at the global scale of transcriptomes (RNA and gene expression), metabolomes (metabolites and metabolic networks), interactomes (protein-protein interactions), and physiomes (physiological dynamics and functions of whole organisms). Associated with the enormous quantity and variety of data being produced is the growing number of databases that are being generated and maintained. Meta databases (database of databases) have been compiled to catalog and categorize these databases, such as the Molecular Biology Database Collection [Galperin, 2005]. This online Collection (http://www.oxfordjournals.org/nar/database/cap/) lists over 700 key biological databases that add new value to the underlying data by virtue of curation, provide new types of data connections, or implement other innovative approaches to facilitate biological discovery. Based on the type of information they provide, these databases can be conveniently classified into subcategories. Examples of major database categories include genomic sequence repositories (e.g., GenBank [Benson et al., 2005]), gene expression (e.g., SMD [Ball et al., 2005]), model organism genomes (e.g., MGD [Eppig et al., 2005]), mutation databases (e.g., dbSNP [Sherry et al., 2001]), RNA 1 sequences (e.g., RDP [Cole et al., 2005]), protein sequences (e.g., Uniprot [Bairoch et al., 2005]), protein family (e.g., InterPro [Mulder et al., 2005]), protein structure (e.g., PDB [Deshpande et al., 2005]), intermolecular interactions (e.g., BIND [Alfarano, et al., 2005), metabolic pathways and cellular regulation (e.g., KEGG [Kanehisa et al., 2004]), and taxonomy (e.g., NCBI taxonomy [Wheeler et al., 2005]). To fully explore these data sets, advanced bioinformatics infrastructures must be developed for biological knowledge extraction and management. The Protein Information Resource (PIR) [Wu et al., 2003] is an integrated bioinformatics resource that supports genomic and proteomic research in this manner. PIR is a member of UniProt—the Universal Protein Resource—the world’s most comprehensive catalog of information on proteins, which unifies the previously separate PIR, Swiss-Prot, and TrEMBL databases [Bairoch et al., 2005]. The core resources and bioinformatics framework for large-scale proteomic data mining at PIR include: the UniProt Knowledgebase of all known proteins; iProClass database integrating information from over 90 biological databases [Wu et al., 2004a]; PIRSF classification-driven and rule-based system for protein functional annotation [Wu et al., 2004b; Natale et al., 2005]; and iProLINK literature mining resource [Hu et al., 2004], and some new tools for proteomics data analysis and target identification. 2. Methodology 2.1. UniProt Sequence Databases The Universal Protein Resource (UniProt) provides the scientific community with a single, centralized, authoritative resource for protein sequences and functional information with three database components, each addressing a key need in protein bioinformatics. The UniProt Knowledgebase (UniProtKB) is the central protein sequence database with accurate, consistent, and rich sequence and functional annotation, full classification, and extensive cross-references. Produced by a combination of automated and over 25 years of human curation, the annotations in UniProtKB include protein name and function, taxonomy, enzyme-specific information (catalytic activity, cofactors, metabolic pathway, regulation mechanisms), domains and sites, post- translational modifications, subcellular locations, tissue- or developmentally-specific expression, interactions, splice isoforms, polymorphisms, diseases, and sequence conflicts. The UniProt Archive (UniParc) provides a stable and comprehensive sequence collection by storing the complete body of publicly available protein sequence data. While a protein sequence may exist in multiple databases, UniParc stores each unique sequence only once and assigns it a unique UniParc identifier. Cross-references back to the source databases are provided, and include source accession numbers, sequence versions, and status (active or obsolete). The archive thus provides a history of protein sequences. The UniProt Reference Clusters (UniRef) provide clustered sets of sequences from UniProtKB (including splice variants and isoforms) and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences from view. The sequence compression is achieved by merging sequences and sub- sequences that are 100% (UniRef100), 90% (UniRef90), or 50% (UniRef50) identical, regardless of source organism. Reduction of sequence redundancy speeds sequence similarity searches while rendering such searches more informative. The UniProt databases can be accessed online at (http://www.uniprot.org) or downloaded in several formats (ftp://ftp.uniprot.org/pub ). New releases are published every two weeks. 2 2.2. PIRSF Protein Family Classification The PIRSF family classification system applies a network structure for protein classification from superfamily to subfamily levels on the UniProtKB [Wu et al., 2004b]. The primary PIRSF classification unit is the homeomorphic family whose members are homologous (sharing common ancestry) and homeomorphic (sharing full-length sequence similarity with common domain architecture). PIRSF classification considers both full-length similarity and domain architecture, discriminates between single- and multi-domain proteins and shows functional differences associated with the presence or absence of one or more domains. For example, the relationship between domain architecture and function can be illustrated by the various types of response regulator proteins that share the CheY-like phosphoacceptor domain (Pfam domain PF00072) (Figure 1) and are involved in signal transduction by two-component signaling systems. These response regulators usually consist of an N-terminal CheY-like receiver domain and a C-terminal output (usually DNA-binding) domain. In addition to the “classical” well- known response regulators (e.g., PIRSF003173 with the winged helix-turn-helix DNA-binding domain), bacterial genomes encode a variety of response regulators with other types of DNA- binding domains (e.g., PIRSF006198, PIRSF036392), RNA-binding domain (PIRSF036382), or enzymatic domains (e.g., PIRSF000876, PIRSF006638), or a combination of these types of domains (e.g., PIRSF003187). For a biologist seeking to collect and analyze information about a protein, matching a protein sequence to a curated protein family provides a tool that is usually faster and more accurate than searching against a protein sequence database which may only return a sequence and name submitted by a genomic sequencing project. Human curation of families provides richer information on protein structure and function as it draws from a wider pool of information and a classification- driven and rule-based system for automation of protein functional annotation has been developed using PIRSF families [Wu et al., 2004b; Natale et al., 2005]. The protein family classifications and associated information are stored in the PIRSF database and can be searched by a variety of methods (http://pir.georgetown.edu:60888/pirwww/dbinfo/pirsf.shtml). The PIRSF family reports (Figure 2) (e.g., http://pir.georgetown.edu:60888/cgi-bin/ipcSF?id=PIRSF000186) provide classification and annotation summaries organized in several sections: (i) general information:

Protein Bioinformatics

PIR Brochure

Loss of BCL-3 Sensitises Colorectal Cancer Cells to DNA Damage, Revealing A

Update on Genome Completion and Annotations

Positive and Negative Roles of an Initiator Protein at an Origin of Replication

The Y Chromosome Modulates Splicing and Sex-Biased Intron Retention Rates in Drosophila

Gene Ontology: Tool for the Unification of Biology

Pcor: a New Design of Plasmid Vectors for Nonviral Gene Therapy

NCBI Resources: from Sequence to Function Outline

Search of Biological Databases and Literature

D113p033.Pdf

Liu Bioinf-Supplement Final

Promoter Interactome of Human Embryonic Stem Cell-Derived Cardiomyocytes Connects GWAS Regions to Cardiac Gene Networks