The Chromosome-Centric Human Proteome Project for Cataloging Proteins Encoded in the Genome
Total Page:16
File Type:pdf, Size:1020Kb
CORRESPONDENCE The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome To the Editor: utility for biological and disease studies. Table 1 Features of salient genes on The Chromosome-Centric Human With development of new tools for in- chromosomes 13 and 17 Proteome Project (C-HPP) aims to define depth characterization of the transcriptome Genea AST nsSNPs the full set of proteins encoded in each and proteome, the HPP is well positioned Chromosome 13 chromosome through development of a to have a strategic role in addressing the BRCA2 3 54 standardized approach for analyzing the complexity of human phenotypes. With this RB1 2 3 massive proteomic data sets currently being in mind, the HUPO has organized national IRS2 1 3 generated from dedicated efforts of national chromosome teams that will collaborate and international teams. The initial goal with well-established laboratories building Chromosome 17 of the C-HPP is to identify at least one complementary proteotypic peptides, BRCA1 24 24 representative protein encoded by each of antibodies and informatics resources. ERBB2 6 13 the approximately 20,300 human genes1,2. An important C-HPP goal is to encourage TP53 14 5 aEnsembl protein and AST information can be found at The proteins will be characterized for tissue capture and open sharing of proteomic http://www.ensembl.org/Homo_sapiens/. localization and major isoforms, including data sets from diverse samples to enhance AST, alternative splicing transcript; nsSNP, nonsyno- mous single-nucleotide polyphorphism assembled from post-translational modifications (PTMs), a gene- and chromosome-centric display data from the 1000 Genomes Projects. using quantitative mass spectrometry and This will display several layers of biological antibody reagents. Our rationale is that information on a common reference effective integration of proteomics data into platform comparable to a genome browser. machine database (GPMDB), UniProt and a genomic framework will lead to improved Such context will effectively integrate neXtProt (Supplementary Fig. 3). knowledge of complex biological systems transcriptomics data such as RNA-Seq with The C-HPP does not propose any and facilitate access to protein level data. proteomic data sets (Fig. 1). alteration in the work flow of a typical Although the intent to engage in a C-HPP Although the C-HPP program has proteomics laboratory; instead, it seeks © 2012 Nature America, Inc. All rights reserved. America, Inc. © 2012 Nature program has been noted1–3, our objective some similarities to the Human Genome more effective use of data encompassed in here is to define the goals and process for its Project (HGP)4 in its quest for complete existing bioinformatics resources, which development as a multinational program. coverage across the genome, the C-HPP will be combined with targeted studies to npg Over the past three years, the Human has the added challenge of characterizing generate a robust list of observed protein Proteome Organization (HUPO) has protein expression at the tissue, cellular and isoforms (Supplementary Fig. 3). A potential developed a strategy for the first phase of subcellular levels, as well as PTMs, ASTs challenge to data collection from different the Human Proteome Project (HPP; http:// and protease-processed protein variants. An laboratories is the diversity of instrument and thehpp.org/; Supplementary Fig. 1). example of protein variation is shown for bioinformatics platforms and quality criteria. HPP1 goals will be achieved through 6 selected genes on chromosome 13 The C-HPP will work closely with proteomics cooperation with the C-HPP to characterize (BRCA2, 3 ASTs and 54 SNPs in protein- journals, and use existing data (GPMDB and the human proteome on a chromosome- coding regions (nsSNPs); RB1, 2 ASTs and 3 PeptideAtlas), literature curation (Uniprot by-chromosome basis and with the nsSNPs; and IRS2, 1 AST and 3 nsSNPs) and and neXtProt) and standardization programs biology- and disease-driven projects chromosome 17 (BRCA1, 24 ASTs and 24 (PSI, CPTAC, Unimod, ABRF and ASMS) (B/D-HPP). Human genome studies, nsSNPs; ERBB2, 6 ASTs and 13 nsSNPs; and to ensure that the data collection is efficient, such as the 1000 Genomes Project and TP53, 14 ASTs and 5 nsSNPs; Table 1 and with consistent quality assurance and quality Encode, and transcriptome sequencing Supplementary Table 1). control. Journal mandates for deposition of provide a basis for identification of protein The C-HPP will build on the three HPP raw data upon publication will reinforce this isoforms generated by alternative splicing pillars that provide both technology and process5. The C-HPP has already encouraged transcripts (ASTs) and by nonsynonymous resources for mapping the human proteome: formation of chromosome-formatted single-nucleotide polymorphisms (nsSNPs; mass spectrometry–based SRMAtlas, databases (http://www.nextprot.org/; http:// Supplementary Fig. 2). Additional antibody reagents in the Human Protein www.gpm.org/) in which new data sets are protein forms will be identified through Atlas and bioinformatics knowledge linked integrated with existing ones. In this manner characterization of post-translational by ProteomeXchange, specifically the the C-HPP will capture the protein evidence modifications. A basic premise of the HPP proteomics identification database (PRIDE), emerging from the hundreds of laboratories is that C-HPP data sets will have substantial Tranche, PeptideAtlas, the global proteome worldwide engaged in hypothesis-driven NATURE BIOTECHNOLOGY VOLUME 30 NUMBER 3 MARCH 2012 221 CORRESPONDENCE a Molecular function Biological process Cellular component example of such a global view for selected • Cell proliferation • Tumor necrosis factor • Extracellular space • Immune response receptor binding • Plasma membrane regions of chromosomes 13 and 17 (Fig. 1) • Signal transduction summarizes the following extensive data sets • Regulation of cell cycle/ • Plasma membrane • Motor activity proliferation • Nucleoplasm on the basis of existing data compilations: • Signaling pathway protein evidence, mass spectrometry data, • Insulin receptor binding • Cytosol • Glucose metabolic • Signal transduction • Plasma membrane process antibody availability, major PTMs, disease • Extracellular matrix • Angiogenesis information and transcript level, including constituent N/A • Axon guidance • Protein binding ASTs from three different samples in a • Tumor necrosis factor • Angiogenesis • Collagen type IV format viewable for associations between receptor binding • Axon guidance data sets and information gaps in specific b Molecular function Biological process Cellular component chromosome regions. • Methyltransferase • Hormone biosynthetic • Cytosol activity process In phase 1 (~6 years), the C-HPP plans to map all proteins currently lacking high- • GPI anchor biosynthetic • Golgi membrane • Hydrolase activity process • Integral to membrane quality mass spectrometry evidence, three • Growth factor receptor major classes of PTMs, many representative activity • Signaling pathway • Integral to membrane 6 • ErbB-3 class receptor • Angiogenesis • Nucleus AST products and many nsSNP sequence binding variants. The characterizations will be • Cytosol • Selenium binding • Cell redox homeostasis • Membrane followed by antibody-based detection in • EGFR signaling • SH3/SH2 adaptor selected tissues and cell lines. In phase 2 pathway • Cytosol activity • Blood coagulation (~4 years), identified proteins will be c Biological process Cellular component characterized and validated with additional N/A N/A N/A proteomic and antibody measurements. N/A • Cell death • Nucleus Throughout this 10-year project, the C-HPP N/A N/A • Extracellular region aims to generate information useful in N/A N/A • Extracellular region the search for new biomarkers and drug N/A N/A N/A targets and also in the study of disease gene d families clustered in each chromosome Molecular function Biological process Cellular component (for example, the cytokeratin gene family • Olfactory receptor • Sensory perception • Integral to plasma in chromosome 17). C-HPP outputs will activity of smell membrane be integrated with output from the parallel B/D-HPP project. The C-HPP has selected the UniProt protein list (based on Ensembl Figure 1 Genomic, transcriptomic and protein information for the set of genes present in selected genome builds) as the starting point for regions of chromosomes 13 and 17. (a,b) The information provides a comprehensive landscape with respect to protein evidence, quality of mass spectrometry–based protein identification, availability identified proteins. Individual chromosome of antibody, disease relationship, and phosphorylation, acetylation, glycosylation and transcriptomic teams will use information collected in well- information. It shows the degree of protein annotation on two important regions on chromosomes 13 (a) annotated databases (for example, GPMDB, © 2012 Nature America, Inc. All rights reserved. America, Inc. © 2012 Nature and 17 (b) and regions with little annotated protein information on chromosomes 13 (c) and 17 (d). PE, PeptideAtlas and neXtProt) to develop a protein evidence from UniProt; Mq, mass quality from GPMDB; Mo, number of mass spectrometry data list of missing or poorly identified proteins sets in GPMDB; Ab, antibody availability; Di, disease information; Ph, Ac and Gl, phosphoryl, acetyl for a particular chromosome. A plot of and glyco, respectively; Pt, placenta transcript; Pa, placenta