<<

correspondence

The -Centric Human Project for cataloging encoded in the

To the Editor: utility for biological and disease studies. Table 1 Features of salient on The Chromosome-Centric Human With development of new tools for in- 13 and 17 Proteome Project (C-HPP) aims to define depth characterization of the Genea AST nsSNPs the full set of proteins encoded in each and proteome, the HPP is well positioned Chromosome 13 chromosome through development of a to have a strategic role in addressing the BRCA2 3 54 standardized approach for analyzing the complexity of human phenotypes. With this RB1 2 3 massive proteomic data sets currently being in mind, the HUPO has organized national IRS2 1 3 generated from dedicated efforts of national chromosome teams that will collaborate and international teams. The initial goal with well-established laboratories building Chromosome 17 of the C-HPP is to identify at least one complementary proteotypic peptides, BRCA1 24 24 representative encoded by each of antibodies and informatics resources. ERBB2 6 13 the approximately 20,300 human genes1,2. An important C-HPP goal is to encourage TP53 14 5 aEnsembl protein and AST information can be found at The proteins will be characterized for tissue capture and open sharing of proteomic http://www.ensembl.org/Homo_sapiens/. localization and major isoforms, including data sets from diverse samples to enhance AST, transcript; nsSNP, nonsyno- mous single-nucleotide polyphorphism assembled from post-translational modifications (PTMs), a - and chromosome-centric display data from the 1000 Projects. using quantitative and This will display several layers of biological antibody reagents. Our rationale is that information on a common reference effective integration of data into platform comparable to a genome browser. machine database (GPMDB), UniProt and a genomic framework will lead to improved Such context will effectively integrate neXtProt (Supplementary Fig. 3). knowledge of complex biological systems transcriptomics data such as RNA-Seq with The C-HPP does not propose any and facilitate access to protein level data. proteomic data sets (Fig. 1). alteration in the work flow of a typical Although the intent to engage in a C-HPP Although the C-HPP program has proteomics laboratory; instead, it seeks © 2012 Nature America, Inc. All rights reserved. America, Inc. © 2012 Nature program has been noted1–3, our objective some similarities to the more effective use of data encompassed in here is to define the goals and process for its Project (HGP)4 in its quest for complete existing resources, which development as a multinational program. coverage across the genome, the C-HPP will be combined with targeted studies to npg Over the past three years, the Human has the added challenge of characterizing generate a robust list of observed protein Proteome Organization (HUPO) has protein expression at the tissue, cellular and isoforms (Supplementary Fig. 3). A potential developed a strategy for the first phase of subcellular levels, as well as PTMs, ASTs challenge to data collection from different the (HPP; http:// and protease-processed protein variants. An laboratories is the diversity of instrument and thehpp.org/; Supplementary Fig. 1). example of protein variation is shown for bioinformatics platforms and quality criteria. HPP1 goals will be achieved through 6 selected genes on chromosome 13 The C-HPP will work closely with proteomics cooperation with the C-HPP to characterize (BRCA2, 3 ASTs and 54 SNPs in protein- journals, and use existing data (GPMDB and the human proteome on a chromosome- coding regions (nsSNPs); RB1, 2 ASTs and 3 PeptideAtlas), literature curation (Uniprot by-chromosome basis and with the nsSNPs; and IRS2, 1 AST and 3 nsSNPs) and and neXtProt) and standardization programs biology- and disease-driven projects chromosome 17 (BRCA1, 24 ASTs and 24 (PSI, CPTAC, Unimod, ABRF and ASMS) (B/D-HPP). Human genome studies, nsSNPs; ERBB2, 6 ASTs and 13 nsSNPs; and to ensure that the data collection is efficient, such as the 1000 Genomes Project and TP53, 14 ASTs and 5 nsSNPs; Table 1 and with consistent quality assurance and quality Encode, and transcriptome Supplementary Table 1). control. Journal mandates for deposition of provide a basis for identification of protein The C-HPP will build on the three HPP raw data upon publication will reinforce this isoforms generated by alternative splicing pillars that provide both technology and process5. The C-HPP has already encouraged transcripts (ASTs) and by nonsynonymous resources for mapping the human proteome: formation of chromosome-formatted single-nucleotide polymorphisms (nsSNPs; mass spectrometry–based SRMAtlas, databases (http://www.nextprot.org/; http:// Supplementary Fig. 2). Additional antibody reagents in the Human Protein www.gpm.org/) in which new data sets are protein forms will be identified through Atlas and bioinformatics knowledge linked integrated with existing ones. In this manner characterization of post-translational by ProteomeXchange, specifically the the C-HPP will capture the protein evidence modifications. A basic premise of the HPP proteomics identification database (PRIDE), emerging from the hundreds of laboratories is that C-HPP data sets will have substantial Tranche, PeptideAtlas, the global proteome worldwide engaged in hypothesis-driven

nature biotechnology volume 30 number 3 march 2012 221 correspondence

a Molecular function Biological process Cellular component example of such a global view for selected • proliferation • Tumor necrosis factor • Extracellular space • Immune response receptor binding • Plasma membrane regions of chromosomes 13 and 17 (Fig. 1) • Signal transduction summarizes the following extensive data sets • Regulation of cell cycle/ • Plasma membrane • Motor activity proliferation • Nucleoplasm on the basis of existing data compilations:

• Signaling pathway protein evidence, mass spectrometry data, • Insulin receptor binding • Cytosol • Glucose metabolic • Signal transduction • Plasma membrane process antibody availability, major PTMs, disease

• Extracellular matrix • Angiogenesis information and transcript level, including constituent N/A • Axon guidance • Protein binding ASTs from three different samples in a

• Tumor necrosis factor • Angiogenesis • Collagen type IV format viewable for associations between receptor binding • Axon guidance data sets and information gaps in specific b Molecular function Biological process Cellular component chromosome regions. • Methyltransferase • Hormone biosynthetic • Cytosol activity process In phase 1 (~6 years), the C-HPP plans to map all proteins currently lacking high- • GPI anchor biosynthetic • Golgi membrane • Hydrolase activity process • Integral to membrane quality mass spectrometry evidence, three

• Growth factor receptor major classes of PTMs, many representative activity • Signaling pathway • Integral to membrane 6 • ErbB-3 class receptor • Angiogenesis • Nucleus AST products and many nsSNP sequence binding variants. The characterizations will be • Cytosol • Selenium binding • Cell redox homeostasis • Membrane followed by antibody-based detection in

• EGFR signaling • SH3/SH2 adaptor selected tissues and cell lines. In phase 2 pathway • Cytosol activity • Blood coagulation (~4 years), identified proteins will be

c Biological process Cellular component characterized and validated with additional N/A N/A N/A proteomic and antibody measurements. N/A • Cell death • Nucleus Throughout this 10-year project, the C-HPP N/A N/A • Extracellular region aims to generate information useful in N/A N/A • Extracellular region the search for new and drug N/A N/A N/A targets and also in the study of disease gene d families clustered in each chromosome

Molecular function Biological process Cellular component (for example, the cytokeratin gene family • Olfactory receptor • Sensory perception • Integral to plasma in chromosome 17). C-HPP outputs will activity of smell membrane be integrated with output from the parallel B/D-HPP project. The C-HPP has selected the UniProt protein list (based on Ensembl Figure 1 Genomic, transcriptomic and protein information for the set of genes present in selected genome builds) as the starting point for regions of chromosomes 13 and 17. (a,b) The information provides a comprehensive landscape with respect to protein evidence, quality of mass spectrometry–based protein identification, availability identified proteins. Individual chromosome of antibody, disease relationship, and phosphorylation, acetylation, glycosylation and transcriptomic teams will use information collected in well- information. It shows the degree of protein annotation on two important regions on chromosomes 13 (a) annotated databases (for example, GPMDB, © 2012 Nature America, Inc. All rights reserved. America, Inc. © 2012 Nature and 17 (b) and regions with little annotated protein information on chromosomes 13 (c) and 17 (d). PE, PeptideAtlas and neXtProt) to develop a protein evidence from UniProt; Mq, mass quality from GPMDB; Mo, number of mass spectrometry data list of missing or poorly identified proteins sets in GPMDB; Ab, antibody availability; Di, disease information; Ph, Ac and Gl, phosphoryl, acetyl for a particular chromosome. A plot of and glyco, respectively; Pt, placenta transcript; Pa, placenta AST; St, SKBR3 breast cell line npg transcript; Sa, SKBR3 breast cancer cell line AST; At, A431 transcript; Aa, A431 AST. Green denotes such data (for example, Fig. 1) can identify presence and black denotes lack of information in the following data sets: transcript, disease, PTMs chromosomal regions with low amounts of and antibodies. For protein evidence, green, yellow and red represent high (protein evidence), medium data. For example, there is protein paucity (transcriptomic evidence) and low (neither) evidence, respectively. Number of individual data sets for regions on chromosome 17 that contain and quality of mass spectrometry evidence was established according to GPMDB scores: green, >20 olfactory receptors and keratin-binding observations, log(e) < –5; yellow, 6–19 observations, –3 ≤ log(e) < –5; red, 1–5 observations, proteins; this may be expressed in limited –1 ≤ log(e) < –3. For the relationship of each protein to disease, we used both Online Mendelian proteomic data sets for nasal epithelium Inheritance in Man (OMIM; NCBI, confirmed Mendelian phenotype) and Cancer Gene Census (CGC, Sanger Center; cancer gene information). For the PTM information, we used UniProt/UniPep containing and bone and hair samples, respectively experimental PTM site information and GPMDB providing mass spectrometry information for the PTMs. (Fig. 1). The missing data can be obtained through collaborations with laboratories with expertise in such samples or by research or high-throughput proteome-wide specific tasks to laboratories with expertise selection of new sample sets for protein studies. in particular protein subsets (for example, identifications guided by transcriptomics Although chromosome-based protein membrane proteins), specific protein measurements. To facilitate selection of data curation is a relatively new concept in variations (PTMs, alternative splicing and samples suitable for mass spectrometry proteomics2, our justification is based in part protease-processed variants), deep profiling discovery of an individual missing protein, on compatibility of this data format with the for low-abundance proteins and targeted the C-HPP will collaborate with RNA-Seq output of RNA-Seq. We think the search for subcellular localization studies. We recognize laboratories to take advantage of specimens yet-to-be-discovered protein products of the popularity of other current bioinformatic and transcriptomics data (Supplementary genes can be informed by transcriptomics methods used to organize complex data sets Fig. 2). We recognize that some proteins measurements of selected tissues and by functional classes; we will incorporate this may not be suited for mass spectrometry cell lines. The C-HPP will also prioritize information into the C-HPP browser. An measurements owing to their physical

222 volume 30 number 3 march 2012 nature biotechnology correspondence

properties or lack of appropriate biological Similarly, the Australia-New Zealand H.-J.L., F.Y., F.Z., Y.Z., S.Y.C., K.N., K.Y.K., E.-Y.L., samples; other approaches such as generation team selected chromosome 7, with a focus E.-Y.C., Y.C., R.C. and A.D.T. carried out various of ribosomal DNA standards, antibody on colon cancer and epidermal growth experiments including sample preparation, proteomic analysis and RNA sequencing of cell lines. localization approaches and molecular factor receptor. C-HPP guidelines now have biology tools will be used. been set for the assignment and progress Acknowledgments Given expected refinements in the human review of chromosome-based teams and We thank R. Beavis for his critical comments and gene list, the C-HPP protein list will reflect standardization of outputs (Y.-K. Paik, support for this work. Y.-K.P. thanks the Korean updates in Uniprot that are captured in G.S. Omenn, M. Uhlen, S. Hanash, Human Proteome Organization HPP planning committee members for their contribution to proteomic databases. To ensure consistent G. Marko-Varga et al., unpublished data). the project in the early phase. Work involving data quality across chromosome groups, the As of December 2011, based on their chromosomes 13 and 17 in this paper was supported C-HPP will encourage prompt deposition interests in a specific disease (for example, in part by the World Class University program funded of data. For antibody-based studies, the male infertility in Iran) or gene cluster (for by the Korean Ministry of Education, and Technology (to Y.-K.P. and W.S.H.), a grant from the C-HPP will promote the use of cultured example, -origin proteins in China), Korean Ministry of Health and Welfare (to Y.-K.P.) and primary or transformed cells, including other international teams have chosen grants from the US National Cancer Institute (to M.S. induced pluripotent stem cells, which can chromosomes 1 (China), 2 (Switzerland), and W.S.H.). be maintained in perpetuity for reanalysis 3 (Japan), 6 (Canada), 11 (Korea), 14 and for subcellular fractionation. Such (France), 18 (Russia), 19 (Sweden and Competing Financial Interests The authors declare no competing financial interests. cell-based studies will be augmented with Germany, Norway, India, China and Spain), tissue profiling, as in the Human Protein 21 (Canada), X (Japan) and Y (Iran). A Young-Ki Paik1, Seul-Ki Jeong1, 2,3 4 Atlas project. Enrichment for nuclear, Swedish team has published extensive Gilbert S Omenn , Mathias Uhlen , 5 1,13 mitochondrial and other subcellular findings for chromosome 21 (ref. 9). The Samir Hanash , Sang Yun Cho , Hyoung-Joo Lee1, Keun Na1, Eun-Young Choi1, organelles may be especially informative7,8. C-HPP guidelines specify management Fangfei Yan6, Fan Zhang6, Yue Zhang6, The C-HPP will integrate antibody- and mass of the project, data quality and data Michael Snyder7, Yong Cheng7, Rui Chen7, spectrometry–based measurements. sharing metrics, reporting formats, and György Marko-Varga8, Eric W Deutsch3, Another goal of the C-HPP is to processes and criteria by which countries or Hoguen Kim9, Ja-Young Kwon9, procure high-quality reagents. To augment researchers are designated to take the lead Ruedi Aebersold10, Amos Bairoch11, commercially available sources, the national for a specific chromosome (Y.-K. Paik et al., Allen D Taylor4, Kwang Youl Kim1, teams will establish centralized antibody unpublished data). Eun-Young Lee1, Denis Hochstrasser11, banks. This will be achieved through a close In conclusion, we envision that effective Pierre Legrain12 & William S. Hancock1,5 collaboration between each chromosome integration of transcriptomics and proteomics 1Yonsei Proteome Research Center, Yonsei group and antibody resource groups or data will provide insights through a University, Seoul, Korea. 2Center for suppliers. In a similar manner, selected more complete ‘parts list’ and enhance a Computational Medicine and Bioinformatics, reaction monitoring peptide banks will be comprehensive understanding of human University of Michigan, Ann Arbor, Michigan, developed for quantitative mass spectrometry biology. The HPP and the C-HPP represent USA. 3Institute for , Seattle, measurements. an even larger endeavor than the HGP. This Washington, USA. 4Royal Institute of Technology, The project will meet its aims when the challenge has led the HUPO to promote Stockholm, Sweden. 5Fred Hutchinson Cancer © 2012 Nature America, Inc. All rights reserved. America, Inc. © 2012 Nature comprehensive C-HPP database is 100% an efficient approach of recruiting national Research Center, Seattle, Washington, USA. 6 matched with the 20,300 protein-coding genes teams with clear areas of responsibility Northeastern University, Boston, Massachusetts, USA. 7Stanford University, Palo Alto, California, annotated on the human genome sequence, and effective collaborations among 8 9 npg including at least one representative AST and leading proteomic laboratories in the HPP USA. Lund University, Lund, Sweden. Yonsei University College of Medicine, Seoul, Korea. nsSNP, tissue localization and three classes consortium. Recognizing the complexity of 10Department of Biology, Institute of Molecular of PTMs in whole-chromosome sets (22 the human proteome, we have set 10-year Systems Biology, Eidgenössische Technische autosomal, X and Y; Supplementary Table 2). goals for characterizing the major forms of Hochschule, Zürich, Switzerland, and Faculty of The C-HPP is led by cochairs Young-Ki the complete set of proteins. The C-HPP Science, University of Zurich, Zurich, Switzerland. Paik (Korea), Bill Hancock (USA) and will provide a global open Web interface for 11Swiss Institute of Bioinformatics and University Gyorgy Marko-Vargas (Sweden), an executive data collection, curation and presentation of Geneva, Geneva, Switzerland. 12Ecole committee and a council of principal of the proteome parts list and will stimulate Polytechnique, Palaiseau, France. 13Present investigators of each of the chromosome availability of high-quality protein capture and address: Korean National Institute of Health, teams (thus far, 15 investigators for 14 signature peptide reagents (Supplementary Osong, Korea. e-mail: Y.-K.P. (paikyk@yonsei. chromosomes; Supplementary Fig. 2). Table 2). Importantly, the C-HPP will work ac.kr) or W.S.H. ([email protected]) The initial C-HPP team emerged from an with governmental funding bodies to address 1. Legrain, P. et al. Mol. Cell Proteomics 10, M111.009993 (2011). exploratory group in Korea that selected major gaps in proteomics infrastructure, such 2. Hancock, W. et al. J. Proteome Res. 10, 210 (2011). chromosome 13; it has several key metabolic as secure archiving of large data sets. 3. Service, R.F. Science 321, 1758–1761 (2008). disease genes (for example, IRS2, which is 4. Lander, E.S. et al. Nature 409, 860–921 (2001) 5. Farrah, T. et al. Mol. Cell. Proteomics 10, associated with diabetes, and CLF, which Note: Supplementary information is available on the Nature Biotechnology website. M110.006353 (2011). is associated with cholesterol metabolism). 6. Menon, R. & Omenn, G.S. Methods Mol. Biol. 696, Author contributions 319–326 (2011). Diverse approaches have been pursued 7. Gnad, F. et al. Mol. Cell. Proteomics 9, 2642–2653 Y-K.P. and W.S.H. conceived strategies in by other countries and teams. A US team (2011). coordination with G.S.O., M.U., S.H., G.M.-V., has focused on breast cancer, selecting 8. Walther, T.C. & Mann, M. J. Cell Biol. 190, 491–500 E.W.D., R.A., A.B., D.H. and P.L.; S.-K.J. carried out (2010). chromosome 17, which contains the profile analysis with programs developed. J.Y.K and 9. Uhlén, M. et al. Mol. Cell. Proteomics published online, oncogenes ERBB2 and BRCA1. H.K. provided clinical samples and information; doi:10.1074/mcp.M111.013458 (31 October 2011).

nature biotechnology volume 30 number 3 march 2012 223