A Draft Map of the Human Proteome
Total Page:16
File Type:pdf, Size:1020Kb
HHS Public Access Author manuscript Author Manuscript Author ManuscriptNature. Author ManuscriptAuthor manuscript; Author Manuscript available in PMC 2015 April 20. Published in final edited form as: Nature. 2014 May 29; 509(7502): 575–581. doi:10.1038/nature13302. A draft map of the human proteome A full list of authors and affiliations appears at the end of the article. Abstract The availability of human genome sequence has transformed biomedical research over the past decade. However, an equivalent map for the human proteome with direct measurements of proteins and peptides does not exist yet. Here, we present a draft map of the human proteome using high resolution Fourier transform mass spectrometry. In-depth proteomic profiling of 30 histologically normal human samples including 17 adult tissues, 7 fetal tissues and 6 purified primary hematopoietic cells resulted in identification of proteins encoded by 17,294 genes accounting for ~84% of the total annotated protein-coding genes in humans. A unique and comprehensive strategy for proteogenomic analysis enabled us to discover a number of novel protein-coding regions, which includes translated pseudogenes, non-coding RNAs and upstream ORFs. This large human proteome catalog (available as an interactive web-based resource at http://www.humanproteomemap.org) will complement available human genome and transcriptome data to accelerate biomedical research in health and disease. Analysis of the complete human genome sequence has thus far led to the identification of ~20,687 protein-coding genes1 although the annotation still continues to be refined. Mass spectrometry has revolutionized proteomics studies in a manner analogous to the impact of next generation sequencing on genomics and transcriptomics2–4. Several groups, including ours, have employed mass spectrometry to catalog complete proteomes of unicellular organisms5–7 and to explore proteomes of higher organisms including mouse8 or human9,10. Reprints and permissions information is available at www.nature.com/reprints. Correspondence and requests for materials should be addressed to H.G. ([email protected]) or A.P. ([email protected]). §Current address as Department of Biochemistry and Molecular Biology, University of Maryland School of Medicine, Baltimore, MD 21201, USA. Supplementary Information is linked to the online version of the paper at www.nature.com/nature. Author Contributions A.P., H.G., R.C., M.-S.K. designed the study; A.P., H.G., M.-S.K. managed the study; D.G., C.L.K., C.A.I.-D., K.R.M. collected human cells/tissues; M.-S.K., R.C., D.G. developed the pipeline of experiment and analysis; D.G., M.-S.K., S.M.P., K.M., R.C., S.R., J.Z., X.W., P.G.S., M.S.Z., T.C.H. prepared peptide samples for LC-MS/MS; M.-S.K., R.S.N., S.M.P., R.C., D.S.K., S.R., G.J.S. performed LC-MS/MS; M.-S.K., S.M.P., S.P., S.S.M ,C.J.M., J.A. and A.K.M. processed MS data and managed data; A.K.M., S.S.M., B.G., A.H.P., Y.S., M.-S.K. performed comparison analysis with PeptideAtlas, neXtProt and GPMDB; R.I., S.J., G.D.B. performed interaction and complex analysis; M.-S.K., S.M.P., S.S.M., P.K., A.K.M., N.A.S., R.S.N., L.B., L.D.N.S., D.S.K., V.N., A.R., T.S., M.K., S.K.S., G.D., A.M., R.R., S.C., K.K.D., A.S., S.D.Y., S.J., P.R., A.H.P., B.G., J.S., N.S., R.G., G.J.S., A.A.K., S.A., D.F., T.S.K.P., H.G., A.P. performed proteogenomic analysis; A.C., H.L., R.S., J.T.S., K.K.M., S.S., A.M., S.K.S., P.S., S.D.L., C.G.D., A.M., M.K.H., R.H.H., C.L.K., C.A.I.-D. assisted with analysis of the data; M.-S.K., S.M.P., T.C.H., P.L.R. performed Western blot experiments; M.-S.K., J.K.T., A.K.M., B.M., S.P., S.M.P. designed the HPM web portal; M.-S.K., A.K.M., J.K.T. generated SRM database; M.-S.K., K.M., G.D., S.M.P., S.S.M. illustrated figures with help of other authors; A.P., M.-S.K., H.G. wrote the manuscript with inputs from other authors. The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium (http:// proteomecentral.proteomexchange.org) via the PRIDE partner repository with the dataset identifier PXD000561. The authors declare no competing financial interests. Kim et al. Page 2 To develop a draft map of the human proteome by systematically identifying and annotating Author Manuscript Author Manuscript Author Manuscript Author Manuscript protein-coding genes in the human genome, we carried out proteomic profiling of 30 histologically normal human tissues and primary cells using high resolution mass spectrometry. We generated tandem mass spectra corresponding to proteins encoded by 17,294 genes, accounting for ~84% of the annotated protein-coding genes in the human genome – the largest coverage of the human proteome reported thus far. This includes mass spectrometric evidence for proteins encoded by 2,535 genes that have not been previously observed as evidenced by their absence in large community-based proteomic datasets - PeptideAtlas11, GPMDB12 and neXtProt13 (which includes annotations from Human Protein Atlas14). A general limitation of current proteomics methods is their dependence on predefined protein sequence databases for identifying proteins. To overcome this, we also employed a comprehensive proteogenomic analysis strategy to identify novel peptides/proteins that are currently not part of annotated protein databases. This approach revealed novel protein- coding genes in the human genome that are missing from current genome annotations in addition to evidence of translation of several annotated pseudogenes as well as non-coding RNAs. As discussed below, we provide evidence for revising hundreds of entries in protein databases based on our data. This includes novel translation start sites, gene/exon extensions and novel coding exons for annotated genes in the human genome. A high quality mass spectrometry dataset to define the normal human proteome To generate a baseline proteomic profile in humans, we studied 30 histologically normal human cell and tissue types, including 17 adult tissues, 7 fetal tissues, and 6 hematopoietic cell types (Fig. 1a). Pooled samples from three individuals per tissue type were processed and fractionated at the protein level by SDS-PAGE and at the peptide level by basic RPLC and analyzed on high resolution Fourier transform mass spectrometers (LTQ-Orbitrap Elite and LTQ-Orbitrap Velos ) (Fig. 1b). To generate a high quality dataset, both precursor ions and HCD-derived fragment ions were measured using the high resolution and high accuracy Orbitrap mass analyzer. Approximately 25 million high resolution tandem mass spectra, acquired from >2,000 LC-MS/MS runs, were searched against NCBI’s RefSeq15 human protein sequence database using MASCOT16 and SEQUEST17 search engines. The search results were rescored using the Percolator18 algorithm and a total of ~293,000 non- redundant peptides were identified at a q value <0.01 with a median mass measurement error of ~260 parts per billion (Extended Data Fig. 1a). The median number of peptides and corresponding tandem mass spectra identified per gene are 10 and 37, respectively, while the median protein sequence coverage was ~28% (Extended Data Fig. 1 b, c). It should be noted, however, that false positive rates for subgroups of peptide-spectrum matches can vary upon nature of peptides such as size, charge state of precursor peptide ions or missed enzymatic cleavage (Extended Data Fig. 1d–f and Supplementary Information). We compared our dataset with two of the largest human peptide-based resources – PeptideAtlas and GPMDB. These two databases contain curated peptide information that has been collected from the entire proteomics community over the last decade. Strikingly, Nature. Author manuscript; available in PMC 2015 April 20. Kim et al. Page 3 almost half of the peptides we identified were not deposited in either one of these resources. Author Manuscript Author Manuscript Author Manuscript Author Manuscript Also, the novel peptides in our dataset constitute 37% of the peptides in PeptideAtlas and 54% of peptides in the case of GPMDB (Extended Data Fig. 1g, h). This dramatic increase in the coverage of human proteomic data was made possible by the breadth and depth of our analysis as most of the cells and tissues that we have analyzed have not previously been studied using similar methods. The depth of our analysis enabled us to identify protein products derived from two-thirds (2,535 out of 3,844) of proteins designated as ‘missing proteins’19 for lack of protein-based evidence. Several hypothetical proteins that we identified have a broad tissue distribution indicating the inadequate sampling of the human proteome thus far (Extended Data Fig. 2a). Landscape of protein expression pattern across cells and tissues Based on gene expression studies, it is clear that there are several genes that are involved in basic cellular functions that are constitutively expressed in almost all the cells/tissues. Although the concept of ‘housekeeping genes’ as genes that are expressed in all tissues and cell types is widespread among biologists, there is no readily available catalog of such genes. However, the extent to which these transcripts are translated into proteins remains unknown. We detected proteins encoded by 2,350 genes across all human cells/tissues with these highly abundant ‘housekeeping proteins’ constituting ~75% of total protein mass based on spectral counts (Extended Data Fig.