Cancer Mutant Proteome Database

Published online 14 November 2014 Nucleic Acids Research, 2015, Vol. 43, Database issue D849–D855 doi: 10.1093/nar/gku1182 CMPD: cancer mutant proteome database Po-Jung Huang1,2,†, Chi-Ching Lee1,†, Bertrand Chin-Ming Tan3, Yuan-Ming Yeh4, Lichieh Julie Chu2, Ting-Wen Chen1, Kai-Ping Chang5, Cheng-Yang Lee1, Ruei-Chi Gan1, Hsuan Liu6,* and Petrus Tang1,* 1Bioinformatics Core Laboratory, Chang Gung University, Taoyuan 333, Taiwan, 2Molecular Medicine Research Center, Chang Gung University, Taoyuan 333, Taiwan, 3Department of Biomedical Sciences, Chang Gung University, Taoyuan 333, Taiwan, 4Bioinformatics Division, Tri-I Biotech, Inc., Taipei 221, Taiwan, 5Department of Otolaryngology, Head and Neck Surgery, Chang Gung Memorial Hospital, Lin-Kou, Taoyuan 333, Taiwan and 6Department of Molecular and Cellular Biology, Chang Gung University, Taoyuan 333, Taiwan Received August 15, 2014; Accepted November 02, 2014 ABSTRACT INTRODUCTION Whole-exome sequencing, which centres on the Cancer is a genetic disease, which arise as a consequence protein coding regions of disease/cancer asso- of genomic abnormalities stemming from somatically ac- ciated genes, represents the most cost-effective quired mutations or inherited gene mutations. Genomic se- method to-date for deciphering the association be- quence can be altered on difference scales, which include tween genetic alterations and diseases. Large-scale single nucleotide variations (SNVs), small insertions and deletions (INDELs), rearrangements of genome segments whole exome/genome sequencing projects have and changes in the copy number of DNA fragments. Since been launched by various institutions, such as NCI, specific mutations of cancer-associated genes are generally Broad Institute and TCGA, to provide a comprehen- considered as DNA biomarkers for diagnosis and as molec- sive catalogue of coding variants in diverse tissue ular markers for therapeutic drug selection in clinical set- samples and cell lines. Further functional and clini- tings, several large-scale sequencing studies have been per- cal interrogation of these sequence variations must formed with the aim to further understand cancer genomics. rely on extensive cross-platforms integration of se- To this end, the NCI Cancer Genome Atlas (TCGA) project quencing information and a proteome database that has sequenced the genomes of over 10 000 tumour samples explicitly and comprehensively archives the corre- in 33 cancer types (https://tcga-data.nci.nih.gov/tcga/)(1). sponding mutated peptide sequences. While such Sequencing data on the NCI-60 cell lines (2) as well as 947 data resource is a critical for the mass spectrometry- cancer cell lines (3) also provide an extensive catalogue of cancer relevant variants as well as pharmacogenomics cor- based proteomic analysis of exomic variants, no relations between specific variants and anticancer agents. database is currently available for the collection of As mutations that alter the protein sequences have the most mutant protein sequences that correspond to recent significant impact on protein stability and functionality, re- large-scale genomic data. To address this issue and searchers have started to design multilayer experiments en- serve as bridge to integrate genomic and proteomics compassing both genomic and proteomic methods in order datasets, CMPD (http://cgbc.cgu.edu.tw/cmpd)col- to better understand the roles of coding variants in tumori- lected over 2 millions genetic alterations, which not genesis and progression, as well as their clinical implications only facilitates the conﬁrmation and examination of (4). potential cancer biomarkers but also provides an in- Sequence database search is the most widely used method valuable resource for translational medicine research for protein identification in the field of mass spectrometry- and opportunities to identify mutated proteins en- based proteomics (5,6). However, search results are directly affected by the specificity and completeness of sequence coded by mutated genes. database––the mass-spectrometry search engine will likely miss the mutated peptides, which are not included in the reference databases. Recently, a mutated peptide database, named XMAn (7), was developed to translate disease- and *To whom correspondence should be addressed. Tel: +886 3 2118800 (Ext 5136); Fax: +886 3 2118122; Email: [email protected] Correspondence may also be addressed to Hsuan Liu. Email: [email protected] †The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors. C The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] D850 Nucleic Acids Research, 2015, Vol. 43, Database issue cancer-related mutations from gene-level into mutated pep- CMPD to identifying mutated peptides was demonstrated tide sequences for proteomics search, based on known by whole-exome sequencing and proteome resources of the mutations gathered from widely accepted sources such as COLO 205 cancer cell line. UniProt, IARC P53, OMIM and COSMIC (8–11). How- ever, no existing mutated peptide database has been developed for global exploration of gene alterations discov- ered in large-scale sequencing studies such as those men- MATERIALS AND METHODS tioned above, and efforts on mapping these mutations to Construction of tryptic peptide database protein sequences for further proteomics analysis are still very limited. Only two existing databases––Human Pro- A comprehensive list of gene mutations was extracted tein Mutant Database (HPMD) (12) and Human Cancer from large-scale cancer genomic sequencing projects Proteome Variation Database (CanProVar) (13)––were de- including NCI-60 (2), CCLE (3), and TCGA (1,15–17). signed for collecting mutated protein specific to colon can- The Variant Call Format (VCF) as well as the Mu- cer cell lines and single amino acid alternations from human tation Annotation Format (MAF) files corresponded cancer proteome, respectively. However, as the data sources to these sequencing projects can be downloaded from were limited to specific cell lines or cancer types, it isun- CellMiner (http://discover.nci.nih.gov/cellminer/), CCLE likely that these resources archive the full repertoire of pro- website (http://www.broadinstitute.org/ccle/home), and the tein sequence variations derived from currently known ge- MAF Dashboard (https://confluence.broadinstitute.org/ netic alterations. Moreover, these databases only annotate display/GDAC/MAF+Dashboard) of the Broad Institute’s variant information in the header line without providing Genome Data Analysis Center (GDAC), respectively. the mutated protein FASTA sequences. Therefore, existing Gene information and FASTA-formatted sequence files peptide mass fingerprint search engines cannot directly as- at both protein and mRNA levels can be downloaded sess these information portals, limiting their applicability from ENSEMBL (22) and UCSC through BioMart (23) in identifying mutant proteins derived from genetic varia- and UCSC Table Browser (24), respectively. Additional tion databases (8,14) and cancer genome/exome sequencing information on genes and proteins, such as HGNC gene studies (1,15,16). symbol (25), identifier in external databases such as Ref- Cancer Mutant Proteome Database (CMPD) is designed Seq (26) and UniProtKB (11), GO annotations (27)and to address this issue, aiming at improving the link between disease descriptions, and KEGG pathway information genomic and proteomics mutations. The mutated protein (28), is available through the cross-reference functionality sequence collection was based on the exome or genome provided by BioMart (23). Moreover, ANNOVAR (29) sequencing datasets from NCI-60 cell lines, 947 cancer was used to evaluate the overall consequences of SNVs cell lines from Cancer Cell Line Encyclopaedia project, and INDELs on corresponding transcripts. In addition, and 5500 more cases from 20 TCGA cancer cohort stud- observed allele frequencies in the 1000 Genomes Project ies (1,15–17). The identified genetic alterations (SNVs and (30) and NHLBI Exome Sequencing Project (31), single INDELs) were converted to all possible mutated protein- nucleotide polymorphisms reported in dbSNP (14), so- coding sequences according to sample-specific transcript matic mutations categorized in COSMIC (8), medically isoforms. A wide variety of databases, such as dbSNP (14), important variants collected in ClinVar (19), and func- dbNFSP (18), ClinVar (19), COSMIC (8) and OMIM (10), tional prediction results of all non-synonymous SNVs has been integrated to our annotation database to facilitate from popular algorithms (18,32–35) were compiled into compilation and exploration of associations between dis- an integrated SQLite database to facilitate variant-based eases and mutations and to eliminate any inconsistency in prioritization. Non-synonymous coding variants and small data format between annotation sources. Functional pre- INDELs identified by NCI-60, CCLE, and TCGA studies diction results are pre-compiled from various prediction were introduced into protein sequences to create sequence algorithms for evaluating the potential functional

Cancer Mutant Proteome Database

Environmental Influences on Endothelial Gene Expression

A Gene Expression Resource Generated by Genome-Wide Lacz

Supplementary Data

The Tumor Suppressor Notch Inhibits Head and Neck Squamous Cell

Mouse Ctnnal1 Knockout Project (CRISPR/Cas9)

Genome Wide Methylome Alterations in Lung Cancer

Transdifferentiation of Human Mesenchymal Stem Cells

393LN V 393P 344SQ V 393P Probe Set Entrez Gene

Genome-Scale Analysis of DNA Methylation in Lung Adenocarcinoma and Integration with Mrna Expression

Leveraging Models of Cell Regulation and GWAS Data in Integrative Network-Based Association Studies

A Network Inference Approach to Understanding Musculoskeletal

Coexpression Network Based on Natural Variation in Human Gene Expression Reveals Gene Interactions and Functions