<<

Sequence Databases …and your Mass Spectrometry-based Proteomics Experiment

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Outline Terminology • Protein Database (DB) • FASTA • Origin • Database repository • Sources • Format • NCBI database • Size • UniProtKB • Composition • Swiss Prot • Selecting a database for mass spec search • Ref Seq (reference • Effect of DB on mass spec sequence) search results • • Post MS analysis: protein • Contaminants DB annotation, ontology, alignment • Ontology

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 FASTA Protein Sequence • Name and Origin • FASTA (pronounced ‘fast-aye’) • ORIGIN: for sequence similarity alignment tool (1985) • REF: DJ Lipman, WR Pearson (1985) PMID: 2983426 "The algorithm has been implemented in a computer program designed to search protein databases very rapidly. For example, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC)." • Stands for “fast all” – the file format worked with ‘all’ alphabets ( and )

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 FASTA Protein Sequence Format

• Structure: TEXT file • Line 1: description line with sequence identifier • Line 2: single amino acid letter protein sequence 80 characters wide • Allowed characters: • AMINO ACID ONE-LETTER CODE • X • * • - • Custom one-letter amino acid codes

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Line 1: description line with sequence identifier FASTA Format Header Line Sequence Identifiers

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Line 2 FASTA Protein Sequence from NCBI- example

Line 1

Line 2

NOTE: In Sept 2016, gi numbers were replaced with accession.version identifiers

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Selecting a Protein • Public repositories, such as • NCBI • UniProtKB • Swiss Prot: manually annotated and reviewed • TrEMBL: Automatically annotated and not reviewed • Custom (from customer) • NOTE: format is important! • Represent species (1 or more) from which protein sample originated • Example: Mouse protein expressed in E. coli • Ideal size range ~ 2000 to < 1 million entries

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Selecting a Protein Database: UniProtKB repository

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Selecting a Protein Database: NCBI Ref Seq repository

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Choose Your Taxonomy or Taxonomies NOTES: • If recombinant protein expressed in host cell, include host & expressed protein(s) • If protein database for your species has <2000 proteins, merge with another protein database (yeast) for statistical reasons • Protein sequence headers must be parsed correctly

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Taxonomy specification - UniProtKB

(19996)

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Taxonomy specification - NCBI

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Protein Database repository content for Thirteen-lined Ground Squirrel

Database Source Number of Proteins Swiss-Prot* reviewed 20 TrEMBL* unreviewed 20,076 UniProt Reference Proteome 19,966 NCBI (‘non-redundant’) 30,130 NCBI Reference Sequence 29,842

* From UniProt

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Protein Database Characteristics …related to your mass spectrometry experiment

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 SPLICES FORM variants Sequence alignments: Protein Cytochrome P450 2D6

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Protein Sequence Variants

Natural variants)

SNP’s (single nucleotide polymporphisms)

https://hive.biochemistry.gwu.edu

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 In silico trypsin digest, ‘native’ protein

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 In silico trypsin digest, with VARIANTS

1

2

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Effect of Variant on Peptide Mass

Peptide example Peptide Mass * Peptide Sequence 1 – native 1730.8443 SELEEQLTPVAEETR 1 – variant (Q -> K) 1730.8806 SELEEKLTPVAEETR 1 – variant (Q -> K) 734.3566 SELEEK 1 – variant (Q -> K) 1015.5418 LTPVAEETR

2 – native 830.4366 EQVAEVR 2 – variant (V -> E) 860.4108 EQEAEVR

* Monoisotopic [M + H]+1

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Proteomics Search Program Meets Protein Sequence Database • Protein sequence file is downloaded to local computer • Merge with common lab contaminants (keratins and more) database • http://www.thegpm.org/crap/ • Protein database is imported or indexed in the proteomics search program (sequence format is critical) • REVERSED sequences are generated for False Discovery Rate (FDR) calculations • Protein sequences are digested with in silico

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Database search > Protein List

• Database search algorithm matches spectrum > peptide > protein • RESULTS: List of protein identifications with accession numbers • POST Database search options (outside CMSP): 1. Protein annotation 2. 3. Obtain related Ontology information

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 POST Database search options What you can do with your protein list.

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 1) Protein Annotation from UniProtKB

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 2) Sequence alignment with UniProt alignment tool

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 2) Sequence alignment with UniProt alignment tool: numerous amino acid labeling options

* (asterisk) indicates positions which have a single, fully conserved residue. : (colon) indicates conservation between groups of strongly similar properties - scoring > 0.5 in the Gonnet PAM 250 matrix. . (period) indicates conservation between groups of weakly similar properties - scoring =< 0.5 in the Gonnet PAM 250 matrix.

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 2) Sequence alignment with NCBI BLAST

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 3) Link information to Proteins • Define: “The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products across databases.” • Ontologies/Vocabularies • molecular function: molecular activities of gene products • cellular component: where gene products are active • biological process: pathways and larger processes made up of the activities of multiple gene products

(http://geneontology.org/page/documentation)

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Molecular Function Pie Chart for a List of 96 Protein Identifiers (gi numbers) submitted to PANTHER (http://www.pantherdb.org/)

Protein list from Supplemental data REF: Thu TM et al (2016) Cell Reports, 15(6):1254-65; PMID: 27134171

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Protein Class Pie Chart for a List of 96 Protein Identifiers (gi numbers) submitted to PANTHER (http://www.pantherdb.org/)

Protein list from Supplemental data REF: Thu TM et al (2016) Cell Reports, 15(6):1254-65; PMID: 27134171

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Biological Process Pie Chart for a List of 96 Protein Identifiers (gi numbers) submitted to PANTHER (http://www.pantherdb.org/)

Protein list from Supplemental data REF: Thu TM et al (2016) Cell Reports, 15(6):1254-65; PMID: 27134171

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 Database Tools for Proteins

• http://geneontology.org/ • http://string-db.org/ • http://www.pantherdb.org/ • http://www.ingenuity.com/products/ipa (licensed at UM via MSI)

ALSO: Match mass spec data to your RNA Seq data with: • https://galaxyp.msi.umn.edu/

© 2015 Regents of the University of Minnesota. All rights reserved. Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279