PROTEOGENOMICS & METAPROTEOMICS USING THE GALAXY PLATFORM
PRATIK JAGTAP University of Minnesota St.Paul / Minneapolis Minnesota, USA MINNEAPOLIS TO HYDERABAD MINNEAPOLIS, MINNESOTA Center for Mass Spectrometry and Proteomics
h p://cbs.umn.edu/cmsp/home
• PROTEOMICS
• PROTEOGENOMICS
• CHALLENGES
• BIOLOGICAL INSIGHTS
• METAPROTEOMICS
• GALAXYP
• LINKS AND ACKNOWLEDGMENTS GENOMIC AND PROTEOMIC DATABASES
Finished and Published Genomes • 3551 Bacterial Genomes. • 211 Archaeal Genomes. • 58 Eukaryal Genomes. • 3363 Viral Genomes
http://www.genomesonline.org/index
DEFINING PROTEOMICS : LOOKING WITHIN
Mass spectrum Reference Protein Database Pep de Spectral Match from genomic annotation PROTEOMICS WORKFLOW
Search Database
Peaklist Generation Eng et al 2011 Mol Cell Proteomics. 10(11): R111.009522.
Database Search
Peptide Spectral Match : Statistical Validation
Protein Inference PROTEOGENOMICS
Nat Methods. 11(11): 1114–1125. DEFINING PROTEOGENOMICS: LOOKING WITHIN AND WITHOUT
Mass spectrum Reference Protein Database from genomic annotation
RNASeq data
Genome six-frame cDNA translation three- frame translation PROTEOGENOMICS : BIOINFORMATIC CHALLENGES
• Large database sizes (6-frame and 3-frame transla on and metagenomic databases). • False Discovery Rate (FDR) Es ma on strategies (for novel pep des). • False-posi ve sources and their elimina on. • Valida on of the pep de iden fica on. (Search using BLAST-P) • PSM Evalua on / Targeted proteomics of iden fied pep des. • Genomic localiza on. • Valida ng biological interpreta on. • Disparate tools and numerous processing steps. PROTEOGENOMICS: STEPS INVOLVED
~ 2 million proteins
~ 10,000 proteins
~ 5,000 proteins
~ 1,000 pep des
~ 100 pep des
~ 50 pep des 3D FRACTIONATED SALIVARY SUPERNATANT
6 individuals
Salivary supernatant
Digested O/N with trypsin The dataset was searched Hexapep de library enrichment against FASTA database with (ProteoMiner ™) followed by SCX / IEF human proteins, contaminant and LC-MS (200 Frac ons) proteins, 3-frame translated cDNA database from EnSEMBL Thermofinnigan Orbitrap and Human Oral Microbiome (Orbi MS, MS/MS LTQ) database (HOMD).
200 RAW Files
INPUTS : PEAKLISTS and SEARCH db Bandhakavi et al (2009) J. Prot. Res PROTEOGENOMICS WORKFLOW Galaxy-P provides an integrated platform for every step of proteogenomic analysis. • Build target database – download and translate EST databases. • Numerous tools for identification and text manipulation. • Workflow utilizing BLAST to identify novel peptides. • Tool to assess peptide-spectrum matches and visualize spectra. • Visualize identified peptides on the genome. • 140 steps: Seamless, integrated proteogenomic workflow.
Flexible and accessible workflows for improved proteogenomic analysis using Galaxy framework. J. Proteome Res. (2014) DOI: 10.1021/pr500812t Link: z.umn.edu/pgfirstlook
NOVEL PROTEOFORMS : PRB1 and PRB2 region PROTEOGENOMICS WORKFLOW
Flexible and accessible workflows for improved proteogenomic analysis using the Galaxy framework. J Proteome Res. (2014) 13(12):5898-908. doi: 10.1021/pr500812t HIBERNATION PROTEOGENOMICS
Tracing of core body temperature (Tb, black line) from a single animal measured by a surgically implanted transmi er, along with the controlled ambient temperature (blue line) over the course of the hiberna on season. * TOR (Torpor), J-IBA (January IBA), M-IBA (March IBA) HIBERNATION PROTEOGENOMICS • The datasets were run in triplicates and were searched against proteomic dataset from RNASeq data. • Differen ally expressed genes from RNASeq data and differen ally expressed proteins from iTRAQ data were compared. - OCT shows the highest correla on between protein and transcript expression. • Func onal analysis of differen ally expressed proteins revealed that: - Protein expression in APRIL rela ve to AUGUST shows increased mitochondrial func on and regula on of muscle contrac on. - Protein expression in OCTOBER rela ve to AUGUST reveals increased hypoxia tolerance and hypertrophy at a me of decreased metabolism. - Protein expression in hiberna on rela ve to AUGUST highlights fa y acid and ketone metabolism and altered calcium handling and contrac le func on in the heart. • 162 novel pep de sequences were iden fied in all three replicates.
METAPROTEOMICS / COMMUNITY PROTEOMICS / MICROBIOMES
“The large-scale characteriza on of the en re protein complement of environmental microbiota at a given point in me”
Bond and Wiilmes (2004) Environ. Microbiol. 6, 911–920.
“Through the applica on of metaproteomics to different microbial consor a over the past decade, we have learnt much about key func onal traits in the various environmental se ngs where they occur.”
Wilmes P, Heintz-Buschart A, Bond PL. (2015) Proteomics. doi: 10.1002/pmic.201500183. DEFINING METAPROTEOMICS : LOOKING WITHIN AND WITHOUT
Mass spectrum Reference Protein Database from genomic annotation
RNASeq data
Genome six-frame cDNA translation three- frame translation Metagenomic sequences DEFINING METAPROTEOMICS
Mass spectrum
Metagenomic sequences MATCHED PAIR OF CONTROL VERSUS LESION
Oral premalignant lesion (OPML) versus control The dataset was searched Digested O/N with trypsin against FASTA database with human proteins, contaminant proteins, 3- SCX Frac ona on and and LC-MS frame translated cDNA (7 Frac ons) database from EnSEMBL and Human Oral Thermofinnigan Orbitrap Microbiome database (Orbi MS, MS/MS LTQ) (HOMD).
7 RAW Files each
INPUTS : PEAKLISTS and SEARCH db Kooren et al (2010) Clin. Prot TAXONOMIC AND FUNCTIONAL ANALYSIS DEFINING METAPROTEOMICS: STEPS INVOLVED
Metaproteomic analysis using the Galaxy framework. Proteomics. (2015) doi: 10.1002/pmic.201500074. METAPROTEOMICS : BIOLOGICAL INSIGHTS
METAPROTEOMICS OF CHILDHOOD CARIES Sucrose No Sucrose • In vitro investigation of sucrose-induced changes in
the metaproteomes of children with caries. Prof. Joel Rudney • Major shifts in taxonomy and function in paired microcosm oral biofilms grown without and with sucrose respectively. . Twelve replicates have been analyzed.
• SEED analysis of Oral microcosm biofilms showed characteristic NS and WS patterns of protein expression that were highly conserved across taxonomically diverse communities. Moreover, many of the proteins that differed between each pH phenotype had functions that would act to promote maintenance of neutral pH under NS conditions, or acid production and tolerance under WS conditions.
• Targeted proteomic approaches then can be used to determine whether those proteins are also expressed when plaque is exposed to sucrose in the mouth. In that case, it may be possible to define a set of dysbiosis biomarkers that could be used to detect at-risk tooth surfaces before the development of overt carious lesions. GALAXY PLATFORM
Goecks J et al Genome Biol. 2010;11(8):R86. Benefits of Galaxy
• A web-based bioinformatics data analysis platform. • Software accessibility and usability. • Share-ability of tools, workflows and histories. • Reproducibility and ability to test and compare results after using multiple parameters. • Software tools can be used in a sequential manner to generate analytical workflows that can be reused, shared and creatively modified for multiple studies.
GALAXY-P : IMPLEMENTATION OF PROTEOMICS TOOLS WITHIN GALAXY ENVIRONMENT.
OUTPUT
INPUT
TOOLS CENTRAL PANE HISTORY TOOLS & WORKFLOWS • Software tools can be used in a sequential manner to generate analytical workflows that can be reused, shared and creatively modified for multiple studies.
For example, Protein Database Downloader downloads UniProt protein FASTA databases of various organisms. GALAXY-P : IMPLEMENTATION OF PROTEOMICS TOOLS WITHIN GALAXY ENVIRONMENT. PROTEOGENOMICS: STEPS INVOLVED
~ 2 million proteins
~ 10,000 proteins
~ 5,000 proteins
~ 1,000 pep des
~ 100 pep des
~ 50 pep des RNASeq DERIVED PROTEOMIC DATABASES
“Using Galaxy-P to leverage RNA-Seq for the discovery of novel protein variations.” Sheynkman G et al BMC Genomics. doi: 10.1186/1471-2164-15-703. Gloria Sheynkman James Johnson PROTEOGENOMICS WORKFLOW Galaxy-P provides an integrated platform for every step of proteogenomic analysis. • Build target database – download and translate EST databases or perform gene prediction with Augustus. • Numerous tools for identification and text manipulation. • Workflow utilizing BLAST to identify novel peptides. • Tool to assess peptide-spectrum matches and visualize spectra. • Visualize identified peptides on the genome. • 140 steps: Seamless, integrated proteogenomic workflow.
Flexible and accessible workflows for improved proteogenomic analysis using Galaxy framework. J. Proteome Res. (2014) DOI: 10.1021/pr500812t Link: z.umn.edu/pgfirstlook PSM EVALUATION
6 GENOME VISUALIZATION USING IGV BROWSER
7 GALAXYP : ONGOING PROJECTS
REPERTOIRE OF WORKFLOWS
• Sharing of analytical workflows that can be reused, shared and creatively modified for multiple studies.
• Multiple workflows for metaproteomics, quantitative proteomics, proteogenomics, RNASeq workflows, are being developed, shared and used.
COMMUNITY BASED SOFTWARE DEVELOPMENT • Community-based software development model should prove effective for future implementation, testing and continued improvement of command-line driven software tools. • We plan to offer the many functionalities of MS-GF+ and PeptideShaker in Galaxy, along with opportunities for integration with other software tools via use of workflows. COMMUNITY-BASED SOFTWARE DEVELOPMENT
So ware Developers SearchGUI / Pep deShaker
Galaxy Improvements Wrapper to the so ware USER FORUM / tool GITHUB Users test the tools and provide feedback So ware tool to developers. deposited in Galaxy Toolshed So ware tool installed in GalaxyP Proteomics tools accessible via the Tool Shed h ps://toolshed.g2.bx.psu.edu/
h ps://github.com/galaxyproteomics/tools-galaxyp
38
GALAXY-P : IMPLEMENTATION OF PROTEOMICS TOOLS WITHIN GALAXY ENVIRONMENT. NSF award 1458524 “A unified Galaxy-based pla orm for mul -omic data analysis and informa cs”. PI: Tim Griffin Award dates: 9/1/15-8/31/18
• Enhance the Galaxy environment with new interac ve visualiza on tools and data exchange func onali es necessary for effec ve mul -omic data analysis.
• Extend the Galaxy environment to analyze and process diverse metabolomics data and support workflows for metabolic ac vity profiling.
• Extend the Galaxy environment for integra ve genomic-proteomic data analysis suppor ng proteogenomic and metaproteomic applica ons.
• Promote usage of Galaxy for mul -omics by the research community and provide undergraduate training opportuni es in computa onal systems biology via a local area ins tu onal network.
COMMUNITY BASED SOFTWARE DEVELOPMENT Department of Medicine Ira Cooke Brian Sandri La Trobe University, Melbourne , Kevin Viken Australia Biochemistry, Molecular Biology & Maneesh Bhargava Bjoern Gruening Biophysics Chris Wendt University of Freiburg, Freiburg, Tim Griffin Department of Biology Germany Candace Guerrero (Duluth) Lennart Martens Kevin Murray Matt Andrews VIB Department of Medical Protein Katie Vermillion Research, Ugent, Belgium Kyle Anderson Harald Barsnes and Marc Vaudel School of Dentistry University of Bergen, Bergen, Joel Rudney Norway James Johnson Laboratory Medicine and Tom McGowan Pathology John Chilton Trevor Wennblom Somi Afiuni Galaxy Team Getiria Onsongo Amy Skubitz Penn State University Bill Gallip Biochemistry, Molecular Ben Lynch Biology and Biophysics Center for Mass Spectrometry
Laurie Parker and Proteomics Tzu-Yi Yang Ebbing de Jong LeeAnn Higgins Todd Markowski Gloria Sheynkman Sarah Parker
Lloyd Smith Sean Seymour Funding Michael Shortreed Sricharan Bandhakavi NSF, NIH
QUESTIONS?
Mail [email protected]
Visit http://usegalaxyp.org
Follow twitter.com/usegalaxyp