PROTEOGENOMICS & METAPROTEOMICS USING THE GALAXY PLATFORM

PRATIK JAGTAP University of Minnesota St.Paul / Minneapolis Minnesota, USA MINNEAPOLIS TO HYDERABAD MINNEAPOLIS, MINNESOTA Center for and

hp://cbs.umn.edu/cmsp/home

• PROTEOMICS

• PROTEOGENOMICS

• CHALLENGES

• BIOLOGICAL INSIGHTS

• METAPROTEOMICS

• GALAXYP

• LINKS AND ACKNOWLEDGMENTS GENOMIC AND PROTEOMIC DATABASES

Finished and Published Genomes • 3551 Bacterial Genomes. • 211 Archaeal Genomes. • 58 Eukaryal Genomes. • 3363 Viral Genomes

http://www.genomesonline.org/index

DEFINING PROTEOMICS : LOOKING WITHIN

Mass spectrum Reference Database Pepde Spectral Match from genomic annotation PROTEOMICS WORKFLOW

Search Database

Peaklist Generation Eng et al 2011 Mol Cell Proteomics. 10(11): R111.009522.

Database Search

Peptide Spectral Match : Statistical Validation

Protein Inference PROTEOGENOMICS

Nat Methods. 11(11): 1114–1125. DEFINING PROTEOGENOMICS: LOOKING WITHIN AND WITHOUT

Mass spectrum Reference Protein Database from genomic annotation

RNASeq data

Genome six-frame cDNA translation three- frame translation PROTEOGENOMICS : BIOINFORMATIC CHALLENGES

• Large database sizes (6-frame and 3-frame translaon and metagenomic databases). • False Discovery Rate (FDR) Esmaon strategies (for novel pepdes). • False-posive sources and their eliminaon. • Validaon of the pepde idenficaon. (Search using BLAST-P) • PSM Evaluaon / Targeted proteomics of idenfied pepdes. • Genomic localizaon. • Validang biological interpretaon. • Disparate tools and numerous processing steps. PROTEOGENOMICS: STEPS INVOLVED

~ 2 million

~ 10,000 proteins

~ 5,000 proteins

~ 1,000 pepdes

~ 100 pepdes

~ 50 pepdes 3D FRACTIONATED SALIVARY SUPERNATANT

6 individuals

Salivary supernatant

Digested O/N with trypsin The dataset was searched Hexapepde library enrichment against FASTA database with (ProteoMiner ™) followed by SCX / IEF human proteins, contaminant and LC-MS (200 Fracons) proteins, 3-frame translated cDNA database from EnSEMBL Thermofinnigan Orbitrap and Human Oral (Orbi MS, MS/MS LTQ) database (HOMD).

200 RAW Files

INPUTS : PEAKLISTS and SEARCH db Bandhakavi et al (2009) J. Prot. Res PROTEOGENOMICS WORKFLOW Galaxy-P provides an integrated platform for every step of proteogenomic analysis. • Build target database – download and translate EST databases. • Numerous tools for identification and text manipulation. • Workflow utilizing BLAST to identify novel peptides. • Tool to assess peptide-spectrum matches and visualize spectra. • Visualize identified peptides on the genome. • 140 steps: Seamless, integrated proteogenomic workflow.

Flexible and accessible workflows for improved proteogenomic analysis using Galaxy framework. J. Res. (2014) DOI: 10.1021/pr500812t Link: z.umn.edu/pgfirstlook

NOVEL PROTEOFORMS : PRB1 and PRB2 region PROTEOGENOMICS WORKFLOW

Flexible and accessible workflows for improved proteogenomic analysis using the Galaxy framework. J Proteome Res. (2014) 13(12):5898-908. doi: 10.1021/pr500812t HIBERNATION PROTEOGENOMICS

Tracing of core body temperature (Tb, black line) from a single animal measured by a surgically implanted transmier, along with the controlled ambient temperature (blue line) over the course of the hibernaon season. * TOR (Torpor), J-IBA (January IBA), M-IBA (March IBA) HIBERNATION PROTEOGENOMICS • The datasets were run in triplicates and were searched against proteomic dataset from RNASeq data. • Differenally expressed from RNASeq data and differenally expressed proteins from iTRAQ data were compared. - OCT shows the highest correlaon between protein and transcript expression. • Funconal analysis of differenally expressed proteins revealed that: - Protein expression in APRIL relave to AUGUST shows increased mitochondrial funcon and regulaon of muscle contracon. - Protein expression in OCTOBER relave to AUGUST reveals increased hypoxia tolerance and hypertrophy at a me of decreased metabolism. - Protein expression in hibernaon relave to AUGUST highlights fay acid and ketone metabolism and altered calcium handling and contracle funcon in the heart. • 162 novel pepde sequences were idenfied in all three replicates.

METAPROTEOMICS / COMMUNITY PROTEOMICS /

“The large-scale characterizaon of the enre protein complement of environmental at a given point in me”

Bond and Wiilmes (2004) Environ. Microbiol. 6, 911–920.

“Through the applicaon of metaproteomics to different microbial consora over the past decade, we have learnt much about key funconal traits in the various environmental sengs where they occur.”

Wilmes P, Heintz-Buschart A, Bond PL. (2015) Proteomics. doi: 10.1002/pmic.201500183. DEFINING METAPROTEOMICS : LOOKING WITHIN AND WITHOUT

Mass spectrum Reference Protein Database from genomic annotation

RNASeq data

Genome six-frame cDNA translation three- frame translation Metagenomic sequences DEFINING METAPROTEOMICS

Mass spectrum

Metagenomic sequences MATCHED PAIR OF CONTROL VERSUS LESION

Oral premalignant lesion (OPML) versus control The dataset was searched Digested O/N with trypsin against FASTA database with human proteins, contaminant proteins, 3- SCX Fraconaon and and LC-MS frame translated cDNA (7 Fracons) database from EnSEMBL and Human Oral Thermofinnigan Orbitrap Microbiome database (Orbi MS, MS/MS LTQ) (HOMD).

7 RAW Files each

INPUTS : PEAKLISTS and SEARCH db Kooren et al (2010) Clin. Prot TAXONOMIC AND FUNCTIONAL ANALYSIS DEFINING METAPROTEOMICS: STEPS INVOLVED

Metaproteomic analysis using the Galaxy framework. Proteomics. (2015) doi: 10.1002/pmic.201500074. METAPROTEOMICS : BIOLOGICAL INSIGHTS

METAPROTEOMICS OF CHILDHOOD CARIES Sucrose No Sucrose • In vitro investigation of sucrose-induced changes in

the metaproteomes of children with caries. Prof. Joel Rudney • Major shifts in taxonomy and function in paired microcosm oral biofilms grown without and with sucrose respectively. . Twelve replicates have been analyzed.

• SEED analysis of Oral microcosm biofilms showed characteristic NS and WS patterns of protein expression that were highly conserved across taxonomically diverse communities. Moreover, many of the proteins that differed between each pH phenotype had functions that would act to promote maintenance of neutral pH under NS conditions, or acid production and tolerance under WS conditions.

• Targeted proteomic approaches then can be used to determine whether those proteins are also expressed when plaque is exposed to sucrose in the mouth. In that case, it may be possible to define a set of dysbiosis biomarkers that could be used to detect at-risk tooth surfaces before the development of overt carious lesions. GALAXY PLATFORM

Goecks J et al Genome Biol. 2010;11(8):R86. Benefits of Galaxy

• A web-based data analysis platform. • Software accessibility and usability. • Share-ability of tools, workflows and histories. • Reproducibility and ability to test and compare results after using multiple parameters. • Software tools can be used in a sequential manner to generate analytical workflows that can be reused, shared and creatively modified for multiple studies.

GALAXY-P : IMPLEMENTATION OF PROTEOMICS TOOLS WITHIN GALAXY ENVIRONMENT.

OUTPUT

INPUT

TOOLS CENTRAL PANE HISTORY TOOLS & WORKFLOWS • Software tools can be used in a sequential manner to generate analytical workflows that can be reused, shared and creatively modified for multiple studies.

For example, Protein Database Downloader downloads UniProt protein FASTA databases of various organisms. GALAXY-P : IMPLEMENTATION OF PROTEOMICS TOOLS WITHIN GALAXY ENVIRONMENT. PROTEOGENOMICS: STEPS INVOLVED

~ 2 million proteins

~ 10,000 proteins

~ 5,000 proteins

~ 1,000 pepdes

~ 100 pepdes

~ 50 pepdes RNASeq DERIVED PROTEOMIC DATABASES

“Using Galaxy-P to leverage RNA-Seq for the discovery of novel protein variations.” Sheynkman G et al BMC . doi: 10.1186/1471-2164-15-703. Gloria Sheynkman James Johnson PROTEOGENOMICS WORKFLOW Galaxy-P provides an integrated platform for every step of proteogenomic analysis. • Build target database – download and translate EST databases or perform prediction with Augustus. • Numerous tools for identification and text manipulation. • Workflow utilizing BLAST to identify novel peptides. • Tool to assess peptide-spectrum matches and visualize spectra. • Visualize identified peptides on the genome. • 140 steps: Seamless, integrated proteogenomic workflow.

Flexible and accessible workflows for improved proteogenomic analysis using Galaxy framework. J. Proteome Res. (2014) DOI: 10.1021/pr500812t Link: z.umn.edu/pgfirstlook PSM EVALUATION

6 GENOME VISUALIZATION USING IGV BROWSER

7 GALAXYP : ONGOING PROJECTS

REPERTOIRE OF WORKFLOWS

• Sharing of analytical workflows that can be reused, shared and creatively modified for multiple studies.

• Multiple workflows for metaproteomics, quantitative proteomics, proteogenomics, RNASeq workflows, are being developed, shared and used.

COMMUNITY BASED SOFTWARE DEVELOPMENT • Community-based software development model should prove effective for future implementation, testing and continued improvement of command-line driven software tools. • We plan to offer the many functionalities of MS-GF+ and PeptideShaker in Galaxy, along with opportunities for integration with other software tools via use of workflows. COMMUNITY-BASED SOFTWARE DEVELOPMENT

Soware Developers SearchGUI / PepdeShaker

Galaxy Improvements Wrapper to the soware USER FORUM / tool GITHUB Users test the tools and provide feedback Soware tool to developers. deposited in Galaxy Toolshed Soware tool installed in GalaxyP Proteomics tools accessible via the Tool Shed hps://toolshed.g2.bx.psu.edu/

hps://github.com/galaxyproteomics/tools-galaxyp

38

GALAXY-P : IMPLEMENTATION OF PROTEOMICS TOOLS WITHIN GALAXY ENVIRONMENT. NSF award 1458524 “A unified Galaxy-based plaorm for mul-omic data analysis and informacs”. PI: Tim Griffin Award dates: 9/1/15-8/31/18

• Enhance the Galaxy environment with new interacve visualizaon tools and data exchange funconalies necessary for effecve mul-omic data analysis.

• Extend the Galaxy environment to analyze and process diverse data and support workflows for metabolic acvity profiling.

• Extend the Galaxy environment for integrave genomic-proteomic data analysis supporng proteogenomic and metaproteomic applicaons.

• Promote usage of Galaxy for mul- by the research community and provide undergraduate training opportunies in computaonal via a local area instuonal network.

COMMUNITY BASED SOFTWARE DEVELOPMENT Department of Medicine Ira Cooke Brian Sandri La Trobe University, Melbourne , Kevin Viken Australia Biochemistry, Molecular Biology & Maneesh Bhargava Bjoern Gruening Biophysics Chris Wendt University of Freiburg, Freiburg, Tim Griffin Department of Biology Germany Candace Guerrero (Duluth) Lennart Martens Kevin Murray Matt Andrews VIB Department of Medical Protein Katie Vermillion Research, Ugent, Belgium Kyle Anderson Harald Barsnes and Marc Vaudel School of Dentistry University of Bergen, Bergen, Joel Rudney Norway James Johnson Laboratory Medicine and Tom McGowan Pathology John Chilton Trevor Wennblom Somi Afiuni Galaxy Team Getiria Onsongo Amy Skubitz Penn State University Bill Gallip Biochemistry, Molecular Ben Lynch Biology and Biophysics Center for Mass Spectrometry

Laurie Parker and Proteomics Tzu-Yi Yang Ebbing de Jong LeeAnn Higgins Todd Markowski Gloria Sheynkman Sarah Parker

Lloyd Smith Sean Seymour Funding Michael Shortreed Sricharan Bandhakavi NSF, NIH

QUESTIONS?

Mail [email protected]

Visit http://usegalaxyp.org

Follow twitter.com/usegalaxyp