Biology ︱ Drs Tim Griffin and Pratik Jagtap

Determine Open-source Mass novel spectrometry Galaxy-P workflows peptide bioinformatic solutions data sequence variants for ‘Big Data’ analysis Sequence Peptide Proteogenomics database spectral search matches Drs Tim Grifin and Pratik urrently, a major limitation to what of progress becoming ever more Metaproteomics Jagtap along with the Galaxy-P we can discover from complex saturated. team from the University of Cdatasets derived from next- Minnesota are working to generation technologies is our ability to But what inluence has this increase develop worklows on an open analyse them. This is where the work of in computer power had on science? Functional Database source platform for the analysis Dr Tim Grifin, Dr Pratik Jagtap and their One of the major advances has been & taxonomy generation of multi-omic data. They are research team will play an important role. the ability to generate data using analysis currently focusing on using a next generation, high throughput Galaxy-based framework to THE ‘BIG DATA’ ERA techniques, resulting in ‘Big Data’. investigate the integration of Moore’s Law predicts that computing Although ‘Big Data’ has been used genomic datasets with mass power will double approximately to deine many datasets, the term spectrometry-based ‘’ every two years, and with this, the often corresponds to what are now data. But in the long term, they cost of high-powered machines will commonly known as ‘omics datasets’ – aim to expand the platform to also decrease. However, this cannot , , , cope with many other ‘Big Data’ continue indeinitely and 2017 may transcriptomics and system-wide approaches being used a multi-disciplinary, collaborative GALACTIC PLATFORM domains. be the crunch point at which physical to name but a few. For example, in more and more commonly. These project between Dr Grifin’s lab and the Galaxy was originally developed over limitations intervene, with the rate biomedical science, we see large scale, include the 1000 Genomes Project, the Minnesota Supercomputing Institute, a decade ago to solve problems in emergence of personalised medicine – which involves software developers, genomic informatics. It can be hosted tailored to an individual’s needs – and data scientists and wet-bench on a scalable compute infrastructure, , examining multiple, biological researchers. Speciically, helping to cope with the problem interacting pathways concurrently as the team are focusing on mass of large data volume, and can be one giant network. spectrometry (MS)-based ‘omics’ data accessed remotely by researchers (metabolomics and proteomics) and across the globe. Supported by a team However, the analysis of these large how they can harness an existing open- of experts and software developers, and complex datasets requires an source framework, called Galaxy. Galaxy integrates many individual analytical platform which can cope with ‘omics tools in a single environment, the intense informatics requirements, Put simply, and also has many functionalities as well as the ability to access disparate represents a high throughput technique that promote worklow sharing and software from different ‘omics’ that sorts ions based on their mass to reproducibility. The latter is particularly domains. Many wet- important, as there may bench researchers will be multiple research not have access to this One of the major advancements is projects that can utilise level of compute-power the ability to generate data using one particular dataset or or expertise locally, worklow. Data sharing and therefore there is next generation, high throughput and transparency an increase in remote, also encourages or cloud, open-access techniques, resulting in ‘Big Data’ collaboration, and platforms being used increases the number to access the necessary bioinformatic charge ratio. Once certain signatures of expert approaches that can be tools needed to cope with the complex have been recorded for individual ions, combined to maximise novel indings. results that researchers are obtaining. this information can, for example, be extrapolated to identify peptides, the In particular, the Galaxy for proteomics ONE SOLUTION FOR ALL building blocks of proteins. Tandem (Galaxy-P) team investigates ways in At the University of Minnesota, Drs mass spectrometry (MS/MS) further which genomic and transcriptomic Tim Grifin, Pratik Jagtap and team expands on this by using at least two data can be integrated with MS- are working on solutions to analyse stages of mass analysis. based proteomics data. From here, these complicated datasets. This is they aim to verify the expression of

www.researchoutreach.org www.researchoutreach.org Behind the Bench Professor Professor Tim Grifin Pratik Jagtap

E: [email protected] E: [email protected] T: +1 612 624 5249 W: http://galaxyp.org/ @usegalaxyp

Research Objectives Bio Contact Drs Grifin and Jagtap’s research focuses on Professor Tim Grifin serves as the Dr Tim Grifin PhD the Galaxy-P project – developing, testing, Principal Investigator on the Galaxy for Professor and Director, Center for Mass optimising and applying multi-omics proteomics (Galaxy-P) project, as well as Spectrometry and Proteomics software tools to a variety of biological the Faculty Director for the Center for University of Minnesota questions, including cancer and big data Mass Spectrometry and Proteomics at the Dept. of Biochemistry, Molecular Biology research. University of Minnesota. and Biophysics 6-155 Jackson Hall Funding Research Assistant Professor Pratik Jagtap 321 Church Street SE • National Science Foundation (NSF) has been the co-leader of the Galaxy-P Minneapolis, MN 55455 • National Institutes of Health (NIH) project since its inception in 2012, USA helping to develop and apply software Collaborators and worklows in metaproteomics, • Minnesota Supercomputing Institute proteogenomics and more recently data- • Galaxy software platform developers independent acquisition methods. • Jetstream research computing resource

sub-ields and how different software seen a gradual increase in interest in using tools work, as well as which are at the Galaxy-P amongst researchers as we have Q&A forefront in terms of functionalities. promoted it via research publications, Members of the Galaxy-P team, (http://galaxyp. analysis using tools for functional analysis TO INFINITY AND BEYOND org/people/) If your research were awarded a However, the biggest challenge and workshops and presentations worldwide. such as MEGAN, provide information Drs Grifin and Jagtap hope their work considerable amount of money and priority in efforts has been to maintain Along with the Galaxy community of about the functional categories will provide a novel environment to granted access to the world’s most the relevance of worklows in a constantly developers and researchers, we have protein sequence variants that result of microbial protein expression. integrate multiple ‘omics’ datasets, and powerful computer – which informatics emerging environment where the inputs been working on making the worklows from sequence variations at the DNA Metaproteomics can provide us with that this approach will provide unique tool would you develop? are diverse and outputs offer deeper and available via downloadable tool containers or RNA level. This approach, known functional data to complement the opportunities for future discovery. A tool that integrates outputs from all newer interpretations. or by making public instances available so ‘omics’ platforms and provides a ‘Google that researchers can access pre-installed as proteogenomics, commonly uses taxonomical indings of a metagenomic So far, the Galaxy-P team has advanced earth’ like interactive visual data. Such What is the most niche/unexpected tools and worklows for the research areas transcriptomic data translated in silico approach. The main draw of this the abilities of Galaxy to cope with a tool would be extremely useful to a dataset that you’ve been asked to of their interest. The vision for the future is to produce a customised protein approach is that it can potentially be the many challenges of multi-omics biological researcher in both providing analyse? that researchers will access these software sequence database. This database is used to analyse data from diverse informatics. An accessible, uniied an overview of ‘data landscape’ for The breadth of biological research and tools remotely, where they are housed on subsequently used to match proteins sample types – ranging from clinical to environment now exists to help non- biological interpretation while providing lexibility of the Galaxy-P worklows has powerful cloud based hardware. obtained through MS technologies. environmental samples. experts navigate the analysis of MS- opportunities to dive-in into regions of exposed us to many interesting datasets. interest for validation and actionable These range from human salivary datasets The major advantage of this approach based proteomics and metabolomics Leading on from this, do you think intervention/follow-up. We continue to for metaproteomics and proteogenomics, that younger students and early is that no existing reference sequence An example of where Galaxy-P data, in addition to a platform with be amazed and fascinated by the depth to dental plaque metaproteomes career researchers should be given is required, and so novel protein (galaxyp.org) provides ideal tools the potential to develop worklows for of analyses that the Galaxy platform in presence of sugar to the study of compulsory bioinformatic training as sequence variants, which may could be in helping cancer researchers proteogenomic and metaproteomic offers in challenging ields of research. metaproteomes from the North Paciic part of their studies? previously have gone undetected, can identify which protein sequences may analyses. Another avenue might be to use such a Oceans. But the most unexpected dataset Absolutely! has become be identiied. The analysis can also be have a functional role in causing a powerful compute platform to re-analyse has been the study of cardiomuscular a necessary research skillset for extended to compare expression levels speciic cancer. Not only does Galaxy-P The next steps will continue to existing publically available proteomic and protein expression in hibernation of experimental researchers. Programming transcriptomic datasets using newer multi- ground squirrels. Human hearts lose the skills enable young researchers to of and proteins. provide the necessary tools required for involve the consultation of biological omic tools, and develop tools to mine for ability to function at temperatures of 20°C perform novel analyses of previously Similar to proteogenomics, complex analyses, it can also potentially researchers to help the team translate new discoveries. and below. The study tried to shed light acquired data. For users, analytical metaproteomics is also based on train non-expert, bench scientists their informatics indings into basic on how the heart of hibernating animals and data interpretation skills expand integration of metagenomic data with through public Galaxy platforms (tiny. biological contexts, and to aid projects What was the biggest challenge you can withstand these low temperatures. their ability to seek newer avenues MS-derived proteomics data. However, cc/galaxyp-proteogenomics; z.umn. which address human diseases. The had to overcome when developing We are certain that we will continue to see in their research ields. We strongly unlike the previous approach, this edu/metaproteomicsgateway). This team will also continue to develop Galaxy-P? more of these interesting datasets as we believe that bioinformatics training will concentrates on integrating these with platform provides small-scale data for visualisation tools that can help with the The development of tools and worklows continue our research work. help in introducing and honing skills in for multi-omic analysis of mass programming and data processing, and sequence data derived from bacterial users to access and use with already interpretation of outputted data. spectrometry data provided challenges at In the future, do you see Galaxy-P helps in continuing to expand the breadth communities (). As before, published worklows. Existing studies many levels. Be it at the conceptualisation becoming a desk-based tool that can and depth of questions that can be sought in metagenomic data are translated have already used the Galaxy-P There is also potential to add extra stage, or at grant seeking stage, or at tool easily and universally used by anyone, by the future generation of scientists. ‘Big silico to create a protein sequence platform successfully to look at a range layers of omics to the analysis. So, selection or worklow stage, we looked anywhere in the world? Data’ will only continue to be generated database. MS/MS peak lists, derived of topics, from proteogenomic analysis for example, metabolomics could at all the challenges as opportunities. The research community has been using in biological research, and having the from the raw data, are matched against of hibernating mammals, to protein be included in the mix. Using this Deciding which of the many effective Galaxy platform for genomics studies ability to speak both languages in terms of software tools to implement in Galaxy for quite some time now and there is a biology and computational science will be the database. Once peptides of interest expression in the lungs of patients with approach, the possibilities for new has been a challenge, as well as stable ecosystem of developers and users, a critical skill, and one that is very much in have been identiied, they are assigned acute respiratory distress syndrome. discoveries are endless. understanding the many different ‘omic which makes this sustainable. We have demand in years to come. to taxonomies and veriied. Additional

www.researchoutreach.org www.researchoutreach.org