University of Calgary PRISM: University of Calgary's Digital Repository

Graduate Studies The Vault: Electronic Theses and Dissertations

2020-03-13 Digestomics: elucidating protein catabolism through the quantitative mapping of in vivo, endogenous peptides to the proteome at an amino acid level of resolution.

Bingeman, Travis Shane

Bingeman, T. S. (2020). Digestomics: elucidating protein catabolism through the quantitative mapping of in vivo, endogenous peptides to the proteome at an amino acid level of resolution. (Unpublished master's thesis). University of Calgary, Calgary, AB. http://hdl.handle.net/1880/111752 master thesis

University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission. Downloaded from PRISM: https://prism.ucalgary.ca UNIVERSITY OF CALGARY

Digestomics: elucidating protein catabolism through the quantitative mapping of in vivo,

endogenous peptides to the proteome at an amino acid level of resolution.

by

Travis Shane Bingeman

A THESIS

SUBMITTED TO THE FACULTY OF GRADUATE STUDIES

IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE

DEGREE OF MASTER OF SCIENCE

GRADUATE PROGRAM IN BIOLOGICAL SCIENCES

CALGARY, ALBERTA

MARCH, 2020

© Travis Shane Bingeman 2020 Abstract

Plasmodium falciparum is the most lethal malaria-causing parasite. During the human stage of its lifecycle it invades erythrocytes and rapidly degrades 65-70% of the cytoplasmic hemoglobin. Despite extensive in vitro investigation of the responsible for this impressive catabolic activity, the definitive purpose of this digestion remains unresolved. Many of the known proteases have been knocked out yet these parasites have minimal growth phenotypes.

The common explanation is that proteolytic redundancy allows parasites to survive. I, however, hypothesize that the (s) essential for hemoglobin catabolism have yet to be identified. To test this, I developed a new approach to directly investigate in vivo protein catabolism. I analyze a panel of knockout parasites and show that in vivo proteolytic cleavage patterns are unchanged in knockouts, find evidence that published cut sites match poorly to in vivo data and find convincing proteolytic patterns suggesting the presence of previously unidentified cut sites.

ii .

Acknowledgements

I would like to thank a few key people for their contributions to my research and their support during these challenging few years. I would like to start by thanking my supervisors Dr.

Doug Storey and Dr. Ian Lewis. Dr. Lewis, thank you for challenging me, encouraging me and always pushing for more. Dr. Storey, thank you for your wisdom, encouragement and efforts to keep me sane. To my committee members, Dr. Hans Vogel and Dr. Joe Harrison, your understanding, suggestions and feedback were invaluable. I would also like to thank all past and present members of both the Lewis and Storey laboratories for their support and feedback along the way.

iii Dedication

To my parents Cathie and Lorne and my partner Angela.

Thank you for your continued love and support.

iv Table of Contents

Abstract ...... ii Acknowledgements ...... iii Dedication ...... iv Table of Contents ...... v List of Tables ...... vii List of Figures and Illustrations ...... viii List of Symbols, Abbreviations and Nomenclature ...... x Epigraph ...... xi

CHAPTER ONE: INTRODUCTION ...... 1 1.1 Background ...... 2 1.2 Lifecycle ...... 2 1.3 Hemoglobin digestion ...... 3 1.4 P. falciparum proteases ...... 4 1.5 Protease knockouts ...... 5 1.6 Summary ...... 5 1.7 Hypothesis and aims ...... 6

CHAPTER TWO: DIGESTOMICS: AN EMERGING STRATEGY FOR COMPREHENSIVE ANALYSIS OF PROTEIN CATABOLISM ...... 8 2.1 Abstract ...... 9 2.2 Introduction ...... 9 2.3 Barriers to digestomics ...... 10 2.4 Digestomics workflow ...... 12 2.5 Bioinformatics in digestomics ...... 13 2.6 Biological applications and future directions ...... 17 2.7 Conclusions ...... 20

CHAPTER THREE: DIGESTR: OPEN SOURCE SOFTWARE FOR THE COMPREHENSIVE ANALYSIS OF PROTEIN CATABOLISM ...... 21 3.1 Abstract ...... 22 3.2 Introduction ...... 22 3.3 Results and Discussion ...... 23 3.3.1 Program design objectives ...... 23 3.3.2 Data pipeline overview ...... 24 3.3.3 Differentiating signal from noise ...... 26 3.3.4 Coincidence mapping ...... 27 3.3.5 False discovery rate estimation ...... 29 3.3.6 DigestR user interface ...... 30 3.4 Conclusions ...... 36

CHAPTER FOUR: CRITICAL EVALUATION OF PROTEIN CATABOLISM IN PLASMODIUM FALCIPARUM ...... 37 4.1 Abstract ...... 38 4.2 Introduction ...... 38

v 4.3 Methods ...... 39 4.4 Results and Discussion ...... 41 4.4.1 Critical evaluation of published cut sites ...... 42 4.4.1.1 Hemoglobin-α ...... 43 4.4.1.2 Hemoglobin-β ...... 44 4.4.2 Predicted loci of unidentified proteases ...... 45 4.4.2.1 Hemoglobin-α ...... 45 4.4.2.2 Hemoglobin-β ...... 46 4.4.3 Analysis of in vivo proteolysis ...... 46 4.4.3.1 3D7 - wildtype ...... 46 4.4.3.2 FP1KO – falcipain-1 knockout ...... 47 4.4.3.3 F2 – falcipain-2 knockout ...... 48 4.4.3.4 C10 – I, II, III and HAP knockout ...... 49 4.4.4 Knockout summary ...... 50 4.5 Conclusions ...... 50 4.6 Future directions ...... 51

REFERENCES ...... 53

vi List of Tables

Table 3-1 Mascot Daemon parameter settings...... 25

Table 3-2 Mascot Daemon export search result settings...... 26

Table 4-1. Outline of 3D7 file organization ...... 40

vii List of Figures and Illustrations

Figure 1-1. Malaria lifecycle...... 3

Figure 2-1. Digestomics workflow...... 12

Figure 2-2. Genome-level maps of the Plasmodium falciparum digestome via traditional proteomics analysis versus the coincidence scoring algorithm...... 16

Figure 2-3. Gene-level view of proteolytic coincidence maps from chloroquine resistant versus sensitive strains of Plasmodium falciparum...... 18

Figure 3-1. Data pipeline overview...... 24

Figure 3-2. Coincidence score generation...... 28

Figure 3-3. Coincidence mapping...... 29

Figure 3-4. Process Mascot files...... 31

Figure 3-5. File open / close window...... 31

Figure 3-6. File Operations window...... 32

Figure 3-7. Protein label threshold window...... 33

Figure 3-8. Plot settings: plot colors window...... 33

Figure 3-9. Plot settings: spectra window...... 34

Figure 3-10. Overlays example...... 35

Figure 3-11. Overlays window...... 36

Figure 4-1. Data processing outline for the critical evaluation of in vivo protein catabolism ...... 41

Figure 4-1. Proteome-wide view of 3D7 ...... 42

Figure 4-2. Protein-wide coincidence map. (amino acid detail) ...... 43

Figure 4-3. Full dataset overlay hemoglobin-α ...... 44

Figure 4-4. Full dataset overlay hemoglobin-β ...... 45

Figure 4-5. Predicted proteolytic cut sites hemoglobin-α ...... 45

Figure 4-6. Predicted proteolytic cut sites hemoglobin-β ...... 46

Figure 4-7. 3D7 Hemoglobin-α ...... 46

viii Figure 4-8. Wildtype Hemoglobin-β ...... 47

Figure 4-9. FP1KO Hemoglobin-α ...... 47

Figure 4-10. FP1KO Hemoglobin-β ...... 48

Figure 4-11. F2 Hemoglobin-α ...... 48

Figure 4-12. F2 Hemoglobin-β ...... 49

Figure 4-13. C10 Hemoglobin-α ...... 49

Figure 4-14. C10 Hemoglobin-β ...... 50

ix List of Symbols, Abbreviations and Nomenclature

Symbol Definition WHO World Health Organization P. falciparum Plasmodium falciparum RBC Red blood cells DV Digestive vacuole HAP Histo- MS Mass spectrometry LC-MS Liquid chromatography mass spectrometry HBB Hemoglobin beta HBA1 Hemoglobin alpha HBD Hemoglobin delta HBG1 Hemoglobin gamma 1 HBZ Hemoglobin subunit zeta S/N Signal to noise ratio FDR / FDRe False discovery rate estimation PfCRT Plasmodium falciparum chloroquine resistance transporter GUI Graphical user interface CSV Comma separated values 3D7 Wild-type Plasmodium falciparum strain isolated from a Dutch patient FP1KO Plasmodium falciparum of 3D7 background with falcipain- 1 knocked out. F2 Plasmodium falciparum of 3D7 background with falcipain- 2 knocked out. C10 Plasmodium falciparum of 3D7 background with all four plasmepsins knocked out.

x Epigraph

"Tell me what you eat, and I will tell you what you are."

Jean Anthelme Brillat-Savarin

xi

Chapter One: Introduction

1

1.1 Background

Malaria is an ancient disease. Since the discovery in 1880 that malaria patients had parasites in their blood1, malariologists have been working to understand the causative protozoan parasites belonging to the genus Plasmodium. There have been tremendous global investments in malaria prevention, control, diagnosis and treatment; however, malaria remains a major social and economic burden. While there has been a significant reduction in mortality and morbidity due to these efforts, the emergence of antibiotic resistance coupled with insufficient resources has led to reduced progress of late. As of 2017, an estimated 219 million cases of malaria causing 435,000 deaths has occurred worldwide. Of all the malaria cases reported, 92% were in the World Health

Organization (WHO) African Region, where Plasmodium falciparum (P. falciparum) is the most prevalent species2 Economically, the direct costs of treatment, illness and premature death have been estimated to be at least US$12 billion per year3.

1.2 Lifecycle

When a malaria-infected female Anopheles mosquito takes a blood meal from a human, they inoculate sporozoites into the bloodstream. These sporozoites infect liver cells and mature into schizonts which rupture and release merozoites. These quickly invade red blood cells (RBC), where the parasite replicates asexually and forms ring stage trophozoites that mature into schizonts that once again rupture releasing merozoites. Some merozoites will differentiate into sexual gametocytes, available for uptake and infection of further female Anopheles mosquitos to continue the cycle, while others will infiltrate further RBCs and produce more progeny (Fig. 1).

2

Figure 1-1. Malaria lifecycle Malaria lifecycle of P. falciparum depicting both mosquito and human stages4.

1.3 Hemoglobin digestion

Replication within human RBCs is a critical element of the plasmodium lifecycle. To facilitate this process, P. falciparum is known to consume and digest approximately 65 – 70% of the total hemoglobin contained within the RBC cytosol5,6. This proteolytic activity takes place within a specialized organelle known as the digestive vacuole (DV)7. Despite years of extensive research, the purpose of hemoglobin digestion remains unclear. Hemoglobin catabolism has been proposed as a method to generate amino acids for growth and energy, however, only an estimated

16% are incorporated into proteins by the parasite6. It has also been suggested that the parasite may be creating space for its daughter cells6, or that it is digesting hemoglobin to regulate RBC osmotic stability to prevent premature lysis of the host cell5. Regardless of the purpose, recent

3

evidence has demonstrated that parasites which have an impaired hemoglobin digestion capacity suffer a significant fitness cost8.

1.4 P. falciparum proteases

Extensive effort has been invested in discovering and elucidating the DV localized proteases that P. falciparum uses to catabolize the hemoglobin. By understanding these , it has been believed that critical insight may be gained in providing new treatments in the fight against malaria. There are four main categories of enzymes identified to be active within the DV of P. falciparum, the plasmepsins, the falcipains, falcilysin and the exopeptidases. The plasmepsins are aspartic proteases, and the four localized to the DV are plasmepsins I, II, IV and Histo-aspartic protease (HAP)9–11. There are three cysteine proteases, they are named falcipains and are falcipain

II, falcipain II’ and falcipain III12–15, and the metalloprotease falcilysin16,17. Finally, there are the exopeptidases, which include dipeptidyl aminopeptidase I (DPAP I)18, the metalloenzyme aminopeptidase P (APP)19, and M1-family aminopeptidase (PfA-M1)20.

As outlined, there is ample evidence indicating the importance of hemoglobin digestion in the human phase of the malaria lifecycle. Additionally, there is evidence that even slight perturbations in hemoglobin catabolism have significant fitness costs for the parasite8. However, most of the known P. falciparum proteases were characterized using in vitro techniques.

Enzymatic inhibitors12,15,17,21,22, expression of recombinant enzymes in E.coli 9,13,23, protease activity assays10,13,21, mass spectrometry (MS)8,11,17,24–26 and visualization techniques such as immunofluorescence10,23, transmission electron microscopy21,27, and gel electrophoresis15,17,23 were common approaches used to elucidate the P. falciparum protease collection. Unfortunately, with the development of enzymatic knockouts11,28 it became apparent that what is determined in vitro, does not always prove to be true in vivo.

4

1.5 Protease knockouts

For more than two decades there have been numerous arguments suggesting that certain enzymes are essential to the hemoglobin degradation pathway, but as subsequent gene knockout experiments have demonstrated, this has often proven false11,27,29,30. The four DV plasmepsins have successfully been knocked out individually, in combinations and all four simultaneously, and while the resulting mutant parasite’s growth is slowed, it remains viable in vitro27. The falcipains have similarly been targeted by knockout experiments. Falcipain-1, falcipain-2, and falcipain-2’ have all successfully been knocked out while the parasite remains viable, however knocking out falcipain-3 appears to be fatal to the parasite31. Interestingly, falcipain-2 knockouts display an altered phenotype wherein the early trophozoite DV appears swollen and stains darker, however as the parasites mature, they become indistinguishable from wildtype cells21. Falcilysin and DPAP

I, like falcipain-3 appear to be essential as any attempts to knock them out have proven unsuccessful32. Retained viability has been suggested in numerous publications to be the result of evolved proteolytic redundancy both within and across enzymatic classes in the DV11,21,28.

1.6 Summary

A recurring theme throughout the literature has been the attempt to define the hemoglobin proteolytic cascade. The original effort was focused towards discovering the responsible enzymes and uncovering their specificity and activity. The belief was that by gaining knowledge of the proteases responsible, effective antimalarials could be developed through the inhibition of one or more of these key enzymes. In 1991, Dan Goldberg used mass spectrometry to identify the peptide fragments created by an aspartic protease that would later become known as plasmepsin. Using this technique he identified a cut site in α-hemoglobin between the 33F and 34L24. In 1999,

Goldberg and his collaborators used recombinant plasmepsin II to digest human hemoglobin and

5

incubated it both with and without falcilysin. The hemoglobin treated with falcilysin generated additional peptide products, confirming that falcilysin could cleave hemoglobin previously degraded by recombinant plasmepsin II17. Through these experiments, and many others, pioneers of malaria research uncovered the in vitro cut sites and the initial steps of hemoglobin metabolism.

It has become clear from reviewing the literature that the data regarding P. falciparum DV-related proteases and their activity has several things in common. Typically, it was collected in vitro, using bacterial expression vectors, recombinant proteases, often on hemoglobin that was purified and/or pre-cleaved or denatured, and sometimes non-human in origin. This begs the question, how well does the current literature reflect what is taking place within live cells?

1.7 Hypothesis and aims

Many of the proteases thought to be essential can be knocked out with little to no impact.

The explanation for this has been that enzymatic redundancy enables the parasite to survive11,21,28.

I suggest an alternative hypothesis: I hypothesize that the known proteases for P. falciparum do not play a critical role in hemoglobin catabolism, moreover, proteolysis in vivo does not match that observed in in vitro proteolysis. To support my hypothesis, I have developed a method for directly studying hemoglobin proteolysis in vivo to allow the evaluation of the accuracy and relevance of the in vitro work that has been done, and to potentially uncover enzymatic activity that has yet to be characterized.

Recent advances in MS have created an opportunity to finally investigate the questions surrounding protein catabolism. Current MS hardware can accurately identify and quantify the peptides in complex biological mixtures. What is missing to enable in vivo mapping is software to match peptides to their locus of origin on the proteome and map these proteolysis patterns in vivo.

In my research, I developed a software tool that allows the analysis and visualization of in vivo

6

proteolytic cleavage. Using the ‘Digestomics’ approach33 (Chapter two), and the ‘DigestR’ software that I developed to support it (Chapter three), I use these tools to investigate the significance of the known proteases and determine if published in vitro protease cut sites match the proteolytic cuts sites as mapped in live cells (Chapter four). Moreover, I provide data to suggest that as-of-yet unidentified enzymes are likely critical to hemoglobin degradation.

7

Chapter Two: Digestomics: an emerging strategy for comprehensive analysis of protein catabolism

Adapted from:

Travis S Bingeman, David H Perlman, Douglas G Storey and Ian A Lewis. Digestomics: an emerging strategy for comprehensive analysis of protein catabolism. (Current Opinion in Biotechnology 2017, 43:134-140)

In this chapter we introduce the digestomics concept, the theoretical framework for my approach and show preliminary results attained by analyzing the in vivo digestion of hemoglobin by P. falciparum. The analyses were performed using a preliminary version of the DigestR software package that I developed as described in detail in Chapter 3.

8

2.1 Abstract

When cells mobilize nutrients from protein, they generate a fingerprint of peptide fragments that reflects the net action of proteases and the identities of the affected proteins.

Analyzing these mixtures falls into a grey area between proteomics and metabolomics that is poorly served by existing technology. Herein, we describe an emerging digestomics strategy that bridges this gap and allows mixtures of proteolytic fragments to be quantitatively mapped with an amino acid level of resolution. We describe recent successes using this technique, including a case where digestomics provided the link between hemoglobin digestion by the malaria parasite and the world-wide distribution of chloroquine resistance. We highlight other areas of microbiology and cancer research that are well-suited to this emerging technology.

2.2 Introduction

Metabolism is the biological conversion of nutrients into energy, biomass, and waste products. The molecular details of metabolism have a direct impact on the types of nutrients organisms use, the amount of energy that can be harvested from finite resources, and the types of ecosystems organisms inhabit. Traditionally, metabolism has been studied from the perspective of small molecule nutrients. Almost everything that is known about metabolic pathways and regulation has been defined using a small collection of amino acids, sugars, nucleic acids, and fats.

However, metabolites only represent about 3% of the dry mass of the cell whereas protein constitutes 50–60% 34,35. Consequently, organisms that can use this resource for metabolism can have a significant competitive advantage.

Although protein catabolism is a well-recognized mechanism for satisfying nutritional needs, it is surprisingly difficult to study in a comprehensive manner. Amino acids are mobilized through the combined action of proteases and peptidases. This enzymatic digestion produces a

9

complex mixture of related peptides ranging in size from dimers to nearly intact proteins. Decoding these mixtures is a significant technical challenge that requires high resolution mass spectrometry

(MS), custom bioinformatics, and a strong understanding of biology. Until recently, these challenges have made comprehensive analyses of protein digestion impractical. However, a constellation of emerging techniques has made this formidable problem more tractable, and has opened the door to diverse applications ranging from elucidating selective pressures acting on drug resistant microbes to investigating metabolic adaptation in cancer cells8,36. This emerging digestomics approach has been previously defined in the narrow context of mammalian and insect digestion37,38. Here, we recast this concept to encompass the study of any living system that uses protein catabolism for nutrition.

Digestomics: the comprehensive analysis of protein catabolism and quantitative mapping of peptides to genomes at an amino acid level of resolution.

2.3 Barriers to digestomics

Digestomics exists in a middle ground between metabolomics and proteomics that is poorly served by the existing analytical workflows. In the case of metabolomics, sample preparation conditions are optimized for the recovery of small molecules and have been specifically designed to minimize the solubility of proteins and large peptides39. Furthermore, mass spectrometry-based metabolomics is typically restricted to low molecular weight molecules, optimized for singly charged species, uses high-flow liquid chromatography conditions with condensed chromatographic separations, and is often conducted with the MS instrument in negative ion mode40. These conditions are poorly suited to analyses of peptides. Moreover, data analyses in metabolomics typically involve either targeted analyses of vetted species or untargeted

10

assignments driven by database matches to libraries of metabolite standards, which only contain a few peptides41–43.

Standard proteomic workflows are also inappropriate for digestomics. Proteomic sample preparation methods target intact proteins, which leads to poor recovery of small peptides. In addition, proteomic sample processing often includes enzymatic digestion (e.g. trypsinolysis), which introduces non-biological peptides into the mixture44. Moreover, MS analyses in proteomics employ automated fragmentation strategies that typically exclude low mass and singly charged species. Finally, proteomic data analyses typically employ database searching and spectral matching algorithms that bias results towards larger peptides. These larger species are favored because they can be identified with higher confidence and are more likely to yield unique matches when aligned with a total proteome database45. This is problematic in the context of digestomics because small peptides are frequently biologically significant8,46,47. In summary, routine metabolomics methods exclude large molecules, routine proteomics methods exclude small molecules, and digestomics must encompass both. Not surprisingly, the emerging digestomics workflow is a hybrid of existing metabolomics and proteomics methods (Figure 2-1).

11

Figure 2-1. Digestomics workflow. The generalized workflow for digestomics presented here is adapted from a previous study8. (a) Cells mobilize peptides and amino acids via proteolysis. The net action of these enzymes creates (b) a characteristic mixture of peptides that can (c) be decoded using high-resolution mass spectrometry. Naturally occurring peptides are extracted from biological samples using sample preparation methods adapted from metabolomics. Briefly, cells are extracted in acidified 90% methanol containing 3% acetonitrile. Mass spectra are acquired using a modified proteomics approach including nanoflow reverse phase ultrahigh performance liquid chromatography and top- ten MS/MS fragmentation in positive mode. Once these fragment data have been generated, they are assigned using Mascot and aligned to the appropriate genome at an amino-acid level of resolution. (d) Coincidence scores are then generated to reflect the aggregate dataset.

2.4 Digestomics workflow

The need to capture a diverse range of naturally occurring peptides shapes the sample preparation and data analysis workflow. The methods outlined here are adapted from a previous

12

study, which quantified peptides ranging from 2 amino acids to 32 amino acids in length8. To achieve this, peptides can be extracted from cells using relatively high dilutions (1:20; cell:solvent volume) in 90% methanol containing 3% acetonitrile and 0.1% formic acid. The high dilution helps improve the recovery of sparingly soluble peptides, and the acetonitrile helps improve the extraction of hydrophobic species and ensures miscibility with the chromatographic mobile phase.

Most peptides can then be separated on a capillary reversed-phase chromatography column (e.g.

75 µm x 25 cm, packed with 1.7 µm, 130 Å BEH C18 resin) and analyzed by nanoflow LC–MS using a linear gradient (120 min; A: 3% acetonitrile/0.1% formic acid; B: 97% acetonitrile/0.1% formic acid) at a flow rate of 400 nl/min. We have found that foregoing the use of a trapping column (which is frequently used in proteomics) significantly improves the recovery of small peptides. High resolution MS and fragmentation (MS/MS) data are then acquired in positive mode using a top-10 MS/MS strategy to facilitate peptide identification. These datasets must include +1 ions and low-molecular weight species, which are often excluded from conventional proteomic analyses. These complex spectra can then be interpreted using complete proteomes digested in silico via the ‘no ’ cleavage specificity option available in existing proteomics software

(e.g. Mascot and SEQUEST). These software tools can then identify peptides by matching the observed MS/MS patterns to the reference database. These assignments must be made using low scoring thresholds (e.g. Mascot ion score threshold _10) to allow for short peptide matching. Once the tentative assignments have been generated, peptides can be mapped to all the possible proteins of origin and scored using the genome-wide coincidence mapping approach (Figure 2-1).

2.5 Bioinformatics in digestomics

There are significant underlying differences in data complexity and structure that differentiate digestomics from proteomics. Proteomics data generally contain diverse collections

13

of peptides mapping to proteins widely distributed across the genome. In contrast, digestomics typically involves complex mixtures of related peptides originating from a relatively restricted set of proteins (i.e. a food source). For example, peptides found in malaria parasites are largely derived from hemoglobin8, analyses of wound exudates are heavily represented by clotting factors48,49, and pancreatic cancer cells are associated with albumin catabolites scavenged from the extracellular environment36. The low protein diversity in each of these cases reflects the fact that protein digestion generally targets a small subset of accessible proteins rather than the entire proteome.

This observation has significant implications for the downstream bioinformatics.

In proteomics, bioinformatics strategies have been optimized for making confident protein assignments from relatively sparse collections of peptides. This process begins with the digestion of proteins using well-defined proteases (e.g. trypsin) to generate predictable cleavage patterns that can be matched to in silico reference libraries of protein digestion50. This strategy minimizes false discovery by restricting the bioinformatics search space45. Proteins can then be identified by matching observed MS/MS fragments to the reference library45. Since the matching score from existing algorithms increases proportionally with peptide length, there is an intrinsic bias against small peptides (Mascot; URL: http://www.matrixscience.com/help/feb2000.html). Although this strategy increases the confidence of protein identification, it de-emphasizes the small, but biologically significant, peptides generated through proteolysis8. Once peptides have been assigned a score, a threshold is established for reducing false positives. These false discovery thresholds can be empirically determined by conducting peptide matches against a decoy database51. Generally, the highest match score of all peptides associated with a particular protein is a significant factor in the confidence of the protein identification. Although this strategy is

14

optimized for the practical constraints of proteomics, it fails to leverage the significant data redundancy implicit to digestomics analyses.

The distribution of peptides observed in digestomics differs considerably from those seen in proteomics. Proteolytic cascades often involve the semi-ordered degradation of proteins via the ensemble action of multiple proteases. The overlapping specificity of proteases and nonlinear action of digestion produces a complex mixture of related peptides, each of which can be mapped to the same parent protein(s). Whereas spurious peptide matches occurring from false discovery map randomly across the genome, true positives map linearly to specific loci (Figure 2-2). This overlap creates an opportunity for improving signal-to-noise through signal averaging (Eq. 2-1), which is analogous to the routine data processing methods used in nuclear magnetic resonance spectroscopy52. These coincidence scores can be generated regardless of peptide length, uniqueness, or matching score, which circumvents many of the traditional barriers to integrating data from both large and small peptides. Our analyses show that coincidence-based scoring can introduce a significant improvement in the signal-to-noise of protein assignments relative to traditional proteomics algorithms for certain datasets (Figure 2-2). This data analysis approach may be portable to a variety of other proteomics applications that require similar peptide alignment strategies53–59.

푛 1 (2-1) 푐표푖푛푐푖푑푒푛푐푒 푠푐표푟푒 = ∑ 퐶푖 푛 푖=1 Where Ci is the number of peptides aligned to each locus in a protein containing n amino acids.

15

Figure 2-2. Genome-level maps of the Plasmodium falciparum digestome via traditional proteomics analysis versus the coincidence scoring algorithm. Proteolytic digestion profiles of human erythrocytes infected with late trophozoite stage Plasmodium falciparum analyzed using established proteomics methods and compared to the coincidence mapping approach. These data were adapted, with permission, from previously published work8. Each mapping strategy is shown relative to an empirical false discovery control generated by decoy searches against a sequence scrambled human proteome. Both strategies start by translating MS/MS fragmentation data into putative peptide identities using Mascot searches. (a) Coincidence scores are then generated by aligning all the putative assignments — irrespective of peptide length and including all peptides with Mascot scores of 10 or greater — to their respective genomic loci. The number of peptides aligned to each locus is quantified and expressed as an average representation for each gene. Correctly identified peptides consistently map to the same loci whereas false discovery resulting from incorrect peptide assignments are randomly distributed across the genome. Consequently, gene-by-gene averages of the total number of peptides mapped to each locus improves the signal to noise. (b) In traditional proteomics, the maximum Mascot score of all peptides matched to a particular protein is a significant factor in the

16

confidence of the protein assignment. Fixed matching thresholds are used to minimize false discovery. Here, the maximum Mascot score associated with every gene is plotted to illustrate the improvement in signal to noise that can be captured via coincidence mapping. The signal to noise ratio was calculated by dividing the maximum match score by the standard deviation of the corresponding FDR control.

2.6 Biological applications and future directions

Malariology is one of the most active emerging areas in digestomics. The malaria parasite,

Plasmodium falciparum, digests approximately 75% of the hemoglobin present in the red blood cell over the course of its 48-hour intraerythrocytic lifecycle60. Recent reports have found that chloroquine resistance is linked to seemingly minor defects in the parasite’s ability to digest hemoglobin8. These perturbations result in a dramatic reduction in parasite fitness and may explain why drug sensitive strains have re-emerged in regions of the world that have discontinued chloroquine therapy8,46. The three-way association between drug resistance, parasite fitness, and metabolism was uncovered through an untargeted metabolomics screen, which identified several small peptides (2–4mers) that were consistently elevated in drug-resistant strains. Connecting these peptides with an underlying biological mechanism required a new analytical approach, digestomics, which unambiguously linked the small peptides to hemoglobin catabolism (Figure 2-

3). Since the original report, several other studies have shown that mutations in the Plasmodium falciparum chloroquine resistance transporter (PfCRT) modulate both drug sensitivity and hemoglobin catabolism46,61–64.

17

Figure 2-3. Gene-level view of proteolytic coincidence maps from chloroquine resistant versus sensitive strains of Plasmodium falciparum. Proteolytic profiles of isogenic parasite lines containing either drug sensitive (CQS; Hb3) or resistant alleles (CQR; Dd2 and 7G8) of the P. falciparum chloroquine resistance transporter gene (PfCRT) were analyzed. These profiles, which were derived with permission from previously published data8, were analyzed using the digestomics approach and the coincidence scores from β-hemoglobin and α-hemoglobin are shown. The figure inset depicts how detailed peptide alignments contribute to the per-residue coincidence score. Quantitative differences between the CQS and CQR lines shown here recapitulate published results, which linked elevated peptides from certain loci to both drug resistance and reduced fitness.

Other exciting applications of digestomics have been emerging from microbiology. A recent article by the Juillard laboratory employed this technology to decode the complex mix of peptides found in the media of Lactococcus lactis. Detailed characterization of these peptides clarified both the source of these molecules and the cell surface proteolytic mechanism through which they are generated47. Similar extracellular proteolytic activities have been found in other

18

microbes and may play a significant role in host-pathogen dynamics. Lactobacillus plantarum colonization of the Drosophila, for example, elevates proteolytic activity in the gut that accelerates fly development65. Similarly, both the lungs of cystic fibrosis patients and wounds contain rich mixtures of proteins and peptides49,66,67; the pathogens that colonize these sites are known to secrete proteases that have been linked to pathogenesis and metabolism68,69. Other interesting host- pathogen examples include enteroaggregative Escherichia coli and Shigella flexneri colonization of the colon, which has been linked to mucin digestion by Pic protease70; and aspergillosis, which requires the secretion of fungal proteases to mobilize biosynthetic precursors71,72. Although several host-pathogen dynamics are mediated through proteolytic mechanisms, these connections are still poorly understood. The emerging digestomics technology offers a promising new avenue for investigating these clinically relevant phenomena.

Cancer biology is another application where digestomics has showed clear relevance.

Tumorigenesis in several types of cancer has been directly linked to protein scavenging. Pancreatic ductal adenocarcinoma cells, for example, access extracellular pools of albumin through macropinocytosis and can use this protein as their sole source of essential amino acids36. Similarly, peptidomics analysis of plasma from breast cancer patients showed a greater than 4000-fold enrichment in certain peptides73. These cancer/proteolytic connections are even more pronounced when intracellular processes are considered. Autophagy — the lysosomal degradation of intracellular components— is an active area of research and a promising therapeutic target74–77.

Digestomics is of obvious relevance to these efforts and may help shape our understanding of tumorigenesis.

Beyond these health-related applications, digestomics remains pertinent to the food-related conception of its original definition37,38. Recent articles have harnessed this approach to examine

19

the mixtures of peptides generated from in vitro digestion of meat and the role Maillard reactions play in this process78,79. In summary, understanding the details of protein digestion may provide critical insights into diverse subjects ranging from cooking to cancer.

2.7 Conclusions

Mixtures of proteases generate a characteristic fingerprint of peptides that reflects nutritional preferences and enzymatic activities. The emerging digestomics approach allows these complex mixtures to be decoded and quantitatively mapped on a genome-wide basis. These detailed maps offer direct insights into the fundamental selective forces that shape organisms and, as the malaria example shows, can link seemingly minor molecular details to world-wide population dynamics8. We anticipate that this strategy will eventually become a routine technique that fills the gap between metabolomics and proteomics.

20

Chapter Three: DigestR: open source software for the comprehensive analysis of protein catabolism

From my review of the literature it was evident that no single software tool was well suited to projecting peptidomic data onto a proteome-wide framework, or for visualizing protein catabolism. To address this, I developed the first software solution capable of revealing the proteolytic activity that occurs in living cells and mapping the resulting peptides to the source proteome. While the immediate goal was to improve the understanding of P. falciparum hemoglobin degradation, protein metabolism is not unique to malaria parasites. We believe this technology has great potential to provide insight into organisms and systems that utilize protein as a source of nutrition.

21

3.1 Abstract

Despite the important role that protein plays in the metabolism of organisms with the pathways to access it, currently no publicly available tools are available for characterizing or visualizing protein catabolism. Here, we introduce a new open source tool, DigestR, which offers a simple graphics-based method for visualizing in vivo proteolytic cleavage on both the proteome and amino acid scale. DigestR allows identified peptides to be aligned to the proteome while tracking the coinciding frequency at each locus. The resulting ‘coincidence map’ reveals the distinct blueprint of proteolytic activity specific to that organism. We describe how this tool can be used to analyze biological extracts for insight into protein-based nutrient acquisition.

3.2 Introduction

Metabolism has traditionally been studied through the lens of small molecule biochemistry.

Typically, time and focus are spent studying pathways such as the citric acid cycle and glycolysis.

While obviously important, small molecules only constitute about 3% of the cell while protein constitutes 50-60%34,35. Technical limitations make studying this rich source of nutrition challenging. Consequently, the role and importance that protein catabolism plays in satisfying nutrient acquisition remains poorly characterized. To address this challenge, we have developed the digestomics strategy for the comprehensive analysis of protein catabolism33.

The main hurdle facing the analysis of protein catabolism is not hardware or protocol related. MS has advanced to the point where peptides can be accurately sequenced, identified and quantified in complex biological mixtures56. One challenge is to place these peptides in a biological context and present them to the user in a meaningful way. Peptide mapping is not unique to digestomics and unsurprisingly software exists including examples such as Peptigram80, iPiG81,

PeptideExtractor (PepEx)82 and EnzymePredictor83. These software tools allow for mapping

22

peptides of known proteins or proteases typical of a proteomics style workflow, but they are not well suited to the task of mapping proteome-wide possibilities to empirical data. They rely on the confident knowledge of a parent-child relationship between protein and peptide to generate their maps. In a digestomics workflow, the target proteins of a proteolytic cascade necessary for mobilizing nutrients are typically unknown, resulting in peptides of ambiguous origin. This has direct consequences on how software tools that interact with digestomics must behave.

Another key design objective is to take a list of peptides and perform an untargeted peptide alignment across the entire proteome, whereas the existing tools only perform peptide alignments at the protein level. The untargeted approach provided by DigestR allows for discovery of novel enzymatic activity on known or unknown biological systems alike. Digestomics requires the ability to map short and long peptides. To ensure a confident protein identification from a peptide, short peptides are typically excluded from traditional peptidomic or proteomic searches84,85. To summarize, no publicly available software tools allow peptides to be quantitatively mapped proteome-wide at an amino acid level of resolution. Consequently, I have developed a new, easy to use, open-source graphics-based software tool (DigestR) that specifically addresses this complexity. When a cell is presented with a proteome-wide possibility of nutritional substrates, we want to know what it chooses, and what proteolytic cascades occur in vivo.

3.3 Results and Discussion

3.3.1 Program design objectives

DigestR was designed with four main capabilities in mind: 1) receive a list of peptide sequences as input, 2) map each peptide to the human proteome to generate a global coincidence map, 3) display the coincidence map at a proteome and protein level view, 4) provide a graphical user interface (GUI) for analyzing peptide alignment, protein cleavage details and provide file

23

management and mathematical operations. I developed DigestR using the R statistical software environment and utilized file handling and graphical user interface functions from the previously published rNMR package86.

Figure 3-1. Data pipeline overview. Raw mass spectrometry data are searched using Mascot to identify the list of peptides. The DigestR GUI is used to import the list of peptides and perform preprocessing on the peptide list, to generate coincidence maps and determine the false discovery rate threshold. Data are saved as a .dcf files on the file system. The user can access .dcf files for data analysis using the DigestR GUI.

3.3.2 Data pipeline overview

Current software solutions for identifying peptides from tandem mass spectra include:

SEQUEST87, Mascot45 and MaxQuant88. DigestR uses Mascot to translate raw MS data into the peptide lists that DigestR uses as input (Fig. 3.1), the modular design of DigestR, allows alternatives to Mascot to be accommodated with minor programmatic modifications, thus achieving the first design objective. To standardize the data input process, raw mass spectrometry files are converted using the ProteoWizard 3.0.18312 MSConvert program with appropriate settings. Briefly, chargeStates 1-7, msLevel 1-2, enable TPP compatibility, set binary encoding

24

precision to 64-bit and output format to ‘mgf’. Each converted file is then searched using the

Mascot Daemon utility that is included with Mascot Server using settings configured as outlined in Table 3-1. Finally, the database must be selected to search against. For my project I performed searches against both a full human proteome and a much smaller α and β hemoglobin database.

By default, Mascot filters out all peptides that are less than six residues in length, this is not suitable for digestomics, so this is changed in the server settings to include peptides three amino acids and longer.

Table 3-1 Mascot Daemon parameter settings. These are the settings used to generate my data. After initial configuration in the parameter editor, desired settings can be saved for future use. Setting Value Enzyme None Max. missed cleavages 2 Peptide charge 1+, 2+ and 3+ Peptide tol. ± 1.2 Da MS/MS tol. ± 0.6 Da MS/MS ions search On Variable modification Deamidated (NQ) Oxidation (HW) Oxidation (M) Trioxidation (C) Data format Mascot generic

Actual searches are performed by creating a task in the Mascot Daemon. With a new task created, the converted mgf files must be added to the task and the configured parameter file must be associated with the task. As a final step, by pressing the auto-export button the task can be set to auto-export the search results in a format suitable for DigestR (Table 3-2). Once the searches have completed, Mascot generates a report for each mgf file it received. These reports serve as the input for DigestR, fulfilling the requirements of the first design objective.

25

Table 3-2 Mascot Daemon export search result settings. These are the settings used to export search results from the Mascot Daemon. Configuring the results in this format ensures that DigestR can import the data in a meaningful manner. Setting Value Export format CSV Significance threshold p< 0.3 homology Ions score cut-off 0 Max. number of hits AUTO Protein scoring Standard Include same-set protein hits 1 (additional proteins that span a sub-set of peptides) Group protein families On Require bold red Off Preferred Taxonomy* All entries Search Information Off Protein Hit Information On Score On Description* On Mass (Da)* On Other settings under Protein Hit Information Off Peptide Match Information On Experimental Mr (Da) On Experimental charge On Calculated Mr (Da) On Mass error (Da) On Number of missed cleavages On Score On Sequence On Variable Modifications On Other settings under Peptide Match Information Off Query Level Information Off

3.3.3 Differentiating signal from noise

In proteomics the objective is to characterize or identify a protein, therefore short peptides are filtered out and longer peptides are weighted more heavily because they can provide a more confident identification of the parent protein. MOWSE, the precursor to Mascot recommended excluding peptides with a mass less than 800 Da84. The current Mascot implementation recommends striving for as many peptides in the 1000 to 3500 Da mass range for protein

26

identification. At an average of 110 Da per amino acid, this amounts to peptides between 9 and 32 amino acids in length. In the digestomics approach all peptides three amino acids in length and longer are included. However, including these shorter sequences in searches against the human proteome which features a combined length of well over 11 million residues, will generate many spurious, random hits (noise) along with real results (signal). In digestomics, these small peptides cannot be ignored as they represent an important component of the catabolic processes being studied. As the peptide length increases, the probability of finding a random alignment decreases, and after the full peptide list has been aligned across the entire proteome the noise is randomly distributed and the signal has accumulated in a linear fashion where multiple peptides have overlapped and stacked at coinciding loci.

3.3.4 Coincidence mapping

The second objective of DigestR is to provide proteome-wide visualization of protein catabolism. To realize this objective, it was necessary to develop a new method of scoring alignment. Our strategy is called coincidence scoring and using this across the entire proteome allows the creation of a coincidence map (Fig 3-2). To display this information in an interactive and responsive way, it is necessary to precompute the coincidence map and false discovery rate estimation. DigestR can preprocess peptide list files individually or in batch, allowing time intensive operations to be performed with minimal user interaction. Preprocessed files are saved as dcf files which can be opened, viewed and analyzed at the user’s convenience. Preprocessing time depends primarily on the number of peptides in the input list.

27

Figure 3-2. Coincidence score generation. Coincidence scores are generated by aligning peptides proteome-wide to the protein sequence. Once all peptides have been aligned to the proteome sequence, the number of alignments at each locus are summed (frequency per locus). The per locus score is the coincidence score, and the distribution of scores creates a coincidence map. This example also demonstrates the peptide visualization option.

Upon receiving a list of peptides, DigestR must generate a coincidence map. This is achieved by aligning each peptide to each coinciding locus across the entire proteome. The frequency with which any peptide has aligned at each locus on the proteome becomes the coincidence score for that locus (Fig. 3-2). The series of coincidence scores from the start to the end of the proteome becomes the coincidence map for the proteome (Fig 3-3A). Protein scores are determined by calculating the average coincidence score per residue contained in the protein.

When viewing the coincidence map at the proteome level, gene names for proteins that have been calculated to be above a threshold are displayed by default (Fig 3-3A). DigestR also allows the view of protein level coincidence maps, thus realizing the third design objective for the software.

Through the GUI, the user can enter the gene name of interest and view the coincidence map for the corresponding protein at an amino acid level of detail (Fig 3-3B).

28

Figure 3-3. Coincidence mapping. A. Example of human proteome-wide coincidence map of P. falciparum digested human red blood cells. B. Example of protein level human hemoglobin β coincidence map of P. falciparum digested red blood cells. Proteome and protein level coincidence map visualization of protein cleavage fulfills the third objective of the design objectives.

3.3.5 False discovery rate estimation

DigestR uses a target-decoy approach for false discovery rate estimation89. Since the human proteome is serving as the target search space, a decoy is created by taking a copy of the human proteome and randomly scrambling the amino acid sequence. By randomly reordering the amino acids, the sequence length and residue proportions remain identical. This ensures that the decoy proteome is representative of the target proteome. Any alignments to the decoy proteome can be considered the result of random chance and is representative of false positives. As has been established in the field of NMR, when the noise in the system is random, signal enhancement is proportional to the number of scans and noise grows at a rate proportional to the square root of the

29

number of scans90. The analogous phenomena are realized in digestomics each time a peptide aligns to a locus where a previous peptide has been mapped, thus improving the signal to noise ratio at that location (Eq 3-1).

푆푖푔푛푎푙 = √푁 (3-1) 푁표푖푠푒 Where N = number of scans.

During the preprocessing step, the mean and standard deviation of the decoy, proteome- wide coincidence map scores are calculated. Empirically, the default false discovery rate threshold is estimated by adding six times the standard deviation of the decoy coincidence scores to the mean of the decoy coincidence scores (Eq 3-2). This threshold is used to determine the default value above which proteins are labelled on the coincidence map plot (Fig 3-3A). The user can override this threshold and adjust it as desired to control the number of peaks being labeled.

퐹퐷푅푒 = 휇 + 6휎 (3-2) Where FDRe = false discovery rate estimation. µ = mean of decoy proteome-wide coincidence scores. σ = standard deviation of decoy proteome-wide coincidence scores.

3.3.6 DigestR user interface

The fourth objective of DigestR is to provide a user-friendly interface to reduce the barrier of entry for new users. The first step in performing any analysis is to have DigestR preprocess the peptide list. To facilitate this process a dedicated user interface window guides the user through processing mascot files (Fig. 3-4).

30

Figure 3-4. Process Mascot files. A screenshot of the process Mascot files window that allows for the preprocessing of single or multiple files depending on the user selection. If ‘All files in folder and sub folders’ is selected, DigestR will ask the user to select a folder and it will attempt to preprocess all files in that folder and all subfolders recursively. Alternatively, if a single file is selected, that file alone will be preprocessed.

Once the files have been preprocessed and are ready for analysis, they can be managed from the file open / close window. This GUI allows the user to browse for files to open, displays currently open files and allows the user to close files from the list of open files (Fig. 3-5).

Figure 3-5. File open / close window. A screenshot displaying the DigestR file open / close window demonstrating the point and click nature of the GUI and ease with which data can be managed.

31

DigestR also features the ability to add, subtract, multiply or divide all coincidence scores in a sample by a single value as required for scaling purposes. In addition, multi-file operations include unique merge, additive merge, mean, add and subtract for the purposes of assembling and comparing multiple samples of a dataset (Fig 3-6).

Figure 3-6. File Operations window. This is a screenshot of the window that provides advanced user options that affects complete files such as mathematical vector operations and multi-file operations and utilities. When multiple files are selected, merge and mean operations are enabled, when a single file is selected only mathematical operations are permitted. Operations are context sensitive, for example when two files are selected multiply is not available.

When performing data analysis at the proteome-wide level the default protein labelling threshold is calculated automatically as described in the false discovery rate estimation section and illustrated in figure 3-3A. When the results of the default calculation do not provide what the user desires, it can be overridden using the Name threshold window (Fig. 3-7).

32

Figure 3-7. Protein label threshold window. Screenshot of the protein label threshold window that allows the user to override the threshold at which proteins are labeled when viewing data on the proteome-wide level. By lowering the default value, more peaks will be labeled.

Once a dataset has been loaded the user can configure the colors of the plot to their preference using the plot settings: plot colors window (Fig. 3-8). In addition to colors, many plot options can be configured according to needs of the user in the plot settings: spectra window (Fig.

3-9).

Figure 3-8. Plot settings: plot colors window. Screenshot of the plot settings: plot colors window showing the customizability. Colors can be changed for axes, background, peak labels and plot. Additionally, default and high contrast options are available.

33

Figure 3-9. Plot settings: spectra window. Screenshot of the plot settings: spectra window demonstrating many of the key features of DigestR. Display single protein button prompts the user to type in a gene name and displays the peptide cleavage patterns at that location on the proteome. Display full proteome returns the user to the full proteome view. Peptide accumulation, peptide end points and peptide visualization are all different options for visualizing proteolysis at the amino acid level. Variance can be displayed if the selected dataset is the mean of multiple datasets.

One of the most powerful features provided by DigestR is overlays. Overlays allow the visual comparison of data in the same context. Whether it is replicates or comparing wildtype samples to mutant samples, it is critical to be able to evaluate data visually in the same context

(Fig. 3-10). By loading multiple files into DigestR, select files can then be displayed concurrently by adding them to an overlay list (Fig 3-11).

34

Figure 3-10. Overlays example. Example figure showing six datasets. Three replicates each, one of wildtype one of a falcipain knockout in P. falciparum where each replicate was collected on a different day. Using the overlay functionality replicates can be compared within a sample type and compared to replicates of another type concurrently to get both a qualitative and quantitative view of the data.

35

Figure 3-11. Overlays window. Screenshot of the overlays window demonstrating how files can be added or removed from a list of files to be displayed concurrently on the plot. Individual files may be assigned their own color and a legend will automatically be generated. Legend can be suppressed by unchecking the ‘Display names of overlaid spectrum on plot’ option. Path of spectra can be suppressed by checking the corresponding checkbox.

3.4 Conclusions

DigestR is an open source software tool designed and developed to allow the visualization and analysis of protein catabolism from a proteome-wide to a protein specific resolution. This software provides critical insight into in vivo protease cleavage patterns, determines where peptides are derived from within a proteome and allows differential digestion to be compared using protease knockout experiments. Ultimately, this software provides the bioinformatics tools necessary to realize the aims of the digestomics approach33.

36

Chapter Four: Critical evaluation of protein catabolism in Plasmodium falciparum

37

4.1 Abstract

Hemoglobin catabolism is known to be a vital activity in the P. falciparum life cycle7,25,60.

Despite the critical nature of this enzymatic pathway, strains with single enzyme knockouts have demonstrated minimal impact on growth, while strains with multiple enzyme knockouts retain viability11,27,29,30,32. The contrast between these two facts has been recognized, however, it has generally been rationalized to be the result of enzymatic redundancy in the proteolytic cascade11,21,28. Although this hypothesis may be true, the obvious alternative hypothesis is that the known proteases are not responsible for essential functions. Differentiating between these hypotheses requires investigating protein catabolism in live cells which until recently has not been feasible in malaria parasites. The digestomics strategy and DigestR software I have developed allow me to perform an in vivo analysis of protease activity in a collection of P. falciparum knockouts. My data suggest that the tested knockouts have no discernable impact on the in vivo catabolism of hemoglobin. Furthermore, obvious proteolytic cleavage patterns emerge indicating strong candidate loci for previously unidentified cut sites and suggest that published cut sites may not accurately predict in vivo activity.

4.2 Introduction

Most of what is currently known about hemoglobin metabolism has been discovered through in vitro methods. Whether through the use of recombinant enzymes9,10,22,23, expression vectors10,22,23,91–94 or the exposure of purified proteases to panels of hemoglobin derived peptides, the problem with this approach is context. In vitro derived data must be evaluated in vivo to determine biological relevance. These data were collected in vivo from human RBCs infected with

P. falciparum of the 3D7 genotype background. Analyses were performed on both the α and β chains of hemoglobin for each of the four sample types.

38

For reference, published enzymatic cut sites have been included on each protein-level figure and are indicated by a dotted line with a vertical label naming the associated protease. At loci where multiple proteases have been identified, multiple labels appear adjacent to the line.

Plasmepsin I, II and falcipain enzymatic cut sites were published as aspartic hemoglobinase I, II and cysteine protease respectively25, whereas the falcilysin cut sites were published independently17.

4.3 Methods

4.3.1 Parasite growth

Parasites were grown synchronously using established methods95 in RPMI 1640 (Sigma) with 25mM HEPES (Sigma), 100µM hypoxanthine (Sigma), 10µg/mL gentamycin (Gibco) and

2.5g/L Albumax II (Gibco). Cultures were maintained at 5% hematocrit in a 37oC incubator with a gas mixture of 5% CO2, 6% O2, and 89% N2 and treated with 5% sorbitol at 0, 48 and 56 hours to triple synchronize. Starting at 34 hours after the last sorbitol treatment, blood smears were performed every two hours to determine invasion time. Time zero was designated when cultures reached 95% rings. Samples were harvested at 12, 24 and 36 hours8.

4.3.2 Sample preparation

Samples for Digestomics analysis were derived from metabolite extracts. Aliquots of each metabolite extract were harvested, diluted 1:4 in a 3% acetonitrile and 0.1% formic acid solution

(final concentrations)8,96. The metabolite extraction protocol was adapted from an established method96. In brief, metabolites were extracted by suspension of 50µL packed cells in 1 mL 4oC

90% methanol. Samples were vortexed and sonicated as necessary to disrupt the cell pellet and generate a uniform homogenate. The homogenates were centrifuged (13,000x g) for 10 minutes and the supernatants were harvested and stored at -80oC as 90% methanolic extracts until

39

metabolic analysis. Prior to analysis, extracts were dried under a stream of N2 gas and resuspended

8,96 in 200 µL H2O .

4.3.3 Data acquisition

Data was acquired in positive mode on an LTQ-Orbitrap using nanospray from a 120- minute hydrophilic interaction liquid chromatography (HILIC) gradient. Each sample was scanned at both low (150-500 m/z) and high (450-1800 m/z) mass ranges to allow for multiple charge states, resulting in two raw mass spectrometry files for each sample. MS2 scans were automatically performed on fragments from each of the top seven signals observed at any given time. These data were collected at the Princeton mass spectrometry facility8.

4.3.4 Data analysis

Four 3D7 background strains were selected for these analyses. Included were two DV protease knockouts, a non-DV protease knockout to serve as a negative control and a wildtype strain without a knockout to serve as a control. In total, six biological replicates were analyzed for each strain (Table 4-1).

Table 4-1. Outline of 3D7 file organization All samples were cultured and collected in parallel to the wildtype 3D7. Two replicates were collected on each day. The experiment was performed on three different days. (Other sample types not shown.) Date of collection Sample type Source Files Merged File Replicate

3D7_2_150_500 3D7_2__merged 1 3D7_2_450_1800 2012.01.18 3D7 3D7_3_150_500 3D7_3__merged 2 3D7_3_450_1800 3D7_2_150_500 3D7_2__merged 3 3D7_2_450_1800 2013.01.31 3D7 3D7_3_150_500 3D7_3__merged 4 3D7_3_450_1800 3D7_2_150_500 2013.02.19 3D7 3D7_2__merged 5 3D7_2_450_1800

40

3D7_3_150_500 3D7_3__merged 6 3D7_3_450_1800

The low and high mass MS files were each processed by Mascot for the identification of peptides which were saved as lists in their respective files (Fig 4-1). These peptide lists were imported into DigestR where the unique peptides of the merged lists were used to generate a coincidence map that was saved to the file system. After coincidence map generation is completed, the two files per date for each sample type were merged additively resulting in a single coincidence map to represent each sample type for that collection date.

Figure 4-1. Data processing outline for the critical evaluation of in vivo protein catabolism Raw MS data is organized by collection date. Each sample is collected at low mass range and high mass range. Mascot performs a search on each file to generate a list of identified peptides. The peptide lists are imported into DigestR in corresponding pairs where the unique peptides are compiled into a complete peptide list and used to generate a coincidence map. This coincidence map is saved to the file system. After the generation of all coincidence files, I used DigestR to merge the sample replicates that were collected on the same date.

4.4 Results and Discussion

Although the importance of the proteome-wide coincidence map has been stressed in previous chapters, a single representative figure is included in these analyses due to the

41

overwhelmingly similar nature of their appearance. This proteome-wide view shows strong peaks in chromosome 10 and chromosome 16 corresponding with the coding regions of hemoglobin-β and hemoglobin-α, respectively (Fig 4-1). Hemoglobin, being the well-established target of P. falciparum catabolism and the most abundant protein in RBCs, unsurprisingly stands out in this

3D7 analysis. In organisms with poorly characterized or more varied diets, I would anticipate this figure to be more intriguing.

Figure 4-1. Proteome-wide view of 3D7 Figure generated using DigestR displaying proteome-wide coincidence map of protein digestion by 3D7 P. falciparum. Default settings.

4.4.1 Critical evaluation of published cut sites

Overlaying all available datasets to examine the overall trends provides the opportunity to critically evaluate published cut sites. Coincidence map analysis is rather intuitive, where peaks correspond to accumulation of peptides at a specific locus and troughs correspond with loci where peptides have been aligned with lesser frequency at that location (Fig. 4-2). Additionally, extended regions of low accumulation could be candidates for cut sites that are obscured due to the follow up activity of exopeptidases removing single amino acids and dipeptides.

42

Figure 4-2. Protein-wide coincidence map. (Amino acid detail) Protein-wide coincidence map displaying amino acid detail to illustrate peptide accumulation and the correlation with peptide alignment and proteolytic cleavage. Peaks correlate with peptide accumulation, whereas troughs correlate with loci where peptides less frequently span the nearby sequence.

4.4.1.1 Hemoglobin-α

The leftmost and rightmost published falcilysin in vitro loci look to have translated very well in vivo as they map to significant valleys in peptide accumulation (Fig 4-3). The middle falcilysin cut site, however, looks highly questionable as it corresponds with the loci of the greatest accumulation of contiguous peptides. Unfortunately, falcilysin has not been successfully knocked out, so it was not available for inclusion in the knock out analyses32. The leftmost cut site is difficult to evaluate, as it corresponds with a peptide valley suggesting a cut site in the area, but when the falcipains were knocked out, no perturbations were observed causing reason to doubt the identity.

At the second cut site from the left, no visible disruptions were observed, there have been three proteases identified to be capable of cleaving at that loci so this lack of disruption could be due to enzymatic redundancy. Both the fourth cut site from the left and the third from the right have been identified as plasmepsin I cut sites, however there are simply too many intact peptides spanning these loci to support the published in vitro characterizations. The two rightmost cut sites are situated in peptide accumulation valleys suggesting likely cut site locations, however the lack of

43

perturbation when the plasmepsins are knocked out raises questions regarding the identification of the causative proteases.

Figure 4-3. Full dataset overlay hemoglobin-α Protein-level coincidence map hemoglobin-α digestion by P. falciparum. Included are FP1KO, F2, C10 and 3D7 datasets. Each line represents two biological replicates.

4.4.1.2 Hemoglobin-β

There are four cut sites attributed to falcilysin on hemoglobin-β (Fig 4-4). As mentioned earlier they cannot be knocked out and therefore are not evaluated on knockout response. Looking strictly at correlation with peptide accumulation the leftmost and rightmost seem reasonable. The falcilysin site second from the right appears near a peak of peptide accumulation and is unlikely.

The falcilysin site third from the right falls in an area between two peptide accumulation peaks, however the high degree of variability between the datasets in this region makes a confident evaluation difficult. The plasmepsin I cut sites look valid, yet the lack of change in peptide distribution in the knockout mutants does raise questions.

The second cut site from the left, shared by falcipain and plasmepsin II did not show perturbation under any knockout conditions. However, with multiple known proteases capable of cleaving at this loci, this result is not surprising. The relatively low peptide accumulation around this loci reinforces the idea that this is likely a true cut site. Finally, fifth from the left is the second falcipain cut site which is noisy across these data. Due to the lack of a clear trend this locus can not be evaluated.

44

Figure 4-4. Full dataset overlay hemoglobin-β Protein-level coincidence map hemoglobin-β digestion by P. falciparum. Included are FP1KO, F2, C10 and 3D7 datasets. Each line represents two biological replicates.

4.4.2 Predicted loci of unidentified proteases

Using the same approach as that used to evaluate the published cut sites, the pooled data can be analyzed for evidence of proteolytic cleavage that has yet to be documented. Surprisingly, very strong patterns emerge suggesting undiscovered loci of in vivo enzymatic activity.

4.4.2.1 Hemoglobin-α

Analysis of the α-chain provides strong evidence for at least eight undocumented candidate cut sites (Fig. 4-5). At each of these loci a dramatic decrease in peptide accumulation occurs that is common across all datasets.

Figure 4-5. Predicted proteolytic cut sites hemoglobin-α Protein-level coincidence map of hemoglobin-α digestion by P. falciparum. Included are FP1KO, F2, C10 and 3D7 datasets. Red arrows represent predicted locations of in vivo protease cut sites. Each line represents two biological replicates.

45

4.4.2.2 Hemoglobin-β

Analysis of the β-chain suggests at least ten undocumented cut sites (Fig. 4-6). At each of these loci occurs a dramatic change in peptide accumulation consistent across all datasets.

Figure 4-6. Predicted proteolytic cut sites hemoglobin-β Protein-level coincidence map of hemoglobin-β digestion by P. falciparum. Included are FP1KO, F2, C10 and 3D7 datasets. Red arrows represent predicted locations of in vivo protease cut sites. Each line represents two biological replicates.

4.4.3 Analysis of in vivo proteolysis

4.4.3.1 3D7 - wildtype

The 3D7 coincidence map serves two purposes. It is the first in vivo map of proteolysis by

P. falciparum, and it is a standard against which to compare proteolytic activity of the knockout mutants. While there is variability between replicates, distinct digestion patterns are immediately obvious. Comparing the hemoglobin-α coincidence map (Fig 4-7) to the hemoglobin-β coincidence map (Fig 4-8) reveals characteristic digestive fingerprints distinct to each protein.

Figure 4-7. 3D7 Hemoglobin-α Protein-level coincidence map of hemoglobin-α digestion by wildtype 3D7 P. falciparum. Each line represents two biological replicates.

46

Figure 4-8. Wildtype Hemoglobin-β Protein-level coincidence map of hemoglobin-β digestion by wildtype 3D7 P. falciparum. Each line represents two biological replicates.

4.4.3.2 FP1KO – falcipain-1 knockout

FP1KO is a 3D7 P. falciparum parasite with the falcipain-1 protease knocked out. This mutant was graciously provided by the Rosenthal lab. Falcipain-1 was originally believed to participate in hemoglobin digestion in the DV, but was later determined to be localized externally to the DV and to play a role in erythrocyte invasion97. This knockout serves as a negative control due to it being a knockout but not participating in hemoglobin proteolysis. Comparing the hemoglobin-α coincidence map of FP1KO with 3D7 no discernable difference is observed (Fig 4-

9). Variability between replicates appears as great as that between sample types. Knocking out falcipain-1 has no effect on proteolytic cleavage, which matches the expected outcome.

Figure 4-9. FP1KO Hemoglobin-α Protein-level coincidence map of hemoglobin-α digestion by 3D7 P. falciparum with protease falcipain-1 knocked out. The hemoglobin-β coincidence map of FP1KO and 3D7 again demonstrates no significant difference (Fig 4-10). Again, this matches the expected outcome.

47

Figure 4-10. FP1KO Hemoglobin-β Protein-level coincidence map of hemoglobin-β digestion by 3D7 P. falciparum with protease falcipain-1 knocked out.

4.4.3.3 F2 – falcipain-2 knockout

F2 is a 3D7 derived P. falciparum strain with the falcipain-2 protease knocked out courtesy of the Goldberg lab. Much like what was observed in the falcipain-1 knockout, nothing stands out between the knockout and wild type coincidence phenotypes. While there is a single replicate suggesting an accumulation of peptides upstream of the first published falcipain cut site, it is not reflected in the other two replicates and therefore cannot be considered significant at this time (Fig

4-11).

Figure 4-11. F2 Hemoglobin-α Protein-level coincidence map of hemoglobin-α digestion by 3D7 P. falciparum with protease falcipain-2 knocked out. The coincidence map for hemoglobin-β of the falcipain-2 knockout yet again demonstrates no apparent perturbations in digestion resulting from falcipain-2 being knocked out (Fig 4-12).

48

Figure 4-12. F2 Hemoglobin-β Protein-level coincidence map of hemoglobin-β digestion by 3D7 P. falciparum with protease falcipain-2 knocked out.

4.4.3.4 C10 – plasmepsin I, II, III and HAP knockout

The last knockout mutant I examined is C10. It is a 3D7 P. falciparum parasite which has had all four DV plasmepsins knocked out courtesy of the Dame laboratory. Of all the datasets, this was the one I considered most likely to display a phenotype. Depriving a parasite of four proteases simultaneously, representing the entire aspartic protease family of the DV would seem likely to impact the parasite’s proteolytic cascade. Yet, in both the α (Fig. 4-13) and β (Fig. 4-14) chains, the coincidence maps remained unperturbed both overall and at published cut sites.

Figure 4-13. C10 Hemoglobin-α Protein-level coincidence map of hemoglobin-α digestion by 3D7 P. falciparum with all DV localized plasmepsin proteases knocked out. Each line represents two biological replicates.

49

Figure 4-14. C10 Hemoglobin-β Protein-level coincidence map of hemoglobin-β digestion by 3D7 P. falciparum with all DV localized plasmepsin proteases knocked out. Each line represents two biological replicates.

4.4.4 Knockout summary

Having compared the in vivo results of P. falciparum digestion of wildtype parasites against each of three different mutant knockouts and finding no significant variation in peptide generation or accumulation, it offers compelling evidence that the examined knockouts are not essential for P. falciparum protein catabolism.

4.5 Conclusions

My data indicate that the tested knockouts do not impact the in vivo generated peptides or the overall proteolytic cleavage patterns created by the digestion of hemoglobin by P. falciparum.

Had the knocked-out proteases played an essential role in the hemoglobin proteolytic cascade, I would predict that an obvious, observable change would be seen in the coincidence maps of the knockout strains. The evaluation of published cut sites suggests that in vitro discoveries are not always reflective of what is occurring in live cells. Finally, using the combined data I have highlighted loci where proteolysis is likely occurring. These locations should prove helpful in the identification of yet to be characterized proteases.

50

To summarize, I created the digestomics approach (Chapter 2) and the accompanying

DigestR software (Chapter 3) for the elucidation of protein catabolism to investigate the P. falciparum proteolytic cascade (Chapter 4). Through my studies I have discovered that most of what is known about protein catabolism has been determined in vitro. Unfortunately, what is uncovered in vitro, does not always equate to what happens in vivo. This analysis provides a blueprint for how digestomics can be used to reveal in vivo protein catabolism. More P. falciparum datasets stand ready for analysis and current graduate students in our lab plan to use this software for their projects. Digestomics, however, is not limited to hemoglobin catabolism, but can be applied to any protein and potentially any proteome.

4.6 Future directions

4.6.1 Limitations

4.6.1.1 Positive control

Due to time constraints, a planned a positive control experiment was not completed. The proposed experiment was to digest a purified, sequenced, human protein with a well-characterized protease such as trypsin and to run the resulting peptides through the described digestomics platform. Comparing the resulting digestomics analysis with the predicted cleavage patterns of the known protease on the protein sequence would allow evaluation of the accuracy of the platform.

4.6.1.2 Sources of concern

During coincidence map interpretation, peptide absence is associated with loci of proteolytic cleavage, there is, however, another possibility. The digestomics platform relies upon

LC-MS for data acquisition and therefore it is vulnerable to the inherent limitations of the LC-MS technique. For an analyte to be detected by MS it must be ionized and to be ionized it must be

51

capable of holding a charge. MS can be strongly impacted by “matrix effects” - a term used to describe the variability with which the instrument responds to an analyte under differing conditions. The most common matrix effect is ion suppression, which occurs when components of the matrix impair ionization of a target analyte by interfering with its ability to gain a charge or reach gas phase. Unfortunately, positive mode, electrospray ionization LC-MS is most prone to ion suppression and this constraint must be kept in mind when forming conclusions98.

52

References

1. Cox, F. E. History of the discovery of the malaria parasites and their vectors. Parasites

and Vectors 3, (2010).

2. WHO. World malaria report 2018. Available at:

https://www.who.int/malaria/publications/world-malaria-report-2018/en/. (Accessed: 16th

July 2019)

3. CDC. Malaria’s Impact Worldwide. (2019). Available at:

https://www.cdc.gov/malaria/malaria_worldwide/impact.html. (Accessed: 15th August

2019)

4. CDC. Malaria - Biology - Lifecycle. (2019). Available at:

https://www.cdc.gov/malaria/about/biology/index.html. (Accessed: 7th October 2019)

5. Lew, V. L., Tiffert, T. & Ginsburg, H. Excess hemoglobin digestion and the osmotic

stability of Plasmodium falciparum-infected red blood cells. Blood 101, 4189–4194

(2003).

6. Krugliak, M., Zhang, J. & Ginsburg, H. Intraerythrocytic Plasmodium falciparum utilizes

only a fraction of the amino acids derived from the digestion of host cell cytosol for the

biosynthesis of its proteins. Molecular and Biochemical Parasitology 119, 249–256

(2002).

7. Francis, S. E., Sullivan, D. J. & Goldberg, D. E. Hemoglobin metabolism in the malaria

parasite Plasmodium falciparum. Annual Review of Microbiology 51, 97–123 (1997).

8. Lewis, I. A., Wacker, M., Olszewski, K. L., Cobbold, S. A., Baska, K. S., Tan, A., Ferdig,

M. T. & Llinás, M. Metabolic QTL Analysis Links Chloroquine Resistance in

53

Plasmodium falciparum to Impaired Hemoglobin Catabolism. PLoS Genetics 10,

e1004085 (2014).

9. Tyas, L., Gluzman, I., Moon, R. P., Rupp, K., Westling, J., Ridley, R. G., Kay, J.,

Goldberg, D. E. & Berry, C. Naturally-occurring and recombinant forms of the aspartic

proteinases plasmepsins I and II from the human malaria parasite Plasmodium falciparum.

FEBS Letters 454, 210–214 (1999).

10. Banerjee, R., Liu, J., Beatty, W., Pelosof, L., Klemba, M. & Goldberg, D. E. Four

plasmepsins are active in the Plasmodium falciparum food vacuole, including a protease

with an active-site histidine. Proceedings of the National Academy of Sciences of the

United States of America 99, 990–995 (2002).

11. Omara-Opyene, A. L., Moura, P. A., Sulsona, C. R., Bonilla, J. A., Yowell, C. A.,

Fujioka, H., Fidock, D. A. & Dame, J. B. Genetic disruption of the Plasmodium

falciparum digestive vacuole plasmepsins demonstrates their functional redundancy.

Journal of Biological Chemistry 279, 54088–54096 (2004).

12. Singh, N., Sijwali, P. S., Pandey, K. C. & Rosenthal, P. J. Plasmodium falciparum:

Biochemical characterization of the cysteine protease falcipain-2′. Experimental

Parasitology 112, 187–192 (2006).

13. Salas, F., Fichmann, J., Lee, G. K., Scott, M. D. & Rosenthal, P. J. Functional expression

of falcipain, a Plasmodium falciparum cysteine proteinase, supports its role as a malarial

hemoglobinase. Infection and Immunity 63, 2120–2125 (1995).

14. Francis, S. E., Gluzman, I. Y., Oksman, A., Banerjee, D. & Goldberg, D. E.

Characterization of native falcipain, an enzyme involved in Plasmodium falciparum

hemoglobin degradation. Molecular and Biochemical Parasitology 83, 189–200 (1996).

54

15. Rosenthal, P. J., McKerrow, J. H., Aikawa, M., Nagasawa, H. & Leech, J. H. A malarial

cysteine proteinase is necessary for hemoglobin degradation by Plasmodium falciparum.

Journal of Clinical Investigation 82, 1560–1566 (1988).

16. Murata, C. E. & Goldberg, D. E. Plasmodium falciparum falcilysin: A metalloprotease

with dual specificity. Journal of Biological Chemistry 278, 38022–38028 (2003).

17. Eggleson, K. K., Duffin, K. L. & Goldberg, D. E. Identification and characterization of

falcilysin, a metallopeptidase involved in hemoglobin catabolism within the malaria

parasite Plasmodium falciparum. Journal of Biological Chemistry 274, 32411–32417

(1999).

18. Klemba, M., Gluzman, I. & Goldberg, D. E. A Plasmodium falciparum dipeptidyl

aminopeptidase I participates in vacuolar hemoglobin degradation. Journal of Biological

Chemistry 279, 43000–43007 (2004).

19. Ragheb, D., Bompiani, K., Dalal, S. & Klemba, M. Evidence for catalytic roles for

Plasmodium falciparum aminopeptidase P in the food vacuole and cytosol. Journal of

Biological Chemistry 284, 24806–24815 (2009).

20. Ragheb, D., Dalal, S., Bompiani, K. M., Ray, W. K. & Klemba, M. Distribution and

biochemical properties of an M1-family aminopeptidase in Plasmodium falciparum

indicate a role in vacuolar hemoglobin catabolism. Journal of Biological Chemistry 286,

27255–27265 (2011).

21. Sijwali, P. S. & Rosenthal, P. J. Gene disruption confirms a critical role for the cysteine

protease falcipain-2 in hemoglobin hydrolysis by Plasmodium falciparum. Proceedings of

the National Academy of Sciences of the United States of America 101, 4384–9 (2004).

22. Siripurkpong, P., Yuvaniyama, J., Wilairat, P. & Goldberg, D. E. contribution

55

to specificity of the aspartic proteases plasmepsins I and II. Journal of Biological

Chemistry 277, 41009–41013 (2002).

23. Shenai, B. R., Sijwali, P. S., Singh, A. & Rosenthal, P. J. Characterization of native and

recombinant falcipain-2, a principal trophozoite cysteine protease and essential

hemoglobinase of Plasmodium falciparum. Journal of Biological Chemistry 275, 29000–

29010 (2000).

24. Goldberg, D. E., Slater, A. F. G., Beavis, R., Chait, B., Cerami, A. & Henderson, G. B.

Hemoglobin degradation in the human malaria pathogen Plasmodium falciparum: A

catabolic pathway initiated by a specific aspartic protease. Journal of Experimental

Medicine 173, 961–969 (1991).

25. Gluzman, I. Y., Francis, S. E., Oksman, A., Smith, C. E., Duffin, K. L. & Goldberg, D. E.

Order and specificity of the Plasmodium falciparum hemoglobin degradation pathway.

Journal of Clinical Investigation 93, 1602–8 (1994).

26. Subramanian, S., Hardt, M., Choe, Y., Niles, R. K., Johansen, E. B., Legac, J., Gut, J.,

Kerr, I. D., Craik, C. S. & Rosenthal, P. J. Hemoglobin cleavage site-specificity of the

Plasmodium falciparum cysteine proteases falcipain-2 and falcipain-3. PLoS ONE 4,

(2009).

27. Bonilla, J. A., Bonilla, T. D., Yowell, C. A., Fujioka, H. & Dame, J. B. Critical roles for

the digestive vacuole plasmepsins of Plasmodium falciparum in vacuolar function.

Molecular Microbiology 65, 64–75 (2007).

28. Liu, J., Istvan, E. S., Gluzman, I. Y., Gross, J. & Goldberg, D. E. Plasmodium falciparum

ensures its amino acid supply with multiple acquisition pathways and redundant

proteolytic enzyme systems. Proceedings of the National Academy of Sciences of the

56

United States of America 103, 8840–8845 (2006).

29. Liu, J., Gluzman, I. Y., Drew, M. E. & Goldberg, D. E. The role of Plasmodium

falciparum food vacuole plasmepsins. The Journal of Biological Chemistry 280, 1432–7

(2005).

30. Bonilla, J. A., Moura, P. A., Bonilla, T. D., Yowell, C. A., Fidock, D. A. & Dame, J. B.

Effects on growth, hemoglobin metabolism and paralogous gene expression resulting from

disruption of genes encoding the digestive vacuole plasmepsins of Plasmodium

falciparum. International Journal for Parasitology 37, 317–327 (2007).

31. Sijwali, P. S., Koo, J., Singh, N. & Rosenthal, P. J. Gene disruptions demonstrate

independent roles for the four falcipain cysteine proteases of Plasmodium falciparum.

Molecular and Biochemical Parasitology 150, 96–106 (2006).

32. Goldberg, D. E. & Rosenthal, P. J. Hemoglobin Digestion. Encyclopedia of Malaria 1–9

(Springer New York, 2013). doi:10.1007/978-1-4614-8757-9_7-1

33. Bingeman, T. S., Perlman, D. H., Storey, D. G. & Lewis, I. A. Digestomics: an emerging

strategy for comprehensive analysis of protein catabolism. Current Opinion in

Biotechnology 43, 134–140 (2017).

34. Schönheit, P., Buckel, W. & Martin, W. F. On the Origin of Heterotrophy. Trends in

Microbiology 24, 12–25 (2016).

35. Valgepea, K., Adamberg, K., Seiman, A. & Vilu, R. Escherichia coli achieves faster

growth by increasing catalytic and translation rates of proteins. Molecular BioSystems 9,

2344–58 (2013).

36. Kamphorst, J. J., Nofal, M., Commisso, C., et al. Human Pancreatic Cancer Tumors Are

Nutrient Poor and Tumor Cells Actively Scavenge Extracellular Protein. Cancer Research

57

75, 544–553 (2015).

37. Picariello, G., Mamone, G., Nitride, C., Addeo, F. & Ferranti, P. Protein digestomics:

Integrated platforms to study food-protein digestion and derived functional and active

peptides. TrAC - Trends in Analytical Chemistry 52, 120–134 (2013).

38. Scharf, M. E. & Tartar, A. Termite digestomes as sources for novel lignocellulases.

Biofuels, Bioproducts and Biorefining 2, 540–552 (2008).

39. Lewis, I. A., Shortreed, M. R., Hegeman, A. D. & Markley, J. L. Novel NMR and MS

Approaches to Metabolomics. in The Handbook of Metabolomics (eds. Fan, W.-M. T.,

Lane, N. A. & Higashi, M. R.) 199–230 (Humana Press, 2012). doi:10.1007/978-1-61779-

618-0_7

40. Lu, W., Clasquin, M. F., Melamud, E., Amador-Noguez, D., Caudy, A. A. & Rabinowitz,

J. D. Metabolomic analysis via reversed-phase ion-pairing liquid chromatography coupled

to a stand alone orbitrap mass spectrometer. Analytical Chemistry 82, 3212–3221 (2010).

41. Wishart, D. S., Knox, C., Guo, A. C., et al. HMDB: A knowledgebase for the human

metabolome. Nucleic Acids Research 37, 603–610 (2009).

42. Cui, Q., Lewis, I. a, Hegeman, A. D., Anderson, M. E., Li, J., Schulte, C. F., Westler, W.

M., Eghbalnia, H. R., Sussman, M. R. & Markley, J. L. Metabolite identification via the

Madison Metabolomics Consortium Database. Nature Biotechnology 26, 162–164 (2008).

43. Markley, J. L., Anderson, M. E., Cui, Q., et al. New bioinformatics resources for

metabolomics. Pacific Symposium on Biocomputing 157–68 (2007).

44. Aebersold, R. & Mann, M. Mass-spectrometric exploration of proteome structure and

function. Nature 537, 347–355 (2016).

45. Perkins, D. N., Pappin, D. J., Creasy, D. M. & Cottrell, J. S. Probability-based protein

58

identification by searching sequence databases using mass spectrometry data.

Electrophoresis 20, 3551–67 (1999).

46. Gabryszewski, S. J., Dhingra, S. K., Combrinck, J. M., et al. Evolution of Fitness Cost-

Neutral Mutant PfCRT Conferring P. falciparum 4-Aminoquinoline Drug Resistance Is

Accompanied by Altered Parasite Metabolism and Digestive Vacuole Physiology. PLOS

Pathogens 12, e1005976 (2016).

47. Guillot, A., Boulay, M., Chambellon, E., Gitton, C., Monnet, V. & Juillard, V. Mass

spectrometry analysis of the extracellular peptidome of Lactococcus lactis: lines of

evidence for the co-existence of extracellular protein hydrolysis and intracellular peptide

excretion. Journal of Proteome Research 15, 3214–3224 (2016).

48. Niessen, S., Hoover, H. & Gale, A. J. Proteomic analysis of the coagulation reaction in

plasma and whole blood using PROTOMAP. Proteomics 11, 2377–2388 (2011).

49. Sabino, F., Hermes, O., Egli, F. E., Kockmann, T., Schlage, P., Croizat, P.,

Kizhakkedathu, J. N., Smola, H. & Keller, U. A. D. In vivo assessment of protease

dynamics in cutaneous wound healing by degradomics analysis of porcine wound

exudates. Molecular & Cellular Proteomics 14, 354–370 (2015).

50. Steen, H. & Mann, M. The abc’s (and xyz’s) of peptide sequencing. Nature Reviews

Molecular Cell Biology 5, 699–711 (2004).

51. Jeong, K., Kim, S. & Bandeira, N. False discovery rates in spectral identification. BMC

Bioinformatics 13, S2 (2012).

52. Allen, L. C. & Johnson, L. F. Chemical Applications of Sensitivity Enhancement in

Nuclear Magnetic Resonance and Electron Spin Resonance. Journal of American

Chemical Society 85, 2668–2670 (1963).

59

53. Snyder, G. A., Deredge, D., Waldhuber, A., et al. Crystal structures of the

Toll/Interleukin-1 Receptor (TIR) domains from the brucella protein TcpB and host

adaptor TIRAP reveal mechanisms of molecular mimicry. Journal of Biological

Chemistry 289, 669–679 (2014).

54. Bandeira, N., Olsen, J. V., Mann, J. V., Mann, M. & Pevzner, P. A. Multi-spectra peptide

sequencing and its applications to multistage mass spectrometry. Bioinformatics (Oxford,

England) 24, 416–423 (2008).

55. Dallas, D. C., Guerrero, A., Khaldi, N., Castillo, P. A., Martin, W. F., Smilowitz, J. T.,

Bevins, C. L., Barile, D., German, J. B. & Lebrilla, C. B. Extensive in vivo human milk

peptidomics reveals specific proteolysis yielding protective antimicrobial peptides.

Journal of Proteome Research 12, 2295–2304 (2013).

56. Dallas, D. C., Guerrero, A., Parker, E. A., Robinson, R. C., Gan, J., German, J. B., Barile,

D. & Lebrilla, C. B. Current peptidomics: Applications, purification, identification,

quantification, and functional analysis. Proteomics 15, 1026–1038 (2015).

57. López-Otín, C. & Overall, C. M. Protease degradomics: a new challenge for proteomics.

Nature reviews. Molecular cell biology 3, 509–19 (2002).

58. Doucet, A., Butler, G. S., Rodríguez, D., Prudova, A. & Overall, C. M. Metadegradomics:

toward in vivo quantitative degradomics of proteolytic post-translational modifications of

the cancer proteome. Molecular & Cellular Proteomics 7, 1925–1951 (2008).

59. Dufour, A. & Overall, C. M. Missing the target: Matrix metalloproteinase antitargets in

inflammation and cancer. Trends in Pharmacological Sciences 34, 233–242 (2013).

60. Goldberg, D. E. Hemoglobin degradation. Current topics in microbiology and

immunology 295, 275–91 (2005).

60

61. Callaghan, P. S., Hassett, M. R. & Roepe, P. D. Functional Comparison of 45 Naturally

Occurring Isoforms of the Plasmodium falciparum Chloroquine Resistance Transporter

(PfCRT). Biochemistry 54, 5083–5094 (2015).

62. Pulcini, S., Staines, H. M., Lee, A. H., et al. Mutations in the Plasmodium falciparum

chloroquine resistance transporter, PfCRT, enlarge the parasite’s food vacuole and alter

drug sensitivities. Scientific Reports 5, 14552 (2015).

63. Adjalley, S. H., Scanfeld, D., Kozlowski, E., Llinás, M. & Fidock, D. A. Genome-wide

transcriptome profiling reveals functional networks involving the Plasmodium falciparum

drug resistance transporters PfCRT and PfMDR1. BMC Genomics 16, 1090 (2015).

64. Eastman, R. T., Khine, P., Huang, R., Thomas, C. J. & Su, X. PfCRT and PfMDR1

modulate interactions of artemisinin derivatives and ion channel blockers. Scientific

Reports 6, 25379 (2016).

65. Erkosar, B., Storelli, G., Mitchell, M., Bozonnet, L., Bozonnet, N. & Leulier, F. Pathogen

Virulence Impedes Mutualist-Mediated Enhancement of Host Juvenile Growth via

Inhibition of Protein Digestion. Cell Host and Microbe 18, 445–455 (2015).

66. Palmer, K. L., Aye, L. M. & Whiteley, M. Nutritional cues control Pseudomonas

aeruginosa multicellular behavior in cystic fibrosis sputum. Journal of Bacteriology 189,

8079–8087 (2007).

67. Turner, K. H., Wessel, A. K., Palmer, G. C., Murray, J. L. & Whiteley, M. Essential

genome of Pseudomonas aeruginosa in cystic fibrosis sputum. Proceedings of the

National Academy of Sciences of the United States of America 112, 4110–5 (2015).

68. Kida, Y., Shimizu, T. & Kuwano, K. Cooperation between LepA and PlcH contributes to

the in vivo virulence and growth of Pseudomonas aeruginosa in mice. Infection and

61

Immunity 79, 211–219 (2011).

69. Aristoteli, L. P. & Willcox, M. D. P. Mucin degradation mechanisms by distinct

Pseudomonas aeruginosa isolates in vitro. Infection and Immunity 71, 5565–5575 (2003).

70. Harrington, S. M., Sheikh, J., Henderson, I. R., Ruiz-Perez, F., Cohen, P. S. & Nataro, J.

P. The pic protease of enteroaggregative Escherichia coli promotes intestinal colonization

and growth in the presence of mucin. Infection and Immunity 77, 2465–2473 (2009).

71. Lee, J. D. & Kolattukudy, P. E. Molecular cloning of the cDNA and gene for an

elastinolytic aspartic proteinase from Aspergillus fumigatus and evidence of its secretion

by the fungus during invasion of the host lung. Infection and Immunity 63, 3796–3803

(1995).

72. Monod, M., Capoccia, S., Léchenne, B., Zaugg, C., Holdom, M. & Jousson, O. Secreted

proteases from pathogenic fungi. International Journal of Medical Microbiology 292,

405–419 (2002).

73. Shen, Y., Tolić, N., Liu, T., et al. Blood peptidome-degradome profile of breast cancer.

PloS one 5, e13133 (2010).

74. Jiang, X., Overholtzer, M. & Thompson, C. B. Autophagy in cellular metabolism and

cancer. Journal of Clinical Investigation 125, 47–54 (2015).

75. Lu, S.-Z. & Harrison-Findik, D. D. Autophagy and Cancer. World Journal of Biological

Chemistry 4, 64–70 (2013).

76. Perera, R. M., Stoykova, S., Nicolay, B. N., et al. Transcriptional control of autophagy-

lysosome function drives pancreatic cancer metabolism. Nature 524, 361–5 (2015).

77. Kenific, C. M. & Debnath, J. Cellular and metabolic functions for autophagy in cancer

cells. Trends in Cell Biology 25, 37–45 (2014).

62

78. Moscovici, A. M., Joubran, Y., Briard-Bion, V., Mackie, A., Dupont, D. & Lesmes, U.

The impact of the Maillard reaction on the in vitro proteolytic breakdown of bovine

lactoferrin in adults and infants. Food & Function 5, 1898 (2014).

79. Sayd, T., Chambon, C. & Santé-Lhoutellier, V. Quantification of peptides released during

in vitro digestion of cooked meat. Food Chemistry 197, 1311–1323 (2016).

80. Manguy, J., Jehl, P., Dillon, E. T., Davey, N. E., Shields, D. C. & Holton, T. A.

Peptigram: A Web-Based Application for Peptidomics Data Visualization. Journal of

Proteome Research 16, 712–719 (2017).

81. Kuhring, M. & Renard, B. Y. iPiG: Integrating Peptide Spectrum Matches into Genome

Browser Visualizations. PLoS ONE 7, (2012).

82. Guerrero, A., Dallas, D. C., Contreras, S., Chee, S., Parker, E. A., Sun, X., Dimapasoc, L.,

Barile, D., German, J. B. & Lebrilla, C. B. Mechanistic Peptidomics: Factors That Dictate

Specificity in the Formation of Endogenous Peptides in Human Milk. Molecular &

Cellular Proteomics 13, 3343–3351 (2014).

83. Vijayakumar, V., Guerrero, A. N., Davey, N., Lebrilla, C. B., Shields, D. C. & Khaldi, N.

EnzymePredictor: a tool for predicting and visualizing enzymatic cleavages of digested

proteins. Journal of Proteome Research 11, 6056–6065 (2012).

84. Pappin, D. J. C., Hojrup, P. & Bleasby, A. J. Rapid identification of proteins by peptide-

mass fingerprinting. Current Biology (1993). doi:10.1016/0960-9822(93)90195-T

85. Mann, M., Højrup, P. & Roepstorff, P. Use of mass spectrometric molecular weight

information to identify proteins in sequence databases. Biological Mass Spectrometry 22,

338–345 (1993).

86. Lewis, I. A., Schommer, S. C. & Markley, J. L. rNMR: Open source software for

63

identifying and quantifying metabolites in NMR spectra. Magnetic Resonance in

Chemistry 47, S123-6 (2009).

87. Eng, J. K., McCormack, A. L. & Yates, J. R. An approach to correlate tandem mass

spectral data of peptides with amino acid sequences in a protein database. Journal of the

American Society for Mass Spectrometry 5, 976–89 (1994).

88. Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized

p.p.b.-range mass accuracies and proteome-wide protein quantification. Nature

Biotechnology 26, 1367–1372 (2008).

89. Elias, J. E. & Gygi, S. P. Target-decoy search strategy for increased confidence in large-

scale protein identifications by mass spectrometry. Nature Methods 4, 207–214 (2007).

90. Cooper, J. W. Computers in NMR—I: Signal averaging in Fourier transform NMR.

Computers & Chemistry 1, 55–60 (1976).

91. Westling, J., Cipullo, P., Hung, S.-H., Saft, H., Dame, J. B. & Dunn, B. M. Active site

specificity of plasmepsin II. Protein Science 8, 2001–2009 (2009).

92. Wyatt, D. M. & Berry, C. Activity and inhibition of plasmepsin IV, a new aspartic

proteinase from the malaria parasite, Plasmodium falciparum. FEBS Letters 513, 159–162

(2002).

93. Wang, F., Krai, P., Deu, E., Bibb, B., Lauritzen, C., Pedersen, J., Bogyo, M. & Klemba,

M. Biochemical characterization of Plasmodium falciparum dipeptidyl aminopeptidase 1.

Molecular and Biochemical Parasitology 175, 10–20 (2011).

94. Sijwali, P. S., Shenai, B. R., Gut, J., Singh, A. & Rosenthal, P. J. Expression and

characterization of the Plasmodium falciparum haemoglobinase falcipain-3. Biochemical

Journal 360, 481–489 (2015).

64

95. Trager, W. & Jensen, J. B. Human malaria parasites in continuous culture. Science 193,

673–675 (1976).

96. Olszewski, K. L., Morrisey, J. M., Wilinski, D., Burns, J. M., Vaidya, A. B., Rabinowitz,

J. D. & Llinás, M. Host-Parasite Interactions Revealed by Plasmodium falciparum

Metabolomics. Cell Host and Microbe 5, 191–199 (2009).

97. Greenbaum, D. C., Baruch, A., Grainger, M., Bozdech, Z., Medzihradszky, K. F., Engel,

J., DeRisi, J., Holder, A. A. & Bogyo, M. A role for the protease falcipain 1 in host cell

invasion by the human malaria parasite. Science 298, 2002–2006 (2002).

98. Panuwet, P., Hunter, R. E., D’Souza, P. E., Chen, X., Radford, S. A., Cohen, J. R.,

Marder, M. E., Kartavenka, K., Ryan, P. B. & Barr, D. B. Biological Matrix Effects in

Quantitative Tandem Mass Spectrometry-Based Analytical Methods: Advancing

Biomonitoring. Critical reviews in analytical chemistry 46, 93–105 (2016).

65