DEVELOPMENT OF HIGH-EFFICIENCY UNDECANAL-BASED N TERMINI ENRICHMENT (HUNTER) FOR MONITORING PROTEOLYTIC PROCESSING IN LIMITED SAMPLES

by Shao Huan Samuel Weng

B.Sc., Simon Fraser University, 2013 M.Sc., Simon Fraser University, 2016

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF

THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

in

THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES

(Pathology and Laboratory Medicine)

THE UNIVERSITY OF BRITISH COLUMBIA

(Vancouver)

November 2019

© Shao Huan Samuel Weng, 2019

The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, a thesis/dissertation entitled:

Development of High-efficiency Undecanal-based N Termini EnRichment (HUNTER) for Monitoring Proteolytic Processing in Limited Samples submitted by Shao Huan Samuel Weng in partial fulfillment of the requirements for the degree of Master of Science in Pathology and Laboratory Medicine

Examining Committee:

Dr. Philipp F. Lange, Pathology and Laboratory Medicine

Supervisor

Dr. Thibault Mayor, Biochemistry

Supervisory Committee Member

Dr. Mari DeMarco, Pathology and Laboratory Medicine

Additional Examiner

Dr. James Lim, Pediatrics

Additional Examiner

ii

Abstract

Genes encode the information for the amino acid backbone of proteins. This information can be altered by genetic variation or alternative splicing and alternative initiation of translation. After translation the protein can further alter by post-translational modification. All these different versions of a protein encoded by one gene are termed proteoforms. Protein N termini can be used to identify truncated (proteolytically cleaved), alternatively translated, or N terminally modified proteoforms that often have distinct functions. Cleavage of proteins by is frequently altered in disease, including cancers and following the occurrence and loss of protein N termini can pinpoint abnormal proteolytic activity in disease.

Selective enrichment of N-terminal peptides is necessary for proteome-wide coverage for unbiased identification of site-specific proteolytic processing and substrates; however, for comprehensive study of N termini so-called N-terminome analysis, most N termini enrichment techniques require relatively large amounts of starting material in the range of several hundred micrograms to milligrams of crude protein lysate. Due to sample constraints, this type of analysis cannot be routinely applied to clinical biopsies, especially those from pediatric patients.

We present High-efficiency Undecanal-based N Termini EnRichment (HUNTER), a robust, sensitive, and scalable method for the analysis of previously inaccessible microscale samples. With this approach, >1,000 N termini are identified from a minimum of 2 µg raw HeLa cell lysate and >5,000 termini from 200 µg of raw HeLa lysate with high-pH pre-fractionation. We demonstrate the broad applicability of HUNTER with the first N-terminome analysis of sorted human primary immune cells and enriched mitochondrial fractions from pediatric cancer patients. The workflow was implemented on a liquid handling system to demonstrate the feasibility of automated liquid biopsy processing from pediatric cancer patients. In general, HUNTER method benefits in handling rare and precious clinical samples.

iii

Lay Summary

The leading string of amino acids in a protein (i.e., the N-terminus) and can provide useful information about protein characteristics and functions. Abnormal cutting of proteins by deregulated called proteases, influences cellular regulation and communication in most cancers. Proteases constitute a major target for cancer drugs, and their breakdown activity can be monitored by following the occurrence and loss of protein termini. Therefore, there is a growing interest in identifying protein N termini and their modification; however, most termini selection methods require large amounts of raw material (>100 µg). Here, we developed a robust, sensitive, and automated method termed High-efficiency Undecanal-based N Termini EnRichment (HUNTER), for identification of thousands of N termini from limited sample amounts (>1 µg). HUNTER benefits in handling rare and precious clinical samples. We also have applied HUNTER to identify the N-terminal profiles of sorted immune cells, subcellular compartments, and plasma from childhood cancer patients.

iv

Preface

The work on this dissertation was done under the guidance and mentorship from Dr. Philipp Lange. Janice Tsui assisted me with HUNTER automation. Lorenz Nierves and Janice Tsui helped me with automated N termini analysis in patient blood plasma (BP) and bone marrow interstitial fluid (BM). Dr. Anuli Uzozie and Lorenz Nierves designed, performed, and analyzed mitochondria experiments. Dr. Anuli Uzozie assisted with optimizing, performing, and analyzing data-independent acquisition experiments. Dr. Fatih Demir conducted rat brain and Arabidopsis leaf experiments. All data analysis was completed with assistance from Enes Ergin. I carried out all other experiments, data analysis, and data interpretation shown in this thesis, in collaboration with Dr. Pitter F. Huesgen and Dr. Fatih Demir. Chapter 2 and 3 are reproduced in part from Weng et al. Sensitive determination of proteolytic proteoforms in limited microscale proteome samples, Mol. Cell. Proteom., doi:10.1074/ mcp.TIR119.001560 (2019).

Primary pediatric B-cell acute lymphoblastic leukemia and acute myeloid leukemia patient mononuclear cells enriched from bone marrow aspirates, BP and BM were retrospectively sourced from the Biobank at BC Children’s Hospital following informed consent and approval by the University of British Columbia Children’s and Women’s Research Ethics Board (REB #H15- 01994) in agreement with the Declaration of Helsinki. Patient BP and BM samples were collected at the time of diagnosis and 29 days after induction chemotherapy. Peripheral blood mononuclear cells from healthy donors were obtained following informed consent and approval by the University of British Columbia Children’s and Women’s Research Ethics Board (REB #H10- 01954).

v

Table of Contents

Abstract ...... iii

Lay Summary ...... iv

Preface ...... v

Table of Contents ...... vi

List of Figures ...... x

List of Abbreviations ...... xii

Acknowledgements ...... xv

Dedication ...... xvi

Chapter 1 Introduction ...... 1

1.1 General introduction ...... 1

1.1.1 Proteoforms ...... 1

1.1.2 Post-translational modifications...... 2

1.1.2 Proteases and ...... 3

1.1.3 Proteases in cancer development ...... 4

1.2 Current N termini enrichment methods: from positive to negative selection ...... 6

1.2.2 Positive selection N termini enrichment methods ...... 7

1.2.3 Negative selection N termini enrichment methods ...... 10

1.3 Objective ...... 22

vi

Chapter 2 Development of High-efficiency Undecanal-based N Termini

EnRichment (HUNTER) ...... 23

2.1 Introduction ...... 23

2.2 Method ...... 24

2.2.2 Materials and reagents ...... 24

2.2.2.1 HeLa cell line ...... 24

2.2.2.2 Plant material ...... 24

2.2.2.3 Rat brain samples ...... 24

2.2.3 Preparation of single-pot solid-phase-enhanced sample-preparation beads 25

2.2.4 Preparation of stop-and-go-extraction tips ...... 25

2.2.5 Fluorometric and colorimetric protein and peptide measurements ...... 25

2.2.6 High-pH reversed phase fractionation ...... 25

2.2.7 Optimization of HUNTER ...... 26

2.2.7.1 Optimizing dimethyl labeling of proteins ...... 26

2.2.7.2 Evaluation of removal of dimethyl labeling reagents ...... 26

2.2.7.3 Optimizing undecanal modification of peptide α-amines ...... 27

2.2.7.4 Optimizing undecanal removal ...... 27

2.2.7.5 Evaluation of peptide recovery dependency on solvent concentrations ... 28

2.2.8 HUNTER protocol ...... 28

2.2.8.1 Preparation of HeLa cell lysates ...... 28

2.2.8.2 Preparation of Arabidopsis thaliana leaf and rat brain lysates...... 29

2.2.8.3 SP3 bead binding and proteome clean-up ...... 29

2.2.8.4 Protein dimethyl labeling ...... 29

2.2.8.5 Enrichment of N termini by undecanal-based negative selection ...... 30

vii

2.2.9 HUNTER automation ...... 31

2.2.10 Mass spectrometry ...... 32

2.2.10.1 Data-dependent acquisition ...... 32

2.2.10.2 Data-independent acquisition ...... 33

2.2.10.3 Data processing ...... 33

2.2.10.4 Data availability ...... 35

2.3 Results ...... 35

2.3.2 Optimization of HUNTER ...... 35

2.3.3 Validation of HUNTER performance on HeLa cells ...... 43

2.3.4 Improved reproducibility on automated HUNTER ...... 46

2.4 Discussions ...... 48

2.5 Conclusions ...... 51

Chapter 3 Application of HUNTER on limited samples ...... 52

3.1 Introduction ...... 52

3.2 Method ...... 53

3.2.2 HUNTER on limited samples ...... 53

3.2.2.1 Preparation of peripheral blood mononuclear sorted cell lysates ...... 54

3.2.2.2 Preparation of mitochondrial enrichment samples ...... 54

3.2.2.3 Automated HUNTER on patient blood and bone marrow plasma ...... 54

3.2.3 Mass spectrometry...... 55

3.2.3.1 Label free quantification ...... 55

3.2.3.2 Data and statistical analysis ...... 55

3.2.3.3 Data availability ...... 57

viii

3.3 Results ...... 58

3.3.2 N-terminome analysis of sorted peripheral blood mononuclear cells ...... 58

3.3.3 Mitochondrial N-terminome analysis from 2.5 million cancer cells ...... 60

3.3.4 Application of HUNTER on B-cell acute lymphoblastic leukemia (B-ALL)

patient plasma samples ...... 64

3.4 Discussion ...... 69

3.5 Conclusions ...... 71

Chapter 4 General conclusions ...... 72

4.1 Concluding remarks ...... 72

Bibliography ...... 73

ix

List of Figures

Figure 1.1 A schematic view of proteoform alterations ...... 2 Figure 1.2 Positive N termini selection with subtiligase ...... 8 Figure 1.3 Schematic workflow for N-CLAP ...... 10 Figure 1.4 Schematic workflow of COFRADIC...... 12 Figure 1.5 Scheme of the ChaFRADIC workflow ...... 14 Figure 1.6 Schematic diagram for ChaFRAtip strategy ...... 15 Figure 1.7 Schematic representation of the TAILS workflow...... 17 Figure 1.8 Workflow for N-terminal peptides enrichment by the PTAG strategy ...... 19 Figure 1.9 Scheme of STagAu strategy ...... 20 Figure 1.10 Schematic workflow for negative enrichment of N-terminal peptides by HyTANE strategy...... 21 Figure 2.1 HUNTER workflow ...... 36 Figure 2.2 Completeness of on-bead dimethyl labeling of 10 µg HeLa cell lysates based on quantitative fluorometric peptide assay ...... 37 Figure 2.3 Completeness of on-bead dimethyl labeling of HeLa cell lysates based on LC-MS/MS analysis ...... 37 Figure 2.4 Effect of SP3 and tris buffer in dimethyl labeling step ...... 38 Figure 2.5 Effect of different C18 cartridges and mobile phases (EtOH and ACN) on the depletion of free undecanal ...... 39 Figure 2.6 Evaluation of peptide recovery from C18 StageTips using 40% EtOH as mobile phase ...... 40 Figure 2.7 Degree of HeLa-derived peptide undecanal modification in organic solvents, assessed by reactivity of unlabeled amines ...... 40 Figure 2.8 Effect of incubation time for undecanal modification of peptides on the number of identified N-terminal unique peptides and pullout efficiency. . 41 Figure 2.9 Comparison of hydrophobic tagging materials used in HUNTER and HyTANE methods ...... 42 Figure 2.10 The effects of enrichment on amount of starting materials from rat brain and leaf proteome ...... 43 Figure 2.11 Evaluation of HUNTER workflow using varying starting amounts of HeLa cells ...... 45 Figure 2.12 Number of N termini identified from 10K to 1M HeLa cells ...... 45 Figure 2.13 Comparison of coefficients of variation (CV) for N termini from 20K and 1M HeLa cell quantified by label free DDA or DIA ...... 46

x

Figure 2.14 Pearson-correlation between two manually handled HUNTER sets on HeLa cells by DDA ...... 47 Figure 2.15 Pearson-correlation between three manually handled HUNTER replicates from HeLa quantified by DIA ...... 48 Figure 2.16 Evaluation of technical robustness and reproducibility of the implemented automation ...... 48 Figure 3.1 N-terminome analysis of sorted human PBMC ...... 59 Figure 3.2 Number of N termini with acetylation on genome encoded position 1 or 2 (co- and/or post-translational) or position >2 (post-translational only) .... 60 Figure 3.3 Evaluation of subcellular enrichment efficiency for B-ALL cell line 697 ... 61 Figure 3.4 Analysis of mitochondrial N-terminomes by HUNTER ...... 62 Figure 3.5 Mitochondrial protein N termini enriched from 2.5 million cancer cells .... 63 Figure 3.6 Consensus sequence logo of 257 mitochondrial N termini identified from AML-1 patient cells ...... 64 Figure 3.7 Study design on patient samples ...... 65 Figure 3.8 Proteins identified exclusively in plasma before or after N termini enrichment, or in both preparations, mapped to known concentrations in plasma ...... 65 Figure 3.9 Automated HUNTER protein level analysis from three pediatric B-ALL patients ...... 66 Figure 3.10 Automated HUNTER termini level analysis from three pediatric B-ALL patients ...... 67 Figure 3.11 Relative intensity-based quantification of complement pathway proteins and N termini identifying previously described activation products ...... 68

xi

List of Abbreviations

ACN Acetronitrile ADAM A disintegrin and metalloprotease ADAM-TS ADAM with thrombospondin motif AML Acute myeloid leukemia B/D-HPP Biology/Disease HPP B-ALL B-cell acute lymphoblastic leukemia BCCH BC Children’s Hospital BM Bone marrow interstitial fluid BP Blood plasma CAA 2-chloroacetamide ChaFRADIC Charge-based fractional diagonal chromatography ChaFRAtip Charge-based fractional diagonal chromatography in a pipette tip C-HPP Chromosome-centric HPP COFRADIC Combined fractional diagonal chromatography CVs Coefficient of variations D0 Time of diagnosis D29 29 days after induction chemotherapy DDA Data-dependent acquisition DIA Data-independent acquisition DTT Dithiothreitol EGF Epidermal growth factor EtOH Ethanol GAP3 Glyceraldehyde-3-phosphate HPG-ALD Hyperbranched polyglycerols-aldehyde-derivatized HPLC High-performance liquid chromatography HPP Human proteome project HUNTER High-efficiency Undecanal-based N Termini EnRichment HUPO Human proteome organization HyTANE Hydrophobic tagging-assisted N-termini enrichment iTRAQ Isobaric tags for relative and absolute quantification

xii

LC Liquid chromatography MAP12 Mitochondrial methionine 1D MMPs Matrix MPPB Mitochondrial-processing peptidase subunit beta MS Mass spectrometry NCE Normalized collision energy N-CLAP N-terminalomics by chemical labeling of the α-amine of proteins NHS N-hydroxysuccinimide NK Natural killer NKT cells NK T-cells PBMC Peripheral blood mononuclear cells PCT Pressure cycling technology PITC Phenyl isothiocyanate PTAG Titanium dioxide affinity chromatography for phospho-tagging PTMs Post-translational modifications RP Reverse phase RT Room temperature SCX Strong cation exchange SD Standard deviation SILAC Stable isotope labeling by amino acids in cell culture SPE Solid phase extraction SP3 Single-pot solid-phase-enhanced sample-preparation STagAu Sulfydryl-tagging and gold-nanoparticle-based depletion StageTip Stop-and-go-extraction tip SUMO Small ubiquitin-like modifiers TAILS Terminal amine isotopic labeling T-ALL T-cell acute lymphoblastic leukemia TEV Tobacco etch virus TFA Trifluoroacetic acid

TiO2 Titanium dioxide TMT Tandem mass tag TNBS 2,4,6-trinitrobenzenesulfonic acid

xiii

TNC Total net charge TNP Trinitrophenyl-peptides t-SNE t-distributed stochastic neighbor embedding UDC Undecanal

xiv

Acknowledgements

First and foremost, I would like to express my profound gratitude to my senior supervisor, Dr. Philipp Lange, for his guidance and giving me the opportunity to study in his group. I am very grateful to embrace my interest in proteomics and cancer research with an easy-going and enthusiastic supervisor. I have always believed that “every new chapter in the life will require a new version of myself” and Dr. Philipp Lange is the one who shaped me into a better and capable person. Under his guidance and support throughout the years, I have had the opportunity to bring my expertise in analytical and material chemistry to cancer proteomics. I have learnt not only solid and valuable research skills and knowledge, but also life lessons that help me to begin my next chapter of my life.

I would also like to express my deepest respect and sincere appreciation to Dr. Catherine Pallen, Dr. Thibault Mayor, and Dr. Graham Sinclair for being my committee members and giving me constructive advice and suggestions for my research.

My special thanks go to Dr. Amina Kariminia and Dr. Kirk Schultz for providing sorted peripheral blood monocytes. I also want to thank Janice Tsui, who helped me with optimizing the automated liquid handler, Enes Ergin who spent his time to help me with data analysis, Anuli Uzozie who assisted me with DIA and mitochondria experiments, and Lorenz Nierves who helped me with performing N termini analysis in patient plasma and bone marrow fluid. I would like to thank my collaborators from Germany, Dr. Pitter Huesgen and Dr. Fatih Demir, who assisted me in establishing and optimizing the undecanal-based N termini enrichment method. A sincere thank you to Dr. Hua-Zhong Yu, my former supervisor during my master’s degree in Chemistry. Without him, I would not be who I am today. Because of his inspiration, I came up with the idea to bring superhydrophobic materials into the proteomics field for assisting sample preparation. In addition, he also generously offered free superhydrophobic filter papers for my research.

I would like to thank the Department of Pathology and Laboratory Medicine, including the past program director Dr. Haydn Pritchard, current program director Dr. Dana Devine and Heather Cheadle, for their care and assistance. In addition, I am very thankful to the past and current members of the Lange laboratory for their friendship and support. Finally, I would like to express my deepest appreciation to my family for their support and encouragement.

xv

Dedication

To my family and friends

xvi

Chapter 1 Introduction

1.1 General introduction

1.1.1 Proteoforms

During gene expression, protein-coding genes are transcribed into mRNA, and most human genes express a variety of protein isoforms through alternative RNA splicing. These isoforms can be further modified by post-translational modifications (PTMs). All versions of a protein originating from the same gene but altered by mutation, alternative splicing, alternative translation or post-translational modification are termed proteoforms1,2,3 (Figure 1.1). Proteoforms can possess distinct biological functions and cellular activities, and alternate proteoforms have been implicated in phenotype diversity and disease1. Today, it is predicted that approximately 1 million proteoforms are generated from 20,300 human genes2. Within these proteoforms, their subcellular localization, binding proteins, structures, and kinetics can be varied4. Mass spectrometry (MS) is widely employed to identify and characterize proteoforms, and is frequently coupled with a separation technology, commonly liquid chromatography (LC)3,4. To date, only a limited number of proteoforms have been experimentally observed; therefore, more in-depth characterizations must be done to better understand the human proteome4. In this thesis, I will focus on a peptide-centric bottom-up proteomics approach for proteoform identification. In brief, the standard bottom-up proteomics approach employs enzymatic protein digestion, where a protein is broken down into several termini and internal peptide pieces. Subsequently, peptides are separated by LC prior to MS analysis. During MS analysis, raw data are searched against a database containing known proteins with specific enzymatic cleavage sites for protein identification.

1

Figure 1.1 A schematic view of proteoform alterations. (a) The canonical pathway of the gene, to transcript, to final protein. (b) Single nucleotide polymorphisms on the genomic level can translate to an altered amino acid sequence, and therefore an altered protein or proteoform. (c) Alternative splicing at the mRNA level leads to exclusion of exons, creating different proteoforms that originate from the same genomic sequence. (d and e) Post-translational modifications create proteoforms that may have altered functions from the canonical protein. (e) Proteolysis creates protease-cleaved proteoforms which are distinct from the original protein. (f) The fusion of two different genes may lead to the formation of an entirely new protein. Reprinted with permission from reference 3.

1.1.2 Post-translational modifications

Post-translational modification (PTM) is a biochemical mechanism, by which amino-acids in a protein are altered through covalent modifications5. This process incorporates addition, removal, exchange, and rearrangement of functional groups to amino-acids, as well as cleavage of peptide bonds. These protein modifications introduce new functionalities and dynamic or irreversible control of protein activity, equivalent to “on/off switches”, by modulating intra- and intermolecular interactions5. As the fundamental role of PTMs is to maintain cellular homeostasis, they represent critical roles for pathogenesis and aberrant PTMs are often associated with various diseases6.

2

Over 200 different forms of PTMs have been discovered, influencing cellular functionalities such as metabolism7, signal transduction8, and protein stability7. Typical and well-recognized PTMs can be classified based on the type of modification. For instance, PTMs with the addition of chemical groups include phosphorylation, acetylation, methylation, and redox-based modifications; addition of polypeptides include ubiquitination, small ubiquitin-like modifier-conjugation (SUMOylation), and ubiquitin-like protein conjugation; addition of complex motifs include prenylation, glycosylation, ADP- ribosylation, and AMPylation; direct amino acid modifications include deamidation and eliminylation; protein cleavage. As there are more than 200 types of PTMs, we will only focus on proteolysis in this thesis.

1.1.2 Proteases and proteolysis

Proteolysis is one of the major and irreversible forms of PTMs, which is catalyzed by proteases9. During proteolysis, proteoforms can be truncated and result in new protein form, so-called proteolytic proteoforms, which may have a weak correlation to their canonical function3. Protein cleavage results in the creation of neo-N and C protein termini, which can be further modified and lead to increased functional diversity10.

Proteases are integral to proteolysis. These can hydrolyze peptide bonds in protein substrates, creating the diverse proteoforms mentioned above. Currently, in the human genome, there are greater than 550 putative proteases that regulate proteolytic events and control downstream signalling11. Currently documented proteases can be found in extracellular (e.g. digestion system, blood, and other extracellular matrix), membrane, and intracellular locations12. There is great diversity among proteases. Based on their catalytic mechanism, proteases are classified into aspartic, cysteine, metallo, serine, and threonine proteases. These groups can be further divided into families and sub-families, according to their catalytic domains and tertiary structure12. A detailed classification can be found in the protease database MEROPS (http://merops.sanger.ac.uk/)13.

3

In brief, each of the main proteases classified have different proteolytic functions. Aspartic proteases activate water molecules by aspartic acid residues at the catalytic site and are optimally active at acidic pH12. All aspartic proteases are recognized as breaking internal peptide bonds within a protein. Pepsin, an , is one of the predominant digestive enzymes in human digestion12. Cysteine proteases are lysosomal proteases and play an essential role in apoptosis12. Within the cysteine proteases group, cysteine cathepsins comprise of 11 members and locate in lysosomes and late endosomes for protein turnover12.

Metalloproteases are metal ion-containing proteases, polarizing water and hydrolyzing proteins12. Several metalloproteases (e.g. the a disintegrin and metalloprotease (ADAM), ADAM with thrombospondin motif (ADAM-TS) proteases, and the matrix metalloproteinases (MMPs)) have been implicated in the pathogenesis of human diseases12,14,15. In humans, there are 12 ADAMs and 19 ADAM-TS that play important roles in membrane protein shedding, matrix protein processing, and coagulation factor activation12. In addition, based on the cleavage product, MMP family members can be further categorized into collagenases (e.g. MMP-1,8,13 and the membrane-type MMP- 14), gelatinases (MMP-2,9), stromelysins (e.g. MMP-3,7,10), and metalloelastase (e.g. MMP-12)12.

Serine proteases contain serine residues as the nucleophilic amino acid at the active site12. There are 175 predicted serine proteases in humans16, for several physiological processes including digestion, blood coagulation, and immune response12. Trypsin is a well-known , hydrolyzing protein at the carboxyl side of the lysine and arginine residues12. Threonine proteases harbour a threonine residue at their N termini within and are a major component of the complex12.

1.1.3 Proteases in cancer development

De-regulated proteases are known drivers of disease, resulting in aberrant activation, inactivation or change in function, stability or localization of specific proteins3,17,18. In cancer, proteases have been shown to participate in processes such as

4

tumor initiation, angiogenesis, inflammation, invasion and metastasis12. In this section, we will discuss different examples of proteases and their inhibitors involved in cancer development and treatment respectively.

During cancer, loss of extracellular matrix mediated by MMPs often indicate a transition from benign to malignant tumors. For example, MMP-14 influences the invasive process of a tumor in the collagen-containing extracellular matrix19. Additionally, it is involved in activating an enzyme causing hydrolysis of gelatin so-called gelatinase, MMP- 2. Both MMP-2 and MMP-9 are linked to actin-based protrusions known as invadopodia, which further degrades the extracellular matrix and facilitates cell invasion12. Upregulated MMP-12 has been found in lung cancer, but not all MMPs are associated with cancer. Some MMPs have been reported to exert antitumor effects, such as MMP-320. Regardless, MMPs represent valuable cancer targets. However, their diverse roles require careful consideration during therapeutic development. This includes precise targeting to reduce side-effects to cancer patients, such as inhibitors that target unique secondary binding sites which are proximal to active site so-called exosites on different MMPs. For example, LEM-2/15, an exosite-based inhibitor, arrests MMP-14 gelatin degradation without affecting cell surface activation of progelatinase A which is an inactive zymogen of MMP- 221.

ADAMs are involved in tumor initiation, promotion, and expansion. For instance, overexpression of ADAM17 promotes tumor development via epidermal growth factor (EGF) ligand regulation22. Additionally, the ADAM metalloproteases and γ-secretases are found to overactivate the Notch signaling pathway in T-cell acute lymphoblastic leukemia (T-ALL)23,24. Consequently, ADAMs and γ-secretases are considered therapeutic targets in cancer. Several inhibitors have been developed for these targets and have shown efficacy in clinical trials25,26.

Cysteine proteases have often been implicated in tumor progression. Cathepsin B has often been implicated in tumor growth, migration, invasion, angiogenesis, and metastasis27,28,29. It is also recognized as a diagnostic biomarker, as it is overexpressed in many cancers. Meanwhile, cathepsin K has a critical role in bone resorption30 and is linked

5

to osteolytic metastases in various cancers (e.g. prostate31 and breast32 cancers). There was a reduction in prostate tumor bone metastases in cathepsin K knockout mice, indicating the potential value of cathepsin K in treatment of bone metastases33.

The proteasome, a threonine protease, can also be involved in cancer development. As it controls cellular protein stability, aberrant proteolytic activity can impact cancer development and progression on many levels. Interestingly, the proteasome is also involved in drug resistance, since it regulates apoptosis-signaling pathways34. Several proteasome inhibitors have been developed and tested in clinical trials35. Bortezomib, was the first proteasome inhibitor used in clinical trials and approved by Food and Drug Administration in the US, with encouraging efficacy and safety results in a phase 2 trial for relapsed multiple myeloma patients35.

In summary, proteases are currently viewed as promising drug targets18,36,37, and proteolytic proteoforms can be utilized as clinical biomarkers38.

1.2 Current N termini enrichment methods: from positive to negative selection

The collective protein N termini proteome, or terminome, has been used to identify truncated proteolytic proteoforms, monitor abnormal proteolytic activities, and reveal perturbations in diseases10. However, a comprehensive terminome analysis requires support from N termini enrichment techniques; standard bottom-up proteomics workflow generally eliminate protease-generated neo-termini information. Development of a selective enrichment method for N termini is thus valuable and critical.

Developing an N termini enrichment protocol is challenging. Protein N termini have an α-amine that has a similar reactivity with ε-amines on the side-chains of lysine residues on a polypeptide chain. Due to the similarities between α- and ε-amines, this makes selective enrichment difficult. In addition to this problem, α-amines are also generated from internal peptides during sample preparation (e.g. enzymatic digestion). It is thus difficult to precisely enrich for N termini39. This section provides an overview of two basic

6

strategies of N termini enrichment methods: positive and negative selection. The former directly targets the termini of interest, whereas the latter focuses on the removal of protein internal peptides.

1.2.2 Positive selection N termini enrichment methods

To avoid co-enrichment of lysine-containing peptides, design of specific N termini “tagging” is essential. In 1994, a subtiligase, an enzymatic N termini site-specific derived from the protease subtilisin BPN, was pioneered by Dr. Wells and co-workers40. This rationally engineered N-terminal-ligase can ligate a peptide ester (or a thioester) specifically onto an N-terminal peptide, and is part of the positive N termini selection method (Figure 1.2)41,42. In brief, the protein mixture is first N-terminally labeled with subtiligase-mediated biotin-peptide ester tags followed by trypsin digestion. Then, N- terminal peptides are captured with avidin beads; internal peptides are washed away. Cleavage of the tobacco etch virus (TEV) protease site, releases tagged N termini for further LC-MS/MS analysis. This method has been applied to study caspase-driven apoptotic events41.

7

Figure 1.2 Positive N termini selection with subtiligase. (a) Structure of peptide ester with ligation and TEV protease cleavage sites. (b) Proteomics workflow for N-terminal labeling using subtiligase. The protein mixture is enzymatically attached with a biotinylated peptide ester. Tagged N termini are captured by avidin beads after trypsin digestion and released by TEV protease treatment for subsequent LC-MS/MS analysis. Reprinted with permission from reference 42.

8

In addition to enzymatic methods, selective labeling of free α-amines on N termini can also be achieved by chemical modification. Xu et al. demonstrated an accessible and affordable positive selection concept based on chemical tagging, named N-terminalomics by chemical labeling of the α-amine of proteins (N-CLAP) (Figure 1.3)43. Their approach uses well-established Edman degradation chemistry to block all amine groups within a proteome with phenyl isothiocyanate (PITC). Subsequently, only PITC-labelled amines on N termini are detached with acidification of trifluoroacetic acid (TFA), but not modified ε- amines on lysine residues. The resulting unblocked amine groups at the N termini react with an amine-specific labeling reagent, EZ-Link Sulfo-N-hydroxysuccinimide (NHS)-SS- biotin, prior to protein digestion. The final biotin-labeled N-terminal peptides are recovered with avidin-based resins and released by the reduction of disulfide linkers, and then identified by MS analysis. With this method, the protein is shortened by one amino acid after TFA treatment. Consequently, this phenomenon must be taken into consideration during data analysis. In conjunction with chemical modification, guanidination of ε-amines on lysine residues using o-methylisourea can also be used for labeling during positive N termini enrichment.

Although positive enrichment of N-terminal peptides provides direct measurement and analysis of the target of interest, there are some limitations of this approach. For instance, there is a limited view of the terminome. Approximately 80% N termini in mammalian cells are naturally modified, mostly due to acetylation, and those are typically lost in the enrichment process44. Another drawback is the incompatibility of chemical stable isotope labeling for quantitative analysis. Furthermore, it is difficult to design N termini-specific tags. Due to all of these reasons, there are fewer positive selection N termini enrichment approaches in development, compared to negative selection.

9

Figure 1.3 Schematic workflow for N-CLAP. After PITC treatment, all amines in a protein are blocked (filled circle). Reaction with TFA selectively deblocks the N terminus but not PITC-modified lysines. The newly generated amino groups at the N-termini then react with an amine-specific labeling reagent, such as EZ-Link Sulfo-NHS-SS-biotin (oval). After protein digestion, the N-terminal peptides are recovered using avidin-based resins, eluted with a reducing agent, and then identified by MS. Reprinted with permission from reference 43.

1.2.3 Negative selection N termini enrichment methods

Compared to positive selection, negative selection is the more flexible enrichment strategy. This method is not restricted to differentiation between α- and ε-amine on N termini and lysine residues respectively, and allows integration of stable isotope labeling for quantification purposes. In general, the first step in negative selection is the blocking of all amine groups on N termini and lysine residues by chemical labeling (e.g. acetylation, dimethylation, or tandem mass tag (TMT)) prior to proteolytic digestion39. The major

10

differences are the negative pullout selection strategies. In this section, I will describe various negative selection methodologies being developed to isolate N termini.

In early 2000, an ingenious and pioneering N-terminomics strategy was reported by Gevaert et al., termed combined fractional diagonal chromatography (COFRADIC)45. Figure 1.4 summarizes the strategy with two consecutive reverse-phase high- performance liquid chromatography (HPLC) steps, coupled with an in-between modification step, are employed to isolate N-terminal peptides. More specifically, after blocking primary amines and protein digestion, peptides are fractionated via the first HPLC step. Then, protease-generated neo-primary amines of internal peptides in the collected fractions are modified with 2,4,6-trinitrobenzenesulfonic acid (TNBS), resulting in an increase of hydrophobicity and delay of retention time, thereby allowing their specific separation from the original blocked N termini through the second step of HPLC. However, TNBS has a poor labeling efficiency on internal peptides, particularly on pyroglutamyl peptides. To overcome this problem, Gavaert et al. have improved the original COFRADIC protocol by introducing an additional strong cation exchange (SCX)-based pre-enrichment and enzymatic removal of pyroglutamate. The improved protocol can deplete most internal peptides, as well as pyroglutamyl residues prior to COFRADIC46. Because of the development of SCX chromatography, both N- and C-terminal peptides can be further distinguished with subsequent separation by butyrylation of C-terminal peptides which increases hydrophobicity of C-terminal peptides and result in a relay of retention time. With the improved technique, an increase of the identification of both termini has been reported47.

11

Figure 1.4 Schematic workflow of COFRADIC. (1) All protein-cysteine residues are alkylated (open circles). (2) All free amines are acetylated (indicated by filled diamonds) and the proteins are digested with trypsin (3). Following the primary reverse phase-HPLC (RP-HPLC) fractionation of the generated peptide mixture, all peptides present in one HPLC fraction are treated with TNBS (4). Only the internal peptides are labelled (indicated by TNP in open boxes). (5) N-terminal peptides are collected at the same retention time from the second RP-HPLC run. It is analyzed further by LC- MS/MS. (6) Tagged-internal peptides are discarded after the second PR- HPLC. Reprinted with permission from reference 45.

Inspired by COFRADIC, Dr. Zahedi and his co-workers introduced charge-based fractional diagonal chromatography (ChaFRADIC), incorporating two consecutive SCX separation steps into N termini enrichment48. As shown in Figure 1.5, amines groups on N termini and lysine residues are first blocked with dimethyl labeling prior to digestion and separated by the first SCX fractionation into five defined charges (+1, +2, +3, +4, and >+4). To isolate the internal peptides within each fraction, these internal peptides are

12

deuteron-acetylated to reduce the charge state by one. The resulting charge differences separate the N termini and internal peptides during the second SCX chromatography fractionation. As the modified internal peptides shift to earlier retention time, and retention time of N termini remains the same, N termini can be selectively enriched.

Recently, Zahedi and his co-workers have replaced HPLC with tip-based SCX chromatography, termed charge-based fractional diagonal chromatography in a pipette tip (ChaFRAtip), which can identify N termini with 40 µg of pooled purified proteomes (Figure 1.6)49. Even with 5-fold less sample amount, tip-based ChaFRADIC demonstrated similar performance and reproducibility compared to an HPLC-based method. Compared to COFRADIC, ChaFRADIC has a simpler protocol and lower required amounts of starting material. Although ChaFRAtip method provides a significant improvement on minimizing the amount of starting material, this strategy is still labour-intensive. Instead, a simplified method proposed by Lai et al. enables the depletion of internal peptides by adapting internal peptides with negatively charged disulfonate groups50. This weakens the binding ability of internal peptides to SCX and they can be effectively removed with a single SCX chromatography step. However, a loss of acetylated N termini is an unavoidable limitation with this charge-reversal approach, due to lack of positive charge on these N termini.

13

Figure 1.5 Scheme of the ChaFRADIC workflow. (1) Proteome is reduced and alkylated prior to dimethylation. (2) Samples are pooled and digested with trypsin, generating mixtures of N-terminal and internal peptides with different charge states at pH 2.7. (3) The pooled sample is separated using an optimized SCX gradient and charge states are selectively collected. (4) Free amines on internal peptides are deuteron-acetylated for each charge-state fraction separately, inducing a charge-state reduction, whereas N-terminal peptide charge states remain unchanged. (5) Each derivatized sample is subjected to a second SCX separation under the same conditions, and only those peptides retaining their charge states are collected for the subsequent LC-MS analysis. Reprinted with permission from reference 48.

14

Figure 1.6 Schematic diagram for ChaFRAtip strategy. Free protein N termini and lysines are labeled with isobaric tags for relative and absolute quantitation (iTRAQ) or TMT on the protein level; samples are pooled and digested with trypsin. Peptides are fractionated on-tip according to their total net charge (TNC). Each collected fraction is acetylated, thus reducing the TNC of internal peptides by 1. In a second tip-based fractionation, internal peptides with reduced TNC are removed and N-terminal peptides are collected for subsequent LC-MS analysis. Reprinted with permission from reference 49.

Aside from COFRADIC-based enrichments, a conceptually different negative N termini selection strategy, so-called terminal amine isotopic labeling of substrates (TAILS), has been developed by Dr. Overall and his co-workers51,52. Their protocol also begins with the blocking of amine groups on N termini and lysine residues by dimethylation prior to

15

protein digestion. The separation of internal peptides occurs through a size exclusion filter, as the size of internal peptides increases after capture by home-made hyperbranched polyglycerols-aldehyde-derivatized (HPG-ALD) polymers (Figure 1.7). Recently, this well- established N termini enrichment method has been applied to studies of mitochondrial pathways and has identified 475 unique mitochondrial N-terminal-peptides from approximately 15 million HeLa cells53. However, this method has some limitations. A relatively large amount of starting material is required (>100 µg proteins) to trigger nucleation in protein precipitation for multiple protein extraction steps. Polymer degradation is also a concern for a long-term LC-MS/MS analysis, as residual polymers cause LC column clogging and damage.

16

Figure 1.7 Schematic representation of the TAILS workflow. Primary amines of natural and protease-generated N termini (red NH2), and lysine residues (red K) in a proteome are chemically modified and blocked by dimethylation or iTRAQ (red stars) at the protein level. Then, the sample is digested with trypsin, which generates internal tryptic peptides with free N termini (green NH2). The newly formed internal tryptic peptides are tagged by amine-reactive hyperbranched aldehyde-derivatized polymer and isolated by size-based ultrafiltration. The natural and neo-N-terminal peptides are collected and analyzed by LC-MS/MS. Reprinted with permission from reference 52.

17

Besides the use of standard polarity-based affinity column and size-based filters for negative selection, titanium dioxide (TiO2) affinity chromatography has been employed by Mommen et al. for phospho-tagging (PTAG) of internal peptide depletion (Figure 1.8)54. This concept and approach are well-established for enrichment of phosphorylated peptides55. With this method, acidic conditions are required to promote the interaction between positively charged TiO2 and negatively charged phosphopeptides, then the bound peptides are eluted under basic conditions. However, some non-phosphorylated peptides, particularly acidic peptides, can also be selected under the same conditions. A clear disadvantage of this enrichment method is that both PTAG-derived peptides and naturally occurring phosphorylated N termini are removed as they have similar charges and binding affinity to TiO2. Therefore, stringent optimization of this strategy is necessary to minimize or avoid selection bias.

Li et al. describe another metal-based selection method to isolate N-terminal peptides using sulfydryl-tagging and gold-nanoparticle-based depletion (STagAu)56. Figure 1.9 shows that their strategy involves conversion of the amine groups on internal peptides to reactive sulfydrl groups amenable to be isolated by gold-nanoparticles through thio-gold interaction.

18

Figure 1.8 Workflow for N-terminal peptides enrichment by the PTAG strategy. A reduced and alkylated proteome is prepared, and followed by dimethylation of free amines (gray double dot). Then, the proteome is purified from excess reagent by acetone precipitation and remaining residual amounts are quenched by ammonium hydroxide. After proteolytic digestion, peptides with an N-terminal glutamine are enzymatically converted into pyroglutamate by Qcyclase and subsequently cleaved by pGAPase. Protein internal peptides are PTAG by the reaction with glyceraldehyde-3-phosphate (GAP3). PTAG-peptides are depleted by TiO2 affinity chromatography for enrichment of dimethylated (gray double dot) and naturally acetylated (open triangle) in the flow-through fraction. Reprinted with permission from reference 54.

19

Figure 1.9 Scheme of STagAu strategy. (a) The STagAu method for isolation of protein N-terminal peptides. The amine groups on protein are first blocked by dimethylation. After digestion, sulfydryl tagging is applied on internal peptides. Then, the internal peptides are removed through thio-gold interaction. (b) Sulfydryl tagging reaction of primary amine by Traut’s reagent. Reprinted with permission from reference 56.

More well-known interactions used in biochemistry have been adapted for negative N termini enrichment. For instance, McDonald et al. removed proteolytically-generated internal peptides based on biotin-streptavidin interaction, where biotin-labeled internal peptides are removed with streptavidin beads57. Additionally, they demonstrated a one- step procedure where the internal peptides are coupled with NHS-activated Sepharose beads and unbounded N termini are isolated by centrifugation57. However, a major drawback of this method is that histidine residues have non-specific side reactivity with NHS-Sepharose, which can introduce bias and lose sample.

To overcome the problem of polymer degradation in the well-established TAILS protocol and selection bias in many existing protocols, Chen et al. have proposed an alternative negative enrichment method, termed as hydrophobic tagging-assisted N- termini enrichment (HyTANE)58. As shown in Figure 1.10, instead of using HPG-ALD polymers for internal peptide elimination, they utilized a commercially-accessible hexadecanal tagging agent on internal peptides. Due to the increase of hydrophobicity of

20

the tagged-internal peptides, tagged peptides are trapped in C18 columns by hydrophobic- hydrophobic interactions. In contrast, N termini are collected in the flow-through, via elution with 80% acetonitrile (ACN). Unlike SCX-based or TiO2-assisted approaches, HyTANE enrichment method is not biased towards histidine(His)-containing and acidic N- terminal peptides.

Figure 1.10 Schematic workflow for negative enrichment of N-terminal peptides by HyTANE strategy. All primary amines in the sample are blocked. After proteolytic digest, primary amines of internal peptides are modified with a long-chain aldehyde that promotes depletion on a reverse phase column. Reprinted with permission from reference 58.

In general, the main challenges in the development of an effective negative selection enrichment strategy are labeling specificity and efficiency for the exclusion of internal peptides, as well as simplicity and effectiveness of the enrichment procedure. These factors can introduce enrichment bias, lose sample, and greatly contaminate sample with internal peptides, and must be considered when selecting an appropriate enrichment strategy.

21

1.3 Objective

Deep understanding of proteolytic activities and identification of protease- generated proteoforms would greatly benefit personalized cancer therapeutics and biomarker development. As described in Section 1.2, current N termini enrichment methods require large amounts of starting material (>100 µg protein), which is unfeasible for limited or precious biological specimens.

Hypothesis: I hypothesize precipitation and nonspecific binding to surfaces are sources for significant sample loss which underly the requirement for large sample amounts. I further hypothesize that utilizing a combination of carboxylate-modified magnetic beads, hydrophobic tagging and internal peptide removal techniques may reduce sample loss and increase sensitivity of N termini identification.

The objective of this thesis is to develop an automated and sensitive mass spectrometry method to monitor protein termini and in turn, proteolytic activities relevant to pediatric cancers. To achieve this, the objective is divided into three specific aims.

Aim I: Develop a highly sensitive microscale N termini enrichment method for use in limited samples. Aim II: Implement the developed enrichment protocol in an automated liquid handling system to facilitate profiling of aberrant proteolytic processes in clinical cohorts, thus improving on labor-intensive sample handling procedures and minimizing reproducibility issues. Aim III: Demonstrate the application of the novel automated mass spectrometry method for in-depth medical and biological studies.

After this introductory chapter, Chapter 2 will describe the development, optimization, and validation of the protocol, termed HUNTER. Chapter 3 will demonstrate and discuss the application of HUNTER. Chapter 4 will conclude this research and suggest future directions.

22

Chapter 2 Development of High-efficiency Undecanal-based N Termini EnRichment (HUNTER)

2.1 Introduction

Current peptide-based bottom-up proteomics enables protein identification and quantification on a proteome-wide scale59, but standard protocols and database search parameters exclude N termini identification, especially protease-generated neo-N-terminal peptides39. To overcome this challenge, dedicated methods for selective enrichment and unbiased identification of N-terminal peptides from complex proteomes have been developed41,45,48,51,58. Such N-terminome profiling has greatly advanced our understanding

41,60 61 62 of apoptosis , revealed novel proteolytic proteoforms in human tissues , and animal models of diseases63, identified protease substrates underlying common diseases and rare genetic disorders64,65, and enabled characterization of alternative protein translation initiation sites66,67 and protein N-terminal modifications68. However, critical proteolytic processes in development69 and disease pathogenesis70 are often confined in space and time. With current N termini enrichment technologies, it is often difficult to study the N- terminome, due to insufficient starting material71. The most sensitive protocol available to- date has enabled N termini enrichment from 40 µg of 10 pooled isobarically labeled, purified proteomes obtained from milligrams of cultured cell lysate49. In contrast, improved and automated proteome49 and phosphoproteome72 sample processing now enable comprehensive analyses of cell-type specific processes and clinically relevant microscale samples (<20 µg). To achieve a similar leap in the analysis of proteolytic proteoforms, we have developed HUNTER, an automatable workflow for the sensitive enrichment of N- terminal peptides from as little as 2 µg crude protein in any cell or tissue lysate using off- the-shelf reagents.

23

2.2 Method

2.2.2 Materials and reagents

2.2.2.1 HeLa cell line

HeLa cells (American Type Culture Collection; cat. no. CCL-2) were cultured in RPMI 1640 medium (ThermoFisher Scientific; cat. no. 11875-093) with 10% Cosmic Calf Serum (GE Healthcare Life Sciences; cat. no. SH30087.04) and maintained in a humidified incubator at 37°C with 5% CO2. Cultured cells were collected using 0.25% Trypsin-EDTA (ThermoFisher Scientific; cat. no. 25200056), centrifuged at 800 x g and washed with PBS (ThermoFisher Scientific; cat. no. 10010023) to collect pellets of different cell quantities. Cell pellets were frozen and stored in -80°C freezer until further lysis.

2.2.2.2 Plant material

Arabidopsis thaliana Col-8 wild type (accession N60000) seed stock was obtained from the Nottingham Arabidopsis Stock Center. A. thaliana Col-8 plants were stratified for 3d at 4°C and subsequently grown on soil at short day conditions (9 h light with an intensity of 100 µE m-2 s-1 at 22°C and 15 h darkness at 18°C, 75% RH). Leaves of 6-week-old plants were harvested and snap frozen in liquid nitrogen.

2.2.2.3 Rat brain samples

Rat brains were obtained from Wistar rats that were sacrificed for liver perfusion experiments at the University Hospital Düsseldorf as approved by local authorities (LANUV NRW #G287/15) and immediately snap frozen in liquid nitrogen.

24

2.2.3 Preparation of single-pot solid-phase-enhanced sample- preparation beads

One to one v/v ratio of hydrophilic (10 µg/µL; GE Life Sciences; cat. no. 4515- 2105-050250) and hydrophobic Sera-Mag SpeedBeads carboxylate-modified magnetic beads (conc. 10 µg/µL; GE Life Sciences, cat. no. 6515-2105-050250) were combined in a 1.5 mL flex tube (Eppendorf, cat. no. 022364111), then placed on a magnetic stand (Life Technologies, cat. no. 12321D) for removal of supernatant. The beads were washed twice with HPLC water and reconstituted in HPLC water (Fisher Scientific; cat. no. W6-4) and stored at 4°C.

2.2.4 Preparation of stop-and-go-extraction tips

Four small circular Empore™ SPE C18 disks (Sigma, cat. no. 66883-U) were punched with a flat-end needle (Hamilton, cat. no. 90517). A straightened paper clip was used to gently push down the C18 disks into a P200 pipette tip (VWR, cat. no. 89079- 474).

2.2.5 Fluorometric and colorimetric protein and peptide measurements

To evaluate labeling, binding and elution efficiencies during protocol optimization, peptide concentration and primary amine reactivity were quantified using the Pierce quantitative fluorometric peptide assay (Thermo Fisher Scientific; cat. no. 23290) and Pierce quantitative colorimetric peptide assay (Thermo Fisher Scientific; cat. no. 23275) following the assay protocols.

2.2.6 High-pH reversed phase fractionation

Fractionation was performed with an Agilent 1100 HPLC system equipped with a diode array detector (254, 260, and 280 nm). HPLC system was installed with a Kinetic EVO C18 column (2.1 mm×150 mm, 1.7 µm core shell, 100Å pore size, Phenomenex). The samples were run at a flow rate of 0.2 mL per minute using a gradient of mobile phase

25

A (10 mM ammonium bicarbonate, pH 8, Fisher Scientific, cat. no. BP2413-500) and mobile phase B (acetonitrile, Sigma-Aldrich, cat. no. 34998-4L) from 3% to 35% B over 60 min. Fractions were collected every minute across the elution window for a total of 48 fractions, then concatenated to a final set of 12 (e.g. fraction 1+13+25+37 as final fraction 1). All the fractions were dried in a SpeedVac centrifuge and resuspended in 0.1% FA in water (Thermo Scientific, cat. no. SC2352911) prior to mass spectrometry analysis.

2.2.7 Optimization of HUNTER

2.2.7.1 Optimizing dimethyl labeling of proteins

Ten micrograms of reduced and alkylated HeLa protein in 200 mM HEPES, pH 7.0 (Sigma, cat. no. H4034-1KG) were used as starting material. Refer to Section 2.2.7.1 for detailed lysis condition. Two molar fresh formaldehyde and one molar sodium cyanoborohydride were added to a final concentration of 30 mM and 15 mM respectively, in 200 mM HEPES, pH 7.0. The reaction was incubated at 37°C as previously indicated. After the first incubation, fresh labeling reagents were added and incubated at 37°C for 1 h. Both LC-MS/MS analysis and the amine-reactive quantitative fluorometric peptide assay were performed to evaluate the dimethyl labeling efficiency.

2.2.7.2 Evaluation of removal of dimethyl labeling reagents

Two molar fresh formaldehyde (Sigma-Aldrich, cat. no. 252549) and one molar sodium cyanoborohydride (Sigma-Aldrich, cat. no. 296813) were prepared to 30 mM and 15 mM final concentration in 200 mM HEPES, pH 7.0 with total volume of 30 µL. The reaction was incubated at 37°C for 1 h. Subsequently, fresh labeling reagents were added and incubated for another hour. Four molar Tris buffered to pH 6.8 (Fisher BioReagents, cat. no. BP153-1) was added to a final concentration of 600 mM to quench the reaction and incubated at 37°C for 3 hours. Single-pot solid-phase-enhanced sample-preparation (SP3) procedure was performed as the final cleanup step. 10 µg of reduced and alkylated HeLa protein were introduced after dimethyl labeling and cleanup steps. Amine-reactive

26

quantitative fluorometric peptide assay was performed to evaluate labeling removal efficiency from SP3 and tris quenching buffer.

2.2.7.3 Optimizing undecanal modification of peptide α-amines

Ten micrograms of HeLa peptides in 200 mM HEPES, pH 7.0 were chosen as a starting material. Subsequently, 0.25 µL undecanal (~200 µg; 0.825 g/mL), and 1M sodium cyanoborohydride diluted to a final concentration of 30 mM were added and topped up to 40% ethanol (40% EtOH) and 50% acetronitrile (50% ACN) v/v for a final volume of 20 µL. Then, the samples were incubated at 37°C in the incubator for an hour. Finally, the performance was examined by quantitative fluorescent peptide assay. In the fluorometric peptide assay, HeLa peptide alone was used as a positive control, whereas water with labeling reagents was used as a negative control. Peptide with undecanal alone, sodium cyanoborohydride alone, and both labeling reagents were tested. N termini enrichment efficiency was further examined by MS analysis with varying organic solvents (40%/50% EtOH and ACN), incubation time (from 30 min to overnight), starting material (25 µg to 400 µg proteome), and hydrophobic labeling (hexadecanal and undecanal).

2.2.7.4 Optimizing undecanal removal

In this experiment, the removal of undecanal with 40% or 50% ethanol or acetonitrile using three different C18 columns was tested. 412 µg, 1650 µg, and 8250 µg undecanal in 0.1% TFA (Sigma-Aldrich, cat. no. T6508-100 mL) in 40%/50% ethanol and acetonitrile were spun through fully conditioned stop-and-go-extraction tip (StageTip), microspin column, or sep-pak C18 columns respectively. The flow-through was collected in 1.5 mL Eppendorf tubes and the volume reduced in a SpeedVac. Ten micrograms HeLa peptides and one molar sodium cyanoborohydride were added to a final concentration of 30 mM. The volume was adjusted with 40% ethanol to a final volume of 20 µL. The samples were incubated at 37°C for 1h and then measured using the quantitative fluorescent peptide assay. The undecanal calibration curve was constructed from 0.21 µg/µL to 41.3 µg/µL.

27

2.2.7.5 Evaluation of peptide recovery dependency on solvent concentrations

StageTips with 4 C18 disks were prepared and conditioned with methanol and 0.1% TFA in water. Ten micrograms HeLa peptides in 0.1% TFA in water were loaded on StageTips and centrifuged at 1200 g. The peptides were sequentially eluted with 40% ethanol, 50% acetonitrile, and 80% acetonitrile and collected in 1.5 mL Eppendorf tubes. Then, the samples were dried with speed vac and topped up with HPLC water to 10 µL. The samples were sonicated before performing colorimetric peptide quantification. Elution with 80% and 100% acetonitrile respectively and initial HeLa peptides in 0.1% TFA in water were used as controls.

2.2.8 HUNTER protocol

2.2.8.1 Preparation of HeLa cell lysates

The HeLa cells were first lysed in 1.5 mL protein Lobind tubes (Eppendorf; cat. no. 022431081) with lysis buffer consisting of 1% sodium dodecyl sulphate (Fisher BioReagents, cat. no. BP8200-500) and 2X Thermo Halt protease inhibitor cocktail (Thermo Scientific, cat. no. 1861279) in 50 mM HEPES, pH 8.0. The lysate was heated at 95°C for 5 min, then chilled on ice for another 5 min. Any liquid condensation or droplets was spun down by centrifugation. Benzonase (EMD Millipore; cat. no. 70664-3) was added at a ratio of 1 unit to 37 µg of DNA and incubated at 37°C for 30 min. Then dithiothreitol (DTT; Fisher BioReagents; cat. no. BP172-25) was added to 10mM and incubated at 37°C for 30 min, followed by addition of 2-chloroacetamide (CAA; Sigma- Aldrich; cat. no. C0267-100G) to 50mM and further incubation at room temperature (RT) in the dark for 30 min. To quench the alkylation, DTT was added to a final concentration of 50 mM and incubate at RT in the dark for 20 min. Protein Lobind tubes were used during all sample handling steps.

28

2.2.8.2 Preparation of Arabidopsis thaliana leaf and rat brain lysates

Frozen plant leaves or rat brains were lysed in a lysis buffer consisting of 6M guanidine hydrochloride, 0.1M HEPES pH 7.4, 1 mM DTT, 5 mM EDTA and 1X Thermo Halt Protease inhibitor and then homogenized twice by a Kinematica Polytron PT-2500 (Kinematica, Luzern, Switzerland) with 30s at 18,000 rpm. Homogenate was filtrated through Miracloth (Merck, Darmstadt, Germany) and cell debris pelleted at 500 g for 5 min, 4 °C. Proteins in supernatant were chloroform/methanol precipitated or SP3 extracted, then resuspended in 1:2 diluted homogenization buffer. Proteomes were reduced with 5 mM DTT for 30 min at 56°C, alkylated with 15 mM iodoacetamide (IAA) for 30 min in the dark at RT, and quenched by addition of additional 15 mM DTT and incubation for 15 min at RT.

2.2.8.3 SP3 bead binding and proteome clean-up

After reduction and alkylation, prepared SP3 beads were added to protein mixtures with a 1:10 ratio (w/w) protein / SP3 beads. Pure 100% ethanol was added to a final volume 80% v/v to initiate binding. After 18 min incubation at RT, supernatant was removed with assistance of a magnetic stand and the beads were rinsed two times with 400 µL 90% ethanol. Beads were resuspended by pipet mixing, with 30s break between each step to allow beads to settle on the magnetic stand. The remaining ethanol was spun down prior to the removal of supernatant and beads were resuspended in 30 µL 200mM HEPES, pH 7.0.

2.2.8.4 Protein dimethyl labeling

Two molar freshly prepared formaldehyde solution and 1M sodium cyanoborohydride were added to 30mM and 15mM final concentration, respectively. The lysate was incubated at 37°C for 1 h in an incubator, before repeated addition of fresh labeling reagents and incubation for another hour. To quench the reaction, 4M Tris buffered to pH 6.8 was added to a final concentration of 600 mM and incubated at 37°C for 3 hours. For removal of excess reagents, new SP3 beads were added at a 1:5 ratio

29

and protein bound by addition of 100% ethanol to a final concentration of 80% v/v ethanol. Beads were settled on a magnetic stand after 15 min incubation at RT, supernatant removed, and the beads washed twice with 400 µL of 90% ethanol. The tube was briefly centrifuged to collect and remove the remaining wash solution before resuspension of the beads in 30 µL trypsin (1mg/mL, Promega, cat. no. V5113) in 200mM HEPES buffer, pH 8.0. Beads were fully immersed in the solution and the trypsin to protein ratio was at minimum 1:100. After incubation at 37°C in an incubator for at least 13 hours, 10% of the sample was removed to assess dimethyl labeling efficiency or to quantify protein abundance (pre-HUNTER sample). The reaction was mixed by tapping and 30s sonication after addition of each new reagent.

2.2.8.5 Enrichment of N termini by undecanal-based negative selection

Anhydrous ethanol was added to the proteome digest to 40% v/v before addition of undecanal (EMD Millipore, cat. no. 8410150025) at an undecanal/peptide ratio of 20:1 w/w and addition of 1M sodium cyanoborohydride to a final concentration of 30mM. The pH was confirmed between pH 7-8 before incubation at 37°C for 1 hour. The reaction was sonicated in a water bath at 60kHz for 15 seconds and bound to magnetic rack for 1 min. The supernatant was transferred to a new Lobind tube and acidified with 0.5% TFA in 40% ethanol to pH 3-4 before loading onto a C18 column for removal of undecanal-tagged peptides. Different columns were chosen to provide sufficient binding capacity for excess undecanal reagent: Self-packed 4-layered C18 StageTips were chosen for 1 to 5 µg protein; microspin column (Nest Group Inc, cat. no. SEM SS18V) for 5 to 20 µg protein; macrospin column (Nest Group Inc, cat. no. SMM SS18V) for 20 to 100 µg protein; sep- pak columns (Waters, cat. no. WAT054960) for 100-1000 µg protein; HR-X (M) spin columns (Macherey-Nagel, cat. no. 730525) for experiments with Arabidopsis and rat brain proteome. The sample volumes were topped up with 0.1% TFA in 40% ethanol to a loading volume of 80 µL, 200 µL, 400 µL and 500 µL for StageTips, microspin, HR-X (M) spin, macrospin column, and sep-pak respectively. Before loading the samples, the StageTips were conditioned with 100 µL methanol and followed by 100 µL 0.1% TFA in 40% ethanol whereas microspin column, macrospin column, HR-X (M) spin columns and

30

sep-pak were conditioned with a volume of 200 µL, 200 µL, 400 µL and 700 µL respectively. After the conditioning of C18 columns, the samples were then loaded and the flow-through was collected in 1.5 mL protein Lobind tubes. The ethanol in the collected flow-through was removed by vacuum supported evaporation, peptides were resuspended in 0.1% TFA in HPLC water and desalted using home-made C18 StageTips or commercial reverse-phase C18 spin columns.

2.2.9 HUNTER automation

Human peripheral blood plasma (STEMCELL Technologies, cat. no. 70039) was processed on an epMotion M5073 automated liquid handling system (Eppendorf) controlled by an EasyCon tablet (Eppendorf). The HUNTER protocol was programmed with epBlue Studio (ver. 40.4.0.38). The M5073 was configured with dispensing tool TS50 (1.0-50 µL) and TS1000(40-1000 µL), epT.I.P.S. Motion racks (1.0-50 µL and 40-1000 µL), epMotion gripper, Thermoadapter for 96-PCR plate (skirted), Alpaqua Magnum FLX 96 magnet plate, Eppendorf rack for 24x safe lock, Twin.tec PCR plate 96 (semi-skirted; max. well volume is 250 µL).

The following adaptations to the HUNTER protocol were made to achieve optimal automation. 250-300 µg protein (maximum 5 µL plasma) was processed. Dimethylation was performed at RT, the final concentration of formaldehyde was 35mM, and the final concentration of sodium cyanoborohydride was 15mM. 2 units of benzonase were added to 5 µl of plasma. Wash steps were programmed to aspirate 10 µL more than the dispense volume to ensure full removal of all wash buffers. During the digestion and undecanal labeling steps, the plate was covered with thermal adhesive sealing film (Diamed Lab Supplies Inc., cat. no. DLAU658-1) and incubated at 37°C. Samples and/or beads were mixed on the heater/shaker at 1500rpm for 2 min. To prevent air bubbles forming in tips and ensure uniform dispensing, the aspiration speed was set to 10mm/s. All pipetting steps were programmed to aspirate from bottom and dispense from top. Undecanal and ethanol were combined first before dispensing into each well.

31

2.2.10 Mass spectrometry

2.2.10.1 Data-dependent acquisition

Pre-HUNTER and post-HUNTER HeLa and clinical samples were analyzed on a Q Exactive HF plus Orbitrap mass spectrometer coupled to an Easy-nLC 1200 liquid chromatography (Thermo Scientific) with a 3cm-long homemade precolumn (Polymicro Technologies capillary tubings, 360OD, 100ID), a 35cm-long homemade analytical column (Self-pack PicoFrit columns, 360OD, 75ID, 15um tip ID) and packed with Dr. Maisch beads (ReproSil-Pur 120 C18-AQ, 3um) with a flow rate at 300nL/min and constant temperature at 50°C. Mobile phase A (0.1% formic acid in water) and mobile phase B (0.1% formic acid in 95% acetonitrile) were used for a 65min gradient (3-8%B in 3 min, 8-27%B in 37 min, 27-42%B in 12 min; 42-100%B in 13min). For a data-dependent acquisition (DDA), a full-scan MS spectrum (350-1600 m/z) was collected with resolution of 120,000 at m/z 200 and the maximum acquisition time of 246 ms and an AGC target value of 1e6. MS/MS scan was acquired at a resolution of 60,000 with maximum acquisition time of 118 ms and an AGC target value of 2e5 with an isolation window of 1.4 m/z at Orbitrap cell. The top 12 precursors were selected. Normalized collision energy (NCE) was set to 28. Dynamic exclusion duration was set to 15 sec. Charge state exclusion was set to ignore unassigned, 1, and 5 and greater charges. The heated capillary temperature was set to 275°C. It should be noted that 0.8 µg peptides in plasma samples, 1 µg peptides in 500K and 1M HeLa post-HUNTER samples and all peptides in 10K, 20K, and 100K HeLa post-HUNTER samples were injected for LC-MS/MS analysis.

Arabidopsis leaf and rat brain samples were analyzed on a two-column nano- HPLC setup (Ultimate 3000 nano-RSLC system with Acclaim PepMap 100 C18, ID 75 µm, particle size 3 µm columns: a trap column of 2 cm length and the analytical column of 50 cm length; ThermoFisher) with a binary gradient from 5-32.5% B for 80 min (A: H2O + 0.1% FA, B: ACN + 0.1% FA) and a total runtime of 2 h per sample coupled to a high resolution Q-TOF mass spectrometer (Impact II, Bruker) as described63. Data was acquired with the Bruker HyStar Software (v3.2, Bruker Daltonics,) in line-mode in a mass range from 200-1500 m/z at an acquisition rate of 4 Hz. The top 17 most intense ions were

32

selected for fragmentation with dynamic exclusion of previously selected precursors for the next 30 sec unless intensity increased three-fold compared to the previous precursor spectrum. Intensity-dependent fragmentation spectra were acquired between 5 Hz for low intensity precursor ions (> 500 cts) and 20 Hz for high intensity (> 25k cts) spectra. Fragment spectra were averaged from t stepped parameters, with 50% of the acquisition time manner with split parameters: 61 µs transfer time, 7 eV collision energy and a collision RF of 1500 Vpp followed by 100 µs transfer time, 9 eV collision energy and a collision RF of 1800 Vpp.

2.2.10.2 Data-independent acquisition

The samples were resolubilized in 0.1% formic acid and spiked with iRT peptides before analysis on the Q-Exactive HF system (Thermo) described above. For 1 million HeLa cells (1 µg of protein was injected), a full-scan MS spectrum (350-1650 m/z) was collected with resolution of 120,000 at m/z 200 and the maximum acquisition time of 60 ms and an AGC target value of 3e6. Data-independent acquisition (DIA) segment spectra were acquired with a twenty-four-variable window format with a resolution of 30,000 with an AGC target value of 3e6, and using 25% normalized collision energy (NCE) with 10% stepped NCE. The stepped collision energy was 10% at 25% (NCE=25.5 - 27.0 - 30.0). The maximum acquisition time was set to “auto”. DIA method for 20,000 HeLa samples was slightly adjusted to accommodate low complexity samples. A 10-variable window format was applied with a resolution of 60,000 and an AGC target of 3e6. The stepped collision energy was 28. A default charge state of 3 was applied for MS2 acquisition scans.

2.2.10.3 Data processing

Raw MS DDA data acquired on the Q Exactive HF were processed and searched with MaxQuant73 version 1.6.2.10 using the built-in Andromeda search engine. The first search peptide tolerance of 20 ppm and main search peptide tolerance of 4.5 ppm were used. The human protein database was downloaded from UniProt (release 2018_09; 20,410 sequences) and common contaminants were embedded from MaxQuant. The “revert” option was enabled for decoy database generation. For analysis of enriched N

33

termini (post-HUNTER) samples, dimethyl (peptide N-term and K) were selected as fixed modifications whereas oxidation (M), acetyl (N-term), Gln→pyro-Glu, and Glu→pyro-Glu were dynamic modifications. ArgC semispecific free N-terminus digestion with maximum two missed cleavage sites. The label free quantification minimum ratio count was 1. “Match between runs” was only enabled for clinical samples. The false discovery rate for PSM, peptide and protein were set as 1%. Label-free quantification was used to quantify the difference in abundance of N termini between samples. To determine dimethyl labeling efficiency and pullout efficiency from pre- and post-HUNTER samples respectively, oxidation (M), acetyl (N-term), dimethyl (K), dimethyl (N-term), Gln→pyro-Glu, and Glu→pyro-Glu were selected as dynamic modifications. ArgC specific digestion mode was used in the first search and Trypsin/P semi-specific digestion mode was selected in the main search. To calculate pullout efficiencies dimethyl (peptide N-term) was defined as variable modification and to calculate labeling efficiencies both dimethyl (peptide N-term and K) were set as variable modifications.

Arabidopsis and rat brain DDA data acquired with Impact II Q-TOF instruments were processed and searched with MaxQuant73 v.1.6.3.3 using embedded standard Bruker Q-TOF settings that included peptide mass tolerances of 0.07 Da in first search and 0.006 Da in the main search. The Arabidopsis and rat protein databases were downloaded from UniProt (Arabidopsis: release 2018_01, 41350 sequences; rat: release 2017_12, 31571 sequences) with appended common contaminants as embedded in MaxQuant. The “revert” option was enabled for decoy database generation. Database searches were performed as described above, except that enzyme specificity was set as Arg-C semi specific with free N-terminus also in the first search, heavy dimethylation with

13 CD2O formaldehyde was set as label (K) whereas oxidation (M), acetyl (N-term), heavy dimethyl (N-term), Gln→pyro-Glu, and Glu→pyro-Glu were set as dynamic modifications.

DIA was analyzed with Spectronaut Pulsar X (version 12.0.20491.0.21112, Biognosys, Schlieren, Switzerland). First, a spectral library was generated by searching the DIA raw files for samples together with 36 DDA files acquired on 12 high-pH fractions for triplicate HeLa samples in Spectronaut Pulsar. The default settings were applied with the following changes: Digest type was semi-specific (free N-terminus) for Arg C, minimum

34

peptide length = 6. Carbamidomethyl (C) and dimethyl (K) were fixed modifications, while variable modifications consisted of oxidation (M), acetyl (N-term), dimethyl (N-term), Gln→pyro-Glu, and Glu→pyro-Glu. The resulting spectral library contained precursor and fragment annotation and normalized retention times. This was used for targeted analysis of DIA data using the default Spectronaut settings. In brief, MS1 and MS2 tolerance strategy were ‘dynamic’ with a correction factor of 1. Similar setting was maintained for the retention time window for the extracted ion chromatogram. For calibration of MS run precision iRT was activated, with local (non-linear) regression. Feature identification was based on the ‘mutated’ decoy method, with ‘dynamic’ strategy and library size fraction of 0.1. Precursor and protein false discovery rate were 1% respectively. The report generated from Spectronaut was filtered for N-terminal peptides with dimethyl and acetyl modifications.

2.2.10.4 Data availability

MS data have been deposited to the ProteomeXchange Consortium74 (http://www.proteomexchange.org) via the PRIDE (https://www.ebi.ac.uk/pride/archive/)75 and MassiVE (https://massive.ucsd.edu/) partner repositories with the following accession numbers: PXD012821 for HUNTER termini enrichment with Arabidopsis leaf extracts, PXD012844 for HUNTER termini enrichment with rat brain extracts, PXD012915 for development of HUNTER on HeLa cells and commercially-available plasma.

2.3 Results

2.3.2 Optimization of HUNTER

The HUNTER protocol consists of two main techniques: the application of carboxylated-modified SP3 beads76,77 in cleanup steps and hydrophobic tagging58 for exclusion of internal peptides, to reduce sample loss while providing sensitive and unbiased N termini enrichment (Figure 2.1). In this section, we discuss step-by-step optimization of the HUNTER protocol.

35

Figure 2.1 HUNTER workflow. (1) Lysis and protein purification with SP3 magnetic beads; (2) modification of N-terminal α- and lysine ε-amines; (3) proteome digestion; (4) modification of digestion-generated peptide α-amines with undecanal in 40% ethanol (grey background); (5) retention of undecanal- modified peptides on reverse-phase columns, while N-terminal peptides pass through.

First, we examined the dimethyl labeling efficiency of proteins bound to SP3 beads using an amine-reactive quantitative fluorometric peptide assay. As the number of amine groups from lysine and N-terminal protein residues determines the fluorescent signal intensity, their modification by dimethylation would result in a reduced signal. In this experiment, conditions with overnight dimethyl labeling, one-hour dimethyl labeling repeated twice, HeLa proteins alone (positive control), and labeling reagents alone (negative control) were tested. The dimethyl labeling efficiency is calculated as one minus the measured signal relative to the average of the positive control and expressed in percentage. Compared to overnight incubation, two hours incubation time, where labeling was repeated after the first hour, was sufficient to reach 99% dimethyl coverage (Figure 2.2). The repeated labeling can shorten the incubation time from overnight to two hours. To validate the measurements obtained with the fluorescent amine-reactive label, LC- MS/MS analysis was performed as a final confirmation (Figure 2.3).

36

Figure 2.2 Completeness of on-bead dimethyl labeling of 10 µg HeLa cell lysates (n=3 technical replicates) based on quantitative fluorometric peptide assay. Error bars indicate standard deviation (SD).

Figure 2.3 Completeness of on-bead dimethyl labeling of HeLa cell lysates (n=3 technical replicates) based on LC-MS/MS analysis. Error bars indicate SD. Calculation of dimethyl labeling efficiency is based on number of dimethyl lysine divided by total lysine.

As excess dimethyl labeling reagents can react with neo-N termini after trypsin digestion, influencing efficiency in N termini isolation, removal of additional dimethyl labeling reagents is critical. To evaluate cleanup ability from SP3 and quenching with Tris buffer, unlabeled (control), dimethyl labeled, and cleanup conditions were examined by amine-reactive quantitative fluorometric peptide assay (Figure 2.4a). In this experiment, HeLa proteins were spiked in after dimethyl label and cleanup steps. Since N termini and lysine residue amine groups contribute to fluorescent signal, a reduced signal would be observed in dimethyl labeled conditions, whereas successful removal of labeling reagents would result in a recovered signal. As shown in Figure 2.4a, a consistent high relative fluorescent signal was detected, demonstrating effective cleaning capability from SP3 and Tris quenching steps. As tris buffer contains amine groups which may contribute to this enhanced fluorescent signal, a follow-up experiment was designed to investigate

37

elimination of tris buffer with SP3 (Figure 2.4b). A reduced fluorescent signal after SP3 cleaning process was shown in Figure 2.4b suggesting successful removal of tris buffer.

Figure 2.4 Effect of SP3 and tris buffer in dimethyl labeling step. (a) removal of excess dimethyl labeling reagent with SP3 and quenching buffer. The relative signal is calculated based on the average signal from the unlabeled condition. (b) removal of tris buffer with SP3 cleanup step. Plus and minus signs indicate the presence and absence of reagents or steps, respectively. Water was used for the conditions with absence of tris buffer. n=3 technical replicates, error bars indicate SD. The relative signal is calculated based on the average signal from tris buffer.

After improving the dimethyl labeling step, we wanted to optimize hydrophobic tagging conditions for elimination of internal peptides including incubation time, labeling condition, tagging material, and residual tagging removal. In HUNTER, we selected undecanal as hydrophobic tags to increase hydrophobicity of internal peptides; protease- generated amines on internal peptides were attached with undecanal based on Schiff base reaction. Without addition of organic solvent, we observed phase separation between undecanal and water. Therefore, addition of organic solvent is required during undecanal labeling step. To make the workflow simpler and smoother, we sought to determine the labeling conditions most appropriate for undecanal removal and N termini elution.

We first tested depletion of free undecanal under 40% EtOH and 50% ACN in various commercially-available C18 columns including StageTips, microspin columns, and

38

sep-pak. To measure undepleted undecanal in the flowthrough, HeLa peptides were introduced; the peptides react with excess undecanal and result in a decrease of fluorescent signal in the aforementioned quantitative peptide assay. The determination of the undecanal concentrations in the samples is based on the calibration curve shown in Figure 2.5. We observed that 40% EtOH provides the best and consistent depletion efficiency whereas other organic conditions have signal drops, indicating presence of undecanal in the flowthrough.

Figure 2.5 Effect of different C18 cartridges and mobile phases (EtOH and ACN) on the depletion of free undecanal. Eluent is incubated with tryptic peptides and measured using an amine reactive quantitative fluorometric peptide assay to determine the relative amount of free primary amines on tryptic peptides after reaction with undecanal from the eluent. The undecanal (UDC) calibration curve uses free undecanal. n=3 technical replicates, error bars indicate SD.

In addition, we assessed N termini peptide recovery with a commercially-available colorimetric peptide quantification test, where the intensity of assay signal is proportional to the concentration of eluted peptides. To validate this, we eluted pre-loaded HeLa peptides with 40% EtOH, 50% ACN, and 80% ACN sequentially and compared to a common elution condition with 80% ACN. We observed a full peptide recovery with 40% EtOH elution, which is slightly better than 80% ACN. (Figure 2.6).

39

Figure 2.6 Evaluation of peptide recovery from C18 StageTips using 40% EtOH as mobile phase. Recovery measured by the amine-reactive colorimetric peptide assay relative to the starting material, in a sequential elution using 40% EtOH followed by 50% ACN and 80%. Tryptic peptides de- salted on C18 using 60% ACN were used as starting material. Mean of n=3 technical replicates, error bars indicate SD.

To evaluate the ideal organic solvent condition for the undecanal labeling reaction, we examined tagging in 40% EtOH and 50% ACN. Validation was done with the same quantitative fluorometric peptide analysis for dimethyl labeling optimization, using previously described conditions. In this experiment, undecanal labeling conditions under 40% EtOH and 50% ACN, HeLa protein alone (positive control), and labeling reagents alone (negative control) were examined. Similar to the dimethyl labeling efficiency calculation, the percent recovery is calculated as one minus the measured signal relative to the average of the positive control and expressed in percentage. Figure 2.7 indicates both conditions are suitable for effective undecanal tagging (99%) on internal peptides.

Figure 2.7 Degree of HeLa-derived peptide undecanal modification in organic solvents, assessed by reactivity of unlabeled amines (n=3 technical replicates).

40

Dependence of identified N-terminal peptides (Figure 2.8a) and pullout efficiency (Figure 2.8b) on the undecanal incubation time were explored in greater detail by our collaborators, using MS analysis of the enriched end products. In this experiment, we performed full HUNTER workflow with the variation of incubation time. In Figure 2.8a and b, with an hour of undecanal incubation, we observed a high N termini identification (an average of 903 N termini from rat samples; 1,387 N termini from leaf samples) and pullout efficiency (an average of 95.6% enrichment efficiency from rat samples; 96.5% enrichment efficiency from leaf samples) on rat brain and Arabidopsis samples. However, there was a decrease in identification after 30 minute and overnight incubations on rat brain and Arabidopsis samples, respectively. The decrease was likely due to incomplete labeling at 30 mins, and degradation during overnight incubation. As a result, one hour undecanal incubation time was chosen in the HUNTER protocol.

Figure 2.8 Effect of incubation time for undecanal modification of peptides on the number of identified N-terminal unique peptides (a) and pullout efficiency (b). 200 µg of rat total brain and Arabidopsis Col 8 whole leaf lysates are used as starting material. Mean of n=3 biological replica, error bars indicate SD. Pullout efficiency is calculated based on number of termini divided by number of total peptides.

To further evaluate the depletion performance, we compared the two hydrophobic tagging materials, undecanal and hexadecanal, used in HUNTER and HyTANE methods58. A similar N termini identification and pullout efficiency were observed between the two types of hydrophobic tags and enrichment methods in the leaves (Figure 2.9). However, a poor enrichment ability was observed on rat brain samples (average of 46.9,

41

91.5, 98.5% pullout efficiency from hexadecanal-based HyTANE, undecanal-based HyTANE, and undecanal-based HUNTER, respectively) with hexadecanal-assisted HyTANE.

Figure 2.9 Comparison of hydrophobic tagging materials used in HUNTER and HyTANE methods. (a) Unique peptides identified with adapted hexadecanal or undecanal-based HyTANE protocols with removal of organic solvent after tagging before C18-mediated depletion and after direct depletion of tagged peptides in the presence of organic solvent as implemented in HUNTER. (b) Enrichment efficiency assessed as N- terminally modified peptides compared to number of digest-generated peptides with free α-amine.

Additionally, we studied the effects of enrichment on amount of starting materials from 25 µg to 400 µg rat brain and Arabidopsis thaliana leaf proteome. Figure 2.10a demonstrates N termini identification and starting proteome increases proportionally within the same sample type (25 to 400 µg rat brain: average of 414 to 1,169 N termini; 25 to 400 µg Arabidopsis: average of 750 to 1,780 N termini). More than 95% pullout efficiency was observed across samples (Figure 2.10b).

42

Figure 2.10 The effects of enrichment on amount of starting materials from rat brain and leaf proteome. (a) Number of unique peptides and fraction of N-terminal peptides identified from different amounts of tryptic peptides used for undecanal modification. (b) pullout efficiency across different amounts of tryptic peptides. t=90 min undecanal modification. Mean of n=3 biological replicates, error bars indicate SD. A maximum of 1 µg enriched peptides was analysed by nano-LC-MS/MS.

2.3.3 Validation of HUNTER performance on HeLa cells

To evaluate the sensitivity, enrichment efficiency, and reproducibility of HUNTER, we performed HUNTER on a wide range of starting material amounts, ranging from 2 µg to 200 µg of HeLa cell lysate (Figure 2.11 and 2.12). As a confirmation, dimethyl labeling and enrichment efficiency were checked with LC-MS/MS analysis across all samples to ensure the quality of enrichment process. Figure 2.11a illustrates a high dimethyl labeling coverage (average of 94% coverage across all HeLa cell samples) on N termini and lysine residues from pre-enriched samples and a significant N-terminal peptide population (average of 91% pullout efficiency across all HeLa cell samples) in enriched samples. A relatively larger standard deviation (~6%) was observed in the pullout efficiency from lysate of 10,000 HeLa cells, which indicates handling a starting material lower than 2 µg proteome would be challenging with HUNTER (Figure 2.11a). Additionally, the percent distribution of N termini within pre- and post-enriched 20K HeLa samples was further investigated to validate HUNTER enrichment performance. As shown in Figure 2.11b, with the enrichment, a dramatic increase of N termini was observed from 4.4% to 93.2% in 20K

43

HeLa proteome. The results suggest that the method could successfully isolate internal peptides and specifically enrich for N termini.

As shown in Figure 2.12, analysis with an Orbitrap Q-Exactive HF mass spectrometer identified an average of 1,057, 1,230 and 1,454 N-terminal peptides from 2 µg, 4 µg and 20 µg crude protein lysate, respectively, within one hour. For larger sample amounts, only 1 µg of the recovered N-terminal peptides were injected, resulting in the identification of 1810 N-terminal peptides on average. High-pH fractionation facilitated an in-depth termini profile for 200 µg HeLa proteome, which readily increased the identification to >5,000 N-terminal peptides (Figure 2.12).

DIA workflow fragments every peptide signal in the sample whereas DDA takes a subset of the most abundant peptides signals forward for fragmentation. The DDA fractionation data were further used to build a DIA spectral library which provides the confidence to identify peptides through an ion chromatograms match in the library. To compare with the two MS data acquisition modes, we performed DDA and DIA on enriched 20,000 and 1 million HeLa cell samples (Figure 2.13). DIA resulted in a two-fold increase in the N termini identification (DDA: total of 1,141 and 1,760 N termini in 20K and 1M HeLa samples respectively; DIA: total of 2,004 and 3,351 N termini in 20K and 1M HeLa samples respectively) and more than a two-fold increase in single termini quantification, with <10% coefficient of variations (CVs) in 1 hour DIA (DDA: 90 and 192 N termini with <10% CV in 20K and 1M HeLa samples respectively; DIA: 224 and 988 N termini with <10% CV in 20K and 1M HeLa samples respectively). By DIA, 988 N termini were quantified in 200 µg with single peptide CVs <10%, whereas DDA only quantified 192 N termini with CVs <10%. DIA further allowed quantification of more N termini at CVs <20% than the total number of termini identified by DDA.

44

Figure 2.11 Evaluation of HUNTER workflow using varying starting amounts of HeLa cells. (a) Dimethyl labeling efficiency assessed by lysine side chain modification before depletion (top) and N termini enrichment efficiency after enrichment (bottom). (b) Fraction of N-terminal and digest-generated peptides in 20K HeLa proteome digests before (pre-HUNTER) and after (post-HUNTER) enrichment. Shown are average values from three independent biological replicates. Error bars indicate SD.

Figure 2.12 Number of N termini identified from 10K to 1M HeLa cells. Circles, samples with <1 µg peptide analyzed in a single injection; triangles, larger samples with 1 µg enriched peptides analyzed per replica; diamonds, with offline high-pH pre-fractionation. Datapoints represent technical replicates, lines indicate the mean.

45

Figure 2.13 Comparison of coefficients of variation (CV) for N termini from 20K and 1M HeLa cell quantified by label free DDA or DIA. Spectral library built on DDA of 12 fractions from 1M termini-enriched HeLa cells. Average CV of pairwise comparisons of quantified terminal peptides in triplicate analysis is displayed. n.a. denotes termini quantified in only one replica.

2.3.4 Improved reproducibility on automated HUNTER

Since technical variation is introduced during sample processing, we wanted to compare the reproducibility of HUNTER when performed manually or by an automated liquid handler. We studied reproducibility by mapping intensity-based Pearson-correlation from two manually-handled HeLa datasets (DDA, each set n=3 technical replicates) previously mentioned in Figure 2.12. As shown in Figure 2.14, the overall correlations are 0.84, 0.80, 0.78, 0.78, and 0.77 for enriched 10K, 20K, 100K, 500K, and 1M HeLa proteome respectively. Specifically, the correlation factors between technical replica are 0.89, 0.89, 0.88, 0.87, and 0.88 for enriched 10K, 20K, 100K, 500K, and 1M HeLa proteome respectively, where the routine correlation factors are 0.82, 0.74, 0.71, 0.72, 0.70 for enriched 10K, 20K, 100K, 500K, and 1M HeLa proteome respectively. With DIA, the correlation factors from three technical replicates increased to 0.91 and 0.97 with 20K and 1M HeLa proteome respectively (Figure 2.15).

To enhance reproducibility for normal DDA runs, we integrated the HUNTER protocol on an automated liquid handling system. Ultimately, we wanted to develop a user- friendly, sensitive, and robust platform in a laboratory or hospital setting. To mimic clinical settings, commercially-available human peripheral blood plasma was used for

46

performance evaluation. We processed the plasma samples twice, with four technical replicates per set on the automated liquid handler and analyzed with LC-MS/MS. Figure 2.16 demonstrates a high Pearson-correlation factor within technical replica (manual: 0.88; automated: 0.93) and day-to-day (manual: 0.74; automated: 0.86) dataset, indicating improved reproducibility with auto-manipulation of liquid samples.

Figure 2.14 Pearson-correlation between two manually handled HUNTER sets on 10K (a), 20K (b), 100K (c), 500K (d) and 1M (e) HeLa cells by DDA. The correlation is based on peptide intensity with scale from 0 (blue) to 1 (red), n=3 technical replicates. The scatter plot contains the relation within the same set (technical replicates) or different sets (day-to-day). The line in the scatter plot indicates relationship between two variables.

47

Figure 2.15 Pearson-correlation between three manually handled HUNTER replicates from 20k (a) and 1M (b) HeLa quantified by DIA. The correlation is based on sum of elution groups (precursors) with scale from 0 (blue) to 1 (red). The scatter plot contains the relation within the same set (technical replicates; n=3). The line in the scatter plot indicates relationship between two variables.

Figure 2.16 Evaluation of technical robustness and reproducibility of the implemented automation. Intensity-based Pearson-correlation between two automatically processed sets (each n=4 technical replicates) from human plasma samples, acquired by DDA with scale from 0 (blue) to 1 (red). The scatter plot contains the relation within the same set (technical replicates) or different sets (day-to-day). The line in the scatter plot indicates relationship between two variables.

2.4 Discussions

The most sensitive protocols established to date enrich protein N termini by negative selection, where protein amines are blocked with amine-reactive reagents before proteome digestion71. This generates new peptide-N-terminal -amines that are then

48

exploited for depletion. Unspecific losses of N termini leading to low reproducibility mainly occur in three critical steps (Figure 2.1): removal of free amino acids and other interfering compounds from the protein lysate, removal of amine-reactive labeling reagents prior to digestion, and selective depletion of proteome digestion-generated non-N-terminal peptides. To overcome the first two limitations, we replaced protein precipitation as the common proteome purification procedure by reversible high-efficiency binding to carboxylate-modified magnetic beads as used in the SP3 method76,77.

We first established compatibility of SP3 with protein-level dimethyl labeling and found that within 2 hours >99% of primary amines on proteins were successfully blocked (Figure 2.2). We then sought to address depletion, where unspecific binding of diluted N- terminal peptides to surfaces of filters, beads, and other consumables are particularly problematic for microscale samples.

To overcome the third limitation, we adapted a strategy based on attaching hydrophobic hexadecanal to the free peptide -amines generated by proteome digestion58. This reaction increased the hydrophobicity of tryptic non-N-terminal peptides, allowing their retention on a reverse phase liquid chromatography column, while N- terminal peptides were eluted and directly analyzed by MS/MS. However, hexadecanal- containing reactions solidified at RT and underwent phase separation; this resulted in losses and lowered experimental reproducibility. We tested the shorter-chain undecanal, which is a liquid at RT. After optimizing reaction time (Figure 2.8) and solvent conditions (Figure 2.7), we found that the ideal reaction occurs in 40% ethanol for 60 min at 37°C, followed by passing the reaction mixture through commercial C18 reverse phase resins (Figure 2.5), thus allowing direct enrichment with minimal loss of N-terminal peptides (Figure 2.6). This depletion of undecanal-tagged peptides was equally or more efficient compared to hexadecanal (Figure 2.9), resulting in enrichment of N-terminally modified peptides from baseline levels of <10% to >92% after enrichment. The enrichment efficiency was independent from the amount and source of digested proteome used for pullout, as shown with Arabidopsis thaliana leaf and rat brain proteomes (Figure 2.10).

49

After optimizing SP3-based labeling and undecanal-mediated depletion individually, we evaluated the performance of the combined workflow in a one-pot reaction from lysis to cleanup. Across a wide range of starting amount from 1 million HeLa cells, equivalent to 200 µg protein lysate, down to as few as 10,000 HeLa cells, or 2 µg crude protein lysate, we observed >94% dimethyl-modified lysine residues and enrichment efficiencies >90% (Figure 2.11). Within an hour of MS analysis, approximately 1,000 N termini were identified only with 2 µg proteome. In contrast, most current N termini enrichment techniques cannot handle such low quantity of starting material71 (Figure 2.12). Both offline high-pH fractionation and DIA method enabled a comprehensive and in-depth N-terminome analysis (Figure 2.12 and 2.13). Interestingly, there are more N termini identified from DIA with 20,000 HeLa cells than DDA with 1 million HeLa cells. The reproducibility was similar across the range of starting material, with Pearson correlation factors of 0.89 between manually pipetted replica of 4 µg HeLa material and 0.74 between different days (Figure 2.14), as reported for label-free single-peptide quantification with minimal starting amounts72. DIA of 4 µg and 200 µg HeLa lysate showed similar correlation coefficients of 0.91 between manually pipetted replica (Figure 2.15), but markedly improved quantitative precision.

Finally, we established HUNTER on a basic liquid handling system in 96 well format to enable high-throughput sample analysis and to reduce variation from manual pipetting. Automated enrichment of protein N termini from 1 to 5 µL of human plasma in four technical replicates achieved an improved intra-assay Pearson correlation of 0.93. The inter-assay Pearson correlation for automated assays on different days and different chromatography columns was 0.86 on average (Figure 2.16). Importantly, this platform solved a common reproducibility problem where higher technical variations were found from limited starting material as little as 5 µg49. With the new protocol in place, we set out to explore the performance and utility of the HUNTER protocol with samples that had so far not been amenable to N-terminome characterization.

50

2.5 Conclusions

In summary, HUNTER is a sensitive, reproducible, and robust protocol for enrichment of protein N termini from crude protein lysates. HUNTER is well-suited for automation, even on basic liquid handling systems. It is based on standard magnetic bead and cartridge technology, does not require protein precipitation, and avoids phase separations. We have optimized and validated HUNTER performance with diverse samples such as rat brain and plant leaf, human plasma, and HeLa cells. With sensitive identification and reproducible quantification of >1,000 protein termini from starting amounts of as little as 10,000 HeLa cells or 2 µg of protein lysate and >5,000 termini from 200 µg of protein lysate, HUNTER is a robust method for studying limited samples in- depth.

51

Chapter 3 Application of HUNTER on limited samples

3.1 Introduction

As mentioned in the introduction of Chapter 2, current N termini enrichment methods used in identification of termini on a proteome-wide level so-called positional proteomics provide insights of underlying proteolytic mechanisms of diseases and novel proteolytic proteoforms. However, most studies are based on cell culture samples, due to the requirement of large starting material for most N termini enrichment methods. Hence, it is challenging to reflect physiological changes driven by proteases in humans.

Almost a decade ago, the Human Proteome Project (HPP) was launched by the Human Proteome Organization (HUPO; https://hupo.org)78. The goal of this project is to identify at least one proteoform of each protein translated from the ~20,000 human genes and offer significant insights related to human biology78. There are two complementary actions embedded in HPP: The Chromosome-centric HPP (C-HPP) and Biology/Disease HPP (B/D-HPP). The former focuses on mapping and charactering the whole human proteome, based on 23 pairs of human chromosomes and the mitochondrial genome. Meanwhile, the latter integrates targeted proteomics technologies to analyze and study protein perturbations in disease. Ultimately, completion of the HPP provides a foundation for elucidating biological functions as well as developing diagnostic, predictive, prognostic, and therapeutic applications61,78,79. All of this would lead to a new era of medicine — precision medicine. However, until today, 2,186 proteins are denoted as “missing” proteins78 due to lack of confident assignments or poor discovery.

Positional proteomics focuses on identification of enriched termini and has unique potential to improve confidence in determining specific proteolytic proteoforms. As discussed previously in Section 1.1.4, positional proteomics strategies are critical towards successful precision medicine79,80. Compared to conventional shot-gun proteomics, positional proteomics provides an additional layer of information for revealing protease

52

networks, biological functions and regulations which is perfect complement to fill the missing protein gaps in human proteome79. In this Chapter, we will utilize human sorted immune cells, mitochondrial fractions from cancer cells, and blood plasma and bone marrow plasma from B-ALL patients to demonstrate biomedical applications with HUNTER and potential usage in supporting HPP to reveal interesting proteolytic mechanisms as well as proteolytic proteoforms.

3.2 Method

3.2.2 HUNTER on limited samples

B-ALL cell lines 380 (ACC 39) and 697 (ACC 42) cells were procured from DSMZ (Braunschweig, Germany). B-ALL cell lines were cultured in RPMI 1640 media supplemented with 10% heat-inactivated fetal bovine serum (ThermoFisher Scientific; cat. no. 10082147) and 2 mM L-Glutamine (ThermoFisher Scientific; cat. no. 25030081) and maintained at 37 °C in 5% CO2. Commercial human blood plasma was purchased from STEMCELL Technologies (cat. no. 70039). Primary pediatric B-ALL and AML patient mononuclear cells enriched from bone marrow aspirates, blood plasma (BP) and bone marrow interstitial fluid (BM) were retrospectively sourced from the Biobank at BCCH following informed consent and approval by the University of British Columbia Children’s and Women’s Research Ethics Board (REB #H15-01994) in agreement with the Declaration of Helsinki. Patient BP and BM samples were collected at the time of diagnosis (D0) and 29 days after induction chemotherapy (D29). Peripheral blood mononuclear cells (PBMC) from healthy donors were obtained following informed consent and approval by the University of British Columbia Children’s and Women’s Research Ethics Board (REB #H10-01954). Individual populations were obtained by Fluorescence Activated Cell Sorting using the following antibody combinations: CD19+ for B-cells, CD14+ for monocytes, CD3- CD56+ for natural killer (NK) cells and CD3+ CD56+ for NK T-cells (NKT cells).

53

3.2.2.1 Preparation of peripheral blood mononuclear sorted cell lysates

For a detailed procedure of sorted cell sample preparation, please refer to Section 2.2.7.1., but replace HeLa cells with sorted immune cells.

3.2.2.2 Preparation of mitochondrial enrichment samples

Mitochondrial enrichment was performed on 2.5 million cells from two B-ALL cell lines (697, 380), and 2.5 million bone marrow monocytes from a pediatric AML patient (AML-1). All samples were processed in technical replicates (n=2 or n=3). Cells in mitochondrial isolation buffer (1 mM EGTA/HEPES pH 7.4, 200 mM Sucrose, 1 x Halt protease inhibitor) were disrupted by Pressure Cycling Technology (PCT) using a Barocycler EXT2320 and a PCT 30 µL MicroTube (Pressure BioSciences, Easton, Massachusetts, United States). The cell samples were homogenised and lysed using 15 cycles of 25kpsi for 20 seconds and followed by 20 seconds at ambient pressure at 26°C. Cells were subsequently centrifuged at 900 x g and the pellet fraction (Mitochondrial fraction 1, M1) was collected. The supernatant was transferred to a new tube and centrifuged at 13,000 x g to collect the second pellet fraction (Mitochondrial fraction 2, M2) and cytosolic supernatant (cytosolic fraction, C). Pellet fractions M1 and M2 made up the mitochondrial enriched portion. Proteins were reduced and denatured as described for HeLa samples.

3.2.2.3 Automated HUNTER on patient blood and bone marrow plasma

Plasma and bone marrow interstitial fluid samples from three pediatric B-ALL patients (B-ALL-1, -2, -3) were processed on an epMotion M5073 automated liquid handling system (Eppendorf). Refer to Section 2.2.8 for a detailed experimental procedure.

54

3.2.3 Mass spectrometry

Detailed mass spectrometry settings can be referred to Section 2.2.9.

3.2.3.1 Label free quantification

Label free quantification muda.pl pre-processed data, with peptide intensities determined by MaxQuant, was processed further by eliminating termini with intensity values for <20% of the analyzed samples. Data was median normalized, followed by multiplication by the overall data median and log10 transformation. Pearson correlations, Coefficients of Variation, and LIMMA-moderated t-test p-values were calculated using standard implementations in R or python. To retain sample-specific termini, missing values were inputted with values randomly selected from a distribution modeled after the 10th- to 20th percentile of the whole data and down-shifted by a random factor of 50-100 placing imputed values into the very low intensity area of the data. Radar plots display the z-score standardized intensity on the y-axis, and fuzzy c-means cluster membership are encoded as the line color. Unsupervised characterization of relationships was represented by radar plots and t-distributed stochastic neighbor embedding (t-SNE) followed by fuzzy clustering, based on imputed data.

3.2.3.2 Data and statistical analysis

Data evaluation and positional annotation was performed using an in-house Perl script (muda.pl) that combines information provided by MaxQuant, UniProt and TopFINDer81 to annotate and classify identified N-terminal peptides. The script (muda.pl) is publicly available (http://muda.sourceforge.io). In short, MaxQuant peptide identifications are consolidated by removing non-valid identifications (peptides identified with N-terminal pyro-Glu peptides that do not contain Glu or Gln as N-terminal residue, peptides with dimethylation at N-terminal Pro), contaminant, reverse database peptides, and non-quantifiable acetylated peptides in multi-channel experiments (no K in peptide sequence to determine labeled channel). For peptides mapping to multiple entries in the UniProt protein database, a “preferred” entry was determined by selecting protein entries

55

where the identified peptide matches position 1 or 2, then manually reviewed UniProt protein entries are favored. If multiple entries persisted, the first alphabetical entry was chosen by default.

Proteins identified in human plasma before and after N termini enrichment were annotated with their previously reported plasma protein concentration82. N-terminal peptides identified from mitochondria were compared to recently reported N-termini identified in HeLa cells53 and listed in the MitoCarta2.0 database83. Cleavage site patterns surrounding identified mitochondrial N-termini was visualized as WebLogo84 (https://weblogo.berkeley.edu/logo.cgi).

Dimensionality reduction techniques are used to remove number of random variables to be able to represent the data with a set of principal variables that best explains the data. Reduced-dimension space could improve exploration and visualization of the relationship between data. t-SNE is a non-linear dimensionality reduction method. In this thesis, t-SNE is used to reduce many data space into a two-dimensional space to compare the relations of data. In general, python library scikit-learn was used to generate t-SNE plot. In this thesis, the unsupervised t-SNE plot in sorted immune cell application is based on raw log10 transformed N-terminal peptide intensities. When HUNTER was applied to patient BM and BP, the quantitative data based on intensities at termini and protein levels was standardized using row-wise z-score standardization. To run this function, t-SNE function was initialized with the standardized data as well as the following parameters: n_components=2, perplexity=50, n_iter=2000, learning_rate=200, random_state=1. It should be noted that n_components refers to the dimension of the embedded space, perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms, n_iter refers to the maximum number of iterations for the optimization, learning_rate is the learning rate for t-SNE (usually in the range from 10 to 1000), and random_state is the seed used by the random number generator.

Clustering splits the data into defined number of groups based on the similarity between points, and aims to optimize pattern finding. Fuzzy clustering is a clustering method that is similar to commonly used method k-means clustering. One major difference

56

is that fuzzy clustering clusters data points that can belong in more than one cluster. As a result, each point has a value defined for each cluster. The closer the data points to a cluster, the higher value they have for that cluster, and vice versa.

In this thesis, fuzzy c-means clustering is implemented to show the groupings of dimensionality reduced data. Data points in the t-SNE two-dimensional space could theoretically belong to all groups with a membership score as described above. R library cluster is used to implement fuzzy c-means clustering with t-SNE data and 6 clusters.

Radar plots are included to communicate the results by displaying the most represented grouping in each cluster. To do this, we used standardized data values in each grouping. For instance, in the patient BP and BM data groupings are D0_BP, D29_BP, D0_BM, and D29_BM. Colormap is added to the coloring of each data point to represent the intensity of membership values created in fuzzy clustering.

3.2.3.3 Data availability

MS data have been deposited to the ProteomeXchange Consortium74 (http://www.proteomexchange.org) via the MassiVE (https://massive.ucsd.edu/) partner repositories with the following accession numbers: PXD012918 for analysis of sorted human peripheral blood mononuclear cells by HUNTER, PXD012916 for analysis of proteolytic processes in plasma and bone marrow interstitial fluid of B-ALL patients by HUNTER, and PXD012919 for analysis of mitochondrial N termini by HUNTER.

57

3.3 Results

3.3.2 N-terminome analysis of sorted peripheral blood mononuclear cells

We first isolated four types of human immune cells (B-cells, monocytes, NK cells, and NKT cells) in blood samples from two healthy donors by fluorescence-activated cell sorting. From only 30,000 cells for each cell type, an average of 723±49, 761±49, 700±38, and 709±74 N termini were identified from B-cell, monocytes, NK, and NKT cells respectively (Figure 3.1a). To visualize termini profiles in each cell type, we employed unsupervised t-distributed stochastic neighbor embedding (t-SNE) based on abundance of each termini (Figure 3.1b). We chose unsupervised classification because it could make data driven decisions without any external bias during analysis and observe clustering specificity from unknown samples. We found B-cell and monocytes have distinct clustering from each other as well as NK and NKT cells.

58

Figure 3.1 N-terminome analysis of sorted human PBMC. (a) N termini identified from 30,000 B-cells, monocytes, NK cells and NKT-cells, mean indicated by line. (DDA, n=3 technical replica) (b) Unsupervised t-SNE based on raw log10 transformed N-terminal peptide intensities. Cell types are indicated by colors and the circles are manually drawn for clarification.

In addition, we investigated position of acetylation in each cell population as it is closely related to co- and post-translational modification. Based on TopFINDer results, the percent acetylated modification on position 1 and 2 was calculated from total acetylated N termini at position 1 and 2 in total N termini at position 1 and 2. Meanwhile, acetylated modification after position 2 is total acetylated N termini after position 2 in total N termini after position 2. As shown in Figure 3.2, percent acetylation on position 1 and 2 are 82±8, 81±9, 67±9, and 80±7 in B-cell, monocyte, NK, and NKT respectively which can happen during both co- and post-translational modification. Percent acetylation after position 2 are 16±2, 11±1, 11±1, and 12±3 in B-cell, monocyte, NK, and NKT respectively which only happen during post-translational modification. Interestingly, no significant cell-type specific differences in N-terminal acetylation were observed.

59

Figure 3.2 Number of N termini with acetylation on genome encoded position 1 or 2 (co- and/or post-translational) or position >2 (post-translational only). No significant differences (t test) between cell types were observed.

3.3.3 Mitochondrial N-terminome analysis from 2.5 million cancer cells

To fill the missing protein gaps in HPP, potential application of HUNTER in proteolytic events in subcellular compartments from limited samples was investigated. In this experiment, PCT and HUNTER coupled with LC-MS/MS were employed to investigate the mitochondrial N-terminome from 2.5 million leukemic cells, including two B-ALL cell lines and a pediatric AML patient cell sample. After the cells were homogenized and lysed by PCT, the first cell pellet was isolated as mitochondrial fraction 1 (M1) by centrifugation. Subsequently, a greater centrifugal force was applied to separate the second mitochondrial fraction (M2) with cytosolic supernatant (C). Throughout this experiment, termini- and protein-level identifications from enriched mitochondrial fractions were matched with human data in the MitoCarta2.0 database to ensure the quality of enrichment.

To evaluate subcellular enrichment efficiency, total mitochondrial termini in unfractionated and subcellular fractionated B-ALL 697 cell samples were compared. As shown in Figure 3.3, mitochondrial enrichment resulted in a three-fold increase in mitochondrial N termini identification. There were also a total 64 N termini identified in the unfractionated sample, and 195 N termini in the fractionated sample. The fractionated

60

sample further consisted of 152 termini in mitochondrial fractions, 12 termini in cytosolic fractions, and 31 overlapped termini between cytosolic and mitochondrial fractions.

Figure 3.4 illustrates overall protein and termini identification in each sample. A lower mitochondrial N termini and protein identification were found in cytosolic fractions, indicating successful mitochondrial enrichment by centrifugation (B-ALL 697: average protein are 167±4 (M1), 64±18 (C), and 107±26 (M2) whereas average termini are 187±8 (M1), 70±19 (C), and 114±33 (M2); B-ALL 380: average protein are 172±9 (M1), 64±3 (C), and 119±19 (M2) whereas average termini are 194±16 (M1), 67±6 (C), and 126±22 (M2); AML-1: average protein are 180±6 (M1), 47±1 (C), and 152±2 (M2) whereas average termini are 221±14 (M1), 52±1 (C), and 181±6 (M2)). This is consistent in both 380 and 697 B-ALL cell lines, as well as in a patient AML sample.

Figure 3.3 Evaluation of subcellular enrichment efficiency for B-ALL cell line 697. Total number of identified mitochondrial termini before and after crude subcellular fractionation of 2.5 million cells in 30 µL buffer using pressure cycling technology (PCT) assisted lysis.

61

Figure 3.4 Analysis of mitochondrial N-terminomes by HUNTER. (a) N termini identified in each fraction obtained by mild PCT assisted lysis of 2.5 million cells followed by differential centrifugation. N termini annotated as mitochondrial by the MitoCarta resource are displayed in light orange. n=2-3, error=SD. (b) The total number of proteins identified by N termini in each fraction. Proteins annotated as mitochondrial by the MitoCarta resource are displayed in light blue. n=2-3, error=SD.

To evaluate the performance of our PCT-assisted HUNTER method, the identification of mitochondrial N termini and proteins from HUNTER were compared with Marshall et al., a recent note-worthy study on the human mitochondrial N-terminome53. In brief, Marshall et al., applied a refined TAILS strategy incorporating stable isotope labeling by amino acids in cell culture (SILAC) labeling, to investigate proteolytic processing in the mitochondria of HeLa cells using 12 to 15 million cells as starting material. They identified

62

an average of 311 unique mitochondrial N-terminal peptides from 227 mitochondrial proteins. HUNTER applied to mitochondrial fractions from less than 2.5 million AML cells obtained by bone marrow aspiration from a pediatric patient enabled detection of 257 N termini from 193 mitochondrial proteins. Similar numbers were identified in two B-ALL cell lines, with averages of 211 termini and 178 proteins in the 697 cell line and 221 termini and 184 proteins in the 380 cell line (Figure 3.5). This shows that human mitochondrial terminome identification obtained by HUNTER is comparable and similar to TAILS.

Figure 3.5 Mitochondrial protein N termini enriched from 2.5 million B-ALL 697, B-ALL 380 and AML patient cells compared to N termini identifications from 12-15 million HeLa cells as reported by Marshall et al.. Orange, N termini; Blue, proteins.

Finally, the proteolytic cleavage pattern in AML-1 was examined to provide insights on pediatric leukemia-associated proteolytic activities, and demonstrate the suitability of PCT-assisted HUNTER when studying limited patient samples. To explore the presence of conserved amino acid sequence motifs, sequences from all mitochondrial N termini were aligned using WebLogo. Figure 3.6 shows an Arg-Arg-X | Ser/Arg-Ser consensus was dominant in proteolytic events in the mitochondria of leukemic cells. The most- occurring protease substrate patterns in this mitochondrial N-terminome data was attributed to mitochondrial methionine aminopeptidase 1D (MAP12) and mitochondrial- processing peptidase subunit beta (MPPB).

63

Figure 3.6 Consensus sequence logo of 257 mitochondrial N termini identified from AML-1 patient cells. The logo is dominated by sequences matching the sequence specificity of the classical mitochondrial transit peptide cleavage site.

3.3.4 Application of HUNTER on B-cell acute lymphoblastic leukemia (B-ALL) patient plasma samples

Beyond terminome analysis on human cells, B-ALL patient plasma samples were chosen to mimic clinical settings, while demonstrating potential applications in personalized medicine and biomarker development. Figure 3.7 illustrates a study design in this experiment where blood plasma (BP) and aspirated bone marrow interstitial fluid (BM) were collected from three B-ALL patients at their beginning of diagnosis (D0) and end of chemotherapy (D29). To solve labor-intensive handling in hospital or clinical settings, 3 µL of cancer patient plasma (250-300 µg proteins) were processed on an automated liquid handler as described in Section 2.2.8.

In this experiment, 69 proteins were specifically identified in pre-HUNTER samples, whereas 39 proteins were specifically identified in post-HUNTER samples and 64 proteins were overlapped between the conditions. Mapping the experimentally identified proteins to known concentrations in plasma82 shows that HUNTER identified more lower abundant proteins (Figure 3.8).

64

Figure 3.7 Study design on patient samples. BP and BM were collected from three pediatric B-ALL patients at D0 and D29. All samples were processed by automated HUNTER and pre-enrichment (total protein) and post-enrichment (N termini reflecting intact and proteolytic proteoforms) samples analyzed by DDA and label free quantification.

Figure 3.8 Proteins identified exclusively in plasma before or after N termini enrichment, or in both preparations, mapped to known concentrations in plasma.

To perform quantitative analysis of BP and BM from three B-ALL patients, unsupervised t-SNE followed by fuzzy-c means clustering based on abundance of protein and termini was applied. From protein level analysis shown in Figure 3.9, protein profile change, based on the treatment response, was observed in pre-enrichment samples. Based on the respective clustering shown in Figure 3.9b, it was challenging to differentiate clusters between sample types (e.g. BM vs. BP). Only Clusters 1 and 2 demonstrated specificity towards bone marrow plasma and blood plasma, respectively. In contrast, Figure 3.10 shows the unique N termini profiles of BP and BM, as well as their distinct profiles at the beginning and end of treatment. The distinct profiles shown in Figure 3.10 indicate potential applications in biomarker or therapeutic target discovery.

65

Figure 3.9 Automated HUNTER protein level analysis from three pediatric B- ALL patients. (a) Unsupervised t-SNE plot for individual quantified proteins split into 6 clusters by fuzzy c-means. Annotation based on spatio-temporal protein abundance profiles. Each point are the individual proteins identified from patient samples and colored based on associated clusters. Black lines are used to separate different conditions based on the sample types (BP and BM) or before (D0) and after treatment. (b) Radar plot representation of the spatio-temporal abundance profiles of proteins assigned to the 6 clusters identified from (a). Radial axis represents protein z-score. Color scaled by fuzzy c-means cluster membership score (light to dark red indicating increased cluster membership fit).

66

Figure 3.10 Automated HUNTER termini level analysis of BP and BM from three pediatric B-ALL patients at D0 and D29. (a) t-SNE dimensionality reduction of protein N termini abundances followed by fuzzy c-means clustering. Annotation based on spatio-temporal N termini abundance profiles. Each point are the individual termini identified from patient samples and colored based on associated clusters. Black lines are used to separate different conditions based on the sample types (BP and BM) or before (D0) and after (D29) treatment. (b) Radar plot representation of the spatio-temporal abundance profiles of termini assigned to the 6 clusters identified from (a). Radial axis represents N termini z-score. Color scaled by fuzzy c-means cluster membership score (light to dark red indicating increased cluster membership fit).

67

Since the complement pathway is a major part of the innate immune system and its activation is initiated by proteolytic activities and result in formation of membrane attack complex for rupturing cell wall85, the cleavage profile changes within this pathway before and after induction chemotherapy was monitored. In this study, the quantitation was based on relative intensity from proteins or N termini, as shown in Figure 3.11. Genome-encoded N termini were found for complement components C3, C4, and C9, as well as complement factors B and H. N termini for well-established complement fragments were also detected including C3a, C3b-alpha chain, C3g, C4b, C4d, and C4 gamma chain. Interestingly, there was a dramatic decrease of the C9 fragment termini abundance after induction chemotherapy. Only moderate changes of C3 and C4 proteins was observed, suggesting termini information provides better insights in proteolytic events than protein information in the complement pathway.

Figure 3.11 Relative intensity-based quantification of complement pathway proteins (blue) and N termini identifying previously described activation products (orange). Each bar represents intensity ratio of protein or termini abundance across four different sample types which is calculated by the actual intensity divided by the average intensity from the four samples.

68

3.4 Discussion

As discussed in previous Chapters, processing limited amount of biological samples is challenging for most N termini enrichment methods. To demonstrate HUNTER application in medical and biological studies, we first investigated the N-terminomes of primary human peripheral blood monocyte populations. With only 30,000 sorted immune cells from healthy donors, a similar and consistent N termini identification (between 646 and 803 N termini) was observed from different cell types (Figure 3.1a). This suggested promising reproducibility of the HUNTER method. Unsupervised dimensionality reduction based on N termini abundance separated the different cell types, with replicas of each cell type grouped in close proximity (Figure 3.1b), demonstrating N termini profile specificity. Interestingly, NK and NKT populations are grouped more closely, compared to B-cell and monocytes. This is expected as NKT cells express similar molecular markers associated with NK cells, although NKT cells uniquely express the T-cell receptor. Additionally, Figure 3.2 suggests an unobservable degree of internal protein processing or protease activities, as there was no statistically significant difference in either acetylated co- or post- translation between different cell types.

Next, HUNTER was employed in an investigation of proteolytic processes in subcellular compartments of limited samples, focusing on pediatric patient biopsies. This type of study was previously restricted to cultured cells, of which large quantities of source material could be obtained. A crude subcellular fractionation of mitochondria was optimized using mild pressure cycling assisted cell lysis of 2.5 million cells in 30 µL buffer, resulting in a three-fold increase of N termini originating from known mitochondrial proteins (Figure 3.3). Compared to a recent study of mitochondrial protein processing in human cells with TAILS enrichment method53, HUNTER identified on average 73% of the mitochondrial protein termini and 81% of mitochondrial proteins from only about 1/10th of the starting material (Figure 3.5). An alignment of the mitochondrial N-termini matched the previously described pattern resulting from transit peptide cleavage and subsequent aminopeptidase processing of nuclear-encoded proteins imported into mitochondria (Figure 3.6)53. Based on these results, HUNTER appears to be to be a sensitive tool in identifying mitochondrial proteins or termini and proteolytic pattern with limited samples.

69

We then tested the utility of the automated platform for the characterization of liquid biopsies from cancer patients. We analyzed 3 µL non-depleted BP and BM from three pediatric B-ALL patients before and after induction chemotherapy (Figure 3.7). As serum albumin is the most abundant protein in human plasma, constituting about 50% of human plasma protein82, removal of internal peptides is critical. HUNTER facilitated termini identification in low-abundance plasma proteins82, where more low-abundance plasma proteins identified after N termini enrichment than in a standard proteome analysis (Figure 3.8).

Quantitation, t-SNE dimensionality reduction and fuzzy c-means clustering of overall protein abundance as determined from aliquots withdrawn before N termini enrichment (Figure 3.9; protein level separation), and N-terminal peptides identified after enrichment (Figure 3.10; peptide level separation) revealed strong treatment-induced changes. Interestingly, there were only few proteins showing differential abundance between blood plasma and bone marrow interstitial fluid. At the same time, markedly different proteolytic proteoforms were observed in both compartments, suggesting the presence of distinct protease activities (Figure 3.10). This suggests potential HUNTER applications in biomarker development, like diagnostic biomarkers for disease or treatment success.

We further studied the cleavage profiles of the mentioned patient samples. Notably, while C3 and C4 protein did not change or changed only moderately, N termini matching complement protein activation sites86 showed a marked decrease during chemotherapy (Figure 3.11), in line with the chemotherapy induced complement defects previously reported in ALL87. Based on these results, HUNTER also enables monitoring of proteolytic activities in the complement pathway from leukemia patients.

70

3.5 Conclusions

We have shown successful application of HUNTER in systems as diverse as sorted peripheral blood cell populations, subcellular fractions enriched for mitochondria, and human plasma. Based on the preliminary data, PCT-assisted HUNTER coupled with subcellular fractionation is a sensitive and reproducible technique that can be applied to investigate proteolysis in clinical specimens at near subcellular levels, using minimal sample amounts. Implementation of the HUNTER protocol in an automated sample handling system facilitates profiling of aberrant proteolytic processes in clinical cohorts and reduces hands-on time. Overall, HUNTER enables comprehensive analysis of proteolytic processes and protein N-terminal modifications in microscale samples from a wide range of precious limited biological samples.

71

Chapter 4 General conclusions

4.1 Concluding remarks

As genomic and transcriptomic information do not directly reflect expression or function of proteins88–91, identification of disease-associated proteoforms and aberrant biological activities responsible for malignant transformation are more beneficial in the development of effective treatments.

In this thesis, the performance of the HUNTER enrichment strategy, with adaptation of SP3 and hydrophobic tagging techniques, has been demonstrated for microscale N-terminome analysis; this included identification of proteolytic proteoforms and derivation of proteolytic mechanisms in diverse human samples. This enrichment approach takes advantage of the protein extraction on carboxylate-modified magnetic beads to minimize sample lost and internal peptide depletion with versatile undecanal tagging to enable an automated and user-friendly interface. It has been shown to significantly reduce the amount of required starting material (20 to 500-fold reduction of required material compared to existing technology), while maintaining a sensitive (similar identification with TAILS but with 1/10th of required starting material) and reproducible (Pearson correlation of 0.93 within technical replica in automated HUNTER) N-terminal peptide identification. This method thus opens exciting avenues towards a better understanding of crucial proteolysis in limited biological specimen. As noted in earlier sections, the ability and potential applications of HUNTER have been demonstrated to support HPP in filling missing proteolytic proteoforms and understanding proteolytic processes and their function down to the subcellular-level. The overall work described in my thesis augments the potential of employing an automated HUNTER enrichment method to perform in-house positional proteomics studies to monitor proteolytic activities and determine proteolytic proteoforms in biological samples. It is necessary to re- emphasize the remarkable sensitivity of termini identification with as little as 2 µg proteome, which ensures its future application as a microscale N termini enrichment strategy for rare and precious samples.

72

Bibliography

1. Smith, L. M. & Kelleher, N. L. Proteoforms as the next proteomics currency. Science. 359, 1106–1108 (2018).

2. Aebersold, R. et al. How many human proteoforms are there ? Nat. Chem. Biol. 14, 206–214 (2018).

3. Lorentzian, A., Uzozie, A. & Lange, P. F. Origins and clinical relevance of proteoforms in pediatric malignancies. Expert Rev. Proteomics 16, 185–200 (2019).

4. Toby, T. K., Fornelli, L. & Kelleher, N. L. Progress in Top-Down Proteomics and the Analysis of Proteoforms. Annu. Rev. Anal. Chem. 9, 499–519 (2016).

5. T., C. & Walsh. Posttranslational Modification of Proteins. (Roberts and Company Publishers, 2006).

6. Karve, T. M. & Cheema, A. K. Small Changes Huge Impact: The Role of Protein Posttranslational Modifications in Cellular Homeostasis and Disease. J. Amino Acids 207691 (1–13) (2011). doi:10.4061/2011/207691

7. Zhao, S. et al. Regulation of Cellular Metabolism by Protein Lysine Acetylation. Science. 327, 1000–1004 (2010).

8. Deribe, Y. L., Pawson, T. & Dikic, I. Post-translational modifications in signal integration. Nat. Struct. Mol. Biol. 17, 666–672 (2010).

9. Rogers, L. D. & Overall, C. M. Proteolytic Post-translational Modification of Proteins: Proteomic Tools and Methodology. Mol. Cell. Proteomics 12, 3532–3542 (2013).

10. Lange, P. F. & Overall, C. M. Protein TAILS: when termini tell tales of proteolysis and function. Curr. Opin. Chem. Biol. 17, 73–82 (2013).

11. Puente, X. S., Sánchez, L. M., Overall, C. M. & López-Otín, C. HUMAN AND MOUSE PROTEASES: A COMPARATIVE GENOMIC APPROACH. Nat. Rev. 4, 544–558 (2003).

12. Brix, K. & St, W. Proteases: Structure and Function. (Springer, 2013).

13. Rawlings, N. D. et al. The MEROPS database of proteolytic enzymes , their substrates and inhibitors in 2017 and a comparison with peptidases in the PANTHER database. Nucleic Acids Res. 46, D624–D632 (2018).

73

14. Fingleton, B. Matrix metalloproteinases: roles in cancer and metastasis. Front. Biosci. 11, 479–491 (2006).

15. Kessenbrock, K., Plaks, V. & Werb, Z. Matrix Metalloproteinases: Regulators of the Tumor Microenvironment. Cell 141, 52–67 (2010).

16. Puente, X. S., Sanchez, L. M., Gutierrez-Fernandez, A., Velasco, G. & Lopez-Otin, C. A genomic view of the complexity of mammalian proteolytic systems. Biochem. Soc. Trans. 33, 331–334 (2005).

17. Klein, T., Eckhard, U., Dufour, A., Solis, N. & Overall, C. M. Proteolytic Cleavage- Mechanisms , Function , and “Omic” Approaches for a Near-Ubiquitous Posttranslational Modification. Chem. Rev. 118, 1137–1168 (2018).

18. Eatemadi, A. et al. Role of protease and protease inhibitors in cancer pathogenesis and treatment. Biomed. Pharmacother. 86, 221–231 (2017).

19. Sabeh, F. et al. Tumor cell traffic through the extracellular matrix is controlled by the membrane-anchored collagenase MT1-MMP. J. Cell Biol. 167, 769–781 (2004).

20. Witty, J. P., Lempka, T., Coffey, R. J. & Matrisian, L. M. Decreased Tumor Formation in 7,12-Dimethylbenzanthracene-treated Stromelysin-1 Transgenic Mice Is Associated with Alterations in Mammary Epithelial Cell Apoptosis. Cancer Res. 55, 1401–1407 (1995).

21. Winer, A., Adams, S. & Mignatti, P. Matrix Inhibitors in Cancer Therapy: Turning Past Failures Into Future Successes. Mol. Cancer Ther. 17, 1147–1156 (2018).

22. Katakowski, M. et al. Tumorigenicity of cortical astrocyte cell line induced by the protease ADAM17. Cancer Sci. 100, 1597–1604 (2009).

23. Ferrando, A. A. The role of NOTCH1 signaling in T-ALL. Hematol. Am. Soc. Hematol. Educ. Progr. 353–361 (2009). doi:10.1182/asheducation-2009.1.353.The

24. Andersson, E. R. & Lendahl, U. Therapeutic modulation of Notch signalling — are we there yet ? Nat. Rev. Drug Discov. 13, 357–378 (2014).

25. Yuan, X. et al. Notch signaling: An emerging therapeutic target for cancer treatment. Cancer Lett. 369, 20–27 (2015).

26. Mullooly, M., Mcgowan, P. M., Crown, J. & Duffy, M. J. The ADAMs family of proteases as targets for the treatment of cancer. Cancer Biol. Ther. 17, 870–880 (2016).

74

27. Aggarwal, N. & Sloane, B. F. Cathepsin B: Multiple roles in cancer. Proteomics - Clin. Appl. 8, 427–437 (2014).

28. Mijanović, O. et al. Cathepsin B: A sellsword of cancer progression. Cancer Lett. 449, 207–214 (2019).

29. Ruan, H., Hao, S., Young, P. & Zhang, H. Targeting Cathepsin B for Cancer Therapies. Horiz. Cancer Res. 56, 23–40 (2015).

30. Saftig, P. et al. Impaired osteoclastic bone resorption leads to osteopetrosis in cathepsin-K-deficient mice. Proc. Natl. Acad. Sci. 95, 13453–13458 (1998).

31. Podgorski, I. et al. Bone Marrow-Derived Cathepsin K Cleaves SPARC in Bone Metastasis. Am. J. Pathol. 175, 1255–1269 (2009).

32. Tomita, A. et al. Human breast adenocarcinoma (MDA-231) and human lung squamous cell carcinoma (Hara) do not have the ability to cause bone resorption by themselves during the establishment of bone metastasis. Clin. Exp. Metastasis 25, 437–444 (2008).

33. Herroon, M. K. et al. Macrophage cathepsin K promotes prostate tumor progression in bone. Oncogene 32, 1580–1593 (2013).

34. Adams, J. The development of proteasome inhibitors as anticancer drugs. Cancer Cell 5, 417–421 (2004).

35. Chen, D., Frezza, M., Schmitt, S., Kanwar, J. & Dou, Q. P. Bortezomib as the First Proteasome Inhibitor Anticancer Drug: Current Status and Future Perspectives. Curr. Cancer Drug Targets 11, 239–253 (2011).

36. Drag, M. & Salvesen, G. S. Emerging principles in protease-based drug discovery. Nat. Rev. Drug Discov. 9, 690–701 (2010).

37. Quancard, J. et al. An allosteric MALT1 inhibitor is a molecular corrector rescuing function in an immunodeficient patient. Nat. Chem. Biol. 15, 304–313 (2019).

38. Huesgen, P. F., Lange, P. F. & Overall, C. M. Ensembles of protein termini and specific proteolytic signatures as candidate biomarkers of disease. Proteomics - Clin. Appl. 8, 338–350 (2014).

39. Niedermaier, S. & Huesgen, P. F. Positional proteomics for identification of secreted proteoforms released by site-specific processing of membrane proteins. BBA - Proteins Proteomics (2018). doi:10.1016/j.bbapap.2018.09.004

75

40. Chang, T. K., Jackson, D. Y., Burniert, J. P. & Wells, J. A. Subtiligase: A tool for semisynthesis of proteins. Proc. Natl. Acad. Sci. 91, 12544–12548 (1994).

41. Mahrus, S. et al. Global Sequencing of Proteolytic Cleavage Sites in Apoptosis by Specific Labeling of Protein N Termini. Cell 134, 866–876 (2008).

42. Yoshihara, H. A. I., Mahrus, S. & Wells, J. A. Tags for labeling protein N-termini with subtiligase for proteomics. Bioorg. Med. Chem. Lett. 18, 6000–6003 (2008).

43. Xu, G., Shin, S. B. Y. & Jaffrey, S. R. Global profiling of protease cleavage sites by chemoselective labeling of protein N-termini. PNAS 106, 19310–19315 (2009).

44. Brown, J. L. & Roberts, W. K. Evidence that Approximately Eighty per Cent of the Soluble Proteins from Ehrlich Ascites Cells Are Nα-Acetylated. J. Biol. Chem. 251, 1009–1014 (1976).

45. Gevaert, K. et al. Exploring proteomes and analyzing protein processing by mass spectrometric identification of sorted N-terminal peptides. Nat. Biotechnol. 21, 566– 569 (2003).

46. Staes, A. et al. Improved recovery of proteome-informative, protein N-terminal peptides by combined fractional diagonal chromatography (COFRADIC). Proteomics 8, 1362–1370 (2008).

47. Van Damme, P. et al. Complementary positional proteomics for screening substrates of endo- and exoproteases. Nat. Methods 7, 512–515 (2010).

48. Venne, A. S., Vögtle, F.-N., Meisinger, C., Sickmann, A. & Zahedi, R. P. Novel Highly Sensitive, Specific, and Straightforward Strategy for Comprehensive N‑Terminal Proteomics Reveals Unknown Substrates of the Mitochondrial Peptidase Icp55. J. Proteome Res. 12, 3823–3830 (2013).

49. Shema, G. et al. Simple , scalable , and ultrasensitive tip-based identification of protease substrates. Mol. Cell. Proteomics 17, 826–834 (2018).

50. Lai, Z. W. et al. Enrichment of protein N-termini by charge reversal of internal peptides. Proteomics 15, 2470–2478 (2015).

51. Kleifeld, O. et al. Isotopic labeling of terminal amines in complex samples identifies protein N-termini and protease cleavage products. Nat. Biotechnol. 28, 281–288 (2010).

52. Kleifeld, O. et al. Identifying and quantifying proteolytic events and the natural N terminome by terminal amine isotopic labeling of substrates. Nat. Protoc. 6, 1578– 1611 (2011).

76

53. Marshall, N. C. et al. Global Profiling of Proteolysis from the Mitochondrial Amino Terminome during Early Intrinsic Apoptosis Prior to Caspase-3 Activation. J. Proteome Res. 17, 4279–4296 (2018).

54. Mommen, G. P. M. et al. Unbiased Selective Isolation of Protein N-terminal Peptides from Complex Proteome Samples Using Phospho Tagging (PTAG) and TiO2-based Depletion. Mol. Cell. Proteomics 11, 832–842 (2012).

55. Thingholm, T. E. & Larsen, M. R. The Use of Titanium Dioxide for Selective Enrichment of Phosphorylated Peptides. in Phospho-Proteomics Methods and Protocols 135–146 (Springer, 2016).

56. Li, L. et al. A novel method to isolate protein N-terminal peptides from proteome samples using sulfydryl tagging and gold-nanoparticle-based depletion. Anal. Bioanal. Chem. 408, 441–448 (2016).

57. Mcdonald, L. & Beynon, R. J. Positional proteomics: preparation of amino-terminal peptides as a strategy for proteome simplification and characterization. Nat. Protoc. 1, 1790–1798 (2006).

58. Chen, L. et al. Hydrophobic Tagging-Assisted N‑Termini Enrichment for In-Depth N‑Terminome Analysis. Anal. Chem. 88, 8390–8395 (2016).

59. Aebersold, R. & Mann, M. Mass-spectrometric exploration of proteome structure and function. Nature 537, 347–355 (2016).

60. Agard, N. J. et al. Global kinetic analysis of proteolysis via quantitative targeted proteomics. PNAS 109, 1913–1918 (2012).

61. Lange, P. F., Huesgen, P. F., Nguyen, K. & Overall, C. M. Annotating N Termini for the Human Proteome Project: N Termini and Nα-Acetylation Status Differentiate Stable Cleaved Protein Species from Degradation Remnants in the Human Erythrocyte Proteome. J. Proteome Res. 13, 2028–2044 (2014).

62. Eckhard, U. et al. The Human Dental Pulp Proteome and N‑ Terminome: Levering the Unexplored Potential of Semitryptic Peptides Enriched by TAILS to Identify Missing Proteins in the Human Proteome Project in Underexplored Tissues. J. Proteome Res. 14, 3568–3582 (2015).

63. Rinschen, M. M. et al. N-Degradomic Analysis Reveals a Proteolytic Network Processing the Podocyte Cytoskeleton. J. Am. Soc. Nephrol. 28, 2867–2878 (2017).

64. Klein, T. et al. The paracaspase MALT1 cleaves HOIL1 reducing linear ubiquitination by LUBAC to dampen lymphocyte NF-kB signalling. Nat. Commun.

77

6, 8777(1–17) (2015).

65. Saita, S. et al. PARL mediates Smac proteolytic maturation in mitochondria to promote apoptosis. Nat. Cell Biol. 19, 318–328 (2017).

66. Van Damme, P., Gawron, D., Van Criekinge, W. & Menschaert, G. N-terminal Proteomics and Ribosome Profiling Provide a Comprehensive View of the Alternative Translation Initiation Landscape in Mice and Men. Mol. Cell. Proteomics 13, 1245–1261 (2014).

67. Willems, P. et al. N-terminal Proteomics Assisted Profiling of the Unexplored Translation Initiation Landscape in Arabidopsis thaliana. Mol. Cell. Proteomics 16, 1064–1080 (2017).

68. Castrec, B. et al. Structural and genomic decoding of human and plant myristoylomes reveals a definitive recognition pattern. Nat. Chem. Biol. 14, 671– 679 (2018).

69. Schardon, K. et al. Precursor processing for plant peptide hormone maturation by subtilisin-like serine proteinases. Science 354, 1594–1597 (2016).

70. Jackson, H. W., Defamie, V., Waterhouse, P. & Khokha, R. TIMPs: versatile extracellular regulators in cancer. Nat. Rev. Cancer 17, 38–53 (2016).

71. Perrar, A., Dissmeyer, N. & Huesgen, P. F. New beginnings and new ends: methods for large-scale characterization of protein termini and their use in plant biology. J. Exp. Bot. 70, 2021–2038 (2019).

72. Post, H. et al. Robust, Sensitive, and Automated Phosphopeptide Enrichment Optimized for Low Sample Amounts Applied to Primary Hippocampal Neurons. J. Proteome Res. 16, 728–737 (2017).

73. Tyanova, S., Temu, T. & Cox, J. The MaxQuant computational platform for mass spectrometry – based shotgun proteomics. Nat. Protoc. 11, 2301–2319 (2016).

74. Deutsch, E. W. et al. The ProteomeXchange consortium in 2017 : supporting the cultural change in proteomics public data deposition SUBMISSION GUIDELINES FOR ORIGINAL DATA. Nucleic Acids Res. 45, D1100-1106 (2017).

75. Vizca, J. A. et al. 2016 update of the PRIDE database and its related tools. Nucleic Acids Res. 44, D447-456 (2016).

76. Hughes, C. S. et al. Ultrasensitive proteome analysis using paramagnetic bead technology. Mol. Syst. Biol. 10, 1–14 (2014).

78

77. Hughes, C. S. et al. Single-pot, solid-phase-enhaced sample preparation for proteomics experiments. Nat. Protoc. 14, 68–85 (2019).

78. Omenn, G. S. et al. Progress on Identifying and Characterizing the Human Proteome: 2018 Metrics from the HUPO Human Proteome Project. J. Proteome Res. 17, 4031–4041 (2018).

79. Eckhard, U., Marino, G., Butler, G. S. & Overall, C. M. Biochimie Positional proteomics in the era of the human proteome project on the doorstep of precision medicine. Biochimie 122, 110–118 (2016).

80. Mnatsakanyan, R. et al. Detecting post-translational modification signatures as potential biomarkers in clinical mass spectrometry. Expert Rev. Proteomics 15, 515–535 (2018).

81. Lange, P. F. & Overall, C. M. TopFIND , a knowledgebase linking protein termini with function. Nat. Methods 8, 703–705 (2011).

82. Geyer, P. E., Holdt, L. M., Teupser, D. & Mann, M. Revisiting biomarker discovery by plasma proteomics. Mol. Syst. Biol. 13, 1–15 (2017).

83. Calvo, S. E. et al. Comparative Analysis of Mitochondrial N-Termini from Mouse, Human, and Yeast. Mol. Cell. Proteomics 16, 512–523 (2017).

84. Crooks, G. E., Hon, G., Chandonia, J.-M. & Brenner, S. E. WebLogo : A Sequence Logo Generator. Genome Res. 14, 1188–1190 (2004).

85. Noris, M. & Remuzzi, G. Overview of Complement Activation and Regulation. Semin. Nephrol. 33, 479–492

86. Afshar-Kharghan, V. The role of the complement system in cancer. J. Clin. Invest. 127, 780–789 (2017).

87. Keizer, M. P. et al. The High Prevalence of Functional Complement Defects Induced by Chemotherapy. Front. Immunol. 7, 1–14 (2016).

88. Tuch, B. B. et al. Tumor Transcriptome Sequencing Reveals Allelic Expression Imbalances Associated with Copy Number Alterations. PLoS One 5, e9317 (1-17) (2010).

89. Ning, K., Fermin, D. & Nesvizhskii, A. I. Comparative Analysis of Different Label- Free Mass Spectrometry Based Protein Abundance Estimates and Their Correlation with RNA- Seq Gene Expression Data. J. Proteome Res. 11, 2261– 2271 (2012).

79

90. Akbani, R. et al. A pan-cancer proteomic perspective on The Cancer Genome Atlas. Nat. Commun. 5, 3887 (1–15) (2014).

91. Wilhelm, M. et al. Can we predict protein from mRNA levels? Nature 547, E19-23 (2017).

80