UC San Diego UC San Diego Electronic Theses and Dissertations

Title Methods and Characterization of smORF-Encoded Microproteins

Permalink https://escholarship.org/uc/item/25k5d8z0

Author Rathore, Annie

Publication Date 2018

Peer reviewed|Thesis/dissertation

eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA SAN DIEGO

Methods and Characterization of smORF-Encoded Microproteins

A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy

in

Biology

by

Annie Rathore

Committee in charge:

Professor Alan Saghatelian, Chair Professor Christopher Benner Professor Christopher Glass Professor Randolph Hampton Professor Martin Hetzer Professor Jens Lykke-Anderson

2018

©

Annie Rathore, 2018

All rights reserved.

The Dissertation of Annie Rathore is approved, and is acceptable in quality and form for publication on microfilm and electronically:

Chair

University of California San Diego

2018

iii TABLE OF CONTENTS

Signature Page ...... iii Table of Contents ...... iv List of Figures ...... vi Acknowledgements ...... viii Vita ...... x Abstract of the Dissertation ...... xi Chapter 1: Discovery and Emerging Biology of smORFs and Microproteins ...... 1

1.1 Introduction ...... 2 1.2 Functions of Small Open Reading Frames and Microproteins ...... 3 1.3 Discovery of Small Open Reading Frames and Microproteins ...... 12 1.4 Characteristics of Small Open Reading Frames and Microproteins ...... 17 1.5 Approach for Functional Characterization ...... 19 1.6 Conclusion and Future Directions ...... 19 1.7 Figures ...... 21 1.8 Acknowledgements ...... 28 1.9 References ...... 28

Chapter 2: Identification of Microprotein- Interactions via APEX Tagging ...... 40

Abstract ...... 41 2.1 Introduction ...... 41 2.2 Results and Discussion ...... 44 2.3 Conclusion ...... 49 2.4 Figures ...... 50 2.5 Materials and Methods ...... 67 2.6 Acknowledgements ...... 70 2.7 References ...... 71

Chapter 3: MIEF1 Microprotein Regulates Mitochondrial Translation ...... 75

Abstract ...... 76 3.1 Introduction ...... 76 3.2 Results and Discussion ...... 78 3.3 Conclusion ...... 86

iv 3.4 Figures ...... 87 3.5 Materials and Methods ...... 99 3.6 Acknowledgements ...... 106 3.7 References ...... 106

Chapter 4: Functional Characterization of StAR smORF ...... 114

Abstract ...... 115 3.1 Introduction ...... 115 3.2 Results and Discussion ...... 117 3.3 Conclusion and Future Directions ...... 121 3.4 Figures ...... 123 3.5 Materials and Methods ...... 128 3.6 References ...... 131

v LIST OF FIGURES Chapter 1

Figure 1.1. uORF mediated regulation of downstream ORF expression ...... 21

Figure 1.2. Biologically active microproteins ...... 23

Figure 1.3. Summary table for smORF-encoded biologically active microproteins ...... 24

Figure 1.4. Methods for smORF and microprotein discovery ...... 25

Figure 1.5. Web-Based integrated tools for smORF discovery and characterization ...... 26

Figure 1.6. Position of smORFs within RNAs ...... 27

Chapter 2

Figure 2.1. Schematic illustration of identification of microprotein- associated by

APEX tagging ...... 50

Figure 2.2. Identification of MRI microprotein-protein interactions in live cells by APEX

tagging ...... 51

Figure 2.3. Loading control of APEX and MRI-APEX FLAG immunoprecipitation ...... 52

Figure 2.4. Biotinylated proteomes of untransfected, APEX and MRI-APEX transfected cells

...... 53

Figure 2.5. Loading control of MRI-APEX biotinylation ...... 54

Figure 2.6. Identification of C11orf98 microprotein associated proteins ...... 55

Figure 2.7. Schematic illustration of APEX control and C11orf98-APEX constructs ...... 56

Figure 2.8. Sub- cellular localization of biotinylated enriched proteins ...... 57

Figure 2.9. Validation of C11orf98 microprotein interaction with NPM1 and NCL ...... 58

Figure 2.10. Loading control of C11orf98-APEX and APEX-C11orf98 biotinylation ...... 59

Figure 2.11. Validation of C11orf98 interaction with NPM1 and NCL by FLAG

immunoprecipitation ...... 60

Figure 2.12. Loading control of C11orf98-FLAG FLAG immunoprecipitation ...... 61

vi Figure 2.13. Loading control of reciprocal anti-HA immunoprecipitation ...... 62

Figure 2.14. Validation of C11orf98 interaction with NPM1 and NCL by reciprocal

immunoprecipitation ...... 63

Figure 2.15. Loading control of reciprocal anti-HA immunoprecipitation ...... 64

Figure 2.16. C11orf98 microprotein was identified in public protein interaction database ...... 65

Figure 2.17. Co-localization of C11orf98 microprotein and NPM1 in cell nucleolus ...... 66

Chapter 3

Figure 3.1. MIEF1-MP expression and conservation ...... 87

Figure 3.2. MIEF1-MP is a mitochondrial matrix microprotein ...... 88

Figure 3.3. MIEF1-MP contains an LYR motif and interacts with ACPM/ NDUFAB1 ...... 89

Figure 3.4. Effect of MIEF1-MP on ACPM/NDUFAB1-related cellular functions ...... 90

Figure 3.5. MIEF1-MP interacts with the mitoribosome ...... 91

Figure 3.6. Spectral counts of MIEF1-MP for different baits in the Bioplex database ...... 92

Figure 3.7. Constructs and controls used for the MIEF1-MP-APEX experiment ...... 93

Figure 3.8. Identification of additional MIEF1 MP- mitoribosome protein interactors ...... 94

Figure 3.9. Ontology analysis of the MIEF1-MP STRING interaction network ...... 95

Figure 3.10. Effect of MIEF1-MP on the mtDNA encoded protein steady state levels ...... 96

Figure 3.11. MIEF1-MP modulates the mitochondrial translation rate ...... 97

Figure 3.12. Controls for BONCAT method measuring mitochondrial translation rate ...... 98

Chapter 4

Figure 4.1. Discovery of the StAR smORF-encoded microprotein ...... 123

Figure 4.2. Tissue-specific expression of StAR smORF transcript ...... 124

Figure 4.3. Identification and quantitation of the StAR-MP in human cell lines and tissues 125

Figure 4.4. Schematic illustration of the StAR constructs used ...... 126

Figure 4.5. Upstream StAR smORF regulates downstream ORF translation ...... 127

vii ACKNOWLEDGEMENTS

I have been extremely fortunate to come across great mentors in my academic career. I am especially grateful to have Dr. Alan Saghatelian as my Ph.D. advisor. Alan is an excellent scientist and his guidance has helped me develop critical scientific thinking and problem-solving skills. More importantly, he has been incredibly supportive of my immediate and long-term goals, and I couldn’t have asked for a better mentor than him.

I would also like to thank members of my dissertation committee: Dr. Jens Lykke-

Anderson, Dr. Chris Benner, Dr. Chris Glass, Dr. Randolph Hampton and Dr. Martin Hetzer, for their time, support, and scientific guidance.

None of this would have been possible without the support of current and past Saghatelian lab members who provided a very positive work environment. I’m thankful to have worked closely with Dr. Qian Chu, Dr. Dan Tan, Dr. Thomas F. Martinez, Dr. Jiao Ma, Raymond Mak, and Dr.

Jolene K. Diedrich and for the invaluable scientific discussions we had. Cynthia Donaldson and

Joan Vaughan have been instrumental in providing constant lab support.

I’m also deeply grateful to my previous mentors; Dr. Partha Roy, Dr. Shailly Tomar, Dr.

Pravindra Kumar, Dr. R P Singh, Dr. Aseem Ansari and Dr. Michael Schultz for helping me realize my passion for science, inspiring me to pursue graduate school and providing immense support till date.

I absolutely couldn’t have made it without my friends and family. Thank you, Dinner/ Hike/

Spam, for keeping me sane through these years. Lastly, I owe all the success in life to my parents, husband, and sister. Thank you for believing in me, encouraging me and standing by my side always. I hope I will continue to make you proud.

Chapter 1, in part is currently being prepared for submission for publication of the material.

The dissertation author was the primary investigator and author of this material. Other author includes Alan Saghatelian.

viii Chapter 2, in full, is a reprint with minor modifications of the material as it appears in

Identification of Microprotein- Protein Interactions via APEX Tagging, Biochemistry 56, 26, 3299-

3306 (2017). The dissertation author was the secondary investigator and author of this paper.

Other authors include Qian Chu, Jolene K. Diedrich, Cynthia J. Donaldson, John R. Yates III and

Alan Saghatelian.

Chapter 3, in part, has been submitted for publication of the material as it may appear in

MIEF1 Microprotein Regulates Mitochondrial Translation, Biochemistry (2018). The dissertation author was the primary investigator and author of this paper. Other authors include Qian Chu,

Dan Tan, Thomas F. Martinez, Cynthia J. Donaldson, Jolene K. Diedrich, John R. Yates III and

Alan Saghatelian.

ix VITA

2014 Bachelor of Technology, Biotechnology, Indian Institute of Technology Roorkee 2018 Doctor of Philosophy, Biology, University of California San Diego

PUBLICATIONS

1. Annie Rathore, Qian Chu, Dan Tan, Thomas F. Martinez, Cynthia J. Donaldson, Jolene K. Diedrich, John R. Yates III, Alan Saghatelian. MIEF1 Microprotein Regulates Mitonchondrial Translation, Biochemistry (2018), submitted.

2. Annie Rathore and Alan Saghatelian. Discovery and Emerging Biology of smORFs and Microproteins, in preparation.

3. Qian Chu, Annie Rathore, Jolene K. Diedrich, Cynthia J. Donaldson, John R. Yates III, Alan Saghatelian. Identification of Microprotein- Protein Interactions via APEX Tagging, Biochemistry 56, 26, 3299-3306 (2017).

4. Tobias M Franks, Asako McCloskey, Maxim Shokirev, Chris Benner, Annie Rathore, Martin W Hetzer. Nup98 recruits the Wdr82–Set1A/COMPASS complex to promoters to regulate H3K4 trimethylation in hematopoietic progenitor cells, & development 31: 2222-2234 (2017).

x

ABTRACT OF THE DISSERTATION

Methods and Characterization of smORF-Encoded Microproteins

by

Annie Rathore

Doctor of Philosophy in Biology

University of California San Diego, 2018

Professor Alan Saghatelian, Chair

Microproteins are peptides and small proteins encoded by small open reading frames

(smORFs). Newer technologies have led to the recent discovery of hundreds to thousands of new microproteins. The biological functions of a few microproteins have been elucidated, and these microproteins have fundamental roles in biology ranging from limb development to muscle function, highlighting the value of characterizing these molecules. Additionally, a majority of smORFs are found upstream of long open reading frames, called the uORFs, that have been shown to play critical role in regulating translation of the main ORF.

xi The identification of microprotein-protein interactions (MPIs) has proven to be a successful approach to the functional characterization of these genes; however, traditional immunoprecipitation methods result in the enrichment of non-specific interactions for microproteins. Here, we test and apply an in-situ proximity tagging method that relies on an engineered ascorbate peroxidase (APEX) to elucidate MPIs. The results demonstrate that APEX tagging is superior to traditional immunoprecipitation methods for microproteins. Furthermore, the application of APEX-tagging to an uncharacterized microprotein called C11orf98 revealed that this microprotein interacts with nucleolar proteins nucleophosmin and nucleolin, demonstrating the ability of this approach to identify novel hypothesis-generating MPIs.

We also characterize the protein-interaction partners of mitochondrial elongation factor 1 microprotein (MIEF1-MP) using a proximity labeling strategy that relies on APEX2. MIEF1-MP localizes to the mitochondrial matrix where it interacts with the mitochondrial ribosome

(mitoribosome). Functional studies demonstrate that MIEF1-MP regulates mitochondrial translation via its binding to the mitoribosome. Loss of MIEF1-MP decreases the mitochondrial translation rate, while elevated levels of MIEF1-MP increases the translation rate. The identification of MIEF1-MP reveals a new gene involved in this process.

Lastly, we report a smORF present upstream and partially overlapping with the

Steroidogenic Acute Regulatory (StAR) gene, that encodes a 109 amino acid long StAR microprotein (StAR-MP). Expression studies at transcript and peptide level show highly tissue- specific expression of the StAR microprotein correlating with StAR protein expression in sites of steroidogenesis. Genetic analysis of the smORF start codons revealed translational repression of downstream StAR ORF (dORF) due to the presence of upstream smORF, pointing towards an additional mode of StAR protein regulation.

xii Chapter 1:

Discovery and Emerging Biology of smORFs and Microproteins

1 1.1 Introduction

The genome of any organism contains thousands of open reading frames (ORFs),1-3 defined as the protein-coding sequence between in-frame start and stop codons. The identification of protein-coding ORFs from the total pool of potential ORFs is challenging. The issue becomes more pronounced when focusing on small open reading frames (smORFs or sORFs) which encode peptides and small proteins, referred to as microproteins (<100 amino acids in length). The random chance of a start and stop codon occurring in frame decreases and the likelihood that the ORF encodes a real protein increases with length. For this reason, most genome annotation pipelines used a threshold length of 300 nucleotides, or 100 amino acids for detection, thus excluding the smORFs.1 For example, when plotting all ORFs in the yeast genome, Basarai, Hieter, and Boeke found ~260,000 smORFs between 2-99 codons,4 highlighting the need for a length cutoff to identify bona fide ORFs. This manuscript defined the term “smORF” and was also prescient about the future of this field as they conclude their abstract by saying, “Yet the subset of small, functional ORFs (here abbreviated smORFs) probably encode very interesting proteins in all organisms, including humans.”

Recent improvements in computational,5-7 genomic,8-12 and proteomic13 technologies have made it easier to identify protein-coding smORFs, revealing a larger proteome than was previously appreciated. New protein-coding smORFs have been identified in several organisms,14-15 including yeast,4, 16-18 flies,5 plants,19 mice and humans20. Moreover, at least some of these smORFs encode biologically relevant peptides and small proteins, or microproteins.21-22

For instance, a new class of evolutionarily similar smORFs encoding petpides that regulate muscle function have been identified and characterized in flies and mice.23-26 The identification of biologically active microproteins fuels efforts to identify and characterize smORFs.

Here we discuss examples of smORFs and microproteins that have been shown to play important role in a variety of cellular processes. We also overview the development and application of computational and experimental methods to identify novel smORFs. Lastly, we

2 highlight the possible approaches for functional characterization going forward and biological implications of these studies in human health.

1.2 Functions of Small Open Reading Frames and Microproteins

Functional smORFs:

Most of the detected smORFs are found upstream of longer open reading frames

(ORFs).13 Originally it was unclear whether these upstream smORFs, or uORFs, were coding, but recent studies demonstrate that they are indeed translated.27-30 Computational analysis determined that over 40% of mammalian mRNAs contains uORFs, and the translation of these smORFs correlates with reduced levels of the downstream protein from a longer downstream

ORF (CDS).31 The presence of a uORF could control mRNA translation of the downstream ORF and mRNA stability by several potential mechanisms: (1) translational reinitiation at the downstream CDS after uORF translation, (2) impeding downstream CDS translation due to ribosome stalling on a uORF, (3) ribosome dissociation from the mRNA after uORF translation,

(4) a CDS with an overlapping out-of-frame uORF that results in ribosome termination downstream of CDS start codon or (5) detection of the uORF stop codon as a premature stop codon leading to activation of the nonsense-mediated decay pathway and degradation of the mRNA.31-34 uORF use the canonical AUG start codon as well as non-canonical initiation codons to initiate translation, increasing the number of genes with uORFs but also making it more challenging to use sequence alone to identify uORFs.35-37 Hence, uORFs represent a class of prevalent translational repressors of the downstream ORF genome-wide, a feature well conserved across species.28, 38-39

One of the earliest examples of a regulatory uORFs comes from the 5’-untranslated region

(UTR) of the transcriptional activator GCN4 in yeast (ATF4 in vertebrates). GCN4 is a transcription factor that increases expression of genes involved in nutrient import, metabolism, and alleviation of oxidative stress. The GCN4 mRNA contains four short uORFs encoding di- or tri-peptides.40

3 Mechanistic studies have led to a detailed model of how the four uORFs are involved in regulating

GCN4 translation and these have revealed that uORFs have different abilities to allow for ribosome reinitiation and therefore can regulate downstream CDS translation to a different extent

(Figure 1.1A). With the GCN4 mRNA ribosomes bind to the mRNA and scan until they recognize and translate uORF1. After translation of uORF1 the ribosome remains attached to GCN4 mRNA it allows for reinitiate and translate downstream ORFs.41 uORF2 is similar to uORF1 and also allows reinitiation and translation of downstream ORFs.42 Based on these results, uORF1 and uORF2 downregulate but do not completely inhibit CDS translation. uORF3 and uORF4, on the other hand, favor release of ribosome from the mRNA and prevent it from reaching the GCN4

AUG start codon42-43 to completely abrogate GCN4 expression. Under starvation conditions there are lower levels of the eIF2-GTP-Met-tRNAi ternary complex (TC), which is required for translation initiation (and reinitiation), resulting in fewer ribosomes initiating at uORF2-4 which increases CDS translation. The uORFs provide a cis-regulatory mechanism for the GCN4 translation under stress conditions.44-45

The mammalian ATF4 mRNA uses two uORFs to regulate the expression of the ATF4 transcription factor but uses a different strategy from GCN4 (Figure 1.1B). uORF2 is 59 codons long and it overlaps with the ATF4 CDS but is out-of-frame. Through this arrangement under stress-free conditions, translates uORF2 and in the process bypasses the ATF4 CDS start site to inhibit ATF4 production. The induction of cell stress due results in lower ternary complex levels leading to ribosome to scan past uORF2 and initiate translation of ATF4. Increased ATF4 production leads to the transcription of several key target genes that are necessary for the cellular stress response.46-47 One advantage of this strategy is that it produces proteins faster by obviating the need for mRNA transcription, and a faster response to the cellular stress is likely vital for preventing cell and tissue damage.

CHOP is another example of a mammalian gene whose expression is controlled by the presence of conserved uORF in the 5’ UTR region, but in this case the regulation depends on the

4 sequence of the smORF. CHOP encodes a transcription factor related to the CCAAT/enhancer- binding protein (C/EBP) family of transcription factors that regulates cellular processes such as growth, differentiation, and programmed cell death. Cell stress promotes ribosomes to bypass the uORF and facilitating the expression of CHOP which leads to apoptosis (Figure 1.1C). What is unique about the CHOP uORF is that the protein sequence of the uORF dictates the efficiency of the uORF inhibition of CHOP expression, and the working hypothesis is that certain elements of this sequence stall the ribosome elongation by preventing efficient exit of the peptide from the ribosome peptide exit tunnel.34, 48 CHOP uORF demonstrates evolutionary selection of smORFs for protein sequences.

Though most studied examples of uORFs highlight their cis- acting modulators of translation,49 however, it remains unclear whether uORFs can produce trans-acting functional peptides. An example of uORF encoded peptide functional in a process other than translational regulation is the 70 amino acid long MIEF1 microprotein (MIEF1-MP) translating from a smORF upstream of Mitochondrial Elongation Factor 1 (MIEF1) gene. The mitochondrial MIEF1-MP regulates mitochondrial fission process through Drp1 pathway, similar to the MIEF1 protein function.50 Overall, it highlights the prevalence of functional uORFs regulating translation of downstream coding sequences or producing functional peptides. Overall, these examples highlight the prevalence of uORF mediated translational control of multiple coding sequences in the genome.

Functional microproteins:

In this section, we highlight significant examples, including the newest examples, of functional microproteins. Identifying functions for these molecules reveals important molecular pathways that influence important biology and provides a rationale for continued investigation into the discovery of smORFs and microproteins as the numbers of functional microproteins increases. This section is divided into three parts. The first section provides an overview of smORFs and microproteins with varied function. Muscle-regulating smORFs and microproteins

5 from the largest class of physiologically active smORFs and the second section is dedicated to outlining the important contributions from this area. Lastly, we highlight examples of smORFs that were identified during the search for disease-regulating genes. We apologize because we have not been able to cover all the work in this area and, in particular, have excluded work in plants.

A. smORFs and Microproteins have roles in many different Biological Processes

Recent interest in smORFs and microproteins can be traced to two contemporaneous manuscripts that describe the discovery of a set of smORFs that produced microproteins responsible for development in flies. Genetic screening identified tarsal-less (tal)51 or polished rice (pri)52 (referred to as tal/pri) as a putative non-coding RNA in Drosophila that is regulated during development.53 Though initially suspected to be a non-coding RNA, there were no obvious non-coding functions such as antisense regulation of other genes that could be found. Additional analysis of this transcript revealed the existence of four smORFs that were hypothesized to code for three 11- and one 32-amino acid microproteins. This unexpected discovery was validated to ensure that these were actually protein-coding smORFs, including placement of a GFP that could only be expressed if the smORFs are translated.51-52 Functional experiments demonstrated that the peptides were sufficient for the biological activity to reveal a new class of developmental regulators that are as short as 11-amino acids. The tal/pri microproteins are likely some of the shortest unprocessed biologically active proteins reported. Subsequent reports have provided a mechanistic rationale for tal/pri function by showing that the these microproteins control epidermal differentiation by activating the transcription factor Shavenbaby (Svb) by mediating the formation of a key protein interaction,54 which drives the developmental program and through Svb ubiquitination and processing.55

The discovery of the non-annotated but functional tal/pri suggested that genomes, especially the Drosophila genome, might harbor additional bioactive, yet non-annotated smORFs and microproteins. An elegant study searching for putative functional smORFs in Drosophila genome5 validated this hypothesis by identifying Hemotin (hemo), as an uncharacterized

6 regulator of phagocytosis that regulates the flies ability to fight infections. Proteomics56 and Ribo-

Seq9 analysis of fly S2 cells revealed that the hemo smORF is translated into a 88-amino acid long transmembrane microprotein. As its name implies Hemotin is expressed in hemocytes, which are insect immune cells that regulate phagocytosis. Subcellular localization experiments localized hemotin to early endosomal vesicles that are responsible for trafficking of material between cell membrane and the cytoplasm. In hemocytes, endosomal trafficking has a fundamental role in phagocytosis of infectious particles and the localization of hemotin to endosomes suggested a role in immune processing of bacteria. Loss of hemotin resulted a vacuolation phenotype with enlarged and abnormal endosomes, delaying the digestion of phagocytosed bacteria and reducing immunity.57 Mechanistic experiments revealed that hemotin interacts with the multifunctional 14-3-3ζ protein to repress 14-3-3ζ activity with the phosphatidylinositol (PI) kinase

PI3K68D.57 Genetic, cellular and biochemical evidence suggests that hemotin modulates the

PI(3)P levels of early endosomes through its interaction with 14-3-3ζ to inhibit PI3K68D,57 providing an example of a microprotein that can regulate kinase activity. Furthermore, a functional homologue of hemo in vertebrates is stannin (snn), a conserved 88-amino acid mitochondrial microprotein, involved in mediating cellular toxicity to metals.58 Like hemotonin, stannin also has a role in innate immunity, as loss of stannin led to dysfunctional phagocytosis.

The discovery of new smORFs and microproteins in Drosophila inspired searches in other genomes for smORFs. One of these studies,13 led to the identification of a human microprotein that was named non-annotated P-body dissociating polypeptide (NoBody).59 The smORF that encodes NoBody was found on an mRNA annotated as non-coding at the time but has since been revised as a coding transcript. Experimental evidence of NoBody translation was observed during a proteogenomics analysis of K562 cells.13 NoBody is highly conserved, which was the first clue that this microprotein might be functional; however, NoBody has no homology to any functional proteins. NoBody immunoprecipitation followed by proteomics revealed that NoBody is bound to the mRNA decapping complex because of the enrichment of enhancer of decapping

7 protein 3 and 4 (EDC3 and EDC4) and decapping protein 1A, 1B, and 2 (Dcp1A, Dcp1B and

Dcp2). The decapping complex removes the 5’-cap from mRNAs to promote 5’-3’ decay.

Biochemical and cross-linking experiments confirmed direct interaction between NoBody and

EDC4. At low expression levels NoBody localizes to mRNA decay-associated RNA-protein granules called P-bodies providing cellular evidence for an interaction between NoBody and the decapping complex. When NoBody is overexpressed the number of P-bodies were dramatically reduced, and the higher the NoBody levels in a particular cell correlated with fewer P-bodies.

Levels of key decapping proteins were similar in low and high expressing NoBody samples indicating that higher NoBody levels was regulating the aggregation state but not the degradation of P-body proteins. Mechanistic studies confirmed that the translated peptide, and not the RNA, along with its interaction with EDC4 is required for the observed changes in P-body numbers.

Since the mRNA decapping complex has a role in the nonsense mediated decay (NMD) pathway, which degrades RNAs with premature termination codons60-61 to prevent translation of truncated protein products. This led to a hypothesis that NoBody may affect cellular levels of endogenous

NMD substrates, which was tested by measuring the levels of a mutant p53 gene with a premature termination codon in Calu-6 cells. Loss of NoBody led to lower levels of the p53 NMD substrate indicating that loss of NoBody is coupled to increased decay, revealing a critical role for this newly identified microprotein.

Originally annotated as modulator of retroviral infection (MRI)62 because of its function in a screen, CYREN (cell cycle regulator of NHEJ) is a microprotein with a critical function in cell biology. There are three CYREN mRNA isoforms that produce three different microproteins with isoforms 1 and 2 having an N-terminal domain required for CYREN activity. Isoform 2 of CYREN, a 69-amino acid microprotein, was first detected as an endogenous microprotein in a proteogenomics study in K562 cells,13 validating CYREN to be a stably expressed microproteins.

CYREN was characterized as a functional microprotein regulator of retroviral infection but the mechanism underlying such regulation remained enigmatic. Protein interaction studies revealed

8 that CYREN interacts with the /80 heterodimer, the key protein complex for DNA double- strand repair by non-homologous end joining (NHEJ),62 but the significance of CYREN only became clear after its role in cellular DNA repair was determined (Figure 2.1A). Cells can utilize the NHEJ and homologous recombination (HR) pathways for the repair of double strand breaks.

NHEJ is the faster, more common, but more error prone process, while HR provides more fidelity it requires another to template the repair of the broken DNA. As a result, HR only occurs in the S and G2 phases of the cell cycle, when another chromosome is present. NHEJ is the faster of the two processes and, therefore, for HR to take place the cell must somehow inhibit

NHEJ. Knockdown studies of CYREN reveled it to be an endogenous suppressor of NHEJ during the S and G2 phases of the cell cycle63 (Figure 2.1A). The characterization of CYREN’s critical role in DNA repair highlights the immense knowledge that can be gained from studying microproteins.

B. Microprotein and smORFs in muscle biology

Recently, two groups identified a biologically active skeletal-muscle enriched 56-amino acid long microprotein named mitoregulin (Mtln)64 or regulator of β OXIdation

(MOXI)65 (herein referred to as Mtln/MOXI). As with other microproteins, the RNA that encoded

Mtln/MOXI was misannotated as a non-coding RNA prior to these reports. Both reports determined that Mtln/MOXI localizes to the mitochondrial inner membrane. Looking at protein lipid interactions showed Mtln/MOXI to be a cardiolipin binder,64 while protein interactions revealed binding of Mtln/MOXI to the Mitochondrial Trifunctional Protein.65 Cell biology experiments revealed a role of Mtln/MOXI in several cellular process. Expression of Mtln/MOXI in HeLa cells increased membrane potential, decreased levels of reactive oxygen species (ROS), and enhanced respiration, while knockdown reversed many of these effects. Knocking out or overexpression of Mtln/MOXI in mice resulted in clear muscle phenotype with changes to the muscle fiber, size and structure of the mitochondria. Ex vivo measurements of fatty acid oxidation revealed enhanced capacity for fatty acid oxidation in transgenic animals, but knockout mice had

9 impaired β-oxidation leading to exercise intolerance. These studies revealed an exciting microprotein with roles in cellular and physiological energy extraction.

Another group of muscle-regulating peptides that operate through intramembrane protein- protein interactions with endoplasmic reticulum calcium channel called sarco-/endoplasmic reticulum Ca2+-ATPase (SERCA). Bioinformatic66 analysis of RNA-Seq data53 from Drosophila melanogaster identified two smORFs that was named sarcolambans (SCLs) A and B, from a misannoated long non-coding RNA. The SCLA and SCLB smORFs encode a 28 and 29 amino acid microproteins, respectively, that share a highly conserved transmembrane domain. Indeed, analysis of the sequence of these microproteins revealed their homology to a family of annotated microproteins called phospholamban (PLN) and sarcolipin (SLN),23 leading to their naming of the Drosophila genes sarcolambans. PLN and SLN bind to the transmembrane domain of SERCA and regulate the ability of this channel to pump calcium, which, in turn, regulates muscle performance. This led to the hypothesis that the Drosophila SCL microproteins might also bind to

SERCA and control muscle performance in the fly. Localization of SCL gene expression revealed high levels in heart muscle and modeling of the SclA and SclB structures demonstrated that they are short transmembrane helices capable of associating with SERCA. Genetic studies in

Drosophila revealed a clear phenotype with loss of SCL resulting in an irregular heartbeat, while gain-of-function overexpression of SCL resulted in an increased periodicity of the fly heartbeat.

The discovery of SCL highlighted the existence of additional, functionally important, smORFs and microproteins in the fly genome and paved the way for similar discoveries in mammals.

Myoregulin (MLN), a conserved microprotein encoded by a skeleton muscle-specific

RNA annotated as a putative non-coding RNA that is related to SLN and PLN. Compared to SLN and PLN, which are expressed in cardiac and slow skeleton muscle in mice, respectively, MLN is expressed in all skeleton muscle suggesting a broad role in muscle function. Deletion of MLN in mice enhances Ca2+ handling and improves skeleton muscle performance67 with mice lacking

MLN being able last longer than wild type mice on a treadmill distance test. In addition to

10 discovering myoregulin (MLN), the Olson group identified three additional transmembrane microproteins, endoregulin (ELN), another-regulin (ALN), and DWORF (dwarf open reading frame) that regulate SERCA activity. ELN and ALN inhibitor SERCA activity in non-muscle cell types,24 while the 34-amino acid DWORF localizes to SR membrane and enhances SERCA activity by displacing the SERCA inhibitors (SLN, PLN and MLN). It is the only endogenous peptide reported to activate SERCA pump by physical interaction and enhance muscle contractibility68 (Figure 2.1B).

Microproteins also have an important role in muscle formation. A microprotein inducer of fusion independently discovered by three groups called minion, myomerger, or myomixer.69

We will refer to this microprotein as minion for simplicity. Minion was identified through a whole transcriptome RNA-Seq analysis searching for transcripts that showed strong temporal regulation during mouse myoblast differentiation in vitro containing smORFs. Minion is an 84-amino acid long microprotein that is specifically expressed in the skeletal muscle. Minion is required for fusion of mononuclear progenitors to form myotubes, muscle fibers, during development (Figure 1.2C).

Loss of minion resulted in failure to form syncytial myotubes and marked reduction in fused muscle fiber in mice leading to perinatal death. Functional studies suggested that minion and myomaker, a key regulator of muscle fusion, work synergistically to form muscle whereby myomaker induces the interaction of cell membranes,70 while minion drives the cytoskeleton organization needed for efficient fusion. Thus, the microprotein minion is part of a fundamental gene that is part of an intriguing two-component program to promote mammalian cell fusion along with rapid cytoskeletal rearrangement.

C. Disease Conditions

As microproteins are relatively new, very few have been linked to disease, but there are examples of microproteins that may be considered disease genes. Hashimoto and colleagues identified a microprotein called Humanin (HN) during a functional screen for genes that would inhibit neuronal death as to find new genes that could protect against Alzheimer’s disease. This

11 screen identified humanin as a 24-amino acid long microprotein derived from a mitochondrial ribosomal RNA that could prevent neuronal cell death induced by the beta amyloid (Aβ) peptide.71

Subsequent mechanism of action studies revealed that humanin has anti-apoptotic activity through a physical interaction with Bax (Bcl2- associated X protein) that inhibits the proapoptotic functions of BAX72 (Figure 1.2D). Detection of this humanin microprotein could open avenues for the development of new therapeutics for Alzheimer’s disease targeting neuroprotection.73

Cancer Associated Small Integral Membrane Open reading frame 1 (CASIMO1)74 an

83-amino acid microprotein was discovered during an expression profiling study that sought to identify genes involved in hormone receptor-positive primary breast cancer samples. An RNA transcript, initially annotated as a putative non-coding RNA, containing a smORF showed 6-fold elevated levels in breast tumor compared to normal tissue. CASIMO1 was conserved between humans and rodents indicative of a functional microproteins. Deletion of CASIMO1 in MCF7

ER/PR+ breast cancer cell line resulted in deregulation of actin cytoskeleton, impaired migration ability, a reduced proliferation rate, and accumulation of cells in the G0/G1- phase. Mechanism of action studies revealed that CASIMO1 interacts with squalene epoxidase (SQLE) to modulate cellular lipid levels and signal transduction. CASIMO1 is the first example of a microprotein with a role in breast cancer.74

Most of the examples described above have be summarized (Figure 1.3) with information about the microprotein discovery and characterization methods along with biological function and mode of action. Biologically active smORFs and microproteins drive interest in this field, but since smORFs are difficult to identify there is still a great deal of interest in the development and application of strategies for smORF and microprotein discovery.

1.3 Discovery of Small Open Reading Frames and Microproteins

As sequencing technologies advanced, and it became possible to sequence whole genomes, the need for algorithms to annotate protein-coding genes within those genomes

12 emerged. In bacteria, gene annotation looked for in frame start-and-stop codons while eukaryotic genomes required steps to account for RNA splicing prior to the annotation of protein-coding genes. These methods have successfully identified most of the annotated protein-coding genes, but most smORFs are not annotated because they are too short and are difficult to detect by other means. To address this problem, additional efforts have been made to predict and validate the smORFs and microproteins with approaches focusing on computational, genomic and proteomic strategies.

Computational Methods:

Genome sequencing projects aimed to identify all protein-coding genes, which required algorithms to reveal ORFs from sequencing data. Because of the number of stop codons a random assembly of a genome should result in many in frame start and stop codons, and therefore any gene-finding analysis that relied solely on the location of in frame starts and stops would overestimate the number of protein coding genes. To understand the scale of this problem, in their 1999 paper Basrai, Hieter, and Boeke estimated that there are ~260,000 smORFs between 2-99 codons, highlighting the challenge of interesting smORFs amongst a group of irrelevant short ORFs.4 One way to thin the pile is to pick a length of sequence that is long enough that evolution would have had to select against the absence of stop codons to create a translated

ORF. The most common length for gene-finding algorithms ended up being 100 codons, and histograms of protein length versus numbers reveal a precipitous drop in the number of annotated or predicted smORFs below 100 codons.20 Additionally, many smORFs use near cognate non-

ATG initation codons,13 which make it even more difficult to find these smORFs informatically.

Despite these inherent challenges, several groups have developed valuable computational methods to identify non-annotated smORFs.5-9, 12, 20, 75 The reanalysis of the human or mouse genomes using a tool called CRITICA (Coding Region Identification Tool Invoking

Comparative Analysis) led to the identification of ~3000 additional smORFs20 providing a valuable estimate for the number of missed smORFs. An informatic analysis of Drosophila melanogaster

13 non-coding euchromatic DNA in search of smORFs revealed ~600,000 potential smORFs. This large number was trimmed ~400-4,500 smORFs depending on the stringency by utilizing conservation between D. melanogaster and Drosophila pseudoobscura, the presence of RNAs in the transcribed regions of the genome, and the ratio of conservative versus non-conservative nucleotide changes (indicative of protein-coding potential76).5 Mackowiak and colleagues used a similar approach to develop an integrated computational pipeline for the identification of conserved smORFs in human, mouse, zebrafish, fruit fly and C. elegans. They identified ~2,000 novel smORFs located in the untranslated regions of canonical mRNAs or on transcripts that were thought to be non-coding.7 Of these, the translation of more than 100 and 70 novel smORFs was supported by the published ribosome profiling data and mass spectrometry evidence, respectively. The encoded peptides showed little homology to know proteins,7, 20 but were rich in protein interaction motifs and in some cases conserved over large evolutionary distances,6-7 as reported in previous studies. Overall, computational methods based on sequence analysis, conservation scores, scanning for functional domains/ motifs and homology (Figure 1.4A) have played critical role in discovery and function prediction of smORF-encoded microproteins, thus expanding our understanding of human proteome.

Newer efforts to find new ORFs, including smORFs, analyze genomes and are committed to putting the information on the internet for global access. sORF.org77 is a public database created to identify smORFs and assess their coding potential. It contains smORFs identified by ribosome profiling using state-of-the-art tools and metrics. The coding potential of each smORF was assessed using a variety of distinct methods including PhyloCSF,78 ribosome profiling data analysis with FLOSS,12, 79 and MS data. The repository identified a total of 263,354 smORFs across 3 different cell lines from human (HCT116), mouse (E14_mESC) and fruit fly (S2) (Figure

1.5). More recently, SmProt, an online database that contains more than 255,000 microproteins identified computationally and experimentally in 291 cell lines/ tissues derived from eight popular species was published.80 In addition to basic information (sequence, location, organism, etc.) the

14 database also provides specific information (function, disease type etc.). The microprotein functions were predicted using InterProScan,81 and confirmed using the already published literature for some microproteins that showed specific functions (Figure 1.5).

Another tool functional smORF encoded peptides predictor (FSPP) was developed to predict microproteins and their function in a high throughput manner. FSSP uses the overlap results of microproteins detected from Ribo-Seq82 and mass spectrometry83 methods as the authentic microprotein targets. Using the transcriptional and translational expression data, along with co-location information, a relation network was built. FSPP annotated the microprotein function based on the sub-network which the target microprotein belonged to and the functions of known neighbors. The FSPP model was evaluated by successfully predicting the function of 856 out of 960 annotated proteins. The method was then used to predict 568 functional microproteins and some were confirmed to be consistent with the known functions.84

Genomic Methods:

Computational methods, discussed above, have predicted hundreds to thousands of smORFs across species, however, the actual number of translated smORFs is still an open question. Ribosome profiling (Ribo-Seq) is a strategy based on the deep sequencing of ribosome protected mRNA fragments called ribosome footprints and these footprints are used to identify translated regions of the transcriptome85 (Figure 1.4B). Ribosome-profiling provided the first experimental evidence that non-coding regions, including the untranslated regions of mRNA

(UTRs)9, 79 and long non-coding RNAs (lncRNAs)9, 79, 86 contain protein-coding smORFs. Ribo-

Seq has also provided a method to identify non-AUG start codon10, 35 usage and polycistronic transcripts9, 35 which has helped identify new smORFs.9-10, 12, 35 Ribo-Seq is ideal because it provides the most sensitive and broad view of the protein translation and has been instrumental in the detection of non-annotated smORFs and accessing their protein-coding potential in yeast,11 flies,9 zebrafish,12 mice,8 and humans.12, 87 One issue with Ribo-Seq is that ribosome binding does not always correlate with translation as non-productive binding of single ribosomes or scanning

15 40S ribosomal subunits could result in footprints but not translation of smORFs; however, improvements on the informatics analysis of Ribo-Seq has solved this problem by accounting for a 3-nt periodicity that is a signature of active translation (Figure 1.4B).

Ribosome profiling was used to identify novel protein-coding smORFs in in zebrafish and mice. Crappé and colleagues used bioinformatics method to search for putative smORFs in the mouse genome and combined this information with conservation and Ribo-Seq to annotate translated smORFs.8 Similarly, Bazzini and colleagues, analyzed zebrafish transcriptome using ribosome profiling to define actively translating smORFs, and identified 190 smORFs in long non- coding RNAs, 311 ORFs in the 5’ UTR and 93 in the 3’UTR, of which some were validated by mass spectrometry.12 Ribo-Seq has also been used to annotate protein-coding smORFs in yeast11 and flies.9 In yeast, more than 1100 unannotated transcripts thought to lack protein-coding capacity were analyzed with Ribo-Seq.11 These uncharacterized RNAs were enrich with polysomes indicating that they might be translated, and sequence analysis predicted that these transcripts contain smORFs between 10-96 codons in length. In flies, Ribo-Seq more than doubled the number of annotated smORFs in the Drosophila genome from 107 smORFs to 228 smORFs with evidence of translation. Thus, Ribo-Seq has proven to be the ideal method for assigning novel smORFs.

Proteomic Methods:

Though a large number of coding smORFs were predicted by computational methods20 and detected by Ribo-Seq,10 these methods do not indicate how much microprotein is being produced or the stability of the translated product. Thus, the ultimate validation that a smORF is producing a stable microprotein is the detection of that microprotein by proteomics. These methods require a proteomics database that contains microproteins, which can be predicted, measured by Ribo-Seq, or an in silico translated transcriptome to account for all possibilities.

Vanderperre and colleagues utilized their computationally predicted alternate open reading frames (AltORFs) database to generate a searchable proteomics database. Using this

16 database in the analysis of the several human tissue samples provided proteomics evidence for

1259 microproteins. Proteomics can also be used to discover smORFs through the unbiased detection of microproteins by translating the entire transcriptome of a cell or tissue in three reading frames to generate a proteomics database that contains all possible translated products.

Searching this database with proteomics data from the same cells and tissues can detect microproteins and reveal new smORFs.88-89 Initial attempts at using this approach were successful but required better mass spectrometers to dig deeper into the proteome. For example, Oyama and colleagues applied this approach to K562 cells and identified a four non-annotated ORFs including one smORF of less than codons.90 Improvements in mass spectrometers and next- generation RNA sequencing (RNA-Seq) led to the discovery of 90 microproteins, 86 of which were from novel smORFs in K562 cells13 (Figure 1.4C). Based on the length assignment using

AUG –to-top, kozak sequence-to-Stop or upstream stop-to-stop for microproteins detected by proteomic methods, microproteins showed length distribution between 18-149 amino acids, majority being less than 100 aa.13

Improved fractionation methods increased the number of new human smORFs with an additional 237 microproteins detected from MCF10A, MDAMB231, K562, and a breast tumor sample.91 These experiments also revealed one limitation of proteomics in that it has limited sensitivity compared to Ribo-Seq as increased fractionation continues to increase the numbers of microproteins/smORFs. Several factors have a role in this limited sensitivity including the natural abundance or stability of microproteins, and smaller microproteins often yield only a single tryptic peptide making their detection difficult.

Overall, these studies have demonstrated that microproteins represent an important class of non-annotated cellular polypeptides which expanded our understanding of the human proteome and requires further investigation.

1.4 Characteristics of Small Open Reading Frames and Microproteins

17 Position of smORFs within RNAs: Sequencing studies have revealed that only a fraction of the eukaryotic genome encodes protein-coding mRNAs. About half of mammalian RNAs are considered to be non-coding and referred to as long noncoding RNAs (lncRNAs).92-96 However, the prevalence of smORFs and evidence for their translation across organisms is suggests that at least some of these lncRNAs might be coding.5, 10-11, 97-102 smORFs have been reported in the several non-coding RNAs including introns of pre-mRNAs,103-104 lncRNAs,10, 51-52, 97 primary transcripts of miRNAs105 and rRNAs106-107 (Figure 1.6).

Non-Canonical Start Sites: Initiation codon usage analysis for the putative smORFs reveal that approximately 40% translate with the canonical AUG start codon, however, a large fraction uses non-AUG start sites with or without strong kozak sequence,13 consistent with findings in mouse genome predicted by Ribo-Seq that show translation initiation of most ORFs at non-AUG start sites.10 The FRAT2 microprotein, for example, is translated using an ACG start codon but still uses a methionine to initiate translation.13

Subcellular Localization: Transient expression of cloned microproteins have confirmed localization into specific cellular compartments such as, cytoplasm (ASNSD1-MP, PHF19-MP,

DNLZ-MP, EIF5-MP, FRAT2-MP, YTHDF3-MP, CCNA2-MP, DRAP1-MP, TRIP6-MP, and

C7ORF47-MP H2AFx-MP),13 nucleus (c11orf98-MP108), mitochondria9 (MKKS-MPs,109 MIEF1-

MP,110 DEDD2-MP13), ER (sarcolambam-MP23, 25) etc.

Amino Acid Composition: To further understand the smORF features and indicators of smORF translation, amino acid composition of translated smORFs was compared to canonical long proteins.9 Annotated smORFs showed a lower than random usage of arginine, which is characteristic of translated proteins.111 However, several amino acids that are characteristic of alpha-helices in canonical proteins, showed dissimilar usage in smORF translation products like enrichment of lysine and phenyalanine and depletion of serine.112 This supports the abundance of putative transmembrane alpha-helix motifs in about a third of the translated and predicted smORFs, compared to about 20% observed in canonical proteins.113 Similar observations are

18 made in bacteria114 and suggests that smORFs may present a class of uncharacterized transmembrane peptides.

1.5 Approach for Functional Characterization

One of the biggest questions in the field remains, are microproteins functionally important.

The examples of biologically functional smORF encoded microproteins discussed above, highlight the importance of their characterization. One of the approaches for characterization of microproteins is based on the identification of microprotein-protein interactions. Several examples in this review highlight that since microproteins are small, they are likely to function via interactions with other protein complexes within the cell. Humanin and tal/pri were initially identified in transcripts previously annotated as non-coding functional genetic screens. Functional studies to understand the molecular mode of action elucidated that the microproteins operate through their interaction with protein complexes. Later, a similar strategy of functional proteomics helped characterize microproteins like MRI-2, NoBody, CASIMO1 etc.

Though the traditional FLAG tagged immunoprecipitation method has been successful in the identification of MPIs, however, they result in the enrichment of non-specific interactions in the lysate.62 To overcome this, an improved method is required for capturing microprotein interactions in vivo. In subsequent chapter, I show that application of engineered ascorbate peroxidase 2 (APEX) tagging method successfully captured endogenous microprotein interactions in living cells with a lower background.108, 115 Overall, these examples demonstrate the ability of this approach to identify novel hypothesis-generating MPIs and elucidating biological functions.

1.6 Conclusion and Future Directions

As discussed in this review, the identification of smORFs and encoded microproteins have advanced rapidly in the recent years. Advanced computational and experimental methods have

19 allowed functional studies to highlight their role in a wide spectrum of biological events such as transcriptional and translational regulation, essential cellular processes and diseases. Discovery of microproteins such as humanin and CASIMO1 in disease specific cell lines or patients raises the question if there are other disease specific microproteins. Targeted approaches for identifying microproteins in disease states can potentially improvement therapeutics development. Several of the novel sORF peptides discovered have been found to bind intracellular protein complexes to be functional in cellular context, suggesting that the sORF- encoded peptides provide an unexplored reservoir of protein interactors to regulate a wide range of cellular activities. Here, I have developed methods and characterized a few of the smORF encoded microproteins.

20 1.7 Figures

Figure 1.1 uORF mediated regulation of downstream ORF expression. (A) GCN4 in yeast is translationally regulated by four smORFs upstream of the CDS. Under normal conditions, ribosome continues to scan after the translation of uORF1 and uORF2, however, dissociates after uORF3 or uORF4 translation, downregulating GCN4 translation. In stress conditions, limited availability of the ternary complex (TC) results in delayed translation reinitiation at the GCN4 CDS Start codon, thus, upregulating GCN4 translation. (B) GCN4 homolog is vertebrates, ATF4 is also regulated by the presence of two smORFs upstream. In normal conditions, translation of the uORFs results in the ribosome to pass across the CDS start codon as uORF2 overlaps with the CDS. In stress conditions, the limited availability of TC complex results in delayed translation initiation at the CDS start site and not uORF2, therefore, upregulating the ATF4 expression. (C) CHOP translation is inhibited by the uORF in a peptide dependent manner under normal conditions, however, under stress conditions the ribosome bypasses uORF to initiate translation at CHOP start codon.

21

22

Figure 1.2 Biologically active microproteins. (A) Human MRI-2/CYREN (cell cycle regulator of NHEJ) microprotein is a cell-cycle-specific inhibitor of classical NHEJ (cNHEJ) that promotes error-free DNA repair by homologous recombination during S and G2 phases by interacting with the Ku70/80 heterodimer. (B) DWORF (dwarf open reading frame) is a 34 amino acid long microprotein that localizes to SR membrane and enhances SERCA activity by displacing the SERCA inhibitors like Sarcolipin (SLN) and in turn controls muscle contraction. (C) Minion (Microprotein inducer of fusion), an 84 amino acid long microprotein is required for fusion of mononuclear progenitors to form myotubes during muscle development. (D) Humanin (HN) is a 24-amino acid long translated polypeptide from mitochondrial genome that rescues neuronal cell death. It also exhibits anti-apoptotic activity by interacting with Bax (Bcl2- associated X protein), an apoptosis inducing protein, to prevent its translocation from cytosol to mitochondria and downstream release of cytochrome C and other apoptotic proteins. HN also functions extracellularly by suppressing the c-Jun N-terminal kinase (JNK) and ASK/ JNK-mediated neuronal cell death.

23

Figure 1.3 Summary table for smORF-encoded biologically active microproteins. The examples have been categorized based on crucial cellular and physiological roles of microprotein, and, in some cases, involvement in disease. The microprotein length, organism detected in, method of discovery, expression data, molecular mode of action and biological function are highlighted here.

24

Figure 1.4 Methods for smORF and microprotein discovery. (A) Computational methods of discovery are based on sequence analysis and conservation scores. Scanning for functional domains/ motifs and homology helps in the functional characterization. (B) Ribosome Profiling methods allows deep sequencing of ribosome protected mRNA fragments called ribosome footprints, to identify translated regions of the transcriptome in vivo, thus, improving smORF discovery platforms. (C) Proteomics approach enables detection of the translation products, thus, confirming protein-coding smORFs. The cells/ tissues are enriched for short proteome followed by trypsin digestion for detection via mass spectrometry. The detected peptides are searched against a custom RNA-Seq database for discovery of novel smORF encoded microproteins.

25

Figure 1.5 Web-Based integrated tools for smORF discovery and characterization. The online databases have been specifically designed to identify and provide functional evidence for smORFs/ microproteins across species. The resource weblink, method of ORF detection, coding potential evidence used, conservation and features used for predicting function are highlighted here.

26

Figure 1.6 Position of smORFs within RNAs. smORF with coding potential exist upstream of mRNAs (uORFs), within or overlapping mRNA CDS out-of-frame (CSD-ORFs), downstream of mRNAs (dORFs), in the introns of pre-mRNAs, on long non-coding RNAs (lncRNAs) and ribosomal RNAs (rRNAs).

27 1.8 Acknowledgements

Chapter 1, in part is currently being prepared for submission for publication of the material.

The dissertation author was the primary investigator and author of this material. Other author includes Alan Saghatelian.

1.9 References

1. Goffeau, A.; Barrell, B. G.; Bussey, H.; Davis, R. W.; Dujon, B.; Feldmann, H.; Galibert, F.; Hoheisel, J. D.; Jacq, C.; Johnston, M.; Louis, E. J.; Mewes, H. W.; Murakami, Y.; Philippsen, P.; Tettelin, H.; Oliver, S. G., Life with 6000 Genes. Science 1996, 274 (5287), 546-567.

2. Venter, J. C.; Smith, H. O.; Adams, M. D., The Sequence of the . Clinical Chemistry 2015, 61 (9), 1207-1208.

3. Mouse Genome Sequencing, C., Initial sequencing and comparative analysis of the mouse genome. Nature 2002, 420, 520.

4. Basrai, M. A.; Hieter, P.; Boeke, J. D., Small Open Reading Frames: Beautiful Needles in the Haystack. Genome Research 1997, 7 (8), 768-771.

5. Ladoukakis, E.; Pereira, V.; Magny, E. G.; Eyre-Walker, A.; Couso, J. P., Hundreds of putatively functional small open reading frames in Drosophila. Genome Biology 2011, 12 (11), R118-R118.

6. Vanderperre, B.; Lucier, J. F.; Bissonnette, C.; Motard, J.; Tremblay, G.; Vanderperre, S.; Wisztorski, M.; Salzet, M.; Boisvert, F. M.; Roucou, X., Direct detection of alternative open reading frames translation products in human significantly expands the proteome. PLoS One 2013, 8 (8), e70698.

7. Mackowiak, S. D.; Zauber, H.; Bielow, C.; Thiel, D.; Kutz, K.; Calviello, L.; Mastrobuoni, G.; Rajewsky, N.; Kempa, S.; Selbach, M.; Obermayer, B., Extensive identification and analysis of conserved small ORFs in animals. Genome Biology 2015, 16 (1), 179.

8. Crappé, J.; Van Criekinge, W.; Trooskens, G.; Hayakawa, E.; Luyten, W.; Baggerman, G.; Menschaert, G., Combining in silico prediction and ribosome profiling in a genome-wide search for novel putatively coding sORFs. BMC Genomics 2013, 14, 648-648.

9. Aspden, J. L.; Eyre-Walker, Y. C.; Phillips, R. J.; Amin, U.; Mumtaz, M. A. S.; Brocard, M.; Couso, J.-P., Extensive translation of small Open Reading Frames revealed by Poly-Ribo-Seq. eLife 2014, 3, e03528.

28 10. Ingolia, N. T.; Lareau, L. F.; Weissman, J. S., Ribosome Profiling of Mouse Embryonic Stem Cells Reveals the Complexity of Mammalian Proteomes. Cell 2011, 147 (4), 789-802.

11. Smith, Jenna E.; Alvarez-Dominguez, Juan R.; Kline, N.; Huynh, Nathan J.; Geisler, S.; Hu, W.; Coller, J.; Baker, Kristian E., Translation of Small Open Reading Frames within Unannotated RNA Transcripts in Saccharomyces cerevisiae. Cell Reports 2014, 7 (6), 1858- 1866.

12. Bazzini, A. A.; Johnstone, T. G.; Christiano, R.; Mackowiak, S. D.; Obermayer, B.; Fleming, E. S.; Vejnar, C. E.; Lee, M. T.; Rajewsky, N.; Walther, T. C.; Giraldez, A. J., Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. The EMBO Journal 2014, 33 (9), 981-993.

13. Slavoff, S. A.; Mitchell, A. J.; Schwaid, A. G.; Cabili, M. N.; Ma, J.; Levin, J. Z.; Karger, A. D.; Budnik, B. A.; Rinn, J. L.; Saghatelian, A., Peptidomic discovery of short open reading frame- encoded peptides in human cells. Nat Chem Biol 2013, 9 (1), 59-64.

14. Andrews, S. J.; Rothnagel, J. A., Emerging evidence for functional peptides encoded by short open reading frames. Nature Reviews Genetics 2014, 15, 193.

15. Mumtaz, M. Ali S.; Couso, Juan P., Ribosomal profiling adds new coding sequences to the proteome. Biochemical Society Transactions 2015, 43 (6), 1271-1276.

16. Kastenmayer, J. P.; Ni, L.; Chu, A.; Kitchen, L. E.; Au, W.-C.; Yang, H.; Carter, C. D.; Wheeler, D.; Davis, R. W.; Boeke, J. D.; Snyder, M. A.; Basrai, M. A., Functional genomics of genes with small open reading frames (sORFs) in S. cerevisiae. Genome Research 2006, 16 (3), 365-373.

17. Kessler, M. M.; Zeng, Q.; Hogan, S.; Cook, R.; Morales, A. J.; Cottarel, G., Systematic discovery of new genes in the Saccharomyces cerevisiae genome. Genome Res 2003, 13.

18. Cheng, H.; Chan, W. S.; Li, Z.; Wang, D.; Liu, S.; Zhou, Y., Small Open Reading Frames: Current Prediction Techniques and Future Prospect. Current protein & peptide science 2011, 12 (6), 503-507.

19. Hanada, K.; Zhang, X.; Borevitz, J. O.; Li, W. H.; Shiu, S. H., A large number of novel coding small open reading frames in the intergenic regions of the Arabidopsis thaliana genome are transcribed and/or under purifying selection. Genome Res 2007, 17.

20. Frith, M. C.; Forrest, A. R.; Nourbakhsh, E.; Pang, K. C.; Kai, C.; Kawai, J.; Carninci, P.; Hayashizaki, Y.; Bailey, T. L.; Grimmond, S. M., The Abundance of Short Proteins in the Mammalian Proteome. PLoS Genetics 2006, 2 (4), e52.

21. Pueyo, J. I.; Magny, E. G.; Couso, J. P., New Peptides Under the s(ORF)ace of the Genome. Trends in Biochemical Sciences 2016, 41 (8), 665-678.

29 22. Saghatelian, A.; Couso, J. P., Discovery and characterization of smORF-encoded bioactive polypeptides. Nat Chem Biol 2015, 11 (12), 909-916.

23. Magny, E. G.; Pueyo, J. I.; Pearl, F. M. G.; Cespedes, M. A.; Niven, J. E.; Bishop, S. A.; Couso, J. P., Conserved Regulation of Cardiac Calcium Uptake by Peptides Encoded in Small Open Reading Frames. Science 2013, 341 (6150), 1116-1120.

24. Anderson, D. M.; Makarewich, C. A.; Anderson, K. M.; Shelton, J. M.; Bezprozvannaya, S.; Bassel-Duby, R.; Olson, E. N., Widespread control of calcium signaling by a family of SERCA- inhibiting . Science signaling 2016, 9 (457), ra119-ra119.

25. Anderson, D. M.; Anderson, K. M.; Chang, C.-L.; Makarewich, C. A.; Nelson, B. R.; McAnally, J. R.; Kasaragod, P.; Shelton, J. M.; Liou, J.; Bassel-Duby, R.; Olson, E. N., A Micropeptide Encoded by a Putative Long Non-coding RNA Regulates Muscle Performance. Cell 2015, 160 (4), 595-606.

26. Nelson, B. R.; Makarewich, C. A.; Anderson, D. M.; Winders, B. R.; Troupes, C. D.; Wu, F.; Reese, A. L.; McAnally, J. R.; Chen, X.; Kavalali, E. T.; Cannon, S. C.; Houser, S. R.; Bassel- Duby, R.; Olson, E. N., A peptide encoded by a transcript annotated as long noncoding RNA enhances SERCA activity in muscle. Science (New York, N.Y.) 2016, 351 (6270), 271-275.

27. Andreev, D. E.; O'Connor, P. B. F.; Fahey, C.; Kenny, E. M.; Terenin, I. M.; Dmitriev, S. E.; Cormican, P.; Morris, D. W.; Shatsky, I. N.; Baranov, P. V., Translation of 5′ leaders is pervasive in genes resistant to eIF2 repression. eLife 2015, 4, e03971.

28. Chew, G.-L.; Pauli, A.; Schier, A. F., Conservation of uORF repressiveness and sequence features in mouse, human and zebrafish. Nature Communications 2016, 7, 11663.

29. von Arnim, A. G.; Jia, Q.; Vaughn, J. N., Regulation of plant translation by upstream open reading frames. Plant Science 2014, 214, 1-12.

30. Duncan, C. D. S.; Mata, J., The translational landscape of fission-yeast meiosis and sporulation. Nature Structural &Amp; Molecular Biology 2014, 21, 641.

31. Calvo, S. E.; Pagliarini, D. J.; Mootha, V. K., Upstream open reading frames cause widespread reduction of protein expression and are polymorphic among humans. Proceedings of the National Academy of Sciences of the United States of America 2009, 106 (18), 7507-7512.

32. Cabrera-Quio, L. E.; Herberg, S.; Pauli, A., Decoding sORF translation – from small proteins to gene regulation. RNA Biology 2016, 13 (11), 1051-1059.

33. Ye, Y.; Liang, Y.; Yu, Q.; Hu, L.; Li, H.; Zhang, Z.; Xu, X., Analysis of human upstream open reading frames and impact on gene expression. Human Genetics 2015, 134 (6), 605-612.

30 34. Young, S. K.; Palam, L. R.; Wu, C.; Sachs, M. S.; Wek, R. C., Ribosome Elongation Stall Directs Gene-Specific Translation In the Integrated Stress Response. Journal of Biological Chemistry 2016.

35. Ingolia, N. T.; Ghaemmaghami, S.; Newman, J. R. S.; Weissman, J. S., Genome-Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome Profiling. Science 2009, 324 (5924), 218-223.

36. Fritsch, C.; Herrmann, A.; Nothnagel, M.; Szafranski, K.; Huse, K.; Schumann, F.; Schreiber, S.; Platzer, M.; Krawczak, M.; Hampe, J.; Brosch, M., Genome-wide search for novel human uORFs and N-terminal protein extensions using ribosomal footprinting. Genome Research 2012, 22 (11), 2208-2218.

37. Gao, X.; Wan, J.; Liu, B.; Ma, M.; Shen, B.; Qian, S.-B., Quantitative profiling of initiating ribosomes in vivo. Nature methods 2015, 12 (2), 147-153.

38. Johnstone, T. G.; Bazzini, A. A.; Giraldez, A. J., Upstream ORFs are prevalent translational repressors in vertebrates. The EMBO Journal 2016, 35 (7), 706-723.

39. Kumar, M.; Srinivas, V.; Patankar, S., Upstream AUGs and upstream ORFs can regulate the downstream ORF in Plasmodium falciparum. Malaria Journal 2015, 14 (1), 512.

40. Hinnebusch, A. G., Evidence for translational regulation of the activator of general amino acid control in yeast. Proceedings of the National Academy of Sciences of the United States of America 1984, 81 (20), 6442-6446.

41. Munzarová, V.; Pánek, J.; Gunišová, S.; Dányi, I.; Szamecz, B.; Valášek, L. S., Translation Reinitiation Relies on the Interaction between eIF3a/TIF32 and Progressively Folded cis-Acting mRNA Elements Preceding Short uORFs. PLOS Genetics 2011, 7 (7), e1002137.

42. Gunišová, S.; Valášek, L. S., Fail-safe mechanism of GCN4 translational control—uORF2 promotes reinitiation by analogous mechanism to uORF1 and thus secures its key role in GCN4 expression. Nucleic Acids Research 2014, 42 (9), 5880-5893.

43. Gunišová, S.; Beznosková, P.; Mohammad, M. P.; Vlčková, V.; Valášek, L. S., In-depth analysis of cis-determinants that either promote or inhibit reinitiation on GCN4 mRNA after translation of its four short uORFs. RNA 2016, 22 (4), 542-558.

44. Mueller, P. P.; Hinnebusch, A. G., Multiple upstream AUG codons mediate translational control of GCN4. Cell 1986, 45 (2), 201-207.

45. Hinnebusch, A. G., TRANSLATIONAL REGULATION OF GCN4 AND THE GENERAL AMINO ACID CONTROL OF YEAST. Annual Review of Microbiology 2005, 59 (1), 407-450.

31 46. Lu, P. D.; Harding, H. P.; Ron, D., Translation reinitiation at alternative open reading frames regulates gene expression in an integrated stress response. The Journal of Cell Biology 2004, 167 (1), 27-33.

47. Vattem, K. M.; Wek, R. C., Reinitiation involving upstream ORFs regulates ATF4 mRNA translation in mammalian cells. Proceedings of the National Academy of Sciences of the United States of America 2004, 101 (31), 11269-11274.

48. Jousse, C.; Bruhat, A.; Carraro, V.; Urano, F.; Ferrara, M.; Ron, D.; Fafournoux, P., Inhibition of CHOP translation by a peptide encoded by an open reading frame localized in the chop 5′UTR. Nucleic Acids Research 2001, 29 (21), 4341-4351.

49. Ito, K.; Chiba, S., Arrest Peptides: Cis-Acting Modulators of Translation. Annual Review of Biochemistry 2013, 82 (1), 171-202.

50. Samandi, S.; Roy, A. V.; Delcourt, V.; Lucier, J.-F.; Gagnon, J.; Beaudoin, M. C.; Vanderperre, B.; Breton, M.-A.; Motard, J.; Jacques, J.-F.; Brunelle, M.; Gagnon-Arsenault, I.; Fournier, I.; Ouangraoua, A.; Hunting, D. J.; Cohen, A. A.; Landry, C. R.; Scott, M. S.; Roucou, X., Deep transcriptome annotation enables the discovery and functional characterization of cryptic small proteins. eLife 2017, 6, e27860.

51. Galindo, M. I.; Pueyo, J. I.; Fouix, S.; Bishop, S. A.; Couso, J. P., Peptides encoded by short ORFs control development and define a new eukaryotic gene family. PLoS Biol 2007, 5 (5), e106.

52. Kondo, T.; Hashimoto, Y.; Kato, K.; Inagaki, S.; Hayashi, S.; Kageyama, Y., Small peptide regulators of actin-based cell morphogenesis encoded by a polycistronic mRNA. Nature Cell Biology 2007, 9, 660.

53. Tupy, J. L.; Bailey, A. M.; Dailey, G.; Evans-Holm, M.; Siebel, C. W.; Misra, S.; Celniker, S. E.; Rubin, G. M., Identification of putative noncoding polyadenylated transcripts in Drosophila melanogaster. Proc Natl Acad Sci USA 2005, 102.

54. Kondo, T.; Plaza, S.; Zanet, J.; Benrabah, E.; Valenti, P.; Hashimoto, Y.; Kobayashi, S.; Payre, F.; Kageyama, Y., Small Peptides Switch the Transcriptional Activity of Shavenbaby During Drosophila Embryogenesis. Science 2010, 329 (5989), 336-339.

55. Zanet, J.; Benrabah, E.; Li, T.; Pélissier-Monier, A.; Chanut-Delalande, H.; Ronsin, B.; Bellen, H. J.; Payre, F.; Plaza, S., Pri sORF peptides induce selective proteasome-mediated protein processing. Science 2015, 349 (6254), 1356-1358.

56. Brunner, E.; Ahrens, C. H.; Mohanty, S.; Baetschmann, H.; Loevenich, S.; Potthast, F.; Deutsch, E. W.; Panse, C.; de Lichtenberg, U.; Rinner, O.; Lee, H.; Pedrioli, P. G. A.; Malmstrom, J.; Koehler, K.; Schrimpf, S.; Krijgsveld, J.; Kregenow, F.; Heck, A. J. R.; Hafen, E.; Schlapbach, R.; Aebersold, R., A high-quality catalog of the Drosophila melanogaster proteome. Nature Biotechnology 2007, 25, 576.

32 57. Pueyo, J. I.; Magny, E. G.; Sampson, C. J.; Amin, U.; Evans, I. R.; Bishop, S. A.; Couso, J. P., Hemotin, a Regulator of Phagocytosis Encoded by a Small ORF and Conserved across Metazoans. PLoS Biology 2016, 14 (3), e1002395.

58. M.L., B.; J., Y.; B.E., R.; C.E., D.; B.A., B. K.; G., V., Functional and structural properties of stannin: Roles in cellular growth, selective toxicity, and mitochondrial responses to injury. Journal of Cellular Biochemistry 2006, 98 (2), 243-250.

59. D'Lima, N. G.; Ma, J.; Winkler, L.; Chu, Q.; Loh, K. H.; Corpuz, E. O.; Budnik, B. A.; Lykke- Andersen, J.; Saghatelian, A.; Slavoff, S. A., A human microprotein that interacts with the mRNA decapping complex. Nat Chem Biol 2017, 13 (2), 174-180.

60. Lejeune, F.; Li, X.; Maquat, L. E., Nonsense-Mediated mRNA Decay in Mammalian Cells Involves Decapping, Deadenylating, and Exonucleolytic Activities. Molecular Cell 2003, 12 (3), 675-687.

61. Li, Y.; Song, M.; Kiledjian, M., Differential utilization of decapping enzymes in mammalian mRNA decay pathways. RNA 2011, 17 (3), 419-428.

62. Slavoff, S. A.; Heo, J.; Budnik, B. A.; Hanakahi, L. A.; Saghatelian, A., A human short open reading frame (sORF)-encoded polypeptide that stimulates DNA end joining. J Biol Chem 2014, 289 (16), 10950-7.

63. Arnoult, N.; Correia, A.; Ma, J.; Merlo, A.; Garcia-Gomez, S.; Maric, M.; Tognetti, M.; Benner, C. W.; Boulton, S. J.; Saghatelian, A.; Karlseder, J., Regulation of DNA repair pathway choice in S and G2 phases by the NHEJ inhibitor CYREN. Nature 2017, 549, 548.

64. Stein, C. S.; Jadiya, P.; Zhang, X.; McLendon, J. M.; Abouassaly, G. M.; Witmer, N. H.; Anderson, E. J.; Elrod, J. W.; Boudreau, R. L., Mitoregulin: A lncRNA-Encoded Microprotein that Supports Mitochondrial Supercomplexes and Respiratory Efficiency. Cell Reports 2018, 23 (13), 3710-3720.e8.

65. Makarewich, C. A.; Baskin, K. K.; Munir, A. Z.; Bezprozvannaya, S.; Sharma, G.; Khemtong, C.; Shah, A. M.; McAnally, J. R.; Malloy, C. R.; Szweda, L. I.; Bassel-Duby, R.; Olson, E. N., MOXI Is a Mitochondrial Micropeptide That Enhances Fatty Acid β-Oxidation. Cell Reports 2018, 23 (13), 3701-3709.

66. Ladoukakis, E.; Pereira, V.; Magny, E. G.; Eyre-Walker, A.; Couso, J. P., Hundreds of putatively functional small open reading frames in Drosophila. Genome Biology 2011, 12 (11), R118.

67. Anderson, Douglas M.; Anderson, Kelly M.; Chang, C.-L.; Makarewich, Catherine A.; Nelson, Benjamin R.; McAnally, John R.; Kasaragod, P.; Shelton, John M.; Liou, J.; Bassel-Duby, R.; Olson, Eric N., A Micropeptide Encoded by a Putative Long Noncoding RNA Regulates Muscle Performance. Cell 2015, 160 (4), 595-606.

33 68. Nelson, B. R.; Makarewich, C. A.; Anderson, D. M.; Winders, B. R.; Troupes, C. D.; Wu, F.; Reese, A. L.; McAnally, J. R.; Chen, X.; Kavalali, E. T.; Cannon, S. C.; Houser, S. R.; Bassel- Duby, R.; Olson, E. N., A peptide encoded by a transcript annotated as long noncoding RNA enhances SERCA activity in muscle. Science 2016, 351 (6270), 271-275.

69. Zhang, Q.; Vashisht, A. A.; O’Rourke, J.; Corbel, S. Y.; Moran, R.; Romero, A.; Miraglia, L.; Zhang, J.; Durrant, E.; Schmedt, C.; Sampath, S. C.; Sampath, S. C., The microprotein Minion controls cell fusion and muscle formation. Nature Communications 2017, 8, 15664.

70. Bae, G.-U.; Gaio, U.; Yang, Y.-J.; Lee, H.-J.; Kang, J.-S.; Krauss, R. S., Regulation of Myoblast Motility and Fusion by the CXCR4-associated Sialomucin, CD164. The Journal of Biological Chemistry 2008, 283 (13), 8301-8309.

71. Hashimoto, Y.; Niikura, T.; Tajima, H.; Yasukawa, T.; Sudo, H.; Ito, Y.; Kita, Y.; Kawasumi, M.; Kouyama, K.; Doyu, M.; Sobue, G.; Koide, T.; Tsuji, S.; Lang, J.; Kurokawa, K.; Nishimoto, I., A rescue factor abolishing neuronal cell death by a wide spectrum of familial Alzheimer's disease genes and Aβ. Proceedings of the National Academy of Sciences of the United States of America 2001, 98 (11), 6336-6341.

72. Guo, B.; Zhai, D.; Cabezas, E.; Welsh, K.; Nouraini, S.; Satterthwait, A. C.; Reed, J. C., Humanin peptide suppresses apoptosis by interfering with Bax activation. Nature 2003, 423 (6938), 456-61.

73. Zapala, B.; Kaczynski, L.; Kiec-Wilk, B.; Staszel, T.; Knapp, A.; Thoresen, G. H.; Wybranska, I.; Dembinska-Kiec, A., Humanins, the neuroprotective and cytoprotective peptides with antiapoptotic and anti-inflammatory properties. Pharmacol Rep 2010, 62 (5), 767-77.

74. Polycarpou-Schwarz, M.; Groß, M.; Mestdagh, P.; Schott, J.; Grund, S. E.; Hildenbrand, C.; Rom, J.; Aulmann, S.; Sinn, H.-P.; Vandesompele, J.; Diederichs, S., The cancer-associated microprotein CASIMO1 controls cell proliferation and interacts with squalene epoxidase modulating lipid droplet formation. Oncogene 2018.

75. Vanderperre, B.; Lucier, J.-F.; Roucou, X., HAltORF: a database of predicted out-of-frame alternative open reading frames in human. Database: The Journal of Biological Databases and Curation 2012, 2012, bas025.

76. Nekrutenko, A.; Makova, K. D.; Li, W. H., The K(A)/K(S) ratio test for assessing the protein- coding potential of genomic regions: an empirical and simulation study. Genome Res 2002, 12.

77. Olexiouk, V.; Crappé, J.; Verbruggen, S.; Verhegen, K.; Martens, L.; Menschaert, G., sORFs.org: a repository of small ORFs identified by ribosome profiling. Nucleic Acids Research 2016, 44 (Database issue), D324-D329.

78. Lin, M. F.; Jungreis, I.; Kellis, M., PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 2011, 27 (13), i275-i282.

34 79. Ingolia, N. T.; Brar, G. A.; Stern-Ginossar, N.; Harris, M. S.; Talhouarne, G. J. S.; Jackson, S. E.; Wills, M. R.; Weissman, J. S., Ribosome Profiling Reveals Pervasive Translation Outside of Annotated Protein-Coding Genes. Cell reports 2014, 8 (5), 1365-1379.

80. Hao, Y.; Zhang, L.; Niu, Y.; Cai, T.; Luo, J.; He, S.; Zhang, B.; Zhang, D.; Qin, Y.; Yang, F.; Chen, R., SmProt: a database of small proteins encoded by annotated coding and non-coding RNA loci. Briefings in Bioinformatics 2017, bbx005-bbx005.

81. Jones, P.; Binns, D.; Chang, H.-Y.; Fraser, M.; Li, W.; McAnulla, C.; McWilliam, H.; Maslen, J.; Mitchell, A.; Nuka, G.; Pesseat, S.; Quinn, A. F.; Sangrador-Vegas, A.; Scheremetjew, M.; Yong, S.-Y.; Lopez, R.; Hunter, S., InterProScan 5: genome-scale protein function classification. Bioinformatics 2014, 30 (9), 1236-1240.

82. Calviello, L.; Mukherjee, N.; Wyler, E.; Zauber, H.; Hirsekorn, A.; Selbach, M.; Landthaler, M.; Obermayer, B.; Ohler, U., Detecting actively translated open reading frames in ribosome profiling data. Nature Methods 2015, 13, 165.

83. Risk, B. A.; Spitzer, W. J.; Giddings, M. C., Peppy: Proteogenomic Search Software. Journal of proteome research 2013, 12 (6), 3019-3025.

84. Li, H.; Xiao, L.; Zhang, L.; Wu, J.; Wei, B.; Sun, N.; Zhao, Y., FSPP: A Tool for Genome- Wide Prediction of smORF-Encoded Peptides and Their Functions. Frontiers in Genetics 2018, 9, 96.

85. Ingolia, N. T.; Brar, G. A.; Rouskin, S.; McGeachy, A. M.; Weissman, J. S., The ribosome profiling strategy for monitoring translation in vivo by deep sequencing of ribosome-protected mRNA fragments. Nature protocols 2012, 7 (8), 1534-1550.

86. Chew, G.-L.; Pauli, A.; Rinn, J. L.; Regev, A.; Schier, A. F.; Valen, E., Ribosome profiling reveals resemblance between long non-coding RNAs and 5′ leaders of coding RNAs. Development (Cambridge, England) 2013, 140 (13), 2828-2834.

87. Lee, S.; Liu, B.; Lee, S.; Huang, S.-X.; Shen, B.; Qian, S.-B., Global mapping of translation initiation sites in mammalian cells at single-nucleotide resolution. Proceedings of the National Academy of Sciences 2012, 109 (37), E2424-E2432.

88. Castellana, N.; Bafna, V., Proteogenomics to discover the full coding content of genomes: a computational perspective. Journal of proteomics 2010, 73 (11), 2124-2135.

89. Branca, R. M.; Orre, L. M.; Johansson, H. J.; Granholm, V.; Huss, M.; Perez-Bercoff, A.; Forshed, J.; Kall, L.; Lehtio, J., HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics. Nat Methods 2014, 11 (1), 59-62.

90. Oyama, M.; Kozuka-Hata, H.; Suzuki, Y.; Semba, K.; Yamamoto, T.; Sugano, S., Diversity of Translation Start Sites May Define Increased Complexity of the Human Short ORFeome. Molecular & Cellular Proteomics 2007, 6 (6), 1000-1006.

35 91. Ma, J.; Ward, C. C.; Jungreis, I.; Slavoff, S. A.; Schwaid, A. G.; Neveu, J.; Budnik, B. A.; Kellis, M.; Saghatelian, A., Discovery of human sORF-encoded polypeptides (SEPs) in cell lines and tissue. J Proteome Res 2014, 13 (3), 1757-65.

92. Carninci, P.; Kasukawa, T.; Katayama, S.; Gough, J.; Frith, M. C.; Maeda, N.; Oyama, R.; Ravasi, T.; Lenhard, B.; Wells, C.; Kodzius, R.; Shimokawa, K.; Bajic, V. B.; Brenner, S. E.; Batalov, S.; Forrest, A. R. R.; Zavolan, M.; Davis, M. J.; Wilming, L. G.; Aidinis, V.; Allen, J. E.; Ambesi-Impiombato, A.; Apweiler, R.; Aturaliya, R. N.; Bailey, T. L.; Bansal, M.; Baxter, L.; Beisel, K. W.; Bersano, T.; Bono, H.; Chalk, A. M.; Chiu, K. P.; Choudhary, V.; Christoffels, A.; Clutterbuck, D. R.; Crowe, M. L.; Dalla, E.; Dalrymple, B. P.; de Bono, B.; Gatta, G. D.; di Bernardo, D.; Down, T.; Engstrom, P.; Fagiolini, M.; Faulkner, G.; Fletcher, C. F.; Fukushima, T.; Furuno, M.; Futaki, S.; Gariboldi, M.; Georgii-Hemming, P.; Gingeras, T. R.; Gojobori, T.; Green, R. E.; Gustincich, S.; Harbers, M.; Hayashi, Y.; Hensch, T. K.; Hirokawa, N.; Hill, D.; Huminiecki, L.; Iacono, M.; Ikeo, K.; Iwama, A.; Ishikawa, T.; Jakt, M.; Kanapin, A.; Katoh, M.; Kawasawa, Y.; Kelso, J.; Kitamura, H.; Kitano, H.; Kollias, G.; Krishnan, S. P. T.; Kruger, A.; Kummerfeld, S. K.; Kurochkin, I. V.; Lareau, L. F.; Lazarevic, D.; Lipovich, L.; Liu, J.; Liuni, S.; McWilliam, S.; Babu, M. M.; Madera, M.; Marchionni, L.; Matsuda, H.; Matsuzawa, S.; Miki, H.; Mignone, F.; Miyake, S.; Morris, K.; Mottagui-Tabar, S.; Mulder, N.; Nakano, N.; Nakauchi, H.; Ng, P.; Nilsson, R.; Nishiguchi, S.; Nishikawa, S.; Nori, F.; Ohara, O.; Okazaki, Y.; Orlando, V.; Pang, K. C.; Pavan, W. J.; Pavesi, G.; Pesole, G.; Petrovsky, N.; Piazza, S.; Reed, J.; Reid, J. F.; Ring, B. Z.; Ringwald, M.; Rost, B.; Ruan, Y.; Salzberg, S. L.; Sandelin, A.; Schneider, C.; Schönbach, C.; Sekiguchi, K.; Semple, C. A. M.; Seno, S.; Sessa, L.; Sheng, Y.; Shibata, Y.; Shimada, H.; Shimada, K.; Silva, D.; Sinclair, B.; Sperling, S.; Stupka, E.; Sugiura, K.; Sultana, R.; Takenaka, Y.; Taki, K.; Tammoja, K.; Tan, S. L.; Tang, S.; Taylor, M. S.; Tegner, J.; Teichmann, S. A.; Ueda, H. R.; van Nimwegen, E.; Verardo, R.; Wei, C. L.; Yagi, K.; Yamanishi, H.; Zabarovsky, E.; Zhu, S.; Zimmer, A.; Hide, W.; Bult, C.; Grimmond, S. M.; Teasdale, R. D.; Liu, E. T.; Brusic, V.; Quackenbush, J.; Wahlestedt, C.; Mattick, J. S.; Hume, D. A.; Kai, C.; Sasaki, D.; Tomaru, Y.; Fukuda, S.; Kanamori-Katayama, M.; Suzuki, M.; Aoki, J.; Arakawa, T.; Iida, J.; Imamura, K.; Itoh, M.; Kato, T.; Kawaji, H.; Kawagashira, N.; Kawashima, T.; Kojima, M.; Kondo, S.; Konno, H.; Nakano, K.; Ninomiya, N.; Nishio, T.; Okada, M.; Plessy, C.; Shibata, K.; Shiraki, T.; Suzuki, S.; Tagami, M.; Waki, K.; Watahiki, A.; Okamura-Oho, Y.; Suzuki, H.; Kawai, J.; Hayashizaki, Y., The Transcriptional Landscape of the Mammalian Genome. Science 2005, 309 (5740), 1559-1563.

93. Cech, Thomas R.; Steitz, Joan A., The Noncoding RNA Revolution—Trashing Old Rules to Forge New Ones. Cell 2014, 157 (1), 77-94.

94. Derrien, T.; Johnson, R.; Bussotti, G.; Tanzer, A.; Djebali, S.; Tilgner, H.; Guernec, G.; Martin, D.; Merkel, A.; Knowles, D. G.; Lagarde, J.; Veeravalli, L.; Ruan, X.; Ruan, Y.; Lassmann, T.; Carninci, P.; Brown, J. B.; Lipovich, L.; Gonzalez, J. M.; Thomas, M.; Davis, C. A.; Shiekhattar, R.; Gingeras, T. R.; Hubbard, T. J.; Notredame, C.; Harrow, J.; Guigó, R., The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression. Genome Research 2012, 22 (9), 1775-1789.

95. Ota, T.; Suzuki, Y.; Nishikawa, T.; Otsuki, T.; Sugiyama, T.; Irie, R.; Wakamatsu, A.; Hayashi, K.; Sato, H.; Nagai, K.; Kimura, K.; Makita, H.; Sekine, M.; Obayashi, M.; Nishi, T.; Shibahara, T.; Tanaka, T.; Ishii, S.; Yamamoto, J.-i.; Saito, K.; Kawai, Y.; Isono, Y.; Nakamura, Y.; Nagahari, K.; Murakami, K.; Yasuda, T.; Iwayanagi, T.; Wagatsuma, M.; Shiratori, A.; Sudo, H.; Hosoiri, T.; Kaku, Y.; Kodaira, H.; Kondo, H.; Sugawara, M.; Takahashi, M.; Kanda, K.; Yokoi, T.; Furuya, T.; Kikkawa, E.; Omura, Y.; Abe, K.; Kamihara, K.; Katsuta, N.; Sato, K.; Tanikawa,

36 M.; Yamazaki, M.; Ninomiya, K.; Ishibashi, T.; Yamashita, H.; Murakawa, K.; Fujimori, K.; Tanai, H.; Kimata, M.; Watanabe, M.; Hiraoka, S.; Chiba, Y.; Ishida, S.; Ono, Y.; Takiguchi, S.; Watanabe, S.; Yosida, M.; Hotuta, T.; Kusano, J.; Kanehori, K.; Takahashi-Fujii, A.; Hara, H.; Tanase, T.-o.; Nomura, Y.; Togiya, S.; Komai, F.; Hara, R.; Takeuchi, K.; Arita, M.; Imose, N.; Musashino, K.; Yuuki, H.; Oshima, A.; Sasaki, N.; Aotsuka, S.; Yoshikawa, Y.; Matsunawa, H.; Ichihara, T.; Shiohata, N.; Sano, S.; Moriya, S.; Momiyama, H.; Satoh, N.; Takami, S.; Terashima, Y.; Suzuki, O.; Nakagawa, S.; Senoh, A.; Mizoguchi, H.; Goto, Y.; Shimizu, F.; Wakebe, H.; Hishigaki, H.; Watanabe, T.; Sugiyama, A.; Takemoto, M.; Kawakami, B.; Yamazaki, M.; Watanabe, K.; Kumagai, A.; Itakura, S.; Fukuzumi, Y.; Fujimori, Y.; Komiyama, M.; Tashiro, H.; Tanigami, A.; Fujiwara, T.; Ono, T.; Yamada, K.; Fujii, Y.; Ozaki, K.; Hirao, M.; Ohmori, Y.; Kawabata, A.; Hikiji, T.; Kobatake, N.; Inagaki, H.; Ikema, Y.; Okamoto, S.; Okitani, R.; Kawakami, T.; Noguchi, S.; Itoh, T.; Shigeta, K.; Senba, T.; Matsumura, K.; Nakajima, Y.; Mizuno, T.; Morinaga, M.; Sasaki, M.; Togashi, T.; Oyama, M.; Hata, H.; Watanabe, M.; Komatsu, T.; Mizushima-Sugano, J.; Satoh, T.; Shirai, Y.; Takahashi, Y.; Nakagawa, K.; Okumura, K.; Nagase, T.; Nomura, N.; Kikuchi, H.; Masuho, Y.; Yamashita, R.; Nakai, K.; Yada, T.; Nakamura, Y.; Ohara, O.; Isogai, T.; Sugano, S., Complete sequencing and characterization of 21,243 full-length human cDNAs. Nature Genetics 2003, 36, 40.

96. Guttman, M.; Russell, P.; Ingolia, N. T.; Weissman, J. S.; Lander, E. S., Ribosome profiling provides evidence that large non-coding RNAs do not encode proteins. Cell 2013, 154 (1), 240- 251.

97. Ji, Z.; Song, R.; Regev, A.; Struhl, K., Many lncRNAs, 5’UTRs, and pseudogenes are translated and some are likely to express functional proteins. eLife 2015, 4, e08890.

98. Pauli, A.; Valen, E.; Schier, A. F., Identifying (non-)coding RNAs and small peptides: Challenges and opportunities. BioEssays : news and reviews in molecular, cellular and developmental biology 2015, 37 (1), 103-112.

99. Raj, A.; Wang, S. H.; Shim, H.; Harpak, A.; Li, Y. I.; Engelmann, B.; Stephens, M.; Gilad, Y.; Pritchard, J. K., Thousands of novel translated open reading frames in humans inferred by ribosome footprint profiling. eLife 2016, 5, e13328.

100. Dinger, M. E.; Pang, K. C.; Mercer, T. R.; Mattick, J. S., Differentiating Protein-Coding and Noncoding RNA: Challenges and Ambiguities. PLOS Computational Biology 2008, 4 (11), e1000176.

101. Hsu, P. Y.; Calviello, L.; Wu, H.-Y. L.; Li, F.-W.; Rothfels, C. J.; Ohler, U.; Benfey, P. N., Super-resolution ribosome profiling reveals unannotated translation events in Arabidopsis. Proceedings of the National Academy of Sciences 2016, 113 (45), E7126-E7135.

102. Firth, A. E.; Brown, C. M., Detecting overlapping coding sequences in virus genomes. BMC Bioinformatics 2006, 7, 75-75.

103. Apcher, S.; Millot, G.; Daskalogianni, C.; Scherl, A.; Manoury, B.; Fåhraeus, R., Translation of pre-spliced RNAs in the nuclear compartment generates peptides for the MHC

37 class I pathway. Proceedings of the National Academy of Sciences of the United States of America 2013, 110 (44), 17951-17956.

104. Duvallet, E.; Boulpicante, M.; Yamazaki, T.; Daskalogianni, C.; Prado Martins, R.; Baconnais, S.; Manoury, B.; Fahraeus, R.; Apcher, S., Exosome-driven transfer of tumor- associated Pioneer Translation Products (TA-PTPs) for the MHC class I cross-presentation pathway. OncoImmunology 2016, 5 (9), e1198865.

105. Lauressergues, D.; Couzigou, J.-M.; Clemente, H. S.; Martinez, Y.; Dunand, C.; Bécard, G.; Combier, J.-P., Primary transcripts of microRNAs encode regulatory peptides. Nature 2015, 520, 90.

106. Lee, C.; Zeng, J.; Drew, Brian G.; Sallam, T.; Martin-Montalvo, A.; Wan, J.; Kim, S.-J.; Mehta, H.; Hevener, Andrea L.; de Cabo, R.; Cohen, P., The Mitochondrial-Derived Peptide MOTS-c Promotes Metabolic Homeostasis and Reduces Obesity and Insulin Resistance. Cell Metabolism 2015, 21 (3), 443-454.

107. Cobb, L. J.; Lee, C.; Xiao, J.; Yen, K.; Wong, R. G.; Nakamura, H. K.; Mehta, H. H.; Gao, Q.; Ashur, C.; Huffman, D. M.; Wan, J.; Muzumdar, R.; Barzilai, N.; Cohen, P., Naturally occurring mitochondrial-derived peptides are age-dependent regulators of apoptosis, insulin sensitivity, and inflammatory markers. Aging (Albany NY) 2016, 8 (4), 796-808.

108. Chu, Q.; Rathore, A.; Diedrich, J. K.; Donaldson, C. J.; Yates, J. R.; Saghatelian, A., Identification of Microprotein–Protein Interactions via APEX Tagging. Biochemistry 2017, 56 (26), 3299-3306.

109. Akimoto, C.; Sakashita, E.; Kasashima, K.; Kuroiwa, K.; Tominaga, K.; Hamamoto, T.; Endo, H., Translational repression of the McKusick-Kaufman syndrome transcript by unique upstream open reading frames encoding mitochondrial proteins with alternative polyadenylation sites. Biochim Biophys Acta 2013, 1830 (3), 2728-38.

110. Samandi, S.; Roy, A. V.; Delcourt, V.; Lucier, J.-F.; Gagnon, J.; Beaudoin, M. C.; Vanderperre, B.; Breton, M.-A.; Jacques, J.-F.; Brunelle, M.; Motard, J.; Gagnon-Arsenault, I.; Fournier, I.; Ouangraoua, A.; Hunting, D. J.; Cohen, A. A.; Landry, C. R.; Scott, M. S.; Roucou, X., Deep transcriptome annotation suggests that small and large proteins encoded in the same genes often cooperate. bioRxiv 2017.

111. King, J. L.; Jukes, T. H., Non-Darwinian Evolution. Science 1969, 164 (3881), 788-798.

112. Chou, P. Y.; Fasman, G. D., Conformational parameters for amino acids in helical, β- sheet, and random coil regions calculated from proteins. Biochemistry 1974, 13 (2), 211-222.

113. Krogh, A.; Larsson, B.; von Heijne, G.; Sonnhammer, E. L. L., Predicting transmembrane protein topology with a hidden markov model: application to complete genomes11Edited by F. Cohen. Journal of Molecular Biology 2001, 305 (3), 567-580.

38 114. Hemm, M. R.; Paul, B. J.; Schneider, T. D.; Storz, G.; Rudd, K. E., Small membrane proteins found by comparative genomics and ribosome binding site models. Molecular microbiology 2008, 70 (6), 1487-1501.

115. Hung, V.; Udeshi, N. D.; Lam, S. S.; Loh, K. H.; Cox, K. J.; Pedram, K.; Carr, S. A.; Ting, A. Y., Spatially resolved proteomic mapping in living cells with the engineered peroxidase APEX2. Nat. Protocols 2016, 11 (3), 456-475.

39 Chapter 2:

Identification of Microprotein-Protein Interactions via APEX Tagging

40 Abstract

Microproteins are peptides and small proteins encoded by small open reading frames

(smORFs). Newer technologies have led to the recent discovery of hundreds to thousands of new microproteins. The biological functions of a few microproteins have been elucidated, and these microproteins have fundamental roles in biology ranging from limb development to muscle function, highlighting the value of characterizing these molecules. The identification of microprotein-protein interactions (MPIs) has proven to be a successful approach to the functional characterization of these genes; however, traditional immunoprecipitation methods result in the enrichment of non-specific interactions for microproteins. Here, we test and apply an in situ proximity tagging method that relies on an engineered ascorbate peroxidase (APEX) to elucidate

MPIs. The results demonstrate that APEX tagging is superior to traditional immunoprecipitation methods for microproteins. Furthermore, the application of APEX-tagging to an uncharacterized microprotein called C11orf98 revealed that this microprotein interacts with nucleolar proteins nucleophosmin and nucleolin, demonstrating the ability of this approach to identify novel hypothesis-generating MPIs.

2.1 Introduction

Bioactive peptides have essential roles in biology. For example, the hormone insulin, which is secreted by β-cells in the pancreas, regulates blood glucose levels.1-3 Most peptide hormones and neuropeptides share a common biosynthetic pathway that produces the mature peptide after limited proteolysis of a longer precursor protein (i.e. a prepropeptide or propeptide).4-

5 Outside of some notable exceptions, such as angiotensin, most bioactive peptides were thought to be produced through this secretory pathway, but this view has begun to change as novel peptides, or microproteins, encoded by small open reading frames (smORFs) are steadily being discovered.6-7

41 smORFs were missed during genome annotations because they are too short for gene- finding algorithms, or the RNAs that encode these smORFs were not known.8-9 Instead, the detection of smORFs and microproteins has relied on the advent and application of novel genomics and proteomics methods.10-11 So far, hundreds to thousands of smORFs and the corresponding microprotein products have been identified from prokaryotic and eukaryotic genomes.12-14 And biological studies in flies and mammals have demonstrated that at least some of these microproteins have fundamental biological functions in development, metabolism, and muscle function.15-17

Because only a few of the discovered smORFs are characterized, the functional characterization of these microproteins represents a major challenge in the field. A common feature of functional microproteins is that they all seem to partake in protein-protein interactions or, more specifically, microprotein-protein interactions to regulate biology. For example, Magny and colleagues demonstrated that a Drosophila microprotein called sarcolamban (Scl) binds to an ER calcium channel, SERCA, regulating channel function and heart muscle contraction.17 And mammalian homologs of Scl that inhibit or activate SERCA have been identified and provide new insights into musculoskeletal biology.18-20 Other examples, such as a smORF that regulates limb development in flies16, 21-23 and a smORF called NoBody that regulates mRNA decapping24 also operate through microprotein-protein interactions. Consequently, the elucidation of the proteins and protein complexes that associate with microproteins can be used to characterize the functions of microproteins.

Immunoprecipitation of FLAG-tagged microproteins provides a general approach for revealing microprotein-associated proteins. These proteins could be direct interactors of the microprotein or parts of a larger complex that contain the microprotein. For example, the 69-amino acid microprotein modulator of retroviral infection (MRI) associates with Ku70 and , two essential proteins that mediate cellular repair of double-stranded DNA breaks (i.e. non- homologous end joining DNA repair).25 The interaction between the MRI microprotein and the

42 Ku70/Ku80 suggests that this microprotein is involved in cellular DNA repair, highlighting the utility of defining microprotein-associated proteins as a powerful hypothesis-generating approach.

In addition to Ku70 and Ku80, the immuno-precipitation of MRI microprotein also enriched housekeeping and heat shock proteins. Imaging studies ruled out cytosolic heat shock proteins as bona fide interactors because MRI microprotein localizes to the nucleus where it associates with Ku70 and Ku80.25-26 We believe, however, that MRI may be intrinsically unfolded and that the interaction with heat shock proteins occurs after the cells are lysed during the immunoprecipitation. The identification of many of the same heat shock proteins during the immunoprecipitation of completely unrelated microproteins being studied in our lab (unpublished results) indicates that microproteins might be particularly susceptible to artifacts generated by lysate interactions with heat shock proteins. Therefore, we needed to find a better approach for identifying microprotein-associated proteins and protein complexes to characterize these novel genes.

Ting and colleagues developed an ingenious in situ proximity labeling method using an engineered ascorbate peroxidase,27 and then optimized the stability and activity of this enzyme through evolution to afford APEX228-31 (which we used in this work and referred to as APEX in the text and figures throughout). In their approach, APEX is fused to a protein of interest. Expression of the APEX-fusion protein followed by treatment of cells with hydrogen peroxide (H2O2) in the presence of biotin-phenol covalently labels proteins proximal to the APEX-fusion protein with biotin. In this scheme, the H2O2 fuels the catalytic oxidation of biotin-phenol by APEX to generate a highly reactive biotin-phenoxyl radical. The lifetime of the radical is less than one millisecond

(ms) which restricts the labeling radius to 20 nm. These biotinylated proteins can then be enriched and analyzed by mass spectrometry. And since proteins adjacent to the APEX-fusion protein are preferentially biotinylated, the resulting mass spectrometry data provides a read out of the protein environment around the fusion protein (Figure 2.1).

43 Using this method, Ting and coworkers comprehensively mapped proteins to distinct intracellular compartments such as mitochondrial intermembrane space and mitochondrial matrix.27, 32 APEX-fusions have also been used to identify the proteome at junctions between the plasma membrane (PM) and endoplasmic reticulum (ER) demonstrating the generality of this approach.33 We hypothesized that APEX-microprotein fusions could solve the background issues with identifying microprotein-associated proteins and protein complexes since the interactions take place in the context of a living cell, not a lysate. Here, we demonstrate that APEX- microprotein fusions provide a superior approach to identify microprotein-protein interactions by comparing MRI-FLAG immuno-precipitation to an MRI-APEX experiment. Then we demonstrate that APEX-microprotein fusions can be used to identify novel microprotein interacting proteins and protein complexes using a microprotein from the C11orf98 gene. These results highlight the value of APEX tagging for discovering microprotein associated proteins.

2.2 Results and Discussion

Live Cell Proximity Labeling Using the MRI-APEX Fusion

We used the MRI microprotein to optimize the elucidation of microprotein-associated proteins. As mentioned, MRI interacts with the Ku70/Ku80 heterodimer and, therefore, our readout would be the addition of biotin to these proteins. Our construct consists of an APEX-myc fusion at the C-terminus of an N-terminal FLAG-tagged MRI microprotein (MRI-APEX), and APEX was used as a control in these experiments (Figure 2.2A). We were concerned that the addition of the much larger APEX protein to MRI might interfere with Ku70/Ku80 binding and we used the

FLAG tag to test whether the complex was intact. Transient transfection of HEK293T cells with

MRI-APEX, followed by FLAG immunoprecipitation, enriched Ku70 and Ku80 and confirmed that the MRI-APEX still binds to Ku70/Ku80 heterodimer (Figures 2.2B and 2.3).

Next, we tested whether Ku70 and Ku80 are biotinylated after treatment of MRI-APEX expressing HEK293T cells with biotin-phenol and H2O2. After H2O2 treatment, cells were lysed,

44 and biotinylated proteins were enriched from the lysate using streptavidin beads. Western blotting of the biotin pulldown revealed that Ku70 and Ku80 were biotinylated in cells transfected with

MRI-APEX but not the APEX control or untransfected samples (Figure 2.2C), indicating successful labeling of known MRI-associated proteins. Furthermore, total cell lysates from MRI-

APEX and APEX are biotinylated to the same extent, showing that the increase in Ku70 and Ku80 biotinylation is due to their proximity to MRI and not due to a difference in biotinylation activity between the two samples (Figures 2.4 and 2.5).

To determine whether the APEX approach provided superior data to the FLAG immunoprecipitation, we performed a shotgun proteomics experiment and analyzed the enrichment of Ku70 and Ku80. The APEX conditions provided a dramatically improved fold enrichment for Ku70 and Ku80 (Figure 2.2D). In the FLAG sample, where the pulldown occurs in cellular lysates, we find a ~2-fold enrichment for Ku70 and Ku80 (MRI-FLAG vs. pcDNA3.1 transfected samples), but in situ proximity labeling with MRI-APEX resulted in an 8-10-fold increase for the same proteins (MRI-APEX vs. APEX transfected samples). This superior enrichment is due to a significantly lower background of these proteins under the APEX conditions. Also, we analyzed the enrichment of tubulin (TUBB) and heat shock 70 kDa protein 9

(HSPA9). These two proteins were enriched in our previous studies.25 FLAG pulldown enriched these proteins to about 2-fold, but they were not detected (n.d.) in the MRI-APEX samples. The observed enrichment of TUBB and HSPA9 by MRI-FLAG highlights the problem with immunoprecipitations from cellular lysates. The FLAG immunoprecipitation, which disrupts cellular localization due to cell lysis, indicates a protein interaction between MRI and TUBB and

HSPA9. By contrast, the MRI-APEX experiment that takes place in an intact cell shows no enrichment of TUBB or HSPA9. Since the cytosolic TUBB or HSPA9 proteins should not interact with nuclear MRI microprotein, it follows that APEX provides a more accurate picture of cellular microprotein-protein interactions by avoiding interactions that only occur upon cell lysis. The power of this approach clearly demonstrates the benefit of using APEX tagging to identify

45 microprotein associated proteins while decreasing non-specific interactions and improving fold enrichment for target proteins.

APEX Fusions of C11orf98 to Discover Interactors

We sought to apply the APEX labeling approach to identify interaction partners for uncharacterized microprotein interactions. We chose an uncharacterized 123 amino acid microprotein encoded by the C11orf98 smORF. A tryptic peptide from this microprotein was detected in HEK293T cells by shotgun mass spectrometry (Figure 2.6A). The C11orf98 transcript is identified as a validated protein-coding gene in RefSeq, but this gene remains uncharacterized.

And the C11orf98 microprotein is highly conserved between human and mouse, suggesting a potential function (Figure 2.6B).

We expressed an N-terminal APEX-tagged C11orf98 (APEX-C11orf98) or a C-terminal

APEX-tagged C11orf98 (C11orf98-APEX) fusion protein in HEK293T cells to characterize this microprotein (Figure 2.7). Cells were then treated with biotin-phenol and H2O2, followed by harvesting of the cells and cell lysis. Biotinylated proteins were isolated from each sample by streptavidin beads and analyzed by proteomics. Unfused APEX was used as the control in these experiments.

Candidate C11orf98-interacting proteins were identified by filtering the proteomics data for proteins with a spectral count greater than 5, a greater than 2-fold increase versus the control sample, and a p-value less than 0.05. This analysis resulted in 112 proteins for APEX-C11orf98 and 137 proteins for C11orf98-APEX, with 99 proteins present in both data sets. The strong overlap between the APEX-C11orf98 and C11orf98-APEX datasets provides additional confidence in the reliability of this data.

The majority of proteins enriched by APEX-C11orf98 and C11orf98-APEX are reported to have a nuclear localization according to Human Protein Atlas (97 out of 112 for APEX-C11orf98 and 115 out of 137 for C11orf98, (Figure 2.8). Two of the most robust C11orf98-associated proteins are nucleolin (NCL) and nucleophosmin (NPM1) (Figure 2.6C). NCL had the highest

46 number of spectral counts in both the APEX-C11orf98 and C11orf98-APEX datasets, and about an 80-fold increase with respect to the control. NPM1, an interaction partner of NCL, was also found amongst the top hits with a strong signal and a ~7-fold increase in the APEX-C11orf98 and

C11orf98-APEX datasets. Given the high spectral counts and robust fold changes for NCL and

NPM1, and the established interaction between NCL and NPM1,34-35 our data indicates that

C11orf98 is associated with an NCL and NPM1 complex. These criteria led us to focus on characterizing the C11orf98/NPM1/NCL interactions, but we couldn’t completely rule out the possibility that C11orf98 may also interact with other proteins in the list, which might end up being important in the biology.

Validation of C11orf98 Microprotein Interactions with NPM1/NCL

To validate the association between NPM1, NCL, and C11orf98 microprotein, we repeated the APEX-C11orf98 and C11orf98-APEX experiments and performed Western blots using NPM1 and NCL specific antibodies. Consistent with the proteomics data, we observed the enrichment of NPM1 and NCL after streptavidin enrichment of lysates from the APEX-C11orf98 and

C11orf98-APEX samples. NPM1 and NCL were not enriched in the control sample with an unfused APEX (Figures 2.9A and 2.10). In addition, immunoprecipitation of a FLAG-tagged

C11orf98 microprotein also enriched the NPM1 and NCL, which demonstrated that the interactions between C11orf98, NPM1 and NCL are mediated by C11orf98, and are not unique to the C11orf98-APEX fusions (Figures 2.11 and 2.12).

Furthermore, we validated the C11orf98 microprotein association with NPM1 by a reciprocal immuno-precipitation experiment. Cells co-expressed a FLAG-tagged C11orf98 microprotein and HA-tagged NPM1. As a control, cells were co-expressed with FLAG-tagged

C11orf98 microprotein and an empty vector. Lysates were prepared and immunoprecipitated with an anti-HA antibody against HA-tagged NPM1. Western blot analysis of the eluates using anti-

HA and anti-FLAG antibodies revealed that HA-tagged NPM1 enriched FLAG-tagged C11orf98, further supporting an interaction between these proteins (Figures 2.9B and 2.13). Control

47 experiments using a control IgG did not enrich FLAG-tagged C11orf98, which provided additional evidence for an interaction between NPM1 and C11orf98 microprotein (Figures 2.14 and 2.15).

Moreover, Gygi’s and Mann’s groups independently published a recent proteome wide protein interaction database, and the data is available to be searched.35-36 We downloaded MS raw files for NPM1 IP from Mann’s data, and both NPM1 and NCL IPs from Gygi’s Bioplex database.

Reanalyzing this data using human UniProt proteome appended with C11orf98 microprotein revealed that C11orf98 is detected with multiple peptide and spectral counts (Figure 2.16). This experiment demonstrates that C11orf98 interacts with NPM1 and NCL even when it’s not overexpressed.

C11orf98 Localizes to the Nucleolus

NPM1 and NCL localize to the nucleolus, an organelle within the nucleus that is the site of ribosome biogenesis, and newly emerging functions in protein regulation using non-coding

RNAs.37-38 The large percentage of nucleolar proteins enriched by C11orf98 indicates that this microprotein should also be localized to the nucleolus. Confocal imaging of overexpressed FLAG- tagged C11orf98 in HeLa cells validated this hypothesis by revealing a nucleolar localization for this microprotein (Figure 2.17). Furthermore, the FLAG-tagged C11orf98 microprotein overlaps entirely with the HA-tagged NPM1 in the nucleoli providing additional evidence that NPM1 and

C11orf98 microprotein are likely to interact with each other.

Functional studies have revealed that both the NCL and NPM1 proteins are multifunctional, with roles in ribosome biogenesis, cell cycle and apoptosis, transcriptional regulation, and DNA replication and repair.39-40 As a putative interaction partner of NPM1 and

NCL, we hypothesize that C11orf98 microprotein will likely have a role in at least some of these processes. This data highlights the value of MPIs for generating novel hypotheses that can lead to the functional characterization of microproteins.

48 2.3 Conclusion

Here, we demonstrate the value of APEX for identifying MPIs. Unlike traditional approaches that immunoprecipitate from lysates, APEX captures endogenous interactions in the context of a living cell. This feature is particularly useful for microproteins since we have observed that microproteins undergoing immunoprecipitation enrich many non-specific interactors, which are not found in the APEX experiments. We suspect that microproteins are becoming unstructured during FLAG-based immunoprecipitations leading to more non-specific interactions.

Furthermore, the reduction in background afforded by the APEX technology also leads to a greater fold increase for the MRI binding partners Ku70 and Ku80, which makes it much easier to identify bona fide microprotein interacting partners.

Indeed, the application of APEX tagging to the C11orf98 microprotein led to the discovery that this microprotein interacts with NCL and NPM1, and several other nucleolar proteins. NCL and NPM1 are multifunctional proteins found in the nucleolus where they participate in synthesis and maturation of ribosomes. NCL and NPM1 are also reported to interact with each other.

Mutations and amplifications of NPM1 have been linked to many cancers including acute myelogenous leukemia, though the exact mechanism for this connection is still being worked out.41-42 Thus, the identification of an MPI between C11orf98 microprotein, NCL, and NPM1 is of fundamental and clinical interest and will lead to new testable hypotheses about the function of the C11orf98 microprotein. Furthermore, the apparent improvement in the APEX data supports the application of APEX to the remaining uncharacterized microproteins.

49 2.4 Figures

Figure 2.1. Schematic illustration of identification of microprotein- associated proteins by APEX tagging. (A) Biotin-phenol is an APEX substrate and the phenol is converted to a phenoxyl radical by APEX upon H2O2 treatment. The highly reactive phenoxyl radical forms covalent bonds with nearby aromatic residues such as tyrosine. (B) APEX tagging can be applied to the identification of microprotein-associated proteins by fusing the microprotein of interest (MPOI) to APEX. Cells expressing the MPOI-APEX fusion protein are pretreated with biotin-phenol followed by the addition of H2O2 to initiate biotin labeling. The hypereactivity of the biotin-phenoxyl radical results in a short half-life to favor labeling of nearby proteins, and any biotinylated proteins are considered to be near the microprotein. The biotinylated proteins are identified by streptavidin enrichment and proteomics.

50

Figure 2.2. Identification of MRI microprotein-protein interactions in live cells by APEX tagging. (A) Schematic illustration of APEX and MRI-APEX constructs. (B) Anti-FLAG immunoprecipitation of HEK293T cells expressing APEX or MRI-APEX. Eluted proteins were separated by SDS-PAGE and visualized by Western blotting using indicated antibodies. (WCL = whole cell lysate) (C) Western blot of MRI-APEX labeling proteome indicating that Ku70/Ku80 complex is selectively biotinylated by MRI-APEX. (D) Spectral counts analysis indicated that APEX labeling has an improved fold change and lower background compared to FLAG IP. Error bars represent S.E.M. of three biological triplicates.

51

Figure 2.3. Loading control of APEX and MRI-APEX FLAG immunoprecipitation (Figure 2B). Anti- FLAG immunoprecipitation of HEK293T cells expressing APEX or MRI-APEX. Eluted proteins were separated by SDS-PAGE and visualized by Coomassie staining. (WCL = whole cell lysate)

52

Figure 2.4. Biotinylated proteomes of untransfected, APEX and MRI-APEX transfected cells. HEK293T cells were transfected with FLAG-MRI-APEX or APEX control construct. 24 hours post- transfection, cell culture medium was changed to fresh growth media containing 500 µM biotin- tyramide. After incubation for 30 min, H2O2 was added at a final concentration of 1 mM and treated for 1 min. Cells were then lysed and biotinylated proteins were enriched by streptavidin beads and analyzed by western blotting using anti-biotin antibody, (A) short exposure and (B) long exposure. Biotin-ladder was purchase from Cell Signaling #7727.

53

Figure 2.5. Loading control of MRI-APEX biotinylation (Figure 2C and S2). HEK293T cells were transfected with FLAG-MRI-APEX or APEX control construct. 24 hours post-transfection, cell culture medium was changed to fresh growth media containing 500 µM biotin-tyramide. After incubation for 30 min, H2O2 was added at a final concentration of 1 mM and treated for 1 min. Cells were then lysed and biotinylated proteins were enriched by streptavidin beads and analyzed by Coomassie staining.

54

Figure 2.6. Identification of C11orf98 microprotein associated proteins. (A) MS/MS spectrum of the unique C11orf98 tryptic peptide, with detected fragment ions marked in blue (b-ions) and red (y-ions). (B) Multiple sequence alignment of C11orf98 microprotein indicates that it is highly conserved in mammals. (C) Semi-quantitative proteomics by spectral counting revealed that NPM1 and NCL were enriched about 7-fold and 80-fold respectively in both N- and C-term APEX fusion samples compared to APEX control samples. Error bars represent the standard deviation of three biological triplicates.

55

Figure 2.7. Schematic illustration of APEX control and C11orf98-APEX constructs

56

Figure 2.8. Sub- cellular localization of biotinylated enriched proteins. The majority of proteins enriched by APEX-C11orf98 and C11orf98-APEX are reported to have a nuclear localization.

57

Figure 2.9. Validation of C11orf98 microprotein interaction with NPM1 and NCL. (A) Western blot of C11orf98 APEX labeling proteome with anti-NCL, anti-NPM1 and anti-myc tag antibodies. (B) Reciprocal anti-HA immuno-precipitation of NPM1-HA from HEK293T cells co-expressing C11orf98-FLAG, with cells expressing C11orf98-FLAG alone as a control. Eluted proteins were analyzed by Western blotting.

58

Figure 2.10. Loading control of C11orf98-APEX and APEX-C11orf98 biotinylation (Figure 4A). HEK293T cells were transfected with C11orf98-APEX, APEX-C11orf98 or APEX control construct. 24 hours post-transfection, cell culture medium was changed to fresh growth media containing 500 µM biotin-tyramide. After incubation for 30 min, H2O2 was added at a final concentration of 1 mM and treated for 1 min. Cells were then lysed and biotinylated proteins were enriched by streptavidin beads and analyzed by Coomassie staining.

59

Figure 2.11. Validation of C11orf98 interaction with NPM1 and NCL by FLAG immunoprecipitation. HEK293T cells were transfected with C11orf98-FLAG or pcDNA3.1(+) empty vector. 48 hours post-transfection, cells were harvested, and cell lysates were subjected to FLAG immunoprecipitation by anti-FLAG M2 affinity gel. Bound proteins were eluted by incubating the beads with 3× FLAG peptide for 1 hour and analyzed by western blotting with indicated antibodies.

60

Figure 2.12. Loading control of C11orf98-FLAG FLAG immunoprecipitation (Figure S7). Anti- FLAG immunoprecipitation of HEK293T cells expressing C11orf98-FLAG or pcDNA3.1 control vector. Eluted proteins were separated by SDS-PAGE and visualized by Coomassie staining. (WCL = whole cell lysate)

61

Figure 2.13. Loading control of reciprocal anti-HA immunoprecipitation (Figure 4B). Reciprocal anti-HA immunoprecipitation of NPM1-HA from HEK293T cells co-expressing C11orf98-FLAG, with cells expressing C11orf98-FLAG alone as a control. Eluted proteins were analyzed by Coomassie staining.

62

Figure 2.14. Validation of C11orf98 interaction with NPM1 and NCL by reciprocal immunoprecipitation. HEK293T cells were co-transfected with C11orf98-FLAG and NPM1-HA. 48 hours post-transfection, cell lysates were subjected to immunoprecipitation using anti-HA agarose beads (mouse IgG beads as control). Bound proteins were eluted and separated by SDS-PAGE. NPM1 and C11orf98 proteins were visualized by western blotting using anti-HA and anti-FLAG antibodies.

63

Figure 2.15. Loading control of reciprocal anti-HA immunoprecipitation (Figure S11). HEK293T cells were co-transfected with C11orf98-FLAG and NPM1-HA. 48 hours post-transfection, cell lysates were subjected to immunoprecipitation using anti-HA agarose beads (mouse IgG beads as control). Bound proteins were eluted and separated by SDS-PAGE. NPM1 and C11orf98 proteins were visualized by Coomassie staining.

64

Figure 2.16. C11orf98 microprotein was identified in public protein interaction database. Searching IP-MS raw data of NPM1 (from Mann’s dataset and Gygi’s Bioplex) and NCL (from Gygi’s Bioplex) revealed multiple A) spectral counts and B) peptide fragments of C11orf98 microprotein.

65

Figure 2.17. Co-localization of C11orf98 microprotein and NPM1 in cell nucleolus. HeLa cells were transfected with C11orf98-FLAG and NPM1-HA, fixed and stained with anti-FLAG and anti- HA antibody to visualize C11orf98 and NPM1 proteins. Nuclei were stained with Hoechst. Scale bars, 5 µm.

66 2.5 Materials and Methods

Biotin-phenol labeling in live cells.

Biotin-phenol labeling in live cells was performed according to previously published protocols.29 Briefly, constructs harboring microprotein-APEX fusion proteins or APEX control were transiently transfected into HEK293T cells using Lipofectamine 2000. 24 hours post-transfection, cell culture medium was changed to fresh growth medium containing 500 μM biotin-tyramide

(CDX-B0270, Adipogen). After 30 min incubation at 37°C, H2O2 was added to each plate at a final concentration of 1 mM and the plates were gently agitated for 1 min. Cells were then washed three times with quenching solution (5 mM Trolox, 10 mM sodium azide and 10 mM sodium ascorbate in PBS) and the pellet was collected by centrifugation at 1,000×g for 5 min.

Western blot and proteomic analysis of biotin-phenol labeling.

Cell pellets were lysed on ice for 20 min in RIPA buffer (Thermo #89901) supplemented with Roche complete protease inhibitor cocktail tablet and 1 mM PMSF followed by centrifugation at 20,000×g for 20 min at 4°C to remove cell debris. Cell lysates were added to pre-washed streptavidin agarose resin (Thermo #20359) and rotated at 4°C for 4 hours, then washed three times with TBST + 0.5% SDS (v/v). Bound proteins were eluted with 2× SDS loading buffer and analyzed by western blotting. For proteomics, eluted samples were precipitated with trichloroacetic acid (TCA, MP Biomedicals #196057) overnight at 4°C. Dried pellets were dissolved in 8 M urea, reduced with 5 mM tris(2-carboxyethyl) phosphine hydrochloride (TCEP,

Thermo #20491) and alkylated with 10 mM iodoacetamide (Sigma I1149). Proteins were then digested overnight at 37°C with trypsin (Promega V5111). The reaction was quenched with formic acid at a final concentration of 5% (v/v). Digested samples were analyzed on Q Exactive Hybrid

Quadrupole-Orbitrap Mass Spectrometry.Biotin-phenol labeling in live cells.

Co-immunoprecipitation.

FLAG-tagged microprotein constructs (or empty pcDNA3.1(+) vector) were transfected into a 10-cm dish of HEK293T cells using Lipofectamine 2000 according to manufacturer’s

67 protocol. 48 hours post-transfection, cells were harvested and lysed in RIPA buffer (Thermo

#89901) supplemented with Roche complete protease inhibitor cocktail tablet and 1 mM PMSF.

Cells were lysed on ice for 20 min followed by centrifugation at 20,000×g for 20 min at 4°C to remove cell debris. Cell lysates were added to pre-washed mouse IgG agarose beads (Sigma

A0919) and rotated at 4°C for 1 hour. The supernatants were collected and added to pre-washed anti-FLAG M2 Affinity Gel (A2220, Sigma). The suspensions were rotated at 4°C overnight and washed four times with TBST. Bound proteins were eluted with 3× FLAG peptide (F4799, Sigma) at 4°C for 1 hour. The eluents were then separated by SDS-PAGE and analyzed by western blotting using indicated antibodies.

Reciprocal immunoprecipitation.

HEK293T cells were co-transfected with C11orf98-FLAG and NPM1-HA (empty vector as control). Lysates from both samples were incubated with mouse anti-HA agarose beads (A2095,

Sigma) to immunoprecipitate HA-tagged NPM1. Alternatively, lysates from HEK293T cells co- expressing C11orf98-FLAG and NPM1-HA were incubated with either mouse IgG beads (A0919,

Sigma) or mouse anti-HA agarose beads. After washing three times with TBST, bound proteins were eluted with HA peptide (I2149, Sigma) at 4°C for 1 hour. The eluents were then separated by SDS-PAGE and analyzed by Western blotting using indicated antibodies.

Immunofluorescence and confocal imaging.

HeLa cells were seeded onto a coverslip (12-541-B, Fisher Scientific) in a 6-well plate, which was pre-treated with 50 µg/mL poly-L-lysine (P1399, Sigma). The next day, cells were co- transfected with 1 µg NPM1-HA and 1 µg C11orf98-FLAG using Lipofectamine 2000. 48 hours post-transfection, cells were fixed with 4% paraformaldehyde (#18814, Polysciences, Inc.) and permeabilized with 0.1% saponin (A18820, Alfa Aesar). After incubating with 4% BSA in PBS for

1 hour at room temperature, cells were stained with primary antibodies (rabbit anti-FLAG and mouse anti-HA) at 1:1000 dilution overnight at 4°C. Then cells were washed three times with PBS, followed by incubating with secondary antibodies (goat anti-mouse Alexa Fluor 647 and goat anti-

68 rabbit Alexa Fluor 488, 1:500 in PBS) for 1 hour at room temperature. Nuclei were counterstained with Hoechst 33258 (#94403, Sigma, 1:2000 in PBS). After three PBS washes, the coverslip was mounted by Prolong® Gold Antifade Mountant (P36930, Life Technologies) and submitted for confocal imaging using Zeiss LSM 710 laser scanning confocal microscope with 63× oil immersion objective. Images were analyzed by FIJI software.

Mass Spectrometry and Data Analysis.

The digested samples were analyzed on a Q Exactive mass spectrometer (Thermo). The digest was injected directly onto a 30cm, 75µm ID column packed with BEH 1.7µm C18 resin

(Waters). Samples were separated at a flow rate of 200nl/min on a nLC 1000 (Thermo). Buffer A and B were 0.1% formic acid in water and acetonitrile, respectively. A gradient of 5-40%B over

110min, an increase to 50%B over 10min, an increase to 90%B over another 10min and held at

90%B for a final 10min of washing was used for 140min total run time. Column was re-equilibrated with 20µl of buffer A prior to the injection of sample. Peptides were eluted directly from the tip of the column and nanosprayed directly into the mass spectrometer by application of 2.5kV voltage at the back of the column. The Q Exactive was operated in a data dependent mode. Full MS44 scans were collected in the Orbitrap at 70K resolution with a mass range of 400 to 1800 m/z and an AGC target of 5e6. The ten most abundant ions per scan were selected for MS/MS analysis with HCD fragmentation of 25NCE, an AGC target of 5e6 and minimum intensity of 4e3. Maximum fill times were set to 60ms and 120ms for MS and MS/MS scans respectively. Quadrupole isolation of 2.0m/z was used, dynamic exclusion was set to 15 sec and unassigned charge states were excluded.

Protein and peptide identification were done with Integrated Proteomics Pipeline – IP2

(Integrated Proteomics Applications). Tandem mass spectra were extracted from raw files using

RawConverter44 and searched with ProLuCID45 against human UniProt database appended with microprotein sequences. The search space included all fully-tryptic and half-tryptic peptide candidates with maximum of two missed cleavages. Carbamidomethylation on cysteine was

69 counted as a static modification. Data was searched with 50 ppm precursor ion tolerance and 50 ppm fragment ion tolerance. Data was filtered to 10 ppm precursor ion tolerance post search.

Identified proteins were filtered using DTASelect46 and utilizing a target-decoy database search strategy to control the false discovery rate to 1% at the protein level.

Materials.

Primary antibodies: rabbit anti-myc tag (#2272, Cell Signaling Technology), mouse anti-

FLAG (F1804, Sigma), rabbit anti-FLAG (#2368, Cell signaling Technology), anti-biotin, HRP- linked antibody (#7075, Cell Signaling Technology), rabbit anti-Ku70 (#4104, Cell Signaling

Technology) and rabbit anti-Ku80 (#2180, Cell Signaling Technology), mouse anti-HA (H9658,

Sigma). Secondary antibodies: goat anti-rabbit IRDye® 800CW (926-32211, LI-COR), goat anti- mouse IRDye® 800CW (926-32210, LI-COR), Goat anti-mouse Alexa Fluor 647 (A21235, Life

Technologies), Goat anti-rabbit Alexa Fluor 488 (A11008, Life Technologies).

Plasmids.

pcDNA3-MRI-1 was generated as previously described47. pcDNA3-APEX-NES was a gift from Alice Ting (Addgene plasmid #49386). pcDNA3-FLAG-APEX was generated by inserting a stop codon before NES sequence using QuikChange II kit (Agilent Technologies). APEX coding sequence was subcloned into pcDNA3.1(+) vector with an N-terminal myc-tag to generate pcDNA3.1-myc-APEX construct. C11orf98 microprotein coding sequence was obtained by PCR amplification of HEK293T cDNA pool. FLAG-MRI-APEX, C11orf98-APEX-myc and myc-APEX-

C11orf98 were amplified by overlap extension PCR and subcloned into pcDNA3.1(+) vector.

NPM1 cDNA (BC050628) was obtained from MGC cDNA library at the Salk Institute, and subcloned into pcDNA3.1(+) vector with a C-terminal HA tag.

2.6 Acknowledgements

Chapter 2, in full, is a reprint with minor modifications of the material as it appears in

Identification of Microprotein- Protein Interactions via APEX Tagging, Biochemistry 56, 26, 3299-

70 3306 (2017). The dissertation author was the secondary investigator and author of this paper.

Other authors include Qian Chu, Jolene K. Diedrich, Cynthia J. Donaldson, John R. Yates III and

Alan Saghatelian.

2.7 References

1. Andrali, Sreenath S.; Sampley, Megan L.; Vanderford, Nathan L.; Özcan, S., Glucose regulation of insulin gene expression in pancreatic β-cells. Biochemical Journal 2008, 415 (1), 1- 10.

2. Saltiel, A. R.; Kahn, C. R., Insulin signalling and the regulation of glucose and lipid metabolism. Nature 2001, 414, 799.

3. Wilcox, G., Insulin and Insulin Resistance. Clinical Biochemist Reviews 2005, 26 (2), 19- 39.

4. Proteases for Processing Proneuropeptides into Peptide Neurotransmitters and Hormones. Annual Review of Pharmacology and Toxicology 2008, 48 (1), 393-423.

5. Hökfelt, T.; Broberger, C.; Xu, Z.-Q. D.; Sergeyev, V.; Ubink, R.; Diez, M., Neuropeptides — an overview. Neuropharmacology 2000, 39 (8), 1337-1356.

6. Saghatelian, A.; Couso, J. P., Discovery and characterization of smORF-encoded bioactive polypeptides. Nat Chem Biol 2015, 11 (12), 909-916.

7. Andrews, S. J.; Rothnagel, J. A., Emerging evidence for functional peptides encoded by short open reading frames. Nat Rev Genet 2014, 15 (3), 193-204.

8. Chu, Q.; Ma, J.; Saghatelian, A., Identification and characterization of sORF-encoded polypeptides. Critical Reviews in Biochemistry and Molecular Biology 2015, 50 (2), 134-141.

9. Trimble, W. L.; Keegan, K. P.; D’Souza, M.; Wilke, A.; Wilkening, J.; Gilbert, J.; Meyer, F., Short-read reading-frame predictors are not created equal: sequence error causes loss of signal. BMC Bioinformatics 2012, 13 (1), 183.

10. Aspden, J. L.; Eyre-Walker, Y. C.; Phillips, R. J.; Amin, U.; Mumtaz, M. A. S.; Brocard, M.; Couso, J.-P., Extensive translation of small Open Reading Frames revealed by Poly-Ribo-Seq. eLife 2014, 3, e03528.

11. Slavoff, S. A.; Mitchell, A. J.; Schwaid, A. G.; Cabili, M. N.; Ma, J.; Levin, J. Z.; Karger, A. D.; Budnik, B. A.; Rinn, J. L.; Saghatelian, A., Peptidomic discovery of short open reading frame- encoded peptides in human cells. Nat Chem Biol 2013, 9 (1), 59-64.

71 12. Ladoukakis, E.; Pereira, V.; Magny, E. G.; Eyre-Walker, A.; Couso, J. P., Hundreds of putatively functional small open reading frames in Drosophila. Genome Biology 2011, 12 (11), R118.

13. Vanderperre, B.; Lucier, J. F.; Bissonnette, C.; Motard, J.; Tremblay, G.; Vanderperre, S.; Wisztorski, M.; Salzet, M.; Boisvert, F. M.; Roucou, X., Direct detection of alternative open reading frames translation products in human significantly expands the proteome. PLoS One 2013, 8 (8), e70698.

14. Yagoub, D.; Tay, A. P.; Chen, Z.; Hamey, J. J.; Cai, C.; Chia, S. Z.; Hart-Smith, G.; Wilkins, M. R., Proteogenomic Discovery of a Small, Novel Protein in Yeast Reveals a Strategy for the Detection of Unannotated Short Open Reading Frames. Journal of Proteome Research 2015, 14 (12), 5038-5047.

15. Lee, C.; Zeng, J.; Drew, Brian G.; Sallam, T.; Martin-Montalvo, A.; Wan, J.; Kim, S.-J.; Mehta, H.; Hevener, Andrea L.; de Cabo, R.; Cohen, P., The Mitochondrial-Derived Peptide MOTS-c Promotes Metabolic Homeostasis and Reduces Obesity and Insulin Resistance. Cell Metabolism 2015, 21 (3), 443-454.

16. Kondo, T.; Hashimoto, Y.; Kato, K.; Inagaki, S.; Hayashi, S.; Kageyama, Y., Small peptide regulators of actin-based cell morphogenesis encoded by a polycistronic mRNA. Nature Cell Biology 2007, 9, 660.

17. Magny, E. G.; Pueyo, J. I.; Pearl, F. M. G.; Cespedes, M. A.; Niven, J. E.; Bishop, S. A.; Couso, J. P., Conserved Regulation of Cardiac Calcium Uptake by Peptides Encoded in Small Open Reading Frames. Science 2013, 341 (6150), 1116-1120.

18. Anderson, Douglas M.; Anderson, Kelly M.; Chang, C.-L.; Makarewich, Catherine A.; Nelson, Benjamin R.; McAnally, John R.; Kasaragod, P.; Shelton, John M.; Liou, J.; Bassel-Duby, R.; Olson, Eric N., A Micropeptide Encoded by a Putative Long Noncoding RNA Regulates Muscle Performance. Cell 2015, 160 (4), 595-606.

19. Nelson, B. R.; Makarewich, C. A.; Anderson, D. M.; Winders, B. R.; Troupes, C. D.; Wu, F.; Reese, A. L.; McAnally, J. R.; Chen, X.; Kavalali, E. T.; Cannon, S. C.; Houser, S. R.; Bassel- Duby, R.; Olson, E. N., A peptide encoded by a transcript annotated as long noncoding RNA enhances SERCA activity in muscle. Science 2016, 351 (6270), 271-275.

20. Anderson, D. M.; Makarewich, C. A.; Anderson, K. M.; Shelton, J. M.; Bezprozvannaya, S.; Bassel-Duby, R.; Olson, E. N., Widespread control of calcium signaling by a family of SERCA- inhibiting micropeptides. Science signaling 2016, 9 (457), ra119-ra119.

21. Pueyo, J. I.; Couso, J. P., The 11-aminoacid long Tarsal-less peptides trigger a cell signal in Drosophila leg development. Dev Biol 2008 2008, 324.

72 22. Galindo, M. I.; Pueyo, J. I.; Fouix, S.; Bishop, S. A.; Couso, J. P., Peptides Encoded by Short ORFs Control Development and Define a New Eukaryotic Gene Family. PLoS Biology 2007, 5 (5), e106.

23. Zanet, J.; Benrabah, E.; Li, T.; Pélissier-Monier, A.; Chanut-Delalande, H.; Ronsin, B.; Bellen, H. J.; Payre, F.; Plaza, S., Pri sORF peptides induce selective proteasome-mediated protein processing. Science 2015, 349 (6254), 1356-1358.

24. D'Lima, N. G.; Ma, J.; Winkler, L.; Chu, Q.; Loh, K. H.; Corpuz, E. O.; Budnik, B. A.; Lykke- Andersen, J.; Saghatelian, A.; Slavoff, S. A., A human microprotein that interacts with the mRNA decapping complex. Nat Chem Biol 2017, 13 (2), 174-180.

25. Slavoff, S. A.; Heo, J.; Budnik, B. A.; Hanakahi, L. A.; Saghatelian, A., A human short open reading frame (sORF)-encoded polypeptide that stimulates DNA end joining. J Biol Chem 2014, 289 (16), 10950-7.

26. Grundy, G. J.; Rulten, S. L.; Arribas-Bosacoma, R.; Davidson, K.; Kozik, Z.; Oliver, A. W.; Pearl, L. H.; Caldecott, K. W., The -binding motif is a conserved module for recruitment and stimulation of non-homologous end-joining proteins. Nature Communications 2016, 7, 11242.

27. Rhee, H.-W.; Zou, P.; Udeshi, N. D.; Martell, J. D.; Mootha, V. K.; Carr, S. A.; Ting, A. Y., Proteomic Mapping of Mitochondria in Living Cells via Spatially Restricted Enzymatic Tagging. Science 2013, 339 (6125), 1328-1331.

28. Hung, V.; Udeshi, N. D.; Lam, S. S.; Loh, K. H.; Cox, K. J.; Pedram, K.; Carr, S. A.; Ting, A. Y., Spatially resolved proteomic mapping in living cells with the engineered peroxidase APEX2. Nat. Protocols 2016, 11 (3), 456-475.

29. Lam, S. S.; Martell, J. D.; Kamer, K. J.; Deerinck, T. J.; Ellisman, M. H.; Mootha, V. K.; Ting, A. Y., Directed evolution of APEX2 for electron microscopy and proximity labeling. Nat Meth 2015, 12 (1), 51-54.

30. Han, S.; Udeshi, N. D.; Deerinck, T. J.; Svinkina, T.; Ellisman, M. H.; Carr, S. A.; Ting, A. Y., Proximity Biotinylation as a Method for Mapping Proteins Associated with mtDNA in Living Cells. Cell chemical biology 2017, 24 (3), 404-414.

31. Loh, K. H.; Stawski, P. S.; Draycott, A. S.; Udeshi, N. D.; Lehrman, E. K.; Wilton, D. K.; Svinkina, T.; Deerinck, T. J.; Ellisman, M. H.; Stevens, B.; Carr, S. A.; Ting, A. Y., Proteomic analysis of unbounded cellular compartments: synaptic clefts. Cell 2016, 166 (5), 1295-1307.e21.

32. Hung, V.; Zou, P.; Rhee, H.-W.; Udeshi, N. D.; Cracan, V.; Svinkina, T.; Carr, S. A.; Mootha, V. K.; Ting, A. Y., Proteomic mapping of the human mitochondrial intermembrane space in live cells via ratiometric APEX tagging. Molecular cell 2014, 55 (2), 332-341.

33. Jing, J.; He, L.; Sun, A.; Quintana, A.; Ding, Y.; Ma, G.; Tan, P.; Liang, X.; Zheng, X.; Chen, L.; Shi, X.; Zhang, S. L.; Zhong, L.; Huang, Y.; Dong, M.-Q.; Walker, C. L.; Hogan, P. G.;

73 Wang, Y.; Zhou, Y., Proteomic mapping of ER–PM junctions identifies STIMATE as a regulator of Ca2+ influx. Nature Cell Biology 2015, 17, 1339.

34. Yi‐Ping, L.; K., B. R.; C., V. B.; Harris, B., C23 Interacts with B23, A Putative Nucleolar‐ Localization‐Signal‐Binding Protein. European Journal of Biochemistry 1996, 237 (1), 153-158.

35. Huttlin, Edward L.; Ting, L.; Bruckner, Raphael J.; Gebreab, F.; Gygi, Melanie P.; Szpyt, J.; Tam, S.; Zarraga, G.; Colby, G.; Baltier, K.; Dong, R.; Guarani, V.; Vaites, Laura P.; Ordureau, A.; Rad, R.; Erickson, Brian K.; Wühr, M.; Chick, J.; Zhai, B.; Kolippakkam, D.; Mintseris, J.; Obar, Robert A.; Harris, T.; Artavanis-Tsakonas, S.; Sowa, Mathew E.; De Camilli, P.; Paulo, Joao A.; Harper, J. W.; Gygi, Steven P., The BioPlex Network: A Systematic Exploration of the Human Interactome. Cell 2015, 162 (2), 425-440.

36. Hein, M. Y.; Hubner, N. C.; Poser, I.; Cox, J.; Nagaraj, N.; Toyoda, Y.; Gak, I. A.; Weisswange, I.; Mansfeld, J.; Buchholz, F.; Hyman, A. A.; Mann, M., A human interactome in three quantitative dimensions organized by stoichiometries and abundances. Cell 2015, 163 (3), 712-723.

37. Boisvert, F.-M.; van Koningsbruggen, S.; Navascues, J.; Lamond, A. I., The multifunctional nucleolus. Nat Rev Mol Cell Biol 2007, 8 (7), 574-585.

38. Matera, A. G.; Terns, R. M.; Terns, M. P., Non-coding RNAs: lessons from the small nuclear and small nucleolar RNAs. Nature Reviews Molecular Cell Biology 2007, 8, 209.

39. Colombo, E.; Alcalay, M.; Pelicci, P. G., Nucleophosmin and its complex network: a possible therapeutic target in hematological diseases. Oncogene 2011, 30 (23), 2595-2609.

40. Scott, D. D.; Oeffinger, M., Nucleolin and nucleophosmin: nucleolar proteins with multiple functions in DNA repair. Biochemistry and Cell Biology 2016, 94 (5), 419-432.

41. Grisendi, S.; Mecucci, C.; Falini, B.; Pandolfi, P. P., Nucleophosmin and cancer. Nature Reviews Cancer 2006, 6, 493.

42. Box, J. K.; Paquet, N.; Adams, M. N.; Boucher, D.; Bolderson, E.; O’Byrne, K. J.; Richard, D. J., Nucleophosmin: from structure and function to disease development. BMC Molecular Biology 2016, 17 (1), 19.

74 Chapter 3:

MIEF1 Microprotein Regulates Mitochondrial Translation

75 Abstract

Recent technological advances led to the discovery of hundreds to thousands of peptides and small proteins (microproteins) encoded by small open reading frames (smORFs).

Characterization of new microproteins demonstrate their role in fundamental biological processes and highlight the value in discovering and characterizing more microproteins. The elucidation of microprotein-protein interactions (MPIs) is useful for determining the biochemical and cellular roles of microproteins. In this study, we characterize the protein-interaction partners of mitochondrial elongation factor 1 microprotein (MIEF1-MP) using a proximity labeling strategy that relies on APEX2. MIEF1-MP localizes to the mitochondrial matrix where it interacts with the mitochondrial ribosome (mitoribosome). Functional studies demonstrate that MIEF1-MP regulates mitochondrial translation via its binding to the mitoribosome. Loss of MIEF1-MP decreases the mitochondrial translation rate, while elevated levels of MIEF1-MP increases the translation rate. The identification of MIEF1-MP reveals a new gene involved in this process.

3.1 Introduction

Protein-coding small open reading frames (smORFs) are a group of understudied genes with fundamental roles in biology.1-2 Most smORFs are non-annotated because gene-finding algorithms require a minimum length requirement for gene assignment.3 About half of all annotated smORFs are upstream of longer open reading frames (ORFs)4 where they function as cis-regulators of protein translation—in most cases, the presence of an upstream ORF (uORF) reduces translation of the downstream ORF.4 However, the discovery of hundreds to thousands of previously non-annotated smORFs and microproteins, which has come about due to advances in computational,5-8 genomic,9-13 and proteomic7, 14-16 technologies, revealed many smORFs that are not uORFs. Evolutionary conservation indicated that some of these smORFs would be translated into functional microproteins and recent work in the field has validated this thinking.17-

21

76 Functional studies in flies and mammals, for instance, have demonstrated that at least some of the microproteins play fundamental biological roles in development, metabolism and muscle function.22-24 Knockout of tarsal-less/polished rice (Tal/Pri) gene in flies led to pronounced developmental phenotypes.25 Tal/Pri is a polycistronic mRNA with four smORFs that encode three

11-amino acid and one 32-amino acid microproteins that are required for proper Drosophila embryogenesis.23, 26 Subsequent studies of the Tal/Pri microproteins revealed that these microproteins bind to the Ubr3 ubiquitin ligase in flies and promotes the degradation of the Shaven baby (Svb) transcription factor driving epidermal differentiation.17-18

There are also biologically active microproteins in mammals.19-21, 27-28 For example, myoregulin (MLN) is a conserved microprotein that is encoded by a skeletal muscle-specific transcript that interacts with the SERCA cation channel to impede Ca2+ uptake into sarcoplasmic reticulum (SR) and controls muscle function. Deletion of MLN from mice enhances Ca2+ handling in skeleton muscles and improves exercise performance in vivo, highlighting the role of MLN as a regulator of skeleton muscle physiology.19 CYREN is a mammalian microprotein that interacts with the Ku70/80 heterodimer to inhibit error-prone DNA repair during the S and G2 phases of the cell cycle.20-21 These examples highlight the importance of studying microproteins to understand critical biological processes.

Here, we focus on the molecular and biochemical characterization of a uORF located in the 5’ UTR of mitochondrial dynamics protein MID51 (MIEF1) gene. MIEF1 regulates mitochondrial fission,29 and the initial characterization of the MIEF1 uORF which encodes the

MIEF1 microprotein (MIEF1-MP) revealed a mitochondrial microprotein that is also involved in mitochondrial fission.30-31 MIEF1-MP is 70-amino acid long and has been detected by proteomic analysis in several cell lines including HEK293, HeLa, and THP1 cells as well as in vivo in the intestinal tissue.7, 30, 32 Publicly available ribosome profiling data also confirmed the translation of the MIEF1 uORF in HEK293, HeLa, PC3, and THP1 cells.30, 33 MIEF1-MP is evolutionarily conserved in vertebrates and contains an LYR motif. The LYR motif is found in 11 mammalian

77 proteins and six of these interact with the mitochondrial respiration machinery.34-38 Others are involved in Fe-S Cluster Biogenesis (LYRM4),39 insulin signaling (LYRM1),40 complex I phylogenetic profile COPP (LYRM5),38 or have unknown functions.41 Previous work indicates that

MIEF1-MP promotes mitochondrial fission using the LYR domain, as LYR mutants were inactive.30-31 One caveat in these functional studies is that they relied on MIEF1-MP overexpression, and in this study, we observe that overexpression results in detection of non- functional MIEF1-MP-protein interactions.

We utilized a biochemical approach to identify the MIEF1-MP protein interactors using an in vivo proximity method that relies on the engineered APEX2 (APEX) protein.42-45 Using this approach, we found that MIEF1-MP interacts with the mitochondrial ribosome (mitoribosome).

The mitochondrial genome encodes 38 genes: 14 protein coding, 22 for mitochondrial tRNA, and

2 for mitochondrial rRNA. The mitoribosome translates several mitochondrially encoded mRNAs to produce the 13 subunits of respiratory complexes I, III, IV, and V.46 We measured the levels of newly synthesized mitochondrial-encoded proteins and found that MIEF1-MP is a regulator of mitochondrial translation. Loss of MIEF1-MP decreases the rate of global mitochondrial translation while elevated levels of MIEF1-MP increase the rate of newly synthesized mitochondrial proteins. These results reveal MIEF1-MP as a new gene involved in regulating mitochondrial translation.

3.2 Results and Discussion

MIEF1 microprotein translates from the smORF upstream of the MIEF1 gene

MIEF1 is an outer mitochondrial membrane protein that regulates mitochondrial membrane dynamics. MIEF interacts with Drp1 and Mff in a trimeric complex to regulate Mff- induced Drp1 accumulation. Overexpression of MIEF1 sequesters Drp1 to the outer mitochondrial membrane and promotes mitochondrial fusion and elongation. MIEF1 knockdown resulted in mitochondrial fission and fragmentation.47-48

78 Recently, a protein-coding smORF was identified in the 5’ UTR region of MIEF1 gene that produces the MIEF1 microprotein (MIEF1-MP). Currently, there are no antibodies for MIEF1-MP, so we needed to assure ourselves that it is being expressed in the HEK293T cells that we were using. Ribosome profiling (Ribo-Seq) is a ribosome footprinting method that identifies translated regions of the transcriptome. Analysis of publicly available HEK293T Ribo-Seq data indicates translation of MIEF1-smORF (Figure 3.1A) and our proteomics data validated this result through detection of a tryptic peptide, “YTDRDFYFASIR”, from MIEF1-MP (Figure 3.1B). MIEF1-MP is a

70-amino acid long microprotein (Figure 3.1B) that is conserved between humans and zebrafish

(Figure 3.1C). Previous work indicated a role for MIEF1-MP in mitochondrial fission.30-31 Here, we sought to obtain a molecular understanding of MIEF1-MP biochemistry and use this information to explain or predict MIEF1-MP function.

MIEF1 microprotein localizes to the mitochondrial matrix

Mitochondria are essential organelles found in mammalian cells and eukaryotes.

Mitochondria are cellular powerhouses that produce more than 60% of the ATP in a cell through mitochondrial oxidative phosphorylation (OXPHOS) pathway. OXPHOS requires five multiprotein complexes in the inner mitochondrial membrane comprised of proteins encoded by the nuclear and mitochondrial genomes. In addition to energy production, mitochondria have additional roles, such as mediating programmed cell death (apoptosis).49-50 Our goal in this study is to understand the mitochondrial pathways regulated by MIEF1-MP.

MIEF1 and MIEF1-MP localize to the mitochondria,30-31 but the localization of MIEF1-MP within the mitochondria is unknown. Determining the mitochondrial location of MIEF1-MP would help understand its function. We used a proteinase K protection assay, which tests protein stability under conditions that selectively permeabilize the mitochondria to determine where the protein resides between the outer- or inner- mitochondrial membranes or the matrix. Transient expression of a C-terminal MIEF1-MP-FLAG resulted in mitochondrial localization (Figure 3.2A).30-31

Mitochondria were isolated from MIEF1-MP-FLAG expressing 293T cells by differential

79 centrifugation.51 The isolated mitochondria were treated under three conditions that differentially regulate proteinase K access to sub-compartments of the mitochondria. Addition of proteinase K to mitochondria in isolation buffer degrades the mitochondrial outer membrane marker TOM20

(Figure 3.2B, lane 2), but leaves TIM50, an inner mitochondrial membrane maker and HSP60, a mitochondrial matrix marker, unperturbed. Pre-treatment of mitochondria with 2mM HEPES results in osmotic swelling and selective rupture of the outer mitochondrial membrane resulting in the degradation of TIM50 (Figure 3.2B, lane 4), and solubilizing mitochondria with Triton X-100 is necessary to proteolyze HSP60 (Figure 3.2B, lane 6). MIEF1-MP-FLAG degradation in the proteinase K assay had an identical pattern to that of HSP60, indicating that MIEF1-MP is a mitochondrial matrix microprotein.

MIEF1 microprotein-protein interactions

MIEF1-MP contains a Complex1_LYR domain41 that is defined by a conserved tripeptide

L-Y-R (leucine/tyrosine/arginine) sequence close to its N-terminus and a downstream F

(phenylalanine) residue (Figure 3.3A). The human genome contains 11 LYR-containing proteins, and a 8 of these are mitochondrial proteins.41 Most LYR proteins are subunits (LYRM3 and

LYRM6) or assembly factors (LYRM7, LYRM8, ACN9, and FMC1) of the oxidative phosphorylation (OXPHOS) core complexes in mitochondria. Human LYRMs are linked with diseases, such as insulin-resistance (LYRM1),40 muscular hypotonia (LYRM3),52 deficiency of multiple OXPHOS complexes (LYRM4),39 apoptosis in HIV-1 infection (LYRM6),53 encephalopathy and lactic acidosis (LYRM7/MZM1L),35 infantile leukoencephalopathy

(LYRM8/succinate dehydrogenase assembly factor (SDHAF) 1)37 and alcohol dependence

(acetate non-utilizing protein (ACN) 9/SDHAF3).36

Several of the known LYRMs (LYRM3, LYRM4, LYRM6, and FMC1) interact with

ACPM/NDUFAB1 which is part of complex I, 34, 41, 54-55 and we sought to determine if MIEF1-MP associates with ACPM/NDUFAB1 via its LYR domain. We carried out an immunoprecipitation (IP) of cell lysates overexpressing MIEF1-MP-FLAG or an empty pcDNA vector (mock) using an anti-

80 FLAG antibody and then analyzed the immunoprecipitates by proteomics. We found

ACPM/NDUFAB1 enriched in the MIEF1-MP-FLAG overexpressing sample over the mock sample (Figure 3.3B). We mutated the L-Y-R (leucine/tyrosine/arginine) and downstream F

(phenylalanine) residues of MIEF1-MP to alanines and repeated the immunoprecipitation-mass spectrometry experiment and found that the MIEF1-MP: ACPM/NDUFAB1 interaction requires the LYR motif (Figure 3.3B).

This experiment reassured us that MIEF1-MP could partake in predicted protein-protein interactions, but we are always cautious not to overinterpret lysate experiments with structurally unknown microproteins because we might be observing spurious, non-endogenous interactions.

We carried out a proximity labeling experiment using MIEF1-MP–APEX fusion protein to obtain cellularly relevant interaction since proximity labeling identifies protein complexes in the context of a living cell.43-45 In this approach, the APEX fused protein of interest is expressed in cells, followed by treatment with hydrogen peroxide (H2O2) in presence of biotin-phenol. H2O2 fuels the catalytic oxidation of biotin-phenol by APEX to generate a highly reactive biotin-phenoxy radical.

The lifetime of the radical is <1 ms that restricts the labeling to 20nm radius. This results in the covalent labeling of proteins proximal to the APEX fusion protein with biotin, which are then enriched and analyzed to identify protein interactors in vivo. The proximity labeling experiments indicate that ACPM/NDUFAB1 interacts with the MIEF1-MP-APEX fusion protein but not

MIEF1mut-MP-APEX (LYR and F are all mutated to alanines), consistent with the FLAG immunoprecipitation experiments (Figure 3.3C). Together the data confirm that overexpressed

MIEF1-MP interacts with the ACPM/NDUFAB1 protein via its LYR motif.

ACPM/NDUFAB1 has a role in lipid biosynthesis and respiratory complex 1 activity,56 and this led us to hypothesize that MIEF1-MP might influence these pathways in the mitochondria.

However, we observed no effects of MIEF1-MP using a mitochondrial complex 1 activity assay

(Figure 3.4A) or lipid biosynthesis (Figure 3.4B) suggesting that MIEF1-MP: ACPM/NDUFAB1 interaction is not functionally relevant. One problem might come from the fact that we observed

81 these protein-protein interactions in cells overexpressing MIEF1-MP, which might promote biologically irrelevant interactions with higher concentrations of MIEF1-MP. Even the functional characterization of MIEF1-MP relied on overexpression since knockdown of MIEF1 mRNA would result in a knockdown of MIEF1-MP and MIEF1. This led us to the realization that we needed a better strategy to find and validate microprotein interactions, especially when we have to overexpress the microprotein.

Bioinformatic Reciprocal Immunoprecipitations to Discover Relevant MIEF1-MP

Interactions

We searched the proteomics data from MIEF1-MP-APEX experiments to determine whether we had missed other potentially significant interactors. Our informatics workflow consisted of analyzing the data using Significance Analysis of INTeractome Express

(SAINTexpress),57 which uses a predicted distribution of spectral counts for a real protein-protein interaction to filter out false positives, and Contaminant Repository for Affinity Purification

(CRAPome),58 removes proteins that are repeatedly, and non-specifically enriched in protein- interaction studies. Application of SAINT and CRAPome to the MIEF1-MP-APEX dataset resulted in 153 enriched proteins with an average spectral count of greater than or equal to 5. Though our list of potential interactors is greatly reduced, validating 153 protein interactions would be an experimental challenge.

Reciprocal immunoprecipitations are the gold standard for validating protein-protein interactions. To do this for 153 proteins would require antibodies for each of the proteins and for human MIEF1-MP, which we do not have and are not commercially available. To solve this problem, we turned to a bioinformatic approach. First, we used the bioinformatics tool STRING59

(Figure 4A) to identify which of the 153 proteins are known to interact with each other. The idea here is that the identification of multiple proteins from known complexes increases confidence that MIEF1-MP is involved in a legitimate protein-protein interaction. We removed enriched proteins with less than or equal to 2 interactors (singletons or proteins that are not part of

82 complexes that contained more than 3 proteins) and identified five MIEF1-MP bound protein clusters (Figure 3.5A, color-coded clusters). The five clusters include complexes linked to the mitochondrial ribosome, isocitrate/succinate/malate dehydrogenase, pyruvate dehydrogenase, hydroxyacyl CoA dehydrogenase and electron transport chain.

This analysis reduced the number of likely MIEF1-MP interactors to 100 proteins, but we are still limited by a lack of suitable antibodies to validate these interactions. Instead, we felt that we could obtain the information we needed by searching publicly available proteome-wide, protein-protein interaction datasets. We decided to search the Bioplex protein interaction network database,60 generated from affinity-purification mass spectrometry data for more than 2500 human proteins in HEK293T cells. Lentiviral library of FLAG-HA-tagged ORFs were constructed and infected in 293T cells followed by immuno-purification, trypsin digestion, and analysis by LC-

MS. Specific interactors for each bait protein was identified and assembled into the Bioplex human interactome network. The resulting database contains more 23,744 interactions among 7,668 proteins. This unbiased mapping of interactions also aids in the characterization of unknown or poorly studied proteins by suggesting possible cellular localization, biological process, and molecular function.60 Our strategy is to use this database to determine whether any MIEF1-MP- binding proteins enriched MIEF1-MP when they were used as bait proteins, providing us with reciprocal IP data without the need for any antibodies.

We downloaded the proteomics data for MIEF1-MP binding partners, which were the baits in this experiment, from the Bioplex database, if available. When the Bioplex data was collected and analyzed, MIEF1-MP was not annotated in the human RefSeq or UniProt databases and, therefore, would not appear as a member of any protein complex during the original analysis of the Bioplex data. We searched the downloaded Bioplex data against the human UniProt database with MIEF1-MP included. The analysis of ACPM/NDUFAB1 detected only a single spectral count from MIEF1-MP and while the mitoribosome proteins MRPL4, MRPL10, MRPL12, MRPL21, and

MRPL39 had many spectral counts for MIEF1-MP (Figure 3.6). Multiple mitoribosome proteins

83 had high spectral counts (Figure 3.5B) in MIEF1-MP-APEX experiment. Overall, the mitoribosome proteins were the only cluster with substantial evidence of MIEF1-MP indicating that under endogenous conditions (i.e., no overexpression of MIEF1-MP) MIEF1-MP interacts primarily with the mitoribosome. We validated the MRPL4 and MRPS27 interaction by Western blot (Figure 3.5C and Figure 3.7). Lastly, we analyzed additional mitoribosome proteins in the

Bioplex database, though these were not enriched in our MIEF1-MP-APEX experiment and discovered eight additional mitoribosome proteins that enriched MIEF1-MP (Figure 3.5D and

Figure 3.8). Also, we find that other mitochondrial proteins (COX4I1, Figure 3.5D) do not enrich

MIEF1-MP demonstrating that MIEF1-MP selectively binds to mitoribosome. The data provide the first evidence that MIEF1-MP is a component of the mitoribosome.

Effect on the Rate of Mitochondrial Translation

The mitoribosome is responsible for the translation of mitochondrial RNAs, and the data indicated a role for MIEF1-MP in this process (Figure 3.9). Although nuclear DNA encodes most mitochondrial proteins, several vital proteins are encoded by mitochondrial DNA (mtDNA) and synthesized by an independent mitochondrial translation system46. Mitochondrially encoded mRNA produce 13 subunits of respiratory complexes; NADH dehydrogenase subunits of complex

I (MT-ND1, MT-ND2, MT-ND3, MT-ND4L, MT-ND4, MT-ND5, MT-ND6), cytochrome b of complex

III (MT-CYB), cytochrome c oxidase subunits of complex IV (MT-CO1, MT-CO2, MT-CO3), and

ATP synthase subunits of complex V (MT-ATP6, MT-ATP8). To determine whether MIEF1-MP modulates mitochondrial translation, we began by measuring the steady-state levels of the mtDNA encoded proteins by quantitative proteomics in the MIEF1-MP over-expressing cells. We observed a 6-fold increase in MIEF1-MP levels, but no change in the concentrations of mtDNA encoded proteins (Figure 3.10), indicating that MIEF1-MP does not affect steady state levels of mitochondrial encoded proteins. Next, we used a mitochondrial translational assay that measures newly synthesized proteins to determine whether MIEF1-MP can influence the translation rate of the mitochondrial proteome, which has been observed for other proteins in this complex.61-62

84 The newly synthesized mtDNA encoded proteins were quantified using the BioOrthogonal

Non- Canonical Amino Acid Tagging (BONCAT) approach.63 BONCAT introduces a chemical handle into newly synthesized proteins that can be used to enrich and detect the protein(s).

BONCAT relies on the introduction of the non-canonical amino acid azidohomoalanine (AHA) in place of methionine into newly synthesized proteins. Labeling with AHA places an azide functional group onto new proteins, and these proteins are subsequently enriched, detected, and quantified by clicking on a fluorophore or biotin.63 We overexpressed or knocked down MIEF1-MP in

HEK293T cells and then incubated these cells with azidohomoalanine (AHA) to allow protein synthesis and AHA incorporation. These experiments were carried out in the presence of emetine, a cytosolic translation inhibitor so that we could solely focus on the products of mitochondrial translation. After incubation, the cells are lysed, and click chemistry is used to append an alkyne- bearing dye to newly synthesized mitochondrial proteins.

SDS-PAGE separation of the labeled mitochondrial translation products revealed the role of MIEF1-MP in mitochondrial translation. Comparison of MIEF1-MP overexpressing to control samples revealed increased band densities for most proteins particularly MT-ND4, MT-CYB, MT-

CO2/ MT-ATP6 and MT-ND3 indicating that MIEF1-MP promotes mitochondrial translation

(Figure 3.11A and Figure 3.12A). By contrast, knockdown of MIEF1-MP led to lower band densities for most proteins including MT-CO1, MT-ND4, MT-CYB, MT-ND2, MT-CO3, MT-CO2/

MT-ATP6, MT-ND6 and MT-ND3, indicating decreased mitochondrial translation (Figure 3.11B and Figure 3.12B). Quantitative analysis of the mtDNA-encoded protein band densities revealed a robust increase (54%) in protein levels upon MIEF1-MP overexpression, and significant decreases (32%) in cells lacking MIEF1-MP (Figure 3.11C). These numbers compare favorably to other mitoribosome components such as MRPL14 and DAP3 that modulate nascent mitochondrial translation by ~40-50%.61-62 MIEF1-MP is a biologically active component of mitochondrial translation pathway through its interaction with the mitoribosome.

85 3.3 Conclusion

Here, we further characterize a 70-amino acid microprotein encoded upstream of MIEF1 gene, MIEF1-MP. Previous success with using protein-protein interactions to discover micropeptide functions prompted a similar approach to be used in this study. MIEF1-MP enriched many proteins but functional follow up of the MIEF1-MP and ACPM/NDUFAB1 interaction suggested that this interaction is not functionally relevant. We reasoned that MIEF1-MP overexpression led to the appearance of non-functional interactions, such as ACPM/NDUFAB1, which required a better approach for validating MIEF1-MP interaction partners. We settled on a two-prong computational approach that looked for MIEF1-MP protein interaction partners that are already known to interact, reducing the total number of proteins, and an “in silico” reciprocal IP that utilized Bioplex data to identify proteins that could enrich MIEF1-MP from HEK293T cells.

This analysis revealed proteins that are part of the mitoribosome as the most relevant MIEF1-MP interaction partners, which we validated experimentally by IP using MRPL4 and MRPS27 antibodies (Figure 4C). We tested the hypothesis that MIEF1-MP might be a functional component of the mitoribosome and obtained data from overexpression and knockdown studies that indicates that MIEF1-MP can regulate mitochondrial translation.

Mitochondrial protein synthesis is an essential process in mammals responsible for encoding key components of the oxidative phosphorylation (OXPHOS) complexes. Complete molecular details of this deceptively simple process still remain unclear. Problems in mitochondrial protein synthesis are leading cause of OXPHOS dysfunction resulting in metabolic and developmental disorders. Heart and brain are particularly sensitive to defects in the mtDNA- encoded protein synthesis during late embryonic or early postnatal development, due to massive mitochondrial biogenesis at that stage. Impaired mitochondrial translation causes severe pediatric cardiomyopathy and brain disease with OXPHOS abnormalities.64 The identification of MIEF1-

MP reveals a new gene involved in regulating mitochondrial translation, thus, highlighting the need for further investigation and the value in characterizing microproteins.

86 3.4 Figures

Figure 3.1. MIEF1-MP expression and conservation. (A) MIEF1-MP is encoded by an upstream smORF (uORF) in the MIEF1 mRNA. RNA-Seq (magenta) and Ribo-Seq (green) data from HEK293T cells indicates translation of the MIEF1-MP smORF and the MIEF1 (ORF). (B) A tryptic peptide from the MIEF1 microprotein detected by shotgun mass spectrometry validates that MIEF1-MP is a stable member of the proteome. (C) Sequence alignment of the MIEF1-MP demonstrates that this microprotein is conserved in numerous species at the protein level.

87

Figure 3.2. MIEF1-MP is a mitochondrial matrix microprotein. (A) MIEF1-MP-FLAG colocalizes with the mitochondrial marker TOM20 in Hela cells verifying its mitochondrial localization (DAPI (blue) nuclear stain, TOM20 (red), MIEF1-MP (green), yellow/orange overlap). (B) Proteinase K digestion of MIEF1-MP-FLAG had a similar pattern to HSP60, a mitochondrial matrix marker, confirming its sub-cellular localization to the mitochondrial matrix in HEK293T cells.

88

Figure 3.3. MIEF1-MP contains an LYR motif and interacts with ACPM/ NDUFAB1. (A) A LOGO plot of MIEF1-MP with known LYR motif containing proteins highlights the conserved tripeptide L-Y-R (leucine/tyrosine/arginine) and downstream F (phenylalanine) (LYR and F) motif. (B) Immunoprecipitation (IP) of the MIEF1-MP-FLAG enriches ACPM/NDUFAB1 while mutating the LYR and F to alanines abrogates this binding. (C) Proximity labeling with MIEF1-MP-APEX demonstrates ACPM/NDUFAB1 enrichment in living cells.

89

Figure 3.4. Effect of MIEF1-MP on ACPM/NDUFAB1-related cellular functions. (A) Mitochondrial Complex 1 Activity measured (in mOD/min using abcam 109721 assay kit) for HEK293T cells overexpressing or knocking down MIEF1-MP, Bovine heart mitochondrial extract was used as a positive control. (B) Fold change in steady-state lipid levels in HEK293T cells after overexpression or knockdown of MIEF1-MP (all indicated lipid families show fold change with significant p-value < 0.05 using t-test).

90

Figure 3.5. MIEF1-MP interacts with the mitoribosome. (A) The interaction network of the proteins enriched in the MIEF1-MP-APEX experiment after STRING analysis reveals MIEF1-MP enriched proteins that interact with each other. (B) Spectral counts for different members of the mitoribosome after the MIEF1-MP-APEX experiment. (C) Western blot validation of MIEF1-MP protein-protein interaction with some members of the mitoribosome (MRPL4 and MRPL27) by proximity labeling. The experiment used an MIEF1-MP-APEX-MYC expression construct for the proximity labeling and expression of MIEF1-MP-APEX-MYC was confirmed by western blot using anti-MYC antibodies. (D) An “in silico” reverse IP searching for MIEF1-MP peptides in mitoribosome protein data from the Bioplex database. Bar graph shows the number of spectral counts for MIEF1-MP peptides in the pulldown data for each of the mitoribosome proteins on the x-axis from the Bioplex database. In addition to the mitoribosome proteins enriched in the proximity labelling experiment (grey) additional mitoribosome proteins that were able to enrich MIEF1-MP were identified (blue).

91

Figure 3.6. Spectral counts of MIEF1-MP for different baits in the Bioplex database (colors correspond to clusters in Figure 4A).

92

Figure 3.7. Constructs and controls used for the MIEF1-MP-APEX experiment in Figure 4C. (A) Schematic Illustration of APEX Control and MIEF1-MP-APEX constructs. Total lysates and the biotinylated proteomes (enriched by streptavidin beads) of the APEX and MIEF-APEX transfected HEK293T cells were analyzed for biotinylation (by western blotting using anti-biotin antibody) and loading control (using silver staining) with (B) low exposure and (C) high exposure.

93

Figure 3.8. Identification of additional MIEF1 MP- mitoribosome protein interactors. The table indicates the number of MIEF1-MP spectral counts identified in the raw Affinity Purification- Mass Spectrometry (AP- MS) data for all the mitoribosome proteins available in the Bioplex database, allowing identification of new mitoribosome interactors highlighted in blue in addition to the mitoribosome interactors identified by MIEF1-MP-APEX PD and proteomic analysis in grey.

94

Figure 3.9. analysis of the MIEF1-MP STRING interaction network (Figure 4A). The interaction with the mitoribosome suggested a role of MIEF1 microprotein in regulating mitochondrial translation. Proteins highlighted in red have biological functions related to mitochondrial translation.

95

Figure 3.10. Effect of MIEF1-MP on the mtDNA encoded protein steady state levels. The mtDNA encoded total protein levels were quantified using TMT labeling and proteomic analysis with MIEF1 microprotein overexpression in HEK293T cells. Fold change of protein levels in MIEF1- MP expressing cells compared to the vector control is shown along y axis. (*, p-value< 0.05 using t-test).

96

Figure 3.11. MIEF1-MP modulates the mitochondrial translation rate. (A) Using the BONCAT method, newly synthesized mtDNA encoded proteins are labelled with IRDye® 800CW dye and visualized on a gel. MIEF1-MP-FLAG overexpression in HEK293T cells increases the amount of newly synthesized mitochondrial proteins versus the control (pcDNA vector). (B) MIEF1 siRNA- treated HEK293T cells led to decreased levels of translation of newly synthesized proteins compared to control siRNA treated cells. (C) Quantitative analysis showed robust change for the newly synthesized mtDNA encoded protein levels on MIEF1 microprotein overexpression and knockdown. Comparison of the MIEF1-MP perturbed sample to the control was used to determine statistical significance. For example, MIEF1-MP overexpression leads to a statistically significant increases in MT-ND4 versus the pcDNA control, and MIEF1-MP siRNA leads to a statistically significant decrease in MT-ND4 translation versus the siRNA control (*, p-value <0.05 using t- test).

97

Figure 3.12. Controls for BONCAT method measuring mitochondrial translation rate in Figure 5. (A) Western Blot using Anti-FLAG antibody to confirm FLAG tagged MIEF1 microprotein overexpression in the FLAG Immunoprecipitated eluent normalized with total cell lysate. Anti- β actin antibody was used as a loading control. (B) RT-qPCR using primers targeted to the MIEF1 microprotein sequence to confirm the knockdown of the MIEF1 mRNA transcript with siRNA treatment. (*, p-value <0.05 using t-test).

98 3.5 Materials and Methods

Immunofluorescence and Confocal Imaging

HeLa cells were seeded onto a coverslip (Fisher Scientific, #12-541-B) in a six-well plate, which was pretreated with 50 μg/mL poly-L-lysine (Sigma, #P1399). The next day, cells were transfected with 1 μg of C-terminal FLAG tagged MIEF1 microprotein (MIEF1-MP-FLAG) containing pcDNA3.1 plasmid using Lipofectamine 2000. Forty-eight hours post-transfection, cells were fixed with 4% paraformaldehyde (Polysciences, Inc., #18814) and permeabilized with 0.1% saponin (Alfa Aesar, #A18820). After being incubated with 4% BSA in phosphate buffered saline

(PBS) for 1 h at room temperature, cells were stained with primary antibodies (mouse anti-FLAG and rabbit anti-TOM20) at a 1:1000 dilution overnight at 4 °C. Cells were washed three times with

PBS, followed by incubation with secondary antibodies (goat anti-mouse Alexa Fluor 488 and goat anti-rabbit Alexa Fluor 546, 1:500 in PBS) for 1 h at room temperature. Nuclei were counterstained with Hoechst 33258 (Sigma, # 94403; 1:2000 in PBS). After three PBS washes, the coverslip was mounted with Prolong Gold Antifade Mountant (Life Technologies, # P36930) and analyzed by confocal imaging using a Zeiss LSM 710 laser scanning confocal microscope with a 63× oil immersion objective. Images were analyzed with FIJI software.

Proteinase K Protection Assay for Sub-Mitochondrial Localization

HEK293T cells were seeded in 10 cm cell culture plates and next day the cells were transfected with MIEF1-FLAG plasmid using Lipofectamine 2000. The following day, cells were harvested, suspended in isolation buffer (225mM Mannitol 75mM Sucrose 50mM Na+ HEPES pH

7.4 supplemented with Roche complete protease inhibitor cocktail tablet), dounce homogenized, and spun to isolate different cellular fractions51. The isolated mitochondrial pellet was washed three times and then suspended in isolation buffer, 2mM HEPES or 0.3% Triton X-100, followed by addition of Proteinase K to select samples. After 30 minutes incubation on ice, 1 mM phenylmethanesulfonylfluoride (PMSF) was added to quench the protease and trichloroacetic acid (TCA, 12% final concentration) was added to precipitate the samples, which were then

99 washed with acetone. The pellets were dissolved in 2X SDS loading dye, separated by sodium dodecyl sulfate-polyacrylamide gel electrophoresis (SDS–PAGE) and analyzed by Western blotting using the indicated antibodies.

Co-Immunoprecipitation

FLAG-tagged microprotein constructs or the empty pcDNA3.1 (+) vector were transfected into a 10-cm dish of HEK293T cells using Lipofectamine 2000 according to the manufacturer’s protocol. Twenty-four hours post-transfection, cells were harvested and lysed in IP lysis buffer

(Thermo catalog no. 87787) supplemented with a Roche complete protease inhibitor cocktail tablet and 1 mM PMSF. Cells were lysed on ice for 10 min followed by centrifugation at 20,000g for 20 min at 4 °C to remove cell debris. Cell lysates were added to prewashed mouse IgG agarose beads (Sigma catalog no. A0919) and rotated at 4 °C for 1 hour. The supernatants were collected and added to prewashed anti-FLAG M2 Affinity Gel (Sigma, catalog no. A2220). The suspensions were rotated at 4 °C for 2 hours and washed three times with TBST. Bound proteins were eluted with 3× FLAG peptide (Sigma, catalog no. F4799) at 4 °C for 1 hour.

Biotin-Phenol Labeling in Live Cells and Streptavidin Pull Down

Biotin-phenol labeling in live cells was performed according to previously published protocols.44-45 Briefly, constructs harboring the MIEF1-MP–APEX fusion proteins or APEX (control sample) was transfected into HEK293T cells using Lipofectamine 2000. Twenty-four hours post- transfection, cell culture medium was changed to fresh growth medium containing 500 μM biotin- tyramide (CDX-B0270, Adipogen). After incubation at 37 °C for 30 min, H2O2 was added to each plate at a final concentration of 1 mM, and the plates were gently agitated for 1 min. Cells were then washed three times with a quenching solution (5 mM Trolox, 10 mM sodium azide, and 10 mM sodium ascorbate in PBS), and pelleted by centrifugation at 1000g for 5 min. Cell pellets were lysed on ice for 20 min in RIPA buffer (Thermo catalog no. 89901) supplemented with a Roche complete protease inhibitor cocktail tablet and 1 mM PMSF followed by centrifugation at 20,000g for 20 min at 4 °C to remove cell debris. Cell lysates were added to prewashed streptavidin

100 agarose resin (Thermo catalog no. 20359), rotated at 4 °C overnight, and then washed three times with TBST and 0.5% (v/v) sodium dodecyl sulfate (SDS) at room temperature. Bound proteins were eluted with 2× SDS by boiling at 95 °C and analyzed by proteomics and Western blotting.

Proteomic Analysis of Co-Immunoprecipitation and Biotin-Phenol Labeling

For proteomics, eluted samples were precipitated with trichloroacetic acid (TCA, MP

Biomedicals catalog no. 196057) overnight at 4 °C or using Methanol- Chloroform. Dried pellets were dissolved in 8 M urea, reduced with 5 mM tris(2-carboxyethyl) phosphine hydrochloride

(TCEP, Thermo catalog no. 20491), and alkylated with 10 mM iodoacetamide (Sigma I1149).

Proteins were then digested overnight at 37 °C with trypsin (Promega V5111). The reaction was quenched with formic acid at a final concentration of 5% (v/v). Digested samples were analyzed on a Q Exactive Hybrid Quadrupole-Orbitrap Mass Spectrometer.

Metabolic Labeling of Mitochondrial Translation Products Using AHA

HEK293T cells were transfected with MIEF1-MP-FLAG (1 ug per well) for 24 hours or

MIEF1 siRNA (100 pmol) for 72 hours in triplicates with respective controls pcDNA vector and negative siRNA. The pulse labeling of mitochondrial translation products in the over-expressing and siRNA-treated 293T cells was then performed as per a reported protocol with modifications63.

Specifically, cells at 80% confluence were incubated with 100 μM AHA in HBS buffer containing

100 μg/ml of cytosolic translation inhibitor emetine for 3 h. After AHA labeling, the cells were lysed, normalized and a “click” reaction was performed by incubating the cell lysates with IRDye800® conjugated alkyne (LI-COR P/N 929-60002) for 3 h at room temperature. The proteins were TCA precipitated, washed by ice-cold acetone and air-dried before being subjected SDS-PAGE separation using the Bolt 4–12% Bis-Tris Protein Gel (Invitrogen). The gel was visualized, and the images analyzed using LI-COR Odyssey CLx.

Measuring Mitochondrial Complex 1 Activity

101 HeLa cells transfected with MIEF1-MP-FLAG construct (with pcDNA control) and MIEF1 siRNA (with siRNA control) were tested for the mitochondrial OXPHOS Complex I enzyme activity using the ab109721 assay kit. Protein extracts from the samples were loaded to the microplate wells which were pre-coated with Complex 1 capture antibodies and incubated for 3 hours at room temperature. After the immobilization of the target in the wells, they were washed and assay solution containing NADH and Dye was added. Complex I activity was measured by the oxidation of NADH to NAD+ and the simultaneous reduction of a dye which led to increased absorbance at

OD=450 nm. Bovine heart mitochondrial extract (ab110338) was used as a positive control.

Lipidomics Analysis

HeLa cells were transfected with MIEF1-MP-FLAG construct (with pcDNA control) and

MIEF1 siRNA (with siRNA control) in triplicates and grown in DMEM media with delipidated fetal bovine serum (Cocalico Biologicals Cat. No. 55-0115). The cell pellet was then lysed in MeOH:

PBS: CHCl3 (1:1:2) with d7 cholesterol and PA as internal standards. The samples were then homogenized followed by centrifugation at 2400g, 5 mins, 4ºC to collect the bottom layer containing lipids for proteomics analysis. LC-MS/MS analyses were carried out on a Vanquish

UHPLC system interfaced to a Q-Exactive Plus mass spectrometer (Thermo Fisher Scientific,

Waltham, MA). Lipids were separated using a Bio-Bond 5U C4 column (Dikma). A 70 min gradient was used as follows: 0% B hold 5 min, 0-20% B in 0.1 min, 20-100% B in 50 min, hold 8 min, 100-

0% B in 0.1 min, then hold 7 min. Solvent A consisted of 95:5 water: methanol, and solvent B was

60:35:5 isopropanol: methanol: water. 0.1% formic acid and 5 mM ammonium formate were added for positive ionization mode and 0.03% ammonium hydroxide was added for negative ionization mode. MS/MS spectra were acquired in the data dependent mode with one full MS scan (resolution, 70K; AGC target, 1e6; mass range, 200-1500) followed by 5 MS2 scans

(resolution, 17.5K; AGC target, 1e5; stepped normalized collision energy of 20, 30, 40). Lipids were identified using Lipid Search65. Precursor tolerance and product tolerance were set to 5 ppm

102 and 10 ppm respectively. M-score threshold was set to 3. Quantification was performed using an in-house software.

Measuring Steady State Protein Levels using TMT Labeling

The crude mitochondrial fraction was isolated from HEK 293T cells transfected with the

MIEF1-MP- FLAG construct and pcDNA control in triplicates, according to a reported protocol51.

The mitochondrial pellet was resuspended in RIPA Buffer and sonicated followed by protein concentration normalization. The equivalent of 100ug of each sample was precipitated using

Methanol- Chloroform. Dried pellets were dissolved in 8 M urea, reduced with 5 mM tris (2- carboxyethyl) phosphine hydrochloride (TCEP), and alkylated with 50 mM chloroacetamide.

Proteins were then trypsin digested overnight at 37 °C. The digested peptides were labeled with

TMT 6plex (Lot RH239931) and fractionated by basic reverse phase (Thermo 84868). The TMT labeled samples were analyzed on a Fusion Lumos mass spectrometer (Thermo). Samples were injected directly onto a 25 cm, 100 μm ID column packed with BEH 1.7 μm C18 resin (Waters).

Samples were separated at a flow rate of 300 nL/min on a nLC 1200 (Thermo). Buffer A and B were 0.1% formic acid in water and 90% acetonitrile, respectively. A gradient of 1–25% B over

120 min, an increase to 40% B over 80 min, an increase to 100% B over another 30 min and held at 90% B for a final 10 min of washing was used for 240 min total run time. Column was re- equilibrated with 20 μL of buffer A prior to the injection of sample. Peptides were eluted directly from the tip of the column and nanosprayed directly into the mass spectrometer by application of

2.8 kV voltage at the back of the column. The Lumos was operated in a data dependent mode.

Full MS1 scans were collected in the Orbitrap at 120k resolution. The cycle time was set to 3 s, and within this 3 s the most abundant ions per scan were selected for CID MS/MS in the ion trap.

MS3 analysis with multinotch isolation (SPS3) was utilized for detection of TMT reporter ions at

60k resolution. Monoisotopic precursor selection was enabled, and dynamic exclusion was used with exclusion duration of 10 seconds.

103 Protein and peptide identification was done with Integrated Proteomics Pipeline – IP2

(Integrated Proteomics Applications). Tandem mass spectra were extracted from raw files using

RawConverter66 and searched with ProLuCID67 against human UniProt database appended with

MIEF1-MP. The search space included all fully-tryptic and half-tryptic peptide candidates.

Carbamidomethylation on cysteine and TMT labels on N-terminus and lysine were considered as static modifications. Data was searched with 50 ppm precursor ion tolerance and 600 ppm fragment ion tolerance. Identified proteins were filtered to using DTASelect68 and utilizing a target- decoy database search strategy to control the false discovery rate to 1% at the protein level69.

Mass Spectrometry and Data Analysis

For proteomics analysis, the eluted samples were precipitated with trichloroacetic acid

(TCA, MP Biomedicals #196057) overnight at 4 °C or using Methanol- Chloroform. Dried pellets were dissolved in 8 M urea, reduced with 5 mM tris(2-carboxyethyl) phosphine hydrochloride

(TCEP, Thermo #20491), and alkylated with 10 mM iodoacetamide (Sigma I1149). Proteins were then trypsin (Promega V5111) digested overnight at 37 °C. The reaction was quenched with formic acid at a final concentration of 5% (v/v). Digested samples were analyzed on a Q Exactive

Hybrid Quadrupole-Orbitrap Mass Spectrometer Samples were injected directly onto a 25cm,

100um ID column packed with BEH 1.7um C18 resin (Waters). Samples were separated at a flow rate of 300nl/min on an nLC 1000 (Thermo). Buffer A and B were 0.1% formic acid in water and

0.1% formic acid in 90% acetonitrile, respectively. A gradient of 1-30%B over 110min, an increase to 40%B over 10min, an increase to 90%B over another 10min and held at 90%B for 10min was used for 140min total run time. Column was re-equilibrated with 20ul of buffer A prior to the injection of sample. Peptides were eluted directly from the tip of the column and nanosprayed directly into the mass spectrometer by application of 2.5kV voltage at the back of the column. The

Q Exactive was operated in a data dependent mode. Full MS1 scans were collected in the

Orbitrap at 70K resolution with a mass range of 400 to 1800 m/z. The 10 most abundant ions per

104 cycle were selected for MS/MS and dynamic exclusion was used with exclusion duration of 15 seconds.

Protein and peptide identification were done with Integrated Proteomics Pipeline – IP2

(Integrated Proteomics Applications). Tandem mass spectra were extracted from raw files using

RawConverter66 and searched with ProLuCID67 against human UnitProt database appended with

MIEF1-MP. The search space included all fully-tryptic and half-tryptic peptide candidates.

Carbamidomethylation on cysteine was considered as a static modification. Data was searched with 50 ppm precursor ion tolerance and 600 ppm fragment ion tolerance. Identified proteins were filtered to using DTASelect68 and utilizing a target-decoy database search strategy to control the false discovery rate to 1% at the protein level69.

Antibodies

Primary antibodies: mouse anti-FLAG (F1804, Sigma), rabbit anti-TOM20 (sc-11415,

Santa Cruz Biotechnology), goat anti-TIMM50 (ab23938, Abcam), rabbit anti-HSP60 (ab46798,

Abcam), rabbit anti-myc tag (#2272, Cell Signaling Technology), rabbit anti-MRPS27 (#17280-1-

AP, Proteintech), rabbit anti-MRPL4 (HPA051261, Atlas Antibodies), rabbit anti-biotin (D5A7)

(#5597S, Cell Signaling Technology).

Secondary antibodies: goat anti-rabbit IRDye® 800CW (926-32211, LI-COR), goat anti- mouse IRDye® 800CW (926-32210, LI-COR), donkey anti-goat IRDye® 800CW (926-32214, LI-

COR), Goat anti-mouse Alexa Fluor 488 (A11001, Life Technologies), Goat anti-rabbit Alexa Fluor

546 (A11010, Life Technologies).

Plasmids

pcDNA3-APEX-NES was a gift from Alice Ting (Addgene plasmid #49386). pcDNA3-

FLAG-APEX was generated by inserting a stop codon before NES sequence using QuikChange

II kit (Agilent Technologies). APEX coding sequence was subcloned into pcDNA3.1(+) vector with an N-terminal myc-tag to generate pcDNA3.1-myc-APEX construct. MIEF microprotein coding sequence was obtained by PCR amplification of HEK293T cDNA pool. MIEF-APEX-myc was

105 amplified by overlap extension PCR and subcloned into pcDNA3.1(+) vector. The conserved L-

Y-R (leucine/tyrosine/arginine) and downstream F (phenylalanine) residues in the Wild Type MIEF sequence were mutant to A (alanine) using the Q5 Site-Directed Mutagenesis Kit (Biolabs

E0552S) making the Mutant MIEF- APEX- myc construct in pcDNA3.1(+) vector.

3.6 Acknowledgements

Chapter 3, in part, has been submitted for publication of the material as it may appear in

MIEF1 Microprotein Regulates Mitochondrial Translation, Biochemistry (2018). The dissertation author was the primary investigator and author of this paper. Other authors include Qian Chu,

Dan Tan, Thomas F. Martinez, Cynthia J. Donaldson, Jolene K. Diedrich, John R. Yates III and

Alan Saghatelian.

3.7 References

1. Cabrera-Quio, L. E.; Herberg, S.; Pauli, A., Decoding sORF translation – from small proteins to gene regulation. RNA Biology 2016, 13 (11), 1051-1059.

2. Saghatelian, A.; Couso, J. P., Discovery and characterization of smORF-encoded bioactive polypeptides. Nat Chem Biol 2015, 11 (12), 909-916.

3. Burge, C. B.; Karlin, S., Finding the genes in genomic DNA. Current opinion in structural biology 1998, 8 (3), 346-354.

4. Calvo, S. E.; Pagliarini, D. J.; Mootha, V. K., Upstream open reading frames cause widespread reduction of protein expression and are polymorphic among humans. Proceedings of the National Academy of Sciences of the United States of America 2009, 106 (18), 7507-7512.

5. Frith, M. C.; Forrest, A. R.; Nourbakhsh, E.; Pang, K. C.; Kai, C.; Kawai, J.; Carninci, P.; Hayashizaki, Y.; Bailey, T. L.; Grimmond, S. M., The Abundance of Short Proteins in the Mammalian Proteome. PLoS Genetics 2006, 2 (4), e52.

6. Ladoukakis, E.; Pereira, V.; Magny, E. G.; Eyre-Walker, A.; Couso, J. P., Hundreds of putatively functional small open reading frames in Drosophila. Genome Biology 2011, 12 (11), R118-R118.

106 7. Vanderperre, B.; Lucier, J. F.; Bissonnette, C.; Motard, J.; Tremblay, G.; Vanderperre, S.; Wisztorski, M.; Salzet, M.; Boisvert, F. M.; Roucou, X., Direct detection of alternative open reading frames translation products in human significantly expands the proteome. PLoS One 2013, 8 (8), e70698.

8. Mackowiak, S. D.; Zauber, H.; Bielow, C.; Thiel, D.; Kutz, K.; Calviello, L.; Mastrobuoni, G.; Rajewsky, N.; Kempa, S.; Selbach, M.; Obermayer, B., Extensive identification and analysis of conserved small ORFs in animals. Genome Biology 2015, 16 (1), 179.

9. Smith, Jenna E.; Alvarez-Dominguez, Juan R.; Kline, N.; Huynh, Nathan J.; Geisler, S.; Hu, W.; Coller, J.; Baker, Kristian E., Translation of Small Open Reading Frames within Unannotated RNA Transcripts in Saccharomyces cerevisiae. Cell Reports 2014, 7 (6), 1858- 1866.

10. Aspden, J. L.; Eyre-Walker, Y. C.; Phillips, R. J.; Amin, U.; Mumtaz, M. A. S.; Brocard, M.; Couso, J.-P., Extensive translation of small Open Reading Frames revealed by Poly-Ribo-Seq. eLife 2014, 3, e03528.

11. Bazzini, A. A.; Johnstone, T. G.; Christiano, R.; Mackowiak, S. D.; Obermayer, B.; Fleming, E. S.; Vejnar, C. E.; Lee, M. T.; Rajewsky, N.; Walther, T. C.; Giraldez, A. J., Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. The EMBO Journal 2014, 33 (9), 981-993.

12. Crappé, J.; Van Criekinge, W.; Trooskens, G.; Hayakawa, E.; Luyten, W.; Baggerman, G.; Menschaert, G., Combining in silico prediction and ribosome profiling in a genome-wide search for novel putatively coding sORFs. BMC Genomics 2013, 14, 648-648.

13. Ingolia, N. T.; Lareau, L. F.; Weissman, J. S., Ribosome Profiling of Mouse Embryonic Stem Cells Reveals the Complexity of Mammalian Proteomes. Cell 2011, 147 (4), 789-802.

14. Slavoff, S. A.; Mitchell, A. J.; Schwaid, A. G.; Cabili, M. N.; Ma, J.; Levin, J. Z.; Karger, A. D.; Budnik, B. A.; Rinn, J. L.; Saghatelian, A., Peptidomic discovery of short open reading frame- encoded peptides in human cells. Nat Chem Biol 2013, 9 (1), 59-64.

15. Ma, J.; Ward, C. C.; Jungreis, I.; Slavoff, S. A.; Schwaid, A. G.; Neveu, J.; Budnik, B. A.; Kellis, M.; Saghatelian, A., Discovery of human sORF-encoded polypeptides (SEPs) in cell lines and tissue. J Proteome Res 2014, 13 (3), 1757-65.

16. Ma, J.; Diedrich, J. K.; Jungreis, I.; Donaldson, C.; Vaughan, J.; Kellis, M.; Yates, J. R., 3rd; Saghatelian, A., Improved Identification and Analysis of Small Open Reading Frame Encoded Polypeptides. Anal Chem 2016, 88 (7), 3967-75.

17. Zanet, J.; Benrabah, E.; Li, T.; Pélissier-Monier, A.; Chanut-Delalande, H.; Ronsin, B.; Bellen, H. J.; Payre, F.; Plaza, S., Pri sORF peptides induce selective proteasome-mediated protein processing. Science 2015, 349 (6254), 1356-1358.

107 18. Kondo, T.; Plaza, S.; Zanet, J.; Benrabah, E.; Valenti, P.; Hashimoto, Y.; Kobayashi, S.; Payre, F.; Kageyama, Y., Small Peptides Switch the Transcriptional Activity of Shavenbaby During Drosophila Embryogenesis. Science 2010, 329 (5989), 336-339.

19. Anderson, D. M.; Anderson, K. M.; Chang, C.-L.; Makarewich, C. A.; Nelson, B. R.; McAnally, J. R.; Kasaragod, P.; Shelton, J. M.; Liou, J.; Bassel-Duby, R.; Olson, E. N., A Micropeptide Encoded by a Putative Long Non-coding RNA Regulates Muscle Performance. Cell 2015, 160 (4), 595-606.

20. Slavoff, S. A.; Heo, J.; Budnik, B. A.; Hanakahi, L. A.; Saghatelian, A., A human short open reading frame (sORF)-encoded polypeptide that stimulates DNA end joining. J Biol Chem 2014, 289 (16), 10950-7.

21. Arnoult, N.; Correia, A.; Ma, J.; Merlo, A.; Garcia-Gomez, S.; Maric, M.; Tognetti, M.; Benner, C. W.; Boulton, S. J.; Saghatelian, A.; Karlseder, J., Regulation of DNA repair pathway choice in S and G2 phases by the NHEJ inhibitor CYREN. Nature 2017, 549, 548.

22. Lee, C.; Zeng, J.; Drew, Brian G.; Sallam, T.; Martin-Montalvo, A.; Wan, J.; Kim, S.-J.; Mehta, H.; Hevener, Andrea L.; de Cabo, R.; Cohen, P., The Mitochondrial-Derived Peptide MOTS-c Promotes Metabolic Homeostasis and Reduces Obesity and Insulin Resistance. Cell Metabolism 2015, 21 (3), 443-454.

23. Kondo, T.; Hashimoto, Y.; Kato, K.; Inagaki, S.; Hayashi, S.; Kageyama, Y., Small peptide regulators of actin-based cell morphogenesis encoded by a polycistronic mRNA. Nature Cell Biology 2007, 9, 660.

24. Magny, E. G.; Pueyo, J. I.; Pearl, F. M. G.; Cespedes, M. A.; Niven, J. E.; Bishop, S. A.; Couso, J. P., Conserved Regulation of Cardiac Calcium Uptake by Peptides Encoded in Small Open Reading Frames. Science 2013, 341 (6150), 1116-1120.

25. Tupy, J. L.; Bailey, A. M.; Dailey, G.; Evans-Holm, M.; Siebel, C. W.; Misra, S.; Celniker, S. E.; Rubin, G. M., Identification of putative noncoding polyadenylated transcripts in Drosophila melanogaster. Proc Natl Acad Sci USA 2005, 102.

26. Galindo, M. I.; Pueyo, J. I.; Fouix, S.; Bishop, S. A.; Couso, J. P., Peptides Encoded by Short ORFs Control Development and Define a New Eukaryotic Gene Family. PLoS Biology 2007, 5 (5), e106.

27. Zhang, Q.; Vashisht, A. A.; O’Rourke, J.; Corbel, S. Y.; Moran, R.; Romero, A.; Miraglia, L.; Zhang, J.; Durrant, E.; Schmedt, C.; Sampath, S. C.; Sampath, S. C., The microprotein Minion controls cell fusion and muscle formation. Nature Communications 2017, 8, 15664.

28. D'Lima, N. G.; Ma, J.; Winkler, L.; Chu, Q.; Loh, K. H.; Corpuz, E. O.; Budnik, B. A.; Lykke- Andersen, J.; Saghatelian, A.; Slavoff, S. A., A human microprotein that interacts with the mRNA decapping complex. Nat Chem Biol 2017, 13 (2), 174-180.

108 29. Losón, O. C.; Song, Z.; Chen, H.; Chan, D. C., Fis1, Mff, MiD49, and MiD51 mediate Drp1 recruitment in mitochondrial fission. Molecular biology of the cell 2013, 24 (5), 659-667.

30. Samandi, S.; Roy, A. V.; Delcourt, V.; Lucier, J.-F.; Gagnon, J.; Beaudoin, M. C.; Vanderperre, B.; Breton, M.-A.; Motard, J.; Jacques, J.-F.; Brunelle, M.; Gagnon-Arsenault, I.; Fournier, I.; Ouangraoua, A.; Hunting, D. J.; Cohen, A. A.; Landry, C. R.; Scott, M. S.; Roucou, X., Deep transcriptome annotation enables the discovery and functional characterization of cryptic small proteins. eLife 2017, 6, e27860.

31. Samandi, S.; Roy, A. V.; Delcourt, V.; Lucier, J.-F.; Gagnon, J.; Beaudoin, M. C.; Vanderperre, B.; Breton, M.-A.; Jacques, J.-F.; Brunelle, M.; Motard, J.; Gagnon-Arsenault, I.; Fournier, I.; Ouangraoua, A.; Hunting, D. J.; Cohen, A. A.; Landry, C. R.; Scott, M. S.; Roucou, X., Deep transcriptome annotation suggests that small and large proteins encoded in the same genes often cooperate. bioRxiv 2017.

32. Kim, M.-S.; Pinto, S. M.; Getnet, D.; Nirujogi, R. S.; Manda, S. S.; Chaerkady, R.; Madugundu, A. K.; Kelkar, D. S.; Isserlin, R.; Jain, S.; Thomas, J. K.; Muthusamy, B.; Leal-Rojas, P.; Kumar, P.; Sahasrabuddhe, N. A.; Balakrishnan, L.; Advani, J.; George, B.; Renuse, S.; Selvan, L. D. N.; Patil, A. H.; Nanjappa, V.; Radhakrishnan, A.; Prasad, S.; Subbannayya, T.; Raju, R.; Kumar, M.; Sreenivasamurthy, S. K.; Marimuthu, A.; Sathe, G. J.; Chavan, S.; Datta, K. K.; Subbannayya, Y.; Sahu, A.; Yelamanchi, S. D.; Jayaram, S.; Rajagopalan, P.; Sharma, J.; Murthy, K. R.; Syed, N.; Goel, R.; Khan, A. A.; Ahmad, S.; Dey, G.; Mudgal, K.; Chatterjee, A.; Huang, T.-C.; Zhong, J.; Wu, X.; Shaw, P. G.; Freed, D.; Zahari, M. S.; Mukherjee, K. K.; Shankar, S.; Mahadevan, A.; Lam, H.; Mitchell, C. J.; Shankar, S. K.; Satishchandra, P.; Schroeder, J. T.; Sirdeshmukh, R.; Maitra, A.; Leach, S. D.; Drake, C. G.; Halushka, M. K.; Prasad, T. S. K.; Hruban, R. H.; Kerr, C. L.; Bader, G. D.; Iacobuzio-Donahue, C. A.; Gowda, H.; Pandey, A., A draft map of the human proteome. Nature 2014, 509 (7502), 575-581.

33. Andreev, D. E.; O'Connor, P. B. F.; Fahey, C.; Kenny, E. M.; Terenin, I. M.; Dmitriev, S. E.; Cormican, P.; Morris, D. W.; Shatsky, I. N.; Baranov, P. V., Translation of 5′ leaders is pervasive in genes resistant to eIF2 repression. eLife 2015, 4, e03971.

34. Angerer, H.; Radermacher, M.; Mańkowska, M.; Steger, M.; Zwicker, K.; Heide, H.; Wittig, I.; Brandt, U.; Zickermann, V., The LYR protein subunit NB4M/NDUFA6 of mitochondrial complex I anchors an acyl carrier protein and is essential for catalytic activity. Proceedings of the National Academy of Sciences of the United States of America 2014, 111 (14), 5207-5212.

35. Invernizzi, F.; Tigano, M.; Dallabona, C.; Donnini, C.; Ferrero, I.; Cremonte, M.; Ghezzi, D.; Lamperti, C.; Zeviani, M., A Homozygous Mutation in LYRM7/MZM1L Associated with Early Onset Encephalopathy, Lactic Acidosis, and Severe Reduction of Mitochondrial Complex III Activity. Human Mutation 2013, 34 (12), 1619-1622.

36. Dennis, R. A.; McCammon, M. T., Acn9 is a novel protein of gluconeogenesis that is located in the mitochondrial intermembrane space. European Journal of Biochemistry 1999, 261 (1), 236-243.

109 37. Ghezzi, D.; Goffrini, P.; Uziel, G.; Horvath, R.; Klopstock, T.; Lochmüller, H.; D'Adamo, P.; Gasparini, P.; Strom, T. M.; Prokisch, H.; Invernizzi, F.; Ferrero, I.; Zeviani, M., SDHAF1, encoding a LYR complex-II specific assembly factor, is mutated in SDH-defective infantile leukoencephalopathy. Nature Genetics 2009, 41, 654.

38. Pagliarini, D. J.; Calvo, S. E.; Chang, B.; Sheth, S. A.; Vafai, S. B.; Ong, S.-E.; Walford, G. A.; Sugiana, C.; Boneh, A.; Chen, W. K.; Hill, D. E.; Vidal, M.; Evans, J. G.; Thorburn, D. R.; Carr, S. A.; Mootha, V. K., A mitochondrial protein compendium elucidates complex I disease biology. Cell 2008, 134 (1), 112-123.

39. Lim, S. C.; Friemel, M.; Marum, J. E.; Tucker, E. J.; Bruno, D. L.; Riley, L. G.; Christodoulou, J.; Kirk, E. P.; Boneh, A.; DeGennaro, C. M.; Springer, M.; Mootha, V. K.; Rouault, T. A.; Leimkühler, S.; Thorburn, D. R.; Compton, A. G., Mutations in LYRM4, encoding iron–sulfur cluster biogenesis factor ISD11, cause deficiency of multiple respiratory chain complexes. Human Molecular Genetics 2013, 22 (22), 4460-4473.

40. Zhu, G.-Z.; Zhang, M.; Kou, C.-Z.; Ni, Y.-H.; Ji, C.-B.; Cao, X.-G.; Guo, X.-R., Effects of Lyrm1 knockdown on mitochondrial function in 3 T3-L1 murine adipocytes. Journal of Bioenergetics and Biomembranes 2012, 44 (1), 225-232.

41. Angerer, H., Eukaryotic LYR Proteins Interact with Mitochondrial Protein Complexes. Biology 2015, 4 (1), 133.

42. Lam, S. S.; Martell, J. D.; Kamer, K. J.; Deerinck, T. J.; Ellisman, M. H.; Mootha, V. K.; Ting, A. Y., Directed evolution of APEX2 for electron microscopy and proximity labeling. Nat Meth 2015, 12 (1), 51-54.

43. Han, S.; Udeshi, N. D.; Deerinck, T. J.; Svinkina, T.; Ellisman, M. H.; Carr, S. A.; Ting, A. Y., Proximity Biotinylation as a Method for Mapping Proteins Associated with mtDNA in Living Cells. Cell chemical biology 2017, 24 (3), 404-414.

44. Chu, Q.; Rathore, A.; Diedrich, J. K.; Donaldson, C. J.; Yates, J. R.; Saghatelian, A., Identification of Microprotein–Protein Interactions via APEX Tagging. Biochemistry 2017, 56 (26), 3299-3306.

45. Hung, V.; Udeshi, N. D.; Lam, S. S.; Loh, K. H.; Cox, K. J.; Pedram, K.; Carr, S. A.; Ting, A. Y., Spatially resolved proteomic mapping in living cells with the engineered peroxidase APEX2. Nat. Protocols 2016, 11 (3), 456-475.

46. Mai, N.; Chrzanowska-Lightowlers, Z. M. A.; Lightowlers, R. N., The process of mammalian mitochondrial protein synthesis. Cell and Tissue Research 2017, 367 (1), 5-20.

47. Zhao, J.; Liu, T.; Jin, S.; Wang, X.; Qu, M.; Uhlén, P.; Tomilin, N.; Shupliakov, O.; Lendahl, U.; Nistér, M., Human MIEF1 recruits Drp1 to mitochondrial outer membranes and promotes mitochondrial fusion rather than fission. The EMBO Journal 2011, 30 (14), 2762-2778.

110 48. Yu, R.; Liu, T.; Jin, S.-B.; Ning, C.; Lendahl, U.; Nistér, M.; Zhao, J., MIEF1/2 function as adaptors to recruit Drp1 to mitochondria and regulate the association of Drp1 with Mff. Scientific Reports 2017, 7 (1), 880.

49. Bhola, P. D.; Letai, A., Mitochondria – judges and executioners of cell death sentences. Molecular cell 2016, 61 (5), 695-704.

50. Green, D. R.; Reed, J. C., Mitochondria and Apoptosis. Science 1998, 281 (5381), 1309- 1312.

51. Wieckowski, M. R.; Giorgi, C.; Lebiedzinska, M.; Duszynski, J.; Pinton, P., Isolation of mitochondria-associated membranes and mitochondria from animal tissues and cells. Nature Protocols 2009, 4, 1582.

52. Haack, T. B.; Madignier, F.; Herzer, M.; Lamantea, E.; Danhauser, K.; Invernizzi, F.; Koch, J.; Freitag, M.; Drost, R.; Hillier, I.; Haberberger, B.; Mayr, J. A.; Ahting, U.; Tiranti, V.; Rötig, A.; Iuso, A.; Horvath, R.; Tesarova, M.; Baric, I.; Uziel, G.; Rolinski, B.; Sperl, W.; Meitinger, T.; Zeviani, M.; Freisinger, P.; Prokisch, H., Mutation screening of 75 candidate genes in 152 complex I deficiency cases identifies pathogenic variants in 16 genes including <em>NDUFB9</em>. Journal of Medical Genetics 2012, 49 (2), 83.

53. Ladha, J. S.; Tripathy, M. K.; Mitra, D., Mitochondrial complex I activity is impaired during HIV-1-induced T-cell apoptosis. Cell Death And Differentiation 2005, 12, 1417.

54. Vinothkumar, K. R.; Zhu, J.; Hirst, J., Architecture of mammalian respiratory complex I. Nature 2014, 515 (7525), 80-84.

55. Zickermann, V.; Wirth, C.; Nasiri, H.; Siegmund, K.; Schwalbe, H.; Hunte, C.; Brandt, U., Mechanistic insight from the crystal structure of mitochondrial complex I. Science 2015, 347 (6217), 44-49.

56. Feng, D.; Witkowski, A.; Smith, S., Down-regulation of Mitochondrial Acyl Carrier Protein in Mammalian Cells Compromises Protein Lipoylation and Respiratory Complex I and Results in Cell Death. The Journal of Biological Chemistry 2009, 284 (17), 11436-11445.

57. Teo, G.; Liu, G.; Zhang, J.; Nesvizhskii, A. I.; Gingras, A.-C.; Choi, H., SAINTexpress: improvements and additional features in Significance Analysis of Interactome software. Journal of proteomics 2014, 100, 37-43.

58. Mellacheruvu, D.; Wright, Z.; Couzens, A. L.; Lambert, J.-P.; St-Denis, N. A.; Li, T.; Miteva, Y. V.; Hauri, S.; Sardiu, M. E.; Low, T. Y.; Halim, V. A.; Bagshaw, R. D.; Hubner, N. C.; al-Hakim, A.; Bouchard, A.; Faubert, D.; Fermin, D.; Dunham, W. H.; Goudreault, M.; Lin, Z.-Y.; Badillo, B. G.; Pawson, T.; Durocher, D.; Coulombe, B.; Aebersold, R.; Superti-Furga, G.; Colinge, J.; Heck, A. J. R.; Choi, H.; Gstaiger, M.; Mohammed, S.; Cristea, I. M.; Bennett, K. L.; Washburn, M. P.; Raught, B.; Ewing, R. M.; Gingras, A.-C.; Nesvizhskii, A. I., The CRAPome: a contaminant repository for affinity purification-mass spectrometry data. Nat Meth 2013, 10 (8), 730-736.

111 59. Franceschini, A.; Szklarczyk, D.; Frankild, S.; Kuhn, M.; Simonovic, M.; Roth, A.; Lin, J.; Minguez, P.; Bork, P.; von Mering, C.; Jensen, L. J., STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Research 2013, 41 (Database issue), D808-D815.

60. Huttlin, Edward L.; Ting, L.; Bruckner, Raphael J.; Gebreab, F.; Gygi, Melanie P.; Szpyt, J.; Tam, S.; Zarraga, G.; Colby, G.; Baltier, K.; Dong, R.; Guarani, V.; Vaites, Laura P.; Ordureau, A.; Rad, R.; Erickson, Brian K.; Wühr, M.; Chick, J.; Zhai, B.; Kolippakkam, D.; Mintseris, J.; Obar, Robert A.; Harris, T.; Artavanis-Tsakonas, S.; Sowa, Mathew E.; De Camilli, P.; Paulo, Joao A.; Harper, J. W.; Gygi, Steven P., The BioPlex Network: A Systematic Exploration of the Human Interactome. Cell 2015, 162 (2), 425-440.

61. Fung, S.; Nishimura, T.; Sasarman, F.; Shoubridge, E. A., The conserved interaction of C7orf30 with MRPL14 promotes biogenesis of the mitochondrial large ribosomal subunit and mitochondrial translation. Molecular Biology of the Cell 2013, 24 (3), 184-193.

62. Xiao, L.; Xian, H.; Lee, K. Y.; Xiao, B.; Wang, H.; Yu, F.; Shen, H.-M.; Liou, Y.-C., Death- associated Protein 3 Regulates Mitochondrial-encoded Protein Synthesis and Mitochondrial Dynamics. The Journal of Biological Chemistry 2015, 290 (41), 24961-24974.

63. Dieterich, D. C.; Lee, J. J.; Link, A. J.; Graumann, J.; Tirrell, D. A.; Schuman, E. M., Labeling, detection and identification of newly synthesized proteomes with bioorthogonal non- canonical amino-acid tagging. Nature Protocols 2007, 2, 532.

64. Kamps, R.; Szklarczyk, R.; Theunissen, T. E.; Hellebrekers, D. M. E. I.; Sallevelt, S. C. E. H.; Boesten, I. B.; de Koning, B.; van den Bosch, B. J.; Salomons, G. S.; Simas-Mendes, M.; Verdijk, R.; Schoonderwoerd, K.; de Coo, I. F. M.; Vanoevelen, J. M.; Smeets, H. J. M., Genetic defects in mtDNA-encoded protein translation cause pediatric, mitochondrial cardiomyopathy with early-onset brain disease. European Journal of Human Genetics 2018, 26 (4), 537-551.

65. Taguchi, R.; Ishikawa, M., Precise and global identification of phospholipid molecular species by an Orbitrap mass spectrometer and automated search engine Lipid Search. Journal of Chromatography A 2010, 1217 (25), 4229-4239.

66. He, L.; Diedrich, J.; Chu, Y.-Y.; Yates, J. R., Extracting Accurate Precursor Information for Tandem Mass Spectra by RawConverter. Analytical Chemistry 2015, 87 (22), 11361-11367.

67. Xu, T.; Park, S. K.; Venable, J. D.; Wohlschlegel, J. A.; Diedrich, J. K.; Cociorva, D.; Lu, B.; Liao, L.; Hewel, J.; Han, X.; Wong, C. C. L.; Fonslow, B.; Delahunty, C.; Gao, Y.; Shah, H.; Yates, J. R., ProLuCID: an improved SEQUEST-like algorithm with enhanced sensitivity and specificity. Journal of proteomics 2015, 129, 16-24.

68. Tabb, D. L.; McDonald, W. H.; Yates, J. R., DTASelect and Contrast: Tools for Assembling and Comparing Protein Identifications from Shotgun Proteomics. Journal of proteome research 2002, 1 (1), 21-26.

112 69. Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P., Evaluation of Multidimensional Chromatography Coupled with Tandem Mass Spectrometry (LC/LC−MS/MS) for Large-Scale Protein Analysis: The Yeast Proteome. Journal of Proteome Research 2003, 2 (1), 43-50.

113 Chapter 4:

Functional Characterization of StAR smORF

114 Abstract

Discovery of small open reading frames (smORFs), that were missed by previous annotation efforts, is now possible due to advanced computational and experimental methods.

Most of newly discovered smORFs are found upstream of longer open reading frames, and this class of upstream smORFs are called uORFs. Recent studies underscore the importance of uORFs in regulating translation of downstream ORFs or in the production of functional microproteins. Here, we report novel uORF that overlaps with the Steroidogenic Acute Regulatory

(StAR) gene, a key regulator of steroid production pathway. The StAR-uORF encodes a stable

109 amino acid long StAR microprotein (StAR-MP) which is detected in human cells lines and tissues using proteomics and biochemical methods. Expression studies at transcript and peptide level show tissue-specific expression of the StAR-MP at sites of steroidogenesis (ovaries, testes, adrenal glands). We could not detect any protein binding partners for StAR-MP, but did observe that the StAR-uORF represses the translation of downstream StAR ORF (dORF), suggesting that

StAR smORF exists as a new regulator of StAR translation.

4.1 Introduction

Recent technological advances led to the discovery of hundreds to thousands of protein- coding small open reading frames (smORFs),1-9 that were missed by gene-finding algorithms due to length cutoffs.10 Most of the detected smORFs are found 5’-untranslated region (UTR) of mRNAs that contain an open reading frame (ORF). These upstream smORFs, referred to as uORFs, are often translated,11-15 but the microproteins are rarely considered. Instead, uORFs are thought to function as post-transcriptional regulators of mRNA translation and their presence in the 5’-UTR has been shown to reduce translation of the downstream ORF.16 In some cases, ribosomes can bypass uORF and proceed with the translation of the downstream ORF; however, in most cases the uORF regulates translation of the downstream ORF by: (1) modulating translational reinitiation at the downstream ORF, (2) impeding translation due to ribosome stalling,

115 (3) dissociating the ribosome after uORF translation, (4) overlapping with the downstream ORF such that translation of the uORF bypasses the start codon of the downstream ORF, or (5) promoting nonsense-mediated decay due to the additional stop codon.16-19

uORFs have been shown to play an important role in stress response as proteins actively translating under stress conditions are often produced from mRNA containing uORF(s). The uORFs repress translation under normal conditions, however upon stress they enable the translation of main coding sequence. One of the earliest examples highlighting the regulatory role of uORFs at translational level under stress conditions is the transcriptional activator GCN4 in yeast and its homolog ATF4 in vertebrates. The GCN4 mRNA contains four short uORFs encoding di- or tri-peptides20 and the mammalian ATF4 mRNA contains two uORFs; one 3 codon in length in the 5’ end and other 59 codons in length overlapping out-of-frame with the ATF4 main coding sequence (CDS).21-22 Under normal conditions, the ribosomes bound to the mRNA scan until they recognize and translate uORF1 and uORF2 of GCN4 and remain attached to allow for reinitiation at downstream AUG.23 However, translation of uORF3 and uORF4 favor release of ribosome and preventing it from reaching the GCN4 AUG start codon,23-24 thus, repressing the

GCN4 translation. Similarly, the ribosome subunit continues scanning after uORF1 translation of

ATF4 to reinitiate translation of uORF2. The translation initiation of uORF2 overlapping the CDS results in translation termination downstream of the ATF4 CDS start site, thus, reducing ATF4 translation.21-22 However, during cellular stress due to the limited cellular availability of the ternary complex (TC) the reinitiation at uORFs subsequent to uORF1 is skipped.21, 25-26 Thus, uORF1 positively regulate GCN4 and ATF4 protein expression by evading the other inhibitory uORFs.

Thus, the uORFs regulate downstream ORF translation under stress conditions.11-12, 19, 22

Increased protein synthesis of GCN4/ ATF4 transcriptionally enhances the target genes to help the cell respond to the stress condition. Hence, uORFs represent a class of prevalent translational repressors of the downstream ORF (dORF) genome-wide, a feature well conserved across species.14, 27-28

116 The uORF prevalence initially relied upon the use of AUG start codons, however recent ribosome profiling data indicates that non-canonical initiation codons can also be used as competent translation initiation sites.29-31 Recently, uORF encoded peptides have also been detected in several studies.9, 13, 15, 32-33 In some cases the regulation of the dORF via uORF is dependent on the uORF peptide sequence,15, 34 while in some other it is not.16 Also, most studied examples of uORFs highlight their role as cis- acting modulators of translation,33 however, it remains unclear whether uORFs can produce trans-acting functional peptides. An example of uORF encoded peptide functional in a process other than translational regulation is the 70 amino acid long MIEF1 microprotein (MIEF1-MP) translating from a smORF upstream of Mitochondrial

Elongation Factor 1 (MIEF1) gene. The mitochondrial MIEF1-MP regulates mitochondrial fission process through Drp1 pathway, similar to the MIEF1 protein function.35 Overall, it highlights the prevalence of functional uORFs regulating translation of downstream coding sequences or producing functional peptides.

Here, we report the discovery of a smORF present upstream and partially overlapping out- of-frame with the Steroidogenic Acute Regulatory (StAR) gene. A 109 amino acid long StAR microprotein (StAR-MP) translates from the smORF and is a stable part of the proteome.

Expression studies at transcript and peptide level show highly tissue-specific expression across human cells lines and tissues. We adopt a biochemical approach to identify StAR-MP protein interactors to see if the MP is functional as a part of a bigger protein complex in cellular context.

Lack of microprotein-protein interactions suggests that StAR smORF might be involved in translation regulation of StAR protein. Genetic analysis of the smORF start codons showed changing translational levels for both the smORF and downstream ORF (dORF), pointing towards an additional mode of StAR protein regulation.

4.2 Results and Discussion

Discovery of StAR smORF-encoded microprotein

117 Steroidogenic Acute Regulatory protein, commonly referred to as StAR (STARD1), is a transport protein that catalyzes the movement of cholesterol from the outer mitochondrial membrane to the inner membrane, the rate limiting step in the steroid hormone production.

Cholesterol is the substrate of all steroid hormones. At the inner membrane, cholesterol is cleaved by the enzymatic activity of cytochrome P450 to produce the first steroid, pregnenolone. In this process, StAR protein is imported into the mitochondria and processed to smaller forms.36-37

Recently, we discovered a protein-coding smORF in the 5’UTR region of the StAR gene through our proteogenomic discovery workflow in K562 human bone marrow cells.38 Briefly, the

K562 cells were boiled in water for extraction, enriched by acid precipitation method and analyzed through shotgun mass spectrometry. The detected peptides were then searched against a 3- frame translated custom RNA-Seq database to identify novel smORFs encoding peptides.

The StAR smORF is present upstream and partially overlaps out-of-frame with the downstream StAR ORF (Figure 4.1A). Interestingly, the smORF starts with three consecutive

AUG start codons and the stop codon is downstream of StAR CDS start site. The StAR smORF encodes a 109 amino acid long StAR microprotein (StAR-MP) and the translation was validated through detection of a tryptic peptide,

“KDEEPPLREEAAAAAAAAAAATPPLPHLPGNNAASDIQAVR”, from StAR-MP in our proteomics data (Figure 4.1B). To confirm that the detected peptide translated from the StAR smORF, we cloned and expressed a flag tagged version of StAR-MP in HEK293T cells and detected the originally identified peptide along with multiple others from the microprotein (data not shown). The

StAR smORF and the encoded microprotein is not annotated in the databases, thus, here we report the presence of StAR smORF and microprotein for the first time and sought to understand their cellular function.

Expression Studies of the StAR smORF-encoded microprotein

The StAR protein is a key player in the steroidogenesis pathway and is expressed primarily in the steroid- producing cells in the ovary, testis and adrenal glands. Here we wanted to test the

118 expression of the StAR smORF and the encoded microprotein across multiple human cell lines and tissues.

The expression of StAR smORF at the mRNA transcript level was measured using the publicly available RNA-seq datasets from 34 human cell lines and tissues in the Human BodyMap

2.0. The genomic coordinates for the StAR smORF were used to count piled reads representative of the smORF expression. The analysis suggested strong tissue specific expression of the StAR smORF transcript in steroidogenesis sites including adrenal glands, placenta and ovaries, while no expression was observed in the HEK293T cells (red indicates high expression and green indicates low expression) (Figure 4.2A). The quantitative expression data for the StAR smORF as well as the StAR CDS in human cell lines (K562 and HEK293T) and tissues (adrenal glands, ovary and placenta) showed correlation (Figure 4.2B).

To confirm the smORF translation the encoded peptide was detected and quantified across human cell lines and tissues using the radioimmunoassay (RIA) method. HEK293 cells expressing pcDNA vector or FLAG tagged StAR-MP were used as negative and positive controls respectively for detection of StAR-MP in K562 (bone marrow) and H295R (adrenal) cells (Figure

4.3A). Purified human ovarian and adrenal tissues also detected the StAR-MP (Figure 4.3B and

C), thus, confirming the existence and authenticity of the StAR-MP in human tissues and ruling out the possibility of it being a cell line artifact.

Characterization of StAR-MP by Identification of Protein Interactors

As we detected peptide from the StAR microprotein by proteomic analysis, we hypothesize that the StAR microprotein may be functional as a peptide. Multiple examples of functional microproteins in the field rely on microprotein-protein interactions (MPIs) to play essential cellular roles. Using the biochemical approach, we wanted to assess if StAR-MP interacts with other proteins in a complex and use the information to develop hypothesis for the StAR-MP function.

We carried out an immunoprecipitation (IP) of H295R adrenal cell lysates overexpressing

C- terminal or N-terminal FLAG tagged StAR-MP or an empty pcDNA vector (mock) using an anti-

119 FLAG antibody and then analyzed the immunoprecipitates by proteomics. None of the proteins

(except for some housekeeping proteins) enriched in the FLAG tagged microprotein expressing samples over the mock sample (with fold change >=2 and p-value<0.05), suggesting that the

StAR microprotein is likely not functional as a peptide.

Role of upstream StAR smORF in regulating downstream ORF translation

Several uORFs are involved in post-transcriptional regulation of the dORF through a variety of mechanisms. Presence of uORFs is crucial in upregulating multiple stress responsive genes under stress conditions by evading global translational repression. We hypothesize a similar role for StAR smORF in regulating the expression of the downstream StAR ORF.

Sequence analysis of the StAR promoter region in human through mouse, indicates the conservation of at least two of the three ATG start sites for the StAR smORF between the GATA4 and SF-1 binding sites. However, at the mRNA transcript level (based on the annotated transcripts in Ref-Seq), only human mRNA contains the smORF upstream of StAR CDS whereas, mouse and rat lose the smORF start site upstream in a possible splicing event. These results might suggest towards a new mode of regulation in higher organisms like humans that are not present in other species such as mouse and rat. However, it is also possible that the Ref-Seq contains incomplete or incorrect annotations.

Currently, only one human StAR mRNA transcript is annotated which contains the smORF partially overlapping out-of-frame with the downstream StAR ORF. We would expect as the ribosome machinery scans the transcript, it will recognize the 3 AUG start sites of the smORF and begin translation, resulting the repression or loss of StAR ORF translation. To test this, we cloned StAR transcript into pcDNA vector from the smORF start site to StAR ORF stop site. FLAG tagged is inserted after the 3 smORF start sites and MYC tag is inserted before the StAR ORF stop site, resulting in a N-terminal FLAG-tagged StAR-MP and a C-terminal MYC-tagged StAR protein on translation in the overexpressing cells (Figure 4.4). We used 293T cells as they do not have any endogenous StAR-MP or StAR protein levels, suggested by our expression studies.

120 Measuring the FLAG-tagged and MYC-tagged protein levels through western blot, robust expression of the StAR-MP, however, low expression of the StAR protein was observed, as expected (Figure 4.5A and B). We then created mutations in all 3 ATG start sites (No ATG) or 2

ATG start sites leaving only one active (ATG1, ATG2, ATG3) for the smORF and measured the translation (Figure 4.4). ‘No ATG’ construct where all the smORF start sites were mutated to ACG resulted in negligible translation of the uORF but high translation of the dORF. In case of at least

1 active ATG start site of the smORF, robust expression FLAG-tagged StAR-MP was observed and some leaky expression of the MYC-tagged StAR protein from the dORF (Figure 4.5A and

B). These results highlight the possibility of either splice-forms of human StAR transcripts that lack the uORF or an intricate mode of uORF mediated regulation or both to allow translation of functional StAR protein from dORF.

4.3 Conclusion and Future Directions

Here, we report the discovery of StAR smORF and the encoded microprotein through our proteogenomic workflow, for the first time. Our results elucidate that the smORF translates to produce a stable microprotein that can be detected by proteomics and biochemical assays in human cell lines and tissues. Expression studies across human cells lines and tissues show a highly specific and strongly correlated smORF expression with the downstream CDS expression.

IP and proteomics analysis confirmed that the encoded microprotein did not partake in MPI to be functional in cellular context. Similar to other uORFs regulating the translation of dORFs under stress conditions, it is likely that the StAR smORF might also regulate the translation of StAR protein. Mutations in the start codons of the smORF showed translational change for both the smORF and dORF, suggesting a new mode StAR protein regulation that needs further investigation.

Several questions about the molecular mechanism of action and regulation still need to be addressed. Such as, is this regulation transcriptional or post-transcriptional? If its

121 transcriptional, are there multiple StAR transcripts? If post-transcriptional, is the StAR mRNA regulated by the uORF presence under certain stress conditions? Do mouse/ rat StAR transcripts contain a StAR smORF and encoded peptide?

Studies have underscored the role StAR protein plays in steroidogenesis and its expression regulation by a host of transcription factors and/ or co-regulators. However, the process of StAR regulation continues to remain elusive.37 It is critical to understand the factors regulating the StAR gene in tissue-specific and temporal- specific manner. Future efforts highlighting the regulatory effect of the upstream smORF on the StAR protein expression can provide more insights into StAR function.

122 4.4 Figures

Figure 4.1 Discovery of the StAR smORF-encoded microprotein. (A) StAR microprotein encoded by a smORF upstream and partially overlapping out-of-frame with the StAR coding sequence (B) A tryptic peptide from the StAR microprotein detected by shotgun mass spectrometry in K562 cell line validates that StAR-MP is a stable member of the proteome.

123

Figure 4.2 Tissue-specific expression of StAR smORF transcript. (A) Heat map depicts StAR smORF expression across 34 human cell lines and tissues using the Human Body Map 2.0 RNA- Seq dataset (B) Quantitative expression data of the StAR smORF and CDS at transcript level in human cell lines and tissues.

124

Figure 4.3 Identification and quantitation of the StAR-MP in human cell lines and tissues. StAR- MP detected in (A) human K562 bone marrow and H295R adrenal cells (B) human ovarian tissues (n=7) and (C) human adrenal tissues (n=6) using radioimmunoassay confirms the existence and authenticity of StAR-MP in purified human tissues.

125

Figure 4.4 Schematic illustration of the StAR constructs used in Figure 4.5. The endogenous ‘StAR’ construct is cloned and the 3 smORF start sites were mutated. While ‘No ATG’ construct had all ATG start sites mutated to ‘ACG’, ATG1, ATG2 and ATG3 contained one intact ATG (as numbered) and other 2 mutated to ‘ACG’. FLAG tag was inserted after the smORF start site and MYC tag before the CDS stop site resulting in labeled translation products.

126

Figure 4.5 Upstream StAR smORF regulates downstream ORF translation. (A) Western blot measuring the translation of smORF and the downstream CDS. The StAR-MP and StAR protein levels were detected using the FLAG and MYC antibodies, respectively. (B) Quantitative analysis of the StAR-MP and StAR protein levels normalized to β actin levels, measured in Figure 4.5A.

127 4.5 Materials and Methods

Cell Culture

K562 cells were maintained, extracted and enriched for microprotein discovery by proteogenomic workflow as described previously.38 HEK293T cells were cultured using DMEM with 10% fetal bovine serum (FBS). H295R cells were cultured in DMEM/F12 with 1% ITS+Premix

(BD Bioscience #354352) and 2.5% NuSerum (BD Bioscience #355100). Cells were grown under an atmosphere of 5% CO2 at 37 °C until confluent.

Production of human StAR microprotein antiserum

All animal procedures were approved by the Institutional Animal Care and Use Committee of the Salk Institute and were conducted in accordance with the PHS Policy on Humane Care and

Use of Laboratory Animals (PHS Policy, 2015), the U.S. Government Principles for Utilization and

Care of Vertebrate Animals Used in Testing, Research and Training, the NRC Guide for Care and

Use of Laboratory Animals (8th edition) and the USDA Animal Welfare Act and Regulations.

Animal care and antigen injection and sera harvesting were performed as detailed previously.39

For production of antiserum against human StAR microprotein, three 10 to 12-week old, female New Zealand white rabbits, weighing 3.0 to 3.2 kg at beginning of the study and procured from Irish Farms (I.F.P.S. Inc., Norco, California, USA), were used. Rabbits were injected with a peptide fragment encoding mouse Cys 38 human StAR (4-38) coupled to keyhole limpet hemocyanin (KLH) via maleimide per manufacturer’s instructions (ThermoFisher, Waltham MA).

The peptide, HSLQRGTFKTQNTRSRLQLRDSEAKLEGLRKDEEC, was synthesized, HPLC purified to >98%, and sequence verified by RS Synthesis (Louisville, KY). Each bleed from each rabbit was tested at multiple doses for the ability to recognize the synthetic peptide antigen; bleeds with highest titers were further analyzed by western immunoblot for the ability to recognize the recombinantly expressed full-length StAR microprotein and for the ability to detect endogenous

StAR microprotein in tissue extracts by radioimmunoassay. The StAR antiserum obtained from the rabbit (code PBL #7396, 5/05/96 bleed) with the best characteristics of titer against the

128 synthetic peptide antigen and ability to recognize the endogenous protein was used for all experiments.

Human StAR microprotein RadioImmunoAssay (RIA)

The peptide analog (Tyr38) human StAR (4-38),

HSLQRGTFKTQNTRSRLQLRDSEAKLEGLRKDEEY, was synthesized, HPLC purified to >98%, and sequence verified by RS Synthesis (Louisville, KY). The peptide was radiolabeled with 125I using chloramine T and purified using a diphenyl HPLC column with a 0.1% trifluoroacetic acid- acetonitrile solvent system for use as a tracer in the RIA. The procedure for the StAR microprotein

RIA was similar to that described in detail for inhibin subunits.40 Briefly, rabbit PBL ##7396,

5/05/96 bleed, serum was used at 1/60,000 final dilution and synthetic (Tyr38) human StAR(4-38) was used as a standard. Human tissues, obtained from the Cooperative Human Tissue Network, were extracted in 1 N acetic acid/0.1 N HCl, centrifuged, and supernatant purified using C8 silica cartridges as described.38 Cells in culture, endogenously expressing StAR microprotein or transfected with StAR DNA constructs, were extracted and purified as described.38 Lyophilized tissue or cell extracts were reconstituted in RIA assay buffer, pH checked and adjusted if necessary, and three to seven dose levels were tested. Free tracer was separated from tracer bound to antibody with the addition of sheep anti-rabbit g-globulins and 10% (wt/vol) polyethylene

38 glycol. The minimum detectable dose and ED50 for (Tyr ) human StAR(4-38) peptide ranged from

1-2 and 15-20 pg/tube, respectively. Results were calculated using a logit/log RIA data processing program of Faden, Hutson, Munson, and Rodbard (NICHD, NIH).

Co-Immunoprecipitation

FLAG-tagged (N- or C-terminal) StAR microprotein constructs or the empty pcDNA3.1 (+) vector were transfected into a 10-cm dish of H295R cells using Lipofectamine LTX according to the manufacturer’s protocol. Twenty-four hours post-transfection, cells were harvested and lysed in IP lysis buffer (Thermo catalog no. 87787) supplemented with a Roche complete protease inhibitor cocktail tablet and 1 mM PMSF. Cells were lysed on ice for 10 min followed by

129 centrifugation at 20,000g for 20 min at 4 °C to remove cell debris. Cell lysates were added to prewashed mouse IgG agarose beads (Sigma catalog no. A0919) and rotated at 4 °C for 1 hour.

The supernatants were collected and added to prewashed anti-FLAG M2 Affinity Gel (Sigma, catalog no. A2220). The suspensions were rotated at 4 °C for 2 hours and washed three times with TBST. Bound proteins were eluted with 3× FLAG peptide (Sigma, catalog no. F4799) at 4 °C for 1 hour.

Proteomic Analysis of Co-Immunoprecipitation

For proteomics, eluted samples were precipitated with trichloroacetic acid (TCA, MP

Biomedicals catalog no. 196057) overnight at 4 °C or using Methanol- Chloroform. Dried pellets were dissolved in 8 M urea, reduced with 5 mM tris(2-carboxyethyl) phosphine hydrochloride

(TCEP, Thermo catalog no. 20491), and alkylated with 10 mM iodoacetamide (Sigma I1149).

Proteins were then digested overnight at 37 °C with trypsin (Promega V5111). The reaction was quenched with formic acid at a final concentration of 5% (v/v). Digested samples were analyzed on a Q Exactive Hybrid Quadrupole-Orbitrap Mass Spectrometer.

Mass Spectrometry and Data Analysis

Samples were injected directly onto a 25cm, 100um ID column packed with BEH 1.7um

C18 resin (Waters). Samples were separated at a flow rate of 300nl/min on an nLC 1000

(Thermo). Buffer A and B were 0.1% formic acid in water and 0.1% formic acid in 90% acetonitrile, respectively. A gradient of 1-30%B over 110min, an increase to 40%B over 10min, an increase to 90%B over another 10min and held at 90%B for 10min was used for 140min total run time.

Column was re-equilibrated with 20ul of buffer A prior to the injection of sample. Peptides were eluted directly from the tip of the column and nanosprayed directly into the mass spectrometer by application of 2.5kV voltage at the back of the column. The Q Exactive was operated in a data dependent mode. Full MS1 scans were collected in the Orbitrap at 70K resolution with a mass range of 400 to 1800 m/z. The 10 most abundant ions per cycle were selected for MS/MS and dynamic exclusion was used with exclusion duration of 15 seconds.

130 Protein and peptide identification were done with Integrated Proteomics Pipeline

Tandem mass spectra were extracted from raw files using RawConverter3 and searched with ProLuCID4 against human UnitProt database appended with StAR-MP. The search space included all fully-tryptic and half-tryptic peptide candidates. Carbamidomethylation on cysteine was considered as a static modification. Data was searched with 50 ppm precursor ion tolerance and 600 ppm fragment ion tolerance. Identified proteins were filtered to using DTASelect5 and utilizing a target-decoy database search strategy to control the false discovery rate to 1% at the protein level6.

Western Blot analysis

HEK293T cells were transfected with StAR constructs using Lipofectamine 2000, according to manufacturer’s protocol. The next day the cells were harvested and lysed using the

IP lysis buffer, sonicated and centrifuged. 4x SDS loading was added to cell lysates and separated by sodium dodecyl sulphate-polyacrylamide gel electrophoresis (SDS–PAGE) and analyzed by

Western blotting using the indicated antibodies. Primary antibodies used: mouse anti-FLAG

(F1804, Sigma), rabbit anti-myc tag (#2272, Cell Signaling Technology) and rabbit anti-β actin

(926-42210, LICOR). Secondary antibodies used: goat anti-rabbit IRDye® 800CW (926-32211,

LI-COR) and goat anti-mouse IRDye® 800CW (926-32210, LI-COR). Visualized and image quantified using the LICOR Odyssey CLx.

4.6 References

1. Ladoukakis, E.; Pereira, V.; Magny, E. G.; Eyre-Walker, A.; Couso, J. P., Hundreds of putatively functional small open reading frames in Drosophila. Genome Biology 2011, 12 (11), R118-R118.

2. Vanderperre, B.; Lucier, J. F.; Bissonnette, C.; Motard, J.; Tremblay, G.; Vanderperre, S.; Wisztorski, M.; Salzet, M.; Boisvert, F. M.; Roucou, X., Direct detection of alternative open reading frames translation products in human significantly expands the proteome. PLoS One 2013, 8 (8), e70698.

131 3. Mackowiak, S. D.; Zauber, H.; Bielow, C.; Thiel, D.; Kutz, K.; Calviello, L.; Mastrobuoni, G.; Rajewsky, N.; Kempa, S.; Selbach, M.; Obermayer, B., Extensive identification and analysis of conserved small ORFs in animals. Genome Biology 2015, 16 (1), 179.

4. Smith, Jenna E.; Alvarez-Dominguez, Juan R.; Kline, N.; Huynh, Nathan J.; Geisler, S.; Hu, W.; Coller, J.; Baker, Kristian E., Translation of Small Open Reading Frames within Unannotated RNA Transcripts in Saccharomyces cerevisiae. Cell Reports 2014, 7 (6), 1858- 1866.

5. Aspden, J. L.; Eyre-Walker, Y. C.; Phillips, R. J.; Amin, U.; Mumtaz, M. A. S.; Brocard, M.; Couso, J.-P., Extensive translation of small Open Reading Frames revealed by Poly-Ribo-Seq. eLife 2014, 3, e03528.

6. Bazzini, A. A.; Johnstone, T. G.; Christiano, R.; Mackowiak, S. D.; Obermayer, B.; Fleming, E. S.; Vejnar, C. E.; Lee, M. T.; Rajewsky, N.; Walther, T. C.; Giraldez, A. J., Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. The EMBO Journal 2014, 33 (9), 981-993.

7. Crappé, J.; Van Criekinge, W.; Trooskens, G.; Hayakawa, E.; Luyten, W.; Baggerman, G.; Menschaert, G., Combining in silico prediction and ribosome profiling in a genome-wide search for novel putatively coding sORFs. BMC Genomics 2013, 14, 648-648.

8. Ingolia, N. T.; Lareau, L. F.; Weissman, J. S., Ribosome Profiling of Mouse Embryonic Stem Cells Reveals the Complexity of Mammalian Proteomes. Cell 2011, 147 (4), 789-802.

9. Slavoff, S. A.; Mitchell, A. J.; Schwaid, A. G.; Cabili, M. N.; Ma, J.; Levin, J. Z.; Karger, A. D.; Budnik, B. A.; Rinn, J. L.; Saghatelian, A., Peptidomic discovery of short open reading frame- encoded peptides in human cells. Nat Chem Biol 2013, 9 (1), 59-64.

10. Burge, C. B.; Karlin, S., Finding the genes in genomic DNA. Curr Opin Struct Biol 1998, 8 (3), 346-354.

11. Andreev, D. E.; O'Connor, P. B. F.; Fahey, C.; Kenny, E. M.; Terenin, I. M.; Dmitriev, S. E.; Cormican, P.; Morris, D. W.; Shatsky, I. N.; Baranov, P. V., Translation of 5′ leaders is pervasive in genes resistant to eIF2 repression. eLife 2015, 4, e03971.

12. Starck, S. R.; Tsai, J. C.; Chen, K.; Shodiya, M.; Wang, L.; Yahiro, K.; Martins-Green, M.; Shastri, N.; Walter, P., Translation from the 5′ untranslated region shapes the integrated stress response. Science (New York, N.Y.) 2016, 351 (6272), aad3867-aad3867.

13. Sendoel, A.; Dunn, J. G.; Rodriguez, E. H.; Naik, S.; Gomez, N. C.; Hurwitz, B.; Levorse, J.; Dill, B. D.; Schramek, D.; Molina, H.; Weissman, J. S.; Fuchs, E., Translation from unconventional 5′ start sites drives tumour initiation. Nature 2017, 541 (7638), 494-499.

14. Chew, G.-L.; Pauli, A.; Schier, A. F., Conservation of uORF repressiveness and sequence features in mouse, human and zebrafish. Nature Communications 2016, 7, 11663.

132 15. von Arnim, A. G.; Jia, Q.; Vaughn, J. N., Regulation of plant translation by upstream open reading frames. Plant Science 2014, 214, 1-12.

16. Calvo, S. E.; Pagliarini, D. J.; Mootha, V. K., Upstream open reading frames cause widespread reduction of protein expression and are polymorphic among humans. Proceedings of the National Academy of Sciences of the United States of America 2009, 106 (18), 7507-7512.

17. Cabrera-Quio, L. E.; Herberg, S.; Pauli, A., Decoding sORF translation – from small proteins to gene regulation. RNA Biology 2016, 13 (11), 1051-1059.

18. Ye, Y.; Liang, Y.; Yu, Q.; Hu, L.; Li, H.; Zhang, Z.; Xu, X., Analysis of human upstream open reading frames and impact on gene expression. Human Genetics 2015, 134 (6), 605-612.

19. Young, S. K.; Palam, L. R.; Wu, C.; Sachs, M. S.; Wek, R. C., Ribosome Elongation Stall Directs Gene-Specific Translation In the Integrated Stress Response. Journal of Biological Chemistry 2016.

20. Hinnebusch, A. G., Evidence for translational regulation of the activator of general amino acid control in yeast. Proceedings of the National Academy of Sciences of the United States of America 1984, 81 (20), 6442-6446.

21. Vattem, K. M.; Wek, R. C., Reinitiation involving upstream ORFs regulates ATF4 mRNA translation in mammalian cells. Proceedings of the National Academy of Sciences of the United States of America 2004, 101 (31), 11269-11274.

22. Lu, P. D.; Harding, H. P.; Ron, D., Translation reinitiation at alternative open reading frames regulates gene expression in an integrated stress response. The Journal of Cell Biology 2004, 167 (1), 27-33.

23. Gunišová, S.; Valášek, L. S., Fail-safe mechanism of GCN4 translational control—uORF2 promotes reinitiation by analogous mechanism to uORF1 and thus secures its key role in GCN4 expression. Nucleic Acids Research 2014, 42 (9), 5880-5893.

24. Gunišová, S.; Beznosková, P.; Mohammad, M. P.; Vlčková, V.; Valášek, L. S., In-depth analysis of cis-determinants that either promote or inhibit reinitiation on GCN4 mRNA after translation of its four short uORFs. RNA 2016, 22 (4), 542-558.

25. Mueller, P. P.; Hinnebusch, A. G., Multiple upstream AUG codons mediate translational control of GCN4. Cell 1986, 45 (2), 201-207.

26. Hinnebusch, A. G., TRANSLATIONAL REGULATION OF GCN4 AND THE GENERAL AMINO ACID CONTROL OF YEAST. Annual Review of Microbiology 2005, 59 (1), 407-450.

27. Johnstone, T. G.; Bazzini, A. A.; Giraldez, A. J., Upstream ORFs are prevalent translational repressors in vertebrates. The EMBO Journal 2016, 35 (7), 706-723.

133 28. Kumar, M.; Srinivas, V.; Patankar, S., Upstream AUGs and upstream ORFs can regulate the downstream ORF in Plasmodium falciparum. Malaria Journal 2015, 14 (1), 512.

29. Ingolia, N. T.; Ghaemmaghami, S.; Newman, J. R. S.; Weissman, J. S., Genome-Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome Profiling. Science 2009, 324 (5924), 218-223.

30. Fritsch, C.; Herrmann, A.; Nothnagel, M.; Szafranski, K.; Huse, K.; Schumann, F.; Schreiber, S.; Platzer, M.; Krawczak, M.; Hampe, J.; Brosch, M., Genome-wide search for novel human uORFs and N-terminal protein extensions using ribosomal footprinting. Genome Research 2012, 22 (11), 2208-2218.

31. Gao, X.; Wan, J.; Liu, B.; Ma, M.; Shen, B.; Qian, S.-B., Quantitative profiling of initiating ribosomes in vivo. Nature methods 2015, 12 (2), 147-153.

32. Oyama, M.; Itagaki, C.; Hata, H.; Suzuki, Y.; Izumi, T.; Natsume, T.; Isobe, T.; Sugano, S., Analysis of Small Human Proteins Reveals the Translation of Upstream Open Reading Frames of mRNAs. Genome Res 2004, 14.

33. Ito, K.; Chiba, S., Arrest Peptides: Cis-Acting Modulators of Translation. Annual Review of Biochemistry 2013, 82 (1), 171-202.

34. Ebina, I.; Takemoto-Tsutsumi, M.; Watanabe, S.; Koyama, H.; Endo, Y.; Kimata, K.; Igarashi, T.; Murakami, K.; Kudo, R.; Ohsumi, A.; Noh, A. L.; Takahashi, H.; Naito, S.; Onouchi, H., Identification of novel Arabidopsis thaliana upstream open reading frames that control expression of the main coding sequences in a peptide sequence-dependent manner. Nucleic Acids Research 2015, 43 (3), 1562-1576.

35. Samandi, S.; Roy, A. V.; Delcourt, V.; Lucier, J.-F.; Gagnon, J.; Beaudoin, M. C.; Vanderperre, B.; Breton, M.-A.; Motard, J.; Jacques, J.-F.; Brunelle, M.; Gagnon-Arsenault, I.; Fournier, I.; Ouangraoua, A.; Hunting, D. J.; Cohen, A. A.; Landry, C. R.; Scott, M. S.; Roucou, X., Deep transcriptome annotation enables the discovery and functional characterization of cryptic small proteins. eLife 2017, 6, e27860.

36. Stocco, D. M., StAR Protein and the Regulation of Steroid Hormone Biosynthesis. Annual Review of Physiology 2001, 63 (1), 193-213.

37. Stocco, D., The role of the StAR protein in steroidogenesis: challenges for the future. Journal of Endocrinology 2000, 164 (3), 247-253.

38. Ma, J.; Diedrich, J. K.; Jungreis, I.; Donaldson, C.; Vaughan, J.; Kellis, M.; Yates, J. R., 3rd; Saghatelian, A., Improved Identification and Analysis of Small Open Reading Frame Encoded Polypeptides. Anal Chem 2016, 88 (7), 3967-75.

39. Tim, S.; M., V. J.; Marc, M., 14-3-3 proteins mediate inhibitory effects of cAMP on salt- inducible kinases (SIKs). The FEBS Journal 2018, 285 (3), 467-480.

134 40. Vaughan, J. M.; Rivier, J.; Corrigan, A. Z.; McClintock, R.; Campen, C. A.; Jolley, D.; Voglmayr, J. K.; Bardin, C. W.; Rivier, C.; Vale, W., Detection and purification of inhibin using antisera generated against synthetic peptide fragments. In Methods in Enzymology, Academic Press: 1989; Vol. 168, pp 588-617.

135