The persistence of haploinsufficiency and its role in genome evolution

by

Summer Ashlee Morrill

B.S. Biology Tufts University, 2015

Submitted to the Department of Biology in Partial Fulfillment of the requirements for the Degree of

Doctor of Philosophy in Biology

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

September 2020

©2020 Summer A. Morrill. All rights reserved.

The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created.

Signature of Author: ______Department of Biology July 31st, 2020

Certified by: ______Angelika Amon Kathleen and Curtis Marble Professor of Cancer Research Investigator, Howard Hughes Medical Institute Thesis Supervisor

Accepted by: ______Amy Keating Professor of Biology and Biological Engineering Co-Director, Biology Graduate Committee

2

The persistence of haploinsufficiency and its role in genome evolution

by

Summer Ashlee Morrill

Submitted to the Department of Biology On July 31st, 2020 in Partial Fulfillment of the Requirements for Degree of Doctor of Philosophy in Biology Abstract

In diploid organisms there are two copies of every , one from each parent. While the majority of are robust to deletion of one of the two copies, a subset of genes remains highly dosage sensitive, causing a significant decrease in fitness when heterozygously deleted. These genes, known as haploinsufficient (HI) genes, are present in eukaryotic species from yeast to humans. Why haploinsufficiency persists over evolutionary time is not known. To answer this, I systematically tested two existing models of haploinsufficiency: 1) the dosage stabilizing hypothesis, which states that haploinsufficiency is caused by imbalances among complex members, and 2) the insufficient amounts hypothesis, which says that haploinsufficient gene products are limiting for growth. In this thesis I find that having a single extra copy of haploinsufficient genes was sufficient to cause a growth defect in Saccharomyces cerevisiae. This showed that HI genes are sensitive to both over- and under-expression. Although having an extra copy of HI genes resulted in heightened sensitivity to proteotoxic stress agents, proteotoxicity could not wholly explain the fitness defect that occurred when HI genes were heterozygously deleted. Haploinsufficiency were still present even when all members of a complex were deleted at the once, restoring protein balance but not expression levels. In creating a new dosage sensitivity dataset by pooled fitness competition, I found that genes sensitive to increased copy number and HI genes are not mutually defined. All together, these data suggested that HI genes are unique among dosage sensitive genes, and that HI genes must also be rate-limiting for maximal growth. Many HI genes showed strong evidence of growth-limiting phenotypes, including ribosomal genes, and genes involved in protein folding. I propose a “dosage-stabilizing” model for haploinsufficiency, which states that HI genes are unable to increase or decrease their expression without fitness penalty. This is due to both the growth-limiting nature of HI genes, and to the proteotoxicity of dosage imbalance. From these two selective pressures, HI genes have very narrow ranges of expression – unable to modulate expression over time. This has caused haploinsufficiency to persist throughout the evolution of the eukaryotic genome.

Thesis supervisor: Angelika Amon Title: Kathleen and Curtis Marble Professor of Cancer Research

3

For my parents – who taught me to ask questions, find my own answers, and never be afraid to try a different path.

“Curiosity is an essential part of the way human beings learn, and it always has been. In order to learn something, we must first wonder about it.”

― Joshua R. Eyler, How Humans Learn

4

Acknowledgements

Foremost, I would like to thank my thesis advisor, Angelika, for her enthusiasm, support, and seemingly never-ending source of ideas. You are a wonderful scientific role model and an incredibly thoughtful mentor. Thank you for talking me through failed experiments with kindness and compassion, and always celebrating my accomplishments with a hug.

Thank you to my amazing lab mates, who helped to make lab a home away from home. I will miss seeing you every day! I am grateful to have had such smart, kind, and silly people by my side through the ups and downs of graduate school. I especially need to thank Chris for being the best first bay mate, showing me the ropes, and tolerating all of my random questions. Thank you to Ian for being my early morning breakfast buddy. And of course, a huge thank you to my lab ladies – Juliet, Allegra, Cassie, Becca, Wendy, Teresa, and Franny (honorary Amon Lab lady) – for your encouragement, excellent cooking/baking skills, and deep love of Netflix romcoms. I am so thankful for all of our adventures outside of lab that kept me sane, especially pink wine nights, old lady nights, reality TV Fridays, and pizza making parties. I look forward to celebrating all of the Amon Lab successes in the years to come.

Thank you to my thesis committee members, Steve Bell and Gene-Wei Li, for providing me with a lot of help and advice on this long journey, especially as I explored new scientific areas. Thank you to Mike Springer for generously serving as my outside committee member.

Thank you to my undergraduate mentor, Steve Fuchs, who somehow convinced me that graduate school was the right decision (he was right!), and to my undergraduate advisor, Susan Koegel, who encouraged me to apply to MIT and fueled my passion for teaching in biology.

Thank you to my fellow BioREFS for making MIT Biology such a wonderful community. You are such a huge asset to our program – I could always count on you for cookies, and to point me in the right direction through any issue.

Thank you to the MIT Teaching and Learning Lab for continuing to foster my love for teaching, and for helping me to develop my educational philosophy. Thank you to Iain Cheeseman and Anu Seshan for being such wonderful mentors and role models in teaching, and for helping me to explore a career path in education.

Thank you to the real actual doctor in my life who beat me to the degree (but still has lots of school left), Catherine Coughlin. I can’t thank you enough for your endless love and pep talks.

Finally, and most importantly, thank you to my husband, Alex, and my parents, Tami and Jeff, who have been my constant source of motivation. You have been my greatest cheerleaders, and I am so happy to get to share in this accomplishment with you – I’m sure we will shed many happy tears now that the moment is here. None of us really knew what I was getting myself into, but I’m finally done with school and I couldn’t have done it without you!

5

6

Table of Contents

Abstract 3 Acknowledgements 4 Table of Contents 7

Chapter 1: Introduction 9

The diversity of genomes 10

Genome imbalance 10 Copy number variation 11 The origin of CNVs 13 Replicative mechanisms 14 Non-replicative mechanisms 15 Functional deletions 17 missegregation 18 Consequences of genomic imbalance on fitness 22 Gene expression and gene dosage 23 Global cellular consequences 24 Gene-specific cellular consequences 27 Organismal consequences 28

Haploinsufficiency 29 Defining haploinsufficiency in model organisms 29 Conditional and morphological phenotypes 33 Computational predictions of haploinsufficiency 35 Early theories of 36 The cause of haploinsufficiency: current hypotheses 37 Evolutionary conservation of haploinsufficient genes 40

Haploinsufficiency and human disease 42 Human case studies 43 Ribosomopathies 43 Tumor suppressors 44 Implications for clinical treatment 45 Gene therapies 46 Drug screening 47

Concluding Remarks 47

References 49

7

Chapter 2: Why haploinsufficiency persists 55

Significance Statement 56

Abstract 56

Introduction 57

Results 59 Haploinsufficient genes are sensitive to increased copy number 59 Why are haploinsufficient genes toxic upon dosage increase? 66 Most dosage sensitive genes are not haploinsufficient 67 Dosage imbalance does not fully explain haploinsufficiency 73 Haploinsufficient genes are rate limiting for cellular fitness 78 Haploinsufficient genes have a narrow expression range 79

Discussion 81

Materials and Methods 83

Strain Tables 91

Acknowledgements 93

References 93

Chapter 3: Conclusions and Future Directions 97

Summary of key conclusions 98

Adaptations to haploinsufficiency 101 Duplication of haploinsufficient genes 101 Haploinsufficiency and karyotype maintenance 107

Concluding remarks 111

References 112

8

Chapter 1: Introduction

9

The diversity of genomes

Eukaryotic genomes range in size from 10 million base pairs in molds and unicellular fungi, to

>100 billion base pairs in some flowering plants (1). This 10,000-fold difference reflects large changes to genome content, regulation, and complexity. Yet, all organisms are tasked with the same objective: to faithfully maintain and transmit genetic material throughout every division, and into every new generation. How do organisms achieve this objective when there is such variability in genome size? How often do organisms get it wrong, and do these mistakes matter?

In particular, does the number of copies of each gene or each chromosome impact organismal fitness?

Genome imbalance

Every genome, big or small, is characterized by a balanced karyotype, meaning that all are inherited proportionally with respect to each other. If two copies of chromosome 1 are present, then one would expect to find two copies of chromosome 2, , and so on, excluding the sex chromosomes. From the perspective of protein complex stoichiometry, this makes sense. Different chromosomes encode different members of shared protein complexes and cellular pathways, and the expression of these must remain coordinated to carry out their functions. Necessarily, gene regulation must also be a part of maintaining the balance of expression. To disrupt this balance, either at the level of a single gene or a whole chromosome, is to disrupt the proportionality of the cell. Such disruption can have severe consequences for the cell, leading to decreased cellular proliferation and fitness,

10

development of disease, or cell death. Characterizing the prevalence of genome imbalance, and studying the mechanisms for how such changes occur, has helped us to better understand the critical role that balance plays in shaping genomes.

Copy number variation

A copy number variant (CNV) is any change to the number of copies of genetic information carried by a cell or organism. Such changes can happen during DNA replication,

DNA repair, or chromosome segregation, and may result in both gains and losses of genetic information. CNVs range in size from a few base pairs to millions of base pairs, and a single alteration event may span one or many genes. For the purpose of this thesis, I will also consider gene-inactivating mutational events to be CNVs, as they functionally reduce the copy number of the gene. Using this broad definition of a CNV, we can identify three main sources of copy number alteration: changes to chromosome structure, disruption of gene function, or altered karyotype. As a result, chromosomes may contain amplifications and deletions, genes may be inactivated, or whole chromosomes may be gained and lost (Fig. 1).

The detection of CNVs is achieved primarily through microarray and next-generation sequencing technologies. CNVs may be found in all cells of an organism if the is inherited through the germline, or may vary from cell to cell. Although analysis of bulk DNA collected from tissues provides robust detection of germline CNVs across individuals, recent advances in single-cell sequencing technologies have also allowed the detection of somatic cell

CNVs within particular cells of individuals (2). Single cells collected from human tissues harbor megabase-scale non-clonal CNVs at low frequencies in normal, healthy tissues, across multiple

11

tissue types (2). Humans are not unique in this. All branches of life, from bacteria to humans, carry CNVs that must be accommodated by cells and organisms throughout development. It is therefore crucial to understand both how CNVs occur, and what impact they have on the cells and organisms that harbor them.

A genome A B C D E Changes to chromosome structure amplification A B B C D E

deletionA A C D E genome A B C D E B Disruption of gene function inactivatingamplification A B B C D E mutation A B C D E deletion inversion A A C B C C D D E E

C inversion ChangesA toC karyotypeB D E

B

euploid cell aneuploid cell

Fig. 1: Types of copy number variants (CNVs).

12

The chromosomal DNA shown is split into five segments A,B,C,D,E for visual reference; these segments could contain one or multiple genes, ranging from kilobases to megabases in size.

Copy number alterations focus on chromosome segment B.

(A) Changes to chromosomal structure such as the amplification or deletion of region B, increase or decrease the copy number of genes within B, respectively.

(B) Disruption of gene function is a mechanism for decreasing copy number. This could occur through a gene-inactivating mutation, or an inversion event which disrupts the function of genes lying at the boundaries of the inversion.

(C)Whole-chromosome copy number alterations lead to gain or loss of chromosomes. Cells with a normal karyotype are referred to as euploid; those which have an abnormal number of chromosomes (not a multiple of the haploid number) are considered aneuploid. Here, the aneuploid cell has gained a copy of the orange chromosome.

The origin of CNVs

Any event which leads to DNA breakage, misalignment, deleterious mutation, or the misappropriation of DNA from one cell into the next can lead to the formation of CNVs. Such events can result from the repair of exogenous attacks on the genome, like the DNA damage that occurs with the sun’s intense radiation, or through problems encountered during DNA replication and mitosis. The type of mutational event depends on which cellular process is affected, and where in the genome it occurs. For example, sub-chromosomal rearrangements are particularly likely to occur in repetitive areas of the genome, where templates for replication and repair can misalign, whereas whole chromosomal gains and losses are only likely to occur during cell

13

division. Here we will explore five potential mechanisms of generating CNVs: replicative, non- replicative, functional deletion, chromosome missegregation, and whole genome duplication.

Replicative mechanisms

When normal DNA replication becomes blocked, attempts to continue DNA synthesis can lead to expansion or contraction of DNA at the site of replication fork stalling. Replication may become blocked as a result of DNA damage or the depletion of nucleotide pools. There are also many intrinsic mechanisms of fork stalling; DNA-binding proteins, transcriptional machinery, and unusual DNA secondary structures can all create barriers to fork progression that lead to blocked or stalled replication (3). For DNA synthesis to continue and the cell cycle to be completed, the cell must make a choice to either synthesize through the lesion that caused the stall (if possible) or to find another template. To synthesize across a stalled region requires a specialized DNA polymerase known as a translesion polymerase, which is capable of bypassing

DNA lesions because of the increased size of its active site. Employing translesion synthesis can lead to a contraction event (Fig. 2B), deleting the affected region of the genome. Alternatively, replication forks may “skip over” the affected region until an alternative template can be found, as in template switching. Template switching is thought to resolve stalled forks in vivo through recombination-mediated progression (Fig. 2C), where partially replicated sister strands serve as templates for repair at the site of stalling (4). Template switching can result in duplication, deletion, inversion, or translocation depending on how the template used for homology is aligned, and how the new replication fork is oriented (5).

14

Non-replicative mechanisms

Cells can encounter a broad range of events that damage DNA throughout their lifetime.

This can occur through chemical or physical means, such as through contact with carcinogenic agents or irradiation, or through the natural accumulation of oxidative stress in proliferating cells. DNA damage which results in a double-stranded break of the DNA helix is most relevant to the creation of CNVs as it requires either a rough end-joining of the broken DNA ends, or the process of seeking out a homologous template for repair. Here, we will specifically consider what happens to free DNA ends in the absence of active DNA replication. Ideally, DNA double- stranded breaks are repaired through homologous recombination with a properly aligned template, such that all genetic information is retained. In this case, no copy number variants are generated. If, however, a homologous template for repair is found but misaligned, the subsequent repair event can lead to duplication or deletion of the affected chromosomal region, increasing or decreasing its copy number in the genome, respectively (5). In the case that there is no suitable homologous template, the break must be repaired by removing regions of the surrounding DNA and ligating them back together in the process of non-homologous end-joining (NHEJ). NHEJ results in deletions of regions of the genome surrounding the DNA break, leading to a loss of genetic information.

Non-replicative duplications and deletions can also occur in the absence of exogenous

DNA damage, as occurs during unequal crossover events in meiosis. In meiosis, double-stranded breaks are induced endogenously and require homology to resolve. Misalignments during crossing over result in alterations to chromosome structure (Fig. 2A).

15

A Unequal crossover B Translesion Synthesis homologous regions A B C D E A B C D E

duplication A B B C D E deletion A C D E

Fig. 2: Potential mechanisms leading to copy number alteration.

(A) Unequal crossover of chromosome arms between repetitive regions of the chromosome can lead to gene duplication or deletion events, depending on how repeats are aligned. Such misalignments can occur during meiotic crossover as depicted here, or during recombinational repair events following DNA damage during the mitotic cell cycle.

(B) Copy number changes can occur when DNA replication is blocked. In some cases, this can be due to the formation of secondary structure on the lagging strand template, which is single- stranded. If the fork continues to replicate over the extruded DNA, as with a translesion polymerase, this results in a deletion of the affected region. Such an event is often referred to as replication slippage.

16

(C) Exposed single-stranded DNA at a stalled replication fork can seek out a new template to copy from. The 3’ end is freed from its blocked template on the lagging strand, seeking out other regions of homology in the genome from which to replicate. This mechanism of replication is known as template switching, and can lead to a variety of copy number changes depending on the source and orientation of the new template.

Figures adapted from Hastings et. al. (2009) (5).

Functional deletions

So far, the mechanisms we have focused on involve direct rearrangements of chromosomal structure. Some which affect copy number, however, can involve gene inactivation, rather than whole-gene deletions. Ranging from point mutations to single nucleotide insertions and deletions, these mutations are relatively small in scale, but have a large effect on the gene product. A missense point mutation which changes a critical portion of the protein’s structure could prevent it from binding with essential binding partners, or disrupt the active site of the protein. Nonsense mutations introduce stop codons, truncating the protein products, often making it non-functional or leading to the degradation of its mRNA or protein. Small insertions or deletions of just a few nucleotides can cause a frameshift in reading out the genetic code of a protein, leading to protein that no longer takes on the same shape or function. Without a working copy of the protein being produced the gene copy number is effectively reduced by half, functionally deleting one of two copies of the gene. Notably, such functional deletions may also result from large-scale chromosomal rearrangements, such as inversions and translocations, when they disrupt the coding sequence of genes at the boundaries of the rearrangement.

17

Chromosome missegregation

Whole chromosome changes to DNA copy number must occur through errors in mitosis or meiosis. Such errors can result from genetic perturbation to the mitotic machinery itself, or to the surveillance mechanisms that monitor chromosome attachments and tension. For example, mutations in kinetochore components that attach chromosomes to the spindle (Fig. 3C), or cohesion proteins that hold replicated sister chromatids together (Fig. 3B), can result in widespread chromosome missegregation events. Chromosome missegregation can also occur if the structure or stability of the mitotic spindle is disrupted. In particular, correct spindle formation depends on the presence of two centrosomes (or spindle pole bodies in yeast), one at each pole. If more than two centrosomes are present, a multipolar spindle develops, leading to the production of highly aneuploid cells (Fig. 3D). Mitotic spindle formation also depends on cells having kinetically-dynamic microtubules. Spindle dynamics are potently altered by cold temperatures, and by microtubule-stabilizing or -destabilizing agents, leaving cells unable to form a stable mitotic spindle and segregate chromosomes. Finally, the spindle assembly checkpoint (SAC) is in place to monitor all chromosome attachments to the spindle. If chromosomes are attached properly, they experience equal tension on sister chromatids at the metaphase plate, originating from the microtubules that attach them to the spindle poles on either side of the cell. If the forces pulling on each side of the chromosome are not balanced, the SAC arrest cells until proper chromosome attachments and tension can be restored, so that chromosomes are divided evenly between daughter cells. As such, perturbation of the SAC can lead to cells being unaware of unattached or improperly attached chromosomes in mitosis or meiosis, leading to chromosome missegregation (Fig. 3A). Chromosome missegregation events occur at a frequency of 10-4 or 10-5, meaning that chromosome copy number alterations occur

18

frequently even in otherwise wild-type populations. If we compare this rate of mutation to the number of cells in the human body, which is approximately 1013, this means that ~108-109 cells have potentially undergone copy number changes throughout the course of human development.

Fig. 3: Mechanisms of chromosomal gain and loss.

(A) Inactivation of the spindle assembly checkpoint (SAC) causes cells to enter anaphase without checking that all chromosomes were properly attached and oriented.

(B) Premature loss of sister chromatid cohesion results in the release of sister chromatid pairs prior to anaphase. This leads to missegregation of the affected chromosomes.

(C) Incorrect attachments of spindle microtubules to the kinetochores, either resulting in too few or too many attachments, can lead to improper segregation of the aberrantly attached chromosome. Here, a merotelic attachment is shown, where a single kinetochore becomes attached to microtubules emanating from both spindle poles.

19

(D) Microtubule organizing centers, such as centrosomes, dictate the axis of spindle formation.

Having the wrong number of centrosomes leads to formation of a multipolar spindle instead of a bipolar spindle, and thus results in improper segregation of chromosomes to the daughter cells.

Figure adapted from Siegel & Amon (2012) (6).

Whole genome duplication

Several times throughout eukaryotic evolution, genomes have seen increases in ploidy.

These increases in ploidy are typically unstable, and are followed by widespread gene copy loss, generating a large number of CNVs at once (7). One such shift occurred in the Saccharomyces clade of budding yeast approximately 100 million years ago (Fig. 4A), leading to the rapid accumulation of CNVs (8). Approximately 20% of all genes that were duplicated during the whole genome duplication event are still maintained as paralogs in the modern descendants of the ancestral strain, including the model organism Saccharomyces cerevisiae (Fig. 4B) (9).

Paralogs formed in this way are also called ohnologs, to honor Susumu Ohno who proposed that whole genome duplication was a potential mechanism for driving evolutionary novelty through copy number variation (10). Because the copy number of so many genes is affected at once, whole genome duplication and subsequent gene loss represent a powerful and relatively swift mechanism for generating widespread CNVs.

20

A ~108 years

B

Fig. 4: Gene dosage imbalance following whole genome duplication in Saccharomyces cerevisiae.

(A) An abbreviated phylogenetic history of the Saccharomycetaceae family of yeast, depicting the whole genome duplication event in the lineage of budding yeasts.

21

Figure modified from Byrne et. al. (2005) (9).

(B) Schematic showing how duplication of the ancestral yeast genome led to massive genomic rearrangement and gene loss. Here, the ancestral genome is shown in pink, with individual ORFs numbered for reference. Immediately following the whole genome duplication (WGD), two copies of each ORF were present, one shown in purple and the other in orange. As large-scale rearrangements occurred following the whole genome duplication, causing widespread loss of gene copies, some duplicated genes were maintained over evolutionary time as paralogs

(highlighted in the bottom panel). These paralogs have altered copy number compared to the reference genome, and are interleaved throughout the modern yeast genome. Saccharomyces cerevisiae is considered a post-WGD species of budding yeast, maintaining these paralogs.

Figure adapted from Kellis et. al. (2004) (7).

Consequences of genomic imbalance on fitness

Changes to DNA copy number, whether at the level of whole chromosomes or single genes, have the potential to negatively affect the fitness of organisms. From a theoretical standpoint, duplications and chromosome gains lead to increased gene expression, placing burden on transcriptional, translational, and protein homeostasis machinery. Deletions and chromosome loss events lead to the depletion of key cellular components, and can also lead to protein imbalances that must be dealt with by the cell. This section will discuss how genome balance, and its effect on gene expression, impacts cellular and organismal fitness.

In general, the fitness defect of an organism which has acquired a CNV is proportional to its degree of genome imbalance (11). For example, a cell which gains a whole chromosome is

22

likely to show a greater fitness defect on average than one with a sub-chromosomal gain.

However, we also know that changes in the expression of individual genes can contribute significantly to the fitness defect of organisms, and that the strength of the fitness defect and the range of expression across which it occurs varies widely among genes (12). It is therefore necessary to consider both the global and the gene-specific effects of copy number alterations when describing the consequences of genome imbalance on cellular fitness. Additionally, among multicellular organisms, one must also consider whether the CNV is present in one cell or many cells of the organism, and what consequence this has on organismal fitness.

Gene expression and gene dosage

To understand the consequence of genomic imbalance on fitness, we must first know what happens to the functionality of DNA regions which are affected. DNA copy number changes have been shown to cause corresponding changes to gene expression, both at the RNA and protein level (Fig. 5) (6, 13). Duplications result in increased expression of genes within the affected region, and deletions cause decreased expression. Although it is thought that some genes possess mechanisms to correct for these changes in dosage, the majority of genes are not dosage- compensated fully, or their expression does not return to wild-type levels (13–15). A notable exception to this is genes which are contained on sex chromosomes, which must be dosage- compensated to keep expression consistent between the biological sexes. In general though,

DNA copy number changes are proportionally relayed into their RNA and protein products, carrying the potential to affect the cellular processes in which they play a role.

23

Fig. 5: Changes to DNA copy number lead to corresponding changes in gene expression.

A gain of chromosome V in yeast, as shown by microarray DNA content analysis (top), is accompanied by a 2-fold increase in expression of genes encoded on chromosome V (log2 ratio over wild-type = 1). Microarray gene expression data are used to measure RNA expression

(middle), and SILAC protein profiling is used to measure protein levels (bottom).

Data from Torres et. al. (2007, 2010a) (11, 16); Figure from Siegel & Amon (2012) (6).

Global cellular consequences

Large scale dosage imbalances are known to have a number of severe physiological consequences which affect the fitness and proliferation of cells. This has been studied extensively within the context of aneuploidy, where whole chromosomes are gained or lost.

There are several stresses found to be shared amongst aneuploid strains in eukaryotes, regardless

24

of the identity of the chromosome(s) gained or lost. Shared cellular responses among aneuploid cells include proteotoxicity, the induction of a generalized stress-associated transcriptional program, and genome instability, among others (17).

Cells containing copy number alterations experience proteotoxic stress due to the increased burden placed on protein homeostasis machinery (18). The burden on protein homeostasis triggered by CNVs can originate in two ways: First, copy number gains can increase the amount of protein being produced, overburdening chaperones which aid in protein folding.

Second, both gains and losses create stoichiometric imbalances among protein complex members, and these uncomplexed protein subunits tend to misfold. Therefore, depending on the amount of uncomplexed subunits and the level at which the proteins are expressed, a potentially large quantity of misfolded protein must be remedied by the cell, either through protein degradation (19), aggregation (20), or autophagy (21) (Fig. 6).

Fig. 6: Proteotoxicity due to gene dosage imbalance of protein complex members.

Members of protein complexes that are produced in stoichiometric imbalance with each other require the aid of chaperones to remain soluble, eventually engaging with protein degradation or autophagy machinery, or forming insoluble cellular aggregates in order to correct for the

25

imbalance. Particularly with large imbalances, or with highly expressed genes, this may lead to an overwhelming of cellular protein quality control systems.

Figure adapted from Santaguida & Amon (2015) (17).

The transcriptional response among aneuploid cells is well-characterized in eukaryotes.

In budding yeast, aneuploid cells exhibit a common transcriptional signature known as the environmental stress response (ESR), a set of ~300 upregulated and ~600 downregulated genes that are coordinately controlled (11). Interestingly, though the source of the stress response in aneuploid cells is clearly endogenous, or genetic, the ESR is typically induced upon exogenous stresses such as heat shock or oxidative stress (22). It is thought that the ESR may represent a more generalized “slow growth” response, as most of the ESR genes also change in expression when grown under nutrient-limiting conditions (23). With the strong growth defect of cells with large-scale CNVs, it is therefore unsurprising that they exhibit such a transcriptional response. A metaanalysis of aneuploid transcriptional responses indicates that there may be a shared transcriptional signature not just across aneuploid cells within a specific organism but in aneuploid organisms across species, including yeast, plants, mice, and humans (24). Such a powerful correlation across organisms suggests that the cellular consequences of genomic imbalance are shared among eukaryotes.

Cells with genome imbalances are also known to have increased genomic instability– an inability of the genome to be maintained faithfully as it is passed from cell to cell – as well as varied defects in DNA replication and repair. Genomic instability describes the frequency of genetic alterations occurring within the genome, as well as the rate of chromosome missegregation. Both of these metrics are increased in aneuploid cells (25), indicating that the

26

presence of large-scale CNVs makes it difficult for cells to properly maintain the genome.

Interestingly, this tendency toward genomic instability among cells containing CNVs puts them at further risk of gaining mutation and copy number changes, potentially leading to greater fitness defects of dosage-imbalanced cells. Despite sharing a tendency toward genomic instability, aneuploid cells vary significantly in their specific defects of DNA replication and repair. Some aneuploid cells also exhibit increased S phase length, indicative of problems with

DNA replication, and others show increased persistence of DNA damage into anaphase, indicating a failure to repair DNA double stranded breaks before mitosis (25, 26). Additionally, some aneuploids show extreme sensitivity to treatment with hydroxyurea, a ribonucleotide reductase inhibitor which depletes nucleotide pools, while others are more susceptible to phleomycin, which induces double-stranded breaks. It is thought that this variation in phenotypes is due to chromosome- or gene-specific sensitivities to changes in dosage.

It is reasonable to expect that a change of any magnitude to DNA copy number, not just whole chromosomes, may affect these same core cellular processes – cell proliferation, protein homeostasis, transcription, and genome stability – to some degree. While we know that, in general, the cellular consequences of genome imbalance scale with the degree of genome imbalance (11), there are many cellular defects which cannot be explained by degree of genome imbalance. Such exceptions are likely due to the presence of specific genes within the segment of the genome gained or lost that independently affect proliferation and cellular homeostasis.

Gene-specific cellular consequences

There are several prominent examples of individual genes whose expression level has a strong effect on organismal fitness. When encoded in extra copy, the � subunit of tubulin causes

27

a severe proliferation defect, which in some cases is even lethal (27). Similarly, the � subunit of tubulin causes slows growth when one copy is deleted in a diploid (28); this is likely due to the

�-tubulin subunit now being in excess over its binding partner �-tubulin. The robust phenotypes of single-gene duplication and deletion events are also evident in multicellular organisms, often leading to the development of disease. A well-known example is the duplication of amyloid- � precursor protein, APP, which is thought to lead to early onset Alzheimer’s disease (29). While a gene-specific duplication event can certainly result in a strong fitness defect, as those examples outlined above, it is much more common to find fitness defects among organisms with gene deletions. Such dosage-sensitive genes are known as haploinsufficient genes, and will be the focus of this thesis.

Organismal consequences

Any copy number alteration that is present in the germline of an organism is transmitted to all cells within that organism. This results in an entire organism that carries the CNV. While large-scale CNVs are relatively well-tolerated in single-cell organisms such as budding yeast, still allowing for proliferation and growth, they are not as well tolerated in multicellular organisms. The majority of whole-chromosome gains and losses are lethal in higher eukaryotes, including flies, worms, mice, and humans (6). In general, plants remain relatively robust to copy number changes, but even partial chromosome gains/losses can lead to deformity and stunted growth (30). Although whole-chromosomal changes are an extreme example of genome imbalance, this observation points towards the key conclusion that genome imbalance presents many more hurdles for development of multicellular organisms than for the fitness of individual

28

cells. Whether one gene or many genes are affected by a CNV, this is important to take into consideration.

Haploinsufficiency

Haploinsufficient (HI) genes cause a measurable decrease in the fitness of diploid organisms when one copy of the gene is deleted. Simply put, heterozygous deletions of HI genes negatively impact the growth and development of eukaryotic organisms. HI genes are thought to be relatively rare, though present in all eukaryotes. Surveys across different growth conditions and carbon sources show that haploinsufficiency is context-dependent, and the genes which cause haploinsufficiency in single-celled organisms may fundamentally differ in mechanism from those HI genes that disrupt the development of multicellular organisms. Given 1) the developmental and environmental differences between unicellular and multicellular organisms,

2) the relatively high frequency of gene inactivating mutation over time, and 3) the high penalty of haploinsufficiency, it is therefore surprising that the haploinsufficiency of these genes is strongly evolutionarily conserved. This set of observations has sparked many studies seeking to characterize HI genes, and has led to the creation of several different hypotheses about the biological basis of haploinsufficiency. This section will focus on understanding haploinsufficiency through the study of model organisms and computational biology, whereas the following section focuses on haploinsufficiency in the context of human disease.

Defining haploinsufficiency in model organisms

To date, comprehensive studies identifying HI genes have been carried out in model organisms, such as the budding yeast Saccharomyces cerevisiae and the fruit fly Drosophila

29

melanogaster. The ease of genetic manipulation in these species lends itself to systematic heterozygous deletion of genes, on a genome wide scale.

Dosage studies in S. cerevisiae have utilized the yeast heterozygous deletion collection, a collection of diploid yeast strains where every gene has been systematically deleted by homologous recombination at one of two copies (31). Each site of deletion contains upstream and downstream tags, or barcodes, that uniquely identify the gene of interest, allowing for high throughput analysis (Figure 7A). As such, the relative fitness of heterozygous deletion strains can be assessed using a pooled fitness competition, where all strains in the collection are grown together in the same flask (28). The relative abundance of strains in the population over time is a measure for haploinsufficiency, equal to the fitness defect of each strain (Fig. 7B). Although the relative fitness of strains was found to be continuously distributed, those strains which decreased in the population most sharply (>1SD from the population mean) were classified as haploinsufficient (Fig. 7C). By this criterion, the study concluded that 3% of the yeast genome

(184 genes) was haploinsufficient under maximal growth conditions. HI genes were enriched for essential genes, members of proteins complexes, and genes associated with ribosomal biogenesis and translation, including ribosomal proteins (28).

30

A UPTAG DNTAG kanMX4

yeast ORF AB CB 2 neutral fitness (not HI) >1SD ) 2 negative fitness (HI) 1 tag 0

-1 relative

abundance (log -2 time

Fig. 7: Identification of haploinsufficient genes by fitness competition.

(A) Yeast deletion by homologous recombination. Deletions contain unique tagged regions upstream (UPTAG) and downstream (DNTAG) of the kanMX4 deletion cassette, which provides resistance to the antibiotic G418. These 20bp tags are flanked by universal primer sequences that can be used for all deletion strains.

Figure adapted from Giaever & Nislow (2014) (32).

(B) A theoretical plot showing abundance of heterozygous deletion tags over time in a pooled fitness competition. Strains with decreasing abundance over time, represented by the black line, have decreased fitness. These strains likely contain a deletion of a haploinsufficient gene. The

31

majority of other strains show neutral fitness and remain constant over time, represented by the grey line.

(C) Average distribution of fitness values for the competition of heterozygous deletion strains over 20 generations. Haploinsufficiency threshold = >1SD from the 1. Strains were grown continuously at 30°C in YPD.

Figure modified from Deutschbauer et. al. (2005) (28).

Haploinsufficient regions of the Drosophila genome were first identified through a series of ordered deletions and segmental aneuploidies (33). Many of these regions caused severe growth defects of the animal when deleted in single copy, a so-called Minute characterized by slow development, abnormal body morphology, and low fertility and viability

(Fig. 8A) (33). Later studies showed that all of the 65 known Minute phenotypes can be attributed to loss of cytoplasmic ribosomal function, or defects in translation initiation (34).

Although flies are in general robust to large scale changes in gene dosage, tolerating imbalances of up to 1% of the genome (35), there still remain a subset of highly haploinsufficient genes which cause developmental defects in heterozygous mutants. Remarkably, despite differences in the constraints of unicellular and multicellular development and more than 1 billion years of evolutionary separation between the species, we see that HI genes in D. melanogaster and S. cerevisiae overlap significantly in function. D. melanogaster is not unique among model organisms in possessing haploinsufficient phenotypes. Studies in Arabidopsis thaliana, Danio rerio, Caenorhabditis elegans, and Mus musculus all show evidence of haploinsufficiencies, particularly among genes involved in ribosome function and biogenesis (Fig 8B). More detailed

32

discussions of the evolutionary conservation of haploinsufficiency and its manifestation in human disease are to follow.

A B

1 WT mfl /+

Fig. 8: Haploinsufficient phenotypes in multicellular model organisms.

Heterozygous deletion or inactivation of proteins involved in translation and ribosomal biogenesis causes a Minute phenotype in many model organisms, including Drosophila melanogaster (A) and Arabidopsis thaliana (B). In D. melanogaster, these mutations are characterized by small body size, shorter bristles, misrotated genitalia, notched wings, and reduced sex combs (34, 36, 37). Here, we see a heterozygous mutant for ribosomal biogenesis factor Nop50, a mutation also known as minifly (mfl1) (A). In A. thaliana, Minute phenotypes include reduced growth, aberrant leaf morphology, and delayed flowering (38, 39).

Heterozygous mutations for ribosomal protein Rpl7b are shown in two different genetic backgrounds (B).

Figure from Giordano et. al. (1999) (36) and Horiguchi e. al. (2011) (39).

Conditional and morphological phenotypes

33

Though the original characterization of haploinsufficient genes in budding yeast was performed under maximal growth conditions – in rich medium at 30°C – several studies have since gone on to define haploinsufficiency in a variety of growth-limiting conditions.

Interestingly, in a direct comparison of the fitness profiles for heterozygotes grown in rich medium vs. minimal medium, it was found that there was very little overlap in gene sets defined as haploinsufficient under these two conditions (28). Upon reflection, this makes sense if we consider that the “rate-limiting factors” for a cell undergoing maximal proliferation are likely different than those for a cell that is nutrient-restricted. Glucose-limited, ammonium-limited, and phosphate-limited cultures also identified different subsets of genes that are haploinsufficient, representing an additional 12-20% of the genome (40). Intriguingly, a comparison of HI genes across the three nutrient-limited conditions shows significant overlap. This suggests that limiting the steady-state growth rate of organisms, regardless of which type medium is used, reveals a common profile of haploinsufficient genes (40). The set of HI genes identified under these slow growth conditions is still different, however, from the set of genes that are haploinsufficient under maximal growth conditions. How exactly haploinsufficiency depends on growth context has yet to be discovered.

Though competitive fitness is the predominant measure of haploinsufficiency in unicellular organisms, recent efforts have been made to look more closely at the morphological defects caused by heterozygous gene deletion as well. Microscopic analysis of single-cell phenotypes in yeast with heterozygous deletions revealed morphological defects for 60% of all essential genes and ~33% of the tested nonessential genes in rich medium (41). This number expanded when cells were analyzed in minimal medium (41). Cells were tested across 501 different computationally-scored metrics such as the morphology of cells walls, nuclei, or

34

cytoskeletal structures. If morphological defects are equated with functional defects, then we can extrapolate that at least 50% of the yeast genome is haploinsufficient from this study alone.

Clearly, a comprehensive understanding of haploinsufficiency across the genome will require fitness to be measured under a variety of growth conditions, integrating several measures of growth and morphology.

The conditional and morphological studies of haploinsufficiency in yeast are perhaps more analogous to how haploinsufficiency must be defined in multicellular organisms. The development of multicellular organisms introduces cells to a variety of conditions, depending on cell type, tissue environment, and motility constraints. Different cell types also do not express the same genes as they differentiate and mature, meaning that a particular HI gene and its corresponding phenotypes may only affect certain tissues. It is not until we observe the morphological characteristics of the fully developed organism that haploinsufficiency can truly be assessed. Because defining a metric for multicellular haploinsufficiency is so complex, many groups have explored computational mechanisms for classifying HI genes.

Computational predictions of haploinsufficiency

Combining information about copy number variation with known properties and features of HI genes, several groups have tried to computationally predict the likelihood of haploinsufficiency genome-wide. One such study in yeast utilized properties like degree of protein-protein interaction, degree of genetic interaction, rate of evolutionary conservation, and levels of expression to define a set of novel HI genes (42). Though the predicted gene set was highly enriched for haploinsufficiency when tested across a variety of growth conditions (48%

35

compared to 3% of the genome) there were many false positives, perhaps suggesting that more growth conditions need to be probed to get a complete picture of haploinsufficiency (42). Studies using human genetic data have implemented similar methods of HI gene prediction, incorporating exome sequencing data from many thousands of individuals (43–45). Most recently, haploinsufficiency has been defined by a loss of function intolerance score (pLI score) that calculates the probability of finding protein-truncating variants across 60,706 human exomes

(46). In this study they conclude that 3,230 human genes are likely to be haploinsufficient, representing 17% of the . Of these predicted HI genes, 70% have no known phenotype in human disease. Though further validation is needed, this comprehensive dataset is a useful tool for understanding the genomic landscape of haploinsufficiency in humans and other vertebrates, and may be able to make predictions for disease-causing variants in humans.

Early theories of dominance

For nearly a century scientists and mathematicians have tried to formulate a theory to explain the existence of haploinsufficiency. In the 1920s, mathematician R.A. Fisher found that dominance of the wild-type was much more common than dominance of mutant in mutant populations of flies (47). In other words, it was rare for a defect to be observed when the mutation was present in only one of the two copies, as it is for HI genes. He predicted that dominance of mutant alleles should ultimately be selected against in heterozygotes, and that haploinsufficiency merely represented a failure of the wildtype allele to maintain protective dominance (47). This idea was ultimately disproven by calculations showing that the frequency of haploinsufficiency does not decrease in populations over time (48), and the observation of

36

equivalent rates of haploinsufficiency in organisms that are primarily haploid, but can be cultured as diploids (49). Later theories evoked a more physiological explanation of haploinsufficiency: the function of the gene dictates its sensitivity to changes in dosage (50). For example, metabolic enzyme function should remain robust in response to changes in dosage because pathway flux changes negligibly, while more structural proteins remain vulnerable to gene copy loss. This is supported by the genome-wide studies of haploinsufficiency in yeast, described earlier (28), which showed that HI genes are more likely to encode structural components and members of protein complexes of the cell, rather than independently-acting enzymatic proteins. How these properties of haploinsufficient genes relate to the cause of their haploinsufficiency is still under investigation.

The cause of haploinsufficiency: current hypotheses

Reducing gene dosage by half, as would occur with deletion of a single copy, results in a concordant 50% decrease in gene expression (13). One hypothesis for the origin of haploinsufficiency suggests that the decrease in fitness caused by HI gene deletion could be due to insufficient amounts of the protein being produced (Fig. 9A) (28, 51). In other words, there is a biological threshold that must be reached to maintain wild-type function, and that HI gene expression falls below this threshold when one copy is deleted. It is thought that developmental factors such as transcription factors may behave in this way, unable to mount a normal transcriptional response at half dosage (52). The insufficient amounts hypothesis is also supported by the observation that translation-associated factors, including ribosomal proteins, are enriched among HI genes under maximal growth conditions, but are not considered HI once

37

growth rate is slowed in minimal medium. This observation suggests that there is a threshold amount of translational capacity needed for maximal growth, that is not necessary when cells are growing more slowly (28).

A second hypothesis for haploinsufficiency focuses on the role of stoichiometric imbalance. Because HI genes are enriched for members of protein complexes, perhaps it is the dosage balance of HI genes that matters, and is disrupted upon HI gene deletion (Fig. 9B) (53,

54). The function of a protein complex depends on all its members coming together in set ratios.

A 50% decrease in expression of a single subunit as would occur with heterozygous deletion, would be expected to decrease the function of the complex by at least 50%, depending on complex stoichiometry, assembly, and rates of degradation (Fig. 10) (55). The idea of dosage balance may also apply when thinking about cellular signaling pathways, where stoichiometry of upstream components affects the function of downstream components in the pathway, ultimately affecting pathway flux (56, 57).

Though it is known that members of protein complexes are enriched for sensitivity to gene overexpression (53), a systematic comparison of the sensitivity of HI genes to both under- expression and over-expression has not yet been undertaken to test the dosage balance hypothesis explicitly. There are, however, several existing datasets that have looked at sensitivity to gene overexpression, particularly in yeast (58–60). Genes sensitive to dosage increases as defined by these datasets show very little overlap with the 3% of yeast genes defined as haploinsufficient. Importantly, the level of gene overexpression and the growth conditions differed widely from the conditions under which haploinsufficiency was assayed. To more fully understand the cause of haploinsufficiency, a systematic single-copy overexpression library is needed. Such experiments are the focus of this thesis work

38

Insufficient amounts hypothesis

B Dosage imbalance hypothesis 2N 2N-1 gene A gene B A B gene A gene B A B gene A gene B gene A gene B A B A B

A A B B A B A A B B B A A B B A B B B B

Fig. 9: Prevailing hypotheses of haploinsufficiency.

(A) The insufficient amounts hypothesis of haploinsufficiency. Loss of function mutations that

inactivate a single copy of the haploinsufficient gene lead to a decrease in its expression. This

50% decrease in expression is insufficient to maintain wild-type functionality of the organism,

decreasing fitness.

Figure from Rice & McLysaght (2017) (61).

(B) The dosage balance hypothesis of haploinsufficiency. Loss of a haploinsufficient gene

(green) creates imbalances among protein complex members, leading to unassembled complexes

39

which must be degraded or sequestered. This burden on protein homeostasis machinery leads to a decrease in fitness.

Fig. 10: Degree of dosage imbalance is dependent on complex stoichiometry.

The configuration of protein complexes, and how their subunits are assembled, can affect the level of functional complex that is able to form. This true even with identical complex stoichiometry: here, a 1:2 heterotrimeric complex (ABB) is shown in two different configurations (with and without subunit A forming a bridge).

Figure from Veitia & Potier (2015) (62).

Evolutionary conservation of haploinsufficient genes

Haploinsufficient genes show much higher rates of conservation across eukaryotic species, independent of gene essentiality or fitness of the null mutant (Table 1) (63). This suggests that HI genes are critical for the function of eukaryotes, and must be maintained in the genome over time. We also know that HI genes are likely to remain haploinsufficient over time: approximately 50% of genes which are haploinsufficient in yeast are also haploinsufficient in humans (n=83, P<10-4) (63). It is curious that across billions of years of evolutionary time that

40

HI genes have not modulated their expression to account for insufficiency of the single copy, especially given that gene inactivation is not uncommon on the evolutionary time scale. In the context of the insufficient amounts hypothesis, if each copy of the gene were to increase expression two-fold, then deletion of a single copy would no longer dip its expression below the necessary biological threshold for wild-type growth of the organism. Thus, an HI gene would no longer be haploinsufficient if its expression was modulated in this way. In the context of the balance hypothesis, however, both increases and decreases in the expression of HI genes could negatively impact growth, as stoichiometry is disrupted in either direction. A deeper mechanistic model of haploinsufficiency is needed to be able to understand the evolution of HI genes, and their impact on the eukaryotic genome.

Table 1: Retention of haploinsufficient gene orthologs across model eukaryotes (63).

Haploinsufficient genes are highly conserved across plants, fungi, and animals. Each row is a comparison between Saccharomyces cerevisiae genome and the published genome listed. The significance of HI gene conservation was determined using the number of HI genes with orthologs in a given species (based off of HI genes in S. cerevisiae), compared to the number of genes with orthologs across the genome. The number of orthologs retained is significantly higher for haploinsufficient genes than for the genome average across representative eukaryotic species

(chi-squared test P<0.05).

41

Haploinsufficiency and human disease

There are ~300 haploinsufficient genes in humans known to cause disease (64), though computational predictions think the true number of HI genes may be up to 10 times higher (43–

45). It is therefore important to consider 1) what groups of diseases HI genes are most likely to cause, and 2) how clinicians can use an understanding of haploinsufficiency to approach treatment for disease.

42

Human case studies

Systematic studies of haploinsufficiency, such as the fitness profiling that has been done in model organisms, are not possible in humans. However, studies of inherited deletion syndromes can yield important information about the organismal effects of heterozygous gene deletions. It is estimated that thousands of individual case studies have been done which could lead to the discovery of HI genes in human, and several groups have worked to compile this information to better understand the range of haploinsufficiencies humans face (43, 64). Among human HI genes there are two groups which stand out as major contributors to human disease: ribosomal genes and tumor suppressor genes, whose heterozygous deletion leads to the development of ribosomopathies and cancer, respectively.

Ribosomopathies

Ribosomopathies are the set of diseases caused by haploinsufficiency of structural proteins of the ribosome, or of crucial ribosomal biogenesis factors. Decreased production of a single ribosomal subunit can lead to defects in ribosomal assembly and reduce overall abundance of ribosomes in the cell, leading to reduced levels of protein synthesis (65). Such disorders are developmental in nature, affecting many different tissue types, including major defects in erythropoiesis and skeletal development. A commonly cited example of a ribosomopathy is

Diamond-Blackfan anemia, a disease that can cause limb and craniofacial abnormalities, urogenital malformations, heart defects, and growth retardation. Though the physical deformities caused by the disease are highly variable, almost all patients present with chronic anemia due to a decrease in erythroid progenitors (66). Diamond Blackfan anemia was originally thought to be

43

linked to a mutation of the small ribosomal subunit RPS19, but is now known to be caused by mutations in at least 14 different small and large ribosomal subunits (66, 67).

Tumor suppressors

Tumor suppressor genes are the protectors of our cells, preventing abnormal cell cycle progression and loss of cell cycle checkpoint function that lead to the development of cancer. At the same time, tumor suppressor genes represent a major vulnerability in human health and disease when their function is reduced. Although mutations to tumor suppressor genes are commonly thought to promote cancer progression only after both alleles are inactivated (68), a handful of tumor suppressor genes are known to be haploinsufficient, causing cellular defects when just one of two copies is lost or inactivated (69). PTEN, a that negatively regulates the PI3K pathway, is known to be inactivated in broad spectrum of gliomas, melanomas, and carcinomas. Heterozygous deletion of PTEN, without loss of heterozygosity, has been shown to promote tumor progression in mice and in humans, particularly in prostate cancer (70). Models for pancreatic and intestinal cancers have also shown that heterozygous deletion of SMAD4, a key regulator of the TGF-� and BMP pathways, accelerates tumor initiation in mice prior to loss of heterozygosity (71). Importantly, many DNA repair genes that affect genomic stability are also haploinsufficient, which could further promote cellular mutation rates and accelerate cancer progression (69, 72).

44

Two-hit:

Haploinsufficiency:

Fig. 11: Haploinsufficiency in tumorigenesis, compared to the two-hit model.

The two-hit hypothesis of tumorigenesis says that a tumor suppressor gene (TSG) only plays a role in the formation of cancer once both copies have been inactivated, or “hit” (68). With haploinsufficiency, inactivation of a single copy of the TSG is enough to either confer a selective advantage to tumor cells, or cause genetic instability that contributes to tumorigenesis upon accumulation of additional tumor-promoting events.

Figure from Fodde & Smits (2002) (69).

Implications for clinical treatment

How can we use our understanding of haploinsufficiency and its origins to better inform our treatment of human disease? Given the causal role that haploinsufficiency plays in many human diseases, it is first worth considering how to develop gene therapies that can directly

45

counteract them. Second, scientists have also been able to exploit the vulnerabilities of cells with haploinsufficiencies to identify targets of novel drugs. Both measures have the potential to lead to improved patient outcomes and prognosis, and should be the subject of further study.

Gene therapies

Whether haploinsufficiency is caused by insufficient amounts of protein product or an imbalance in components of key protein complexes and cellular pathways, the clinical solution is the same: restore expression of the affected gene to wild-type levels. Because haploinsufficiency necessarily only affects one of the two copies of a gene in a diploid, there remains a wild-type copy to work with in altering gene expression. Groups have taken a variety of approaches to altering gene expression of the functional wild-type copy, including CRISPR-mediated promoter activation (CRISPRa) (73). Alternative approaches include augmenting the function of the affected enzyme or protein. For example, the decreased activity of a positive regulator of H3K4 monomethylation can be restored using a histone demethylase inhibitor (74). Importantly, although both of the existing models for haploinsufficiency advise the same course of treatment

– restoring gene function – they do differ in how the treatment should be administered. The dosage balance hypothesis suggests that overexpression of haploinsufficient gene products could also impact organismal fitness, as both over- and under- expression creates protein imbalances. If this is the case, clinicians would want to ensure that gene therapies do not “overshoot” their targeted range of expression when restoring function. The insufficient amounts hypothesis makes no such prediction. It will therefore be critical for scientists to gain a full understanding of the origins of haploinsufficiency before implementing gene therapies as a regular form of treatment for these diseases.

46

Drug screening

Identifying the mechanism of action of novel drugs and medications is a crucial step toward improving clinical treatment. It can help physicians better understand drug toxicities and interactions, classify patients most likely to respond to a treatment, and identify other diseases where the drug may prove useful. A robust way to identify drug targets is through drug screening of genome-wide deletion collections: deletion of key molecular components involved in a drug’s mechanism of action can sensitize cells to treatment with the drug. In particular, it is useful to screen heterozygous deletions collections because even targets which are essential can be identified. For example, cells with heterozygous deletions of genes encoding tubulin are highly sensitive to microtubule depolymerizing agents such as benomyl, confirming its mechanism of action (75). Screening of heterozygous deletion collections is called haploinsufficiency profiling

(HIP), and has identified potential drug targets for over 1800 known and novel compounds (75–

77). In this way, understanding haploinsufficiency helps to identify sensitivities in the context of drug treatment that we can capitalize on in medicine to improve health outcomes.

Concluding Remarks

Maintaining balance within the genome is critical for organismal growth and survival.

Copy number variations, whether gene-level or chromosome-level, have the potential to impact cellular fitness, disrupting key cellular processes such as protein homeostasis and cell-cycle progression. Certain genes, known as haploinsufficient genes, are especially sensitive to changes in DNA copy number. It is unclear why the expression of these genes in particular, which are a minority in the genome, has such a large effect on fitness. Perhaps there is a biological threshold

47

for these critical gene products, or maybe their stoichiometry is particularly important for maintaining optimal organismal fitness. Despite the vulnerability that they pose for organisms, haploinsufficient genes are maintained over evolutionary time. Genes which are haploinsufficient in yeast remain haploinsufficient in humans, where they cause a variety of diseases. It is therefore critical for us to understand the cause of haploinsufficiency, and determine why it has persisted over evolutionary time.

In this thesis, I find that the expression of haploinsufficient genes is limited by the toxicity of their overexpression. I have created a comprehensive dosage sensitivity dataset, comparing the growth of strains where a single copy of a gene is deleted to strains where a single copy of the same gene is overexpressed, all under the native promoter. From this dataset I conclude that while all haploinsufficient genes confer a growth disadvantage when subtly overexpressed, the reverse is not true. Many genes exist, including genes encoding known protein complex members, that impair proliferation when subtly overexpressed but not when heterozygously deleted. It is likely that the source of overexpression toxicity of haploinsufficient genes has to do with defects in protein homeostasis, caused by stoichiometric imbalance of highly expressed genes.

What sets haploinsufficient genes apart among dosage sensitive genes is that they are also limiting for important cellular processes when under-expressed. This means that haploinsufficient genes are both limiting for maximal cell growth, and yet their expression cannot increase due to the toxicity of overexpression. As a consequence, haploinsufficient genes have become dosage-stabilized, possessing very narrow ranges of gene expression across cells in a population compared to other genes in the genome. I conclude that haploinsufficient genes are

48

evolutionarily “stuck”, unable to modulate their expression over time, perhaps explaining the conservation of haploinsufficiency from yeast to humans.

References

1. Schubert I, Vu GTH (2016) Genome Stability and Evolution: Attempting a Holistic View. Trends Plant Sci 21(9):749–757.

2. Knouse KA, Wu J, Amon A (2016) Assessment of megabase-scale somatic copy number variation using single cell sequencing. Genome Res 26:376-384.

3. Mirkin E V, Mirkin SM (2007) Replication Fork Stalling at Natural Impediments. Microbiol Mol Bio R 71(1):13–35.

4. Ghosal G, Chen J (2013) DNA damage tolerance: a double-edged sword guarding the genome. Transl Cancer Res 2(3):107–129.

5. Hastings PJ, Lupski JR, Rosenberg SM, Ira G (2009) Mechanisms of change in gene copy number. Nat Rev Genet 10(8):551–564.

6. Siegel JJ, Amon A (2012) New insights into the troubles of aneuploidy. Annu Rev Cell Dev Biol 28:189–214.

7. Kellis M, Birren BW, Lander ES (2004) Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 428:617–624.

8. Wolfe KH, Shields DC (1997) Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387:708–713.

9. Byrne KP, Wolfe KH (2005) The Yeast Gene Order Browser: Combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Res 15(10):1456–1461.

10. Ohno S (1970) Evolution by Gene Duplication (George Allen and Unwin, London, UK).

11. Torres EM, et al. (2007) Effects of Aneuploidy on Cellular Physiology and Cell Division in Haploid Yeast. Science 317:916–924.

12. Keren L, et al. (2016) Massively Parallel Interrogation of the Effects of Gene Expression Levels on Fitness Article. Cell 166:1282–1294.

13. Springer M, Weissman JS, Kirschner MW (2010) A general lack of compensation for gene dosage in yeast. Mol Syst Biol 6(368).

49

14. Stingele S, et al. (2012) Global analysis of genome , transcriptome and proteome reveals the response to aneuploidy in human cells. Mol Syst Biol 8(608).

15. Torres EM, Springer M, Amon A (2016) No current evidence for widespread dosage compensation in S. cerevisiae. Elife 5:e10996.

16. Torres EM, et al. (2010) Identification of aneuploidy-tolerating mutations. Cell 143:71– 83.

17. Santaguida S, Amon A (2015) Short- and long-term effects of chromosome mis- segregation and aneuploidy. Nat Rev Mol Cell Biol 16:473–485.

18. Oromendia AB, Amon A (2014) Aneuploidy: implications for protein homeostasis and disease. Dis Model Mech 7:15–20.

19. Dephoure N, et al. (2014) Quantitative proteomic analysis reveals posttranslational responses to aneuploidy in yeast. Elife 3:e03023.

20. Brennan CM, et al. (2019) Protein aggregation mediates stoichiometry of protein complexes in aneuploid cells. Genes Dev 33:1031–1047.

21. Santaguida S, Vasile E, White E, Amon A (2015) Aneuploidy-induced cellular stresses limit autophagic degradation. Genes Dev 29:2010–2021.

22. Gasch AP, Werner-Washburne M (2002) The genomics of yeast responses to environmental stress and starvation. Funct Integr Genomics 2:181–192.

23. Brauer MJ, et al. (2008) Coordination of Growth Rate , Cell Cycle , Stress Response , and Metabolic Activity in Yeast. Mol Biol Cell 19:352–367.

24. Sheltzer JM, Torres EM, Dunham MJ, Amon A (2012) Transcriptional consequences of aneuploidy. Proc Natl Acad Sci USA 109:12644–12649.

25. Sheltzer JM, et al. (2011) Aneuploidy Drives Genomic Instability in Yeast. Science 333:1026–1030.

26. Blank H, Sheltzer J, Meehl C, Amon A (2015) Mitotic entry in the presence of DNA damage is a widespread property of aneuploidy in yeast. Mol Biol Cell 26:2031–2037.

27. Katz W, Weinstein B, Solomon F (1990) Regulation of tubulin levels and microtubule assembly in Saccharomyces cerevisiae: consequences of altered tubulin gene copy number. Mol Cell Biol 10:5286–5294.

28. Deutschbauer AM, et al. (2005) Mechanisms of haploinsufficiency revealed by genome- wide profiling in yeast. 169:1915–1925.

29. Isacson O, Seo H, Lin L, Albeck D, Granholm A (2002) Alzheimer’s disease and Down’s syndrome : roles of APP, trophic factors and ACh. Trends Neurosci 25(2):79–84.

50

30. Birchler JA, Veitia RA (2012) Gene balance hypothesis: Connecting issues of dosage sensitivity across biological disciplines. Proc Natl Acad Sci 109(37):14746–14753.

31. Giaever G, et al. (2002) Functional profiling of the Saccharomyces cerevisiae genome. Nature 418:387–391.

32. Giaever G, Nislow C (2014) The yeast deletion collection: A decade of functional genomics. Genetics 197:451–465.

33. Lindsley DL, et al. (1972) Segmental aneuploidy and the genetic gross structure of the Drosophila genome. Genetics 71:157–184.

34. Marygold SJ, et al. (2007) The ribosomal protein genes and Minute loci of Drosophila melanogaster. Genome Biol 8(10):R216.

35. Ashburner M (1989) Drosophila. A laboratory handbook. (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York).

36. Giordano E, Peluso I, Senger S, Furia M (1999) minifly, A Drosophila gene required for ribosome biogenesis. J Cell Biol 144(6):1123–1133.

37. Schulze SR, Sinclair DAR, Fitzpatrick KA, Honda BM (2005) A genetic and molecular characterization of two proximal heterochromatic genes on chromosome 3 of Drosophila melanogaster. Genetics 169:2165–2177.

38. Ito T, Kim G, Shinozaki K (2000) Disruption of an Arabidopsis cytoplasmic ribosomal protein S13-homologous gene by transposon-mediated mutagenesis causes aberrant growth and development. Plant J 22(3):257–264.

39. Horiguchi G, et al. (2011) Differential contributions of ribosomal protein genes to Arabidopsis thaliana leaf development. Plant J 65:724–736.

40. Delneri D, et al. (2008) Identification and characterization of high-flux-control genes of yeast through competition analyses in continuous cultures. Nat Genet 40:113–117.

41. Ohnuki S, Ohya Y (2018) High-dimensional single-cell phenotyping reveals extensive haploinsufficiency. PLoS Biol 16(5): e2005130.

42. Norris M, Lovell S, Delneri D (2013) Characterization and Prediction of Haploinsufficiency Using Systems-Level Gene Properties in Yeast. G3:Genes, Genomes, Genetics 3:1965–1977.

43. Huang N, Lee I, Marcotte EM, Hurles ME (2010) Characterising and Predicting Haploinsufficiency in the Human Genome. PLoS Genet 6(10):e1001154.

44. Steinberg J, Honti F, Meader S, Webber C (2015) Haploinsufficiency predictions without study bias. Nucleic Acids Res 43(15):e101.

51

45. Lek M, et al. (2016) Analysis of protein-coding genetic variation in 60,706 humans. Nature 536:285-291.

46. Lek M, et al. (2016) Analysis of protein-coding genetic variation in 60,706 humans. Nature 536:285-291.

47. Fisher RA (1928) The possible modification of the response of the wild type to recurrent mutations. Am. Nat. 62:115–126.

48. Wright S (1929) Fisher’s Theory of Dominance. Am. Nat. 63:274–279.

49. Orr HA (1991) A test of Fisher’s theory of dominance. Proc Natl Acad Sci 88(24):11413– 11415.

50. Wright S (1934) Physiological and Evolutionary Theories of Dominance. Am Nat 68:24– 53.

51. Wilkie AOM (1994) The molecular basis of genetic dominance. J Med Genet 31:89–98.

52. Gilchrist MA, Nijhout HF (2001) Nonlinear developmental processes as sources of dominance. Genetics 159:423–432.

53. Papp B, Pál C, Hurst LD (2003) Dosage sensitivity and the evolution of gene families in yeast. Nature 424:194–197.

54. Veitia RA (2002) Exploring the etiology of haploinsufficiency. Trends Genet 21:175–184.

55. Veitia RA, Birchler JA (2015) Models of buffering of dosage imbalances in protein complexes. Biol Direct 10(1):1–11.

56. Kacser H, Burns J a (1981) The molecular basis of dominance. Genetics 97:639–666.

57. Kondrashov FA, Rogozin IB, Wolf YI, Koonin E V (2002) Selection in the evolution of gene duplications. Genome Biol 3(2):research0008.1-0008.9.

58. Sopko R, et al. (2006) Mapping pathways and phenotypes by systematic gene overexpression. Mol Cell 21:319–330.

59. Makanae K, et al.(2013) Identification of dosage-sensitive genes in Saccharomyces cerevisiae using the genetic tug-of-war method. Genome Res 23:300–311.

60. Payen C, et al. (2016) High-Throughput Identification of Adaptive Mutations in Experimentally Evolved Yeast Populations. PLoS Genet 12(10):e1006339.

61. Rice AM, McLysaght A (2017) Dosage-sensitive genes in evolution and disease. BMC Biol 15(1):1–10.

62. Veitia RA, Potier MC (2015) Gene dosage imbalances: Action, reaction, and models.

52

Trends Biochem Sci 40(6):309–317.

63. de Clare M, Pir P, Oliver SG (2011) Haploinsufficiency and the sex chromosomes from yeasts to humans. BMC Biol 9(15).

64. Dang VT, Kassahn KS, Marcos AE, Ragan MA (2008) Identification of human haploinsufficient genes and their genomic proximity to segmental duplications. Eur J Hum Genet 16(11):1350–1357.

65. Mills EW, Green R (2017) Ribosomopathies: There’s strength in numbers. Science 358: eaan2755.

66. Draptchinskaia N, et al. (1999) The gene encoding ribosomal protein S19 is mutated in Diamond-Blackfan anaemia. Nat Genet 21(2):169–175.

67. Boria I, et al. (2010) The ribosomal basis of diamond-blackfan anemia: Mutation and database update. Hum Mutat 31(12):1269–1279.

68. Knudson AG (1971) Mutation and Cancer: Statistical Study of Retinoblastoma. Proc Natl Acad Sci USA 68(4):820–823.

69. Fodde R, Smits R (2002) A Matter of Dosage. Science 298:761–764.

70. Kwabi-Addo B, et al. (2001) Haploinsufficiency of the Pten tumor suppressor gene promotes prostate cancer progression. Proc Natl Acad Sci U S A 98(20):11563–11568.

71. Alberici P, et al. (2006) Smad4 haploinsufficiency in mouse models for intestinal cancer. Oncogene 25(13):1841–1851.

72. Coelho MC, Pinto RM, Murray AW (2019) Heterozygous mutations cause genetic instability in a yeast model of cancer evolution. Nature 566:275–278.

73. Matharu N, et al. (2019) CRISPR-mediated activation of a promoter or enhancer rescues obesity caused by haploinsufficiency. Science 363:eaau0629.

74. Fulcoli FG, et al. (2016) Rebalancing gene haploinsufficiency in vivo by tartargeting chromatin. Nat Commun 7:11688.

75. Hoepfner D, et al. (2014) High-resolution chemical dissection of a model eukaryote reveals targets, pathways and gene functions. Microbiol Res 169(2–3):107–120.

76. Giaever G, et al. (1999) Genomic profiling of drug sensitivities via induced haploinsufficiency. Nat Genet 21:278–283.

77. Lum PY, et al. (2004) Discovering Modes of Action for Therapeutic Compounds Using a Genome-Wide Screen of Yeast Heterozygotes. Cell 116:121–137.

53

54

Chapter 2: Why haploinsufficiency persists

Reproduced with permission from Proc Natl Acad Sci.

Morrill S.A., Amon A. (2019) “Why haploinsufficiency persists.” Proc Natl Acad Sci U S A. 116(24):11866-11871.

55

Significance Statement

For most genes, a single copy is enough to support normal growth and development of diploid organisms, but a small subset of genes known as haploinsufficient (HI) genes exhibit extreme sensitivity to decreased gene dosage. Given the relatively high frequency of gene- inactivating mutations over the lifespan of an organism, and cell-to-cell variability in gene expression, haploinsufficiency represents a significant barrier to organismal fitness. Why the expression of these genes has not been modulated over evolutionary time to eliminate their haploinsufficiency remains unexplained. We find that the limit of haploinsufficient genes on organismal fitness cannot be overcome by an increase in expression because haploinsufficient genes also confer a fitness disadvantage when encoded in extra copy, leaving these genes evolutionarily “stuck”.

Abstract

Haploinsufficiency describes the decrease in organismal fitness observed when a single copy of a gene is deleted in diploids. We investigated the origin of haploinsufficiency by creating a comprehensive dosage sensitivity dataset for genes under their native promoters. We demonstrate that the expression of haploinsufficient genes is limited by the toxicity of their overexpression. We further show that the fitness penalty associated with excess gene copy number is not the only determinant of haploinsufficiency. Haploinsufficient genes represent a unique subset of genes sensitive to copy number increases, as they are also limiting for important cellular processes when present in one copy instead of two. The selective pressure to decrease gene expression due to the toxicity of over-expression, combined with the pressure to increase expression due to their fitness-limiting nature, has made haploinsufficient genes extremely

56

sensitive to changes in gene expression. As a consequence, haploinsufficient genes are dosage- stabilized, showing much more narrow ranges in cell-to-cell variability of expression compared to other genes in the genome. We propose a dosage-stabilizing hypothesis of haploinsufficiency to explain its persistence over evolutionary time.

Introduction

For nearly a century scientists and mathematicians have worked to formulate a theory to explain the origin of haploinsufficiency. Why do these genes exhibit an abnormal phenotype upon deletion of one of their two homologous copies, when the majority of genes do not? Early theories considered haploinsufficiency to be an artifact of diploidy, a rare failure of the wildtype allele to maintain protective dominance (5). This idea was ultimately disproven by the observation that equivalent rates of haploinsufficiency are present in organisms that primarily exist in the haploid state (2). Later theories evoked a more physiological explanation, whereby the specific function of the gene dictates its sensitivity to changes in dosage (3). For example, genes encoding enzymes are sparse among haploinsufficient genes, but genes encoding proteins that perform structural and regulatory functions in the cell are enriched amongst them (6). More recent studies suggest that the context of gene function is also important. In particular, genes whose products function as members of macromolecular complexes or cellular signaling networks may be especially vulnerable to changes in gene dosage (7).

High throughput screens, metadata analyses, and computational predictions have been applied to define which genes are haploinsufficient. In budding yeast, about 3% of the genome is considered haploinsufficient under maximal growth conditions, resulting in substantial defects in cellular proliferation when heterozygously deleted (4). In humans, ~300 genes are known to be

57

haploinsufficient, contributing to a wide range of human health issues including neurodevelopmental disorders and tumorigenesis when heterozygously deleted (8) although computational predictions estimate this number to be much higher (9, 10). Importantly, haploinsufficiency of many genes is conserved from yeast to humans (11) indicating that strong selective forces exist that prevent the upregulation of their expression.

Two theories have been put forth to explain the cause of haploinsufficiency: the dosage balance hypothesis and the insufficient amounts hypothesis. The dosage balance hypothesis (Fig.

1A) states that growth defects caused by changes to gene dosage - either over- or under- expression - are due to stoichiometric imbalances of protein complexes interfering with cellular functions (12, 13). This hypothesis predicts that haploinsufficient genes also confer a growth defect when present in excess by as little as one copy. In other words, haploinsufficiency and sensitivity to increased gene dosage are mutually defined. This hypothesis elegantly explains why haploinsufficiency has persisted over evolutionary time. Upregulation of the gene is not possible because too much protein, like too little protein, disrupts protein complex stoichiometries that interfere with cellular function. The ‘insufficient amounts’ hypothesis (Fig.

1B) postulates that haploinsufficiency is the physiological result of reduced levels of protein product being insufficient to perform its cellular function (4). This hypothesis, unlike the dosage balance hypothesis makes, neither predictions about the effects of overexpressing haploinsufficient genes nor explains why haploinsufficiency persisted over evolutionary time.

In this study we set out to experimentally test the dosage balance and insufficient amount hypotheses of haploinsufficiency, and conclude that neither adequately explains the persistence of haploinsufficiency. We find that while all haploinsufficient genes confer a growth disadvantage when subtly overexpressed, the reverse is not true. Many genes exist, including

58

genes encoding known protein complex members, that impair proliferation when subtly overexpressed but not when heterozygously deleted, arguing against the dosage balance hypotheses as a general explanation for the persistence of haploinsufficiency. Instead, our analyses of the growth defects of strains heterozygously deleted for haploinsufficient genes indicate that HI genes are limiting for cellular growth and proliferation when present in one copy instead of two. Based on these observations we propose an expansion of the current hypotheses for haploinsufficiency. Our “dosage-stabilizing” hypothesis stipulates that haploinsufficiency persists in organisms over evolutionary time because a balance must be struck between a gene product being limiting for a biological process, while avoiding the toxicity of its overproduction.

Results

Haploinsufficient genes are sensitive to increased copy number

The dosage balance hypothesis of haploinsufficiency predicts that HI genes are toxic when subtly overexpressed, that is, they should also be sensitive to increased copy number (SIC;

Fig. 1A). The budding yeast, Saccharomyces cerevisiae, is an ideal system to explore this prediction because several tools exist to generate comprehensive dosage-altered libraries of genes that are haploinsufficient or are toxic when overexpressed. The heterozygous deletion collection (14) – where one copy of each of the ~6000 yeast genes has been systematically deleted in diploid strains – allowed us to study haploinsufficiency at genome-wide resolution. To study sensitivity to increased copy number, we utilized the previously constructed MoBY-CEN plasmid library, which is comprised of centromeric plasmids that express nearly all yeast genes

(4981/5915 confirmed ORFs) from their endogenous promoters (15). For the purposes of this

59

study we consider genes introduced via MoBY-CEN vectors as present in “single-extra copy”, though of course the copy number of CEN vectors can vary, depending on ploidy and the type of selection (16).

Fig. 1: Models of haploinsufficiency.

Theoretical plots relating gene dosage to the fitness of strains for haploinsufficient genes. HI = haploinsufficiency, SIC = sensitivity to increased copy number.

(A) The dosage balance hypothesis. Strains exhibiting changes in HI gene dosage show decreased fitness for both under- and over-expression, due to altered stoichiometry of protein complexes or cellular pathways.

(B) The insufficient amounts hypothesis: HI genes cause decreased fitness of cells as gene dosage decreases. HI gene products are limiting for growth.

We generated a high confidence data set of haploinsufficient genes in yeast.

Deutschbauer et al. (2005) used the heterozygous deletion collection to identify 184 genes that are haploinsufficient under maximal growth conditions – in YEP medium containing 2% glucose

60

at 30°C. Of these, we chose 100 highly haploinsufficient genes to pursue further (henceforth top_HI) based on the following criteria: 1) accuracy of the gene deletion in the heterozygous deletion collection, 2) a confirmed growth defect (>5%) in heterozygous knockout strains, and 3) presence of the gene in the MoBY-CEN library of plasmids. We note that the growth defect we measured (Fig. 2A) was in excellent agreement with previously defined fitness values (Fig. 2B), except for a small number of strains that harbor deletions of ribosomal subunits, which are known to cause genomic alterations (17). The complete list of top_HI genes is shown in Strain

Table 1, with excluded genes in Strain Table 2. The majority of genes that exhibit severe haploinsufficiency encode ribosomal proteins, and proteins required for transcription and translation as well as proteostasis (Fig. 3A).

61

Fig. 2: Evaluation of growth rate measurements for highly haploinsufficient genes.

(A) Doubling times for strains harboring a CEN plasmid carrying one of the top_HI genes, measured for 3 separate transformants in technical duplicate. Strains are shaded according to the degree of proliferation defect observed in strains with heterozygous deletions in the same gene.

The darker the bars, the more severe the proliferation defect of the heterozygous deletion strain.

Strains exhibiting extremely variable doubling time measurements are identified by (‡). Strains whose doubling times were not significantly slower than that of wild type are identified by (†).

Error bars are SD. See Methods for explanation of statistical significance.

(B) Pearson correlation plotting the measured doubling time data for strains containing heterozygous deletions of HI genes against existing data on the relative fitness of these strains(4)

(n=101).

(C) Doubling times of diploid strains carrying a CEN plasmid harboring an extra copy of top_HI genes (2N+1). Data for 2N-1 and 1N+1 strains are the same as in Fig. 2B.

(D) The DNA copy number of RPL20A and RPL30 in three transformants containing a CEN plasmid copy of each gene. Measurements are based on qPCR amplification of DNA samples collected from cells in stationary phase (OD600nm>3). The strength of the doubling time defect in these transformants is indicated by coloring of the bars (grey = weak defect, black = strong defect).

62

Fig. 3: Toxicity of haploinsufficient gene copy number increase.

(A) Functional categories that define highly haploinsufficient genes (top_HI).

(B) Doubling time of strains containing dosage-altered top_HI genes, with measurements made at 30 °C in YPD. 2N-1 are diploids with heterozygous deletions of HI genes (wild-type diploid)

63

and 1N+1 are haploids with a single extra copy of HI genes, contained on a CEN plasmid (wild- type haploid plus empty vector).

(C) The relationship between 2N-1 and 1N+1 relative growth rates. The black dots identify genes that are disproportionately toxic when in excess, which are thought to be correlation outliers due to high variability as shown in (D).

(D) A plot comparing doubling time and variability in growth measurements for 1N+1 cells.

Note that data points highlighted in black are the same as black data points in (C).

(E) Degradation of HI genes when encoded in single extra copy (18). The fraction of excess protein degraded is the relative protein expression in 1N+1 compared to WT cells (1-log2 ratio), where a value of 0 represents no degradation and 1 represents complete degradation (two-tailed t test with Welch correction p<0.0001). The percent of proteins in each category considered degraded (>0.4) is indicated below each bar. Central line = median. Plot whiskers 10-90th percentile.

(F-G) Growth of strains containing top_HI genes on a CEN plasmid were treated with 100 µM

MG132 (F) or 25 µM radicicol (G), alongside untreated controls (DMSO) in YPD at 30°C.

Green bars represent the doubling time of strains harboring an empty vector control. The number of strains that are more sensitive to drug treatment than expected is indicated below each condition.

Having created a high-confidence haploinsufficient gene set, we then compared growth rates of strains heterozygous for haploinsufficient genes with growth rates of strains harboring an extra copy of the top_HI genes. Remarkably, most top_HI genes interfered with proliferation when expressed on a CEN plasmid, with 85/100 strains exhibiting a statistically significant

64

growth defect under maximal growth conditions (Fig. 3B, Fig. 2A, Dunn’s multiple comparison test p<0.05, Dataset S1). Of the 15 strains that did not meet statistical significance, 9 strains displayed highly variable doubling times (identified as ‡ in Fig. 2A). 6 strains only showed a very slight increase in doubling time (identified as † in Fig. 2A), despite evidence that the genes were expressed and that their coding sequences harbored no mutations. We propose that the genes present in excess in these 6 strains cause toxicity in situations other than maximal growth that yeast cells encounter as part of their natural life cycle. Growth defects were also observed in diploid cells expressing top_HI CEN plasmids, though the effect was smaller, indicating that increased ploidy buffers against phenotypes caused by copy number alteration (Fig. 2C).

Comparison of the growth defect of strains heterozygously deleted for a haploinsufficient gene (henceforth 2N-1) with the growth defect of strains harboring an extra copy of the same gene (henceforth 1N+1) showed that the magnitude of the growth defect was proportional, with the phenotype of 2N-1 strains being generally more severe (Fig. 3C Spearman correlation p=0.0008). This is in agreement with a previous study which found that out of ~100 genes analyzed, the majority of genes whose fitness was affected by both over- and under-expression had more severe consequences for decreased expression than increased expression (19). We note that for several genes the growth of 2N-1 and 1N+1 strains was not particularly correlated (black data points in Fig. 3C). This is most likely due to significant variability in the doubling time measurements for these 1N+1 strains (black data points in Fig. 3D). Our data further suggest that this variability is a consequence of copy number variation between different strain isolates (Fig.

2D), which disproportionately affected the doubling times of strains containing genes that are most sensitive to increased copy number.

65

Why are haploinsufficient genes toxic upon dosage increase?

Our results lead to the conclusion that, under maximal growth conditions, most genes that confer a significant growth defect in diploids when deleted in single copy, also cause a growth disadvantage when in extra copy. Why are haploinsufficient genes toxic when overproduced?

We can envision two non-mutually exclusive possibilities: 1) increased levels of the gene could interfere with a specific cellular function, or 2) production and potential subsequent degradation of the excess gene product could be costly.

A well-known example for gene specific toxicity (option 1) in budding yeast is the β- tubulin encoding gene TUB2. Expression of a single extra copy of the gene leads to severe growth defects, as does deletion of the two α-tubulin encoding genes TUB1 and TUB3 in diploids

(20, 21). Is there evidence that production and degradation of HI genes is generally costly

(option 2)? Looking at median expression, the collection of RNAs and proteins produced from the 100 most haploinsufficient genes are 11-fold more abundant than those for the rest of the genome (Fig. 4A-B). This preponderance of highly expressed genes among top_HI genes is driven by genes encoding ribosomal proteins (Fig. 4A-B). While producing large amounts of excess protein is known to place a burden on the cell’s transcription and translation machinery

(22), more recent studies suggest that it is the demand for protein degradation that most impacts cellular growth when additional copies of highly expressed genes are introduced into cells (23).

We found that approximately 65% of HI genes produce proteins known to be degraded by the proteasome when encoded in single copy excess, compared to 26% of non-HI genes (Fig. 3E,

Student’s t test p<0.0001). The enrichment for proteasomal targets is again driven by ribosomal proteins (Fig. 3E). This observation raises the possibility that degradation of excess HI proteins could be costly to cells. Consistent with this idea, we found that 98/100 haploid strains bearing

66

an extra copy of top_HI genes exhibited increased sensitivity to the proteasome inhibitor MG132

(Fig. 3F). 96/100 strains were more sensitive to the Hsp90 inhibitor radicicol (Fig. 3G). When we excluded ribosomal proteins, the majority of strains were still sensitive to these proteotoxicity-inducing agents (Fig. 3F, G). These observations indicate that the toxicity of HI gene overexpression is in part due to excess proteins placing a burden on the cell’s protein homeostasis machinery.

A 0 B ) 6 **** n.s. C Radicicol ) 10 n.s.

10 10 **** n.s. 5 og -1 4 z-score 0 -2 3

2 treated

-3 untreated -10 1

RNA abundance (l 12-fold 2-fold 11-fold 4-fold protein abundance(log

-4 0 growth -20 top_HI genome top_HI top_HI genome top_HI top_HI genome excl. ribo excl. ribo

Fig. 4: Haploinsufficient (HI) genes are highly expressed.

The relative RNA (A) and protein (B) abundance of top_HI genes, compared to top_HI genes excluding ribosomal protein genes and other non-HI genes in the genome (top_HI vs. genome, two-tailed Mann-Whitney test ****p<0.0001). Central line = median. Plot whiskers span 10-90th percentile. Median fold-change compared to genome is reported.

Most dosage sensitive genes are not haploinsufficient

Our observation that top_HI genes are toxic when subtly overexpressed supports the dosage balance hypothesis of haploinsufficiency. An additional prediction of this theory is that

67

haploinsufficiency and sensitivity to increased copy number are mutually defined, at least for members of protein complexes. To test this prediction, we needed to create a genome-wide data set defining the genes that confer a fitness disadvantage when present in single extra copy. While previous studies had characterized the sensitivity of genes to high level overexpression (24, 25) or tested their copy number limit (23), none had defined the genes that confer a fitness defect when expressed in single extra copy under conditions of maximal growth. Again utilizing the

MoBY-CEN collection of yeast plasmids, we generated two independent transformant pools of haploid strains, where each strain contained a single extra copy of a gene, for all genes in the genome. We competed pools in liquid culture, and monitored plasmid representation by sequencing plasmid-specific tags every 8 hours for a period of 48 hours (~30 generations). In following each tag’s abundance over time, we were able to extract a linear slope for 2646 strains

(from 4981 plasmids), of which 1588 passed criteria for reproducibility and were assigned a fitness score (Fig. 5; Table 1). We converted each slope into a relative fitness value where 1 represents a zero slope with neutral fitness, and values <1 or >1 have negative or positive slopes, respectively. The distribution of fitness values showed approximately equal numbers of strains increasing and decreasing in the population (Fig. 6A, Dataset S2). Genes detected in both transformant pools showed good correlation between relative fitness values (Fig. 6B, Pearson correlation p<0.0001). A gene was considered to have a negative impact on fitness, and defined as SIC, when its relative fitness was <1 and 1SD below the population average. Using this criterion, we defined 251 genes to be SIC (FDR=0.072, Fig. 5C). Conversely, there were 247 genes which fell 1SD above the mean and had a relative fitness >1. We note that strains whose abundance increases in the population likely do not have a growth advantage, based on a

68

comparison with known reference points (Fig. 5D). Rather, these strains increase in abundance because of the relative decrease of other strains in the culture.

A B MCM2 0.2 2 ) 2 2 y=-0.055x+0.199 0.1 1 plot combined read 0.0 forward/reverse 0 reads

-1 -0.1 amplification relative y=x abundance (log -2 -0.2 0 10 20 30 40 -0.2 -0.1 0.0 0.1 0.2 doublings amplification 1

C 0.5 1SD,FDR=0.072 0.2 0.4 remove noisy 2 slopes using 0.1 confidence intervals 0.3

0.2 0.0

0.1 -0.1 cumulative FDR 0.0 amplification y=x 0 -0.2 600 400 200 800 1200 1000 1600 1400 -0.2 -0.1 0.0 0.1 0.2 threshold number of genes amplification 1

D Pearson R = - 0.4304 0.2 2.2 **p=0.005 remove points (hr) 2 that don’t correlate y=-1.162x+2.985 0.1 between amplifications 2.0 0.0 1.8 WT -0.1 1.6 amplification y=x 1.4 -0.2 -0.2 -0.1 0.0 0.1 0.2 N+1 doubling time 0.90 0.95 1.00 1.05 1.10 1.15 1.20 competition fitness score amplification 1

Fig. 5: Pooled competition data analysis.

(A) Sample read abundance over time (log2 fold change from t=0) for a strain containing a single extra copy of MCM2 across one competition experiment. Data points are an average of two separate rounds of amplification, with forward and reverse reads.

69

(B) Sample data analysis progression plotting the slopes for two independent amplifications of the same competition replicate. After a linear model is calculated for the abundance of each strain over time, strains showing noisy slopes with large confidence intervals were removed.

Subsequently, we removed strains whose amplifications was inconsistent between replicates

(>1SD from y=x).

(C) Iterative FDR calculations based on the threshold number of genes defined as SIC. We note the FDR value for the 1SD cutoff used to define SIC genes in this study.

(D) Extrapolation of wild-type fitness score in the pooled competition based on measured doubling times for reference strains.

70

Fig. 6: Identification of genes that are sensitive to increased copy number (SIC).

(A) The relative fitness distribution for 1588 strains across two independent pools (n=6). Each strain contains a single extra copy of a gene on a centromeric vector. Strains were competed for

~30 generations in YPD at 30°C. Bins = 0.005.

(B) Reproducibility between the two independent pools of transformants.

(C) The percentage of genes encoding protein complex members that are haploinsufficient among top SIC genes – defined as those genes which show decreased fitness of greater than 1SD in pooled competition. For comparison, the percentage of genes that are SIC among known haploinsufficient genes (from Fig. 2B) is shown on the right.

Table 1: Summary of MOBY library competition results.

Sequencing coverage for each transformant pool in the competition to identify SIC genes.

Forward and reverse are the different reads gathered from forward and reverse primer sets surrounding the tag sequence for the plasmid in each strain. Unique tags are the number of unique sequences detected by plasmid-specific primers in the pool amplification. Assigned ORFs refers to the number of unique tags identified that correspond to specific ORFs in the MoBY plasmid library. The last rows of the table represent the final set of cutoffs used for data analysis of the pooled competition. Only strains whose tags were found in more than 3 data points across a given experiment (>3 points) were analyzed, from which a threshold was created to ensure reproducibility (see Methods).

71

poolA poolB forward reverse forward reverse combined unique tags 4085 4123 4801 4577 assigned ORFs 3045 3149 3100 3188 >3 points 2316 2448 2646 Threshold criteria 1121 1292 1566

Under identical growth conditions, Deutschbauer et al. (2005) found 3% of the yeast genome to be haploinsufficient, defined as heterozygous gene deletions that cause a decrease in fitness greater than 1 SD below the population mean. By the same definition, we found 251/1588 genes to confer a fitness defect when present in single extra copy (SIC), which when extrapolated to the entire yeast genome suggests that SIC genes may represent up to 15% of the genome. This observation indicates that genes are much more likely to be sensitive to increased copy number than gene copy loss, though more comprehensive studies in diploid yeast will be needed to make an absolute comparison. Strikingly, while 85% of haploinsufficient genes were

SIC, only 10% of the genes identified as highly SIC were haploinsufficient (n=26). When we restricted our analysis to members of protein complexes found in our competition data set, we observed a similar result: 15% of SIC genes (22/89) are HI, while 83% of HI genes (33/40) were identified as SIC by doubling time measurements. This is true even when we only considered genes whose products are known protein complex members (Fig. 6C). We conclude that while haploinsufficient genes are toxic when present in excess, the converse is not true – many genes which are found to be highly SIC are not haploinsufficient. This observation also leads to the conclusion that dosage imbalance among protein complex members alone cannot explain haploinsufficiency.

72

Dosage imbalance does not fully explain haploinsufficiency

The finding that haploinsufficiency and sensitivity to increased copy number are not mutually defined prompted us to test additional predictions of the dosage balance model.

The dosage imbalance hypothesis predicts that deletion of a HI gene in a haploid cell results in the same phenotype as deleting one copy in a diploid cell because both strains experience the same degree of stoichiometric imbalance. This is not what we observe. Deletion of the non-essential top_HI genes in haploid cells causes an increase in doubling time of 11-

100% compared to wild-type cells (Fig. 7A, Dataset S1). Doubling time increased only 5-29% in heterozygously deleted diploid cells (Fig. 3B, Dataset S1). Importantly, the growth defect of the haploid strain lacking the HI gene was invariably more severe than the growth defect of the corresponding heterozygously deleted diploid strain (Fig. 7A).

73

Fig. 7: Testing the dosage balance and insufficient amount hypotheses.

(A) Doubling time of haploid strains containing deletions in non-essential top_HI genes, with measurements made at 30 °C in YPD. 2N-1 diploids are included for comparison (paired t test, p<0.0001). Connecting lines show corresponding genes.

(B) Doubling time of strains carrying heterozygous deletions for genes encoding members of the eIF2 complex, in combination and in isolation. Measurements were made at 30 °C in YPD.

74

Significance compared to WT is shown above each bar; select other comparisons are made with brackets (ANOVA with multiple comparisons, ****p<0.0001 Bonferroni correction).

(C) Volume of diploid cycling cells with heterozygous deletions for all top_HI genes, and a subset of HI genes encoding ribosomal proteins (two-tailed t test with Welch correction

**p=0.0076).

(D) A 5-fold serial dilution of strains containing confirmed heterozygous deletions for CCT complex members, plated on benomyl (15ug/ml) and YPD control.

(E) Haploinsufficient profiles of the heterozygous deletion collection for latrunculin treatment

(0.9 µM) or benomyl treatment (27 µM). Data are from Hoepfner et al. (2014)25. Genome position of a heterozygous deletion is plotted against normalized drug sensitivity score, where a negative score represents impaired proliferation in the presence of the drug. Purple dots are subunits of the CCT chaperone complex.

Another prediction of the dosage balance hypothesis is that deleting all subunits of a protein complex should alleviate the haploinsufficiency of deletions of individual complex members. We tested this prediction for the eIF2 complex. eIF2 is an initiation factor for translation with 3 obligate subunits, 2 of which (SUI2 and SUI3) are known to be haploinsufficient. If stoichiometry were the cause of the genes’ haploinsufficiency, a strain heterozygously deleted for all three subunit genes should not exhibit a growth defect. This is not what we observed. The dosage-balanced strain for eIF2 exhibited a strong growth defect (13 min increase in doubling time), on par with that of strains harboring single eIF2 gene deletions (9-

18min) (Fig. 7B). While the triple deletion strain grew significantly slower than the wild-type strain, the phenotype was not as severe as that of strains with single deletions of the SUI2 and

75

SUI3 genes. Based on the relative differences in doubling time, we estimate that protein stoichiometry imbalance can explain only ~33% of the haploinsufficient growth defect for this protein complex.

Finally, our results show that subtle overexpression of top_HI genes causes increased sensitivity to proteotoxic agents such as radicicol (Fig. 3G). According to the dosage balance hypothesis, this proteotoxicity should arise from stoichiometric imbalance of functional complexes (26) and should thus also occur in strains heterozygously deleted for HI genes. This is not the case. Strains containing heterozygous deletions of HI genes are not, on average, more sensitive to radicicol, although a subtle increase in sensitivity to radicicol might have been missed in this bulk measurement (Fig. 8A). Taken together our data indicate that stoichiometric imbalance is not the sole cause of haploinsufficiency.

76

A Radicicol B 10 100 n.s. n.s. ** 90 ) 3

z-score 0 80

70 treated

untreated -10 60

50 mean volume (µm growth -20 40 top_HI genome WT (N) 1N+1 WT (2N) 2N-1 10

C D ) HI 5-fold 2 V HI_excl. ribo dilution genome (C empty vector pCCT2 1 pCCT4 pCCT5 pCCT7 0.1 cct4∆/+ cct5∆/+

YPD benomyl expression variability (15g/ml) 0.01 100 1000 10000 100000 mean expression (fluorescence intensity)

Fig. 8: Phenotypic and cell-to-cell variability in dosage sensitivity.

(A) Data from Hoepfner et al. (2014) haploinsufficient profiles of the heterozygous deletion collection for radicicol (88 µM) (27). The normalized drug sensitivity score is plotted for top_HI genes, where a negative score represents impaired proliferation in the presence of the drug.

Scores for the rest of the genome are plotted for comparison (two-tailed t test with Welch correction, p=0.1518).

(B) Volume of haploid cycling cells containing a single extra copy of top_HI ribosomal genes, compared to diploid cycling cells with heterozygous deletions for the same genes. (two-tailed t test with Welch correction (1N) p=0.3057, (2N) **p=0.0076)

77

(C) A 5-fold serial dilution of strains containing a single extra copy of top_HI CCT complex members, plated on benomyl (15ug/ml) and YPD control.

(D) A plot comparing cell-to-cell expression variability (CV2) and mean expression for 1000 genes, based on flow cytometry analysis of fluorescence reporters (see also Fig. 9A). Highly expressed HI genes are indicated in blue, more lowly-expressed, non-ribosomal genes indicated in yellow. Both axes are plotted in log10 scale. The plot shows that the expression variability is much lower for both classes of HI genes compared to non-haploinsufficient genes.

Haploinsufficient genes are rate limiting for cellular fitness

The conclusion that the dosage imbalance hypothesis alone cannot explain haploinsufficiency prompted us to test elements of the insufficient amounts hypothesis. We hypothesized that HI genes are limiting for organismal function, such as cell growth or proliferation when heterozygously deleted. To test the growth-limiting nature of haploinsufficiency, we examined the importance of HI genes for key processes in cell growth and proliferation. Ribosomal proteins are particularly enriched among HI genes as a class (4), and it is well established that ribosomes are rate-limiting for cell growth (28). Consistent with previous reports (29), we found that deletion of one copy of ribosomal genes and of other haploinsufficient genes required for translation caused a reduction in mass accumulation in diploid strains, as judged by a smaller average cell size (Fig. 7C; Dataset S1). Strains carrying heterozygous deletions in HI genes not involved in protein synthesis were the same size or larger than control strains (Fig. 7C; Dataset S1). Importantly, introducing an extra copy of these ribosomal genes did not lead to a decrease in cell size (Fig. 8B), arguing that stoichiometric imbalances among ribosome subunits are not responsible for the small-cell size phenotype

78

observed in diploid cells heterozygously deleted for these ribosomal protein genes. Instead, as small size is a characteristic of reduced protein synthesis (28), our data indicate that insufficient amounts of ribosomal proteins result in the growth defect of strains heterozygously deleted for their genes.

Is there evidence that other haploinsufficient genes are rate limiting in processes critical for cell proliferation? The top_HI genes include 5/8 subunits of the CCT chaperone complex known to fold actin and tubulin. Strains harboring heterozygous deletions in these genes are sensitive to the actin and microtubule assembly inhibitors latrunculin A and benomyl, respectively (Fig. 7D-E), suggesting CCT subunits are indeed rate limiting for folding these cytoskeleton constituents. We note that strains with excess CCT subunits are not sensitive to benomyl (Fig. 8C), again arguing against stoichiometric imbalances and for insufficient amounts of protein as the source of this benomyl sensitivity. We predict that the remaining top_HI genes are also limiting for processes important for growth or proliferation under maximal growth conditions.

Haploinsufficient genes have a narrow expression range

Taken together, our results lead to the conclusion that two properties determine whether a gene is haploinsufficient in a specific environment or growth condition: (1) the gene product is rate-limiting for maximal organismal fitness and at the same time (2) the gene confers a fitness disadvantage when in excess. We propose that the fitness penalty when in excess prevents upregulation of the gene to counteract haploinsufficiency, and the rate-limiting nature of the gene causes the observed fitness defect in heterozygotes. This model of dual, counteracting selective pressures over evolutionary time makes a very strong prediction: the expression range of HI genes, in particular its variation between cells in a population, should be narrow relative to that

79

of other genes in the genome. Using a previously published set of single-cell gene expression data (30), we observe that the cell-to-cell variability in gene expression is significantly more narrow among HI genes compared to other genes in the genome (Fig. 9A). While low variation can be driven by high expression (the example in Fig. 9A shows the cell-to-cell variability in the expression of 100 highly expressed, non-haploinsufficient genes), we still observed narrow ranges of expression among more lowly expressed HI genes (Fig. 9A, Fig 8D). We conclude that haploinsufficient genes are narrowly expressed irrespective of their expression level. This conclusion is consistent with recent data showing that variability in gene expression across a population of cells is decreased for genes where small changes in expression had a large impact on fitness (19).

Figure 9: The gene expression range of Haploinsufficient genes is narrow.

(A) Cell-to cell variability (CV2) in gene expression for HI genes compared to genes that are highly expressed but are not haploinsufficient (high expression, not HI), genes that are haploinsufficient but not highly expressed (HI not high expression) and all non-HI genes in the genome (genome) based on FACS fluorescent measurements of promoter-YFP fusions for 1000 genes (30) (****p<0.0001, **p=0.0058, *p=0.0197). Central line = median. Plot whiskers span

10-90th percentile.

80

(B) The dosage stabilizing hypothesis: HI genes cause a decrease in fitness when under- and over-expressed. This narrowed fitness distribution is driven by the evolutionary pressure to both increase and decrease expression.

Discussion

While changes to gene dosage can lead to imbalances in protein complex stoichiometry that adversely affect cellular fitness (12, 13), several lines of evidence indicate that haploinsufficiency cannot be explained by fitness decrease due to stoichiometric imbalance alone. First, haploinsufficiency and overexpression toxicity are not mutually defined, even among genes encoding protein complex subunits. Second, deletion of haploinsufficient genes in a haploid strain is more detrimental than in a diploid strain even though the number of uncomplexed protein subunits is the same in both cell types. Third, overexpression of individual members of a complex lead to a different phenotype than their underexpression, as is showcased by ribosomal proteins and CCT subunits. Finally, at least in the case of eIF2, heterozygous deletion of all complex member genes does not alleviate the growth defect caused by deletion of individual subunit genes.

Our data support a new hybrid model for haploinsufficiency, which we call the dosage- stabilizing hypothesis (Fig 9B). It builds upon and incorporates core principles from both the insufficient amounts hypothesis and the dosage balance hypothesis. The dosage-stabilizing hypothesis posits that HI gene products are limiting for fitness when under-expressed

(insufficient amount hypothesis), and toxic when over expressed, most likely due to adverse effects on protein homeostasis and imbalances in protein complex stoichiometry (dosage balance hypothesis). In other words, haploinsufficient genes are evolutionarily “stuck”, unable to

81

increase or decrease expression over time to accommodate fluctuations in gene dosage because a fitness penalty is associated with both downregulating and upregulating HI gene expression.

Thus, haploinsufficient genes are “living on the edge,” representing a unique class of genes that must carefully balance their expression, ensuring that the cost of over-production does not outweigh the potential benefit of maximizing growth.

It is worth noting that haploinsufficiency is extremely context dependent. Deutschbauer et al. (2005) showed that there is little overlap between genes that are haploinsufficient under maximal growth conditions and those that are limiting when cells are grown in minimal medium.

Applying more stringent pressures on cell growth through glucose-, ammonium-, or phosphate- limited continuous culture revealed a set of haploinsufficient genes that is highly conserved across nutrient-limiting conditions, yet distinct from those defined under maximal growth conditions (31). Interestingly, under these severe growth restriction conditions, the frequency of haploinsufficient genes appears to increase (12-20%) (31) and could be as high as 76%, at least among essential yeast genes, based on single-cell morphological phenotyping (32). Together, these studies indicate that conditions exist for most if not all genes under which they are haploinsufficient. We speculate that this could account for why all genes are maintained in two copies over evolutionary time even in organisms that propagate by a predominantly non-sexual life style.

Given the persistence of haploinsufficient genes, how has cellular physiology been driven by, and perhaps adapted to, their presence? Given the enrichment of transcription and translation factors among HI genes, which are by nature growth limiting, our data lead us to wonder whether haploinsufficient genes might play a role in setting the division rate of cells. In other words, the expression levels of haploinsufficient genes may be partially responsible for budding yeast cells

82

having a doubling time of 90 minutes in YEPD medium at 30°. Changing the expression of any individual gene will have little effect on cell division length, but increasing their expression coordinately will, we predict, produce yeast cells with shorter doubling times. Additionally, if increasing gene expression of individual genes is not a solution to the problem of haploinsufficiency, do organisms have another way to escape it evolutionarily? Previous work

(4, 33) has suggested that gene duplication, where two copies of the gene now split the overall expression level, may be better buffered against gene expression fluctuation and loss, providing organisms a way out of haploinsufficiency. Future studies should seek to address these questions.

Materials and Methods:

Strains and growth measurements.

Strains harboring heterozygous deletions of haploinsufficient genes were obtained from the BY4743 yeast knockout collection; haploid deletion strains are BY4741 (MATa) (14).

Centromeric (CEN) plasmids were obtained from the Molecular barcoded ORF collection of yeast genes (MoBY) (15), and were transformed into a haploid (MATα) or diploid (MATa/α) strain (W303; ade2-1, leu2-3, ura3, trp1-1, his3-11,15, can1-100, GAL, [phi+]) using the

LiAc/SS carrier DNA/PEG method of transforming yeast (34). For a list of haploinsufficient genes that were included in this study refer to Table S2; for those excluded see Table S3.

Doubling times were measured in 96-well format on a Biotek Synergy 2 plate reader set to continuous shaking at 30°C. Strains were grown overnight in glucose-containing medium

(YEP medium or synthetic complete medium lacking uracil, SC-URA, for strains containing plasmids) and diluted to grow for 3-5 generations before inoculating 100 µL of fresh YPD with

83

OD600nm<0.1 cells per well. The OD600nm of each well was then measured every 15 minutes for a period of 24 hours. For heterozygous diploid strains, measurements were made in triplicate from

3 separate starting cultures. For strains containing plasmids, at least 3 different transformants were analyzed in duplicate. Proteotoxic stress agents (25 µM radicicol or 100 µM MG132, Fig.

2F-G) were added at the start of the cell division measurement, adding an equivalent amount of

DMSO to untreated controls. The “expected” growth of drug-treated strains was calculated based on the growth of untreated controls, multiplied by the relative doubling time increase of empty vector control upon drug treatment. For the serial dilution of CCT heterozygous deletion strains

(Fig. 4C), cells were grown overnight in YPD, diluted to OD600nm =0.2 to reenter log phase growth (~2 doublings), and plated onto YPD or YPD+benomyl (15 µg/ml) in 5-fold dilution increments, starting at OD600nm =0.5. Plates were incubated at 30°C and imaged daily.

Pooled library preparation and competition.

The complete library of 4981 E. coli strains containing the MoBY collection of centromeric plasmids (CEN) was thawed and plated systematically onto LB plates (10g tryptone,

5g yeast extract, 5g NaCl) containing kanamycin (100 µg/mL) and chloramphenicol (10µg/mL).

Plates were grown overnight at 37°C for no more than 18 hours, and then collected by scraping.

Plasmid DNA was extracted using QIAprep spin miniprep kit, pooled, aliquoted, and stored at -

20°C. 10 µg of plasmid DNA was transformed into 100 mL of W303 haploid yeast culture following the yeast high efficiency transformation protocol (35). Cells were plated for isolation onto SC-URA and grown at 30°C. Approximately 150,000 transformants (30x genome coverage) were subsequently pooled into 100 mL cold YPD (YEP + 2% glucose), then prepared for storage in glycerol at -80°C. This process was repeated for a second independent pool of transformants.

84

To begin the competition experiment, 500 µL of thawed transformant pool was added to

100 ml YPD and allowed to grow at 30°C for 3 hours to re-enter log phase growth, at which point the first sample was taken, t=0. Cells were subsequently diluted back to OD600nm =0.02 in

0.5 L YPD (no selection) and collected every 8 hours (~5 generations) for 48 hours. Collected samples contained ~2 OD600nm cells and were stored at -20°C for DNA collection and at -80°C in glycerol for storage. Assuming equal starting contribution, bottle-necking of the population should be prevented by this dilution strategy, as ~104 cells from each strain are carried over at each time point. For each transformant pool, 3 separate competitions were performed.

DNA preparation, sequencing and analysis of fitness.

Plasmid DNA was isolated from competition samples using a modified version of

QIAprep plasmid miniprep protocol (add 0.35 mg glass beads to cell pellet with buffer P1 and vortex 5 min). The relative fitness of strains was captured by the contribution of a particular strain to the pool over time. Each strain, and the gene that it contained, was identifiable by a unique barcode sequence, or tag, flanked by a shared primer sequence on the plasmid.

Amplification of the plasmid-specific tag sequences was performed as previously described(36), using primers with indices for multiplex sequencing. To account for amplification bias, the barcode amplification was performed twice for each experiment and sequenced separately. No more than 14 samples were run at once on the Illumina Next-seq platform, accounting for at least

107 forward and reverse reads per time point, the equivalent of 2000 potential reads per tag

(assuming all 4981 MoBY plasmids were in the starting pool in equivalent number). The relative contribution of each tag was calculated by dividing the number of reads per tag by the total number of reads for each time point. Additionally, the relative abundance was log2 normalized to

85

the relative number of reads at t=0, to account for variability in the starting population. By following the relative contribution of a tag within the pool over ~30 generations, a slope could be calculated to represent fitness. A decreasing slope represents decreased relative fitness (fitness score < 1), while an increasing slope can represent normal or increased fitness (fitness score ≥ 1).

The fitness score was calculated by adding 1 to the slope. Only genes with more than 3 data points across the 30 generation competition were included in the subsequent analyses. To eliminate reads with sequencing errors, the barcode sequence must exactly match reported

MoBY tag sequences, and barcodes with more than one gene, or genes with more than one barcode were excluded. Additionally, time points where read abundances did not match (>10%) between forward and reverse reads were discarded. To eliminate errors in amplification, variable slopes (those with large 95% confidence intervals) and slopes which did not agree across amplifications (>1SD residuals from y=x) were discarded. One replicate from pool B was discarded due to lack of reproducibility between amplifications. Finally, the average fitness value for each gene was calculated across all replicates within a pool, and across both pools. FDR is calculated as the number of SIC genes where >50% replicates don’t pass variability cutoffs

(likely false positives), divided by the total number of genes considered to be SIC.

DNA copy number analysis by qPCR.

Individual transformants, each streaked out to single colonies, were grown to stationary phase in selective medium (OD600nm >3) and collected in equal numbers. Cells were flash frozen and resuspended in 0.5mL sorbitol buffer (1M sorbitol, 10mM NaPi pH 7.0, 10mM EDTA pH

8.0). Samples were then subject to 30 minute zymolyase digestion (5mg/mL) in the presence of b-mercaptoethanol (0.1%), treatment with proteinase K (20mg/mL) and 0.3% SDS for 30

86

minutes at 65°C, and phenol extraction, followed by RNAse digestion (20mg/mL) for 1 hour at

37°C and a second phenol extraction. DNA was resuspended in water and used as template for

Real-Time qPCR at a concentration of 25ng/µL. Primer sets used were specific to the two genes of interest, RPL20A (forward 5¢-CCGTCGTTTGCCAACTGAAT-3¢ and reverse 5¢-

TGACCTTGGTTGGATGAGCT-3¢) and RPL30 (forward 5¢-

AAGTTGATCATCATTGCCGCT-3¢ and reverse 5¢-ATCAGAGTCACCAGCTTCCA-3¢). Each reaction contained 0.25 µM of each primer, 2µL of template, and 1x SYBR Green mix (TaKaRa mix for real time PCR: SYBR Premix Ex Taq, cat#TAKRR420B). Standard curves were calculated separately for each primer set, and DNA copy number was measured relative to ACT1.

The corresponding strength of the dosage sensitivity phenotype (weak or strong) was obtained from growth curve doubling times measured for each transformant, as described earlier.

Cell volume measurements.

Strains from the heterozygous deletion collection containing deletions of top_HI genes were grown overnight in YPD (YEP + 2%glucose) and diluted to grow for 3 generations in fresh

YPD medium, shaking at 30°C. 200µL of log phase culture was diluted into 10mL Isoton II

Diluent (Beckman Coulter) and loaded onto a Multisizer 3 Coulter Counter to measure cell volume after sonication. For each sample, 100,000 particles were counted, with a threshold diameter of 2µm. To improve the measurement accuracy and eliminate debris, all volume measurements with <500 particles per bin were removed from the dataset. Mean volume is reported.

Existing datasets used to characterize dosage sensitive genes.

87

The list of haploinsufficient genes used in our analysis is from Deutschbauer et al.

(2005), which found 184 genes to confer reduced fitness by pooled competition when heterozygously deleted.

RNA expression data (Fig. S2A) were obtained from (Neurohr et al. 2018, in press). Data used here are from a single time point, taken after 1 hour of growth in YEPD medium following the isolation of wild-type G1 cells by centrifugal elutriation.

Protein abundance data (Fig. S2B) were obtained from a mass spectrometry data set created by Kulak et. al. (2014)(37), and represent an estimation of the absolute abundance of proteins in a cell. Data on protein degradation (Fig. 2E) were adapted from Dephoure et. al.

(2014) (18). In this study, degradation is a measure of the relative abundance of a protein in cells where it is encoded in single extra copy compared to cells with wild-type copy number. This was done on a chromosome by chromosome basis in a set of disomic yeast strains. The data were modified by the authors of this paper to reflect the fraction of excess protein degraded (1 -

“disome ratio”). Therefore, in the absence of degradation, the fraction of excess protein degraded is expected to be 0. With complete degradation, the level of protein in cells where it is encoded in excess is equal to that of wild-type cells (fraction of excess protein degraded = 1). The threshold for degradation (>0.4) is derived from the corresponding threshold of 0.6 set by

Dephoure et. al. (2014).

Haploinsufficiency profiles (Fig. 4B, S2C) for latrunculin (0.9 µM), benomyl (27 µM), and radicicol (88µM) were obtained from a high throughput dataset created by Hoepfner et. al.

(2014) (27), screening the yeast heterozygous knockout collection for drug sensitivities. The growth sensitivity score is the logarithmic ratio of treated vs control growth measurements, converted to a z-score that accounts for the variability of the strain across all tested conditions.

88

Variability in gene expression of haploinsufficient genes (Fig. 4E) data were obtained from Keren et. al. (2015) (30), which measured single-cell gene expression by FACS for ~1000 genes using promoter-YFP fusions.

Statistical Analysis.

The correlation between relative doubling times for 2N-1 and 1N+1 top_HI strains

(n=100, Fig. 2C) was determined by Spearman correlation (r=0.3271, p=0.0008) to reduce the effect of experiment outliers. Similarly, a Spearman correlation (r=0.5221, p<0.0001) was carried out for the comparison of 1N+1 doubling time and the measurement CV within each strain (n=100, Fig. 2D).

The fraction of excess protein degraded for top_HI genes (Fig. 2E) was determined to be different from the genome based on a two-tailed t test with Welch correction (t=7.095, df=78.39, p<0.001). The same test for top_HI genes excluding those encoding ribosomal proteins is not significant (t=0.6508 df=37.96, p=0.5191).

The comparison of poolA and poolB relative fitness values for the competition (Fig. 3B) was assessed by Pearson correlation (n=655, r=0.492, p<0.0001).

The difference between growth of haploid deletion strains and matched diploid heterozygous deletion strains for non-essential top_HI genes (Fig. 4A) was assessed by a paired t test (t=9.786, df=44, p<0.0001).

Doubling times of eIF2 complex single and triple mutants (Fig. 4B) were compared to

WT and to each other using an ANOVA test with multiple comparisons and Bonferroni correction.

89

The mean volume of top_HI heterozygous deletion strains (Fig. 4C) was not significantly different from a set of wild-type diploid controls by Welch’s t-test (t=0.8279, df=41.38, p=0.4125). A comparison of strains containing heterozygous deletions of ribosomal proteins alone with wild-type controls yielded statistically significant results (t=2.854, df=31.16, p=0.0076).

Differences in cell-to-cell variability measurements for the classes of genes in Fig. 5A were assessed by Dunn’s multiple comparisons test, due to the extreme non-normality of each distribution of measurements. Individual p values are reported in the figure legend, with all other comparisons being non-significant.

The doubling time of 1N+1 strains was compared to an empty vector control (Fig. 2B,

Fig. S1A) in two groups (n=50 and n=51), based on the time that that the strains were transformed and tested. Statistical significance was assessed by Kruskal-Wallis Dunn’s multiple comparisons test, due to the experimental variability within each measurement across three separate transformants. Strains with corrected p-values <0.05 were considered to be significant different from the empty vector strain.

90

Strain Table

Strain Table 1: Summary of HI genes included Supplemental Table 2: Top_HI list and their characterization. ORF gene general function total 100 YPL235W RVB2 chromatin remodeling ribsomal protein 51 YBR156C SLI15 chromosome segregation RNA processing 13 YNL126W SPC98 chromosome segregation translation 11 YML085C TUB1 chromosome segregation/structural transcription 7 YDR394W RPT3 degradation proteostasis 7 YFR004W RPN11 degradation other 11 YBR202W CDC47 DNA synthesis YBR173C UMP1 folding YDL143W CCT4 folding YIL142W CCT2 folding YJL111W CCT7 folding YJR064W CCT5 folding YBR265W TSC10 metabolism YER133W GLC7 metabolism YJL102W MEF2 mitochondrial translation YBR236C ABD1 mRNA processing YMR061W RNA14 mRNA processing YHR170W NMD3 nuclear export of large ribosomal subunit YFR002W NIC96 nuclear pore complex YKL057C NUP120 nuclear pore complex YBL027W RPL19B ribosomal protein YBR048W RPS11B ribosomal protein YBR084C-A RPL19A ribosomal protein YBR191W RPL21A ribosomal protein YDL061C RPS29B ribosomal protein YDL075W RPL31A ribosomal protein YDL136W RPL35B ribosomal protein YDL191W RPL35A ribosomal protein YDR025W RPS11A ribosomal protein YDR064W RPS13 ribosomal protein YDR382W RPP2B ribosomal protein YDR418W RPL12B ribosomal protein YEL054C RPL12A ribosomal protein YER056C-A RPL34A ribosomal protein YER074W RPS24A ribosomal protein YER117W RPL23B ribosomal protein YGL030W RPL30 ribosomal protein YGL031C RPL24A ribosomal protein YGR034W RPL26B ribosomal protein YGR118W RPS23A ribosomal protein YGR148C RPL24B ribosomal protein YHL015W RPS20 ribosomal protein YHR010W RPL27A ribosomal protein YHR021C RPS27B ribosomal protein YIL052C RPL34B ribosomal protein YIL148W RPL40A ribosomal protein YJL136C RPS21B ribosomal protein YJR123W RPS5 ribosomal protein YJR145C RPS4A ribosomal protein YKL006W RPL14A ribosomal protein

91

Strain Table 2: Summary of HI genes excluded Supplemental Table 3: Summary of HI strains excluded. hetko hetko ORF Gene forward primer reverse primer MoBY_CEN growth YGL092W NUP145 absent not checked not checked YMR143W RPS16A absent not checked not checked YBL023C MCM2 haploid not checked not checked YFL008W SMC1 haploid not checked not checked YDL193W YDL193W haploid not checked not checked YPL202C YPL202C haploid not checked not checked YDR184C ATC1 incorrect AGTTACACTAAGGTTGTGATAGGG TGGACCATCTTTGTAGTAAAGACA YJL014W CCT3 incorrect TCGAGTGATCATACGTTGAAA CACCAAATATTAAACGGCAGTA YBR078W ECM33 incorrect AACAAGTCCCTTTGAGCTATCA GTGATGAACCAACCGTCTCA YBR133C HSL7 incorrect GTGTGTGTGTGTGTGGAATTG CTTGTGGTACTCAGGACGTATG YEL034W HYP2 incorrect TCATAGACTCCCAAACACACAC CGGTTCAGCGAAGAGTACAT YKL009W MRT4 incorrect GCTAAGCAGAGCTGATTCCT AGGCTTCCAAATAATAGTTCAGC YPR131C NAT3 incorrect TCAAGGAAAGAGACAGGAGGA TTCTGAGTATGAGGACGAGGTA YML031W NDC1 incorrect AGACTCTTTGGCGATGAGTAAG GTGCTCCTCGGTTGAATTGT YMR091C NPL6 incorrect ACATGTTACTGTGAGCAGACAA ACGCGTGAAATGCATATGTAGA YAR002W NUP60 incorrect CCGCAAGATATCCTAAAATCG CGGTAATTATGTCACGGCTAA YBR154C RPB5 incorrect TGTGCGCAGGTGGATATTAC GGGTAAATCCAAATGCCACT YMR142C RPL13B incorrect GTGCTCACCATCCTCCATTG CAGTATGCGCTTTGGATGTTT YNL069C RPL16B incorrect GGGAGGGTGGTGATCTCTTC GGGTGGTTTGTTGTTTTGCA YOR312C RPL20B incorrect TTTCTTTCCGCTCGAGTTGG CGCTATGGACCCTGGATTTTG YOL127W RPL25 incorrect TCTTCTGCTGTTGAAAAGGCT GTAAGGCACAGGAAACCCC YHR062C RPP1 incorrect AGTGAAAGCATTATAGAACCGA GTCGCGATTCGTGGATTGT YDL130W RPP1B incorrect CGCGCGAAGGAGAAATCTTA CTATCCCGCAGTGCCATTTC YBR189W RPS9B incorrect GACGCGCTTCACTCATGTAG AGCGCCAGTTATGTACCTCT YGR095C RRP46 incorrect CAAGGAGCCCAATACCAAGAA AGTCTCCGGCATACGACATA YIL126W STH1 incorrect CAAAGAGGCAGCACAGTTTAG GGGAAAGGGATATAGTCGTAAA YDR212W TCP1 incorrect AGGCCACTGAGAGAGTACAA CGTTCGGATTACGCAAAGAA YPR016C TIF6 incorrect CTCATCCCTCGTTGGTCTTAT TGTAAACAGACTTGAGGAAGGAG YHR168W YHR168W incorrect AGGATGTGTACGTACTGAGAAT TCCTCTTCTCACGGCTTCTA YPR034W ARP7 no product TGGTAAACAAGAGAGATAGAACAGAG CTTCTAGCCGCCTACAATCC YER165W PAB1 no product CCCTGGTACCACCACCTAATA GTTTGTTGAGTAGGGAAGTAGGT YCR057C PWP2 no product AACAGGTAGACAGGAGAGCA TGTGCATAAATAGAGGACAGTGA YNR016C ACC1 CCCGAAACAGCGCAGAAA AAGAGGAAACAGACCGATCAC weak YDR188W CCT6 GGAAGCAGTGAGAAGCAGAA TCAACGCGAAGAGGGAAAC weak YLR115W CFT2 CATACATTGAGGGCGACGAA GGCAGTCCACGAACGTAAA weak YGR255C COQ6 TGGTCTTTCAGTGAACCTTGT GGAGCTAGTCTATTTCTATTTACATACC weak YNL118C DCP2 TCCACAGTTGTTTAATCCTCCT TCGGCTGCCTTCATTTACA weak YHR011W DIA4 ACAGACACTGAACAAGACAAGA ACAGAGGATTTCGTCCAACAC weak YBR247C ENP1 TGACGAAAGGAAATATGCACTAAG GGAAAGACCGAGCGATATAAA weak YHR148W IMP3 AGATAAACCACCAGGCATAATCA GGCGCCTCACATCATTTGTA weak YIR005W IST3 AGTGCAGCGTGCAATTCT GGGCACAACCAGTTTCTACT weak YFR001W LOC1 AAGAAGATCGAATGCAAACCA AGACTTTGCCTTCAGGATGTT weak YDR405W MRP20 TGATTAGCAGGCTCCAACAG GGGTATTTGCGTGTGGGTTA weak YIR009W MSL1 CGAAACTCACGCATCATTCAC CGGTATATACGTGCTGCTATGA weak YIR006C PAN1 GAGAAGCAGGAAGAGGAAGAA CGAAGAACAATGCTGAGAAACT weak YNL216W RAP1 GCAACCGCCCTACATAAGA ACGTGAATCAGTGAAATAAAGGAG weak YOR224C RPB8 GCCAGCTCATTGCTTCCATAA GCAGTAAGTGATCGCCCTTT weak YDL082W RPL13A CAGCACAGCAGGAATCGTAC AAGTGATGGCTTCCCGTTTA weak YDR471W RPL27B TACTACCTGTCAAACCCGGC AGGCTCCATTGGGCATTTTG weak YJL189W RPL39 CTTTTGGGCTGTCGAGAACC GGGAAGGATGGAAGACAAATGA weak YGR214W RPS0A ATAGTAGCAGGGCAGGACAC TGGCTAAATGTGTGACAAGATGA weak YOL121C RPS19A TCTAATACCCCGGTGCGTTT ATTCCCCTCCTCGTTGTAGC weak

92

Acknowledgements:

We thank Andrew Murray, Gene-Wei Li and members of the Amon lab for suggestions and critical reading of this manuscript. This work was supported by NIH grant

CA206157 and GM118066 to A.A., who is an investigator of the Howard Hughes

Medical Institute and the Paul F. Glenn Center for Biology of Aging Research at MIT.

On behalf of S.A.M., this material is based upon work supported by the National

Science Foundation Graduate Research Fellowship under Grant No. 1122374.

Sequencing work at the MIT BioMicro Center was supported in part by the Koch

Institute Support (core) Grant P30-CA14051 from the National Cancer Institute.

References:

1. Fisher RA (1928) The Possible Modification of the Response of the Wild Type to Recurrent Mutations. Am Nat 62(679):115–126.

2. Orr HA (1991) A test of Fisher’s theory of dominance. Proc Natl Acad Sci 88(24):11413– 11415.

3. Wright S (1934) Physiological and Evolutionary Theories of Dominance. Am Nat 68(714):24–53.

4. Deutschbauer AM, et al. (2005) Mechanisms of haploinsufficiency revealed by genome- wide profiling in yeast. Genetics 169:1915–1925.

5. Fisher RA (1928) The Possible Modification of the Response of the Wild Type to Recurrent Mutations. Am Nat 62(679):115–126.

6. Kondrashov FA, Koonin E V. (2004) A common framework for understanding the origin of genetic dominance and evolutionary fates of gene duplications. Trends Genet 20(7):283–287.

7. Veitia RA, Potier MC (2015) Gene dosage imbalances: Action, reaction, and models. Trends Biochem Sci 40(6):309–317.

8. Dang VT, Kassahn KS, Marcos AE, Ragan MA (2008) Identification of human haploinsufficient genes and their genomic proximity to segmental duplications. Eur J Hum Genet 16(11):1350–1357.

93

9. Huang N, Lee I, Marcotte EM, Hurles ME (2010) Characterising and Predicting Haploinsufficiency in the Human Genome. PLoS Genet 6(10):e1001154.

10. Steinberg J, Honti F, Meader S, Webber C (2015) Haploinsufficiency predictions without study bias. Nucleic Acids Res 43(15):e101.

11. de Clare M, Pir P, Oliver SG (2011) Haploinsufficiency and the sex chromosomes from yeasts to humans. BMC Biol 9(15).

12. Papp B, Pál C, Hurst LD (2003) Dosage sensitivity and the evolution of gene families in yeast. Nature 424:194–197.

13. Veitia RA (2002) Exploring the etiology of haploinsufficiency. Trends Genet 21:175–184.

14. Giaever G, et al. (2002) Functional profiling of the Saccharomyces cerevisiae genome. Nature 418:387–391.

15. Ho CH, et al. (2009) A molecular barcoded yeast ORF library enables mode-of-action analysis of bioactive compounds. Nat Biotechnol 27(4):369–377.

16. Karim AS, Curran KA, Alper HS (2013) Characterization of plasmid burden and copy number in Saccharomyces cerevisiae for optimization of metabolic engineering applications. FEMS Yeast Res 13(1):107–116.

17. Hughes TR, et al. (2000) Widespread aneuploidy revealed by DNA microarray expression profiling. Nat Genet 25:333–337.

18. Dephoure N, et al. (2014) Quantitative proteomic analysis reveals posttranslational responses to aneuploidy in yeast. Elife:e03023.

19. Keren L, et al. (2016) Massively Parallel Interrogation of the Effects of Gene Expression Levels on Fitness Article. Cell 166:1282–1294.

20. Burke D, Gasdaska P, Hartwell L (1989) Dominant effects of tubulin overexpression in Saccharomyces cerevisiae. Mol Cell Biol 9:1049–1059.

21. Katz W, Weinstein B, Solomon F (1990) Regulation of tubulin levels and microtubule assembly in Saccharomyces cerevisiae: consequences of altered tubulin gene copy number. Mol Cell Biol 10(10):5286–5294.

22. Wagner A (2005) Energy constraints on the evolution of gene expression. Mol Biol Evol 22(6):1365–1374.

23. Makanae K, Kintaka R, Makino T, Kitano H, Moriya H (2013) Identification of dosage- sensitive genes in Saccharomyces cerevisiae using the genetic tug-of-war method. Genome Res 23:300–311.

24. Gelperin DM, et al. (2005) Biochemical and genetic analysis of the yeast proteome with a

94

movable ORF collection. Genes Dev 19:2816–2826.

25. Sopko R, et al. (2006) Mapping pathways and phenotypes by systematic gene overexpression. Mol Cell 21(3):319–330.

26. Veitia RA, Birchler JA (2015) Models of buffering of dosage imbalances in protein complexes. Biol Direct 10(1):1–11.

27. Hoepfner D, et al. (2014) High-resolution chemical dissection of a model eukaryote reveals targets, pathways and gene functions. Microbiol Res 169(2–3):107–120.

28. Jorgensen P, Tyers M (2004) How cells coordinate growth and division. Curr Biol 14(23):1014–1027.

29. Jorgensen P, Nishikawa JL, Breitkreutz B-J, Tyers M (2002) Systematic Identification of Pathways That Couple Cell Growth and Division in Yeast. Science (80- ) 297:395–400.

30. Keren L, et al. (2015) Noise in gene expression is coupled to growth rate. Genome Res 25:1893–1902.

31. Delneri D, et al. (2008) Identification and characterization of high-flux-control genes of yeast through competition analyses in continuous cultures. Nat Genet 40(1):113–117.

32 Ohnuki S, Ohya Y (2018) High-dimensional single-cell phenotyping reveals extensive haploinsufficiency. PLoS Biol 16(5): e2005130.

33. Guan Y, Dunham MJ, Troyanskaya OG (2007) Functional analysis of gene duplications in Saccharomyces cerevisiae. Genetics 175(2):933–943.

34. Gietz RD, Schiestl RH (2007) Quick and easy yeast transformation using the LiAc/SS carrier DNA/PEG method. Nat Protoc 2(1):35–37.

35. Gietz RD, Schiestl RH (2007) High-efficiency yeast transformation using the LiAc/SS carrier DNA/PEG method. Nat Protoc 2(1):31–34.

36. Payen C, et al. (2016) High-Throughput Identification of Adaptive Mutations in Experimentally Evolved Yeast Populations. PLoS Genet 12(10):e1006339.

37. Kulak NA, Pichler G, Paron I, Nagaraj N, Mann M (2014) Minimal, encapsulated proteomic-sample processing applied to copy-number estimation in eukaryotic cells. Nat Methods 11(3):319–324.

95

96

Chapter 3: Conclusions and Future Directions

97

Summary of key conclusions

In this thesis, I have created a comprehensive model for haploinsufficiency that unites all previous observations and models. Importantly, I have discovered that haploinsufficient (HI) genes are uniquely sensitive to changes in gene dosage in both directions – causing a fitness defect when both under- and overexpressed. The “dosage-stabilizing” model that I have developed (Fig. 1A) posits that haploinsufficiency is caused both by 1) dosage imbalances within protein complexes and cellular pathways (previously the dosage-balance hypothesis) and 2) protein products being limiting for cellular growth (previously the insufficient amounts hypothesis). Further, it states that haploinsufficient genes are evolutionarily “stuck”, unable to increase or decrease expression over evolutionary time, explaining their conservation from yeast to humans.

First, to show that increased expression of HI genes has an effect on fitness, HI genes were systematically tested for low-copy gene overexpression phenotypes. Finding that the majority of HI genes exhibit growth defects when a single extra copy of the gene is introduced into cells, I concluded that HI gene overexpression is toxic to cells. This is likely due to stress on protein folding and degradation machinery, as treatment with proteotoxic agents exacerbated the growth defects of cells containing extra copies of HI genes. To improve our understanding of which genes are sensitive to increased copy number (SIC), I also created a large-scale dataset to identify genes which when expressed in low copy excess under their native promoters impair growth. This dataset is directly comparable to a widely-used haploinsufficiency dataset (1), collected under maximal growth conditions. From it, I concluded that haploinsufficiency and sensitivity to increased copy number are not always mutually defined.

98

Although dosage balance no doubt contributes to haploinsufficiency, it cannot account for the strength of haploinsufficient phenotypes in their entirety. When I looked at the contribution of dosage imbalance to haploinsufficiency among members of a protein complex, I found that proteotoxicity alone could not fully explain the fitness defect. I tested the eIF2 complex, which consists of three subunits, Sui1, Sui2 and Gcd11, each possessing a haploinsufficient growth defect. Yet, deleting a copy of all three subunits in the complex at once, which maintains the stoichiometric balance of the complex, still caused a significant growth defect (Fig 1B). This finding indicates that haploinsufficiency is a compound phenotype: copy number variation of HI genes is proteotoxic, but the genes must also be growth-limiting in some manner. There is extensive evidence for the growth-limiting nature of individual HI genes in published datasets. For example, ribosomal genes are known to be limiting for biomass accumulation when heterozygously deleted, causing a decrease in cell size (2). Yet, I found that cells which have an extra copy of a ribosomal gene do not possess this phenotype. Similarly, heterozygous deletion of individual subunits of the CCT chaperone, which folds tubulin and actin, results in a sensitivity to microtubule- and actin-depolymerizing agents (3) that does not occur when CCT subunits are overexpressed. From these data I concluded that this is evidence for the both the dosage balance and the insufficient amounts hypotheses, and develop a new hypothesis that we can the dosage-stabilizing hypothesis of haploinsufficiency to reflect the dual nature of haploinsufficiency.

99

A B growth defect

sui2Δ/+ 17%

sui3Δ/+ 17%

gcd11Δ/+ 9% sui2Δ/+ *dosage sui3Δ/+ 12% gcd11Δ/+ balanced

Fig 1: Dosage imbalance alone cannot explain haploinsufficiency.

(A) The dosage stabilizing hypothesis: HI genes cause a decrease in fitness when under- and

over-expressed. This results in a narrowed fitness distribution that is driven by the evolutionary

pressure to both increase and decrease expression.

(B) Sui2, Sui3, and Gcd11 are all members of the heterotrimeric translation initiation factor

complex eIF2. Heterozygous deletion of individual subunits results in dosage imbalance, while

heterozygous deletion of all three subunits creates a dosage-balanced haploinsufficient strain.

The percentage increase in doubling time of heterozygous deletion strains compared to a wild-

type diploid is noted, under maximal growth conditions (YPD, 30ºC).

To confirm these results, I then looked for evolutionary evidence for the dosage-

stabilizing hypothesis. I was able to use the validated set of haploinsufficient genes to show that

HI genes have decreased ranges of expression, independent of their absolute levels of expression.

The narrowed ranges of expression for HI genes indicate a strong evolutionary selective pressure

to keep these highly dosage sensitive genes tightly regulated, so that organismal fitness is not

100

compromised by small fluctuations in gene expression over time. Overall, this model clarifies several important characteristics of haploinsufficient genes, and explains the persistence of haploinsufficiency over evolutionary time.

Adaptations to haploinsufficiency

With the persistence of haploinsufficiency, organisms have had many opportunities to adapt and respond to the presence of haploinsufficient genes in the genome throughout evolution.

It is reasonable to suppose that such powerfully dosage-sensitive genes, highly conserved yet unable to modulate their expression over time, have impacted genome evolution. There are two proposed adaptations that I find probable, and that warrant further exploration: 1) duplicating HI genes may help to escape the consequences of haploinsufficiency, and 2) haploinsufficient genes may impact chromosome maintenance.

Duplication of haploinsufficient genes

Duplicated genes, or paralogs, are ubiquitous in nature, found in genomes across all domains of life. Genes may be duplicated individually or as part of small chromosomal segments, referred to as a small-scale duplication events (SSD), or they may duplicate as part of a large-scale whole genome duplication event (WGD). Gene duplication is thought to be a primary driver in the formation of new genes (4). After duplication, the two copies may remain functionally redundant, retain partial function (subfunctionalization), gain new functions

(neofunctionalization), or one of the two copies may be lost (5). Whether or not the duplicate gene copy is retained depends on several factors including gene essentiality, gene functional

101

category, number of alternatively spliced forms, protein evolution rate, membership in protein complexes, number of protein interaction partners, and organismal complexity (6).

It is thought that gene duplication may mitigate haploinsufficiency because loss of a single copy would have less severe effects on fitness. From this work and others, we know that the expression of HI genes is limiting for growth, and causes a fitness defect when expression is too low. Having a second copy of the HI gene present ensures that deletion of just one copy no longer causes HI gene expression to dip below the threshold required for wild-type growth (Fig.

2). This is potentially still true even if gene expression diverges from the ancestral level

(modeled as “split expression” in Fig. 2). Analysis of gene duplication events throughout evolution shows preliminary evidence that gene duplication may offer some benefit for haploinsufficiency. In humans we see that HI genes possess much larger paralogous gene families than non-HI genes (mean family size = 4.65 vs. 1.77) (7), suggesting that it is beneficial for HI genes to be maintained in multiple copies. In yeast, HI genes are also more likely than non-HI genes to be retained after duplication (8). However, for HI gene duplication to actually alleviate the fitness defects of haploinsufficiency two things must be true: 1) paralogs must share the essential function of the HI gene, and 2) the increased gene expression resulting from the duplication itself cannot cause a fitness defect. What can eukaryotic genomes tell us about these two stipulations of our hypothesis?

102

pre-duplication post-duplication post-duplication (gene expression is split)

A A1 A2 A1 A2 A A1 A2 A1 A2 100% 100% 100% 2N threshold required for 50% 50% wild-type relative fitness expression

A A1 A2 A1 A2 A A1 A2 A1 A2 2N-1 100% 50% 50% 50%

relative 25% expression

overall expression = 50% overall expression = 150% overall expression = 75%

Fig. 2: Gene duplication as a mechanism for escape of haploinsufficiency.

Heterozygous deletion of an ancestral haploinsufficient gene reduces gene expression by half, falling below the threshold needed for wild-type function (left). If genes are duplicated, then the duplicated copy can now supplement expression of the haploinsufficient gene to retain wild-type function, assuming that it maintains some of the critical functions of the ancestral gene (middle).

Even if gene expression is split between paralogs following gene duplication, heterozygous deletion of one allele after duplication now only reduces overall expression by 25%. This still provides a buffer in loss of gene expression that is likely to maintain wild-type function (right).

The Saccharomycetaceae family of yeast is an excellent model to study gene duplication because of the whole genome duplication event (WGD) it underwent in recent history. This gives us a large number of paralogs to study, and allows us to split Saccharomycetaceae species into

103

pre- and post- WGD lineages. The genomes of these lineages can be compared to understand how large-scale duplication events impact paralog divergence and expression.

In budding yeast, approximately 25% of WGD paralogs are known to retain shared interaction partners and/or shared functional relationships for paralogs with as little as 25% sequence identity (9). This suggests that paralogs may, in part, remain functionally redundant after whole genome duplication, making WGD paralogs good candidates for answering our question. According to one study, 23% of HI genes have WGD-duplicates compared to just 6% of non-HI genes (6). Probabilistically, this means that HI genes are more likely than non-HI genes to be retained after the whole genome duplication event, suggesting a selective advantage to their duplication (8). There is already some evidence suggesting that gene duplication impacts haploinsufficiency phenotypes in yeast. Among ribosomal genes, the majority of which are haploinsufficient, we find that unduplicated ribosomal genes have more severe haploinsufficiency phenotypes than duplicated ribosomal genes (1). For ribosomal genes at least, gene duplication appears to be beneficial in mitigating haploinsufficiency.

Although the phenotype of duplicated ribosomal genes is strong evidence for gene duplication mitigating haploinsufficiency, the increased expression that initially results from gene duplication remains confounding. Work in this thesis has shown that increased gene dosage of HI genes is harmful to cells because it causes dosage imbalances between members of protein complexes and pathways (Fig. 3A). How then can we get duplication of haploinsufficient genes without causing deleterious effects on fitness? Previously, groups have posited that the nature of whole genome duplication may provide an explanation for how duplication of dosage sensitive genes could escape the consequences of dosage imbalance (10). By duplicating all genes at once, and slowly losing paralogs over time, those genes which are most sensitive to changes in dosage

104

can remain in balance with one another (Fig. 3B). Data from this thesis would support the WGD hypothesis: we know that HI gene overexpression is likely harmful due to proteotoxicity; therefore, increased expression after gene duplication is only a problem if there is protein imbalance.

Fig. 3: Gene duplication and dosage balance

105

(A) The relationships between gene duplication and protein complex balance. The protein products of genes 1 and 4 form a heterodimeric complex. If gene 4 alone is duplicated (left), then the protein complex formed is unbalanced, and the excess subunits must be degraded or aggregated in the absence of a partner. This is costly for the cell, and leads to a fitness defect. If genes 1 and 4 are both duplicated, then the protein complex remains balanced and there is likely no effect on fitness. Though, notably, expression of the balanced complex is higher overall.

(B) A schematic depicting how whole genome duplication (WGD) could act as a mechanism for copy number increase without disrupting genome balance. Genes which are haploinsufficient and therefore particularly sensitive to dosage imbalance (ex: genes 1 and 4) are more likely to be retained during the period of gene loss following WGD. This ensures the that gene products remain balanced and in complex with each other, protecting against the fitness defect of dosage imbalance. Genes which are less dosage sensitive (ex: genes 3 and 6) do not need to remain in balance with one another, and so are lost gradually and asynchronously.

In the future I would want to more directly test whether gene duplication mitigates haploinsufficiency. First I would want to identify a set of haploinsufficient genes with paralogs in Saccharomyces cerevisiae that have orthologs of one or both paralogs in a

Saccharomycetaceae outgroup that has not undergone WGD. We would expect that orthologs within the outgroup which are unduplicated have more severe haploinsufficiency phenotypes than duplicated orthologs, and that the haploinsufficiency for the unduplicated outgroup genes is more severe than for S. cerevisiae duplicated orthologs. Given that this experiment is so dependent on HI gene conservation across the outgroup, an assumption which has not yet been tested, I would also like to investigate de novo duplication of HI genes in S. cerevisiae. By

106

evolving a strain after an HI gene has been duplicated artificially, I could test what happens to the gene expression of the two copies, whether the functionality of the two copies remained the same, and whether presence of a second gene copy mitigated the haploinsufficiency of heterozygous deletion of the other. Together, these experiments would determine whether gene duplication can play a role in helping genes escape haploinsufficiency, thus adapting to their presence in the genome.

Haploinsufficiency and karyotype maintenance

I noted earlier that genome size differs across eukaryotes by more than 10,000-fold, but we also know that the packaging of genomes into chromosomes can be very diverse. Organisms may have as little as one chromosome or greater than 1000 chromosomes, ranging from kilobases to megabases in scale. Unlike genome size, however, chromosome number does not generally correlate with organismal complexity. Although humans carry more chromosomes

(2n=46) than flies (2n=8) or worms (2n=12), there are other mammals like Indian muntjacs which have fewer chromosomes still (2n=6), and plants that have many times more, like the black mulberry plant (2n=308). Some organisms even have very small chromosomes that carry as little as 0.01% of the genome. Yet, despite these profound differences in how genetic material is packaged, all organisms are able to maintain a stable karyotype within individuals and among species.

The dosage-stabilizing effect of haploinsufficient genes makes them excellent candidates for regulators of karyotype stability and maintenance. Both gain and loss of HI genes confer a fitness penalty, providing a gene-specific consequence for genome imbalance. It is easy to see how, within a population of cells, losing or gaining a copy of an HI gene could lead to the

107

selective pressure needed to compete out unfit cells. Therefore, the consequence of gaining or losing many HI genes at once, all packaged into a chromosome, would be further detrimental still. One group has hypothesized that HI genes are critically important for the karyotype maintenance of organisms which exhibit bipolar sexuality, such as fungi (8). In these organisms there are two sexes, or mating types, that are controlled by two alleles (MATa and MAT�) at the same of a particular chromosome (Fig. 4A). In haploids, the allele present at the mating type locus determines the sex pheromones that are produced, and thus the mating ability of the organism. In diploids, mating is suppressed due to the presence of both sex-specific alleles, but loss of either allele can lead to an otherwise diploid cell initiating mating, creating triploid zygotes. Thus, loss of the sex chromosome would be detrimental to the overall fitness of a population, destabilizing karyotypes by creating triploids that give rise to highly aneuploid progeny with low viability. This same research group found that sex chromosomes in yeast are enriched for HI genes, whether defined in nutrient-limited or nutrient-rich conditions (8, 11).

Preferential allocation of HI genes to the mating type chromosome is conserved across the yeast lineage, suggesting that HI gene distribution may protect yeast populations from the consequences of losing the mating type locus. In other words, the penalty of losing all of the HI genes on the chromosome carrying the mating type locus causes cells to be outcompeted from the population before they have the chance to mate and create more complex aneuploid cells

(Fig. 4B). Interestingly, organisms that possess sex chromosomes which are silenced, such as humans and worms, often exclude HI genes from sex chromosomes (8). This prevents HI gene loss through chromosome silencing events. Exceptions to this include genes that are encoded on the pseudo-autosomal regions of X chromosomes, where there is a second copy of the gene present on the Y chromosome or in the regions protected from silencing on the X.

108

Fig. 4: HI genes could help protect against loss of the sex chromosome in yeast.

(A) In many fungi, sex is determined by a mating type locus. In the case of Saccharomyces cerevisiae, this is located on chromosome III. Haploids, which only have one copy of the mating type locus, possess either a MATa or a MAT� allele that determines their mating type. When crossed to another haploid of the opposite mating type, they will form a diploid. Diploids have two copies of the mating type locus and are unable to mate with haploids. Diploid cells can undergo meiosis to produce haploid spores under the appropriate conditions as part of the sexual reproduction cycle of the fungus. If, however, the chromosome which contains the mating type locus is lost in a diploid, it can no longer sporulate. Instead, it gains the ability to mate with other

109

cells, creating triploid zygotes upon mating. As a consequence, further triploid progeny would then sporulate to create highly aneuploid offspring, a disastrous fitness consequence for the population. However, loss of the chromosome containing the mating type itself confers a large decrease in fitness compared to wild-type diploids and haploids, likely causing it to be competed out of the population before triploids can accumulate.

(B) It is proposed that the fitness penalty of chromosomes loss for chromosomes containing the mating type locus can be augmented by the enrichment of HI genes on that chromosome (8). In a hypothetical graph, if we compare the relative fitness of cells which have lost the chromosome encoding the mating type locus when it is enriched for HI genes compared to when it is not, we would see that the fitness defect for the HI-enriched chromosome is much greater. If this were true, then the allocation of HI genes to chromosomes could lead to more rapid loss from the population, preventing the generation of complex aneuploid cells.

I propose that we can directly test the contribution of haploinsufficient genes to the penalty of chromosome loss by using yeast as a model organism. We can test HI penalty by comparing the fitness of strains with haploinsufficient genes present on a given chromosome compared to when they are absent. Critically, this comparison in fitness should be made when the given chromosome is lost. The fitness of the monosomic strain (containing only one copy of the chromosome), with or without the loss of HI genes contributing to the penalty of chromosome loss, can then be measured by competition, or the relative abundance of monosomes in the population over time. Importantly, HI genes must not be deleted from the genome, but rather moved to an ectopic site. I would want to test this for chromosomes which carry the mating type locus, because we already know that they are enriched for HI genes, as

110

well as other smaller chromosomes. Additionally, we can compare the effect of losing one HI gene versus losing many. I would predict that loss of chromosomes will carry a higher fitness penalty with HI genes present than without, and that HI genes act in combination to provide a higher fitness penalty when lost. This project allows us to measure the impact of individual haploinsufficient genes on karyotype maintenance, and gives insight into how haploinsufficiency may shape eukaryotic genomes, playing a role in stabilizing organismal karyotypes over evolutionary time.

Concluding remarks

It is intriguing that highly dosage sensitive genes, which make up only a minority of the genome, can provide such a far-reaching perspective on cellular biology and eukaryotic evolution. Haploinsufficient genes and the phenotypes they cause are shared among organisms as distantly related as yeast and humans. Their persistence over evolutionary time gives us insight into basic cellular processes that are rate-limiting across eukaryotes. From them, we can understand more about the role of protein stoichiometry in cellular fitness, know who the critical molecular players are for optimal organismal fitness, and learn more about the regulation of gene expression. Undoubtedly, we also have a lot more to learn about the role of haploinsufficient genes in disease, which increasingly sophisticated computational datasets allow us to predict with greater accuracy.

On an even grander scale, we know that haploinsufficient genes are capable of shaping our genomes. Deletions of haploinsufficient genes are capable of causing compensatory genomic rearrangements in order to resolve the fitness defects of misexpression (12). Gene duplications in particular appear to be an evolutionary mechanism to try to resolve haploinsufficiency, as

111

demonstrated by the conserved amplification of important HI gene families. Finally, we know that the dosage-stabilizing effect of haploinsufficiency makes HI genes likely candidates for more global regulators of genome imbalance, acting to keep karyotypes stable. Understanding haploinsufficiency will surely help to shed new light on the forces that shape our genomes, and how we can take advantage of these forces to improve the fitness of organisms across the eukaryotic lineage.

References

1. Deutschbauer AM, et al. (2005) Mechanisms of haploinsufficiency revealed by genome- wide profiling in yeast. Genetics 169:1915–1925.

2. Jorgensen P, Nishikawa JL, Breitkreutz B-J, Tyers M (2002) Systematic Identification of Pathways That Couple Cell Growth and Division in Yeast. Science 297:395–400.

3. Hoepfner D, et al. (2014) High-resolution chemical dissection of a model eukaryote reveals targets, pathways and gene functions. Microbiol Res 169:107–120.

4. Ohno S (1970) Evolution by Gene Duplication (George Allen and Unwin, London, UK).

5. Force A, et al. (1999) Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151:1531-1545.

6. Qian W, Zhang J (2008) Gene Dosage and Gene Duplicability. Genetics 179:2319–2324.

7. Kondrashov FA, Koonin E V. (2004) A common framework for understanding the origin of genetic dominance and evolutionary fates of gene duplications. Trends Genet 20(7):283–287.

8. de Clare M, Pir P, Oliver SG (2011) Haploinsufficiency and the sex chromosomes from yeasts to humans. BMC Biol 9(15).

9. Guan Y, Dunham MJ, Troyanskaya OG (2007) Functional analysis of gene duplications in Saccharomyces cerevisiae. Genetics 175:933–943.

10. Papp B, Pál C, Hurst LD (2003) Dosage sensitivity and the evolution of gene families in yeast. Nature 424:194–197.

11. Delneri D, et al. (2008) Identification and characterization of high-flux-control genes of yeast through competition analyses in continuous cultures. Nat Genet 40(1):113–117.

112

12. Hughes TR, et al. (2000) Widespread aneuploidy revealed by DNA microarray expression profiling. Nat Genet 25:333–337.

113