High-resolution sequencing of mitochondrial DNA

by Sofia Nicolai Annis

B.A in Biology, Smith College

A dissertation submitted to

The Faculty of the College of Science of Northeastern University in partial fulfillment of the requirements for the degree of Doctor of Philosophy

July 27th, 2020

Dissertation directed by

Konstantin Khrapko Department of Biology

1 Acknowledgments

I would first like to thank Dr. Konstantin Khrapko for inviting me into his lab and graciously letting me take control of the projects that interested me while always being available with guidance and support. I appreciate all of the meandering conversations about our science, unrelated science, and life in general. I would not have been able to keep going without the continuous supply of coffee, chocolate, and good humor. The additional mentorship and support from Dr. Jonathan Tilly and particularly Dr. Dori Woods was invaluable and truly helped me grow as a scientist.

Many aspects of this work required a true team effort. Zoe Fleischmann provided invaluable contributions to the development of LUCS and showed that bench and computational scientists can work together in harmony. Her patient guidance in teaching me the basics of coding is greatly appreciated. Other members of the Laboratory for Aging and Infertility Research helped were excellent technical and theoretical sounding boards, and they helped created a supportive and happy community.

I want to thank the many, many undergraduate, graduate, and high school students who I worked with in the Khrapko lab. Through teaching them, I was able to expand and deepen my own knowledge, and I had to challenge myself to always have an answer waiting. Their willingness to take on a wide variety of projects let me chase many ideas at once. Particular thanks are owed to

Housaiyin Li, Melissa Franco, and Zachary Mullin-Bernstein.

Finally, I want to thank my family for fostering my interest in science from a young age.

They encouraged me to perform my own genetics experiments with my many rabbits, sent me away to science camps in the summers, and pretended to not be too sad when I continued to spend my summers away from home in labs across the country. Last but not least I want to thank my boyfriend

Mike for his unwavering support and devotion, who along with his family has been so patient and kind.

2 Abstract of Dissertation

Mitochondria, heralded as the powerhouses of the cell, contain small, highly-conserved genomes. This mitochondrial DNA (mtDNA) is abundant in most eukaryotic cells, reflecting its crucial role in providing protein subunits that are necessary for oxidative phosphorylation. mtDNA is asexually transmitted through the maternal germline but the precise mechanisms that govern the maintenance of genomic integrity through the generations are still a focus of research and lively debate. The general consensus is that germline mitochondria undergo a population bottleneck and are exposed to strong purifying selection, a combination that can generate a high degree of genetic variability among offspring. Details are lacking, however, on the exact structure of the bottleneck and timing of selective pressures. Current mtDNA sequencing strategies that aim to elucidate these topics largely rely on Next Generation Sequencing (NGS) datasets that provide an abundance of data but provide no insight into mtDNA as discrete and diverse genetic elements. This research offers a high-resolution analysis of mtDNA in oocytes, with a particular focus on low-frequency variants that are often overlooked by NGS approaches but could have important clinical significance. To further enhance this whole-mtDNA genome sequencing approach, this work describes a novel sequencing strategy that couples Oxford

Nanopore’s long single-read capabilities with unique molecular identifiers that can generate highly accurate consensus sequences and ultimately capture the true genetic diversity of mtDNA populations.

3 Table of Contents

Acknowledgments 2 Abstract of Dissertation 3 Table of Contents 4 Abbreviations 5 Chapter 1: High-resolution analysis of oocyte mtDNA 6 Introduction 6 Materials and methods 18 Results 22 Discussion 28 Figures 33 Tables 46 References 49 Chapter 2: Quasi-Mendelian Paternal Inheritance of mitochondrial DNA: A 55 notorious artifact, or anticipated mtDNA behavior? Abstract 55 Introduction 56 Methods and results 58 Figures 83 References 87 Chapter 3: LUCS: a high-resolution nucleic acid sequencing tool for accurate 90 long-read analysis of individual DNA molecules Introduction 90 Materials and methods 107 Results 112 Discussion 117 Figures 121 Tables 136 References 138

4

Abbreviations bp: Base pair CCS: Circular consensus sequencing COX: Cytochrome c oxidase ETC: Electron transport chain Gbp: Giga base pair (1,000,000,000 bp) GERP: Genomic evolutionary rate profiling HGP: Human Genome Project Indel: Insertion or deletion in a DNA sequence Kbp: Kilo base pair (1,000 bp) LHON: Leber hereditary optic neuropathy LUCS: Long UMI-driven consensus sequencing Mbp: Mega base pair (1,000,000 bp) mtDNA: Mitochondrial DNA mtDNA: Mitochondrial DNA nDNA: Nuclear DNA NGS: Next-generation sequencing NUMT: Nuclear mitochondrial DNA PGC: Primordial germ cell PolG: Polymerase gamma SMRT: Single molecule real time sequencing SNP: Single nucleotide polymorphism SNV: Single nucleotide variant ssDNA: Single-stranded DNA UMI: Unique molecular identifier VEP: Variant effect predictor WGS: Whole-genome sequencing

5 Chapter 1: High-resolution analysis of oocyte mtDNA

Introduction

Often called the powerhouses of the cell, the mitochondria are crucial components of eukaryotic cells that fulfill the bulk of the cells’ energetic demands through oxidative phosphorylation [1]. Mitochondria arose following the endosymbiosis of a single-celled organism, likely an a-Proteobacteria, into the common, likely archaeal, bacterial ancestor of all eukaryotes around 2 billion years ago [2-4]. This monophyletic origin has been traced back by sequencing and phylogenetic analysis on the mitochondrial DNA (mtDNA) that still remains in the organelle as a vestige of its previous free-living lifestyle [5]. The emergence of mitochondria as specialized intercellular energy producers gave an immense evolutionary advantage to their host cells and is one of the key reasons for the growth in cell size and complexity of eukaryotes

[6-8]. Over time, mitochondria lost most of their genetic complexity, shifting to a reliance on the nuclear genome to provide the organelle with essential proteins. Mitochondrial genome size varies across the eukaryotic domain, but in mammals is typically 15,000-17,000 bp. Comprised of just 13 protein subunits, 22 tRNAs and 2 rRNAs, mtDNA is a compact intron-free circular genome with minimal non-coding sequence (Figure 1.1). Around 1,500 nuclear gene products are involved in governing mitochondrial function, with specific expression varying across tissue types [9, 10] . The mitochondrial genome exists in high abundance with an average of ~4.6 copies per organelle [11, 12] and hundreds to hundreds of thousands of organelles per cell.

Across eukaryotic evolution, mitochondria have lost differing amounts of their genomic content. Plants have significantly larger genomes, typically ranging from 100-1,000 kbp and encoding ~40-156 genes; the lower gene-to-genome size ratio is due to a more complex genetic structure featuring introns and repetitive sequences [13]. Despite the broad diversity of

6 mitochondrial genomes, a shared trend across eukaryotes is the retention of components of the electron transport chain (ETC) [14]. The proper function of these retained mitochondrial genes is vital for life; indeed, pathogenic mtDNA mutations that disrupt gene function can manifest in a variety of clinical syndromes ranging from mild to fatal. Leber hereditary optic neuropathy

(LHON) is the most prevalent mitochondrial disorder, occurring in as many as 1 in 10,000 to 1 in

50,000 Europeans [15, 16]. LHON was the first disorder to be directly linked to a mutation in the mitochondrial genome [17] and is characterized by the degeneration of retinal ganglion cells, leading to severe or complete vision loss. Like many mitochondrial diseases, LHON can arise from any of several point mutations, with three point mutations accounting for 90-95% of patients’ conditions and dozens of other mutations eliciting a similar pathology [15]. Because mitochondrial function is vital for almost all cells in the human body, dysfunction caused by mtDNA mutations affects a broad spectrum of organ systems and results in a varied range of clinical phenotypes.

A key feature of pathogenic mtDNA mutations is that they must be at relatively high frequency within the cell in order to exert a negative phenotypic effect. For example, the pathogenic mutations that cause LHON typically need to be present in at least 60% of the mtDNA genomes in order to result in loss of vision [15]. As mammals are diploid, a pathogenic nuclear mutation could be present in as few as one copy and have a detrimental effect on mitochondrial function. The presence of a single pathogenic mutation in mtDNA within a cell, however, will not have a noticeable effect as it is merely one disruption among hundreds or thousands of genomes. Even slightly higher levels of heteroplasmy (the state of having multiple mtDNA genotypes within a cell, as contrasted with the uniform state of homoplasmy) are generally not sufficient to have a phenotypic effect, owing to the dynamic nature of

7 mitochondrial populations. While mitochondria can exist as discrete organelles, they can also combine to form large, interconnected networks as well as transiently fusing with and fissioning apart from other mitochondria [18]. mtDNA encodes membrane-bound proteins, so an organelle containing a dysfunctional protein product can simply fuse with a healthy organelle to acquire the correct form of the protein, even though it still contains the mutated genome [19]. Figure 1.2 illustrates this dynamic: at low abundance, pathogenic mutations are virtually invisible to the cell’s regulatory machinery but can spread throughout the cell’s mitochondrial population. Only at high heteroplasmy levels, typically ~80%, is the detrimental mutation able to impair organellar and ultimately cellular function and can then manifest as a mitochondrial disease [20].

An important question in mitochondrial biology is how these pathogenic mutations are able to take hold in cells and what mechanisms are in place to maintain healthy mitochondrial genomes. Mitochondrial mutations are known to increase with age, and examining this phenomenon can provide important insight into mitochondrial population dynamics. An interesting case study is the decline in cytochrome c oxidase (COX) activity in substantia nigra cells of the hippocampus. COX is the final step in the electron transport chain and is composed of 10 nuclear- and 3 larger mitochondrial-encoded subunits [21]. The substantia nigra has been specifically pinpointed as a critical site for the development and progression of Parkinson’s

Disease, which is characterized by the loss of dopaminergic neurons in that region [22]. Older brains (70+ years of age) showed significant loss of COX activity in substantia nigra neurons

[23]; upon further investigation, the COX defects appeared to be due to an mtDNA deletion that was present in a high proportion in defective cells but was under 60% prevalence in phenotypically normal cells [24]. mtDNA deletions are typically excisions of several thousands of nucleotides from the mitochondrial genome and often arise as a result of faulty DNA repair

8 activity, which does worsen with age [25]. A key element of this study, however, was that the deletions were clonally expanded rather than being independently created: that is, a single mtDNA molecule acquired a deletion and was replicated until it reached a high enough heteroplasmy level to cause detectable defects. In fact, much of the age-dependent increase in mtDNA mutations is attributable to clonal expansion rather than the spontaneous generation of a wide variety of distinct mutations [26].

On its surface, clonal expansion seems to have a fundamental flaw in terms of maintaining healthy mitochondrial populations. While certain mutations can confer a replicative advantage, often the reason a particular genome expands is not understood. This provides potentially deleterious mutations with the opportunity to reach phenotypically relevant concentrations, to the cost of the organism. Clonal expansion, however, is beneficial for the organism on the whole for several reasons. First, while expansion can shift a single cells mitochondrial population, it is far more challenging for that change to take hold across an entire tissue type or organ. A deleterious mutation expanding in a stem cell niche, such as the colonic crypt, can result in a substantial spread of that defect throughout the tissue [27]. An important distinction is that this expansion takes time; by the time a de novo deleterious mutation expands to a clinically significant threshold, the organism is often beyond reproductive age, or at least has had time to reproduce. Natural selection works in part by penalizing changes that impede fitness, but if a defect only manifests itself after the reproductive period of that organism’s life then it cannot be subject to selection, barring the selective advantage provided by the contributions healthy aging individuals to communities.

A major advantage of clonal expansion becomes evident when contrasted with the alternative: the accumulation of many mildly detrimental mutations. Any single mtDNA

9 mutation within a population will not have any effect on the health of that cell, as discussed above. In fact, as more independent mutations arise, the mitochondria will initially be able to function on the whole better than if a single deleterious point mutation expanded clonally. As mitochondria with diverse mutations fuse together, then can readily compensate for each other’s singular defects. If the accumulation of mutations carries on unchecked, however, the population will reach a point where the mutated genomes are no longer able to sufficiently counterbalance the ill effects of so many deficiencies. This scenario of mutational meltdown is the crux of

Muller’s Ratchet, the postulated decline of asexual populations that are unable to maintain their genomic integrity [28]. In effect, clonal expansion serves to expose deleterious mutation to natural selection by propagating polymorphisms to levels where they are phenotypically detectable rather than allowing for an insidious creep towards mutational meltdown.

Maternal mitochondrial inheritance and the germline bottleneck

Mitochondrial diseases typically manifest themselves in infancy or early childhood.

While adult-onset disorders occur, they are much rarer than childhood disease and are more often the result of nuclear-encoded defects rather than mitochondrial mutations [29, 30]. This demonstrates that although clonal expansion of deleterious mtDNA mutations accelerates in later life and may have a variety of aging-related health implications, the majority of mitochondrial disorders are inherited rather than generated de novo. Although the following chapter highlights the ongoing investigation of paternal transmission of mtDNA [31], this phenomenon appears to be exceedingly rare [32]. Notwithstanding these rare exceptions of paternal inheritance, the scientific consensus of the last 50 years is that mitochondria are most often exclusively maternally derived [33, 34]. As described in Chapter 2, successful transmission of paternal

10 mtDNA appears to rely on a combination of a faulty gatekeeper and a strong replicative or selective advantage for the paternal haplotype (the series of mtDNA variants characteristic of that person) [35]. The gatekeeper in this instance is one of the mechanisms in place to specifically prevent paternal mtDNA from surviving in the zygote. Typically, sperm mitochondria are ubiquitinated while in the male reproductive tract [36]. Following fertilization, the oocyte’s autophagy machinery recognizes these ubiquitinated mitochondria and destroys them, generally by the 16-cell stage [37]. The primary reason for this targeted degradation may be to prevent potentially damaged (and therefore bearing damage-induced mutations) mitochondria from proliferating when there is already an abundance of presumably healthier organelles provided by the oocyte. Another explanation is the potential for incompatibility between two distinct mitochondrial haplotypes. As much of mitochondrial function is governed by the nuclear genome, functional nucleus-to-mitochondria cross-talk is essential to maintain cell energetics. nDNA/mtDNA incompatibility appears to be a significant factor in speciation.

Subspecies and same-species cybrids (cytoplasmic hybrids) have been demonstrated to have impaired fertility and generally degraded health [38-40]. Interestingly, in the paternal inheritance cases investigated by Luo et al [31], the paternal haplotype was detected when the probands’ mtDNA was sequenced on suspicion of mitochondrial disease. Although the authors did not divulge the ultimate source of the study subjects’ health complications, it is tempting to consider that the health impairments may be due to incompatibility between the maternal and paternal haplotypes.

Between generations, mtDNA populations can undergo substantial shifts in heteroplasmy. Mutations that are rare in early progenitors can be come fixed within a few generations, a phenomenon that was first observed in Holstein cows in the early 1980’s [41, 42].

11 The rapid change in population composition points to the presence of a germline mtDNA bottleneck, where offspring only inherit a smaller proportion of their mother’s available mtDNA that is not reflective of the mother’s total mtDNA composition. Originally, the bottleneck effect was attributed simply to a reduction in mtDNA copy number during early germline development.

During the pre-implantation stage of embryonic development, mtDNA does not replicate [43], with each round of cell division resulting in diminished mtDNA content per cell. Early mitochondrial copy number estimates indicated as few as 200 organelles per primordial germ cell (PGC, the precursors to adult sex cells) [44], a sufficiently small number to explain the rapid segregation of mtDNA populations as an act of chance or random genetic drift. In this scenario, natural selection for the maintenance of healthy mtDNA genomes would occur at the level of the cell or the organism, with no need for active selection on the level of the mitochondrion. Later investigations demonstrated that PGCs in fact contain over 1000 mitochondria at their lowest point [45, 46]; this larger founding population makes genetic drift a much less likely mechanism to explain the degree of change that can occur between two generations. An alternative explanation for the observed bottleneck effect is a reduction in the number of replicating mtDNA genomes rather than simply a reduction in the total copy number. In this scenario, despite having a large total population, only a small subset of mtDNA genomes replicate in early germ cells

[47]. Although this explanation results in the same effect (i.e. early segregation of mtDNA heteroplasmies) as simple population restriction, it introduces the intriguing question of how the replicating populations are chosen. While the mechanism may be random [48], a better understanding of the intricacies of mtDNA populations and a better characterization of the replicating subpopulation is still needed.

12 A question remains regarding the role of natural selection in the development of mtDNA populations in early germ cells. In recent years there has been increasing support for purifying selection in oocyte mtDNA [49-51]; under purifying selection, deleterious mutations are eliminated from the population. One shortfall of these studies is the resolution with which they examine the mtDNA populations within a cell. Most groups use a variety of Next Generation

Sequencing (NGS) tools to sequence mtDNA from oocytes and PGCs, with the result being that the data reflect only population averages, do not capture rare variants, and do not discern between potential subpopulations within the cell. NGS platforms are also inconsistent at identifying low-frequency variants – even with a coverage depth of 3000X, variants that are below 1.5% frequency in the population are not reliably identifiable [52]. These limitations mean that the full diversity of mtDNA heteroplasmy in the germline has not yet been fully appreciated.

Measures of selective pressure have only been applied to mutations that become widespread throughout a population, even though rare variants may have an important physiological role in the later stages of an organism’s life.

If purifying selection is indeed a key component of maintaining mitochondrial integrity across generations, there is still uncertainty as to where that selection occurs. Selection could occur at the level of the organism (some offspring live others die, an energetically costly mechanism), the cell (some oocytes or early embryonic cells die), the mtDNA genome (through mitophagy of the organelle containing that genome), or some combination of all three [53]. A recent study in Drosophila melanogaster revealed a potential mechanism for how selection can actively remove single mtDNA genomes [54]. During early oogenesis, mitochondrial fusion is reduced, resulting in smaller organelles with fewer genomes, typically just a single copy. This makes respiratory defects from mutant mtDNA more apparent as organelles can no longer rely

13 on the compensation that fusion affords them. The organelles harboring deleterious mutations are then targeted by the mitophagy proteins Atg1 and BNIP3. While the same mechanisms have not yet been validated in mammalian cells, mammalian oocytes are known to have small mitochondria that typically contain a single mtDNA genome [55], which would allow for a similar selective mechanism.

To get a better understanding of the complexities underlying the mtDNA populations, in this study we used a single molecule sequencing approach to measure the selective pressures that are active at the level of the mtDNA genome. By sequencing mtDNA genomes in their entirety, selection can be measured on the genome as a whole rather than on singular mutations. This approach also offers higher accuracy for detecting rare variants, which may have a significant role in aging and aging-related disorders but are inconsistently identified by NGS. The timing of selection can also be assessed by comparing the strength of selection in older versus newer variants.

Genomic evolutionary rate profiling as a pathogenicity indicator

There are several resources for predicting the phenotypic impact of mtDNA mutations.

The proportion of synonymous changes in a population can be an indicator of whether those mutations are under the influence of natural selection. When there is no active selection pressure

(neutral selection), we would expect the proportion of synonymous substitutions to reflect random chance. If the proportion of nonsynonymous substitutions is lower than what we expect from random chance, those mutations are under purifying selection. A challenge arises when considering the potential impact of a nonsynonymous change. Some nonsynonymous changes do not have a significant impact on the final protein structure, either due to similarities between the

14 interchanged amino acids or because their position in the polypeptide does not have a significant effect on protein structure [56]. Nonsynonymous changes can confer a selective advantage, which is how evolution progresses, but are more often deleterious [57]. Merely knowing that a mutation is nonsynonymous is not sufficient to predict the ultimate impact of that change.

Examining only the proportion of synonymous and nonsynonymous substitutions also excludes the RNA and non-coding portions of the genome, even though changes in these regions may have a substantial impact on the healthy function of mitochondria.

Two common analysis options provided are PolyPhen and Sorting Intolerant From

Tolerant (SIFT) scores. Polyphen predicts the effect of a substitution by analyzing computed changes protein folding and crystal structure while SIFT makes predictions based off evolutionary homology [58, 59]. Although these tools can be generally useful for large scale mutational analysis, they are not particularly well-adapted for mtDNA. When tested against known human mtDNA variants, they are not consistently able to delineate between pathogenic and benign mutations [60]. As with synonymity, these metrics also only apply to the coding regions of DNA, leaving the non-protein-coding regions without predictive scores. A number of pathogenic mutations have been annotated in the human mitochondrial genome as they have been detected in patients. These publicly available annotations are certainly useful for clinical screening, but they still fall short for large scale analysis. The list is not exhaustive, it can be difficult to translate to model organisms such as mice, and some mutations may be so severe as to be embryonic lethals and thus not detectable in the living human population [61].

An alternative to these metrics is genomic evolutionary rate profiling, or GERP.

Originally developed to detect coding regions in full genome sequences, GERP determines how constrained a genetic region or single position is [62, 63]. For a given region, the number of

15 observed substitutions within a phylogenetic group is compared to a neutral rate of substitution that would be expected for that group. GERP determines the rate of rejected substitutions (fewer substitutions than predicted by the neutral model) to measure how constrained a region is. In essence, GERP can be used to assign a score to every position in the mitochondrial genome reflecting how conserved or variable that site is across evolutionary history. For our study, we used a broad spectrum GERP that encompasses all mtDNA sequences from the Euarchontoglires superorder, a group comprised of lagomorphs, tree shrews, flying lemurs, rodents, and primates

[64]. Using a phylogenetic grouping that is inclusive of both humans and mice permits easy comparison between the two as the GERP scores were constructed from the same sequences.

While including more distantly-related species could add more depth to the metric, it could also lead to alignment issues that would diminish the accuracy of the site-specific resolution.

As shown in Figure 1.3A, GERP scores are non-normally distributed but rather reflect what portions of the genome are highly conserved, neutrally changing, and highly variable.

Despite spanning over 60 million years of evolution [65], ~25.1% of positions in the mitochondrial genome have remained completely unchanged within the superorder; on our

GERP scale these sites have a score of 1 and will be termed constrained sites. A smaller proportion, ~7.7%, are highly mutable positions that acquire substitutions at a higher than expected rate; on our GERP scale these sites have a score of 0 and will be termed permissive sites. One interpretation of GERP scores is that the sites that are not highly constrained (scores below 1) have been demonstrated to be compatible with life, at least in some species. It is certainly possible that a polymorphism that is tolerated in one species is fatal or otherwise disadvantageous in another: the context of the rest of that species’ genetic makeup is sure to influence the effect of any single nucleotide change. Mutations in the constrained sites, however,

16 are likely to be deleterious: no healthy organism that has been sequenced contained variants at these sites, so the penalty for mutation is likely a harsh drop in fitness. In essence, these mutations have not been demonstrated to be compatible with life and can likely be considered harmful or pathogenic. As would be expected, synonymous variants have low GERP scores, with

37.8% occurring at permissive sites and none are found at constrained sites Figure 1.3B.

Conversely, nonsynonymous variants are found less frequently at permissive sites (2.6%) and more frequently at constrained sites (34.3%).

Although it is helpful to contextualize GERP against other metrics, a critical step is to verify that GERP can reliably be used to identify deleterious SNVs. One test of the GERP scale is to examine mutations that are known to be pathogenic compared to benign population-level polymorphisms. Mitochondrial haplogroups are ancestral lineages of mtDNA and are characterized by distinct series of inherited polymorphisms. Of the 321 most common haplogroup markers, present in at least 80% of their respective haplogroup, none can be categorized as constrained Figure 1.4. For the 68 coding pathogenic mutations, however, 70.6% occur at constrained sites. Some of the remaining mutations that are not at constrained sites are arguably less impactful on fitness, with 35% of them associated with hearing loss and Alzheimer and Parkinson Disease, which most commonly develop after the reproductive prime [66]. These patterns of GERP distributions highlight 3 putative classifications for variants: the permissive sites that are found exclusively in healthy populations, the constrained sites that are found exclusively among pathogenic variants, and intermediately conserved sites that are found in both groups. While these intermediate sites are not readily predictable as pathogenic or benign, sites with GERP scores at the polar ends of the scale are indeed helpful indicators. Populations that

17 are under purifying selection can be expected to contain more mutations at permissive sites and fewer mutations at constrained sites.

PolG mutator mouse model

A commonly used model for studying the mtDNA germline bottleneck is the PolG mutator mouse. Polymerase gamma (PolG) is a nuclear-encoded polymerase that is exclusively responsible for replicating mtDNA. A single amino acid substitution in the proofreading domain results in a progressive accumulation of mtDNA mutations over time as the enzyme is no longer capable of using its 3’-5’ exonuclease activity to correct misincorporation events [67]. As a result, homozygous PolG mice have a roughly 500-fold high mutation burden then wild type animals [68]. Although the precise mechanism is unclear, these PolG mice develop a progeroid phenotype, developing many of the characteristics of aging at ~25 weeks of age, including sarcopenia, reduced bone mineral density, severely impaired fertility, kyphosis, cardiomyopathy, ultimately culminating in a significantly shortened lifespan [67, 69]. The PolG mutator mouse is a favored model for mtDNA germline inheritance because it generates an intense pulse of mutations that can be measured for signs of selection and tracked through successive generations. Although the mutational burden is in great excess to what is found in wild type mice and humans, the model can still provide broader insight into mtDNA population dynamics.

Materials and methods

Animals and oocyte retrieval

Heterozygous PolG mice (PolgD257A/+) were sourced from Jackson Laboratory and bred to generate two homozygous mutator mice sister pairs (PolgD257A/D257A). Wild type female C57

18 BL/6 mice were obtained from Charles River and bred to heterozygous PolG males to produce two first-generation heterozygous females (PolgD257A/+). A homozygous (PolgD257A/D257A) female was bred to a wild type male to produce 2 heterozygous daughters (PolgD257A/+). All experiments were reviewed and approved by the Institutional Animal Care and Use Committee at

Northeastern University.

Oocyte retrieval was conducted as described in [70]: 10 IU of pregnant mare serum gonadotropin

(Sigma-Aldrich) was injected interperitoneally to induce ovulation in 2 month old mice (4 months for the homozygous mother). The mice were euthanized by CO2 asphyxiation 15 hours post-injection and the ovaries and oviducts were collected. Oocytes were collected and denuded of the surrounding cumulus cells by incubation in 80 I/mL of hyaluronidase (Sigma-Aldrich) for

2 minutes at 37°C and subsequently washed three time with human tubal fluid (Irvine Scientific) at 37°C. Single oocytes were incubated in 1 µl lysis buffer (10mM EDTA, 0.5% SDS, 0.1 mg/mL Proteinase K) for 3 hours at 37°C. Oocyte lysate was stored under ~20 µl of mineral oil

(CVS) in 0.5 mL conical tubes at -80°C.

PCR and DNA sequencing

Oocyte lysate was serially diluted in order to perform single molecule PCR (smPCR), wherein each amplicon originated from a single mtDNA template and the majority of PCR wells were negative. mtDNA was initially amplified with primers m3092F and m3031R from table 1.1 in 15

µl reactions using Q5 Hot Start Polymerase (New England Biolabs); reagent concentrations were: 1X Q5 reaction buffer, 0.2mM dNTPs, 10 µM primer (each), 0.3 units polymerase.

Reactions underwent 45 cycles (30 seconds denaturing at 95°C, 16 minutes combined annealing and extending at 68°C). Following initial PCR, amplicons were re-amplified for 15 additional

19 cycles with Ex Taq Hot Start Polymerase (Takara) using primers m3140F and m3003R and the following reagent concentrations: 1X LA Taq buffer, 0.2mM dNTPs, 10 mM primer (each), 0.15 units polymerase. Amplicons were sequenced across 24 sequencing reactions on a 3720xl DNA

Analyzer (Applied Biosystems). Reads were assembled and aligned against the C57 BL/6 mouse mtDNA reference genome (GenBank ID AY172335.1). CodonCode Aligner (CodonCode

Corporation) software was used for assemblies and alignments, and each mutation was manually confirmed. Sequences with overlapping peaks were discarded as mixed molecules (derived from multiple rather than single templates).

Human polymorphic and pathogenic mutations

Human haplogroup markers (here termed polymorphic mutations, N=321) and pathogenic coding mutations (N=68) were downloaded from the MITOMAP database [66].

Synonymous/nonsynonymous, Ka/Ks, and GERP idenitifcation

A custom mutation analysis tool was built from the open source Variant Effect Predictor (VEP) provided by Ensembl, which provides details on the site-specific effects of polymorphisms [71].

The primary information obtained from VEP was coding consequence (e.g. synonymous, nonsynonymous, etc.) and codon position. Using the assigned synonymous/nonsynonymous identity, Ka/Ks was calculated as:

!"#$#"%$&' '&)'*+*&*+$#'⁄,-.+/.)/0 '"#$#"%$&' '+*0' (3.)

5$#'"#$#"%$&' '&)'*+*&*+$#'⁄,-.+/.)/0 #$#'"#$#"%$&' '+*0' (3')

The number of available sites was determined either on the whole-genome scale or by specific protein-coding regions.

20 GERP scores were assigned based on the alignment of all available mtDNA sequences from the superorder Euarchontoglires, which includes rodents, lagomorphs, and primates using the

GERP++ tool [63]. Whole genomic descriptive statistics were generated by analyzing the effect of every possible variant (one transition, two transversions) at each position. The C57BL/6J sequence (accession ID: AY172335.1) was used for the mouse reference sequence and the revised Cambridge Reference Sequence (accession ID: NC_012920.1) for the human reference sequence.

Mutation modeling

Randomly simulated mtDNA populations were generated using the population structure and mutation frequencies of the Sanger-sequenced 2 month old homozygous PolG mutator mice.

Each simulation contained 112 molecules with the same mutation frequency distribution as the

Sanger data. Positions were randomly selected without repetition for a given simulated genome using the NumPy Python package [72]. The mutation direction was also randomly generated with NumPy following the observed transition/transversion ratios from the Sanger data set

(A>(C:2.83%, G:49.12%, T:48.06%), C>(A:3.26%, G:3.26%, T:93.49%), G>(A:79.17%,

C:16.67%, T:4.17%), T>(A: 12.83%, C: 82.89%, G:4.27%). Additional simulations were generated by randomly distributing the known Sanger-derived mutations across a simulated mtDNA population. The population simulations were each run 10,000 times. Mutation consequences and associated GERP scores were assigned following the same methodology outlined above.

Data analysis and presentation

21 Statistical analysis of synonymous and nonsynonymous substitution rates was performed using two-tailed Chi-square tests using GraphPad Prism 7.0e for Mac OS X. GERP and Ka/Ks distributions were analyzed with the K-sample Anderson-Darling test using the SciPy Python package [73]. Constrained and permissive GERP statistical significance was determined using

Fisher’s randomization inference in NumPy [72]. Mean differences were analyzed with the

Mann-Whitney U test. Plots were generated with Matplotlib [74] and Seaborn [75].

Results

Homozygous PolG oocyte mutations

In total, 112 near full-length (mean 14,919.5bp per sequence) mtDNA genomes were sequenced from 8 oocytes. Mutant fractions ranged from 3.9 x 10-4 (~6.4 mutations/genome) to

1.4 x 10-3 (~23.0 mutations/genome), with a mean of 8.2 x 10-4 (~13.5 mutations/genome)

(Figure 1.5A). A total of 1389 SNVs were identified across 1,670,985bp sequenced. Of those variants, 578 were observed on only one genome or in less than 1% of the population (here termed “heteroplasmic variants”), while 180 SNVs were observed in 1.5-100% of genomes (here termed “polymorphic variants”) (Figure 1.5B). In all subsequent analysis of polymorphic variants, they are considered as single mutational events and are thus counted only once, regardless of their population-level frequency.

Polymorphic but not heteroplasmic variants are significantly more synonymous than random chance in homozygous PolG oocyte mtDNA

To establish a neutral model that can be used as a baseline for detecting selection, 10,000 in silico models were generated based on the population structure of the mtDNA. For coding

22 positions in the mouse mtDNA genome, 23.3% of fully random mutations (equal probability of transitions and transversions) result in a synonymous change. Because mammals have a strong transition bias in mtDNA mutations [76], randomizations used the same transition/transversion frequencies as the homozygous PolG mutations, resulting in a mean of 33.7% of variants being synonymous.

When all homozygous PolG mutations were considered (heteroplasmic variants and polymorphic variants), there was not a significant increase in the percentage of synonymous substitutions as compared to the random models (36.7%, P = 0.1304) (Figure 1.6).

Heteroplasmic variants alone were 34.3% synonymous (P = 0.7657) but polymorphic variants were 44.0% synonymous, significantly higher than both the random models and heteroplasmic variants (P = 0.0113 and P = 0.0441, respectively).

Polymorphic and heteroplasmic variants are more frequent at permissive sites and less frequent at constrained sites as measured by GERP

GERP scores were assigned to all variants from the homozygous PolG mtDNA and the randomized mtDNA. The overall GERP distributions were significantly different between the randomized mtDNA and all homozygous PolG variants, heteroplasmic variants, and polymorphic variants(P<0.001, P=0.032, and P<0.001, respectively) (Figure 1.7). Additionally, heteroplasmic and polymorphic GERP score distributions were significantly different from each other (P<0.001). Much of the distributions’ differences can be attributed to the significant differences in the proportion of permissive and constrained GERP scores in all groups (Table

1.2). The proportion of permissive scores was higher than expected from the neutral model, from

7.67% in randomized mtDNA to 10.55% in heteroplasmic variants (P = 0.0003) and 15.56% in

23 polymorphic variants (P<0.0001). Conversely, the proportion of constrained GERP scores was lower than expected, from 25.9% in randomized mtDNA to 22.15% in heteroplasmic variants

(P=0.0057) and 19.44% in polymorphic variants (P<0.0001).

Increases in relative abundance of synonymous mutations is non-evenly distributed across coding products

The Ka/Ks ratio for a coding region of DNA represents the number of synonymous substitutions per available synonymous site divided by the number of nonsynonymous substitutions per available nonsynonymous site. A Ka/Ks ratio of 1 indicates neutrality, with an equal proportion of synonymous and nonsynonymous variants relative to the number of available synonymous and nonsynonymous sites [77]. A ratio below 1 indicates purifying selection, with a relative overabundance of substitutions due to strong selection against potentially deleterious nonsynonymous substitutions. In most contexts a ratio over 1 indicates positive selection for nonsynonymous variants that confer a selective advantage, but in the homozygous PolG context it can be an indicator of mutational meltdown or the accumulation of deleterious variants.

While the overall proportion of synonymous variants was elevated in the polymorphic variants, the disproportionate prevalence of synonymous substitutions was limited to just 7 protein-coding regions: COXI, COXII, ND1, ND3, ND4, ND4L and ND5 (Figure 1.8). Although in total the heteroplasmic variants did not show a significant increase in the proportion of synonymous substitutions, there was a significant decrease in Ka/Ks in COXI and COXII and an increase in ND1.

24 A higher proportion of individual mtDNA genomes are under purifying selection as measured by

Ka/Ks and GERP

To measure the extent of purifying selection on whole mtDNA genomes, Ka/Ks ratios were calculated for each molecule. Given that the homozygous variants were significantly more synonymous than the randomized model, expectedly the mean Ka/Ks per genome (1.151 in homozygous PolG versus 1.488 in the random model, P=0.0011) and the overall Ka/Ks distribution were significantly lower (P<0.001) among the homozygous variants (Figure 1.9A).

A more meaningful comparison point is to analyze whether, given an already more synonymous pool of variants, that increase in synonymous substitutions is distributed evenly across all genomes. To that end, a new series of 10,000 in silico randomizations were conducted, here termed shuffled genomes, with variants randomly pulled from the pool of identified homozygous

PolG mtDNA variants. The shuffled genomes has a mean Ka/Ks that was not significantly higher than the actual mtDNA variants (1.333, P=0.0866) nor was the distribution significantly different

(P=0.1484) (Figure 1.9B). Of note, a significantly higher proportion of homozygous PolG molecules had a Ka/Ks ratio below 1 (60.91%), indicating purifying selection, than in either the randomized (41.80%, P<0.0001) or shuffled models (49.03%, P=0.0047).

The proportion of permissive and constrained GERP scores per molecule was compared to the expected proportions based on shuffled genomes. The median proportion of permissive scores per genome (14.3%) was significantly higher than expected (11.1%, P=0.0033), accompanied by a significant reduction in the proportion of genomes that had no variants at permissive sites (from expected 23.1% to 15.2%, P=0.0182) (Figure 1.9C). Conversely, the median proportion of constrained scores (16.7%) per genome was significantly lower than

25 expected (20.0%, P=0.0038) and a larger proportion of genomes had no variants at constrained sites (from expected 6.8% to 11.6%, P=0.0431) (Figure 1.9D).

First-generation heterozygous oocyte mutations

A total of 159 mtDNA genomes were sequenced from 18 oocytes of two sister heterozygous PolG mice with a wild type mother. Mutant fractions ranged from 1.28 x 10-4 (~2.1 mutations/genome) to 6.63 x 10-4 (~10.8 mutations/genome), with a mean of 2.5 x 10-4 (~4.2 mutations/genome) (Figure 1.10A). A total of 599 SNVs were identified across 2,356,191bp sequenced. Of these, 204 were categorized as heteroplasmic variants and 27 as polymorphic variants, as described above (Figure 1.10B).

Polymorphic but not heteroplasmic variants are significantly more synonymous than random chance in heterozygous PolG oocyte mtDNA

Similar to the effect observed in homozygous PolG variants, heteroplasmic variants were not significantly more synonymous than the neutral model of randomized variants (40.6% compared to 33.7%, P = 0.0822) (Figure 1.11). Polymorphic variants were significantly more synonymous at 60.0% (P = 0.0128) but was not significantly higher than heteroplasmic variants

(P = 0.0999). Heteroplasmic variants from homozygous and heterozygous mtDNA were not significantly different from each other (P=0.1847), nor were polymorphic variants (P=0.1814).

26 Heterozygous mtDNA has similar GERP distributions for heteroplasmic variants but significantly improved for polymorphic variants

In heterozygous mice, GERP distributions were significantly different between the randomized variants and heteroplasmic and polymorphic variants (P=0.0067 and P<0.001, respectively), and heteroplasmic and polymorphic distributions were significantly different from each other (P=0.0051) (Figure 1.12). The GERP distributions of heteroplasmic variants from homozygous and heterozygous mtDNA were not significantly different (P>0.25), nor were the polymorphic variants (P=0.0900), although the lack of significance may be due to the small N in heterozygous samples as manual inspection of the distributions suggests a substantial difference.

Polymorphic variants were substantially more abundant at permissive site (26.92%, P<0.0001) and less frequently found at constrained sites (7.69%, P<0.0001) (Table 1.3).

Mother and daughter mtDNA variants differ only in frequency, not by other metrics

Oocyte mtDNA was sequenced from a homozygous PolG mother (distinct from the previous homozygous PolG analysis) and two heterozygous daughters, with a total of 32 molecules and 426,452bp and 27 molecules and 402,304bp, respectively. Similar analyses were conducted on these data is described above. Both mother and daughters significantly GERP score distributions for polymorphic variants as compare to the random model (P<0.001 for both) but the distributions were not significantly different from each other (P>0.25) (Figure 1.13A). The proportion of synonymous variants was slightly but not significantly elevate in mothers as compared to daughters (46.67% and 43.48%, P=0.7324). The mean proportion of permissive

GERP scores per molecule was not different between mothers (10.93%) and daughters ( 10.97%)

27 (P>0.25) (Figure 1.13C). The mean proportion of constrained GERP scores was higher but non- significantly in mothers (23.03%) than in daughters (16.15%)(P=0.0516) (Figure 1.13D).

The most significant difference between the mother and daughter mtDNA genomes was the mutant fraction of the individual genomes. The mother mtDNA genomes had an mean mutant fraction of 1.3 x 10-4 (~22.1 mutations/genome) and daughter mtDNA genomes had a mean mutant fraction of 1.0 x 10-4 (16.4 mutations/genome) (P<0.0001) (Figure 1.13B).

Discussion

Previous studies on purifying selection in the mtDNA maternal germline have typically centered on aggregate variant data rather than single molecule analysis. While this approach is efficient and high-throughput, it comes at the cost of low precision for the detection of rare variants in the form of false negatives (true variants being discarded as noise or false positives

(noise mis-classified as a true variant). Variants that are rare in germ cells and the earliest stages of development may become more prevalent later in life due to clonal expansion, and recent studies have highlighted a potential role for the clonal expansion of germline mutations as contributors to aging-related decline [78-81]. With a higher-resolution analysis, oocytes that would previously have been considered homoplasmic may reveal a mtDNA population that has a high abundance of individually rare SNVs. The diminished level of purifying selection seen in rare heteroplasmic variants as opposed to the more common polymorphic variants has wider implications for previous studies on selection: while the intensity of selection as measure by the proportion of synonymous variants in our heteroplasmic variants matches previous similar reports on selection [49, 82], the effect greatly diminishes when the individually rare but collectively high-frequency heteroplasmic variants are taken into account. While our findings are

28 consistent with a distinct signature of purifying selection, it highlights an important caveat that not every mutation is subject to the same selective forces.

The study of rare variants also gives insight into the timing of selective pressures during oocyte development. Polymorphic variants occurred before heteroplasmic variants, having had time to propagate within the cell and spread through a larger proportion of the population. The cell’s maintenance machinery has therefore had more time to act upon potentially deleterious variants. It is consistent, therefore, that polymorphic variants show significantly stronger markers of selection than heteroplasmic variants. While heteroplasmic variants do show some degree of selection by a relative increase in the proportion of mutations at permissive sites and a decrease at constrained sites, they still harbor a higher proportion of potentially deleterious mutations than is observed in healthy populations. One potential explanation for the weakening of selective pressure in these heteroplasmic variants is their sheer abundance: these variants account for

76.3% of the mutation burden in homozygous mice. While heterozygous mice have an even higher proportion of heteroplasmic variants relative to total variants, the absolute frequency of heteroplasmic variants per genome is significantly lower. Despite the reduction in total number of heteroplasmic variants, the cell is no more efficient at purifying out potentially deleterious ones. It is not the sheer abundance of heteroplasmic variants, therefore, that is taxing the cell’s ability to select against them. There are a couple of plausible explanations for this phenomenon.

First, the rate of mitophagy may just be outstripped by the rate of mtDNA replication, causing deleterious variants to accumulate. Between the PGC and the mature oocyte, the mtDNA population increases around 100-fold per cell [83], which would be a substantial burden on the cell’s regulatory machinery. Both homozygous and heterozygous PolG mice have an equivalent mtDNA copy number per oocyte, so would both be disadvantaged by an overly-taxed regulatory

29 pathway. Second, the regulatory machinery itself may be less active in later stages of oogenesis.

The pathways that detect mitochondrial dysfunction in early germ cells, such as changes in membrane polarization and ROS overproduction [84], merit study to determine their activity in later development. If the diminished selection effect is due to a general decline in mitophagy activity, there are potential interventions that could help promote the clearance of dysfunctional genomes that may be appropriate for women who are carriers of known mtDNA disease-causing mutations [85, 86].

While purifying selection is stronger in polymorphic variants, there are limits: many still potentially deleterious mutations are not successfully cleared from the cell, even in the less aggressively mutagenic heterozygous animals. Just under half of all coding products did not show evidence of selection in terms of an increase in the relative abundance of synonymous substitutions (Ka/Ks). Prior studies have generally failed to highlight differences across coding products [51] or focus on 4-fold degenerate sites [49] when 2-fold degenerate sites are almost as frequently synonymous, given the strong bias toward transitions. It is possible that dysfunction in some protein products is more difficult for the cell to detect, especially if not widespread throughout the cell, and thus more challenging to select against. The inability to purge all deleterious mutations from the cell coupled with the potential for certain mtDNA defects to evade detection highlight the importance of coupling purifying selection with a population bottleneck. While purifying selection can act to eliminate the most damaging mutations and protect the most conserved proteins, mildly detrimental mutations and mutations that were otherwise missed by selection on the level of the organelle must be exposed to selection at some point or otherwise face the mutational meltdown postulated by Muller’s Ratchet. If only a small subpopulation of mtDNA genomes end up constituting the next generation of maternal germ

30 cells, that allows for wide swings in the overall mtDNA genetic composition and could randomly segregate deleterious and healthy genomes across cells, allowing for selection on the level of the cell. Healthy mitochondrial function is necessary for proper oocyte maturation and fertilization

[87-89], providing an opportunity to eliminate cells that are overly burdened by mutations that escaped previous exposure to selection.

A remaining question is how random the replicating subpopulations truly are. Although previous in silico models have supported a stochastic phenomenon [44, 48], they do not specifically characterize subpopulations within the cell. In addition to not being evenly applied across protein products, purifying selection is not evenly applied across all genomes. Some genomes show marked signatures of selection, such as Ka/Ks ratios well below 0 or low abundances of variants at constrained sites. Additionally, some small number of genomes appear to bear a disproportionately high burden of nonsynonymous variants, as seen the tail of Ka/Ks ratios exceeding 2. Particularly dysfunctional genomes may be less readily replicated, as an increased exposure to ROS can lead to strand breaks or other damage that would impair polymerase activity [90, 91]. The mother-daughter mtDNA comparison indicates that the quantity of mutations, rather than their overall quality, is a more reliable predictor of which genomes get transmitted. The relatively low mutation frequency of these inherited genomes indicates a potential niche of low-mutant genomes that are continuously replicated preferentially over their daughter genomes, which will invariably acquire more mutations.

This high-resolution, single-molecule mtDNA analysis has confirmed previous reports of strong purifying selection among higher-frequency variants while highlighting the weakened effect of selection of the often unexamined low frequency variants. These rare variants may have important implications in health and aging and merit further study, such as examining their

31 frequency in naturally human oocyte populations. While our approach offers high accuracy and sensitivity, the single molecule PCR method combined with Sanger sequencing is laborious, costly, and ultimately low-throughput. To examine mitochondrial subpopulations at a the same resolution but larger capacity, some form of next generation sequencing is necessary. Chapter 3 outlines a novel molecular barcoding approach to Nanopore sequencing that will allow for a higher through-put analysis of these and other oocyte samples.

32

Figure 1.1: Human mtDNA genome

Mammalian mtDNA is a circular genome that is 15,569bp long (16,299bp in mice) with a genetic structure that is conserved across mammals. All protein products except ND6 are encoded for on the guanine-rich heavy strand, with tRNA sequences dispersed across both strands between protein-coding regions.

Image credit: Emmanuel Douzery [92]

Permission granted through Creative Commons open access license.

33

Figure 1.2

“Mitochondrial DNA inheritance in somatic and germ cells. Different mitochondria in a single somatic cell (A) are interconnected by constant events of fusion and fission, allowing them to share membranes, solutes, metabolites, proteins, RNAs and DNA (mitochondrial DNA – mtDNA). Hence, when a mutation in mtDNA arises, it can rapidly spread throughout the mitochondrial network. In this case, mutant (red circles) and wild-type (green circles) mtDNAs may co-exist, which is known as heteroplasmy. In comparison, homoplasmic mitochondria contain a single mtDNA genotype, either mutant or wild-type. Unless the mutation level exceeds a critical threshold necessary to cause a biochemical defect (i.e., above 60-90%; red mitochondria), the mutation effect will be masked by wild-type molecules (green mitochondria with both mutant and wild-type mtDNA).”

From: Chiaratti, M.R., Macabelli, C.H., Augusto Neto, J.D., Grejo, M.P., Pandey, A.K., Perecin, F. and Collado, M.D., 2020. Maternal transmission of mitochondrial diseases. Genetics and Molecular Biology, 43(1). [93]

Permission granted through Creative Commons open access license.

34 A

B

Figure 1.3 Mouse mtDNA GERP score distribution

A) GERP scores are non-normally distributed in the mtDNA genome, with a particularly high abundance of constrained sites that are conserved across recent evolutionary history. B.) GERP scores map predictably with whether a given mutation is synonymous or nonsynonymous. Few nonsynonymous substitutions occur at permissive sites while no synonymous substitutions occur at constrained sites. Dotted lines represent the boundary for permissive sites and dashed lines represent the boundary for constrained sites.

35

Figure 1.4: GERP profile of pathogenic and polymorphic human mutations

The most common human haplogroup markers (polymorphic variants) have a distinctly different GERP score profile than the general genomic distribution. There is an overabundance of polymorphic variants at permissive sites and none at constrained sites. Pathogenic variants display an opposite effect, being disproportionately found at constrained sites. [66]

36 A

B

Figure 1.5: Composition of homozygous PolG mtDNA

A.) 112 mtDNA genomes were sequenced with a mean mutant fraction of 8.2 x 10-4. B.) The frequency of each SNV was measured within the population. 76.3% of SNVs were observed in only a single genome, or less than 1% of the population. 21.0% were found at a frequency of 1.5- 5% of the population, 2.4% in 5-15% of the population, and just 0.4% (3 SNVs total) were found in 50-100% of the population.

37 * P = 0.0113

NS P = 0.7657

NS P = 0.1304 * P = 0.0441

33.7% 36.7% 34.3% 44.0%

Figure 1.6: Proportion of homozygous PolG variants resulting in synonymous substitutions

Taken as a whole, the homozygous PolG variants are not significantly more synonymous than expected. The polymorphic variants, which only includes SNVs that are at a higher relative abundance (³1.5% of the population), do have a substantial increase in the proportion of mutations that are synonymous that is significantly higher than both randomized and heteroplasmic (<1.5% of the population) variants.

38 A

B

Figure 1.7: GERP distribution for PolG homozygous variants

The GERP score distributions for all (A, red) and both heteroplasmic (B, orange) and polymorphic variants (B, blue) were significantly different from randomized models. The total randomized GERP distribution is in black, with the ranges of the distribution by model iteration displayed in grey. Dotted lines represent the boundary for permissive sites and dashed lines represent the boundary for constrained sites for the given data set.

39 ** *** *** *** * * *** * * *

Figure 1.8: Ka/Ks ratios by product for homozygous PolG mtDNA

Although the proportion of synonymous substitutions is significantly higher in polymorphic variants, not every coding product is effected equally by selection. While polymorphic variants are significantly more synonymous than expected by random chance in COXI, COXII, ND1, ND3, ND4, ND4L and ND5, in other coding regions the Ka/Ks ratio tracks closer to the mean or above it

40

Figure 1.9 – Measures of selection at the genome level in homozygous PolG mtDNA

Single molecule analysis allows for the discrete measurement of selection on individual mtDNA genomes. The Ka/Ks distribution is significantly altered from randomized genomes (A) but not from shuffled genomes (B), which account for the skewed proportion of synonymous mutations in the data. The significant increase in Ka/Ks ratios under 1, however, indicates that a larger number of genomes are under purifying selective pressure, even if that pressure is not evenly distributed across the population. A relative increase in the proportion of permissive sites per genome (C) and decrease in constrained sites per genome (D) as measured by GERP shows a stronger effect of selection at the population level.

41

Figure 1.10: Composition of first-generation heterozygous PolG mtDNA

A.) 159 mtDNA genomes were sequenced with a mean mutant fraction of 2.5 x 10-4. B.) The frequency of each SNV was measured within the population. 88.3% of SNVs were observed in only a single genome, or less than 1% of the population. 7.7% were found at a frequency of 1- 1.5% of the population, 3.0% in 1.5-15% of the population, and just 0.9% (2 SNVs total) were found in 50-100% of the population.

42

Figure 1.11: Proportion of first-generation heterozygous PolG variants resulting in synonymous substitutions

The total proportion of synonymous variants is significantly higher than expected by chance, but this increase is largely due to the significant increase in synonymous substitutions in polymorphic variants. Heteroplasmic variants have a higher proportion of synonymous substitutions than expected by chance, but not significantly so.

43

Figure 1.12: GERP score distribution for first-generaton PolG heterozygous mtDNA

As with homozygous PolG variants, both heteroplasmic and polymorphic variants from heterozygous PolG mtDNA showed significantly different GERP distributions than expected by random chance (P=0.0067 and P<0.001, respectively). While heteroplasmic and polymorphic GERP distributions were significantly different from each other (P=0.0128), neither was significantly different from its homozygous counterpart. Dotted lines represent the boundary for permissive sites and dashed lines represent the boundary for constrained sites for the given data set.

44

Figure 1.13: Summary statistics for mother-daughter PolG oocytes

Homozygous mother mtDNA variants and genomes were not significantly different from the heterozygous daughters as measure by total GERP distribution (A) or the proportion of permissive (C) and constrained (D) sites per genome. There was a significant reduction in the mean mutant fraction from mother to daughters (B).

45 Primer Primer Primer sequence name location 3092 3092F CTCCATTCTATGATCAGGATGAGCCTCAAACTCCAAA Forward 3140 3140F CGGAGCTTTACGAGCCGTAGCCCAAACAAT Forward 3003 3003R GACTTAATGCTAGTGTGAGTGATAGGGTAGGTGCAA Reverse 3031 3031R GGGTGTGGTATTGGTAGGGGAACTCATAGACTTA Reverse

Table 1.1: Primers

PCR primers used for mtDNA replication as outlined in Materials and methods.

46

Table 1.2 Statistics for permissive and constrained GERP scores for homozygous PolG mtDNA

The proportion of permissive and constrained GERP scores for homozygous mtDNA variants, as seen in Figure 1.7.

47

Table 1.3: Statistics for permissive and constrained GERP scores for first-generation heterozygous PolG mtDNA

The proportion of permissive and constrained GERP scores for heterozygous mtDNA variants, as seen in Figure 1.12.

48 References

1. Siekevitz, P., Powerhouse of the cell. Scientific American, 1957. 197(1): p. 131-144.

2. Margulis, L., Origin of eukaryotic cells: Evidence and research implications for a theory of the origin and evolution of microbial, plant and animal cells on the precambrian Earth. 1970: Yale University Press.

3. Gray, M.W., G. Burger, and B.F. Lang, Mitochondrial evolution. Science, 1999. 283(5407): p. 1476-81.

4. Zaremba-Niedzwiedzka, K., et al., Asgard archaea illuminate the origin of eukaryotic cellular complexity. Nature, 2017. 541(7637): p. 353-358.

5. O’Malley, M.A., Endosymbiosis and its implications for evolutionary theory. Proceedings of the National Academy of Sciences, 2015. 112(33): p. 10270-10277.

6. Zachar, I., et al., Farming the mitochondrial ancestor as a model of endosymbiotic establishment by natural selection. Proceedings of the National Academy of Sciences, 2018. 115(7): p. E1504- E1510.

7. Roger, A.J., S.A. Muñoz-Gómez, and R. Kamikawa, The origin and diversification of mitochondria. Current Biology, 2017. 27(21): p. R1177-R1192.

8. Lane, N. and W. Martin, The energetics of genome complexity. Nature, 2010. 467(7318): p. 929- 934.

9. Wallace, D.C., A mitochondrial paradigm of metabolic and degenerative diseases, aging, and cancer: a dawn for evolutionary medicine. Annu. Rev. Genet., 2005. 39: p. 359-407.

10. Mootha, V.K., et al., Integrated analysis of protein composition, tissue diversity, and gene regulation in mouse mitochondria. Cell, 2003. 115(5): p. 629-640.

11. Satoh, M., Organization of multiple nucleoids and DNA molecules in mitochondria of a human cell. Experimental Cell Research, 1991. 196(1): p. 137-140.

12. MacDonald, J.A., et al., A nanoscale, multi-parametric flow cytometry-based platform to study mitochondrial heterogeneity and mitochondrial DNA dynamics. Commun Biol, 2019. 2: p. 258.

13. Kolesnikov, A. and E. Gerasimov, Diversity of mitochondrial genome organization. Biochemistry (Moscow), 2012. 77(13): p. 1424-1435.

14. Johnston, I.G. and B.P. Williams, Evolutionary inference across eukaryotes identifies specific pressures favoring mitochondrial gene retention. Cell systems, 2016. 2(2): p. 101-111.

15. Yu-Wai-Man, P., P.G. Griffiths, and P.F. Chinnery, Mitochondrial optic neuropathies - disease mechanisms and therapeutic strategies. Prog Retin Eye Res, 2011. 30(2): p. 81-114.

16. Rosenberg, T., et al., Prevalence and genetics of Leber hereditary optic neuropathy in the Danish population. Investigative ophthalmology & visual science, 2016. 57(3): p. 1370-1375.

49 17. Wallace, D.C., et al., Mitochondrial DNA mutation associated with Leber's hereditary optic neuropathy. Science, 1988. 242(4884): p. 1427-1430.

18. Chan, D.C., Fusion and fission: interlinked processes critical for mitochondrial health. Annual review of genetics, 2012. 46.

19. Youle, R.J. and A.M. Van Der Bliek, Mitochondrial fission, fusion, and stress. Science, 2012. 337(6098): p. 1062-1065.

20. Chiaratti, M.R., et al., Embryo mitochondrial DNA depletion is reversed during early embryogenesis in cattle. Biology of reproduction, 2010. 82(1): p. 76-85.

21. Capaldi, R.A., Structure and function of cytochrome c oxidase. Annual review of biochemistry, 1990. 59(1): p. 569-596.

22. Lang, A.E. and A.M. Lozano, Parkinson's disease. Second of two parts. N Engl J Med, 1998. 339(16): p. 1130-43.

23. Bender, A., et al., High levels of mitochondrial DNA deletions in substantia nigra neurons in aging and Parkinson disease. Nature genetics, 2006. 38(5): p. 515-517.

24. Kraytsberg, Y., et al., Mitochondrial DNA deletions are abundant and cause functional impairment in aged human substantia nigra neurons. Nat Genet, 2006. 38(5): p. 518-20.

25. Gorbunova, V., et al., Changes in DNA repair during aging. Nucleic acids research, 2007. 35(22): p. 7466-7474.

26. Greaves, L.C., et al., Clonal expansion of early to mid-life mitochondrial DNA point mutations drives mitochondrial dysfunction during human ageing. PLoS genetics, 2014. 10(9).

27. Greaves, L.C., et al., Mitochondrial DNA mutations are established in human colonic stem cells, and mutated clones expand by crypt fission. Proceedings of the National Academy of Sciences, 2006. 103(3): p. 714-719.

28. Muller, H.J., The Relation of Recombination to Mutational Advance. Mutat Res, 1964. 106: p. 2- 9.

29. Chinnery, P.F., Mitochondrial disease in adults: what's old and what's new? EMBO Mol Med, 2015. 7(12): p. 1503-12.

30. Fernandez-Sola, J., et al., Adult-onset mitochondrial myopathy. Postgraduate medical journal, 1992. 68(797): p. 212-215.

31. Luo, S., et al., Biparental inheritance of mitochondrial DNA in humans. Proceedings of the National Academy of Sciences, 2018. 115(51): p. 13039-13044.

32. Rius, R., et al., Biparental inheritance of mitochondrial DNA in humans is not a common phenomenon. Genetics in Medicine, 2019: p. 1.

33. Hutchison, C.A., et al., Maternal inheritance of mammalian mitochondrial DNA. Nature, 1974. 251(5475): p. 536-538.

50 34. Giles, R.E., et al., Maternal inheritance of human mitochondrial DNA. Proceedings of the National academy of Sciences, 1980. 77(11): p. 6715-6719.

35. Annis, S., et al., Quasi-Mendelian paternal inheritance of mitochondrial DNA: A notorious artifact, or anticipated behavior? Proceedings of the National Academy of Sciences, 2019. 116(30): p. 14797-14798.

36. Sutovsky, P., et al., Ubiquitin tag for sperm mitochondria. Nature, 1999. 402(6760): p. 371-372.

37. Song, W.-H., et al., Autophagy and ubiquitin–proteasome system contribute to sperm mitophagy after mammalian fertilization. Proceedings of the National Academy of Sciences, 2016. 113(36): p. E5261-E5270.

38. Ma, H., et al., Incompatibility between nuclear and mitochondrial genomes contributes to an interspecies reproductive barrier. Cell metabolism, 2016. 24(2): p. 283-294.

39. Yan, H., et al., Association between mitochondrial DNA haplotype compatibility and increased efficiency of bovine intersubspecies cloning. Journal of genetics and genomics, 2011. 38(1): p. 21-28.

40. Sharpley, M.S., et al., Heteroplasmy of mouse mtDNA is genetically unstable and results in altered behavior and cognition. Cell, 2012. 151(2): p. 333-343.

41. Olivo, P.D., et al., Nucleotide sequence evidence for rapid genotypic shifts in the bovine mitochondrial DNA D-loop. Nature, 1983. 306(5941): p. 400-402.

42. Hauswirth, W.W. and P.J. Laipis, Mitochondrial DNA polymorphism in a maternal lineage of Holstein cows. Proceedings of the National Academy of Sciences, 1982. 79(15): p. 4686-4690.

43. St. John, J.C., et al., Mitochondrial DNA transmission, replication and inheritance: a journey from the gamete through the embryo and into offspring and embryonic stem cells. Human reproduction update, 2010. 16(5): p. 488-509.

44. Jenuth, J.P., et al., Random genetic drift in the female germline explains the rapid segregation of mammalian mitochondrial DNA. Nat Genet, 1996. 14(2): p. 146-51.

45. Cao, L., et al., The mitochondrial bottleneck occurs without reduction of mtDNA content in female mouse germ cells. Nat Genet, 2007. 39(3): p. 386-90.

46. Cao, L., et al., New evidence confirms that the mitochondrial bottleneck is generated without reduction of mitochondrial DNA content in early primordial germ cells of mice. PLoS Genet, 2009. 5(12): p. e1000756.

47. Wai, T., D. Teoli, and E.A. Shoubridge, The mitochondrial DNA genetic bottleneck results from replication of a subpopulation of genomes. Nat Genet, 2008. 40(12): p. 1484-8.

48. Johnston, I.G., et al., Stochastic modelling, Bayesian inference, and new in vivo measurements elucidate the debated mtDNA bottleneck mechanism. Elife, 2015. 4: p. e07464.

49. Stewart, J.B., et al., Strong purifying selection in transmission of mammalian mitochondrial DNA. PLoS Biol, 2008. 6(1): p. e10.

51 50. Hill, J.H., Z. Chen, and H. Xu, Selective propagation of functional mitochondrial DNA during oogenesis restricts the transmission of a deleterious mitochondrial variant. Nat Genet, 2014. 46(4): p. 389-92.

51. Floros, V.I., et al., Segregation of mitochondrial DNA heteroplasmy through a developmental genetic bottleneck in human embryos. Nature cell biology, 2018. 20(2): p. 144-151.

52. del Mar González, M., et al., Sensitivity of mitochondrial DNA heteroplasmy detection using Next Generation Sequencing. Mitochondrion, 2020. 50: p. 88-93.

53. Stewart, J.B., et al., Purifying selection of mtDNA and its implications for understanding evolution and mitochondrial disease. Nature Reviews Genetics, 2008. 9(9): p. 657-662.

54. Lieber, T., et al., Mitochondrial fragmentation drives selective removal of deleterious mtDNA in the germline. Nature, 2019. 570(7761): p. 380-384.

55. Piko, L. and K.D. Taylor, Amounts of Mitochondrial-DNA and Abundance of Some Mitochondrial Gene Transcripts in Early Mouse Embryos. Developmental Biology, 1987. 123(2): p. 364-374.

56. Dobson, R.J., et al., Predicting deleterious nsSNPs: an analysis of sequence and structural attributes. BMC bioinformatics, 2006. 7(1): p. 217.

57. Loewe, L., et al., Estimating selection on nonsynonymous mutations. Genetics, 2006. 172(2): p. 1079-1092.

58. Adzhubei, I.A., et al., A method and server for predicting damaging missense mutations. Nat Methods, 2010. 7(4): p. 248-9.

59. Kumar, P., S. Henikoff, and P.C. Ng, Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc, 2009. 4(7): p. 1073-81.

60. Wang, J., et al., An integrated approach for classifying mitochondrial DNA variants: one clinical diagnostic laboratory's experience. Genet Med, 2012. 14(6): p. 620-6.

61. Fan, W., et al., A mouse model of mitochondrial disease reveals germline selection against severe mtDNA mutations. Science, 2008. 319(5865): p. 958-962.

62. Cooper, G.M., et al., Distribution and intensity of constraint in mammalian genomic sequence. Genome Res, 2005. 15(7): p. 901-13.

63. Davydov, E.V., et al., Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol, 2010. 6(12): p. e1001025.

64. Kumar, V., B.M. Hallström, and A. Janke, Coalescent-based genome analyses resolve the early branches of the euarchontoglires. PLoS One, 2013. 8(4).

65. Benton, M.J. and P.C. Donoghue, Paleontological evidence to date the tree of life. Molecular biology and evolution, 2007. 24(1): p. 26-53.

66. Lott, M.T., et al., mtDNA variation and analysis using MITOMAP and MITOMASTER. Current protocols in bioinformatics, 2013. 44(1): p. 1.23. 1-1.23. 26.

52 67. Trifunovic, A., et al., Premature ageing in mice expressing defective mitochondrial DNA polymerase. Nature, 2004. 429: p. 417-423.

68. Vermulst, M., et al., Mitochondrial point mutations do not limit the natural lifespan of mice. Nature genetics, 2007. 39(4): p. 540-543.

69. Kujoth, G.C., et al., Mitochondrial DNA mutations, oxidative stress, and apoptosis in mammalian aging. Science, 2005. 309(5733): p. 481-4.

70. Faraci, C., et al., Impact of exercise on oocyte quality in the POLG mitochondrial DNA mutator mouse. Reproduction, 2018. 156(2): p. 185-194.

71. McLaren, W., et al., The Ensembl Variant Effect Predictor. Genome Biol, 2016. 17(1): p. 122.

72. Oliphant, T.E., A guide to NumPy. Vol. 1. 2006: Trelgol Publishing USA.

73. Virtanen, P., et al., SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods, 2020. 17(3): p. 261-272.

74. Hunter, J.D., Matplotlib: A 2D graphics environment. Computing in science & engineering, 2007. 9(3): p. 90-95.

75. Waskom, M., et al., mwaskom/seaborn: v0. 8.1 (September 2017). Zenodo, doi, 2017. 10.

76. Belle, E.M., et al., An investigation of the variation in the transition bias among various animal mitochondrial DNA. Gene, 2005. 355: p. 58-66.

77. Li, W.-H., C.-I. Wu, and C.-C. Luo, A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Molecular biology and evolution, 1985. 2(2): p. 150-174.

78. Ma, H., et al., Germline and somatic mtDNA mutations in mouse aging. Plos one, 2018. 13(7): p. e0201304.

79. Keogh, M. and P.F. Chinnery, Hereditary mtDNA heteroplasmy: a baseline for aging? Cell metabolism, 2013. 18(4): p. 463-464.

80. Khrapko, K., et al., Clonal expansions of mitochondrial genomes: implications for in vivo mutational spectra. Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis, 2003. 522(1-2): p. 13-19.

81. Khrapko, K., The timing of mitochondrial DNA mutations in aging. Nat Genet, 2011. 43(8): p. 726-7.

82. Stewart, J.B. and N.G. Larsson, Keeping mtDNA in shape between generations. PLoS Genet, 2014. 10(10): p. e1004670.

83. Mahrous, E., Q. Yang, and H.J. Clarke, Regulation of mitochondrial DNA accumulation during oocyte growth and meiotic maturation in the mouse. Reproduction, 2012. 144(2): p. 177.

84. Mishra, P. and D.C. Chan, Mitochondrial dynamics and inheritance during cell division, development and disease. Nature reviews Molecular cell biology, 2014. 15(10): p. 634-646.

53 85. Gilkerson, R.W., et al., Mitochondrial autophagy in cells with mtDNA mutations results from synergistic loss of transmembrane potential and mTORC1 inhibition. Human molecular genetics, 2012. 21(5): p. 978-990.

86. Wallace, D.C., Mitochondrial genetic medicine. Nature genetics, 2018. 50(12): p. 1642-1649.

87. Tarazona, A., et al., Mitochondrial activity, distribution and segregation in bovine oocytes and in embryos produced in vitro. Reproduction in Domestic Animals, 2006. 41(1): p. 5-11.

88. Wang, L.-y., et al., Mitochondrial functions on oocytes and preimplantation embryos. Journal of Zhejiang University Science B, 2009. 10(7): p. 483-492.

89. May-Panloup, P., et al., Ovarian ageing: the role of mitochondria in oocytes and follicles. Human reproduction update, 2016. 22(6): p. 725-743.

90. Cline, S.D., Mitochondrial DNA damage and its consequences for mitochondrial gene expression. Biochimica et Biophysica Acta (BBA)-Gene Regulatory Mechanisms, 2012. 1819(9- 10): p. 979-991.

91. McCulloch, S.D. and T.A. Kunkel, The fidelity of DNA synthesis by eukaryotic replicative and translesion synthesis polymerases. Cell research, 2008. 18(1): p. 148-161.

92. Douzery, E., Map of the human mitochondrial genome, M.o.t.h.m. genome.svg, Editor. 2017: Wikimedia Commons.

93. Chiaratti, M.R., et al., Maternal transmission of mitochondrial diseases. Genetics and Molecular Biology, 2020. 43(1).

54 Chapter 2: Quasi-Mendelian Paternal Inheritance of mitochondrial DNA: A notorious artifact, or anticipated mtDNA behavior?

Originally published as: Annis, S., Fleischmann, Z., Khrapko, M., Franco, M., Wasko, K., Woods, D., Kunz, W.S., Ellis, P. and Khrapko, K., 2019. Quasi-Mendelian paternal inheritance of mitochondrial DNA: A notorious artifact, or anticipated behavior?. Proceedings of the National Academy of Sciences, 116(30), pp.14797-14798.

Abstract

A recent report by Luo et al (2018) in PNAS (DOI:10.1073/pnas.1810946115) [1] presented evidence of biparental inheritance of mitochondrial DNA. The pattern of inheritance, however, resembled that of a nuclear gene. The authors explained this peculiarity with

Mendelian segregation of a faulty gatekeeper gene that permits survival of paternal mtDNA in the oocyte. Three other groups [2-4], however, posited the observation was an artifact of inheritance of mtDNA nuclear pseudogenes (NUMTs), present in the father’s nuclear genome.

We present justification that both interpretations are incorrect, but that the original authors did, in fact, observe biparental inheritance of mtDNA. Our alternative model assumes that because of initially low paternal mtDNA copy number these copies are randomly partitioned into nascent cell lineages. The paternal mtDNA haplotype must have a selective advantage, so ‘seeded’ cells will tend to proceed to fixation of the paternal haplotype in the course of development. We use modeling to emulate the dynamics of paternal genomes and predict their mode of inheritance and distribution in somatic tissue. The resulting offspring is a mosaic of cells that are purely maternal or purely paternal – including in the germline. This mosaicism explains the quasi-

Mendelian segregation of the paternal mDNA. Our model is based on known aspects of mtDNA

55 biology and explains all of the experimental observations outlined in Luo et. al., including maternal inheritance of the grand-paternal mtDNA.

Introduction

A recent report [1] presented the long-awaited confirmation of paternal inheritance of mtDNA in humans [5]. Surprisingly, paternal transmission of mtDNA [1] follows a bi-modal pattern: about a half of the offspring show fairly uniform paternal/maternal heteroplasmy levels while the rest do not inherit paternal mtDNA at all. This pattern resembles the inheritance of a dominant nuclear gene. The authors explain this pattern as permissive inheritance resulting from a faulty ‘gatekeeper’ gene [1]. However, three groups [2] [3] [4] instead suspect contamination with mtDNA nuclear pseudogenes (NUMTs), a notorious artifact [6].

Based on our vast NUMT experience, we support the authors’ response (Luo et al.,

2019), asserting that NUMT artifact is unlikely (further explanations: Notes 1-2). However, we also demonstrate that the authors’ dominant gatekeeper explanation [1] is incorrect, because spermatozoa are functionally diploid (Note 3) and thus are equally affected by the faulty gatekeeper. This leaves the quasi-Mendelian inheritance unexplained.

We offer an alternative explanation based on our analysis of the intracellular population dynamics of paternal mtDNA (Fig.1): With a defective gatekeeper, a spermatozoid delivers

=<100 paternal mtDNA molecules (Note 4) to the oocyte. In the beginning, cleavage of the embryo proceeds without mtDNA replication, and mtDNA molecules, including paternal ones, are randomly distributed among the blastomeres. Because the number of paternal molecules is low, some blastomeres are not seeded with paternal mtDNA (Fig1B) or lose them due to

56 intracellular genetic drift. Early replication of a mtDNA subpopulation counterbalances genetic drift, with about half of cells getting seeded with paternal mtDNA (Note 7.2).

Because paternal inheritance includes reproducible, ~1000-x enrichment of paternal mtDNA, paternal mtDNA haplotype must have a selective advantage over maternal haplotype

(Note 5). Competition between mtDNA haplotypes in somatic tissues is well-documented [7], and can be sufficiently strong to drive the enrichment of the paternal haplotype to ~100% in the seeded cell lineages (Note 6). Non-seeded lineages naturally remain 100% maternal, so a mosaic of cells with mostly paternal and pure maternal mtDNA is created (Fig.1E). In mosaic tissue, the heteroplasmy level is equal to the proportion of paternally-derived cells, or seeding density i.e., the fraction somatic cell lineages that received and kept paternal mtDNA during embryogenesis.

Similarly, not all germline lineages are seeded with paternal mtDNA, which ultimately results in a mosaic of paternally- and maternally-derived spermatozoa (Fig.1F). This explains the quasi-Mendelian bi-modal inheritance, with a transmission rate equal to the seeding density of germline lineages. Maternal germline follows a different fate (Fig.1G, Note 7.3.2)

Intuitively, seeding density should depend on the initial number of paternal mtDNA, the timing of mtDNA replication initiation, and the dynamics of the intracellular population of mtDNA molecules. To investigate the combined effect of these factors quantitatively, we performed numerical simulations of mtDNA behavior (Note 7.2). Simulations predict intermediate seeding densities and heteroplasmy levels in somatic and germ cells (Note 7.3), in agreement with the observations (Note 7.1).

In conclusion, despite apparent inconsistencies and some missing quality control items

(Note 10), Luo et al. likely describe genuine cases of paternal inheritance, and seemingly impossible inheritance patterns are compatible with known mtDNA biology.

57

Methods and Results

Note 1. NUMT contamination: an unlikely possibility.

While this letter was in preparation, a recent letter [3] and author’s response [8] treated the problem of NUMT contamination in some detail. Because some of us have been involved in the research on NUMT effects in mtDNA mutational analysis since the issue was first discovered in early nineties by Gengxi Hu in the Bill Thilly’s laboratory at MIT [9], we would like to add our perspective to the NUMT contamination problem.

Note 1.1 mtDNA-specific damage boosts the perceived NUMT fraction.

In our experience, one of the causes of apparent NUMT contamination is mitochondrion- specific DNA damage. Specific damage of mtDNA was the cause of NUMT contamination in the case when this problem was first identified [6]. While specific damage of mtDNA does decrease the apparent number of mtDNA copies in a sample and in severe cases may result in parity of mitochondrial and nuclear DNA or even the prevalence of nuclear DNA, this is not typical in DNA isolated from fresh or frozen blood using modern laboratory practices – it’s more typical of partially degraded autopsy samples, etc.

Suggested test for mtDNA damage: The presence of mtDNA damage can be retrospectively tested on the existing samples of purified DNA from blood samples: PCR should be performed using much shorter PCR fragments. The effect of damage dwindles dramatically with shorter fragment length, so if NUMT contamination is involved because of mtDNA-specific damage, shorter PCR will result in a sharp decrease of the paternal haplotype fraction.

Another factor could be cell lysis prior to DNA extraction. If the sample is not stored carefully, then the plasma membrane might be damaged, leading to loss of cytoplasm. Then,

58 when the cells are centrifuged to extract the DNA, only nuclei may be pelleted, while the mitochondria may be lost in the supernatant resulting in enrichment for nuclear NUMTs. This possibility emphasizes the need for other tissue measurements (such as buccal cells). We note however, that chances are that the paternal haplotype is enriched in blood but not in the buccal cells. Indeed, it is known that NZB/Balb selection differ dramatically between different tissues

(from positive to negative). Similarly, in the well-established case of paternally inherited mtDNA

[5] the paternal haplotype strongly showed up in muscle, but was completely absent in blood. So, ideally, the confirmatory tests should be performed in blood.

Note 1.2 NUMT must be full length, tandem repeat of a perfect mtDNA copy.

The scheme in (Fig.2) shows that in the case of linear mtDNA copy inserted into nuclear genome, one’s ability to perform the full length PCR with either of two different pairs of primers

(as reported in materials and methods, [1] ), requires that the flanks of the pseudogene sequence in the genome are also mtDNA sequences continuously extending the sequence of the central copy, as if this was a circular mtDNA molecule. Biologically such a situation most likely can arise by tandem multiplication of the unaltered mtDNA sequence in the nuclear genome.

(Luo et al., 2018) [1] amplified mtDNA using two separate pairs of primers (A and B) that each generated nearly full-length products (green and orange lines). If the observed paternal haplotypes were indeed NUMT contaminants, the NUMT would have to be a perfect tandem repeat that encompasses the forward and reverse priming sites for both primer combinations

59 While existence of full length tandem repeat of unaltered copies of mtDNA in the nuclear genome cannot be completely excluded, such an arrangement has not been observed in sequenced human genomes or other genomes. Even simply large mtDNA fragments are very rare in the human nuclear genome. There are a few cases of almost full genome NUMTS – one recently detected in a cancer tissue [10], one in a columbine monkey [11], but all of them contain at least some deletion. Moreover, NUMTs appear to be rather stable in the nuclear genomes of various species. We have observed a 5kb NUMT which did not change (other than acquired point mutations) for at least 8 million years while residing in Human, Chimp and Gorilla genomes [12]. The columbine monkey NUMTs also showed no structural change since their divergence about 4 million years ago [13]. This implies that the lack of tandemly repeated perfect mtDNA copies is likely not a result of disappearance of such structures with time but rather their failure to arise in the first place.

In conclusion, while the presence of perfect multicopy NUMTs in the families described in Luo 2018, cannot fully excluded, this appears to be a very unlikely possibility. If such NUMT arrays can be proven to exist, that would be, in our opinion, a discovery similar in scientific value to the confirmation of the paternal transmission case.

Note 2. Tests for NUMT contamination: platelet and nuclei analysis.

The NUMT hypothesis can be most easily tested by comparing heteroplasmy levels in tissues with different mtDNA/nDNA ratio. For example, in case of Luo et al., in blood samples, analysis the purified platelet fraction would help. Platelets contain no nuclear DNA, so if NUMT hypothesis is correct, the proportion of the paternal genotype should be drastically reduced when re-measured in purified platelets. Platelets should be easy to isolate even from frozen archived

60 blood samples, so this is hopefully something that can be done even if new sampling is not possible.

In addition, it would be important do the converse analysis by purifying nuclei from a blood sample using hypotonic lysis. That should eliminate most mtDNA and enrich NUMTs.

Moreover, the degree of enrichment of nuclear DNA can be directly checked using PCR to test for constitutive NUMT sequences.

The original case of paternal inheritance is not a NUMT.

Of note, the first detected case of paternal inheritance [5], did pass a similar NUMT test: the fraction of paternal mtDNA was high in muscle, a tissue rich in mtDNA, and undetectable in blood, where relative mtDNA content is low, i.e. the extreme reverse of what NUMT hypothesis would have predicted. Also as we explored the presence of mtDNA recombinants (essentially products of reciprocal repair), we extensively tested the DNA from the patient (both blood and muscle) at different PCR fragment lengths, and detected no difference in the maternal/paternal haplotype ratio [14], arguing against differential damage possibility (Note .1). Of note, because the original case of paternal inheritance [5] passed both tests, the NUMT hypothesis at least is not generally applicable (Note 4).

Note 3: Functional diploidy of spermatozoa: Mendelian segregation of a gatekeeper gene is not a valid explanation of bi-modal inheritance.

Luo et al (2018) propose that the quasi-Mendelian inheritance pattern exhibited among the offspring of heteroplasmic fathers that transmit their mitochondria is due to the underlying segregation of an autosomal dominant gatekeeper gene mutation that promotes paternal

61 mitochondrial transmission. Under this hypothesis, the gatekeeper gene must function cell- autonomously after meiosis, i.e. only those sperm cells that inherit the mutant allele pass on their mitochondria. This contradicts the known biology of germline development. A fundamentally conserved aspect of spermatogenesis throughout the animal kingdom is that cytokinesis during spermatogenesis remains incomplete during the premeiotic and meiotic divisions. This means that all the sister cells arising from the same germline stem cell remain linked by cytoplasmic bridges [15] [16]. The bridges are large, up to 3 microns in diameter [17], i.e. sufficient size and scale to enable sharing of cytoplasmic contents between cells [18]. This sharing is an active process, in which mRNAs are trafficked between cells via specific binding proteins and shuttling of transcripts within a dedicated organelle known as the chromatoid body [19] [20]. In addition to the facilitated transport of mRNAs within the cytoplasm, both rough and smooth endoplasmic reticulum are continuous across the intercellular bridges, allowing sharing of protein products between cells [21], figure 18. Consequently, sperm are functionally diploid, not haploid.

Could it be possible for the gatekeeper gene to avoid transcript sharing and act in a cell- autonomous manner? While theoretically possible, this would be virtually unprecedented since only two examples of non-shared genes are known in mice [22] [23] and none in humans.

However, even a non-shared gatekeeper gene cannot explain the observed inheritance pattern in the families studied by Luo et al. Whole mitochondria have been observed within the intracellular bridges by electron microscopy [24] figures 19 and 20), implying that mitochondria are also freely exchanged between sister cells. Thus, the mutant gatekeeper gene product would not only have to escape sharing across the bridges, but also modify the mitochondria in such a way that they are no longer shared across the bridges. We judge this to be extremely implausible.

Consequently, any gatekeeper mutation, whether dominant or recessive, should affect all sperm

62 cells equally regardless of which alleles they happen to inherit during meiosis.

Note 4: How many paternal mtDNA molecules are delivered (and survive) by the spermatozoid into the oocyte in the cases of paternal inheritance?

The number of mtDNA in a sperm is controversial, and estimates vary from ~100 to essentially none. This variation may be due to the self-destructive potential for elimination of mtDNA in spermatozoa, as described [25]. This may be a premature artificial in vitro phenomenon, and in vivo sperm in fact may enter the oocyte with its mtDNA still intact. We therefore assume that once the elimination of sperm mtDNA is disabled, we may expect that about 100 paternal mtDNA molecules will survive in the fertilized oocyte. It might be even less, since the mid sperm tail, where the mitochondria are localized does not penetrate the egg.

Nevertheless, it’s probably safe to assume that a substantial fraction of sperm mitochondria (and, hence, mtDNA) are injected into the oocyte, since [26] conservatively observed ~40 paternal mitochondrial objects in the fertilized oocyte at 36hrs (two cell embryo). This count is conservative in that overlapping mitochondrion images were counted as one, and also in that counts were taken at 2-cell stage, at which it has been shown that the disaggregation of paternal mitochondria is far from complete [25], so many individual mitochondrial objects probably still stick to each other and are counted as one. This interpretation is supported by the highly uneven luminosity of the objects, implying that some of them are aggregates (see Figure 2 from [25]) and the number of objects increases at least until the morula stage. Importantly, when sperm mtDNA does make it into the oocyte, it can be identified and traced all the way to multiple tissues of the newborn [25].

We note that paternal mtDNA copies delivered by the spermatozoid is one of the inexact

63 parameters of our model, and lower than we assumed number of delivered copies may be compensated by earlier than we assumed onset of mtDNA replication and smaller replicating subpopulation (Note 7.3.3).

Note 5: Reproducible paternal mtDNA transmission requires selection of the paternal haplotype.

Note 5.1. Paternal expansion is reproducible.

Paternal transmission described by [1] is reproducible, meaning that it is repeatedly observed within a family. In each of the three families described in [1], paternal transmission must have happened more than once. This may not be immediately obvious from the tree itself: there is only one case (family C) where more than one child clearly inherits paternal mtDNA

(CIII6&7). Nevertheless, paternal transmission should have happened more than once in all three families because in each family, the path of paternally inherited mtDNA starts with a heteroplasmic man, who already carries two highly divergent haplotypes. Even if we do not consider the great-grand parents, whose DNA was not available and genotypes are uncertain

(despite the author’s attempt to infer them), the very existence of the hetero-haplotypic person implies that at least one additional paternal inheritance event have happened sometime in the past to initially create this unusual distant heteroplasmy of two haplotypes. This means that in all three families, paternal transmission was indeed reproducible.

The reproducibility of paternal transmission is conceptually very important. Paternal inheritance of mtDNA in cases reported by Luo et al. includes reproducible = ~1000-x enrichment of paternal mtDNA, from ~0.05% in the egg (i.e., ~100 molecules among 200,000

64 maternal molecules) to ~50% heteroplasmy in blood. While such enormous enrichment apparently can happen due to random drift alone (next paragraph Note 5.2.), such events are quite rare and the probability that two such events happen within the same family is very small, so a selective process is the only reasonable explanation of reproducible paternal transmission of paternal mtDNA in a family.

Note 5.2. Random enrichment of paternal mtDNA.

The efficient enrichment of a rare mtDNA variant in one generation can happen, though rarely, via a random selection-free process; this is basically a manifestation of the mtDNA germline bottleneck phenomenon, which in turn is probably based on intracellular random genetic drift. Perhaps the most dramatic example of random enrichment is the expansion, in tumors, of random somatic mtDNA mutations, including fully synonymous ones, which excludes the possibility that any selection is involved in the process of their enrichment. These mutations appear initially in a single mtDNA molecule and remarkably make it to 100% of the entire tumor

[27]. We have previously shown that this process can be realistically modeled as random intracellular genetic drift [28]. As far as the germline is concerned, the abrupt enrichment of one genotype over the other have been recognized as the mtDNA bottleneck effect since a seminal work in heteroplasmic cows [29]. More recently, extensive population studies on human mother/offspring pairs showed a significant number of cases where a genotype that was not detectable in the mother showed up as significant heteroplasmy in the child [30] [31]. This implies that a rare genotype, perhaps a nascent mutant molecule, in some cases may enjoy a dramatic enrichment among germline mitochondria in normal humans within a one generation time frame.

65 If paternal mtDNA that has leaked into the oocyte due to defective gatekeeper happens to appear in the role of such a randomly enriched sporadic genotype (and there is no reason why it can’t be in this role), the paternal genotype may overcome the maternal mtDNA and result in paternal inheritance. In fact, this is the same path that any neutral nascent mutation that eventually gets fixed in an organism has to take. And because new neutral mutations do get fixed from time to time, we know that this path indeed exists and is quite efficient on the evolutionary time scale. This implies that reproducible selection is not necessarily needed for paternal inheritance, and thus the only condition of paternal inheritance is the defective gatekeeper. While the coincidence of a lack of paternal mtDNA elimination and random enrichment is still a very rare event, this scenario explains how any haplotype (not necessarily one enjoying reproducible selective advantage) can be potentially paternally inherited.

Note 6. Haplotype selection of mtDNA – a known process, sufficient to explain paternal mtDNA enrichment observed in paternal inheritance cases.

Strong sequence-dependent selection is a natural property of mtDNA. The most studied example is the heteroplasmic NZB/BALB mouse, constructed by cytoplasmic fusion of between mice with highly divergent haplotypes [7]. Relative proliferative fitness of the NZB haplotype is as high as 1.16 per duplication, depending on the tissue type. This means that the proportion of the

NZB haplotype increases 1.16 times per mtDNA duplication. Interestingly, this selective advantage mechanistically may be related to slower turnover in the NZB haplotype rather than faster replication [32]. Note that with fitness of 1.16 per duplication, it would take fewer than 50 duplications to achieve a ~1000-fold enrichment that is needed to equalize maternal and paternal mtDNA when the latter starts from 0.05% (indeed, 1.1650 ~1,700). It is not unrealistic to expect

66 hundreds of cell duplications in the somatic tissue cell lineage, which gives the lineage more than enough cell duplications to achieve the necessary ~1000x enrichment of the paternal haplotype.

Also, NZB/BALB enrichment was essentially a randomly chosen pair of haplotypes; it is easy to imagine that other combinations of haplotypes may result in even higher relative fitness.

Furthermore, [1] cases are not average ones, but were actually selected for detectability among large set of samples. Thus the haplotype combinations in these samples are expected to be the ones that display appropriate relative fitness and end up as a detectable heteroplasmy/paternal inheritance.

Note 7: A detailed model of paternal mtDNA transmission.

We devised a numerical model that simulates the intracellular behavior of individual mitochondria in proliferating cell lineages. In this model, intracellular populations of mtDNA molecules are represented by virtual cells, i.e., lists of sequences (some of which are identified as paternal and some as maternal). Sequences can be duplicated (representing mtDNA replication) and removed (representing cell division and/or mtDNA turnover) in any order and proportion.

This model allows for adjustments in the number of mitochondria per cell, the size of the replicating subpopulation, the proportion of molecules of each haplotype, their relative proliferative fitness, and the number of cell divisions. Everything is done biologically reasonably, e.g. to increase cell size, we duplicate more sequences (randomly chosen) than we remove, and thus create a positive balance in the number of sequences per virtual cell. The model is composed of several sequential blocks with different parameters representing the different developmental stages. The numbers of germline cell divisions were chosen according to classical

67 work [33]. The algorithm is written in Python and is conceptually based on our original numerical simulations of mtDNA intracellular dynamics in cancer lineages [28].

This model is capable of predicting the observables of the Luo et al 2018 study: the transmission rates and the paternal heteroplasmy levels of somatic tissues in the offspring. As we show below, the predictions are in a reasonable agreement with the observations. This does not necessarily mean that processes are in the foundation of our model are necessarily operating in vivo. The goal of our modeling was to show that Luo’s observations, however enigmatic they are, are compatible with a biologically plausible scenario. This goal has been accomplished in full.

Note 7.1: The observables to predict: Transmission rates of paternal mtDNA and heteroplasmy levels in the offspring.

First, we need to determine what are the observables that our model is expected to predict. For the purpose of this discussion we concern the transmission rate of paternal mtDNA among families that have already shown paternal transmission (not in the general population, where it is obviously very low). In particular, we will use two rates: a) the transmission rate from fathers already known to be transmitters of mtDNA to their offspring, and b) from mothers carrying paternally transmitted mtDNA to their offspring. Note that when counting transmission events we need to exclude any parent/offspring transmissions to or directly upstream of the probands, i.e. individuals in whom paternal transmission has been initially discovered during search. The probands are positive for paternal transmission by the anthropic principle: if they did not show paternal transmission, the families would have not been included in the analysis. With that in mind there are 6 such transmissions in all there families where offspring were analyzed (to AII1, 2, 3; to AIII7; to BIII1,

68 and to CIII7), of which 3 (to AII1 and AII3 and to CIII7 ) are positive [1], i.e. observed transmission rate is ~50%.

As far as the rate of maternal transmission of a paternally acquired haplotype is concerned, there are 3 independent events of this type: transmission to individuals AIV1 and AIV3 (AIV2 is omitted as proband), and CIV1 [1]. All these transmissions are positive, so transmission rate here is 3/3 – 100%. Of course, the number of events is extremely low, so this number perhaps is a rough approximation only. It looks like transmission rate along maternal line is higher than along the paternal one. Reassuringly, our detailed model (Note 7.5) does predict a higher transmission rate in females than in males (this is related to a larger cell size in the female germline).

The average heteroplasmy level of paternally transmitted mtDNA (positive individuals only) is 0.48 ± 0.16. Note that estimating the heteroplasmy levels does not require exclusion of the probands and their direct ancestors because being positive carries very little information about the actual heteroplasmy levels (we are interested in the heteroplasmy levels in the positive individuals only). The average heteroplasmy level of maternally transmitted paternal mtDNA is 0.33 ± 0.15

Note 7.2. Model step-by-step: Justification of the scenario and of the parameters used.

Note 7.2.1. Block 1: Fertilization and cleavage of the embryo prior to the re-start of mtDNA replication.

We assumed that ~100 paternal mitochondria (Note 4, Note 7.3.3) were randomly distributed among 16 blastomeres in the morula. The morula blastomere is expected to contain

~12,000 (200,000/16) total mtDNA molecules, of which there should be, on average, 6

69 molecules of paternal mtDNA (100/16=6). So a cell lineage at this stage was simulated by randomly distributing an average of 6 paternal genomes (Poisson-distributed) among virtual blastomeres, i.e., lists 12,000 sequences long.

Justification: After entering the oocyte, paternal mtDNA temporarily persists as a cloud of mitochondria, which gradually dissipates and gets uniformly dispersed around the embryo by the morula stage, as recorded in studies where paternal mitochondria were stained with specific fluorescent tags [26] [25] The initial clustering of paternal mitochondria might naturally result in an excess of lineages lacking paternal mtDNA (and others with overcrowded mtDNA), however with the data at hand there is no way we can reliably model this deviation, so we do not take it into consideration. Our estimates are conservative with respect to the expected rate of paternal transmission and can overestimate the level of paternal heteroplasmy for this reason.

Note 7.2.2. Block 2: Onset of mtDNA replication: the timing of the onset and the size of the replicating subpopulation.

We assumed that mtDNA replication resumes within the mtDNA subpopulations containing 1000 molecules (~10%) within the morula blastomeres, and that all paternal mtDNA molecules are replicating. 10% for the relative size of the replicating pool was based on our preliminary estimates in PolG oocytes as discussed in the ‘Justification’ below (see also ‘Note of caution’ 7.3.3). The virtual replicating subpopulation was uniformly duplicated 3 times while the entire population of mtDNA is equally but randomly halved after each duplication to emulate the ongoing cleavage (thus replicating population oscillated between 1000 and 2000). Equally but randomly means that the total mitochondrial population is divided in half (reflecting approximately equal distribution of mtDNA among blastomeres), but the few paternal copies are

70 distributed randomly (and as a being a small number, are subject to much larger relative variance). We assumed that the rest of the mtDNA population either undergoes mitophagy

(according to the [34] data) or is diluted out by the newly replicating mtDNA. The completion of this block mimics the implantation stage with ~128 cells, each now dominated by the replicating mtDNA population, 1000-molecules strong. We then performed 2 cycles to emulate post- implantation proliferation of embryonic cells prior to the commitment to the germline at CS5.

Note that during this period the average number of mtDNA molecules per cell is 1500

((1000+2000)/2=1500), which is in keeping with current estimates [35].

Justification: The timing and the extent of mtDNA replication reinitiation in the embryo is a critical parameter in our model, so we will spend extra time on the justification of this parameter choice. In particular, it is likely (see below) that mtDNA replication in the embryo starts earlier than conventionally assumed, but is initially limited to a small subpopulation of mtDNA molecules. Paternal mtDNA is likely to be a part of this replicating subpopulation, which means its frequency in the embryo should get a boost.

In support of the idea of partial replication, note that replication machinery is being gradually rebuild in the preimplantation embryo, and most likely will start to work at partial capacity. Independent studies indicate that mtDNA replication may be starting as early as in the morula (16-cell embryo), because this is the point where a significant amount of mtDNA- replication-related proteins (PolG and TFAM) have been already produced de novo in significant amounts [36] (Fig.1 therein). In addition, the intrinsic ability of the early embryo to replenish its mtDNA has been demonstrated in experiments on the depletion of oocyte mtDNA in cattle [37].

Depleted mtDNA have been fully replenished by as early as the blastocyst stage.

A more dramatic observation is that apparently mtDNA in the early embryo is degraded

71 to a significant extent and then is replenished by replication of a limited stock. According to [34]

(Fig 3 therein), total mtDNA copy number in the embryo is sharply reduced in the 8-cell embryo, and then rebounds to pre-cleavage levels by the expanded blastocyst stage. These observations may seem to contradict other studies that reported relatively constant amount of mtDNA in the early embryo, such as [38]. We note however, that mtDNA content of the embryo is typically assessed by real time PCR that employs a very short PCR fragment, about 100nt long [38]. In contrast, [34] used a 1000 bp PCR fragment, which is far more sensitive to the acute mtDNA degradation, which should take place in the embryo after activation of autophagy in the 4-cell embryo [39]. Note that mtDNA that has just started to be degraded may not have had enough time to have been cut to 100nt fragments, so regular 100-bp real time PCR may not be sensitive enough to the initial stages of degradation, which is later masked by subsequent replenishment of mtDNA in the embryo by de novo replication.

Of note, the conventional idea that there is no mtDNA replication until implantation stage is mostly based on the observation that the total amount of mtDNA in the preimplantation embryo is roughly constant. However the above arguments imply that the apparent stability of the DNA content may be the result either of initial replication of a small subset of mtDNA which thus is not expected to significantly affect the total mtDNA amount and this remain undetected, or reflect a balance between replication and degradation.

The observation that mtDNA is degraded and then replenished in the early embryo is further corroborated by the report of intense segregation of mtDNA haplotypes in the chimeric preimplantation embryo combined from two oocytes containing heterologous mtDNA[40]. This effect is very difficult to explain without assuming a dramatic degradation/replenishment of the mtDNA population. mtDNA degradation in the early embryo actually makes a lot of practical

72 sense: because there is no mitophagy in the oocyte [41], apparently much of the mtDNA in the oocyte has not been renewed for many years of the oocyte’s lifetime and perhaps is significantly damaged and likely to fail replication.

In corroboration of this view, we independently observed that only a small subpopulation of mtDNA in the mature oocyte (~10%) have been actively replicated, and that mtDNA molecules from this subpopulation are preferentially inherited in the next generation (Annis et al., in preparation).

Early replication of a subpopulation of mtDNA significantly affects our model as long as paternal mtDNA specifically is a part of this replicating subpopulation. The reason to expect this is that unlike a majority of mtDNA of the oocyte, paternal mtDNA has been recently replicated as a part of proliferation in the spermatogonia and thus is expected to be free of replication- impeding damage. Moreover, sperm mtDNA is pre-loaded with PolG and TFAM [42] (Figure 3 therein), putting it in the position to start replicating ahead of the rest of oocyte mtDNA, which needs the de-novo protein synthesis to provide PolG and other factors essential for mtDNA replication.

Note 7.2.3. Block 3: Germline/soma specification to birth.

At CS5, after 9 embryonic cell divisions, the germline is specified as a separate group of cells, Primordial Germ Cells (PGCs), which have distinctly larger size than soma-to-be cells. In keeping with this milestone, from this point simulations were carried for 7 duplications along two paths representing the germline and the somatic lineage. The separate paths reflect the different average size of the primordial germ cells (PGCs, 1500 mtDNA/cell) and somatic cell lineages (500 mtDNA/cell) [38], and the size of virtual cells was adjusted accordingly. We used

73 the same approach as in previous blocks: 2-fold oscillation around the average cell size, 7 cell duplications were performed in this regimen. After 7 cell duplications, at the point corresponding to CS17-18, the germline is subject to sex differentiation resulting in differently sized male and female germline cells (~1750mtDNA/cell on average for female PGCs, ~1000 mtDNA per cell for male PGCs [35]. After that, the germline is carried through 13 more cycles until birth (as before, two-fold oscillation around average cell size).

While the germline was carried through 7+13 cell divisions, somatic cells were carried through a different path with smaller virtual cells. An important question is how many generations does the somatic lineage go through prior to birth. We found no definite answer in the literature and apparently the actual number of duplications can vary significantly even within the same cell type [43]. We thus performed simulations with a range of number of duplications:

20, 30, 40, which essentially covers the entire range of plausible possibilities. These differences in duplication number resulted in only moderate deviations of the anticipated heteroplasmy levels (~20%), so we have chosen to present the entire range in the results table (table 1). The whole range is sufficiently narrow to support our conclusions.

Note 7.2.4. Block 4: Post-natal mtDNA behavior.

Male germline and somatic cell lineages were simulated with selective pressure in favor of the paternal haplotype. To emulate selective pressure, in each duplication, every fifth randomly selected paternal mtDNA molecule was replicated twice, to represent the ~20% higher fitness of the paternal haplotype. Mechanistically this bias most likely results from slower turnover of the fit haplotype, not from the faster replication [32], but this does not matter as far as the simulation is concerned. The simulations with selection pertain only to the somatic

74 lineages and post-puberty male germline lineages; female germline is assumed not to be subject to selection [7].

Post-natal behavior of mtDNA in the female germline was simulated by proportionally increasing the mtDNA populations of female PGCs (obtained in the process (3) above) to

200,000 genomes – reflecting the oocyte growth process. For example, a PGC containing 2 paternal mtDNA among 1000 mtDNA was expanded to a 200,000 mtDNA oocyte with 400 paternal mtDNA molecules. Because the replicating subpopulation of mtDNA is only 10%, 40 out of 400 paternal genomes were assigned to this subpopulation. The resulting virtual oocyte was treated as described above for the oocytes fertilized with sperm with paternally inherited mtDNA. The difference is that in this case mtDNA from next generation spermatozoid contributed no paternal mtDNA (because gatekeeper was intact and paternal mtDNA was destroyed); instead, grand-paternal DNA has been carried over from preceding generation through the female germline.

Note 7.3 Simulation Results.

The most important conclusion from our simulations is that, first, fairly simple and biologically reasonable assumptions (like early onset of proliferation of a small subset of mtDNA) are sufficient to explain the necessary boost of the initially infinitesimal fraction of the paternal mtDNA to the observed high heteroplasmy levels. Second, it explains the observed differences/similarities between heteroplasmy levels and transmission rates in males and females and between germline and somatic cells (which in our model are linked to each other in a complex but sensible way). It should be noted that the model’s reaction to parameter changes is not trivial – this is a system of checks and balances. For example, moving the onset of mtDNA

75 proliferation into an earlier developmental stage helps to explain the remarkable gain in paternal mtDNA frequency and high paternal transmission rate, but going too far results in a too-high transmission rate and heteroplasmy levels in maternal transmission of paternal haplotype. The ability of the model with a biologically reasonable set of parameters to satisfy complex and often conflicting observations gives us additional confidence in our conclusions. The results are described in more detail below.

Note 7.3.1. Rate of paternal inheritance and levels of somatic heteroplasmy of paternal mtDNA in the offspring.

The main output of our modeling is the fraction of cell lineages seeded by the paternal haplotype by the time of birth, at which time haplotype selection turns on [7]. According to post- natal simulations (with selection), this transition effectively prevents any further loss of the paternal haplotype from cell lineages. This is because the intracellular genetic drift that had been primarily responsible for haplotype loss slows down as the fraction of paternal mtDNA is boosted by positive selection. The fraction of seeded lineages becomes frozen while intracellular selection within each seeded lineage eventually makes that lineage homoplasmic for the paternal haplotype; the final number of cell lineages homoplasmic for the paternal haplotype is essentially equal to the number of seeded lineages at the time of birth. This observation is illustrated in

Figure 3. This implies that the level somatic heteroplasmy of the paternal haplotype in the progeny should be approximately equal to the fraction of seeded somatic lineages. Similarly, the paternal inheritance rate (i.e., the proportion of progeny showing measurable paternal heteroplasmy in somatic tissues) should be equal to the fraction of seeded male germline

76 lineages.

The anticipated heteroplasmy levels and inheritance rate of paternal mtDNA, as derived from our simulations (with parameters as described in Note 7.2) are presented and compared to the experimentally observed values in Table1 below. As one can see, the correspondence is fairly good. This good fit is not intended to be used to corroborate our proposed scenario or justify our choice of parameters. Instead this is intended to demonstrate that the ‘impossible’ observations Luo’s 2018 in fact might have resulted from at least one very plausible scenario.

Note 7.3.2. Maternal inheritance of the grand-paternal mtDNA.

In addition to paternal mtDNA transmission, Luo et al. reports maternal transmission of the once paternally inherited haplotype (i.e., haplotype inherited from the male parent at grand- paternal level or even earlier). The authors incorrectly call this normal maternal transmission:

Unlike normal maternal transmission, the grand-paternal heteroplasmy level in the maternal germline must be diminutively lower than the somatic heteroplasmy of their offspring, because higher maternal germline heteroplasmy would have caused dense seeding of the embryonic cells and resulted in 100% paternal homoplasmy in the offspring due to selection in favor of the grand-paternal haplotype (instead, 22-57% heteroplasmy was observed). Our model correctly predicts the observed maternal inheritance rate and heteroplasmy levels of the paternal haplotype

(Note 7.2.4, Table 1)

An explanation for the maternal transmission of the grand-paternal haplotype that has been reported by Luo et al. for 4 cases requires a different approach. Unlike the male germline and somatic cells (Note 7.3.1), the female germline is not subject to haplotype selection in keeping with classical results [7] and it also has very different mtDNA dynamics. The virtual

77 PGCs (primordial germ cells) in our simulations show a wide range of paternal heteroplasmy levels which are transferred almost unchanged to the virtual oocytes derived from these PGCs

(red curve in the Figure 4A below). Because there is no selection, there is no digitization of the heteroplasmy in individual oocytes (like in spermatozoa). The variable levels of heteroplasmy in the oocytes should be translated into variable levels of somatic heteroplasmy in the offspring

(Fig 4A, blue curve). The distribution predicts that any level of heteroplasmy is similarly likely to appear in the offspring (Fig. 4B), with moderate bias towards the higher levels.

Surprisingly, offspring of the mothers that carry paternal mtDNA (Luo et al.) have a fairly narrow range of heteroplasmy levels: 0.22, 0.22, and 0.29 in a set of 3 siblings from family

A and 0.57 in an individual from family C. This distribution appears too narrow to have been independently sampled from the distribution shown in Figure 4B, and this was our major concern as this model was being developed. We note however, that sampling may have not been independent. It has been shown [43], that it is not uncommon for a majority of oocytes of the same individual have a common ancestral cell after germline fate has been specified. According to our modeling, heteroplasmy levels in such sibling oocytes should be very uniform. In other words, the three sibling heteroplasmy levels may be so similar because they are closely related, which makes these measurements non-independent and essentially resolves the problem.

Assuming that the three siblings in fact represent one ancestral germ cell, we presented their average in table 1.

Of note, our modeling implies that haplotype selection, if it was present in the maternal germline at level similar to that in somatic cells/male germline, would have resulted in very high paternal somatic heteroplasmy in most offspring of mothers inheriting the paternal haplotype.

This is because under selection the frequency distribution of PGC/oocyte heteroplasmy (red) in

78 figure 4A will shift to the right, which implies a steep increase in the resulting anticipated heteroplasmy in the offspring (blue curve). This would be in startling contrast with the observed fairly low paternal heteroplasmy in these individuals (35% on average, table 1). In other words the independently established lack of selection in the germline is nicely in line with simulations.

Note 7.3.3. Note of caution.

Here we must once more emphasize that our model contains several inexact parameters and different combinations of these parameters could result in equally realistic predictions. For example, the size of the replicating subpopulation strongly affects the output of the simulations.

Smaller subpopulations result in a higher percentage of paternal haplotype in the offspring and an increased inheritance rate (being a part of a smaller replicating subpopulation paternal gives a haplotype a stronger initial boost in relative copy number). Decreasing the size of the replicating subpopulation combined with an even earlier onset of mtDNA replication (which makes sense with a smaller subpopulation) can compensate for a smaller input of mtDNA from the spermatozoid. Neither of these parameters is currently known with certainty (e.g. we do not really know how many mtDNA are delivered by a spermatozoid in the absence of a gatekeeper;

100 copies used in our simulations was no more than a realistic estimation). Thus the results reported herein should not be taken as corroboration of certain values of parameters (such as proving that 10% is the size for replicating subpopulation). Instead the value of these simulations is in demonstrating the importance of these parameters (e.g. without assuming a small replicating mtDNA subpopulation in the early embryo, Luo’s data cannot be explained in a realistic way). In this way, the new cases of paternal inheritance will hopefully serve as catalyst for the studies of mtDNA dynamics in germline and development.

79

Note 8. The Vissing case of paternal transmission.

The high paternal heteroplasmy in Vissing’s case, i.e., 90% in muscle [5], appears to be too high for our model. The chances of achieving seeding density of 90% appears low.

This difficulty can be resolved, however, if we note that Vissing’s case and Luo’s cases differ in the way the subjects were searched for. Vissing’s patient was discovered as suffering from mtDNA disease, directly related to the pathological and highly detrimental mutation that he had inherited as a part of paternal mtDNA. This means that the search in this case was targeted at high heteroplasmy of the paternal mtDNA, because the mtDNA microdeletion of the type present in this patient would only exert a phenotype when heteroplasmy reaches over ~90% in muscle. In other words, Vissing’s patient would be invisible to the search had his paternal heteroplasmy not exceeded 90%. Keeping this in mind, we can allow relatively rare circumstances to be used as an explanation for unusually high heteroplasmy levels in this patient (see next paragraph). Note that the search in Luo’s case was not for high heteroplasmy, but merely heteroplasmy detectable by sequencing of a blood sample, with the sensitivity of NGS sequencing (~15-20%).

According to our model, high heteroplasmy can only be achieved via high seeding density of the paternal mtDNA in the cell precursors of the target tissue. There are several ways this can happen as a relatively rare event. One possibility is that the entire tissue, such as muscle in this case, had a relatively recent common ancestor which happened to be highly seeded with paternal mtDNA (such cells do appear at early stages in our models). Indeed, as noted earlier for oocytes, the entire or great majority of cells in a tissue can have a relatively recent common ancestor cell. This has been observed for oocytes and for mesenchymal stem cells [43]. The other possibility is that embryo proper in this case happened to be particularly enriched with paternal

80 mtDNA because the early blastomere encompassing the cloud of paternal mtDNA at the point of sperm penetration happened to found a majority of embryo proper cells. Such cases are relatively frequent and have been recorded multiple times in the blastomere lineage tracing experiments

[44]. In other words, spatial clustering of the type mentioned in Note 7.2.1, which favored preferential seeding of muscle precursors with paternal haplotype could have resulted in high- density seeding needed to explain high heteroplasmy in Vissing’s patient. Indeed, a little over a

4-fold increase of seeding density (which will happen if paternal mtDNA cluster is entirely encompassed by one blastomere of the 4-cell embryo that then founds the entire embryo proper) is expected to be sufficient to boost heteroplasmy from regular 0.36 to ~0.90, as observed in

Vissing’s patient’s muscle.

Interestingly, all the probands in [1] were identified by suspected mtDNA disease, as was the original Vissing’s patient. This observation therefore could be somehow related to selective advantage of the paternal haplotype – it’s a known phenomenon that inactivating mtDNA mutations often gain selective intracellular advantage in certain conditions, and there is voluminous literature striving to find the mechanism(s) of this counterintuitive phenomenon

[45]. However, clearly inactivating mtDNA mutations were apparently absent in the reported paternal haplotypes. Alternatively, a mtDNA-disease-like phenotype may be related to the dominant gatekeeper mutations which enabled leakage of sperm mitochondria into the oocyte in these families; such a mutation may have systemic consequences in other aspects of mitochondrial metabolism resulting in the alleged mtDNA-disease-like phenotype in these individuals.

81

Note 9. Testable predictions of the model.

Single cell analysis should reveal intercellular heteroplasmy and intracellular homoplasmy

(mtDNA mosaic). This may be most easily tested in buccal swabs.

Note 10. Quality control items:

- NGS raw data must be made available to the reader.

- Tests for NUMT contamination must be performed (Note 1.3)

-Our own analysis of mitochondrial DNA heteroplasmy in the exome sequencing data from 1339 unrelated individuals revealed three clear instances of multiple heteroplasmies (up to 25% heteroplasmy), potentially originating from mixtures of different mitochondrial haplotypes.

However, we failed to confirm these cases of multiple heteroplasmy by conventional Sanger sequencing and the analysis of available family members did not show any evidence for paternal inheritance. Thus, our cases were very likely generated by barcoding errors inherent to modern

NGS technology. This proves that bi-paternal mtDNA inheritance is obviously a very rare phenomenon. In light of this rarity an independent confirmation of multiple heteroplasmies in the patients from Luo et al., 2018 by conventional Sanger sequencing would have been desirable, the presented RFLP data appear to us not a sufficient proof to exclude potential methodological problems.

82

Figure 1 As paternal mtDNA (A) is distributed among blastomeres not all of them receive a copy (B). Later germline is parted from soma (C) and then the sex of the germline is set (D). After birth (dotted line), somatic and male germline cells exert selection in favor of the paternal mtDNA haplotype, so that somatic cells and spermatozoa end up as a paternal/maternal mtDNA mosaic. This explains somatic heteroplasmy and quasi-Mendelian inheritance, respectively. In the female germline, there is no selection for the paternal haplotype (8), so oocytes keep low and variable proportions of paternal mtDNA (G), (Note 7.3.2)

83

Figure 2

84

Figure 3 Population dynamics of paternal haplotypes under positive selection: preservation of seeded cell lineages. 100 cells (500 genomes per cell) were randomly seeded with 0.5 paternal genomes per cell on average (which resulted in an initial seeding density of ∼0.39). The cells were then carried through 100 cell duplications with the 1.2 relative fitness assigned to paternal molecules. The graph shows gradual expansion of paternal haplotype in the lineages. Importantly, the asymptotic perceived heteroplasmy, i.e., the average across all cells (∼0.36) appears essentially equal to the initial seeding density of the paternal haplotype (0.39), implying that there is essentially no additional loss of seeded lineages under positive selection regime at seeding densities used in our simulations. Colors reflect the progression of simulations.

85

Figure 4 A: Red: the distribution of simulated heteroplasmy among virtual PGCs of mothers who inherited a fraction of mtDNA from their fathers. Note the wide range of heteroplasmy levels. Blue: the predicted final somatic heteroplasmy levels in virtual offspring resulting from PGCs with the heteroplasmy levels indicated on the X-axis. The saturated appearance of the blue curve reflects the fact that paternal genomes present in PGCs (and in the resulting oocytes) are subject to strong positive enrichment during the development of virtual somatic tissues of the virtual offspring. B: The simulated probability distribution of somatic heteroplasmy in the offspring of mothers who had inherited paternal mtDNA.

Note: zero heteroplasmy cases are not shown (they would have looked like off-scale point at zero).

86 References

1. Luo, S., et al., Biparental inheritance of mitochondrial DNA in humans. Proceedings of the National Academy of Sciences, 2018. 115(51): p. 13039-13044.

2. Vissing, J., Paternal comeback in mitochondrial DNA inheritance. Proceedings of the National Academy of Sciences, 2019. 116(5): p. 1475-1476.

3. Lutz-Bonengel, S. and W. Parson, No further evidence for paternal leakage of mitochondrial DNA in humans yet. Proceedings of the National Academy of Sciences, 2019. 116(6): p. 1821- 1822.

4. Salas, A., et al., Extraordinary claims require extraordinary evidence in the case of asserted mtDNA biparental inheritance. bioRxiv, 2019: p. 585752.

5. Schwartz, M. and J. Vissing, Paternal inheritance of mitochondrial DNA. New England Journal of Medicine, 2002. 347(8): p. 576-580.

6. Hirano, M., et al., Apparent mtDNA heteroplasmy in Alzheimer’s disease patients and in normals due to PCR amplification of nucleus-embedded mtDNA pseudogenes. Proceedings of the National Academy of Sciences, 1997. 94(26): p. 14894-14899.

7. Jenuth, J.P., A.C. Peterson, and E.A. Shoubridge, Tissue-specific selection for different mtDNA genotypes in heteroplasmic mice. Nature genetics, 1997. 16(1): p. 93-95.

8. Luo, S., et al., Reply to Lutz-Bonengel et al.: Biparental mtDNA transmission is unlikely to be the result of nuclear mitochondrial DNA segments. Proceedings of the National Academy of Sciences, 2019. 116(6): p. 1823-1824.

9. Khrapko, K., et al., Mutational spectrometry: means and ends, in Progress in nucleic acid research and molecular biology. 1994, Elsevier. p. 285-312.

10. Yuan, Y., et al., Comprehensive molecular characterization of mitochondrial genomes in human cancers. Nature genetics, 2020: p. 1-11.

11. Wang, B., et al., Full‐length Numt analysis provides evidence for hybridization between the Asian colobine genera Trachypithecus and Semnopithecus. American journal of primatology, 2015. 77(8): p. 901-910.

12. Popadin, K., et al., Mitochondrial pseudogenes suggest repeated inter-species hybridization in hominid evolution. BioRxiv, 2017: p. 134502.

13. Roos, C., et al., Nuclear versus mitochondrial DNA: evidence for hybridization in colobine monkeys. BMC evolutionary biology, 2011. 11(1): p. 77.

14. Kraytsberg, Y., et al., Recombination of human mitochondrial DNA. Science, 2004. 304(5673): p. 981-981.

15. Burgos, M.H. and D.W. Fawcett, Studies on the fine structure of the mammalian testis: I. Differentiation of the spermatids in the cat (Felis domestica). The Journal of Cell Biology, 1955. 1(4): p. 287-300.

87 16. Fawcett, D.W., S. Ito, and D. Slautterback, The occurrence of intercellular bridges in groups of cells exhibiting synchronous differentiation. The Journal of Cell Biology, 1959. 5(3): p. 453-460.

17. Weber, J.E. and L.D. Russell, A study of intercellular bridges during spermatogenesis in the rat. American journal of anatomy, 1987. 180(1): p. 1-24.

18. Braun, R.E., et al., Genetically haploid spermatids are phenotypically diploid. Nature, 1989. 337(6205): p. 373-376.

19. Morales, C.R., et al., A TB-RBP and Ter ATPase complex accompanies specific mRNAs from nuclei through the nuclear pores and into intercellular bridges in mouse male germ cells. Developmental biology, 2002. 246(2): p. 480-494.

20. Ventela, S., J. Toppari, and M. Parvinen, Intercellular organelle traffic through cytoplasmic bridges in early spermatids of the rat: mechanisms of haploid gene product sharing. Molecular biology of the cell, 2003. 14(7): p. 2768-2780.

21. Clermont, Y. and A. Rambourg, Evolution of the endoplasmic reticulum during rat spermiogenesis. American Journal of Anatomy, 1978. 151(2): p. 191-211.

22. Zheng, Y., X. Deng, and P. Martin-DeLeon, Lack of sharing of Spam1 (Ph-20) among mouse spermatids and transmission ratio distortion. Biology of Reproduction, 2001. 64(6): p. 1730- 1738.

23. Véron, N., et al., Retention of gene products in syncytial spermatids promotes non-Mendelian inheritance as revealed by the t complex responder. Genes & development, 2009. 23(23): p. 2705-2710.

24. Dym, M. and D.W. Fawcett, Further observations on the numbers of spermatogonia, spermatocytes, and spermatids connected by intercellular bridges in the mammalian testis. Biology of reproduction, 1971. 4(2): p. 195-215.

25. Luo, S.-M., et al., Unique insights into maternal mitochondrial inheritance in mice. Proceedings of the National Academy of Sciences, 2013. 110(32): p. 13038-13043.

26. Rojansky, R., M.-Y. Cha, and D.C. Chan, Elimination of paternal mitochondria in mouse embryos occurs through autophagic degradation dependent on PARKIN and MUL1. Elife, 2016. 5: p. e17896.

27. Polyak, K., et al., Somatic mutations of the mitochondrial genome in human colorectal tumours. Nature genetics, 1998. 20(3): p. 291-293.

28. Coller, H.A., et al., High frequency of homoplasmic mitochondrial DNA mutations in human tumors can be explained without selection. Nature genetics, 2001. 28(2): p. 147-150.

29. Hauswirth, W.W. and P.J. Laipis, Mitochondrial DNA polymorphism in a maternal lineage of Holstein cows. Proceedings of the National Academy of Sciences, 1982. 79(15): p. 4686-4690.

30. Rebolledo-Jaramillo, B., et al., Maternal age effect and severe germ-line bottleneck in the inheritance of human mitochondrial DNA. Proceedings of the National Academy of Sciences, 2014. 111(43): p. 15474-15479.

88 31. Li, M., et al., Transmission of human mtDNA heteroplasmy in the Genome of the Netherlands families: support for a variable-size bottleneck. Genome research, 2016. 26(4): p. 417-426.

32. Battersby, B.J. and E.A. Shoubridge, Selection of a mtDNA sequence variant in hepatocytes of heteroplasmic mice is not due to differences in respiratory chain function or efficiency of replication. Human Molecular Genetics, 2001. 10(22): p. 2469-2479.

33. Drost, J.B. and W.R. Lee, Biological basis of germline mutation: comparisons of spontaneous germline mutation rates among drosophila, mouse, and human. Environmental and molecular mutagenesis, 1995. 25(S2): p. 48-64.

34. Spikings, E.C., J. Alderson, and J.C.S. John, Regulated mitochondrial DNA replication during oocyte maturation is essential for successful porcine embryonic development. Biology of reproduction, 2007. 76(2): p. 327-335.

35. Floros, V.I., et al., Segregation of mitochondrial DNA heteroplasmy through a developmental genetic bottleneck in human embryos. Nature cell biology, 2018. 20(2): p. 144-151.

36. Bowles, E.J., et al., Contrasting effects of in vitro fertilization and nuclear transfer on the expression of mtDNA replication factors. Genetics, 2007. 176(3): p. 1511-1526.

37. Chiaratti, M.R., et al., Embryo mitochondrial DNA depletion is reversed during early embryogenesis in cattle. Biology of reproduction, 2010. 82(1): p. 76-85.

38. Cao, L., et al., The mitochondrial bottleneck occurs without reduction of mtDNA content in female mouse germ cells. Nature genetics, 2007. 39(3): p. 386-390.

39. Tsukamoto, S., A. Kuma, and N. Mizushima, The role of autophagy during the oocyte-to-embryo transition. Autophagy, 2008. 4(8): p. 1076-1078.

40. Lee, H.-S., et al., Rapid mitochondrial DNA segregation in primate preimplantation embryos precedes somatic and germline bottleneck. Cell reports, 2012. 1(5): p. 506-515.

41. Boudoures, A.L., et al., Obesity-exposed oocytes accumulate and transmit damaged mitochondria due to an inability to activate mitophagy. Developmental biology, 2017. 426(1): p. 126-138.

42. Amaral, A., J. Ramalho-Santos, and J.C. St John, The expression of polymerase gamma and mitochondrial transcription factor A and the regulation of mitochondrial DNA content in mature human sperm. Human Reproduction, 2007. 22(6): p. 1585-1596.

43. Reizel, Y., et al., Cell lineage analysis of the mammalian female germline. PLoS genetics, 2012. 8(2).

44. Tabansky, I., et al., Developmental bias in cleavage-stage mouse blastomeres. Current Biology, 2013. 23(1): p. 21-31.

45. Kowald, A. and T.B. Kirkwood, Resolving the enigma of the clonal expansion of mtDNA deletions. Genes, 2018. 9(3): p. 126.

89

Chapter 3: LUCS: a high-resolution nucleic acid sequencing tool for accurate long-read analysis of individual DNA molecules Portions previously published as: Annis, S., Fleischmann, Z., Logan, R., Mullin-Bernstein, Z., Franco, M., Saürich, J., Tilly, J.L., Woods, D.C. and Khrapko, K., 2020. LUCS: a high- resolution nucleic acid sequencing tool for accurate long-read analysis of individual DNA molecules. Aging (Albany NY), 12(8), p.7603.

Introduction

The field of DNA sequencing is constantly evolving, with technological and computational advancements allowing for high-throughput data acquisition that is faster, cheaper, and increasingly more accurate. The Human Genome Project (HGP) was started in 1990 with the goal of sequencing the entire three billion base pair human genome and was completed thirteen years later at a cost of approximately $5.1 billion, adjusted for inflation [1]. Since the project’s completion in 2003, sequencing costs have plummeted, with the cost per human genome dropping below $1,000 in 2019 [1] and taking a matter of days, with assembly and analysis requiring additional time. In 1965, Gordon Moore observed that the number of components in integrated circuits doubled every year [2]. This principle became known as

Moore’s Law, which has been used to predict the pace at which new technologies will advance.

An interesting aspect of this sharp decline in sequencing cost is that the rate is far outstripping the costs predicted by Moore’s Law (Figure 3.1). Moore’s own genome was fully sequenced by a single lab in 2011 [3] with a mean coverage of 10.6-fold and costing an estimated $2 million [4].

Today, multiple biotechnology companies are making whole genome sequencing (WGS)

90 packages commercially available to consumers for under $1000 and with promises of 30-fold coverage [5, 6].

These major advances in sequencing speed and price are due to the continual development of novel technologies that in particular focus on higher throughput. One of the earliest and most enduring sequencing techniques was developed by Fred Sanger in the late

1970’s [7, 8]. While Sanger sequencing can provide high accuracy for genomic sequencing, it is a labor-intensive process that yields short 600-700 base pair (bp) fragments. Although the cost per read dropped to around $1 per read following the completion of the HGP [9], WGS using the

Sanger method would cost approximately $500 per Megabase (Mbp; 1 million bases) or ~$1.5 million for the whole genome, and would take 500 days on a standard 96-capillary instrument

[10]. While Sanger sequencing’s high single-read accuracy (³99.999% per-base accuracy [11]) make it a useful tool, for example, in investigating specific genetic regions to identify a disease- causing single nucleotide polymorphism (SNP), it simply does not scale well to longer fragments or whole genome assembly.

The limitations of Sanger sequencing spurred the development of the Next Generation

Sequencing (NGS) platforms. The focus of these NGS approaches is to parallelize the sequencing process, allowing for hundreds to thousands of DNA strands to be read simultaneously. Two of the early pioneers were the Roche 454 Genome Sequencer developed by the Rothberg lab [12] and the SOLiD platform developed by the Church lab [13], both initially published in 2005. Both methods produce reads that are shorter than Sanger reads, but since they are produced significantly more quickly and at a lower cost they are much more efficient than

Sanger [9]. The Illumina platform soon followed with commercialized sequencing in 2007 and by 2015 had become the most popular NGS system, capturing more than 70% of the market [14].

91 The intense competition among these developing technologies hastened the dramatic cost decline of sequencing as seen in Figure 3.1, as the years immediately following the commercial availability of these new platforms were marked by the steepest decline.

Despite the incredible progress in terms of cost efficiency and speed, the early NGS approaches still had flaws that make them ill-suited for certain projects. One hurdle is the fact that these platforms generate very short reads. While millions or billions of short reads can be assembled to provide accurate and full-length genomic coverage, accurate assembly is particularly hindered in highly repetitive regions [15]. Because individual reads are only a few hundred base pairs long, it becomes immensely challenging to resolve tandem repeats and can lead to improperly organized genomes. Although there are computational algorithms that can greatly enhance the assembly of short fragments into whole genomes, they still cannot fully resolve all of the issues needed for a fully accurate assembly [16]. Another hurdle for using earlier NGS platforms to sequence complex populations is that the short fragments do not allow for later identification of the particular origin of each fragment, i.e. the short fragments can be used to assemble a generic genome but cannot differentiate between different genomic variants in the same sample. While a useful tool for full genome assembly, these tools fall short when sequencing discrete individuals that are naturally mixed within a sample, such as complex bacterial populations or the focus of this work, mitochondria.

Next-generation sequencing of the mitochondrial genome

Reflecting its origin as a formerly free-living organism turned endosymbiont, the mitochondrion contains its own circular genome that is between 15,000 – 17,000 bp in

92 mammals. The thirty-seven genes encode the translational machinery for the organelle as well as thirteen protein subunits necessary for the electron transport train and oxidative phosphorylation.

Most organelles contain multiple copies of the genome, averaging 4.6 copies [17, 18]. The total mitochondrial DNA (mtDNA) content varies widely across cell types and in different disease states [19, 20] but mammalian cells typically contains several hundred to thousands of copies of the genome. Despite this high copy number, mtDNA still represents a tiny fraction (typically

<1%) of the total DNA content of the cell.

In addition to being numerous, the mtDNA genome is also distinguished by its heterogeneity, attributable in part due to the elevated mutation rate, at least 10-fold higher, of mtDNA as opposed to nuclear DNA [21, 22]. Mitochondria comprise dynamic populations within cells, frequently fusing and undergoing fission within continuously changing complex networks [23]. Mitochondrial replication is independent of the cell cycle [24], with ongoing mitogenesis (the biogenesis of new mitochondria) replenishing cells as mitophagy (the controlled autophagy of mitochondria) clears older or defective organelles. This continual replication coupled with the elevated mutation rate ensures that some cells will contain heterogeneous genetic populations, a state termed heteroplasmy. NGS can be used to assess heteroplasmy in cells or tissues but comes with several limitations. Critically, most NGS pipelines do a poor job of detecting ultra-rare variants; if a heteroplasmic mutation is below 1.5% of the total population, it cannot reliably be identified [25]. Although such rare mutations can seem trivial on the surface, the unique characteristics of mitochondrial population dynamics can result in those rare mutations clonally expanding and dominating in certain niches with potential physiological and clinical significance [26], discussed in more detail in Chapter 3.

93 One downside of using NGS approaches for mtDNA is fragmentation, using many small overlapping reads to make the whole sequence. When there is a mixed population, it becomes impossible to determine whether SNPs are co-associated. An example is outlined in Figure 3.2A: if SNPs detected within fragments that are further away from each other than the maximum sequencing distance of a few hundred bp, then it is not possible to discern whether the SNPs are evenly spread throughout the populations (Population 1) or if some genomes disproportionately shoulder the mutational burden (Population 2). Although NGS can give insight into the types of mutations found within a population, it cannot describe the types of individuals within that population. Advancements in technology enable a more granular examination of cellular substructures and with the development of enhanced techniques such as electron microscopy or single-cell transcriptomics it has become increasingly more apparent that mitochondria form diverse subpopulations with varied roles within cells [18, 27, 28]. In order to further the characterization of these subpopulations, it is important to examine the mtDNA genomes as individuals within a population rather than as a group average.

Another caveat for the use of NGS in the particular context of mtDNA sequencing is potential contamination from nuclear mitochondrial DNA (NUMT). First identified in 1982,

NUMTs are fragments of mtDNA that are transposed into the nuclear genome [29]. Once in the nucleus, the mitochondrial genes are no longer functional and are therefore no longer subject to selection pressures that would otherwise limit mutations. The NUMTs therefore accumulate mutations at a steady rate and can serve as helpful molecular clocks for evolutionary biology studies [30, 31]. The SNPs found in NUMTs can artificially inflate the mutation rate of a given subsample and skew the interpretation of the types of mutations observed [32]. Many NUMTs are ancient and well-characterized, but due to the short length of NGS reads it is impossible to

94 link every contaminant SNP with a known NUMT polymorphism that would allow it to be excluded from the data set. While known NUMT polymorphisms can be eliminated, a given sample may contain NUMTs that are not found in the reference sequence and are therefore not readily excluded. Potential NUMT contamination is at the heart of the current debate regarding rare paternal inheritance of mtDNA, as discussed in Chapter 2. Similarly, contaminant mtDNA can be introduced during sample preparation; without the ability to associate contaminant SNPs to known polymorphisms to the individuals preparing the samples or, inversely, accepting only

SNPs found on molecules with the known polymorphisms of the particular sample, contaminant

SNPs can appear to be low-level heteroplasmies and inappropriately distort the data (Figure

3.2B).

The in-depth study of these individual mtDNA genomes within larger population pools necessitates long-read capabilities, where the entire DNA molecule is sequenced in its entirety.

This form of single molecule analysis can be accomplished with the Sanger sequencing method, and much of the data in this study was generated in that fashion, but the process is time- and labor-intensive and limits the scope of the analysis. Briefly, the mtDNA samples are serially diluted in a long-range (~16,000 bp) PCR reaction such that each positive well contains amplicons originating from a single mtDNA template. The samples are then subdivided across

24 separate Sanger sequencing reactions so that the resulting overlapping reads can later be assembled into a complete genome. The benefits of this strategy include the high accuracy of

Sanger reads, the ability to analyze mtDNA molecules as discrete individuals within the population, and the ability to confidently exclude contaminants. Long-range PCR prevents the amplification of NUMTs as they are almost always not full-length copies [32], and contamination during sample preparation can be excluded by genotyping the operators and

95 excluding sequences matching their unique haplotypes. The primary hurdle of Sanger sequencing still remains, however: it is slow, costly, and does not scale to the analysis of large populations.

Fortunately, the past decade has seen the emergence of third-generation sequencing platforms that offer both long-read capabilities and a high-throughput capacity. The two leading technologies, Pacific Biosciences’ single molecule real time (SMRT) sequencing platform and

Oxford Nanopore Technologies’ MinION sequencer, both offer the ability to sequence single molecules that are tens of thousands of base pairs, all in a single, continuous read.

Pacific Biosciences SMRT Sequencing

PacBio’s SMRT sequencing platform uses a unique approach of sequencing by synthesis to generate real-time sequence data. A single immobilized DNA polymerase occupies the bottom of a well; as phospholinked dNTPs are brought into the enzyme’s active site to be incorporated into the target DNA strand, a fluorescent pulse is released that is recorded using a zero-mode waveguide, a sensitive detection tool that allows for the direct observation and recording of events at the single molecule level [33, 34]. As the polymerase synthesizes DNA complementary to the target molecule, the fluorescent signature of each nucleotide incorporation is recorded to create the sequence. Although this method can quickly generate long reads, each single read is highly error prone, with an average accuracy of under 85% [35]. Single-read accuracy, however, becomes less important in the context of massively high-throughput sequencing. Even though a single read may be littered with errors, sequencing the same region in greater depth can eliminate that error. Importantly, the errors that occur during SMRT sequencing are randomly distributed

[35, 36], and therefore the exact same error is unlikely to occur in multiple reads covering the same region. Despite the error-prone nature of each individual read, with enough depth of

96 coverage whole genome assembly can result in accuracies of upwards of 99.999% [37]. [33, 34].

As the polymerase synthesizes DNA complementary to the target molecule, the fluorescent signature of each nucleotide incorporation is recorded to create the sequence. Although this method can quickly generate long reads, each single read is highly error prone, with an average accuracy of under 85% [35]. Single-read accuracy, however, becomes less important in the context of massively high-throughput sequencing. Even though a single read may be littered with errors, sequencing the same region in greater depth can eliminate that error. Importantly, the errors that occur during SMRT sequencing are randomly distributed [35, 36], and therefore the exact same error is unlikely to occur in multiple reads covering the same region. Despite the error-prone nature of each individual read, with enough depth of coverage whole genome assembly can result in accuracies of upwards of 99.999% [37].

One way that the SMRT platform can achieve such high levels of accuracy for WGS is through a hybrid approach of long and short molecule sequencing. Long genomic fragments are used to resolve problematic repetitive regions that are particularly problematic for short fragment assembly as noted previously. This makes the SMRT system particularly useful for creating genomic scaffolds, mapping the general layout of the genome and resolving problematic repetitive regions. Short but high-depth NGS reads can then be mapped on to these scaffolds, such as when the large (32.4 Gigabase (Gbp, or 1,000,000,000 bp)) and particularly repetitive axolotl genome was sequenced in 2018 [38]. Although this hybrid method approach has now been fruitfully applied to several de novo genome assemblies [39-41], a similar method can be performed using PacBio’s platform exclusively. For sequencing shorter fragments with higher accuracy, the circular consensus sequencing method (CCS) can be used. In this method, bell adapters are added to both ends of the DNA template, circularizing the molecule, as illustrated

97 in. The polymerase in the SMRT system is then able to continuously read the template until several iterations are concatenated into one long read [42]. When originally developed, this tool was applicable only for shorter fragments due to the limitations of how many times the polymerase was able to circumnavigate the DNA molecule; while the polymerase may be able to read around a 1 kilobase pair (kbp) fragment 10 times, it would only be able to make the transit once in a 10 kbp fragment, achieving a depth that is insufficient to correct for the raw error rate

[42]. As explained previously, fragmenting genomes from large mixed populations like mitochondria is not a suitable method; the alterations in library preparation would negate the high-throughput sequencing capabilities by requiring each genome to be individually prepared before fractioning, labeling, and sequencing.

Recent advances in PacBio’s chemistry and independently developed library preparation strategies have allowed for prolonged activity of the polymerase, yielding longer reads and making the CCS method attainable for longer fragments [43]. So far, this improved method has yielded an accuracy of 99.8% for 13.5 kbp targets. Although this is a vast improvement over the raw single-read accuracy for long molecules, it still falls far short of what is needed when it comes to rare variant identification in large populations, such as heterogeneous mtDNA populations. A 0.2% error when sequencing a single human mitochondrial genome would yield on average 33 erroneous mutations, far higher than the expected mutant fraction of ~1 x 10-4, or roughly 1.6 mutations per mtDNA genome [44]. With the background noise so high, the intended mutational signal would be lost. Extra sequencing depth cannot be added because once sequenced, that initial molecule is gone and cannot be sequenced again. While these advances in long-read CCS hold great promise for genome assembly, they are still poorly-suited for the task of comparing large numbers of subtly different genomes.

98 A final hurdle in using the PacBio platform is the upfront costs of the sequencing system.

The Sequel II System costs $350,000, a price tag that is far out of reach for most academic research labs. There are 20 certified service providers in the US, which can result in lengthy weeks-long queues. Ultimately, while the per-base sequencing costs when outsourcing to these facilities is a fraction of the cost of Sanger sequencing, the process is slowed significantly by the high demand on the limited number of sequencing facilities.

Oxford Nanopore Sequencing

The idea of using membrane-embedded nanopores to sequence DNA was first published in 1996; the principle was that as single-stranded DNA (ssDNA) passes through the channel of an a-hemolysin pore, it disrupts the ion current in that pore in a consistent and measurable way

[45]. These early experiments demonstrated the ssDNA length could be measured as it traversed the channel and hinted that it may be possible to determine more detailed information about the particular composition of that molecule. Nearly a decade later, it was shown that individual nucleotides had distinct ion current perturbation signatures, such that single base pair differences could be ascertained in oligomers passed through a pore [46]. Although this advanced demonstrated the potential for nanopores to be harnessed for sequencing, it also showed that much work was needed for pore engineering before precise recording of nucleotide sequences could be accomplished. Issues of particular importance were slowing translocation speed of ssDNA through the pore and shortening the sensing region. In earlier designs, nucleotides passed through the nanopores in under 20 microseconds [47], and while high speeds are desirable for improving throughput of sequencing platforms, speeds that are too high impair the detection ability of the sensors that record the raw signal. In order to slow the processing of the DNA

99 strands through the pores, phi29 DNA polymerase can be coupled to a-hemolysin so that the phi29 essentially feeds the ssDNA into the pore ~10,000 times slower than ssDNA moving through the pore alone [48]. While slowing the progression of DNA though the pore is crucial for improving read accuracy, another important component is the number of nucleotides passing through the sensing region of the pore. In wild type versions of the a-hemolysin pore, 12 nucleotides are present in the sensing region of the pore at a time [47]. Such a wide reading frame would make translating the raw signal into a sequence incredibly challenging, as the individual bases become significantly harder to distinguish when they’re part of a larger group.

Researchers have developed mutated versions of a-hemolysin pore and other protein pore candidates that shorten the sensing regions and also allow for detection at multiple sites within the pore [49], helping improve accuracy by being able to cross check the readings at both sites.

(Figure 3.3) illustrates how a single DNA strand is fed into the nanopore via a polymerase, with the resulting current changes being recorded over time.

In 2014, Oxford Nanopore Technologies released the first commercial nanopore sequencing device, the MinION. A primary attribute that sets Nanopore apart from other NGS platforms is the low up-front costs and the high portability. Currently $1,000 for a device and reagent starter kit, the MinION sequencer is affordable for most labs, while larger and higher- throughput options are available for labs with high sequencing demands. Uniquely, the MinION device is easily portable, weighing just 87 grams, making it particularly useful for real-time disease tracking of viruses such as Ebola [50], Zika [51], influenza [52], and the current COVID-

19 coronavirus outbreak [53], the latter of which was sequenced for the first time with a hybrid

Nanopore/Illumina approach. The standard flow cells, which are the interchangeable components where the sequencing occurs, each contain 2,048 membrane-embedded protein pores that allows

100 for parallel sequencing of many DNA molecules at once. While read length in PacBio sequencing is constrained by the limited lifespan of the polymerase [54], the nanopores do not have the same limitations and can achieve massive read lengths. The longest single read to date was 2.29 Mb [55], and more routinely reads are in the low tens of thousands of base pairs [56-

58].[50], Zika [51], influenza [52], and the current COVID-19 coronavirus outbreak [53], the latter of which was sequenced for the first time with a hybrid Nanopore/Illumina approach. The standard flow cells, which are the interchangeable components where the sequencing occurs, each contain 2,048 membrane-embedded protein pores that allows for parallel sequencing of many DNA molecules at once. While read length in PacBio sequencing is constrained by the limited lifespan of the polymerase [54], the nanopores do not have the same limitations and can achieve massive read lengths. The longest single read to date was 2.29 Mb [55], and more routinely reads are in the low tens of thousands of base pairs [56-58]. Such long reads are particularly useful for de novo genome assembly through either entirely Nanopore or hybrid

Nanopore and short-read NGS sequencing because they their extended length allows for better resolution of genomic arrangements.

Similar to PacBio sequencing, Nanopore sequencing suffers from a low single-read accuracy, which was as low as 60% in earlier implementations of the technology but that now are on average 85% or higher [57, 59, 60] due to improvements in chemistry and basecalling, the computational process of translating raw signal into sequence. Nanopore has undergone several changes in pore composition and the algorithms used to basecall, resulting in a steady increase of accuracy over time (Figure 3.4). With high depth of sequencing, Nanopore reads can be assembled into accurate genomes with accuracy as has a 99.9% [56]. Recently, a lab demonstrated Nanopore’s effectiveness and efficiency for WGS by sequencing 11 complete and

101 diverse human genomes in 9 days, with assembly taking a mere 6 hours and costing around

$6,000 per genome [60]. The speed and accessibility of Nanopore sequencing make it an excellent resource for WGS, but it still does not provide a baseline way to examine complex populations such as a heterogenous sample of mtDNA. Barcoding kits are available in the

Nanopore store and can be used to label different preparations of DNA that can then be pooled and sequenced together. The limitation here, however, is that each mtDNA genome would have to be individually prepared through single molecule PCR, barcoded, and then sequenced. While this is a feasible workflow, it detracts from the high-throughput capabilities of third generation sequencing while also being more costly and time-consuming. In order to better utilize

Nanopore’s large-scale potential, a different barcoding approach is required.

Long UMI-driven Consensus Sequencing

Although single-read accuracy is relatively low across NGS platforms, and especially in third generation technologies, the inherent error rate can be virtually eliminated as long as sufficient depth of sequencing is achieved. Because sequencers such as Nanopore’s MinION device allow for parallel sequencing across hundreds of pores simultaneously, it takes little time to amass a sufficient number of reads to build an accurate consensus. If a sample of native mtDNA (i.e. purified DNA that has not been amplified or otherwise altered) is sequenced with a

Nanopore device, any single mitochondrial genome can only be sequenced once and therefore the errors cannot be separated from the actual mutational signal. In order to examine the population at the level of individual mtDNA molecules, each genome must first be amplified so that a consensus sequence can be derived from those copies. The errors in Nanopore reads are randomly distributed [61], so while multiple copies of the same molecule will share the same

102 mutations, their error signal will be different, as illustrated in Figure 3.5. The challenge comes in identifying which reads are sister amplicons; while sister molecules will share identical mutations, the mutational frequency will be overshadowed by the errors. Mutations alone are not sufficient for clustering molecules, especially in wild type samples where most mitochondrial genomes will have no mutations at all. In order to address this problem, we have developed a novel barcoding system to label DNA molecules with a unique molecular identifier (UMI) that allows for individual molecule identification following sequencing. Once the molecules are appropriately grouped or clustered by their matching UMIs, an accurate consensus can be constructed, a process we are terming long UMI consensus-driven sequencing (LUCS). Figure

3.6A provides an overview of the process: first, a sample of DNA molecules, each with their own unique mutational signal, is subjected to PCR where in the initial cycle a long, random UMI is added to each molecule, as is shown in more detail in Figure 3.6B. In addition to a UMI that is unique to each oligonucleotide primer, a secondary identifier is present to allow for the pooling of multiple different samples so that they can be run on the same flow cell. In the subsequent cycles of PCR, each molecule is copied along with its UMI by a second set of external primers that are complementary to the artificial sequence on the 5’ end of the UMI-containing primers.

Through the processes of PCR amplification and sequencing, errors will accumulate on each molecule or read. Finally, the sequencing reads can be identified and clustered by their UMIs and a consensus sequence can be generated by recording all of the mutations that occur across all molecules within a cluster and excluding the variances that occur in only a minority of reads as error.

The re-identification and subsequent clustering of sequencing reads based on UMI represents a computational challenge. The UMIs are random sequences that cannot be known in

103 advance; in order to barcode with known UMIs, the molecules must first be isolated singly.

Applying the UMIs as randomly generated primers allows for a simpler single-step preparation process and allows for a much larger number of potential UMIs for a fraction of the cost. The challenge arises with the combination of having to match previously unknown sequences while also dealing with a 10-15% error rate. The fact that Nanopore errors are predominantly insertions and deletions (indels) only adds to the difficulty of differentiating UMIs. Two general strategies are used to improve the accuracy of clustering. First, the UMIs are long; in this study, each UMI is 24 consecutive bases of A, G, or T on one strand or A, C, or T on the other. This results in over 282 billion possible combinations so the likelihood of exact repeats is vanishingly small and the UMIs in general are likely to be dissimilar from each other. Another useful approach for verifying the proper UMI clustering is to attach a separate UMI to each end of the molecule, as shown in Figure 3.6B. Some Nanopore reads do not end up being the full length of the initial product, as a read can be terminated or truncated due to quality issues. If the UMI is only on one end, then any partial reads that happen to not contain the UMI will have to be fully excluded from the dataset. A UMI on each end ensures that each read will contain at least one identifier.

Further, the UMIs on either end can be used to confirm the reliability of the clustering. For example, if two 5’ end UMIs end up being similar to each other, they may initially be clustered together and considered sister amplicons. The result of this improper clustering will be weak support for each of the mutations in that pool: if half of the molecules have a mutation at one site but the other half are wild type, that site may appear to be a spot of error when actually there are just two different sequences confounding the consensus. By using a double-ended approach, molecules within that joint cluster can be appropriately segregated based on 3’ UMI, as again the odds of having two similar UMIs is low.

104 A particular advantage of applying UMIs to molecules in this fashion is that very small samples can be used without significant loss. The 3’ end of the UMI primers (Table 3.1) are

~30bp complements to native mtDNA that permit accurate amplification of the mitochondrial genome without having to specifically isolate the mtDNA from the nuclear genome. The primers generate a 13 kbp amplicon, preventing the unintentional amplification of the smaller NUMTs.

Since the sample does not require any purification steps, mtDNA analysis with the LUCS approach can be conducted on the single-cell level, allowing for a high level of resolution that was previously unattainable.

Basecalling models

A critical step in sequencing is basecalling, the process of converting raw signal into the appropriate string of nucleotides. Deciphering Sanger chromatographs into sequences is a straightforward and intuitive tasking, requiring a simple one-to-one conversion of fluorescent peaks to their corresponding nucleotide. With high-throughput NGS methods, however, translating the raw reads is a much more involved process requiring complex computational algorithms. A key difficulty with Nanopore basecalling is that while the detection sensitivity is high enough to differentiate between two bases at the level of single nucleotide resolution, the sensing portion of the pore accommodates a span of 5 nucleotides that do not move with consistent timing (randomly staying longer or shorter periods inside the pore), meaning that each recorded signal does not neatly translate in a one-to-one manner as with Sanger sequencing.

There are 1,024 possible translations of a single 5 bp reading frame which, compounded with the high noise level associated with taking such precise single molecule measurements, makes

105 simple translation impossible. Instead, Oxford Nanopore and other independent developers have employed neural networks and machine learning algorithms to address the basecalling challenge.

Neural networks were first applied to Nanopore basecalling in 2017 with the independently developed DeepNano model, followed soon by the Chiron model, and have now become the standard strategy for deciphering raw reads [62-64]. These neural networks employ a layered, bi-directional approach to processing the sequence. An initial model is required as a starting point: raw sequencing reads and the corresponding sequence are fed into the neural network so that it can start to learn and associate patterns within the data. As it basecalls, the network takes both local (the immediately surrounding nucleotides) and global (nucleotides across the entire sequence) patterns into consideration to develop predictions on how to accurately translate the signal into sequence. Average single read accuracy when using neural network basecallers has been observed as high has 93.69% [63] and continues to advance over time. In the less-common 1D2 protocol from Oxford Nanopore, the single strands of DNA can essentially be pair-ended (one strand passing through the pore followed immediately by its complementary strand); when basecalled using the Bonito neural network, median raw paired- read accuracy jumps to an impressive 96.15% [65]. In step with these basecalling advances are improvements to the nanopores themselves. The newly released R10 pores, which differentiate themselves from previous versions by their twin shortened sensing regions, have achieved mean raw single read accuracies of around 95% and can be used to reach consensus accuracies of

99.995% [66, 67]. [62-64]. These neural networks employ a layered, bi-directional approach to processing the sequence. An initial model is required as a starting point: raw sequencing reads and the corresponding sequence are fed into the neural network so that it can start to learn and associate patterns within the data. As it basecalls, the network takes both local (the immediately

106 surrounding nucleotides) and global (nucleotides across the entire sequence) patterns into consideration to develop predictions on how to accurately translate the signal into sequence.

Average single read accuracy when using neural network basecallers has been observed as high has 93.69% [63] and continues to advance over time. In the less-common 1D2 protocol from

Oxford Nanopore, the single strands of DNA can essentially be pair-ended (one strand passing through the pore followed immediately by its complementary strand); when basecalled using the

Bonito neural network, median raw paired-read accuracy jumps to an impressive 96.15% [65]. In step with these basecalling advances are improvements to the nanopores themselves. The newly released R10 pores, which differentiate themselves from previous versions by their twin shortened sensing regions, have achieved mean raw single read accuracies of around 95% and can be used to reach consensus accuracies of 99.995% [66, 67].

Materials and Methods

The following sections were previously published as: Annis, S., Fleischmann, Z., Logan, R.,

Mullin-Bernstein, Z., Franco, M., Saürich, J., Tilly, J.L., Woods, D.C. and Khrapko, K., 2020.

LUCS: a high-resolution nucleic acid sequencing tool for accurate long-read analysis of individual DNA molecules. Aging (Albany NY), 12(8), p.7603.

Animals and sample collection

All studies with animals reported herein were reviewed and approved by the institutional animal care and use committee of Northeastern University. Heterozygous mice with a single amino acid substitution (D257A) in the nuclear-encoded DNA polymerase-g gene (PolgD257A/+) were obtained from the Jackson Laboratory (Bar Harbor, ME, USA) and bred to generate

107 homozygous mtDNA mutator mice (PolgD257A/D257A) [15, 23]. Oocytes were collected after superovulation of young adult (2-month-old) homozygous female mice and denuded of all adherent somatic cells, as detailed previously [23]. Individual oocytes were incubated in 1 µl of lysis buffer (10 mM EDTA, 0.5% SDS, 0.1 mg/ml Proteinase-K) for 3 hours at 37 C and then stored under mineral oil at –80 C.

Barcoding PCR

For barcoding primers, we used 125-bp oligonucleotides with three distinct regions. The

5’-end was designed as a 64–73-bp synthetic code devoid of guanines, followed by a random 24- bp barcode also devoid of guanines, and ending with a 28–37-bp 3’-end complementary to the target DNA sequence. The synthetic primers were comprised of the first 29-bp of the 5’-end of their corresponding barcoding primer. All primer sequences are available in Supplementary

Table 3.1. The initial 4 cycles of PCR were conducted in 2-µl reactions containing 1X- concentrated LA Taq reaction buffer (Takara Bio USA, Mountain View, CA, USA), 0.2 mM of each dNTP, 2.5 µM of each barcoding primer, 10 µM of each synthetic primer, and 0.1 units of

Hot Start Ex Taq DNA Polymerase (Takara Bio USA), along with lysate prepared from individual oocytes as follows: lysate (1-µl frozen stock; see above) was diluted 10,000-fold in ultrapure water, resulting in an estimated 10 mtDNA molecules per reaction well. Reactions were cycled at 95 C for 30 sec of denaturation followed by 14 min of combined annealing and extension at 68 C. After 4 cycles, reactions were held at 68 C while 48-µl of additional barcoding primer-free PCR mix was added, bringing the final 50-µl reaction to 1X-concentrated LA Taq reaction buffer, 0.2 mM of each dNTP, 0.1 µM of each barcoding primer, 10 µM of each

108 synthetic primer, and 1.25 units of Hot Start Ex Taq DNA Polymerase. Reactions were continued for an additional 45 cycles as described above.

Library preparation and sequencing

For this experiment, 10 wells of barcoding PCR product were pooled and sequenced.

Based on mtDNA copy number estimates from mtDNA content per mouse oocyte [24], this yielded ~100 unique molecules per sample. Products were cleaned using SPRIselect beads

(Beckman Coulter Life Sciences, Indianapolis, IN, USA) at 1:4 ratio of product to bead-buffer to diminish the retention of short, non-target by-products. The QuantiFluor dsDNA System

(Promega, Madison, WI, USA) was used to quantify DNA concentrations, and 1.5 µg of cleaned

DNA was prepared using the 1D amplicon/cDNA by Ligation SQK-LSK-109 protocol

(ACDE_9064_v109_revD_23May2018; Oxford Nanopore Technologies, Oxford, United

Kingdom) for sequence analysis using a MinION R9 flow cell and MinION software version

19.10.1 (Oxford Nanopore Technologies). Sequence data were base called using Guppy software

(version 2.3.7) and the dna_r9.4.1_450bps_flipflop.cfg model (Oxford Nanopore Technologies).

Single-molecule PCR and Sanger sequencing

Lysate from individual oocytes was serially diluted in order to perform single-molecule

PCR, wherein each amplicon originates from a single mtDNA template, as described [25]. In brief, lysate was diluted 300,000-fold in ultrapure water, with approximately 1/3 of the wells being positive and 2/3 being negative for mtDNA. Mitochondrial DNA was initially amplified with primers m3092F and m3031R (Table 3.1) in 15 µl reactions using Q5 Hot Start Polymerase

109 (New England Biolabs, Ipswich, MA, USA), with final reagent concentrations of 1X- concentrated Q5 reaction buffer, 0.2 mM of each dNTP, 10 µM of each primer and 0.3 units of

Q5 Hot Start Polymerase. Reactions were cycled 45 times (30 sec of denaturation at 95 C followed by 16 min of combined annealing and extension at 68 C). Following the initial cycles of PCR, amplicons were re-amplified for 15 additional cycles with Hot Start Ex Taq Polymerase using primers m3140F and m3003R (Table 3.1) and the following reagent concentrations: 1X- concentrated LA Taq reaction buffer, 0.2 mM of each dNTP, 10 µM of each synthetic primer and

0.15 units of Hot Start Ex Taq DNA Polymerase. Amplicons were sequenced across 24 sequencing reactions on a 3720xl DNA Analyzer (Applied Biosystems, Foster City, CA, USA).

Reads were assembled and aligned against the C57BL/6 mouse mtDNA reference genome

(GenBank AY172335.1). CodonCode Aligner software (CodonCode Corporation, Centerville,

MA, USA) was used for assemblies and alignments, and each mutation identified was manually confirmed. Sequences with overlapping peaks were discarded as mixed molecules derived from multiple, rather than single, templates.

Data processing and analysis

Short reads were removed with Filtlong (https://github.com/rrwick/Filtlong) to a minimum size of 13,700-bp, which is 300-bp shorter than expected length of a UMI-labelled

PCR fragment. Reads were then processed in Porechop (https://github.com/rrwick/Porechop) to remove residual ONT adapters. Forward and reverse reads were sorted using Cutadapt

(http://journal.embnet.org/index.php/embnetjournal/article/view/200) in paired-end mode. The reverse complement of the reverse reads (https://github.com/lh3/seqtk) and the forward reads

110 were concatenated into a single FASTQ file. Read UMIs were extracted using the template sequence in Cutadapt, leaving two FASTA files: forward-read UMIs and reverse-read UMIs.

Read UMIs were clustered in python using a network-based approach, which leverages the repetitiveness of read UMIs and the linkage information between forward- and reverse-read

UMIs. Chimeric clusters were pruned by removing read UMIs if metric longest common subsequence (LCS) exceeded 0.125 from the largest UMI in the cluster. This limit was chosen because it allows for no more three differences between read and centroid, in line with the expected error rate of ONT-based reads. Metric longest common subsequence is defined for two sequences, a and b, of length |a| and |b|, where:

Filtered clusters were written to separate FASTQ files. Reads were aligned to the mtDNA reference sequence (GenBank AY172335.1) using minimap2 (https://github.com/lh3/minimap2) and were polished with medaka (https://github.com/nanoporetech/medaka). Genotypes were called using nanopolish (https://github.com/jts/nanopolish) on all medaka alignments to confirm that no chimeras were present, and to generate base-called and raw signal support fractions for all variants. Compiled data (mean ± SEM) were analyzed by ANOVA and Student’s t-test.

In total, 548,000 reads were sequenced, of which 78,105 reads (14.25%) met the hard length threshold of 13,700bp. Forward and reverse reads sorted in Cutadapt yielded 31,367 forward reads and 15,036 reverse reads that had adapters on both ends, for a total of 46,403 reads left at this stage. Only 11,588 reads had UMIs at both ends that met the length and quality cut off for UMIs. Of these UMIs, 1001 5' UMIs and 892 3' UMIs repeated more than once in the dataset, which accounted for 5208 and 5732 reads and averaged 5.2 and 6.4 repetitions, respectively. In

111 clustering, 1147 reads were clustered by their 5'-UMIs and 1645 reads by their 3'-UMIs. After filtering for chimeras, 12 clusters remained at minimum depth of 20 reads. The average depth was 42.1 reads/consensus.

End of published material.

Additional unpublished materials and methods:

Basecaller model comparison

To compare the Guppie and Flappie basecaller models, 560 sequences from a single UMI cluster were processed with either the template_r9.5_450bps_5mer_raw.jsn (Guppie) or template_r9.4.1_450bps_large_flipflop.jsn (Flappie) model. Sequence alignments were generated using SAMtools [68]. Statistics and visualizations were generated with the NanoPack

[69] and counterr [70] packages.

Results

Basecalling performance

Oxford Nanopore distributes its basecaller, Guppy, within its MinKNOW software that controls the sequencer and as a standalone command line toolkit. The most recent version (since v2.1.3), called Flappie, uses a flip-flop algorithm that takes analyzes each base in two alternate states to make a better prediction of the true base. To compare these two basecalling approaches, sequences were initially basecalled with Guppy and filtered to include reads that were a minimum of 10 kbp with a Phred Q quality score of at least 7. The reads were aligned with the reference sequence and clustered based on their barcodes. In order to pass all of these benchmarks, this subset of reads are among the highest initial quality of the dataset; the purpose

112 of this subsequent analysis is to determine how much additional accuracy benefits can be obtained by using different analysis algorithms. A representative cluster of 560 sequences

(essentially, 560 reads from sister amplicons of the same original single molecule) was basecalled using the Flappie algorithm.

We observed a substantial improvement across several metrics when using the Flappie model as opposed to the Guppy version. As expected, the predominant error types in both versions were insertions and deletions (indels) (Figure 3.7) Although the distribution of indel length was similar for both basecallers (~80% of indels being 1 bp, and rarely exceeding 3 bp), the prevalence of insertions was reduced by 29.1% between Guppy and Flappie (2.517% and

1.784% of called bases, respectively) and deletions were reduced by 31.9% (1.6% to 1.09%, respectively), excluding indels bordering homopolymeric stretches. The rate of substitution errors of each type also decreased with Flappie basecalling, although the same error patterns persisted, such as relatively high rates of A to G and G to A transitions. Furthermore, Flappie yielded higher median base quality scores of 13.0 as compared to Guppy’s 11.4 (Figure 3.8,

Table 3.2), which is likewise reflected in an increased median percent identity (the proportion of bases in a read that accurately map to the consensus) of 93.1% in Flappie over 90.6% in Guppy.

There was a slight 0.79% increase in the number of bases included in the Flappie analysis as more bases from the ends of the sequence were able to pass the quality score threshold of 7 and were not automatically trimmed before alignment.

Despite the above improvements that the Flappie model makes in basecalling, a major point of error still remains. One particular difficulty with Nanopore sequencing is its ability to accurately decipher homopolymeric stretches of DNA, regions where a single base is repeated several times in a row [71]. This issue has a bases in the structure of the pore itself: the current

113 Oxford Nanopore pores have a detection window of 4 bp. DNA passes through the pore at an inconsistent rate [72], so a given 4-mer does not have an accurately predictable dwell time within the pore. A 5 bp sequence could have the same transit time as a 4 bp sequence, so resolving the difference between the two is a substantial challenge. This artifact is clearly seen in (Figure 3.9) for both Guppy and Flappie, with Flappie having modest improvements. While both basecalling models correctly call homopolymers of up to 4 bp in over 80% of occurrences, homopolymers over 5 bp immediately suffer a drop in accurate calling, demonstrating the impact that the size of the detection window can have. For homopolymers longer than 6 bp, Flappie is only able to correctly assess the homopolymer length in less than half of all occurrences.

The following sections were previously published as: Annis, S., Fleischmann, Z., Logan, R.,

Mullin-Bernstein, Z., Franco, M., Saürich, J., Tilly, J.L., Woods, D.C. and Khrapko, K., 2020.

LUCS: a high-resolution nucleic acid sequencing tool for accurate long-read analysis of individual DNA molecules. Aging (Albany NY), 12(8), p.7603.

Consensus sequence analysis

Following UMI-based clustering, a consensus of each cluster was generated. Support for a variant was assessed using reads aligned to this consensus in two ways: base-called support and signal support. Base-called support reports the fraction of aligned reads that support the variant at a given position in the consensus. Meanwhile, signal support refers to the raw feature files (in fast5 format) that estimate the likelihood of the variant. The distinction between base-called and signal support fractions is particularly important here because of the atypical GC-skew of mtDNA. Raw signal support was used to identify spurious variants that could arise from training

114 the neural network of the nanopore base-caller for more standard or methylated genomes. A variant with a high base-called support but low signal support would support that the variant is a false positive, whereas the opposite would support a true variant despite lower support from the read alignment. Training a mtDNA-specific model for base calling is not possible due to the limited training of the ONT base caller, and this would result in over-fitting.

We filtered sites where the base-called support fraction (viz. the percentage of reads within the cluster that contained a given base) for the wild type variant was less than 0.2, yielding a total of 132 putative variants from 12 molecules. Of these, the average base-called support across all variants was 89.4% (with average signal support of 91.7%), and 95.7% of all variants had raw signal support ranging between 80–100% (Figure 3.10). Positions with signal support lower than 80% could be PCR artefacts from misincorporation of a nucleotide by the

Taq polymerase early in the PCR cycling. Five such variants with signal support less than 80% showed support on only one strand and were therefore excluded from further analysis. Variants with low signal support, viz. those below 80%, were randomly distributed across consensus sequences, and no consensus contained more than one variant with low signal support. The presence of variants with high signal support (80% or higher) within the same consensus sequences suggests that the low support variants are a product of random error and not poor- quality clustering or issues with consensus building.

To further corroborate the sensitivity of base-called and signal support for variants, we next compared polymorphic sites to heteroplasmic variants. Polymorphic sites were defined as the variant positions that are found in all molecules sequenced by Sanger from the same sample.

Here, "heteroplasmic variants" refers to the frequency at which a variant occurs in the sample, not in the cluster of reads that generate the single molecule consensus. Both heteroplasmic and

115 polymorphic variants would be expected in all reads of the same cluster, but only polymorphic variants would be expected in all reads across all clusters. Therefore, within a cluster of reads that represents a single molecule, base-called support and signal support for both heteroplasmic and polymorphic variants should be similar. The average base-celled support fractions for polymorphic (n = 36) and heteroplasmic (n = 96) variants were comparable at 90.1% ± 1.7% and

89.2% ± 0.4%, respectively (mean ± SEM, P = 0.39). Likewise, average signal support fractions were comparable in heteroplasmic (92.1% ± 0.6%) versus polymorphic (93.5% ± 0.7%) (mean ±

SEM, P = 0.17) variants Figure 3.10. Collectively, the consistency of support fractions between polymorphic sites and heteroplasmic variants strongly supports that the lower frequency variants are of high quality.

Verification of LUCS mutation frequency analysis by Sanger sequencing

Mitochondrial DNA from the same oocyte was amplified in single-molecule PCRs without UMI primers and then Sanger sequenced to determine if LUCS variants demonstrated a distribution similar to variants identified by Sanger sequencing. We observed that LUCS variants with support fractions above 80% were 38.5% synonymous, compared to 36.0% using Sanger sequencing (Figure 3.11). A characteristic feature of Sanger-sequenced variants was a relatively high proportion of transversions, with 45.7% of mutated adenines converted to thymine. While this mutational profile was not observed in variants identified by LUCS (only 36.8% of adenines mutated as transversions), adenine to thymine transversions were more frequently represented versus all other transversions (Figure 3.12). Finally, the mutation rate associated with variants identified using Sanger sequencing was 8.3X10-4 (± 2.11X10-5) mutations/bp or ~13.60 mutations per mitochondrial genome. Using LUCS, the mutation rate was 7.9X10-4 (± 8.2X10-5)

116 mutations/bp or ~12.81 mutations per mitochondrial genome (mean ± SEM) (Figure 3.13) in close alignment with Sanger sequencing (P = 0.12).

End of published material.

Discussion

Over the past two decades, next generation sequencing has dramatically reduced sequencing costs while also permitting faster and larger-scale workflows. Early NGS platforms require DNA fragmentation, which impedes the study of mixed and diverse populations.

Additionally, rare variants occurring in a small proportion of the population are impossible to detect reliably, as they would often fall at or below the noise filtering threshold during data processing [25]. Due to the unique population dynamics of mitochondria, rare variants can clonally expand and become much more abundant late in life, as discussed in Chapter 1 regarding maternal mtDNA inheritance and in Chapter 2 with paternal mtDNA leakage.

Currently, the exact role of rare variants or low-level heteroplasmies is poorly understood because most sequencing approaches do not capture them. The true diversity of mitochondria has not been fully tapped, and high-resolution sequencing of mtDNA is necessary to shed light on these complex populations.

Third generation sequencing such as the Oxford Nanopore platform offer long-read capabilities at the cost of higher error. Since the commercial launch of the MinION in 2014, substantial improvements have been made in both pore chemistry and in the algorithms that process the raw signals. A recent shift in their standard basecalling model has led to great increases in overall read quality and raw single-read accuracy. Although some of these

117 improvements may seem modest on the surface, it is important to note that these gains in accuracy and quality are all before the consensus is even built, so the cumulative benefits have the potential to yield an even better consensus sequence. Additionally, these improvements are purely on the computational end: any previously acquired data can be subjected to a new round of basecalling to improve quality and accuracy. Third generation sequencing makes advances on two fronts, the chemistry of the sequencing reactions and the data processing. While researchers await new upgrades in pore chemistry, for example, substantial progress is being continually made to advance the algorithms that process the complex data sets.

Challenges still remain for the platform, including particular difficulty at homopolymer sites. Compounding this homopolymer detection issue is the inherent flaws of PCR DNA polymerases to accurately replicate these same regions. Polymerases may stutter when replicating these sites, adding or omitting a base, although this is predominantly an issue for longer (>9 bp) stretches [73, 74]. Homopolymer errors are not unique to Nanopore, affecting all of the NGS platforms to varying degrees [75-77]. The most recently release pore chemistry from

Oxford Nanopore, R10.3 (not used in this study) , has a smaller detection window and advertises improved resolution of homopolymeric regions [66]. True indels of biological origin are relatively rare, so most observed changes in homopolymer length are likely to be artifact; the few that do arise in the dataset can be manually inspected to determine if the appropriate support is there.

Although the accuracy developments have greatly improved single-read accuracy in recent years, the basal error rate is still too high for the meaningful analysis of discrete individuals within a mixed population. LUCS is a powerful tool that enables high coverage sequencing of long DNA targets such as the mitochondrial genome. While any given read within

118 a consensus cluster will have an expected accuracy of ~85%, taken as a whole the reads can be used to generate an accurate consensus sequence. The key metric for determining whether a variant is a valid mutation or merely sequencing error is the support fraction, both from the raw signal and the consensus cluster as a whole. At a given site, if the reference base is cytosine but

94% of reads are thymine, 2% are cytosine, and 4% are adenine, the base called support fraction is 94% and can be confidently labeled as a mutation. Support fractions that are above 80% are in line with expected error rates and, as long as they are accompanied by correspondingly high raw signal support, can be categorized as mutations. Variants with support fractions below 80% are more likely the result of error rather than reflecting an actual mutation. For example, a PCR error occurring in the second cycle of amplification would result in a DNA population with roughly

75% of one nucleotide and 25% of another at that site. An accurate consensus cannot be built from a simple majority of reads but instead must take into account the expected error profile of the sequences. By using a hard cutoff of 80% support fraction, some valid mutations may be lost but the retained variants are high-confidence. The mutation rate was modestly but not significantly lower for the LUCS molecules as compared to the Sanger data, indicating that most mutations are successfully identified. Furthermore, variants falling below the support fraction threshold can be manually inspected and considered for inclusion at the researchers’ discretion.

An important characteristic of the variant support scores in the LUCS consensus sequences is that they varied across the sequence, with each consensus having a wide range of support fractions. This indicates that lower (e.g. 82% versus 92%) support fractions are not due to widespread quality issues or improper sequence clustering but are instead attributable to more localized and sporadic error. Similarly, support fractions were comparable between polymorphic and heteroplasmic variants. This demonstrates that the rare variants are as reliably identified as

119 more frequent variants. Critically, rare variants are not filtered out due to their rareness, as happens with other NGS approaches. Rare variants are particularly susceptible to being overlooked as false negatives, with various analysis pipelines showing a high degree of discordance in their ability to accurately identify rare SNPs [78, 79]. Because error noise is eliminated at the consensus level rather than the population level, rare variants are more apparent and less likely to be excluded as error, which is further validated by the comparable support fractions between LUCS and Sanger sequences.

A final advantage of LUCS is the ability to adapt to a broad range of sample types and sizes. As there are no cleaning steps required between the application of the UMI and PCR, very small DNA inputs can be used. LUCS is therefore a robust system for single cell analysis with the ability to scale to much larger samples. This sequencing strategy allows for high-resolution analysis of complex populations, and while this study focused on mtDNA from a single cell,

LUCS can be readily adapted to other targets and populations such as bacterial and viral genomes. The error rate for this study was 7.9x10-4, or ~99.921% accuracy; however, this rate is more reflective of the elevated mutational burden of the PolgD257A/D257A mutator mouse than the true error of this sequencing pipeline. The concordance of mutation rates and mutation characteristics, such as synonymity, between LUCS and Sanger sequences indicates low false positive and false negative rates. Additional validation of LUCS on low-mutant samples will provide better insight into the true error rate, with the caveat that estimated mutant fractions of wildtype samples may be currently underestimated by NGS platforms due to the elimination of low-frequency variants as noise. Ultimately, LUCS allows for higher resolution sequencing analysis of heterogeneous DNA populations that will further elucidate the complex population dynamics of mitochondria.

120

$100,000,000

$10,000,000

Moore's Law

$1,000,000

$100,000

$10,000 Cost per human per genome Cost

$1,000

$100 Jul-01 Jul-02 Jul-03 Jul-04 Jul-05 Jul-06 Jul-07 Jul-08 Jul-09 Jul-10 Jul-11 Jul-12 Jul-13 Jul-14 Jul-15 Jul-16 Jul-17 Jul-18 Jul-19 Jan-02 Jan-03 Jan-04 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16 Jan-17 Jan-18 Jan-19 Oct-01 Apr-02 Oct-02 Apr-03 Oct-03 Apr-04 Oct-04 Apr-05 Oct-05 Apr-06 Oct-06 Apr-07 Oct-07 Apr-08 Oct-08 Apr-09 Oct-09 Apr-10 Oct-10 Apr-11 Oct-11 Apr-12 Oct-12 Apr-13 Oct-13 Apr-14 Oct-14 Apr-15 Oct-15 Apr-16 Oct-16 Apr-17 Oct-17 Apr-18 Oct-18 Apr-19 Moore's Law Cost per Genome

Figure 3.1: Sequencing costs outpace Moore’s law predictions The cost of DNA sequencing, here presented as cost per full human genome, has decreased faster than predicted by other trends in technological advancement. Full-length human genomes are now attainable for under $1000 [1].

121 A B Fragmented library Fragmented library

Population 1

Polymorphisms Contaminant DNA

Population 2 Target DNA

Figure 3.2: Fragmentation of sequencing libraries When sequencing libraries are fragmented, discrete characteristics of the population are lost. (A) A fragmented library (rainbow boxes) containing mutations (white boxes) can yield the total mutational burden of the population but does not provide details on the composition of that population. In this example, the mutations found in the sequenced fragmented library could be evenly distributed (Population 1) or clustered on a specific DNA molecule (Population 2). (B) Sample contamination either during library preparation or from previous sequencing runs can be difficult to eliminate from sequencing results. While known contaminant polymorphisms (orange boxes) can be readily identified and eliminated during data processing, any mutations associated with those contaminated molecules will not be distinguishable.

122

Figure 3.3: Nanopore overview “(A) Two ionic solution-filled chambers are separated by a voltage-biased membrane. A single-stranded polynucleotide (black) is electrophoretically driven through an MspA nanopore (green) that provides the only path through which ions or polynucleotides can move from the cis to the trans chamber. Translocation of the polynucleotide through the nanopore is controlled by an enzyme (red). (B) Portion of a record showing the ionic current through a nanopore measured by a sensitive ammeter. In nanopore strand- sequencing, the stepping rate is usually 30 bases per second, but this experiment was carried out using exceptionally low concentrations of ATP to slow the helicase activity, thereby increasing the duration of each level to illustrate the resolution that can be achieved.” From: Deamer, D., Akeson, M. and Branton, D., 2016. Three decades of nanopore sequencing. Nature biotechnology, 34(5), p.518. [80]

Copyright granted from Spring Nature and Nature Biotechnology

123

Figure 3.4: Advances in Nanopore sequencing accuracy “Timeline of reported MinION read accuracies and Oxford Nanopore Technologies (ONT) technological developments. Nanopore chemistry updates and advances in base-caller software are represented as colored bars. The plotted accuracies are ordered on the basis of the chemistry and base-calling software used, not according to publication date.” From: Rang, F.J., Kloosterman, W.P. and de Ridder, J., 2018. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome biology, 19(1), p.90. [81] Permission granted through Creative Commons open access license.

124 Mutations Errors

Amplification Consensus and building sequencing

Figure 3.5: Consensus sequencing to eliminate errors Consensus sequencing can be used to reduce errors from PCR and sequencing. Initial DNA molecules (purple) will have a given set of mutations (white boxes). Following PCR and sequencing, errors (orang boxes) will accumulate across the sequencing reads. A consensus can be generated to determine the true mutations from the original template; the mutations should be consistent across all amplicons while error will be randomly distributed.

125

Figure 3.6: Overview of the LUCS technology (A) Each individual DNA molecule in a complex mixture, bearing its own unique pattern of mutations (white), has a UMI applied to it via PCR (each UMI represented by a different end- color), which is specific for that molecule (Step 1). The pool of DNA molecules is then amplified and sequenced (Step 2), during which time artefacts (i.e., PCR errors and sequencing errors) are

126 introduced in a random fashion across molecules (red). All reads are then clustered based on their UMI (Step 3), and a consensus read is built for each molecule (Step 4). This final step removes random errors introduced during the process (red) but retains true mutations (white) found in the original molecule and in all amplicons of that molecule. (B) Two-step PCR process for UMI application and dilution. In the first 4 cycles of PCR, the targeted DNA template is amplified by 125-bp oligonucleotide barcoding primers, each containing a random UMI sequence. The initial reaction is then diluted 25-fold within a larger PCR reaction containing only synthetic primers that amplify the UMI-containing molecules after 45 additional cycles of PCR. The resultant elimination of barcoding primer 're-priming' allows for high-resolution pair- end clustering and, in particular, the detection and removal of chimeras (artificial recombinant molecules) caused by PCR jumping. Published in: Annis, S., Fleischmann, Z., Logan, R., Mullin-Bernstein, Z., Franco, M., Saürich, J., Tilly, J.L., Woods, D.C. and Khrapko, K., 2020. LUCS: a high-resolution nucleic acid sequencing tool for accurate long-read analysis of individual DNA molecules. Aging (Albany NY), 12(8), p.7603.

127

A B

Guppy basecalling errors

Flappie basecalling errors

Figure 3.7: Error profiles from different basecalling models (A) The prevalence of insertion and deletion errors with the Guppy (upper) and Flappie (lower) basecalling models. While both have similar distributions of indel size, Guppy had significantly more indels than Flappie. (B) Heatmap of specific mutation type errors. Rows represent the reference base (Truth) and columns are the sequenced base (Observed). For all nucleotides, there was improved accuracy for calling the appropriate base while a decrease in all types of basecalled error.

128 A

Guppy Flappie

B

Guppy Flappie

Figure 3.8: Increases in quality and percent identity metrics The same subset of raw Nanopore sequences were basecalled with two different models. The newer Flappie model showed a marked increase in average basecall quality (A), from 11.4 to 13.0. The increase in general quality subsequently resulted in an increase in the average per-read percent identity (B), from 90.6% to 93.1%.

129 A Guppy

130 B Flappie

Figure 3.9: Observed versus expected homopolymer length Homopolymer stretches of all lengths (3-9bp) were analyzed to determine how frequently the basecallers (A- Guppy, B- Flappie) accurately determined the homopolymer length. Dashed vertical lines represent the expected homopolymer length from the reference sequence and colored dots represent the proportion of homopolymers (y-axis) of the expected length that were called as a given length (x-axis).

131

Figure 3.10: Support fraction distributions for polymorphic and heteroplasmic variants Average base-called support fractions for polymorphic (blue, n = 36) and heteroplasmic (orange, n = 96) variants were 90.1% ± 1.7% and 89.2% ± 0.4%, respectively (mean ± SEM). Likewise, signal support fractions were comparable across polymorphic (93.5% ± 0.7%) and heteroplasmic (92.1% ± 0.6%) variants (mean ± SEM). Distributions are Kernel Density Estimates of base- called and signal support fractions, as determined by nanopolish for all variants. Base-called support and signal support fraction distributions were not significantly different (P = 0.39 and P = 0.17, respectively). Published in: Annis, S., Fleischmann, Z., Logan, R., Mullin-Bernstein, Z., Franco, M., Saürich, J., Tilly, J.L., Woods, D.C. and Khrapko, K., 2020. LUCS: a high-resolution nucleic acid sequencing tool for accurate long-read analysis of individual DNA molecules. Aging (Albany NY), 12(8), p.7603.

132

Figure 3.11: Comparison of synonymity distributions between the LUCS and Sanger sequencing datasets Support fractions above 80% for LUCS and Sanger sequencing methods were comparatively analyzed, and display similar proportional synonymity in coding regions, indicative of low error rate. Published in: Annis, S., Fleischmann, Z., Logan, R., Mullin-Bernstein, Z., Franco, M., Saürich, J., Tilly, J.L., Woods, D.C. and Khrapko, K., 2020. LUCS: a high-resolution nucleic acid sequencing tool for accurate long-read analysis of individual DNA molecules. Aging (Albany NY), 12(8), p.7603.

133

Figure 3.12: Proportional mutational spectra for the LUCS and Sanger sequencing datasets (A, B) The mutation spectrum was determined for each reference nucleotide for the LUCS (A) and Sanger sequencing (B) datasets. Each bar represents the proportion of a variant for a given reference base. For example, the A>G bar is the number of A>G mutations divided by the number of mutated positions that are adenines in the reference sequence. For cytosine, guanine and thymine positions, both LUCS and Sanger mutations exhibited a strong bias towards transitions. Adenine positions were more likely to mutate as a thymine transversion than as a transition in the Sanger dataset, which was reflected to a slightly lesser degree in the LUCS dataset. Published in: Annis, S., Fleischmann, Z., Logan, R., Mullin-Bernstein, Z., Franco, M., Saürich, J., Tilly, J.L., Woods, D.C. and Khrapko, K., 2020. LUCS: a high-resolution nucleic acid sequencing tool for accurate long-read analysis of individual DNA molecules. Aging (Albany NY), 12(8), p.7603.

134

Figure 3.13: Mutation rates estimated for single molecules from Sanger sequencing and LUCS datasets Violin plots showing that mutation rates per molecule sequenced, determined by dividing the number of mutations by the coverage of a given molecule, were similar between the two technologies (P = 0.12; see text for details). Published in: Annis, S., Fleischmann, Z., Logan, R., Mullin-Bernstein, Z., Franco, M., Saürich, J., Tilly, J.L., Woods, D.C. and Khrapko, K., 2020. LUCS: a high-resolution nucleic acid sequencing tool for accurate long-read analysis of individual DNA molecules. Aging (Albany NY), 12(8), p.7603.

135

Primer Primer sequence Secondary bar- Primer name code sequence location NPb02H24m CCACTACTCACACACCAATTCC ATGCTGATGAC 3092 3092F TCTCATTACCACGCACTACCTA GCGCT Forward (barcoding TTAGATGCTGATGACGCGCTHH (3092F) primer) HHHHHHHHHHHHHHHHHHHH HHCTCCATTCTATGATCAGGAT GAGCCTCAAACTCCAAA NPbAdPr CCACTACTCACACACCAATTCCT N/A 5’-end of (synthetic CTCATTA NPb02H24 primer) m3092F NPc02H24m CCCACACTACAAAACCCACTCA TGCGAGACTAT 786 Reverse 786R TATACACTACACTCTATCAACAT CGCGA (786R) (barcoding ACTATCATGCGAGACTATCGCG primer) AHHHHHHHHHHHHHHHHHHHH HHHHGCCCATTTCTTCCCATTTC ATTGGCTACACCTT NPcAdPr CCCACACTACAAAACCCACTCA N/A 5’-end of (synthetic TATACACT NPc02H24 primer) m786R 3092F CTCCATTCTATGATCAGGATGA N/A 3092 GCCTCAAACTCCAAA Forward 3140F CGGAGCTTTACGAGCCGTAGCC N/A 3140 CAAACAAT Forward 3003R GACTTAATGCTAGTGTGAGTGA N/A 3003 TAGGGTAGGTGCAA Reverse 3031R GGGTGTGGTATTGGTAGGGGAA N/A 3031 CTCATAGACTTA Reverse

Table 3.1: LUCS PCR primers Sequences of oligonucleotide primers, in 5’ to 3’ orientation, utilized for UMI-based barcoding and PCR amplification (H = A, C, or T). Primer location determined from the position in reference sequence AY172335.1 Published in: Annis, S., Fleischmann, Z., Logan, R., Mullin-Bernstein, Z., Franco, M., Saürich, J., Tilly, J.L., Woods, D.C. and Khrapko, K., 2020. LUCS: a high-resolution nucleic acid sequencing tool for accurate long-read analysis of individual DNA molecules. Aging (Albany NY), 12(8), p.7603.

136

Guppy Flappie

Mean Median Mean Median

Percent identity 88.9 90.6 91.4 93.1

Read quality 11.2 11.4 12.8 13.0

Read length 13,118.3 13,532.5 13,208.5 13,639.5

Table 3.2: Basecalling quality metrics Mean and median values for standard Guppy and Flappie basecalling models. Percent identity is the proportion of single read that aligns accurately with the reference sequence (accession AY172335.1). Read quality is the standard quality estimate from the Nanopore software, with scores lower than 7 considered failed reads. Read length is given in bp and reflects read length after automated quality-based trimming.

137 Citations

1. DNA Sequencing Costs:Data. October 30, 2019 [cited 2019; Available from: https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data.

2. Moore, Cramming more components onto integrated circuits. 1965.

3. Rothberg, J., et al., An integrated semiconductor device enabling non-optical genome sequencing. Nature, 2011. 475(7356): p. 348-352.

4. Wade, N., Decoding DNA with Semiconductors, in New York Times. 2011. p. 13.

5. Crow, D., A New Wave of Genomics for All. Cell, 2019. 177(1): p. 5-7.

6. Church, G.M. List of Personal Genome Sequencing and Interpretation Services. November 15, 2018; Available from: http://arep.med.harvard.edu/gmc/genome_services.html.

7. Sanger, F., et al., Nucleotide sequence of bacteriophage φX174 DNA. nature, 1977. 265(5596): p. 687-695.

8. Sanger, F. and A.R. Coulson, A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. Journal of Molecular Biology, 1975. 94(3): p. 441-448.

9. Shendure, J., et al., DNA sequencing at 40: past, present and future. Nature, 2017. 550(7676): p. 345-353.

10. Kircher, M. and J. Kelso, High-throughput DNA sequencing--concepts and limitations. Bioessays, 2010. 32(6): p. 524-36.

11. Shendure, J. and H. Ji, Next-generation DNA sequencing. Nature biotechnology, 2008. 26(10): p. 1135-1145.

12. Margulies, M., et al., Genome sequencing in microfabricated high-density picolitre reactors. Nature, 2005. 437(7057): p. 376-80.

13. Shendure, J., et al., Accurate multiplex polony sequencing of an evolved bacterial genome. Science, 2005. 309(5741): p. 1728-32.

14. Kulski, J.K., Next-Generation Sequencing — An Overview of the History, Tools, and “Omic” Applications, in Next Generation Sequencing - Advances, Applications and Challenges. 2016.

15. Schatz, M.C., A.L. Delcher, and S.L. Salzberg, Assembly of large genomes using second- generation sequencing. Genome Res, 2010. 20(9): p. 1165-73.

16. Zerbino, D.R. and E. Birney, Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res, 2008. 18(5): p. 821-9.

17. Satoh, M., Organization of multiple nucleoids and DNA molecules in mitochondria of a human cell. Experimental Cell Research, 1991. 196(1): p. 137-140.

138 18. MacDonald, J.A., et al., A nanoscale, multi-parametric flow cytometry-based platform to study mitochondrial heterogeneity and mitochondrial DNA dynamics. Commun Biol, 2019. 2: p. 258.

19. Miller, F.J., et al., Precise determination of mitochondrial DNA copy number in human skeletal and cardiac muscle by a PCR-based assay: lack of change of copy number with age. Nucleic Acids Res, 2003. 31(11): p. e61.

20. Reznik, E., et al., Mitochondrial DNA copy number variation across human cancers. elife, 2016. 5: p. e10769.

21. Allio, R., et al., Large Variation in the Ratio of Mitochondrial to Nuclear Mutation Rate across Animals: Implications for Genetic Diversity and the Use of Mitochondrial DNA as a Molecular Marker. Mol Biol Evol, 2017. 34(11): p. 2762-2772.

22. Brown, W.M., M. George, Jr., and A.C. Wilson, Rapid evolution of animal mitochondrial DNA. Proc Natl Acad Sci U S A, 1979. 76(4): p. 1967-71.

23. Chan, D.C., Fusion and fission: interlinked processes critical for mitochondrial health. Annual review of genetics, 2012. 46.

24. Bogenhagen, D. and D.A. Clayton, Mouse L cell mitochondrial DNA molecules are selected randomly for replication throughout the cell cycle. Cell, 1977. 11(4): p. 719-727.

25. del Mar González, M., et al., Sensitivity of mitochondrial DNA heteroplasmy detection using Next Generation Sequencing. Mitochondrion, 2020. 50: p. 88-93.

26. Nekhaeva, E., et al., Clonally expanded mtDNA point mutations are abundant in individual cells of human tissues. Proceedings of the National Academy of Sciences, 2002. 99(8): p. 5521-5526.

27. Saunders, J.E., C.C. Beeson, and R.G. Schnellmann, Characterization of functionally distinct mitochondrial subpopulations. J Bioenerg Biomembr, 2013. 45(1-2): p. 87-99.

28. Kuznetsov, A.V., et al., Mitochondrial subpopulations and heterogeneity revealed by confocal imaging: possible physiological role? Biochim Biophys Acta, 2006. 1757(5-6): p. 686-91.

29. van den Boogaart, P., J. Samallo, and E. Agsteribbe, Similar genes for a mitochondrial ATPase subunit in the nuclear and mitochondrial genomes of Neurospora crassa. Nature, 1982. 298(5870): p. 187-9.

30. Hu, G. and W.G. Thilly, Evolutionary trail of the mitochondrial genome as based on human 16S rDNA pseudogenes. Gene, 1994. 147(2): p. 197-204.

31. Gunbin, K., et al., Integration of mtDNA pseudogenes into the nuclear genome coincides with speciation of the human genus. A hypothesis. Mitochondrion, 2017. 34: p. 20-23.

32. Li, M., et al., Fidelity of capture-enrichment for mtDNA genome sequencing: influence of NUMTs. Nucleic Acids Res, 2012. 40(18): p. e137.

33. Eid, J., et al., Real-time DNA sequencing from single polymerase molecules. Science, 2009. 323(5910): p. 133-8.

139 34. Zhu, P. and H.G. Craighead, Zero-mode waveguides for single-molecule analysis. Annu Rev Biophys, 2012. 41: p. 269-93.

35. Koren, S., et al., Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol, 2012. 30(7): p. 693-700.

36. Carneiro, M.O., et al., Pacific biosciences sequencing technology for genotyping and variation discovery in human data. Bmc Genomics, 2012. 13.

37. Chin, C.S., et al., Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods, 2013. 10(6): p. 563-9.

38. Nowoshilow, S., et al., The axolotl genome and the evolution of key tissue formation regulators. Nature, 2018. 554(7690): p. 50-55.

39. Bashir, A., et al., A hybrid approach for the automated finishing of bacterial genomes. Nat Biotechnol, 2012. 30(7): p. 701-707.

40. Utturkar, S.M., et al., Evaluation and validation of de novo and hybrid assembly techniques to derive high-quality genome sequences. Bioinformatics, 2014. 30(19): p. 2709-16.

41. Wallberg, A., et al., A hybrid de novo genome assembly of the honeybee, Apis mellifera, with chromosome-length scaffolds. BMC Genomics, 2019. 20(1): p. 275.

42. Travers, K.J., et al., A flexible and efficient template format for circular consensus sequencing and SNP detection. Nucleic Acids Res, 2010. 38(15): p. e159.

43. Wenger, A.M., et al., Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol, 2019. 37(10): p. 1155-1162.

44. Wanrooij, S., et al., Twinkle and POLG defects enhance age‐dependent accumulation of mutations in the control region of mtDNA. Nucleic acids research, 2004. 32(10): p. 3053-3064.

45. Kasianowicz, J.J., et al., Characterization of individual polynucleotide molecules using a membrane channel. Proc Natl Acad Sci U S A, 1996. 93(24): p. 13770-3.

46. Ashkenasy, N., et al., Recognizing a single base in an individual DNA strand: a step toward DNA sequencing in nanopores. Angew Chem Int Ed Engl, 2005. 44(9): p. 1401-4.

47. Meller, A., Dynamics of polynucleotide transport through nanometre-scale pores. Journal of physics: condensed matter, 2003. 15(17): p. R581.

48. Lieberman, K.R., et al., Processive replication of single DNA molecules in a nanopore catalyzed by phi29 DNA polymerase. J Am Chem Soc, 2010. 132(50): p. 17961-72.

49. Stoddart, D., et al., Single-nucleotide discrimination in immobilized DNA oligonucleotides with a biological nanopore. Proc Natl Acad Sci U S A, 2009. 106(19): p. 7702-7.

50. Quick, J., et al., Real-time, portable genome sequencing for Ebola surveillance. Nature, 2016. 530(7589): p. 228-232.

140 51. Faria, N.R., et al., Mobile real-time surveillance of Zika virus in Brazil. Genome Med, 2016. 8(1): p. 97.

52. Wang, J., et al., MinION nanopore sequencing of an influenza genome. Front Microbiol, 2015. 6: p. 766.

53. Zhu, N., et al., A Novel Coronavirus from Patients with Pneumonia in China, 2019. N Engl J Med, 2020. 382(8): p. 727-733.

54. Rhoads, A. and K.F. Au, PacBio Sequencing and Its Applications. Genomics Proteomics Bioinformatics, 2015. 13(5): p. 278-89.

55. Payne, A., et al., Whale watching with BulkVis: A graphical viewer for Oxford Nanopore bulk fast5 files. BioRxiv, 2018.

56. Koren, S., et al., Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res, 2017. 27(5): p. 722-736.

57. Jain, M., et al., Improved data analysis for the MinION nanopore sequencer. Nat Methods, 2015. 12(4): p. 351-6.

58. Tyler, A.D., et al., Evaluation of Oxford Nanopore's MinION Sequencing Device for Microbial Whole Genome Sequencing Applications. Sci Rep, 2018. 8(1): p. 10931.

59. Goodwin, S., et al., Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res, 2015. 25(11): p. 1750-6.

60. Shafin, K., et al., Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit. BioRxiv, 2019: p. 715722.

61. Malmberg, M., et al., Assessment of low-coverage nanopore long read sequencing for SNP genotyping in doubled haploid canola (Brassica napus L.). Scientific reports, 2019. 9(1): p. 1-12.

62. Teng, H., et al., Chiron: translating nanopore raw signal directly into nucleotide sequence using deep learning. Gigascience, 2018. 7(5).

63. Wick, R.R., L.M. Judd, and K.E. Holt, Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol, 2019. 20(1): p. 129.

64. Boža, V., B. Brejová, and T. Vinař, DeepNano: deep recurrent neural networks for base calling in MinION nanopore reads. PloS one, 2017. 12(6).

65. Silvestre-Ryan, J. and I. Holmes, Pair consensus decoding improves accuracy of neural network basecallers for nanopore sequencing. BioRxiv, 2020.

66. R10.3: the newest nanopore for high accuracy nanopore sequencing – now available in store. 2020, Oxford Nanopore Technologies: nanoporetech.com.

67. Karst, S.M., et al., 2020.

68. Li, H., et al., The sequence alignment/map format and SAMtools. Bioinformatics, 2009. 25(16): p. 2078-2079.

141 69. De Coster, W., et al., NanoPack: visualizing and processing long-read sequencing data. Bioinformatics, 2018. 34(15): p. 2666-2669.

70. counterr. 2018, Day Zero Diagnostics, Inc.

71. Sarkozy, P., Á. Jobbágy, and P. Antal, Calling Homopolymer Stretches from Raw Nanopore Reads by Analyzing k-mer Dwell Times, in Embec & Nbc 2017. 2018. p. 241-244.

72. Cherf, G.M., et al., Automated forward and reverse ratcheting of DNA in a nanopore at 5-A precision. Nat Biotechnol, 2012. 30(4): p. 344-8.

73. Shinde, D., et al., Taq DNA polymerase slippage mutation rates measured by PCR and quasi- likelihood analysis: (CA/GT)n and (A/T)n microsatellites. Nucleic Acids Res, 2003. 31(3): p. 974-80.

74. Fazekas, A., R. Steeves, and S. Newmaster, Improving sequencing quality from PCR products containing long mononucleotide repeats. Biotechniques, 2010. 48(4): p. 277-85.

75. Ivady, G., et al., Analytical parameters and validation of homopolymer detection in a pyrosequencing-based next generation sequencing system. BMC Genomics, 2018. 19(1): p. 158.

76. Weirather, J.L., et al., Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. F1000Res, 2017. 6: p. 100.

77. Loman, N.J., et al., Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol, 2012. 30(5): p. 434-9.

78. O'Rawe, J., et al., Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome medicine, 2013. 5(3): p. 28.

79. Hwang, K.-B., et al., Comparative analysis of whole-genome sequencing pipelines to minimize false negative findings. Scientific reports, 2019. 9(1): p. 1-10.

80. Deamer, D., M. Akeson, and D. Branton, Three decades of nanopore sequencing. Nat Biotechnol, 2016. 34(5): p. 518-24.

81. Rang, F.J., W.P. Kloosterman, and J. de Ridder, From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol, 2018. 19(1): p. 90.

142