Interpreting Human Genetic Variation through Genomics and In Vivo Models

by

Stephan George Frangakis

Department of Cell Biology Duke University

Date:______Approved:

______Nicholas Katsanis, Supervisor

______Blanche Capel, Chair

______Vann Bennett

______Michael Hauser

______Christopher Kontos

Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Cell Biology in the Graduate School of Duke University

2015

ABSTRACT

Interpreting Human Genetic Variation through Genomics and In Vivo Models

by

Stephan George Frangakis

Department of Cell Biology Duke University

Date:______Approved:

______Nicholas Katsanis, Supervisor

______Blanche Capel, Chair

______Vann Bennett

______Michael Hauser

______Christopher Kontos

An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Cell Biology in the Graduate School of Duke University

2015

Copyright by Stephan George Frangakis 2015

Abstract

Improvements in genomic technology, both in the increased speed and reduced cost of sequencing, have expanded the appreciation of the abundance of human genetic variation. However, the sheer amount of variation, as well as the varying type and genomic content of variation, poses a challenge in understanding the clinical consequence of a single . This work uses several methodologies to interpret the observed variation in the , and presents novel strategies for the prediction of allele pathogenicity.

Using the zebrafish model system as an in vivo assay of allele function, we identified a novel driver of Bardet-Biedl Syndrome (BBS) in CEP76. A combination of targeted sequencing of 785 cilia-associated in a cohort of BBS patients and subsequent in vivo functional assays recapitulating the human phenotype gave strong evidence for the role of CEP76 in the pathology of an affected family. This portion of the work demonstrated the necessity of functional testing in validating disease-associated mutations, and added to the catalogue of known BBS disease genes.

Further study into the role of copy-number variations (CNVs) in a cohort of BBS patients showed the significant contribution of CNVs to disease pathology. Using high- density array comparative genomic hybridization (aCGH) we were able to identify pathogenic CNVs as small as several hundred bp. Dissection of constituent and in

iv

vivo experiments investigating epistatic interactions between affected genes allowed for an appreciation of several paradigms by which CNVs can contribute to disease. This study revealed that the contribution of CNVs to disease in BBS patients is much higher than previously expected, and demonstrated the necessity of consideration of CNV contribution in future (and retrospective) investigations of human genetic disease.

Finally, we used a combination of comparative genomics and in vivo complementation assays to identify second-site compensatory modification of pathogenic alleles. These pathogenic alleles, which are found compensated in other species (termed compensated pathogenic deviations [CPDs]), represent a significant fraction (from 3 – 10%) of human disease-associated alleles. In silico pathogenicity

prediction algorithms, a valuable method of allele prioritization, often misrepresent

these alleles as benign, leading to omission of possibly informative variants in studies of

human genetic disease. We created a mathematical model that was able to predict CPDs

and putative compensatory sites, and functionally showed in vivo that second-site

mutation can mitigate the pathogenicity of disease alleles. Additionally, we made

publically available an in silico module for the prediction of CPDs and modifier sites.

These studies have advanced the ability to interpret the pathogenicity of multiple

types of human variation, as well as made available tools for others to do so as well.

v

Dedication

I would like to dedicate this work to my family. To my parents, Crist and Maria, for nurturing a curious boy that never stopped asking “Why?” To my brothers, Jon and

Gabe, for keeping me grounded and aware of what matters most in my life. To my amazing wife Jen, who is my best friend and partner and given me happiness greater than I knew I could have. And finally to my two newborn children, Christopher and

Charlotte, who have shown me a love within myself I didn’t know existed; I can’t wait to watch you grow up.

vi

Contents

Abstract ...... iv

List of Tables ...... xii

List of Figures ...... xiii

Acknowledgements ...... xv

Chapter 1 – Introduction ...... 1

1.1 Challenges and Opportunities for Human Genetics ...... 1

1.1.1 Novel Gene Discovery ...... 2

1.1.2 Copy Number Variation Contribution ...... 6

1.1.3 Abundance of Variants of Unknown Significance...... 8

1.2 as a Model Disorder ...... 15

1.2.1 Function of Primary in Development, Homeostasis, and Signaling ..... 15

1.2.2 The Ciliopathies as a Model of Human Genetic Disease ...... 17

1.3 Animal Models of Human Genetic Disease ...... 21

1.3.1 Zebrafish as a Model Organism ...... 26

1.3.2 Zebrafish Complementation Studies ...... 27

1.4 Initial Screenings: Forward Genetics in Zebrafish ...... 28

1.5 Reverse Genetics: From Candidate Gene to Animal Model ...... 30

1.5.1 Antisense Morpholino Suppression of Endogenous Genes ...... 31

1.5.2 Genome Editing and Transgenic Lines ...... 32

1.6 Humanizing Zebrafish to Evaluate Human Mutations ...... 35

vii

1.6.1 Recessive Disease ...... 35

1.6.2 Dominant Disease...... 40

1.6.3 De Novo Variants ...... 41

1.6.4 Copy Number Variations ...... 42

1.6.5 Second-Site Modifiers ...... 43

1.7 Implications ...... 44

1.8 Chapter Outlines...... 46

Chapter 2 – Methods ...... 48

2.1 Research Participants for CNV Analysis ...... 48

2.2 Whole Genome Amplification ...... 48

2.3 Oligonucleotide aCGH analyses ...... 49

2.4 Junction Sequence Analyses...... 49

2.5 qPCR Confirmation ...... 49

2.6 Resequencing of BBS Genes Identified in CNV Analyses ...... 50

2.7 CNV Genes Suppression, Rescue, and Genetic Interaction Experiments ...... 50

2.8 Datasets of Known Benign and Pathogenic Variants ...... 51

2.9 Comparative Genomics Screen for CPDs ...... 52

2.10 Statistical Models of Variant Density ...... 53

2.11 Fitting Observed Density to Statistical Models ...... 54

2.12 CPD Prediction Method ...... 55

2.13 DM048 Whole Exome Sequencing ...... 55

2.14 , rpgrip1L, btg2, and nos2 Morpholino Design ...... 56

viii

2.15 Site-Directed Mutagenesis ...... 56

2.16 bbs4, , btg2, and nos2 mRNA Synthesis and Zebrafish Embryo Injection ...... 56

2.17 Classification and Scoring of Embryos ...... 57

Chapter 3 – Identification of a Novel Driver of Bardet–Biedl Syndrome ...... 59

3.1 Introduction ...... 59

3.2 Sequencing the Ciliary Proteome in a Cohort ...... 60

3.3 Mutational Load in Bardet-Biedl Syndrome...... 61

3.4 Dissection of the 25 Patients without Molecular Genetic Diagnoses for the Identification of Novel BBS Candidate Genes ...... 67

3.5 Sanger Confirmation of Novel Candidate Genes ...... 70

3.6 Early Gastrulation Defects in CEP76 and E2F4 dysfunction ...... 72

3.7 In Vivo Retinal Assay of CEP76 Dysfunction ...... 76

3.8 Renal Defects in CEP76 Morphants ...... 82

3.9 Discussion ...... 86

Chapter 4 – Copy Number Variation Contributes to the Mutational Load of Bardet–Biedl Syndrome ...... 88

4.1 Introduction ...... 89

4.2 Results ...... 92

4.2.1 Identification of Intragenic CNVs in BBS ...... 92

4.2.2 Breakpoint Analysis Shows Multiple Mechanisms of CNV Formation ...... 97

4.2.3 Enrichment of Ciliary Gene CNVs in BBS Cases ...... 99

4.2.4 BBS Gene CNVs Contribute Recessive Alleles ...... 100

ix

4.2.5 IFT74, a Novel BBS Candidate Gene...... 104

4.2.6 Extensive Oligogenic Load Among Individuals Harboring BBS CNVs ...... 109

4.2.7 In Vivo Analysis of Oligogenic BBS Loci ...... 114

4.3 Discussion ...... 117

Chapter 5- Identification of cis-suppression of human disease mutations by comparative genomics ...... 121

5.1 Introduction ...... 122

5.2 Prevalence of CPDs Among Pathogenic Variants ...... 123

5.3 Modeling Compensatory Interaction ...... 128

5.4 In Vivo Modeling of CPDs and Compensatory Sites ...... 133

5.5 Implications ...... 142

Chapter 6 – Discussion ...... 144

6.1 Interpreting Human Genetic Variation ...... 144

6.2 Use of Targeted Sequencing ...... 146

6.3 Functional Testing to Interpret Genetic Variation ...... 148

6.4 Contribution of CNVs to Human Disease ...... 151

6.5 Implications of CPDs to Evolution and Interpreting Genetic Variation . 157

6.5.1 Comparative Genomics and False Negatives ...... 158

6.5.2 Compensation of Pathogenic Alleles ...... 160

6.5.3 Sporadic Evidence of Compensation in Humans ...... 162

6.5.4 Mechanism of CPD Fixation ...... 164

Appendix A...... 168

x

References ...... 175

Biography ...... 201

xi

List of Tables

Table 1: Summary of In Silico Prediction Algorithms ...... 14

Table 2: Examples of Genes Associated with Ciliopathies ...... 20

Table 3: General Attribute Similarities of Laboratory Organisms Used to Model Human Genetic Disease ...... 22

Table 4: Anatomical Comparisons Between Zebrafish and Humans ...... 23

Table 5: Patients with Recessive Changes in Known BBS Genes ...... 63

Table 6: Recessive Changes in Candidate BBS Loci ...... 68

Table 7: Rhodopsin Staining ...... 77

Table 8: Average Cilia Count and Length ...... 77

Table 9: 5dpf Retinal TUNEL Staining and Quantification ...... 78

Table 10: Mutational Findings of 17 Individuals Affected with BBS Harboring CNVs Encompassing BBS genes ...... 94

Table 11: Breakpoint Features of 18 CNVs Detected in 17 BBS Cases ...... 95

Table 12: Clinical Phenotypes of BBS Cases with CNVs ...... 96

Table 13: Embryo Counts for In Vivo Complementation Assays of Novel BBS Missense Variants ...... 108

Table 14: Embryo Counts for In Vivo Genetic Interaction Experiments ...... 116

Table 15: Range of Estimates for Prevalence of CPDs in Human Disease ...... 127

Table 16: Occurance of CPDs Among Independent Pathogenic Lists ...... 128

xii

List of Figures

Figure 1: Gene Identification Strategies Through Multiple Inheritance Models ...... 5

Figure 2: Probability of Mutation Causing Disease Increases with Increased Conservation ...... 12

Figure 3: Tolerability Profile Based on Site Conservation ...... 13

Figure 4: Zebrafish Models of Human Genetic Disease ...... 37

Figure 5: Candidate Gene Mutation Segregation ...... 71

Figure 6: cep76 and e2f4 Gastrulation Assay ...... 75

Figure 7: Immunostaining of Retinal Cryosections for Rhodopsin and Cilia of cep76 Morphant and Rescue embryos ...... 79

Figure 8: TUNEL Staining of Retinas from 5dpf cep76 Morphant and Rescue embryos .. 81

Figure 9: Na+/K+ ATPase Staining of 5dpf cep76 Morphant Embryos ...... 84

Figure 10: Quantification of Tubule Convolution ...... 85

Figure 11: Segregation of BBS Gene CNVs Assigns a Primary Causal to 10 Pedigrees ...... 103

Figure 12: IFT74 is a Novel Candidate BBS Locus...... 106

Figure 13: Segregation and In Vivo Analysis of Oligogenic BBS Loci Demonstrate Additive and Epistatic Effects ...... 111

Figure 14: Genetic Architecture of BBS Gene Variation in 17 BBS Cases Harboring CNVs ...... 113

Figure 15: Distribution of Variants Found in Sequence Alignments ...... 126

Figure 16: Relationship Between Variants and Evolutionary Distance ...... 132

Figure 17: Compensatory Mutations Rescue Pathogenic Alleles in BBS4 and RPGRIP1L ...... 135

xiii

Figure 18: Protein Domain Structure of Functionally Tested Human Disease Genes .... 136

Figure 19: A De Novo BTG2 p.V141M Encoding Allele Causes Microcephaly ...... 140

Figure 20: Evaluation of Post-Mitotic Neurons in 2 dpf Zebrafish Embryos Confirms the Pathogenicity of BTG2 V141M ...... 141

Figure 21: Deletions are able to unmask normally recessive alleles ...... 156

Figure 22: Androgen Receptor G491S Mutation is Conserved in Rodents ...... 160

Figure 23: Fitness Landscape of Individual Compensated Genotypes ...... 165

xiv

Acknowledgements

There are a great many people to whom I owe so much, it would probably be best to go chronologically. My parents challenged me from early on, pushing me through my morose teenage years to develop the independence and confidence I found in college. They nurtured my evolving love for science, but also encouraged me to explore all that an education at Rutgers had to offer. They helped me through the worst times in my life, personally and professionally, and without their unconditional support

I would not be completing this PhD.

I also owe a great deal of thanks to the many people in the CHDM. My many colleagues there, from fellow graduate students to postdocs to technicians, have been a great source of advice, personally, professionally, and scientifically. Maria, Perciliz,

Phoebe, and Ed were there the day I joined the lab and have helped me navigate the oftentimes confusing maze that is graduate school and research. I also have to thank

Erica and Christelle, who as faculty members have worked hard and helped my research tremendously.

Of course, I owe much to my mentor, Nico. Nico took a chance on an orphan graduate student, gave me funding and a place in his lab, but most importantly gave me purpose. He helped reinvigorate my love of science, and has nurtured and helped develop me into who I am today. He is a great scientist, and it is a privilege and pleasure

xv

to talk with him not just about science, but all things that give joy to life. I can never repay him for the chance he took on me, the time he spent, and the research we did together.

Finally, the best things in my life I owe to Jen. She brought our children into the world (with a little help from me) and has given me a future to look forward to. She has been a constant support through the many rough patches of graduate school, as well as my biggest champion during the successes. She gives me a reason to work hard, as well as a reason not to work too hard, and gives me a kiss and hug every day just for coming home. She is simply the best thing that has ever happened to me

xvi

Chapter 1 – Introduction

This introduction is adapted and expanded from Davis et al (2013), a review published in Biochimica et Biophysica Acta [1]

1.1 Challenges and Opportunities for Human Genetics

Major advances in genomic technologies have always been accompanied by

accelerated discovery of lesions associated with human pathologies. The development of

the first karyotype led rapidly to the discovery of syndromes of polyploidy [2], while the

then nascent technologies of genome mapping, cloning and sequencing yielded early

insights into rare disease pathogenesis [3]. As the field has progressed, whole exome and

whole genome sequencing (WES/WGS) have hyper-accelerated disease gene discovery

both in historical cohorts and in the real-time clinical setting [4]. Additionally, molecular

cytogenetics at the sub-Mb and ultimately kb-level resolution revealed the high contribution of copy number variants to both rare and common human genetic disorders [5]. Recently, it has become apparent that numerous human disease-causing alleles are conserved in other species, further complicating the interpretation of results of genomic analyses [6].

The goal of this work is to show how genomic tools and in vivo models can be used to interrogate the contribution of multiple types of genetic lesions in the development of human disease.

1

1.1.1 Novel Gene Discovery

Roughly half of the >5000 rare Mendelian diseases do not have known causal disease genes [7], and an increasing number of more common disorders (such as autism,

schizophrenia, and intellectual disability) that have been previously thought of as

complex, multifactorial diseases have been shown in several instances to be collections

of several monogenic disorders [8-11]. The identification of the causal genes driving human disease phenotypes paves the way for a further understanding of pathophysiology and is a foundation for the development of clinical therapies.

Previously, putative causal genes were identified through Sanger sequencing of potential candidates. These candidates were selected through similarity with genes already known to be associated with disease, relevance of predicted protein function, or localization to specific genomic regions via positional mapping [7]. The development of next-generation sequencing (NGS), and the resultant drop in per-base sequencing costs, allowed for the entirety of the genomic content (WGS) or protein-coding regions (WES) to be analyzed [12].

Though WES does not currently asses for the impact of non-coding alleles, it is a validated approach for the discovery of rare alleles contributing to Mendelian phenotypes [13]. Most alleles known to causes Mendelian disease have been shown to be located in protein-coding regions [14], and most mis/nonsense SNVs or exonic insertion-

2

deletions are predicted to have pathogenic consequenes [15]. However, though the exome is a portion of the genome that is highly enriched for variants of significant clinical importance, there remains the challenge of properly determining which set of targets best comprises the exome. There are several approaches to addressing this challenge; one is to focus only on a subset of high-confidence genes, such as those identified by the Consensus CDS project [16]. Alternately, the entirety of the RefSeq collection [17] as well as a large number of hypothetical genes can be targeted. A third approach is to narrow the scope of targeted exons to just those in genes of interest. This has the advantages of greatly reducing costs and increasing site coverage, but will miss variation in any gene not specifically targeted.

A remaining challenge is the identification of disease-associated alleles among the large amount of variants identified. A single sequenced exome can have between

20,000 and 50,000 identified variants [7,13]. Typically greater than 95% of these variants are known polymorphisms.

Identification of disease genes and alleles is a multi-step process. The list of candidate disease-associated alleles obtained from sequencing can be culled through various inheritance models (Figure 1). Those that are present de novo, as a two-hit

variant, or in multiple affected individuals represent the best candidates for further

investigation [7]. Alleles segregating with disease in large, multi-generational pedigrees

and/or present in multiple affected families can be determined to be disease-associated

3

with reasonable conviction. However, the best gauge for pathogenicity remains in vitro or in vivo functional study of gene variants and their protein products. The use of such assays allows for conclusions regarding the pathogenicity of specific variants to be made.

Recent concerns over the ability of in vitro assays to fully recapitulate the endogenous function of alleles have raised in vivo functional testing as the gold standard for determining allele pathogenicity [18]. Knockout or knockdown of endogenous followed by complementation with an allele of interest allows for analysis of the degree of pathogenicity as well as direction of effect of an allele [19]. This strategy of disease gene discovery will be detailed in Chapter 3 with the identification of CEP76 as a novel driver of Bardet-Biedl Syndrome (BBS).

4

Figure 1: Gene Identification Strategies Through Multiple Inheritance Models

Pedigrees indicate the inheritance model loosely underlying the strategy; filled symbols represent affected individuals, empty symbols represent presumably healthy individuals, and carriers are depicted by a symbol with a dot. Circles below each pedigree symbolize sets of genetic variants identified in the exomes. Colored sections within the circles represent the portion of the exomes in which the causal mutation falls.

A) Homozygosity-based strategy, B) two-hit based strategy, C) overlap strategy, D) de novo strategy

5

Adapted with permission from [7].

1.1.2 Copy Number Variation Contribution

Several areas of the genome contain architectural features that are prone to DNA rearrangements and can lead to disease [20,21]. Though usually caused by random de novo rearrangements, these genomic structural variations have been shown to also be inherited and present at varying incidence subject to specific populations [22].

The advancement of next-generation sequencing has greatly increased the rate of

discovery of structural variations in the genome. Paired-end sequencing (mapping pairs

of sequence reads to a reference genome) was recently able to identify more than 1000

genomic structural variation at a resolution which was, at the time, unparalleled [23].

The development and expanded capabilities of array comparative genomic

hybridization (aCGH) has further allowed for the identification of genomic deletions,

duplications/triplications, insertions, and translocations on a genome-wide scale [24].

These types of genomic events have been termed copy number variations (CNVs).

In the past two decades CNVs have emerged as major contributors to the genetic

burden of both rare and common disorders [25]. Genes contained either partially or

completely within a CNV region undergo a change in genomic “dosage”. This is

important for both deletions and duplications as several genes have been shown to

demonstrate haploinsufficiency [26] or pathology with increased dosage [27]. In fact,

6

several CNVs have shown reciprocal phenotypes upon deletion and duplication [5].

Further, deletion of genes causing recessive disease can contribute to a carrier state for an individual, as well as unmask recessive disease-causing alleles [28]

In addition to alterations in gene dosage, CNVs can contribute to pathology through the creation of chimeric genes, where two genomic segments containing part of independent genes combine to form a novel protein [29]. Finally, CNVs can also disrupt regulatory elements; the resulting changes in may be less apparent than deletions/duplications and not affect protein function [30].

Traditional Sanger-sequencing based modalities are not robust enough to detect

most CNVs, as normal gene transcripts will be amplified for both heterozygous

deletions and duplications. Low-copy repeats (LCRs), characterized as DNA fragments

over 1 kb and 90% sequence identity, can induce genomic rearrangements. Breakpoints

for CNVs are therefore often repeated segments that are difficult to sequence [31]. Due

to the pathogenic potential of CNVs, yet their difficulty in identification through traditional sequencing methods, evaluating their contribution to common and rare disease presents a considerable challenge to the interpretation of human genetic variation.

7

1.1.3 Abundance of Variants of Unknown Significance

Amid the excitement of advancements in technology and rapid rates of gene discovery, the realization has also emerged that each individual human genome is burdened with a large number of rare and ultra-rare alleles. Studies have reported a

median of 50–150 nonsense mutations in the average human exome, several in homozygosity [32]. The number of rare and ultra-rare SNVs has continued to increase proportionately to the number of available exomes and genomes [33], indicating that we are unlikely to reach saturation of such alleles soon.

Yet despite, or perhaps because of, the large number of novel and rare variants that are generated through multiple sequencing modalities, the determination of which select few of these thousands of variants are indeed causal for disease remains a challenge [34]. Though most of these variations are common, each individual genome may contain up to several thousand rare and private (limited to a single individual or family) variations [32]. These observations have generated a significant interpretive problem for disease gene discovery and for clinical genomics, as population-based arguments alone have been unable to dissect the contribution of the majority of these alleles to clinical phenotypes.

Experimental approaches for demonstrating allele pathogenicity have therefore been used to support causality of genetic variants, given a limited number of candidate alleles. In vivo and in vitro assays evaluating the function of proteins containing

8

variations found in human cases can lend great insight into the functional effects of specific variants. While some groups have had success studying a large number of protein variants simultaneously [35], most in vitro and in vivo functional assays are limited to a smaller, more manageable number of candidate alleles. Because whole- exome or targeted sequencing routinely yields a large number of potential causal variants, prioritization of variants based on predicted pathogenicity has become a convenient tool to reduce the number of candidate alleles [36,37].

Predicting the pathogenicity of variants is challenging, due to the multitude of ways a single amino acid change can affect the function of an entire protein. Ignoring nonsense changes (for which the vast majority are pathogenic, and are straightforwardly predicted as such), amino acid substitutions can alter sites critical to protein function

(such as catalytic or binding domains), induce aberrant folding, or alter the stability of the protein [38]. However most in silico prediction methods do not consider many of these confounding variables, but instead employ protein evolution as a functional test of variation. These models are based upon assumptions of natural selection of protein variants; namely that mutations affecting protein function, and thereby the fitness of an organism, will be selected for or against depending on the direction of effect [34].

Mutations in regions which are highly conserved throughout evolution are more likely to be pathogenic, whereas those at more variable sites (or conversion to an allele that is present in other species) are more likely to be benign (Figure 2). Multiple species

9

alignments allow for most online in silico prediction algorithms to evaluate the conservation at each site in a protein, and further varying analyses determine the divergence of inputted variants from the profile of allowed amino acids (Figure 3, Table

1). Hundreds of potential causal variants can be quickly analyzed and prioritized based on the predicted likelihood that they disrupt conserved sites or protein domains, and therefore the likelihood that they are pathogenic.

More recent aggregative in silico algorithms utilize several combined methodologies to identify pathogenic alleles and prioritize analysis. The Variant

Annotation, Analysis, and Search Tool (VAAST) is one such tool, and employs an aggregative variant association test that merges residue substitution information with allele frequency [39]. This allows for a better analysis of variants that fall in regions without strong homology between species, as algorithms such as SIFT and PolyPhen must have an adequate multiple species alignment for effective prediction. In regions with high inter-species homology, aggregative tools can use existing algorithms (i.e.

SIFT and PolyPhen) to improve their accuracy. As such, VAAST has been specifically designed to identify disease-associated alleles within the large scale output of next-gen sequencing data [40], including high-throughput identification in pedigrees with an additional module (pVAAST, [41]).

A second example of aggregative in silico prediction integrates multiple annotation and estimation into a single measure for each variant. Combined Annotation-

10

Dependent Depletion (CADD) correlates its scores with allelic diversity (similar to other methods) as well as functional annotations, disease severity, complex trait associations, and regulatory effects [42]. The algorithm contrasts the annotations of inputted variants with a catalog of 14.7 million simulated variants; those that reduce fitness are reduced by natural selection, but not in the simulated variant model. It also has the potential advantage or relating the phenotypic consequence of a single variant to the genomic context in which it appears [42].

While in silico prediction algorithms are invaluable for the evaluation of allele pathogenicity, they should not in themselves be employed as direct evidence of pathogenicity. One recent study of ENCODE targets found only a modest correlation between the constraint estimates (predicted conservation and intolerance to variance) of sites and in vitro functional analyses of variants at these sites [43]. And while several human mutations are predicted to be benign due to presence in other species (and thus predicted tolerability of a specific residue at a specific site), many of these variants have strong evidence of pathogenicity through both genetic and functional analyses [44,45].

These examples raise the possibility that the genomic context of an allele is important to its function, either compensating for or contributing to pathogenicity.

11

Figure 2: Probability of Mutation Causing Disease Increases with Increased Conservation

The probability that a mutation will cause a disease increases monotonically with an increase in the degree of site conservation. An increase in the degree of evolutionary conservation increases the probability of deleterious mutations and decreases the probability of nonsynonymous benign SNPs. The benign synonymous SNPs do not

12

change amino acids and should be predominantly neutral. As a result, their probability is uniform across sites, regardless of whether or not the site is conserved.

Reproduced with permission from [46]

Figure 3: Tolerability Profile Based on Site Conservation

A map of site conservation can be created through alignment of the orthologous protein sequences from multiple species. Sites that are highly conserved throughout all species are assumed to be intolerant to variation. Here, multiple alignment of the RPLP0 protein creates a map highly conserved sites and the amino acids which are tolerated at each site. Variations away from allowed residues are more likely to be pathogenic.

13

Table 1: Summary of In Silico Prediction Algorithms

Training Conservation Structural Method Based on Annotation Set Analysis Attributes HGMD, SIFT, PFam, Predicted MutPred RF - Swiss-Prot PSI-BLAST attributes Homolog nsSNPAnalyzer RF Swiss-Prot SIFT - mapping Alignment Panther Panther - - - Scores library, HMMs Sequence environment, PhD-SNP SVM Swiss-Prot - - sequence profiles Homolog Empirical PolyPhen - PSIC profiles mapping/ Swiss-Prot Rules predictions Swiss- Prot, Homolog Bayesian Pfam PolyPhen2 Neutral PSIC profiles mapping/ classification domain pseudo- predictions mutations Alignment SIFT - MSAs - - Scores PMD, PSIC profiles, Neutral SNAP NN Pfam, PSI- Predictions - pseudo- BLAST mutations Sequence environment, SNPs&GO SVM Swiss-Prot sequence - GO profiles, Panther GO, ; HGMD, Human Gene Mutation Database; HMM, Hidden Markov model; NN, neural network; MSA, multiple sequence alignment; PMD, Protein Mutant Database; PSIC, position-specific independent counts; RF, random forest; SVM, support vector machine. Reproduced with permission from [37]

14

1.2 Ciliopathies as a Model Disorder

Although rare, ciliopathies have become model disorders for the dissection of human genetic disease. The term ‘ciliopathy’ describes multiple diseases that are all characterized by dysfunction in a single organelle, the cilia. Polycystic disease

(PKD), nephronophthisis (NPHP), retinitis pigmentosa, Bardet-Biedl syndrome (BBS),

Joubert syndrome, and Meckel-Gruber syndrome (MKS) can all be classified as ciliopathies [47]. Investigations into dysfunctions of cilia and the resultant ciliopathies have yielded insights into allele causality and expanded our ability to interpret human genetic variation.

1.2.1 Function of Primary Cilium in Development, Homeostasis, and Signaling

Nonmotile, or primary, cilia were first identified as drivers of human disease less than two decades ago [48]. Until that point it was believed that these small hair-like projections, present on nearly all cell types, were ancient vestigial remnants of previously functional organelles, a cellular equivalent of the appendix. Even their structural parts seemed long ago scavenged: the aoneme shaft is composed of a scaffold of nine radially organized triplets similar to motile cilia, but lacking (with rare exceptions) the central pair; and the cilia are tethered to the cell membrane through the , a modified co-opted from the mitotic spindle poles.

15

However, numerous studies over the past 20 years have demonstrated the important role of primary cilia in multiple cellular processes [49].

Rather than being static structural organelles, primary cilia exist in a dynamic equilibrium with proteins being trafficked up and down the length of the . This (IFT) has been shown to be necessary for proper development of left-right axis asymmetry [50], as well as intracellular Ca2+ release in response to

mechanical stimuli [51]. Homology of certain IFT proteins to coat protein I and clathrin- coated vesicle proteins have suggested an origin of the primary cilia from the Golgi apparatus and vesicular transport machinery [52], underscoring its importance in transport and signaling.

Evidence for the role of the primary cilium in signaling pathways and signal transduction is increasing. G protein-coupled receptors and β-arrestin 1/2, which regulates seven-transmembrane receptor signaling, have been shown to localize to neuronal cilia [53-55]. Additionally, planar cell polarity (PCP) signaling molecules interact with proteins known to be part of the cilia basal body [56]. PCP signaling is a subset of non-canonical Wnt signaling, so it is therefore not surprising that several features of disrupted Wnt signaling are present in ciliary mutants, such as impaired convergent extension, neural tube closure, and stereocilia organization [49,56].

The primary cilium has also been associated with the establishment of apical- basolateral polarity. Disruption of cell polarization will consequently lead to defects in

16

cilia formation. Loss of components of the apical transport machinery (which sorts and transports cellular cargo bound for the apical portion of the cell) results in defects of ciliogenesis and regulation of cilia length [57,58].

1.2.2 The Ciliopathies as a Model of Human Genetic Disease

Because nearly all cells in the body have a primary cilium and can be affected in different ways by dysfunction of cilia, the ciliopathies as a group have wide variability in class and severity of phenotypes. These phenotypes can overlap greatly between the clinical diagnoses, a fact which prompted the recognition that several rare genetic diseases (such as NPHP, BBS, MKS, Joubert syndrome, and others) to be identified as driven by the same cause: cilia dysfunction (see Table 2). The broad yet distinctive features of the ciliopathies aid in the investigations of the phenotypic effects of multiple alleles found in patients [59].

The ciliopathies serve as illustrations of the variable pathogenicity of multiple alleles within the same gene; in the case of NPHP3, hypomorphic alleles have been shown to cause nephronophthisis (NPHP) [60], while null alleles cause the neonatal lethal Meckel-Gruber syndrome (MKS) [61]. NPHP is a cystic kidney disease and main genetic cause of childhood kidney failure, while MKS is a much more severe disorder consisting of renal cystic disease as well as CNS defects and polydactyly. Ciliopathies are therefore oftentimes thought of as individual diagnoses within a single organellar

17

disorder, with individual allelic variants located on a continuum of phenotypic severity

[62].

The complex gene interactions and inheritance patterns now identified as epistasis and oligogenic inheritance are observed in many of the ciliopathies. Unaffected individuals with homozygous mutations in BBS genes gave the first evidence that complex inheritance is oftentimes a feature of some ciliopathies [63]. The concept of trialleleic inheritance (where three mutations divided between two genes is required for pathology), while initially thought to be a rare esoteric phenomenon, has been shown to account for the inheritance of close to 25% of BBS families [64].

How oligogenic inheritance can affect the manifestation of pathology is explained through two models: first, the interaction of two causal genes is important to normal physiologic function; therefore the disruption of both produces a phenotype that is more severe than the either one alone. Second, a known causal ciliopathy gene can interact with a secondary modifier locus, that itself would not be sufficient to cause disease, but can augment the effects of the primary locus [62]. The presence of epistasis and modifier loci contribute to what is known as mutational load, a facet of many ciliopathies in which the total burden of pathogenic alleles in genes across the ciliopathy landscape shapes the observed clinical phenotype. It is predicted that close to 1000 genes may influence the phenotypic variability of ciliopathies and contribute to mutational load [65].

18

Because the differing inheritance patterns and clinical phenotypes of various diseases are derived from dysfunctions of the same organelle, ciliopathies provide an exceptional opportunity to interpret and understand human genetic disease. One has the ability to explore both simple and complex inheritance patterns, as well as clinical manifestations, while focusing on a unique cellular structure. The ciliopathies present a useful model of the variety of genetic disease. Application of this to an appropriate animal model system represents the bulk of the work described here.

19

Table 2: Examples of Genes Associated with Ciliopathies

Ciliopathy Clinical Manifestation Gene(s) Motile – Planar DNAI1, DNAH5, DNAH11, DNAI2, Primary ciliary Chronic bronchitis, rhinosinusitis, otitis media, KTU, TXNDC3, dyskinesia laterality defects, infertility, CHD LRRC50, RSPH9, RSPH4A, CCDC40,

Non-Motile – Embryonic Node and Sensory Autosomal recessive polycystic RFD, CHF PKHD1 kidney disease RFD, interstitial nephritis, CHF, RP, skeletal defects, NPHP1-8, ALMS1, Nephronophthisis retinopathy, disease, cognitive defects, renal CEP290 cysts Obesity, polydactyly, ID, RP, renal anomalies, Bardet-Biedl BBS1-12, MKS1, anosmia, CHD, situs inversus, retinopathy, liver syndrome MKS3, CEP290 disease, cognitive defects, renal cysts Obesity, RFD, polydactyly, ID, CNS anomalies, Meckel-Gruber MKS1-6, CC2D2A, Encephalocele, liver disease, CHD, cleft lip, cleft syndrome CEP290, TMEM216 palate, retinopathy, cognitive defects, renal cysts NPHP1, JBTS1, CNS anomalies, ID, ataxia, RP, polydactyly, cleft lip, JBTS3, JBTS4, CORS2, Joubert syndrome cleft palate, situs inversus, liver disease, renal cysts. AHI1, CEP290, TMEM216 Obesity, RP, DM, hypothyroidism, hypogonadism, Alstrom syndrome skeletal dysplasia, cardiomyopathy, pulmonary ALMS1 fibrosis, retinopathy, deafness Jeune asphyxiating Narrow thorax, RFD, RP, dwarfism, Polydactyly, IFT80 thoracic dystrophy skeletal defects, retinopathy, liver disease, renal cysts Polydactyly, syndactyly, cleft lip, cleft palate, CNS Orofaciodigital anomalies, ID, RFD, skeletal defects, cognitive OFD1 syndrome type 1 defects, liver disease, renal cysts, situs inversus Ellis van Creveld Chondrodystrophy, polydactyly, ectodermal EVC, EVC2 syndrome dysplasia, CHD Sensenbrenner Dolichocephaly, ectodermal dysplasia, dental IFT122, IFT43, syndrome dysplasia, narrow thorax, RFD, CHD WDR35 Short rib- Narrow thorax, short limb dwarfism, polydactyly, WDR35, DYNC2H1, polydactyly renal dysplasia NEK1 syndromes Adapted with permission from [59]

20

1.3 Animal Models of Human Genetic Disease

Animal studies combine the identification of candidate alleles for human diseases with mutant organisms that recapitulate the human mutation or loss of gene function, and have improved our understanding of the causal link between genetic mutation and phenotypic trait [66].

Numerous animal models have been developed to study both monogenic and complex disease. Each model system has its advantages and limitations, such as genetic and anatomic homology to humans, the size of the genetic toolkit, generation time, and cost. Here I have used the zebrafish as a method to model human genetic disease. The zebrafish is an organism that has gained utility by bridging the gap between the high throughput abilities of invertebrates and the orthologous structure of mammals (Table 3 and Table 4).

21

Table 3: General Attribute Similarities of Laboratory Organisms Used to Model Human Genetic Disease

C. elegans D. melanogaster D. rerio M. musculus Percent identity with humans 43% 61% 70% 80% Genome size 9.7 × 107 bp 1.3 × 108 bp 1.4 × 109 bp 2.5 × 109 bp Exome size 28.1 Mb 30.9 Mb 96 Mb 49.6 Mb Husbandry demands $ $ $ $$$ Cost per animal $ $ $ $$$ Characterized inbred strains + + + ++++ Outbred laboratory strains + + +++ ++ Germline/embryonic Yes No Yes Yes cryopreservation Lifespan 2 weeks 0.3 years 2–3 years 1.3–3 years Generation interval 5.5 days 2 weeks 3 months 6–8 weeks Number of offspring > 200 10–12 300 larva 10–20 eggs embryos pups/litter Embryonic development ex vivo ex vivo ex vivo in utero Transgenesis ++++ +++ ++++ +++ Gene targeting ++++ +++ + ++++ Conditional gene targeting + ++ + ++++ Transient in vivo assays +++ ++ ++++ + Allelic series from TILLING +++ +++ ++++ ++ Affordability of large scale ++++ ++++ +++ + screens Cell lines and tissue culture + ++ + ++++ Antibody reagents + ++ + ++++ In situ probes + +++ ++++ +++ Birth defects + ++ ++++ ++++ Adult-onset diseases ++ + + ++++ Behavioral modeling ++ ++ ++ ++ Aging modeling +++ ++ ++ ++ Metabolic disease ++ ++ +++ +++

22

Table 4: Anatomical Comparisons Between Zebrafish and Humans

Anatomy Key Similarities Key Differences • Cleavage, early patterning, • Rapid Embryology gastrulation, somitogenesis, • Influence of maternal transcripts organogenesis are all represented • Non-placental, involves hatching • Lack long bone, cancellous bone, • Ossified skeleton comprising cartilage Skeletal system and bone marrow and bone • Joints are not weight-bearing • Fast- and slow-twitch muscle • Axial and appendicular muscle groups topographically separate • Skeletal, cardiac and smooth muscle • Tail-driven locomotion depends Muscle cell types, with similar cellular on alternating contraction of architecture and machinery myotomal muscle • Appendicular muscle bulk is • Fast and slow fibers proportionately small • Central nervous system anatomy: fore-, mid- and hind-brain, including • Telencephalon has only a diencephalon, telencephalon and rudimentary cortex cerebellum • Peripheral nervous system has motor • Fish-specific sensory organs, and sensory components such as the lateral line • Fish behaviors and cognitive • Enteric and autonomic nervous function are simplified compared Nervous system and systems with human behavior behavior • Significant difference in • Specialized sensory organs, eye, population of dopaminergic olfactory system and vestibular system, neurons (telencephalic vs are well conserved midbrain) • Complex behaviors and integrated • Some immediate early genes and neural function: memory, conditioned neuropeptides not conserved in responses and social behaviors (for zebrafish example, schooling) • Multiple hematopoietic cell types: erythrocytes, myeloid cells (neutrophils, • Erythrocytes are nucleated eosinophils, monocytes and Hematopoietic and macrophages), T- and B-lymphocytes lymphoid/immune • Possess thrombocytes rather than • Coagulation cascade for hemostasis systems platelets • Innate and adaptive humoral and • Kidney interstitium is the cellular immunity hematopoietic site •Has left–right distinctions in • Multi-chamber heart with an atrium cardiac anatomy, but does not have and ventricle separate left–right circulations, that Cardiovascular system is, the heart has only two chambers •So far no evidence for secondary • Circulation within arteries and veins heart field derivatives • Separate lymphatic circulation • Lymph nodes observed 23

• Cardiac differentiation occurs through • Embryos are not dependent on similar signaling pathways functioning CV system for larval (eg,nkx2.5, bmp2b) development • Atria and ventricles express different myosin heavy chains • Similar electrical properties and during development (human conduction patterns (SA node, slow hearts only later differentiate atrial conductance, AV node, fast between atrial and ventricular ventricular conductance) mhc) • Heart has high regenerative capacity, even in adult animals • Respiration occurs in gills, not • Cellular gas exchange • No pulmonary circulation • Endoderm-derived swim bladder Respiratory system • Oxygenation is dependent on (functioning as a variable circulation and hemoglobin carriage buoyancy device), which corresponds embryologically but not functionally to the lungs • Major organs: liver, exocrine and • Lack an acidified digestive organ endocrine , gall bladder Gastrointestinal • Zonal specializations along the length • Have an intestinal bulb rather system of the absorptive alimentary tract than stomach • Immune cells in lamina propria • Intestinal Paneth cell not present • Filtration occurs in anterior and posterior kidneys • Mesonephric rather than metanephric adult kidney Renal and urinary • Glomerular anatomy and function • No bladder or prostate gland systems • No structure in zebrafish homologous to descending or ascending thin limb of nephron in mammals • Molecular and embryological biology • No sex of germ-cell development • Mechanism of sex determination is uncertain • Fertilization isex vivo (no uterus or the related internal female reproductive organs) Reproductive system • Cellular anatomy of germ-cell organs, • Oocytes are surrounded by a the testis and ovary chorion, not the zona pellucida, which must be penetrated by sperm • Non-lactating; no breast equivalent

Endocrine system • Most endocrine systems represented, • Differences in anatomical

24

including hypothalamic/ hypophyseal distribution of glands, for example, axis (glucocorticoids, growth hormone, discrete parathyroid glands do not thyroid hormone, prolactin), parathyroid seem to be present hormone, insulin and rennin • Prolactin has a primary role in osmoregulation • Structures unique to fish that are specialized for the aquatic • Ectodermal derivative environment (including, elasmoid scales, mucous cells) Skin and appendages • Lack appendages (hair follicles, • Pigmentation pattern is due to neural- sebaceous glands) crest-derived pigment cells including • Additional pigment cell types: melanocytes xanthophores and iridophores

25

1.3.1 Zebrafish as a Model Organism

The zebrafish is a tropical teleost that lives in the freshwaters of south-east Asia.

In the late 1960s, George Streisinger transitioned this common aquarium species to a model for basic research of embryogenesis and organ development because of its

“desirable attributes,” including a relatively short generation time of three to four months, the ability of mating pairs to generate several hundred embryos that develop rapidly and synchronously ex vivo, and the small size of adult fish (3 cm in length), making them easy to care for [67]. Moreover, embryos are transparent, allowing facile microscopic visualization in the first days of development, with major organ formation occurring 24 hours post-fertilization. Zebrafish have a diploid genome, but differ notably from the genomic structure of other vertebrates by the major teleost specific genome duplication that has resulted in sub-functionalization and neo-functionalization of genes [68-70].

Importantly, the biomedical research community now has a publicly available, extensively annotated version of the zebrafish genome at its disposal, of which 70% of genes have an identifiable human ortholog [71]. Additionally, a vast catalog of mutants, transgenic reporters, and gene-specific expression data has been generated from over two decades of dedicated D. rerio use for “phenotype-driven” forward genetic screens and “gene-driven” reverse genetic approaches. These data are curated in ZFIN (the

Zebrafish Model Organism Database), a community-wide resource warehousing

26

genomic information, anatomical atlases, molecular tools and links to zebrafish publications (www.zfin.org and [72]).

1.3.2 Zebrafish Complementation Studies

Although not a panacea, the implementation of zebrafish complementation studies [73] in human and medical genomics has facilitated disease gene discovery in both monogenic and complex traits. These studies involve suppression of the orthologous zebrafish gene and rescue with either a mutant or wild-type human mRNA to determine pathogenicity. Evaluation of the function of the mutant protein can determine whether a specific nonsynonymous variant is a functional null (loss of function), hypomorph (reduced function), or benign change (no difference from wild- type protein). Additionally, dominant pathology can be evaluated through injection of the mutant mRNA without concomitant knockdown of the endogenous gene. A dominant negative or gain-of-function mutation can be identified through phenotypic

abnormalities caused by the mutant mRNA alone.

Complementation studies have also found application in modeling more

intricate (and challenging) genetic lesions that include CNVs and epistatic interactions

[27,74]. However, there remain some phenotypes that are less tractable: adult-onset

disorders, slowly progressing degenerative phenotypes, and diseases affecting organs

and structures for which there is no acceptable ortholog in the zebrafish (for example,

27

the lungs). Despite these potential limitations, complementation studies using the zebrafish model are exceptional tools for modeling human disease genes and alleles.

1.4 Initial Screenings: Forward Genetics in Zebrafish

Initial forays in zebrafish research predated the precise knowledge of gene content or location within the zebrafish genome, and were not necessarily motivated by targeted questions of human pathology. Rather, most forward screens were conducted to understand vertebrate embryonic development by 1) introducing random mutations throughout the genome; 2) conducting an informative breeding scheme to generate progeny with homozygous recessive mutations; 3) evaluating animals for a measurable phenotypic readout; and 4) identifying the mutation and gene underscoring the phenotype. Used widely across multiple model organisms, the application of this traditionally laborious strategy in zebrafish has been reviewed extensively elsewhere

[75].

The first zebrafish screens were reported in the 1980s and involved the application of gamma rays to induce recessive lethal mutations. However, this approach resulted in significant chromosomal breaks that rendered mapping to a single locus difficult [76,77]. Alkylating agents, primarily N-ethyl-N-nitrosourea (ENU), replaced gamma rays as an effective mutagen and application resulted in discrete genomic mutagenesis in zebrafish germ cells that could be mapped to a single gene

28

(approximately one mutant per genome evaluated) [78,79]. This discovery led to large- scale efforts by labs in Tübingen and Boston to apply ENU screening to zebrafish.

Within two years, their combined efforts led to the characterization of ~ 4000 embryonic lethal phenotypes; these include gastrulation [80,81]; somitogenesis [82]; brain [83-85]; cardiovascular [86]; and craniofacial development mutants [87-89].

Although forward genetic screens in zebrafish contributed significantly to the fundamental understanding of early embryonic development, the impact on such findings to inherited disease in humans has been sporadic. This modest connection can be attributed to four main reasons. First, such screens are unable to capture alleles that confer a dominant negative (antagonizes the wild-type protein function) or gain of function (mutation confers a protein function different from that of wild-type protein) effect. Second, phenotypes must have a measurable phenotypic readout in early embryonic or larval stages, decreasing the possibility of detecting adult-onset or degenerative phenotypes. Third, although such screens were able to uncover discrete gene functions, the odds of generating precisely the same allele by ENU as one which has been seen in a patient are remote. Finally, this approach is confounded further by the fact that the zebrafish genome underwent a teleost specific duplication [68-70].

Among the genes for which there is an identifiable human ortholog, 47% have a one-to-one orthologous relationship with a human counterpart, while the remainder of zebrafish genes have complicated one-to-many or many-to-one orthology in comparison

29

to the human gene [71]. As a result, duplicated gene function may either be a) retained in both copies making them functionally redundant; b) lost in one of the two copies wherein it becomes a pseudogene; or c) a novel and divergent gene function is acquired by one of the two copies. Therefore, mutations in only one of two functionally redundant orthologs might not display a phenotype.

Nonetheless, ENU mutants have been successful in drawing anatomical correlates for genes implicated in recessive human disorders that cause anatomical birth defects. For instance, the craniofacial mutant crusherm299 is caused by a nonsense

mutation in sec23a [90]; at the same time as this discovery, SEC23A mutations in humans were shown to cause a clinically relevant craniofacial dysmorphology, cranio-lenticulo- sutural dysplasia, bolstering the evidence of causality in both species [91]. Importantly, the recent application of WGS [92], WES [93], and improved mapping strategies [94] to zebrafish ENU mutants has enabled the rapid and cost-effective identification of mutations, justifying the continued use of forward genetics to assist with assigning causality in human genetic disease.

1.5 Reverse Genetics: From Candidate Gene to Animal Model

Forward genetic screening involves the unbiased examination of phenotypes resulting from mutations in the zebrafish genome. However, the randomness of this approach is hampered by the inability to specifically target every coding gene and/or

30

specific mutations implicated in human pathology. To circumvent this problem, the precise targeting of candidate genes and alleles can be achieved through several methods that have been developed over the past ~ 15 years.

1.5.1 Antisense Morpholino Suppression of Endogenous Genes

The first method is transient gene manipulation that can be achieved through the injection of either morpholino (MO) antisense oligonucleotides (suppression) or capped in vitro transcribed mRNA (ectopic expression) into zebrafish embryos. MOs are stable molecules that consist of a large, nonribose morpholine backbone with the four

DNA bases pairing stably with mRNA at either the translation start site (to disrupt protein synthesis) or at intron–exon boundaries (to disrupt mRNA splicing) [95]. The use of MOs to confer effective gene knockdown was first shown in zebrafish in 2000;

Nasevicius and Ekker recapitulated the developmental phenotypes of five different embryonic lethal mutants and also developed models of the human genetic disorders hepatoerythropoietic porphyria and holoprosencephaly through the suppression of urod and shh, respectively [96]. Since this report, MOs have been used broadly to study vertebrate development and disease; co-injection of MO and orthologous mRNA has been employed for the systematic functional testing of alleles identified in humans, offering a powerful approach for analysis of variant pathogenicity and direction of effect[73].

31

Still, this methodology does have notable drawbacks: 1) MO efficacy is limited to

~ 3–5 days [96], and similarly, the presence of mRNA is limited to the same embryonic timeframe; 2) with few exceptions [97], injected MOs and mRNAs do not confer spatial

or temporal specific activity; and 3) MOs can give rise to spurious phenotypes resulting from off-target effects [98]. Even so, the use of this methodology within the appropriate developmental stage, and with the appropriate experimental controls a) targeting with a splice-blocking MO to demonstrate incorrectly spliced RNA; b) specific rescue of MO

phenotypes with orthologous wild-type mRNA; c) demonstration of a similar

phenotype with multiple MOs targeting the same gene; or d) comparison with a mutant

when possible, and if appropriate [98] can allow for the correct interpretation of MO

phenotypes to establish relevance to human genetic disease through the recapitulation

of loss-of-function or dominant negative effects.

1.5.2 Genome Editing and Transgenic Lines

A second method of precise targeting of candidate genes is to obtain germ line zebrafish mutants for a gene of interest; this avoids the phenotypic variability associated with MOs, and allows the observation of phenotypes beyond early larval stages.

Targeting Induced Local Lesions in Genomes (TILLING) was the first reverse genetic approach to produce germ line mutations in a gene of interest. Similar to forward screens, TILLING involves ENU mutagenesis of adult male zebrafish and generation of

32

F1 families. Sperm from F1 males is then cryopreserved while genomic lesions are screened in target genes, typically in early exons or near exonic regions encoding critical protein domains, through PCR amplicon screening [99].

The completion of the zebrafish genome coupled to next-generation sequencing

has significantly increased the throughput of the screening aspect of this strategy.

TILLING mutants have been identified for > 38% of all known zebrafish protein coding

genes (http://www.sanger.ac.uk/Projects/D_rerio/zmp and [100]); this corresponds to

~ 60% of orthologous genes associated with a human phenotype in the Online

Mendelian Inheritance in Man (OMIM; http://www.omim.org) database. The ongoing

TILLING efforts hope to generate a comprehensive resource of putative null or

hypomorphic models of human genetic disease, however, it is critical to be cognizant of

the possibility that ENU may introduce multiple lesions in the genome. Ideally, multiple

mutants with different alleles in the same gene should be phenotypically characterized

to ensure that the pathology is specific. The same guidelines are true for retrovirus

[101] or transposon [102] insertional mutants used in similar reverse genetics

approaches.

Both forward ENU screens and TILLING are laborious; alternative approaches

have recently expanded the utility of the zebrafish by enabling precise and germ line

transmittable gene targeting that does not require excessive downstream screening to

identify mutations (see ref [103] in this issue for a detailed review of genome editing

33

endonucleases). First, zinc finger nucleases (ZFN) utilize a zinc finger array to enable target sequence specificity (typically the early exon of a gene), and a FokI endonuclease to guide cleavage and subsequent repair at the target site [104]; this was first utilized to target the gol locus (mutation of which results in absence of pigment), ntl (a regulator of early embryogenesis), and kdr (vascular endothelial growth factor-2 receptor), as visible proof-of-principle phenotypes [105,106].

Second, transcription activator-like TAL effector nucleases (TALEN) have similarly been optimized to achieve locus-specific genome editing and have been shown to achieve greater specificity of and alteration of target sequences than ZFNs [107]. A third, more recent advancement in zebrafish genome editing technology involves clustered, regularly interspaced, short palindromic repeats (CRISPR), bacterial type II systems that guide RNAs to direct site-specific DNA cleavage by the Cas9 endonuclease

[108]. Each of ZFN, TALEN, and CRISPR technologies have expanded the molecular toolkit of the zebrafish (for comparisons, see Table 3), accelerating studies of vertebrate development and improving our understanding of analogous phenotypes to human disease. For example, CRISPR/Cas9 was used to edit the gata5 locus [109], and mutant embryos displayed a cardia bifida phenotype mimicking both the fautm236a zebrafish

mutant [110], and also humans with congenital heart defects [111-113].

However, there is currently a relative paucity of reports in which human-driven

WES/WGS studies have been followed with the generation of such stable mutants; this is

34

largely due to the relative newness of the technology, and/or the amount of time and labor still required to generate and characterize mutants; we anticipate the landscape of the field to change rapidly in the coming months and years.

1.6 Humanizing Zebrafish to Evaluate Human Mutations

Taken together, the zebrafish exemplifies a tractable and physiologically relevant tool to model genetic variation in humans. Each of the forward and reverse genetics tools have limitations, and in particular, place significant emphasis toward the study of loss-of-function effects of single genes potentially making them an overly simplistic

model to investigate oligogenic or even complex traits. In a growing number of

instances, however, it has been possible to balance experimental tractability, specificity,

and cross-species phenotypic similarity to establish: 1) physiological relevance of a gene

to a human clinical phenotype; 2) allele pathogenicity; and 3) direction of allele effect for

a vast array of human genetic disorders with diverse models of inheritance, phenotypes,

and ages of onset.

1.6.1 Recessive Disease

Disorders that segregate under a recessive mode of inheritance, especially congenital or pediatric-onset disorders with an abnormality in a structure with an anatomical counterpart in the developing zebrafish, have achieved widespread use

35

toward demonstrating physiological relevance. This often represents the extent of functional data presented in instances when the human mutations have an unambiguous loss-of-function effect on the protein (nonsense, frameshift, or splice-site).

For instance, causal mutations identified in primary ciliary dyskinesia (PCD) are almost exclusively null changes, and transient MO-based studies in zebrafish have shown that proteins of a priori unknown function including CCDC39, ARMC4, andZMYND10 give rise to left-right asymmetry defects phenotypes found in humans [114-116].

In other recessive disorders, such as pontocerebellar hypoplasia (PCH), the zebrafish has assisted in establishing clinical relevance and allele pathogenicity (Figure

4A). Wan et al. identified nonsynonymous changes inEXOSC3, encoding exosome component 3, following WES of four affected siblings; MO-induced suppression resulted in phenotypes that were relevant to the human clinical features of microcephaly and reduced motility in exosc3 morphants. Additionally, in vivo complementation of exosc3 MO phenotypes with either zebrafish or human mRNA harboring the missense mutations found in patients failed to rescue the phenotype, indicating that these were loss-of-function alleles [117]. Even so, transient in vivo complementation assays are not applicable to every gene; human genes with an open reading frame (ORF) larger than

~ 6 kb are challenging to transcribe in vitro.

36

Figure 4: Zebrafish Models of Human Genetic Disease

37

A. Mutations in pontocerebellar hypoplasia are caused by EXOSC3 and in vivo complementation studies in zebrafish recapitulate the brain phenotypes observed in patients and demonstrate that missense mutations are functional null variants. Left panel, neuroimaging of affected individuals (A1–A4; top row) and control images (A5–

A8; bottom row); right panel, whole-mount in situ hybridization in splice blocking MO- injected embryos in lateral view (A9; inset – dorsal view, with rostral to the left) demonstrated diminished expression of dorsal hindbrain progenitor-specific marker atoh1a and cerebellar-specific marker pvalb7 (A10; quantification). Images reproduced from Wan et al. 2012. Nat Genet. 44: 704–708 with permission.

B. Congenital abnormalities of the kidney and urinary tract are caused by haploinsufficiency of DSTYK. Left, panels B1–B3, with hypoplasia of the left kidney

(Panel B1, kidneys outlined by dashed lines), bilateral hydronephrosis (Panel B2, arrows) caused by ureteropelvic junction obstruction detected at birth, and hydronephrosis only of the left kidney (Panel B3, arrow) caused by ureteropelvic junction obstruction. The intravenous pyelogram in Panel B4 shows blunting of fornices on the right side (white arrow); right, MO-induced knockdown of dstyk embryos, live lateral images show absence of the patent pronephric duct opening (arrows). Images reproduced from Sanna-Cherchi et al. 2013. NEJM. 369: 621–629 with permission.

38

C. Adult-onset limb-girdle muscular dystrophy is caused by dominant negative mutations in DNAJB6. C1, Transmission electron microscopy showed early disruption of

Z-disks (arrows; left) and autophagic pathology (right) in LGMD1D; C2–C10, lateral views of zebrafish embryos 2 days post fertilization subjected to whole-mount immunofluorescence staining of slow myosin heavy chain display myofiber abnormalities (arrows). Images reproduced from Sarparanta et al. 2012. Nat Genet. 44:

450–455 with permission.

D. PLS3 overexpression exerts a protective effect on SMN1 deletion to rescue motor neuron defects in spinal muscular atrophy (SMA). D1, pedigrees of SMA-discordant families showing unaffected (gray) and affected (black) SMN1-deleted siblings; D2, lateral view of zebrafish embryos injected with control MO, smn MO, PLS3 RNA, and smn MO + PLS3 RNA. Motor axons were visualized with znp1 antibody at 36 hours post fertilization and show rescue of smn MO with PLS3 RNA. Images reproduced from

Oprea et al. 2008. Science. 320: 524–527 with permission.

E. SCRIB and PUF60 suppression drive the multisystemic phenotypes of the 8q24.3 copy number variant. E1, Photographs of five individuals with the 8q23.4 CNV show craniofacial abnormalities and microcephaly; E2, Lateral and dorsal views of control and scrib or puf60a MO-injected embryos at 5 dpf show head size and craniofacial defects

39

observed in affected individuals. Images reproduced from Daubert et al. 2013. Am J Hum

Genet. 93: 798–811 with permission. LOF, loss-of-function; DN, dominant negative.

1.6.2 Dominant Disease

In contrast to recessive disorders, in which the allele effect is typically loss-of- function, traits that segregate under an autosomal dominant inheritance pattern are the result of a haploinsufficiency, dominant negative, or gain-of-function mechanism. For some dominant pediatric-onset disorders, the genetic evidence of a heterozygous null variant segregating in a large pedigree with fully penetrant disease is sufficient to suggest that the direction of allele effect is haploinsufficiency, and these predictions have been confirmed in zebrafish through MO-induced gene suppression. For example, dilated cardiomyopathy is caused by nonsense, splice-site or missense mutations in the gene encoding heat shock protein co-chaperone BCL2-associated athanogene 3 (BAG3) and gene suppression results in similar cardiac phenotypes in zebrafish embryos [118].

Similarly, congenital abnormalities of the kidney and the urinary tract (CAKUT) associated with a loss-of-function splice site mutation segregating in a dominant pedigree were identified in the dual serine-threonine kinase encoded by DSTYK ( Figure

4B); the human phenotypes were recapitulated in dstyk morphant embryos [119].

Transient experiments in zebrafish embryos can also determine allele pathogenicity and capture a dominant negative direction of effect that is isoform-

40

specific. This is exemplified by the transient functional studies of missense mutations in the co-chaperone protein, DNAJB6, associated recently with adult-onset limb girdle

muscular dystrophy [120] (Figure 4C). Co-injection of mutant DNAJB6 mRNA in the presence of equivalent amounts of wild-type transcript resulted in myofiber defects in zebrafish embryos; injection of increasing amounts wild-type mRNA with a fixed concentration of mutant resulted in a phenotypic rescue, indicating the dominant toxicity of the mutant alleles. Given the nature of these mutations (deleterious in heterozygosity), the use of zebrafish for dissecting dominant disorders will likely remain restricted to transient MO- and mRNA-based studies until the development of more sophisticated conditional gene suppression/expression techniques in zebrafish.

1.6.3 De Novo Variants

Variants that arise de novo as a product of either germline mosaicism or early developmental DNA replication errors are significant contributors to the human mutational burden [121]. Similar to variants that underscore autosomal dominant disorders, de novo changes may give rise to clinical phenotypes produced from falling below a gene dosage threshold, dominant negative effects, or acquisition of a novel function. As such, an unbiased approach toward dissecting the direction of de novo allele effect is critical once physiological relevance has been determined.

41

For instance, transient approaches have been carried out in zebrafish to investigate de novo missense mutations in CACNA1C, encoding the voltage-gated calcium channel Cav1.2, in the pathophysiology of Timothy syndrome (TS), a pediatric disorder characterized by cardiac arrhythmias, syndactyly, and craniofacial abnormalities. Ectopic expression of mutant mRNA and suppression of cacna1c in zebrafish embryos not only revealed that the mutation confers a gain-of-function effect, but also demonstrated a novel role for Cav1.2 in the non-excitable cells of the developing jaw [122].

1.6.4 Copy Number Variations

As described above, copy number variants (CNVs) represent a significant molecular basis for human genetic disease [123]. These variations in genomic structure can range in size from a few thousand to millions of base pairs, are not identifiable by conventional chromosomal banding, and can encompass from one to hundreds of genes

[25]. Although genotype–phenotype correlations among affected individuals with overlapping CNVs can assist in narrowing specific genetic drivers, CNVs have been historically intractable to functional interpretation in animal models, with sparse reports of human CNVs being modeled in the mouse [124]. Zebrafish models have emerged recently as powerful tools to dissect both recurrent and non-recurrent CNVs. First, systematic zebrafish modeling of the 29 genes in the recurrent reciprocal 16p11.2

42

duplication/deletion CNV – associated with a range of neurocognitive defects – found the main driver of the neuroanatomical phenotypes to be KCTD13, causing mirrored

macro- and microcephaly upon suppression or overexpression in zebrafish, respectively

[27]. Second, MO-induced suppression of three genes in the 8q24.3 non-recurrent

deletion CNV in zebrafish embryos revealed that the planar cell polarity effector SCRIB,

and the splicing factor PUF60 could be linked to distinct aspects of the renal, short

stature, coloboma, and cardiac phenotypes observed in five individuals with

overlapping microdeletions at this locus [125] ( Figure 4E).

1.6.5 Second-Site Modifiers

The demonstration of second-site phenotype modification in primarily recessive human genetic disease has been fuelled by the use of in vivo assays in zebrafish. The ciliopathies, disorders underscored by dysfunction of the primary cilium, have been causally linked with over 50 different loci, can give rise to a constellation of human phenotypes, and have been an ideal system to study epistasis [62]. The recent dissection of the genetic architecture of Bardet–Biedl syndrome (BBS), a ciliopathy hallmarked by retinal degeneration, obesity, postaxial polydactyly, renal abnormalities, and intellectual disability, 1) informed the pathogenic potential of missense BBS alleles contributing to the disorder (null, hypomorphic, or dominant negative); 2) revealed the surprising contribution of dominant negative alleles in oligogenic pedigrees with BBS; and 3)

43

provided sensitivity (98%) and specificity (82%) metrics for the zebrafish in vivo complementation assay to predict allele pathogenicity [126].

Transient zebrafish in vivo complementation assays have similarly been used to identify RPGRIP1L A229T as a modulator of retinal endophenotypes [127], RET as a modifier of Hirschsprung phenotypes in BBS [74], and TTC21B as a frequent contributor to mutational load in ciliopathies [128]. Second-site modification phenomena are not unique to the ciliopathies; for example, overexpression of plastin 3 (PLS3) to mimic the gene expression in unaffected individuals, improved the axon length and growth defects associated with SMN1 deletion in spinal muscular atrophy (SMA). The interaction of these two genes was shown, in part, through modeling of SMA genotype and phenotype correlates in zebrafish embryos [129] (Figure 4D).

1.7 Implications

The remarkable speed and depth at which human genomes and exomes can now be sequenced has identified an overabundance of human genetic variations with pathogenic potential. As the costs and efficiency of next-gen sequencing become more optimized, the challenge of first prioritizing and subsequently modeling likely disease- associated variants increases as well. How we can probe the questions raise in human disease and genetic variation is the focus of the following work.

44

In vivo animal studies, previously used mainly to model specific human diseases or genes, can now be used to functionally test an extensive series of variants.

Advancements in genomics have allowed for multiple candidate disease-associated alleles to be investigated in parallel, as well as multiple paradigms of disease inheritance and etiology. This has enabled us to ask questions which were previously irresolvable: how do individual alleles interact with one another; how does each contribute to overall phenotype; what is the effect of the total mutational load on human disease?

Additionally, recent advances in genomic technology has allowed for finer resolution of structural variation that is a newly recognized source of human pathology..

Genomic variation contributes to disease in multiple ways, and through both Mendelian and non-Mendelian inheritance. Through advancement in genomic technology and in vivo modeling, we are able to interrogate the contribution of structural variations as causal drivers as well as modifiers of disease.

In addition to functional analysis of the large number of candidate alleles obtained through whole genome and whole exome sequencing, the initial task of prioritizing likely pathogenicity is an increasing difficulty. In silico tools assist with

pathogenicity prediction, depending on a variety of projection algorithms. While these

tools can be exceedingly useful in categorizing candidate alleles, they have some

disadvantages that affect predictive power. Understanding and improving upon these

algorithms will lead to a greater ability to interpret human genetic variation.

45

1.8 Chapter Outlines

Chapter 2 is a complete description of the methods and reagents used during the course of conducting the studies described in this work.

Chapter 3 describes the combination of genomics and in vivo modeling in the

identification of a novel driver of BBS. Targeting sequencing of a large cohort of BBS

patients without a molecular diagnosis identified 17 families which had mutations in

genes associated with the cilia but not yet known to be causal for BBS. A combination of

Sanger segregation, complementation assays, live embryo analysis, whole-embryo

immunofluorescence, and cryo-histology identified a gene that is a likely causal driver

for BBS in a single family.

Chapter 4 will detail the use of aCGH to identify the contribution of CNVs to the

mutational load of a cohort of BBS patients. A genomic resolution of several hundred bp

was obtained, much smaller than previous levels of 5-10 kb. We were able to detect

CNVs contributing to mutational load in patients in multiple ways: CNVs can act in a

Mendelian paradigm as recessive alleles themselves; CNVs can unmask recessive

mutations that would ordinarily be compensated by a normal allele on the homologous

; finally, CNVs can also act in epistasis with mutations in other genes. This

work demonstrated that CNVs contribute a mutational load more frequently than was

46

previously thought and demonstrated the need to consider the role of CNVs in disease gene discovery.

Chapter 5 will examine the contribution of genomic context on the pathogenicity of disease-associated alleles and their tolerability in other species. These alleles, known as compensated pathogenic deviations (CPDs), make up a significant number of pathogenic human variants, and functional testing demonstrates the ability of secondary sites in cis to compensate for the loss of function conferred by CPDs.

47

Chapter 2 – Methods 2.1 Research Participants for CNV Analysis

A total of 92 unrelated BBS individuals and 229 unaffected controls were analyzed by aCGH. Affected individuals were selected irrespective of previously identified pathogenic changes. The control samples included 137 controls of northern

European descent from our Age Related Macular Degeneration control cohort [130,131], five HapMap samples of western European ancestry, and anonymized samples from 87 healthy internal lab controls of northern European descent. We used standard methods to isolate genomic DNA from peripheral blood. Informed consent was obtained from all participating individuals, with approval from the Institutional Review Boards of the

Duke University Medical Center and the Baylor College of Medicine.

2.2 Whole Genome Amplification

30 ng of genomic DNA from affected participants and gender-matched male

(NA10851) and female (NA15510) control DNAs (obtained from Coriell Cell

Repositories; http://ccr.coriell.org) were amplified by whole genome amplification

(GenomePlex Whole Genome Amplification Kit, Sigma). Amplified product was verified by gel electrophoresis and quantified by UV absorbance at 260 nm with a NanoDrop

(ND-1000) spectrophotometer (NanoDrop Technologies).

48

2.3 Oligonucleotide aCGH analyses

We used a 4x180k array format at an average coverage of one probe per 100 base

pairs (bp) in coding sequences and one probe per 500 bp in intragenic non-coding

sequences. Briefly, 2 μg of each amplified BBS patient DNA sample and 1.2 μg of

unamplified control DNA sample were then labeled and hybridized as described [132].

2.4 Junction Sequence Analyses

Long-range PCR was conducted with the Phusion® High-Fidelity PCR Kit (NEB,

Ipswich, Massachusetts, E0553L) following the manufacturer’s protocol, and PCR

products that were unique to carriers and not observed in controls were Sanger

sequenced.

2.5 qPCR Confirmation

When junction fragments failed to amplify by multiple primer pairs, we used quantitative (q)PCR to confirm CNVs. TaqMan® Copy Number Assays (Life

Technologies, Carlsbad, CA) were selected from within the CNV regions. We conducted triplicate reactions according to an ABI protocol on an ABI7900HT Fast Real-Time PCR

System for both cases and matched controls. A copy neutral reference assay was performed in parallel (RNase P) for normalization purposes. Data were analyzed and visualized with CopyCaller™ (v1.0).

49

2.6 Resequencing of BBS Genes Identified in CNV Analyses

We amplified all exons and splice junctions of BBS1, BBS2, ARL6/BBS3, BBS4,

BBS5, MKKS/BBS6, BBS7, TTC8/BBS8, BBS9, BBS10, TRIM32/BBS11, BBS12,

MKS1/BBS13, CEP290/BBS14, WDPCP/BBS15, SDCCAG8/BBS16, IFT27/BBS19, and

NPHP1/BBS20 in the seventeen individuals harboring CNVs, and we conducted bidirectional Sanger sequencing on an ABI3730 according to standard protocols.

Mutations identified were segregated by Sanger sequencing in all available family members. Variants (CNVs or SNVs) were named according to the following NCBI reference transcript accession numbers: BBS1, NM_024649; BBS2, NM_031885;

ARL/BBS3, NM_032146; BBS4, NM_033028; BBS5, NM_152384; MKKS/BBS6,

NM_018848; BBS7, NM_176824; TTC8/BBS8, NM_144596; BBS9, NM_198428; BBS10,

NM_024685; ALMS1, NM_015120; CEP290, NM_025114; SDCCAG8, NM_006642;

NPHP1, NM_000272; IFT74, NM_025103.

2.7 CNV Genes Suppression, Rescue, and Genetic Interaction Experiments

Translation-blocking and splice-blocking morpholino (MO) antisense

oligonucleotides were designed by and obtained from Gene Tools, LLC. To determine

the optimal dose for in vivo complementation assays and genetic interaction studies, we

50

injected increasing doses of MO into one-to-four cell stage wild-type (WT) zebrafish embryos. To determine MO efficiency (splice-blocking MOs), we harvested injected

embryos in Trizol (Invitrogen), extracted total RNA according to the manufacturer’s

instructions, and generated oligo-dT-primed cDNA with the QuantiTect reverse

transcription kit (Qiagen) for use as a template for RT-PCR; subsequent Sanger

sequencing of RT-PCR products identified the precise alteration of endogenous

transcript. For rescue experiments, human IFT74 transcript variant 1 (long; NM_025103)

and IFT74 transcript variant 2 (short; NM_001099224) messages were amplified from

cDNA generated from a human fibroblast cell line; SDCCAG8 (NM_006642; clone T8331)

was obtained commercially (Genecopoeia); and all ORFs were subcloned into the pCS2+

vector and transcribed in vitro using the mMessageMachine kit (Ambion). All other BBS

gene ORF clones, and mutagenesis procedures have been described [126,133,134].

Embryo batches were assessed live at the 8-10 somite stage for early developmental

defects in gastrulation and were categorized into class I or class II according to

established objective criteria[126,127,133-137]. Live embryo images were acquired on a

Nikon AZ100 microscope at 8x magnification facilitated by NIS Elements software.

2.8 Datasets of Known Benign and Pathogenic Variants

Our primary training dataset was HumVar, one of the training datasets for

PolyPhen-2.2.3. We used the most recent public release at the time of this publication

51

(December 2011), available for download at http://genetics.bwh.harvard.edu/pph2/ dokuwiki/downloads. This dataset is derived from SwissVar variant annotations[138]. It contains 22,207 variants annotated as pathogenic and 21,433 variants annotated as benign. We also used a dataset of pathogenic variants derived from the June 2014 release of the ClinVar database[139]. This dataset consists of all missense variants from ClinVar that are unambiguously (i.e. classified the same by all submitters) and confidently (i.e. not a “Likely” annotation) annotated as “Pathogenic.” It contains 10,596 variants annotated as pathogenic and 1,926 variants annotated as benign. The intersection of these two datasets contains 3,563 variants annotated as pathogenic and 454 variants annotated as benign. As an additional control, we required that pathogenic variants be absent from 6,503 human exomes[33] (Exome Variant Server: http://evs.gs.washington.edu/EVS/). This most stringent dataset contains 3,062 variants annotated as pathogenic.

2.9 Comparative Genomics Screen for CPDs

We used the UCSC MultiZ whole-genome alignments of 100 vertebrate sequences [140], downloaded from UCSC as translated exons. As an alternate alignment strategy, we used the EPO alignment of 37 eutherian mammal species [141], downloaded as nucleotide sequences and translated for all aligned species using the human open reading frame. In cases where the alignment contained multiple sequences

52

from the same species, only the sequence most similar to the human sequence was retained. Variants were classified as CPDs if the variant amino acid was found in the translated sequence of any vertebrate ortholog other than human or chimpanzee, with chimpanzee being excluded because presence in the chimpanzee sequence may be used as evidence for neutrality in variant annotation databases. The resulting dataset of neutral and deleterious variants found in vertebrate orthologs is available as supplementary materials.

2.10 Statistical Models of Variant Density

We modeled the density of benign variants with an exponential distribution, with scale parameter βneut representing the mean time to fixation of neutral alleles. We used three different models for the density of CPDs:

1. k compensatory changes fix at rate 1/θ, followed by the now-neutral CPD at the

same rate. This is represented by a gamma distribution with shape parameter k +

1 and scale parameter θ.

2. k compensatory changes fix at rate 1/θ1, followed by the now-neutral CPD at an

independent rate 1/θ2. This is represented by the convolution of a gamma

distribution with shape parameter k and scale parameter θ1, and an exponential

distribution with scale parameter θ2.

53

3. k compensatory changes fix at rate 1/βcomp, followed by the now-neutral CPD at

the neutral rate 1/βneut.

We assume a reversible model of evolution, so that the same three models can

apply both to the fixation of CPDs not present in an ancestral sequence and to the loss of

CPDs that are present in an ancestral sequence. The random variable used in these

models is the sequence distance to the closest sequence containing the variant, where

sequence distance is defined as the fraction of aligned (i.e. non-gapped) positions that

are identical.

2.11 Fitting Observed Density to Statistical Models

We recorded for each variant found in a vertebrate ortholog the minimum number of amino acid differences between that variant and a vertebrate ortholog, not counting gapped sites, normalized by the length of the sequence. We then used maximum likelihood to fit the neutral model and each of the three pathogenic models described above, using the Broyden-Fletcher-Goldfarb-Shanno (BFGS) optimization algorithm to maximize the likelihood functions. We repeated the fit using each of our filtered variant datasets and alignment methodologies, as well as discarding all variants that were only found in the alignment in a single sequence. All three models fit reasonably well, and all produced qualitatively similar results on all datasets and

alignment methodologies.

54

2.12 CPD Prediction Method

Our prediction method is implemented in Perl, and the source code is available.

To calculate the probability that a variant is a CPD, we find the minimum distance to the variant in the multiple sequence alignment and apply Bayes’ Law. We use the third likelihood model described above, where the CPD fixes at the neutral rate, using the maximum likelihood inferred parameter values. As our prior we use the well- established result that 10% of variants seen in another sequence are CPDs. Candidate compensation sites are identified by collecting all substitutions found in any sequence containing the candidate CPD, prioritizing sites that are substituted in many sequences over sites that are substituted in only a few sequences.

2.13 DM048 Whole Exome Sequencing

Research study participants were enrolled upon informed consent according to protocols approved by the Duke University Internal Review Board. We conducted paired-end pre-capture library preparation by fragmenting genomic DNA through sonication, ligating it to the Illumina multiplexing PE adapters, and PCR amplification using indexing primers. For target enrichment/exome capture we enriched the pre- capture library by hybridizing to biotin labeled VCRome 2.1 [142] in-solution Exome

Probes at 47°C for 64-72 hours. For massively parallel sequencing, the post-capture

55

library DNA was subjected to sequence analysis on an Illumina Hiseq platform for 100 base-pair (bp) paired-end reads (130x median coverage, >95% target coverage at 10x).

Primary data were interpreted and analyzed by Mercury 1.0; the output data from

Illumina Hiseq were converted from bcl files to FastQ files by Illumina CASAVA 1.8 software, and mapped by the BWA program. We performed variant calls using Atlas-

SNP and Atlas-indel [143].

2.14 bbs4, rpgrip1L, btg2, and nos2 Morpholino Design

Morpholinos targeting zebrafish bbs4 and rpgrip1L were obtained from Gene

Tools, and described previously [126,144]. Morpholinos against zebrafish btg2 and nos2

were obtained from Gene Tools (Extended Figure 3; sequences available upon request).

2.15 Site-Directed Mutagenesis

Mutant alleles were generated as described[126]. Sequences were validated via

Sanger sequencing on Applied Biosystems 3730xl DNA Analyzer.

2.16 bbs4, rpgrip1l, btg2, and nos2 mRNA Synthesis and Zebrafish Embryo Injection

mRNA was transcribed in vitro as described[145] using SP6 Message Machine kit

(Ambion). MO and mRNA concentrations were determined based on the combination

56

by which WT mRNA efficiently rescued the morphant phenotype. The same concentrations were used for rescue with mutant mRNA or injection of mRNA alone.

The MO and mRNA concentrations injected were as follows: 0.7ng bbs4 MO and 100 pg

BBS4 mRNA; 5ng rpgrip1l MO and 100 pg RPGRIP1L mRNA; 3ng btg2 MO and 150 pg

BTG2 mRNA; 8ng nos2a; 8ng nos2b. All animal work was performed in accordance with

the protocols and guidelines of the Duke Institutional Animal Care & Use Committee.

2.17 Classification and Scoring of Embryos

Embryos injected with bbs4, rpgrip1l , cep76, or e2f4 MOs were classified into two graded phenotypes on the basis of the relative severity compared with age-matched uninjected controls from the same clutch, as described [126]. Comparisons between injections of MO alone, mRNA alone, mutant rescue, and WT rescue were performed by

χ2 test.

Embryos injected with btg2 MO at were fixed in 4% paraformaldehyde at either 2

dpf or 4 dpf. Two dpf embryos were stained for HuC/D or phospho-histone H3. HuC/D

scored and quantified as described [126]. Four dpf embryos were transferred to 1x PBS

and bright-field dorsal images were captured; we assessed head size by measuring the

distance from the anterior-most region of the forebrain to the hindbrain as defined by

the attachment of the pectoral fins.

57

Phosphohistone H3 cell quantification was done using the Image-based Tool for

Counting Nuclei (ITCN) plugin for the ImageJ software. Rolling background subtraction

(25 pixel radius) and outlier removal (2.5 pixel radius, threshold = 5) were used to process images. Linear measurement of a typical cell was used to determine cell radius for ITCN analysis. Threshold level was set to 0.5. Statistical comparisons between groups were performed with a student’s t-test.

58

Chapter 3 – Identification of a Novel Driver of Bardet– Biedl Syndrome 3.1 Introduction

Bardet-Biedl syndrome (BBS; MIM 209900) is a phenotypically variable disorder inherited primarily in an autosomal recessive paradigm, though oligogenic inheritance has been demonstrated for a number of causal genes [146]. Hallmarks of the disorder include retinal degeneration, obesity, polydactyly, intellectual disability, kidney disease, and hypogonadism [147,148]. Over the last decade and a half, 21 causal drivers of BBS have been identified, identified almost exclusively via locus-specific DNA sequencing

[133,134,136,137,149-166]. This disorder is caused by defects in the function of a singular organelle, the primary cilium, which is present on nearly all somatic cells and functions in a wide array of cellular signalling processes, most notably in kidney tubule formation, trafficking of rhodopsin in the photoreceptor cells of the retina, and development of planar cell polarity [150,167,168].

This genetic heterogeneity among patients affected with BBS has complicated diagnosis and genetic counselling of families, not only because a significant number of causal drivers remain undiscovered (roughly 20% of cases of Northern European descent do not have an identified causal locus [169]), but also due to the high incidence of additive and/or epistatic interactions between genes [170-172]. Inheritance has also been shown to be non-Mendelian in some cases, involving at least three mutant alleles in two genes that act as modifiers of either penetrance or expressivity [173,174]. 59

Here we describe the targeted resequencing of a cohort of BBS patients, the contribution to mutational load from known BBS genes, and the subsequent identification of CEP76 as a novel driver of BBS. In vitro complementation assays allowed for robust interrogation of two mutant alleles present within one family, and led to insight on the potential mechanism of pathology observed in the patient.

3.2 Sequencing the Ciliary Proteome in a Ciliopathy Cohort

As part of our ongoing efforts to determine how the mutational burden of the ciliary proteome can contribute to, at least in part, the phenotypic variability underscoring the ciliopathies, we conducted targeted resequencing of 457 affected individuals. This cohort spanned the ciliopathy spectrum of severity, ranging from the severe primary ciliopathies Meckel-Gruber Syndrome (MKS), Jeune Asphyxiating

Thoracic Dystrophy (JATD) and Orofaciodigital syndrome (OFD); to moderate primary ciliopathies Bardet-Biedl Syndrome (BBS), Joubert syndrome (JBTS) and Usher syndrome (USH); and also included the motile ciliopathy Primary Ciliary Dyskinesia

(PCD). Importantly, these individuals were included without prior selection based on known causal mutations.

We prioritized 785 genes from the ciliary proteome (www.ciliaproteome.org and

[65]), identified by at least two different ciliary proteomics studies (n=693, [175-184]), as well as genes with a published or putative association with ciliary disease that were not

60

in the ciliary proteome database (n=92). In collaboration with the Human Genome

Sequencing Center at Baylor College of Medicine (BCM-HGSC), we conducted a targeted exon capture followed by next-generation sequencing on an Illumina platform.

Specifically, we conducted pre-capture pooling of 23 samples/pool, NimbleGen targeted liquid capture of 785 ciliary genes (12,000 exons; 1.9 Mb of target) and next-gen sequencing using the Illumina HiSeq2000 (paired-end 100bp reads with two pools/lane;

10 lanes total). This paradigm resulted in a mean of 110x coverage with 92% of bases achieving >20x coverage and 99% of targeted regions captured. This dataset contained a putative ~80,000 single nucleotide variants (SNVs) and ~2,000 insertion-deletions

(indels) with a minor allele balance >20% and coverage >10x.

3.3 Mutational Load in Bardet-Biedl Syndrome

Next, we focused our analysis on a cohort of 100 affected individuals who fulfilled the clinical diagnostic criteria of BBS ([147,148]). First, the cohort was evaluated for mutations in known BBS genes (BBS1, BBS2, BBS3, BBS4, BBS5, BBS6, BBS7, BBS8,

BBS9, BBS10, BBS11, BBS12, BBS13, BBS14, BBS15, BBS16 and BBS17) using a paradigm

that was unbiased by prior mutational findings. To obtain a comprehensive

understanding of the subset of patients with mendelizing lesions in known BBS genes,

the data from the sequencing of the ciliary proteome were combined with a high

resolution array Comparative Genomic Hybridization (CGH) analysis interrogating the

61

same a subset of ciliary genes. A total of 75 affected individuals harbored recessive changes and/or CNVs in the known BBS loci evaluated (Table 5), while 25 patients remained without a primary recessive driver gene.

62

Table 5: Patients with Recessive Changes in Known BBS Genes

Patient Gene Transcript Nucleotide Protein Previous Zygosity identifier name identifier change change Reports 32-3 BBS4 NM_033028 n/a ex3-4del Hom 44-3 BBS7 NM_176824 c.87_88del p.29_30del Hom [134] 53-II-3 ARL6 NM_177976 c.497-2A>T Splice defect Hom AR003- NM_0010336 c.190C>T p.Q64X Het BBS9 05 04 c.1120C>T p.R374X Het AR037- SDCCAG c.696T>G p.Y232X Het [185] NM_006642 07 8 c.740delG p.R247fs Het [185] AR045- c.1169T>G p.M390R Het [186] BBS1 NM_024649 05 c.1285C>T p.R429X Het [186] AR075- Hom [186] BBS1 NM_024649 c.1169T>G p.M390R 03 AR112- c.436C>T p.R146X Het [187,188] BBS1 NM_024649 04 c.1169T>G p.M390R Het [187,188] AR124- c.472delG p.V158fs Het [188] BBS2 NM_031885 03 c.1371delG p.L457fs Het [188] c.1114_1115de Het [137] AR198- p.372_372del BBS12 NM_152618 l 04 c.1619G>T p.G540V Het [137] AR201- Hom [189] BBS10 NM_024685 c.272_273insT p.C91fs 06 AR231- c.1169T>G p.M390R Het BBS1 NM_024649 02 c.1553T>C p.L518P Het Unpub AR232- c.272_273insT p.C91fs Het BBS10 NM_024685 03 c.1677delC p.Y559X Het AR236- NM_0010336 c.801_802insT p.N267fs Het Unpub BBS9 03 04 c.1198-2A>G Splice defect Het Unpub AR240- c.1169T>G p.M390R Het BBS1 NM_024649 03 n/a ex10-17del Het AR244- Hom BBS1 NM_024649 c.1169T>G p.M390R 03 AR246- c.1169T>G p.M390R Het BBS1 NM_024649 03 n/a ex14-17del Het AR249- Hom BBS1 NM_024649 c.1169T>G p.M390R 03 AR291- Hom BBS1 NM_024649 c.1169T>G p.M390R 05 c.169A>G p.T57A Het [155] AR301- MKKS NM_018848 c.433_434del p.145_145del Het 03 c.429_430del p.143_144del Het c.175C>T p.Q59X Het AR309- BBS2 NM_031885 c.1528_1539de Het 03 p.510_513del l AR316- c.1092delA p.T364fs Het [137] BBS12 NM_152618 03 c.2133G>C p.Z711Y Het [137] AR323- Hom BBS10 NM_024685 c.272_273insT p.C91fs 03 63

AR332- Hom BBS10 NM_024685 c.272_273insT p.C91fs 04 AR347- Hom BBS1 NM_024649 c.1169T>G p.M390R 03 AR438- NM_0010996 c.38C>T p.A13V Het TRIM32 03 79 c.325C>T p.R109W Het AR349- Hom BBS10 NM_024685 c.272_273insT p.C91fs 04 AR365- c.272_273insT p.C91fs Het [189] BBS10 NM_024685 03 c.687delT p.P229fs Het [189] AR372- c.272_273insT p.C91fs Het BBS10 NM_024685 04 c.791G>T p.C264F Het AR380- c.1169T>G p.M390R Hom BBS1 NM_024649 03 ex9-12del n/a Het c.145C>T p.R49W Het [189] AR381- BBS10 NM_024685 c.2119_2120de Het 03 p.707_707del l AR385- Hom Unpub BBS2 NM_031885 c.2010insG n/a 03 AR400- Hom BBS4 NM_033028 n/a ex5-6del 03 AR433- c.909_912del p.303_304del Het [189] BBS10 NM_024685 05 c.2030G>T p.G677V Het [189] AR596- c.72C>G p.Y24X Het BBS2 NM_031885 04 c.534+1G>A Splice defect Het AR603- c.436C>T p.R146X Het BBS1 NM_024649 05 c.1169T>G p.M390R Het AR634- c.878A>T p.Q293L Het Unpub BBS7 NM_018190 03 n/a ex16-17del Het AR707- c.272_273insT p.C91fs Het [189] BBS10 NM_024685 06 c.1407T>G p.Y469X Het [189] AR752- c.121G>C p.G41R Het MKKS NM_018848 03 c.1649A>C p.Q550P Het AR758- Hom Unpub BBS2 NM_031885 c.814C>T p.R272X 03 AR796- Hom BBS1 NM_024649 c.1169T>G p.M390R 04 AR799- c.1169T>G p.M390R Het Unpub BBS1 NM_024649 04 n/a p.Y434S Het Unpub AR800- Hom BBS4 NM_033028 n/a ex5-6del 03 AR829- Hom BBS1 NM_024649 c.1169T>G p.M390R 04 AR831- Hom BBS1 NM_024649 c.1169T>G p.M390R 03 AR839- c.1237C>T p.R413X Het [188] BBS2 NM_031885 04 c.1438C>T p.R480X Het [188] AR840- Hom [190] BBS7 NM_018190 c.198T>G p.I66M 04 AR847- BBS9 NM_198428 c.1876_1879de p.626_627del Het [188]

64

03 l c.2001_2002in Het [188] p.S667fs sGA AR873- c.299G>A p.R100Q Hom BBS1 NM_024649 03 c.951-2A>G Splice defect Hom AR879- Hom BBS1 NM_024649 c.1169T>G p.M390R 05 AR880- Hom BBS10 NM_024685 c.272_273insT p.C91fs 04 AR883- c.2119_2120de Hom BBS10 NM_024685 p.707_707del 04 l AR888- c.1645G>T p.E549X Het [134] BBS1 NM_024649 03 n/a ex1-11del Het DM002- c.1670G>A p.R557H Het CEP290 NM_025114 003 c.2512A>G p.K838E Het DM003- MKKS NM_018848 c.121G>C p.G41R Hom 003 DM006- NM_0011780 Hom BBS12 c.1080delG p.E360fs 003 07 DM007- c.298_300del p.100_100del Het BBS12 NM_152618 003 c.1237C>G p.L413V Het DM008- BBS1 NM_024649 c.1169T>G p.M390R Hom 003 DM009- MKKS NM_018848 c.110A>G p.Y37C Hom 003 DM010- c.52_53insC p.S18fs Het BBS5 NM_152384 003 c.618-1G>C Splice defect Het DM012- BBS10 NM_024685 c.272_273insT p.C91fs Hom 003 DM014- BBS1 NM_024649 c.1169T>G p.M390R Hom 003 c.272_273insT p.C91fs Het Unpub DM015- BBS10 NM_024685 c.1813_1814in p.N605delins Unpub 003 Het sGGT GN DM032- c.3G>A p.M1I Het BBS1 NM_024649 004 c.724_726del p.242_242del Het DM034- ARL6 n/a n/a ex4-9del Hom 004 c.190C>T p.Q64X Het DM035- NM_0010336 c.1363G>T p.A455S Het BBS9 003 04 c.2343C>A p.C781X Het Unpub n/a int21del Het DM036- BBS1 NM_024649 c.1169T>G p.M390R Hom 003 DM038- c.6191A>G p.E2064G Het CEP290 NM_025114 003 c.3710G>A p.R1237H Het c.272_273insT p.C91fs Het Unpub F1-04 BBS10 NM_024685 c.391delC p.Q131fs Het NM_0010336 I2220B BBS9 c.1373C>G p.S458X Hom 04

65

KK011- Unpub ARL6 n/a n/a ex8del Hom 04 KK044- [188] BBS4 NM_033028 c.156-2A>G Splice defect Hom 03 KK047- BBS2 NM_031885 n/a ex8-15del Hom 05 c.9_14del p.3_5del Het [134] RC1-03 BBS10 NM_024685 c.926T>C p.L309P Het [134] Whole gene RC2-03 NPHP1 n/a n/a Hom del

66

3.4 Dissection of the 25 Patients without Molecular Genetic Diagnoses for the Identification of Novel BBS Candidate Genes

To identify candidate novel drivers of BBS, these 25 patients were analyzed further for variants in genes of the ciliary proteome which were present under a recessive paradigm. The variants (single nucleotide polymorphisms [SNPs] and small insertion-deletion changes [indels]) called for each patient were filtered against public databases (Exome Variant Server: http://evs.gs.washington.edu/EVS/; NCBI:

http://www.ncbi.nlm.nih.gov/) and a dataset generated from a cohort of 2,300

atherosclerosis patients (ARIC; internal to BCM-HGSC) to determine rare variants; variants above a minor allele frequency of 1% were removed. Next, all variants that were mapped as intronic, intergenic, synonymous or affecting a non-coding RNA

transcript (ncRNA) were also excluded from further analysis. Finally each patient’s

filtered SNP and indel files, was intersected and evaluated for genes that harbored

mutations consistent with a recessive model of inheritance (two mutations in

heterozygosity or a single homozygous mutation).

To minimize the possibility of false-positive calls, the quality of each identified

variant was evaluated with the Integrative Genomics Viewer

(https://www.broadinstitute.org/igv/). Variants with an alternate allelic balance of <20%

and variants lying in areas covered by <5 reads were excluded from further analysis. In

toto, 14 patients were identified with putative recessive mutations in candidate novel

BBS candidate genes (Table 6).

67

Table 6: Recessive Changes in Candidate BBS Loci

Patient Gene Transcript Nucleotide change Protein change identifier name identifier

AR112-04 No recessive candidates in the ciliary proteome

c.83_85del p.28_29del AR234-02 LRRC48 NM_001130092 c.1091C>T p.A364V

c.1537G>A p.A513T AR243-03 RPRIP1 NM_020366 c.1918G>A p.A640T

c.10499+1G>A splicing defect PKD1 NM_001009944 c.3101A>G p.N1034S AR333-04 c.83_84insG p.I28fs CEP76 NM_024899 c.1520_1521insT p.S507fs

c.189-1insA splicing defect TXNDC9 NM_005783 c.563-4delT splicing defect AR386-03 c.698T>G p.V233G SEPT1 NM_052838 c.848A>G p.D283G

c.3153G>T p.D1051Y

c.3191A>C p.E1064A AR605-05 GPR98 NM_032119 c.8407G>A p.A2803T

c.12269C>A p.T4090N

AR704-03 No recessive candidates in the ciliary proteome

c.722G>A p.R241H E2F4 NM_001950 c.917_918insCAG p.D306delinsDS AR719-03 splicing defect TMEM216 NM_001173991 c.431-2insA (hom) (hom)

AR736-03 No recessive candidates in the ciliary proteome

68

AR786-03 No recessive candidates in the ciliary proteome

AR811-03 No recessive candidates in the ciliary proteome

AR838-02 No recessive candidates in the ciliary proteome

AR884-03 No recessive candidates in the ciliary proteome

c.349G>A p.D117N AR896-04 TTC37 NM_014639 c.3014-4delT splicing defect

AR898-03 No recessive candidates in the ciliary proteome

DM0004-003 No recessive candidates in the ciliary proteome

DM005-003 No recessive candidates in the ciliary proteome

DM029-003 No recessive candidates in the ciliary proteome

c.12557T>C p.I4186T DM033-003 USH2A NM_206933 c.15029A>T p.Q5010L

USH2A cc.7541A>G p.N2514S NM_206933 KK007-03 c.9110G>A p.R3037H

WDR52 NM_001164496 c.2674A>G (hom) p.M892V (hom)

c.1144A>G p.T382A

KK015-03 ALMS1 NM_015120 c.3035A>G p.Y1012C

c.10559A>G p.K3520R

p.T250T/ splicing c.750A>T defect LE1-03 KDM3A NM_018433 c.3115G>A p.E1039K

69

3.5 Sanger Confirmation of Novel Candidate Genes

Filtered alleles in genes previously unknown to be causal for BBS were Sanger- confirmed in patients and any related family members for whom DNA was available.

Fourteen families were found to have mutations in ciliary genes not yet associated with disease; in four families mutations segregated with disease (Figure 5). Families AR333,

AR605, AR719, and DM033 had candidate pathogenic mutations in CEP76, GPR98, E2F4, and USH2A, respectively. The size of the GPR98 and USH2A proteins (6306 and 5202 residues, respectively), precluded us from performing in vivo testing; as such, the pathogenicity of the discovered alleles could not be determined. We therefore proceeded with in vivo functional assays of CEP76 and E3F4 and the mutations found in families

AR333 and AR719. Patient AR333-04 carried two frameshift mutations in CEP76, each inherited from one parent, and introducing premature termination within several codons (Figure 5A). Patient AR719-03 carried a point mutation in E2F4 encoding an

R241H missense change, and a trinucleotide CAG insertion (insCAG) into a region of 13

CAG repeats (Figure 5C).

70

Figure 5: Candidate Gene Mutation Segregation

A) Patient AR333-04 inherited one frameshift mutation from each parent, I28fs from his father and S507fs from his mother. Each mutation creates a stop codon downstream. B)

Patient AR605-05 inherited two point mutations each from his mother (A2803T/T4090N) and father (D1051Y/E1064A). C) Patient AR719-03 inherited both a point mutation

(R241H) from his mother and an in-frame insertion (insCAG) from his father. The insCAG mutation occurs at a stretch of 13 CAG trinucleotide repeats and yields an additional CAG repeat, causing an additional serine to be translated. D) Patient DM033-

03 is compound heterozygous for the I4186T/Q5010L mutations. DNA from the parents was not available. 71

3.6 Early Gastrulation Defects in CEP76 and E2F4 dysfunction

Previous studies have shown that suppression of BBS genes can induce gastrulation defects in zebrafish embryos [126,137,167]. It is likely that defects in planar cell polarity (PCP) precipitate this phenotype [56], as well as the various phenotypes observed in patients with BBS, such as retinopathy, kidney disease, situs inversus, obesity, and developmental delay [191]. Additionally, these phenotypes stem from the

same pathophysiology of aberrant ciliary function, which is the causative pathology for

all genes identified as drivers of BBS [191]. We posited that the gastrulation assay of

convergent extension (CE) would be a useful surrogate assay for characterizing the

candidacy of additional BBS genes.

Reciprocal BLAST of CEP76 and E2F4 against the zebrafish genome identified a

single orthologue for each gene. Upon confirmation by RT-PCR (data not shown), we

designed splice-blocking morpholinos (MOs) against both cep76 and e2f4, and validated

the efficiency of knockdown of endogenous message by RT-PCR. Injection of increasing

concentrations of MO showed dose-dependent early developmental defects in

gastrulation, as measured by CE (Figure 6B). This phenotype is consistent with previous

reports of morphant phenotypes of other bbs genes ([126,137,189]).

Because of the high level of homology between human and zebrafish bbs genes,

we reasoned that expression of human mRNAs should be able to rescue the phenotype

observed in the two morphants. Co-injection of wild-type human CEP76 mRNA with

72

cep76 MO was able to rescue the morphant phenotype, from 60% of embryos affected when injected with MO alone, to 27% affected when injected with both cep76 MO and wild-type CEP76 mRNA (n=50-100 embryos, p >0.0001, Figure 6C). Likewise, co- injection of wild-type human E2F4 mRNA with e2f4 MO was able to rescue, from 58% to

25% affected (n=50-100 embryos, p < 0.0001, Figure 6C).

We next evaluated the function of the two mutations in each of CEP76 and E2F4 found in our families. Human CEP76 mRNA containing either the I28fs (n= 50-100 embryos, done in triplicate, p = 0.15) or S507fs (n= 50-100 embryos, in triplicate, p = 0.03, rescue worse than MO alone) mutation was unable to rescue the cep76 MO induced

CE/gastrulation defect, being indistinguishable from MO alone (Figure 6C). Human

E2F4 mRNA containing the R241H mutation was able to partially rescue the morphant phenotype, but to a level lower than that of wild-type mRNA, indicating that this mutation is a functional hypomorph (n= 50-100 embryos, in triplicate; p = 0.024 vs MO, p

= 0.002 vs wt). However, E2F4 insCAG mRNA rescued to a level indistinguishable from wild-type, suggesting that this mutation is functionally benign (n= 50-100 embryos, in triplicate; p = 0.80, Figure 6C). Injection of E2F4 R241H mRNA alone was not significantly different from controls (data not shown), indicating that the R241H mutation does not act in a dominant negative fashion (n= 60 embryos, p = 0.51).

The functionally benign quality of the E2F4 insCAG mutation, and the evidence that R241H is not a dominant negative, lead us to reject E2F4 as a primary driver of BBS

73

in AR719. While this was a negative result, it nonetheless increased our confidence in the specificity of the assay. We therefore focused solely on CEP76 and the mutations found in family AR333.

74

Figure 6: cep76 and e2f4 Gastrulation Assay

A) Normal, Class I and Class II defects in early gastrulation is defined by reduced convergent extension (CE) of the embryo around the yolk. B) Increasing dosage of both cep76 and e2f4 morpholinos results in a concomitant increase in the proportion of Class I and II embryos at 8-10 somite stage. C) Left Panel: CE defects induced by cep76 morpholino can be rescued with co-injection of wild-type human CEP76 mRNA, but not with mRNA containing either the I28fs or S507fs mutations. Right Panel: CE defects induced by e2f4 morpholino can be rescued with co-injection of wild-type human E2F4 mRNA, as well as mRNA containing the insCAG mutation. Co-injection of R241H mRNA results in a partial rescue. 75

3.7 In Vivo Retinal Assay of CEP76 Dysfunction

Although useful, the CE assay is a distant proxy to the BBS pathology. As such, we sought to evaluate the candidacy of CEP76 by in vivo complementation of phenotypes directly relevant to the patient. Upon review of the clinical data, we noted that the patient had many of the hallmarks of BBS, including retinopathy and kidney disease. Retinopathy is a primary feature of several ciliopathies; in patients with BBS, legal blindness is reached by an average age of 15 [147]. In retinal rod cells, rhodopsin is transported from the cell body to the outer segment via the connecting cilia (known as the transition zone) [192]. Defects in rhodopsin transport to the outer segment cause rod cell apoptosis and subsequent vision loss. Previous studies have demonstrated that loss of BBS gene function in vivo results in rhodopsin mislocalization followed by

progressive photoreceptor degeneration [193,194].

We asked whether cep76 morphant embryos also had retinal defects.

Cryosectioning of 5 dpf control and MO injected embryos showed a reduction of

rhodopsin staining in the outer segment of rod cells, with a concomitant increase in

rhodopsin staining in the inner segment and cell body (Figure 7A). This aberrant

rhodopsin localization could be rescued by co-injection with wild-type mRNA, but not

with mutant S507fs RNA (Table 7, Figure 7B). To test whether this deficiency in

rhodopsin trafficking was associated with a defect in ciliary function or structure, we

evaluated cilia structure in photoreceptor cells by co-staining 5dpf retina sections with

76

acetylated tubulin and centrin (labeling the cilia axoneme and basal body, respectively), as well as DAPI staining (labeling the nuclei, allowing for localization of the cell body).

We found that treatment of zebrafish embryos with cep76 MO resulted in significant defects in ciliary structure in photoreceptors as evidenced by a reduction in both the length and overall number of cilia, and that these defects could be rescued by co- injection of wild-type mRNA, but not mutant S507fs mRNA (Table 8, Figure 7C).

Table 7: Rhodopsin Staining

N % Abnormal p vs Ctrl p vs MO Control 12 8.3 - <0.0001 cep76 MO 11 72.7 <0.0001 - cep76 MO + WT 10 20.0 0.18 0.0003 cep76 MO + 12 66.7 <0.0001 0.56 S507fs

Table 8: Average Cilia Count and Length

Control cep76 MO cep76 MO + WT cep76MO + S507fs Ave Cilia Count 24.4 9 22 10.4 Standard Dev 4.2 1.6 3.1 2.4 p vs ctrl - <0.0001 0.24 >0.0001 p vs MO - - <0.0001 0.22 Ave Cilia Length 100 62 106 66 (a.u) Standard Dev 8.3 11.8 10.1 12.1 p vs Ctrl - <0.0001 0.07 <0.0001 p vs MO - - <0.0001 0.27 N = 5 for average cilia count, N = 20 for average cilia length

77

Next, we evaluated photoreceptors in control or MO-injected embryos for cell death. TUNEL staining of the retina of control and MO-injected embryos showed an increase in TUNEL-positive nuclei in the inner nuclear layer (INL) and outer nuclear layer (ONL), though to a lesser extent (Figure 8A, B). In 5 dpf control embryos the retina contains an average of 0.2% apoptotic cells, compared with an average of 2.6% in morphant embryos (Table 9, Figure 8A, B). This increased cell death in morphants can be rescued with co-injection of wt CEP76 mRNA (0.2% apoptotic cells), but not with mutant

S507fs mRNA (2.8% apoptotic cells) (Table 9, Figure 8C, D). This recapitulates previous studies in Bbs null mice that have shown that rhodopsin mislocalization begins at 6-8 weeks of age postnatally (as rhodopsin is not produced prior to birth), causing photoreceptor degeneration through apoptosis in the following months [193-195].

Table 9: 5dpf Retinal TUNEL Staining and Quantification

cep76 MO + cep76MO + Control cep76 MO WT S507fs Total Nuclei (ave, ± St Dev) 901 ± 46.8 971.5 ± 61.5 910 ± 71.7 1014 ± 74.2 TUNEL positive (ave, ± St 1.75 ± 0.4 25.0 ± 6.1 1.5 ± 0.2 28.3 ± 5.6 Dev) % TUNEL (ave, ± St Dev) 0.2 ± 0.05 2.6 ± 0.6 0.2 ± 0.06 2.8 ± 0.7 p vs Ctrl - 0.007 0.537 0.008 p vs MO - - 0.006 0.661 N = 4 for all sections

78

Figure 7: Immunostaining of Retinal Cryosections for Rhodopsin and Cilia of cep76 Morphant and Rescue embryos 79

A) In 5dpf control embryos, rhodopsin staining is localized to the outermost section of the photoreceptor cell (PRC) layer. B) cep76 morphant embryos show reduced rhodopsin staining in the outer segment of the rod cell, with increased staining in the inner segment and cell body. This aberrant rhodopsin localization can be rescued with co- injection of wild-type mRNA, but not with mutant S507fs RNA. C) Staining for cilia in the PRC of 5dpf control and morphant embryos shows a decrease in cilia number and length in morphants, which can be rescued with co-injection of wild-type mRNA, but not with mRNA containing the S507fs mutation. GCL, ganglion cell layer; IPL, inner plexiform layer; INL, inner nuclear layer; OPL, outer plexiform layer; PRC, photoreceptor cells; RPE, retinal pigment epithelium.

80

Figure 8: TUNEL Staining of Retinas from 5dpf cep76 Morphant and Rescue embryos

81

A) TUNEL staining of the retina of 5dpf control embryos demonstrates TUNEL-positive nuclei in 0.2% of cells. B) In morphant embryos, 2.6% of nuclei are TUNEL-positive. C)

Rescue with wild-type mRNA decreases TUNEL staining to 0.2% of nuclei, whereas D) rescue with S507fs mRNA does not rescue morphant TUNEL staining (2.8% TUNEL- positive nuclei).

3.8 Renal Defects in CEP76 Morphants

Given that our index patient also presented with kidney disease, we tested whether cep76 morphants had observable renal defects. At 5 dpf, morphant embryos had marked edema around the gut, heart, and eyes (Figure 9A). Edema in embryos can be a marker of reduced kidney function; however it is also a non-specific phenotype of poor embryo health. Whole mount staining of PFA fixed embryos for the Na+/K+ ATPase allowed us to visualize the proximal and distal tubules of the kidneys. In control uninjected embryos, the proximal tubules are convoluted, and feed into the straight distal tubules, which run parallel along the ventral body axis. In embryos injected with cep76 MO, however, the proximal tubules are distended and non-convoluted, while the distal tubules follow a tortuous route along the ventral side (Figure 9B). It has been shown previously that morpholino-induced knockdown of the zebrafish homolog for the BBS gene NPHP1 causes abnormal renal morphology [134]. Similar to cep76

82

morphants, these embryos had atrophy and reduced convolution of the proximal tubule, however the distal tubule remained unaffected.

We quantified the degree of convolution of the distal tubules by measuring the total length of each tubule and dividing it by the total distance of displacement (Figure

10A). We found that the degree of distal tubule convolution was increased significantly in morphants when compared to controls, from an average degree of convolution of 1.02

± 0.003 (SEM; n=7) for controls to 1.10 ± 0.009 (SEM; n=11, p < 0.0001) for MO-injected embryos (Figure 10B). Co-injection with wild-type CEP76 mRNA was able to rescue the convolution (1.01 ± 0.004 SEM, n=10, p = 0.55 vs controls), while injecting with S507fs mutant mRNA was not able to rescue (1.10 ± 0.010 SEM, n=16, p = 0.82 vs MO; Figure

10B).

83

Figure 9: Na+/K+ ATPase Staining of 5dpf cep76 Morphant Embryos

A) At 5dpf, cep76 morphant embryos have marked edema around gut, heart, and eyes.

B) Na/K ATPase staining shows straight distal tubules and convoluted proximal tubules in control embryos. cep76 morphant embryos have tortuous distal tubules, with dilated and less convoluted proximal tubules. 84

Figure 10: Quantification of Tubule Convolution

A) Degree of convolution was determined by measuring the length along the path of the distal tubules (red line), and dividing by the total distance covered (yellow line). B) The degree of convolution for cep76 morphant embryos is significantly increased from controls. Co-injection with wild-type mRNA rescues this convolution, whereas injection with S507fs mRNA does not. Injection of mutant RNA on its own was not significantly different than controls 85

3.9 Discussion

Identification of primary drivers of BBS has been based traditionally on positional cloning approaches, which are limited due to the requirement of large pedigrees for preliminary linkage analysis [155,161,165,196]. Here we describe a targeted approach to the identification of novel drivers of BBS in a cohort of 100 BBS patients, for which extended familial pedigrees were not available. Targeted sequencing of 785 genes comprising the cilia proteome ([65]) enabled high depth of coverage as well as broad interrogation of functional candidates.

Targeted sequencing of the ciliary proteome in 100 BBS patients identified 14 for which mutations were found under a recessive paradigm in genes not yet associated with BBS, and which did not have more than one mutation (or a single mutation in homozygosity) in known BBS genes. Sanger confirmation of patients and related family members for which DNA was available revealed mutations in four genes (CEP76, E2F4,

GPR98, and USH2A) that segregated with disease in four separate families. In vivo complementation assays confirmed loss of function in both mutations found in CEP76

(I28fs and S507fs), which were inherited in a compound heterozygous manner in family

AR333.

CEP76 encodes the CEP76 protein, a centriolar protein that interacts with CP110, a suppressor of ciliogenesis [197-199]. Additionally, it contains a transglutaminase-like

86

(TGL) peptidase domain, also found in many known ciliopathy protein that are thought to mediate protein-protein interactions during ciliogenesis [200]. The , as a main constituent of the centrosome, has a dual role in not only the formation of spindle poles during but also as the originating structure of the basal body that nucleates the primary cilium [201,202]. In corroboration of these non-mitotic functions of the centrosome and its components, proteomic analysis of centrosomal proteins found that a number of known drivers of ciliopathies, with phenotypes that had no obvious connection to cell-cycle defects [182]. Conversely, BBS4 has been shown to localize to the centrosome, in addition to their canonical localization at the basal body of the cilia [203].

atient AR333-04 exhibited many phenotypes that are hallmarks for BBS, including retinal degeneration and kidney disease. Similarly, our in vivo of CEP76 dysfunction displayed phenotypes similar to those previously seen in other BBS genes, such as gastrulation defects, aberrant rhodopsin trafficking and retinal degeneration, and anomalous kidney tubule morphology [134,193,194]. While the assertion that CEP76 is a novel driver of BBS stems from a single familial pedigree, the combination of phenotypic, genetic, and functional evidence strongly supports this claim.

Finally, these findings add to the gradually increasing catalog of genes shown to cause BBS when the function of the translated protein is disturbed. Furthermore, the inclusion of CEP76 as a BBS gene may lead to resolution of unresolved cases in which

CEP76 may act as a modifier of, or in epistasis to, previously identified gene variants.

87

Chapter 4 – Copy Number Variation Contributes to the Mutational Load of Bardet–Biedl Syndrome

This chapter is adapted from Lindstrand et al (2015), currently under revision for

The American Journal of Human Genetics.

Bardet-Biedl syndrome (BBS) is a defining ciliopathy, notable for extensive allelic and genetic heterogeneity and oligogenic inheritance; almost all of which has been identified through locus-specific DNA sequencing. Recent data, however, have suggested that copy number variants (CNVs) also contribute to BBS. We used a custom oligonucleotide array comparative genomic hybridization (aCGH) covering all 20 genes that encode intraflagellar transport (IFT) components and 74 ciliopathy loci to screen 92

unrelated individuals diagnosed with BBS, irrespective of their known mutational

burden. We identified 17 individuals with exon-disruptive CNVs (18.5%), including 13 different deletions in eight BBS genes (BBS1, BBS2, ARL6/BBS3, BBS4, BBS5, BBS7, BBS9,

and NPHP1); and a deletion and a duplication in other ciliopathy genes (ALMS1 and

NPHP4, respectively). By contrast, we found a single heterozygous exon-disruptive

event in a BBS gene (BBS9) in 229 controls. Superimposing these data with BBS gene

resequencing revealed CNVs to: a) be sufficient to cause disease; b) Mendelize

heterozygous deleterious alleles; and c) contribute oligogenic alleles by combining point

mutations and exonic CNVs in multiple genes. Finally, we found a deletion and a splice

88

site mutation in IFT74, inherited under a recessive paradigm, defining a new candidate BBS locus. Our data suggest that CNVs contribute pathogenic alleles to a substantial, hitherto unappreciated, fraction of BBS individuals. Importantly, the small size (<20 kb) of most of these CNVs suggests that a combination of exon resolution copy number analysis and resequencing unbiased for previous mutational burden is necessary to delineate the complexity of disease architecture.

4.1 Introduction

Bardet–Biedl syndrome (BBS; MIM209900) is a rare (1:13,500-1:160,000) multisystemic developmental disorder characterized by retinal dystrophy, obesity, polydactyly, intellectual disability, renal dysfunction, and hypogonadism. During the past fifteen years, 21 different causal genes have been identified [133,134,136,137,149-

166], and in vivo and in vitro studies have established the cellular basis of BBS as a defect of ciliary function [150]. Biochemical characterization of the cilium has shown that it is composed of at least four different functional complexes: the BBSome, the transition zone, and two intraflagellar transport complexes [62]. BBS proteins are thought to localize mainly to the BBSome and the transition zone[191], with the exception of recent reports of pathogenic mutations in IFT27 [149] and IFT172 [166] encoding components of

IFT complex B, localizing primarily to cilia. Furthermore, some BBS genes have also been linked to other ciliopathies, including Meckel-Gruber syndrome (MKS,

89

MIM249000), Joubert syndrome (JBTS, MIM213300), Senior-Løken syndrome (SLS,

MIM266900), Nephronophthisis (NPHP, MIM256100), and Leber congenital amaurosis

(LCA, MIM204000), illustrating the genetic, biological and clinical overlap within these genetic disorders [133,204-207]. Alström syndrome (ALMS; OMIM 203800), caused by mutations in ALMS1, overlaps clinically with BBS, sharing common characteristics such as obesity, diabetes, and retinal dystrophy [208,209].

Although BBS is viewed traditionally as an autosomal recessive Mendelian trait, oligogenic inheritance has been documented for most BBS genes. In rare cases, at least one deleterious (typically heterozygous) allele in a different BBS gene modifies the penetrance of the disorder [63,150,209-212]. More commonly, trans modifier alleles contribute to variable expressivity, not only in BBS, but across the ciliopathy spectrum

[127,135]. For example, a common c.685G>A (p.Ala229Thr) mutation in RPGRIP1L is associated with retinitis pigmentosa in ciliopathies [127], similar to an NPHP1 mutation in a genetic background sensitized by recessive AHI1 mutations [213]. Moreover, resequencing across different ciliopathy cohorts unbiased for known mutations has demonstrated an enrichment of deleterious heterozygous changes in TTC21B, a key component of the retrograde IFT machinery, in affected individuals versus controls

[135]. These data have created a model in which, in addition to primary causal mutations, the mutational burden or ‘load’ in the ciliary proteome, defined as the total number, mutational type, and locus distribution of deleterious alleles, has a role in

90

defining disease presentation, especially the elaboration of endophenotypes [62].

Notably, trans modifying alleles can be deleterious or protective, as evidenced by the

ability of homozygous loss of function Mkks/Bbs6 alleles to ameliorate retinal degeneration in a Cep290 mouse model [214].

In the past decade, copy number variants (CNVs) have emerged as major contributors to the genetic burden of both rare and common disorders [25]. A number of

CNVs have been associated with human genomic disorders, such as microdeletion and duplication syndromes[21,215], as well as dose changes involving single loci [216,217], and recessive carrier states[28]. In BBS, CNVs have been reported episodically [218,219].

More recently, we have shown that the 290 kb recurrent deletion of NPHP1, usually underlying renal ciliopathies including NPHP [220,221] and SLS [222], is both a likely

driver of BBS in one family and is also enriched in patients with BBS [134]. These observations prompted us to investigate systematically whether CNVs contribute to

mutational load in BBS. Here, through high-resolution analysis of bona fide genes involved causally in BBS, other ciliopathies, and/or core ciliary components, we report exon-disruptive CNVs in 18.5% of unrelated BBS individuals, most of which are driven by recombination at intronic repetitive elements. Our analyses indicate that: a) these events can represent recessive mutational drivers, with homozygous exonic or whole gene deletions; b) they can Mendelize heterozygous deleterious variants; and c) they can unmask genetic interactions with recessive mutations in other BBS genes. When

91

combined with in vivo modeling of likely pathogenic SNVs, our observations improve the resolution of known BBS gene contribution to disease; support further a mutational burden model that includes distinct classes of variant alleles (SNV plus CNV); facilitate the identification of novel loci; and they illuminate further the genetic complexity of this disorder.

4.2 Results

4.2.1 Identification of Intragenic CNVs in BBS

To evaluate the contribution of CNVs to BBS, we investigated a cohort of 92 unrelated BBS individuals, without preselection based on known mutation status. To improve the accuracy of CNV detection and to achieve sub-kb resolution, we focused on a genetically and biologically plausible preselected set of genes. We therefore generated a custom high-resolution oligonucleotide aCGH targeting of a total of 94 genes. These included the 17 known BBS genes (at the time of the design of the array); 20 genes encoding components of the IFT complex; and 57 additional loci associated causally with ciliopathies. Using this defined gene set, we designed probes at a density of one probe per 100 bp of coding sequence and one probe per 500 bp in intragenic regions.

In addition to the previously reported recurrent deletion at the NPHP1 locus

(found in one homozygous and two heterozygous BBS individuals) and a 17 kb heterozygous deletion in BBS1 identified in the same cohort [134], we detected another

92

14 unique intragenic CNVs in 13 individuals with BBS: 13 deletions (ranging in size from

730 bp to 172 kb) and one complex intragenic duplication (Table 10, Table 11, Table 12).

These data, combined with our recent report [134], total 16 different CNVs among 17 unrelated individuals from the same BBS cohort.

93

Table 10: Mutational Findings of 17 Individuals Affected with BBS Harboring CNVs Encompassing BBS genes

In vivo Sample Primary Additional Mutational Genotype Counts in 6,503 Allele 1 Allele 2 Pathogenicity ID causal locus Burden Controls (EVS All) Score Primary Causal Locus: Homozygous CNV KK047-05 BBS2 del: exon8_15 del: exon8_15 (Pat)* (Mat) DM034-004 ARL6/BBS3 del: exon4_9 del: exon4_9 (Pat) (Mat) KK11-04 ARL6/BBS3 del: exon8 del: exon8 SDCCAG8: c.1594G>A 0 hypomorph (Pat) (Mat) (p.Glu532Lys) het (Mat) 32/3 BBS4 del: exon3_4 del: exon3_4 (U) (U) AR800-03 BBS4 del: exon5_6 del: exon5_6 MKKS/BBS6: c.724G>T AA=0/AC=58/CC=6445 dominant (Pat) (Mat) (p.Ala242Ser) het (ref 68) negative (ref 53) BBS2: c.367A>G CC=240/CT=2060/TT=419 hypomorph (p.Ile123Val) hom (Pat, Mat) 8 (ref 53) AR400-03 BBS4 del: exon5_6 del: exon5_6 BBS9 c.1993C>T TT=0/TC=42/CC=6461 null (Pat)* (Mat) (p.Leu665Phe) het (Pat)* ALMS1 del exon16_17 het (Pat)* RC2-03 NPHP1 del: whole gene del: whole gene BBS2: c.367A>G CC=240/CT=2060/TT=419 hypomorph (ref 11) (Pat) (Mat) (p.Ile123Val) hom (Pat, Mat) 8 (ref 53) Primary Causal Locus: Compound Heterozygous CNV + SNV AR380-03 BBS1 del: exon9_12 c.1169T>G (U) (p.Met390Arg) (U) AR888-03 BBS1 del: exon1_11 c.1645G>T NPHP1: c.14G>T AA=0/AC=6/CC=6497 hypomorph (ref 11) (Pat) (p.Glu549*) (Mat) (p.Arg5Leu) het (Mat) (ref 11) BBS4: c.439T>A 0 null (p.Tyr147Asn) het (Pat) BBS5: c.584A>G GG=0/GA=4/AA=6497 null (p.Asp195Gly) het (Mat) BBS7: c.442A>C 0 null (p.Asn148His) het (Mat) AR240-03 BBS1 del: exon10_17 c.1169T>G BBS5: c.551A>G GG=0/GA=61/AA=6440 null (Mat) (p.Met390Arg) (p.Asn184Ser) het (Mat) (ref 53) (Pat) AR246-03 BBS1 del: exon14_17 (U) c.1169T>G BBS10: c.2118_2119delTG 0 (p.Met390Arg) (p.Val707fs*708) het (U, ref 20) (U) AR634-03 BBS7 del: exon16_17 c.878A>C BBS4: c.137A>G GG=0/GA=92/AA=6403 null (Pat) (p.Gln293Pro) (p.Lys46Arg) het (Mat) (ref 53) (Mat) IFT74: c.1735G>A AA=0/AG=97/GG=5884 benign (p.Val579Met) het AR672-04 IFT74 del: exon14-19 c.1685-1G>T (Mat) (Pat) Heterozygous CNV Contributes Additional Mutational Burden AR883-04 BBS10 c.2118_2119delTG c.2118_2119delTG BBS5: del: exon8_12 het (Mat) (p.Val707fs*708) (p.Val707fs*708) (Pat) (Mat) 44/3 BBS7 c.87_88delCA c.87_88delCA NPHP1: del whole gene het (ref 11) (p.His29Glnfs*12) (p.His29Glnfs*12) (Mat) (Pat) (Mat) CEP290 c.2236G>C 0; GG=0/GA=64/AA=5768 hypomorph (p.E2243Q); c.6401T>C (p.I2134T) het (Mat) AR704-03 NPHP1: del whole gene het (ref 11) (Mat) BBS10: c.145C>T (p.Arg49Trp) 0 null het (Pat) (ref 53) TTC8/BBS8: c.1253A>G GG=0/GA=27/AA=6476 hypomorph (p.Gln418Arg) het (Mat) AR811-03 NPHP4: het dup exon1_28; normal exon29; het dup exon 30 (U)

94

Table 11: Breakpoint Features of 18 CNVs Detected in 17 BBS Cases

Sample hg19 CNV Array Rearrangement Breakpoint % identical Gene Size (bp) Mechanism ID coordinates Result Type Feature nucleotides

chr11:66274840- het del: simple AR888-03 BBS1 17734 AluY - AluY 92% Alu-Alu 66292574 exon1_11 nonrecurrent

1 bp chr11:66288383- het del: simple AR380-03 BBS1 5266 microhomology no homology NHEJ/ RBM 66293649 exon9_12 nonrecurrent (T)

chr11:66290130- het del: AR240-03 BBS1 14892* - - - - 66305022* exon10_17

chr11:66296361- het del: simple AR246-03 BBS1 7906 AluSp - AluSp 87% Alu-Alu 66304267 exon14_17 nonrecurrent

chr16:56529926- hom del: simple KK047-05 BBS2 8271 no homology no homology NHEJ 56538197 exon8_15 nonrecurrent

chr3:97498355- hom del: simple 5 bp DM034-004 ARL6/BBS3 85004 no homology NHEJ/ RBM 97583359 exon4_9 nonrecurrent microhomology

TG)n Simple chr3:97508904- hom del: simple Repeat KK11-04 ARL6/BBS3 4108 100% RBM/SSA 97513012 exon8 nonrecurrent 41 bp homology

chr15:73000565- hom del: simple 32/3 BBS4 6261 AluSc8 - AluSx 83% Alu-Alu 73006826 exon3_4 nonrecurrent

chr15:73006570- hom del: simple 1 bp AR800-03 BBS4 7719 no homology NHEJ/ RBM 73014289 exon5_6 nonrecurrent microhomology

chr15:73006584- hom del: simple AR400-03 BBS4 6204 AluSx - AluSg 82% Alu-Alu 73012788 exon5_6 nonrecurrent

chr2:73617393- het del: 5 bp insertion at AR400-03 ALMS1 172476 no homology NHEJ/ RBM 73789869 exon2_15 brp

chr2:170355189- het del: simple AR883-04 BBS5 9762 AluYc3 - AluSg 82% Alu-Alu 170364951 exon8_12 nonrecurrent

chr4:122749644- het del: simple AR634-03 BBS7 730 5 bp insertion no homology NHEJ/RBM 122750374 exon16_17 nonrecurrent

chr2:110875689- hom del: RC2-03 NPHP1 91840* recurrent - - - 110967529* whole gene

chr2:110875689- het del: whole 44/3 NPHP1 91840* recurrent - - - 110967529* gene

chr2:110875689- het del: whole AR704-03 NPHP1 91840* recurrent - - - 110967529* gene

chr1:5910699- het dup: AR811-03 NPHP4 120535* - - - - 6038368* exon1_28

normal:

exon29

chr1: 6051187- het dup: 79389* 6158763* exon30

95

Table 12: Clinical Phenotypes of BBS Cases with CNVs

Primary Individual Sex Age RP Ob PD HG RD ID DD Others Locus AR240-03 BBS1 F 21 ✓ ✓ ✓ − ✓ − S, OM

AR246-03 BBS1 M 12 ✓ ✓ ✓ ✓ − ✓ ✓ M, DC, MT, E

AR380-03 BBS1 F 9 ✓ ✓ ✓ − ✓ − PS, OM, H

KK047-05 BBS2 M 9 ✓ ✓ ✓ ✓ ✓

KK11-04 ARL6/BBS3 M 17 ✓ ✓ ✓ ✓ ✓ Hs, UT

KK11-09 ARL6/BBS3 F 7m ✓

DM034-004 ARL6/BBS3 M 10 ✓ ✓ ✓ − − As, DM1, FL

32/3 BBS4 M 8 ✓

AR800-03 BBS4 M 5 ✓ ✓ ✓ ✓ ✓ St, SS, F A, M, As, OM, AR400-03 BBS4 F 11 ✓ ✓ ✓ ✓ ✓ UH, ST AR634-03 BBS7 F 17 ✓ ✓ − − ✓ ✓ M, As, DM2

AR634-04 BBS7 M 15 ✓ ✓ − ✓ ✓ M, AS, HL

AR883-04 BBS10 M 4 ✓ ✓ ✓ ✓ ✓ SA, Sz, OM

AR672-03 IFT74 M 36 ✓ ✓ ✓ ✓ − ✓ − MC

AR811-03 Unknown M 4 ✓ ✓ ✓ ✓ SA

Phenotypes are indicated as present (✓) or absent (-). Age at assessment is indicated in years, unless otherwise noted (m, months). Empty boxes indicate not evaluated. Phenotypes assessed:

RP, retinitis pigmentosa; Ob, obesity; PD, polydactyly; ID, intellectual disability; RD, renal disease; DD, developmental delay. Others: MC, microcephaly; M, myopia; S, scoliosis; OM, otitis media; DM1, Diabetes Mellitus type 1; DM2, Diabetes Mellitus type 2; HL, hearing loss;

DC, dental crowding; MT, missing teeth; E, eczema; PS, pulmonary stenosis; H, hypertension;

Ns, nystagmus; Sz, seizures; St; strabismus; SS, short stature; A, asthma; UH, umbilical hernia;

ST, sickle trait; As, Astigmatism; FL, fatty liver; F, farsighted, SA, sleep apnea; Hs,

Hypospadias; UT, undescended testis

96

4.2.2 Breakpoint Analysis Shows Multiple Mechanisms of CNV Formation

To refine CNV endpoints, to map precisely the genomic content, and to examine rearrangement end products as a means of inferring possible mechanisms, we characterized 13 of 15 non-recurrent breakpoint junctions by long range PCR and Sanger sequencing (Table 11). None of the 15 CNVs detected by high-resolution aCGH in the

BBS genes (BBS1, BBS2, ARL6/BBS3, BBS4, BBS5, BBS7), ALMS1, IFT74, and NPHP4 in

our cohort have been reported previously or are present in the Database of Genomic

Variants (DGV), except for the heterozygous deletion in BBS1 (AR888-03) that we

published recently [134], and a BBS5 deletion of similar size and position to our patient

32/3 that has been found (in heterozygosity) in two individuals: one with reported

developmental delay and one with obesity. We found three individuals with unique

CNVs in BBS4; two of them involve exons 4-6 as determined by breakpoint mapping.

Remarkably, individuals with deletions and differing breakpoints in BBS4 involving the

same exons have been reported [159,219], suggesting that this region might be

susceptible to genomic instability (Table 10, Table 11).

In six of 15 non-recurrent CNVs (40%; AR888-03, BBS1; AR246-03, BBS1; 32/3,

BBS4; AR400-03, BBS4; AR883-04, BBS5; AR672-03, IFT74), the deletions were mediated

by Alu elements with variable percentage of sequence identity (77%-92%). Alu-Alu

recombination forming an Alu hybrid is a prominent mechanism underlying the

formation of pathogenic CNVs associated with distinct diseases [223-227]. Our data

97

indicate that Alu-Alu recombination plays an important role in BBS as well. In two junctions substrate pairs of Alu repeats from the same family (AR888-03, AluY-AluY;

AR246-03, AluSp-AluSp) are involved whereas the remaining four have substrate pairs consisting of distinct Alu repeat families mediating CNV formation (32/3, AluSc8 - AluSx;

AR400-03, AluSx - AluSg; AR883-04, AluYc3 - AluSg; AR672-03, AluSx - AluY). Notably, a

particular AluSx is present at the proximal junction of non-recurrent BBS4 deletions

present in two unrelated BBS individuals (AR400-03 and 32/3). In addition, a simple

dinucleotide microsatellite repeat, (TG)n, 4 kb apart and consisting of 41 nt, was

involved in the formation of the CNV in individual KK11-04 (ARL6/BBS3).

The formation of complex rearrangements such as triplications and inversions

mediated by Alu-Alu recombination and the co-occurrence of point-mutations and

indels at or nearby the CNV breakpoint junctions implicate replication-based

mechanisms (RBM; ref [228] and unpublished data), but simple rearrangements may be

caused by double-strand breaks repaired by single-strand annealing (SSA) by the

sequence similarity for annealing the broken DNA [229]. In two Alu-Alu mediated CNVs

additional point mutations were observed: a T>C SNV 178 bp from the junction of BBS1

in AR888-03 [134] and a deletion of nine nucleotides in a polyA region involved in the

formation of the breakpoint junction of IFT74 in AR672-04, potentially implicating an

error prone DNA polymerase and a RBM underlying formation of those variants.

98

Because those CNVs are inherited, we could not confirm whether the SNVs and the rearrangement were formed concomitantly.

Six additional deletion CNVs presented with no homology, 1 to 5 bp homology,

or 5 bp nucleotide insertions at the breakpoint junctions (Table 11), indicating that

canonical non-homologous end joining (NHEJ), microhomology-mediated end joining

(MMEJ) [230], or RBM as underlying mechanisms. Interestingly, breakpoint junctions of

CNVs spanning ALMS1 (AR400-03) and BBS7 (AR634-03) presented five

insertions (Table 11) that could have originated from within the deleted region as a

templated insertion (indicating either RBM or MMEJ); nonetheless the short length of

such insertions makes it difficult to rule out random bp insertion that can occur during

NHEJ.

In two cases we were unable to obtain junction fragments, and therefore

confirmed the aberrations with TaqMan® Copy Number Assays; both CNVs, the 14.9 kb

BBS1 deletion (AR240-03) and the complex duplication in NPHP4 (AR811-03) displayed

the expected dosage alteration (data no shown).

4.2.3 Enrichment of Ciliary Gene CNVs in BBS Cases

Subsequent to breakpoint analysis and CNV confirmation, we asked whether there was an increased prevalence of CNVs within our target ciliary gene set in BBS cases. A search in the DGV revealed a single heterozygous CNV in BBS5 (9969) [231]

99

overlapping with those identified in our BBS cohort; the CNV was of similar size and genomic location as the deletion detected in AR883-04. However, mindful of likely resolution differences between CNVs in this database and our study, we performed aCGH with the same reagent, hardware, and analytic software on 229 healthy control individuals. We did not detect any of the discovered CNVs in these controls; analysis for any exon-disrupting CNV identified a sole heterozygous deletion of exons 5-7 of BBS9, predicted to result in an in-frame deletion of 133 amino acids. These data, together with our previously reported NPHP1 and BBS1 CNVs [134], suggest an incidence of exon- disruptive CNVs at 15.6% (25/184) of BBS chromosomes compared to 0.2% (1/458) of control chromosomes, respectively, constituting a significant enrichment (p<0.0001,

Fisher’s exact test).

4.2.4 BBS Gene CNVs Contribute Recessive Alleles

Next, we investigated how the CNVs disrupting BBS genes contribute to disease manifestation. For cases in which parental samples were available (13 of 17), segregation analysis showed all CNVs to be inherited (Table 10, Figure 11). Among the six BBS

pedigrees harboring non-recurrent homozygous deletions, familial DNAs segregated

according to Mendelian expectations demonstrating anticipated carrier status for both

parents in three pedigrees (AR800, DM034, KK11); and a single carrier parent in two

pedigrees (KK47 and AR400; DNA was unavailable for the other parent; Figure 11). Five

100

of six pedigrees had only a single affected individual. Nonetheless, the nature of the gene disruptive deletion, ranging from two to eight exons, suggested strongly that these

CNVs are likely BBS drivers. Although cell lines were unavailable for mRNA studies for

KK47, AR800, and AR400, the putative splicing of exons flanking the deletions is predicted to result in a frameshift and premature termination codon. In DM034, the six exons encoding the C-terminal end of ARL6/BBS3 are deleted, predicted to result in a 41

amino acid truncated protein bereft of the majority of the ADP-ribosylation factor

domain. Affected individual 32/3 harbored a two-exon deletion that would not alter the

reading frame if spliced, but would delete part of the first tetratricopeptide (TPR)

domain in BBS4, known to be important for protein-protein interactions [232]. Finally,

deletion of a single exon in Saudi pedigree KK11 is predicted to result in truncation of 23

amino acids from the C-terminal end of ARL6/BBS3; the presence of the same

homozygous exon eight deletion in affected sibling KK11-09 bolstered our confidence in

the pathogenicity of this CNV as the primary driver of disease.

Unmasking a recessive allele on one chromosome by a deletion on the other is a

recognized disease-causing mechanism [233-236]. For BBS, we reported previously that

this phenomenon contributes to pathology in a BBS1 pedigree [134]. Consistent with this

observation, segregation analysis of a previously reported [188] maternally inherited

BBS7 c.878A>C (p.Gln293Pro) change in AR634 showed this variant to be in trans with

101

the novel paternally inherited exon 16-17 deletion; both affected siblings carried the two

BBS7 alleles (Figure 11).

To determine whether this mechanism could explain disease in any of the three pedigrees with novel heterozygous BBS1 deletions, we sequenced the coding exons and splice junctions in each of AR240, AR246 and AR380. In each case, we identified the c.1169T>G (p.Met390Arg) mutation, known to be one of the two most frequently mutated sites among BBS loci and a known hypomorphic allele [126,160,237,238], in trans with the deletion and segregating according to Mendelian expectations with disease in each pedigree (Figure 11; Table 10). Notably, Sanger sequencing of BBS1 exon

12 in AR380-03 showed c.1169T>G as an apparently homozygous mutation because the encompassing deletion masks the zygosity of the change. Sanger sequencing of BBS5 by a similar strategy in AR883, a pedigree carrying a heterozygous exon 8-12 deletion, yielded no novel functional variants in BBS5. Together, the presence of either a homozygous gene deletion (n=7), or a heterozygous deletion in trans with a pathogenic

SNV in the same gene (n=5) assigned primary causal genes for a previously underappreciated proportion of our cohort (13%; Figure 11).

102

Figure 11: Segregation of BBS Gene CNVs Assigns a Primary Causal Locus to 10 Pedigrees

Segregation analysis results indicate that CNVs can either a) be sufficient to cause disease (panels D, E, F, G, H, I; blue coloring); or b) Mendelize heterozygous deleterious alleles (panels A, B, C, J; green coloring). Squares, males; circles, females; black symbols, individuals affected with BBS; double lines, consanguinity

103

4.2.5 IFT74, a Novel BBS Candidate Gene

Next, we asked whether CNV deletions could assist in the identification of new

BBS genetic drivers. Among the families studied, there was a single individual, AR672-

03, who was bereft of known driver mutations, but harbored a heterozygous deletion

CNV. Our array data, confirmed by breakpoint sequencing, showed a maternally inherited deletion that removed the coding exons 14-19 of IFT74 (Figure 12), a locus encoding a component of the IFT complex that to date has not been associated with BBS or any other ciliopathy. We sequenced all 20 coding exons and splice junctions of IFT74 in the index case and both parents. We identified a paternally-inherited heterozygous c.1685-1G>T change (Figure 12). This change maps to the conserved intronic splice acceptor site of exon 20, and is absent from 11,868 publicly available healthy control chromosomes (NHLBI Exome Variant Server (EVS); evs.gs.washington.edu). Expanded resequencing of IFT74 in our 92 BBS affected individuals yielded only a single additional rare heterozygous missense change (c.1735G>A; p.Val579Met) in AR634-03 (Table 10).

To test functionally the protein resulting from the maternally-inherited IFT74 deletion, we used in vivo complementation in zebrafish embryos according to established methods[73]. Reciprocal BLAST identified a single zebrafish IFT74 ortholog

(51% identity; 71% similarity versus human), against which we designed a splice- blocking MO targeting the splice donor site of ift74 exon 2 that we used to inject one-to- two-cell stage WT embryos. Suppression of endogenous message resulted in significant

104

gastrulation defects (shortened body axes, broader and thinner somites, and broad and kinked notochords) in embryo batches scored at the mid-somitic stage (n=69-86 embryos/injection, masked scoring, repeated twice, p<0.0001 vs. uninjected control), consistent with that reported for other BBS and IFT gene morphants [126,133,135,239].

The phenotype was specific, since it could be rescued with the long WT human IFT74

mRNA isoform (p<0.0001; Figure 12, Table 13). In contrast, the short IFT74 isoform

rescued only partially (Figure 12, Table 13), consistent with the contention that the

AR672 proband is hypomorphic for IFT74 function.

105

Figure 12: IFT74 is a Novel Candidate BBS Locus

A. Chromosomal location of human IFT74 on chromosome 9p21 is indicated with a vertical red bar. B. Schematic of two IFT74 transcript variants (long, NM_025103; short, 106

NM_001099224) with alternate 3’ exon usage. Horizontal line, gene locus; vertical black bars, exons; black star, paternally-inherited point mutation; green bar, heterozygous deletion. C. aCGH plot indicates a heterozygous deletion of ~20 kb encompassing exons

14-19 of the long transcript. D. Enlarged view of the CNV and SNV-bearing region of

IFT74 (corresponds to the dashed blue box in panel B) and location of AluSz and AluY repeat elements. E. BBS pedigree AR672 and segregation of the paternally-inherited

c.1685-1G>T splice variant and maternally-inherited exon 14-19 deletion. F. Breakpoint

characterization of the IFT74 deletion. The junction sequence and corresponding

reference location is highlighted in blue and microhomology is shown in red. G.

Chromatogram corresponding to the microhomology region shown in panel F. H.

Representative live images of ift74 morphant zebrafish embryos at the mid-somitic stage

(top, lateral; bottom, dorsal) display gastrulation defects typical of IFT and other ciliary

gene suppression models. I. In vivo complementation studies indicate that the short

IFT74 transcript is a hypomorphic allele; heterozygous allele p.Val579Met (identified in

BBS pedigree AR634) and negative control variant c.165A>G; p.Ile55Met (rs10812505;

present in homozygosity in EVS controls; EVS All: GG=10/GA=336/AA=5654) are benign

in this assay; n=39-86 embryos/injection batch; masked scoring, repeated at least twice;

(*) indicates p<0.0001; NS, not significant; WT, wild-type.

107

Table 13: Embryo Counts for In Vivo Complementation Assays of Novel BBS Missense Variants

p-value p-value Class Class % p-value Injection Normal n= vs WT vs WT Interpretation I II Affected vs MO rescue RNA Uninjected (UI) control 135 3 1 139 3% <0.0001 <0.0001 <0.0001 bbs4 MO 61 56 14 131 53% N/A <0.0001 <0.0001 BBS4 WT RNA 63 2 2 67 6% <0.0001 <0.0001 N/A BBS4 c.439T>A (p.Tyr147Asn) RNA 49 5 5 59 17% <0.0001 <0.0001 0.0023 MO + BBS4 WT RNA 99 19 9 127 22% <0.0001 N/A <0.0001 MO + BBS4 c.439T>A (p.Tyr147Asn) RNA 69 43 24 136 49% 0.0037 <0.0001 <0.0001 null UI control 204 4 0 208 2% <0.0001 <0.0001 <0.0001 bbs5 MO 51 30 25 106 52% N/A <0.0001 <0.0001 BBS5 WT RNA 49 8 2 59 17% <0.0001 <0.0001 N/A BBS5 c.584A>G (p.Asp195Gly) RNA 56 7 3 66 15% <0.0001 <0.0001 0.4437 MO + BBS5 WT RNA 99 15 6 120 18% <0.0001 N/A <0.0001 MO + BBS5 c.584A>G (p.Asp195Gly) RNA 61 26 30 117 48% 0.1743 <0.0001 <0.0001 null UI control 139 2 0 141 1% <0.0001 <0.0001 <0.0001 bbs7 MO 68 37 22 127 46% N/A <0.0001 <0.0001 BBS7 WT RNA 66 4 1 71 7% <0.0001 <0.0001 N/A BBS7 c.442A>C (p.Asn148His) RNA 60 8 4 72 17% <0.0001 <0.0001 0.0011 MO + BBS7 WT RNA 103 16 6 125 18% <0.0001 N/A <0.0001 MO + BBS7 c.442A>C (p.Asn148His) RNA 67 24 23 114 41% 0.0989 <0.0001 <0.0001 null UI control 148 2 0 150 1% <0.0001 <0.0001 <0.0001 bbs8 MO 51 38 24 113 55% N/A <0.0001 <0.0001 BBS8 WT RNA 61 3 3 67 9% <0.0001 0.0002 N/A BBS8 c.1253A>G (p.Gln418Arg) RNA 55 6 2 63 13% <0.0001 0.0001 0.1406 MO + BBS8 WT RNA 86 9 11 106 19% <0.0001 N/A <0.0001 MO + BBS8 c.1253A>G (p.Gln418Arg) RNA 76 27 19 122 38% 0.0003 <0.0001 <0.0001 hypomorph UI control 127 6 0 133 5% <0.0001 <0.0001 <0.0001 bbs9 MO 59 40 24 123 52% N/A <0.0001 <0.0001 BBS9 WT RNA 59 5 3 67 12% <0.0001 0.0002 N/A BBS9 c.1993C>T (p.Leu665Phe) RNA 51 5 5 61 16% <0.0001 0.0001 0.2985 MO + BBS9 WT 79 8 16 103 23% <0.0001 N/A <0.0001 MO + BBS9 c.1993C>T (p.Leu665Phe) RNA 61 31 23 115 47% 0.3440 <0.0001 <0.0001 null UI control 151 1 3 155 3% <0.0001 <0.0001 <0.0001 cep290 MO 40 37 25 102 61% N/A <0.0001 <0.0001 CEP290 WT RNA 52 2 1 55 5% <0.0001 <0.0001 N/A CEP290 c.6401T>C (p.I2134T) RNA 44 6 3 53 17% <0.0001 <0.0001 0.0013

CEP290 c.2236G>C (p.E2243Q) RNA 56 3 2 61 8% <0.0001 <0.0001 0.4050 CEP290 c.2236G>C (p.E2243Q); c.6401T>C (p.I2134T) RNA 49 7 5 61 20% <0.0001 <0.0001 <0.0001 MO + CEP290 WT RNA 94 22 4 120 22% <0.0001 N/A <0.0001 MO + CEP290 c.2236G>C (p.E2243Q) RNA 81 25 13 119 32% <0.0001 <0.0001 <0.0001 MO + CEP290 c.6401T>C (p.I2134T) RNA 57 38 28 123 54% 0.0222 <0.0001 <0.0001 MO + CEP290 c.2236G>C (p.E2243Q); c.6401T>C (p.I2134T) RNA 69 42 33 144 52% <0.0001 <0.0001 <0.0001 hypomorph

108

4.2.6 Extensive Oligogenic Load Among Individuals Harboring BBS CNVs

Oligogenic inheritance is a hallmark feature of BBS [191]. However, exploration of even that mechanism has been based almost exclusively on sequencing data. Given our current observations on the CNV burden in BBS, and the fact that all affected individuals screened were not preselected for or against prior known pathogenic BBS alleles, we asked how exon-disruptive CNVs and pathogenic SNVs might coalesce in our cohort. First, we expanded our mutational screening efforts to the entire coding exons and splice junctions of BBS1-16, IFT27/BBS19, and NPHP1/BBS20 in all 17 individuals harboring CNVs. We confirmed variants in BBS2, MKKS/BBS6, BBS7, BBS10,

and NPHP1/BBS20, that were identified in previous candidate gene [134,136] or targeted

resequencing [188] studies. Further, we detected 10 additional rare alleles that were

predicted to disrupt amino acid sequence and were present at <1% minor allele

frequency in publicly available databases (13,006 chromosomes; EVS; Table 10). With the

exceptions of a homozygous 2 bp deletion in BBS10 (c.2118_2119delTG; p.Val707fs*708)

that segregated with disease in BBS pedigree AR883, and two heterozygous variants

shown previously to be functional null alleles (BBS5 c.551A>G [p.Asn184Ser] in AR240,

and BBS4 c.137A>G [p.Lys46Arg] in AR634) [126], all other variants were heterozygous

missense changes of unknown pathogenic potential. We therefore employed in vivo

complementation in zebrafish embryos by established, highly sensitive, and specific

assays for BBS4, BBS5, BBS7, BBS8, BBS9, BBS14/CEP290, and BBS16/SDCCAG8

109

[126,133,163]. Upon masked scoring of embryos rescued with either WT or mutant mRNA (in triplicate), we found that 7 of 7 were pathogenic (Table 10, Table 13).

Next, we asked a) whether CNVs contribute oligogenic alleles to BBS; and b) the extent to which pathogenic BBS gene variation (CNV or SNV) is present in addition to primary recessive loci. Supported by both genetic and in vivo functional data, we have shown previously that the common NPHP1 deletion can contribute oligogenic alleles to

BBS [134]. We found additional such examples (Figure 13). For instance, in AR883, the proband harbors both a homozygous BBS10 c.2118_2119delTG mutation (the recessive driver) and a heterozygous BBS5 exon 8-12 deletion. In other families, we found a range of combinations of deleterious SNVs and CNVs across two or more BBS loci. Overall, of the 17 BBS individuals bearing CNVs, 11 had additional pathogenic mutations in one or more BBS genes in addition to their driver locus. Some of these second sites were SNVs, some were CNVs, with one family harboring as many as four BBS associated pathogenic alleles (Figure 13, Figure 14).

110

Figure 13: Segregation and In Vivo Analysis of Oligogenic BBS Loci Demonstrate Additive and Epistatic Effects

Example pedigrees harboring BBS gene CNVs in either primary causal loci (A, B), or

heterozygous CNVs that are likely second-site contributors to disease (C) are shown.

Left panels show pedigrees and segregation of each BBS gene variant (CNV or SNV; separate pedigrees for each gene) with the primary causal locus shown on the far left.

Gene name color indicates homozygous deletion CNV (blue); heterozygous oligogenic

CNV (magenta); point mutation (gray). D, E, F. Bar charts indicate in vivo assessment of bbs gene interaction by the comparison of either single-gene, pairwise, or triple co-

injection of sub-effective doses of MOs and phenotypic scoring of zebrafish embryo

111

batches at the mid-somitic stage. Objective scoring criteria correspond to images shown in Figure 12H. Panel D demonstrates an additive effect in the triple-MO batch; panels E and F show an epistatic effect in the triple and double MO-injected batches, respectively, in which the combined effect is more severe than the sum of the individual MO-injected batches.

112

Figure 14: Genetic Architecture of BBS Gene Variation in 17 BBS Cases Harboring CNVs

Each slice of the pie chart represents one BBS case and the primary causal BBS locus is indicated with gene names and colors (homozygous CNV, blue; compound heterozygous CNV + SNV, green; homozygous or compound heterozygous SNV, gray).

Circles overlapping each slice represent the additional mutational burden in each BBS case (CNV, magenta; SNV, gray). 5/17 individuals (29%) harbor no additional BBS gene variants outside their primary causal driver; 12/17 individuals (71%) have one-to-four additional heterozygous deleterious variants in BBS1-16, IFT27/BBS19, or NPHP1/BBS20.

113

4.2.7 In Vivo Analysis of Oligogenic BBS Loci

Genetic interaction has been assessed extensively among the BBS genes and in concert with other ciliopathy loci [126,133-137,240,241]. To investigate whether the pairwise BBS gene combinations observed here result in additive or epistatic effects, we co-injected sub-effective doses of MO targeting the orthologous zebrafish genes, and we compared the fraction of affected embryos scored at the mid-somitic stage to that of embryos injected with single-gene MO concentrations alone. To approximate the BBS mutational burden in affected individuals, we tested four novel pairwise gene suppression combinations; we also generated an additional model in which three BBS genes were suppressed simultaneously (Figure 13, Table 10, Table 14). In three instances, and consistent with our previous observations of interactions with , , , and [134] we observed an additive effect. However, in one pairwise ( and bbs10) or one triple gene suppression model (bbs2, bbs4, and bbs6), we saw an epistatic effect in which the combined gene suppression model exhibits increased percentages of phenotypic severity, especially Class II (severe) in comparison to the addition of any of the single suppression models alone (Figure 13).

Finally, we used this approach to test the oligogenic contribution of the heterozygous ALMS1 CNV detected in BBS pedigree AR400. The proband harbors a recessive BBS4 locus (homozygous exon 5-6 deletion), in addition to a heterozygous

BBS9 c.1993C>T (p.Leu665Phe) null allele and a heterozygous ALMS1 exon 2-15

114

deletion. To model the combined effects of these loci, we first developed a zebrafish model of suppression. We targeted the sole ALMS1 ortholog in zebrafish with an sb-MO, and coinjected subeffective doses into embryo batches either alone or in combination with bbs4 and/or MOs. We observed an additive effect, suggesting that

ALMS1 is neither redundant with the other two BBS proteins nor does it exacerbate

significantly the phenotypic severity (Figure 13). Together, these data substantiate

further the complexity of the genetic architecture of BBS and lend further support to the

contribution of mutational burden, in addition to primary recessive loci.

115

Table 14: Embryo Counts for In Vivo Genetic Interaction Experiments

Class Class % p-value vs p-value vs p-value vs Injection Normal n= Interpretation I II Affected bbs MO (1) bbs MO (2) bbs MO (3)

UI control 143 6 1 150 5% N/A N/A N/A (1) bbs4 MO 80 21 9 110 27% N/A N/A N/A (2) bbs7 MO 115 15 8 138 17% N/A N/A N/A bbs4 + bbs7 MOs 71 24 25 120 41% 3.2392E-07 2.1248E-13 N/A additive UI control 164 4 0 168 2% N/A N/A N/A (1) bbs1 MO 95 15 12 122 22% N/A N/A N/A (2) bbs5 MO 88 13 14 115 23% N/A N/A N/A bbs1 + bbs5 MOs 68 28 30 126 46% 1.0572E-10 1.922E-09 N/A additive UI Ctrl 161 0 0 161 0% N/A N/A N/A (1) bbs10 MO 97 13 8 118 18% N/A N/A N/A (2) bbs5 MO 86 13 9 108 20% N/A N/A N/A bbs10 + bbs5 MOs 55 46 26 127 57% 1.1650E-31 2.5728E-28 N/A epistatic UI control 149 4 0 153 3% N/A N/A N/A (1) bbs2 MO 104 21 19 144 28% N/A N/A N/A (2) bbs4 MO 79 20 10 109 28% N/A N/A N/A (3) bbs6 MO 91 9 10 110 17% N/A N/A N/A bbs2 + bbs4 MOs 49 28 25 102 52% 5.8321E-08 8.8203E-09 N/A bbs2 + bbs6 MOs 79 20 24 123 36% 0.0250 N/A 3.0965E-08 bbs4 + bbs6 MOs 101 14 11 126 20% N/A 0.0180 0.0740 bbs2 + bbs4 + bbs6 MOs 23 24 51 98 77% 3.1972E-26 5.0572E-46 6.0897E-43 epistatic UI control 139 2 1 142 2% N/A N/A N/A (1) MO 66 15 4 85 22% N/A N/A N/A (2) bbs3 MO 95 22 9 126 25% N/A N/A N/A sdccag8 + bbs3 MOs 67 38 10 115 42% 2.4228E-10 0.0017 N/A additive UI control 154 4 2 160 4% N/A N/A N/A (1) alms1 MO 90 26 13 129 30% N/A N/A N/A (2) bbs4 MO 98 19 12 129 24% N/A N/A N/A (3) bbs9 MO 87 16 13 116 25% N/A N/A N/A alms1 + bbs4 MOs 60 29 9 98 39% 0.0030 0.0011 N/A alms1 + bbs9 MOs 60 22 14 96 38% 0.0047 N/A 0.1663 bbs4 + bbs9 MOs 79 16 19 114 31% N/A 0.0428 0.2445 alms1 + bbs4 + bbs9 MOs 54 35 22 111 51% 6.9757E-06 5.2393E-06 0.03466 additive

116

4.3 Discussion

Genetic and functional studies of the ciliopathies have informed the complexity of biological systems with regard to phenotypic variability and have afforded us the opportunity to begin to understand how the distribution of mutations beyond the primary causal locus can influence penetrance and expressivity [62]. However, essentially all that work was driven by mutational data derived by sequencing; some examples notwithstanding, the contribution of CNVs to both causality and overall mutational burden has been under-recognized. Here, using a high-density aCGH with sub-kb resolution to test an unselected cohort of BBS patients, we have found that 18% of these individuals harbor at least one exon-disruptive CNV. These lesions, which

range in size from ~700 bp to >100kb, can contribute recessive alleles; can Mendelize

pathogenic SNVs on the other chromosome; and likely interact with both CNVs and

SNVs in other BBS loci, a hypothesis substantiated by genetic interaction of the

discovered BBS gene combinations in vivo. Of note, analyses of either control databases

or intramural studies of control samples on the same platform as our discovery cohort

showed a stark absence of such CNVs from the general population, indicating that this

CNV enrichment is unlikely to represent random background variation. Even so, the

study of larger cohorts, both for BBS and other ciliopathies, will be required to measure

the precise contribution of CNVs to both recessive and oligogenic paradigms.

We also note that the observed genetic interactions did not always lead to the

117

same magnitude of effect, as measured by the severity of the combinatorial suppression of two or more bbs genes in zebrafish embryos. Acknowledging that the interaction experiments in zebrafish embryos are a coarse approximation of the genetic architecture of these specific BBS allelic contributions to trait manifestation in patients, we speculate that these patterns, once understood more fully, have the potential to inform the effect of trans alleles on phenotypic expression (an exercise that the current cohort is underpowered to accomplish). Moreover, we recognize that not all genetic interactions will necessarily be deleterious. In that context, the analysis of individuals with clinically mild ciliopathies, or individuals protected from a specific organ disorder within the ciliopathy spectrum, might be informative at documenting protective CNV/SNV combinations.

Finally, our studies highlight the continued utility of CNV analysis to identify driver loci. In addition to previous studies that identified recessive CNVs in NPHP1 in

BBS and ARMC4 in Primary Ciliary Dyskinesia [114,134], examination of the non- deleted chromosome identified IFT74 as a likely new BBS locus. We discovered only a single nuclear family; sequencing of the remainder of our cohort identified a sole rare

SNV at this locus, functional testing of which showed it to be benign (Figure 3, Table S4).

Therefore, we remain cautious until additional affected individuals are identified.

Nonetheless, several lines of evidence support the candidacy of this locus. First, this is not the first IFT component to be mutated in this disorder; mutations in IFT27 and

118

IFT172 were shown recently in BBS sibships, respectively [149,166]. Second, not only are the mutations likely damaging and potentially clinically severe for their functional impact (deletion/splice), but both mutant alleles affect specifically the long isoform of this gene, rendering the index case a functional hypomorph for this locus, thus offering an attractive hypothesis to explain the relatively mild phenotype in this family compared to the severe, often prenatal or perinatal, lethal phenotypes caused by recessive null alleles in IFT genes in humans or model organisms [239,242-244]. Thus, it will be important to determine the differential functions, including interacting partners and cargo, of the two IFT74 splice isoforms, since these might inform pathomechanism.

In conclusion, our studies have highlighted a substantial role for CNVs in BBS. In addition to improving our appreciation of the complexity of the genetic architecture of this and other ciliopathies, our observations also highlight two critical points for clinical molecular diagnosis. First, our data highlight the need to consider all types of genetic and genomic lesions that can impact gene and protein function. Computational algorithms that rely on depth of coverage [245,246] could not have detected most of the deletions that we discovered due to their small size, and the same is true for some of the commonly used techniques in molecular cytogenetic laboratories [4]. It is possible that the eventual transition to whole genome diagnostic sequencing will improve the accuracy of detection of such structural variants. Second, our data also emphasize why it remains important to continue the study of the exomes and genomes of affected

119

individuals beyond the discovery of a primary disease driver. Even though most current human genetic studies lack the resolution to interpret the effect of alleles beyond the recessive locus, the accrual of bona fide deleterious lesions in biological systems will ultimately inform the management of individuals who have the same recessive alleles but divergent clinical presentations.

120

Chapter 5- Identification of cis-suppression of human disease mutations by comparative genomics

This chapter is adapted from Jordan & Frangakis, et al (2015), currently accepted and in proof in Nature.

Patterns of amino acid conservation have served as a tool for understanding protein evolution [247]. The same principles have also found broad application in human genomics, driven by the need to interpret the pathogenic potential of variants in patients [34]. We performed a systematic comparative genomics analysis of human disease-causing missense variants. We found that an appreciable fraction of disease- causing alleles are fixed in the genomes of other species, suggesting a role for genomic context. We developed a model of genetic interactions that predicts most of these to be simple pairwise compensations. Functional testing of this model on two known human disease genes [144,248] revealed discrete cis amino acid residues that, although benign on their own, could rescue the human mutations in vivo. This approach was also applied to ab initio gene discovery to support the identification of a de novo disease driver in

BTG2 that is subject to protective cis-modification in >50 species. Finally, based on our data and models, we developed a computational tool to predict candidate residues subject to compensation. Taken together, our data highlight the importance of cis- genomic context as a contributor to protein evolution; they inform the complexity of

121

allele effect on phenotype; and they are likely to assist methods for predicting allele pathogenicity [249,250].

5.1 Introduction

Understanding the nature and prevalence of genetic interactions has the potential to inform the evolutionary forces that act on protein residues, protein complexes and, more broadly, genomes. Some studies have reported that interactions are ubiquitous and contribute significantly to the evolutionary landscape [251,252], while others found that interactions are rare and that protein evolution can be modeled without taking them into account [253]. Even among those who agree that interactions are important, the architecture of these interactions remains unclear: some studies find distinct interactions between two or three sites [254,255], while others propose a complex interaction network, effectively responding to aggregate properties of the entire protein or the entire genome [256,257].

One practical utility of comparative genomics has been highlighted by our appreciation of the large number of rare variants in humans and the difficulty in inferring their contribution to disease [34]. To prioritize variants of interest, frequency in control populations and evolutionary conservation have become two prominent filters.

Conserved regions are considered more likely to be biologically important and intolerant of variation[247]; programs such as PolyPhen [249] and SIFT [250] have

122

employed this principle to predict functional effects of variants in both the research and clinical setting [34]. Although clearly useful, these strategies are constrained, in part because they do not take into consideration the genomic context of the mutated allele.

An allele can appear damaging in one sequence yet be neutral in an orthologous sequence of another species. This phenomenon, referred to as compensated pathogenic deviation (CPD), contributes an unknown, but potentially significant, number of false negatives to the evaluation of functional sites [6,258].

5.2 Prevalence of CPDs Among Pathogenic Variants

To examine the prevalence of CPDs, and to identify such sites, we used comparative genomics on a genome-wide scale. A typical, non-CPD allele should cause

the same phenotype in any orthologous sequence, regardless of genetic background. By

contrast, when a variant that causes human genetic disease is found in a wild-type (wt)

orthologous sequence, it is likely that the genetic background of that species exerts a

compensatory effect on such a variant: it suppresses the phenotype, and thus protects

the variant from negative selection [6,255,258,259]. Previous studies have used this insight to quantify the fraction of CPDs interspecies substitutions at ~10% [6,258,259].

Other studies have reported estimates of the inverse value, namely the fraction of pathogenic variants that are present as CPDs in other species, ranging from 2%–18%

[260,261]. We set out to produce a new estimate of this value. We collected two datasets

123

of missense single-nucleotide variants (SNVs), annotated as either benign or pathogenic, derived from two databases, one based on literature (“HumVar”) [138,139,249] and one based on clinical genetic laboratories and investigator submissions (“ClinVar”) [139].

Although the two databases are not fully independent , the majority of pathogenic variants were listed in one or the other (Figure 15A). Overall, these datasets comprised

69,905 human missense mutations across 13,040 genes. We compared this dataset to

orthologous proteins from 100 vertebrates. As expected, we found the mutant residue

for a large number of likely neutral human variants to be fixed in orthologs. However,

the number of pathogenic missense variants found in orthologs (CPDs) was surprisingly

high: 5.6% ± 0.5% of ClinVar and 6.7% ± 0.4% of HumVar were found in the alignment of

mammals. For all vertebrates, these numbers increase to 10.2% ± 0.7% and 12.0% ± 0.5%,

respectively (Table 15).

Mindful of the possibility of false pathogenic annotations, we applied several

filtering steps, including cross-referencing HumVar and ClinVar variant annotations

with population frequency data; filtering based on alignment quality [262]; using

alternate alignment methodologies; and requiring presence in multiple species (Table 15;

Appendix A). Some filters did remove bona fide recessive alleles (since we did not allow

carriers), as well as disease variants with incomplete penetrance, even though this class

of alleles is, by definition, sensitive to genomic context and thus likely to be affected by

compensation [263]. Nevertheless, all filtering steps retained a substantial number of

124

variants. Including only variants that pass all filtering steps and are detected in >1 vertebrate, we predict that the minimum estimate of CPDs in human patients is 3%

(Figure 15B). This is consistent with previous analyses, which have found that stringent

filtering does not change the observed properties of CPDs significantly [259,260]. As a

final test, we extracted post hoc pathogenic alleles from three different sources, each of

which used independent means for assessing pathogenicity in acute pediatric disorders

(Table 16); overall, CPD rates ranged once again between 3%-to 9%; additional analyses

of other possible sources of bias were likewise consistent with our initial observations

and previous studies.

125

Figure 15: Distribution of Variants Found in Sequence Alignments a) Venn diagram showing sizes and overlap of the ClinVar and HumVar datasets, and how many are found in the multiple sequence alignment. b) Estimated number of human disease variants found in the alignment. The smallest estimate (3.0%, dark blue) comes from using the intersection of both variant datasets, requiring the variant be absent from 6,503 human exomes, and filtering out alignments with low quality scores.

With any methodology, at least 88% of human disease variants (grey) are not found in the alignment. c, d) Potential mechanisms for the occurrence of CPDs in evolution.

Branches where the variant is fixed are purple; branches where the variant is pathogenic

126

are red. In panel c, the variant is present neutrally in an ancestor, but is lost in primates.

Subsequent substitutions cause the ancestral allele to become pathogenic. In panel d, the variant is pathogenic in the ancestor, but mutations in a non-human branch cause it to become tolerated, and it arises later by mutation and becomes fixed.

Table 15: Range of Estimates for Prevalence of CPDs in Human Disease

Unfiltered High-Quality Mammalian Present in >1 EPO MultiZ MultiZ Subset of Species in Alignment Alignment Alignments MultiZ MultiZ Alignment Alignment HumVar 12.0% ± 0.5% 11.5% ± 0.5% 6.7% ± 0.4% 6.1% ± 0.3% 7.5% ± 0.4% ClinVar 10.2% ± 0.7% 9.9% ± 0.7% 5.6% ± 0.5% 4.7% ± 0.5% 6.5% ± 0.6% HumVar+ClinVar 9.3% ± 1.0% 8.5% ± 1.0% 5.3% ± 0.8% 3.9% ± 0.7% 5.5% ± 0.9% HumVar+ClinVar+ESP 7.5% ± 1.0% 7.0 ± 1.0% 3.8% ± 0.7% 3.0% ± 0.6% 4.0% ± 0.8%

127

Table 16: Occurance of CPDs Among Independent Pathogenic Lists

The Deciphering Developmental Disorders Study, 2014 Total Total Vert Total Mam Variants % of Total Alignment % of Total Alignment % of Total Total 183 - 26 14% 18 9.8%

Total De Total De Novo in Total De Novo in Novo Vert Align Mam Align Total 111 61% 5 4.5% 3 2.7%

In vivo validated BBS alleles Total Total Vert Total Mam Variants % of Total Alignment % of Total Alignment % of Total Total 96 - 23 24% 13 14%

Total Total Recessive Total Recessive Recessive in Vert Align in Mam Align % of Total Total 49 51% 8 16% 2 4%

Neuronal Ceroid-Lipofuscinoses (Kousi et al, 2012) Total Total Vert Total Mam Variants % of Total Alignment % of Total Alignment % of Total Total 164 - 20 12% 9 5%

5.3 Modeling Compensatory Interaction

We next turned to the question of the structure of the genetic interactions underlying such sites. Broadly, there are two possibilities: suppression of the disease phenotype may be the result of a small number of discrete compensatory substitutions; or suppression may be caused by global shift in the properties of the gene, or the whole

128

genome, caused by numerous substitutions that, individually, have small effects. The difference between these two models should be visible in the distribution of CPDs among orthologous sequences. During evolution, variants arise stochastically through a

Poisson process: the expected amount of evolutionary time required to produce a given substitution is distributed exponentially [264]. For a CPD, the distribution should be different; the presence of a CPD mandates the presence of all compensatory substitutions necessary for the CPD to be rendered neutral. As such, the expected evolutionary time required to produce a CPD is the sum of the times required to produce each compensatory substitution, followed by the time required to produce the

CPD.

Previous studies have proposed different processes by which CPDs arise. The most plausible option is a neutral mechanism, where the compensatory substitutions are neutral and arise/fix neutrally before the pathogenic substitution appears (Figure 15C, D;

Figure 16A). In this case, the time required for each substitution to arise is given by an exponential distribution, and the time for all compensatory sites to arise is approximated by the convolution of multiple exponential distributions (a gamma distribution, where all exponential distributions are identical). The number of exponential distributions included in the convolution corresponds to one plus the number of compensatory substitutions required, and it can be inferred from the shape of the distribution (Figure

16B).

129

Although the evolutionary time separating two sequences is not observable directly, we can approximate it using sequence distance (one minus sequence identity)

[265]. We plotted the number of missense variants observed as a function of sequence distance for neutral variants and for CPDs. Qualitatively, the shapes of both distributions match theoretical expectations. The two distributions are distinct from each other (p = 1.6 x 10-68, Kolmogorov-Smirnov 2-sample test). Additionally, the observed distribution of CPDs is weighted toward shorter evolutionary distances, as expected if most CPDs require a small number of individual compensatory substitutions, and does not resemble the normal distribution expected if CPDs require many individual compensatory substitutions (Figure 16B, D). To obtain a more precise estimate of the number of compensatory substitutions, we used maximum likelihood to fit several versions of the convolution-of-exponentials model with different combinations of

variant datasets and alignment strategies (Figure 16C, D). Most versions of the model fit

best as the convolution of approximately two exponential distributions, supporting a

mechanism where most CPDs are compensated by simple pairwise interactions.

Additionally, most models reported similar rates of evolution for neutral variants, CPDs,

and compensatory variants, suggesting that the target size for compensatory changes is

small. We repeated these analyses with multiple different variant datasets and

alignment strategies, finding similar results each time.

130

These analyses predict that most CPDs could be rescued by one large-effect compensatory substitution. We tested this prediction experimentally. We posited that each vertebrate sequence that includes a CPD should also include its cis-compensatory allele. Therefore, every amino acid difference between the human sequence and the sequence of the ortholog(s) containing a CPD is a candidate compensatory substitution.

Given the practical constraints of examining all possible compensatory substitutions in macromolecular complexes, we focused on substitutions within the same gene as the

CPD.

131

Figure 16: Relationship Between Variants and Evolutionary Distance a) Model for fixation of CPDs. Neutral changes (Xs) arise neutrally. Some of these (blue) compensate for alleles that would otherwise be pathogenic. Once k of compensatory changes have fixed, the CPD (red) fixes neutrally. b) The relationship between evolutionary distance and the number of variants in the alignment is expected to be different for individual benign variants

(black) and pathogenic variants with different numbers of compensations (blue: one, red: two, cyan: five, magenta: ten). c, d) Observed distribution of missense variants annotated as neutral (c) or pathogenic (d) in HumVar and present in vertebrate orthologs (bars), with maximum likelihood fits (black lines) and 95% confidence bands (gray shading). Panel d corresponds to a fitted value for k of 1.44 ± 0.07.

132

5.4 In Vivo Modeling of CPDs and Compensatory Sites

Scanning our list of candidate CPDs, we noted two alleles in genes involved in ciliopathies: a p.N165H change in BBS4 and a p.R937L variant in RPGRIP1L, which contribute pathogenic alleles to Bardet-Biedl syndrome and Meckel-Gruber syndrome, respectively [144,248]. These alleles were prioritized because: a) those disorders have a severe effect on reproductive fitness; b) previous studies have established loss-of- function zebrafish phenotypes rescuable by human mRNA for both genes [126,144]; c) in vivo complementation has indicated both alleles to be deleterious to human protein function [126,144]; and d) we observed multiple species with the human mutant allele fixed: six species for BBS4 165H and four for RPGRIP1L 937L (Figure 17A, B); for this reason, both alleles were predicted to be benign (PolyPhen-2, SIFT, MutationAssessor).

Comparative genomic analysis identified nine candidate sites in BBS4 and 32 candidate sites in RPGRIP1L. To test each site, we took advantage of the established convergent extension (CE) defects induced by morpholino (MO)-mediated suppression of bbs4 or rpgrip1l in zebrafish [126,144]. Consistent with previous observations, suppression of bbs4 or rpgrip1l induced CE defects in 80% and 50% of embryos respectively (n=50-100 embryos; Figure 17C-E). Co-injection of MO with human wt mRNA rescued this phenotype, whereas injection with human mutant mRNA showed no improvement (Figure 17D, E). We next tested the entire candidate complementing

133

allelic series for each gene. For BBS4, the introduction of 2/9 candidate residues in cis with the 165H-encoding mRNA ameliorated the phenotype in a manner indistinguishable from wt mRNA. Strikingly, both complementing alleles affected the same amino acid and were specific to the compensatory changes: the 165H/366N or the

165H/366S behaved as null, whereas 165H/366R was indistinguishable from wt;

165H/366T converted the functional null to a hypomorph (Figure 17D; Figure 18A).

We observed a similar pattern for RPGRIP1L. Testing each of the 32 candidate

sites identified three complementing events, two of which map to the same region:

937L/189L, 937L/193L and 937L/961T (Figure 17E; Figure 18B). Testing each

complementing allele individually showed them to be either extremely mild or benign.

Finally, comparative genomic analysis showed that these data could explain the

tolerance of the RPGRIP1L 937L change in all four species and of the BBS4 165H change

in 4/6 species (Figure 17A, B).

134

Figure 17: Compensatory Mutations Rescue Pathogenic Alleles in BBS4 and RPGRIP1L a) The pathogenic BBS4 165H allele is fixed in six species. Secondary sites 160, 163, and 366 are possible CPDs. b) The pathogenic RPGRIP1L 937L allele is fixed in four species. The 189L, 193L, and 961T alleles are present in all four species. c) Examples of zebrafish convergent extension phenotypic groups. d) Human RNA encoding the BBS4 165H mutation and either 366R or 366T can rescue the morphant phenotype; RNA encoding 165H mutation alone cannot. e) Mutation of

189L, 193L, or 961T, in the background of 937L RPGRIP1L mRNA, rescues the loss of function observed in 937L RNA. Significance determined by χ2 test

135

Figure 18: Protein Domain Structure of Functionally Tested Human Disease Genes a) Schematic of BBS4 (519 amino acids) is depicted with eight tetratricopeptide (TPR) domains (yellow); b) RPGRIP1L (1315 amino acids) has multiple coiled-coil domains

(green rectangles) and two Protein kinase C conserved region 2 (C2) domains (green hexagons); and c) BTG2 (158 amino acids) has one BTG1 domain (purple pentagon).

Disease-causing alleles are shown with red stars; complementing alleles are represented with blue stars; amino acid number scale in increments of 100 is shown below each schematic

136

The above analysis is limited by its retrospective nature. We therefore tested the usefulness of our model in ab initio gene discovery. We have initiated recently a whole exome sequencing (WES) and functional testing paradigm to accelerate gene discovery in young children: The Task Force for Neonatal Genomics (TFNG). Patients who display anatomical phenotypes amenable to functional modeling in zebrafish are evaluated by trio-based WES and have candidate alleles tested systematically in vivo [266].

We enrolled a 17-month-old female with an undiagnosed neuroanatomical condition hallmarked by microcephaly (Figure 19A). We filtered WES data for non- synonymous and splice variants with a minor allele frequency of <1%, and we conducted a proband-centric trio analysis that yielded four candidates: de novo missense changes in BTG2 and NOS2; and recessive missense variants in TTN and LAMA1.

Testing of an unaffected sibling excluded LAMA1; TTN, a known dominant

cardiomyopathy locus [267], is an unlikely driver.

To investigate the pathogenicity of the BTG2 (p.V141M) and NOS2 (p.P795A)

changes, we studied btg2 and nos2 in zebrafish. Reciprocal BLAST identified a single

zebrafish btg2 ortholog and two zebrafish nos2 orthologs. We injected splice-blocking

MO or translational-blocking MO (sb-MO and tb-MO) into zebrafish embryos (3 ng;

n=80 embryos/injection) and scored for head size defects at 4 days post-fertilization (dpf)

by measuring the anterior-posterior distance between the forebrain and the hindbrain

(Figure 19B). For nos2a/b MO-injected embryos, we saw no differences at the highest

137

dose injected (8 ng for nos2a/b sb-MOs). In contrast, we found a significant reduction of anterior structures in btg2 morphants (p<0.0001; Figure 19B, C). Co-injection of wt human BTG2 mRNA with tb-MO resulted in significant rescue (p<0.0001; Figure 19C). In contrast, injection of 141M was significantly worse at rescue than wt (p < 0.0001; Figure

19C).

BTG2 is a regulator of cell cycle checkpoint in neuronal cells [268] and is strikingly intolerant to variation in humans (Exome Variant Server). To test the pathogenicity of 141M by a different assay, we performed antibody staining at 2 dpf (a time prior to the manifestation of microcephaly). We marked post-mitotic neurons in the forebrain with HuC/D, and we scored (blind, triplicate) based on an established paradigm [269]. btg2 morphants displayed a significant decrease in HuC/D staining

(p<0.0001; Figure 20). This defect was rescued with WT BTG2 mRNA (p<0.05); but could not be ameliorated by 141M-encoded mRNA co-injection (Figure 20). Importantly, co- injection of btg2 tb-MO with two rare control EVS alleles (p.A126S and p.R145Q) resulted in rescue, providing evidence for assay specificity (Figure 20B). As a third test, we stained whole embryos with a phospho-histone H3 (PH3) antibody that marks proliferating cells. We counted the number of positive cells in a defined anterior region of embryos. We saw a significant reduction in cell proliferation in the heads of 2 dpf btg2 morphants (p<0.0001); this defect was likewise rescued by co-injection of wt mRNA, while 141M mutant rescue was indistinguishable from btg2 tb-MO alone (p = 0.38; Figure

138

19D, B). Combined, all three assays indicated that BTG2 p.V141M is pathogenic and that haploinsufficiency of this gene likely contributes to the microcephaly of the proband.

Despite our functional and genetic data for p.V141M, this allele was predicted computationally to be benign. A likely reason is that, with the exception of primates, most BTG2 orthologs encode Met at the orthologous position (Figure 19F). These data suggested that V141 might represent a CPD site in primates that branched from the ancestral methionine. To test this possibility, we identified nine BTG2 sites that co- evolved with 141M, which we mutagenized into the human construct encoding 141M.

We then injected embryos with btg2 MO; MO+wt human BTG2 mRNA; MO+ 141M- encoding mRNA; or MO+ 141M in cis with one of the nine candidate complementing alleles. Seven of the alleles had no effect. However, R80K- or L128V-encoded mRNA on

the 141M backbone rescued the number of PH3-positive cells to wt levels (Figure 19E;

Figure 18C); both alleles were benign on their own. Taken together, our data indicated that 141M is deleterious in the human background, but the protection to this residue afforded by either Lys 80 or by Val 128 can explain >90% (54/59) of species encoding

141M (Figure 19F).

139

Figure 19: A De Novo BTG2 p.V141M Encoding Allele Causes Microcephaly a) Pedigree DM048. Chromatograms show a de novo c.421G>A change. b) Suppression of btg2 leads to head size defects. Dorsal view of uninjected (UI) control and btg2 MO-injected zebrafish embryos at 4 dpf. White arrows show the distance measured from forebrain to hindbrain. Red line shows the protrusion of the pectoral fins in UI controls. c) Distribution of head size measurements at 4 dpf (Table S10; white arrows in c), a.u., arbitrary units. d) Two day-post- fertilization zebrafish embryos stained for phospho-histone H3. Human RNA containing the

V141M mutation is unable to rescue the reduction of proliferation of btg2 MOs. e) Quantification of pH3: human RNA with mutations V141M and either R80K or L128V can rescue knockdown of btg2. Error bars represent standard deviation. f) The 141M allele is fixed in 59/87 species besides primates, examples displayed here.

140

Figure 20: Evaluation of Post-Mitotic Neurons in 2 dpf Zebrafish Embryos Confirms the Pathogenicity of BTG2 V141M a) Suppression of btg2 leads to a decrease of HuC/D levels at 2 dpf. Representative ventral images of control, btg2 morphants (images show unilateral or absent HuC/D expression), and a rescued embryo injected with a btg2 MO plus human BTG2 wt mRNA. b) Percentage of embryos with normal, bilateral HuC/D protein levels in the anterior forebrain or decreased/unilateral HuC/D protein levels in embryos injected with btg2 MOs alone or MOs plus human BTG2 WT or variant mRNAs (p.V141M, index case; p.A126S and p.R145Q, control alleles). *, p<0.05 (two-tailed t-test comparisons between MO-injected and rescued embryos; n=38-78/injection batch).

141

5.5 Implications

To improve the scalability of detecting CPDs, we used our model of CPD evolution to develop a computational predictor for distinguishing variants that are unlikely to be CPDs from those that might be CPDs, and to identify candidate compensations to aid experimental design (http://genetics.bwh.harvard.edu/cpd/). Initial

testing of this tool intimated high negative predictive values but modest positive

predictive values, likely due to the dearth of known CPDs (Appendix A).

Our results contrast some previous studies that claim that epistasis is ubiquitous

[251,254]; or that it is practically nonexistent [253]; or that it is commonly of higher order

[256,257]. The most likely explanation for this discrepancy is that such studies have

examined different kinds of variation and traits. For example, studies on the evolution of

genetic incompatibilities rely on assumptions of high mutation rate and weak negative

selection, assumptions that generally do not hold for the case of pathogenic missense

variation [254,270]. The difference with the studies suggesting higher-order cis-

interactions may have to do with the scale of evolutionary time our analyses probes: the

span of hundreds of millions of years of evolution represented by the vertebrate

alignment may not be long enough to reveal higher-order combinations of

nonsynonymous SNVs. Indeed, using neutral SNVs from the HumVar dataset as a

control, we estimate the vertebrate alignment has explored 12% of pairwise interactions

between SNVs, compared to 0.6% of three-way interactions between SNVs. It is possible

142

that higher-order interactions are common, but are not detectable without a deeper alignment.

Finally, in the backdrop of accelerated use of genome editing to model human pathogenic mutations in a variety of model organisms, our data highlight the critical need to not only pair computational predictions with functional studies, but to also evaluate the effect of human mutations in the context of the human sequence.

143

Chapter 6 – Discussion 6.1 Interpreting Human Genetic Variation

Major progress in science often comes as a result of advancements in technology that allow for new ways to solve challenges that previously seemed insurmountable.

Genomics is no different, as the increasing speed, accuracy, and breadth of sequencing, all while per base costs have continued to fall, has allowed for the study of more complex disorders and genetic interactions.

With this increasing genomic capability a new challenge has risen. While geneticists and clinicians previously bemoaned the difficulty of genetic diagnoses with scant data, now the torrent of genomic data obtained readily for a patient and their family members has made genetic diagnosis again difficult, but for a different reason.

Specifically, for each patient undergoing whole exome sequencing (WES), between

20,000 and 50,000 variants may be identified [7,13]. Though greater than 95% are known common polymorphisms, this still leaves hundreds of variants, many of which are nonsynonymous changes, most of which will be seen rarely, if at all, in control populations.

The challenge therefore is to effectively filter variants irrelevant to the phenotype from those that are likely disease-causing. Of the several hundred potentially pathogenic variants, only a few are likely to be associated with a specific disease of interest [271], although even that model should be approached with caution, since our insights from

144

the genetic architecture of rare disease comes largely from the study of unusual, large pedigrees and the study of single genes. One method to filter through the large number of variants is intersection filtering, where several individuals with a shared disorder are sequenced, and genes with rare missense and nonsense variants in all or most of the individuals of the cohort are identified [272].

At present, however, there are a decreasing number of undiscovered highly penetrant disorders with simple Mendelian inheritance, which are also present in large pedigrees or in multiple individuals. More and more we are seeing that what once looked like monolithic and homogenous genetic disorders are in fact caused by a variety of lesions across multiple genes (for example, the ciliopathy BBS). Therefore a targeted approach to investigating the genes most likely involved in specific disease processes is beneficial. In Chapter 3, I showed how targeting sequencing can reduce the total number of genes sequenced (and therefore the number of resulting discovered variants) while still retaining power to identify genes involved in a disease of interest.

Additionally, the wide variety of genetic lesions that can now be identified complicates the interpretation and filtering of resultant variants. In addition to single nucleotide variants (SNVs), sequencing technology is capable of identifying small insertions and deletions (indels) and larger structural variation such as copy number variations (CNVs). Chapter 4 detailed an approach to not only find CNVs as small as

145

only a few hundred bp, but also to resolve the variety of ways in which they are able to contribute to disease and modify the effects of genes.

6.2 Use of Targeted Sequencing

Until recently, sequencing tests in the clinical setting were available for only those disorders for which a small number of causative genes were responsible. These tests were based mainly on Sanger sequencing technology, which has remained relatively unchanged for over 30 years (though it is still the gold standard for molecular diagnosis [273]). Traditionally, linkage analysis in related individuals would identify candidate genomic regions presumably containing a causal gene, which could be isolated through further narrowing of the region of interest and subsequent systematic sequencing of the candidate genes within that region [274].

The advent of high-throughput next-gen sequencing (NGS) methods have allowed for the massively parallel sequencing of an entire genome or exome, yet the increasing amount of sequence data for each patient is itself a barrier to analysis.

Additionally, NGS reduces the number of samples required for identification of candidate genes, which is highly advantageous as many of the diseases listed in the database of Mendelian disorders in humans (http://www.ncbi.nlm.nih.gov/omim) are

rare, which is a challenge for linkage analysis studies that commonly require large

affected pedigrees [275].

146

The goal of targeted sequencing is to reduce the amount of variants needed to be filtered through, while still retaining the power to identify those that are causal for disease. Exome sequencing itself is based on enriching a large targeted library, in this case the entirety of coding regions. Without prior enrichment of selected regions, any genomic region will be sequenced equally. This will include large intergenic and non- coding regions, which are both less likely to be causal for disease as well as more difficult to interpret [276]. By designing specific targeting arrays, any region of the genome can be enriched for sequencing including exons of specified genes (such as those that are known to be associated with a disease or cellular process) or promoter/modifier regions.

What we demonstrated in Chapter 3 is that targeted sequencing of a cohort of unrelated patients was capable of identifying a novel driver of disease in one family.

Our targeted library was the ciliary proteome, a compilation of 785 genes for which their protein products are found to be associated with the cilia and/or basal body [65]. This allowed for sequencing of the most likely candidates for disease.

Furthermore, we resequenced patients for whom a molecular diagnosis was not previously possible. The use of a large targeted array of candidate genes allowed for the evaluation of many candidate genes at once. And as the sequencing of the ciliary proteome was targeted to those genes associated with the cilia, we functionally tested our candidate alleles through assays designed to evaluate cilia function. It is therefore

147

possible that future investigations assaying for cilia function may exclude a truly causal gene, however it is more likely that any truly pathogenic mutation in a protein associated with the cilia will result in a cilia phenotype.

6.3 Functional Testing to Interpret Genetic Variation

Sequencing of extended multi-generational pedigrees and large cohort genomic analyses can provide a strong indication that a particular gene is associated with disease.

However, the gold standard to provide evidence that an individual gene or allele is causal for a given disease is through functional testing. Predicting the pathologic effect of a variant is of major importance to clinical diagnosis and treatment, which is made easier with functional information on the variants present in a patient, provided that the functional data are derived from assays that are a) relevant to the patient phenotype; and b) with measurable specificity and sensitivity.

Functional testing of an allele can be performed through in vitro or in vivo assays.

The ideal is to disrupt the endogenous protein (through either silencing of the gene, aberrant splicing of mRNA, or translational inhibition) to induce an observable and measurable phenotype, and subsequently attempt to rescue the pathologic phenotype through expression of exogenous protein. An additional method is through an enzymatic assay that measures activity of a mutant protein and compares to the wild- type version. However, this method only works for the subset of genes and proteins that

148

have quantifiable enzymatic activity. While in vitro functional studies are often easier and can test several alleles at once, they have a disadvantage in that a cell culture or chemical assay cannot recapitulate the wide array of contexts to which a protein can be exposed to within an organism. Furthermore, many substrates, especially immortalized cell lines, are abnormal in their genetics, signaling pathways, surface expression, or activities, making them imperfect reproductions of intra- and extracellular environments

[277]. In vivo assays, while not perfect, have the advantage of using the whole organism as a proving ground for the function and stability of an allele.

It is also worth noting that, in addition to selecting the appropriate model organism, thought must be given to the appropriate phenotypic readout. In the case of the functional testing of variants CEP76, E2F4, BBS4, and RPGRIP1L, we used assays specific for ciliary function as the genes were known to be involved in ciliopathies (or so suspected, as in the case of CEP76 and E2F4). Additionally, it was important that for

CEP76 and E2F4, neither of which was known before the present study to be a driver of ciliopathy, the assays we selected to perform were relevant to the phenotype observed in the patient. The convergent extension (CE) assay does capture the overall cilia dysfunction analysis (that is, if cilia function is perturbed then CE will be affected), but as a marker of gastrulation it is not specific to cilia. Therefore, to prove the specificity of the phenotypes we were observing and the claim that CEP76 is a ciliopathy gene, we

149

followed up with assays that phenocopied the patient’s pathology and more specifically interrogated cilia function.

It is not always necessary, however, to demonstrate the equivalence of an in vivo assay with the phenotype observed in humans. In Chapter 5 we assayed allele function in BBS4, RPGRIP1L, and BTG2. CE was selected for BBS4 and RPGRIP1L because they are both known ciliary genes, and this assay was utilized previously to assess function of alleles in these genes. The purpose of the study was not to recapitulate the phenotypes observed in the patients, but to use the zebrafish in vivo model as a system to test the function of the respective proteins. Likewise, while the purpose of the assay of embryo head size in btg2 morphants was to phenocopy the microcephaly observed in the patient, and the subsequent phospho-histone 3 staining to give a quantifiable readout of a possible mechanism, we used the zebrafish in vivo assays as systems to test the function of proteins. This allowed direct evaluation of whether the second-site mutations could restore function to null alleles.

The focus of Chapter 3 was to identify a novel driver of BBS and provide strong evidence for its role in disease. Further investigations can now be built on that foundation; specifically how disruption of CEP76 contributes to ciliary dysfunction, whether it has dual functions in both the cell cycle and ciliogenesis, and what additional signaling pathways may be required for ciliogenesis or trafficked through the cilium. We showed that kidney tubule structure was affected in zebrafish by disruption of cep76;

150

further investigations may be able to look at possible defects in tubule polarity or cellular junctions. The cilium has been shown to play an important role in the development of planar cell polarity [56], which is mediated in part through Wnt signaling and necessary for the development of the kidney tubule [278,279]. Appropriate structure is also determined by the coordinated direction of proliferation of epithelial cells parallel to the path of the tubule [278]. It would be of interest to investigate the direction of cell division in morphant tubules, with the hypothesis that a centrosome lacking CEP76 may be defective in appropriate organization of and division. Furthermore, inquiry into how CEP76 dysfunction perturbs Wnt signaling during kidney and retina morphogenesis (if at all) may lend insight into the mechanism of pathology in the human patient.

6.4 Contribution of CNVs to Human Disease

For much of the past forty years, the concept of large structural variations concerned mainly to alterations in the heterochromatin; those changes large enough to be seen through light microscopy or fluorescent in situ hybridization (FISH). With the advent of new genomic technology that does not rely solely on traditional sequencing modalities, we have over the past several decades become aware of submicroscopic variations in the structure of the genome. Segments that are deleted or duplicated are termed copy-number variations (CNVs), and are now thought to contribute more to

151

human variation (in terms of total base-pairs involved) than single nucleotide polymorphisms (SNPs) [25].

This has implications in the interpretation of human genetic variation; the most commonly known genetic changes driving disease are SNPs, that is the perception in discussions regarding mutations of disease genes, but the greatest contributor to human disease may be previously under-recognized CNVs, which form at rates greater than other sources of variation [229]. CNV formation occurs through a variety of mechanisms, in both germline and somatic cells; differences in CNVs have been observed both between identical (monozygotic) twins [280] as well as separate tissues within the same individual [281]. Mechanisms of genomic structural changes that give rise to CNVs are the same that normally repair structural damage to the DNA strand; specifically homologous (HR) and nonhomologus (NR) recombination. Errors in DNA repair mechanisms (such as incorrect recombination partner in HR or replication slippage and template switching in NR) can occur at similar or greater instances as polymerase/proofreading errors [229].

Just as fascinating is the recent research indicating the contribution of CNVs to human and other species’ evolution and adaption [282]. For example, a recent study found the number of copies of the salivary amylase gene (AMY1) is correlated with amylase protein level and activity in the saliva, and that individuals from populations of high starch consumption had more copies of AMY1 [283]. CNVs can provide the raw

152

material for gene duplication and subsequent gene family expansion and adaptation. As mentioned in Chapter 1, the teleost genome underwent a major duplication event, giving rise to multiple gene pairs that have specialized functions for their related members (this can present a challenge in in vivo gene knockdown experiments, as it is occasionally necessary to knockdown both related homologs of a human target gene)

[69]. Additionally, structural variation in the primate lineage reveals several segmental duplication events that created novel gene families as well as specialized regulation and function [284,285]. (This can also present challenges when aligning human genes to

those of other species, challenges that we had to specifically address through our

multiple-alignment strategies in Chapter 5).

As important as CNVs are to human disease and genomic evolution, they can be

overlooked because traditional sequencing methods often have difficulty identifying

them. A complete homozygous deletion of a gene or region would be straightforward to

detect by Sanger and next-gen sequencing methods, as there will be no amplicons from

that region. However, for heterozygous deletions or duplications of a gene or larger

region, amplicons specific for that section of the genome would be detected, and not

recognized as variations.

What we did to circumvent this issue was to use a comparative genomic

hybridization array (array CGH, or aCGH) that measures the amount of signal

generated by the hybridization of a specific probe with its target sequence. We used an

153

array that had fluorescently tagged oligonucleotides that probed for targets every 100bp in exonic regions, and 500bp in intronic regions. Instead of looking at the sequence of the targeted region (that is partially assured by the specificity of the oligo itself), aCGH instead looks at the number of hybridization events that occur for each oligo. Areas of the genome which have a lower hybridization signal are therefore likely to have a deletion of that region, whereas a higher hybridization signal indicates a likely duplication of the region.

The ability to detect CNVs is dependent on the number of oligos in a given region. The more densely spaced the oligos are, the more sensitive the array will be to variations, especially smaller CNVs. Our oligo spacing of 100/500 bp allowed for the detection of CNVs as small as a few hundred bp. This resolution has become an important factor in CNV detection, as several deletion/duplications cover only single genes, or even just parts of a gene [28,286]. Thus even with a homozygous deletion traditional sequencing would amplify most of the gene, but not the deleted exons. It is possible that the investigators would assume that the gene is intact but those exons are not amplifying for one of several reasons (and the explanation that the exons are deleted would not be considered).

Sub-gene resolution of CNVs in this array allowed us to gain insight into the

manner in which CNVs can contribute to disease. In addition to behaving as recessive

loci, wherein an affected patient has inherited the same deletion (or possibly

154

overlapping CNVs) from each parent and contains a homozygous deletion across a causal gene, CNVs can also function to uncover recessive alleles on the opposite chromosome (Figure 20). Additionally, CNVs can act in epistasis to alleles in other genes outside of their affected region. By behaving similarly to a LOF allele or gene knockout, a CNV can contribute to disruption of gene interactions, resulting in a synergistic effect when there is a pathogenic allele in a secondary gene.

Our investigation into the involvement of CNVs in BBS revealed a contribution to mutational load that was much higher than previously thought (~18% in our cohort).

The involvement of CNVs in other complex diseases is likely similarly high as well, indicating that a large amount of disease-contributing variation in human disease is liable to be completely overlooked. It is also worth considering individuals from historic cohorts for which Sanger or next-gen sequencing was unable to give a molecular diagnosis, as there is a strong possibility that previously undetectable CNVs are contributing to their disease. Additionally, as genomic technology becomes cheaper and more accessible, higher resolution arrays will become available, likely increasing the detection of CNVs in disease; indeed, several of the deletions we detect were small enough that they would have been missed by arrays only a few years older.

Even after detection, the interpretation of the clinical significance of one or multiple CNVs presents a challenge due to the variance in location and genomic content.

Recently a “CNV map” of the human genome has been published [287], which will aid

155

in the interpretation of new CNVs. Yet, the best method of interpreting the contribution of a particular CNV to disease, as shown in Chapter 4, is through the dissection of its constitutive genes. This method has been employed to identify primary drivers of pathology in several CNVs [125,288,289]. Future studies may benefit through investigations into the role of non-coding genomic regions (such as regulatory and intronic regions) in CNV pathology, as much of the current research has concentrated on disrupted coding regions. Disruption of these non-coding regions could disrupt expression of nearby coding regions as much as a deletion or duplication of these regions [30]. Knowledge of how structural disruptions of regulatory regions contribute to pathology in complex disorders such as the ciliopathies would assist in the interpretation of human genomic variation.

Figure 21: Deletions are able to unmask normally recessive alleles

156

Dosage-sensitive genes that are encompassed by a structural variant can cause disease through a duplication or deletion event (upper panel; a deletion is shown here). Dosage- insensitive genes can also cause disease if a deletion of the gene unmasks a recessive mutation on the homologous chromosome (lower panel). From [290], used with permission

6.5 Implications of CPDs to Protein Evolution and Interpreting Genetic Variation

As has been noted, the high number of variants obtained from the multitude of sequencing methods poses a challenge to the interpretation of each allele’s contribution to the observed pathology. Researchers have developed numerous tools to filter, or at the very least prioritize, the large number of variants of unknown significance (VUS).

Some of the more highly employed tools are the in silico programs for predicting allele pathogenicity. The most used of these, Polyphen (and now the updated Polyphen2),

SIFT, and Panther are open-access prediction algorithms; SIFT and Polyphen 1/2 are web-based, whereas Panther is a freely downloadable package. These tools have dramatically helped investigators prioritize their catalogs of VUS to focus on the most likely candidates of disease alleles.

157

6.5.1 Comparative Genomics and False Negatives

While these in silico programs consider a wide variety of factors in their prediction algorithms (see Table 1), one of the more heavily weighted components is consensus mapping using multiple alignments from various species. Sites that are highly conserved between species (and by extension conserved throughout evolution) are more likely to lead to pathology when altered. This reasoning makes the logical assumption that sites that have remained constant throughout evolution, and through multiple evolutionary branches, must have selective pressure to do so and thereby be intolerant of change.

It is this assumption that, while incredibly powerful for identifying sites in which any change from the highly conserved residue is pathogenic, leads to some inaccuracies in the negative calls. This is because the converse is that, if a site is poorly conserved between species, then that variation at this site is likely to be benign, or at least that the site is tolerant to the residues found among the species. Therefore if an investigator queries a candidate allele (for example BTG2 V141M) and the program identifies one or more species that contain the mutant allele (such as the 59 species with 141M as wild- type), then the program is likely to assume that this allele, since it is tolerated in other species, is likely tolerated in humans and therefore benign.

Yet there are numerous instances in which alleles with strong direct evidence of pathogenicity in humans are present as wild-type in other species. Many of these are

158

under strong negative selective pressure, so it is highly unlikely that they are pathogenic in these species but are somehow endured. A prime example is the G491S mutation in the androgen receptor. In humans and nearly all other mammals the 491 residue is glycine, and a change at this site to serine causes complete androgen insensitivity syndrome, which, while not lethal, does render the organism sterile. Thus the 491S mutation is highly deleterious and by most assumptions have no opportunity to ever fix in a species. Yet this supposedly pathogenic allele is fixed as wild-type in one order of

mammals, the rodents (Figure 21). They have somehow escaped the pathogenic effects

of this mutation and, by most estimation, are reproductively thriving.

159

Figure 22: Androgen Receptor G491S Mutation is Conserved in Rodents

From [45], used with permission.

6.5.2 Compensation of Pathogenic Alleles

Our hypothesis was that there is something in the genome of these species that is able to compensate for the loss of fitness generated by these alleles. The phenomenon of pathogenic human alleles present in other species is relatively well known, and these

160

alleles have previously been termed compensated pathogenic deviations (CPDs) [6]. It was previously estimated that up to 10% of pathogenic human alleles are CPDs, that is, they are conserved in other species. By systematically looking through several repositories of curated pathogenic alleles (almost 70,000 alleles in ~13,000 genes), we were able to arrive at a similar figure of between 5 and 12% depending on how stringent our filtering modalities were. This in itself is admittedly not a huge contribution to knowledge regarding CPDs, but we were subsequently able to predict the occurrence of the secondary compensating sites and functionally validate our prediction.

We wanted to know what the architecture of compensation is in the human genome, that is, how many secondary compensating sites would be required to mitigate the loss of fitness caused by a CPD. Previous calculations have predicted that CPDs could be compensated by a handful of strong-effect secondary sites, while others have said that it is likely the structure of the entire genomic region and perhaps whole genome that contributes countless weakly modifying compensations ([254] and [256], respectively). The structure of compensatory interaction would greatly influence our ability to both find CPDs and their potential compensatory sites, but also to functionally model this interaction.

Our collaborators designed a mathematical model, with input and optimization help from us, that allowed us to predict the required number of sites needed to compensate for the loss of fitness of a CPD. By regarding the evolutionary distance

161

(measured as the inverse of sequence identity) between humans and other species as a surrogate for evolutionary time, we are able to model the time required for a CPD and to arise and fix (and possible be lost as well) as a function of the compensatory sites. We charted the expected distribution of CPDs across species under various models, each calculated based on a different number of required secondary sites for a CPD. We found that the actual distribution of CPDs throughout species best matched the model of a simple pairwise interaction between a CPD and its compensatory site.

Because a single secondary site can mitigate the pathogenicity of a CPD, we were able to model this interaction in a relatively straightforward manner. If a CPD required 3 or even more sites to compensate, functional modeling would have been considerably more difficult, with the required combination of mutations to test exponentially larger.

Additionally, previous studies in multiple organisms and mathematical models had predicted that most compensation occur within a single locus, meaning that mutations within the same gene are the likely candidates of compensation.

6.5.3 Sporadic Evidence of Compensation in Humans

Though we presented a systematic analysis of the presence of CPDs among pathogenic alleles and showed functional evidence of compensation, there have been sporadic reports of pairwise compensation of pathogenic alleles in humans. One study reported the mosaic compensation of a R880Q change in the Fanconi anemia gene

162

FANCA. Monochorionic/monoamniotic twin girls had germline mutations in FANCA that would normally cause Fanconi anemia, which affects connective tissue as well as hematopoietic lineage cells [291]. However, mutation at a secondary site, E966K, occurred in a hematopoietic progenitor, and restored function to the FANCA protein.

The twin girls had hallmark connective tissue phenotypes of Fanconi anemia, including short stature and skeletal defects, but because of the hematopoietic compensation they did not have hematologic symptoms.

Another example of previously reported pathogenic allele compensation is in the gene mutated in ornithine transcarbamylase deficiency (OTCD), OTC [292]. The pathogenic human allele T125M is fixed in the chimpanzee. The chimp OTC protein diverges from the human at two sites; the first is the aforementioned 125M, the second is

135T (in humans, alanine substitutes threonine at position 135). In vitro assay of OTC activity in human cell lines showed that T125M does indeed have reduced activity, but when combined with the A135T allele (that is, a change of both alleles, creating a protein with identical sequence to chimpanzee) activity is restored.

In the FANCA study, the compensated site has not been shown to be present in other species, and so is not truly a CPD (as a species where this mutation is not compensated does not exist). However, the principle of compensation is the same. We have only focused on CPDs (those conserved in other species) because their presence in other species gives us indication that there can be a compensating effect somewhere.

163

Similarly, the OTC study is an example of a compensation that was fairly simple to predict; if the human and chimp OTC differ at only two residues, mutation of one would logically require mutation of the second to return to fitness. We have been able to predict the likely sites for candidate compensatory alleles for human proteins with much less one-to-one conservation with another species, and functionally validate our prediction.

6.5.4 Mechanism of CPD Fixation

If CPDs are pathogenic and require the presence of a secondary mutation, how do they arise and fix initially? Our model predicts, and our functional data show, that compensatory sites themselves are likely to be neutral. Though proteins are highly regulated molecules that are on the edge of stability, there is some room for benign missense fluctuation [293]. In silico experiments on virtual organisms undergoing hundreds of thousands of rounds of proliferation (where fitness is assessed and likelihood of reproduction calculated) demonstrate that benign or mildly pathogenic mutations are intermediaries for subsequent highly beneficial mutations that would be pathogenic otherwise (Figure 22, [294]).

164

Figure 23: Fitness Landscape of Individual Compensated Genotypes

Fitness landscape of four genotypes corresponding to the evolution of a compensated allele. (ab) is the neutral progenitor genotype, while the (AB) genotype has higher fitness due to the beneficial combination of the A and B alleles. The path from (ab) to (AB) can go through the (Ab) genotype, which in this case has a moderate loss of fitness. The B allele cannot arise in the absence of A without a significant loss of function. Used with permission from [294].

165

It is possible that the mutation that we call the CPD is the first to arise, and does so neutrally. An explanation is such: a single mutation A  B is likely to arise and fix neutrally, thereby allowing a second mutation that would otherwise be pathogenic X 

Y to arise and fix neutrally as well. This first mutation B is what is called the CPD, as reversion back to A will expose the pathogenic mutation Y, which no longer is compensated. Thus it is that B  A change that we see as pathogenic, as it can no longer compensate for the deleterious Y allele. Another possibility is that both mutations are able to rise and fix with close proximity (in evolutionary time), so that they compensate for one another. Because of this, these two alleles (B and Y) will tend to co-evolve from the point fixation.

Furthermore, we can trace the branching of fixed alleles through CPD pairs. In the case of BBS4 and RPGRIP1L, the pathogenic human mutation is shared by 4 and 6 species, respectively. Most other species have the wild-type human allele as their wild- type as well. This indicates that these species likely branched off from the main mammalian line near the time of CPD fixation. In contrast, only primates carry the BTG2

141V allele, whereas nearly every other mammal has a methionine at this position (the pathogenic human allele). This indicates that it is us, the primates, who are separate from the main mammalian line. The V141M mutation is actually just a reversion to the ancestral allele.

166

What had originally begun as an investigation into the source of false negatives in in silico prediction algorithms progressed towards a result that has influence how pathogenic alleles are conceptualized. The constant meandering of millions of residues over billions of years of evolution is a functional study that we could not hope to recreate in the lab.

167

Appendix A

Additional Characterization of CPDs

To further characterize our detected CPDs (for complete list see suppl file 1), we performed enrichment analysis on functional annotations and modes of inheritance. We performed Gene Ontology (GO) term enrichment analysis using the AmiGO term enrichment tool (http://amigo1.geneontology.org/cgi-bin/amigo/term_enrichment),

using the default thresholds of maximum p-value = 0.01 (before multiple test correction)

and minimum number of gene products = 2. No significant term enrichments were

found. We also performed annotation cluster enrichment analysis using the Database for

Annotation, Visualization and Integrated Discovery (DAVID) functional annotation

clustering tool [295]. This analysis showed modest enrichment for terms representing secreted extracellular compounds, integral membrane proteins, ion channels, and vacuoles. Other enrichments span a broad range of biological systems and processes, including blood clotting, blood lipid regulation, perception and cognition, hormone response, and embryonic development.

Next, we performed enrichment analysis on modes of inheritance. Modes of inheritance were downloaded from Orphanet (http://www.orpha.net/), which labels

diseases with inheritance categories including Autosomal Dominant, Autosomal

Recessive, X-Linked, Sporadic, and Multigenic/Multifactorial. In both the ClinVar and

HumVar datasets, enrichment analysis found small but highly significant depletion for

168

Autosomal Dominant, Autosomal Recessive, and X-Linked modes of inheritance, and significant enrichment for Multigenic/Multifactorial inheritance. This is in keeping with our expectations, since diseases with multigenic and multifactorial modes of inheritance inherently involve multiple sites and thus should be more readily subject to compensation.

Finally, we searched dbSNP for candidate CPDs whose presence in the alignment may be explained by the variant being polymorphic in the other species. We found a total of 10 candidate CPDs that are present in dbSNP for another species. Out of these, all but one were present in multiple species other than the one with a known polymorphism, suggesting that in general there are very few cases where a variant is found in the alignment solely because it is polymorphic. However, we cannot fully address the intersection of polymorphism and CPDs without more polymorphism data from a wider range of species.

These analyses complement previous studies that have examined the structural and chemical properties of CPDs, finding that CPDs are likely to be structurally destabilizing rather than impairing a specific biochemical function, and that they are likely to have milder effects on folding stability than other pathogenic SNVs [260,296].

Multiple Alignment Modalities

169

To help control for alignment artifacts, we used several alternate alignment strategies. First, we used the EPO "low-coverage” alignment of 37 eutherian mammal species, maintained by Ensembl [141]. This alignment shows good concordance with the

MultiZ mammalian alignment [140], with 94% of variants found in the original alignment in these species are also present in the alternate alignment. We also extended the analysis to non-mammalian vertebrates, which is expected to include lower quality alignments and orthology assignments. However, after scoring the alignment quality[262], we found that in aggregate, these alignments are of acceptable quality: the median score was 0.977, meaning that 97.7% of positions in the median alignment are correct. We excluded all genes with a quality score below 0.95 to create a third “high- quality” alignment methodology. Finally, we conducted a separate analysis filtering out variants found in only one sequence, since this cohort would be most sensitive to false positives due to sequencing or alignment errors. All these alignment strategies retained a sizable fraction of pathogenic missense variants, and qualitatively reproduced the distribution of variants and model-fitting results reported in the main text.

Post-hoc Analysis of CPD Rates in Multiple List of Pathogenic Variants

To further show that the observed incidence of CPDs among pathogenic alleles was correct, we evaluated three lists of bona fide pathogenic variants, each of which used independent means for assessing pathogenicity. The first is the 183 non-synonymous

170

pathogenic alleles reported recently in the Deciphering Developmental Disorders[297].

This list is not based on evolution but on relative incidence in known pathogenic genes, as well as population frequency and amount of background variation in these genes. The second is 49 alleles implicated as recessive drivers of Bardet-Biedl syndrome that have also been annotated functionally in vivo [126]. Lastly we looked at the 161 curated non- synonymous alleles from the genes that cause neuronal ceroid lipofuscinosis [298], for which the disorder is pediatric and severe, there exists a large allelic series, and for a subset of genes/alleles there exist data for either enzymatic activity or biopsy data.

Of note, when we only considered alleles de novo from DDD or recessive in BBS, in the mammalian alignment exclusively and absent from control databases, the most conservative boundary for CPD incidence from this analysis remains at 3%.

CPD Predictor Usage

Based on our experience with the identification of a novel human disease gene containing a CPD, we suggest a Bayesian approach to CPD identification, supported by our probabilistic predictor. The BTG2 V141M allele, identified as pathogenic in vivo but predicted as benign by multiple computational tools, should be considered initially an ambiguous case due to conflicting evidence. Our CPD predictor reports that it is unlikely but possible that this variant is a CPD, with probability 1.6%. This value represents the Bayesian posterior based solely on the alignment. However, by

171

presenting human genetic evidence, demonstrating experimentally that the V141M allele is functional, and locating the compensating alleles that account for its presence in almost all orthologous sequences, we are able to plausibly claim that this variant falls in the 1.6% of variants seen in similar alignments that are CPDs, and dismiss the computational prediction as a false negative. If, on the other hand, the CPD predictor had reported a probability well below 1%, we might remain unconvinced even in the face of our genetic and functional evidence; meanwhile, if it had reported a probability well above 5%, the functional and genetic evidence might have been sufficient without identifying the compensating alleles. In our dataset, approximately 1,800 benign variants and 60 pathogenic variants are assigned a probability below 1%, while approximately

5,600 benign variants and 1,800 pathogenic variants are assigned a probability above 5%.

The web interface outputs three distinct messages that repeat these recommendations.

Predicting specific compensating sites, though also desirable, is not feasible based on our current knowledge. There are several theoretical models to explain the biochemical basis of compensatory events. These include reconstitution of destabilized tertiary structure, restoration of protein stability, and improvement of protein-protein interaction capabilities within a complex [293,296]. However, we do not know the biochemical basis of the compensatory events discovered here; for each of BBS4,

RPGRIP1L, and BTG2, there was little a priori evidence for any of these interactions. The validated interactions can be long-range in terms of primary sequence, with one

172

spanning more than 700 residues; none of the three proteins tested have known 3D structures; and none of these interactions suggest obvious biochemical mechanisms for rescue, such as balancing, electrostatic charge, or replacement of phosphorylation sites.

It would be challenging for any computational method to account for these interactions and to make the correct prediction. Until more validated examples are collected, and/or until we have more biochemical information about those examples we have collected, predicting specific interactions in a principled way is not feasible. Instead, we have made available the method we used to design the experiments we reported here. This method simply treats any substitution that co-occurs with the candidate CPD as a candidate compensation, prioritizing sites that are substituted in multiple different species. These candidate lists may require fairly high-throughput experimental systems to test them. A more principled approach would not have this limitation.

Both the CPD prediction tool and the candidate compensation tool are available online at http://genetics.bwh.harvard.edu/cpd/. These tools should make it easier to interpret the output of computational tools like PolyPhen and SIFT, and to design experiments like those we report here. We expect future studies to develop more accurate predictors, possibly incorporating known functional and structural features of

CPDs [260,296] and/or attempting to predict compensation sites in a principled way.

Manual Evaluation of False-Positive Pathogenic Alleles

173

To evaluate the false-positive annotation of variants as pathogenic in the unfiltered HumVar dataset, we selected 100 alleles randomly from the HumVar pathogenic alleles list, and another 100 random alleles from the HumVar compensated alleles list (a subset of the HumVar pathogenic alleles that are also found in other species). We inspected the evidence supporting the alleles manually, discarding any that were obvious false positives based on the following criteria: a) in vitro or in vivo functional studies showed benign or only minor effects not significantly different from wild type; b), no evidence of pathogenicity; c) common polymorphisms (present in homozygosity in healthy controls, and/or minor allele frequency >5%); or d) incorrect annotation. We found that 5/100 alleles in the HumVar pathogenic list and 6/100 alleles in the HumVar compensated list were annotated falsely as pathogenic.

174

References

1 Davis, E. E., Frangakis, S. & Katsanis, N. Interpreting human genetic variation with in vivo zebrafish assays. Biochim Biophys Acta 1842, 1960-1970, doi:10.1016/j.bbadis.2014.05.024 (2014).

2 Lejeune, J., Gautier, M. & Turpin, R. [Study of somatic chromosomes from 9 mongoloid children]. C R Hebd Seances Acad Sci 248, 1721-1722 (1959).

3 Collins, F. S. Positional cloning: let's not call it reverse anymore. Nat Genet 1, 3-6, doi:10.1038/ng0492-3 (1992).

4 Katsanis, S. H. & Katsanis, N. Molecular genetic testing and the future of clinical genomics. Nat Rev Genet 14, 415-426, doi:10.1038/nrg3493 (2013).

5 Golzio, C. & Katsanis, N. Genetic architecture of reciprocal CNVs. Curr Opin Genet Dev 23, 240-248, doi:10.1016/j.gde.2013.04.013 (2013).

6 Kondrashov, A. S., Sunyaev, S. & Kondrashov, F. A. Dobzhansky-Muller incompatibilities in protein evolution. Proc Natl Acad Sci U S A 99, 14878-14883, doi:10.1073/pnas.232565499 (2002).

7 Gilissen, C., Hoischen, A., Brunner, H. G. & Veltman, J. A. Disease gene identification strategies for exome sequencing. Eur J Hum Genet 20, 490-497, doi:10.1038/ejhg.2011.258 (2012).

8 Betancur, C. & Buxbaum, J. D. SHANK3 haploinsufficiency: a "common" but underdiagnosed highly penetrant monogenic cause of autism spectrum disorders. Mol Autism 4, 17, doi:10.1186/2040-2392-4-17 (2013).

9 Oddi, D., Crusio, W. E., D'Amato, F. R. & Pietropaolo, S. Monogenic mouse models of social dysfunction: implications for autism. Behav Brain Res 251, 75-84, doi:10.1016/j.bbr.2013.01.002 (2013).

10 Xu, B. et al. Exome sequencing supports a de novo mutational paradigm for schizophrenia. Nat Genet 43, 864-868, doi:10.1038/ng.902 (2011).

11 Vissers, L. E. et al. A de novo paradigm for mental retardation. Nat Genet 42, 1109-1112, doi:10.1038/ng.712 (2010).

12 Mardis, E. R. A decade's perspective on DNA sequencing technology. Nature 470, 198-203, doi:Doi 10.1038/Nature09796 (2011). 175

13 Bamshad, M. J. et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nature Reviews Genetics 12, 745-755, doi:Doi 10.1038/Nrg3031 (2011).

14 Stenson, P. D. et al. The Human Gene Mutation Database: providing a comprehensive central mutation database for molecular diagnostics and personalised genomics. Human genomics 4, 69 (2009).

15 Kryukov, G. V., Pennacchio, L. A. & Sunyaev, S. R. Most rare missense alleles are deleterious in humans: Implications for complex disease and association studies. American Journal of Human Genetics 80, 727-739, doi:Doi 10.1086/513473 (2007).

16 Pruitt, K. D. et al. The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Res 19, 1316-1323, doi:10.1101/gr.080531.108 (2009).

17 Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35, D61-65, doi:10.1093/nar/gkl842 (2007).

18 Cirulli, E. T. & Goldstein, D. B. In vitro assays fail to predict in vivo effects of regulatory polymorphisms. Hum Mol Genet 16, 1931-1939, doi:10.1093/hmg/ddm140 (2007).

19 Niederriter, A. R. et al. In Vivo Modeling of the Morbid Human Genome using Danio rerio. Jove-Journal of Visualized Experiments, e50338, doi:UNSP e50338, DOI 10.3791/50338 (2013).

20 Lupski, J. R. Genomic disorders: structural features of the genome can lead to DNA rearrangements and human disease traits. Trends in genetics 14, 417-422, doi:Doi 10.1016/S0168-9525(98)01555-8 (1998).

21 Lupski, J. R. Genomic disorders ten years on. Genome Med 1, 42, doi:10.1186/gm42 (2009).

22 Lupski, J. R. Genomic rearrangements and sporadic disease. Nat Genet 39, S43-47, doi:10.1038/ng2084 (2007).

23 Korbel, J. O. et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318, 420-426, doi:10.1126/science.1149504 (2007).

24 Sebat, J. et al. Large-scale copy number polymorphism in the human genome. Science 305, 525-528, doi:10.1126/science.1098918 (2004).

176

25 Stankiewicz, P. & Lupski, J. R. Structural variation in the human genome and its role in disease. Annu Rev Med 61, 437-455, doi:10.1146/annurev-med-100708- 204735 (2010).

26 Huang, N., Lee, I., Marcotte, E. M. & Hurles, M. E. Characterising and predicting haploinsufficiency in the human genome. PLoS Genet 6, e1001154, doi:10.1371/journal.pgen.1001154 (2010).

27 Golzio, C. et al. KCTD13 is a major driver of mirrored neuroanatomical phenotypes of the 16p11.2 copy number variant. Nature 485, 363-U111, doi:Doi 10.1038/Nature11091 (2012).

28 Boone, P. M. et al. Deletions of recessive disease genes: CNV contribution to carrier states and disease-causing alleles. Genome Res 23, 1383-1394, doi:10.1101/gr.156075.113 (2013).

29 Backx, L., Seuntjens, E., Devriendt, K., Vermeesch, J. & Van Esch, H. A balanced translocation t(6;14)(q25.3;q13.2) leading to reciprocal fusion transcripts in a patient with intellectual disability and agenesis of corpus callosum. Cytogenet Genome Res 132, 135-143, doi:10.1159/000321577 (2011).

30 Spielmann, M. & Mundlos, S. Structural variations, the regulatory landscape of the genome and their alteration in human disease. BioEssays 35, 533-543, doi:10.1002/bies.201200178 (2013).

31 Gu, W., Zhang, F. & Lupski, J. R. Mechanisms for human genomic rearrangements. Pathogenetics 1, 4, doi:10.1186/1755-8417-1-4 (2008).

32 Genomes Project, C. et al. A map of human genome variation from population- scale sequencing. Nature 467, 1061-1073, doi:10.1038/nature09534 (2010).

33 Tennessen, J. A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64-69, doi:10.1126/science.1219240 (2012).

34 Cooper, G. M. & Shendure, J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet 12, 628-640, doi:10.1038/nrg3046 (2011).

35 Fowler, D. M. et al. High-resolution mapping of protein sequence-function relationships. Nat Methods 7, 741-746, doi:10.1038/nmeth.1492 (2010).

177

36 MacArthur, D. G. et al. Guidelines for investigating causality of sequence variants in human disease. Nature 508, 469-476, doi:10.1038/nature13127 (2014).

37 Thusberg, J., Olatubosun, A. & Vihinen, M. Performance of mutation pathogenicity prediction methods on missense variants. Hum Mutat 32, 358-368, doi:10.1002/humu.21445 (2011).

38 Thusberg, J. & Vihinen, M. Pathogenic or not? And if so, then how? Studying the effects of missense mutations using bioinformatics methods. Hum Mutat 30, 703- 714, doi:10.1002/humu.20938 (2009).

39 Hu, H. et al. VAAST 2.0: improved variant classification and disease-gene identification using a conservation-controlled amino acid substitution matrix. Genet Epidemiol 37, 622-634, doi:10.1002/gepi.21743 (2013).

40 Kennedy, B. et al. in Current Protocols in Human Genetics (John Wiley & Sons, Inc., 2001).

41 Hu, H. et al. A unified test of linkage analysis and rare-variant association for analysis of pedigree sequence data. Nat Biotechnol 32, 663-669, doi:10.1038/nbt.2895 (2014).

42 Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 46, 310-315, doi:10.1038/ng.2892 (2014).

43 Margulies, E. H. et al. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res 17, 760-774, doi:10.1101/gr.6034307 (2007).

44 Xu, J. & Zhang, J. Why human disease-associated residues appear as the wild- type in other species: genome-scale structural evidence for the compensation hypothesis. Mol Biol Evol 31, 1787-1792, doi:10.1093/molbev/msu130 (2014).

45 Gao, L. & Zhang, J. Why are some human disease-associated mutations fixed in mice? Trends Genet 19, 678-681, doi:10.1016/j.tig.2003.10.002 (2003).

46 Ng, P. C. & Henikoff, S. Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet 7, 61-80, doi:10.1146/annurev.genom.7.080505.115630 (2006).

47 Hildebrandt, F., Benzing, T. & Katsanis, N. Ciliopathies. N Engl J Med 364, 1533- 1543, doi:10.1056/NEJMra1010172 (2011).

178

48 Barr, M. M. & Sternberg, P. W. A polycystic kidney-disease gene homologue required for male mating behaviour in C. elegans. Nature 401, 386-389, doi:10.1038/43913 (1999).

49 Gerdes, J. M., Davis, E. E. & Katsanis, N. The vertebrate primary cilium in development, homeostasis, and disease. Cell 137, 32-45, doi:10.1016/j.cell.2009.03.023 (2009).

50 Tabin, C. J. The key to left-right asymmetry. Cell 127, 27-32, doi:10.1016/j.cell.2006.09.018 (2006).

51 Yoder, B. K. Role of primary cilia in the pathogenesis of polycystic kidney disease. J Am Soc Nephrol 18, 1381-1388, doi:10.1681/ASN.2006111215 (2007).

52 Jekely, G. & Arendt, D. Evolution of intraflagellar transport from coated vesicles and autogenous origin of the eukaryotic cilium. BioEssays 28, 191-198, doi:10.1002/bies.20369 (2006).

53 Brailov, I. et al. Localization of 5-HT6 receptors at the plasma membrane of neuronal cilia in the rat brain. Brain Research 872, 271-275, doi:Doi 10.1016/S0006- 8993(00)02519-1 (2000).

54 Berbari, N. F., Lewis, J. S., Bishop, G. A., Askwith, C. C. & Mykytyn, K. Bardet- Biedl syndrome proteins are required for the localization of G protein-coupled receptors to primary cilia. Proc Natl Acad Sci U S A 105, 4242-4246, doi:10.1073/pnas.0711027105 (2008).

55 Kovacs, J. J. et al. Beta-arrestin-mediated localization of smoothened to the primary cilium. Science 320, 1777-1781, doi:10.1126/science.1157983 (2008).

56 Gerdes, J. M. et al. Disruption of the basal body compromises proteasomal function and perturbs intracellular Wnt response. Nat Genet 39, 1350-1360, doi:10.1038/ng.2007.12 (2007).

57 Torkko, J. M., Manninen, A., Schuck, S. & Simons, K. Depletion of apical transport proteins perturbs epithelial cyst formation and ciliogenesis. J Cell Sci 121, 1193-1203, doi:10.1242/jcs.015495 (2008).

58 Vieira, O. V. et al. FAPP2, cilium formation, and compartmentalization of the apical membrane in polarized Madin-Darby canine kidney (MDCK) cells. Proc Natl Acad Sci U S A 103, 18556-18561, doi:10.1073/pnas.0608291103 (2006).

179

59 Powles-Glover, N. Cilia and ciliopathies: classic examples linking phenotype and genotype-an overview. Reprod Toxicol 48, 98-105, doi:10.1016/j.reprotox.2014.05.005 (2014).

60 Olbrich, H. et al. Mutations in a novel gene, NPHP3, cause adolescent nephronophthisis, tapeto-retinal degeneration and hepatic fibrosis. Nat Genet 34, 455-459, doi:10.1038/ng1216 (2003).

61 Bergmann, C. et al. Loss of nephrocystin-3 function can cause embryonic lethality, Meckel-Gruber-like syndrome, situs inversus, and renal-hepatic- pancreatic dysplasia. American Journal of Human Genetics 82, 959-970, doi:10.1016/j.ajhg.2008.02.017 (2008).

62 Davis, E. E. & Katsanis, N. The ciliopathies: a transitional model into systems biology of human genetic disease. Curr Opin Genet Dev 22, 290-303, doi:10.1016/j.gde.2012.04.006 (2012).

63 Katsanis, N. et al. Triallelic inheritance in Bardet-Biedl syndrome, a Mendelian recessive disorder. Science 293, 2256-2259, doi:10.1126/science.1063525 (2001).

64 Deveault, C. et al. BBS genotype-phenotype assessment of a multiethnic patient cohort calls for a revision of the disease definition. Hum Mutat 32, 610-619, doi:10.1002/humu.21480 (2011).

65 Gherman, A., Davis, E. E. & Katsanis, N. The ciliary proteome database: an integrated community resource for the genetic and functional dissection of cilia. Nat Genet 38, 961-962, doi:10.1038/ng0906-961 (2006).

66 Aitman, T. J. et al. The future of model organisms in human disease research. Nat Rev Genet 12, 575-582, doi:10.1038/nrg3047 (2011).

67 Streisinger, G., Walker, C., Dower, N., Knauber, D. & Singer, F. Production of clones of homozygous diploid zebra fish (Brachydanio rerio). Nature 291, 293-296 (1981).

68 Amores, A. et al. Zebrafish hox clusters and vertebrate genome evolution. Science 282, 1711-1714 (1998).

69 Meyer, A. & Schartl, M. Gene and genome duplications in vertebrates: the one- to-four (-to-eight in fish) rule and the evolution of novel gene functions. Curr Opin Cell Biol 11, 699-704 (1999).

180

70 Postlethwait, J. H. et al. Vertebrate genome evolution and the zebrafish gene map. Nat Genet 18, 345-349, doi:10.1038/ng0498-345 (1998).

71 Howe, K. et al. The zebrafish reference genome sequence and its relationship to the human genome. Nature 496, 498-503, doi:10.1038/nature12111 (2013).

72 Howe, D. G. et al. ZFIN, the Zebrafish Model Organism Database: increased support for mutants and transgenics. Nucleic Acids Res 41, D854-860, doi:10.1093/nar/gks938 (2013).

73 Niederriter, A. R. et al. In vivo modeling of the morbid human genome using Danio rerio. J Vis Exp, e50338, doi:10.3791/50338 (2013).

74 de Pontual, L. et al. Epistasis between RET and BBS mutations modulates enteric innervation and causes syndromic Hirschsprung disease. Proc Natl Acad Sci U S A 106, 13921-13926, doi:10.1073/pnas.0901219106 (2009).

75 Lawson, N. D. & Wolfe, S. A. Forward and reverse genetic approaches for the analysis of vertebrate development in the zebrafish. Dev Cell 21, 48-64, doi:10.1016/j.devcel.2011.06.007 (2011).

76 Chakrabarti, S., Streisinger, G., Singer, F. & Walker, C. Frequency of gamma-Ray Induced Specific Locus and Recessive Lethal Mutations in Mature Germ Cells of the Zebrafish, BRACHYDANIO RERIO. Genetics 103, 109-123 (1983).

77 Streisinger, G. Extrapolations from species to species and from various cell types in assessing risks from chemical mutagens. Mutat Res 114, 93-105 (1983).

78 Mullins, M. C., Hammerschmidt, M., Haffter, P. & Nusslein-Volhard, C. Large- scale mutagenesis in the zebrafish: in search of genes controlling development in a vertebrate. Curr Biol 4, 189-202 (1994).

79 Solnica-Krezel, L., Schier, A. F. & Driever, W. Efficient recovery of ENU-induced mutations from the zebrafish germline. Genetics 136, 1401-1420 (1994).

80 Hammerschmidt, M. et al. Mutations affecting morphogenesis during gastrulation and tail formation in the zebrafish, Danio rerio. Development 123, 143-151 (1996).

81 Solnica-Krezel, L. et al. Mutations affecting cell fates and cellular rearrangements during gastrulation in zebrafish. Development 123, 67-80 (1996).

181

82 van Eeden, F. J. et al. Mutations affecting somite formation and patterning in the zebrafish, Danio rerio. Development 123, 153-164 (1996).

83 Brand, M. et al. Mutations in zebrafish genes affecting the formation of the boundary between midbrain and hindbrain. Development 123, 179-190 (1996).

84 Jiang, Y. J. et al. Mutations affecting neurogenesis and brain morphology in the zebrafish, Danio rerio. Development 123, 205-216 (1996).

85 Schier, A. F. et al. Mutations affecting the development of the embryonic zebrafish brain. Development 123, 165-178 (1996).

86 Stainier, D. Y. et al. Mutations affecting the formation and function of the cardiovascular system in the zebrafish embryo. Development 123, 285-292 (1996).

87 Neuhauss, S. C. et al. Mutations affecting craniofacial development in zebrafish. Development 123, 357-367 (1996).

88 Piotrowski, T. et al. Jaw and branchial arch mutants in zebrafish II: anterior arches and cartilage differentiation. Development 123, 345-356 (1996).

89 Schilling, T. F. et al. Jaw and branchial arch mutants in zebrafish I: branchial arches. Development 123, 329-344 (1996).

90 Lang, M. R., Lapierre, L. A., Frotscher, M., Goldenring, J. R. & Knapik, E. W. Secretory COPII coat component Sec23a is essential for craniofacial chondrocyte maturation. Nat Genet 38, 1198-1203, doi:10.1038/ng1880 (2006).

91 Boyadjiev, S. A. et al. Cranio-lenticulo-sutural dysplasia is caused by a SEC23A mutation leading to abnormal endoplasmic-reticulum-to-Golgi trafficking. Nat Genet 38, 1192-1197, doi:10.1038/ng1876 (2006).

92 Obholzer, N. et al. Rapid positional cloning of zebrafish mutations by linkage and homozygosity mapping using whole-genome sequencing. Development 139, 4280- 4290, doi:10.1242/dev.083931 (2012).

93 Ryan, S. et al. Rapid identification of kidney cyst mutations by whole exome sequencing in zebrafish. Development 140, 4445-4451, doi:10.1242/dev.101170 (2013).

94 Leshchiner, I. et al. Mutation mapping and identification by whole-genome sequencing. Genome Res 22, 1541-1548, doi:10.1101/gr.135541.111 (2012).

182

95 Summerton, J. & Weller, D. Morpholino antisense oligomers: design, preparation, and properties. Antisense Nucleic Acid Drug Dev 7, 187-195 (1997).

96 Nasevicius, A. & Ekker, S. C. Effective targeted gene 'knockdown' in zebrafish. Nat Genet 26, 216-220, doi:10.1038/79951 (2000).

97 Shestopalov, I. A., Sinha, S. & Chen, J. K. Light-controlled gene silencing in zebrafish embryos. Nat Chem Biol 3, 650-651, doi:10.1038/nchembio.2007.30 (2007).

98 Eisen, J. S. & Smith, J. C. Controlling morpholino experiments: don't stop making antisense. Development 135, 1735-1743, doi:10.1242/dev.001115 (2008).

99 Wienholds, E., Schulte-Merker, S., Walderich, B. & Plasterk, R. H. Target-selected inactivation of the zebrafish rag1 gene. Science 297, 99-102, doi:10.1126/science.1071762 (2002).

100 Kettleborough, R. N. et al. A systematic genome-wide analysis of zebrafish protein-coding gene function. Nature 496, 494-497, doi:10.1038/nature11992 (2013).

101 Wang, D. et al. Efficient genome-wide mutagenesis of zebrafish genes by retroviral insertions. Proc Natl Acad Sci U S A 104, 12428-12433, doi:10.1073/pnas.0705502104 (2007).

102 Sivasubbu, S., Balciunas, D., Amsterdam, A. & Ekker, S. C. Insertional mutagenesis strategies in zebrafish. Genome Biol 8 Suppl 1, S9, doi:10.1186/gb- 2007-8-s1-s9 (2007).

103 Wijshake, T., Baker, D. J. & van de Sluis, B. Endonucleases: new tools to edit the mouse genome. Biochimica et Biophysica Acta (BBA) - Molecular Basis of Disease 1842, 1942-1950, doi:http://dx.doi.org/10.1016/j.bbadis.2014.04.020 (2014).

104 Urnov, F. D., Rebar, E. J., Holmes, M. C., Zhang, H. S. & Gregory, P. D. Genome editing with engineered zinc finger nucleases. Nat Rev Genet 11, 636-646, doi:10.1038/nrg2842 (2010).

105 Doyon, Y. et al. Heritable targeted gene disruption in zebrafish using designed zinc-finger nucleases. Nat Biotechnol 26, 702-708, doi:10.1038/nbt1409 (2008).

183

106 Meng, X., Noyes, M. B., Zhu, L. J., Lawson, N. D. & Wolfe, S. A. Targeted gene inactivation in zebrafish using engineered zinc-finger nucleases. Nat Biotechnol 26, 695-701, doi:10.1038/nbt1398 (2008).

107 Bedell, V. M. et al. In vivo genome editing using a high-efficiency TALEN system. Nature 491, 114-118, doi:10.1038/nature11537 (2012).

108 Hwang, W. Y. et al. Efficient genome editing in zebrafish using a CRISPR-Cas system. Nat Biotechnol 31, 227-229, doi:10.1038/nbt.2501 (2013).

109 Chang, N. et al. Genome editing with RNA-guided Cas9 nuclease in zebrafish embryos. Cell Res 23, 465-472, doi:10.1038/cr.2013.45 (2013).

110 Reiter, J. F. et al. Gata5 is required for the development of the heart and endoderm in zebrafish. Genes Dev 13, 2983-2995 (1999).

111 Padang, R., Bagnall, R. D., Richmond, D. R., Bannon, P. G. & Semsarian, C. Rare non-synonymous variations in the transcriptional activation domains of GATA5 in bicuspid aortic valve disease. J Mol Cell Cardiol 53, 277-281, doi:10.1016/j.yjmcc.2012.05.009 (2012).

112 Wei, D. et al. GATA5 loss-of-function mutations underlie tetralogy of fallot. Int J Med Sci 10, 34-42, doi:10.7150/ijms.5270 (2013).

113 Wei, D. et al. GATA5 loss-of-function mutation responsible for the congenital ventriculoseptal defect. Pediatr Cardiol 34, 504-511, doi:10.1007/s00246-012-0482-6 (2013).

114 Hjeij, R. et al. ARMC4 mutations cause primary ciliary dyskinesia with randomization of left/right body asymmetry. Am J Hum Genet 93, 357-367, doi:10.1016/j.ajhg.2013.06.009 (2013).

115 Merveille, A. C. et al. CCDC39 is required for assembly of inner arms and the dynein regulatory complex and for normal ciliary motility in humans and dogs. Nat Genet 43, 72-78, doi:10.1038/ng.726 (2011).

116 Zariwala, M. A. et al. ZMYND10 is mutated in primary ciliary dyskinesia and interacts with LRRC6. Am J Hum Genet 93, 336-345, doi:10.1016/j.ajhg.2013.06.007 (2013).

184

117 Wan, J. et al. Mutations in the RNA exosome component gene EXOSC3 cause pontocerebellar hypoplasia and spinal motor neuron degeneration. Nat Genet 44, 704-708, doi:10.1038/ng.2254 (2012).

118 Norton, N. et al. Genome-wide studies of copy number variation and exome sequencing identify rare variants in BAG3 as a cause of dilated cardiomyopathy. Am J Hum Genet 88, 273-282, doi:10.1016/j.ajhg.2011.01.016 (2011).

119 Sanna-Cherchi, S. et al. Mutations in DSTYK and dominant urinary tract malformations. N Engl J Med 369, 621-629, doi:10.1056/NEJMoa1214479 (2013).

120 Sarparanta, J. et al. Mutations affecting the cytoplasmic functions of the co- chaperone DNAJB6 cause limb-girdle muscular dystrophy. Nat Genet 44, 450-455, S451-452, doi:10.1038/ng.1103 (2012).

121 Veltman, J. A. & Brunner, H. G. De novo mutations in human genetic disease. Nat Rev Genet 13, 565-575, doi:10.1038/nrg3241 (2012).

122 Ramachandran, K. V. et al. Calcium influx through L-type CaV1.2 Ca2+ channels regulates mandibular development. J Clin Invest 123, 1638-1646, doi:10.1172/JCI66903 (2013).

123 Inoue, K. & Lupski, J. R. Molecular mechanisms for genomic disorders. Annu Rev Genomics Hum Genet 3, 199-242, doi:10.1146/annurev.genom.3.032802.120023 (2002).

124 Lindsay, E. A. Chromosomal microdeletions: dissecting del22q11 syndrome. Nat Rev Genet 2, 858-868, doi:10.1038/35098574 (2001).

125 Dauber, A. et al. SCRIB and PUF60 are primary drivers of the multisystemic phenotypes of the 8q24.3 copy-number variant. American Journal of Human Genetics 93, 798-811, doi:10.1016/j.ajhg.2013.09.010 (2013).

126 Zaghloul, N. A. et al. Functional analyses of variants reveal a significant role for dominant negative and common alleles in oligogenic Bardet-Biedl syndrome. Proc Natl Acad Sci U S A 107, 10602-10607, doi:10.1073/pnas.1000219107 (2010).

127 Khanna, H. et al. A common allele in RPGRIP1L is a modifier of retinal degeneration in ciliopathies. Nat Genet 41, 739-745, doi:ng.366 [pii] 10.1038/ng.366 (2009).

185

128 Davis, E. E. et al. TTC21B contributes both causal and modifying alleles across the ciliopathy spectrum. Nat Genet 43, 189-196, doi:10.1038/ng.756 (2011).

129 Oprea, G. E. et al. Plastin 3 is a protective modifier of autosomal recessive spinal muscular atrophy. Science 320, 524-527, doi:10.1126/science.1155085 (2008).

130 Raychaudhuri, S. et al. A rare penetrant mutation in CFH confers high risk of age-related macular degeneration. Nat Genet 43, 1232-1236, doi:10.1038/ng.976 (2011).

131 van de Ven, J. P. et al. A functional variant in the CFI gene confers a high risk of age-related macular degeneration. Nat Genet 45, 813-817, doi:10.1038/ng.2640 (2013).

132 Carvalho, C. M. et al. Complex rearrangements in patients with duplications of MECP2 can occur by fork stalling and template switching. Hum Mol Genet 18, 2188-2203, doi:ddp151 [pii] 10.1093/hmg/ddp151 (2009).

133 Leitch, C. C. et al. Hypomorphic mutations in syndromic encephalocele genes are associated with Bardet-Biedl syndrome. Nat Genet 40, 443-448, doi:ng.97 [pii] 10.1038/ng.97 (2008).

134 Lindstrand, A. et al. Recurrent CNVs and SNVs at the NPHP1 locus contribute pathogenic alleles to Bardet-Biedl syndrome. Am J Hum Genet 94, 745-754, doi:10.1016/j.ajhg.2014.03.017 (2014).

135 Davis, E. E. et al. TTC21B contributes both causal and modifying alleles across the ciliopathy spectrum. Nat Genet 43, 189-196, doi:ng.756 [pii] 10.1038/ng.756 (2011).

136 Stoetzel, C. et al. BBS10 encodes a vertebrate-specific -like protein and is a major BBS locus. Nat Genet 38, 521-524, doi:10.1038/ng1771 (2006).

137 Stoetzel, C. et al. Identification of a novel BBS gene (BBS12) highlights the major role of a vertebrate-specific branch of chaperonin-related proteins in Bardet-Biedl syndrome. Am J Hum Genet 80, 1-11, doi:10.1086/510256 (2007).

138 Mottaz, A., David, F. P., Veuthey, A. L. & Yip, Y. L. Easy retrieval of single amino-acid polymorphisms and phenotype information using SwissVar. Bioinformatics 26, 851-852 (2010).

139 Landrum, M. J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res, gkt1113 (2013).

186

140 Karolchik, D. et al. The UCSC Genome Browser database: 2014 update. Nucleic Acids Res 42, D764-770 (2014).

141 Flicek, P. et al. Ensembl 2014. Nucleic Acids Res 42, D749-755 (2014).

142 Bainbridge, M. N. et al. Targeted enrichment beyond the consensus coding DNA sequence exome reveals exons with higher variant densities. Genome Biol 12, R68 (2011).

143 Challis, D. et al. An integrative variant analysis suite for whole exome next- generation sequencing data. BMC bioinformatics 13, 8 (2012).

144 Khanna, H. et al. A common allele in RPGRIP1L is a modifier of retinal degeneration in ciliopathies. Nature Genetics 41, 739-745 (2009).

145 Niederriter, A. R. et al. In Vivo Modeling of the Morbid Human Genome using Danio rerio. JoVE, e50338 (2012).

146 Katsanis, N. The oligogenic properties of Bardet-Biedl syndrome. Hum Mol Genet 13 Spec No 1, R65-71, doi:10.1093/hmg/ddh092 (2004).

147 Beales, P. L., Elcioglu, N., Woolf, A. S., Parker, D. & Flinter, F. A. New criteria for improved diagnosis of Bardet-Biedl syndrome: results of a population survey. J Med Genet 36, 437-446 (1999).

148 Green, J. S. et al. The cardinal manifestations of Bardet-Biedl syndrome, a form of Laurence-Moon-Biedl syndrome. N Engl J Med 321, 1002-1009, doi:10.1056/NEJM198910123211503 (1989).

149 Aldahmesh, M. A. et al. IFT27, encoding a small GTPase component of IFT particles, is mutated in a consanguineous family with Bardet-Biedl syndrome. Hum Mol Genet 23, 3307-3315, doi:10.1093/hmg/ddu044 (2014).

150 Ansley, S. J. et al. Basal body dysfunction is a likely cause of pleiotropic Bardet- Biedl syndrome. Nature 425, 628-633, doi:Doi 10.1038/Nature02030 (2003).

151 Badano, J. L. et al. Identification of a novel Bardet-Biedl syndrome protein, BBS7, that shares structural features with BBS1 and BBS2. Am J Hum Genet 72, 650-658, doi:10.1086/368204 (2003).

187

152 Chiang, A. P. et al. Homozygosity mapping with SNP arrays identifies TRIM32, an E3 ubiquitin ligase, as a Bardet-Biedl syndrome gene (BBS11). Proc Natl Acad Sci U S A 103, 6287-6292, doi:10.1073/pnas.0600158103 (2006).

153 Chiang, A. P. et al. Comparative genomic analysis identifies an ADP-ribosylation factor-like gene as the cause of Bardet-Biedl syndrome (BBS3). Am J Hum Genet 75, 475-484, doi:10.1086/423903 (2004).

154 Fan, Y. et al. Mutations in a member of the Ras superfamily of small GTP-binding proteins causes Bardet-Biedl syndrome. Nat Genet 36, 989-993, doi:10.1038/ng1414 (2004).

155 Katsanis, N. et al. Mutations in MKKS cause obesity, retinal dystrophy and renal malformations associated with Bardet-Biedl syndrome. Nat Genet 26, 67-70, doi:10.1038/79201 (2000).

156 Kim, S. K. et al. Planar cell polarity acts through septins to control collective cell movement and ciliogenesis. Science 329, 1337-1340, doi:10.1126/science.1191184 (2010).

157 Li, J. B. et al. Comparative genomics identifies a flagellar and basal body proteome that includes the BBS5 human disease gene. Cell 117, 541-552 (2004).

158 Marion, V. et al. Exome sequencing identifies mutations in LZTFL1, a BBSome and smoothened trafficking regulator, in a family with Bardet--Biedl syndrome with situs inversus and insertional polydactyly. J Med Genet 49, 317-321, doi:10.1136/jmedgenet-2012-100737 (2012).

159 Mykytyn, K. et al. Identification of the gene that, when mutated, causes the human obesity syndrome BBS4. Nat Genet 28, 188-191, doi:10.1038/88925 (2001).

160 Mykytyn, K. et al. Identification of the gene (BBS1) most commonly involved in Bardet-Biedl syndrome, a complex human obesity syndrome. Nat Genet 31, 435- 438, doi:10.1038/ng935 (2002).

161 Nishimura, D. Y. et al. Positional cloning of a novel gene on chromosome 16q causing Bardet-Biedl syndrome (BBS2). Hum Mol Genet 10, 865-874 (2001).

162 Nishimura, D. Y. et al. Comparative genomics and gene expression analysis identifies BBS9, a new Bardet-Biedl syndrome gene. Am J Hum Genet 77, 1021- 1033, doi:10.1086/498323 (2005).

188

163 Otto, E. A. et al. Candidate exome capture identifies mutation of SDCCAG8 as the cause of a retinal-renal ciliopathy. Nat Genet 42, 840-850, doi:10.1038/ng.662 (2010).

164 Scheidecker, S. et al. Exome sequencing of Bardet-Biedl syndrome patient identifies a null mutation in the BBSome subunit BBIP1 (BBS18). J Med Genet 51, 132-136, doi:10.1136/jmedgenet-2013-101785 (2014).

165 Slavotinek, A. M. et al. Mutations in MKKS cause Bardet-Biedl syndrome. Nat Genet 26, 15-16, doi:10.1038/79116 (2000).

166 Bujakowska, K. M. et al. Mutations in IFT172 cause isolated retinal degeneration and Bardet-Biedl syndrome. Hum Mol Genet 24, 230-242, doi:10.1093/hmg/ddu441 (2015).

167 Ross, A. J. et al. Disruption of Bardet-Biedl syndrome ciliary proteins perturbs planar cell polarity in vertebrates. Nat Genet 37, 1135-1140, doi:10.1038/ng1644 (2005).

168 Badano, J. L., Mitsuma, N., Beales, P. L. & Katsanis, N. The ciliopathies: an emerging class of human genetic disorders. Annu Rev Genomics Hum Genet 7, 125- 148, doi:10.1146/annurev.genom.7.080505.115610 (2006).

169 Ashkinadze, E., Rosen, T., Brooks, S. S., Katsanis, N. & Davis, E. E. Combining fetal sonography with genetic and allele pathogenicity studies to secure a neonatal diagnosis of Bardet-Biedl syndrome. Clin Genet 83, 553-559, doi:10.1111/cge.12022 (2013).

170 Badano, J. L. et al. Dissection of epistasis in oligogenic Bardet-Biedl syndrome. Nature 439, 326-330, doi:10.1038/nature04370 (2006).

171 Badano, J. L. et al. Heterozygous mutations in BBS1, BBS2 and BBS6 have a potential epistatic effect on Bardet–Biedl patients with two mutations at a second BBS locus. Hum Mol Genet 12, 1651-1659, doi:10.1093/hmg/ddg188 (2003).

172 de Pontual, L. et al. Epistatic interactions with a common hypomorphic RET allele in syndromic Hirschsprung disease. Hum Mutat 28, 790-796, doi:10.1002/humu.20517 (2007).

173 Badano, J. L. & Katsanis, N. Beyond Mendel: an evolving view of human genetic disease transmission. Nat Rev Genet 3, 779-789, doi:10.1038/nrg910 (2002).

189

174 Eichers, E. R., Lewis, R. A., Katsanis, N. & Lupski, J. R. Triallelic inheritance: a bridge between Mendelian and multifactorial traits. Annals of medicine 36, 262- 272, doi:Doi 10.1080/07853890410026214 (2004).

175 Li, J. B. et al. Comparative Genomics Identifies a Flagellar and Basal Body Proteome that Includes the BBS5 Human Disease Gene. Cell 117, 541-552, doi:10.1016/s0092-8674(04)00450-7 (2004).

176 Efimenko, E. et al. Analysis of xbx genes in C. elegans. Development 132, 1923- 1934, doi:10.1242/dev.01775 (2005).

177 Stolc, V., Samanta, M. P., Tongprasit, W. & Marshall, W. F. Genome-wide transcriptional analysis of flagellar regeneration in Chlamydomonas reinhardtii identifies orthologs of ciliary disease genes. Proc Natl Acad Sci U S A 102, 3703- 3707, doi:10.1073/pnas.0408358102 (2005).

178 Ostrowski, L. E. et al. A proteomic analysis of human cilia: identification of novel components. Mol Cell Proteomics 1, 451-465, doi:10.1074/mcp.M200037-MCP200 (2002).

179 Pazour, G. J., Agrin, N., Leszyk, J. & Witman, G. B. Proteomic analysis of a eukaryotic cilium. J Cell Biol 170, 103-113, doi:10.1083/jcb.200504008 (2005).

180 Avidor-Reiss, T. et al. Decoding cilia function: defining specialized genes required for compartmentalized cilia biogenesis. Cell 117, 527-539, doi:10.1016/S0092-8674(04)00412-X (2004).

181 Keller, L. C., Romijn, E. P., Zamora, I., Yates, J. R., 3rd & Marshall, W. F. Proteomic analysis of isolated chlamydomonas reveals orthologs of ciliary-disease genes. Curr Biol 15, 1090-1098, doi:10.1016/j.cub.2005.05.024 (2005).

182 Andersen, J. S. et al. Proteomic characterization of the human centrosome by protein correlation profiling. Nature 426, 570-574, doi:10.1038/nature02166 (2003).

183 Blacque, O. E. et al. Functional genomics of the cilium, a sensory organelle. Curr Biol 15, 935-941, doi:10.1016/j.cub.2005.04.059 (2005).

184 Broadhead, R. et al. Flagellar motility is required for the viability of the bloodstream trypanosome. Nature 440, 224-227, doi:http://www.nature.com/nature/journal/v440/n7081/suppinfo/nature04541_S1. html (2006).

190

185 Otto, E. A. et al. Candidate exome capture identifies mutation of SDCCAG8 as the cause of a retinal-renal ciliopathy. Nat Genet 42, 840-850, doi:http://www.nature.com/ng/journal/v42/n10/abs/ng.662.html#supplementary- information (2010).

186 Katsanis, N. et al. Delineation of the critical interval of Bardet-Biedl syndrome 1 (BBS1) to a small region of 11q13, through linkage and haplotype analysis of 91 pedigrees. Am J Hum Genet 65, 1672-1679, doi:10.1086/302684 (1999).

187 Beales, P. L. et al. Genetic interaction of BBS1 mutations with alleles at other BBS loci can result in non-Mendelian Bardet-Biedl syndrome. Am J Hum Genet 72, 1187-1199, doi:Doi 10.1086/375178 (2003).

188 Janssen, S. et al. Mutation analysis in Bardet-Biedl syndrome by DNA pooling and massively parallel resequencing in 105 individuals. Hum Genet 129, 79-90, doi:10.1007/s00439-010-0902-8 (2011).

189 Stoetzel, C. et al. BBS10 encodes a vertebrate-specific chaperonin-like protein and is a major BBS locus. Nat Genet 38, 521-524, doi:http://www.nature.com/ng/journal/v38/n5/suppinfo/ng1771_S1.html (2006).

190 Harville, H. M. et al. Identification of 11 novel mutations in eight BBS genes by high-resolution homozygosity mapping. J Med Genet 47, 262-267, doi:10.1136/jmg.2009.071365 (2010).

191 Zaghloul, N. A. & Katsanis, N. Mechanistic insights into Bardet-Biedl syndrome, a model ciliopathy. J Clin Invest 119, 428-437, doi:10.1172/JCI37041 (2009).

192 Tobin, J. L. & Beales, P. L. The nonmotile ciliopathies. Genet Med 11, 386-402, doi:10.1097/GIM.0b013e3181a02882 (2009).

193 Nishimura, D. Y. et al. Bbs2-null mice have neurosensory deficits, a defect in social dominance, and retinopathy associated with mislocalization of rhodopsin. Proc Natl Acad Sci U S A 101, 16588-16593, doi:10.1073/pnas.0405496101 (2004).

194 Abd-El-Barr, M. M. et al. Impaired photoreceptor protein transport and synaptic transmission in a mouse model of Bardet-Biedl syndrome. Vision Res 47, 3394- 3407, doi:10.1016/j.visres.2007.09.016 (2007).

195 Jan, L. Y. & Revel, J.-P. ULTRASTRUCTURAL LOCALIZATION OF RHODOPSIN IN THE VERTEBRATE RETINA. The Journal of Cell Biology 62, 257- 273, doi:10.1083/jcb.62.2.257 (1974).

191

196 Mykytyn, K. et al. Identification of the gene that, when mutated, causes the human obesity syndrome BBS4. Nature genetics 28, 188-191, doi:Doi 10.1038/88925 (2001).

197 Tsang, W. Y. et al. Cep76, a centrosomal protein that specifically restrains centriole reduplication. Dev Cell 16, 649-660, doi:10.1016/j.devcel.2009.03.004 (2009).

198 Spektor, A., Tsang, W. Y., Khoo, D. & Dynlacht, B. D. Cep97 and CP110 suppress a cilia assembly program. Cell 130, 678-690, doi:10.1016/j.cell.2007.06.027 (2007).

199 Tsang, W. Y. et al. CP110 suppresses primary cilia formation through its interaction with CEP290, a protein deficient in human ciliary disease. Dev Cell 15, 187-197, doi:10.1016/j.devcel.2008.07.004 (2008).

200 Zhang, D. & Aravind, L. Novel transglutaminase-like peptidase and C2 domains elucidate the structure, biogenesis and evolution of the ciliary compartment. Cell Cycle 11, 3861-3875, doi:10.4161/cc.22068 (2012).

201 Kobayashi, T. & Dynlacht, B. D. Regulating the transition from centriole to basal body. J Cell Biol 193, 435-444, doi:10.1083/jcb.201101005 (2011).

202 Badano, J. L., Teslovich, T. M. & Katsanis, N. The centrosome in human genetic disease. Nat Rev Genet 6, 194-205, doi:10.1038/nrg1557 (2005).

203 Kim, J. C. et al. The Bardet-Biedl protein BBS4 targets cargo to the pericentriolar region and is required for microtubule anchoring and cell cycle progression. Nature genetics 36, 462-470, doi:Doi 10.1038/Ng1352 (2004).

204 den Hollander, A. I. et al. Mutations in the CEP290 (NPHP6) gene are a frequent cause of Leber congenital amaurosis. Am J Hum Genet 79, 556-561, doi:S0002- 9297(07)62755-4 [pii] 10.1086/507318 (2006).

205 Karmous-Benailly, H. et al. Antenatal presentation of Bardet-Biedl syndrome may mimic Meckel syndrome. Am J Hum Genet 76, 493-504, doi:S0002- 9297(07)63344-8 [pii] 10.1086/428679 [doi] (2005).

206 Sayer, J. A. et al. The centrosomal protein nephrocystin-6 is mutated in Joubert syndrome and activates transcription factor ATF4. Nat Genet 38, 674-681, doi:ng1786 [pii]

10.1038/ng1786 (2006).

192

207 Valente, E. M. et al. Mutations in CEP290, which encodes a centrosomal protein, cause pleiotropic forms of Joubert syndrome. Nat Genet 38, 623-625, doi:ng1805 [pii] 10.1038/ng1805 (2006).

208 Aliferis, K. et al. Differentiating Alstrom from Bardet-Biedl syndrome (BBS) using systematic ciliopathy genes sequencing. Ophthalmic Genet 1, 18-22, doi:10.3109/13816810.2011.620055 (2011).

209 Pereiro, I. et al. Arrayed primer extension technology simplifies mutation detection in Bardet-Biedl and Alstrom syndrome. Eur J Hum Genet 19, 485-488, doi:ejhg2010207 [pii] 10.1038/ejhg.2010.207 [doi] (2011).

210 Katsanis, N. The oligogenic properties of Bardet-Biedl syndrome. Hum Mol Genet 13 Spec No 1, R65-71, doi:10.1093/hmg/ddh092, ddh092 [pii] (2004).

211 Katsanis, N. et al. BBS4 is a minor contributor to Bardet-Biedl syndrome and may also participate in triallelic inheritance. Am J Hum Genet 71, 22-29, doi:S0002- 9297(07)60031-7 [pii] 10.1086/341031 [doi] (2002).

212 Muller, J. et al. Identification of 28 novel mutations in the Bardet-Biedl syndrome genes: the burden of private mutations in an extensively heterogeneous disease. Hum Genet 127, 583-593, doi:10.1007/s00439-010-0804-9 [doi] (2010).

213 Louie, C. M. et al. AHI1 is required for photoreceptor outer segment development and is a modifier for retinal degeneration in nephronophthisis. Nat Genet 42, 175-180, doi:10.1038/ng.519 (2010).

214 Rachel, R. A. et al. Combining Cep290 and Mkks ciliopathy alleles in mice rescues sensory defects and restores ciliogenesis. J Clin Invest 122, 1233-1245, doi:10.1172/JCI60981 (2012).

215 Lupski, J. R. Genomic disorders: structural features of the genome can lead to DNA rearrangements and human disease traits. Trends Genet 14, 417-422 (1998).

216 Boone, P. M. et al. Detection of clinically relevant exonic copy-number changes by array CGH. Hum Mutat 31, 1326-1342, doi:10.1002/humu.21360 (2010).

217 Zhang, F. et al. The DNA replication FoSTeS/MMBIR mechanism can generate genomic, genic and exonic complex rearrangements in humans. Nat Genet 41, 849-853, doi:ng.399 [pii] 10.1038/ng.399 (2009).

193

218 Chen, J. et al. Molecular analysis of Bardet-Biedl syndrome families: report of 21 novel mutations in 10 genes. Invest Ophthalmol Vis Sci 52, 5317-5324, doi:10.1167/iovs.11-7554 (2011).

219 Redin, C. et al. Targeted high-throughput sequencing for diagnosis of genetically heterogeneous diseases: efficient mutation detection in Bardet-Biedl and Alstrom syndromes. J Med Genet 49, 502-512, doi:10.1136/jmedgenet-2012-100875 (2012).

220 Saunier, S. et al. Characterization of the NPHP1 locus: mutational mechanism involved in deletions in familial juvenile nephronophthisis. Am J Hum Genet 66, 778-789, doi:S0002-9297(07)64008-7 [pii]10.1086/302819 (2000).

221 Hildebrandt, F. et al. A novel gene encoding an SH3 domain protein is mutated in nephronophthisis type 1. Nat Genet 17, 149-153, doi:10.1038/ng1097-149 (1997).

222 Caridi, G. et al. Renal-retinal syndromes: association of retinal anomalies and recessive nephronophthisis in patients with homozygous deletion of the NPH1 locus. Am J Kidney Dis 32, 1059-1062, doi:S0272638698003643 [pii] (1998).

223 Boone, P. M. et al. The Alu-rich genomic architecture of SPAST predisposes to diverse and functionally distinct disease-associated CNV alleles. Am J Hum Genet 95, 143-161, doi:10.1016/j.ajhg.2014.06.014 (2014).

224 Boone, P. M. et al. Alu-specific microhomology-mediated deletion of the final exon of SPAST in three unrelated subjects with hereditary spastic paraplegia. Genet Med 13, 582-592, doi:10.1097/GIM.0b013e3182106775 (2011).

225 Verdin, H. et al. Microhomology-mediated mechanisms underlie non-recurrent disease-causing microdeletions of the FOXL2 gene or its regulatory domain. PLoS Genet 9, e1003358, doi:10.1371/journal.pgen.1003358 (2013).

226 Stankiewicz, P. et al. Genomic and genic deletions of the FOX gene cluster on 16q24.1 and inactivating mutations of FOXF1 cause alveolar capillary dysplasia and other malformations. Am J Hum Genet 84, 780-791, doi:10.1016/j.ajhg.2009.05.005 (2009).

227 Shlien, A. et al. A common molecular mechanism underlies two phenotypically distinct 17p13.1 microdeletion syndromes. Am J Hum Genet 87, 631-642, doi:10.1016/j.ajhg.2010.10.007 (2010).

228 Carvalho, C. M. et al. Replicative mechanisms for CNV formation are error prone. Nat Genet 45, 1319-1326, doi:10.1038/ng.2768 (2013).

194

229 Hastings, P. J., Lupski, J. R., Rosenberg, S. M. & Ira, G. Mechanisms of change in gene copy number. Nat Rev Genet 10, 551-564, doi:10.1038/nrg2593 (2009).

230 McVey, M. & Lee, S. E. MMEJ repair of double-strand breaks (director's cut): deleted sequences and alternative endings. Trends Genet 24, 529-538, doi:10.1016/j.tig.2008.08.007 (2008).

231 Wang, K. et al. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res 17, 1665-1674, doi:10.1101/gr.6861907 (2007).

232 Kim, J. C. et al. The Bardet-Biedl protein BBS4 targets cargo to the pericentriolar region and is required for microtubule anchoring and cell cycle progression. Nat Genet 36, 462-470, doi:10.1038/ng1352 (2004).

233 Roa, B. B. et al. Evidence for a recessive PMP22 point mutation in Charcot-Marie- Tooth disease type 1A. Nat Genet 5, 189-194, doi:10.1038/ng1093-189 (1993).

234 Yang, Y. et al. Molecular findings among patients referred for clinical whole- exome sequencing. JAMA 312, 1870-1879, doi:10.1001/jama.2014.14601 (2014).

235 Stray-Pedersen, A. et al. PGM3 mutations cause a congenital disorder of glycosylation with severe immunodeficiency and skeletal dysplasia. Am J Hum Genet 95, 96-107, doi:10.1016/j.ajhg.2014.05.007 (2014).

236 Bayer, D. K. et al. Vaccine-associated varicella and rubella infections in severe combined immunodeficiency with isolated CD4 lymphocytopenia and mutations in IL7R detected by tandem whole exome sequencing and chromosomal microarray. Clin Exp Immunol 178, 459-469, doi:10.1111/cei.12421 (2014).

237 Beales, P. L. et al. Genetic interaction of BBS1 mutations with alleles at other BBS loci can result in non-Mendelian Bardet-Biedl syndrome. Am J Hum Genet 72, 1187-1199 (2003).

238 Davis, R. E. et al. A knockin mouse model of the Bardet-Biedl syndrome 1 M390R mutation has cilia defects, ventriculomegaly, retinopathy, and obesity. Proc Natl Acad Sci U S A 104, 19422-19427, doi:10.1073/pnas.0708571104 (2007).

239 McIntyre, J. C. et al. Gene therapy rescues cilia defects and restores olfactory function in a mammalian ciliopathy model. Nat Med 18, 1423-1428, doi:10.1038/nm.2860 (2012).

195

240 Putoux, A. et al. KIF7 mutations cause fetal hydrolethalus and acrocallosal syndromes. Nat Genet 43, 601-606, doi:ng.826 [pii]10.1038/ng.826 (2011).

241 Zhang, Y. et al. BBS mutations modify phenotypic expression of CEP290-related ciliopathies. Hum Mol Genet 23, 40-51, doi:10.1093/hmg/ddt394 (2014).

242 Beales, P. L. et al. IFT80, which encodes a conserved intraflagellar transport protein, is mutated in Jeune asphyxiating thoracic dystrophy. Nat Genet 39, 727- 729, doi:10.1038/ng2038 (2007).

243 Halbritter, J. et al. Defects in the IFT-B component IFT172 cause Jeune and Mainzer-Saldino syndromes in humans. Am J Hum Genet 93, 915-925, doi:10.1016/j.ajhg.2013.09.012 (2013).

244 Huangfu, D. et al. Hedgehog signalling in the mouse requires intraflagellar transport proteins. Nature 426, 83-87, doi:10.1038/nature02061 (2003).

245 Fromer, M. et al. Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth. Am J Hum Genet 91, 597-607, doi:10.1016/j.ajhg.2012.08.005 (2012).

246 Krumm, N. et al. Copy number variation detection and genotyping from exome sequence data. Genome Res 22, 1525-1532, doi:10.1101/gr.138115.112 (2012).

247 Alföldi, J. & Lindblad-Toh, K. Comparative genomics as a tool to understand evolution and disease. Genome Res 23, 1063-1068 (2013).

248 Katsanis, N. et al. BBS4 is a minor contributor to Bardet-Biedl syndrome and may also participate in triallelic inheritance. Am J Hum Genet 71, 22-29 (2002).

249 Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat Methods 7, 248-249 (2010).

250 Sim, N. L. et al. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res 40, W452-457 (2012).

251 Breen, M. S., Kemena, C., Vlasov, P. K., Notredame, C. & Kondrashov, F. A. Epistasis as the primary factor in molecular evolution. Nature 490, 535-538 (2012).

252 Weinreich, D. M., Delaney, N. F., DePristo, M. A. & Hartl, D. L. Darwinian evolution can follow only very few mutational paths to fitter proteins. Science 312, 111-114 (2006).

196

253 McCandlish, D. M., Rajon, E., Shah, P., Ding, Y. & Plotkin, J. B. The role of epistasis in protein evolution. Nature 497, E1-2; discussion E2-3 (2013).

254 Corbett-Detig, R. B., Zhou, J., Clark, A. G., Hartl, D. L. & Ayroles, J. F. Genetic incompatibilities are widespread within species. Nature 504, 135-137, doi:10.1038/nature12678 (2013).

255 Gao, L. & Zhang, J. Why are some human disease-associated mutations fixed in mice? Trends Genet 19, 678-681 (2003).

256 Weinreich, D. M., Lan, Y., Wylie, C. S. & Heckendorn, R. B. Should evolutionary geneticists worry about higher-order epistasis? Curr Opin Genet Dev 23, 700-707 (2013).

257 Chou, H. H., Chiu, H. C., Delaney, N. F., Segre, D. & Marx, C. J. Diminishing returns epistasis among beneficial mutations decelerates adaptation. Science 332, 1190-1192 (2011).

258 Kulathinal, R. J., Bettencourt, B. R. & Hartl, D. L. Compensated deleterious mutations in insect genomes. Science 306, 1553-1554 (2004).

259 Soylemez, O. & Kondrashov, F. A. Estimating the rate of irreversibility in protein evolution. Genome Biol Evol 4, 1213-1222 (2012).

260 Ferrer-Costa, C., Orozco, M. & de la Cruz, X. Characterization of compensated mutations in terms of structural and physico-chemical properties. J Mol Biol 365, 249-256 (2007).

261 Waterston, R. H. et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520-562 (2002).

262 Ahola, V., Aittokallio, T., Vihinen, M. & Uusipaikka, E. Model-based prediction of sequence alignment quality. Bioinformatics 24, 2165-2171 (2008).

263 Giudicessi, J. R. & Ackerman, M. J. Determinants of incomplete penetrance and variable expressivity in heritable cardiac arrhythmia syndromes. Transl Res 161, 1-14 (2013).

264 Kimura, M. The neutral theory of molecular evolution. (Cambridge University Press, 1984).

197

265 Povolotskaya, I. S. & Kondrashov, F. A. Sequence space and the ongoing expansion of the protein universe. Nature 465, 922-926 (2010).

266 Katsanis, N., Cotten, M. & Angrist, M. Exome and genome sequencing of neonates with neurodevelopmental disorders. Future Neuro 7, 655-658 (2012).

267 Herman, D. S. et al. Truncations of titin causing dilated cardiomyopathy. N Engl J Med 366, 619-628, doi:10.1056/NEJMoa1110186 (2012).

268 Montagnoli, A., Guardavaccaro, D., Starace, G. & Tirone, F. Overexpression of the nerve growth factor-inducible PC3 immediate early gene is associated with growth inhibition. Cell Growth & Differentiation 7, 1327-1336 (1996).

269 Beunders, G. et al. Exonic deletions in AUTS2 cause a syndromic form of intellectual disability and suggest a critical role for the C terminus. Am J Hum Genet 92, 210-220 (2013).

270 Fraisse, C., Elderfield, J. A. & Welch, J. J. The genetics of speciation: are complex incompatibilities easier to evolve? J Evol Biol 27, 688-699 (2014).

271 Robinson, P. N. Whole-exome sequencing for finding de novo mutations in sporadic mental retardation. Genome Biol 11, 144, doi:10.1186/gb-2010-11-12-144 (2010).

272 Ng, S. B. et al. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet 42, 30-35, doi:10.1038/ng.499 (2010).

273 Rehm, H. L. Disease-targeted sequencing: a cornerstone in the clinic. Nat Rev Genet 14, 295-300, doi:10.1038/nrg3463 (2013).

274 Lin, X. et al. Applications of targeted gene capture and next-generation sequencing technologies in studies of human deafness and other genetic disabilities. Hear Res 288, 67-76, doi:10.1016/j.heares.2012.01.004 (2012).

275 Bick, D. & Dimmock, D. Whole exome and whole genome sequencing. Curr Opin Pediatr 23, 594-600, doi:10.1097/MOP.0b013e32834b20ec (2011).

276 Kumar, V. et al. Human disease-associated genetic variation impacts large intergenic non-coding RNA expression. PLoS Genet 9, e1003201, doi:10.1371/journal.pgen.1003201 (2013).

198

277 Ouellette, M. M., McDaniel, L. D., Wright, W. E., Shay, J. W. & Schultz, R. A. The establishment of telomerase-immortalized cell lines representing human chromosome instability syndromes. Hum Mol Genet 9, 403-411, doi:10.1093/hmg/9.3.403 (2000).

278 Karner, C. M. et al. Wnt9b signaling regulates planar cell polarity and kidney tubule morphogenesis. Nat Genet 41, 793-799, doi:10.1038/ng.400 (2009).

279 Patel, V. et al. Acute kidney injury and aberrant planar cell polarity induce cyst formation in mice lacking renal cilia. Hum Mol Genet 17, 1578-1590, doi:10.1093/hmg/ddn045 (2008).

280 Bruder, C. E. et al. Phenotypically concordant and discordant monozygotic twins display different DNA copy-number-variation profiles. American Journal of Human Genetics 82, 763-771, doi:10.1016/j.ajhg.2007.12.011 (2008).

281 Piotrowski, A. et al. Somatic mosaicism for copy number variation in differentiated human tissues. Hum Mutat 29, 1118-1124, doi:10.1002/humu.20815 (2008).

282 Perry, G. H. et al. Copy number variation and evolution in humans and chimpanzees. Genome Res 18, 1698-1710, doi:10.1101/gr.082016.108 (2008).

283 Perry, G. H. et al. Diet and the evolution of human amylase gene copy number variation. Nat Genet 39, 1256-1260, doi:10.1038/ng2123 (2007).

284 Bailey, J. A. & Eichler, E. E. Primate segmental duplications: crucibles of evolution, diversity and disease. Nat Rev Genet 7, 552-564, doi:10.1038/nrg1895 (2006).

285 Perry, G. H. The evolutionary significance of copy number variation in the human genome. Cytogenet Genome Res 123, 283-287, doi:10.1159/000184719 (2008).

286 Tsai, A. C. et al. Exon deletions of the EP300 and CREBBP genes in two children with Rubinstein-Taybi syndrome detected by aCGH. Eur J Hum Genet 19, 43-49, doi:10.1038/ejhg.2010.121 (2011).

287 Zarrei, M., MacDonald, J. R., Merico, D. & Scherer, S. W. A copy number variation map of the human genome. Nat Rev Genet 16, 172-183, doi:10.1038/nrg3871 (2015).

199

288 Golzio, C. et al. KCTD13 is a major driver of mirrored neuroanatomical phenotypes of the 16p11.2 copy number variant. Nature 485, 363-367, doi:http://www.nature.com/nature/journal/v485/n7398/abs/nature11091.html#su pplementary-information (2012).

289 Yu, S. et al. Cardiac defects are infrequent findings in individuals with 8p23.1 genomic duplications containing GATA4. Circ Cardiovasc Genet 4, 620-625, doi:10.1161/CIRCGENETICS.111.960302 (2011).

290 Feuk, L., Carson, A. R. & Scherer, S. W. Structural variation in the human genome. Nat Rev Genet 7, 85-97, doi:10.1038/nrg1767 (2006).

291 Mankad, A. et al. Natural gene therapy in monozygotic twins with Fanconi anemia. Blood 107, 3084-3090, doi:10.1182/blood-2005-07-2638 (2006).

292 Suriano, G. et al. In vitro demonstration of intra-locus compensation using the ornithine transcarbamylase protein as model. Hum Mol Genet 16, 2209-2214, doi:10.1093/hmg/ddm172 (2007).

293 DePristo, M. A., Weinreich, D. M. & Hartl, D. L. Missense meanderings in sequence space: a biophysical view of protein evolution. Nat Rev Genet 6, 678-687, doi:10.1038/nrg1672 (2005).

294 Covert, A. W., 3rd, Lenski, R. E., Wilke, C. O. & Ofria, C. Experiments on the role of deleterious mutations as stepping stones in adaptive evolution. Proc Natl Acad Sci U S A 110, E3171-3178, doi:10.1073/pnas.1313424110 (2013).

295 Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4, 44-57 (2008).

296 Baresic, A., Hopcroft, L. E., Rogers, H. H., Hurst, J. M. & Martin, A. C. Compensated pathogenic deviations: analysis of structural effects. J Mol Biol 396, 19-30 (2010).

297 Study, T. D. D. D. Large-scale discovery of novel genetic causes of developmental disorders. Nature, doi:10.1038/nature14135 (2014).

298 Kousi, M., Lehesjoki, A. E. & Mole, S. E. Update of the mutation spectrum and clinical correlations of over 360 mutations in eight genes that underlie the neuronal ceroid lipofuscinoses. Human Mutation 33, 42-63, doi:10.1002/humu.21624 (2012).

200

Biography

Stephan George Frangakis was born on June 25th, 1984 in Albany, New York.

Stephan attended Rutgers University in New Jersey, where he double majored in

Molecular Biology and Chemistry, and minored in Philosophy. He completed the Henry

Rutgers Scholars Program, presenting his honors thesis on the role of microRNAs in cell differentiation. He was elected to the Phi Beta Kappa Society and graduated magna cum laude in 2006 with a Bachelor of Arts. Stephan came to Duke University in 2006 as part of the Medical Scientist Training Program, and after 2 years of medical education and 3 years in the lab of Howard Rockman, came to the Center for Human Disease Modeling as a Ph.D. candidate under the mentorship of Nicholas Katsanis in 2012, in the department of Cell Biology. He will complete his Ph.D. work in June 2015, and after completing his last year of medical school, be awarded his combined MD/PhD in May of

2016.

Publications:

Casad ME, Abraham D, Kim IM, Frangakis S, Dong B, Lin N, Wolf MJ, Rockman HA, Cardiomyopathy Is Associated with Ribosomal Protein Gene Haplo-Insufficiency in Drosophila melanogaster, Genetics, 189, 861-870, 2011

Davis EE, Frangakis S, Katsanis N, Interpreting human genetic variation with in vivo zebrafish assays, Biochimica et Biophysica Acta, 1842, 10, 2014

201

Jordan DM*, Frangakis S*, Golzio C, Cassa C, Kurtzberg J, Task Force for Neonatal Genomics, Davis EE, Sunyaev SR, Katsanis N, Nature, in press, 2015

Lindstrand A, Frangakis S, Carvalho CMB, Willer JR, Pehlivan D, Liu P, Sabo A, Lewis RA, Banin E, Lupski JR, Davis EE, Katsanis N, The American Journal of Human Genetics, in revision, 2015

Frangakis S, Kousi M, Liu J, Davis EE, Katsanis N, Targeted sequencing reveals increased mutational load in BBS patients and identifies CEP76 as a novel driver of BBS. in preparation, 2015

* These authors contributed equally to this work

202