Finding Mutations Which Cause Disease DNA - RNA - Protein

HTS and medical genetics Finding mutations which cause disease DNA - RNA - protein

gene Mutation

DNA intron intron exon exon exon

Transcription

RNA

Translation

Protein Aim

@EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC + ;;3;;;;;;;;;;;;7;;;;;;;88 @EAS54_6_R1_2_1_540_792 R|G TTGGCAGGCCAAGGCCGATGGATCA + ;;;;;;;;;;;7;;;;;-;;;3;83 @EAS54_6_R1_2_1_443_348 Compare to GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 reference ;;;;;;;;;;;9;7;;.7;39333

FASTQ format

3 billion reads 3 billion bases 1 mutation Genetic disease: the challenge

Single gene (monogenic) disorders Approximately 50% of children with congenital syndromes/mental retardation do not receive a firm etiological diagnosis Many of these children have disorders that are primarily genetic in origin. Many disorders are individually rare and difficult to recognize even for experts. Clinical phenotypes can be non-specific and variable. Many phenotypes are genetically heterogeneous.

Human genome is (quite) big ~23 000 genes ~3 billion bases pairs ! How can we (rapidly) identify a mutation causing disease? Mendelian disease in man

OMIM Statistics 2951 of the well-characterized phenotypes registered in OMIM have a known molecular basis

1965 3743 registered phenotypes with known or 2951 suspected Mendelian basis, no associated gene has been identiﬁed

1778 !

Improve speed/cost of diagnosis of known Mendelian, gene known genetic disorders Mendelian, gene unknown Suspected Mendelian Improve speed/cost of identiﬁcation of the cause new/suspected genetic disorders Methods for identifying variants/aberrations

DNA (Sanger) sequencing

Southern blotting Fragment analysis MLPA FISH

Array CGH

Karyotyping SNP array

Low throughput (limited number of loci per run) Detect speciﬁc types of variation Mendelian disease in man

OMIM Statistics 2951 of the well-characterized phenotypes registered in OMIM have a known molecular basis

3743 registered phenotypes with known or 1965 suspected Mendelian basis, no associated gene 2951 has been identiﬁed

Protein coding regions of the human genome (the 1778 exome) constitute approximately 1.5% of the total, but harbour ~85% of the mutations with large effects on disease-related traits Mendelian, gene known Mendelian, gene unknown Suspected Mendelian Exome sequencing

HTS in research and routine diagnostics? REVIEWS

3 Box 1 | Workflow for exome sequencing variants for monogenic diseases . Second, most alleles that are known to underlie Mendelian disorders disrupt Since 2007, there has been tremendous progress in the development of diverse protein-coding sequences13. Third, a large fraction of technologies for capturing arbitrary subsets of a mammalian genome at a scale rare, protein-altering variants, such as missense or commensurate with that of massively parallel sequencing8,10,72–79. To capture all nonsense single-base substitutions or small insertion– protein-coding sequences, which constitute less than 2% of the human genome, the field has largely converged on the aqueous-phase, capture-by-hybridization deletions (that is, indels), are predicted to have functional 14 approach described below. consequences and/or to be deleterious . As such, the The basic steps required for exome sequencing are shown in the figure. Genomic exome represents a highly enriched subset of the genome DNA is randomly sheared, and several micrograms are used to construct an in which to search for variants with large effect sizes. shotgun library; the library fragments are flanked by adaptors (not shown). Next, the library is enriched for sequences corresponding to exons (dark blue fragments) by Defining the exome. One particular challenge for apply- aqueous-phase hybridization capture: the fragments are hybridized to biotinylated ing exome sequencing has been how best to define the DNA or RNA baits (orange fragments) in the presence of blocking oligonucleotides set of targets that constitute the exome. Considerable that are complementary to the adaptors (not shown). Recovery of the hybridized uncertainty remains regarding which sequences of the fragments by biotin–streptavidin-based pulldown is followed by amplification and human genome are truly protein coding. When sequence massively parallel sequencing of the enriched, amplified library and the mapping and calling of candidate causal variants. Barcodes to allow sample indexing can capacity was more limiting, initial efforts at exome potentially be introduced during the initial library construction or during sequencing erred on the conservative side (for exam- post-capture amplification. Key performance parameters include the degree of ple, by targeting the high-confidence subset of genes enrichment, the uniformity with which targets are captured and the molecular identified by the Consensus Coding Sequence (CCDS) complexity of the enriched library. Project). Commercial kits now target, at a minimum, all At least three vendors (Agilent, Illumina and Nimblegen) offer kitted reagents of the RefSeq collection and an increasingly large num- for exome capture. Although there are technical differences between them (for ber of hypothetical proteins. Nevertheless, all existing example, Agilent relies on RNA baits, whereas Illumina and Nimblegen use DNA baits targets have limitations. First, our knowledge of all truly — the kits vary in the definition of the exome), we find the performance of these kits protein-coding exons in the genome is still incomplete, to be largely equivalent, and each is generally scalable to 96-plex robotic so current capture probes can only target exons that have automation. The fact that the costs of exome sequencing are not directly proportional to the fraction of the genome targeted is a consequence of several been identified so far. Second, the efficiency of capture factors, including imperfect capture specificity, skewing in the uniformity of target probes varies considerably, and some sequences fail to coverage introduced by the capture step and the fixed or added costs that are be targeted by capture probe design altogether (FIG. 1). associated with sample processing (for example, library construction and exome Third, not all templates are sequenced with equal effi- capture). This ratio will fall as the cost ofExome whole-genome sequencing drops. sequencingciency, and not all sequences can be aligned to the ref- Although methods for calling single nucleotide substitutions are maturing80, there erence genome so as to allow base calling. Indeed, the is considerable room for improvement in detecting small insertion–deletions and effective coverage (for example, 50×) of exons using especially copy number changes from short-read exome sequence data81 (for currently available commercial kits varies substantially. example,detecting a heterozygous, single-exon deletion with breakpoints that fall Finally, there is also the issue of whether sequences other within adjacent introns). Exome sequencing also needs improvements of a technical than exons should be targeted (for example, microRNAs nature. First, input requirements (several micrograms of high-quality DNA) are such ultra-conserved elements that many samples that have already been collected are inaccessible. Protocols using (miRNAs), promoters and ). whole-genome amplification or transposase-based library construction offer a These caveats aside, exome sequencing is rapidly prov- solution82, but additional work is required to fully integrate and validate these ing to be a powerful new strategy for finding the cause methods.oligonucleotides Second, as the minimum ‘unit’ of sequencing of massively parallel of known or suspected Mendelian disorders for which sequencing continues to increase, sample indexing with minimal performance loss the genetic basis has yet to be discovered. and minimal crosstalk between samples will be required to lower the costs of exome sequencing. Third, a substantial fraction of the exome (~5–10%, depending on the kit) Identifying causal alleles is poorly covered or altogether missed, largely owing to factors that are not specific A key challenge of using exome sequencing to find to exome capture itself. novel disease genes for either Mendelian or complex traits is how to identify disease-related alleles among the background of non-pathogenic polymorphism and sequencing errors. On average, exome sequencing identifies ~24,000 single nucleotide variants (SNVs) in African American samples and ~20,000 in European American samples (TABLE 1). More than 95% of these variants areDesign already known exome as polymorphisms capture in array human populations. Strategies for finding causal alleles against this background vary, as they do for traditional approachesMake to gene discovery, sequence depending library on factors from patient DNA such as: the mode of inheritance of a trait; the pedigree or population structure; whether a phenotype arises owing to de novo or inherited variants; and the extent AGGTCGTTACGTACGCTAC Hybridize and capture GACCTACATCAGTACATAG of locus heterogeneity for a trait. Such factors also influ- GCATGACAAAGCTAGGTGT ence both the sample size needed to provide adequate power to detect trait-associated alleles and the selection of the mostSequence successful analytical framework.

FASTQ format

3 billion reads 3 billion bases 1 mutation Aligning sequence

Reference CGATGCTGTTGCATGATGCTAGTCGATGCTGTTGGGTTTGATCGTGGGCTAGCTAGC

GTTGATGCTGTTGGGTTTG TGATGCTAGTTGATGCTGTTGGG Reads CGATGCTGTTGCATGA TTGATCGTGGGCTAGCTAGC GATGCTAGTTGATGCTGTT

TGTTGCATGATGCTAGTT ATGATGCTAGTTGATGCTGTTGGGT TGGGTTTGATCGTGGGCTA TGTTGCATGATGCTAGTTGATGCTG

Consensus CGATGCTGTTGCATGATGCTAGTTGATGCTGTTGGGTTTGATCGTGGGCTAGCTAGC Coverage 111111333333466555556665555554544443332222222222111111111 Quality (&_(&%£C*H*WH C(*HQWHHFHW DWQH::WKNVIUHWIUBIWUB CIUWIUUBW Exome sequencing data Types of genetic variation

A Homozygous A SNP/SNV Heterozygous A

Deletion (hemizygous)

CNV/indel Duplication

SNP/SNV: single nucleotide polymorphism/variant CNV: copy number variant indel: insertion/deletion Effect of variants

missense - change amino acid nonsense - premature stop codon frameshift - shift the codon frame Indels - multiple effects What’s in an exome?

> 20 000 variants

R128 Human Molecular Genetics, 2010, Vol. 19, Review Issue 2 Downloaded from Figure 3. Reported numbers of LOF variants per individual genome in several published large-scale sequencing studies. Individuals labelled European, East Asian or Nigerian are HapMap individuals from reference (27). Numbers for Venter and Lupski are from references (25) and (28), respectively. Small grey segments in Venter and Lupski histograms indicate unreported numbers for splice-disrupting SNPs and frame-shift indels, respectively.

Since the publication~100 of Venter’s genome,loss-of-function several dozen complex diseases. variants Finally, however, the sheer number of these complete genomes and exomes (targeted sequencing of the variants present within each human genome also poses major hmg.oxfordjournals.org protein-coding regions of the genome) have been generated analytical challenges for clinical genome sequencing studies using ‘second-generation’ sequencing technologies (26). Unfor- seeking to identify a single disease-causing mutation, especially tunately, few of the resulting publications have provided sys- in cases whereMacArthur there is & little Tyler-Smith or no additional Hum information Mol Genet (such 19 (2010). as tematic counts of observed LOF variants. Even where such linkage data) available to filter out spurious variants. numbers are provided, differences between studies in terms of the sequencing technology, read-mapping and variant-calling algorithms and annotation sets make it difficult to compare studies. MOVING FORWARD at University of Oslo Library on October 13, 2010 These differences, and the problems with sequencing and As mentioned above, comparison of published genomic data annotation artefacts raised above, mean that there is as yet no sets is challenging due to heterogeneity in the technologies clear consensus on the number of LOF variants present in an and analysis techniques employed, the annotation sets used individual genome. To illustrate the extent of the discrepancy, and the degree of filtering and validation of variants. In order one recent exome sequencing study (27) reported an average to better understand the full spectrum of LOF variation in the of 45 nonsense SNPs, 16 splice-disrupting SNPs and 46 frame- human genome, it will be necessary to take a more systematic shift indels per genome in individuals of non-African ancestry, approach to the analysis of genome-wide sequence data. whereas a whole-genome sequence of a European male (28) We and others are currently performing this type of analysis reported 121 stop SNPs and 112 splice-disrupting SNPs but as part of the 1000 Genomes Project, an international collab- did not report frame-shift indels (Figure 3). At least some of oration generating low-coverage whole-genome and high- this discrepancy relates to the relatively conservative consensus coverage exome sequence data from 2500 individuals from coding sequence (CCDS) gene set (29) targeted by the exome 27 diverse populations (http://www.1000genomes.org/page. sequencing study, but differences in sequencing technology php); the results of a pilot study are expected to be published and variant calling thresholds likely also played a role. around the same time as this review. The project is stimulating Several lessons can be drawn from the data generated so far. improvements in the annotation of both gene models and First, the current catalogue of human LOF variants is incom- disease-associated variants in databases such as the Human plete—especially for insertion/deletion variants, which are still Gene Mutation Database (HGMD; http://www.hgmd.cf.ac. more difficult than SNPs to ascertain from short-read sequence uk/ac/index.php). It is expected to provide a catalogue of data due to their frequently repetitive sequence context—and most coding LOF variants present at 5% frequency in the also likely contains many sequencing and annotation artefacts. populations studied by the end of 2010≥ and 0.1% when Second, although there is wide variation in per-individual the project is complete. ≥ numbers between studies (Figure 3), LOF variants are certainly Simply collecting observed LOF variants from a high- more prevalent within human genomes than most observers throughput survey, however, is not sufficient to provide a would have predicted prior to the genomic era; using even the useful resource, since we expect a significant minority of more conservative estimates thus far from large-scale sequen- these variants to be false positives for the reasons outlined cing (27), each individual carries at least 100 of these variants, above. Therefore, we are subjecting many of the LOF variants and at least 30 in the homozygous state. collected by the project to both experimental validation and Third, the distribution of allele frequencies of LOF variants careful manual reannotation of the surrounding gene structure, suggests—as might be expected—that they as a class tend to thus providing a core set of true LOF variants to serve as the be evolutionarily deleterious (4), in turn suggesting that they basis for interpretation of these variants in other sequencing may provide a rich source of potentially causal variation for studies. In the long term, this catalogue will prove most Finding causitive mutations is not easy

Many, many variants will be found

Which variants are deleterious?

Novel? (dbSNP, 1000genomes, HGMD)

Synonymous/non-synonymous?

Conserved? SNPnexus Alter protein structure? PolyPhen2 MutationTaster ANNOVAR SeattleSeq Annotation

This is the hard part Strategies to identify mutations

REVIEWS multiple individuals same disease large multigenerational pedigrees de novo mutations population frequency for complex diseases

Figure 2 | Strategies for finding disease-causing rare variants using exome sequencing. Four main strategies are illustrated. a | Sequencing and filtering across multiple unrelated, affected individuals (indicated by the three coloured circles). This approach is used to identify novel variants in the same gene (or genes), as indicated by the shaded region that is shared by the three individuals in this example. b | Sequencing and filtering among multiple affected individuals from within a pedigree (shaded circles and squares) to identify a gene (or genes) with a novel variant in a shared region of the genome. c | Sequencing parent–child trios for identifying mutations. d | Sampling and comparing the extremes of the distribution (arrows) for a quantitative phenotype. As shown in panel d, individuals with rare variants in the same gene (red crosses) are concentrated in one extreme of the distribution.

Bamshad et al. Nature Reviews Genetics Nov 2011 Analytical failures can follow from the limitations Application to clinical diagnostics and assumptions of discrete filtering. Perhaps the major Discovery of variants that underlie both Mendelian limitation of discrete filtering is that its power is substan- and complex traits will naturally lead to a much deeper tially reduced by genetic heterogeneity. For example, if understanding of disease mechanisms that should, in alleles of one gene account for only a fraction of cases, turn, facilitate development of improved diagnostics, no single gene will be found to have disease-causing prevention strategies and targeted therapeutics48,49. For alleles in all cases, and several other genes may carry example, the finding that several families with domi- neutral mutations in as many cases, depending on the nantly inherited adult-onset arterial calcifications had sample size. In this scenario, it is impossible to separate mutations in NT5E — a gene that encodes a protein the causal alleles from the non-causal alleles. From an involved in adenosine metabolism — allowed for con- analytical perspective, false-negative calls, the presence sideration of specific therapeutic interventions that of disease-causing alleles in the comparative data set and would otherwise not have been considered50. Some of reduced penetrance result in a reduced signal-to-noise these improvements will soon be realized (for example, ratio that is practically indistinguishable from genetic better diagnostics for Mendelian disorders and disorders heterogeneity. False-positive calls will result in detection of unknown aetiology), whereas others (for example, Processed pseudogenes of candidate genes that cannot logically be eliminated risk profiling for complex traits) are likely to be a more Copies of the coding by filtering alone. False-positive calls are frequently distant realization. sequences of genes that lack observed in segmental duplications and processed promoters and introns, contain pseudogenes poly(A) tails and are flanked . Particularly notorious are processed pseu- Diagnosis. One of the immediate applications of exome by target-site duplications. dogenes that are not currently represented in the human sequencing will be facilitating the accurate diagnosis genome (for example, CDC27; see REF. 11). Finally, quan- of individuals with Mendelian disorders that: present Posterior probability tifying the strength of the results of discrete filtering with atypical manifestations; are difficult to confirm The probability of an event by significance testing (for example, by P value or by using clinical or laboratory criteria alone (for exam- after combining prior posterior probability knowledge of the event with ) is problematic in the absence of a ple, when symptoms are shared among multiple dis- the likelihood of that event better understanding about the nature and distribution orders); or require extensive or costly evaluation (for given by observed data. of variation across the exome. example, when there is a long list of possible candidate

more exomes ARTICLES

Table 3 Number of candidate genes identified based on different filtering strategies Number of affected exomes Subsets of 3 exomes Subsets of all 4 exomes

Dominant model 1 2 3 Any 1 Any 2 Any 1 Any 2 Any 3 NS/SS/I 4,645–4,687 3,358–3,940 2,850–3,099 6,658 4,489 6,943 5,167 3,920 Not in dbSNP129 634–695 136–369 72–105 1,617 274 1,829 553 172 Not in HapMap 8 898–979 161–506 55–117 2,336 409 2,628 835 222 Not in either 453–528 40–228 10–26 1,317 109 1,516 333 44 Predicted damaging 204–284 10–83 3–6 682 37 787 126 11

Recessive model NS/SS/I 2,780–2,863 1,993–2,362 1,646–1,810 4,097 2,713 4,293 3,172 2,329 Not in dbSNP129 92–115 30–53 22–31 226 61 270 90 42 stricter criteria Not in HapMap 8 111–133 13–46 5–13 329 32 397 75 19 Not in either 31–45 2–9 2–3 100 6 121 14 4 Predicted damaging 6–16 0–2 0–1 35 2 44 4 1 Under the dominant model, at least one nonsynonymous variant, splice acceptor or donor site variant or coding indel (NS/SS/I) in a gene was required in the gene. Under the recessive model, at least two novel variants were required, and these could be either at the same position (a homozygous variant) or at two different positions in the same gene (a potential compound heterozygote, though we are unable to ascertain phase at this stage). In each column are the range for the number of candidate genes for exomes considered individually (column 1) and all combinations of 2–4 exomes (columns 2–4). Note that the upper bound on the ranges may be inflated relative to what would be the case if four unrelated affected individuals had been used because the comparisons in which the two siblings were included provided reduced power compared to unrelated individuals. Columns 5–9 show the number of candidate genes when at least 1, 2 or 3 individuals is required to have one variant in a gene (dominant model) or two or more variants in a gene (recessive model). This is a simple model of genetic heterogeneity or incomplete data. For example, the total number of candidate genes common to any 3 of all 4 exomes is shown in column 9. For columns 5–6, one of the siblings (kindred 1-B) was not included in the analysis as siblings share 50% of variants. Comparing two exomes identiﬁes ~20 000 SNPs ing. All four individuals were found to be compound heterozygotes was no genetic heterogeneity in our sample of individuals with Miller for missense mutations in DHODH that are predicted to be delete- syndrome. In the presence of heterogeneity, it is possible to relax strin- rious. Collectively, 11 different mutations in 6 kindreds with Miller gency by allowing genes common to subsets of all affected individuals syndrome wereWhich identified in DHODHis the by a combination causal of exome variant? and to be considered candidates, although this method will reduce power targeted resequencing (Table 4 and Fig. 2). Each parent of an affected (Table 3). Fourth, all of the individuals with Miller syndrome for individual who was tested was found to be a heterozygous carrier, and whom exomes were sequenced were of European ancestry. Sequencing none of the mutations appeared to have arisen de novo. In the kindreds exomes of affected individuals sampled from populations with a dif- with affectedIn siblings, a nonefamily, of the unaffected compare siblings were compound more ferent geographicalexomes ancestry who have a higher number of novel and/ heterozygotes. None of these mutations were found in 200 control or rare variants (for example, individuals with sub-Saharan African chromosomes from unaffected individuals of matched geographical or East Asian ancestry) will make the identification of candidate genes ancestry that were genotyped. Ten of these mutations were missense more difficult. This will become less of an issue as databases of human mutations, two of which affected the same amino acid codon, and one polymorphisms become increasingly comprehensive. was a 1-bp indel that is predicted to cause a frameshift resulting in a Additional factors could facilitate the future application of this strat- termination codon seven amino acids downstream. One mutation, egy. Mapping information, such as blocks of homozygosity, could focus C1036T, was shared between two unrelated individuals with Miller the search to a smaller pool of candidates. The number of candidate syndrome who are of different self-identified geographical ancestry. variants can also be reduced further by comparison between variants Each of the amino acid residues affected by a DHODH mutation is in an affected individual to those found in each parent. For autosomal highly conserved among homologs studied to date (Supplementary dominant disorders, this strategy can discover de novo coding variants, Fig. 1). A single, validated nonsynonymous polymorphism in human as neither parent is predicted to have a mutation that causes a fully DHODH has been studied previously19. This polymorphism causes a penetrant dominant disorder; by contrast, in recessive disorders, par- lysine-to-glutamine substitution in the relatively diverse N-terminal ents are predicted to be carriers of the disease-causing variants. extension of dihydroorotate dehydrogenase that is responsible for the There are at least three aspects of this approach where we see sub- association of the enzyme with the inner mitochondrial membrane. stantial scope for improvement. The first relates to missed variant calls, either due to low coverage or because some variants are not identified DISCUSSION easily with current sequencing platforms—for example, those within We show that the sequencing of the exomes of affected individuals repeat tracts in coding sequences. The second is that our filtering relied from a few unrelated kindreds, with appropriate filtering against public on a public SNP database (dbSNP) that has a highly uneven ascertain- SNP databases and a small number of HapMap exomes, is sufficient ment of variation across the genome. It would be better to rely on to identify a single candidate gene for a monogenic disorder whose catalogs of common variation that are ascertained in a single study cause had previously been unknown, Miller syndrome. Several factors either exome wide (as with the eight HapMap exomes2) or genome were important to the success of this study. First, Miller syndrome is a wide (for example, as with the 1,000 genomes project) and where very rare disorder that is inherited in an autosomal recessive pattern. estimates of allele frequency are available. Increasing the number of Therefore, the causal variants were unlikely to be found in public SNP control exomes progressively reduces the relevance of dbSNP to this databases or in control exomes. Second, genes for recessive diseases analysis (Supplementary Fig. 2). Furthermore, as increasingly deep will, in general, be easier to find than genes for dominant disorders catalogs of polymorphism become available, it may be necessary to because fewer genes in any single individual have two or more new establish frequency-based thresholds for defining common variation or rare nonsynonymous variants. Third, we were fortunate that there that is unlikely to be causal for disease. A third concern is that the

NATURE GENETICS | VOLUME 42 | NUMBER 1 | JANUARY 2010 33 Example Finding the mutation in ARAS syndrome ARAS syndrome

Born blind, due to anopthalmia/ micropthalmia Early onset (infant) neurodegenerative disease Diagnostic test? MRI shows brain atrophy Often fatal early Linkage analysis

Mendelian disorder Multiple pedigress Dominant ! Linkage analysis ! Identify region of genome with disease locus Genetic markers (known) and disease locus (unknown) are co-inherited Linkage results

chr17 (q25.1-q25.3) p13.3 p13.2 17p13.1 17p12 17p11.2 17q11.2 17q12 21.2 q21.31 17q22 23.2 q24.2 q24.3 q25.1 17q25.3 4 LOD=3 2 0 LOD − 2 Combined Family_22028 Family_60080 − 4 Family_50031

0 10 20 30 40 50 60 70

Linkage to chromosome 17q 2.4 Mb region (large) Candidate region

chr17 (q25.1-q25.3) p13.3 p13.2 17p13.1 17p12 17p11.2 17q11.2 17q12 21.2 q21.31 17q22 23.2 q24.2 q24.3 q25.1 17q25.3

Scale 1 Mb hg19 chr17: 73,500,000 74,000,000 74,500,000 75,000,000 75,500,000 ATP5H HN1 MRPS7 MIR3678 LLGL2 ITGB4 TRIM65 CDK3 FOXJ1 QRICH2 AANAT MXRA7 MGAT5B SEC14L1 SEPT9 ATP5H HN1 GRB2 LLGL2 ITGB4 TRIM65 EVPL FAM100B SPHK1 CYGB MXRA7 MGAT5B SEC14L1 SEPT9 KCTD2 NUP85 KIAA0195 RECQL5 H3F3B FBF1 GALR2 PRPSAP1SSP AANAT MXRA7 MGAT5B SEC14L1 MIR4316 SLC16A5 GGA3 CASKIN2 RECQL5 UNK ACOX1 ZACNSSP SPHK1 PRCD JMJD6 LINC00338 SEPT9 ARMC7 GGA3 CASKIN2 RECQL5 UNK ACOX1 RNF157 SPHK1 PRCD JMJD6 SCARNA16 SEPT9 NT5C GGA3 TSEN54 SAP30BP UNC13D TEN1 SPHK1 SNHG16 METTL23 SEC14L1 SEPT9 NT5C GGA3 LLGL2 ITGB4 MRPL38 SRP68 UBE2O METTL23 SEC14L1 SEPT9 NT5C GGA3 MYO15B GALK1 ACOX1 RHBDF2 METTL23 SEC14L1 LOC100507351 HN1 GRB2 C17orf109 MIR4738 TEN1-CDK3 RHBDF2 METTL23 SEC14L1 SUMO2 C17orf109 WBP2 SRP68 SNHG16 METTL23 SEPT9 SUMO2 C17orf110 TRIM47 SRP68 SNHG16 METTL23 MIF4GD SRP68 SNHG16 METTL23 MIF4GD EXOC7 SNORD1C SRSF2 MIF4GD EXOC7 SNORD1B SRSF2 genes MIF4GD EXOC7 SNORD1A SRSF2 LOC100287042 EXOC7 ST6GALNAC2 MIR636 SLC25A19 EXOC7 ST6GALNAC1 SLC25A19 EXOC7 MFSD11 SLC25A19 RNF157-AS1 MFSD11 MFSD11 MFSD11 MFSD11 MFSD11 MFSD11 genome.ucsc.edu

72 genes Candidate genes? Function? Expression pattern? SSP and others What are our options?

Sanger sequencing of candidate genes?

Potentially simple and quick

Potentially slow and expensive Capture/sequencing region under linkage peak?

Good chance of success (can capture exons+++)

Design and test new capture array - expensive/time Exome sequencing?

Off-the-shelf reagents available, comprehensive exons

Mutation may not be exonic Resequencing: mutation detection

Genomic region unknown Genomic region known Linkage peak Rare Mendelian GWAS disorders ! ! Sequence capture Sequence capture - exome Region of interest Exome RNAseq

51 Exome sequencing

affected unaffected 4 individuals 2 affected, 2 unaffected Identify variants only in affected √ x x

intersection Aim

FASTQ format

Sequence Mutation Exome capture in 4 easy steps

VN&J#B6360&46'4363)#,$ PN&;'92'$*'&*34)26'

GENOMIC SAMPLE (Set of chromosomes) SureSelect Target Enrichment System %'$,1#*&G:A Capture Process NGS Kit

/63%1'$)&%G:A ++ SureSelect GENOMIC SAMPLE (PREPPED) SureSelect HYB BUFFER BIOTINYLATED RNA LIBRARY “BAITS” /#--E#$

Hybridization 3..&3.34)'6" STREPTAVIDIN COATED MAGNETIC BEADS + "'92'$*#$%&-#B6360 Library Preparation

<6 h (<3 h hands-on) <4 h (<10 min hands-on) Bead capture UNBOUND FRACTION Wash Beads DISCARDED and Digest RNA Amplify Sequencing

Figure 1. SureSelect Target Enrichment System Workﬂ ow 3 WN&L--21#$3&"'92'$*#$% HiSeq2000

cBot T

G A T A C C G G C A A T T T G G G G A C C C Cluster Generation Sequencing by Synthesis Cluster sequence library <4 h (<10 min hands-on) 1.5-8 days (<10 min hands-on) Sequencing by synthesis

QN&A$3-0"#" Software

A-#%$&6'3."&),&6'/'6'$*'&%'$,1' !"#$ !%&"'()# *+,-

./0$)#$)%1#22+,3 4(2"./ !""#$%%&&&'()*)+,*-./")01'((1-0'/0'23%#-*450"1%,/1"60%

Q7)'$ GF"<3 76+GNS GFONIFFF GFO_FFFF GFO_IFFF GFOHFFFF GFOHIFFF GFIFFFFF ?,$+"QA--'($&"B+)7< (%*+1%V^^!^!k! 45!6786%%9-+" !""#$%%!/++*+7/('01!7'582%,/1"9:"**73)"% (%*+1%];B^^LG^nB (%*+1%]^^V^VkV (%*+1%V^^!^!k! 71&(%0;,/%1%/.1A,V^^!^!k! (%*+1%!^^P^!kV .(,,$%,$]^^B^BkB 71&(%0;,/%1%/.1A,B^^V^VkV 71&(%0;,/%1%/.1A,!^^V^VkV (%*+1%B^^]^]k] 71&(%0;,/%1%/.1A,B^^]^]k] (%*+1%!!^^\E^;!! 71&(%0;,/%1%/.1A,B^^]^]k] (%*+1%*^^]^]k] 71&(%0;,/%1%/.1A,V^^B^BkB (%*+1%B^^V^VkV (%*+1%]^^B^BkB (%*+1%V^^!^!k! (%*+1%!;B^^LG^nB (%*+1%!^^V^VkV (%*+1%V^^!^!k! (%*+1%VB^^\E^;VB P$4Q$@"V$%$, 59+3,+,3 :%;%(9+3, !""#$%%&&&'+*;*0-/,"'0*. #bYM ]-V"L,')%&,"8L,')%&,"`"MFF"a),$,")+$"W(06*"V+$$%= ]-V"L,')%&, ?]Q]"V$%$,"a),$&"1%"P$4Q$@9"?%(U+1*9"V$%a)%<9"]]\Q")%&"]1.-)+)*(2$"V$%1.(7, #bYM P$-$)*(%0"['$.$%*,"3/"P$-$)*#),<$+"2$+,(1%"M>E>N QLZ[ WLZ[ WBP \Z! Q(.-'$ W15"]1.-'$C(*/ <=5 !""#$%%()*<(&/'1*2-05,*-=5'+5"% Q)*$''(*$ PZ! K*6$+ ?%<%15% Q(.-'$"ZA7'$1*(&$"U1'/.1+-6(,.,"8&3QZU"3A('&"GMF= <3--&K36#3$)" +,NMHNR_R_ +,GEOIGHFM +,EE_IONH +,MOFERFR_ +,REFRFRHO +,EE_IONI +,_NRRIH +,EE_IOR_ +,EEMHHMI +,EEMHHMG +,MEGOGGO +,MN_IHM_ +,NGMRINHG +,NME_MFHF +,MIFINGNG +,GGRIIGIE +,EE_ION_ +,GRHOMINN +,GEHIFOOE +,RGNMIMIM +,_NRRIN +,GRHOMIHG +,EEMHHMO +,NG_MH_NN +,GH_H_GG +,NEGF_OM +,RIFMMGR +,GN_GNEFM +,IRENRHHF +,EFGIHI +,GH_GIGO +,I_MOMOHH +,GEHIFHRG +,MOF__FGO +,MEGR__O +,EFOFIRH +,EEMHHMM +,MNRFOME +,GH_H_GF +,NEGOOMR +,M_GIFGM +,EEMHHMF +,GEHOFGRG +,NMHNR_NF +,NMHNR_NE +,MIEMFEOG +,NMHNR_NN +,EE_IONM +,MIGREOHM +,NME_MFRF +,GMHRRFF +,MNRFOMM +,IHRIFFRN +,EE_IORN +,M_GIFGE +,NMHNO_FR +,M_MFOIH +,NEEMGRE +,NMHNR_NM +,NMHNR_NO +,RGNMIMI_ +,IRGRMM_H +,NMHO_E +,IHNMNMHF +,EEMHHME +,NME_MFNF +,REFRFRHR +,EE_IORO +,RRIN_EHF +,GEHIFIMH +,NMHNR_RH +,EEMHHMR +,MOM_OG_R +,RNO__OOI +,EE_IONN +,EE_IORH +,MEGR__M +,NOFRIRH +,ONHEFFH +,MIGHRNMR +,MIEH_FMF +,GFIHROEF +,MO_OEFNH +,MEGR__R +,MEGR__I +,GGFN__IO +,GEHOGGHN +,_NRRRF +,RGNMIMO_ +,NGMRIN_H +,IRNRERRE +,NME_MFNE +,NEFHO_M +,GGRINIIH +,E_ONI_RI +,MIEHEGEN +,OFIRHI_ +,MOFOEOHR +,EFGRER +,_NRRI_ +,MORHINN_ +,RR_RIEII +,RNMNF_O_ +,EFHNRIN +,NEGIIF_ +,NME_MF_M +,MOMHMRFG +,NME_GFNN +,NMHNR_NI +,EE_IONR +,IREIHMHG +,GFIN___F +,GRHOMIH_ +,_FRNGF_ +,EE_IORR +,M_GIFGG +,NMHNR_NG +,IIH_FHNR +,EE_IONO +,EFGRGHM +,RGIONFEM +,NEIE_MH_ +,NOFRHMM +,EE_IORI +,M_GIFFH +,NEEMFFOE +,NMHNR_NR +,EE_IONE +,NEFMEGN_ +,MRF_EINI +,M_MOIHN +,MORHIMI_ +,EE_IORM +,M_GIFF_ +,NGMRIN_N +,GFRGHOEI +,I_GHENH +,GGRIOFGH +,RFOMRENF +,GEHORI_I +,M_GIFGF +,GFIHII_E +,EE_IONG +,REFRFRHI +,NGMRINHF +,GEHMRR_G +,NEFH_FR +,M_GIFFN +,RRIFOOOG +,EE_IONF +,MIEINGO_ +,GGRIMOIO +,MOOIRME_ +,MO__GGRN +,ONHEFF_ +,I_NMENMI +,MFIG_RF +,IRFMONRN +,MIRM_HH_ +,M_GIFFR +,MOGRIO_F +,RFHOFGNM +,NMHNO_FO +,NEMNMRN_ +,M_GIFFI +,NGMRIN__ +,GRHOMRFO +,NMHNO_FI +,MOE_FHMR +,RNRNNOO_ +,NGMRFE_I >()+(,"?1(99+,3 !(@"%%92 !""#$%%1/."**71'1*2-05,*-=5'+5"% +,RNFNR_H_ +,GGRIFMEG +,NEMHNMRM +,MOOIENN_ +,MOENOFEF +,MII_HIFG +,NGGMHFIR +,NEEGEIFN +,INNOG_NO +,RMM_NORF +,NEFHEGGR +,NEG_G_RF +,MOMHHNIN +,GFRGFOG_ +,NEFEG__F >/4"%%92 !""#$%%;0,"**71'1*2-05,*-=5'+5"% +,I_GHE_F +,NEFGIEFE +,I_HEGHIF U')7$%*)'"#)..)'"a),$5(,$"]1%,$+2)*(1%"3/"U6/'1U #A'*(J"!'(0%.$%*,"14"OO"d$+*$3+)*$, P6$,A, #1A,$ \10 ['$-6)%* K-1,,A. U')*/-A, ]6(7<$% W(J)+& :^*+1-(7)'(, Q*(7<'$3)7< >()+(,"?(,,%"("+%, !#(""9#!#A?5,,%"("+%, !""#$%%=;1'=1'&/1!)+="*+'582%>5/""75>56?++*"/")*+%

X#-)'6&K36#3$)" B("(?;+#'+,3 CD> !""#$%%&&&'(-*/8)+1")"2"5'*-=%1*,"&/-5%)=;% E/!/?<)%'2#) !""#$%%=5+*.5'2010'582%

F+21 "(G+H !""#$%%1/."**71'1*2-05,*-=5'+5"%"/()9'1!".7 I#)9 !""#$%%&&&'#5-7'*-=% J !""#$%%&&&'-<#-*450"'*-=% Y#'( Resequencing steps

Capture Sequence

! Align to reference Call variants The hard part Filter } View ExomeSeq results

All variants 21456 10987 4356 2489

Novel 1345 547 245 148

Missense 567 296 143 67

Damaging 298 139 56 22

Highly conserved 143 30 9 2

Very good candidate variants

How can we conﬁrm these? Identify correct variant? Exome sequencing data Data viewer - IGV Individual

SSP Heterozygous variant in affected individuals

Conﬁrm in further individuals/familes by Sanger

Mutation in SSP causes ARAS syndrome Genetic diagnosis and rare diseases

2009 2013 ! ! Conﬁrm diagnosis Conﬁrm or clarify diagnosis ! ! Test series of single genes 1000s genes tested at once ! ! Often international labs Carried out in Norway ! ! Very expensive Costs decreasing ! ! Time-consuming (years) Fast (weeks)

A diagnostic revolution (common/rare) Summary High-throughput sequencing Dramatic increase in sequence production Many applications on one platform Field new and moving very quickly Diagnostic (exome) sequencing in place Huge impact on human/medical genetics

Challenges/opportunities Data storage/backup/distribution

Data analysis Whole-genome sequencing? People NSC Siri Okkenhaug MedGen OUS Magnus Leithaug Sari Thiele Hanne Sorte Kristine Fjelland Gregor Gilﬁllan Monica Solbakken Martin Hammerø Rune Moe Yvan Strahm Tim Hughes Beate Skinningsrud Ave Tooming-Klunderud Trine Prescott Morten Skage Asbjørg Stray-Pedersen Gregor Gilﬁllan Olaug Rødningen Ying Sheng Robert Lyle Lex Nederbracht Dag Undlien Sissel Jentoft Robert Lyle Kjetill Jakobsen Dag Undlien [email protected] [email protected]