! ! ! ! ! ! ! ! This! is! copyrighted! material.! Please! do! not! distribute! or! use! other! than! for! personal! purposes! without! the! permission! of! Suzanne!Leal!or!Michael!Nothnagel.!

i ii Table&of&Contents&

Monday' '...... '1' Introduction*to*finding*pathogenic*variants*using*NGS*data* *...... *1* Introduction*to*the*basics* *...... *3* Calculation*of*LOD*scores* *...... *13* Cloud*computing* *...... *20*

Tuesday' '...... '25' Penetrance* *...... *25* Software*to*perform*linkage*analysis* *...... *28* Genotyping*error*detection*using*Merlin* *...... *33* Homozygosity*mapping* *...... *35*

Wednesday' '...... '39' File*formats*for*sequence*data* *...... *39*

Thursday' '...... '49' Filtering*approaches*for*the*analysis*of*NGS*data* *...... *49* Association*analysis*for*Mendelian*traits* *...... *63* Performing*linkage*analysis*using*sequencing*data*...... *69*

Friday' '...... '73' Evaluating*power*using*simulation*studies* *...... *73* Functional*studies* *...... *80* Variant*annotation* *...... *85* *

iii iv Analysis of Family Data Via Filtering Strategies Samples in a Family Undergoing NGS Two or More Affected & Unaffected One Affected Introduction to Finding Pathogenic Affected Individuals Individuals Individual Variants Using NGS Data: Exclude variants which are not shared by affected Linkage Analysis and Filter Approaches individuals & (present in unaffected individuals )

Suzanne M. Leal [email protected] Exclude non-coding variants and & coding variants which are found Center for Statistical Genetic in databases (ExAC) or are not rare e.g. >0.5% Baylor College of Medicine https://www.bcm.edu/research/labs/center-for-statistical-genetics Test for segregation of identified variants with disease phenotype & sequence variants in ethnically matched controls

This Strategy can Fail! Performing Linkage Analysis • DNA samples from all informative pedigree None of the variants completely segregate with members are genotyped using arrays disease status • Parametric two-point and multipoint linkage Ø Affected individuals are phenocopies or incorrectly diagnosed analysis performed Ø Unaffected individuals are disease variant carriers (reduced D • For consanguineous pedigrees segregating penetrance) autosomal recessive traits Ø Sample swaps have occurred – Homozygosity mapping can also be used Ø Locus heterogeneity within the pedigree Homozygoisty Linkage Analysis Homozygosity Mapping

++ D+ Region of ++ D+ mapper score ++ D+

Phenocopy Reduced Penetrance Sample Swaps Homozyosity

Benefits of Performing Linkage Analysis Trouble Shooting Using Linkage Analysis Using Genotyping Arrays • Linkage analysis can be performed using genotyping arrays or sequence data • Aids in selection of individuals for sequencing • Observed LOD scores compared to • Maps the disease locus to specific genomic – Expected maximum LOD (EMLOD) region(s) – Maximum LOD (MLOD) • Filtering can be performed within several Mb, i.e. • Deflated LOD scores can be due to – Incorrect phenotype information linkage region, instead of the entire genome – Locus heterogeneity within the pedigree – Reducing the number of variants which need to be • Genotypes can also aid in detection of incorrect followed-up familial relationships • Testing for segregation in pedigrees • Evaluating frequencies in ethnically matched controls

1 Non-syndromic Hearing Impairment (NSHI)

• 893 NSHI families ascertained REPORT American Journal of Human Genetics – Pakistan, USA, Switzerland, Turkey, Jordan, Hungry in KARS, Encoding Lysyl-tRNA (Roma), Poland & Germany Synthetase, Cause Autosomal-Recessive Nonsyndromic Hearing Impairment DFNB89

• Intra-familial heterogeneity in the collection Regie Lyn P. Santos-Cortez,1,8 Kwanghyuk Lee,1,8 Zahid Azeem,2,3 Patrick J. Antonellis,4,5 Lana M. Pollock,4,6 Saadullah Khan,2 Irfanullah,2 Paula B. Andrade-Elizondo,1 • 15.3% (95% CI 11.9 - 19.9%) Santos-Cortez et al. 2015 EJHG Ilene Chiu,1 Mark D. Adams,6 Sulman Basit,2 Joshua D. Smith,7 University of Washington Center for Mendelian Genomics, Deborah A. Nickerson,7 Brian M. McDermott, Jr.,4,5,6 • Linkage analysis followed by exome sequencing led Wasim Ahmad,2 and Suzanne M. Leal1,* to the identification of a number of NSHI Previously, DFNB89, a locus associated with autosomal-recessive nonsyndromic hearing impairment (ARNSHI), was mapped to chromo- somal region 16q21–q23.2 in three unrelated, consanguineous Pakistani families. Through whole-exome sequencing of a hearing- – impaired individual from each family, missense mutations were identified at highly conserved residues of lysyl-tRNA synthetase KARS (Santos-Cortez et al. 2013 AJHG) (KARS): the c.1129G>A (p.Asp377Asn) variant was found in one family, and the c.517T>C (p.Tyr173His) variant was found in the other two families. Both variants were predicted to be damaging by multiple bioinformatics tools. The two variants both segregated with the – ADCY1 (Santos-Cortez et al. 2014 Hum Mol Genet) nonsyndromic-hearing-impairment phenotype within the three families, and neither was identified in ethnically matched controls or within variant databases. Individuals homozygous for KARS mutations had symmetric, severe hearing impairment across – TBC1D24 (Rehman et al. 2014 AJHG) all frequencies but did not show evidence of auditory or limb neuropathy. It has been demonstrated that KARS is expressed in hair cells of zebrafish, chickens, and mice. Moreover, KARS has strong localization to the spiral ligament region of the cochlea, as well as to Deiters’ cells, the sulcus epithelium, the basilar membrane, and the surface of the spiral limbus. It is hypothesized that KARS variants affect ami- noacylation in inner-ear cells by interfering with binding activity to tRNA or p38 and with tetramer formation. The identification of rare KARS variants in ARNSHI-affected families defines a that is associated with ARNSHI.

Hearing impairment (HI) affects nearly 300 million people ilies. The knowledge that has been gained from functional, of all ages globally and increases in prevalence per decade expression, and localization studies after the identification of life.1 Children and adults with bilateral, moderate-to- of genes with mutations that cause NSHI has immensely profound HI have a poorer quality of life, which encom- expanded our understanding of inner-ear physiology. passes not only problems in physical function but also Previously, an ARNSHI-associated locus, DFNB89, was socioemotional, mental, and cognitive difficulties.2,3 In mapped to chromosomal region 16q21–q23.2 in two unre- particular, children with congenital HI must be identified lated, consanguineous Pakistani families.6 The two fam- and habilitated within the first 6 months of life so that de- ilies, 4338 and 4406 (Figures 1A and 1B), had maximum lays in theRare Homozygous Variants in the acquisition of speech, language, and reading multipoint LOD scores of 6.0 and 3.7, respectively. The ho- DFNB89 Locus (16q21-q23.2) skills can be prevented.4 mozygosity regions that overlap in the two families led to Among children with congenitalDFNB89 Region sensorineural HI, more the identification of a 16.1 Mb locus (chr16: 63.6–79.7 Mb) Bilateral symmetric moderate-to-profound SNP genotyping (Illumina linkage panel) than 80% do not display syndromic features and ~60% that includes 180 genes. Additionally, a third consanguin- hearing impairment across all frequencies Region of homozygosity 16.1 Mb: have a family history of HI or a confirmed genetic etiol- eous Pakistani family, 4284 (Figure 1C), was identified, and ogy.5 Because of the complex cellular organization of the showed suggestive linkage to the DFNB89 region with a 4338 (LOD 6.0) Containing ~180 genes inner ear, hundreds of genes and are predicted maximum multipoint parametric LOD score of 1.93. For Familyto influence auditoryGene mechanisms. ToVariant date, for nonsyn- familyFrequency 4284, linkage analysisDamaging* was performed for ~6,000 dromic HI (NSHI), about 170 loci have been localized SNPExAC markers that were genotyped across the genome and mutations in ~75 genes have been identified in hu- with the Illumina Linkage Panel IVb. 4284 (LOD 1.9) 4406mans (Hereditary HearingCOG4 Loss Homepage).p.Ile271Val Of the gene Consanguineous0.0005 familiesMT, LRT 4284, 4338, and 4406 from variants that have been implicated in NSHI, almost 60% Pakistan are affected by ARNSHI, which was found to 4406are autosomal recessiveZFHX3 (AR) in inheritance,p.Pro1929Ser and 95% of segregate0.0005 with unique haplotypesNone within the DFNB89 lo- the genes that harbor mutations that cause ARNSHI were cus (Figures 1A–1C). From the medical history, no other 4406,initially4284 mapped andKARS identified in consanguineousp.Tyr173His fam- risk0.00002 factors were identified asAll a possible cause of HI. For

43381Center for Statistical Genetics,KARS Department ofp.Asp377Asn Molecular and Human Genetics, Baylor College0 of Medicine, Houston,All Texas 77030, USA; 2Department of 4406 (LOD 3.7) Biochemistry, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad 45320, Pakistan; 3Department of Biochemistry, Azad Jammu Kashmir Medical College, Muzaffarabad, Azad Jammu and Kashmir 13100, Pakistan; 4Department of Otolaryngology Head and Neck Surgery, Case Western Reserve 4338University, Cleveland, OHCNTNAP 44106, USA; 5Departmentp.Ala1235Thr of Biology, Case Western Reserve0.00002 University, Cleveland, OH 44106,MT, LRT USA; 6Department of Genetics and Genome Sciences, Case Western Reserve University, Cleveland, OH 44106, USA; 7Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA 4 8These authors contributed equally to this work *Correspondence: [email protected] http://dx.doi.org/10.1016/j.ajhg.2013.05.018. Ó2013 by The American Society of Human Genetics. All rights reserved. One individual from each family *Bioinformatics Tools: CADD, LRT, MutationAssessor, MutationTaster (MT), PolyPhen-2, SIFT selected for exome sequencing All variant sites were deemed to be conserved (PhyloP & GERP) based upon region of homozygosity Basit et al., Hum. Genet. 2010 132 The American Journal of Human Genetics 93, 132–140, July 11, 2013

KARS Variants Segregate with HI in Analysis of Family Based Data (Mendelian) DFNB89 Families Genotype Informative family Pedigree Members Perform Linkage Analysis KARS encodes lysyl- Select Pedigree Member(s) for Sequencing tRNA synthetase Remove Variants Which are not Rare in ExAC, e.g. MAF> 0.5%

Investigate Functionality using Bioinformatic Tools

Determine if Variant Segregates with Phenotype

Population Specific Frequencies for Variant

Acquire Additional Families with Variants with the Same Gene p.Try173His & pAsp377Asn not observed in 750 ethnically matched Pakistani chromosomes

2 Goal of Linkage Studies § To localize disease/trait/susceptibility loci to a unique position on the genome § Only family data can be used to carry out linkage Introduction to the Basics studies § Extend families § Pedigrees with multiple branches and/or multigenerational Suzanne M. Leal § Nuclear Families [email protected] § Parents and offspring Center for Statistical Genetic, Baylor College of Medicine https://www.bcm.edu/research/labs/center-for-statistical-genetics § Trios (parents and proband) cannot be used for www.statgen.us linkage studies § Suitable for association studies

Copyrighted © S.M. Leal 2015

Types of families for Linkage Analysis Linkage Analysis & Homozygosity Mapping Extended Pedigrees • Can be used to reduce the region to be followed up with sequencing – Thus greatly reducing the number of variants Nuclear Pedigrees – May lead to identification of the causal variant where other approaches have failed • Genotype all available informative families member to perform linkage analysis/homozygosity Trios mapping

Chromosome in meiosis with two crossovers Parametric Linkage Analysis • For Mendelian traits – Mode of inheritance must be known – Autosomal Recessive – Autosomal Dominant – X-linked – Trait can have reduced penetrance or phenocopies

Two homologous Four gametic chromosomes, each Two products (egg, crossovers with two chromatids sperm cells)

3 Linkage Analysis – Allele Sharing Methods Parametric Linkage - Analysis • Also known as nonparametric or model free • Goal method – To test whether there is linkage between a – Neither nonparametric or model free disease locus and a marker or set of marker loci – Mode of Inheritance does not need to be known – Null hypothesis • Complex traits • No linkage - recombination fraction (θ=0.5) – Underlying genetic model is not specified in the analysis – Recombination rate 50% » Disease locus and marker locus/loci far apart – Loci on two different chromosomes

Parametric Linkage - Analysis Linkage Analysis - Allele Sharing Methods § • Alternative hypothesis Compare the amount of allele sharing between • linkage θ<0.5 § Affected Sibling – Wish to reject the null hypothesis of no linkage § Other affected relative pairs • Use a LOD score criterion of 3.3 (p<0.05) § Avuncular § e.g. uncle-Niece § Cousins – Estimate the recombination fraction (genetic distance) between the disease and the marker loci

Linkage Analysis - Allele Sharing Methods Polymorphisms & Variants § Variety of tests to elucidate if there is an § Polymorphism excess of allele sharing § A region of the genome that varies between individual members of a population § Mean test § Usually with a frequency of at least 1 or 5% § Null hypothesis § Variant or mutation § Under no linkage § An alteration in a genome compared to some reference § Affected siblings share 50% of their alleles state § Does not have to be causal or functional § Alternative Hypothesis § Types of Variants § Under linkage § Pathogenic § Affected siblings share > 50% of their alleles § Of unknown significance § Benign

4 Loci & Alleles Loci & Alleles

§ Locus: A specific position on the genome § Codominant § For the autosomes 2 alleles are observed at each § Both alleles are expressed in the heterozygous state locus § Dominant § Alleles: Are alternative forms of DNA § Expression is the same in heterozygous as in the homozygous state sequence that occur at a locus § The homozygous state can sometimes produce a more severe § e.g. the A, B, 0 alleles of the AB0 gene phenotype than the heterozygous state § Homozygous lethal § Recessive § Homozygous state is necessary for expression

Hardy Weinberg Equilibrium (HWE) HWE • For the autosomes the proportion of each • The organism is diploid genotype follows the laws of HWE • Reproduction is sexual – p2, 2pq & q2 • Generations are non-overlapping • Which is based upon the observed allele • Mating is random frequencies • Population size is very large • Migration is negligible • Mutation can be ignored • Natural selection does not affect the alleles under consideration

HWE HWE

Example 2 allelic system (e.g. SNP marker) The following genotype counts are observed

Allele 1 frequency p= (2N11 + N12)/2N Observed Expected 1 1 300 ? Allele 2 frequency q= (2N22 + N12)/2N 1 2 500 ? 2 2 200 ? Expected proportions of heterozygotes and homozygotes under HWE Allele frequencies 2 1 1= p 1 allele: p=(600+500)/2000=0.55 1 2 = 2pq 2 allele: q=(500+400)/2000=0.45 2 2 = q2 Note q=(1-p)

5 HWE HWE

2 Expected genotype frequencies under HWE 2 (observed-Expected) Χ =Σ Expected 11 p2=0.3025 12 2pq=0.495 Observed Expected 22 q2= 0.2025 1 1 300 302.5 1 2 500 495.0 2 2 200 202.5 Expected number of genotypes under HWE* Χ2= (300-302.5)2/302.5+(500-495)2/495+(200-202.5)2/202.5=0.102 11 302.5 12 495 2 22 202.5 Χ = 0.102 p=0.75 1 df *For a sample size of 1,000 individuals

Testing for deviations in HWE Reasons for Deviation from HWE

• Chi-square tests • Population Admixture • Exact tests • Heterozygous Advantage • Likelihood ratio tests • Copy number variants • Genotyping Error • Chance

Loci, Genotypes & Haplotypes Locus, Genotype & Haplotype § Multiple marker on a chromosome Genotypes are known Chromosome1 § Microsatellites Genotype for Locus A 1 1 § Single nucleotide polymorphisms (SNPs) Locus A: 1 1 Locus B 2 2 Locus C 1 2 § Single nucleotide variants (SNVs) Locus B: 2 2 Locus D 1 2 § The haplotype for each Genotype chromosome of a pair § The two alleles at a locus comprise a genotype usually needs to be § Haplotype reconstructed Haplotypes for § The alleles on each chromosome Locus A & B: 1 2, 1 2 Locus C & D: 1 1, 2 2 or 1 2, 2 1

6 Linkage Studies-Genetic Maps Genetic Maps

• A map provides the position and order of marker • Map distance given in Centimorgans (cM) loci • Recombination (Θ) fractions cannot be • Physical position added • Genetic position – Except in the case of complete interference – Based upon interpolation for SNV (single nucleotide variant) x=Θ and SNP (single nucleotide polymorphism • Under complete interference multiple crossovers • Genetic position necessary to perform multipoint between two loci can be excluded linkage analysis – It can also be assumed there is complete interference when two loci are closely linked (θ<0.05) » Then recombination fractions can be added

Genetic Maps Haldane Map Function (Haldane 1919)

• Can convert Θ to map distances using • – Map functions Assumption that crossovers in different intervals • Haldane occur according to a Poisson probability law • Kosombi – Note x is given in Morgans • Sturt – The distances can then be summed • X = -1/2 ln(1-2θ) if 0 < θ < ½ • { No one-to-one correspondence between map • infinity otherwise, distance and number of base pairs – Recombination events variable across the genome • The Inverse is θ = ½[1-exp(-2|x|)]

Genome Scan Data (Marker Loci) for Genetic Maps Linkage Analysis • Most SNP and SNVs are not on genetic maps • Microsatellite Marker loci • Physical position and order known – Not currently usually used – Unknown genetic map distance • Genotyping Arrays • Using Genetic Maps such as • SNP and SNV marker Loci • Rutgers Combined Linkage-Physical Map • Exome and whole genome sequence data – http://compgen.rutgers.edu/mapinterpolator • SNV and SNP marker loci • Interpolation can be used to estimate the genetic distance of markers to perform linkage analysis

7 Microsatellite Markers Heterozygosity (H) § For the most part have been replaced by § Provides information on what proportion SNP marker loci of individuals that will be heterozygous for § Microsatellite markers have many alleles a particular marker locus § Heterozygosity >0.71 § Assumption Hardy Weinberg Equilibrium § Usually denoted by a D# H=1-Σp 2 § Linkage whole genome scans i § 10 cM scan § ~400 marker loci § 5 cM whole genome scan § ~800 marker loci

SNP and SNV Marker loci SNP and SNV Marker loci

• Most commonly used markers for linkage • Denoted by an rs# analysis are SNP loci • SNP have a minor allele frequency (MAF) of >5% – Base change at a single nucleotide – Can also be defined as having a MAF >1% • Most have only two alleles (diallelic) • SNVs have a MAF < 1% – But can have up to four alleles – Usually diallelic but can have up to four alleles • Those which have more than two alleles are not used • Heterozygosity <0.5

Genotyping Arrays SNP/SNV Marker Loci* Genotyping Arrays § Illumina HumanCore-24 Bead Chip § ~300,000 SNP marker loci • Higher density arrays are overkill for linkage § Up to a additional 300,000 custom markers analysis § Illumina HumCoreExome-24 Bead Chip – A subset of informative markers can be used § ~300,000 SNP marker loci § ~240,000 Exome marker loci • e.g. ~0.20cM § Up to an additional 100,000 custom markers • Once linkage has been established denser maps of markers § Illumina HumanOmni5-Quad can be analyzed within the linkage region § ~4.2 Million SNP/SNV marker loci • Using entire set of markers extremely slow to § Up to an additional 500,000 custom markers analyze § Illumina HumanOmni5Exome – § ~4.5 Million SNP/SNV marker (including exome content) May not be able to complete linkage analysis within a § Up to an additional additional 200,000 custom markers reasonable amount of time *These arrays were all developed for association studies- the Illumina linkage array has been discontinued

8 Features of Mendelian Traits Heterogeneity • Allelic Heterogeneity • Non-Allelic/Locus/Linkage Heterogeneity – Multiple separate alleles at the same locus are • Allelic heterogeneity responsible for the disease phenotype • Cystic fibrosis • Phenocopies • Non-allelic/Locus/Linkage Heterogeneity • Reduced penetrance – Different individual genes are responsible for disease • Age specific reduced penetrance etiology • Charcot-Marie-tooth disease • Adult polycystic kidney disease (APKD) • Non-syndromic hearing loss

Phenocopies Phenocopies § Traditional definition § The term phenocopy (although used § An environmentally induced phenotype that incorrectly) is also used to describe resembles the phenotype produced by a § Genetic heterogeneity mutation § An individual(s) within a pedigree which is affected § Examples due to a different gene than the other pedigree members § Individuals taking meperidien which is tainted § E.g. BRCA1 families with breast cancer patients with out a BRCA1 with its by product MPTP variant § Causes the destruction of dopaminergic neurons and § Misdiagnosed cases within a pedigree produces a Parkinson disease phenotype § Alzheimer’s disease pedigrees with cases of dementia which are not Alzheimer’s disease § Epilepsy due to traumatic brain injury

Reduced Penetrance Familial & Founder Effect § Age specific • Familial § Sex specific/Sex limited – Any trait which is more common in relatives of an § Exposure specific affected individual than in the general population § Incomplete penetrance – Can be genetic or environmental or both § A proportion of disease gene carriers never • Prion disease Kuru develop the phenotype • Founder Effect – A high frequency of a disease allele in a population § Can reduce the power of detecting linkage founded by a small ancestral group due to one or more founders being carriers of this allele § Unaffected individuals below the age of onset provide no linkage information

9 Assortative Mating Epistasis & Pleiotropy • Selection of mate with preference to a certain • Epistasis phenotype/genotype (that is non-random mating) – Interaction between alleles at two different loci – Positive • Pleiotropy • preference for a mate with the same phenotype – Multiple phenotype effects of a single gene – Negative • Example Marfans Syndrome • Preference for a mate with a different phenotype

Mendelian Traits Pedigree Drawings -Symbols

§ Modes of Inheritance

§ Autosomal Dominant Inheritance Affected male Affected male proband § Autosomal Recessive Inheritance

§ X-linked Inheritance unaffected female Deceased affected female § Dominant § Recessive Sex unknown affected

Male affected with two traits

Pedigree Drawings Phenotype Quantitative & Qualitative Traits • Quantitative trait –Continuous founder – Dichotomizing based upon an arbitrary or clinical cut-off

Non founder • Can lead to loss of power (due to misclassification)

• Qualitative – binary disease trait –Affected or unaffected Dizygotic twins Monozygotic twins

10 Quantitative Trait - Example BMI Autosomal Dominant Pedigree

Autosomal Dominant Mode of Inheritance Autosomal Dominant Mode of Inheritance § On average 50% of the children from an • If trait is fully penetrant with no phenocopies the heterozygous affected individual will also be following is true: heterozygous for the disease allele and affected • Each affected individual carries at least one copy § 100% of all children of an affected homozygous of the disease/trait allele individual will be affected • Each unaffected individual must be homozygous § Equal number of males and females affected: wild type § There are exceptions • If an affected individual has an unaffected parent § e.g. sex limited traits they must by heterozygous for the disease/trait § Affected (heterozygous) and unaffected individuals allele provide equal linkage information

Consanguineous Autosomal Recessive Autosomal Recessive Pedigree with Pedigree Unrelated Parents

&AMILY D + D +

D D D + or + + D + or + + D D D + or + + D + or + +

11 Autosomal Recessive Mode of Inheritance Autosomal Recessive Mode of Inheritance § The following hold true for fully penetrant diseases § Unaffected individuals can either by homozygous with with no phenocopies wild type or carry one copy (heterozygous) of the pathogenic variant § Each affected individual must be either homozygous or a compound heterozygous for the § 1/3 homozygous wild type pathogenic variant(s) § 2/3 carriers, heterozygous for the pathogenic variant

§ Approximately 25% of all children whose parents are carriers will be affected

Autosomal Recessive Mode of Inheritance X-Linked Recessive Pedigree

§ Offspring of two affected individuals will all be affected § If both parents have the same pathogenic variant or pathogenic variants within the same gene § If pathogenic variant(s) are rare § Usually only nuclear families are observed, with both parents unaffected. § Exceptions are § For consanguineous kindreds where multiple affected sibships can be observed in the pedigree. § Quasidominant/ Pseudodominant Inheritance

X-linked Recessive Mode of Inheritance • No male to male transmission • For fully penetrant traits disease with no phenocopies the following is true – 50% female children of female carriers will also be carriers – 50% of male children of carriers females will be affected – All female children of affected males will be carriers • In some circumstances carrier females are also affected – But have a milder phenotype than affected males

12 Pedigree Drawing with Marker Loci

Calculation of LOD scores 1 2 1 3

Suzanne M. Leal [email protected] 1 3 2 2 Center for Statistical Genetic Baylor College of Medicine https://www.bcm.edu/research/labs/center-for-statistical-genetics www.statgen.us 2 3 2 3 1 2 1 2

Copyrighted © S.M. Leal 2015

Autosomal Dominant Pedigree Informative Individuals for Linkage Phase Known § An individual to be informative § Most be heterozgyous at the marker locus & § And a second locus 1 2 § Disease locus 1 1 § Marker locus Phase 2 1 2 1 D + 2 2

§ In order to observe whether a recombination 2 2 1 2 2 2 1 2 event occurred or not between the two loci NR NR NR NR

Autosomal Dominant Pedigree Autosomal Dominant - Phase Phase Known Unknown Phase I Phase II H0: Θ = ½ 2 1 2 1 1 3 D + + D H1: Θ <½ 1 2 1 1 θ (1−θ ) Z(θ ) = log10 1 3 ½ (1−½) 1 2 2 2 Phase 2 1 ^ R 2 1 2 2 Z(θ) = 0.227 = ! D + Θ NR + R 1 2 2 2 1 2 1 2 1 2 2 2 2 2 1 2 1 2 2 2 P 2 2 1 2 1 2 1 2 I NR NR NR NR R NR NR NR NR NR θ =0.25 NR NR R NR P R R R R NR R R R R R II

13 REVIEWS

LOD score even all markers on a given chromosome. For multipoint marker loci and the individual genes within a region had x x LOD score Z( ) = log10[L( )/L(∞)] is the analysis, the , Z(x) = log10[L(x)/L(∞)], is com- to be sequenced using, for example, Sanger sequencing, logarithm of the likelihood puted as the logarithm of the likelihood ratio, with false-positive regions would not be followed up owing ratio, with the numerator the numerator specifying a position, x, of the putative to reasons of time and cost, so it was important for a being calculated under the assumption of linkage and the disease locus on the marker map. For the denomina- pedigree or a group of pedigrees to meet the genome- denominator under no linkage. tor, one assumes the disease locus to be off the map wide significance level. There is less concern now with A LOD score of 3.3 or higher — that is, infinitely far away from the markers (FIG. 2). meeting this criterion because it is quick and relatively has been shown to correspond The multipoint LOD score can furnish a curve over inexpensive to follow by WGS of associated regions. to a genome-wide significance all markers on a chromosome (FIG. 3); the maximum of Smaller pedigrees with suggestive LOD scores can still level of 0.05. this curve, over all chromosomes, then represents the be followed up with WGS, although there may be mul- estimated position of the disease locus on the human tiple linkage regions that could potentially harbour the gene map provided that the maximum LOD score is at causative variant. If a putative causal variant is identi- least equal to 3.3 (REF. 101). Evidence for linkage can be fied in a small pedigree, it is imperative that additional obtained from a single pedigree or multiple pedigrees families are identified that segregate either the same with LOD scores summed at the same θ or map position. variant or another putatively causal variant within the When linkage analysis was previously performed with same gene. If a variant is identified that segregates with a

Figure 2 | Linkage information for a first-cousin a Generation I mating for an autosomal recessive trait and a phase-known autosomal dominant trait. The disease is fully penetrant without phenocopies and has a minor Generation II allele frequency of 0.0001. Circles represent females and squares males. Individuals represented by solid black symbols are affected, and individuals represented 0.3 0.3 by white symbols are unaffected. Shown below each individual in generation IV are the possible underlying disease genotypes. a | An autosomal recessive trait Generation III pedigree in which the affected children are offspring 0.0625 0.3 of first-cousin parents is shown. Consanguinity is REVIEWS Autosomal Dominant indicated by the double horizontal line. The affected individualsLOD scoreLOD are homozygousScoreseven all markers on a given for chromosome. a variant For multipoint that markeris either loci and the individual genes within a region had x x LOD score 0.3 0.3 0.3 0.0625 Z( ) = log10[L( )/L(∞)] is the analysis, the , Z(x) = log10[L(x)/L(∞)], is com- to be sequenced using, for example, Sanger sequencing, Phase Unknown causallogarithm or of the in likelihood perfect puted linkage as the logarithm disequilibrium of the likelihood ratio, with with false-positive the regions would not be followed up owing ratio, with the numerator the numerator specifying a position, x, of the putative to reasons of time and cost, so it was important for a being calculated under the disease locus on the marker map. For the denomina- pedigree or a group of pedigrees to meet the genome- § causalassumption variant. of linkage and the The unaffected sibling is homozygous LOD Scores can be added across familiesdenominator under no linkage. tor, one assumes the disease locus to be off the map wide significance level. There is less concern now with H0: Θ = ½ wildA type. LOD score of 3.3The or higher arrows — that is, show infinitely fareach away from informative the markers (FIG. 2). meiosismeeting this criterion because it is quick and relatively § Must be summed at the same theta value or map distancehas been shown to correspond The multipoint LOD score can furnish a curve over inexpensive to follow by WGS of associated regions. to a genome-wide significance all markers on a chromosome (FIG. 3); the maximum of Smaller pedigrees with suggestive LOD scores can still and levelthe of 0.05. contribution to the LOD score. For this H : Θ <½ § If all members of the family are genotypedthis curve, over all chromosomes, then represents the be followed up with WGS, although there may be mul- 1 pedigree configuration,estimated position the of therare disease variant locus on the humanmust tiple have linkage regions that could potentially harbour the Generation IV 1 9 9 1 § The genotype frequencies are not used in the LOD score gene map provided that the maximum LOD score is at causative variant. If a putative causal variant is identi- (1 ) (1 ) entered the pedigreeleast equal through to 3.3 (REF. 101). Evidenceone of for linkage the can great- be fied in a small pedigree, it is imperative that additional θ −θ +θ −θ calculation obtained from a single pedigree or multiple pedigrees families are identified that segregate either the same Z(θ ) = log10 10 10 grandparents. Thewith meiosis LOD scores summed events at the same from θ or map the position. variant or another putatively causal variant within the Underlying disease genotype dd dd d+ or ++ § Misspecification of allele frequenciesWhen linkage will analysis not biaswas previously the LOD performed score with same gene. If a variant is identified that segregates with a ½ + ½ great-grandparents to their children do not contribute § When genotype data is not available for all family Variant genotype GG GG AA to the LOD score; however, the meiosis events from the Figure 2 | Linkage information for a first-cousin members a Generation I affected children’s grandparents to their parentsmating forand an autosomal recessive trait and a Maximum LOD Score occurs at 1.3 at Θ=0.1 phase-known autosomal dominant trait. The disease § Misspecification of allele frequencies can increase type I is fully penetrant without phenocopies and has a minor from the parents to the first affected child each LOD=1.2 LOD=0.6 LOD=0.125 Generation II allele frequency of 0.0001. Circles represent females error and squares males. Individuals represented by solid Total pedigree LOD score = 1.925 contribute 0.3 to the LOD score, yielding a totalblack symbolsLOD are affected, and individuals represented 0.3 0.3 by white symbols are unaffected. Shown below each score of 1.2. The second affected child only addsindividual 0.6 in generation to IV are the possible underlying disease genotypes. a | An autosomal recessive trait the LOD scoreGeneration for III the family because only thepedigree meioses in which the affected children are offspring 0.0625 0.3 of first-cousin parents is shown. Consanguinity is from her parents yield new linkage information.indicated Each by the double horizontal line. The affected individuals are homozygous for a variant that is either 0.3 0.3 0.3 0.0625 causal or in perfect linkage disequilibrium with the additional unaffected child only yields an additionalcausal variant. The unaffected sibling is homozygous b wild type. The arrows show each informative meiosis Generation I LOD score of 0.125 because for unaffected childrenand the contribution it is to the LOD score. For this pedigree configuration, the rare variant must have Generation IV not possible to elucidate whether they are homozygousentered the pedigree through one of the great- D+ ++ Underlying disease genotype dd dd d+ or ++ grandparents. The meiosis events from the Linkage Information Obtained from an wild type or causal-variant carriers; each of thesegreat-grandparents to their children do not contribute Variant genotype GG GG AA to the LOD score; however, the meiosis events from the Linkage AnalysisGA AA possibilities have a probability of 1/3 and 2/3,affected children’s grandparents to their parents and Autosomal Dominant PedigreeLOD = 1.2 LOD = 0.6 LOD = 0.125 from the parents to the first affected child each Total pedigree LOD score = 1.925 contribute 0.3 to the LOD score, yielding a total LOD respectively. These two probabilities are incorporatedscore of 1.2. The second affected child only adds 0.6 to • For traits which are fully penetrant with no • Each offspring both affected and the LOD score for the family because only the meioses Generation II unaffected adds 0.3 to the into the calculationLod of the LOD score, and linkagefrom her parents yield new linkage information. Each score additional unaffected child only yields an additional phenocopies informationb is thereforeGeneration I lost. b | A phase-knownLOD score of 0.125 because for unaffected children it is – LOD Score 1.5 for displayed not possible to elucidate whether they are homozygous – When θ=0 and there are not recombination eventsD+ ++ pedigree autosomal dominant pedigreeD+ with++ five childrenwild type is or causal-variant carriers; each of these • The only informative meioses in GA AA possibilities have a probability of 1/3 and 2/3, this example are from the shown. Thisfather to pedigree with five offspring for whichrespectively. there These two probabilities are incorporated – When the marker is fully informative GA AA Generation II into the calculation of the LOD score, and linkage his offspring are no recombination events will lead to a maximuminformation is therefore lost. b | A phase-known • What is the LOD score if parental D+ ++ autosomal dominant pedigree with five children is • Autosomal dominant traits (phase-known 5 5 genotype data was not available? GA AA shown. This pedigree with five offspring for which there LOD score of 1.5 at θ = 0, where Z(θ) = log10 are θ no) recombination/(½) ]. events will lead to a maximum pedigrees) LOD score of 1.5 at = 0, where Z( ) = log )5/(½)5]. θ θ 10 θ However, if no genotype information is availableHowever, for if no genotype information is available for – Generation III Generation III the grandparents (shown in generation I), making the Each affected and unaffected individuals adds 0.3 to pedigree phase-unknown, the pedigree will yield a the grandparents Underlying disease (shown genotype D+in generation++ ++ ++ I),D+ making the maximum LOD score of 1.2 at θ = 0, where the LOD score Variant genotype GA AA AA AA GA Z( ) = log )5 + 5)/((½)5 + (½)5)]. pedigree phase-unknown, the pedigree will yieldθ 10a θ θ Underlying disease genotype D+ ++ ++ ++ D+ REVIEWS maximum LOD score of 1.2 at =Nature 0, whereReviews | Genetics GENETICS θ 3 Variant genotype GA AA AA AA GA NATURE REVIEWS | 5 5 5 5 ADVANCE ONLINE PUBLICATION | LOD scoreZ(θ) = logeven10 all markers onθ a) given + chromosome.θ )/((½) For© multipoint +2015 (½) Macmillan )].marker Publishers loci Limited.and the All individual rights reserved genes within a region had x x LOD score Z( ) = log10[L( )/L(∞)] is the analysis, the , Z(x) = log10[L(x)/L(∞)], is com- to be sequenced using, for example, Sanger sequencing, logarithm of the likelihood puted as the logarithm of the likelihood ratio, with false-positive regions would not be followed up owing ratio, with the numerator the numerator specifying a position, x, of the putative to reasons of time and cost, so it was important for a being calculated under the assumption of linkage and the disease locus on the marker map. For the denomina- pedigree or a group of pedigrees to meet the genome- Nature Reviews | Geneticsdenominator under no linkage. tor, one assumes the disease locus to be off the map wide significance level. There is less concern now with A LOD score of 3.3 or higher — that is, infinitely far away from the markers (FIG. 2). meeting this criterion because it is quick and relatively NATURE REVIEWS | GENETICS has been shown to correspond The multipoint LOD score ADVANCE can furnish a curve overONLINE inexpensive PUBLICATIONto follow by WGS of associated regions.| 3 to a genome-wide significance all markers on a chromosome (FIG. 3); the maximum of Smaller pedigrees with suggestive LOD scores can still level of 0.05. this curve, over all chromosomes, then represents the be followed up with WGS, although there may be mul- © 2015 Macmillan Publishers Limited. All rights reserved estimated position of the disease locus on the human tiple linkage regions that could potentially harbour the gene map provided that the maximum LOD score is at causative variant. If a putative causal variant is identi- least equal to 3.3 (REF. 101). Evidence for linkage can be fied in a small pedigree, it is imperative that additional obtained from a single pedigree or multiple pedigrees families are identified that segregate either the same Linkage Information obtained from a with LOD scores summed at the same θ or map position. variant or another putatively causal variant within the When linkage analysis was previously performed with same gene. If a variant is identified that segregates with a Linkage Analysis Consanguineous Autosomal Recessive Pedigree

• Autosomal recessive traits Figure 2 | Linkage information for a first-cousin a Generation I mating for an autosomal recessive trait and a – First affected individual is not informative for linkage phase-known autosomal dominant trait. The disease is fully penetrant without phenocopies and has a minor Generation II allele frequency of 0.0001. Circles represent females – Except if parental mating is consanguineous and squares males. Individuals represented by solid black symbols are affected, and individuals represented • How much information the first affect individual provides 0.3 0.3 by white symbols are unaffected. Shown below each individual in generation IV are the possible underlying depends on the frequency of the haplotype/marker disease genotypes. a | An autosomal recessive trait Generation III pedigree in which the affected children are offspring • How distantly related are the parents 0.0625 0.3 of first-cousin parents is shown. Consanguinity is indicated by the double horizontal line. The affected – The more distantly related the parents and the lower the frequency of individuals are homozygous for a variant that is either 0.3 0.3 0.3 0.0625 the haplotype/variant the higher the LOD score causal or in perfect linkage disequilibrium with the causal variant. The unaffected sibling is homozygous » Maximum LOD score first cousin mating one affected LOD=1.2 wild type. The arrows show each informative meiosis and the contribution to the LOD score. For this » Maximum LOD score second cousin matting one affected LOD=1.8 pedigree configuration, the rare variant must have Generation IV entered the pedigree through one of the great- – Each additional affected individual adds Underlying disease genotype dd dd d+ or ++ grandparents. The meiosis events from the great-grandparents to their children do not contribute • Adds 0.6 to the LOD score Variant genotype GG GG AA to the LOD score; however, the meiosis events from the affected children’s grandparents to their parents and from the parents to the first affected child each – Each additional unaffected individual LOD = 1.2 LOD = 0.6 LOD = 0.125 Total pedigree LOD score = 1.925 contribute 0.3 to the LOD score, yielding a total LOD score of 1.2. The second affected child only adds 0.6 to • Adds 0.125 to the LOD score the LOD score for the family because only the meioses from her parents yield new linkage information. Each additional unaffected child only yields an additional b Generation I LOD score of 0.125 because for unaffected children it is not possible to elucidate whether they are homozygous D+ ++ wild type or causal-variant carriers; each of these GA AA possibilities have a probability of 1/3 and 2/3, respectively. These two probabilities are incorporated Generation II into the calculation of the LOD score, and linkage information is therefore lost. b | A phase-known D+ ++ autosomal dominant pedigree with five children is GA AA shown. This pedigree with five offspring for which there 14 are no recombination events will lead to a maximum 5 5 LOD score of 1.5 at θ = 0, where Z(θ) = log10 θ) /(½) ]. However, if no genotype information is available for Generation III the grandparents (shown in generation I), making the pedigree phase-unknown, the pedigree will yield a Underlying disease genotype D+ ++ ++ ++ D+ maximum LOD score of 1.2 at θ = 0, where Variant genotype GA AA AA AA GA 5 5 5 5 Z(θ) = log10 θ) + θ )/((½) + (½) )].

Nature Reviews | Genetics NATURE REVIEWS | GENETICS ADVANCE ONLINE PUBLICATION | 3

© 2015 Macmillan Publishers Limited. All rights reserved Lod Score Curve for Autosomal Dominant Size of Mapped Regions Pedigree -Phase Known 10 fully informative meioses • For Mendelian disease/traits L – Where a large sample (many informative meiosis ~200) 0 No Recombination Events – 3 are available D maximum LOD score 3.0 at θ=0.0 • Highly unusual that a disease locus can be mapped to a region 2 S For 10 fully informative meioses one which is < 1cM (~ < 1 Mb) C recombination event occurred – O 1 – However the genetic/physical region is usually much larger R maximum LOD 1.2 score at θ=0.1 E – For consanguineous kindreds S 0 0.0 0. 1 0. 2 0. 3 0. 4 0.5 • Even when linkage can be established Recombination Fractions – A limited number of informative meiosis For 50% of the meioses a recombination event » Large genetic region of homozygosity with many genes occurred –maximum LOD score 0.0 at θ=0.5 - ∞

The Effect of Using Incorrect Marker Allele Frequencies on LOD Scores Obtaining Allele Frequencies • If there are pedigree members with missing • Estimate from pedigree founders genotype data – Must have a sufficient number of founders • Using incorrect marker allele frequencies • Obtain from the manufacture of genotype – Can increase type I error array • Important to obtain accurate population specific – Usually allele frequencies provided for estimates of allele frequencies Europeans, African Americans and Asians • If there is missing genotype data it is advisable • For e.g. Illumina HumanOmni5-Quad not to use equal allele frequencies for marker loci – http://support.illumina.com/array/array_kits/humanomni 5-4-beadchip-kit/downloads.html

Obtaining Allele Frequencies Obtaining Allele Frequencies • Alohomora • UCSC Genome Binoinformatics – Provides frequencies for Europeans, African Americans – For customized SNP arrays & population specific allele and Asians for popular SNP arrays frequencies • Creates datafile with allele frquencies – Use Table browser • http://gmc.mdc-berlin.de/alohomora/maps/ • http://genome.ucsc.edu/cgi-bin/hgTables – Populations specific allele frequencies can be downloaded using • HapMap project or • HGDP (Human Genome Diversity Project). – Select ‘Variation’ in group menu – Select ‘HapMap SNPs’ or ‘HGDP Allele Freq’ in track menu – Then SNP list can be pasted or uploaded to appropriate file

15 Two-point Linkage Analysis Multipoint-point Linkage Analysis • Can be performed between the disease and • Can increase the informativeness of markers marker loci for parametric linkage analysis within the region – Extremely important when SNP marker loci are • For SNP data two-point linkage analysis is not analyzed very informative • Helps to fine map a locus to a smaller region • Can be used to elucidate linked regions – Compared to two-point linkage analysis – Which can be followed-up with multipoint analysis

Multipoint-point Linkage Analysis Multipoint-point Linkage Analysis

• Incorrect specified genetic map • Intermarker linkage disequilibrium (LD) – Can bias LOD scores – Can increase type I error – Bias the position of the genetic locus • When parental genotypes are missing

• For parametric linkage analysis when the genetic • For consanguineous pedigrees model is mis-specified and a susceptibility locus is – when parental, grandparent, etc. genotypes are placed between two flanking markers: missing – Can result in false negative results – Pushes the disease locus outside of the map of markers

ASP, no parents EMLOD model-free Examples EMLOD parametric 25

• Affected sibpairs 20 – Without parental genotype data – Without parental genotype data and genotype data 15 from one unaffected sibling EMLOD 10 – Missing genotype data from one parent • Consanguineous pedigree – first cousin mating 5 – Parents genotypes 0 • Various relative missing genotype data 0 0.2 0.4 0.6 0.8 1 D'

Huang et al. 2004

16 ASP, one unaffected sib EMLOD model-free ASP, one parent EMLOD model-free EMLOD parametric EMLOD parametric 25 25

20 20

15 15 EMLOD EMLOD 10 10

5 5

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 D' D'

Huang et al. 2004 Huang et al. 2004

Pedigree Structure- Indicating ASP, two parents EMLOD model-free EMLOD parametric Individuals with Missing Genotype data 25

20 Great Grandparents

15

EMLOD 10 Sibling Married-in Grandparents Grandparents 5

0 0 0.2 0.4 0.6 0.8 1 D'

Huang et al. 2004

Avoiding Inflation of LOD Scores due to Parents of the Proband Genotyped Inter-marker LD • Trim marker loci so that LD is weak between marker loci e.g. r2< 0.5 – Can lead to a loss of power • Analyze data using programs that incorporate haplotype frequencies – e.g. Merlin

17 Advantages of Two-point Linkage Analysis Error Detection in Pedigree data

• Not influenced by intermaker-LD • First need to remove markers which are missing a large number of genotypes – Therefore no inflation of the LOD score – e.g. >5% • Not influenced by incorrect genetic maps • A more stringent criterion can be used for SNPs with MAF<5% – Which can cause incorrect map position and deflation – e.g. >1% of the LOD score • These markers can have higher genotyping error rates for the non-missing genotypes

Error Detection in Pedigree data Error Detection in Pedigree data

• Check for Mendelian errors • SNP markers are not very informative and therefore often not possible to detect errors – Marker should be removed for the entire through Mendelian inconsistencies pedigree – Those markers which are most informative (H=0.5) • Do not just remove individuals involved in the produce the least number of Mendelian Mendelian inconsistency inconsistencies – PedCheck • Can check for double recombination events over short genetic distances • Useful program to detect Mendelian inconsistencies • This is an indication that a genotyping error has occurred – https://watson.hgen.pitt.edu/register/docs/pedcheck.html • Merlin (Abecasis et al. 2002 Nat Genet) – Can be used to detect double recombination events – http://csg.sph.umich.edu//abecasis/Merlin/

Type I error Type I error

§ Reject the null hypothesis even when it is true § If a nominal criterion of p=0.05 is use as the § e.g. reject the null hypothesis of no linkage even criterion to reject the null hypothesis when it is true § One test performed 1 out of 20 chance null hypothesis § The null hypothesis of no linkage should have rejected when it should not have been not been rejected § False positive § If 1,000 tests are performed § By chance for ~50 tests the null will be rejected § Even though the null hypothesis is true

18 Type I error-Parametric Linkage Analysis Type I error-Parametric Linkage Analysis

§ If many tests are performed must adjust for multiple § A LOD Score of 0.59 testing § Nominal p-value 0.05 [one sided]) § Family wise error rate § Is not used to reject the null hypothesis of no linkage § LOD score criterion takes into consideration § For parametric linkage analysis a LOD score of 3.3* is § Multiple testing used to reject the null hypothesis § Size of the genome § Nominal one sided p-value 0.000049 § Number of chromosomes § Genome wide p-value 0.05

*Lander E, Kruglyak L (1995) Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nat Genet 11:241-247

Type II error (Power)

• Type II error False positive true positive – When the null hypothesis is false and it is not rejected – Represented by β • Power – The ability to reject the null hypothesis when it is false – Most studies require a power (1-β) of at least 0.8

True False negative negative

19 Outline

• Motivation Cloud Computing • Basic idea

• Providers • Costs

• Advantages vs. disadvantages

• Concerns

Michael Nothnagel, [email protected], 2015

Motivation Basic idea

• Own IT infrastructure is expensive to set up, maintain and update – Projects may be one-time endeavors – Not cost-efficient for single, short-time projects • Urgent need for immediate access to computing power – IT resources may not be present at location – Promised/Planned setup is delayed • Setup of an IT infrastructure serving high demand in computing, storage and archiving requires substantial

expertise createdbySamJohnston, distributed underaCreative Commons license – Limited pool of personnel Instant, demand-driven access to a network of pooled configurable resources for computing and storage without or with minimal management effort by the service provider

Types of cloud services Levels of cloud computing

IaaS PaaS SaaS

Infrastructure Platform Software as a service as a service as a service (“cloud foundation”)

Own administration Running own Use of an exisiting of virtual servers, application within the application offered by

complete access to cloud, no the cloud instance administration

Amazon’s Web Microsoft’s Windows Apple’s iCloud, Services (Elastic Azure, Goolge’s App Google’s Drive, www.wikipedia.org Compute Cloud, Simple Engine Microsoft’s oneDrive, Storage Service S3) ownCloud, DropBox

20 Location of clouds Providers

• Public cloud • Amazon – Access to virtual IT infrastructure via the internet – EC2 cor computing – Commercial service providers – S3 for storage (Web services) • Private cloud • Google – Access to virtual IT infrastructure within an organization – Compute Engine – Usually located within the same country as the users • IBM – Protected against outside access – Focus on businesses – Increasingly used at high-performance computing (HPC) centers, e.g. at universities, by provision of virtual computers to users • T Systems (Deutsche Telekom) instead of real ones – Focus in businesses • Hybrid & Community clouds • There are many more providers.

Amazon Costs for Amazon: storage

• Amazon’s EC2 for elastic web-based computing AWS S3 Frankfurt (Germany): prices per GB per month • Virtual servers (“instances”) with – look & feel of a real server, with own IP address – root privileges – choice of operating system – flexible configuration of working memory, cores (CPUs), hard disk space • Storage of data using Amazon’s S3 • Booking as: – On-demand instance: payment by the hour, extremely flexible – Reserved instance: reservation of computing capacity for one or e.g. 3 TB per month: ~ $ 35-100 three years, less expensive than on-demand – Spot instance: bidding for unused EC2 capacity, execution of Additional fees apply for access and data transfer. instance as long as bid is above actual spot price (August 2015)

Costs for Amazon: computing Costs for Amazon: computing

Single server in Frankfurt (Germany), some data: price per month Single server in Frankfurt (Germany), lots of data: price per month

e.g. 3 TB per month: ~ $ 35-100

(August 2015) (August 2015)

21 Costs for Google: storage Costs for Google: computing

AWS S3 Frankfurt (Germany): prices per GB per month Single server in Europe/APAC: price per hour

e.g. 3 TB per month: ~ $ 120-510

Additional fees apply. (August 2015) e.g. per month: ~ $185 (August 2015)

Costs for Google: computing Advantages vs. disadvantages

OS for single server in Europe/APAC: price per hour • Scalability both • Limited bandwidth: – by computational demand – NGS datasets can be huge – by storage demand (up to several TB), depending on the type of • Costs transmitted file (fastq, bam, – proportional to usage vcf) – occur at time of usage – Transfer can last several days or even weeks, with • Immediate availability risks of interruption – No need to wait for that new HPC cluster for another six • Costs: months – No “flatrate” for usage • Data backup • Potential lock-in effect with – Usually automatically provider when using more provided cloud-specific elements e.g. per month: ~ $95 (August 2015)

Concerns: data privacy Concerns: data protection

• Security during transfer between client and server • >90% of all cloud infrastructure is located in the USA. – Known and unknown but potentially exploited bugs in encryption • National laws may prohibit transfer to another country or software (e.g. SSL/TSL) outside the European Union. • Security at server • Germany: Regular checks for compliance of used cloud – encryption of databases and file systems with standards set by law (“Bundesdatenschutzgesetz”) at • Profiling based on user data (Google) physical location are mandatory for personalized data • NSA and other secret services (genetic identifiably!). • Account hijacking • Some US providers have set up computing centers within the European Union. – Ireland & Germany (Amazon), Denmark (Google) – They are still required under the US Patriot Act to transfer data to the US government if requested. – Big Brother Award 2012 for cloud computing as a technology awarded by digitalcourage, Chaos Computer Club e.V., and others

22 Literature on cloud computing

• Cloud computing security risk assessment by the European Union: http://www.enisa.europa.eu/activities/risk-management /files/deliverables/cloud-computing-risk-assessment • Cloud Security Alliance: Top Threats https://cloudsecurityalliance.org/topthreats /csathreats.v1.0.pdf

23 24 Elementary Usage of the Affection Status Locus Type

• Penetrance: Probability(Phenotype |genotype) • Autosomal dominant - Complete penetrance no Penetrance phenocopies

1/1 1/2 2/2 Suzanne M. Leal P(affected|genotype) 0 1 1 [email protected] P(unaffected|genotype) 1 0 0 Center for Statistical Genetic Baylor College of Medicine https://www.bcm.edu/research/labs/center-for-statistical-genetics All unaffected individuals must be 1/1 www.statgen.us Affected individuals can be either 1/2 or 2/2

Note: 2 denotes the disease allele

Copyrighted © S.M. Leal 2015

Elementary Usage of the Affection Status Locus Type Elementary Usage of the Affection Status Locus Type § X-linked recessive - Complete penetrance no • Penetrance: Probability(Phenotype |genotype) phenocopies • Autosomal recessive - Complete penetrance no § Specify separate penetrances for males & females phenocopies 1/1 1/2 2/2 P(affected female|genotype) 0 0 1 1/1 1/2 2/2 P(unaffected female|genotype) 1 1 0 P(affected|genotype) 0 0 1 P(unaffected|genotype) 1 1 0 1 2 P(affected male|genotype 0 1 All affected individuals must be 2/2 P(unaffected male|genotype 1 0 Unaffected individuals can be either 1/1 or 1/2 All affected females must be 2/2 and affected males hemizygous 2

Note: 2 denotes the disease allele Note: 2 denotes the disease allele

Elementary Usage of the Affection Status Locus Type Use of Affection Status Locus § X-linked dominant or X-linked recessive with milder expression in females - Complete • Do not get confused by thinking the following is penetrance no phenocopies always true § Specify separate penetrances for males & females – 2 for affected individuals 1/1 1/2 2/2 – 1 for unaffected individuals P(affected female|genotype) 0 0 1 P(unaffected female|genotype) 1 1 0 • Code 2 – Use penetrances as denoted in the datafile 1 2 P(affected male|genotype 0 1 • Code 1 P(unaffected male|genotype 1 0 Affected females are either 1/2 or 2/2 & affected males are hemizygous 2 – 1-penetrances are used Note: 2 denotes the disease allele

25 Use of Affection Status Locus - Example Autosomal Dominant Multiple Liability Classes • Used when penetrance is not the same for all individuals due to reduced penetrance and 1/1 1/2 2/2 Coded in datafile 0 1 1 phenocopies – Age of onset Penetrance for individuals coded with a 2 – Different penetrance for males and females 1/1 1/2 2/2 0 1 1 • In pedigree file Affection – First column affection status status ã ä liability class Penetrance for individuals coded with a 1. 1 0 0 1 2 1 5 6 – Second column liability class 2 0 0 2 2 2 4 4 1/1 1/2 2/2 1-0 1-1 1-1 3 1 2 2 2 3 5 4 → 1 0 0 4 1 2 1 2 3 5 4 5 1 2 1 2 4 6 4 Pedfile.pre

Examples Example § The ratios between genotypes are the most important § Autosomal dominant disease with reduced factor in any penetrance model. penetrance and no phenocopies. § A risk ratio of 1:1 gives no linkage information at the disease Codes in Pedigree File locus. Aff. Ind. Unaff Ind. § Can be used for an individual whose phenotype is unknown 1/1 1/2 2/2 0.0 0.6 0.6 2 1 1 1 Example 1/1 1/2 2/2 0.0 0.8 0.8 2 2 1 2 0.5 0.5 0.5 0.0 1.0 1.0 2 3 1 3 § In the cases where the risk ratio is ∞:1 the individual either does or does not carry a copy of the disease allele. In the 1.) Does it matter which liability class an affected example below all individuals assigned to this penetrance individual is assigned to? Why? class must be 2/2 at the disease locus. Example 1/1 1/2 2/2 0.0 0.0 1.0

Do the Penetrances make Sense on a Example Population Level? § Autosomal dominant disease with reduced penetrance and phenocopies • If the population prevalence of a disease Φ is Codes in Pedigree File Aff. Ind. Unaff Ind. known then the disease gene frequency p and 1/1 1/2 2/2 penetrances f for an autosomal trait should 0-10 yrs 0.01 0.6 0.6 2 1 1 1 satisfy: 11-25 yrs 0.02 0.8 0.8 2 2 1 2 >25 yrs 0.05 1.0 1.0 2 3 1 3 2 2 Φ =fDDp +2fDdp(1-p) + fdd(1-p) How would the following be coded? • For sex-linked recessive traits § An unaffected 15 year old § An unaffected 9 year old § An affected 25 year old Φ =pfd +(1-p)f+ § An affected 10 year old

26 Do the Penetrances make Sense on a Do the Penetrances make Sense on a Population Level? Population Level? § The frequency of phenocopies is given by: § If the population prevalence of a disease Φ is 2 § C=fdd(1-p) known then the disease gene frequency p and penetrances f for an autosomal trait should satisfy: § If the disease is rare then § A≈ 2pfDd 2 2 Φ =fDDp +2fDdp(1-p) + fdd(1-p) § C ≈ (1-2p) fdd § The phenocopy rate § The population frequency for genetic cases for an autosomal dominant trait: § The proportion of phenocopies amongst all affected 2 individuals is § A= fDDp +2fDdp(1-p) § equal to C/(A+C).

Example How can Penetrance Data be Obtained

§ Calculate the population prevalence and phenocopy § From the literature rate for an autosomal dominant trait where: § All necessary information not always available

§ fDD=fDd =0.8 § Estimate it from the data § § fdd=0.02 Usually biased due to way data was ascertained § p=0.001 § May not have enough data for reliable estimates § Linkage programs § Φ =f p2 (0.0000008) +2f p(1-p) (0.00160) + f (1-p)2 DD Dd dd § Ageon (SAGE 3.1) (0.01996) § Approximate methods

§ Φ =0.0216

§ The phenocopy rate equals 0.926

27 Frequent!QuesHons! ! ! ! ! • Why!are!you!telling!me!about!so!many!programs?! ! – Please!just!let!me!know!the!best!one! So8ware!to!Perform!Linkage!Analysis!!! ! • Unfortunately!not!so!easy! Genotype!Array!Data!! ! – No!single!program!works!equally!well!in!all!situaHons! ! ! • Some!programs!can!handle! Suzanne!M.!Leal!! [email protected]! ! – Large!pedigrees!but!not!many!markers! Center!for!StaHsHcal!GeneHc!! ! Baylor!College!of!Medicine! – Many!markers!but!not!large!pedigrees! ! !hJps://www.bcm.edu/research/labs/centerMforMstaHsHcalMgeneHcs! ! – Both!large!pedigrees!and!many!markers!but!does!not! www.statgen.us! ! ! provide!exact!LOD!scores! ! !! ! ! !! ! ! !! ! ! !! !Copyrighted!©!S.M.!Leal!2015! • Here!is!an!abbreviated!list!of!linkage!programs! !

LINKAGE(Lathrop!et!al.!1984)/! LINKAGE(Lathrop!et!al.!1984)/! FASTLINK!(Cocngham!et!al.!1993)! FASTLINK!(Cocngham!et!al.!1993)! • Parametric!analysis!only!! • QuanHtaHve!and!QualitaHve!analysis! • Suitable!for!relaHvely!large!pedigrees! • Allows!for!esHmaHon!of!various!parameters:!theta,! • Limited!in!the!number!of!loci!for!mulHpoint! penetrance,!allele!frequencies! analysis! – ILINK! – LINKAGE!can!allow!for!slightly!more!alleles/markers! • TwoMpoint!linkage!can!be!performed!using! but!slower!than!FASTLINK! – MLINK! • ElstonMStewart!Algorithm!scales!exponenHally! – ILINK! with!the!number!of!loci!and!linearly!with!the! • Can!esHmate!haplotype!frequencies!and! number!of!nonMfounders! incorporate!them!in!the!analysis! – MLINK! – LINKMAP!

LINKMAP! Sliding!the!Disease!Locus!Across!a! LINKAGE/FASTLINK! Map!of!Marker!Loci! • LINKMAP!can!be!used!to!calculate!mulHpoint!LOD! scores! 1.) 1___2___3___4 1-Disease Locus 2, 3, 4, 5 and 6-Marker • Due!to!ElstonMStewart!algorithm!can!calculate!LOD! 2.) 2___1____3____4 Loci

scores!for!large!pedigrees!but!very!limited!in!the! 3.) 1__ _3____4___5 (only one step) number!marker!loci! 4.) 3____1____4___5 – Number!of!marker!loci!dependent!on!number!of!alleles! 5.) 1____4____5___6 (only one step) • Maxhap – Product of the allele frequencies including disease locus 6.) 4____1____5___6

• For SNP marker loci can only perform multipoint analysis 7.) 4____5____1___6 using ~7 marker loci – Can!use!sliding!window! 8.) 4____5____6___1

28 Superlink!(Silberstein!et!al.!2006)! Superlink!(Silberstein!et!al.!2006)! • Can!analyze!complex!pedigrees!quickly! • Can!quickly!calculate!genome!wide!twoM – Parametric!linkage!analysis! point!LOD!scores! • Computes exact LOD scores • MulHpoint!linkage!analysis! – Ideal!for!pedigrees!with!many!loops!(marriage!or! – Use!sliding!window!to!calculate!LOD!scores!for!>~3! consanguinity)! marker!loci! • In particular animal pedigrees – Not!suitable!for!genomeMwide!mulHpoint!linkage! – Dogs analysis! – Cattle • Efficient!use!of!parallelizaHon!of!the! • Can!perform!mulHpoint!linkage!analysis! algorithm! – Limited!in!the!number!of!marker!loci!2M4! • No!need!to!install!program! – Implements!Bayesian!networks! – Use!of!Superlink!is!available!free!online!

Genehunter!(Kruglyak!et!al.!1996)! Genehunter!(Kruglyak!et!al.!1996)! • Parametric!and!NonMparametric!linkage!analysis! • Handles!a!large!number!of!marker!loci! – But!only!pedigrees!of!small!to!moderate!to!moderate! • Provides!exact!rapid!calculaHon!of!mulHpoint! size! LOD!scores!through!the!implementaHon!of! • Maxbit (2n-f)≤21 hidden!Markov!Models! • QualitaHve!Traits! – This!approach!scales!linearly!with!the!number!of!loci,! • NPL!counts!the!numbers!of!alleles!shared!IBD! but!exponenHally!with!the!number!of!nonMfounders! amongst!2!or!more!affected!relaHves! – Implements!the!Lander!&!Green!Algorithm! – Calculates!pMvalues!using!either!exact!distribuHon!or! normal!approximaHon! !

! Genehunter!2.0!(Daly!et!al.!1998)! Allegro!(Gudbjartsson!et!al.!2000)! • Allegro!has!the!same!basic!funcHonality!as!Genehunter!! • Performs!variance!component!analysis!for!mapping! – quanHtaHve!traits! !Includes!the!features!of!Genehunter!plus! • Performs!all!sibMpair!analysis!contained!in!the! • Supported!features! Mapmaker/sib!so8ware! – • Constructs!Haplotypes! Parametric!and!nonparametric!LOD!scores! – • Implements!a!large!pedigree!approximaHon!for!the! Nonparametric!NPL!scores,!! computaHon!of!a!nonMparametric!allele!sharing! – InformaHon! staHsHc!on!extended!pedigrees!of!arbitrary!size!and! – Exact!pMvalues! complexity! – Expected!crossover!rate! • Computes!tradiHonal!and!mulHlocus!Transmission! – Constructs!Haplotypes!! Disequilibrium!Test!(TDT)! – SimulaHon!!

29 ! MERLIN!(Abecasis!et!al.!2002)! Allegro!(Gudbjartsson!et!al.!2000)! • Typical!speedup!compared!to!Genehunter!is!30M! • Handles!small!to!medium!sized!pedigrees! fold.!! – Implements!Lander!&!Green!Algorithm! • On!a!computer!with!four!Gb!of!memory!the! • Parametric!analysis! program!can!handle!pedigrees!with!up!to!about! • NonMparametric!analysis!! 28!bits! • Same!data!format!as!Genehunter! • Variance!Components!Analysis! • ALLEGRO2!can!handle!even!larger!pedigrees! • Regression!based!linkage!analysis!(quanHtaHve!traits)! • Incorporates!LD!in!analysis! • Error!checking!–!double!recombinaHon!events!over! small!geneHc!distances!

SIMWALK2!(Sobel!and!Lange!1996)! Simwalk2! • A!Markov!Chain!Monte!Carlo!(MCMC)!algorithm!is! • The!MCMC!random!walk!proceeds!to!sample!the! implemented!in!order!to!transverse!the!space!of! possible!underlying!configuraHons!in!proporHon!to! their!likelihood! inheritance!vectors!for!each!pedigree!! – A!sample!average!is!then!used!to!give!esHmated! • The!iniHal!legal!descent!state!is!found!for!using!an! results!for!the!pedigree! iteraHve!genotype!eliminaHon!technique.!!! ! – Simulated!annealing!is!then!performed!to!search!for!find! • Can!analyze!large!families!with!complex!structures! – >1000!individuals!! the!single!most!likely!descent!graph.!!! • Handles!a!large!number!of!markers!! – >30!markers! • Performs!! – Constructs!Haplotypes!! – Parametric!Analysis! – Nonparametric!analysis!

Integrated!Suites!for!Linkage!Analysis!M! Integrated!SuitesMEasy!Linkage! Alohmora! • Facilitates!Analysis!of!a!large!number!of!markers!! – IncorporaHng!geneHc!mappings! • Runs!on!windows! – Allows!for!Analysis!of!a!subset!of!markers! • Data!preparaHon! • Error!Checking! – Allows!for!analysis!of!a!subset!of!markers! – Pedcheck! • – Merlin! Calls!! – Genehunter! • Linkage!Analysis! – Allegro,!etc! – Allegro!! – Merlin! • Graphical!representaHon!of!results! – Genehunter! – Simwalk2!

30 Haplotypes/3MUnit!Support!Interval! ! ! • A8er!compleHon!of!linkage!analysis! ! ! – Haplotypes!should!be!constructed! ! ! • e.g Allegro, SimWalk2 ! ! • AddiHonally!a!3Munit!support!interval!should!be! In!Summary!! ! obtained! ! ! • !If!linkage!was!established!the!causal!variant!should! ! ! lie!within!the!haplotype!and/or!3Munit!support! ! interval!! ! ! ! ! ! ! ! !! ! ! !! ! ! !! ! ! !!

Error!DetecHon!in!Pedigree!data! Analysis!Programs!

• PedCheck! • ElstonMStewart!Algorithm! – Mendelian!errors! – Large!Pedigrees! • Merlin! – limited!number!of!markers! – Double!recombinaHon!events!over!short!geneHc! • Linkage! distances! • Fastlink! • Vitesse! • Superlink!

Analysis!Programs! Analysis!Programs!

• LanderMGreen!Algorithm! • Other!methods!–!Bayesian!networks! – SmallMmedium!sized!pedigrees! – Superlink! – Large!number!of!Marker!loci! • Suited for pedigrees with many inbreeding or marriage loops • Genehunter • Allegro • Merlin • Approximate!methods!M!MCMC! – Simwalk2! – LOKI!!

31 Pedigree!Drawing!Programs! Integrated!Suites!

• Haplopainter! • Easy!Linkage! – Can!draw!pedigrees!and!haplotypes! ! • Alohomora! • Pelican!!

32 Sources of error in genetic data

Genotyping Error Detection • Genotyping errors Appear as:

Using MERLIN • False paternities • Mendelian errors • Probes mix-up • Unlikely genotypes • Data mix-up • Double recombinants – Introduction – • Wrong model over short genetic (e.g. marker distances) distances • Wrong phenotype definition • ...

Michael Nothnagel, [email protected], 2015

The MERLIN software What can MERLIN be used for?

Pedigree file X.ped IBD calculations • „Multipoint Engine for Rapid Likelihood INference“ Marker file Non-parametric linkage • „Works like magic! (G. Abecasis) X.dat Variance components linkage • Available for Linux, Solaris , MacOS X, Windows MERLIN Haplotyping • Efficient data storage through sparse binary trees for modelling gene Marker map flow in pedigrees and corresponding algorithms Pedigree regression • Used via command line mode X.map Error detection • Reference: ... – Abecasis et al. (2002) Merlin-rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 30:97-101. • Download & Docs: (linkage / QTDT / [PLINK] format ) http://www.sph.umich.edu/csg/abecasis/Merlin/ • Current version: 1.1.2

Unlikely genotypes (I) Unlikely genotypes (II)

Question: Do a particular marker with genotype g and its neighboring Sib-ship 1 Sib-ship 2 markers (G\g) provide consistent information?

Ratio of likelihood ratios (LR) for two marker map models: Likelihood for: L(G\g|θ) g set to unknown using LR L(G|θ) model map linked all markers distances r = = double LR L(G\g|θ=0.5) unlinked g set to assuming recombination unknown event L(G|θ=0.5) unlinked all markers markers Sibs are identical at all markers One marker contradicts sharing ! sibs very likely share this stretch information from all other markers g consistent with G\g under θ ! r << 1 MERLIN reports 1/r of the chromosome ! very unlikely case ! check! g inconsistent with G\g under θ ! r >> 1 as mistyping score!! (identity-by-descent [IBD] = 2) (from MERLIN online tutorial)

33 Running MERLIN Error detection using MERLIN

At the command line: 1. Are there genotype errors in the data?

merlin –p X.ped –d X.dat [–m X.map] ! Detection of Mendelian errors / unlikely genotypes: merlin –p … –d … --error

Pedigree file Marker file Marker map file 2. Are reported errors simply due to chance? (optional) ! Estimation of the false-positive rate for error detection merlin –p … –d … --error –-simulate –r --reruns Other programs in bundle: Analysis options (listed whenever MERLIN starts) • pedstats 3. „Wipe errors from data! • pedwipe • General: --error [ON], --information, … • IBD States: --ibd, --kinship, --matrices, … ! Erase problematic genotypes (requires error file merlin.err) … • NPL Linkage: --npl, --pairs, --qtl, … pedwipe –p … –d … …

Other programs

• Other multipoint error detection methods implemented in: – SimWalk2 (Sobel & Lange. Am J Hum Genet 1996;58:1323–1337) https://www.genetics.ucla.edu/software/simwalk – Mendel v14.4.2 (Sobel E, et al. Am J Hum Genet 2002;70:496–508, Lange K, et al. Bioinformatics 2013;29:1568-1570) https://www.genetics.ucla.edu/software/mendel – Sibmed (Douglas JA, et al. Am J Hum Genet 2000; 66:1287–1297) http://csg.sph.umich.edu/boehnke/sibmed.php • Performance comparison in: – Mukhopadhyay, et al. Comparative study of multipoint methods for genotype error detection. Hum Hered 2004;58:175-189.

34 Homozygosity Mapping - Concept

• Useful tool to map autosomal recessive traits Homozygosity Mapping – Particularly for consanguineous pedigrees • Surrounding the pathogenic variant, multiple markers will be homozygous Suzanne M. Leal – Pointing to one or several regions of the genome [email protected] where the pathogenic variant occurs Center for Statistical Genetic, Baylor College of Medicine https://www.bcm.edu/research/labs/center-for-statistical-genetics www.statgen.us

Copyrighted © S.M. Leal 2015

Homozygosity Mapping - Concept I:1 I:2

• Can look at homozygosity within a single II:1 II:2 II:3 II:4 individuals III:1 III:2 III:3 III:4 • However information from several affected IV:1 IV:2

individuals D3S3045 1 2 2 1 D3S2496 1 1 1 2 D3S4018 1 1 1 2 D3S1278 1 1 1 2 D3S4523 1 1 1 2 – Usually but not necessarily from the same pedigree D3S1765 1 1 1 2 D3S3720 1 1 1 2 D3S2453 1 2 1 2 • Can help to reduce the number of regions D3S1764 1 2 1 3 – And the size of the region containing the putative causal variant V:1 V:2 V:3 V:4 V:5 V:6 V:7

D3S3045 1 2 1 2 1 2 1 1 1 2 1 2 1 2 D3S2496 1 1 1 2 1 1 1 2 1 1 1 2 1 2 D3S4018 1 1 1 2 1 1 1 2 1 1 1 2 1 2 D3S1278 1 1 1 2 1 1 1 2 1 1 1 2 1 2 D3S4523 1 1 1 2 1 1 1 1 1 1 1 2 1 2 D3S1765 1 1 1 2 1 1 1 1 1 1 1 2 1 2 D3S3720 1 1 1 2 1 1 1 1 1 1 1 2 1 2 D3S2453 1 1 1 1 1 2 1 1 1 1 1 2 1 1 D3S1764 1 3 1 1 1 3 1 3 1 1 1 3 1 1

Concept Identify by Descent (IBD)/Identify by State (IBS)

• Two segments of the chromosome are inherited from a common ancestor – Sharing is identical-by-descent (IBD) 1/2 1/3 • Often occurs in consanguineous pedigrees 1/2 1/3 1/2 1/3 • Two different haplotypes, surrounding the pathogenic variant, in affected individuals who are 1/3 1/2 1/1 1/3 1/2 1/2 offspring of a consanguineous mating IBD=0 IBS=1 IBD=1 IBS=1 IBD=2 IBS=2 – Disease variant(s) entered the pedigree more than once in order to observe this phenomenon • Highly unusual

35 Concept Concept

• Can also observe regions of homozygosity in • Can occur in small populations “outbreed” populations – Geographically isolated • Mountains, Island populations – Small breeding pool – Socially isolated • Individuals, although not known to be, st nd related (1 or 2 cousins) • An individual can also inherit two copies of – Are in reality quite closely related the same variant by “chance” – Often inbreeding coefficients can be high due to – Usually parents are distantly related but this generations of intermarriage relationship is unknown

Performing Homozygosity Mapping Performing Homozygosity Mapping • In a single individual • Information can be used from multiple • Often more than one run of homozygosity individuals affected individuals (same • Difficult to determine which run of phenotype) who may or may not related homozygosity contains the causal variant – If multiple individuals are homozygous for an overlapping interval on the chromosomes – Can lead to identifying the correct regions of homozygosity • Also aids in reducing the size of the interval

Performing Homozygosity Mapping Performing Homozygosity Mapping • Can determine that two individuals are • Even if two or more individuals are not distantly related because they are homozygous for the same haplotypes homozygous for the same haplotype • Can still examine the haplotypes to determine the smallest interval containing the causal gene • Examine the region of homozygosity across • Caveat it can be possible that not all individuals individuals in order to obtain the smallest have the same phenotype due to the same gene region in common – Unusual but can occur when there are multiple – Likely to contain the pathogenic variant genes responsible for the same phenotype within a small genetic region • Nonsyndromic hearing impairment – 13q11-13q12 » GJB2 and GJB6

36 Performing Homozygosity Mapping Performing Homozygosity Mapping • In this situation by examining the region of • Most beneficial in consanguineous pedigrees overlap between individuals • If pedigree is sufficiently large • Can accidently exclude region containing – Can usually map the causal variant to one region the causal variant • Caution should be used when trying to refine • This can also occur when examining interval using unaffected individuals smallest region of homozygosity between – May not have disease phenotype due to reduced families penetrance – i.e. when analyzing families which do not have the • Carrying two copies of causal variant and thus are same haplotype within the region of homozygosity homozygous where the disease variant lies

Performing Homozygosity Mapping Performing Homozygosity Mapping • Can help to quickly zoom in on the region • May be advantageous to only use affected containing the causal variant individuals • For homozygosity mapping analyzing thousands of • Dependent on disease etiology marker loci takes seconds – Can use a wide variety of genotyping arrays • Likewise phenocopies can cause rejection • Illumina HumCoreExome-24 Bead Chip of true region – Also can use exome or whole genome data • Multipoint linkage analysis can be time consuming • Phenotyping is extremely important – Homozygosity mapping can be used to elucidate the region where initial linkage analysis should be carried out – And most likely contains the pathogenic variant

Performing Homozygosity Mapping Performing Homozygosity Mapping • Region of homozygosity and 3-unit linkage support interval usually perfectly overlap • Regions of homozygosity can give • Performing multipoint linkage analysis not additional support to a linkage finding for correcting for intermarker linkage disequilibrium autosomal recessive traits when analysis is can inflate LOD scores performed in consanguineous pedigrees – This can occur if family members are missing genotype – Robust to intermarker linkage disequilibrium data • e.g. parental genotypes – For consanguineous pedigrees missing grandparental data can also cause an increase in false positive LOD scores

37 Programs • HomozygosityMapper (Seelow et al. 2009) – http://www.homozygositymapper.org/

• IBDfinder (Carr et al. 2009) – http://dna.leeds.ac.uk/ibdfinder/

• AutoSNPa (Carr et al. 2006) – http://dna.leeds.ac.uk/autosnpa/

• PLINK (Purcell et al. 2008) – http://pngu.mgh.harvard.edu/~purcell/plink/

38 Outline

• NGS technologies File formats for sequence data • Workflow and corresponding data files • FASTQ files: reads fresh from the sequencer • SAM/BAM files: read mapping • VCF files: variants, genotypes and more

Michael Nothnagel, [email protected], 2015

Whole-genome sequencing (WGS) Whole-exome sequencing (WES) Exons'as'prime'funcIonal'candidates'represent'1K2%'of'total'sequence' Genomic' Flanked' Hybridized' Pulldown'&' Genomic' Flanked' DNA' Mapping,' DNA' fragments' fragments' washing' DNA' fragments' sequencing' alignment,' variant'calling'

Random'shearing,'flanking'of' HybridizaIon' Pulldown' fragments'by'adaptors' Exonic' DNA' Mapping,' fragments' sequencing' alignment,' Random'shearing,'flanking'of' amplificaIon' fragments'by'adaptors' variant'calling' WholeKgenome'sequencing'(WGS)'does'not'cover'the'whole'genome.'

WGS'is'cheap'on'a'perKbase'basis,'but'sIll'expensive'in'total'costs.' amplificaIon' Sequencing'allows'assessment'of'geneIc'variaIon'that'is'unknown'in' Vendors'for'exome'capture'kits:'Agilent,'Illimuna,'Nimblegen,'and'others.' advance'(must'be'known'for'genotyping).'' WES'does'not'cover'the'whole'exome.'Covered'regions'depend'on'the'used'library.' modified'from'Bamshad'et'al.'(2011)'Nat'Rev'Genet' modified'from'Bamshad'et'al.'(2011)'Nat'Rev'Genet'

Bioinformatic workflow Next-Generation Sequencing (NGS)

• Aim: – Full sequences, rare variants – Direct assessment of genetic variation directly • 'Generations': – First: Sanger sequencing – Second: 'next-generation sequencing' – Third and Fourth are coming • Platforms: – Roche 454 – Illumina / Solexa / Sequenom – Applied Biosystems (ABI) SOLiD – Helicos BioSciences – Pacific Biosciences – Ion Torrent, … Jobling, et al. (2014)

39 NGS: Roche 454 NGS: Illumina/Solexa

1. Emulsion PCR 1. Bridge amplification of DNA fragments

2. Pyrosequencing

2. Sequencing by labeled reversible terminators

Metzker (2010) Nat Rev Genet Metzker (2010) Nat Rev Genet

NGS: Applied Biosystems (ABI) SOLiD NGS: read length

1. Emulsion PCR

2. Sequencing by ligation

Metzker (2010) Nat Rev Genet Metzker (2010) Nat Rev Genet

Capacity of sequencing instruments Data generation throughput

Mardis (2011) Nature Stratton, et al. (2009) Nature

40 NGS: sequenced individuals back then Sequencing costs

per Megabase (Mb) per genome

Moore’s law (1965, co-founder of Intel) “The number of transistors in a dense integrated circuit doubles approximately every two years”

Metzker (2010) Nat Rev Genet http://www.genome.gov/SequencingCosts/

In-silico storage of NGS data Output of sequencers

Base pairs per $ (NGS)

Disk space per $

Base pairs per $ (before NGS)

Stein (2010) Genome Biology Strachan, et al. (2015)

Read length vs. throughput Illumina HiSeq X for WGS

All information has been obtained from the Illumina web site.

Lex Nederbragt (2013): developments in NGS. figshare. http://dx.doi.org/10.6084/m9.figshare.100940. http://www.illumina.com/systems/hiseq-x-sequencing-system/

41 FASTQ file format Base quality: Phred score Q (I)

Contains all unaligned reads obtained from the sequencer Sanger sequencing: Plain text file, four lines per sequence Ewing & Green (1998) Q = - 10 log10p Genome Res Sequence identifier (optionally some more description) Sequence letters

… @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT Illumina/Solexa have used a slightly + different quality score in the past: !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65en.wikipedia.org … p Q = - 10 log “+” Quality code for respective letter 10 (optionally some more description) 1-p

!"#$%&'()*+,-./0123456789:;<=>@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

low high p – probability of an incorrect base call

Quality: Phred score Q (II) An example: the Tyrolean Iceman (“Ötzi”)

Sanger sequencing: • Discovery in Sept. 1991 off a mountain pass (3210m) near Quality p=10-(Q/10) Probability of incorrect Proportion of Tisenjoch (Ötztal Alps) by hiking score Q base call accurate base calls German couple 10 10-1 = 0.1 1 in 10 90% (92 m inside Italy, off Austria) 20 10-2 = 0.01 1 in 100 99% • Ötzi died ~5250 YBP during the Copper Age (Chalcolithic) 30 10-3 = 0.001 1 in 1,000 99.9% 40 10-4 = 0.0001 1 in 10,000 99.99% • First genetic study in 1994 (Handt et al., Science; on mtDNA -5 50 10 = 0.00001 1 in 100,000 99.999% variation) 60 10-6 = 0.000001 1 in 1,000,000 99.9999% • Full nuclear sequencing (WGS) in 2012 (Keller, Graefen, Ball, et al., Phred score 0 1 … 20 … 50 … 92 93 Nat Comm) ASCII coding 33 34 … 53 … 83 … 125 126 FASTQ symbol ! “ … 5 … S … } ~ Google Earth & Helmut Simon (en.wikipedia.org) Data deposited at European Nucleotide Archive (ENA): Quality shows a trend of decreasing towards the end of a read. http://www.ebi.ac.uk/ena; accession number: ERP001144

Ötzi’s sequence data A FASTQ file from Ötzi

http://www.ebi.ac.uk/ena/data/view/ERP001144 @ERR069107.83594103 61_656_767/1 GGCTGAGGCAGGAGAATTGCTTGAACCCAGGACACGGAGGTTGTG + Sequence identifier IIIIIIIIIIIIIIGIHHIIIII:HI.,,(AI1(((F)&BIIIIE @ERR069107.83594104 1361_1032_197/1 TGGAATGGAATGGAATGGAATGGAATGGAAC Sequence (31 bases) + IIGIIICEIIICCIIIIIIIICAI((D2IF; @ERR069107.83594105 2174_789_1143/2 TGTGTGTGTGTGTGTGTGTGTGTGT + Quality IIIIIIIIIIIIIIIIIIIIIIII@ @ERR069107.83594106 744_1322_1870/1 TCTCCACTTCATTCCATTCCATTTCATCCTATTCCA probability of incorrect base call + IIIIIIIIIIIIIIIIIIIIIIIIIIB>I??IIIII I ! Q=73 ! p=10-73/10=5.0x10-8 @ERR069107.83594107 533_1805_480/1 TCTCAAAAGAAGACATTTATGCAGCCAAAAAATACATGAAAAAATGCTCA -40/10 -4 + ( ! Q=40 ! p=10 =1.0x10 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII1 @ERR069107.83594108 997_1205_1612/1 ; ! Q=59 ! p=10-59/10=1.3x10-6 CATTCCATTGCATTCCATTCCATTCCATTAGTTTCCATTCCATTC + IIIIIIIIIIIIIIIIIIIIIIIIIIII"""ABIIIIIIIIIIII @ERR069107.83594109 1010_363_607/1 CGCAATGGCATTCCTAATGCTTACCGAACGAAAAAT + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHII :

42 Bioinformatic workflow Sequence alignment

Read'length' CATC ACGATTCG' ACACCATG ACGA' Reads&

ACCATG ACGATT' ACCATG ACGATT' CCATC ACGATTC'

Sequence& alignment&

…GTGACGAACTGACACCATGACGATTCGATTAGTACCATGAT…'' Reference&sequence& (or'de#novo'assembly)'

Jobling, et al. (2014)

Sequence alignment BWA & bowtie

• Matching single sequences/reads (pairwise alignment) or • Aligner using Burrows-Wheeler Transform (approach for multiple sequences/reads (multiple alignment) to a data compression, published 1994) reference • Mapping low-divergent sequences against a large • Local alignment; multitude of approaches, many NP reference genome -complete (classical algorithm: Smith-Waterman algorithm) • BWA: three algorithms: • Frequently used software for WGS/WES: BWA, bowtie – BWA-backtrack: short reads up to 100bp (previous Illumina) • Many others available, often for specialized tasks (or – BWA-MEM: longer reads (70bp-1Mb) outdated): – BWA-SW: like BWA-MEM, but better for frequent alignment gaps ELAND (Illumina), MAQ, Partek, VelociMapper, GEM, – Li & Durbin (2009, 2010) Bioinformatics SOAP/SOAP2/SOAP3, … – Extensions: BarraCUDA, UGENE (visual interface) • Memory-consuming!! (>2 GB) • bowtie/bowtie 2: • Repetitive and pseudo-autosomal regions are hard to align – Very fast, memory-efficient and therefore barely accessible to NGS – Alignment of short reads – Langmead, et al. (2009) Genome Biol

Mapping of reads to a reference genome Mapping quality score MAPQ

Probability (on a log scale) of a read being misplaced

Li, Ruan, Durbin (2008) MAPQ = - 10 log10(1-p) Genome Res

p – probability of the read coming from the correct position

This probability is approximately modeled using a Bayesian approach, assuming that sequencing errors at different sites of read are independent of each other. best alignment

all possible alignments p(z|x,..)=10-(ΣQi)/10 Read mapping is complicated mismatches (errors or variants), InDels, duplications, insertions, etc. sum of single-base phred Q e.g. two mismatches with Q=20 and Q=10: -(20+10)/10 https://www.broadinstitute.org/gatk/guide/best-practices values at mismatch positions p(z|x,..)=10 =0.001

43 SAM/BAM file format A BAM file from Ötzi

Sequence Alignment/Map format Plain text file containing the reads that could be aligned/mapped

reference

aligned reads split alignment

clipped from the alignment -specs hts SAM file: / header (format version, sorting, …) reference sequence dictionary (length, …) samtools position /

mapping quality (MAPQ) Phred score (* - missing) github.com https:// query template combination of descriptive flags sequence BAM format stores the same data in a compressed, indexed, binary form.

Genetic sample of Ötzi Bioinformatic workflow

Sequencing errors, sample contamination, and other factors can lead to unmapped reads.

Jobling, et al. (2014) Jobling, et al. (2014)

Types of variation Variant calling for SNVs and InDels

Read'length' SNVs InDels SVs CATC ACGATTCG' ACACCATG ACGA' Reads& • Single-nucleotide • Single-base or • Structural variants variants small insertions • Copy number ACCATG ACGATT' and deletions variants (CNVs) ACCATG ACGATT' • Inversions, CCATC ACGATTC' translocations Sequence& alignment&

Read&depth/Coverage' (e.g.'5x)' …GTGACGAACTGACACCATG ACGATTCGATTAGTACCATGAT…'' Reference&sequence& (or'de#novo'assembly)'

G/C& Single:nucleo;de&variant&&(SNV)& Jobling, et al. (2014)

44 Variant calling for SNVs and InDels Variant calling

Unified Genotyper Haplotype Caller • SAMtools • SNVs and InDels are called • SNVs and InDels are called – Outgrew from the 1000 Genomes project separately simultaneously (in local – mpileup for calling SNVs and InDels • aka Consensus Calling (in neighborhood) – Li, et al. (2009) Bioinformatics 25, 2078-9; MAP software) • Re-assembly of genomic Li, et al. (2011) Bioinformatics 27:2987-93 • Bayesian approach; the regions with large variation; – http://samtools.sourceforge.net/ likelihood for each genotype is identification of possible • GATK haplotypes per region expressed depending on the Q – Genome Analysis ToolKit; developed at the Broad Institute and MAPQ error probabilities • Calculation of haplotype – Requires Java • The genotype with highest likelihood for given data – McKanna, et al. (2010) Genome Res 20:1297-303; posterior probability is selected • Bayesian approach as with DePristo, et al. (2011) Nat Genet 43:491-498; consensus calling • First implemented in MAQ Van der Auwera, et al. (2013) Curr Protocols Bioinformatics (Li, et al., 2008, Genome Biol) • Implemented in GATK 43:11.10.1-11.10.33 • Faster, any ploidy • More accurate (for InDels) – https://www.broadinstitute.org/gatk/ No software is optimal for every task. • MAQ, FreeBayes, … Different callers are used for different sorts of variants.

VCF/BCF file format VCF: variant information

chromosome identifier (RefSeq Variant Call Format number etc.) allele frequency ##fileformat=VCFv4.2 physical position reference alternative MAPQ ##fileDate=20090805 meta information (field description, filter criteria, etc.) value of ALT ##source=myImputationProgramV3.1 allele allele ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig= ##phasing=partial #CHROM POS ID REF ALT QUAL FILTER INFO ##INFO= : 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 ##INFO= 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 ##FILTER= ##FILTER= 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T ##FORMAT= : header 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G ##FORMAT= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51. number of 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 call is made 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 samples with 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 (variant has passed all filters) in dbSNP : data no call is made in HapMap2 combined depth called variants (MAQ<10) variant information variant quality samples (coverage) ancestral across samples allele

mandatory columns ##INFO= BCF format stores the same data in a compressed, binary form. See https://github.com/samtools/hts-specs for a specification of entries.

Genotypes in REF and ALT columns VCF: sample information Data format: GT: inferred genotype GQ: conditional genotype quality (phred scale) (described in DP: read depth (coverage) VCF header!) Sample information: … REF ALT … ##FORMAT= … G A … substitution (SNV), 2 alleles … FORMAT NA00001 NA00002 … G A,T … substitution (SNV), 3 alleles … GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 … GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 … T . … monomorphic (no variant) … GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 … GT:GQ:DP 0/1:35:4 0/2:17:2 … GTC G … deletion, 2 bp Individual NA00002: … G GTCT … insertion, 3 bp … T … large deletion (1 kb) 1|0: genotype phased, heterozygous 0/2: genotype unphased, hetero- 48: probability of 10-4.8=0.000016 for zygous for second allele in ALT an erroneous call 17: probability of 10-1.7=0.02 for an 8: read depth (coverage) erroneous call 2: read depth (coverage)

VCF uses a general syntax system and flexible for coding different information.

45 Technical information A VCF file from Ötzi

##fileformat=VCFv4.1 Variant information Genotype information ##FILTER= ##FORMAT= ##FORMAT= ##FORMAT= AC: allele count in genotypes, 0-REF, 1,2,..-ALT allele :

respectively for each ALT DP: read depth at this site : ##contig= allele GL: log10 genotype likelihoods: ##contig= ##contig= BQ: base quality Q at this site GT:GL 0/1:-323.0,-99.1,-802.5 : -323.0 DB: dbSNP membership ! L(G=0,0)=10 : -99.1 #CHROM POS ID REF ALT QUAL FILTER INFO … DP: combined read depth cross L(G=0,1)=10 1 10002 . A C 116.55 PASS ABHom=0.87;AC=2;AF=1.00;AN=2;BaseCounts=0,7,0,1;DP=8;Dels=0.00; … -802.5 1 10003 . A T 49.02 PASS ABHet=0.13;AC=1;AF=0.50;AN=2;BaseCounts=1,0,0,7; … L(G=1,1)=10 1 233571 . G A 11.34 LowQual ABHom=1.00;AC=2;AF=1.00;AN=2;BaseCounts=2,0,0,0;DP=2;Dels=0.00; … samples 1 234466 . C T 13.41 LowQual ABHom=1.00;AC=2;AF=1.00;AN=2;BaseCounts=0,0,0,2;DP=2;Dels=0.00; … PL: 10*log (phred-scaled) 1 546897 . G T 14.49 LowQual ABHom=1.00;AC=2;AF=1.00;AN=2;BaseCounts=0,0,0,2;DP=2;Dels=0.00; … H2/3: HapMap2/3 membership 10 1 546900 . G A 14.49 LowQual ABHom=1.00;AC=2;AF=1.00;AN=2;BaseCounts=2,0,0,0;DP=2;Dels=0.00; … genotype likelihoods 1 565286 . C T 131.66 PASS ABHom=1.00;AC=2;AF=1.00;AN=2;BaseCounts=0,0,0,40;DP=40;Dels=0.0 … MQ: mapping quality MAPQ 1 567493 . C T 25.63 LowQual ABHet=0.70;AC=1;AF=0.50;AN=2;BaseCounts=0,7,0,3;BaseQRankSum=- … GP: phred-scaled genotype 1 569492 . T C 157.27 PASS ABHom=0.93;AC=2;AF=1.00;AN=2;BaseCounts=1,14,0,0;DP=15;DS;Dels= … MQ0: number of reads with 1 800383 . C T 14.58 LowQual ABHom=1.00;AC=2;AF=1.00;AN=2;BaseCounts=0,0,0,2;DP=2;Dels=0.00 … posterior probabilities : MAPQ=0 covering this site : PQ: phasing quality 1000G: 1000 genomes : membership MQ: mapping quality MAPQ and more and more

Genomic VCF (gVCF) file format VCFtools

https://vcftools.github.io/ • VCF file with – a record for every position (also for non-called variants) • Software for manipulating VCF files – per-sample reference confidence estimation for invariant sites • Possible tasks: • Produced by Haplotype Caller – Filter and summarize variants, create • Developed at the Broad Institute intersections and subsets – Compare, validate and merge VCF files – Convert to different formats (e.g. PLINK, IMPUTE, BEAGLE) – Perform some analyses: • Calculation of population-genetic parameters (nucleotide diversity, FST, Tajima’s D, Hardy -Weinberg proportions & test, etc.) • Linkage disequilibrium calculation (r2) • SNP density, sample relatedness, etc. • Danecek, et al. (2011) Bioinformatics

Bioinformatic workflow GATK Best Practice

https://www.broadinstitute.org/gatk/guide/best-practices FASTQ

SAM/BAM

VCF/gVCF This is a simplified picture. There are more steps that genotype files are taken to produce genotype calls.

Jobling, et al. (2014)

46 Exome pipeline at the CCG Visualization: Integrated Genome Viewer (IGV)

• Interactive visualization tool for large integrated datasets • Many supported file formats, including SAM/BAM and VCF files • http://www.broadinstitute.org/software/igv/ • Robinson, et al. (2011) Nat Biotech 29, 24–26; Thorvaldsdóttir, et al. (2013) Brief Bioinf 14, 178-192

IGV: view of aligned sequence reads Ötzi’s mtDNA in IGV

variants from VCF file

coverage

non-reference alleles (errors, variants?) reads from BAM file

reference

Jobling, et al. (2014)

Cautionary notes on variant calling Estimated proportions of false SNV detections

• Models for variant calling are tuned for sensitivity 1000 Genomes, Pilot 2, chromosomes 1-22 – Project-specific trade-off between between sensitivity and specificity • Variant calls are error-prone NA12878 (CEU) NA19240 (YRI) All Consensus All Consensus – Substantial proportions of false-positives are to be expected (!) [%] P P SNVs only SNVs only • Variant calling quality depends on the experiment 6.3 0.7 2.9 2.6 454 FLXTM <10-4 0.08 – Raw DNA isolation (6.1-6.5) (0.5-3.6) (2.7-3.2) (1.2-4.7) – Library preparation 8.4 3.5 11.1 3.9 GA IIxTM <10-4 <10-4 – Sequencing (inter-lane differences) (8.0-8.7) (3.1-3.9) (10.9-11.3) (3.5-4.3) 17.1 0.8 7.3 4.0 • Variant calls require subsequent filtering before meaningful SOLiDTM <10-4 <10-4 analyses can be conducted (16.9-17.4) (0.1-2.6) (6.8-7.8) (3.1-4.8)

P values obtained from a permutation test.

Nothnagel, Herrmann, et al. (2011) Hum Genet

47 Literature on file formats (& Ötzi) Ötzi’s museum

• Metzker ML (2010) Sequencing technologies - the next generation. Nat Rev Genet 11:31-46. • https://github.com/samtools/hts-specs • https://www.broadinstitute.org/gatk/guide

• http://www.ebi.ac.uk/ena • Keller & Graefen & Ball, et al. (2012) New insights into the Tyrolean Iceman’s origin and phenotype as inferred by whole-genome sequencing. Nat Comm 3:698.

Hubert Berberich (de.wikipedia.org)

South Tyrolean Archeological Museum, Bozen, Italy

48 Filtering Approaches for the A Few Words About Next Analysis of NGS Data Generation Sequencing

Suzanne M. Leal [email protected]

Copyrighted © S.M. Leal 2015

Generation of NGS Data Whole Exome Sequencing

• Capture arrays can be used with sequencing • Not really whole exome to generate data on – Not all genes are targeted • Exomes • Great variability between capture arrays – Different arrays capture different proportions of the exome –Aligent SureSelect 38MB – Not all targeted genes are captured –Aligent SureSelect 50Mb – Not all targeted sequences call be aligned –Illumina TrueSeq Exome Enrichment (62Mb) – Not all aligned sequences can be accurately called • Targeted regions – Not all captured regions have sufficient depth to call • Genes variants

Why is Exome Sequencing Currently Used NGS Data More Frequently than Whole Genome? • For exome sequencing high quality data consists of • Number 1 Reason - Cost! of a median depth of >80X – An exome is ~1/3rd of the cost of a genome • With >90% of the exome covered with a depth of • >10X Easier interpretation of the data – • Whole Genome sequencing (good quality) ~30X Focuses on regions of the genome we understand best coverage • Ideal for the study of highly penetrant diseases – Not necessary to use such high depth for whole genome • Exome sequencing a stop-gap measure until the as for exome sequencing price of whole genome sequencing becomes more • Reads are distributed more evenly across genome reasonable • Sequence data for an exome is ~1/15th of the data for a genome • Already starting to see a switch – More studies performing whole genome sequencing

49 REVIEWS

that have been followed up by sequencing10 across entire Association analysis versus linkage analysis families. However, it should be noted that although link- Pertinent reviews of family-based association analysis age analysis provides statistical evidence that a variant is have previously been published13–15, and only high- involved in disease aetiology, false positives can occur lights are therefore presented here. Genetic linkage when the variant that is tested is only in linkage disequi- and association between two loci are both related to librium with the causal variant. When filter approaches recombination — in the former, recombination events are used, pheno copies11,12 and reduced penetrance can are scored over a limited number of observed genera- Genetic linkage inhibit the ability to elucidate the causal variant but, tions, whereas the latter relies on large numbers of A phenomenon whereby two alleles, one each at two because parametric linkage analysis incorporates a unobserved recombination events in past generations. different loci, are transmitted penetrance model, even under these circumstances As generations go by after an initial disease mutation together from parents to the causal variant can usually be mapped. has occurred, recombination events (crossing over) offspring more often than Filtering to Identify Pathogenic Variants with surrounding markers tend to occur closerIdentifying Casual Genes Using and expected by chance. It leads closer to the disease locus so that measurable asso- to a recombination fraction smaller than 0.5. ciation between disease and marker lociExome/Whole Genome Sequencing extends only Family Based Sequencing Family-based whole-genome (Exome or Whole sequencing Genome) over short distances of up to 100 kb16,17, corresponding Phenocopies approximately to a recombination• fractionInformation on mode of inheritance may give clues (represented Individuals that exhibit the Two or more Affected and One by θ) of 0.001, given 1 Mb ≈ 1 cM. Most differences phenotype of a Mendelian affected unaffected affected between association and linkage analysisto type of variants which you are looking for are due to trait but that are not carriers individuals individuals individual of a susceptible genotype. this difference in the number of generations.– Autosomal recessive phenotype for consanguineous Phenocopies were thought to Association analysis using common variants gen- result from non-genetic factors, erally allows for finer mapping than linkagepedigree analy- but genes at locations other Exclude variants that are not shared sis using SNP loci, but one potentially problematic• Homozygous variants than those under current by affected individuals and that are aspect of association analysis is population stratifica- consideration can also lead present in unaffected individuals – Autosomal recessive phenotype from outbreed pedigree to (genetic) phenocopies. tion, which can lead to an increased number of false- positive results if not properly accounted for• Compound18. This is heterozygous variants Penetrance not a problem in linkage analysis because children’s Exclude non-coding variants and coding variants that • Homozygous variants The conditional probability of genotypes depend on those of their parents and not REVIEWSbeing affected given one of the are found in databases (e.g. 1000 Genomes Project or genotypes at the disease locus, ExAC) or that are not rare (e.g. >0.5% frequency) on population genotype frequencies. However,– Suspected if some de Novo event ‘++’, ‘+d’ or ‘dd’, where ‘d’ is parental genotype data are missing, using• Heterozygousincorrect variant –which is absent in parents the disease allele and ‘+’ the marker allele frequencies can increase type I and II non-disease (wild-type) allele. that have been followed up by sequencing10 across entire errors.Association It has analysis thus been versus tempting linkage to– combineanalysisAutosomal Dominant posi- More generally, penetrance is Test for segregation of identified genetic variants with disease phenotype. Sequence variants in ethnically the conditional probability of a families. However, it should be noted that although link- tivePertinent aspects reviews of linkage of family-based and association association analysis,• Heterozygous analysis which variant matched controls. 13–15 phenotype given a genotype. age analysis provides statistical evidence that a variant is mayhave bepreviously achieved been by using published family-based, and rather only high than- involved in disease aetiology, false positives can occur population-basedlights are therefore control presented individuals. here. Genetic Consider linkage an Figure 1 | Workflow for the whole-genome Recombination when the variant that is tested isNature only in Reviews linkage | Geneticsdisequi- affectedand association individual between and his two or lociher parents.are both Atrelated a given to sequencing filtering approach in human family Two alleles, one from each of librium with the causal variant. When filter approaches markerrecombination locus, —the in alleles the former, inherited recombination by the child may events be two loci, can be inherited from data. Usually, one, two or more affected individuals, or pheno copies11,12 penetrance 19,20 one parent but originate from affectedare used, and unaffected individuals, and reduced in a family have can contrastedare scored over with a the limited alleles number that are of notobserved inherited genera-, Genetic linkage two different grandparents. theirinhibit genomes the ability or exomes to elucidate sequenced. the causal Variants variant that are but, wheretions, whereasthe latter the can latter be shown relies to on be large representative numbers of A phenomenon whereby two If the two marker loci are on 21 alleles, one each at two notbecause predicted parametric to be nonsense, linkage missenseanalysis orincorporates splice-site a theunobserved alleles in recombination the population events. The in most past generations.well-known the same chromosome, a different loci, are transmitted penetrancevariants are usuallymodel, excluded even under from thesefurther circumstances analyses useAs generations of such family-based go by after an controls initial disease is probably mutation the recombination is the result of together from parents to because it is unlikely that they are causal. When the crossing22 over an odd number of crossovers the causal variant can usually be mapped. transmissionhas occurred, disequilibrium recombination test events (TDT) ( . For this to) offspring more often than between the markers. mode of Filtering to Identify Pathogenic Variantsinheritance of a disease is known, this applywith surrounding to multiple offspring, markers thetend null to occurhypothesis closer of and the expected by chance. It leads information can be used to aid the selection of variants. TDTcloser must to the include disease absence locus so of that linkage measurable (θ = 0.5), asso soScreening Databases- Crossingto a recombination over fraction For example, for an autosomal dominant disease, the smaller than 0.5. theciation TDT between is a test diseasefor linkage and that marker is only loci powerful extends when only A cytogenetic phenomenon affected pedigreeFamily-based member’s whole-genome sequence sequencing data should Family Based Sequencing (Exome or Whole Genome) thereover short is both distances linkage of and up toassociation 100 kb16,1721, .corresponding The TDT has that occurs during the display a heterozygous causal variant. Sequence data on Phenocopies recombination• fractionDatabases of 23 Exome and genome Data formation of human gametes additional pedigree members can help to reduce the beenapproximately extended (theto a rare variant-TDT (RV (represented-TDT)) for Individuals that exhibit the (egg or sperm cells). The salient Two or more Affected and One useby θ with) of 0.001,WGS datagiven incorporating 1 Mb ≈ 1 cM. severalMost differences rare vari- phenotype of a Mendelian number of variants that could potentially be disease – Contain individuals who have not been phenotyped feature of crossing over is that affected unaffected affected trait but that are not carriers causing. A final filtering step is performed in which antbetween association association tests and and has linkage been implementedanalysis are due in the to it occurs semi-randomly along individuals individuals individual • e.g. 1000 Genome data of a susceptible genotype. those variants that are present in the databases dbSNP, Family-Basedthis difference Association in the number Test of Toolkit generations. (FBAT) suite of chromosomes, with at least Phenocopies were thought to 1000 Genomes, ExAC and Exome Variant Server are 24 25 one crossover occurring on programsAssociation. Some analysis rare variant using commonassociation– Were ascertained because of disease phenotype variants tests genana- result from non-genetic factors, each chromosome in meiosis. excluded. Additionally, bioinformatic tools, such as lyseerally variants allows infor aggregate finer mapping (usually than across linkage a genomic analy- but genes at locations other PolyphenExclude variants-2 (REF. that102 )are, and not measures shared of conservation, for regionsis using such SNP as aloci, gene) but instead one potentially of analysing• problematicCoronary individual heart disease Recombinationthan those under fractioncurrent example,by affected PhyloP individuals103, are and often that used are to predict whether a consideration can also lead present in unaffected individuals rareaspect variants. of association It has beenanalysis shown is population that– Several analysing stratifica drareatabases available- (θ). The expected proportion variant is deleterious and therefore likely to be disease to (genetic) phenocopies. tion, which can lead to an increased number of false- of recombinant children variants in aggregate is much more powerful than the causing. Even after filtering steps, there may be many 25,26 • 100018 Genomes divided by the total number individualpositive results analysis if not of properlyrare variants accounted. for . This is Penetrance variants that need to be followed up in the remaining of recombinant and not a problem in linkage analysis because children’s– http://www.1000genomes.org/ The conditional probability of familyExclude members non-coding to elucidate variants whether and coding the variants variant that(or non-recombinant children. For Approaches for linkage analysis being affected given one of the variants)are found segregate in databases with the(e.g. disease 1000 Genomes phenotype. Project If theor genotypes depend on those of their parents• Exome and notVariant Server two loci in close proximity to genotypes at the disease locus, familyExAC) is from or that a population are not rare that (e.g. is>0.5% not represented frequency) in LODon population scores. Linkage genotype analysis frequencies. can be carried However, out between if some each other, θ is small owing to – http://evs.gs.washington.edu/EVS/ ‘++’, ‘+d’ or ‘dd’, where ‘d’ is the randomness of crossing databases, then ethnically matched controls need to aparental putative genotype disease locus data and are a missing,single marker using locus incorrect (two- the disease allele and ‘+’ the • ExAC over, but it increases to 0.5 for be sequenced to evaluate the frequency of the variant pointmarker linkage) allele frequencies or across a setcan of increase markers type (multipoint I and II non-disease (wild-type) allele. loci that are far apart. (or variants). analysis)errors. It consistinghas thus been of a smalltempting number to combine of markers– http:// posi or-exac.broadinstitute.org / More generally, penetrance is Test for segregation of identified genetic variants with disease phenotype. Sequence variants in ethnically tive aspects of linkage and association analysis, which the conditional probability of a matched controls. phenotype given a genotype. may be achieved by using family-based rather than 2 | ADVANCE ONLINE PUBLICATION population-based control www.nature.com/reviews/genetics individuals. Consider an Figure 1 | Workflow for the whole-genome Recombination Nature Reviews | Genetics affected individual and his or her parents. At a given sequencing filtering© approach 2015 Macmillan in human Publishers family Limited. All rights reserved Two alleles, one from each of marker locus, the alleles inherited by the child may be two loci, can be inherited from data. Usually, one, two or more affected individuals, or 19,20 one parent but originate from affected and unaffected individuals, in a family have contrasted with the alleles that are not inherited , two different grandparents. their genomes or exomes sequenced. Variants that are where the latter can be shown to be representative of If the two marker loci are on not predicted to be nonsense, missense or splice-site the alleles in the population21. The most well-known the same chromosome, a variants are usually excluded from further analyses use of such family-based controls is probably the recombination is the result of because it is unlikely that they are causal. When the 22 an odd number of crossovers transmission disequilibrium test (TDT) . For this to between the markers. mode of inheritance of a disease is known, this apply to multiple offspring, the null hypothesis of the information can be used to aid the selection of variants. TDT must include absence of linkage (θ = 0.5), Contributing Projectsso Crossing over For example, for an autosomal dominant disease, the the TDT is a test for linkage that is only powerful when A cytogenetic phenomenon affected pedigree member’s sequence data should • 1000 Genomes21 that occurs during the there is both linkage and association• Bulgarian Trios. The TDT has display a heterozygous causal variant. Sequence data on 23 formation of human gametes additional pedigree members can help to reduce the been extended (the rare variant-TDT• Finland (RV--TDTUnited States Investigation of NIDDM Genetics (FUSION))) for (egg or sperm cells). The salient number of variants that could potentially be disease use with WGS data incorporating• GoT2Dseveral rare vari- feature of crossing over is that causing. A final filtering step is performed in which ant association tests and has been• implementedInflammatory Bowel Disease in the it occurs semi-randomly along • METabolic Syndrome In Men (METSIM) those variants that are present in the databases dbSNP, Family-Based Association Test Toolkit• (FBAT) suite of chromosomes, with at least 24 Jackson Heart Study25 one crossover occurring on 1000 Genomes, ExAC and Exome Variant Server are programs . Some rare variant association• Myocardial Infarction Genetics Consortium: tests ana- each chromosome in meiosis. excluded. Additionally, bioinformatic tools, such as lyse variants in aggregate (usually• Italian Atherosclerosis, Thrombosis, and Vascular Biology Working Groupacross a genomic Polyphen-2 (REF. 102), and measures of conservation, for region such as a gene) instead of• analysingOttawa Genomics Heart Study individual 103 Recombination fraction example, PhyloP , are often used to predict whether a rare variants. It has been shown• thatPakistan Risk of Myocardial Infarction Study (PROMIS) analysing rare ( ). The expected proportion • Precocious Coronary Artery Disease Study (PROCARDIS) θ variant is deleterious and therefore likely to be disease variants in aggregate is much more powerful than the of recombinant children causing. Even after filtering steps, there may be many • Registre Gironi del COR (REGICOR) individual analysis of rare variants25,26. divided by the total number variants that need to be followed up in the remaining • NHLBI-GO Exome Sequencing Project (ESP) • Most extensive database with data on 60,706 individuals of recombinant and family members to elucidate whether the variant (or • National Institute of Mental Health (NIMH) Controls non-recombinant• children.Provides break For -downs by different ethnic groups • variants) segregate with the disease phenotype. If the Approaches for linkage analysis SIGMA-T2D two loci in close proximity• Although to contains individuals with disease, e.g. schizophrenia • Sequencing in Suomi (SISu) family is from a population that is not represented in LOD scores. Linkage analysis can be carried out between each other, θ is small owing• No individuals to with diagnosed early onset disease included • Swedish Schizophrenia & Bipolar Studies databases, then ethnically matched controls need to a putative disease locus and a single marker locus (two- the randomness of• crossingInformation on allele frequencies • T2D-GENES be sequenced to evaluate the frequency of the variant point linkage) or across a set of markers (multipoint over, but it increases to• 0.5N forumbers of heterozyous and homozygous individuals for a variant • Schizophrenia Trios from Taiwan loci that are far apart. (or variants). • • Can evaluate read depth to determine if variant site of interest is covered analysis) consisting of a small numberThe Cancer Genome Atlas (TCGA) of markers or • Tourette Syndrome Association International Consortium for Genomics (TSAICG) with adequate read depth and in how many individuals

2 | ADVANCE ONLINE PUBLICATION www.nature.com/reviews/genetics

© 2015 Macmillan Publishers Limited. All rights reserved

50 REVIEWS

that have been followed up by sequencing10 across entire Association analysis versus linkage analysis families. However, it should be noted that although link- Pertinent reviews of family-based association analysis age analysis provides statistical evidence that a variant is have previously been published13–15, and only high- involved in disease aetiology, false positives can occur lights are therefore presented here. Genetic linkage when the variant that is tested is only in linkage disequi- and association between two loci are both related to librium with the causal variant. When filter approaches recombination — in the former, recombination events are used, pheno copies11,12 and reduced penetrance can are scored over a limited number of observed genera- Genetic linkage inhibit the ability to elucidate the causal variant but, tions, whereas the latter relies on large numbers of A phenomenon whereby two alleles, one each at two because parametric linkage analysis incorporates a unobserved recombination events in past generations. different loci, are transmitted penetrance model, even under these circumstances As generations go by after an initial disease mutation together from parents to the causal variant can usually be mapped. has occurred, recombination events (crossing over) offspring more often than Filtering to Identify Pathogenic Variants with surrounding markers tend to occur closer and Avoid 0% Cut-off When Filtering expected by chance. It leads closer to the disease locus so that measurable asso- to a recombination fraction smaller than 0.5. ciation between disease and marker loci extends only • Databases do not consist of disease free Family Based Sequencing Family-based whole-genome (Exome or Whole sequencing Genome) over short distances of up to 100 kb16,17, corresponding Phenocopies approximately to a recombination fraction (represented individuals Individuals that exhibit the Two or more Affected and One by θ) of 0.001, given 1 Mb ≈ 1 cM. Most differences phenotype of a Mendelian affected unaffected affected between association and linkage analysis are due to trait but that are not carriers individuals individuals individual • Mendelian variants may not be 100% penetrant of a susceptible genotype. this difference in the number of generations. Phenocopies were thought to Association analysis using common variants gen- • For autosomal recessive traits result from non-genetic factors, erally allows for finer mapping than linkage analy- but genes at locations other Exclude variants that are not shared sis using SNP loci, but one potentially problematic – Carriers may be present in databases than those under current by affected individuals and that are consideration can also lead present in unaffected individuals aspect of association analysis is population stratifica- to (genetic) phenocopies. tion, which can lead to an increased number of false- • Frequency cut-offs should be disease specific positive results if not properly accounted for18. This is Penetrance not a problem in linkage analysis because children’s – Unlikely pathogenic variants will have a frequency of The conditional probability of Exclude non-coding variants and coding variants that being affected given one of the are found in databases (e.g. 1000 Genomes Project or genotypes depend on those of their parents and not >0.5% genotypes at the disease locus, ExAC) or that are not rare (e.g. >0.5% frequency) on population genotype frequencies. However, if some • There are rare exceptions ‘++’, ‘+d’ or ‘dd’, where ‘d’ is parental genotype data are missing, using incorrect the disease allele and ‘+’ the marker allele frequencies can increase type I and II – GJB2 35delG variant for nonsydromic hearing impairment non-disease (wild-type) allele. errors. It has thus been tempting to combine posi- More generally, penetrance is Test for segregation of identified genetic variants with disease phenotype. Sequence variants in ethnically tive aspects of linkage and association analysis, which the conditional probability of a matched controls. phenotype given a genotype. may be achieved by using family-based rather than population-based control individuals. Consider an Figure 1 | Workflow for the whole-genome Recombination Nature Reviews | Genetics affected individual and his or her parents. At a given sequencing filtering approach in human family Two alleles, one from each of marker locus, the alleles inherited by the child may be two loci, can be inherited from data. Usually, one, two or more affected individuals, or 19,20 one parent but originate from affected and unaffected individuals, in a family have contrasted with the alleles that are not inherited , two different grandparents. their genomes or exomes sequenced. Variants that are where the latter can be shown to be representative of If the two marker loci are on not predicted to be nonsense, missense or splice-site the alleles in the population21. The most well-known the same chromosome, a variants are usually excluded from further analyses use of such family-based controls is probably the recombination is the result of because it is unlikely that they are causal. When the transmission disequilibrium test (TDT)22. For this to Test for Segregation with Disease an odd number of crossovers between the markers. mode of inheritance of a disease is known, this apply to multiple offspring, the null hypothesis of the Screening Control Individualsinformation can be used to aid the selection of variants. TDT must include absence of linkage (θ = 0.5), so Phenotype Crossing over For example, for an autosomal dominant disease, the the TDT is a test for linkage that is only powerful when A cytogenetic phenomenon affected pedigree member’s sequence data should there is both linkage and association21. The TDT has that occurs during• theIs it not always necessary to screen controls given display a heterozygous causal variant. Sequence data on • Is a variant ruled out if it does not completely 23 formation of human gametesthe large available databases additional pedigree members can help to reduce the been extended (the rare variant-TDT (RV-TDT)) for segregate with the disease phenotype? (egg or sperm cells). The salient number of variants that could potentially be disease use with WGS data incorporating several rare vari- feature of crossing over is that causing. A final filtering step is performed in which ant association tests and has been implemented in the it occurs semi-randomly along • Depends if the study population is well those variants that are present in the databases dbSNP, Family-Based Association Test Toolkit (FBAT) suite of • What are reasons for incomplete segregation? chromosomes, with at least 1000 Genomes, ExAC and Exome Variant Server are programs24. Some rare variant association tests25 ana- one crossover occurringrepresented in the public databases on – Variant was a false positive call each chromosome in meiosis. excluded. Additionally, bioinformatic tools, such as lyse variants in aggregate (usually across a genomic Polyphen-2 (REF. 102), and measures of conservation, for region such as a gene) instead of analysing individual • For under represented populations variant 103 – Not pathogenic Recombination fraction example, PhyloP , are often used to predict whether a rare variants. It has been shown that analysing rare (θ). The expected proportionfrequencies should be examined in controls variant is deleterious and therefore likely to be disease variants in aggregate is much more powerful than the of recombinant children causing. Even after filtering steps, there may be many – Locus heterogeneity within the pedigree individual analysis of rare variants25,26. divided by the total number– Or individuals from the same populations who were variants that need to be followed up in the remaining of recombinant and family members to elucidate whether the variant (or – “Phenocopies” within the pedigree non-recombinant children. For ascertained for another phenotypevariants) segregate with the disease phenotype. If the Approaches for linkage analysis two loci in close proximity to family is from a population that is not represented in LOD scores. Linkage analysis can be carried out between – Reduced penetrance each other, θ is small owing to the randomness of crossing databases, then ethnically matched controls need to a putative disease locus and a single marker locus (two- – Incorrect pedigree structure over, but it increases to 0.5 for be sequenced to evaluate the frequency of the variant point linkage) or across a set of markers (multipoint loci that are far apart. (or variants). – Sample swaps analysis) consisting of a small number of markers or

2 | ADVANCE ONLINE PUBLICATION www.nature.com/reviews/genetics

© 2015 Macmillan Publishers Limited. All rights reserved

Proof of Principal - Miller Syndrome NIH Public Access Author Manuscript Nat Genet. Author manuscript; available in PMC 2010 July 1.

NIH-PA Author ManuscriptPublished NIH-PA Author Manuscript in final edited NIH-PA Author Manuscript form as: A Few Examples of Successful NGS Nat Genet. 2010 January ; 42(1): 30–35. doi:10.1038/ng.499. Studies Using Filtering Approaches Exome sequencing identifies the cause of a Mendelian disorder Sarah B. Ng1,*, Kati J. Buckingham2,*, Choli Lee1, Abigail W. Bigham2, Holly K. Tabor2, Karin M. Dent3, Chad D. Huff4, Paul T. Shannon5, Ethylin Wang Jabs6,7, Deborah A. Nickerson1, Jay Shendure1,†, and Michael J. Bamshad1,2,8,†

1Department of Genome Sciences, University of Washington, Seattle, Washington, USA 2Department of Pediatrics, University of Washington, Seattle, Washington, USA 3Department of Pediatrics, University of Utah, Salt Lake City, Utah, USA 4Department of Human Genetics, University of Utah, Salt Lake City, Utah, USA 5Institute of Systems Biology, Seattle WA, USA 6Department of Genetics and Genomic Sciences, Mount Sinai School of Medicine, New York, New York, USA 7Department of Pediatrics, Johns Hopkins University, Baltimore, Maryland 8Seattle Children’s Hospital, Seattle, Washington, USA

Abstract We demonstrate the first successful application of exome sequencing to discover the gene for a rare, Mendelian disorder of unknown cause, Miller syndrome (OMIM %263750). For four affected individuals in three independent kindreds, we captured and sequenced coding regions to a mean coverage of 40X, and sufficient depth to call variants at ~97% of each targeted exome. Filtering against public SNP databases and a small number of HapMap exomes for genes with two novel variants in each of the four cases identified a single candidate gene, DHODH, which encodes a key enzyme in the pyrimidine de novo biosynthesis pathway. Sanger sequencing confirmed the presence of DHODH mutations in three additional families with Miller syndrome. Exome sequencing of a small number of unrelated, affected individuals is a powerful, efficient strategy for identifying the genes underlying rare Mendelian disorders and will likely transform the genetic analysis of monogenic traits.

Rare monogenic diseases are of substantial interest because identification of their genetic basis provides important knowledge about disease mechanisms, biological pathways, and 51 potential therapeutic targets. However, to date, allelic variants underlying fewer than half of all monogenic disorders have been discovered. This is because the identification of allelic variants for many rare disorders is fundamentally limited by factors such as the availability of only a small number of cases/families, locus heterogeneity, or substantially reduced reproductive fitness, each of which lessens the power of traditional positional cloning strategies and often restricts the analysis to a priori identified candidate genes. In contrast, deep resequencing of all human genes for discovery of allelic variants could potentially identify the gene underlying any given rare monogenic disease. Massively parallel DNA sequencing technologies1 have rendered the whole genome resequencing of individual

†Corresponding authors: Mike Bamshad, MD, Department of Pediatrics, University of Washington School of Medicine, Box 356320, 1959 NE Pacific Street, Seattle, WA 98195, Jay Shendure, MD, PhD, Department of Genome Sciences, University of Washington School of Medicinem, Box 355065, 1705 NE Pacific Street, Seattle, WA 98195. *These authors contributed equally to this work Author contributions The project was conceived and experiments planned by M.B., D.A.N., and J.S. Review of phenotypes and sample collection were performed by E.W.J. and M.B. Experiments were performed by S.B.N., K.J.B., and C.L. Genetic counseling and ethical consultation were provided by K.M.D and H.K.T. Data analysis were performed by S.B.N., K.J.B., A.W.B., C.D.H., P.T.S., and J.S. The manuscript was written by M.B., J.S., S.B.N., and K.J.B. All aspects of the study were supervised by M.B., D.A.N., and J.S. DHODH Gene Identified Kabuki Syndrome NIH Public Access Author Manuscript • Four individuals with Miller syndrome Nat Genet. Author manuscript; available in PMC 2011 March 1.

NIH-PA Author ManuscriptPublished NIH-PA Author Manuscript in final edited NIH-PA Author Manuscript form as: underwent exome sequencing Nat Genet. 2010 September ; 42(9): 790–793. doi:10.1038/ng.646. – From three families Exome sequencing identifies MLL2 mutations as a cause of • An additional three Miller families where Kabuki syndrome Ng et al. Page 11 followed up with Sanger Sequencing Sarah B. Ng1,*, Abigail W. Bigham2,*, Kati J. Buckingham2, Mark C. Hannibal2,3, Margaret McMillin2, Heidi Gildersleeve2, Anita E. Beck2,3, Holly K. Tabor2,3, Greg M. Cooper1, Heather C. Mefford2, Choli Lee1, Emily H. Turner1, Josh D. Smith1, Mark J. Rieder1, Koh- NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript ichiro Yoshiura4, Naomichi Matsumoto5, Tohru Ohta6, Norio Niikawa6, Deborah A. Nickerson1, Michael J. Bamshad1,2,3,†, and Jay Shendure1,† 1Department of Genome Sciences, University of Washington, Seattle, Washington, USA • 2Department of Pediatrics, University of Washington, Seattle, Washington, USA 3ExomeSeattle Children’ssequenced Hospital, Seattle, Washington, USA 4Department– 10 unrelated of Human Genetics,probands Nagasaki Universitywith Kabuki syndrome Graduate School of Biomedical Sciences, From Ng et al. 2010 Nagasaki, Japan 5 DHODH is composed of 9 exons that encode the Figure 2. Genomic structure of the exons encodinguntranslated the open readingregion (orange) and coding frame of DHODH Department of Human Genetics, Yokohama City University Graduate School of Medicine, region (blue). Arrows indicate the location of 11 different variants found in six Miller familiesDHODH is composed of 9 exons that encode untranslated regions (orange) and protein Yokohama, Japan coding sequence (blue). Arrows indicate the locations of 11 different mutations found in 6 6Research Institute of Personalized Health Sciences, Health Sciences University of Hokkaido, families with Miller syndrome. Hokkaido, Japan

Abstract We demonstrate the successful application of exome sequencing1–3 to discover a gene for an autosomal dominant disorder, Kabuki syndrome (OMIM %147920). The exomes of ten unrelated probands were subjected to massively parallel sequencing. After filtering against SNP databases, there was no compelling candidate gene containing novel variants in all affected individuals. Less stringent filtering criteria permitted modest genetic heterogeneity or missing data, but identified multiple candidate genes. However, genotypic and phenotypic stratification highlighted MLL2, a

Users may view, print, copy, download and text and data- mine the content in such documents, for the purposes of academic research, Kabuki Syndrome subject always to the full ConditionsKabuki Syndrome of use: http://www.nature.com/authors/editorial_policies/license.html#terms †Corresponding authors: Jay Shendure, MD, PhD, Department of Genome Sciences, University of Washington School of Medicine, Box 355065, 1705 NE Pacific Street, Seattle, WA 98195, Mike Bamshad, MD, Department of Pediatrics, University of Washington School of Medicine, Box 356320, 1959 NE Pacific Street, Seattle, WA 98195. *These authors contributed equally to this work. URLs • Refseq 36.3 : ftp://ftp.ncbi.nlm.nih.gov/genomes/MapView/Homo_sapiens/sequence/BUILD.36.3/updates/seq_gene.md.gz Could have tackled problem by sequencing triosPhaster : http://www.phrap.org SeattleSeq Annotation : http://gvs.gs.washington.edu/SeattleSeqAnnotation/ –1000Suspected to be Genomes Project: : http://www.1000genomes.org/page.phpde Novo Data access Exome data for the discovery cohort will be available via the NCBI dbGaP repository. • Article describes how initial strategy failed since not Author contributions The project was conceived and experiments planned by M.J.B., D.A.N., and J.S. Review of phenotypes and sample collection were performed by M.J.B., M.C.H., M.M., K.Y., N.M., T.O. and N.N. Experiments were performed by S.B.N., K.J.B., A.E.B., C.L., all children have Kabuki syndromH.C.M. J.D.S., M.J.R., E.H.T., and H.G. Ethical consultation was provided by H.K.T. Datae analysisdue to variants in was performed by A.W.B., M.J.B., K.J.B., G.M.C., S.B.N. and J.S. The manuscript was written by M.J.B., S.B.N., and J.S. All aspects of the study were supervised by M.J.B. and J.S. the same gene (locus heterogeneity)Competing Interests Statement From The authors declare no competing financial interests. Ng et al. 2010

• Dysmorphic, skeletal, immunologic & mild intellectual disabilities • 1/30,000 to 1/50,0000 • Most cases simplex – Very few cases of parental transmissionNat Genet. Author manuscript; available in PMC 2010 July 1.

Kabuki Syndrome Kabuki Syndrome

• After failure to identify gene • Clinicians ranked the patients from typical Kabuki syndrome to atypical • Predicted functional assessment of variants • Manual review of data highlighted previously Not the correct unidentified nonsense variant in MLL2 gene gene – Identified in the four highest ranked cases 1, 2, 3 & 4 From Ng et al. 2010 – Additional found in patients 6, 7 & 9

52 Discover of de novo events using Exome Kabuki Syndrome Sequence Data for Autism

ARTICLE doi:10.1038/nature13908 • Additional Kabuki cases identified to have MLL2 gene mutations The contribution of de novo coding – 26/43 cases mutations to autism spectrum disorder Ivan Iossifov1*, Brian J. O’Roak2,3*, Stephan J. Sanders4,5*, Michael Ronemus1*, Niklas Krumm2, Dan Levy1, Holly A. Stessman2, Kali T. Witherspoon2, Laura Vives2, Karynne E. Patterson2, Joshua D. Smith2, Bryan Paeper2, Deborah A. Nickerson2, Jeanselle Dea4, Shan Dong5,6, Luis E. Gonzalez7, Jeffrey D. Mandell4, Shrikant M. Mane8, Michael T. Murtha7, • 12/12 patients with available parents had de Novo Catherine A. Sullivan7, Michael F. Walker4, Zainulabedin Waqar7, Liping Wei6,9, A. Jeremy Willsey4,5, Boris Yamrom1, Yoon-ha Lee1, Ewa Grabowska1,10, Ertugrul Dalkic1,11, Zihua Wang1, Steven Marks1, Peter Andrews1, Anthony Leotta1, Jude Kendall1, Inessa Hakker1, Julie Rosenbaum1, Beicong Ma1, Linda Rodgers1, Jennifer Troge1, Giuseppe Narzisi1,10, Seungtai Yoon1, Michael C. Schatz1, Kenny Ye12, W. Richard McCombie1, Jay Shendure2, Evan E. Eichler2,13, variants Matthew W. State4,5,7,14 & Michael Wigler1 Nature

Whole exome sequencing has proven to be a powerful tool for understanding the genetic architecture of human disease. • Exome sequence data from 2,517 simplex families from the Here we apply it to more than 2,500 simplex families, each having a child with an autistic spectrum disorder. By com- paring affected to unaffected siblings, we show that 13% of de novo missense mutations and 43% of de novo likely gene- disrupting (LGD) mutations contribute to 12% and 9% of diagnoses, respectively. Including copy number variants, coding Simons Simplex Collection (SSC) was analyzedde novo mutations contribute to about 30% of all simplex and 45% of female diagnoses. Almost all LGD mutations occur opposite wild-type alleles. LGD targets in affected females significantly overlap the targets in males of lower intelligence quotient (IQ), but neitheroverlaps significantly with targets in males of higher IQ. We estimate that LGD mutation in about – Probands400 genes canwith autism spectrum disorder and their parents contribute to the joint class of affected females and males of lower IQ, with an overlapping and similar number of genes vulnerable to contributory missense mutation. LGD targets in the joint class overlap with published targets for intellectual disability and schizophrenia, and are enriched for chromatin modifiers, FMRP-associated genes sequencedand embryonically expressed genes. Most of the significance for the latter comes from affected females.

– 1,911 families also had sequence data on an unaffected siblingAutism spectrum disorder (ASD) is characterized by impaired social unaffected siblings and the parents of each family. Within the SSC, the interaction and communication, repetitive behaviour and restricted inter- overall gender bias in affected individuals, 7 males to 1 female, is nearly ests. It has a strong male bias, especially in high-functioning affected twice that typically reported. Exomes were analysed at Cold Spring Har- individuals. The contribution from transmission has long been suspected bor Laboratory (CSHL), Yale School of Medicine, and University of from increased sibling risk1, but more recently the role of germline de Washington (Extended Data Figs 1 and 2 and Supplementary Table 1). novo (DN) mutation has been established, first from large-scale copy Pipelines were blind with respect to affected status. For uniformity, all number variation in simplex families2–5, and subsequently from exome data were reanalysed with the CSHL pipeline, allowing comparison of sequencing. The smaller DN variants observed by DNA sequencing pin- analysis tools. All calls were validated or strongly supported, as listed point candidate gene targets6–8. These developments have promoted a (Supplementary Table 2) and described (Methods). new model for causation, and re-evaluation of sibling risk9,10. Here we report whole exome sequencing of the Simons Simplex Col- Rates and targets of DN mutation lection (SSC)11 and an extensive list of DN mutated targets, including For greatest precision we measured DN rates in quad families (one affec- 27 recurrent LGD (nonsense, frameshift and splice site) targets. The size ted and one unaffected child) over genomic positions at which all family and uniformity of this study allow an unprecedented evaluation of gen- members had $403 sequence coverage (Methods and Supplementary etic vulnerability to ASD. We subdivide target sets by mutation type Table 3). This ‘joint 403 region’ in the SSC was 32 gigabases (Gb) in total, (missense and LGD) and affected child status (gender and non-verbal or 48% of the targeted exome, from 1,867 quads. DN events were shared IQ, to which we refer throughout as ‘IQ’), and explore the overlap be- by siblings 1% of the time (Supplementary Table 2); and 1% of mutations tween target sets and their enrichment for certain gene categories. We had nearby nucleotide positions altered, presumably by singlemutagenic make estimates of the number of genes vulnerable to a given mutation events12–14 (Supplementary Table 4). The overall rate of base substitu- type and the proportion of simplex autism resulting from DN mutation tion is 1.8 3 1028 (61029) per base pair (Supplementary Table 5). for each affected subpopulation. Rates of DN synonymous mutation in affected (0.34 per child) and unaffected (0.33 per child) siblings do not differ significantly (Fig. 1). SSC sequencing and validation By contrast, LGD mutations occur at significantly higher rates in affec- We report on 2,517 of ,2,800 SSC families including ,800 that were ted versus unaffected siblings (Fig. 1 and Extended Data Fig. 3). The previously published6–8. We sequenced 2,508 affected children, 1,911 rate of LGD mutations is 0.12 in unaffected siblings and 0.21 in affected

RESEARCH ARTICLE 1Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA. 2Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA. 3Molecular & Rates of de novo Events by Variant Type 4 5 Medical Genetics, Oregon Health & Science University, Portland,ARTICLE Oregon 97208, USA. Department ofRESEARCH Psychiatry, University of California, San Francisco, San Francisco, California 94158, USA. Department Genes with Recurrent Hits and Non6 -Verbal IQ of Genetics, Yale University School of Medicine, New Haven, Connecticut 06520, USA. Center for Bioinformatics, State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking University, Beijing 100871, China. 7Child Study Center, Yale University School of Medicine, New Haven, Connecticut 06520, USA. 8Yale Center for Genomic Analysis, Yale University School of Medicine, New Haven, Connecticut 06520, USA. 9National Institute of Biological Sciences, Beijing 102206, China. 10New York Genome Center, New York, New York 10013, USA. 11Department of Medical Biology, Bulent Ecevit University School of Medicine, 67600 Zonguldak, Turkey. 12Department of Epidemiology and Population Health, Albert Einstein College of Medicine, Bronx, New York 10461, USA. 13Howard Hughes Medical Institute, Seattle, Washington 98195, USA. 14Department of Psychiatry, Yale University School of Medicine, New Haven, Connecticut 06520, USA. *These authors contributed equally to this work. Frameshift Male F FMRP targets 1.5 overlap of DN LGD targets in216affected | NATURE | VOL probands515 | 13 NOVEMBERwith 2014 FMRP targets (55 Nonsense Female ©2014 Macmillan Publishers Limited. All rights reserved C Chromatin modifers 0.10 Splice site 24 E Embryonically expressed observed versus 34.1 expected; P 5 4 3 10 ) and chromatin modifiers P Postsynaptic density proteins 1.0 (26 observed versus 11.8 expected; P 5 3 3 1024). We also observed sig- 0.05 20 90 100 23 160 Proband Unaffected 0.5 nal from mutation in genes expressedNon-verbal in embryonic IQ developmentGene (65Categories LGD MS LGD MS 23 observed versus 45.0 expected; P 5 2 3 10 ). The latter signalCHD8 comesF C7 DYRK1A E 4 Events per child 0.0 0.00 ANK2 F P31 mainly from the small number of female affected individualsGRIN2B (23 ob-F P 31 Substitutions Synonymous Missense Nonsense Splice site 26DSCAM F 3 1 –3 –3 served versus 8.5 expected from 67 LGD targets; P 5 5 3 10 ).CHD2 The 27C3 P = 1 × 10 P = 0.56 P = 0.01 P = 9 × 10 P = 0.03 ARID1B F C21 KDM6B F C2 genes with recurrent LGDs show strong enrichment for FMRPADNP targetsF E 2 28 MED13L F E 2 1 0.2 0.3 (14 observed versus 2.6 expected; P 5 4 3 10 ) and chromatinNCKAP1 modi-F P 2 24 ANKRD11 F 2 5 3 DIP2A F 2 ASD fiers (6 observed versus 0.9 expected; P 2 10 ). By contrast,SCN2A noF 2 4 0.2 TNRC6B F 2 Sib significant enrichment for these gene sets is seen for the DN LGDWDFY3 tar-F 2 1 0.1 PHF2 C E 2 gets in unaffected siblings. WAC C 2 0.1 KDM5B E 2 2 22 POGZ E 2 2 1 The 1,500 DN missense targets in probands are also enrichedRIMS1 for P 2 FOXP1 2 Events per child FMRP targets and embryonically expressed genes. We observeGIGYF1 171 2 0.0 0.0 KATNAL2 2 Indels No frameshift Frameshift Likely gene disruptive (LGD)LGDs FMRP targets (144.8 expected; P 5 0.03), and 220 embryonicallyKMT2E ex- 2 1 –3 –5 TBR1 2 1 P = 0.03 P = 0.47 P = 7 × 10 P = 2 × 10 pressed genes (191.4 expected; P 5 0.03). As before, the signalTCF7L2 for em- 2 From Iossitov et al. 2014 From Iossitov et al. 2014 Figure 1 | Rates of de novo events by mutational type in the SSC. Rates bryonically expressed genes100 comes almost entirely from the small number LGDs LGDs LGDs LGDs LGDs Missense Missense Synonymous per child are estimated from the 403 joint coverage target region, then of female affected individuals90 (48all observed,recurrent 31.1in FXG expectedin CHM fromin EMB 244 tar-all recurrent all

1.0 80 0.4 extrapolated for the entire exome. Mutation types are displayed by class, gets; P 5 0.002). With the exception of chromatin modifiers, contrib- 0.2 and the combined rate for all LGDs is shown at the bottom right. For each event 0.01 0.01 utory DN missense and70 LGD mutations tend to strike similar functional0.9 0.9 type, the significance between probands and unaffected siblings is given. 0.5 0.001 0.2 0.5 0.4 classes of genes. 0.00001 Non-verbal IQ 60 0.4 Sib, unaffected siblings. The errors bar represent 95% confidence interval for 0.06 0.10 Male with de novo type Male without de novo type 50 the mean rates. Female with de novo type De novo mutation and IQ Female without de novo type 40 Higher IQ probands are heavily skewed towards males24. For further Short List of Genes Identified Using Exome probands, an ‘ascertainment differential’ of 0.21 2 0.12 5 0.09 (P 5 Figure 2 | Recurrently hitHow Many Variants will be Identified genes and non-verbal IQ. Affected females DN mutations are considered: all LGDs, recurrent LGDs, LGDs in FMRP 25 analyses,account we for chose 13.5% of to the divideSSC with a mean the IQ affected of 78, whereas male affected population males have targets roughly (FXG), LGDs in in chromatin modifiers (CHM), LGDs in embryonically 2 3 10 ). Thus, we estimate ,43% (0.09 out of 0.21) of LGD events a mean IQ of 86 (top, P 5 1027 by Student’s t-test). Vertical dashed line expressed genes (EMB), all missense mutations, recurrent missense mutations half intoindicates higher an IQ and of 90. lower Middle (left) IQ shows sets. IQ We for affected investigated children with LGD whetherand synonymous higher IQmutations. Probands are divided by the presence of DN in probands contribute toSequencing ASD diagnoses. For DN missense, the rate is (.90) malesmutations comprise in genes hit recurrently a population (right).Using Filtering Approaches? Recurrently with mutated a distinguishable genes are mutations genetic and gender. sig- Means, 95% confidence intervals and P values 0.82 for unaffected siblings and 0.94 for affected probands, an ascer- clustered into four categories as shown. The last four columns give overall (Student’s t-test) are shown. nature.numbers There of is DN a LGD decreased and missense (MS)ascertainment mutations. Bottom, differential eight classes of for DN LGD tainment differential of 0.12 (P 5 0.01). We estimate only ,13% (0.12 • Depends on mutationsLGD in mutation male target children genes with with shared higher functionality, IQ the relative same genes to othersets of vulnerable affected genes. We next sought evidence for excess recurrence out of 0.94) of DN missense events in probands contribute to ASD individualsare sometimes (Extended targeted. Data Fig. 3 and Supplementary Tableof targets. 6). WeThis first is examined synonymous mutations and mutations in diagnoses. There is a wide confidence interval for the missense ascer- – Mode of inheritanceunaffected children. Among the 647 synonymous events in probands, not statisticallyNumber of significant vulnerable genes over the joint 403 region. However,there are 25 gene over targets found in more than one child, close to the null tainment differential (Supplementary Table 6); for this reason, we con- the entireOur data analysis set, of functional the– dropNumber of individuals sequenced clustering in IQ and is overlaps 5 points within for target males classes withexpectation DN of LGD 19.9 (P 5 0.13). Recurrent LGD (n 5 3 out of 179 events) sider primarily the LGD events for our analysis and look on missense mutationsuggests compared that the mutations to those ascertained without in probands mutation target( restrictedP 5 0.01;or Fig.missense 2). targets The (70 out of 1,143 events) in unaffected siblings are data as supporting. mean IQTable of 1affected| Enrichment males of DN mutations with recurrent in six gene classes DN LGDs drops 20 points – Type of sequence data rDN LGD (ASD) DN LGD (ASD) DN miss (ASD) DN LGD (sib) DN miss (sib) To identify gene targets for DN mutation, we examined all family (P 5 0.00001, Fig. 2). Significance is also evident as we examine targets data including trios. We provide a complete list of all mutations (Sup- No. of genes • ExomeOverlap (27) Overlap (353) Overlap (1,513) Overlap (176) Overlap (1,066) by functionalGene class class. Males withObs LGD Exp P mutationsObs Exp in FMRPP targetsObs Exp haveP Obs Exp P Obs Exp P plementary Table 2) along with the number of mutations of each type an averageFMRP 14-point842 drop (P 5 140.001). 2.6 4 3 This1028 55 trend 34.1 continues 4 3 1024 171 with 144.8 LGD 0.03 14 17.0 0.52 117 102.9 0.15 Chromatin 428• 6Whole 0.9 2 3 genome1024 26 11.8 3 3 1024 57 50.0 0.31 5 5.9 1.00 37 35.6 0.80 in each gene (Supplementary Table 7). A total of 391 DN LGD muta- targets inEmbryonic the other functional 1,912 6 classes—chromatin 3.4 0.15 65 45.0 modifiers 2 3 1023 220 and embry- 191.4 0.03 20 22.5 0.65 142 136.0 0.58 tions in 353 target genes were identified and validated in autism pro- PSD 1,445 4 2.5 0.31 34 32.5 0.78 159 138.1 0.07 22 16.2 0.15 113 98.1 0.12 onicallyEssential expressed genes—but1,750 7 with 3.2 reduced 0.04 50 significance. 42.4 0.22 We 201 observe 180.3 0.10 20 21.2 0.91 127 128.1 0.96 bands. Of these, 27 target genes were recurrent (Fig. 2). Among 1,500 Mendelian 256 0 0.6 1.00 3 8.0 0.07 31 34.0 0.66 5 4.0 0.61 20 24.1 0.47 little signalDN LGD from (Scz) DN 93missense20.30.0393.70.011615.70.9021.80.718 mutation, even in recurrent targets, either 11.20.45 missense targets in probands, 145 were recurrent. becauseDN these LGD (ID) events 30 are less likely30.11 to contribute3 1024 8 1.2 or because 3 3 1025 they10 are 4.9 less 0.04 0 0.6 1.00 5 3.5 0.41 We tested eight classes (Methods) for enrichment against five lists of targets of DN mutations. These include genes with (1) recurrent DN LGD mutations in probands (rDN LGD (ASD)); (2) DN LGD mutations in We examined all alleles transmitted opposite a DN LGD target. We severe. Femaleprobands (DN LGD probands (ASD)); (3) DN missense show mutations the in probands same (DN miss trends (ASD)); (4) as DN LGDs males, in siblings (DNbut LGD as(sib)); they and (5) DN missense mutations in siblings (DN miss (sib)). Observed (obs) and expected (exp) numbers are shown with P values obtained from two-sided binomial tests. Expected numbers and P values are based on a length model in which DN mutations occur randomly in all genes, saw no instance in 391 observations in which the allele opposite an compriseproportional a smaller to length. population, the significance is weak (Fig. 2). , LGD target carried a rare transmitted LGD variant (in 1% of parental Further218 evidence| NATURE | VOLfor 515a distinguishable | 13 NOVEMBER 2014 signature among the higher exomes), and only four in which such an allele carried a rare missense IQ comes from the functional enrichment©2014 Macmillan within Publishers DN target Limited. geneAll rights sets. reserved variant. Thus, the DN mutations do not generally cause homozygous LGD targets in females are enriched for all three functional gene classes. loss-of-function of their target (Supplementary Table 8). LGD targets in lower IQ affected males are significantly enriched for Confirming previous results7,8,15, observed DN mutations arise three the FMRP-associated and chromatin-modifier gene classes (Supplemen- times as often in the paternal background, and mutation rates rise with tary Table 6). However, for LGD targets in higher IQ males we see no age of either parent (Extended Data Fig. 4 and Methods). The latter may statistically significant enrichment for any of the gene categories. provide a partial explanation for increased autism rates in children born 53 of older parents. Target overlaps in children and mutation-type groups We partitioned children into four primary groups: unaffected siblings, Functional clustering in target genes affected females, affected males with higher IQ, and affected males Previous studies presented evidence of functional clustering in targets with lower IQ. We analysed these and various combinations for three of DN LGD mutation in affected individuals6–8,16. Our larger data set types of DN mutations: LGDs, missense and synonymous (Supplemen- was examined with an improved null ‘length model’ for mutation in tary Table 6). Targets of synonymous mutations in all children and which the probability of DN mutation in a gene is proportional to its targets of LGD and missense mutations in unaffected siblings have no length (Methods and Extended Data Fig. 5). We tested for enrichment significant overlap with targets from any other group. We see no signi- within DN LGD and missense targets in probands and siblings for the ficant overlap between targets in higher IQ males with targets from other following six classes: (1) FMRP target genes, with transcripts bound by groups. In strong contrast, the 67 LGD targets from affected females the fragile X mental retardation protein8,17;(2)genesencodingchromatin overlap significantly with the 166 LGD targets from lower IQ affected modifiers; (3) genes expressed preferentially in embryos18,19; (4) genes males (10 observed, 1.3 expected, P 5 7 3 1027). We therefore refer to encoding postsynaptic density proteins20; (5) essential genes21; and (6) the group of lower IQ males and affected females as a ‘joint’ class. In this genes identified as Mendelian disease genes22 (Table 1, Supplementary class, the 874 missense and 223 LGD targets also overlap significantly (39 Table 6 and Methods). These data provide the strongest evidence yet for observed, 22.1 expected, P 5 0.0008). Thus, not only do missense and

13 NOVEMBER 2014 | VOL 515 | NATURE | 217 ©2014 Macmillan Publishers Limited. All rights reserved Pathogenic Variant Identification Using How Many Variants will be Identified on Data from a Trio or a Single individual Average for an Exome of Single Individual? • Assume complete penetrance – Removing those variants with >0.5% which are in databases • ExAC

• Limiting analysis to protein coding mutations – Missense – Nonsense – Splice sites Is it possible to identify the causal gene using exome or genome sequencing?

Recessive phenotypes Autosomal Dominant Mode of Inheritance Rare compound heterozygous and homozygous variants

§ 0-5 compound heterozygous SNVs § Data from parents must be available § To determine if variants are compound heterozygous § 1-2 homozygous SNVs ~200-300 variants § A much larger number of homozygous sites will be observed when the child is an offspring of a consanguineous mating

De novo mutations Mode of Inheritance

• If Mode of Inheritance is unknown can try more than one model • Filtering a single individual often leads to many variants that can reasonably be followed-up by – Testing for segregation • If family members are available – Functional studies • 0-3* coding non-synonymous mutation per individual NGS data from additional family members can be Exome data must be available from parents helpful in narrowing down the number of variants

*More may be observed due to false positive variant calls

54 Selection of Additional Family Members to Selection of Additional Family Members to Reduce the Number of Variants Reduce the Number of Variants • Avoid performing NGS on unaffected family • Do not sequence parents of affected individuals members – Except for the study – Variant frequencies can be obtained from • de novo events – databases Offspring will always inherit one parental allele • ExAC • More distantly related family members are the most informative – Unaffected individuals can also be pathogenic – variant carriers due to reduce penetrance e.g. cousins • Can lead to exclusion of the causal variant

Selection of Additional Family Members to Maximum LOD scores - Autosomal Reduce the Number of Variants Dominant Pedigree • Who to select can be guided by basic linkage • If two affected Individuals from an a pedigree are principals sequenced what are the maximum LOD scores – Those sets of individuals providing the highest “LOD” which can be obtained? scores should be selected Autosomal Dominant Pedigrees Maximum LOD scores Siblings 1st Cousins 2nd Cousins Avuncular Parent-Child 0.00 Siblings 0.176 Avuncular 0.301 1st Cousins 0.602 2nd Cousins 1.201

Reducing the Number of Variants For Selecting Individuals for NGS Follow-up • Which individuals should be selected can be • Sequence multiple unrelated individuals with evaluated by simulations studies the same phenotype – SLINK/MSIM – Look for rare variants which are predicted to be – Calculating maximum LOD score functional that occur within the same gene • If genotype array data is available • Due to allelic heterogeneity not all affected individuals will share the same variant – GIGI- Pick (Chueng et al 2014 AJHG) • https://faculty.washington.edu/wijsman/progdists/gigi/software/ GIGI-Pick/GIGI-Pick.html – ExomePick • http://genome.sph.umich.edu/wiki/ExomePicks

55 Reducing the Number of Variants For Selection of Individuals for NGS Follow-up de Novo • If there is locus heterogeneity • If it is of interest to detect de Novo variants – There may be no single gene for which all affected – Child and both parents should be sequenced individuals have a pathogenic variant • For extreme locus heterogeneity – None of the individuals may share pathogenic variants • Would not expect to find de Novo variants in within the same gene families with more than one affected individual • Particularly if the sample size is small – Can occur if de novo variant occurs in a founder that is – Therefore not possible to narrow down results to a passed to offspring single gene

de Novo Events • A single validated LGD de novo event is not sufficient to implement a gene in disease etiology • Multiple LGD de novo events must be observed within What are the Success Rates of NGS a gene region – The number which must be observed to be significant is Studies? What are the Success Rates of NGS Studies • Dependent on the sample size • The mutation rate within the gene region • Significance can be evaluated – By comparing the de novo variant rate in controls • e.g. unaffected siblings of probands Data Dependent – Iossitov et al. 2014 Nature – Estimating the gene specific mutation rates • Neale et al. 2012 Nature

NIH Public Access Author Manuscript N Engl J Med. Author manuscript; available in PMC 2014 October 28. European Journal of Human Genetics (2012) 20, 490–497 NIH-PA Author ManuscriptPublished NIH-PA Author Manuscript in final edited NIH-PA Author Manuscript form as: & 2012 Macmillan Publishers Limited All rights reserved 1018-4813/12 N Engl J Med. 2013 October 17; 369(16): 1502–1511. doi:10.1056/NEJMoa1306555. www.nature.com/ejhg

REVIEW Clinical Whole-Exome Sequencing for the Diagnosis of Disease gene identification strategies for exome Mendelian Disorders sequencing Yaping Yang, Ph.D, Donna M. Muzny, M.Sc, Jeffrey G. Reid, Ph.D, Matthew N. Bainbridge, Christian Gilissen*,1, Alexander Hoischen1, Han G Brunner1 and Joris A Veltman1 Ph.D, Alecia Willis, Ph.D, Patricia A. Ward, M.S, Alicia Braxton, M.S, Joke Beuten, Ph.D, Fan Xia, Ph.D, Zhiyv Niu, Ph.D, Matthew Hardison, Ph.D, Richard Person, Ph.D, Mir Reza Next generation sequencing can be used to search for Mendelian disease genes in an unbiased manner by sequencing the entire protein-coding sequence, known as the exome, or even the entire human genome. Identifying the pathogenic mutation amongst Bekheirnia, M.D, Magalie S. Leduc, Ph.D, Amelia Kirby, M.D, Peter Pham, M.Sc, Jennifer thousands to millions of genomic variants is a major challenge, and novel variant prioritization strategies are required. The Scull, Ph.D, Min Wang, Ph.D, Yan Ding, M.D, Sharon E. Plon, M.D., Ph.D, James R. Lupski, choice of these strategies depends on the availability of well-phenotyped patients and family members, the mode of inheritance, the severity of the disease and its population frequency. In this review, we discuss the current strategies for Mendelian disease M.D., Ph.D, Arthur L. Beaudet, M.D, Richard A. Gibbs, Ph.D, and Christine M. Eng, M.D gene identification by exome resequencing. We conclude that exome strategies are successful and identify new Mendelian Departments of Molecular and Human Genetics (Y.Y., A.W., P.A.W., A.B., J.B., F.X., Z.N., M.H., disease genes in approximately 60% of the projects. Improvements in bioinformatics as well as in sequencing technology will likely increase the success rate even further. Exome sequencing is likely to become the most commonly used tool for Mendelian R.P., M.R.B., M.S.L., A.K., J.S., S.E.P., J.R.L., A.L.B., C.M.E.) and Pediatrics (S.E.P., J.R.L.) and disease gene identification for the coming years. the Human Genome Sequencing Center (D.M.M., J.G.R., M.N.B., P.P., M.W., Y.D., J.R.L., European Journal of Human Genetics (2012) 20, 490–497; doi:10.1038/ejhg.2011.258; published online 18 January 2012

R.A.G.), Baylor College of Medicine, Houston. Keywords: Mendelian disease; gene identification; strategies; next generation sequencing; exome sequencing

Abstract • INTRODUCTION24 families of which 14 lead to a novel gene identificationgenes in a genomic region.8 This last approach has been most The number of rare monogenic diseases is estimated to be 45000 and successful as it does not rely on prior biological or medical knowledge BACKGROUND—Whole-exome sequencing is a diagnostic approach for the identification of for half of these the underlying genes are unknown.1 In addition, an and can be applied in an unbiased fashion. The most important • 58% success rate 95% CI 36%-78% 9 10 molecular defects in patients with suspected genetic disorders. increasing proportion of common diseases, such as intellectual dis- genetic mapping approaches rely on karyotyping, linkage analysis, ability, schizophrenia, and autism, previously thought to be due to homozygosity mapping,11 copy number variation analysis,12 and SNP- In a clinical setting • complexThree families segregated known disease genes multifactorial inheritance, are now thought to represent a based association analysis.13 One problem associated with using METHODS—We developed technical, bioinformatic, interpretive, and validation pipelines for heterogeneous collection of rare monogenic disorders,2–5 the large genetic mapping approaches is that it is difficult, if not impossible, whole-exome sequencing in a certified clinical laboratory to identify sequence variants underlying majority• ofOverall success rate of 71% 95 CI 51% which is still unknown. The identification of genes to predict whether a disease is- caused85% by a single nucleotide mutation ~25% of Mendelian disoders solved responsible for these diseases enables molecular diagnosis of patients, or by structural genomic variation. Without family information it is disease phenotypes in patients. as well as testing gene carriers and prenatal testing. Gene identification also often difficult to predict whether a disease is dominantly or represents the first step to a better understanding of the physiological recessively inherited. Therefore, different mapping approaches often RESULTS—We present data on the first 250 probands for whom referring physicians ordered role of the underlying protein and disease pathways, which in turn need to be applied in a sequential order before a disease locus is 6 whole-exome sequencing. Patients presented with a range of phenotypes suggesting potential serves as a starting point for developing therapeutic interventions. identified. In addition, these mapping approaches commonly do not Recent advances in next generation sequencing technologies have reduce the number of candidate genes sufficiently for follow-up by genetic causes. Approximately 80% were children with neurologic pheno-types. Insurance dramatically changed the process of disease gene identification, in Sanger sequencing, when the disease locus remains very large. This is coverage was similar to that for established genetic tests. We identified 86 mutated alleles that particular by using exome sequencing in which the protein-coding especially the case if these mapping approaches are applied to only a part of the genome of a patient can be studied in a single experiment single patient or family with a limited number of informative relatives. were highly likely to be causative in 62 of the 250 patients, achieving a 25% molecular diagnostic (see Majewski et al7 for an overview of exome sequencing technology Combining data from multiple unrelated but phenotypically similar rate (95% confidence interval, 20 to 31). Among the 62 patients, 33 had autosomal dominant and its applications). As tens of thousands of genomic variants can be patients or families is useful to reduce this to a manageable number, identified in each exome, it is important to carefully consider strategies but carries a risk that patients with similar phenotypes are affected by disease, 16 had auto-somal recessive disease, and 9 had X-linked disease. A total of 4 probands for efficiently and robustly prioritizing pathogenic variants. In order mutations in different genes. Alternatively, a candidate gene approach received two nonoverlapping molecular diagnoses, which potentially challenged the clinical to do so, much can be learned from traditional disease gene identi- can be used to select the best candidate genes from the large disease fication approaches, but also novel strategies need to be established. locus for Sanger sequencing. Many bioinformatics tools have been 56 diagnosis that had been made on the basis of history and physical examination. A total of 83% of developed to prioritize candidate disease genes from disease gene 14–16 the autosomal dominant mutant alleles and 40% of the X-linked mutant alleles occurred de novo. TRADITIONAL DISEASE GENE IDENTIFICATION loci. Candidate gene selection is however critically dependent on Past identification of Mendelian disease genes was carried out by prior knowledge and only few disease genes have been identified by Recurrent clinical phenotypes occurred in patients with mutations that were highly likely to be Sanger sequencing of candidate genes. Candidate genes can be selected specifically using these bioinformatics tools.17 A particular category of causative in the same genes and in different genes responsible for genetically heterogeneous because they resemble genes associated with similar diseases, because diseases that has remained largely unresolved is that of the rare genetic the predicted protein function seems relevant to the physiology of the disorders that occur sporadically, and for which neither a family-based disorders. disease, or because a positional mapping approach pointed to these approach nor an association-based approach can be used. These

1Department of Human Genetics, Nijmegen Centre for Molecular Life Sciences and Institute for Genetic and Metabolic Disorders, Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands *Correspondence: C Gilissen, Department of Human Genetics, Nijmegen Centre for Molecular Life Sciences and Institute for Genetic and Metabolic Disorders, Radboud University Nijmegen Medical Centre, 6500 HB Nijmegen, The Netherlands. Tel: +31243668940; E-mail: [email protected] Received 24 August 2011; revised 31 October 2011; accepted 7 December 2011; published online 18 January 2012 Copyright © 2013 Massachusetts Medical Society. Address reprint requests to Dr. Eng at the Department of Molecular and Human Genetics, NAB 2015, Baylor College of Medicine, Houston, TX 77030, or at [email protected].. Disclosure forms provided by the authors are available with the full text of this article at NEJM.org. Benefits of Obtaining Genotype Array Data • If multiple family members are available Using Genotype Array Data, Linkage – Advantageous to perform SNP genotyping using one Analysis and Homozygosity of the current microarrays Mapping to Increase Success of – All informative individuals should be genotyped • Can also aid in accessing the quality of DNA Gene/Pathogenic Variant samples Identification – Help to ensure NGS data will successfully be generated

Benefits of Obtain Genotype Array Data Benefits of Linkage Analysis • Can be used to validate the pedigree structure • Can identify problems with pedigrees • Help to ensure that samples have not been swapped – Incorrect phenotype information • A variety of programs have been developed to provide probabilities on • Affected individuals labeled and unaffected relationships within pedigrees • Unaffected individuals labeled as affected – GRR • Abecasis et al. 2001 Bioinformatics • Collaborators and Families can be re-contacted – RELATIVE • Goring and Ott, 1997 Eur J Hum Genet – To correct errors – SIBPAIR • Ehm and Wagner 1998 AJHG • Errors which can not be resolved – RELCHECK – Should be removed from analysis • Broman and Weber 1998 AJHG – RELPAIR • Boehnke and Cox 1997 AJHG • Pedigree data can also be reconstructed from genotype data ++ D+ ++ D+ – PRIMUS ++ D+ • Staples et al. 2014 AJHG Sample Swaps Phenocopy Reduced Penetrance

Benefits of Linkage Analysis Inter-sibship Locus Heterogeneity

European Journal of Human Genetics (2015) 23, 1207–1215 & 2015 Macmillan Publishers Limited All rights reserved 1018-4813/15 www.nature.com/ejhg

ARTICLE Family 8 Family 9 Challenges and solutions for gene identification in the presence of familial locus heterogeneity

Atteeq U Rehman1,12, Regie Lyn P Santos-Cortez2,12, Meghan C Drummond1, Mohsin Shahzad3,4, Kwanghyuk Lee2, Robert J Morell1, Muhammad Ansar2,5, Abid Jan5, Xin Wang2, Abdul Aziz5, Saima Riazuddin3,4, Joshua D Smith6, Gao T Wang2, Zubair M Ahmed4, Khitab Gul3, A Eliot Shearer7, Richard J H Smith7, Jay Shendure6, Michael J Bamshad6, Deborah A Nickerson6, University of Washington Center for Mendelian Genomics13, John Hinnant8,14, Shaheen N Khan3, Rachel A Fisher9, Wasim Ahmad5, Karen H Friderici9,10, Sheikh Riazuddin3,11, Thomas B Friedman1, Ellen S Wilch10 and Suzanne M Leal*,2

Next-generation sequencing (NGS) of exomes and genomes has accelerated the identification of genes involved in Mendelian phenotypes. However, many NGS studies fall short of identifying causal variants, with estimates for success rates as low as 25% for uncovering the pathological variant underlying disease etiology. An important reason for such failures is familial locus heterogeneity, where within a single pedigree causal variants in two or more genes underlie Mendelian trait etiology. As examples § Can ofbe used to detect locus heterogeneity within pedigrees intra- and inter-sibship familial locus heterogeneity, we present 10 consanguineous Pakistani families segregating hearing impairment due to homozygous variants in two different hearing impairment genes and a European-American pedigree in which § hearing impairment is caused by four variants in three different genes. We have identified 41 additional pedigrees with Linkage syndromicanalysis/ and nonsyndromic hearinghomozygous mapping can resolve which impairment for which a single previously reported hearing impairment gene has been WES identified but only segregates with the phenotype in a subset of affected pedigree members. We estimate that locus branchesheterogeneity/individuals occurs in 15.3% (95% conare fidencesegregating the same pathogenic variant interval: 11.9%, 19.9%) of the families in our collection. We demonstrate novel approaches to apply linkage analysis and homozygosity mapping (for autosomal recessive consanguineous pedigrees), § Aid in selection of individuals for NGwhich can be used to detect locus heterogeneity using either NGS or SNPS array data. Results from linkage analysis and MLOD LOD Region MLOD LOD Region homozygosity mapping can also be used to group sibships or individuals most likely to be segregating the same causal variants and thereby increase the success rate of gene identification. Family 8 5.73 1.48 7q21.11-q21.3 (HGF) Family 9 6.27 1.65 4q21.21 European Journal of Human Genetics (2015) 23, 1207–1215; doi:10.1038/ejhg.2014.266; published online 10 December 2014 Branch 1 3.59 3.59 7q21.11-q22.2 (HGF) Branch 1 2.53 1.60 7q22.3-q31.1 (SLC26A4) INTRODUCTION and unaffected family members and perform NGS. If multiple Identification of genetic variants that cause Mendelian disorders, family members have undergone NGS, filtering is then performed Branch 2 2.53 2.53 9q21.12-q21.13 (TMC1) Branch 2 2.53 2.41 15q24.1-q26.1 (CIB2) which segregate in large families, has been facilitated through linkage based upon variant sharing in affected family members and lack of analysis coupled with Sanger sequencing.1–3 In the past few years, sharing in unaffected family members. Additional filtering is next-generation sequencing (NGS) has supplanted this experimental performed using variant databases such as Exome Variant Server strategy for variant detection for monogenic traits,4 and many articles or 1000 Genomes retaining low-frequency variants (eg, o0.1%) report successful gene identification.5–7 Failures are rarely published. that are predicted to be deleterious by bioinformatics tools. The Estimates of the success rate for gene identification of Mendelian traits selected variants are then tested for co-segregation with the using NGS in clinical settings is as low as 25%.8 However, for studies phenotype in the entire family. However, this approach is based using large pedigrees with Mendelian segregation, the gene identifica- on the assumptions that clinical information is reliable, the disease tion success rate can be much higher, for example, in a study of 24 is fully penetrant, no phenocopies exist and there is locus homo- families with multiple affected individuals the success rate was 60% geneity. If these conditions do not hold, identification of the causal (95% confidence interval (CI): 36%, 78%).9 variant can be problematic, because segregation with disease will A frequent strategy used for Mendelian trait gene identification is not be observed. In our cohort of pedigrees segregating hearing to select DNA samples from one or more affected or both affected impairment (HI), initially unrecognized locus heterogeneity was

1Laboratory of Molecular Genetics, National Institute on Deafness and Other Communication Disorders, National Institutes of Health, Bethesda, MD, USA; 2Center for Statistical 57 Genetics, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA; 3National Center of Excellence in Molecular Biology, University of the Punjab, Lahore, Pakistan; 4Department of Otorhinolaryngology Head and Neck Surgery, School of Medicine, University of Maryland, Baltimore, MD, USA; 5Department of Biochemistry, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad, Pakistan; 6Department of Genome Sciences, University of Washington, Seattle, WA, USA; 7Molecular Otolaryngology and Renal Research Labs, Department of Otolaryngology—Head and Neck Surgery, University of Iowa, Iowa City, IA, USA; 8Department of Religious Studies, Michigan State University, East Lansing, MI, USA; 9Department of Pediatrics and Human Development, Michigan State University, East Lansing, MI, USA; 10Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, USA; 11Allama Iqbal Medical College-Jinnah Hospital Complex, University of Health Sciences, Lahore, Pakistan *Correspondence: Professor SM Leal, Center for Statistical Genetics, Department of Molecular and Human Genetics, Baylor College of Medicine, 1 Baylor Plaza 700D, Houston, TX 77030, USA. Tel: +1 713 798 4011; Fax: +1 713 798 4012; E-mail: [email protected] 12These authors contributed equally to this work. 13See appendix. 14Present address: Jalan Mawar, Kalibukbuk, Lovina, Bali, Indonesia 81119. Received 8 August 2014; revised 25 October 2014; accepted 31 October 2014; published online 10 December 2014 Intra-Familial Locus Heterogeneity is not Rare Benefits of Linkage Analysis

• 15.3% of the families in a study of nonsydromic • Unlike filtering approaches linkage analysis can hearing impairment incorporate reduced penetrance and – 95% Confidence Interval (11.9 - 19.9%) phenocopies in the analysis – These families segregate at least one published HI gene – Allowing for success identification of a gene region Families with Families Classification based on variants – Even for pedigrees where there is phenocopies locus without locus Total identified and tests performed heterogeneity heterogeneity and/or reduced penetrance Variant identified via screening GJB2 19 98 117 (exon 2), CIB2 (p.Phe91Ser) or HGF (del 3) Variants identified in other known HI genes Linkage analysis + Sanger sequencing 8 87 95 Linkage analysis + NGS 18 64 82 Total 45 249 294

Benefits of Linkage analysis Benefits of Linkage analysis • Linkage analysis/homozygosity mapping can • Information on haplotypes can be used to select identify a small genomic region where the pedigree member(s) for NGS causal variant lies • Selecting those individuals with the smallest – Filtering can be applied within the possible haplotype linkage/homozygous region • Individuals which are phenocopies can be • Greatly reduce the number of variants which need to excluded from selection for NGS be followed-up – Test identified variant(s) for segregation within pedigree • This is particularly true for whole genome data where – Even a small genomic regions can contain hundreds of rare variants

Benefits of Linkage Analysis Benefits of Linkage Analysis • Examining haplotypes can also give clues if two or • If multiple families linked to same locus are more families are segregating the same disease available gene or variant – Sequencing individual(s) from more than one family – Overlapping haplotype which are not the same can aid in gene identification • Potentially disease phenotype due to the same disease gene – When they share variants in the same gene • But unlikely due to the same pathogenic variant – Disease haplotype is identical – although not of the • Provide additional evidence of genes same length involvement in disease etiology • Likely the two families are segregating the same causal – Compared to a single family variant

58 Selection of Family Members for NGS Selection of Family Members for NGS Autosomal Dominant Pedigrees Autosomal Recessive Pedigrees • Sequence >2 individuals from each family • A single individual can be selected – Two individuals are often sufficient – With the smallest homozygous region – Distantly related as possible – With overlapping haplotypes which span the smallest • i.e. from two different branches region – Choose those affected individuals within the pedigree which • If compound heterozygous segregate the same haplotypes • Helps to exclude individuals who are potentially phenocopies • Sequencing additional affected family members – i.e. have the phenotype due to different causal variants may aid in gene identification – Select >2 individuals with smallest overlapping haplotypes – Can greatly increase cost • Reduces the size of the interval in which the pathogenic variant lies – Usually not necessary

Selection of Family Members for NGS Prioritize Families for Study Using NGS Autosomal Recessive Pedigrees • For compound heterozygous individuals • Prioritize families with multiple affected individuals – Variants identified within a gene region – 1.) Significant linkage LOD >3.3 • Can be sequenced in parents, e.g. Sanger – 2.) Suggestive linkage 3.3 1.2 » Or lay on the same haplotype – 4.) Small families with only 1-2 affected individuals LOD • Parents can also undergo NGS <1.2 – Currently not as cost effective • Single affected individuals can also be studied – 5.) Trios – Highest priority if looking for de Novo events – 6.) Single affected individuals with family history

Data Quality Control Data Quality Control

• Extremely important when testing for association for • Proceed with caution it is possible to remove true complex traits variant sites including causal variants when • Also important for Mendelian traits filtering data – Many false positive variant sites if data is not cleaned – May wish to loosen stringency of filtering/cleaning if • Data cleaning for exome sequence data is data unable to identify causal variant specific • Working with “dirty data” can lead to many false – e.g. remove variant sites that positive variant sites/genotype • Fail Variant Quality Score Recalibration (VQSR) • Fail HWE p<5 X 10-8 – e.g. remove variants with • a read depth of <10x • GQ score <20

59 Software to Perform Variant Annotation Integrative Genomic Viewer (IGV) and Filtering • Can be used to investigate variant calls • FamAnn – To determine if a variant is a false positive call – Yao et al. 2014 Bioinformatics – Before following-up of variants – https://sites.google.com/site/famannotation/docu • e.g. testing for segregation mentation

• Gemini

1" – Paili et al. 2013 PLoS Comput Biol 4"

2" 3" – https://gemini.readthedocs.org/en/latest/ 5" 6"

• Jannovar

– Jaeger et al. 2014 Hum Mutation

7"

– http://jannovar.readthedocs.org/en/master/install.

Robinson et al Nat Biotechnology 2011 html

http://www.broadinstitute.org/igv/

[1] The upper panel with vertical gray bars shows depth of coverage per variant site. Pointing the mouse arrow on each bar will open a pop-up box that shows the actual depth of coverage (number of reads) at the specific site, and the number and percentage of reads with a specific allele (see screenshot on page 3). The black bar below the vertical gray bars indicates sites where reads have been downsampled. [2] An insertion was detected in a single read, as indicated by purple ‘ƚ’. When the mouse arrow is pointed on the insertion, a pop-up box shows the inserted bases. In this example, a single base A was inserted in the specific read. [3] When the arrow is pointed at a specific read, a pop-up box appears with details on the specific read, e.g. read name, alignment start, mapping quality, base phred quality, insert size, etc. [4] At high resolution the specific base or variant site that falls at the center of the IGV window is marked by two vertical dashed lines. When a variant site has an allele that is different from reference in more than 25% of reads, the alternate allele is indicated in a different color. [5] A deletion is indicated by a black horizontal bar within gaps in a read. [6] An insert size that is smaller than expected is represented as a blue bar. [7] An insert size that is bigger than expected is represented as a red bar.

2" Software to Perform Variant Annotation " Data Analysis Using Filtering and Filtering An Additional Note • VAAST • If multiple samples are analyzed – Hu et al. 2013 Genet Epidemiol • Multisample calling should be used to identify – http://www.hufflab.org/software/vaast/ variants • Variant Mendelian Tools • A multisample VCF file should be analyzed – http://varianttools.sourceforge.net/VMT/VMT • If variant is calling is performed on single • VARank samples – Geoffroy et al. 2015 PeerJ – No information on read depth, etc for variant sites – http://www.lbgi.fr/VaRank/#requirements where there is no alternative allele

Steps After Variant Identification Expression and Functional Studies

• Additional families with same phenotype and • Can aid in implicating a variant/gene in disease putatively pathogenic variants in the same gene etiology – Help support involvement of the gene in disease etiology – Particularly important if the variant/gene is found in • Form collaborations to identify additional families a single family • Matchmaker Exchange • Identified variant may be in LD with functional mutation – http://www.matchmakerexchange.org/ • Brings about a better understanding of disease etiology and the role the identified gene plays – Can help to identify investigators who have families with the same phenotype and variants within the same gene

60 Steps When NGS Does not Reveal Reasons for Failure of NGS Putatively Causal Variant • Insufficient samples for gene identification • When linkage region is known – e.g. single individual with no additional family members – Examine the region to determine which genes have not • Locus heterogeneity • been captured • Phenocopies, misdiagnosed or mislabeled individuals • missing data due to poor read depth coverage or within a pedigree • Variants have not been called • Sample swaps – Examine regions with IGV • Variant not captured • Follow-up with Sanger Sequencing if – Missing regions – Can be potentially be resolved by whole genome sequencing – Poor quality variants • Inadequate depth of coverage to call variant • If exome sequencing was perform proceed to whole • Indel/Copy number variants genome sequencing – Difficult to accurately call • Sensitivity can be low – The causal variant could lie outside of the coding region

NIH Public Access Author Manuscript Nat Genet. Author manuscript; available in PMC 2014 May 26.

NIH-PA Author ManuscriptPublished NIH-PA Author Manuscript in final edited NIH-PA Author Manuscript form as: Nat Genet. ; 44(8): 916–921. doi:10.1038/ng.2348. An Example of Using Linkage TGFB2 loss of function mutations cause familial thoracic aortic aneurysms and acute aortic dissections associated with mild Analysis and NGS to Identify systemic features of the Marfan syndrome Pathogenic Variants Catherine Boileau1,2,3,4,14,15, Dong-Chuan Guo5,14, Nadine Hanna1,2,3, Ellen S. Regalado5, Delphine Detaint1,2,6, Limin Gong5, Mathilde Varret1, Siddharth Prakash5,12, Alexander H. Li5, Hyacintha d’Indy1,3, Alan C. Braverman7, Bernard Grandchamp2,8, Callie S. Kwartler5, Laurent Gouya2,3,4, Regie Lyn P. Santos-Cortez9, Marianne Abifadel1, Suzanne M. Leal9, Christine Muti2, Jay Shendure10, Marie-Sylvie Gross1, Mark J. Rieder10, Alec Vahanian6,8, Deborah A. Nickerson10, Jean Baptiste Michel1, National Heart Lung and Blood Institute (NHLBI) Go Exome Sequencing Project11, Guillaume Jondeau1,2,6,8,14, and Dianna M. Milewicz5,12,13,14,15 1Institut National de la Santé et de la Recherche Médicale (INSERM) U698, Hôpital Bichat, Paris France 2Assistance Publique-Hôpitaux de Paris (AP-HP), Hôpital Bichat, Centre National de Référence pour les syndromes de Marfan et apparentés, Service de Cardiologie, Paris, France 3AP-HP, Ambroise Paré, Service de Biochimie d’Hormonologie et de Génétique moléculaire, Boulogne, France 4Université Versailles SQY, Guyancourt, France 5Department of Internal Medicine, University of Texas Health Science Center at Houston, Houston, Texas, USA 6APHP, service de Cardiologie, Hopital Bichat, Paris, France 7Department of Internal Medicine, Washington University School of Medicine, St. Louis, Missouri, USA 8UFR de Médecine, Université Paris 7, Paris, France

Correspondence should be addressed to D.M.M. ([email protected]) or C.B. ([email protected]). Analysis of Two TAA Pedigrees with Mild 11 Exome Sequencing Project Banner available in the supplementary materials Clinical Features of Two TAA Families with TGFB2 Variants 14These authors contributed equally to this work. 15These authors jointly directed this work. Systemic Features of Author contributions Marfan Syndrome C.B., D.M.M., and G.J. planned the project, designed the experiments, oversaw all aspects of the research and wrote the manuscript. Pedigree ID C.B., S.L.L. R.L.R.-C., M.V., D.G., M.G., and M.A. performed linkage analyses and analyzed exome sequencing data. M.J.R., J.S., and D.A.N. did the exome sequencing in TAA288. D.G., A.H.L. and N.H. analyzed exome sequencing data and performed Sanger Phenotypic Information TAA288 MS239 • sequencing analysis with H.d’I. L.G., D.G. and C.S.K. performed the TGFB2 transcript and protein assays. J.B.M. performed Whole genome linkage analysis performedhistological and imunostaining analyses with MS.G. S.P. did the genomic deletion analysis. E.S.R., B.G., C.M., A.C.B. and L.G. Number affected pedigree members 7 TAA 6 TAA provided clinical and pedigree data. D.D., D.M.M., and G.J. provided clinical data and performed the sample collection. A.V. revised –the manuscript.Pedigree TAA288 Age at diagnosis (years) 5 – 41 (median 32) 27 – 53 (median 36) Competing financial interests The authors declare no competing financial interests. Surgical Intervention 1 TAA 1 TAA, 1 MVP* ACCESSION• Affymetrix Number 50k SNP Array Arterial tortuosity No Yes The TGFB2 mRNA– Samples reference sequence from is available 9 informative at NCBI (NM_003238.3). pedigree members genotyped Other cardiac disease 2 MVP 1 MVP* – Pedigree MS239 Lens dislocation No 1/6 minor Flat cornea Unknown 2/6 • 1,056 microsatellite markers (deCode array) Pectus deformity 3/7 mild 2/6 definite, 1/6 mild – Samples from 14 informative pedigree members genotyped Scoliosis 2/7 definite, 1/7 mild 1/6 mild Flat feet 5/7 6/7 Joint hyperflexibility 5/7 3/6 High-arched palate 6/7 3/6 Striae atrophicae 4/7 4/6

*Mitral valve prolapse

61 Analysis of Two TAA Pedigrees with Mild Pedigree TAA288 Systemic Features of Marfan Syndrome • Multipoint and two-point analysis performed – Autosomal dominant mode of inheritance • 90% Penetrance • Disease allele frequency 0.0001 • Computer Software – Pedcheck – Merlin, Superlink & SimWalk2 • Both pedigrees mapped to 1q41 – TAA288 • Multipoint LOD score 2.4 – MS239 • Multipoint LOD score 1.6 *Individuals underwent whole genome genotyping Circled individuals underwent exome sequencing Multipoint LOD score 2.4 at 1q41

Pedigree MS239 Exome Sequencing

• Family TAA288 – Two affected individuals selected for exome sequencing • 16 variants shared by affected pedigree members • Family MA239 – Three affected and one unaffected individuals selected for exome sequencing • 5 variants shared by affected pedigree members • In family TAA288 two variants within linkage region and in family MA239 only one variant • Both families only had TGFB2 variants in common *Individuals genotyped for whole genome scan Circled individuals underwent exome sequencing Multipoint LOD score 1.6 at 1q41

Identification of TGFB2 variants Additional Screening of TGFB2 • French probands from a Marfan referral clinic • Family TAA288 – 62 familial cases – 5-bp deletion c.1021_1025del-TACAA in exon 6 which – 74 sporadic cases leads to a premature stop codon p.Try341Cysfs*25 • USA probands with thoracic aortic disease • Two-point LOD score 3.3 – 214 familial cases • Family MS239 – 57 sporadic cases – Stop-gain variant in exon 4 p.Cys229* • In the French familial probands two variants were – Two-point LOD score 4.4 found • Neither variant found in ExAC database – p.Glu102* – – 61,486 “control” individuals Frameshift duplication c.873_888dup leading to p.Asn297* • Both probands had TAAD • In both pedigrees all affected individuals were heterozygotes for respective variants • Neither variant was observed in ExAC – 60,706 “controls” individuals • In both pedigrees there was reduced penetrance

62 Association Analysis of Rare Variants • Analysis of single rare variants are very poorly powered Association Analysis for Mendelian Traits • Many methods have been developed specifically to test for rare variant associations – To overcome the low power of testing for associations with individual rare variants Suzanne M. Leal Center for Statistical Genetics • Rare variant association methods are frequently Baylor College of Medicine referred to as [email protected] – Aggregate – Burden Copyrighted © S.M. Leal 2015 – Collapsing

Association Analysis of Rare Variants Association Analysis - Mendelian • If pedigree data are available – Linkage analysis and filtering approaches should be used • Generally only Rare variants are analyzed, e.g. for data analysis MAF< 0.5% • When only the proband is available for study • Which are – Or a limited number of family members – Missense variants • e.g. unaffected family members, a single affected sibling • If the proband has a family history or – Stop loss, gain variant • It is suspected that the disease is due to a de novo – Spice site variants variant – No parental data is available • Association analysis can aid in finding genes which harbor pathogenic variants

Association Analysis - Mendelian Controls • If convenience controls are used • Rare variant association analysis can be used in these situations – BAM files should be obtained and variants called for both cases and controls together • Affected probands are compared to control • Although frequencies for individual variants can individuals be obtained from databases such as ExAC • Care must be used in selecting controls – These frequencies/counts should not be used to • Sequencing conditions should be the same for perform rare variant association analysis both cases and controls • Can lead to an increase in type I – Read depth – Capture array, etc

63 Data Quality Control Sample Size & Power • Unlike for filtering approached stringent data quality control should be performed • For complex traits extremely large sample sizes – Removing variant sites which are necessary • Fail variant quality score recalibration [ (VQSR) GATK] – Tens of thousands of individuals • High rates of missing variant calls, e.g. >10% • Fail Hardy Weinberg equilibrium , e.g. p < 10-7 • Due to low effect sizes of disease susceptibility variants – Removing variant genotypes with • For Mendelian diseases many fewer cases are • Low read depth e.g < 10X • Low GQ scores e.g. < 20 necessary to detect an association – For some studies <50 cases may be necessary • The quality control is data specific – To increase power large numbers of controls can be • A balance must be met used – between removal of data & false positive calls • Although there is a diminishing return when the ratio of control to case is > 3:1

Influences on Power Types of Aggregate Analyses

• Mode of Inheritance • Frequency cut offs used to determine which variants to • Locus heterogeneity include in the analysis – Increasing locus heterogeneity leads to a decrease – Rare Variants (e.g. <1% frequency) in power – Rare and low (1-5%) frequency variants • Allelic heterogeneity • Maximization approaches – Will not impact power • Tests developed to detection associations when • Unless benign variants are included in association test. variants effects are bidirectional e.g. protective and detrimental

• Incorporate weights based upon – frequency or functionality

Misclassification Caveats • When performing aggregate analysis • For exome data natural regions to aggregate rare – Misclassification of variants within a region can reduce variants are power – Genes • Exclusion of causal variants – Genes within pathways – Variants which are causal are erroneously not included • Analysis of genome sequence data outside of in the analysis exonic regions is problematic • Inclusion of non-causal variants – Unlikely a sliding window approach will work • Size of window unknown and will differ across the genome – Variants which are non-causal are included in the – A better understanding functionality outside the analysis coding regions is necessary • Predicted functional regions, enhancer regions, transcription factors, DNase I hypersensitivity sites, etc.

64 A Few Rare Variant Association Tests A Few Rare Variant Association Tests

• Combined Multivariate Collapsing (CMC) • Combined Multivariate Collapsing (CMC) – Li and Leal AJHG 2008 – Li and Leal AJHG 2008 • Burden of Rare Variants (BRV) • – Auer, Wang, Leal Genet Epidemiol 2013 Burden of Rare Variants (BRV) – Auer, Wang, Leal Genet Epidemiol 2013 • Weighted Sum Statistic (WSS) Fixed Effect – Madsen and Browning PloS Genet 2009 Tests • Weighted Sum Statistic (WSS) Fixed Effect • Kernel based adaptive cluster (KBAC) – Madsen and Browning PloS Genet 2009 Tests – Liu and Leal PloS Genet 2010 • Kernel based adaptive cluster (KBAC) • Variable Threshold (VT) – – Price et al. AJHG 2010 Liu and Leal PloS Genet 2010 • Sequence Kernel Association Test (SKAT) Random Effect • Variable Threshold (VT) – Wu et al. AJHG 2011 Test – Price et al. AJHG 2010

Methods to Detect Rare Variant Associations Using Variant Frequency Cut-offs CMC

• Define covariate Χj for individual j as • Combined multivariate & collapsing (CMC)

– Li & Leal, AJHG 2008 • Compute Fisher exact test for 2x2 table • Collapsing scheme which can be used in the Number of cases for which one or more rare Number of cases regression framework variants are observed e.g. nonsynonymous X=1 X=0 without a rare – Can use various criteria to determine which variants to variants freq. <1% variant collapse into subgroups cases • Variant frequency controls

• Predicted functionality Number of controls for Number of which one or more rare controls without a variants are observed rare variant

Can also use same coding in a regression framework

Methods to Detect Rare Variant Associations CMC Using Variant Frequency Cut-offs

• Example of coding used in regression framework: • Gene-or Region-based Analysis of Variants of – Binary coding Intermediate and Low frequency (GRANVIL) Gene region with 5 variant sites Individual Coding – Aggregate number of rare variants used as regressors 1 1 in a linear regression model – Can be extended to case-control studies 2 1 • Morris & Zeggini 2010 Genet. Epidemiol – Test also referred to as MZ 3 0

Rare Variant Sites Green bars: Major allele is observed in the study subject Red bars: Minor allele has been observed

65 GRANVIL Methods to Detect Rare Variant Associations Weighted Approaches • Example of coding used in regression framework – Gene region with 5 variant sites – data available on all sites Individual 1 • Group-wise association test for rare variants using Coded 2/5 (0.4) the Weighted Sum Statistic (WSS) Individual 2 – Variants are weighted inversely by their frequency in Coded 2/5 (0.4) Note same coding for heterozygous and homozygous controls (rare variants are up-weighted) genotypes • Madsen & Browning, PLoS Genet 2009

• Gene region with 5 variant sites but missing data on three variant sites • Kernel based adaptive cluster (KBAC) Individual 3 – Adaptive weighting based on multilocus genotype Coded 1/2 (0.5) Burden Rare Variant (BRV) extension (Auer et al. 2013 Genet Epidemiol) • Liu & Leal, PLoS Genet 2010 Individual 1: Coded 2 Individual 2: Coded 3 Individual 3: Coded 1

Methods to Detect Rare Variant Associations Significance Level for Rare Variant Maximization Approaches Association Tests • For exome data where individual genes are • Variable Threshold (VT) method analyzed usually a Bonferroni correction for the – Uses variable allele frequency thresholds and maximizes the test statistic number of genes tested is used. – Also can incorporate weighting based on functional – There is very little to no linkage disequilibrium between information genes • Price et al. AJHG 2010 • A Bonferroni correction for testing 20,000 genes is often used as the significance level cut-off • RareCover -6 – Maximizes the test statistic over all variants with a – 2.5 x 10 region using a greedy heuristic algorithm • Bhatia et al. 2010 PLoS Computational Biology

Rare Variant Aggregate Methods Rare Variant Aggregate Methods

• Ideally should be performed in a regression – If the proportion of cases and controls sampled analysis framework from each populations is different – Logistic • Can occur due to – Linear regression – Disease frequency is different between populations • Almost all methods have been extended to be – Sloppy sampling implemented within a regression framework – Population substructure\admixture can cause – Can control for covariates which are potential detection of differences in variant frequencies confounders within a gene which is due to sampling and not – Age disease status – Sex • False positive findings can be increased – Population substructure/admixture

66 Rare Variant Aggregate Methods Related Individuals

• Population substructure\admixture is often a • Remove related individuals from the analysis confounder for genetic studies – Only retain one member of a related pair/group – A particular problem for rare variants in the analysis • Perform analysis using mixed models • Currently Principal Components Analysis (PCA) • Ignoring that related individuals are included or Multidimensionality Scaling (MDS) is used to in the analysis can increase type I error control for population substructure\admixture – For both studies of common & rare variants

Software to Perform Rare Variant Testing for Associations using Trio Data Association Testing using NGS Data • PLINK/SEQ • Trio data are often sequenced to detect de novo – Developed by Shaun Purcell events. • https://atgu.mgh.harvard.edu/plinkseq/tutorial.shtml • Variant Association Tools (VAT) • However, transmitted as well de novo events – Reference Wang, Peng & Leal, 2014 can be analyzed • http://varianttools.sourceforge.net/Association/HomePage • The transmission disequilibrium test (TDT) is a natural choice to analyze trio data • The TDT design can also be used to analyze Mendelian traits

Controlling for Population Admixture and Substructure Using the Trio Design TDT Control Alleles • The trio (two parents and an affected child) approach was developed to control for Case Alleles population substructure and admixture •Transmitted parental alleles 1 and 2 – Falk and Rubinstein 1987 Ann Hum Genet Control Alleles 1 3 4 2 – Many additional trio methods have been described •Non-transmitted parental alleles 3 and 4 • The Transmission Disequilibrium Test (TDT) is currently the mostly widely used trio method 1 2 – Spielman et al. 1993 AJHG Case Alleles

*Phenotype information from parents is not used in the analysis

67 TDT TDT-Aggregate Analysis

Not Transmitted • The TDT was extended to incorporate rare variant association methods 1 allele 2 allele – Combined Multivariate Collapsing (CMC) 1 allele a b 1 2 12 • RV-TDT-CMC – Burden of Rare Variants (BRV) 2 allele c d Transmitted • RV-TDT-BRV – Weighted Sum Statistic (WSS) (McNemar’s Test) 11 • RV-TDT-WSS – Variable Threshold (VT) Only transmission events from heterozygous parents are informative, • RV-TDT-VT i.e. quadrants b & c

RV-TDT-CMC & RV-TDT-WSS RV-TDT-WSS & RV-TDT-VT

• RV-TDT-WSS • RV-TDT-CMC – Aggregate transmissions within a region weighted by the – Collapse transmissions within a region frequency of non-transmitted (“control”) alleles • For parent j • For parent j – b=0 and c=1 – b=0 and c=

• RV-TDT-BRV • RV-TDT-VT – Aggregate transmissions within a region – Maximizes the test statistic over minor allele frequencies • For parent j using either CMC or BRV coding – b=0 and c=3 • For parent j – b=0 and c=1 or 3

What are the Necessary Sample Sizes for Evaluating Significance the Trio Design for a Mendelian Trait • Analytical • Assuming No Locus Heterogeneity – (one-sided test) • Power 0.80 – CMC method only Number of Trios Mode of Inheritance Alpha Haplotype Permutation -6 • Empirical 0.05 2.5 x 10 – Haplotype permutation Autosomal Recessive 4 15 • Shuffle parental haplotypes Autosomal Dominant 13 59 – All methods

References He et al. 2014 AJHG RV-TDT Software Krumm et al. 2015 Nature Genetic http://bioinformatics.org/rv-tdt/

68 Performing!Linkage!Analysis!! Using!Exome!and!Genome!Sequence!Data! • As!cost!of!performing!sequencing!falls! ! – DNA!samples!from!all!informaGve!pedigree!members! The!Collapsed!Haplotype!Pa0ern!(CHP)! can!undergoing!sequencing! Method!for!Performing!Linkage!Analysis! Using!Sequence!Data! • Several!studies!have!generated!exome!and!genome! sequence!data!on!all!informaGve!family!members! Suzanne!M.!Leal! – T2DMGenes!(type!2!diabetes!study)! [email protected]! • Genome!sequence!data!on!20!Mexican!families!(1,043!Individuals)! Center!for!StaGsGcal!GeneGc! Baylor!College!of!Medicine! • Caveat!performing!linkage!analysis!on!individual! !h0ps://www.bcm.edu/research/labs/centerMforMstaGsGcalMgeneGcs! rare!variants!is!not!a!powerful!approach!

Collapsed!Haplotype!Pa0ern!(CHP)!Method!! CHP!Method! • MoGvated!by!rare!variant!aggregate!associaGon!! • LanderMGreen!algorithm!is!used!for!geneGc!phasing! methods! and!reconstrucGon!of!haplotypes! – Analysis!of!regions,!usually!genes! • Missing!genotypes!are!imputed!! • Instead!of!analyzing!individual!rare!variants! – CondiGonal!on!family!members!genotypes!and!! • Rare!variant!aggregate!associaGons!methods!are! – PopulaGon!allele!frequencies! more!powerfully!than!analyzing!individual!variants! • Obtained!from!founders!if!sample!size!is!sufficiently!large!or! • Frequencies!are!obtained!from!databases!(e.g.!ExAC)!

CHP!Method! ExampleMCHP! • For!each!pedigree!variants!on!a!regional! haplotypes,!e.g.!LD!blocks! – Are!assigned!a!single!numeric!value!e.g.!!! • 0!no!minor!alleles! • 1!at!least!one!minor!allele! • Each!regional!haplotype!within!a!family!is!uniquely! represented!

! These!two!pedigrees!both!have!rare!variants!in!the!same!gene!region!!! ! Although!they!segregate!a!different!rare!variant!and!haplotype! ! The!same!coding!can!be!used!without!a!lost!of!informaGon!

69 ExampleMCHP! CHP!method!

• Each!pedigree!is!analyzed!separately! – Using!allele!frequencies!that!are!correct!for!the! haplotypes!segregaGng!in!the!pedigree! • Parametric!LOD!score!results!are!summed!across! families!by!gene!region!! – At!the!same!Θ!value,!e.g.!Θ=0.0!

! A!unique!coding!is!provided!for!each!haplotype!!! ! To!avoid!lost!of!informaGon!

EvaluaGon!of!the!CHP!Method! EvaluaGon!of!the!CHP!Method!

• Data!was!generated!for!four!nonsyndromic!hearing! • Families!were!generated!with!3M8!children! impairment!genes! – Based on the number of children per family in the United States in 2012, rescaled to sum to 100% – Autosomal!recessive!mode!of!inheritance! • 3 children: 69.34% • GJB2,!SLC26A4. • 4 children: 20.52%, – Autosomal!dominant!mode!of!inheritance! • 5 children: 6.84 • MYO7A!and!MYH9! • 6 children: 2.28% • All!variants!were!generated!based!upon!their! • 7 children 0.76%, frequency!in!EuropeanMAmericans! • 8 children 0.26% – Using!data!from!Exome!Sequencing!Project! • RarePedSim was used to generate the pedigree data • Causal!status!of!variants!obtain!from!NCBIMClinVar! – http://bioinformatics.org/simped/rare/

Results!Autosomal!Recessive!Model! EvaluaGon!of!the!CHP!Method! Genes!SLC26A4!and!GJB2.. • Pedigrees!were!generated!with!varying!degrees!of! locus!heterogeneity! – e.g.!20%!of!families!linked!to!GJB2!and!80%!to!SLC26A4. • Families!with!>2!affected!children!were!“ascertained”! • Variants!with!a!MAF<0.01%!analyzed! • Power!was!evaluated!using!500!replicates!! – For!a!genomeMwide!α!level!of!

• LOD!>!3.3! ProporGon!of!linked!families!! ProporGon!of!linked!families!! • HLOD!>!3.6! • The!CHP!method!was!compared!to!single!variant! !Single!variant!Analysis!! analysis!! !CHP!method! Power!is!displayed!in!curves!!

70 Results!Autosomal!Dominant!Model! !Gene!MYH9.! Linkage!Analysis! • Unlike!filtering!approaches,!linkage!can!provides! Analysis!under!locus!homogeneity!! Analysis!under!locus!heterogeneity!! staGsGcal!evidence!of!a!variant’s!“involvement”!in! trait!eGology! – CauGon!should!be!used,!variant!may!only!be!in!LD!with! the!pathological!variant! • Because!linkage!incorporates!mode!of!inheritance! informaGon!and!penetrance!models!! – Less!likely!than!filtering!to!exclude!causal!variants!in!the! presents!of!phenocopies!and/or!reduced!penetrance!! References!! So+ware! Wang!et!al.!2015!EJHG! CHP!incorporated!in!SEQLinkage! !Single!variant!Analysis!! !CHP!method! Power!is!displayed!in!curves!! O0,!Wang,!Leal!2015!NRG! h0p://bioinformaGcs.org/seqlink! !

71 72 Simula1on!

• Simulation studies are used in many situations Evalua&ng)Power)Using)Simula&on)Studies! – Predict traffic jams – Flow from volcano eruptions, etc. • For genetic studies simulation can be used for a

Suzanne!M.!Leal! variety of situations Center!for!Sta1s1cal!Gene1cs! – Estimate the power to detect linkage for a given data Baylor!College!of!Medicine! [email protected]! set ! ! – Estimate empirical p-values Copyrighted © S.M. Leal 2015 – Compare various analysis methods

Example!Genera1ng!Genotype!Data! Genera1ng!Genotypes!for!a!Pedigree!

• A marker the following allele frequencies will be generated – 1=0.4 I.1 I.2 – 2=0.1 – 3=0.45 – 4=0.05

II.1 II.2

Genera1ng!Genotypes!for!a!Pedigree! Genera1ng!Genotypes!

• A!random!number!generator!is!used! • If!the!random!number!selected!!is!between!<0.4!! – Random!numbers!between!0!and!1!are!generated! – Then!the!1!allele!is!chosen!! • The!numbers!are!generated!according!to!a! • If!the!random!number!selected!is!between!>0.4!and!

73 Genera1ng!Genotypes! Genera1ng!Parental!Genotypes!

• Since!each!individual!needs!two!alleles!to!construct! their!genotype!J!two!random!numbers!are!generated!! • Father! 0! !!!!!!!!!!!!!0.4!!!!!!!!0.5 ! !!!!!!!0.95!!!!!1 !! I.1 I.2 – 0.84! |______1______|__2__|______3______|_4__|! • Assign!a!3!allele! 1! 3! 2 3! – 0.31! • Assign!a!1!allele! • Mother! – 0.44! • Assign!a!2!allele! II.1 II.2 – 0.63! • Assign!a!3!allele!

Genera1ng!Offspring!Genotypes! Genera1ng!Offspring!Genotypes!

Should!the!random!number!generator!be!used!to!generate!two! • more!genotypes!for!!the!children?! No!the!alleles!must!segregate!from!the!parents.! • It!must!be!determined!which!of!the!two!parental! alleles!each!offspring!inherits! I.1 I.2 1! 3! 2 3! – With!50%!probability! !

!!0!!child!receives!first!parental!allele!!0.5!!child!receives!second!parental!allele!!!1! !|______|______|

II.1 II.2

Random!Numbers!are!Generated! Genera1ng!Offspring!Genotypes! ! • For!child!II.1! – Random!#!0.21!! I.1 I.2 • From!father!!first!allele! 1! 3! 2 3! – obtains!a!1!allele! – Random!#!0.11! • From!mother!first!allele!! – obtains!a!2!allele!! • For!child!II.2! – Random!#!0.76! • From!father!second!allele!! – obtains!a!3!allele! II.1 II.2 – Random!#!0.31!! 1! 2! 2! 3! • From!mother!first!allele!! – obtains!a!2!allele!

74 Genera1ng!Genotypes!Condi1onal!on! Genera1ng!Genotypes!for!Pedigrees! Disease!Locus!

• The!genotypes!in!the!previous!example!were! D!!+! +!+! generated!uncondi1onal!(unlinked)!to!the!disease! I.1 I.2 phenotype! 1! 3! 2 3! • Next!marker!will!be!generated!linked!to!the! disease!locus! • Assump1on!! – The!disease!phenotype!is!autosomal!dominant! • No!phenocopies! D!!+! +!+! • No!reduced!penetrance! II.1 II.2 – The!marker!and!the!disease!locus!are!linked!! • ϴ=0.04!

Genera1ng!Genotypes!Condi1onal!on! Genera1ng!Genotypes!Condi1onal!on! the!Disease!Locus! Disease!Locus! Phase!I! Phase!II! • Need!to!Generate!offspring!genotypes!condi1onal! +!+! D!!!+! D!!!+! on!parental!genotypes!and!underlying!disease! I.1 I.2 1!!!!3! 3!!!!1! genotype! 1! 3! 2 3! • Since!the!pedigree!is!phase!unknown! Each!with!probability!50%! – Do!not!know!grandparental!genotypes! • Have!to!determine!phase! • Assump1on!the!disease!the!disease!and!marker! loci!are!in!linkage!equilibrium! II.1 II.2 – Each!phase!has!50%!probability

Genera1ng!Genotypes!Condi1onal!on!the! Fathers!Phase!is!Determined! Disease!Locus!

• A!random!number!generator!is!used!to! D!!+! +!+! I.1 I.2 determine!the!phase!for!the!father! Phase!II! 1! 3! 2 3! D!!!+! !!! 3!!!!1! !!!!!!!0!!!!!!Phase!I!for!the!father!!!!!!!!!!!!!!!!!!!!!0.5!!!!!Phase!II!for!the!father!!!!!!!!!!!!!!!!!!!!!1! !!!!!!!|______!_|______|!

• The!random!#!generated!is!0.76! – The!fathers!phase!is!II! D!!+! +!+! II.1 II.2

75 Genera1ng!the!Offspring!Genotypes! Determining!Offspring!Genotypes! • The!genotypes!must!be!assigned!condi1onal!on! • The!first!child!II.1!is!affected!! the!disease!genotypes!and!whether!or!not!a! • He!either!receives!from!his!father!! recombina1on!event!has!occurred! – A!3!allele!with!probability!1JΘ!! • For!this!example!1J0.04! • Whether!or!not!a!recombina1on!has!occurred! – Or!a!1!allele!with!probability!Θ!! will!be!determined!by!genera1ng!a!random! • For!this!example!0.04!!! • number! The!second!child!II.2!is!unaffected! • She!either!receives!from!her!father!! – A!!1!allele!with!probability!1JΘ!! – Or!with!probability!Θ!receives!a!3!allele! ! !0!!!!!!!!!!Θ!(0.04)!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!1! |_____|______|

Determining!Offspring!Genotypes! Determining!the!Offspring!Genotypes!

• Child!II.1!(affected)!! • The!mother!provides!no!linkage!informa1on! – Random!#!0.88!is!generated! – Each!child!can!be!assigned!either!a!2!or!a!3!allele! • He!is!assigned!the!3!allele!from!his!father!! • With!probability!0.5! ! • Two!Random!numbers!are!generated! • Child!II.2!(unaffected)! – For!Child!II.1! • Random!#!0.01!is!generated! • The!random!#!0.32!is!generated! – The!2!allele!is!assigned!from!the!mother! – She!is!assigned!the!3!allele!from!her!father! – For!Child!II.2! • The!random!#!0.43!is!generated! – The!2!allele!is!assigned!from!the!mother! 0!!!!!!!!!!Θ!(0.04)!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!1!

|_____|______| 0!!!child!receives!first!parental!allele!!!0.5!!child!receives!second!parental!allele!!!!!!!1! |______|______|

Genera1ng!Genotypes!Condi1onal!on!the! Disease!Locus! Genera1ng!Haplotype!Data!

D!!+! +!+! • Instead!of!genera1ng!and!assigning!individuals! I.1 I.2 alleles!! 1! 3! 2 3! – Haplotypes!are!generated!! • When!haplotypes!are!generated!uncondi1onal! on!disease!phenotype!or!quan1ta1ve!trait! • Based!upon!haplotype!frequencies!two! D!!+! +!+! haplotypes!are!assigned!to!each!individual!in! II.1 II.2 the!parental!genera1on! 2 3! 2 3! – Random!numbers!are!used!to!determine!which!two! haplotypes!are!assigned!

76 Genera1ng!Haplotype!Data! Genera1ng!Sequence!Data!for!Pedigrees!

• For!each!offspring!recombina1on!events!between! • Haplotype!data!can!be!!generated!using! the!two!parental!haplotypes!are!determined!by! gene1c!maps! popula1on!demographic!models!! – Posi1ons!of!recombina1on!events!are!determined!by! • Data!generated!on!16,568!genes! random!numbers! • Simula1ng!variant!data!using! • One!paternal!and!one!maternal!new!haplotypes! reference!sequence!data!a!European! is!assigned!to!the!offspring!from!each!of!their! popula1on!demographic!model! parents! – Gazave!et!al.!2013! – Haplotype!pool!generated!for!each! – Each!of!the!two!parental!haplotype!has!equal! gene! probability!of!beginning!assigned!to!the!offspring! • Each!pool!contains!1,308,000!haplotypes! • Which!haplotypes!are!assigned is determined by random ! numbers !

!

Genera1ng!Sequence!data!for! Genera1ng!Sequence!Data!for!Pedigrees! Pedigrees!! • Variant!data!frequencies!can!also!be!used!from! • To!then!generate!the!variant!data! databases! condi1onal!on!the!disease!phenotypes! – e.g.!ExAC! – To!generate!data!under!the!alterna1ve!! • Cau1on!should!be!used!that!a!sufficient!large! • A penetrance model is used sample!sizes!is!used!to!obtain!variant!frequencies! • The penetrance model should mimic the mode of – Otherwise!very!rare!variants!will!be!underJrepresented! inheritance in the pedigree • Too few singletons, doubletons etc. – Autosomal dominant, Autosomal Recessive or X-linked • Determine!which!variants!are!pathogenic!using! – Fully penetrant or clinical!databases! – Reduced penetrance and phenocopies – e.g.!ClinVar!

Genera1ng!Sequence!data!for! First!Step!JGenera1ng!Pedigree!Data! Pedigrees!! • Variants!can!be!generated!uncondi1onal!on!the! • Empirical!pJvalues! disease!phenotype! – Data!is!generated!under!the!null!hypothesis! • Markers!and!disease!are!unlinked! – !To!generate!data!under!the!null! – Not!necessary!to!know!the!underlying!gene1c! • Variants!are!only!generated!for!pedigree!members! model! which!are!available!for!study! • Can!be!used!for!Mendelian!and!nonJMendelian!traits!! • Power! Family ID Phenotype BD32 BD210 – Marker(s)!are!generated!linked!(ϴ<0.5)!to!the! maxLOD Polydactyly PostAxial Polydactyly 4.05 2.66 disease!locus! AP216 Woolly Hairs on Scalp 1.33 – Must!know!underlying!gene1c!model! 3 2 1

2 1 • 9 10 5 4 7 6 8 For!pedigree!data!can!only!be!used!for!Mendelian!traits! 1 5 6

5 3 4 6 4 2 3 7 8 !

ED104 ED112 ED129 Pure hair and nail ectodermal dysplasia Skin Ichthyosis Skin Ichthyosis 1.88 2.65 6.04

5 2 1 2 9

3 4 7 6 1 7 5 4 3 6 10 13 12 11 77 9 6 11 4 10 1

8 7 2 ED201 ED173 Ichthyosis Cutaneous Syndactyly 5.91 2.86

1 2 3 1

2 3 4 5 4

7 5 6 7 6 10 8 9

11 16 12 14 13 15 ED205 ED212 Skin Ichthyosis Skin Ichthyosis 2.05 2.91

1 2

6 4 3 5

1 2

5 6 3 4 Second!Step!Analyzing!Data! Second!Step!Analyzing!Data! Power,!ELOD!&!EMLOD!! Power,!ELOD!&!EMLOD!! • Must!know!underlying!disease!model! • Power! • Simulated!data!is!analyzed!using!the!same! – The!propor1on!of!replicates!where!the!null!hypothesis!of! no!linkage!is!rejected!based!upon!a!LOD!score!criterion! model!as!was!used!for!data!genera1on! (e.g.!LOD!score!>3.3)! – Allele!frequencies!(marker!and!disease)! • ELOD! – Penetrances! – Is!es1mated!by!the!average!LOD!score!across!at!the! recombina1on!frac1on!the!data!was!generated!at!across! • Can!evaluate!the!informa1veness!of!pedigree! all!replicates! data!using!several!measures! • EMLOD! – Power! • Is!es1mated!by!the!average!of!the!maximum!LOD!score! – ELOD!(Expected!LOD!Score)! across!all!replicates! – EMLOD!(Expected!Maximum!LOD!Score)! • Maximum!LOD!score! • Largest!LOD!score!observed!for!all!replicates! – Maximum!LOD!score! • Only valid for fully penetrant disease without phenocopies

How!Many!Replicates!Should!be!Generated?! How!Many!Replicates!Should!be!Generated?! • Depends!on!how!accurate!of!an!es1mate!is! • Power! necessary.! • Usually!need!fewer!replicates! • When!es1ma1ng!empirical!pJvalues!will!be! dependent!on!how!small!of!a!pJvalue!is!being! • ~500!replicates! es1mated.!! • But!is!some!instances!there!can!be!great!variability! – The!smaller!the!pJvalues!the!more!replicates! and!many!more!replicates!are!necessary!for! • For!example!if!the!pJvalue!is!in!the!range!of!0.00001! need!to!generate!many!more!than!1,000!replicates! accurate!power!es1mates! – Since!by!chanced!under!the!null!my!never!observe!a!pJ value!of!!<0.00001! • If!only!interested!in!es1ma1ng!if!an!empirical!pJ value!is!<0.05! – ~5,000!replicates!may!be!sufficient!

Exercises!! Simula1on!Programs! • SLINK • Simulate!pedigree!data!using!SLINK! – Generates genotype and haplotype data conditional or – Generate!marker!data! unconditional on affection status or quantitative trait – Generates phenotype data • !Analyze!data!with!using!MSIM! • Quantitative • Qualitative – Perform!parametric!twoJpoint!linkage!analysis! – Large and complex pedigree structures • Simulate!rare!variant!data!using!RareSimPed! – Small number of marker loci ~< 7 can be generated – Simulates!sequence!data! • SIMULATE • Generates a VCF file – Generates genotype data unconditional on affection status • Analysis!data!using!SEQLinkage! – Large and complex pedigree structures – Large number of marker loci can be generated – Performs!the!Collapsed!Haplotype!Parern!(CHP)! method!

78

• SIMLINK! • GASP – Generates!genotype!data!condi1onal!and!uncondi1onal!on! – Generates quantitative and qualitative phenotype data affec1on!status!or!quan1ta1ve!trait! – Large!pedigree!structure! • Gene-gene and gene-environmental interaction – Small!number!of!marker!loci!can!be!generated! – Generates genotype data conditional on generated • One!disease!and!one!marker!locus! phenotype data – Must!modify!the!program!in!order!to!have!it!supply!generated! – Limited in size and structure of pedigrees pedigree!structures! • At most three generations ! – Can generate up to 400 marker loci • MERLIN! – Generates!genotypes!data!uncondi1onal!on!affec1on!status!or! quan1ta1ve!trait! • SimPed – Large!and!complex!pedigree!structures! – Large!number!of!marker!loci!can!be!generated! – Pedigrees of virtually any size or complexity – Generation of >10,000 diallelic or multiallelic marker loci • Generates data for the autosomes and X chromosome • SOLAR! – Haplotype data – Generate!genotypes!uncondi1onal!on!affec1on!status!or! quan1ta1ve!trait! » Markers in linkage disequilibrium – Large!and!complex!pedigree!structures! – Genotype data – Large!number!of!marker!loci!can!be!generated! » Markers in linkage equilibrium

• SIMLA – Generates qualitative phenotype data • Gene-gene and gene-environmental interaction • Assigns affection status to pedigree members – Limited in pedigree structures that can be generated • user cannot provide pedigree structure – Large number of marker loci can be generated – Can also generate sequence data

• RareSimPed – Generates sequence data for Mendelian and Complex traits (qualitative and quantitative) regardless of pedigree structure • Using population based frequencies or demographic models – Generates genotype data conditional and unconditional on the phenotype – Generates phenotype data conditional on the generated genotype data

79 ! ! !! ! FuncA (p.Asp377Asn) variant was found in one family, and the c.517T>C (p.Tyr173His) variant was found in the other ! [email protected]! two families. Both variants were predicted to be damaging by multiple bioinformatics tools. The two variants both segregated with the ! ! !! ! ! !! ! ! !! ! ! !! nonsyndromic-hearing-impairmentDescribes!the!iden<fica

Hearing impairment (HI) affects nearly 300 million people ilies. The knowledge that has been gained from functional, of all ages globally and increases in prevalence per decade expression, and localization studies after the identification of life.1 Children and adults with bilateral, moderate-to- of genes with mutations that cause NSHI has immensely profound HI have a poorer quality of life, which encom- expanded our understanding of inner-ear physiology. passes not only problems in physical function but also Previously, an ARNSHI-associated locus, DFNB89, was socioemotional, mental, and cognitive difficulties.2,3 In mapped to chromosomal region 16q21–q23.2 in two unre- particular, children with congenital HI must be identified lated, consanguineous Pakistani families.6 The two fam- and habilitated within the first 6 months of life so that de- ilies, 4338 and 4406 (Figures 1A and 1B), had maximum KARS*Variants!Segregate!with!HI!in! lays in the acquisition of speech, language, and reading multipoint LOD scores of 6.0 and 3.7, respectively. The ho- skills can be prevented.4 mozygosity regions that overlap in the two families led to DFNB89!Families! Among childrenDFNB89!Hearing!Impairment! with congenital sensorineural HI, more the identification of a 16.1 Mb locus (chr16: 63.6–79.7 Mb) than 80% do not display syndromic features and ~60% that includes 180 genes. Additionally, a third consanguin- * have a family history of HI or a confirmed genetic etiol- eous Pakistani family, 4284 (Figure 1C), was identified, and ogy.5 Because of the complex cellular organization of the showed suggestive linkage to the DFNB89 region with a inner ear, hundreds of genes and proteins are predicted maximum multipoint parametric LOD score of 1.93. For to influence auditory mechanisms. To date, for nonsyn- family 4284, linkage analysis was performed for ~6,000 dromic HI (NSHI), about 170 loci have been localized SNP markers that were genotyped across the genome and mutations in ~75 genes have been identified in hu- with the Illumina Linkage Panel IVb. mans (Hereditary Hearing Loss Homepage). Of the gene Consanguineous families 4284, 4338, and 4406 from variants that have been implicated in NSHI, almost 60% Pakistan are affected by ARNSHI, which was found to are autosomal recessive (AR) in inheritance, and 95% of segregate with unique haplotypes within the DFNB89 lo- the genes that harbor mutations that cause ARNSHI were cus (Figures 1A–1C). From the medical history, no other initially mapped and identified in consanguineous fam- risk factors were identified as a possible cause of HI. For

1Center for Statistical Genetics, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA; 2Department of Biochemistry, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad 45320, Pakistan; 3Department of Biochemistry, Azad Jammu Kashmir Medical College, Muzaffarabad, Azad Jammu and Kashmir 13100, Pakistan; 4Department of Otolaryngology Head and Neck Surgery, Case Western Reserve University, Cleveland, OH 44106, USA; 5Department of Biology, Case Western Reserve University,(+)!ABR!waveforms! Cleveland, OH 44106, USA; 6Department of Genetics and GenomeBilateral!symmetric!moderateNtoNprofound!! Sciences, Case Western Reserve University, Cleveland, OH 44106, USA; 7Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA Absent!OAE! 8These authorshearing!impairment!across!all!frequencies! contributed equally to this work *Correspondence: [email protected] Normal!EMGNNCV! http://dx.doi.org/10.1016/j.ajhg.2013.05.018. Ó2013 by The American SocietyRuling!out!auditory!neuropathy!and! of Human Genetics. All rights reserved. suppor

KARS*Variants!Predicted!to!Lower! KARS*Variants!at!Conserved!Residues! Cataly

Loss!of!βN2!strand!!

Interfere!with!tRNA!binding!

Decreased(cataly,c(ac,vity(

Affect!tetramer!binding!

p.Asp377!iden

80 KARS*is!Expressed!in!Zebrafish!and! KARS*is!Expressed!in!Chicken!Hair!Cells! Mouse!Hair!Cells!and!Maculae! Zebrafish! Mouse!

HC!=!Hair!Cell! M!=!Macula! WF!=!Whole!Fish! Expression!of!KARS!in!purified!chicken!hair!cells! (5)!=!Control! was!detected!by!RNANseq!

KARS*Localized!to!Cochlear!Duct! KARS*Localized!to!Cochlear!and! Ves

outer!sulcus! cells! spiral! hair!cells! limbus!

Deiters’!cells! basilar! membrane! hair!cells! spiral! inner! *!=!hair!cell!nuclei!! ligament! sulcus! fibrocytes! cells! Labeling!with!KARS!polyclonal!an

Human Molecular Genetics, 2014, Vol. 23, No. 12 3289–3298 Conclusions!KARS! doi:10.1093/hmg/ddu042 Advance Access published on January 29, 2014 Adenylate cyclase 1 (ADCY1) mutations cause • KARS*muta

gene!and!a!novel!phenotype!for!KARS* Regie Lyn P. Santos-Cortez1, Kwanghyuk Lee1, Arnaud P. Giese3,4, Muhammad Ansar1,5, Muhammad Amin-Ud-Din6, Kira Rehn4, Xin Wang1, Abdul Aziz5, Ilene Chiu2, Raja Hussain Ali5, • Joshua D. Smith7, University of Washington Center for Mendelian Genomics, Jay Shendure7, KARS*is!expressed!in!inner!ears!and!hair! Michael Bamshad7, Deborah A. Nickerson7, Zubair M. Ahmed3, Wasim Ahmad5, Saima Riazuddin4 1, cells!of!chicken,!zebrafish!and!mouse! and Suzanne M. Leal ∗ 1Department of Molecular and Human Genetics, Center for Statistical Genetics and 2Bobby R. Alford Department of Otolaryngology—Head and Neck Surgery, Baylor College of Medicine, Houston, TX 77030, USA, 3Division of Pediatric Ophthalmology and 4Division of Pediatric Otolaryngology—Head and Neck Surgery, Cincinnati Children’s Hospital • KARS!strongly!localizes!to!oT(p.Arg1038∗)withinadenylatecyclase1(ADCY1) was identified. This stop-gained mutation segregated with hearing impairment within the family and was not identified in ethnically matched controls or within variant databases. This mutation is predicted to cause the loss of 82 amino acids from the carboxyl tail, including highly conserved residues within the catalytic domain, plus a calmodulin-stimulation defect, both of which are expected to decrease enzymatic efficiency. Individuals who are homozygous for this mutation had symmetric, mild-to-moderate mixed hearing impairment. Zebrafish adcy1b morphants had no FM1-43 dye uptake and lacked startle response, indicating hair cell dysfunc- tion and gross hearing impairment. In the mouse, Adcy1 expression was observed throughout inner ear devel- opment and maturation. ADCY1 was localized to the cytoplasm of supporting cells and hair cells of the cochlea and vestibule and also to cochlear hair cell nuclei and stereocilia. Ex vivo studies in COS-7 cells suggest that the carboxyl tail of ADCY1 is essential for localization toactin-basedmicrovilli.Theseresults demonstrate that ADCY1 has an evolutionarily conserved role in hearing and that cAMP signaling is important to hair cell function within the inner ear. 81

∗ To whom correspondence should be addressed at: Department of Molecular and Human Genetics, Center for Statistical Genetics, Baylor College of Medicine, 1 Baylor Plaza 700D, Houston, TX 77030, USA. Tel: 1 7137984011; Fax: 1 7137984012; Email: [email protected] + +

# The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] Family!4009!N!DFNB44!Mapped!to!7p14.1Nq11.22!! Nonsense*Variant*c.311C>T*p.ARG1038**in* ADCY1!Iden<fied!through!exome!sequencing!!!

Anar!et!al.!2004! Mapped!with!STR!Markers! Ansar!et!al.!! Missense*variant*in*** Maximum!Mul

Bilateral!symmetric!mildNtoNmoderate!mixed!hearing! Predicted!loss!of!two!terminal!betaNsheets! impairment!in!5!of!6!!family!members!of!4009! due!to!ADCY1!p.Arg1038*!

Predicted!to!cause!loss!of!82!amino!acids!from!the!!cytoplasmic! carboxyl!tail!and!include!highly!conserved!residues!of!the!C2! domain!

Failure!of!FM1N43!dye!uptake!and!lack!of!startle! ADCY1!is!expressed!in!mouse!inner!ear!at! response!in!adcy1b*but!not!adcy1a*morphant! various!developmental!stages,!with!highest! zebrafish! expression!at!P16!

82 ADCY1!is!localized!to!cochlear!outer!and!inner! ADCY1!localizes!to!the!ves

ADCY1!is!localized!to!the!adult!rat!inner!hair!cell! bodies!and!along!the!length!of!the!stereocilia!of! ConclusionsNADCY1* both!inner!and!outer!hair!cells! • ADCY1*p.Arg1038*!causes!bilateral!mildNtoN moderate!mixed!hearing!impairment!in! humans! • This!muta

Conclusions!–ADCY1* ConclusionsN!overall!

• ADCY1*is!expressed!throughout!inner!ear! • With!fast!pace!of!NGS!gene!discovery,!func

83 Conclusions!N!Overall! Expression!and!Func

84 Outline

• Forms of variant annotation Variant Annotation • Databases for annotation • Software for annotation

• Notes of caution

Michael Nothnagel, [email protected], 2015

Forms of variant annotation Bioinformatic workflow

Technical information Database annotation

• Sequencing instrument • What is already known about • Quality metrics for filtering the variant? • Retrieval of information from databases

Functional annotation Multiple layers annotation

• Prediction of functionality, • Overlap with others sorts of pathogenicity etc. of variant genomic information, e.g. • Inference based on various expression levels, transcription factors

Jobling, et al. (2014)

Database annotation Potential effects of small-scale gene mutations

• A wealth of information is already available from public databases for many variants – RefSeq numbers and other identifiers – Population frequencies (both global and population-specific) – Type of variant for coding regions (missense, stop, etc.) – Implication in human Mendelian diseases – Implication in human inherited diseases – Implication in human diseases and traits (GWAS?) – Literature • Database annotation involves scripted or web-based analyses for – querying of public databases – storing retrieved information

Jobling, et al. (2014)

85 (Human) genomic databases (I) (Human) genomic databases (II)

dbSNP, dbVar Genomic Variants Archive • SNPs, indels, SV • SNPs, indels www.ncbi.nlm.nih.gov/SNP www.ebi.ac.uk/dgva • Global and population-specific frequencies UCSC Genome Browser 1000 Genomes • the reference sequence and working draft • SNPs, indels, SV genome.ucsc.edu www.1000genomes.org assemblies for a large collection of genomes • Global and population-specific frequencies • portal to ENCODE data at UCSC (2003-12) and to the Neanderthal project HapMap Ensemble • SNPs, indels www.hapmap.org • European Bioinformatics Institute • Global and population-specific frequencies www.ensembl.org • SNPs, indels

GoNL ExAC • SNPs, indels • Exome Aggregation Consortium www.nlgenome.nl exac.broadinstitute.org • Dutch population frequencies • Exome data (including variants) for >60,000 unrelated individuals

Human disease databases (I) HGMD

OMIM Cooper & Krawczak (1993), Cooper, et al. (1998), Krawczak, et al. (2000), Stenson, et al. (2003), Stenson, et al. (2014) • Online Mendelian Inheritance in Men www.ncbi.nlm.nih.gov/omim • Catalog of human genes/disorders/traits • Focus on molecular relationship between genetic variation and phenotypic expression HGMD • Human Gene Mutation Database www.hgmd.cf.ac.uk • Collate known (published) gene lesions responsible for human inherited disease GWAS catalog • QC-ed, manually curated, literature-derived www.ebi.ac.uk/gwas collection of all published GWAS assaying Manually curated collection of published gene lesions responsible for >100,000 SNPs; all SNPs with p<10-5 human inherited disease; includes the first example of all mutations PubMed causing or associated with human inherited disease plus functional studies • >24 million citations for biomedical literature www.ncbi.nlm.nih.gov/pubmed Free access to mutations included >=3 years ago for registered from MEDLINE, journals, and online books academic users; otherwise professional version for up-to-date access

Human disease databases (II) Number of genic SNPs per genome

ClinVar • Public archive of reports on relationships www.ncbi.nlm.nih.gov/clinvar among human variations and phenotypes • Supporting evidence and submitter visible • Focus in medical genetics

The Cancer Genome Atlas (TCGA) • Public catalog of genomic changes in tumors cancergenome.nih.gov • Search, download, and analysis of data sets generated by TCGA

COSMIC • Public catalog of genomic changes in tumors cancer.sanger.ac.uk/cosmic • Search, download, and analysis of data sets generated by TCGA

There are (many) more databases. Jobling, et al. (2014)

86 Motivation for functional annotation Approaches to functional annotation

• Each individual carries multiple deleterious mutations: Variant is located in … – On average 2% of all individuals carry a missense mutation in any Coding sequence Non-coding sequence given gene. (Andrews, et al., 2013, Trends Immunol) – For a given disease, multiple missense mutations will by chance be • Focus on proteins • Focus on regulation present in those genes that seemingly relate to the disease in an • Primarily applied with exome • Primarily applied with whole- affected individual. data genome data • Do these missense mutations • Possible analyses • Possible analyses: – actually alter gene function or, more precise, • Protein sequence • Methylation, epigenetic – actually cause the disease/phenotype at hand? conservation alterations • Functional assays or model organism experiments are • Protein features • Transcription-factor (TF) – too costly for all observed missense mutations • Gene relationships, binding sites – too time-consuming interaction • RNA interference – may raise ethical issues • Pathway analysis • Expression • … • … • Computational inference may address some of these issues. – Trade-off between sensitivity and specificity • Selection, reduced nucleotide diversity, etc.

Homology: sequence divergence by evolution Physicochemical properties of proteins

Sequences( Sequence(divergence( • Chemical composition and biophysical characteristics – Solubility in water, polarity, charge, ACCAGGTACACAATGAGT---CAGC – Cyclic, sulfur-containing – Molecular volume evolu0on( subs0tu0on((“muta0on”)( • Structure ACC---TACACATTGAGTCCGCAGC – Primary: amino acid chain

dele0on( inser0on( – Secondary: chain folding (mainly a and b helixes) – Tertiary: spatial arrangement through folding and coiling, sometimes including molecular chaperones duplica0on( phylogeny( ACCAGGAGGTACACAATGAGTCAGC – Quaternary: complex of two more polypeptide chains transloca0on( • Sequence conservation ACCAGGTACACAATGAGTCAGC ACCAGGTACACAATGAGTAGGCAGC • … inversion( Fixation of mutations with dissimilar amino acids is rare. ACCGGATACACAATGAGTCAGC (Grantham, 1974, Science) Cozzone((2002)(Encyclopedia(of(Life(Sciences;(www.els.net(

Conservation through selection Transcriptional regulation

Thioredoxin(of(E.#coli#and(15(homologs( Places of action for transcription factors (TF)

active site with highest conservation TFBS: TF binding site

CRM: cis-regulatory module (set of TFs) Lesk((2014)( Wasserman(&(Sandelin((2004)(Nat(Rev(Genet(

87 Epigenetic changes RNA interference

Silencing of gene expression by targeted degradation of mRNA

hQp://commonfund.nih.gov/epigenomics/( Robinson((2004)(PLoS(Biol(

Data basis for functional annotation UniProt database

• Universal Protein Resource • http://www.uniprot.org/ • Comprehensive catalog of – protein sequence – functional information (annotation data) • Several databases: – UniProtKB: Knowledgebase (annotation) – UniRef: Reference Clusters (sequences for UniProtKB) – UniParc: Archive (all sequences) – UniMES: metagenomes • Merger of previous Swiss-Prot and TrEMBL databases

PANTHER PANTHER

Thomas, et al. (2003) Genome Res 13:2129-41; Information types stored in PANTHER Thomas & Kejariwal (2004) PNAS 101:15398–15403. • Protein ANalysis THrough Evolutionary Relationships • http://www.pantherdb.org/ • Classification system of proteins and their genes • Classification by: – Family (evolutionarily related proteins) and subfamily (related proteins that have the same function) – Molecular protein function (e.g. kinase) – Biological protein function (e.g. mitosis) – Pathway relationships • Compilation by human curation as well as bioinformatic algorithms • >11,900 protein families, >83,000 subfamilies in 2015 http://www.pantherdb.org

88 PFAM database Ensemble database

Punta, et al. (2012) Nucleic Acids Res 40:D290-D301; Cunningham, et al. (2015) Nucleic Acids Res 43:D662-D669 Finn, et al. (2014) Nucleic Acids Res 42:D222-D230

• Protein families • http://www.ensembl.org • Genomic interpretation system • http://pfam.xfam.org/ • Annotations, querying tools, access • Database of protein families methods for chordates and key (>16,200 in 2015) model organisms • Contains, for each family, • Annotation includes: multiple sequence – Gene annotation (GENCODE gene set) – Regulatory region / epigenetic annotation alignments and Hidden – Variation annotation (germline & Markov models (HMMs) for somatic), also including 1000G, HapMap, EVS and other data seed alignment – Comparative annotation (mutation age, • Contains information about multiple sequence alignment, secondary protein structures, …) protein domains • Web-based queries and API • Grouping of families into clans

JASPAR database ENCODE database

Portales-Casamar, et al. (2010) Nucleic Acids Res 38:D105–D110; The ENCODE Project Consortium (2012) Nature 489:57–74 Mathelier, et al. (2014) Nucleic Acids Res 42:D142-D147 • http://jaspar.genereg.net/ • Collection of databases: • Encyclopedia of DNA Elements – JASPAR CORE: • https://www.encodeproject.org/ database of • Projects aims to build a transcription factor comprehensive catalog of all binding motifs functional elements in the – JASPAR human genome COLLECTIONS: • International collaboration funded by the National Human databases for splice Genome Research Institute forms, meta-models, (NHGRI and others

ENCODE: author list ENCODE: annotations

• Candidate enhancers and promoters for DNase hypersensitivity • Gene expression over ~60 cell types • Transcription start sites (TSS) • Peaks (sites of transcription factor binding or DNase hypersensitivity) • Amount of RNA for different types of RNA and in various cell lines • Promoter regions • Predicted enhancers • Semi-automated genome annotation (SAGA); summarization of chromatin accessibility, patterns of histone modifications, transcription factor binding, … • High Occupancy of Target (HOT) regions (regions in which a large number of different transcription-related factors bind) • Connectivity of transcription factors • Motifs (DNA binding sites) for transcription related factors • and more … The ENCODE Project Consortium (2012) Nature https://www.encodeproject.org/

89 ENCODE FANTOM5 database

Co-association between transcription factors FANTOM Cons., RIKEN PMI & CLST, et al. (2014) Nature 507:462-70

• Functional Annotation of the Mammalian Genome • http://fantom.gsc.riken.jp/5/ • Annotation of regulation, expression and function of mammalian genes • Promotor atlas, cell-type-specific TF • Tools for visualization and exploration • Based on systematic sampling of the distinct mammalian cell types (975 human and 399 mouse samples, including primary cells), tissues and cancer cell lines • RIKEN-led consortium The ENCODE Project Consortium (2012) Nature

FANTOM5 Human epigenome

Collapsed co-expression network of 4882 co-expression groups Roadmap Epigenomics Consortium, et al. (2015) Nature 518:317-30 [124,090 promoters across all primary cell types, tissues & cell lines]

• http://www.roadmapepigenomics.org/ • Map of – DNA methylation – Histone modifications – Chromatin accessibility – Small RNA transcripts • Considered locations: – Stem cells – Primary ex vivo tissues • Sites selected to represent the normal counterparts of tissues and organ systems frequently involved in human disease ! Convenience control for such studies FANTOM Consortium, et al. (2014) Nature

Human epigenome Human epigenome

Profiled tissues and cell types Chromatin state annotation in 127 epigenomes

Roadmap Epigenomics Consortium (2015) Nature Roadmap Epigenomics Consortium (2015) Nature

90 STRING database COSMIC database

Franceschini, et al. (2012) Nucleic Acids Res 41:D808-D815 Forbes, et al. (2015) Nucleic Acids Res 43: D805-D811

• http://string-db.org/ • http://cancer.sanger.ac.uk/cosmic • Database of known and predicted • Catalog of somatic mutations in protein-protein interactions cancer (both direct [physical] and indirect • Two types of data: [functional] associations) – Manual curation data from peer • Based on: reviewed publications by – Genomic context COSMIC expert curators (aka non-systematic/targeted – High-throughput experiments screen data) – Co-expression – Systematic screen data: – Previous knowledge uploads from large scale • Builds upon numerous other genome screening databases publications and from other • >9,600,000 proteins from >2000 databases (TCGA, ICGC); organisms in 2015 unbiased molecular profiling of diseases

GenomeRNAi database

Horn, et al. (2007) Nucleic Acids Res 35:D492-7; Gilsdorf, et al. (2010) Nucleic Acids Res 38:D448-52; Schmidt, et al. (2013) Nucleic Acids Res 41:D1021-6

• http://www.genomernai.org/GenomeRNAi/ There are more databases… • Database containing phenotypes from RNA interference screens in Drosophila and Homo sapiens • Provision of RNAi reagents and their predicted quality.

Software MAPP

Stone & Sidow (2005) Genome Res 15:978-86

• Multivariate Analysis of Protein Polymorphism • http://mendel.stanford.edu/sidowlab/downloads /MAPP • Steps: 1. Multiple alignment of homologous sequences, phylogeny-weighted scores 2. Interpretation of scores by quantified physicochemical properties, yielding constraints on these properties for each variant

Schiffahrtsmuseum Brake, Germany 3. Create new feature space by PCA of all physicochemical properties 4. MAPP score: distance to the origin of the new feature space

91 GERP/GERP++ PhastCons

Cooper, et al. (2005) Genome Res 15:901-13; Siepel, et al. (2005) Genome Res 15:1034-1050 Davydov, et al. (2010) PLoS Comp Biol 6:e1001025. • Genomic Evolutionary Rate Profiling • Part of the PHAST (Phylogenetic Analysis with Space • http://mendel.stanford.edu/sidowlab/downloads/gerp /Time Models) package: – http://compgen.bscb.cornell.edu/phast/ • Identification of constrained elements by a deficit of – Engine behind the Conservation tracks in the UCSC Genome substitution events due to purifying selection Browser • Comparison of estimated evolutionary rates between • Aims at conservation scoring and identification of – individual alignment column (residue/variant) and conserved elements from multiple sequence alignment – a tree describing neutral substitution rates (ML-based phylogenic inference) • Predicting sequences as being conserved / not conserved – using a phylogenetic Hidden Markov Model (HMM) • Constraint regions exhibit fewer than expected changes – different values for branch length scaling parameter (average • RS score (metric of constraint): rejected substitutions substitution rate) in phylogenetic tree between both types • GERP++: additional aggregation of constrained sites into – Unsupervised learning without use of external information constrained sequences • Calculation of conservation score

PhastCons PhastCons: conservation score

Assumed tree topologies and branch length Conservation score: posterior probability that each site was generated from a conserved state in the phylo-HMM

LOD score: log-ratio of the likelihoods for a region under the conserved phylogenetic model compared to the nonconserved model Note 1: LOD – logarithm of the odds Note 2: This is not the LOD score from linkage analysis (although scaled in a similar way).

Conservation track in UCSC Genome Browser:

Siepel, et al. (2005) Genome Res Siepel, et al. (2005) Genome Res

PhyloP LogRE

Pollard, et al. (2010) Genome Res 20:110-121 Clifford, et al. (2004) Bioinformatics 20:1006-14

• phylogenetic P-values • http://compgen.bscb.cornell.edu/phast/ • http://lpgws.nci.nih.gov/cgi-bin/GeneViewer.cgi • Aims at detecting deviations from the • Prediction whether amino acid (AA) changes in conserved neutral rate of substitutions domains are likely to affect protein function – Conservation: less than under drift • Based on output of the HMMER/2/3 software (multiple – Acceleration: more than under drift sequence alignment using HMMs and profiles) and Pfam • Additionally allows for clade-specific differences in the phylogeny profiles (conservation in protein families) • Software implementation of four tests, • E-value in sequence alignment: expected number of including likelihood-ratio and score tests, sequences with an alignment score equal to or even more a number-of-substitutions test (SPH), and extreme than that of the observed sequence GERP • The conservation track of the UCSC • LogRE value: genome browser contains PhyloP scores log10 of ratio E(deviant AA) / E(canonical AA) (SPH p-values for deviation from drift).

92 SIFT SIFT: workflow

Kumar, et al. (2009) Nat Protoc 4:1073-81, and others

• Sorting Tolerant From Intolerant • http://sift.jcvi.org/ • Protein function prediction due to an AA substitution (nsSNP) • Based on – Multiple sequence alignment – Conservation with respect to functionally related protein sequences – Similarity between the alternate amino acids – No incorporation of protein structure • Output – Score: probability of substitution for being tolerated (i.e. values near 0 imply high probability for being deleterious) High sequence conservation in functionally related protein sequences – Qualitative prediction of being ‘tolerated’ or ’deleterious’ by thresholding ! nsSNP unlikely to be tolerated Kumar, et al. (2009) Nat Protoc

SIFT score SIFT score

Multiple sequence alignment of homologous amino acid (AA) sequences Thioredoxin(of(E.#coli#and(15(homologs( For a given position, calculation of the relative frequencies of the 20 AA at this position in the alignment, normalized by the maximum relative frequency

SIFT score: normalized probability of the observed AA (i.e. frequency of the observed AA relative to the most common AA at this position in the alignment)

SIFT score close to 0: Relative The observed AA almost never occurs at this position in the homologous f(G)=2/16 f(L)=1/16 f(K)=10/16 f(E)=1/16 f(Q)=2/16 sequences, indicating high conservation and a probably deleterious effect. frequency SIFT S(G)=2/10 S(L)=1/10 S(K)=10/10 S(E)=1/10 S(Q)=2/10

Kumar, et al. (2009) Nat Protoc Lesk((2014)(

PolyPhen-2 PolyPhen-2: workflow

Adzhubei, et al. (2010) Nat Methods 7(4):248-249

• http://genetics.bwh.harvard.edu/pph2/ • Prediction of the functional effects of an amino acid substitution on the structure and function of a protein • Naïve Bayes classifier based on – Sequence conservation – Chemical properties of amino acids – Protein structure – Sequence context • Output – Score: probability of substitution for being deleterious (i.e. values near 1 imply high probability) – Qualitative prediction of being ‘probably damaging', 'possibly damaging', 'benign’ or 'unknown' Adzhubei, et al. (2010) Nat Methods

93 SNP Effect Predictor (SEP) ANNOVAR (I)

McLaren, et al. (2010) Bioinformatics 26:2069-70. Wang, et al. (2010) Nucleic Acids Res 38:e164.

Predicted consequences • Annotation of SNVs in • http://annovar.openbioinformatics.org/ transcripts (i.e. coding • Widely used tool; builds upon numerous databases and sequence) many other tools • Part of Ensemble; annotation • Annotation of SNVs, InDels and CNVs based on Ensemble • Conversion utilities for numerous file types (including VCF) databases • Perl command line tool • Web-based tool and • Web-based access to some functionality via wANNOVAR Application Programme Interface (API, written in Perl) (http://wannovar.usc.edu/) available • Gene-based annotation: • http://www.ensembl.org/info – Identification of protein-coding changes /docs/api/ – Flexible use of many gene definition systems (RefSeq, UCSC, ENSEMBL, GENCODE, AceView, and others)

ANNOVAR (II) SnpEff

• Region-based annotation: Cingolani, et al. (2012) Fly 6:1-3 – Identification of conserved regions among 44 species, – Prediction of transcription factor binding sites, segmental • SNP Effect duplication regions, GWAS hits, database of genomic variants, • http://SnpEff.sourceforge.net/ ENCODE sites, ChIP-Seq peaks, RNA-Seq peaks, … • Annotation of SNVs, InDels, MNP (multiple nucleotide • Filter-based annotation: polymorphism) in coding sequence – Presence (and reported frequency) in specific databases (dbSNP, 1000 Genome, NHLBI-ESP 6500 exomes, ExAC, and others) • Multiple input file formats (VCF, mpileup, text) – Calculation of scores (e.g. SIFT, PolyPhen-2, LRT, MutationTaster, • Gene annotation has similar scope as in ANNOVAR MutationAssessor, FATHMM, MetaSVM, MetaLR, GERP++) • Integration with computational biology platform Galaxy • Other functionalities: (http://gmod.org/wiki/Galaxy) and GATK – Retrieval of nucleotide sequence in any user-specific genomic • Superseded ANNOVAR when integrated in GATK positions in batch – Candidate gene list for Mendelian diseases from exome data • Tool SnpSift for VCF file manipulaitn and filtering – and more

Condel FATHMM, FATHMM-MKL

González-Pérez & López-Bigas (2011) Am J Hum Genet 88:400-9 Shihab, et al. (2013) Hum Mutat 34:57-65; Shihab, et al. (2015) Bioinform.

• Consensus deleteriousness score of missense mutations • Functional Analysis through Hidden Markov Models • http://bg.upf.edu/condel • http://fathmm.biocompute.org.uk/ • Mulitple sequence alignment of homolous sequences • Prediction of functional consequences for both coding and • Weighted combination of five predictors: Logre, MAPP, non-coding SNVs Mutation Assessor, PolyPhen-2 and SIFT • Web service • Definition of different simple and averaged scores for the • Based on 0/1 prediction and the normalized scores of each of the – conservation of homologous sequences, protein domain five predictors functionality and pathogenicity (inferred from relative frequencies of disease-associated variants) • Combinations these derived scores used for classification – SVM using functional annotation from numerous ENCODE tracks of a variant being deleterious or neutral • Incorporates numerous databases, e.g. HGMD, UniProt, VariBench and SwissVar

94 Mutalyzer Mutation Assessor

Wideman, et al. (2008) Hum Mutat 29:6-13 Reva, et al. (2011) Nucleic Acids Res 39:e118

• https://mutalyzer.nl/ • http://mutationassessor.org/ • Checking sequence variant nomenclature according to the • Functionality predicted from inter-species conservation and guidelines of HGVSt (Human Genome Variation Society) known 3D structures • Some automated extraction of variant annotation • Somatic cancer mutations are additionally evaluated for • Web-based service recurrence, multiplicity and annotation based on the COSMIC database

Mutation Assessor: functional impact score (FIS) MutationTaster / MutationTaster2

Multiple sequence alignment of a large number of homologs for both Schwarz, et al. (2010) Nat Methods 7:575-6; protein families and subfamilies Schwarz, et al. (2014) Nat Methods 11:361-2

• http://www.mutationtaster.org/ • Web-based service, upload of VCF files • Prediction of functional consequences for amino acid residues conserved with subfamily, residues conserved substitutions (nsSNVs), intronic and synonymous SNVs vary between subfamilies across entire family and InDels and exon-intron border variants Strength of residue conservation: distributional entropy of alignment column • Bayes classifier trained on 1000G and HGMD Professional Conservation score: effect of mutation described as difference in residue • Integration of: 1000G, HapMap, ClinVar, HGMD Public, conservation ENCODE, JASPAR, PhyloP/PhastCons [conservation], Specificity score: conservation score within data-defined sequence subfamily NNSplice [splicing], …

Reva, et al. (2011) Nucleic Acids Res

VAAST VAT

Yandell, et al. (2011) Genome Res 21:1529-42 Wang, et al. (2014) Am J Hum Genet 94:770-83

Search procedure • Variant Annotation, Analysis, and Search Tool • Variant Analysis Tools • Annotation of amino acid cases controls • http://varianttools.sourceforge.net/ substitutions (coding sequence) and non-coding • Different gene set references • Likelihood-ratio test for disease • Presence in dbSNP, ExAC, 1000G, HapMap, database of association; aggregation of rare genomic variants, catalog of somatic mutations in cancer variants (similar to CMC • Prediction scores from dbNSFP database (SIFT, approach) PolyPhen, MutationTaster, and others) • Severity of SNVs assessed by comparison to OMIM • Conserved or duplicated regions • Scoring of non-coding and • Automatic annotation using ANNOVAR and SnpEff synonymous variants by use of • Many more tasks possible; coded in Python sequence conservation, OMIM, 1000 Genomes, ENCODE,

95 PROVEAN CADD

Choi, et al. (2012) PLoS ONE 7: e46688 Kircher, et al. (2014) Nat Genet 46:310-5

• Protein Variation Effect Analyzer • Combined Annotation - Dependent Depletion • http://provean.jcvi.org/ • http://cadd.gs.washington.edu/ • Annotation of the functional impact based on conservation of • Annotation of SNVs and InDels homologous protein sequences • Based on 63 partially different annotations (VEP, • Focus on InDels, multiple ENCODE, GERP, phyloP, TF binding, SIFT, PolyPhen, substitutions …) • Impact measured by Delta Score Δ: • Integration of numerous annotations into a single C score – defined as the difference in the alignment scores for the given • Assessment of the “deleteriousness” of a variant by protein and a homologous simulation sequence, average over many – Genome-wide simulation of de-novo germline variation without homologous sequences selection – Thresholding Δ for prediction – Comparison against fixed or nearly fixed derived alleles in humans (as compared to chimpanzee) with respect to annotation

CADD: C score CADD: typical C scores for SNVs

Support-vector machine (SVM) for distinguishing nearly fixed variation from simulated neutral variation (14.7x106 vs. 14.7x106)

SVM trained on 63 annotations and some selected interaction terms (but 949 features in the model due to dummy coding of categorical variables!)

Application to all 8.6 billion possible substitutions in GRCh37, yielding the distribution of the combined score from the SVM (C-score) for variants in the human reference genome

Phred-scaling of the rank of the score (scaled C-sore):

−10log10 (rank/total number of substitutions).

Comparison of the scaled C-score of a variant at hand against this distribution Example: A variant with a scaled C-score of 20 indicates that it is rank at 1% of the most deleterious substitutions in the human genome

Kircher, et al. (2014) Nat Genet Kircher, et al. (2014) Nat Genet

GWAVA SuRFR

Ritchie, et al. (2014) Nat Methods 11:294-6 Ryan, et al. (2014) Genome Med 6:79

• Genome-wide annotation of variants • SNP Ranking by Function R package • https://www.sanger.ac.uk/resources/software/gwava/ • http://www.cgem.ed.ac.uk/resources/ • Functional annotation of non-coding sequence variants • Annotation of non-coding variants • Integration of genomic and epigenomic annotations • Incorporation of 1000G, ENCODE, FANTOM5, Epigenome (1000G frequencies and ancestral allele calls, GERP Roadmap scores [conservation], several ENCODE tracks, TF • Prioritization of variants by a rank-of-ranks approach: binding motifs) • Classification of variants having a pathogenic effect or not r – ranks within annotation category, w – weight for category, R – overall rank via random forest, trained on HGMD and 1000G ij j • Three pre-trained weighting schemes available • Validation by application to the COSMIC database • Implemented as package for the R statistical language

96 There are more annotation tools… Performance of prediction

http://omictools.com/variant-annotation-c104-p1.html

Comparison of methods Comparison of methods Comparison of 1,100 common polymorphisms (1000G) and 1,100 known disease mutations (HGMD)

Schwarz, et al. (2014) Nat Methods Knecht & Krawczak (2014) Hum Genet

Comparison of methods Comparison of methods

Same software (ANNOVAR), different annotation databases Same annotation database (Ensembl), different annotation software [~81 million variant calls from 276 samples of immune disease & cancer cases; [~81 million variant calls from 276 samples of immune disease & cancer cases; from the WGS500 project (University of Oxford)] from the WGS500 project (University of Oxford)]

log10 number of variants

RefSeq annotation

Ensembl annotation McCarthy, et al. (2014) Genome Med McCarthy, et al. (2014) Genome Med

97 Correlation between different annotations Limits of in-silico functional prediction

14.7 millions human-derived alleles with ≥95% population frequency Back to square one: motivation for in-silico prediction Do these missense mutations Model organism experiments are actually cause the disease/ too costly for all observed phenotype at hand? missense mutations.

(A) Generation of random mutations in mouse pedigrees using ENU; Breeding to homozygosity and phenotyping of mice with 1 of 33 potentially disruptive de-novo points mutations in 23 essential immune system genes [already known to produce a fully penetrant detectable phenotype] (B) In vitro phenotyping (translational activity) of all possible TP53 mutations Prediction by PolyPhen-2, CADD, SIFT, GERP, MutAssessor &PANTHER Kircher, et al. (2014) Nat Genet Miosge et al. (2015) PNAS

Limits of in-silico functional prediction Limits of in-silico functional prediction

Predicted damage vs. experimentally measured activity for TP53 “The discordance between the predicted and actual effect of missense mutations revealed here creates the potential for many FP conclusions in clinical whole genome sequencing. … Hence, for interpretation of a clinical genome sequence at present, it is essential to measure experimentally the consequence of any missense mutation thought to be causal.”

… We conclude that for de novo or low-frequency missense mutations found by genome sequencing, half those inferred as deleterious correspond to nearly neutral mutations that have little impact on the clinical phenotype of individual cases but will nevertheless become subject to purifying selection.

Miosge et al. (2015) PNAS Miosge(et(al.((2015)(PNAS(

That’s it!

98