! ! ! ! ! ! ! ! This! is! copyrighted! material.! Please! do! not! distribute! or! use! other! than! for! personal! purposes! without! the! permission! of! Suzanne!Leal!or!Michael!Nothnagel.!
i ii Table&of&Contents&
Monday' '...... '1' Introduction*to*finding*pathogenic*variants*using*NGS*data* *...... *1* Introduction*to*the*basics* *...... *3* Calculation*of*LOD*scores* *...... *13* Cloud*computing* *...... *20*
Tuesday' '...... '25' Penetrance* *...... *25* Software*to*perform*linkage*analysis* *...... *28* Genotyping*error*detection*using*Merlin* *...... *33* Homozygosity*mapping* *...... *35*
Wednesday' '...... '39' File*formats*for*sequence*data* *...... *39*
Thursday' '...... '49' Filtering*approaches*for*the*analysis*of*NGS*data* *...... *49* Association*analysis*for*Mendelian*traits* *...... *63* Performing*linkage*analysis*using*sequencing*data*...... *69*
Friday' '...... '73' Evaluating*power*using*simulation*studies* *...... *73* Functional*studies* *...... *80* Variant*annotation* *...... *85* *
iii iv Analysis of Family Data Via Filtering Strategies Samples in a Family Undergoing NGS Two or More Affected & Unaffected One Affected Introduction to Finding Pathogenic Affected Individuals Individuals Individual Variants Using NGS Data: Exclude variants which are not shared by affected Linkage Analysis and Filter Approaches individuals & (present in unaffected individuals )
Suzanne M. Leal [email protected] Exclude non-coding variants and & coding variants which are found Center for Statistical Genetic in databases (ExAC) or are not rare e.g. >0.5% Baylor College of Medicine https://www.bcm.edu/research/labs/center-for-statistical-genetics Test for segregation of identified variants with disease phenotype & sequence variants in ethnically matched controls
This Strategy can Fail! Performing Linkage Analysis • DNA samples from all informative pedigree None of the variants completely segregate with members are genotyped using arrays disease status • Parametric two-point and multipoint linkage Ø Affected individuals are phenocopies or incorrectly diagnosed analysis performed Ø Unaffected individuals are disease variant carriers (reduced D • For consanguineous pedigrees segregating penetrance) autosomal recessive traits Ø Sample swaps have occurred – Homozygosity mapping can also be used Ø Locus heterogeneity within the pedigree Homozygoisty Linkage Analysis Homozygosity Mapping
++ D+ Region of ++ D+ mapper score ++ D+
Phenocopy Reduced Penetrance Sample Swaps Homozyosity
Benefits of Performing Linkage Analysis Trouble Shooting Using Linkage Analysis Using Genotyping Arrays • Linkage analysis can be performed using genotyping arrays or sequence data • Aids in selection of individuals for sequencing • Observed LOD scores compared to • Maps the disease locus to specific genomic – Expected maximum LOD (EMLOD) region(s) – Maximum LOD (MLOD) • Filtering can be performed within several Mb, i.e. • Deflated LOD scores can be due to – Incorrect phenotype information linkage region, instead of the entire genome – Locus heterogeneity within the pedigree – Reducing the number of variants which need to be • Genotypes can also aid in detection of incorrect followed-up familial relationships • Testing for segregation in pedigrees • Evaluating frequencies in ethnically matched controls
1 Non-syndromic Hearing Impairment (NSHI)
• 893 NSHI families ascertained REPORT American Journal of Human Genetics – Pakistan, USA, Switzerland, Turkey, Jordan, Hungry Mutations in KARS, Encoding Lysyl-tRNA (Roma), Poland & Germany Synthetase, Cause Autosomal-Recessive Nonsyndromic Hearing Impairment DFNB89
• Intra-familial heterogeneity in the collection Regie Lyn P. Santos-Cortez,1,8 Kwanghyuk Lee,1,8 Zahid Azeem,2,3 Patrick J. Antonellis,4,5 Lana M. Pollock,4,6 Saadullah Khan,2 Irfanullah,2 Paula B. Andrade-Elizondo,1 • 15.3% (95% CI 11.9 - 19.9%) Santos-Cortez et al. 2015 EJHG Ilene Chiu,1 Mark D. Adams,6 Sulman Basit,2 Joshua D. Smith,7 University of Washington Center for Mendelian Genomics, Deborah A. Nickerson,7 Brian M. McDermott, Jr.,4,5,6 • Linkage analysis followed by exome sequencing led Wasim Ahmad,2 and Suzanne M. Leal1,* to the identification of a number of NSHI genes Previously, DFNB89, a locus associated with autosomal-recessive nonsyndromic hearing impairment (ARNSHI), was mapped to chromo- somal region 16q21–q23.2 in three unrelated, consanguineous Pakistani families. Through whole-exome sequencing of a hearing- – impaired individual from each family, missense mutations were identified at highly conserved residues of lysyl-tRNA synthetase KARS (Santos-Cortez et al. 2013 AJHG) (KARS): the c.1129G>A (p.Asp377Asn) variant was found in one family, and the c.517T>C (p.Tyr173His) variant was found in the other two families. Both variants were predicted to be damaging by multiple bioinformatics tools. The two variants both segregated with the – ADCY1 (Santos-Cortez et al. 2014 Hum Mol Genet) nonsyndromic-hearing-impairment phenotype within the three families, and neither mutation was identified in ethnically matched controls or within variant databases. Individuals homozygous for KARS mutations had symmetric, severe hearing impairment across – TBC1D24 (Rehman et al. 2014 AJHG) all frequencies but did not show evidence of auditory or limb neuropathy. It has been demonstrated that KARS is expressed in hair cells of zebrafish, chickens, and mice. Moreover, KARS has strong localization to the spiral ligament region of the cochlea, as well as to Deiters’ cells, the sulcus epithelium, the basilar membrane, and the surface of the spiral limbus. It is hypothesized that KARS variants affect ami- noacylation in inner-ear cells by interfering with binding activity to tRNA or p38 and with tetramer formation. The identification of rare KARS variants in ARNSHI-affected families defines a gene that is associated with ARNSHI.
Hearing impairment (HI) affects nearly 300 million people ilies. The knowledge that has been gained from functional, of all ages globally and increases in prevalence per decade expression, and localization studies after the identification of life.1 Children and adults with bilateral, moderate-to- of genes with mutations that cause NSHI has immensely profound HI have a poorer quality of life, which encom- expanded our understanding of inner-ear physiology. passes not only problems in physical function but also Previously, an ARNSHI-associated locus, DFNB89, was socioemotional, mental, and cognitive difficulties.2,3 In mapped to chromosomal region 16q21–q23.2 in two unre- particular, children with congenital HI must be identified lated, consanguineous Pakistani families.6 The two fam- and habilitated within the first 6 months of life so that de- ilies, 4338 and 4406 (Figures 1A and 1B), had maximum lays in theRare Homozygous Variants in the acquisition of speech, language, and reading multipoint LOD scores of 6.0 and 3.7, respectively. The ho- DFNB89 Locus (16q21-q23.2) skills can be prevented.4 mozygosity regions that overlap in the two families led to Among children with congenitalDFNB89 Region sensorineural HI, more the identification of a 16.1 Mb locus (chr16: 63.6–79.7 Mb) Bilateral symmetric moderate-to-profound SNP genotyping (Illumina linkage panel) than 80% do not display syndromic features and ~60% that includes 180 genes. Additionally, a third consanguin- hearing impairment across all frequencies Region of homozygosity 16.1 Mb: have a family history of HI or a confirmed genetic etiol- eous Pakistani family, 4284 (Figure 1C), was identified, and ogy.5 Because of the complex cellular organization of the showed suggestive linkage to the DFNB89 region with a 4338 (LOD 6.0) Containing ~180 genes inner ear, hundreds of genes and proteins are predicted maximum multipoint parametric LOD score of 1.93. For Familyto influence auditoryGene mechanisms. ToVariant date, for nonsyn- familyFrequency 4284, linkage analysisDamaging* was performed for ~6,000 dromic HI (NSHI), about 170 loci have been localized SNPExAC markers that were genotyped across the genome and mutations in ~75 genes have been identified in hu- with the Illumina Linkage Panel IVb. 4284 (LOD 1.9) 4406mans (Hereditary HearingCOG4 Loss Homepage).p.Ile271Val Of the gene Consanguineous0.0005 familiesMT, LRT 4284, 4338, and 4406 from variants that have been implicated in NSHI, almost 60% Pakistan are affected by ARNSHI, which was found to 4406are autosomal recessiveZFHX3 (AR) in inheritance,p.Pro1929Ser and 95% of segregate0.0005 with unique haplotypesNone within the DFNB89 lo- the genes that harbor mutations that cause ARNSHI were cus (Figures 1A–1C). From the medical history, no other 4406,initially4284 mapped andKARS identified in consanguineousp.Tyr173His fam- risk0.00002 factors were identified asAll a possible cause of HI. For
43381Center for Statistical Genetics,KARS Department ofp.Asp377Asn Molecular and Human Genetics, Baylor College0 of Medicine, Houston,All Texas 77030, USA; 2Department of 4406 (LOD 3.7) Biochemistry, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad 45320, Pakistan; 3Department of Biochemistry, Azad Jammu Kashmir Medical College, Muzaffarabad, Azad Jammu and Kashmir 13100, Pakistan; 4Department of Otolaryngology Head and Neck Surgery, Case Western Reserve 4338University, Cleveland, OHCNTNAP 44106, USA; 5Departmentp.Ala1235Thr of Biology, Case Western Reserve0.00002 University, Cleveland, OH 44106,MT, LRT USA; 6Department of Genetics and Genome Sciences, Case Western Reserve University, Cleveland, OH 44106, USA; 7Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA 4 8These authors contributed equally to this work *Correspondence: [email protected] http://dx.doi.org/10.1016/j.ajhg.2013.05.018. Ó2013 by The American Society of Human Genetics. All rights reserved. One individual from each family *Bioinformatics Tools: CADD, LRT, MutationAssessor, MutationTaster (MT), PolyPhen-2, SIFT selected for exome sequencing All variant sites were deemed to be conserved (PhyloP & GERP) based upon region of homozygosity Basit et al., Hum. Genet. 2010 132 The American Journal of Human Genetics 93, 132–140, July 11, 2013
KARS Variants Segregate with HI in Analysis of Family Based Data (Mendelian) DFNB89 Families Genotype Informative family Pedigree Members Perform Linkage Analysis KARS encodes lysyl- Select Pedigree Member(s) for Sequencing tRNA synthetase Remove Variants Which are not Rare in ExAC, e.g. MAF> 0.5%
Investigate Functionality using Bioinformatic Tools
Determine if Variant Segregates with Phenotype
Population Specific Frequencies for Variant
Acquire Additional Families with Variants with the Same Gene p.Try173His & pAsp377Asn not observed in 750 ethnically matched Pakistani chromosomes
2 Goal of Linkage Studies § To localize disease/trait/susceptibility loci to a unique position on the genome § Only family data can be used to carry out linkage Introduction to the Basics studies § Extend families § Pedigrees with multiple branches and/or multigenerational Suzanne M. Leal § Nuclear Families [email protected] § Parents and offspring Center for Statistical Genetic, Baylor College of Medicine https://www.bcm.edu/research/labs/center-for-statistical-genetics § Trios (parents and proband) cannot be used for www.statgen.us linkage studies § Suitable for association studies
Copyrighted © S.M. Leal 2015
Types of families for Linkage Analysis Linkage Analysis & Homozygosity Mapping Extended Pedigrees • Can be used to reduce the region to be followed up with sequencing – Thus greatly reducing the number of variants Nuclear Pedigrees – May lead to identification of the causal variant where other approaches have failed • Genotype all available informative families member to perform linkage analysis/homozygosity Trios mapping
Chromosome in meiosis with two crossovers Parametric Linkage Analysis • For Mendelian traits – Mode of inheritance must be known – Autosomal Recessive – Autosomal Dominant – X-linked – Trait can have reduced penetrance or phenocopies
Two homologous Four gametic chromosomes, each Two products (egg, crossovers with two chromatids sperm cells)
3 Linkage Analysis – Allele Sharing Methods Parametric Linkage - Analysis • Also known as nonparametric or model free • Goal method – To test whether there is linkage between a – Neither nonparametric or model free disease locus and a marker or set of marker loci – Mode of Inheritance does not need to be known – Null hypothesis • Complex traits • No linkage - recombination fraction (θ=0.5) – Underlying genetic model is not specified in the analysis – Recombination rate 50% » Disease locus and marker locus/loci far apart – Loci on two different chromosomes
Parametric Linkage - Analysis Linkage Analysis - Allele Sharing Methods § • Alternative hypothesis Compare the amount of allele sharing between • linkage θ<0.5 § Affected Sibling – Wish to reject the null hypothesis of no linkage § Other affected relative pairs • Use a LOD score criterion of 3.3 (p<0.05) § Avuncular § e.g. uncle-Niece § Cousins – Estimate the recombination fraction (genetic distance) between the disease and the marker loci
Linkage Analysis - Allele Sharing Methods Polymorphisms & Variants § Variety of tests to elucidate if there is an § Polymorphism excess of allele sharing § A region of the genome that varies between individual members of a population § Mean test § Usually with a frequency of at least 1 or 5% § Null hypothesis § Variant or mutation § Under no linkage § An alteration in a genome compared to some reference § Affected siblings share 50% of their alleles state § Does not have to be causal or functional § Alternative Hypothesis § Types of Variants § Under linkage § Pathogenic § Affected siblings share > 50% of their alleles § Of unknown significance § Benign
4 Loci & Alleles Loci & Alleles
§ Locus: A specific position on the genome § Codominant § For the autosomes 2 alleles are observed at each § Both alleles are expressed in the heterozygous state locus § Dominant § Alleles: Are alternative forms of DNA § Expression is the same in heterozygous as in the homozygous state sequence that occur at a locus § The homozygous state can sometimes produce a more severe § e.g. the A, B, 0 alleles of the AB0 gene phenotype than the heterozygous state § Homozygous lethal § Recessive § Homozygous state is necessary for expression
Hardy Weinberg Equilibrium (HWE) HWE • For the autosomes the proportion of each • The organism is diploid genotype follows the laws of HWE • Reproduction is sexual – p2, 2pq & q2 • Generations are non-overlapping • Which is based upon the observed allele • Mating is random frequencies • Population size is very large • Migration is negligible • Mutation can be ignored • Natural selection does not affect the alleles under consideration
HWE HWE
Example 2 allelic system (e.g. SNP marker) The following genotype counts are observed
Allele 1 frequency p= (2N11 + N12)/2N Observed Expected 1 1 300 ? Allele 2 frequency q= (2N22 + N12)/2N 1 2 500 ? 2 2 200 ? Expected proportions of heterozygotes and homozygotes under HWE Allele frequencies 2 1 1= p 1 allele: p=(600+500)/2000=0.55 1 2 = 2pq 2 allele: q=(500+400)/2000=0.45 2 2 = q2 Note q=(1-p)
5 HWE HWE
2 Expected genotype frequencies under HWE 2 (observed-Expected) Χ =Σ Expected 11 p2=0.3025 12 2pq=0.495 Observed Expected 22 q2= 0.2025 1 1 300 302.5 1 2 500 495.0 2 2 200 202.5 Expected number of genotypes under HWE* Χ2= (300-302.5)2/302.5+(500-495)2/495+(200-202.5)2/202.5=0.102 11 302.5 12 495 2 22 202.5 Χ = 0.102 p=0.75 1 df *For a sample size of 1,000 individuals
Testing for deviations in HWE Reasons for Deviation from HWE
• Chi-square tests • Population Admixture • Exact tests • Heterozygous Advantage • Likelihood ratio tests • Copy number variants • Genotyping Error • Chance
Loci, Genotypes & Haplotypes Locus, Genotype & Haplotype § Multiple marker on a chromosome Genotypes are known Chromosome1 § Microsatellites Genotype for Locus A 1 1 § Single nucleotide polymorphisms (SNPs) Locus A: 1 1 Locus B 2 2 Locus C 1 2 § Single nucleotide variants (SNVs) Locus B: 2 2 Locus D 1 2 § The haplotype for each Genotype chromosome of a pair § The two alleles at a locus comprise a genotype usually needs to be § Haplotype reconstructed Haplotypes for § The alleles on each chromosome Locus A & B: 1 2, 1 2 Locus C & D: 1 1, 2 2 or 1 2, 2 1
6 Linkage Studies-Genetic Maps Genetic Maps
• A map provides the position and order of marker • Map distance given in Centimorgans (cM) loci • Recombination (Θ) fractions cannot be • Physical position added • Genetic position – Except in the case of complete interference – Based upon interpolation for SNV (single nucleotide variant) x=Θ and SNP (single nucleotide polymorphism • Under complete interference multiple crossovers • Genetic position necessary to perform multipoint between two loci can be excluded linkage analysis – It can also be assumed there is complete interference when two loci are closely linked (θ<0.05) » Then recombination fractions can be added
Genetic Maps Haldane Map Function (Haldane 1919)
• Can convert Θ to map distances using • – Map functions Assumption that crossovers in different intervals • Haldane occur according to a Poisson probability law • Kosombi – Note x is given in Morgans • Sturt – The distances can then be summed • X = -1/2 ln(1-2θ) if 0 < θ < ½ • { No one-to-one correspondence between map • infinity otherwise, distance and number of base pairs – Recombination events variable across the genome • The Inverse is θ = ½[1-exp(-2|x|)]
Genome Scan Data (Marker Loci) for Genetic Maps Linkage Analysis • Most SNP and SNVs are not on genetic maps • Microsatellite Marker loci • Physical position and order known – Not currently usually used – Unknown genetic map distance • Genotyping Arrays • Using Genetic Maps such as • SNP and SNV marker Loci • Rutgers Combined Linkage-Physical Map • Exome and whole genome sequence data – http://compgen.rutgers.edu/mapinterpolator • SNV and SNP marker loci • Interpolation can be used to estimate the genetic distance of markers to perform linkage analysis
7 Microsatellite Markers Heterozygosity (H) § For the most part have been replaced by § Provides information on what proportion SNP marker loci of individuals that will be heterozygous for § Microsatellite markers have many alleles a particular marker locus § Heterozygosity >0.71 § Assumption Hardy Weinberg Equilibrium § Usually denoted by a D# H=1-Σp 2 § Linkage whole genome scans i § 10 cM scan § ~400 marker loci § 5 cM whole genome scan § ~800 marker loci
SNP and SNV Marker loci SNP and SNV Marker loci
• Most commonly used markers for linkage • Denoted by an rs# analysis are SNP loci • SNP have a minor allele frequency (MAF) of >5% – Base change at a single nucleotide – Can also be defined as having a MAF >1% • Most have only two alleles (diallelic) • SNVs have a MAF < 1% – But can have up to four alleles – Usually diallelic but can have up to four alleles • Those which have more than two alleles are not used • Heterozygosity <0.5
Genotyping Arrays SNP/SNV Marker Loci* Genotyping Arrays § Illumina HumanCore-24 Bead Chip § ~300,000 SNP marker loci • Higher density arrays are overkill for linkage § Up to a additional 300,000 custom markers analysis § Illumina HumCoreExome-24 Bead Chip – A subset of informative markers can be used § ~300,000 SNP marker loci § ~240,000 Exome marker loci • e.g. ~0.20cM § Up to an additional 100,000 custom markers • Once linkage has been established denser maps of markers § Illumina HumanOmni5-Quad can be analyzed within the linkage region § ~4.2 Million SNP/SNV marker loci • Using entire set of markers extremely slow to § Up to an additional 500,000 custom markers analyze § Illumina HumanOmni5Exome – § ~4.5 Million SNP/SNV marker (including exome content) May not be able to complete linkage analysis within a § Up to an additional additional 200,000 custom markers reasonable amount of time *These arrays were all developed for association studies- the Illumina linkage array has been discontinued
8 Features of Mendelian Traits Heterogeneity • Allelic Heterogeneity • Non-Allelic/Locus/Linkage Heterogeneity – Multiple separate alleles at the same locus are • Allelic heterogeneity responsible for the disease phenotype • Cystic fibrosis • Phenocopies • Non-allelic/Locus/Linkage Heterogeneity • Reduced penetrance – Different individual genes are responsible for disease • Age specific reduced penetrance etiology • Charcot-Marie-tooth disease • Adult polycystic kidney disease (APKD) • Non-syndromic hearing loss
Phenocopies Phenocopies § Traditional definition § The term phenocopy (although used § An environmentally induced phenotype that incorrectly) is also used to describe resembles the phenotype produced by a § Genetic heterogeneity mutation § An individual(s) within a pedigree which is affected § Examples due to a different gene than the other pedigree members § Individuals taking meperidien which is tainted § E.g. BRCA1 families with breast cancer patients with out a BRCA1 with its by product MPTP variant § Causes the destruction of dopaminergic neurons and § Misdiagnosed cases within a pedigree produces a Parkinson disease phenotype § Alzheimer’s disease pedigrees with cases of dementia which are not Alzheimer’s disease § Epilepsy due to traumatic brain injury
Reduced Penetrance Familial & Founder Effect § Age specific • Familial § Sex specific/Sex limited – Any trait which is more common in relatives of an § Exposure specific affected individual than in the general population § Incomplete penetrance – Can be genetic or environmental or both § A proportion of disease gene carriers never • Prion disease Kuru develop the phenotype • Founder Effect – A high frequency of a disease allele in a population § Can reduce the power of detecting linkage founded by a small ancestral group due to one or more founders being carriers of this allele § Unaffected individuals below the age of onset provide no linkage information
9 Assortative Mating Epistasis & Pleiotropy • Selection of mate with preference to a certain • Epistasis phenotype/genotype (that is non-random mating) – Interaction between alleles at two different loci – Positive • Pleiotropy • preference for a mate with the same phenotype – Multiple phenotype effects of a single gene – Negative • Example Marfans Syndrome • Preference for a mate with a different phenotype
Mendelian Traits Pedigree Drawings -Symbols
§ Modes of Inheritance
§ Autosomal Dominant Inheritance Affected male Affected male proband § Autosomal Recessive Inheritance
§ X-linked Inheritance unaffected female Deceased affected female § Dominant § Recessive Sex unknown affected
Male affected with two traits
Pedigree Drawings Phenotype Quantitative & Qualitative Traits • Quantitative trait –Continuous founder – Dichotomizing based upon an arbitrary or clinical cut-off
Non founder • Can lead to loss of power (due to misclassification)
• Qualitative – binary disease trait –Affected or unaffected Dizygotic twins Monozygotic twins
10 Quantitative Trait - Example BMI Autosomal Dominant Pedigree
Autosomal Dominant Mode of Inheritance Autosomal Dominant Mode of Inheritance § On average 50% of the children from an • If trait is fully penetrant with no phenocopies the heterozygous affected individual will also be following is true: heterozygous for the disease allele and affected • Each affected individual carries at least one copy § 100% of all children of an affected homozygous of the disease/trait allele individual will be affected • Each unaffected individual must be homozygous § Equal number of males and females affected: wild type § There are exceptions • If an affected individual has an unaffected parent § e.g. sex limited traits they must by heterozygous for the disease/trait § Affected (heterozygous) and unaffected individuals allele provide equal linkage information
Consanguineous Autosomal Recessive Autosomal Recessive Pedigree with Pedigree Unrelated Parents
&AMILY