Annotating Gene Sequence Variation Watch for Multiple Transcripts!

Identifying functionally significant causal variants in Segregation data Functional data Annotating gene sequence variation Frequency Animal in Interpretation Models Controls Shamil Sunyaev Department of Biomedical Informatics Harvard Medical School Bioinformatics Broad Institute of M.I.T. and Harvard Map variants on genomic annotation Nonsense variants One of most significant types of variants usually leading to the complete loss of function. Watch for multiple transcripts! Nonsense variants are enriched in sequencing artifacts Watch for conflicting annotations! Important considerations: i) location along the gene, ii) does the variant cause NMD? iii) is the variant in a commonly skipped exon? Tool: LOFTEE Variants involved in splicing SpliceAI Variants in canonic splicing sites Variants in exonic or intronic splicing enhancers Gain of splicing variants Missense variants: computational Does the mutation fit the pattern of past predictions evolution? This image cannot currently be displayed. human VVSTADLCAPSSTKLDERA dog FVSTSELCAGSTTRLEERA fish FLSTSELCVPSTLKVNEKV Statistical issues: -sequences are related by phylogeny -generally, we have too few sequences Does the mutation fit the pattern of past evolution? Continuous time Markov model • We assume a constant fitness landscape: what is good for fish is good for human! GLY VAL ALA GLY ALA • We can estimate whether the mutation fits the pattern of amino acid changes. • We can also estimate rate of evolution at the amino acid site Continuous time Markov model Protein structure view P – matrix of transition probabilities 푃 푡 = 푒-/ p – stationary distribution • Most of pathogenic mutations are important for stability (good news?). • DDG is difficult to estimate. 푄휋1# 0 • Unfolded protein response pathway has to be taken into account. • Heuristic structural parameters help but less than comparative genomics. PolyPhen2 Weakly deleterious mutations • Multiple independent lines of evidence suggest abundance of weakly deleterious alleles in humans • Weakly deleterious variants may occur in highly conserved positions • Weakly deleterious alleles probably contribute to complex phenotypes but not to simple Mendelian phenotypes www.genetics.bwh.harvard.edu/pph2 Adzhubei, et al. Nature Methods 2010 Conservation can be due to very weak selection Constant evolutionary landscape Every new mutation eventually will be either fixed or lost AA s – selection coefficient b B Ne - effective population size For humans estimated to be ~ 10 000 1 . 0 K/K0 0 . 9 0 . 8 A 0 . 7 β 0 . 6 Complete Neutral 0 . 5 conservation behavior 0 . 4 0 . 3 0 . 2 0 . 1 0 . 0 10-6 10-5 10-4 10-3 Selection coefficient, s Epistatic interactions Compensatory mutations AA b B A β The phenomenon of compensatory Ridges on the fitness landscape mutations in different fields • Biochemistry – protein stability, allosteric effects • Genetics – incomplete penetrance • Evolutionary biology – speciation, epistatic models of evolution Dobzhansky-Muller incompatibility Looking at vertebrate species Many human disease mutations are found in How complex genetic suppression can be? vertebrates HumVar "Disease" LETTER doi:10.1038/nature12678 (22,207 variants) ClinVar "Pathogenic" (10,596 variants) Genetic incompatibilities are widespread within species 6,503 1 1 2,3 1 1,2,4 16,544 3,250 5.5-6.5% of presumably Russell B. Corbett-Detig , Jun Zhou , Andrew G. Clark , Daniel L. Hartl & Julien F. Ayroles 313 530 pathogenic human 2,100 mutations are detected in mammals Numerous Dobzhansky-Muller 24,304,185 incompatibilities in fly population Found in MultIZ 100-Way alignment (24,307,128 variants) How complex genetic suppression can be? Model: accumulation of neutral variants Variant of interest time x 1 change x x 2 changes x x x 3 changes Stability x x x 4 changes 噣 time xx x x x x n changes Disease variants do not fit the Poisson Exponential fit expectation Model: accumulation of disease variants Model: Erlang distribution target size m Suppressors Disease variant time x 1 change . .01 00/ x x 2 changes 휆 푡 푒 퐿 푘, 푡, 휆 = 푘 3 1 5 x x x 3 changes x x x 4 changes 噣 xx x x x x n changes k out of m required for suppression Model: fit for target size and number of Evolutionary model of cis- compensatory changes complementation 1 suppressor required 2 suppressors available C“Pathogenic” ! Compensation! D mutation is lost is lost human human macaque macaque mouse mouse rabbit rabbit Compensation! “Pathogenic” ! is ﬁxed mutation is ﬁxed Model Summary Predicting and testing suppressors • ~5-6% of human disease mutations have potential suppressors (i.e. are present in another mammalian species) • In most cases, one large-effect suppressor is sufficient, out of only 1-2 available x • These values allow simple experiment to identify suppressors Zebrafish model Experiment interpretation • Model of Bardet-Biedl Normal Syndrome (obesity, renal Human gene with failure, vision loss) No injection disease mutant • Caused by defects in primary cilium Class I Double mutant Knockdown (no suppression) • Embryonic convergence / extension phenotype in Rescue with Double mutant zebrafish human gene (full suppression) Class II • Easily scorable phenotype Images: Phoebe Images: Phoebe Liu Liu Bardet-Biedl syndrome – BBS4 N165H Bardet-Biedl syndrome – BBS4 N165H *** *** Severely % % of embryos Affected Affected % % of embryos Normal Double mutants Double *** p < mutants 0.0001 Out of 9 candidates (7 shown here), 1 complete rescue Figure: Stephan Frangakis Figure: Stephan Frangakis Bardet-Biedl syndrome – RPGRIP1L R937L A newly identified gene Clinical features Phe193Le Global developmental delay u ** ** microcephaly feeding issues - - failure to thrive 001 002 abnormal muscle tone low immunoglobulins Severely Affected frequent respiratory infections Affected % of embryos Normal Clinical testing normal female microarray metabolic testing – negative * p < 0.05 extensive genetic testing – Mut WT ** p < 0.001 negative - ctrl + ctrl + Double mutants - -- BTG2 TTN 003 004004 De novo Compound het NOS2 LAMA1 Out of 32 candidates, 1 complete rescue De novo Compound het Figure: Stephan Frangakis StephanErica Frangakis Davis BTG2 is the disease driver The mutation is a reversal to the mammalian+ ancestral state ef BTG2 R80 L128 Q140 V141 L142 H. sapiens RLQVL P. troglodytes ••••• G. gorilla ••••• M. musculus KV •MM R. norvegicus KV •MM H. glaber •V•MM S. domesticus KV •MM B. primigenius KV •MM E. ferus caballus KV •MM F. catus KV •MM C. lupus familiaris KV •MM D. novemcinctus KV •MM Microcephaly G. gallus KP•MM êNeuronal expr. êcell proliferation BTG2 has two compensatory mutations Methods PolyPhen2 SIFT * * LRT MutationTaster Number of cells/embryo FatHMM SNPs3D DeepSequence *P<0.01 vs V141M rescue Injection alone Stephan Frangakis Umbrella methods Incorporating regional constraint Condel CCR REVEL M-CAP CADD PrimateAI M-CAP Regulatory variants • Regulation: variants in promoters, enhancers, silencers, insulators Non-coding variants Chromatin accessibiliy Chromatin modification Epigenomics Why do we think that non-coding variation is of importance? Regions and individual nucleotides conserved along phylogeny show signals of purifying selection in humans Epigenetic studies report many well-localized regulatory marks GWAS signals are predominantly located in non-coding regions Non-disease alleles of large effect Ultraconserved elements PLoS BIOLOGY Deletion of Ultraconserved Elements Yields Viable Mice Eye color Nadav Ahituv1,2¤, Yiwen Zhu1, Axel Visel1,AmyHolt1, Veena Afzal1, Len A. Pennacchio1,2, Edward M. Rubin1,2* 1 Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America, 2 United States Department of Energy Joint Genome Institute, Walnut Creek, California, United States of America Ultraconserved elements have been suggested to retain extended perfect sequence identity between the human, mouse, and rat genomes due to essential functional properties. To investigate the necessities of these elements in vivo, we removed four noncoding ultraconserved elements (ranging in length from 222 to 731 base pairs) from the mouse genome. To maximize the likelihood of observing a phenotype, we chose to delete elements that function as enhancers in a mouse transgenic assay and that are near genes that exhibit marked phenotypes both when completely inactivated in the mouse and when their expression is altered due to other genomic modifications. Remarkably, all four resulting lines of mice lacking these ultraconserved elements were viable and fertile, and failed to reveal any critical abnormalities when assayed for a variety of phenotypes including growth, longevity, pathology, and metabolism. In addition, more targeted screens, informed by the abnormalities observed in mice in which genes in proximity to the investigated elements had been altered, also failed to reveal notable abnormalities. These results, while not inclusive of all the possible phenotypic impact of the deleted sequences, indicate that extreme sequence constraint does not necessarily reflect crucial functions required for viability. Lactase tolerance Enrichment of GWAS signals in Enrichment of GWAS signals in putative regulatory elements putative regulatory elements Partitioning heritability GWAS SNPs co-localize with eQTLs Trait-Associated SNPs Are More Likely to Be eQTLs: Annotation to Enhance Discovery from GWAS Dan L. Nicolae1,2,3, Eric Gamazon1, Wei Zhang1, Shiwei Duan1¤, M. Eileen Dolan1,2, Nancy J. Cox1,2* Whole blood The Genotype-Tissue Expression ) 5 p Crohn’s (GTEx)

Annotating Gene Sequence Variation Watch for Multiple Transcripts!

Complexity of a Small Non-Protein Coding Sequence in Chromosomal Region 22Q11.2: Presence of Specialized DNA Secondary Structures and RNA Exon/Intron Motifs Delihas

A Different View on DNA Amplifications Indicates Frequent, Highly Complex, and Stable Amplicons on 12Q13-21 in Glioma

Variation in Protein Coding Genes Identifies Information Flow

Accurate Prediction of Kinase-Substrate Networks Using

Gnomad Lof Supplement

Table S1. 103 Ferroptosis-Related Genes Retrieved from the Genecards

Systematic Detection of Brain Protein-Coding Genes Under Positive Selection During Primate Evolution and Their Roles in Cognition

Whole Genome Analyses of a Well-Differentiated Liposarcoma Reveals Novel SYT1 and DDR2 Rearrangements

Mouse Tsfm Knockout Project (CRISPR/Cas9)

Annotation of Functional Variation Within Non-MHC MS Susceptibility Loci Through Bioinformatics Analysis

A Multi- Tissue Transcriptomic Network META- Analysis Rosa Faner1* , Jarrett D

Hipsc-Derived Cardiomyocyte Model of LQT2 Syndrome Derived