Diagnosing the Undiagnosed: Exploring the Bioinformatics of Extreme Precision Medicine

Anika, Eduardo, Laurie, and Layla BMIF 201 Final Project Presentation Background

Multidisciplinary Specialist Team Causal Accepted

Referral ? Treatment Plan UDN Nodes Clinical Phenotyping

WES or WGS Background

- National Institutes of Health Common Fund - Patient care motive: ameliorate rare disease burden (largely Mendelian) - Research motive: human population = saturated genome mutagenesis!

Extreme Precision Medicine Time Money Information Expertise Background

Causal UDN Pipeline Gene + Treatment Plan

!!!

BGM Pipeline

Candidate Variant List UDN Pipeline

!!! Causal BGM Pipeline Expert Medical Gene Team + Treatment Sequencing Candidate Plan Data Variant List

Variant Identification Variant Evaluation Data Pre- Processing - Input unmapped reads Background (BAM, FASTQ) - Map to ref genome Variant - Base quality scores Discovery - Rare variant callers (SNPs + Indels, bayes de novo dom) Variant - Joint calling (GATK Annotation haplotype caller) - Callability analysis - CNV tools (SvABA - Ancestry analysis or MoChA) - Functional annots - Allele frequency - Known disease (ClinVar, HGMD)

BGM Pipeline

Candidate Variant List Motivation

This process is: - Time-intensive - Resource-intensive - Dependent upon expert-level knowledge - Subjective - Variable - Limited !!!

Causal Gene

Candidate Variant List Project Aims

Can we formalize the variant evaluation process?

Can we automate the variant evaluation process?

Can we expand the variant evaluation process?

!!!

Causal Gene

Candidate Variant List Project Data

Whole Genome Sequencing Can we formalize the variant evaluation process? Case 344 Case 343 Can we automate the variant evaluation process?

Can we expand the variant evaluation process? Case 72 Case 146 Project Data Unsolved Cases Can we formalize the variant evaluation process? Case 344 Case 343 Can we automate the variant evaluation process?

Unaffected Unaffected Unaffected Parent Parent Parent ?

4-year-old Male 11-year-old Male

● Global developmental delay ● Asthma ● West syndrome (infantile seizures) ● Dysmorphic features ● Hypotonia ● Episodic chorea ● Absent olfactory bulbs ● Chronic diarrhea ● Significant intellectual disability ● Persistent craniopharyngeal canal ● Dysphagia ● Adrenal insufficiency ● Submucous cleft palate ● Low IgG ● Significantly disordered sleep Approach Can we formalize the variant evaluation process? Approach Can we formalize the variant evaluation process?

- Which resources are used? - What information is taken from those resources? - How can we convert subjective data into quantified or categorical values?

Gene-Level Variant-Level Variant Validation Curation Curation Approach Can we formalize the variant evaluation process?

Variant Validation Gene-Level Curation Variant-Level Curation

- Called correctly? Binary YES/NO - Do alleles from trio actually match the searched-for inheritance pattern? - Genotype Quality? Three (two for duo) numeric values - Note if genotype quality < 99 - Alignment plausible? 1) Binary Good/OK/Bad, 2) Text entry--description of why OK or Bad - Make sure alignment of reads looks legitimate - Variant is present consistently in reads - Variant present at differently-located reads - Variant present in reads going in both directions (no strand bias) - No other “variants” in our variant-containing reads--that might suggest it would map better to another location - Same for parents as applicable - Hemizygote frequency in gnomAD (for ChrX variants) Approach Can we formalize the variant evaluation process?

Variant Validation Gene-Level Curation Variant-Level Curation

- Is there information to suggest there is a good story? Are you finding anything that excludes it as a candidate? - Gene statistics - 3 constraint values (missense, synonymous, LOF) - pLI: probability of LOF intolerance - OMIM associations with gene (Online Mendelian Inheritance in Man) - Binary YES/NO, Text Entry--if YES, which - Disease associations that overlap? - Is there a plausible biological story? (e.g., known to be active in early development) - GTEx Expression--where is our gene expressed? Mostly from individuals >18yo who died, so they do not indicate expression during development - In tissue? YES/NO - Differentially expressed in tissue? YES/NO - Allen Brain Atlas--good for developmental transcriptomes - YES/NO expressed in early development before patient age - Text entry on overall level of expression and trends of time periods Approach Can we formalize the variant evaluation process?

Variant Validation Gene-Level Curation Variant-Level Curation

- gnomAD: How much variation occurs at that specific site in the “healthy” genome? - Supportive/not supportive categorical: Good/OK/Bad - Good: specific variant not found among healthy pop, other variants at that site not found at high frequencies - OK: specific variants present but at low freq (<5% MAF) - Bad: Found our specific variant at nontrivial frequency in healthy population - Text entry: Elaborate on categorical score - UCSC Genome Browser: is this allele conserved across species? - Good/OK/Bad - Good: conserved across all species - OK: not conserved across all species - Bad: not conserved across primate species - UniProt: Impact within the - Known variants at the site - Known functional role(s): domains, interaction sites, modification site Approach Can we formalize the variant evaluation process?

- We manually retrieved 23 metrics for all 239 variants across our 2 new cases - Case 344 - Bayes de novo: 3 variants - Compound heterozygous: 31 variants - Homozygous recessive: 14 variants - Case 343 - Autosomal dominant: 182 variants - Homozygous recessive: 9 variants Approach Can we formalize the variant evaluation process?

- De novo dominant - pLI ≥ 0.9 - Allele frequency (if present in ≥2 individuals, rule out) - Missense constraint How are the various - Compound heterozygous - Allele frequency metrics weighted in - Homozygotes for either variant in the pair - Gene expression in brain their contribution to a - If ‘benign’ by Polyphen, likely not candidates final evaluation of each - Homozygous recessive - Allele frequency (if ≥1 homozygote in gnomAD, rule out) candidate variant? - LOF constraint, missense constraint - Autosomal dominant (for a duo) - pLI ≥ 0.9 - Missense constraint - Allele frequency (if present in ≥2 individuals, rule out) - LOF prioritization (deprioritize synonymous) Approach Can we formalize the variant evaluation process?

Can the final “likelihood” of each variant be formalized into a categorical system?

Categorical Rating Rationale

● Strong candidate ● Plausibly causal of phenotype

● Weak candidate ● Some reasons for skepticism

● Unlikely candidate ● Unlikely to be causal

● False variant call ● Variant not present in patient Results Can we formalize the variant evaluation process?

- How do our final ratings for variants compare to the ratings of an expert medical geneticist on the UDN clinical team? - “Expert evaluation” for Case 344 (trio) 43 variants (excluding false calls) - Comparison against our own evaluation using the formalized process

Truth (Joel) Truth (Joel)

Plausible Unlikely Strong Unlikely Weak

Test (Us) Plausible 20 2 Test Strong 10 0 1 (Us) Unlikely 2 19 Unlikely 0 19 2

Weak 0 2 9 Approach Can we automate the variant evaluation process?

- If we feed our manually-curated parameters to a classifier, can it predict variant ratings with some accuracy? - Binary: unlikely causative variant vs. variant of any degree of plausibility - Performed on Case 344 data, for which we have a “ground truth” evaluation of variant candidacy from Dr. Krier - Removed all “false” calls, where the variant was not truly present in the reads on manual review via IGV - Removed manually-curated parameters that were dependent upon human interpretation (e.g., UniProt, OMIM) - 43 total variants - 21 were “ground truth” unlikely and 22 were “ground truth” plausible Approach Can we automate the variant evaluation process?

- If we feed our manually-curated parameters to a classifier, can it predict variant ratings with some accuracy? - 19 total parameters - In-silico predictions (4): polyphen, sift, muttaster, fathmm - Genotype quality scores for all individuals (2-3) - Alignment Quality categorical rating - Constraint metrics: Synonymous_Z, Missense_Z, LoF_Z, pLI - Expression levels (3): Y/N in brain or nerve, and Y/N elevated in brain or nerve, Y/N in early development - Variant frequency: gnomAD allele frequency and # homozygotes, gnomAD categorical evaluation, categorical conservation across species Results Can we automate the variant evaluation process?

- If we feed our manually-curated parameters to a classifier, can it predict variant ratings with some accuracy? - Naive Bayes with 6-fold CV: 0.75, 0.88, 0.88, 0.83, 0.78, 0.99 → mean AUC of 0.85 - Random Forest with 6-fold CV: 0.83, 0.83, 0.99, 0.83, 0.78, 0.67 → mean AUC of 0.82 - Logistic Regression with 6-fold CV: 0.56, 0.99, 0.50, 0.58, 0.67, 0.67 → mean AUC of 0.66 - Further refinement (esp feature selection) forthcoming - Future Directions: Expand data set (in size, in certainty), classification nuance (strong vs. weak plausible candidates) Project Data Previously-Failed Cases

Can we expand the variant evaluation process? Case 72 Case 146

Unaffected Unaffected Unaffected Parent Parent mother ?

Affected male Affected female

● Intrauterine growth restriction ● Eosinophilia ● Failure to thrive ● Autoantibody production ● Congenital hypothyroidism ● Bone abnormalities Approach CanCan we we automate expand thethe variant variant evaluation evaluation process? process?

- Apply existing CNV callers to our cases that failed the standard BGM pipeline. Can we discover causal variants that were missed? - Based on our experience in this endeavor, offer recommendations about - The value of expanding the BGM pipeline to include CNV callers - Considerations about caller selection - Considerations about caller integration Approach Caller 1: SvABA CanCan we we automate expand thethe variant variant evaluation evaluation process? process?

- Uses clipped, discordant, unmapped, and indel reads, assembles into contigs for every 25kb window using String Graph Assembler, then aligns using BWA- MEM - Identifies indels, structural variants, complex rearrangements, and viral integration Approach Caller 1: SvABA CanCan we we automate expand thethe variant variant evaluation evaluation process? process? Approach Caller 2: MoCha CanCan we we automate expand thethe variant variant evaluation evaluation process? process?

● Main idea: ○ Phase variants, establish haplotype blocks

○ Analyze heterozygous sites ■ Expected BAF= 50%

■ Deviations from 50% = presence of CNV

○ Coverage = indication of type Methods

● Svaba ○ Software already installed in Shamil’s cluster ○ Ran Svaba on subjects bam files ■ Run time: Around 6 hours per case ○ SV classification is in BND format ■ Requires self-made parser to know whether INS/DEL, DUP, TRANS, INV Methods

● MoCha ○ Required patch extensions of bcftools and htslib ○ Requires GNU compiler version 5 or newer for patches ■ Shamil’s Cluster GNU compiler was outdated (version 3) ■ Updated Shamil’s cluster protocol to allow for version 5 or newer usage ○ Uses GRCh37 aligned resources ■ BGM data in UCSC hg19 format ■ Had to run Liftover (Picard) and find parser to convert between different aligners ● It would have taken 2-5 days to re-align each vcf file. ○ Developed filter to screen variants present in parents ○ Run-time: 10 hours using 12 multithreads, but ran multiple samples at the same time CNV Caller Outputs

- SvABA - BGM0072 (trio) -- 1997 SV (child), 3764 SV(parent1), and 3922 SV (parent2) -- 548 child only -- 360 classified as CNVs 1096 child-specific - BGM0146 (duo) -- 2532 SV (child), 4838 (mom) SV called -- 1270 child only -- 887 classified as CNVs

- MoChA - BGM0146 (duo) -- 270 (child), 268 (mom) CNVs called - Filtered to exclude variants present in both child and parents: - Left 17 variants - BGM0072 (trio) - 232 (child), 265 (parent1), 252 (parent2) Recall symptoms: ● Eosinophilia MoChA Output: BGM0146 ● Autoantibody production

- Selection of genes lying within deletion regions: - SLX1B - endonuclease - mutations in this family associated with Fanconi anemia - deletion - RPS17L - small ribosomal subunit - mutations in this gene linked to Diamond-Blackfan Anemia - deletion - Selection of genes lying within loss of heterozygosity regions: - USP54 - contributing to the ubiquitin proteolysis pathway Feedback on incorporating CNV callers

● SvABA: ○ For germline cases the structural variant classification is not intuitive. ○ Requires a parser to screen for variants in unaffected and proband ○ It is previous alignment agnostic ○ Easy to run, with relatively fast processing time ● MoCha ○ Depends on having compatible sequence alignment with reference databases ○ Use of LiftOver to convert into compatible Reference genome leads to loss of variants ○ Relies on phasing, thus it cannot detect events in regions of homozygosity. ○ Relatively fast when running multiple samples. ○ Output is straightforward to interpret for downstream analysis. Causal Variant Findings/Suggestions

CASE 1 (BGM0344, trio):

- Compound heterozygous - CEP170B - centromeric protein 170 kD B - Allele 1 - missense variant with significant amino acid change (R>Q); no domains specified in Uniprot - Allele 2 - synonymous variant at 3’ end of intron between exons 2 and 3 - may be a splice acceptor - ANKFY1 - involved in vesicular fusion & endosomal transport - Allele 1 - missense variant with non-significant amino acid change (K>R); highly conserved residue - Allele 2 - synonymous Causal Variant Findings/Suggestions: DUO BGM0343

Autosomal dominant (likely de novos): Homozygous recessive

- ZBTB17: (encodes for Myc-interacting protein, has - CLCN5: implicated in Dent’s disease been implicated in neuropathy) - AMER1: associated with high bone mass - ZZZ3: (involved in chromatin binding as part of the ATAC complex) - CACNA1D - ATP13A3: (member of the P-type ATPase family of that transport a variety of cations across membranes) - RUFY3: (required for maintenance of neuronal polarity) - SEPT8: (involved in neuron polarity and vesicle trafficking) - HIST1H2AC: (histone protein involved in nucleosome) - ZNF462 - LEO1: Involved in transcription, PAC1 complex - CASKIN1: synaptic scaffolding protein - CACNA1A - SLIT1 - CACNA1D

● Missense variant ch3:53761021 G>A, p.6759D in cytoplasmic portion of the protein ● Encouraging data: ○ pLi:1. Missense Z score: 5.5684, LoF z score: 8.9245 ○ Expressed in early development ○ Variant not found in GnomAD, highly conserved across specie

● Encodes for Voltage gated calcium channel, alpha1 subunit ● Implicated in idiopathic hyperaldosteronism (Omata K, et al 2018, Hypertension) ● Mutation present in an individual with hearing impairment, developmental delay, and epilepsy (Garza Lopez E. et al 2018, J Biol Chem) ● A CACNA1D mutation in a patient with persistent hyperinsulinaemic hypoglycaemia, heart defects, and severe hypotonia (Flanagan, SE. et al 2017, Pediatric Diabetes) ● Recommend: further cardiac evaluation of the patient CACNA1A

● Missense variant ch19:13318252 A>G ● Encouraging data: ○ pLi:0.9999999998. Missense Z score: 5.5684, LoF z score: 8.9245 ○ Present on neurodevelopmental gene list ○ Expressed in early development with highest expression in the brain per GTEX ○ Variant not found in GnomAD, highly conserved across specie

● Encodes for Voltage gated calcium channel

● Implicated with motor seizures in generalized epilepsy (Jiang X. et al 2018, Ann Neurol) ● Mutations in this gene have been implicated on episodic ataxia (Ling X. et al 2018, Int J Neurosci, Lance S. et al 2018, Case Rep Neurol Med) ● Recommend: Neuropsych evaluation. Zinc finger nuclear factor (ZNF462)

● Missense variant at position 3310 (C to A) ● Regulates chromatin structure/organization to control differentiation of ESCs ● Specifically regulates SOX2, POU5F1/OCT4, and NANOG and PBX1 ● Directs neuronal development and differentiation ● Encouraging data ○ Constraints: pLI 0.9999, Missense z-score 3.22, LOF z-score 7.53 ○ Appears in neurodevelopmental gene list ○ Expression: GTEx → in brain and nerve, ABA→ highly expressed during fetal development ○ Variant not found in gnomAD, UCSC shows perfectly conserved across species ○ Literature support: 8 patients with haploinsufficiency of ZNF462 → craniofacial anomalies, corpus callosum dysgenesis, ptosis, and dev. delay (Weiss et al Eur J Hum Genet. 2017) ● Caveats ○ ¾ in-silico prediction tools called “benign”/”tolerated” (Mutation Taster called pathogenic) ○ Variant call is imperfect but plausible on manual review → recommend Sanger sequencing Project Limitations

- Missing data - Some parameters for a giving variant may be unknown or not yet captured - Makes classifiers/automation more difficult - especially as these variants that are missing data could very well be causal variants - Refining our “ground truth” - Curating multiple expert opinions - Use cases validated by follow-up testing, clinical success - Expanding data set - Proof of concept, as xBrowse will be retiring, and our dataset is limited - Compatibility of CNV callers with xBrowse output - Upstream and downstream Future Directions

CNV Callers:

- Have UDN cases with known CNVs as benchmarks to test these - Balance ease of use with accuracy - Develop metrics for accuracy of CNV calls

Solving Cases:

- ML pipeline with wider set of cases and ground truth from experts - More historic cases with experimental/clinical validation as ground truth Conclusions

Can we formalize the variant evaluation process?

Can we automate the variant evaluation process? Yes!

Can we expand the variant evaluation process? Thank you!

● Dr. Joel Krier ● Dr. Shamil Sunyaev & Lab members ● Brigham UDN Clinical Group ● Boston Children’s UDN Clinical Group ● BMIF Teaching Team