Divine: Prioritizing Genes for Rare Mendelian Disease in Whole Exome Sequencing Data Changjin Hong, Jean R
Total Page:16
File Type:pdf, Size:1020Kb
[Supplementary Documents] Divine: Prioritizing Genes for Rare Mendelian Disease in Whole Exome Sequencing Data Changjin Hong, Jean R. Clemenceau, Yunku Yeu, and TaeHyun Hwang* Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, 9500 Euclid Avenue Cleveland, OH 44195 Contents 1 Workflow ............................................................................................................................................... 2 2 Annotation ............................................................................................................................................ 3 3 Requirement ......................................................................................................................................... 3 3.1 Acceptable HPO inputs ................................................................................................................. 3 3.2 Input and output ........................................................................................................................... 3 3.3 Installation and manual ................................................................................................................ 4 4 Methods: ............................................................................................................................................... 4 4.1 Comparing known disease phenotypes with patient phenotypes ............................................... 4 4.2 Damage prediction from genetic information .............................................................................. 4 4.2.1 Pathogenic likelihood from AA ................................................................................................. 5 4.2.2 Functional impact by variant location....................................................................................... 5 4.2.3 Pathogenic variant density in a protein domain ....................................................................... 5 4.2.4 Pathogenic score for a mutated gene ....................................................................................... 6 4.3 Phenotype gene enrichment ........................................................................................................ 7 4.3.1 Gene ontology enrichment ....................................................................................................... 7 4.4 KEGG pathway enrichment ........................................................................................................... 8 4.5 Combining Gi and Pi and a final ranking by a heat diffusion on STRING network. ....................... 8 5 Experiments .......................................................................................................................................... 8 5.1 The other methods for comparison .............................................................................................. 8 5.2 26 WES retrospective study samples ............................................................................................ 8 5.3 AUC scores .................................................................................................................................... 9 5.4 A Case Study with Atypical Hemolytic Uremic Syndrome (aHUS) [14] ....................................... 10 5.5 Divine under noise HPO queries ................................................................................................. 11 6 Reference ............................................................................................................................................ 17 1 Workflow S. Figure 1. Divine Workflow: Divine takes either VCF file or patient phenotype in HPO IDs. Divine annotates each variant with up to 30 databases and features in either variant-level or gene-level. Divine also supports a discovery mode to infer genes that have never been associated with a certain disease model. Not only proband sample but also trio familial samples can be analyzed. Divine assesses the pathogenicity of each gene by analyzing both the patient phenotype information and genetic variants and provides a prioritized gene ranking list in Microsoft Excel format as an annotation table. 2 Annotation S. Table 1. Divine uses Varant [11] as an annotation framework. Originally, 22 features were available. In the release of Divine, eight new annotations are added or supported (the eight items at the bottom of the table). 1. dbSNP, 1000Genome Minor Allele Frequency (MAF) & ESP (MAF) 2. Clinically significant variants from ClinVar DB 3. GWAS Phenotype 4. Genomic region - Intergenic, Intronic, Exonic & UTR 5. Downstream and upstream gene for intergenic variants 6. Splice Site (Donor/Acceptor) 7. Mutation Type - NonSyn, Syn, StartGain, StartLoss, StopGain, StopLoss, SynStop 8. Codon Usage in Human 9. Exonic splice enhancer / silencer site 10. Flag variants that spans boundary region like Intron-Exon or UTR-CDS 11. Distance of intronic variants from splice sites 12. UTR Functional Motifs 13. miRNA Binding Site 14. Polyphen2, SIFT & CADD prediction 15. Gene-Disease association - OMIM, NCBI-GAD 16. Position Conservation - Gerp++ Score 17. Interpro Domain 18. TFBS 19. eQTL 20. Low complexity region 21. Pseudo autosomal region 22. Capture region 23. COSMIC 24. HGMD (a license is required) 25. Gene Ontology and KEGG pathway 26. ExAC 27. ClinVitae 28. Protein domain pathogenicity 29. Amino acid change pathogenicity 30. Genetic model (autosomal recessive/recessive, homozygous, heterozygous, compound heterozygous) 3 Requirement Divine requires either a standard format VCF file or a text file of Human Phenotype Ontology (HPO) IDs that describe patients’ clinical features. 3.1 Acceptable HPO inputs It is very helpful to provide phenotype-to-disease associations from HPO [2] that allows for large-scale computational analysis of the human phenome. Currently, Divine only accepts an HPO ID (e.g., HP: 0002803) rather than terms or vocabularies (e.g., “Congenital contracture”) describing a patient clinical feature. A couple of websites are available to convert a phenotypic description into an appropriate HPO ID from https://mseqdr.org/search_phenotype.php, http://compbio.charite.de/phenomizer, or https://hpo.jax.org/. 3.2 Input and output When only HPO IDs are given, Divine generates an inferred disease list with associated genes. If VCF file (with HPO IDs) is given, it generates an annotated variant table and an annotated inferred disease ranking list in Microsoft® Excel format. For the best result, it is ideal to provide both VCF (e.g., generated by GATK germline variant caller [16]) and a set of HPO IDs. Divine mainly uses an existing annotation framework, Varant [11], originally providing 22 annotations and we add eight new features in Divine (See the last eight items in S. Table 1). 3.3 Installation and manual https://github.com/cjhong/divine 4 Methods: 4.1 Comparing known disease phenotypes with patient phenotypes Given a patient query HPO set, H={1,2,…,m,…,M}, we calculate a semantic similarity with each known disease phenotype (j) HPO set, Dj={1,2,…,n,…, N}. Total M by N term-to-term similarity (푠,) is available. We use simRel [10] semantic measure defined in [s.eq2]. In the equation, pm indicates an information content of m and CLA stands for a common lowest ancestor in the ontology graph. In order to summarize the M by N similarity matrix into a single value, 푠퐻, 퐷, we use a method suggested in [20], but we adapt it to our application. Between the two maximum average values, one in each column and the other from each row respectively, the maximum average value is taken. Symptoms or phenotypic descriptions related to disease are incomplete and sparse. The number of phenotypes describing a disease significantly vary among diseases. Thus, we penalize the maximum average value by dividing it by |M-N| in a log scale, ∑ 푚푎푥 {푠 } ∑ 푚푎푥 {푠 } 푚푎푥 , , , 푀 푁 푠퐻, 퐷 = [푠. 푒푞1] 푙표푔(|푀 − 푁| + 10) 2푙표푔(푝) 푠, = (1− 푝) [푠. 푒푞2] 푙표푔(푝) + 푙표푔(푝) One gene can be associated with more than two diseases, which is often true when two diseases are very similar to each other. We retain only max s(H, Dj) among those and assign the phenotypic score to the gene i directly associated with Dj, [ ] 푃 = 푚푎푥∈∀{}푠퐻, 퐷 푠. 푒푞3 . 4.2 Damage prediction from genetic information Divine uses hg19 (e.g., GRCh37) as a reference genome sequence. By default, Divine filters out any variant outside of exonic regions or UTR with +/- 20 bp flanking. Note that the user can change this option to handle either whole genome or targeted sequencing reads. As a gene model, we use NCBI RefSeq gene annotation, containing 52,065 isoform transcripts across 26,668 genes. As described in the main manuscript, Divine filters out any variant frequently observed in a common population where the user can define the MAF (Minor Allele Frequency) cutoff value. Divine predicts the pathogenicity of a gene from variants in a VCF file in the following 3 components: 1) a pathogenic likelihood by amino acid change predicted from known pathogenic databases, 2) an impact score by the variant location within a transcript, and 3) pathogenic density per active protein domain. 4.2.1 Pathogenic likelihood from AA Taking positive controls from pathogenic variants that appeared in ClinVar[15] or HMGD professional [9], we train a beta distribution (i.e., a cumulative distribution function, FP[a]) of either Gerp++ [18] or CADD [19] scores by each amino acid change (a). Similarly, the other beta distribution (i.e., a cumulative distribution