[Supplementary Documents] Divine: Prioritizing for Rare Mendelian Disease in Whole Exome Sequencing Data Changjin Hong, Jean R. Clemenceau, Yunku Yeu, and TaeHyun Hwang* Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, 9500 Euclid Avenue Cleveland, OH 44195 Contents 1 Workflow ...... 2 2 Annotation ...... 3 3 Requirement ...... 3 3.1 Acceptable HPO inputs ...... 3 3.2 Input and output ...... 3 3.3 Installation and manual ...... 4 4 Methods: ...... 4 4.1 Comparing known disease phenotypes with patient phenotypes ...... 4 4.2 Damage prediction from genetic information ...... 4 4.2.1 Pathogenic likelihood from AA ...... 5 4.2.2 Functional impact by variant location...... 5 4.2.3 Pathogenic variant density in a protein domain ...... 5 4.2.4 Pathogenic score for a mutated ...... 6 4.3 Phenotype gene enrichment ...... 7 4.3.1 enrichment ...... 7 4.4 KEGG pathway enrichment ...... 8

4.5 Combining Gi and Pi and a final ranking by a heat diffusion on STRING network...... 8 5 Experiments ...... 8 5.1 The other methods for comparison ...... 8 5.2 26 WES retrospective study samples ...... 8 5.3 AUC scores ...... 9 5.4 A Case Study with Atypical Hemolytic Uremic Syndrome (aHUS) [14] ...... 10 5.5 Divine under noise HPO queries ...... 11 6 Reference ...... 17

1 Workflow

S. Figure 1. Divine Workflow: Divine takes either VCF file or patient phenotype in HPO IDs. Divine annotates each variant with up to 30 databases and features in either variant-level or gene-level. Divine also supports a discovery mode to infer genes that have never been associated with a certain disease model. Not only proband sample but also trio familial samples can be analyzed. Divine assesses the pathogenicity of each gene by analyzing both the patient phenotype information and genetic variants and provides a prioritized gene ranking list in Microsoft Excel format as an annotation table. 2 Annotation S. Table 1. Divine uses Varant [11] as an annotation framework. Originally, 22 features were available. In the release of Divine, eight new annotations are added or supported (the eight items at the bottom of the table).

1. dbSNP, 1000Genome Minor Allele Frequency (MAF) & ESP (MAF) 2. Clinically significant variants from ClinVar DB 3. GWAS Phenotype 4. Genomic region - Intergenic, Intronic, Exonic & UTR 5. Downstream and upstream gene for intergenic variants 6. Splice Site (Donor/Acceptor) 7. Mutation Type - NonSyn, Syn, StartGain, StartLoss, StopGain, StopLoss, SynStop 8. Codon Usage in Human 9. Exonic splice enhancer / silencer site 10. Flag variants that spans boundary region like Intron-Exon or UTR-CDS 11. Distance of intronic variants from splice sites 12. UTR Functional Motifs 13. miRNA Binding Site 14. Polyphen2, SIFT & CADD prediction 15. Gene-Disease association - OMIM, NCBI-GAD 16. Position Conservation - Gerp++ Score 17. Interpro Domain 18. TFBS 19. eQTL 20. Low complexity region 21. Pseudo autosomal region 22. Capture region 23. COSMIC 24. HGMD (a license is required) 25. Gene Ontology and KEGG pathway 26. ExAC 27. ClinVitae 28. Protein domain pathogenicity 29. Amino acid change pathogenicity 30. Genetic model (autosomal recessive/recessive, homozygous, heterozygous, compound heterozygous)

3 Requirement Divine requires either a standard format VCF file or a text file of Human Phenotype Ontology (HPO) IDs that describe patients’ clinical features.

3.1 Acceptable HPO inputs It is very helpful to provide phenotype-to-disease associations from HPO [2] that allows for large-scale computational analysis of the human phenome. Currently, Divine only accepts an HPO ID (e.g., HP: 0002803) rather than terms or vocabularies (e.g., “Congenital contracture”) describing a patient clinical feature. A couple of websites are available to convert a phenotypic description into an appropriate HPO ID from https://mseqdr.org/search_phenotype.php, http://compbio.charite.de/phenomizer, or https://hpo.jax.org/.

3.2 Input and output When only HPO IDs are given, Divine generates an inferred disease list with associated genes. If VCF file (with HPO IDs) is given, it generates an annotated variant table and an annotated inferred disease ranking list in Microsoft® Excel format. For the best result, it is ideal to provide both VCF (e.g., generated by GATK germline variant caller [16]) and a set of HPO IDs. Divine mainly uses an existing annotation framework, Varant [11], originally providing 22 annotations and we add eight new features in Divine (See the last eight items in S. Table 1).

3.3 Installation and manual https://github.com/cjhong/divine

4 Methods:

4.1 Comparing known disease phenotypes with patient phenotypes Given a patient query HPO set, H={1,2,…,m,…,M}, we calculate a semantic similarity with each known disease phenotype (j) HPO set, Dj={1,2,…,n,…, N}. Total M by N term-to-term similarity (푠,) is available.

We use simRel [10] semantic measure defined in [s.eq2]. In the equation, pm indicates an information content of m and CLA stands for a common lowest ancestor in the ontology graph. In order to summarize the M by N similarity matrix into a single value, 푠퐻, 퐷, we use a method suggested in [20], but we adapt it to our application. Between the two maximum average values, one in each column and the other from each row respectively, the maximum average value is taken.

Symptoms or phenotypic descriptions related to disease are incomplete and sparse. The number of phenotypes describing a disease significantly vary among diseases. Thus, we penalize the maximum average value by dividing it by |M-N| in a log scale,

∑ 푚푎푥 {푠 } ∑ 푚푎푥 {푠 } 푚푎푥 , , , 푀 푁 푠퐻, 퐷 = [푠. 푒푞1] 푙표푔(|푀 − 푁| + 10)

2푙표푔(푝) 푠, = (1− 푝) [푠. 푒푞2] 푙표푔(푝) + 푙표푔(푝)

One gene can be associated with more than two diseases, which is often true when two diseases are very similar to each other. We retain only max s(H, Dj) among those and assign the phenotypic score to the gene i directly associated with Dj,

[ ] 푃 = 푚푎푥∈∀{}푠퐻, 퐷 푠. 푒푞3 .

4.2 Damage prediction from genetic information Divine uses hg19 (e.g., GRCh37) as a reference genome sequence. By default, Divine filters out any variant outside of exonic regions or UTR with +/- 20 bp flanking. Note that the user can change this option to handle either whole genome or targeted sequencing reads. As a gene model, we use NCBI RefSeq gene annotation, containing 52,065 isoform transcripts across 26,668 genes. As described in the main manuscript, Divine filters out any variant frequently observed in a common population where the user can define the MAF (Minor Allele Frequency) cutoff value. Divine predicts the pathogenicity of a gene from variants in a VCF file in the following 3 components: 1) a pathogenic likelihood by amino acid change predicted from known pathogenic databases, 2) an impact score by the variant location within a transcript, and 3) pathogenic density per active protein domain.

4.2.1 Pathogenic likelihood from AA Taking positive controls from pathogenic variants that appeared in ClinVar[15] or HMGD professional

[9], we train a beta distribution (i.e., a cumulative distribution function, FP[a]) of either Gerp++ [18] or CADD [19] scores by each amino acid change (a). Similarly, the other beta distribution (i.e., a cumulative distribution function, FB[a]) for a negative control is built from variants appearing in 1000 genome project VCF files. For a given SNP of interest and its corresponding amino acid change, we look up two beta distributions of the same amino acid change and compute a pathogenic likelihood ratio,

푝 = 퐹[](푧)퐹[](푧) + 퐹[](푧) [푠. 푒푞4].

, where z is Gerp++ or the CADD score assigned to the variant of interest.

4.2.2 Functional impact by variant location

Distinct pathogenicity (pf) is summed in a linear combination in precedence of the location of the variant within a transcript as summarized in S.Table 2.

4.2.3 Pathogenic variant density in a protein domain An In silico conservation prediction score indicates how well each genomic element has been preserved in evolution among mammals such as primates. For example, Phastcons or Phylop scores are commonly used for this purpose. To extend this idea, we access the IntAct [24] protein domain database and look for the regions where known pathogenic variants are clustered densely. From both ClinVar [15] and HGMD database [9], we count a total number of pathogenic variants and non-pathogenic variants per each domain region, respectively. Then, we compute a target Bernoulli parameter (p) at Z-score = 0.75 from Wilson score interval. Taking into account a case where one focuses on storing more pathogenic variants than benign variants in the database, we build a pathogenic density distribution in a log scale from all retained variants and apply a simple logistic regression by assigning a target value 1.0 to the

50% percentile (pd).

푛 = 푛 + 푛 + 푛 푛 푝̂ = 푛 푧 [푠. 푒푞5] 푧 푝̂(1− 푝̂) + 푝̂ + − 푧 4푛 2푛 . 푛 푝 = 푧 1+ 푛

4.2.4 Pathogenic score for a mutated gene

Finally, the three values discussed above are represented in a damaging genetic score (Gi) where we

normalize the score by the transcript length (Tl) that the mutation locates,

0.5(푝 + 푝)+ 0.5(푝 + 푝) 퐺 = [푠. 푒푞6] 푇

S. Table 2. Parameters to assess pathogenicity depending on a variant location in the transcript, MAF, and zygosity.

Categories Location of variant Pathogenicity factor Description

Variant-level Non-coding region 0.050 A variant at non-coding pathogenicity constant RNA or transcript whose = v start/stop codon is unknown.

Intronic region 0.055 An intronic variant within 20 bp from a splice site.

Exonic region 0.350 A variant on coding region causing loss of function (nonsense, frameshift, or codon change).

Splice site in the exonic region 0.450 An exonic variant at a splice site and its alteration leading to exon skipping.

Splice site in the exonic region 0.060 Same as above but the (synonymous amino acid amino acid is not alteration) changed.

Reported as (likely) pathogenic 0.500 The identical SNV is reported as pathogenic in literature or a previous lab report (ClinVar, ClinVitae, or HGMD)

Gene-level damage Pathogenic variant reported in v + 0.040 The gene harboring the offset the other location variant of interest has a known pathogenic mutation.

MAF offset Logistic regression coefficient in 1 − 훽푒푥푝(1000 ∗ 푀퐴퐹) A max MAF damage 푣 + converting MAF to damage 훽 offset (in the same VCF factor file) is assigned to de- novo mutation. 훽 = 0.25; 훽 = 1.50

Zygosity weighting Heterozygosity v x 0.40 for de-novo An autosomal recessive disease model factor = 훿(푎) v x 0.10 for rare MAF

Homozygosity/Compound v x 0.85 for de-novo Either autosomal Heterozygosity in the same recessive or dominant transcript v x 0.75 for rare MAF disease model

4.3 Phenotype gene enrichment The number of genes known to be associated with a certain disease is very limited. It is a challenging problem to discover a new gene-to-disease association without cohort studies or family trio samples. Divine utilizes the Gene ontology (GO) database [17], KEGG pathway [28], and protein-protein interaction [13] to discover new gene-to-disease associations.

4.3.1 Gene ontology enrichment We start from a seed gene set where its associated disease has very high phenotype similarity with a

patient’s (e.g., top 3 from the top) regardless of their Gi. Among the genes whose Gi>0 and Pi=0, Divine rescues the genes whose GO semantic score with a seed gene is greater than 0.95 over at minimum two GO categories out of molecular function, biological process, or cellular component. We transfer 40% of a

phenotype score, Pi, from the seed gene to the matched gene. Intuitively, a gene (x) with very high Gx but never associated with a disease will receive a phenotypic score from a seed gene (y) whose associated disease descriptions are largely overlapped with the patient symptoms and also its gene function is similar to the gene (x).

P * G*

i j

S. Figure 2. Gene enrichment by computing GO (gene ontology) similarity score between seed genes with

a higher Pi and a mutated gene (e.g., private members of genes in G* in the Venn diagram) that was

never previously associated with a true positive disease of interest. Gene[i] in P* enriches one private

member gene [j] in G* since two genes (i and j) in red show a high GO similarity. The same concept is applied to KEGG pathway enrichment. 4.4 KEGG pathway enrichment

Similarly, among the genes whose Gi >0 and Pi=0, we can also transfer a seed gene’s phenotype score to the genes belonging to a KEGG pathway where the other member gene (including the seed gene) has a higher Pi. The enrichment process is formulated into a bipartite graph between genes and KEGG pathways in a graph Laplacian [29] to assign phenotypic score indirectly. A label propagation performs in the graph where KEGG membership and Pi represent the edge set to an activation value. A final value accumulated at each gene is assigned as an indirect phenotype score.

4.5 Combining Gi and Pi and obtaining a final ranking by a heat diffusion on STRING network.

Now that both Pi and Gi are available for each gene, Divine combines two pathogenic scores into a Bayesian framework [4], 푦 = . Then, yi is fed into a heat diffusion network [12] on ()() STRING database [13] as an excitation node values in [s.eq7]. The STRING is a functional protein association network (A) where we can define a node to represent a gene product (i.e., protein) and the edge to represent how strongly two proteins interact together or have a functional association. 푥() = 1− 푥() + 휆퐴푥() + (1− 휆)푦 [푠. 푒푞7]. We set 훾 = 0.9, 푁 = 100, and 휆 =2 respectively, such that the predicted damage score x is smoothed over the network while avoiding inflation. For each gene i, a steady state value xi is obtained in either (n+1) (n) maximum n=150 recursions or a stop condition, |x -x |<1e-4. Finally, the xi is used to prioritize all variants appeared in the VCF file.

5 Experiments

5.1 The other methods for comparison Five similar methods (Phen-gen [4], Exomiser_v7.2, Exomiser_v10 [6], eXtasy[3], and PhenIx[5]) are evaluated with Divine. These programs also accept HPOs and VCF files as input. Note that all public databases established in Divine were updated by March 2016 and any variant-level pathogenic mutation database (e.g., ClinVar or HGMD) is not used for the Divine run.

5.2 26 WES retrospective study samples Total 26 in silico WES samples are used for testing. The samples cover a wide spectrum of disease cases (S. Table 5). Among those samples, 23 cases correspond to real patient samples, studied in [7, 14] published around early 2015. For three samples, we spike a variant (SLC9A3 [21], VPS13C [22], and SEPSECS [23]) confirmed as a pathogenic mutation in recent studies into an NA12878 WES VCF file. The three simulated samples and one real patient sample are specially designed to test how well each method can discover a new gene-to-disease association.

Patient clinical features are extracted from the original publications and converted into HPO IDs. The number of HPOs range from 3 to 24 (mean 10.8, standard deviation of 5.87). See S. Table 5. After an initial filtration of variants such as highly frequent SNP rate or variant poorly annotated, e.g., intergenic region), average 1,814 variants across 1,635 genes are retained. 5.3 AUC scores In S. Table 3, the ranking discovered by three methods is summarized. Divine includes 25 pathogenic mutations in the prioritization report with a higher ranking than the other methods. Both Divine and Exomiser_v10 (downloaded March 2018) outperform than the other methods. That is mainly because the disease-gene association information is not populated in the other methods’ databases.

Except one case in both Divine and Exomiser_v10 failed to detect a disease-causing gene due to a copy number of alteration (See the case with SEMA3D in S. Table 5), they prioritize the genes confirmed by the original studies in higher rankings.

S. Table 3. AUC scores and the total number of failed samples. 26 total WES VCF files (including three simulated samples) along with its patient HPOs are used to evaluate six methods. The original studies were mainly published between 2014 to early 2016. These results show that both Exomiser (Area Under Curve (AUC) score: 0.9542) and Divine (AUC: 0.9585) outperform the other methods. Also, both methods report the same genes carrying a disease-causing variant confirmed by the original study within the first 1,000 genes, except for the one case which is associated with a copy number alteration.

Methods AUC # of samples not detected

Divine 0.9585 1

Phen-Gen 0.482 13

Exomiser_v7.2 0.5766 10

Exomiser_v10 0.9542 1 eXtasy 0.4995 4

PhenIx 0.7276 5

S. Figure 3. Scatter plots of the disease-causing gene rank between Divine and the other method

5.4 A Case Study with Atypical Hemolytic Uremic Syndrome (aHUS) [14] We run Divine to diagnose the case of a 64-year-old male patient with schistocytes in the peripheral blood smear and a complex and life-threatening coagulation disorder causing recurrent venous thromboembolic events, severe thrombocytopenia, and subdural hematomas. The patient WES VCF has a frameshift mutation in C3AR1 (c.355-356dup, p.Asp119Alafs*19). The mutation in C3AR1 produces an unusual receptor for complement factor C3a, speculating that this affects host cell opsonization. The frameshift is attributable to the principal mechanism for the dysregulation of the alternative complement pathway in the patient. The amplification loop of C3 convertase results in greater than normal activation of the terminal pathway. That could explain the microangiopathic hemolytic anemia and consumptive thrombocytopenia. Also, the patient recovered after being treated with monoclonal antibodies that inhibit the terminal pathway.

The case result was published in April 2016 after the Divine database was built for testing. C3 is linked to Complement Component 3 Deficiency (OMIM: 613779). This disease is highly ranked by Divine phenotype matching with patient phenotypic descriptions. A set of genes associated with the disease usually receives a higher phenotype score. However, C3AR1 is not registered with any complementary deficiency disease.

Additionally, in a KEGG pathway (Complement and Coagulation Cascades), C3AR1 is directly connected to C3 as a complement component 3a receptor 1 (C3a). In the STRING interaction network database, the combined score of 0.968 is reported. Before the Divine enrichment process, the phenotype score was 0, but after the enrichment and heat diffusion process, Divine prioritizes the C3AR1 frameshift mutation at 4th from the top.

5.5 Divine under noise HPO queries In practice, when examining a patient, incorrect phenotype information or clinical features are often collected, which can be irrelevant to the patient’s diagnoses. To test the robustness of Divine under the noise phenotypic terms, we randomly generate an additional sample of half the number of the original query HPOs. For each case, we repeat the experiments with 100 HPO input sets that contain both original HPOs and noise HPOs and then perform the experiment on Divine. The overall ranks are pulled down. For the cases of CACNA1B, DTNA, SLC9A3, and VPS13C, the rank is significantly impacted by the additional noisy HPOs (32 times lower than the original rank from the top). For the other cases, the average rank is 8.10, suggesting that the Divine reports are still helpful. S.Figure 4: Ranked genes of interest by Divine after adding irrelevant query HPOs into the original patient HPO terms. The number of noise HPOs is 50% of the number of original HPO queries. Those are sampled 100 times for each case from the ones whose HPO semantic similarity with the original HPO query set is lower than 10%. Red dots indicate the original Divine gene ranking.

S.Table 4. S.Figure 3 is summarized in a table format and compares the original ranking before adding an irrelevant HPO set to the original HPO IDs.

after adding gene studied in the original ranked noisy HPO standard patient diseases from the top queries deviation AFF4 3 15.30 3.82 C3AR1 4 18.27 56.84 CACNA1B 3 36.71 35.09 CEP120 1 1.20 0.41 CHCHD10 1 5.84 5.35 COL17A1 2 7.01 4.63 COQ4 1 6.08 5.67 DCDC2 1 1.44 0.90 DDX58 5 8.14 3.01 DPM2 1 1.10 0.31 DTNA 1 24.01 26.41 ETV6 2 9.44 9.25 KCNA2 9 16.46 8.88 KCNC1 13 35.48 24.61 NALCN 1 1.56 0.72 PKLR 1 1.82 1.45 PNKP 1 1.39 0.77 PTRH2 1 1.03 0.16 SEMA3D - - - SEPSECS 2 2.72 1.69 SLC9A1 3 5.06 2.45 SLC9A3 2 106.07 116.86 SNRPB 21 23.55 8.25 USP8 6 5.39 3.01 VPS13C 6 234.62 141.57 WWOX 2 1.87 1.04

S. Table 5. Disease-causing gene rankings for 26 WES samples (including three simulated datasets) by Divine, Phen-gen, Exomisers, eXtasy, and PhenIx. Note that '-' in ranking indicates 'failed or not detected' within 1,000 top rankings. The ones highlighted are used to test a new gene-to- disease discovery.

ID Disease- Pathogenic Publication Published Ranking causing variant reported Date gene in the original Divine Phen_Gen Exomiser_v7.2 Exomiser_v10 eXtasy PhenIx (# of studies phenotypes) Spiked? 1 AFF4 NM_014423 Germline gain-of-function mutations in AFF4 cause a 2015/04 N 3 NA 22 1 1116 192 (24) :c.760A>G developmental syndrome functionally linking the super elongation complex and cohesion [Izumi K.]

2 C3AR1 NM_004054: Whole-exome sequencing of a patient with severe and 2016/07 N 4 246 NA 21 NA NA (14) c.355dupG complex hemostatic abnormalities reveals a possible contributing frameshift mutation in C3AR1 3 CACNA1B NM_000718: CACNA1B mutation is linked to unique myoclonus-dystonia 2015/02 N 3 NA 37 1 NA 159 (12) c.4166G>A syndrome [Groen JL.] vs. https://academic.oup.com/hmg/article/24/18/5326/688881 4 CEP120 NM_153223: A founder CEP120 mutation in Jeune asphyxiating thoracic 2015/03 N 1 24 120 1 666 NA (16) c.595G>C dystrophy expands the role of centriolar proteins in skeletal ciliopathies [Shaheen R.] 5 CHCHD10 NM_213720: Mutation in the novel nuclear-encoded mitochondrial 2015/01 N 1 NA 2 2 NA 185 (19) c.172G>C protein CHCHD10 in a family with autosomal dominant mitochondrial myopathy [Ajroud-Driss S.] 6 COL17A1 NM_000494: Mutations in collagen, type XVII, alpha 1 (COL17A1) cause 2015/04 N 2 NA 371 4 113 14 (6) c.2816C>T epithelial recurrent erosion dystrophy (ERED) [Jonsson F.] 7 COQ4 NM_016035: COQ4 mutations cause a broad spectrum of mitochondrial 2015/02 N 1 28 NA 6 2578 279 (19) c.433C>G disorders associated with CoQ10 deficiency [Brea-Calvo G.] 8 DCDC2 NM_001195610: DCDC2 mutations cause a renal-hepatic ciliopathy by 2015/01 N 1 29 NA 2 512 431 (13) c.649A>T disrupting Wnt signaling [Schueler M.] 9 DDX58 NM_014314: Mutations in DDX58, which encodes RIG-I, cause atypical 2015/02 N 5 NA 3 3 791 175 (5) c.1118A>C Singleton-Merten syndrome [Jang MA.] 10 DPM2 NM_003863: Low budget analysis of Direct-To-Consumer genomic testing 2012/07 N 1 10 NA 1 99 1 (13) c.68A>G familial data [Glusman]

11 DTNA NM_001198938: Identification of two novel mutations in FAM136A and 2015/02 N 1 NA 129 124 257 196 (3) c.1963G>T DTNA genes in autosomal-dominant familial Meniere's disease [Requena T.] 12 ETV6 NM_001987: Germline ETV6 mutations in familial thrombocytopenia and 2015/02 N 2 NA 1 1 7 18 (12) c.1195C>T hematologic malignancy [Zhang MY.] 13 KCNA2 NM_004974: De novo loss- or gain-of-function mutations in KCNA2 cause 2015/04 N 9 NA 12 3 280 279 (8) c.440G>A epileptic encephalopathy [Syrbe S.] 14 KCNC1 NM_001112741: A recurrent de novo mutation in KCNC1 causes progressive 2015/01 N 13 NA 7 4 187 142 (9) c.959G>A myoclonus epilepsy [Muona M.]

15 NALCN NM_052867: De novo mutations in NALCN cause a syndrome 2015/03 N 1 42 NA 1 765 6 (3) c.1733A>C characterized by congenital contractures of the limbs and face, hypotonia, and developmental delay [Chong JX.] 16 PKLR NM_000298: Exome sequencing and unrelated findings in the context of 2011/07 N 1 NA 1 1 1 2 (3) c.1706G>A, complex disease research: ethical and clinical implications c.1022G>C [Lyon GJ.]; ADHD; Idiopathic Hemolytic Anemia 17 PNKP NM_007254: Mutations in PNKP cause recessive ataxia with oculomotor 2015/03 N 1 28 753 1 179 64 (11) c.1123G>T apraxia type 4 [Bras J.]

18 PTRH2 NM_001015509: Accelerating novel candidate gene discovery in 2015/01 N 1 37 NA 1 319 134 (10) c.257A>C neurogenetic disorders via whole-exome sequencing of prescreened multiplex consanguineous families [Alazami AM.] 19 SEMA3D CNV [gain] Disruption of the SEMA3D gene in a patient with congenital 2015/01 N NA NA NA NA 3695 NA (10) heart defects [Sanchez-Castro M.] 20 SEPSECS NM_016955: Milder progressive cerebellar atrophy caused by biallelic 2016/02 Y 2 89 226 3 2435 NA (9) c.356A>G, SEPSECS mutations [Iwama K.] c.77delG 21 SLC9A1 NM_003047: Mutation of SLC9A1, encoding the major Na⁺/H⁺ exchanger, 2015/01 N 3 28 NA 3 69 149 (17) c.913G>A causes ataxia-deafness Lichtenstein-Knorr syndrome [Guissart C.] 22 SLC9A3 NM_001284351: Reduced sodium/proton exchanger NHE3 activity causes 2015/12 Y 2 60 12 1 NA 93 (3) c.1718delC, congenital sodium diarrhea [Janecke AR.] c.1145G>A 23 SNRPB NM_198216: Mutations in SNRPB, encoding components of the core 2015/02 N 21 NA 359 226 2394 1080 (14) c.166G>C splicing machinery, cause cerebro-costo-mandibular syndrome [Bacrot S.] 24 USP8 NM_005154: Mutations in the deubiquitinase gene USP8 cause Cushing's 2015/01 N 6 NA 13 1 2480 171 (3) c.2152T>C disease [Reincke M.] 25 VPS13C NM_018080: Loss of VPS13C Function in Autosomal-Recessive 2016/03 Y 6 29 NA 2 3889 NA (7) c.8582C>T, Parkinsonism Causes Mitochondrial Dysfunction and c.8316+2T>G, Increases PINK1/Parkin-Dependent Mitophagy [Lesage S.] c.3743G>T

26 WWOX NM_001291997: WWOX-related encephalopathies: delineation of the 2015/01 N 2 24 NA 1 88 383 (17) c.666G>A phenotypical spectrum and emerging genotype-phenotype correlation [Mignot C.] 6 Reference [1] Online Mendelian Inheritance in Man, OMIM®. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD), (2016). World Wide Web URL: https://omim.org/

[2] Sebastian Köhler et al. (2017). The Human Phenotype Ontology in 2017 Nucl. Acids Res. doi: 10.1093/nar/gkw1039

[3] Sifrim, A. et al. (2013). eXtasy: variant prioritization by genomic data fusion. Nat. Methods, 10, 1083– 1084.

[4] Javed A. et al. (2014) Phen-Gen: combining phenotype and genotype to analyze rare disorders. Nat. Methods, 11:935–937.

[5] Zemojtel T. et al. (2014). Effective diagnosis of genetic disease by computational phenotype analysis of the disease-associated genome. Sci. Transl. Med. ; 6:252ra123.

[6] Robinson, P.N. et al. (2014) Improved exome prioritization of disease genes through cross-species phenotype comparison. Genome Res., 24, 340–348.

[7] Antanaviciute A. et al. (2015). OVA: integrating molecular and physical phenotype data from multiple biomedical domain ontologies with variant filtering for enhanced variant prioritization. Bioinformatics. 31(23):3822–3829. 10.1093/bioinformatics/btv473

[8] Richards, S. et al. (2015). Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17, 405–423

[9] Stenson PD. et al. (2017). The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Hum Genet 136:665-677

[10] Schlicker,A. et al. (2006). A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics, 7, 302.

[11] Kunal Kundu et al. (2014). Varant, Intelligent Systems for Molecular Biology (ISMB) conference, http://compbio.berkeley.edu/proj/varant

[12] Haixuan Yang et al. (2007). DiffusionRank: a possible penicillin for web spamming. SIGIR '07. ACM, New York, NY, USA, 431-438. DOI: https://doi.org/10.1145/1277741.1277815

[13] Szklarczyk D. et al. (2017). The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res. 45:D362-68

[14] Leinoe E. et al. (2016). Whole-exome sequencing of a patient with severe and complex hemostatic abnormalities reveals a possible contributing frameshift mutation in C3AR1. Cold Spring Harbor Molecular Case Studies, 2, a000828

[15] Landrum MJ et al. (2018), ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. PMID:29165669

[16] Van der Auwera GA et al., (2013), From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline, CURRENT PROTOCOLS IN BIOINFORMATICS 43:11.10.1-11.10.33

[17] The Gene Ontology Consortium, 2017, Expansion of the Gene Ontology knowledgebase and resources, Nucleic Acids Research, Volume 45, Issue D1, 4 January 2017, Pages D331–D338

[18] Identifying a High Fraction of the to be under Selective Constraint Using GERP++. PLoS Comput Biol. 2010 Dec 2;6(12):e1001025

[19] A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014 Feb 2. doi: 10.1038/ng.2892. PubMed PMID: 24487276.

[20] Schlicker A. et al. (2008), FunSimMat: a comprehensive functional similarity database. Nucleic Acids Research, 36(Database issue):D434-439. doi: 10.1093/nar/gkm806

[21] Andreas R et al., 2015, Reduced sodium/proton exchanger NHE3 activity causes congenital sodium diarrhea, Human Molecular Genetics, Volume 24, Issue 23, 1 December 2015, Pages 6614–6623, https://doi.org/10.1093/hmg/ddv367

[22] Suzanne Lesage et al., 2015, Loss of VPS13C Function in Autosomal-Recessive Parkinsonism Causes Mitochondrial Dysfunction and Increases PINK1/Parkin-Dependent Mitophagy, DOI:https://doi.org/10.1016/j.ajhg.2016.01.014

[23] Iwama, K et al., Milder progressive cerebellar atrophy caused by biallelic SEPSECS mutations, Journal of Human Genetics 61(6) · February 2016 DOI: 10.1038/jhg.2016.9

[24] Sandra Orchard et al. 2014, The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases, Nucleic Acids Research, Volume 42, Issue D1, 1, Pages D358–D363

[25] Kanehisa, Furumichi, M., Tanabe, M., Sato, Y., and Morishima, K.; KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353-D361 (2017)

[26] Tae Hyun Hwang, Hugues Sicotte, Ze Tian, Baolin Wu, Dennis A. Wigle, Jean-Pierre Kocher, Vipin Kumar and Rui Kuang., Robust and Efficient Identification of Biomarkers by Classifying Features on Graphs, Bioinformatics, Vol. 24, No. 18, pages 2023-2029, 2008