Development of A Transcriptome-Based Genome Assembly Tool and Whole Genome Sequencing for Autism Spectrum Disorders

Robert Baldwin

Centre for Biotechnology

Submitted in partial fulfillment of the requirements for the degree of Master’s of Science

Faculty of Biological Sciences, Brock University St. Catharines, Ontario

Ó 2018

Abstract

This thesis consisted of two independent projects. The first involved developing a software tool that uses transcriptome data to improve genome assemblies. The second involved processing and analyzing whole genome sequencing (WGS) from the ASPIRE autism spectrum disorder (ASD) cohort.

The first project produced the bioinformatics software called RDNA. This free tool was written in Perl and should be valuable for users interested in genome assembly. Comparative assessment between RDNA and the leading transcript based scaffolding software showed that

RDNA can significantly improve genome assemblies while making relatively few scaffolding connection errors. RDNA also makes possible the assembly of scaffolding connections, including gap filling, using BLAST.

The second project was undertaken with collaborators and involved processing and analyzing whole genome sequencing (WGS) data from the ASPIRE ASD cohort. The ASPIRE

ASD cohort consisted of several hundred probands from both simplex and multiplex families.

Sequencing occurred for 120 of these individuals who were selected based upon membership in two phenotype clusters (C1 and C2). These individuals had a relatively high rate of intellectual disability (ID) compared to heavily studied ASD cohorts such as the Simons Simplex Collection

(SSC), indicating a significant involvement of de novo sequence variants. Analysis of rare single nucleotide variants (SNVs) and insertion/deletions (indels) identified large risk factors for severe neurodevelopmental disorders (NDDs), two of which were previously observed de novo among individuals with severe, undiagnosed NDDs. On this basis, ABCA1 was found to be a novel candidate risk . (GO) analysis of rare loss of function and missense SNVs

indicted the importance of lipid metabolic processes and synaptic signalling. Overall, the genetic variation examined by this study pertained to a modest number of cases, consistent with previous findings that ASD is a genetically heterogeneous disorder with a complex genetic architecture.

Table of Contents

Chapter I: Improving de novo genome assemblies using RDNA……………1

Chapter II: Whole genome sequencing and variant discovery in the ASPIRE autism cohort………………………………………………………………………..21

References…………………………………………………………………..44

Appendices………………………………………………………………….52

Acknowledgments

I’d like to thank my supervisor Ping Liang for taking me on as his student and supervising my work, including helping edit this manuscript. I’d also like to thank

Radesh for his work helping prepare the genome assemblies needed to finish the evaluations for the assembly software, and my committee members, Fiona Hunter and Feng Li.

Tables and Figures

Figure 1.1: Overview of RDNA scaffolding method 5 Figure 1.2: Scaffolding performance of RDNA and LRNA for human data 11 Table 1.1: Scaffolding performance of RDNA and LRNA for human transcriptome data 12 Table 1.2: RDNA and L_RNA_scaffolder BUSCO assessment results for transcriptome human data 13 Figure 1.3: Error rates for RDNA and LRNA for human data 14 Figure 1.4: Scaffolding performance of RDNA and LRNA for lavender data 16 Table 1.3: RDNA and L_RNA_scaffolder genome assembly quality metrics for lavender data 16 Table 1.4: RDNA and L_RNA_scaffolder BUSCO results for lavender data 17 Table 2.1: GO enrichment results from the combined C1 and C2 SNV gene set 33 Table 2.2: Previously validated de novo variants 34 Table 2.3: Prioritized novel PTVs and missense variants 35 Appendix, Table 1: RDNA output example 52 Appendix, Table 2: Variants identified in this study previously prioritized by collaborators 52 Appendix, Table 3: All missense variants identified in this study not previously prioritized by collaborators 54 Appendix, Table 4: All PTVs identified in this study not previously prioritized by collaborators 66 Appendix, Table 5: C1 and C2 missense and PTV gene sets used to GO analysis 68 Appendix, Table 6: Key variant calling metrics collected with PICARD 72

Abbreviations

ASD – autism spectrum disorder

BAM – binary alignment and mapping file

BGI – Beijing Genomics Institute

BQSR – base quality score recalibration

CADD – combined annotation dependent depletion

CNV – copy number variant

DDD – Deciphering Developmental Disorders Study

DSM – diagnostic and statistical manual of mental disorders

EST – expressed sequence tags

ExAC – Exome Aggregation Database

GATK – genome analysis toolkit

GDD – global developmental delay gnomAD – Genome Aggregation Database

GO – gene ontology

ID – intellectual disability

LGD – likely gene disrupting variant

MSSNG – MISSING Project

NDD – neurodevelopmental disorder

PE – putative error

PTV – truncating variant pLI – probability of loss of function intolerance (ExAC)

RefSeq – NCBI reference sequence

SFARI – Simons Foundation Autism Resource Initiative

SSC – Simons Simplex Collection

SRA – sequence read archive

SNV – single nucleotide variant

VEP – Ensembl variant effect predictor

VCF – variant call format

VQSR – variant quality score recalibration

WES – whole exome sequencing

WGS – whole genome sequencing

1000GP – One thousand genomes project

Chapter I: Improving de novo genome assemblies using transcriptome sequences

Abstract

Genome assembly is a major challenge due to the short reads generating by next generation sequencing (NGS) technology and can be assisted using the connection information provided by NGS RNA sequencing (RNA-seq) data. In particular, transcriptomes are commonly being assembled as part of genome projects. Yet the development of assembly tools designed to work with this data is lacking and limited to a single tool, L_RNA_scaffolder. Presented here is a transcript based genome assembly tool called RDNA. Assessment using human data showed that the connection error rate increased dramatically for both RDNA and L_RNA_scaffolder using a transcriptome assembled de novo compared to the NCBI reference sequence (RefSeq) transcripts. However, the connection error rate for RDNA was much lower than for

L_RNA_scaffolder. The higher error rate for L_RNA_scaffolder was the cost of making a greater number of scaffolding connections. In addition, RDNA offers utilities for gap filling and joining and collapsing among overlapping scaffolds. Overall, RDNA has advantages over

L_RNA_scaffolder and should be especially useful for users wishing to minimize assembly errors.

1

Introduction:

The scaffolding stage of genome assemblies generally relies on paired reads obtained from long insert fosmid or jumping libraries to order, orient, and connect contigs into larger sequences called scaffolds. Unfortunately, this method of scaffolding is complex, time consuming, and leads to a high rate of incorrect and missed connections. In the ongoing effort to improve genome assemblies there is an interest in novel scaffolding methods that make use of data generated from NGS RNA-seq approaches.

Transcriptome profiling will likely be included as part of many genome projects, and the data can be used to scaffold the transcribed regions of a genome (Xue et al. 2013). These regions correspond to features such as , non-coding RNA, and small RNA, which most likely will be the primary subject of subsequent biological research. As the quality of these annotations is affected by the quality of the underlying assembly, scaffolding with RNA-seq data addresses the need that the genomic regions for these features are well assembled (Denton et al. 2014).

Although mRNAs and expressed sequence tags (ESTs) were used to help assemble the human reference genome, there are very few readily available tools that are specifically designed to work with RNA-seq. For transcriptome based scaffolding there is a single tool,

L_RNA_scaffolder (Xue et al. 2013). Furthermore, the performance of this tool has been poorly evaluated. Perhaps for these reasons some genome projects are continuing to rely upon custom

RNA-seq based scaffolding strategies (Warren et al. 2015), or have decided to avoid them entirely.

Presented here is a transcript based scaffolding tool called RDNA that is capable of outperforming L_RNA_scaffolder based on assembly quality metrics and BUSCO genome

2

completeness analysis (Simao et al. 2015). Comparative assessment showed the connection error rate of RDNA to be more than two times lower than that of L_RNA_scaffolder regardless of what transcript data was used as input. RDNA therefore offers an alternative scaffolding method that provides a much lower connection error rate than L_RNA_scaffolder. Furthermore, it includes sequence merging, including gap filling, as a unique alternative to scaffolding connections. The Perl code for RDNA is freely available at github

(http://github.com/RobertWBaldwin).

Material and Methods:

The main problem and general strategy used to solve it

RDNA was written in Perl (version 5.22.1) and uses a familiar “divide and conqueror” strategy whereby a complex problem is broken down into a set of smaller, easier to solve problems. Most of these smaller problems were solved using graph algorithms, or modifications of such algorithms, which are used extensively when working with biological sequence data, including genome assembly problems (Jones and Pevzner 2009). Using this approach, a genomic scaffold is represented as an ordered, linear array of vertices (i.e., genomic sequences).

Determining the correct scaffold for a genomic region using transcriptome data presents challenges to reducing what can be complex alignment data to any type of simple data structure.

A genomic region may have multiple different transcripts, some of which were reverse transcribed and/or underwent alternative splicing. Utilizing this data, therefore, requires both determining the correct edges for each transcript and combining edges across multiple

3

transcripts. The former problem may be solved by assigning a score, called the alignment score, to each RNA/DNA alignment. The correct scaffold for a transcript then becomes the set of vertices whose sum of vertex scores (i.e., alignment scores) is the greatest. In a directed graph, this set of vertices is called the maximum scoring path (Figure 1.1 A).

Methods

When a transcript is only partially covered by any one genomic sequence, and has an alignment with more than one sequence in the assembly, it is possible that the locus to which it corresponds has been fragmented into separate sequences. The genomic sequence for a transcript is considered fragmented when the transcript is partially covered by any single genomic sequence (<95% coverage), can be represented by at least one path, and the score for the maximum scoring path is significantly higher than the score for the maximum scoring alignment

(i.e., the highest vertex score).

For each transcript RDNA builds a graph of all logically possible connections, finds all possible paths, and selects the highest scoring path for scaffolding (Figure 1.1 A). An edge between any two sequences is considered logically possible if they do not overlap significantly on the transcript, the estimated intron size does not exceed the maximum intron size, and they are not separated by a third sequence. A minimum path score is used to remove transcripts that are poorly represented in the assembly. When the maximum scoring paths of two or more transcripts share at least one vertex in common they are assumed to represent the same locus and a

4

consensus path is sought to incorporate as many of the connections as possible into the scaffold

(Figure 1.1 B).

Figure 1.1: Graphical representation of how RDNA represents alignment information and computes scaffolding paths. (A) The maximum scoring path is computed for an RNA showing evidence of fragmentation. Vertices are weighted with alignment scores. (B) Maximum scoring paths sharing vertices are clustered and reordered based on strand positions (+/-) to find a consensus path. In this example, the consensus path may be found by reversing the edges and strand positions of the bottom maximum scoring path.

Prior to scaffolding it is optional to have connections checked for alignments with

BLAST and merged otherwise connections are represented by an arbitrary length sequence consisting of a string of fifty ‘n’ characters (i.e., a sequence gap with an arbitrary given size).

5

The merging step allows for the assembly of overlapping ends and gap filling. The tool may also be used with only the gap filling and/or end merging option set.

RDNA makes extensive use of a graph module developed by Jarkko Hietaniemi and currently maintained by Neil Bowers (https://github.com/neilbowers/Graph). It also utilizes an object-oriented programming system for Perl called Moose

(https://metacpan.org/release/Moose). Using the RNA-guided assembly options requires stand- alone Basic Local Alignment Search Tool (BLAST; Altschul et al. 1990).

Program inputs, parameters, and options

RDNA takes as input a psl file describing RNA to DNA alignments and two FASTA format files corresponding to the DNA and RNA sequence files represented in the psl file. The psl file format is the default output for the Blast-like alignment tool (BLAT; Kent 2002), which is typically used when computing large numbers of RNA to DNA alignments. While users may set options when running BLAT, RDNA will filter alignments using its own calculation of percentage of identity as well as several other parameters.

Command line options for RDNA control three aspects of the program. The first set of commands include the percentage of identity (--pid), minimum alignment score (--score), minimum block size (--bsize), RNA maximum (--rmax), and DNA maximum (--dmax), determine what alignments from the psl are used for generating scaffolding paths. The --pid option (default 98) is calculated from the sum of matches and repetitive matches (matches occurring in low complexity sequence) divided by the sum of matches, repetitive matches, and

6

mis-matches ((matches + repetitive matches / matches + repetitive matches + mis-matches) X

100). The minimum alignment score or node score is calculated as a percent coverage score

(((matches + mis-matches + repetitive matches) / query size) X 100) minus a penalty for mismatches and gaps, and therefore ranges from > 0 to ~100 (default is 5). Alignments must also have at least 1 block greater than or equal to the minimum block size (default is 75 bp). The

RNA and DNA maximums restrict the number of DNA and RNA alignments, respectively, to the specified number of highest scoring alignments (by default, each RNA and DNA may have up to

50 DNA and 300 RNA alignments, respectively).

A second set of options control how scaffolding paths are created and used. This includes the maximum intron length (--intron), minimum path score (--pscore), RNA overlap (-- xrna), maximum edge (--maxEdge), and maximum scoring path constant (--Pconstant) options.

The RNA base pair overlap option specifies the maximum number of bases two DNA sequences may overlap with each other on an RNA (default is 5 bp), which determines whether or not two

DNA sequences can be connected by an edge. Conversely, when two DNA sequences do not overlap on the RNA, the maximum intron length is used to remove edges where the estimated intron size, as calculated from the sequence lengths, is too large (default is 100,000 bp). The

RNA path score is the sum of the individual vertex scores (i.e., alignment scores), and should therefore have a similar range as the vertex scores, depending on what is used for the RNA base pair overlap option. Under the recommended default settings a path score of ~100 indicates that the RNA is fully covered by the DNA sequences in the scaffolding path. The minimum path score option places a lower limit on an acceptable maximum path score (default is 50), meaning a maximum scoring path with a path score less than the minimum path score will be not be scaffolded. As computing all possible scaffolding paths from a large number of edges can be

7

computationally expensive, the maximum edge option (default 300) allows users to skip RNA with a large number of possible edges. Finally, the maximum scoring path constant specifies by how much the maximum scoring path must exceed the score for maximum scoring vertex.

Maximum scoring paths that do not exceed the maximum scoring vertex by this magnitude will not be scaffolded (default is 5).

By default the program runs with the --scaffold option set. In order to assemble edges before scaffolding the user must set the --gap and/or --overlap flags. This allows for the merging of sequences adjacent to each other on a scaffolding path when they have a high quality alignment. What constitutes a high quality alignment is controlled by a third set of BLAST related options for filtering high score pairs (HSPs). This includes the alignment length (--

HSPlength; default 50 bp) and percentage of identity (--HSPpid; default 99). When using the -- gap or --overlap options users must also specify the location of the BLAST program on the computer system using the --blast option. A typical command would be as follows: perl rdna.pl -

-scaffold –gap --overlap --psl=/home/user/file.psl --dna=/home/user/DNA.fa -- rna=/home/user/RNA.fa –blast=/work/user/bin/blast/.

Output

The output of RDNA consists of a new assembly and a log file. The new assembly consists of three non-overlapping sequence sets in fasta format: the untouched sequences

(remainder.fa) from the input genome assembly file, new scaffolds (scaffolds.fa), and updated sequences (updated.fa). The new scaffolds file will contain all sequences, including any updated sequences, used in scaffolding. The updated sequence file only includes updated sequences not

8

used in scaffolding. The log file provides a summary of the old and new assemblies and documents all of the updated sequences (Appendix, Table 1).

Assessment

The performance of RDNA and L_RNA_scaffolder were compared using de novo assemblies for both Homo sapiens and Lavundula angustfolia. A was chosen because a highly accurate reference genome was needed to evaluate scaffolding connections and merged/updated sequences. The input for test runs was a de novo human genome assembled with using Fermi (Li 2012) and Opera (Wing-Kin Sung et al. 2011) with Illumina 2500 Hi-Seq reads downloaded from NCBI SRA database (sample SRR068201; https://www.ncbi.nlm.nih.gov/sra) and a de novo transcriptome assembly using RNA-seq data from NCBI SRA database assembled with SOAPdenovoTrans (Xie et al. 2014) or the NCBI RefSeq transcripts

(ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/), consisting of 1,424,992 and 163, 319 sequences, respectively. RDNA and L_RNA_scaffolder were run with both sets of transcripts to allow for a comparison of how performance varies with the quality of the underlying transcript data. The lavender assembly was an ongoing project in Dr. Liang’s research group and included a transcriptome with 46,102 sequences. Both tools were run using the default settings.

Scaffolding connections in the human genome assembly were evaluated by splitting the scaffolds on arbitrary gap sequences and mapping the non-gap sequences to the hg19 human reference genome with MUMmer3.23 nucmer and show-tiling programs (Kurtz et al. 2004).

Only scaffolds with all sequences mapping to the reference assembly were considered for assessment. The order and orientation of the sequences within a scaffold along the human reference genome was considered correct if all sequences had the same orientation and the same

9

order as they appeared in the original scaffold. All scaffolds that did not map as expected to the reference genome were manually evaluated. Any scaffolds having the original scaffolding connections mapping incorrectly were ignored to exclude assembly errors introduced by the original assembly. The remainder were considered putative errors (PEs).

PEs were counted and categorized as a transposition, inversion, or relocation. A translocation is when sequences on either side of a connection map to different ; inversion when they map with different strand positions on the same ; and relocation when they are separated on the same chromosome by a distance greater than 100 kb.

Any instance of both a relocation and inversion was counted as an inversion so that no error was counted twice. Connection error rates are expressed as the number of connection errors divided by the number of new scaffolds. Sequences assembled with the help of BLAST were validated by aligning 100 randomly selected sequences to the human reference genome using web-based

BLAST.

Results:

Connection error rates and sequence updates with de novo human genome assembly

For both human transcript data sets (the de novo transcript assembly and RefSeq transcripts), the number of genomic sequences incorporated into scaffolds was higher for

L_RNA_scaffolder than RDNA. This difference was less significant using the RefSeq data

10

(Figure 1.2). As expected, it was found that the RefSeq data produced a better assembly for both tools, generating a greater number of connections and a lower PE rate (Figure 1.2 and Figure 1.

Figure 1.2: Comparison of L_RNA_scaffolder and RDNAin scaffolding performance. Bar plots showing the numbers of genome sequences incorporated into scaffolds and the numbers of new scaffolds generated for the RefSeq (top) and de novo transcriptome (bottom) by each tool.

The greater number of connections made by L_RNA_scaffolder for both data sets also resulted in a larger N50 statistic and improved BUSCO assessment results (Table 1.1; Table 1.2).

RDNA, however, assembled 147 and 331 updated sequences for the RefSeq and transcriptome assembly data, respectively, resulting in the closure of 1,870 bp and 5,432 bp of gaps. These assembled sequences were validated by taking the alignment regions and adjacent 100 bp of

11

sequence for 100 randomly selected sequences and aligning these to the human reference genome using web-based BLAST. All alignments were found to be correct.

Table 1.1: Comparison of scaffolding improvements from L_RNA_scaffolder and RDNA using transcriptome data LRNA RDNA Assembly metrics LRNA RDNA # contigs (>= 1000 bp) 155,139 160,269 # contigs (>= 5000 bp) 71,970 76,084 # contigs (>= 10000 bp) 41,783 43,660 # contigs (>= 25000 bp) 15,440 16,097 # contigs (>= 50000 bp) 7,133 7,269 # contigs 192,317 196,308 Largest contig 2,658,163 2,175,155 Total length bp 2,704,814,580 2,702,967,863 Estimated reference length bp 3,000,000,000 3,000,000,000 non-gap bases in sequences >= 2,651,103,758 2,651,106,787 500bp GC (%) 40.65 40.65 N50 101,433 82,025 # N's per 100 kbp 1,976.90 1,909.97

Table 1.2: Comparison of BUSCO results from L_RNA_scaffoldr and RDNA using transcriptome data BUSCO category L_RNA RDNA Complete Single-Copy BUSCOs 277 249 Complete Duplicated BUSCOs 23 25 Fragmented BUSCOs 52 61 Missing BUSCOs 100 119 Total BUSCO groups searched 429 429

12

Evaluating the PE rates showed the connection error rate for L_RNA_scaffolder to be more than twice that of RDNA for both sets of data. This was driven largely by translocations

(Figure 1.3). For the RefSeq data, RDNA had a translocation error rate of 0.14% compared to

4.1% for L_RNA_scaffolder. Error rates increased for both tools using the de novo transcriptome data. The translocation error rate was 3.8% and 9.9% for RDNA and L_RNA_scaffolder, respectively. Although the rate of inversions, like translocations, also increased for both tools using the transcriptome data, the contribution from this class of PEs was relatively minor and these differences had a negligible impact on the PE rate. In contrast to translocations and inversion, the number of relocations was higher using the RefSeq data for both tools, indicating that many of these connections were correct. The rate of relocations, connections representing a distance greater than 100 kbp, was also higher in both cases for RDNA. This would be expected if RDNA was missing connections made by L_RNA_scaffolder resulting in the exclusion of contigs carrying exonic sequence.

13

Figure 1.3: Comparison of errors and error rates between L_RNA_scaffolder and RDNA. Bar plots showing counts (Y-axis) for the different classes of putative errors using the RefSeq (top) and transcriptome (middle) data, as well as error rates (bottom). Error rates are calculated as the number of connection errors divided by the number of new scaffolds. As there was mostly one error per scaffold, the error rate also estimates the percentage of new scaffolds harboring a connection error. Although L_RNA_scaffolder performed better on the human data for some assembly quality metrics (e.g., the N50), this advantage came at the cost of a much higher rate of PEs. The

14

rate of translocation errors was especially high using the de novo transcriptome assembly data.

As this class of PEs was the one most likely to include genuine connection errors, something reflected in the large increase in the translocation error rate for the poorer quality de novo transcriptome assembly, these results suggested that as many as 10% of the scaffolds created by

L_RNA_scaffolder may contain at least one connection error. Whether this error rate varies significantly for other assemblies, such as those for different species, could not be determined.

The elevated error rate for L_RNA_scaffolder would only account for a modest proportion of the greater number of sequences scaffolded. Thus, RDNA was likely missing a significant number of valid connections.

Lavender

In addition to the human genome assembly, we also compared the performance of the two tools using a lavender de novo genome assembly based on genome assembly quality metrics and BUSCO results. As with the human data, RDNA incorporated fewer sequences into scaffolds and produced fewer scaffolds than L_RNA_scaffolder (Figure 1.4). However, the difference between the tools was not as great as with the human transcriptome assembly. The difference was also offset by RDNA assembling a significantly greater number of sequences compared to the human data, including the closing of 25,711 bp of gaps.

15

Figure 1.4: Performance comparison of RDNA and L_RNA_scaffolder for lavender genome assembly. Data showing the number of genome sequences incorporated into scaffolds and the number of new genome scaffolds generated using lavender transcriptome.

Although the results of both tools showed an improvement over the original assembly, the greatest improvement was with the use of RDNA (Table 1.3). The improvement in basic assembly quality metrics was also reflected in the BUSCO analysis, showing a greater number of complete and partially complete gene models for RDNA (Table 1.4). No assessment of the connection error rates was possible because of the lack of lavender reference genome.

Table 1.3: Comparison of scaffolding improvements for de novo genome assembly of lavendula angustifolia using L_RNA_scaffolder and RDNA. Assembly metrics L_RNA RDNA # Contigs(>=1,000 BP) 127,534 128,027 # Contigs(>=5,000 BP) 2,419 2,582 # Contigs(>=10,000 BP) 82 81 # Contigs(>=25,000 BP) 1 1 # Contigs(>=50,000 BP) 0 0 # Contigs 337,328 335,579 Largest contig 26,990 Total length bp 380,880,530 382,744,188 Estimated reference length bp 879,915,704 879,915,704 Non gap bases in sequences >= 500 bp 364,078,391 365,802,783 GC% 37 37 N50 1,270 1,291 #Ns per 100 kb 4337 4224

16

Table 1.4: Comparison of BUSCO results for lavundula angustifolia genome using L_RNA_scaffolder and RDNA BUSCO category L_RNA RDNA Complete Single-Copy BUSCOs 77 80 Complete Duplicated BUSCOs 16 19 Fragmented BUSCOs 135 141 Missing BUSCOs 217 208 Total BUSCO groups searched 429 429

Discussion:

Short read, next generation sequencing technologies create challenges for accurate genome assembly that may be partially addressed by assembly approaches utilizing complimentary data such as RNA-seq. Comparative evaluation of RDNA and

L_RNA_scaffolder using human data showed that the connection error rate for both tools increased sharply using the lower quality transcriptome data. This was a significant problem for

L_RNA_scaffolder with as many as 10% of scaffolds harboring at least one connection error.

Although RDNA made fewer scaffolding connections than L_RNA_scaffolder the difference was less significant as the quality of the transcriptome data improved. Furthermore, the trade-off between maximizing the number of scaffolding connections and minimizing the connection error rate was partly overcome by assembling scaffolding connections with the use of

BLAST. This had the largest impact using the lavender data. Although L_RNA_scaffolder made more scaffolding connections, the assembly produced by RDNA was better, having a larger N50, fewer N bases per 100kbp, and a larger number of base pairs in non-gap sequences over 500bp.

17

Nonetheless, many of the additional connections made by L_RNA_scaffolder were likely correct.

As next generation sequencing reads are too short to cover anything but the smallest classes of RNA molecules such as microRNA and PIWI-interacting RNAs (piRNAs), transcriptomes must be assembled de novo with tools such as SOAPdenovoTrans (Xie et al.

2014) and Trinity (Haas et al. 2013). Unfortunately, these assemblies are challenging, error prone, and time consuming. Chimeric transcript reconstructions caused by trans-splicing (Li et al. 2008), gene fusions (Gustincich et al. 2006), and artefacts of the library preparation method, can lead to incorrect scaffolds. These problems were reflected in the increased connection error rates for both tools using the human transcriptome data compared to the RefSeq data. An important objective of transcriptome based scaffolding should therefore be to achieve a high specificity of connections.

Two recently published tools AGOUTI (Zhang et al. 2016) and Rascaf (Song et al. 2016), suggested that building scaffolding connections using the raw RNA-seq reads may have significant advantages over the use of assembled transcripts. The utility of the latter approach however has only been explored by a single tool, L_RNA_scaffolder (Xue et al. 2013). The results here demonstrated that the connection error rate of L_RNA_scaffolder may be too high for some assemblies, and may remain high even as the quality of transcriptome assemblies improves. Using human data it was found that RDNA makes 2-3 times fewer connection errors than L_RNA_scaffolder. As the quality of the underlying RNA sequence data improved the difference in the number of connections made became less significant, whereas the connection error rates decreased similarly. An unexpected result was how many connection errors

L_RNA_scaffolder made: using the transcriptome data as many as 10% of scaffolds may have

18

harbored a translocation error. Despite the lack of a thorough and independent evaluation,

L_RNA_scaffolder has been used in numerous genome projects (Vij et al. 2016; Pavey et al.

2017; Rehan et al. 2016; Tollis et al. 2017).

The option to close gaps and merge sequences adjacent to each other on a scaffolding path is a unique and valuable feature of RDNA that contributes both to the improvement in the

N50 statistic and connection accuracy. As many of the sequences assembled by RNDA were found to be correct when mapped back to the human reference genome, representing edges as scaffolding connections may be incorrect in many cases. One reason for the success of this method is that the tools used in the de novo genome assembly of raw sequencing reads require perfect sequence identity and as a result may fail to properly assemble homologous alleles and redundant copies of the same genomic regions carrying sequencing errors. In contrast, the alignment of sequences by RDNA is done through BLAST, which allows for adjustment of the threshold for sequence similarity.

RDNA offers an alternative to L_RNA_scaffolder with some advantages for transcriptome based genome assembly. It may benefit from a more sophisticated approach for selecting maximum scoring paths whereby each RNA may have multiple, equally plausible scaffolding paths. It was noted, for example, that many transcripts had multiple maximum scoring paths (i.e., the score for the maximum scoring path was only marginally higher than the score for the second highest scoring path) that are identical to each other except for a single sequence. Consequently, the specific sequence that is chosen for one path may be substituted by a different sequence in the maximum scoring path for a different transcript. Although the resulting connections will be topologically inconsistent, the solution becomes apparent when considering all of the alternative maximum scoring paths simultaneously. Hence, methods that

19

can consider multiple alternative paths for any transcript and then combine these paths across multiple transcripts to produce a consensus path may improve results.

20

Chapter II: Whole genome sequencing and rare variant discovery in the ASPIRE autism cohort

Abstract

Processing the raw WGS data of individuals in ASPIRE autism spectrum disorder (ASD) cohort involved mapping raw sequencing reads to the human reference genome as well as calling, filtering, and annotating variants. As the ASPIRE cohort had a large proportion of cases with comorbid ID compared to heavily studied ASD cohorts such as the SSC it was expected that there would be a significant contribution from de novo events. Analysis therefore looked at the class of rare loss of function (SNVs and indels) and missense variants to identify likely causal de novo variants for validation. Overall, the class of rare variation defined by this analysis included approximately 100 loss of function (stop gained, splice acceptor, splice donor, and frameshift variants) and 600 predicted to be damaging missense variants. Previous work by our collaborators identified five de novo events, including two events in the major ASD risk gene,

SCN2A. This study found previously unreported variants predicted to be damaging in genes associated with ASD by sequencing studies such as SCN2A and PPP2R5D. ABCA1 was found to be a novel ASD candidate gene on the basis of a missense variant in that gene having been previously observed de novo in a case with a severe, undiagnosed, neurodevelopmental disorder

(NDD).

Gene ontology (GO) analysis of the class of rare SNVs predicted to be damaging indicated the relevance of glutamate signalling and lipid metabolic processes to the ASPIRE cohort, but no terms were specific to any one phenotype cluster. Nearly all the genes driving these enrichments were also not associated with ASD by sequencing studies, suggesting that rare

21

inherited loss of function and missense variants contribute to risk through a set of genes largely distinct from genes that have reached de novo significance.

Introduction:

Autism spectrum disorder (ASD) is a common neurodevelopmental disorder that can have a strong genetic basis. The diagnostic criteria specified by the fifth edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) includes impaired social communication and social interaction as well as repetitive behaviors and restricted interests. DSM-5 also recognizes that individuals vary significantly in their symptomology and level of severity within these domains and frequently have overlapping comorbidities related, for example, to intellectual and language impairments. A high level of comorbidity among NDDs and the association of the same locus with multiple disorders raise important questions about how they may be distinguished in terms of their underlying etiology (Morena-De-Luca et al. 2013). The same may be said about ASD subtypes recognized by the DSM-4 such as Asperger’s disorder and autistic disorder, which were abandoned because different diagnoses were being applied inconsistently across clinics, making them meaningless except as a general index of severity (Fischbach and

Lord 2010).

ASD is characterized by extreme etiologic and genetic heterogeneity. Recent studies have estimated that there are hundreds to as many as a thousand risk genes representing variants with a range of effect sizes, a large proportion of which when disrupted can significantly increase risk

(He et al. 2013; Ronemus et al. 2014). The Simons Foundation Autism Resource Initiative

(SFARI; https://sfari.org) database (last updated 12/2015) lists some 150 genes that are ranked as syndromic, high confidence, or a strong candidate, and there are hundreds more in the remaining

22

categories. A dramatic increase in the pace of gene discovery over the past decade was possible because of the introduction of array and sequencing based technologies to molecular genetic research. Whole exome sequencing (WES) of large family based cohorts such as the Simons

Simplex Collection (SSC; Fischbach and Lord 2010) identified 65 high confidant risk genes

(Sanders et al. 2012; O’Roak et al.2012a,b; Iossifov et al. 2012; Neale et al. 2012; De Rubies et al. 2014; Iossifov et al. 2014; O’Roak et al. 2014; Krumm et al. 2014; Sanders et al. 2015;

Krumm et al. 2015), and early results from a larger whole genome sequencing (WGS) project have added more to this list (Yuen et al. 2017). A primacy is placed on strong statistical support to implicate new genes. Candidate genes that have yet to reach significance may be cost effectively sequenced in follow up studies using single-molecule molecular inversion probes

(smMIPs) (O’Roak et al. 2012; Stressman et al. 2017).

Progress in identifying ASD risk loci has been driven almost entirely by the study of rare de novo variants. This approach is informed by basic evolutionary theory which states that pathogenic variants with a significant impact on fitness should remain at a low frequency within a population because of negative selection. They are maintained at a low frequency because of recessive inheritance, incomplete penetrance, balancing selection, and new mutations, and should remain rare. For pathogenic variants that occur de novo in the parental germ line, the relative risk should generally be greater than those that are inherited because they have avoided being subject to selection pressure in previous generations. The investigation of de novo variation in ASD was supported by the sporadic nature of most cases as well as evidence from cytogenetic and early microarray studies showing significant associations with de novo events (Vortsman et al. 2006;

Jacquemonte et al. 2006; Sebat et al. 2007). This provided empirical support for an influential

23

theory positing that most cases of ASD are caused by recent mutation in the parental germ line

(Zhao et al. 2007).

The discovery of genes that are risk factors for autism spectrum disorder (ASD) has accelerated rapidly over the past five years and numerous studies have integrated these findings with complimentary data such as RNA sequencing (RNA-seq) and protein-protein interactions

(PPIs), demonstrating several points of convergent biology. A main finding is that many of the genes being identified by sequencing studies belong to one of either two distinct sets of genes.

One includes members such as Chromodomain Helicase DNA-binding Protein 8, CHD8, with an early to mid-fetal brain expression profile and are involved in biological process related to chromatin and transcriptional regulation; and the second set includes genes such as SHANK3 with a post-natal brain expression profile and are involved in biological processes related to synaptic development and function (Sanders et al. 2015 b). Although this model is a simplification it is generally correct across studies using different data and methodologies.

Analyses using a combination of PPI networks and gene ontologies have all observed a set of genes enriched for chromatin/transcriptional regulation and a second, distinct set of genes enriched for synaptic components (O’Roak et al. 2012 b; Pinto et al. 2014; Sanders et al. 2015 a;

De Rubies et al. 2014; Yuen et al. 2016; Yuen et al. 2017). Gene co-expression network analyses using human mRNA expression data for multiple brain regions and developmental stages (Kang et al. 2011) have supported these findings, identifying temporally distinct sets of co-expression modules enriched for gene ontology terms, PPIs, and ASD risk genes (Willsey et al. 2013;

Parikshak et al. 2013).

The genes involved in the formation and functioning of synapses represent a key point of convergence in ASD neurobiology. Synaptic genes encode involved in a broad array of

24

molecular functions such as actin and microtubule cytoskeletal regulation (MYO16, CTNNBP2,

ADNP, SYNGAP1), cell adhesion (cadherins, neurexins, neuroligins, immunoglobulins), surface reception (glutamate receptors such as GRIN2B and GRIK, tyrosine kinase receptors), signaling

(protein kinases such as DYRK1A and CDKL5, phosphatases such as PTEN), and scaffolding

(SHANKs) (Lin et al. 2016). Although synaptic genes are particularly well represented in the

SFARI database only a small number have been implicated by sequencing studies. There are, for example, as many as 45 synaptic cell adhesion genes in the SFARI database (Baig et al. 2017), but only 3 of these (NLGN3, NLGN4X, NRXN1) have reached de novo significance (Yuen et al.

2017). The evidence for most SFARI genes comes from candidate gene studies, genes within

ASD associated CNVs, and to a smaller extent syndromic forms of ASD (Parikshack et al.

2013). Syndromic forms of ASD have been recognized for decades. One review described 103

ID genes associated with ASD (Becantur 2011). Importantly, the association between ASD and de novo loss of function variants appears to come entirely from the severe end of the autism spectrum, characterized by ID, dysmorphic features, and a higher proportion of female cases

(Kosmicki et al. 2017).

Analysis of the WGS data from the ASPIRE ASD cohort complemented a previous analysis by our collaborators and was intended to identify candidate de novo loss of function and missense variants (unpublished). ASD sequencing studies normally identify de novo variants using a comprehensive approach involving the sequencing of large numbers of nuclear families, providing a sufficient number of de novo variants for downstream analysis. This study could only propose candidate de novo variants for validation due to the absence of WGS data for parental samples. Based on previous findings (Iossifov et al. 2014), it was estimated that around

25% of cases harbored a high risk de novo missense or loss of function variant. As there were 5

25

(4% cases) confirmed de novo events and about as many candidate de novo events identified by a previous analysis it was likely that more could be found. This study identified at least several new de novo events that were likely major risk factors for their carriers. In addition, gene ontology (GO) analysis of rare loss of function SNVs and missense variants predicted to be damaging suggested the importance of biological processes related to synaptic signaling and lipid metabolism.

Material and Methods:

Cohort:

The 119 subjects from 77 simplex and 42 multiplex families were selected from a larger cohort of 318 ASD subjects recruited through the research registry of the Autism Spectrum

Disorders – Canadian-American Research Consortium (ASD-CARC). All subjects underwent a comprehensive screening of medical systemic and morphological features and had normal karyotypes, including targeted 22q11/22q13 and 15q11-q13 FISH, subtelomeric FISH studies, negative Fragile X and clinical chemistry screening (serum lactate, ammonia, creatine phosphokinase, lead, complete blood cell count and microscopy, uric acid, TSH, urine purine/pyrimidine and creatine metabolites). ASD diagnoses were based on DSM-IV criteria using Autism Diagnostic Interview-Revised (ADI-R) and/or Autism Diagnostic Observation

Schedule-Generic (ADOS-G) standard(Lord et al., 2000; Lord et al. 1994).

As part of the selection process, subjects were distinguished into two phenotype clusters,

C1 and C2, based on an analysis of somatic and clinical features, most of which were

26

craniofacial. All of this phenotype data was available to assist with variant analysis. In addition to the features used for clustering there was also the following: 53 subjects diagnosed with ID,

42 diagnosed with global developmental delay (GDD), and 17 experiencing seizures.

Information on the sex and ethnic background of samples was provided by collaborators and the sex was confirmed using the filtered read depth (DP) of variants on the Y chromosome. Most case samples were of European ancestry (85/120) and male (90/120). The 20 control samples were half male and mostly of European ancestry.

DNA Collection:

Blood samples were collected from all 119 subjects, and family members (parents and siblings) in 90 cases. DNA was extracted from whole blood using the Puregene (Gaithersburg,

MD, USA) DNA Isolation Kit.

Sequencing:

Genome sequencing of the blood genomic DNA for 120 probands was done using

Illumina HiSeq 2000 paired-end (2x100 bp) sequencing by collaborators from Bejing Genome

Institute (BGI) in Shenzhen, China. The average paired-end read count per sample was 11.2 million. Sequencing depth ranged from 25.24 - 48.56 X coverage (mean 35.88). Quality sequencing data was obtained for 119 samples. In addition, WGS data for 20 control samples was also provided by BGI from their sample pools of normal individuals, which was selected to

27

have roughly equal numbers between the two sexes, representing mostly European ancestry. No further details were provided to us.

Quality Control (QC):

Data quality metrics were collected for the WGS data with Picard

(broadinstitute.github.io/picard) followed by both sequence alignment and variant calling

(Appendix, Table 2-3). A sample was considered an outlier for a metric when falling outside 4 median absolute deviations (MAD) from the median. As these metrics may vary considerably due to differences in the population origin of the samples the European samples were in some cases evaluated separately from the non-European samples. Two samples, 21705-34281 and

15164-25527, showed significantly fewer number of total reads (TOTAL_READS), while the percentage of reads mapping to different chromosomes or with incorrect insert sizes

(PCT_CHIMERAS) was significantly higher for sample 187-13717 at 4.4%. These observations, however, were deemed insufficient to remove these samples for variant calling. Samples in the control group were similar across all metrics. Between group comparisons revealed no significant difference between alignment metrics except the MEAN_READ_LENGTH which was constant within groups but shorter for the control group (100 and 90, respectively), and the percentage of aligned PF_READS (PCT_PF_READS_ALIGNED). Variant QC considered any of the given metrics: number of deletions, number of insertions, number of SNVs, insertion vs. deletion ratio, transition vs. transversion (TiTv) ratio, and heterozygous vs. homozygous ratio.

Variant Calling:

28

Sequence alignment and variant calling adhered to the best practices recommended by the

Broad Institute (Van de Auwera et al. 2013). Raw sequence reads in fastq format were mapped to the human reference genome GRCch37 with the Burrows-Wheeler-Aligner (BWA 0.7.9; Li and

Derbin 2009) as a sorted binary format (BAM) and PCR duplicates were removed with Picard tools. The Genome Analysis Toolkit (GATK, version 3.7; https://software.broadinstitute.org/gatk) was used to perform indel realignment, base quality score recalibration, variant calling and joint genotyping, variant quality score recalibration

(VQSR), genotype refinement, and variant filtering. All samples were genotyped together on a per chromosome basis and merged before VQSR. A 1000 Genomes Project Phase 1 SNV data set was used as supporting evidence for genotype refinement.

All variant sites were retained if they met the following criteria: FILTER is ‘PASS’,

QUAL > 30, DP > 10, and quality by depth (QD) > 2. Only genotypes having a genotype quality

(GQ) > 35 and filtered read depth (DP) > 10 at these sites were considered for downstream analysis. Only variants impacting canonical transcripts were considered.

Variant Annotation:

Annotation of SNVs and indels was performed with the Ensembl Variant Effect Predictor

(VEP) version 87 (McLaren et al. 2016). VEP plugins provided allele frequencies and counts from ExAC (Lek et al. 2016), CADD scores (Kircher et al. 2014), and loss of function transcript effect estimator (LOFTEE) predictions (https://github.com/konradjk/loftee); and a custom Perl script was used to provide allele frequencies and counts from the whole genome sequencing call set provided by the Genome Aggregation Database (gnomAD;

29

gnomad.broadinstitute.org/downloads). Spidex (noncommerical v1.0) provided predictions for the effect of near splice site SNVs on exon inclusion/exclusion (Xong et al. 2016). Variants previously identified as de novo in origin by sequencing studies related to ASD and other NDDs were annotated using the database of de novo variants (denovo-db version 1.5; Turner et al.

2017).

The ExAC and gnomAD data sets contained exome and genome variant calls on 60,706 and 15,000 unrelated individuals, respectively. Individuals were ascertained for case/control status for various common diseases, excluding NDDS or any other severe pediatric disorder.

Version 1.5 of denovo-db included de novo SNVs and indels in cases and controls from various sequencing studies. This included 5,624 unrelated individuals from 13 published ASD

WES, WGS, and targeted sequencing studies, 4,293 unrelated individuals with severe NDDs from the Deciphering Developmental Disorders (DDD) study (McRae et al. 2017), and 2,278 controls, among others.

Although the sequencing data set lacked parental DNA sequences, identifying de novo variants among probands was possible in cases where parental DNA was available by prioritizing variants through computational analysis and determining the inheritance pattern by targeted sequencing of the parents’ DNA. At the time of this writing, however, none of the novel variants prioritized by this analysis were validated.

Gene Ontology (GO) analysis :

Subjects were originally selected for sequencing on the basis of membership in the phenotype clusters. Analysis by our collaborators found that C1 and C2 were similar with respect

30

to their burden of different variant classes (unpublished data). This study explored cluster characteristic using gene set enrichment analysis for GO biological processes. Analysis was done in R using the hypergeometric test (hyperGTest) in the Bioconductor GOstats package

(Gentleman and Falcon 2007). The background gene set included ~10,000 genes derived from the same application of ExAC constraint metrics as was used to filter variants for the class of novel variants (i.e., a gene was included in the reference set if either pLI > 0.9 (0.7 for X chromosome) or missense z-score > 0.3), and the input gene set was restricted to genes carrying missense and loss of function SNVs (Appendix, Table 4). Only terms with an unadjusted p-value

< 0.001 were reported. As the purpose of this analysis was to provide some biological insights to the data, using a correction for the false discovery rate (FDR) was not necessary. The hierarchical nature of GO also means that tests are not independent of each other, making selecting an appropriate adjustment for the FDR difficult.

Results:

This analysis detected a total of 19,232,662 SNVs and 1,907,086 indels after variant quality filtering. Within this search space was initially defined a set of high impact variants previously associated with ASD, referred to as novel protein truncating variants (hereafter referred to as novel PTVs), using the following criteria: SNVs or small (< 20bp) indels, predicted to be high confidence loss of function variants by LOFTEE, impacting genes considered to be intolerant of heterozygous loss of function variation (pLI) by ExAC (autosome pLI >= 0.9; X- linked pLI=0.7), scaled CADD score >15 (SNVs only), and absent from both the ExAC and gnomAD (WGS data only) reference samples. This class of variation included 116 PTVs impacting 47 cases (37 in C1 and 10 in C2).

31

A corresponding class of missense variants (hereafter referred to as novel missense variants) was also defined using the following criteria: SNVs, predicted to be damaging by Sift

(<0.05) and PolyPhen-2 (>= 0.9), scaled CADD score > 20, impacting genes with an ExAC missense z-score (misZ) > 0.3, and absent from population reference samples. A novel measure of missense regional constraint, the Chi-squared value (Chi) for the deviation of the region’s number of missense variants in ExAC from expectation, was also used to remove variants in missense constrained regions with a Chi < 5 (Samocha et al. 2017).

This class of novel missense variation included 596 variants impacting 118 cases (88 in

C1 and 30 in C2). There were 2 C1 subjects lacking this class of variation, but were carrying novel PTVs. Thus, all subjects in this study were carrying at least one of the novel variants defined here. C1 member 1444-21622 was carrying 51 of these novel variants, suggesting a higher rate of false positives, and was therefore removed from analysis.

Following the removal of 1444-21622 the variants defined here included a total of 21 frameshift, 16 stop gained, 60 splice donor/acceptor, and 491 missense variants. These 588 variants were impacting 523 different genes. There were 527 remaining variants after removing those previously prioritized (Appendix, Table 4-5).

Phenotype Cluster Characteristics:

The set of 412 C1 related genes showed enrichment for several terms including “DNA methylation involved in gamete generation” (P-value = 0.00004) and “AV node cell action potential” (p-value = 0.00027), among others. In contrast, the set of genes for C2 (132 genes) showed no enrichments. For terms showing the strongest enrichment in C1 the associated gene

32

set was used to assess whether there were more individuals carrying a novel variant in at least one of the genes in C1 compared to C2. Three terms were tested and none were significant at p =

0.05. There was no evidence to indicate that these biological processes were unique to C1.

Consequently, the analysis was repeated using the combined set of C1 and C2 genes (Appendix,

Table 5).

GO analysis for the combined C1 and C2 SNV gene set showed 11 GO biological processes with P<0.01 before any type of control for the false discovery rate (FDR) (Table 2.1).

Table 2.1: Eleven over represented GO biological processes for combined C1 and C2 SNV gene set

GOBPID Pvalue OddsRatio ExpCount Count Size Term GO:0043046 0.00014 15.09 0.58 5 11 DNA methylation involved in gamete generation GO:0006639 0.00023 3.60 4.17 13 79 acylglycerol metabolic process GO:0006638 0.00026 3.55 4.23 13 80 neutral lipid metabolic process GO:0019752 0.00032 1.82 28.10 47 532 carboxylic acid metabolic process GO:0046339 0.00045 18.07 0.42 4 8 diacylglycerol metabolic process GO:0086016 0.00056 54.12 0.21 3 4 AV node cell action potential GO:0086027 0.00056 54.12 0.21 3 4 AV node cell to bundle of His cell signaling GO:0035235 0.00076 6.80 1.16 6 22 ionotropic glutamate receptor signaling pathway GO:0030534 0.00077 2.83 5.92 15 112 adult behavior GO:0046498 0.00078 14.45 0.48 4 9 S-adenosylhomocysteine metabolic process GO:0006732 0.00097 2.35 9.30 20 176 coenzyme metabolic process *5036 GO BP terms tested; 487 input genes; 9219 background gene

Comparison with previously prioritized variants:

As a previous analysis by collaborators prioritized ~100 variants the intersection between these variants and those defined here was evaluated. Among the 33 and 67 previously prioritized

PTVs and missense variants, respectively, 20 and 41 were also found here (Appendix, Table 3).

33

This included 4 of 5 confirmed de novo variants impacting SCN2A, ARID2, WDR45, and NIPBL.

The fifth de novo variant, a missense change in SCN2A (2:166153564-G/A), was filtered out because the PolyPhen score (0.7) was below threshold (Table 2.2).

Table 2.2: Previously confirmed de novo variants in the ASPIRE ASD cohort. Ge ne Genomic Change (hg19) Variant Effect Subject_ID SCN2A* 2:166153564-G/A p.Arg102Gln 21730-34469 SCN2A 2:166168534-G/A splice_acceptor 21705-34281 NIPBL 5:37059219-T/C p.Leu2546Pro 1543-22102 WDR45 X:48933022-C/T splice_donor 21974-35254 ARID2 12:46231491-G/A splice_donor 1885-23669 *indicates a variant absent from the class of variation defined by this analysis

The previous analysis validated and in most cases determined the inheritance pattern for

16 other PTVs. Ten of these variants were present here: 7 were inherited, 2 failed validation, and

1 passed validation but the inheritance pattern was not determined. Other priority, albeit non- validated, PTVs included those impacting ADNP, ASXL3, HNRNPUL2, and KIF11. Among the

41 previously prioritized missense variants 25 were inherited, 1 failed validation, and 14 lacked validation (Appendix, Table 5).

Overall, it was found that the class of novel variants included many of the same variants prioritized by our collaborators. Given the somewhat different variant filtering methods the absence of some candidate variants was expected. As the primary purpose of this analysis was to identify any novel, high impact variants related to ASD, it was next asked whether any of the remaining variants were significant risk factors for ASD.

34

Variants Not Previously Prioritized:

This study prioritized 30 variants for validation not prioritized by our collaborators (Table 2.3).

Table 2.3: Novel variants prioritized by this study. Ge ne Genomic Change (hg19) Variant Effect CADD Chi Subject_ID DIP2C 10:332272-G/A p.Arg1354Trp 35 - 871-19574 DOCK9 13:99483615 p.Arg1518Gln 35 - 1569-22238 ACACA 17:35454050-G/A p.Arg2258Cys 35 253 1569-22238 PLXNA1 3:126752807-C/T p.Arg1880Trp 35 95 993-20365 KIRREL3 11:126391224-C/T p.Arg140His 34 - 48-14470 ATN1 12:7050172-G/A p.Arg1115His 34 75 1886-23674 SCN2A 2:166201267-G/A p.Arg922His 34 64 21686-34166 PLXNB1 3:48451382-G/A p.Arg1904Trp 34 51 1885-23669 ABCA1* 9:107583669-C/T p.Val983Met 34 25 21974-35254 CMIP 16:81726765-C/G p.Ala486Gly 33 - 1711-22976 AGAP1 2:236708163 p.Phe318Leu 33 - 1946-23968 ANK2 4:114170951-G/A p.Arg308Gln 33 54 1548-22146 PLXNA1 3:126748891-G/A p.Arg1682Gln 32 95 2007-24164 ARHGEF1 19:42408413-G/C p.Arg695Pro 32 91 20010-31644 ARID5B 10:63759873-G/A p.Val176Met 32 22 1454-21674 PDK2 17:48185592-T/G p.Phe253Cys 29.7 - 19275-30820 FASN 17:80045614-C/A p.Arg997Leu 28.2 - 21758-34760 CACNA2D3 3:55021748-G/C p.Met886Ile 27.8 - 1946-23968 PACS2 14:105848902-C/G p.Ser501Trp 27.7 - 1454-21674 PPP2R5D* 6:42975698-A/C p.Asp251Ala 27.7 85 15210-25677 EP300 22:41547886-C/G p.Ser956Cys 27.1 - 635-18522 CLCN7 16:1498451-G/A p.Arg640Trp 26.4 - 1694-22917 WARS 14:100813098-C/G p.Asp271His 26.2 - 1876-23644 SPTBN1 2:54872506-G/T p.Leu1470Phe 26 103 1457-21708 PREX1 20:47309240-T/C p.Met336Val 25.2 83 1972-24038 KAT6A 8:41906027-G/A p.His157Tyr 25.2 9 19283-30834 TOPORS 9:32542941-C/T p.Asp528Asn 24.2 - 902-19869 HMGCS1 5:43292741-T/C splice_acceptor 24 - 20620-32415,20014-31660,2063-24398,19844-31412 ACTG1 17:79478592-G/A p.Leu142Phe 23.7 - 2007-24164,1457-21708 CELF2 10:11312712-G/T p.Gln234His 23.6 74 532-14596 *variants previously observed among NDD cases in denovo-db

This analysis may have uncovered a second de novo missense variant in SCN2A

(p.Arg922His). Searching denovo-db for all the variants detected in this study showed that two

of them, PPP2R5D (p.Asp251Ala) and ABCA1 (p.Val983Met), had previously occurred de novo

among NDD cases in denovo-db.

35

Discussion:

This study was a compliment to a previous analysis by collaborators intended to prioritize de novo loss of function and missense variants for validation. It identified several novel variants that were likely major contributors to diagnoses for their carriers. Although these variants were not validated at the time of this writing, integrating accompanying phenotype information suggested that some of these were causal. The missense variants in Protein Phosphatase 2

Regulatory Subunit B Delta (PPP2R5D; p.Asp251Ala) and ATP-binding Cassette Transporter

Member 1 (ABCA1; p.Val983Met) carried additional evidence as they were observed de novo in cases from a separate study involving a cohort of ~4500 probands with severe, undiagnosed

NDDs (DDD; McRae et al. 2017). Both of these variants had passed quality filters in the previous analysis and were therefore considered high confidence calls. Many of the other variants prioritized by this analysis would also be strong de novo candidates provided they were veridical.

This study identified several new likely de novo variants in ASD risk genes such as

SCN2A and PPP2R5D. PPP2R5D was associated with ASD based on the recurrence of specific de novo missense variants (Geishekeer et al. 2017). As the variant reported here was only the second known occurrence it may be useful for defining the breadth of phenotypes associated with these mutations. ABCA1 was not a candidate ASD risk gene. Although this study was unable confirm the origin of the variant, the observation of the same de novo missense variant in two unrelated cases could provide very strong evidence for association (Geishekeer et al. 2017).

36

Is ABCA1 a severe ASD risk gene?

ABCA1 encodes a key member of the reverse cholesterol transport pathway. Biallelic loss of function defects in ABCA1 result in a severe HDL deficiency syndrome called Tangiers disease (TD) (Oram and Lawn 2001). ABCA1 knockout mice fed a high fat diet showed a dramatic increase in dietary cholesterol absorption (McNeish et al. 2000), suggesting that ABCA1 also regulates the uptake of cholesterol during digestion. Although the primary diagnosis of the female carrying the missense variant in ABCA1 was not known, they appeared to be severely affected with numerous dysmorphic and other features, something generally consistent with the carrier from the DDD study who was described as having a severe NDD (McRae et al. 2017).

Cholesterol is a major component of cell membranes, including myelin in the nervous system, and a precursor of steroid hormones and bile acids. ASD associated syndromes, Smith-

Lemli-Opitz syndrome (SLOS, DHCR7) and Rett syndrome (MECP2; Buchovecky et al. 2013), are disorders of cholesterol metabolism. The prevalence of ASD in Rett is approximately 61%

(females only) giving it perhaps the highest rate of ASD among genetic disorders (Richards et al.

2015). SLOS appears to have a similarly high rate of comorbid ASD between 50-75% (Tierney et al. 2006; Sikora et al. 2006). Cholesterol disorders are also believed to underlie some idiopathic ASD (Tierney et al. 2006; Gillberg et al. 2017). More recently, homozygous missense variants in DHCR24 were found to underlie an uncharacterized ASD/ID disorder (Lim et al.

2014).

Although the SFARI database contains a large number of candidate ASD risk genes, the class of severe haploinsufficient risk genes has been thoroughly explored, making it unlikely to include many more members (McRae et al. 2017). In contrast, many activating and dominant-

37

negative NDD associated genes remain to be described because these disorders are rarer and caused by a relatively small number of missense mutations within each gene (McRae et al.

2017). These genes should be identified by the clustering of missense variants (Lelieveld et al.

2017), and particularly the recurrence of specific de novo missense variants (Geishekeer et al.

2017). As with PPP2R5D, ABCA1 may belong to this second class of risk genes for severe

NDDs.

Many of the genes that have reached de novo significance in sequencing studies are involved in biological processes related to chromatin/transcription regulation. MECP2, the gene disrupted in Rett syndrome, is also a transcription factor, raising questions about what other ASD risk genes also control the expression of genes involved in cholesterol metabolism. In general, the function of many ASD risk genes is poorly understood (Krishnan et al. 2016).

Are genes involved in fatty acid metabolism risk factors for ASD?

The class of rare SNVs predicted to be damaging may have converged on biological processes related to glutamate signalling and lipid metabolism. These results were interpreted cautiously because most of the variants were not validated, and consisted largely of missense variants, the functional consequences of which remain challenging to predict (Miosge et al.

2015). An even larger problem was that the size of the control group was too small to test whether the class of variation defined here was associated with ASD. Previous analyses indicated that it would be. Inherited loss of function variants in pLI >= 0.9 genes that are absent from the

ExAC database are significantly more frequent in ASD cases than in controls, making this type of variation one of the only types of genetic variation, aside from those that occur de novo,

38

associated with ASD (Kosmicki et al. 2017). Previous work using residual variance intolerance scores (RVIS) also showed that among genes in the lower 50% of RVIS values PTVs were transmitted significantly more to probands than their unaffected siblings, and that the significance increases with progressively lower RVIS values (Krumm et al. 2014). Overall,

PTVs in genes that are constrained by loss of function variation and absent from ExAC are reliably associated with ASD, and most of these PTVs are being transmitted. Given that the greatest proportion of de novo variants contributing to ASD are missense changes, the same should also be true for rare inherited variants. GO analysis was therefore conducted under the assumption that, in addition to the de novo variants, there was a larger group of rare, inherited variants contributing to risk among individuals in the ASPIRE cohort.

Disco-interacting protein 2 (DIP2C) has no GO annotations or functional literature and was associated with ASD based on recurrent de novo frameshift variants (Yuen et al. 2017). A functional study found that the drosphilia ortholog, dip2, was likely involved in fatty- acid metabolism, perhaps taking part in the biosynthesis of membrane phospholipids and/or lipid rafts

(Nitta et al. 2017), which are critical for neuronal development.

Only one of the genes annotated with lipid metabolic processes in this study was a candidate ASD risk gene. Acetyl-CoA carboxylase alpha (ACACA) has a key role in fatty acid synthesis and was predicted to be the primary ASD risk gene within the 17q12 deletion region

(Gamsiz et al. 2015; Krishnan et al. 2016). The penetrance of 17q12 deletions was estimated to be high, and they are associated with kidney disease, neuropsychiatric disorders, and mild dysmorphic features such as frontal bossing, depressed nasal bridge, downslanting palpebral fissures, and high arched eyebrows (Mitchel et al. 2016). Within the17q12 deletion region the loss of HNF1B causes developmental kidney disease, but is unrelated to the neurodevelopmental

39

phenotype (Clissold et al. 2016). Notably, there have been no reported cases of a neurological disease in connection to a single gene deletion or mutation of ACACA (Mitchel et al. 2016).

Supposing some of the genes annotated with lipid metabolic processes contribute significantly to risk for ASD, it is likely predominantly through inherited events. This class of variation has been poorly explored, and only in the context of recessive genes (Stein et al. 2013).

Previous studies found that rare inherited variants contribute to risk through a set of genes largely distinct from those reaching de novo significance (Krumm et al. 2014; Kosmicki et al.

2017). This indicates that association testing for the class of rare inherited loss of function and missense variation will identify novel risk genes. The results of this study could therefore indicate that lipid metabolic processes are an important point of convergence for this class of variation.

Protein Phosphatases: an emerging group of ASD risk genes

There were four protein phosphatases impacted by novel missense variants among cases

(PPP1CA, PPP1R13B, PPP2R5A, and PPP2R5D), and none among controls. PPP2R5D was one of 36 genes, including a number of others impacted by missense variants prioritized here, such as

SCN2A, SCN8A, and CHD8, found carrying a recurrence of specific de novo missense variants in

NDDs (Geishekeer et al. 2017). A similar study implicated both PPP2R5D and PPP2R1A based on the clustering of de novo missense variants (Lelieveld et al. 2017). Most of the de novo missense variants reported in denovo-db were recurrent: among the 17 cases with de novo missense variants 10 carried the same variant (6:42975003 G>A) and three more had variants within 28 bp of this region. The missense variant in PPP2R5D reported here (p.Asp251Ala) was

40

therefore well outside of this region and may help broaden the phenotype associated with these disorders. The phenotypes associated with de novo missense variants in PPP2R5D have been described from a few individuals and include ID, macrocephaly, hypotonia, and ASD (Shang et al. 2017). The carrier of the novel missense variant in PPP2R5D had severe ASD, GDD, ID, macrocephaly, spasticity, facial asymmetry, and other dysmorphic features.

Downstream biological processes do not support phenotype clusters

None of the biological processes indicated by the class of rare loss of function and missense SNVs represented variants in more than a small proportion of the individuals in this study, and none pertained to individuals from a specific phenotype cluster. Previous efforts by our collaborators to distinguish the phenotype clusters using a class of variation similar to the one defined here failed. Risk was apparently being moderated by variants in many different genes and with different effect sizes and there was no evidence that they pointed to any cluster specific biology. An example is provided by the previously prioritized variants in NIPBL and

SMC3. Both genes encode members of the Cohesion complex; both are causes of Cornelia de

Lange Syndrome (CDLS); and the carriers belonged to different phenotype clusters. CDLS is a highly variable syndrome and mutations in SMC3 generally result in a milder phenotype.

What is the significance of recurrent SCN2A de novo variants

An important development in ASD genetics is the ‘genotype first’ approach to defining

ASD subtypes, involving the identification and careful phenotypic evaluation of individuals with the same genotype (Stressman et al. 2014). There is a growing literature describing how specific

41

genotypes, for example, the disruption to a specific gene such as CHD8, manifests with recognizable symptoms (Berneir et al. 2016; Stressman et al. 2016). Thus, it should be possible to enrich a cohort for causal variants within the same gene by reducing phenotypic heterogeneity.

This approach would require that the initial cohort be of a sufficient size to include those genotypes in the first place. In contrast, the ASPIRE cohort was a relatively small and genetically heterogeneous sample.

In general, the data here is consistent with a small subset of patients carrying high risk de novo variants and having genetically distinct disorders with some overlapping features. Sodium

Voltage-Gated Channel Alpha Subunit 2 (SCN2A) was an exception as it was harboring at least two and perhaps three de novo events. SCN2A encodes the Nav1.2 voltage-gated sodium channel and is one of the largest contributors of de novo variants to idiopathic ASD. The carrier of the novel missense variant in SCN2A was in the C1 cluster and appeared to have a very similar set of traits to the carriers with confirmed de novo events, consistent with them having a genuinely pathogenic variant. SCN2A is also strongly associated with epilepsy. Recent evidence suggests that de novo loss of function and missense variants among ASD result in the loss of channel function whereas those occurring in epilepsy have a gain of function effect (Ben-Shalom et al.

2017).

Concluding remarks

Although studies of de novo loss of function and missense variants have contributed greatly to ASD locus discovery, the genes reaching significance for this class of variation appear to be associated with severe autistic traits in the presence of ID, whereas autistic traits are more widespread, being continuous with the general population (Constantino 2011). These results are

42

consistent with these findings: all the confirmed or highly likely de novo events were carried by severely affected individuals with some form of ID. As autistic traits are measured independently of IQ, this raises questions about whether autistic traits are associated with de novo variants in the absence of ID. Future studies should clarify this point.

43

References Cited

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215:403-410.

Baig, Deeba Noreen, Toru Yanagawa, and Katsuhiko Tabuchi. "Distortion of the normal function of synaptic cell adhesion molecules by genetic variants as a risk for autism spectrum disorders." Brain research bulletin 2017, 129: 82-90.

Becantur, Catalina. “Etiological heterogeneity in autism spectrum disorders: More than 100 genetic and genomic disorders and still counting.” Brain Research 2011, 1380:42-77.

Ben-Shalom, Roy, et al. "Opposing effects on NaV1. 2 function underlie differences between SCN2A variants observed in individuals with autism spectrum disorder or infantile seizures." Biological psychiatry 2017, 82: 224-232.

Ben-David, E., and S. Shifman. "Combined analysis of exome sequencing points toward a major role for transcription regulation during brain development in autism." Molecular psychiatry 2013, 18: 1054-1063.

Bernier, Raphael, et al. "Disruptive CHD8 mutations define a subtype of autism early in development." Cell 2014, 158:263-276.

Birol, Inanc, et al. "Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data." Bioinformatics 2013, 29:1492- 1497.

Buchovecky, Christie M., et al. "A suppressor screen in Mecp2 mutant mice implicates cholesterol metabolism in Rett syndrome." Nature genetics 2013, 45: 1013-1022.

Bourgeron, Thomas. "From the genetic architecture to synaptic plasticity in autism spectrum disorder." Nature Reviews Neuroscience 2015, 16:551-563.

Chaisson, Mark J., Dumitru Brinza, and Pavel A. Pevzner. "De novo fragment assembly with short mate-paired reads: Does the read length matter?." Genome research 2009, 19: 336-346.

44

Chang, Jonathan, et al. "Genotype to phenotype relationships in autism spectrum disorders." Nature neuroscience 2015, 18: 191-198.

Clissold, Rhian, et al. “Chromosome 17q12 microdeletions but not intragenic HNF1B mutations link developmental kidney disease and psychiatric disorder.” Kidney International 2016, 90: 203-211.

Constantino, John. “The Quantitative Nature of Autistic Social Impairment”. Pediatr Res 2011, 69: 55R-62R.

De Rubeis, Silvia, et al. "Synaptic, transcriptional and chromatin genes disrupted in autism." Nature 2014, 515: 209-215.

Denton, James F., et al. "Extensive error in the number of genes inferred from draft genome assemblies." PLoS computational biology 2014, 10: e1003998.

Fischbach, Gerald D., and Catherine Lord. "The Simons Simplex Collection: a resource for identification of autism genetic risk factors." Neuron 2010, 68: 192- 195.

Gao, Song, Wing-Kin Sung, and Niranjan Nagarajan. "Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences." Journal of Computational Biology 2011, 18:1681-1691.

Gamsiz, Ece D., et al. "Discovery of rare mutations in autism: elucidating neurodevelopmental mechanisms." Neurotherapeutics 2015, 12: 553-571.

Gentleman, Seth and Robert Falcon. "An introduction to bioconductor’s expressionset class." (2007).

Geisheker, Madeleine R., et al. "Hotspots of missense mutation identify neurodevelopmental disorder genes and functional domains." Nature neuroscience 2017, 20:1043-1049.

Gilman, Sarah R., et al. "Rare de novo variants associated with autism implicate a large functional network of genes involved in formation and function of synapses." Neuron 2011, 70: 898-907.

45

Gillberg, Christopher, et al. “The role of cholesterol Metabolism and Various Steriod Abnormalities in Autism Spectrum Disorders: A Hypothesis Paper.” Autism Research 2017, 10: 1022-1044.

Haas BJ., et al. “De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis”. Nat Protoc 2013, 8:1494-1512.

He, Xin, et al. "Integrated model of de novo and inherited genetic variants yields greater power to identify risk genes." PLoS genetics 2013, 9:e1003671.

Iossifov, Ivan, et al. "De novo gene disruptions in children on the autistic spectrum." Neuron 2012, 74:285-299.

Iossifov, Ivan, et al. "The contribution of de novo coding mutations to autism spectrum disorder." Nature 2014, 515:216-221.

Jacquemont, Marie-Line, et al. "Array-based comparative genomic hybridisation identifies high frequency of cryptic chromosomal rearrangements in patients with syndromic autism spectrum disorders." Journal of medical genetics 2006, 43: 843- 849.

Kang, Hyo Jung, et al. "Spatio-temporal transcriptome of the human brain." Nature 2011, 478: 483-489.

Kent WJ: BLAT--the BLAST-like alignment tool. Genome Res 2002, 12:656-664.

Marçais, Guillaume, et al. "MUMmer4: A fast and versatile genome alignment system." PLoS computational biology 2018, 14: e1005944.

Kircher, Martin, et al. "A general framework for estimating the relative pathogenicity of human genetic variants." Nature genetics 2014, 46: 310-315.

Krishnan, Arjun, et al. "Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder." Nature neuroscience 2016, 19: 1454-1462.

Krumm, Niklas, et al. "Excess of rare, inherited truncating mutations in autism." Nature genetics 2015, 47:582-588.

46

Kosmicki, Jack A., et al. "Refining the role of de novo protein-truncating variants in neurodevelopmental disorders by using population reference samples." Nature Genetics 2017, 49:504-510.

Lelieveld, Stefan., et al. “Spatial Clustering of de novo Missense Mutations Identifies Candidate Neurodevelopmental Disorder-Associated Genes.” The American Journal of Human Genetics 2017, 101: 478-484.

Lek, Monkol, et al. "Analysis of protein-coding genetic variation in 60,706 humans." Nature 2016, 536: 285-291.

Li H, Wang J, Mor G, Sklar J: A neoplastic gene fusion mimics trans-splicing of RNAs in normal human cells. Science 2008, 321:1357-1361.

Li H: Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics. 2012, 28:1838-1844.

Li, Heng, and Richard Durbin. "Fast and accurate short read alignment with Burrows–Wheeler transform." Bioinformatics 2009, 25:1754-1760.

Lin, Yu-Chih, et al. "A subset of autism-associated genes regulate the structural stability of neurons." Frontiers in cellular neuroscience 2016, 10: 263.

Lim, Teng Ting. “Exploring the genetic landscape of complex diseases using the recessive model.” Doctoral dissertation, Harvard University (2014).

Lord, Catherine, et al. "The Autism Diagnostic Observation Schedule—Generic: A standard measure of social and communication deficits associated with the spectrum of autism." Journal of autism and developmental disorders 2000, 30: 205- 223.

Lord, Catherine, Michael Rutter, and Ann Le Couteur. "Autism Diagnostic Interview-Revised: a revised version of a diagnostic interview for caregivers of individuals with possible pervasive developmental disorders." Journal of autism and developmental disorders 1994, 24: 659-685.

McLaren, William, et al. "The ensembl variant effect predictor." Genome biology 2016, 17: 122.

47

McRae, Jeremy F., et al. "Prevalence and architecture of de novo mutations in developmental disorders." Nature 2017, 542: 433-438.

Miosge, Lisa A., et al. "Comparison of predicted and actual consequences of missense mutations." Proceedings of the National Academy of Sciences 2015, 112: E5189-E5198.

Mitchel, Marissa W., et al. "17q12 recurrent deletion syndrome." (2016).

Moreno-De-Luca, Andres, et al. "Developmental brain dysfunction: revival and expansion of old concepts based on new genetic evidence." The Lancet Neurology 2013, 12:406-414.

Neale, Benjamin M., et al. "Patterns and rates of exonic de novo mutations in autism spectrum disorders." Nature 2012, 487:242-245.

Nitta, Yohei, et al. "DISCO interacting protein 2 regulates axonal bifurcation and guidance of Drosophila mushroom body neurons." Developmental Biology 2017, 421: 233-244.

Oram, John F., and Richard M. Lawn. "ABCA1: the gatekeeper for eliminating excess tissue cholesterol." Journal of lipid research 2001, 42:1173-1179.

O’Roak, Brian J., et al. "Multiplex targeted sequencing identifies recurrently mutated genes in autism spectrum disorders." Science 2012 (a), 338:1619-1622.

O’Roak, Brian J., et al. "Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations." Nature 2012 (b), 485: 246-250.

O'Roak, B. J., et al. "Recurrent de novo mutations implicate novel genes underlying simplex autism risk." Nature communications 2014, 5: 5595-5609.

Packer, Alan. "Neocortical neurogenesis and the etiology of autism spectrum disorder." Neuroscience & Biobehavioral Reviews 2016, 64: 185-195.

Pavey SA., et al. “Draft genome of the American Eel (Anguilla rostrata)”. Mol Ecol Resour 2017, 17:806-811.

48

Pinto, Dalila, et al. "Convergence of genes and cellular pathways dysregulated in autism spectrum disorders." The American Journal of Human Genetics 2014, 94: 677-694.

Parikshak, Neelroop N., et al. "Integrative functional genomic analyses implicate specific molecular pathways and circuits in autism." Cell 2013, 155: 1008-1021.

Parikshak, Neelroop N., Michael J. Gandal, and Daniel H. Geschwind. "Systems biology and gene networks in neurodevelopmental and neurodegenerative disorders." Nature Reviews Genetics 2015, 16: 441-458.

Pavey, Scott A., et al. "Draft genome of the American eel (Anguilla rostrata)." Molecular ecology resources 2017, 17: 806-811.

Pevzner, Pavel., and Neil Jones. “An Introduction to Bioinformatics Algorithms”. MIT Press (2004).

Rehan SM, Glastad KM, Lawson SP, Hunt BG: The Genome and Methylome of a Subsocial Small Carpenter Bee, Ceratina calcarata. Genome Biol Evol 2016, 8:1401-1410.

Richards, Caroline., et al. “Prevalence of autism spectrum disorder phenomenology in genetic disorders: a systematic review and meta-analysis.” Lancet Psychiatry 2015, 2: 909-916.

Ronemus, Michael., et al. “The role of de novo mutations in the genetics of autism spectrum disorders.” Nature Reviews 2014, 15: 133-141.

Samocha, Kaitlin., et al. “Regional missense constraint improve Gao S, Sung WK, Nagarajan N: Opera: reconstructing optimal genomic scaffolds with high- throughput paired-end sequences. J Comput Biol 2011, 18:1681-1691.

Sanders, Stephan J. "First glimpses of the neurobiology of autism spectrum disorder." Current opinion in genetics & development 2015a, 33: 80-92.

Sanders, Stephan J., et al. "Insights into autism spectrum disorder genomic architecture and biology from 71 risk loci." Neuron 2015b, 87: 1215-1233.

Sebat, Jonathan, et al. "Strong association of de novo copy number mutations with autism." Science 2007, 316: 445-449.

49

Sikora, Darryn M., et al. "The near universal presence of autism spectrum disorders in children with Smith–Lemli–Opitz syndrome." American journal of medical genetics Part A 2006, 140: 1511-1518.

Simao, Felipe, et al. “BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs”. Bioinformatics 2015: 31-42.

Sugathan, Aarathi, et al. "CHD8 regulates neurodevelopmental pathways associated with autism spectrum disorder in neural progenitors." Proceedings of the National Academy of Sciences 2014, 111: E4468-E4477.

Song, Li, et al. “Rascaf: Improving Genome Assembly with RNA Sequencing Data.” The Plant Genome 2016, 9: 569-581.

Stein, Jason, et al. “Rare inherited variation in autism: beginning to see the forest and a few trees.” Neuron 2013, 77: 209-211.

Stessman, Holly AF, et al. "Targeted sequencing identifies 91 neurodevelopmental-disorder risk genes with autism and developmental-disability biases." Nature genetics 2017, 49: 515-529.

Tierney, Elaine., et al. “Abnormalities of Cholesterol Metabolism in Autism Spectrum Disorders.” American Journal of Medical Genetics 2006, 6: 666-668.

Tkatchenko AV., et al. “Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences”. J Comput Biol 2011, 18:1681-1691.

Tollis M., et al. “The Agassiz's desert tortoise genome provides a resource for the conservation of a threatened species”. PLoS One 2017, 12: e0177708.

Turner, Tychele., et al. “denovo-db: a compendium of human de novo variants.” Nucleic Acids Research 2017, 45: 237-251.

Van der Auwera, Geraldine A., et al. "From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline." Current protocols in bioinformatics 2013: 11.10.1-33.

50

Weiner, Daniel J., et al. "Polygenic transmission disequilibrium confirms that common and rare variation act additively to create risk for autism spectrum disorders." Nature genetics 2017, 49: 978-986.

Vij, Shubha, et al. "Chromosomal-level assembly of the Asian seabass genome using long sequence reads and multi-layered scaffolding." PLoS genetics 2016, 12.4: e1005954.

Vorstman, J. A. S., et al. "Identification of novel autism candidate regions through analysis of reported cytogenetic abnormalities associated with autism." Molecular psychiatry 2006: 11: 18-28..

Warren RL,. et al. “Improved white spruce (Picea glauca) genome assemblies and annotation of large gene families of conifer terpenoid and phenolic defense metabolism”. Plant J 2015, 83:189-212.

Willsey, A. Jeremy, et al. "Coexpression networks implicate human midfetal deep cortical projection neurons in the pathogenesis of autism." Cell 2013, 155: 997- 1007.

Xie Y, Wu G, Tang J, Luo R, Patterson J, Liu S, Huang W, He G, Gu S, Li S et al: SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics 2014, 30:1660-1666.

Xue, Wei, et al. “L_RNA_scaffolder: scaffolding genomes with transcripts.” BMC Genomics 2013, 14: 604-625.

Yuen, Ryan KC, et al. "Genome-wide characteristics of de novo mutations in autism." NPJ genomic medicine 2016, 1: 16027-16038.

Yuen, Ryan KC, et al. "Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder." Nature neuroscience 2017, 20: 602-615.

Zhang SV, Zhuo L, Hahn MW: AGOUTI: improving genome assembly and annotation using transcriptome data. Gigascience 2016, 5

51

Appendices

Table 1: an example of the standard results report in the log file from RDNA describing the original and new (remainder, updated, and new files) genome assemblies. seq_count defined_bases N_bases n_bases Tot_bases original_genome_assembly.fa 5006644 1069861165 24756536 0 1094617701 remainder.fa 5001296 1066212423 24638038 0 1090850461 updated_sequences.fa 6 11350 331 0 11681 new_scaffolds.fa 2352 3636682 118103 148900 3903685

New assembly total 5003654 1069860455 24756472 148900 1094765827

Table 2: All variants in C1 and C2 previously prioritized by our collaborators, and belonging to the class of variation defined by this study CHR POS REF ALT Gene Chi CQ SIFT PolyPhen CADD Sample Origin

1 7796467 A G CAMTA1 - p.Lys1044Glu 0.03 0.99 28.3 1442-22354 inherited

1 23648092 C T HNRNPR 35.5 p.Arg247Lys 0 0.95 33 1569-22238 inherited

1 31414077 C A PUM1 84.3 p.Gly1055Cys 0 1 35 15542-26229 inherited

1 36384789 C T AGO1 67.6 p.Pro800Leu 0 1 28.1 21773-34925 inherited

1 156105105 T C LMNA - splice_donor - - 24.7 21304-33563 -

1 200969864 G A KIF21B - p.Arg483Trp 0 0.99 35 1792-23269 inherited

1 208202316 A G PLXNA2 50.3 p.Ile1766Thr 0.01 0.97 24.9 1855-23570 -

1 237659974 G A RYR2 40.1 p.Gly709Arg - 1 32 1946-23968 inherited

2 25462044 G A DNMT3A 6.3 p.Ala788Val 0 0.98 32 20572-32334 -

2 111921708 A C BCL2L11 - splice_acceptor - - 23.6 1987-24091 inherited

2 166168534 G A SCN2A - splice_acceptor - - 26.5 21705-34281 denovo

2 170885854 A G UBR3 - splice_acceptor - - 24.8 21684-34154 inherited

2 218713221 G C TNS1 - stop_gained - - 34 21762-34808 inherited

2 225422467 T C CUL3 - p.Tyr58Cys 0 1 25.6 19275-30820 -

2 234377170 C T DGKD - p.Leu1176Phe 0 1 32 21758-34760 -

2 241696767 C T KIF1A 24.2 p.Asp943Asn 0.04 1 27.8 21870-36159 inherited

52

3 9786121 G A BRPF1 7.2 p.Arg950Gln 0.04 0.99 26.7 19844-31412 inherited

3 114070363 C G ZBTB20 - p.Glu188Gln 0 0.99 26.4 21507-33839 inherited

4 860151 C T GAK - splice_donor - - 24.9 1886-23674 inherited

4 53464810 T G USP46 - p.Asp328Ala 0 0.97 31 2007-24164 -

4 57887162 G T POLR2B - splice_donor - - 24.3 1876-23644 -

4 74007509 G A ANKRD17 10.6 p.Pro761Ser - 0.95 29.8 16260-27149 inherited

5 37059219 T C NIPBL 142.6 p.Leu2546Pro 0 1 28.9 1543-22102 denovo

5 161309585 T A GABRA1 32 p.Val194Asp 0 1 32 1792-23269 failed

6 83872513 C T DOPEY1 - stop_gained - - 55 21762-34808 inherited

7 100280017 TG T GIGYF1 - frameshift - - - 1883-23662 -

8 53071669 G C ST18 - stop_gaineg - - 40 871-19574 inherited

9 140953048 G A CACNA1B 86.8 p.Ala1446Thr 0.01 0.99 34 16260-27149 -

10 71158568 T C HK1 - p.Tyr869His 0.01 1 28.5 1855-23570 inherited

10 94408018 C G KIF11 - stop_gained - - 35 18436-29827 -

10 112352910 T C SMC3 112.8 p.Leu631Pro 0 1 28 19991-31601 -

11 9441980 A G IPO7 - p.Asp250Gly 0.04 0.99 25.2 21236-33440 -

11 62484506 G A HNRNPUL2 - stop_gained - - 41 1989-24099 -

11 117308080 G C DSCAML1 97 p.Ser1553Cys 0.02 0.92 27.2 1792-23269 inherited

12 46231491 G A ARID2 - splice_donor - - 23.9 1885-23669 denovo

12 52099273 G A SCN8A 90 p.Val403Met 0 1 28.7 1937-23920 -

12 116399093 T C MED13L 42.1 p.Asn2204Ser 0.01 0.97 25.7 635-18522 inherited

12 123480051 C T PITPNM2 39.9 p.Asp647Asn 0.03 1 34 1792-23269 inherited

12 123519119 G A PITPNM2 39.9 p.Arg7Trp 0 1 34 21762-34808 inherited

13 35632895 T G NBEA 44.8 p.Asp378Glu 0.02 1 27 21879-36229 -

14 21875188 G A CHD8 149.9 p.Arg912Cys 0 0.98 35 21705-34281 inhertied

53

14 50296078 G A NEMF - stop_gained - - 38 1792-23269 failed

14 103470357 C T CDC42BPB 71.4 p.Ala119Thr 0 1 32 779-18300 inherited

15 28422649 C T HERC2 311.8 p.Arg3057Gln - 0.98 35 21969-37814 inherited

15 41772874 G A RTF1 - p.Gly708Arg 0 1 27.4 2007-24164 -

15 48788393 G A FBN1 65.6 p.Leu775Phe - 1 31 1876-23644 -

16 1793424 G A MAPK8IP3 63.6 p.Asp231Asn 0.01 0.96 27.3 19993-31610 inherited

17 29206439 A G ATAD5 - splice_acceptor - - 24.2 19844-31412 inherited

17 36935652 C T PIP4K2B - p.Arg213His 0.01 0.95 35 613-14848 -

18 31324700 CAAAA C ASXL3 - frameshift - - - 20705-32533 -

19 7172377 G A INSR - p.Arg398Cys 0.01 1 34 21740-34579 inherited

19 9971445 C T OLFM2 - p.Gly30Asp 0.01 0.99 32 2007-24164 -

19 17318004 G A MYO9B 13.2 p.Asp1859Asn 0 1 28.2 16260-27149 inherited

19 18778860 T C KLHL26 - p.Leu218Pro 0 1 25.4 21742-34594 inherited

19 18968255 C T UPF1 - stop_gained - - 45 21970-37819 failed

19 38934225 C T RYR1 33.1 p.Leu100Phe - 1 28.2 21686-34166 inherited

19 38949911 G A RYR1 33.1 p.Gly765Ser - 1 28.5 1548-22146 inherited

19 42525615 T C GRIK5 74.7 p.Tyr570Cys 0 0.98 27.2 928-19074 inherited

20 49508452 AC A ADNP - frameshift - - - 1986-24087 -

20 62836451 G T MYT1 - splice_donor - - 25.6 2047-24322 passed

X 48933022 C T WDR45 - splice_donor - - 25 21974-35254 denovo

Table 3: All C1 and C2 missense variants identified by this analysis and not previously prioritized chr position ref alt gene Chi HGVSp PP CADD Sample 1 69149 T A OR4F5 - ENSP00000334393.3:p.Leu20Gln 0.98 23 21731-34484 1 1372478 C T VWA1 7 ENSP00000417185.1:p.Pro82Leu 1.00 24.3 18436-29827 1 3322107 G A PRDM16 17 ENSP00000270722.5:p.Val361Met 0.97 29.6 2047-24322 1 3380087 C T ARHGEF16 - ENSP00000367629.4:p.Arg147Trp 1.00 26.8 19861-31432

54

1 9932101 C A CTNNBIP1 - ENSP00000366474.1:p.Gly8Trp 1.00 35 21730-34469 1 12998857 C T PRAMEF6 - ENSP00000365360.1:p.Cys360Tyr 1.00 16.87 2047-24322 1 13448282 A C PRAMEF13 - ENSP00000365302.3:p.Leu398Trp 0.99 21 1876-23644 1 22456169 G A WNT4 - ENSP00000290167.5:p.Arg85Trp 1.00 35 1569-22238 1 24782648 G A NIPAL3 - ENSP00000363520.4:p.Val220Ile 0.92 29.1 20705-32533 1 28293080 G A XKR8 - ENSP00000362991.5:p.Arg186His 0.92 34 19844-31412 1 28586363 T C SESN2 - ENSP00000253063.3:p.Ile2Thr 0.98 26.1 15497-26119 1 36054149 C T TFAP2E - ENSP00000362332.3:p.Arg261Cys 1.00 34 635-18522 1 38343870 A G INPP5B - ENSP00000362115.3:p.Phe556Ser 1.00 29.2 1792-23269 1 42045555 G T HIVEP3 - ENSP00000361664.1:p.Asn1638Lys 1.00 25.3 21699-34238 1 44683986 G A DMAP1 11 ENSP00000361363.2:p.Val133Met 0.96 26.9 21507-33839 1 44820667 C T ERI3 - ENSP00000361331.2:p.Gly11Glu 0.93 26.5 21799-35302 1 45252262 G T BEST4 - ENSP00000361281.3:p.Asp118Glu 0.94 26.4 1565-22223 1 49224665 G A BEND5 - ENSP00000360899.3:p.Leu218Phe 0.99 26.9 1710-22973 1 53546470 A G PODN - ENSP00000308315.5:p.Tyr576Cys 1.00 28.2 20569-32326 1 113239306 G A MOV10 42 ENSP00000399797.2:p.Gly679Glu 1.00 31 1291-21555 1 116941252 G A ATP1A1 - ENSP00000445306.1:p.Val712Met 0.92 28 21705-34281 1 119575851 T C WARS2 - ENSP00000235521.4:p.Thr256Ala 0.96 26.3 16262-27157 1 145756605 T G PDZK1 44 ENSP00000342143.2:p.Tyr328Asp 0.98 23.5 928-19074 1 155294968 A G RUSC1 48 ENSP00000357336.5:p.Lys511Arg 1.00 27.5 20010-31644 1 156020104 T C UBQLN4 - ENSP00000357292.3:p.Asn240Ser 1.00 24.9 1883-23662 1 156562416 A C APOA1BP - ENSP00000357218.3:p.Gln157Pro 1.00 28.1 1550-22157 1 167384827 T C POU2F1 - ENSP00000356840.2:p.Val694Ala 0.98 25.3 21189-33353 1 174417723 G C GPR52 - ENSP00000356658.2:p.Leu158Phe 0.99 23.2 1855-23570 1 179605014 T G TDRD5 - ENSP00000406052.1:p.Ile504Met 1.00 25.5 21852-36011 1 200817771 A G CAMSAP2 7 ENSP00000351684.2:p.Glu625Gly 0.99 26.9 21799-35302 1 200818883 C T CAMSAP2 7 ENSP00000351684.2:p.Arg996Trp 0.92 31 20620-32415 1 205030533 C G CNTN2 - ENSP00000330633.4:p.Arg320Gly 1.00 32 1548-22146 1 205632471 G C SLC45A3 - ENSP00000356113.3:p.Leu150Val 0.93 23.5 21974-35254 1 207078378 T C FAIM3 - ENSP00000356058.3:p.Asn387Asp 1.00 27.5 20705-32533 1 212459528 C T PPP2R5A - ENSP00000261461.2:p.Arg26Trp 0.96 35 21733-34496 1 214557966 G A PTPN14 - ENSP00000355923.4:p.Ser411Phe 1.00 24.7 21733-34496 1 229661817 C A ABCB10 - ENSP00000355637.3:p.Gly591Val 1.00 32 20569-32326 1 248084767 G C OR2T8 - ENSP00000326225.4:p.Gly150Arg 1.00 23.3 2047-24322 1 248436669 C G OR2T33 - ENSP00000324687.2:p.Gly150Arg 1.00 23.5 2047-24322 2 1680735 C T PXDN - ENSP00000252804.4:p.Gly271Asp 1.00 32 902-19869 2 25387541 C A POMC - ENSP00000384092.1:p.Cys34Phe 1.00 32 20569-32326 2 27589956 A G EIF2B4 - ENSP00000394869.2:p.Ile353Thr 0.99 28.1 21723-34401 2 54872506 G T SPTBN1 103 ENSP00000349259.4:p.Leu1470Phe 0.90 26 1457-21708 2 74709393 C A CCDC142 34 ENSP00000290418.4:p.Arg191Leu 1.00 25.7 20620-32415 2 96932210 G C CIAO1 - ENSP00000418287.1:p.Arg41Pro 0.99 33 21799-35302

55

2 96937031 C T CIAO1 - ENSP00000418287.1:p.Ser321Phe 0.99 33 1711-22976 2 100006699 G C EIF5B 36 ENSP00000289371.5:p.Leu807Phe 1.00 27.1 19844-31412 2 105979751 G A FHL2 - ENSP00000350846.4:p.Pro227Ser 0.95 28.9 15492-26109 2 107423268 G C ST6GAL2 - ENSP00000386942.3:p.Leu486Val 0.99 26.8 1925-23858 2 120677735 T C PTPN4 - ENSP00000263708.2:p.Phe307Leu 0.99 28.7 1885-23669 2 130951802 C T TUBA3E - ENSP00000318197.7:p.Asp205Asn 0.98 25.3 21852-36011 2 136511755 T G UBXN4 - ENSP00000272638.9:p.Phe81Val 0.95 28.7 21801-35317 2 136566064 C A LCT 75 ENSP00000264162.2:p.Asp1285Tyr 1.00 24.9 466-14420 2 157352727 G C GPD2 - ENSP00000308610.5:p.Gly92Arg 0.99 28.4 1999-24135 2 166201267 G A SCN2A 64 ENSP00000349973.3:p.Arg922His 0.97 34 21686-34166 2 166514315 G A CSRNP3 26 ENSP00000318258.7:p.Val65Ile 0.99 25.1 21856-36031 2 182543230 C A NEUROD1 - ENSP00000295108.3:p.Ala120Ser 1.00 31 779-18300 2 186655167 A G FSIP2 - ENSP00000344403.5:p.Ile1191Val 0.99 22.9 21762-34808 2 186669653 A C FSIP2 - ENSP00000344403.5:p.Glu5296Ala 1.00 25.8 1256-20693 2 192711179 A G SDPR - ENSP00000305675.4:p.Leu158Pro 1.00 27 15210-25677 2 197986075 C T ANKRD44 - ENSP00000387233.1:p.Gly296Glu 1.00 34 1769-23158 2 198318289 C T COQ10B - ENSP00000263960.2:p.Ala2Val 0.91 33 1989-24099 2 203162141 C A NOP58 - ENSP00000264279.5:p.Arg371Ser 0.97 34 1457-21708 2 210683909 A G UNC80 - ENSP00000391088.1:p.Tyr629Cys 0.99 27.2 15164-25527 2 210752819 G T UNC80 - ENSP00000391088.1:p.Asp1373Tyr 0.95 29.6 1565-22223 2 211057534 A T ACADL - ENSP00000233710.3:p.Ile398Asn 0.98 32 22124-39156 2 211441092 C T CPS1 - ENSP00000402608.2:p.Pro93Ser 1.00 32 2031-24261 2 211476857 G A CPS1 - ENSP00000402608.2:p.Arg809His 1.00 34 21870-36159 2 216288234 C T FN1 - ENSP00000346839.4:p.Arg411Gln 0.99 35 532-14596 2 220110807 G A STK16 - ENSP00000386928.3:p.Gly29Asp 0.98 32 1769-23158 2 220161733 T G PTPRN 13 ENSP00000295718.2:p.Lys737Thr 1.00 25.7 21723-34401 2 220315996 C T SPEG 156 ENSP00000311684.7:p.Pro751Leu 0.96 27.4 1923-23847 2 230744722 G A TRIP12 42 ENSP00000283943.4:p.Pro25Leu 0.98 23.9 993-20365 2 232660927 C G COPS7B - ENSP00000272995.4:p.Gln147Glu 0.98 26.5 21762-34808 2 236708163 C G AGAP1 - ENSP00000307634.7:p.Phe318Leu 0.98 33 1946-23968 2 242009470 G A SNED1 24 ENSP00000308893.8:p.Ser1148Asn 1.00 28.6 20984-32938 2 242607580 C T ATG4B - ENSP00000384259.3:p.Ser252Phe 0.93 33 635-18522 2 242743087 G C GAL3ST2 - ENSP00000192314.6:p.Val235Leu 0.94 28.4 16262-27157 3 10188288 G A VHL - ENSP00000256474.2:p.Gly144Glu 0.94 29.3 21852-36011 3 32776343 A G CNOT10 - ENSP00000399862.2:p.Ile523Met 0.99 24.1 19275-30820 3 38523702 G A ACVR2B - ENSP00000340361.3:p.Arg363Gln 0.99 35 15164-25527 3 38595836 G A SCN5A 33 ENSP00000410257.1:p.Arg1583Cys 0.94 33 21505-33825 3 39110289 C T WDR48 40 ENSP00000307491.5:p.Ser170Phe 1.00 33 902-19869 3 46245509 C A CCR1 - ENSP00000296140.3:p.Trp99Leu 1.00 26 1034-19683 3 48339975 A C NME6 - ENSP00000416658.1:p.Leu19Arg 1.00 28.9 2047-24322 3 48451382 G A PLXNB1 51 ENSP00000351338.4:p.Arg1904Trp 1.00 34 1885-23669

56

3 49459544 A T AMT - ENSP00000273588.3:p.Met84Lys 0.95 24.6 1034-19683 3 49569800 T C DAG1 - ENSP00000442600.1:p.Val619Ala 0.95 22.8 21505-33825 3 52556585 G A STAB1 - ENSP00000312946.6:p.Val2209Ile 1.00 32 19275-30820 3 55021748 G C CACNA2D3 - ENSP00000419101.1:p.Met886Ile 0.92 27.8 1946-23968 3 58083671 A G FLNB 8 ENSP00000420213.1:p.Asn372Asp 1.00 28.1 993-20365 3 70014001 C G MITF - ENSP00000295600.7:p.Leu389Val 1.00 28 21189-33353 3 97329696 C T EPHA6 - ENSP00000374323.5:p.Arg858Trp 1.00 32 1876-23644 3 113684108 T A KIAA1407 - ENSP00000295878.3:p.Lys902Met 0.97 24.2 20620-32415 3 122078750 C G CCDC58 - ENSP00000291458.5:p.Cys134Ser 0.93 28 1972-24038 3 123221467 C T PTPLB - ENSP00000373153.5:p.Arg148His 1.00 35 19283-30834 3 126748891 G A PLXNA1 95 ENSP00000377061.2:p.Arg1682Gln 1.00 32 2007-24164 3 126752807 C T PLXNA1 95 ENSP00000377061.2:p.Arg1880Trp 0.92 35 993-20365 3 127335738 T C MCM2 - ENSP00000265056.7:p.Ile517Thr 0.99 25.4 16262-27157 3 156877706 G A CCNL1 41 ENSP00000295926.3:p.Pro60Ser 0.99 24.6 1710-22973 3 183858330 C T EIF2B5 - ENSP00000273783.3:p.Pro323Leu 0.95 33 19844-31412 3 183881498 G C DVL3 - ENSP00000316054.3:p.Gly72Ala 0.98 29 21762-34808 3 184075258 C T CLCN2 - ENSP00000265593.4:p.Glu264Lys 0.99 35 21304-33563 3 185347619 A G SENP2 - ENSP00000296257.5:p.Gln586Arg 1.00 23.9 1924-23853 3 185540959 G A IGF2BP2 - ENSP00000371634.2:p.Ser74Leu 0.99 25.4 21758-34760 3 193363543 A G OPA1 25 ENSP00000354681.3:p.Ile552Val 1.00 26.7 1711-22976 3 196050833 A T TM4SF19 - ENSP00000273695.3:p.Val162Asp 0.95 24.1 21699-34238 3 196304328 G A FBXO45 - ENSP00000310332.6:p.Arg108His 0.94 26.7 21183-33343 4 2749499 C A TNIP2 - ENSP00000321203.7:p.Glu150Asp 1.00 29.7 1925-23858 4 15646174 G T FBXL5 12 ENSP00000344866.3:p.Ser81Tyr 1.00 27.7 1457-21708 4 54257236 G A FIP1L1 - ENSP00000336752.6:p.Cys189Tyr 1.00 26.8 1588-22326 4 77102138 T A SCARB2 - ENSP00000264896.2:p.Asp131Val 0.99 25.7 993-20365 4 83601884 T G SCD5 - ENSP00000316329.4:p.Asp182Ala 1.00 29 21799-35302 4 96091348 C T UNC5C - ENSP00000406022.1:p.Gly863Ser 1.00 34 19283-30834 4 114170951 G A ANK2 54 ENSP00000349588.4:p.Arg308Gln 1.00 33 1548-22146 4 118005652 C A TRAM1L1 - ENSP00000309402.4:p.Val300Phe 0.97 27.3 20574-32342 4 122261682 C T QRFPR 16 ENSP00000377948.2:p.Glu142Lys 0.96 34 21505-33825 4 146686237 G A ZNF827 21 ENSP00000368761.4:p.Arg1045Trp 1.00 34 21773-34925 4 184243146 A C CLDN24 - ENSP00000438400.1:p.Val145Gly 0.95 24.5 15492-26109 5 1079591 G A SLC12A7 - ENSP00000264930.5:p.Arg440Trp 1.00 33 1569-22238 5 6748591 C T PAPD7 41 ENSP00000230859.6:p.Arg242Trp 1.00 34 1457-21708 5 16481171 A T FAM134B - ENSP00000304642.9:p.Phe206Tyr 1.00 27.6 1711-22976 5 38960546 G A RICTOR 19 ENSP00000349959.3:p.Thr602Met 1.00 32 21929-37306 5 61694348 G A DIMT1 - ENSP00000199320.4:p.Pro146Ser 1.00 29.4 16262-27157 5 68840897 T A OCLN 51 ENSP00000347379.2:p.Tyr402Asn 1.00 31 1399-21439 5 79470796 G A SERINC5 - ENSP00000420863.1:p.His178Tyr 0.98 33 635-18522 5 89769853 A C MBLAC2 - ENSP00000314776.6:p.Phe86Cys 1.00 28.1 20014-31660

57

5 95748122 A T PCSK1 - ENSP00000308024.2:p.Val261Glu 1.00 29.5 1769-23158 5 118728619 G A TNFAIP8 - ENSP00000422245.1:p.Ser47Asn 0.91 24.5 21405-33683 5 131298298 T A ACSL6 - ENSP00000368566.2:p.Glu596Val 1.00 32 19273-30812 5 131325107 C T ACSL6 - ENSP00000368566.2:p.Arg182Gln 1.00 32 20984-32938 5 131543523 G A P4HA2 - ENSP00000384999.1:p.Pro320Ser 0.94 26.1 21856-36031 5 140167686 A G PCDHA1 - ENSP00000420840.2:p.Asn604Ser 1.00 23.1 2063-24398 5 140222828 A T PCDHA8 - ENSP00000434655.1:p.Asp641Val 0.92 23.9 19844-31412 5 140307774 G A PCDHAC1 - ENSP00000253807.2:p.Val433Met 0.97 25.1 635-18522 5 140626348 A G PCDHB15 - ENSP00000231173.3:p.Tyr401Cys 0.99 23.6 21733-34496 5 140784365 C A PCDHGA9 - ENSP00000460274.1:p.Leu616Ile 0.92 25.5 20569-32326 5 140789421 T C PCDHGB6 - ENSP00000428603.1:p.Leu551Ser 0.96 24.5 21686-34166 5 146798117 C G DPYSL3 - ENSP00000343690.5:p.Gly183Ala 1.00 28.7 19275-30820 5 169484602 A C DOCK2 - ENSP00000256935.8:p.Thr1467Pro 1.00 32 20620-32415 6 7585566 G C DSP - ENSP00000369129.3:p.Ala2691Pro 1.00 27 1034-19683 6 13365207 C T GFOD1 - ENSP00000368589.3:p.Arg314His 0.98 34 21684-34154 6 24865638 A T FAM65B - ENSP00000259698.4:p.Leu152His 0.98 28.4 21739-34571 6 26093359 C T HFE - ENSP00000417404.1:p.Ser302Phe 0.95 23.3 21686-34166 6 27101002 A G HIST1H2AG - ENSP00000352119.2:p.Tyr51Cys 0.99 26.7 1876-23644 6 31122377 C G CCHCR1 - ENSP00000379566.3:p.Glu233Gln 1.00 26.5 21733-34496 6 42975698 A C PPP2R5D 85 ENSP00000417963.1:p.Asp251Ala 1.00 27.7 15210-25677 6 43030656 T C KLC4 - ENSP00000259708.3:p.Val105Ala 0.97 25.8 21740-34579 6 43420832 A G DLK2 17 ENSP00000349893.3:p.Val61Ala 0.92 25.8 21854-36022 6 43550090 G A POLH - ENSP00000361310.4:p.Val12Met 0.95 28.3 533-15609 6 45459744 G C RUNX2 - ENSP00000360493.1:p.Arg251Pro 1.00 33 1987-24091 6 75813511 T C COL12A1 24 ENSP00000325146.8:p.Lys2761Glu 1.00 26.2 21739-34571 6 82461558 C T FAM46A - ENSP00000318298.6:p.Val101Met 0.97 24.4 993-20365 6 101100623 T C ASCC3 - ENSP00000358159.2:p.Tyr989Cys 1.00 28.5 21762-34808 6 137338210 G A IL20RA - ENSP00000314976.5:p.Pro40Leu 1.00 32 21762-34808 7 2472938 C T CHST12 - ENSP00000258711.6:p.Arg222Cys 0.99 34 1986-24087 7 7612455 A G MIOS - ENSP00000339881.4:p.Asn117Asp 1.00 25.3 1454-21674 7 8791076 G C NXPH1 - ENSP00000384551.1:p.Val165Leu 0.97 26 21686-34166 7 29606003 A T PRR15 - ENSP00000317836.2:p.Ser20Cys 0.94 22.8 1694-22917 7 44619082 A G TMED4 - ENSP00000404042.2:p.Val227Ala 0.90 26.8 21740-34579 7 47440445 T C TNS3 15 ENSP00000381854.1:p.Thr264Ala 0.98 26 20984-32938 7 66548433 G A TYW1 - ENSP00000352645.5:p.Val431Met 0.95 32 993-20365 7 75630263 C T STYXL1 - ENSP00000248600.1:p.Arg252His 1.00 31 1543-22102 7 142724119 T C OR9A2 - ENSP00000316518.3:p.Tyr34Cys 0.97 19.92 21731-34484 7 143555936 G A FAM115A 38 ENSP00000419235.1:p.Ala829Val 1.00 28.7 15210-25677 7 150068607 C T REPIN1 - ENSP00000417291.2:p.Arg150Cys 1.00 26.3 15210-25677 7 150731592 G A ABCB8 - ENSP00000351717.4:p.Gly189Arg 0.98 27.2 21799-35302 8 12047388 G A FAM86B1 - ENSP00000407067.2:p.His57Tyr 0.97 23.3 15492-26109

58

8 20073963 C T ATP6V1B2 41 ENSP00000276390.2:p.Thr373Ile 0.99 34 21758-34760 8 22862891 G A RHOBTB2 42 ENSP00000427926.1:p.Glu89Lys 0.96 33 21189-33353 8 27294661 C T PTK2B - ENSP00000380638.1:p.Ala455Val 0.99 26.1 21870-36159 8 27987135 T C ELP3 - ENSP00000256398.8:p.Ile245Thr 0.97 26.9 21740-34579 8 28651439 G A INTS9 - ENSP00000429065.1:p.Leu308Phe 1.00 28.9 21705-34281 8 41906027 G A KAT6A 9 ENSP00000380136.3:p.His157Tyr 0.98 25.2 19283-30834 8 68965366 T G PREX2 18 ENSP00000288368.4:p.Asn326Lys 0.99 29.4 1876-23644 8 73993374 G A SBSPON - ENSP00000297354.6:p.Arg97Trp 1.00 35 1769-23158 8 74334918 T G STAU2 - ENSP00000428756.1:p.Lys517Thr 0.99 32 20572-32334 8 86392971 C T CA2 - ENSP00000285379.4:p.Pro246Ser 0.95 32 1925-23858 8 98735185 T C MTDH - ENSP00000338235.3:p.Ser534Pro 1.00 25.6 1883-23662 8 110587469 A G SYBU - ENSP00000407118.1:p.Leu553Pro 0.99 23.6 1886-23674 8 131181311 A C ASAP1 47 ENSP00000350297.1:p.Phe250Cys 0.97 26.7 1885-23669 8 143996132 A T CYP11B2 - ENSP00000325822.2:p.Ile263Asn 1.00 26 1855-23570 8 145541102 T C DGAT1 - ENSP00000332258.4:p.Asn330Asp 1.00 26.4 1855-23570 9 32427367 C T ACO1 - ENSP00000309477.5:p.Pro473Ser 1.00 33 1565-22223 9 32542941 C T TOPORS - ENSP00000353735.2:p.Asp528Asn 0.91 24.2 902-19869 9 33465871 C T NOL6 - ENSP00000297990.4:p.Val797Met 0.93 26 21856-36031 9 33470033 C T NOL6 - ENSP00000297990.4:p.Asp179Asn 1.00 33 19993-31610 9 35376055 G T UNC13B - ENSP00000367756.3:p.Ala467Ser 0.91 28.5 466-14420 9 37440721 C T ZBTB5 - ENSP00000307604.4:p.Ala610Thr 0.99 28.3 21974-35254 9 71628474 C T PRKACG - ENSP00000366488.2:p.Gly179Ser 1.00 25 21773-34925 9 71840940 C G TJP2 - ENSP00000438262.1:p.Ile384Met 1.00 25 1291-21555 9 74975562 C G ZFAND5 8 ENSP00000237937.3:p.Gly45Arg 0.98 29.5 1946-23968 9 107583669 C T ABCA1 25 ENSP00000363868.3:p.Val983Met 1.00 34 21974-35254 9 115651913 A T SLC46A2 - ENSP00000363345.4:p.Met350Lys 1.00 26.1 1550-22157 9 136573477 A G SARDH - ENSP00000360938.4:p.Tyr468His 1.00 27.2 21686-34166 9 136577759 A G SARDH - ENSP00000360938.4:p.Met437Thr 0.99 27.1 21856-36031 9 136595252 G A SARDH - ENSP00000360938.4:p.Arg250Trp 0.90 34 20705-32533 9 136596525 C G SARDH - ENSP00000360938.4:p.Asp198His 1.00 32 1034-19683 9 138710883 T C CAMSAP1 24 ENSP00000374183.4:p.Glu1311Gly 0.92 25.1 635-18522 9 139092555 A C LHX3 29 ENSP00000360811.3:p.Phe47Val 1.00 23.5 1883-23662 10 332272 G A DIP2C - ENSP00000280886.6:p.Arg1354Trp 0.99 35 871-19574 10 3158983 T A PFKP - ENSP00000370517.4:p.Trp463Arg 0.98 33 15492-26109 10 11312712 G T CELF2 74 ENSP00000389951.1:p.Gln234His 0.98 23.6 532-14596 10 14880414 G A HSPA14 - ENSP00000367623.3:p.Gly5Arg 1.00 34 21236-33440 10 26789831 A G APBB1IP - ENSP00000365411.4:p.Thr82Ala 0.99 24.9 21856-36031 10 50533772 C T C10orf71 - ENSP00000363259.3:p.Ser1061Phe 1.00 24 1543-22102 10 50820093 C G SLC18A3 - ENSP00000363229.3:p.Pro436Arg 1.00 27.3 1769-23158 10 63759873 G A ARID5B 22 ENSP00000279873.7:p.Val176Met 0.97 32 1454-21674 10 71124620 A C HK1 - ENSP00000384774.2:p.Thr157Pro 1.00 27.7 613-14848

59

10 87898778 A T GRID1 31 ENSP00000330148.7:p.Ile175Asn 0.93 29.5 1550-22157 10 94291566 T A IDE - ENSP00000265986.6:p.Arg200Ser 0.97 26.2 1034-19683 10 95993906 C G PLCE1 5 ENSP00000360431.2:p.Ser684Cys 1.00 23.4 1999-24135 10 99190776 T G PGAM1 27 ENSP00000359991.4:p.Ile160Ser 0.92 32 993-20365 10 101294855 G A NKX2-3 - ENSP00000342828.7:p.Ala158Thr 0.95 34 21505-33825 10 102743737 G A SEMA4G - ENSP00000210633.3:p.Arg794Gln 1.00 25.6 15164-25527 10 102782137 T A PDZD7 - ENSP00000359234.3:p.Asp183Val 1.00 28.9 1923-23847 10 111624967 G A XPNPEP1 - ENSP00000421566.1:p.Thr659Met 1.00 33 20572-32334 10 115962912 C T TDRD1 - ENSP00000251864.2:p.Leu260Phe 0.99 29.3 19273-30812,21974- 35254 10 117059702 A T ATRNL1 - ENSP00000347152.3:p.Leu858Phe 0.95 23.5 19273-30812 10 126686685 G A CTBP2 47 ENSP00000311825.6:p.Thr678Ile 0.96 29.9 993-20365 10 129868590 C G PTPRE 36 ENSP00000254667.3:p.Ala390Gly 1.00 32 1454-21674 11 488272 A G PTDSS2 - ENSP00000308258.5:p.Glu232Gly 1.00 33 1986-24087 11 1491580 T C MOB2 - ENSP00000328694.6:p.Tyr210Cys 0.98 23.8 21740-34579 11 3038491 G C CARS - ENSP00000369897.4:p.Leu588Val 0.92 23.7 21740-34579 11 6240226 C T FAM160A2 - ENSP00000265978.4:p.Asp416Asn 1.00 28.8 1256-20693 11 6431908 A C APBB1 6 ENSP00000299402.6:p.Trp224Gly 0.96 25.6 1987-24091 11 6703590 C T MRPL17 - ENSP00000288937.6:p.Arg96Gln 0.99 28.4 21183-33343 11 10521674 C G AMPD3 - ENSP00000379802.3:p.Ser542Arg 0.98 24.5 21405-33683 11 13387084 G A ARNTL - ENSP00000374357.4:p.Arg166Gln 0.97 34 20618-32410 11 45671885 A T CHST1 - ENSP00000309270.2:p.Cys197Ser 1.00 23.6 21870-36159 11 58962715 T C DTX4 - ENSP00000227451.3:p.Ile470Thr 1.00 27.1 20014-31660 11 60889293 G A CD5 - ENSP00000342681.3:p.Gly339Glu 1.00 21.2 21597-33936 11 61511762 C T DAGLA 17 ENSP00000257215.5:p.Ala977Val 0.99 31 15497-26119 11 61646028 C T FADS3 39 ENSP00000278829.2:p.Val235Met 1.00 28.4 1548-22146 11 64010761 G A FKBP2 - ENSP00000378046.3:p.Gly88Ser 1.00 33 21731-34484 11 64882515 T C TM7SF2 - ENSP00000279263.7:p.Leu285Pro 0.99 24.4 532-14596 11 65303482 C A SCYL1 - ENSP00000270176.5:p.Pro482Gln 0.99 32 19273-30812 11 66039719 A C RAB1B 33 ENSP00000310226.6:p.Gln60Pro 1.00 26.5 1974-24044 11 66239885 C T PELI3 - ENSP00000322532.7:p.Arg134Trp 1.00 35 2047-24322 11 66259045 C A DPP3 - ENSP00000353701.2:p.His293Gln 1.00 27.1 1548-22146 11 67270038 A G PITPNM1 24 ENSP00000348772.3:p.Leu77Pro 1.00 26.3 1565-22223 11 73022270 G A ARHGEF17 - ENSP00000263674.3:p.Glu863Lys 0.99 34 1876-23644 11 94758805 C A KDM4E - ENSP00000397239.2:p.Phe28Leu 1.00 25.7 21854-36022 11 103908449 G C DDI1 - ENSP00000302805.3:p.Gly300Ala 1.00 23.1 20618-32410 11 114270897 G A C11orf71 - ENSP00000325508.4:p.Pro53Ser 0.92 23.7 21705-34281 11 117152656 C G RNF214 16 ENSP00000431643.1:p.Pro461Arg 0.99 26 20014-31660 11 120328821 G T ARHGEF12 29 ENSP00000380942.2:p.Ser753Ile 0.99 25.8 19275-30820 11 120690540 C A GRIK4 - ENSP00000435648.1:p.Pro141His 1.00 32 2016-24201 11 123894102 C A OR10G9 - ENSP00000364164.1:p.Pro128Gln 0.97 29.5 1256-20693 11 124644677 G A MSANTD2 - ENSP00000239614.4:p.Ser183Leu 1.00 29.7 993-20365

60

11 126391224 C T KIRREL3 - ENSP00000435466.2:p.Arg140His 0.94 34 48-14470 12 1299146 G A ERC1 - ENSP00000380386.2:p.Arg760Gln 1.00 27 21879-36229 12 6833962 T C COPS7A - ENSP00000438115.1:p.Leu47Pro 0.99 29.4 1792-23269 12 7050172 G A ATN1 75 ENSP00000349076.3:p.Arg1115His 1.00 34 1886-23674 12 7177312 A G C1S - ENSP00000385035.1:p.His475Arg 1.00 23.4 187-13717 12 7310285 G A CLSTN3 - ENSP00000266546.6:p.Glu910Lys 0.98 34 1543-22102 12 8192617 G C FOXJ2 - ENSP00000162391.3:p.Gln63His 0.97 24.2 19890-31485 12 32772646 G C FGD4 - ENSP00000394487.2:p.Leu451Phe 1.00 26.1 1986-24087 12 39079325 G A CPNE8 - ENSP00000329748.5:p.Thr413Ile 1.00 31 20572-32334 12 48578082 G T C12orf68 - ENSP00000320849.3:p.Arg59Ser 0.99 22.9 20010-31644 12 49162754 C T ADCY6 16 ENSP00000311405.4:p.Arg1116His 1.00 34 1548-22146 12 49435746 A G KMT2D 21 ENSP00000301067.7:p.Met2046Thr 0.92 22.8 19303-30889 12 49691167 G A PRPH - ENSP00000257860.4:p.Glu342Lys 0.94 34 1550-22157 12 52695761 C T KRT86 19 ENSP00000293525.5:p.Arg21Trp 0.99 25.6 1291-21555 12 52910979 T C KRT5 - ENSP00000252242.4:p.Asp377Gly 0.99 29.9 21739-34571 12 56481866 A G ERBB3 - ENSP00000267101.3:p.Tyr265Cys 1.00 25.8 19993-31610 12 56868500 C T GLS2 - ENSP00000310447.4:p.Cys351Tyr 0.99 32 21189-33353 12 64819614 G C XPOT - ENSP00000327821.5:p.Arg531Thr 0.98 25.9 21405-33683 12 77252772 C T CSRP2 - ENSP00000310901.5:p.Gly181Asp 0.99 34 20569-32326 12 99126322 A G APAF1 - ENSP00000448165.1:p.Tyr1242Cys 0.96 27 1710-22973 12 102141004 G A GNPTAB - ENSP00000299314.7:p.Arg1237Trp 0.99 34 19993-31610 12 104337566 G T HSP90B1 - ENSP00000299767.4:p.Leu647Phe 1.00 22.8 2007-24164,16260- 27149,1588-22326,1457- 21708 12 122497017 C A BCL7A - ENSP00000445868.1:p.Pro215His 1.00 28.5 993-20365 12 133235933 T A POLE 32 ENSP00000322570.5:p.Ser1075Cys 0.99 29.2 1694-22917 13 26621198 A C SHISA2 - ENSP00000313079.2:p.Ile114Ser 1.00 28.1 1710-22973 13 28498834 G A PDX1 - ENSP00000370421.4:p.Arg283Gln 0.95 29.2 1937-23920 13 38320300 C T TRPC4 34 ENSP00000369003.3:p.Ser224Asn 0.97 28.2 20569-32326 13 45694851 G T GTF2F2 - ENSP00000340823.6:p.Val21Phe 0.92 32 1543-22102 13 50050750 G C SETDB2 - ENSP00000326477.8:p.Lys160Asn 0.99 23.4 20618-32410 13 79932464 T C RBM26 - ENSP00000267229.7:p.Asn545Ser 1.00 25.5 1550-22157 13 79933810 T A RBM26 - ENSP00000267229.7:p.Asn477Tyr 0.96 31 871-19574 13 99483615 C T DOCK9 - ENSP00000365643.1:p.Arg1518Gln 1.00 35 1569-22238 14 21957480 G C TOX4 - ENSP00000385102.1:p.Gly243Ala 0.97 24.8 1999-24135 14 23345424 G C LRP10 - ENSP00000352601.4:p.Asp423His 1.00 26.7 1548-22146 14 24550507 G A NRL 27 ENSP00000454062.1:p.Arg218Cys 1.00 35 19993-31610 14 24727514 G A TGM1 - ENSP00000206765.6:p.Ala460Val 0.98 28.7 1543-22102 14 24773721 G A NOP9 - ENSP00000267425.3:p.Cys557Tyr 1.00 29.8 1569-22238 14 34266679 G T NPAS3 - ENSP00000348460.4:p.Asp440Tyr 0.91 32 15497-26119 14 36142228 C T RALGAPA1 - ENSP00000302647.6:p.Ala1134Thr 1.00 34 20984-32938 14 66188555 C T FUT8 - ENSP00000353910.5:p.Arg300Cys 1.00 35 21304-33563

61

14 74058719 T A ACOT4 - ENSP00000323071.4:p.Val19Glu 0.99 32 19275-30820 14 75555253 T C NEK9 10 ENSP00000238616.5:p.Tyr845Cys 0.98 26.2 1937-23920 14 76906027 T G ESRRB - ENSP00000370270.2:p.Ser111Ala 0.99 23.4 21773-34925 14 94088144 G C UNC79 14 ENSP00000256339.4:p.Cys1345Ser 1.00 23.1 871-19574 14 96706865 T A BDKRB2 - ENSP00000307713.3:p.Leu67Gln 0.91 23.9 993-20365 14 100813098 C G WARS - ENSP00000347495.2:p.Asp271His 0.99 26.2 1876-23644 14 104205230 A G PPP1R13B - ENSP00000202556.9:p.Ile908Thr 1.00 28.2 1711-22976 14 105848902 C G PACS2 - ENSP00000399732.2:p.Ser501Trp 1.00 27.7 1454-21674 15 30385632 A C GOLGA8J - ENSP00000456401.1:p.His610Pro 1.00 18.74 928-19074 15 40462740 A G BUB1B - ENSP00000287598.6:p.Tyr81Cys 1.00 26.4 1710-22973 15 45467570 G A SHF - ENSP00000290894.8:p.Pro167Ser 0.99 28.3 1457-21708 15 51783791 C T DMXL2 - ENSP00000441858.2:p.Arg1646Gln 1.00 27.5 1565-22223 15 51783795 G T DMXL2 - ENSP00000441858.2:p.Leu1645Ile 1.00 23.7 19275-30820 15 51795186 G C DMXL2 - ENSP00000441858.2:p.Pro937Ala 1.00 25 21731-34484 15 63972918 A G HERC1 33 ENSP00000390158.2:p.Trp2095Arg 1.00 27.1 1988-24094 15 64067780 G T HERC1 40 ENSP00000390158.2:p.His15Asn 0.98 25.4 1769-23158 15 75308993 G A SCAMP5 - ENSP00000355387.6:p.Gly66Arg 1.00 34 2007-24164 15 75942810 T C SNX33 - ENSP00000311427.5:p.Met456Thr 0.97 24.5 19283-30834 15 76018456 G C ODF3L1 - ENSP00000329584.2:p.Arg96Pro 1.00 32 20618-32410 15 83218354 G C CPEB1 - ENSP00000457881.1:p.Gln419Glu 0.95 24.2 1924-23853 15 89415309 A G ACAN - ENSP00000387356.2:p.His2394Arg 0.99 25.5 2007-24164 15 89416136 C A ACAN - ENSP00000387356.2:p.Gln2405Lys 1.00 32 21856-36031 15 92988137 C T ST8SIA2 - ENSP00000268164.3:p.Arg274Cys 0.99 31 18436-29827 16 596972 A T CAPN15 - ENSP00000219611.2:p.Glu45Val 1.00 25.1 19844-31412 16 735271 C G WDR24 12 ENSP00000293883.4:p.Asp669His 1.00 29.7 19275-30820 16 1498451 G A CLCN7 - ENSP00000372193.4:p.Arg640Trp 0.95 26.4 1694-22917 16 2258614 G T MLST8 - ENSP00000456405.1:p.Ala288Ser 1.00 24.6 21304-33563 16 2834709 C T PRSS33 - ENSP00000293851.5:p.Arg260His 0.99 34 21974-35254 16 4242940 C G SRL - ENSP00000382518.3:p.Gln212His 0.99 26.5 15164-25527 16 28948349 G C CD19 - ENSP00000437940.1:p.Ala364Pro 1.00 27.6 21870-36159 16 31235538 A C TRIM72 - ENSP00000312675.3:p.His299Pro 1.00 25.5 1999-24135 16 52480042 T A TOX3 - ENSP00000219746.9:p.Lys257Met 1.00 28.5 19991-31601 16 56536696 C T BBS2 - ENSP00000245157.5:p.Gly277Arg 1.00 33 1399-21439 16 68056562 A G DDX28 - ENSP00000332340.5:p.Tyr182His 1.00 25.3 2047-24322 16 68288877 C T PLA2G15 - ENSP00000219345.5:p.Arg114Cys 0.95 35 20984-32938 16 81726765 C G CMIP - ENSP00000446100.2:p.Ala486Gly 0.99 33 1711-22976 17 1639493 G A WDR81 - ENSP00000386609.1:p.Gly1829Glu 1.00 27.3 20014-31660 17 5064857 C G USP6 - ENSP00000460380.1:p.Leu955Val 0.91 23.8 21507-33839 17 5347635 G A DHX33 - ENSP00000225296.3:p.Leu672Phe 0.93 26.4 21870-36159 17 7132729 G C DVL2 - ENSP00000005340.4:p.Ser262Cys 1.00 23 1987-24091 17 7189147 C T SLC2A4 - ENSP00000320935.8:p.Arg416Cys 1.00 35 20618-32410

62

17 7399382 A C POLR2A - ENSP00000314949.6:p.Gln72His 0.97 17.46 2047-24322 17 7759021 G T TMEM88 16 ENSP00000301599.6:p.Val157Phe 0.90 24.4 21742-34594 17 7852428 C A CNTROB - ENSP00000369614.3:p.Pro869Thr 0.92 29.5 21742-34594 17 8218837 C T ARHGEF15 23 ENSP00000355026.3:p.Arg456Cys 1.00 34 20574-32342 17 8383603 G A MYH10 16 ENSP00000353590.3:p.Arg1808Cys 0.90 35 21686-34166 17 16526629 C G ZNF624 - ENSP00000310472.7:p.Gly524Ala 1.00 24.9 2076-24438 17 26816245 C T SLC13A2 - ENSP00000392411.3:p.Ala39Val 0.91 24.9 1974-24044 17 26875096 C A UNC119 - ENSP00000337040.3:p.Asp120Tyr 0.99 32 20574-32342 17 29855737 G C RAB11FIP4 9 ENSP00000312837.8:p.Glu532Gln 0.98 31 1999-24135 17 29855756 G A RAB11FIP4 9 ENSP00000312837.8:p.Arg538His 1.00 34 20620-32415 17 35454050 G A ACACA 253 ENSP00000344789.5:p.Arg2258Cys 1.00 35 1569-22238 17 36916790 G T PSMB3 - ENSP00000225426.4:p.Asp135Tyr 0.91 32 15554-26282 17 40695973 G A NAGLU - ENSP00000225927.1:p.Gly650Glu 1.00 24.6 2076-24438 17 42168805 A C HDAC5 5 ENSP00000225983.5:p.Leu408Arg 1.00 28.1 928-19074 17 45914179 T C LRRC46 - ENSP00000269025.4:p.Leu220Pro 0.96 24.8 21762-34808 17 46804277 C T HOXB13 - ENSP00000290295.7:p.Asp244Asn 1.00 34 1925-23858 17 46928923 T G CALCOCO2 - ENSP00000398523.2:p.Leu236Arg 0.97 28.8 15554-26282 17 48185592 T G PDK2 - ENSP00000420927.1:p.Phe253Cys 1.00 29.7 19275-30820 17 56400055 C T BZRAP1 - ENSP00000345824.4:p.Gly426Asp 0.99 25 48-14470 17 62230425 T G TEX2 - ENSP00000258991.3:p.Glu1014Ala 0.91 29.8 2047-24322 17 65337128 G T PSMD12 - ENSP00000348442.3:p.Thr401Asn 0.97 26.5 15210-25677 17 71434256 C T SDK2 - ENSP00000376421.3:p.Gly255Arg 1.00 27.6 15497-26119 17 72846072 C T GRIN2C - ENSP00000293190.5:p.Val498Met 0.99 29.8 19275-30820 17 73016680 G C ICT1 - ENSP00000301585.5:p.Arg155Pro 0.99 34 21879-36229 17 74079808 T C EXOC7 - ENSP00000334100.6:p.Lys710Arg 1.00 25.9 21879-36229 17 74266441 C T UBALD2 - ENSP00000331298.6:p.Pro117Leu 1.00 32 2063-24398 17 78332220 G C RNF213 31 ENSP00000464087.1:p.Trp3665Cys 1.00 34 1543-22102 17 79478592 G A ACTG1 - ENSP00000458162.1:p.Leu142Phe 0.99 23.7 2007-24164,1457-21708 17 79803767 T C P4HB - ENSP00000327801.4:p.Lys386Arg 0.98 24.1 20620-32415 17 80045614 C A FASN - ENSP00000304592.2:p.Arg997Leu 0.92 28.2 21758-34760 18 2922055 C T LPIN2 - ENSP00000261596.4:p.Ala773Thr 1.00 35 15492-26109 18 30254645 G C KLHL14 - ENSP00000352314.4:p.Pro621Arg 0.99 32 1876-23644 18 40554052 T C RIT2 - ENSP00000321805.4:p.Asp74Gly 0.99 28.2 928-19074 18 48439216 T G ME2 - ENSP00000321070.5:p.Phe96Leu 1.00 25.6 21739-34571 18 59212318 A G CDH20 - ENSP00000262717.3:p.Tyr530Cys 0.91 24.3 20010-31644 19 3980939 C G EEF2 - ENSP00000307940.5:p.Leu350Phe 0.92 22.8 2007-24164,16260- 27149,1588-22326,1457- 21708 19 6364529 C T CLPP - ENSP00000245816.3:p.Thr145Ile 1.00 31 1886-23674 19 6586042 C T CD70 - ENSP00000245903.2:p.Val191Met 0.96 24.5 21304-33563 19 7677226 C T CAMSAP3 - ENSP00000416797.1:p.Ala643Val 1.00 26.9 21762-34808 19 7965775 C T LRRC8E - ENSP00000306524.5:p.Arg790Trp 0.99 35 1548-22146

63

19 10572332 C G PDE4A 33 ENSP00000270474.5:p.Ser532Arg 0.96 24.2 21183-33343 19 11832922 C T ZNF823 - ENSP00000340683.5:p.Cys476Tyr 1.00 25.6 19275-30820 19 11833648 G A ZNF823 - ENSP00000340683.5:p.Ser234Phe 1.00 24.3 1510-21940 19 12575960 T C ZNF709 - ENSP00000380840.3:p.Lys259Arg 0.99 25.6 1711-22976 19 13876932 G A MRI1 - ENSP00000040663.5:p.Gly179Asp 1.00 29.6 20620-32415 19 15508354 T C AKAP8L - ENSP00000380557.3:p.Tyr461Cys 1.00 24.1 21405-33683 19 15582891 G A PGLYRP2 18 ENSP00000345968.4:p.Arg385Cys 1.00 29 15542-26229 19 15661528 G A CYP4F22 - ENSP00000269703.1:p.Arg460His 1.00 34 16262-27157 19 15918660 A G OR10H1 - ENSP00000335596.2:p.Leu63Pro 0.99 25.5 1886-23674 19 17670211 C T COLGALT1 - ENSP00000252599.3:p.Arg118Trp 0.99 29.6 19283-30834 19 19049227 A G HOMER3 - ENSP00000439937.1:p.Trp80Arg 1.00 25.2 21929-37306 19 19329904 C G NCAN 5 ENSP00000252575.4:p.Ser85Trp 1.00 27.3 1946-23968 19 34991071 G T WTIP - ENSP00000466953.2:p.Arg397Leu 0.97 28.8 928-19074 19 36045968 G C ATP4A 62 ENSP00000262623.2:p.Ile779Met 0.99 24.5 21731-34484 19 41094629 T C SHKBP1 21 ENSP00000291842.4:p.Ile479Thr 0.94 26.8 768-18527 19 41928216 G A BCKDHA 30 ENSP00000269980.2:p.Arg265Gln 1.00 35 1987-24091 19 42408413 G C ARHGEF1 91 ENSP00000337261.3:p.Arg695Pro 0.95 32 20010-31644 19 45821233 G T CKM - ENSP00000221476.2:p.His66Gln 0.92 25.8 21267-33506 19 47597334 G A ZC3H4 16 ENSP00000253048.4:p.His129Tyr 0.91 28.7 1937-23920 19 49110389 G A SPACA4 - ENSP00000312774.1:p.Gly52Arg 1.00 25.5 20618-32410,928-19074 19 49143086 G C CA11 - ENSP00000084798.3:p.Arg176Gly 0.99 22.9 1792-23269 19 49225129 G A RASIP1 30 ENSP00000222145.3:p.Arg892Trp 0.95 33 928-19074 19 49399764 A T TULP2 18 ENSP00000221399.3:p.Met45Lys 1.00 28 15497-26119 19 49518346 C T RUVBL2 - ENSP00000473172.1:p.Thr397Met 1.00 34 21507-33839 19 49564683 C T NTF4 - ENSP00000301411.3:p.Arg191Gln 1.00 32 21852-36011 19 49862754 G A TEAD2 - ENSP00000472109.1:p.Arg79Trp 1.00 35 19273-30812 19 50750342 G A MYH14 26 ENSP00000470298.1:p.Gly431Asp 0.98 29.5 1457-21708 19 50753075 C T MYH14 26 ENSP00000470298.1:p.Arg551Trp 0.91 34 21740-34579 19 51135970 A G SYT3 - ENSP00000340914.3:p.Trp83Arg 1.00 25.7 1855-23570 19 52149981 C A SIGLEC14 50 ENSP00000354090.5:p.Trp11Leu 1.00 29 21969-37814 19 55785916 G A HSPBP1 - ENSP00000255631.4:p.Arg164Trp 0.99 33 15210-25677 19 57176173 G A ZNF835 - ENSP00000444747.1:p.His132Tyr 1.00 24.2 928-19074 19 57765119 G A ZNF805 - ENSP00000412999.1:p.Gly311Glu 1.00 26.1 1588-22326 19 58982794 C T ZNF324 - ENSP00000444812.1:p.Pro312Leu 1.00 28.9 1886-23674 20 2777036 G A CPXM1 - ENSP00000369979.2:p.Arg367Trp 1.00 31 1972-24038 20 5539460 C G GPCPD1 - ENSP00000368305.4:p.Arg513Pro 1.00 34 15210-25677 20 6750829 G A BMP2 - ENSP00000368104.3:p.Gly19Asp 0.99 26.8 1924-23853 20 18142821 C T CSRP2BP - ENSP00000392318.2:p.Pro347Leu 0.94 29.7 1291-21555 20 30585127 A T XKR7 - ENSP00000477059.1:p.Asp536Val 1.00 26.2 15164-25527 20 30898905 T C KIF3B - ENSP00000364864.3:p.Leu442Pro 1.00 26 19303-30889 20 31967335 C A CDK5RAP1 - ENSP00000217372.2:p.Val347Phe 0.99 32 1925-23858

64

20 32255701 T C ACTL10 - ENSP00000329647.4:p.Leu133Pro 0.98 23.4 1694-22917 20 34527030 G T PHF20 - ENSP00000363124.3:p.Lys904Asn 0.95 25.1 21405-33683 20 37144978 T C RALGAPB - ENSP00000262879.6:p.Met339Thr 0.99 25.6 1986-24087 20 44518912 C T NEURL2 - ENSP00000361596.4:p.Arg240His 0.96 34 20705-32533 20 44757635 G C CD40 - ENSP00000361359.3:p.Glu264Gln 1.00 26 20010-31644 20 47309240 T C PREX1 83 ENSP00000361009.3:p.Met336Val 1.00 25.2 1972-24038 20 56090799 T C CTCFL 15 ENSP00000415579.2:p.Lys384Arg 1.00 26.3 21852-36011 20 57289076 C T NPEPL1 - ENSP00000348395.6:p.Pro410Leu 0.99 34 1946-23968 20 57581530 G A CTSZ - ENSP00000217131.5:p.Arg52Trp 0.93 33 613-14848 20 60773793 G A MTG2 - ENSP00000359859.3:p.Gly191Asp 1.00 32 871-19574 20 61341152 T C NTSR1 - ENSP00000359532.3:p.Leu198Pro 0.95 24.3 19890-31485 20 61391449 C G NTSR1 - ENSP00000359532.3:p.Leu363Val 1.00 28.7 21705-34281 21 28304490 A G ADAMTS5 - ENSP00000284987.5:p.Phe628Leu 0.99 32 993-20365 21 31066215 C T GRIK1 - ENSP00000382791.1:p.Ala96Thr 1.00 32 2047-24322 21 43988482 A C SLC37A1 - ENSP00000344648.2:p.Ser453Arg 0.97 24.1 1442-22354 21 44589351 T C CRYAA - ENSP00000291554.2:p.Tyr48His 0.99 26.4 19844-31412 22 16449399 C T OR11H1 - ENSP00000252835.4:p.Asp136Asn 1.00 25.3 21969-37814 22 16449405 C T OR11H1 - ENSP00000252835.4:p.Ala134Thr 0.96 25.2 21929-37306,48- 14470,1442-22354,21304- 33563,1886-23674,1399- 21439 22 18900831 C T PRODH - ENSP00000349577.6:p.Val554Met 0.96 34 21183-33343 22 18907082 A G PRODH - ENSP00000349577.6:p.Met378Thr 0.96 25.8 21856-36031 22 19121720 C A DGCR14 - ENSP00000252137.6:p.Asp474Tyr 1.00 32 1543-22102 22 22317152 A G TOP3B 22 ENSP00000381773.2:p.Phe440Leu 1.00 27.6 19303-30889 22 22990089 C T GGTLC2 28 ENSP00000419751.1:p.Arg179Trp 0.98 25 19993-31610 22 26107042 C T ADRBK2 - ENSP00000317578.4:p.Pro468Leu 1.00 33 21597-33936 22 26771553 C G SEZ6L 24 ENSP00000248933.6:p.Ala947Gly 0.95 27.3 15210-25677 22 31494760 C A SMTN 16 ENSP00000351593.1:p.Pro756His 1.00 34 21852-36011 22 36690980 G A MYH9 44 ENSP00000216181.5:p.Arg1210Trp 1.00 34 21799-35302 22 38470374 C T PICK1 - ENSP00000385205.3:p.Arg299Cys 1.00 35 15164-25527 22 40696487 C A TNRC6B 18 ENSP00000401946.2:p.Pro1246His 1.00 28.3 21773-34925 22 41547886 C G EP300 - ENSP00000263253.7:p.Ser956Cys 1.00 27.1 635-18522 22 44681504 A G KIAA1644 - ENSP00000370568.4:p.Trp135Arg 1.00 28 533-15609 22 45785740 T C SMC1B - ENSP00000350036.4:p.Lys528Arg 1.00 23.9 21879-36229,2031-24261 22 46628134 G A PPARA - ENSP00000379322.2:p.Gly386Glu 1.00 29.1 2063-24398 22 49042540 C T FAM19A5 - ENSP00000383933.1:p.Arg82Trp 0.99 34 21730-34469 22 50905989 G A SBF1 - ENSP00000370196.2:p.Ser137Leu 1.00 27.4 21597-33936 X 53111879 G C TSPYL2 - ENSP00000364591.4:p.Gly67Arg 1.00 25.2 1883-23662 X 92928230 G C NAP1L3 - ENSP00000362171.3:p.Ser25Trp 0.99 25.8 15164-25527 X 102632474 G C NGFRAP1 - ENSP00000361728.3:p.Glu19Gln 0.91 24.9 1457-21708

65

Table 4: All C1 and C2 PTVs identified by this analysis and not previously prioritized chr position ref alt gene consequence CADD Sample 1 2161156 C G SKI stop_gained 39 18436-29827 1 90493021 GA G ZNF326 frameshift - 2016-24201 1 109553675 AAT A WDR47 frameshift - 21686-34166 1 116941412 G GTCGTCTGATCTT ATP1A1 splice_donor - 20014-31660 1 160394947 C T VANGL2 stop_gained 43 1694-22917 1 168032998 G A DCAF6 splice_donor 26 2063-24398 1 231471976 T A EXOC8 stop_gained 37 1550-22157 1 236906345 T C ACTN2 splice_donor 22.2 20014-31660 1 237774119 C T RYR2 stop_gained 46 1946-23968 1 245026015 G A HNRNPU stop_gained 38 2063-24398 2 27445090 GT G CAD frameshift - 19991-31601 2 27632168 A C PPM1G splice_donor 24.5 1291-21555 2 63822654 G A MDH1 splice_donor 24.7 2063-24398 2 63832529 T G MDH1 splice_donor 26.1 20620-32415,2063- 24398,19844-31412 2 114390523 T C RABL2A splice_donor 22.9 21405-33683 2 217055058 G T XRCC5 splice_donor 23.2 2063-24398 2 217057460 T G XRCC5 splice_donor 24.7 2063-24398,19844- 31412 3 58398647 ACCTCTCACGT A PXK frameshift - 15210-25677 3 182683410 G GA DCUN1D1 frameshift - 2076-24438 3 183896911 G GATTCTAGACTT AP2M1 splice_donor - 2047-24322,19844- 31412 4 71665895 TAA T RUFY3 frameshift - 1924-23853 4 125592253 GC G ANKRD50 frameshift - 21854-36022 4 144469294 G GAGCTCCAGAGGAGA SMARCA5 splice_donor - 2063-24398 4 166389008 G C CPE splice_donor 23.4 20620-32415,20618- 32410 4 170601359 G A CLCN3 splice_donor 25.3 2063-24398 5 43292741 T C HMGCS1 splice_acceptor 24.6 20620-32415,20014- 31660,2063- 24398,19844-31412 5 65321381 G GT ERBB2IP splice_donor - 1999-24135 5 122139309 G GGA SNX2 splice_donor - 2063-24398 5 145847922 TAGTC T TCERG1 frameshift - 20574-32342 5 149212338 AC A PPARGC1B frameshift - 2076-24438 5 176887705 C A DBN1 splice_acceptor 26.3 19844-31412 5 179043961 T G HNRNPH1 splice_acceptor 24.7 2063-24398 6 43188203 TG T CUL9 frameshift - 21165-33301 6 114265575 C T HDAC2 splice_acceptor 26.6 2063-24398 6 149700653 A AGCTCTTTT TAB2 frameshift - 2063-24398 6 155154182 G T SCAF8 stop_gained 44 21970-37819

66

7 26236279 C A HNRNPA2B1 splice_acceptor 26.4 2063-24398,19844- 31412 7 91936801 CCACA C ANKIB1 frameshift - 16260-27149,1694- 22917 7 91936805 A ATGTG ANKIB1 frameshift - 16260-27149,1694- 22917 7 127569384 T C SND1 splice_donor 25.5 2063-24398 8 53045827 CGT C ST18 frameshift - 19275-30820 8 146015348 C A RPL8 splice_acceptor 26.4 20620-32415,20014- 31660,19844-31412 9 35616207 G A CD72 stop_gained 36 21705-34281 10 17271985 G A VIM splice_donor 24.2 19844-31412 10 33211664 C A ITGB1 splice_acceptor 27.2 20014-31660 10 70099313 G T HNRNPH3 splice_donor 26.8 2063-24398 10 104176196 AG A PSD frameshift - 1711-22976 11 2417201 G GAGGAC CD81 splice_donor - 20014-31660,20572- 32334 11 68574933 A T CPT1A splice_donor 23.7 1988-24094 11 71707292 C A RNF121 stop_gained 38 20014-31660 12 57037373 T A ATP5B splice_acceptor 26.4 20574-32342,20010- 31644,19844-31412 12 57927881 C T DCTN2 splice_acceptor 25.8 2063-24398 12 120783936 A C MSI1 splice_donor 24.5 2047-24322 13 66879143 G A PCDH9 stop_gained 46 20014-31660 13 100211790 T A TM9SF2 splice_donor 25.5 2063-24398 14 99959932 TGTAA T CCNK splice_donor - 1937-23920 14 102514366 G GA DYNC1H1 splice_donor - 2063-24398 17 17721231 C CTTT SREBF1 splice_acceptor - 2063-24398 17 40025379 C CTTCGAGG ACLY splice_acceptor - 20014-31660,2063- 24398 17 40042563 T A ACLY splice_acceptor 26.7 20014-31660 17 79801967 T TCGTCATCATCC P4HB frameshift - 20014-31660 18 43671818 C CAG ATP5A1 splice_acceptor - 20620-32415 18 51053130 G C DCC splice_donor 18.09 1883-23662 19 42373286 T C RPS19 splice_donor 24.6 2063-24398 19 42735020 C CT GSK3A splice_acceptor - 19844-31412 19 51190838 C T SHANK1 splice_donor 24.5 20984-32938 19 51214683 C CT SHANK1 frameshift 34 1986-24087,15164- 25527 19 56180159 G GCCAT U2AF2 splice_donor - 19844-31412 20 32880315 T C AHCY splice_acceptor 24.9 19844-31412 22 50719619 C CAG PLXNB2 splice_acceptor - 2063-24398 X 20211712 C CTCTTTGGATAAGCGTG RPS6KA3 splice_acceptor - 2063-24398 X 47070625 G A UBA1 splice_donor 25.7 20620-32415 X 47086869 G C CDK16 splice_donor 24.1 19844-31412 X 48460593 A AG WDR13 frameshift - 2063-24398,19844- 31412

67

X 51640984 T C MAGED1 splice_donor 23 20014-31660 X 54835810 G GC MAGED2 splice_donor - 2063-24398 X 152981145 C CGA BCAP31 splice_acceptor - 20620-32415

Table 5: C1 and C2 SNV gene sets used for GO analysis SIGLEC1 XABCA1 CCNK ERBB2IP KLHL26 PIP4K2B 4 WDR13 ABCB10 CCNL1 ERBB3 KMT2D PITPNM1 SKI WDR24 SLC12A ABCB8 CCR1 ERC1 KRT5 PITPNM2 4 WDR45 SLC12A ACACA CD19 ERI3 KRT86 PLA2G15 7 WDR47 SLC13A ACADL CD40 ESRRB LCT PLCD1 2 WDR48 LEPROT SLC18A ACAN CD5 EXOC7 L1 PLCE1 3 WDR81 ACLY CD70 EXOC8 LGR4 PLXNA1 SLC2A4 WNT4 SLC37A ACO1 CD72 FADS3 LHX3 PLXNA2 1 WTIP SLC45A ACOT4 CD81 FAIM3 LMNA PLXNB1 3 XKR4 CDC42B SLC46A ACSL6 PB FAM115A LPIN2 PLXNB2 2 XKR7 SLC4A1 ACTG1 CDH20 FAM134B LRP10 PODN 1 XKR8 FAM160A XPNPE ACTL10 CDK16 2 LRRC46 POLE SLIT2 P1 CDK5RA SMARC ACTN2 P1 FAM19A5 LRRC8E POLH A2 XPOT MAGED SMARC ACVR2B CELF2 FAM211A 1 POLR2A A5 XRCC5 ADAMTS MAGED 13 CHD8 FAM46A 2 POLR2B SMC1B XRCC6 ADAMTS MAPK8I ZBTB2 5 CHRNB4 FAM65B P3 POMC SMC3 0 ADCY6 CHST1 FAM86B1 MAST3 POTED SMTN ZBTB5 ZBTB7 ADD2 CHST12 FASN MBLAC2 POU2F1 SND1 C ADNP CIAO1 FBN1 MCM2 PPARA SNED1 ZC3H4 PPARGC ZFAND ADRBK2 CKM FBXL5 MDH1 1B SNX2 5 ZNF32 AGAP1 CLCN2 FBXO45 ME2 PPM1G SNX33 4 AGO1 CLCN3 FGD4 MED13L PPP1R13 SP7 ZNF32

68

B 6 ZNF62 AHCY CLCN7 FHL2 MED6 PPP2R5A SPACA4 4 METRN ZNF64 AKAP8L CLDN24 FIP1L1 L PPP2R5D SPEG 4 MICALL PRAMEF ZNF70 AKR1C2 CLPP FKBP2 1 13 SPIRE1 9 PRAMEF ZNF77 AMPD3 CLSTN3 FLNB MIOS 6 SPTBN1 2 ZNF80 AMT CMIP FN1 MITF PRDM16 SREBF1 5 ZNF82 ANK2 CNOT10 FOXJ2 MLST8 PREX1 SRGAP3 3 ZNF82 ANKIB1 CNTN2 FSIP2 MMP9 PREX2 SRL 7 ANKRD1 ZNF83 7 CNTROB FUT8 MOB2 PRKACG ST18 5 ANKRD4 COL12A ST6GAL 4 1 GABRA1 MOV10 PRODH 2 ANKRD5 COLGAL 0 T1 GAK MRI1 PRPF40A ST8SIA2 AP2M1 COPB2 GAL3ST2 MRPL17 PRPH STAB1 MSANT APAF1 COPS7A GDAP1 D2 PRR15 STAU2 APBB1 COPS7B GFOD1 MSI1 PRSS33 STK16 APBB1IP COQ10B GGTLC2 MTDH PSD STYXL1 APOA1B SUV420 P COX6B2 GIGYF1 MTG2 PSMB3 H1 APP CPE GLI4 MYH10 PSMC4 SYBU AQP3 CPEB1 GLS2 MYH14 PSMD12 SYNJ2 ARHGEF 1 CPNE8 GNPTAB MYH9 PTDSS2 SYT3 ARHGEF 12 CPS1 GOLGA8J MYO9B PTK2B TAB2 ARHGEF 15 CPT1A GPCPD1 MYT1 PTPLB TCERG1 ARHGEF 16 CPXM1 GPD2 NAGLU PTPN14 TDRD1 ARHGEF 17 CRYAA GPR156 NAP1L3 PTPN4 TDRD5 ARID2 CSRNP3 GPR17 NBEA PTPRE TEAD2 ARID5B CSRP2 GPR52 NCAN PTPRN TEX2 CSRP2B ARNTL P GPRC5B NEK9 PUM1 TFAP2E ARPC3 CTBP2 GRID1 NEMF PXDN TGFBRA

69

P1 ASAP1 CTCFL GRIK1 NEURL2 PXK TGM1 NEURO ASCC3 CTNNA2 GRIK3 D1 QRFPR TJP2 CTNNBI NGFRAP RAB11FI ASXL3 P1 GRIK4 1 P4 TLL1 TM4SF1 ATAD5 CTSZ GRIK5 NIPAL3 RAB1B 9 ATG4B CUL3 GRIN2C NIPBL RABL2A TM7SF2 RALGAP ATN1 CUL9 GSK3A NKX2-3 A1 TM9SF2 RALGAP ATOH1 CYP11B2 GSN NME6 B TMED4 ATP1A1 CYP4F22 GTF2F2 NOL6 RASIP1 TMEM88 ATP4A DAG1 HDAC2 NOP58 RBM26 TMPPE ATP5A1 DAGLA HDAC5 NOP9 REPIN1 TNFAIP8 RHOBTB ATP5B DBN1 HERC1 NPAS3 2 TNFSF11 ATP6V1 DCAF12 B2 L2 HERC2 NPEPL1 RICTOR TNIP2 ATP8B2 DCAF6 HFE NRL RIT2 TNKS HIST1H2 ATRNL1 DCC AG NRP1 RNF121 TNRC6B AVPR2 DCTN2 HIVEP3 NTF4 RNF213 TNRC6C DCUN1D BBS2 1 HK1 NTSR1 RNF214 TNS1 BCAP31 DDI1 HMGCS1 NXPH1 ROR1 TNS3 HNRNPA2 BCKDHA DDX28 B1 OCLN RPL8 TOP3B DENND4 BCL2L11 B HNRNPH1 ODF3L1 RPS19 TOPORS RPS6KA BCL7A DGAT1 HNRNPH3 OLFM2 3 TOX3 BDKRB2 DGCR14 HNRNPR OPA1 RRP1B TOX4 TRAM1L BEND5 DGKD HNRNPU OR10G9 RTF1 1 HNRNPU BEST4 DHX32 L2 OR10H1 RUFY3 TRIM67 BIRC6 DHX33 HOMER3 OR11H1 RUNX2 TRIM72 BMP2 DIMT1 HOXB13 OR2T33 RUSC1 TRIP12 BPTF DIP2C HSP90B1 OR2T8 RUVBL2 TRPC4 BRPF1 DLK2 HSPA14 OR4F5 RYR1 TSPYL2 BRWD1 DLX3 HSPBP1 OR9A2 RYR2 TTLL6 BUB1B DMAP1 ICT1 P4HA2 SALL1 TUBA3E

70

BZRAP1 DMBX1 IDE P4HB SARDH TULP2 C10orf71 DMXL2 IGF2BP2 PACS2 SBF1 TYW1 DNAJC1 C11orf71 3 IL13RA1 PAPD7 SBSPON U2AF2 DNMT3 C12orf68 A IL20RA PCDH9 SCAF8 UBA1 C1S DOCK2 INPP5B PCDHA1 SCAMP5 UBALD2 CA11 DOCK3 INSR PCDHA8 SCARB2 UBQLN4 PCDHAC CA2 DOCK9 INTS9 1 SCD5 UBR3 PCDHB1 CABP7 DOPEY1 IPO7 5 SCN2A UBXN4 CACNA1 PCDHG B DPP3 IRGQ A10 SCN5A UNC119 CACNA1 PCDHG D DPYSL3 IRX4 A9 SCN8A UNC13B CACNA2 DSCAM PCDHGB D3 L1 ITGB1 6 SCYL1 UNC5C CAD DSP KAT6A PCSK1 SDK2 UNC79 CALCOC O2 DTX4 KDM4E PDE4A SDPR UNC80 CAMSAP 1 DVL2 KHNYN PDK2 SEMA4G UPF1 CAMSAP 2 DVL3 KIAA1407 PDX1 SENP2 USP46 CAMSAP DYNC1H 3 1 KIAA1644 PDZD7 SERINC5 USP6 CAMTA1 EEF2 KIF11 PDZK1 SESN2 USP9X CAPN15 EIF2B4 KIF1A PELI3 SETDB2 VANGL2 CARS EIF2B5 KIF21B PFKP SEZ6L VHL CASZ1 EIF5B KIF3B PGAM1 SHANK1 VIM PGLYRP CCDC142 ELP3 KIRREL3 2 SHF VWA1 CCDC58 EP300 KLC4 PHF20 SHISA2 WARS CCHCR1 EPHA6 KLHL14 PICK1 SHKBP1 WARS2

71

Table 6: Key variant calling metrics for C1, C2, and controls collected with PICARD SAMPLE TOTAL_SNPS NOVEL_SNPS TITV HET_HOM_RATIO INDELS INS_DEL 01C06309 3640807 96331 2.09 1.68 371120 0.87 01C06310 3509541 87086 2.10 1.69 351623 0.88 01C08034 3526745 105573 2.09 1.68 359604 0.87 01C08035 4147601 129251 2.09 2.19 420365 0.87 02C10017 3519896 89851 2.09 1.74 363553 0.87 02C10230 3677337 90862 2.09 1.62 374449 0.87 03C15857 3730130 102735 2.09 1.73 383058 0.87 03C15858 3676687 91483 2.09 1.63 371622 0.88 03C16009 3764244 104350 2.09 1.77 385450 0.87 03C16018 3699585 99994 2.09 1.66 378541 0.87 03C16241 3779456 124760 2.09 1.77 386555 0.87 03C16242 3746449 117477 2.09 1.69 386652 0.87 03C16383 3697906 106770 2.09 1.66 382312 0.87 03C16384 3699411 97918 2.09 1.72 378920 0.87 03C16687 3696691 98615 2.09 1.67 377988 0.87 03C16689 3725670 101746 2.09 1.73 379588 0.87 03C16811 3706130 93785 2.09 1.73 377778 0.87 03C16812 3690549 87965 2.09 1.64 378188 0.87 03C16899 3716328 88889 2.09 1.70 376820 0.87 03C16900 3708938 99896 2.09 1.65 382469 0.87 1034-19683 3699574 84770 2.09 1.61 373431 0.88 1256-20693 3712438 93799 2.09 1.39 375173 0.88 1291-21555 3680640 90158 2.09 1.64 367860 0.88 1399-21439 3466817 78775 2.09 1.60 347564 0.88 1442-22354 3760937 108162 2.09 1.72 379287 0.88 1444-21622 3539661 131371 2.09 1.66 350951 0.88 1454-21674 3610492 81812 2.09 1.63 360360 0.88 1457-21708 3536357 105046 2.09 1.66 353527 0.88 1510-21940 3605587 83479 2.09 1.59 358323 0.88 15164-25527 3658589 94832 2.09 1.73 351957 0.88 15210-25677 3745072 89968 2.09 1.81 378216 0.88 1543-22102 3518313 87372 2.08 1.47 355135 0.88 1548-22146 3707095 87738 2.09 1.61 371564 0.88 15492-26109 3669888 83213 2.09 1.60 367774 0.88 15497-26119 3735074 97112 2.09 1.41 378241 0.88 1550-22157 3706145 89422 2.09 1.66 371946 0.88 15542-26229 3767784 92971 2.09 1.72 377937 0.88 15554-26282 3725878 84211 2.09 1.70 374771 0.88 1565-22223 3709531 88872 2.09 1.61 370364 0.88 1569-22238 3680992 83494 2.10 1.62 367563 0.88

72

1588-22326 3717867 85972 2.09 1.69 374741 0.88 16260-27149 3700309 94310 2.09 1.64 373762 0.88 16262-27157 4264005 124371 2.09 2.30 429661 0.88 1694-22917 3831986 93816 2.09 1.82 385721 0.88 1710-22973 3883914 111778 2.09 1.85 388177 0.88 1711-22976 3695060 87606 2.09 1.64 374052 0.88 1769-23158 3708619 89068 2.09 1.62 377109 0.88 1792-23269 3665227 82794 2.10 1.70 366600 0.88 18436-29827 3638195 82103 2.09 1.59 368943 0.88 1855-23570 3699225 84443 2.09 1.63 370912 0.88 187-13717 3594621 83970 2.09 1.70 358957 0.88 1876-23644 4150783 137377 2.09 2.16 421319 0.88 1883-23662 3740220 88114 2.09 1.72 378421 0.88 1885-23669 3729331 84078 2.09 1.70 375092 0.88 1886-23674 3587981 81753 2.08 1.58 354571 0.88 1923-23847 3653919 85251 2.09 1.63 356187 0.88 1924-23853 3734800 82398 2.09 1.70 375107 0.88 1925-23858 3663466 82255 2.10 1.59 366638 0.88 19273-30812 3738553 91486 2.09 1.62 374094 0.88 19275-30820 3688396 80126 2.09 1.59 363550 0.88 19283-30834 3618428 85742 2.08 1.63 363220 0.88 19303-30889 3710827 85181 2.09 1.63 370410 0.88 1937-23920 3697275 84990 2.10 1.64 369143 0.88 1946-23968 3701628 86555 2.09 1.63 371311 0.88 1972-24038 3709069 85306 2.09 1.62 370514 0.88 1974-24044 3491437 81296 2.09 1.65 351524 0.88 19844-31412 3477523 91516 2.09 1.41 350258 0.88 19861-31432 3485996 87661 2.09 1.37 348058 0.89 1986-24087 4185435 114086 2.10 2.22 418979 0.88 1987-24091 3719049 87985 2.09 1.68 372010 0.88 1988-24094 3722415 90850 2.09 1.67 372321 0.88 19890-31485 3708880 91104 2.09 1.38 371691 0.88 1989-24099 3799646 93701 2.09 1.72 380263 0.88 19991-31601 3719148 84010 2.09 1.64 372325 0.88 1999-24135 3519350 88124 2.09 1.47 344252 0.88 19993-31610 3744445 86818 2.09 1.70 377032 0.88 20010-31644 3689478 86263 2.09 1.36 365690 0.88 20014-31660 3680238 84720 2.09 1.59 365223 0.88 2007-24164 3705142 92256 2.09 1.62 370746 0.88 2016-24201 3661710 82157 2.09 1.61 363653 0.88 2031-24261 3764341 87215 2.09 1.72 378539 0.88

73

2047-24322 3734456 100599 2.10 1.64 372414 0.88 20569-32326 3764600 130469 2.09 1.66 374133 0.88 20572-32334 3695083 85487 2.09 1.71 366771 0.88 20574-32342 3720045 88098 2.09 1.67 371414 0.88 20618-32410 3716129 85931 2.09 1.70 371429 0.88 20620-32415 3803126 135012 2.09 1.64 378534 0.88 2063-24398 3722232 90058 2.09 1.68 363147 0.88 20705-32533 3717794 84762 2.09 1.68 371903 0.88 2076-24438 3698620 85921 2.09 1.60 365105 0.88 20984-32938 3581367 88620 2.08 1.59 357022 0.88 21165-33301 3573757 82214 2.08 1.60 356331 0.88 21183-33343 3666537 91802 2.09 1.67 366333 0.88 21189-33353 3699338 88019 2.09 1.66 372704 0.88 21236-33440 3634792 87480 2.09 1.70 362628 0.88 21267-33506 3691965 86129 2.10 1.60 369418 0.88 21304-33563 3600650 83143 2.09 1.62 358885 0.88 21405-33683 3755820 109931 2.09 1.68 374606 0.88 21505-33825 3685450 87958 2.09 1.63 369377 0.88 21507-33839 3701823 88829 2.09 1.70 364379 0.88 21597-33936 3684146 87542 2.09 1.64 367603 0.88 21684-34154 3609825 88886 2.08 1.64 360325 0.88 21686-34166 3781557 119456 2.09 1.67 376613 0.88 21699-34238 3720970 86703 2.09 1.67 369404 0.88 21705-34281 3675616 77510 2.09 1.58 358596 0.88 21723-34401 3687331 86703 2.09 1.60 369152 0.88 21730-34469 3697385 87841 2.09 1.62 371325 0.88 21731-34484 3788854 112270 2.09 1.67 375194 0.88 21733-34496 3695803 91677 2.09 1.37 369913 0.88 21739-34571 3719965 85384 2.09 1.67 373958 0.88 21740-34579 3688377 86499 2.09 1.62 371903 0.88 21742-34594 3709623 85094 2.09 1.62 369367 0.88 21758-34760 3715652 84933 2.09 1.67 370439 0.88 21762-34808 3694735 86495 2.09 1.63 369626 0.88 21773-34925 3692276 84819 2.09 1.61 368128 0.88 21799-35302 3683997 116485 2.08 1.67 371274 0.88 21801-35317 3688483 82647 2.09 1.62 369191 0.88 21852-36011 3679898 85918 2.09 1.63 370591 0.88 21854-36022 3712571 86021 2.09 1.69 375126 0.88 21856-36031 3731789 87752 2.09 1.68 374861 0.88 21870-36159 3694497 107639 2.09 1.55 371282 0.88 21879-36229 3816536 108497 2.09 1.73 379012 0.88

74

21928-37301 3621282 83052 2.09 1.66 360627 0.88 21929-37306 3697356 87642 2.09 1.63 367788 0.88 21969-37814 3698801 86329 2.09 1.60 367864 0.88 21970-37819 3725535 86297 2.09 1.72 373884 0.88 21974-35254 3733288 94202 2.09 1.58 373595 0.88 22124-39156 3687256 85302 2.09 1.61 369455 0.88 466-14420 3645712 84469 2.09 1.63 364906 0.88 48-14470 3697341 86052 2.09 1.62 372601 0.87 532-14596 3685998 85223 2.09 1.59 371444 0.88 533-15609 3719610 87070 2.09 1.66 375241 0.88 613-14848 3704289 83866 2.09 1.62 373809 0.88 635-18522 3730175 86360 2.09 1.71 374818 0.88 768-18527 3682290 87137 2.09 1.59 370658 0.88 779-18300 3690991 82166 2.09 1.61 371774 0.88 871-19574 3668753 85575 2.09 1.59 371685 0.88 902-19869 3815225 89993 2.09 1.79 383988 0.88 928-19074 3810510 89391 2.09 1.81 385098 0.88 993-20365 4185988 173979 2.10 1.98 424015 0.88

75