Supplementary Materials for

"Whole genome sequencing in multiplex families reveals novel inherited and de novo genetic risk in autism"

Elizabeth K. Ruzzo1,†, Laura Pérez-Cano1,†, Jae-Yoon Jung4,5, Lee-kai Wang1, Dorna Kashef- Haghighi4,5,6, Chris Hartl3, Jackson Hoekstra1, Olivia Leventhal1, Michael J. Gandal1, Kelley Paskov4,5, Nate Stockham4,5, Damon Polioudakis1, Jennifer K. Lowe1, Daniel H. Geschwind1,2,* & Dennis P. Wall4,5,*

* Correspondence to: [email protected] and [email protected]

This PDF file includes:

Materials and Methods Supplementary Text Figs. S1 to S13 Tables S2, S6, S8, S10, S11, and S13-S16. Captions for Tables S1, S3, S4, S5, S7, S9, and S12

Other Supplementary Materials for this manuscript include the following:

Tables S1, S3, S4, S5, S7, S9, and S12 [excel files] Table S1: Cohort information. Table S3: Phenotypic details for NR3C2 families. Table S4: DLG2 structural variant details and haplotype prediction for the carrier families. Table S5: ARC feature descriptions Table S7: TADA qualifying variants identified in iHART/AGRE samples. Table S9: iHART TADA-mega analysis results for all 18,472 . Table S12: NetSig significant genes.

1 Materials and Methods Sample selection and pre-sequencing QC Families including two or more individuals with ASD (those with a "derived affected status" of "autism", "broad-spectrum","nqa","asd", or "spectrum") were drawn from the Autism Genetic Resource Exchange (AGRE)(1). Patients with known genetic causes of ASD or syndromes with overlapping features (Table S2) were excluded from sequencing. In our first round of sequencing, we prioritized ASD-families harboring affected female subjects. We also prioritized monozygotic twin containing families, in part to facilitate the development of our machine learning model (Artifact Removal by Classifier (ARC): a machine learning classifier for distinguishing true de novo variants from LCL and sequencing artifacts). A complete list of sequenced samples can be found in Table S1.

Purified DNA was obtained from the Rutgers University Cell and DNA Repository (RUCDR; Piscataway, NJ) and submitted to the New York Genome Center (NYGC) for whole-genome sequencing. Where available, DNA from whole blood was used; however, only lymphoblastoid cell line (LCL) DNA was available for many samples, as DNA was not extracted from whole blood at the time of recruitment. The use of LCL derived DNA presents an additional challenge for the detection of de novo variants. Given that cells in culture accumulate mutations, these mutations appear as candidate de novo variants because they are not present in the parental DNA.

Upon receipt at NYGC, DNA samples were examined for quality/quantity, and subsequently genotyped using Illumina Infinium Human Exome-12 v1.2 or Infinium Human Core Exome microarrays (San Diego, CA) according to standard manufacturer protocols. Identity-by- descent estimation and sex checks in PLINK v1.07(2) were used to validate expected vs. observed family relationships and confirm sample identity based on these genome-wide genotyping data. Contamination was assessed using verifyIDintensity (VII)(3); samples exceeding 3% contamination in two or more modes were excluded from sequencing.

Sequencing and NGS-processing pipeline Samples passing array-based identity and quality checks (pre-sequencing QC) were sequenced at NYGC using the Illumina TruSeq Nano library kits and Illumina's HiSeq X (San Diego, CA) according to standard manufacturer protocols.

All iHART WGS data were processed through the same bioinformatics pipeline; this pipeline was designed based on GATK’s best practices(4, 5). The metadata for each sample are stored in a custom MySQL database where each sample was tracked as it progressed through the sequencing pipeline at the New York Genome Center (NYGC) and bioinformatic processing, and finally the quality assurance metrics were populated based on the resulting processed data. The first step in the pipeline was to align the raw short sequence reads to the human reference genome (human_g1k_v37.fasta). This was accomplished by processing the fastq files with the Burrows-Wheeler Aligner (bwa-mem, version 0.7.8)(6) to generate BAM files. BAM files were generated in a read-group-aware fashion (properly annotating sequence reads derived from the same flow cell and lane) and thus multiple BAM files were subsequently merged using BamTools (version 2.3.0)(7) to generate a single BAM file per sample. The second step in the pipeline was to mark duplicate reads in the BAM file using the Picard MarkDuplicates tool (version 1.119; http://broadinstitute.github.io/picard/). The third step in the pipeline was to perform local realignment of reads around indels using GATK's IndelRealigner (version 3.2- 2). The fourth step in the pipeline was to genotype each sample, generating a gVCF file. To achieve accuracy at this stage, base quality score recalibration was run using GATK (version 3.2-2)(8). Subsequently, GATK’s HaplotypeCaller (version 3.2-2) was run on each base- recalibrated BAM to identify the variant and non-variant bases in the genome. All four of these steps were performed at the NYGC, resulting in a BAM and a gVCF file for each sample.

The fifth step in the pipeline was to jointly call variants across all AGRE/iHART samples, to generate a VCF file. This was accomplished by combining gVCF files, 200 samples at a time

2 using GATK’s combineGVCFs (version 3.2-2), and then running GATK’s GenotypeGVCFs (version 3.2-2). Step 5 was accomplished by splitting data by (which increases parallelization) and resulted in one cohort-wide VCF per chromosome. Finally, to help filter out low quality variants within the call set, GATK’s Variant Quality Score Recalibration (VQSR, version 3.2-2) was run to generate well calibrated quality scores. The final step in the pipeline was to annotate the resulting variant calls, to generate an annotated VCF file. This was accomplished by annotating with ANNOVAR(9) (version 20160201) and then with Variant Effect Predictor(10) (version VEPv83). The resulting VCF contains -based, region-based, and filter-based annotations for each identified variant. For all the analyses described in this manuscript, we excluded VQSR failed variants and multi-allelic variants.

Whole-genome sequence coverage We used SAMtools v1.2(11) depth utility to calculate genome-wide (excluding gap regions in the human reference genome, downloaded from the UCSC table browser) per-base sequencing coverage for each sample. In order to reduce memory requirements, the reported depth was truncated at a maximum of 500 reads. Subsequently, we calculated two main summary statistics for each sample using custom scripts: (i) average coverage and (ii) percent of the genome (excluding gap regions) covered at 1X, 10X, 20X, 30X and 40X. On average, 98.97 ± 0.37 % of bases were covered at a depth of !10X (Fig. S1).

Inheritance classifications For each member of the iHART cohort with at least one parent sequenced (affected or unaffected children), all identified variants were classified based on their observed inheritance. Every variant was categorized into one of eight inheritance types: (i) de novo, (ii) maternally inherited, (iii) paternally inherited, (iv) newly homozygous, (v) newly hemizygous, (vi) missing, (vii) unknown phase, or (viii) uncertain. To perform this classification, we developed a custom script to simultaneously evaluate variant quality and inheritance within each family. Prior to this classification step, all VQSR failed variants and multi-allelic variants were excluded (see "NGS-processing pipeline" section). Additionally, we set permissive quality control thresholds in order to retain sensitivity while removing variants with a high probability of being false positives. Variants were required to have a depth of 10x, a genotype quality of ≥25, and a ratio of alternative allele reads/total reads ≥0.2. We assumed that if a variant met these quality thresholds, then the assigned genotype was correct.

While maternally inherited, paternally inherited, and de novo categories are self-explanatory, definitions for the remaining inheritance classifications are more complex. A homozygous variant observed in a child is called a newly homozygous variant if it is heterozygous in both parents. Similarly, a newly hemizygous variant on the X chromosome is defined as a hemizygous variant observed in a male child which is not identified as hemizygous in the corresponding father. A variant was classified as missing (./.) if the variant was called in at least one child in the iHART cohort but did not have sufficient coverage for GATK's haplotype caller to define a genotype. A variant was classified as unknown phase if a child had an inherited variant and only one biological parent sequenced (unless on a sex chromosome where inheritance can be inferred) or if both parents carry the variant and thus the phase cannot be determined from this site alone. Finally, a variant was classified as uncertain if it could not be classified into another inheritance type; this includes: Mendelian error variants (e.g., heterozygous variants on male sex ), variants failing the quality control thresholds above (in a child or a parent), or a variant that couldn't be classified with confidence (e.g., a variant identified in a child but absent in its only sequenced parent could be de novo or inherited).

Detection of Large Structural Variants We developed a custom pipeline for high-resolution detection of large structural variants (SVs) from whole-genome sequence data (Fig. S2). This pipeline combines four different detection algorithms, including: BreakDancer(12), LUMPY(13), GenomeSTRiP(14, 15) and SMuFin(16).

3

BreakDancer We first used the bam2cfg.pl script (part of the BreakDancer v1.1.2 package(12)) to generate a tab-delimited configuration file per family required to run BreakDancerMax. This configuration file specifies the locations of the BAM files, the desired detection parameters (the upper and lower insert size thresholds to detect SVs) and sample metadata (e.g., read group and sequencing platform); we used default detection parameters. We then ran BreakDancerMax to call SVs per chromosome within families. The resulting output files were combined for all chromosomes and samples and converted into a single VCF file using a custom script (see next section for details about genotyping). We filtered to exclude variants if the identified variant (i) was in a sequence contig, (ii) had a quality score <80, (iii) had <4 supporting reads, or (iv) had a length of <71 base pairs (small indel).

LUMPY We used SAMtools v1.1(11) to extract both the discordant paired-end reads and the split-read alignments per sample, generating two different sorted BAM files required to run LUMPY v0.2.11(13). We then ran lumpyexpress to call SVs within families. We merged the resulting VCF files per family (containing raw calls), into a single genotyped VCF file for all the samples in the cohort, using a custom script (see " SV Post-processing" for more genotyping details). We excluded variants identified in sequence contigs. Small insertions and inversions were excluded by restricting to variants with a length of >71 base pairs; no filter was applied for small duplications because the min length identified was 74 base pairs.

GenomeSTRiP We obtained genotyped SV calls generated by the NYGC's in-house GenomeSTRiP v1.04 standard pipeline(14, 15). This pipeline consists of three main modules: i) SVPreprocess: a pre- processing module that was run per sample to generate genome-wide metadata required for next processes; ii) SVDiscovery: a discovery module, that was run in three large batches to call deletions, producing a VCF file with raw calls detected per batch; and iii) SVGenotyper: a module run to produce genotyped VCF files per sequencing batch. In total, we received three genotyped VCF files, for sequencing batch one (N=956 samples), two (N=538 samples), and three (N=858 samples). We filtered out variants flagged as “LowQual” and merged the final set of SV calls for downstream analyses.

SMuFin We used SMuFin(16), a reference-free approach, to identify SVs within families. Families were processed as independent trios and SMuFin was used to directly contrast sequencing reads between the parents and the offspring (Fig. S2c). During the detection process, one parental genome is used as the reference genome to identify genetic variants in the children that were absent in that parent and then this process is repeated using the other parental genome as the reference genome. This produced one output file for each parent-offspring comparison run, containing the SVs detected per comparison. We then merged all the SV calls identified in phase-able individuals (i.e., individuals for which at least one biological parent was also sequenced) and classified them according to their inheritance patterns (see "SV post- processing" for more details).

SV post-processing We assumed heterozygosity for all SV calls, with two exceptions: i) SVs identified in sex chromosomes from males, which were annotated as homozygous; and ii) SVs identified by GenomeSTRiP, whose genotypes were defined by its SVGenotyper module. The inheritance type for all SVs identified in phase-able individuals was classified as: de novo, maternal, paternal, newly homozygous, newly hemizygous, unknown phase, missing, or uncertain (see "Inheritance classifications" for more information). For SVs, the missing classification was only applied to BreakDancer calls with a quality score of <80 and/or <4 reads supporting the variant call.

4 We focused on the analysis of high-confidence SVs, specifically deletions (DELs), duplications (DUPs), and inversions (INVs), by restricting to events identified by at least two detection algorithms and removing SVs that overlapped genomic regions of low complexity (i.e., centromeres, segmental duplications, regions of low mappability, and regions subject to somatic recombination in antibodies and T-cell receptor genes)(17) by more than 50%. We made two exceptions to the rule that at least two detection algorithms must detect an SV. The first exception was to exclude SVs detected by only LUMPY and BreakDancer because this subset of SVs had very low concordance with genotype array data (Table S16). The second exception comes from the fact that we are calling SVs in families; if an SV event was called by at least two detection algorithms in one or more family member but called by only one algorithm in another family member, this SV was included in the analysis despite its identification by only one algorithm.

Even though WGS theoretically enables high-resolution prediction of breakpoints, the breakpoints called by the detection algorithms can vary due to technical differences between these methods and also between samples (e.g., coverage) despite the fact that they are detecting the same underlying SV event. To adjust for this, SV calls made by different detection algorithms were considered to be the same SV event if they were: (i) called in the same individual, (ii) had a reciprocal overlap of at least 50%, and (iii) shared the same SV type (e.g., DEL) and inheritance pattern. A similar approach was subsequently applied to SVs within a family, where SV events are likely inherited and thus identical; the breakpoints of overlapping SVs (!50% reciprocal overlap of the same SV type) identified in individuals within the same family, were adjusted to the minimum start and maximum end coordinates predicted (maximum predicted size based on breakpoints) in family members with the SV call.

SVs were defined as rare if they had no more than 50% overlap in (a) regions commonly disrupted by SVs in our Curated Database of Genomic Variants (cDGV; allele frequency !0.001) and (b) regions commonly disrupted by the same SV type (allele frequency !0.01) in the HNP samples (see "Control cohorts"). We also classified SVs as rare if (c) they had a region of !500Kb that did not overlap with common SVs in cDGV (allele frequency !0.001) or HNP samples (see “Control Cohorts” section below) (allele frequency !0.01).

Finally, in order to facilitate prioritization for likely pathogenic variants, gene-based and region-based annotations were added to the final set of high confidence SV calls by using custom scripts and the Bamotate annotation tool(18).

Multi-algorithm consensus SV calls The four algorithms chosen to call SVs use different detection strategies and are suitable for identifying different sizes and types of SVs with varying levels of sensitivity and specificity. Therefore, we ran a multi-algorithm comparison to identify high-quality SVs, identified by at least two methods (as described above). We used BEDTools(19) to intersect SV calls detected by the different algorithms, by performing an all-against-all comparison (Fig. S2 and Table S14). The start and end positions of overlapping SVs (!50% reciprocal overlap of the same SV type and inheritance pattern) identified in the same individual were reassigned based on the coordinates from the detection algorithm predicted to be more precise in calling breakpoints.

Considering the strategy implemented to identify SVs (e.g., split-read methods can detect SVs at single base-pair resolution) for each detection algorithm, we defined the following rank for breakpoint precision accuracy: SMuFin (split-read and de novo assembly method) > LUMPY (split-read and read-pair method, with coordinates assigned within families) > GenomeSTRiP (split-read, read-pair and read-count method, with coordinates assigned within sequencing batch) > BreakDancer (read-pair detection method). We obtained a high confidence set of heterozygous deletions by restricting to the subset of AGRE samples for which heterozygous deletions were also detected (>=50% reciprocal overlap) from Illumina genotyping array data(18). We then used GATK's VariantEval tool to generate het:hom metrics for SNVs

5 identified within 224 heterozygous deletions identified by any one detection algorithm (and also identified in genotyping array data) to assess the accuracy of this breakpoint precision ranking. A heterozygous deletion with accurate break points would include only homozygous SNVs (het:hom ratio of zero). This analysis showed no significant differences between these methods (with all of them showing a median het:hom ratio of 0.01), but ranking of mean het:hom ratios was generally consistent with our SV breakpoint precision detection algorithm ranking: SMuFin (0.028)< LUMPY (0.043) < BreakDancer (0.059) < GenomeSTRiP (0.067).

Visual inspection of LUMPY-BreakDancer joint SV calls by genotyping array Array-based SV detection is a well-established method with high accuracy for certain SV classes, in particular large deletions(20). Copy Number Variants (CNVs) can be visualized from array data by plotting the B Allele Frequency and Log R Ratio values for all array genotyped SNPs within the estimated CNV region and its flanking regions (25% of the length CNV each side); we will refer to this as an "array visualization plot". Given the low concordance rate between LUMPY and BreakDancer SV calls with other methods (Table S16) we decided to manually inspect array visualization plots, generated by using available Illumina genotyping array data(18), for regions with LUMPY-BreakDancer joint SV calls from the iHART WGS data.

We randomly selected 218 LUMPY-BreakDancer detected SV events from different length bins and used a custom script to generate an array visualization plot for each detected SV region. For each of the 218 SVs, an array visualization plot was generated for the carrier and all corresponding family members. Manual inspection of the array visualization plots was conducted (blinded with respect to the predicted carrier(s) of the LUMPY-BreakDancer SV call), and each SV was categorized as true or false. By treating the array-based true calls as the gold standard, we were able to estimate the validation rate for LUMPY-BreakDancer joint SV calls (Table S16).

Sensitivity to detect rare SVs identified in genotyping array data A clean set of rare SVs detected from Illumina genotyping array data (array-SVs) were available for 553 iHART fully phase-able samples(18). We used BEDTools(19) to intersect our set of SV calls (WGS SV calls, DELs and DUPs) with rare SVs identified in genotyping array data(18) in these 553 overlapping samples. We evaluated our sensitivity to detect array-SVs by considering events detected with >=50% reciprocal overlap by both array and NGS in the same sample (Table S15).

Control cohorts Throughout this manuscript, we reference several control cohorts used for assessing variant frequencies in samples not ascertained for ASD. These cohorts are described below.

Publicly available databases. Unless otherwise specified, the publicly available databases (all annotations provided by ANNOVAR) referenced include: the NHLBI Exome Sequencing Project (ESP, esp6500siv2_all)( http://evs.gs.washington.edu/EVS/), the Exome Aggregation Consortium (ExAC_ALL annotation from version exac03nonpsych)(21), 46 unrelated whole-genome sequenced (high coverage on the Complete Genomics platform) non-disease samples (http://www.completegenomics.com/public-data/69-genomes/, cg46) and the 1000 genomes project (1000g2015aug_all)(22).

UCLA internal controls. Throughout this manuscript, the use of "UCLA internal controls" refers to a set of 379 unrelated whole-genome sequenced (30x coverage on Illumina platform, processed by the same bioinformatics pipeline as was used for iHART) samples with a neurodegenerative disorder known as Progressive Supranuclear Palsy (PSP). There is no known etiological overlap or comorbidity between PSP and ASD.

6 Healthy Non-Phaseable (HNP) samples. Throughout this manuscript, the use of "HNPs" refers to the 922 healthy non-phaseable (no biological parents sequenced) iHART samples. The majority of these samples are parents of affected or unaffected children. Due to the fact that these samples likely harbor genetic ASD-risk variants, these HNPs provide a helpful estimate of allele frequencies but we generally apply more permissive allele frequency filtering to retain inherited risk variants.

Curated Database of Genomic Variants (cDGV). To assess the population frequency of structural variants in a more precise manner, we manually curated the Database of Genomic Variants (DGV, release date 2015-07-23)(23). This curation involved removing studies that did not include sample identifications and/or only analyzed targeted genomic regions, as well as SVs detected in non-human samples or individuals diagnosed with intellectual disability (ID) or developmental delay (DD) (ID and DD samples in two studies, Coe et al. 2014(24) and Cooper et al. 2011(25), were flagged for exclusion by Evan Eichler's laboratory and their accession numbers were shared with us via personal communication). This resulted in a total of 26,353 unique remaining samples. We then collapsed all SV types in the remaining samples into five different categories: deletions, duplications, insertions, inversions and unknown SV types, as shown in Table S13. We finally re-calculated the frequency of the different SV categories by continuous genomic intervals expanding the entire , avoiding double-counting SVs (of same type) identified in the same sample and same exact region by different studies.

Defining rare inherited variants and private variants Rare inherited variants (SNVs and indels) were defined as those with an allele frequency (AF) less than or equal to 0.1% in public databases (1000g, ESP6500, ExACv3.0, cg46), internal controls, and iHART HNP samples and were restricted to those not missing in more than 25% of controls and that were not flagged as low-confidence by the GIAB consortium.

We define private variants as variants that are observed in one and only one iHART/AGRE family (AF~0.05%) and are not missing in more than 25% of iHART HNPs. Additionally, private variants are never observed in any control cohorts (AF=0) and are not missing in more than 25% of the PSP control samples. Only variants that were not flagged as low-confidence by the GIAB consortium were considered. We only report analyses for iHART private variants in the 1,177 children with both biological parents sequenced (fully phase-able). For non-coding private inherited variants, variants present in gnomAD (version 2.0.2) were also removed.

High-risk inherited variants To characterize potential high-risk inherited rare variants, rare damaging variants transmitted to all affected individuals in a multiplex family, but not transmitted to unaffected individuals, were identified. In order to identify these high-risk inherited variants, we performed an analysis looking for rare PTVs (AF"0.001 in public databases and internal controls) or SVs (AF"0.001 in our curated Database of Genomic Variants (cDGV) and an AF"0.01 healthy non-phaseable samples (HNPs)) disrupting an exon or promoter (2Kb upstream of the TSS) transmitted to all affected and no unaffected members of a family when the variant disrupted a gene(s) with a high probability of being loss-of-function (LoF) intolerant (pLI!0.9, n=3,483 genes)(21). The families selected for the PTV analysis were restricted to a subset of 346 families with !2 genetically distinct (i.e., not a family with just a pair of affected MZ twins) fully phase-able affected children with a variable number of unaffected children. Given the small number of qualifying SVs in these ASD families, all families (n=493) were considered.

To determine if these high-risk inherited variants formed a –protein interaction (PPI) network, we used the Disease Association Protein-Protein Link Evaluator (DAPPLE)(26) and performed 1,000 permutations (within-degree node-label permutation).

Non-coding analysis

7 We defined non-coding SNVs and indels as those annotated by VEP to be consequences that do not occur within a coding transcript. This includes 17 of the 35 VEP consequences: "mature_miRNA_variant", "5_prime_UTR_variant", "3_prime_UTR_variant", "non_coding_transcript_exon_variant", "intron_variant", "non_coding_transcript_variant", "upstream_gene_variant", "downstream_gene_variant", "TFBS_ablation", "TFBS_amplification", "TF_binding_site_variant", "regulatory_region_ablation", "regulatory_region_amplification", "feature_elongation", "regulatory_region_variant", "feature_truncation", or "intergenic_variant". If multiple consequence annotations were present for a single variant, only the first most damaging consequence was considered to stringently filter for non-coding variants. Only variants that were not flagged as low-confidence by the GIAB consortium were considered. To increase our accuracy in assessing the allele frequency of these non-coding variants, we also annotated these variants with gnomAD (version 2.0.2) allele frequencies identified from whole-genome sequencing of over 15K samples not enriched for ASD phenotypes. We defined promoters as 2Kb upstream and 1Kb downstream from the transcription start site (TSS) by referencing the longest transcript for each gene (ties in transcript length were resolved by selecting the lower Ensembl Transcript ID). The ASD-risk genes used for this analysis are the 69 genes with an FDR<0.1 in the iHART TADA-mega analysis. iHART non-coding private variants were identified in the 1,177 children with both biological parents sequenced (fully phase-able) (Naff=960, Nunaff=217). iHART non-coding RDNVs were considered after running ARC to identify high confidence variants and were restricted to those identified in the 716 non-ARC outlier samples (Naff=575, Nunaff=141). To increase our power for non-coding variants, we obtained data from 519 whole-genome sequenced SSC quads (mother, father, affected child, unaffected child). These data were also generated and processed to a per sample gVCF (GATK version 3.2-2) by NYGC. We then performed joint genotyping, annotation, and quality control using the same pipeline applied to the iHART genomes. After quality control resolved 4 identity crises, we removed one likely contaminated sample and two samples with sex crises. This resulted in 516 quads and 3 trios from the SSC (Naff=517, Nunaff=518). We identified an average of 89 raw RDNVs per child in this cohort. After applying ARC to these data, we obtain an average of 61.83 RDNVs per SSC child which is very similar to the genome-wide expectation and matches the average observed for iHART RDNVs after applying ARC (60.3 RDNVs per child in LCL-derived samples and 59.4 RDNVs per child in WB-derived samples). Given that the SSC cohort is comprised entirely of WB-derived samples, we identified zero ARC outliers with >90% of their raw RDNVs removed by ARC. The resulting combined iHART + SSC whole-genome cohort includes 1,092 affected and 659 unaffected samples for RDNV analysis and 1,477 affected and 735 unaffected samples for the analysis of private inherited variants.

Enrichment for NetSig significant genes in high-risk inherited PPI networks In order to identify genes whose encoded directly interact with potential ASD-risk genes more than expected by chance, we ran NetSig(27). NetSig requires two input files: (1) a set of genes and their associated q-values (or p-values) and (2) a PPI network. We input the q- values obtained in the iHART TADA mega-analysis and known protein-protein interactions from InWeb v3(28) after converting to HGNC and restricting to genes included in the iHART TADA mega-analysis (12,015 genes); this resulted in a subset of 302,991 known PPIs.

DLG2 association testing and haplotype prediction The 2.5Kb deletion identified in the promoter region of DLG2 is significantly associated with ASD when considering the three independent ASD carriers in the iHART cohort (3 of 484 unrelated phase-able (at least one parent sequenced) affected children with SVs called) and the lack of any deletions intersecting this DLG2 promoter region deletion amongst 212 unaffected children in this cohort (iHART) and 26,353 cDGV controls (curated DGV) (two-sided Fisher's Exact Test, P=5.7 x 10-6, OR=Inf, 95% CI=22.7- Inf). When restricting to only WGS control samples (212 unaffected children in this cohort and 2,677 cDGV controls), this association remains significant (two-sided Fisher's Exact Test, P=0.003, OR=Inf, 95% CI=2.47- Inf).

8

Given that all carriers of the recurrent high-risk inherited SV deletion disrupting the promoter of DLG2 (chr11:85339733-85342186) were of Hispanic or Latino origin, we wanted to eliminate the possibility that this was a rare population-specific event. When restricting to only Hispanic or Latino (AMR) WGS control samples (98 unaffected children in this cohort and 351 cDGV controls), this association remains significant (two-sided Fisher's Exact Test, P=0.006, OR=Inf, 95% CI=1.92-Inf). To determine if this SV was always found on the same haplotype, we first extracted all high confidence SNVs (genotype quality of ≥30 and ≤30% of samples with missing genotypes) in the region surrounding the SV (1Kb upstream and downstream the start and end positions of the SV, respectively). We then ran fastPHASE(29) to estimate the haplotype in this region for all the 2,308 WGS samples included in this study, as well as the corresponding haplotype frequencies (using –F option). Of the 40 possible estimated haplotypes, a different haplotype was found in each of the three families carrying the SV, with haplotype frequencies of 0.469, 0.024 and 0.001 (Table S4). ! Gene set enrichment The purpose of gene set enrichment (GSE) analysis is to count the number of genes in common between two sets of genes and determine if there is greater overlap than expected by chance. We use a null model in which the probability that a gene is hit by mutation is proportional to the length of this gene, as previously described in Iossifov et al. Nature 2014(30). In this model, we collapse all recurrent hits resulting in a gene being classified as hit (1) or not hit (0) in the sample of interest (e.g., ASD affected children). We then compare T, the genes targeted by at least one mutation in the sample, to a predefined gene set (S), and obtain the intersection of T and S – the overlap (O). We estimate the probability p(S) that an exonic mutation (and hence the gene) is contained within S by taking the ratio of the sum of the coding lengths of the genes of S and the sum of the coding lengths of all genes.

Σ%coding%lengths%of%genes%of%S ! " = Σ%coding%lengths%of%all%genes

As described in Iossifov et al. Nature 2014(30), we then perform a two-sided binomial test of |O| outcomes in |T| opportunities given the probability of success p(S), where '||' denotes the number of gene members in a set. ! For some analyses, only a portion of the genome was considered (e.g., only genes with a pLI!0.9), and thus all parameters (T, S, and p(S)) were adjusted to remove genes (and gene lengths) not being considered in the count of T. All gene sets (S), were first converted to their HGNC symbol and then matched by HGNC symbol to the target genes (T) before intersecting to obtain O. In the analysis of high-risk inherited variants, the gene set enrichment analysis was adjusted to include only the 3,483 genes which have a pLI!0.9 (see "High-risk inherited variants"). In the TADA mega-analysis, the gene set enrichment analysis was adjusted to include only the 18,472 gencodeV19 TADA genes (see "TADA-mega analysis"). ! We selected 22 gene sets with known or hypothesized biological relevance in the study of ASD. This included four transcriptome co-expression studies: (1) one module downregulated (M12) and one module upregulated (M16) in ASD brain vs. control brain(31), (2) three modules downregulated (M4, M10, M16) and three modules upregulated (M9, M19, M20) in ASD brain vs. control brain(32), (3) three neurodevelopmental co-expression networks from multiple human brain regions across human development enriched for genes hit by a single de novo PTV in ASD patients from the Simons Simplex Collection (SSC) (3_5_PFC_MS, 4_6_PFC_MSC, and 8_10_MD_CBC) – after removing the nine high-confidence ASD (hcASD) genes on which these networks were seeded(33), and (4) five neurodevelopmental co- expression modules – constructed agnostic to ASD risk genes – enriched for ASD risk genes/variants (M2, M3, M13, M16, and M17)(34). In addition to these 16 gene sets, we compiled a list of genes with !2 SSC probands harboring a de novo PTV(30); this gene set

9 serves as a positive control. We also included FMRP targets(35), CHD8 targets(36), RBFOX1 targets(37), and genes encoding proteins identified in the post synaptic density in human neocortex(38). Finally, we used >8,000 samples from various tissues (e.g., brain, heart, liver) in GTEx(39) data v6 (dbGap # phs000424.v6.p1) to identify genes enriched for expression in the brain vs. other tissues. These genes had a 2-fold enrichment (FDR<0.05) after regressing out RNA Integrity Number (RIN) and various sequencing covariates (principal component 1 and 2 of sequencing statistics provided by GTEx). Unless otherwise noted, p-values are only reported for gene sets surviving multiple test correction (Bonferroni correction for the 22 gene sets tested or P<0.002).

Artifact Removal by Classifier (ARC): a machine learning classifier for distinguishing true de novo variants from Lymphoblastoid Cell Line (LCL) and sequencing artifacts Artifact Removal by Classifier (ARC) is a random forest supervised model developed to distinguish true rare de novo variants from LCL-specific genetic aberrations or other types of artifacts such as sequencing and mapping errors. To train the model we used rare de novo variants identified in 76 pairs of fully phase-able monozygotic (MZ) twins with WGS data derived from LCL DNA. We performed GATK joint genotyping of variants in MZ twins together with all samples in the iHART cohort (see "NGS-processing pipeline") and identified the de novo variants (see "Inheritance classifications"). We defined rare de novo variants as de novo variants with a population frequency of zero in the publicly available databases, UCLA internal controls, and Healthy Non-Phaseable (HNP) samples. In the training set, rare de novo variants identified in both MZ twins were labeled as true variants (positive class), whereas discrepant calls were labeled as false variants (negative class). Our final training set consisted of 5,667 positive and 56,018 negative variants.

Variants in both classes were annotated with 48 features; these features are related to intrinsic genomic properties (e.g., GC content and other properties implicated in de novo hotspots(40)), sample specific properties (e.g., genome-wide number of de novo SNVs), signatures of transformation of peripheral B lymphocytes by Epstein-Barr virus (e.g., number of de novo SNVs in immunoglobulin genes), or variant properties (e.g., GATK variant metrics). We also annotated whether or not each variant fell into a region flagged as low-confidence (regions for which no high-confidence genotype calls were possible) by the Genome in a Bottle Consortium(41); variants flagged as low-confidence (deemed "GIAB variants") were retained for calculating sample-level metrics but subsequently removed prior to running the classifier (5,373 positive and 53,622 negative variants remained for use in classifier). For the eleven features which occasionally had missing values, missing values were imputed (see: "Imputing missing values for ARC features") and a feature "is.X.feature.na" was included to capture this imputation process as an independent feature. All non-missing GATK metrics were taken directly from the VCF, with the exception of ABhet (see "ABhet adjustment"). An overview of the ARC is shown in Fig. S5. A complete list of features and the importance of each feature used in the random forest classifier are shown in Fig. 3c and Table S5.

A random forest classifier with 1,000 decision trees was trained on these positive and negative examples. We used the RandomForestClassifier implementation from the Python scikit-learn package (version 0.18.1). Weights associated with classes were adjusted inversely proportional to the class frequencies by using sklearn ‘balanced’ class weight option to control for class imbalance (many more negative examples than positive). We performed hyperparameter optimization by grid search.

The performance of the model was first evaluated with the receiver operating characteristic (ROC) curve analysis using a ten-fold cross validation procedure. In the ten-fold cross validation, the entire training set was divided into ten folds such that the ratio of positive to negative examples was constant across folds. We achieved an AUC of >0.98 (Fig. S7a). The ROC and precision recall curves are shown in Fig. S7a-d.

10 To assess the generalization error of our model, we additionally performed whole-genome sequencing (30X) of matched whole blood (WB) and LCL samples from 17 fully phase-able individuals from the iHART cohort. These samples were also jointly genotyped with all samples in the iHART cohort (see "NGS-processing pipeline") so as to preserve variant calling metrics between the training and test set. We followed the same procedure as in the training set to identify and extract rare de novo variants in our WB-LCL matched samples. We assumed that true de novo variants would be those identified in both the WB and LCL sample in a pair (deemed “concordant”). We further assumed that variants detected in only one sample of a pair (deemed “discordant”), would be due to LCL-specific aberrations (if called in LCL) or other sources of errors (if called in WB or LCL). In total, 1,512 concordant rare de novo variants (n=1,291 after excluding GIAB variants) were called in these samples, which we used as positive examples in our test set. Furthermore, 2,560 discordant rare de novo variants (n=1,898 after excluding GIAB variants) were called in only one sample of a pair (64% of discordant variants found only in LCL, 36% of discordant variants found only in WB) and were used as negative examples. We evaluated a model that was trained using the entire training set on this independent test set and achieved an AUC of 0.98 and an F1 score of 0.89.

To determine a cutoff point for the predicted ARC scores, below which a variant would be considered likely to be an LCL-specific genetic aberration or other type of artifact, we chose a cutoff value achieving a minimum precision of >0.9 in all folds from the 10-fold cross validation analysis on the training set (Fig. S7c). Under this constraint, we calculated the cutoff to be at 0.4 and estimated the minimum recall rate given this threshold to be ~0.8 in all folds (Fig. S7d). This cutoff achieves similar precision and recall results in the independent test set (Fig. S7h).

ABHet adjustment ABHet is a variant-level annotation from GATK that aims to estimate if biallelic variants match expected allelic ratios. An ideal heterozygous variant will have a value of close to 0.5 and an ideal homozygous variant will have a value of close to 1.0. ABHet is calculated for a variant based on all samples in the VCF which are not homozygous reference at this site. The ABHet annotation is not currently provided by GATK for indels. Using the ABHet formula below, we manually calculated the ABHet value for all indels.

#%REF%reads%from%heterozygous%samples 45678 = % #%REF% + %ALT%reads%from%heterozygous%samples

Additionally, in the training set, we manually adjusted ABHet values by only including the proband and removing his/her twin(s) from the calculation. This corrects for bias introduced by applying the raw GATK metric calculated based on two samples to a single sample because we retain only the proband metrics (and exclude the MZ twin metrics) for shared de novo variants. This systematic bias is particularly apparent when comparing to the sample-level ADDP metric (formula below).

#%ALT%reads%in%a%sample%at%variant%site 4GGH = % #%REF% + %ALT%reads%in%a%sample%at%variant%site

If a variant is present in only one sample in the VCF, then ABHet == 1 – ADDP (diagonal line observed only for false variants due to these variants being found only once in the VCF, Fig. S6a). This strong correlation was not observed in the shared de novo variants (Fig. S6a) because when a variant is in two different samples (proband (x) and MZ twin(y)), ADDPx is rarely equal ADDPy, and thus ABHetxy != 1 – ADDPx.

11 Similarly, in the test set, we manually adjusted the ABHet values by only including the LCL sample and removing its matched WB sample from the calculation (Fig. S6b).

Imputing missing values for ARC features For eleven of the ARC features (InbreedingCoeff, ABHom, ABHet, OND, RecombRate, BaseQRankSum, MQRankSum, ReadPosRankSum, DnaseUwdukeGm12878UniPk, ReplicationTiming and EncodeCaltechRnaSeqGm12878R2x75), some variants had missing values. In general, we used the mean of all non-missing values to impute the missing values of a feature. However, for GATK's "OND" feature, missing values were imputed as zero. In order to account for missingness and capture this imputation process in the ARC model, a binary feature "is.X.feature.na" was included for all variants for each of these features, with the exception of the "OND" feature (as OND values were missing for the majority of variants).

For two ARC features, indels were occasionally annotated with multiple values – SimpleRepeats and EncodeCaltechRnaSeqGm12878R2x75. For SimpleRepeats, if multiple values were listed, only the lowest value was retained. For EncodeCaltechRnaSeqGm12878R2x75, only the max value was retained. We chose these features to be most conservative and least likely to bias the classifier. These exceptions are also captured by the "is.indel" feature.

ARC outlier samples After applying ARC to all 1,377 children (partially or fully phase-able), we identified a subset of outlier samples for which >90% of their raw DN variant calls had an ARC score of less than 0.4, in other words variants for which >90% of raw DN variant calls were excluded by ARC (partially phase-able n=2; fully phase-able n=346). These outlier samples were those with the largest number of raw DN variant calls prior to running ARC (biological sequencing source was LCL) and it's likely that the classifier was unable to confidently distinguish variants in these samples. Unless otherwise mentioned, all ARC outlier samples were excluded from downstream analyses involving de novo variants.

Calculating the rates for rare de novo mutations When calculating de novo mutation rates, we only considered the 1,177 children with both biological parents sequenced (fully phase-able). We then excluded ARC outlier samples (n=346). Consistent with what is shown in Fig. 3, there is no significant difference in the rate of rare de novo variants based on the biological sequencing source (WB vs. LCL; after ARC and after excluding ARC outliers) when including all MZ twins. However, we observed that shared de novo (TRUE) variants from the LCL MZ twins have slightly inflated ARC scores (median number of de novo variants is 69), as compared to LCL non-MZ twin samples (median number of de novo variants is 57) (Fig. S8a). This difference in de novo rates was significant when evaluated using Wilcoxon rank sum test (P=1.275x10-12). The inflated ARC scores are likely due to the fact that these LCL MZ twins were used as the ARC training set; therefore, de novo variants from all MZ twin samples were excluded from all de novo rate calculations (n=158, some of which are also ARC outliers n=43). Therefore, all de novo mutation rate calculations were performed using 716 fully phase-able non-MZ twin and non-ARC outlier samples (Naff=575; Nunaff=141).

Given our observation that children from multiplex families and simplex families have comparable rates of rare de novo synonymous and missense variants in both affected and unaffected children but different rates for rare de novo PTVs (Table S6), we sought to determine if this represented a true difference in the underlying genetic architecture of multiplex ASD families by performing a Monte Carlo integration. This revealed that with the current iHART sample size (Naff=575, Nunaff=141), we had only 51% power to detect an odds ratio greater than or equal to the odds ratio reported in simplex families (OR=0.13/0.07=1.86)(30, 42) and that our power to reject the null hypothesis that affected and

12 unaffected children have no difference in the rate of de novo PTVs was 71.2%. We estimate that once we expand our cohort by a factor of 2.5 (Naff=1,362, Nunaff=352), we will have >95% power to detect a rate difference in de novo PTVs if such a difference exists in multiplex ASD families.

Correlation of de novo mutation rate with paternal age We evaluated the correlation between DN variant rate in 574 fully phase-able iHART affected children (excluding MZ twins, ARC outliers, and one sample without paternal age information) and paternal age at the child’s birth using a generalized quasi-Poisson linear model, assuming that the counts are distributed as an over-dispersed Poisson distribution (a generalized quasi- Poisson linear model):

JK = LK%×% 4%×%NK + 5

Where RC is the rate of DN events per child, FC is the age of the father at the birth of the child and TC is the percent of the child’s genome covered at 10X and A and B are whole population parameters, estimated by maximizing the likelihood over all children (as previously described in Iossifov et al. Nature 2014(30)). We performed this analysis before and after ARC, considering only DN events not falling in GIAB low-confidence regions.

Determining rate differences between sample groups (Affected vs. Unaffected) For variants of a given class (e.g., de novo PTV), the rates were calculated by taking the number of variants per child and performing a quasi-Poisson linear regression. This method enabled us to adjust for both biological sequencing source (WB vs. LCL) and biological sex (male vs. female). Biological sex was not used as a covariate for hemizygous variants because only male children are considered. Unless otherwise noted, rates are displayed as the mean number of variants with error bars representing the standard error.

RDNV rates and power All rare de novo variants (RDNVs) were required to be absent (AF=0) in all available control samples; including public databases, UCLA internal controls, and HNP samples. This revealed that with the current iHART sample size (Naff=575, Nunaff=141), the probability of detecting an odds ratio greater than or equal to the odds ratio reported in simplex families (OR=0.13/0.07=1.86) is 51.8% and that our power to reject the null hypothesis that affected and unaffected children have no difference in the rate of de novo PTVs was only 70.9%. We estimate that once we expand our cohort by a factor of 2.4 (Naff=1,380, Nunaff=338), we will have >95% power to detect a rate difference in de novo PTVs if such a difference exists in multiplex ASD families. ! TADA mega-analysis We used the Transmitted And De novo Association (TADA) test(43) to combine evidence from rare de novo (DN) or transmitted (inherited) protein-truncating variants (PTVs) and de novo missense variants predicted to damage the encoded protein (Mis3, a probably damaging prediction by PolyPhen-2(44)) identified in ASD cases. Within the 422 iHART families with at least one ASD case and both biological parents sequenced, there were 838 genetically non- identical (only one MZ twin retained) ASD cases available for the TADA analysis. These 838 affected samples, and their biological parents, were treated as independent trios for the TADA analysis. This approach means that siblings were treated as belonging to independent trios, an approach for which we also approximate the null distribution (see "TADA simulations"). To further increase power for the identification of novel ASD-risk genes, we combined qualifying variants found in ASD cases from the current (iHART) cohort with the most recent TADA mega-analysis(45), which included variants described in the Simons Simplex Collection (SSC) and the Autism Sequencing Consortium (ASC) cohorts, together with small de novo CNV deletions (SmallDel) identified in SSC and Autism Genome Project (AGP) probands (Table S8).

13 Qualifying variants in the iHART cohort included rare DN/transmitted PTV and rare DN Mis3 variants identified in the 838 affected samples, and not flagged as low-confidence by the Genome in a bottle (GIAB) consortium(41). Following the allele frequency threshold used in the previous TADA mega-analysis(45), we required transmitted PTVs to have an AF≤0.1% in public databases (1000g, ESP6500, ExACv3.0, cg46), internal controls, and iHART HNP samples. DN variants identified in the iHART cohort were required to be absent in all public databases, internal controls, and HNP samples (AF = 0). High confidence DN variants were obtained by ARC (see "ARC" section) for all non-MZ twin samples. DN variants shared by MZ twins (shared DN variants, used as TRUE examples in the ARC training set) were also included as qualifying variants without filtering on their ARC score. Additionally, for the 185 TADA samples identified as ARC outlier samples, we excluded DN variants in these samples and retained only their inherited PTVs as qualifying variants. We used the PolyPhen-2(44) v2.2.2r395 HDIV predictions from the Whole Human Exome Sequence Space (WHESS dataset) to annotate DN Mis3 variants in the iHART cohort. This method is highly concordant to the method (PolyPhen-2 web application) implemented for the ASC and SSC(45), with our re-annotation resulting in identical Mis3 classifications for 99.8% of the reported DN Mis3 variants in the ASC and SSC cohorts. When multiple qualifying variants in a gene were found in the same sample, only the most damaging variant was retained. The parameters used for performing the TADA analysis, matched those used in previous TADA mega-analyses(43, 45, 46); including the previously observed aggregate association signals used to estimate relative risk (RR, γ) for each variant class – DN PTV (γ=20), DN SmallDel (γ=15.3), DN Mis3 (γ=4.7), and transmitted PTV (γ=2.3) (see extended results and Table S11). PTVs classified as uncertain or missing in children were excluded (see “Inheritance classifications”).

We then tallied qualifying variants from the different cohorts into a gene by variant-type matrix for the TADA analysis, which contained variant counts for a total of 18,472 gencodeV19 genes with HGNC approved gene names (this excludes a subset of genes (193 out of 18,665) from the most recent TADA mega-analysis(45) that could not be easily converted to a single non- redundant HGNC gencodeV19 gene). In particular, counts of DN PTV/Mis3 variants come from 4,689 ASD cases from ASC(46), SSC(30), and iHART (this manuscript), while counts for the transmitted and non-transmitted PTVs are from 3,813 ASD cases and 7,609 controls from the ASC(46) and iHART (this manuscript) (Table S8). Finally, counts of DN SmallDel were calculated in 4,687 ASD cases from the SSC(45) and the AGP(47) (Table S8).

Critically, 424 AGRE samples (sample list obtained via personal correspondence with Bernie Devlin, Mark Daly, and Christine Stevens) were included as "cases" in the original ASC TADA analysis(46) meaning that all qualifying PTVs identified in these cases were treated as transmitted PTVs because de novo status could not be determined. Given that iHART sequenced 119 of these samples (or the monozygotic twin of one of these samples) and their biological parents, we were able to recover the de novo status for variants identified in these samples (71 samples after excluding ARC outliers) by using the iHART data. Thus, we used iHART data to count qualifying DN PTV and Mis3 variants in these samples and allowed transmitted PTV counts to come from the original study(46). To do this in a non-redundant way (without double-counting variants), we looked for all qualifying DN PTVs identified by iHART data in these samples in the ASC VCFs (downloaded from dbGAP(46)) and subtracted a transmitted PTV count from the iHART mega-analysis TADA-ready table for each variant found in the ASC VCFs. In three instances, the transmitted PTV count from ASC cases was already zero for the gene harboring the corresponding variant and thus we left it at zero.

The parameters used for performing the TADA analysis, matched those used in previous TADA mega-analyses(43, 45, 46). In addition to the RR parameters above, we also assumed the fraction of ASD risk genes (π) to be ≈ 0.05 (1,000 ASD risk genes divided by a total of 18,472 genes), the PTV frequency parameters (required by TADA to integrate transmitted and non-transmitted PTV variant counts into the model) were ρ=0.1 and ν=200 and the gene mutation rates were taken directly from the most recent TADA mega-analysis(45), with PTV and Mis3 gene mutation rates calculated by multiplying the exome mutation rates, originally

14 estimated by Samocha et al.(48), resulting in fractional constants of 0.074 and 0.32, respectively. This use of gene mutation rates as ground truth (rather than comparing to control samples) facilitates the use of TADA in mega-analyses because differences in sample size and variant detection between studies impact the power of TADA, but are not a potential source of bias. ! TADA simulations The distribution of the TADA statistic (under the null) is known for independent trios(43), but not for multiplex families. Therefore, we estimated the distribution of the null TADA statistic by simulating Mendelian transmission and de novo mutation across the family structures used in our TADA-mega analysis. This simulation was based on the observed qualifying variant counts and family structures from our TADA-mega analysis datasets, which included: (1) iHART multiplex families, (2) ASC and SSC trios and ASC case-control samples, and (3) small deletions from Sanders et al.(45) (Table S8). Simulations under the null model (1.1 million-simulations) were conducted prior to running TADA with the same parameters used for our TADA-mega analysis (see "TADA mega-analysis").

The occurrence of rare de novo variants (RDNVs) was simulated by randomly shuffling genes carrying the observed qualifying RDNVs across the genome of each sample by redrawing in proportion to the gene-specific mutation rates (derived from Samocha et al.(48)). For example, if an affected sample harbored 8 RDNVs, these RDNVs would be placed in 8 genes in simulation 1, independently in 8 genes in simulation 2, and so on, where the probability of a gene containing an RDNV is derived from its gene mutation rate. This method was applied to simulate RDNVs in affected children from iHART multiplex families and affected children from ASC and SSC trios.

The occurrence of transmitted (inherited) and non-transmitted variants was simulated by (A) randomly shuffling genes carrying the observed qualifying variants in the parents of a given family, by redrawing in proportion to the gene-specific mutation rates and (B) randomizing the Mendelian inheritance of such variants across all children (affected and unaffected) in the family. Randomization of the Mendelian inheritance simply means that for each simulation a variant can be transmitted to each child, regardless of affected status, with a 50% probability. For example, in Family001, if mom harbored 10 qualifying PTV variants and dad harbored 10 qualifying PTV variants; then in each simulation each of these 20 variants would be randomly placed in a gene according to its gene-specific mutation rate and is either transmitted or not transmitted to each of the children in the family. This method was applied to simulate transmitted and non-transmitted PTVs in the cohorts listed in Table S8.

Finally, we simulated small deletions disrupting 2-7 genes at a time(45). For each observed small deletion containing Ngenes, we selected a contiguous set of Ngenes by redrawing in proportion to the multi-gene mutation rates. Multi-gene mutation rates were calculated by summing single-gene mutation rates of adjacent genes using sliding windows of K genes across the genome. For example, if a small deletion disrupted 5 genes in an affected sample, then for each simulation a contiguous set of 5 genes would be randomly selected for this sample with a probability based on the 5-gene-sliding-window multi-gene mutation rates.

The resulting set of simulation-based Bayes factors from TADA were multiplied together. A single p-value for each gene was generated, reflecting how unlikely it is to have observed the Bayes factor obtained in our TADA-mega analysis given the 1.1 million simulation-based Bayes factors observed for this gene (Fig. S9).

Identification of genes with a large contribution from inherited PTVs Conservatively, we considered only variants where the inheritance was known (de novo vs. inherited). Thus, the total number of TADA-qualifying variants was adjusted to ignore PTVs from cases because these originate from case-control studies where inheritance is unknown.

15 We applied two methods to identify genes with a large contribution from inherited PTVs. Method 1: The total number of qualifying variants (N) in each TADA gene was defined as

NDN.PTV + NDN.SmallDel + NDN.Mis3 + NInherited.PTV; and if NInherited.PTV/N !70%, then this gene was considered to have a higher proportion of inherited risk variants. Method 2: Alternatively, we identified genes for which the main driver of the TADA association signal was from inherited PTVs. We defined this class of genes as those where the Bayes Factor from inherited PTVs was greater than the Bayes Factor from all other de novo variant classes (BFinheritedPTV > BFdnPTV & BFinheritedPTV > BFdnSmallDel & BFinheritedPTV >BFdnMis3). scRNA-seq Gene cell type enrichment scores were obtained from an unpublished scRNA-seq dataset of GW17-18 human fetal cortex, a human fetal forebrain scRNA-seq dataset from Nowakowski et al.(49), and adult brain scRNA-seq dataset from Lake et al.(50) (Fig. S12). Cell type enrichment lists were grouped into major cell classes (glutamatergic, GABAergic, glial and other support cells). Broadly expressed genes were determined by enrichment in neuronal and glial or other support cell types, or above a mean expression threshold across all cells in the dataset but without cell type specific enrichment. Enrichment log2 odds ratios were calculated using a general linear model (binomial distribution).

Online User Interface and Genotype/Phenotype Search Engines for iHART Data Access To facilitate sharing of iHART data with the broader autism research community and patients, we implemented a set of online data access methods to preview and search genetic variants and phenotypic traits (Extended methods and http://www.ihart.org/home).

Supplementary Text

Extended methods: Online user interface and genotype/phenotype search engines for iHART data access One of the main goals of the iHART project is to build an open-access platform enabling shared data access and analysis without requiring local high-performance computing resources. We have created two primary modes of access, raw and processed, in addition to traditional data sharing methods that enable downloading data files locally. For the raw format, we provide authorized users ways to make a direct online access to Binary Alignment Map (BAM) and Variant Call Format (VCF) files developed through alignment to the reference genome (human_g1k_b37.fasta) and variant calling by the GATK pipeline (version 3.2-2), along with the pedigree data file. For access to processed data, we provide two main paths. First, we provide SQL-like database access to variants, annotations, and pedigree data using the interactive query system called Amazon Athena(51). Athena allows users to generate instant SQL queries to extract variants among regions and/or combined with annotations of interest. Second, we created a specialized web interface, called CaseLog, for user-friendly interrogation of variant information and associated phenotype traits. We designed these methods of access to provide flexibility to approved researchers interested in new computational analysis as well as researchers interested in content-specific queries, such as those focused on a gene or a phenotype of clinical interest; phenotypic traits available are those captured by the Autism Genetic Resource Exchange (AGRE) collection(52). The current iHART database includes genetic and phenotypic data for 493 multiplex autism families, comprising 2,308 individuals, of which 1,123 have autism spectrum disorder.

1. Online Methods to Access Raw Data Files

1.1 BAM access using secured Uniform Resource Locator (URL) BAM files maintain raw read information from sequencing machines, so they are essential to visually inspect the validity of rare calls, for example in de novo variant analysis. We provide direct access to un-encrypted BAMs and their index files via Amazon Web Services storage (AWS S3): authorized users can create pre-signed URLs for these files and can use them without downloading, in applications accepting URLs for input for example Broad-Institute

16 Integrative Genomics Viewer (IGV)(53). Further details including instructions for creating pre- signed URLs are available on the iHART web site (http://www.ihart.org/home).

1.2 VCF access using AWS Virtual Machines For online access of VCF files, we built a custom operating system image (called Amazon Machine Image (AMI), ID="ami-f20c2c89") where widely-used genomic processing applications are pre-installed, and a shared storage (called snapshot, ID="snap- 02337e463aaec44fe") that contains the VCF file containing all 2,308 samples. Authorized users can launch their own virtual machine instance with this AMI and attach the snapshot as a local disk volume to the launched instance, to start variant analysis without a need for downloading files or setting up a computational environment.

1.3 Other Raw Data File Access We provide a README to explain the file naming convention on AWS storage. Each sample corresponds to a key identifier that can be used to determine the phenotype of the AGRE sample. The phenotypes are organized in tab separated value file format and is also accessible through the command line interface on Amazon (aws-cli)(54).

2. Online Methods to Access Variants, Annotations, and Phenotypes

2.1 SQL-query using Athena For Athena genotype, we converted GATK-pipeline (version 3.2-2) VCF file into two tables: the variant table ("vars") has all sample-level variant information, including per-sample unfiltered allele depth (AD), call depth (DP), genotype quality (GQ), and Phred-scaled posterior probability for genotypes (PL). The total number of calls in this table is 154,274,873,692 from 2,308 samples and the total number of non-reference calls is 11,694,973,020, about 5 million variants per sample. The other table ("anno") has all site-level annotation information, including the GATK variant call metrics, ANNOVAR(9), and VEP(10) annotations. The total number of sites in this table is 66,841,982 with 112 annotation items. For Athena phenotypes, we built a pedigree table for the current iHART samples. For each table, the data are stored in AWS S3 as a group of partitioned and compressed columnar format (Apache Parquet(55)) files, to maximize the search performance and to reduce query costs. An example of a SQL query and result combining these two tables to get a likely-gene-disrupting de novo variants in a family and additional details including example queries and screen shots are available on the iHART website (http://www.ihart.org/home).

2.2 Overview and Gene/Variant Search using CaseLog For CaseLog genotype information, we used the Illumina Issac pipeline(56) to call single nucleotide variants, insertions, deletions, and duplications. We converted 2,308 per-sample GATK-pipeline-produced BAM files back to the Fastq format then applied Issac Aligner against human_g1k_v37.fasta and Variant Caller (version 2.4.7) to build per-sample BAM and VCF files respectively, in order to assess the impact of different variant genotyping pipelines (future versions will also include GATK 3.4 based calls). For variant annotation, we used the Illumina Nirvana annotation engine (version 1.5.3)(57) to annotate allele frequency, variant type, coding or protein changes, loss of function (LoF) prediction. We then reformatted these variants and annotation tables into an Elasticsearch(58) index for fast query performance; data were aggregated per call-site to distinguish homozygous reference regions from no-call regions per sample. For CaseLog phenotypic traits, first we normalized pedigree information (i.e., family structure, gender, and affected status) to accumulate a set of filters to find corresponding samples. In addition to the pedigree information, we extracted disorder-related condition information from the medical history of each patient and incorporated them as normalized quantitative attributes. Finally, total 3,248 item-level phenotype attributes from Autism Genetic Research Exchange (AGRE), matched to the current iHART data of 2,308 samples, were stored in 21 separate tables to facilitate individual searches. These extended phenotype data include standard diagnosis measures (ADI-R: 426 items, 1,095 samples; ADOS: 109 items, 986 samples), development and learning scales (Stanford Binet: 72 items, 318 samples; Mullen

17 scales of early learning: 278 items, 29 samples; Peabody Picture Vocabulary Test: 48 items, 945 samples), and social/behavioral measures (Social Responsiveness Scale: 107 items, 1,091 samples; Vineland adaptive behavior scales: 234 items, 853 samples).

For the web interface, we implemented following functionalities to summarize data and to perform variant or gene-level queries with phenotype filters: Variant or gene information search by ids; Sample information search by a combination of normalized phenotype attributes; and Variant or gene information search, filtered by normalized phenotype attributes

In addition to these search methods, we also implemented a set of statistical test methods (case versus control test, genetic burden test, phenotype-wide association test, and principal component analysis) to facilitate checking variants and proceeding to the targeted analysis phase. We describe each method in detail below.

2.2.1. Baseline Search We implemented three basic search interfaces for variant, gene, and phenotype data in CaseLog. For a variant search, the input is in “chr:startPos:endPos:genotype” format (e.g., "10:72477533:72477533:T" without quotes), and it provides variant type, possible protein change, predicted consequence, reference call, frequency information for each genotype, and sample information who has a variant in this allele, as shown in the screen shot on the iHART website (http://www.ihart.org/home). For a gene search, the input is either HUGO Gene Nomenclature (HGNC) symbol, EMBL-EBI Ensembl gene id, or NCBI gene id and it visualizes how many different-type variants are located in the target gene in a needle plot, with a list of variant information, as shown in Fig. S13a. For phenotype data search, the input can be any AND (OR) combination of normalized attributes and the output is a list of samples meeting search criteria. An example of unaffected female sibling search is shown on the iHART website.

2.2.2. Search by Variant and Phenotype In addition to basic search options, we implemented interfaces where users can perform variant and phenotype filtering together. For example, the screen shot on the iHART website shows an example search for affected patient samples who has any missense variants in the gene FMR1.

2.2.3. Case versus Control Analysis As an extension of combined genotype/phenotype search, we implemented a case versus control test interface in CaseLog. For an input variant (or a gene), users can build case and control set defined by any combination of normalized attributes. The output is a contingency table with test statistics (if available) given the input variant. Fig. S13b shows an example result for a variant in affected proband versus unaffected siblings.

2.2.4. Phenotype-wide Association Studies CaseLog provides the capability to run phenotype-wide association studies (PheWAS) against a cohort in real time. PheWAS aims at identifying phenotypes and diseases that are significantly enriched in carriers of a specific variant versus healthy individuals. For each phenotype, the user interface displays the number of carriers versus non-carriers affected by the same phenotype, as well as the respective odds ratios and Chi-squared statistics to help discover potential associations. In the current implementation, these values are not adjusted for population stratification.

2.2.5. Genetic Burden Test Genetic burden tests help users to identify genes or sub-regions of genes that are excessively enriched for variants in cases versus controls (or visa-versa). We have implemented a de novo burden test based on the Poisson exact test and the expected rate of de novo missense and protein-truncating variation per gene(48, 59). Each run of a genetic burden test focuses on a single, user-provided gene, and results are cached to be readily available upon the next request.

18

2.2.6. Principal Component Analysis We implemented a principal component analysis (PCA) on the cohort to control for confounding variables such as population stratification. We base our analysis on 17,535 SNPs that were pruned to an average density of one SNP per 200k base pairs and by removing SNPs with a minor allele frequency "5% in the 1000 Genomes Project Phase 3 data. The user interface currently plots the first four principal components against each other, which mostly explain differences in ancestry among the cohort. Previous runs of PCA are cached, similarly to the results of genetic burden tests.

Extended results: High-risk inherited PPI networks In the main manuscript, we report the results of the combined PTV+SV analysis of potential high-risk inherited rare variants, rare damaging variants transmitted to all affected individuals in a multiplex family, but not transmitted to unaffected individuals (Methods; Fig. 2d). We found that the proteins encoded by the 61 unique genes harboring PTVs transmitted to all affecteds form a significant direct PPI network (Methods; 1000 permutations, P < 0.04) but the indirect PPI network is not significant (Methods; 1000 permutations, P > 0.31). We found that the proteins encoded by the 40 unique genes harboring SVs transmitted to all affecteds form a significant direct PPI network (Methods; 1000 permutations, P < 0.02) and form a significant indirect PPI network (Methods; 1000 permutations, P < 0.04).

Extended results: Enrichment for NetSig significant genes in high-risk inherited PPI networks Our NetSig analysis identified 596 genes that were significantly (P < 0.05) directly connected to potential ASD-risk genes (as defined by their TADA q-value(27)). We then used a two-sided Fisher’s Exact Test to determine if genes within the direct (n=14) (P = 0.151) or indirect (n=555) (P = 2.92x10-12; OR=2.89; 95% confidence interval = 2.18-3.79) PPI networks seeded by genes hit by high-risk inherited variants (PTVs + SVs) were enriched for genes with a resulting NetSig P < 0.05 (596 in total).

Given that iHART SVs were not included in the TADA analysis (Table S8), and therefore do not contribute to the input q-values in the NetSig analysis, we also asked if the proteins in the direct and indirect PPI networks (seeded by only high-risk inherited PTVs) directly bind ASD- risk genes substantially more than expected by chance and found enrichment in both the direct (n=5) (P = 0.02; OR=12.80; 95% confidence interval = 1.07-111.92) and indirect (n=243) (P = 4.24x10-16; OR=4.90; 95% confidence interval = 3.45-6.85) networks (Methods).

Extended results: ASD TADA-mega analysis replication Prior to running our TADA-mega analysis, we attempted to replicate the results published in the previous TADA-mega analysis(45) using the original code (and published small deletion # = 0.14), while restricting to the published list of sample counts (with variants identified in the SSC, ASC, and AGP and listed in the supplemental table of Sanders et al.(45)). This test successfully replicated the previously published results. However, due to updates to the version of TADA implemented (confirmed by personal communication with Xin He) the resulting q- values for genes varied slightly (mean |$ q-value| = 0.008). But, the vast majority of genes were consistent in terms of the FDR bins (e.g., FDR<0.1, 0.2

19 Extended results: Relative risk parameters for the ASD TADA-mega analysis The parameters used for performing the iHART ASD TADA-mega analysis matched those used in previous TADA mega-analyses. Use of these parameters facilitated replication of previous findings prior to adding the iHART cohort to perform a mega analysis.

The Hierarchical Bayes strategy implemented by the TADA method borrows information across all genes to infer parameters that would be difficult (if not impossible) to estimate for individual genes(43). Each genetic variant has an associated relative risk (RR); in aggregate, certain classes of variants (e.g., rare and deleterious) have a higher relative risk than others (e.g., synonymous variants). Given that we do not know the relative risk for every possible genetic variant or even the relative risk associated with mutations in a given gene, the TADA model relies on estimates of the relative risk distribution obtained by pooling information across all genes for a given class of variant, and parametrized by the average relative risk, γ. Varying the average relative risk from 10 to 20 has almost no impact on the resulting TADA p- value for a gene (probability of a gene being associated with ASD), as shown in Xe et al.(43). The estimated relative risks for variant classes in ASD include: DN PTV (γ=20), DN SmallDel (γ=15.3), DN Mis3 (γ=4.7), and transmitted PTV (γ=2.3).

The RR for de novo PTVs was calculated using exome sequencing data from N=923 autism families(43). The RR for de novo small deletions (spanning no more than 7 genes) was estimated using CNV calls identified in affected (n=2,100) and unaffected (n=2,100) siblings from the SSC(45) and relies on the observed mean number of genes contained within these small de novo deletions (mean=2.7)(45). The RR for de novo missense variants predicted to be probably damaging by PolyPhen-2 was calculated from exome sequencing data from N=923 autism families, after restricting to 20 genes thought to be "highly confident ASD genes"(43). The RR for transmitted PTVs was calculated using exome sequencing data from 3,871 autism cases and 9,937 ancestry-matched or parental controls, after restricting to 26 genes thought to be "highly confident ASD genes"(46).

Given that our iHART cohort is likely depleted for de novo variation, application of the previous RR estimates in our TADA mega analysis conservatively limits our power for discovering novel risk genes where risk is attributable to inherited variation. However, given the large RR estimates for de novo variant classes in TADA, we sought to show how our novel gene discovery would be altered by assigning a RR of 1 to all de novo variants identified in the iHART cohort, in essence removing de novo contributions (Table S11). This overly conservative approach still results in the discovery of 6 or our 16 novel ASD risk genes (C16orf13, CCSER1, MLANA, PCM1, TMEM39B, and TSPAN4) plus an additional 5 genes which reach the FDR<0.1 threshold for significance (ASXL3, CDH13, NR3C2, SCN7A, and STARD9). ASXL3 and NR3C2 would not be considered novel as they obtained a TADA q- value<0.1 in De Rubeis et al. Nature 2014(46) (Table S11). This extremely conservative approach for estimating the RR for de novo variants in a multiplex cohort, which essentially ignores all de novo variants identified in the iHART cohort, still results in the identification of nine novel ASD-risk genes. Furthermore, we note that the iHART cohort includes de novo variants in very clearly established ASD-risk genes (e.g., CHD8) consistent with a high relative risk for this class of variant. We also note in 70% of ASD multiplex families, when an affected child was found to harbor an established ASD-risk CNV, the CNV was not shared by all affected siblings, consistent with multiplex families where ASD-risk is derived from distinct risk factors(18).

20 Fig. S1.

Figure S1: Whole-genome sequence coverage statistics for 2,308 iHART/AGRE samples. There were no significant differences in the average fold coverage per sample across the cohort and no differences in the categories of (a) ASD affectation status, (b) sex, or (c) family member type – where family member type was simplified to include Mother, Father, Child (proband, sibling, MZ or DZ twin) and Other (e.g., cousin). (d) The percent of exonic and genomic bases covered at ≥10x in all family members for each of the 422 fully-phaseable iHART families. Exonic regions were defined as those annotated as protein-coding exons in Gencode V19 (>75Mb). Genomic regions were defined as all non-N bases in the reference genome (>2.8Gb). (e) The percentage of genomic bases covered at greater than or equal to 1X, 10X, 20X, 30X, and 40X bases

21 for the 2,308 iHART samples with WGS data. On average, 98.97 ± 0.37 % of bases were covered at a depth of !10X.

Fig. S2.

Figure S2: High-resolution detection of large structural variants (SVs). (a) An overview of our custom multi-algorithm consensus SV pipeline for high-resolution detection of large structural variants (SVs) from whole-genome sequence data. The four boxes at the top list the four main algorithms used to call SVs, and the parenthetical describes the detection strategy(s) used by each algorithm: AS, de novo assembly method; SR, split-read method; RP, read-pair method; RC, read-count method. (b) Venn diagrams of structural variants detected by four different algorithms for all and rare (AF < 0.001 in cDGV and AF < 0.01 in iHART HNP samples) SVs (DELs, DUPs and INVs) detected in 1,377 phase-able WGS samples by SMuFin, LUMPY, GenomeSTRiP and BreakDancer, after excluding events with ≥50% overlap with genomic low-complexity regions(17). Additional per-algorithm filters were also applied prior to the generation of this Venn diagram as described in Methods. (c) A schematic overview of the SMuFin detection pipeline. Families are processed as independent trios, where the sequence reads from a child are aligned to the mother’s genome and then the father’s genome, treating the parental genome as the reference genome in both comparisons. Each comparison, or SMuFin execution, results in variants identified in the child by that parent-offspring comparison. All three members of the trio are considered for assigning the corresponding inheritance of variants identified in the child. A variant detected when comparing to both mom and dad is de novo, while a variant detected only when comparing to mom is paternally inherited and a variant detected only when comparing to dad is maternally inherited.

22 Fig. S3.

Affected Unaffected

inherited_mother inherited_mother inherited_father inherited_father

60 4 60 4

3 40 40 3

2 2

20 20

1 1

0 0 0 0

are SNVs + Indels per child missense synonymous PTV missense synonymous PTV r newly_homozygous newly_homozygous hemizygous hemizygous 0.50 0.020 1.5 umber of

n 0.06

0.015 0.25

Mean 1.0

0.04 0.010 0.00

0.5 0.02 0.005 -0.25

0.000 0.0 0.00 -0.50 missense synonymous PTV missense synonymous PTV

Figure S3: Rare inherited coding variants by consequence and inheritance. The rate of rare inherited coding variants per fully phase-able child is displayed for 960 affected (red) and 217 unaffected children (light blue) by both variant consequence and inheritance, this includes newly hemizygous variants in 563 affected (red) and 100 unaffected male children (light blue). Rare inherited variants (SNVs and indels) were defined as those with an AF≤ 0.1% in public databases (1000g, ESP6500, ExACv3.0, cg46), internal controls, and iHART HNP samples and were restricted to those not missing in more than 25% of controls and that were not flagged as low-confidence by the GIAB consortium (Methods). Rates are displayed as bar plots of the mean number of variants in a sample and error bars represent the standard error.

23 Fig. S4. a b Chromosome 11

Phenotype 85.335 mb 85.337 mb 85.339 mb 85.341 mb 85.343 mb 85.345 mb P = 0.96 Affected 85.336 mb 85.338 mb 85.34 mb 85.342 mb 85.344 mb Unaffected

hild 4 c DLG2

3 Gene Model TMEM126B CNV 2 60 vg A

C 40 A T

1 A 20

P = 0.40 VZ

Number of inherited PTV per 0 0 60 vg All genes PTV intolerant genes A

C 40 A T A 20 CP 0

c d e f g h P = 0.93 P = 0.58 P = 0.60 P = 0.43 P = 0.49 P = 1 0.04 0.4 0.03 7.5 0.20 0.03 hild hild hild hild hild hild 0.2 c c c c c 0.3 c 0.15 0.02 5.0 0.02 P = 1 0.2 0.10 P = 0.82 0.1 0.01 2.5 0.01 0.1 0.05 P = 0.25 P = 1 Number of SVs per Number of SVs per Number of SVs per Number of SVs per Number of SVs per P = 0.63 Number of SVs per P = 1 0.0 0.0 0.00 0.0 0.00 0.00 0 1 0 1 0 1 0 1 0 1 0 1 Gene disruption Gene disruption Gene disruption Gene disruption Gene disruption Gene disruption by deletion by duplication by inversion by deletion by duplication by inversion

Figure S4: Additional details on rare inherited PTVs and SVs. (a) The rate of private inherited PTVs in 960 affected (red) and 217 unaffected children (light blue) iHART children for all genes vs. PTV intolerant genes. Error bars represent the standard error. (b) The DLG2 promoter-disrupting 2.5Kb deletion, displayed as an orange rectangle, transmitted to all affected children in two independent iHART/AGRE families. This 2.5Kb deletion is transmitted to all affected members of two different iHART/AGRE families. The average ATAC-seq peak read depth from the cortical plate (CP) and ventricular zone (VZ) of developing human brain samples (n=3) are shown below the deletion. (c-h) The rate of rare inherited SVs per fully phase-able child identified in 960 affected (red) and 217 unaffected children (light blue) by inheritance type; this includes newly hemizygous variants in 563 affected (red) and 100 unaffected male children (light blue). Rare SVs (DELs, DUPs, INVs) were defined as those with an AF < 0.001 in cDGV and an AF < 0.01 in iHART HNP samples. Rates are displayed as bar plots of the mean number of variants in a sample and error bars represent the standard error. Maternally and paternally inherited SVs by affectation status and gene disruption for deletions (c), duplications (d), and inversions (e). Newly hemizygous SVs by affectation status and gene disruption for deletions (f), duplications (g), and inversions (h).

24 Fig. S5. 1) Jointly genotype all samples (>2K iHART samples)

2) Identify and extract RDNVs

3) Classify positive vs. negative variants for training and test sets and perform manual ABhet calculations

Training set Test set RDNVs from 75 LCL MZ twins: RDNVs from 17 matched WB-LCL: -Shared de novo variants: True (5,373) - Concordant: True (1,291) -Somatic variants: False (53,622) - Discordant: False (1,898)

LCL* WB* *same donor True%RDNV True%RDNV False%RDNV False%RDNV

4) Annotate all 48 ARC features (and impute missing feature values)

5) Calculate sample-level count features (Num_Denovo_SNV, Num_IG, & Num_TCR_abparts)

6) Remove GIAB region variants

7) Build ARC random forest (after grid search for parameters) with training set

8) 10-fold cross validation with training set

9) External validation with test set

10) Select cutoff score and classify RDNVs in the iHART cohort by running ARC

Figure S5: Schematic overview of Artifact Removal by Classifier (ARC) development. A detailed outline of the major steps for developing ARC. Step 3 is only required for the training and test sets and a schematic for positive and negative variants used in the training and test sets is shown. The processing in steps 1,2, and 4-6 are completed for all iHART samples prior to running ARC in step 10.

25 Fig. S6.

a

b

Figure S6: ABHet correlation with sample AD/DP for TRUE vs. FALSE variants in the ARC training and test set. For variants present in only one sample, the ABHet==ADDP (diagonal line); this phenomenon introduced systematic bias in the training set until we corrected for this by manually adjusting ABHet after removing the non-proband MZ twin. For each row in this figure, there are three columns containing variant ABHet values: the VCF extracted ABHet value from GATK, our manual calculation of ABHet, and the adjusted ABHet after removing the non-proband MZ twin prior to manually recalculating ABHet. (a) The allelic depth to total depth metric (sample ADDP metric) vs. the ABHet (variant metric) for each variant in the Training set. (b) The allelic depth to total depth metric (sample ADDP metric) vs. the ABHet (variant metric) for each variant in the Test set.

26 Fig. S7.

a b 1.0 0.996

0.8 0.994

0.6 0.992

0.4

ROC AUC ROC 0.990

0.2 0.988 True positive rate (sensitivity) True 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.986 False positive rate (1-specificity)

c 1.00 d 1.00

0.95 0.90 0.90

0.85 0.80 0.80

0.75 0.70 0.70

0.65 0.60

0.60 Recall in 10 fold CV of training set

Precision in 10 fold CV of training set 0.55 0.50

0.100.120.140.160.180.200.220.240.260.280.300.320.340.360.380.400.420.440.460.480.500.520.540.560.580.600.620.640.660.680.700.720.740.760.780.80 0.100.120.140.160.180.200.220.240.260.280.300.320.340.360.380.400.420.440.460.480.500.520.540.560.580.600.620.640.660.680.700.720.740.760.780.80 ARC score cutoff ARC score cutoff

e f g h

2000 350 2000 1.0

300 0.8 1500 1500 250 0.6 200 1000 1000 150 0.4

100 500 500 0.2 Precision or recall

All test set variants 50 0.0 Discordant test set variants 0 Concordant test set variants 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 ARC score ARC score ARC score ARC score

Figure S7: ARC performance in the training and test set. The 10-fold cross validation (CV) for the ARC training set (a-d). (a) Receiver Operating Characteristic (ROC) curves for the true positive rate (Sensitivity) is plotted as a function of the false positive rate; (b) Area Under the ROC Curve (ROC AUC) statistics (median=0.992) for each of the 10 folds; (c) Precision rates vs. predicted score cutoffs – the dashed line at the selected score (0.4) highlights that the minimum precision across all 10 folds is >0.9; (d) Recall rates vs. predicted score cutoffs – the dashed line at the selected score (0.4) highlights that the minimum precision across all 10 folds is ~0.8. ARC performance in the test set (e-h). (e) Distribution of all test set variants by ARC score; (f) Distribution of all TRUE test set variants by ARC score – the majority of concordant variants have an ARC score of ≥0.4; (g) Distribution of all FALSE test set variants by ARC score – almost all discordant variants have an ARC score of <0.4; (h) The precision and recall rates vs. predicted ARC score cutoff in the test set – the dashed line at the selected score (0.4) highlights that the precision is >0.95 and the recall is >0.85.

27 Fig. S8. a DNA derived from LCL (n=1028) DNA derived from WB (n=149)

1.0

0.8

0.6

0.4 ARC score

0.2

0.0

de novo de novo Shared Shared Somatic Somatic de novo de novo de novo de novo N samples 874 145 154 4 154 4

De novo variants De novo variants from non-MZ twin samples from MZ twin samples

b Coding RDNVs per sample before ARC Coding RDNVs per sample after ARC

600 200

400 Samples Samples 100 200

0 0 0 10 20 30 0 2 4 6 Number of RDNVs Number of RDNVs c Before ARC After ARC

P = 0.521 P = 3.6 x 10-13 DN variants per child DN variants per child

Paternal age at child birth Paternal age at child birth

Figure S8: RDNVs identified in iHART samples before and after ARC. (a) The ARC score distribution for raw de novo variants identified in 1,177 fully phase-able samples – displayed as non-MZ twin samples vs. MZ twin samples. Samples for which DNA was derived from LCL or WB are shown in red and green, respectively. All LCL MZ twins were included in the ARC training set and all WB MZ twins were included in the ARC test set. (b) The number of rare de novo coding variants identified per fully phase-able sample displayed as histograms. The coding RDNVs before ARC are from 1,177 fully phase-able samples and after ARC (variants with an ARC score <0.4 are filtered out) and after excluding ARC outlier samples (samples with >90% DNs removed by ARC) (n=831 samples). (c) The correlation between the rate of rare de novo variants and paternal age before and after ARC. This analysis considers 574 fully phase-able ASD children (excluding MZ twins and ARC outliers) for which paternal age was known. The red line is the linear regression line. The graph on the left shows the raw number of rare de novo variants (SNVs and indels) per child by paternal age at

28 the time of the participant's birth in years. The graph on the right shows the number of rare de novo variants (SNVs and indels) per child after running ARC by paternal age at the time of the participant's birth in years.

Fig. S9. a b log(simulation p-value) log(simulation p-value) Observed TADA q-value Observed TADA Bayes Factor

log(observed TADA q-value) log(observed TADA Bayes Factor)

Figure S9: TADA simulation results vs. TADA-mega analysis observed results. (a) For each of the 18,472 TADA genes, the observed FDR in the iHART TADA-mega analysis is plotted against the simulated p-value. Genes with the smallest FDRs also have small simulated p-values, as expected. (b) The observed Bayes Factor (BF), for genes with a BF>1, in the iHART TADA-mega analysis is plotted against the simulated p-value. Genes with the largest BF also have small simulated p-values, as expected.

29 Fig. S10.

a Study b Study iHART_TADAmega_qvalue iHART_TADAmega_qvalue Sanders_published_qvalue Sanders_published_qvalue

2.0 2.0 y Rate y Rate r r 1.5

alue)) 1.5 alue)) e e v v v v o 1.0 o 1.0 0.5 0.5 0.0

(−log10(q 0.0 (−log10(q alse Disc alse Disc F F AN4 O5A P CHE T2B NINL GRIA1 BTRC PCM1 Y A PTK7 TRIO A MIB1 MBD5 MLANA CMPK2DDX3X M UIMC1 AM98C APH1A NAA15 DIP2A K NLGN3 CNA2D3TS SMURF1 C16orf13 CCSER1 F OR52M1 TMEM39BA RAPGEF4PRKAR1B IRF2BPL C Gene Gene c d Simulation p-value Simulation p-value

0.01 0.02 0.03 0.04 0.05 0.06 0.001 0.002 0.003 0.004 0.005 0.006

15

15 actor)) actor F

actor)) 10 F actor F F

10 es y es es a y y es a a y

a 5 5 ulated B ulated B ulated B m ulated B 0 m m 0 m Si Si (log(si

(log(si −5 −5

T2B CHE A NINL AN4 O5A A PTK7 DIP2A TRIO MIB1 MBD5 P BTRC PCM1 Y APH1A NAA15 K NLGN3 MLANA GRIA1 CMPK2DDX3X M UIMC1AM98C OR52M1 IRF2BPL CNA2D3TS SMURF1 C16orf13 CCSER1 F TMEM39BA RAPGEF4PRKAR1B C Gene Gene

Figure S10: TADA-mega analysis results from previous study vs. current iHART study. Genes are sorted by increasing difference in the FDR (q-value) obtained in Sanders et al. vs. the current iHART TADA-mega analysis. In panels (a) and (b) the per-gene TADA FDR is displayed as the -log10(q-value) (higher dots have a lower FDR) obtained in Sanders et al. (green) and the current iHART study (red) and the horizontal line marks the FDR=0.1 threshold; for (a) the 13 genes with an FDR<0.1 in Sanders et al. that failed replication in iHART (FDR>0.1), and (b) the 16 newly significant genes identified in the iHART mega analysis with an FDR<0.1. Note that the CACNA2D3 gene is significantly associated with ASD in the iHART mega- analysis, but not the previous TADA mega-analysis.(45) However, it was previously reported in De Rubeis et al.(46) and thus we do not consider it a new risk gene. Below this, in panels (c) and (d), are the per-gene violin plots of Bayes Factors (displayed as log(simulated Bayes Factor)) obtained for each of the 1.1 million TADA simulations. The grey “x” marks the median simulated Bayes Factor, the blue dot indicates the observed Bayes Factor in the iHART TADA mega analysis, and the violin plots are filled according to their simulation p-value; for (c) the 13 genes with an FDR<0.1 in Sanders et al. that failed replication in iHART (FDR>0.1) (max p-value=0.06) and (d) the 16 newly significant genes (plus CACNA2D3) identified in the iHART mega analysis with an FDR<0.1 (max p-value=0.006).

30 Fig. S11. a c

P = 0.003 100 om r 75 s f r 50 acto F 25 es y inherited PTVs a

B 0 Current Sanders 2015 Study

b

histone demethylation histone−lysine N−methyltransferase activity negative regulation of synaptic transmission1 positive regulation of developmental growth histone lysine methylation regulation of synapse organization learning or memory2 regulation of axon extension positive regulation of axonogenesis organelle localization3

0 2 4 6 8 10 12 Z-score

High-risk SV High-risk PTV BAF complex member Newly discovered ASD gene (iHART) Previously associated ASD gene Significant seed gene

Figure S11: Biological insights from known and novel genes identified in the TADA-mega analysis. (a) A box plot for the inherited PTV Bayes Factors observed for the 35 genes with an FDR<0.2 in the current iHART TADA mega analysis that had an FDR>0.2 in the Sanders TADA mega analysis(45) (Kruskal–Wallis test, P=0.003). (b) Gene-ontology enrichment for the 69 ASD-risk genes with known biological pathways. Three of the enriched pathways contain one or more of the 16 novel ASD-risk genes (any of the 69 genes in biological pathway are listed, with novel risk genes in bold): (1) negative regulation of synaptic transmission includes ADNP, SLC6A1, and RAPGEF4; (2) learning and memory includes ADNP, GRIA1, NRXN1, PRKAR1B, SLC6A1, and SYNGAP1; and (3) organelle organization includes MYO5A, PCM1, and TCF7L2. (c) The indirect PPI network formed by the 69 ASD-risk genes and the 98 genes harboring high-risk inherited variants (n=165 unique genes). The resulting indirect PPI was significant for two connectivity metrics – seed indirect degrees mean P=0.003, and CI degrees mean P=0.005. Proteins encoded by a gene with a high-risk inherited PTVs are shown in teal and SVs are shown in gold. Proteins encoded by an ASD risk gene by the TADA mega analysis in Sanders et al. 2015(45) are shown in purple, proteins encoded by a newly identified ASD-risk gene by the iHART TADA mega analysis are shown in red, proteins belonging to the BAF complex are shown in blue, and any protein falling in more than one category is colored with all categorical colors that apply (e.g., ARID1B). The gene label for significant seed genes are bold and blue.

31 Fig. S12. a

7.2 8.7 10.1 1.4 5.8 2.9 10.1 13 20.3 52.2 10.1 56.5 58 Class

iHART 69 iHART 13 7.2 11.6 Broadly expressed 4.3 7.2 GABAergic Glia or support cells b Glutamatergic Neuron 6.2 6.2 12.5 6.2 Not detected

6.2 12.5

12.5 68.8 12.5 62.5 iHART novel 16 iHART 93.8

Fetal Fetal Adult Drop-seq Nowakowski (Kriegstein) Lake

Figure S12: Enrichment of iHART ASD risk genes in single cell RNA seq (scRNA- seq) cell type expression signatures. Genes enriched in major cell type classes were obtained from human fetal brain datasets and an adult brain dataset, and the percentage of iHART ASD risk genes in each cell type class is shown. (a) iHART 69 ASD risk genes in fetal cell classes (left and center) and adult cell classes (right). (b) iHART 16 novel ASD risk genes in fetal cell classes (left and center) and adult cell classes (right). Significant log2 odds ratios of neuronal cell type enrichment: Fetal drop-seq glutamatergic, 3; GABAergic 4.7; neuron 4.8. Fetal Nowakowski et al.(49) glutamatergic, 1.6; GABAergic 2.4; neuron 5.4. Adult Lake et al. (50) Glutamatergic, 2.9; GABAergic 0.65; Neuron 3.4. The Broad expression class was defined as expression in neuronal cell types and glial cell types and Neuron class was defined as expression in glutamatergic and GABAergic cell types. The numbers inside each pie chart indicate the percentage of iHART ASD risk genes in that cell type.

32 Fig. S13.

a

b

Figure S13: Schematics of the online user interface and genotype/phenotype search engines. (a) Genetic variants in the RBFOX1 gene observed across individuals. Amino acid position and protein domains (colored bars) shown on the x-axis, with sample counts per genetic variant on the y-axis. Variant details are available in the table shown below the Needle plot. (b) An example of a single-variant association test comparing affected versus unaffected children.

33 Supplemental Tables

Table S1. (separate file) Table S1: Cohort information. Detailed sample information for the 2,308 AGRE samples whole-genome sequenced as part of iHART.

Table S2.

Category Exclusion Known chromosomal abnormality 15p13 duplication Known chromosomal abnormality 15q deletion Known chromosomal abnormality 15q duplication Known chromosomal abnormality 16p deletion Known chromosomal abnormality 16p duplication Known chromosomal abnormality 22q duplication Known chromosomal abnormality mosaic for deleted Y Known chromosomal abnormality mosaic trisomy 12 Known chromosomal abnormality Trisomy 21 (Down Syndrome) Known chromosomal abnormality Fragile X Known neurogenetic disorder Gaucher Disease Known neurogenetic disorder Marfan’s Syndrome Known neurogenetic disorder Sotos Syndrome

Table S2: Exclusionary criteria for patients. This table contains a list of exclusionary criteria – known genetic causes or syndromes with overlapping features – used to select AGRE families for WGS as part of iHART.

Table S3. (separate file) Table S3: Phenotypic details for NR3C2 families. Phenotypic information for the three families harboring NR3C2 variants transmitted to all affected children. Shared phenotypic features identified in the physical exam column are highlighted by blue font.

Table S4. (separate file) Table S4: DLG2 structural variant details and haplotype prediction for the carrier families. This table displays coordinates, copy number, inheritance, SV detection algorithm, and predicted haplotypes for each member of the three families with a DLG2 deletion detected.

34 Table S5. (separate file) Table S5: ARC feature descriptions. This table provides details for all 48 random forest features used in ARC. The features are listed in order of feature rank and are categorized into one of four possible metric types – (1) Genome: feature of a particular genomic region, independent of our cohort; (2) Sample: Feature of a particular sample, aggregated over the whole genome; (3) Variant by cohort: Feature of a particular variant in our cohort; (4) Variant by sample: Feature of a particular variant in a particular sample.

Table S6.

Affected:! Affected! Unaffected! Affected! Unaffected! Affected! Unaffected! Cohort! Unaffected! synonymous! synonymous! missense! missense! PTV! PTV!

SSC(30)! 1867:1867! 0.34! 0.33! 0.94! 0.82! 0.21! 0.12! ASC!+! 3982:2078! 0.14! 0.14! 0.48! 0.41! 0.13! 0.07! SSC(30,$42)! iHART! 545:141! 0.17! 0.18! 0.45! 0.42! 0.07! 0.07!

Table S6: Comparison of RDNV rates in affected and unaffected children from ASD simplex vs. multiplex families. Data in the first two rows originates from two prior studies reported coding RDNV rates in ASD(30, 42); these rates were based on the SSC and ASC cohorts consisting, primarily, simplex families. Data in the last row is from the current iHART study where the vast majority of families are multiplex. The rates from Iossifov et al.(30) were reported directly in the manuscript. The rates from Kosmicki et al.(30, 42) were calculated based on data in Table S1 which contains de novo variants from ASD (n=3,982) and unaffected ASD (2,078) siblings; the manually calculated rates for PTVs match the PTV rates reported directly in the manuscript. Kosmicki et al. used a similarly stringent minor allele frequency threshold (AF=0 in ExAC) to that of iHART (AF=0 in ExAC and in all available controls, including iHART parents); resulting in very similar rates for synonymous and missense RDNVs. The rates of RDNVs in Iossifov et al. were derived by identifying de novo variants called in <0.3% of the SSC parents within 40X joint coverage regions and extrapolating this rate exome-wide. !

Table S7. (separate file) Table S7: TADA qualifying variants identified in iHART/AGRE samples. !

35 Table S8.

Qualifying! ASD!cases! ASD!controls! Cohorts! variants!

iHARTa!+!SSC(30)+! DN$PTV/Mis3! 653!+!2,591!+!1,445!=!4,689! NA! ASC(46)!

DN!SmallDelb! 2,591!+!2,096!=!4,687! NA! SSC(45)!+!AGP(47)!

Transmitted! and!nonL 767!+!3,046!=!3,813! 767!+!6,842!=!7,609! iHARTa!+!ASC(46)! transmitted! PTVc! ! Table S8: Samples included in the iHART TADA mega-analysis. For each category of qualifying variants analyzed in TADA, the number of samples are broken down by cohort – the order in which the individual cohort sample counts are listed, corresponds to order of the cohort names listed in the last column. aThe variant contribution from the 838 iHART affected samples, after accounting for AGRE overlap in ASC and ARC outliers, includes 71 samples for which only DN variants were considered, 185 samples for which only transmitted/non-transmitted variants were considered, and 582 samples for which both DN and transmitted/non-transmitted variants were considered. bCounts of DN SmallDel were directly taken from Sanders et al. 2015(45), which were generated using data from the SSC(45) and AGP(47). (c) TADA combines the counts from inherited and case-control data, by treating transmitted and non-transmitted PTV counts as additional cases and controls. Therefore, in the ASC the 6,842 "ASD control" samples are individual control samples, while in iHART the count of 767 represents the number of affected children for which variants were classified as non-transmitted by either parent.

Table S9. (separate file) Table S9: iHART TADA-mega analysis results for all 18,472 genes. Detailed results for the 18,472 genes included in the iHART TADA mega-analysis, including mutation rates by variant type and iHART-only and mega-analysis counts by qualifying variant type.

36 Table S10.

dnPTV& dnMis3& dnSmallDel& inheritedPTV&or&casePTV&

ADNP,&AKAP9,&ANK2,&ARID1B,& ASH1L,&BCL11A,&BTRC*,&C16orf13*,& ADNP,&AKAP9,&ANK2,&ARID1B,& CACNA2D3,&CCSER1*,&CHD2,&CHD8,& ASH1L,&BTRC*,&C16orf13*,&CAPN12,& AKAP9,&ANK2,&ARID1B,&BTRC*,& CMPK2*,&CTTNBP2,&CUL3,&DDX3X*,& CCSER1*,&CHD2,&CHD8,&CTTNBP2,& CAPN12,&CHD2,&CHD8,&DNMT3A,& DNMT3A,&DSCAM,&DYRK1A,&ERBIN,& CUL3,&DSCAM,&DYRK1A,&ERBIN,& ERBIN,&GABRB3,&GIGYF1,&GRIA1*,& ETFB,&FAM98C*,&FOXP1,&GABRB3,& ETFB,&FAM98C*,&GABRB3,&GIGYF1,& GRIN2B,&INTS6,&KDM5B,&KDM6B,& GIGYF1,&GRIN2B,&ILF2,&INTS6,& GRIA1*,&GRIN2B,&INTS6,&KATNAL2,& KMT2C,&KMT5B,&MFRP,&MYO5A*,& ARID1B,&CHD2,&KMT2C,&NRXN1,& KATNAL2,&KDM5B,&KDM6B,&KMT2C,& KDM5B,&KMT2E,&KMT5B,&MFRP,& MYT1L,&NRXN1,&PCM1*,&POGZ,& SETD5,&SHANK2,&SHANK3,&SYNGAP1,& KMT2E,&KMT5B,&MFRP,&MLANA*,& MLANA*,&MYO5A*,&MYT1L,&NCKAP1,B PRKAR1B*,&PTEN,&RAPGEF4*,& TCF7L2,&TNRC6B,&TRIP12& MYT1L,&NCKAP1,&NRXN1,&P2RX5,& NRXN1,&P2RX5,&PCM1*,&POGZ,&PTEN,& SCN2A,&SETD5,&SHANK3,&SLC6A1,& PHF2,&POGZ,&PRKAR1B*,&PTEN,& RANBP17,&RAPGEF4*,&SCN2A,& SMURF1*,&SYNGAP1,&TBR1,& RANBP17,&RAPGEF4*,&SCN2A,&SETD5,& SETD5,&SHANK3,&SLC6A1,&SMURF1*,& TMEM39B*,&TRIP12,&TSPAN4*,& SHANK2,&SHANK3,&SMURF1*,&SPAST,& TCF7L2,&TMEM39B*,&TSPAN4*,& UIMC1*,&USP45,&WDFY3,&ZNF559& SYNGAP1,&TBR1,&TCF7L2,& UIMC1*,&USP45,&WAC,&WDFY3,& TMEM39B*,&TNRC6B,&TRIP12,& ZNF559& TSPAN4*,&USP45,&WAC,&WDFY3&

Table S10: ASD risk genes identified in the TADA mega-analysis by variant class. All 69 ASD risk genes with an FDR<0.1 in the iHART TADA mega-analysis are listed in the table above, with genes containing qualifying variants in more than one variant class displayed in multiple columns. The newly associated ASD risk genes are displayed with an asterisk and genes with one or more qualifying variant (by variant class) in an iHART sample(s) are in bold. For example, an iHART sample harbors a TRIP12 dnPTV variant (bold) but zero iHART TADA samples were identified with a dnMis3 variant in TRIP12 (plain). In the "inherited PTV or case PTV" variant class, the 18 genes for which no inherited PTVs (only case PTVs) were observed are shown in grey font to highlight that these case PTVs might actually be de novo variants ascertained in single ASD cases where no parental DNA was obtained to determine their de novo status. For example, there were three CHD8 case PTVs which are likely de novo variants.

37 Table S11.

dnPTV& dnMis3& dnSmallDel& inheritedPTV&or&casePTV&

ADNP,&AKAP9,&ANK2,&ARID1B,& ASH1L,&ASXL3,&BCL11A,&BTRC*,& ADNP,&AKAP9,&ANK2,&ARID1B,& C16orf13*,&CACNA2D3,&CCSER1*,& ASH1L,&ASXL3,BBTRC*,&C16orf13*,& CHD2,&CHD8,&CMPK2*,&CTTNBP2,& AKAP9,&ANK2,&ARID1B,&BTRC*,& CAPN12,&CCSER1*,&CDH13*,BCHD2,& CUL3,&DDX3X*,&DNMT3A,&DSCAM,& CAPN12,&CDH13*,&CHD2,&CHD8,& CHD8,&CTTNBP2,&CUL3,&DSCAM,& DYRK1A,&ERBIN,&ETFB,&FAM98C*,& DNMT3A,&ERBIN,&GABRB3,&GIGYF1,& DYRK1A,&ERBIN,&ETFB,&FAM98C*,& FOXP1,&GABRB3,&GIGYF1,&GRIN2B,& GRIA1*,&GRIN2B,&INTS6,&KDM5B,& GABRB3,&GIGYF1,&GRIA1*,&GRIN2B,& ILF2,&INTS6,&KATNAL2,&KDM5B,& KDM6B,&KMT2C,&KMT5B,&MFRP,& ARID1B,&CDH13*,&CHD2,&KMT2C,& INTS6,&KATNAL2,&KDM5B,&KMT2E,& KDM6B,&KMT2C,&KMT2E,&KMT5B,& MYO5A*,&MYT1L,&NR3C2,&NRXN1,& NRXN1,&SHANK2,&SETD5,&SHANK3,& KMT5B,&MFRP,&MLANA*,&MYO5A*,& MFRP,&MLANA*,&MYT1L,&NCKAP1,& PCM1*,&POGZ,&PRKAR1B*,&PTEN,& TCF7L2,&SYNGAP1,&TNRC6B,&TRIP12&& MYT1L,&NCKAP1,BNR3C2,BNRXN1,& NR3C2,&NRXN1,&P2RX5,&PHF2,&POGZ,& RAPGEF4*,&SCN2A,&SETD5,&SHANK3,& P2RX5,&PCM1*,&POGZ,&PTEN,& PRKAR1B*,&PTEN,&RANBP17,& SLC6A1,&SMURF1*,&SYNGAP1,&TBR1,& RANBP17,&RAPGEF4*,&SCN2A,& RAPGEF4*,&SCN2A,&SCN7A*,&SETD5,& TMEM39B*,&TRIP12,&TSPAN4*,& SCN7A*,&SETD5,&SHANK3,&SLC6A1,& SHANK2,&SHANK3,&SMURF1*,&SPAST,& UIMC1*,&USP45,&WDFY3,&ZNF559& SMURF1*,&STARD9*,BTCF7L2,& STARD9*,&SYNGAP1,&TBR1,&TCF7L2,& TMEM39B*,&TSPAN4*,&UIMC1*,& TMEM39B*,&TNRC6B,&TRIP12,& USP45,&WAC,&WDFY3,&ZNF559& TSPAN4*,&USP45,&WAC,&WDFY3&

Table S11: ASD risk genes identified if the relative risk for de novo variants in iHART is one in TADA mega-analysis. A modified version of Table S7 (newly associated ASD risk genes are displayed with an asterisk and genes with a given variant class in an iHART sample are bold) demonstrating TADA gene discovery revisions if RR=1 for de novo variants in iHART/AGRE. The five genes in green font become significant (TADA q-value <0.1) after assigning a relative risk (RR) of 1 to all de novo variant classes when a de novo variant was identified in an iHART/AGRE sample. The genes in red font or highlight are no longer significant (TADA q-value > 0.1) after assigning a relative risk (RR) of 1 to all de novo variant classes when a de novo variant was identified in an iHART/AGRE sample. Six of the 16 novel genes identified in the iHART TADA-mega analysis remain significant (C16orf13, CCSER1, MLANA, PCM1, TMEM39B, and TSPAN4).

38 Table S12. (separate file) Table S12: NetSig significant genes. The 596 genes that are significantly (P < 0.05) directly connected to potential ASD-risk genes (as defined by their TADA q-value).

Table S13. !

SV!types!in!DGV! Collapsed!SV!category! deletion! deletion! loss! duplication! gain! duplication! tandem!duplication! insertion! mobile!element!insertion! insertion! novel!sequence!insertion! inversion! inversion! complex! gain+loss! unknown! sequence!alteration! !

Table S13: Structural variant (SV) classifications from the Database of Genomic Variants.

Table S14.

Method (SV types) No. SVs %SVs called by >=2 methods SMuFin (DELs, DUPs and INVs) 34,693 72% GenomeSTRiP (DELs) 40,984 61% BreakDancer (DELs and INVs) 407,545 21% LUMPY (DELs, DUPs and INVs) 1,054,030 8% TOTAL 1,537,252 15%

Table S14: Number of rare SVs identified by each of the four calling algorithms and percentage of SVs also called by at least one other algorithm. ! ! !

39 Table S15. ! ! DELs! DUPs! %!array=2!methods! 73%! 20%! SVs!called!by!>=2!methods! (Excluding!LUMPYM 68%! 18%! BreakDancer!joint!SVs)! ! Table S15: Sensitivity to detect high-confidence rare array-SVs called from genotyping array data. !

40 Table S16. ! ! Total!number!of!SVs! Total!number!of!TRUE!SVs! Validation!rate! SV!length!bin! LUMPY< LUMPY< LUMPY< LUMPY< LUMPY< LUMPY< BreakDancer! BreakDancer!! BreakDancer! BreakDancer!! BreakDancer! BreakDancer! only! +!another! only! +!another! only! +!another! method! method!! method!! 100Kb&200Kb! 18! 13! 1! 10! 0.06! 0.77! 200Kb&500Kb! 22! 13! 1! 12! 0.05! 0.92! 500Kb&1Mb! 27! 5! 0! 4! 0.00! 0.80! 1Mb&2Mb! 35! 0! 2! NA! 0.06! NA! 2Mb&5Mb! 31! 0! 3! NA! 0.10! NA! 5Mb&10Mb! 32! 0! 2! NA! 0.06! NA! >10Mb! 22! 0! 0! NA! 0.00! NA! TOTAL! 187! 31! 9! 26! 0.05! 0.84!

Table S16: Validation rate of LUMPY-BreakDancer joint SV calls (DELs) based on Illumina array intensity plots.

41 References:

1.# C.#M.#Lajonchere,#Changing#the#landscape#of#autism#research:#the#autism#genetic# resource#exchange.#Neuron#68,#187>191#(2010).# 2.# S.#Purcell'et'al.,#PLINK:#a#tool#set#for#whole>genome#association#and#population>based# linkage#analyses.#American'journal'of'human'genetics#81,#559>575#(2007).# 3.# G.#Jun'et'al.,#Detecting#and#estimating#contamination#of#human#DNA#samples#in# sequencing#and#array>based#genotype#data.#American'journal'of'human'genetics#91,# 839>848#(2012).# 4.# M.#A.#DePristo'et'al.,#A#framework#for#variation#discovery#and#genotyping#using#next> generation#DNA#sequencing#data.#Nature'genetics#43,#491>498#(2011).# 5.# G.#A.#Van#der#Auwera'et'al.,#From#FastQ#data#to#high#confidence#variant#calls:#the# Genome#Analysis#Toolkit#best#practices#pipeline.#Current'protocols'in'bioinformatics#43,# 11.10.11>33#(2013).# 6.# H.#Li,#R.#Durbin,#Fast#and#accurate#short#read#alignment#with#Burrows>Wheeler# transform.#Bioinformatics'(Oxford,'England)#25,#1754>1760#(2009).# 7.# D.#W.#Barnett,#E.#K.#Garrison,#A.#R.#Quinlan,#M.#P.#Stromberg,#G.#T.#Marth,#BamTools:#a# C++#API#and#toolkit#for#analyzing#and#managing#BAM#files.#Bioinformatics'(Oxford,' England)#27,#1691>1692#(2011).# 8.# A.#McKenna'et'al.,#The#Genome#Analysis#Toolkit:#a#MapReduce#framework#for#analyzing# next>generation#DNA#sequencing#data.#Genome'research#20,#1297>1303#(2010).# 9.# K.#Wang,#M.#Li,#H.#Hakonarson,#ANNOVAR:#functional#annotation#of#genetic#variants# from#high>throughput#sequencing#data.#Nucleic'acids'research#38,#e164#(2010).# 10.# W.#McLaren'et'al.,#The#Ensembl#Variant#Effect#Predictor.#Genome'biology#17,#122# (2016).# 11.# H.#Li'et'al.,#The#Sequence#Alignment/Map#format#and#SAMtools.#Bioinformatics'(Oxford,' England)#25,#2078>2079#(2009).# 12.# K.#Chen'et'al.,#BreakDancer:#an#algorithm#for#high>resolution#mapping#of#genomic# structural#variation.#Nature'methods#6,#677>681#(2009).# 13.# R.#M.#Layer,#C.#Chiang,#A.#R.#Quinlan,#I.#M.#Hall,#LUMPY:#a#probabilistic#framework#for# structural#variant#discovery.#Genome'biology#15,#R84#(2014).# 14.# R.#E.#Handsaker,#J.#M.#Korn,#J.#Nemesh,#S.#A.#McCarroll,#Discovery#and#genotyping#of# genome#structural#polymorphism#by#sequencing#on#a#population#scale.#Nature'genetics# 43,#269>276#(2011).# 15.# R.#E.#Handsaker'et'al.,#Large#multiallelic#copy#number#variations#in#humans.#Nature' genetics#47,#296>303#(2015).# 16.# V.#Moncunill'et'al.,#Comprehensive#characterization#of#complex#structural#variations#in# cancer#by#directly#comparing#genome#sequence#reads.#Nature'biotechnology#32,#1106> 1112#(2014).# 17.# W.#M.#Brandler'et'al.,#Frequency#and#Complexity#of#De#Novo#Structural#Mutation#in# Autism.#American'journal'of'human'genetics#98,#667>679#(2016).# 18.# V.#M.#Leppa'et'al.,#Rare#Inherited#and#De#Novo#CNVs#Reveal#Complex#Contributions#to# ASD#Risk#in#Multiplex#Families.#American'journal'of'human'genetics#99,#540>554#(2016).#

42 19.# A.#R.#Quinlan,#I.#M.#Hall,#BEDTools:#a#flexible#suite#of#utilities#for#comparing#genomic# features.#Bioinformatics'(Oxford,'England)#26,#841>842#(2010).# 20.# D.#T.#Miller'et'al.,#Consensus#statement:#chromosomal#microarray#is#a#first>tier#clinical# diagnostic#test#for#individuals#with#developmental#disabilities#or#congenital#anomalies.# American'journal'of'human'genetics#86,#749>764#(2010).# 21.# M.#Lek'et'al.,#Analysis#of#protein>coding#genetic#variation#in#60,706#humans.#Nature#536,# 285>291#(2016).# 22.# A.#Auton'et'al.,#A#global#reference#for#human#genetic#variation.#Nature#526,#68>74# (2015).# 23.# J.#R.#MacDonald,#R.#Ziman,#R.#K.#Yuen,#L.#Feuk,#S.#W.#Scherer,#The#Database#of#Genomic# Variants:#a#curated#collection#of#structural#variation#in#the#human#genome.#Nucleic'acids' research#42,#D986>992#(2014).# 24.# B.#P.#Coe'et'al.,#Refining#analyses#of#copy#number#variation#identifies#specific#genes# associated#with#developmental#delay.#Nature'genetics#46,#1063>1071#(2014).# 25.# G.#M.#Cooper'et'al.,#A#copy#number#variation#morbidity#map#of#developmental#delay.# Nature'genetics#43,#838>846#(2011).# 26.# E.#J.#Rossin'et'al.,#Proteins#encoded#in#genomic#regions#associated#with#immune> mediated#disease#physically#interact#and#suggest#underlying#biology.#PLoS'genetics#7,# e1001273#(2011).# 27.# H.#Horn'et'al.,#NetSig:#network>based#discovery#from#cancer#genomes.#Nature'methods# 15,#61>66#(2018).# 28.# K.#Lage'et'al.,#A#human#phenome>interactome#network#of#protein#complexes#implicated# in#genetic#disorders.#Nature'biotechnology#25,#309>316#(2007).# 29.# P.#Scheet,#M.#Stephens,#A#fast#and#flexible#statistical#model#for#large>scale#population# genotype#data:#applications#to#inferring#missing#genotypes#and#haplotypic#phase.# American'journal'of'human'genetics#78,#629>644#(2006).# 30.# I.#Iossifov'et'al.,#The#contribution#of#de#novo#coding#mutations#to#autism#spectrum# disorder.#Nature#515,#216>221#(2014).# 31.# I.#Voineagu'et'al.,#Transcriptomic#analysis#of#autistic#brain#reveals#convergent#molecular# pathology.#Nature#474,#380>384#(2011).# 32.# N.#N.#Parikshak'et'al.,#Genome>wide#changes#in#lncRNA,#splicing,#and#regional#gene# expression#patterns#in#autism.#Nature#540,#423>427#(2016).# 33.# A.#J.#Willsey'et'al.,#Coexpression#networks#implicate#human#midfetal#deep#cortical# projection#neurons#in#the#pathogenesis#of#autism.#Cell#155,#997>1007#(2013).# 34.# N.#N.#Parikshak'et'al.,#Integrative#functional#genomic#analyses#implicate#specific# molecular#pathways#and#circuits#in#autism.#Cell#155,#1008>1021#(2013).# 35.# J.#C.#Darnell'et'al.,#FMRP#stalls#ribosomal#translocation#on#mRNAs#linked#to#synaptic# function#and#autism.#Cell#146,#247>261#(2011).# 36.# A.#Sugathan'et'al.,#CHD8#regulates#neurodevelopmental#pathways#associated#with# autism#spectrum#disorder#in#neural#progenitors.#Proceedings'of'the'National'Academy'of' Sciences'of'the'United'States'of'America#111,#E4468>4477#(2014).# 37.# S.#M.#Weyn>Vanhentenryck'et'al.,#HITS>CLIP#and#integrative#modeling#define#the#Rbfox# splicing>regulatory#network#linked#to#brain#development#and#autism.#Cell'reports#6,# 1139>1152#(2014).#

43 38.# A.#Bayes'et'al.,#Characterization#of#the#proteome,#diseases#and#evolution#of#the#human# postsynaptic#density.#Nature'neuroscience#14,#19>21#(2011).# 39.# A.#Battle,#C.#D.#Brown,#B.#E.#Engelhardt,#S.#B.#Montgomery,#Genetic#effects#on#gene# expression#across#human#tissues.#Nature#550,#204>213#(2017).# 40.# J.#J.#Michaelson'et'al.,#Whole>genome#sequencing#in#autism#identifies#hot#spots#for#de# novo#germline#mutation.#Cell#151,#1431>1442#(2012).# 41.# J.#M.#Zook'et'al.,#Integrating#human#sequence#data#sets#provides#a#resource#of# benchmark#SNP#and#indel#genotype#calls.#Nature'biotechnology#32,#246>251#(2014).# 42.# J.#A.#Kosmicki'et'al.,#Refining#the#role#of#de#novo#protein>truncating#variants#in# neurodevelopmental#disorders#by#using#population#reference#samples.#Nature'genetics# 49,#504>510#(2017).# 43.# X.#He'et'al.,#Integrated#model#of#de#novo#and#inherited#genetic#variants#yields#greater# power#to#identify#risk#genes.#PLoS'genetics#9,#e1003671#(2013).# 44.# I.#A.#Adzhubei'et'al.,#A#method#and#server#for#predicting#damaging#missense#mutations.# Nature'methods#7,#248>249#(2010).# 45.# S.#J.#Sanders'et'al.,#Insights#into#Autism#Spectrum#Disorder#Genomic#Architecture#and# Biology#from#71#Risk#Loci.#Neuron#87,#1215>1233#(2015).# 46.# S.#De#Rubeis'et'al.,#Synaptic,#transcriptional#and#chromatin#genes#disrupted#in#autism.# Nature#515,#209>215#(2014).# 47.# D.#Pinto'et'al.,#Convergence#of#genes#and#cellular#pathways#dysregulated#in#autism# spectrum#disorders.#American'journal'of'human'genetics#94,#677>694#(2014).# 48.# K.#E.#Samocha'et'al.,#A#framework#for#the#interpretation#of#de#novo#mutation#in#human# disease.#Nature'genetics#46,#944>950#(2014).# 49.# T.#J.#Nowakowski'et'al.,#Spatiotemporal#gene#expression#trajectories#reveal# developmental#hierarchies#of#the#human#cortex.#Science'(New'York,'N.Y.)#358,#1318> 1323#(2017).# 50.# B.#B.#Lake'et'al.,#Integrative#single>cell#analysis#of#transcriptional#and#epigenetic#states#in# the#human#adult#brain.#Nature'biotechnology#36,#70>80#(2018).# 51.# Amazon#Athena.#(2018).#https://aws.amazon.com/athena.# 52.# Autism#Genetic#Resource#Exchange.#(2017).#http://research.agre.org/.# 53.# H.#Thorvaldsdottir,#J.#T.#Robinson,#J.#P.#Mesirov,#Integrative#Genomics#Viewer#(IGV):# high>performance#genomics#data#visualization#and#exploration.#Brief'Bioinform#14,#178> 192#(2013).# 54.# AWS#Command#Line#Interface.#(2018).#https://aws.amazon.com/cli.# 55.# Apache#Parquet.#(2014).#http://parquet.apache.org/.# 56.# C.#Raczy'et'al.,#Isaac:#ultra>fast#whole>genome#secondary#analysis#on#Illumina# sequencing#platforms.#Bioinformatics'(Oxford,'England)#29,#2041>2043#(2013).# 57.# Nirvana>Nimble#and#Robust#Variant#Annotator.#(2018).# https://github.com/Illumina/Nirvana.# 58.# Elasticsearch.#(2018).#https://www.elastic.co/.# 59.# S.#Deciphering#Developmental#Disorders,#Prevalence#and#architecture#of#de#novo# mutations#in#developmental#disorders.#Nature#542,#433>438#(2017).#

44