Bioinformatic Pipeline for Whole Exome Sequence (WES) Analysis
Total Page:16
File Type:pdf, Size:1020Kb
SUPPLEMENT Supplement contents: Supplementary Methods Supplementary Figure 1: Bioinformatic pipeline for whole exome sequence (WES) analysis. Supplementary Figure 2: Pedigrees and HomozygosityMapper output for families with single variants identified, in addition to those shown in Figure 2. Supplementary Figure 3: Spatiotemporal expression of ID genes in human development using RNA sequencing data. Supplementary Figure 4: Developmental expression pattern of ID genes in the human prefrontal cortex. Supplementary Table 1: Family statistics Supplementary Table 2: Homozygosity-by-descent/autozygosity shared regions, as defined using HomozygosityMapper, cross-referenced with FSuite. Supplementary Table 3: Mutations identified per family. A. Single homozygous variant identified. B. Two to four variants identified. C. Dominant/de novo mutation identified. Supplementary Table 4: Pathogenic CNVs and variants of unknown significance identified by microarray analysis. Supplementary Table 5: BioGRID protein interaction and gene ontology analysis. (separate Excel file) Supplementary Table 6: Gene Ontology Pathway analysis. Supplementary Table 7: Gene List for anatomic/temporal transcription analyses. Supplementary Table 8: Top anatomical regions for ID gene expression. Supplementary Methods HBD/Autozygosity mapping Both of the below methods and the hg19 version of the genome were used to ensure a consistent and uniform genotyping. Genotyping data was uploaded to the HomozygosityMapper server to determine putative homozygous-by- descent (HBD) regions based on the allele frequencies of the markers uploaded to the server from previous studies. HBD regions were identified by manual curation and only HBD regions larger than 1 Mb shared between all affected members (and not unaffected members) of the family were chosen. These regions were extracted based on SNP RS numbers and these dbSNP identifiers were converted to a genomic position for used to represent genomic regions with NGS data. Previous methods used to identify HBD regions can vary among individual researchers due to the manual curation necessary to ensure the selected regions share both homozygosity and haploidentity between affecteds but not unaffecteds (i.e. to exclude false positives), and to try to limit false negative calls. These methods are also labour intensive and do not automatically take in family information. For these reason, we used a software called FSuite for computational HBD mapping1. FSuite makes use of allele frequencies, genotyping and pedigree structure integrated into a Hidden Markov model to determine the likelihood of a region being HBD1. In-house developed scripts were used to convert genotyping calls from the Affymetrix Genotyping Console, ChAS and Illumina GenomeStudio output to a family based PLINK input for FSuite. We also used 1000 Genomes Phase Three reference for the allele frequencies, extracting allele frequencies from the South Asian ethnic group specifically. This data was then used with standard run protocols of FSuite to generate HBD regions. Autozygosity mapping was done by extracting regions common between members of the family. These protocols were standardized and implemented on the Specialized Computing Cluster (SCC) at the Centre for Addiction and Mental Health (CAMH). CNV Analysis CNV analysis was performed to identify homozygous CNVs in HBD regions, or heterozygous/ homozygous CNVs that could indicate phenocopies that should then be excluded from the HBD/WES analyses. We have used five array types for homozygosity mapping including the Affymetrix SNP 5.0, SNP 6.0, CytoscanHD and Mapping 250K NspI, as well the Illumina HumanCoreExome platform. Due to the wide range of array types that were used in this study, we have used manufacturer-recommended and third-party software to call CNVs. The Affymetrix Genotyping Console was used to genotype and calls CNVs using the Affymetrix Birdsuite algorithm. CytoscanHD and SNP 6.0 arrays were analyzed and CNVs were called using the Chromosome Analysis Suite (ChAS). Illumina HumanCore Exome arrays were used for the largest proportion of families, and CNVs were called using the Illumina GenomeStudio cnvPartition plugin. PennCNV is an academically developed software that uses a Hidden Markov Method (HMM) to call CNVs2. CNVs were called using PennCNV for all array types and represents a standard method for CNV calling across our study. CNVs derived from all algorithms were compared against the Database of Genomic Variants (DGV) to screen for variants that showed 30% or lower overlap with DGV control variants, or were not in the database3. Sequencing Alignment and Variant Calling Illumina reads were aligned with BWA-mem (v 0.7.13) using the hg19 genome reference4. Once alignment was performed GATK (v 3.2.2) was used for base recalibration, indel realignment and the GATK Unified Genotyper was used for variant calling. Variants called were generated in the standard VCF version 4.1 and annotated with Annovar5, Alamut (v 1.4.4 Interactive Biosoftware, Rouen, France) and EDGC Annotator (v 1.0.0 Eone- Diagnomics Genome Center, Songdo Ichon, South Korea) software. SOLiD sequencing was aligned to the hg19 reference genome using Shrimp26. Once alignment files were generated, variant calling and annotation were performed as previously described. Proton data was aligned using the TMAP algorithm and variants were called using the Ion Proton Variant Caller as part of the Torrent Server analysis pipeline (ThermoFisher Scientific, Waltham, MA, US). Ion Reporter was used in addition to the annotation software mentioned above for Ion Proton VCF files (ThermoFisher Scientific, Waltham, MA, US). Variant Filtration Once annotation was performed, filtration of variants was performed by computationally and manually extracting variants that were in HBD regions provided by the autozygosity mapping of FSuite and HomozygosityMapper. The next step was to filter variants based on sequencing quality, homozygosity of allele and mutation type. Prioritization was performed by scoring truncating mutations such as stop and frameshift loss of function mutations higher than missense mutations and inframe indels. Missense mutations were scored based on Sift and Polyphen scores for prioritization. Allele frequencies were then used to filter out variants that were too common in the population (>1 in 10 000). All variants surviving filtration were checked against the Exome Aggregation Consortium (ExAC), Cambridge, MA (URL: http://exac.broadinstitute.org) accessed 03,2016) OMIM7 and Genecards8 to determine expression level in tissues and other functional characteristics. Anatomical expression analyses To look for commonalities in the anatomical and temporal expression profiles for the genes identified here and previously reported genes for ID (full list is available as eTable 7), we used a number of human expression datasets, tools, and atlases. The TopAnat Bgee tool was used to quantify tissue specific expression patterns before birth in human expression datasets 9.TopAnat uses publically available expression profiles from reference atlases of normal tissue. Within a tissue, it tests for expression enrichment using binary calls of expression (expressed/present versus not). The ID Gene list was converted to Ensembl identifiers using the Ensembl Biomart tool (Ensembl release 84). The embryo developmental stage, RNA-Seq and Affymetrix data source settings were used with the “Data quality” option was set to “All”. We removed redundant tissue terms and only present data from tissues with direct expression information (17 human fetal tissues). Developmental brain expression analyses Human spatiotemporal transcriptomic datasets were downloaded from the BrainSpan website (http://www.brainspan.org/static/download.html, December 2015). In this developmental atlas, age ranged from 8 post conception weeks to 40 years old (46% female). Brain specimens were from normal healthy donors that passed several strict exclusion criteria to ensure consistency. For example, specimens with evidence of malformations, lesions, neuronal loss, neuronal swelling, glioneuronal heterotopias, or dysmorphic neurons and neurites were excluded. Complete details of the methods used in the BrainSpan atlas are available as a technical paper (http://help.brain-map.org/download/attachments/3506181/Transcriptome_Profiling.pdf). Both RNA sequencing and exon microarray gene expression profiles were used to test for specific temporal and regional expression of the ID genes. In both datasets we removed data from brain regions with expression profiles for less than 7 donors (10 regions). These regions are early structures that were dissected at early timepoints (earlier than 14 post conception weeks). After this filtering, 465 expression profiles from 35 brains (27 timepoints) in 16 unique regions were used from the exon array data. In the RNA sequencing data, 553 expression profiles from 41 brains (30 timepoints) in 16 unique regions remained. Expression data was previously summarized to genes by the BrainSpan consortium (exon array: 22,255, RNA sequencing: 17,282 genes). The provided RPKM (Reads Per Kilobase of transcript per Million mapped reads) expression values were log scaled. Human prefrontal cortex expression profiles10 were downloaded from the supplemental files in “Mapping DNA methylation across development, genotype and schizophrenia in the human frontal cortex”11. This microarray dataset contains gene expression profiles of the dorsolateral prefrontal cortex across development.