Report

Customer Name Demo Institute University of ABC Order Number 6667 Date Prepared 2017-06-02

This report covers confidential materials of LC Sciences. Please make sure that the contents of this report are for your personal use only and that you are responsible for confidentiality. If the contents of this report are disclosed toany third party or company, according to the relevant laws and regulations, LC Sciences will be entitled to legal action.

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Table of Contents

1 Exome Sequencing Introduction2

2 Exome Sequencing Report3 2.1 Disease Information...... 3 2.2 Database Information...... 4 2.3 Data Analysis Program...... 4

3 Technical Methods and Processes5 3.1 Experimental Processes...... 5 3.2 Analysis Process...... 6

4 Sequencing Data Overview8 4.1 Sample Collection and Grouping Information...... 8 4.2 Sequencing Data Filtering...... 8 4.3 Sequencing Data Quality Control...... 9 4.4 Sequencing Depth/Coverage Distribution...... 10 4.5 Coverage Results...... 12

5 Variant Calling Results and Analysis 14 5.1 SNP Results...... 15 5.2 SnpEff Annotation of SNP...... 20 5.3 INDEL Results...... 21 5.4 SnpEff Annotation of INDEL...... 25

6 Annotation and Filtering of Variants 26 6.1 dbSNP Annotation and Filtering...... 26 6.2 1000Genome Annotation and Filtering...... 26 6.3 Coding Region Annotation...... 27 6.4 Function Annotation...... 27

7 Quality Control 28

8 Appendix 30 8.1 Materials and Methods...... 30

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 8.2 Information Analysis...... 30

9 References 32

10 Contact Us 32

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 2

1 Exome Sequencing Introduction

The is the part of the eukaryotic gene which is preserved after splicing and can be trans- lated into peptide sequences. Exome is the sum of all exon regions in the that contains the information needed for translation, covering most of the functional variants associated with the individual phenotype. The contains approximately 180,000 with a total length of about 30 Mb. The human exome accounts for about 1% of the genome, but is responsible for about 85% of human pathogenic mutations.

Exome sequencing refers to the use of specially designed probes to enrich the protein cod- ing region of interest or a specific region of interest. High-throughput sequencing generates genetic information, which greatly improves the efficiency of exome studying and significantly reduces the cost of research. The technology can be used to identify and study Mendelian diseases, complex diseases such as cancer, diabetes, obesity and other pathogenic genes. This enables researchers to better explain the pathogenesis of diseases.

The technical advantages of exome sequencing: Cost-effective: Genome-wide information can be obtained economically and efficiently relative to genome-wide sequencing. The depth of sequencing of the exon region is deeper and the results are more accurate. High detection accuracy: Individual base variation can be identified in the genome-wide range. Applicable to analysis with a large sample size: Exome sequencing is economically efficient and more applicable to analysis with a large sample size.

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 3

2 Exome Sequencing Report

Species Name: Human Species Name: Homo sapiens

2.1 Disease Information

Disease Name: Genetic Disease Disease Type: Dominant/recessive disorders on autosomal or sex chromosomes Demo Family Map

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 4

2.2 Database Information

Database Informaton Genome Database ftp://ftp.ensembl.org/pub/release-73/fasta/homo sapiens/dna/Ho hg19 mo sapiens.GRCh37.73..toplevel.fa.gz dbSNP Database http://www.ncbi.nlm.nih.gov/SNP/ 144b 1000Genome Database http://www.1000genomes.org/ V73 Clinvar Database ftp://ftp.ncbi.nih.gov/snp/organisms/human 9606 b144 GRCh37 144 p13/VCF/clinical vcf set

2.3 Data Analysis Program

Analysis Tool Version and Description Data Quality Control FastQC 0.10.1 BWA 0.7.10 Reference Genome Comparison SAMtools 0.1.19 View and Sort Alignment Results Picard 1.119 Merge Sample Bam Results SNP/INDEL Detection GATK 3.3.0 Detection and Filtration of SNP and IN- DEL SNP/INDEL Gene Anno- SnpEff V4.1 Detected Variation Explanation tation

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 5

3 Technical Methods and Processes

3.1 Experimental Processes

The liquid chip capture system (Agilent, CA, USA) is used to efficiently enrich the human exon region. High throughput sequencing is performed on the HiSeq 2500/4000 platform. Construc- tion and capture experiments are carried out using the SureSelect Human All Exon V6 kit (Agilent, CA,USA).

Sample DNA quality assessment requires the amount of DNA to be >= 1.5 ug (Qubit). The agarose gel electrophoresis result should show no degradation and no RNA contamination. Ad- ditionally, the amount of OD260/280 measured by Nanodrop should range from 1.8 to 2.0. For samples with the DNA amount < 1 ug, a substitute protocol may be suggested for optimizing the sequencing library.

Genomic DNA is randomly broken into small 150-300 bp fragments. Following end-repair and polyadenylation at both ends, the fragments are ligated with the sequencing adaptor including specific indices. The library is then hybridized with up to 738,690 biotin-labeled probes sothat the exon region (including the upstream and downstream regions) of 58 Mb can be captured using streptomycin beads. After PCR amplification and quality assurance, the library is loaded on the flowcell for sequencing (Figure). Paired-end reads (2 x 150) are obtained for downstream data analysis. A more detailed description of library preparation is available in supplemental materials.

Figure. Library Construction Experimental Workflow

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 6

3.2 Analysis Process

The adaptor, polyN, polyA and other low-complexity sequences are excluded from the raw reads. The remaining valid reads are mapped to the reference genome using BWA (Li H et al. and Kent WJ et al.). The mapping result is saved in the BAM format and then sorted using SAMtools (Li H et al.). Duplicate reads coming from PCR amplification are marked using Picard. Marked duplicate reads are not used for subsequent processing, as they may result in false positive results in the detection of mutations.

After marking duplicate reads, it is necessary to re-align the reads close to the region re- ported as insertion/deletion (INDEL) by BWA based on the Compact Idiosyncratic Gapped Alignment Report (CIGAR) value. The mismatch close to the INDEL region reported by BWA may not be accurate due to its algorithm of alignment and it may cause false positive results of variant calling. Therefore, correction at these sites is required for subsequent SNP and INDEL analysis. The IndelRealigner module in GATK is use to carry out INDEL re-alignment in an effort to minimize the error rate of mismatches near each INDEL site.

Variant calling relies heavily on the quality score of each base reported by the seuqencer. For example, the BWA aligner reports a mismatch when the base quality is above Q25. Namely, the error rate of the mismatch caused by sequencing is about 1%, which may heavily affect the reliability of downstream analysis. Additionally, the sequencing quality at the 3’ end is always lower than the quality at the 5’ end due to reagent depletion, and the quality of A/C is often lower than T/G. Therefore, it is necessary to recalibrate the base quality using the BaseRe- calibrator module in GATK so that the quality score of the sequence can be more reliable. Note: The reads in one sample are supposed to come from the same lane for base recalibration. Otherwise, reads from different lanes need to be recalibrated separately.

After the steps metioned above, variant calling is made by the UnifiedGenotyper or the Hap- lotypeCaller moduel in GATK. The UnifiedGenotyper module does not consider the impact of adjacent bases to make the call, and the HaplotypeCaller module makes the call based on the local de-novo model. The HaplotypeCaller module first builds a De Bruijn graph and applies a PairHMM model to do haplotype prediction and make the variant call.

Of course, in dealing with such typical large scale Bayesian inference problems for a large quan- tity of samples, the high-sensitivity model may cause more false positive results. Therefore, it is important to perform further corrections on the variant calling results (Variant recalibration).

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 7

In general, real mutation sites are clustered together by the variant calling model. These clus- ters should fit the Gaussian distribution. Therefore, the VariantRecalibrator module inGATK uses a Gaussian mixture model to correct the false positve calls and find the true mutation sites.

It is well known that mutations in the coding region may be critical and cause diseases. There- fore, it is important to annotate the biological function of the mutation site. We used the SnpEff program (official recommended by GATK) to examine structural changes at the mutation site and further sort out the candidate area leading to the disease. The overall flowchart of data analysis is as follows:

Flowchart of data analysis

Note: The Genome Analysis Toolkit (GATK) is developed by the Broad Institute for second-generation sequencing data analysis. It contains a variety of tools mainly engaged in variant calling, etc. Data quality assurance is highly emphasized to reduce the false positive resutls. The program has a powerful architecture, a powerful processing engine and high-performance computing capabilities that make it suitable for projects of any size. At present, GATK and the mapping software BWA have become the most mainstream analysis pipeline for whole genome and exome sequencing.

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 8

4 Sequencing Data Overview

4.1 Sample Collection and Grouping Information

Group Sample sampleA sampleA sampleB sampleB

A complete table can be found in summary/1 RawData/sample info mendelian.xlsx

4.2 Sequencing Data Filtering

Paired-end raw reads were obtained by the high-throughput sequencer, which may contain the sequencing adap- tor and low-quality reads. In order to ensure accurate analysis results, it is necessary to preprocess the raw reads and obtain valid data for subsequent analysis.

The preprocessing steps are as follows: (1) Adapter removal (2) Removal of reads containing N for more than 5% of the bases (3) Removal of low quality reads with the quality score less than 10 for more than 20% of the bases (4) Removal of reads after a comprehensive evaluation with Q20, Q30, and the GC content

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 9

4.3 Sequencing Data Quality Control

For paired-end sequencing (PE150), the percentage of bases with their quality score greater than 20 should be more than 90%; the percentage of bases with their quality score greater than 30 should more than 85%.

Table. Summary of sequencing quality control

Sample Raw Data Valid Data Raw Valid% Q20% Q30% GC% depth(x) Read Base Read Base sampleA 100053676 15.01G 98910722 14.84G 258.79 98.86 98.51 96.28 47.51 sampleB 98312530 14.75G 96767964 14.52G 254.31 98.43 97.24 93.08 48.39

Terminology:

Term Annotations Sample Sequencing Library Name Raw Data/Read Number of reads obtained from the sequencer Raw Data/Base Number of bases in billions (Giga) obtained from the sequencer Valid Data/Read Number of reads after preprocessing Valid Data/Base Number of bases in billions (Giga) after preprocessing Raw Depth Raw number of bases divided by the Agilent kit captures Size: 58M Valid Ratio% Percentage of the processed reads (Valid) to the raw reads (Raw) Q20% Percentage of bases with the quality score greater than 20 Q30% Percentage of bases with the quality score greater than 30 GC count% Percentage of the GC content in raw reads

A complete table can be found in summary/1 RawData/ReadsQC.xlsx

After preprocessing, the average number of valid bases per sample is 195.68G. All samples have more than more than 97.84G bases, Q30>90% and meet the criteria for downstream analysis.

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 10

4.4 Sequencing Depth/Coverage Distribution

Sequencing coverage/depth is estimated by the number of reads mapped to exons. Usually the mapping rate of reads from a human sample can be more than 95%. A variatn call with the coverage/depth higher than 10X is more reliable.

Figure. Sequencing Depth Graph

The abscissa indicates the depth of the sequencing, and the ordinate indicates the ratio of the base at the corresponding depth to all bases. The graph on the right shows the cumulative base ratio (ordinate) at different depths (abscissa). For example, the cumulative depth of 50X corresponds to the base ratio about 95%. It indicates that about 95%of the bases have the sequencing depth greater than 50X.

The file of the figure can be found in summary/2 MappedData/ReadsDepthCoverage.png

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 11

Figure. The average sequencing coverage/depth on each chromosome

The abscissa indicates the chromosomes, and the ordinate represents the average depth. The average depth is calculated by (nubmer of mapped reads x length of the covered region) / total length of the exon region on each chromosome.

The file of the figure can be found in summary/2 MappedData/DepthCoverageByChr.png

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 12

4.5 Coverage Results

Term sampleA sampleB Total 100053676 98312530 (100.00%) (100.00%) Duplicate 11323351 9049250 (11.32%) (9.20%) Mapped 95830841 91689785 (95.78%) (93.26%) TARGET TERRITORY 60700153 60700153 NEAR AMPLICON BASES 2500942993 2061476018 NEAR AMPLICON BASES+ 2561643146 2122176171 TARGET TERRITORY PF UQ READS ALIGNED 95830841 91689785 ON AMPLICON BASES 7066308762 5996686766 MEAN TARGET COVERAGE 118.66 100.58 PCT TARGET BASES 30X 93.23% 90.81% PCT TARGET BASES 20X 95.47% 94.66% PCT TARGET BASES 10X 96.87% 96.81% PCT TARGET BASES 2X 97.61% 97.71%

Terminology:

Term Annotations Total Total number of raw reads (read 1 + read 2) Duplicate Number of duplicate reads Mapped Number of reads mapped to the reference genome TARGET TERRITORY Number of unique bases covered by the intervals of all targets that should be covered NEAR AMPLICON BASES Number of PF aligned bases that mapped to within a fixed interval of an amplified region, but not on a baited region PF UQ READS ALIGNED Number of PF unique reads that are aligned with mapping score > 0 to the reference genome. ON AMPLICON BASES Number of PF aligned amplified that mapped to an amplified region of the genome MEAN TARGET COVERAGE Mean coverage of targets that recieved at least coverage depth = 2 at one base PCT TARGET BASES 30X Percentage of all target bases achieving 30X or greater coverage PCT TARGET BASES 20X Percentage of all target bases achieving 20X or greater coverage PCT TARGET BASES 10X Percentage of all target bases achieving 10X or greater coverage PCT TARGET BASES 2X Percentage of all target bases achieving 2X or greater coverage

A complete table can be found in summary/2 MappedData/MappedStatistics.xlsx

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 13

Figure. Read coverage of each sample

The file of the figure can be found in summary/2 MappedData/DepthCoverageByTarget.png

The average coverage depth of each sample is more than 100X. The average coverage depth of each chromosome is more than 100X. 109.62%of bases have more than 20X and good uniformity for downstream analysis.

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 14

5 Variant Calling Results and Analysis

Single Nucleotide Polymorphism (SNP) refers to a single nucleotide variation in the genome which leads to the formation of a genetic marker. Variations at individual nucleotides in the genome include substitutions, deletions, and insertions. Depending on the structure of the nucleotide base, a substitution can be classified into a transition (C to T, G to A)and a transversion (C to A, G to T, C to G, A to T). SNPs appear most frequently in the CG islands. C in the CG islands tends to be methylated during histone modification. Methylated C then turns into T through spontaneous deamination. In general, a SNP refers to a single nucleotide variation with the presence greater than 1% in a population.

SNPs may fall within coding regions of genes, non-coding regions of genes, or intergenic regions. SNPs within non- coding regions may still affect alternative splicing, transcription factor binding, mRNA degradation or non-coding RNA sequences. Gene expression affected by this type of SNPs is known as expression SNP (eSNP) and may occurinthe upstream or downstream region of the gene. SNPs within the coding region of the gene (cSNPs) are less common, and the variation rate in the exome is only 1/5 of the variation rate in the surrounding regions. However, they are more significantly correlated to the development of genetic diseases. From the perspective of a genetic trait, cSNPscanbe classified into two types: synonymous and nonsynonymous cSNPs. Synonymous cSNPs are SNP-induced changes inthe coding region that do not affect the translation of the protein amino acid sequence. Nonsynonymous cSNPs arethe variations that result in protein sequence changes and therefore changes in the function of the protein. This change is often implicated as the direct cause of changes in biological traits. About half of cSNPs are nonsynonymous.

Insertion-Deletion (INDEL) refers to the insertion or the deletion of a small fragment (one or more bases) in the reference genome. INDELs may fall within the coding regions or the noncoding regions of genes. INDELs within the coding regions may cause the structural and functional change of the protein. If one or several bases (not a multiple of three) are inserted or deleted within the coding region, the mutation is called a frame shift mutation. Such mutations cause changes in the downstream amino acid sequence. INDELs within the noncoding regions (e.g., intron regions) may reduce the efficiency of transcription and the accuracy of alternative splicing. Moreover, the occurrence ofINDELsis also one of the main causes of evolution. For species with relatively close genetic relationship, the major cause of species divergence is INDELs. In general, the longer the genetic distance between species, the more INDEL fragments and longer the length of the INDEL fragment.

SNP and INDEL sites on the genome are analyzed using GATK. After variant calling, SnpEff is used for mutation site annotation.

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 15

5.1 SNP Results

Table. SNP location classification and annotation

SNV Class NON SYNONY- START GAINEDSTART LOST STOP GAINED STOP LOST SYNONYMOUS MOUS CODING CODING sampleA 30352 781 57 280 29 38200 sampleB 27992 729 39 295 33 31879

SNV Pos DOWNSTREAM INTRON SPLICE SITE UPSTREAM UTR 3 PRIME UTR 5 PRIME ACCEPTOR sampleA 71305 137844 111 56138 10858 6331 sampleB 66760 127846 122 52787 10034 5948

Terminology:

Term Annotations SNV Class NON SYNONYMOUS CODING Number of variants causing a codon that pro- duces a different amino acid SYNONYMOUS CODING Number of variants causing a codon that pro- duces the same amino acid STOP GAINED Number of variants causing a STOP codon STOP LOST Number of variants causing a stop codon to be mutated into a non-stop codon START GAINED Number of variants causing a START codon START LOST Number of variants causing a start codon to be mutated into a non-start codon

SNV Pos DOWNSTREAM Downstream 1K bases from the stop codon EXON Exon region INTERGENIC Intergenic region INTRON Intron region SPLICING Region within 10 bases from the splicing junci- ton UPSTREAM Upstream 1K bases from the start codon UTR3 PRIME 3’-UTR region UTR5 PRIME 5’-UTR region

A complete table can be found in summary/3 VariantData/SNP INDEL PositionType VariantsType.xlsx

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 16

Note: The effect of mutations on the genome is as follows

High-Impact Effects Moderate-Impact Effects Low-Impact Effects SPLICE SITE ACCEPTOR NON SYNONYMOUS CODING SYNONYMOUS START SPLICE SITE DONOR CODON CHANGE (note: this effect NON SYNONYMOUS START is used by SnpEff only for MNPs, not SNPs) START LOST CODON INSERTION START GAINED EXON DELETED CODON CHANGE PLUS CODON SYNONYMOUS CODING INSERTION FRAME SHIFT CODON DELETION SYNONYMOUS STOP STOP GAINED CODON CHANGE PLUS CODON NON SYNONYMOUS STOP DELETION STOP LOST UTR 5 DELETED UTR 3 DELETED

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 17

Figure. The distribution of SNPs in different categories (left) and the distribution of SNPs (right) in different regions of the genome

The pie chart on the left shows the distribution of SNPs in different categories as definded in the previous table. The pie chart on the right shows the ditribution of SNPs in different locations as defined in the previous table.

The file of the figure can be found in summary/3 VariantData/sampleA/sampleA.SNV.png

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 18

Table. Summary of SNP types

Sample all genotype. genotype. novel in db- novel pro dbSNP pro Ts Tv novel.Ts novel.Tv Het Hom SNP portion portion sampleA 55608 35408 20200 3545 52063 0.06 0.94 39295 16354 2303 1242 sampleB 51936 32751 19185 2527 49409 0.05 0.95 36785 15193 1599 928

Terminology:

Term Annotations Sample Sample Name novel ts Number of novel transitions (not annotated in dbSNP) novel tv Number of novel transversions ts Total number of transitions tv Total number of transversions all Number of all SNPs genotype.Het Number of SNPs of heterozygous genotypes genotype.Hom Number of SNPs of homozygous genotypes novel Number of novel SNPs novel proportion Proportion of novel SNPs

A complete table can be found in summary/3 VariantData/VariantsType SNP.xlsx

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 19

Figure. Comparison of SNP types

The first bar graph shows the number of heterozygous (Het) and homozygous (Hom) SNPs in one sample. TheHetrate represents the percentage of heterozygous SNPs in all SNPs. The second bar graph shows the number of annotated (dbSNP) and novel SNPs. The dbSNP rate represents the per- centage of annotated SNPs in all SNPs. The third bar graph shows the number of transitions (ts) and trasversions (tv). The value of ts/tv is the ratio of transi- tions to transversions. The fourth bar graph shows the number of transitions (ts) and trasversions (tv) in novel SNPs. The value of ts/tv is the ratio of transitions to transversions.

The file of the figure can be found in summary/3 VariantData/sampleA/sampleA.SNP VariantsType.png

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 20

5.2 SnpEff Annotation of SNP

SNPs identified from sequencing are annotated using SnpEff. The annotations are presented in the VCF4.1 format foreach sample. The VCF files can be found in summary/3 VariantData/ for each sample, e.g., summary/3 VariantData/sampleA/sampleA.snp.annotation.fixed.function.vcf

Terminology:

Term Description CHROM chromosome id POS ID chromosome position ID REF reference allele ALT alternative allele QUAL quality FILTER filter INFO information AD Allelic depths DP Approximate read depth GQ Genotype Quality GT Genotype PL Phred-scaled likelihoods of the given genotypes

A complete description of the terms in the VCF file can be found in summary/3 VariantData/readme.xlsx

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 21

5.3 INDEL Results

Table. INDEL location classification and annotation

INDEL Class CODON CODON CODON CODON FRAME SHIFT FRAME SHIFT CHANGE CHANGE DELETION INSERTION + PLUS CODON PLUS CODON STOP GAINED DELETION INSERTION sampleA 187 126 84 137 310 16 sampleB 177 142 78 144 357 18

INDEL Pos DOWNSTREAM INTRON SPLICE SITE SPLICE SITE UPSTREAM UTR 3 PRIME ACCEPTOR DONOR sampleA 5474 15091 101 17 4428 618 sampleB 5171 13373 97 14 4152 641

Terminology:

Term Annotations INDEL Class CODON CHANGE PLUS One codon is changed and one or more codons are deleted CODON DELETION CODON CHANGE PLUS One codon is changed and one or many codons are inserted CODON INSERTION CODON DELETION One or many codons are deleted CODON INSERTION One or many codons are inserted FRAME SHIFT Insertion or deletion causes a frame shift FRAME SHIFT+STOP GAINED Insertion or deletion causes a frame shift or a STOP codon

INDEL Pos DOWNSTREAM Downstream 1K bases from the stop codon EXON Exon region INTERGENIC Intergenic region INTRON Intron region SPLICING Region within 10 bases from the splicing junciton UPSTREAM Upstream 1K bases from the start codon UTR3 PRIME 3’-UTR region UTR5 PRIME 5’-UTR region

A complete table can be found in summary/3 VariantData/SNP INDEL PositionType VariantsType.xlsx

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 22

Figure. The distribution of INDELs in different categories (left) and the distribution of INDELs (right) in different regions of the genome

The pie chart on the left shows the distribution of INDELs in different categories as definded in the previous table. The pie chart on the right shows the ditribution of INDELs in different locations as defined in the previous table.

The file of the figure can be found in summary/3 VariantData/sampleA/sampleA.INDEL.png

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 23

Table. Summary of INDEL types

Sample all genotype. genotype. novel in dbSNP novel pro dbSNP pro Het Hom portion portion sampleA 3846 2231 1615 436 3410 sampleB 3469 1904 1565 279 3190

Terminology:

Term Description Sample Sample name all Number of all INDELs genotype.Het Number of heterozygous INDELs genotype.Hom Number of homozygous INDELs novel Number of novel INDELs novel proportion Proportion of novel INDELs dbSNP proportion Proportion of INDELs annotated in dbSNP

A complete table can be found in summary/3 VariantData/VariantsType INDEL.xlsx

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 24

Figure. Comparison of INDEL types

The first bar graph shows the number of heterozygous (Het) and homozygous (Hom) INDELs in one sample. TheHet rate represents the percentage of heterozygous INDELs in all INDELs. The second bar graph shows the number of annotated (dbSNP) and novel INDELs. The dbSNP rate represents the percentage of annotated INDELs in all INDELs.

The file of the figure can be found in summary/3 VariantData/sampleA/sampleA.INDEL VariantsType.png

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 25

5.4 SnpEff Annotation of INDEL

INDELs identified from sequencing are annotated using SnpEff. The annotations are presented in the VCF4.1 formatfor each sample. The VCF files can be found in summary/3 VariantData/ for each sample, e.g., summary/3 VariantData/sampleA/sampleA.indel.annotation.fixed.function.vcf

Terminology:

Term Description CHROM chromosome id POS ID chromosome position ID REF reference allele ALT alternative allele QUAL quality FILTER filter INFO information AD Allelic depths DP Approximate read depth GQ Genotype Quality GT Genotype PL Phred-scaled likelihoods of the given genotypes

A complete description of the terms in the VCF file can be found in summary/3 VariantData/readme.xlsx

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 26

6 Annotation and Filtering of Variants

6.1 dbSNP Annotation and Filtering

The Single Nucleotide Polymorphism Database (dbSNP; http://www.ncbi.nlm.nih.gov/SNP/) is developed by NCBI and the Human Genome Research Institute (National Human Genome Research Institute. The database contains the SNP annotations of single base substitutions and short insertions/deletions for multiple organisms. Each SNP is annotated by an index starting with ’rs’. High frequency mutations in normal people are usually not pathogenic sites, so we annotate and filter out the high frequency mutations that have been included in dbSNP to retain the mutation sites thatarenot annotated in dbSNP for downstream analysis.

The annotation results can be found in summary/4 VariantMultiAnno/sampleA/SNP/sampleA.snp.dbSNP.xlsx

6.2 1000Genome Annotation and Filtering

The 1000 Project (1000Genome) was launched on January 22, 2008, with a total mission of 1,200 people. It was designed to draw the most detailed and the most valuable human genome genetic polymorphism map. The new map allows researchers to more quickly lock down disease-related genetic variants, enabling the use of genetic information to develop new strategies for the diagnosis and treatment/prevention of common diseases. The project includes genetic data from Yoruba in the Ibadan region of Nigeria, Japanese living in Tokyo, Chinese living in Beijing, the descendants of Scandinavia in Western Europe and Utah, Luhya of Webuye, Maasai of Kinyawa, Toscani residents of Italy, Gujarati Indians living in Houston, Chinese people living in Denver, Mexican descendants living in Los Angeles and African descendants living in the southwestern United States. Mutaions in 1000Genome with the minor allele frequency (MAF) greater than 5% are filtered out to retain the mutation sites with the MAF less than 5% for downstream analysis.

The annotation results can be found in summary/4 VariantMultiAnno/sampleA/SNP/ sampleA.snp.dbSNP.KGenome.xlsx

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 27

6.3 Coding Region Annotation

Variants within the coding region or within the upstream/downstream 10 bases region from the splicing junction are retained as candidate sites that may cause diseases.

The annotation results can be found in summary/4 VariantMultiAnno/sampleA/SNP/ sampleA.snp.dbSNP.KGenome.func.xlsx

6.4 Protein Function Annotation

The effect of amino acid substitutions on protein function is predicted using the SIFT program (http://sift.jcvi.org/). It can determine whether the amino acid substitutions are functionally neutral or deleterious. A standardized score ranging from 0 to 1 is reported by the program. A score greater than 0.05 indicates that the mutation is tolerable. In other words, the mutation has little or no effect on protein function. A score less than 0.05 suggests the mutation isharmful, that is, the mutation has greater impact on protein function. Polymorphism Phenotyping (PolyPhen2; http://genetics.bwh.harvard.edu/pph2/) is also a tool for predicting the effect of amino acid substitutions on protein structure and function. The results include three parts, Query, Prediction and Details. The Query section contains query information, similar to the input file. The Prediction section shows the predicted results. The Details section shows the details of the PolyPhen forecast, including all data information. A PolyPhen2 values greater than 0.95 indicates that the mutation site has a great impact on gene function.

The annotation results can be found in summary/4 VariantMultiAnno/sampleA/SNP/ sampleA.snp.dbSNP.KGenome.func.syn.xlsx

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 28

7 Quality Control

The original image data obtained on Illumina HiSeq 2500 were processed for base calling and then converted to raw reads stored in the FASTQ format. The FASTQ format contains the sequence information and the corresponding base quality. The sequencing error rate per base increases along with the length of the sequencing read due to depletion of chemical reagents at the end of the sequencing cycles. This phenomenon is common in the Illumina platform (Erlich and Mitra, 2008).

Quality score per base

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 29

The figure above shows the quality score per base. A quality score of 20 (Q20) indicates the sequencing errorrateis1% and a quality socre of 30 (Q30) indicates the sequencing error rate is 0.1%. The figure below shows the sequence content per base. Ideallly the percentage of the base content at each postionshould be approximately equal and show no bias with the postion.

Sequence content per base

The file of the figure can be found in summary/6 Quality Control/1 fastQC

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 30

8 Appendix

8.1 Materials and Methods

Materials for Sample Quality Control Instrument: Bioanalyzer 2100 (Agilent, CA, USA) DNA Quality Assurance Kit: RNA 6000 Nano LabChip Kit (Agilent, CA, USA) Materials for Sequencing Library Quality Control Instrument: Bioanalyzer 2100 (Agilent, CA, USA) Sequencing Library Quality Assurance Kit: High Sensitivity DNA Chip Kit (Agilent, CA, USA)

Sequencing The sequencing library was loaded on the flowcell to generate clusters on Illumina’s Cluster Station. Each sequencing cycle was followed by a fluorescence signal to detect one base. Paired-end reads with 150 bases were obtained. Read 1 starts from the 5’ end of the insertion and read 2 starts from the 3’ end of the insertion. Based on the insertion size, paired-end reads can cover the seuqence from both ends. In addtion, the distance between the two reads can be estimated for downstream mapping and assembly.

8.2 Information Analysis

Sequence and primary analysis A DNA library from human samples was sequenced with the Illumina HiSeq2500/4000 platform. Millions of paired-end reads with 150 bases were obtained. This yields average 195.68G bases per sample to cover the human exome (50Mb). Alignment and duplicate marking For the alignment step, BWA is utilized to perform reference genome alignment with the reads contained in paired FASTQ files. For the first post-alignment processing step, Picard tools is utilized to identify and mark duplicate reads from BAM file.

Local realignment around INDELs In the second post-alignment processing step, local read realignment is performed to correct for potential alignment errors around indels. Mapping of reads around the edges of indels often results in misaligned bases, thus creating false positive SNP calls. Local realignment uses these mismatched bases to determine if a site should be realigned and applies a computationally intensive algorithm to determine the most consistent placement of the reads with respect to the indel and remove misalignment artifacts.

Base quality score recalibration Each base of each read has an associated quality score, corresponding to the probability of a sequencing error. Due to the Systematic biases, the reported quality scores are known to be inaccurate and as such must be recalibrated prior to genotyping. After recalibration, the recalibrated quality score in the output BAM will more closely correspond to the probability of a sequencing error.

Variant calling Variant calls can be generated with GATK HaplotypeCaller or UnifiedGenotyper which examines the evidence for vari-

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 31

ation from reference via Bayesian inference.

Variant recalibration A Gaussian mixture model is fit to assign accurate confidence scores to each putative mutation call and evaluate new potential variants.

Variant function annotation Biological functional annotation is a crucial step in finding the links between genetic variation and disease. SnpEffis utilized to add biological information to a set of variants.

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 32

9 References

[1]. Ng SB1, Turner EH., et al. Targeted capture and massively parallel sequencing of 12 human . Nature. 461(7261):272-6. [2]. Choi M1,Scholl UI., et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing.Proc Natl Acad Sci USA. 106(45):19096-101. [3]. Li H, Durbin R. Fast and accurate short read alignment with BurrowsCWheeler transform. Bioinformatics, 2009, 25(14): 1754-1760. [4]. Kent W J, Sugnet C W, Furey T S, et al. The human genome browser at UCSC. Genome research, 2002, 12(6): 996-1006. [5]. Li H, Handsaker B, Wysoker A, et al. The sequence alignment/map format and SAMtools. Bioinformatics, 2009, 25(16): 2078-2079. [6]. Sherry S T, Ward M H, Kholodov M, et al. dbSNP: the NCBI database of genetic variation. Nucleic acids research, 2001, 29(1): 308-311. [7]. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequenc- ing data. Nucleic acids research, 2010, 38(16): e164-e164. [8]. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature, 2012, 491(7422): 56-65. [9]. P Cingolani, A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly, 2012, 6:2,80-92.

10 Contact Us

Address: 2575 W Bellfort Ave Ste 270 Houston, TX 77054 Phone Number:(713) 664-7087 Toll-free: (888)-528-8818 Fax:(713) 664-8181 Email: [email protected] Website: www.lcsciences.com

Prepared by LC Sciences |www.lcsciences.com| [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181