
Exome Sequencing Report Customer Name Demo Institute University of ABC Order Number 6667 Date Prepared 2017-06-02 This report covers confidential materials of LC Sciences. Please make sure that the contents of this report are for your personal use only and that you are responsible for confidentiality. If the contents of this report are disclosed toany third party or company, according to the relevant laws and regulations, LC Sciences will be entitled to legal action. Prepared by LC Sciences jwww.lcsciences.comj [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Table of Contents 1 Exome Sequencing Introduction2 2 Exome Sequencing Report3 2.1 Disease Information..................................3 2.2 Database Information.................................4 2.3 Data Analysis Program................................4 3 Technical Methods and Processes5 3.1 Experimental Processes................................5 3.2 Analysis Process...................................6 4 Sequencing Data Overview8 4.1 Sample Collection and Grouping Information....................8 4.2 Sequencing Data Filtering..............................8 4.3 Sequencing Data Quality Control..........................9 4.4 Sequencing Depth/Coverage Distribution...................... 10 4.5 Coverage Results................................... 12 5 Variant Calling Results and Analysis 14 5.1 SNP Results...................................... 15 5.2 SnpEff Annotation of SNP.............................. 20 5.3 INDEL Results.................................... 21 5.4 SnpEff Annotation of INDEL............................ 25 6 Annotation and Filtering of Variants 26 6.1 dbSNP Annotation and Filtering.......................... 26 6.2 1000Genome Annotation and Filtering....................... 26 6.3 Coding Region Annotation.............................. 27 6.4 Protein Function Annotation............................. 27 7 Quality Control 28 8 Appendix 30 8.1 Materials and Methods................................ 30 Prepared by LC Sciences jwww.lcsciences.comj [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 8.2 Information Analysis................................. 30 9 References 32 10 Contact Us 32 Prepared by LC Sciences jwww.lcsciences.comj [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 2 1 Exome Sequencing Introduction The exon is the part of the eukaryotic gene which is preserved after splicing and can be trans- lated into peptide sequences. Exome is the sum of all exon regions in the genome that contains the information needed for translation, covering most of the functional variants associated with the individual phenotype. The human genome contains approximately 180,000 exons with a total length of about 30 Mb. The human exome accounts for about 1% of the genome, but is responsible for about 85% of human pathogenic mutations. Exome sequencing refers to the use of specially designed probes to enrich the protein cod- ing region of interest or a specific region of interest. High-throughput sequencing generates genetic information, which greatly improves the efficiency of exome studying and significantly reduces the cost of research. The technology can be used to identify and study Mendelian diseases, complex diseases such as cancer, diabetes, obesity and other pathogenic genes. This enables researchers to better explain the pathogenesis of diseases. The technical advantages of exome sequencing: Cost-effective: Genome-wide information can be obtained economically and efficiently relative to genome-wide sequencing. The depth of sequencing of the exon region is deeper and the results are more accurate. High detection accuracy: Individual base variation can be identified in the genome-wide range. Applicable to analysis with a large sample size: Exome sequencing is economically efficient and more applicable to analysis with a large sample size. Prepared by LC Sciences jwww.lcsciences.comj [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 3 2 Exome Sequencing Report Species Name: Human Species Name: Homo sapiens 2.1 Disease Information Disease Name: Genetic Disease Disease Type: Dominant/recessive disorders on autosomal or sex chromosomes Demo Family Map Prepared by LC Sciences jwww.lcsciences.comj [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 4 2.2 Database Information Database Informaton Genome Database ftp://ftp.ensembl.org/pub/release-73/fasta/homo sapiens/dna/Ho hg19 mo sapiens.GRCh37.73.dna.toplevel.fa.gz dbSNP Database http://www.ncbi.nlm.nih.gov/SNP/ 144b 1000Genome Database http://www.1000genomes.org/ V73 Clinvar Database ftp://ftp.ncbi.nih.gov/snp/organisms/human 9606 b144 GRCh37 144 p13/VCF/clinical vcf set 2.3 Data Analysis Program Analysis Tool Version and Description Data Quality Control FastQC 0.10.1 BWA 0.7.10 Reference Genome Comparison SAMtools 0.1.19 View and Sort Alignment Results Picard 1.119 Merge Sample Bam Results SNP/INDEL Detection GATK 3.3.0 Detection and Filtration of SNP and IN- DEL SNP/INDEL Gene Anno- SnpEff V4.1 Detected Variation Explanation tation Prepared by LC Sciences jwww.lcsciences.comj [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 5 3 Technical Methods and Processes 3.1 Experimental Processes The liquid chip capture system (Agilent, CA, USA) is used to efficiently enrich the human exon region. High throughput sequencing is performed on the HiSeq 2500/4000 platform. Construc- tion and capture experiments are carried out using the SureSelect Human All Exon V6 kit (Agilent, CA,USA). Sample DNA quality assessment requires the amount of DNA to be >= 1.5 ug (Qubit). The agarose gel electrophoresis result should show no degradation and no RNA contamination. Ad- ditionally, the amount of OD260/280 measured by Nanodrop should range from 1.8 to 2.0. For samples with the DNA amount < 1 ug, a substitute protocol may be suggested for optimizing the sequencing library. Genomic DNA is randomly broken into small 150-300 bp fragments. Following end-repair and polyadenylation at both ends, the fragments are ligated with the sequencing adaptor including specific indices. The library is then hybridized with up to 738,690 biotin-labeled probes sothat the exon region (including the upstream and downstream regions) of 58 Mb can be captured using streptomycin beads. After PCR amplification and quality assurance, the library is loaded on the flowcell for sequencing (Figure). Paired-end reads (2 x 150) are obtained for downstream data analysis. A more detailed description of library preparation is available in supplemental materials. Figure. Library Construction Experimental Workflow Prepared by LC Sciences jwww.lcsciences.comj [email protected] 2575 W Bellort Ave, Houston, TX 77054 Tel. 888-528-8818, Fax. 713-664-8181 Exome Sequencing Report 6 3.2 Analysis Process The adaptor, polyN, polyA and other low-complexity sequences are excluded from the raw reads. The remaining valid reads are mapped to the reference genome using BWA (Li H et al. and Kent WJ et al.). The mapping result is saved in the BAM format and then sorted using SAMtools (Li H et al.). Duplicate reads coming from PCR amplification are marked using Picard. Marked duplicate reads are not used for subsequent processing, as they may result in false positive results in the detection of mutations. After marking duplicate reads, it is necessary to re-align the reads close to the region re- ported as insertion/deletion (INDEL) by BWA based on the Compact Idiosyncratic Gapped Alignment Report (CIGAR) value. The mismatch close to the INDEL region reported by BWA may not be accurate due to its algorithm of alignment and it may cause false positive results of variant calling. Therefore, correction at these sites is required for subsequent SNP and INDEL analysis. The IndelRealigner module in GATK is use to carry out INDEL re-alignment in an effort to minimize the error rate of mismatches near each INDEL site. Variant calling relies heavily on the quality score of each base reported by the seuqencer. For example, the BWA aligner reports a mismatch when the base quality is above Q25. Namely, the error rate of the mismatch caused by sequencing is about 1%, which may heavily affect the reliability of downstream analysis. Additionally, the sequencing quality at the 3' end is always lower than the quality at the 5' end due to reagent depletion, and the quality of A/C is often lower than T/G. Therefore, it is necessary to recalibrate the base quality using the BaseRe- calibrator module in GATK so that the quality score of the sequence can be more reliable. Note: The reads in one sample are supposed to come from the same lane for base recalibration. Otherwise, reads from different lanes need to be recalibrated separately. After the steps metioned above, variant calling is made by the UnifiedGenotyper or the Hap- lotypeCaller moduel in GATK. The UnifiedGenotyper module does not consider the impact of adjacent bases to make the call, and the HaplotypeCaller module makes the call based on the local de-novo model. The HaplotypeCaller module first builds a De Bruijn graph and applies a PairHMM model to do haplotype prediction and make the variant call. Of course, in dealing with such typical large scale Bayesian inference problems
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages34 Page
-
File Size-