Establishing an Automated Graphical Genome Analysis Platform
Total Page:16
File Type:pdf, Size:1020Kb
Manuscript 1 Establishing an automated graphical genome analysis platform 1 2 3 4 5 2 Wison Laochareonsuk1, Komwit Surachat2, Surasak Sangkhathat3* 6 7 8 9 3 1 Department of Biomedical Science, and Translational Medicine Research Center, Faculty of Medicine, Prince 10 11 12 4 of Songkla University, Hat Yai, Songkhla, 90110, Thailand 13 14 15 2 Department of Computing Science, Faculty of Science, Prince of Songkla University, Hat Yai, Songkhla, 16 5 17 18 19 6 90110, Thailand 20 21 22 7 3 Department of Surgery, and Translational Medicine Research Center, Faculty of Medicine, Prince of Songkla 23 24 25 8 University, Hat Yai, Songkhla, 90110, Thailand 26 27 28 9 * Corresponding author, Email address: [email protected] 29 30 31 10 32 33 34 35 11 Abstract 36 37 38 12 Precision medicine is a modern health concept which involves using a specific treatment for 39 40 41 42 13 individuals based on genetic background. This new approach not only concerns personalized 43 44 45 14 prescriptions to minimize adverse events and targeted therapy but also extends to diagnosis, 46 47 48 49 15 prognosis and prevention of relevant diseases. Emerging high-throughput sequencing 50 51 52 53 16 technology allows widespread identification of genetic variations. However, the downstream 54 55 56 17 workflow of sequencing raw data is based on command-line interface (CLI) and open-source 57 58 59 60 18 software, and general users may be confronted with the difficulty of using the basic CLI 61 62 63 64 65 1 language and constructing pipelines. Our project aimed to establish a fully automated web- 1 2 3 4 2 based analysis system for cancer genomic medicine and other human genomic fields to query 5 6 7 3 and visualize genomic data. The web-based analysis for cancer biomedicine named iCBC was 8 9 10 11 4 prepared using a user-friendly graphical interface on a high-performance computing system. 12 13 14 We also implemented various bioinformatic tools and assembled automated pipelines for 15 5 16 17 18 6 genomic sequencing. The implemented workflows included quality control, alignment and pre- 19 20 21 22 7 processing, variant discovery, annotation and prioritization with multiple filtering criteria. 23 24 25 8 Ultimately, the iCBC was developed, which is an open access online analysis platform for 26 27 28 29 9 human high-throughput genomic data in biomedical cancer research. 30 31 32 33 10 34 35 36 37 38 11 Keywords: Cancer research, Medical bioinformatics, Online analysis platform 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 1 1. Introduction 1 2 3 4 2 Precision medicine has become a modern healthcare paradigm in the new era involving 5 6 7 3 specific medical treatments based on individual lifestyle and genetic background (Goetz & 8 9 10 11 4 Schork, 2018). This novel approach not only involves personalized treatment to decrease 12 13 14 adverse events and targeted therapeutic effects but also influences solving undiagnosed 15 5 16 17 18 6 diseases, predicting outcome and encountering for preventable diseases to improve the 19 20 21 22 7 individual’s quality of life. The concept of genomic medicine was introduced after the fully 23 24 25 8 developed human genome projects and 1000-genome project were published in the 2010s 26 27 28 29 9 which provide a reference for human genetic mapping and exploring the variations in regions 30 31 32 33 10 of specific interest (The Genomes Project et al., 2015). Nowadays, emerging high-throughput 34 35 36 11 sequencing technology is enhancing the power of genetic studies. Because of high-capability 37 38 39 40 12 sequencing, this platform allows clinicians and researchers to extend their identification of 41 42 43 44 13 disease pathogenesis, application for diagnosis of diseases, select suitable treatments and 45 46 47 14 provide better prognoses. Together with the rapid decline of sequencing costs, next generation 48 49 50 51 15 sequencing has become a current application for biomedical research that delivers novel 52 53 54 55 16 medical knowledge worldwide (Nambot et al., 2018; Warr et al., 2015). 56 57 58 59 60 61 62 63 64 65 1 As precision medicine becomes a greater part of healthcare systems, with the 1 2 3 4 2 emergence of sequencing data, including whole exome sequencing or targeted sequencing, the 5 6 7 3 resources available for this work face the danger of becoming overwhelmed (He, Ge, & He, 8 9 10 11 4 2017). Moreover, downstream analysis process of the data is based on open-source software 12 13 14 (i.e., FastQC, Burrows-Wheeler Aligner, Samtools and Genome Analysis Tool Kit), running 15 5 16 17 18 6 on command line interface (CLI), which may present challenges to researchers or clinicians 19 20 21 22 7 due to the difficulty of the basic computer language and workflow assembly. According to 23 24 25 8 immersion of the data, bioinformatic analysis is usually operated on an effective infrastructure 26 27 28 29 9 system including high performance computing and extensive storage. Moreover, variant 30 31 32 33 10 prioritization is a crucial step in genomic analysis that requires various criteria to filter out non- 34 35 36 11 related variants and select only possible pathogenic variants (DePristo et al., 2011; Gong et al., 37 38 39 40 12 2016). 41 42 43 44 13 Therefore, a web-based genomic analysis program for cancer biomedicine named 45 46 47 14 integrative Computational Biology for Cancer (iCBC) was established using a graphical 48 49 50 51 15 interface on a cloud computing platform. We implemented various bioinformatic tools and 52 53 54 55 16 assembly analysis pipelines for whole exome sequencing and targeted sequencing. The 56 57 58 17 provided analyses include sequence quality control, reference alignment and pre-processing, 59 60 61 62 63 64 65 1 variant calling, variant annotation and variant prioritization by multiple filtering criteria for 1 2 3 4 2 individual sequence analysis. In addition, somatic and comparative germline analyses are also 5 6 7 3 embedded in an integrative sequence analysis, in which the user can select separate conditions 8 9 10 11 4 for each sample between case-control for germline analysis or matched tumor-normal sample 12 13 14 for somatic analysis. Finally, the integrative analysis archives workflows with variant 15 5 16 17 18 6 annotation and prioritization, similar to individual sample analysis. 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 1 2. Materials and Methods 1 2 3 4 2 The iCBC was developed as a web-based bioinformatic tool with selectable and 5 6 7 3 automated pipelines for individual and integrative human genomic analysis. The development 8 9 10 11 4 of the in-house pipeline applied open-source software with the Linux scripting language. 12 13 14 Focusing on human genomic studies, short read sequencing has been the most widely used 15 5 16 17 18 6 platform for single-end and paired-end reads. The best practice guideline recommends four 19 20 21 22 7 principle processes for an NGS pipeline including (1) quality control, (2) pre-processing, (3) 23 24 25 8 variant discovery, and (4) functional annotation (Figure 1) (DePristo et al., 2011). The analysis 26 27 28 29 9 usually starts with a sequence quality check and control followed by trimming out parts with 30 31 32 33 10 low base quality scores, short sequence lengths and remaining adaptors. Alignment is the next 34 35 36 11 crucial step in which sequences are mapped with a pre-processing reference at the best position. 37 38 39 40 12 When finished, the data are stored in a format of sequence align map (SAM)(H. Li et al., 2009). 41 42 43 44 13 Because of the large size of the data, the aligned reads are usually converted to a binary format 45 46 47 14 (binary align map-BAM) and sorted by order of chromosomes and their positions to reduce the 48 49 50 51 15 time of computation. In general, short read platforms regularly need library construction before 52 53 54 55 16 sequence reading which may lead to non-random enrichment and cause duplicated sequences. 56 57 58 17 These sequences require re-analysis for duplication and data bias. Systematic base quality score 59 60 61 62 63 64 65 1 errors in each position are improved by correcting the errors with known variant databases 1 2 3 4 2 (Tian, Yan, Kalmbach, & Slager, 2016). After pre-processing, the sequence quality is satisfied 5 6 7 3 for the discovery of nucleotides that differ between the read sequences and reference sequences 8 9 10 11 4 in each position, relying on a diploid assumption. The variants are generally recorded in variant 12 13 14 call format (VCF). Alternatively, the variants may be identified in all represented sites 15 5 16 17 18 6 (genomic VCF or GVCF) that allows joining multiple analyses together (McKenna et al., 19 20 21 22 7 2010). Moreover, somatic variant calling is a special analysis for matched tumor-normal 23 24 25 8 samples which may be interfered with by chromosomal aneuploidy or polyploidy.