Deephbv: a Deep Learning Model to Predict Hepatitis B Virus (HBV)
Total Page:16
File Type:pdf, Size:1020Kb
Wu et al. BMC Ecol Evo (2021) 21:138 BMC Ecology and Evolution https://doi.org/10.1186/s12862-021-01869-8 RESEARCH Open Access DeepHBV: a deep learning model to predict hepatitis B virus (HBV) integration sites Canbiao Wu1†, Xiaofang Guo2†, Mengyuan Li3†, Jingxian Shen1†, Xiayu Fu4, Qingyu Xie1,9, Zeliang Hou1, Manman Zhai1,5, Xiaofan Qiu1, Zifeng Cui3, Hongxian Xie6, Pengmin Qin5, Xuchu Weng1,7, Zheng Hu3,8* and Jiuxing Liang1,7* Abstract Background: The hepatitis B virus (HBV) is one of the main causes of viral hepatitis and liver cancer. HBV integration is one of the key steps in the virus-promoted malignant transformation. Results: An attention-based deep learning model, DeepHBV, was developed to predict HBV integration sites. By learning local genomic features automatically, DeepHBV was trained and tested using HBV integration site data from the dsVIS database. Initially, DeepHBV showed an AUROC of 0.6363 and an AUPR of 0.5471 for the dataset. The integra- tion of genomic features of repeat peaks and TCGA Pan-Cancer peaks signifcantly improved model performance, with AUROCs of 0.8378 and 0.9430 and AUPRs of 0.7535 and 0.9310, respectively. The transcription factor binding sites (TFBS) were signifcantly enriched near the genomic positions that were considered. The binding sites of the AR-half- site, Arnt, Atf1, bHLHE40, bHLHE41, BMAL1, CLOCK, c-Myc, COUP-TFII, E2A, EBF1, Erra, and Foxo3 were highlighted by DeepHBV in both the dsVIS and VISDB datasets, revealing a novel integration preference for HBV. Conclusions: DeepHBV is a useful tool for predicting HBV integration sites, revealing novel insights into HBV integra- tion-related carcinogenesis. Keywords: Deep learning, HBV integration sites, Genomic features, Bioinformatics Background into covalently closed circular DNA (cccDNA), which HBV is the main cause of viral hepatitis and liver cancer produces messenger RNA (mRNA) and pre-genomic (HCC) [1]. HBV can integrate into the host genome via an RNA (pgRNA) by transcription. Ten, pgRNA produces RNA intermediate due to its small size [1]. Cases of viral new rcDNA and double-stranded linear DNA (dslDNA) DNA integrated into the human genome were detected via reverse transcription in the host nucleus, which tend in 85–90% of HBV-related HCCs [2]. HBV attaches and to integrate into the host cell genome [3]. enters hepatocytes, then transports its nucleocapsid, A previous study showed HBV integration break- which contains a relaxed circular DNA (rcDNA), to the points distributed randomly across the whole genome host nucleus. In the host nucleus, rcDNA is converted with a handful of hotspots [7]. Further analysis revealed an association between HBV integration and genomic *Correspondence: [email protected]; [email protected] instability during these insertional events [8]. Moreover, †Canbiao Wu, Xiaofang Guo, Mengyuan Li, and Jingxian Shen have signifcant enrichment of HBV integration was found contributed equally to this work near the following genomic features: repetitive regions, 1 Institute for Brain Research and Rehabilitation, South China Normal University, Guangzhou 510631, Guangdong, China fragile sites, CpG islands, and telomeres in tumors com- 3 Department of Gynecological Oncology, the First Afliated Hospital, Sun pared to non-tumor tissues [3]. For instance, HBV inte- Yat-Sen University, Guangdong 510080 Guangzhou, China gration was reported to recur in the telomerase reverse Full list of author information is available at the end of the article © The Author(s) 2021. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Wu et al. BMC Ecol Evo (2021) 21:138 Page 2 of 10 transcriptase (TERT) and myeloid/lymphoid or mixed- Tis study developed DeepHBV, an attention-based lineage leukemia 4 (MLL4, also known as KMT2B) model for predicting HBV integration sites using deep genes. Te insertional events were also accompanied by learning. Te attention mechanism highlights the regions altered expression of the integrated gene [3, 7, 9], indi- concentrated upon by DeepHBV and helps determine the cating important biological impacts on the local genome. investigated patterns. DeepHBV can predict HBV inte- However, the pattern and mechanism of HBV integra- gration sites accurately and specifcally, and the attention tion remain to be explored. Many HBV integration sites mechanism highlights positions with potentially impor- are distributed throughout the human genome and seem tant biological meanings. Our work identifed novel completely random [8, 10, 11]. Whether the features and transcription factor-binding sites (TFBSs) near HBV patterns of these “random” viral integration events could integration hotspots, revealing new insights into HBV- be learned and extracted remains an open question, and induced cancer. once solved, will greatly improve the understanding of HBV integration-related carcinogenesis. Results Deep learning has shown a promising ability to dis- DeepHBV efectively predicts HBV integration sites cover intricate structures in multidimensional data by adding genomic features and automatically extract features from these data [12]. Te DeepHBV model structure and the scheme of Under the conditions of big data and a well-designed net- encoding a 2000 bp sample into a binary matrix are work structure, deep learning models can better predict shown in Fig. 1. Te DeepHBV model was tested using performance in many cases. Moreover, deep learning has the HBV integration sites database (http:// dsvis. wuhan performed excellently in computational biology research softw are. com). HBV integration sequences were pre- such as medical image identifcation [13] and protein pared according to HBV integration sites as positive sequence motif discovery [14]. Te convolutional neu- samples, following the steps in the method. Te nega- ral network (CNN) is the most important part of deep tive sample abstracting also followed the method, and learning, enabling a computer to learn and program itself the negative samples should be twice the number of from training data [15]. Although deep learning performs positive samples to maintain data balance and improve well in various felds, the detailed theory of how it makes the confdence level. Te positive samples were divided decisions is difcult to explain due to its black box efect. into 2902 and 1264 positive training datasets and test- Terefore, an approach called the attention mechanism, ing datasets, respectively. We extracted 5804 and 2528 which can highlight the outstanding parts and connect negative samples as the negative training dataset and the encoder and the decoder was invented to open the testing dataset, respectively. Tests were performed on “black box” [16, 17]. the DeepHBV model using samples of DNA sequences Fig. 1 The deep learning framework applied in DeepHBV. (a) Scheme of encoding a 2 kb DNA sequence into a binary matrix using one-hot code; (b) A brief fowchart of DeepHBV structure, the matrix shape was included in brackets, and a detailed fowchart was in Supplementary Figure 1 Wu et al. BMC Ecol Evo (2021) 21:138 Page 3 of 10 near the HBV integration sites. DeepHINT, an exist- function make it possible to use HBV integration ing deep learning model for predicting HIV integration sequences to perform tests on DeepHINT. Both mod- sites according to the surroundings [18], was also eval- els were trained using the same HBV integration train- uated using HBV integration sequences for training and ing dataset, and the same testing dataset was used for testing. Te preparation of input data for DeepHINT the evaluation. Te results showed that DeepHBV with also applied 2000 bp sequences near HIV integra- HBV integration sequences had an AUROC of 0.6363 tion sites [18]. We tested DNA sequence samples with and an AUPR of 0.5471, while DeepHINT with HBV lengths of 500 bp, 1000 bp, and 2000 bp and 4000 bp. integration sequences had an AUROC of 0.6199 an As shown in Additional fle 4: Table S5, 2000 bp had the AUPR of 0.5152 (Fig. 2). Except for AUROC and AUPR, largest accuracy (0.7368), sensitivity (0.7695), specifc- the tenfold cross-validation and confusion matrix, ity (0.7321), AUROC (0.6901) and the most of the per- which included true positives, true negatives, false formance results among all tested lengths. In this case, positives, and false negatives, followed by accuracy, 2000 bp sequences were used in our study. Te ReLU specifcity, sensitivity, Mathews’ correlation coefcient, activation function and the almost identical encoding and F−1 score were applied to