A Python Script to Merge Sanger Sequences
Total Page:16
File Type:pdf, Size:1020Kb
A Python script to merge Sanger sequences Cen Chen1,2,3,*, Bingguo Lu4,*, Xiaofang Huang5, Chuyun Bi5, Lili Zhao6, Yunzhuo Hu5, Xuanyang Chen5, Shiqiang Lin5 and Kai Huang1 1 Clinical Center for Human Genomic Research, Union Hospital, Huazhong University of Science and Technology, Wuhan, China 2 Department of Radiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China 3 Department of Cardiovascular Diseases, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China 4 Provincial University Key Laboratory of Cellular Stress Response and Metabolic Regulation, College of Life Science, Fujian Normal University, Fuzhou, China 5 Key Laboratory of Crop Biotechnology, Fujian Agriculture and Forestry University, Fujian Province Universities, Fuzhou, China 6 State Key Laboratory of Infectious Disease Prevention and Control, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, National Institute for Communicable Disease Control and Prevention, Chinese Center for Disease Control, Beijing, China * These authors contributed equally to this work. ABSTRACT Merging Sanger sequences is frequently needed during the gene cloning process. In this study, we provide a Python script that is able to assemble multiple overlapping Sanger sequences. The script utilizes the overlapping regions within the tandem Sanger sequences to merge the Sanger sequences. The results demonstrate that the script can produce the merged sequence from the input Sanger sequences in a single run. The script offers a simple and free method for merging Sanger sequences and is useful for gene cloning. Subjects Bioinformatics, Genetics, Genomics, Molecular Biology, Computational Science Keywords Sanger sequencing, Merge, Python script Submitted 15 December 2020 Accepted 5 April 2021 Published 27 April 2021 INTRODUCTION Corresponding authors Gene cloning is customarily required to study the gene functions in vivo and in vitro. Shiqiang Lin, For the most part, the gene of concern is ligated to a vector with a canonical method of [email protected] restriction enzyme cutting and ligation, or the new-fashioned technique of seamless Kai Huang, [email protected] ligation (Okegawa & Motohashi, 2015). To ensure that the gene sequence is correct within Academic editor Joseph Gillespie the constructed plasmid, the Sanger sequencing is utilized to sequence the gene, and the Additional Information and results are aligned and checked (Zimmermann et al., 1988). Declarations can be found on Sanger sequencing is a DNA sequencing technology by incorporating chain-terminating page 8 dideoxynucleotides, which are fluorescently labeled and can be read-out by automated DOI 10.7717/peerj.11354 sequencing machinery. Sanger sequencing has been widely applied to determine the DNA Copyright sequence since it was reported first time by Sanger et al. (1977) and Sanger, Nicklen & 2021 Chen et al. Coulson (1977). Though the next-generation sequencing has started a new era, Sanger Distributed under sequencing is still one of the most popular methods due to its reliability, affordability, Creative Commons CC-BY 4.0 and feasibility (Schuster, 2008). However, the number of nucleotides determined by one How to cite this article Chen C, Lu B, Huang X, Bi C, Zhao L, Hu Y, Chen X, Lin S, Huang K. 2021. A Python script to merge Sanger sequences. PeerJ 9:e11354 DOI 10.7717/peerj.11354 Sanger sequencing reaction is around 1,000. For those genes with lengths of only several hundred nucleotides, it is possible to go through each one of the whole sequences with one single Sanger sequencing reaction, and the results can be directly aligned to the correct gene files. However, in other cases, the lengths of the genes exceed one thousand nucleotides, beyond the scope of one single Sanger sequencing reaction. To get the full-length sequence of a target gene, it is necessary to carry out DNA walking sequencing using a new primer based on the previous sequencing result. To sequence a large gene, several reactions might be required in both forward and reverse directions. These results have to be aligned to the correct gene file for confirmation. It is preferred to merge all the walking results before alignment with the correct gene file, rather than aligning each walking result with the target gene file separately (Tang et al., 2020a, 2020b), especially for those large genes, which may be over 10,000 bps in length. To merge the multiple walking results, several commercial software can be applied, such as DNASTAR (DNASTAR, Inc., Madison, WI, USA), DNAMAN (Lynnon Biosoft, San Ramon, CA, USA), Vector NTI (Thermo Fisher Scientific Inc., Denver, CO, USA), and SnapGene (GSL Biotech LLC., San Diego, CA, USA), which are robust, however, expensive. For protein science and molecular biology, sequencing is mostly used to confirm whether the subcloning or mutagenesis is correct. Thereby, some functions of these commercial software are not essential, and the price of these software might be too high for the routine work of molecular cloning. In terms of free software, there is an online tool that is able to merge overlapping long sequence fragments based on the program merger in the “EMBOSS” suite (Bell & Kramvis, 2013; Rice, Longden & Bleasby, 2000). Nevertheless, the web tool relies on web access and server status. “Cap3” is a free and effective program for merging DNA fragments, however, the command-line is complicated for users who are not familiar with commands of “Cap3” (Huang & Madan, 1999). “Staden”, containing a graphic user interface, provides a free and powerful package for analyzing giant genome sequencing by combining the “SPIN” and “EMBOSS” (Staden, Judge & Bonfield, 2003), again, this package is complicated for some users. Since the first trial of the assembly of DNA sequence by computer, the algorithms have developed rapidly, which can be categorized into three classes: the first is overlap-layout- consensus method (Batzoglou, 2005), which applies an overlap graph; the second is de Bruijn graph roads (Idury & Waterman, 1995; Pevzner, 1989; Pevzner, Tang & Waterman, 2001), which uses a k-mer graph; and the last is greedy algorithm, which uses a k-mer or an overlap graph (Pop, 2009; Pop & Salzberg, 2008). Most of these algorithms are coded by C, which is challenging for biologists with less programming backgrounds. Here, we introduce a Python script capable of merging multiple overlapping Sanger sequencing files by employing the alignment module of Biopython (Cock et al., 2009). To visualize the critical parameters of the merging process, we print the output of alignment and show the critical junction within the merged sequence. The script is useful for merging the Sanger sequences and increasing the efficiency of subcloning. Chen et al. (2021), PeerJ, DOI 10.7717/peerj.11354 2/10 METHODS Running environments and input files To run the script “Merge_Sanger_v2.py”, the macOS Catalina 10.15.3 or higher (Apple Inc., Cupertino, CA, USA) is needed. Python 3.7.3 (https://www.python.org/), Biopython 1.7.4, and “EMBOSS” 6.6.0 are necessary for running the script in this study (Cock et al., 2009; Rice, Longden & Bleasby, 2000). The users need to install the above three packages before running the script. Here are the sequencing result files of the pET28a- SV40NLS-hCas9-SV40NLS, which is originated from PX458 (Ran et al., 2013) (Supplemental Information S1)(Fig. 1). All input files are in the seq format. Of note, the first four characters of the input file name must consist of three numbers (000, 001, 002, 003, …) plus one capital letter F or R (F for forward Sanger sequencing and R for reverse Sanger sequencing). Adding extra four characters to the original filename provided by the sequencing company is suggested. The numbers in the filenames should be ordered from small to big along with the 5′ to 3′ of the sense strand. Running script Please follow the steps below to run the script. 1. Make a new directory and copy the sequencing result files of the above section (Running environments and input files) to the newly made directory. Also, copy the script “Merge_Sanger_v2.py” (Supplemental Information S2) to the directory. 2. Open a new terminal and “cd” to the directory. Run the following command. Notice that a space between every two items is needed and the backward slash is used for multiline command input. python3.7 Merge_Sanger_v2.py 001F_sequence_to_be_merged.seq \ 002F_sequence_to_be_merged.seq 003F_sequence_to_be_merged.seq \ 004R_sequence_to_be_merged.seq 005R_sequence_to_be_merged.seq \ 006R_sequence_to_be_merged.seq The Sanger sequencing result files must be input from lower to higher according to the first three numbers of the file name. It is convenient to input the first three numbers and then use the tab key to fill the rest of the file name automatically. Our script is able to merge dozens of (up to 1,000) Sanger sequencing result files at a time, which is far enough for usual needs in labs. 3. After running finishes, a folder named “merged_sequence”, which contains the merged sequence file with the name “merged.seq”, appears. Manual validation of the merged Sanger sequences First, all the Sanger sequence files are renamed and changed to fasta format with TextEdit. Those Sanger sequence files in the reverse sequencing direction are transformed to the reverse complement sequences with “EMBOSS” command “revseq”. All the Sanger Chen et al. (2021), PeerJ, DOI 10.7717/peerj.11354 3/10 Figure 1 The test files used for running the script. Full-size DOI: 10.7717/peerj.11354/fig-1 sequence files are now in the forward sequencing direction in fasta format. Then, the “EMBOSS”“needle” is utilized to align each pair of tandem sequence files, and the command “less” is used to open the result files of alignments. The final merged sequence file is generated by combining the Sanger sequences based on the overlapping regions. The manually combined sequence file is used to align with the script’s merged sequence, and the time consumed by the manual method is compared with that used by the script.