A Tool to Sanity Check and If Needed Reformat FASTA Files
Total Page:16
File Type:pdf, Size:1020Kb
ManuscriptbioRxiv preprint doi: https://doi.org/10.1101/024448; this version posted August 21, 2015. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under Click here to download Manuscript: bmc_article.tex aCC-BY 4.0 International license. Click here to view linked References Shelton and Brown 1 2 3 4 5 RESEARCH 6 7 8 9 Fasta-O-Matic: a tool to sanity check and if 10 11 needed reformat FASTA files 12 13 Jennifer M Shelton1 and Susan J Brown1* 14 15 *Correspondence: [email protected] 16 1KSU/K-INBRE Bioinformatics Abstract 17 Center, Division of Biology, 18 Kansas State University, Background: As the sheer volume of bioinformatic sequence data increases, the 19 Manhattan, KS, USA only way to take advantage of this content is to more completely automate Full list of author information is 20 available at the end of the article robust analysis workflows. Analysis bottlenecks are often mundane and 21 overlooked processing steps. Idiosyncrasies in reading and/or writing 22 bioinformatics file formats can halt or impair analysis workflows by interfering 23 with the transfer of data from one informatics tools to another. 24 Results: Fasta-O-Matic automates handling of common but minor formatissues 25 that otherwise may halt pipelines. The need for automation must be balanced by 26 the need for manual confirmation that any formatting error is actually minor 27 28 rather than indicative of a corrupt data file. To that end Fasta-O-Matic reports 29 any issues detected to the user with optionally color coded and quiet or verbose 30 logs. 31 Fasta-O-Matic can be used as a general pre-processing tool inbioinformatics 32 workflows (e.g. to automatically wrap FASTA files so that they can be read by 33 BioPerl). It was also developed as a sanity check for bioinformatic core facilities 34 that tend to repeat common analysis steps on FASTA files received from 35 disparate sources. Fasta-O-Matic can be set with format requirements specific to 36 downstream tools as a first step in a larger analysis workflow. 37 Availability: Fasta-O-Matic is available free of charge to academic and 38 non-profit institutions on GitHub. 39 40 Keywords: FASTA; sequence data; bioinformatics file format 41 42 43 44 45 Background 46 Sequence data can be stored as text with each letter representing a nucleic acid 47 48 (DNA and RNA) or amino acid (protein). The linear nature of these molecules 49 makes it natural to represent them as strings, finite sequences of characters. Al- 50 though it has been argued that a graph, a network of edges connected by vertices, 51 is a more accurate way to store genomic sequences because graphs allow the inclu- 52 53 sion of alternate alleles and alternate possible assemblies [1] all of the most common 54 methods for storing sequences (FASTA, FASTQ, SAM/BAM) use a linear strings. 55 Other decisions about how to represent sequence data can be more arbitrary. For 56 57 example, any character that is not used as a base or an amino acid could be used 58 to indicate the beginning of a new sequence. Additionally text could be wrapped to 59 limit the information content in any one line of a file. The advantage of wrapping 60 text is that some programs can then be designed to work one line at time limiting 61 62 the burden of each step (e.g. the program would never have to process an entire 63 64 65 bioRxiv preprint doi: https://doi.org/10.1101/024448; this version posted August 21, 2015. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Shelton and Brown Page 2 of 8 1 2 3 4 5 chromosome of sequence data in a single step). The disadvantage is that code must 6 7 be slightly more complex to load an entire sequence record into the working memory. 8 9 0.1 FASTA file format specifications versus recommendations 10 The popular FASTA file format stores sequence records and has very minimal format 11 12 requirements [2]. Each sequence is preceded by a header/description line that begins 13 with a > symbol. Sequence lines can include any standard International Unionof 14 Pure and Applied Chemistry (IUPAC) single character symbols for nucleic acids 15 or amino acids or the ambiguous codes that indicate possible residues or bases [3]. 16 They can also include - to indicate alignment gaps and * to indicate stop codons. 17 18 NCBI recommends wrapping FASTA file sequences lines [2]. It is also common 19 practice to use the first ‘word’ in a header (i.e. any character string to the left of 20 the first space in the header) as the unique sequence id. Although these features 21 are common they are not required leading to format compatibility issues with tools 22 23 that treat these conventions as required. 24 25 0.2 Customizing FASTA files to ensure that information is properly interpreted by 26 downstream tools 27 Regardless of whether a FASTA file is technically improperly formatted or it’s for- 28 29 mat merely violates a popular convention, it is critical to quality analysis workflows 30 that data is converted into a format that will be correctly interpreted by down- 31 stream tools. Formatting issues can fall into multiple categories including actual 32 format errors and formats that are not technically wrong but are non-standard, 33 34 causing some tools to throw an error. 35 Some format errors indicate a major problem like an attempt to use the wrong 36 data format (e.g. the first line is not a FASTA header because it does not begin 37 with a > character). These types of errors will be subsequently referred to as fatal. 38 Alternately, some formatting issues occur commonly without indicating the FASTA 39 40 file is corrupt (e.g. improperly wrapped/unwrapped sequence lines, missing final new 41 line characters, unusual new line characters like \r). These issues will be referred 42 to as non-fatal. Fatal formatting issues should cause processing to stop. Non-fatal 43 formatting issues should be automatically corrected according to the most common 44 45 resolution for this type of error. While downstream processing continues, the analyst 46 can double check the automated decision to reformat non-fatal issues. This way 47 workflow would not be slowed for trivial reformatting steps and the more rare 48 problems (e.g. when a missing last new line was caused by incomplete file transfer) 49 50 could still be caught. 51 52 0.3 Existing tools 53 Existing bioinformatics tools address FASTA format inconsistencies.Howevermany 54 tools either halt and exit with an error (e.g. BioPerl [4], [5], [6]) or can produce 55 56 reformatted output FASTA but cannot determine if there is a formatting issue to 57 begin with (e.g. EMBOSS Seqret [7]). 58 The BioPerl module DB::Fasta will halt if a FASTA is inconsistently wrapped or 59 if a line of sequence is too long (as in an unwrapped genome FASTA). This has the 60 disadvantage of requiring human intervention to wrap and restart analysis. 61 62 Code: 63 64 65 bioRxiv preprint doi: https://doi.org/10.1101/024448; this version posted August 21, 2015. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Shelton and Brown Page 3 of 8 1 2 3 4 5 #!/usr/bin/perl 6 7 use Bio::Seq; 8 use Bio::SeqIO; 9 use Bio::DB::Fasta; #makes a searchable db from FASTA file 10 11 my $out_file_temp = ’/home/bionano/test_db/all.fa’; 12 #Create new FASTA outfile object 13 my $seq_out = Bio::SeqIO->new(’-file’ => ">$out_file_temp",’-format’ => ’fasta’); 14 #Load FASTA file as DB 15 16 my $db = Bio::DB::Fasta->new("/home/bionano/test_db/miswrapped.fa"); 17 my $seq_obj = $db->get_Seq_by_id(’seq’); # get FASTA records using headers 18 #(where header = first ’word’ so really header whitespace should also be 19 #removed for this file) 20 21 $seq_out->write_seq($seq_obj); 22 Input: 23 >seq 1 24 25 ACTGTGTGCAATCGCTGNNNNCTCTCATCGGATCTTGCAATCGCTNNNCTCTCATCGGATTGCAATCGCTNNNCTtcatcCGGAT 26 CGCTGNNNNCTGTGTGCAATCGCTGNNNNCTCCTGATCGCTGNNNNCTGTGTGCAATCGCTGNNNNCTCCTGCAATCGCTGNNNN 27 CTCCTGTTCGNATCGatcctctgtttatgcttatagctagctgatcgtagnnntcaacgt 28 CTAGAGCGCAGCTCTGGGGGATTACTACTCACTACATCATTAGATCAGATacgactcann 29 30 >seq 2 31 cttatagctagctgatAATCGCTGNNTCATCGGATCTTGCCTTGCAATCGtcatcCGtcC 32 CGCTGNNNNCTGTGTGCAnnnnnnnnnnncgtaaaacgcctcctccgactcgTCTCTAGG 33 CTAGAGCGCAGCTCTGGGGGATTACTACTCACTACATCATTAGATCAGATacgactcann 34 35 nnnctacgCTATCAGGTCTCGAG 36 >seq 3 37 ATCAGCGCTCTATATGGCTCTGATTATAGTTTGCATTCATATGCTGATCTTctcagnntc 38 cttgacgctcgctATCTGTAGATCTGTACTtcagacagctcTCAGCAGNNNCTCAGCAGC 39 40 CTACGACAGTcatgcagactagcagt 41 Output: 42 ------------- EXCEPTION ------------- 43 44 MSG: Each line of the fasta entry must be the same length exceptthelast. 45 Line above #5 ’CTAGAGCGCAGCTCTGGGGG..’ is 61 != 86 chars... 46 EMBOSS seqret was designed as a very flexible tool to convert from one properly 47 48 formatted file to another properly but distinctly formatted file. It also was designed 49 to accept poorly formatted data (e.g. a FASTA missing the final new line that is 50 improperly wrapped) and export a reformatted file (e.g. wrapped after 60 bases 51 with a final new line). 52 53 Code: 54 seqret -stdout -sequence test.fa -outseq test_reformat.fa 55 Input: 56 57 >my header 58 AAAAAAAAAAAATTTTTTCCCCGGCGCGCGCGCTATAGCGCTATANNNNNNNNNNNNNNN 59 ATATATATATAT 60 ATTATTATATATATATTCTCTCTGGGCTCGCGTCTCGCTATTTATATATATATATATATTGCGCTCTCGTCTCCT 61 62 Output: 63 64 65 bioRxiv preprint doi: https://doi.org/10.1101/024448; this version posted August 21, 2015. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.