The Biological Object Notation (BON): a Structured Fle Format for Biological Data Received: 19 January 2018 Jan P
Total Page:16
File Type:pdf, Size:1020Kb
www.nature.com/scientificreports OPEN The Biological Object Notation (BON): a structured fle format for biological data Received: 19 January 2018 Jan P. Buchmann 1, Mathieu Fourment2 & Edward C. Holmes 1 Accepted: 13 June 2018 The large size and high complexity of biological data can represent a major methodological challenge Published: xx xx xxxx for the analysis and exchange of data sets between computers and applications. There has also been a substantial increase in the amount of metadata associated with biological data sets, which is being increasingly incorporated into existing data formats. Despite the existence of structured formats based on XML, biological data sets are mainly formatted using unstructured fle formats, and the incorporation of metadata results in increasingly complex parsing routines such that they become more error prone. To overcome these problems, we present the “biological object notation” (BON) format, a new way to exchange and parse nearly all biological data sets more efciently and with less error than other currently available formats. Based on JavaScript Object Notation (JSON), BON simplifes parsing by clearly separating the biological data from its metadata and reduces complexity compared to XML based formats. The ability to selectively compress data up to 87% compared to other fle formats and the reduced complexity results in improved transfer times and less error prone applications. Biological data, which includes, but is not limited to molecular sequences, annotations and phylogenetic trees, are still predominantly exchanged as fat fles or in line-based formats despite the existence of more structured fle notations that are better suited to complex data. Here, we propose an improved format for biological data that overcomes many of the limitations associated with fat fles. Tis new “biological notation” (BON) format is based on the JSON notation and can support all data types required in biology. We do not propose a new standard, but a rather new structured and lean fle format for modern molecular biology data sets and their exchange. We imple- mented BON as a Python library that can be used in pipelines and as basis for individual adjustments. Since the emergence of the frst successful format for biological sequences – the FASTA format in 19851 – dif- ferent formats for a variety of biological data types have been developed, such as molecular sequences (FASTQ2), phylogenetic trees (Newick2, NEXUS3), and sequence alignments (Stockholm4, BAM5). Tese formats are all fat fles or block-based and hence infexible to modifcation, complicating the addition of additional information. Further, each format requires a diferent parser that must be developed and maintained. Although unstructured fat fles are commonly used for biological data sets, no new approaches have been proposed to simplify the exchange of increasingly complex sequence data sets. Te “Extensible Markup Language” (XML) has been adopted as a mark-up language for the exchange of struc- tured biological data, including TinySeq (NCBI), INSDSEQ (http://www.insdc.org/documents), NeXML6 or phy- loXML7. However, XML is complex compared to JSON (Supplementary Material Notes 1,2) and requires at least four parts: (i) the document type defnition schema language (DTD, https://www.w3.org/TR/xmlschema-1/), (ii) the XML Schema Definition schema language (XSD, https://www.w3.org/TR/xmlschema11-1/), (iii) the Extensible Stylesheet Language Transformations transformation language (XSLT, https://www.w3.org/ TR/xslt20/), and (iv) the XML Query (XQuery) and XML Path Language (XPath, https://www.w3.org/TR/ xpath-datamodel-31/). Recently, the World Wide Web Consortium proposed an “Efcient XML interchange for- mat” (EXI8). While the EXI specifcations indicate improved transfer times due to an optimized binary form of XML, the underlying structure is still XML. XML documents tend to have a high degree of “noise” or “clutter”, as the syntax can use more space than the information itself (Fig. 1a). We defne clutter as any part of the transferred 1Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Life and Environmental Sciences and Sydney Medical School, the University of Sydney, Sydney, New South Wales, 2006, Australia. 2ithree Institute, University of Technology Sydney, Ultimo, New South Wales, 2007, Australia. Jan P. Buchmann and Mathieu Fourment contributed equally to this work. Correspondence and requests for materials should be addressed to E.C.H. (email: [email protected]) SCIENTIFIC REPORTS | (2018) 8:9644 | DOI:10.1038/s41598-018-28016-6 1 www.nature.com/scientificreports/ Figure 1. Basic description of BON. (a) TinySeq XML and an equivalent JSON notation to demonstrate the ratio of clutter between these syntaxes. Te tables on the lef indicate the size of the attributes and the XML tags or JSON keys, respectively. Sizes are given in bytes. Only those parts shown in red are included in the size calculations. (b) Schematic diagram of the BON format indicating its major components. (c) Basic BON example in JSON syntax. Stars indicate positions where header and payload neighbour using non-valid JSON syntax. Te JSON snippet in (a) is an example of a possible payload item and the sequence attribute is an obvious candidate for data compression. data not related to meta- or biological data, such as fle specifc syntax like XML tags or JSON syntax. In contrast, the JSON notation can encode the same data as XML, but with cleaner syntax and less clutter. Parsing errors can occur when parts of the data are wrongly classifed. Metadata can and is increasingly incorporated into the existing fat fle formats, for example FASTA, by extending the header line and using a specifc character as delimiter. Tese extensions are highly variable as they depend on the author of the fle (Supplementary Notes 3). Terefore, if such metadata is present it has to be carefully checked since such extension are not standardized. Implementing such checks and the corresponding corrections subsequently increase the complexity of the parser, thereby increasing the change to introduce additional errors. Te reduced complexity of the JSON syntax makes it possible to write parsers with simpler code and ultimately fewer errors. Tese are important factors, considering the steadily increasing size and complexity of biological data that must be trans- ferred between computers and applications. Biological data repositories such as the European Molecular Biology Laboratory (EMBL, https://www.embl. org/) and the National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov/) ofer JSON as a data format within several “Representational state transfer” (REST) application programming interfaces (APIs). For example, NCBI’s Entrez utility or REST API only export biological data in FASTA and XML formats, although other information is available in JSON9. In biology, JSON is mainly used to store or exchange application related data that can include biological data, such as alignments10 or protein sequences11. Another use of JSON is to combine search results12 or represent XML schemes13. To date, however, the JSON syntax has not been used to describe biological data independent of an application or schematic dependency. We can only speculate why JSON has not been yet widely adopted for biological data. It is possible that the prevalence of XML established a de facto standard that works for most biologists. However, the increased data complexity in biology will require biologists to write their own tools, and hence parsers, specifcally tailored to their research. Importantly, JSON notation represents data the same way as used in programming, while XML SCIENTIFIC REPORTS | (2018) 8:9644 | DOI:10.1038/s41598-018-28016-6 2 www.nature.com/scientificreports/ XML 1 import xml.etree.ElementTree as ET 2 xml = open(data) 3 for event, elem in ET.iterparse(xml, events=["start", "end"]): 4 if event == 'end' and elem.tag == 'TSeq_sequence': 5 sequence = elem.text 6 seqtype = elem.attrib['value'] JSON 1 import json 2 sequnces = json.load(data) # assuming a list of JSON sequence objects 3 for i in sequences: 4 sequence = i['sequence'] 5 seqtype = i['seqtype'] Figure 2. A simplifed example illustrating the diferences in parsing XML and JSON syntax in Python. Numbers in italic indicate line numbers for each example. Python syntax is indicated in bold. Te XML syntax requires more checks since XML tags can open or close (‘start’ or ‘end’ event, respectively) and properties can be indicated as tag attributes, here ‘seqtype’ as encountered in NCBI’s TinySeq XML. Te JSON syntax facilitates data access and therefore reduces parsing complexity. Checks for the existence of the expected data have been omitted for simplicity. requires a transformation step, adding unnecessary complexity. Te existence of JSON parsers in almost all pro- gramming languages facilitates adapting existing and creating new pipelines. With the large amount of data now common in biology, human readability is of no concern and efciently inspecting data must be done by specifc tools. Herein, we use the JSON syntax in our new biological notation format to facilitate the inspection and exchange of structured biological data with the following properties. Reduced parsing complexity. Te JSON syntax describes only two structures; ordered list and named/ value pairs. Tese structures can describe virtually all biological data sets while retaining low parsing complexity, for example because additional checks like those for attribute values in XML tags are omitted. Complex data can still be described, such as the use of nested structures, but the structure limitation simplifes writing of the corresponding parsers. JSON can be therefore considered a very lean XML14. An example comparing parsing complexity is given in Fig. 2. Scheme independent. Tere are no fxed keywords, only a few rules on how to describe and handle data. Tis allows adjustments to be made for the majority, if not all, biological data sets. Te keywords are defned by the application authors or maintainers, and ofen explained in the source fles or accompanying documentation.