Hanwell et al. J Cheminform (2017) 9:55 DOI 10.1186/s13321-017-0241-z

RESEARCH ARTICLE Open Access Open chemistry: RESTful web APIs, JSON, NWChem and the modern web application Marcus D. Hanwell1*† , Wibe A. de Jong2† and Christopher J. Harris1

Abstract An end-to-end platform for chemical science research has been developed that integrates data from computational and experimental approaches through a modern web-based interface. The platform ofers an interactive and analytics environment that functions well on mobile, laptop and desktop devices. It ofers pragmatic solutions to ensure that large and complex data sets are more accessible. Existing desktop applications/frameworks were extended to integrate with high-performance computing resources, and ofer command-line tools to automate interaction—connecting distributed teams to this software platform on their own terms. The platform was developed openly, and all source code hosted on the GitHub platform with automated deployment possible using Ansible cou- pled with standard Ubuntu-based machine images deployed to cloud machines. The platform is designed to enable teams to reap the benefts of the connected web—going beyond what conventional search and analytics platforms ofer in this area. It also has the goal of ofering federated instances, that can be customized to the sites/research per- formed. Data gets stored using JSON, extending upon previous approaches using XML, building structures that sup- port calculations. These structures were developed to make it easy to process data across diferent languages, and send data to a JavaScript-based web client. Keywords: Chemistry, Web, Data, Semantic, NWChem, JSON

Background the Clean Energy Project [8] (over 2.3 million calculated Te in-silico determination of chemical and materi- structures). als properties is a vital capability that drives innovation Driven by the U.S. Materials Genome Initiative the across many market sectors. Its importance is refected in development of new and novel materials has become a the number of codes that can perform simulations over multidisciplinary research endeavor where complex sim- a broad range of levels of theory and length scales [1–5], ulation and experimental data get integrated, and analyt- and the enormous investments in experimental facilities ics such as machine learning techniques are utilized to that can also produce large data that is often difcult or aid in scientifc discovery. Tis same multidisciplinary impossible to reproduce. However, it is all too common approach is becoming essential in chemical and bio- for experimental and computational studies to take place logical research and development, in the design of new independently, with large scale studies often involving chemicals, biomolecules and drugs, or new energy ef- heroic eforts developing one-of software projects dedi- cient chemical production processes. cated to the specifc resource, such as the Protein Data Tere is a strong need to develop a collaborative scien- Bank [6] (over 100,000 experimental structures), Mate- tifc research software platform that enables researchers rials Project [7] (over 33,000 simulated materials), and to defne concepts and hypotheses, add them, and ana- lyze integrated sets of experimental and computational data to ofer efective knowledge discovery more univer- sally. Tis must go well beyond a web portal to create an *Correspondence: [email protected] interactive platform integrating simulation, experimental †Marcus D. Hanwell and Wibe A. de Jong contributed equally to this work data, and analytics, while leveraging semantic web tech- 1 Kitware, Inc., 28 Corporate Drive, Clifton Park, NY 12065, USA Full list of author information is available at the end of the article nologies to support the federated storage of data across

© The Author(s) 2017. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Hanwell et al. J Cheminform (2017) 9:55 Page 2 of 10

geographically dispersed sites. Te use of language inde- with translation facilities using existing open source tools pendent programming interfaces that are consumed to existing fle formats, along with integrated visualiza- by web, desktop, and command-line clients will enable tion/analysis capabilities in the web browser. researchers to make much better use of their data. Te Te project can be used directly, extended for related platform described aims to provide an open source ser- use cases, or built upon in future work to ofer a more vice that can satisfy this basic need, and also ofer sim- comprehensive semantically enriched platform for chem- ple methods of extension to accommodate new areas of ical data. Most existing platforms use approaches that research. mix data with web page generation on the server, rather Te web has evolved signifcantly in the last two decades, than embracing modern client-server approaches using and new developments should be explored to assess what the latest web standards and frameworks. Te source can be achieved in a platform that seeks to use the latest code is often not available, and the platforms are curated open source tools, technologies, standards, and approaches centrally—such as the PDB, Materials Project and Clean to deliver an end-to-end platform for chemical/materi- Energy Project mentioned earlier. als research. Te basic approach employed was to develop a server component written in the Python language that Methods exposes RESTful (representational state transfer) endpoints An open source prototype web platform was developed to interact with the data on the server. Te Python code can to demonstrate key capabilities in addressing the needs use existing core functionality, Python modules, and wrapped outlined in the introduction, and summarized in Fig. 1. libraries developed in other programming languages. Te application has a number of components devel- Ideally no HTML, images, etc would be generated oped in several languages following modern develop- on the server, the server acts as a data server primarily ment methodologies. It was intentionally developed through RESTful endpoints that accept/return data, along using some of the latest technology innovations, which with user authentication (required by some endpoints). It means that it requires a modern web browser that sup- can trigger calculations, perform analyses, and batch jobs ports WebGL [9] in order to render 3D geometry, and it in order to make the data discoverable. Te server can makes extensive use of HTML5 [10]. It is clear that not then be consumed by a rich HTML5 web client, more tra- all devices/web browsers have full support for these tech- ditional desktop clients, and from command-line clients nologies at this time, but that this support is already sub- or other servers to perform automated workfows. stantial and will grow in the coming years. In this work a rich HTML5 web interface was developed Te server-side components were written in Python [11], that made use of the server’s API (application program- with wrapped C++ code [12] providing access to a num- ming interface) briefy discussed above. It consumes data ber of existing chemistry/materials libraries such as Avoga- from the server, uploading/editing data, and maintain- dro [13] and [14]. Te basis for the server-side ing local state in a one-page application reusing a popular project was the Girder project [15], which reuses a number open source HTML5 framework. Te application inte- of Python modules such as CherryPy [16] to provide a mini- grated other open source frameworks for client side chart- malist Python web framework, Swagger [17] to document the ing, and 3D rendering/visualization of molecular structure. programming interfaces, MongoDB [18] to store data/meta- Te web application was developed using open source data, and Virtuoso [19] to store triples. Te chemistry spe- tools, a number of frameworks, and was “built” as a static cifc functionality was developed as additional programming bundle of HTML5 web assets that are downloaded by the interfaces using the Girder project’s plugin mechanism to add web client. It dynamically constructs the page in response additional endpoints extending upon existing functionality. to user interaction, server data, and other client events Te client-side components were developed in HTML5, to provide a rich, interactive experience. Tis means using open source web frameworks/technologies such as that many interactions take place entirely on the client, AngularJS 1.6 [20] and Material Design [21] to provide a requiring no interaction with or access to the server. Tis single page web application. 3DMol.js [22] was used to ren- ofers interactive data visualization and analysis, even on der molecular geometry in 3D, D3 [23] to render charts, relatively low bandwidth links, once initial data for a mol- and responsive design elements to accommodate devices of ecule or calculation has been downloaded. various sizes/aspect rations. Te capabilities developed have Te feld of chemical sciences needs access to open been demonstrated on desktop browsers, mobile phones, source chemical data services that make use of open data and tablets on the major operating systems. Tis includes standards and formats. Te development of tools with Windows, macOS, Linux, iOS, and Android operating sys- programming language agnostic interfaces available over tems using browsers including Chrome, Firefox and Safari. standard web protocols is described. Standardized data Te features developed for the web platform made it representations in the database layer were employed, possible to expose functions originally developed for Hanwell et al. J Cheminform (2017) 9:55 Page 3 of 10

Public Client Backend

Linked Cloud Data Computational Resources

Semantic HPC Scheduler Data

Desktop MongoDB Clients Server-side Framework

RESTful Data IO APIs Triple Store

Rich Web Large Data Client Storage

Fig. 1 Architecture of the chemical data platform. Overview of the high-level architecture used for the chemical data platform

command-line/desktop use, this led to the reuse of the were also developed to upload and add data fles from the 2 libraries [24] for the ingestion of chemical command-line, enabling ingestion of existing data sets, data. Extensions to the Avogadro 2 libraries, and several and new ones as they are generated. other components, were made to support the web plat- form. Tese have been merged into the main develop- Results ment branch, and were made available in the 1.90 release Te software components were developed to serve the of the software. Capabilities, such as the visualization/ needs of data-centric chemistry research using open animation of vibrational data, and additional fle formats source approaches that embrace the use of open APIs, supporting the NWChem package were added to the open data formats, and open components. Te approach library. Te JSON readers/writers built upon the Json- made extensive use of client-side rendering/interaction Cpp library [25], and the capabilities were exposed to the wherever practical, and focused on a server-side compo- Python-based server using Boost.Python [26] wrapped nent that served data from web endpoints using the web- calls to the C++ API, and then mapped to web endpoints native JSON format where possible. Te development in the Python code developed for the server. spanned a number of programming languages (Fortran, Te computational chemistry code used as a generator C++, Python, and JavaScript) in order to ofer structured for the calculation results is the open source NWChem data that can be stored, queried, edited, and visualized. software suite [1]. Te JSON-Fortran library [27] was integrated into the NWChem source to enable the code MongoChemServer: server side platform to write out a new JSON fle in addition to the standard Te server code was developed using the Python 3 lan- output or log fle. APIs and interfaces between the For- guage as a Girder plugin. Te Girder framework is an tran-90 routines of the JSON-Fortran library and the open source project, and released under the Apache 2.0 Fortran-77 NWChem source were written to facilitate license. It has three main components: the transfer of the computational chemistry data into the JSON format. Te full JSON enabled NWChem source •• Data organization and dissemination code is available on Github [28]. To enable the end-user •• User management and authentication to convert existing log fles to the JSON format a Python •• Authorization management 3 library was created. Te library and examples are avail- able in a separate Github repository [29]. It is developed as an extensible data management plat- New JSON data structures were developed to support form, and it reuses a number of open source projects the end-to-end workfow from data generation, through including CherryPy—“a pythonic, object-oriented to ingestion, analysis, and visualization. Python scripts web framework”. Te code in the mongochemserver Hanwell et al. J Cheminform (2017) 9:55 Page 4 of 10

repository [30] extends the functionality provided in a Te access permissions can be applied at several levels plugin that is loaded by the Girder process when it starts in the code. A RESTful API must be exposed as a resource up. Te plugin adds RESTful API, reuses core function- which resolves various paths, which refer to namespaces ality and core plugins for more generic features such as within the API prefx and are documented using a system authentication, Gravatars, fle upload/download, access called “Swagger”. Tis enables developers to document control, etc. API as it is written, provides an HTML5 web client that Te platform provides integration with MongoDB, exposes this documentation, and ofers the ability to test using that to store user credentials, access permissions, API live on the web, shown in Fig. 2. Te API exposed metadata, and other elements exposed via its plugin sys- uses decorators to express whether a given piece of tem. Among the most useful abstractions provided in the API is public, or can only be accessed by authenticated context of this project are the authentication, access per- users. API that requires authentication can apply further mission, and fle storage systems. Almost all of these con- restrictions based on user privileges, and provide fltered cepts must be exposed on both the server and in the web results containing only data the authenticated user has client code to be used efectively. the access privileges for. Te existing OAuth2 plugin was used, and coupled File upload/download sounds quite simple on the surface, with Google’s OAuth2 implementation to ofer single- but it involves a number of distinct components in order to sign on. Tis can be replaced with other authentication scale and integrate well in diferent environments. Te pro- schemes, or augmented with multiple options. For sim- ject uses asset stores to abstract the storage backend, and plicity this was the only option ofered in the prototype the backend can then be mapped to fle systems, S3 stor- described, coupled with the use of encrypted SSL con- age (as provided by Amazon EC2), and others. Large fles nections to provide secure authenticated access. Tis must also be uploaded/downloaded in “chunks”, something choice enabled the deployment of a demonstration to ofered as part of the standard fle API and exposed in the multiple locations, but was not always the most appro- client application. File system storage proved sufcient for priate and will be augmented in future development the work described, but future deployments would beneft to include integration with site-wide systems where from using large fle stores, with extension to archive serv- appropriate. ers at supercomputing centers.

Fig. 2 Swagger documentation for RESTful API. An example of Swagger being used to test some of the RESTful API in the ‘molecules’ resource Hanwell et al. J Cheminform (2017) 9:55 Page 5 of 10

Te web APIs were extended with three main end- excellent glue, providing easy access to native Python code, points for the chemical data server. Te “molecules” C/C++ wrapped libraries, and even other web services. prefx provides functions to interact with molecular data, and is linked to from the other objects created. A MongoChemClient: rich HTML5 web client molecular graph is unique, and other objects such as cal- Te Girder project has its own web interface, but this was culations will refer to a molecule. Te simplest way to not used—a custom user interface was developed in the use the molecules endpoint is to use a GET query on the mongochemclient repository [31]. Te interface devel- name, InChI, or InChI key of a molecule to see if it exists. oped in this repository is a modern HTML5 web inter- If the search results in a math then a JSON array will be face. Tis means that all HTML5 assets can be served as returned with objects containing the felds id, inchikey, static fles, and the page is built up dynamically on the and name that can be used to retrieve each molecule. If client-side using the web API to authenticate (if neces- no match then an empty JSON array will be returned. sary), retrieve data, upload new date, and visualize data. Retrieving molecule records can be achieved using the A number of technologies and projects were leveraged in molecules/id endpoint, using the URL “/api/v1/molecules order to create a compelling, modern interface in a rela- /564a2fdd5573c07f61ce3db/xyz” would retrieve the mol- tively short space of time. ecule with the ID of “564a2fdd5573c07f61ce3db” in the Te main framework used to manage interaction, react to XYZ format. Changing that to “/api/v1/molecules/564a2fd events, and coordinate the single dynamic page approach d5573c07f61ce3db/cml” would retrieve the same molecule was AngularJS. Tis framework is divided into a number of in the CML format, and ending it with ’cjson’ would return modules that provide various services/extensions, such as the molecule in the Chemical JSON format described later. easy access to APIs, animations, routing (where the URL is Tere is a similar endpoint to query the database using updated to refect the current ‘location’ (state) despite being InChI keys. A “/molecules/conversions/{output_format}” in a single-page web application), and overall look and feel endpoint provides fle format conversion services. (Materials design in this instance). AngularJS was chosen A “/molecules/search” endpoint ofers a simple query for the rich feature set, maturity, and encapsulation of com- language that was exposed in the web interface, where ponents along with its powerful web page layout frame- it is possible to search on molecular mass with numeric work. An example of the single page application in action comparisons, logical AND or OR queries together. It is shown in Fig. 3, where a molecular structure can be seen contained a number of string based values that could beside a plot of vibrational modes that can be animated in be searched on, such as InChI, InChI key, name, chemi- the 3DMol.js based geometry viewer. cal formula, as well as numeric felds such as mass, atom Te static HTML5 content that serves as the frontend, count, and heavy atom count. Tis was implemented in and the dynamic programming interfaces ofering access a Python fle, and exposed in the endpoint, with inline to the data must be presented to the web client. Tis is documentation available by clicking on the search icon. where the NGINX web server [32] came in, ofering an Te other major endpoint developed was the “calcula- SSL-enabled endpoint for encryption, serving the static tions” prefx, this provides access to quantum chemi- content in the web root, and proxying requests to the / cal calculations. Tey must have a parent molecule, and api/v1 prefx to the Python-based backend. Te Python- are primarily accessed using the GET method, querying based backend also had access to the MongoDB server on the “moleculeId”. If there are calculations present for where all metadata, access controls, and links to fles a given moleculeId, a JSON array will be returned with were stored. Tere is also an asset store, where a simple objects containing the “_id” (the identifer for the calcula- on-disk asset store was used. tion), with various properties such as the name of the code Te software infrastructure involves a number of build/ performing the calculation, the theory, calculation types deployment systems, and a deployment repository [33] contained, and a fle identifer that refers to the original was necessary to coordinate the task of deploying every- output used to build the record. Te calculation identifer thing to the right location, with compatible software ver- can then be used to retrieve other elements of the calcula- sions, and ensuring services are brought up/down in the tion, such as the atomic coordinates as a JSON fle, a cube correct order. Te project employed an industry standard for a given molecular orbital, vibrational modes, etc. open source tool called Ansible [34] to document how All of these endpoints, along with the more generic ones the service was deployed on Amazon’s EC2 infrastruc- usch as fle, folder, group, item, user, etc can be viewed ture, and this approach can be adopted to other envi- using the built in Swagger. One of the important results ronments. Ansible automates the process of logging into is not the features described, but the simple mapping of specifed web hosts, setting up users, installing packages, Python code to endpoints, with inline documentation, that and placing everything in the correct place before start- can be rapidly extended and deployed. Python ofers an ing services such as the database and web services. Hanwell et al. J Cheminform (2017) 9:55 Page 6 of 10

Fig. 3 HTML5 web client with molecular structure and vibrational modes. The web client displaying a molecular structure in 3DMol.js, and selection of vibrational modes in a D3 chart

JSON data formats with the focus on providing a simple means of data stor- Over the years, various data formats have been developed age and exchange of chemical data for visualization. Te for computational chemistry to enable, among other JSON format was developed based upon the require- things, information exchange and visualization. More ments of storing data in BSON using MongoDB, com- recent work has focused on the development of ways to municating chemical data over JSON-RPC 2.0 between express results (log fles) from computational chemistry desktop applications, and later using web services. It in data formats that are suitable for knowledge discov- was also motivated by the need for a format to support ery on the semantic web. One example is the XML based an application being developed to edit and communicate Chemical Markup Language (CML) [35]. chemical structures. Te format was tested with small Here, we chose to build a data format using JSON molecules through to molecules with millions of atoms because it is less cumbersome than XML, with simpler using a philosophy of being machine readable, efcient, parsers, and is a native format of the web. It is based on and avoiding repetition. It was written to represent a sim- many of the concepts developed in the CML format as ple mapping of the in memory data structures. part of several ongoing collaborations under the broad Tis JSON format exposed the internal data model umbrella of the [36, 37]. Two JSON formats used in Avogadro 2, and was used as the transport layer were developed in parallel, one focused on providing to the 3DMol.js client-side rendering. It was extended to information essential for visualization platforms, and one expose vibrational modes, and a primitive container for with the goal to provide semantically enriched informa- volumetric data already present in the 3DMol.js project. tion from computational chemistry codes as an alterna- Te current state of the Chemical JSON format is docu- tive to plain text log fles. Comparison of the two JSON mented in a Github repository [38], and will continue to formats provides insight into the disparate needs of the evolve as it is extended to serve more use cases. use cases, and can serve as a guide for the development In parallel, the ChemLog JSON format was developed of a community wide JSON data format. with the goal of encapsulating the data from computational Te Chemical JSON format within the Avogadro 2 chemistry software, such as the open source NWChem, project has been in development for a number of years, with semantic annotation. Te goals of the two formats Hanwell et al. J Cheminform (2017) 9:55 Page 7 of 10

were somewhat orthogonal. Chemical JSON expresses the with efcient storage of atomic data. It does not provide a properties of a single molecule in machine readable arrays layout that is focused on human readability, and it is not that contain the essential data that might be visualized/dis- intended to be used as a directly editable set of objects in played. In contrast, the ChemLog JSON format is focused Python/JavaScript without some code in front of it to aid in on storing all the essential input and output data with manipulation/keeping the representation consistent. Tis semantic annotation from the computational chemistry representation is very easy to use in web platforms where software simulation (here for NWChem) for the (possibly asynchronous network calls can bring in progressively multiple) simulation(s). Tis is somewhat akin to the typi- more data about the molecule, starting with just atoms, cal molecular fle formats (XYZ, CML, SDF, etc) versus a then bonds, then molecular orbital cubes, vibrations, etc. more structured quantum mechanical log fle—a snapshot Te listing below shows the same molecule block of defnite structure versus data discovery and recording. within the ChemLog JSON format: Tis new format focuses on satisfying the need to replace a log fle with structured output, ofering output { "molecule " : of multiple calculations as a computational chemistry "id": "Molecule{ .1" , code like NWChem is executed. Te JSON fle generated " atoms" :[ by the NWChem application builds upon previous work { "id": " Atom.1.Mol.1" , done integrating CML into NWChem [39] (an XML- " elementLabel" : "o" , based format), with some related eforts extending CML " elementSymbol" : "O" , " elementNumber" :8, for the semantic web using an intermediary format called " elementName " : " Oxygen" , " cartesianCoordinates" : CSX [40]. It is designed around the notion of using objects { as the primary representation. As a starting point the CML "value" :[ 0.0E+0, naming and conventions were adopted and extended with 0.0E+0, new concepts. Te format was developed and designed to 0.2106360400000002E+0 ], be portable to multiple computational chemistry codes but "units" : "bohr" in this work frmly targeted the NWChem code. , } Te relevant section representing the geometry/ } structure of JSON fles generated in the two formats, in { "id": " Atom.2.Mol.1" , this case for a water molecule, are compared to clearly " elementLabel" : "h" , show the diference in design philosophies, i.e. arrays " elementSymbol" : "H" , " elementNumber" :1, vs objects, driven by the use case, i.e. visualization vs " elementName " : " Hydrogen " , semantically enriched semantic. Te listing below shows " cartesianCoordinates" : a molecule block within the Chemical JSON format: "value" :[ { 0.1841188380000002E+1, 0.− 0E+0, { 0.8425441600000008E+0 "chemical json" :0, − "atoms" : ], " coords{" : "units" : "bohr" { "3d":[0.00 ,0.00, 0.14 , } 0.76,0.00, 0.46 , , − − } 0.76 ,0.00, 0.46 ] { , − "id": " Atom.3.Mol.1" , "}elements " : " elementLabel" : "h" , " number" :[{8, 1, 1] " elementSymbol" : "H" , " elementNumber" :1, , } " elementName " : " Hydrogen " , } " cartesianCoordinates" : "bonds" : { " connect{ions": "value" :[ " index" :[0, {1, 0, 2] 0.1841188380000002E+1, , 0.0E+0, "}order" :[1, 1] 0.8425441600000008E+0 ],− } "units" : "bohr" } } Te format focuses on using arrays to store informa- ] } tion about atoms, bonds, etc—such as the atomic num- } ber, bond order, etc. Te 3D coordinates are in an array, } and each 3D position is ofset by 3N where N is the index In contrast to the Chemical JSON listing, all informa- of the atom. Tis ofers compact, simple representations tion and properties of each atom in ChemLog JSON is Hanwell et al. J Cheminform (2017) 9:55 Page 8 of 10

collected in a single object. A key aspect of this format JSON format has adopted the CML approach of stor- is the explicit defnition of units, a key aspect brought in ing each calculation in a “calculations” array. Te listing for the CML specifcation. It is relatively simple to map above shows an excerpt of the fourth calculation in this between the two formats, and this is now possible in series. Often, data from previous calculation steps in the Avogadro 2 libraries which feature readers for both the sequence is reused in the code. formats. To avoid duplicating information, “id” tags were used to In developing the ChemLog JSON format, some addi- provide a pointer to the frst mention of the JSON object. tional referencing features were introduced that are not An example of this in the listing above are the key “mol- natural to JSON. Te listing below shows an excerpt of a ecule” to “Molecule.2”, which is defned as the id-tag in JSON data fle containing the calculation setup and some the molecule block in the previous listing. Another exam- of the properties that can be calculated with quantum ple is the “basisSet” keyword in the “calculationSetup” chemistry software: block. Even the “calculationSetup” could be the same for multiple calculations, and could be pointed to in this fashion. Te same approach is used to identify the atom { " calculationType" : " molecularProperties" , within the molecule to which the calculated atomic prop- " molecularFormula" : " H2O" , erties belong. Here the “atom” keyword points to atom "id": " calculation .4" , " calculationSetup" : “Atom.1.Mol.2” in the earlier ChemLog JSON listing " outputVectors": ".{/ prop_h2o_run. movecs", for the molecular geometry. Essentially, this referencing " molecularSpinMultiplicity":1, approach mimics the ability to link diferent sets of data. "molecule" : "Molecule.2" , "charge" :0, Te referencing feature is missing in the current JSON " waveFunctionType" : "RHF" , specifcation but native to XML. Te JSON pointer has " numberOfElectrons" :10, been proposed in RFC 6091 [41] and a draft RFC is being "basisSet" : "BasisSet.1" , " waveFunctionTheory" : "Hartree -Fock" , developed for the JSON reference [42]. While JSON ref- " inputVectors" : "./ prop_h2o_run. movecs", erence would provide the necessary fexibility, either "id": " calculationSetup.4" could readily be adopted in the format in the near future. , "}calculationResults" : Examples of complete ChemLog JSON fles generated " molecularProperties"{ :[ by the NWChem quantum chemistry code can be found on the Github repository [29]. { " Molecule " : " Molecule .2" , " dipoleMoment" : Te JSON fle can be submitted to the web platform "totalMoment " : { { developed, and relevant metadata will be extracted. Te " units" : " atomicunits" , format is capable of encapsulating jobs that have multiple " value" :0.8052087008 , steps, and it has been demonstrated with vibrational and , } electronic structure data. Te format breaks most ele- "}quadrupoleMoment": " diamagneticSusceptibility{ " : ments of the output into distinct objects, and it has been " units" : " atomicunits" , { designed to be more semantically expressive than the " value" : 18.596483 Chemical JSON format developed as part of the Avoga- , "m} omentXZ" : dro 2 project. Future work will draw upon new develop- " units" : "{atomicunits" , ments in JSON-LD [40, 43] that aims to bring linked data " value" :0.0 and meaning to JSON using the same approaches used in } semantic data structures. , } Te two formats were used as part of this project, with } several extensions made to the Chemical JSON format { " atom" : "Atom.1.Mol.2" , " diamagneticShielding" : in order to support data needed by the project. Tey are "units" : "atomic units{" , both supported by the Avogadro 2 libraries, and exposed "value" : 23.457292 in the Python-wrapped API used on the server side. Te } Chemical JSON format is designed in a pragmatic fash- ] } ion, making extensive use of arrays, along with key-value } pairs for simple properties. Te format closely matches } in-memory structures, it is optimized for storage in BSON (the binary form of JSON used internally by Mon- Input structures in computational chemistry codes are goDB), is easy to visualize and store as a single document set up to allow a sequence of calculations to be executed in MongoDB. Te new ChemLog JSON format devel- in one single run. To store this sequence, the ChemLog oped tends to represent everything as an object, can store Hanwell et al. J Cheminform (2017) 9:55 Page 9 of 10

multiple confgurations/states of a molecule, and uses signifcant barriers to wider adoption. Te development links to reuse objects that have not changed. Tis makes of permissively licensed open source code, which can be it more complex to parse, but more expressive, and it readily inspected, reused, and modifed is one approach more closely refects what is currently stored in log fles. to improving the situation. Coupled with the use of com- As stated before, extensions were needed in the inter- munity platforms such as GitHub for collaborative devel- nal models used in the Avogadro 2 projects, the 3Dmol. opment, and workshops gathering interested experts, js project, and in the model employed in the web frame- these approaches can be iterated upon and standardized. work. Ultimately, any given visualization can only show a single state, and the web and desktop frameworks both Conclusions generally assume a fle only has a single state with mini- A “full-stack” open source web application was devel- mal links back to the input fle. Tis concept will need to oped, making heavy use of Python and open source com- be extended in order to visualize and analyze data from munity tools on the backend, and HTML5/JavaScript on more complex multi-step fles, and some of this develop- the frontend. Years of experience in using, converting, ment was started in the work described. and developing data formats was drawn upon to create some new formats for the output of structured data from Discussion the NWChem code, ingestion into databases, and the Research in chemistry is becoming more data intensive, subsequent visualization/analysis of data. Tis was dem- and as a result it is vital that we act to create platforms onstrated in a new web client, and added to the Avogadro that can be easily deployed, used, and shared with a focus 2 desktop application. Te primary data types were 3D on data. Tis requires standard formats that support a chemical structure, electronic structure, and vibrational multitude of facets of chemical data that leverage indus- modes on the web and desktop. try standard technologies such as XML, JSON, and the If the feld of chemical sciences is to progress towards semantic web. knowledge discovery on the semantic web, it is important As we move closer to a landscape where the web is to move away from disparate formats and basic data log an essential part of any workfow we must embrace fles. Te ingestion and normalization of data still repre- web-native formats, generally using JSON and similar sents a serious challenge to reaping the benefts power- containers, to enable the free exchange, analysis and visu- ful data-centric platforms promise. Te Avogadro project alization of data. Tis can be further augmented through provides translation to and from standard chemical data the use of JSON-LD, and is something the authors would formats, these can be read by Open Babel and translated like to do in order to ofer semantic meaning and enable to all supported formats. Te authors were made aware of the simple transformation of JSON-LD documents to a new efort by the European Materials Modeling Coun- triples. cil (EMMC) on enhancing interoperability in materials Te use of triples open up the semantic, linked web modeling codes. As this paper was written, the authors that is becoming increasingly important as this widens and researchers at the Molecular Sciences Software Insti- access to the data. As we move forward, and data-centric tute (MolSSI) started a working group committed to the computing proliferates embracing the standards of the development of a unifed community (web, visualization, semantic web will make it easier to fnd and use the data workfow, computational chemistry codes, and others) generated, and it will also make the meaning of specifc supported JSON data format. keys clear to machines, i.e. the defnition of a 3D coordi- nate, and the units it is expressed in, of the “energy” of a Abbreviations quantum calculation—what quantity is being expressed, 3D: three dimensional; API: application programming interface; CSS: cascading on what molecule, and in what units. Tis requires nor- style sheets; CML: Chemical Markup Language; HTML: Hypertext Markup Lan- guage; HPC: high-performance computing; JSON: JavaScript object notation; malized data, but goes well beyond it to express data less LBNL: Lawrence Berkeley National Laboratory; REST: representational state ambiguously and more formally. transfer; SSL: Secure sockets layer; XML: Extensible Markup Language. Once these components are available the open source Authors’ contributions code referenced here, and available as a demonstration MDH and CJH developed the client and server code, new fle format support capability, show what can be built around it. Te REST- in Avogadro 2, and the Chemical JSON fle format. WDJ developed the Comp- ful APIs, databases, and HTML5 frontends can be readily Chem JSON data format, the JSON writer in the NWChem application and the Python NWChem output converter, and generated the simulation data used coupled with command line and powerful desktop appli- in testing. MDH and WDJ wrote the manuscript. All authors read and approved cations to ofer a compelling ecosystem of data-centric the fnal manuscript. applications that serve the chemical research enterprise. Author details Te defnition of suitable schema, codifcation of that 1 Kitware, Inc., 28 Corporate Drive, Clifton Park, NY 12065, USA. 2 LBNL, One into JSON-LD, and community agreement still represent Cyclotron Road, Berkeley, CA 94720, USA. Hanwell et al. J Cheminform (2017) 9:55 Page 10 of 10

Acknowledgements genome approach to accelerating materials innovation. APL Mater Not applicable. 1(1):011002 8. Hachmann J, Olivares-Amaya R, Atahan-Evrenk S, Amador-Bedolla C, Competing interests Sánchez-Carrera RS, Gold-Parker A, Vogt L, Brockway AM, Aspuru-Guzik The authors declare that they have no competing interests. A (2011) The Harvard Clean Energy Project: large-scale computational screening and design of organic photovoltaics on the World Community Availability of data and materials Grid. J Phys Chem Lett 2(17):2241–2251 All code and data discussed in this manuscript is available under open licenses 9. WebGL. https://www.khronos.org/webgl/ on GitHub. The following repositories contain all the data and materials: 10. HTML5. https://www.w3.org/TR/html5/ Chemical data RESTful server [15, 30] 11. Python. https://www.python.org/ Web frontend single-page application [31] 12. C . https://isocpp.org/ Chemical JSON specifcation and examples [38] 13. Hanwell++ MD, Curtis DE, Lonie DC, Vandermeersch T, Zurek E, Hutchison NWChem with JSON output [28] GR (2012) Avogadro: an advanced semantic chemical editor, visualization, Conversion tool from NWChem log fles to JSON [29]. and analysis platform. J Cheminform 4:17 14. OBoyle N, Banck M, James C, Morley C, Vandermeersch T, Hutchison G Consent for publication (2011) Open babel: an open chemical toolbox. J Cheminform 3:1–14. Not applicable. doi:10.1186/1758-2946-3-33 15. Girder. https://github.com/girder/girder Ethics approval and consent to participate 16. CherryPy. http://cherrypy.org/ Not applicable. 17. Swagger. https://swagger.io/ 18. MongoDB. https://www.mongodb.com/ Funding 19. Virtuoso. https://github.com/openlink/virtuoso-opensource This material is based upon work supported by the U.S. Department of Energy, 20. AngularJS. https://angularjs.org/ Ofce of Science, Ofce of Basic Energy Sciences, Small Business Innovation 21. Material Design. https://material.io/ Research (SBIR) program under Award Number DE-SC0013250. 22. Rego N, Koes D (2015) 3dmol.js: molecular visualization with webgl. Bioinformatics 31(8):1322 23. Bostock M, Ogievetsky V, Heer J (2011) D3: data-driven documents. IEEE Publisher’s Note Trans Vis Comput Graph (Proc. InfoVis) Springer Nature remains neutral with regard to jurisdictional claims in pub- 24. Avogadro Libraries. https://github.com/openchemistry/avogadrolibs lished maps and institutional afliations. 25. JsonCpp. https://github.com/open-source-parsers/jsoncpp 26. Boost.Python. http://boostorg.github.io/python Received: 26 May 2017 Accepted: 23 October 2017 27. JSON-Fortran Version 2.0.0. https://github.com/jacobwilliams/ json-fortran/wiki 28. NWChem-ChemLog-JSON Writer Github Repository. https://github.com/ wadejong/NWChem-Json 29. Python ChemLog JSON Github Repository. https://github.com/ wadejong/NWChemOutputToJson References 30. mongochemserver. https://github.com/openchemistry/ 1. Valiev M, Bylaska E, Govind N, Kowalski K, Straatsma T, van Dam H, Wang mongochemserver D, Nieplocha J, Apra E, Windus T, de Jong W (2010) NWChem: a com- 31. mongochemclient. https://github.com/openchemistry/ prehensive and scalable open-source solution for large scale molecular mongochemclient simulations. Comput Phys Commun 181:1477 32. NGINX. https://www.nginx.com/ 2. Kong J, White CA, Krylov AI, Sherrill D, Adamson RD, Furlani TR, Lee MS, 33. mongochemdeploy. https://github.com/openchemistry/ Lee AM, Gwaltney SR, Adams TR, Ochsenfeld C, Gilbert ATB, Kedziora GS, mongochemdeploy Rassolov VA, Maurice DR, Nair N, Shao Y, Besley NA, Maslen PE, Dombroski 34. Ansible. https://www.ansible.com/ JP, Daschel H, Zhang W, Korambath PP, Baker J, Byrd EFC, Van Voorhis T, 35. Murray-Rust P, Rzepa HS (2002) Markup languages—how to structure Oumi M, Hirata S, Hsu C-P, Ishikawa N, Florian J, Warshel A, Johnson BG, chemistry related documents. Chem Int 4(24):9–13 Gill PMW, Head-Gordon M, Pople JA (2000) Q-chem 2.0: a high-perfor- 36. Guha R, Howard MT, Hutchison GR, Murray-Rust P, Rzepa H, Steinbeck mance ab initio electronic structure program package. J Comput Chem C, Wegner J, Willighagen EL (2006) The blue obelisk—interoperability in 21(16):1532–1548 chemical informatics. J Chem Inf Model 46(3):991 3. Hutter J, Iannuzzi M, Schifmann F, VandeVondele J (2014) : atomistic 37. O’Boyle NM, Guha R, Willighagen EL, Adams SE, Alvarsson J, Bradley J-C, simulations of condensed matter systems. Wiley Interdiscip Rev Comput Filippov IV, Hanson RM, Hanwell MD, Hutchison GR et al (2011) Open Mol Sci 4(1):15–25 data, open source and open standards in chemistry: the Blue Obelisk fve 4. Gonze X, Amadon B, Anglade P-M, Beuken J-M, Bottin F, Boulanger P, years on. J Cheminform 3:37 Bruneval F, Caliste D, Caracas R, Côt;é M, Deutsch T, Genovese L, Ghosez 38. Chemical JSON Github Repository. https://github.com/OpenChemistry/ P, Giantomassi M, Goedecker S, Hamann DR, Hermet P, Jollet F, Jomard G, chemicaljson/blob/master/chemicaljson.md Leroux S, Mancini M, Mazevet S, Oliveira MJT, Onida G, Pouillon Y, Rangel 39. de Jong WA, Walker AM, Hanwell MD (2013) From data to analysis: linking T, Rignanese G-M, Sangalli D, Shaltaf R, Torrent M, Verstraete MJ, Zerah NWChem and Avogadro with the syntax and semantics of Chemical G, Zwanziger JW (2009) Abinit: frst-principles approach to material and Markup Language. J Cheminform 5:25 nanosystem properties. Comput Phys Commun 180(12):2582–2615 (40 40. Wang B, Dobosh PA, Chalk S, Sopek M, Ostlund NS (2017) Computational YEARS OF CPC: A celebratory issue focused on quality software for high chemistry data management platform based on the semantic web. J performance, grid and novel computing architectures) Phys Chem A 121(1):298–307 5. List of Quantum Chemistry and Solid-state Phys- 41. RFC 6091 for JSON Pointer. https://tools.ietf.org/html/rfc6901 ics Software. https://en.wikipedia.org/wiki/ 42. Draft RFC for JSON Reference. https://tools.ietf.org/html/ List_of_quantum_chemistry_and_solid-state_physics_software draft-pbryan-zyp-json-ref-03 6. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shin- 43. JSON-LD. http://json-ld.org/ dyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28:235–242 7. Jain A, Ong SP, Hautier G, Chen W, Richards WD, Dacek S, Cholia S, Gunter D, Skinner D, Ceder G, Persson Ka (2013) The materials project: a materials