<<

Markup Languages—How to Structure -Related Documents by Peter Murray-Rust and Henry S. Rzepa ● Creation of an infrastructure for underpinning the Although the use of markup languages in publishing emerging areas of e-business goes back to the 1960s when IBM introduced GML (Generalized Markup Language), which subsequently The extension to chemistry included, therefore, the evolved into the standard SGML, most authors are creation of a new generation of ontologically rich, pri- nowadays more familiar with the more recent imple- mary publication and a clear division of the respective mentation, referred to as HTML (HyperText Markup roles of humans and software agents (robots). Thus, Language). The rapid rise in the use of HTML in con- humans should be able to: junction with the growth of the was in ● Publish all their data automatically large measure due to its ease of use for achieving pre- ● Eliminate errorsfrom publications sentational and visual effect. However, its limitations as a mechanism for expressing precisely defined data and ● Use the published literature as a database meanings were not always adequately recognized. ● Understandinformation from other domains These limitations meant that in areas such as molecular sciences where precise meanings are essential, a variety Robots should be able to: of often proprietary solutions continued to be used to ● Analyze publications (on whatever scale) define and manipulate molecular “data” and informa- ● Create secondary publications tion. The publishing processes were seen as quite sepa- rate and the process of translating data, information, and ● Purchase chemicals knowledge into a published entity remained an activity ● Synthesize chemicals from literature requiring much human perception. It is also worth not- ing that the reverse process of converting the published To achieve this, we argue that a number of prerequi- materials back into usable data remained equally human sites must be in place: intensive and hence expensive. ● Automatic data capture, especially from instru- The need to reconcile these two extremes was recog- ments. We note that in 30 years we have moved nized at the first World Wide Web conference in 1994. from using instruments that captured data often A solution gelled shortly after the conference as a only in analogue form (chart paper) to using stan- remarkable communal effort resulted in the specifica- dard computers to capture and process data to most tion of extensible markup language or XML. The ulti- recently an increasing tendency for placing these mate vision of XML , as described by Berners-Lee, is computers online and connecting them to central- the creation of a “Semantic Web.”1 The rationale for ized data stores. this impressive effort included the following: ● Common ontologies for a specific community (e.g., ● Provision of a more universal infrastructure for molecular science) publishing ● Ontologically guided authoring. ● Recognition that the use of XML will require sub- ject-specific vocabularies called “ontologies” Issues Involved in “Capturing” Chemistry Ontology is defined as a description—such as a for- The following extract2 from a typical science journal mal specification of a program—of the concepts illustrates both how precisely data and information must and relationships that can exist for a software agent be represented, but also how much human perception is or a community of agents. required to translate this information (e.g., to a repro- ● Provision of a mechanism for enhancing quality ducible experiment or a mechanistic interpretation): (“validation”) “Thiamin phosphate synthase catalyzes the ● Promotion of the creation of dynamic hyper-documents formation of thiamin phosphate from 4- ● Recognition of the need to be able to reuse compo- amino-5-(hydroxymethyl)-2-methylpyrimidine nents of documents for other purposes pyrophosphate and 5-(hydroxyethyl)-4- methylthiazole phosphate. The reaction ● Provision of a mechanism for creating smart involves . . . dissociative mechanism . . archives, in which the re-usable components (infor- . carbenium ion intermediate . . . and mation objects) can be readily identified pyrimidine iminemethide observed in the crystal . . .”

Chemistry International, 2002, Vol 24, No. 4 9 Note the profusion of chemical structure information, requiring specific markup languages. With this spark, concepts, and terms, which only a trained human CML () evolved between chemist could easily process. Quantitative concepts and 1995-1997 to become the first scientific extended markup units are also ubiquitous: language. A concurrent effort lead to MathML becoming formalized as such in 1998.3We estimate that by 2002, “A 500 µl aliquot of 0.8 µM TP synthase in perhaps 50 specifically scientific applications have been 50 mM Tris-HCl (pH 7.5) and 6 mM MgCl 2 described in some degree. For example, 37 scientific appli- incubated at room temperature with 50uM cations are quoted at and CF3HMP-PP.” a more general listing is at . The Science Citation required when handling graphical chemical representa- Index shows around 570 references to the keyword XML, tions, which may contain many, often fuzzy and danger- and SciFinder retrieves 38 references to the term “XML in ous, human-only semantics (e.g., 2-D representations of chemistry.” 3-D properties, relative , aromaticity, hydrogen and other “weak” bonding, use of generic and “R” groups, reaction arrows, and mechanisms, etc.). The XML offers a general, powerful, and challenge, therefore, is to develop an infrastructure that extensible mechanism for handling can be routinely used to capture, store, and appropriate- ly filter and display such information. both the “capture” and the The Current Position of XML publication of chemical information. As it is in 2002, XML offers a general, powerful, and extensible mechanism for handling both the “capture” and the publication of chemical information. In particu- We also emphasize that XML is designed to allow lar, XML allows for the first time this process to oper- markup languages to be combined, at whatever level of ate equally well in both directions. Our basis for stating granularity, so that documents can contain any number this derives from the following observations: of components deriving from specific XML languages. HTML, which we noted above, has evolved into one ● XML is increasingly accepted as an information infrastructure. such language (XHTML), but in its latest development has been modularized into smaller, more easily imple- ● The protocols are all public and many of the tools mented components (e.g., XFORMS, a data-entry and are open source. validation component can be implemented separately ● XML is vendor neutral, but with heavy vendor from other, more display-oriented components). involvement. XHTML can co-exist in a document with languages ● There is a large communal investment in generic such as SVG (a scalable vector graphical language), tools (e.g., business2business, e-commerce). MathML, and CML. We elaborate this when discussing 4 ● XML has a modular approach; an application is namespaces (vide infra). built from components. Some Essentials of an XML System ● Domains are expected to create domain-specific The following tasks will have to be accomplished in XML protocols and tools. order to implement an XML solution to publishing ● XML is increasingly universal in back-ends, mid- chemical information: dleware, and servers. ● Creation of documents from both legacy sources of ● Support for XML from database vendors is rapidly data and de novoby humans increasing. ● Creation and capture of metadata (dictionaries of ● XML has close interoperability with other infor- terms, tables of contents, codes, etc.) matics standards such as UML, OMG/CORBA, etc. ● Specification of namespaces (a reserved addressing ● There is increasing support for “XML over the net” scheme for information) and from browsers (e.g., Explorer, ● Human validation of the system (conformance to Netscape 6, etc). agreed specifications) ● XML is very well supported by books, tutorials, etc. ● Machine validation of documents (according to a Global Open Activity in Scientific XML specified and agreed upon schema) So how has the scientific community adopted these con- ● Document transformation (XSLT) cepts? As noted above, the first World Wide Web confer- ● Rendering and display (XSL-FO, domain-specific ence specifically identified mathematics and chemistry as such as molecular representations)

10 Chemistry International, 2002, Vol 24, No. 4 The design of an XML-based markup language should provide for the following: Table 1: Types of Ontologies Relevant to XML in Chemistry and Tasks for the Chemical ● A simple, extensible document type definition (DTD) Community or schema (modular and not over-complicated) General Non-Chemical Informatics ● Agreed semantics Business and commerce, gov-Reuse existing or emerging ● One or more agreed and published ontologies ernment, regulatory, academ-approaches. ic, publishing, etc. ● Agreed examples and conformance tests

● A community of critical mass Domain-Specific Non-chemical Mathematics (MathML), Collaborate to reuse existing Appropriate tools for accomplishing this should be healthcare (HL7/XML), or emerging approaches. identified. These might include the following: genomics (GeneOntology), etc.

● XML writers Chemical-Specific but Generic Information Types Numeric data, descriptive ● XML readers (more difficult than readers since the Create ontologies and reuse prose, safety XML may not be normalized to a single form) generic tools. Chemical-Specific Information Types ● Legacy converters (difficult because of variation and ambiguity in the original data which may Chemical substances, Build the complete tool set. require some degree of perception for an accurate molecules, analytical and conversion) spectroscopic, reactions, ● Validators

● Dictionaries

● Editors reducing difficulty of (human) understanding due to Custom-written XSLT style sheets and generic edi- invalid publications. The DTD is a concept rooted in tors will accomplish some of these, but a document SGML, and is still used in XML to constrain the object model (DOM), which represents a syntax free Markup vocabulary (i.e., the basic elements used for abstraction of the data in memory, is probably essential markup) and to some extent the (sub)structure of doc- for many subjects. uments (i.e., what element can be a parent or child of another). Schemas are a more recent development, and Ontologies of Relevance to Chemistry* unlike DTDs, are themselves expressed using XML. An overview of the types of ontologies required is Of particular relevance to chemistry, they provide shown in Table 1. Of the chemically specific informa- advantages over DTDs in that they can also be used tion types, support should be included for: for:

● Molecules and substances ● Datatyping: numbers and user-defined types ● Reactions ● Enumeration (for example to specify the list of ● Analytical information, especially spectra chemical elements) ● Computation and simulation (QM, mechanics, ● Lexical patterns dynamics, etc.) ● Inheritance ● “Data-centric” concepts (numbers, units, arrays, Moreover, schemas allow for additional user-created matrices, etc.) rules (schematron/XSLT), and with dictionaries, sup- ● Specialist software for display, editing, searching, port the conversion to software (e.g. CML-DOM), etc. authoring (e.g., in editors), validation of the data on ● “Adjoining” disciplines such as bio areas, materi- entry by the user. als science, etc. Namespaces—The Key to Making ItUnique Creating Valid XML Documents Each information object must be uniquely named to Generic tools and protocols already exist to create avoid collision and ambiguity. This is achieved using valid XML documents. In particular, the use of DTDs XML namespacing. (Document Type Definitions) and Schemas can bring The example below shows a paragraph of text enormous benefits, including eliminating/reducing (derived from XHTML, which inherits the default software failure due to the use of invalid data and namespace), within which components of CML are

*In this context, the term ontology refers to a machine readable set of def- initions that create a taxonomy of classes and subclasses and relation- ships between them. Ref:

Chemistry International, 2002, Vol 24, No. 4 11 embedded, including prefixes using the defined name- spaces:

We can supply the following set of mol- ecules:

A proposal5 for domain-independent components for Figure 1: The use of namespaces in CML. Scientific-Technical-Medical information, or STMML, contains key elements such as units, dictionary, metadata, ● CMLQuery. A generic query language. item, array, and matrix and which supports datatypes such ● Hooks for other Schemas, such as SpectHook, for as numbers, max/min, ranges, errors, etc. The next example spectral parameters and data and links to molecular illustrates how CML can be used in conjunction with the details (assignment). STMML namespace5to specify units and their constraints: Dictionaries and Schemas It is useful to separate the domain ontology from the 5.628 abstract and which helps extensibility. Thus, with the 5.628 5.628 ● The data instance 90 ● The XMLSchema describing the instance 90 90 Figure 2, where, for example, units are themselves ver- A more extended example of this concatenation of namespaces6contains up to eight namespaced compo- nents and illustrates how a complete publication in XML/CML could be achieved. The use of namespaces can be seen in a more general context in Figure 1, which illustrates how the various specific XML components might relate to each other. In particular, we note how the original CML specifica- tion7can be extended by modularization into a core name- space, and extended via other schemas into the following:

● CMLReact. A reaction, containing reactantLists, productLists and links between them.

● CMLComp. A container for computational and simulation input and results. Figure 2: Validation scheme using dictionaries.

12 Chemistry International, 2002, Vol 24, No. 4 Document Structure and Metadata weights). Their availability for XML-based processes Common dictionaries and compendia usually have some would be a considerable asset. of the following features: Conclusions ● Dictionaries consist of curated entries and many are In this brief review of the application of XML in chem- “flat” (e.g., the IUPAC GoldBook). istry, we have summarized the essential advantages of ● Dictionaries are compiled within a single hierarchy: adopting the XML approach. We have discussed in par- — generic(“is A”): ticular the benefits in creating reusable namespaced eukaryote <-- vertebrate <-- mammal <-- human information components or objects, how these can be created and validated using subject-specific ontologies — partitive(“has A”): and dictionaries, and then how they can be enhanced body <-- leg <-- foot <-- toe with appropriate metadata. The role of communities and ● Dictionaries can now be associated with a name- global organizations, such as IUPAC, is crucial to this space for uniqueness and navigation. endeavour. The use of such XML-based documents opens the prospect of creating avenues for the reversible ● Dictionaries must have curatorial information. flow of data and information between the scientific pub- ● Dictionaries should support versioning. lication processes and the discovery, research, and learning processes in molecular sciences; a reversibility Metadata is an important component of a document or that has hitherto only been achieved with considerable information object and it can serve a number of purposes: human effort and expense. ● Navigational/Discovery—How is a piece of informa- tion to be discovered (e.g., Dublin Core and GILS)? References

● Descriptive—What does the information mean and 1. T. Berners-Lee, M. Fischetti, M, Weaving the Web: The how is it to be used? Original Design and the Ultimate Destiny of the World Wide Web, Orion Business Books, London, 1999. ● Constraining—What constraints are there on the structure and content of the information? Is it valid? 2. D. H. Peapus, H. J. Chiu and N. Campobasso, This would be accomplished using mainly XML Biochemistry, 2001, 40, 10103-10114. Schemas. 3. See www.w3.org for details of all XML specifications. 4. G. V. Gkoutos, P. Murray-Rust, H. S. Rzepa, and M. ● Supplementary—Additional (hyper-) data added from metadata Wright, Internet J. Chemistry, 2001, article 13. 5. P. Murray-Rust and H. S. Rzepa, 2002, submitted for ● Algorithmic—Deductions can be made from meta- publication. For the previous article in this series, see P. data (e.g., using Schematron, XSLT, and RDF). Murray-Rust and H. S. Rzepa, Data Science2002, 1, 84- ● Chemical-descriptive—For example, medicinal, 98. physical organic chemistry, Gold Book, stereo- 6. P. Murray-Rust, H. S. Rzepa, and M. Wright, New J. chemistry. Chem., 2001, 618-634. ● Chemical-constraining—For example, theoretical 7. P. Murray-Rust and H. S. Rzepa, J. Chem. Inf. Comp. chemistry and CIF. Sci., 1999, 39, 928; ● Chemical-supplemental—For example, tables of P. Murray-Rust and H. S. Rzepa, J. Chem. Inf. Comp. atomic weights, dictionaries of compounds, etc. Sci., 2001, 1113; ● Chemical-algorithmic—For example, theoretical G. Gkoutos, P. Murray-Rust, H. S. Rzepa, and M. chemistry and CIF. Wright, J. Chem. Inf. Comp. Sci., 2001, 1124. Peter Murray-Rust is a lecturer Communally agreed-upon schemas for defining such at the Unilever Centre for Molecular Informatics, metadata are again seen as an essential component of Cambridge University, United Kingdom. Henry Rzepa the XML-infrastructures. is a reader in the Department of The existing IUPAC compendia provide a natural Chemistry, Imperial College of Science, Technology, foundation for creating XML-based machine processi- and Medicine, London, United Kingdom. ble resources. They fall into three broad categories: descriptive (e.g., medicinal chemistry, physical organic chemistry, stereochemistry, etc.), validating (e.g., theo- http://cml.sourceforge.net retical chemistry) and supplemental (e.g., atomic

Chemistry International, 2002, Vol 24, No. 4 13