<<

A new publishing paradigm: STM articles as part of the semantic web 177

Learned Publishing (2001)14, 177–182

A new publishing

ntroduction Both scientists and publishers would paradigm: STM agree that the processes involved in publishing and particularly of reading Iscientific articles have changed consider- ably over the last five years or so. We will articles as part of argue, however, that these changes relate predominantly to the technical processes of publication and delivery, and that fundam- entally most authors’ and readers’ concepts the semantic web of what a learned paper is, and how it can be used, remain rooted in the medium rather Henry S. Rzepa than the message. The ubiquitous use of the Department of , Imperial College, term ‘reading a paper’ implies a perceptive London activity that only a human can easily do, especially if complex visual and symbolic Peter Murray-Rust representations are included in the paper. School of Pharmacy, University of Nottingham We suggest that the learned article should instead be regarded as more of a functional © Henry S. Rzepa and Peter Murray-Rust 2001 tool, to be used with the appropriate com- bination of software-based processing and Based on a presentation given at the ALPSP transformation of its content, with the International Learned Journals Seminar, March human providing comprehension and en- 2001. hancement of the knowledge represented by ABSTRACT: An argument is presented for the article. replacing the traditional published scientific technical or medical article, where heavy reliance The current publishing processes is placed on human perception of a printed or a Much of the current debate about learned printable medium, by a more data-centric scientific, technical, and medical (STM) model with well-defined structures publishing centres around how the dynamics expressed using XML languages. Such of authoring an article might involve review, articles can be regarded as self-defining or comment, and annotation by peer groups, ‘intelligent’: they can be scaled down to i.e. the preprint/self-print experiments and fine-level details such as, for example, discussion fora.1 This leads on to the very atoms in molecules, up into journals or role of publishers themselves, and the higher-order collections, and can be used ‘added-value’ that they can bring to the by software agents as well as humans. Our publication process, involving concepts such vision is that this higher-order concept, as the aggregation of articles and journals, often referred to as the semantic web, ever more rich contextual full-text searching would lay the foundation for the creation of articles, and added hyperlinking both of an open and global knowledge base. between articles and between the article and databases of subject content. These are all layers added by the publishers, and, inevitably, since they often involve human perception at some or all of these stages, Henry S. Rzepa

L E A R N E D P U B L I S H I N G V O L . 1 4 N O . 3 J U LY 2 0 0 1 178 Henry S. Rzepa and Peter Murray-Rust

they remain expensive additions to the must be expressed in a far more precise way publishing process. There is also the implicit than is currently achieved, precise enough assumption that the concept of what repres- to be not merely human perceivable, but if ents added value is largely defined by the necessary to be machine processable. The publishers rather than by the authors and concept is summarized by the term ‘semantic readers. web’, used by Berners-Lee2 to express his These debates largely assume that the vision of how the will intrinsic structure of the ‘article’ remains evolve to supporting the exchange of know- very much what an author or reader from ledge on the web. The semantic web by the 19th century might have been familiar its nature includes the entire publishing with. These structures are mostly associated process, and we feel that everyone involved with what can be described as the ‘look and in this process will come to recognize that feel’ of the journal and its articles, namely this concept really does represent a para- the manner in which the logical content digm shift in the communication and in created by the author is presented on the particular the use of information and data. printed page (or the electronic equivalent of A concept central to the semantic web is the printed page, the Acrobat file). In our that data must be self-defining, such that own area of molecular sciences, the con- decisions about what they represent and the tent is serialized onto the printed page or context of how they can be acted upon or Acrobat equivalent into sequential sections transformed are possible not merely by such as abstract, introduction, numerical humans but by software agents created by and descriptive results, a section showing humans for the purpose. The concepts also schematic representations of any new include some measure of error checking molecules reported, a discussion relating if the structure and associated meaning perhaps to components (atoms, bonds, etc.) (ontology) of the data is available, and of of the molecules, and a bibliography. A mechanisms to avoid loss of data if the human being can scan this serialized content meaning is not sufficiently well known at and rapidly perceive its structure and more any stage. data must be or less accurately infer the meaning of, The stages in the evolution of data and regarded as a for example, the schematic drawing of a knowledge are part of the well-known scien- molecule (although perceiving the three- tific cycle. An example in molecular and critically dimensional structure of such a molecule medicinal sciences might serve to illustrate important part from a paper rendition is much more of a the current process: of the challenge!). A human is less well suited to scan in an error-free manner thousands if 1. A human decides a particular molecular publication not millions of such articles, and is subject substructure is of interest, on the basis of process to the error-prone process of transcribing reading a journal article reporting one or numerical data from paper. Changing the more whole molecular structures and medium of the article from paper to an their biological properties relating to, for Acrobat file does little to change this instance, inhibition of cancerous growth. process. Most people probably end up print- This process is currently almost entirely ing the Acrobat file; few would confess to dependent on human perception. liking to read it on the computer screen. Yet, 2. A search of various custom molecular this remains the process that virtually databases is conducted, using a manual everyone ‘using’ modern electronic journals transcription of the relevant molecular would go through. structure. This implies a fair degree of We argue here that data must be regarded knowledge by the human about the as a critically important part of the publi- representational meaning of the structure cation process, with documents and data they have perceived in the journal article. being part of a seamless spectrum. In many Chemists tend to use highly symbolic disciplines the data are critical for the full representations of molecules, ranging use of the ‘article’. To achieve such seamless from text-based complex nomenclature to integration, the data content of an article even more abstract two-dimensional line

L E A R N E D P U B L I S H I N G V O L . 1 4 N O . 3 J U L Y 2 0 0 1 A new publishing paradigm: STM articles as part of the semantic web 179

diagrams where many of the components ular data themselves. We emphasize that present are implied rather than declared. this step in particular is a very lossy Licences to access the databases must be process, i.e. lack of appropriate data available, since most molecular databases structures will mean loss of data! are proprietary and closed. It is quite 9. More often than not, the document is probable that a degree of training of the then printed and sent to referees. The human to use each proprietary interface data from components 1–7 above are to these databases will be required. only accessible to them if they invoke 3. It is becoming more common for both their own human perception, since the primary and secondary publishers to process involved in step 8 may adhere integrate steps 1 and 2 into a single (and then often only loosely) merely to ‘added-value’ environment. This envir- the journal publishing and presenta- onment is inevitably expensive, because tional guidelines rather than to those it was created largely by human per- associated with the data harvested from Full-text prose ception of the original published journal steps 1–7. articles. In effect, although the added 10. The article is finally published, the full is inevitably a service is indeed valuable, the processes text indexed, and the bibliography pos- lossy carrier of involved in creating it merely represent sibly hyperlinked to the other articles data and an aggregation of what the human cited (in a monodirectional sense). The starting the process would have done important term here is of course ‘full information anyway. text’. In a scientific context at least, and 4. The result of the search may be a certainly in molecular sciences, the methodology for creating new variants prose-based textual description of the of the original molecule (referred to meaning inevitably carries only part of by chemists as the ‘synthesis’ of the the knowledge and information accumu- molecule). The starting materials for lated during the steps 1–10. Full-text conducting the synthesis have to be prose is inevitably a lossy carrier of sourced from a supplier, and ordered by data and information. Even contextual raising purchase orders from an accounts operators invoked during a search (is A officer. adjacent to B? Does A come before B?) 5. Nowadays, it is perfectly conceivable recover only a proportion of the original that a ‘combinatorial’ instrument or data and meaning. The rest must be machine will need to be programmed by accomplished by humans as part of the the human to conduct the synthesis. secondary publishing process, and of 6. The products of the synthesis are then course the cycle now completes with a analysed using other instruments, and return to step 1. the results interpreted in terms of both purity and the molecular structure. This The cycle described above is clearly lossy. can often nowadays be done automati- Much of the error correction, context- cally by software agents. A comparison ualization, and perception must be done by of the results with previously published humans. We argue that it is too much (we and related data is often desirable. certainly do not argue for eliminating the 7. Biological properties of the new species human entirely from the cycle!). can be screened, again often automati- cally using instrumentation and software Learned articles as part of a semantic web agents. 8. The data from all these process are then It is remarkable how many of the ten steps gathered, edited by the human, and described above have the potential for the (nowadays at least) transcribed into a symbiotic involvement of software agents word-processing program in which the and humans. If the structures of the document structures imposed are those data passed between any two stages in the of the journal’s ‘guidelines for authors’ above process and the actions resulting rather than those implied by the molec- could be mutually agreed, then significant

L E A R N E D P U B L I S H I N G V O L . 1 4 N O . 3 J U LY 2 0 0 1 180 Henry S. Rzepa and Peter Murray-Rust

automation becomes possible, and more grained as individual atoms or bonds) can be importantly, data or their context need not identified using a combination of name- be lost or marooned during the process. This spaces and identifiers. very philosophy is at the heart of the The most important new concept that development and adoption of XML (exten- emerges from the use of XML is that the sible markup language)3 as one mechanism boundaries of what would conventionally be for implementing the semantic web, to- thought of as a ‘paper’ or ‘article’ can be gether with the other vital concept of scaled both up and down. Thus as noted metadata, which serves to describe the con- above, an article could be disassembled text and meaning of data. XML is a precise down to an individual marked-up com- the boundaries set of guidelines for writing any extended ponent such as one atom in a molecule, or of a ‘paper’ or markup language (ML), together with a set instead aggregated into a joumal, collection of generic tools for manipulating and trans- of journals, or ultimately into the semantic ‘article’ can be forming the content expressed using such web! This need not mean loss of identity, or scaled both up languages. Many MLs already exist and are provenance, since, in theory at least, each and down being used, e.g. XHTML (for carrying prose unit of information can be associated with descriptions in a precise and formal metadata indicating its originator, and, if manner), MathML (for describing math- required, a digital signature confirming its ematical symbolisms),4 SVG and PlotML provenance. Because the heart of XML (for expressing numerical data as two- contains the concept that the form or style dimensional diagrams and charts),5 and of presentation of data is completely separ- CML ()6 for ex- ated from its containment, the ‘look-and- pressing the properties and structures of feel’ of the presentation can be applied at collections of molecules. any scale (arguably for an individual atom, We have described in technical detail certainly for an aggregation such as a elsewhere7 how we have authored, pub- journal, and potentially for the entire lished, and subsequently reused an article semantic web). It is worth now reanalysing written entirely in XML languages, and so the ten steps described above, but in the confine ourselves here to how such an context that everything is expressed with approach has the potential to change some, XML structures. if not all, of the processes described in steps 1–10 above. Molecular concepts such as 1. A human or software agent acting on molecule structures and properties were their behalf can interrogate an XML- captured using CML, schematic diagrams based journal, asking questions such as were deployed as SVG, the prose was writ- ‘How many molecules are reported ten in XHTML, the article structure and containing a particular molecular bibliography was written in DocML, meta- fragment with associated biological data data were captured as RDF (resource relating to cancer?’ This would, tech- description framework),8 and the authen- nically, involve software searching for ticity, integrity, and structural validity of the the CML or related ‘namespaces’ to find article and its various components verified molecules, and checking any occur- by using XSIGN digital signatures. All these rences for particular patterns of atoms various components interoperate with each and bonds. We have indeed demon- other, and can be subject to generic tools strated a very similar process for our own such as XSLT (transformations) to convert XML-based joumal articles; the issue is the data into the context required or CSS really only one of scale. Any citations (stylesheets) to present the content in, for retrieved during this process are example, a browser window. The semantics captured into the XML-based project of each XML component can be machine- document along with relevant infor- verified using documents known as DTDs mation such as CML-based descriptors. (document type descriptions) or Schemas, 2. Any retrieved molecules can now be and where necessary components of the edited or filtered by the human (or article (which could be as small or finely software agent) and presented to special-

L E A R N E D P U B L I S H I N G V O L . 1 4 N O . 3 J U L Y 2 0 0 1 A new publishing paradigm: STM articles as part of the semantic web 181

ized databases for further searching (if biological screening systems, which can necessary preceded by the appropri- extract the relevant information and ate transformation of the molecule to return the results in the same form. accommodate any non-standard or pro- 8. At this stage, much human thought will prietary representations required by that be needed to make intelligent sense of database) and any retrieved entries again the accumulated results. To help in this formulated in XML. process, the XML document that de- 3. With publishers receiving all journal scribes the entire project can always be articles in XML forms, the cost of represented to the human by appro- validating, aggregating, and adding value priately selective filters and transforms, to the content is now potentially much which may include statistical analysis or the cost of smaller. The publisher can concentrate computational modelling. The human validating, on higher forms of added value, e.g. can annotate the document with appro- contracting to create similarity indices priate prose, taking care to link tech- aggregating, for various components, or computing nical terms to an appropriate dictionary and adding additional molecular properties. or glossary of such terms so that other value to the 4. Other XML-based sources of secondary humans or agents can make the onto- published information such as ‘Organic logical associations. content is now Syntheses’ or ‘Science of Synthesis’ 9. Any referees of the subsequent article potentially (both of which actually happen to be (whether open in a preprint stage, or much smaller already available at least partially in closed in the conventional manner) will XML form) can be used to locate poten- now have access not only to the tial synthetic methods for the required annotated prose created by the author in molecule. The resulting methodology is the previous stage, but potentially to the again returned in XML form. At this more important data accreted by the stage, purchasing decisions based on document in the previous stages. Their identified commercial availability of ability to perform their task can only be appropriate chemicals can be made, enhanced by having such access. again with the help of software agents 10. The article is published. The publisher linking to e-commerce systems. Many may choose to add additional value to new e-commerce systems are themselves any of the components of the article, based on XML architectures. depending on their speciality. They may 5. The appropriate instructions, in XML also make the article available for form, can be passed to a combinatorial annotation by others. robot. 6. Processing instructions for instruments This revised cycle is potentially at least far can be derived from the XML formu- less lossy than the conventional route. Of lation, and the results similarly returned, course, some loss of data is probably desir- or passed to software for heuristic able, since otherwise the article will become (rule-based) interpretation or checking. overburdened by superseded data. The issue The software itself will have an authen- of how much editing is required within such tication and provenance that could be a model is one the community (and com- automatically checked, if necessary, by mercial reality) will decide. resolution back to a journal article and its XML-identified authorship. We also Conclusions note at this stage that the original molecule fragment originated in step 1 is The semantic web is far more than just one still part of the data, but obviously particular instance of how the scientific subjected to very substantial annotation discovery and publishing process could be with each step, the provenance of which implemented. It requires the recognition by can be verified if necessary. humans of the importance of retaining 7. The compound, along with its accreted the structure of data at all stages in the XML description, can now be passed to discovery process. It involves them in recog-

L E A R N E D P U B L I S H I N G V O L . 1 4 N O . 3 J U LY 2 0 0 1 182 Henry S. Rzepa and Peter Murray-Rust

nizing the need for interoperability of data in 8. The RDF specifications provide a lightweight ontology system to support the exchange of knowledge on the the appropriate context, and ultimately of web, see www.w3c.org/RDF/ agreeing to common ontologies for what they mean in their own subject areas. At the Henry Rzepa heart of this model will be the creation of an Department of Chemistry open model of publishing, which will lay the Imperial College foundation for the creation of a global London SW7 2AY knowledge base in a particular discipline. UK The technical The seamless aggregation of published Email: h.rzepa@ ic.ac.uk ‘articles’ will be the foundation of such a Web: www.ch.ic.ac.uk/rzepa/ problems are knowledge base. Peter Murray-Rust relatively close These will be grand challenges which may School of Pharmacy to solution take a little while to achieve. The technical University of Nottingham problems are relatively close to solution, University Park although the business models may not be so! Nottingham NG7 2RD However, the greatest challenge will be UK convincing authors and readers in the scientific communities to rethink their concepts of what the publishing process is, and to instead think on a global scale and of how they must change the way they work, capture and pass on data and information into the global community.

References 1. Hamad, S. Nature 1999:401 (6752), 423; The topic is currently being debated on forums such as the Nature debates: www.nature.com/nature/debates/e-access/ index.html or the American Scientist Forum: http://amsci-forum.amsci.org/archives/september98- forum.html, and at chemistry preprint sites such as http://preprint.chemweb.com/. Other interesting points of view are represented by Bachrach, S.M. The 21st century chemistry journal. QUIM NOVA 1999:22, 273–6; Kircz, J. New practices for electronic publishing: quality and integrity in a multimedia environment. UNESCO–ICSU Conference Electronic Publishing in Science, 2001 2. Berners-Lee, T., Hendler, J. and Lassila, O. www. scientificamerican.com/2001/0501issue/0501berners- lee.html; Berners-Lee, T. and Fischetti, M. Weaving the Web: The Original Design and the Ultimate Destiny of the World-Wide Web. London: Orion, 1999. 3. The definitive source of information about XML projects is available at the World-Wide Web Con- sortium site: www.w3c/org/ 4. See www.w3c.org.Math/ 5. SVG, see www.w3c.org/Graphics/SVG; PlotML, see http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII1.0/ 6. Murray-Rust, P. and Rzepa, H.S. Journal of Chemical Information and Computer Sciences 1999:39, 928 and articles cited therein. See also: www.xml-cml.org/ 7. Murray-Rust, P., Rzepa, H.S., Wright, M. and Zara, S. A universal approach to web-based chemistry using XML and CML. Chemical Communications 2000: 1471–2; Murray-Rust, P., Rzepa, H.S. and Wright, M. Development of chemical markup language (CML) as a system for handling complex chemical content. New Journal of Chemistry 2001:618–4. The full XML-based article can be seen at www.rsc.org/suppdata/NJ/ B0/B008780G/index.sht

L E A R N E D P U B L I S H I N G V O L . 1 4 N O . 3 J U L Y 2 0 0 1