The Otastat Publication Platform
Total Page:16
File Type:pdf, Size:1020Kb
Mat-2.108 Independent Research Projects in Applied Mathematics The OtaStat Publication Platform Jussi K. Virtanen 54233J July 19, 2005 Contents List of Abbreviations ii 1 Introduction 1 2 Background 3 3 Technical Foundation 5 3.1 Unicode . 5 3.1.1 UTF-8 . 7 3.2 TEX ................................ 8 3.2.1 LATEX ........................... 9 3.3 XML . 10 3.3.1 Namespaces . 11 3.3.2 XHTML . 12 3.3.3 XSLT . 13 3.4 Dublin Core . 14 4 Implementation Overview 17 4.1 Document Type . 17 4.2 Page Processing . 19 4.2.1 XSLT Transformation . 19 4.2.2 Image Generation . 21 4.3 Customization . 22 5 Conclusion 24 5.1 Further Research . 24 References 25 i List of Abbreviations ASCII American Standard Code for Information Interchange BMP Basic Multilingual Plane CSS Cascading Style Sheets DCMI Dublin Core Metadata Initiative DTD Document Type Definition GIF Graphics Interchange Format HTML Hypertext Markup Language ISO International Organization for Standardization MathML Mathematical Markup Language SP Supplementary Plane URI Uniform Resource Identifier URL Uniform Resource Locator UTF Unicode Tranformation Format W3C World Wide Web Consortium WWW World Wide Web XHTML Extensible Hypertext Markup Language XML Extensible Markup Language XSL Extensible Stylesheet Language XSLT XSL Transforms ii 1 Introduction Although Tim Berners–Lee invented the World Wide Web (WWW) at the European Organization for Nuclear Research (CERN) in 1989, the Hyper- text Markup Language (HTML), the lingua franca of the Web, has always contained very limited capabilities for mathematically oriented publishing [34]. In 1998 the World Wide Web Consortium (W3C), the governing body of Web standardization, released the new Mathematical Markup Language (MathML), a purpose-built language for the task of embedding mathematics into documents [23]. Unfortunately support for MathML in Web browsers has so far been weak and, thus, the standard method for inclusion of mathe- matics into plain HTML pages is still the use of generated images depicting mathematical constructs. Scientific papers and other mathematical documents in the academia are often prepared using Leslie Lamport’s LATEX, which is based on Donald E. Knuth’s TEX typesetting system [27, 25]. Although LATEX is also used for Web publishing with dedicated tools that generate HTML documents with embedded images for mathematical symbols and equations, the traditions of LATEX-based tools greatly differ from the established ways of working on the World Wide Web. The OtaStat Web publication system, developed in the OtaStat project at the Systems Analysis Laboratory at Helsinki University of Technology, Finland, during the years 2003–2004, attempts to combine native World Wide Web methods and techniques, such as the Extensible Markup Lan- guage (XML), with the benefits of LATEX-based tools. In this paper, the applied technologies are described both in general and in the context of the implemented system. First, a short description of the OtaStat project and a rationale for the system development is given in Section 2, then, start- 1 ing from page 5, Section 3 provides a concise presentation of the underlying standards and technologies used. Section 4, starting from page 17, explores the architecture and implementation of the system and, finally, Section 5 concludes by evaluating the implemented system and identifying the needs for further research. 2 2 Background The OtaStat project, led by Ilkka Mellin at the Systems Analysis Laboratory at Helsinki University of Technology, produces World Wide Web resources for students of statistics. The study materials are organized so that they can aid self-access students as well as act as a support for lectures. One of the intents of the project is to establish a publicly available online encyclopædia of statistics. The encyclopædia is in the centre of the entire OtaStat project. The aim is to combine traditional reference articles, interactive components that illustrate the concepts of the articles, and direct links to the statistics courses offered at the university. So far the project has produced material that fits the aforementioned content types, but the presentation of the material is clearly bound to the courses offered. Unifying structure and links between the different content types are missing. The majority of the current material is produced using Microsoft Word, Microsoft Powerpoint, and Design Science MathType equation editor, and is offered on the project Web site in Adobe Systems’ open Portable Document Format (PDF). Computer exercises require either NCSS, Mathworks Matlab, or Microsoft Excel, depending on the topic. Interactive applets that, for example, illustrate the parametrization of distributions are implemented on Sun Microsystems’ Java platform. The initial planning of the content management and publication system began in June 2003 by investigating the currently used techniques and their fitness for the encyclopædia. One quickly identified problem was the reliance on PDF for basically all text documents. While the PDF document format is excellent for printed material, usability studies have shown online PDF documents to be problematic for average site visitors [1, 30, 31]. Clearly 3 the advanced features required by an encyclopædia, such as extensive inter- document references, would only make matters worse. Although Microsoft Word and Design Science MathType have the option to convert documents into HTML, on a closer inspection the conversion ap- pears to be of very low technical quality—the resulting documents are, in practice, “broken” and do not honor the standards of the World Wide Web Consortium. Because not a lot of material destined exclusively for the encyclopædia existed yet, the decision to start seeking for other suitable tools was not hard. The TEX-based systems LATEX2HTML and TEX4ht, arguably the most popular platforms for mathematical publishing on the World Wide Web, were the most promising candidates. The latter was also already used in another similar project at the university. Initial tests revealed that neither of the systems produces output of as high aesthetical quality as the TEX engine is capable of. In addition, the systems were quite hungry for resources. Finally the development work on a brand new system started. The idea was to base the work on established standards and technologies and to come up with a lightweight, flexible, and easily extensible solution to the problem. 4 3 Technical Foundation This section presents the fundamental standards and technologies that are used in the OtaStat system implementation. Both benefits and drawbacks of the chosen solutions are discussed and, where applicable, competing solutions and other related technologies are mentioned. The discussion begins with low-level concepts and gradually advances towards high-level abstractions. 3.1 Unicode Although a computer is, in essence, just a powerful calculator and therefore able to process only purely numeric data, computers are nowadays used for a variety of tasks beside bare calculation. This is possible via the utilization various encoding methods: non-numeric information is in some systematic way transformed into numbers and back. Most types of information require such a transformation—pictures, sounds, even plain text—, and the approach and techniques used in the encoding process depend heavily on the nature of the information. While text encoding is in principle relatively straightforward, the differ- ent writing systems and scripts used around the world complicate matters considerably. The early encoding schemes, such as the American Standard Code for Information Interchange (ASCII), bypassed these issues simply by being national standards. Later international standards, however, have had varying success while tackling the challenge. On most modern computers, the basic unit of information is octet, 8 bits. In text encoding, the basic unit of information is character, a logical entity, whose exact role is highly dependent on the used writing system [37]. In many encoding schemes a character is encoded as an octet and, consequently, the 5 maximum size for the character set is 256 characters. The still widely used ISO 8859 standard series is an enlightening example of the problems related to 8-bit text encoding. The series contains sixteen standards that specify multiple variants of the Latin character set as well as Arabic, Cyrillic, Greek, Hebrew, and Thai character sets, which augment the 7-bit ASCII character set with at most 128 additional code points each. The character sets are mutually exclusive: in different character sets, a spe- cific code point is assigned to different characters. Thus, it is impossible to combine, say, Greek passages into a typical Finnish text, because the char- acter set ISO-8859-7, needed for the Greek alphabet, does not contain, for example, the character ‘A’¨ latin capital letter a with umlaut [22]. The development of a truly sustainable solution to the various problems related to text representation on computers was initiated by Apple Computer in 1988. Three years later the Unicode Consortium was formally found and other key players in the computer industry, including IBM, Hewlett–Packard, Microsoft, and Sun Microsystems, had joined the effort. Today the Unicode Standard is implemented on almost all major operating systems for desktop computers and embedded systems. In the core of the Unicode Standard is the Unicode character set, which consists of 1,114,112 code points. In the latest version of the standard 96,382 of these code points