No. 3 on the Use of TEX As an Authoring Language for HTML5 SK

278 TUGboat, Volume 32 (2011), No. 3 On the use of TEX as an authoring language The difference between the above two S-expressions for HTML5 is that the former introduces a deliberate asymmetry between attributes and elements, whereas the latter S.K. Venkatesan treats attributes on a par with elements. However, Abstract both S-expressions can be considered as an improve- ment on XML as they allow further nesting within The T X syntax has been fairly successful at mark- E attributes. The corresponding XML code would be: ing up a variety of scientific and technical literature, making it an ideal authoring syntax. The brevity <title lang="en">Title of a <italic>plain</italic> article</title> of the TEX syntax makes it difficult to create overlapping structures, which in the case of HTML has In both TEX and XML syntax, further nesting of made life so difficult for XML purists. We discuss structures is not possible within attributes, which S-expressions, the TEX syntax and how it can help makes TEX ideal for authoring XML or HTML5. reduce the nightmare that HTML5 markup is going There are further similarities between the TEX to create. Apart from this we implement a new syn- and SGML/HTML syntaxes. Attribute minimization tax for marking up semantic information (microdata) used in HTML, like not quoting attribute values, is in TEX. very much practiced in TEX syntax, more as a rule rather than the exception; e.g., 1 Introduction \includegraphics[width=2cm]{myimage.gif} The brevity of T X syntax has made it fairly success- E Unlike SGML/HTML,TEX typically uses a comma ful at marking up a variety of scientific and techni- as the separator between attributes, instead of the cal literature. On the one hand, modern markup word-space used in SGML/HTML.TEX also uses languages such as (X)HTML and XML have ver- complete skipping of attribute values, similar to the bose syntax which is not only difficult to author commonly used HTML code: <option selected>. but also produces non-treelike structures such as Quite like TEX, HTML also has the practise of shrink- overlapping structures that need to be checked for ing multiple spaces to a single space. All of these well-formedness. On the other hand, T X and its E similarities make it clear that authoring HTML in macros are difficult to parse and validate, compared TEX would be an ideal proposition. to XML with a DTD or schema. Many XML versions of TEX have been proposed such as TEXML [3] and 3 Overlapping markup in HTML XLAT X[5] that are intrinsically close to (LA)T X. E E Since HTML is marked up by humans, there tend The main advantage of such a system is that one to be many situations with overlapping elements or can introduce a validator using a DTD or schema to other eccentric markup which do not confirm to a check the syntax before passing it to the T X engine. E well-formed SGML or XML syntax. Consider the However, XML syntax is difficult to author and HTML markup: in fact is prone to producing overlapping structures that need to be avoided for it to be well-formed, <p>Text with <i>unique <b>and</i> and as a result these XML versions have not become strong formatting</b> issues</title> popular for authoring. In this article, we propose A utility like HTML Tidy [6] or TagSoup [1] can something that is quite the reverse, i.e., TEX as an convert this into well-formed markup such as: authoring syntax for both XML and HTML. <p>Text with <i>unique </i><b><i>and</i> strong formatting</b> issues</title> 2 TEX, S-expressions and XML However, it is not always clear what should be Let us look at the following T X code: E done with such a non-standard markup. The HTML5 \title[lang=en]{Title of specification defines clearly how such a non-standard a \textit{plain} article} markup should be interpreted [7] but the HTML The same code in a Lisp-like S-expression would be: implementations in browsers currently deal with it differently from each other. (title (@ (lang="en")) ("Title of a ") W3C has been insisting for some time that the (italic "plain") ("article")) next generation of markup should be XML-compliant or if one would like to treat elements and attributes like XHTML+MathML+SVG profiles, with other in- in the same way: tricacies such as namespaces. However, more than (title (@lang="en") ("Title of a ") 99% of HTML pages in the wild are invalid, accord- (italic "plain") ("article")) ing to the HTML4 DTD or schema. This being the S.K. Venkatesan TUGboat, Volume 32 (2011), No. 3 279 case, W3C gave up on the idea of an XML solution 4 TEX as an input format for HTML5 and moved on to HTML5 with added elements and In this section we would like introduce LATEX environ- features, such as MathML, SVG and video, audio and ment for authoring HTML5. Many of these features additional microdata elements. have been introduced before, say, e.g., in XLATEX Given the experience with HTML4, it can be and other concepts. safely predicted that the more features one adds to HTML, the greater the scope for non-standard markup such as overlaps and entanglement that can 4.1 Main structural elements create a great deal of difficulty for browsers and of the document users. HTML5 has introduced new content elements that We will consider here, e.g., Microsoft's interpre- bring it closer to the standard LATEX classes. We tation of MathML in HTML5. Microsoft has been propose the following TEX macros. pushing for certain agenda in MathML3 (although I must say with great relief that much of it has not No. HTML LATEX Description been accepted by the MathML committee). Based on 1 <article>#1 \begin{article} their own experience with OML, a subset of OOXML </article> #1 article markup, they would like to add formatting features in \end{article} MathML such as bold, italic and paragraph elements headings: inside MathML. Consider the following markup: 2 <h1>#1</h1> \Ha{#1} | level one 3 <h2>#1</h2> \Hb{#1} | level two <math><b><mi>r</mi></b>=<mfenced><mi>x</mi> 4 <h3>#1</h3> \Hc{#1} | level three <mi>y</mi></mfenced></math> 4 <h4>#1</h4> \Hd{#1} | level four the corresponding pure MathML coding would be: 5 <p>#1</p> \p{#1} paragraph 6 <span>#1</span> \s{#1} text span <math><mi mathvariant="bold-italic">r</mi> <mo>=</mo><mfenced><mi>x</mi> <mi>y</mi></mfenced></math> 4.2 Simple formatting elements Mixing elements from different namespaces is We propose the following T X macros for HTML one of the side effects one can expect in HTML5. It E formatting elements: is not clear if MathML elements could be included within SVG elements or vice versa. One can expect No. HTML LATEX Description such new non-standard markups to be created that 1 <b>#1</b> \B{#1} bold will be quite difficult for browsers to handle. 2 <i>#1</i> \I{#1} italic New elements such as <section> have been in- 3 <b><i>#1</i></b> \BI{#1} bold-italic troduced, so one can expect more confusion: 4 <tt>#1</tt> \M{#1} text <section><h2>Section title</h2> 5 <sup>#1</sup> \sp{#1} superscript <section><h1>Another section title</h1> 6 <sub>#1</sub> \sb{#1} subscript </section> </section> The intended meaning of <h1> or <h2> is not clear 4.3 MathML elements from the above markup, and you could say either Ì We propose the following TEX macros for MathML mean what I say' or Ì say what I mean', with our formatting elements: own impressionistic interpretations. A In this article we do not want to convey the No. MathML LTEX Description impression that everything about HTML5 is out of 1 <mrow>#1</mrow> {#1} grouping the wild west; rather, it is a rich arena that needs 2 <mi>#1</mi> {#1} variables to be authored carefully, because there are so many 3 <mo>#1</mo> {#1} operators pitfalls. In fact, HTML5 introduces new features like 4 <mn>#1</mn> {#1} numbers 5 monospace MathML, SVG, video and audio features that are <mtext>#1</mtext> \mbox{#1} 6 <mfrac>#1#2</mfrac> \frac{#1}{#2} fraction essential for further enrichment of basic content [4]. 7 <msup>#1#2</msup> {#1}^{#2} superscript The important reason for using a TEX-like system is 8 <msub>#1#2</msub> {#1} {#2} subscript that it doesn't allow one to see the output if there 9 <mover>#1#2</mover> {#1}^{#2} over are errors in the code and one can only produce 10 <munder>#1#2</munder> {#1} {#2} under well-formed code. On the use of TEX as an authoring language for HTML5 280 TUGboat, Volume 32 (2011), No. 3 No. SVG LATEX Description 1 <circle cx="#1" cy="#2" r="#3" \circle[x=#1,y=#2,r=#3 circle style="stroke:#4; s=#4,sw=#5,f=#6] stroke-width:#5;fill:#6;"/> 2 <ellipse cx="#1" cy="#2" rx="#3" ry="#4" style="stroke:#5; \ellipse[x=#1,y=#2,rx=#3, ellipse stroke-width:#6;fill:#7;"/> ry=#4,s=#5,sw=#6,f=#7] 3 <rect x="#1" y="#2" width="#3" height="#4" style="stroke:#5; \rect[x=#1,y=#2,w=#3, rectangle stroke-width:#6;fill:#7;"/> h=#4,s=#5,sw=#6,f=#7] Table 1: Proposed TEX macros for SVG formatting elements.

No. 3 on the Use of TEX As an Authoring Language for HTML5 SK

Just Another Perl Hack Neil Bowers1 Canon Research Centre Europe

Sublimelinter Documentation Release 4.0.0

Notetab User Manual

A Personal Research Agent for Semantic Knowledge Management of Scientiﬁc Literature

0789747189.Pdf

Introduction, Internet and Web Basics, XHTML and HTML

Semantic Web Technologies and Legal Scholarly Publishing Law, Governance and Technology Series

Requirements for Web Developers and Web Commissioners in Ubiquitous

How to Disable Tidy HTML Corrector and Validator to Output Error and Warning Messages

Proceedings of Balisage: the Markup Conference 2012

Schema Crawled API Documentation Exploration

Towards the Unification of Formats for Overlapping Markup