Producing HTML Directly from LATEX — the Lwarp Package

48 TUGboat, Volume 38 (2017), No. 1 Producing HTML directly from LATEX | the programming text editor is usually more responsive lwarp package than a word processor. Another development is the large number of Brian Dunn markup languages now available, usually with a num- Abstract ber of options for output format. These systems are based on plain-text markup using inline tags or se- The lwarp package allows LAT X to directly produce E quences of special characters, and thus share some HTML5 output, using external utility programs only of the advantages of LATEX. While these systems for the final conversion of text and images. Math are useful for smaller documents, cross-referencing may be represented by SVG files or MathJax. is limited (although the AsciiDoc syntax does offer Documents may be produced by LAT X, Lua- E full cross-referencing to figures and tables), much of LAT X, or X LAT X. A script removes the E E E texlua the customization is done at the back end, and the need for system utilities such as and , and make gawk syntax of special symbols tends to become rather also supports xindy and latexmk. Configuration is dense once things become complicated. LATEX has automatic at the first manual compile. the advantage of giving macros relatively readable Print and HTML versions of each document may names. coexist, each with its own set of auxiliary files. Sup- Great progress has been made in making LATEX port files are self-generated on request. more widely accessible. Online collaborative LATEX A modular package-loading system uses the editing websites now claim a million users and thou- lwarp version of a package for HTML when available. sands of institutions, and LATEX is also now available Several dozen LATEX packages are supported with as a browser application [5]. If anything, LATEX these high-level source compatibility replacements. seems to be building momentum, even after all these A tutorial is provided to quickly introduce the decades. user to the major components of the package. 2 Why convert LATEX to HTML Unfortunately, modern publishing often involves sub- A 1 Why LTEX mission and rounds of editing in Word format, con- Before attempting to justify yet another LATEX-to- version to an XML intermediate, then conversion HTML conversion package, it may be worth stepping yet again to a professional typesetting system, along back for a moment to consider LATEX itself. A quick with HTML or EPUB versions. Each of these stages web search for \LaTeX vs Word", or some other may be performed by different groups of people in program, will return many web pages and discussion different parts of the world [6], most of whom are threads comparing the various programs and their not familiar with the technical content, and also by advantages. Things change, however, and many of imperfect algorithms whose programmers haven't these discussions are now obsolete due to modern ad- thought of every possibility. (Example: An incorrect vances in each program's capabilities. As examples, line break in a superscript, where a hyphen had been LATEX no longer has many problems dealing with used as a minus sign.) The resulting errors are often fonts, and LYX plus a number of integrated devel- beyond the author's control | the final product hav- opment environments are now available, along with ing problems which were not present in the signed-off online collaborative-development websites [1, 2, 3, 4]. proof. Meanwhile, Word can typeset nice mathematics with While it is unrealistic to expect any of this to a LATEX-ish input and has improved in its typesetting change, there is a movement towards self-publish- and stability, and commercial page-layout programs ing [7, 8, 9, 10] which removes many of these problems have improved in their handling of large documents. while also providing the benefits of quick turnaround, Nevertheless, many of the traditional advantages print on demand, and the ability to make changes of LATEX still apply: the visibility, stability, and porta- or updates as needed. This requires the ability to bility of plain-text markup, regular-expression search create a professional-quality printed document in sev- and replace of both text and formatting commands, eral sizes (e-tablet included), which LATEX certainly easy revision control, the ability to handle large and can provide, but also the ability to create HTML or complex documents, extensive programming capabil- EPUB as well. Providing a high-quality PDF ver- ities, and the large number of user-supplied packages sion is better than asking the user to print from solving real-world problems. In many cases, it's still HTML, whereas providing an HTML version provides faster to type a few arguments than it is to open a di- easy accessibility and some search-engine benefits. alog box and select and fill in entries, and a powerful Providing both is the best option. Brian Dunn TUGboat, Volume 38 (2017), No. 1 49 Another application of LATEX-to-HTML conver- 4 Existing methods sion is the creation of an informational (non-interact- Several methods already exist for converting some ive) website. Many scientists, professors, and engi- subset of LATEX into HTML. These are discussed in neers would benefit from having their own website on slightly more detail in the lwarp manual. which their own technical papers could be published, The closest to lwarp in design principle is the A and they could apply their pre-existing LTEX skills internet class by Andrew Stacey [16], an interesting not just to the documents but also to the website project which directly produces several versions of itself. markdown, and also HTML and EPUB. There is also the TEX4ht project [17], which 3 Why LATEX is hard to convert uses LATEX itself to do most of the work, along with Modern HTML5 and CSS3 are quite capable, to the an external program to convert special codes into point where they can be used to produce technical HTML or several other formats. books [11]. Nevertheless, there are some practical A number of other projects use an intermedi- problems to overcome in order to create a good con- ate translation program to parse LATEX source and version from LATEX to HTML. then convert it externally. See HeVeA [18], TTH One of the first issues is the difference between [19], GELLMU [20], LATEXML [21], plasTEX[22], individual printed pages versus the HTML concept LATEX2HTML [23], and TEX2page [24], most of which of an endless scroll of variable width. Footnotes can are found on CTAN. become endnotes, but \pageref refers to what, ex- GladTEX[25] may be used to insert LATEX math actly? Is \linewidth for the current screen size, or expressions into pre-existing HTML code. is it for a conceptual page size? The relationship be- For sake of completeness, it should be mentioned tween font size, image size, and screen size is broken, that there are plugins allowing the entry of LATEX there is no margin for marginpars, and text may be math expressions for Word [26, 27] and LibreOffice reflowed at any time. [28], as well as commercial page-layout programs. LATEX knows about stretchable space, which is not true of HTML.A \vfill is almost meaningless 5 Why another approach in HTML, and an \hfill is not much better. Nor do floating objects translate well, since there are no Nothing except LATEX truly understands LATEX. page breaks at which to place them. More to the point, it's easier for LATEX to pro- Math in HTML has been a problem for years, gram HTML than for a third-party converter program and the MathML standard has not been adopted by to understand LATEX. A larger portion of LATEX and many browsers [12, 13, 14]. MathJax is nice and its associated packages can be parsed and converted getting better all the time, but requires JavaScript when LATEX itself does the work. Another advantage and web access or a local copy, possibly making it of staying with LATEX alone is that development of unacceptable for use in EPUB documents [15], and the core and additional packages can be done without it can be relatively slow. Drawing math as images requiring skills in an additional language. has its own limitations. Aside from display-related issues, another gen- 6 Development eral problem with converting LATEX to HTML is the 6.1 AsciiDoc markup fact that LATEX does not use end delimiters for many of its syntactic units. A \section does not have an The initial inspiration for the lwarp package was the \endsection before the next \section, for example, internet class. Seeing that someone else had trained and beginning the next \section may first require LATEX to produce markup, it was decided to pro- closing several nested levels worth of currently open gram LATEX to generate the AsciiDoc markup syntax. subsections and paragraphs. Nor does \bfseries AsciiDoc has several advantages over other markup have a syntactically defined endpoint, and HTML/ languages, including improved cross-references, and CSS do not support state switching. its Asciidoctor variant generating modern HTML5 Finally, LATEX engines do not allow for the direct output. Using AsciiDoc as an intermediate syntax plain-text output of HTML tags and text content, lifted much of the conversion load from LATEX, while thus requiring some kind of PDF-to-text conversion, providing almost all of the functionality which would followed by some system to optionally split the results be required for a typical technical paper. Neverthe- into separate web pages of HTML, and also copy less, AsciiDoc just couldn't represent many of the out any inline images which must be cropped and concepts commonly-used in LATEX.

Load more