44 TUGboat, Volume 35 (2014), No. 1

An overview of the (few) conventions that the program follows in order to format the document. Massimiliano Dominici These languages are mainly used in two fields: Abstract documentation of code (reStructuredText, AsciiDoc, etc.) and management of contents for the web (Mark- This paper is a short overview of Pandoc, a utility down, Textile, etc.). In the case of code documenta- for the conversion of -formatted texts to tion, the use of an LML is a good choice, because the many output formats, including LAT X and HTML. E documentation is interspersed in the code itself, so it 1 Introduction should be easy to read by a developer perusing the code; but at the same time it should be able to be Pandoc is , written in Haskell, whose aim converted to presentation formats (PDF and HTML, is to facilitate conversion between some lightweight traditionally, but today many IDEs include some markup languages and the most widespread ‘final’ form of visualization for the internal documentation). document formats.1 On the program’s website [3], In the case of web content, the emphasis is placed Pandoc is described as a sort of ‘swiss army knife’ for on the ease of writing for the user. Many content converting between different formats, and in fact it management systems already provide plugins for one is able to read simple files written in LAT X or HTML; E or more of those languages and the same is true for but it is of lesser use when trying to translate LAT X E static site generators2 that are usually built around documents with non-trivial constructs such as com- one of them and often provide support for others. mands defined by the user or by a dedicated package. The various dialects can be considered another Pandoc shows its real utility, in my opinion, instance of LML. when what is needed is to obtain several output The actual ‘lightness’ of an LML depends greatly formats from a single source, as in the case of a docu- on its ultimate purpose. In general, an LML con- ment distributed online (HTML), in print form (PDF ceived for code documentation will be more complex via LAT X) and for viewing on tablets or read- E and less readable than one conceived for web content ers (EPUB). In such cases one may find that writing management, which in turn will often not be capable the document in a rich format (e.g. LAT X) and con- E of general semantic markup. A paradigmatic exam- verting later to other markup languages often poses ple of this second category is Markdown that, in its significant problems because of the different ‘philoso- original version, stays rigorously close to the mini- phies’ that underlie each language. It is advisable, in- malistic approach of the first LMLs. The following stead, to choose as a starting point a language that is from its author, John Gruber, explains his ‘neutral’ by design. A good candidate for this role is a intentions in designing Markdown: lightweight , and in particular Mark- down, of which Pandoc is an excellent interpreter. Markdown is intended to be as easy-to-read In this article we will briefly discuss the concept and easy-to-write as is feasible. Readabil- of a ‘lightweight markup language’ with particular ity, however, is emphasized above all else. reference to Markdown (§ 2), and then we will re- A Markdown-formatted document should be view Pandoc in more details (§ 3) before drawing our publishable as-is, as , without look- conclusions (§ 5). ing like it’s been marked up with tags or for- matting instructions.3 2 Lightweight markup languages: The only output format targeted by the refer- Markdown ence implementation of Markdown is HTML; indeed, Before getting to the heart of the matter, it is advis- Markdown also allows raw HTML code. Gruber has able to say a few words about lightweight markup languages (LML) in general. They are designed with 2 Static site generators are a category of programs that the explicit goal of minimizing the impact of the build a website in HTML starting from source files written in markup instructions within the document, with a a different format. The HTML pages are produced beforehand, usually on a local computer, and then loaded on the server. particular emphasis on the readability of the text by Websites built this way share a great resemblance with old a human being, even when the latter does not know websites written directly in HTML, but unlike those, in the building process it is possible to use templates, share metadata Translation by the author from his original in ArsTEXnica across pages, and create structure and content programmati- #15, April 2013, “Una panoramica su Pandoc”, pp. 31–38. cally. Static site generators constitute an alternative to the 1 Strictly speaking, LATEX isn’t a ‘final’ document format more popular dynamic server applications. in the same way PDF, ODF, DOC, EPUB, etc. are. But, from 3 [1], http://daringfireball.net/projects/markdown/ the point of view of a Pandoc user, LATEX is a ‘final’ — or syntax#philosophy. A significant contribution to the intermediate, at least — product. design of Markdown was made by Aaron Swartz.

Massimiliano Dominici TUGboat, Volume 35 (2014), No. 1 45

Table 1: Markdown syntax: inline elements. Element Markdown LATEX HTML Links [link](http://example.net) \href{link}{% http://example.net} link Emphasis _emphasis_ \emph{emphasis} emphasis *emphasis* \emph{emphasis} emphasis Strong emphasis __strong__ \textbf{strong} strong **strong** \textbf{strong} strong Verbatim ‘printf()‘ \verb|printf()| printf() Images ![Alt](/path/to/img.jpg) \includegraphics{img}

Table 2: Markdown syntax: block elements. Element Markdown LATEX HTML Sections # Title # \section{Title}

Title

## Title ## \subsection{Title}

Title

...... Quotation > This paragraph \begin{quote}

> will show This paragraph This paragraph > as quote. will show will show as quote. as quote. \end{quote}

Itemize * First item \begin{itemize}
    * Second item \item First item
  • First item
  • * Third item \item Second item
  • Second item
  • \item Third item
  • Third item
  • \end{itemize}
Enumeration 1. First item \begin{enumerate}
    2. Second item \item First item
  1. First item
  2. 3. Third item \item Second item
  3. Second item
  4. \item Third item
  5. Third item
  6. \end{enumerate}
Verbatim Text paragraph. Text paragraph.

Text paragraph.

grep -i ’\$’ grep -i ’\$’ always adhered to these initial premises and has Of course, in the reference implementation there is consistently refused to extend the language beyond no LATEX output, so I have provided the most logical the original specifications. This stance has caused a translation. In the following sections we will see how proliferation of variants, so that every single imple- Pandoc works in practice. mentation constitutes an ‘enhanced’ version. Famous websites like GitHub, reddit and Stack Overflow, all 3 An overview of Pandoc support their own Markdown flavour; and the same As mentioned in the introduction, Pandoc is pri- is true for conversion programs like MultiMarkdown marily a Markdown interpreter with several output or Pandoc itself, which also introduce new output formats: HTML,LATEX, ConTEXt, DocBook, ODF, formats. It’s not necessary, here, to examine the OOXML, other LMLs such as AsciiDoc, reStructured- details of the different flavours; the reader can get an Text and Textile (a complete list can be found in [3]). idea of the basic formatting rules from tables 1 and 2. Pandoc can also convert, with severe restrictions,

An overview of Pandoc 46 TUGboat, Volume 35 (2014), No. 1

a source file in LATEX, HTML, DocBook, Textile - name: First Author or reStructuredText to one of the aforementioned - affiliation: First Affiliation output formats. Moreover it extends the syntax of - name: Second Author Markdown, introducing new elements and providing - affiliation: Second Affiliation customization for the elements already available in --- the reference implementation. and then, in the template $for(author)$ 3.1 Markdown syntax extensions $if(author.name)$ Markdown provides, by design, a very limited set $author.name$ of elements. Tables, footnotes, formulas, and biblio- $if(author.affiliation)$ ($author.affiliation$) graphic references have no specific markup in Mark- $endif$ down. The author’s intent is that all markup exceed- $else$ ing the limits of the language should be expressed $author$ $endif$ in HTML. Pandoc maintains this approach (and, for $endfor$ LATEX or ConTEXt output, allows the use of raw TEX code) but makes it unnecessary, since it introduces to get a list of authors with (if present) affiliations. many extensions, giving the user proper markup for As we will see in section 3.1, a YAML block can each of the elements mentioned above. In the follow- also be used to build a bibliographic database. ing paragraphs we’ll take a look at these extensions. Footnotes Since the main purpose of Markdown Metadata Metadata for title, author and date can and its derivatives is readability, the mark and the be included at the beginning of the file, in a text text of a footnote should usually be split. It is block, each preceded by the character %, as in the recommended to write the footnote text just below following example. the paragraph containing the mark, but this is not strictly required: the footnotes could be collected % Title at the beginning or at the end of the document, for % First Author; Second Author % 17/02/2013 instance. The mark is an arbitrary label enclosed in the following characters: [^...]. The same label, The content for any of these elements can be followed by ‘:’ must precede the footnote text. omitted, but then the respective line must be left When the footnote text is short, it is possible blank (unless it is the last element, i.e. the date). to write it directly inside the text, without the label. % Title The output from all the footnotes is collected at the % end of the document, numbered sequentially. Here % 17/02/2013 is an example of the input: % Paragraph containing[^longnote] a % First Author; Second Author footnote too long to be written directly % 17/02/2013 inside the text.

% Title [^longnote]: A footnote too long to be % First Author; Second Author written inside the text without causing confusion. Since version 1.12 metadata support has been substantially extended. Now Pandoc accepts multi- New paragraph.^[A short note.] ple metadata blocks in YAML format, delimited by Tables Again, syntax for tables is based on consid- a line of three hyphens (---) at the top and a line erations of readability of the source. The alignment of three hyphens (---) or three dots (...) at the of the cells composing the table is immediately visi- bottom.4 This gives the user a high level of flexibil- ble in the alignment of the text with respect to the ity in setting and using variables for templates (see dashed line that divides the header from the rest section 3.1). of the table; this line must always be present, even YAML structures metadata in arrays, thus al- when the header is void.5 When the table includes lowing for a finer granularity. The user may specify cells with more than one line of text, it is mandatory in his source file the following code: --- 5 In fact, it is the header, if present, or the first line of the author: table that sets the alignment of the columns. Aligning the remaining cells is not needed, but it is recommended as an 4 YAML Ain’t Markup Language, http://www.yaml.org/. aid for the reader.

Massimiliano Dominici TUGboat, Volume 35 (2014), No. 1 47

to enclose it between two dashed lines. In this case divided by blank lines from the remaining text, it the width of each column in the source file is used to will be interpreted as a floating object with its own compute the width of the equivalent column in the caption taken from the ‘Alternative text’. output table. Multicolumn or multirow cells are not Listings In standard Markdown, verbatim text is supported. The caption can precede or follow the ta- marked by being indented by four spaces or one tab. ble; it is introduced by the markup ‘:’ (alternatively: To that, Pandoc adds the ability to specify identifiers, Table:) and must be divided from the table itself classes and attributes for a given block of ‘verbatim’ by a blank line: material. Pandoc will treat them in different ways, ------depending on the output format and the command Right Centered Left line options; in some circumstances, they will sim------Text Text Text ply be ignored. To achieve this, Pandoc introduces aligned aligned aligned an alternative syntax for listings of code: instead right center left of indented blocks, they are represented by blocks delimited by sequences of three or more tildes (~~~) New cell New cell New cell or backticks (‘‘‘); identifiers, classes and attributes ------must follow that initial ‘rule’, enclosed in braces. In the following example6 we can see a listing of Python Table: Alignment code with, in this order: an identifier, the class that There’s an alternative syntax to specify the marks it as Python code, another class that speci- alignment of the individual columns: divide columns fies line numbering and an attribute that marks the with the character ‘|’ and use the character ‘:’ in starting point of the numbering. the dashed line below the header to specify, through ~~~ {#bank .python .numberLines startFrom="5"} its placement, the column’s alignment, as shown in class BankAccount(object): the following example: def __init__(self, initial_balance=0): Right | Centered | Left self.balance = initial_balance ------:|:------:|:------def deposit(self, amount): Text | Text | Text self.balance += amount aligned | aligned | aligned def withdraw(self, amount): right | center | left self.balance -= amount | | def overdrawn(self): New cell | New cell | New cell return self.balance < 0 my_account = BankAccount(15) : Alignment by ‘:’ my_account.withdraw(5) In the examples above, the cells cannot contain print my_account.balance ‘vertical’ material (multiple paragraphs, verbatim ~~~ blocks, lists). ‘Grid’ tables (see the example below) It is possible to use identifiers, class and at- allow this, at the cost of not being to specify the tributes for inline code, too: column alignments. The return value of the ‘printf‘{.C} function +------+------+------+ is of type ‘int‘. | Text | Lists | Code | +======+======+======+ By default, Pandoc uses a simple verbatim en- | Paragraph. | * Item 1 | ~~~ | vironment for code that doesn’t need highlighting | | * Item 2 | \def\PD{% | and the Highlighting environment, defined in the | Paragraph. | | \emph{Pandoc} | template’s preamble (see 3.1) and based on Verbatim | | * Item 3 | ~~~ | from fancyvrb, when highlighting is needed. If the +------+------+------+ option --listings is given on the command line, | New cell | New cell | New cell | Pandoc uses the lstlistings environment from list- +------+------+------+ ings every time a code block is encountered. Figures As shown in table 1, Markdown allows for Formulas Pandoc supports mathematical formu- the use of inline images with the following syntax: las quite well, using the usual TEX syntax. Expres- ![Alternative text](/path/image) sions enclosed in dollar signs will be interpreted as where ‘Alternative text’ is the description that inline formulas; expressions in double dollar signs HTML uses when the image cannot be viewed. Pan- doc adds to that one more feature: if the image is 6 From http://wiki.python.org/moin/SimplePrograms.

An overview of Pandoc 48 TUGboat, Volume 35 (2014), No. 1

--- will be interpreted as displayed formulas. This is all references: comfortably familiar to a TEX user. - author: The way these expressions will be treated de- family: Gruber pends on the output format. For T X (LAT X/Con- given: E E - John TEXt) output, the expressions are passed without id: gruber13:_markd modifications, except for the substitution of the issued: delimiters for display math: \[...\] instead of year: 2013 $$...$$. When HTML (or similar) output is re- title: Markdown type: no-type quired, the behavior is controlled by command line publisher: the formulas by means of Unicode characters. Other - volume: 32 options allow for the use of some of the most com- page: 272-277 container-title: TUGboat mon JavaScript libraries for visualizing math on the author: web: MathJax, LaTeXMathML and jsMath. It is family: Kielhorn also possible, always by means of a command line given: switch, to render formulas as images or to encode - Axel id: kielhorn11:_multi 7 them as MathML. issued: Pandoc is also able to parse simple macros and year: 2011 expand them in output formats different from the title: Multi-target publishing type: article-journal supported TEX dialects. This feature, though, is only issue: 3 available in the of math rendering. ...

Citations Pandoc can build a bibliography (and Figure 1:A YAML bibliographic database (line break manage inside the text) using a database in in the url is editorial). any of several common formats (BibTEX, EndNote, ISI, etc.). The database file must always be speci- fied as the argument of the option --bibliography. Since version 1.12 native support for citations Without other options on the command line, Pan- has been split from the core functions of Pandoc. In doc will include citations and bibliographic entries order to activate this feature, one must now use an as plain text, formatted following the bibliographic external filter (--filter pandoc-, to be style ‘Chicago author–date’. The user may specify a installed separately).9 different style by means of the option --csl, whose A new feature is that bibliographic databases argument is the name of a CSL style file8 and may can now be built using the references field in- also specify that the bibliographic apparatus will side a YAML block. Finding the correct encoding be managed by natbib or biblatex. In this case Pan- for a YAML bibliographic database can be a little doc will not include in the LATEX output citations tricky, so it is recommended, if possible, to convert and entries in extended form, but only the required from an existing database in one of the formats rec- commands. The options to get this behavior are, ognized by Pandoc (among them BibTEX), using respectively, --natbib and --biblatex. the biblio2yaml utility, provided together with the The user must type citations in the form [@key1; pandoc-citeproc filter. The YAML code for the @key2;...] or @key1 if the citation should not be first two items in this article’s bibliography, created enclosed in round . A dash preceding the by converting the .bib file, is shown in figure 1. label suppresses the author’s name (when supported A YAML field can be used also for specifying the by the citation format). Bibliographic references are CSL style for citations (csl field), or the external always placed at the end of the document. bibliography file, if required (bibliography field). Raw code (HTML or TEX) All implementations 7 The web pages for these different rendering engines for of Markdown, of whichever flavour, allow for the use math on the web are http://www.mathjax.org, http://math. of raw HTML code, written without modifications in etsu.edu/LaTeXMathML, http://www.math.union.edu/~dpvc/ the output, as we mentioned in section 2. Pandoc jsmath and http://www.w3.org/Math. extends this feature, allowing for the use of T X raw 8 CSL (http://citationstyles.org), Citation Style Lan- E A guage, is an , XML-based, language to describe code, too. Of course, this works only for LTEX/ the formatting of citations and bibliographies. It is used in ConTEXt output. several citation managers, such as Zotero, Mendeley, and Pa- pers. A detailed list of the available styles can be found in 9 The filter is not needed when using natbib or biblatex http://zotero.org/styles. directly instead of the native support.

Massimiliano Dominici TUGboat, Volume 35 (2014), No. 1 49

1 \documentclass$if(fontsize)$[$fontsize$]$endif$ Pandoc; to obtain a complete document, including a 2 {article} preamble, the command line option --standalone 3 \usepackage{amssymb,amsmath} (or its equivalent -s) is used. 4 \usepackage{ifxetex} It is also possible to build more flexible tem- 5 \ifxetex plates, useful for different projects with different 6 \usepackage{fontspec,xltxtra,xunicode} features, providing for a moderate level of customiza- 7 \defaultfontfeatures{Mapping=-text, 8 Scale=MatchLowercase} tion. As the reader can see in figure 2, a template is 9 \else substantially a file in the desired output format (in 10 \usepackage[utf8]{inputenc} this case LATEX) interspersed with variables and con- 11 \fi trol flow statements introduced by a dollar sign. The 12 $if(natbib)$ expressions will be evaluated during the compilation 13 \usepackage{natbib} and replaced by the resulting text. For instance, at 14 \bibliographystyle{plainnat} line 113 of the listing in figure 2 we find the expres- 15 $endif$ 16 $if(biblatex)$ sion $body$, which will be replaced by the document 17 \usepackage{biblatex} body. Above, at lines 12–21, we can find the se- 18 $if(biblio-files)$ quence of commands that will include in the final 19 \bibliography{$biblio-files$} output all the resources needed to generate a bibli- 20 $endif$ ography by means of natbib or biblatex. This code 21 $endif$ will be activated only if the user has specified either ... --natbib or --biblatex on the command line. The code to print the bibliography can be found at the 113 $body$ end of the listing, at lines 124–130. 114 115 $if(natbib)$ In this way it is possible to define all the desired 116 $if(biblio-files)$ variables and the respective compiler options. The 117 $if(biblio-title)$ user can thus change the default template to specify, 118 $if(book-class)$ e.g., among the options that may be passed to the 119 \renewcommand\bibname{$biblio-title$} class, not only the body font, but a generic string 120 $else$ containing more options.10 We would replace the 121 \renewcommand\refname{$biblio-title$} first line in the listing of figure 2 with the following: 122 $endif$ 123 $endif$ \documentclass$if(clsopts)$[$clsopts$]$endif$ 124 \bibliography{$biblio-files$} Then we can compile with the following options: 125 $endif$ pandoc -s -t --template=mydefault \ 126 $endif$ -V clsopts=a4paper,12pt -o test.tex test.md 127 $if(biblatex)$ 128 \printbibliography given that we saved the modified template in the 129 $if(biblio-title)$[title=$biblio-title$]$endif$ current directory by the name mydefault.latex. 130 $endif$ Since version 1.12, ‘variables’ can be replaced by 131 $for(include-after)$ YAML ‘metadata’, specified either inside the source 132 $include-after$ file or on the command line using the -M option. 133 $endfor$ 134 \end{document} 4 Problems and limitations

Figure 2: Fragments of the default Pandoc v1.11 We’ve seen many nice features of Pandoc. Not sur- prisingly, Pandoc also has some limitations and short- template for LATEX. comings. Some of these shortcomings are tied to the particular LML used by Pandoc. For instance, Mark- down doesn’t allow semantic markup.11 This kind of Templates One of the most interesting features limitation can be addressed using an additional level of Pandoc is the use of customized templates for 10 the different output formats. For HTML-derived and This can be also be achieved via the variable $fontsize. (Since Pandoc 1.12, the default LATEX template includes sepa- TEX-derived output formats this can be achieved in rate variables for body font size, paper size and language, and two ways. First of all, the user may generate only a generic $classoption variable for other parameters.) the document ‘body’ and then include it inside a 11 This is not an issue necessarily pertinent to all LMLs since some of them provide methods to define objects that be- ‘master’ (for TEX output, with \input or \include). have like LATEX macros, either through pre- or post-processing In this way, an ad hoc preamble can be built be- () or by taking advantage of conceptually close struc- forehand. This is in fact the default behaviour for tures (the ‘class’ of a span in HTML, in Textile). In any case,

An overview of Pandoc 50 TUGboat, Volume 35 (2014), No. 1

of abstraction, using preprocessors like gpp or m4, as scripts (a drawback for me, at least . . . ). The current illustrated by Aditya Mahajan in [4]. Of course, this version also allows Python, thus making easier the clashes with the initial purpose of readability and task of creating such scripts.12 introduces further complexity, though the use of m4 need not significantly increase the amount of extra 5 Conclusion markup. To conclude this overview, I consider Pandoc to be Other problems, though, unexpectedly arise in the best choice for a project requiring multiple out- the process of conversion to LATEX output. For in- put formats. The use of a ‘neutral’ language in stance, the cross reference mechanism is calibrated the source file makes it easier to avoid the quirks to HTML and shows all its shortcomings with regard of a specific language and the related problems of A to the LTEX output. The cross reference is in fact translation to other languages. For a LATEX user in generated by means of a hypertext anchor and not particular, being able to type mathematical expres- by the normal use of \label and \ref. As a typical sions ‘as in LATEX’ and to use a BibTEX database for example, let’s consider a labelled section referenced bibliographic references are also two strong points. later in the text, as in the following: One should not expect to find in Pandoc an easy ## Basic elements ## {#basic} solution for every difficulty. Limitations of LMLs [...] in general, and some flaws specific to the program, As we have explained in entail the need for workarounds, making the process [Basic elements](#basic) less immediate. This doesn’t change the fact that, if we get this result: the user is aware of such limitations and the project \hyperdef{}{basic}{% can bear them, Pandoc makes obtaining multiple \subsection{Basic elements}\label{basic} output formats from a single source extremely easy. } [...] References As we have explained in \hyperref[basic]{Basic elements} [1] John Gruber. Markdown. http://daringfireball.net/projects/ which is not exactly what a LAT X user would expect E markdown/, 2013. . . . Of course one could directly use \label and \ref, but they will be ignored in all non-TEX output [2] Axel Kielhorn. Multi-target publishing. formats. Or, we could use a preprocessor to get TUGboat, 32(3):272–277, 2011. http://tug. two different intermediate source files, one for HTML org/TUGboat/tb32-3/tb102kielhorn.. and for LATEX (and maybe a third for ODF/OOXML, [3] John MacFarlane. Pandoc: a universal etc.), but by now the original point of using Pandoc document converter. http://johnmacfarlane. is being lost. net/pandoc/, 2013. Formulas, too, may cause some problems. Pan- [4] Aditya Mahajan. How I stopped worrying and doc recognizes only inline and display expressions. started using Markdown like TEX. http:// The latter are always translated as displaymath en- randomdeterminism.wordpress.com/2012/06/ vironments. It is not possible to specify a different 01/how-i-stopped-worring-and-started- kind of environment (equation, gather, etc.) unless using-markdown-like-tex/, 2012. one of the workarounds discussed above is employed, [5] Wikipedia. Lightweight markup language. with all the consequent drawbacks also noted. http://en.wikipedia.org/wiki/ It should be stressed, however, that Pandoc is a Lightweight_markup_language, 2013. program in active development and that several fea- tures present in the current version were not available a short time ago. So it is certainly possible that all or  Massimiliano Dominici some of the shortcomings that a LATEX user finds in Pisa, Italy the current version of Pandoc will be addressed in the mlgdominici (at) gmail dot com near or mid-term. It is also possible, to some extent, to extend or modify Pandoc’s behaviour by means of scripts, as noted at http://johnmacfarlane.net/ 12 Another option is writing one’s own custom ‘writer’ in pandoc/scripting.. One major drawback, un- Lua. A writer is essentially a program that translates the data til recently, was the mandatory use of Haskell for such structure, collected by the ‘reader’, in the format specified by the user. Having Lua installed on the system is not required, though, the philosophy behind LMLs doesn’t support such since a Lua interpreter is embedded in Pandoc. See http:// markup methods. johnmacfarlane.net/pandoc/README.html#custom-writers.

Massimiliano Dominici