TEX Users Group
PREPRINTS for the 2004 Annual Meeting TEX Users Group Board of Directors These preprints for the 2004 annual meeting are Donald Knuth, Grand Wizard of TEX-arcana † ∗ published by the TEX Users Group. Karl Berry, President Kaja Christiansen∗, Vice President Periodical-class postage paid at Portland, OR, and ∗ Sam Rhoads , Treasurer additional mailing offices. Postmaster: Send address ∗ Susan DeMeritt , Secretary changes to T X Users Group, 1466 NW Naito E Barbara Beeton Parkway Suite 3141, Portland, OR 97209-2820, Jim Hefferon U.S.A. Ross Moore Memberships Arthur Ogawa 2004 dues for individual members are as follows: Gerree Pecht Ordinary members: $75. Steve Peter Students/Seniors: $45. Cheryl Ponchin The discounted rate of $45 is also available to Michael Sofka citizens of countries with modest economies, as Philip Taylor detailed on our web site. Raymond Goucher, Founding Executive Director † Membership in the TEX Users Group is for the Hermann Zapf, Wizard of Fonts † calendar year, and includes all issues of TUGboat ∗member of executive committee for the year in which membership begins or is †honorary renewed, as well as software distributions and other benefits. Individual membership is open only to Addresses Electronic Mail named individuals, and carries with it such rights General correspondence, (Internet) and responsibilities as voting in TUG elections. For payments, etc. General correspondence, membership information, visit the TUG web site: TEX Users Group membership, subscriptions: http://www.tug.org. P. O. Box 2311 [email protected] Portland, OR 97208-2311 Institutional Membership U.S.A. Submissions to TUGboat, Institutional Membership is a means of showing Delivery services, letters to the Editor: continuing interest in and support for both TEX parcels, visitors [email protected] and the TEX Users Group. For further information, TEX Users Group contact the TUG office ([email protected]) or see 1466 NW Naito Parkway Technical support for our web site. Suite 3141 TEX users: Portland, OR 97209-2820 [email protected] TEX is a trademark of the American Mathematical U.S.A. Society. Contact the Board Telephone of Directors: +1 503 223-9994 [email protected] Copyright c 2004 TEX Users Group. Copyright to individual articles within this publication Fax remains with their authors, and may not be reproduced, +1 503 223-3960 distributed or translated without their permission. For the editorial and other material not ascribed to a particular author, permission is granted to make and distribute verbatim copies without royalty, in any medium, World Wide Web provided the copyright notice and this permission notice are http://www.tug.org/ preserved. Permission is also granted to make, copy and distribute translations of such editorial material into another language, Problems not resolved? except that the TEX Users Group must approve translations TUG of this permission notice itself. Lacking such approval, the The Board wants to hear from you: original English permission notice must be included. Please email [email protected].
Printed in U.S.A. [printing date: April 2004] PREPRINTS for the 2004 Annual Meeting TEX Users Group Twenty-fifth Annual Meeting Xanthi, Greece August 30 – September 3, 2004
A COMMUNICATION OF THE TEX USERS GROUP
EDITORS KARL BERRY BADEN HUGHES STEVE PETER
PORTLAND • OREGON • U.S.A. • 2004 Introduction TUG will not sell this issue through our online Welcome to this special communication from TUG! store (http://tug.org/store); it will be sent Although it is formatted in the usual TUGboat pro- only to those who join TUG for the 2004 mem- ceedings style, as you can see, it is not officially a bership year. TUGboat issue, there is no volume or issue number, Springer or the authors may make the papers and it’s being produced before the meeting. Why? available online; for any that are available, we These editorial notes (try to) explain the situation. may link to them from the TUGboat web pages, The 2004 TUG Annual Meeting (http://tug. but we won’t publish them online ourselves. org/tug2004) is being organized by Apostolos Sy- I sincerely hope this one-time special arrangement ropoulos and others, associated with the Greek TEX will meet with everyone’s understanding. We expect Friends user group, in collaboration with the Dem- future proceedings of TUG annual meetings to be ocritus University of Thrace in Xanthi, Greece. The published as normal TUGboat issues. organizers were able to negotiate with Springer- The theme of the meeting is TEX in the era of Verlag for the proceedings of the meeting to be pub- Unicode, and this is reflected in the contents here. lished in Springer’s prestigious series Lecture Notes There is a wealth of material on Omega and other in Computer Science. multilingual typesetting, including a definitive study As TUG president, I felt such a publication was of the historical Deseret alphabet. Additional pa- in the best interests of TEX and TUG, as it would pers include a comparative study of Type 1 font gen- expose our system to a wide new audience. On the eration from METAFONT sources, a significant new other hand, it is of course absolutely necessary for package for critical edition typesetting, and more. every current TUG member to receive the proceed- As this volume is being prepared in advance of ings, as that is a benefit promised upon joining. the meeting, please email any comments you might Unfortunately, it is not financially possible for have directly to the authors, or to the editors at TUG to purchase the hardbound Springer volume for [email protected], so the presentation every member. Therefore, with the consent of the might be improved. And of course please consider TUG board, we worked out this compromise. The going to the conference. (Online registration is avail- principal points of the agreement are these: able from the conference web site.) The present volume of “preprints” is fully edit- Production of the three issues of TUGboat for ed and prepared to usual TUGboat publication the 2003 membership year is continuing. The pro- standards. ceedings of the TUG 2003 conference will probably The present volume will be one of the usual be the first to be completed. three TUG publications sent to members for As always, feel free to email the TUG board 2004. The other two will be regular TUGboat ([email protected]), or the TUGboat editorial staff issues. ([email protected]) with any questions, concerns, or (most important of all) suggestions and submis- Springer’s volume will be the publication of rec- sions for future articles. ord for the 2004 meeting; this is why the present volume is not an official TUGboat issue. ⋄ Karl Berry [email protected] We also are not sending the present volume to libraries, as it cannot easily be cataloged. Thus, it would only cause trouble for the librarians. Libraries will receive the Springer volume. Digital Typography in the New Millennium: Flexible Documents by a Flexible Engine
Christos KK Loverdos Department of Informatics and Telecommunications University of Athens TYPA Buildings, Panepistimioupolis GR-157 84 Athens Greece [email protected]
Apostolos Syropoulos Greek TEX Friends Group 366, 28th October Str. GR-671 00 Xanthi Greece [email protected]
Abstract
The TEX family of electronic typesetters is the primary typesetting tools for the preparation of demanding documents, and have been in use for many years. However, our era is characterized, among others, by Unicode, XML and the in- troduction of interactive documents. In addition, the Open Source movement, which is breaking new ground in the areas of project support and development, enables masses of programmers to work simultaneously. As a direct consequence, it is reasonable to demand the incorporation of certain facilities to a highly mod- ular implementation of a TEX-like system. Facilities such as the ability to extend the engine using common scripting languages (e.g., Perl, Python, Ruby, etc.) will help in reaching a greater level of overall architectural modularity. Obviously, in order to achieve such a goal, it is mandatory to attract a greater programming audience and leverage the Open Source programming community. We argue that the successful TEX-successor should be built around a microkernel/exokernel ar- chitecture. Thus, services such as client-side scripting, font selection and use, out- put routines and the design and implementation of formats can be programmed as extension modules. In order to leverage the huge amount of existing code, and keep document source compatibility, the existing programming interface is demonstrated to be just another service/module.
1 Introduction Open Source movement, which, in turn, borrowed The first steps towards computer typesetting took ideas and practices from the Unix world. Furthe- place in the 1950s, but it was not until Donald E. more, the development of TEX and its companion system, METAFONT, had made obvious the need for Knuth introduced TEX in 1978 [16] that true quality was brought to software-based typesetting. The his- properly documented programs. This, in turn, ini- tiated Knuth’s creation of the literate programming tory of TEX is well-known and the interested reader is referred to [16] for more details. program development methodology. This method- ology advances the idea that the program code and Today, the original TEX is a closed project in the sense that its creator has decided to freeze its documentation should be intermixed and developed development. As a direct consequence no other pro- simultaneously. The source code of T X and METAFONT be- grams are allowed to be called TEX. In addition, E the freely available source code of the system was a ing freely available has had enormous consequences. major step on the road towards the formation of the Anyone can not only inspect the source code, but
Preprints for the 2004 Annual Meeting 3 Christos KK Loverdos and Apostolos Syropoulos also experiment freely with it. Combined with TEX’s plexity should be tackled with an emphasis on ab- (primitive, we should note, but quite effective for the straction that will eventually lead to increased pro- time) ability to extend itself, this led to such suc- ductivity, as is shown in the following figure: cess stories as LAT X and its enormous supporting E ☛ ✟requires ☛ ✟increases ☛ ✟ codebase, in the form of packages. As a direct con- Complexity ✲ Abstraction ✲ Productivity sequence of the fact that the source code is frozen, ✡ ✠ ✡ ✠ ✡ ✠ stability was brought forth. Note that this was ex- TEX’s programming language is more or less an actly the intention Knuth had when developing his “assembly language” for electronic typesetting. It systems. A common referred-to core, unchanged in is true that higher level constructs can be made— the passing of time and almost free of bugs, offered macros and macro packages built on top of that. But a “secure” environment to produce with and even the essence remains the same. Although it is true experiment with. that TEX is essentially bug free and its macro ex- However, in an everchanging world, especially pansion facility behaves the way it is specified (i.e., in the fast-paced field of computer science, almost as defined in [9]), it still remains a fact that it takes anything must eventually be surpassed. And it is a non-specialist quite some time to fully understand the emerging needs of each era that dictate possible the macro expansion rules in spite of Knuth’s initial future directions. TEX has undoubtedly served its intentions [12, page 6]. purpose well. Its Turing-completeness has been a The fact that one should program in the lan- most powerful asset/weapon in the battles for and of guage of his/her choice is just another reason for evolution. Yet, the desired abstraction level, needed moving away from a low-level language. And it to cope with increasing complexity, has not been is true that we envision an environment where as reached. Unfortunately, with TEX being bound to a many programmers as possible can—and the most fixed core, it cannot be reached. important, wish to—contribute. In the era of the Furthermore, the now widely accepted user-un- Open Source revolution, we would like to attract friendliness of TEX as a language poses another ob- the Open Source community and not just a few dedi- stacle to TEX’s evolution. It has created the myth cated low-level developers. Open Source should also of those few, very special and quite extraordinary mean, in our opinion, “open possibilities” to evolve 1 “creatures” able to decrypt and produce code frag- the source. This is one of our major motivations 2 ments such as the following: for reengineering the most successful typesetting en- \def\s@vig{{\EO@m=\EO@n gine. \divide\EO@n by20 \relax \ifnum\EO@n>0\s@vig\fi Richard Palais, the founding chairman of TUG, \EO@k=\EO@n\relax pointed out back in 1992 [12, page 7] that when \multiply\EO@k by-20\relax \advance\EO@m by \EO@k\relax developing TEX, Knuth \global\advance\EO@l by \@ne \expandafter\xdef\csname EO@d\@roman{\EO@l}\endcsname{% . . . had NSF grant support that not only provided \ifnum\EO@m=0\noexpand\noexpand\EOzero him with the time and equipment he needed, but \else\expandafter\noexpand also supported a team of devoted and brilliant \expandafter\csname EO\@roman{\EO@m}\endcsname\fi} \expandafter\@rightappend graduate students who did an enormous amount \csname EO@d\@roman{\EO@l}\endcsname of work helping design and write the large quan- \t@\epi@lmecDigits}} tity of ancillary software needed to make the TEX Of course, to be fair, programmers in several system work ... languages (C and Perl among others) are often ac- and immediately after this, he poses the fundamen- cused of producing ununderstandable code and the tal question: well-known obfuscated code contests just prove it. On the other hand, with the advent of quite so- Where will the resources come from what will have to be at least an equally massive effort? And phisticated assemblers, today one can even write will the provider of those resources be willing, at well-structured assembly language, adhering even to the end of the project, to put the fruits of all his “advanced” techniques/paradigms, such as object- effort in the Public Domain? oriented programming. Naturally, this should not The answer seems obvious now. The way has been lead to the conclusion that we should start writing paved by the GNU/Linux/BSD revolutionary devel- in assembly (again)! In our opinion, software com- opment model, as has been explained crystal clearly 1 The second author may be regarded as one of Gandalf’s in The Cathedral and the Bazaar [15]. famuli, while the first author is just a Hobbit, wishing to have This paper is an attempt to define a service- been an Elf. 2 Taken from the documentation of the epiolmec package oriented architecture for a future typesetting en- by the second author. gine, which will be capable of modular evolution.
4 Preprints for the 2004 Annual Meeting Digital Typography in the New Millennium: Flexible Documents by a Flexible Engine
We take a layered approach of designing some core documentation is of course specified in the TEX no- functionality and then define extensible services on tation. Although the TEX source is structured in a top of the core. The engine is not restricted to a monolithic style, its architecture provides for some specific programming language either for its basic/ kind of future evolution. bootstrapping implementation or, even more impor- First, TEX can be “extended” by the construc- tant, for its future enhancement. At the same time, tion of large collections of macros that are simply we are bound to provide a 100% TEX-compatible called formats. Each format can be transformed to a environment, as the only means of supporting the quickly loadable binary form, which can be thought vast quantity of existing TEX-based documents. We of as a primitive form of the module concept. intend to achieve such a goal by leveraging the pro- Also, by the prescient inclusion of the \special posed architecture’s own flexibility. Specifically, a primitive command, TEX provides the means to ex- TEX compatibility mode is to be supported and it press things beyond its built-in “comprehension”. should give complete “trip-test” compliance. Later For example, TEX knows absolutely nothing about on, we shall see that this compatibility is divided PostScript graphics, yet by using \special and with into two parts: source code compatibility and inter- the appropriate driver program (e.g., dvips), Post- nal core compatibility. Both are provided by plug- Script graphics can be easily incorporated into doc- gable modules. uments. Color is handled in the same way. In all cases, all that TEX does is to expand the \special Structure of the paper In the following sec- command arguments and transfer the command to tions we briefly review the most important and in- its normal output, that is, the DVI file (a file format fluential approaches to extending or reengineering that contains only page description commands). TEX, including TEX’s inherent abilities to evolve. Last, but not least, there is the notion of change Then we discuss a few desired characteristics for any file [3, page 243]: next generation typesetting engine. We advance by proposing an architecture to support these emerging A change file is a list of changes to be made to WEB needs. Finally, we conclude by discussing further the file; a bit like a stream editor script. These changes can comprise both adaptations of and future work. the WEB file to the particular Pascal compiler 2 A Better TEX? that will be used and bug fixes to TEX. Thus the TeX.web file needs never to be edited. 2.1 TEX the Program TEX supports a Turing-complete programming lan- Thus, change files provide a form of incremental guage. Simply, this means that if it lacks a feature, modification. This is similar to the patch mecha- it can be programmed. It contains only a few con- nism of Unix. cepts and belongs to the LISP family of languages. Yet, no matter how foresighted these methods In particular, it is a list-based macro-language with may be, twenty years after its conception TEX has late binding [5, Sec. 3.3]: started to show its age. Today’s trends, and more Its data constructs are simpler than in Common importantly the programming community’s contin- Lisp: ‘token list’ is the only first order type. uing demand for even more flexible techniques and Glue, boxes, numbers, etc., are engine concepts; systems, call for new modes of expressiveness. instances of them are described by token lists. Its A lexical analysis is simpler than CL: One cannot 2.2 The LTEX Format program it. One can only configure it. Its control LATEX [10], which was released around 1985, is the constructs are simpler than in CL: Only macros, most widely known TEX format. Nowadays, it seems no functions. And the macros are only simple that LATEX is the de facto standard for the communi- ones, one can’t compute in them. cation and publication of scientific documents (i.e., Further analysis of TEX’s notions and inner workings documents that contain a lot of mathematical no- such as category codes, TEX’s mouth and stomach is tation). LATEX “programs” have a Pascal-like struc- beyond the scope of this paper and the interested ture and the basic functionality is augmented with reader is referred to the classic [9] or the excellent [3]. the incorporation of independently developed col- TEX the program is written in the WEB sys- lections of macro packages. In addition, classes are tem of literate programming. Thus, its source code used to define major document characteristics and is self-documented. The programs tangle and weave are in essence document types, such as book, article, are used to extract the Pascal code and the docu- etc. Thus, each LATEX “program” is characterized mentation, respectively, from the WEB code. The by the document class to which it belongs, by the
Preprints for the 2004 Annual Meeting 5 Christos KK Loverdos and Apostolos Syropoulos packages it utilizes, and any new macro commands with cleaner interfaces, a property that the original it may provide. TEX source clearly lacks. In spite of the promis- 3 The current version of LATEX is called LATEX 2ε. ing nature of NTS, the directory listing at CTAN Work is in progress to produce and widely distribute shows that the project is inactive since 2001.4 It the next major version, LATEX3 [11]. Among the seems that the main focus is now the development several enhancements that the new system will bring of ε-TEX, which is presented in the following section. forth, are: 2.4 ε-TEX • Overall robustness ε N S • Extensibility, relating to the package interface -TEX [17] was released by the T team as soon as it was recognized that NTS itself was very am- • Better specification and inclusion of graphical bitious and that a more immediate and more easily material conceivable goal should be set. So, it was decided • Better layout specification and handling that the first step towards a new typesetting system • Inclusion of requirements of hypertext systems was to start with a reimplemented but 100% TEX The LATEX3 core team expects that a major reim- compatible program. A ε plementation of LTEX is needed in order to support -TEX was released in 1996, after three years of the above goals. development and testing. It adds about thirty new The ConTEXt [13] format, developed by Hans primitives to the standard TEX core, including han- Hagen, is monolithic when compared to LATEX. As a dling of bidirectional text (right-to-left typesetting). result, the lessons learned from its development are It can operate in three distinct modes: not of great interest to our study. 1. “compatibility” mode, where it behaves exactly like standard T X. 2.3 NTS: The New Typesetting System E 2. “extended” mode, where its new primitives are The NTS project [14] was established in 1992 as enabled. Full compatibility with TEX is not an attempt to extend TEX’s typesetting capabilities actually sought and the primary concern is to and at the same time to propose a new underlying make typesetting easier through its new primi- programmatic model. Its originators recognised that tives. TEX lacked user-friendliness and as a consequence it attracted many fewer users than it could (or should). 3. “enhanced” mode, where bidirectional text is also supported. This mode is taken to be a Moreover, TEX (both as a name and a program) was frozen by Knuth, so any enhancements should be radical departure from standard TEX. implemented in a completely new system. Today, ε-TEX is part of all widely used TEX distri- NTS was the first attempt to recognize that butions and has proven to be very stable. Indeed, A TEX’s monolithic structure and implementation in in 2003 the LTEX team requested that future distri- an obsolete language (i.e., the Pascal programming butions use ε-TEX by default for LATEX commands, language) are characteristics that could only impede which has since been implemented in TEX Live and its evolution. The techniques used to implement other distributions. TEX, particuarly its “tight”, static and memory con- 2.5 servative data structures have no (good) reason to Ω exist today (or even when NTS was conceived, in Ω [16], which was first released in 1996, is primarily 1992), when we have had a paradigm shift to flexi- the work of two people: Yannis Haralambous and ble programming techniques. John Plaice. It extends TEX in order to support the After considering and evaluating several pro- typesetting of multilingual documents. Ω provides gramming paradigms [19] including functional, pro- new primitives and new facilities for this reason. Ω’s cedural and logic programming, the NTS project default character encoding is the Unicode UCS-2 en- team decided to proceed with a Java-based imple- coding, while it can easily process files in almost any mentation. Java’s object-oriented features and its imaginable character encoding. In addition to that, network awareness were the main reasons for adopt- Ω supports the parameterization of paragraph and ing Java, as NTS was envisioned as a network-based page direction, thus allowing the typesetting of text 5 program, able to download and combine elements in any imaginable writing method. from the network. 3 http://www.ctan.org/tex-archive/systems/nts/. Today, there is a Java codebase, which has de- 4 We have last accessed the above URL in March 2004. constructed the several functional pieces of TEX and 5 Currently the the boustrophedon writing method is the reconstructed them in a more object-oriented way only one not supported.
6 Preprints for the 2004 Annual Meeting Digital Typography in the New Millennium: Flexible Documents by a Flexible Engine
Much of its power comes from its new notion Engine should incorporate all the novelties that the of ΩTPs (Ω Translation Processes). In general, an various TEX-like derivatives have presented so far. ΩTP is normally used to transform a document from In addition, such a system should be designed by a particular character encoding to another. Obvi- taking into serious consideration all aspects of mod- ously, an ΩTP can be used to transform text from ern software development and maintance. However, one character set to another. An ΩTP is actually a our departure should not be too radical, in order to finite state automaton and, thus, it can easily han- be able to use the existing codebase. Let us now dle cases where the typesetting of particular charac- examine all these issues in turn. ters are context dependent. For example, in tradi- tional Greek typography, there are two forms of the 3.1 Discussion of Features small letter theta, which are supported by Unicode Data Structures TEX’s inherent limitations are [namely ϑ (03D1) and θ (03B8)]. The first form is due to the fact that it was developed in a time when used at the beginning of a word, while the second in computer resources were quite scarce. In addition, the middle of a word. The following code borrowed TEX was developed using the now outdated struc- from [16] implements exactly this feature: tured programming program development method- input: 2; output: 2; ology. aliases: Nowadays, hardware imposes virtually no lim- LETTER = (@"03AC-@"03D1 | @"03D5 | @"03D6 | its in design and development of software. Also, @"03F0-@"03F3 | @"1F00-@"1FFF) ; new programming paradigms (e.g., aspect-oriented expressions: programming [8], generative programming [2], etc.) ^({LETTER})@"03B8({LETTER} | @"0027) and techniques (e.g., extreme programming [1]) have => \1 @"3D1 \3; emerged, which have substantially changed the way . => \1; software is designed and developed. For performance reasons, ΩTPs are compiled into These remarks suggest that a new typesetting ΩCPs (Ω Compiled Processes). engine should be free of “artificial” limitations. Nat- External ΩTPs are programs in any program- urally, this is not enough as we have to leave behind ming language that can handle problems that can- the outdated programming techniques and make use not be handled by ordinary ΩTPs. For example, of modern techniques to ensure the future of the Uni- one can prepare a Perl script that can insert spaces versal Typesetting Engine. Certainly, NTS was a in a Thai language document. Technically, external step in the right direction, but in the light of current ΩTPs are programs that read from the standard in- developments in the area of software engineering it put and write to the standard output. Thus, Ω is is now a rather outdated piece of software. forking a new process to allow the use of an external ΩTP. In [16] there are a number of examples (some of them were borrowed from [7]). New Primitive Commands Modern document We should note that the field of multilingual manipulation demands new capabilities that could typesetting is an active research field, which is the not have been foreseen at the time TEX was cre- main reason why Ω is still an experimental system. ated. A modern typesetting engine should provide We should also note that ε-Ω [4], by Giuseppe Bil- a number of new primitive commands to meet the otta, is an extension of Ω that tries to incorporate new challenges imposed by modern document prepa- the best features of ε-T X and Ω in a new typeset- E ration. Although the new primitives introduced by ting engine. ε-TEX and Ω solve certain problems (e.g., bidirec- 2.6 pdfTEX tional or, more generally, multidirectional typeset- ting), they are still unable to tackle other issues, pdfT X [18] is yet another T X extension that can E E such as the inclusion of audio and/or animation. directly produce a file in Adobe’s PDF format. Re- cently, pdf-ε-TEX was introduced, merging the ca- pabilities of both pdfTEX and ε-TEX. Input Formats For reasons of compatibility, the 3 Towards a Universal Typesetting Engine current input format must be supported. At the From the discussion above, it is obvious that there same time the proliferation of XML and its applica- is a trend to create new typesetting engines that tions makes it more than mandatory to provide sup- provide the best features of different existing type- port for XML content. Currently, XMLTEXisaTEX setting engines. Therefore, a Universal Typesetting format that can be used to typeset validated XML
Preprints for the 2004 Annual Meeting 7 Christos KK Loverdos and Apostolos Syropoulos
6 files. In addition, XLATEX [6] is an effort to recon- anism that will adapt the algorithms so they can cile the TEX world with the XML world. In partic- successfully handle such difficult cases. ular, XLATEX is an XML Document Type Definition (DTD) designed to provide an XMLized syntax for Fonts Typesetting means to put type (i.e., font LATEX. However, we should learn from the mistakes glyphs) on paper. Currently, only METAFONT fonts of the past and make the system quite adaptable. and PostScript Type 1 fonts can be used with all This means that as new document formats emerge, different TEX derivatives. Although Ω is Unicode the system should be easily reconfigurable to “com- aware, still it cannot handle TrueType fonts in a prehend” these new formats. satisfactory degree (one has to resort to programs like ttf2tfm in order to make use of these fonts). Output Formats The pdfLATEX variant has be- In addition, for new font formats such as OpenType come quite widespread, due to its ability to directly and SVG fonts there is only experimental support, or produce output in a very popular document for- none at all. A new typesetting engine should provide mat (namely Adobe’s Portable Document Format). font support in the form of plug-ins so that support Commercial versions of TEX are capable of directly for new font formats could be easily provided. generating PostScript files without the need of any driver programs. However, as in the case of the in- Scripting Scripting is widely accepted as a means put formats, it is quite possible that new document of producing a larger software product from smaller formats will appear. Thus, we need to make sure components by “gluing” them together. It plays a that these document formats will find their way into significant role in producing flexible and open sys- TEX sooner or later. tems. Its realization is made through the so-called In addition, XML initiatives such as MathML “scripting languages”, which usually are different and SVG (Scalable Vector Graphics) are increasingly from the language used to implement the individ- common in electronic publishing of scientific docu- ual software components. ments (i.e., quite demanding documents from a ty- One could advance the idea that scripting in pographical point of view). Thus, it is absolutely TEX is possible by using TEX the language itself. necessary to be able to choose the output format(s) This is true to some extent, since TEX works in a from a reasonable list of options. For example, when form of “interpretive mode” where expressions can one makes a drawing using LATEX’s picture environ- be created and evaluated dynamically at runtime—a ment, it would be quite useful to have SVG output feature providing the desired flexibility of scripting in addition to the “standard” output. Currently, Ω languages. But TEX itself is a closed system, in that can produce XML content, but it cannot generate almost everything needs to be programmed within PDF files. TEX itself. This clearly does not lead to the desired openness. Innovative Ideas The assorted typesetting en- A next generation typesetting engine should be gines that follow TEX’s spirit are not mere exten- made of components that can be “glued” together sions of TEX. They have introduced a number of using any popular scripting language. To be able useful features and/or capabilities. For example, to program in one’s language of choice is a highly Ω’s ΩTPs and its ability to handle Unicode input wanted feature. In fact, we believe it is the only by default should certainly make their way into a way to attract as many contributors as possible. new typesetting engine. In addition, ε-TEX’s new conditional primitives are quite useful in macro pro- Development Method Those software engineer- gramming. ing techniques which have proven successful in the development of real-world applications should form Typesetting Algorithms The paragraph break- the core of the program methodology which will ing and hyphenation algorithms in TEX make the be eventually used for the design and implementa- difference when it comes to typographic quality. Ro- tion of a next generation typesetting engine. Obvi- bust and adaptable as they are, these algorithms ously, generic programming and extreme program- may still not produce satisfactory results for all pos- ming as well as aspect-oriented programming should sible cases. Thus, it is obvious that we need a mech- be closely examined in order to devise a suitable de- velopment method. 6 Validation should be handled by an external utility. Af- ter all, there are a number of excellent tools that can accom- plish this task and thus it is too demanding to ask for the All the features mentioned above as well as the incorporation of this feature in a typesetting engine. desired ones are summarized in Table 1.
8 Preprints for the 2004 Annual Meeting Digital Typography in the New Millennium: Flexible Documents by a Flexible Engine
TEX NTS ε-TEX Ω LATEX(3) Desired implementation language traditional Java traditional traditional traditional perhaps scripting architecture monolithic modular? monolithic monolithic monolithic modular TEX compatibility 100% yes 100% 100% 100% via module input transformations ΩTPs pluggable Unicode (Babel) (Java) (Babel) true true XML yes via package yes typesetting algorithms TEX TEX-like TEX-like TEX-like TEX-like pluggable scripting language TEX NTS (?) ε-TEX Ω TEX any output drivers dvi(ps,pdf) dvi(?) dvi(ps,pdf) dvi(ps,pdf) dvi(ps,pdf) any TRIP-compatible yes almost ε-TRIP yes yes yes (via module) library mode no no no no no yes daemon (server) mode no no no no no yes programming community < LATEX 1 person? < TEX very small big > LATEX
Table 1: Summary of features of TEX and its extensions.
3.2 Architectural Abstractions Exokernel Architecture Exokernels follow a rad- Roughly speaking, the Universal Typesetting Engine ically different approach. As with microkernels, we are proposing in this paper, is a project to design they take as much out of the kernel as pos- and, later, to implement a new system that will sup- sible, but rather than placing that code into port all the “good features” incorporated in various external programs (mostly user-space servers) as microkernels do, they place it into shared TEX derivatives plus some novel ideas, which have libraries that can be directly linked into appli- not found their way in any existing TEX derivative. Obviously, it is not enough to just propose the cation code. Exokernels are extremely small, general features the new system should have—we since they arbitrarily limit their functionality need to lay down the concrete design principles that to the protection and multiplexing of resources. will govern the development of the system. A rea- sonable way to accomplish this task is to identify the Both approaches have their pros and cons. We various concepts that are involved. These concepts believe that a mixed approach is the best solution. will make up the upper abstraction layer. By fol- For example, we can have libraries capable of han- lowing a top-down analysis, eventually, we will be dling the various font formats (e.g., Type 1, True- in position to have a complete picture of what is Type, OpenType, etc.) that will be utilized by ex- needed in order to proceed with the design of the ternal programs that implement various aspects of system. the typesetting process (e.g., generation of Post- The next step in the design process is to choose Script or PDF files). Let us now elaborate on the a particular system architecture. TEX and its deriva- architecture we are proposing. The underlying com- tives are definitely monolithic systems. Other com- ponents are given in Figure 1. monly used system architectures include the micro- The Typesetting Kernel (TK) is one of the two kernel and exokernel architectures, both well-known core components at the first layer. It can be viewed from operating system research. as a “stripped-down” version of TEX, meaning that Microkernel Architecture A microkernel-based its role as a piece of software is the orchestration design has a number of advantages. First, it of several typesetting activities. A number of ba- is potentially more reliable than a conventional sic algorithms are included in this kernel both as monolithic architecture, as it allows for mov- abstract notions—necessary for a general-purpose ing the major part of system functionality to typesetting engine—and concrete implementations. other components, which make use of the mi- So, TK incorporates the notions of paragraph and crokernel. Second, a microkernel implements a page breaking, mathematical typesetting and is Uni- flexible set of primitives, providing high level of code-aware (utilizing UCS-4 internally). It must be abstraction, while imposing little or no limita- emphasized that TK “knows” the concept of para- tions on system architecture. Therefore, build- graph breaking and the role it plays in typesetting ing a system on top of an existing microkernel but it is not bound to a specific paragraph breaking is significantly easier than developing it from algorithm. The same principle applies to all needed scratch. algorithms.
Preprints for the 2004 Annual Meeting 9 Christos KK Loverdos and Apostolos Syropoulos
☛ ✟☛ ✟ L3 WFs Tools ✡ ✠❅✡ ✠❅❘ ❅❘ bibtex, makeindex latex && bibtex && latex ☛ ✟☛ ✟☛ ✟☛ ✟☛ ✟ L2 Fonts TAs DMs SEs HyP ... Terms ✡ ✠❅✡ ✠✡ ✠❅❘✡ ✠✡ ✠ TK Typesetting Kernel LAT X, XLAT X ASK Active Scripting Kernel ❅❘ E E TAs Typesetting Algorithms Type 3, dvips + Type 1 DMs Document Models ☛ ✟☛ ✟ SEs Scripting Engines L1 TK ASK ✡ ✠✡ ✠ HyP Hyphenation Patterns ❅❘ WFs Workflows TEX, ε-TEX, Ω
Figure 1: The proposed microkernel-based layered architecture. The arrows show rough correspondence between the several architectural abstractions and their counterparts in existing monolithic typesetting engines.
The Active Scripting Kernel (ASK) is the sec- several paragraph and page breaking algorithms. By ond of the core components and the one that al- simply changing a few runtime parameters it will be lows scripting at various levels, using a programming possible to produce different typographic “flavors” (scripting) language of one’s choice. It is in essence of the same document. a standardized way of communicating between sev- A Scripting Engine (SE) is the realization of the eral languages (TEX, Perl, Python), achieved by pro- ASK APIs for a particular scripting language. For viding a consistent Application Programming Inter- reasons of uniformity, the TEX programming lan- face (API). The most interesting property of ASK guage will be provided as a Scripting Engine, along is its activeness. This simply means that any ex- with engines for Perl, Ruby and Python. This will tension programmed in some language is visible to make all the existing TEX codebase available for im- any other available languages, as long as they ad- mediate use and it will provide for cooperation be- here to the standard Active Scripting Kernel API. tween existing LATEX packages and future enhance- For example, an external module/service written in ments in other languages. Thus, a level of 100% TEX Perl that provides a new page breaking algorithm is compatibility will be achieved, merely as a “side- not only visible but also available for immediate use effect” of the provided flexibility. from Python, C, etc. The idea of a Document Model (DM) concerns Above TK and ASK, at the second layer, we find two specific points: The document external repre- a collection of typesetting abstractions. sentation, as it is “edited” for example in an editor, Fonts are at the heart of any typesetting engine. or “saved” on a hard disk, and its internal repre- It is evident that font architectures change with the sentation, used by the typesetting engine itself. It passing of time, and the only way to allow for flex- is clear that under this distinction, current LATEX ibility in this part is to be open. Although there documents follow the (fictional) “LATEX Document many different font formats, all are used to define Model”, XLATEX documents follow the “XLATEX doc- glyphs and their properties. So instead of directly ument model” and an XML document with its cor- supporting all possible font formats, we propose the responding DTD follows an analogous “XML+DTD use of an abstract font format (much like all font ed- Document Model”. itors have their own internal font format). With the We strongly believe that how a document is use of external libraries that provide access to popu- written should be separated by its processing. For lar font formats (e.g., a Free Type library, a Type 1 the last part, an internal representation like the Ab- font library, etc.), it should be straightforward to stract Syntax Trees (ASTs) used in compiler technol- support any existing or future font format. ogy is highly beneficial. One way to think of DM is as The various Typesetting Algorithms (TAs)—al- the typographic equivalent of the Document Object gorithms that implement a particular typographic Model (DOM). That is, it will be a platform-neutral feature—should be coded using the Active Scripting and language-neutral representation allowing scripts Kernel API. In a system providing the high degree to dynamically access and update the content, struc- of flexibility we are proposing, it will be possible to ture and style of documents. exhibit, in the same document, the result of applying
10 Preprints for the 2004 Annual Meeting Digital Typography in the New Millennium: Flexible Documents by a Flexible Engine
Several Document Processors (DPs) may be ap- 3.3 On Design and Evolution plied to a specific document before actual type- Recent advances in software engineering advocate setting takes place. DPs are the analog of ΩTPs. the use of multidimensional separation of concerns By leveraging the scripting power of ASK, the rep- as a guiding design principle. Different concerns resentation expressiveness of DPs is increased—as should be handled at different parts of code and opposed to algorithmic expressiveness (Turing-com- ideally should be separated. For example, the repre- pleteness), which is evident, e.g., in Ω, but is not sentation of a document and its processing are two the sole issue. separate concerns and should be treated as such. The Workflows (WF) and Tools are at the high- Their interaction is better specified out of their in- est architectural layer. Currently, there are a num- dividual specifications. Thus, we have introduced ber of tools that may not produce a final typeset re- the Document Models notion to cope with the exist- sult, but are important for the proper preparation of A ing TEX/LTEX base as well as any future document a document. For example, such tools include bibli- representation. ography, index and glossary generation tools. In the Several architectural abstractions of Figure 1 proposed architecture, all these programs will take are candidates to be specified as “services” at dif- advantage of other architectural abstractions—such ferent granularities. For example, any Tool of the as the Document Model or the Scripting Engines— third layer can be thought of as a service that is in order to be more closely integrated in the type- registered with a naming authority and discovered setting engine as a whole. dynamically, for immediate use on demand. A True- Of particular importance is the introduction of Type Font Service, regarding the second layer Font the Workflows notion. A workflow is closely related abstraction, is another example, this time more of a to the operation or, to be more precise, cooperation fine-grained nature, in the sense that a Tool (coarse- of several tools and the typesetting engine in the grained service) utilizes a Font (fine-grained ser- course of producing a typeset document. In effect, vice). a workflow specifies the series of execution (proba- The proposed architecture makes special provi- bly conditional) steps and the respective inputs/out- sions for evolution by keeping rigid design decisions puts during the “preparation” of a document. By to a minimum. Built-in Unicode awareness is such a introducing a workflow specification for each tool, notable rigid design decision, but we feel that its in- we relieve the user from manually specifying all the corporation is mandatory. Besides that, the ideas of necessary actions in order to get a “final” .pdf (or pluggable algorithms and scripting are ubiquitious whatever output format has been requested). In- and help maintain the desired high degree of flexi- stead, the user will declaratively specify that the bility. services of a tool are needed and the engine will load At the programming level, any style of design the respective workflows, compose them and execute and development that promotes evolution can be them. applied. In the previous section we have actually We shall give a workflow example concerning a demonstrated that the proposed architecture can bibtex-like tool. What we do here is to transform even handle unanticipated evolution at the work- our exerience of using bibtex into declarations spec- flow level: the bibtex tool workflow specification ifying its behaviour in cooperation with latex: causes the execution of an existing tool (latex) but we have neither altered any workflow for latex nor WORKFLOW DEFINITION bibtex does latex need to know that “something new” is introduced SERVICE bibtex NEEDS latex using it. In effect, we have (the use of the SERVICE bibtex INTRODUCES latex keyword INTRODUCE was deliberate) a new aspect [8]. 4 Conclusions and Future Work In effect, this translates a hypothetical Makefile: In this paper we have reviewed the most widespread all: modern approaches to extending TEX, THE type- latex mydoc setting engine. After analyzing weaknesses of the bibtex mydoc approaches and the existing support for several fea- latex mydoc tures, we have presented our views on the architec- ture of an open and flexible typesetting engine. for the preparation of the fictitious mydoc.tex doc- We have laid down the basic architectural ab- ument into a declarative specification that is given stractions and discussed their need and purpose. Of only once as part of the bibtex tool!
Preprints for the 2004 Annual Meeting 11 Christos KK Loverdos and Apostolos Syropoulos course, the work is still at the beginning stages and [8] Gregor Kiczales, John Lamping, Anurag Mend- we are now working on refining the ideas and eval- hekar, Chris Maeda, Cristina Lopes, Jean- uating design and implementation approaches. Marc Loingtier, and John Irwin. Aspect- The introduction of the Active Scripting Kernel Oriented Programming. In M. Aksit and is of prime importance and there is ongoing work to S. Matsuoka, editors, ECOOP ’97 — Object- completely specify a) the form of a standard pro- Oriented Programming: 11th European Confer- cedural API and b) support for other programming ence, Jyv¨askyl¨a, Finland, June 1997. Proceed- styles, including object-oriented and functional pro- ings, number 1241 in Lecture Notes in Com- gramming. This way, an object may for example puter Science, pages 220–242. Springer-Verlag, take advantage of an algorithm that is better de- Berlin, 1997. scribed in a functional form. There are parallel [9] Donald Erwin Knuth. The TEXbook. Addison- plans for transforming TEX into a Scripting Engine Wesley, 1984. and at the same time providing Engines powered by A [10] Leslie Lamport. LTEX: A Document Prepara- Perl and Python. tion System. Addison–Wesley Publ. Co., Read- We are also investigating the application of the ing, MA, USA, 2nd edition, 1994. workflow approach at several parts in the architec- [11] LAT X3 Project home page. http://www. ture other than the interaction among tools. This, E latex-project.org/latex3.html. in turn, may raise the need for the incorporation of a Workflow Kernel at the core layer, along with the [12] Richard Palais. Position Paper on the future of Typesetting Kernel and the Active Scripting Kernel. TEX. http://www.loria.fr/services/tex/ moteurs/nts-9207.dvi, reached from http: //tex.loria.fr/english/moteurs.html, References October 1992. [1] chromatic. Extreme Programming Pocket [13] PRAGMA Advanced Document Engi- Guide. O’Reilly & Associates, Sebastopol, CA, neering. ConTEXt home page. http: USA, 2003. //www.pragma-ade.com/. N S [2] Krzysztof Czarnecki and Ulrich Eisenecker. [14] T Project home page. http://www.dante. Generative Programming: Methods, Tools, and de/projects/nts/. Applications. Addison–Wesley Publ. Co., Read- [15] Eric E. Raymond. The Cathedral and ∼ ing, MA, USA, 2002. the Bazaar. http://www.catb.org/ esr/ writings/cathedral-bazaar/. [3] Victor Eijkhout. TEX by Topic. http://www. cs.utk.edu/∼eijkhout/tbt. [16] Apostolos Syropoulos, Antonis Tsolomitis, and Nick Sofroniou. Digital Typography Using ε [4] -Ω Project home page. http://www.ctan. LATEX. Springer-Verlag, New York, NY, USA, org/tex-archive/systems/eomega/. 2003. [5] NTS FAQ. http://www.ctan.org/ [17] NTS Team and Peter Breitenlohner. The ε-TEX tex-archive/info/NTS-FAQ. manual, Version 2. MAPS, (20):248–263, 1998. [6] Yannis Haralambous and John Plaice. Omega, [18] H`anThˆe´ Th`anh, Sebastian Rahtz, and Hans OpenType and the XML World. The 24th An- Hagen. The pdfTEX users manual. MAPS, nual Meeting and Conference of the TeX Users (22):94–114, 1999. Group, TUG 2003. [19] Jiˇr´ı Zlatuˇska. NTS: Programming Lan- [7] Yannis Haralambous and John Plaice. Traite- guages and Paradigms. EuroTEX 1999, ment automatique des langues et compositions http://www.uni-giessen.de/partosch/ sous omega. Cahiers GUTenberg, pages 139– eurotex99/zlatuska.pdf. 166, 2001.
12 Preprints for the 2004 Annual Meeting Migrating to XML: The Case of the GUST Bulletin Archive
Wlodzimierz Bzyl Instytut Matematyki, Uniwersytet Gda´nski 80-952 Gda´nsk ul. Wita Stwosza 57 Poland [email protected]
Tomasz Przechlewski Wydzial Zarzadzania, Uniwersytet Gda´nski 81-824 Sopot ul. Armii Krajowej 119/121 Poland tomasz@gnu.univ.gda.pl
Abstract Ten years of experience with TEX publishing of the GUST bulletin shows that Knuth’s dream of highly portable TEX files is apparently an illusion in practice. Over the last decade, articles in the GUST bulletin have used at least six major formats (LATEX 2.09, transitional LATEX+NFSS,LATEX 2ε, plain-based TUGboat, Eplain, and ConTEXt), numerous macro packages, fonts, and graphic formats. Many old articles are typeset differently nowadays, and some even cause TEX errors. This situation motivates the following question: how do we avoid the same problem in the future? As the World Wide Web is quickly becoming the main- stream both of publishing and of information exchange we argue for migration to XML—a Web compatible data format. In the paper we examine a possible strategy for storing GUST articles in a custom XML format and publishing them with both TEX and XSLT/FO. Finally, the problems of converting the TEX files to XML and possibility of using TEX4ht— an authoring system for producing hypertext—are discussed.
1 Introduction We are optimistic that these issues will disap- pear in the near future, and believe that XML will become pervasive in the online environment. How- The dominant role played by the Web in information ever, in our context, a key to the adoption of XML exchange in modern times has motivated publishers is the degree to which it can be integrated with ex- to make printed documents widely available on the isting TEXnologies. Internet. It is now common that many publications In this paper we examine one strategy for stor- are available on the Web only, or before they are ing GUST articles in a custom XML format and pub- printed on paper. Articles published in the GUST lishing them with both TEX and XSLT/FO. Also, the bulletin are available on the Web in PostScript and problems of converting the existing TEX files to XML PDF. Unfortunately, these formats decrease doc- and the possibility of using TEX4ht—an authoring ument accessibility, searching and indexing by Web system for producing hypertext—are discussed. search engines. For broad accessibility to automated services, it is better to use XML as the format of A 2 TEX/LTEX and Other Document such data. However, one issue with XML is that it Formats is difficult to maintain the high quality presentation of TEX documents. This is caused by incompatibil- When the authors started work with TEX (many ities between browsers and incomplete or immature years ago), there was only a choice between closed- implementations of W3C Consortium standards. source applications based on proprietary formats, or
Preprints for the 2004 Annual Meeting 13 W lodzimierz Bzyl and Tomasz Przechlewski
TEX, for publishing significant documents. Nowa- choice, we preferred those which are widely used days, the choice is much wider, as XML-based so- and for which formatting applications and tools are lutions are based on open standards and supported available. The following possible schemes were con- by a huge number of free applications. We do not sidered: need to write the tools ourselves. Thus the strategy • DocBook [21], a complete publishing frame- of reusing what is publicly available is key in our work, i.e., schemes plus XSLT/DSSSL style- migration plan. sheets for conversion to presentation formats; On the other hand it would be unwise to switch actively developed and maintained; the de facto to XML as the only acceptable submission format, standard of many Open Source projects; widely because it would force many authors to abandon known and used. their powerful T X-based editing environments to E • which they are accustomed, just to submit texts to TEI [1], another complete, actively developed our bulletin. Following this route, we would more publishing framework. Not as popular as Doc- likely end up with a shortage of submissions. Thus, Book, used mainly in academia. we are preparing a mixed strategy with both TEX • The LATEX-based DTD developed in [7] (further and XML as accepted formats. Papers submitted referred as LWC DTD). The similarity to the A in LTEX will ultimately be converted to XML as an structure of LATEX is an advantage of this DTD archival or retrieval format. Presentation formats for our project. will be XHTML, with corresponding PDF generated • Others, such as DTD for GCA/Extreme confer- by a variety of tools. The work-flow of documents ences, X-DiMi from the Electronic Thesis and in this proposed framework is depicted on Fig. 1. Dissertations Project, and the LATEX-like PMLA developed by one of the authors [15]. Industry-standard DTDs tend to be too big, too complex, and too general for practical use in specific XML LaTeX cases (cf. [14, p. 29]). In particular, the DocBook and TEI DTDs seem to be too complex for marking- pdfTeX up documents conforming to LATEX format. tex4ht As a result, users frequently employ the tech- PDF nique of using different DTDs at different stages of XSLT db2latex the editorial process. Following Maler [14], we will DTD HTML call the common to a group of users within an interest group as a reference DTD, while those XSLT used solely for editing purposes as an authoring DOCBOOK ... DTD. Translation from one DTD to another may be easily performed with an XSLT stylesheet. We decided to use a simplified LWC DTD as Figure 1: Processing diagram for XML/LATEX authoring DTD and DocBook as reference DTD. documents. Providing a simple DTD should expand the group of prospective authors. For example, many GUST The XML implementation project described in members are experts in typography or Web design the paper can be broadly divided into the following but not necessarily TEX hackers. subtasks: DTD development, formatting develop- The simplification consists of restricting the ment, and legacy information conversion [19]. We’ll document hierarchy only to article-like documents, now describe these stages in detail. and removing back matter tags (index, glossary) and all presentation tags (newline, hspace, etc.). DTD 3 Development Considerations Also, the optional status of meta-data, for example There is no doubt (see for example [14, 19]) that the title, abstract, keywords tags, was changed the DTD development phase is of critical impor- to required. The resulting DTD contains 45 elements tance for the overall success of any SGML/XML compared to 64 in the original one. project. Fortunately, thanks to the great interest in For better maintainability, we rewrote our ver- XML technology in recent years, there are several sion of LWC DTD into RNC syntax. The RNC production-quality publicly available DTDs which schema was introduced by Clark [6], and recently could be adapted for our project. To make this adopted as an ISO standard. It has many advantages
14 Preprints for the 2004 Annual Meeting Migrating to XML: The Case of the GUST Bulletin Archive over DTD or W3C Schema syntax, namely simplicity The output can be customized at XSLT stylesheet 1 and an included documentation facility. level as well as by redefining appropriate LATEX style As the structure of our documents is not partic- files. MathML markup is translated with XSLT to 3 ularly complex, it may be feasible to develop several LATEX and supported natively. authoring DTDs targeted at different groups of au- The translation from DocBook to LATEX imple- thors, for example one for those preferring ConTEXt- mented in these tools is incomplete. To get reason- like documents, another for those used to GCA con- able results, prior customization to local needs is re- ference markup, etc., and then map those docu- quired. The main advantage of this approach is that ments to the reference DTD with XSLT. we use TEX—a robust and well known application. 4 Formatting with XSLT 5 The GUST Bulletin Archive For presentation, LWC documents are first trans- When considering the conversion of the GUST ar- formed to DocBook with a simple XSLT stylesheet. chive to XML we have two points in mind: first, The DocBook XSL stylesheets [22] translate an we recognize the long-term benefits of an electronic XML document to XHTML or FO [18]. As they are archive of uniformly and generically marked-up doc- written in a modular fashion, they can be easily uments; and second, to take the opportunity to test customized and localized. To publish XHTML from the whole framework using ‘real’ data. XML documents, an XSLT engine is needed such as During the last 10 years, 20 volumes of the Kay’s saxon [11] or Veillard’s xsltproc [20]. GUST bulletin were published, containing more than For hard copy output, a two-step process is 200 papers. From the very beginning GUST was used. First, the XSLT engine produces format- tagged in a modified TUGboat style [2]. The total ting objects (FO) which then must be processed output is not particularly impressive, but the con- 2 with a formatting object processor for PDF output. version of all articles to XML isn’t a simple one-night The detailed transformation work-flow is depicted in job for a bored TEX hacker: Fig. 2. • they were produced over an entire decade and were written by over 100 different authors. latex PDF • LWC they were processed with at least six major document formats (LAT X 2.09, transitional LAT X+NFSS, dblatex.xsl HTML E E DocBook html/docbook.xsl LAT X 2ε, plain-based TUGboat, Eplain, and fi- document E fo/docbook.xsl FO nally ConTEXt), using numerous macro pack- xslt stylesheet ?.xsl 4 xslt processor xslt processor other ages, fonts, and graphic formats. As a group, the GUST authors are not ama- Figure 2: Processing details of LWC documents teurs, producing na¨ıve TEX code. On the contrary with XSLT/FO. they are TEX experts, writing on a diverse range of subjects using non-standard fonts, packages and With just a few customizations the translation macros. For example, Fig. 3 shows the detailed dis- from XML to XHTML presents no obstacles (except tribution of the TEX formats used in GUST. for math formulas). On the other hand, the quality In total, there were 134 plain TEX articles, com- A A of the PDF produced with the publicly available fop pared to 87 for LTEX. LTEX authors used 74 differ- processor from the Apache project is poor compared ent packages, while those preferring plain TEX nol- to that obtained with TEX. ogy used 139 different style files. The proportion Instead of generating FO objects one can use of other formats (Eplain, ConTEXt, BLUE) was in- XSLT to translate XML directly to high-level LATEX. significant (only a few submissions). It can also be That is the method used in db2latex [3] (see also noted from Fig. 3 that in recent years, the propor- a clone project: dblatex/dbcontext [9]; the latter, tion of plain TEX submissions has decreased sub- A of course, generates files processable by ConTEXt). stantially in favor of LTEX. It is obviously very difficult to maintain a repos- 1 It is possible to convert between different schema lan- itory containing papers requiring such a diverse guages for XML with the trang program [5]. There is also a nxml-mode for GNU Emacs for editing of XML which features 3 One approach which we did not try is to format FO files highlighting, indentation, and on the fly validation against an with TEX. This method is implemented by S. Rahtz’ Passive RNC schema [4]. TEX [17]. 2 Modern browsers have XSLT engines built-in. So, it suf- 4 Needless to say, all of these packages have been evolving fices to attach to a document appropriate stylesheets to make during the last 10 years, many of them becoming incompati- the transformation on the fly. ble with each other.
Preprints for the 2004 Annual Meeting 15 W lodzimierz Bzyl and Tomasz Przechlewski
# papers author(s) are often absent and there is no obligation for abstracts and keywords. Therefore, most of the 25 GUST articles lack these valuable elements. More- over, bibliographies are inconsistently encoded.7 Having said that, our plan is to markup as many elements as possible.
7 Translating TEX to XML with TEX4ht
year Out of the box, the TEX4ht system is configured 93 94 95 96 97 98 99 00 01 02 03 A TUGboat plain latex other total to translate from plain, LTEX, (ltugboat, ltugproc), and Lecture Notes in Computer Science Figure 3: Distribution of TEX formats used by (llncs) formats to HTML, XHTML, DocBook, or TEI. GUST authors. To translate from, say, TUGboat to our custom XML format the system needs to be manually con- figured. Because the configuration of T X4ht from range of external resources (macros, fonts). As a E scratch is a non-trivial task, we consider other more result, many old papers are now unable to be type- efficient possibilities. set owing to changes in underlying macros or fonts. The TEX4ht system consists of three parts: 6 Conversion from TEX to XML (1) Style files which enhance existing macros with features for the output format (HTML, XHTML, It may be surprising that only few papers report 8 etc.). (2) The tex4ht processor which extracts successful conversion from T X to XML: Grim [8] E HTML or XHTML, or DocBook, or TEI files from describes successful large-scale conversion in a large DVI files produced by tex. (3) The t4ht processor academic institution, while Rahtz [16] and Key [12] which is responsible for translating DVI code frag- describe translation to SGML at Elsevier. ments which need to be converted to pictures; for Basically when converting T X to XML the fol- E this task the processor uses tools available on the lowing three approaches have been adopted [16]: current platform. • Perl/Python hackery combined with manual re- As mentioned above, the conversion from a vi- typing and/or manual XML marking-up. sual format to an information-oriented one cannot • Parsing TEX source files not with tex, but be done automatically. Let’s illustrate this state- with another program which generates SGML/ ment with the following example marked with plain XML. This is the approach used by ltx2x [23], T X macros:9 5 E tralics [8] and latex2html, which replace \noindent {\bf exercise, left as an} A LTEX commands in a document by user-defined {\it adj\/} {\ss Tech} Used to complete strings. a proof when one doesn’t mind a • Processing files with TEX and post-processing {\bf handwave}, or to avoid one entirely. the DVI files to produce XML. This is the way The complete phrase is: {\it The proof tex4ht works. \rm(or \it the rest\/\rm) \it is left as an exercise for the reader.\/} This comment Although the conversion performed with tralics has occasionally been attached to unsolved is impressive, we found the application very poorly research problems by authors possessed of documented. After evaluation of the available tools either an evil sense of humor or a vast and consulting the literature [7], we decided to use faith in the capabilities of their TEX4ht—a TEX-based authoring system for produc- audiences.\hangindent=1em ing hypertext [10]. After translation of this fragment to XHTML by Because TEX formats contain many visually- tex4ht we obtain: oriented tags, we could not expect to automatically 6
exercise, left as an TUGboat For example, the format requires adj only the metadata elements title and author name(s); author address(es) and webpage(s) of the 7 Publicly available tools (see [13] for example) can auto- matically mark up manually keyed bibliographies. 5 8 latex2html was not considered as its output is limited Altogether over 2.5M lines of TEX code. Compare this to HTML. with 1M code of the LATEX base macros. 6 For example, see [16, 8]. Other examples, based on 9 The text comes from “The Project Gutenberg Etext of GUST articles, are presented below. The Jargon File”, Version 4.0.0.
16 Preprints for the 2004 Annual Meeting Migrating to XML: The Case of the GUST Bulletin Archive
Tech to retype than try to recompile and adjust tex4ht Used to complete a proof when one just for a particular single case, and ‘useless’ applies doesn’t mind a handwave, or to outs, or advanced typesetting capabilities of TEX. avoid one entirely. The complete phrase Although our system needs improvement—con- is: The proof version of math is the most important remaining (or the rest) is left item to investigate—we are determined to start to as an exercise for the reader. use it in a production environment. This comment has occasionally been Finally, we note that many of our conclusions attached to unsolved research problems by and methods are also applicable to TUGboat, be- authors possessed of either an evil sense cause the format used for typesetting GUST bulletin of humor or a vast faith in the capabilities differs only slightly from the one used for TUGboat. of their audiences.
References and this could be rendered by a browser as: [1] Lou Burnard and C. M. Sperberg-McQueen. TEI lite: An introduction to text encoding for interchange. http://www.tei-c.org/Lite/, 2002. [2] W lodek Bzyl and Tomasz Przechlewski. An application of literate programming: creating a format for the Bulletin of the Polish TUG. We can see that tex4ht uses ‘span’ elements TUGboat, 14(3):296–299, October 1993. to mark up font changes. These visual tags could [3] Ramon Casellas and James Devenish. be easily remapped to logical ones unless fragments DB2LaTeX XSL stylesheets. http://db2latex. of text with different meaning are marked with the sourceforge.net, 2004. same tag. Here the tag cmti-10 was used to tag [4] James Clark. NXML mode for the GNU both the short form ‘adj’ and the example phrase Emacs editor. http://www.thaiopensource. (shown in the green italic font). To tag them differ- com/download, 2003. ently we need different TEX macros specially config- [5] James Clark. Trang—multi-format schema ured for TEX4ht. Note that the \hangindent=1em was ignored by tex4ht. This command could not be converter based on RELAX NG. http://www. simulated, because hanging indentation is not sup- thaiopensource.com/relaxng/trang.html, ported by browsers. 2003. So, the markup produced by the tex4ht pro- [6] James Clark and Makoto Murata. Relax NG gram is not logical markup. To get logical markup specification. http://www.relaxng.org/, the GUST format should be reworked and reconfig- 2001. ured for TEX4ht. [7] Michel Goossens and Sebastian Rahtz. LATEX Instead of configuring TEX4ht we could use an Web Companion. Addison-Wesley, 2001. XSLT stylesheet to remap elements referencing XML [8] Jose Grim. Tralics. In EuroTEX Preprints, format. This could be an easier route than config- pages 38–49, 2003. http://www-sop.inria. uring the system from scratch, while some TEX4ht fr/miaou/tralics. Final version to appear in configuration could also help. So, a combination of TUGboat. the two methods is envisaged to provide the best [9] Benoˆıt Guillon. DocBook to LATEX/ConTEXt results. publishing. http://dblatex.sourceforge. 8 Conclusion and Future Work net, 2004. A We have not completed the conversion yet. How- [10] Eitan Gurari. tex4ht:LTEX and TEX for hypertext. http://www.cis.ohio-state. ever, based on the experience gained so far we can es- ∼ timate that almost 70% of the whole archive should edu/ gurari/TeX4ht/mn.html, 2004. be converted with little manual intervention. Semi- [11] Michael Kay. SAXON—the XSLT and XQuery automatic conversion of another 15% (34 papers) is Processor. http://saxon.sourceforge.net, possible, with prior extensive changes in markup. 2003. Conversion of remaining 15% is impossible or use- [12] Martin Key. Theory into practice: working less, where ‘impossible’ means the paper is easier with SGML, PDF and LATEX. Baskerville,Preprints for the 2004 Annual Meeting 17 W lodzimierz Bzyl and Tomasz Przechlewski
5(2), 1995. ftp://tug.ctan.org/pub/tex/ [19] Brian E. Travis and Dale C. Waldt. The usergrps/uktug/baskervi/5 2/. SGML Implementation Guide: A Blueprint for [13] Language Technology Group. LT TTT version SGML Migration. Springer-Verlag, 1996. 1.0. http://www.ltg.ed.ac.uk/software/ [20] Daniel Veillard. LIBXSLT—the XSLT C ttt, 1999. library for Gnome. http://xmlsoft.org/ [14] Eve Maler and Jeanne El Andaloussi. XSLT, 2003. SGML DTD Developing s: From Text to Model [21] Norman Walsh and Leonard Muelner. to Markup. Prentice Hall PTR, 1995. DocBook: The Definitive Guide. O’Reilly, [15] Tomasz Przechlewski. Definicja dokumentu 1999. http://www.docbook.org/tdg/en/ typu PMLA. http://gnu.univ.gda.pl/ html/docbook.html. ∼tomasz/sgml/pmla/, 2002. A [22] Wiki. DocBook XSL Stylesheets. [16] Sebastian Rahtz. Another look at LTEX to http://docbook.org/wiki/moin.cgi/ TUGboat SGML conversion. , 16(3):315–324, DocBookXslStylesheets, 2004. September 1995. http://www.tug.org/ [23] Peter R. Wilson. LTX2X:ALAT X to X TUGboat/Articles/tb16-3/tb48raht.pdf. E Auto-tagger. http://www.ctan.org/ [17] Sebastian Rahtz. PassiveT X. http: E tex-archive/support/ltx2x, 1999. //www.tei-c.org.uk/Software/passivetex, 2003. [18] Robert Stayton. Using the DocBook XSL Stylesheets. Sagehill Enterprises, http:// www.sagehill.net/docbookxsl/index.html, 2003.
18 Preprints for the 2004 Annual Meeting Managing TEX Resources with XML Topic Maps
Tomasz Przechlewski Uniwersytet Gda´nski, Wydzial Zarzadzania 81-824 Sopot ul. Armii Krajowej 119/121 Poland [email protected]
Abstract
For many years the Polish TEX Users Group newsletter has been published online on the GUST web site. The repository now contains valuable information on TEX, METAFONT, electronic documents, computer graphics and related subjects. However, access to the content is very poor: it is available as PS/PDF files with only a simple HTML page facilitating navigation. There is no integration with information resources from other sections of the site, nor with the resources from other LUG or CTAN sites. Topic maps were initially developed for efficient preparation of indices, glos- saries and thesauruses for electronic documents repositories, and are now codified as both the ISO standard (ISO/IEC 13250) and the XTM 1.0 standard. Their ap- plications extend to the domain of electronic publishing. Topic maps and the similar RDF standard are considered to be the backbone of corporate knowledge management systems and/or the Semantic Web [3]. The paper contains an introduction to the Topic Maps standard and discusses selected problems of Topic Map construction. Finally the application of Topic Maps as an interface to the repository of TEX related resources is presented, as well as the successes and challenges encountered in the implementation.
1 Introduction three sections. Then a short comparison of Topic All the papers published for the last 10 years in Maps and RDF is presented. The application of Topic Maps as an interface to the GUST resource the bulletin of the Polish TEX Users’ Group (GUST, http://www.gust.org.pl/) are now available on- repository is described in the last two sections. line from the GUST web site. The repository con- 2 What is a Topic Map? tains valuable information on TEX, METAFONT, electronic documents, computer graphics, typogra- Topic Maps are an SGML/HYTIME based ISO stan- phy and related subjects. However, access to the dard defined in [1] (ISO/IEC 13250, often referred content itself is very poor: the papers are available as HYTM). The standard was recently rewritten by as PS/PDF files with only a simple HTML interface an independent consortium, TopicMaps.org [19] and facilitating navigation. There is no integration with renamed to XML Topic Maps (XTM). XTM was de- other resources from that site. As CTAN and other veloped in order to simplify the ISO specification LUGs’ sites provide more resources it would obvi- and enable its usage for the Web through XML syn- ously be valuable to integrate them too. tax. Also, the original linking scheme was replaced At first glance, the Topic Maps framework ap- by XLINK/XPOINTER syntax. XTM was recently pears to be an attractive way to integrate vast incorporated as an Annex to [1]. The standard enumerates the following possible amounts of dispersed TEX related resources. A pri- 1 mary goal of the proposed interface should be to applications of TMs [1]: support learning. If the project succeeds, we hope • To qualify the content and/or data contained it will change slightly the opinion of TEX as a very in information objects as topics, to enable nav- difficult subject to become acquainted with. igational tools such as indexes, cross-references, The paper is organized as follows. The stan- citation systems, or glossaries. dard is introduced and selected problems of topic 1 Examples of application of Topic Maps to real world maps construction are discussed in the subsequent problems can be found in [9, 21, 5, 18, 11, 12].
Preprints for the 2004 Annual Meeting 19 Tomasz Przechlewski
• To link topics together in such a way as to en- information. This independency means that a con- able navigation between them. crete topic map can describe more than one informa- • To filter an information set to create views tion pool, and vice versa. Each association can have adapted to specific users or purposes. For ex- an association type which is also a topic. There are ample, such filtering can aid in the management no constraints on how many topics can be related by of multilingual documents, management of ac- one association. Topics can play specific roles in as- cess modes depending on security criteria, de- sociations, described with association role types — livery of partial views depending on user profiles which are also topics. and/or knowledge domains, etc. The concepts described above are shown in • To add structure to unstructured information Fig. 1. Topics are represented as small ovals or cir- objects, or to facilitate the creation of topic- cles in the upper half of the picture while the large oriented user interfaces that provide the effect oval at the bottom indicates data layer. Small ob- of merging unstructured information bases with jects of different shapes contained in the data layer structured ones. represents resources of different types. The lines be- tween the data layer and topics represents occur- In short, a topic map is a model of knowledge repre- rences, while thick dashed ones between topics de- sentation based on three key notions: topics which picts associations. represent subjects, occurrences of topics which are links to related resources, and associations (rela- tions) among topics. A topic represents, within an application con- text, any clearly identified and unambigous subject or concept from the real world: a person, an idea, an object etc. A topic is a instance of a topic type. Topic types can be structured as hierarchies organized by superclass-subclass relationships. The standard does not provide any predefined semantics to the classes. Finally, topic and topic type form a class- instance relationship. Topic have three kinds of characteristics: names (none, one, or more), occurrences, and roles in asso- ciations. The links between topics and their related information (web page, picture, etc.) are defined by occurrences. The linked resources are usually lo- cated outside the map. XTM uses a simple link mechanism as defined in XLINK, similar to HTML hyperlinks. As with topics, occurrences can be typed; oc- currence types are often referred as occurrence roles. Figure 1: Topic map and resource layer. Occurrence types are also defined as topics. Using XML syntax, the definition of topic is quite simple: Besides the above mentioned three fundamen-
20 Preprints for the 2004 Annual Meeting Managing TEX Resources with XML Topic Maps
3 Subject Identity and Map Merging Displaying lists of indexes which the user can TM From the above short tour of TM concepts it should navigate easily is the standard way of visual- be clear that there is an exact one-to-one correspon- ization. As this approach does not scale well for dence between subjects and topics. Thus, the identi- larger maps, augmenting navigation with some sort fication of subjects is crucial to individual topic map of searching facility is recommended. Other visu- applications and to interoperability between differ- alization techniques such as hyperbolic trees [15], ent topic maps. cone trees, and hypergraph views (Fig. 2) can be used for visualization and navigation of topic maps. The simplest and most popular way of identify- They display TMs as a graph, with the topics and ing subjects is by identifyng them via some system occurrences as nodes and the associations as arcs. of unique labels (usually URIs). A subject identi- The drawback of such ‘advanced’ techniques is that fier is simply a URI unambiguously identifying the users are usually unfamilar with them. subject. If the subject identifier points to a re- source (not required) the resource is called a sub- ject indicator. The subject indicator should contain human-readable documentation describing the non- addressable subject [22]. As there are no restrictions to prevent every map author from defining their own subject iden- tifiers and resource indicators, there is a possibility that semantic or syntactic overlap will occur. To overcome this, published subject indicators (PSIs) are proposed [17]. PSIs are stable and reliable in- dicators published by an institution or organization which desires to promote a specific standard. Any- one can publish PSIs and there is no registration authority. The adoption of PSIs can therefore be an open and spontaneous process [17, 6].2 Subject identity is of primary importance for Figure 2: Hypergraph visualization with TMNav. topic map merging when there is a need to recognize which topics describe the same subject. There are several proposed query languages for Two topics and their characteristics can be topic maps. None of them are part of the standard merged (aggregated) if the topics share the same and there are inconsistencies in different TM engines. name in the same scope (name-based merging), or Two of the most prominent proposals are: if they refer to the same subject indicator (subject- • TMQL (Topic Maps Query Language, [9]), with based merging). Merging results in a single topic SQL-like syntax, provides both for querying and that has the union of all characteristics of merged modifying topic maps (select, insert, delete, up- topics. Merged topics play the roles in all the date). associations that the individual topics played be- • Tolog, inspired by logic programming language fore [22, 15]. Prolog, supports requirements for TMQL with 4 Constraining, Querying, and Navigating clearer and simpler syntax. the Map The introduction to the TM standard presented in this paper does not cover all the details of the The notion of a topic map template is used fre- technology. Interested readers can find an exhaus- quently in literature. As the name suggests, a topic tive description in [15], which contains detaileda in- map template is a sort of schema imposing con- troduction with numerous examples, and [16]. straints on topic map objects with TM syntax. The standard does not provide any means by which the 5 Topic Maps and RDF designer of the TM template can put constraints The W3C promotes the Resource Description Frame- onto the topic map itself. Standardisation of such work (RDF) [10] as another framework for expressing constraints are currently in progress [14]. metadata. RDF is a W3C standard envisioned to be 2 a foundational layer of the Semantic Web. For example, the XTM 1.0 specification contains a set of PSIs for core concepts, such as class, instance, etc., as well The fundamental notion in the RDF data model as for the identification of countries and languages [19]. is a statement, which is a triple composed of a
Preprints for the 2004 Annual Meeting 21 Tomasz Przechlewski resource, property, and value. The RDF Schema The publication date and language can be used as (RDFS) [4] is a W3C working draft aimed at defin- scopes for easy navigation using them as constraints. ing a description language for vocabularies in RDF. Other TAOs (topics, associations, and occur- More expressive RDFS models have been proposed rences [16]) to consider are: author home pages recently [23]. (occurrence type), applications described in the pa- One key difference between RDF and topic per (association type), papers referenced (associa- maps is that topic maps are modelled on a concept- tion type). This information is absent from BibTEX centric view of the world. For example, in RDF there files but, at least theoretically, can be automatically are no ‘predefined’ properties, so to assign a name to extracted from the source files of papers. a resource one has to use another standard (such as We started by analyzing the data at our dis- Dublin Core), something that is not necessary with posal, i.e., TEX and BibTEX source files. Unfortu- topic maps. The notion of scope is also absent from nately, in the case of the GUST bulletin the BibTEX RDF too. database was not maintained. This apparent over- The RDF and Topic Maps standards are similar sight was rectified with simple Perl scripts and a few in many respects [7]. Both offer simple yet powerful days of manual labour. The bibliographic database 3 means of expressing concepts and relationships. was created and saved in a XML-compatible file. T X documents are often visually tagged and GUST E 6 Building Topic Maps for the lack information oriented markup. The only ele- Bibliographic Database ments marked up consistently and unambiguously Similar to writing a good index for a book, creating in the case of the GUST bulletin are the paper ti- a good topic map is carried out by combining man- tles and authors’ names. Authors’ home pages were ual labour with the help of some software applica- rarely present, while email addresses were available tions. It is usually a two-stage task, beginning with but not particularly useful for our purposes. Nei- the modelling phase of building the ‘upper-part’ of ther abstracts nor keyword lists had been required the map, i.e., the hierarchy of topic types and as- and as a consequence were absent from the major- sociation types (the schema of the map) and then ity of the papers. Similarly, any consistent scheme populating the map with instances of topic types, of marking bibliographies (or attaching .bib files) their associations and occurrences. was lacking, so there was no easy way to define the Approaches for developing a topic map out of a related to association between papers. pool of information resources include [2]: The benefit derived from keywords is much greater if they are applied consistently according to • using standard vocabularies and taxonomies some fixed classification; otherwise, the set of key- (i.e., www.dmoz.org) as the initial source of top- words usually consists of many random terms which ics types. are nearly useless. Since we didn’t want to define • generating TMs from the structured databases yet another ‘standard’ in this area, we would have or documents with topic types and association liked to adopt an existing one. The following sources types derived from the scheme of the database/ were considered: the TEX entry at dmoz.org, Gra- 4 document. ham Williams’ catalogue. , collections of BibTEX files and .tpm files [20] • extraction of topics and topic associations from pools of unstructured or loosely structured doc- The accuracy of the TEX taxonomy subtree dmoz.org uments using NLP (Natural Language Process- at was somewhat questionable, and we ing) software combined with manual labour. quickly rejected the idea of using it. Williams’ cat- alogue of TEX resources does not include any infor- The first approach is concerned with the modelling mation except the location of the resource in the phase of topic map creation, while the third one structure of CTAN. As for BibTEX files, it appeared deals with populating the map. only MAPS and TUGboat were complete and up-to Following the above guidelines, the structure date5 but only the latter contains keywords. Un- of the BibTEX records was an obvious candidate to fortunately, they don’t comply with any consistent start with in modelling our map of GUST articles. It provides a basic initial set of topics including: 3 We reused the XML schema developed for the MAPS author, paper, keyword, and the following associa- bulletin (http://www.ntg.org.ln/maps/). 4 http://www.ctan.org/tex-archive/help/Catalogue tion types: author-paper, paper-keyword and author- 5 GUT Bib Cahiers enberg was not found, but the impressive keyword. Abstracts (if present in TEX databases) portal of Association GUTenberg indicates appropriate meta- can be considered as occurrences of the topic paper. data are maintained, but not published.
22 Preprints for the 2004 Annual Meeting Managing TEX Resources with XML Topic Maps scheme. Due to the lack of any existing standard, lower left has all the related topics, i.e., papers writ- the keywords were added manually on a common- ten by Kees and other subjects covered. The user sense basis, with the intention of being ‘in sync’ with can easily browse both papers authored by him and 6 the most frequent terms used in MAPS. switch to pages on some other interesting subject. Finally the following TAOs were defined (the The panel with resources contains information on language of the publication was considered to be the the resource type allowing fast access to the required only scope): data. • topic types: author, paper, and keyword; Similar functionality can be obtained with the freely available TM4Nav or even by using a simple • association types: author-paper, paper-keyword, XSLT stylesheet [13]. and author-keyword; • occurrence types: papers and abstracts. 7 Integrating Other TEX Resources The schema of the map was prepared manually and So far there is nothing in TMs which cannot be ob- then the rest of the map was generated from the tained using other technologies. The same or bet- content of intermediate XML file with an XSLT style- ter functionality can be achieved with any database heet [8, 13]. The resulting map consists of 454 top- management system (DMS). But integrating TEX ics, 1029 associations, and 999 occurrences. A frag- resources on a global scale needs flexibility, which ment of the map rendered in a web browser with traditional RDBMS-based DMS applications lack. Ontopia Omnigator (a no-cost but closed-source ap- For example, topic maps can be extended easily plication, http://www.ontopia.net/download/) is through merging separate maps into one, while shown in Fig. 3. DMS-based extensions usually require some prior agreement between the parties (e.g., LUGs), schema redefinitions, and more. To verify this flexibility in practice, we extended the GUST map with the MAPS and TUB BibTEX databases. For easy interoperability in a multi- language environment, the upper half of the map was transferred to a separate file. With the use of scope, the design of multi-language topic types was easy, for example:
Preprints for the 2004 Annual Meeting 23 Tomasz Przechlewski
(http://bibtexml.sf.net) and than transformed For further extensions, we plan to incorporate to MAPS XML with an XSLT stylesheet. Keywords CTAN resources. For that purpose, Williams’ cata- were added to TUB entries using a very crude pro- logue and/or the TPM files from TEX Live project cedure.7 can be used. As the catalogue contains author names, it would be for example possible to enrich the map with the author-application association. Fur- ther enrichment will result if we can link applica- tions with documents describing them. However, some robust classification scheme of TEX resources should be devised first. 8 Topic Map Tools As with any other XML-based technology, topic maps can be developed with any text editor and pro- cessed with many XML tools. However, for larger- scale projects specialized software is needed. There are a few tools supporting topic map technology, developed both commercially and as Open Source projects. We have considered both Ontopia Omni- gator (mentioned in the previous section) and TM4J Figure 4: Topic map fragment from Fig. 3 scoped (free software). to Polish language. TM4J (http://tm4j.org) is a suite of Java packages which provide interfaces and default imple- Simple name-based merging of all three maps mentations for the import, manipulation and export results in over 25,000 TAOs (≈ 1000 authors, more of XML Topic Maps. Features of the TM4J engine than 2000 papers). Some of the subjects were rep- include an object model which supports XTM spec- resented with multiple topics. As an example the ification with the ability to store topic map in an Grand Wizard was represented as the following four object-oriented or relational database, and an im- distinct topics: ‘Knuth, Don’, ‘Knuth, Donald’, plementation of the tolog query language. ‘Knuth, Donald E.’, ‘Knuth., Donald E.’.8 Based on TM4J a few projects are in progress: As identity-based merging is regarded as more TMNav for intuitive navigation and editing of topic robust, some identifiers have to be devised first. Es- maps, and TMBrowse for publishing maps as set of tablishing a PSI for every TEX author seemed overly HTML pages (similarly to Omnigator). ambitious. Instead, a dummy subject identifier These projects are in early stages and our expe- was chosen, such as: http://tug.org/authors# rience with TMBrowse indicates that current version initials-surname. This can still produce multiple frequently crashes with bigger maps and is signifi- topics for the same subject, but now we can elimi- cantly slower than Omnigator. There were problems nate unwanted duplicates by defining an additional with tolog queries as well. map consisting solely of topics like the following [18]: As all these projects are actively maintained
24 Preprints for the 2004 Annual Meeting Managing TEX Resources with XML Topic Maps
References [12] Ashish Mahabal, S. George Djorgovski, [1] ISO/IEC. Topic Maps, ISO/IEC 13250, 2002. Robert Brunner, and Roy Williams. Topic http://www.y12.doe.gov/sgml/. maps as a virtual observatory tool, 2002. [2] Kal Ahmed, Danny Ayers, Mark Birbeck, and http://arxiv.org/abs/astro-ph/0110184. Jay Cousins. Professional XML Meta Data. [13] Sal Mangano. XSLT Cookbook. O’Reilly, 2002. Wrox Press, 2001. [14] Mary Nishikawa and Graham Moore. Topic [3] Tim Berners-Lee, James Hendler, and Ora Las- map constraint language requirements, 2002. sila. The semantic web. Scientific American, http://www.isotopicmaps.org/tmcl/. 284(5):35–43, 2001. http://www.sciam.com/ [15] Jack Park and Sam Hunting, editors. XML 2001/0501issue/0501berners-lee.html. Topic Maps. Creating and using Topic Maps for [4] Dan Brickley and R. V. Guha. RDF vocabulary the Web. Addison-Wesley, 2002. description language 1.0: RDF schema, 2002. [16] Steve Pepper. The TAO of topic maps, http://www.w3.org/TR/rdf-schema/. 2000. http://www.ontopia.net/topicmaps/ [5] Anna Carlstedt and Mats Nordborg. An evalu- materials/tao.html. ation of topic maps. Master’s thesis, G¨oteborg [17] Steve Pepper. Published subject: Introduction University, 2002. http://www.cling.gu.se/ and basic requirements, 2003. http://www. ∼ cl8matsn/exjobb.html. oasis-open.org/committees/. [6] Paolo Ciancarini, Marco Pirruccio, and [18] Steve Pepper and Marius L. Garshol. Lessons Fabio Vitali. Metadata on the Web. On on applying topic maps. http://www.ontopia. the integration of RDF and topic maps. net/topicmaps/materials/xmlconf.html, In Extreme Markup Languages, 2003. 2002. http://www.mulberrytech.com/Extreme/ [19] Steve Pepper and Graham Moore. XML Proceedings/xslfo-pdf/2003/Presutti01/ topic maps (XTM) 1.0, 2000. http://www. EML2003Presutti01.pdf. topicmaps.org/xtm/1.0/. [7] Lars M. Garshol. Living with Topic Maps and [20] Fabrice Popineau. Directions for the TeXlive RDF. http://www.ontopia.net/topicmaps/ systems. In EuroT X 2001, The Good, the Bad materials/tmrdf.html, 2002. E and the Ugly, Kerkrade, pages 152–161. NTG, [8] Michael Kay. XSLT Programmer’s Reference 2001. http://www.ntg.nl/maps/pdf/26 20. 2nd Edition . Wrox Press, 2001. pdf. [9] Rafa l Ksie˙zyk. Trying not to get lost with [21] Tomasz Przechlewski. Wykorzystanie map a topic map. In XML Europe Conference, poje´c w zarzadzaniu repozytoriami doku- 1999. http://www.infoloom.com/gcaconfs/ ment´owelektronicznych. In Materia ly Konfer- WEB/granada99/ksi.HTM. encji: MSK 2003, 2003. [10] Ora Lassila and Ralph R. Swick. Resource de- [22] Hans Holger Rath. Semantic resource exploita- scription framework (RDF). Model and syntax tion with topic maps. In Proceedings of the specification, 1999. http://www.w3.org/TR/ GLDV-Spring Meeting 2001, 2001. http:// REC-rdf-syntax/. www.uni-giessen.de/fb09/ascl/gldv2001/. [11] Xia Lin and Jian Qin. Building a topic map repository, 2000. http: [23] Michael K. Smith, Chris Welty, and Debo- //www.knowledgetechnologies.net/ rah L. McGuinness. OWL web ontology lan- proceedings/presentations/lin/xialin. guage guide, 2003. http://www.w3.org/TR/ pdf. owl-guide/.
Preprints for the 2004 Annual Meeting 25 Interactive Editing of MathML Markup Using TEX Syntax∗
Luca Padovani Department of Computer Science, University of Bologna Mura Anteo Zamboni, 7 40127 Bologna Italy [email protected] http://www.cs.unibo.it/~lpadovan/
Abstract We describe the architecture of a syntax-directed editor for authoring struc- tured mathematical documents which can be used for the generation of MathML markup [4]. The author interacts with the editor by typing TEX markup as he would do in a normal text editor, with the difference that the typed markup is parsed and displayed on-the-fly. We discuss issues regarding both the parsing and presentation phases and we propose implementations for them. In contrast with existing similar tools, the architecture we propose offers better compatibility with TEX syntax, a pervasive use of standard technologies and a clearer separation of content and presentation aspects of the information.
1 Introduction and, in general, the pointing device for completing MathML [4] is an XML [2] application for the rep- most operations. resentation of mathematical expressions. Like most In this paper we describe the architecture of XML applications, MathML is unsuitable to be writ- a tool that tries to synthesize the “best of both ten directly because of its verbosity except in the worlds”. The basic idea is to create a WYSIWYG simplest cases. Hence the editing of MathML doc- editor in which editing is achieved by typing con- uments needs the assistance of dedicated tools. As crete markup as the author would do in an actual of today, such tools can be classified into two main plain text editor. The markup is then tokenized categories: and parsed on-the-fly, a corresponding presentation is created by means of suitable transformations, and 1. WYSIWYG (What You See Is What You Get) finally displayed. The editor is meant not only as an editors that allow the author to see the format- authoring tool, but more generally as an interface for ted document on the screen while it is being math applications. composed. The editor usually provides some Although in the paper we assume that the con- “export mechanism” that creates XML crete markup typed by the user is TEX (more pre- with embedded MathML from the internal rep- cisely the subset of TEX concerned about mathe- resentation of the document; matics) and that presentation markup is MathML, 2. conversion tools that generate MathML markup the system we are presenting is by no means tied to from different sources, typically other markup these languages and can be targeted to other con- languages for scientific documents, such as texts as well. One question that could arise is: “why TEX [5]. TEX syntax?” We can see at least three motivations: first of all because of TEX popularity in many com- Tools in the first category are appealing, but munities. Second, because macros, which are a fun- they suffer from at least two limitations: a) edit- damental concept in TEX, are also the key to editing ing is typically presentation oriented — the author at a more content-oriented level, which is a primary is primarily concerned about the “look” of the docu- requirement for many applications handling mathe- ment and tends to forget about its content. b) They matics. Finally, because, as we will see, TEX markup may slow down the editing process because they of- has good locality properties which make it suitable ten involve the use of menus, palettes of symbols, in the interactive environment of our concern.
∗ This work has been supported by the European Project IST-2001-33562 MoWGLI.
26 Preprints for the 2004 Annual Meeting Interactive Editing of MathML Markup Using TEX Syntax
The body of the paper is structured into four (a) “TEXtree” (b)MathML tree main sections: in Section 2 we overview the architec- g math ture of the tool while in Sections 3, 4, 5 we describe in more detail the main phases of the editing process over mfrac (lexing, parsing, and transformation). Familiarity 1 sp mn msup with TEX syntax and XML-related technologies is assumed. g 2 1 mrow mn 2 Architecture x + 1 mi mo mn 2 Several tools for the conversion of T X markup suf- E x + 1 fer from two major drawbacks that we are not willing to tolerate in our design: (1) they rely on the TEX Figure 1: Tree representation for system itself for parsing the markup. While guaran- {1\over{x+1}^2} and corresponding MathML teeing perfect compatibility with TEX, this implies markup. the installation of the whole system. Moreover, the original TEX parser does not meet the incremental requirements that we need; (2) the lack of flexibility macro and the node’s children are the macro’s pa- in the generation of the target document represen- rameters. Entities like delimiters, square brackets tation, which is either fixed by the conversion tool surrounding optional parameters or literals occur- or it is only slightly customizable by the user. ring in the hparameter texti of macro definitions are To cope with problem (1) we need to write our purely syntactic and need not be represented in the own parser for TEX markup. This is well known to tree if our main concern is capturing the structure be a non-trivial task, because of some fancy aspects of the document. Fig. 1(a) shows the tree structure regarding the very nature of TEX syntax and the of a simple mathematical formula. lack of a proper “TEX grammar”. We will commit Once the document is represented as a tree, ourselves with a subset of TEX syntax which appears the process of macro expansion — that is, interpre- to be just what an average author needs when writ- tation — can be defined as a recursive transforma- ing a document. As we will see, the loss in the range tion on the nodes of the tree. As we will represent of syntactic expression is compensated by a cleaner trees using XML, transformations can be very natu- and more general transformation phase. As for the rally implemented by means of XSLT stylesheets [3]. lack of a TEX grammar, we perceive this as a fea- Fig. 1(b) shows the MathML tree corresponding to ture rather than a weakness: after all TEX is built the TEX tree on the left hand side. The two trees around the fact that authors are free to define their are basically isomorphic except for the name of the own macros. Macros are the fundamental entities nodes and the presence of explicit token nodes for giving structure to the document. literals in the MathML tree. This is to say that the Let us now turn our attention to problem (2): MathML tree can be generated from the TEX tree recall that the general form of a TEX macro defini- by simple transformations. However, once the in- tion (see The TEXbook, [5]) is terpretation phase is independent of parsing (some- thing which does not happen in TEX) it is natural to \defhcontrol sequenceihparameter texti define much more general transformations that are {hreplacement texti} not just node-by-node rewritings. where the hparameter texti gives the syntax for in- The following are the main components of an voking the macro and its parameters whereas the interactive, syntax-based editor for structured doc- hreplacement texti defines somehow the “semantics” uments: of the macro (typically a presentational semantics). Input Buffer: the sequence of concrete charac- Thus the ultimate semantic load of a macro is invari- ters typed by the author; ably associated with the configuration of the macro Lexical Analyzer: responsible for the tokeniza- at the point of definition. tion of the characters in the input buffer; We solve problem (2) by splitting up macro Dictionary: definitions so that structure and semantics can be a map from hcontrol sequencei to hparameter texti which is used to know the syn- treated independently. A well-formed TEX docu- ment can be represented as a tree whose leaves are tax of macros; either literals (strings of characters) or macros with Parser: for the creation of the internal tree struc- no parameters, and each internal node represents a ture representing the document;
Preprints for the 2004 Annual Meeting 27 Luca Padovani
Transformation Engine: for mapping the inter- Since the token stream is the only interface be- nal tree into the desired format. tween the lexer and the parser, the lexer has the free- dom to perform arbitrary mappings from the char- No doubt these entities are common to all tools con- acters in the input buffer to tokens in the stream. verting T X markup into a different format, but the E In particular, some T X commands like \alpha or degree of mutual interdependence and the way they E \rightarrow are just placeholders for Unicode char- are implemented may differ considerably, especially acters. There is no point in communicating these when interactivity is a main concern. The added entities as control sequences as the internal tree rep- value of our approach is that it allows the author resentation (XML) is able to accommodate Unicode to independently customize both the dictionary and characters naturally; also, treating them as literals the transformation engine, and the advanced user of simplifies the subsequent transformation phase. the editor the possibility of adapting the lexical an- On the other hand, there are characters, such alyzer to languages other than T X (we will spend E as curly braces { and } or scripting operators _ and a few more words on this topic in the conclusions). ^, that have a special meaning. Logically these are just short names for macros that obey their own Notation We will use the following conventions rules regarding parameters. What we propose is a regarding lists. Lists are uniformly typed, that is general classification of parameter types which, in elements of a list are all of the same type. We use addition to parameters in normal TEX definitions, α∗ to denote the type of a list whose elements have allows us α n x type . [] is the empty list; :: is the list with • to deal with optional parameters as LATEX [6] head element n and tail x; x@y is the concatenation does; of two lists x and y; [n1; ... ; nk] is a short form for • to treat { as just an abbreviation for \bgroup n1 :: ··· :: nk :: []. and make \bgroup a macro with one parameter 3 Lexical Analysis delimited by \egroup, which we treat as the The purpose of this phase is to tokenize the input expansion for }. In order for this “trick” to buffer. As we are talking about an interactive tool, work we have to design the parser carefully, as the presence of an input buffer may look surprising. we will see in Sect. 4; Implementations for the input buffer range from vir- • to treat scripting operators _ and ^ as the two tual buffers (there is no buffer at all, characters are macros \sb and \sp both accepting a so-called collected by the lexical analyzer which outputs to- pre-parameter (a parameter that occurs before kens as they are completed) to flat buffers (just a the macro in the input buffer) and a so-called string of characters as in a text editor) to structured post-parameter (a parameter that occurs after buffers. For efficiency, we do not investigate in detail the macro in the input buffer); all the possibilities in this paper, but early experi- ments have shown that working with virtual buffers • to deal with macros that have “open” parame- can be extremely difficult. As long as insert oper- ters. For instance \rm affects the markup fol- ations are performed at the right end of the buffer lowing it until the first delimiter coming from the restructuring operations on the parsing tree are an outermost macro is met. We treat \rm as fairly easy, but when it comes to deletion or to mod- a macro with an open post-parameter that ex- ifications in arbitrary positions, the complexity of tends as far as possible to the right. Similarly, restructuring operations rises rapidly to an unman- \over can be seen as a macro with open pre- ageable level. Hence, from now on we will assume and post-parameters. that a flat input buffer is available. Whether the In order to describe parameter types we need to buffer should be visible or not is a subjective matter, define the concept of term.A term is either a literal and may also depend on the kind of visual feedback or a macro along with all its parameters (equiva- given by the editor on incomplete and/or incorrect lently, a term is a subtree in the parsing tree). A typed markup. simple parameter consists of one sole term. A com- The outcome of the lexer is a stream (list) of pound parameter consists of one or more terms ex- tokens. Each token may have one of three forms: a tending as far as possible to the left or to the right literal, that is a single character to be treated “as is”, of the macro depending on whether the parameter a space, that is a sequence of one or more space-like is “pre-” or “post-”. A delimited parameter consists characters, or a control sequence, that is the name of one or more terms extending as far as possible of a macro. to the right up to but not including a given token
28 Preprints for the 2004 Annual Meeting Interactive Editing of MathML Markup Using TEX Syntax
Table 1: Examples of TEX and LATEX macros along with their signature. Parameters Macro pre post overline [simple] sqrt [simple] (TEX) [optional; simple] (LATEX) root [delimited(control(of)); simple] over, choose [compound] [compound] frac [simple; simple] rm, bf, tt, it [compound] left [simple; delimited(control(right)); simple] sb, sp [simple] [simple] bgroup [delimited(control(egroup))] begin [simple; optional; delimited(control(end)); simple] proclaim [token(space); delimited(literal(.)); token(space); delimited(control(par))]
t. An optional parameter is either empty or it con- out there is no fixed grammar that we can use to sists of one or more terms enclosed within a pair or generate the parser automatically: authors are free square brackets [ and ]. The absence of the open- to introduce new macros and hence new ways of ing bracket means that the optional parameter is structuring the parse tree. Thus we will build the not given. A token parameter is a given token t rep- parser “by hand”. More reasons for writing an ad- resenting pure syntactic sugar. It does not properly hoc parser, namely error recovery and incremental- qualify as a parameter and does not appear in the ity, will be discussed later in this section. parsing tree. The following grammar captures formally the Formally tokens and parameter types are de- structure of a TEX parsing tree, which is the out- fined as follows: come of the parser:
token ::= literal(v) | space | controlhp1,p2i(v) node ::= empty type ::= simple | compound | delimited(t) | literal(v) v ∈ string | optional | token(t) | macro(v,x) v ∈ string,x ∈ param∗ where t ∈ token, v ∈ string is an arbitrary string of param ::= {a} a ∈ node∗ ∗ Unicode characters, p1 ∈ {simple, compound} and ∗ Note that a parameter is made of a list of nodes and p2 ∈ type are lists of parameter types for the pre- and post-parameters respectively. Note that pre- that literals are strings instead of single characters. parameters can be of type simple or compound only. The empty node is used to denote a missing term The dictionary is a total map when one was expected; its role will be clarified later in this section. dictionary : string 7→ token The appendix contains the Document Type such that for each unknown control sequence v we Definition for the XML representation of TEX pars- have dictionary(v) = controlh[],[]i(v). Table 1 shows ing trees. It is simpler than the TEXML DTD [7] part of a possible dictionary for some TEX and LATEX and we are providing it as mere reference. commands (mostly for mathematics). Note how it is possible to encode the signature for the \begin con- 4.1 Parsing Functions trol sequence, although it is not possible to enforce Table 2 gives the operational semantics of the parser. the constraint that the first and the last parameters In this table only, for each a ∈ node∗ we define must have equal value in order for the construct to a!=[empty] if a = [] and a! = a otherwise. There be balanced. are four parsing functions: T for terms, A for pre- parameters, B for post-parameters, and C for de- 4 Parsing limited sequences of terms. Each parsing function We now come to the problem of building the TEX is defined by induction on the structure of its ar- parsing tree starting the stream of tokens produced guments. Axioms (rules with no horizontal line) de- by the lexical analyzer. As we have already pointed note base cases, while inference rules define the value
Preprints for the 2004 Annual Meeting 29 Luca Padovani
Table 2: Parsing functions for the simplified TEX markup. T (d) ∀d ∈ token∗ −→ : node∗ × token∗ → node∗ × token∗ −→A : node∗ × type∗ → node∗ × param∗ B(d) ∀d ∈ token∗ −→ : type∗ × token∗ → param∗ × token∗ C(d,b) ∀d ∈ token∗, ∀b ∈ bool −→ : node∗ × token∗ → node∗ × token∗
T (d) T (d) (T.1) a, [] −→ a, [] (T.2) a, t :: l −→ a, t :: l (t occurs in d) T (d) T (d) a, l −→ a′, l′ (T.3) a, literal(v) :: l −→ a@[literal(v)], l (T.4) T (d) a, space :: l −→ a′, l′ A ′ B(d) ′ a, p1 −→ a , x p2, l −→ y, l (T.5) T (d) ′ ′ a, controlhp1,p2i(v) :: l −→ a @[macro(v,x@y)], l
A A [], p −→ a,x (A.1) a, [] −→ a, [] (A.2) [],s :: p −→A a,x@[{[empty]}] A a, p −→A a′,x [], p −→ a′,x (A.3) (A.4) a@[n], simple :: p −→A a′,x@[{[n]}] a, compound :: p −→A a′,x@[{a}]
′ B(d) B(d) p, t :: l −→ x,a ′ (B.1) [], l −→ [], l (B.2) † (t 6= t ) B(d) token(t) :: p, t′ :: l −→ x,a B(d) B(d) p, [] −→ x, l p, l −→ x, l′ (B.3) † (B.4) B(d) B(d) token(t) :: p, [] −→ x, l token(t) :: p, t :: l −→ x, l′ T (d) B(d) C(d,false) B(d) [], l −→ a, l′ p, l′ −→ x, l′′ [], l −→ a, l′ p, l′ −→ x, l′′ (B.5) (B.6) B(d) B(d) simple :: p, l −→ {a!} :: x, l′′ compound :: p, l −→ {a!} :: x, l′′ B(d) C(literal(])::d,true) B(d) p, [] −→ x, l [], l −→ a, l′ p, l′ −→ x, l′′ (B.7) (B.8) B(d) B(d) optional :: p, [] −→ {[]} :: x, l optional :: p, literal([) :: l −→ {a} :: x, l′′ B(d) p, t :: l −→ x, l′ (B.9) (t 6= literal([)) B(d) optional :: p, t :: l −→ {[]} :: x, l′ C(t::d,true) B(d) [], l −→ a, l′ p, l′ −→ x, l′′ (B.10) B(d) delimited(t) :: p, l −→ {a!} :: x, l′′
C(d,b) (C.1) a, [] −→ a, [] C(t::d,true) C(d,b) (C.2) a, t :: l −→ a, l (C.3) a, t :: l −→ a, t :: l (t occurs in d) T (d) C(d,b) a, t :: l −→ a′, l′ a′, l′ −→ a′′, l′′ (C.4) (t 6∈ d) C(d,b) a, t :: l −→ a′′, l′′
30 Preprints for the 2004 Annual Meeting Interactive Editing of MathML Markup Using TEX Syntax of a parsing function (the conclusion, below the line) stream is malformed. Unlike parsers of batch TEX in terms of the value of one or more recursive calls converters or the TEX parser itself, there will often to other functions (the premises, above the line). be moments during the editing process when the in- Right arrows denote the action of parsing. Arrows put buffer contains incorrect or incomplete markup, are decorated with a label that identifies the parser for example because not all the required parameters along with its parameters, if any. The T , B, and of a macro have been entered yet. The parser must C parsers have a parameter representing the list of recover from such situations in a tolerant and hope- delimiters in the order they are expected, with the fully sensible way. We distinguish three kinds of head of the list being the first expected delimiter. situations: missing parameters, pattern mismatch, The C parser also has a boolean parameter indicat- and ambiguity, which we examine in the rest of this ing whether the parser should or should not “eat” section. the delimiter when it is eventually met. The root parsing function is T . Given a delim- 4.2.1 Missing Parameters iter t ∈ token and a token stream l ∈ token∗ we Consider an input token stream representing the sole have \over macro with no arguments provided: T ([t]) [], l −→ [n], l′ def l1 = [controlh[compound],[compound]i(over); ′ ∗ where n ∈ node is the parsed term and l ∈ token control(eoi)] is the part of the token stream that has not been consumed. Spaces are ignored when parsing terms It is easy to check that and pre-parameters (rule T.4), but not when pars- T ([control(eoi)]) [], l1 −→ [macro(over, [empty; empty])], ing post-parameters (rule B.4). The A function dif- [control(eoi)] fers from the other parsing functions because by the time a macro with pre-parameters is encountered, More generally the parser inserts empty nodes pre-parameters have already been parsed. The lists in the parsing tree wherever an expected parameter a ∈ node∗ in the T , A, and C parsers represent the is not found in the token stream. This behavior terms accumulated before the term being parsed. can be seen in rule A.2 and also in rules B.5, B.6, Note that pre-parameters are inserted at the end and B.10 where the ! operator is used. For optional of the parameter list (rules A.2 to A.4) and that parameters an empty node list is admitted (rules post-parameters are inserted at the beginning of the B.7 and B.8). parameter list (rules B.5 to B.10). This way pa- The presence of empty nodes guarantees that rameter nodes appear in the parse tree in the same the generated tree is structurally well-formed, which order as in the original token stream (rule T.5). is crucial for the subsequent transformation phase. It also allows the application to give the user feed- 4.1.1 Example back indicating the absence of required parameters. In the example above, for instance, the application Given that the input buffer contains the TEX source shown in Fig. 1, the lexical analyzer would produce may display something like suggesting that a frac- the following stream of tokens: tion was entered, but neither the numerator nor the denominator have been. def l0 = [controlh[],[delimited(control(egroup))]i(bgroup); 4.2.2 Pattern Mismatch literal(1); controlh[compound],[compound]i(over); controlh[],[delimited(control(egroup))]i(bgroup); Rules B.2 and B.3 have been marked with a † to literal(x); literal(+); literal(1); indicate that the parser expects a token which is control(egroup); controlh[simple],[simple]i(sp); not found in the token stream. In both cases the literal(2); control(egroup)] parser will typically notify the user with a warning By the application of the parsing rules given in Ta- message. ble 2 it can be shown that 4.2.3 Ambiguities T ([control(eoi)]) [], l @[control(eoi)] −→ [n], [control(eoi)] 0 In TEX one cannot pass a macro with parameters as where n ∈ node is the same tree shown in Fig. 1 the parameter of another macro, unless the parame- except that the g nodes are labeled with bgroup. ter is enclosed within a group. For example, it is an error to write \sqrt\sqrt{x}, the correct form is 4.2 Error Recovery \sqrt{\sqrt{x}}. Because we treat the left curly Parsing functions are all total functions, they al- brace like any other macro, grouping would not help ways produce a result, even when the input token our parser in resolving ambiguities. However, the
Preprints for the 2004 Annual Meeting 31 Luca Padovani
parser knows how many parameters a macro needs, for the implementation of this phase. Besides, XSLT because the token representing the control sequence stylesheets can be extended very easily, by providing has been annotated with such information by the new templates that recognize and properly handle lexer. When processing a macro with arguments new macros that an author has introduced. the parser behaves “recursively”, it does not let an We can see in Fig. 2 two sample templates taken incomplete macro to be “captured” if it was passed from an XSLT stylesheet for converting the internal as parameter of an outer macro. A consequence of parsing tree into a MathML tree. Both templates this extension is that any well-formed fragment of have a preamble made of an xsl:if construct which TEX markup is accepted by our parser resulting in we will discuss later in this section. Since the TEX the same structure, but there are some strings ac- tree and the MathML tree are almost isomorphic cepted by our parser that cause the TEX parser to (Fig. 1) the transformation is generally very sim- fail. ple and in many cases it amounts at just renam- ing the node labels. Template (a) is one such case: 4.3 Incremental Parsing it matches any node in the parsing tree with label Parsing must be efficient because it is performed in macro and having the name attribute set to over. real-time, in principle at every modification of the The node for the \over macro corresponds natu- input buffer, no matter how simple the modifica- rally to the mfrac element in MathML. The two tion is. Fortunately TEX markup exhibits good lo- parameters of \over are transformed recursively by cality, that is small modifications in the document applying the stylesheet templates to the first and cause small modifications in the parsing tree. Con- second child nodes (p[1] means “the first p child sequently we can avoid re-parsing the whole source of this node”, similarly p[2] refers to the second p document, we just need to re-parse a small inter- child). val of the input buffer around the point where the Template (b) is slightly more complicated and modification has occurred, and adjust the parsing shows one case where there is some change in the tree accordingly. Let us consider again the example structure. For combined sub/super scripts TEX ac- of Fig. 1 and suppose that a change is made in the cepts a sequence of _ and ^ no matter in what order markup they occur, but MathML has a specific element for such expressions, namely msubsup. The template {1\over{1+x}^2} {1\over{1+x+y}^2} ⇒ matches an sb node whose first parameter contains (a +y is added to the denominator of the fraction). an sp node, thus detecting a ...^..._... fragment To be conservative we can re-parse the smallest term of markup, then the corresponding msubsup element within braces that includes the modified part (the is created and its three children accessed in the underlined fragments). Once the term has been re- proper position of the parsing tree. A symmetric parsed it has to be substituted in place of the old template will handle the case where the subscript term in the parsing tree. occurs before the superscript. In order to compute the interval of the input 5.1 Incremental Transformation buffer to be re-parsed we annotate the nodes of the parsing tree with information about the first and As we have done for parsing, for transformations we the last characters of the buffer which were scanned also need to account for their cost. In a batch, one- while building the node and all of its children. A shot conversion from TEX this is not generally an simple visit of the tree can locate the smaller interval issue, but in an interactive authoring tool a trans- affected by the modification. formation is required at every modification of the Curly braces occur frequently enough in the parsing tree in order to update the view of the doc- markup to give good granularity for re-parsing. At ument. the same time limiting re-parsing to braced terms Intuitively, we can reason that if only a frag- helps control the costs related to the visit to the ment of the parsing tree has changed, we need re- parsing tree and to the implementation of the incre- transform only that fragment and substitute the re- mental parsing and transformation machinery. sult in the final document. This technique makes two assumptions: (1) that transformations are 5 Transformation context-free; that is, the transformation of a frag- The transformation phase recognizes structured pat- ment in the parsing tree is not affected by the con- terns in the parsing tree and generates correspond- text in which the fragment occurs; (2) that we are ing fragments of the result document. We have al- able to relate corresponding fragments between the ready anticipated that XSLT is a very natural choice parsing and the result trees.
32 Preprints for the 2004 Annual Meeting Interactive Editing of MathML Markup Using TEX Syntax
Figure 2: Example of XSLT templates for the transformation of the internal parsing tree into a MathML tree. MathML elements can be distinguished because of the m: prefix.
Template (b) in Fig. 2 shows one case where stick to TEX syntax as closely as we do and that, the transformation is not context free: the deeper sp apart from being a complete (and cumbersome) edit- node is not processed as if it would occur alone, but ing tool and not just an interface, it uses encoding it is “merged” together with its parent. More gener- and transformation technologies not based on stan- ally we can imagine that transformations can make dard languages (XML [2] and XSLT [3]). almost arbitrary re-arrangements of the structure. Among batch conversion tools we observe a ten- This problem cannot be solved unless we make some dency to move towards the processing of content. assumptions, and the one we have already commit- The TEX to MathML converter by Igor Rodionov ted to in Sect. 4 is that braces define “black-box” and Stephen Watt at the University of Western On- fragments which can be transformed in isolation, tario [8, 9] is one such tool, and the recent Her- without context dependencies. mes converter by Romeo Anghelache [10] is another. As for the matter of relating corresponding frag- These represent significant steps forwards when ments of the two documents, we use identifiers and compared to converters such as LATEX2HTML.3 references. Each node in the parsing tree is anno- A prototype tool called EdiTEX, based on the tated with a unique identifier (in our sample tem- architecture described in this paper, has been de- plates we are assuming that the identifier is a string veloped and is freely available along with its source in the id attribute). Templates create correspond- code.4 No mention of MathML is made in the name ing xref attributes in the result document “point- of the tool to remark the fact that the architec- ing” to the fragment with the same identifier in the ture is very general and can be adapted to other parsing tree. This way, whenever a fragment of the kinds of markup. The prototype is currently be- parsing tree is re-transformed, it replaces the frag- ing used as interface for a proof-assistant applica- ment in the result document with the same identi- tion where editing of complex mathematical formu- fier. las and proofs is required. In this respect we should More generally, back-pointers provide a mecha- remark that TEX syntax is natural for “real” math- nism for relating the view of the document with the ematics, but it quickly becomes clumsy when used source markup. This way it is possible to perform for writing terms of programming languages or λ- operations like selection or cut-and-paste that, while calculus. This is mainly due to the conventions re- having a visual effect in the view, act indirectly at garding spaces (for instance, spaces in the λ-calculus the content/markup level. denote function application) and identifiers (the rule “one character is one identifier” is fine for math- 6 Conclusion ematics, but not for many other languages). Note We have presented architectural and implementa- however that, since the lexical analyzer is completely tion issues of an interactive editor based on TEX syn- separate from the rest of the architecture, the token tax which allows flexible customization and content- stream being its interface, it can be easily targeted oriented authoring. TEXmacs2 is probably the exist- to a language with different conventions than those ing application that most closely adopts such archi- of TEX. tecture, with the difference that TEXmacs does not 3 http://www.latex2html.org/ 2 http://www.texmacs.org/ 4 http://helm.cs.unibo.it/software/editex/
Preprints for the 2004 Annual Meeting 33 Luca Padovani
The idea of using some sort of restricted TEX [5] Donald E. Knuth: The TEXbook, Addison- syntax for representing mathematical expressions is Wesley, Reading, MA, USA (1994). not new. For example, John Forkosh’s MimeTEX5 [6] Leslie Lamport: A Document Preparation Sys- generates bitmap images of expressions to be em- tem: LATEX, Addison-Wesley, Reading, MA, bedded in Web pages. However, to the best of our USA (1986). knowledge the formal specification of the parser for [7] Douglas Lovell: TEXML: Typesetting XML with simplified T X markup presented in Sect. 4 is unique E TEX, TUGboat, 20(3), pp. 176–183 (September of its kind. A straightforward implementation based 1999). directly on the rules given in Table 2 amounts at [8] Sandy Huerter, Igor Rodionov, Stephen only just 70 lines of functional code (in an ML di- M. Watt: Content-Faithful Transformations alect), which can be considered something of an for MathML, Proc. International Confer- achievement given that parsing T X is normally re- E ence on MathML and Math on the Web garded as a hard task. By comparison, the parsing (MathML 2002), Chicago, USA (2002). code in MimeT X amounts to nearly 350 lines of C E http://www.mathmlconference.org/2002/ code after stripping away the comments. presentations/huerter/ One may argue that the simplified TEX markup is too restrictive, but in our view this is just the sen- [9] Stephen M. Watt: Conserving implicit mathe- sible fragment of T X syntax that the average user matical semantics in conversion between TEX E TUGboat should be concerned about. In fact the remaining and MathML, , 23(1), pp. 108–108 (2002). syntactic expressiveness provided by TEX is mainly required for the implementation of complex macros [10] Romeo Anghelache: LATEX-based authoring and of system internals, which should never surface tool, Deliverable D4.d, MoWGLI Project at the document level. By separating the transfor- (2003). http://relativity.livingreviews. mation phase we shift the mechanics of macro ex- org/Info/AboutLR/mowgli/index.html pansion to a different level which can approached with different (more appropriate) languages. Since Appendix: The TML DTD this mode of operation makes the system more flex- bution which may provide an architecture for other Unicode Consortium: The Unicode Stan- end NMTOKEN #IMPLIED"> dard, Version 4.0, Boston, MA, Addison-Wesley (2003). http://www.unicode.org/ [2] Tim Bray, Jean Paoli, C.M. Sperberg- McQueen, Eve Maler (editors): Extensible [3] James Clark (editor): XML Transformations (XSLT) Version 1.0, W3C Recommenda- [4] Ron Ausbrooks, Stephen Buswell, St´ephane Dalmas, Stan Devitt, Angel Diaz, et al.: REC-MathML2-20031021/ 5 http://www.ctan.org/tex-archive/support/mimetex/
34 Preprints for the 2004 Annual Meeting Animations in pdfTEX-generated PDF
Jan Holeˇcek Faculty of Informatics, Masaryk University Botanick´a68a 602 00 Brno Czech Republic [email protected] http://www.fi.muni.cz/~xholecek
Petr Sojka Faculty of Informatics, Masaryk University Botanick´a68a 602 00 Brno Czech Republic [email protected] http://www.fi.muni.cz/usr/sojka
Abstract This paper presents a new approach for creating animations in Portable Doc- ument Format (PDF). The method of animation authoring described uses free software (pdfTEX) only. The animations are viewable by any viewer that sup- ports at least some features of Acrobat JavaScript, particularly Adobe (Acrobat) Reader, which is available at no cost for a wide variety of platforms. Furthermore, the capabilities of PDF make it possible to have a single file with animations both for interactive viewing and printing. The paper explains the principles of PDF, Acrobat JavaScript and pdfTEX needed to create animations for Adobe Reader using no other software except pdfTEX. We present a step by step explanation of animation preparation, to- gether with sample code, using a literate programming style. Finally, we discuss other possibilities of embedding animations into documents using open standards (SVG) and free tools, and conclude with their strengths and weaknesses with re- spect to the method presented.
1 Introduction One of the first applications showing the power Extensive use of electronic documents leads to new of using JavaScript with PDF was Hans Hagen’s cal- culator [5]. Further, the AcroTEX bundle [9] uses demands being made on their content. Developing A specific document versions for different output de- several LTEX packages and the full version of the vices is time consuming and costly. A very natural Adobe Acrobat software for preparing PDF files with demand, especially when preparing educational ma- DLJS [10]; macro support for animations is rudimen- terials, is embedding animations into a document. tary and it is stressed in the documentation that it A widely used open format for electronic doc- works only with the full commercial version of Ac- uments is the Adobe PDF [2] format, which com- robat. bines good typographic support with many inter- Our motivation is a need for PDF animations active features. Even though it contains no pro- in a textbook [3] published both on paper and on gramming language constructs such as those found CD. We have published it using Acrobat [7, 8], and in PostScript, the format allows for the inclusion eventually discovered a method to create animations of Document Level JavaScript (DLJS) [1]. Widely using pdfTEX [11] only. available PDF viewers such as Adobe Reader (for- pdfTEX facilitates the PDF creation process in merly Acrobat Reader) benefit from this possibility, several ways. We can directly write the PDF code allowing interactive documents to be created. which is actually required to insert an animation. We can also utilise the TEX macro expansion power
Preprints for the 2004 Annual Meeting 35 Jan Holeˇcek and Petr Sojka
to produce PDF code. And finally, we can write of a key in a dictionary), it can be given explicitly only the essential parts directly, leaving the rest to (a direct reference) or as an object identifier to an pdfTEX. pdfTEX introduces new primitives to take indirect object (an indirect reference). In this way advantage of PDF features. The ones we are going objects are interconnected to form a graph. An in- to use will be described briefly as they appear. direct reference consists of two numbers. The first In this paper, we present this new ‘pdfTEX only’ number is a unique object number. The second is way of embedding animations. We require no pre- an object version number and is always 0 in indi- vious knowledge either of the PDF language or of rect objects newly created by pdfTEX—the first one pdfTEX extensions to TEX. However, the basics of therefore suffices to restore an indirect reference. TEX macro definitions and JavaScript are assumed. Various document elements are typically repre- The structure of the paper is as follows. In the sented by dictionary objects. Each element has a next section we start with the description of the PDF given set of required and optional keys for its dic- internal document structure with respect to anima- tionary. For example, the document itself is repre- tions. The core of the paper consists of commented sented by a Catalog dictionary, the root node of the code for the pdfTEX that generates a simple all-in- graph. Its key-value pairs define the overall proper- one animation. The examples are written in plain ties of the document. A brief description of concrete TEX [6], so that others can use it in elaborate macro objects will be given when encountered for the first packages, in a literate programming style. In the time. See [2] for more detailed information. second example the animation is taken from an ex- ternal file, allowing the modification of the anima- 3 Insertion of the Animation Frames tion without modifying the primary document. Fi- We are not interested in constructing the animation nally, we compare this approach with the possibili- frames themselves — any graphics program such as ties of other formats, including the new standard for METAPOST will do. Let us hence assume we have Scalable Vector Graphics (SVG) [12] from the W3C. a PDF file, each page of which forms a single an- imation frame and the frames are in the order of PDF 2 The Document Structure appearance. A PDF file typically consists of a header, a body, Every image is inserted into PDF as a so-called a cross-reference table and a trailer. The body is form XObject which is actually an indirect stream the main part of the PDF document. The other object. There are three primitives that deal with im- parts provide meta-information and will not be dis- ages in pdfTEX. The \pdfximage creates an indirect cussed here. A PDF document is actually a graph object for a given image. The image can be spec- of interconnected objects, each being of a certain ified as a page of another PDF file. However, the type. There are basic data types (boolean, numeric, indirect object is actually inserted only if referred string) and some special and compound types which to by the \pdfrefximage primitive or preceded by require some explanation. \immediate. \pdfrefximage takes an object num- A name object has the form /MYNAME. There ber (the first number of indirect reference) as its is a set of names with predefined meanings when argument and adds the image to the TEX list be- used as a dictionary key or value. Other names ing currently built. The object number of the image can be defined by the user as human readable ref- most recently inserted by \pdfximage is stored in erences to indirect objects (dictionaries and indirect the \pdflastximage register. objects are treated below). An array object is a A general PDF indirect object can be created one-dimensional list, enclosed by square brackets, of similarly by \pdfobj, \pdfrefobj and \pdflast- objects not necessarily of the same type. A dictio- obj. \pdfobj takes the object content as its argu- nary object is a hash, i.e., a set of key-value pairs ment. TEX macro expansion can be used for gener- where the keys are name objects and the values are ating PDF code in an ordinary manner. arbitrary objects. A dictionary is enclosed by the << In our example, we first define four macros for and >> delimiters. Stream objects are used to insert efficiency. The \ximage macro creates a form XOb- binary data into a PDF document. There is also a ject for a given animation frame (as an image) and special null object used as an “undefined” value. saves its object number under a given key. The The body of a PDF file consists of a sequence of \insertobj macro creates a general PDF object and labelled objects called indirect objects. An object of saves its object number under a given key. The any other type which is given a unique object iden- \oref macro expands to an indirect reference of an tifier can form an indirect object. When an object object given by the argument. The last “R” is an op- is required in some place (an array element, a value erator that creates the actual indirect reference from
36 Preprints for the 2004 Annual Meeting Animations in pdfTEX-generated PDF
two numbers. We are not going to use \pdfref* 4 Setting up an AcroForm Dictionary primitives, so \immediate must be present. Refer- The interactive features are realized by annotation ences will be put directly into the PDF code by the elements in PDF. These form a separate layer in ad- \oref macro. The \image macro actually places an dition to the regular document content. Each one image given by its key onto the page. denotes an area on the page to be interactive and 1 % an image for further use binds some actions to various events that can hap- 2 \def\ximage#1#2{% pen for that area. Annotations are represented by 3 \immediate\pdfximage Annot dictionaries. The way pdfTEX inserts anno- 4 page #2 {frames-in.pdf}% tations into PDF is discussed in the section “Anima- 5 \expandafter\edef tion Dynamics” below. 6 \csname pdf:#1\endcsname Annotations are transparent by default, i.e., the 7 {\the\pdflastximage}} page appearance is left unchanged when adding an 8 annotation. It is up to the regular content to provide 9 % a general object for further use the user with the information that some areas are 10 \def\insertobj#1#2{% interactive. 11 \immediate\pdfobj{#2}% We will be interested in a subtype of annota- 12 \expandafter\edef tions called interactive form fields. They are repre- 13 \csname pdf:#1\endcsname sented by a Widget subtype of the Annot dictionary. 14 {\the\pdflastobj}} Widgets can be rendered on top of the regular con- 15 tent. However, some resources have to be set. The 16 % expands to an indirect ref. for a key document’s Catalog refers to an AcroForm dictionary 17 \def\oref#1{% in which this can be accomplished. 18 \csname pdf:#1\endcsname\space 0 R} The next part of the example first defines the 19 name Helv to represent the Helvetica base-font (built 20 % actually places an image in font). This is not necessary but it allows us to 21 \def\image#1{% have a smooth control button. Next we insert the 22 \expandafter\pdfrefximage AcroForm dictionary. The DR stands for “resource 23 \csname pdf:#1\endcsname} dictionary”. We only define the Font resource with one font. The DA stands for “default appearance” Another new primitive introduced by pdfTEX string. The /Helv sets the font, the 7 Tf sets the is \pdfcatalog. Its argument is added to the docu- font size scale factor to 7 and the 0 g sets the color ment’s Catalog dictionary every time it is expanded. to be 0% white (i.e., black). The most important en- The one below makes the document open at the first try in the AcroForm dictionary is NeedAppearances. page and the viewer fit the page into the window. Setting it to true (line 43) makes the Widget anno- One more key will be described below. tations visible. Finally, we add the AcroForm dictio- nary to the document’s Catalog. 24 % set up the document 25 \pdfcatalog{/OpenAction [ 0 /Fit ]} 32 % the Helvetica basefont object 33 \insertobj{Helv}{ Now we are going to insert animation frames 34 << /Type /Font /Subtype /Type1 into the document. We will use the \ximage macro 35 /Name /Helv defined above. Its first argument is the name to be 36 /BaseFont /Helvetica >> } bound with the resulting form XObject. The sec- 37 ond one is the number of the frame (actually a page 38 % the AcroForm dictionary number in the PDF file with frames). One needs to 39 \insertobj{AcroForm}{ be careful here because pdfTEX has one-based page 40 << /DR << /Font << numbering while PDF uses zero-based page number- 41 /Helv \oref{Helv} >> >> ing internally. 42 /DA (/Helv 7 Tf 0 g ) 26 % all animation frames are inserted 43 /NeedAppearances true >> } 27 \ximage{fr0}{1} \ximage{fr1}{2} 44 28 \ximage{fr2}{3} \ximage{fr3}{4} 45 % add a reference to the Catalog 29 \ximage{fr4}{5} \ximage{fr5}{6} 46 \pdfcatalog{/AcroForm \oref{AcroForm}} 30 \ximage{fr6}{7} \ximage{fr7}{8} To make a form XObject with an animation 31 \ximage{fr8}{9} frame accessible to JavaScript, it has to be assigned
Preprints for the 2004 Annual Meeting 37 Jan Holeˇcek and Petr Sojka
a name. There are several namespaces in PDF in action is interpreted within this interpreter. This which this can be accomplished. The one searched means that one action can set a variable to be used for is determined from context. We are only in- by another action triggered later. Document-level terested in an AP namespace that maps names to JavaScript code, e.g., function definitions and global annotation appearance streams. pdfTEX provides variable declarations, can be placed into a JavaScript the \pdfnames primitive that behaves similarly to namespace. This code should be executed when \pdfcatalog. Each time it is expanded it adds its opening the document. argument to the Names dictionary referred from doc- Unfortunately, there is a bug in the Linux port ument’s Catalog. The Names dictionary contains the of the Reader that renders this generally unusable. name definitions for various namespaces. In our ex- The document level JavaScript is not executed if ample we put definitions into a separate object Ap- the Reader is not running yet and the document pearanceNames. is opened from a command line (e.g., ‘acroread The name definitions may form a tree to make file.pdf’). Neither the first page’s nor the docu- the lookup faster. Each node has to have Limits set ment’s open action are executed, which means they to the lexically least and greatest names in its sub- cannot be used as a workaround. Binding a Java- tree. There is no extensive set of names in our ex- Script code to another page’s open action works well ample, so one node suffices. The names are defined enough to suffice in most cases. in the array of pairs containing the name string and We redeclare everything each time an action is the indirect reference. triggered so as to make the code as robust as possi- 47 % defining names for frames ble. First we define the Next function, which takes a 48 \insertobj{AppearanceNames}{ frame index from a global variable, increases it mod- 49 << /Names ulo the number of frames and shows the frame with 50 [ (fr0) \oref{fr0} (fr1) \oref{fr1} the resulting index. The global variable is modified. 51 (fr2) \oref{fr2} (fr3) \oref{fr3} The animation actually starts at line 78 where 52 (fr4) \oref{fr4} (fr5) \oref{fr5} the frame index is initialized. The frames are dis- 53 (fr6) \oref{fr6} (fr7) \oref{fr7} played on an interactive form’s widget that we name 54 (fr8) \oref{fr8} ] "animation" — see “Placing the Animation” below. 55 /Limits [ (fr0) (fr8) ] >> } A reference to this widget’s object is obtained at 56 line 79. Finally, line 80 says that from now on, 57 % edit the Names dictionary the Next function should be called every \timespan 58 \pdfnames{/AP \oref{AppearanceNames}} milliseconds. 5 Animation Dynamics We have created all the data structures needed for 62 % play the animation the animation in the previous section. Here we in- 63 \insertobj{actionPlay}{ troduce the code to play the animation. It uses Ac- 64 << /S /JavaScript /JS ( robat JavaScript [1], an essential element of inter- 65 function Next() { active forms. Acrobat JavaScript is an extension of 66 g.delay = true; Netscape JavaScript targeted to PDF and Adobe Ac- 67 if (cntr == \frames) { robat. Most of its features are supported by Adobe 68 cntr = 0; Reader. They can, however, be supported by any 69 try { app.clearInterval(arun); } other viewer. Nevertheless, the Reader is the only 70 catch(except) {} one known to us that supports interactive forms and 71 } else { cntr++; } JavaScript. 72 g.buttonSetIcon( The animation is based on interchanging frames 73 this.getIcon("fr" + cntr)); in a single widget. Here we define the number of 74 g.delay=false; frames and the interchange timespan in milliseconds 75 } to demonstrate macro expansion in JavaScript. 76 try { app.clearInterval(arun); } 77 catch(except) {} 59 % animation properties 78 var cntr = 0 ; 60 \def\frames{8} 79 var g = this.getField("animation"); 61 \def\timespan{550} 80 var arun = app.setInterval("Next()", Every document has its own instance of a Java- 81 \timespan); Script interpreter in the Reader. Every JavaScript 82 ) >> }
38 Preprints for the 2004 Annual Meeting Animations in pdfTEX-generated PDF
Now, let us describe the Next function in more 85 \pdfstartlink user{ detail. Line 66 suspends widget’s redrawing until 86 /Subtype /Widget /FT /Btn line 74. Then the global variable containing the cur- 87 /Ff 65536 /T (animation) rent frame index is tested. If the index reaches the 88 /BS << /W 0 >> number of frames, it is set back to zero and the pe- 89 /MK << /TP 1 >> }% riodic calling of the function is interrupted. The 90 \image{fr0}% function would be aborted on error, but because we 91 \pdfendlink}} catch exceptions this is avoided. The getIcon func- For the sake of brevity and clarity we are going tion takes a name as its argument and returns the to introduce only one control button in our exam- reference to the appearance stream object according ple. However, we have defined a macro for creating to the AP names dictionary. This explains our ap- control buttons to show a very simple way of includ- proach of binding the names to animation frames — ing multiple control buttons. The \controlbutton here we use the names for retrieving them. The macro takes one argument: the caption of the but- buttonSetIcon method sets the object’s appear- ton it is to produce. The macro creates a pushbutton ance to the given icon. and binds it to an action defined like actionPlay. Line 76 uses the same construct as line 69 to We have chosen control buttons to be push- handle situations in which the action is relaunched buttons again. They are little different from the even if the animation is not finished yet. It aborts animation widget — they are supposed to look like the previous action. It would have been an error buttons. The BS dictionary (i.e., border style) sets had the animation not been running, hence we must the border width to 1 point and style to 3D button use the exception catching approach. look. The MK dictionary (appearance characteris- 6 Placing the Animation tics dictionary) sets the background color to 60% white and the caption (line 98). The /H /P entry The animation is placed on an interactive form tells the button to push down when clicked on. Fi- field — a special type of annotation. There are nally, an action is bound to the button by setting two primitives in pdfTEX, \pdfstartlink and the value of the A key. \pdfendlink, to produce annotations. They are in- 92 % control button for a given action tended to insert hyperlink annotations but can be 93 \def\controlbutton#1{% used for creating other annotations as well. The cor- 94 \hbox to 1cm{\pdfstartlink user{ responding \pdfstartlink and \pdfendlink must 95 /Subtype /Widget /FT /Btn reside at the same box nesting level. The resulting 96 /Ff 65536 /T (Button#1) annotation is given the dimensions of the box that 97 /BS << /W 1 /S /B >> is enclosed by the primitives. We first create a box 98 /MK << /BG [0.6] /CA (#1) >> to contain the annotation. Note that both box and 99 /H /P /A \oref{action#1} annotation size are determined by the frame itself — 100 }\hfil\strut\pdfendlink}} see line 91 where the basic frame is placed into the regular page content. And finally, we add a control button that plays We will turn now to the respective entries in the the animation just below the animation widget. annotation dictionary. The annotation is to be an 101 % control button interactive form field (/Subtype /Widget). There 102 \centerline{\hfil are many field types (FT). The only one that can 103 \controlbutton{Play}\hfil} take any appearance and change it is the pushbutton. 104 It is a special kind of button field type (/FT /Btn). 105 \bye The type of button is given in an array of field bit flags Ff. The pushbutton has to have bit flag 17 7 External Animation set (/Ff 65536). To be able to address the field Let us modify the example a little so that the anima- from JavaScript it has to be assigned a name. We tion frames will be taken from an external file. This have assigned the name animation to it as men- has several consequences which will be discussed at tioned above (/T (animation)). Finally, we define the relevant points in the code. the appearance characteristics dictionary MK. The We are going to completely detach the anima- only entry /TP 1 sets the button’s appearance to tion frames from the document. As a result, we will consist only of an icon and no caption. need only the \insertobj and \oref macros from 83 % an animation widget lines 1–23 from the previous example. Lines 26–31 84 \centerline{\hbox{% are no longer required.
Preprints for the 2004 Annual Meeting 39 Jan Holeˇcek and Petr Sojka
A problem arises here: the basic frame should 8 Two Notes on Animation Frames be displayed in the animation widget when the doc- The examples with full TEX source files can ument is opened for the first time. This can be ac- be found at http://www.fi.muni.cz/∼xholecek/ complished by modifying the OpenAction dictionary animations/. As one can see in these examples, the at line 25 as follows. all-in-one approach allows all frames to share a sin- \pdfcatalog{ /OpenAction << gle background which is formed by the frame actu- /S /JavaScript /JS ( ally inserted into the page. However, it is possible to var g = this.getField("animation"); overlay pushbuttons. Elaborate constructions, the g.buttonImportIcon( simplest of which is to use a common background "frames-ex.pdf",0); frame in the example with external animations, can this.pageNum = 0; be achieved in conjunction with transparency. this.zoomType = zoomtype.fitP; One must ensure the proper size of all frames ) >> } when fitting them into the widget. We have en- countered situations (the given example being one of This solution suffers from the bug mentioned in them) where the bounding box of METAPOST gen- the “Animation Dynamics” section. The animation erated graphics with TEX label was not set properly widget will be empty until a user performs an action using \convertMPtoPDF and a white line had to be every time the bug comes into play. drawn around the frames to force the proper bound- We still do need an AcroForm dictionary, so lines ing box as a workaround. 32–46 are left without a change. Lines 47–58 must be omitted on the other hand, as we have nothing to 9 Animations in Other Formats name. We are going to use the same animation as It is fair to list and compare other possible ways of in the previous example, so lines 59–61 are left un- creating animations. In this section we give a brief touched. There is one modification of the JavaScript overview of a dozen other formats and technologies code to be done. The buttonSetIcon function call capable of handling animations. is to be replaced by 9.1 GIF g.buttonImportIcon( "frames-ex.pdf", cntr); One of the versions of the GIF format is the GIF89a format, which allows multi-image support, with We have used the basic frame to determine a bitmap only animations to be encoded within a sin- size of the widget in the previous example. This is gle GIF file. GIF format supports transparency, in- impossible now because it has to be done at compile terlacing and plain text blocks. It is widely sup- time. The replacement for lines 83–91 is as follows ported in Internet browsers. However, there are li- % an animation widget censing problems due to the compression methods \centerline{\hbox to 6cm{% used, and the format is not supported in freely avail- \vrule height 6cm depth 0pt width 0pt able TEXware. \pdfstartlink user{ 9.2 SWF /Subtype /Widget /FT /Btn /Ff 65536 /T (animation) The SWF format by Macromedia allows storing /BS << /W 0 >> frame-based animations, created e.g., by Macro- /MK << /TP 1 media’s Flash authoring tool. The SWF authoring /IF << /SW /A /S /P tools have to compute all the animation frames at /A [0.5 0.5] >> >> }% export time. As proprietary Flash plug-ins for a \hfil\pdfendlink}} wide range of Internet browsers are available, ani- mations in SWF are relatively portable. The power Dimensions of the widget are specified explicitly of SWF can be enriched with scripting by Action- and an IF (icon fit) dictionary is added to attributes Script. At the time of writing, we are not aware of of the pushbutton so that the frames would be al- any TEXware supporting SWF. ways (/SW /A) proportionally (/S /P) scaled to fit the widget. Moreover, frames are to be centered in 9.3 Java the widget (/A [0.5 0.5]) which would be the de- One can certainly program animations in a gen- fault behavior anyway. The basic frame is not placed eral programming language like Sun’s Java. The into the document — there is only glue instead. drawback is that there are high demands on one’s Lines 92–105 need not be modified.
40 Preprints for the 2004 Annual Meeting Animations in pdfTEX-generated PDF
programming capabilities in Java when creating 11 Acknowledgments N S portable animations. With T (a TEX reimple- Authors thank Oleg Alexandrov and Karl Berry for mentation in Java), one can possibly combine TEX comments on an early draft of the paper. documents with fully featured animations, at the The work has been supported by VZ MSM expense of studying numerous available classes, in- 143300003. terfaces and methods. References 9.4 DOM [1] Adobe Systems Incorporated. Acrobat Java- It is possible to reference every element in an HTML Script Object Specification, Version 5.1, Tech- or XML document by means of the W3C’s Document nical Note #5186. Technical report, Adobe, Object Model (DOM), a standard API for document 2003. http://partners.adobe.com/asn/ structure. developer/pdfs/tn/5186AcroJS.pdf. DOM offers programmers the possibility of im- [2] Adobe Systems Incorporated. PDF Reference: plementing animations with industry-standard lan- Adobe Portable Document Format Version 1.5. guages such as Java, or scripting languages as Addison-Wesley, Reading, MA, USA, fourth ECMAScript, JavaScript or JScript. edition, August 2003. 9.5 SVG [3] Zuzana Doˇsl´a,Roman Plch, and Petr Sojka. Mathematical Analysis with Maple: 2. Infinite The most promising language for powerful vector Series. CD-ROM, http://www.math.muni.cz/ graphic animation description seems to be Scal- ∼plch/nkpm/, December 2002. able Vector Graphics (SVG), a W3C recommenda- A tion [12]. It is being developed for XML graphical [4] Michel Goossens and Vesa Sivunen. LTEX, applications, and since SVG version 1.1 there is rich SVG, Fonts. TUGboat, 22(4):269–280, October support for animations. The reader is invited to 2001. look at the freely available book chapter [13] about [5] Hans Hagen. The Calculator Demo, Integrating SVG animations on the publisher’s web site, or read- TEX, METAPOST, JavaScript and PDF. TUG- ing [4] about the first steps of SVG integration into boat, 19(3):304–310, September 1998. TEX world. There are freely available SVG view- [6] Petr Olˇs´ak. TEXbook naruby (in Czech). Kon- ers from Adobe (browser plug-in), Corel, and the voj, Brno, 1997. Apache Foundation (Squiggle). [7] Petr Sojka. Animations in PDF. In Proceed- SVG offers even smaller file sizes than SWF or ings of the 8th Annual Conference on Innova- our method. The description of animations is time- tion and Technology in Computer Science Ed- based, using another W3C standard, SMIL, Synchro- ucation, ITiCSE 2003, page 263, Thessaloniki, nised Multimedia Integration Language. The au- 2003. Association of Computing Machinery. thor can change only one object or its attribute in [8] Petr Sojka. Interactive Teaching Materials in the scene at a time, allowing detailed control of ani- PDF using JavaScript. In Proceedings of the 8th mated objects through the declarative XML manner. Annual Conference on Innovation and Technol- Compared to our approach, this means a much wider ogy in Computer Science Education, ITiCSE range of possibilities for creators of animations. 2003, page 275, Thessaloniki, 2003. Association The SVG format is starting to be supported in of Computing Machinery. T Xware. There are SVG backends in VTEX and E [9] Donald P. Story. AcroTEX: Acrobat and TEX BaKoMaTEX, and a program Dvi2Svg by Adrian team up. TUGboat, 20(3):196–201, Sep. 1999. Frischauf, available at http://www.activemath. [10] Donald P. Story. Techniques of introduc- org/∼adrianf/dvi2svg/. Another implementation ing document-level JavaScript into a PDF file of a DVI to SVG converter in C is currently being from LATEX source. TUGboat, 22(3):161–167, developed by Rudolf Sabo at the Faculty of Infor- September 2001. matics, Masaryk University in Brno. [11] H´an Thˆe´ Th´anh. Micro-typographic exten- 10 Conclusions sions to the TEX typesetting system. TUGboat, 21(4):317–434, December 2000. We have shown a method of preparing both space- SVG efficient and high-quality vector frame-based anima- [12] W3C. Scalable Vector Graphics ( ) 1.1 Spec- ification, January 2003. tions in PDF format using only freely available, TEX- integrated tools. [13] Andrew H. Watt. Designing SVG Web Graph- ics. New Riders Publishing, September 2001.
Preprints for the 2004 Annual Meeting 41 Arabic Mathematical e-Documents
Mustapha Eddahibi [email protected]
Azzeddine Lazrek [email protected]
Khalid Sami k [email protected] Department of Computer Sciences, Faculty of Sciences, University Cadi Ayyad P.O. Box 2390, Marrakech, Morocco http://www.ucam.ac.ma/fssm/RyDArab
Abstract
What problems do e-documents with mathematical expressions in an Arabic pre- sentation present? In addition to the known difficulties of handling mathematical expressions based on Latin script on the web, Arabic mathematical expressions flow from right to left and use specific symbols with a dynamic cursivity. How might we extend the capabilities of tools such as MathML in order to structure Arabic mathematical e-documents? Those are the questions this paper will deal with. It gives a brief description of some steps toward an extension of MathML to mathematics in Arabic exposition. In order to evaluate it, this extension has been implemented in Mozilla. Keywords: Mathematical expressions, Arabic mathematical presentation, Mul- tilingual documents, e-documents, Unicode, MathML, Mozilla.
1 Overview be converted to and from other mathematical for- It is well known that HTML authoring capabilities mats, they need encoding with respect to both the are limited. For instance, mathematics is difficult mathematical notation and mathematical meaning. to search and web formatting is poor. For years, The mathematical markup language MathML most mathematics on the web consisted of texts [14] offers good solutions to the previous problems. with scientific notation rendered as images. Image- MathML is an XML application for describing math- based equations are generally harder to see, read ematical notation and capturing both its structure, and comprehend than the surrounding text in the for high-quality visual display, and content, for more browser window. Moreover, the large size of this semantic applications like scientific software. XML kind of e-document can represent a serious prob- stands for eXtensible Markup Language. It is de- lem. These problems become worse when the doc- signed as a simplified version of the meta-language ument is printed. For instance, the resolution of SGML used, for example, to define the grammar and the equations will be around 72 dots per inch, while syntax of HTML. One of the goals of XML is to the surrounding text will typically be 300 or more be suitable for use on the web by separating the dots per inch. In addition to the display problems, presentation from the content. At the same time, there are encoding difficulties. Mathematical ob- XML grammar and syntax rules carefully enforce jects can neither be searched nor exchanged between document structure to facilitate automatic process- software systems nor cut and pasted for use in dif- ing and maintenance of large document collections. ferent contexts nor verified as being mathematically MathML enables mathematics to be served, re- correct. As mathematical e-documents may have to ceived, and processed on the web, just as HTML
42 Preprints for the 2004 Annual Meeting Arabic Mathematical e-Documents has enabled this functionality for text. MathML el- 3 MathML and Arabic Mathematics ements can be included in XHTML documents with Of course, semantically speaking, an Arabic mathe- namespaces and links can be associated to any math- matical expression is the same as a Latin-based one. ematical expression through XLink. Of course, there Thus, only display problems need be taken into ac- are complementary tools. For instance, the project count. In any way, encoding semantics are beyond OpenMath [12] also aims at encoding the semantics the scope of this paper. of mathematics without being in competition with In order to know if there really is a need to MathML. construct a new tool or only to improve an already Now, what about some of the MathML interna- available one, what are the possibilities offered by tionalization aspects — say, for instance, its ability the known MathML renderers? As much of the work to structure and produce e-documents based on non is built around TEX, an open source community ef- Latin alphabets, such as mathematical documents fort, it is hard to be precise about the current status in Arabic? of all TEX/MathML related projects. Most of these 2 Arabic Mathematical Presentation projects belong to one of three basic categories: • Arabic script is cursive. Small curves and ligatures Conversions from TEX to MathML. Of partic- TeX4ht join adjacent letters in a word. The shapes of most ular note here, are Ω [5, 6] and [13], of the letters are context dependent; that is, they a highly specialized editor/DVI driver. Both change according to their position in the word. Cer- of these systems are capable of writing presen- tain letters have up to four different shapes. tation MathML from TEX documents. There LaTeX2HTML Although some mathematical documents using are other converters such as and tralics Arabic-based writing display mathematics in Latin [1]. characters, in general, not only the text is encoded • Conversions from MathML to TEX. The con- with the Arabic script but mathematical objects and version from MathML to TEX can be done for expressions are also encoded with special symbols instance, through reading MathML into Math- flowing from right to left according to the Arabic ematica or other similar tools and then sav- writing. Moreover, some of these symbols are ex- ing the result back out as TEX, or using Scien- A tensible. tific WorkPlace for suitable LTEX sources. The Mathematical expressions are for the most part ConTEXt system is another example. handwritten and introduced as images. A highly- • Direct typesetting of MathML using TEX. evolved system of calligraphic rules governs Arabic Currently, MathML is supported by many ap- handwriting. Though Arabic mathematical docu- plications. This fact shows not only that it is the ments written by hand are sometimes of fair quality, format of choice for publishing equations on the web the mere presentation of scientific documents is no but also that it is a universal interchange format for longer enough, since there is a need for searchability, mathematics. More than twenty implementations using them in software and so on. are listed on the MathML official website, show- The RyDArab [8] system makes it possible to ing that all categories of mathematical software can compose Arabic mathematical expressions of high handle MathML. Actually, typographical quality. RyDArab complements T X E • most mathematical software, such as Scientific for typesetting Arabic mathematical documents. WorkPlace, Maple, MathCad and Mathemat- RyDArab uses the Computer Modern fonts and those ica, can export and import MathML; of Ω [4] or ArabT X [7]. The output is DVI, PS, PDF E • or HTML with mathematical expressions as images. all common browsers can display MathML The RyDArab [2] system does not replace or modify equations either natively or through the use of plug-ins; the functionality of the TEX engine, so it does not restrict in any way the set of macros used for author- • editors such as MathType, Amaya, TEXmacs, ing. Automatic translation from and to Latin-based and WebEQ support MathML. expressions is provided beginning with the latest Ry- Once non-free or non-open-source tools are DArab version. Will this be enough to structure and omitted, two web browsers remain: the well-known typeset e-documents with mathematics even when Mozilla system [11] and Amaya. The W3C’s Amaya they are based on an alternative script? Starting editor/browser allows authors to include mathemat- from this material with TEX and Ω, will MathML ical expressions in web pages, following the MathML be able to handle Arabic mathematics? specification. Mathematical expressions are han- dled as structured components, in the same way
Preprints for the 2004 Annual Meeting 43 Mustapha Eddahibi, Azzeddine Lazrek and Khalid Sami and in the same environment as HTML elements.
44 Preprints for the 2004 Annual Meeting Arabic Mathematical e-Documents
Preprints for the 2004 Annual Meeting 45 Mustapha Eddahibi, Azzeddine Lazrek and Khalid Sami the square root symbol is composed of the square
For the elements
46 Preprints for the 2004 Annual Meeting Arabic Mathematical e-Documents
The following mathematical expression is an ex- References ample of the use of
Preprints for the 2004 Annual Meeting 47 Dynamic Arabic Mathematical Fonts
Mostafa Banouni [email protected]
Mohamed Elyaakoubi [email protected]
Azzeddine Lazrek [email protected] Department of Computer Sciences Faculty of Sciences University Cadi Ayyad P.O. Box 2390 Marrakech, Morocco http://www.ucam.ac.ma/fssm/rydarab
Abstract This contribution describes a font family designed to meet the requirements of typesetting mathematical documents in an Arabic presentation. Thus, not only is the text written in an Arabic alphabet-based script, but specific symbols are used and mathematical expressions also spread out from right to left. Actually, this font family consists of two components: an Arabic mathematical font and a dy- namic font. The construction of this font family is a first step of a project aiming at providing a complete and homogeneous Arabic font family, in the OpenType format, respecting Arabic calligraphy rules. Keywords: Mathematical font, Dynamic font, Variable-sized symbols, Ara- bic mathematical writing, Multilingual documents, Unicode, PostScript, Open- Type.
1 Overview ics take their place over or under the characters, The Arabic language is native for roughly three hun- and that is also context dependent. Moreover, the dred million people living in the Middle East and kashida, a small flowing curve placed between Ara- North Africa. Moreover, the Arabic script is used, bic characters, is to be produced and combined with in various slightly extended versions, to write many characters and symbols. The kashida is also used major languages such as Urdu (Pakistan), Persian for the text justification. The techniques for man- and Farsi (Iran, India), or other languages such as aging the kashida are similar to those that can be Berber (North Africa), Sindhi (India), Uyghur, Kir- used for drawing curvilinear extensible mathemati- giz (Central Asia), Pashtun (Afghanistan), Kurdish, cal symbols, such as sum, product or limit. Jawi, Baluchi, and several African languages. A There are several Arabic font styles. Of course, great many Arabic mathematical documents are still it is not easy to make available all existing styles. written by hand. Millions of learners are concerned The font style Naskh was the first font style adopted in their daily learning by the availability of systems for computerization and standardization of Arabic for typesetting and structuring mathematics. typography. So far, only Naskh, Koufi, Ruqaa, and Creating an Arabic font that follows calligra- to a limited extent Farsi have really been adapted phic rules is a complex artistic and technical task, to the computer environment. Styles like Diwani due in no small part to the necessity of complex or Thuluth, for example, don’t allow enough sim- contextual analysis. Arabic letters vary their form plification, they have a great variation in characters according to their position in the word and accord- shapes, the characters don’t share the same baseline, ing to the neighboring letters. Vowels and diacrit- and so on. Considering all that, we have decided to use the Naskh style for our mathematical font.
48 Preprints for the 2004 Annual Meeting Dynamic Arabic Mathematical Fonts
The RyDArab [10] system was developed for the Now we will describe the way we constructed purpose of typesetting Arabic mathematical expres- the OpenType Arabic mathematical font RamzArab. sions, written from right to left, using specific sym- The construction of the font started by drawing the bols. RyDArab is an extension of the TEX system. It whole family of characters by hand. This task was runs with K. Lagally’s Arabic system ArabTEX [8] performed by a calligrapher. Then the proofs were or with Y. Haralambous and J. Plaice’s multilingual scanned to transform them into vectors. The scan- Ω [6] system. The RyDArab system uses characters ning tools alone don’t produce a satisfying result, so belonging to the ArabTEX font xnsh or to the omsea once the design is finalized, the characters are pro- font of Ω, respectively. Further Arabic alphabetic cessed and analyzed using special software to gener- symbols in different shapes can be brought from the ate the file defining the font. font NasX that has been developed, for this special In Arabic calligraphy, the feather’s head (ka- purpose, using METAFONT. The RyDArab system lam) is a flat rectangle. The writer holds it so that also uses symbols from Knuth’s Computer Modern the largest side makes an angle of approximately 70◦ family, obtained through adaptation to the right-to- with the baseline. Except for some variations, this left direction of Arabic. orientation is kept all along the process of drawing Since different fonts are in use, it is natural that the character. Furthermore, as Arabic writing goes some heterogeneity will appear in mathematical ex- from right to left, some boldness is produced around pressions typeset with RyDArab [9]. Symbol sizes, segments from top left toward the bottom right and shapes, levels of boldness, positions on the base- conversely, segments from top right to the bottom line will not quite be in harmony. So, we undertook left will rather be slim as in Figure 1. building a new font in OpenType format with two The RamzArab font in Figure 4 contains only main design goals: on the one hand, all the symbols symbols specific to Arabic mathematics presentation will be drawn with harmonious dimensions, propor- plus some usual symbols found even in text mode. tions, boldness, etc., and on the other hand, the font It is mainly composed of the following symbols: should contain the majority of the symbols in use in • alphabetic symbols: Arabic letters in various the scientific and technical writing based on an Ara- forms, such as characters in isolated standard bic script. form, isolated double-struck, initial standard, Both Arabic texts and mathematical expres- initial with tail, initial stretched and with loop sions need some additional variable-sized symbols. (e.g., ✈ ➙ ❻ ➵❻ ➯❻ ➇ respectively); We used the CurExt [11] system to generate such • punctuation marks (e.g., ❀ ✿ ✾ ✦ ❃ ); symbols. This application was designed to gener- • digits as used in the Maghreb Arab (North Afri- ate automatically curvilinear extensible symbols for ca), and as they are in the Machreq Arab (Mid- T X with the font generator METAFONT. The new E dle East); extension of CurExt does the same with the font gen- ost cript • accents to be combined with alphabetic symbols erator P S . ❳❚ ❚ ❚ While METAFONT generates bitmap fonts and (e.g., ); • thus remains inside the TEX environment, Open- ordinary mathematical symbols such as delim- Type [14] gives outline and multi-platform fonts. iters, arithmetic operators, etc. Moreover, since Adobe and Microsoft have devel- • mirror image of some symbols such as sum, in- oped it jointly, OpenType has become a standard tegral, etc. combining the two technologies TrueType and Post- In Arabic mathematics, the order of the alpha- Script. In addition, it offers some additional typo- betic symbols differs from the Arabic alphabetic or- graphic layout possibilities thanks to its multi-table der. Some problems can appear with the alphabetic feature. symbols in their multi-form. Generally, in Arabic mathematical expressions, 2 A Mathematical Font alphabetic❅ ❆ ❇ symbols are written without dots (e.g., The design and the implementation of a mathemat- ❚ ❚ ❚ ) or diacritics. This helps to avoid confu- ical font are not easy [5]. It becomes harder when sions with accents. The dots can be added when- it is oriented to Arabic presentation. Nevertheless, ever they are needed, however. Thus, few symbols independent attempts to build an Arabic mathemat- are left. ical font have been undertaken. In fact, F. Alhar- Moreover, some deviation from the general rules gan [1] has sent us proofs of some Arabic mathemat- will be necessary: in a mathematical expression, the ical symbols in TrueType format. isolated form of the letter ALEF can be confused with the Machreq Arab digit ONE. The isolated
Preprints for the 2004 Annual Meeting 49 Mostafa Banouni, Mohamed Elyaakoubi and Azzeddine Lazrek form of the letter HEH can also present confusion with the Machreq Arab digit FIVE. The choice of ❩ ➳② the glyphs and to denote respectively these −→ −→ two characters will help to avoid such confusions. Even though these glyphs are not in conformity with the homogeneity of the font style and calligraphic Figure 1: Sum symbol with vertical then rules, they are widely used in mathematics. In the horizontal mirror image same way, the isolated form of the letter KAF ❝ , resulting from the combination of two other basic el- ements, will be replaced by the KAF glyph in Ruqaa ❫ Other symbols with mirror image forms already style, . 1 For the four letters ALEF, DAL, REH and WAW, in use are not added to this font. Of course, Latin the initial and the isolated forms are the same, and and Greek alphabetic symbols can be used in Arabic these letters will be withdrawn from the list of letters mathematical expressions. In this first phase of the in initial form. On the other hand, instead of a project, we aren’t integrating these symbols into the unique cursive element, the stretched form of each of font. They can be brought in from other existing the previous letters will result from the combination fonts. of two elements. It follows that these letters will not 3 A Dynamic Font be present in the list of the letters in the stretched form. The composition of variable-sized letters and curvi- The stretched form of a letter is obtained by the linear symbols is one of the hardest problems in addition of a MADDA-FATHA or ALEF in its final digital typography. In high-quality printed Arabic form ➯ to the initial form of the letter to be stretched works, justification of the line is performed through (e.g., ❻ + ➯ −→ ➯❻). The glyph of LAM-ALEF ❜ has using the kashida, a curvilinear variable lengthen- a particular ligature that will be added to the list. ing of letters along the baseline. The composition of The stretched form of a character is used if there is curvilinear extensible mathematical symbols is an- no confusion with any usual function abbreviation other aspect of dynamic fonts. Here, the distinction (e.g., ➯❼ or ➯❼❅ for the sine function). between fixed size symbols and those with variable The form with tail is obtained starting from width, length, or with bidimensional variability, ac- the initial form of the letter followed by an alter- cording to the mathematical expression covered by native of the final form of the letter HEH ➵ (e.g., the symbol, is of great importance. ❻ + ➵ −→ ➵❻ ). These two forms are not integrated Certain systems [11] solve the problem of ver- into the font because they can be obtained through tical or horizontal curvilinear extensibility through a simple composition. the a priori production of the curvilinear glyphs for The form with loop is another form of letters certain sizes. New compositions are therefore neces- with a tail. It is obtained through the combination sary beyond these already available sizes. This op- of the final form with a particular curl that differs tion doesn’t allow a full consideration of the curvilin- from one letter to another (e.g., ➔ ➣ ). This earity of letters or composed symbols at large sizes. form will be integrated into the font because it can- A better approach to get curvilinear letters or ex- not be obtained through a simple composition. tensible mathematical symbols consists of parame- The following particular glyphs are also in use: terizing the composition procedure of these symbols. ❩❬➳②❭❪❫➳⑦❴❵❛❜. The parameters then give the system the required The elements that are used in the composition information about the size or the level of extensi- of the operator sum, product, limit and factorial in bility of the symbol to extend. As an example, we will deal with the particular case of the opening and a conventional presentation (➲❞ ❣➲❼ ➯➲❢ ➲➀) are added also. These symbols are extensible. They are closing parenthesis as vertically extensible curvilin- stretched according to the covered expression, as we ear symbol and with the kashida as a horizontally will see in the next section. extensible curvilinear symbol. This can be general- Reversed glyphs, with respect to the vertical — ized to any other extensible symbol. and sometimes also to the horizontal — axis, as in The CurExt system was developed to build ex- Figure 1, are taken from the Computer Modern font tensible mathematical symbols in a curvilinear way. family. For example, there are: ➻➽➚➶➹➘➴➷➬➮➱. 1 The Bidi Mirrored property of characters used in Uni- code.
50 Preprints for the 2004 Annual Meeting Dynamic Arabic Mathematical Fonts 1 14
The previous version of this system was able to pro- duce automatically certain dynamic characters, such 2 13 1 14 as parentheses, using METAFONT. In this adapta- 1 14 tion, we propose to use the Adobe PostScript Type 2 13
3 format [13]. 2 13 The PostScript language defines several types of font, 0, 1, 2, 3, 9, 10, 11, 14, 32, 42. Each one of these types has its own conventions to represent and to organize the font information. The most widely 3 12 ost cript 3 12 used P S font format is Type 1. However, a 3 12 dynamic font needs to be of Type 3 [3]. Although the use of Type 3 loses certain advan- 4 11 4 11 4 11 tages of Type 1, such as the possibility of producing 5 10 hints for when the output device is of low resolution, 5 10 5 10 and in the case of small glyphs, a purely geometrical treatment can’t prevent the heterogeneity of char- acters. Another lost advantage is the possibility of using Adobe Type Manager (ATM) software. These two disadvantages won’t arise in our case, since the 6 9 6 9 symbols are generally without descenders or serifs 7 8 and the font is intended to be used with a composi- tion system such as TEX, not directly in Windows. 6 9 7 8 The PostScript language [7] produces a draw- ing by building a path. Here, a path is a set of 7 8 segments (lineto) and third degree B´ezier curves (curveto). The path can be open or closed on its origin (closepath). A path can contain several con- Figure 2 Parametrization of dynamic parenthesis trol points (moveto). Once a path is defined, it can : be drawn as a line (stroke) or filled with a color (fill). From the graphical point of view, a glyph is a procedure defined by the standard operators of • converting the file from dvi to ps format. PostScript. This process should be repeated as many times as To parameterize the procedure, the form of the needed to resolve overlapping of extensible symbols. glyph has to be examined to determine the different The curvilinear parentheses is produced by Cur- parts of the procedure. This analysis allows deter- Ext with the following encoding:
mining exactly what should be parameterized. In ✂✁☎✄✝✆✟✞✡✠☞☛✡✌✎✍✝✠✑✏☞✠✝✏✟✒ ✳✵✴✷✶
the case of an opening or closing parenthesis, all the ✁✔✓✕✆✟✌✂✞✗✖✙✘✝✒✗✚✜✛✣✢✤✛✦✥✡✁✡✧★✞
✸ ✹✷✺ ✛✣✪✤✛✦✫✡✁✡✧★✞
parts of the drawing depend on the size: the width, ✩
✻✽✼✿✾ ✛✣✭✤✛✦✮✡✁✡✧★✞
the length, the boldness and the end of the paren- ✬
❀ ✳❁✴ ✛✰✚✱✛✦✢✡✁✡✧★✞✝✲ thesis completely depend on the size. Figure 2 shows ✯ the variation of the different parameters of the open ✲✂ parenthesis according to the height. We have cho- instead of the straight parentheses given by the usual sen a horizontally-edged cap with a boldness equal encoding in TEX: to half of the boldness of the parenthesis. The same process is applied to the kashida. $\left( Producing a dynamic parenthesis such as that \matrix{1 & 2 & 3\cr 1 2 3 in Figure 3 follows these steps: 4 & 5 & 6\cr 4 5 6 • collecting the various needed sizes in a param- 7 & 8 & 9\cr 7 8 9 eter file par; 0 & 1 & 2\cr} 0 1 2 • generating a file pl with the local tool par2pl \right)$ starting from the par file; In the same way, we get the curvilinear kashida
• converting the file pl into a metric file tfm with with CurExt:
✥ ✂✁☎✄✆✁✞✝✟✁✞✠✞✄✆✁✞✡✂☛
the application pltotf;
✦ ✧✍★ • ☞✍✌✍ ✏✎✍✑☎✒✞✄✔✓✂✌✕✠✟✖✂✗✙✘✛✚✢✜✙✣✕✌✍✑✤✜✍✜✤☞ compiling the document to generate a dvi file; ✩✫✪✭✬✯✮✱✰
Preprints for the 2004 Annual Meeting 51 Mostafa Banouni, Mohamed Elyaakoubi and Azzeddine Lazrek latex doc.tex ✲ parent.par curext References par2pl [1] http://www.linux.org.sa. ❄ [2] Jacque Andr´e, Caract`eres num´eriques : intro- latex doc.dvi ✛ parent.pl duction, Cahiers GUTenberg, vol. 26, 1997. curext ⊕ [3] Daniel M. Berry, Stretching Letter and Slanted- pltotf baseline Formatting for Arabic, Hebrew and ❄ Persian with ditroff/ffortid and Dynamic parent.tfm POSTSCRIPT Fonts, Software–Practice & Expe- dvips -u rience, no. 29:15, 1999, pp. 1417–1457. doc.ps ✛ parent.t3 [4] Charles Bigelow et Kris Holmes, Cr´eation d’une police Unicode, Cahiers GUTenberg, vol. 20, parent.map 1995. [5] Yannis Haralambous, Une police math´ematique Figure 3: Generation of dynamic parentheses pour la Soci´et´eMath´ematique de France : le SMF Baskerville, Cahiers GUTenberg, vol. 32, 1999, pp. 5–19. instead of the straight lengthened one obtained by [6] Yannis Haralambous and John Plaice, Multilin- RyDArab: gual Typesetting with Ω, a Case Study: Arabic, ➄ \amarabmath ✞✞✞✞✞✞✞✞✞✞✞✞✞✞♠✳ Proceedings of the International Symposium on ${\lsum_{b=T-1}^{s}}$ 1−➔=❍ Multilingual Information Processing (Tsukuba), 1997, pp. 137–154. We can stretch Arabic letters in a curvilinear
way through the kashida by CurExt: [7] Adobe Systems Incorporated, POSTSCRIPT ✁☎✄ ✂✁☎✄✆ ✞✝✟✁☎✄✆ ✞✠ Language Reference Manual, Second ed., Addison-Wesley, 1992. 4 Conclusion [8] Klaus Lagally, ArabTEX — Typesetting Ara- The main constraints observed in this work were: bic with Vowels and Ligatures, EuroTEX’92 • a close observation of the Arabic calligraphy (Prague), 1992. rules, in the Naskh style, toward their formal- [9] Azzeddine Lazrek, Aspects de la probl´ematique ization. It will be noticed, though, that we are de la confection d’une fonte pour les still far from meeting all the requirements of math´ematiques arabes, Cahiers GUTen- Arabic calligraphy; berg, vol. 39–40, Le document au XXIe si`ecle, 2001, pp. 51–62. • the heavy use of some digital typography tools, rules and techniques. [10] Azzeddine Lazrek, A package for typesetting arabic mathematical formulas, Die T Xnische RamzArab, the Arabic mathematical font in E Kom¨odie, DANTE e.V., vol. 13. (2/2001), 2001, Naskh style, is currently available as an OpenType pp. 54–66. font. It meets the requirements of: [11] Azzeddine Lazrek, CurExt, Typesetting • homogeneity: symbols are designed with the variable-sized curved symbols, EuroTEX 2003 same nib. Thus, their shapes, sizes, boldness preprints, pp. 47–71 (to appear in TUGboat). and other attributes are homogeneous; [12] Mustapha Eddahibi, Azzeddine Lazrek • completeness: it contains most of the usual spe- and Khalid Sami, Arabic mathematical e- cific Arabic symbols in use. documents, International Conference on TEX, These symbols are about to be submitted for XML and Digital Typography (TUG 2004, inclusion in the Unicode standard. This font is un- Xanthi, Greece), 2004. der test for Arabic mathematical e-documents [12] [13] W lodzimierz Bzyl, The Tao of Fonts, TUGboat, after having been structured for Unicode [2, 4]. vol. 23, 2002, pp. 27–39. The dynamic component of the font also works [14] Thomas W. Phinney, TrueType, POSTSCRIPT ost cript in P S under CurExt for some symbols such Type 1 & OpenType: What’s the Difference?, as the open and close parenthesis and the kashida. Version 2.00 (2001). That will be easily generalized to other variable- sized symbols. The same adaptation can be per- formed within the OpenType format.
52 Preprints for the 2004 Annual Meeting Dynamic Arabic Mathematical Fonts
0 1 2 3 4 5 6 7 8 9 1x 2x 3x ✦ ✧ ★ ✩ ✪ ✫ ✬
4x ✭ ✮ ✯ ✰ ✱ ✲ ✳ ✴ ✵ ✶
5x ✷ ✸ ✹ ✺ ✻ ✼ ✽ ✾ ✿ ❀ 6x ❁ ❂ ❃ ❄ ❅ ❆ ❇ ❈ ❉ ❊
7x ❋ ● ❍ ■ ❏ ❑ ▲ ▼ ◆ ❖
8x ◗ ❘ ❙ ❚ ❯ ❱ ❲ ❳ ❨
9x ❩ ❬ ❭ ❪ ❫ ❴ ❵ ❛ ❜ ❝
10x ❞ ❡ ❢ ❣ ❤ ✐ ❥ ❦ ❧ ♠
11x ♥ ♦ ♣ ✉ ✈ ✇
12x ① ② ③ ④ ⑤ ⑥ ⑦ ⑧ ⑨ ⑩
13x ❶ ❷ ❸ ❹ ❺ ❻ ❼ ❽ ❾ ❿
14x ➀ ➁ ➂ ➃ ➄ ➅ ➆ ➇ ➈ ➉
15x ➊ ➋ ➌ ➍ ➎ ➏ ➐ ➑ ➒ ➓
16x ➔ → ➣ ↔ ↕ ➙ ➛ ➜ ➝ ➞
17x ➟ ➠ ➡ ➢ ➤ ➥ ➦ ➧ ➨ ➩
18x ➫ ➭ ➯ ➲ ➳ ➵ ➸ ➺ ➻ ➼
19x ➽ ➾ ➚ ➪ ➶ ➹ ➘ ➴ ➷ ➬
20x ➮ ➱
Figure 4: RamzArab Arabic mathematical font
Preprints for the 2004 Annual Meeting 53 Creating Type 1 Fonts from METAFONT Sources: Comparison of Tools, Techniques and Results
Karel P´ıˇska Institute of Physics, Academy of Sciences 182 21 Prague Czech Republic [email protected] http://www-hep.fzu.cz/~piska/
Abstract This paper summarizes experiences in converting METAFONT fonts to PostScript fonts with TEXtrace and mftrace, based on programs of autotracing bitmaps (Auto- Trace and potrace), and with systems using analytic conversion (MetaFog and MetaType1, using METAPOST output or METAPOST itself). A development process is demonstrated with public Indic fonts (Devanagari, Malayalam). Ex- amples from the Computer Modern fonts have been also included to illustrate common problems of conversion. Features, advantages and disadvantages of var- ious techniques are discussed. Postprocessing — corrections, optimization and (auto)hinting — or even preprocessing may be necessary, before even a primary contour approximation is achieved. To do fully automatic conversion of a perfect METAFONT glyph definition into perfect Type 1 outline curves is very difficult at best, perhaps impossible. Keywords: font conversion, bitmap fonts, METAFONT, METAPOST, outline fonts, PostScript, Type 1 fonts, approximation, B´ezier curves.
1 Introduction product, but its excellent results remain, in our com- In recent years, several free programs for creating parisons, unsurpassed. PostScript outline fonts from METAFONT sources Additionally, we may need adequate encoding have been developed. The aim of this paper is to files. If none are available, a TEX encoding (e.g., give a short comparison of these programs, with ref- the standard TEX T1 encoding) is usually used as erences to original sources and documentation, and the default. to provide a brief description of their use. We will 2 Autotracing Bitmaps discuss advantages and drawbacks, and demonstrate numerous examples to compare important features 2.1 TEXtrace with AutoTrace and to illustrate significant problems. We omit tech- P´eter Szab´odeveloped TEXtrace [18]. It is a col- nical details described in the original documentation lection of Unix scripts. It reads the original META- and concentrate our attention on the quality of the FONT sources, rendering the font bitmaps into Post- output, including hinting issues. Script (via dvips). For converting the resulting bit- The programs TEXtrace and mftrace read origi- maps to outlines, it calls (in the version of 2001) the nal METAFONT sources, generate high-resolution pk AutoTrace program [21] created by Martin Weber, bitmaps, call autotracing programs (AutoTrace or and, finally, composites the final files in the Type 1 potrace) and finally generate the files in the Type 1 format. TEXtrace works fully automatically and can format (pfb or pfa). be invoked by a command like this: MetaType1 creates Type 1 output from META- bash traceall.sh mfname psname psnumber POST sources. Therefore it requires rewriting font where mfname.mf is the name of the METAFONT definitions from METAFONT into METAPOST. font, psname.pfb is the name of the Type 1 font Similarly, MetaFog converts the PostScript files file, and psnumber denotes a Type 1 UniqueID [1]. generated by METAPOST to other PostScript files The Adobe Type 1 Font Format documentation containing only outlines, that can be subsequently [1, pp. 29–33] recommends observing certain Type 1 assembled into Type 1 fonts. MetaFog is not a new
54 Preprints for the 2004 Annual Meeting Creating Type 1 Fonts from METAFONT Sources: Comparison of Tools, Techniques and Results
a V d tta
a d A b a b Ac Figure 1: TEXtrace: “¨” in cmr10.
b b h conventions: 1) points at extremes; 2) tangent con- b tinuity; 3) conciseness; and 4) consistency. P pha b The outline results from TEXtrace (that is, from AutoTrace) are relatively faithful to the original bit- maps. Some artifacts exist, but they are invisible g in usual font sizes and magnifications and for prac- tical purposes may be negligible. Nonetheless, they Pspoil our attempts to automatically produce perfect, k c ka hinted, outline fonts. The underlying reason is that the information Figure 2: Results of T Xtrace (AutoTrace): METAFONT E about the control points in the original bumps and a hole (h). is lost, and the Type 1 conventions are not satisfied, ¨as exemplified in figure 1. The endpoints (double a squares) are not placed at extremes (rule 1), most of the horizontal and vertical points of extrema are missing. On the other hand, the outline definition is not concise (rule 3) — due to the large numbers of V control points in the glyph definitions, the font files tta generated by TEXtrace are huge. Furthermore, the a two identical periods in the dieresis glyph “¨” are approximated by different point sets (rule 4). The following examples show theA results of con- version of Indic fonts submitted to TUG India 2002 A [16], devanagari (dvng10) and Malayalam (mm10). a b A Typical irregularities produced by conversion with TEXtrace are bumps and holes. Figure 2 demon- strates bumps caused by the envelope being stroked b along a path with a rapid change of curvature, and by cases of transition from a straight line to a sig- P b nificantly small arc. The second clipped part of the pha letter “pha” shows a hole. I tried to remove those bumps and holes, and g (partially) other irregularities at the Type 1 level with a set of special programs manually marking Pplaces to be changed in a “raw” text, translated k by t1disasm and by t1asm back after modifications ka (both programs are from the t1utils package [13]), which achieves a better outline approximation, as Figure 3: Improved results achieved with shown in figure 3. The postprocessing consisted of: postprocessing.
PPreprints for the 2004 Annual Meeting 55 V¨PAk Karel P´ıˇska .˝ .d . ˝ a Figure 5: CM-super: period in sfrm1000 and sfsi1000, with redundant node. Figure 4: TEXtrace (AutoTrace) first without and then with postprocessing for the Malayalam “a”, showing undetected corners.. inserting missing extrema points, changing the first nodes of contour paths (if desirable), and the opti- mization of merging pairs (or sequences) of B´ezier segments together, and joining nodes in horizon- d tal or vertical straight parts to eliminate redundant − nodes. However, when this process was applied to the Malayalam fonts, we meet another problem: unde- tected corners in Figure 4. Instead of attempting to correct them, I stopped my postprocessing at- d tempts, and switched to experiments with analytic methods of conversion. Figure− 6: CM-super: minus in sfrm1000 and sftt1000, with bumps. 2.1.1 Examples of CM-super Type 1 fonts [20] generated by Vladimir Volovich (first announced in 2001) inherit typical bugs pro- duced by tracing bitmaps by AutoTrace (as invoked by TEXtrace) such as bumps and holes, improper se- f lection of starting points of contour paths, and prob- lems in distinguishing sharp corners and small arcs. We illustrate them in several following figures, in order to demonstrate that fixing such irregularities M a automatically is difficult. M In the period “.” from the sfrm1000 font (its source is the original cmr10), an optimization cannot c exclude the redundant node (fig. 5) (it is still the − starting point of the path). i The minus “−” derived from cmr10 contains a i bump, and minus from cmtt10 two bumps (fig. 6). a Moreover, these bumps have been hinted and have Figure 7: CM-super: “M” in sfrm1000 and “i” in their own hints (probably as results of autohinting). sfsi1000. In the letter “M” from cmtt10, we observe miss- −ing dishes, a hole and a bad approximation of an arc 56 iMPreprints for the 2004 Annual Meeting Creating Type 1 Fonts from METAFONT Sources: Comparison of Tools, Techniques and Results
d ˝b ˝ ˝b ˝ Figure 9: mftrace without and with --simplify. Figure 8: TEXtrace (using potrace), with different corners. ˝3 Analytic Conversions (fig. 7). On the contrary, in “i” the corners are not 3.1 MetaType1 detected properly, we also have a hinted bump. MetaType1 [8, 9] is a programmable system for au- diting, enhancing and generating Type 1 fonts from 2.2 TEXtrace with potrace METAPOST sources. MetaType1 was designed by The 2003 version of TEXtrace supports alternative Bogus law Jackowski, Janusz M. Nowacki and Piotr bitmap tracing with potrace [17], developed by Pe- Strzelczyk. The MetaType1 package is available ter Selinger. In this version, the real corners are from ftp://bop.eps.gda.pl/pub/metatype1 [10]. detected or at least detected better than with Auto- This “auditing and enhancing” is a process of Trace (see fig. 8). Thus, bumps and holes have been converting the Type 1 font into MetaType1 (text) suppressed, but smooth connections have often been files, generating proof sheets, analysis, making cor- changed to sharp corners (not present originally). rections and regenerating modified Type 1 fonts. It While the bumps demonstrated violation of consis- is an important tool for checking, verifying and im- tency and may produce invalid hinting zone coor- proving existing Type 1 fonts. dinates (fig. 6), the unwanted sharp corners mean MetaType1 works with the METAPOST lan- loss of tangent continuity (the middle clip in fig. 8). guage. Therefore the METAFONT font sources must Unfortunately, the approximation does not preserve be converted/rewritten into METAPOST. Macro horizontal and vertical directions (the right clip), the package extensions of METAPOST and other mis- stem edges are oblique — the difference between the cellaneous programs provide generation of proper two arrows on the left edge is 2 units in the glyph structure of the Type 1 format, evaluate hints (not ˝coordinate space. only the basic outline curves), and create pfb and also afm and pfm files. 2.3 mftrace During the rewriting process, users define sev- Han-Wen Nienhuys created mftrace [15, 3], a Python eral parameters of the Type 1 font, including the script which calls AutoTrace or potrace (as with TEX- PostScript font encoding — PostScript glyph names trace) to convert glyph bitmap images to outlines. and their codes — because METAFONT sources do The results of tracing are thus expected to be very not contain this data in a form directly usable for similar to those of TEXtrace. In fact, for the ana- Type 1 encoding vectors. METAFONT output com- lyzed Indic fonts, they are identical, as we can see in mands have to be changed to their METAPOST al- the first image in figure 9 (compare with TEXtrace re- ternatives. Similarly, it is necessary to substitute sults in fig. 4). With the --simplify option, mftrace METAFONT commands not available in METAPOST, calls FontForge [22] (previously named PfaEdit) to to define METAPOST variants of pen definitions and execute postprocessing simplification; this helps to pen stroking, etc. exclude redundant nodes from outline contours, as Alternative METAPOST commands are defined in the second image in figure 9. in the MetaType1 files fontbase.mp, plain ex.mp,
Preprints for the 2004 Annual Meeting 57 Karel P´ıˇska
result, writing METAPOST code that would produce equivalent results as original METAFONT code using these or other such features would be very difficult. After the basic conversion, the next step is re- moving overlaps (if any are present) using the Meta- a Type1 command find outlines. Figure 10 shows the results before and after overlap removal for the Malayalam vowel a (font mm10 using pen stroking with a circular pen). This operation is not neces- sary in METAFONT, since it generates bitmaps. In the METAPOST environment of PostScript outlines, however, we need to reduce overlapping curves to single or pairs of paths. a MetaType1 also allows insertion of commands for automatic computation of horizontal and vertical Figure 10: MetaType1 — primary outlines and hints (FixHStems, FixVStems). The Type 1 font overlap removal. can be visualized in a proof sheet form containing the control point labels (numbers) and hinting zones (figure 11). et al. Other (new) commands may be defined by So far, so good. But there are two crucial prob- METAFONT the user. Correspondence between METAFONT and lems. First, the Malayalam fonts de- METAPOST is approximately as shown in the fol- signed by Jeroen Hellingman [5], use the command lowing table (of course, the details may vary from currenttransform := currenttransform font to font): shifted (.5rm, 0); METAFONT METAPOST So all the glyphs should be shifted to the right. METAFONT saves the transformation command and fill path; Fill path; does this operation automatically. By contrast, in draw path; pen stroke()(path)(glyph); METAPOST we need to insert the shift commands Fill glyph; explicitly in all glyph programs. Also the labels penlabels(1,2); justlabels(1,2); must be shifted! In my experiments, I did this shift beginchar( ... beginglyph( ... operation later, before final assembly of the Type 1 endchar; endglyph; fonts. Many METAFONT commands have no counter- The second problem is that in MetaType1 (I part in METAPOST [6]. For example, operations used MetaType1 version 0.40 of 2003) a regular pen with bitmap pictures: in METAPOST, font data is stroking algorithm is not available, only a simpli- represented as PostScript curves, not bitmaps. As a fied method of connecting the points ‘parallel’ to the
3 10 13
17
4
2 11 14 96 5
16 7
1 8 12 15
Figure 11. MetaType1—proof sheet.
58 Preprints for the 2004 Annual Meeting Creating Type 1 Fonts from METAFONT Sources: Comparison of Tools, Techniques and Results
Figure 12: MetaType1 — bad pen stroking.
nodes on the path. Therefore the approximation of the envelope is not correct. For example, in Fig- ure 12 it should be asymmetric, but it is symmetric. Inserting additional nodes cannot help, because the bisection results will again be asymmetric. The fig- ure shows the outline curves do not correspond to the real pen in two midpoint locations. The enve- lope there looks narrow and it is in fact narrower than it should be. I hope that this problem could be solved in a future release, at least for pen stroking Figure 13: MetaType1 — Devanagari “i”, “a”. with a circular pen. Even more serious is a situation with the ro- tated elliptic pen used in the Devanagari fonts de- signed by Frans Velthuis [19] (and also other In- and a package for making final fonts from out- dic fonts derived from dvng). Absence of a regular lines generated by MetaFog in TrueType, Type 1 pen stroking in MetaType1 makes it impractical for and other formats. MetaFog itself (I used an such complicated fonts. MetaType1 approximates evaluation version graciously donated by Richard) the pen statically in path nodes, tries to connect reads the METAPOST output from the command: their static end points, and ignores complicated dy- mpost ’&mfplain options;’ input fontname.mf namic correlations between the path, the pen and Thus, the conversion (from METAFONT sources) is the envelope. Unfortunately, in this case the results limited to fonts that can be processed by META- of the envelope approximation are not correct and POST, that is, do not contain METAFONT-specific cannot be used (figure 13). definitions and commands. MetaFog generates an- other PostScript file consisting only of the outline 3.2 MetaFog structures. A conversion process is described also in Two programs using analytic conversion were pre- the paper written by Taco Hoekwater [7]. sented in 1995. Basil K. Malyshev created his Ba- MetaFog evaluates outline contours and pre- KoMa collection [14] and Richard J. Kinch devel- cisely computes envelopes of an elliptical pen strok- oped MetaFog [11]. BaKoMa is a PostScript and ing along a B´ezier curve. We must notice that the TrueType version of the Computer Modern fonts. envelopes in general are not cubic B´ezier curves and Malyshev’s paper discusses some problems of con- their representation in a Type 1 font must be an ap- version, especially regarding hinting, but his pro- proximation. The results for a circular pen, on the grams and detailed information about the conver- other hand, can be considered perfect. Figures 14 sion algorithm are not available. and 15 show an example of the Malayalam letter R. Kinch created MetaFog along with weeder, “a” (font mm10): the initial and final contours and which supports interactive processing of outlines, the final Type 1 font with control points (stroked
Preprints for the 2004 Annual Meeting 59 Karel P´ıˇska
jcq bfbm nfu tktj ti r bnba v tl rs cxcy eced
h bd l th t q cz cw ee eb g bc k sk
pu cvda eaef p ybl t tg oei bkbe sdm sl sm td tetf
fc fa ez
fb
ey
bjcdbybqdi sn brdjz z
bhcb vba xy bgca
w bbbpdh bicc tc bodgbw tatb
k dfdv ek
bhbucq bbbgbzca lm jo dbdgdwdx dddedudz egel eiej
bibvcrcs cpcu bccb bfby n dcdy eh chcpdmdq dndrbxck so
bjbwct bdbebxcc
cfcn
cecm
et bzcidldp dkdocgco sz
ch dfcxec euev esex ed sxswsv
cdci cfcg dd ew su
dc
ctdbclcududy dvdz sy ce
eb eacvde st sqss
crcz
cqcy cjco emer i dhdr djdt dtdx dsdwcjcwcsda sp bl ck cn en eq
dids g tta1 h bk clcm eoep
bty btw btu
btx
bta bti bto btb bth
bue
btp
btv
bua
buf btg
buh bug btz bsz btj btn bub btt
btf btc
bui buj
btr bud
buk bte
bul buc bsy g btk btm
bsx
btl bsu bsw tta2 bsv btd bts
tktj ti tl
th
sk so
tg Figure 14: MetaFog — initial input contour and sl sm td tetf
sn final result. sz
tc tatb
so
sy
sz
sxswsv
su
sy
st g sqss sp sp tta3
covcou cot cow
cos
cnf cnj
cor cng cnh coo copcoq
cnk
cni
cok
cnl cnm
con colcom
cnn
cno cojcoi coh cnp
cnj cnq cog
cnr cof
cnk cns cok
cnl cnt cnm coecodcoc
cnn
cno cojcoi coh cnp cob cnu
cnq cog
cnr cof
cns
cnt coa cnxcnz cnv
cnu
g cnv cnw cnw tta4
Figure 16: MetaFog contours, METAPOST output, primary and secondary conversion on the METAPOST background.
Figure 15: MetaFog — final Type 1 font. of all but the simplest Devanagari glyphs, correc- tions are necessary. I therefore applied a simple pen-dependent pre- processing step before the MetaFog conversion, thus version) and its visual comparison with METAPOST adapting the METAPOST output as a modified form output embedded in a Type 3 font, respectively. of bisection, as discussed in a paper by R. Kinch [11]. The preprocessing scans curves, searching for 3.2.1 Problems with Complex Pen-stroking points where the path direction and the direction of A more complicated situation is the conversion of main axis of the pen coincide (namely 135◦) and in- fonts using pen stroking with a rotated elliptical pen, serts these points as additional path nodes. In our such as the Devanagari font. Figure 16 illustrates case, the transformation matrix is cos θ∗[1, 1, −1, 1], this case. The initial input contour and final result so we solve only a quadratic equation and can find contour (tta1) look good — in the first image we can 0, 1 or 2 (at most) of these points. This technique see the projections of the pen in nodes correspond- corrects the MetaFog approximation of all such oc- ing to METAFONT source. But exact comparison currences in the dvng font. The result of this sec- with the original METAPOST output embedded in ondary MetaFog conversion with METAPOST source a Type 3 font (tta2) and primary MetaFog conver- is shown in the last panel of Figure 16 (tta4). sion displayed together with the METAPOST source Similar improvements for the Devanagari let- (tta3) shows that this approximation is not correct. ters “a” and “pha” are shown in figure 17. For Because these elements are very common in shapes “pha”, the first 135 degree node was already present
60 Preprints for the 2004 Annual Meeting Creating Type 1 Fonts from METAFONT Sources: Comparison of Tools, Techniques and Results
eifeie eid eig
ega
eic
efn
egcegb egd efn
eib efo efp ehy ehzeia
efxefw efyefz
egi
ege eglegkegj
efv
egfeggegh d
efu
eft
efxefw efy egm efz
egi egpego egn egq
efq egkegj efs egl
ehiehh ehj egr ehkehl egu egv efv ehg efr egyegxegw egz egs egt eha ehb
ehc ehf
ehe ehp
eho
ehn
efu
ehd
eft
ehm
ehq
ehr g
ehu ehseht a1 egm 1
ewj
eydeyc eyb eye
eya
ewg
ewh exz eyi eyj exw exxexy
ewi
eyh exu ewk exv
eyg
ewj
exf exg exh
exi
eyk exj
ewk ewl exk eyg ewm
ewl ewm ewn ewn eyf eyl ewo exe ewp eyf ewq ext
ewr eyl ews ewo ewt ewp ewu
ewy ewveww
ewx g ewq
ewr
ews exnexl
ewt
exo
ewz ewu
expexqexs ewveww exa g
exd exbexc pha1 2
oqz
otjoti oth otk
ora
oqz otg oqy oqy
oqx oqe
orcorb ord oqw oqx oqe oqv
otf METAPOST oqf oqg otc otdote oqu Figure 18 ori oqt : Wingtip curves in source. orj ore orlork oqs oqw
oqr orforgorh d oqq oqv
oqp
oqo oqu
orm
ori oqt orporo orn orq
oqh orjorlork oqn oqs oqm
oql oshosg osi osj orr oru osk oqk
orvorworx oqj
osf oqi ors ort oqr
ory ose
ost oqq
oss
osr
osd osc oqp osbosa osl
orz
osm
osn
oso oqo
osp
osq
osu
osv
osy oswosx a2 orm
ljblja liz ljc lis
liy lir
lgt
lgu lix liw ljg ljh liu liv liq lgv
lip
lio ljf
lis
lir lit lin
liq
lip
lgw lio lim lin
lim lht lil lil lhu lik lht
lji lij lgx lii lhu lik lhv lje lih
lgy lgz liglif lha lij ljd lhb ljj lhs g lhc
lhd lii lie lhe lhv lhf lih
lhg
lhh
lhi
lhm lhjlhk
lhl
liglif
lhylhw
lhz
lhn
lialiblid lie lho
lhr lhplhq pha2 Figure 17: MetaFog output before and after modification of METAPOST source.
in the path defined by the METAFONT source (first panel, pha1); on the contrary, the second occurrence of a 135 degree point was absent, and therefore it was inserted in the METAPOST output (last panel, pha2). Of course, this improvement is not universal, it only solves a special problem with a special pen for Figure 19: MetaFog converted to Type 1 — a special font. before and after postprocessing. Figure 18 illustrates movement of a rotated el- liptical pen stepping along a “nice” path (panel 1). However, correlations with the pen are not trivial: changes of curvature of the outer wingtip curve do 3.2.2 Automatic Conversion Problems not have simple monotonic behavior, and the in- A “dark side” of improving the curve approxima- ner wingtip curve (panel 2) is even more compli- tion is a fragmentation of an envelope curve into cated. This means that the pen-stroked wingtip many segments (often more than 10, and up to 16 in curves along a single B´ezier curve cannot be approx- Devanagari!). We achieve a faithful approximation imated by single B´ezier curves (compare with the (limited only by numerical accuracy) at the expense starting fig. 16, panel tta1), i.e., an envelope edge of of conciseness. To make up for this, postprocess- a pen along a simple path is not simple. ing is needed. The original MetaFog output and a
Preprints for the 2004 Annual Meeting 61 Karel P´ıˇska
yilyik yij im
yii yhv more than 12 % of the Devanagari glyphs. In all yih yhw yhx yie yifyig
yix yityis yhy yiu yhz cases, they have been successfully solved manually
yiw yir yin yia yjq
yiv yiq yll yid ylk yioyip yibyic a by the interactive weeder program.
yiy
yiz ykz
ylj In the last two details in fig. 20 (the lower end-
yla yky yja yjb
yjc yjr
yjd yliylh ylg ylf yjs
yje
yle ylb ykx
ylc yld yjt ing part of the left stem) we can see that both nodes yjf
ykf
ykw
ykg ykv
ykh yki
yke of the left segment are outside the filled area bound-
yjg
yjoyjmyjnyjl
ykj yjk yjj METAPOST yji ykd yjp yjh yjq ary defined by the curve. The outer yjsyjr yjt
yju
yjv yjw b yjx yjy wingtip edge is split there into many segments, some yjz yka
ykb ykc ykp
ykk g ykl ykq being straight lines — and they should not be, e.g.,
ykm ykr yju the first and the third segment marked by 2 ar-
yix yityis yhy yiu yhz rows in the clip — their curvatures are for us unde- fined. Additionally, we cannot detect the last seg- yiw yir ment (magnified in the figure) as horizontal because yin yia its angle is “greater than some ε”. Thus, neither node coordinates, nor segment yiv yiq yid ylkyll yioyip yic yib directions, nor curvatures are reliable. It gives a visual comparison of the METAPOST output with yjz its outline approximation. Therefore, my (first and yka g “simple”) idea cannot succeed. This was to clas- sify the behavior of directions and curvatures of all ykb ykc the segments automatically, and then to divide seg- g ments into groups according to directions and cur- vatures, then automatically merging the groups to single B´ezier segments. As demonstrated, this opti- mization may fail or produce incorrect results and, ykb unfortunately, human assistance is needed. ykc 4 Summary Here we summarize the most important features of the conversion programs found in our experiments. Figure 20: MetaFog — problems with automatic 4.1 Approximate Conversions: TEXtrace, conversion. mftrace Advantages: result of my (preliminary) optimization assembled • approximation covers original METAFONT fonts into Type 1 fonts is shown in Figure 19. and correspondence to pk bitmaps is (reason- Unfortunately, even a small computational in- ably) close accuracy can make automatic conversion and opti- • simple invocation, robust solution mization impossible, and even make it very difficult • fully automatic processing can generate com- to design postprocessing algorithms. In Figure 20, plete, final Type 1 fonts we demonstrate problems with the primary approxi- Disadvantages: mation of an envelope stroked by a rotated elliptical • pen, and also difficulties with automatic optimiza- approximate conversions give only approximate tion of the Devanagari ligature “d+g+r”. outlines • In the first panel of fig. 20, we observe an ar- lost information about nodes and other control tifact produced by MetaFog due to a complicated points • correlation of the pen and the path. Fortunately, final fonts do not satisfy the Type 1 conventions • those cases are very rare (less than 1 % of glyphs in AutoTrace: problems with recognizing corners, Devanagari). generation of unwanted bumps and holes In the second panel, the path and subsequently • potrace: sharp connections, thus loss of tangent the corresponding envelope edges are not absolutely continuity, violation of horizontal or vertical di- horizontal, thus (probably) MetaFog cannot prop- rections erly find intersection points and join reconstructed • automatic and correct (auto)hinting may yield outline components. Those defects are present in poor results due to these irregularities
62 Preprints for the 2004 Annual Meeting Creating Type 1 Fonts from METAFONT Sources: Comparison of Tools, Techniques and Results
4.2 MetaType1 integer” effects, and by the presence of close dou- Advantages: blets or triplets. In my view, these experiments show that the • complete support for Type 1 font generation quality of primary outline approximation is crucial • manual insertion of hinting information possible to achieve perfect final Type 1 fonts. It is virtually via simple hinting commands impossible to recreate discarded METAFONT infor- • font file compression via subroutines mation, or to find exact conditions for a secondary Disadvantages: fit that corrects primary contours that were created • conversion of METAFONT to METAPOST often with irregularities or artifacts. Starting with high- requires manual rewriting, possibly non-trivial resolution bitmaps is problematic, as too much infor- and time-consuming mation has been lost, making subsequent processes of improvement, optimization and hinting difficult • bad pen stroking algorithm; in particular, re- at best, not possible to automate and usually not sults for complicated fonts using rotated ellip- successful. tical pens are unusable • difficulties with removing overlaps in tangential 5 Acknowledgements cases I would like to thank all the authors of the free 4.3 MetaFog conversion programs, Richard Kinch for donating his MetaFog and weeder, the authors of the pub- Advantages: lic METAFONT fonts for Indic languages and other • fully automatic conversion of METAPOST out- sources used in the contribution, and Karl Berry for put to outlines help with editing of this article. • “typical” fonts usually achieve perfect results References • even for very complex fonts (again, with rotated elliptical pens), adaptations of METAPOST out- [1] Adobe Systems Inc. Adobe Type 1 Font Format. put and manual editing with weeder make it Addison-Wesley, 1990. plausible to obtain perfect outlines [2] Alfred V. Aho, Brian W. Kernighan, and Peter • results fulfill the Type 1 conventions in most J. Weinberger, The AWK Programming Lan- cases (except for those very complex fonts) guage, Addison-Wesley, 1988. Disadvantages: [3] Karl Berry. “Making outline fonts from bitmap • MetaFog reads METAPOST output, thus cannot images.” TUGboat, 22(4), pp. 281–285, 2001. METAFONT process -specific definitions [4] Free Software Foundation. GNU awk, http:// • complex fonts may still need manual reduction www.gnu.org/software/gawk. with weeder or subsequent optimization of out- [5] Jeroen Hellingman. Malayalam fonts, CTAN: lines to reach conciseness language/malayalam. • processing is slow [6] John D. Hobby. A User’s Manual for META- 4.4 Final Font Processing and Common POST. AT&T Bell Laboratories, Computing Problems Science Technical Report 162, 1994. The conversion systems discussed here, with the ex- [7] Taco Hoekwater. “Generating Type 1 Fonts ception of MetaType1, do not include internal hint- from METAFONT Sources”, TUGboat, 19(3), ing subsystems. To insert hints, we can use font pp. 256–266, 1998. editors, for example FontForge [22]. For success- [8] Bogus law Jackowski, Janusz M. Nowacki, Piotr ful automatic hinting, however, the font outlines METAPOST must fulfill certain conditions. Irregularities — ab- Strzelczyk. “MetaType1: A -based Pro- sence of nodes at extrema or presence of bumps Engine for Generating Type 1 Fonts”, ceedings of the XII EuroT X 2001 conference and holes — are not compatible with autohinting, E , because extrema points correspond to hinting zones pp. 111–119, Kerkrade, the Netherlands, 23–27 while bumps or holes do not fit them, thus caus- September 2001. ing outliers. The resulting difference of ±1 unit in [9] Bogus law Jackowski, Janusz M. Nowacki, Piotr the integer glyph coordinate system, after rounding Strzelczyk. “Programming PostScript Type 1 to integers, is not acceptable for high-quality fonts. Fonts Using MetaType1: Auditing, Enhancing, Problems may also be caused by other “rounding to Creating. Preprints of the XIV EuroTEX 2003
Preprints for the 2004 Annual Meeting 63 Karel P´ıˇska
conference, pp. 151–157, Brest, France, 24–27 [17] Peter Selinger. potrace, http://potrace. June 2003 (to appear in TUGboat). sourceforge.net. [10] MetaType1distribution: ftp://bop.eps.gda. [18] P´eter Szab´o. “Conversion of TEX fonts into pl/pub/metatype1. Type 1 format”, Proceedings of the XII Euro- [11] Richard J. Kinch. “MetaFog: Converting TEX 2001 conference, pp. 192–206, Kerkrade, METAFONT Shapes to Contours”, TUGboat, the Netherlands, 23–27 September 2001. 16(3), pp. 233–243, 1995. http://www.inf.bme.hu/∼pts/textrace; [12] Donald E. Knuth. The METAFONTbook. http://textrace.sourceforge.net. Addison-Wesley, 1986. Volume C of Computers [19] Frans J. Velthuis. Devanagari fonts, CTAN: and Typesetting. language/devanagari. t1utils http:// [13] Eddie Kohler. (Type 1 tools), [20] Vladimir Volovich: CM-super fonts: CTAN: freshmeat.net/projects/t1utils . fonts/ps-type1/cm-super. [14] Basil K. Malyshev, “Problems of the conver- [21] Martin Weber. AutoTrace, http://autotrace. sion of METAFONT fonts to PostScript Type 1”, sourceforge.net. TUGboat, 16(1), pp. 60–68, 1995. [15] Han-Wen Nienhuys. mftrace, http://www.cs. [22] George Williams. FontForge: A PostScript Font uu.nl/∼hanwen/mftrace. Editor, http://fontforge.sourceforge.net. [16] Karel P´ıˇska. “A conversion of public Indic fonts from METAFONT into Type 1 format with TEX- trace.” TUGboat, 23(1), pp. 70–73, 2002.
64 Preprints for the 2004 Annual Meeting OpenType and Ω: Past, Present and Future
Yannis Haralambous D´epartement Informatique Ecole´ Nationale Sup´erieure des T´el´ecommunications de Bretagne CS 83818, 29238 Brest C´edex, France [email protected]
G´abor Bella [email protected]
Abstract This article presents our plans for integrating the OpenType font format into the Ω typesetting system. Beginning with a short summary of what we have achieved so far, we compare potential methods of adaptation. Since most OpenType- related issues we present are valid not only for Ω but for all TEX-like systems, the authors hope that the article will be interesting for the whole TEX community.
1 Past be reorganized in order to take advantage of certain The OpenType font format has been officially avail- OpenType features. able since 1997. In contrast to its predecessors, In the following, we give a quick overview of TrueType and PostScript Type 1 and 2, it provides the present solution and then explain why and how 1 essential features for proper typesetting of non-LGC it needs to be revised. scripts, as well as handling LGC ones. Although 2 Present: OpenType Support, the Quick competing formats with similar capabilities (Apple Way GX/AAT and Graphite) were and still are available, the marketing force behind OpenType seems strong The word ‘quick’ may seem ironic here: in total, enough to make it a de facto standard. it took us almost a year to produce our first re- The decision to bring OpenType support to Ω sults. However, this was due more to organizational as a complement to the already available fonts was problems than to the difficulty of the task. By ‘the taken sometime in 2002 and effective work began quick way’, we mean the solution that was the most straightforward and the easiest to implement. It is around early 2003. At the EuroTEX 2003 confer- ence, we presented our initial development plans described in detail in the EuroTEX article [1]; here, and our first results [1]. Since then, considerable only a short summary is given. progress has been made, so that in the final version The present solution is based on the approach that OpenType fonts should be converted to Ω’s own of the EuroTEX proceedings we are finally able to formats, i.e., OFM (metrics), OVF (virtual fonts) announce a working implementation [1]. 2 However, this implementation is not likely to and ΩTP (Ω Translation Process ). Anish Mehta become the final solution, for two reasons. First, wrote several Python scripts to generate these files, it was never intended to be such. Rather, it was of which the most interesting is perhaps the one that created to verify the validity of certain conversion converts the whole OpenType GSUB table (see next methods and also to provide users with the possibil- section) into ΩTP’s. ity to install and use OpenType fonts as quickly as Moreover, as the OpenType format is generally possible, even if not all advanced features are avail- not understood by PostScript printers, a conversion able yet. Secondly, as we will see later, complete and to Type 1 (or Type 42) is necessary. To speed up robust OpenType support cannot simply be patched the process, we create Type 1 charstring collections PFC onto the existing Ω: as the two do not always share using our own tables (see [1] for details) which the same philosophy, some parts of Ω will need to
2 An ΩTP describes a finite state automaton that can be used to filter the input text to perform tasks such as contex- 1 Latin-Greek-Cyrillic. tual analysis.
Preprints for the 2004 Annual Meeting 65 Yannis Haralambous and G´abor Bella are later used by odvips to create small, subsetted for typesetting many non-LGC scripts. In Ω, the Type 1 fonts (a.k.a. minifonts) on the fly. equivalent of GSUB features are the ΩTP’s: they can The above approach assumes that OpenType do everything GSUB features can, including contex- can be converted to OFM and ΩTP files without sig- tual operations. Glyph positioning is a different is- nificant loss of information. This is not always the sue: since the ΩTPs are designed for text rearrange- case. In the next section, we will show the main dif- ment (substitutions, reordering etc.), they are not ferences between the two formats and why we have suitable for doing glyph placement as easily. In fact, decided to abandon some of our results in favour the idea of microengines—ΩTP-like Ω plugins—was of a better concept. This does not mean that the introduced some time ago, exactly to provide mod- work done so far was useless: first, we managed to ular, script- and language-specific positioning meth- prove that the conversion to Type 1 (which we are ods, along the lines of ΩTP files. With the appear- planning to keep) is a viable approach and secondly, ance of OpenType fonts, it became clear that posi- we are able to provide Ω users with a working, al- tioning features as implemented by the GPOS table beit not complete and only temporary, OpenType and microengines provide essentially the same func- support. tionality. It should then be possible to implement microengines as GPOS features and vice versa. 3 Future: Alternatives of Adaptation A closely related problem is the fundamental An OpenType font consists mainly of the following difference between Ω’s and OpenType’s way of de- parts: scribing features. Although both use the Unicode 1. font and glyph metrics; encoding, OpenType’s GSUB and GPOS features are based on strings of glyph ID’s and not of Unicode 2. Type 2 or TrueType glyph outlines (and hints characters. Ω and some of its ΩTP’s, on the other or instructions); hand, perform tasks such as contextual analysis or 3. advanced typographic features (mainly GSUB hyphenation on character sequences and the passage and GPOS); from characters to ‘real’ glyph ID’s happens only 4. other data tables. when (o)dvips replaces virtual fonts by real ones. In one form or another, all of them will certainly Moreover, it is theoretically impossible to convert need to be dealt with either inside Ω or odvips. Be- a glyph-based OpenType feature into a character- low, we give a very concise description of our plans based ΩTP, as some glyphs (allographs) may not 3 regarding each of these fields. even have Unicode equivalents. The solution to this problem is either to use glyph- and character-based 3.1 Metrics ΩTP’s at the same time or else to read GSUB and OpenType provides extensive font metric informa- GPOS features directly from the OpenType font, tion dispersed among various tables (post, kern, without conversion to ΩTP’s or microengines. hmtx, hdmx, OS/2, VORG, etc.), both for horizon- Whichever method we choose, Ω will need to tal and vertical typesetting. In most cases, Ω’s (or maintain both glyph and character representations TEX’s) and OpenType’s metrics can be converted of the same text in parallel to be able to perform from one to another, with few but important excep- font-specific (OpenType features) and font-indepen tions (e.g., height/depth, see [1] and [2]). Despite dent (hyphenation) operations at the same time. the occasional differences, it seems desirable to dis- This dual representation of text is also crucial for the pose of the intermediary OFM files and read Open- searchability and modifiability of the output (PDF, Type metrics directly from the font file. PS, SVG or any other) document. 3.2 Conversion 3.4 Extensibility As explained in the last section, conversion of Open- Finally, the OpenType format has the important Type’s Type 2 and TrueType outlines to Type 1 has feature of being extensible: new tables can be added already been implemented. We are also planning into the font file, containing, for example, data need- to provide Type 42 support for TrueType-flavoured ed by Ω with no OpenType-equivalents (such as OpenType that would also preserve instructions. metrics or PFC charstrings). Of course, it is nec- essary that the given font’s license allows such ad- 3.3 Advanced features: GSUB and GPOS ditions.
These advanced features are the most interesting 3 There exist some workarounds though, for example to part of OpenType. The GSUB (glyph substitution) map these glyphs to the Private User Area of Unicode, but and GPOS (glyph positioning) tables are essential this solution is not very elegant, to say the least.
66 Preprints for the 2004 Annual Meeting OpenType and Ω: Past, Present and Future
4 Conclusions In summary, on the one hand we have opted It is time to draw conclusions from the arguments for a tighter integration of Ω and OpenType, by di- made above. First, OpenType seems capable of be- rectly reading from (e.g., metrics) and writing to the ing a base font format for Ω. Its tabular file struc- font file. On the other hand, advanced OpenType ture is flexible enough so that missing Ω-specific in- features such as glyph positioning and substitution formation can easily be added to fonts. Metrics can necessitate modifications in Ω’s character handling be read directly from the font file (with some excep- as well as in the ΩTP/microengine concept. The tions) and then converted on the fly, if necessary. companion article in this volume will give further Also, virtual fonts are not useful for OpenType, as details on how we are planning to implement these it is more reasonable to use a Unicode-based en- changes. coding. Thus, neither OFM nor OVF files will be References needed when using OpenType fonts. This is one of the arguments against the approach presented in the [1] G´abor Bella and Anish Mehta: Adapting Ω to previous section. OpenType Fonts. EuroTEX 2003 Proceedings. Secondly, in terms of capabilities, ΩTP’s and Final version to appear in TUGboat. GSUB features are very much compatible. The same [2] Yannis Haralambous and John Plaice: Ω and can be said for GPOS and microengines, although OpenType Fonts. Ky¯oto University 21st Cen- no implementation exists for the latter yet. How- tury COE Program, 2003. ever, incompatibility is present on the character/ [3] The OpenType Specification v1.4. glyph level. As was shown, a dual approach can be http://www.microsoft.com/typography helpful, keeping strings of glyph ID’s and characters /otspec/default.htm in parallel—another reason why the simple conver- sion method described in the previous section is not adequate.
Preprints for the 2004 Annual Meeting 67 Moving Ω to a C++-based Platform
John Plaice School of Computer Science and Engineering The University of New South Wales unsw sydney nsw 2052, Australia [email protected]
Paul Swoboda [email protected]
Abstract The code for the Omega Typesetting System has been substantially reorganised. All fixed-size arrays implemented in Pascal Web have been replaced with inter- faces to extensible C++ classes. The code for interaction with fonts and Omega Translation Processes (OTPs) has been completely rewritten and placed in C++ libraries, whose methods are called by the typesetting engine. The Pascal Web part of Omega no longer uses change files. The overall Omega architecture is now much cleaner than that of previous versions.
1 Introduction been devised for the use of C++ code from the re- Since the first paper on Omega was presented at the maining Pascal code. 1993 Aston TUG Conference, numerous experiments This exercise serves two purposes. The first is have been undertaken in the realm of multilingual to stabilise and to simplify the existing Omega dis- typesetting and document processing. This over- tribution. The second is to lay the groundwork for all work has given important insights into what a a forthcoming, complete reimplementation of open future document processing system, including high typesetting software. quality typesetting, should look like. See, for ex- 2 Problems with Pascal Web ample, the articles published in the 2003 EuroTEX and TUG conferences. Clearly, building an extensive When we examine the difficulties in creating Omega new system will require substantial effort and time, as a derivation of tex.web, we should understand both at the design and the implementation levels. that there is no single source for these difficulties. In this article, we present the interim solution for Pascal was designed so that a single-pass com- the stabilisation of the existing Omega code base, piler could transform a monolithic program into a with a view towards preparing for the design and running executable. Therefore, all data types must implementation of a new system. be declared before global variables; in turn, all vari- The standard web2c infrastructure, which as- ables must be declared before subroutines, and the sumes that a binary is created from a single Pas- main body of code must follow all declarations. This cal Web file and a single Pascal Web change file, is choice sacrificed ease of programming for ease of simply not well suited for the development of large compiler development; the resulting constraints can scale software, of any genre. For this reason, we be felt by anyone who has attempted to seriously have eliminated the change files, and broken up the modify the TEX engine. Pascal Web file into chapter-sized files. All fixed- Pascal Web attempts to alleviate this draco- size arrays have been reimplemented in C++ using nian language vision by allowing the arbitrary use the Standard Template Library. Characters are now within code blocks—called modules — of pointers 32 bits, using the wchar_t data type, and charac- to other modules, with a call-by-name semantics. ter set conversion is done automatically using the The result is a programming environment in which routines available in the stdc++ library. The entire the arbitrary use of GOTOs throughout the code Pascal Web code for fonts and OTPs, including that is encouraged, more than ten years after Dijkstra’s of Donald Knuth, has been completely rewritten in famous paper. Knuth had responded correctly to C++ and placed in libraries. Clean interfaces have Dijkstra’s paper, stating that the reasonable use of
68 Preprints for the 2004 Annual Meeting Moving Ω to a C++-based Platform
GOTOs simplifies code. However, the arbitrary use 3 Suitability of C++ of GOTOs across a program, implicit in the Pascal The advantages of the use of C++ as an implementa- Web methodology, restricts code scalability. Knuth tion language for stream-oriented typesetting, over himself has stated that one of the reasons he stopped the Pascal Web architecture, are manifold. The working on TEX was his fear that he might break it. chief reason for this is that the rich set of tools and For a skilled, attentive programmer, develop- methodologies that have evolved in the twenty-five ing a piece of code that is not going to evolve, it years since the introduction of TEX includes devel- is possible to write working code, up to a certain opments not only in programming languages and en- level of complexity. However, for a program that is vironments, but in operating systems, file structure, to evolve significantly, this approach is simply not multiprocessing, and in the introduction of whole tenable, because the monolithic Pascal vision is in- new paradigms, including object-oriented software herited in the change file mechanism of Pascal Web. and generic programming. Modifications to TEX are supposed to be undertaken The STL offers built-in support for arbitrary solely using change files; the problem with this ap- generic data structures and algorithms, including proach is that the vision of the code maintainer is extensible, random-access arrays. It would be fool- that they are modifying functions, procedures, and ish to ignore such power when it is so readily avail- so on. However, the real structure of a Pascal Web able. program is the interaction between the Pascal Web The Standard C++ library also provides built- modules, not the functions and procedures that they in wide character support, including, in the GNU define. As a result, maintaining a Pascal Web pro- stdc++ implementation, the full iconv functionality gram is a very slow process. Back in 1993, when for character-set conversion between Unicode and the first Omega work was being undertaken, “slow” any other imaginably-used character set. did not just mean slow in design and programming, C++ is the de facto standard for object-oriented but also in compilation. The slightest modification systems development, with its capability to provide required a 48-minute recompilation. low-level C-style access to data structures and sys- The size limitations created by compile-time tem resources (and, in the case of Unix-like systems, fixed-size arrays are obvious and well known. This direct access to the kernel system call API), for the was addressed publicly by Ken Thompson in the sake of efficiency. early 1980s and has been addressed in Omega as Since C++ is fully compatible with C, one can well as in the web2c framework, simply by changing still take advantage of many existing libraries asso- these sizes. However, there are other problems in ciated with TEX, such as Karl Berry’s kpathsea file the way these arrays are used. The eqtb, str_pool, searching library. font_info and mem arrays all have programming in- The abilities to use well-known design patterns terfaces using fixed-size arrays. Whenever these in- for generic algorithm support (plug-in paragraphers, terfaces are insufficient, the TEX code simply makes generic stream manipulation), as well as generic rep- direct accesses to these arrays. As a result, any at- resentation of typesetting data itself, add a wealth of tempt to significantly modify these basic data struc- possibilities to future, open typesetting implementa- tures requires the modification of the entire TEX tions. engine, and not simply the implementations of the structural interfaces. 4 Organisation of the Omega 2 Code Base The single input buffer for all active files of Obviously, we are moving on. Our objective is to in- tex.web turned out to be truly problematic for im- clude the existing Omega functionality, to stretch it plementing Omega’s OTPs. Since an OTP can read where appropriate, leaving clean interfaces so that, in an arbitrary amount of text before processing it, a if others wish to modify the code base, they can do new input buffer had to be introduced to do this col- so. Our current objective is not to rewrite TEX. lection. The resulting code is anything but elegant, and could certainly be made more efficient. 4.1 Pascal Web Components Finally, problems arise from the web2c imple- The tex.web file has been split into 55 files called mentation of Pascal Web. Many of the routines 01.web to 55.web. The tex.ch file has been con- written in C to support the web2c infrastructure verted into 55 files, 01.ch to 55.ch. Data structure make the implicit assumption that all characters are by data structure, we have passed through the code, 8 bits, making it difficult to generalise to Unicode, throwing out the definitions of the data structures even though C itself has a datatype called wchar_t. and replacing their uses with Pascal procedure calls
Preprints for the 2004 Annual Meeting 69 John Plaice and Paul Swoboda which, once passed through the web2c processor, be- 4.3 Fonts and OTPs ++ come C method calls. In the process, most of the In terms of numbers of lines written, this is the most code in the change files ends up either being thrown significant part of the new Omega; however, because out, or directly integrated in the corresponding .web we are using standard OO technology, it is also the files. most straightforward. 4.2 Characters, Strings and Files With respect to fonts, significant energy ex- pended, both in the original code as well as in previ- For characters, TEX has two types, ASCII_code ous Omega implementations, for bit packing of fields and text_char, the respective internal and ex- in binary font formats, which are stored in memory ternal representations of 8-bit characters. The as they are on disk. By providing a simple OO inter- ++ new Omega uses the standard C/C data type, face in the character-level typesetter of the Omega wchar_t. On most implementations, including GNU engine, we have been able to greatly simplify the ++ C , wchar_t is a 32-bit signed integer, where the code for both the typesetter, as well as the font util- values 0x0 to 0x7fffffff are used to encode char- ities for conversion between formats. acters, and the value 0xffffffff (-1) is used to en- Similarly, for the OTPs, filters can be imple- code EOF. Pascal Web strings, as interpreted by the mented as function objects over streams using iter- tangle program, are each assigned a str_number, ators, tremendously simplifying the code base. where values 0 to 255 are reserved for the 256 8- bit characters. We have modified tangle so that the 5 Prospects strings are numbered -256 downwards, rather than At the time we are writing, this work is not com- 256 upwards. Hence, str_number and wchar_t can pletely finished. Nevertheless, it is well advanced be of the same data type. and detailed documentation will be forthcoming on When dealing with files, there are two separate the Omega website. issues, the file names, and the file content. Inter- If we view things in the longer term, we are nally, all characters are 4-byte integers, but on most clearly moving forward with two related goals, the systems, file names are stored using 8-bit encod- stabilisation of existing Omega infrastructure, and ings, specified according to the user’s locale. Hence, abandonment of the TEX infrastructure for the de- character-set conversion is now built into the file- sign and implementation of a next-generation open opening mechanisms, be they for reading or writing. typesetting suite. The actual content of the files may come from Such a suite should be a generic framework with anywhere in the world and a single file system may an efficient C++ core, that is universally extensible include files encoded with many different encoding through a number of well-known scripting interfaces, schemes. We provide the means for opening a file for example, Perl, Python, and Guile. Implementa- with a specified encoding, as well as opening a file tion of libraries similar to the popular LATEX suite with automatic character encoding detection, using could then be done directly in C++, on top of the a one-line header at the beginning of the file. The core API, or as a linked-in C++ stream filter. actual character set conversion is done using the stdc++ local routines. As a result of these choices, the vast majority of the Omega code can simply as- sume that characters are 4-byte Unicode characters.
70 Preprints for the 2004 Annual Meeting Typesetting CJK Languages with Ω
Jin-Hwan Cho Korean TEX Users Group [email protected]
Haruhiko Okumura Mie University Faculty of Education 514-8507 Japan [email protected]
Abstract This paper describes how to typeset Chinese, Japanese, and Korean (CJK) lan- guages with Omega, a 16-bit extension of Donald Knuth’s TEX. In principle, Omega has no difficulty in typesetting those East Asian languages because of its internal representation using 16-bit Unicode. However, it has not been widely used in practice because of the difficulties in adapting it to CJK typesetting rules and fonts, which we will discuss in the paper.
1 Introduction localized to the Japanese language. It is designed Chinese, Japanese, and Korean (CJK) languages are for high-quality Japanese book publishing (the “p” characterized by multibyte characters covering more of pTEX stands for publishing; the name jTEX was than 60% of Unicode. The huge number of char- used by another system). pTEX can handle multi- byte characters natively (i.e., without resorting to acters prevented the original 8-bit TEX from work- ing smoothly with CJK languages. There have been subfonts), and it can typeset both horizontally and three methods for supporting CJK languages in the vertically within a document. It is upward com- patible4 with T X, so it can be used to typeset TEX world up to now. E The first method, called the subfont scheme, both Japanese and Latin languages, but it cannot splits CJK characters into sets of 256 characters or handle Chinese and Korean languages straightfor- wardly. pT X supports three widely-used Japanese fewer, the number of characters that a TEX font met- E ric file can accommodate. Its main advantage lies in encodings, JIS (ISO-2022-JP), Shift JIS, and EUC-JP, but not Unicode-based encodings such as UTF-8. using 8-bit TEX systems directly. However, one doc- ument may contain dozens of subfonts for each CJK The third route, Omega [3], is also a 16-bit ex- font, and it is quite hard to insert glue and kerns tension of TEX, having 16-bit Unicode as its internal between characters of different subfonts, even those representation. In principle, Omega is free from the from the same CJK font. Moreover, without the help limitations mentioned above, but thus far there is of a DVI driver (e.g., DVIPDFMx [2]) supporting the no thorough treatment of how it can be used for subfont scheme, it is not possible to generate PDF professional CJK typesetting and how to adapt it documents containing CJK characters that can be to popular CJK font formats such as TrueType and extracted or searched. Many packages are based on OpenType. We set out to fill in this blank. A 1 this method; for instance, CJK-LTEX by Werner CJK 2 2 Typesetting Characteristics Lemberg, HLATEX by Koaunghi Un, and the Chi- 3 Each European language has its own hyphenation nese module in ConTEXt by Hans Hagen. On the other hand, in Japan, the most widely rules, but their typesetting characteristics are over- all fairly similar. CJK languages differ from Euro- used TEX-based system is pTEX [1] (formerly known pean languages in that there are no hyphenation as ASCII Nihongo TEX), a 16-bit extension of TEX
1 4 Avaliable on CTAN as language/chinese/CJK/ Although pTEX doesn’t actually pass the trip test, it 2 Available on CTAN as language/korean/HLaTeX/ is thought to be upward compatible with TEX in virtually all 3 Available on CTAN as macros/context/ practical situations.
Preprints for the 2004 Annual Meeting 71 Jin-Hwan Cho and Haruhiko Okumura rules. All CJK languages allow line breaking almost 1. {CJK} is the set of all CJK characters; its com- anywhere, without a hyphen. This characteristic is plement is denoted by ^{CJK}. usually implemented by inserting appropriate glues 2. {XSPCODE1} (e.g., ([{‘) is the subset of ^{CJK} between CJK characters. such that \boundCJKglue is inserted only be- One fine point is the treatment of blank spaces tween {CJK} and {XSPCODE1} in this order. and end-of-line (EOL) characters. Korean uses blank spaces to separate words, but Chinese and Japanese 3. {XSPCODE2} (e.g., )]}’;,.) is the subset of ^{CJK} \boundCJKglue rarely use blank spaces. An EOL character is con- such that is inserted only between {XSPCODE2} and {CJK} in this or- verted in TEX to a blank space and then to a skip, which is unnecessary for Chinese and Japanese type- der. setting. To overcome this problem, pTEX ignores an 4. {XSPCODE3} (e.g., 0-9 A-Z a-z) is the subset EOL when it follows a CJK character. of ^{CJK} such that \boundCJKglue is inserted Moreover, whereas Korean uses Latin punctu- between {CJK} and {XSPCODE3}, irrespective of ation marks (periods, commas, etc.), Chinese and the order. Japanese use their own punctuation symbols. These 5. {INHIBITXSPCODE0} (e.g., /.1/) is the sub- CJK punctuation symbols need to be treated some- set of {CJK} not allowing \boundCJKglue be- what differently from ordinary characters. The ap- tween {INHIBITXSPCODE0} and ^{CJK}, irre- propriate rules are described in this paper. spective of the order. 3 CJK Omega Translation Process 6. {INHIBITXSPCODE1} (e.g., !#%')+-, CJK right parentheses and periods) is the sub- We introduce here the CJK Omega Translation Pro- 5 set of {CJK} not allowing \boundCJKglue be- cess (CJK-ΩTP) developed by the authors to imple- tween ^{CJK} and {INHIBITXSPCODE1} in this ment the CJK typesetting characteristics mentioned order. above. An Omega Translation Process (ΩTP) is a pow- 7. {INHIBITXSPCODE2} (e.g., "$&(*,# , erful preprocessor, which allows text to be passed CJK left parentheses) is the subset of through any number of finite state automata, which {CJK} not allowing \boundCJKglue in between can achieve many different effects. Usually it is quite {INHIBITXSPCODE2} and ^{CJK} in this order. hard or impossible to do the same work with other The second task of boundCJK.otp is to en- TEX-based systems. close each CJK block in a group ‘{\selectCJKfont For each CJK language, the CJK-ΩTP is di- ... }’, and convert all blank spaces inside the block vided into two parts. The first ΩTP (boundCJK. to the command \CJKspace. otp) is common to all CJK languages, and controls The command \selectCJKfont switches to the the boundaries of blocks consisting of CJK char- appropriate CJK font, and \CJKspace is defined to acters and blank spaces. The second ΩTP (one be either a \space (for Korean) or \relax (for Chi- of interCHN.otp, interJPN.otp, and interKOR.otp) is nese and Japanese) according to the selected lan- specific to each language, and controls typesetting guage. rules for consecutive CJK characters. Note that if the input stream starts with blank spaces followed by a CJK block or ends with a CJK 4 Common Typesetting Characteristics block followed by blank spaces, then these spaces The first task of boundCJK.otp is to split the input must be preserved regardless of the language, be- stream into CJK blocks and non-CJK blocks, and cause of math mode: insert glue (\boundCJKglue) in between to allow line {{CJK} {SPACE} $...$ {SPACE} CJK}} breaking. However, combinations involving some Latin and restricted horizontal mode: and CJK symbols (quotation marks, commas, pe- \hbox{{SPACE} {CJK} {SPACE}} riods, etc.), do not allow line breaking. In this case, \boundCJKglue is not inserted so that the original 5 Language-dependent Characteristics line breaking rule is applied. This corresponds to The line breaking mechanism is common to all of the pTEX’s primitives \xspcode and \inhibitxspcode. language-dependent ΩTPs (interCHN.otp, interJPN. boundCJK.otp defines seven character sets; the otp, and interKOR.otp). The glue \interCJKglue role of each set is as follows. is inserted between consecutive CJK characters, and its role is similar to the glue \boundCJKglue at the 5 Available at http://project.ktug.or.kr/omega-cjk/ boundary of a CJK block.
72 Preprints for the 2004 Annual Meeting Typesetting CJK Languages with Ω
Some combinations of CJK characters do not We can use the virtual font mechanism to map the allow line breaking. This is implemented by sim- logical shape and the actual implementation. ply inserting a \penalty 10000 before the rele- interJPN.otp divides Japanese characters into vant \interCJKglue. In the case of boundCJK.otp, six classes: however, no \boundCJKglue is inserted where line 1. Left parentheses:‘“(〔[{〈《「『【 breaking is inhibited. Half width, may be designed on square canvases The CJK characters not allowing line breaking flush right. In that case we ignore the left half are defined by the following two classes in interKOR. and pretend they are half-width, e.g., \hbox to otp for Korean typesetting. 0.5zw{\hss}. If a class-1 character is followed 1. {CJK_FORBIDDEN_AFTER} does not allow line by a class-3 character, then an \hskip 0.25zw breaking between {CJK_FORBIDDEN_AFTER} minus 0.25zw is inserted in between. and {CJK} in this order. 2. Right parentheses: 、,’”)〕]}〉》」』】 2. {CJK_FORBIDDEN_BEFORE} does not al- Half width, may be designed flush left on square low line breaking in between {CJK} and canvases. If a class-2 character is followed by {CJK_FORBIDDEN_BEFORE} in this order. a class-0, -1, or -5 character, then an \hskip 0.5zw minus 0.5zw is inserted in between. If a On the other hand, interJPN.otp defines six classes class-2 character is followed by a class-3 char- for Japanese typesetting, as discussed in the next acter, then a \hskip 0.25zw minus 0.25zw is section. inserted in between. 6 Japanese Typesetting Characteristics 3. Centered points: ・:; Half width, may be designed centered on square Most Japanese characters are designed on a square canvases. If a class-3 character is followed by ‘canvas’. pT X introduced a new length unit, zw (for E a class-0, -1, -2, -4, or -5 character, then an zenkaku width, or full-width), denoting the width of \hskip 0.25zw minus 0.25zw is inserted in be- this canvas. The CJK-ΩTP defines \zw to denote tween. If a class-3 character is followed by a the same quantity. class-3 character, then an \hskip 0.5zw minus For horizontal (left-to-right) typesetting mode, 0.25zw is inserted in between. the baseline of a Japanese character typically divides the square canvas by 0.88 : 0.12. If Japanese and 4. Periods: 。. Latin fonts are typeset with the same size, Japanese Half width, may be designed flush left on square fonts appear larger. In the sample shown in Fig- canvases. If a class-4 character is followed by ure 1, Japanese characters are typeset 92.469 per- a class-0, -1, or -5 character, then an \hskip cent the size of Latin characters, so that 10 pt (1 in = 0.5zw is inserted in between. If a class-4 char- 72.27 pt) Latin characters are mixed with 3.25 mm acter is followed by a class-3 character, then an (= 13 Q; 4 Q = 1 mm) Japanese characters. Also, \hskip 0.75zw minus 0.25zw is inserted in be- Japanese and Latin words are traditionally sepa- tween. rated by about 0.25 zw, though this space is getting 5. Leaders: ―…‥ smaller nowadays. Full width. If a class-5 character is followed by a class-1 character, then an \hskip 0.5zw minus 0.5zw is inserted in between. If a class-5 character is followed by a class-3 character, then an \hskip 0.25zw minus 0.25zw is inserted in 日本 Japan between. If a class-5 character is followed by a class-5 character, then a \kern 0zw is inserted Figure 1: The width of an ordinary Japanese in between. character, 1 zw, is set to 92.469% the design size 0. Class-0: everything else. of the Latin font, and a gap of 0.25 zw is inserted. Full width. If a class-0 character is followed The baseline is set to 0.12 zw above the bottom of by a class-1 character, then an \hskip 0.5zw the enclosing squares. minus 0.5zw is inserted in between. If a class-0 character is followed by a class-3 character, then Some characters (such as punctuation marks an \hskip 0.25zw minus 0.25zw is inserted in and parentheses) are designed on a half-width can- between. vas: its width is 0.5 zw. For ease of implementation, Chinese texts can be typeset mostly with the actual glyphs may be designed on square canvases. same rules. An exception is the comma and the
Preprints for the 2004 Annual Meeting 73 Jin-Hwan Cho and Haruhiko Okumura period of Traditional Chinese. These two letters are Note that inutf8.otp has to be loaded first to designed at the center of the square canvas, so they convert the input stream encoded with UTF-8 to should be treated as Class-3 characters. UCS-2, the 16-bit Unicode. \ocplist\CJKOCP= 7 Example: Japanese and Korean \addafterocplist 1 \OCPboundCJK Let us discuss how to use CJK-ΩTP in a practical \addafterocplist 1 \OCPindefault situation. Figure 2 shows sample output contain- \nullocplist ing both Japanese and Korean characters, which is \ocplist\JapaneseOCP= typeset by Omega with the CJK-ΩTP and then pro- \addbeforeocplist 2 \OCPinterJPN \CJKOCP cessed by DVIPDFMx. \ocplist\KoreanOCP= \addbeforeocplist 2 \OCPinterKOR \CJKOCP The glues \boundCJKglue and \interCJKglue TEX はスタンフォード大学のクヌース教授に for CJK line breaking mechanism are defined by new よって開発された組版システムであり、組版の skip registers to be changed later according to the 美しさと強力なマクロ機能を特徴としている。 language selected. TEX) $-/! 7@* ,*$ 4;& *1 3 \newskip\boundCJKskip % defined later : =+C. \newskip\interCJKskip % defined later \def\interCJKglue{\hskip\interCJKskip}
Figure 2: Sample CJK-ΩTP output. Japanese typesetting requires more definitions to support the six classes defined in interJPN.otp. The source of the sample above was prepared \newdimen\zw \zw=0.92469em with the text editor Vim as shown in Figure 3. Here, \def\halfCJKmidbox#1{\leavevmode% the UTF-8 encoding was used to see Japanese and \hbox to .5\zw{\hss #1\hss}} Korean characters at the same time. Note that the \def\halfCJKleftbox#1{\leavevmode% \hbox to .5\zw{#1\hss}} backslash character (\) is replaced with the yen cur- \def\halfCJKrightbox#1{\leavevmode% rency symbol in Japanese fonts. \hbox to .5\zw{\hss #1}} Finally, we need the commands \japanese and *279=<'86/0+ -34 ;+695/ \korean to select the given language. These com- *1;2A/&%$66'*9+:27./7<&*A? mands have to include actual manipulation of fonts, glues, and spaces. B*3+9+7/;/ *(/)S\]eaYf_hiRZ`f\omQTLM \font\defaultJPNfont=omrml xsIWKuq[\^cNGUEuqRvJIOk \def\japanese{% gPbZdpwXrlOJMHVF \clearocplists\pushocplist\JapaneseOCP C *9+:*>;429'#"9< \let\selectCJKfont\defaultJPNfont B*48:/+7 \let\CJKspace\relax % remove spaces *(/))'$-/!'hj*',($'nm&'*1'xt \boundCJKskip=.25em plus .15em minus .06em 'uq'%$.("v'uq*'v''2g0'#,"'' \interCJKskip=0em plus .1em minus .01em pw+'r3+6! } C \font\defaultKORfont=omhysm \def\korean{% *,@/ \clearocplists\pushocplist\KoreanOCP D'''''''''''''''''''''''''''''''''''''''''''' \let\selectCJKfont\defaultKORfont Figure 3: Sample CJK-ΩTP source. \let\CJKspace\space % preserve spaces \boundCJKskip=0em plus .02em minus .01em The first line in Figure 3 calls another TEX file \interCJKskip=0em plus .02em minus .01em omega-cjk-sample.tex which starts with the following } 6 code, which loads the CJK-ΩTP. It is straightforward to extend these macros to \ocp\OCPindefault=inutf8 create a LATEX (Lambda) class file. \ocp\OCPboundCJK=boundCJK \ocp\OCPinterJPN=interJPN 8 CJK Font Manipulation \ocp\OCPinterKOR=interKOR At first glance, the best font for Omega seems to be 6 Omega requires the binary form of ΩTP files compiled the one containing all characters defined in 16-bit by the utility otp2ocp included in the Omega distribution. Unicode. In fact, such a font cannot be constructed.
74 Preprints for the 2004 Annual Meeting Typesetting CJK Languages with Ω
There are several varieties of Chinese letters: to work correctly, so the previous version 1.x must Traditional letters are used in Taiwan and Korea, be used. while simplified letters are now used in mainland DVIPDFMx China. Japan has its own somewhat simplified set. 9 Asian Font Packs and The glyphs are significantly different from country A solution avoiding the problems mentioned above to country. is to use the CJK fonts included in the Asian font Unicode unifies these four varieties of Chinese packs of Adobe (Acrobat) Reader as non-embedded letters into one, if they look similar. They are not fonts when making PDF output. identical, however. For example, the letter ‘bone’ It is well known that Adobe Reader can display has the Unicode point 9AA8, but the top part of the and print several common fonts even if they are not Chinese Simplified letter and the Japanese letter are embedded in the document. These are fourteen base almost mirror images of each other, as shown in Fig- Latin fonts, such as Times, Helvetica, and Courier — 10 ure 4. Less significant differences are also distracting and several CJK fonts, if Asian font packs are to native Asian readers. The only way to overcome installed. These packs have been available free of this problem is to use different CJK fonts according charge since the era of Adobe Acrobat Reader 4. to the language selected. Four are available: Chinese Simplified, Chinese Tra- ditional, Japanese, and Korean. Moreover, Adobe Reader 6 downloads the appropriate font packs on demand when a document containing non-embedded CJK characters is opened. Note that these fonts are 骨 licensed solely for use with Adobe Readers. 骨 Professional CJK typesetting requires at least (a) Chinese Simplified (b) Japanese two font families: serif and sans serif. As of Adobe Figure 4: Two letters with the same Unicode Acrobat Reader 4, Asian font packs, except for Chi- point. nese Simplified, included both families, but newer packs include only a serif family. However, newer versions of Adobe Reader can automatically substi- OpenType (including TrueType) is the most tute a missing CJK font by another CJK font in- popular font format for CJK fonts. However, it is stalled in the operating system, so displaying both neither easy nor simple, even for T X experts, to E families is possible on most platforms. generate OFM and OVF files from OpenType fonts. If the CJK fonts included in Asian font packs are The situation looks simple for Japanese and to be used, there is no need to embed the fonts when Chinese fonts, having fixed width, because one (vir- making PDF output. The PDF file should contain tual) OFM is sufficient which can be constructed the font names and code points only. Some ‘generic’ by hand. However, Korean fonts have proportional font names are given in Table 1, which can be han- width. Since most of the popular Korean fonts are in dled by Acrobat Reader 4 and later. However, these OpenType format, a utility that extracts font met- 11 names depend on the PDF viewers. Note that the rics from OpenType fonts is required. names are not necessarily true font names. For ex- There are two patches of the ttf2tfm and ttf2pk 7 8 ample, Ryumin-Light and GothicBBB-Medium are the utilities using the freetype library. The first, writ- names of commercial (rather expensive) Japanese ten by one of the authors, Jin-Hwan Cho, generates fonts. They are installed in every genuine (expen- OFM and OVF files from TrueType fonts (not Open- 9 sive) Japanese PostScript printer. PDF readers and Type fonts). The other, written by Won-Kyu Park, PostScript-compatible low-cost printers accept these lets ttf2tfm and ttf2pk run with OpenType (includ- names but use compatible typefaces instead. ing TrueType) fonts with the help of the freetype2 While T X generates DVI output only, pdfT X library. Moreover, two patches can be used together. E E generates both DVI and PDF output. But Omega Unfortunately, ovp2ovf 2.0 included in recent and pTEX do not have counterparts generating PDF TEX distributions (e.g., teTEX 2.x) does not seem 10 Asian font packs for Adobe Acrobat Reader 5.x 7 Available from the FreeType project, http://www. and Adobe Reader 6.0, Windows and Unix versions, freetype.org. can be downloaded from http://www.adobe.com/products/ 8 Available from the Korean TEX Users group, acrobat/acrrasianfontpack.html. For Mac OS, an optional http://ftp.ktug.or.kr/pub/ktug/freetype/contrib/ component is provided at the time of download. ttf2pk-1.5-20020430.patch. 11 For example, these names are hard coded in the exe- 9 Available as http://chem.skku.ac.kr/∼wkpark/ cutable file of Adobe (Acrobat) Reader, and each version has project/ktug/ttf2pk-freetype2 20030314.tgz. different names.
Preprints for the 2004 Annual Meeting 75 Jin-Hwan Cho and Haruhiko Okumura
Table 1: Generic CJK font names Serif Sans Serif Chinese Simplified STSong-Light STHeiti-Regular Chinese Traditional MSung-Light MHei-Medium Japanese Ryumin-Light GothicBBB-Medium Korean HYSMyeongJo-Medium HYGoThic-Medium output yet. One solution is DVIPDFMx [2], an ex- barely qualified. In due course, we hope to upload 12 tension of dvipdfm, developed by Shunsaku Hirata CJK-ΩTP to CTAN. and one of the authors, Jin-Hwan Cho. References 10 Conclusion [1] ASCII Corporation. ASCII Nihongo TEX We have shown how Omega, with CJK-ΩTP, can be (Publishing TEX). http://www.ascii.co.jp/ used for the production of quality PDF documents pb/ptex/. using the CJK languages. [2] Jin-Hwan Cho and Shunsaku Hirata. The CJK-ΩTP, as it stands, is poorly tested and DVIPDFMx Project. http://project.ktug. documented. Especially needed are examples of Chi- or.kr/dvipdfmx/. nese typesetting, in which the present authors are [3] John Plaice and Yannis Haralambous. The Omega Typesetting and Document Processing 12 The utility dvipdfm is a DVI to PDF translator devel- System. http://omega.enstb.org. oped by Mark A. Wicks. The latest version, 0.13.2c, was released in 2001. Available from http://gaspra.kettering. edu/dvipdfm/.
76 Preprints for the 2004 Annual Meeting MlBibTEX: Beyond LATEX
Jean-Michel Hufflen LIFC (FRE CNRS 2661) University of Franche-Comt´e 16, route de Gray 25030 Besan¸con Cedex France [email protected] http://lifc.univ-fcomte.fr/~hufflen
Abstract
This article sums up our experience with MlBibTEX, our multilingual imple- mentation of BibTEX, and points out some possible improvements for better co- operation between LATEX and MlBibTEX. Also, MlBibTEX may be used to gen- erate bibliographies written according to other formalisms, especially formalisms related to XML, and we give some ways to ease that. Keywords: Bibliographies, multilingual features, BibTEX, MlBibTEX, bst, nbst, XML, XSLT, XSL-FO, DocBook.
3 1 Introduction in a DOM tree. Bibliography styles written using nbst are XML texts, too. MlBibTEX (for ‘MultiLingual BibTEX’) is a reim- Of course, nbst can be used to generate bibli- plementation of BibTEX [21], the bibliography pro- cessor associated with LAT X [19]. The project began ographies for documents other than those processed E A 4 in October 2000, and has resulted in two experimen- with LTEX. In particular, nbst eases the genera- tal versions [9, 11] and the present version (1.3), that tion of bibliographies for documents written using will be available publicly by the time this article ap- XML-like syntax. Nevertheless, dealing with .bib pears. As we explained in [15], a prototype using the files raises some problems: we go into them thor- Scheme programming language is working whilst we oughly in Section 4. are developing a more robust program written in C. Reading this article requires only a basic knowl- The prototype has allowed us to get some experience edge of LATEX, BibTEX and XML. Some examples with real-sized bibliographies: this is the purpose of given in the next section will use the commands pro- the first part of this article, after a short review of vided by the multilingual babel package of LAT X 2ε Bib E the modus operandi of Ml TEX. [2]. Other examples given in Section 4 will use the Bib Ml TEX’s present version no longer uses the Scheme programming language, but if need be, re- Bib bst language of TEX for bibliography styles [20]. ferring to an introductory book such as [28] is suffi- Bib Such .bst files were used in Ml TEX’s first version, cient to understand them. but since this old-fashioned language, based on sim- ple stack manipulations, is not modular, we quickly 2 Architecture of MlBIBTEX realised that this choice would have led us to styles 2.1 How MlBIBTEX Works that were too complicated [12]. Thus, Version 1.3 As a simple example of using MlBibT X with LAT X, uses the nbst (for ‘New Bibliography STyles’) lan- E E 1 let us consider the silke1988 bibliographical entry guage, described in [13] and similar to XSLT, the given in Figure 1. As we explain in [15], the sequence language of transformations designed for XML texts 2 ‘[...] ! hidfi’ is one of the multilingual features [32]. More precisely, MlBibTEX 1.3 uses XML as a central formalism in the sense that parsing files containing bibliographical entries (.bib files) results 3 Document Object Model. This is a W3C recommen- dation for a standard tree-based programming approach [24, p. 306–308], very often used to implement XML trees. 1 4 EXtensible Stylesheet Language Transformations. This is also the case with the bst language of BibTEX, 2 EXtensible Markup Language. A good introduction but in practice, it seems that this feature has not been used, to this formalism issued by the W3C (World Wide Web except for documents written in SCRIBE [25], a predecessor Consortium) is [24]. of LATEX.
Preprints for the 2004 Annual Meeting 77 Jean-Michel Hufflen
@BOOK{silke1988,
78 Preprints for the 2004 Annual Meeting MlBibTEX: Beyond LATEX
We can see that some LATEX commands and Such an XML tree, resulting from our parser, 10 special characters are converted according to the may be validated using a DTD; more precisely, by conventions of XML. a revised version of the DTD sketched in [10]. • The commands used for accents and special let- Some examples of using nbst for bibliography ters are replaced by the letter itself. This poses styles are given in [12, 13, 14]. We give another ex- no problem since DOM trees are encoded in Uni- ample in Figure 3. We can see that this language code [29]. As an example, the ‘\c{c}’ sequence is close to XSLT and it uses path expressions as in in the value of the NOTE field in Figure 1 is re- the XPath language [31]. Also, the example shows placed by ‘¸c’ in Figure 2. (By the way, let us re- how multilingual features (for example, the sequence mark that MlBibT X can handle the 8-bit latin1 ‘[...] ! ...’) are processed: we use some external E A encoding:6 notice the ‘U¨’ character inside this functions in order to determine which LTEX com- value.) mand can be used to switch to another language. • Likewise, the commands: These external functions are written using the lan- guage of MlBibTEX’s implementation: Scheme for – \ ‘ ’ for a simple space character, the prototype, C for the final program. – ‘\\’ for an end-of-line character, and the sequences of characters: 3 MlBIBTEX with LATEX – ‘~’, for an unbreakable space character, When BibTEX generates a .bbl file, it does not use – ‘--’, and ‘---’ for dash characters, the source file processed by LATEX, but only the are replaced by the corresponding Unicode val- auxiliary (.aux) file, in which the definition of all ues for these characters:7 the labels provided by the commands \label and
– — \bibitem is stored. This file also contains the name of the bibliography style to be used and the paths of An example is given by the value of the AUTHOR bibliography data bases to be searched, so BibTEX field, see Figures 1 & 2. need not look at any other file. A • Some characters escaped in LTEX (for example, This is not true for MlBibTEX. It still uses the ‘$’, ‘%’, ‘&’) lose the escape character: .aux file as far as possible, but it also has to deter- \% =⇒ % mine which multilingual packages are used: first of all babel, but also some packages devoted to particu- The escape is restored if MlBibTEX generates a lar languages: french [6], german [23], . . . So we have .bbl file to be processed by LATEX. Other char- to do a partial parsing of the .tex file for that. For acters are replaced by a reference to a character 8 better co-operation between LAT X and MlBibT X, entity: E E this could be improved, in that information about \& =⇒ & < =⇒ < > =⇒ > multilingual packages used, and languages available, • Balanced delimiters for quotations (“ ‘ ” and could be put in the .aux file. In fact, the external “ ’ ” or ‘ ‘‘ ’ and ‘ ’’ ’) are replaced by an emph functions of our new bibliography styles are only element:9 used to manage information extracted from a .tex ‘Tooth and Claw’ =⇒ file. Expressing such operations using nbst would
Preprints for the 2004 Annual Meeting 79 Jean-Michel Hufflen
Figure 3: Example of calling an external function.
the german option of this package, although it could
80 Preprints for the 2004 Annual Meeting MlBibTEX: Beyond LATEX
Preprints for the 2004 Annual Meeting 81 Jean-Michel Hufflen
(define-pattern "{#1}" ;; The asitis element is used for words that should never be uncapitalised, that is, proper names. In BIBTEX, ;; we specify such behaviour by surrounding words by additional braces. "
Figure 6: Patterns for some LATEX commands in Scheme.
as well as more complicated cases, like a cascade of BibTEX deals with three formalisms: .aux files, \iflanguage commands [2]: .bib files and .bst files. As Jonathan Fine notes \iflanguage{...}{...}{% in [5], the applications devoted to a particular for- \iflanguage{...}{...}{ ... }} malism cannot be shared with other applications. MlBibTEX attempts to use XML as far as possible, which becomes: although there is still much to do. For example,
82 Preprints for the 2004 Annual Meeting MlBibTEX: Beyond LATEX
21 difficult to model with DTDs. We are presently go- [11] Jean-Michel Hufflen: Towards MlBIBTEX’s ing thoroughly into replacing our DTD by a schema; Versions 1.2 & 1.3. MaTEX Conference. Bu- when this work reaches maturity, bibliographical en- dapest, Hungary. November 2002. tries using XML syntax could be directly validated [12] Jean-Michel Hufflen: “European Bibliogra- using schemas. phy Styles and MlBibTEX”. EuroTEX 2003, Brest, France. June 2003. (To appear in TUG- 6 Acknowledgements boat.) Many thanks to Karl Berry for his patience while [13] Jean-Michel Hufflen: MlBIBTEX’s Ver- waiting for this article. In addition he proofread a sion 1.3. TUG 2003, Outrigger Waikoloa Beach first version and gave me many constructive criti- Resort, Hawaii. July 2003. cisms. Thanks also to Barbara Beeton. [14] Jean-Michel Hufflen: “Making MlBibTEX References Fit for a Particular Language. Example of the Polish Language”. Biuletyn GUST. Forthcom- [1] Harald Tveit Alvestrand: Request for Com- ing. Presented at the BachoT X 2003 confer- ments: 1766. Tags for the Identification of Lan- E ence. 2004. guages. UNINETT, Network Working Group. Hufflen March 1995. http://www.cis.ohio-state. [15] Jean-Michel : “A Tour around Bib edu/cgi-bin/rfc/rfc1766.html. Ml TEX and Its Implementation(s)”. Biule- tyn GUST, Vol. 20, pp. 21–28. In Proc. Ba- [2] Joannes Braams: Babel, a Multilingual Pack- choT X Conference. April 2004. age for Use with LAT X’s Standard Document E E XML Classes. Version 3.7. May 2002. CTAN:macros/ [16] ISO/IEC 19757: The Schematron. An latex/required/babel/babel.dvi. Structure Validation Language Using Pat- terns in Trees. http://www.ascc.net/xml/ [3] James Clark et al.: Relax NG. http://www. resource/schematron/schematron.html. oasis-open.org/committees/relax-ng/. June 2003. 2002. [17] Oleg Kiselyov: “A Better XML Parser Drakos A HTML [4] Nicos : The LTEX2 Translator. through Functional Programming”. In: 4th March 1999. Computer Based Learning Unit, International Symposium on Practical Aspects University of Leeds. of Declarative Languages, Vol. 2257 of Lecture [5] Jonathan Fine: “TEX as a Callable Func- Notes in Computer Science. Springer-Verlag. tion”. In: EuroTEX 2002, (pp. 26–30). Ba- 2002. chotek, Poland. April 2002. [18] Donald Ervin Knuth: Computers & Typeset- Gaulle [6] Bernard : Notice d’utilisation du style ting. Vol. A: The TEXbook. Addison-Wesley french multilingue pour LATEX. Version pro Publishing Company, Reading, Massachusetts. V5.01. Janvier 2001. CTAN:loria/language/ 1984. french/pro/french/ALIRE.pdf. A [19] Leslie Lamport: LTEX. A Document Prepa- [7] Michel Goossens, Sebastian Rahtz and Frank ration System. User’s Guide and Reference Mittelbach: The LATEX Graphics Compan- Manual. Addison-Wesley Publishing Company, ion. Illustrating Documents with TEX and Reading, Massachusetts. 1994. PostScript. Addison-Wesley Publishing Com- [20] Oren Patashnik: Designing BIBTEX styles. pany, Reading, Massachusetts. March 1997. February 1988. Part of the BibTEX distribu- [8] Hans Hagen: ConTEXt, the Manual. Novem- tion. ber 2001. http://www.pragma-ade.com. [21] Oren Patashnik: BIBTEXing. February 1988. [9] Jean-Michel Hufflen: “MlBibTEX: A New Part of the BibTEX distribution. Implementation of BibTEX”. In: EuroTEX [22] Dave Pawson: XSL-FO. O’Reilly & Associates, 2001, (pp. 74–94). Kerkrade, The Netherlands. Inc. August 2002. September 2001. [23] Bernd Raichle: Die Makropakete german“ ” [10] Jean-Michel Hufflen: “Multilingual Features und ngerman“ fur¨ LAT X 2ε, LAT X 2.09, ” E E for Bibliography Programs: From XML to Plain-TEX and andere darauf Basierende For- MlBibTEX”. In: EuroTEX 2002, (pp. 46–59). mate. Version 2.5. Juli 1998. Im Software Bachotek, Poland. April 2002. LATEX. 21 Ray XML Whereas that is easy with ‘old’ BibTEX, provided that [24] Erik T. : Learning . O’Reilly & Asso- you use a bibliography style able to deal with additional fields. ciates, Inc. January 2001.
Preprints for the 2004 Annual Meeting 83 Jean-Michel Hufflen
[25] Brian Keith Reid: SCRIBE Document Produc- [32] W3C: XSL Transformations (XSLT). Ver- tion System User Manual. Technical Report, sion 1.0. W3C Recommendation. Writ- Unilogic, Ltd. 1984. ten by Sharon Adler, Anders Berglund, Jeff [26] Thomas Schraitle: DocBook-XML—Medien- Caruso, Stephen Deach, Tony Graham, Paul neutrales und plattformunabh¨andiges Publizie- Grosso, Eduardo Gutentag, Alex Milowski, ren. SuSE Press. 2004. Scott Parnell, Jeremy Richman and Steve [27] John E. Simpson: XPath and XPointer. Zilles. November 1999. http://www.w3.org/ O’Reilly & Associates, Inc. August 2002. TR/1999/REC-xslt-19991116. W3C XML [28] George Springer and Daniel P. Friedman: [33] : Extensible Markup Language ( ) Scheme and the Art of Programming. The MIT 1.0 (Second Edition). W3C Recommen- Press, McGraw-Hill Book Company. 1989. dation. Edited by Tim Bray, Jean Paoli, C. M. Sperberg-McQueen and Eve Maler. Oc- [29] The Unicode Standard Version 3.0. Addison- tober 2000. http://www.w3.org/TR/2000/ Wesley. February 2000. REC-xml-20001006. van der Vlist [30] Eric : Examplotron. http:// [34] W3C: Extensible Stylesheet Language (XSL). examplotron.org. February 2003. Version 1.0. W3C Recommendation. Edited by [31] W3C: XML Path Language (XPath). Ver- James Clark. October 2001. http://www.w3. sion 1.0. W3C Recommendation. Edited org/TR/2001/REC-xsl-20011015/. by James Clark and Steve DeRose. Novem- [35] W3C: XML Schema. November 2003. http: ber 1999. http://www.w3.org/TR/1999/ //www.w3.org/XML/Schema. REC-xpath-19991116. [36] Norman Walsh and Leonard Muellner: DocBook: The Definitive Guide. O’Reilly & As- sociates, Inc. October 1999.
84 Preprints for the 2004 Annual Meeting iTEXMac: An Integrated TEX Environment for Mac OS X
J´erˆome Laurens Universit´ede Bourgogne [email protected]
bundles, giving plug-ins great potential power to- gether with the application. Among the standard features shared by advan- ced text editors (like key binding management, ad- vanced regular expressions, command completion), an interesting feature of iTEXMac’s text editor is the syntax parsing policy. The syntax highlighting deeply depends on the kind of text edited, whether it is Plain, LATEX or METAPOST (support for HTML is planned). The text properties used for highlight- ing include not only the color of the text, but also the font, the background color and some formatting properties. Moreover, the command shortcuts that refer to mathematical and text symbols are replaced by the glyph they represent, thus replacing \alpha with α and so on. Conversely the built-in editor can show a character palette with 42 menus gathering 1 Introduction text and mathematical symbols, as they would ap- iTEXMac is an integrated suite of three major com- pear in the output. The editor thus serves as a ponents: a text editor detailed in section 2, a PDF graphical front-end to the standard LATEX packages, viewer detailed in section 3, and a TEX front end amsfonts.sty, amssymb.sty, mathbb.sty, math- detailed in section 4. Some notes on installation rsfs.sty, marvosym.sty and wasysym.sty, which are followed by remarks concerning inter-application makes thousands of symbols available with just one communication in section 6 for other Mac OS X de- click. The result is a text editor that contains much velopers. Finally, the pdfsync feature and the TEX more WYSIWYG than others, with no source file Wrapper are discussed in sections 7 and 8. Since format requirement. they concern the synchronization between the TEX There is also advanced management of string source and PDF output, and a definition for a shared encoding, and iTEXMac supports more than 80 of TEX document structure, both will certainly interest them with an efficient user interface. The text files the whole TEX community. are scanned for hints about the text encoding: 2 The Text Editor LATEX \usepackage[encoding]{inputenc} ConT Xt \enableregime[encoding] iTEXMac can be used either with a built-in text edi- E tor or an external one. All standard text editors like emacs %-*-coding:character encoding;-*- TextEdit, BBEdit, AlphaX, vi, and emacs are sup- %!iTeXMac(charset): character encoding ported and configuring iTEXMac for other editors is very easy, even when coming from the X11 world. Mac OS X hidden internals The built-in text editor comes with flavours sim- But this is not user friendly practice and will be en- ilar to emacs and AlphaX modes. It relies on a plug- hanced by the forthcoming discussion of T X wrap- in architecture that allows very different kinds of E pers in section 8. user interfaces according to the type of the file being Spell checking for TEX input is available with edited. Whereas AlphaX uses Tcl and emacs uses 1 Rick Zaccone’s LATEX aware Excalibur and Anton Lisp, iTEXMac utilizes the benefits of Objective-C
1 http://www.eg.bucknell.edu/∼excalibr/
Preprints for the 2004 Annual Meeting 85 J´erˆome Laurens
2 Leuski’s TEX aware cocoAspell , a port of the Free of a source file. For example, the following directive, and Open Source spell checker aspell. The latter if present, will run pdflatex in escape mode. also knows about HTML code and is integrated to %!iTeXMac(typeset): pdflatex Mac OS X allowing iTEXMac to check spelling as --shell-escape $iTMInput you type, with misspelled words being underlined in The makeindex and bibtex command options red. While this service is provided to all applications can be set from panels, and other commands are sup- for free, iT XMac is the only one that truly enables E ported. Moreover, the various log files are parsed, the T X support by managing the language and the E warnings and errors are highlighted with different list of known words on a file by file basis using T X E colors and HTML links point to lines where an error Wrappers. occurred. Some navigation facilities from log file to 3 The PDF Viewer output are also provided, a string like [a number...], pointing to the output page. iT XMac can be used either with a built-in PDF E As for documents, iT XMac manages a list of viewer or an external one. The built-in viewer lacks E default settings that fit a wide range of situations, many advanced features of Acrobat or Preview, but including for example it updates the display automatically when a PDF A file has been changed externally. Moreover, it allows • LTEX documents with DVI, PS or PDF engines you to make selections and export them to other ap- • books plications, like Word, TextEdit or Keynote for ex- • METAPOST documents ample. Finally, it has support for the useful PDF • ConT Xt documents synchronization discussed below and is very well in- E tegrated in the suite. • HTML documents iTEXMac can open PS, EPS and DVI files with • B. Gaulle’s French Pro documents. a double click, by first converting them to PDF. It Users can extend this list with a built-in editor, PostScript thus plays the role of a or a DVI viewer. adding support for MusicTEX, maybe gcc, and so This feature is now partially obsolete since Mac OS on. X version 10.3 provides its own PS to PDF translator used by the Preview application shipped with the 5 Installing TEX and iTEXMac system. The starting point for a detailed documentation is 3 the MacOS X TEX/LATEX Web Site where one will 4 The TEX Front End find an overview of the TEX related tools available This component of the software serves two different on Mac OS X. As a graphical front-end, iTEXMac purposes. On one hand it is a bridge between the needs a TEX distribution to be fully functional. Ger- 4 5 user and the utilities of a standard TEX distribution: ben Wierda maintains on his site the TEX Live dis- a graphical user interface for the commands tex, tribution and a set of useful packages. Other teTEX latex, pdftex, and so on. On the other hand, it has 2.0.2 ports are available from fink6 (and through one to properly manage the different kinds of documents of its graphical user interfaces, such as finkcomman- one wants to typeset. der7) and from Darwin Ports8, through a CVS in- Actually, the iTEXMac interface with its under- terface. lying TEX distribution is fairly simple. Five basic The official web site of iTEXMac is hosted by actions are connected to menu items, toolbar but- the Open Source software development SourceForge tons or command shortcuts to website at: • typeset (e.g., running latex once, or twice in http://itexmac.sourceforge.net/ advanced mode) One can find in the download section the disk images • make the bibliography (e.g. running bibtex) for the following products: • make the index (e.g., running makeindex) • iTEXMac, both stable and developer release • render graphics (e.g., running dvipdf) 3 http://www.esm.psu.edu/mac-tex/ All these actions are connected to shell scripts stored 4 http://www.rna.nl/tex.html 5 on a per document basis. If necessary, the user can http://www.tug.org/texlive/ 6 http://fink.sourceforge.net/pdb/package.php/ customize them or even change the whole process by tetex inserting an in-line instruction at the very beginning 7 http://finkcommander.sourceforge.net/ 8 http://darwinports.opendarwin.org/ports/?by= 2 http://cocoAspell.leuski.net/ cat&substr=print
86 Preprints for the 2004 Annual Meeting iTEXMac: An Integrated TEX Environment for Mac OS X
• an external editor for iTEXMac key binding location corresponding to the given line and column • the Hypertext Help With LATEX wrapped as a numbers searchable Mac OS X help file. ~/Library/TeX/bin/iTeXMac edit -file • the TEX Catalog On line wrapped as a search- "filename" able Mac OS X help file. -line lineNumber -column colNumber • the French LATEX FAQ wrapped as a searchable The following syntax is used to display a PDF file at Mac OS X help file. the location corresponding to the given line column An updater allows you to check easily for new ver- and source file name sions. To install iTEXMac, just download the latest ~/Library/TeX/bin/iTeXMac display -file disk image archive, double click and follow the in- "filename.pdf " structions in the read-me file. -source "sourcename.tex" Due to its Unix core, Mac OS X is no longer fo- -line lineNumber -column colNumber cused on only one user. To support multiple users, 6.2 AppleScript iTEXMac configuration files can be placed in dif- ferent locations to change defaults for all, or just The same feature is implemented using this scripting certain users. The search path is: language. It would be great for the user if other TEX • the built-in domain as shipped with the appli- front ends on Mac OS X would implement the same cation (with default, read-only settings) syntax. • the network domain (/Network/Library/ tell application "iTeXMac" to edit Application Support/iTeXMac), where an "filename.tex" administrator can put material to override or at line lineNumber column colNumber augment the default behaviour of all the ma- chines on a network tell application "iTeXMac" to display • the local domain (/Library/Application "filename.pdf " Support/iTeXMac), where an administrator at line lineNumber column colNumber can put material to override or augment the in source "Posix source name.tex" network or default behaviour iTEXMac support for AppleScript likewise covers the • the user domain (~/Library/Application compile, bibliography and index actions. They are Support/iTeXMac), where the user can put ma- not given here since there is no Apple Events Suite terial to override or augment the local or default dedicated to TEX. However, configuration files and behaviour instructions are given to let third-party applications like Alpha X, BBEdit or emacs control iT XMac This is a way to apply modifications to iTEXMac as E a whole. using those scripts. 6 Inter-application Communication 6.3 HTML iT XMac implements support for a URL scheme This section describes how iTEXMac communicates E with other components, in the hope that this syntax named file-special for editing, updating or dis- will also be used by other applications when rele- playing files, for example vant, to avoid the current situation where there are file-special://localhost/"filename.tex"; as many AppleScript syntaxes as there are available action=edit;line=lineNumber; applications for TEX on the Macintosh. It also shows column=columnNumber why iTEXMac integrates so well with other editors filename.pdf or viewers. file-special://localhost/" "; action=display;line=lineNumber; 6.1 Shell Commands column=columnNumber; source="Posix source name.tex" iTEXMac acts as a server, such that other appli- cations can send it messages. Each time it starts, will ask iTEXMac to edit a TEX source file or display iTEXMac installs an alias to its own binary code in the given file (assumed to be PDF) and when syn- ~/Library/TeX/bin/iTeXMac. With the following chronization information is available, scroll to the 9 syntax, either from the command line, an Apple- location corresponding to the given line and column Script or shell script, one can edit a text file at the in source (assumed to be TEX). This allows adding 9 These commands should be entered all on one line. They dynamic links in HTML pages, in a TEX tutorial for are broken here due to the narrow TUGboat columns. example.
Preprints for the 2004 Annual Meeting 87 J´erˆome Laurens
7 The pdfsync Feature mation (location and page number) will be written 7.1 About Synchronization only at the next ship out. As the TEX typesetting system heavily relies on a 7.3 The .pdfsync File Specifications page description language, there is no straightfor- This is an ASCII text file organized into lines. There ward correspondence between a part of the output is no required end of line marker format from among and the original description code in the input. A the standard ones used by operating systems. workaround was introduced a long time ago by com- 10 11 Only the two first lines described in table 1 are mercial TEX frontends Visual TEX and TEXtures required, the other ones are optional. The remain- A with a very efficient implementation. Then LTEX ing lines are described according to their starting users could access the same features — though in characters, they consist of 2 interlaced streams. A a less-efficient implementation — through the use of synchronous one detailed in table 2 is obtained with srcltx.sty, which added source specials in the DVI \immediate\writes and concerns the input infor- files. The command line option -src-specials now mation. An asynchronous one detailed in table 3 gives this task to the TEX typesetting engine. is obtained with delayed \writes and concerns the When used with an external DVI viewer or an output information. external text editor, through an X11 server or not, The correspondence between the two kinds of iTEXMac fully supports this kind of synchronization information is made through a record counter, which feature. establishes a many-to-many mapping from line num- For the PDF file format, Piero d’Ancona and the bers in TEX sources to positions in PDF output. author elaborated a strategy that works rather well for Plain TEX, ConTEXt and LATEX users. While 7.4 Known Problems A typesetting a foo.tex file with LTEX for example, Unfortunately, the various pdfsync files for Plain, the pdfsync package writes extra geometry informa- A LTEX or ConTEXt are not completely safe. Some tion in an auxiliary file named foo.pdfsync, subse- compatibility problems with existing macro pack- quently used by the front ends to link line numbers ages may occur. Moreover, sometimes pdfsync ac- in source documents with locations in pages of out- 12 tually influences the final layout; in a case like that, put PDF documents. iTEXMac and TEXShop both it should only be used in the document preparation support pdfsync. stage. The official pdfsync web site is: Another mechanism widely used by ConTEXt http://itexmac.sourceforge.net/pdfsync.html makes pdfsync sometimes inefficient, where the ma- It was more convenient to use an auxiliary file cro expansion only occurs long after it has been than to embed the geometric information in the PDF parsed, such that the \inputlineno is no longer output using pdftex primitives. The output file is relevant and the significant line number is no longer not polluted with extraneous information and the accessible. This makes a second argument for the front ends need not parse the PDF output to retrieve implementation of the pdfsync feature at a very low such metrics. level, most certainly inside the pdftex engine itself. 7.2 The pdfsync Mechanism 8 TWS: A TEX Wrapper Structure A macro is defined to put pdfsync anchors at spe- In general, working with TEX seems difficult due cific locations (for hbox’s, paragraphs and maths). to the numerous auxiliary files created. Moreover, There are essentially three problems we must solve: sharing TEX documents is often delicate as soon A the position of an object in the PDF page is not as we do not use very standard LTEX. The pur- known until the whole page is composed, the ob- pose of this section is to lay the foundation for the jects don’t appear linearly in the output13 and fi- TEX Wrapper Structure, which aims to help the user nally, an input file can be entirely parsed long before solve these problems. its contents are shipped out. To solve these, at each First, it is very natural to gather all the files pdfsync anchor the known information (line num- related to one TEX document in one folder we call a ber and source file name) is immediately written to TEX Wrapper. The file extension for this directory the pdfsync auxiliary file and the unknown infor- is texd, in reference to the rtf and rtfd file exten- sions already existing on Mac OS X. The contents 10 http://www.micropress-inc.com/ 11 of a TEX wrapper named document.texd is divided http://www.bluesky.com/ 12 http://www.uoregon.edu/∼koch/texshop according to different criteria: 13 The footnotes objects provide a good example.
88 Preprints for the 2004 Annual Meeting iTEXMac: An Integrated TEX Environment for Mac OS X
Line Format Description Comment 1st jobName jobName: case sen- In general, the extensionless name of sitive TEX file name the file as the result of an \immediate\write\file{\jobname} 2nd version V V : a 0 based non- The current version is 0 negative integer
Table 1: pdfsync required lines format
Line Format Description Comment “b” b name name: TEX file TEX is about to begin parsing name, name all subsequent line and column numbers will refer to name. The path is relative to the directory containing the .pdfsync file. Path separators are the Unix “/”. The file extension is not required, “tex” is the default if necessary. Case sensitive. “e” e The end of the input file has been reached. Subsequent line and column numbers now refer to the calling file. Optional, but must match a corre- sponding “b” line. “l” l R L R: record number, l RLC L: line number, C : optional column number.
Table 2: pdfsync line specifications of the synchronous stream
Line Format Description Comment “s” s S S: physical page number TEX is going to ship out a new page. “p” p Rx y R: record number, Both coordinates are respectively “p*” p* Rx y x: horizontal coordinate, given by y: vertical coordinate. \the\pdflastxpos and \the\pdflastypos
Table 3: pdfsync line specification of the asynchronous stream
• required data (source, graphics, bibliography document.texp such that silently removing this di- database) rectory would cause no harm. As a consequence, no • helpful data and hints (tex, bibtex, makeindex required data should stay inside document.texp, and options, known words) this is the only rule concerning the required data. The texp file extension stands for “TEX Project”. • user-specific data In tables 4 to 9 we show the core file struc- • front-end-specific data ture of the document.texp directory. This is a min- • cached data imal definition involving only string encoding and spelling information because there is no consensus • temporary data yet among users and all the developers of TEX so- It seems convenient to gather all the non-required lutions, on Mac OS X at least. We make use of the information in one folder named document.texd/ XML property list data format storage as defined by http://www.apple.com/DTDs/PropertyList-1.0.dtd
Preprints for the 2004 Annual Meeting 89 J´erˆome Laurens
Name Contents Info.plist XML property list for any general purpose information wrapped in an info dictionary described in table 5. Op- tional. spellingKey.spelling XML property list for lists of known words wrapped in a spelling dictionary defined in table 9 and uniquely identified by spellingKey. This format is stronger than a simple comma separated list of words. Optional. frontends directory dedicated to front-ends only. frontends/name private directory dedicated to the front-end identified by name. The further contents definition is left under the front-end responsibility. users directory dedicated to users. Should not contain any front-end specific data. users/name directory dedicated to the user identified by name (not its login name). Not yet defined, but private and prefer- ably encrypted.
Table 4: Contents of the TEX Project directory document.texp
Key Class Contents isa String Required with value: info version Number Not yet used but reserved files Dictionary The paths of the files involved in the project wrapped in a files dictionary. Optional. properties Dictionary Attributes of the above files wrapped in a properties dictionary. Optional. main String The fileKey of the main file, if relevant, where fileKey is one of the keys of the files dictionary. Optional.
Table 5: info dictionary description.
Key Class Contents fileKey String The path of the file identified by the string fileKey, rel- ative to the directory containing the TEX project. No two different keys should correspond to the same path.
Table 6: files dictionary description.
However, this mechanism doesn’t actually pro- but no warning is given then. No security problem vide the concrete information needed to typeset pro- has been reported yet, most certainly because such perly (engine, format, output format). For that we documents are not shared. can use Makefiles or shell scripts either embedded Notice that iTEXMac declares texd as a docu- in the TEX Wrapper itself or shipped as a stan- ment wrapper extension to Mac OS X, which means dard tool in a TEX distribution. This latter choice that document.texd folders are seen by other appli- is less powerful but much more secure. Anyway, cations just like other single file documents, their a set of default actions to be performed on a TEX contents is hidden at first glance. Using another file Wrapper should be outlined (compose, view, clean, extension will prevent this Mac OS X feature with- archive...). out losing the benefit of the TEX Wrapper Structure. Technically, iTEXMac uses a set of private, A final remark concerns the version control sys- built-in shell scripts to typeset documents. If this tem in standard use among TEX users. In the cur- is not suitable, customized ones are used instead, rent definition, only one directory level should be
90 Preprints for the 2004 Annual Meeting iTEXMac: An Integrated TEX Environment for Mac OS X
Key Class Contents fileKey Dictionary Language, encoding, spelling information and other at- tributes wrapped in an attributes dictionary described in table 8. fileKey is one of the keys of the files dic- tionary.
Table 7: properties dictionary description.
Key Class Contents isa String Required with value: attributes version Number Not yet used but reserved language String According to latest ISO 639. Optional. codeset String According to ISO 3166 and the IANA Assigned Char- acter Set Names. If absent the standard C++ locale library module is used to retrieve the codeset from the language. Optional. eol String When non void and consistent, the string used as end of line marker. Optional. spelling String One of the spellingKeys of table 4, meaning that spellingKeys.spelling contains the list of known words of the present file. Optional.
Table 8: attributes dictionary description
Key Class Contents isa String Required with value: spelling version Number Not yet used but reserved words Array The array of known words
Table 9: spelling dictionary description. supported in a document.texp folder. The contents of the frontend and users should not be monitored. 9 Nota Bene Some features discussed here are still in the develop- ment stage and are still being tested and validated (for example, advanced syntax highlighting and full TWS support).
Preprints for the 2004 Annual Meeting 91 S¸¨aferTEX: Source Code Esthetics for Automated Typesetting
Frank-Rene Schaefer Franzstr. 21 50931 Cologne Germany [email protected] http://safertex.sourceforge.net
Abstract While TEX [4] provides high quality typesetting features, its usability suffers due to its macro-based command language. Many tools have been developed over the years simplifying and extending the TEX interface, such as LATEX [5], LATEX3 [6], pdfTEX [2], and NTS [8]. Front-ends such as TEXmacs [10] follow the visual/ graphical approach to facilitate the coding of documents. The system introduced in this paper, however, is radical in its targetting of optimized code appearance. The primary goal of S¸¨aferTEX is to make the typesetting source code as close as possible to human-readable text, to which we have been accustomed over the last few centuries. Using indentation, empty lines and a few triggers allows one to express interruption, scope, listed items, etc. A minimized frame of ‘paradigms’ spans a space of possible typesetting commands. Characters such as ‘ ’ and ‘$’ do not have to be backslashed. Transitions from one type of text to another are automatically detected, with the effect that environments do not have to be bracketed explicitly. The following paper introduces the programming language S¸¨aferTEX as a user interface to the TEX typesetting engine. It is shown how the development of a language with reduced redundancy increases the beauty of code appearance.
1 Introduction S¸¨aferTEX tries to get back to the basics, as de- The original role of an author in the document pro- picted in Figure 1. Like the traditional writer, a user duction process is to act as an information source. shall specify information as redundancy-free as pos- To optimize the flow of information, the user has sible with a minimum of commands that are alien to be freed from tasks such as text layout and doc- to him. Layout, features, and styles shall be im- ument design. The user should be able to delegate plemented according to predefined standards with a the implementation of visual document features and minimum of specification by the user. styles to another entity. With this aim in mind, the To the user, the engine provides a simple inter- traditional relationship between an author and his face, only requiring plain text, tables and figures. typesetter before the electronic age can be consid- A second interface allows a human expert to adapt ered the optimal case. Modern technology has in- the engine to local requirements of style and output. creased the speed and reduced the cost of document Ideally, the added features in the second interface do processing. However, the border between informa- not appear to the user, but are activated from con- tion specification and document design has blurred text. Then, the user can concentrate on the core or even vanished. information he wants to produce, and not be dis- In typesetting engines with a graphical user in- tracted by secondary problems of formatting. terface, an editor often takes full control over page Human features, style, formatting breaks, font sizes, paragraph indentation, references Expert and so on. Script-oriented engines such as TEX take pure care of most typesetting tasks and provide high qual- text, tables Document and figures Design Typesetting ity document design. However, quite often the task Human Document to produce a document requires detailed insight into User Engine the underlying philosophy. Figure 1: Neo-traditional typesetting.
92 Preprints for the 2004 Annual Meeting S¸¨aferTEX: Source Code Esthetics for Automated Typesetting
The abovementioned ideal configuration of en- a parse tree, which is currently processed in C++. gine, user, and expert can hardly be achieved with The final product is a LATEX file that is currently present automated text formatting systems. While fed into the LATEX engine. A detailed discussion of relying on TEX as a typesetting engine, S¸¨aferTEX the engine is not the intention of this paper, though. tries to progress towards a minimal-redundancy pro- The present text focuses on the language itself. gramming language that is at the same time intu- itive to the human writer. 3 Means for Beauty S¸¨aferTEX tries to optimize code appearance. The 2 The S¸¨aferTEX Engine author identifies three basic means by which this As shown in figure 2, the S¸¨aferTEX engine is based can be achieved: on a three-phase compilation, namely: lexical anal- 1. The first means is intuitive treatment of charac- ysis, parsing and code generation. Along with the ters. For example, ‘$’ and ‘ ’ are used as normal usual advantages of such modularization, this struc- characters and do not function as commands, as ture allows us to describe the engine in a very formal they do in LAT X. manner. In this early phase of the project, it fur- E ther facilitates adding new features to the language. 2. The second is to use indentation as the scope Using interfacing tools such as SWIG [1] and .NET delimiter. This is reminiscent of the Python [9], it would be possible in the future to pass the programming language. It allows the user to generated parse tree to different programming lan- reduce brackets and enforces proper placement guages. Such an approach would open a whole new of scopes. For table environments, this princi- world to typesetters and document designers. Plug- ple is extended so that the column positions can ins for S¸¨aferTEX could then be designed in the per- be used as cell delimiters. son‘s favorite programming language (C++, Java, Python, C#, anything). Currently, automated doc- 3. The third principle is automatic environment ument production mainly happens by preprocessing detection. If an item appears, then the ‘itemize’ A environment is automatically assumed. This LTEX code. Using a parse tree, however, gives ac- cess to document contents in a structured manner, reduces redundancy, and makes the source file i.e., through dedicated data structures such as ob- much more readable. jects of type section, item group, and so on. Applying these principles leads to the eight rules of S¸¨aferTEX as they are explained at the end (sec- S¸¨aferTEX code files lexical This is some example text tion 5). We now discuss them in more detail. \author Frank R. Schaefer Franzstr. 21 50931 Cologne, Germany analysis ( ) Abstract ...... abstract.st (*) Introduction ...... intro.st (**) Sense and Nonsense of Examples ...... nonsense.st (**) Explanations w/o examples ...... explanation.st (*) Examples in Daily Practis ...... daily.st (*) bla bla bla ...... blah.st token stream 3.1 Intuitive Treatment of Characters In the design of a typesetting language, the user has parsing to be given the ability to enter both normal text and parse tree commands specifying document structure and non- LATEX text content. This can be achieved by defining func- code Typesetting document generator Engine Core tions, i.e., using character sequences as triggers for a specific functionality. This happens when we define, say, sin(x) as a function computing the sine of x. Figure 2: The S¸¨aferTEX compilation process. For a typesetter this is not a viable option, since the character chain can be easily confused with normal GNU flex [7] (a free software package), is used text. As a result, one would have to ‘bracket’ normal to create the lexical analyzer. It was, though, nec- text or ‘backslash’ functions. Another solution is to essary to deviate from the traditional idea of a use extra characters. This was the method Donald lexical analyzer as a pure finite state automaton. Knuth chose when he designed TEX [4]. The first A wrapper around flex implements inheritance be- solution is still intuitive to most users. The second, tween modes (start conditions). Additionally, the however, is rather confusing, implying that ‘%’, ‘$’ lexical analyzer produces implicit tokens and deals and ‘ ’ have a meaning different from what one sees with indentation as a scope delimiter. in the file. The parser is developed using the Lemon parser Historically, at the time TEX was designed, key- generator [3] (also free software). Using such a pro- boards had a very restricted number of characters. gram, the S¸¨aferTEX language can be described with Moreover, ASCII being the standard text encoding a context-free grammar. The result of the parser is in Knuth’s cultural context, the high cost of data
Preprints for the 2004 Annual Meeting 93 Frank-Rene Schaefer
Einstein clearly stated his disbelief in the Table 1: Comparison of treatment of special boundedness of the human spirit as becomes characters in LATEX and S¸¨aferTEX. clear through his sayings:
\quote The difference between genius and LATEX: According to Balmun \& Refish $<$www.b-and-r.org$>$ a stupidity is that genius has its limits. conversion of module \#5, namely ‘propulsion\ control,’ Only two things are infinite, the into a metric system increases universe and human stupidity, and I’m code safety up to 98.7\% at not sure about the former. cost of \~ \ \$17,500. Similar reports have been heard from Frank S¸¨aferTEX: According to Balmun & Zappa and others. Refish
94 Preprints for the 2004 Annual Meeting S¸¨aferTEX: Source Code Esthetics for Automated Typesetting
\table Food suppliers, prices and amounts.
______Product || Price/kg && Supplier && kg || Total Price ======Sugar || $0.25 && Jackie O‘Neil && 34 || $8.50 ______Yellow Swiss || $12.2 && United Independent && 100 || $1220.00 Cheese Farmers of Switzerland ______Green Pepper || $25.0 && Anonymous && 2 || $50.00 Genuine Indians Tribes Mexican ======Sum ~~ ~~ ~~ && $1278.50 ______
Figure 4: Example of writing a table: identifying cell borders by column.
\parbox unnecessary. Figure 4 shows an example Items provide a good means to of a S¸¨aferT X table definition. E -- structure information 4 Implicit Environment Detection -- emphasize important points. There are three basic ways to do this: A basic means of improving convenience of program- A ming is reducing redundancy. In LTEX, for example, [[Numbers]]: Enumerations are good when the environment declarations are sometimes unnec- there is a sequential order essary. To declare a list of items, one has to specify in the information being something like presented. \begin{itemize} [[Descriptions]]: Descriptions are \item This is the first item and suitable if keywords \item this one is the second. or key phrases are \end{itemize} placeholders for more Considering the information content, the occur- specific information. rence of the \item should be enough to know that an itemize environment has started. Using our sec- [[Bullets]]: Normal items indicate that ond paradigm, ‘scope by indentation’, the closing of the presented set of the environment could be detected by the first text information does not define any priorization. block that has less indentation than the item itself. The \begin and \end statements are therefore re- -- classify basic categories dundant. In S¸¨aferTEX, the token ‘--’ (two dashes) is used to mark an item. Thus, in S¸¨aferTEX, the There may be other things to consider of item list above simply looks like: which the author is currently unaware. -- This is the first item and -- this one is the second. Figure 5: Example code showing ‘scope by As implied previously, this paradigm’s power indentation’. really unfolds in combination with scope by inden- tation. Subsequent paragraphs simply need to be that an implicit token ‘list begin’ has to be cre- indented more than the text block to which they ated before the token ‘item start’ is sent. The belong. Nested item groups are specified by higher next ‘--’ signals the start of the next item. levels of indentation, as seen in figure 5. Some important points from the example: • The ‘[[’-symbol appears at the beginning of the • The appearance of a ‘--’ at the beginning of line. It indicates a descriptor item. Since it has a line tells S¸¨aferTEX that there is an item and a higher indentation than the ‘--’ items, it is
Preprints for the 2004 Annual Meeting 95 Frank-Rene Schaefer
identified as a nested list. Therefore, an implicit [7] Termination of paragraphs, interruptions of the token ‘list begin’ has to be created again. text flow, etc., are indicated by an empty line. • The final sentence having less indentation than [8] The scope of an environment, table cells, etc. anything before closes all lists, i.e., it produces is determined by its indentation. A line with implicit ‘list end’ tokens for all lists that are to less indentation closes all scopes of higher in- be closed. Thus, the parser and code genera- dentation. tor are able to produce environment commands These are the eight rules of S¸¨aferTEX which en- corresponding to the given scopes. able one to operate the typesetter. They are defined The above has discussed the fundamental ideas as ‘rules’ but, in fact, they do not go much beyond to improve programming convenience for a typeset- common organization of text files. ting system. We now turn to defining a best set of rules for expressions that implements these rules. 6 Commands This section gives a brief overview of the commands 5 The Eight Rules of S¸¨aferTEX that are currently implemented. In this early stage Rules for command design shall be consistent with of development, the system’s structure and language the paradigms of intuitive treatment of characters, design has been in the foreground, in order to build scope by indentation, and automatic environment the framework for a more powerful typesetting en- detection. The following set of rules was designed gine. In the current version of S¸¨aferTEX, the follow- to meet these goals for S¸¨aferTEX while striving for ing commands are implemented: a intuitive code appearance: --, ++ starts a bullet item. The two can be used [1] Every character and every symbol in the code interchangeably to distinguish different levels of appears in the final output as in the source doc- nested item groups. Alien things ument, except for . ## starts an enumeration item. Alien things [2] look alien. [[ ]] bracket the beginning of a description item. In plain TEX, characters such as ‘$’, ‘%’ and \table opens a table environment. It is followed ‘ ’ do not appear in the document as typed. The by a caption and the table body as described in fact that they look natural but trigger some TEX section 3.2. specific behavior is prone to confuse the layman. In \figure opens a figure environment. The text fol- S¸¨aferTEX, they appear as typed on the screen. Alien things can be identified by their look. The next four lowing this command is interpreted as the cap- rules define the ‘alien look:’ tion. Then file names of images are to be listed. Images that are to be shown side by side are [3] Any word starting with a single backslash \. separated by ‘&&’. Vertically adjacent images Examples are \figure and \table. are separated by empty lines. [4] Any non-letter character that appears twice or \quote opens a quote environment. more, such as ‘##’ (this triggers the start of an enumeration item at the beginning of the line). (*) starts a section. The number of asterisks indi- cates the level of the section. [5] Parentheses (at the beginning of a line) that only contain asterisks ‘*’ or whitespace. Se- () starts a section without a section number. The quences such as ‘(*)’, ‘()’, ‘(***)’ indicate number of blanks indicates the section level. sections and subsections. .... includes a file (more than four dots is equiv- [6] The very first paragraph of the file. It is inter- alent to four). The next non-whitespace char- preted as the title of the document. acter sequence is taken as the filename to be included. Except for the first case, alien things do not in- terfere with readability. In fact, the double minus \author specifies information about the author of ‘--’ for items and the ‘(*)’ for sections are used nat- the document. urally in many ASCII files. Internally, alien things Commands have been designed for footnotes, are translated into commands for the typesetting en- labels, and more. However, due to the early stage of gine, but the user does not need to know. development, no definite decision about their format The last two issues are separation of the text has been made. In the appendix, two example files stream and identification of scope of an environ- are listed in order to provide an example of S¸¨aferTEX ment: code in practical applications.
96 Preprints for the 2004 Annual Meeting S¸¨aferTEX: Source Code Esthetics for Automated Typesetting
7 Conclusion and Outlook References Using simple paradigms for improving code appear- [1] D. M. Beazley. Automated scientific software ance and reducing redundancy, a language has been scripting with SWIG. In Tools for program de- developed that allows more user-friendly input than velopment and analysis, volume 19, pages 599– is currently possible with TEX and LATEX. These 609. Elsevier Science Publishers B. V., Amster- paradigms are: the intuitive processing of special dam, The Netherlands, 2003. characters, the usage of indentation for scope and [2] T. T. H`an, S. Rahtz, and H. Hagen. The the implicit identification of environments. As an pdftex manual. http://www.ntg.nl/doc/han/ implementation of these paradigms, the eight rules pdftex-a.pdf, 1999. of S¸¨aferTEX were formed, which describe the funda- [3] R. D. Hipp. The Lemon Parser Generator. mental structure of the language. http://www.hwaci.com/sw/lemon, 1998. While developing S¸¨aferT X, the author quickly E [4] D. E. Knuth. The T Xbook. Addison Wesley, realized that the ability to provide the parse tree E 1983. to layout designers extends the usage beyond the [5] H. Kopka and P. Daly. A Guide to LAT X. Ad- domain of TEX. Currently, much effort remains to E provide appropriate commands for document pro- dison Wesley, 1992. duction. Functionality of popular tools such as ps- [6] Frank Mittelbach and Chris Rowley. The A frag, fancyheaders, bibtex, makeindex, etc., are to be LTEX3 Project. TUGboat, 18(3):195–198, 1997. implemented as part of the language. In the long [7] J. Poskanzer and V. Paxson. Flex, a fast lexi- run, however, it may be interesting to extend its cal analyzer generator. http://sourceforge. usage towards a general markup language. net/projects/lex, 1995. [8] P. Taylor, J. Zlatuˇska, and K. Skoupy. The 8 Acknowledgments NTS project: from conception to implementa- The author would like to thank the TEX Stammtisch tion. Cahiers GUTenberg, May 2000. of Cologne in Germany for their valuable comments. [9] A. Troelsen. C# and the .NET Platform. Special thanks to Holger Jakobs who helped trans- APress, 2001. late this text from Genglish to English. [10] Joris van der Hoeven. GNU TEXmacs. http://www.texmacs.org/tmweb/home/ welcome.en.html, 2003.
Preprints for the 2004 Annual Meeting 97 Frank-Rene Schaefer
Details about The Elves and The Shoemaker
\Author Original: Brothers Jakob & Wilhelm Grimm Somewhere in Germany
( ) Abstract ...... abstract.st
(*) Nocturne shoe productions ...... strange.st (**) Living in confusion ...... confusion.st (**) Women make trouble ...... trouble.st
(*) Midnight observations ...... midnight.st (**) Elves in the cold ...... freezing-elves.st
(*) New era for elves: luxury ...... luxury.st
(*) Elves leave their job undone ...... spoiled-elves.st
Figure 6: Example input ‘main.st’.
\figure ::fig:plots:: Performance a) productivity of shoemaker. b) gain.
ferry-tales/prod.eps && ferry-tales/capital.eps
Reviewing the plots of shoes produced (figure --
## He was sleepworking. Since he even used to work few when awake this assumption was quickly refuted.
## Elves must have come over night and did some charity work.
He further based his theory on the influence of the tanning material used. In fact, there were differences in the number of shoes produced depending on acid number and pH value (see table --
\table ::tab:tan-mat::Influence of tanning materials on shoe production.
Tanning Mat. && pH value && acid number && shoes prod. ______European && 3.4-3.7 && 30-40 && 32 @@ Indian 2.0-2.1 31-45 35 @@ African 4.5-4.6 33-37 36 @@ Australian 3.0-7.0 27-45 15 @@ ______
Resourcing several leathers from indian & african suppliers allowed him to increase profit ranges tremendously. Moreover, these shoes were sold at an even higher price around $0.50. Pretty soon, the shoemaker was able to save a good sum of $201.24.
Figure 7: Example input ‘confusion.st’.
98 Preprints for the 2004 Annual Meeting FEATPOST and a Review of 3D METAPOST Packages
L. N. Gon¸calves CFMC-UL, Av. Prof. Gama Pinto 2 1649-003 Lisboa Portugal [email protected] http://matagalatlante.org
Abstract METAPOST is able to produce figures that look almost like ray-traced raster im- ages but that remain vector-based. A small review of three-dimensional perspec- tive implementations with METAPOST is presented. Special emphasis is given to the abilities of the author’s implementation: FEATPOST.
1 Introduction includes, among many other ideas, the develop- There are at least four METAPOST packages related ment of an input routine (to allow interactive to three-dimensional diagrams: use) and the implementation of three-dimens- ional paths via NURBS. • GNU 3DLDF — http://www.gnu.org/ directory/graphics/3D/3DLDF.html Given the possible computational efficiency of this approach, one can foresee a system that merges the • 3d/3dgeom http://tug.org/tex-archive/ — METAPOST language with the capabilities of stan- graphics/metapost/macros/3d/ dard ray-tracing software. • m3D — http://www-math.univ-poitiers. fr/∼phan/m3Dplain.html 3 3d/3dgeom • FEATPOST — http://matagalatlante.org/ This was the first documented extension of META- nobre/featpost/doc/featexamples.html POST into the third dimension—and also into the All of these packages are individual and independent fourth dimension (time). Denis B. Roegel created, works “under construction”. There has been neither back in 1997, the 3d package to produce animations collaboration nor competition among the authors. of polyhedra. In 2003 he added the 3dgeom “mod- Each produces different kinds of diagrams and each ule” which is focused on space geometry. It remains uses a different graphic pipeline. The following sec- the least computationally intensive package of those tions of this document describe these packages, in a presented here. mainly independent way. 1. Each component of a point or a vector is stored in a different numeric array. This eases control 2 GNU 3DLDF of a stack of points. Points are used to define 3DLDF is not a pure METAPOST package, as it is planar polygons (faces of polyhedra) and the written in C++ using CWEB. Diagrams are also polygons are used to define convex polyhedra. coded in C++ and are compiled together with the 2. When defining a polygon, a sequence of points package. Nevertheless, this is, of all four, the pack- must be provided such that advancing on the age with the greatest promise for a future three- sequence is the same as rotating clockwise on dimensional-capable METAPOST. the polygon, when the polygon is visible. This 1. It outputs METAPOST. means that, when a polyhedron is to be drawn, the selection of polygons to be drawn is very 2. Its syntax is similar to METAPOST. easy: only those whose points rotate clockwise 3. It overcomes the arithmetic limitations inherent (the visible ones). Hidden line removal is thus METAPOST in . achieved without sorting the polygons. 4. Both the affine transformations and the graph- 3. Points can also be used to define other points ics pipeline are implemented through 4 × 4 ma- according to rules that are common in the ge- trices. ometry of polyhedra or according to operations 5. Its author, Laurence D. Finston, is actively im- involving straight lines and/or planes and/or proving and maintaining the package. His plan angles.
Preprints for the 2004 Annual Meeting 99 L. N. Gon¸calves
4. The author plans to release an updated version with the ability to graph parametric lines and surfaces. 4 m3D Anthony Phan developed this very interesting pack- age but has not yet written its documentation. Cer- tainly, this is, of all four, the package that can pro- duce the most complex and beautiful diagrams. It achieves this using, almost exclusively, four-sided polygons. 1. Complex objects can be defined and composed (see figure 1). For example, one of its many predefined objects is the fractal known as the “Menger Sponge”. 2. It can render revolution surfaces defined from a standard METAPOST path (see figure 2). Figure 2: A diagram produced by m3D showing a 3. Objects or groups of polygons can be sorted revolution surface under a central perspective. and drawn as if reflecting light from a punctual source and/or disappearing in a foggy environ- ment. distant from the projection plane as the point of view get projected at infinity, and MetaPost overflows and crashes. The spherical projection is always finite. 2. It can mark and measure angles in space. 3. It can produce shadows of some objects (see figure 9). Shadows are calculated in much the same way as perspectives. The perspective pro- jection, from 3D into 2D, is a calculation of the intersection of a straight line and a plane. A shadow is also a projection from 3D into 2D, only the line and the plane are different. The shadow must be projected onto the paper page before the object that creates the shadow. Shadows are drawn after two projections, ob- Figure 1: A diagram produced by m3D showing a jects are drawn after one projection and after single object, composed of spheres and cylindrical their shadows. connections, under a spherical perspective. 4. It can correctly draw intersecting polygons (see figure 12). 5. It knows how to perform hidden line removal on 5 FEATPOST some curved surface objects. Imagine a solid Geared towards the production of physics diagrams, cylinder. Now consider the part of the cylin- FEATPOST sacrifices programming style and com- der’s base that is the farthest away. You only putational efficiency for a large feature set. see a part of its edge. In order to draw that part, it is necessary to know the two points at 1. Besides the usual parallel and central perspec- which the edge becomes hidden. FEATPOST tives it can make a sort of “spherical distortion” calculates this. Note that the edge is a circle, a as if a diagram is observed through a fish-eye curved line. FEATPOST does not use polygons lens1. This kind of perspective is advantageous to hide lines on some curved surface objects. for animations as it allows the point of view to 6. Supported objects include: dots, vectors, an- be inside or among the diagram objects. When gles, ropes, circles, ellipses, cones, cylinders, using the central perspective, points that are as globes, other curved surface objects, polygons, 1 Also possible with m3D. cuboids, polyhedra, functional and parametric
100 Preprints for the 2004 Annual Meeting FEATPOST and a Review of 3D METAPOST Packages
surface plots, direction fields, field lines and tra- The macro that actually does the projection is, in jectories in conservative force fields. all cases, rp. Many of the drawable objects are not made of poly- gons, but rather of two-dimensional paths. FEAT- POST does not attempt to draw surfaces of these ob- jects, only their edges. This is partly because of the use of intrinsic METAPOST functions and partly be- cause it eases the production of diagrams that com- bine space and planar (on paper) objects. One of the intrinsic METAPOST functions that became fundamental for FEATPOST is the compo- sition makepath makepen. As this converts a path into its convex form, it very much simplifies the de- termination of some edges. Another important aspect of the problem is hid- den line removal. Hidden line removal of a group of polygons can, in some cases, be performed by draw- Figure 3: Parallel projection. ing the polygons by decreasing order of distance to the point of view. FEATPOST generally uses the Shell sorting method, although when the polygons are just the faces of one cuboid FEATPOST has a small specific trick. There is also a specific method for hidden line removal on cylinders and another for
other curved surface objects.
✂
✞
✄☎✆✄✝✟ 5.1 Examples ✁ Some of the FEATPOST macros are presented here. Detailed information is available at • http://matagalatlante.org/nobre/ featpost/doc/macroMan.html • http://www.ctan.org/tex-archive/ graphics/metapost/macros/featpost/ Each perspective depends on the point of view. Figure 4: Central projection. FEATPOST uses the global variable f, of type color, to store the (X,Y,Z) coordinates of the point of Physics problems often require defining angles, view. Also important is the aim of view (global and diagrams are needed to visualize their meanings. variable viewcentr). Both together define the line The angline and squareangline macros (see figure of view. 6 and the code below) support this. The perspective consists of a projection from f := (5,3.5,1); space coordinates into planar (u, v) coordinates on beginfig(2); the projection plane. FEATPOST uses a projection cartaxes(1,1,1); plane that is perpendicular to the line of view and color va, vb, vc, vd; contains the viewcentr. Furthermore, one of the va = (0.29,0.7,1.0); projection plane axes is horizontal and the other is vb = (X(va),Y(va),0); on the intersection of a vertical plane with the pro- vc = N((-Y(va),X(va),0)); jection plane. “Horizontal” means parallel to the vd = (0,Y(vc),0); XY plane. drawarrow rp(black)--rp(va); One consequence of this setup is that f and draw rp(black)--rp(vb)-- viewcentr must not be on the same vertical line rp(va) dashed evenly; (as long as the author avoids solving this problem, draw rp(vc)--rp(vd) dashed evenly; at least!). The three kinds of projection known to drawarrow rp(black)--rp(vc); FEATPOST are schematized in figures 3, 4 and 5. squareangline( va, vc, black, 0.15 );
Preprints for the 2004 Annual Meeting 101 L. N. Gon¸calves
enddef; beginfig(1); numeric axsize, zaxpos, zaxlen;
color xbeg, xend, ybeg, yend, zbeg, zend; axsize = 0.85; zaxpos = 0.55; zaxlen = 2.1; pickup pencircle scaled 1.5pt; xbeg = (axsize,0,0); xend = (-axsize,0,0); ybeg = (0,0,-axsize); yend = (0,0,axsize); zbeg = (zaxpos,-zaxpos,0); Figure 5: Spherical projection. The spherical zend = (zaxpos,zaxlen,0); projection is the composition of two operations: drawarrow rp( xbeg )--rp( xend ); (i) there is a projection onto a sphere and (ii) the drawarrow rp( ybeg )--rp( yend ); sphere is plaited onto the projection plane. defaultscale := 1.95; label.rt( "A", rp( xend ) ); label.lft( "B", rp( yend ) ); angline(va,red,black,0.75, emptyline(false,1,black, decimal getangle(va,red),lft); 0.5black,1000,0.82,2,theline); endfig; drawarrow rp( zbeg )--rp( zend ); label.bot( "C", rp( zend ) ); z endfig; B
76.63591 A
y x Figure 6: FEATPOST diagram using angline. C Visualizing parametric lines is another need of physicists. When two lines cross, one should be able to see which line is in front of the other. The macro Figure 7: FEATPOST diagram using emptyline. emptyline can help here (see figure 7 and the code below). Cuboids and labels are always needed. The f := (2,4,1.8); macros kindofcube and labelinspace fulfill this def theline( expr TheVal ) = need (see figure 8 and the code below). The macro begingroup labelinspace does not project labels from 3D into numeric cred, cgre, cblu, param; 2D. It only Transforms the label in the same way as param = TheVal*(6*360); its bounding box, that is, the same way as two per- cred = -0.3*cosd( param ); pendicular sides of its bounding box. This is only cblu = 0.3*sind( param ); exact for parallel perspectives. cgre = param/850; f := (2,1,0.5); ( (cred,cgre,cblu) ) ParallelProj := true; endgroup verbatimtex
102 Preprints for the 2004 Annual Meeting FEATPOST and a Review of 3D METAPOST Packages
\documentclass{article} axe := (0,cosd(90-wang), \usepackage{beton,concmath,ccfonts} sind(90-wang)); \begin{document} fws := 4; reflen := 0.35*fws; etex coordg := frac*fws; beginfig(1); NumLines := 45; kindofcube(false,true,(0,-0.5,0), HoriZon := -0.5*fws; 90,0,0,1.2,0.1,0.4); setthestage(0.5*NumLines,3.3*fws); kindofcube(false,true,(0,0,0), cubevertex = (0.3*fws,-0.5*fws,0); 0,0,0,0.5,0.1,0.8); tropicalglobe( 7, cubevertex, labelinspace(false,(0.45,0.1,0.65), 0.5*fws, axe ); (-0.4,0,0),(0,0,0.1), allellaxe:=reflen*(0.707,0.707,0); btex ellaaxe:= reflen*( 0.5, -0.5, 0 ); \framebox{\textsc{Label}} pcb := ( -coordg, coordg, 0 ); etex); rigorousdisc( 0, true, pcb, endfig; 0.5*fws, -ellaaxe ); verbatimtex \end{document} etex conecenter = ( coordg, coordg, -0.5*fws ); conevertex = conecenter +
( 0, 0, 0.9*fws );
☎
✄
✂ ✁