TUGBOAT

Volume 19, Number 4 / December 1998

347 Addresses 348 seasonal puzzle: XII / David Carlisle

General Delivery 349 From President / Mimi Jett 351 Editorial comments / Barbara Beeton TUG election; TEX’98; The end of an era—Phyllis Winkler retires; Sans Serif; Sauter font distribution has a new maintainer; Goodies on CTAN

Typography 353 Typographers’ inn / Peter Flynn 355 Typesetting with varying letter widths: New hope for your narrow columns / Miroslava Mis´akov´a

Software & Tools 366 Editorial: EncTEX, by Petr Olˇs´ak / Barbara Beeton 366 EncTEX — A little extension of TEX / Petr Olˇs´ak 372 ConcTEX: Generating a concordance from TEX input files / Laurence Finston Language Support 403 Cyrillic encodings for LATEX2ε multi-language documents / A. Berdnikov, . Lapko, M. Kolodin, A. Janishevsky, and A. Burykin 417 Romanized Indic and LATEX / Anshuman Pandey 419 New Greek fonts and the greek option of the babel package / Claudio Beccari and Apostolos Syropoulos

Hints & Tricks 426 ‘Hey — it works!’ / Jeremy Gibbons Controlling abbreviations in BibTEX (Jeroen H. B. Nijhof); A small minus sign (Jeremy Gibbons); Ornamental rules (Christina Thiele) 428 The Treasure Chest: A package tour from CTAN — soul.sty / Christina Thiele

Abstracts 431 Les Cahiers GUTenberg, Contents of issue 30 News & 433 Calendar Announcements 434 TUG’99 Announcement

Late-Breaking 432 Production notes / Mimi Burbank News 432 Future issues TUG Business 437 Institutional members 438 TUG membership application

Advertisements 439 TEX consulting and production services 440 Y&Y Inc. c 3 Blue Sky Research [A] ‘pi’ of type thrown on the floor constitutes no reading matter, though it contains its elements. The order, the sequence of letters in a sentence, is therefore paramount. Victor Hammer [1955] “Those Visible Marks...”, The forms of our letters Typophile Monograph, New Series, Number 6, 1988

COMMUNICATIONS OF THE TEX USERS GROUP EDITOR BARBARA BEETON

VOLUME 19, NUMBER 4 • DECEMBER 1998 PORTLAND • OREGON • .S.A. TUGboat TUGboat Editorial Board

During 1999, the communications of the TEXUsers Barbara Beeton, Editor Group will published in four issues. The Mimi Burbank, Production Manager September issue (Vol. 20, No. 3) will contain the Victor Eijkhout, Associate Editor, Macros Proceedings of the 1999 TUG Annual Meeting. Jeremy Gibbons, Associate Editor, TUGboat is distributed as a benefit of mem- “Hey — it works!” bership to all members. Alan Hoenig, Associate Editor, Fonts Submissions to TUGboat are reviewed by vol- Christina Thiele, Associate Editor, unteers and checked by the Editor before publica- Topics in the Humanities tion. However, the authors are still assumed to be Production Team: the experts. Questions regarding content or accu- Barbara Beeton, Mimi Burbank (Manager), Robin racy should therefore be directed to the authors, Fairbairns, Michel Goossens, Sebastian Rahtz, with an information copy to the Editor. Christina Thiele Submitting Items for Publication See page 347 for addresses. The next regular issue will be Vol. 20, No. 1. The Other TUG Publications deadline for technical items will be February 15; reports and similar items are due by March 1. TUG publishes the series TEXniques,inwhichhave Mailing is scheduled for March. Deadlines for other appeared reference materials and user manuals for future issues are listed in the Calendar, page 433. macro packages and TEX-related software, as well Manuscripts should be submitted to a member as the Proceedings of the 1987 and 1988 Annual of the TUGboat Editorial Board. Articles of general Meetings. Other publications on TEXnical subjects interest, those not covered by any of the editorial also appear from time to time. departments listed, and all items submitted on TUG is interested in considering additional magnetic media or as camera-ready copy should be manuscripts for publication. These might include addressed to the Editor, Barbara Beeton, or to the manuals, instructional materials, documentation, or Production Manager, Mimi Burbank (see addresses works on any other topic that might be useful to on p. 347). the TEX community in general. Provision can be Contributions in electronic form are encour- made for including macro packages or software in aged, via electronic mail, on diskette, or made computer-readable form. If you have any such items available for the Editor to retrieve by anonymous or know of any that you would like considered for FTP; contributions in the form of camera copy publication, send the information to the attention are also accepted. The TUGboat “style files”, for of the Publications Committee in care of the TUG use with either plain TEXorLATEX, are available office. “on all good archives”. For authors who have no network FTP access, they will be sent on request; Trademarks please specify which is preferred. Send -mail to Many trademarked names appear in the pages of [email protected], or write or call the TUG office. TUGboat This is also the preferred address for submitting . If there is any question about whether a name is or is not a trademark, prudence dictates contributions via electronic mail. that it should be treated as if it is. The following Reviewers list of trademarks which appear in this issue may not be complete. Additional reviewers are needed, to assist in check- MS/DOS is a trademark of MicroSoft Corporation ing new articles for completeness, accuracy, and METAFONT is a trademark of Addison-Wesley Inc. presentation. Volunteers are invited to submit PC TEX is a registered trademark of Personal TEX, their names and interests for consideration; write to Inc. [email protected] or to the Editor, Barbara Beeton PostScript is a trademark of Adobe Systems, Inc. (see address on p. 347). TEXandAMS-TEX are trademarks of the American Mathematical Society. TUGboat Advertising and Mailing Lists Textures is a trademark of Blue Sky Research. Unix For information about advertising rates, publication is a registered trademark of X/Open Co. Ltd. schedules or the purchase of TUG mailing lists, write or call the TUG office. 348 TUGboat, Volume 19 (1998), No. 4

A Seasonal Puzzle

XII David Carlisle [email protected].

\let~\catcode~‘76~‘A13~‘F1~‘j00~‘P2jdefA71F~‘7113jdefPALLF PA’’FwPA;;FPAZZFLaLPA//71F71iPAHHFLPAzzFenPASSFthP;A$$FevP A@@FfPARR717273F737271P;ADDFRgniPAWW71FPATTFvePA**FstRsamP AGGFRruoPAqq71.72.F717271PAYY7172F727171PA??Fi*LmPA&&71jfi Fjfi71PAVVFjbigskipRPWGAUU71727374 75,76Fjpar71727375Djifx :76jelse&U76jfiPLAKK7172F71l7271PAXX71FVLnOSeL71SLRyadR@oL RrhC?yLRurtKFeLPFovPgaTLtReRomL;PABB71 72,73:Fjif.73.jelse B73:jfiXF71PU71 72,73:PWs;AMM71F71diPAJJFRdriPAQQFRsreLPAI I71Fo71dPA!!FRgiePBt’@ lTLqdrYmu.Q.,Ke;vz vzLqpip.Q.,tz; ;Lql.IrsZ.eap,qn.. i.eLlMaesLdRcna,;!;h htLqm.MRasZ.ilk,% s$;z zLqs’.ansZ.Ymi,/sx ;LYegseZRyal,@i;@ TLRlogdLrDsW,@;G LcYlaDLbJsW,SWXJW ree @rzchLhzsW,;WERcesInW qt.’oL.Rtrul;e doTsW,Wk;Rri@stW aHAHHFndZPpqar.tridgeLinZpe.LtYer.W,:jbye Seasons greetings to all. This code should be input to plain TEX, not LATEX. For those without patience to figure out what the output will be, and to save the fingers and sanity of anyone who would like to try it out, the file can be found at http://www.tug.org/TUGboat/Articles/tb61/xii.tex. Enjoy! TUGboat, Volume 19 (1998), No. 4 349

ever seen. Indeed, this is one of the best things General Delivery have done for our members. Compilation and dis- tribution of TEX Live was a joint effort by UK TUG, GUTenberg, DANTE and TUG, with additional sup- From the President port from the Czech/Slovak, Dutch, Indian and Pol- Mimi Jett ish groups. Thanks to all! Truly, without the com- bined efforts of everyone, this would not have hap- Greetings TUG Members! pened. As the year draws to a close and the new year DANTE CTAN TUG arrives, we pause to reflect on the events and accom- provides archive to all DANTE plishments we have enjoyed in 1998. I will review and members the year from my catbird seat, then go on to tell A three-CD set of the complete CTAN archive you about the fantastic meeting we had in Toru´, was distributed to all TUG members in TUGboat and our plans for a celebration—and opportunities 19:2, thanks to a very generous donation from for discovery—at our 20th annual meeting in Van- DANTE. We are indeed grateful to DANTE for couver, BC, Canada, in August 1999. Once again, this very valuable resource. The importance of there is a good deal of administrivia to discuss. But, CTAN was immediately clear to me when I heard finally, we have the people and the plan in place to Anita Hoover exclaim “I haven’t written a macro in provide great service and benefits to our member- years...everything I need is on CTAN.” ship. We have elections coming up again, so please NTG TUG pay special attention to the words of wisdom from release of 4allTEX provided to Barbara Beeton, and take the initiative to nominate members yourself or someone else to the board, or for presi- In addition to the software and data resources al- dent. The more participation we have from our - ready mentioned, we were also fortunate to have the tire membership, the stronger we are. At this time, opportunity to deliver 4allTEX, the popular Win- we have an extremely wonderful board; active in dows TEX program, to our members. The program many aspects of creating benefits and learning for was developed and donated by Erik Frambach and our members; willing to work across barriers; and Wietse Dol at NTG, with the cost of manufactur- gracious and respectful to each other; this group is a ing and duplication covered, once again, by DANTE. delight to work with. Most of our board has agreed Isn’t it nice to have generous relatives? to stand for re-election, that is good news! There are open seats on the in-coming board even if our Confusion reigns after Microsoft standing group stays, so please think about it. announcement An amusing, and all-too-believable article was 1998: The year in review posted as an April Fools’ Day prank claiming that Release of TEXLive3 to TUG members Dr.KnuthhadsoldoutTEX to Microsoft. Complete fantastically successful! with in-depth reporting, background, and quotes TUGboat from the victims of this tragedy, it is no surprise the Thefirstissueof for 1998 included the article was taken as genuine news by several people CD TEXLive3. This was one of the most outstand- around the world. In the 10th anniversary edition ing benefits we offered this year, and it was the super-issue of MAPS, NTG reprinted the article as it fulfillment of a goal to include it in the first is- was originally written, and included color photos of sue of the year. The combined, almost heroic, - Dr. Knuth and Bill Gates with serious expressions forts of Sebastian Rahtz, Olaf Weber, Mimi Bur- on their faces. This added credibility and fuel to bank (who provided many different testing plat- the flames. I received messages from other groups forms at the Florida State University Supercom- asking what TUG will do, continue or disband, now puter Research Institute), Kaja Christiansen, Robin that Bill is our leader? Up to that point, I didn’t re- Fairbairns, Eitan Gurari, Fabrice Popineau, An- alize that it was taken seriously. Of course, once the dreas Scherer, Thorsten Schmidt and Eli Zaretskii joke was realized it was appreciated as good humor. were involved in assembling and testing the collec- tion. Karl Berry, Thomas Esser, Graham Williams, Renewed interest in TEXandTUG apparent and (as Robin puts it) “hordes of others” were the This year has been another growth year for us, source of the software and packages. Together, they thanks to the great developments that are happen- all brought forth the best TEX Live release we have ing in Europe, Australia, and literally dozens of 350 TUGboat, Volume 19 (1998), No. 4

countries around the world. With the migration of audience. The workshops were well organized and publications to the World Wide Web, TEX is being highly informative. Everything was orchestrated by discovered anew by authors and publishers of techni- the GUST team; all events, meetings, and facilities cal material. The ability to use live math for display were perfect. Visiting the Teutonic Knights’ castle or coded content is quite desirable in the world of was quite an experience, not only feeling the his- database storage and delivery. All of this translates tory of the place, but watching Gilbert breathe fire to more inquiries and requests for information about and hearing Boguslaw leading folk songs, made the TUG. We have had a steady flow of new members, night one to remember for a lifetime. Thanks to our including both individuals and institutions. friends at GUST for a wonderful time, and to the people of Poland for their gracious hospitality. Complete turnover in office staff Probably the most important thing that has hap- Board News pened for TUG this year is the lucky fate that sent We are fortunate to welcome Mr. Philip Taylor to Dick Detwiler to us. Dick has been managing the our board. Phil has been active in TUG and UK TUG office since early summer, and it is finally beginning for many years, and brings a depth of experience to feel like we are organized. Several times we have and knowledge to us. The following directors have thought we had things figured out, just as another resigned: Cameron Smith, Donna Burnette, and alligator surfaced from the swamp. The member- Jiˇr´ıZlatuˇska. We thank them for their service to ship database was terribly inaccurate, with records TUG. Please don’t forget the upcoming elections. of payment for dues confused for more than 500 We would love to have more members standing for members. We spent many hours editing, reviewing, the board. and correcting the data. I think we have now fixed all records, and mailed the missing TUGboatsand Plans for 1999 CD s to most members. If you have been billed or TUG ’99 — 20th Annual Meeting in credited erronously, I apologize. Vancouver, BC, August 15–20 TUG ’98 in Toru´n This will be an outstanding conference, based on the The 19th annual meeting was held in Toru´n, Poland, papers that have been accepted and the training, August 17–22, and what a meeting it was! Sev- workshops, and events that are planned. (See the eral people commented that it was “the best TUG conference preliminary schedule on page 434.) The meeting ever.” Not having been to every meeting, conference and program committees are actively cre- I cannot say. But it was an incredible week. Ev- ating an itinerary you will enjoy. Please stay tuned TUG ery session was started pretty much on time and to the website (http://www.tug.org/tug99/) ended on time. The quality of all the papers was for additions and announcements. very good. Hans Hagen was exceptional in all of his See you in Vancouver! presentations. Hans was voted the best in all three categories; content, presentation, and overall, by the TUGboat, Volume 19 (1998), No. 4 351

Editorial Comments rection to the syntax of path expressions on page 129 (repeated on page 213). Barbara Beeton • METAFONT moves to version 2.7182, correcting TUG election a bug involving unprintable strings of length 1. Please remember that this is an election year for • Changes to Computer Modern typefaces are TUG. Both the presidency and a number of seats documented completely in the file cm85.bug as on the Board are open for candidates. Nomination well as in errata.tex (changes to Volume E). papers are due to the TUG office by March 15, Most corrections simply make the programs 1999, balloting will take place during the spring, and more robust in the presence of weirder com- newly elected officials will assume their offices at the binations of parameters. Contrary to previ- annual meeting in Vancouver. ous claims that the shapes would never change The form for nominations appeared in TUG- again, a few have changed in nontrivial ways, to boat 19 (3), on page 234. The form is also available improve their appearance in the new editions of from the TUG Web site, at http://www.tug.org/, The Art of Computer Programming: lowercase in the form of a PDF file, or a copy can be obtained beta and , uppercase sans serif C and G, on request from the TUG office ([email protected]). and the position of the dots on the i’s in sans Your participation in this election will help to serif fi and ffi ligatures has descended to the determine the course that TUG will take headed into normal position for i dots. the new century. Don’t leave it to chance — become • DVItype is now version 1.6; it reports some active and help make it happen. errors better. • VPtoVF is now version 1.5, fixing a bug with TEX’98 respect to rules of dimension zero. During the past summer, Don Knuth undertook his • Typos in GFtoPK.web and PKtype.web were periodic review of the accumulated bug reports for corrected, with no change to version numbers. TEX, METAFONT,andtheComputers & Typeset- ting book series. The new versions of all the source • In logo.mf,theS has been redesigned; it now files are in place on CTAN,inftp://ftp.ctan. sort of assumes that a T will follow, as in org/tex-archive/systems/knuth/, mirrored from METAPOST. labrea.stanford.edu. Don’s advice is to TEX and print out the file Here is a brief summary of changes. errata.tex for reference. His transmittal letter • The most important TEXbook corrections ap- concluded, “In summary, I’m pleased that people pear in the files errata.nine and errata.tex; still care enough about TEX/METAFONT to under- some insignificant changes (e.g., page numbers stand the details and to help me get them right. But in index entries out of order) are corrected in oh how I wish I hadn’t made so many mistakes!” texbook.tex but not mentioned in the errata Don will next address TEX-related bugs in 2002. files, so anyone who thinks s/he has found an Until then, I will continue to collect reports, acting error should check it in texbook.tex before as his entomologist. reporting it. The end of an era — Phyllis Winkler retires • plain.tex has had a few changes and now has a format version 3.1415926, two steps When Don Knuth created TEX, he intended it to be ahead of the version of TEX itself. The defini- a tool for himself and his secretary, Phyllis Winkler, tions of \AA, \d, \b, \c, rightarrowfill and to prepare his books and papers for publication. \leftarrowfill were corrected or improved to On October 1, 1998, Phyllis retired from Stan- work correctly or more robustly in a wider range ford, after 32 years of service, for 28 of which she was Don’s secretary. As he says on his Web page of of situations. A new control sequence, \Orb, 1 was introduced to typeset the big circle used in news for 1998 , \copyright. She typed more than 200 of my papers, most • A few corrections were made to comments in of which required several rounds of revisions. TEX: The Program, but the version remains She buffered all of my email and telephone 3.14159. messages. She administered the editorial

• In The METAFONTbook, numerous nitpicky 1 http://www-cs-faculty.stanford.edu/~knuth/ corrections are recorded in the errata file, but news98.html; a photo shows Phyllis with Don, who is, the only really important change is the cor- for I think the first time I’ seen it, wearing a necktie. 352 TUGboat, Volume 19 (1998), No. 4

work of more than a dozen technical jour- Sauter font distribution has a nals, and helped out with numerous research new maintainer projects. She made online indexes of all the The Sauter font distribution, a comprehensive set of correspondence in our files. She did all of parameters for automated generation of Computer the initial keyboarding for the new editions Modern and other METAFONT fonts, has been main- of The Art of Computer Programming,Vol- tained for quite a long time by J¨org Knappen, who umes 1 and 3 — amounting to more than 1500 took over this task from the originator, John Sauter. printed pages of what printers used to call Owing to a change in his employment status, “penalty copy” because it is so hard to do. J¨org found it necessary to look for, and has found, a And so on and so on, what a team we made! replacement. The new maintainer of the Sauter font And she was simultaneously also serving as distribution is Jeroen Nijhof; he can be reached at secretary for several other faculty members. [email protected]. I remember Phyllis most fondly. I met her in 1979 when I was first sent to Stanford with a small Goodies on CTAN group from AMS to learn TEX; she took very good With the recent posting of yet another translation care of us. Whenever a TUG meeting was held of The (not always) short introduction to LATEX at Stanford, I always enjoyed checking in with her (familiarly known as lshort, the number of languages to find out what was happening. I learned some in which this little manual is now available has interesting personal things about her, for example reached seven: English, Finnish, French, German, that her son-in-law raised and trained large cats for Mongolian, Russian, Spanish. several well known magicians; a delightful poster on On CTAN, lshort can be found in her wall showed him with his hands full of tiny tiger /tex-archive/info/lshort/hlanguagei. kittens, their mother looking on with curiosity but This is a fine beginner’s manual for LATEX2ε, without concern. Phyllis told me that once when she and while it doesn’t replace Lamport or the other was visiting her daughter, her son-in-law suggested formally published manuals, it is readily available, she might go out in the back yard to get some and the price is right! exercise running around with their resident panther. Other new or updated packages, tools, docu- I think they were just kidding... mentation, you name it,... , appear on CTAN in Phyllis also took care of communicating mes- a contining stream. How is one to know what is sages between Don and me whenever he would work there, and to determine whether it is useful in one’s on the current batch of TEX bug reports. She could own work? With this issue of TUGboat,wehave always be depended on to get necessary messages initiated a new column, “The Treasure Chest”, in through to him, but insulate him, firmly but po- which one or more packages will be presented in litely, from things that weren’t urgent. each regular issue. Enough examples will be shown On September 30, members of the Stanford to provide a flavor of the package, so that a reader Computer Science Department held a retirement can decide to investigate further, if it’s of inter- party for Phyllis. Among the other greetings, est. The first package to be presented is soul.sty. a resolution from the TUG Board expressed our Take a look, let us know what you think, and if appreciation for all her contributions over the past you have any suggestions for other packages you 20 years. would like to see highlighted, send them to Christina Along with many other friends, I wish Phyllis a Thiele ([email protected]). Better still, if long, productive, enjoyable retirememt. you’d like to volunteer to help produce the column, Christina will be delighted! Sans Serif Don Hosek, editor and publisher of Serif: The Mag- ⋄ Barbara Beeton azine of Type & Typography, has created an elec- American Mathematical Society tronic adjunct—“Sans Serif: The On-line Compan- P.O. Box 6248 ion to Serif”. These Web pages (found at http:// Providence, RI 02940 USA www.quixote.com/serif/sans/) contain some ma- [email protected] terial related to items in the print product, along with a full calendar of type- and print-related events. Check it out—it contains more local and specialized events than we are able to include in the TUGboat calendar. TUGboat, Volume 19 (1998), No. 4 353

LATEX and glue Typography I’ve pretty much settled down to using LATEXnow. I don’t make so many mistakes and I’ve stopped typing plain T X on the rare occasions when I ac- Typographers’ Inn E tually write a document in raw code. Most of my Peter Flynn text is authored by other means and converted to LAT X for formatting, so my misgivings about the ‘C’ stands for Euro E default LATEX appearance mean that most of what January 1st came and went, and we survived the in- I do is writing or modifying style files to publishers’ troduction of the Euro, the planet’s ugliest-named specs. I’ve started to turn some of the ideas which

currency. I can now write cheques in e, do elec- have spun off from this into class and package files in tronic transactions, and even lodge credits, should preparation for a project I mentioned online recently anyone be generous enough to send me money. As and which I will be presenting in Vancouver. aLATEX user, I have discovered the \texteuro com- I had several responses to my suggestion that mand in the textcomp package (which I should have we ditch the weird concept that the default style for mentioned last time, but mea culpa, I am a recent reports should have chapters, most of them support- convert to LATEX and am still finding stuff squirrelled ing a change. It’s probably inadvisable to change the away that I didn’t know about). And, I’m pleased to source of article.cls, as too many people have A say, LTEX’s û is a much more suitable design to go too much private code rigged to cope with its - with those serif fonts which have none of their own culiarities, and they rightly rely on the stability of than the strange C-like designs put out by Microsoft TEX systems to maintain their text. What I’m aim- in their TrueType replacement fonts (designed by ing at is a package that repairs this and other leaks and licensed from Monotype, of all people!). But and seals them up so that authors have to spend \texteuro uses PostScript fonts (the T1 encoding) less time fiddling and thus have more time for writ- and while that’s fine by me, it’s not for everyone. ing1. Articles should work more like authors and Full marks, therefore to Henrik Theiling for his publishers expect them to, books more like books, eurosym package, which implements the original EU and reports more like reports. Maybe this will even (sans-serif) design in roman, bold, italic, and out- help stem the flow of FAQs about these problems on line using METAFONT, so it’s usable in TEX-based comp.text.tex. systems anywhere. No seriffed version yet, but the And what was that about glue? When you following table shows some of the glyphs available. repair a puncture in a tire, you glue a patch of rubber to the inner tube. In the early days of cy-

n=sc sl=it ol cling and motoring, tire rubber needed heat treat-

e e

m e ment after repairs to ensure the patch was properly

e e bx e bonded, and this treatment was called ‘vulcanising’. To avoid the bond degrading as the rubber flexes Adobe also has the Euro in PostScript fonts avail- in use, the glue is made of our good friend latex able for download, including sans and serif versions (plus assorted chemicals). Those of you with long (with serifs top and bottom, too!), and there’s al- memories may recall childhood cycle repairs with ready a Euro in the china2e package (a METAFONT ‘self-vulcanising’ glue, which replaced the need for font). Bitstream sells a standalone pi font with the heat-bonding. Hence by a tortuous path the name Euro symbol, and will customize your existing fonts vulcan for a package which seals the leaks in LATEX— for you (for a charge). Linotype sells nearly 200 come to Vancouver and see (tell your boss you’re off Euro symbols for DM 100 but makes the same mis- to learn about latex bondage or something: Dan take as Monotype in pretending the E is a C in the Quayle will explain). serif versions. The EU has laid down that the official design Backquotes is to be used regardless of the surrounding font (in Maybe it’s just the way that once you notice some- both style and colour, see http://europa.eu.int/ thing once, you repeatedly see it all around you, euro/html/entry.html). Fortunately I don’t know but I’ve spotted the reversed quote ( ’ ) several times, anyone daft enough to want to follow that diktat. As I write, each of my Euros appears to be 1 worth °1.30 Canadian, so I’ve started to save for But where should we be without the inveterate fiddling the TUG meeting in Vancouver. author inventing new styles? 354 TUGboat, Volume 19 (1998), No. 4

SGMLX 97’ washington, d.c.

Figure 1: Type fac-simile of the SGML97 logo

including some extremely public displays which in- cluded the logo which appeared on all the posters, leaflets, proceedings, and assorted publicity for the XML ’ 97 conference in Washington, DC. They got it right on the title page, and anywhere that it was reproduced from typed characters, but the logo it- self, approximated in Figure 1 with CM fonts, used the reversed quote. I’m curious to know why so I’ve Figure 2:TheMKPS PostScript font installer sent a message to the designer and I’ll let you know. Get writing The new journal I mentioned last time, Markup Lan- file for DOS/Windows systems which followed was guages: Theory & Practice [1], is up and running. never satisfactory as the restricted operating envi- It’s quarterly, peer-reviewed, and the first one of its ronment available precluded it doing all that mkvf kind devoted to text markup. I’m therefore repeat- did. This time I have taken the plunge into Win- ing my call for articles: as the markup we use for dows’95 and rewritten it as a windowing utility: it’s TEXandLATEX was one of the major advances in the most prevalent platform I support. The tool I the move towards logic-based or structural markup, used, Visual DisplayScript, is a very simple and ef- I feel that there is plenty of scope. Dip your quill in fective way of tying together a simple interface to the ink and start writing now. make a little utility (see Figure 2). It assumes the TDS, although you can change Postscript that if you store your fonts elsewhere, and it makes Probably like many of you I’ve been using PostScript reasonably intelligent although by no means fool- fonts for years. They’re portable, convenient, rea- proof attempts to deduce the Karl Berry fontname sonably accurate, and although the hinting isn’t a abbreviations from the extended descriptive name in full substitute for design-sizing, in most practical the AFM file (part of this grew out of having to res- situations they work just fine. I don’t do many cue a client’s broken installation where all the AFMs jobs requiring extremely large sizes, so I haven’t run were corrupt and I had to try and dig into several into the problems that I am told exist in advertising hundred PFBs for the same data). I hope to have work, for example. a distributable beta release by the summer: if any- But my guess is that most TEX systems, par- one can recommend a similarly simple environment ticularly in research or academic sites, don’t have for producing X Window mini-apps, I’d be happy to PostScript fonts (a font survey would be interest- hear of it. ing). There is a cost involved once you go beyond The core that does the work is about four lines: Charter and the other free PostScript fonts, and al- afm2tfm, vptovf, some file-copying, and the append- though it is small per font, it can be outside the ing of the relevant line to psfonts.map (it does as- budget of many individuals, especially students, and sume dvips: it’s all I know about). What takes the even some projects. The bigger stumbling-block is time, as usual, is handling the configuration, the - the installation: I had my own problems when I did ducing, the file-loading, and working out the name the first few fonts, but I was lucky to have generous (and checking in the fontname map files). The user and helpful people on call who patiently explained should be able to drag and drop a PFB file onto it, what I needed to do; and this was long before the check that it has correctly resolved Bonemontano, new TEX Directory Structure. Inc’s Gracatia Sancta Skinny Weird DemiBold into I have therefore finally gotten around to writing zgsdw8ro and then just go ahead and do everything, a new PostScript font installer. The old mkvf pro- including creating a skeleton FD file. If you’re tired gram which I wrote to take the Virtual Font route of hearing people complain that TEXsystemshave was a shell script, and fairly crude; the mkcd batch only got one font, and it’s soooo hard to get it to TUGboat, Volume 19 (1998), No. 4 355

work with anything else, mail me to go on the beta

list. And no, I’m not offering û1.00 per bug. References [1] Markup Languages: Theory & Practice.MIT Press, Cambridge, MA. ISSN:1099-6621.

⋄ Peter Flynn Computer Centre University College Cork Ireland [email protected] http://imbolc.ucc.ie/~pflynn/

METAFONT

\h×iÞe

½

Øhi×

ÑiÜiÒg Óf fÓÒØ×

ÛaÒØ

¾

¿

½

¿ ¾

´Ðefص

´Öighص

4

4

6

METAFONT

ÓÔØiÑÙÑ Ø

ÓÔØiÑÙÑ Ø

gÐÙi×h

bÓÜ

3

b = |r| × 100

r

α × + β ×

α β

METAFONT

5

\ÖighØ×kiÔ

\ÖighØ×kiÔ=¼ÔØ ÔÐÙ× ¼º¼5¾\h×iÞe ÑiÒÙ× ¼º¼47\h×iÞe

\h×iÞe

5 6

x

3

|r| × 100

METAFONT

METAFONT

h−5, −4,...4, 5i

7

8

x%

x%

aÐ Ð

8

7

ÓÚeÖfÙÐ Ð bÓÜe× badÒe××

−33

+5% −4%

−1%

5 ½8±

½ ½½±

8 ½4±

½¼¼¼¼ ¿¿±

½¼¼¼¼ ¿¿±

¾¼ ¾9±

¾ 9±

½¼¼¼¼ ¿¿±

½¼¼¼¼ ¿¿±

9 ½4±

½¼¼¼¼ ¿¿±

7 ¾¼±

½5 ¾6±

½¼¼¼¼ ¿¿±

¾8 ¿¾±

87 47±

8 ½4±

5 ½8± ·½ 5± ÆÓÖ×ká ÖÙÒÓÚá jÑéÒa

½ ½½± ·½ 4± j×ÓÙ Ô ÓÞd¥j²í¸ Þ dÓbݸ kdÝ

½54 57± ·5 8±

¿¾9 74± ·5 4¼±

¾¼¼5 ½¿5± ·5 6¾±

¿¾ ¿4± ·5 ½±

½¼¼¼¼ ¿¿± 4 ¿±

768 98± ·5 5¾±

5 ½8± ·¾ ½±

¿5 ¾¿± ¾ ½±

½69 98± ·5 ¾5±

7¾ 44± ·5 9±

½¼¼¼¼ ¾¿8± ¾ ¾±

4¿9½ ½76± 4 9±

¿¼¾9 ½55± ·¿ ½¿± aÒgÐÓ×a×kýÑ ÔÖÓØ¥j²kóÑ;

5¿6 87± ·¿ ¿± a ØÙØÓ Ô Ó dÑÒÓöiÒÙ ÔÓÚa¹

¾884 ½5¿± ·5 ½±

¼ ¼± ½ ¿±

¾59½ ½47± ¼ ¼±

FigÙÖe 4

½¼¼¼¼ ¿¿±

½¼¼¼¼ ¿¿±

½¾ ½6±

½¼¼¼¼ ¿¿±

½4 ½7±

¾9 ¿¿± ú×kÓkÝ

½¼¼¼¼ ¿¿± ú×kÓkÝ

9 ½4±

½5 ¾6±

½4¿ 56±

86 47±

½¼¼¼¼ ¿¿±

½75 6¼±

¼ ½±

½½½ 5½±

½9¼ 64±

¼ ½±

½¼¼¼¼ ¿¿±

¼±

¾7¿ 69± ·5 ½7±

8¼ 46± ·5 ½¾4±

½¼¼¼¼ ¿¿± ·5 56±

¼ 4± ·½ ¿± ×ÔÖáÚÒé¸ Ô ÓÒ¥Úadö ÔÖáÚ¥ ØÓ ÒeÒí

55¾ 88± ·5 ½½4±

¾½9 65± ú×kÓkÝ ·5 ½9±

½45 56± ¿ ½¾± ú×kÓkÝ

½75 6¼± ¿ 4±

7¾ 44± ½ ¾± Ú¥k ×áÑ ÒeÑóöe beÞÔe£Ò¥ Ú¥d¥Øº

½4 ¾6± ·5 ¾7±

7¾5 96± ·½ ¼± Þji²Ø¥Òaº A Ôak Ô°i kaödé debaØ¥

½¾48 ½½6± ·¿ ¼±

½¿4¾ ½½8± ·½ ½± ×íÑe ×hÓ dÒÓÙØ Òa Ò¥£eѸ Ó dkÙd

¾½8 64± ·¼ ¾±

½79 6¼± ·5 ½5±

½¿¿ 57± ·5 ½½±

½¾4 56± ¿ ½¿± e×Ø di×ÔÙØaÒdÙѺ

¼ ¿± ¾ 4± kÙØÙje × ØíѸ kdÓ ÔÓÔíÖá ÔÐaØÒÓ×Ø

¼± ¼ ¼±

FigÙÖe 5

¾5 ¾¼±

9 ½5±

½¼¼¼¼ ¿¿±

¼ 6± ÔeÖ fa× eØ Òefa× ÔeÖ fa× eØ Òefa×

87 47±

½¼¼¼¼ ¿¿±

½4¿ 56±

½¼¼¼¼ ¿¿±

½47 56±

¾6 ¿½±

½9 ½9±

¼ 4±

½ 7±

½ ½¼±

5 ½¾±

¼±

¾5 ¾¼± ¾ ¾±

9 ½5± ½ 4±

¾59 68± ·5 ¾¿±

66 4¿± ÔeÖ fa× eØ Òefa× ·5 ½7± ÔeÖ fa× eØ Òefa×

¾½ ¾9± ·4 ½±

¿7 ¿6± ·5 7±

¼ ¿± ·¼ ¿±

½½99 ½½4± ·5 7¾±

¾½5¼ ½¿9± ·5 79±

¿4½ 75± ·5 ¿½±

¿¿8 75± ·5 ¿¼±

¾9 ¿¿± ·¿ 5±

¿64 77± ·5 ¿¾±

½96¼ ½¿4± ·5 9¿±

½478 ½¾¾± ·5 8¼±

½7 ¾7± ·¿ 6±

¼ ¼

FigÙÖe 6



½¼¼¼¼ ¿¿±

½¼¼¼¼

½57 58±

¾¼¼ 6¿±

½ 7±

84 47±

½¼ ¾¿±

85 47±

½¼¼¼¼

½¿4¾ ½½8± ·5 ¿8±

½¼¼¼¼ ·5:5¾ ¼±

7¼¿¼ ¾¼6± ¾ 5±

½8¿½ ½¿½± ·5 ½±

4¼ ¿7± ·5 9±

½59 58± ·5 ½¿±

¿ 9± ¾ ¼±

½¿¾ 54± ·5 ½¼¾±

½ ½¾± ¾ ½7±

¼ ¼±

FigÙÖe 7

ËÝÑÔ Ó×iÙÑ Ó ØÓÐeÖaÒØÒÓ×Øi

ËÝÑÔ Ó×iÙÑ Ó ØÓÐeÖaÒع

ÒÓ×Øi

ËÝÑÔ Ó×iÙÑ Ó ØÓÐeÖaÒØÒÓ×¹

Øi

ËÝÑÔ Ó×iÙÑ

Ó ØÓÐeÖaÒØÒÓ×Øi

ËÝÑÔ Ó×iÙÑ Ó ØÓÐeÖaÒØÒÓ×Øi

FigÙÖe 8

366 TUGboat, Volume 19 (1998), No. 4

Software

Editorial: EncTEX, by Petr Olˇs´ak Barbara Beeton The motto introducing the following article, by Petr Olˇs´ak, describes Donald Knuth’s original vision for TEX, to be used mainly by him and his secretary. Things haven’t turned out that way. Publishers of scientific and mathematical jour- nals now produce them with TEX, from TEXman- uscripts prepared by the authors, adhering to uni- form guidelines—any divergence causes problems in automated production, with associated delays and costs. For this reason and others — joint authorship with manuscripts shipped back and forth, preprint archives on the Web, . . . — there is enormous peer pressure in much of the TEX community (at least in the English speaking part of it) to use standard implementations and macro sets. Portability has be- come paramount. However, as printed languages ac- cumulate more and more accented letters, or use dif- ferent alphabets, TEX-out-of-the-box becomes less and less usable without workarounds, sometimes elab- orate ones. That is the environment in which Petr Olˇs´ak finds himself, and he is trying to solve the problems that will make TEXaseasytouseforacomputer- literate, Czech-language-literate novice as it is for a similarly well-educated English-speaking novice. How much more difficult it would be to learn a dif- ferent natural language before you could use a com- puter tool created with that language in mind. Three potential reviewers were asked to look at this article. Two refused outright, stating personal biases that might color their opinions. The third agreed, but warned of a similar bias, and made a strong (and successful) effort to overcome his prej- udice. All three strongly agreed that the article should be published, as it forms a solid basis for discussion of this knotty problem. Work is now going on to extend TEX to (ul- timately) 16bit encodings; the transition has to be planned with care, and it is going more slowly than anyone really wants. The stability of TEXandthe conservatism with respect to adding features have been proven out by the fact that TEX is still in active use after nearly 20 years, while most other systems of this vintage are long dead. I invite discussion in particular from implemen- tors of TEX, its successors and adjuncts. Space will be reserved in the next issues for this discussion. 366 TUGboat, Volume 19 (1998), No. 4

EncTEX — A little extension of TEX Petr Olˇs´ak Motto: Certainly, if I were a publishing house, if I were in the publishing business myself, I would have probably had ten different versions of TEXbynow for ten different complicated projects that had come in. They would all look almost the same as TEX, but no one else would have this program— they wouldn’t need it, they’re not doing exactly the book that my publishing house was doing. Donald E. Knuth, Prague, March 1996 −−∗−−

This article describes a simple change to TEX which makes it possible to manipulate the internal TEX vectors xord and xchr. These vectors are used to convert input encoding into TEX internal encoding and vice versa. For example, emTEXusers(DOS or OS/2) know the so-called tcp tables which are used to set xord and xchr values. On the other hand UNIX users have no chance to reset these values once the TEX binary is compiled. The modification of TEX described below enables something similar to emTEX tcp tables. It is independent of the operating system. You can implement it in all TEX systems where the TEX program is compiled from WEB source files. I have tested my modification on UNIX systems with the web2c implementation of TEX. There exist so far two options how to work with reencoding on UNIX web2c implementations. The first one is Skarvada’s patch [3]. This solution implements several reencoding tables directly into the TEX source and the user cannot change these tables at TEX run time. The encoding table is selected by an environment variable. It is not saved into generated formats during iniTEX. I think, this solution is not so flexible. The second option was implemented through tcx files by Karl Berry. It is commented out in current web2c sources with the following note: “tcx files are probably a bad idea, since they make TEX source documents unportable. Try the inputenc LATEX package.” I know the xord/xchr reencoding solution is not compatible with the inputenc package, but nevertheless I disagree with the note above. I have the following arguments: • The inputenc package is a solution for LATEX only, but TEX is used via other formats too. • The log files and the terminal outputs are not legible if an overfull/underfull box of Czech TUGboat, Volume 19 (1998), No. 4 367

text is reported. The ^^ notation is absolutely \uccodesand\lccodes, which are used in the funny. If section 49 of tex.web is changed hyphenation algorithm after expanding. (via tex.ch of course) in the following way: My solution to the reencoding problem is more (k<" ")or(k=invalid_code),the^^ notation general than the tcp tables or the tcx files, because no longer occurs and the text is legible. But the encTEX tables are read during iniTEX simply if inputenc is used with different internal T X E by using \input and are defined by TEX macros. encoding, the Czech sentences in log files are I have implemented three new primitives into TEX: still not legible. \xordcode, \xchrcode and \xprncode. They en- • The reencoding is an implementation problem, able to set and read the values of xord and xchr it is not the problem of a naive user, who must vectors and to set the “printability” attribute of write \usepackage[bla]{inputenc}.Heor any character. A new quality is introduced: the she has no knowledge about encoding used in xord/xchr vectors may be set independently. This his/her OS. He or she perceives the command opens great new possibilities. \usepackage[bla]{inputenc} as very mysti- cal. A technical introduction • If the LATEX document is sent via e-mail with The xord vector is 256 bytes long and stores MIME (or similar methods of transport), the the reencoding information for inputting characters reencoding is done by e-mail agents and the from a text file into TEX. The xchr vector has the document is properly encoded for the OS of the same length and stores the reencoding information receiver. The \usepackage[bla]{inputenc} for outputting characters from TEX to the terminal, is not automatically changed in the LATEX to logs and to the text output files created through header, thus if the sender and the receiver \write, but does not influence output to dvi files. work in different encodings—oops—Houston, These vectors are built into the program. All text we have a problem. I think, the reencoding information during input or output is reencoded by must be solved by software for transportion these vectors. If the input character has an external between different OSes (e-mail agents or WWW code x andaninternalcodey in TEX, the xord servers/clients, for example) and this problem vector must be set the following way: xord[x] = y. should not be solved in the LATEXheader. The rules for the output characters are as follows: If • The inputenc package sets active \catcodesto the character with internal code y is not assumed to accented characters. So Olˇs´ak is expanded to be “printable” then the ^^code y is output, other- Ol\v s\’ak, and therefore you cannot define wise the character with code x = xchr[y] is written. the control sequence \Olˇs´ak. Accented letters The encTEXpackage have \catcode=13 but \catcode=11 is needed. The installation of encT X was tested on web2c • Donald Knuth has implemented the xord/xchr E version 7. If we have the WEB sources of this vectors into TEX to separate the encodings implementation of TEX then the command used in an OS and the internal TEX encoding (because text fonts used in TEX are independent patch

encTEX specific changes are described in this file. then \xordcode i = i and \xchrcode i = i for all i. You can modify your tex.ch file manually using If the operating system is using another (obscure) information from this file. encoding standard, then 95 printable ASCII internal The encTEX modification is independent of the codes from {32 ...126} are mapped into appropriate operating system and of the TEX implementation codes through corresponding changes in \xordcode because all the changes are done in a WEB change and \xchrcode initial values. file exclusively. The values of \xordcode, \xchrcode and After encTEX is installed, you can set and \xprncode are stored in the fmt format file and they read the values of xord and xchr vectors by new are restored during the run of the production version primitives \xordcode and \xchrcode. You can set of TEX. the “printability” attribute of the character by the new primitive \xprncode. The syntax of all three new primitives is the same as the syntax of the About the ambiguous encoding \lccode or \uccode primitives. For example: This subsection will describe some issues with xord \xordcode"AB="CD \xchrcode\xordcode"AB="AB and xchr resetting. Let us construct an example. \the\xchrcode200 Say, we need to map the character \’a (having sets xord[0xAB]=0xCD, xchr[xord[0xAB]]=0xAB code 129 in the OS, for example) onto the in- and, as a result of the second line, the value of ternal TEX code 128. So let \xordcode129=128, xchr[200] is printed in this example. \xchcode128=129 and \xprncode128=1.Atthe The new primitive \xprncode enables to set same time the input code 128 is not mapped because the “printability” attribute of the character. The it is never used in the Czech alphabet, for example. character with internal code y is “printable” if What if I get some file from Poland containing the and only if y ∈{32 ...126} or \xprncode y>0. character with the input code 128? This character For example, if we set \xprncode255=1, then the is mapped to the code 128 (internal in TEX) but it is character with code 255 is “printable” and it will returned to \log as the code 129. That means that be printed as a character with the code xchr[255]. TEX is no longer able to distinguish between codes On the other hand, setting \xprncode^ to zero does 128 and 129 on its input. not cause “non-printability” because the code of the We will describe this phenomenon more exactly. character “^”isin{32 ...126}. It is a kind of Let’s use mathematical terminology. Let X = “self-defence instinct” against an unsound user who {0 ...255} be a set of input codes and Y = X be could set all characters to be “non-printable” and the same set (from mathematical point of view) but the printing ability of the program may be lost. The let’s use this letter for a set of the internal codes in \xprncode primitive can take any value from the TEX. Let Yp ⊆ Y be a set of all printable characters. range 0 ...255, but remember the rule — “printable” We claim: if \xprncode is positive. Yp = {y; \xprncodey>0}∪{32 ...126}. There is an important difference between the It is obvious that the values of the xchr vector on the new encTEX primitives and well-known primitives set Y \Yp don’t influence the behavior of the program like \lccode or \catcode. The new primitives output. represent internal TEX registers and are global under Let I : X → Y be the “input” function defined any circumstances. You can set their values in a by the xord vector and O : Yp → X be the “output” group and these settings are not changed at the function defined by the xchr vector. The initial group end. I rejected the possibility of local settings values of xord or xchr ensure that I is bijective and −1 (via the eqtb table) in order to achieve more efficient O is injective and O = I on Yp. This feature gets code. lost after the first change of xord or xchr values. The initial values, when iniTEX starts, are the For example, let x 6= y and x ∈ X, y ∈ Yp. Let us following: make a simple transposition: \xordcode i = i for i ∈{128 ...255}, xord[x]=y, xchr[y]=x. (1) \xchrcode i = i for i ∈{128 ...255}, Now, neither I function nor O function are injective! i i ... , ... \xprncode =0for ∈{0 31 127 255}, You can see, xord[x] = xord[y], and a similar i i ... \xprncode =1for ∈{32 126}. equation holds for the xchr vector. The following The values \xordcode i and \xchrcode i for i ∈ condition must be fulfilled so that our functions are {0 ...127} depend on the operating system. If the injective after applying transposition (1) n-times. system is using the ASCII standard (very common) The sequence x0,...xn must exist with the following TUGboat, Volume 19 (1998), No. 4 369

properties: See Appendix 2 for the overview of all tables of the first type included in encTEX. You can list these x0 = x, x1 = y and xord[xi]=xi+1 files by

for all i ∈{0 ...n−1} and the equation xord[xn] = ls *-csf.tex *-t1.tex x holds. Similar conditions must be fulfilled for the EncT X contains many files prepared for xchr vector. The problem is, that we apply the E iniT X for plain-variant formats. For example, the transpositions (1) only on a certain subset of X (for E command example on the printable characters or on accessible characters in a given encoding). Then we need not initex plain-il2-dc be surprised that as a result of our settings neither I generates the plain format with ISO8859-2 as the nor O are injective functions and therefore equations −1 −1 plain-il2-dc O = I or I = O are senseless. The inversion input encoding. This format name is , it includes the hyphenation table in T1 and uses exists only if the function is injective. the dc fonts for text. The content of the plain-il2-dc.tex file is shown in Appendix 3. The The encoding tables \input of Knuth’s original plain.tex and the table il2-t1.tex There are two types of encoding tables in encTEX. is performed here. Both tables are TEX \input files with auxiliary macros. The files have the common extension The second type of encoding tables tex. It is recommended to use these tables (or to modify them to your needs). Don’t use the new The tables of the second type perform reencoding primitives \xordcode, \xchrcode and \xprncode only on the input side of TEX, so the xord values directly unless you exactly know what you are doing. are changed but the xchr values are not. The name convention identifies these tables: the symbols like t1 or csf are missing in the name, because tables of The first type of encoding tables this type deal with reencoding from one operating These tables are used during the iniTEX run. The system standard to another and therefore they are values of xord/xchr are set symmetrically during not related to the TEX internal code. the \input, the transposition (1) is used repeatedly For example, the table 1250-il2.tex maps the for setting of the xord/xchr values. The resulting input characters from CP1250 to ISO8859-2. The settings are stored in the format file using \dump and CP1250 encoding becomes a new input encoding used in the production version of TEX. and we assume that the first type of encoding table These tables declare the relation of internal TEX il2-*.tex was used in iniTEX. encoding and the encoding used on the host oper- We can use the table of the second type when a ating system. For example, our system encoding is part of the input document (or the whole document) ISO8859-2 and internal TEX encoding is chosen by has a different encoding from the encoding used by the Cork standard (called T1 in LATEX). In this case, our operating system. The table of the second type the encoding table name is il2-t1.tex. It redefines establishes the relationship to the input encoding the xord vector to map ISO8859-2 to T1 and the declared in the table of the first type. xchr vector to map T1 back to ISO8859-2. A part For example, the il2-t1.tex table was used of the il2-t1.tex table is shown in Appendix 1 at in iniTEX and we have obtained a document in the end of this article. CP1250. We can write: The first thing every encoding table does is \input 1250-il2 input the macro file encmacro.tex, which conse- \input document quently defines macros \setcharcode, \expandto, \restoreinputencoding \texmacro, \texaccent.SeetheREADME.eng file in The ISO8859-2 is restored here. the encTEX package for detailed documentation of these macros. The double reencoding is active when the Secondly, the internal encoding-specific macro document.tex is read: firstly from CP1250 to is read. An input of the t1macro.tex file is ISO8859-2 and secondly from ISO8859-2 to internal performed in our example. The encoding-specific TEX T1 encoding. The text is output to log,tothe macros (such as accent definitions) are placed here. terminal and to \write files in ISO8859-2 encoding. These macros solve similar issues as the fd files for The ISO8859-2 input encoding is restored after the text fonts in LATEX. \restoreinputencoding command. 370 TUGboat, Volume 19 (1998), No. 4

Attention: it is impossible to reread the \write \let\xprncode=\undefined files when the table of the second type is ac- Thus users can’t access the backward incompatible document.tex tive. If, for instance, the file in- extension of encTEX. I recommend this setup for \write cludes some activities (for index, table public sites. If encTEX is used this way exclusively, of contents and so on), we have to read these then the xord and xchr vectors work as was meant before \input 1250-il2 after auxiliary files or by the author of TEX: They filter operating system \restoreinputencoding . That is the reason why specific encodings into internal TEX encoding. \dump (the format generation) is senseless while The question in my mind is why Donald Knuth table of the second type is active. did not introduce primitives similar to mine. He About compatibility probably wanted all TEX macros to behave the same way on various implementations. In this The encTEX extension successfully passes the TRIP case the direct access to xord/xchr values was not test with two exceptions: 1. The banner is changed. acceptable for him. We can use a condition like 2. The number of multiletter control sequences is \ifnum\xordcode‘@=‘@, thus our macro processing greater than in original TEX by three. depends on whether or not the operating system All changes of TEX which do not change the adopts the ASCII standard. behavior of original TEX and only add some new The same macro behavior is not exactly reached primitives are backward compatible with Knuth’s in standard TEX either. We can \write a character original TEX. It means that if we have written a doc- into a file and we can reread this character in the ument for original TEX and it is processed through next run. If we set \catcode‘^=12 before rereading extended TEX we will get the same results. Here is then we can conditionally continue processing based one exception though: the macro construction of the on the “printability” attribute of this character in a type \ifx\xordcode\undefined has to be missing given operating system. in such a document. But, I guess, the probability of existence of such constructions in documents for Conclusion standard TEX is equal to zero. Everybody can modify the TEX source for his needs We have to say that it is possible to write new (see the quotation from the author of TEXatthe macros and documents in extended TEX which are beginning of my article). Modifying the TEX source not backward compatible with original TEX. This is is simpler than it looks. In my case, I perused [1] a disadvantage of all extensions of TEX. We face this in the evening and reconsidered all issues. The next situation both if the extension adds new primitives morning, I implemented my ideas into a computer directly (as in encTEXorpdfTEX, for example), as and performed a couple of tests. And I wrote this well as if the access to new primitives is hidden article in the afternoon (in the Czech language; the and may be initialized by some trick at the format English version took me considerably more time :-). generation time (as in e-TEXorMLTEX). The issue The goal was reached quickly thanks to the very well is, how many users will use the new primitives and documented program TEX. who will be a supervisor for the standardization of these primitives. References In case of my encTEX package, I have no claim [1] Donald Knuth. TEX: The program, volume B to standardize its primitives into newly developed of Computers & Typesetting. Addison-Wesley, TEXs. I have made this extension for my needs and Reading, MA, USA, 1986. if somebody likes it, he/she can use it realizing that [2] Petr Olˇs´ak. The encTEX package, his/her documents may not be backward compatible ftp://math.feld.cvut.cz/pub/olsak/enctex if he/she uses the new primitives directly. On the [3] Libor Skarvada.ˇ The patch for web2c TEX other hand, I meant my primitives to be used pri- ftp://ftp.muni.cz/pub/tex/local/cstug/ marily while generating formats and not to be used skarvada directly in real documents. Thus the documents can still be backward compatible. ⋄ Petr Olˇs´ak If an administrator of a multi-user system Department of Mathematics Czech Technical University installs a TEX format using encTEX, he/she can call Prague, Czech Republic some table of the first type and prohibit the usage [email protected] of the new primitives before \dump: \let\xordcode=\undefined \let\xchrcode=\undefined TUGboat, Volume 19 (1998), No. 4 371

Appendices

Appendix 1: A part of the table of the first type il2-t1.tex. %%% The encoding table, v.Sep.1997 (C) Petr Ol\v s\’ak %%% input: ISO-8859-2, internal TeX: Cork

\input encmacro \input t1macro

% (1) Czech/Slovak alphabet % input TeX lc uc sf cat prn sequence \setcharcode "C1 "C1 "E1 "C1 999 11 1 \texaccent \’A \setcharcode "E1 "E1 "E1 "C1 1000 11 1 \texaccent \’a \setcharcode "C4 "C4 "E4 "C4 999 11 1 \texaccent \"A \setcharcode "E4 "E4 "E4 "C4 1000 11 1 \texaccent \"a \setcharcode "C8 "83 "A3 "83 999 11 1 \texaccent \v C \setcharcode "E8 "A3 "A3 "83 1000 11 1 \texaccent \v c ... \setcharcode "A4 "1F 0 0 0 15 0 % =o=, not accesible \setcharcode "A7 "9F 0 0 0 12 1 \texmacro \S \setcharcode "D7 "03 0 0 0 13 0 \expandto {$\times$} \setcharcode "F7 "07 0 0 0 13 0 \expandto {$\div$}

Appendix 2: A list of tables of the first type. Filename inputencoding internalTEX encoding il2-csf.tex ISO8859-2 CS-font kam-csf.tex Kamenick´ych CS-font 1250-csf.tex CP1250, MS-Windows CS-font 852-csf.tex CP852, PC Latin2 CS-font mac-csf.tex MAC-CZ, Macintosh CS-font il2-t1.tex ISO8859-2 T1 alias Cork kam-t1.tex Kamenick´ych T1aliasCork 1250-t1.tex CP1250, MS-Windows T1 alias Cork 852-t1.tex CP852, PC Latin2 T1 alias Cork mac-t1.tex MAC-CZ, Macintosh T1 alias Cork The CS-font encoding and T1 are commonly used as the internal encoding of TEX for the Czech language. CP1250 is commonly used in MS Windows systems, ISO8859-2 in UNIX, CP852 or Kamenick´ych encodings are used in DOS and MAC-CZ is used in Macintosh systems.

Appendix 3: The content of the plain-il2-dc file. \input noprefnt % re-defines \font: \font\preloaded is ignored \input plain % format Plain \restorefont % original meaning of primitive \font \input dcfonts % loads text-style dc fonts \input il2-t1 % input: ISO8859-2, internal TeX: Cork \input hyphen.lan % czech / slovak hyphenation pattern \input plaina4 % \hsize and \vsize for A4 \everyjob=\expandafter{\the\everyjob \message{The format: plain-il2-dc .} \message{The cm+dc-fonts are preloaded and A4 size predefined.}} \dump 372 TUGboat, Volume 19 (1998), No. 4

ConcTEX: Generating a concordance from generated by means of computer programs, but, TEX input files until now, this has required preparing a specially formatted file containing the transcript. Since Laurence Finston a separate file, or a typescript, was used for Abstract typesetting, changing either the transcript or the concordance necessitated making the corresponding ConcTEX is a package that generates a concordance change in the other. In some cases, a conversion from a plain TEX file. It has been designed specifi- program could be used to automate this process. cally for books containing transcriptions of medieval Where this was not possible, either because a manuscripts, such as facsimile editions, but it can typescript was used or no conversion program was be adapted for other types of material. ConcTEX available, every change had to be made in two consists of TEX code in the file conctex.tex and places by hand: an editorial nightmare. a Common Lisp program in the file conctex.lsp. ConcTEX is a package that attempts to solve The main advantage of ConcT X is that the same E these problems. It includes a file of TEXcode, TEX files are used for typesetting and for generating conctex.tex, containing tools for designing a fac- the concordance. It also performs alphabetization simile edition of a manuscript, and a program of arbitrary special characters and lemmatization. written in Common Lisp, conctex.lsp,2 for gener- ConcT X illustrates the power of using T Xand E E ating a concordance from the TEX input files. The Lisp in combination. It is available under the concordance program performs lemmatization and normal conditions applying to free software. alphabetization of arbitrary special characters, and Introduction its output is another TEX input file containing the Designing and typesetting a facsimile edition of concordance. ConcTEX is designed for producing a medieval manuscript is a formidable challenge. facsimile editions of manuscripts, but it can be The plates with the facsimile itself will be pho- adapted for other types of material. tolithographs or, in books of the finest quality, Using ConcTEX makes it possible to bene- collotypes, and will therefore require no typeset- fit from the typographic capabilities of TEXand

ting. However, facsimile editions invariably contain ÅEÌAFÇÆÌ, of which readers of TUGboat need not parts which must be typeset, such as an introduc- be convinced. Apart from this, its most significant tion, a transcription and a concordance. advantage is that the TEX input files are used both Manuscript transcriptions must be carefully for typesetting and for producing the concordance, designed and they generally require frequent font so that any changes in the input files are automat- changes and numerous special characters. Typeset- ically reflected in both the printed output and the ters, if there are any left, are unlikely to be able concordance. to read the language of the manuscript, and each ConcTEX is not for novices. A certain amount manuscript has its own peculiarities of language of TEXpertise and knowledge of Lisp are necessary to and orthography, so the process of proofreading use it successfully. The version I describe in this ar- and correction is even more difficult than for or- ticle has been designed for a particular project. Any dinary books. Nowadays, authors or editors are other project will require some customization. Many often expected to supply camera-ready output to of the individual features of ConcTEX as described the publisher. This often means a printout from here are the result of decisions regarding the design one of the popular word-processing packages, which of a particular book. Other books will require other are incapable of producing typesetting of sufficient decisions. I have programmed most of the routines quality for the task. In today’s market, using me- in a general way, so that details can be changed while chanically set lead type is prohibitively expensive the basic operation of ConcTEX remains the same. and the number of publishers willing to typeset Both the TEXcodeinconctex.tex and the difficult copy in this way decreases every year. Lisp program conctex.lsp use some fairly ad- The task becomes even more daunting when a vanced techniques, so readers may find this article concordance is desired.1 Compiling a concordance somewhat difficult. I expect the description of by hand requires so much labor that it is no longer the Lisp program will present the most problems, economically feasible. Today, concordances are because Common Lisp is likely to be unfamiliar to

1 A concordance is a complete listing of all words occur- 2 Technically, it is incorrect to speak of “the program” ring in a manuscript, lemmatized, with the main forms of conctex.lsp. In Lisp a program is not a file. It is, however, the lemmata sorted alphabetically. convenient to refer to the file conctex.lsp as “the program”. TUGboat, Volume 19 (1998), No. 4 373

3 most TEXusers. Some of the more difficult and to be used by many people, many of whom subtle points are in the footnotes; for others, I’ve may have minimal knowledge of how TEX borrowed Prof. Knuth’s “dangerous bend” sign:  works. However, it is simply not worth the time and effort to write a LATEX format for a I sometimes use footnotes and “dangerous bend” single book, when a plain TEX format is just paragraphs to refer to topics that are introduced as good and can be written in a fraction of the later in the article. Most readers will want to skip time. And every book (or series) deserves its these paragraphs on the first reading. own design. Generating a concordance is admittedly a spe- A 3. L TEX loads various files by default, and signals cial application; most TEX users won’t want to do an error if it doesn’t get them. Some of such a thing. However, the techniques for extracting the things in these files might not even be information from TEX input files, described in this necessary, but you’ve still got to wait until article, are of general applicability. they’re loaded. In the following description and examples, I 4. The more macros you use, the more likely use two transcriptions of Icelandic manuscripts of th o it is that they will start to interfere with the 13 century, AM 234 fol and Holm perg 11 4 , each other. The problem is even worse when each containing a vita of the Virgin Mary and a using a large package with macros you a) collection of miracles. I wish to thank Dr. Wilhelm A don’t need and b) don’t understand. LTEX’s Heizmann, my Doktorvater (dissertation advisor), macros are very difficult to understand because who prepared the transcriptions, for permission to 4 pieces of them are scattered all over the place. use them in this article. For this reason, it can be very difficult and In the following, many phrases have special A frustrating trying to get L TEX to stop doing meanings. They are all explained at their first something you don’t want it to do. ConcTEX appearance, but to avoid confusion, they are listed changes the \catcode of several characters and in a glossary on page 402. includes a large number of macros. Therefore, Installation. In order to use ConcTEX, the file the likelihood is great that there would be conctex.lsp, containing the Lisp program, must A interference between ConcTEXandLTEX. be in your working directory, and conctex.tex, The T X input file containing TEX code, must be either in your working E directory or in a directory in TEX’s load path. If Multiple input files can be used for typesetting a you don’t know what this is, or how to change it, document and generating a concordance, so the ask your local TEX wizard, or just put the file in document can be divided into several files in the your working directory. The line customary way. In the following, I will assume, for \input conctex simplicity’s sake, that there is only one input file. must be at the beginning of your input file. This file can have any name within reason. A Like any other T X input file, an input file for Why not L TEX? ConcTEX is designed for use E ConcT X will contain text, control sequences for with plain TEX. It is theoretically possible to E A typesetting and perhaps comments (using %). But adapt it for use with LTEX, but I recommend it will also contain Lisp code used by the program against doing so. I use plain TEX in preference to A conctex.lsp. Therefore, T X input files used for L TEXinConcTEX for the following reasons: E generating a concordance are subject to greater 1. Making significant changes to one of LATEX’s pre-defined formats is difficult and time-con- restrictions than is ordinarily the case with plain suming, whereas programming a format based TEX. Some parts of the input file will only be used by T X, some will only be used by conctex.lsp, on plain TEX is relatively easy. E and some will be used by both, or neither. Many 2. LATEX enforces a rigid structure on formats. This makes sense for formats that are intended of the complications of ConcTEXhavetodowith making TEX and/or Lisp ignore items in the input 3 The standard introduction to Lisp is Patrick Henry file. Winston and Berthold Klaus Paul Horn’s LISP.Onceyou Environments. ConcTEX changes some category know how to program in Lisp, Guy L. Steele’s Common codes and redefines some control sequences in order Lisp, The Language is the one indispensable reference. to format the transcription correctly. This can make 4 I would also like to thank G¨unter Koch and J¨urgen Hattenbach of the Gesellschaft f¨ur wissenschaftliche Daten- it difficult to type in “normal” text using plain verarbeitung mbH G¨ottingen, Germany for help above and TEX’s familiar conventions. It might be useful to beyond the call of duty. do this if sections of transcription alternate with 374 TUGboat, Volume 19 (1998), No. 4

sections of commentary. The macro \plain defines by conctex.lsp. Apart from the text, they may an environment where \catcodes and macros are contain items like comments, commentaries, and reset to their normal values. This environment ends certain macros. with the macro \endplain. Comments and commentaries. ConcTEXmakes In the \trans environment, which is the de- a distinction between “comments” and “commen- fault, the \catcodes of characters and macro expan- taries”. A comment can be a normal TEX comment sions are set to the values needed for transcription using %, but it can also be code that looks like this: lines, i.e., the lines that contain the actual tran- \begincomment{This is a comment.} scription. The macros \endplain and \endtrans \endcomment are defined like this: The format defined in conctex.tex uses a con- \let\endplain=\trans ditional (defined with \newif) called \ifdraft. \let\endtrans=\plain Whenever \drafttrue, that is to say, whenever The changes made by \trans and \plain (and \ifdraft≡\iftrue, certain things are done which \endplain and \endtrans) are global. A \trans in are useful for editing purposes, but which aren’t the input file need not be matched by an \endtrans, done for the final draft. One of these things is but \plain and \endplain must be matched, printing out comments. If \drafttrue, this line: otherwise they will wreak havoc in conctex.lsp. helgum m{\}nnum Another environment is used for commentaries, \begincomment{This is a comment}% which are described below. Commentaries also \endcomment {\}{\dh}r i.~helgari reset some category codes and macros, but the =⇒ commentary environment is not identical to the * This is a comment * \plain environment. Commentaries will usually helgum mœnnum contain material similar to that in the transcription æðr i. helgari itself, whereas material in the \plain environment If \draftfalse, it yields

should be formatted differently. The \catcodesof helgum mœnnum æðr i. helgari several characters needed in math mode are also A commentary, on the other hand, is text which reset by the token list \everymath. should be processed by TEX and appear in the

 A line beginning with \trans or \endtrans in output, but should be ignored by conctex.lsp, i.e., an input file will be ignored by conctex.lsp, not be used for generating the concordance. This is but \plain and \endplain are replaced by *, for editorial remarks within the transcription itself. so conctex.lsp treats text between \plain and Commentaries can be coded in several different \endplain as a commentary. ways. Usually, a commentary begins and ends in *.5

 Having the \trans environment be the default \catcode‘\*=9 isn’t hard-wired into ConcT X; however, the E ... definitions of \- and - assume that \trans is the herbergi heilag{\slong} anda.~% default, so this will need to be fixed if you want * This passage is particularly \plain to be the default environment. interesting * {\oe}llvm\\

 In both the \plain environment and within helgum m{\oe}nnum {\ae}{\dh}r commentaries, - is reset to \catcode 12 and i.~helgari\\ \- is used for discretionary hyphens, because line {\tirok} haleitari.~ komin breaking is not performed explicitly, but rather by at k{\ydot}n\\ TEX’s line breaking routine. =⇒

Transcription lines are the lines in the input herbergi heilag anda. This passage is particu- file that contain the text of the transcription itself. larly interesting œllvm

They are processed in whole or in part by both TEX helgum mœnnum æðr i. helgari and conctex.lsp. In order to do its job, ConcTEX ` haleitari. er komin at kyn˙ redefines the \catcode of several characters. Each 5 of these changes is explained in its proper place, The control symbol \\ in the following example is used for breaking the lines. It’s explained on page 381. The but there is a list for reference on page 402. letter {\slong} −→ , used in the following examples, is an Transcription lines are formatted according to the alternative form of “s”. It is simply the “f” from Computer settings in the \trans environment, and processed Modern Roman, with the crossbar removed. TUGboat, Volume 19 (1998), No. 4 375

In this example, TEX simply ignores the character herbergi heilag anda. œllvm *, because its \catcode has been changed from 12 hThis commentary is so long that it’s nice to mark (other) to 9 (ignored), so the commentary is printed it with \begincommentary and \endcommen- in the same font as the transcription. If I want the tary to make it easier to see in the input file, commentaries to be printed in a different font, I can since * and * might be easy to overlook.i helgum

code something like this: mœnnum æðr i. helgari

` ˙ \catcode‘\*=\active haleitari. er komin at kyn \font\ssf=cmss10 The explicit \space following \endcommen- tary is necessary, because ordinary spaces following \def\startcommentary{\begingroup\ssf} \endcommentary would just be swallowed up in the \let\finishcommentary=\endgroup usual way. It would be easy enough to change by \let\docommentary=\startcommentary redefining \finishcommentary,sothatitalways inserts a space into the current list. \def*{\docommentary \def\finishcommentary{\endgroup\space} \ifx\docommentary However, it might be desirable to have commentaries \startcommentary end within a word sometimes. \global\let\docommentary=% Ignored lines. \finishcommentary • Lines that begin with % are ignored both by \else TEXandconctex.lsp.(A% in the middle of \global\let\docommentary=% a line causes both TEXandconctex.lsp to \startcommentary discard the rest of the line.) \fi} • Entirely blank lines are processed normally by TEX (the second hreturni in a row is converted This makes * an active char whose expansion to \par), and ignored by conctex.lsp. switches back and forth between \startcommentary • Lines that begin with \ are treated in the and \finishcommentary. The example above then normal way by TEX and ignored by conc- comes out looking like this: tex.lsp, with a few exceptions. This makes it possible to include undelimited macros (control

herbergi heilag anda. This passage is particularly sequences that are not surrounded by braces) interesting œllvm in the input file without worrying about how

helgum mœnnum æðr i. helgari the Lisp program is affected by them. These

` haleitari. er komin at kyn˙ can be skips, font changes or other commands. If you want * to do more than just change the Beginning a line with \relax or \empty makes font, you can put more code into the expansions of it possible to type anything in the input file \startcommentary and \finishcommentary.The and have conctex.lsp ignore it. macros \begincommentary and \endcommentary are both \let to *, so using them is exactly the  Certain undelimited macros, such as \lineno same as using *. They can be useful for marking and macros like \putontop and \overstroke longer commentaries, e.g., that are used for text, do not cause conctex.lsp to ignore a line, so they can appear at the beginning of herbergi heilag{\slong} anda.~{\oe}llvm\\ text lines. Other exceptions may be programmed. \begincommentary \ complete balanced expression (s-expression or sexp); \endcommentary\space multiline Lisp expressions are not permitted in the helgum m{\oe}nnum {\ae}{\dh}r input file. If an @ is not the first non-blank character i.~helgari\\ in a text line,bothTEXandconctex.lsp discard {\tirok} haleitari.~er komin the rest of the line following the @, and Lisp does at k{\ydot}n\\ not evaluate it. 376 TUGboat, Volume 19 (1998), No. 4

 The most common use of evaluated lines is to and other characters should not be used in their reset the text position for a new leaf, side or place. The reason for this is that the program column. conctex.lsp uses { and } explicitly in strings. It @ (set-position "28va") would be possible to change this and have a general mechanism for recognizing a beginning-of-group and They can also be used for adding occurrences an end-of-group character, but I did not consider for words which TEX can’t format in the normal that this was worthwhile. way, perhaps because a word is written vertically Braces are used in transcription lines for their and extends over multiple lines (see page 398). normal purpose: to delimit macros and their ar- @ (add-occurrences "drotning" "1va~1-7") guments. TEX normally permits “unmotivated braces”, i.e., braces that are typed into the input

The character ¥. The program conctex.lsp file for no particular reason. considers all letters (\catcode 11) and some char- Th{is} line has {unm}otiv{ated} br{ace}s. acters of type “other” (\catcode 12) in text lines as “word elements”. Some “other” characters are =⇒ considered “word separators”: in particular, blanks This line has unmotivated braces.

and punctuation. The character ¥ (decimal 165, 6 The unmotivated braces will only have an effect if octal 245, hexadecimal A5) is a special kind of they separate parts of a ligature, as in waf{f}le word separator: it is ignored by TEX(\catcode 9). −→ waffle, or prevent kerning, as in {V}A −→ The program conctex.lsp will ignore most lines VA. Sometimes, of course, as in {shelf}ful −→ that begin with an undelimited macro, but if the shelfful,9 this is useful, but then the braces aren’t line begins with ¥, conctex.lsp will process it. unmotivated. ConcTEX does not permit them,

¥\vskip12pt DROTNING himins ok and the function letter-function in conctex.lsp will iar{\dh}ar ... signal an error when it finds an unmotivated }.If cases like “shelfful”, where the word looks better The character ¥ may look different on your ter- minal.7 However it looks, it should always be without the ligature, are desired, they must be 8 accounted for in conctex.lsp. used rather than ^^a5 in your input file; TEX will recognize ^^a5 as referring to a single token, but Changing fonts. Manuscript transcriptions conctex.lsp will treat it as a string of 4 charac- often require frequent font changes, sometimes within a single word. For example, editorial ters. It’s never necessary to use ¥, it is merely a convenience. emendations may be printed in italics: prestr (Engl. “priest”), and perhaps enclosed in brackets, Curly braces. As far as plain TEXiscon- cerned, the characters { and } are simply set to too: p[re]str. Some letter forms may be represented categories 1 and 2, beginning and end of group, in the transcription as small capitals, such as the r r respectively. They can be set to other categories, “ ”inp[re]st . Different fonts may be used for and other characters can be set to categories 1 various other purposes, too, according to the book design. For instance, initials, large and/or decora- and 2. ConcTEX doesn’t support this generality; the \catcodesof{ and } should not be changed tive letters, and letters written in a different color ink in the manuscript may all be indicated by a 6 It’s sometimes convenient to know the decimal, octal special font in the transcription. Font changes are or hexadecimal notation for an integer in another of these handled in the normal way by TEX and discarded radices. ConcTEX includes a lagniappe, a C program that by conctex.lsp. Font changes can extend over converts between decimal, octal and hexadecimal integers multiple transcription lines. easily and quickly, and several Emacs-Lisp functions for calling this program from within Emacs. Special characters and other macros. 7 If you’re using Emacs, you can enter any character by A fundamental decision in editing a manuscript typing Control-q followed by an octal integer, so I can type transcription is, to what degree the transcription

Control-q 245 to get ¥. You may get another character, should correspond to the actual appearance of the though, depending on your editor, operating system, etc. A manuscript, e.g., how special characters are repre- better way to type in special characters is to define a key sented, whether abbreviations are expanded, etc. sequence or an abbreviation to do it. There’s more about

With T XandÅEÌAFÇÆÌ, it is possible to imitate this in the documentation supplied with ConcTEX. E 8 the appearance of the manuscript to a greater degree Any character can be represented in a TEXfileas^^ followed by two lowercase hexadecimal digits (The TEXbook, than is possible with other methods; however it is up p. 45). If necessary, conctex.lsp can be made to recognize this notation. 9 The TEXbook, pp. 19 and 306. TUGboat, Volume 19 (1998), No. 4 377

to the editor and the book designer whether to make makes the input file somewhat easier to type and use of this capacity or not. I believe that a transcrip- read. The \catcode of _ is reset to 8 (subscript) in tion will usually be of greater interest if it’s not nor- math mode, so it’s available for making subscripts, malized, and if the abbreviations are not expanded. andalsointhe\plain environment, so that it’s No matter what style of transcription is chosen, a possible to load files with the character _ in their wide range of special characters is usually required. names. But it’s not reset in commentaries, which In TEX, special character macros can be typed might very well want to use _ for \ustroke. with or without enclosing braces, e.g., the word

 The macro \ustroke is defined as follows: “ætt” (Engl. “a quarter of the heavens, one’s fam- 10 ily”) can be coded as “{\ae}tt”or“\ae tt”, or \def\ustroke#1{% even “\ae tt” since spaces following a control \def\subustroke{\leavevmode word are ignored. The second and third alternatives \ifx\next.% It’s a period are impractical even under normal circumstances, \setbox0=\hbox{{\it#1}}\else because it is unclear in the input file that “\ae” \ifx\next,% It’s a comma and “tt” belong together. In ConcTEX, however, \setbox0=\hbox{{\it#1}}\else all special character codings must be enclosed in % It’s neither a period nor a braces (“{\ae}tt”). TEX will handle “\ae tt”and % comma “\ae tt” in the normal way, but conctex.lsp \setbox0=\hbox{{\it#1\/}}% will signal an error when it reads “\ae” without 11 \fi\fi enclosing braces. All macros used in transcription $\underline{\box0}$}% lines must be accounted for in conctex.lsp.Others \futurelet\next\subustroke} will cause an error. In most medieval manuscripts, words are often Since the underlined text is put in italics, it’s nice abbreviated. There are several methods of abbre- to have \ustroke insert the italic correction (\/) viation. One is to write some of the letters of a automatically, if and only if \ustroke’s argument word and put a stroke over them, like “mm” for is followed by something other than a period or a comma. In order to find this out, it’s necessary to “mo¬ nnum.” It may be desirable to expand these abbreviations in the transcription. One way of peek at the following token, using \futurelet.If doing this is to print out the characters that do \ustroke uses a non-slanted font, it can be defined not appear in the manuscript, but in italics and more simply. It uses the \underline macro, which is available only in math mode. This makes underlined, e.g., “mo¬ nnum”. (Other solutions may be used according to the book design.) The macro the underlines go below the bottom of the lowest \ustroke \ustroke is used to do this: character in ’s argument. _{ypq} _{abc} m{\ustroke{{\ohook}nnu}}m _{{\ehook}{\ohook}{\oehook}} =⇒ =⇒

mo¬ nnum

¬ ¬ ypq abc e¬ oø You can type “m{\ustroke{{\ohook}nnu}}m”or It might be nicer to have the underlines all at the “m\ustroke{{\ohook}nnu}m”, i.e., the macro \us- same depth, preferably close to the baseline, but troke can be delimited or undelimited. Either way, unfortunately, this doesn’t turn out to look very conctex.lsp discards the macro and its braces, good. If \ustroke is defined like this: and the argument is treated as part of the word, so

in this example, “mo¬ nnum” is printed in the out- \def\ustroke#1{%

put and “mo¬ nnum” will appear in the concordance \def\subustroke{\leavevmode

(under the main form “maðr”, Engl. “man”). \ifx\next.% It’s a period Since such abbreviations occur frequently, the \setbox0=\hbox{\it#1}\else \catcode of the underline character _ (decimal \ifx\next,% 95, octal 137, hexadecimal 5F) has been reset to % It’s a comma \active and \let to \ustroke.Sonowyoucan \setbox0=\hbox{\it#1}\else

type “m_{{\ohook}nnu}m”toget“mo¬ nnum”, which % It’s neither a period % nor a comma 10 Cleasby-Vigfusson, p. 760. \setbox0=\hbox{\it#1\/}\fi\fi 11 It is letter-function that signals the error, when it reads % This makes a .25pt rule. the \. \hbox to 0pt{\vrule height -.8pt 378 TUGboat, Volume 19 (1998), No. 4

depth 1.05pt 6. \setbox4=\hbox{##4}% width \wd0\hss}\box0}% 7. \setbox5=\hbox{##5}% \futurelet\next\subustroke} 8. % then 9. %% These are the default dimensions. 10. %% For shifting the top letter _{ypq} _{abc} 11. %% to the right or left: _{{\ehook}{\ohook}{\oehook}} 12. \dimen3=-.8\wd1 =⇒ 13. %% For shifting the top letter

14. %% up or down:

¬ ¬ ypq abc e¬ oø 15. \dimen4=1.2\ht1 I think it looks worse to have the underline stroke 16. %% For increasing or decreasing go through the descenders of y, p,andq,andthe

17. %% the kern following \putontop:

¬ ¬ ¬ ogoneks (¬)ofe, ø,ando,thanitdoestohave 18. \dimen5=0pt the underline stroke be at different heights. To 19. % do this properly, it would really be necessary to 20. \ifdim\wd3>0pt design fonts with the underline stroke included in 21. \advance\dimen3 by ##3\fi the individual letters.12 22. % Sometimes manuscript transcriptions will re- 23. \ifdim\wd4>0pt quire the use of special characters which are not 24. \advance\dimen4 by ##4\fi available in existing fonts. It may be possible to 25. % create them by manipulating existing sorts.13 For 26. \ifdim\wd5>0pt example, in Holm perg 11 4o, many words have 27. \advance\dimen5 by ##5\fi letters with smaller letters placed over them. A 28. % logotype14 ab,codedas{\bOVERa}, can be defined 29. \setbox2=\hbox{\kern\dimen3 using the existing letters “a” and “b”, boxes, and 30. \raise\dimen4 glue. For other sorts, like `, the Tironian symbol 31. \hbox{{\small ##2}}}% for Latin “et” (“and” in English and “ok” in Old 32. \leavevmode\box1 Icelandic), it may be necessary to program a font 33. \hbox to 0pt{\hss\box2\hss}% using ÅEÌAFÇÆÌ. 34. \kern\dimen5\endgroup}\subputontop}

 \bOVERa is actually defined as The macro \putontop takes its second argu- \putontop{a}{b}{}{}{}.Themacro\puton- ment and puts it above its first argument in a top is defined like this: smaller size. The third and fourth arguments can be empty, or they can be used to shift the position 1. \def\putontop{\begingroup of the second argument. The fifth argument can 2. \catcode‘\-=12 also be empty, or it can be used to increase or 3. \def\subputontop##1##2##3##4##5{% decrease the space following \putontop. The first 4. \setbox1=\hbox{##1}% and second arguments to \putontop need not be 5. \setbox3=\hbox{##3}% single characters. Here are some examples of using \putontop. 12 Cf. The TEXbook, p. 323. 13 “Sort” is a technical term for a typographical unit, \putontop{abc}{def}{}{}{} ghi synonymous with “character”. The term “character” is am- =⇒ biguous in the context of TEX, because the characters in def the input file differ in nature from the characters in the abc ghi fonts used for typesetting, and the coding of the latter in This shifts the raised letters to the right. terms of the former is subject to modification. Therefore, I sometimes prefer the unambiguous term “sort” for a char- \putontop{abc}{def}{10pt}{}{} ghi acter when I mean a typographic unit belonging to a font. Williamson =⇒ defines a character or sort as a “single figure, def letter, punctuation mark, symbol or word-space cast as a abc ghi type or generated by photocomposition, CRT [cathode-ray tube] or digital system, or typed.” (368). This shifts them down. 14 Williamson defines “logotypes” (or “logos”) as fol- \putontop{abc}{def}{}{-20pt}{} ghi lows: “letters joined to each other & cast on a single =⇒ shank.” (p. 379.) He also notes that there is a shortage of logotypes in some photocomposition systems, where they are abc ghi replaced by separate characters. def TUGboat, Volume 19 (1998), No. 4 379

This increases the kern following “abc”. then it would make sense to define a macro \dr as \putontop{abc}{def}{}{}{20pt} ghi follows: \def\dr{\putontop{d}{r}{}{}{}} =⇒ def Then, {\dr} can be made a variant of “drot- abc ghi ning” in the lemmatization dictionary. And this reduces it. (generate-entry "drotning":variants "{\\dr}") \putontop{abc}{def}{}{}{-10pt} ghi or =⇒ (add-variants "drotning" "{\\dr}") def abcghi Alternatively, the function replace-items can replace all occurrences of “{\dr}” in the input At the beginning of \putontop,the\catcode of file with “drotning” before current-line is passed - is reset temporarily to 12 (other). This makes to process-line.Notethat“{\dr}” should always it possible to use negative dimensions in its third, appear within braces, like the special character fourth, and fifth arguments. macros.16 b

 If a certain combination of letters like a appears If there are a lot of cases of this type, and the frequently in a transcription, it’s easier to transcription does not expand the abbreviations, define a macro like \bOVERa for it rather than using then it would be a good idea to add another \putontop explicitly over and over. The function argument to \putontop. replace-items can replace {\bOVERa} with “ab” or any other string, so that a word like “rabbit” will \putontop{ma{\dh}r}{m}{r}{}{}{} appear in the concordance as “rabbit”. Another In this example, the word ma ðr (Engl. “man”) is ab- r possibility is to account for {\bOVERa} in letter- breviated as m. Then, discard-items can call process- function. In this case, “rabbit” will be written macro in such a way that the first argument, contain- to the concordance, unless “r{\bOVERa}bit”has ing the whole word as it should appear in the con- been specified as a variant of “rabbit”inthe cordance, is taken out of its braces so that it reaches lemmatization dictionary. process-line and read-word, and the other arguments \putontop (generate-entry "rabbit" are discarded. In this case, should dis- card the first argument when T X is run, so that :variants "r{\\bOVERa}bit") E

only the abbreviation is printed to the output.

 ð If a word like “eptir” (Engl. “after”) is abbre-  If the word “ma r” appears frequently in the viated as \putontop{e}{p}{}{2pt}{}tir −→ p manuscript, which is likely, it might make sense “etir” in the input file, the first two arguments to define a macro as follows: shouldn’t be discarded, because they belong to the word. However, the string “\putontop” and its \def\madhr{\putontop{}{m}{r}{}{}{}} other arguments, whether they’re empty or not, Here, the first argument can be empty, since TEX will cause an error in letter-function,iftheyreach ignores it. The string “{\madhr}”mustbemade

it. The function process-macro inside the function a variant of “maðr” in the lemmatization dictio- discard-items discards the string “\putontop”, takes nary or replaced by “ma{\dh}r” in the function the first two arguments out of their braces and replace-items,asabove.Themacro\madhr can 15 discards the rest. be used in words like “{\madhr}inn” −→ “mr inn”

for “ma ðrinn” (nominative singular with enclitic

 This works, if words using \putontop always article). Cases where a single special charac- conform to this pattern. When they do, ter is used to represent a frequently used word, \putontop can be used explicitly in the input file.

like {\hxbarhk} −→ “ ­” (abbreviation for “hans”, However, when they don’t, a different approach Engl. “his”, personal pronoun, masculine singular is required. Sometimes, for instance, letters that

genitive) and {\thxbarhk} −→ “ ©” for the word are put over other letters stand for parts of the word which are left out. If the word “drotning” 16 r The control word \dr could even be accounted for r is abbreviated frequently in the manuscript as “d”, in letter-function, which would make “d ” appear in the concordance. If it was assigned the list (d-value r-value o- 15 The function process-macro receives two arguments value t-value n-value i-value n-value g-value), it would even be which tell it which arguments to take out of their braces and alphabetized correctly. I don’t think this would be useful for which to discard. a concordance, but it might be for some other application. 380 TUGboat, Volume 19 (1998), No. 4

“ þessu” (demonstrative pronoun, feminine singular “m” and the other as {\mone} (for “m-one”) which dative), can handled in the same way. also prints as “m” and is defined as \def\mone{m} An advantage in using TEX is the ability to use temporary definitions for special characters. If The function replace-items can convert each

I hadn’t programmed ` yet, I could define it as occurrence of the string “{\mone}”inthe follows: input file to “m” before current-line is passed on \def\tirok{\&$^{\rm Tir.}$} to process-line,so{\mone} will never appear in the concordance. If, later, a different letter form should Then, a line like the following: be used to represent {\mone} in the output, \mone {\tirok} haleitari.~er komin at can be redefined, e.g., k{\ydot}n \def\mone{{\specialfont\char’001}} will be printed like this: Tir. Math mode material & haleitari. er komin at kyn˙ is always treated in the normal way by TEX. It is treated in different ways by When I’ve gotten around to programming `,it conctex.lsp. Display math material is processed will be printed like this: 18 normally by TEX and discarded by conctex.lsp.

` haleitari. er komin at kyn˙ Math mode is very likely to appear in transcription but I won’t have to change my input file.17 lines, mainly for sub- and superscripts and certain special characters. Sometimes, the math mode

 This is the TEX code that’s necessary for using material will be irrelevant for the concordance, but

special characters like `, assuming the font is at other times it may be an essential part of a word, called specialfont. or even serve to distinguish between two otherwise \font\specialfontnine=specialfont9 identical words. If math mode material is attached \font\specialfontten=specialfont10 to a word, that is, if it isn’t separated from the word \font\specialfonttwelve=specialfont12 by blank space or another word-separator (like most ... punctuation marks), it is considered to be part of Now, \specialfont must be defined where the word. If not, conctex.lsp discards it. For \rm, \large, \small etc. are defined, so that example, \specialfont accesses the correct size of the font drotning$^\pi$ specialfont,e.g. is treated as a word, “drotningπ”. It will appear \def\rm{\let\specialfont=% as such in the concordance and will be a separate \specialfontten ... entry from “drotning”, if this word occurs in the \def\small{\let\specialfont=% transcription. If this appears in the transcription, \specialfontnine ... drotning $^\pi$ \def\large{\let\specialfont=% conctex.lsp will discard the math mode material, \specialfonttwelve ... π “drotning ” will appear in the output and “drot- Finally, \tirok is defined like this: ning” will appear in the concordance. To have \def\tirok{{\specialfont\char’140}} conctex.lsp discard the math mode material, but not separate it from the word in the output, type:

The character “`” is in position octal 140

(decimal 96, hexadecimal 60)inspecialfont. drotning¥$\pi$ In this case, conctex.lsp also discards the math

 In a similar way, T X makes it possible to E mode material, and “drotning” appears in the con- distinguish words in the input file which should cordance, but “drotningπ” appears in the output,

not be distinguished in the output. For instance, ¥ o because ¥ is ignored by T X(\catcode‘\ =9), and there are two forms of “m” in Holm perg 11 4 which E treated as a word separator by conctex.lsp.The are used interchangeably. Depending upon the math mode material can also be put on the next degree of normalization desired in the transcription, input line following a %. these two letters can be distinguished in the output or not, as desired. One form can be coded as m −→ drotning% $^\pi$ himins ... 17 TEX is generous with space for storing macros; there’s room for over 2000, so there should be enough for all the 18 The program conctex.lsp converts $$ to *,sodisplay special characters you need. math mode material is treated as a commentary. TUGboat, Volume 19 (1998), No. 4 381

=⇒ in the input file that is followed by a blank line is drotningπ himins ... treated as if it ended in \\. Normally, the lines in the output should corre- and “drotning” appears in the concordance. spond to the lines in the manuscript, so they should The \catcodes of some characters, such as be wide enough, i.e., the value of \hsize should be -, _ and <, that have been changed for use in large enough so that transcription lines don’t have transcription lines, should have their normal values to be broken. In exceptional cases, however, this in math mode. The token list \everymath resets may not be possible, for example, if a line includes them. a long commentary. \everymath={\catcode‘\-=12\catcode‘\_=8 \lineno{8} himins _{ok} *\<{\ss This \catcode‘\*=12\catcode‘\<=12\relax} is an extremely long commentary which causes this otherwise short line to The \relax at the end of \everymath is necessary be broken so that it results in more because assignments, such as changes to \catcodes, than one line in the output.}\>* only take effect when T X is building a list.19 The E iar{\dh}ar.~s{\ae}l {ok} dyr{\dh}\- same effect could be achieved with \vbox{} or \lineno{9} lig m{\o}r Maria.~mo{\dh}ir \hbox{}, or a harmless macro (but not \empty). d_{ro}tti_{n}s\\

 Math mode material should not extend over =⇒ more than one transcription line. If it does, 8: himins ok hThis is an extremely long commen- conctex.lsp will signal an error. Normally, it tary which causes this otherwise short line to shouldn’t be necessary anyway, since most manu- be broken so that it results in more than one

scripts don’t contain multiline equations. Even if ð line in the output.i iar ðar. sæl ok dyr yours does, the elements of the equation probably 9: lig mør Maria. moðir drottins don’t belong in the concordance anyway, so the

multiline math mode material can be put in a  The control symbol \\ expands to \par and commentary. In this case, it may be necessary to the value of \parskip and \parindent have reset the line number explicitly with set-position.If been set to 0pt in order that multiple blank lines your application requires a lot of math mode, it won’t cause excessive vertical space or unwanted would be advisable to reprogram the routines for indentation in the output. If \\ expanded to handling it. Display math mode material is treated \hfil\break, a blank line following \\ would cause as a commentary, so it can extend over any number two \baselineskips to appear in the output, one of lines. from the \break and one from the second ^^M (hreturni), which is converted to \par.Onthe Line ends. The treatment of line ends is an 20 other hand, a \par following a \par is harmless. important factor in ConcT X. A typeset manuscript E The text in manuscripts is generally not divided transcription will generally have lines corresponding into paragraphs, so it’s unnecessary to use blank to the lines in the manuscript, so it is undesirable to lines to indicate paragraphs, and it’s convenient to have T X do the line breaking for the text lines. (It E allow them for the sake of making the input file would also be a lot of work to write a hyphenation more readable. dictionary for a medieval manuscript.) However, it

will rarely if ever be desirable to use \obeylines,  The control symbol \\ is not needed in the because there are many other items which can \plain environment, because TEXdoesthe appear in the input lines, making it impractical line breaking there. The values of \parskip and for the lines in the input file to correspond to the \parindent can also be changed in the definition of lines in the output: font changes, index entries, \plain, for text that should be formatted differently footnotes, comments, etc. Therefore, the ends of from the transcription itself. transcription lines must be indicated. Manuscript positions are defined according Most transcription lines should end in \\.The to leaf (folium), side (recto or verso), column (if control symbol \\ is \let to \par in conctex.tex there are multiple columns), and line number. The and recognized as a line end by conctex.lsp. The program conctex.lsp keeps track of the line 20 The first \par puts TEX into vertical mode, where the numbers, so this is important. A transcription line second \par has no effect, “except that the page builder is exercised ... and the paragraph shape parameters are 19 The TEXbook, p. 373. cleared.” (The TEXbook, p. 283.) 382 TUGboat, Volume 19 (1998), No. 4

leaves of Holm perg 11 4o and AM 234 fol have (explained below) and a blank line all cause a \par two columns, so a position is identified as follows: to be added to the current list, \lineno always 2ra 17 is leaf 2, recto, column a, line 17. When a begins a paragraph (assuming \printlinenumbers- new leaf, side or column begins, the input file must true). Usually, this paragraph will consist of one contain a line that looks like this: output line, just as it corresponds to one line in @ (set-position "28vb") the manuscript, but if the paragraph is longer than one line, the \hangafter and \hangindent macros If that column starts with line 15, perhaps because cause the following lines to be indented so that they the top of the leaf is missing or unreadable, the line begin directly below the text of the first line in the should look like this: paragraph. @ (set-position "28vb" 15)

 The same effect could be achieved by using the Lines beginning with @ are ignored by T Xand E token list \everypar, and separate \everypars evaluated by conctex.lsp. The Lisp program could be maintained for the \plain and \trans counts the lines and keeps track of the position in environments. the manuscript. However, it will usually be useful

to indicate line numbers in the transcription itself,  The “magic number” 33 in \box0 in the defi- so a line in the input file might look like this: nition of \lineno is there simply to make the 31: f_{ra} savgn_{n} Mathevs box an appropriate size. Numbers wider than “33” extend into the left-hand margin, so the colons (or In the output, it will look like this: periods, or whatever) always line up. 31: fra savgnn Mathevs In some applications, positions may not require However, “31” should not appear as a lemma in line numbers. For instance, positions in a concor- the concordance. Therefore, conctex.lsp discards dance of the Bible should be given as book, chapter numbers, punctuation and blanks at the beginning andverse.Inthiscase,TEX could take care of of transcription lines. Another possibility is to use the line breaking, and \\ could be used to indicate the macro \lineno to make them print or not, the end of a verse. It could be \let to \relax in according to the value of a conditional. conctex.tex and cause conctex.lsp to increment \ifprintlinenumbers a“verse-counter” instead of line-counter (explained \def\lineno#1{...}\else below). The exact way that ConcTEX handles line \def\lineno#1{\relax}\fi numbers in the input file can and should be set for \lineno{1} Drotning etc. each particular project.

\newif\ifprintlinenumbers  While it is convenient to define macros in this \printlinenumberstrue way: =⇒ \ifprintlinenumbers \def\lineno#1{...}\else 1: Drotning etc. \def\lineno#1{\relax}\fi \printlinenumberfalse it is sometimes preferable to define them like =⇒ this: Drotning etc. \def\lineno#1{\ifprintlinenumbers The macro \lineno is an exception to the rule ...\else\relax\fi} about conctex.lsp discarding lines that begin with or like this: undelimited macros. It can be defined as follows: \def\linenoprint#1{...} \ifprintlinenumbers \def\linenonoprint#1{\relax} \def\lineno#1{\leavevmode \setbox0=\hbox{33:\space}% \ifprintlinenumbers \hbox to \wd0{\hss#1:\space}% \let\lineno=\linenoprint \hangindent\wd0\hangafter 1} \else\let\lineno=\linenonoprint \else \fi \def\lineno#1{\relax}\fi The advantage of the last two alternatives Line numbers, whether explicit or in \lineno, are is, that it is possible to put the definitions discarded by conctex.lsp. Since \\, \- and - of \linenoprint and \linenonoprint in a file TUGboat, Volume 19 (1998), No. 4 383

used for generating a preloadable format file with a broken word with an explicit hyphen and \- to INITEX. If you try to generate a format file using the indicate a broken word with no hyphen. Using \- first example, where the definitions themselves are indicates broken words in the input file, and causes within the conditional construction, one version conctex.lsp to treat such a broken word as a unit, will be skipped when INITEX runs, depending but no hyphen appears in the typeset output. For on the current value of \ifprintlinenumbers. example: However, in the second example, the conditional tok j vpphafi {\slong}n{\slong} construction is within the definition, so it will Gv{\dh}{\slong}pia- expand according to the value of \ifprintline- llz.~at telia {\ae}tt drottin{\slong} numbers when \lineno is called. This means that ie{\slong}us the expansion of \lineno can switch back and forth =⇒

during the TEX run. In the third example, the

 ð

two definitions are assigned to two different macros, tok j vpphafi n Gv pia-  and \lineno is \let to one of them according to llz. at telia ætt drottin ie us the value of \ifprintlinenumbers at the time. Whereas This conditional construction can be put in an tok j vpphafi {\slong}n{\slong} ordinary macro file, loaded after the preloaded Gv{\dh}{\slong}pia\- format. The assignment must occur before \lineno llz.~at telia {\ae}tt drottin{\slong} is called for the first time. A subsequent change ie{\slong}us to the value of \ifprintlinenumbers later in =⇒

the TEX run has no effect on the expansion of

 ð tok j vpphafi n Gv pia

\lineno. You can, however, switch the definition  llz. at telia ætt drottin ie us explicitly by saying \let\lineno=\linenoprint or \let\lineno=\linenonoprint at any time. If you If a line in the input file ends in - or \-, have a lot of macros, it can be useful to create a this obviously indicates the end of a line in the preloadable format in this way.21 manuscript, so it would be redundant to type -\\ or \-\\.22  The Lisp program doesn’t use the explicit line The program conctex.lsp recognizes - and \- numbers in the input file for its line counting at the end of a line as line ends for its line numbering routine, although it could. Explicit line numbers routine. TEX also recognizes them as line ends at are optional, so conctex.lsp needs its own line the end of a line and inserts a \\. However, - can counting routine anyway. It does, however, check also occur within a line, or even at the beginning of that the line numbers from its routine and the a line. Here it should not be recognized as a line explicit line numbers in the input file match up. If end, either by TEX or by Lisp. A - at the beginning they don’t, it issues a warning, but doesn’t signal of or within a line is simply printed as - (what an error. Warnings are written to standard output else?). As far as TEX is concerned, a \- within a when conctex.lsp is run, but also to the file line is harmless, but doesn’t make any sense, since warnings.Ifconctex.lsp is set up to write code it doesn’t print anything. Therefore, TEX issues a like this to warnings: warning but doesn’t signal an error. The program echo "At line 21 of input.tex: conctex.lsp discards \-s that are neither at the Line numbering is incorrect. end of an input line nor followed by \\. The strings (\\lineno: 12) (line-counter = 14)" -- and --- still yield – (en-dash) and — (-dash) respectively, but they do not cause line ends, either and conctex.lsp is run using a UNIX shell script, 23 in T X or in Lisp. then the shell script can execute warnings by E The following code is necessary to accomplish calling sh warnings, causing all of the warnings to this: be printed to standard output after conctex.lsp is done. 1. \catcode‘\ c =\active Broken words. Some lines may end in words 2. \let c =\- that are continued on the next line. In AM 234 fol, 3. some of the broken words have hyphens and some do 22 Although it’s redundant, it is nonetheless permitted. not. It is necessary to distinguish between these two 23 If a concordance is to be generated for a manuscript cases in the input file. I use - (hyphen) to indicate which uses more characters to indicate broken words, like - or -,bothconctex.tex and conctex.lsp must be modified 21 The TEXbook, p. 344. to account for these cases. 384 TUGboat, Volume 19 (1998), No. 4

4. \def\hyphen{-} A9)ismadeactiveand\let to \- before \- is 5. \def\dash{--} redefined (lines 1 and 2). Make sure you use c 6. \def\Dash{---} and not ^^a9.TEX will treat ^^a9 as a single

7. token, but conctex.lsp will not, just as with ¥ and 8. \catcode‘\-=\active ^^a5 (unless conctex.lsp is modified to allow this). 9. In commentaries and the \plain environment, \- 10. \begingroup reverts to its primitive self. 11. \catcode‘\^^M=\active TEX’s ligature routine won’t work for -, --, 12. \global\let^^M=\empty and --- after the \catcode of - has been changed 13. to \active (line 8), so first I put -, -- and 14. \gdef\invisiblehyphen{ --- into the expansion of the new control words 15. \begingroup\catcode‘\^^M=\active \hyphen, \dash and \Dash, so I can still use them. 16. \def\subhyphen{\ifx\next^^M\\\else Now I define - and \invisiblehyphen and \let\- 17. \ifx\next\\ % Do nothing to \invisiblehyphen.24 They check whether the 18. \else token following - or \- is a hreturni, ASCII code 19. \message{Ignoring \noexpand\-. % 13, which can be notated as ^^M,inwhichcase\\ 20. Don’t use \noexpand\- % should be inserted to break the line.25 21. in the middle of a line.} Under normal circumstances, i.e., when \cat- 22. \fi\fi\endgroup} code‘\^^M=5, it wouldn’t be possible to tell whether 23. \futurelet\next\subhyphen} the character following an active char is a hreturni 24. or not. For example: 25. \global\let\-=\invisiblehyphen \def\abc#1{\show#1} 26. \abc 27. \gdef-{\begingroup\catcode‘\^^M=\active d 28. \def\aEathyphens##1{ 29. \futurelet\next% causes 30. \bEathyphens} > the letter d. 31. \def\bEathyphens{ d 32. \ifx\next-\Dash \abc #1->\show #1 33. \expandafter\cEathyphens\else to be printed to the screen, or standard output, to 34. \dash\endgroup\fi} be precise, and 35. \def\cEathyphens##1{ 36. \futurelet\next\endgroup} \abc 37. \def\subhyphen{ 38. \ifx\next^^M\hyphen\\% It’s a return. d 39. \let\eathyphens=\endgroup\else causes an error. 40. \ifx\next-% It’s a hyphen ? 41. \let\eathyphens=% Runaway argument? 42. \aEathyphens\else ! Paragraph ended before 43. %% It’s not a return or a hyphen. \abc was complete. 44. \hyphen 45. \let\eathyphens=\endgroup 24 The macro \invisiblehyphen is defined in order to 46. \fi\fi make it possible to switch the definition of \- back and forth 47. \eathyphens} when changing environments. This is necessary, because \- is 48. \futurelet\next\subhyphen} already a control symbol (in fact, it’s a primitive), whereas - 49. has \catcode 12 in the \plain environment, commentaries, and math mode. Therefore, the active character - can 50. \endgroup be defined globally. When the environment is switched, it suffices to change the \catcode of -. The primitive \-, which is ordinarily used for 25 Incidentally, you might be wondering why I’m talking inserting discretionary hyphens, must be redefined. about “hreturnis”. I use a computer running UNIX where Since the user decides where transcription lines are the end-of-line character is hnewlinei, ASCII 10, and not to be broken, discretionary hyphens are unnecessary hreturni, ASCII 13. The reason is that TEXconvertsthe external coding scheme of the input file to its own internal there. In case they are needed, however, the coding (The TEXbook, p. 43), so UNIX’ hnewlineisare character c (decimal 169, octal 251, hexadecimal converted to TEX’s hreturnis. TUGboat, Volume 19 (1998), No. 4 385

read into TEX’s memory, it is necessary that \cat- \par code‘\^^M=\active and that ^^M is set globally to The first time \abc was invoked, it skipped the \empty (line 11–12). It could also have been set to first ^^M, which had been converted to a space, and \relax. found the “d” in the next line. The second time it Serendipitously, setting ^^M to \empty makes it was invoked, \abc skipped the first ^^M, but found unnecessary to use % at the end of lines that end in { the second one, which had been converted to a par or }, such as lines 14, 21–22, etc. However, % is still \message token (The TEXbook, p. 47). A normal macro can’t necessary in the command in lines 19–20, accept arguments that contain \pars, so an error otherwise there would be no space between “\-.” was signalled. Here’s another version. and “Don’t” in the message, i.e., any trailing spaces before the ^^M would be swallowed up. A % is also \long\def\abc#1{\show#1} necessary in the \futurelet construction in line \abc 29, otherwise \next would be set to \bEathyphens, the token following the ^^M, which is expanded. d The way TEX handles ^^M at ordinary times, =⇒ i.e., when \catcode‘\^^M=5, causes the global > \par=\par. definition of ^^M as a macro to be reset to \par. \par Therefore it’s necessary to set it globally in line 12, so that it expands to \empty in - and \-. Setting \abc #1->\show #1 \catcode‘\^^M=\active in line 11 is only in effect inside the group that ends on line 50; when - A macro defined using \long\def won’t signal an or \- is invoked, it must begin a group and error if a \par occurs in its arguments, but the first reset \catcode‘\^^M to \active. However, this ^^M is still skipped. So simply reading in arguments being done, the \global\let^^M=\empty is in effect is useless for discovering whether a macro or an while - or \- is expanding. It doesn’t work to active character is followed by a ^^M. In addition, get rid of the \global\let‘\^^M=\empty and put \abc\par will produce the same screen output as the \let‘\^^M=\empty in the definitions of - and \-. last example, since T X can’t distinguish between E I’m afraid I don’t understand why not, but it is an explicit \par and one that was converted from a apparently connected with peculiarities of category ^^M. code 5. The same behavior can be demonstrated using Since the definitions of - and \invisible- \futurelet: hyphen are inside of a group, they must be \gdefs \def\abc{\futurelet\next\subabc} (global definitions), or they wouldn’t be available \def\subabc{\show\next} outside the group. The same applies to \-, which \abc is \let globally to \invisiblehyphen. The defini- a tions inside - and \invisiblehyphen, however, are =⇒ local to the groups begun in these macros. > \next=the letter a.  Another possibility is to set and \catcode‘\^^M=\active and \let^^M=\empty \abc globally (i.e., get rid of the group begun in line 10 and ended in line 50, change \gdef and a \global\let to ordinary \def and \let,andreset \catcode‘\^^M=5 after the definitions. =⇒ > \next=\par. If - were defined like this: ... In order to check whether the character follow- \gdef-#1#2{ } ing a control sequence or an active character is a i.e., with arguments, it wouldn’t be possible to check ^^M, it is necessary to change the \catcode of ^^M. whether the tokens following - in the input file were One possibility is to change it to \active.Ido ^^M or not, because the arguments would have been this locally within the definitions of - and \-,so read in and expanded before - had had a chance that ^^M otherwise behaves normally. However, in to change ^^M’s \catcode. Once a token has been order that the ^^Ms within the actual definitions of read in as an argument this is no longer possible. So - and \- are handled correctly while they’re being - and \invisiblehyphen change the \catcode of 386 TUGboat, Volume 19 (1998), No. 4

^^M, peek at the following token using \futurelet, the effect that -- or --- at the end of a line is not and turn over control to \subhyphen.Both- followed by a blank space. and \invisiblehyphen have their own version of Normally, in plain TEX, or when using the \subhyphen. \plain environment in ConcTEX, -- or --- at the In \invisiblehyphen, \next contains the to- end of a line is followed by a blank space. ken following the \- in the input file. This token is ... in lines 299-- examined by \subhyphen.If\next is ^^M, \sub- 547 ... hyphen inserts \\ to break the line. If it’s \\, \subhyphen does nothing (\-\\ is redundant but There was a sudden crash--- permitted in the input files). If it’s anything else, then silence. \subhyphen issues a warning that \- shouldn’t be used in the middle of a line. Then \subhyphen ends =⇒ the group begun by \invisiblehyphen in line 15. ... in lines 299– 547 ... The active character - functions in a similar, There was a sudden crash— then silence. but more complicated way. It too uses \futurelet However, in the \trans environment, the spaces and \next to examine the following token. disappear. I If \next is ^^M, \subhyphen inserts a hyphen ... in lines 299–547 ... with \hyphen (defined in line 4), a \\ to break the There was a sudden crash—then silence. line, and \lets the control word \eathyphens to This is nice, since one doesn’t want spaces following \endgroup, to end the group begun in line 27. en- or em-dashes. If you do want them, it’s easy I Else, if \next is not -, \subhyphen adds a enough to redefine \bEathyphens and \cEathy- \hyphen to the current list, but no \\,and\eathy- phens so they’re added before ^^M,oryoucan phens is \let to \endgroup. type I Else, if \next is -, \subhyphen \lets \eat- hyphens to \aEathyphens. ... in lines 299--\space 547 ... Now, \eathyphens takes over control. In the There was a sudden crash---\space first two cases, i.e., \next was ^^M or anything else then silence. other than -, \eathyphens merely ends the group However, you may be puzzled as to why they begun in line 27. However, if next was -, matters disappear. If \cEathyphens just eats the third - are more complicated. and ends the group, a following ^^M will cause a If - is followed by -, this could be --, which space to be inserted into the current list, so it’s the should be replaced with an en-dash, or ---, which 26 \futurelet that causes the space to disappear. should be replaced with an em-dash. In neither of When \futurelet peeks at a token, it fixes its these cases should the line be broken. Therefore, 27 \catcode even though it doesn’t expand it. When it’s necessary to peek at the token following the the group ends and - is finished, T X’s parser reads second -, but first we have to dispose of the second E the ^^M that follows the second or third -.The -, because it’s not possible to peek at the next \catcode of this ^^M was fixed at 13, so it expands token but one, and the second - is still to be to \empty, its expansion within - rather than read by T X’s parser. The macro \aEathyphens E a blank space. takes one argument, which disposes of the second -, Changing the category code of - to \active has then peeks at the following token using \futurelet some minor, unpleasant side effects in the \trans and \next, and passes control to \bEathyphens. environment. It is no longer possible to use -, -- This macro examines \next.Ifit’snot- (in and --- in the arguments to some macros, such the \else construction), \bEathyphens inserts a as \beginchapter and \beginsection, which are \dash into the current list, and ends the group defined in conctex.tex, and code like \vskip-2pt begun in line 27. However, if \next is -,it will cause an error, because - must have \catcode inserts a \Dash into the current list and calls 12 (other) for this to work. The control words \cEathyphens.The\expandafter is necessary to \hyphen, \dash and \Dash canbeusedinmacro prevent \cEathyphens from reading in \else as its argument. Instead, it reads in and thereby disposes 26 Inthecaseof--,it’sthe\futurelet in \aEathyphens of the third -. It could just end the group now, but that causes the space to disappear. 27 it doesn’t. It peeks at the token following the third The TEXbook, p. 381: “... TEX stamps the category on - using \futurelet and \next yet again. This has each character when that character is first read from a file.” TUGboat, Volume 19 (1998), No. 4 387

arguments instead; other cases should be put into The macro \eatit simply reads an argument, tests a commentary or the \plain environment, where - to see if it’s “>” and if it isn’t, calls itself again. has \catcode=12, e.g., *\vskip-12pt*. The arguments it reads are discarded, hence the name. The macro \donteatit functions in a similar

 If certain macros that use -, --,or--- are used way, but has the added complication that the “{” frequently in the \trans environment, and you should only be put into the current list once, so the don’t want to put them in commentaries or the main work of not eating the string is taken over \plain environment, you can redefine them using a by \subdonteatit.When\draftfalse, < is \let similar technique to that used in the definitions of to \eatit.When\drafttrue, the homograph- - and \invisiblehyphen. identifiers shouldn’t be eaten, so < is \let to \begingroup \donteatit, which puts the text between < and > \catcode‘\-=12 inside curly braces, e.g., “´a{p}”. \gdef\mymacro{\begingroup

\catcode‘\-=12  Homograph identifiers can be as long as you \def\submymacro##1##2##3{% want, because \eatit and \donteatit pro- ...\endgroup}} cess the tokens (or groups) within the angle braces \endgroup one-by-one. This prevents a long argument from burdening TEX’s memory. When \eatit or \sub- The macro \putontop is defined like this. It donteatit calls itself recursively within the condi- doesn’t work if you try to redefine \vskip, tional construction, the \expandafter allows the though. conditional, and hence the current invocation of Homograph identifiers. Homographs are \eatit or \subdonteatit, to complete execution words that are spelled the same, but belong to before the recursive call is expanded. Therefore, different lemmata. They must be indicated ex- only one invocation of \eatit or \subdonteatit is plicitly in the input file. In order to do this, I active at any time. This is called “tail recursion” have defined the characters “<”and“>” to delimit (cf. The TEXbook, p. 219). It would also be possible homograph-identifiers in the \trans environment. to define \eatit and \donteatit like this: For instance, in Old Icelandic, the word “´a” can be \def\eatit#1>{\relax} a noun, meaning “river” or a preposition, meaning \def\donteatit#1>{$\{$#1$\}$} “at, on,” etc. In the input file, I can distinguish be- In this case, extremely long homograph identi- tween them by typing the preposition as {\’a}

fiers will burden T X’s memory, but in practice, and the noun “´a” as plain {\’a} or {\’a}. E it would be just as good, since it’s unlikely that TEX handles them in different ways, depending on the value of \ifdraft. For rough drafts, it will anyone would want to type in extremely long ho- mograph identifiers. be useful to see the homograph-identifiers, whereas they should not appear in the final draft. The Lisp program conctex.lsp To accomplish this, the \catcode of < has been changed to \active. The definition of < depends The Lisp program conctex.lsp consists of several on the value of \ifdraft: parts: 1. A lemmatization dictionary. \catcode‘\<=\active 2. A parser that reads the input files, discards ex- traneous material, and passes the transcription \def\eatit#1{\ifx#1>\else lines on for further processing. \expandafter\eatit\fi} 3. A routine for extracting information from the transcription lines, accessing Lisp symbols, and \def\donteatit{$\{$% storing the information in data structures. \def\subdonteatit##1{% 4. An output routine that writes the T X file \ifx##1>$\}$\else##1% E containing the concordance. \expandafter\subdonteatit\fi}% \subdonteatit} ConcTEX does not use the page numbering information generated by TEX’s output routine, \ifeathomoids so it doesn’t require a preliminary pass. The \let<=\eatit concordance shows the position of individual words \else in the original manuscript, not the page numbers \let<=\donteatit\fi of the printed transcription. This makes it possible 388 TUGboat, Volume 19 (1998), No. 4

to generate the concordance directly from the TEX 2. Subsidiary forms. Distinct words that belong input file without running TEX at all. to the same lemma, like “man”, “man’s” and “men” in English.

 Making a concordance using the page numbers 3. Variants. Words that are spelled differently, generated by TEX’s page breaking routine is but are really the same, and should appear un- a more difficult matter. If the input files contain der the same heading, like “color” and “colour” explicit commands for page breaking, i.e., \eject, in English. \supereject, and/or macros that call one or the A lemma in the context of generating a con- other of these macros, ConcT Xcanbeusedwith E cordance is a set of one or more distinct words only trivial modifications. However, it will not work which belong together, like the various forms in a if T X is supposed to break pages automatically. E morphological paradigm, e.g., the nominative, geni- One way around this problem would be to let tive, dative, and accusative; singular and plural, of T X break the pages automatically, and insert E “drotning” (Engl. “queen”). code for conctex.lsp in the final version, at the Sg. Pl. page breaks TEX found, so conctex.lsp knows to increment a page counter. This would not be Nom. drotning drotningar entirely satisfactory, however. Gen. drotningar drotninga Dat. drotningo drotningom

 If explicit page breaks are used, or code is Acc. drotning drotningar entered in the input file indicating where pages One of these words is denoted the main form of the are broken, then explicit line breaks (using \break lemma; with nouns, it’s usually the nominative sin- and/or \par, or macros using one or the other of gular, with verbs, the infinitive, etc. The subsidiary these commands) would make it possible to generate forms should appear in the concordance below a concordance using line numbering information the main form, indented, with their occurrences, referring to the lines in the output. If TEXdoes arranged according to grammatical criteria. the line and page breaking automatically, however, drotning (2) 1va 12, 3rb 14. and page breaks are not explicitly indicated, then drotningo (2) 2rb 25, 12va 30–31. conctex.lsp won’t know when to increment the drotninga (1) 13ra 16. page counter and reset the line counter to 1. There is no way to generate line numbering information Homographs must be distinguished in the con- without explicitly breaking the lines. cordance. Sometimes, they can belong to the same lemma, like the three forms of “drotning” that are

 It is therefore not possible to use ConcTEX spelled “drotningar”: the genitive singular and the for making a concordance using line and page nominative and accusative plural. Sometimes they numbers that result from TEX’s line and page can belong to different lemmata, like the noun “´a” breaking routines, i.e., without explicit line and and the preposition “´a” in Old Icelandic. page breaking commands. But it might be possible Lemmatization must also account for variants. to design a concordance program that could do this Where there are variants, all of the occurrences by having it process the dvi file instead of the are assigned to a single form, which appears in the input file. However, if this sort of concordance were concordance. One way variants can arise is from the desired, a better approach would be to write a new use of interchangeable letter forms in a manuscript. version of TEX that incorporates the features of The manuscript transcription might use a macro ConcTEXintheWEB program itself, thus making it or a font change to indicate the use of an alterna- unnecessary to use an auxiliary program. However, tive letter. For example, the word “prestr” may since conctex.lsp uses several features peculiar to sometimes appear in the transcription as “prestr” Lisp, an entirely different program structure would −→ “prestr” and sometimes as “prest{\r}” −→ be necessary. “prestr”, depending on the form of “r” used in that passage in the manuscript, so “prestr”may Lemmatization. The most difficult problem in be termed a variant of “prestr” (or vice versa). generating a concordance is lemmatization. There Only one of them, say “prestr”, should appear in are three cases that need to be accounted for: the concordance, and the occurrences of “prestr” 1. Homographs. Words that are spelled the same, should be listed under “prestr”. Subsidiary forms but belong to different lemmata, like the word can also have variants. “spring”, which can be one of several nouns or Lemmatization is the process of sorting indi- averb. vidual words, so that each one is put into its proper TUGboat, Volume 19 (1998), No. 4 389

lemma. There are several possible solutions to this in the main-form slot of the word structure. In this problem. The one described here is a lemmatization example, the string “David” is surrounded by ||, dictionary, which must account for all subsidiary and the symbol is accessed with read-from-string.28 forms and variants. (set (read-from-string "|David|")

 Running conctex.lsp without loading a lem- (make-word :main-form "David")) matization dictionary produces a complete list of all the words in the transcription, with their Surrounding a string with || has the effect of occurrences, but each entry is considered a main escaping all the characters within the string. This form, and only those variants are accounted for makes it possible to have symbol names with which replace-items or process-line catch automati- lowercase letters and other characters, which are cally. For a concordance this is clearly inadequate, normally not allowed in symbol names in Lisp. but it might be useful for some other purpose. Evaluating If concordances are to be generated for normal- (read-from-string "David") ized transcriptions, or for a group of manuscripts without || returns a symbol named DAVID, because whose orthographic conventions are very similar, a Lisp converts lowercase to uppercase letters in master lemmatization dictionary can be used. In symbol names unless they are escaped using \ or most cases, however, the orthography and the form ||. of the language of a given manuscript will diverge The second, optional argument to generate- enough from that of others that it will be necessary entry, noun in the example above, is discarded by to create a special lemmatization dictionary for that generate-entry. It is allowed for compatibility with particular manuscript. another program that uses the same dictionary files One of the first things that conctex.lsp does 29 for a different purpose. is to load the lemmatization dictionary. This is one or more files of Lisp code with invocations of the Alphabetization. The first argument to gene- functions generate-entry, harmless,andadd-variants. rate-entry always refers to the main form of a These functions cause data to be stored in word lemma. It is treated as a heading in the concor- structures. dance. Lemmata are sorted in the concordance in alphabetical order of the main forms. ConcTEX The data for the entries in the concordance are 30 stored in structuresoftypeword, defined like this: includes a special sorting routine. It makes it possible to sort arbitrary special characters in (defstruct word a user-defined order by replacing ordinary char- main-form acters and special character macros with 0 or occurrences-mariu-a more characters from Lisp’s code table, which is occurrences-mariu-s based on the ASCII code table. In its current sort-string form, it allows up to 256 positions; however the tex-string actual number of special characters that can be forms sorted is much greater. This is because some root characters occupy no position at all, others share variants a position with another character, and still oth- ) More slots can be added for other applications; 28 What generate-entry really does is a bit more compli- in particular, occurrences slots can be added for cated, but the effect is the same. 29 additional manuscripts. This other program is called LexTEXandIplanto The function generate-entry. The basic document it in a subsequent article. It’s for generating dictionaries, vocabulary flashcards and similar things from function for generating a lemmatization dictionary files of Lisp code. is generate-entry. It’s invoked like this: 30 The alphabetization routine for ConcTEX is essentially (generate-entry "David") identical to the one in my Spindex package. For a more or like this: complete discussion of the alphabetization routine, see my previous article “Spindex — Indexing with Special Charac- (generate-entry "David" noun) ters”, TUGboat 18(4)/1998, pp. 255–273. Since ConcTEX, The string, which is the first, required argument to unlike Spindex, does not require a preliminary TEXrunin order to generate page numbering information by means of generate-entry, is used, perhaps with some modifi- TEX’s output routine, and therefore uses no \write com- cations, as the name of a symbol which is bound mands, special character macros in the input file can be to a word-structure. The original string is stored coded in the normal way (but with obligatory braces). 390 TUGboat, Volume 19 (1998), No. 4

ers use the positions of more than one other for control sequences within words so it can pass character. them as a whole to letter-function. The string “David” is passed to the function Subsidiary forms. The function generate- generate-info, which passes each letter or special entry can also have keyword arguments. character coding to letter-function in order to gen- erate a sort-string. Each letter or special character (generate-entry "dyr{\\dh}ligr" in the string is used to create a new string using :forms "dyr{\\dh}lig")

characters from Lisp’s code table. The sort-string The :forms argument tells generate-entry that “dyrð-

generated for “David” might be "^D^A^[^K^D", lig” is a subsidiary form of “dyrðligr” (in fact, depending on the way the alphabetization routine is it’s the feminine singular nominative and neuter 31 set up. The sort-string is put into the car of a cons plural nominative and accusative form). The cell, with the symbol bound to the word structure in string "dyr{\\dh}lig" is used to access a symbol, the cdr. |dyr{dh}lig|, which is bound to another word struc- ("^D^A^[^K^D". |David|) ture.Thesymbol|dyr{dh}lig| is added to the list in the forms slot of the word structure |dyr{dh}ligr|,32 This cons cell is then put into an association list and the symbol |dyr{dh}ligr| is put into the root (or alist) called var-alist. This alist will contain a slot of the word structure |dyr{dh}lig|. The function cons cell for each of the word structures created by generate-info is used to generate a sort-string for generate-entry in the lemmatization dictionary. If |dyr{dh}lig|,andthecons cells this is the lemmatization dictionary: ("^D^^^W^E^P^K^H^W". |dyr{dh}ligr|) (generate-entry "David") (generate-entry "eiga") and (generate-entry "eptir") ("^D^^^W^E^P^K^H". |dyr{dh}lig|) var-alist will look like this: are added to var-alist.Alloftheword structures are (("^D^A^[^K^D". |David|) added to var-alist, not just the ones for main forms ("^F^K^H^A". |eiga|) of lemmata. ("^F^U^Y^K^W".|eptir|)) The :forms argument to generate-entry needn’t be a single string. It can also be a list of strings.

 Formatting commands like \it, \bf, \sc, (generate-entry "dyr{\\dh}ligr" etc. are ignored when the sort-string is gen- :forms ’("dyr{\\dh}ligs""dyr{\\dh}ligum" erated, otherwise, they would cause an error in "dyr{\\dh}ligan")) letter-function. In this case, the subsidiary forms listed are

Words in the lemmatization dictionary can the masculine singular genitive, dative, and ac- ð contain special characters. cusative strong forms “dyrðligs”, “dyr ligum”, and

(generate-entry "dyr{\\dh}ligr") “dyrðligan”. Each of the forms is used to access a symbol and bound to a new word structure.The Note that two backslashes are necessary for the symbol |dyr{dh}ligr| is put in the root slot of each {\\dh} in generate-entry’s argument. The result- of these word structures, and the symbols for the ing symbol that is bound to the word structure subsidiary forms are put into the list in the forms is |dyr{dh}ligr| (Engl. “glorious”), and the back- slot of |dyr{dh}ligr|. slashes disappear in the symbol name. (|dyr{dh}ligs||dyr{dh}ligum||dyr{dh}ligan|)

 The braces surrounding special character mac- The forms are stored in the order in which they ros are needed so generate-info knows where appear in the invocation of generate-entry. They will they begin and end. T X’s parser is cleverer than E appear in this order in the concordance. They are generate-info.WhenTX finds an escape character, E not sorted alphabetically, because subsidiary forms it reads the following tokens and uses them to make should be arranged in the concordance according to the longest control sequence possible. The function grammatical criteria. generate-info, on the other hand, needs delimiters Variants. Some words in a manuscript may 31 Each project will have its own requirements with respect differ from the standard form that should appear in to alphabetization, so this must be customized by the user. the concordance. For example, the word for “king” The documentation supplied with ConcTEX explains in detail howtodothis.Thestring"^D^A^[^K^D" consists of non- 32 In Lisp, nil is the same as the empty list (),soit’s printing characters in Lisp’s printed representation. possible to add an element to the list nil. TUGboat, Volume 19 (1998), No. 4 391

in Old Icelandic is “konungr”, but sometimes the gument and conses33 (or puts) the symbol sub- form “kongr” is used (cf. Modern Danish “konge”). generate-entry onto the front so it looks like this: If a manuscript uses both forms, “konungr” and (sub-generate-entry "eigi" “kongr”, but only “konungr” should appear in the :variants "ei{\\g}i") concordance, generate-entry can be called using the Then it appends the list :variants keyword argument. (:root "eiga":recursive t) (generate-entry "konungr":variants "kongr") to the end of it, resulting in " " The string "kongr" is used to access a symbol, (sub-generate-entry eigi : " {\\ } " |kongr|, but this symbol is not bound to a word variants ei g i : " ": structure; instead, it’s bound to the symbol |ko- root eiga recursive t) 34 nungr|, i.e., and evaluates this list.

 The lists in generate-entry’s :forms argument (setq |kongr|’|konungr|) can also have :forms arguments. (symbol-value ’|kongr|) −→ |konungr| (generate-entry "abc" (Again, what generate-entry really does is more :forms ’("def":forms "ghi")) complicated, but this is the effect.) The occurrences This is meaningless in the context of a concor- of the word “kongr” in the transcription, if any, will dance, since nesting in grammatical paradigms appear under “konungr” in the concordance. only has two levels. It might be useful, if vari- The :variants keyword argument to generate- ants were to appear in the concordance. Then entry may also be a list of strings. they could be indented to the value of \parindent, and their forms could be indented to the value of (generate-entry "konungr" \parindent\parindent. It might also be useful for :variants ’("kongr""ko{\\n}gr" some purpose other than a concordance. Changes "kvnvngr""konongr")) to process-line and export-entry would be necessary to make deeper nesting work. In this case, all of the strings in the list are used to Homograph identifiers access symbols which are bound to |konungr|. can be assigned to a word structure in two different ways: by using the The symbols that are made from the string or syntax, as in the input file, strings in the :forms keyword argument to generate- (generate-entry "afl") entry can also have variants. or by using the :homograph-identifier keyword argu- (generate-entry "eiga":variants "ei{\\g}a" ment. :forms ’("{\\’a}" (generate-entry "afl" ("attv":variants ’("a{\\t}v""atto")))) :homograph-identifier "nsg")

Here, there are two forms, “´a”and “attv”, and The string “nsg” stands for “neuter, singular”. the latter has two variants, “atv” and “atto”. In the concordance, this word will appear as “afl nsg n. sg. neut. sing. The :forms argument can be a list comprising a [ ]”. If you want “afl [ ]” or “afl [ ]” combination of simple strings and/or lists such as to appear in the concordance instead, you can write ("attv":variants ’("a{\\t}v""atto")). (generate-entry "afl") or (generate-entry "eiga":variants "ei{\\g}a" :forms ’("eigir" 33 In Lisp, a list is a chain of cons cells. Cons cells contain ("eigi":variants "ei{\\g}i") two elements, the car and the cdr, in that order. In the list "eigom" (a b c d), the first cons cell has the symbol a in its car and a pointer to the following cons cell in its cdr. This in turn has ("attv":variants the symbol b in its car and a pointer to the next cons cell in ’("a{\\t}v""atto")))) its cdr.Thelastcons cell in the list has d in its car and nil in its cdr: (d . nil). So the list (a b c d) is really (a . (b . (c . (d . nil)))). 34 Since the original invocation of generate-entry called

 The function generate-entry takes lists like sub-generate-entry, as explained below, the latter is called ("eigi":variants "ei{\\g}i") in its :forms ar- recursively for subsidiary forms. 392 TUGboat, Volume 19 (1998), No. 4

(generate-entry "afl" “afl”. This is not very satisfactory, however, be- :homograph-identifier "n.~sg.") cause the genitive and dative singular forms of the or masculine “afl” are identical to the genitive and dative singular of the neuter “afl”. So it’s necessary (generate-entry "afl") to type in or (generate-entry "afl" (generate-entry "afl" :forms ’("afls" "afli")) :homograph-identifier "neut.~sing.") or instead. But “afl” is more convenient to (generate-entry "afl" type in the input file than “afl”, so :homograph-identifier "nsg" it’s possible to use a cons cell as the :homograph- :forms ’("afls" "afli")) identifier argument to generate-entry. (generate-entry "afl" or :homograph-identifier (generate-entry "afl" ’("nsg"."neut.~sing.")) :homograph-identifier "nsg" This way, you can type “afl” in the input :forms ’(("afls" file, but “afl [neut. sing.]” will appear in the :homograph-identifier "nsg") concordance. "afli")) or one of the other possible alternatives. Note

 and If you use both the syntax an that the various possibilities of formulating the explicit :homograph-identifier argument, arguments are always allowed. (generate-entry "afl" If no homograph-identifier is indicated for a :homograph-identifier subsidiary form of a lemma whose main form has ’("nsg"."neut.~sing.")) one, the subsidiary form may be typed in the input file with no homograph-identifier. However, the explicit :homograph-identifier argument takes generate-entry automatically creates a variant using precedence, and the homograph-identifier using the the homograph-identifier of the main form. syntax is discarded. It doesn’t matter whether it’s a simple string or a cons cell in the (generate-entry "afl" :forms "aflar") :homograph-identifier argument. This results in a symbol |aflar| which is bound to a word structure,andasymbol|aflar| which

 When you use a homograph-identifier, no mat- is bound to the symbol |aflar|. The word “aflar”, ter how you create it, the name of the symbol masculine plural nominative, is unambiguous and bound to the word structure includes the homograph- needs no homograph-identifier, since no other form identifier. For instance, in the last example, the of “afl” (masculine) or “afl” (neuter) is spelled the word structure is bound to the symbol |afl|, same way. its main-form is "afl",itstex-string is "afl [{\\it Another problem is that words in the same neut.~sing.\\/}]", and its homograph-identifier lemma sometimes have identical forms. The nom- (i.e., the object stored in its homograph-identifier inative and accusative, singular and plural of the slot)is"nsg". neuter noun “afl” are all spelled “afl”. If it’s de- When the main form of a lemma is a homograph sirable to keep the morphological forms of lemmata and has subsidiary forms, the latter may or may not separate in the concordance, it’s necessary to have also need homograph-identifiers. In Old Icelandic, different homograph-identifiers for the forms that there are two words spelled “afl”, a masculine noun are spelled the same. meaning “an ironsmith’s forge” and a neuter noun (generate-entry "afl" meaning “physical strength”. If generate-entry is :forms ’("afls" "afli" called like this: "afl" "afl" (generate-entry "afl" "afl")) :forms ’("afls""afli")) Most of the time, the homograph-identifiers of sub- the genitive and dative singular forms, “afls” and sidiary forms are not printed to the concordance. “afli”, may appear in the input file with no Since the homograph-identifier of the main form is homograph-identifier, and they will appear in the printed to the concordance, indicating for instance concordance as subsidiary forms of the neuter noun that “afl” is a neuter noun, it’s not necessary to TUGboat, Volume 19 (1998), No. 4 393

indicate this fact for the subsidiary forms that ap- However, if you type pear below it. Sometimes, though, it’s desirable (generate-entry ’("eiga". "igea") to print something. The homograph-identifiers for :tex-string "agie") subsidiary forms are printed to the concordance only when an explicit :homograph-identifier argument is the explicit tex-string keyword argument takes prece- used, not when the syntax is used. The dence over the tex-string in the cdr of the cons cell, homograph-identifiers for main forms of lemmata which is discarded. I’m not sure whether the are always printed to the concordance. tex-string feature will prove to be useful, but it’s available if needed. (generate-entry "afl"

:forms ’("afls" "afli"  The function generate-entry allows the use of ’("afl" other keyword arguments. It doesn’t matter :homograph-identifier what they are, generate-entry simply ignores them. ’("nsga". "sg.~acc.")) This is for compatibility with the program LexTEX, ’("afl" which uses the lemmatization dictionary files for an- :homograph-identifier other purpose (see page 389), or for other programs ’("npln". "pl.~nom.")) that might be written. For instance, ’("afl" (generate-entry "nema" verb :homograph-identifier :forms ’("nam""n{\\ohook}mum" ’("npla"."pl.~acc.")))) ’("nvmenn":variants "nomenn")) =⇒ :class ’strong-4 :definition-english "take") afl [neut.] contains the keyword arguments :class and afls :definition-english, and their values. They are afli irrelevant to the function generate-entry in ConcTEX, afl [sg. acc.] which ignores them, but are used by the function afl [pl. nom.] called generate-entry in LexTEX. afl [pl. acc.]

 ConcTEX could be extended to do more than Note that not all the information that’s necessary just produce a concordance. One possibility is for the homograph-identifier in the input file must to sort the words into a file of T X code according to be printed. The string “afl” in the input E grammatical criteria. For instance, the words could file indicates this form of “afl” unambiguously, i.e., be written to an output file nouns first, then adjec- neuter, plural, nominative. But only “pl. nom.” tives, pronouns, verbs, adverbs, etc. Within these need be printed to the concordance, because it’s categories, the words could be sorted according to obvious it’s the neuter noun. class. In order to accomplish this, more information The tex-string is what is written to the must be entered in the lemmatization dictionary, concordance. It is usually determined by a com- i.e., generate-entry must be modified to process ad- bination of generate-entry’s first argument and the ditional arguments. Alternatively, the homograph- homograph-identifier, if any. It is possible to over- identifiers could be used for this purpose. ride this by using the :tex-string keyword argument

to generate-entry.  Actually, generate-entry isn’t really a function (generate-entry "abc":tex-string "xyz") at all. It’s a macro, defined with defmacro.One reason for this is that a macro doesn’t evaluate the This causes a symbol |abc| to be created and bound arguments that are passed to it. If generate-entry to a word structure, but “xyz” will be printed to were defined like this, the concordance. Occurrences of “abc” in the input file will appear under the heading “xyz”, and the (defun generate-entry sort-string will be ^X^Y^Z (depending on the values (main-form &optional word-type forms assigned by letter-function), i.e., based on the string ...) ...) “xyz” and not on “abc”. then this would cause an error: A tex-string can also be set by using a cons cell (generate-entry "{\ae}tt" noun for the first argument to generate-entry. :forms "{\ae}ttar") (generate-entry ’("eiga". "agie")) because Lisp would try to evaluate the symbol is equivalent to noun, which is unbound. Since generate-entry (generate-entry "eiga":tex-string "agie") is a macro, 394 TUGboat, Volume 19 (1998), No. 4

(defmacro generate-entry (&rest arg) ...) fine. If, however, it’s unbound, then a word structure it’s possible to manipulate its arguments before is created, using the string as the only argument they’re passed to the real function, which to generate-entry. Otherwise, if it’s bound and not is called sub-generate-entry. Another reason for a word structure, add-variants issues a warning, and defining generate-entry as a macro is to make it exits. Assuming the symbol is now a word structure, the other strings are used to access symbols which easier to extend ConcTEX, or change the way it behaves. The arguments can be manipulated and are set to the symbol derived from the first string, just as generate-entry sets variants (as described on passed to other functions, instead of or in addition 35 to sub-generate-entry. page 390f). The functions harmless and add-variants The order of the subsidiary forms is important. are two other functions that can be used in the It will usually not be desirable to have them sorted lemmatization dictionary. These are merely con- in alphabetical order in the concordance, although venience functions, you can accomplish the same it’s possible; usually, the order of the subsidiary results using generate-entry. forms will be determined by the conventional order The function harmless is for cases where none of the forms within the paradigm, e.g., singular of generate-entry’s keyword arguments are needed. nominative, genitive, dative, accusative, then plural Some words, like prepositions and conjunctions (in nominative, genitive, dative, accusative for nouns in the Germanic languages, at least) have no inflected Old Icelandic. If you prefer the order nominative, forms, and some words may have no variants in a accusative, genitive, dative, just type the forms in particular manuscript. (So far, I’ve never needed a this order. tex-string.) The parser. After the lemmatization dictionary (harmless "{\\’a}

" "af" has been loaded, conctex.lsp opens the input file "{\\’\\i}" "ok""yfir") and the function main reads it line-by-line. Entirely is equivalent to blank lines, and lines that begin with % or \ are (generate-entry "{\\’a}

") discarded (except for lines beginning with \lineno, (generate-entry "af") \putontop, \overstroke, or other exceptions, if (generate-entry "{\\’\\i}") any). Lines that begin with @ are Lisp code, the @ is (generate-entry "ok") discarded, and the rest of the line, which must be a (generate-entry "yfir") balanced expression (sexp), is evaluated. Other lines are text lines, and passed to the function process- The function harmless takes one or more strings line. This function strips off leading and trailing as its arguments, and calls (generate-entry hstringi) blanks, discards line numbers and punctuation from for each of the strings. Note that a homograph- the beginning of the line, discards certain macros identifier is permitted, using the syntax, and perhaps some or all of their arguments, discards as for {\\’a}

. unattached math mode material, and replaces some Calling generate-entry with a lot of forms and strings with others. If a @ or % appears in the middle variants can be confusing, of a line, process-line discards it together with the (generate-entry "dyr{\\dh}ligr" adjective rest of the line. It also tests to see whether the end of :forms ’(("dyr{\\dh}lig" the input line corresponds to the end of a transcript :variants "dyrdlig") line. If the input line ends with \\, -,or\-,or ("dyr{\\dh}ligi":variants "dyrdligi") if the following line is blank, process-line treats the )) end of the current line as the end of a transcript line so the function add-variants canbeusedinstead. and increments the line counter (called line-counter) This is equivalent to the previous example: when it’s done with the current line. (Advancing the leaf, side or column requires an explicit call to (generate-entry "dyr{\\dh}ligr" adjective set-position, as does advancing more than one line at :forms ’("dyr{\\dh}lig""dyr{\\dh}ligi")) a time, which is sometimes necessary, as when part of a leaf is missing or unreadable.) Then process-line (add-variants "dyr{\\dh}lig""dyrdlig") passes the line to the function read-word. This func- (add-variants "dyr{\\dh}ligi""dyrdligi") tion strips characters off the line until it has stripped All of the arguments to add-variants are strings. off a whole word. Then it returns the word, current- The first is used to access a symbol. This symbol might be bound to a word structure, which would be 35 Actually, generate-entry uses add-variants to set variants. TUGboat, Volume 19 (1998), No. 4 395

word, and the rest of the line to process-line.Now  Actually, the functions string-downcase, string- the routine for access and data storage takes over. capitalize and string-upcase are applied to Access and data storage. The function read- (symbol-name current-var).Thesymbolcurrent-var is word returns current-word as a string. The function evaluated once to get the symbol |aett|.IfIwanted process-line appends | to the beginning and end of to examine the symbol-name of |aett| directly, I the string, so that, for example, "{\\ae}tt" becomes would have to type (symbol-name ’|{ae}tt|) to pre- "|{\\ae}tt|".36 Now the string is used to access vent evaluation of |{ae}tt|. Note that applying a symbol (or variable) using read-from-string,and string-capitalize to a symbol name beginning with a the symbol current-var is set to this symbol, e.g., coding like {ae} fails, because the capitalized string |{ae}tt|. (In C, you’d say “current-var is a pointer would need to be {AE}. It is, of course, possible to |{ae}tt|”.) to have \Ae expand to “Æ” in TEX, so that {Ae} would yield the correct result, but it seems hardly This is one reason why unmotivated braces worthwhile, since words should be accounted for in are not permitted in the input file. The braces the lemmatization dictionary anyway. are needed to delimit the special characters for generate-info and to distinguish between special Unbound symbols. If |{ae}tt|, |{Ae}tt|, character codings and ordinary characters in the and {AE}TT are unbound, a word structure is symbol names.37 The vertical strokes surrounding created, and |{ae}tt| is bound to it. A sort-string the symbol name act to escape all of the other char- is generated and stored in |{ae}tt|’s sort-string slot, acters, which causes \ to disappear when the string and a tex-string is stored in its tex-string slot,just "|{\\ae}tt|" is converted to the symbol |{ae}tt|. as for word structures created by the lemmatization If the input file contained the string “{\ae}tt” dictionary. If the word read in from the file includes −→ “ætt” and “{ae}tt” −→ “aett” (using “ae” the symbols < and >, the string they enclose is instead of the ligature), and unmotivated braces considered to be the homograph-identifier, which is were permitted, they would both map to the same stored in the homograph-identifier slot, and affects symbol, namely {ae}tt, and they would not be the form of the tex-string. kept separate in the concordance. After a word structure has been created for the symbol |{ae}tt|, the symbol itself is put into the Now process-line checks to see if the value of cdr of a cons cell, while its sort-string is put into the current-var (the symbol |{ae}tt|, in our example) is car, e.g., already bound. If it is, this means either that the word “ætt” is accounted for in the lemmatization ("^A^F^X^X".|{ae}tt|) dictionary, or that it’s occurred previously in the This cons cell is now appended to the association input file. If |{ae}tt| is unbound, process-line checks list (or alist) word-alist, not var-alist, as are the whether the symbol |{Ae}tt| is bound. If it is, word structures from the lemmatization dictionary. current-var is set to |{Ae}tt|,otherwiseprocess-line The alist word-alist is used to store all of the en- checks whether the symbol {AE}TT is bound.38 If tries that are main forms of lemmata, i.e., not it is, current-var is set to {AE}TT.39 subsidiary forms or variants, that actually occur in the transcription. They will appear as top-level 36 Backslashes in strings read in from a file are automati- entries in the concordance. Whenever a word in cally escaped. the input file maps to a symbol that is unbound, 37 By uncommenting two lines in letter-function, it’s possi- i.e., if it’s the first occurrence in the input file and ble to allow unmotivated braces. I believe it’s safer to make them signal an error, because it’s a good way of discovering not in the lemmatization dictionary, this word is mistakes in the input file. Any cases where braces are considered to be the main form of a lemma, so required can be accounted for in conctex.lsp. all lemmata with subsidiary forms and most vari- 38 No vertical strokes are needed around {AE}TT because ants must be accounted for in the lemmatization {, }, and capital letters don’t need to be escaped. dictionary. Variants that differ only in the use of 39 Manuscripts are often orthographically inconsistent, capitalization and upper- and lowercase letters, or and transcriptions often use uppercase and lowercase letters where certain special character macros are replaced, to represent different letter forms, although the original letters do not really correspond to our upper- and lowercase. are handled automatically (as above). If the editor wants a particular word to be represented in Positions and occurrences The word struc- the concordance in lowercase, capitalized or all capitals, ture has an occurrences slot, where manuscript po- and the first occurrence of this word in the transcription sitions are stored. It might even have multiple does not have the desired form, then the word must be entered in this form into the lemmatization dictionary. Of to different word structures are also possible, if they’re bound course, capitalized, upper- and lowercase strings which map in the lemmatization dictionary. 396 TUGboat, Volume 19 (1998), No. 4

occurrences slots,ifConcTEX is being used for more depends on the value of the variable manuscript.It than one manuscript. When a word structure is cre- is used to access the appropriate occurrences slot for ated in the lemmatization dictionary, there are no setf.40 occurrences to be stored yet. When an entry is cre-

 Multiple occurrence slots are useful if you want ated for a word in the input file, the position-string is to use the information generated by conc- stored as a string in a list in the proper occurrences tex.lsp in another program. The word structures slot, e.g., ("28va˜7"), using access-occurrences. can be written to a file using Lisp’s read syntax. The current position in the transcription (and Another Lisp program can load this file, provided the manuscript) will have been generated by the structures of type word are defined, and the data function set-position, which is used in the input file, produced by the last run of conctex.lsp will be and process-line’s line counting routine. In addition, available. This could be useful for dictionaries or process-line can tell whether a particular word is linguistic studies involving multiple sources. the first or last word in the transcription line. It can sometimes be useful to indicate that a word is Bound symbols. If the symbol pointed to by at the beginning or end of a line in a manuscript, current-var (|{ae}tt| in our example) is bound, this because the use of unusual forms is often explainable means it was either bound in the lemmatization because of the position. For instance, words at the dictionary, or that the word “ætt” has already beginning of lines may have fancy initials, and occurred in the input file, or both.41 If it has already words at the end may be abbreviated in order that occurred in the input file, and it’s a main form, word- they may be squeezed into the remaining space on alist already contains the cons cell ("^A^F^X^X". the line. If a word occurs at the beginning or end |{ae}tt|), but if it’s only bound because it appears of a line, this may be indicated in the concordance, in the lemmatization dictionary, or it’s a subsidiary if the book designer wishes. For example, if this is form or a variant, this cons cell will not be in a line at leaf 1, verso, column a: word-alist.So,process-line checks to see whether a \lineno{15} seg_{ir} e{\n} gaufgi cons cell containing |{ae}tt| is in word-alist, using: ken_{n}i~{\m}_{a{\dh}r} [ok enn]\\ (rassoc current-var word-alist) =⇒ which searches for a cons cell in word-alist using the cdr as the search key. If it’s there, it means that 15: segir en gaufgi kenni maðr [ok enn] “ætt” has already occurred in the input file. then one of the positions for “segir” will be “1va 15b” The situation is more complicated if the symbol and one of the positions for “enn” will be “1va 15e”. is not already in word-alist. There are three The program conctex.lsp doesn’t print $^b$ possibilities: it could be the main form of a lemma, and $^e$ explicitly to the concordance file. In- a subsidiary form, or a variant. stead, it uses the token lists \linebeginstring and \lineendstring, I If the symbol is a word-structure, it is either a \newtoks\linebeginstring main form or a subsidiary form. \newtoks\lineendstring I If the root slot of the word-structure is nil, \linebeginstring={$^b$} then it’s a main form. In this case, the cons \lineendstring={$^e$} cell containing the symbol and its sort-string is and prints \the\linebeginstring and \the\line- copied from var-alist into word-alist. endstring to the concordance file. To suppress “b” I Else, if the root slot of the word structure is and “e” in the concordance, all that’s necessary is non-nil, the symbol is a subsidiary form. In to redefine the token lists \linebeginstring and this case, its root slot is accessed in order to get \lineendstring. its main form, which is another symbol bound \linebeginstring={} to a word-structure.Ifthecons cell containing \lineendstring={} 40 It’s worth looking at the way access-occurrences is This feature will probably be turned off, because it defined, because it illustrates a limitation of Lisp’s setf access is not customary for concordances to indicate the function. position of words in a line. 41 Or there’s been a terrible mistake. It is unlikely that A concordance should only be generated for one a symbol derived from a word in the transcription would conflict with a symbol that’s already defined by conctex.lsp manuscript at a time. This is merely the way I’ve or the Lisp interpreter itself. However, if this problem arises, set things up; it’s not hard-wired into the program. it would be easy enough to write safety routines to catch the The definition of the Lisp macro access-occurrences problem symbols. TUGboat, Volume 19 (1998), No. 4 397

this symbol and its sort string is not already in conctex.lsp is set up, occurrences at the beginning word-alist, it’s copied to it from var-alist. and end of a line are indicated with a superscript, b e I If the symbol evaluates to another symbol, it’s a e.g., “27ra 31 ” and “34vb 12 ”. On the other hand, variant. In this case, the symbol is replaced by if a word occurs multiple times within a line, not at this other symbol, and this process is repeated for the beginning or end and not broken, the position the latter. string is only printed out once, and the number of occurrences is indicated within parentheses, e.g., The Lisp macro access-occurrences then appends the “51ra 5 (2)”. If the word is broken, what is position-string for the new position, e.g., "28ra˜15" printed depends on whether it’s broken across a line, to the appropriate occurrences slot in the appropri- column, side or leaf. In the extremely unlikely event ate word-structure. that the word occurs at the beginning or end of Exporting the concordance. After every line in a line, or broken, and multiple times within the the input file has been processed, the loop in main line, the different cases are listed separately, e.g., returns nil. If you are generating a concordance “11ra 5b, 11ra 5 (2), 11ra 5–6”. using multiple input files, the first input file is When a symbol is popped off of word-list, closed, the next one is opened and processed in export-words checks whether forms are present in the same way. When all of the input files have the word structure’s forms slot.Theforms should been processed and closed, the alists word-alist and be in the order in which they should be listed var-alist are sorted. in the concordance. The function export-words (setq word-alist accesses the symbol for each form, and checks the (sort word-alist #’string<:key #’car)) appropriate occurrences slot.Ifit’snon-nil,thetex- (setq var-alist string for that form is written to concordance.tex, (sort var-alist #’string<:key #’car)) with its occurrences, as above, but indented. If This puts the cons cells in word-alist and var-alist in a form has no occurrences, nothing is printed to alphabetical order according to their cars, i.e., their concordance.tex. The Lisp program currently sort-strings.Nowthesort-strings are now longer does not print out variants in any way. This needed, so new lists are generated by popping off would require some additional programming, but it the cons cells and putting the cdrs, i.e., the symbols, is possible. into new lists, word-list and var-list. Only main forms  The program conctex.lsp can be extended to of lemmata are members of word-list, and only of extract grammatical information from manu- those lemmata, in which at least one form (which script transcriptions. One way of doing this would can be a variant) occurs in the transcription. Other be to use standardized homograph-identifiers. An- main forms may be bound from the lemmatization other would be to add one or more slots to the word dictionary, but they will not be in word-list. The list structure. Then, export-words could be modified to word-list is now passed to the function export-words, write files for the various grammatical categories, which writes the TEX file for the concordance. which could then be concatenated. After word- Exactly what export-words writes is a matter for list has been generated, word-alist and var-alist are the editor and the book designer, so this part of no longer needed, and var-list is never needed by ConcTEX can be changed easily. conctex.lsp. They are kept or generated only for The function export-words pops the symbols debugging purposes. off of the front of word-list one-by-one. The tex- string is written to the concordance file (here called Other data. When conctex.lsp is run, it gener- concordance.tex, but any name within reason ates some data in addition to the concordance. It can be chosen), not indented. If there are no counts the number of lines and words in each input occurrences of the main form, the tex-string is file, the total number of lines, the total number of enclosed in parentheses. If there are occurrences, words, the total number of distinct words and the the number of occurrences is printed after the total number of lemmata. It prints this information tex-string, enclosed in parentheses, followed by the to the TEX file count_info.tex.

occurrences. Subsidiary forms are printed to the  The number of lines in an input file is simply concordance if and only if there are occurrences for the value of line-counter at the end of the file. them. The number of words in an input file is the number If there are no occurrences and no subsidiary of times read-word returns a non-empty string while forms with occurrences, something is terribly wrong that file is begin read. The total number of distinct and conctex.lsp will signal an error. The way words is the number of times export-occurrences 398 TUGboat, Volume 19 (1998), No. 4

is called and the total number of lemmata is the If you use UNIX, you can ensure that the con- length of word-list. cordance is always up-to-date by using a shell script.

Hints on using ConcTEX. Trying to generate a # This runs the concordance program. concordance from an entire manuscript transcrip- gcl

r o Computer Modern and its offspring notwithstand-

t ing, there simply aren’t enough ÅEÌAFÇÆÌ fonts n in the public domain suitable for use in fine printed i Dn books. Understandably, most of the other fonts 1–7: g available either contain special symbols or alpha-

bets for non-Western languages, and are generally ð 8: himins ok iar ðar. sæl ok dyr designed to be compatible with Computer Modern.

9: lig mør Maria. moðir drottins But even if the Computer Modern fonts were the Since process-line ignores all lines within a com- most beautiful fonts in the world, not every book mentary, the complex construction for the word should be printed in them. I admire Knuth’s ac- “Drotning” isn’t read by process-line, so this occur- complishment and I like Computer Modern, but rence needs to be set explicitly. Monotype Modern 8A, on which it is based, is 42 @ (add-occurrences |drotning|"1va~1–7") not universally admired. It is possible to use PostScript fonts with T X, but it’s inconvenient, The function add-occurrences uses access-occurrences E and although they are well-designed, they have to access the appropriate occurrences slot of the word the serious defect that most sizes are produced by structure bound to the symbol in its first argument, simple magnification or reduction. I consider it an and appends its second argument, a string, to

important desideratum that more ÅEÌAFÇÆÌ fonts the list in the occurrences slot. It can be any are created and made available, but I don’t see how string, so access-occurrences canbeusedtomake this goal can be accomplished by amateur program- non-standard occurrences by hand. An occurrence mers writing free software. But until such fonts like “1va 1–7” cannot be created by conctex.lsp’s are available, books typeset without the financial ordinary routines. and technical backing of a publisher will continue Final remarks to suffer from a poverty of fonts. ConcTEX demonstrates the power of Lisp and I am far from being an authority on the subject, TEX in combination. Designing and typesetting but I doubt very much that any photolithographic a manuscript transcription, usually as part of a printing technique will ever be able to equal the facsimile edition, is a challenge under the best quality of impression of lead type. My dream is to of circumstances, and generating a concordance is use ÅEÌAFÇÆÌ for designing lead type for use on always a time-consuming task. I do not promise aTEX-driven, Linotype-like linecasting machine. I miracles with ConcTEX, but I do believe that it can say a linecasting machine only because I think that make both of these tasks easier, and indeed possible it would be easier to implement glue on a linecast- for non-professionals. ing machine. It might, however, be an interest- ConcTEX’s most significant advantages are: ing exercise to design a TEX-driven Monotype-like 1. It makes it possible to take advantage of the typesetting machine, or even a typesetting machine

typographic capabilities of TEXandÅEÌA- based on a different principle, such as casting an

FÇÆÌ. entire page at once. Perhaps with such machines, 2. It uses the very same file for typesetting and we can finally equal and perhaps even surpass the generating the concordance. great typographic achievements of the past. 3. It performs alphabetical sorting on arbitrary Sample texts and concordances special characters ConcTEX is designed to be extendable. It Concordances become long very quickly, so this would be possible to adapt it for use with other section contains three mini-concordances and the languages. For languages that are written left-to- texts from which they were generated. right, it will only be necessary to cope with the The first text is the beginning of the transcrip- usual difficulties with fonts and character encoding. tion of Holm perg 11 4o, prepared by Dr. Wilhelm For right-to-left text or a mixture of left-to-right Heizmann. It illustrates the use of line footnotes. and right-to-left text, the difficulties are greater, These are footnotes which refer to lines in the but it should be possible to solve them. Many of its features are of general utility and could be used 42 Williamson, in reference to the English Monotype for other kinds of programs that extract data from Corporation’s Modern Extended series 7: “Not particularly distinguished in letter form, the face has become familiar to TEX input files. readers of scientific works; for some years, this was one of the However, there is one significant problem from few series equipped with a full range of mathematical and the point of view of book design. The virtues of other special sorts.” (p. 130) 400 TUGboat, Volume 19 (1998), No. 4

transcription. Unlike ordinary footnotes, there is no gaufgi (1): 1va 15. indication in the running text, such as an asterisk, (haleitr) a dagger, or a superscripted numeral, but the line haleitari (1): 1va 13. number or numbers appear in the footnote where (heilagr) the footnote indicator usually goes. heilags (1): 1va 11. Note that a different definition of \ustroke helgari (1): 1va 12e. is used here than in the foregoing article. The helgum (1): 1va 12b. emendations are in italics and underlined, but not herbergi (1): 1va 11. enclosed in brackets. (himinn)

o himins (1): 1va 8. Holm perg 11 4 (hreinlifi) 1va hreinlifis (1): 1va 10e. g b

prolo us Iesus (1): 1va 10 . ð r (io¬ r r) o t iar ðar (1): 1va 8.

n kennima ðr (1): 1va 15. i (koma) n D komin (1): 1va 13. 1–7: g

(kongligr) ð

8: himins ok iar ðar. sæl ok dyr kongligri (1): 1va 14. ð 9: lig mør Maria. mo ir drottins (kynfer ð)

10: IesusXristz. blomi hreinlifis. kynferðj (1): 1va 13 – 14.

11: herbergi heilags anda. øllvm (ma ðr)

12: helgum mønnumæðri. helgari mønnum (1): 1va 12.

13: ok haleitari. er komin at kyn Maria (1): 1va 9.

ð þ 14: fer jafkongligri ætt eptir vi sem mo ðir (1): 1va 9.

15: segir en gaufgi kenni maðr [ok enn] mør (1): 1va 9. b o ok (4): 1va 8 (2), 13 , 15. Concordance to Holm perg 11 4 (sæll) sæl (1): 1va 8.

æðri (1): 1va 12. (segja) ætt (1): 1va 14. segir (1): 1va 15b. af (1): 1va 14. sem (1): 1va 14e. (allr) e (Xrist) øllvm (1): 1va 11 . Xristz (1): 1va 10. (andi) þvi (1): 1va 14. anda (1): 1va 11. at (1): 1va 13. The Bible blomi (1): 1va 10. drotning (1): 1va 1-7. Genesis (drottinn) 1. In the beginning God created the heavens and e drottins (1): 1va 9 . the earth. 2. The earth was without form and void,

(dyrðligr) and darkness was upon the face of the deep; and

dyrðlig (1): 1va 8–9. the Spirit of God was moving over the face of the e enn (2): 1va 15, 15 . waters. 3. And God said, “Let there be light”; and eptir (1): 1va 14. there was light. 4. And God saw that the light er (1): 1va 13. was good; and God separated the light from the (gøfugr) darkness. 5. God called the light Day, and the darkness he called Night. And there was evening 1–7 Drotning, bis auf das initiale D sind die Buchstaben and there was morning, one day. untereinander angeordnet.

8/9 dyr ðlig, daf¨ur in Ausg. zumeist d´yrlig. Concordance to the Bible 10 Iesus, 233 Jesu, Ausg. f¨ur den Genitiv immer Jesu; Xristz, 233 cristi, Ausg. f¨ur Xrist- fast immer Crist-, einige Male auch Krist-. and (11): Gen. 1:1, 2 (3), 3 (2), 4 (2), 5 (3). TUGboat, Volume 19 (1998), No. 4 401 be (1): Gen.1:3. sprach: werde Licht! Und es ward Licht. 4. Und was (7): Gen. 1:2 (3), 3, 4, 5 (2). Gott sah, daß das{nas} Licht gut war. Da schied beginning (1): Gen.1:1. Gott das{nas} Licht von der{fds} Finsternis 5. Und (call) nannte das{nas} Licht Tag und die{fas} Finsternis called (2): Gen. 1:5 (2). Nacht. Da ward aus Abend und Morgen der{mns} (create) erste Tag. created (1): Gen.1:1. darkness (3): Gen.1:2,4,5. Die Bibel day (2): Gen. 1:5 (2). Das 1. Buch Mose (Genesis) deep (1): Gen.1:2. earth (2): Gen.1:1,2. (Without homograph identifiers.) evening (1): Gen.1:5. 1. Am Anfang schuf Gott Himmel und Erde. face (2): Gen. 1:2 (2). 2. Und die Erde war w¨ust und leer, und es war form (1): Gen.1:2. finster auf der Tiefe; und der Geist Gottes schwebte from (1): Gen.1:4. auf dem Wasser. 3. Und Gott sprach: Es werde God (6): Gen.1:1,2,3,4(2),5. Licht! Und es ward Licht. 4. Und Gott sah, daß good (1): Gen.1:4. das Licht gut war. Da schied Gott das Licht von he (1): Gen.1:5. der Finsternis 5. Und nannte das Licht Tag und die (heaven) Finsternis Nacht. Da ward aus Abend und Morgen heavens (1): Gen.1:1. der erste Tag. in (1): Gen.1:1. let (1): Gen.1:3. Konkordanz zur Bibel light (5): Gen. 1:3 (2), 4 (2), 5. morning (1): Gen.1:5. Abend (1): Gen.1:5. (move) (an) moving (1): Gen.1:2. am (1): Gen.1:1. night (1): Gen.1:5. Anfang (1): Gen.1:1. of (3): Gen. 1:2 (3). auf (2): Gen. 1:2 (2). one (1): Gen.1:5. aus (1): Gen.1:5. over (1): Gen.1:2. da (1): Gen.1:5. (say) (das [n.]) said (1): Gen.1:3. das [akk. sg.](2):Gen.1:4,5. (see) dem [dat. sg.](1):Gen.1:2. saw (1): Gen.1:4. der [m.](2):Gen.1:2,5. (separate) die [f.](1):Gen.1:2. separated (1): Gen.1:4. die [akk. sg.](1):Gen.1:5. spirit (1): Gen.1:2. der [dat. sg.](2):Gen.1:2,4. that (1): Gen.1:4. Erde (2): Gen.1:1,2. the (14): Gen. 1:1 (3), 2 (6), 4 (3), 5 (2). (erst) there (4): Gen. 1:3 (2), 5 (2). erste (1): Gen.1:5. upon (1): Gen.1:2. es (3): Gen. 1:2, 3 (2). void (1): Gen.1:2. finster (1): Gen.1:2. (water) Finsternis (2): Gen.1:4,5. waters (1): Gen.1:2. Geist (1): Gen.1:2. without (1): Gen.1:2. Gott (2): Gen.1:1,3. Gottes (1): Gen.1:2. Die Bibel Himmel (1): Gen.1:1. leer (1): Gen.1:2. Das 1. Buch Mose (Genesis) Licht (4): Gen. 1:3 (2), 4, 5. (With homograph identifiers.) Morgen (1): Gen.1:5. 1. Am Anfang schuf Gott Himmel und Erde. Nacht (1): Gen.1:5. 2. Und die{fsn} Erde war w¨ust und leer, und es (nennen) war finster auf der{fds} Tiefe; und der{mns} Geist nannte (1): Gen.1:5. Gottes schwebte auf dem{nds} Wasser. 3. Und Gott (schaffen) 402 TUGboat, Volume 19 (1998), No. 4

schuf (1): Gen.1:1. can be used for discretionary hyphens within (schweben) transcription lines, if necessary. schwebte (1): Gen.1:2. Glossary (sein [verb]) war (2): Gen. 1:2 (2). Braces, unmotivated: Braces that delimit unnec- (sprechen) essary groups in the input file. Not permitted in sprach (1): Gen.1:3. transcription lines. Tag (2): Gen. 1:5 (2). Tiefe (1): Gen.1:2. Comment: Comments can be normal TEXcom- und (10): Gen. 1:1, 2 (4), 3 (2), 5 (3). ments that follow a %. Usually, however, “comment” von (1): Gen.1:4. refers to invocations of the macros \begincomment Wasser (1): Gen.1:2. and \endcomment. (werden) \begincomment{This is a comment.} werde (1): Gen.1:3. \endcomment ward (2): Gen.1:3,5. w¨ust (1): Gen.1:2. Comments appear in the output only if \drafttrue. Comments are ignored by conctex.lsp. Category code list Commentary: A commentary contains text which should appear within the transcription lines in the • \catcode‘\<=\active. For homograph identi- output, but which should not be used for generating fiers. Reset to 12 (other) in math mode. the concordance. Commentaries can be coded in • \catcode‘\@=14 (Comment). Treated as a several ways. comment by TEX, equivalent to %.If@ * This is a commentary. * is the first non-blank character in an input \begincommentary This is also line, the Lisp interpreter evaluates the rest a commentary.\endcommentary of the line, which should contain a balanced \plain Yet another commentary. expression (sexp). Otherwise, if it’s in a text \endplain line, conctex.lsp discards it and the rest of

the line following it. Evaluated lines: Lines in the input file,where ¥ • \catcode‘\¥=9 (Ignored). The character @ is the first non-blank character. Such lines (decimal 165, octal 245, hexadecimal A5)is must contain a balanced Lisp expression, which is ignored by TEX and treated as a word separator evaluated by the Lisp interpreter. TEX ignores lines by Lisp. beginning with @. • \catcode‘\*=9 or \active.Usedforcom- Homograph identifiers:InTX, strings of the mentaries. Reset to \catcode 12 in math E form “”. Used for indicating homographs mode. in the input file. Printed out or not, depending • \catcode‘\_=\active. The underline charac- on the value of \ifdraft. For the Lisp program, ter is \let to \ustroke. Reset to 8 (subscript) homograph-identifiers can be set in various ways in in the \plain environment, so _ can appear the lemmatization dictionary. in the names of files loaded using \input,and math mode, so it can be used for subscripts. Ignored lines: Lines in the input file that are • \catcode‘\-=\active. The hyphen character ignored by TEX, conctex.lsp, or both. Completely - is used for line ends where words are bro- blank lines are ignored by conctex.lsp and treated ken. Reset to 12 (other) in commentaries, the normally by TEX. Lines beginning with % are \plain environment, and math mode. ignored both by TEXandconctex.lsp. If a line • \catcode‘\^^M=\active. This change is local contains a % in any other position, the rest of the to the expansions of the active character - and line is discarded by both TEXandconctex.lsp. the redefined control symbol \-,where^^M is The character @ is equivalent to % in TEX. If a \let to \empty. Otherwise, it has its normal line contains an @ that’s not the first non-blank \catcode of 5 (end of line). character in the line, conctex.lsp discards the rest of the line. • \catcode‘\ c =\active. The copyright symbol (decimal 169, octal 251, hexadecimal A9)is Input file: A file containing text to be typeset for \let to \- before the latter is redefined, so it a book containing a facsimile of a manuscript, or TUGboat, Volume 19 (1998), No. 4 403

something similar. The input file or files are also Williamson, Hugh. Methods of Book Design. The used for generating a concordance. Practice of an Industrial Craft.3rd ed. Yale University Press. New Haven: 1983. Lemmatization dictionary: A file of Lisp code Winston, Patrick Henry and Berthold Klaus Paul used for lemmatizing the words in the transcription. rd The file can contain invocations of the functions Horn. LISP.3 ed. Addison-Wesley Publish- generate-entry, add-variants, and/or harmless. ing Company. Reading, Mass.: 1989. Macro, delimited: A macro within braces, like ⋄ Laurence Finston {\%}, {\rm ...},or{\dh} Skandinavisches Seminar Georg-August-Universit¨at Macro, undelimited: A macro with no enclosing Humboldtallee 13 braces, like \%, \rm or \indexentry{nouns}{}{x}% D-37073 G¨ottingen, Germany {verbs}{}{}. [email protected]

Output: The typeset result of running TEXonthe input file or files. Technically, the result of running TEXisadvi file, however I usually mean either the paper printout or a display on the computer terminal in a program like xdvi or Ghostview. Text lines: Lines in the input file which are processed by TEX. They may be transcription lines or commentaries. Transcription lines: Lines in the input file con- taining the transcription of the manuscript. They should correspond, for the most part, to the actual lines in the manuscript; however, they may also contain commentaries, which may affect the length and hence the line breaking in the output. Word element: A character which conctex.lsp considers to be part of a word. Includes all letters (characters whose \catcode is 11), some characters of type “other” (\catcode 12), and special character macros. Word separator: Characters that cannot be part of a word. Currently these are blanks, punctuation,

and the character which is represented as ¥ on my terminal (decimal 165, octal 245, hexadecimal A5). References

The Bible. Containing the Old and New Testa- ments. Revised Standard Version. American Bible Society. New York: 1952. Die Bibel. Mit Apokryphen. Nach der Ubersetzung¨ Martin Luthers neubearbeitet. Deutsche Bibel- gesellschaft. Stuttgart: 1985. Knuth, Donald E. The TEXbook. Addison-Wesley Publishing Company. Reading, Mass.: 1986. Steele,GuyL.,Common Lisp. The Language.2nd ed. Digital Press. 1990. Cleasby, Richard and Gudbrand Vigfusson. An Icelandic-English Dictionary.2nd ed. The Clarendon Press. Oxford: 1957. TUGboat, Volume 19 (1998), No. 4 403

Language Support

Cyrillic encodings for LATEX2ε multi-language documents A. Berdnikov, O. Lapko, M. Kolodin, A. Janishevsky, A. Burykin

Abstract This paper describes the X2 and the T2A, T2B, T2C encodings designed to support Cyrillic writing sys- tems for the multi-language mode of LATEX2ε.The encoding X2 is the “Cyrillic glyph container” which canbeusedtoinsertintoLATEX2ε documents text fragments from all modern Cyrillic writings, but it does not strictly obey all the rules of LATEX2ε. The encodings T2A, T2B and T2C are the “true” LATEX2ε encodings which satisfies all the require- ments of the LATEX2ε kernel, but as a result three encodings are necessary to support the whole variety of languages based on the Cyrillic alphabet. These restrictions of the LATEX2ε kernel, the specific features of Cyrillic writing systems and the basic principles used to create the encodings X2 and T2A, T2B, T2C are considered. This project supports all the Cyrillic writing systems known to us, although the majority of the accented letters need to be constructed using internal TEX tools. The X2 encoding was approved at CyrTUG- 97 — the annual conference of Russian-speaking TEX users and was previously presented at the EuroTEX- 98 Conference. The encodings X2, T2A, T2B and T2C were intensively discussed on the cyrtex-t2 mailing list. 404 TUGboat, Volume 19 (1998), No. 4

1 Introduction • T2A, T2B, T2C — the encodings which strictly A The project (originally named T2) to produce en- satisfy the requirements necessary for LTEX2ε codings necessary to support modern Cyrillic lan- multi-language mode. With these encodings it A is safe to mix different languages inside your guages in LTEX2ε multi-language mode was initi- ated at the TUG-96 Conference in Dubna. this pa- documents and to use large pieces of text with- per presents the results of that project. Although out any problems. The price is that it is neces- some minor corrections could still appear as new in- sary to have three encodings (and an enormous formation about minor Cyrillic writings appear, the number of fonts each in agreement with the en- kernel of the project seems to be stable. The encod- coding conventions of the European Computer ing support includes: Modern fonts) to support the whole variety of Cyrillic alphabets. Some Cyrillic languages are • LHfont font collection version 3.2 — the Com- supported by one or two encodings from the puter Modern Cyrillic fonts and European T2∗ encoding family, some are supported by all Computer Modern Cyrillic fonts created by three encodings. The encodings are in agree- O.Lapko, ment with the LATEX2ε multi-language mode • T2enc macro package created by V. Volovich and encoding paradigm. Although there are no and W. Lemberg — the input and output en- obstacles to the use of these encodings for orig- coding and font definitions necessary for the inal Cyrillic texts, native users may have differ- LATEX2ε packages fontenc and inputenc, ent preferences; • the hyphenation patterns in encoding-indepen- • LR0 and LR1 — the encodings which combine dent style: ashyphen by A. Slepuhin, ruhyphen the OT1 encoding for 0 – 127 and Cyrillic let- by D. Vulis, lvhyphen by M. Vorontsova and ters and symbols from the leading languages of S. Lvovski, znhyphen by S. Znamenskii, the Former Soviet Union for 128 – 255. (The • rusbabel package created by V. Volovich and encoding LR0 contains the Russian letters only, W. Lemberg to support Cyrillics (based on new the encoding LR1 contains in addition the most encodings) in babel. frequently used national letters.) These en- The β-versions of these packages are on the codings are as close as possible to X2 and TEXLive 3 CD-ROM distribution. The final versions T2A/T2B/T2C and are designed mainly for are available on CTAN.1 non-LATEX formats based on the CM font fam- Support for Cyrillics includes the following en- ily and the original Plain TEX. There is some codings: hope (not confirmed at this moment) that LR0 • X2 — the Cyrillic glyph container which con- and LR1 may become the inter-platform and tains all the glyphs necessary to support mod- inter-format standard for representing Russian ern Cyrillic writing. It does not obey all the letters inside TEX (as ASCII is the standard for English); specifications that LATEX2ε requires for an en- coding with the prefix T, but as a result it • LRn — local encodings necessary to support is enough to have just one encoding to insert individual Cyrillic languages. The T2∗ encod- in LATEX2ε documents the characters, words, ings are intended for the multi-language mode names, bibliography references, short sentences, of LATEX2ε and for this reason may be not well citations, etc., specific for all modern Cyrillic suited to bilingual documents or to the prefer- writings without too large an increase in the ences of the native users. It is definitely not number of fonts used for this purpose (i.e., this the task of the T2 working group, but that of encoding is a tool which enables Latin-writing the national TEX User Groups, to organize local people to occasionally use Cyrillics in their doc- encodings that are useful for their language; uments). The price is that some Cyrillic letters • TS2 — the encoding containing accents, non- do not exist as a separate glyphs but must be letter symbols, etc., necessary for Cyrillic writ- composed from pieces (accents and modifiers) ings which are outside X2 and T2∗ encodings. contained in X2, and the user must obey some This encoding is under consideration, and with rules of safety (described in section 6) since X2 great probability all necessary additional glyphs does not satisfy all the requirements obligatory could be added to TS1 encoding. (The latter n for T  encodings; case has the advantage that it prevents the in- 1 /fonts/cyrillic, /languages/hyphenation/ruhyphen, crease of the number of fonts needed to support and /macros/latex/contrib/supported/t2. multi-language mode in LATEX2ε); TUGboat, Volume 19 (1998), No. 4 405

• LWN — the encoding which generalizes the ation symbols, etc.) can be mixed with the WNCyr font family by adding new Cyrillic glyphs from the corresponding Tn font. letters and new substitution pairs (ligatures) Sn — Symbol encodings. The style of fonts in Sn based on ASCII Latin input. It is suitable for encoding need not be synchronized with that Latin-writing users who use Cyrillics only occa- of Tn fonts. These encodings are used for sionally. (This encoding is still under develop- arbitrary symbols, ‘dingbats’, ornaments, frame ment and is not described here.); elements, etc. • T5 encoding(s) to support Old Slavonic, Gla- An — Encodings for special Applications (not cur- golitic, Church Slavonic, etc., writings. The rently used). project to develop these encodings has just E∗ — Experimental encodings but those intended for started and its discussion is outside the scope wide distribution (currently used for the ET5 of this publication. proposal for Vietnamese). A L∗ — Local, unregistered encodings (for example, 2LTEX2ε system of encodings the LR0, LR1 and LRn encodings mentioned The following types of encodings are recognised by above). the LAT X Project:2 E OM∗ — 7 bit Mathematics encodings. n — essentially 7 bit ‘old’ encodings. Typically M∗ — 8 bit Mathematics encodings. these will be small modifications of the origi- U — Unknown (or unclassified) encoding. nal TEX encoding, OT1 (for example, OT4, a variant for Polish). 3 Specifications for Tn and Xn Tn —8bitTextEncodings.Tn encodings are encodings the main text encodings that LATEX uses. They There are two main restrictions to be fulfilled before have some essential technical restrictions to an encoding may be considered as an encoding with enable multilingual documents with standard the prefix ‘T’ satisfying the requirements of the TEX: (a) they should have the basic Latin al- LATEX2ε kernel: phabet, the digits and punctuation symbols in • the \lccode–\uccode pairs should be the same the ASCII positions, (b) they should be con- as they are in the LATEX2ε kernel (i.e., as they structed so that they are compatible with the are in the T1 encoding); lowercase code used by T1. Further discussion ! ’ ( of the technical requirements for Tn encod- • the Latin characters and symbols: , , , ) * + , - . / : ; = ? [ ] ‘ | ings is given in section 3. , , , , , , , , , , , , , , , @ (questionable), 0–9, A–Z, a–z should be at Xn — other 8 bit Text Encodings (eXtended, or the positions corresponding to ASCII, and the eXtra, or X=Non Latin). Sometimes it may be symbols produced by the ligatures --, ---, ‘‘, necessary, or convenient, to produce an encod- ’’ (at arbitrary positions). ing that does not meet the restrictions placed If the encoding requires the redefinition of the values on the Tn encodings. Essentially arbitrary \lccode–\uccode, or if it does not contain the text encodings may be registered as Xn, but necessary Latin characters in the ASCII positions, it is the responsibility of the maintainers of the it will produce undesirable effects in some situations encoding to clearly document any restrictions inside LAT X2 and will make use of the encoding on the use of the encoding. E ε incompatible with the general multi-language mode. TSn — Text Symbol Encodings. Encodings of The reasons for such restrictions are explained in symbols that are designed to match a cor- detail in [1]. responding text encoding (for example, para- graph signs, alternative digit forms, etc.). The A Although the LTEX Team’s technical specifications font style of fonts in TSn encoding will or- for Xn encodings are less restrictive than those for dinarily be changed in parallel with that of ‘ordinary’ text encodings, there are corresponding the fonts in Tn encoding using NFSS mecha- restrictions on their use, and some desirable proper- nisms. As a result, at any moment the TSn ties for them to have. In particular: font style is compatible with the Tn font and • If the encoding does not have Latin letters in the glyphs from TSn font (accents, punctu- ASCII slots then the users must take care not 2 The following text is slightly adapted from a post by to enter such text, otherwise ‘random’ incorrect David Carlisle to the mailing list cyrtex-t2. output will be produced, with no warning from 406 TUGboat, Volume 19 (1998), No. 4

the LATEX system. Also, care must be taken especially taking into account the \lccode–\uccode with ‘moving’ text that is generated internally restrictions. So it is necessary to accept some within LATEX (such as cross references), which principles of selection which enable us to decrease may fail if the encodings change; the number of Cyrillic glyphs included in X2: • To reduce the problems with cross reference in- 1. The X2 encoding follows the LAT X2ε agree- A E formation, the LTEX maintainers strongly rec- ments about \lccode–\uccode not to produce ommend that at least the digits and ‘com- garbage for the headings, table of contents, mon’ punctuation characters are placed in their hyphenations inside paragraphs, arguments of ASCII slots; \uppercase and \lowercase; • If the encoding uses a lowercase table that is 2. All glyphs used in publishing for some language incompatible with the lowercase table of T1, are included in X2 if they cannot be constructed then it is not possible to mix this encoding and as accented letters or letters with additional aTn encoding within a single paragraph, and modifiers using TEX commands. Variant glyphs obtain correct hyphenation with standard TEX. for Cyrillic alphabets are also included in X2 If the Xn encoding does not use a lowercase ta- if there is some free space and if different ble that is compatible with that of T1, the package languages use different variants; supporting this encoding should ensure that encod- 3. The X2 encoding includes all punctuation sym- ing switches only happen between paragraphs (or bols, digits, mathematical symbols, accents, hy- that hyphenation is suppressed when temporarily phens, dashes, etc., needed to form the full set switching to the new encoding). It should be noted of symbols necessary for Cyrillic typography; that this restriction on the lowercase table applies 4. The additional Cyrillic letters which are used in only to systems using standard TEX(version3and the PC 866 and MS Windows 1251 code pages ε later). Using -TEX version 2 will remove the need are included in X2 even if they are accented for this restriction as the hyphenation system has forms; been improved — it will use a suitable lowercase ta- 5. Glyphs which are not used now but which were ble for each language (the table will be stored along used at some stage in the 20th century may with each language hyphenation table), and surely be included if there are good reasons to do it deals not at all with the Omega system. so (as, for example, with the old Russian and 4“Cyrillic glyph container” — the X2 Bulgarian letters); encoding 6. Glyphs which were used in old Cyrillic texts be- The encoding X2 should include all the glyphs neces- fore 1900 (Old Slavonic, Church Slavonic, Gla- A golitic, old phonetic symbols, etc.) should be sary to represent in LTEX2ε documents containing texts from stable Cyrillic languages. The basis of moved to a separate glyph container. There X2 is the Russian alphabet (since it is the main lan- could also be an additional glyph container to guage used for publication in Cyrillic). Taking ac- collect the exotic glyphs used in some contem- count of the variety of old Cyrillic texts, only those porary Cyrillic texts; modern alphabets which are still in use are included 7. When jettisoning accented letters it is neces-

sary to take into account that they may be nec-

Ü Y Ý in X2. The exceptions are the characters X/ , / ,

essary for hyphenation patterns for some lan- Þ Z/ which were used in Russian and Bulgarian texts at the beginning of the 20th century. guages (if such patterns have been created or The X2 encoding is designed so that by com- if there is a chance that they will be created bining "00 – "7f from OT1 and "80 – "ff from X2 sometime). For example, accented letters for one can construct an encoding which is adequate to Russian, Ukrainian and Belorussian, Kazakh, support the most common Cyrillic languages. This Tatar, and Bashkir are included in X2; permits use of X2 as the base Cyrillic encoding 8. When deciding whether to jettison an accented for a variety of TEXformats(Plain, AMS-TEX, letter that is used in a language supported by BLUETEX, LATEX 2.09, etc.) as well as LATEX2ε. LR1, one must keep in mind that only the CM (This local encoding is called LR1 below. The accents are available in that encoding; design aim for LR1 was to select glyphs required by 9. The following priorities are used when the ac- the most widely-used languages and to put them cented letters or letters with simple modifiers into the 128 – 255 section of X2.) are thrown away: (0) letters which are easily Unfortunately the full set of glyphs including constructed by the internal command \accent accented letters is too big to fit into 256 slots, (so that the letters using accents available in TUGboat, Volume 19 (1998), No. 4 407

CM fonts have lower significance); (1) letters lower comma are also used as accents. The lower which contain a centered diacritic below the let- accents will be constructed using TEX commands

ter (cedilla, ogonek, dot, macron) and are easily from the upper accents available in X2. constructed using a command similar to \c in The accents ("12)and ("13)areusedas Plain TEX; (2) letters which contain a horizon- stresses in Serbian; there is no letter in any Cyrillic tal stroke positioned symmetrically; (3) letters language where these symbols are used as “normal”

which require special alignment of accents and accents. modifiers; The quasi-letters (apostrophe, "27), (double

10. Accents and modifiers used in Cyrillic are in- apostrophe, "22)and (, "0d)areused cluded in X2 even if all accented forms are in- like letters in some languages but do not have

cluded in X2 for some other reasons (an exam- uppercase and lowercase forms (i.e., for these letters ple is Cyrillic breve used for and ); the uppercase form is just the same as the lowercase form). 11. Latin letters or glyphs which are similar to some Single quotes are not used in Cyrillic writing, Latin letter (used in Macedonian, Kurdish, etc.) and for this reason there is no need to keep single are placed at the same positions as the Latin French quotes. Instead, in their place, the angle

letters are in ASCII. Among other things, this brackets Æ ("0e)and ("0f) are provided. Angle increases the number of languages supported by brackets are used in Cyrillic typography, and it is the LR1 encoding; good if their style is changed in parallel with the 12. Whenever it is possible, glyphs (ASCII, accents, style of other symbols. special symbols, etc.) are placed at the same The Cyrillic breve “ ”("14)isaveryfamous positions as they are placed in T1 encoding. glyph (it is even included in the Adobe and Word-

The X2 encoding is shown in Fig. 1. The Rus- Perfect Cyrillic fonts). Although all letters with this

sian letters – , – (except and ) are placed accent ( / , / ) are included in X2, it is included

in the only region in the encoding table where 32 as a special glyph as well.

consecutive letter positions are available — i.e., po- Cedilla “ ”("0b) and ogonek “ ”("0c)areused

sitions "c0 – "df and "e0 – "ff. The Russian letters by some letters already included in X2 (, , ). and are placed at the end of the block "80 – These letters have variant forms where cedilla could "9c and "a0 – "bc which simplifies the ordering of be oriented to the left or to the right depending

non-Russian letters. Latin letters and letters similar on the user’s taste. Also, some applications use

to Russian letters are placed as in ASCII. Letters ogonek instead of descender for , , , ,etc. used in other Cyrillic alphabets are grouped into The availability of cedilla and ogonek in X2 makes the parts "80 – "ff and "00 – "7f of the encoding it possible to satisfy these needs.

table according to the “popularity” of the corre- Percentage zero “ ”("18) is included as a useful

sponding languages (to satisfy the requirements of idea borrowed from the T1 encoding and EC fonts: the LR1 encoding). They are placed in free posi- this symbol is used to convert ‘’into‘ ’and A tions reserved by LTEX2ε for letters in some quasi- ‘’. alphabetic order. The old Russian and Bulgarian Punctuation ligatures, i.e., the symbols pro- letters are placed at the end of the block of letters duced by the abbreviations -- (endash, "15), --- in "00 – "7f. (emdash, "16), ‘‘ (opening English quotes, "10), ’’ Accents and modifiers are placed in X2 at "00 – (closing English quotes, "11)areusedinthesame "1f; those also used by T1 are placed at the same manner and are placed at the same position as in positions as in T1. The same is true for additional T1, as is - (the hyphen used for hanging Hyphen- symbols produced by the ligatures --, ’’,etc.The ation, "7f). It is worth noting that the ligature --- punctuation symbols, digits, mathematical symbols, (emdash, "16) corresponds to Cyrillic emdash which etc., are placed as they are positioned in ASCII. A (following the traditions of Russian typography) is

special case is made of the symbols which much shorter than that glyph in Latin-encoded CM are essential for Russian typography. These symbols and EC fonts.

are placed in "80 – "ff at the positions reserved for There are the special cross-modifiers: horizon- symbols, to guarantee the correctness of the LR1 tal “ ”at"17, grave-diagonal “ ”at"19 and acute-

encoding. diagonal “”at"1a which are used to construct from

ä ð á ±  Some accents (macron, dot) can be used as pieces the letters €/ , / , / and / used in lower accents as well for transliteration systems. some minor Cyrillic languages. (These letters are In some specific cases the upper comma ("1b)and included as separate glyphs in T2A/T2B/T2C, but

408 TUGboat, Volume 19 (1998), No. 4

ܼ»Ü8 ܽ»Ü9 ܾ»ÜA Ü¿»ÜB Ü4»ÜC Ü5»ÜD Ü6»ÜE Ü7»ÜF

¼Ü ¼Ü

Æ

½Ü ½Ü

¾Ü ¾Ü

¿Ü ¿Ü



4Ü 4Ü

5Ü 5Ü

6Ü 6Ü

7Ü 7Ü

8Ü 8Ü

DZ

9Ü 9Ü



AÜ AÜ

BÜ BÜ

CÜ CÜ



DÜ DÜ

EÜ EÜ



FÜ FÜ

ܼ»Ü8 ܽ»Ü9 ܾ»ÜA Ü¿»ÜB Ü4»ÜC Ü5»ÜD Ü6»ÜE Ü7»ÜF

Figure 1: The “Cyrillic glyph container” X2 TUGboat, Volume 19 (1998), No. 4 409

unfortunately the limit of 256 characters prevents emdash which is shorter than that in OT1 and including them in X2.) T1);

The positions "1c/"1d and "1e/"1f are used 6. The ff-ligatures (ff, fi, fl, ffi, ffl) are included at

  for the exotic letters / and / used by some mi- "1b – "1f as in T1 to keep the full set of Latin nor Cyrillic alphabets. Although following LATEX2ε glyphs necessary for typography; rules these positions should be used for symbols,not 7. The standard accents and symbols, and Cyril- for letters,wehavemadethemexceptionsfromthe "00 A lic-specific accents and symbols used in – severe LTEX2ε requirements. Formally speaking, "1a A of X2, are reproduced in T2∗ at the same the LTEX2ε requirements are not violated since the positions except the cross-modifiers which are \lccode–\uccode data for these positions conserve not necessary (the letters with cross-modifiers the original values. Instead of the explicit use of are included as separate glyphs); the \lccode–\uccode mechanism, the lowercase – 8. The Russian alphabet is reproduced similar to uppercase conversion is performed by the LAT X2ε E X2; \MakeUppercase and \MakeLowercase transforma-

tions using the list \@uclclist of identifiers. As a 9. The symbols specific to Cyrillic typography (í

result these “letters” cannot be used in hyphenation îïù úû) are reproduced in "9d, "9e, "9f, "bd, patterns and they break the automatic hyphenation "be and "bf, similar to X2; whenever they appear in a word. The gain is that 10. Positions "80 – "9b and "a0 – "bb are used for even exotic Cyrillic texts could be created (if neces- national Cyrillic letters (these are the only po- sary) using X2 only and without additional encod- sitions that differ from the T2∗ encodings since ings and fonts. all other codes are already fixed as described above); 5“True”LAT X2ε encodings T2A, T2B, E 11. To prevent an increase in the number of en- T2C codings up to infinity, the accented letters for ThebasefeaturesoftheT2∗ encodings are deter- Cyrillic languages which do not have the hy- A mined by the LTEX2ε specifications for Tn encod- phenation tables in TEX format (and rarely will ings and by the already created X2 encoding. Some have in future) are not included; more requirements are added due to the necessity to 12. Equivalent letters occupy just the same posi- keep the fonts in the T1, X2 and T2∗ encodings com- tions in all the T2∗ encodings whenever possi- patible. For example, it is necessary to keep similar ble (i.e., some glyphs may be absent in some glyphs at the same positions in all fonts whenever it encodings, but if the glyph is included in an en- is possible. As a result the following basic principles coding, it occupies the same position as in the appear. other encodings). 1. The set of T2∗ encodings supports the full set The encodings can therefore be decomposed of modern Cyrillic languages, each Cyrillic lan- into the following regions: guage is supported at least by one encoding T2∗ – the accent, non-ASCII punctuation and so that there is no necessity to mix encodings ligature symbols ("00 – "1f), for some languages; – the ASCII-encoded Latin letters, digits, 2. Cyrillic letters occupy the positions reserved punctuation and mathematical symbols, in LAT X2ε for letters, Cyrillic symbols — po- E etc. ("20 – "7f), sitions reserved for symbols, Cyrillic accents —

positions reserved for accents; – non-letter symbols at "9d – "9f and "bd – úû "bf (íî ï ù ), 3.LettersincludedinT2∗ follow the LATEX2ε con- vention about uppercase and lowercase letters – specific (uppercase and lowercase) Cyrillic (i.e., have the same \uccode–lccode assign- letters ("80 – "9b, "a0 – "bb), ments as in T1); – uppercase and lowercase Russian letters 4. ASCII glyphs (Latin letters, digits, mathemat- ("c0 – "df, "e0 – "ff, "9c, "bc). ical and punctuation symbols, etc.) are placed The accent part (see Fig. 2) is copied from X2 with at "20 – "7f; minor changes: 5. The symbols produced by the ligatures --, ---, a) the exotic letters ("1c – "1f) and the up- ‘‘, ’’ are included at the same positions as per comma accent ("1b) are substituted by they are in T1 and X2 (it is worth noting the ff-ligatures “ff ”, “fi”, “fl”, “ffi”, “ffl”, that, as in X2, the emdash glyph is the Cyrillic

410 TUGboat, Volume 19 (1998), No. 4

ܼ»Ü8 ܽ»Ü9 ܾ»ÜA Ü¿»ÜB Ü4»ÜC Ü5»ÜD Ü6»ÜE Ü7»ÜF

¼Ü ¼Ü

Æ

½Ü ½Ü

a) Accents, ligatures, special symbols, etc.

ܼ»Ü8 ܽ»Ü9 ܾ»ÜA Ü¿»ÜB Ü4»ÜC Ü5»ÜD Ü6»ÜE Ü7»ÜF

¾Ü ¾Ü

¿Ü ¿Ü



4Ü 4Ü

5Ü 5Ü

6Ü 6Ü

7Ü 7Ü

b) ASCII glyphs

ܼ»Ü8 ܽ»Ü9 ܾ»ÜA Ü¿»ÜB Ü4»ÜC Ü5»ÜD Ü6»ÜE Ü7»ÜF

CÜ CÜ



DÜ DÜ

EÜ EÜ



FÜ FÜ

c) Russian letters

Figure 2: Common parts in T2A/T2B/T2C TUGboat, Volume 19 (1998), No. 4 411

b) the cross-modifiers are substituted by the Uzbek, Ukrainian, Hanty-Obskii, Hanty-Sur- compound-word-mark symbol3 ("17), the gut, Gipsi, Chechen, Chuvash, Crimean-Tatar; dotless-i ("19)andthedotless-j ("1a) like T2B: Abaza, Avar, Agul, Adyghei, Aleut, Al- it is in T1. tai, Balkar, Belorussian, Bulgarian, Buryat, The ASCII-encoded Latin letters, digits, punctu- Gagauz, Dargin, Dolgan, Dungan, Ingush, Itel- ation and mathematical symbols at "20 – "7f are men, Kabardino-Cherkess, Kalmyk, Karakal- placed exactly as in the T1 encoding as shown in pak, Karachaevskii, Karelian, Ketskii, Kirgiz, Fig. 2. Komi-Zyrian, Komi-Permyak, Koryak, Kumyk, The Russian alphabet letters and non-letter Kurdian, Lak, Lezgin, Mansi, Mari-Valley, Mol- symbols at "9d – "9f, "bd – "bf are just copied from davian, Mongolian, Mordvin-Moksha, Mord- the corresponding positions in the X2 encoding (see vin-Erzya, Nanai, Nganasan, Negidal, Nenets, Fig. 2). Nivh, Nogai, Oroch, Russian, Rutul, Selkup, The component which is different in T2A/T2B/ Tabasaran, Tadjik, Tatar, Tati, Teleut, To- T2C/ are the national letters at "80 – "9b and "a0 – falar, Tuva, Turkmen, Udyghei, Uigur, Ulch, "bb, which are shown in Fig. 3 and Table 1. Al- Khakass, Hanty-Vahovskii, Hanty-Kazymskii, though we tried to fulfill the requirement that the Hanty-Obskii, Hanty-Surgut, Hanty-Shurysha- equivalent letters should occupy the same positions rskii, Gipsi, Chechen, Chukcha, Shor, Evenk, in all encodings, it was impossible to completely ful- Even, Enets, Eskimo, Yukagir, Crimean Tatar,

fill this requirement. Fortunately there are only two Yakut;

k Å Ñ exceptions: the letters Ã/ and / are placed at T2C: Abkhazian, Bulgarian, Gagauz, Karelian, "87/"a7 and "9b/"bb in T2A, but at "88/"a8 and Komi-Zyrian, Komi-Permyak, Kumyk, Mansi, "99/"b9 in T2B. All other letters and symbols in Moldavian, Mordvin-Moksha, Mordvin-Erzya, T2A, T2B and T2C (and in the accent part "00 – Nanai, Orok (Uilta), Negidal, Nogai, Oroch, "1a, symbol/digit part "20 – "3f, "5b – "5f, "7b – Russian, Saam, Old-Bulgarian, Old-Russian, "7f and lower part "80 – "ff of X2) have fixed po- Tati, Teleut, Hanty-Obskii, Hanty-Surgut, sitions. Evenk, Crimean Tatar. A summary of the languages covered by T2A/T2B/T2C is shown below. The encoding T2A 6 The Cyrillic glyph container X2 versus contains the leading languages sorted by using sta- Cyrillic encodings T2A, T2B, T2C tistical data on populations. The encoding T2B con- As was already specified, there are two main require- tains the majority of the remaining languages. Fi- ments essential to the reliable working of LATEX2ε nally, the encoding T2C contains several languages in multi-language mode: with exotic letters which do not fit into T2A or • we must keep the \lccode – \uccode table as T2B: Abkhazian, Orok (Uilta), Saam (Lappish), in T1, Old-Bulgarian, Old-Russian. Due to the intersec- tions between Cyrillic alphabets some languages are • we must keep the ASCII encoding for positions supported by two or three encodings simultaneously: 32 – 127. Both requirements are fulfilled by the T2A/T2B/ T2A: Abaza, Avar, Agul, Adyghei, Azerbaidzan, T2C encodings and hence they can be safely mixed Altai, Balkar, Bashkir, Belorussian, Bulgar- with the Latin encodings OT1 and T1 inside a doc- ian, Buryat, Gagauz, Dargin, Dungan, Ingush, ument. The encoding X2 conserves the \lccode – Kabardino-Cherkess, Kazah, Kalmyk, Karakal- \uccode values but does not contain these ASCII pak, Karachaevskii, Karelian, Kirgiz, Komi- glyphs. As a result it may cause problems and unex- A Zyrian, Komi-Permyak, Kumyk, Lak, Lez- pected effects inside LTEX2ε documents if the user gin, Macedonian, Mari-Mountain, Mari-Val- is not careful enough. So, why do we need X2 when ley, Moldavian, Mongolian, Mordvin-Moksha, we have T2A/T2B/T2C? Mordvin-Erzya, Nogai, Oroch, Osetin, Rus- The reason is that the requirement to keep all the ASCII glyphs is very restrictive — it leaves only sian, Rutul, Serbian, Tabasaran, Tadjik, Tatar, 4 Tati, Teleut, Tofalar, Tuva, Turkmen, Udmurt, 61 positions for non-ASCII letters. To fit all Cyril- lic letters into Tn encodings requires three tables 3 The empty character with the zero thickness and the height equal to 1ex used in EC fonts and T1 encoding for 4 For Cyrillic encodings it is even more restrictive: it is special applications — such as hyphenating compound words, necessary to keep 32 base Russian letters in each encoding breaking down ligatures, creating accents to be placed over as well since they are encountered in almost all Cyrillic the invisible space between two letters. alphabets.

412 TUGboat, Volume 19 (1998), No. 4

ܼ»Ü8 ܽ»Ü9 ܾ»ÜA Ü¿»ÜB Ü4»ÜC Ü5»ÜD Ü6»ÜE Ü7»ÜF

8Ü 8Ü

DZ

9Ü 9Ü



AÜ AÜ

BÜ BÜ

a) T2A encoding

ܼ»Ü8 ܽ»Ü9 ܾ»ÜA Ü¿»ÜB Ü4»ÜC Ü5»ÜD Ü6»ÜE Ü7»ÜF

8Ü 8Ü

DZ

9Ü 9Ü



AÜ AÜ

BÜ BÜ

b) T2B encoding

ܼ»Ü8 ܽ»Ü9 ܾ»ÜA Ü¿»ÜB Ü4»ÜC Ü5»ÜD Ü6»ÜE Ü7»ÜF

8Ü 8Ü

DZ

9Ü 9Ü



AÜ AÜ

BÜ BÜ

c) T2C encoding

Figure 3: The national Cyrillic letters in T2A/T2B/T2C TUGboat, Volume 19 (1998), No. 4 413

Code T2A T2B T2C "80/"a0 ghe-upturned ghe-with-descender hcrossed pe-with-tail "81/"a1 ghe-hcrossed ghe-hcrossed + "82/"a2 (T+h with tail) ghe-with-descender te-with-descender "83/"a3 (T+h) ghe-with-tail ghe-with-tail "84/"a4 h-special h-special h-special "85/"a5 -with-descender zhe-with-descender er-with-descender "86/"a6 -with-descender delta-phonetical er-gravecrossed "87/"a7 zet zet "88/"a8 i-with-umlaut lje em-with-descender "89/"a9 -with-descender ka-with-descender ka-with-descender "8a/"aa ka-with-left-poker el-with-descender el-with-descender "8b/"ab ka-vcrossed ka-with-tail ka-hcrossed "8c/"ac a+e el-with-tail el-with-tail "8d/"ad en-with-descender en-with-descender en-with-descender "8e/"ae en+ghe en+ghe em-with-tail "8f/"af S en-with-tail en-with-tail "90/"b0 o-barred o-barred o-barred (fita) "91/"b1 es-with-descender es-acutecrosseds abkhazian-ch "92/"b2 u-with-cyrbreve u-with-cyrbreve abkhazian-ch with descender "93/"b3 Y-special Y-special () "94/"b4 Y-special-hcrossed ha-hcrossed "95/"b5 ha-with-descender ha-with-descender ha-with-descender "96/"b6 tse-macedonian ha-with-tail tse-macedonian "97/"b7 -vcrossed che-with-left-descender abkhazian-ha "98/"b8 che-with-descender che-with-descender che-with-descender "99/"b9 e-ukrainian en-with-right-tail "9a/"ba shwa shwa shwa "9b/"bb nje epsilon big

Table 1: The national Cyrillic letters in encodings T2A/T2B/T2C

to achieve the coverage of X2: the problem is that ASCII in 32 – 127, but they do have the standard, most characters in the Tn tables are the same. compatible \lccode–\uccode values. And to support each such Tn encoding it is neces- Currently the LAT X Team only supports the sary to have a separate font class like the EC fonts. E Tn encodings and TS encodings, while the support To keep such enormous numbers of fonts is of an Xn encoding is entirely the responsibility of 5 too large a price for people who use Cyrillics only the designer of the encoding. For the “glyph con- occasionally. On the other hand, if all the Cyrillic tainer” X2 such support should include listing some glyphs are put into just one table without the Latin simple rules which should be followed by Users so letters in 32 – 127, but in a way that satisfies the as to avoid strange and undesirable effects. The X2 \lccode–\uccode requirements, one table and one encoding does not contain Latin ASCII letters but font class is enough provided the user obeys some only digits, punctuation and mathematical symbols, elementary rules of safety. This “economy mode” is etc., therefore the rules should guarantee that text implemented by X2. containing Latin ASCII letters is never used with the X2 encoding: There is a similar situation for Old Slavonic characters and some other encodings which are only occasionally used by normal users. To resolve this 5 problem, “glyph containers” like X2 could again It seems that there should be support for such “glyph A be helpful. The “glyph container” encodings Xn container” encodings by the L TEX Team as well (such sup- port should include the registration procedure for glyph con- should be an intermediate case between Tn and tainers and maintenance of the official list of exceptions where the “free style” Xn: such encodings do not have the glyph container encodings produce undesirable results). 414 TUGboat, Volume 19 (1998), No. 4

1. If you use ASCII Latin letters in the text part of (d) the commands that write, explicitly or your document, some Latin encoding (i.e., OT1 implicitly, text to external files, which may or T1) must be active for this piece of your text; be loaded outside the X2 encoding: 2. The following commands should be used only \addtocontents, \addcontentsline, outside the range where the X2 encoding is \glossary, \index, \part, active (or should be preceded by the explicit \chapter, \section, \subsection, specification of some “Latin” encoding) because \subsubsection, \paragraph, they may implicitly include the Latin text in \subparagraph, \caption. your document: Similarly, if such a floating command includes \part, \chapter, \section, Latin letters and the resulting object may ap- \subsection, \subsubsection, pear inside the range where X2 is active, some \paragraph, \subparagraph, \caption, Latin encoding (i.e., OT1 or T1) should be acti- \tableofcontents, \listoftables, vated explicitly before the Latin-encoded text. \listoffigures, \ref, \pageref, \cite, 4. Just the same requirement holds for all com- \item, \labelenumxx, \labelitemxx, mands which can occasionally insert Cyrillic or \makelabel, \numberline, \thechapter, Latin text where the Latin encodings OT1/T1 \thesection, \thesubsection, \date, or the “Cyrillic glyph container” encoding X2 \today. are active. For example, you should be care- Exactly the same requirement is needed for ful with the definition and re-definition of the all user-defined macros (or those loaded from following commands: external packages) that have ASCII Latin text (a) commands which create automatically in their body but without explicit specification generated text used by other commands: of the “Latin” encodings OT1 or T1 for this \labelenumxx, \labelitemxx, text; \makelabel, \numberline, \thechapter, 3. When you deal with floating objects (or moving \thesection, \thesubsection, \today, arguments), you should not rely on the assump- (b) commands which are used in international tion that the Cyrillic letters are used by LAT X A E LTEX to define language-specific names: when the floating material is inserted into the \abstractname, \appendixname, document. For example, if Cyrillic letters are \alsoname, \ccname, \chaptername, used inside some command defining the floating \contentsname, \enclname, object, the encoding X2 should be activated ex- \headtoname, \figurename, \indexname, plicitly in front of Cyrillic letters even if X2 is \listfigurename, \listtablename, active at the point where the command is is- \notesname, \pagename, \partname, sued. Among such commands are: \prefacename, \seename, \tablename, (a) the floating environments: (c) commands which implicitly define headers, \begin{table}– \end{table}, footers, margin remarks, etc., and/or im- \begin{figure}– \end{figure}, plicitly write something into external files: \begin{table*}– \end{table*}, \part, \chapter, \section, \begin{figure*}– \end{figure*}, \subsection, \subsubsection, (b) the commands that define floating text \paragraph, \subparagraph, \caption, explicitly: \addtocontents, \addcontentsline, \author, \title, \date, \address, \glossary, \index, \name, \signature, \telephone, (d) commands which create floating text and \footnote, \footnotetext, \thanks, floating environments: \marginpar, \markboth, \markright, \author, \title, \date, \address, \bibitem, \topfigrule, \botfigrule, \name, \signature, \telephone, \dblfigrule, \footnoterule, \footnote, \footnotetext, \thanks, (c) the commands that define headers, foot- \marginpar, \markboth, \markright, ers, margin remarks, etc., implicitly: \bibitem, \topfigrule, \botfigrule, \part, \chapter, \section, \dblfigrule, \footnoterule, \subsection, \subsubsection, \begin{table}– \end{table}, \paragraph, \subparagraph, \caption, \begin{figure}– \end{figure}, TUGboat, Volume 19 (1998), No. 4 415

\begin{table*}– \end{table*}, a table for direct text coding, and the status of \begin{figure*}– \end{figure*}, T2∗ as the modern Cyrillic encodings. In struc- (e) macros and user-defined commands which tured markup, the ambiguity would be addressed by may be expanded unintentionally inside assigning two symbolic names for each glyph (say, or outside X2: \yat/\semisft and \/\obarred) and only us- ing the semantically correct one to code texts. \def, \newcommand, \newcommand*, Some preliminary information about exotic \renewcommand \renewcommand* , , glyphs and pure phonetic symbols has been provided \providecommand, \providecommand*, by linguists studying some minor writing systems. \newenvironment, \newenvironment*, These letters and symbols are not currently included \renewenvironment , in X2 and T2∗ at all. The reason for not including \renewenvironment*, the glyphs at this stage is that the writing systems \newtheorem, \ProvideTextCommand, are very unstable and are subject to change from \ProvideTextCommandDefault , publication to publication. There is no justification \AtBeginDocument, \AtEndDocument, for including such symbols in the version of X2 and \AtEndClass \AtEndPackage , , T2∗ proposed as a standard until the situation be- \DeclareRobustCommand, comes stable. \DeclareTextCommand, It seems that all the specific Cyrillic glyphs used \DeclareTextCommandDefault. in modern Cyrillic alphabets are included in X2 and 7 The weak points of X2 and T2∗ in one of the T2∗, but there is also a chance that some minor writing system is omitted. There is also The X2 and T2∗ encodings do not contain accented a chance that some linguists suggest a new alphabet letters, and (for some languages) this throws the for some minor language using their own glyphs user back on the \accent primitive which prevents not available in X2. Until this happens we can construction of correct hyphenation tables and de- consider X2 and T2A/T2B/T2C as comprehensive stroys kerning pairs. The encodings (especially glyph collections for modern Cyrillic texts (although X2) are also overloaded (to some extent) with rare not very comfortable and not specifically adjusted glyphs, which arise from the attempt to collect all for intensive Cyrillic writing). Cyrillic glyphs in one table.

There are the cross-modifiers (horizontal stroke 8 Some remarks about the TS2 encoding  “ ”, vertical stroke “ ”, diagonal strokes “ ”and The TS2 is expected to be the collection of accents “ ”) which are included in X2 but are absent in and special symbols which are necessary for Cyrillic T2A/T2B/T2C. Although there is a great chance typography, but which are not included into the that these glyphs will be included in TS2 (see encodings X2 and T2A/T2B/T2C for some reasons section 8), their status at this stage of the project is 6 (i.e., TS2 is the encoding supplementary to X2 and

undefined. Similarly, there are the title forms for T2n as TS1 is supplementary to T1).

k Å Ñ the letters Ã/ and / which were included in For typographical reasons, ‘wide’ versions of previous (intermediate) versions of X2 but are now some accents — macron, tilde, breve, etc. — are de- excluded for some reason. sirable. These versions would be used for extra wide

Another disadvantage of minor importance is letters: as compared with the Latin alphabet, Cyril-

Ü à °

that there are two glyphs (X/ and / )which lic has a far higher proportion of wide letters. Such Ü correspond to logically different letters: X/ stands wide versions of the accents are good candidates for

for Saam semisoft sign and for old Russian yat, a TS2 encoding. Similarly, the lowercase/uppercase ° and à/ stands for o-barred and old Russian fita. variants of cedilla, ogonek and the accents absent in Although graphically these symbols are similar, they T1 and TS1 may make useful additions to TS2.

are different logically.

k Å Ñ The letters Ã/ and / used in some

This situation can be accepted taking into ac- Ü Cyrillic languages are actually ligatures “ Ë+ ”

count the status of X2 as a glyph table rather than Ü and “Í+ ”. As well as the uppercase and lowercase 6 forms there is also a title form for these letters:

The title form is a combination of the uppercase “ Ë”

Ë ü or “Í” and the bowl from the lowercase “ ”. These glyphs the combination of the uppercase form for “ ”or

are used for first letters in titles, etc., where the first letter is ü “ Í” and the bowl for the lowercase “ ”. This form capital and other letters are in lowercase mode. For example, is used for titles where the first letter is capital there is the title form “Ij” for the ligature “IJ”/“ij” used in Dutch. (Note, that the title form “Ij” is absent in T1 and while the other letters are ordinary (a similar effect TS1 encodings.) occursfor‘IJ’usedinDutch).Suchtitleletters 416 TUGboat, Volume 19 (1998), No. 4

should be placed in TS2 and shared by the X2 and Dutch Organization for Scientific Research (NWO T2∗ encodings. grant No 07 – 30 – 007). To construct some exotic letters from pieces, References

special modifiers are necessary: horizontal stroke “ ”, vertical stroke “ ”, diagonal strokes “ ”and [1]A.Berdnikov,O.Lapko,M.Kolodin,A.Jani-

“ ”. The diagonal strokes are used only for letters shevsky, A. Burykin. The encoding paradigm

A



in LT X2ε and the projected X2 encoding

/ (Enetz) and / (Saam, or Lappish). Verti- E

cal strokes are used only for letters / and / for Cyrillic texts. Proceedings of EuroTEX-98, which have become obsolete since modern Azerbai- Saint-Malo, 1998.

jan writing is based on the Latin alphabet. Horizon- [2]A.Berdnikov,O.Lapko,M.Kolodin,A.Jani-

tal strokes are used in several Cyrillic letters (/ , shevsky, A. Burykin. Alphabets necessary for

/ , / , etc.). There are serious reasons for keep- various Cyrillic writing systems (towards X2 ing these modifiers in TS2: there are still minor lan- and T2 encodings). Proceedings of EuroTEX- guages for which alphabets based on Cyrillic could 98, Saint-Malo, 1998. be proposed. The availability of these modifiers in [3] A. Berdnikov, O. Grineva. Some problems with TS2 would support such developments without the accents in TEX: letters with multiple accents necessity to include more glyphs in the X2 and T2∗ and accents varying for uppercase/lowercase encodings. letters. Proceedings of EuroTEX-98,Saint- It is still a question how, and whether, the TS2 Malo, 1998. encoding should be realized. Taking into account [4] K. P´ıˇska, Cyrillic Alphabets, in: Proceedings that there are only a few glyphs really necessary of TUG’96, eds. M. Burbank and C. Thiele, for it and that there are several positions in TS1 pp. 1 – 7, JINR, Dubna; TUGboat 17(2), reserved for future extension of this encoding, it pp. 92 – 98. may be a good decision just to combine these two [5] O. Lapko, Full Cyrillic: How many languages, encodings. in: Proceedings of TUG ’96, eds. M. Burbank and C. Thiele, pp. 164 – 170, JINR, Dubna; 9 Acknowledgments TUGboat 17(2), pp. 174 – 180. There are many people who have contributed to this [6] K. Kenneth. The languages of the world. Lon- project, and it is difficult to list all of them in this don, Henley. section. Among the people who contributed the es- [7] M. Ruhlen. A Guide to the languages of the sential components are Mikhail Grinchuk, Vladimir world. Stanford University, 1975. Volovich, Werner Lemberg, Frank Mittelbach, J¨org [8] C. F. Voegelin, F. M. Voegelin. Classification Knappen, Michel Goossens, Andrew Slepuhin, but and Index of the World Languages. Academic the list is not restricted to these names only. Press, 1977. We are especially thankful to Vladimir Vo- [9] WWW page by Karel P´ıˇska: http://www-hep. lovich and Werner Lemberg for their work on fzu.cz/~piska/ macro support for the X2 and T2∗ encodings and [10] Ethnologue Database: ftp://ftp.std.com/ to the members of the mailing list CyrTEX-T2 who discussed enthusiastically the X2 and T2∗ obi/Ethnologue/eth.Z problems. (To subscribe to this list send email ⋄ to [email protected] with the command: A. Berdnikov, O. Lapko, subscribe cyrtex-t2 your-e-mail-address.) M. Kolodin, A. Janishevsky, A. Burykin We are grateful to Robin Fairbairns for his time Institute of Analytical spent polishing the text of our papers submitted to Instrumentation the EuroTEX-98 conference where the preliminary Rizskii pr. 26, 198103 results of this project (namely, the X2 encoding) was St. Petersburg, Russia presented for the first time, and to Michel Goossens [email protected] for his efforts to organize the support for Russian participants at EuroTEX-98. Finally, although most work on this project was done on a voluntary basis (as it is traditional for TEX community), it is worth mentioning that part of the research was supported by a grant from the TUGboat, Volume 19 (1998), No. 4 417

ISO CSX Romanized Indic and LATEX 3 15919 and + Anshuman Pandey ISO/TC46/SC2/WG12, the International Standards Organization Working Group for the Transliteration 1 Introduction of Indic, has been busy with the draft ISO 15919 In 1990 at the 8th World Sanskrit Conference in standard [2]. This draft standard provides tables Vienna, a panel of Indologists devised two encod- which enable the romanization of Indic scripts which ing schemes which would enable them to exchange are specified in Rows 09–0D and 0F of UCS (ISO/IEC electronic data across a variety of platforms. These 10646 and ). schemes are the “Classical Sanskrit” and “Classi- This romanization is accomplished using plain cal Sanskrit eXtended” encodings, widely known in ASCII 7-bit (ISO-646) characters, two or three ro- Indological circles as CS and CSX, respectively, or man characters often being required to represent simply, CS/CSX. a single Indic one. These tables provide for the Devanagari, Gujarati, Gurmukhi, Bengali (includ- 2 CS and CSX ing Assamese), Oriya, Telugu, Kannada, Malay- The CS and CSX encodings are currently the clos- alam, Tamil, and Sinhala scripts. This draft is not est thing to an accepted standardization of the 8-bit yet a standard, although work is well advanced. transliteration of Indic scripts. CS/CSX is based on While ISO 15919 is still in draft stages, it ap- IBM Code Page 437, whose characters of the range pears that a consensus has been reached with regard 129–255 have been reassigned with characters tradi- to the form of transliteration. It was therefore de- tionally used for the romanization of Sanskrit. cided that CS/CSX ought to be further extended to The accented French and German characters in account for the new characters proposed in the draft the cp437 range 129–223 were not altered, in order standard. John Smith, Dominik Wujastyk, and I to facilitate the input of these languages as well as developed an extension to CS/CSX known as CSX+ English and Sanskrit. The accented characters re- (Classical Sanskrit eXtended+). quired for Sanskrit were located as far as possible in CSX+ aims to be downward compatible with the positions used by cp437 for graphic or mathe- CS/CSX, save for the relocation of two characters matical symbols. in positions used for non-breaking spaces in popular The re-encoding of cp437 was discussed in a software packages. While seeking to implement the document by Dominik Wujastyk titled Standardiza- draft ISO 15919 standard, CSX+ also retains a useful tion of Sanskrit for Electronic Data and Screen Rep- set of European accented characters, dashes, and resentation [1]. quotes. CS is a basic inventory of diacritic letters com- Most of the new characters are those required prising the following characters traditionally used for the draft ISO 15919 standard, which specifies for the transliteration of Classical Sanskrit: the following characters which are unsupported by CS/CSX: ˜l˙m¯a A¯¯ ı ¯I¯u Ur¯ . ¯æ Æ˘¯ urR ´r `r ¯r ¯r´ R. ¯r. R¯. .l L. ¯.l L¯. ˙n N˙ ˚ ˚ ˚ ˚ ˚ ˚ l ¯l ¯e E¯ ¯˜e¯o O¯ ¯o˙˜ y t. T. d. D. n. N. ´s Ss´ . ˚ ˚ ˘r m˘˘˙ n˘mtk kh Kh ˙g m. M. h. H. ¯ ¯ Gˇ˙ c Chˇ h ¯ ˘ CSX is an extension of the above which provides In addition to the above, the following further the following additional characters used in Vedic characters have been added as being centrally useful Sanskrit and in Prakrit: in any text encoding: r ¯a˘ ¯ı˘ un¯˘ ¯a´ ¯a` ¯ı´ ¯ı` “”–— ¯ ¯ ¯u´ u´¯` r. `r. ¯r´. ˜a˜ı˜u¯e There is a single “European” accented charac- ˜o˘e˘ol ter —y—that ¨ CSX inherited from the original Code ¯ Page 437, but that is unlikely to be required for any Contrary to what the name indicates CS/CSX Indian or European language. It has been elimi- is not limited to the transliteration of Sanskrit, and nated to save one character slot. Other characters may be used to transliterate many other Indic scripts removed from the code page are the currency sym- effectively. bols sterling, yen, and cent, and the guillemets. 418 TUGboat, Volume 19 (1998), No. 4

128 C¸ 147o ˆ 165 N˜ 184 ¯ı` 203 Ư 222 — 241 t. 129u ¨ 148o ¨ 166 ˜l 185 ¯e 204 kh 223 “ 242 T. 130 ´e 149o ` 167m ˙ 186o ¯ 205g ˙ 224a ¯ 243 d. 131a ˆ 150u ˆ 168 ¯a˘ 187 R 206 ˆc 225 ß 244 D. ˚ 132a ¨ 151u ` 169 ¯ı˘ 188y ˙ 207 ¯r´. 226 A¯ 245 n. 133a ` 152æ ¯ 170 ¯u˘ 189 ¯u´ 208a ˜ 227 ¯ı 246 N. 134 ˚a 153 O¨ 171 ¯a˜ 190 ¯u` 209 ˜ı 228 ¯I 247 ´s 135c ¸ 154 U¨ 172 ¯ı˜ 191 ˘r 210u ˜ 229u ¯ 248 S´ 136 ˆe 155u ˘ 173 n 192 ¯o˜ 211 ˜e 230 U¯ 249 s ¯ . 137 ¨e 156 ¯e˜ 174 ¯r 193 ˘m˙ 212o ˜ 231 r. 250 S. ˚ 138 `e 157 r 175 l 194 t 213 ˘e 232 R. 251 ” ˚ ˚ ¯¯ 139 ¨ı 158a ´ 176 ¯l 195 E 214o ˘ 233 ¯r. 252 m. ˚ ¯ ¯ 140 ˆı 159 r 177 ´r 196 O 215 l 234 R. 253 M. ¯ ˚ ¯ 141 `ı 160 space 178 `r 197n ˘ 216 ¯u˜ 235. l 254 h. ˚ 142 A¨ 161 ´ı 179 ¯r´ 198 ´r. 217 G˙ 236 L. 255 H. ˚ 143 A˚ 162o ´ 180m ˘ 199 `r. 218 Cˆ 237 ¯.l 144 E´ 163u ´ 181 ¯a´ 200 Kh 219 h 238 L¯. ¯ 164n ˜ 182 ¯a` 201 k 239n ˙ 145 æ ¯ 220 h 146 Æ 183 ¯ı´ 202 space 221 –˘ 240 N˙

Table 1: CSX+ Character Encoding Table

The remaining assignments have been made on further necessary work, such as hyphenation tables the basis that the best use for the small number of for romanized Indic. spare slots available is to use them for capitalised versions of those new characters with the most need References for capital forms — i.e., characters capable of begin- [1] Wujastyk, Dominik. Standardization of Roman- ning a word. ized Sanskrit for Electronic Data Transfer and Screen Representation [results of a session held 4 Input encoding at the 8th World Sanskrit Conference, Vienna, As the use of LATEX amongst Indologists has sig- 1990], in Sesame Bulletin 4(1), 1991, pp. 27-29. nificantly increased, I felt that the CSX+encod- Also available as a PostScript document from ing scheme ought to be adapted for use with LATEX CTAN/fonts/csx/csx-doc.ps. through the inputenc package. To serve this end an [2] Stone, Anthony [ed]. ISO Committee Draft input encoding definition file called cp437csx.def 15919: Transliteration of Devanagari and Re- has been developed and placed on CTAN in the di- lated Scripts into Latin Characters. Avail- rectory fonts/csx/styles/. Such an input encod- able at http://ourworld.compuserve.com/ ing definition will make the typesetting of romanized homepages/stone_catend/trdcd1c.htm. Indic much easier, and might lead to the develop- ment of hyphenation patterns for romanized San- ⋄ Anshuman Pandey skrit and other Indic languages. University of Washington The file cp437csx.def enables text encoded in Department of Asian Languages CSX+ to be read and accurately typeset by LATEX and Literature without the need for converting the CSX+textinto 225 Gowen Hall, Box 353521 LATEX accent codes. Table 1 provides a map of Seattle, WA 98195 the CSX+ encoding character set. A screen font [email protected] http://weber.u.washington.edu/ and driver for displaying CSX+textonMS-DOS ~apandey/ and OS/2 systems is also available in the directory fonts/csx/. The stabilization of the CSX+ encoding, in tan- dem with the emerging ISO standard, will encourage TUGboat, Volume 19 (1998), No. 4 419

New Greek fonts and the greek option of mented each “alphabet” with more than 300 glyphs. the babel package Of course, due to TEX limitations, they had to make some compromises for hyphenation of Greek text; as Claudio Beccari and Apostolos Syropoulos TEXies well know, TEX can hyphenate words com- Abstract posed of glyphs taken from the same font, so that when the adjunct set was called for, TEX was unable A new complete set of Greek fonts and their use to hyphenate. in connection with the babel greek extension is de- Haralambous [4] also described the hyphenation scribed. Some suggestions are proposed so as to en- of ancient Greek, but for several years no further ad- hance some T X related utilities and some LAT X2ε E E vances were done in the field of typesetting Greek, font description macros. because (we suppose) LATEX was undergoing one of A 1 Introduction its major changes, the transformation into LTEX2ε with its standard use of the NFSS (New Font Se- TUGboat already reported several papers on the lection Scheme) which has the ability to deal with possibility of typesetting Greek with (LA)T X. Per- E several font encodings, and the babel package was haps the first paper was the one by Silvio Levy [1] getting richer and richer thanks to the introduction who, so to speak, set forth a standard of fonts and of the Cork “double” font encoding with 256 glyphs, macros in order to set texts both in English (or any that tremendously facilitated typesetting all those other “Latin” written language) and Greek. European and extra European languages that use His fonts, prepared to be generated with META- lots of diacritical signs; moreover Haralambous had FONT v.1.x, were very good replicas of the “stan- started the enormous task of developing Ω, a descen- dard” Didot Greek fonts; besides digits and punctu- dant of TEX that can handle 16 bit Unicode coded ation marks, they contained the 24 upper case letters fonts. and the 25 lower case ones with all kinds of accents Also Dryllerakis [6] generated with METAFONT and breaths and implied such ligatures so as to insert a set of Greek fonts that included the regular, bold- the diacritical signs by inputting the corresponding face, slanted and italic typefaces; these fonts are to keystrokes before (or after for the subscript) the play an important role as we shall see soon. letter to be marked. The correspondence between For Greek it was necessary to wait for the con- the keys of a “Latin” keyboard (namely the US key- stitution of the Greek Society of TEX Users, in order board) and the Greek letters and the diacritical signs to have the enthusiasm and the will to sit down and was established in such a way that everybody, except prepare a complete set of babel macros and environ- perhaps Macintosh users who have access to utilities ments capable of handling the necessary change of for configuring their keyboards for any “alphabet”, font encoding (with the corresponding changes into got acquainted with Levy’s convention in such a way the catcodes of the various extended ASCII codes) as even people, like Beccari, who are not of Greek and the switching in and out from Greek or Latin mother language, can read Greek compuscript text font typesetting. set on the screen with latin characters just as easily Syropoulos took the initiative of collaborating as real Greek text set with true Greek fonts. with Johannes Braams, the author and curator of Levy’s fonts exploited the METAFONT ability babel [10], for writing the greek option to the babel to describe fonts with 256 glyphs, but at his time package. When Syropoulos first wrote his greek.ld drivers could generally handle only 128 glyph fonts. language definition file, he made reference to the Haralambous [3] therefore developed a set of Greek Dryllerakis fonts, that were the most complete set fonts with 128 glyphs by reducing Levy’s set in such at that time. a way as to keep the substantial and most frequent Apparently everything was settled for the Greek accent vowel ligatures, and to do away with the au- authors and for the Ellenists around the world, be- tomatic setting of the initial/medial sigma as op- cause they now had all the necessary tools for type- posed to the final sigma. setting Greek texts, both as main ones and as cita- Mylonas and Witney [5] soon after proposed a tions within “Latin” written ones. set of fonts based on a main 256 glyph font and Beccari, triggered by a teacher of classical Greek its adjunct font (a 128 glyph one) by which they in Italian high schools, was induced to get strongly could cover the extended necessities of the full set of involved in producing a set of tools for his friend; Greek glyphs, including any sort of breath-accent- when he had last examined the CTAN archives he vowel-iota subscript combinations, including the oυ had not noticed the greek extension to babel nor the ligature with its diacritics; in this way they imple- Dryllerakis fonts; both had been there for a certain 420 TUGboat, Volume 19 (1998), No. 4

time, but nevertheless he missed them. He started As Table 1 clearly shows, the cb font set is even working on Levy’s fonts, but he wanted to generate wider than the standard ec fonts directly accessi- something that could be just as versatile and com- ble with the standard LATEX2ε font changing com- plete as J¨orge Knappen’s ec “Latin” fonts are [7]. mands. Actually the ec fonts include some shapes When he finally discovered the existence of both that require special commands to become usable in the Dryllerakis fonts and the greek babel extension a document, or require some modifications to the by Syropoulos, he found out that his work, after all, standard font description files. was not a complete waste of time. Syropoulos’ greek.ld language definition file In facts Syropoulos could not avoid some short contains all the necessary commands to invoke any cuts for the lack of optically compatible sans-serif of those valid family, series and shape combinations, Greek fonts, to the point that the “new” LATEX2ε and the accompanying .fd font description files be- font changing commands had to refer to other (not have accordingly. by Dryllerakis) Greek font families in order to avoid METAFONT too many font substitutions. 3 considerations It is necessary at this point to underline a drawback 2Thecb Greek fonts of the ec and cb METAFONT source files. META- When Beccari submitted his fonts to Syropoulos, FONT produces the .tfm and .????gf1 files whose the latter agreed that the cb set was more complete name derives from the jobname special METAFONT and supported the former with lots of helpful sug- string variable; this jobname is assigned a value equal gestions. Beccari ended up with a set of fonts [8], to the name of the first file input by the specific that he unmodestly called cb Greek fonts (as well METAFONT run. as Kostis Dryllerakis’ fonts are named after him kd This approach does not produce any inconve- Greek fonts), which is very rich in families, series, nience with the standard cm fonts; with the ec and and shapes, so that all LATEX2ε font changing com- cb fonts, the name of which includes a numerical mands refer to a specific font, and new commands part equal to (one hundred times) the design size of may be defined in addition to the standard ones. the specific font, it is somewhat redundant to have According to Beccari’s idea, all the Greek fonts are hundreds of small files containing just two or three supposed to be optically compatible with one an- lines of METAFONT code; in facts they simply spec- other and with the corresponding “Latin” ones. See ify the design size again and then input a “generic” the appendix for a sample of mixed text written with driver file whose task is exactly that of interpolating several families and shapes. the font parameters. Take for example the main file His work led him to conclude that the NFSS for the roman medium normal ec (Latin) font: idea of encoding, family, series, and shape are possi- % This is ecrm1000.mf in text format ... bly incomplete in order to describe a set of fonts, or if unknown exbase: input exbase fi; at least Beccari’s imagination was not wide enough gensize:=10; to find a better description of the font characteris- generate ecrm tics. After inputting, if necessary, the base file for the ec Beccari decided that his fonts had to be based fonts, the design size is set with a value that actually on Knappen’s algorithm for interpolating the var- is already part of the file name (a part a factor of ious font parameters, just as Knappen’s ec fonts. 100), and finally inputs the “generic” file for that So he “borrowed” Knappen’s interpolating macros; family, series and shape. such METAFONT macros work perfectly with the ec The cb main files are even simpler; for exam- fonts; if they work well also with the cb fonts, it is ple the main file for the regular medium normal cb just Knappen’s merit, should they behave improp- (Greek)fontis: erly it is just Beccari’s demerit. Beccari worked on the families, series and shapes input cbgreek; listed in Table 1; the boldface series applies to all (and the same line is contained in any other cb main families except the monospaced ones. The outline file); the trick lies simply in the fact that the file family has only the medium and bold extended se- cbgreek generates the design size directly from the ries. Fonts for slides comprise both proportional and jobname and from the same jobname it extracts the monospaced, visible and invisible varieties, but lack “generic” driver file name specific for that family, the serifed proportional shapes, as well as it happens series and shape. for the “Latin” ec fonts. 1 The ???? part of the file extension may be void, or it is formed by the product of the resolution times the magnifica- tion of the used METAFONT mode. TUGboat, Volume 19 (1998), No. 4 421

Table 1: The cb Greek font families, series and shapes Family Series Shape regular medium normal outline boldextended oblique sans serif monospaced italic typewriter invisible uprightitalic sans serif for slides bold extended invisible caps-and-small-caps typewriter for slides monospace invisible

Where is the drawback, then? It consists in the echo "Nothing done " hundreds of small specific main files necessary for echo "======" generating the jobname correctly, instead of working goto endbatch the other way around. This implies that the file :dpi300 system gets overloaded with hundreds of small files, set dpi=300 individually smaller than the smallest addressable set mfmode=cx disk memory unit, that nevertheless clog the disk goto dpiset with unnecessary information. :dpi600 The excellent trick devised by Knappen of in- set dpi=600 cluding the design size directly in the file name, so set mfmode=ljfive that the jobname is assigned the correct value, is :dpiset sort of hijacked by the rigidity of METAFONT that if exist %1.mf del %1.mf does not allow to assign a value to jobname with an echo input cbgreek; > %1.mf explicit assignment. If one could run METAFONT mf \mode:=%mfmode%; mag:=1; input %1 with a command such as: if errorlevel 1 goto endbatch mf \mode:=ljfive; mag:=1; gensize:=10; gftopk %1.%dpi%gf input ecrm c:\texmf\fonts\pk\%mfmode%\beccari \cbgreek\dpi%dpi%\%1.pk with ecrm.mf starting with move %1.tfm jobname:=jobname & decimal(100gensize); c:\texmf\fonts\tfm\beccari\cbgreek no specific main files would be necessary allowing rem for an enormous saving of disk space. In comput- del %1.%dpi%gf ers with file systems that are not too smart, each del %1.mf main file, although smaller than a block or sector del %1.log (512 bytes), may reserve up to 16Kb or 32Kb of disk set dpi= space, and the hundreds of main files necessary to set mfmode= generate the ec fonts may take up several megabytes goto endbatch of disk space. :message With the cb fonts Beccari got used to a simple echo Font name missing batch file2 that generates the specific main file, runs :endbatch METAFONT, and then deletes the now unnecessary 3 With this strategy Beccari has the advantage main file : that unnecessary files are always immediately deleted @echo off and only the .tfm and .pk files of the fonts effec- if "%1"=="" goto message tively employed are kept on disk. if "%2"=="" goto dpi600 The disadvantage is that the above batch file if "%2"=="600" goto dpi600 can not be handled by those utilities that automat- if "%2"=="300" goto dpi300 ically run the generation of .tfm and/or .pk files echo "Density %2 non allowed" with those TEX systems that may call on the fly 4 2 such programs as MakeTeXtfm and/or MakeTeXpk . Batch file refers to DOS and related operating systems; AtthesametimebothMakeTeXtfm and MakeTeXpk other operating systems may use the terminology of script or command file. are smart enough to perform more elaborate tasks 3 For typesetting reasons some lines are wrapped to the following line; in other words an indented line should be imag- 4 Such utilities with the Windows95 or NT based MikTeX ined as a continuation of the preceding one. version 1.10 or later become maketfm and makepk respectively. 422 TUGboat, Volume 19 (1998), No. 4

than simply running METAFONT. They could be Actually the genb “file name generating func- modified so as to handle both the ec and the cb tion” that is used in the font description files for the fonts in a way similar to the above simple batch file. ec (and the cb) fonts is powerful enough to accept Since MakeTeXtfm and MakeTeXpk are able to any font size, not only those that are specified in recognize the font group from the name first let- such files. If the font description file for the T1-cmr ter(s), it would be very simple to add the following family, for example, contained simple lines such as: file to the ec file bundle: \DeclareFontShape % File ecfonts.mf {T1}{cmr}{m}{n}{<-> genb * ecmr}{} % General driver file for ec fonts a user could specify in his source .tex file: if unknown exbase: input exbase fi; string f_name, f_size; \DeclareFixedFont f_name:=substring(0,4) of jobname; {\myfont}{T1}{cmr}{m}{n}{26} f_size:=substring(4,8) of jobname; causing the NFSS macros to ask LATEX2ε to look scantokens("gensize:="&f_size&"/100"); for the external file ecrm2600.tfm, possibly shelling scantokens("generate "&f_name); out in order to run MakeTeXtfm should that file still so that substituting ecfonts to cbgreek, the pre- be missing. Analogous definitions are usable with the cb fonts. vious batch file and/or the enhanced TEX utilities could directly generate the .tfm and .pk files with- In a similar way the sizes for text, text math, out the need of overloading the file system with use- display math, script and sub-subscript, could be de- clared differently from the standard sequence with less files; notice also that since the main driver files 4 are generated on the fly, if you specify in your source geometrical ratio 1.2; why not √2 or the square root A of the golden section? LTEX2ε file something like: The enhancement of the utilities MakeTeXtfm \font\myfont=ecrm2600 and MakeTeXpk, and of the font description files, as i.e. an ec font with a design size that is not in- suggested in the preceding paragraphs, would be very handy in the sense that Kappen’s extended cluded in the standard font description file, on run- 5 ning LATEX the required ecmr2600.tfm is not found, fonts (besides the new cb Greek fonts) could be so that the system is forced to shell out in order to treated almost as vector fonts, at least in the range generate it; this task may be performed without er- of 5 pt–99.99 pt. MakeTeXtfm ror messages by the proposed enhanced 4 babel greek extension utility. This point sets forward another one: the font Greek typesetters as well as Ellenists do not have A description files .fd for the ec fonts (and for the to bother about the way LTEX2ε handles the en- moment also for the cb ones) make use of a “name coding of the Greek fonts compared with that of the generating” function genb that most LATEX2ε users Latin ones. Font encoding and character catcodes are unaware of, because, although it frequently gets are dealt with by the internal macros invoked by babel 6 to play its role, it always operates behind the scenes. ’s greek extension behind the scenes . This function generates the external font name when Two Greek languages are actually defined: a particular non-preloaded ec font is called for; but greek and polutonikogreek; they share the same because of the way it is used in the font description hyphenation pattern set, but they typeset internal files, it can generate names only for the specified Greek words according to the modern “monotonic” sizes, not for any size, although the ec fonts (as well (default) accent system, as opposed to the classical as the cb fonts) can be generated for any size from “polytonic” one. Actually there is no other differ- 5 pt to 99.99 pt. ence. Although the ec and the cb fonts are not vec- Accents are introduced by means of some or- tor fonts (well, unless their source files are compiled dinary ASCII characters, not by means of control with Mike Vulis’ Vmf METAFONT interpreter that characters as it is the case with Latin alphabets: comes with VTEX, or unless they are treated with specifically ’ and " are the only ones dealt with Syropoulos’ perl script mf2pt3 [9] for the generation by the monotonic system, while with the polytonic of Type 3 fonts, that scale pretty well), under cer- 5 tain points of view they are not too different from That is, not only the ec fonts, but also the Text Com- the other vector fonts, in the sense that they may panion fonts that are identified with similar names and gen- erated with the same interpolating macros. be properly scaled (to be precise, designed to the 6 The functionality described here relates to what can be proper size) to almost any size. achieved with release 3.7 of the babel package. TUGboat, Volume 19 (1998), No. 4 423

one there are also ‘, ~, <,and>, plus the iota sub- ropoulos wrote also the package grmath.sty that script |. Since these characters are to be treated allows to “ellenize” all the log-like operator names in a special way, they are catcoded as letters; up- such as log, sin, cos,... This is intended especially percasing changes them (except the diaeresis) to a for Greek authors who write school books with a dummy invisible character so that they disappear. special attention to young people who are not used Prefixed accents, breaths, and diaeresis and post- to the corresponding (standard) Latin names. fixed iota subscript interact with the font specifi- Of course \today, while in Greek mode, type- cations so as to produce ligatures; those signs that sets the date with the Greek names for the months, may be prefixed can be specified in any order, that but keeps Arabic numerals for the day and the year. is >’a and ’>a produce the same result; with mono- Another command, \Grtoday, typesets the Greek tonic spelling "’i and ’"i produce again the same date using the Greek numerals for the day and the result. year. The same ligature mechanism controls the use In the LATEX2ε enumerate environment the of final as opposed to initial/medial sigma; if the numbering is redefined in such a way as to use Greek typesetter is used to type the letter ‘c’ for the final numerals for the numbered items. More specifically, sigma and the letter ‘s’ for the initial/medial one, he the LATEX2ε internal commands that translate a can keep doing so, but if he typed always ‘s’ the font number into a lower or upper case letter are rede- characteristics recognize the end of the word and use fined in such a way that the number is converted to the proper sigma in the final position; t’onos and a Greek numeral expressed by a suitable combina- t’onoc in the input file produce the same result in tion of the 29 Greek numeral symbols; therefore even the output file: τoνoς´ . page numbering, if requested in alphabetic form, One simply declares the language selection with turns to Greek numbering while in Greek mode. the traditional babel command Of course before using the greek (or the polutonikogreek babel \selectlanguage{greek} ) extension, it is necessary to rebuild the format with the inclusion of the Greek and for short citations in “Latin” characters within hyphenation patterns. The Greek bundle includes Greek text it is possible to use \textlatin{...} also the suitable hyphenation file, but it does not which behaves exactly as any other font changing become effective (as for any other language) until command; on the opposite a Greek short citation the format file is rebuilt. Although the babel pack- within another language may be inserted by means age documentation is explicit on this point, most of \textgreek{...}. The corresponding declara- users fail to notice it, or may be they assume that tions are \latintext and \greektext. format rebuilding is automatic. While in Greek mode, the usual font changing The necessary steps are quite simple; after veri- commands, such as \emph or \textbf or \sffamily, fying that one has the babel bundle on the hard disk, perform as they are supposed to do except they it is necessary to locate the file language.dat,and operate on the Greek fonts. In addition to the edit it so as to append the following lines7 other font changing commands, the new command greek grhyph.tex \textol{...} allows typesetting with the outline =polutonikogreek Greek fonts. Remember though that all these com- mands obey the grouping rules, so that when a group then run the TEX initializer according to the instruc- is closed, the font parameters revert to the values tions that come with the TEX system. With MikTeX they had before entering that group. If you change it is very simple, since it suffices to give the line size, for example, within the Greek environment, command when you close the Greek citation you automati- makefmt latex cally get to \normalsize or whatever size you had before. from a DOS window; with other systems the proce- The new commands \greeknumeral{...} and dure is very similar although it might be necessary \Greeknumeral{...} allow typesetting a counter to give more instructions and/or move files from one value or an explicit number in Greek lowercase or directory/folder to another. uppercase numerals. With Syropoulos’ athnum.sty additional package it is possible to set numbers with 7 This might be the right moment for controlling that the Athenian numerals whose glyphs are already con- loaded hyphenation files correspond exactly to the languages tained in the cb fonts. one wants to use; it is convenient to control also that the file names correspond exactly to those one has on the hard disk With reference to numbers, and therefore to TEX search path; in case it is possible to fetch the proper files mathematics, it may be worth noticing that Sy- from CTAN. 424 TUGboat, Volume 19 (1998), No. 4

In order to complete the collection of useful files archives. babel that complement the basic greek option to , [2] Levy S.: “Using Greek Fonts with TEX”, TUG- it is worth noticing that Syropoulos wrote also a boat, 9(1):20–24. number of other files that can be very useful for the [3] Haralambous Y. and Thull K., “Typesetting Greek typesetters as well as the other Ellenists. Modern Greek with 128 Character Codes”, Afirstfileiso8859-7.def [11] extends the col- TUGboat, 10(3):354–359. lection of distributed encoding files so as to map di- [4] Haralambous Y., “Hyphenation patterns rectly the keystrokes of a standard Greek keyboard for ancient Greek and Latin”, TUGboat, to the internal cb font codes; this allows people 13(4):457–469. to enter their LATEX Greek text using actual Greek characters, which makes the greek option more user [5] Mylonas C. and Whitney R., “Complete Greek TUGboat friendly. with adjunct fonts”, , 13(1):39–50. A second set of files [11] generates either a [6] Dryllerakis K.: “The kd Greek fonts”, available single cb font driver file (a perl script gendrv)or from the CTAN archives. the complete sets of both the text and slide fonts [7] Knappen J., The ec fonts released on driver files (cbstdedt.tex and cblstded.tex re- 1997/01/17 together with the sixth upgrade of spectively). At the moment these files are precious; LATEX2ε; the previous (almost definitive) tem- should makeTeXtfm and friends be enhanced as sug- porary release, called the dc fonts, was de- gested in this paper, their utility would be confined scribed in “Release 1.2 of the dc-fonts: Im- to those systems that do not allow to shell out. provements to the European letters and first A third file is hellas.bst [12], a bibliography release of text companion symbols”, TUGboat, style to be used with BiBTEX for creating mixed bib- 16(4):381–387. liographies (Greek and non-Greek) in a “consistent” [8] Beccari C.: “The METAFONT source files way. for the cb fonts”, available from the CTAN archives in the directory tex-archive/ 5 Conclusion language/greek/cb/mf. The appendix shows the appearance of several Greek [9] Syropoulos A.: mf2pt3 perl script, available at fonts in line with the corresponding Latin ones; of http://obelix.ee.duth.gr/~apostolo. course the different shape of the single glyphs makes [10] Beccari C., Braams J., Syropoulos A.: “The it very evident the change between Latin and Greek, babel bundle for the Greek language”, will but the use of the same font parameters both for be available from the CTAN archives to- the overall alphabet and for the single characteris- gether with the new release 3.7 of babel.A tics of the strokes (fine, crisp, ... lines; vertical and preliminary version may be found in ftp// horizontal upper and lover case strokes, upper case obelix.ee.duth.gr/pub/TeX. serifs, etc.) guarantee that there is some optical [11] Syropoulos A.: iso8859-7.def, gendrv, compatibility between the corresponding Latin and cbstdedt.tex,andcblstded.tex are avail- Greek alphabets. Nevertheless the sans serif family abe in the CTAN archives in the directory turned out more difficult than expected, so that few tex-archive/language/greek/cb/misc. font parameters had to be modified. The italic shape for all families and series was [12] Syropoulos A.: hellas.bst is available in the completely redesigned trying to get some inspiration CTAN archives in the directory tex-archive/ from the elegant italic shape named Olga produced language/greek/cb/BiBTeX. by the Greek Font Society [13]; since Beccari is nei- [13] Matthiopoulos G.D.: “Oblique or Italics? A ther an artist nor a good programmer, the result can Greek Typographical Dilemma”, in Greek let- not be even compared to the original Olga font, but ters – From Tablets to Pixels, Makrakis M.S. if the constraints imposed by the “metaness” are ed., Oak Knoll Press, New Castle, Delaware, taken into account, the results may be considered 1996. acceptable. ⋄ Criticism and suggestions, of course, are wel- Claudio Beccari Politecnico di Torino come. Turin, Italy References [email protected] [1] Levy S.: “Typesetting Greek”; the file ⋄ Apostolos Syropoulos greekhistory.tex is available from the CTAN Xanthi, Greece [email protected] TUGboat, Volume 19 (1998), No. 4 425

Appendix

This is the b eginning of J En rq n Lgoc ka Lgoc n prc tn n ka Jec

n Lgoc otoc n n rq prc Jen pnta di ato gneto ka qwrc ato gneto od n

ggonen n at zw n ka zw n t fc tn ntrpwn ka t fc n t skot fanei ka

skota at o katlaben

Egneto ntrwpoc pestalmnoc par Jeo noma at Iwnnhc otoc ljen ec marturan

na marturs per to fwtc na pntec pisteswsin di ato ok n kenoc t fc ll na

marturs per to fwtc

This is the beginning of J En rq n Lgoc ka Lgoc n prc tn Jen ka Jec

nLgoc otoc nnrq prc Jen pnta di ato gneto ka qwrc ato gneto od n

ggonen n at zw n ka zw n t fc tnntrpwn ka t fc n t skot fanei ka

skota at o katlaben

Egnetontrwpoc pestalmnoc par Jeo noma at Iwnnhc otoc ljen ecmarturan

na marturs per to fwtc na pntecpisteswsin di ato ok n kenoc t fc ll na

marturs per to fwtc

This is the b eginning of J En rq n Lgoc ka Lgoc n prc tn Jen ka Jec

n Lgoc otoc n n rq prc Jen pnta di ato gneto ka qwrc ato gneto od n

ggonen n at zw n ka zw n t fc tn ntrpwn ka t fc n t skot fanei ka

skota at o katlaben

Egneto ntrwpoc pestalmnoc par Jeo noma at Iwnnhc otoc ljen ec marturan

na marturs per to fwtc na pntec pisteswsin di ato ok n kenoc t fc ll na

marturs per to fwtc

This is the beginning of J En rq nLgoc ka Lgoc n prc tn Jen ka

Jec nLgoc otoc nnrq prc Jen pnta di ato gneto ka qwrc ato gneto

od n ggonen n at zw n ka zw n t fc tnntrpwn ka t fc n t skot

fanei ka skota at o katlaben

Egnetontrwpoc pestalmnoc par Jeo noma at Iwnnhc otoc ljen ecmarturan

na marturs per to fwtc na pntecpisteswsin di ato ok n kenoc t fc ll na

marturs per to fwtc

This is the b eginning of J En rq n Lgoc ka Lgoc n prc tn Jen ka Jec

nLgoc otoc n n rq prc Jen pnta di ato gneto ka qwrc ato gneto od n

ggonen n at zw n ka zw n t fc tn ntrpwn ka t fc n t skot fanei ka

skota at o katlaben

Egneto ntrwpoc pestalmnoc par Jeo noma at Iwnnhc otoc ljen ec marturan

na marturs per to fwtc na pntec pisteswsin di ato ok n kenoc t fc ll na

marturs per to fwtc

This is the b eginning of J En rq n Lgoc ka Lgoc n prc tn Jen ka Jec

nLgoc otoc n n rq prc Jen pnta di ato gneto ka qwrc ato gneto od n

ggonen n at zw n ka zw n t fc tn ntrpwn ka t fc n t skot fanei ka

skota at o katlaben

Egneto ntrwpoc pestalmnoc par Jeo noma at Iwnnhc otoc ljen ec marturan

na marturs per to fwtc na pntec pisteswsin di ato ok n kenoc t fc ll na

marturs per to fwtc

Date Iounou Greek date ib Iounou ah

Athenian numerals QHHHHDDDDPIII 426 TUGboat, Volume 19 (1998), No. 4

characters in it will be taken into account for sorting Hints and Tricks purposes. So one can define a one-parameter macro \oneletter: @Preamble{"\newcommand\oneletter[1]{#1}"} ‘Hey — it works!’ and {{\oneletter{}}ri Govorukhin} will give Jeremy Gibbons the desired effect. There are more tricks in the xampl.bib Bib Welcome to ‘Hey — it works!’, a column devoted preamble of in the TEX distribution. In our case, though, we can get by without a to (LA)TEX tips, tricks and techniques. In this is- sue we have three articles: one from Jeroen Nijhof preamble. The macro in the special character does not have to do anything, it only has to provide a showing how to get multi-letter ‘initials’ in BibTEX; one from me, showing how to use a text symbol (in backslash after the opening brace. So we can simply this case, a hyphen) as an operator in maths; and write one from Christina Thiele on generating ornamen- author = {{\relax Yu}ri Govorukhin} tal rules. My backlog of articles is running low, so and here is some text. please send them in! ⋄ Jeroen H. B. Nijhof ⋄ Jeremy Gibbons Aston University CMS, Oxford Brookes University Photonics Research Group Oxford OX3 0BP, UK Birmingham, UK [email protected] [email protected] http://www.brookes.ac.uk/ ~p0071749/ 2 A small minus sign In a recent paper (Deriving Tidy Drawings of Trees, 1 Controlling abbreviations in BibTEX Journal of Functional Programming 6(3) p. 535–562, In the TEX newsgroup, comp.text.tex,someone May 1996) I had to make a visible distinction be- Bib asked how to get TEX to abbreviate Yuri to ‘Yu.’, tween unary minus (for example, the number ‘mi- in styles in which the author’s first names are abbre- nus three’) and binary minus (for example, the sub- viated — after all, ‘Yu’ is one letter in Russian. Take traction ‘four minus three’). I decided to keep the for example ordinary minus symbol for subtraction, and to use @Article{govorukhin, a hyphen for negation: ‘2 − -3’. author = {Yuri Govorukhin}, The quick-and-dirty way to achieve this is to title = {What we sow, use an \hbox: that we reap the fruits of}, ... } \def\minus{\hbox{-}} How to get the author abbreviated to ‘Yu. Gov- This looks fine most of the time (‘2 − -3’). Unfor- orukhin’? Simply putting braces around the ‘Yu’ tunately, the hyphen does not change size in super- Bib x does not work. Somewhat surprisingly, TEX will scripts and subscripts; compare ‘ y2−-3 ’withhowit x not abbreviate {{Yu}ri Govorukhin} at all. should look (‘ y2−-3 ’). But if one defines a macro \Yu in the the pream- A better way is to use TEX’s \mathchardef: ble that expands to ‘Yu’, \mathchardef\minus="002D @Preamble{"\newcommand\Yu{Yu}"} This defines \minus to be an ordinary symbol (the then {{\Yu}ri Govorukhin} will be abbreviated to first 0), taken from maths family zero (the second ‘Yu. Govorukhin’. 0), using the character in position 45 (the 2D)ofthat But as Oren Patashnik, the author of BibTEX, font, which in the case of Computer Modern Roman remarked, this solution has one problem. Namely, happens to be the hyphen character. (There are no the ‘Yu’ will be neglected for sorting purposes, so guarantees in other fonts, of course!) The LATEX2ε this way the author would be sorted as if he were way of writing this is ‘ri Govorukhin’. To get both a proper abbrevia- \DeclareMathSymbol{\minus} tion and proper sorting, one needs to use a feature {\mathord}{operators}{"2D} of BibT X (which is discussed in the BibT Xdocu- E E but this has to go in the preamble of the document. mentation btxdoc.dvi): BibTEX considers a special character, which is everything from a left brace at ⋄ Jeremy Gibbons the top level directly followed by a backslash up to Oxford Brookes University the matching right brace, as one character, and all [email protected] TUGboat, Volume 19 (1998), No. 4 427

3 Ornamental rules The macro is designed to be used in horizontal mode, A Something you run across in books on occasion are so it can also be used inside LTEX’s center environ- ‘type ornaments’: those looping trellises and lattices ment, as in the first example, above. and swirls often used to indicate the top or bottom Some tips: of a titlepage, for example, or sometimes the top of 1. First place to raid: cmmi and cmsy have lots of a chapter titleblock (called ‘headbands’, I believe), promising items; I’ve already tried \diamond, or to separate sections within a chapter. What I no- \vee, \wedge, \dagger and \ddagger, \bowtie, ticed was that sometimes the ornaments were noth- \div, \widetilde,... ing more than a fairly simple character repeated a 2. Other fonts which may contain some surprises: few times: msam and msbm, and of course, any dingbat font. ÷÷÷÷÷÷÷ 3. The value of \mkern must be given in ‘mu’math units; there are 18mu in 1em (see p. 68 of The or all the way across the page, like this: TEXbook). While putting \mkern units both ⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄⋄ before and after the symbol works for some char- So howcanI getTEX to do this? Poking around acters, others work better if only used after. in the The TEXbook (pp. 353 and 357), I found the Changing the spacing can really alter the flavour following macros:1 of the ornamental line, so experiments are use- \def\m@th{\mathsurround=0pt } ful. \def\dotfill{\cleaders\hbox{$\m@th 4. Note that \widetilde needs an argument in \mkern1.5mu . \mkern1.5mu$}\hfill} order to know how wide it must be: I used (the \m@th is to switch off spacing around math- \widetilde{\qquad}: ematical material, which complicates the measure- ggg ments) You can use this in a table of contents, for 5. It’s also fun to combine symbols: \times\div example: yields ‘ ×÷×÷×÷×÷×÷ ’. Making the pat- 3. Typographic use of leaders ...... 45 tern start and finish with the same symbol is So that’s how to get a repeating pattern going. a little trickier. Because the pattern produced \cleaders Here is the macro that does all the work: by is centered within the area to be filled, it may be padded with a little extra space \mathsurround=0pt at either end, which makes positioning the ex- \def\motif#1{\cleaders\hbox{#1}\hfil} tra symbol difficult: It’s now a simple matter to replace the period and ×÷×÷×÷×÷×÷ × \mkern amounts with other values. Here’s a defi- nition using the \diamond, which I wanted to have A better solution is to use \leaders instead, very close together: as these are aligned with the smallest enclosing box (see p. 224 of The TEXbook); then the ex- \def\diamondpattern{$\mkern-.6mu tra symbol can be placed at the start, with no \diamond \mkern-.6mu$} \motif{\diamondpattern} intervening space. The width of the repeating section is the stated width less the width of the But this only defines the repeating motif. To make first symbol. use of it, I’ve got a macro which takes the motif, and \def\bordertwo#1#2#3{{% allows you to specify how long the sequence should \setbox1=\hbox{#1}% span: e.g., 5cm or even full page width: \dimen0=#3\advance\dimen0 by -\wd1 \def\border#1#2{\leavevmode \leavevmode\hbox{#1\hbox to \dimen0 \hbox to #2{\motif{#1}}} {\leaders\hbox{#2#1}\hfil}}% The first argument to \border is a motif to repeat, }} and the second argument is a dimension; for exam- Then ‘\bordertwo{$\times$}{$\div$}{3cm}’ ple, yields \def\divpattern{$\div \mkern12mu$} ‘×÷×÷×÷×÷× ’ \border{\divpattern}{5cm} (which is no longer nicely centered, but that’s or, to span the full width of the column, life). \border{\divpattern}{\columnwidth} ⋄ Christina Thiele 1 I’m sure there are other ways to do this; so far, this one Nepean, Ontario, Canada works fine for me. [email protected] 428 TUGboat, Volume 19 (1998), No. 4

Compatible with: plain, LATEX (old and new). The Treasure Chest Note: the documentation describes some re- strictions when the soul package is not used with LATEX2ε. A package tour from CTAN — soul.sty Location on CTAN: When the Treasure Chest is CTAN, there’s so much /macros/latex/contrib/supported/soul to choose from. But even worse . . . there are so Files to fetch: soul.dtx and example.cfg.2 many packages that keep being added! And how to How to install: Put files with your other class and even find out about them, if you don’t keep up with style files on your system. Read the top por- notices posted to the newsgroups? This column is tion of soul.dtx (or the file soul.txt) for in- one way to try to bring some of these treasures to structions on processing the files (you will need TUGboat readers, with a quick introduction to the A LTEX2ε). Notice that the soul.sty package is package and some examples of what it can do. not actually on CTAN;itusesthe.dtx method This is the first such column; let me know what of documentation, a wonderful feature in aspects are most useful and which ones less so, what A LTEX2ε. If you’re unfamiliar with how this additional facets should be examined, what other works, see footnote 1 for a general overview. packages cover some of the same issues; which pack- Files generated: soul.ins, soul.dvi (documen- ages do you prefer. tation), soul.toc, soul.sty, changes.tex,(as 1 Quick tour well as the usual soul.aux and .log files). Package: soul.sty 2 Documentation This is version 1.2, dated 11 Jan. 1999. Upon The documentation is so extensive (26 pages long), processing, the file changes.tex is generated, with explanations, examples of basic use and varia- and describes the differences (the file is also in- 1 tions, that little needs to be said here! side the .dtx file ). The opening pages are a pleasant introduction Explanation of the name: “[it] is only a combi- to the general notions of emphasis, however it is \so space out nation of the two macro names ( ) achieved, and the various opinions which exist on the \ul underline and ( )—nothing poetic at all...” suitability of their use. There is a pragmatism ex- Keywords: spacing out, letterspacing, underlining, pressed here, offering the user the choice of options, striking out leaving the reasons for such choices to the user. Purpose: The user portion of the documentation provides soul.sty provides hyphenate-able letterspac- extensive examples and explanations for creating the ing, underlining, and some variations on each. various effects (underlining, overstriking, letterspac- All features are based upon a common mech- ing). anism for typesetting text syllable-by-syllable, Chapter 7 (pp. 14–25) provides a detailed ex- using TEX’s excellent hyphenation algorithm to planation of the macros themselves, along with some find the proper hyphenation points. As well, additional points and tips, so do glance through it. two examples are presented to show how to use One nice addition from the author (in collabo- the interface provided to address such issues as ration with Stefan Ulrich) is a sample configuration ‘an-a-lyz-ing syl-la-bles’. Although the package file, example.cfg, which shows how to select specific is optimized for LATEX2ε, it works under plain spacing values for different fonts automatically, and TEXandLATEX 2.09, and is compatible with store them for local use. As well, the local file (call it other packages, too. soul.cfg and hooks exist to read it in automatically Author: Melchior Franz via soul.sty) can be used to store other changes to [email protected] the package default settings, thus avoiding making changes in either the style file or inserting the cus- 1 Documented source (.dtx) files are a combination of tomizations into individual source files. macros and documentation, an evolution of Frank Mittel- bach’s original DocStrip utility. There are usually two steps: 2.1 Table of Contents run LATEX over the .dtx file to get the documentation, and run LATEX over its matching .ins file to generate the style 1. Introduction files, which are extracted from the .dtx file (sometimes the .ins file is itself generated by the first step, which means you 2 Note: CTAN also has the file soul.txt (description of only have to pick up the one .dtx file — even more compact package + processing instructions), and soul.ins,whichcan packaging). either be fetched, or generated by processing the .dtx file. TUGboat, Volume 19 (1998), No. 4 429

2. Typesetting rules (a) Theory . . . \so{le th{\’e}{\^a}tre} le (b) . . . and Practice le th´eˆatre th´eˆatre 3. Modes and options Tokens that belong together have to be grouped; text (a) LATEX2ε mode inside groups is not spaced out. Grouped text must not contain hyphenation points. (b) Plain TEXmode (c) Command summary \so{just an {{\hbox{example}}}} 4.Letterspacing just just an example an (a) The macros example (b) Some examples To prevent material with hyphenation points from (c) Typesetting Fraktur being spaced out, you have to put it into an \hbox (d) Dirty tricks (\mbox) with two pairs of braces around it. However, 5. Underlining it’s better to end s p a c i n g o u t before words not to be broken and restart it afterwards. (a) Settings (b) Some examples (c) The dvips problem \so{inside.} \&\ \so{outside}. in- inside. & outside. side. 6. How the package works & (a) The kernel out- (b) The interface side. (c) Doing it yourself Punctuation marks are spaced out if they are put into (d) Common restrictions the group. (e) Known features (aka bugs) 7. The macros \so{{‘‘}\

\so{unbreakable\~ space} un- \so{electrical industry} elec- unbreakable space break- electrical industry tri- able space cal in- The \~-command inhibits line breaks. A space is dus- mandatory here to mark the word boundaries. try Ordinary text can be typed in as usual. \so{1\<3 December {{1995}}} 13 13 December 1995 De- \so{man\-u\-script} man- cem- manuscript u- ber script 1995 \- works as usual. Numbers should never be spaced out. 430 TUGboat, Volume 19 (1998), No. 4

What is perhaps not immediately obvious is \so{broken\\ line} bro- that this package provides not just various forms of broken ken emphasis but emphasis while retaining hyphen- line line ation. Using \underline only works for one line of text, and it blocks the last word from being hyphen- \\ works as usual. Additional arguments (e.g., * or vertical space) are not accepted. Mind the space. ated. Devising simple strike-out macros (my case) similarly removes the word(s) inside the macro’s ar- gument from being considered for hyphenation. \so{\dots and {\hbox{-}}jet} ... and ...and -jet -jet This package gets over that hurdle, yielding es- sentially a two-for-one set of tools which many type- \hyphen must not be used for leading hyphens. setters find they need at the last minute—as they turn the manuscript page, open the next file, and \so{pretty awful{\break} test} pretty stare at several lines of underlined or over-struck text. pretty awful aw- test ful 5 Follow-up test I’d like to invite users to fetch this package and see The braces keep TEX from discarding the space. how it works out for them, and then send word on their application and results. 4 Applications and comments If you’ve been using other packages (perhaps because they’re old and comfortable friends), give For my own purposes, I will most likely find uses this one a try and then tell us how they compare. for the package in all three TEXs: plain TEX(for A In particular, make a note of which TEX you’re us- critical editions), and both old and new LTEXs (for ing; I myself am hanging on to a number of ‘old’ most everything else). In the past, I’ve had to cob- things because they work — or is it the other way ble together very unpresentable macros to deal with A around ... that I’m not using the new LTEXinall overstriking and underlining: both were needed in instances because I don’t have suitable replacements an article and then a book, to reproduce the cre- or substitutes for the old and trusted friends?! ative writing process in Philip Larkin’s notebooks. That is, an application which had nothing to do with ⋄ Christina Thiele typographic emphasis and everything to do with try- 15 Wiltshire Circle ing to reproduce hand-written notes via typesetting. Nepean, Ontario This package would certainly have made the job K2J 4K9 Canada easier! [email protected]

Helpful hint for finding files on the CTAN archives via ftp

Don’t know where to find the package you want? The following shows how to use the quote site index command, to quickly locate packages. Our example is for soul, the package just presented.

ftp> quote site index soul 200-index soul 200-NOTE. This index shows at most 20 lines. for a full list of files, 200-retrieve /tex-archive/FILES.byname 200-1998/12/08 | 3484 | macros/latex/contrib/supported/soul/example.cfg 200-1999/01/12 | 70239 | macros/latex/contrib/supported/soul/soul.dtx 200-1999/01/11 | 279 | macros/latex/contrib/supported/soul/soul.ins 200-1999/01/11 | 1437 | macros/latex/contrib/supported/soul/soul.txt 200 (end of ’index soul’) TUGboat, Volume 19 (1998), No. 4 431

1. an early long essay: “The Typography of the French Abstracts Book 1800–1914”, in Book Typography 1815–1965, ed. Kenneth Day. Chicago: U of Chicago Press, 1966, pp. 37–80. [The next chapter is by M. Vox, Les Cahiers GUTenberg “The Half Century 1914–1964”.] Contents of Issue 30 2. a book: La Lettre. La lettre et ses usages sociaux. October 1998 Editions´ du Gymnase typographique, 1975. 3. a large paperback: Pour une s´emiologie de la Hommage `aG´erard Blanchard typographie. Andenne: R´emy Magermans, 1979. This issue of the Cahiers was produced in a very short 4. his last book: Aide au choix de la typo-graphie, time in order to be ready for MultiTypo 98, the ATypI cours sup´erieur. Reillanne: Atelier Perrousseaux, ´ meeting held in Lyon, France in October, to honor Editeur, 1998. [This book includes an extensive G´erard Blanchard, who had just died in August. It bibliography of Blanchard’s work.] includes his last manuscript — his text as invited speaker to ATypI. Fernand Baudin, Rencontres & confluence [Meeting and merging]; pp. 10–11 Jacques Andre´ and Jean-Franc¸ois Porchez, A personal account of feelings and memories which came Editorial´ : ATypI & Blanchard; pp.3–5 to mind upon hearing of Blanchard’s death, and how it The editors present a brief overview of the ATypI affects the author’s role at the upcoming ATypI meeting (Association de Typographique Internationale), from (the note was written 6 October, before ATypI). its inception in France in 1957, through the great Massin, Des caves d’HollensteinalaSorbonne ` technological changes from lead to computer, till its return to France for the October 1998 annual meeting [From the Hollenstein cellars to the Sorbonnne]; in Lyon, a city long associated with print. pp. 12–13 In addition to the conference publication, Lettres A brief reminiscence of discordant beginnings to what fran¸caises, all attendees (courtesy of Louis-Jean Print- became a lasting friendship with and respect for Blan- ers) received copies of this 30th issue of the Cahiers, chard, with a particular memory of Blanchard’s intensity which was originally to be a thematic issue on electronic in after-hours sessions held in the cellars of Hollenstein typography but was quickly re-worked to serve as a Printers in Paris. tribute to G´erard Blanchard, whose interest in typog- raphy extended beyond the subject itself, to the many Gerard´ Blanchard, participants involved in different aspects of typography. Connotation typographique : Pour une po´etique In addition to the last Blanchard manuscript, the de la typo-graphie [Typographic meaning: A case issue includes tributes from such old friends as John for a poetry of typography]; pp. 14–39 Dreyfus, Fernand Baudin, and Massin. As well, there Foreword to the article, edited by Jean-Fran¸cois are two pieces supplementing Blanchard’s Aide au choix de la typo-graphie: a brief discussion of the methodology Porchez and Jacques Andr´e : behind the work, and an index for it. This paper contains the text G´erard Blanchard was supposed to deliver as Guest Speaker at the John Dreyfus, A Tribute to G´erard Blanchard; ATypI conference, Lyon, October 1998. He died pp. 6–9 at the end of August and we edit here his draft, First paragraph from the English text : as it was found. The death of G´erard Blanchard in August robbed The editors have chosen to simply transcribe Blan- us of our chance to hear him speak at this chard’s manuscript notes, unchanged and unedited, sup- Congress. Though he had taken part in sev- plemented with scanned images of select pages, to show eral of our earlier congresses, many of you who his creative process at work. Their subsequent article in are not French may never have met him, and this issue discusses this process in more detail. may know very little about him. As I had the happy experience of meeting him in the mid-1950s for several consecutive years at the international typographical meetings held at Lurs-en-Provence, I readily agreed to pay this short tribute to his wide ranging contributions to the graphic and the typographic arts. The text includes several references to Blanchard’s works, which might interest the TUGboat reader: 432 TUGboat, Volume 19 (1998), No. 4

Jacques Andre´, Noms propres cit´es dans Aide aux choix de la typo-graphie de G´erard Blanchard [Names cited in G´erard Blanchard’s Aide aux choix de la typo-graphie]; pp. 40–56 Author’s abstract : The following pages include an index of all the names (people, authors, works, founderies) quoted in the following book: G´erard Blanchard, Aide aux choix de la typo-graphie, Atelier Per- rousseaux ´editeur, Reillanne, 1998. ISBN 2- 911220-02-1. N.B. Works names are in italic; fonts are already indexed in the book itself; cover pages are referenced as 0. This index is published here with G´erard Blanchard’s and Yves Perrousseaux’s authorizations. Jacques Andre´ and Jean-Franc¸ois Porchez, Classeurs et chemin de fer [File folders and thumbnail layout]; pp. 57–62 Authors’ abstract : To prepare his Aide aux choix de la typo-graphie, G´erard Blanchard used small loose-leaf binders together with a thumbnail layout (sketches of how each set of facing pages would look). These techniques allowed him to organize his book’s layout as a hypertext. The abstract merely hints at the strategies Blanchard used not simply to organize book layout but also — and I would hazard, primarily — to organize its contents. To capture those thoughts and views which usually come to mind in a very non-linear fashion as an idea begins to take shape and then seemingly races off in a dozen differ- ent directions, on at least half a dozen levels: references, examples, suitable quotes, illustrations, cross-references, and so on. In short, an analysis of the creative process at work, whose topic just happens to be typography. Rather like us, using TEXtotalkaboutTEX!

[Compiled by Christina Thiele]

Articles from Cahiers issues can be found in the form of PostScript files at the following site: http://www.univ-rennes1.fr/pub/GUTenberg/ publications 432 TUGboat, Volume 19 (1998), No. 4

Late-Breaking News

Production Notes Mimi Burbank Here is your last issue of 1998 — better late than never. Herein are the promised articles held over from the TUG’98 conference held in Torun, Poland. The article by Miroslava Mis´akov´aon page 355 pre- sented quite a challenge. Mirka (her nickname) used DC fonts with Czech encoding and after moving all of the fonts to SCRI, I was still unable to get the font encoding correct (everyone else on the team was either sick or traveling), so in order to finish the is- sue, I distilled the .ps file used for the preprints and created .eps files of the examples rather than typesetting them. Several authors and I corresponded frequently about layout and fonts—I would put up pdf doc- uments in my WWW directory for them to peruse using their browser, and they would then email me comments and we would begin again. I find this a satisfactory way of working out problems — of course, depending upon any problems concerning bandwidth across the ocean. I’m have a vision of some shark cruising around the ocean floor, chewing on the “ca- bles”. Output The final camera copy was prepared at SCRI on IBM rs6000 workstations running AIX v4.2, using the TEXLivesetup (Version 3), which is based on the Web2c TEX implementation version 7.2 by Karl Berry and Olaf Weber. PostScript output, us- ing outline fonts, was produced using Radical Eye Software’s dvips(k) 5.78, and printed on an HP Laser- Jet 4000 TN printer at 1200dpi.

Coming In Future Issues I think the news ev- eryone is waiting for is, “When will TEXLive4 be available?” The projected completion date of the CD is “Spring of 1999” and it will reportedly be “bigger and better”. The completion date will af- fect our decision whether or not to delay shipment of the first issue of TUGboat until we receive deliv- ery of the CD (otherwise, it will be June 1999 before we can ship). We will also be posting a list of additional en- glish hyphenation exceptions uncovered since 1995; the full article will be posted on the TUG Web site.

⋄ Mimi Burbank [email protected] TUGboat, Volume 19 (1998), No. 4 433

Calendar

1999 May 24 TUGboat 20 (2), deadline for reports. Jun 1 – 11 Society for Scholarly Publishing, Exhibition st Jan 29– : Primitive types: The 21 annual meeting, Boston, Apr 24 sans serif alphabet from John Soane to Massachusetts. For information, visit Eric Gill, St. Bride Printing Library, http://www.sspnet.org. London, UK. For information, visit Jun 9 – 13 Joint International Conference of http://www.stbride.org/soanepr.htm. the ACH/ALLC in 1999 (Association TUGboat 20 Feb 15 (1), deadline for technical for Computers and the Humanities, submissions. and Association for Literary and th Feb 24–26 DANTE’99, 20 meeting, “10 years of Linguistic Computing), University of DANTE e.V.”, Universit¨at Dortmund, Virgina, Charlottesville, Virginia, Germany. For information, visit USA. For information, visit http:// http://dante99.cs.uni-dortmund.de/. www.iath.virginia.edu/ach-allc.99. Mar 1 TUGboat 20 (1), deadline for reports. Jul 5–7 CIDE’99: Second Colloque International ´ Mar 1 – 5 Seybold Seminars Boston/Publishing, sur le Document Electronique, Boston, Massachusetts. For information, Damascus, Syria. For information, visit visit http://www.seyboldseminars.com/ http://infodoc.unicaen.fr/cide/. Events/bo99/. Aug8–13 SIGGRAPH, Los Angeles, California. Mar 13 – ABeCeDarium: A traveling juried For information, visit Apr 17 exhibition of contemporary artists’ http://www.siggraph.org/s99/. th alphabet books by members of the Aug15–19 TUG’99 —The20 annual meeting of Guild of Book Workers. This show the TEX Users Group, “TEX Online— will join historical examples Untangling the Web and TEX”, from the Newberry collections. University of British Columbia, Newberry Library, Chicago, Illinois. Vancouver, Canada. Information will be Sites and dates are listed at http:// posted to http://www.tug.org/tug99/ palimpsest.stanford.edu/byorg/gbw. as plans develop. Apr 26 – 30 XML Europe’99, Granada, Spain. Aug 23 TUGboat 20 (3), deadline for reports. For information, visit Sep 20 – 23 EuroTEX’99, the XIth European http://www.gca.org/conf/euro99/. T X Conference, “Document th E May1–3 BachoTEX’99, 7 annual meeting of the Publishing”, Heidelberg, Germany. Polish TEX Users’ Group (GUST), Tutorials will precede and follow the main “Practical Aspects of Electronic conference. For information, visit Publishing”, Bachotek, Brodnica Lake http://www.dante.de/eurotex99. District, Poland. For information, visit Nov 8 TUGboat 20 (4), deadline for technical http://www.gust.org.pl/BachoTeX/. submissions. th May 11 – 14 8 International World Wide Web Nov 22 TUGboat 20 (4), deadline for reports. Conference, Toronto, Canada. Dec6–9 XML 99, Philadelphia, Pennsylvania. For information, visit http://www8.org. For information, visit May 10 TUGboat 20 (2), deadline for technical http://www.gca.org/conf/conf1996.htm. submissions. A May 18 – 20 GUTenberg ’99, “L TEX, a route to the 2000 Web”, l’Institut de physique nucl´eaire de Lyon, France. For information, visit st Jul-Aug TUG 2000 —The21 annual meeting of http://www.ens.fr/gut/manif/. the TEX Users Group, in the UK.

Status as of 20 December 1998

For additional information on TUG-sponsored events listed above, contact the TUG office (+1 503 223-9994, fax: +1 503 223-3960, e-mail: [email protected]). For events sponsored by other organizations, please use the contact address provided. Additional type-related events and news items are listed in the Sans Serif Web pages, at http://www.quixote.com/serif/sans. TUGboat, Volume 19 (1998), No. 4 437

Iowa State University, Stanford University, Institutional Computation Center, Computer Science Department, Members Ames, Iowa Stanford, California Kluwer Academic Publishers, Stockholm University, The Netherlands Department of Mathematics, Stockholm, Sweden Academic Press, Los Alamos National Laboratory, San Diego, CA University of California, University of California, Irvine, Los Alamos, New Mexico Information & Computer Science, American Mathematical Society, Irvine, California Providence, Rhode Island Marquette University, Department of Mathematics, University of Canterbury, CERN, Geneva, Switzerland Statistics and Computer Science, Computer Services Centre, College of William & Mary, Milwaukee, Wisconsin Christchurch, New Zealand Department of Computer Science, Masaryk University, University College, Williamsburg, Virginia Faculty of Informatics, Computer Centre, CSTUG, Praha, Czech Republic Brno, Czechoslovakia Cork, Ireland Elsevier Science Publishers B.V., Mathematical Reviews, University of Delaware, Amsterdam, The Netherlands American Mathematical Society, Computing and Network Services, Ann Arbor, Michigan Newark, Delaware Florida State University, Supercomputer Computations Max Planck Institut Universit¨at Koblenz–Landau, Research, Tallahassee, Florida f¨ur Mathematik, Fachbereich Informatik, Bonn, Germany Koblenz, Germany Hong Kong University of Science and Technology, New York University, University of Oslo, Department of Computer Science, Academic Computing Facility, Institute of Informatics, Hong Kong, China New York, New York Blindern, Oslo, Norway Institute for Advanced Study, Princeton University, University of Texas at Austin, Princeton, New Jersey Department of Mathematics, Austin, Texas Princeton, New Jersey Institute for Defense Analyses, Uppsala University, Center for Communications Space Telescope Science Institute, Computing Science Department, Research, Princeton, New Jersey Baltimore, Maryland Uppsala, Sweden Springer-Verlag, Heidelberg, Germany 1999 TUG Order Form Rates for TUG membership, subscription, and products are listed below. Please check the appropriate boxes and mail payment (in US dollars, drawn on a United States T X bank) along with a copy of this form. If paying by credit card, you may fax the E completed form to the number at left. USERS • 1999 TUGboat includes Volume 20, nos. 1–4. • 1999 CD-ROMs include TEX Live 4 (1 disk) and Dante's CTAN (3 disk set). GROUP • Multi-year orders: You may use this year's rate to pay for more than one year of membership or of the CD-ROM product. • Orders received after March 1, 1999: please add $10 to cover the additional expense of shipping back numbers of TUGboat and CD-ROMs. Promoting the use of Rate Amount TEX throughout the Annual membership for 1999 (TUGboat and CD-ROMs) $65 world Student membership for 1999 (TUGboat and CD-ROMs) (please attach photocopy of 1999 student ID) $35 mailing address: P.O. Box 2311 CD-ROM product for 1999 (CD-ROMs and TUGboat) $65 Portland, OR 97208–2311 USA Subscription for 1999 (TUGboat and CD-ROMs) (Non-voting) $75 shipping address: Shipping charge if after March 1, 1999. $10 1466 NW Naito Parkway, Suite 3141 Voluntary donations Portland, OR 97209–2820 USA General TUG contribution Phone: +1 503 223–9994 Contribution to Bursary Fund* Fax: +1 503 223–3960 Total $ Email: [email protected] Payment (check one) Payment enclosed Charge Visa/Mastercard WWW: www.tug.org Account Number: President: Mimi Jett Vice-President: Exp. date: Signature: Kristoffer Høgsbro Rose Treasurer: Donald W. DeLand *The Bursary Fund provides financial assistance to members who otherwise would be unable Secretary: Arthur Ogawa to attend the TUG Annual Meeting.

Please complete all applicable items; we need this information to mail you products, publications, notices, and (for voting members) official ballots. Note: TUG neither sells its membership list nor provides it to anyone outside of its own membership.

Name:

Department:

Institution:

Address:

Phone: Fax:

Email address:

Position: Affiliation: TUGboat, Volume 19 (1998), No. 4 439

TEX Consulting & Production Services

Information about these services can be obtained Ogawa, Arthur from: 40453 Cherokee Oaks Drive, TEXUsersGroup Three Rivers, CA 93271-9743; 1466 NW Naito Parkway, Suite 3141 (209) 561-4585 Portland, OR 97209-2820, U.S.A. Email: [email protected] Bookbuilding services, including design, copyedit, art, Phone: +1 503 223-9994 and composition; color is my speciality. Custom T X Fax: +1 503 223-3960 E macros and LAT X2ε document classes and packages. Email: [email protected] E Instruction, support, and consultation for workgroups and URL: http://www.tug.org/ authors. Application development in LAT X, T X, SGML, consultants.html E E PostScript, Java, and ßC++. Database and corporate North America publishing. Extensive references.

Hargreaves, Kathryn Outside North America 135 Center Hill Road, DocuT Xing: T X Typesetting Facility Plymouth, MA 02360-1364; E E 43 Ibn Kotaiba Street, (508) 224-2367; [email protected] Nasr City, Cairo 11471, Egypt IwriteinTX, LAT X, METAFONT, MetaPost, PostScript, E E +20 2 4034178; Fax: +20 2 4020316 HTML,Perl,Awk, C, C++, Visual C++, Java, Email: [email protected] JavaScript, and do CGI scripting. I take special care DocuT Xing provides high-quality T XandLAT X with mathematics. I also copyedit, proofread, write E E E typesetting services to authors, editors, and publishers. documentation, do spiral binding, scan images, program, Our services extend from simple typesetting and technical hack fonts, and design letterforms, ads, newsletters, illustrations to full production of electronic journals. For journals, proceedings and books. I’m a journeyman more information, samples, and references, please visit our typographer and began typesetting and designing in 1979. web site: http://www.DocuTeXing.com or contact us by I coauthored T X for the Impatient (Addison-Wesley, 1990) E e-mail. and some psychophysics research papers. I have an MFA in Painting/Sculpture/Graphic Arts and an MSc in Computer Science. Among numerous other things, I’m currently doing some digital type and human vision research, and am a webmaster at the Department of Engineering and Applied Sciences, Harvard University. For more information, see: http://www.cs.umb.edu/ kathryn. Loew, Elizabeth President, TEXniques, Inc., 362 Commonwealth Avenue, Suite 5E, Boston, MA 02115; (617) 670-1916; Fax: (617) 670-1916 Email: [email protected] Long-term experience with major publisher in preparing camera-ready copy or electronic disk for printer. Complete book and journal production in the areas of mathematics, physics, engineering, and biology. Services include copyediting, layout, art sizing, preparation of electronic figures; we keyboard from raw manuscript or tweak TEX files.