TUGBOAT
Volume 16, Number 1 / March 1995
3 Addresses 4 EuroTEX94 Contest Answers / Barbara Beeton General Delivery 5 Opening words / Christina Thiele and Michel Goossens 8 Editorial comments / Barbara Beeton Software & Tools 9MakingMakeTeXPK safer for Unix installations / Michael Jaegermann 12 Hyphenation exception log / Barbara Beeton Philology 18 Configuring TEXorLATEX for typesetting in several languages / Claudio Beccari 30 How to make a foreign language pattern file: Romanian / Claudio Beccari, Radu Oprea and Elena Tulei 42 TEX and Linguistics / Christina Thiele
Font Forum 45 Introducing METAP OST / Alan Hoenig
46 Some METAFONT techniques / Yannis Haralambous 54 The program a2ac —Font handling on the PostScript level / Petr Olsak
60 Problems of the conversion of METAFONT fonts to PostScript Type 1 / Basil Malyshev 69 Partial font embedding utilities for PostScript Type-1 fonts / Basil Malyshev and Michel Goossens 78 Tight setting with TEX / Alan Jeffrey LATEX 80 XΥMTEX for drawing chemical structural formulas / Shinsaku Fujita News & 89 Calendar Announcements 90 Production notes / Mimi Burbank 91 Coming next issue TUG Business 92 Institutional members Forms 93 TUG membership application Advertisements 94 TEX consulting and production services TEXUsersGroup Board of Directors Memberships and Subscriptions Donald Knuth, Grand Wizard of TEX-arcana† TUGboat (ISSN 0896-3207) is published quarterly Christina Thiele, President ∗ ∗ by the TEX Users Group, Flood Building, 870 Michel Goossens , Vice President Market Street, #801; San Francisco, CA 94102, George Greenwade∗, Treasurer U. S. A. Peter Flynn∗, Secretary Barbara Beeton 1995 dues for individual members are as follows: Johannes Braams, Special Director for NTG Ordinary members: $55 with TUGboat; Mimi Burbank $40 without; Jackie Damrau Students: $35 with TUGboat; $20 without. Luzia Dietsche Membership in the T X Users Group is for the E Michael Doob calendar year, and includes all issues of TEXand Michael Ferguson TUG NEWS for the year in which membership Bernard Gaulle, Special Director for GUTenberg begins or is renewed. TUGboat may be included Yannis Haralambous for a supplementary fee. Individual membership is Dag Langmyhr, Special Director for open only to named individuals, and carries with the Nordic countries it such rights and responsibilities as voting in the Nico Poppelier annual election. A membership form is provided on Jon Radel page 93. Sebastian Rahtz TUGboat subscriptions are available to organi- Tom Rokicki zations and others wishing to receive TUGboat in a Chris Rowley, Special Director for UKT XUG name other than that of an individual. Subscription E Raymond Goucher, Founding Executive Director † rates: North America $60 a year; all other countries, Hermann Zapf, Wizard of Fonts † ordinary delivery $60, air mail delivery $80. ∗ Second-class postage paid at San Francisco, member of executive committee †honorary CA, and additional mailing offices. Postmaster: TUGboat Send address changes to ,TEXUsers Addresses Telephone Group, 1850 Union Street, #1637, San Francisco, All correspondence, +1 415 982-8849 CA 94123, U.S.A. payments, etc. T X Users Group Fax Institutional Membership E 1850 Union Street, #1637 +1 415 982-8559 Institutional Membership is a means of showing San Francisco, continuing interest in and support for both T X E CA 94123 USA Electronic Mail and the T X Users Group. For further information, E (Internet) contact the TUG office. Parcel post, delivery services: General correspondence: [email protected] TEX Users Group TUGboat c Copyright 1995, TEXUsersGroup Flood Building Submissions to TUGboat: Permission is granted to make and distribute verbatim 870 Market Street, #801 [email protected] copies of this publication or of individual items from this publication provided the copyright notice and this permission San Francisco, notice are preserved on all copies. CA 94102, USA Permission is granted to copy and distribute modified versions of this publication or of individual items from this publication under the conditions for verbatim copying, TEX is a trademark of the American Mathematical provided that the entire resulting derived work is distributed Society. under the terms of a permission notice identical to this one. Permission is granted to copy and distribute transla- tions of this publication or of individual items from this publication into another language, under the above condi- tions for modified versions, except that this permission notice may be included in translations approved by the TEXUsers Group instead of in the original English. Some individual authors may wish to retain traditional copyright rights to their own articles. Such articles can be identified by the presence of a copyright notice thereon.
Printed in U.S.A. Who were the first compositors? Unfortunately we are not quite sure. Many of them were probably scribes faced with unemployment as a result of the new technology. Alexander Lawson The Compositor as Artist, Craftsman, and Tradesman (1990)
COMMUNICATIONS OF THE TEX USERS GROUP EDITOR BARBARA BEETON
VOLUME 16, NUMBER 1 • MARCH 1995 SAN FRANCISCO • CALIFORNIA U.S.A. 2 TUGboat, Volume 16 (1995), No. 1
TUGboat TUGboat Editorial Board
During 1995, the communications of the TEXUsers Barbara Beeton, Editor Group will be published in four issues. One issue Mimi Burbank, Production Manager (Vol. 16, No. 3) will contain the Proceedings of the Victor Eijkhout, Associate Editor, Macros 1995 TUG Annual Meeting. One issue (Vol. 16, Alan Hoenig, Associate Editor, Fonts No. 2) will be a theme issue, edited by a guest Christina Thiele, Associate Editor, Philology and editor; participation will be by invitation. Linguistics TUGboat is available at a special subscription See page 3 for addresses. rate to all members. Submissions to TUGboat are reviewed by vol- Other TUG Publications unteers and checked by the Editor before publica- tion. However, the authors are still assumed to be TUG publishes the series TEXniques, in which have the experts. Questions regarding content or accu- appeared reference materials and user manuals for racy should therefore be directed to the authors, macro packages and TEX-related software, as well with an information copy to the Editor. as the Proceedings of the 1987 and 1988 Annual Meetings. Other publications on TEXnical subjects Submitting Items for Publication also appear from time to time. Two regular issues will be prepared in 1995. Dead- TUG is interested in considering additional manuscripts for publication. These might include lines for these and other future issues are listed in the Calendar, page 89. manuals, instructional materials, documentation, or works on any other topic that might be useful to Manuscripts should be submitted to a member the T X community in general. Provision can be of the TUGboat Editorial Board. Articles of general E interest, those not covered by any of the editorial made for including macro packages or software in computer-readable form. If you have any such items departments listed, and all items submitted on magnetic media or as camera-ready copy should or know of any that you would like considered for be addressed to the Editor, Barbara Beeton (see publication, send the information to the attention of the Publications Committee in care of the TUG address on p. 3). Contributions in electronic form are encour- office. aged, via electronic mail, on magnetic tape or TUGboat Advertising and Mailing Lists diskette, or made available for the Editor to retrieve by anonymous FTP; contributions in the form of For information about advertising rates, publication camera copy are also accepted. The TUGboat schedules or the purchase of TUG mailing lists, “style files”, for use with either plain TEXor write or call the TUG office. LATEX, are available “on all good archives”. For authors who have no access to a network, they will Trademarks be sent on request; please specify which is preferred. Many trademarked names appear in the pages of For instructions, write or call the TUG office. TUGboat. If there is any question about whether An address has been set up on the AMS com- puter for receipt of contributions sent via electronic a name is or is not a trademark, prudence dictates that it should be treated as if it is. The following mail: [email protected] on the Internet. list of trademarks which appear in this issue may Reviewers not be complete. MS/DOS is a trademark of MicroSoft Corporation
Additional reviewers are needed, to assist in check- METAFONT is a trademark of Addison-Wesley Inc. ing new articles for completeness, accuracy, and PC TEX is a registered trademark of Personal TEX, presentation. Volunteers are invited to submit Inc. their names and interests for consideration; write to PostScript is a trademark of Adobe Systems, Inc. [email protected] or to the Editor, Barbara TEXandAMS-TEX are trademarks of the American Beeton (see address on p. 3). Mathematical Society. Textures is a trademark of Blue Sky Research. Unix is a registered trademark of Unix Systems Laboratories, Inc.
TUGboat, Volume 16 (1995), No. 1 5
group of keen implementors and developers with lit- General Delivery tle administrative involvement to one where most development work is now done outside the board, even outside TUG, and administrative issues have Opening words seemed to consume the board’s time and energy. In fact, most of the faces from when I first joined the Christina Thiele (Outgoing President) and board are no longer there. But this should not sur- Michel Goossens (Incoming President) prise anyone unduly. This first column for 1995 marks the transition be- The TEX community of today is very differ- tween myself and Michel Goossens, TUG’s new pres- ent from that of 7 years ago, much less 16 years ident. I am delighted to share this column with ago when TUG began. The program is mature, the him — and even more so to pass it on to him! users are world-wide and experienced, network ac- cess and network-based resources have increased to Passing the baton the point where almost everything is available elec- While this first issue of 1995 should normally have tronically: help, information, sources, documenta- been in your mailbox last March, thus before the tion—and fellow-users. Finding a useful role in the spring election, the delay in getting it out to you has current electronic and economic climate is very dif- been such that it is more reasonable to acknowledge ficult for all organizations in computer-related ac- “real-time” events. tivities. To my mind, TUG is now swinging away Having a real election for the position of TUG from the administrative focus of recent years to- president was very important to me, since my own wards more of a collaborative and coordinating role term was not the result of an election by TUG mem- in the community at large. This is not to say that bers but a make-do solution to an interim situation the efforts made to provide a solid administrative (although I had been elected as a board member in infrastructure have been for naught — but they only the first elections in 1991). need to be done once, and after that it’s more a mat- In 1991, the board decided that all board mem- ter of introducing refinements and improvements. bers should be elected by the membership, includ- The main focus can thus move elsewhere. ing the president. Malcolm Clark served as interim I’d like to think that this is where I’ve most president for one-and-a-half years, to allow the elec- usefully expended my energies for TUG —intheoc- tion cycle to begin properly. However, no candidates casionally unimaginative and plodding business of stepped forward in the fall of 1992, and so the board documentation, guidelines and general information. was forced to find a president from amongst its own Drafting guidelines for the proceedings, which TUG members; the result was that I took on the job of first began publishing in 1987, was one of my first president. I have therefore viewed my role these past ventures — mainly to provide some sort of reference two-and-a-half years as being more of an adminis- point for myself as editor of the 1988 and 1989 pro- trator (“paper-generating bureaucrat” might be an- ceedings, and of course, to also give authors some other term for it!), focusing on internal infrastruc- help in preparing their submissions. The guidelines ture more than on external leadership issues. have since become a regular component in the pro- This latter role is what TUG now needs to fo- ceedings editor’s arsenal of files; and each editor cus on — indeed, one could argue that this role was has been steadily improving and revising the docu- needed already a year ago. Perhaps. But the cli- ment over the years. Which is what good guidelines mate a year ago was less calm, more agitated. Now, should have to undergo—growth and change—to however, I believe Michel, as new president, has a address new situations. Similarly, the four years much better context in which to be a strong leader I spent working on TUG meetings led to drafting for TUG, and by extension, to better represent TUG conference guidelines along with Peter Flynn, who had experience with the Cork 1990 meeting. Other in the general TEX community. guideline writing duties I’ve shared include those for Looking back presenters, vendors, joint memberships, and elec- Since this is my last column, I would like to take the tions. opportunity to look back over my past seven years Last year saw the introduction of “info-sheets”, on TUG’s board, beginning in 1988 at the Montreal short one- or two-page documents which can be use- annual meeting. I’ve seen the board move from be- ful sources of information: tug-info.tex (general ing an appointed group to an elected body, from a info about TUG)andusergrps.tex (list of all user 6 TUGboat, Volume 16 (1995), No. 1
groups) are two of the first ones.1 They are pre- ment had to be made. As president, I was able to sented later in this issue, for everyone’s informa- initiate discussions on the feasibility of a team ap- tion. The source files can be found on CTAN in tex- proach, and in conjunction with Barbara (TUGboat archive/usergrps/tug and are updated as neces- editor), Mimi Burbank (chair of the Publications sary. In the works are info-sheets on CTAN and on Committee), and Michel Goossens (incoming presi- TEX implementations for various platforms. dent), we were able to quickly find a new production TTN was the biggest project, though, that I route to follow. undertook: and even then, only with the substan- The utter dependence upon one person, Bar- tial and informed contributions from regular colum- bara Beeton, to not only edit but also deal with 3 nists such as Peter Flynn, Peter Schmitt, Jeremy TEXnical production had to change. . TUGboat Gibbons and Robert Becker, was it possible to keep 15,4 was the first result of the new production ap- up a rhythm of four issues per year for three years. proach: a team of people working under Barbara’s That publication is now also evolving, under the new sharp eye, each one bringing a great deal of expe- editorship of Peter Flynn. rience in different aspects of TEXasitnowisused Indeed, most all of what I’ve worked on in TUG and understood by the community. This issue (16,1) has been something that others have had just as big is the first one which joins the team approach with a role in, or have taken further along the road. With a new production site (SCRI), and will, we believe, the right combination of people, who can make any allow for much greater scope and flexibility in the job seem do-able, any problem solvable, and any future. pleasure share-able, there’s really nothing like col- laborative work to make you feel useful, particularly SCRI support critical as volunteer work doesn’t bring much else! The team approach has been greatly aided by the Most of what I’ve learned about what they call generosity of the people at Florida State Univer- ‘people skills’ and all that—it’s been learned by sity’s Supercomputer Computations Research Insti- working within TUG and the TEX community. Per- tute (SCRI), who have allowed us to share half of a haps not always well learned, but certainly it’s been new 4GB disc in order to undertake TUGboat pro- the best exposure to all kinds of issues, situations, duction. Mimi Burbank deserves the credit for hav- and people one could wish for. Not at all what I ex- ing made this possible; and we are deeply grateful pected to learn when I sent in my first membership to SCRI for the additional technical and logistical form in 1986, that’s for sure, or when I attended my support which they have provided. The disc now first meeting in 1987! makes it possible for team members to access all TUGboat production files, lend speedy assistance TUGboat is back! and advice on problems which inevitably arise, and One major infrastructure concern is of course TUG- generally provide a solid support group for TUG- boat production. This issue you are reading is there- boat’s long-standing editor. For more details, see fore also significant in that it marks the second of Barbara’s column elsewhere in this issue. what will be a “five-step program”2 to getting TUG TUG’s new president, Michel Goossens, vowed back into normal contact with its own membership, at the recent annual meeting to see that, between and by extension, serving notice to the entire TEX now and the end of the year, members will see TUG- community that we are alive and well and work- boat issues appearing in very short order, to get us ing like mad to regain our members’ respect and back to the normal schedule. You have received 15,4 renewed membership. It has come as no surprise (the last 1994 issue). This is 16,1. You will receive that our 1995 membership figures are down from 16,2 (guest-edited by Malcolm Clark) and 16,3 (the last year; a big factor has been the non-appearance TUG’95 Proceedings, edited by Robin Fairbairns) of our flagship publication. A major concern has before the end of the calendar year; the December therefore necessitated a major change in production. 3 And that TEXnical aspect has done nothing but become With this in mind, as well as the growing diffi- more complex with each new article — proving that TUGboat culties being experienced by the TUGboat editor to authors are amongst the most devilishly creative group any- devote as much time and energy towards TUGboat one should ever have to deal with! I would hasten to reiterate as in the past, a change in the production environ- a point that Barbara has repeatedly made in the past: that this is not the aim of TUGboat —to be for the TEXnically 1 The idea came from a one-page overview that I picked devilish! There is a desperate need for solid entry-level arti- up at the 1994 annual meeting of the LSA (Linguistic Society cles that will help all users better understand and apply TEX of America) in Boston. for all purposes — not just the generation of fantastic new 2 Five issues — five steps: 15,4; 16,1 to 16,4. code and fonts! TUGboat, Volume 16 (1995), No. 1 7
issue, 16,4, will be out in early 1996. The work is needs of the year 2000. Recent developments such already well in hand; elsewhere in this issue you will as Ω, ǫ−TEX, NTS,LATEX3, TDS, hyperTEX, ASTeR find information on what’s coming in 16,2 and 16,3. have shown that TEX is alive and well, and that We are convinced this is the best solution for the cur- many enthusiastic developers in various parts of the rent and long-term survival of both TUGboat and world are actively working on extensions in function- TUG: without the one, the health and viability of ality to better integrate TEX into the window envi- the other is also drawn into question — not only by ronments that are becoming commodity items in our current and former TUG members, but by the TEX daily life. TUG has to set up efficient communica- community at large. tion channels, and the means to make coordinating all these activities possible. In particular, TUGboat And with this longish message, I now pass the Pres- will carry articles addressing these important issues ident’s Column over to Michel Goossens, TUG’s in- so that everybody can be kept truly informed. coming president. −−∗−−∗−−∗−− All together now Many — I should say most — TEX users in the world Moving forward are not members of TUG or of any other local or na- Christina Thiele, TUG’s outgoing president, has ex- tional TEX Users Group. Yet they all can profit from plained clearly what has happened in the last year TEX’s fantastic typesetting abilities. They are work- or two, stating the facts and putting them in an ob- ing in far-away places, on small personal computers. jective perspective. As noted previously, the prob- We should not forget those writing their thesis in lems with TUGboat have, amongst other things, Russian, research report in Chinese, perhaps a love contributed to a drop of about 20% in the mem- letter in Armenian, or a poem in Swahili. TUG per bership of TUG with respect to the 1994 figures. As se is not the prime aim of the game, it is not organi- I announced in my inaugural speech, I am making zations that make history, it is people. Knuth gave it my first task as new president to get TUGboat TEX to the world, and asked TUG to look after it, back on schedule. In the production team that we to make sure that TEXandMETAFONT can be used have set up, a real spirit of cooperation has devel- to the benefit of the whole of mankind as the only oped, with each of the team members contributing truly generally and freely available text-processing in an area where she or he feels most comfortable system in the world. Therefore we should all con- or has particular expertise. I am confident that this tinue to work together, in trust and good faith. TUG approach can be made to last, and that TUGboat depends on you, TUG needs your active support, and will now arrive on time and at regular intervals on all those hundreds of thousands of TEX users depend our readers’ desks. on all of us. Let us not disappoint them! ⋄ Christina Thiele New horizons 15 Wiltshire Circle But is it not enough to just “carry on”, saying that Nepean, Ontario it’ll soon be “business as usual”. The world of elec- K2J 4K9, Canada tronic publishing does not stand still; on the con- [email protected] trary, it is caught in a whirlwind. If TEX wants to ⋄ Michel Goossens survive, it will have to adapt to this new and chang- CERN, CN Division HTML SGML PDF ing environment. Hypertext, and , CH-1211 Geneva 23 (Portable Document Format), publishing on the Net, Switzerland document re-use, CD-ROM, dialing the global vil- [email protected] lage, surfing the Internet, using multiple master, GX or Truetype fonts — all of these are only some of the buzz-words that we encounter on walls, in maga- zines, on our computer screens, and in the books we open. And what about Windows 95, NT, or other Unixes, can we just ignore them? No, we have to deal with them, adapt to the real world, profit from these developments, borrow the good ideas, use cross-fertilization to take what we need in order to make our tool of excellence — TEX — even bet- ter, and adapt it more ideally to the text processing 8 TUGboat, Volume 16 (1995), No. 1
Editorial Comments TUGboat back on schedule. Mimi Burbank, men- tioned above as our “angel”, will assume the func- Barbara Beeton tion of Production Manager. Christina Thiele, in Apology to our readers addition to production duties, has agreed to take on the job of Associate Editor for Philology and It will not have escaped your notice that this issue Linguistics. Robin Fairbairns, Michel Goossens and is late, very late. I won’t bore you with the details, Sebastian Rahtz complete the team. All these only point out that there have been some changes individuals have considerable experience in editing in circumstances that have made it impossible to and producing TUG proceedings and/or other T X TUG- E spend the amount of time necessary to keep user group publications as well as other T Xand boat E on schedule by myself. However, help has been computer-related talents too numerous to mention. found, and a new production location that, unlike They have come in and “hit the ground running” the AMS computer facilities, is accessible to other with an abundance of energy and good will. I authorized members of the new production team. remain the Editor-in-Chief, responsible for major This facility, at the Supercomputer Computations decisions, overall guidance, and the person to com- Research Institute (SCRI) of Florida State Univer- plain to if something goes wrong. I’m confident sity in Tallahassee, is due to the efforts of Mimi that this team can deliver the goods. Burbank, and we are deeply appreciative of her help and that of the SCRI support staff. MetaPost goes public An all-out effort has been undertaken to return
TUGboat to schedule by the end of the year. This MetaPost, the extended METAFONT program by has already been described by the “old” and “new” John Hobby that creates PostScript output for dia- presidents in their “Opening words”. I’ll introduce grams and similar non-alphabetic shapes, has been the new production team below. placed in the public domain by AT&T. Meta- Post has facilities for including TEX output and Old faces, new faces manipulating the resulting picture. The distribution can be retrieved by anony- As with any volunteer-based endeavour, people who mous ftp from netlib.att.com/netlib/....The have given of their time and energy in one set user’s manual and auxiliary documentation are of circumstances may find that they are unable in .../att/cs/cstr as 162.ps.Z and 164.ps.Z to continue when circumstances change. This has respectively. The source distribution is .../re- TUGboat happened once again on the Editorial search/metapost.tar.Z. This is also mirrored at Board. Jackie Damrau, the first recipient of the CTAN in tex-archive/graphics/metapost. Donald E. Knuth Scholarship, has been Associate The distribution requires the Unix Web2C A Editor of the L TEX column since 1986. Jackie version of T X and Tom Rokicki’s dvips,which A E has been a champion of the L TEX user; her goal contains special support for using downloaded TEX has been to find better ways of explaining “how fonts in included figures. A DOS version of MetaPost to”, as opposed to laying bare the gory details that is promised. are of interest mainly to a developer. Both are necessary, but we should not forget that this is a ⋄ Barbara Beeton users group. We wish Jackie all the best in her American Mathematical Society future endeavours. P.O. Box 6248 The new faces belong to the production team Providence, RI 02940 USA that has been drawn together for the task of putting [email protected] TUGboat, Volume 16 (1995), No. 1 9
the biggest security holes you can think of. Modern Software & Tools UNIX kernels usually simply disable “suid” scripts, but even if they are permitted on your system, avoid using them. Problems with “world writable” direc- Making MakeTeXPK safer for UNIX tories pale in comparison. installations Fortunately, there are safer approaches. As long as you have a compiler and an editor, MakeTeXPK Micha l Jaegermann and relatives (MakeTeXTFM, MakeTEXMF,...)maybe Abstract executed indirectly by a small compiled “wrapper” program and this latter program can be safely made Leaving directories open for anybody in the world to “suid”. write to, in order to ensure that MakeTeXPK will work, is often a security concern. This note describes how to The notes below describe the steps leading to avoid that on UNIX systems without losing functionality. such a modification. Due to assorted variations in T X distributions and installations it is highly un- −−∗−− E likely that they will work for you literally as given— unless you happen to run a very similar system. To If you run T XonUNIX,oranyotherUNIX- E make details easier to follow they will be shown for like system, and you are using a very convenient and one particular distribution (teTeX, version 0.2, for popular setup with automatic bitmap generation, Linux). This is only an example but it should give then you’ll undoubtedly notice that this requires di- an idea how to proceed in other cases. Modifica- rectories with ‘write’ permissions to everybody. In tions should be rather straightforward. There are theory, this is not much different than having your also possible variations in UNIX behaviour. See, for /tmp directory open that way. In practice, though, example, the Solaris note below. especially when your programs are compiled with It goes without saying that you need root ac- the kpathsea library and you have multiple unpro- cess for all (okay, most) of these steps. tected font directories, this may cause substantial • security and administrative headaches. An option Create new user account, say tex, on your sys- to create automatically all bitmaps in one directory tem and include it in some innocuous group, used only for that purpose, and to move them later e.g., tex. by hand after careful examination to final locations, The only purpose of this “user” will be to own com- is not very practical if you serve multiple printers mon TEX files. You may already have some suitable with different characteristics and is always unattrac- group, such as nogroup or nobody,soinsteadyou tive to busy system administrators. may include your new user there. A first, but rather feeble line of defense, if your • Give the tex account /bin/false for a login variant of UNIX supports it, is to set the ‘sticky’ shell and disable password by putting * in the bit on the directories in question like this: chmod at corresponding field. Home directory is not ter- pk+. That way files can be removed or overwritten ribly important. Nobody will ever be logging only by their owner(s). Unfortunately this does not into this account. prevent a “denial of service” attack. For example, The completed entry in the password file will look a perpetrator may fill up directories with garbage something like this: such as empty files with expected names thus pre- tex:*:117:65535:Owner of TeX files: venting later generation of required bitmaps. The /usr/local/tex:/bin/false above is just a prank, without lasting long-term con- • sequences, but repeated often enough it may turn Do touch /usr/spool/mail/tex to create an into a major nuisance. There are other, more in- empty file. This action closes a loophole in some sidious possible threats, which do occur in practice, mail delivery programs. You may find the file especially when an attacker comes over a network already in place, made by some system admin- using compromised legitimate accounts. istration utility. Change owner and group of If you ever even thought of running MakeTeXPK this file to those of root (chown root:root / “suid” (that means, in a mode which gives a pro- usr/spool/mail/tex) and also remove all read gram all privileges of its owner) you should drop the and write permissions on it (chmod 000 /usr/ idea immediately, especially if MakeTeXPK is owned spool/mail/tex). by root. MakeTeXPK is a shell script and for many • Make sure that MakeTeXPK, and similar scripts reasons running shell scripts in “suid” mode is one of you want to execute the same way, are not in 10 TUGboat, Volume 16 (1995), No. 1
your $PATH, or at least not earlier than the in- your program MakeTeXPK. Change its ownership tended location of your “wrapper” program(s). and group to that of user tex by typing chmod Original scripts will not be called directly, how- tex:tex MakeTeXPK. (Depending on your vari- ever, the “wrapper” program will “inherit” their ant of UNIX you may have to use a dot in- names thus presenting the same interface to stead of a colon to separate the user and the users and other programs. group name, or you may have to do that in two Any convenient location will do for scripts, but if steps, using also another command, chgrp,to your system conforms to the TEX Directory Struc- accomplish the above. Use the group to which ture, then a subdirectory of $TEXMFROOT will be a you assigned your tex user. This is only an logical place. For the particular teTeX distribution, example). The program needs “execute” and version 0.2, it is enough to delete links in a di- “set uid” privileges (chmod 4755 MakeTeXPK). rectory /usr/local/bin. Real scripts MakeTeXPK, Also for your other MakeTeX... scripts, provide MakeTeXMF,andMakeTeXTFM reside in /usr/local/ corresponding “call points” with their names tex/scripts-0.2/bin/ and may be left there. via file links (ln MakeTeXPK MakeTeXTFM; ln • Edit all scripts in question to supply absolute MakeTeXPK MakeTeXMF). paths to all executables. Don’t forget to per- Solaris note: Passing ownership privileges to a form this task in other scripts which may be subprocess, as illustrated above, works for Linux and called by our MakeTEX... script (append db in other assorted UNIX systems. Still, I am informed the teTeX distribution). Change the mode used by Ulrik Vieth ([email protected] when creating new files to 444 and to 755 for .de), that on Solaris systems this happens only when directories. thewrapperis“suid”andownedbyroot.There- A proper way to do this is to start a script with a fore a series of shell variable definitions similar to setuid(geteuid()); MF="/usr/local/bin/mf" line from a sample source will not work as intended and replace all later occurrences of mf in the script (either the call will fail or the subprocess will not be
by $MF. That way, if you later move your METAFONT owned by the tex account). One should instead set executables to some other place, script editing will explicitly TEXGROUP and TEXUSER of a type uid_t be limited to one place; similarly for other programs. and include a replacement code like this Absolute locations are required since, for security setgid(TEXGROUP); reasons, we will limit $PATH only to "/bin:/usr/ setuid(TEXUSER); bin". This means that in theory you may leave in the given order — to achieve the same effect. This things like test, echo or rm alone. In the latter may apply as well to other UNIX variants. Caveat case, for “dangerous” commands like rm -f,itis emptor! still a good idea to replace them with definitions • Change ownership of all your font files to tex. similar to RM="/bin/rm -f" to get better control of Actually you may make tex an owner of whole what you are really executing. directory trees in T X system files. Assuming As a side effect this keeps user-aliased or re- E that all you want to assign that way is in a designed versions of these commands from putting tree rooted in texmf, you may accomplish that out unexpected and unwanted text or hanging be- by doing chown -R tex:tex texmf.Ifyour cause of a need for tty input as in the case of the chown does not understand the -R (recursive) common and usually desirable alias of rm to rm -i. flag, then something similar to the following The unplanned appearance of extraneous text on should do: stdout is one of the most common reasons for the MakeTeXPK to fail on its first pass. find texmf -print | xargs -n1 chown tex:tex Depending on your level of mistrust, you may See also chown remarks in the previous item. use a similar approach to echo and test as well, but • Remove “write” permissions for anybody but in the sample files, I have chosen to be more relaxed. owner on all directories in question. A com- Moreover, they may be “built-ins” in your shell. mand like the following one should accomplish • Edit sample source given in the appendix to that task (be careful, you do not want to change adjust it to your system and compile. non-directories): • Install results of the compilation somewhere in find texmf -type d -print | xargs -n1 chmod 755 your $PATH. The directory /usr/local/bin is usually a good place. Go there and name TUGboat, Volume 16 (1995), No. 1 11
You are done. Now, whenever MakeTeXPK is called Appendix – Sample wrapper program code directly from a command line or by some other pro- Sample C code for a wrapper program for teTeX, gram like dvips, your “wrapper” program will be version 0.2, Linux distribution. Adapt with cau- executed instead. It will call, in turn, a “real” script tion! but one already with the id of the owner of your font directories. /*********************************************/ /* */ Concluding remarks /* Executable wrapper for MakeTeX... */ /* programs. Calls its namesake from */ The presented solution is not entirely without prob- /* TOOLS directory. Provide links with */ lems. Due to “out of sync” ownership and permis- /* different names to make it multipurpose */ sions, kpathsea library functions may fail, depend- /* */ /* Michal Jaegermann, Feb 11 1995 */ ing on the exact moment this happened, when trying /* */ to write the missfont.log file in cases when font- /*********************************************/ making was not successful. This can likely be hard to resolve without modifying the library itself. If you #include
if (0 == strcmp(accepted[idx], start)) break; /* this is ours */ idx += 1; } /* * Set pretty bland, but hopefuly secure * environment; we intend to run this * program ’suid’. */ setenv("PATH", "/bin:/usr/bin", 1); setenv("IFS", " ", 1); /* * You may want/need some other calls to * setenv(). For example, if your system * has an environment variable pointing to * shared libraries it should be set here. */
/* * Attach our name at the end of a * directory string. This assumes that * real scripts in TOOLS directory will * be called by their own names (but * indirectly) */ strcpy(tail, start);
/* * Get the privileges of the owner of this * program, then execute the script and * return its results */ setuid(geteuid()); return execv(doer, argv); } 12 TUGboat, Volume 16 (1995), No. 1
the second column are suitable for inclusion in a \hyphenation{...} list. In most instances, inflected forms are not shown for nouns and verbs; note that all forms must be specifiedina\hyphenation{...} list if they occur in your document. Thanks to all who have submitted entries to the list. Since some suggestions demonstrated a lack of familiarity with the rules of the hyphenation algorithm, here is a short reminder of the relevant idiosyncrasies. Hyphens will not be inserted before the number of letters specified by \lefthyphen- min, nor after the number of letters specified by \righthyphenmin. For U.S. English, \lefthy- phenmin=2 and \righthyphenmin=3; thus no word shorter than five letters will be hyphenated. (For the details, see The TEXbook, page 454. For a digression on other views of hyphenation rules, see below under “English hyphenation”.) This partic- ular rule is violated in some of the words listed; however, if a word is hyphenated correctly by TEX except for “missing” hyphens at the beginning or end, it has not been included here. Some other permissible hyphens have been omitted for reasons of style or clarity. While this is at least partly a matter of personal taste, an author should think of the reader when deciding whether or not to permit just one more break-point in some obscure or confusing word. There really are times when a bit of rewriting is preferable. One other warning: Some words can be more than one part of speech, depending on context, and have different hyphenations; for example, ‘analyses’ Hyphenation Exception Log can be either a verb or a plural noun. If such a word appears in this list, hyphens are shown only for the Barbara Beeton portions of the word that would be hyphenated This is the periodic update of the list of words the same regardless of usage. These words are marked with a ‘*’; additional hyphenation points, if that TEX fails to hyphenate properly. The list last appeared in TUGboat 13, no. 4, starting on needed in your document, should be inserted with page 452. Everything listed there is repeated discretionary hyphens. Words added since the last appearance of the here. Owing to the length of the list, it has been + subdivided into two parts: English words, and list are preceded by . names and non-English words that occur in English The reference used to check these hyphenations Webster’s Third New International Dictionary texts. is , This list is specific to the hyphenation patterns Unabridged. that appear in the original hyphen.tex,thatis, English hyphenation the patterns for U.S. English. If such information for other patterns becomes available, consideration It has been pointed out to me that the hyphenation will be given to listing that too. (See below, rules of British English are based on the etymology “Hyphenation for languages other than English”.) of the words being hyphenated as opposed to the In the list below, the first column gives re- “syllabic” principles used in the U.S. Furthermore, sults from TEX’s \showhyphens{...};entriesin in the U.K., it is considered bad style to hyphenate a word after only two letters. In order to make TEX TUGboat, Volume 16 (1995), No. 1 13
defer hyphenation until after three initial letters, at-tributed at-trib-uted set \lefthyphenmin=3. at-tributable at-trib-ut-able Of course, British hyphenation patterns should avoirdupois av-oir-du-pois awo-ken awok-en be used as well. A set of patterns for UK English ban-dleader band-leader has been created by Dominik Wujastyk and Graham bankrupt(cy) bank-rupt(-cy) Toal, using Frank Liang’s PATGEN and based on a ba-ronies bar-onies file of 114925 British-hyphenated words generously base-li-neskip \base-line-skip made available to Dominik by Oxford University bathymetry ba-thym-e-try bathyscaphe bathy-scaphe Press. (This list of words and the hyphenation bea-nies bean-ies break points in the words are copyright to the be-haviour be-hav-iour OUP and may not be redistributed.) The file of be-vies bevies hyphenation patterns may be freely distributed; it + bib-li-og-ra-phystyle \bib-li-og-ra-phy-style is posted on CTAN in the file bid-if-fer-en-tial bi-dif-fer-en-tial + biggest big-gest tex-archive/language/english/ukhyph.tex bil-l-able bill-able and can be retrieved by anonymous FTP. biomath-e-mat-ics bio-math-e-mat-ics biomedicine bio-med-i-cine biorhythms bio-rhythms Hyphenation for languages bitmap bit-map other than English blan-der bland-er blan-d-est bland-est Patterns now exist for many languages other than blin-der blind-er English, including languages using accented alpha- blon-des blondes bets. CTAN holds an extensive collection of patterns blueprint blue-print in subdirectories of bornolog-i-cal bor-no-log-i-cal bo-tulism bot-u-lism tex-archive/language brus-quer brusquer bus-ier busier bus-i-est busiest The List — English words buss-ing bussing academy(ies) acad-e-my(ies) but-ted butted addable add-a-ble buz-zword buzz-word ad-di-ble add-i-ble ca-caphony(ies) ca-caph-o-ny(ies) adrenaline adren-a-line cam-er-a-men cam-era-men + aerospace aero-space cartwheel cart-wheel af-terthought af-ter-thought catar-rhs ca-tarrhs agronomist agron-o-mist catas-trophic cat-a-stroph-ic am-phetamine am-phet-a-mine catas-troph-i-cally cat-a-stroph-i-cally anal-yse(d) an-a-lyse(d) cauliflower cau-li-flow-er anal-y-ses analy-ses * cha-parral chap-ar-ral anomaly(ies) anom-aly(ies) chartreuse char-treuse an-tideriva-tive an-ti-deriv-a-tive cholesteric cho-les-teric anti-nomy(ies) an-tin-o-my(ies) cigarette cig-a-rette antin-u-clear an-ti-nu-clear cin-que-foil cinque-foil antin-u-cleon an-ti-nu-cle-on cognac co-gnac an-tirev-o-lu-tion-ary an-ti-rev-o-lu-tion-ary cog-nacs co-gnacs apotheoses apoth-e-o-ses com-parand com-par-and apotheo-sis apoth-e-o-sis com-para-nds com-par-ands ap-pendix ap-pen-dix comptroller comp-trol-ler archipelago arch-i-pel-ago con-formable con-form-able archety-pal ar-che-typ-al con-formist con-form-ist archetyp-i-cal ar-che-typ-i-cal con-for-mity con-form-ity arc-t-an-gent arc-tan-gent congress con-gress + (better: arc tangent) con-tribute(s,d) con-trib-ute(s,d) assignable as-sign-a-ble cose-cant co-se-cant as-sig-nor as-sign-or cotan-gent co-tan-gent + as-sis-tantship as-sist-ant-ship courses cour-ses asymp-tomatic asymp-to-matic crankshaft crank-shaft asymp-totic as-ymp-tot-ic crocodile croc-o-dile asyn-chronous asyn-chro-nous crosshatch(ed) cross-hatch(ed) atheroscle-ro-sis ath-er-o-scle-ro-sis dachshund dachs-hund at-mo-sphere at-mos-phere database data-base 14 TUGboat, Volume 16 (1995), No. 1
dat-a-p-ath data-path hep-atic he-pat-ic declarable de-clar-able hermaphrodite(ic) her-maph-ro-dite(-ic) defini-tive de-fin-i-tive heroes he-roes delectable de-lec-ta-ble hex-adec-i-mal hexa-dec-i-mal democratism de-moc-ra-tism holon-omy ho-lo-no-my de-mos demos ho-mo-th-etic ho-mo-thetic deriva-tive de-riv-a-tive horseradish horse-rad-ish diffract dif-fract hy-potha-la-mus hy-po-thal-a-mus di-rer direr ide-als ideals di-re-ness dire-ness ideographs ideo-graphs dis-parand dis-par-and id-iosyn-crasy idio-syn-crasy dis-traugh-tly dis-traught-ly ig-niter ig-nit-er dis-tribute(d) dis-trib-ute(d) ig-n-i-tor ig-ni-tor dou-blespace(ing) dou-ble-space(-ing) ig-nores-paces \ignore-spaces dol-lish doll-ish impedances im-ped-ances drif-tage drift-age in-finitely in-fin-ite-ly driver(s) dri-ver(s) in-finites-i-mal in-fin-i-tes-i-mal dromedary(ies) drom-e-dary(ies) in-fras-truc-ture in-fra-struc-ture duopolist du-op-o-list in-ter-dis-ci-plinary in-ter-dis-ci-pli-nary duopoly du-op-oly + in-ter-galac-tic in-ter-ga-lac-tic eas-t-en-ders east-end-ers inu-tile in-utile eco-nomics eco-nom-ics inu-til-ity in-util-i-ty economist econ-o-mist ir-re-vo-ca-ble ir-rev-o-ca-ble elec-trome-chan-i-cal electro-mechan-i-cal itinerary(ies) itin-er-ary(-ies) elec-tromechanoa-cous-tic electro-mechano-acoustic jeremi-ads je-re-mi-ads eli-tist elit-ist keystroke key-stroke en-trepreneur(ial) en-tre-pre-neur(-ial) kil-ning kiln-ing epinephrine ep-i-neph-rine la-ciest lac-i-est equiv-ari-ant equi-vari-ant lamentable lam-en-ta-ble ethy-lene eth-yl-ene land-sca-per land-scap-er ev-ersible ever-si-ble larceny(ist) lar-ce-ny(-ist) ev-ert(s,ed,ing) evert(s,-ed,-ing) + let-terspac-ing let-ter-spac-ing exquisite ex-quis-ite lifes-pan life-span ex-traor-di-nary ex-tra-or-di-nary lightweight light-weight fermions fermi-ons limousines lim-ou-sines flag-el-lum(la) fla-gel-lum(-la) linebacker line-backer flammables flam-ma-bles lines-pac-ing \line-spacing fledgling fledg-ling lithographed lith-o-graphed flowchart flow-chart lithographs lith-o-graphs formidable(y) for-mi-da-ble(y) lobotomy(ize) lo-bot-omy(-ize) forsythia for-syth-ia lo-ges loges forthright forth-right + longest long-est freeloader free-loader macroe-co-nomics macro-eco-nomics friendlier friend-lier malapropism mal-a-prop-ism frivolity fri-vol-ity manuscript man-u-script frivolous friv-o-lous marginal mar-gin-al + galac-tic ga-lac-tic + math-e-mati-cian math-e-ma-ti-cian + galaxy(ies) gal-axy(-ies) mat-tes mattes ga-some-ter gas-om-e-ter med-i-caid med-ic-aid geodesic ge-o-des-ic mediocre medi-ocre geode-tic ge-o-det-ic medi-o-crities medi-oc-ri-ties ge-o-met-ric geo-met-ric me-galith mega-lith geotropism ge-ot-ro-pism metabolic meta-bol-ic gnomon gno-mon metabolism me-tab-o-lism grievance griev-ance met-a-lan-guage meta-lan-guage grievous(ly) griev-ous(-ly) metropo-lis(es) me-trop-o-lis(es) hairstyle hair-style metropoli-tan met-ro-pol-i-tan hairstylist hair-styl-ist mi-croe-co-nomics micro-eco-nomics harbinger har-bin-ger mi-crofiche mi-cro-fiche harlequin har-le-quin mil-lage mill-age hatcheries hatch-eries milliliter mil-li-liter hemoglobin he-mo-glo-bin mimeographed mimeo-graphed hemophilia he-mo-phil-ia mimeographs mimeo-graphs + hemorhe-ol-ogy hemo-rhe-ol-ogy mimi-cries mim-ic-ries TUGboat, Volume 16 (1995), No. 1 15
mi-nis min-is par-al-lelism par-al-lel-ism + min-isym-po-sium(a) mini-sym-po-sium(a) para-m-ag-netism para-mag-net-ism min-uter(est) mi-nut-er(-est) paramedic para-medic mis-chievously mis-chie-vous-ly param-ethy-lanisole para-methyl-anisole mis-ers mi-sers (para-meth-yl-an-is-ole) mis-ogamy mi-sog-a-my parametrize pa-ram-e-trize mod-elling mod-el-ling paramil-i-tary para-mil-i-tary molecule mol-e-cule paramount para-mount monar-chs mon-archs pathogenic path-o-gen-ic mon-eylen-der money-len-der pee-vish(ness) peev-ish(-ness) monochrome mono-chrome pen-tagon pen-ta-gon mo-noen-er-getic mono-en-er-getic petroleum pe-tro-le-um monoid mon-oid phe-nomenon phe-nom-e-non monopole mono-pole philatelist phi-lat-e-list monopoly mo-nop-oly phos-pho-ric phos-phor-ic monos-pline mono-spline pi-cador pic-a-dor monos-trofic mono-strofic pi-ran-has pi-ra-nhas mono-tonies mo-not-o-nies pla-ca-ble placa-ble monotonous mo-not-o-nous plea-sance pleas-ance mo-ro-nism mo-ron-ism poltergeist pol-ter-geist mosquito mos-qui-to polyene poly-ene mu-d-room mud-room polyethy-lene poly-eth-yl-ene mul-ti-faceted mul-ti-fac-eted polygamist(s) po-lyg-a-mist(s) mul-ti-pli-ca-ble mul-ti-plic-able poly-go-niza-tion polyg-on-i-za-tion mul-tiuser multi-user (better polyphonous po-lyph-o-nous with explicit hyphen) polystyrene poly-styrene ne-ofields neo-fields pomegranate pome-gran-ate + neon-azi neo-nazi poroe-las-tic poro-elas-tic newslet-ter news-let-ter + porous por-ous non-ame no-name postam-ble post-am-ble none-mer-gency non-emer-gency postscript post-script nonequiv-ari-ance non-equi-vari-ance pos-tu-ral pos-tur-al noneu-clidean non-euclid-ean pream-ble pre-am-ble non-i-so-mor-phic non-iso-mor-phic preloaded pre-loaded nonpseu-do-com-pact non-pseudo-com-pact prepar-ing pre-par-ing non-s-mooth non-smooth + preprint(s) pre-print(s) nonuni-form(ly) non-uni-form(-ly) pre-pro-ces-sor pre-proces-sor nore-pinephrine nor-ep-i-neph-rine pre-s-plit-ting \pre-split-ting nutcracker nut-crack-er priestesses priest-esses oer-st-eds oer-steds + pret-typrinter pret-ty-prin-ter oligopolist oli-gop-o-list pro-ce-du-ral pro-ce-dur-al oligopoly(ies) oli-gop-oly(ies) pro-cess process* operand(s) op-er-and(s) procu-rance pro-cur-ance orangutan orang-utan pro-ge-nies prog-e-nies or-thodon-tist or-tho-don-tist progeny prog-e-ny or-thok-er-a-tol-ogy or-tho-ker-a-tol-ogy pro-hibitive(ly) pro-hib-i-tive(-ly) or-thoni-tro-toluene ortho-nitro-toluene prosci-utto pro-sciut-to (or-tho-ni-tro-tol-u-ene) protester(s) pro-test-er(s) overview over-view protestor(s) pro-tes-tor(s) ox-i-dic ox-id-ic pro-to-ty-pal pro-to-typ-al + padding pad-ding pseu-dod-if-fer-en-tial pseu-do-dif-fer-en-tial painlessly pain-less-ly pseud-ofi-nite pseu-do-fi-nite pal-mate palmate pseud-ofinitely pseu-do-fi-nite-ly parabola par-a-bola pseud-o-forces pseu-do-forces parabolic par-a-bol-ic pseudonym pseu-do-nym paraboloid pa-rab-o-loid pseu-doword pseu-do-word paradigm par-a-digm psychedelic psy-che-del-ic parachute para-chute psy-chs psychs paradimethyl-ben-zene para-di-methyl-benzene pubescence pu-bes-cence (para-di-meth-yl-ben-zene) + quadding quad-ding paraflu-o-ro-toluene para-fluoro-toluene quadratic(s) qua-drat-ic(s) (para-flu-o-ro-tol-u-ene) quadra-ture quad-ra-ture para-g-ra-pher para-graph-er quadriplegic quad-ri-pleg-ic par-ale-gal para-le-gal quain-ter(est) quaint-er(-est) 16 TUGboat, Volume 16 (1995), No. 1
quasiequiv-a-lence qua-si-equiv-a-lence sovereign sov-er-eign or quasi- spaces spa-ces quasi-hy-ponor-mal qua-si-hy-po-nor-mal specious spe-cious quasir-ad-i-cal qua-si-rad-i-cal spelunker spe-lunk-er quasiresid-ual qua-si-resid-ual spendthrift spend-thrift qua-sis-mooth qua-si-smooth spheroid(al) spher-oid(-al) qua-sis-ta-tion-ary qua-si-sta-tion-ary sph-inges sphin-ges qu-a-sito-pos qua-si-topos spi-cily spic-i-ly qu-a-si-tri-an-gu-lar qua-si-tri-an-gu-lar spinors spin-ors quintessence quin-tes-sence spokeswoman(en) spokes-woman(en) quintessen-tial quin-tes-sen-tial sportscast sports-cast rab-bi-try rab-bit-ry sportively spor-tive-ly ra-dio-g-ra-phy ra-di-og-ra-phy sportswear sports-wear raf-f-ish(ly) raff-ish(-ly) sportswriter sports-writer ramshackle ram-shackle sprightlier spright-lier ravenous rav-en-ous squeamish squea-mish re-ar-range-ment re-arrange-ment stan-dalone stand-alone re-ciproc-i-ties rec-i-proc-i-ties startling(ly) star-tling(-ly) reci-procity rec-i-proc-i-ty statis-tics sta-tis-tics rect-an-gle rec-tan-gle stealthily stealth-ily ree-cho re-echo steeplechase steeple-chase + reprint(s) re-print(s) stochas-tic sto-chas-tic restorable re-stor-able strangeness strange-ness re-tri-bu-tions ret-ri-bu-tions stratagem strat-a-gem retrofit(ted) retro-fit(-ted) stretchier stretch-i-er rhinoceros rhi-noc-er-os stronghold strong-hold righ-teous(ness) right-eous(-ness) + strongest strong-est ringleader ring-leader stupi-der(est) stu-pid-er(est) robot ro-bot summable sum-ma-ble robotics ro-bot-ics su-perego super-ego roundtable round-table su-pere-gos super-egos salesclerk sales-clerk supremacist su-prema-cist salescle-rks sales-clerks surveil-lance sur-veil-lance saleswoman(en) sales-woman(en) swim-m-ingly swim-ming-ly salmonella sal-mo-nel-la symp-tomatic symp-to-matic sarsaparilla sar-sa-par-il-la syn-chromesh syn-chro-mesh sauerkraut sauer-kraut syn-chronous syn-chro-nous sca-to-log-i-cal scat-o-log-i-cal syn-chrotron syn-chro-tron schedul-ing sched-ul-ing talkative talk-a-tive schizophrenic schiz-o-phrenic tapestry(ies) ta-pes-try(ies) schnauzer schnau-zer tarpaulin tar-pau-lin schoolchild(ren) school-child(-ren) tele-g-ra-pher te-leg-ra-pher schoolteacher school-teacher telekinetic tele-ki-net-ic scy-thing scyth-ing teler-obotics tele-ro-bot-ics + sec-re-tariat sec-re-tar-iat + tenure ten-ure semaphore sem-a-phore testbed test-bed semester se-mes-ter + tex-twidth text-width semidef-i-nite semi-def-i-nite tha-la-mus thal-a-mus semi-ho-mo-th-etic semi-ho-mo-thet-ic ther-moe-las-tic ther-mo-elas-tic semir-ing semi-ring times-tamp time-stamp semiskilled semi-skilled toolkit tool-kit seroepi-demi-o-log-i-cal sero-epi-de-mi-o-log-i-cal to-po-graph-i-cal topo-graph-i-cal ser-vomech-a-nism ser-vo-mech-anism to-ques toques setup set-up traitorous trai-tor-ous severely se-vere-ly transceiver trans-ceiver sha-peable shape-able transgress trans-gress shoestring shoe-string transver-sal(s) trans-ver-sal(s) sidestep side-step transvestite trans-ves-tite sideswipe side-swipe traversable tra-vers-a-ble skyscraper sky-scraper traver-sal(s) tra-ver-sal(s) smokestack smoke-stack treacheries treach-eries snorke-l-ing snor-kel-ing troubadour trou-ba-dour solenoid so-le-noid + turkey tur-key so-lute(s) solute(s) turnaround turn-around TUGboat, Volume 16 (1995), No. 1 17
ty-pal typ-al + Haifa Hai-fa unattached un-at-tached Hamil-to-nian Hamil-ton-ian unerringly un-err-ing-ly + Helsinki Hel-sinki un-friendly(ier) un-friend-ly(i-er) Her-mi-tian Her-mit-ian va-guer vaguer Hi-bbs Hibbs vaudeville vaude-ville + Hokkaido Hok-kai-do vi-cars vic-ars Jan-uary Jan-u-ary vil-lai-ness vil-lain-ess Japanese Japan-ese viviparous vi-vip-a-rous Kadomt-sev Kad-om-tsev voiceprint voice-print Kansas Kan-sas vs-pace \vspace Karl-sruhe Karls-ruhe + wadding wad-ding Ko-rteweg Kor-te-weg wallflower wall-flower + Lan-caster Lan-cas-ter wastew-a-ter waste-water Leg-en-dre Le-gendre waveg-uide wave-guide Le-ices-ter Leices-ter + wavelets wave-lets Lip-s-chitz(ian) Lip-schitz(-ian) we-b-like web-like Louisiana Lou-i-si-ana weeknight week-night Manch-ester Man-ches-ter wheelchair wheel-chair Marko-vian Mar-kov-ian whichever which-ever Mas-sachusetts Mass-a-chu-setts whitesided white-sided + Min-neapo-lis Min-ne-ap-o-lis whites-pace white-space Min-nesota Min-ne-sota widespread wide-spread + Moscow Mos-cow wingspread wing-spread + Nachrichten Nach-richten witchcraft witch-craft + Nashville Nash-ville + wordspac-ing word-spac-ing Ni-jmegen Nij-me-gen workhorse work-horse Noethe-rian Noe-ther-ian wraparound wrap-around No-ord-wi-jk-er-hout Noord-wijker-hout wretched(ly) wretch-ed(-ly) Novem-ber No-vem-ber yesteryear yes-ter-year + Palermo Pa-ler-mo + Philadel-phia Phil-a-del-phia Names and non-English words Poincare Poin-care Po-ten-tial-gle-ichung Po-ten-tial-glei-chung used in English text rathskeller raths-kel-ler al-ge-brais-che al-ge-brai-sche Rie-man-nian Rie-mann-ian Al-legheny Al-le-ghe-ny Ry-d-berg Ryd-berg Arkansas Ar-kan-sas schot-tis-che schot-tische + Aus-tralasian Aus-tral-asian Schrodinger Schro-ding-er au-toma-tisierter auto-mati-sier-ter Schwabacher Schwa-ba-cher Be-di-enung Be-die-nung Schwarzschild Schwarz-schild bib-li-ographis-che bib-li-o-gra-phi-sche Septem-ber Sep-tem-ber + Boston Bos-ton Stokess-che Stokes-sche + Brow-n-ian Brown-ian Stuttgart Stutt-gart + Brunswick Bruns-wick Susque-hanna Sus-que-han-na + Bu-dapest Bu-da-pest tech-nis-che tech-ni-sche + Caribbean Car-ib-bean Ten-nessee Ten-nes-see + + Charleston Charles-ton Ukrainian Ukrain-ian + Char-lottesville Char-lottes-ville ve-r-all-ge-mein-erte ver-all-ge-mein-erte + Columbia Co-lum-bia Vere-ini-gung Ver-ein-i-gung Czechoslo-vakia Czecho-slo-va-kia Verteilun-gen Ver-tei-lun-gen Di-jk-stra Dijk-stra Wahrschein-lichkeit-s-the-o-rie dy-namis-che dy-na-mi-sche Wahr-schein-lich-keits-theo-rie En-glish Eng-lish Werthe-rian Wer-ther-ian Eu-le-rian Euler-ian Winch-ester Win-ches-ter + Evanston Evan-ston Yingy-ong Shuxue Jisuan Ying-yong Shu-xue Ji-suan + Febru-ary Feb-ru-ary Zealand Zea-land Festschrift Fest-schrift Zeitschrift Zeit-schrift Florida Flor-i-da Forschungsin-sti-tut For-schungs-in-sti-tut funk-t-sional funk-tsional Gaus-sian Gauss-ian Greif-swald Greifs-wald Grothendieck Grothen-dieck Grundlehren Grund-leh-ren 18 TUGboat, Volume 16 (1995), No. 1
for help from the United States as well, so I presume Philology that across the ocean there are also some people who may benefit from these simple notes. Typesetting text in several languages implies Configuring TEXorLATEX for typesetting in the following problems: (a) correct hyphenation, (b) several languages correct “labels” in titles and captions (“Chapter”, “Capitolo”, “Chapitre”, etc.), (c) special language- Claudio Beccari dependent typesetting rules (French vs. non-French Abstract spacing, quotation marks, etc.). Therefore the points to be covered include: Based on the frequent requests for help that I have received in the past months from users who want 1. choosing the proper fonts and font encodings; to configure TEXorLATEX for typesetting in sev- 2. retrieving or creating hyphenation patterns; eral languages, it seems that this sort of informa- 3. initializing TEXorLATEX, taking into account tion is not adequately covered in the documentation the program’s memory limitations; available to most do-it-yourself users. This tutorial 4. retrieving or creating macros for switching lan- is intended to give basic information on this topic. guages; The related subject of hyphenation patterns will be 5. language-dependent typesetting macros. addressed in a future tutorial. Here I will skip over the problem of typesetting 1 Introduction with alphabets different from the extended Latin one; for Greek (modern and ancient) there is plenty LATEX users are excused for not knowing how to set
of information (T XandMETAFONT files, the lat- up their favorite typesetting program for different E ter for generating the full set of characters and lig- languages; Lamport [5] doesn’t say a word on this atures) in the CTAN directories subject; the new edition invites the reader to consult Goossens et al. [3] for what concerns multilanguage /tex-archive/languages/greek/levy typesetting, and certainly the latter contains several /tex-archive/languages/greek/yannis hints — more than simple hints, since all of chapter 9 The former contains the full set of 256-glyph fonts is dedicated to this problem. and TEX macros for handling them, the latter con- TEX users should be more familiar with this tains also the 128-glyph fonts, TEX macros and hy- problem via the main reference, Knuth [4], which phenation patterns; both contain excellent documen- covers the topic of hyphenation patterns. Appendix tation for using Greek fonts and for setting ancient H stresses that TEX hyphenation capabilities are and modern Greek. generated by the \pattern and the \hyphenation The platforms (or boxes, as TEXies seem to call commands, the former being processable only by the their computers) on which people set their texts are initialization TEX program initex. of widely different types, with different operating J. Braams prepared a complete system of style systems and, in particular, with different file sys- files and macros for use with LATEX, called babel;of- tems. I’ll address myself mainly to the DOS type ficially, it works only with standard document styles, of personal computer users, because apparently this but in practice, it is also valid for other styles and category counts the highest number of do-it-yourself includes adequate means for setting text into type in users; with minor modifications what I’ll say is ap- some twenty languages. The only part that is lack- plicable also to other environments, with the cau- ing from the babel system is the set of hyphenation tion that with VMS one must have system manager patterns for each of those languages, but this was privileges and with UNIX, superuser privileges. In done on purpose, I suppose, because pattern prepa- any case I assume that the do-it-yourself reader who ration, although essential for multilanguage type- is going to use this information in practice is suffi- setting, has almost nothing to do with the style of ciently familiar with his/her computer to be able to multilanguage typesetting. use the available utilities for exploring and manag- This tutorial will attempt to explain to do-it- ing the hard disk(s) and the file system, for creating yourself (LA)TEX users how to configure their sys- command (or batch or script) files, and so on. tems in order to set text into type in different lan- As for myself, I use a VMS mainframe, a UNIX guages at the same time. It is natural that this issue workstation and a DOS personal computer; the lat- should be particularly interesting for non-English- ter is a 486 type and has 8 Mb of RAM,sothat, speaking (LA)TEX users, but I have received requests even though I don’t use Windows, with a suitable TUGboat, Volume 16 (1995), No. 1 19
extended memory handler I can use a “big TEX” im- you get electricity properly hyphenated even though plementation of TEX based on the compilation of a it is emphasized. If you want to typeset some French C-source program; I recommend that any DOS user words within an English text, and you only have the equip him/herself with a configuration of this sort. standard cm fonts, besides inserting a lot of discre- My LATEX is set up to deal with eight languages at tionary breaks \- at every syllable (if you don’t have a time: US-English, Italian, French, Spanish, Por- the French hyphenation patterns), you have the pos- tuguese, Catalan, Romanian, and Latin (modern sibility of inserting \hz in order to convince TEXto spelling). Except for US-English, I prepared all the deal with word fragments; for example, you can type hyphenation patterns for the remaining seven lan- \’e\hz lectricit\’e guages myself. and TEX will try to hyphenate the fragment ‘lec- 2Fonts tricit’, probably finding some correct hyphen points even by using English patterns. Except for English (both US and UK), which does But the real solution lies in using the extended not use diacritics (actually it does when assimilated fonts. There are two flavors: the real dc fonts and foreign words are used), and textual Latin (i.e. ex- the virtual em fonts. The former are complete sets cluding prosodic Latin), almost all the languages of 256 glyphs that conform to the “double Cork” that were examined by Sojka and Seveˇˇ cek [6] use encoding, while the latter are virtual fonts that are diacritics. (LA)T X has no problem in setting suit- E made up with pieces taken from real 128-glyph cm able diacritics over or under any letter, but (LA)T X E fonts. Although em fonts lack some glyphs, com- has real problems in hyphenating words that contain pared with real dc fonts, they sometimes are more the control symbols or control words that are used flexible than dc fonts since by editing the virtual for setting such diacritics. In fact, T X “words” are E property list file it is possible to modify them very not the same as the words of a language. Loosely easily. Moreover the .pk files for the dc fonts in the speaking, for T X a “word” is a sequence of charac- E customary 300, 329, ..., 746dots-per-inchsizesoc- ters of category code 11 (letter) or 12 (other), with cupy approximately 7 Mb of disk space, which might a non-zero \lccode, set in the same font, that is not be available on the smaller platforms or which preceded by a punctuation mark (parenthesis, quo- might be used in a different way. tation marks, etc.) or by space or glue, and ends If you already have the dc .tfm and pixel files with anything different from the above-mentioned on your computer disk, or if you are willing to copy characters. Any command that interrupts this se- them from a CTAN archive, skip the next section; quence interrupts the word, unless it expands to one otherwise, you might find it interesting to create the or more characters with the proper characteristics. full set of em fonts yourself, as explained below. For a precise definition of what TEX thinks a word is, see Appendix H of The TEXbook. 2.1 Creating standard virtual em fonts As examples, the French word \’ecole = ´ecole Examine your file system and find out if you already is not a word at all for T X because the accent com- E have .tfm files whose names start with em and if you mand \’ expands to \accent 19, unless ... Even have files with the same name but extension .vf;if the English word {\it school} = school is not a you do, skip to the next section. word for T X, because there is no space or glue be- E If you don’t, you might be in trouble, but before tween the command \it and the word ‘school’; this giving up examine your file system and find out if is why (LA)T X often reports overfull hboxes when E you have executable files (extension .exe) with the you emphasize some text. following names: tftovp, vftovp, vptovf, pltotf, Unless . . . yes, there is a way around this: ex- tftopl. If you do, they probably came with your tended fonts and a simple little macro, \hz,forin- screen and/or printer driver, unless you are the type serting glue where there should be no extra space: of hacker who copies everything in the hope that it \def\hz{\nobreak\hskip0pt \relax} might become useful at some future date. In fact, \hz inserts an unbreakable glob of glue of This point is crucial; in fact, if you got these zero width, stretch and shrink, but nevertheless it is files with your drivers, you can be almost certain glue, so that after this glob a real TEX word may be- that your drivers handle virtual fonts. Check the gin, a word that TEX can hyphenate properly. Use driver documentation; if your drivers actually han- this little macro and you’ll get rid of many hyphen- dle virtual fonts, keep reading this section; other- ation problems. For example, if you type wise, go to the next subsection. There is no sense \emph{\hz electricity} 20 TUGboat, Volume 16 (1995), No. 1
’0 ’1 ’2 ’3 ’4 ’5 ’6 ’7 fonts. Of course, for other cases you might be tftovp ’20 A˘ A C´ Cˇ Dˇ Eˇ E G˘ obliged to use the command in its full ‘ ‘ "8 glory with all the other options fully spelled out, ’21 L´ L’ L N´ Nˇ O˝ R´ but this simple example is sufficient for giving 1 ’22 Rˇ S´ Sˇ S¸ Tˇ T¸ U˝ U˚ the idea of the whole procedure. "9 2. Now run vptotf in this way ’23 Y¨ Z´ Zˇ Z˙ IJ I˙ d § vptotf emr10 ’24 a˘ a c´ cˇ d’ eˇ e g˘ ‘ ‘ "A obtaining the .tfm file (LA)T X needs for its ´ E ’25 l l’ l n´ nˇ o˝ r´ font selection and the virtual font file (exten- ’26 rˇ s´ sˇ s¸ t’ t¸ u˝ u˚ sion .vf) that the driver needs for using the "B virtual font; move this latter file into the direc- ’27 y¨ z´ zˇ z˙ ij ¡ ¿ tory where the driver(s) expect to find virtual ’30 A` A´ Aˆ A˜ A¨ A˚ Æ C¸ "C files. ’31 E` E´ Eˆ E¨ `I ´I ˆI ¨I 3. The .vpl file is no longer needed, so it can be ’32 N˜ O` O´ Oˆ O˜ O¨ Œ deleted. "D ’33 Ø U` U´ Uˆ U¨ Y´ SS Of course you might automate this simple procedure by writing a command (or script) file that performs ’34 a` a´ aˆ a˜ a¨ a˚ æ c¸ "E all these operations with a minimum of human in- ’35 e` e´ eˆ e¨ ı` ı´ ıˆ ı¨ tervention. ’36 n˜ o` o´ oˆ o˜ o¨ œ If you have .afm files for PostScript fonts, you "F can do similar operations in order to use such outline ’37 ø u` u´ uˆ u¨ y´ ß fonts, but maybe leave that for when you have more "8 "9 "A "B "C "D "E "F experience.
Table 1: Font layout for em fonts with character 2.2 Getting along with ordinary cm fonts codes above 127. The empty positions should If you do not have the programs mentioned in the contain the lower- and uppercase versions of ‘eth’ previous subsection and/or your drivers do not han- and ‘thorn’, the ‘nj’ ligature, and a non-slanting dle virtual fonts or do not handle fonts with more version of the pound sterling sign; the dc fonts than 128 characters, or if you have decided that you have these positions filled up. do not want to use virtual fonts (for example for portability reasons), do not give up! It is still possi- ble to redefine the accent macros so as to make them in creating virtual fonts if your drivers can’t handle a little smarter, i.e. so that they introduce automati- them. cally an \hz command before and/or after the letter If you check in The T Xbook, Appendix F, you E they operate upon, depending on the nature of the may realize that the roman, italic and typewriter following character. fonts have different layouts; therefore, when you You should be able to perform an anonymous create virtual fonts you must give your programs ftp to my site: this kind of information. Just to give an example ftp ftp.polito.it let’s create the virtual font emr10 starting from the Username: anonymous standard real font cmr10: Password: your e-mail address 1. Run tftovp by issuing the following command cd /pub/tex/polito/hyphens tftovp -rm cmr10.tfm emr10.vpl where you can fetch the three files polyglot.tex, after having set things up so that the necessary polyglot.sty and polyglot.doc. You can use the .tfm files are in the default directory; the best 1 Apparently nobody noticed or took care that none of thing to do is to issue the command in the di- the options available for the tftovp command handles the rectory where you have all the .tfm files. caps-and-small-caps fonts; for such fonts, some editing of the ASCII virtual property list file is necessary before going to This action creates a virtual property list file, the next step, but you’ll produce virtual caps-and-small-caps with extension .vpl, that contains all the in- fonts when you have gained a little experience with virtual formation on the size of every character, the fonts. It’s a good idea to use your editor to explore the prop- ligatures, the glyphs that are made by super- erty list files: you get to understand a lot of things about TEX that are not written in any book, or are presented in such a position of glyphs taken from several other real way that the reader can’t really understand them. TUGboat, Volume 16 (1995), No. 1 21
second one as a LATEX option, so that you can pro- dc fonts it is necessary to declare the extended T1 cess the third one as a LATEX document and get encoding by means of the declaration all the information about the use and/or initializa- \renewcommand{\encodingdefault}{T1} tion of your (LA)T X programs with the polyglot E in the preamble of your document. macros. But the above operation is suitable only when In the same directory you will find the “poor you run LAT X2ε as a regular program that has al- man” hyphenation patterns for several languages; E ready been initialized. Its ability to handle hyphen- these are labeled “poor man” because they do not ation patterns in one or more languages derives from consider accented letters and leave the task of hy- its initialization which must be executed according phenating to the “intelligent” accent macros. Such to the procedure described for your particular im- macros do their best, but of course they cannot plementation of T X; in the following sections, an perform as well as a (LA)T X system correctly set E E example is given, but your particular implementa- up with dc or em fonts for handling multiple lan- tion might require some variations. guages. Nevertheless, I did use such a “poor man” Regardless of the implementation, though, you implementation for a while, and I have typeset sev- need to set up or modify a file named hyphen.cfg eral documents in different languages with almost which is intended for loading several hyphenation no human intervention while getting error-free jus- pattern files for the corresponding languages. As tified text with hyphenated words, in particular in far as my experience is concerned, the LAT X2ε base French and in Catalan.2 E package obtainable from CTAN states that this task 3 Using em or dc fonts is reserved for TEXperts and a special documenta- tion file is offered to such experts for the task. Un- 3.1 Extended fonts and T X E fortunately this file does not say much and several If you use TEX, and you want to use em or dc fonts, actions must be invented by the user. Two points you should do the following: are to be followed with special care: 1. copy the file plain.tex to another file, and call 1. The command \newlanguage is defined to be it emplain.tex; an \outer one so that it cannot be used within 2. edit emplain.tex so that it also preloads the the main argument of the command roman, italic, etc., em or dc fonts corresponding \InputIfFileExists{}{}{
setup will never succeed.}% \catcode‘\^^a0=11 % letter \u{a} \@@end \lccode"A0="A0 % lc code } \def\@u@a{^^a0} % ctrl word (The actions to be taken for a nonexistent or % unreachable French pattern file are copied [and \catcode‘\^^80=11 % letter \u{A} slightly edited] from the sample hyphen.cfg file \lccode"80="A0 % lc code \u{a} that is included in the base package.) \def\@u@A{^^80} % ctrl word 2. Before inputting any pattern file that contains % special characters, it is necessary to map the ... accent macros and their arguments to the char- and so on for the remaining special char- codes of the corresponding special characters acters whose codes can be deduced from and to define their lowercase codes. Table 1. In fact, the ability of LATEX2ε to deal with (c) Most important, steps 1 and 2 must be extended codes through sophisticated accent confined within a group together with the macros is applicable only during regular type- pattern file input command, so as to keep setting runs, not during the manipulation and such declarations local to the sole group digestion of hyphenation patterns. For the lat- where patterns are being manipulated and ter, simple and direct macros must be designed digested. such as the following: 3. After closing the group mentioned in the above (a) For the special characters for which stan- item, it is wise to declare the default language, dard commands are available, such as \o the default encoding (so that you do not need for ø and \ss for ß, it is sufficient to pre- to declare it in every preamble of every docu- pare declarations of the following form: ment), and the relevant parameters for the left- \def\ss{^^ff}\lccode"FF="FF most and rightmost word fragments: \def\SS{^^df}\lccode"DF="FF \language=0 \def\ae{^^e6}\lccode"E6="E6 \lefhyphenmin=2 \def\AE{^^c6}\lccode"C6="E6 \righthyphenmin=3 \def\oe{^^f7}\lccode"F7="F7 \def\encodingdefault{T1} \def\OE{^^d7}\lccode"D7="F7 \def\o{^^f8} \lccode"F8="F8 This done, you are ready to run the initializer. \def\O{^^d8} \lccode"D8="F8 A 3.3 Extended fonts and LTEX2.09 \def\i{^^19} \lccode"19="19 If you are still using LATEX2.09oryouwanttohave \def\j{^^1a} \lccode"1A="1A A \def\aa{^^e5}\lccode"E5="E5 aLTEX2.09-compatible version, do the following: \def\AA{^^c5}\lccode"C5="E5 1. copy the file lplain.tex to a new file, say, \def\l{^^aa} \lccode"AA="AA emlplain.tex; \def\L{^^8a} \lccode"8A="AA 2. edit emlplain.tex by replacing the line \input (b) For the other characters — those that carry lfonts with \input emlfonts a diacritical mark — it is better to resort to 3. copy lfonts.tex into emlfonts.tex; intermediate macros, some of which map 4. edit emlfonts.tex by adding a line that loads the accent macro and character to a sin- a dc or an em font for every text font already gle control word, and some for defining the loaded; as shown here (original lines are marked meaning of such control words. The whole with <--): trick is accomplished as follows: first you \font\fivrm =cmr5 % roman <-- define the control word mappings \font\efivrm=dcr5 % ex. roman \def\’#1{\csname @ac@#1\endcsname} ... \def\‘#1{\csname @gr@#1\endcsname} \font\elvrm =cmr10\@halfmag % roman <-- ... \font\eelvrm=dcr10\@halfmag % ex. roman and so on, with the prefixes that are listed ... in Table 2 for the other diacritical marks. \font\frtnrm =cmr10\@magscale2 % rom <-- Then, for each of the approximately one \font\efrtnrm=dcr10\@magscale2 % ex. rom hundred diacriticized characters, you must ... set up declarations such as these: TUGboat, Volume 16 (1995), No. 1 23
5. modify the definitions of the font changing com- But what happens if you decide to use a lan- mands for all point sizes; e.g. for the ten-point guage for which no hyphenation patterns have been size, search for the definition \def\xpt and prepared? This is most unlikely. If you read the change: paper by Sojka and Seveˇˇ cek [6], you will find a ta- \def\prm{\fam\z@\tenrm}% ble where 38 hyphenation pattern files are exam- ined; they deal with 32 different languages that in- to clude both flavors of English (US and UK), modern \def\prm{\fam\z@\etenrm}% and classical Latin and Greek, Esperanto, and most doing the same for all the other font selections. modern European and North and South American When you edit emlfonts.tex, near the end of the languages. Besides Greek, patterns exist also for file, you should comment out the definition of \$, languages that do not use the Latin alphabet (with because with dc or em fonts the dollar sign always or without diacritics), such as Russian, and possibly has the correct shape. for less known regional languages. The procedure seems very complicated but re- But what if you are unlucky—the patterns you ally it amounts to just some time spent in careful are looking for do not exist, or they exist but are repetitive editing: duplicating lines, replacing cm unreachable, or they are too large for the capabilities with dc and adding an e in the proper places. of your tex program . . . In this case you yourself Another point must be kept in mind: the pro- must create a file containing your patterns; you must gram tex has a finite memory for holding font in- carefully read Appendix H of The TEXbook and, of formation. If you have made all the modifications course, you must know the “strange” language you A want to use, or at least have a perfect knowledge explained above, you end up with a LTEX format where 106 fonts have been preloaded. This might of its grammar and hyphenation rules. It’s not too difficult, with or without the use of the program be too much for your “small TEX”, but presents no patgen, but this topic is sort of self-contained, so I problems to the “big TEX” implementations. You also need to edit your newly created files leave it for another tutorial [1]. emplain.tex and/or emlplain.tex and replace So, from now on, we assume that you have all the pattern files you need. You should at this point \input hyphen edit (or create) the file lhyphen.tex to read: with % File lhyphen.tex created on ... \input lhyphen % by ... unless your lplain.tex file is dated 31 March 1992 % It loads the hyphenation patterns or later, in which case it may already have been % for the following languages: done. % 0) US english % 1) italian 4 Hyphenation patterns % 2) french The best way to have the proper hyphenation pat- % 3) ... terns for the languages one chooses to use is to re- % trieve them from the CTAN archives. If you have \input chardefs access to the Internet, just ftp to the nearest CTAN % archive and explore the directory /tex-archive/ \language=0 \chardef\l@english 0 languages. Select the directory corresponding to \input hyphen the chosen language, and hopefully the proper set % of hyphenation patterns is already there. \newlanguage\l@italian If you do not have access to the Internet, you \language\l@italian probably know somebody who does. If you don’t, \input ithyph ask the TUG office for a diskette containing what % you can’t otherwise obtain, and pay what the TUG \newlanguage\l@french office will charge you. If the whole procedure seems \language\l@french too complicated, just consult the advertising pages \input frhyph of any issue of TUGboat and choose the vendor who % can provide you with what you desire; the vendor % and so on prices are higher but generally you get a full set of % diskettes with an install program that saves you % Default values all the burden of retrieving, initializing, etc. \language\l@english 24 TUGboat, Volume 16 (1995), No. 1
\lccode‘’= 0 \def\i{^^19} \lccode"19="19\uccode"19="49 \lefthyphenmin=2 \def\j{^^1a} \lccode"1A="1A\uccode"1A="4A \righthyphenmin=3 \def\aa{^^e5}\lccode"E5="E5\uccode"E5="C5 % \def\AA{^^c5}\lccode"C5="E5\uccode"C5="C5 \endinput \def\l{^^aa} \lccode"AA="AA\uccode"AA="8A If you are following the “poor man” procedure, \def\L{^^8a} \lccode"8A="AA\uccode"8A="8A the chardefs.tex file to be input might read simply \lccode‘’=‘’ like this: The hexadecimal codes appearing in the above def- \lccode‘’=‘’ initions can be found in Table 1, where the octal \catcode’33=11 %\oe and hexadecimal codes for all the other diacriticized \lccode’33=’33 %\oe characters can also be found. The uppercase codes are specified so that the case of such special letters \let\oe=^^1b 3 ... can be properly changed. Besides the above macros chardefs.tex must and similar definitions and code assignments for all also contain the character mappings necessary for the other special ligatures or diacriticized characters the cases when accent macros are used; in the CTAN present in the cm fonts that are necessary in the archives you can find a file, compatible.tex,that languages you are going to use. contains intermediate macros for these mappings. The apostrophe should receive an \lccode dif- They look like this: ferent from 0 because it should not interrupt TEX words of the type quest’anello, l’approbation, s’orga- \def\’#1{{\expandafter nitzen in languages such as Italian, French, or Cata- \ifx\csname @ac@#1\endcsname\relax lan (where vocalic elision is marked with an apos- {\accent19 #1}% \else trophe) — remember the definition of a TEX word: “. . . characters. . . , with a non-zero \lccode,...”. \csname @ac@#1\endcsname This is one of the most frequent requests for help \fi}} I receive from Italian users, who forget to \lccode \def\‘#1{{\expandafter A \ifx\csname @gr@#1\endcsname\relax the apostrophe and complain about (L )TEXnothy- phenating after such a sign. {\accent18 #1}% If you are going to use extended fonts (dc or \else em) you need a chardefs.tex file that defines a set \csname @gr@#1\endcsname of macros for changing such sequences as \u{a} (˘a) \fi}} into the numerical code of the diacriticized letter (in ... this case ’240 or "A0 or 160), while assigning a non- \def\~#1{{\expandafter zero \lccode to the character in question. Such a \ifx\csname @til@#1\endcsname\relax file, in other words, should contain things similar to {\accent’176 #1}% A \else those that have been described for LTEX2ε. Such macros are simple but numerous when you \csname @til@#1\endcsname want to have a complete map to all 102 diacriticized \fi}} letters of the Cork encoding and to the 13 special \def\"#1{{\expandafter characters ß, œ, æ, ø, ˚a,l, ı, and , with their upper- \ifx\csname @um@#1\endcsname\relax case possible counterparts (plus the apostrophe): {\accent’177 #1}% \else % Save current @ catcode and ... \csname @um@#1\endcsname \chardef\atcatcode=\the\catcode‘\@ \fi}} % ... make it a letter. \catcode‘\@=11 The trick is this: any accent macro (for exam- \def\ss{^^ff}\lccode"FF="FF\uccode"FF="DF ple, the one for the acute accent), operating, say, \def\SS{^^df}\lccode"DF="FF\uccode"DF="DF 3 A This is not necessary with LTEX2ε because definitions \def\ae{^^e6}\lccode"E6="E6\uccode"E6="C6 are local to the section where patterns are handled. On the \def\AE{^^c6}\lccode"C6="E6\uccode"C6="C6 other hand, remember to avoid groups while specifying these A \def\oe{^^f7}\lccode"F7="F7\uccode"F7="D7 codes with L TEX2.09, otherwise you lose the possibility of treating words with special characters in the proper way. \def\OE{^^d7}\lccode"D7="F7\uccode"D7="D7 \def\o{^^f8} \lccode"F8="F8\uccode"F8="D8 \def\O{^^d8} \lccode"D8="F8\uccode"D8="D8 TUGboat, Volume 16 (1995), No. 1 25
on the letter ‘a’, maps to the protected4 internal grave @gr@ caron @v@ control word \@ac@a, if this word is defined; other- acute @ac@ breve @u@ wise, it operates as a regular accent macro with cm circumflex @hat@ macron @eq@ fonts. If you are using the extended fonts and the tilde @til@ dotaccent @dot@ set-up I am describing, you might as well simplify dieresis @um@ cedilla @c@ such macros to: hungarumlaut @H@ ogonek @og@ \def\’#1{\csname @ac@#1\endcsname} ring @r@ \def\‘#1{\csname @gr@#1\endcsname} Table 2: Control-word prefixes for accent-macro ... mapping \def\~#1{\csname @til@#1\endcsname} \def\"#1{\csname @um@#1\endcsname} The whole set of prefixes of such control words is rectly to the corresponding extended TEXcodes;for summarized in Table 2. Actually, there are no LATEX this you might add to your chardefs.tex file some macros for the ring accent and the ogonek diacritical definitions of the form: mark;5 if you need to use them, you can define \r \catcode‘\`a=13\def `a{\@gr@a} and \g to map to the proper control words by means \catcode‘\´a=13\def ´a{\@ac@a} of the prefixes shown in Table 2. ... Next you must assign every such control word \catcode‘\~n=13\def ~n{\@til@n} to an extended character and then assign that char- Remember to end the file by restoring the right @ acter the proper \lccode and \uccode, an operation catcode: that is lengthy because of the number of characters, \catcode‘\@=\atcatcode but at least is repetitive; all such assignments are of this type: If you spend the time necessary to create this chardefs.tex file from scratch, to explore the CTAN \catcode‘\^^a0=11 % letter \u{a} archives, and to put together the various parts, you \lccode"A0="A0 % lc code understand why vendors charge you a reasonable \uccode"A0="80 % uc code \u{A} price for selling T Xware that everybody could oth- \def\@u@a{^^a0} E erwise get at no cost! % \catcode‘\^^80=11 % letter \u{A} 5 Initialization \lccode"80="A0 % lc code \u{a} Now you have all the necessary pieces of informa- \uccode"80="80 % uc code tion to create the formats latex.fmt with LAT X2ε, \def\@u@A{^^80} E emplain.fmt with extended font plain T X, and % E emlplain.fmt for extended font LAT X2.09. ... E It’s time to run initex,theTEX initialization The CTAN archives contain a file called extdef.tex program, the only one that can chew and digest the that has definitions similar to the ones above, but hyphenation patterns and store them into memory resorts to two macros: in a special way so that during a regular TEXrun \csubinverse and \charsubdef the hyphenation process proceeds efficiently. that map em/dc-font extended characters to corre- Depending on your software implementation, sponding cm-font accent macros + characters, and this initialization program may be a different .exe vice versa. These macros should not be necessary if (i.e. there are both tex.exe and initex.exe), or you consistently use only extended fonts. the initialization may be an option to the regular On the other hand, if you have a keyboard with procedure, or the executable module may behave national characters (and you pay some attention by differently depending on the name used to invoke editing your files before you send them to colleagues it. Again, depending on the operating system, you who might not have the same keyboard) you could might also include the arguments to the command reduce your keying if you map the input codes di- within double quotes. So this simple initialization 4 ‘Protected’ in the sense that it contains the character task may prove to be more difficult than it should. @, which is not a letter during normal (LA)TEXoperation,so On my DOS personal computer, with the var- that it is impossible to inadvertently redefine it. 5 ious implementations I have used, I had to follow Actually the ring accent is used only on the letters ‘a’ three different approaches; however, command files and ‘u’, so that the standard LATEXmacros\aa and \AA are sufficient for the former, but you still need something for using come in handy for doing the whole job without both- the latter. ering too much about the details. You have to: 26 TUGboat, Volume 16 (1995), No. 1
1. set up the environment variables according to move %1.fmt C:\TEX\ctexfmts the documentation for your particular imple- 5. Finally I reset the program name to ctex.exe mentation of TEX; rename C:\TEX\cinitex.exe ctex.exe 2. set up the search paths, if necessary; 3. invoke the initialization program with the prop- 6. At this point, with some implementations of er syntax and with the proper arguments; tex, it might be possible to run a preloading facility that creates an executable image of the 4. \dump the format; program with the format preloaded; it speeds 5. move the format to the directory where T X E up the preliminary operations of a regular TEX expects to find format files; run a little, but it is not a real necessity; many 6. possibly preload the format so as to create a implementations do not have this facility. specific executable program. That’s all. For running (LA)TEX, another com- Steps 1 and 2 are very much system- and software- mand file sets up the same environment variables dependent, so please check your documentation very (in case they were not set or had been reset), and carefully. Below I show how I would do the job invokes the executable with:7 A tex for LTEX2.09, by means of a C-source derived ctex &emlplain %1 program named ctex when it operates as a regular tex program, and cinitex when it operates as an where, as usual, %1 is replaced by the .tex file name initializer program.6 you want to process. During initialization you might experience a 1. I set up the following environment variables (but number of error messages; try to understand which remember that the names of my directories do is the cause and exit the initializer with x at the not necessarily match yours): program prompt. Some possible causes include: SET TEXFORMATS=C:\TEX\ctexfmts 1. initex did not find some .tex file(s): check SET TEXPOOL=C:\TEX\ctexfmts that all the necessary files are on a TEX search SET TEXFONTS=C:\TEX\fonts path conforming to the TEXINPUTS environment SET TEXINPUTS=.;C:\TEX\inputs variable. With other implementations it might be possi- 2. initex did not find some font files: check for ble to set environment variables that control the spelling errors in the parts you have edited; amount of memory that the program assigns to check that all .tfm files are in the directory(ies) the various operating parts. identified by the TEXFONTS environment vari- 2. I decide if the executable cinitex.exe is al- able. ready in the proper directory; if not, I rename 3. initex complains about the tex.pool file:8 the the executable ctex.exe to that name file is missing or the file in the directory pointed if exist C:\TEX\cinitex.exe goto runinitex at by TEXPOOL does not belong to the particular rename C:\TEX\ctex.exe cinitex.exe TEX implementation you are using; retrieve the 3. Then I produce the format and \dump it in one correct tex.pool file and be sure to put it in step: the correct directory. 4. initex complains about some undefined con- :runinitex trol sequence: check the name of the control C:\TEX\cinitex %1 \dump sequence and correct your spelling in the files In the above command %1 isthenameofthe you have edited. format I want to produce; in our examples it 5. initex complains about duplicate patterns: might be emplain or emlplain. press
having inadvertently duplicated some patterns tions because this is my policy when I prepare is normal, but must be corrected. my patterns. 6. initex complains about memory limitations: There is nothing wrong with hyphenation ex- ifyouareusinga“bigTEX” implementation ceptions; simply they are better suited for a based on a C-source, this should not happen. non-flexive or moderately flexive language, such If it does, either you want to go into the Guin- as English, than for flexive languages such as all ness book of records for the largest number of the Romance ones. In English, for a noun you languages treated at the same time, or you have have two forms, for an adjective one form, for a bad patterns, or your C-source program did not verb four or five forms. In Italian nouns require work properly. If your program comes from a two forms, adjectives up to four, verbs approxi- Pascal source and you have the source code, af- mately 60 and this does not include all the pos- ter having properly checked that you are not sible agglutinations of enclitic pronouns. For going into the Guinness book, and that your French, Spanish, Portuguese, Catalan, Roma- patterns are OK, you could carefully modify the nian, and Latin the situation is similar, so that memory declarations in the source code and re- if a hyphenation exception involves the stem of compile and link the program. I hope you know a verb, you may have to input some 60 entries what you are doing. If you do not have the (atleast)inthe\hyphenation argument! source code, but you have an implementation 2. US-English hyphenation patterns occupy 6075 (like sb39tex) that allows you to exercise some trie memory words and 181 ops; the standard control over the memory usage, check the doc- size of the trie memory (in a regular-sized TEX umentation and proceed accordingly. implementation) is 8000 words; French comes With respect to memory limitations, specifi- second in complexity, requiring from 1122 to 9 cally hyphenation memory limitation, TEXhastwo 1433 trie memory words and 110 ops; the other memory areas, the trie and the ops ones, the former Romance languages require on the average 600 being dedicated to holding the patterns (all the pat- trie memory words and 25 ops. The ops num- terns of all the languages that have been loaded) and bers are simply additive while the trie mem- the latter to holding the “branch” information of the ory words are not, because the cited numbers pattern structured lists. The greater the number of include some undocumented overhead, so that patterns, the greater the trie memory occupied; the the final trie memory occupied is smaller than more complex the pattern structure, the greater the the sum of the single language trie memory oc- ops memory occupied. cupied. But during the initex run you need When you run initex,the.log (or .lis)file the availability of at least the sum of the single documents all the relevant information about the language occupations, so that the program can run; in particular, towards the end you will find a massage the patterns and compress them in the listing that looks like this: proper way. 14 hyphenation exceptions When you have to control the memory occu- Hyphenation trie of length 8172 pation, you must keep these numbers in mind has 407 ops out of 750 in order to understand possible complaints by 23 for language 6 initex. 15 for language 5 3. The ops numbers give you a fairly good idea of 15 for language 4 the complexity of the hyphenation rules for a 25 for language 3 given language as they are translated into TEX 110 for language 2 code. In my experience, German patterns ap- 38 for language 1 pear to be the most complicated, requiring 9980 181 for language 0 trie memory words and 281 ops; this is the set from which you can obtain important information of patterns indicated as DEmax in Sojka and Seveˇˇ cek [6],10 and if you have to use German about the memory occupation dedicated to multi- language issues: 9 Depending on the presence of extended characters; the 1. the 14 hyphenation exceptions are those that numbers refer to my poor-man or extended-character pat- terns, respectively. come with the US-English hyphen.tex file, orig- 10 However, they report 255 ops while my cinitex exe- inated by Liang and Knuth and described in cuted 281; perhaps the files we examined are not exactly the The TEXbook, Appendix H. The other lan- same. guages I loaded do not have hyphenation excep- 28 TUGboat, Volume 16 (1995), No. 1
you’d better have a “big TEX” implementation ... % french text with a large trie memory size. \end{Lang} or you might prefer to define a new environment: 6 Language-dependent macros \newenvironment{French}{\Lang{french}}{} The best thing to do is to retrieve the babel package; so that you can type: this is the finest set of multilanguage macros I know of and I recommend it to everybody. My polyglot \begin{French} macros are also an acceptable “poor man” alterna- ... % french text tive for those who have stuck to cm fonts and relied \end{French} on intelligent accent macros. The macros should be and thereby following standard LATEX markup style. divided into three categories: 6.2 Label selection 1. language selection; What has been described in the previous subsec- 2. label selection; tion is suitable for setting short passages of one lan- 3. style selection. guage within a text composed in another language; If you want to make your own macros, because you for example, French extracts (properly hyphenated are going to use your multilanguage implementation in French) in an English-language essay (with cita- in a restricted or in a special way, you can do the tions in an appropriate English style). following. If you want to write a whole document in an- other language you should be able to change “Chap- 6.1 Language selection ter” into “Chapitre”, “Table” into “Tableau”, and In the lhyphen.tex file all languages were identified so on, with a single command. Fortunately, as of 31 by a literal name mapped to an internal number March 1992, these words are no longer hardwired of the form \l@language; this allows you to set up into the LAT X styles; they are contained in control 11 E simple macros such as these: sequences with self-explanatory names. So, together \def\Lang#1{\expandafter with the language settings, you need a \setcaptions \language\csname l@#1\endcsname macro of the following type: \csname #1settings\endcsname} \def\setcaptions#1{% % \csname #1captions\endcsname} \def\englishsetting{\lccode‘’=0 % \righthyphenmin=3} \def\frenchcaptions{% % \def\refname{R\’ef\’erences} \def\italiansettings{\lccode‘’=‘’ \def\abstractname{R\’esum\’e} \righthyphenmin=2} \def\bibname{Bibliographie} % \def\chaptername{Chapitre} \def\frenchsettings{\lccode‘’=‘’ \def\appendixname{Annexe} \righthyphenmin=3} \def\contentsname{Table des mati\‘eres} % \def\listfigurename{Liste des figures} % and so on \def\listtablename{Liste des tableaux} % \def\indexname{Index} You change hyphenation rules simply by issuing \def\figurename{Figure} commands of the type \def\tablename{Tableau} {\Lang{french} ... % french text \def\partname{Partie} \def\enclname{P.~J.} } % end french \def\ccname{Copie \‘a} without forgetting the group braces so that the ac- \def\headtoname{A} tion is confined to the enclosed group alone. \def\pagename{Page} A In LTEX you can use the above definitions also % as environments: \def\today{\ifnum\day=1\relax \begin{Lang}{french} 1\/$^{\rm er}$\else\number\day\fi 11 \space\ifcase\month\or \lefthyphenmin is supposed to maintain the default value of 2; of course some settings might change this value, janvier\or f\’evrier\or mars\or but if you choose to do so, you should insert an appropriate avril\or mai\or juin\or juillet\or setting for every language. ao\^ut\or septembre\or octobre\or TUGboat, Volume 16 (1995), No. 1 29
novembre\or d\’ecembre\fi are far from optimal; dc fonts and PostScript \space\number\year} fonts have perfect shapes. But in any case suit- } able macros must be set up correctly in order % to insert the proper quotation marks. % and so on for the other languages 3. Ordinal numbers are another point of differ- % ence; in some countries a digit is immediately With a LATEX2.09 document you can start this way: followed by the desinence or the final letter of \documentstyle{book} the corresponding spelled-out ordinal; in other \Lang{french}\setcaptions{french} countries this literal ending is separated by a \begin{document} period; in yet other countries this ending is set ... as an exponent (sometimes underlined). Flex- \end{document} ive languages may have different endings for masculine and feminine, singular and plural. The language selection macros and the caption def- Elsewhere, ordinals are expressed by roman inition commands could also be incorporated into numerals; the lowercase roman letters are cus- hyphen.cfg for LAT X2ε,andintolhyphen.tex for E tomary in English, and this habit has gained LAT X2.09, so that they remain available with any E wide acceptance in many places, while good document type and you need not specify a partic- style in many countries requires the use of up- ular option or package every time you start a new percase roman letters only, possibly from a small document. caps font. 6.3 Style selection For these sorts of style problems, don’t do it If you want your composition style to be correct, all the do-it-yourself way; it’s too difficult (unless you are a book connoisseur, of course). It’s better to the way down to respecting the typographic tradi- babel tions of another country, then you should rely com- rely on the package, which, although it was pletely on the babel macros. created and is maintained by one man, J. Braams, The most apparent differences consist in these benefits from continuing suggestions and construc- points: tive criticism by all members of the TEX community, with the result that the package is constantly being 1. In France (and perhaps in some other countries) upgraded and always becoming better. French spacing is normally used; French spac- ing leaves the same amount of white space af- 7Conclusion ter all punctuation marks, but leaves some thin This long tutorial on multilanguage typesetting has space before the “tall” punctuation marks such tried to focus on some of the problems concerning as the colon and semicolon, the question and setting texts in different languages by means of a exclamation marks, the parentheses,12 the quo- single program, TEXorLATEX. Configuring the pro- tation marks; such spaces cannot split at the gram to handle several languages is not that diffi- end of the line. All these punctuation marks cult, but it does require patience, good knowledge must be made active in text mode so that they of the software, a long week-end. . . But when you’ve introduce the right amount of white space, or, set up the program the proper way, you will enjoy conversely gobble extra white space. it much more than before. 2. The quotation marks are the most variable in different countries; in the US high quotation References marks with the shape of normal or reversed [1] Beccari C., Oprea R., Tulei E., “How to make a commas are used; in France and many other foreign language pattern file: Romanian”, TUG- countries guillemets are used; in Italy we use boat, 16(1):31–42, March 1995. both types. In some countries open quotation [2] Braams J., “babel, a multilingual style-option marks are identical to closed quotation marks system for use with LATEX standard document but are lowered; in other countries they are re- styles”, TUGboat, 12(2):291–301, June 1991. versed, that is left quotation marks are used for Update: TUGboat 14(1):60–62, April 1993. closing a quotation instead of opening it. And [3] Goossens M., Mittelbach F., Samarin A., The so it goes. A LTEX Companion, Addison-Wesley, Reading, All these marks are present in the extended Mass., 1994. fonts, although in the em fonts the guillemets [4] Knuth D.E., The TEXbook, Addison-Wesley, 12 The thin white space goes after the open parenthesis. Reading, Mass., 1990. 30 TUGboat, Volume 16 (1995), No. 1
[5] Lamport L., LATEX A Document Preparation System, Addison-Wesley, Reading, Mass., 1986. Second edition, dealing with LATEX2ε, 1994. [6] Sojka P., Seveˇˇ cek P., “Hyphenation in TEX — Quo Vadis?”, Proceedings of the Eighth European TEX Conference (Sept. 26–30, 1994, Gda´nsk, Poland), 59–68.
⋄ Claudio Beccari Dipartimento di Elettronica Politecnico di Torino Turin, Italy Email: [email protected] 30 TUGboat, Volume 16 (1995), No. 1
indirect proof is given by the existence of fonts for exotic languages, fonts that could not be of any use if nobody set texts in those languages—setting text implies breaking words at the end of the line, so that some sort of hyphenation patterns must be used. When the first author of this article started his interest in hyphenation, he had to make up for patching or creating new patterns for his site be- cause the patterns available for Italian at that time did not comply with national regulations and, to be honest, were quite poor. He ended up with new pat- terns that were published in TUGboat [2] and then made their way into the CTAN archives. Recently, version 4.1, which greatly improves hyphenation of technical terms, was submitted to the archive man- agers. Due to the presence of several foreign visiting professors and undergraduate and graduate students How to make a foreign language pattern in his Polytechnic, Beccari was asked to prepare ver- file: Romanian A sions of (L )TEX suitable for French, Spanish, Cata- Claudio Beccari, Radu Oprea and Elena Tulei lan, and Romanian; for his own pleasure he added Portuguese. He decided to prepare a single multilin- Abstract gual version of the formats so that users would not In this tutorial we show how to make hyphenation have to keep different commands in mind and would patterns for a language when you know the gram- have the possibility of switching back and forth be- matical rules for hyphenation (if they exist). We tween languages. The limitations included the fol- also discuss some points related to typographical hy- lowing: phenation when compared with grammatical syllab- 1. programs had to run on a cluster of VMS main- ification. frames running a Pascal-derived “big TEX”; 1 Introduction 2. a large number of isolated and clustered UNIX workstations running C-derived versions of “big In this tutorial we complete the argument of multi- TEX”, a countless number of DOS personal com- language typesetting that we started in the previous puters, which has the most varied hardware article [3]; there the problem of pattern generation and software configurations, equipped with the had been omitted because this topic is sort of self most unpredictable versions of TEX (from ver- contained and deserves its own discussion. sion 1.5 (!) to 3.1415), as well as screen and The problem of creating patterns suitable for printer drivers; T X’s hyphenation algorithm is dealt with in Ap- E 3. the DOS personal computers were mostly owned pendix H of Knuth [10], where a certain statement by students, whose limited budgets were not may induce T X users to abstain from creating pat- E sufficient to upgrade and standardize their tern tables “since patterns are supposed to be pre- equipment. pared by experts who are well paid for their exper- tise.” Beccari therefore had to adopt a “poor man” In a way this is desirable, so that pattern tables policy so that .tex files edited and tested at home for the same language are not created continuously, could also be run on the mainframes or the UNIX stations. This implied sticking to regular cm fonts leading to multiple, incompatible, different, dubious, 1 erroneous pattern lists and frustrated users unable and creating a set of “intelligent” accent macros to find a reliable pattern set for the language they that could look ahead and do some intelligent dis- want to use. cretionary break insertion. The results were satis- In another sense, pattern creation is essential factory for all the cited languages except Romanian, for the wide-spread dissemination of TEX, which is where TEX runs yielded poor results (many overfull now being used at least for all the languages exam- 1 By “accent macro” we mean every command that places ined by Sojka and Seveˇˇ cek [12], and certainly also a diacritical mark over or under any letter. for many other modern and ancient languages. An TUGboat, Volume 16 (1995), No. 1 31
and underfull hboxes) with line widths shorter than does not serve as a semi-consonant; this is not cer- ca. 100 mm. tainly the case in English-speaking countries, where With the collaboration of the second and third a division such as liq-uid 3 is not only grammatically authors (Oprea and Tulei), natives speakers of Ro- correct but also acceptable to the readers — such a manian, we decided to prepare Romanian hyphen- division, on the other hand, makes Romance lan- ation patterns without being confined to cm fonts, guage speakers shiver unpleasantly! With Romance yet still referring to the extended dc or em font set- language hyphenation patterns, therefore, vocalic up, according to the Cork encoding (see Beccari, groups should remain undivided, except when semi- elsewhere in this issue [3]). consonants are present. The attentive reader may ask himself: “Why So we have two different concepts, hyphenation didn’t they simply retrieve the Romanian patterns and syllabification: the former deals with typogra- that Sojka and Seveˇˇ cek [12] mention had been pre- phy, while the latter deals with grammar. pared by Malyshev or by Samarin and Urvantsev?” A syllable is a word fragment that contains at The answer is simple: the articles mentioned in So- least one vowel or equivalent sound; speakers of the jka and Seveˇˇ cek deal with Russian; the Romanian language should be able to pronounce it as an iso- patterns are only mentioned in passing; Sojka prob- lated utterance; the set of vowels, diphthongs and ably got hold of them, but we did not succeed. More- triphthongs of a language is specific to that lan- over, according to Sojka and Seveˇˇ cek, these Roma- guage. Sometimes we are astonished when we see nian patterns make up a very large set (4121 pat- foreign words that apparently do not contain any terns) that requires a trie size of 4599 words, which vowels; this is due to the way we are used to clas- might be too large for a multilingual (LA)TEXimple- sifying the letters of our alphabet, but there is no mentation where they are supposed to coexist with doubt that in several Slavic languages the letter r six or seven other pattern sets (including the US- plays the rˆole of a vowel (ˇcrn ‘black’, smrt ‘death’, English patterns that one cannot do without). Trst ‘Trieste’). The consonants on either side of such sounds 2 Grammar vs. typography may belong to the left or to the right syllable, de- Preparing hyphenation patterns for TEXisnota pending on various rules and the acceptability of simple grammar exercise; one must know the gram- pronouncing consonantic clusters by native speakers matical rules (if they exist and are not too compli- of a specific language; however, etymological rules cated) of the language for which patterns are being may alter this kind of division, which explains why prepared, and one must also know the typographical syllabification rules are so different in different lan- rules valid for the specific country or for a generic guages. good typographical practice in that language. Unluckily TEX has no notion of vowels and con- It is obvious that typographical hyphenation sonants, so that one cannot rely on the fact that must comply with grammar rules; it is less obvious no (incorrect) breakpoints are executed that isolate that the possible hyphen points must not coincide groups of consonants; in fact, if you ask TEXto with syllable divisions, but are a subset of the lat- show the hyphen points for comparands, for exam- ter. Just to set forth a simple example, in Italian the ple, while English is the default language, you get word idea is syllabified in i-de-a, but typographi- com-para-nds, which is clearly wrong. cally this word has no hyphen points;2 some of (or Another point, valid for every language, is the all) the syllables may show up again when the word existence of syllabification rules; we dare say that is connected with other words, as in dell’idea that is these rules exist for every official language, but what divisible like this: del-l’i-dea. about dialects or regional unofficial languages? It is Syllables and typographic divisions differ in an- quite likely that the latter completely lack a stan- other way; in some when reading their mother lan- dard and recognized set of grammatical rules, and guage, some people feel uncomfortable if they have that the language variety might change from village to read a word split across a hiatus (that, is a pair to village both in spelling and word usage. But if you of vowels that do not form a diphthong); more gen- must write a report on philological research dealing erally, they feel uncomfortable if the word fragment 3 US hyphenation verified in the Webster dictionary. UK that goes to the next line begins with a vowel that hyphenation, as given by TEX by means of the pattern file UKhyphen.tex, is ‘li-quid’. Although the UK hyphenation 2 For the moment, let us skip the question of minimum file is huge (55 860 bytes, 8527 patterns, 10 995 trie memory length of the first and last “typographical” word fragments, words, 224 ops), it hyphenates the words reported in Ap- that T X deals with the values stored into the internal E pendix H of The TEXbook in a funny way. \lefthyphenmin and \righthyphenmin. 32 TUGboat, Volume 16 (1995), No. 1
with such languages, you probably need to set many nored, while most other languages require etymolog- specimens of text, which means that you need pat- ical division — Romanian belongs to this class. terns for these languages also. Another problem with some languages is the But even if we examine only official languages, fact that hyphenated words change spelling com- such rules, in particular those regarding syllabifi- pared to when they are undivided; in German, for cation, might be too complicated or might refer to example, the orthographic sequence ck often gets word stress. In the case of complicated rules, the hyphenated into k-k. following statement holds true: All these problems offer the willing pattern cre- too complicated rules = no rules ator two alternatives: process a language dictionary with patgen or do everything by hand. As for word stress, T X (or any other program that E patgen is a program created by Liang for use deals with words instead of numbers) does not know with T X (documentation available from CTAN). anything about stress, so that it cannot take it into E You have to feed the program with a large set of account for hyphenation purposes. hyphenated words (several thousand words) and it In English (US and UK alike) the rules are quite will produce the hyphenation patterns that T Xre- complicated and refer to stress; the example shown E quires. Producing patterns this way is time con- by Knuth in Appendix H of The T Xbook regard- E suming because you need to create a large file of ing the word record is typical: the stress is different hyphenated words absolutely without error (either whether the word is used as a noun or as a verb, in spelling or hyphenation), but it is the only prac- therefore the syllabification turns out to be rec-ord ticable way when dealing with languages for which or re-cord! There is no simple way to get around the no rules statement applies. And this statement these situations. must be extended to all languages where stress plays Another crucial point is the question of com- an important rˆole in syllabification, and where pre- pound words, and prefixes and suffixes. Compound fixes and suffixes must be divided automatically,ac- words should preferably be divided at word bound- cording to etymology. aries, while prefixes and suffixes may be treated in different ways in different languages. 3 Mute vowels In some languages compound words are very Some languages—French for instance—have mute frequent; in other languages, although compound vowels; that is, vowels that are not pronounced, or words exist, “compound concepts” are expressed are pronounced in an indistinct way. According to without agglutination — the building of long and grammar rules, these vowels may produce a sylla- complicated compound words from many smaller ble, but typographic practice avoids splitting words ones — a feature of Germanic languages. Romance which would leave a mute vowel to start the new languages prefer constructions that make use of con- line, especially if this happens on the last syllable. junctions and prepositions; English is different yet For French hyphenation patterns, M. Ferguson [9] again, often putting together sequences of words explicitly inhibited hyphenation for all mute end- that form a “compound concept” without “gluing” ings. This is another instance where syllabification them to one another. For example, in English you and hyphenation are in contrast: if patterns are gen- have the compound list “manufacturing systems erated via patgen, the syllabified list of words fed to engineering” that in French becomes the preposi- the program must take hyphenation breaks at mute tional phrase “genie des syst`emes de fabrication”. vowels into account. In any case, compound words are typical of the jar- In any event, we want to stress the point that gon found in chemistry, regardless of language, and, the word fragments obtained through the process of no doubt, since chemical names tend to be very hyphenation are less than or equal to the number long, they should be divided on the word bound- of syllables the word contains. There is no point in aries rather than on syllable boundaries. measuring hyphenation algorithm performance by Prefixes and suffixes are generally treated in 4 counting the number of breakpoints it misses (with two different ways in different languages: prefixed a chosen word list), because the point is to measure and suffixed words must be divided according to the number of wrong breakpoints it produces. The their etymology, or they may be treated as regular common words, disregarding the presence of prefixes 4 Unless the algorithm is so poor that it misses most and suffixes. Italian and Portuguese belong to the breakpoints! second class, where prefixes and suffixes may be ig- TUGboat, Volume 16 (1995), No. 1 33
best algorithm should produce no incorrect break- “activity’ or “specialness” is turned off when you points (obviously) but apparently this is not achiev- go into verbatim mode. But it is a shame that all able with any algorithm. the standard input characters obtainable with a US Beccari’s pattern list for Italian ithyph.tex, keyboard have already been used for special TEX version 3.5, was supposed to correctly hyphenate all purposes. Italian words; after additional checking with dictio- In fact we would have preferred that asin- naries created by L. Bianchi [6], a version 4.0 had gle character be used in place of the control se- to be produced, in spite of the fact that hundreds of quence \allowhyphens, so as to speed up keyboard- people at his site had used the patterns and no one ing. Moreover, when using dc fonts, it should be had ever found an incorrect breakpoint. possible to refer to character "17 (called a compound word marker). If such a character could be included 4 Compound words within the hyphenation patterns, and if it could be patgen can also produce patterns for division of easily inserted into the TEX input file, much keybor- compound words, but the pattern list may become ding could be saved. Unfortunately the character extremely complicated, and since compound techni- belongs to the set of the “illegal” ones that (LA)TEX cal words are continuously being created, any pat- refuses to read. tern list becomes obsolete the very moment it is cre- For a while, Beccari (compare [2]) used the un- ated. We are of the opinion that it is too exacting derscore _ as the single-character command, but a request that a computer program provide a com- other TEXies at his site were so used to using the plete, fast, accurate, and reliable algorithm suitable underscore for other purposes (for example in file for every language and every situation. Some type names) that his choice was doomed to failure; he of manual intervention is therefore necessary; below had to delete the definition in order to avoid contin- are some options which have proven useful to us. uous quarreling with his colleagues. (LA)TEX offers the macro \- that allows hyphen- In any case — call it "-, \allowhyphens,or_ — ation in specific points. If you use \- within a word, this macro is suitable for separating prefixes, not for this word can be hyphenated at that and only that compound words, whose division should take prece- position.5 dence over syllable division; see the discussion in So- Another macro with more flexibility is shown jka and Seveˇˇ cek [12]. This question of marking the below: boundaries of compound words is still an open one, A \def\hz{\nobreak\hskip0pt \relax} and we hope that the LTEX3 team finds a suitable \def\allowhyphens{\hz\-\hz} solution. This new definition allows you to insert a discre- 5 Patterns tionary break without inhibiting hyphenation in the According to Appendix H of The T Xbook,apat- rest of the word. The only drawback to such a def- E tern is a sequence of lowercase alphabetic letters inition is that the name is too long. German T X E (and characters of category 12, “other”) separated users get around this problem by making the double 6 by digits. The meaning of the digits di and the let- quote character " active. Its definition is quite com- ters li is as follows: the pattern d1l1d2l2 ···ln−1dn plicated because it has to do a lot of things in dif- implies that if the sequence of letters l1l2 ···ln−1 ap- ferent circumstances with a minimum of keyboard- pears in a word, the hyphenation “weights” between ing: the double quote prefixed to a vowel inserts such letters are d1, d2, ..., dn, where odd digits the umlaut (dieresis), while prefixed to certain con- allow hyphenation, even digits inhibit hyphenation sonants it inserts the appropriate \discretionary and, if two or more (different) patterns specify dif- command (for example, "ck expands to ferent weights between the same letters, the highest \discretionary{k-}{k}{ck} digit prevails. The digit 0, being the smallest, is plus no breaks or zero skips to allow hyphenation in optional and may generally be omitted. the rest of the word). Among other things, the dou- A pattern list is a sequence of different patterns ble quote character can change "- into the “soft” dis- separated by spaces (or single end-of-line marks) as cretionary break obtained via \allowhyphens,de- if they were words of a single paragraph given as fined above. arguments to the \patterns primitive command. Active characters are no problem, provided you 6 The special sign “.” marks the beginning or end of a add them to the lists of special characters whose word. 5 Unless you have put several discretionary \- breaks in the same word. 34 TUGboat, Volume 16 (1995), No. 1
Such a command may be processed only by the ini- such languages are flexive.7 With such languages— tialization version of TEX and is illegal while using and all Romance languages belong to this class — (LA)TEX in its normal operating version. The or- the conjugation of a verb might include from 60 to der of patterns in the pattern list should not have 80 different forms, so that if an exception involves any influence, but the list can be maintained much the stem of a verb, some 60 to 80 different entries more easily if it is alphabetized on letters only (i.e. must be made in the \hyphenation list for that one disregarding digits). verb. When extended characters are used, the pat- As a concluding remark it is worth noting that terns may contain control sequences, but these are the patterns inserted via the \patterns command restricted to those that expand to single characters. are static: once they have been processed by initex, If you use the technique outlined in Beccari [3], you you cannot change them, unless you create a new will have no problems, even if you use the standard format. Hyphenation exceptions introduced via (LA)TEX accent macros; just remember to put a digit the \hyphenation command are dynamic, and you (possibly 0) after the control sequences that map to can add new exceptions at any point in your com- standard characters — i.e. \oe, \ae, \aa, \o, \ss, puscript. Any new exception gets added to the andl. list valid for the current language, and if a word is entered twice (supposedly with different hyphen 6 Hyphenation exception lists points) the last one is the one TEX uses. Hyphen- TEX can do an excellent job hyphenating words in ation exceptions are global and are not limited by a particular language but it is not perfect, because group delimiters. every language uses words borrowed from other lan- guages or created with foreign roots and local end- 7TEX hyphenation mechanism ings. In some instances, especially when patterns TEX’s hyphenation mechanism consists in examin- are generated with patgen, unusual words may have ing each word to find if it appears in the \hyphen- been omitted from the list fed to patgen so that ation argument of the current language; if so, it these words might get hyphenated incorrectly. hyphenates the word accordingly; otherwise, it ex- These exceptions can be addressed by using amines all possible patterns for dividing the word, “hyphenation exception lists”, one for each lan- patterns that must appear in the pattern list for the guage, that consists of space-delimited hyphenated current language. This done, it compares and saves words given as arguments to the \hyphenation the highest digits that the several patterns have pro- primitive. Originally Knuth prepared a list of 14 duced between the same pairs of letters, and inserts exceptions for English, but several years of usage (implicit) discretionary breaks where odd digits ap- have produced a continuously growing list of ex- pear. ceptions that is maintained in the CTAN archives. To tell the whole truth, TEX actually does not On this subject it is interesting to read the words insert such discretionary breaks before a certain by B. Beeton that accompany the 1992 US English number of letters from the beginning of the word exception log [4]. and after another (possibly different) number of let- We are of the opinion that exception lists should ters from the end; with version 3.xx of TEX, such be avoided as much as possible, at least for those numbers, call them λ and ρ respectively, can be set languages with patterns made by hand: if manual with the commands creation was possible, it means that the syllabifica- \lefthyphenmin = λ tion rules were simple enough to consider most, if \righthyphenmin = ρ not all, normal situations. Hyphenation exception lists then become useful only in specific documents With US English, the default values are λ =2 where unusual words are used: — first and/or fam- and ρ = 3, but for UK English, the left limit is λ =3; ily names, foreign toponyms, chemical compound for Italian, where there are no mute vowels, it is cor- names, and the like. The one exception to this seems rect to put λ =2andρ = 2, although, if the line to be English: while the number of hyphenation width is sufficiently large, λ =3andρ = 3 is more rules would seem to argue that the manual approach elegant. These two numbers, therefore, relate to the could not be considered reasonable, the language is 7 ‘Flexive’ describes languages where nouns, adjectives widely used and a manually-prepared hyphenation and verbs have different forms (or spellings, if you wish) to exception log is regularly maintained and updated. show additional elements of meaning: singular or plural, mas- In most other languages hyphenation exception culine or feminine, case (nominative, accusative, etc., as in Latin) for nouns and/or adjectives, and the many forms due lists could become too difficult to create, especially if to verb conjugation. TUGboat, Volume 16 (1995), No. 1 35
typographic style, not to the hyphenation mecha- • v1c1v2 → v1-c1v2 nism, and may be varied according to individual • v1c1c2v2 → v1c1-c2v2 language/national typographical practice and to the • and so on (these rules are just examples particular style one prefers. This implies that pat- and do not refer to any particular lan- terns and hyphenation exceptions lists should not guage) consider these two values and should produce cor- If you succeed in translating all the “spoken” rules rect breakpoints even if they are set at λ =1and in the above form, then you can proceed to con- ρ =1. struct the pattern list. Otherwise, patgen is proba- 8Tools bly the only practical solution: you must start from the very beginning and construct a huge list (several Constructing a pattern list for a language whose hy- thousand words) of perfectly spelled and hyphen- phenation rules are not too complicated is not a dif- ated words taken from a language dictionary, then ficult task; you just have to organize yourself with process this list by means of patgen. But before the following “tools”: starting this heavy task be sure to do everything you 1. a grammar or a very good personal knowledge can in order to find out if someone else has already of the language; 8 done it; public archives are there for that purpose! 2. a dictionary where syllabification is shown; We will now describe the manual procedure. 3. any national or language regulations, if they ex- Something to keep in mind as you translate the spo- ist, that establish rules for typographical hy- ken rules into abstract form: prepare a list of words phenation; that contain examples of application of the rules; 4. if available, an excellent tool would be an ortho- counter examples are also precious elements in this graphic dictionary in computerized form; for list. Both can later be used to easily check the per- Italian we found the one prepared by Luigi formance of your hyphenation algorithm. Bianchi (to be used with AMSpell [6]); 5. a good handbook for typographical typesetting 9 Romanian syllabification “spoken” rules practice. Our colleague Tulei was able to produce a list of Another couple of tools are the \showhyphens macro Romanian syllabification spoken rules in three forms provided by TEX, and a macro set provided by Eijk- which are not completely equivalent so that some hout [8], called \printhyphens. intelligent interpretation must be introduced: The first tool, \showhyphens, outputs TEXhy- 1. First form [1]: phenation on the screen and into the .log file in (a) if one vowel is followed by a single conso- the form of the usual warning message for underfull nant, the latter belongs to the following hboxes. Eijkhout’s macros instruct TEX to set one syllable; word per line with hyphens inserted at the break- (b) two vowels that do not form a diphthong points identified by T X’s hyphenation algorithm. E belong to different syllables; The second approach is more elegant, and allows you to produce a clean hardcopy; its only draw- (c) i and u between other vowels behave as back is that fragile commands tend to break up. semi-consonants and start a new syllable; \showhyphens is much simpler, but its output (from (d) if a vowel is followed by several consonants the .log file) does not contain the extended charac- the first consonant belongs to the left syl- ters; on the other hand, the output may contain lable and the other consonants to the right strange symbols9 or “double caret” sequences, so syllable with the exception of the following that its interpretation requires a little skill by the items; user. (e) if one of the consonants b, c, d, f, g, p, t, Then you should translate the “spoken” syllabi- v is followed by l or r, the pair cannot be fication rules into “abstract” statements of the form: split; if vi and ci are vowels and consonants respec- (f) the groups ct, ct¸,andpt preceded by one tively, hyphenate as follows: or more consonants get split between the 8 first and the second elements of the group; This is not a trivial recommendation; except for English, we do not know of any language for which syllabification is a (g) prefixed words are separated according to standard feature of every dictionary. 9 etymology if the component words main- They are the symbols that the computer has in its inter- tain their integrity. nal tables for driving the screen or the printer in correspon- dence with the internal TEX character “numbers”. 2. Second form [13]: 36 TUGboat, Volume 16 (1995), No. 1
(a) a single consonant between two vowels be- Consonants: b,c,d,f,g,h,j,k,l,m,n,p,r, longs to the second syllable; s, ¸s, t,¸, t v, x, z (b) two consonants between two vowels Vowels: a, ˆa,a, ˘ e, i, ˆi, o, u are separated unless the second is an l or Special letters: q, w, y an r; Table 1: Classification of Romanian letters (c) three or more consonants between two vow- els are divided so as to leave one or two consonants with the second syllable, the French and English use an apostrophe. This is a latter case occurring when the last conso- minor problem because TEX hyphenates words con- nant is l or r; taining the intra-word dash or hyphen only in cor- (d) vowels forming a hiatus are divided; respondence with the hyphen character; in the case (e) prefixed words are divided according to where the cratima is used for marking vocalic elision their etymology. there should be no line break. 3. Third form [5]: 10 Romanian “abstract” rules (a) vowels forming a hiatus are divided; Let us put the Romanian spoken rules into abstract (b) a single consonant between two vowels form. As usual, let ci be the consonants, li the liquid belongs to the second syllable; for the consonants l or r, vi the vowels in general, oi the application of this rule the digraphs ch vowels belonging to the set {a, ˆa,a, ˘ e, o},andx is aspecificletter(x in this example). This yields the and gh are considered a single consonant, following abstract rules: as are digraphs imported from other lan- v c v v c v guages which are not “regular” in Roma- 1 1 2 → 1- 1 2 (1) nian words, such as, for example, sh; if c1 ∈{li} and c2 6∈ {li} then (c) two consonants are divided unless an l or v1c2c1v2 → v1-c2c1v2 (2) an r follows one of the consonants b, c, d, v1c3c2c1v2 → v1c3-c2c1v2 (3) f, g, h, p, t, v; else (d) three consonants are divided after the first v1c2c1v2 → v1c2-c1v2 (4) one if the group ends with l or r; otherwise, if c2c1 = ct or they are divided after the second conso- c2c1 = ct¸ or nant. The “regular” Romanian consonant c2c1 = pt or triplets of the latter group are: lpt, mpt, c2c1 = pt¸ then mpt¸, ncs, nct, nct¸, ndv, rct, rtf, stm ; v1c3c2c1v2 → v1c3c2-c1v2 (5) (e) four or five consecutive consonants are di- else vided after the first one, except in adapted v1c3c2c1v2 → v1c3-c2c1v2 (6) words and neologisms. v1c4c3c2c1v2 → v1c4-c3c2c1v2 (7) For rule 3e the underlying concept seems to be that v1c5c4c3c2c1v2 → v1c5c4-c3c2c1v2 (8) division must take place where the second syllable can be pronounced by a normal Romanian speaker. end if end if In other words, the second syllable must start with if v1 = v2 then the longest set of consonants that can be found at v1v2 → v1-v2 (9) the beginning of other Romanian words. The cor- else if v1 = i or v1 = u then rect division is therefore ang-strom,notan-gstrom o1v1o2 → o1-v1o2 (10) because there is no Romanian word starting with end if gstr, but there are plenty starting with str. Another remark: among the grammars exam- In order to apply the above rules it is necessary to ined, none considers the group ngv that appears in define the sets of vowels and consonants; in Roma- at least one word that is linguistically important: nian we have the situation summarized in Table 1. lingvistic. By analogy with the previous remark, The letters q, w,andy are classified as special be- since no Romanian word starts with gv,wehavede- cause they appear only in words not strictly Ro- cided to adopt the division ng-v. manian, that is imported or adapted from foreign The cratima or liniut¸˘adeunire(intra-word dash languages; in order to hyphenate at least some of or hyphen) is used to connect most compound words, these imported words, such letters will be included but it is also used to mark vocalic elision, much as in the patterns. Furthermore the digraphs ch, gh, TUGboat, Volume 16 (1995), No. 1 37
sh, and the like, that are not listed in Table 1, must b2l c2l d2l f2l g2l h2l k2l p2l t2l v2l be counted as a single consonant. b2r c2r d2r f2r g2r h2r k2r p2r t2r v2r We considered hiati, diphthongs and triph- Next, we allow word division between other conso- thongs only to a limited extent. Romanian is very nants, inhibiting word division before the first one rich in diphthongs and triphthongs, but the same of the pair; as well, we introduce some patterns for couples or triplets of vowels may be indivisible or dealing with the special letters w and y: form a hiatus in different words, or in the conjuga- 2bc 2bd 2bj 2bm 2bn 2bp 2bs b3s2t 2bt tion or declination of the same word they might play 2bt¸2bv a different rˆole and become divisible when they used not to be—and vice versa. Since it is so difficult 2cc 2cd 2ck 2cm 2cn 2cs 2ct 2ct¸2cv 2cz to classify pairs or triplets of vowels as diphthongs 2dg 2dh 2dj 2dk 2dm 2dq 2ds 2dv 2dw or triphthongs, the best thing to do is not to divide 2fn 2fs 2ft them at all, except when rules 9–10 are applica- 2gd 2gm 2gn 2gt 2g3s2 2gv 2gz ble. Furthermore, let us remember the typographic 2jm point of view of avoiding line breaks within a vocalic 2hn group. 2lb 2lc 2ld 2lf 2lg 2lj 2lk 2lm 2ln 2lp 2lq 2lr 2ls 2lt 2lt¸2lv 2lz 11 Romanian patterns 2mb 2mf 2mk 2ml 2mn 2mp 2m3s2 2mt¸ With Romanian, as with other languages where the 2nb 2nc 2nd n3d2v 2nf 2ng 2nj 2nl 2nm rules refer to vowels and consonants, it is necessary 2nn 2nq 2nr 2ns n3s2a. n3s2˘an3s2e to establish if the former or the latter play a more n3s2i n3s2o n3s2cr n3s2f ns3h n3s2pl important rˆole in hyphenation. If we exclude the n3s2pr n3s2t 2n¸sn3¸s2cn3¸s2t2nt 2nt¸ division of vocalic clusters (except when rule 10 ap- 2nv 2nz n3z2dr plies), there is no doubt that consonants are the dis- 2pc 2pn 2ps 2pt 2pt¸ criminating elements, so that patterns may be built 2rb 2rc 2rd 2rf 2rg 2rh 2rj 2rk 2rl 2rm focusing on consonants. 2rn 2rp 2rq 2rr 2rs r3s2t 2r¸s2rt 2rt¸ When we want to apply rules 1–10, we must 2rv 2rx 2rz not implement the patterns by making all the com- 2sb 2sc 2sd 2sf 2sg 2sj 2sk 2sl 2sm 2sn binations of letters implied by such rules. We would 2sp 2sq 2sr 2ss 2st 2sv 2sz otherwise create a set of patterns that would be un- 2¸sn necessarily large, because it would contain combina- 2¸st tions that never occur in the real language. 2tb 2tc 2td 2tf 2tg 2tm 2tn 2tp 2ts 2tt Furthermore we use the smallest weight-digits 2tv 2tw available, remembering that 0 is also implicitly used when we do not write anything. So we start with 2vn omitting all vowels, unless specifically required, and 1w wa2r start producing the following patterns that should 2xc 2xm 2xp 2xt take care af all single intervocalic consonants, and 1y 2yb 2yl 2ym 2yn 2yr 2ys all single consonants at the beginning and ending of 2zb 2zc 2zd 2zf 2zg 2zl 2zm 2zn 2zp 2zr words: 2zs 2zt 2zv 1b 1c 1d 1f 1g 1h 1j 1k 1l 1m 1n 1p 1q 1r Some of the patterns have a non-repetitive look in 1s 1¸s1t1t¸1v1x1z order to take care of rules 6, 7 and 8; for example .b2 .c2 .d2 .f2 .g2 .h2 .j2 .k2 .m2 .p2 n3s2t overrides 2st so that a word like monstru can .s2 .¸s.t2 .t¸.v2 .z2 be hyphenated correctly as mon-stru. With just the 2b. 2c. 2d. 2f. 2g. 2h. 2j. 2k. 2l. “regular” pattern 2ns, the hyphenation would have 2m. 2n. 2p. 2r. 4s. 2¸s.4t. 2v. 2x. been mons-tru. 2z. Finally, we introduce the patterns that involve vowels; in particular, we inhibit hyphenation when Now we consider the l and r rules starting with state- the last syllable (and the whole word altogether) ment 2; in passing, we also add the digraphs ch ends with an i. This is useful because the final i and gh, the digraph sh that appears so often in for- is almost mute in the endings of some masculine eign words, and the group tz that appears in several non-articulated nouns and adjectives, and in such adapted technical words: cases line breaks are not allowed [7]. But since TEX c2h g2h s2h t2z does not know anything about the language, we take 38 TUGboat, Volume 16 (1995), No. 1
a conservative approach and inhibit division before Let us point out that with the help of such tools any ending syllable terminated with an i. one can verify both the validity of the pattern set a1ia a1a a1ia a1ie a1io ˘a1ie˘a1oa^a1ia and if any patterns are missing. For example, if a e1e e1ia i1i i2ii. 2i. o1o o1ua o1u˘a word turns out to have incorrect hyphen points, it is u1u u1ia u1al. u1os. u1ism u1ist u1i¸st possible to find out why: — wrong patterns? miss- 2bi. 2ci. 2di. 2fi. 2gi. 2li. 2mi. ing patterns? wrong weights? — and to proceed with 2ni. 2pi. 2ri. 2si. 2¸si.2ti. 2tri. corrections. This is how we discovered that we had 2t¸i. 2vi. 2zi. originally missed the pattern 2tt, which refers to Eventually we add a few patterns that work with a double consonant not present in “normal” Roma- some prefixes: nian; in fact, without this pattern, wattmetru gets hyphenated as wa-tt-me-tru, which is obviously .ante1 .anti1 wrong since it has a word fragment without vowels. .contra1 Similar situations can be easily spotted by means of .de3s2cri the word list that had been prepared together with .de2z1aco .de2z1am˘a.de2z1apro the patterns, as suggested above. With the help of a .de2z1avan .de2z1infec .de2z1ord dictionary it is possible to spot unusual or doubtful .i2n1ad .i2n1am .i2n1of .i2n1˘asp^ words, include them in the test word list, and find .i2n1ad^ .i2n3s2^ .in3¸s2^ out if they get hyphenated correctly. .ne1a .ne1i^ .nema2i3 .ne3s2ta A few concluding remarks. Romanian is not .re1ac .re1i^ particularly easy nor particularly difficult to hyphen- .su2b1ord .su2b3r ate. If one gives up the possibility of TEXdoingthe .supra1 whole task of separating the prefixes, and is willing .tran2s1 .tran4s3pl .tran4s3f to using "- or similar macros for inserting discre- 12 Tests and conclusion tionary soft breaks in the prefixed words, the pattern file one gets is of very modest size although it pro- We were able to obtain a small pattern file con- 10 duces correct hyphenation in most circumstances. taining 350 patterns and requiring 31 ops. This Such a small file is compatible with the memory lim- pattern set is much simpler than that mentioned in itations of small implementations of TEX the pro- Sojka and Seveˇˇ cek, and the results are simpler too. gram; (LA)TEX can be initialized with several hy- However, although the ratio between the number of phenation patterns at the same time, allowing the patterns cited by these two authors and our set is ap- user to create a multilanguage tool that allows him/ proximately 12, our results are not 12 times poorer. her to typeset texts in several languages without the It might be interesting to test them on a large set need for changing software when passing from one of Romanian words in order to compare the differ- language to another. ences. As for our experience, we used these patterns to typeset a large number of Romanian documents Acknowledgments but never experienced an incorrect line break. We would like to thank the precious cooperation of Using Eijkhout’s \printhyphens macro, we can Mihai Lazarescu and Janetta Mereuta who helped verify what we got: feeding this macro some test 11 very much with the queries and discussions on the words we get the following: Internet and with the creation of a test word list a-do-les-cen-ti-lor in-dus-triei with normal and unusual Romanian words. a-pro-b–and in-e-gal ba-ia jert-f–a References Bu-cu-resti mon-stru [1] A.V., ˆIndreptar ortografic, ortoepic ¸si de con-struc-tor post-de-cem-bris-t–a punctuat¸ie, Bucure¸sti, 1971 (ed. III-a). dum-ne-zeu vre-mea [2] Beccari C., “Computer Aided Hyphenation for drep-tu-lui Wa-shing-to-nu-lui Italian and Modern Latin”, TUGboat 13(1):23– dez-a-van-ta-jul watt-me-tru 33, April 1992. fla-shul [3] Beccari C., “Configuring TEXorLATEXfor 10 One pattern not mentioned here is 2-2, which makes it typesetting in several languages”, TUGboat cratima possible to distinguish between the (referred to in 16(1):18–30, March 1995. section 9) and the hyphen character, by using appropriate category codes. [4] Beeton B., “Hyphenation Exception Log”, 11 For testing purposes, we set λ =1andρ = 1; in normal TUGboat 13(4):452–457, December 1992. Romanian typesetting, it is better to set λ =2andρ =2. TUGboat, Volume 16 (1995), No. 1 39
[5] Beldescu G., Ortografia actuala a limbii that you may read your source file with a minimum romˆane, Bucure¸sti: Editura S¸tiint¸ific˘a¸si En- of extra characters interspersed. ciclopedica, 1985. German and Dutch users, who have similar [6] Bianchi L., italian.zip. [Includes the com- problems, have chosen the double quote "; nothing plete set of tools for checking correct Italian prevents us from doing the same for Romanian, but spelling in ASCII and (LA)TEX compuscripts; the double quote is used so many times explicitly available from ftp.yorku.ca or contacting di- or behind the scenes by TEX, especially for repre- rectly L. Bianchi at [email protected].] senting internal codes in hexadecimal notation, that [7] Breban V., Bojan M., Com¸sulea E., Negomi- we prefer another, more “innocent” character — the neanu D., S¸erban V., Teiu¸sS.,Limba romˆan˘a exclamation mark !. A corect˘a, Bucure¸sti: Editura S¸tiint¸ific˘a, 1973. ALTEX guide has been published recently in [8] Eijkhout V., “The bag of tricks”, TUGboat Romania [11], where the apostrophe sign has been made active and defined in such a way as to save 14(4):424, December 1993. a lot of keystrokes while keying in the source text. [9] Ferguson M., frhyph.tex, frhyph7.tex, We wonder if the authors actually used the macros frhyph8.tex. [Hyphenation pattern files avail- they suggest on page 109 of their guide because the able from the CTAN archives in the directory \newcommand command cannot (ordinarily) operate /tex-archive/languages/french.] on active caracters, but only on control words — [10] Knuth D.E., The TEXbook, Reading Mass.: they probably used a regular \def command. Sev- Addison-Wesley, 1990. 12 eral other TEX users around Romania who an- [11] Pusztai A. and Ardelean Gh., LATEX, Ghid de swered our questions on the Internet revealed that utilizare, Bucure¸sti: Editura Tehnic˘a, 1994. they are using definitions similar to the ones pub- [12] Sojka P., Seveˇˇ cek P., “Hyphenation in TEX lished in the Romanian LATEX guide. But the same — Quo Vadis?”, Proceedings of the Eighth criticism remains: the apostrophe is used internally European TEX Conference (Sept. 26–30, 1994, by the plain and LATEX format files for identifying Gda´nsk, Poland), 59–68. octal numbers, and it becomes really difficult to cre- [13] Zamsa E., Limba romˆan˘a, recapitulari ¸si ate correct definitions so that character “activity” exercit¸ii, Bucure¸sti: Editura S¸tiint¸ific˘a, 1991. does not interfere with octal notation. Therefore we will stick with our choice of the ex- clamation mark, but what we state here can be ap- Appendix plied to any other “innocent” character. If you use Useful shorthand macros babel, then you should add the exclamation mark to Although Table 1 shows that just five diacriticized the special TEX characters and provide an argument characters appear in the Romanian alphabet, the to \extrasromanian such as, for example: language makes frequent use of these characters, so \addto\extrasromanian{% that the ASCII source .tex file is filled with se- \babel@add@special\!} quences such as \u{a}, \^a, \^{\i}, \c{s}, \c{t} \addto\extrasromanian{% (and their uppercase counterparts), to the point \babel@remove@special\!} where the source file is almost unreadable. A sin- \addto\extrasromanian{% gle word may contain several such characters, as for \babel@savevariable{\catcode‘\!}% example ¸ˆtı¸snitur˘a, \c{t}\^{\i}\c{s}nitur\u{a}, \catcode‘\!\active} that contains four of them, so that to key in the In the style file there must be a statement that mem- source file is quite error prone, and finding and cor- orizes the meaning of ! before changing catcode: recting possible errors turns out to be quite difficult. \let\@xcl@@=! If you have a Romanian keyboard, you can eas- ily map the input Romanian characters (with in- and a set of definitions for the active exclamation ternal code higher than 127) to the appropriate se- mark: quences: on the screen the text appears without \begingroup backslashes and braces, but if you have to send your \catcode‘\!=13 file to somebody else on some computer network \gdef!{\protect\p@xcl@} global substitutions have to be made to put back \endgroup the cumbersome sequences. 12 They are too numerous to be cited here, but we take Another approach would be to make some char- this opportunity to thank them warmly. acter active and then define it in a suitable way, so 40 TUGboat, Volume 16 (1995), No. 1
Some macros must be defined in order to test if !a ˘a !A A˘ what follows the exclamation mark is a non-expand- !i ˆı !I ˆI able token (i.e. a character token or a primitive TEX !s ¸s !S S¸ command), or has an expansion: !t t¸ !T T¸ \def\@Macro@Meaning#1->#2?{% ! ! !" ` \def\@Expansion{#2}} !- - !| % \def\TestMacro#1#2#3{% Table 2: Sequences obtained with the active \expandafter exclamation mark and their results. In math mode \@Macro@Meaning\meaning#1->?% the exclamation mark preserves its meaning. \ifx\@Expansion\empty \def\@Action{#3}% % \else \expandafter \def\@Action{#2}% \def\csname @xcl@a\endcsname{\u{a}} \fi \expandafter \@Action} \def\csname @xcl@A\endcsname{\u{A}} More definitions are needed, and the apparently \expandafter tortuous path followed before arriving at the desired \def\csname @xcl@s\endcsname{\c{s}} goal is due to the necessity of making such macros \expandafter robust, so that they do not break apart when they \def\csname @xcl@S\endcsname{\c{S}} appear in moving arguments, such as when writing \expandafter to auxiliary files for cross-reference purposes or for \def\csname @xcl@t\endcsname{\c{t}} preparing indices or tables of contents. \expandafter \def\p@xcl@{\ifmmode \def\csname @xcl@T\endcsname{\c{T}} \@xcl@@ \expandafter \else \def\csname @xcl@i\endcsname{\^{\i}} \expandafter\ExCl \expandafter \fi} \def\csname @xcl@I\endcsname{\^I} % \expandafter \def\ExCl{\futurelet\eXcl\exCl} \def\csname @xcl@"\endcsname{^^12} % \expandafter \def\excL{\Excl{}} \def\csname @xcl@-\endcsname{\char"7F} % \expandafter \def\exCl{\TestMacro\eXcl{\excL}{% \def\csname @xcl@|\endcsname{\hz\-\hz} \ifcat\eXcl a% When you select the Romanian language the \let\eXcL\Excl exclamation mark becomes active and you get the \else shorthand notations summarized in Table 2. The \ifcat\eXcl "% last line of this table requires some additional com- \let\eXcL\Excl ment. \else 1. The sequence !| produces a discretionary break \let\eXcL\excL that does not inhibit hyphenation in the rest of \fi the word, and is useful for marking the bound- \fi ary between a prefix and a word stem. This se- \eXcL}} quence complements the regular TEX sequence Finally the definitions we were aiming at: \- that inserts a discretionary break but in- \def\Excl#1{\expandafter hibits hyphenation in the rest of the word. \ifx\csname @xcl@#1\endcsname \relax 2. The sequence !- produces the short dash that \@xcl@@\space #1% is used for connecting compound words (quite \else rare in Romanian); it inserts the \hyphenchar \csname @xcl@#1\endcsname that TEX treats in a special way for what con- \fi} cerns line breaking. In fact, when this character % appears, TEX breaks the line, if necessary, only \def\hz{\nobreak\hskip\z@} after the short dash. TUGboat, Volume 16 (1995), No. 1 41
3. The frequent cratima, i.e. the short dash to be acela!si timp c!a: !"Numele generice used in place of an elided vowel or for connect- {\em deal, fluviu, insul!a, lac, munte, ing enclitic pronouns (across which line breaks peninsul!a, r\^au, vale} etc., c\^and are forbidden), is obtained by the usual - sign, nu fac parte din denumire, se scriu cu but to avoid having TEX treat it as the hyphen- ini!tial!a mic!a‘‘. char, it is necessary (a) to chose a different en- Note that ˆa still requires the full accent sequence,13 try in the font table for the hyphen character, but it is just three keystrokes, compared with the and (b) to assign the “minus sign” an \lccode numerous keystrokes required by the other special different from 0, so that in text mode it can be characters. Notice the sequence !- that connects treated as a regular letter and it is possible to a compound word (after which TEX will break the enter a special pattern, 2-2 in the pattern list, line if necessary) and the cratima that connects the so that line breaks are forbidden across it. By enclitic pronoun (where hyphenation is forbidden). so doing, the number of ops increases by 1, but And finally, notice the reversed and lowered quota- still remains very reasonable. tion marks and the soft discretionary inserted after The extended fonts actually include two short the prefix pro.14 The corresponding typeset text ap- dashes, one with hexadecimal code "2D,andthe pears as such: other with hexadecimal code "7F.ForRoma- nian, it is therefore necessary to assign a low- Problema desp˘art¸irii cuvintelor ˆın silabe se ercase code to "2D so as to use it as a cratima, pune mai ales ˆın scriere, unde se ivesc difi- and to declare "7F to be the \hyphenchar for cult˘at¸i cˆand este vorba de vocale sau de con- the current extended font. soane succesive. Dar problema desp˘art¸irii ˆın In order to revert to the original situation silabe este strˆans legat˘a¸si de pronunt¸area when a different language is selected, it is neces- corect˘a a unor cuvinte care cont¸in diftongi, sary to restore the \lccodesandthe\hyphen- triftongi sau vocale in hiat, dup˘acumseva char assignments. We did not investigate how vedea mai departe. this is done in babel, but we did it with our im- La categoria de nume geografice teritorial- ˆ plementation, which is less general, compared administrative, Indreptarul ortografic prevede with the facilities offered by babel. c˘a acestea `se scriu cu init¸ial˘amajuscul˘ala toate cuvintele componente“, precizˆandu-se It is worth noting that the exclamation mark fol- ˆın acela¸si timp c˘a: `Numele generice deal, lowed by a character different from one of those fluviu, insul˘a, lac, munte, peninsul˘a, rˆau, vale listed in Table 2 reproduces itself; since this mark is etc., cˆand nu fac parte din denumire, se scriu commonly used at the end of a sentence, and there- cu init¸ial˘amic˘a“. fore is followed by a space or an end-of-line mark, a space is inserted by default by the macro expansion. ⋄ Claudio Beccari A sample (source) text follows, so as to appre- Dipartimento di Elettronica ciate the shorthand notations just introduced. Politecnico di Torino Problema desp!ar!tirii cuvintelor !in Turin, Italy silabe se pune mai ales !in scriere, Email: [email protected] unde se ivesc dificult!a!ti c\^and ⋄ Radu Oprea este vorba de vocale sau de consoane Dipartimento di Elettronica succesive. Dar problema desp!ar!tirii Politecnico di Torino !in silabe este str\^ans legat!a !si de Turin, Italy pro!|nun!tarea corect!a a unor cuvinte ⋄ Elena Tulei care con!tin diftongi, triftongi sau Universitatea Tehnic˘ade vocale in hiat, dup!a cum se va vedea Construct¸ii mai departe. Bucure¸sti, Romˆania
La categoria de nume geografice 13 Up to a couple of years ago, ˆa wasusedonlyintheword teritorial!-administrative, {\em Romˆania and its derivatives. Since the last spelling reform this sign is used more often; in the example we have used the !Indreptarul ortografic} prevede c!a modern spelling. acestea !"se scriu cu ini!tial!a 14 Actually there is no need for this discretionary break — majuscul!a la toate cuvintele it was put there just to show its use. With our patterns we componente‘‘, preciz\^andu-se !in did not succeed in finding a sample text containing something that really requires the use of a soft discretionary break. 42 TUGboat, Volume 16 (1995), No. 1
TEX and Linguistics by the end of its first month over 150 subscribers had joined. There was — and continues to be — interest: Christina Thiele the list currently has some 250 subscribers. Over the past few years a small but persevering Begun late in 1993, the list was originally set group of us have been working on gathering informa- up by George Greenwade at SHSU. ling-tex has tion about the use of TEX for typesetting linguistics now moved to the University of Oslo, where it is material. And for more than just the past few years, maintained by Dag Langmyhr. To subscribe to the many people have been developing macros to deal list, send a message to ling-tex-requests@ifi. with the specialized formatting which often arises uio.no. To post messages, address them to ling- in this field. [email protected]. A Web page has recently been The information-gathering has yielded some started and will be ready for public viewing by the useful results which have been announced and pub- time you read this article. The address is http:// lished in various places; this short piece simply col- www.ifi.uio.no/~dag/ling-tex.html. lects that all into one article. 3Theling-mac.tex inventory There are three main items to present here: the Technical Working Group (TWG), the ling-tex One immediate result of the ling-tex list was in- list, and an inventory of style files (ling-mac.tex). formation about style files various people had devel- oped to deal with such formatting issues as tree dia- 1TheTWGforTEX and Linguistics grams, examples, glosses, and attribute-value matri- Bib 2 This TWG was formed in July of 1994, following the ces, as well as bibliographies via TEXfiles. The previous half-year’s activity on the ling-tex list first draft was circulated in early 1994; it is now (see below). While activity in the TWG has been regularly updated and posted to the ling-tex list. quite minimal, the following people are now part The inventory appears below; most material is for A of the group: Christopher Manning (Stanford Uni- LTEX rather than plain. versity), Rei Fukui (University of Tokyo), myself as TEXandLATEX Macros for Linguistics chair, and Stuart Shieber (Harvard University). The TWG maintains the inventory of macros (see below), The following list is not intended to be exhaustive and is planning a similar inventory of fonts. or complete; it is based on information which has The TWG is officially designated as WG-94-10 come to light as people have posted messages to the (SI-TWG = special interest). Our mandate currently ling-tex mailing list. reads as follows: “The main goal is to study and The material is all public domain, but the usual discuss the requirements for typesetting linguistics requests for citing authorship, not changing the con- in TEX and as a means of identifying, examining, tents without changing the file name, and so on testing, and comparing macros, fonts, style files and apply. These are the results of volunteers efforts, other aids for typesetting linguistics.” The liaison and a desire to share those efforts with others; this to the Technical Council is Yannis Haralambous.1 should always be kept in mind. Constructive criti- cism, helpful suggestions, or offers of revised coding 2Theling-tex list or wording are always welcome. ling-tex The purpose of is to identify material Index of .sty files which is available to typeset linguistics text with 1. avm-doc.tex, avm.sty TEX, and to also stimulate testing, improvements to code and documentation, and so on, all with as 2. cgloss4e.sty much cooperation and assistance from the original 3. chomsky.sty authors as possible. It is no coincidence that this 4. cjl-glosses.tex is very similar to the mandate of the TWG:atthe 5. cm-lingmacros.sty time the Technical Council was inviting people to 6. covingtn.tex covingtn.sty consider starting up special interest working groups, 7. french.sty it wasn’t clear to me how much of an interest there 8. glex.sty 9. gloss.tex, gloss.doc. might be in TEX and linguistics. The list was a way to gauge that interest. Initial response was such that 10. lingmacros.sty 11. lsalike.sty, lsalike.bst 1 For a complete list of Technical Working Groups, ftp 2 the file twg-list.tex from CTAN in tex-archive/usergrps The ling-mac.tex file can be found on CTAN in (.dvi and .ps files are also available. tex-archive/info/ling-mac.tex. Contact the TWG for ad- ditions and corrections. TUGboat, Volume 16 (1995), No. 1 43
12. numquote.doc, numquote.tex, enum.sty 7. French Style Files: 13. pstrees Bernard Gaulle ([email protected]). 14. pstricks French-based style files offering an easy-to-use mul- 15. tree-dvips tilingual scheme to work with other languages (En- 16. voorbeeldom.sty glish and German are currently offered). French patterns are up-to-date and there are a lot of test Details on various .sty files files. This package also offers a way to change your keyboard “on the fly” and to set your default at 1. avm-doc.tex, avm.sty: initex time, i.e. when creating your format. Two Christopher Manning versions are released per year. ([email protected]). ftp.univ-rennes1.fr:pub/GUTenberg/french Macros for attribute-value matrices. Documenta- tion available (but not printed in this collection). 8. glex.sty: csli.stanford.edu:pub/TeXfiles Rob Norris; notes from Chet Creider. A 2. cgloss4e.sty: LTEX macros for numbered glosses. All three lines cjl-glosses. This is a modified version of covingtn.sty by Hap of a gloss are input; by contrast, tex Kolb and Craig Thiersch. For glosses; as with only takes care of the first 2 lines, requiring covingtn.sty, does not require ampersands (&)to the 3rd line, the translation, to be formatted in- glex.sty align sets of glossed items. dependently. On the other hand, works with tabs, while cjl-glosses.tex groups each set “[The f]ollowing borrows from M. Covington’s style of word-1 over gloss-1 within braces. files inspired by Midnight by M. de Groot, adapted to be used with gbt3.sty: examples beginning with [Availability not yet determined.] \ex can contain glosses directly. Default is Linguis- 9. gloss.tex, gloss.doc: tic Inquiry style with all lines in \rm.” Part of the Midnight Macros set by Marcel van der No documentation, but file is heavily commented. Goot ([email protected]). Posted to ling-tex; not currently available on Macros for vertically aligning words in consecutive archives. sentences. Documentation. 3. chomsky.sty: CTAN: tex-archive/macros/macros/generic/ Michael Barr. midnight No documentation; however, file is heavily anno- 10. lingmacros.sty: tated. Some draft documentation by Ch. Thiele Emma Pease, CSLI, Stanford. 4. cjl-glosses.tex: Macros for numbered examples, trees, AVM struc- Maintained by by Ch. Thiele tures. ([email protected]). csli.stanford.edu:pub/TeXfiles Macros for glosses (seems to work in both plain A 11. lsalike.sty, lsalike.bst: TEXandinLTEX). Variants for centred, flush right or left glosses, and others. Some documen- Daniel S. Jurafsky, UC Berkeley. Bib tation; needs testing before it can be put out on “lsalike style file for TEX. It implements a bibli- the archives. ography format which is very close to the LSA style Posted to ling-tex list; not currently available on sheet and resembles the journal Language.Among archives. its advantages are that it does the lovely dashed lines for repeated bib entries that makes Language 5. cm-lingmacros.sty: bibliographies so easy to read, and it also makes Christopher Manning and Avery Andrews citations of the form ‘Chomsky (1965:134)’ very ([email protected]). easy.” Modified version of lingmacros.sty (see below). ftp.icsi.berkeley.edu:pub/ai/jurafsky csli.stanford.edu:pub/TeXfiles 12. numquote.doc, numquote.tex, enum.sty: 6. covingtn.tex, covingtn.sty: Bob Mercer, U. of Western Ontario. A Michael Covington. LTEX macros for automatic numbering of exam- A ples. Documentation. LTEX macros for numbered examples, glosses, phrase structure rules, feature structures, discourse Posted to ling-tex list; not currently available on representation structures, exercises, reference lists, archives. and miscellany. Documentation. CTAN: tex-archive/macros/latex/contrib/ covington 44 TUGboat, Volume 16 (1995), No. 1
13. pstrees: 17. treetex: Avery Andrews; requires tree-dvips (see below). Anne Brueggemann-Klein and Derick Wood. “This package consists of a preprocessor and some Extensive tree-drawing macro set. Documentation macro-definitions, by which linguistics-style trees available. See also A. Brueggeman-Klein and Der- can be specified as convenient indented lists, with ick Wood, “Drawing trees nicely with TEX,” Elec- spacing and line-drawing done automatically.” tronic Publishing 2.2:101–115 (1989). csli.stanford.edu:pub/TeXfiles [Availability unknown.] [Note: pstrees contains files which do not include the prefix “pstrees”; rather, the main files are 18. voorbeeldom.sty: trees.**, which may cause confusion if one is not Werenfried Spit careful – Ch.] ([email protected], [email protected]). LATEX document-style option which defines an enu- 14. pstricks: merate-like environment for typesetting linguistic Timothy Van Zandt ([email protected]). examples. No documentation, but the .sty file has This is an extensive collection of PostScript macros commented examples. that is compatible with most TEX macro packages, CTAN: tex-archive/macros/latex/contrib/ including Plain TEX, LATEX, AMS-TEXandAMS- misc/ LATEX. Included are macros for color, graphics, ro- tation, trees and overlays. “PSTricks puts the icing What’s available where (PostScript) on your cake (TEX)!” Documentation. CTAN: Most TEXware is available via ftp from CTAN CTAN: tex-archive/graphics/pstricks (Comprehensive TEX Archive Network) sites, in the directory /tex-archive. CTAN sites include: 15. qtree: Alexis Dimitriadis United Kingdon ftp.tex.ac.uk ([email protected]). Huntsville, Texas ftp.shsu.edu Tree macros (written by Jeff Siskind) with a front Germany ftp.dante.de end allowing trees to be specified in bracket nota- The CTAN holdings are too numerous to list here. tion. These macros take into account the size of For more information, get the readme files from the the node labels when designing the tree; it is usu- /tex-archive directory. ally only necessary to give the bracketed structure to obtain beautiful trees. The node labels them- CSLI: In addition to CTAN, there has been a long- selves can be arbitrarily complicated. The output standing ftp site at Stanford: is .dvi (not PostScript) directives, thus it can be csli.stanford.edu:pub/TeXfiles previewed with xdvi. Version 2 simplifies the labeling of non-termin UofGeorgia: There is a smaller archive at the Uni- nodes and provides triangular “roofs”. Documen- versity of Georgia which contains files of local in- tation is included in the distribution file. [Note: terest. This is a new release, dating from February 1995.] ai.uga.edu:pub/tex ai.uga.edu:/pub/tex UofTokyo: ThereisasmallarchiveattheUniversity 16. tree-dvips: of Tokyo which contains the latest version of the Emma Pease ([email protected]). phonetic font TSIPA and other related files. CSLI PostScript drawing macros. These macros tooyoo.l.u-tokyo.ac.jp:pub/TeX/tsipa were originally created to draw the lines between nodes in the trees created by the tree macros in ⋄ Christina Thiele lingmacros.sty. They will only work with dvips 15 Wiltshire Circle version 541 or later (by Tomas Rokicki available on Nepean, Ontario labrea.stanford.edu) but can be easily modified K2J 4K9 Canada to be used with earlier versions of dvips and slightly Email: [email protected] less easily modified for other .dvi-to-PostScript convertors. Documentation. [Formerly: tree.tex.] csli.stanford.edu:pub/TeXfiles TUGboat, Volume 16 (1995), No. 1 45
Some Test Data Font Forum 1
0.5
Introducing METAP OST Alan Hoenig y 0 I am pleased that the article by Yannis Haralam- bous which immediately follows these comments is
available. People using METAFONT should find it of −0.5 great interest and significance.
It has long been clear that METAFONT things
are of interest to only a limited subset of TEXor −1 LATEX users. Who, after all, has the time to design 0 π/2 π 3π/22π fonts? The event that I wish now to report should The horizontal coordinate, x significantly alter this perception. As of April of
1995, John Hobby’s METAP OST program has been
Figure 1:AsampleMETAP OST graphic. placed in the public domain, and I’d like to comment
on this turn of events. METAFONT METAP OST is similar to the lan- post-processors can also incorporate them as well).
guage, so METAP OST input files look a lot like Plain TEX authors include lines like
METAFONT files. However, the output is different — \input epsf
METAP OST produces PostScript output rather than ... generic font files, so it is printable on any PostScript \epsfbox{foo.1} device. It’s not really possible to produce fonts with in their document, while LAT X users say something METAP OST.Itsraison d’ˆetre is really toward the E production of high quality graphics for inclusion in like aTEXorLATEXdocument. \usepackage{epsf}
There are differences between METAFONT and ...
METAP OST, and these are necessitated by the dif- \begin{center} ferences in use and environment. PostScript de- \leavevmode\epsfbox{foo.1} scriptions are to be device independent, so all of \end{center}
METAFONT’s pixel-handling constructs have been in their documents. (You may use \noindent in removed. Things like the sharp convention and com- place of \leavevmode if you wish. These are often,
mands like cullit are not part of the METAP OST but not always, needed in LATEX environments.)
language. Certain enhancements have been added METAP OST is available from any CTAN archive
to the METAP OST language in aid of fine graphics. in the path tex-archive/systems. At least two
For example, METAP OST is also able to easily executables already exist, for DOS and for OS/2, include TEXorLATEX text within its graphic output, and (I believe) it is now part of the Unix web2c
so tags on graphs and drawings can now match the kit. To learn how to use METAP OST,consultthe
text font exactly. The METAP OST package includes original METAFONTbook (usablebyvirtueofthe special macro packages for drawing graphs and for many similarities between the two languages), and drawing boxes and ovals. The silly graph in fig- also the two technical reports by John Hobby. These ure 1 (everything above the caption) was produced two reports, numbered 162 and 164, are part of the
entirely by METAP OST—it read the data points, package, but can be obtained by sending email send connected them with a smooth curve, prepared the 162 or send 164 to [email protected] coordinate axes, and integrated the LATEXlabelsand reports are valuable not only for their explications
tags all by itself. of METAP OST but for the alternative perspectives
The output of a successful METAP OST run is a they also provide for METAFONT. sequence of files with names like foo.1, foo.2,and so on. With the epsf macro package they are easily ⋄ Alan Hoenig included in any document that is post-processed by CUNY, 17 Bay Ave., Tom Rokicki’s dvips (although it may be that other Huntington, NY 11743 Email: [email protected] 46 TUGboat, Volume 16 (1995), No. 1
advance the angle of the pen (this stands both for Some METAFONT Techniques defining a new pen with command pickup pen and Yannis Haralambous for defining a simulated pen with command penpos ). Abstract Because it arises when designing the letter ‘K’, we will call this problem the “k problem”. This paper presents a few ideas on how to solve cer- tain geometrical problems arising very often in charac- The next problem is encountered when design- ing an ‘X’, as in the right part of fig. 1. Suppose ter design, not directly solvable by METAFONT’s plain macros. The first part of the paper presents two geomet- you want to draw this letter. Once again it should rical problems: the “k problem”andthe“x problem”, fit inside the box, and should consist of two strokes. their solutions using dichotomy, and a different solution To keep the same notation as in the previous case, using path intersections. The latter was proposed earlier we have only given names to the pen positions con- on the net by the author; although geometrically correct, cerning the upper right part of the letter. In this
it does not work in real-world METAFONT practice: a case the constraints are: z1 is fixed (and not z1l as
nice example of METAFONT code . . . to avoid. in the previous case), as well as y2l and x2r. Here The second part of the paper presents two simple again is the problem which we call the “x problem”: macros for drawing “loose” Bézier curves; in a sense, the Find a stroke z1 z2 with fixed z1,y2l,x2r. opposite of the tension operator. Finally, the third part −− solves a problem stated by Alan Hoenig: how to extract Well understood, in both cases the direction of
text and data from a METAFONT run, without using the stroke must be perpendicular to the angle of the the log file. This is done in a straightforward manner by pen: all strokes must keep the same width. running a Flex-generated preprocessor over the GF file: the Flex code for this utility is given in appendix B. 1.2 The solutions (which work) Let’s start with the “k problem”. In fig. 2, the reader −−∗−− has a closer look at the situation. Point A is fixed, point B must lie on a fixed horizontal line H,and 1 Two geometrical problems, solved by C on a fixed vertical line V . The angle ABC must d iterated calculations stay orthogonal. Also the length BC of the vector
1.1 Description BC is fixed. METAFONT cannot compute the angle Suppose you want to design a character ‘K’, as in the φ directly, so that it fits to these constraints. But ′ left part of fig. 1. The character should fit inside a once we have chosen a point B , METAFONT can ′ ′ ′ ′ box of width w and height h, and should consist of calculate the corresponding C so that AB B C ′ ′ ⊥ three strokes: the vertical stroke z0 z0′ ,andthe and B C = BC. −− two oblique strokes z1 z2 and z1′ z2′ .Only −− −− H BB′ constraint: the point z1l = z1′r should be fixed (for h example, its coordinates can be (0, 2 )). So, here is the problem: C C′ Find a stroke z1 z2 with fixed z1l,y2l,x2r. ϕ −− ϕ′ A 0′ 2l 2l V 2 2 2r 2r 1′ Figure 2: A closer look at the “k problem”. 1 ′ 1 So let’s choose a random point B H,and find the corresponding C′.IfC′ lies on the∈ right of 2 0 ′ V then we know that we should move B′ more to the left, if C′ lies on the left of V then B′ should be The “k problem” The “x problem” moved more to the right. We will modify the posi- tion of B′ by a certain step and start again. This Figure 1: The two problems. procedure will be iterated until we get close enough to V . The step will be halved every time: because 1 This problem is not trivial, because METAFONT of the well-known equality Pi≥1 2i =1we are sure cannot compute pen positions without knowing in that this process of iterations will converge to the TUGboat, Volume 16 (1995), No. 1 47 correct result. One may argue that the result will depends on the implementation and the resolution
always be approximate; this is true in mathematics of our character. This process is called dichotomy t mnw but a useless remark in computer calculations, since (from d qa = in two pieces, and = to cut), all values are approximate anyway. Once we have a and is usually one of the first exercises in most pro- sufficient precision we stop; the sufficient precision gramming languages. The code is shown below.
def solve k problem(suffix $, $$, $$$)(expr pen width, first try)= pair z zero; z zero = z$; numeric x one, y one, x two; y one = y$$; x one = first try; x two = x$$$; numeric theta,n; n := 1; forever: clearxy; z$ = z zero; z$$ =(x one, y one); theta := angle(z$$ z$); pos $$.1(pen width, theta 90); − − if y$ >y$$: z$$.1r else: z$$.1l fi = z$$; x one := x one if x$$.1r > x two: else:+ − fi abs(first try x$)/(2 n); − ∗∗ exitif ((abs(x$$.1r x two) < 0.1) or (n>13)); n := n +1; endfor − pos $$$(pwidth, theta 90); z$$$r = z$$.1r; enddef; −
This procedure expects that you feed it with: C (a) the suffixes of points z1l,z2l and z2r,(b)the B′′′ B B′ width of the pen, (c) a hint on the first choice for B x2l. It usually works fine when your hint is sim- ′′ D′ ply the x-coordinate of z2r. Theoretically it can go C C′ wrong if the x-projection of z1l z1r is bigger than θ the first step of the iteration process.−− But this can ∆′ hardly ever happen. The iteration is stopped either (a) when the dis- A tance of z2r to V is less than a tenth of a pixel, or (b) if n =14, because the next step would produce a denominator 215 = 32,768, which is too big for Figure 3: A closer look to the “x problem”. (usual) METAFONT. Experience shows that with 8 steps one is usually done — again, all depends on the resolution. Let’s consider the second problem now. As the But first let’s consider two geometrical facts reader can see in fig. 3, only the position of point A which we are going to use. differs. Nevertheless, this makes a big difference for
METAFONT: in the previous problem, once we had Fact 1. Let be a circle, centered at O,and chosen B′ we could immediately calculate the posi- P a point outsideC the circle. For every line tion of C′. This is not the case here: all we know is ∆, going through P and intersecting the cir- that if D′ isthemiddleofB′C′,thenAD′ B′C′. cle at points X and X′, the product of lengths ⊥ So we need a different technique already to calcu- PX PX′ is constant. In particular, if PT is late the location of C′ for each step of the iterating a tangent· to the circle going through P ,then 2 process. This will be done again by iteration. PT = PX PX′. · 48 TUGboat, Volume 16 (1995), No. 1
As the reader can see on fig. 4, it follows from Let’s apply now Fact 2 to the orthogonal tri- the previous fact that if ∆′ is the line going through angle AD′B′. We obtain: θ = arccos(AD′/AB′). P and O, and intersecting the circle at Y and Y ′, Once we have the angle and the length of AD′,we then PT = √PY PY′. have point D′, and we are done for this step of the · iterating process. The remainder of the solution is A T similartothatofthe“k problem”: we are moving X′ θ ′ ′ X B around until C is close enough to line V . b
Y O Y META P ′ c Let’s try to implement this solution in -
FONT. Fact 1 can be implemented easily: of course
a B C one should avoid multiplying two lengths (because
Fact 1. Fact 2. of a possible overflow error), but there should be no problem if we take the square root of each length first (for purists: lengths are always positive!). So Figure 4: Two facts from elementary Euclidean 2 ′ ′′ ′′′ geometry. instead of AD = AB AB = we will formulate the equation as AD′ = √· AB′′ √AB′′′. · The second fact is even more trivial: Fact 2 is a little harder to implement. As a matter of fact, the reader may have noticed that Fact 2. Let ABC be an orthogonal trian-
although METAFONT provides exponential and log- gle; the right angle shall be ABC; let’s call d arithmic functions, there are no inverse trigonomet- CAB = θ,anda,b,c the lengths of faces op- d c ric functions. What should be done? Unfortunately, posite to points A, B, C.Thenθ = arccos( b ). ′ METAFONT offers no complex calculus so that for- Let’s return to our problem (see fig. 3). D is 1 xi −xi ′ 1 mulas such as cos(x)= 2 (e + e ) could be ap- on a circle centered at B , of radius 2 BC (half the plied; power series cannot be used either because our C ′ ′ ′ ′ width of the stroke, since B D = D C ). Also we candidates for angles are not necessarily in a neigh- know that AD′ B′C′ AD′ B′C′ AD′ is tan- ⊥ ⇒ ⊥ ⇒ borhood of 0; using an external program to make gent to circle . Let’s draw the line ∆, going through this calculation would be highly unorthodox. Let’s points A and CB′. It will intersect at points B′′ and C 2 use dichotomy once again! B′′′. From Fact 1 we know that AB′′ AB′′′ = AD′ . Hereisthecodeforaarccosd procedure in ′ ·
So we do not yet have D itself, but the length of METAFONT: AD′.
def arccosd (expr ttt)= if ttt > 1: message(”error: arccosd argument > 1!!??”); stop; else: numeric a ; numeric test, nnn, prev; test := 45; nnn := 1; prev := cosd(test); forever: nnn := nnn +1; if cosd(test) < ttt: test := test (90/(2 nnn)) else: − ∗∗ test := test +(90/(2 nnn)) fi; ∗∗ exitif ((abs(test prev) < 0.01) or (nnn > 14)); prev := test; − endfor fia := test; enddef;
The above procedure requires as argument a lem”), one must consider arccosd for only the first number ttt ]0, 1[. It stores the result in the nu- quadrant: solutions will always be in the range ∈ π π meric variable a_. The first try is always 4 .Since 0, 2 . One can easily generalize the code to work we want to solve a specific problem (the “xprob- in different ranges. In particular it would be nice to TUGboat, Volume 16 (1995), No. 1 49
modify the code to allow us getting results in the obtain the intersection of line ∆′ (fig. 3) and circle ′ complex domain for ttt ] , 0[ ]1, [ but the : the pen $$ lies on ∆ and points $$r and $$l are ∈ −∞ ∪ ∞ C
author can hardly see the utility of these for META- at the right distance from point $$. As we shall see
FONT... in the following section this method yields results Letusnowhavealookatthesolutionofthe“x that are accurate enough for our purpose. problem”, as it is shown below. The same idea has been used to explicitly de- A word of explanation concerning the pen posi- fine point $$.1: by taking pen position $, the right tion $$ is perhaps necessary. This is a quick way to edge of the pen is on point D′ of the figure.
def solve x problem(suffix $, $$, $$$)(expr pen width, first try)= pair z zero; z zero = z$; numeric x one, y one, x two; y one = y$$; x one = first try; x two = x$$$; numeric theta,n,phi, tangent length; n := 1; forever: clearxy; z$ = z zero; z$$ =(x one, y one); theta := angle(z$$ z$); − pos $$(pen width, theta); tangent length := sqrt(length(z$$l z$)) sqrt(length(z$$r z$)); − ∗ − arccosd (tangent length/ length(z$$ z$)); phi := theta a ; − − pos $(2tangent length, phi); z$$.1r = z$r +(z$r z$$); z$$.1l = z$$; x one := x one − if x$$.1r > x two: else:+ − fi abs(first try x$)/(2 n); − ∗∗ exitif ((abs(x$$.1r x two) < 0.1) or (n>13)); n := n +1; endfor − z$$$l = z$$.1l; z$$$r = z$$.1r; z$$$ =0.5[z$$$l,z$$$r]; enddef;
The entries for this procedure are the same as A, and upon circle . Take a circle ′, centered at for solve_k_problem. Again the problem is solved in point A and of radiusC AD′. The (right)C intersection the specific case of the right and upper part of letter of circles and ′ is the desired point D′. ‘X’; for the other possible cases, one should either C C change the code (straightforward but tedious), or use symmetry arguments. C B′′′ 1.3 A solution which doesn’t work, and B′ why B D The author must confess that the first time he had ′′ ′ C to solve the “x problem” was during a METAFONT ′ θ tutorial at the Royal Holloway College (UK) [this ∆′ shows how badly the tutorial was prepared. . . mea culpa]. On the spot I could find no solution, yet on C′ my way back across the Channel on the ship I found A the code shown below. I am convinced that people
taking the tunnel nowadays write better METAFONT code! Figure 5: A solution of the “x problem”which Let be a circle, centered at B′ and of radius 1 C ′ doesn’t work in METAFONT. 2 BC, as in fig. 5. The intersection of ∆ and gives ′′ ′′′ C us points B and B . From Fact 1, we obtain the This method is mathematically correct — but if ′ ′ length of AD .Now,D is at a known distance from you try it out you will get extremely bad results. 50 TUGboat, Volume 16 (1995), No. 1
The problem is that when we define a path, META- 1. tension, which allows us to get tense curves;
FONT does not consider it as an abstract curve, but 2. controls, by which we can explicitly determine as a set of pixels. When we ask for the intersection the control points of our Bézier curve. of two paths, we obtain the pixel which is the closest to the (theoretical) intersection of the paths. In the One can use tension quite intuitively: for a solution sketched above, the intersection of the two value of 1, the path remains unchanged; for higher circles is taken as an abstract point, and its coor- values the path gets more and more tense. On the dinates are used for calculations. The result is of other hand, the operator controls gives us absolute course completely deformed. control of the curve — but this is certainly not intu- itive; maybe Leonardo da Vinci was smart enough 2 Loosening Bézier curves to be able to guess the control point coordinates of
Bézier curves are quite beautiful, and METAFONT Joconda’s smile, but the rest of us would probably allows us to obtain them even out of only partial be unable to do it. information: for example, one can ask for “a curve So it happened that the author often needed leaving point A following a vertical direction and ar- “loose” Bézier curves, and was unable to obtain riving at point B following a horizontal direction”. them; unfortunately, tension doesn’t work with val-
There is an infinite number of Bézier curves with ex- ues less than .75. (In fact, the METAFONTbook
actly these features; METAFONT will choose one of does not mention what the lower bound of the ten- them, out of hard-wired criteria. Most of the time, sion parameter is; however, repeated tries by the
METAFONT’s choice is exactly what you need; but it author have shown that the value .75 still works, may also happen that you want to keep some other while .75 ǫ produces an error message.) With the − curve in the same set. There are two operators al- following code, one can get arbitrarily loose Bézier lowing us to do this: curves:
def npush(expr p, coef )= hide(pair firstpt, firstcpt, secondcpt , secondpt ; firstcpt = postcontrol 0 of p; secondcpt = precontrol 1 of p; firstpt = point 0 of p; secondpt = point 1 of p; pair intersectpt, newfirstcpt, newsecondcpt; intersectpt firstpt = whatever (firstcpt firstpt); intersectpt − secondpt = whatever∗ (secondcpt− secondpt ); newfirstcpt −= coef [firstcpt, intersectpt∗ ]; newsecondcpt− = coef [secondcpt , intersectpt]; ) firstpt .. controls newfirstcpt and newsecondcpt .. secondpt enddef;
The macro npush takes two arguments: the the intersection point. For values higher than 1 they path we want to loosen, and a numerical coefficient. continue their travel outside of the Bézier triangle. For value 0 of this coefficient, the path remains In Appendix A the reader can see the effects of unchanged. What happens when we increase this the npush macro applied uniformly to all paths of a value? Let’s consider the intersection point of the circle, with values of the coefficient going from 5 two Bézier tangents (the tangent at curve beginning to 5. − and end). We know that control points always lie As the reader has surely already noticed, this on these two tangents. For values between 0 and 1 macro doesn’t work when the tangents are parallel of the coefficient, the control points travel between (because there is no Bézier triangle in that case). their original positions and the intersection point. A second macro, with a slightly different approach For value 1 both control points are identified with covers all possible cases:
def mpush(expr p, lcoef , rcoef )= hide(pair firstpt, firstcpt, secondcpt , secondpt ; firstcpt = postcontrol 0 of p; secondcpt = precontrol 1 of p; firstpt = point 0 of p; secondpt = point 1 of p; pair newfirstcpt, newsecondcpt; newfirstcpt firstpt = lcoef (firstcpt firstpt); newsecondcpt− secondpt = rcoef∗ (secondcpt− secondpt ); ) firstpt .. controls− newfirstcpt and newsecondcpt∗ −.. secondpt enddef; TUGboat, Volume 16 (1995), No. 1 51
This macro has three arguments: the path ple pattern matching, using regular expressions and which we will modify, and two numerical coefficients, states. The Flex code of GFtoTXT is very short; the corresponding to the transformation at curve begin- reader can find it in Appendix B. To obtain an exe- ning and at curve end. This time we multiply the cutable, (a) run this code through Flex, with the -8 distance of the first control point from curve begin- option — Flex will produce a C file called lex.yy.c ning by the first coefficient, and that of the second (or LEX_YY.C on Messy-DOS systems); (b) compile control point from curve end by the second coeffi- it using your favourite C compiler. The Flex code cient. Hence, in this case, value 1 for both coef- as well as executables can be found on ftp.ens.fr, ficients will leave the path unchanged. For values in pub/tex/yannis/gftotxt. higher than 1 the path will “swell”, while for values How does it work? There is one convention tending to 0 it will become more and more tense. which must be followed: every string which we want
These two macros may not be as reliable as to extract from the METAFONT run must start with
primitive METAFONT operators, but they produce the character #. This precaution is necessary be-
easily predictable results and are suitable for fine- cause METAFONT itself sends several strings to the tuning of character parts. GF file by using internal special commands. These will be ignored by GFtoTXT. Hence, to ob- 3 Getting text and numeric data out of a tain the string “Hello world!” in our output file (let’s font call it output.txt), we will include the command In his extremely interesting paper on communica- special(”#Hello world!”);
tion between TEXand METAFONT [1], Alan Hoenig in the METAFONT code of our file (let’s call it
states that “...METAFONT’s file handling abilities input.mf). GFtoTXT reads from the standard in- are greatly crippled when compared to T X. Other E put flow and writes to the standard output flow, so than font pixel files, font metric files, and log files, we only need to redirect these; here is the necessary
METAFONT cannot write files. . . ” To remedy that command line: situation, we present GFtoTXT, a small utility for reading text (and any other information) out of GF GFtoTXT < input.mf > output.txt
files produced by METAFONT. As a matter of fact, GFtoTXT allows three additional conventions in METAFONT METAFONT has a special command, just like TEX, strings: (a) \n will produce a carriage but until now, no DVI driver was able to use it (as return in the output file, and (b) \x followed by a a possible use, one could very well imagine Post- two-digit (lowercase) hexadecimal number between Script color instructions embedded in the GF files, 00 and ff will produce the corresponding 8-bit taken over by the PK files and translated into real character in the output file, (c) \X followed by a PostScript commands by the DVI driver). four-digit (lowercase) hexadecimal number between GFtoTXT is written in Flex, a UNIX-originated 0000 and ffff will produce the corresponding 16- lexical analyzer, under GNU copyleft. Flex allows bit character in the output file. Convention (a) is the generation of highly reliable C code out of sim- useful for “formatting” the output file, for example
special(”#(CHAR C A“n (HEIGHT R ” & decimal h & ”)“n (WIDTH R ” & decimal w & ”)“n”);
will produce Convention (b) is useful for inserting 8-bit char- acters (or characters in the range "00–"1f)intothe (CHAR C A output file. Finally, convention (c) can be useful to
(HEIGHT R 99.99976) those of us who use unicode encoding: wouldn’t it
S gap (WIDTH R 129.99683) be nice to have METAFONT say to us ? the necessary code would be ♥ ♥
special(”#“X2665“X03a3“X0027“X0020“X1f00“X03b3“X03b1“X03c0“X1ff6“X2665”);
4 Bottom line ever the specific problem was solved. It is not the goal of this paper to provide the reader with pow- The different METAFONT techniques presented in this paper are certainly not programmed in the most erful new tools, but rather to stimulate him/her in elegant way; the author needed them for specific creating his/her own, and to go beyond the plain purposes, and stopped testing and refining when- and cm base macros. In all three examples, the ba- sic idea was very simple: iterate calculations until a 52 TUGboat, Volume 16 (1995), No. 1
sufficiently precise approximation is obtained, mod- SUBSCRIBE METAFONT Your name and institution ify a path by manipulating the control points in the background, read the GF file by something else than References
GFtoPK or GFtoDVI. By writing down and sharing [1] Hoenig, A. When TEXandMETAFONT talk:
such ideas we can make out of METAFONT an even Typesetting on curved paths and other special friendlier and more productive font design tool. To effects, pp. 549–553, TUGboat 12 (4), 1991
discuss METAFONT relative issues, but also font de-
sign in general (and why PostScript and TrueType ⋄ Yannis Haralambous META are less efficient than METAfonts), join the - 187, rue Nationale, 59800 Lille,
FONT e-mail discussion list! The address of the list France. on the Internet is: Email: haralambous@ [email protected] univ-lille1.fr You can subscribe by sending the following subscrip- tion message to [email protected]:
A Transforming a circle through npush
' & $ coef = 5 coef = 4 coef = 3 coef = 2.5 coef = 2
− − − − −
! coef = 1.5 coef = 1.3 coef = 1 coef = 0.8 coef = 0.6
− − − − −
coef = 0.5 coef = 0.4 coef = 0.3 coef = 0.2 coef = 0.1
− − − − −
coef =0 coef =0.1 coef =0.2 coef =0.3 coef =0.4
coef =0.5 coef =0.6 coef =0.8 coef =1 coef =1.3
coef =1.5 coef =2 coef =2.5 coef =3 coef =5 TUGboat, Volume 16 (1995), No. 1 53
B The Flex code of GFtoTXT
%{ #define HEXA(A,B) (yytext[(A)]>=’a’? yytext[(A)]-’a’+10 : \ yytext[(A)]-’0’)*16 + (yytext[(B)]>=’a’? yytext[(B)]-’a’+10 : yytext[(B)]-’0’) long int length_special = 0L; int we_need_it = 0, pre_length = 0; %}
%x READ_SPECIAL %x READ_PRE
%%
([\x00-\x3F\x45\x46\x4A-\xEE\xF4\xF8-\xFF])|([\x40\x47].)|([\x41\x48]..) ;
([\x42\x49]...)|(\x43.{24})|(\x44.....) ;
\xEF.# { BEGIN READ_SPECIAL; length_special = yytext[1] - 1L; we_need_it=1; }
\xEF. { BEGIN READ_SPECIAL; length_special = yytext[1]; we_need_it=0; }
\xF0..# { BEGIN READ_SPECIAL; length_special = (yytext[1] * 256L) + yytext[2] - 1L; we_need_it=1; }
\xF0.. { BEGIN READ_SPECIAL; length_special = (yytext[1] * 256L) + yytext[2]; we_need_it=0; }
\xF1...# { BEGIN READ_SPECIAL; length_special = (((yytext[1] * 256L) + yytext[2]) * 256L) + yytext[3] - 1L; we_need_it=1; }
\xF1... { BEGIN READ_SPECIAL; length_special = (((yytext[1] * 256L) + yytext[2]) * 256L) + yytext[3]; we_need_it=0; }
\xF2....# { BEGIN READ_SPECIAL; length_special = (((((yytext[1] * 256L) + yytext[2]) * 256L) + yytext[3]) * 256L) + yytext[4] - 1L; we_need_it=1; }
\xF2.... { BEGIN READ_SPECIAL; length_special = (((((yytext[1] * 256L) + yytext[2]) * 256L) + yytext[3]) * 256L) + yytext[4]; we_need_it=0; }
(\xF3....)|(\xF5.{17})|(\xF6.{10}) ;
\xF7.. { pre_length = yytext[2]; BEGIN READ_PRE; }
.|\n ;
%%
main() { yylex(); } 54 TUGboat, Volume 16 (1995), No. 1
The Program a2ac — Font Handling on the full description of new composites (accented letters) PostScript Level including kern pairs, etc., to the output afm file. The afm file describes information about the Petr Olˇs´ak letters using the symbolic name for each character Abstract (Aacute is the A with an acute accent, for example). Each character is described through one of two pos- In this article, the a2ac program written by the author sible ways. Either the character name is bound to a is described. The program enables the use of PostScript fonts while typesetting texts in which accented letters definite encoding position and to a PostScript pro- are used. The font doesn’t need to contain the complete cedure for rendering the image of the character, or alphabet of a given language; merely the presence of the the character is described as a so-called “compos- accents themselves (and not all of the accented charac- ite character”. In this case the encoding position ters) is sufficient. The configuration files of the a2ac of the character is set to be −1 and the descrip- program are independent of the PostScript font encod- tion of how to make the character from elements ing as well as of the typesetting system encoding. The is stored in afm. The elements are usually char- program may be used to prepare a font for any typeset- acters described as mentioned previously and only ting system, but only TEXwastested.Thea2ac package their symbolic names (not their encoding positions) CTAN is available on including the source in C and the are used. documentation. It can serve as an alternative to the All desired new composite characters are de- fontinst program by Alan Jeffrey. The program was tested on the operating systems SUN OS, Linux, and fined in the description file. Since only the symbolic DOS. names are used, the description file is totally inde- pendent of the encoding of the PostScript font and −−∗−− of the encoding used by the typesetting system. The program already has nonzero intelligence when it reads the description file. You can declare 1 Introduction and use so-called “variables”. You can write met- The program a2ac (Afm to Afm and add Compos- ric and composite information in the form of simple ites) works with the afm file (Adobe Font Metrics) expressions. You can add new kern pairs using the as its input and creates a new modified afm file as information from already existing similar kern pairs its output. The program reads a so-called “descrip- (with exceptions being possible). tion file” during its work. The changes we want to An example of the description file format is make are defined in this file. The program adds the shown below.
NC Ecaron 2 ; PCCE00;PATcaron 0 Carontop ; NC Eacute 2 ; PCCE00;PATacute Acuteshift Acutetop ; NK (Ecaron,Eacute) : E NC ecaron 2 ; PCCe00;PACcaron 0 0 ; NC eacute 2 ; PCCe00;PACacute acuteshift vshift ; NK (ecaron,eacute) : e RK (P,T,V) ecaron 0 RK (P,T) eacute 0
In this example we define new letters E,ˇ E,´ ˇe negative kerns are usually defined. The same excep- and ´e. First, Ecaron and Eacute are defined as com- tions are specified for the pairs P´eandT´e. posite letters (NC lines), then new kerns are made For more information about the description file for Ecaron and Eacute. The kerns are the same as format see the a2ac-eng.doc file, which is included the kerns for the letter E (NK line). Next, ecaron in the a2ac package. and eacute are defined as composite letters with Ligature information is not defined in the de- the same kern information as the letter e (NC lines scription file. We will show that it essentially doesn’t and NK line). Last, some exceptions for the kerns matter. Ligatures are described in the input afm file are specified. The pairs Pˇe, TˇeandVˇe will have no at the end of lines with the C prefix as is shown in kern, in contrast to the pairs Pe, Te and Ve, where our following example:
C102;WX333;Nf;B200383683;Lifi;Llfl; TUGboat, Volume 16 (1995), No. 1 55
The a2ac program just copies the ligature infor- in the ligtable of the tfm format. New ligatures for mation to the output afm file. This is sufficient for TEX can be defined through the enc files. For in- generating the ligtables which are used by the type- stance, the LIGKERN information below is present in setting systems. For example, the afm2tfm program the file xl2.enc and defines new ligatures for the reads this information and generates the proper data dash characters “--”and“---”.
% LIGKERN hyphen hyphen =: endash ; endash hyphen =: emdash ;
2 Font preparation for TEX is used, then this requirement is satisfied. The out- put afm file then specifies all characters (plain or After a2ac is used, preparation of the font for TEX continues on in the standard way (as for the En- composite) of the Czech and Slovak alphabets and glish language). You can use the afm2tfm program, the new kern information. which reads the converted afm file and the encod- The description file cscorr.tab was used with- out changes for all standard PostScript fonts (32 ing information file enc to match the internal TEX fonts installed in all rips) with very good results. encoding. The result is a TEX virtual font which includes two kinds of information: the information Of course, if you want, you can rearrange the cscorr.tab file for your new font to obtain the best about re-encoding from the internal TEXencoding to the raw PostScript font encoding (as was specified result. in the enc file) and the information about building If the above UNIX script is used, the TEXfont the accented letters from the elements (as was spec- suitable for Czech and Slovak texts is prepared in C ified in the “rebuilt” afm file). the S-font encoding, which is an extension of the Let us show an example of the PostScript font Computer Modern font encoding (see Section “The T X Encoding”). preparation for TEX for the Czech and Slovak lan- E guages. First, we run the a2ac program. The com- The output of the script is c*.tfm (for TEXin- mand line looks like this: put), r*.tfm and c*.vf (for dvips input). The only work we must do is place these files into the appro- a2ac input.afm desc.tab output.afm priate directories (or rearrange the script to make The first parameter is the name of the input this work automatically) and add one line to the afm file, the second one is the name of the descrip- configuration file psfonts.map of the dvips driver. tion file and the third one is the name of the output The line looks like this: afm file. The extensions (.afm and .tab) are obliga- tory — the program has no algorithm for adding the rfont PostScript-Font extensions automatically. If the PostScript font is not installed in the The command to run the a2ac program is usu- PostScript RIP of the output device, we have to store ally part of a script or a batch file. For example, the the pfb (pfa) format of the PostScript font in our UNIX script to convert the font from the afm format computer. Then the line in psfont.map looks like to the tfm format may look like this: this: a2ac $1.afm cscorr.tab c$2.afm afm2tfm c$2.afm -t xl2.enc -v c$2 r$2 rfont PostScript-Font
3TheTEXencoding • Kerning and other aspects of the fonts are language-dependent. The multi-language font The input text encoding for TEX doesn’t need to (T1 encoded, for example) is nonsense from the be the same as the internal TEX encoding. The re-encoding can be performed by the tcp table of typographer’s point of view. In addition, we can’t rely on the results of the font hacking from emTEX, by changes to the xord/xchr vectors in the abroad. tex.ch file (TEX’s WEB source hacking) or by setting active characters. The enc file describes exactly the The Computer Modern text font encoding (and same encoding as the internal TEXencodinginto its conservative extensions) is font-type dependent. which the hyphenation patterns have been read. For example, two different characters are encoded In the Czech Republic and the Slovak Repub- to the same position depending on \rm versus \tt lic, the so-called CS-encoding (the encoding of the font (the ‘fi’ ligature versus downarrow, for exam- CS-fonts) is used very often as the internal TEXen- ple) or on \rm versus \it font (the dollar versus the coding. The CS-encoding is an extension of the Com- sterling). This feature implies some problems. puter Modern text font encoding and follows the Two files, xl2.enc and xt2.enc, are included ISO-8859-2 layout for the Czech and Slovak letters. in the a2ac package. These files define extensions of The CS-fonts are included in the widely available the CS-font encoding. The extensions include some TEX installation package with the name CSTEX. characters which are not contained in the Czech and The advantages of the CS-encoding (in compar- Slovak alphabets (i.e., in the CS-font), but which are ison to the T1 encoding) from our point of view are contained in the Adobe StandardEncoding material. the following: It is recommended to use xl2.enc for \rm and \it-like fonts and xt2.enc for \tt-like fonts. The • The fonts include the letters from the Czech and dollar is in position 36 under any circumstances and Slovak alphabets only. Font generation from position number 132 was given to the sterling. If the METAFONT sources is faster and the pk files you load the PostScript \it-like font then you need are smaller. Exceptions (non-Czech/Slovak let- to know that the sterling is not accessible by the ters in names, for example) are accessible by plain TEX {\it\$} but you need to use a macro plain TEX macros or by the \accent primi- with \char132 code. tive. Position number 32 is not defined in the xl2.enc • The CS-font encoding is the conservative ex- file because the cross accent for the Pol- tension of the Computer Modern text fonts. ish L and l is not included in the Adobe Standard- Therefore, CS-fonts can replace CM text fonts Encoding. TheLand l themselves are in positions in the installation packages without problems 163 and 179, respectively. If we need to use these and with backward compatibility. This feature characters in plainTEX with the PostScript font, the saves space on the disk. No complicated macro redefinition of the macros \l and \L is necessary. redefinitions are necessary. For example, the Note that the three-letter ligatures are not included in the Adobe StandardEncoding. This csplain format, which is included in the CSTEX package, gives the same results as the original doesn’t cause any problems because the three-letter ligatures are not used in Czech and Slovak. If the plain TEX. There are only three differences: (1) The new letters with codes ≥ 128 are intro- font is available in the ExpertEncoding you can do vpl duced by \catcode, \lccode, \uccode;(2)CS- some hacking at the level to obtain these liga- fonts are loaded instead of CM text fonts; and tures. (3) the Czech and Slovak hyphenation patterns If you want to typeset math using the Post- are read in addition to the US English ones. Script font, you need to edit the vpl file [1] to include the uppercase Greek characters taken usually from • The layout of the accented letters in the CS- the Symbol font. Sorry, but it is not obvious how to font is the subset of the ISO-8859-2 standard. typeset math with the PostScript font; some more Therefore, it is possible to use an eight-bit in- hacking must be done—on the plain TEX macro put Czech/Slovak text without re-encoding in levelaswellasonthe\fontdimen font information. the UNIX environment. The power is in sim- The pretty “look” of math is a result of a font de- plicity! The tcp table (emTEX) is used in the signer’s work and a lot of TEX-macro programming. DOS installation because there are many differ- “Its emphasis is on art and technology, as in the un- ent layouts for the Czech/Slovak alphabet and derlying Greek word” (see [2], Chapter 1.). There- there is no fixed standard in DOS (in contrast fore, the pretty math is not accessible from the text to the UNIX-like systems). PostScript font via any automatic process. TUGboat, Volume 16 (1995), No. 1 57
For more information about the CS-font encod- References ing and its history see [3] and appendix F in [4]. [1] Donald Knuth: Virtual fonts, more fun for grand 4 About the names of the accented letters wizards. TUGboat 11(1):13–23, April 1990. [2] Donald Knuth: The T Xbook, vol. A of Com- The names dquoteright and tquoteright should E puters & Typesetting. Addison-Wesley, Read- be used for the characters d’ and t’ in the descrip- ing, MA, 1986. tion file cscorr.tab. However, we use the names ´ dcaron and tcaron instead. The reason is that [3] Petr Olˇs´ak: Uvaha o fontech v CSTEXu [A re- C the uppercase alternatives of the characters d’ and flection about fonts in STEX], TEXbulletin (of t’ are Dandˇ T,ˇ i.e., Dcaron and Tcaron, respec- CSTUG) 3/93 (121–131). tively. The parameter -V (for afm2tfm to make the [4] Petr Olˇs´ak: Typografick´ysyt´em TEX [Typeset- small caps variant of a font) does not work with ting System TEX], CSTUG 1995, 270 pages. the names dquoteright and tquoteright.Weuse Lcaron and lcaron instead of Lquoteright and ⋄ Petr Olˇs´ak lquoteright because the semantic for these accents Department of Mathematics is same as for Dcaron, dcaron, Tcaron and tcaron. Czech Technical University in Prague Czech Republic Email: [email protected]
−−∗−−∗−−∗−−∗−−∗−−∗−−∗−−∗−−∗−−
Note added by production editor psfonts.map. The file pscs.map itself contains one In order to better understand how the above line for each of the fonts considered: and starts like procedure works, I decided to download the rpagd AvantGarde-Demi program from ftp.tex.ac.uk in the directory rpagdo AvantGarde-DemiOblique fonts/utilities/a2ac below the CTAN root. I rpagk AvantGarde-Book also got the .afm files for the PostScript fonts from rpagko AvantGarde-BookOblique the directory fonts/psfonts/adobe. ... I now was all set to compile the a2ac.c pro- as explained in the text of the article. gram, and start to make the CS-encoded files using As a further exercise, I decided it would be the script mkfnt. It essentially runs the program interesting to make this encoding available with afm2tfm that comes with the dvips distribution and LATEX, so I started by defining an encoding L2 in the creates .tfm and .vf files for TEXanddvips accord- file L2enc.def (See Chapter 7 of The LATEXCom- ing to the desired encoding (see the Section “Font panion, by Goossens et al., Addison-Wesley, 1994 Preparation for TEX”). This was done for all “Stan- and the file fntguide.tex “LATEX2ε font selection”, dard” PostScript fonts. Figure 1 shows the font ta- which is distributed electronically and comes with ble for the font NewCenturySchlbk-Roman. the standard LATEX2ε distribution.). This file con- The righthand side of the table shows the char- tains merely: acters specific to the Czech and Slovak languages. \ProvidesFile{L2enc.def}% .dvi To obtain PostScript output from a file one [1995/09/22 CS Encoding] more step has to be taken; namely, one has to tell \DeclareFontEncoding{L2}{}{} the dvips program which file to read when a font- \DeclareFontSubstitution{L2}{ptm}{m}{n} name is encountered. The easiest way of achieving this is to generate a new “config” file, since several Next, one has to make font definition .fd files can be specified with one dvips execution. In this for each of the fonts one wants to use, e.g., for case, I generated a one-line config.cs file contain- the Times-Roman family one would define a file ing L2ptm.fd,whereptm is a “short” name chosen in the framework of the font-selection scheme of Karl Berry p +pscs.map (Version 2), which can represent all possible font names within the ISO-9660 (and DOS)-compatible This states that the contents of the file pscs.map eight characters limit for file names. Below the font has to be concatenated with the contents of previ- definition file (L2pag.fd)fortheL2encodingofthe ously declared map files, in particular the default AvantGarde font family is shown.
58 TUGboat, Volume 16 (1995), No. 1