Distinguishing 8-Bit Characters and Japanese Professional Quality [9], Including Japanese Line Break- Characters in (U)Ptex Ing Rules and Vertical Typesetting
Total Page:16
File Type:pdf, Size:1020Kb
TUGboat, Volume 41 (2020), No. 3 329 Distinguishing 8-bit characters and Japanese professional quality [9], including Japanese line break- characters in (u)pTEX ing rules and vertical typesetting. pTEX and pLATEX were originally developed by Hironori Kitagawa the ASCII Corporation2 [1]. However, pTEX and Abstract pLATEX in TEX Live, which are our concern, are community editions. These are currently maintained pTEX (an extension of TEX for Japanese typesetting) 3 by the Japanese TEX Development Community. For uses a legacy encoding as the internal Japanese en- more detail, please see the English guide for pTEX [3]. coding, while accepting UTF-8 input. This means pTEX itself does not have 휀-TEX features, but that pTEX does code conversion in input and output. there is 휀-pTEX [7], which merges pTEX, 휀-TEX and Also, pT X (and its Unicode extension upT X) dis- E E additional primitives. Anything discussed about tinguishes 8-bit character tokens and Japanese char- pTEX in this paper (besides this paragraph) also acter tokens, while this distinction disappears when applies to 휀-pTEX, so I simply write “pTEX” instead tokens are processed with \string and \meaning, of “pTEX and 휀-pTEX”. Note that the pLATEX format or printed to a file or the terminal. in TEX Live is produced by 휀-pTEX, because recent These facts cause several unnatural behaviors versions of LATEX require 휀-TEX features. with (u)pTEX. For example, pTEX garbles “ſ ” (long s) to “顛” on some occasions. This paper explains these 2.1 Input code conversion by ptexenc unnatural behaviors, and discusses an experiment in improvement by the author. Although pTEX in TEX Live accepts UTF-8 inputs, the internal Japanese character set is limited to 1 Introduction JIS X 0208 (JIS level 1 and 2 kanjis), which is a Since TEX Live 2018, UTF-8 has been the new default legacy character set before Unicode. pTEX uses input encoding in LATEX [8]. However, with pLATEX, Shift_JIS (Windows) or EUC-JP (other) as the in- which is a modified version ofA LTEX for the pTEX ternal encoding of JIS X 0208. engine, the source In pTEX and related programs, the ptexenc library [12] converts an input line to the internal %#!platex encoding. pTEX’s input processor actually reads \documentclass{minimal} the converted result by ptexenc. A valid UTF-8 \begin{document}ſ\end{document} % long s sequence which does not represent a JIS X 0208 char- gives an inconsistent error message [4] (edited to fit acter — such as <C5><BF> (“ſ”) or <C3><9F> (“ß”) — TUGboat’s narrow columns): is converted to ^^-notation, such as ^^ab. On the other hand, an invalid UTF-8 sequence ! Package inputenc Error: Unicode character 顛 (U+C4CF) not set up for use with LaTeX. is converted into <A2><AF> (an undefined code point in EUC-JP) sometimes, in TEX Live 2019 or prior. Here “顛”, “ſ” and U+C4CF are all different characters. In TEX Live 2020, the sequence is always converted The purpose of this paper is to investigate the into ^^-notation. background of this message and propose patches to resolve this issue. This paper is based on a cancelled 2.2 Japanese character tokens talk [6] in T XConf 2019.1 E pT X divides character tokens into two groups: ordi- In this paper, the following are assumed: E nary 8-bit character tokens and Japanese character • All inputs and outputs are encoded in UTF-8. tokens. The former are not different from tokens in • pTEX uses EUC-JP as the internal Japanese en- 8-bit engines, say, TEX82 and pdfTEX.A ^^-notation coding (see Section 2.1). sequence is always treated as an 8-bit character. • Sources are typeset in plain pTEX(ptex), unless A Japanese character token is represented by stated otherwise by %#!. its character code. In other words, although there • The notation <AB> describes a byte 0xab, or a is a \kcatcode primitive, which is the counterpart character token whose code is 0xab. of \catcode, its information is not stored in tokens. Hence, changing \kcatcode by users is not recom- 2 Overview of pTEX mended. pTEX is an engine extension of TEX82 for Japanese typesetting. It can typeset Japanese documents of 2 Currently ASCII DWANGO in DWANGO Co. Ltd. 3 texjp.org/. Several GitHub repositories: 1 TEXConf 2019 (the annual meeting of Japanese TEX users, github.com/texjporg/tex-jp-build ((u)pTEX), texconf2019.tumblr.com) was canceled due to a typhoon. github.com/texjporg/platex (pLATEX). Distinguishing 8-bit characters and Japanese characters in (u)pTEX 330 TUGboat, Volume 41 (2020), No. 3 2.3 An example input This is because all of Now we look at an example. Our input line is \csname u^^c5^^bf\endcsname a<C3><9F><E6><BC><A2><C5><BF><C2><A7> (aß漢ſ§) \csname uſ\endcsname % ſ: <C5><BF> in UTF-8 \csname u顛\endcsname % 顛: <C5><BF> in EUC-JP First, ptexenc converts this line into have the same name u<C5><BF> in pTEX, hence they a^^c3^^9f<B4><C1>^^c5^^bf<A1><F8> are treated as the same control sequence. Applying which is fed to pTEX’s input processor. The final \string to them, we get the same token list character “§” is included in JIS X 0208. \12 u12 顛 From the result above, pTEX produces tokens This explains the error message in the introduc- a <C3> <9F> 漢 <C5> <BF> § 11 12 12 12 12 tion. “顛 (U+C4CF)” in the message is generated where 漢 and § are Japanese character tokens. From from this example, we can see that we cannot write “§” \expandafter\string directly to output this character in a Latin font (use \csname u8:\string<C5>\string<BF>\endcsname commands or ^^c2^^a7). The inputenc package expects that applying \string 3 Stringization in pTEX to the above control sequence produces 3.1 Overview \12 u12 812 :12 <C5>12 <BF>12 Names of multiletter control sequences, which in- A clude control sequences with single Japanese charac- but the result in pLTEX is ter name, such as \あ, are stringized, that is to say, \12 u12 812 :12 顛 they are stored into the string pool. Similarly, some primitives, such as \string, \jobname, \meaning 3.3 \meaning and \the (almost always the case), first stringize The result of their intermediate results into the string pool, and \font\Z=ec-lmr10 \Z % T1 encoding then retokenize these intermediate results. \def\fuga{^^c5^^bf顛ſ}\meaning\fuga Stringization of pTEX has two crucial points. • The origin of a byte is lost in stringization. A differs between plainE T X and plain pTEX: byte sequence, for example <C5><BF>, in the plain TEX macro:->Å£éąŻÅ£ string pool may be the result of stringization of plain pTEX macro:->顛顛顛 a Japanese character “顛”, or that of two 8-bit Now we look at what happened with pTEX. The characters <C5> and <BF>. definition of \fuga is represented by the token list • In retokenization, a byte sequence which repre- sents a Japanese character in the internal encod- <C5>12 <BF>12 顛 <C5>12 <BF>12 ing is always converted to a Japanese character This gives the following string as the intermediate token. For example, <C5><BF> is always con- result of \meaning. verted to a Japanese token 顛. macro:-><C5><BF><C5><BF><C5><BF> These points cause unnatural behavior, namely bytes from 8-bit characters becoming garbled to Japanese Retokenizing this string gives the final result character tokens. We look into several examples. macro:->顛顛顛 3.2 Control sequence name which we have already seen. Let’s begin with the following source: 3.4 A tricky application \font\Z=ec-lmr10 \Z % T1 encoding The behavior described in Section 3.2 has a tricky \expandafter\def\csname uſ\endcsname{AA} application: generating a Japanese character token \expandafter\def\csname u顛\endcsname{BB} from its code number, even in an expansion-only \def\ZZ#1{#1 (\string#1) } context. This can be constructed as follows: \expandafter\ZZ\csname u^^c5^^bf\endcsname% (1) \expandafter\ZZ\csname uſ\endcsname % (2) %#!eptex \expandafter\ZZ\csname u顛\endcsname % (3) \font\Z=ec-lmr10 \Z % T1 encoding \input expl3-generic % for \char_generate:nn With pTEX, (1)–(3) produces the same result \ExplSyntaxOn BB (\u 顛) \cs_generate_variant:Nn \cs_to_str:N { c } Hironori Kitagawa TUGboat, Volume 41 (2020), No. 3 331 \cs_new:Npn \tkchar #1 { Then, pTEX prints this token list. Since <A1>–<FE> \cs_to_str:c { are printable and <9F> is not, the putc2 function \char_generate:nn % upper byte receives the following string, one byte per call. { \int_div_truncate:nn { #1 } { 256 } } { 12 } <C5><BF><C5><BF><C3>^^9f \char_generate:nn % lower byte { \int_mod:nn { #1 } { 256 } } { 12 } Each <C5><BF> is converted to “顛” by putc2, } while the single <C3> remains unchanged. Hence the } final result is “顛顛<C3>^^9f”, as shown. \ExplSyntaxOff \edef\A{\tkchar{`漢}\tkchar{`字}} 4.3 \message \meaning\A % ==> macro:->漢字 \message is similar to \write, but differs in that it stringizes its argument. Now consider an input line This \tkchar will be unnecessary as of TEX Live 2020, since the \Uchar and \Ucharcat primitives \message{^^fe^^f3:ꚲ:} were added into 휀-pTEX at that time. Here ꚲ (<F0><AA><9A><B2> in UTF-8) is a character 4 Output to file or terminal included in JIS X 0213, but not in JIS X 0208. 4.1 Output code conversion The argument of \message is (expanded to) the following token list. As with input, pTEX does a code conversion from the internal Japanese encoding to UTF-8 in outputting <FE>12 <F3>12 :12 <F0>12 <AA>12 <9A>12 <B2>12 :12 to a file or the terminal.