Universal Shaping Engine

making fonts for the Universal Shaping Engine John Hudson, Tiro Typeworks Ltd • TYPO Labs, Berlin, 10 May 2016 Version 1.1, 23 May 2016 This paper, based on a presentation delivered at the inaugural TYPO Labs font technology conference in Berlin, concerns making a particular kind of OpenType font to work with a new shaping engine for complex script layout. If you’re not involved in making fonts for complex scripts, I hope you might still find some interest in the conceptual problems and solutions involved, and also in the insights these provide into the architecture and history of OpenType Layout. Let me begin by defining what we mean by ‘complex script’. These are scripts that require processing beyond a simple display of the default encoded glyph for each character in order to correctly present text in an acceptably readable form. This processing typically involves character string analysis and manipulation, as well as glyph substitution and positioning. There are, of course, instances in which a font for any script may assume complex behaviours — ligation, contextual substitutions, dynamic mark positioning —, but an inherently complex script is one forWhat which theis aplain complex text encoded script? character sequence will be unreadable without additional processing. تعقيد ديقعت मिश्रित ममम섿त ಸಂಕೀರ್ಣ ಸಂಕೕಣ㒣 Complex scripts tend to fall into one of two broad categories: those, like Arabic, involving join- ing behaviour that requires knowledge about adjacent characters and substitution of appropriate forms to display connected lettergroups, and those, like the many Brahmi-derived scripts of South and Southeast Asia, in which the orthographic unit is a cluster that may consist of multiple consonant letters plus dependent vowel sign and additional modifier marks. Scripts in the latter category also tend to involve reordering behaviours, in which there is a distinction between the graphical order of signs in the cluster and their phonetically encoded ordering. In the OpenType model, complex script layout is handled collaboratively by a shaping engine — residing at the operating system or application level — and the layout in a font. This is a some- what simplified diagram of that collaboration, and I’m not going to discuss it in a step-by-step way.OpenType [For more detailed, Layout see my simplifiedUnicode conference collaborative presentation from 2015 .]model Layout services Shaping engine Font [Bidi algorithm] Script itemisation Run segmentation Run analysis Cluster segmentation [Split vowels] [Initial reordering] Cluster shaping Basic shaping GSUB features Final reordering Standard GSUB features Conditional GSUB features GPOS features � Line breaking Justification For more detailed discussion, see : http://tiro.com/John/Hudson_IUC39_Beyond_Shaping.pdf Complex script handling was Microsoft’s primary goal in developing a smart font format in the mid-1990s. Microsoft developed Arabic, Hebrew, and Thai shaping engines for TrueType Open, the immediate precursor to OpenType, and in early 1999 shipped the first version of the Uni- code Script Processor for Complex Scripts — or Uniscribe — with Internet Explorer 5.01. Sub- sequent versions of Uniscribe have shipped with all versions of Windows, Office, and Microsoft browsers, often leapfrogging each other in support for additional scripts and languages. Other companies have produced their own OpenType Layout engines for complex scripts, notably the open source Harfbuzz shaper — maintained by Behdad Esfahod —, Adobe’s World Ready Composer, and Apple’s CoreText engine. The assignment of scripts to processing by a particular engine generally depends on similar- ities in shaping needs. This leads to predictable groupings such as the handling of numerous South Asian Brahmi-derived scripts in a common Indic shaping engine, and occasionally to strange-bedfellows, such as assignment of the Thaana script of the Maldive Islands to Unis- cribe’s Hebrew shaping engine. The current Windows 10 version of Uniscribe includes nine engines, each of which is responsible for shaping one or more scripts. An engine may also support more than one version of shaping for a given script, mapped to different OpenType script tags, for example the old Windows XP IndicUniscribe shaping and theshaping new ‘Indic2’ engines model introduced as of in WindowsWindows Vista. This 10 enables (RS1) continued support for older fonts while allowing improved implementations to emerge. Arabic engine Arabic, Syriac Generic engine Cyrillic, Greek, Latin, etc. (non-complex scripts) Hangul engine Hangul, Old Hangul Hebrew engine Hebrew, Thaana Indic engine Bengali, Devanagari, Gujurati, Gurmukhi, Kannada, etc. Khmer engine Khmer Myanmar engine Myanmar (Burmese) Thai/Lao engine Lao, Thai Universal engine Balinese, Batak, Brahmi, Buginese, Buhid, Chakma, Cham, Duployan, Egyptian Hieroglyphs, Grantha, Hanunoo, Javanese, Kaithi, Kayah Li, etc. (45 total) In case you are unfamiliar with the kinds of things that a shaping engine does with a script, I’ll take a moment to discuss a step-by-step example of typical script-specific shaping for a mock character sequence (not a real word). Layout services will have identified this as Bengali, based on the Unicode script property of the characters involved, and will have passed the run to the appropriate Indic shaping engine. The shaping engine has determined that the font supports theStep-by-step Indic2 shaping model example using the <bng2> (Bengali script tag, so<bng2> is going to shaping) apply that shaping model. ক ে◌া ল ◌্ ম ◌ু র ◌্ ত ◌্ ক ি◌ 0995 09CB 09B2 09CD 09AE 09C1 09B0 09CD 09A4 09CD 0995 09BF ka o la [x] ma u ra [x] ta [x] ka i TheStep-by-step shaping engine analyses example the character (Bengali run, and segments <bng2> it into shaping) three orthographic units, 1 in this case clusters consisting of one or more consonants with explicit vowel signs. The small diagonalক markে (U+09CD)া ল is a◌ vowel্ ম killer, ◌ indicatingু র that ◌ the্ precedingত ◌ and্ followingক ি◌ consonants ক0995 are 09C7েpart◌ 09BEofা the 09B2sameল cluster.09CD◌ ্ 09AEম 09C1◌ ু 09B0র 09CD◌ ্ 09A4ত 09CD◌ ্ 0995ক ি09BF◌ The Step-by-step0995 first step09CB is to split09B2 exampleany two-part09CD 09AE vowel (Bengali 09C1 signs into09B0 <bng2>their09CD constituent 09A4 shaping) 09CDelements. 0995 This is09BF a buffered ka character o level operation,la [x] made ma possible u becausera both[x] elementsta are[x] alsoka atomically i en- 2 codedে as ক characters া inল Unicode. ◌্ ম ◌ু ি◌ র ◌্ ত ◌্ ক ক09C7 0995ে◌ 09BEা 09B2ল 09CD◌ ্ 09AEম 09C1◌ ু 09BFর ◌09B0্ ত09CD ◌09A4্ ক09CD ি0995◌ 1 0995 09CB 09B2 09CD 09AE 09C1 09B0 09CD 09A4 09CD 0995 09BF Step-by-step কka ে o া লla example ◌[x]্ মma (Bengali ◌u ু রra <bng2> ◌[x]্ তta shaping) ◌[x]্ কka ি◌i 0995 09C7 09BE 09B2 09CD 09AE 09C1 09B0 09CD 09A4 09CD 0995 09BF 3 The secondে ক step inা this লshaping ◌ model্ ম is initial ◌ reordering.ু ি◌ In our ◌ example,� this ত involves ◌ moving্ ক 09C7 0995 09BE 09B2 09CD 09AE 09C1 09BF 09B0+09CD 09A4 09CD 0995 1 of কleft-side ে vowel◌া signs ল in the ◌ first্ মand third ◌ clusters.ু র Again,◌্ this ত is a buffered◌্ ক character ি◌ level 2 0995ক ে09CB া ল09B2 ◌09CD্ ম09AE ◌09C1ু র09B0 ◌09CD্ ত09A4 ◌09CD্ ক0995 ি◌09BF operation: 0995েka ক09C7 there’s o 09BE া no ল09B2 interactionla ◌09CD[x] ্ with 09AEমma the ◌09C1fontu ু layout 09B0িra◌ tables 09CD [x]র up 09A4to◌ta this ্ 09CDstage.[x]ত 0995◌ka ্ 09BFকi 09C7 0995 09BE 09B2 09CD 09AE 09C1 09BF 09B0 09CD 09A4 09CD 0995 21 3 কে ক ে া ল ◌্ ম ◌ু রি◌ ◌ র্ ত◌ ্ ◌ ত্ ক◌ ্ িক◌ 099509C7ে 099509C7ক 09BE া 09B2 ল 09CD◌ ্ 09AE ম 09C1◌ ু 09B0 09BFি◌ 09CD 09B0 ◌ 09A409CD� 09CD 09A4 ত 0995 09CD ◌ 09BF্ 0995 ক 09C7 0995 09BE 09B2 09CD 09AE 09C1 09BF 09B0+09CD 09A4 09CD 0995 That interaction begins in the third step: application of basic shaping glyph substitution fea- 3 tures. These may include precomposition of letter plus nukta forms, and formation of akhand 2 ligaturesেে ক (aক kind া of া pseudo-letter). ল ল ◌ ◌্ ্ Inম মour example,◌ ◌ু ু ি the◌ি◌ shaping র ◌ engine ◌�্ applies ত ত the◌ Reph◌্ ক্ Forms ক <rphf> 09C709C7 feature 09950995 09BEto09BE the 09B2 sequence09B2 09CD09CD of cluster-initial09AE09AE 09C109C1 Ra09BF plus09BF the 09B0 09B0vowel +09CD09CD killer 09A4 character09A4 09CD09CD in the09950995 third cluster, substituting the repha mark glyph found in the substitution lookup. 3 ে ক া ল ◌্ ম ◌ু ি◌ ◌� ত ◌্ ক 09C7 0995 09BE 09B2 09CD 09AE 09C1 09BF 09B0+09CD 09A4 09CD 0995 If we had other characters that take special forms in particular situations, these would also be substituted during this phase. Some fonts might substitute half forms of other letter plus vowelStep-by-step killer sequences, althoughexample I generally (Bengali don’t do this <bng2> in Bengali. The shaping) next step in our example, is substitution of consonant ligatures in the Conjuct Forms <cjct> feature. Note that in our example, only the conjunct in the second cluster takes a ligature form; this is because the font does not contain a ligature form for the conjunct in the third cluster, which instead will display withক an explicit ে◌ vowelা ল killer sign.◌্ This ম seldom ◌ happensু র in◌ Bengali্ ত text, but◌ is্ theক sort িof◌ thing 0995 09CB 09B2 09CD 09AE 09C1 09B0 09CD 09A4 09CD 0995 09BF that ka happens owhen Englishla or[x] other ma foreign u loanwordsra are[x] transliteratedta [x] in anka Indian iscript, producing character sequences that don’t occur in the local language. 4 Step-by-stepে ক া example � (Bengali◌ু ি◌ <bng2> ◌� ত shaping) ◌্ ক 09C7 0995 09BE 09B2+09CD+09AE 09C1 09BF 09B0+09CD 09A4 09CD 0995 At কthis stage, ে◌ basicা shaping ল features◌্ ম are complete,◌ু র and the◌ next্ ত step is◌ final্ কreordering, ি◌ this 0995 09CB 09B2 09CD 09AE 09C1 09B0 09CD 09A4 09CD 0995 09BF timeStep-by-step ka performed o at thela glyphexample [x]level, takingma (Bengali outputu fromra <bng2> features[x] suchta shaping) Reph[x] Formska <rphf>i and 5 trackingে their কposition া in the glyph� string.

Universal Shaping Engine

Resource, Valuable Archive on Social and Economic History in Western India

L2/20-246 Teeth and Bellies: a Proposed Model for Encoding Book Pahlavi

The Unicode Cookbook for Linguists: Managing Writing Systems Using Orthography Profiles

Kirja-Alan Onix

Suspicious Identity of U+A9B5 JAVANESE VOWEL SIGN TOLONG

Cqmejj · -Uhhrersity

Tai Lü / ᦺᦑᦟᦹᧉ Tai Lùe Romanization: KNAB 2012

This Document Serves As a Summary of the UC Berkeley Script Encoding Initiative's Recent Activities. Proposals Recently Submit

2903 Date: 2005-08-22

PDF Copy of My Gsoc Proposal

ISO/IEC JTC1/SC2/WG2 N 4823 Date: 2017-05-24

Notes on Linguistics, 1999. INSTITUTION Summer Inst