Automatic Non-Standard Hyphenation in Openoffice.Org

Automatic non-standard hyphenation in OpenOffice.org LÁSZLÓ NÉMETH nemeth (at) openoffice dot org Abstract The hyphenation algorithm of OpenOffice.org 2.0.2 is a generalization of TEX’s hyphenation algorithm that allows automatic non-standard hyphenation by competing standard and non-standard hyphenation patterns. With the suggested integration of linguistic tools for compound decomposition and word sense disambiguation, this algorithm would be able to do also more precise non-standard and standard hyphenation for several languages. Introduction standard hyphenation in TEX. Omega 2 has promis- Standard hyphenation consists of splitting a word and ing developments towards implementing sophisticated including a hyphen at the end of the first part of the automatic non-standard hyphenation for German and split word (unless the word already contained a hy- other languages [4, 5]. phen or n-dash at the break). While standard hyphen- The aim of the present project was to implement ation is widely applicable, several languages also use language-independent automatic non-standard hyphen- non-standard hyphenation. ation in OpenOffice.org. In this article we present Table 1 shows examples for non-standard hyphen- our results, introduce old and new hyphenation algo- ation: character deletions and other changes at hy- rithms, extension of the Hungarian hyphenation pat- phenation break points in European writing systems. terns and finally show the possibility of integrating Some non-standard hyphenation can be handled easily compound word decomposition and word sense dis- by computer, like the mandatory middle dot deletion ambiguation to our algorithm. from Catalan digraph ll. But complex analysis is neces- Results sary for languages, like German,1 Hungarian, Swedish T X’s hyphenation is the de facto standard in the free and Norwegian to recognize hyphenation points. For E software world, because the hyphenators of the free instance, the Swedish word form glassko has three dif- programs mentioned in the previous section are all ferent meanings, and can be hyphenated as glassko based on Liang’s hyphenation algorithm from T X82 (glass shoe), glassko (ice cream cow) and in the non- E 9 , and use the T X hyphenation patterns. Thus, standard way, glass- sko (ice cream shoe). [ ] E we have developed an extension for OpenOffice.org’s Such non-standard hyphenation plays an impor- ALT Linux LibHnj hyphenator to do automatic non- tant role in good typesetting. Commercial DTP pro- standard hyphenation. The result is based on a gener- grams, even word processors, support automatic non- alization of Liang’s original algorithm which also al- standard hyphenation, often by licensing third party lows easy integration of special linguistic tools to han- libraries. The most important free alternatives, such dle compound word decomposition and word sense as Apache FOP, GNU Troff, KDE KOffice, Open- disambiguation in automatic hyphenation. The Hun- Office.org, Scribus, and TEX and its variants, do not garian hyphenation patterns were extended with non- support automatic non-standard hyphenation. TEX standard hyphenation patterns. has a hyphenation primitive, the \discretionary The improved hyphenation library (without inte- command. There are TEX macros in the Babel package grated linguistic tools) is part of the OpenOffice.org for non-standard hyphenation, for instance, \lgem for 2.0.2 with the extended Hungarian hyphenation pat- Catalan l.l, \ck or "ck for German, ~ssz for Hungar- terns. Developers can download a standalone version ian, \= for Polish, but there is no real automatic non- 1German orthography before the spelling reform of 1996. 750 TUGboat, Volume 27 (2006), No. 2 — Proceedings of EuroTEX 2006 Automatic non-standard hyphenation in OpenOffice.org Language Example Hyphenation Description Catalan paral.lel paral- lel digraph l.l represents long (geminated) l Dutch reëel re- eel diaeresis and hyphenation sign syllable breaks omaatje oma- tje vowel lengthening with diminutive -tje English eighteen eight- teen suggested pretty hyphenation by D. E. Knuth [6] German Zucker Zuk- ker digraphs ck and kk represent long k Schiffahrt Schiff- fahrt triple consonants at compound word boundary Greek Μαΐου Μα- ίου diaeresis and hyphenation sign syllable breaks Hungarian asszonnyal asz- szony- nyal simplified double digraphs (long sz and ny phonemes) Norwegian bussjåfør buss- sjåfør triple consonants at compound word boundary Polish kong-fu kong- -fu repeated hyphen at line begin Swedish tillåta till- låta triple consonants at compound word boundary Table 1: Non-standard hyphenation in European languages .algorithm. hyphenation is suppressed by TEX, so we end up with 4l1g4 the correct algo- rithm. l g o3 1g o One of the most notable features of this pattern- 2i t h based hyphenation is the human-readable format of 4h1m the knowledge database, in contrast to an equivalent ----------------- finite state machine or a similarly good artificial neural 4 1 4 3 2 0 4 1 network. This format is good for manual checking and a l-g o-r i t h-m corrections. Figure 1:TEX hyphenation of ‘algorithm’ Missing features In TEX’s automatic hyphenation the most wanted features are non-standard hyphenation, compound word of this library with an example executable from the analysis, word sense disambiguation and taboo word Lingucomponent project home [14]. filtering [12, 13].2 Liang’s hyphenation algorithm Sojka’s non-standard hyphenation extension Franklin M. Liang’s hyphenation algorithm is based In 12 Petr Sojka suggests a non-standard hyphen- on competing hyphenation patterns. The patterns can [ ] ation extension for Liang’s algorithm. His algorithm give excellent compression for a hyphenation dictio- first searches all hyphenation points of a word using nary, and using these patterns the fast hyphenator al- Liang’s algorithm, and then matches patterns from a gorithm can also correctly hyphenate unknown (non- non-standard hyphenation table at valid hyphenation dictionary) words most of the time. Liang’s work points, replaces the matching pattern with a special covers also the machine learning of the hyphenation character, and rechecks the hyphenation of the new patterns and exceptions by PatGen pattern generator. word at this special character with Liang’s algorithm. The hyphenation patterns can allow and prohibit The non-standard hyphenation point will be chosen if hyphenation breaks on multiple levels. Figure 1 shows the second hyphenation is successful. Using a ck % the pattern matching on the word ‘algorithm’. The (k- k) pattern data from a non-standard hyphenation→ TEX English hyphenation patterns 4l1g4, lgo3, 1go, table, the German word Zucker will be Zu%er after 2ith and 4h1m match this word and determine its hyphenation. Only odd numbers mean hyphenation 2Liang’s hyphenation algorithm and its compact implementation breaks. If two (or more) patterns have numbers in using packed trie data structure was perfect twenty-five years ago for English and for computers with less than a few MB RAM. Nowadays the same place, the highest number wins. The al- internationalization (handling multiple languages) is a standard in go- rith- m hyphenation is bad, but the last one-letter software industry and free software development. Modern personal computers have much more memory and speed to enable using additional special linguistic tools in hyphenation. TUGboat, Volume 27 (2006), No. 2 — Proceedings of EuroTEX 2006 751 László Németh Pattern Example Hyphenation .asszonnyal. s1s z/sz=sz,1,3 l·1l/l=l paral.lel paral- lel n1n y/ny=ny,1,3 e1ë/e=e reëel re- eel ------------------- a1atje./a=t,1,3 omaatje oma- tje 0 1 0 0 0 1 0 0 0/sz=sz,2,3,ny=ny,6,3 eigh1tee/t=t,5,1 eighteen eight- teen a s-s z o n-n y a l/sz=sz,2,3,ny=ny,6,3 c1k/k=k Zucker Zuk- ker schif1f/ff=f,5,2 Schiffahrt Schiff- fahrt Figure 2: Hyphenation of asszonnyal 1ΐ/=ί Μαΐου Μα- ίου s1sz/sz=sz asszonnyal asz- szony- Table 2 shows possible hyphenation patterns for n1ny/ny=ny nyal Table 1. The dots in the patterns match the word bus1s/ss=s,3,2 bussjåfør buss- sjåfør boundaries. The first dot doesn’t affect the character 7-/=- kong-fu kong- -fu positions in the non-standard hyphenation subranges: .til1lå/ll=l,3,2 tillåta till- låta .zuc1ker/k=k,3,2. Figure 2 shows the result of ap- Table 2: Extended hyphenation patterns for Table 1 plying multiple non-standard pattern matching. Rules the replacement, and the pattern zu%1er permits non- A single subregion must contain exactly one hyphen- standard hyphenation with k- k (Zuk- ker). ation point (indicated by an odd number in Liang’s syntax). There may also be explicit non-breakable Problems points (indicated by even numbers) in the subregion, It’s possible to use the pattern generator on a pre- and any breakable or non-breakable points out of the pared input dictionary for Sojka’s algorithm, but then subregion. we lose the human-readable format of hyphenation A standard and a non-standard hyphenation pat- patterns. The biggest problem is to use competing tern matching the same hyphenation point must not patterns on multiple levels. That is why instead of us- be on the same hyphenation level. For instance, ing difficult redundant patterns with special hyphen- c1 and zuc1ker/k=k,3,2 are invalid, while c1 and ation characters, Sojka suggests global parameters (left zuc3ker/k=k,3,2 are valid extended hyphenation and right non-standard hyphenation penalties) to for- patterns. bid standard hyphenations near the non-standard hyphenation points. But German, Hungarian, Norwe- Unicode character encoding gian and Swedish non-standard hyphenation need true Unicode is the basis for internationalization.3 Thanks competing patterns. to the unambiguous start positions of the multibyte- characters, Liang’s algorithm works perfectly with the OpenOffice.org’s extension UTF-8 Unicode encoding.

Automatic Non-Standard Hyphenation in Openoffice.Org

Compound Word Formation.Pdf

Hunspell – the Free Spelling Checker

Building a Treebank for French

Compounds and Word Trees

A Poor Man's Approach to CLEF

Robust Prediction of Punctuation and Truecasing for Medical ASR

Lexical Analysis

Name Description

Part-Of-Speech Tagging Guidelines for the Penn Treebank Project (3Rd Revision)

Learning to SMILE

Building a Large Annotated Corpus of English: the Penn Treebank

Real-Time Statistical Speech Translation