Thai Sentence-Breaking for Large-Scale SMT

Glenn Slayden Mei-Yuh Hwang Lee Schwartz thai-language.com Microsoft Research Microsoft Research [email protected] [email protected] [email protected]

therefore been regarded as the task of classifying Abstract each space that appears in a Thai source text as either sentence-breaking (sb) or non-sentence- Thai language text presents challenges breaking (nsb). for integration into large-scale multi- Several researchers have investigated Thai language statistical machine translation SB. Along with a discussion of Thai word break- (SMT) systems, largely stemming from ing (WB), Aroonmanakun (2007) examines the the nominal lack of punctuation and in- issue. With a human study, he establishes that ter-word space. For Thai sentence break- sentence breaks elicited from Thai informants ing, we describe a monolingual maxi- exhibit varying degrees of consensus. Mittra- mum entropy classifier with features that piyanuruk and Sornlertlamvanich (2000) define may be applicable to other languages part-of-speech (POS) tags for sb and nsb and such as Arabic, Khmer and Lao. We ap- train a trigram model over a POS-annotated cor- ply this sentence breaker to our large- pus. At runtime, they use the Viterbi algorithm vocabulary, general-purpose, bidirec- to select the POS sequence with the highest tional Thai-English SMT system, and probability, from which the corresponding space achieve BLEU scores of around 0.20, type is read back. Charoenpornsawat and Sornler- reaching our threshold of releasing it as a tlamvanich (2001) apply Winnow, a multiplica- free online service. tive trigger threshold classifier, to the problem. Their model has ten features: the number of 1 Introduction words to the left and right, and the left-two and NLP research has consolidated around the notion right-two POS tags and words. of the sentence as the fundamental unit of trans- We present a monolingual Thai SB based on a lation, a consensus which has fostered the devel- maximum entropy (ME) classifier (Ratnaparkhi opment powerful statistical and analytical ap- 1996; Reynar and Ratnaparkhi, 1997) which is proaches which incorporate an assumption of suitable for sentence-breaking SMT training data deterministic sentence delineation. As such sys- and runtime inputs. Our model uses a four token tems become more sophisticated, languages for window of Thai lemmas, plus categorical fea- which this assumption is challenged receive in- tures, to describe the proximal environment of creased attention. Thai is one such language, the space token under consideration, allowing since it uses space neither to distinguish syl- runtime classification of space tokens with pos- lables from words or affixes, nor to unambi- sibly unseen contexts. guously signal sentence boundaries. As our SB model relies on Thai WB, we re- Written Thai has no sentence-end punctuation, view our approach to this problem, plus related but a space character is always present between preprocessing, in the next section. Section 2 also sentences. There is generally no space between discusses the complementary operation to WB, words, but a space character may appear within a namely, the re-spacing of Thai text generated by sentence according to linguistic or prescriptive SMT output. Section 3 details our SB model and orthographic motivation (Wathabunditkul 2003), evaluates its performance. We describe the inte- and these characteristics disqualify sentence- gration of this work with our large-scale SMT breaking (SB) methods used for other languages, system in Section 4. We draw conclusions in such as Palmer and Hearst (1997). Thai SB has Section 5.

8 Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing (WSSANLP), pages 8–16, the 23rd International Conference on Computational Linguistics (COLING), Beijing, August 2010 2 Pre- and Post-processing instance. These two treatments are contrasted in Figure 2. Miscoding (Figure 3) is another variety As will be shown in Section 3, our sentence of input error that is readily repaired. breaker relies on Thai WB. In turn, with the aim of minimizing WB errors, we perform Unicode จะะา → จะ character sequence normalization prior to WB. ใเไป → ไป As output byproducts, our WB analysis readily Figure 2. Two ambiguous repairs identifies certain types of named entities which Within the Infoquest Thai newswire corpus, a we propagate into our THA-ENG SMT; in this low-noise corpus, about 0.05% of the lines exhi- section, we briefly summarize these preliminary bit at least one of the problems mentioned here. processing steps, and we conclude the section For some chunks of broad-range web scraped with a discussion of Thai text re-spacing. data, we observe rates as high as 4.1%. This 2.1 Character Sequence Normalization measure is expected to under-represent the utility of the filter to WB, since Thai text streams, lack- Thai orthography uses an alphabet of 44 conso- ing intra-word spacing and permitting two un- nants and a number of vowel glyphs and tone written vowels, have few re-alignment check- marks. The four Thai tone marks and some Thai points, allowing tokenization state machines to vowel characters are super- and/or sub-scripted linger in misaligned states. with respect to a base character. For example, the อ ิ้ sequence consists of three code points: อ ◌ ํ า → อ ◌าํ → อํา อ ◌ ิ ◌ ้. When two or more of these combining เ เ อ → แ อ → แอ marks are present on the same base character, the Figure 3. Two common mis-codings ordering of these code points in memory should be consistent so that orthographically identical 2.2 Uniscribe Thai Tokenization entities are recognized as equivalent by comput- Thai text does not normally use the space cha- er systems. However, some computer word pro- racter to separate words, except in certain specif- cessors do not enforce the correct sequence or do ic contexts. Although Unicode offers the Zero- not properly indicate incorrect sequences to the Width Space (ZWSP) as one solution for indicat- user visually. This often results in documents ing word breaks in Thai, it is infrequently used. with invalid byte sequences. Programmatic tokenization has become a staple Correcting these errors is desirable for SMT of Thai computational linguistics. The problem inputs. In order to normalize Thai input character has been well studied, with precision and recall sequences to a canonical Unicode form, we de- near 95% (Haruechaiyasak et al. 2008). veloped a finite state transducer (FST) which In our SMT application, both the sentence detects and repairs a number of sequencing er- breaker and the SMT system itself require Thai rors which render Thai text either orthographi- WB, and we use the same word breaker for these cally invalid, or not in a correct Unicode se- tasks (although the system design currently pro- quence. hibits directly passing tokens between these two For example, a superscripted Thai tone mark components). Our method is to apply post- should follow a super- or sub-scripted Thai vo- processing heuristics to the output of Uniscribe wel when they both apply to the same consonant. (Bishop et al. 2003), which is provided as part of When the input has the tone mark and the vowel the Microsoft® WindowsTM operating system glyph swapped, the input can be fully repaired: interface. Our heuristics fall into two categories: อ า ◌ ่ น → อ ◌ ่ า น → อาน่ “re-gluing” words that Uniscribe broke too ag- อ ◌ ้ ◌ ิ น → อ ◌ ิ ◌ ้ น → อนิ้ gressively, and a smaller class of cases of further breaking of words that Uniscribe did not break. Figure 1. Two unambiguous repairs Re-gluing is achieved by comparing Uniscribe Other cases are ambiguous. The occurrence of output against a Thai lexicon in which desired multiple adjacent vowel glyphs is an error where breaks within a word are tagged. Underbreaking the intention may not be clear. We retain the by Uniscribe is less common and is restricted to first-appearing glyph, unless it is a pre-posed a number of common patterns which are repaired vowel, in which case we retain the last-appearing explicitly.

9 2.3 Person Name Entities Our name identifier first notes the presence of In written Thai, certain types of entities employ an honorific {h} นาย followed by a pattern of prescriptive whitespace patterns. By removing tokens {g0-gn}, {s0-sn} and spaces {sp0, sp1} these recognized patterns from consideration, SB that is compatible with a person name and sur- precision can be improved. Furthermore, be- name of sensible length. cause our re-gluing procedure requires a lookup Next, we determine which of those tokens in of every syllable proposed by Uniscribe, it is the ranges {g} and {s} following the honorific efficient to consider, during WB, additional do not have a gloss translation (i.e., are not processing that can be informed by the same found in the lexicon). These tokens are indicated lookup. Accordingly, we briefly mention some by in the gloss above. When the number of the entity types that our WB identifies, focus- of unknown tokens exceeds a threshold, we hy- ing on those that incorporate distinctive spacing pothesize that these tokens form a name. The patterns. lack of lexical morphology in Thai facilitates Person names in Thai adhere to a convention this method because token (or syllable) lookup for the use of space characters. This helps Thai generally equates with the lookup of a stemmed readers to identify the boundaries of multi- lemma. syllable surnames that they may not have seen 2.4 Calendar Date Entities before. The following grammar summarizes the prescriptive conventions for names appearing in Our WB also identifies Thai calendar dates, as Thai text: these also exhibit a pattern which incorporates spaces. As a prerequisite to identifying dates, we ::= ::= [] map Thai orthographic digits {๐ ๑ ๒ ๓ ๔ ๕ ๖ ::= space ๗ ๘ ๙} to Arabic digits 0 through 9, respec- ::= space ::= + tively. For example, our system would interpret ::= ก | ข | ฃ | ค | ... the input text ๒๕๔๐ as equivalent to “2540.” Figure 4. Name entity recognition grammar .../ใน/วนั /ท/ี่ /14/ มนาคมี / /๒๕๔๐/ /และ/... The re-glue lookup also determines if a sylla- ใน วนั ที่ sp 14 มนาคมี sp ๒๕๔๐ sp และ ble matches one of the following predefined spe- on day which 14 March 2540 and cial categories: name-introducing honorific (h), ...on March 14th, 1997 and... Thai or foreign given name (g), token which is Figure 6. Date entity recognition likely to form part of a surname (s), or token Figure 6 shows a fragment of Thai text which which aborts the gathering of a name (i.e. is un- contains a calendar date for which our system likely to form part of a name). will emit a single token. As shown in the example, our system detects and adjusts for the use of .../วา่ /นาย/จ/ิ ระ/นุช/ /ว/ิ นจิ /จก/ู ล/ /วา่ /... Thai Buddhist year dates when necessary. Ga- วา่ นาย จ ิ ระ นุช ว ิ นจิ จก ู ล วา่ thering of disparate and optional parts of the h g0 g1 g2 sp0 s0 s1 s2 s3 sp1 Thai date is summarized by the grammar in Fig-

d ure 7.

d ::= [] [space] elove that that Mr. hit b stable sai ::= วนทั ี่ | ท ี่ ...that Mr. Chiranut Winichotkun said... ::= month-date [space] year ::= Figure 5. Thai person-name entity recognition ::= ::= [space] Figure 5 shows a Thai name appearing within ::= + a text fragment, with Uniscribe detected token ::= + boundaries indicated by slashes. In the third row ::= | ::= มกราคม | กมภาพุ ันธ ์ | มนาคมี | ... we have identified the special category, if any, ::= ม.ค. | ก.พ. | ม.ี ค. | ... for each token. The fourth line shows the Eng- ::= ๐ | ๑ | ๒ | ๓ | ๔ | ๕ | ๖ | ๗ | ๘ | ๙ lish translation gloss, or if none. The bot- ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 tom row is the desired translation output. Figure 7. Date recognition grammar

10 2.5 Thai Text Re-spacing finds a solution {𝜶} representing the most un- To conclude this section, we mention an opera- certain commitment tion complementary to Thai WB, whereby Thai max 𝐻(𝑝) =−𝑝(𝑏, 𝑐) log 𝑝(𝑏, 𝑐) words output by an SMT system must be re- spaced in accordance with Thai prescriptive that satisfies the observed distribution 𝑝̂(𝑏, 𝑐) of convention. As will be mentioned in Section 4.2, the training data for each input sentence, our English-Thai system ∑ 𝑝(𝑏, 𝑐)𝑓(𝑏, 𝑐) = ∑ 𝑝̂(𝑏, 𝑐)𝑓(𝑏, 𝑐),1≤𝑗≤𝑘 . has access to an English dependency parse tree, as well as links between this tree and a Thai This is solved via the Generalized Iterative Scal- transfer dependency tree. After using these links ing algorithm (Darroch and Ratcliff 1972). At to transfer syntactic information to the Thai tree, run-time, a space token is considered an sb, if we are able to apply prescriptive spacing rules and only if p(sb|c) > 0.5, where (Wathabunditkul 2003) as closely as possible. 𝑝(𝒔𝒃, 𝑐) 𝑝(𝒔𝒃|𝑐) = . Human evaluation showed satisfactory results 𝑝(𝒔𝒃, 𝑐) +𝑝(𝒏𝒔𝒃,𝑐) for this process. 3.2 Feature Selection 3 Maximum Entropy Sentence-Breaking The core context of our model, {w, x, y, z}, is a We now turn to a description of our statistical window spanning two tokens to the left (posi- sentence-breaking model. We train an ME clas- tions w and x) and two tokens to the right (posi- sifier on features which describe the proximal tions y and z) of a classification candidate space environment of the space token under considera- token. tion and use this model at runtime to classify c token characteristic space tokens with possibly unseen contexts. yk Yamok (syllable reduplication) symbol ๆ sp space 3.1 Modeling ๐๙ Thai numeric digits Under the ME framework, let B={sb, nsb} num Arabic numeric digits ABC Sequence of all capital ASCII characters represent the set of possible classes we are inter- cnn single character (derived from hex) ested in predicting for each space token in the ckkmmnn single character (derived from UTF8 hex) input stream. Let C={linguistic contexts} ascii any amount of non-Thai text represent the set of possible contexts that we can (Thai text) Thai word (derived from lemma) observe, which must be encoded by binary fea- Table 1. Categorical and derived feature names tures, 𝑓(𝑏, 𝑐), 1 ≤ 𝑗 ≤such 𝑘, as: The possible values of each of the window 1 if the previous word is English 𝑎𝑛𝑑 𝑏 = 𝐧𝐬𝐛. positions {w, x, y, z} are shown in Table 1, 𝑓 (𝑏, 𝑐) = 0 otherwise. where the first match to the token at the desig- This feature helps us learn that the space after an nated position is assigned as the feature value for English word is usually not a sentence boundary. that position. Foreign-text tokens plus any inter- vening space are merged, so a single “ascii” fea- 1 if the distance to the previous honorific ture may represent an arbitrary amount of non- 𝑓(𝑏, 𝑐) = is less than 15 tokens 𝑎𝑛𝑑 𝑏 =𝐧𝐬𝐛 0 otherwise. Thai script with interior space. This feature enables us to learn that spaces Figure 8 shows an example sentence that has which follow an honorific are less likely to mark been tokenized. Token boundaries are indicated sentence boundaries. Assume the joint probabili- by slashes. Although there are three space tokens ty p(b,c) is modeled by in the original input, we extract four contexts. The shaded boxes in the source text—and the (,) shaded line in the figure—indicate the single sb 𝑝(𝑏, 𝑐) =𝑍 𝜶 context that is synthesized by wrapping, to be described in Section 3.4. where we have k free parameters {𝜶} to esti- mate and Z is a normalization factor to make For each context, in addition to the {w, x, y, z} ∑ 𝑝(𝑏, 𝑐) =1. The ME learning algorithm features, we extract two more features indicated , by {l ,r} in Figure 8. They are the number of

11 tokens between the previous space token (wrap- the parenthesis nesting level. This feature helps ping as necessary) and the current one, and the the model learn that spaces which occur within number of tokens between the current space to- parentheses are likely to be nsb. ken and the next space token (wrapping as ne- Parity features for the non-directional paired cessary). These features do not distinguish glyphs, which do nest, are true binary features. whether the bounding space token is sb or nsb. Since these features have only two possible val- This is because, processing left-to-right, it is ues (inside or outside), they are only emitted permissible to use a feature such as “number of when their value is “inside,” that is, when the tokens since last sb,” but not “number of tokens space under consideration occurs between such a until next sb,” which would be available during pair. training but not at runtime. 3.3 Sentence Breaker Training Corpus ลกษณะการอั างอ้ งแบบิ R1C1 ถกแปลงไปเปู ็ นลักษณะการ อางอ้ งแบบิ A1 Thai corpora which are marked with sentence “R1C1 reference style was converted to A1 reference style.” breaks are required for training. We assembled a __/ลกษณะั /การ/อาง้ /องิ /แบบ/ /R1C1/ /ถกู /แปลง/ไป/ corpus of 361,802 probable sentences. This cor- เป็ น/ลกษณะั /การ/อาง้ /องิ /แบบ/ /A1/__ pus includes purchased, publicly available, and web-crawled content. In total it contains 911,075 b c=w c=x c=y c=z c=l c=r spaces, a figure which includes one inter- nsb องิ แบบ ABC sp 5 1 sentence space per sentence, generated as de- nsb sp ABC ถกู แปลง 1 9 scribed below. nsb องิ แบบ ABC sp 9 1 sb sp ABC ลกษณะั การ 1 5 3.4 Out-of-context Sentences Figure 8. A Thai sentence and the training contexts extracted. Hig- For SB training, paragraphs are first tokenized hlighting shows the context for sb. into words as described in Section 2.2. This In addition to the above core features, our process does not introduce new spaces between model emits certain extra features only if they tokens; only original spaces in the text are classi- appear: fied as sb/nsb and used for the context features • An individual feature for each English punc- described below. To keep this distinction clear, tuation mark, since these are sometimes used token boundaries are indicated by a slash rather in Thai. For example, there is one feature for than space in the examples shown in this paper. the sentence end period (i.e. full-stop); For 91% of our training sentences, the para- • The current nest depth for paired glyphs with graphs from which they originate are inaccessi- directional variation, such as brackets, braces, ble. In feature extraction for each of these sen- and parentheses; tences, we wrap the sentence’s head around to its tail to obtain its sb context. In other words, for a • The current parity value for paired glyphs sentence of tokens t -t , the context of sb (the without directional distinction such as 0 n-1 last space) is given by “straight” quotation marks. { w=t , x=t , y=t , z=t }. The following example illustrates paired direc- n-2 n-1 0 1 tional glyphs (in this case, parentheses): This process was illustrated in Figure 8. Al- .../ยนู ลิ เวอริ /์ /(/ประเทศ/__/ไทย/)/ /จํากดั / /เปิดเผย/วา่ /... though not an ideal substitute for sentences in ...Unilever (Thailand) Ltd. disclosed that... context, this ensures that we extract at least one sb context per sentence. The number of nsb con- b c=w c=x c=y c=z c=pn texts extracted per sentence is equal to the num- nsb ( ประเทศ ไทย ) 1 ber of interior space tokens in the original sen- Figure 9. Text fragment illustrating paired directional glyphs and tence. Sentence wrapping is not needed when the context for the highlighted space training with sentence-delimited paragraph In Figure 9, the space between ประเทศ sources. Contexts sb and nsb are extracted from “country” and ไทย “Thai,” generates an nsb the token stream of the entire paragraph and context which includes the features shown, wrapping is used only to generate one additional where “pn” is an extra feature which indicates sb for the entire paragraph.

12 3.5 Sentence Breaker Evaluation text authors may intentionally include or omit Although evaluation against a single-domain spaces in order to create syntactic or semantic corpus does not measure important design re- ambiguity. We defer to Mittrapiyanuruk and quirements of our system, namely resilience to Sornlertlamvanich (2000) and Aroonmanakun broad-domain input texts, we evaluated against (2007) for informed commentary on this topic. the ORCHID corpus (Charoenporn et al. 1997) 4 SMT System and Integration for the purpose of comparison with the existing literature. Following the methodology of the stu- The primary application for which we developed dies cited below, we use 10-fold ×10% averaged the Thai sentence breaker described in this work testing against the ORCHID corpus. is the Microsoft® BING™ general-domain ma- Our results are consistent with recent work us- chine translation service. In this section, we pro- ing the Winnow algorithm, which itself com- vide a brief overview of this large-scale SMT pares favorably with the probabilistic POS tri- system, focusing on Thai-specific integration gram approach. Both of these studies use evalua- issues. tion metrics, attributed to Black and Taylor (1997), which aim to more usefully measure sen- 4.1 Overview tence-breaker utility. Accordingly, the following Like many multilingual SMT systems, our sys- definitions are used in Table 2: tem is based on hybrid generative/discriminative (#correct sb+#correct nsb) models. Given a sequence of foreign words, f, its space-correct = total # of space tokens best translation is the sequence of target words, e, that maximizes #sb false positives false break= ∗ total # of space tokens 𝒆 =argmax𝒆 𝑝(𝒆|𝒇) = argmax𝒆 𝑝(𝒇|𝒆)𝑝(𝒆)

It was generally possible to reconstruct preci- =argmaxe { log 𝑝(𝒇|𝒆) +log𝑝(𝒆)} sion and recall figures from these published re- 1 where the translation model 𝑝(𝒇|𝒆) is computed sults and we present a comprehensive table of on dozens to hundreds of features. The target results. Reconstructed values are marked with a language model (LM), 𝑝(𝒆), is represented by a dagger and the optimal result in each category is smoothed n-grams (Chen 1996) and sometimes marked in boldface. more than one LM is adopted in practice. To Mittrapiyanuruk Charoenpornsawat Our result achieve the best performance, the log likelihoods et al. et al. evaluated by these features/models are linearly POS method Winnow MaxEnt Trigram combined. After 𝑝(𝒇|𝒆) and 𝑝(𝒆) are trained, the #sb in reference 10528 1086† 2133 combination weights 𝜆 are tuned on a held-out dataset to optimize an objective function, which #space tokens 33141 3801 7227

† † we set to be the BLEU score (Papineni et al. nsb-precision 90.27 91.48 93.18 2002): † † nsb-recall 87.18 97.56 94.41 ∗ ∗ {𝜆 } =max{} BLEU({𝑒 }, {𝑟}) sb-precision 74.35† 92.69† 86.21 ∗ sb-recall 79.82 77.27 83.50 𝒆 = argmaxe { 𝜆log 𝑝(𝒇|𝒆) +𝜆log 𝑝(𝒆)} “space-correct” 85.26 89.13 91.19 “false-break” 8.75 1.74 3.94 where {r} is the set of gold translations for the given input source sentences. To learn 𝜆 we use Table 2. Evaluation of Thai Sentence Breakers against ORCHID the algorithm described by Och (2003), where the decoder output at any point is approximated Finally, we would be remiss in not acknowl- using n-best lists, allowing an optimal line edging the general hazard of assigning sentence search to be employed. breaks in a language such as Thai, where source 4.2 Phrasal and Treelet Translation

1 Full results for Charoenpornsawat et al. are reconstructed based Since we have a high-quality real-time rule- on remarks in their text, including that “the ratio of the number of based English parser available, we base our Eng- [nsb to sb] is about 5:2.”

13 lish-to-Thai translation (ENG-THA) on the Although it is well known that language trans- “treelet” concept suggested in Menezes and lation pairs are not symmetric, we use these Quirk (2008). This approach parses the source same resources to build our THA-ENG transla- language into a dependency tree which includes tion system due to the lack of additional corpora. part-of-speech labels. Our parallel MT corpus consists of approx- Lacking a Thai parser, we use a purely statis- imately 725,000 English-Thai sentence pairs tical phrasal translator after Pharaoh (Koehn from various sources. Additionally we have 9.6 2004) for THA-ENG translation, where we million Thai sentences, which are used to train a adopt the name and date translation described in Thai 4-gram LM for ENG-THA translation, to- Sections 2.3 and 2.4. gether with the Thai sentences in the parallel We also experimented with phrasal ENG- corpus. Trigrams and 4-grams that occur only THA translation. Though we actually achieved a once are pruned, and n-gram backoff weights are slightly better BLEU score than treelet for this re-normalized after pruning, with the surviving translation direction, qualitative human evalua- KN smoothed probabilities intact (Kneser and tion by native speaker informants was mixed. Ney 1995). Similarly, a 4-gram ENG LM is We adopted the treelet ENG-THA in the final trained for THA-ENG translation, on a total of system, for its better re-spacing (Section 2.5). 45.6M English sentences. For both the lambda and test sets, THA LM 4.3 Training, Development and Test Data incurs higher out-of-vocabulary (OOV) rates Naturally, our system relies on parallel text cor- (1.6%) than ENG LM (0.7%), due to its smaller pora to learn the mapping between two languag- training set and thus smaller lexicon. Both trans- es. The parallel corpus contains sentence pairs, lation directions define the maximum corresponding to translations of each other. For phrase/treelet length to be 4 and the maximum Thai, quality corpora are generally not available re-ordering jump to be 4 as well. in sufficient quality for training a general- 4.4 BLEU Scores domain SMT system. For the ENG-THA pair, we resort to Internet crawls as a source of text. To evaluate our end-to-end performance, we We first identify paired documents, break each compute case insensitive 4-gram BLEU scores. document into sentences, and align sentences in Translation outputs are WB first according to the one document against those in its parallel docu- Thai/English tokenizer, before BLEU scores are ment. Bad alignments are discarded. Only sen- computed. The BLEU scores on the test sets are tence pairs with high alignment confidence are shown in Table 4. We are not aware of any pre- kept in our parallel corpus. Our sentence align- viously published BLEU results for either direc- ment algorithm is based on Moore (2002). tion of this language pair. For our ENG-THA translation system, we as- BLEU sembled three resources: a parallel training corpus, a development bitext (also called the lamb- THA-ENG 0.233 da set) for training the feature combination ENG-THA 0.194 Table 4. Four-gram case-insensitive BLEU scores. weights {𝜆}, and a test corpus for BLEU and human evaluation. Both the lambda and the test Figures 10 and 11 illustrate sample outputs for sets have single reference translations per sen- the each translation direction, with reference tence. translations.

Data Set #Sentences INPUT: ในประเทศไทยมกลี วยไม้ ประมาณ้ ๑๗๕ ชนดิ ถา้ (ENG||THA) training 725K สญพู ันธไปจากประเทศไทยุ์ ก็หมายถงสึ ญพู ันธไปจากโลกุ์ OUTPUT: In Thailand a Orchid approximately 175 type if (ENG,THA) lambda 2K extinct from Thailand. It means extinct from the world. (ENG,THA) test 5K REF: In Thailand, there are about 175 species of Orchid. If THA LM text 10.3M they disappear from Thailand, they will be gone from the ENG LM text 45.6M world. Table 3. Corpus size of parallel and monolingual data Figure 10. THA-ENG Sample Translation Output

14 INPUT: In our nation the problems and barriers we face are References just problems and barriers of law not selection or development. W. Aroonmanakun. 2007. Thoughts on Word OUTPUT: ในประเทศชาตของเราิ ปัญหาและอปสรรคทุ เรา่ี and Sentence Segmentation in Thai. In Pro- เผชญอยิ เพู่ ยงปี ัญหาและอปสรรคของกฎหมายไมุ เล่ อกหรื อื ceedings of the Seventh International Sympo- พัฒนา sium on Natural Language Processing, Pat- REF: ในประเทศของเราปัญหาและอปสรรคุ ก็เป็ นปัญหา taya, Thailand, 85-90. อปสรรคทางดุ านกฎหมาย้ แตไม่ เป่ ็ นปัญหาอปสรรคในการุ คดเลั อกและพื ัฒนาพันธ ุ์ F. Avery Bishop, David C. Brown and David M. Meltzer. 2003. Supporting Multilanguage Figure 11. ENG-THA Sample Translation Output Text Layout and Complex Scripts with Win- Although the translation quality is far from being dows 2000. http://www.microsoft.com/typo- perfect, SMT is making good process on build- graphy/developers/uniscribe/intro.htm ing useful applications. A. W. Black and P. Taylor. 1997. Assigning 5 Conclusion and Future Work Phrase Breaks from Part-of-Speech Se- quences. Computer Speech and Language, Our maximum entropy model for Thai sentence- 12:99-117. breaking achieves results which are consistent with contemporary work in this task, allowing us Thatsanee Charoenporn, Virach Sornlertlamva- to overcome this obstacle to Thai SMT integra- nich, and Hitoshi Isahara. 1997. Building A tion. This general approach can be applied to Thai Part-Of-Speech Tagged Corpus (ORC- other South-East Asian languages in which space HID). does not deterministically delimit sentence Paisarn Charoenpornsawat and Virach Sornler- boundaries. tlamvanich. 2001. Automatic sentence break In Arabic writing, commas are often used to disambiguation for Thai. In International separate sentences until the end of a paragraph Conference on Computer Processing of when a period is finally used. In this case, the Oriental Languages (ICCPOL), 231-235. comma character is similar to the space token in S. F. Chen and J. Goodman. 1996. An empirical Thai where its usage is ambiguous. We can use study of smoothing techniques for language the same approach (perhaps with different lin- modeling. In Proceedings of the 34th Annual guistic features) to identify which commas are Meeting on Association for Computational sentence-breaking and which are not. Linguistics, 310-318. Morristown, NJ: ACL. Our overall system incorporates a range of in- dependent solutions to problems in Thai text J. N. Darroch and D. Ratcliff. 1972. Generalized processing, including character sequence norma- Iterative Scaling for Log-Linear Models. The lization, tokenization, name and date identifica- Annals of Mathematical Statistics, 43(5): tion, sentence-breaking, and Thai text re- 1470-1480. spacing. We successfully integrated each solu- Choochart Haruechaiyasak, Sarawoot Kon- tion into an existing large-scale SMT frame- gyoung, and Matthew N. Dailey. 2008. A work, obtaining sufficient quality to release the Comparative Study on Thai Word Segmenta- Thai-English language pair in a high-volume, tion Approaches. In Proceedings of ECTI- general-domain, free public online service. CON 2008. Pathumthani, Thailand: ECTI. There remains much room for improvement. We need to find or create true Thai-English di- Reinhard Kneser and Hermann Ney. 1995. Im- rectional corpora to train the lambdas and to test proved Backing-Off for M-Gram Language our models. The size of our parallel corpus for Modeling. In Proceedings of International Thai should increase by at least an order of mag- Conference on Acoustics, Speech and Signal nitude, without loss of bitext quality. With a Procesing (ICASSP), 1:181-184. larger corpus, we can consider longer phrase Philipp Koehn. 2004. Pharaoh: a Beam Search length, higher-order n-grams, and longer re- Decoder for Phrase-Based Statistical Machine ordering distance. Translation Models. In Proceedings of the As-

15 sociation of Machine Translation in the Amer- icas (AMTA-2004). Arul Menezes, and Chris Quirk. 2008. Syntactic Models for Structural Word Insertion and De- letion during Translation. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. P. Mittrapiyanuruk and V. Sornlertlamvanich. 2000. The Automatic Thai Sentence Extrac- tion. In Proceedings of the Fourth Symposium on Natural Language Processing, 23-28. Robert C. Moore. 2002. Fast and Accurate Sen- tence Alignment of Bilingual Corpora. In Ma- chine Translation: From Research to Real Users (Proceedings, 5th Conference of the As- sociation for Machine Translation in the Americas, Tiburon, California), Springer- Verlag, Heidelberg, Germany, 135-244 Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL. David D. Palmer and Marti A. Hearst. 1997. Adaptive Multilingual Sentence Boundary Disambiguation. Computational Linguistics, 23:241-267. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual meeting of the Association for Computational Linguistics, 311–318. Stroudsburg, PA: ACL. Adwait Ratnaparkhi, 1996. A Maximum Entropy Model for Part-of-Speech Tagging. In Pro- ceedings of the Conference on Empirical Me- thods in Natural Language Processing, 133- 142. Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997. A Maximum Entropy Approach to Iden- tifying Sentence Boundaries, In Proceedings of the Fifth Conference on Applied Natural Language Processing, 16-19. Suphawut Wathabunditkul. 2003. Spacing in the Thai Language. http://www.thailanguage.com/ ref/spacing