arXiv:2004.11472v1 [cs.CL] 23 Apr 2020 etto orfrt h rcs fdvdn odo sente a or a dividing of sub-units. process into the to refer to mentation hii o-eorelnug,s ti fe h aeta d that case the often is it so language, low-resource a is Thai hrfr,snecsaemdlda euneo tokens. of a sequence is a Thai data). As as training modeled as are sequences sentences of Therefore, pairs (i.e. using framework built are sequence-to-sequence models the on based mod- are (NMT) els Translation Machine Neural [10] State-of-the-art 16]. languages [23; similar text with high artificially-generated data or achieve training problem the to this augment mitigate enough to to is be approach not One may quality. translation models avail train data parallel to of able amount the are as languages challenging low-resource especially (MT), Translation Machine In Keywords: tool. splitting supervised a data using these split combining been operatio that has show merge experiments different The Using obtained. be training. s model replicating NMT by for data parallel data English–Thai b of when set complexity a qual more augment of adds level which high , a between to boundaries perform which model (NMT) Translation ftkn,awyt drs h hilnug st pi the split to is language Thai the address sequence to way a a with tokens, deals of framework encoder-decoder the As alone. strategy segmentation perfor- combined one translation using be better than achieve can mance and sets models These NMT train side). to Thai s the different with on (each mentation sets parallel different to leads This operations). merge dif- different obtain segmen- using different (by to with tations sentence (BPE) same training Encoding the of unsupervised Pair versions our ferent Byte an to as using segmentation such of propose method problem We the use advantage. we work, this In algorithms approaches. ma- learning-based different and chine dictionary-based and including [3] [5] proposed level been split have linguistic be the should at sentences excep- cussed Thai no is How problem Thai (NLP), well-known tion. Processing a Language is Natural in sentences in of tokens. step segmentation of preprocessing The sequences the into converted in be to segmented order be to sen- words) need between boundaries tences the indicate to used not are 1 nti ok eueitrhnebytetr pi rseg- or split term the interchangeably use we work, this In utpeSgettoso hiSnecsfrNua Machin Neural for Sentences Thai of Segmentations Multiple let Poncelas Alberto .Cmiaino emne Texts Segmented of Combination 1. ahn rnlto,Wr emnain hiLanguage Thai Segmentation, Word Translation, Machine citocontinua scriptio 4 rnt etefrLtrr n utrlTasain Trin Translation, Cultural and Literary for Centre Trinity hluamog ihy.icaok@aldui,HADL [email protected], [email protected], 1 DP ete colo optn,Dbi iyUniversity, City Dublin Computing, of School Centre, ADAPT 1 ihy Pidchamook Wichaya , agae(hr whitespaces (where language abropnea,andy.way}@adaptcentre.ie {alberto.poncelas, 2 AI,Dbi iyUiest,Ireland University, City Dublin SALIS, 1 a endis- been has 3 cncTasainMachines Translation Iconic nec-ar ihdfeetwr emnainmtoson methods segmentation word different with entence-pairs es efrac sipoe o M oestandwt d a with trained models NMT for improved is performance sets, nce eg- so yePi noig ifrn emnain fTa se Thai of segmentations different Encoding, Pair Byte of ns Abstract 2 t sntaalbei ufiin uniist ri nNeur an train to quantities sufficient in available not is ata t.I diin h hisrp osntuewiesae to spaces white use not does script Thai the addition, In ity. - hoHn Liu Chao-Hong , idn eunet eunemdl.I hswr,w explor we work, this In models. sequence to sequence uilding oeta nti okw r uligmdl nthe in models building are we work this in that models. using NMT of Note split performance dataset the a this improve whether augment to Deepcut explore to to followed statistical want be a we in can have addition, approach as In to such 14]. shown approaches [13; been MT other has in which impact extended, positive source-side be This of to ar- candidates side. we words translation target of this, the number doing the on By causes vocabulary the model. bene- increase NMT be tificially an can training operations) for split merge ficial (but different aug- sentences using same whether BPE the explore with with to set Ac- want training ways. the we different menting work in split this be in can cordingly, set pa- the operation BPE, merge of the rameter best, changing the by However, perform manually-segmented the to sentences. from NMT information allow the encodes might it that as one the strat is word-segmentation egy the supervised Among the engine. NMT approaches, an for three data training they as split, used been have be corpus can parallel a of sentences the opera- Once merge of completed. number are given tion a until iteratively executed hnieaieytems rqetsbeun aro token of pair token; x subsequent independent frequent an most the as iteratively Ini- character then follows: each considers as n executed it is tially is split BPE the Therefore word-based. letters. necessarily the on of based sequences sentence of the splits statistics as method use- last such This spliting [24]. split: for BPE algorithm Language-independent language-independent (iii) a ing and [8]); corpus uha h “Deepcut” data the supervised on as trained such been to have need which us no tokenizers word-segmentation: is ing Supervised there (ii) as words; approach the simplest identify the is This characte each token. using Character-based, split (i) three investigate strategies: We ting tokens). (or words into sentence and 2 https://github.com/rkcosmos/Deepcut y r egdit igetoken single a into merged are t olg uln Ireland Dublin, College ity 3 ae Hadley James , [email protected] 2 irr 6 tando h BEST the on (trained [6] library Ireland Translation e 4 nyWay Andy , xy h rcs is process The . hi straining as Thai, tne can ntences lMachine al eii the delimit tstthat ataset o to how e 1 as r ot s - - - English-to-Thai direction. The datasets with different splits dataset CHRF3 are only on the target side whereas we keep the same num- Deepcut 47.90 ber of merge operations on the source side. The main character 20.70 reason for this is that using several splits on the source BPE 1000 39.88 side would also mean that NMT models would be evalu- BPE 5000 45.49 ated with different versions (different splits) of the test set, BPE 10000 41.75 which would obviously affect the results. BPE 20000 38.77 2. Data and Model Configuration Table 1: Performance of the NMT model trained with data using different methods for word segmentation on the The models are built in the English to Thai direction us- target-side. ing OpenNMT-py [7]. We keep the default settings of OpenNMT-py: 2-layer LSTM with 500 hidden units and a maximum vocabulary size of 50000 words for each lan- guage. thee chosen approaches. For example, the first row Deep- We use the Asian Language (ALT) Parallel Cor- cut present the results when the model is built with the set pus [21] for training and evaluation. We split the corpus so after being split using the Deepcut tool, and in the row BPE we use 20K sentences for training and 106 sentences for de- 1000 if the dataset is split with BPE using 1,000 merge op- velopment. We use Tatoeba Tiedemann [26] for evaluating erations. the models (173 sentences). As expected, we see in the table that the best results are ob- In order to evaluate the models we translate the test set tained when using the Deepcut tool (Deepcut row), which and measure the quality of the translation using an auto- splits Thai words grammatically. matic evaluation metric, which provides an estimation of Regarding the models in which BPE has been used, we see how good the translation is by comparing it to a human- in Table 1 that the best performance is achieved when 5000 translated reference. As the evaluation is made in a Thai merge operations is used (BPE 5000 row). text (which does not contain n-grams, we use CHRF3 [20] When too many merge operationsare performedthere is the which is a character-level metric, instead of BLEU [12], risk of merging characters of different words into a single which is based on overlap of n-grams. token, so the score decreases. The English side is tokenized and truecased (applying the By contrast, if there are too few merge operations, the re- proper case of the words) and we apply BPE with 89500 sulting text is closer to character-split which, as we can see operations (the default explored in Sennrich et al. [24]). in the character row, leads to the worst results. For the Thai side we explore combinations of different ap- proaches of sentence segmentation: 3.1. Combination of Datasets with Unsupervised Split 1. Character-level: Split the sentences character-wise, so each character is a token of the sequence (character dataset CHRF3 dataset). character 20.70 2. Deepcut: Split the sentences using the Deepcut tool. + 1000 38.23 + 5000 45.63 3. Use BPE with different merge operations. In this + 10000 49.93 work we explore with: 1000, 5000, 10000 and 20000 + 20000 52.17 merge operations (BPE 1000, BPE 5000, BPE 10000 BPE 1000 39.88 and BPE 20000 datasets, respectively). However, we + 5000 49.54 propose as future work investigating the performance + 10000 52.33 when using BPE with more merge operations. + 20000 53.45

In total there are six different datasets. Note that they con- Table 2: Performance of the NMT model trained with data tain the same unique sentences. We train NMT models combining different unsupervised methods for word seg- using these datasets either independently or in combina- mentation on the target-side. Numbers in bold indicate sta- tion. The combination is carried out by appending different tistical significant improvement at p=0.01. datasets. In particular we use character, Deepcut or BPE 1000 datasets and then accumulatively add BPE 1000, BPE 5000, BPE 10000 and BPE 20000 datasets. The main focus of this paper is to study the impact on the performance when several training sets with different splits 3. Experimental Results are gathered together. We show the results in Table 2. Note First, we evaluate the performance of the models when dif- that only the training set has been augmented, the develop- ferent merge operations are used on the Thai side. The per- ment and test set remains the same for all experiments. In formance of the model, evaluated using CHRF3 are pre- the table, each subtable presents the results when the sets sented in Table 1. In the table, each row shows the results are appended. For example, in the first subtable the row (the evaluation of the translation of the test set) of an NMT character indicates the performance of the model trained model built with the training set split following one of the with thedata split by character(thesame as in Table1). The following row, +1000, shows the score of the model trained 3.3. Analysis with character-wise split and that split with BPE using 1000 In Table 4 we show some examples of the output. For each merge operations (so there are two instances of each source sentence we present the translations produced by the NMT sentence). The last row of the subtable (+20000) indicates model trained on Deepcut, BPE 20000 and Deepcut BPE the performance of the model trained with the five datasets 20000. We do not include the outputs of the NMT models concatenated (five instances of each source sentence). trained on the character dataset as all the translation con- In the table we observe that combining datasets with differ- sisted of the same, meaningless, sequence of characters. In ent segmentations is beneficial. In each subtable of Table 2 the third column of the table we show the translation of the we see that the score goes up when we add more datasets output of the NMT system after it has been postedited by with different splits. For example, the performance of the removing characters (in gray) or replaced (in blue). model using data split in a character-wise achieves a score In the first sub-table we present how the sentence “the train of only 20.70 (first subtable). When we concatenate the is here” (meaning that the train is here because it has ar- same dataset but split using 1000 merge operations, the rived) has been translated by different models. On the one score increases to 38.23 (85% improvement) and we see hand, we see in the Deepcut row that the model has not that the score increases as we add sentences with different produced an accurate translation. On the other hand, the splits. This effect is observed for all three subtables. models trained with data containing sentences of different The results also show that the lowest scores are achieved segmentations (i.e. BPE 20000 and Deepcut BPE 20000) in those datasets in which a character-wise split fashion achieve a more accurate translation. Nevertheless, the out- is performed. For example, the performance when using put is not a perfect translation as it indicates where the train the dataset split with BPE using 1000 operations is 39.88 is located instead of expressing that it has arrived. (BPE 1000 row in Table 2), whereas when used in combi- In the following subtables we observe a similar effect. The nation with character-split set (+1000 row in the first sub- models trained with the Deepcut dataset produced a trans- table of Table 2) the score is 38.23. Another indication that lation that is either inaccurate or makes no sense. However, character-wise splitting hurts performance is the compar- the models trained on BPE 20000 or Deepcut BPE 20000 ison between the subtables of Table 2. If we compare the produce a translation closer to the input after some post- rows +1000, +5000, +10000 and +20000 of each subtable, editing (i.e. removing the characters in gray). the lower scores are seen in the first subtable Another research question we want to answer is whether us- 4. Related Work ing a combination of splits using BPE can outperform the There are several studies aiming to address the problem of model trained with a language-independent tool like Deep- splitting words in Thai. One of the first approaches to seg- cut. We see that in the second subtable of Table 2 all of the menting is the longest-matching method [19], consisting of the combination of BPE-split datasets (+5000, +10000 and identifying the longest sequence of characters that match +20000 rows) exceed the 47.90 score achieved by using a word in the dictionary. Another approach is maximal- Deepcut alone. matching method [25], which consists of generating all 3.2. Augmenting the Dataset Split with Deepcut possible segmentations and retrieving those containing the smallest amount of words. dataset CHRF3 Haruechaiyasak and Kongyoung [4] used conditional ran- dom fields to classify each character as either word- Deepcut 47.90 beginning or intra-word. Nararatwong et al. [11] proposed + 1000 47.00 an improvement to this approach by adding information + 5000 51.25 from POS tags. + 10000 54.94 The use of several segmentations has also been proposed by + 20000 54.96 Kudo [9], in which he tries to integrate candidates from dif- Table 3: Performance of combination of dataset with dif- ferent segmentations. This technique has applications in a ferent split using Deepcut. Numbers in bold indicate statis- number of topics such as word-alignment [28] or language tical significant improvement at p=0.01. modeling [22; 1]. The use of multiple instances in the training data, where only one side is modified, has been used by Poncelas et al. As combining different splits can increase the performance [18], who showed that using multiple instances of the same of the model, does it also help when used in combination target sentences, with different source-sides (generated by with Deepcut? different MT engines) leads to better results than using a In Table 3 we present the results when the dataset split us- single instance of each sentence. ing Deepcut is augmented with datasets split with different merge operations using BPE. We see that initially, adding 5. Conclusions and Future Work the set split with BPE using 1,000 merge operation (+1000 In this work we have explored whether the duplication of row) causes the performance to drop slightly. Nonetheless, training sentences with different splits is useful to build the performance increases when more data are added (i.e. NMT models with improved performance. The experi- +5000, +10000, +20000 rows). Moreover, it even out- ments show that the combination of different splits on the performs the model trained with the Deepcut-split dataset target side does improve NMT models involving a low- alone. resource language such as Thai. Dataset Output Source Sentence the train is here. Reference รถไฟมาแลéว Deepcut รถไฟนี้เปšนสายพันธุì This train is a breed. BPE 20000 รถไฟอยูèที่นี่อยูèที่นี่ดéวย The train is located here. Deepcut + BPE 20000 รถไฟนี้อยูèนี่ This train is located here. Source Sentence I have a new bicycle. Reference ฉันมีจักรยานใหมè Deepcut ผมไดéทำการทดลองใหมè I made a new experiment. BPE 20000 ผมมีรถบรรทุกคนใหมèๆ I have a new truck. Deepcut + BPE 20000 ผมมีรถจักรยานชุดใหมèขึ้นมาแลéว I have a new set of bicycles. Source Sentence she does not have many friends in Kyoto. Reference เธอไมèคèอยมีเพื่อนมากนักที่เกียวโต Deepcut เธอยังไมèมีใครไดéรับบาดเจ็บ She still doesn’t have anyone injured. BPE 20000 เธอไมèไดéมีเพื่อนหลายๆในเกียวโตเกียว She does not have many friends in Kyoto. Deepcut + BPE 20000 เธอไมèมีเพื่อนรèวมหลายคนในเกียวโต She does not have many friends in Kyoto.

Table 4: Examples of translations using different splits

The experiments also reveal that combining the same References dataset using different merge operations of BPE not only [1] Solomon Teferra Abate, Laurent Besacier, and improves the model trained on the dataset using the sin- Sopheap Seng. Boosting n-gram coverage for unseg- gle configuration (regardless of the number of merge oper- mented languages using multiple ations), but also the modeltrained on data that has been split approach. In 23rd International Conference on Com- using a tool trained on supervised data such as Deepcut. putational Linguistics, pages 1–7, Beijing, China, In the future, we plan to conduct more fine-grained experi- 2010. ments to explore which configurations of BPE perform bet- ter. For example, would the combination of BPE 10000 and [2] Chanudom Apichai, Karman Chan, and Yoshimi BPE 20000 (those with the highest number of operations Suzuki. Classifying short text in social media for ex- explored) perform better than the model’s original setup? tracting valuable ideas. In 20th International Con- And what would the results be if only BPE 1000 and BPE ference on Computational Linguistics and Intelligent 20000 (those with the lowest and highest number of opera- , La Rochelle, France, 2019. tion) are combined? Furthermore, as all the experiments with different splits [3] Wirote Aroonmanakun. Thoughts on word and sen- have been applied on the target side, we plan to investigate tence segmentation in thai. In Proceedings of the NMT models when Thai is on the source side. Similarly, Seventh Symposium on Natural language Processing, we will also investigate whether these improvements will pages 85–90, Pattaya, Thailand, 2007. be achieved using other languages. [4] Choochart Haruechaiyasak and Sarawoot Kongy- Another variation we are interested in exploring is not to oung. Tlex: Thai lexeme analyser based on the con- replicate all the sentences but to use data-selection algo- ditional random fields. In Proceedings of 8th Interna- rithms to find a subset of sentences that may boost the per- tional Symposium on Natural Language Processing, formance of the models trained on the subset [15; 17]. Bangkok, Thailand, 2009. Finally, we would like to investigate the applicability of the method of employing several segmentations to other NLP [5] Choochart Haruechaiyasak, Sarawoot Kongyoung, tasks such as text classification [2; 27]. and Matthew Dailey. A comparative study on Thai word segmentation approaches. In 2008 5th International Conference on Electrical Engineer- 6. Acknowledgements ing/Electronics, Computer, Telecommunications and Information Technology, volume 1, pages 125–128, The QuantiQual Project, generously funded by the 2008. Irish Research Council’s COALESCE scheme (COA- LESCE/2019/117). [6] Rakpong Kittinaradorn, Korakot Chaovavanich, Titi- This research has been supported by the ADAPT Centre for pat Achakulvisut, and Chanwit Kaewkasi. A Thai Digital Content Technology which is funded under the SFI word tokenization library using deep neural network, Research Centres Programme (Grant 13/RC/2106). 2018. [7] Guillaume Klein, Yoon Kim, Yuntian Deng, Jean translation. In 21st Annual Conference of the Eu- Senellart, and Alexander M. Rush. Opennmt: Open- ropean Association for , pages source toolkit for neural machine translation. In Pro- 249–258, Alacant, Spain, 2018. ceedings of the 55th Annual Meeting of the Associ- ation for Computational Linguistics-System Demon- [17] Alberto Poncelas, Andy Way, and Kepa Sarasola. strations, pages 67–72, Vancouver, Canada, 2017. The ADAPT system description for the IWSLT 2018 Basque to English translation task. In 15th Inter- [8] Krit Kosawat, Monthika Boriboon, Patcharika national Workshop on Spoken Language Translation, Chootrakool, Ananlada Chotimongkol, Supon pages 76–82, Bruges, Belgium, 2018. Klaithin, Sarawoot Kongyoung, Kanyanut Kriengket, Sitthaa Phaholphinyo, Sumonmas Purodakananda, [18] Alberto Poncelas, Maja Popovic, Dimitar Shterionov, Tipraporn Thanakulwarapas, et al. BEST 2009: Thai Gideon Maillette de Buy Wenniger, and Andy Way. word segmentation software contest. In 2009 Eighth Combining SMT and NMT back-translated data for International Symposium on Natural Language efficient NMT. In Proceedings of Recent Advances in Processing, pages 83–88, Bangkok, Thailand, 2009. Natural Language Processing (RANLP), pages 922– [9] Taku Kudo. Subword regularization: Improving neu- 931, Varna, Bulgaria, 2019. ral network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meet- [19] Yuen Poowarawan. Dictionary-based thai syllable ing of the Association for Computational Linguistics separation. In Proc. Ninth Electronics Engineering (Volume 1: Long Papers), pages 66–75, Melbourne, Conference (EECON-86), pages 409–418, Bangkok, Australia, 2018. Thailand, 1986. [10] Surafel Melaku Lakew, Aliia Erofeeva, and Marcello [20] Maja Popovic. chrF: character n-gram F-score for au- Federico. Neural machine translation into language tomatic MT evaluation. In Proceedings of the Tenth varieties. In Proceedings of the Third Conference on Workshop on Statistical Machine Translation, pages Machine Translation: Research Papers, pages 156– 392–395, Lisbon, Portugal, 2015. 164, Brussels, Belgium, 2018. [21] Hammam Riza, Michael Purwoadi, Teduh Ulinian- [11] Rungsiman Nararatwong, Natthawut Kertkeidka- syah, Aw Ai Ti, Sharifah Mahani Aljunied, Luong Chi chorn, Nagul Cooharojananone, and Hitoshi Okada. Mai, Vu Tat Thang, Nguyen Phuong Thai, Vichet Improving thai word and sentence segmentation us- Chea, Sethserey Sam, et al. Introduction of the asian ing linguistic knowledge. IEICE TRANSACTIONS on language treebank. In 2016 Conference of The Ori- Information and Systems, 101(12):3218–3225, 2018. ental Chapter of International Committee for Coordi- [12] Kishore Papineni, Salim Roukos, Todd Ward, and nation and Standardization of Speech Databases and Wei-Jing Zhu. Bleu: a method for automatic evalu- Assessment Techniques (O-COCOSDA), pages 1–6, ation of machine translation. In Proceedings of 40th 2016. Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylva- [22] Sopheap Seng, Laurent Besacier, Brigitte Bigi, and nia, USA, 2002. Eric Castelli. Multiple text segmentation for statis- tical language modeling. In 10th Annual Confer- [13] Alberto Poncelas, Andy Way, and Antonio Toral. ence of the International Speech Communication As- Extending feature decay algorithms using alignment sociation (Interspeech), pages 2663–2666, Brighton, entropy. In International Workshop on Future and United Kingdom, 2009. Emerging Trends in Language Technology, pages 170–182, Seville, Spain, 2016. [23] Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with [14] Alberto Poncelas, Gideon Maillette de Buy Wenniger, monolingual data. In Proceedings of the 54th An- and Andy Way. Applying n-gram alignment entropy nual Meeting of the Association for Computational to improve feature decay algorithms. The Prague Bul- Linguistics (Volume 1: Long Papers), pages 86–96, letin of Mathematical Linguistics, 108(1):245–256, Berlin, Germany, 2016. 2017. [15] Alberto Poncelas, Gideon Maillette de Buy Wenniger, [24] Rico Sennrich, Barry Haddow, and Alexandra Birch. and Andy Way. Data selection with feature decay al- Neural machine translation of rare words with sub- gorithms using an approximated target side. In 15th word units. In Proceedings of the 54th Annual Meet- International Workshop on Spoken Language Trans- ing of the Association for Computational Linguistics lation (IWSLT 2018), pages 173–180, Bruges, Bel- (Volume 1: Long Papers), pages 1715–1725, Berlin, gium, 2018. Germany, 2016. [16] Alberto Poncelas, Dimitar Shterionov, Andy Way, [25] Virach Sornlertlamvanich. Word segmentation for Gideon Maillette de Buy Wenniger, and Peyman Pass- thai in machine translation system. Machine Trans- ban. Investigating backtranslation in neural machine lation, pages 556–561, 1993. [26] Jörg Tiedemann. Parallel data, tools and interfaces in opus. In Proceedings of the Eighth International con- ference on on Language Resources and Evaluation (LREC 2012), pages 2214–2218, Istanbul, Turkey, 2012. [27] Attaporn Wangpoonsarp, Kazuya Shimura, and Fu- miyo Fukumoto. Acquisition of domain-specific senses and its extrinsic evaluation through text cate- gorization. In 20th International Conference on Com- putational Linguistics and Intelligent Text Processing, La Rochelle, France, 2019. [28] Ning Xi, Guangchao Tang, Boyuan Li, and Yinggong Zhao. Word alignment combination over multiple word segmentation. In Proceedings of the ACL 2011 Student Session, pages 1–5, Portland, USA, 2011.