Utterance Segmentation Using Combined Approach Based on Bi-Directional N-Gram and Maximum Entropy

Total Page:16

File Type:pdf, Size:1020Kb

Utterance Segmentation Using Combined Approach Based on Bi-Directional N-Gram and Maximum Entropy Utterance Segmentation Using Combined Approach Based on Bi-directional N-gram and Maximum Entropy Ding Liu Chengqing Zong National Laboratory of Pattern Recognition National Laboratory of Pattern Recognition Institute of Automation Institute of Automation Chinese Academy of Sciences Chinese Academy of Sciences Beijing 100080, China. Beijing 100080, China. [email protected] [email protected] Output (text Abstract Input speech Speech Language or speech) analysis and recognition generation This paper proposes a new approach to segmentation of utterances into sentences using a new linguistic model based upon Figure 1. System with speech input. Maximum-entropy-weighted Bi- directional N-grams. The usual N-gram In these systems, the language analysis module takes the output of speech recognition as its input, algorithm searches for sentence bounda- representing the current utterance exactly as pro- ries in a text from left to right only. Thus nounced, without any punctuation symbols mark- a candidate sentence boundary in the text ing the boundaries of sentences. Here is an is evaluated mainly with respect to its left context, without fully considering its right example: 这边请您坐电梯到 9 楼服务生将在那 context. Using this approach, utterances 里等您并将您带到 913 号房间 . (this way please are often divided into incomplete sen- please take this elevator to the ninth floor the floor tences or fragments. In order to make use attendant will meet you at your elevator entrance of both the right and left contexts of can- there and show you to room 913.) As the example didate sentence boundaries, we propose a shows, it will be difficult for a text analysis module new linguistic modeling approach based to parse the input if the utterance is not segmented. on Maximum-entropy-weighted Bi- Further, the output utterance from the speech rec- directional N-grams. Experimental results ognizer usually contains wrongly recognized indicate that the new approach signifi- words or noise words. Thus it is crucial to segment cantly outperforms the usual N-gram al- the utterance before further language processing. gorithm for segmenting both Chinese and We believe that accurate segmentation can greatly English utterances. improve the performance of language analysis modules. Stevenson et al. have demonstrated the difficul- 1 Introduction ties of text segmentation through an experiment in which six people, educated to at least the Bache- Due to the improvement of speech recognition lor’s degree level, were required to segment into technology, spoken language user interfaces, spo- sentences broadcast transcripts from which all ken dialogue systems, and speech translation sys- punctuation symbols had been removed. The ex- tems are no longer only laboratory dreams. perimental results show that humans do not always Roughly speaking, such systems have the structure agree on the insertion of punctuation symbols, and shown in Figure 1. that their segmentation performance is not very good (Stevenson and Gaizauskas, 2000). Thus it is a great challenge for computers to perform the task automatically. To solve this problem, many meth- They applied word-based N-gram language models ods have been proposed, which can be roughly to utterance segmentation, and then combined classified into two categories. One approach is them with prosodic models. Compared with N- based on simple acoustic criteria, such as non- gram language models, their combined models speech intervals (e.g. pauses), pitch and energy. achieved an improvement of 0.5% and 2.3% in We can call this approach acoustic segmentation. precision and recall respectively. The other approach, which can be called linguistic Beeferman et al. (1998) used the CYBERPUNC segmentation, is based on linguistic clues, includ- system to add intra-sentence punctuation (espe- ing lexical knowledge, syntactic structure, seman- cially commas) to the output of an automatic tic information etc. Acoustic segmentation can not speech recognition (ASR) system. They claim that, always work well, because utterance boundaries do since commas are the most frequently used punc- not always correspond to acoustic criteria. For ex- tuation symbols, their correct insertion is by far the ample: 您好<pause>请问<pause>明天的单人间 most helpful addition for making texts legible. 还有吗<pause>或者<pause>标准间也行. Since CYBERPUNC augmented a standard trigram the simple acoustic criteria are inadequate, linguis- speech recognition model with lexical information tic clues play an indispensable role in utterance concerning commas, and achieved a precision of segmentation, and many methods relying on them 75.6% and a recall of 65.6% when testing on 2,317 have been proposed. sentences from the Wall Street Journal. This paper proposes a new approach to linguis- Gotoh et al. (1998) applied a simple non-speech tic segmentation using a Maximum-entropy- interval model to detect sentence boundaries in weighted Bi-directional N-gram-based algorithm English broadcast speech transcripts. They com- (MEBN). To evaluate the performance of MEBN, pared their results with those of N-gram language we conducted experiments in both Chinese and models and found theirs far superior. However, English. All the results show that MEBN outper- broadcast speech transcripts are not really spoken forms the normal N-gram algorithm. The remain- language, but something more like spoken written der of this paper will focus on description of our language. Further, radio broadcasters speak for- new approach for linguistic segmentation. In Sec- mally, so that their reading pauses match sentence tion 2, some related work on utterance segmenta- boundaries quite well. It is thus understandable that tion is briefly reviewed, and our motivations are the simple non-speech interval model outperforms described. Section 3 describes MEBN in detail. the N-gram language model under these conditions; The experimental results are presented in Section 4. but segmentation of natural utterances is quite dif- Finally, Section 5 gives our conclusion. ferent. Zong et al. (2003) proposed an approach to ut- 2 Related Work and Our Motivations terance segmentation aiming at improving the per- formance of spoken language translation (SLT) systems. Their method is based on rules which are 2.1 Related Work oriented toward key word detection, template Stolcke et al. (1998, 1996) proposed an approach matching, and syntactic analysis. Since this ap- to detection of sentence boundaries and disfluency proach is intended to facilitate translation of Chi- locations in speech transcribed by an automatic nese-to-English SLT systems, it rewrites long recognizer, based on a combination of prosodic sentences as several simple units. Once again, cues modeled by decision trees and N-gram lan- these results cannot be regarded as general-purpose guage models. Their N-gram language model is utterance segmentation. Furuse et al. (1998) simi- mainly based on part of speech, and retains some larly propose an input-splitting method for translat- words which are particularly relevant to segmenta- ing spoken language which includes many long or tion. Of course, most part-of-speech taggers re- ill-formed expressions. The method splits an input quire sentence boundaries to be pre-determined; so into well-balanced translation units, using a seman- to require the use of part-of-speech information in tic dictionary. utterance segmentation would risk circularity. Cet- Ramaswamy et al. (1998) applied a maximum tolo et al.’s (1998) approach to sentence boundary entropy approach to the detection of command detection is somewhat similar to Stolcke et al.’s. boundaries in a conversational natural language user interface. They considered as their features but it can’t take into account the distant right con- words and their distances to potential boundaries. text to the candidate. This is the reason that N- They posited 400 feature functions, and trained gram methods often wrongly divide some long their weights using 3000 commands. The system sentences into halves or multiple segments. For then achieved a precision of 98.2% in a test set of example:小王病了一个星期. The N-gram method 1900 commands. However, command sentences is likely to insert a boundary mark between “了” for conversational natural language user interfaces and “一”, which corresponds to our everyday im- contain much smaller vocabularies and simpler pression that, if reading from the left and not structures than the sentences of natural spoken lan- considering several more words to the right of the guage. In any case, this method has been very current word, we will probably consider “小王病 helpful to us in designing our own approach to ut- 了 terance segmentation. ” as a whole sentence. However, we find that, if There are several additional approaches which are we search the sentence boundaries from right to not designed for utterance segmentation but which left, such errors can be effectively avoided. In the can nevertheless provide useful ideas. For example, present example, we won’t consider “一个星期” Reynar et al. (1997) proposed an approach to the as a whole sentence, and the search will be contin- disambiguation of punctuation marks. They con- ued until the word “小” is encountered. Accord- sidered only the first word to the left and right of ingly, in order to avoid segmentation errors made any potential sentence boundary, and claimed that by the normal N-gram method, we propose a re- examining wider context was not beneficial. The verse N-gram segmentation method (RN) which features they considered included the candidate’s does seek sentence boundaries from right to left. prefix and suffix; the presence of particular charac- Further, we simply integrate the two N-gram ters in the prefix or suffix; whether the candidate methods and propose a bi-directional N-gram was honorific (e.g. Mr., Dr.); and whether the can- method (BN), which takes into account both the didate was a corporate designator (e.g. Corp.). The left and the right context of a candidate segmenta- system was tested on the Brown Corpus, and tion site. Since the relative usefulness or signifi- achieved a precision of 98.8%.
Recommended publications
  • Sentence Boundary Detection for Handwritten Text Recognition Matthias Zimmermann
    Sentence Boundary Detection for Handwritten Text Recognition Matthias Zimmermann To cite this version: Matthias Zimmermann. Sentence Boundary Detection for Handwritten Text Recognition. Tenth International Workshop on Frontiers in Handwriting Recognition, Université de Rennes 1, Oct 2006, La Baule (France). inria-00103835 HAL Id: inria-00103835 https://hal.inria.fr/inria-00103835 Submitted on 5 Oct 2006 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Sentence Boundary Detection for Handwritten Text Recognition Matthias Zimmermann International Computer Science Institute Berkeley, CA 94704, USA [email protected] Abstract 1) The summonses say they are ” likely to persevere in such In the larger context of handwritten text recognition sys- unlawful conduct . ” <s> They ... tems many natural language processing techniques can 2) ” It comes at a bad time , ” said Ormston . <s> ”A singularly bad time ... potentially be applied to the output of such systems. How- ever, these techniques often assume that the input is seg- mented into meaningful units, such as sentences. This pa- Figure 1. Typical ambiguity for the position of a sen- per investigates the use of hidden-event language mod- tence boundary token <s> in the context of a period els and a maximum entropy based method for sentence followed by quotes.
    [Show full text]
  • Multiple Segmentations of Thai Sentences for Neural Machine Translation
    Proceedings of the 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020), pages 240–244 Language Resources and Evaluation Conference (LREC 2020), Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Multiple Segmentations of Thai Sentences for Neural Machine Translation Alberto Poncelas1, Wichaya Pidchamook2, Chao-Hong Liu3, James Hadley4, Andy Way1 1ADAPT Centre, School of Computing, Dublin City University, Ireland 2SALIS, Dublin City University, Ireland 3Iconic Translation Machines 4Trinity Centre for Literary and Cultural Translation, Trinity College Dublin, Ireland {alberto.poncelas, andy.way}@adaptcentre.ie [email protected], [email protected], [email protected] Abstract Thai is a low-resource language, so it is often the case that data is not available in sufficient quantities to train an Neural Machine Translation (NMT) model which perform to a high level of quality. In addition, the Thai script does not use white spaces to delimit the boundaries between words, which adds more complexity when building sequence to sequence models. In this work, we explore how to augment a set of English–Thai parallel data by replicating sentence-pairs with different word segmentation methods on Thai, as training data for NMT model training. Using different merge operations of Byte Pair Encoding, different segmentations of Thai sentences can be obtained. The experiments show that combining these datasets, performance is improved for NMT models trained with a dataset that has been split using a supervised splitting tool. Keywords: Machine Translation, Word Segmentation, Thai Language In Machine Translation (MT), low-resource languages are 1. Combination of Segmented Texts especially challenging as the amount of parallel data avail- As the encoder-decoder framework deals with a sequence able to train models may not be enough to achieve high of tokens, a way to address the Thai language is to split the translation quality.
    [Show full text]
  • A Clustering-Based Algorithm for Automatic Document Separation
    A Clustering-Based Algorithm for Automatic Document Separation Kevyn Collins-Thompson Radoslav Nickolov School of Computer Science Microsoft Corporation Carnegie Mellon University 1 Microsoft Way 5000 Forbes Avenue Redmond, WA USA Pittsburgh, PA USA [email protected] [email protected] ABSTRACT For text, audio, video, and still images, a number of projects have addressed the problem of estimating inter-object similarity and the related problem of finding transition, or ‘segmentation’ points in a stream of objects of the same media type. There has been relatively little work in this area for document images, which are typically text-intensive and contain a mixture of layout, text-based, and image features. Beyond simple partitioning, the problem of clustering related page images is also important, especially for information retrieval problems such as document image searching and browsing. Motivated by this, we describe a model for estimating inter-page similarity in ordered collections of document images, based on a combination of text and layout features. The features are used as input to a discriminative classifier, whose output is used in a constrained clustering criterion. We do a task-based evaluation of our method by applying it the problem of automatic document separation during batch scanning. Using layout and page numbering features, our algorithm achieved a separation accuracy of 95.6% on the test collection. Keywords Document Separation, Image Similarity, Image Classification, Optical Character Recognition contiguous set of pages, but be scattered into several 1. INTRODUCTION disconnected, ordered subsets that we would like to recombine. Such scenarios are not uncommon when The problem of accurately determining similarity between scanning large volumes of paper: for example, one pages or documents arises in a number of settings when document may be accidentally inserted in the middle of building systems for managing document image another in the queue.
    [Show full text]
  • An Incremental Text Segmentation by Clustering Cohesion
    An Incremental Text Segmentation by Clustering Cohesion Raúl Abella Pérez and José Eladio Medina Pagola Advanced Technologies Application Centre (CENATAV), 7a #21812 e/ 218 y 222, Rpto. Siboney, Playa, C.P. 12200, Ciudad de la Habana, Cuba {rabella, jmedina} @cenatav.co.cu Abstract. This paper describes a new method, called IClustSeg, for linear text segmentation by topic using an incremental overlapped clustering algorithm. Incremental algorithms are able to process new objects as they are added to the collection and, according to the changes, to update the results using previous information. In our approach, we maintain a structure to get an incremental overlapped clustering. The results of the clustering algorithm, when processing a stream, are used any time text segmentation is required, using the clustering cohesion as the criteria for segmenting by topic. We compare our proposal against the best known methods, outperforming significantly these algorithms. 1 Introduction Topic segmentation intends to identify the boundaries in a document with goal of capturing the latent topical structure. The automatic detection of appropriate subtopic boundaries in a document is a very useful task in text processing. For example, in information retrieval and in passages retrieval, to return documents, segments or passages closer to the user’s queries. Another application of topic segmentation is in summarization, where it can be used to select segments of texts containing the main ideas for the summary requested [6]. Many text segmentation methods by topics have been proposed recently. Usually, they obtain linear segmentations, where the output is a document divided into sequences of adjacent segments [7], [9].
    [Show full text]
  • A Text Denormalization Algorithm Producing Training Data for Text Segmentation
    A Text Denormalization Algorithm Producing Training Data for Text Segmentation Kilian Evang Valerio Basile University of Groningen University of Groningen [email protected] [email protected] Johan Bos University of Groningen [email protected] As a first step of processing, text often has to be split into sentences and tokens. We call this process segmentation. It is often desirable to replace rule-based segmentation tools with statistical ones that can learn from examples provided by human annotators who fix the machine's mistakes. Such a statistical segmentation system is presented in Evang et al. (2013). As training data, the system requires the original raw text as well as information about the boundaries between tokens and sentences within this raw text. Although raw as well as segmented versions of text corpora are available for many languages, this required information is often not trivial to obtain because the segmented version differs from the raw one also in other respects. For example, punctuation marks and diacritics have been normalized to canonical forms by human annotators or by rule-based segmentation and normalization tools. This is the case with e.g. the Penn Treebank, the Dutch Twente News Corpus and the Italian PAISA` corpus. This problem of missing alignment between raw and segmented text is also noted by Dridan and Oepen (2012). We present a heuristic algorithm that recovers the alignment and thereby produces standoff annotations marking token and sentence boundaries in the raw test. The algorithm is based on the Levenshtein algorithm and is general in that it does not assume any language-specific normalization conventions.
    [Show full text]
  • Topic Segmentation: Algorithms and Applications
    University of Pennsylvania ScholarlyCommons IRCS Technical Reports Series Institute for Research in Cognitive Science 8-1-1998 Topic Segmentation: Algorithms And Applications Jeffrey C. Reynar University of Pennsylvania Follow this and additional works at: https://repository.upenn.edu/ircs_reports Part of the Databases and Information Systems Commons Reynar, Jeffrey C., "Topic Segmentation: Algorithms And Applications" (1998). IRCS Technical Reports Series. 66. https://repository.upenn.edu/ircs_reports/66 University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-98-21. This paper is posted at ScholarlyCommons. https://repository.upenn.edu/ircs_reports/66 For more information, please contact [email protected]. Topic Segmentation: Algorithms And Applications Abstract Most documents are aboutmore than one subject, but the majority of natural language processing algorithms and information retrieval techniques implicitly assume that every document has just one topic. The work described herein is about clues which mark shifts to new topics, algorithms for identifying topic boundaries and the uses of such boundaries once identified. A number of topic shift indicators have been proposed in the literature. We review these features, suggest several new ones and test most of them in implemented topic segmentation algorithms. Hints about topic boundaries include repetitions of character sequences, patterns of word and word n-gram repetition, word frequency, the presence of cue words and phrases and the use of synonyms. The algorithms we present use cues singly or in combination to identify topic shifts in several kinds of documents. One algorithm tracks compression performance, which is an indicator of topic shift because self-similarity within topic segments should be greater than between-segment similarity.
    [Show full text]
  • A Generic Neural Text Segmentation Model with Pointer Network
    Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) SEGBOT: A Generic Neural Text Segmentation Model with Pointer Network Jing Li, Aixin Sun and Shafiq Joty School of Computer Science and Engineering, Nanyang Technological University, Singapore [email protected], {axsun,srjoty}@ntu.edu.sg Abstract [A person]EDU [who never made a mistake]EDU [never tried any- thing new]EDU Text segmentation is a fundamental task in natu- ral language processing that comes in two levels of Figure 1: A sentence with three elementary discourse units (EDUs). granularity: (i) segmenting a document into a se- and Thompson, 1988]. quence of topical segments (topic segmentation), Both topic and EDU segmentation tasks have received a and (ii) segmenting a sentence into a sequence of lot of attention in the past due to their utility in many NLP elementary discourse units (EDU segmentation). tasks. Although related, these two tasks have been addressed Traditional solutions to the two tasks heavily rely separately with different sets of approaches. Both supervised on carefully designed features. The recently propo- and unsupervised methods have been proposed for topic seg- sed neural models do not need manual feature en- mentation. Unsupervised topic segmentation models exploit gineering, but they either suffer from sparse boun- the strong correlation between topic and lexical usage, and dary tags or they cannot well handle the issue of can be broadly categorized into two classes: similarity-based variable size output vocabulary. We propose a ge- models and probabilistic generative models. The similarity- neric end-to-end segmentation model called SEG- based models are based on the key intuition that sentences BOT.SEGBOT uses a bidirectional recurrent neural in a segment are more similar to each other than to senten- network to encode input text sequence.
    [Show full text]
  • Text Segmentation Techniques: a Critical Review
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Sunway Institutional Repository Text Segmentation Techniques: A Critical Review Irina Pak and Phoey Lee Teh Department of Computing and Information Systems, Sunway University, Bandar Sunway, Malaysia [email protected], [email protected] Abstract Text segmentation is widely used for processing text. It is a method of splitting a document into smaller parts, which is usually called segments. Each segment has its relevant meaning. Those segments categorized as word, sentence, topic, phrase or any information unit depending on the task of the text analysis. This study presents various reasons of usage of text segmentation for different analyzing approaches. We categorized the types of documents and languages used. The main contribution of this study includes a summarization of 50 research papers and an illustration of past decade (January 2007- January 2017)’s of research that applied text segmentation as their main approach for analysing text. Results revealed the popularity of using text segmentation in different languages. Besides that, the “word” seems to be the most practical and usable segment, as it is the smaller unit than the phrase, sentence or line. 1 Introduction Text segmentation is process of extracting coherent blocks of text [1]. The segment referred as “segment boundary” [2] or passage [3]. Another two studies referred segment as subtopic [4] and region of interest [5]. There are many reasons why the splitting document can be useful for text analysis. One of the main reasons is because they are smaller and more coherent than whole documents [3].
    [Show full text]
  • Text Segmentation Based on Semantic Word Embeddings
    Text Segmentation based on Semantic Word Embeddings Alexander A Alemi Paul Ginsparg Dept of Physics Depts of Physics and Information Science Cornell University Cornell University [email protected] [email protected] ABSTRACT early algorithm in this class was Choi's C99 algorithm [3] We explore the use of semantic word embeddings [14, 16, in 2000, which also introduced a benchmark segmentation 12] in text segmentation algorithms, including the C99 seg- dataset used by subsequent work. Instead of looking only at nearest neighbor coherence, the C99 algorithm computes mentation algorithm [3, 4] and new algorithms inspired by 1 the distributed word vector representation. By developing a coherence score between all pairs of elements of text, a general framework for discussing a class of segmentation and searches for a text segmentation that optimizes an ob- objectives, we study the effectiveness of greedy versus ex- jective based on that scoring by greedily making a succes- act optimization approaches and suggest a new iterative re- sion of best cuts. Later work by Choi and collaborators finement technique for improving the performance of greedy [4] used distributed representations of words rather than a strategies. We compare our results to known benchmarks bag of words approach, with the representations generated [18, 15, 3, 4], using known metrics [2, 17]. We demonstrate by LSA [8]. In 2001, Utiyama and Ishahara introduced a state-of-the-art performance for an untrained method with statistical model for segmentation and optimized a poste- our Content Vector Segmentation (CVS) on the Choi test rior for the segment boundaries. Moving beyond the greedy set.
    [Show full text]
  • Steps Involved in Text Recognition and Recent Research in OCR; a Study
    International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-8, Issue-1, May 2019 Steps Involved in Text Recognition and Recent Research in OCR; A Study K.Karthick, K.B.Ravindrakumar, R.Francis, S.Ilankannan Abstract: The Optical Character Recognition (OCR) is one of Preprocessing the automatic identification techniques that fulfill the automation Feature Extraction needs in various applications. A machine can read the Recognition information present in natural scenes or other materials in any Post processing form with OCR. The typed and printed character recognition is uncomplicated due to its well-defined size and shape. The In addition to the above steps segmentation and handwriting of individuals differs in the above aspects. So, the morphological processing also involved in the recognition handwritten OCR system faces complexity to learn this difference process. These steps may be added before the feature to recognize a character. In this paper, we discussed the various extraction process. stages in text recognition, handwritten OCR systems classification according to the text type, study on Chinese and II. PREPROCESSING Arabic text recognition as well as application oriented recent research in OCR. The preprocessing is a fundamental stage that is proceeding Index Terms: Edge detection, Optical Character Recognition, OCR, Preprocessing stages, Text Recognition to the stage of feature extraction; it regulates the appropriateness of the outcomes for the consecutive stages. I. INTRODUCTION The OCR success rate is contingent on the success percentage of each stage. The technologyadventfoundan amazing and noble development curve for the last two centuries. For the last A. Factors Affecting the Text Recognition Quality few decades, it is easy in using mouse and keyboard to assist as interfacing device between us and computer.
    [Show full text]
  • Word Sense Disambiguation and Text Segmentation Based on Lexical
    Word Sense i)ismnl}iguati<)n and Text Set mentation Bas(q on I,c×ical (+ohcslOll ()KUMUI{A Manabu, IIONI)A Takeo School of [nforma,tion Science, Japan Advanced Institute of Science a.nd Technology ('l'al.sunokuchi, lshikawa 923-12 Japan) c-nmil: { oku,honda}¢~jaist.ac.ji~ Abstract cohesion is far easier to idenlAfy than reference be- cause 1)oth words in lexical cohesion relation ap- In this paper, we describe ihow word sense am= pear in a text while one word in reference relation biguity can be resolw'.d with the aid of lexical eo- is a pr<mom, or elided and has less information to hesion. By checking ]exical coheshm between the infer the other word in the relation automatically. current word and lexical chains in the order of Based on this observation, we use lexical cohe- the salience, in tandem with getmration of lexica] sion as a linguistic device for discourse analysis. chains~ we realize incretnental word sense disam We call a sequence of words which are in lexieal biguation based on contextual infl)rmation that cohesion relation with each other a Icxical chain lexical chains,reveah Next;, we <le~<:ribe how set like [10]. l,exical chains tend to indicate portions men< boundaries of a text can be determined with of a text; that form a semantic uttit. And so vari.- the aid of lexical cohesion. Wc can measure the ous lexical chains tend to appear in a text corre. plausibility of each point in the text as a segment spou(ling to the change of the topic.
    [Show full text]
  • Multiple Segmentations of Thai Sentences for Neural Machine
    Multiple Segmentations of Thai Sentences for Neural Machine Translation Alberto Poncelas1, Wichaya Pidchamook2, Chao-Hong Liu3, James Hadley4, Andy Way1 1ADAPT Centre, School of Computing, Dublin City University, Ireland 2SALIS, Dublin City University, Ireland 3Iconic Translation Machines 4Trinity Centre for Literary and Cultural Translation, Trinity College Dublin, Ireland {alberto.poncelas, andy.way}@adaptcentre.ie [email protected], [email protected], [email protected] Abstract Thai is a low-resource language, so it is often the case that data is not available in sufficient quantities to train an Neural Machine Translation (NMT) model which perform to a high level of quality. In addition, the Thai script does not use white spaces to delimit the boundaries between words, which adds more complexity when building sequence to sequence models. In this work, we explore how to augment a set of English–Thai parallel data by replicating sentence-pairs with different word segmentation methods on Thai, as training data for NMT model training. Using different merge operations of Byte Pair Encoding, different segmentations of Thai sentences can be obtained. The experiments show that combining these datasets, performance is improved for NMT models trained with a dataset that has been split using a supervised splitting tool. Keywords: Machine Translation, Word Segmentation, Thai Language In Machine Translation (MT), low-resource languages are sentence into words (or tokens). We investigate three split- especially challenging as the amount of parallel data avail- ting strategies: (i) Character-based, using each character as able to train models may not be enough to achieve high token. This is the simplest approach as there is no need to translation quality.
    [Show full text]