Text Segmentation by Language Using Minimum Description Length Hiroshi Yamaguchi Kumiko Tanaka-Ishii Graduate School of Faculty and Graduate School of Information Information Science and Technology, Science and Electrical Engineering, University of Tokyo Kyushu University
[email protected] [email protected] Abstract addressed in this paper is rare. The most similar The problem addressed in this paper is to seg- previous work that we know of comes from two ment a given multilingual document into seg- sources and can be summarized as follows. First, ments for each language and then identify the (Teahan, 2000) attempted to segment multilingual language of each segment. The problem was texts by using text segmentation methods used for motivated by an attempt to collect a large non-segmented languages. For this purpose, he used amount of linguistic data for non-major lan- a gold standard of multilingual texts annotated by guages from the web. The problem is formu- lated in terms of obtaining the minimum de- borders and languages. This segmentation approach scription length of a text, and the proposed so- is similar to that of word segmentation for non- lution finds the segments and their languages segmented texts, and he tested it on six different through dynamic programming. Empirical re- European languages. Although the problem set- sults demonstrating the potential of this ap- ting is similar to ours, the formulation and solution proach are presented for experiments using are different, particularly in that our method uses texts taken from the Universal Declaration of only a monolingual gold standard, not a multilin- Human Rights and Wikipedia, covering more than 200 languages.