Text Segmentation by Language Using Minimum Description Length
Total Page:16
File Type:pdf, Size:1020Kb
Text Segmentation by Language Using Minimum Description Length Hiroshi Yamaguchi Kumiko Tanaka-Ishii Graduate School of Faculty and Graduate School of Information Information Science and Technology, Science and Electrical Engineering, University of Tokyo Kyushu University [email protected] [email protected] Abstract addressed in this paper is rare. The most similar The problem addressed in this paper is to seg- previous work that we know of comes from two ment a given multilingual document into seg- sources and can be summarized as follows. First, ments for each language and then identify the (Teahan, 2000) attempted to segment multilingual language of each segment. The problem was texts by using text segmentation methods used for motivated by an attempt to collect a large non-segmented languages. For this purpose, he used amount of linguistic data for non-major lan- a gold standard of multilingual texts annotated by guages from the web. The problem is formu- lated in terms of obtaining the minimum de- borders and languages. This segmentation approach scription length of a text, and the proposed so- is similar to that of word segmentation for non- lution finds the segments and their languages segmented texts, and he tested it on six different through dynamic programming. Empirical re- European languages. Although the problem set- sults demonstrating the potential of this ap- ting is similar to ours, the formulation and solution proach are presented for experiments using are different, particularly in that our method uses texts taken from the Universal Declaration of only a monolingual gold standard, not a multilin- Human Rights and Wikipedia, covering more than 200 languages. gual one as in Teahan’s study. Second, (Alex, 2005) (Alex et al., 2007) solved the problem of detecting 1 Introduction words and phrases in languages other than the prin- cipal language of a given text. They used statisti- For the purposes of this paper, a multilingual text cal language modeling and heuristics to detect for- means one containing text segments, limited to those eign words and tested the case of English embed- longer than a clause, written in different languages. ded in German texts. They also reported that such We can often find such texts in linguistic resources processing would raise the performance of German collected from the World Wide Web for many non- parsers. Here again, the problem setting is similar to major languages, which tend to also contain portions ours but not exactly the same, since the embedded of text in a major language. In automatic process- text portions were assumed to be words. Moreover, ing of such multilingual texts, they must first be seg- the authors only tested for the specific language pair mented by language, and the language of each seg- of English embedded in German texts. In contrast, ment must be identified, since many state-of-the-art our work considers more than 200 languages, and NLP applications are built by learning a gold stan- the portions of embedded text are larger: up to the dard for one specific language. Moreover, segmen- paragraph level to accommodate the reality of mul- tation is useful for other objectives such as collecting tilingual texts. The extension of our work to address linguistic resources for non-major languages and au- the foreign word detection problem would be an in- tomatically removing portions written in major lan- teresting future work. guages, as noted above. The study reported here was From a broader view, the problem addressed in motivated by this objective. The problem addressed this paper is further related to two genres of previ- in this article is thus to segment a multilingual text ous work. The first genre is text segmentation. Our by language and identify the language of each seg- problem can be situated as a sub-problem from the ment. In addition, for our objective, the set of target viewpoint of language change. A more common set- languages consists of not only major languages but ting in the NLP context is segmentation into seman- also many non-major languages: more than 200 lan- tically coherent text portions, of which a represen- guages in total. tative method is text tiling as reported by (Hearst, Previous work that directly concerns the problem 1997). There could be other possible bases for text 969 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 969–978, Jeju, Republic of Korea, 8-14 July 2012. c 2012 Association for Computational Linguistics segmentation, and our study, in a way, could lead learning data, since our objective requires us to con- to generalizing the problem. The second genre is sider segmentation by both major and non-major classification, and the specific problem of text clas- languages. For most non-major languages, only a sification by language has drawn substantial atten- limited amount of corpus data is available.1 tion (Grefenstette, 1995) (Kruengkrai et al., 2005) This constraint suggests the difficulty of applying (Kikui, 1996). Current state-of-the-art solutions use certain state-of the art machine learning methods re- machine learning methods for languages with abun- quiring a large learning corpus. Hence, our formu- dant supervision, and the performance is usually lation is based on the minimum description length high enough for practical use. This article con- (MDL), which works with relatively small amounts cerns that problem together with segmentation but of learning data. has another particularity in aiming at classification In this article, we use the following terms and into a substantial number of categories, i.e., more notations. A multilingual text to be segmented is than 200 languages. This means that the amount of denoted as X = x1, . , x X , where xi denotes training data has to remain small, so the methods the i-th character of X and | X| denotes the text’s to be adopted must take this point into considera- length. Text segmentation by| language| refers here tion. Among works on text classification into lan- to the process of segmenting X by a set of borders guages, our proposal is based on previous studies us- B = [B1,...,B B ], where B denotes the num- | | | | ing cross-entropy such as (Teahan, 2000) and (Juola, ber of borders, and each Bi indicates the location 1997). We explain these works in further detail in of a language border as an offset number of charac- 3. ters from the beginning. Note that a pair of square § This article presents one way to formulate the seg- brackets indicates a list. Segmentation in this paper mentation and identification problem as a combina- is character-based, i.e., a Bi may refer to a position torial optimization problem; specifically, to find the inside a word. The list of segments obtained from set of segments and their languages that minimizes B is denoted as X = [X0,...,X B ], where the con- | | the description length of a given multilingual text. In catenation of the segments equals X. The language the following, we describe the problem formulation of each segment Xi is denoted as Li, where Li , ∈ L and a solution to the problem, and then discuss the the set of languages. Finally, L = [L0,...,L B ] | | performance of our method. denotes the sequence of languages corresponding to each segment Xi. The elements in each adjacent pair 2 Problem Formulation in L must be different. We formulate the problem of segmenting a multi- In our setting, we assume that a small amount (up lingual text by language as follows. Given a multi- to kilobytes) of monolingual plain text sample data lingual text X, the segments X for a list of borders is available for every language, e.g., the Universal B are obtained with the corresponding languages L. Declaration of Human Rights, which serves to gen- Then, the total description length is obtained by cal- erate the language model used for language identifi- culating each description length of a segment Xi for cation. This entails two sub-assumptions. the language Li: First, we assume that for all multilingual text, B every text portion is written in one of the given | | (ˆ, ˆ) = arg min dl (X ). (1) languages; there is no input text of an unknown X L Li i X,L i=0 language without learning data. In other words, ∑ we use supervised learning. In line with recent The function dl Li (Xi) calculates the description trends in unsupervised segmentation, the problem length of a text segment Xi through the use of a of finding segments without supervision could be language model for Li. Note that the actual total solved through approaches such as Bayesian meth- description length must also include an additional ods; however, we report our result for the supervised term, log2 X , giving information on the number setting since we believe that every segment must be of segments| (with| the maximum to be segmented labeled by language to undergo further processing. 1In fact, our first motivation was to collect a certain amount Second, we cannot assume a large amount of of corpus data for non-major languages from Wikipedia. 970 by each character). Since this term is a common 3 Calculation of Cross-Entropy constant for all possible segmentations and the min- The first term of (3), log PL (Xi), is the cross- imization of formula (1) is not affected by this term, − 2 i we will ignore it. entropy of Xi for Li multiplied by Xi . Vari- ous methods for computing cross-entropy| have| been The model defined by (1) is additive for Xi, so the following formula can be applied to search for proposed, and these can be roughly classified into two types based on different methods of univer- language Li given a segment Xi : sal coding and the language model.