Text Segmentation by Language Using Minimum Description Length Hiroshi Yamaguchi Kumiko Tanaka-Ishii Graduate School of Faculty and Graduate School of Information Information Science and Technology, Science and Electrical Engineering, University of Tokyo Kyushu University [email protected] [email protected] Abstract addressed in this paper is rare. The most similar The problem addressed in this paper is to seg- previous work that we know of comes from two ment a given multilingual document into seg- sources and can be summarized as follows. First, ments for each language and then identify the (Teahan, 2000) attempted to segment multilingual language of each segment. The problem was texts by using text segmentation methods used for motivated by an attempt to collect a large non-segmented languages. For this purpose, he used amount of linguistic data for non-major lan- a gold standard of multilingual texts annotated by guages from the web. The problem is formu- lated in terms of obtaining the minimum de- borders and languages. This segmentation approach scription length of a text, and the proposed so- is similar to that of word segmentation for non- lution finds the segments and their languages segmented texts, and he tested it on six different through dynamic programming. Empirical re- European languages. Although the problem set- sults demonstrating the potential of this ap- ting is similar to ours, the formulation and solution proach are presented for experiments using are different, particularly in that our method uses texts taken from the Universal Declaration of only a monolingual gold standard, not a multilin- Human Rights and Wikipedia, covering more than 200 languages. gual one as in Teahan’s study. Second, (Alex, 2005) (Alex et al., 2007) solved the problem of detecting 1 Introduction words and phrases in languages other than the prin- cipal language of a given text. They used statisti- For the purposes of this paper, a multilingual text cal language modeling and heuristics to detect for- means one containing text segments, limited to those eign words and tested the case of English embed- longer than a clause, written in different languages. ded in German texts. They also reported that such We can often find such texts in linguistic resources processing would raise the performance of German collected from the World Wide Web for many non- parsers. Here again, the problem setting is similar to major languages, which tend to also contain portions ours but not exactly the same, since the embedded of text in a major language. In automatic process- text portions were assumed to be words. Moreover, ing of such multilingual texts, they must first be seg- the authors only tested for the specific language pair mented by language, and the language of each seg- of English embedded in German texts. In contrast, ment must be identified, since many state-of-the-art our work considers more than 200 languages, and NLP applications are built by learning a gold stan- the portions of embedded text are larger: up to the dard for one specific language. Moreover, segmen- paragraph level to accommodate the reality of mul- tation is useful for other objectives such as collecting tilingual texts. The extension of our work to address linguistic resources for non-major languages and au- the foreign word detection problem would be an in- tomatically removing portions written in major lan- teresting future work. guages, as noted above. The study reported here was From a broader view, the problem addressed in motivated by this objective. The problem addressed this paper is further related to two genres of previ- in this article is thus to segment a multilingual text ous work. The first genre is text segmentation. Our by language and identify the language of each seg- problem can be situated as a sub-problem from the ment. In addition, for our objective, the set of target viewpoint of language change. A more common set- languages consists of not only major languages but ting in the NLP context is segmentation into seman- also many non-major languages: more than 200 lan- tically coherent text portions, of which a represen- guages in total. tative method is text tiling as reported by (Hearst, Previous work that directly concerns the problem 1997). There could be other possible bases for text

969

Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 969–978, Jeju, Republic of Korea, 8-14 July 2012. c 2012 Association for Computational Linguistics segmentation, and our study, in a way, could lead learning data, since our objective requires us to con- to generalizing the problem. The second genre is sider segmentation by both major and non-major classification, and the specific problem of text clas- languages. For most non-major languages, only a sification by language has drawn substantial atten- limited amount of corpus data is available.1 tion (Grefenstette, 1995) (Kruengkrai et al., 2005) This constraint suggests the difficulty of applying (Kikui, 1996). Current state-of-the-art solutions use certain state-of the art machine learning methods re- machine learning methods for languages with abun- quiring a large learning corpus. Hence, our formu- dant supervision, and the performance is usually lation is based on the minimum description length high enough for practical use. This article con- (MDL), which works with relatively small amounts cerns that problem together with segmentation but of learning data. has another particularity in aiming at classification In this article, we use the following terms and into a substantial number of categories, i.e., more notations. A multilingual text to be segmented is than 200 languages. This means that the amount of denoted as X = x1, . . . , x X , where xi denotes training data has to remain small, so the methods the i-th character of X and | X| denotes the text’s to be adopted must take this point into considera- length. Text segmentation by| language| refers here tion. Among works on text classification into lan- to the process of segmenting X by a set of borders guages, our proposal is based on previous studies us- B = [B1,...,B B ], where B denotes the num- | | | | ing cross-entropy such as (Teahan, 2000) and (Juola, ber of borders, and each Bi indicates the location 1997). We explain these works in further detail in of a language border as an offset number of charac- 3. ters from the beginning. Note that a pair of square § This article presents one way to formulate the seg- brackets indicates a list. Segmentation in this paper mentation and identification problem as a combina- is character-based, i.e., a Bi may refer to a position torial optimization problem; specifically, to find the inside a word. The list of segments obtained from set of segments and their languages that minimizes B is denoted as X = [X0,...,X B ], where the con- | | the description length of a given multilingual text. In catenation of the segments equals X. The language the following, we describe the problem formulation of each segment Xi is denoted as Li, where Li , ∈ L and a solution to the problem, and then discuss the the set of languages. Finally, L = [L0,...,L B ] | | performance of our method. denotes the sequence of languages corresponding to each segment Xi. The elements in each adjacent pair 2 Problem Formulation in L must be different. We formulate the problem of segmenting a multi- In our setting, we assume that a small amount (up lingual text by language as follows. Given a multi- to kilobytes) of monolingual plain text sample data lingual text X, the segments X for a list of borders is available for every language, e.g., the Universal B are obtained with the corresponding languages L. Declaration of Human Rights, which serves to gen- Then, the total description length is obtained by cal- erate the language model used for language identifi- culating each description length of a segment Xi for cation. This entails two sub-assumptions. the language Li: First, we assume that for all multilingual text, B every text portion is written in one of the given | | (ˆ, ˆ) = arg min dl (X ). (1) languages; there is no input text of an unknown X L Li i X,L i=0 language without learning data. In other words, ∑ we use supervised learning. In line with recent The function dl Li (Xi) calculates the description trends in unsupervised segmentation, the problem length of a text segment Xi through the use of a of finding segments without supervision could model for Li. Note that the actual total solved through approaches such as Bayesian meth- description length must also include an additional ods; however, we report our result for the supervised term, log2 X , giving information on the number setting since we believe that every segment must be of segments| (with| the maximum to be segmented labeled by language to undergo further processing. 1In fact, our first motivation was to collect a certain amount Second, we cannot assume a large amount of of corpus data for non-major languages from Wikipedia.

970 by each character). Since this term is a common 3 Calculation of Cross-Entropy constant for all possible segmentations and the min- The first term of (3), log PL (Xi), is the cross- imization of formula (1) is not affected by this term, − 2 i we will ignore it. entropy of Xi for Li multiplied by Xi . Vari- ous methods for computing cross-entropy| have| been The model defined by (1) is additive for Xi, so the following formula can be applied to search for proposed, and these can be roughly classified into two types based on different methods of univer- language Li given a segment Xi : sal coding and the language model. For example, ˆ (Benedetto et al., 2002) and (Cilibrasi and Vitanyi,´ Li = arg min dl Li (Xi), (2) Li 2005) used the universal coding approach, whereas ∈L (Teahan and Harper, 2001) and (Sibun and Reynar, L = L i under the constraint that i i 1 for 1996) were based on language modeling using PPM 1,... dl 6 − ∈ B . The function can be further decom- and Kullback-Leibler divergence, respectively. posed{ as| follows|} to give the description length in an In this section, we briefly introduce two meth- information-theoretic manner: ods previously studied by (Juola, 1997) and (Teahan, 2000) as representative of the two types, and we fur- dl Li (Xi) = log2 PLi (Xi) − (3) ther explain a modification that we integrate into the + log X + log + γ. 2 | | 2 |L| final optimization problem. We tested several other Here, the first term corresponds to the code length coding methods, but they did not perform as well as these two methods. of the text chunk Xi given a language model for Li, which in fact corresponds to the cross-entropy 3.1 Mean of Matching Statistics of Xi for Li multiplied by Xi . The remaining | | terms give the code lengths of the parameters used (Farach et al., 1994) proposed a method to esti- to describe the length of the first term: the second mate the entropy, through a simplified version of the term corresponds to the segment location; the third LZ algorithm (Ziv and Lempel, 1977), as follows. term, to the identified language; and the fourth term, Given a text X = x1x2 . . . xixi+1 ..., Leni is de- to the language model of language Li. This fourth fined as the longest match length for two substrings term will differ according to the language model x1x2 . . . xi and xi+1xi+2 .... In this article, we de- type; moreover, its value can be further minimized fine the longest match for two strings A and B as the through formula (2). Nevertheless, since we use a shortest prefix of string B that is not a substring of uniform amount of training data for every language, A. Letting the average of Leni be E [Len], Farach and since varying γ would prevent us from improv- proved that E [Len] log2 i probabilistically con- | − H(X) | ing the efficiency of dynamic programming, as ex- verges to zero as i , where H(X) indicates the plained in 4, in this article we set γ to a constant entropy of X. Then,→H ∞(X) is estimated as obtained empirically.§ Under this formulation, therefore, when detect- log i Hˆ (X) = 2 . ing the language of a segment as in formula (2), the E [Len] terms of formula (3) other than the first term will be constant: what counts is only the first term, simi- (Juola, 1997) applied this method to estimate the larly to much of the previous work explained in the cross-entropy of two given texts. For two strings following section. We thus perform language de- Y = y1y2 . . . y Y and X = x1x2 . . . x X , let | | | | tection itself by minimizing the cross-entropy rather Leni(Y ) be the match length starting from xi of X than the MDL. For segmentation, however, the con- for Y 2. Based on this formulation, the cross-entropy stant terms function as overhead and also serve to is approximately estimated as prohibit excessive decomposition. Next, after briefly introducing methods to calcu- log2 Y JˆY (X) = | | . late the first term of formula (3), we explain the so- E [Leni(Y )] lution to optimize the combinatorial problem of for- 2This is called a matching statistics value, which explains mula (1). the subsection title.

971 Since formula (1) of 2 is based on adding the Considering the additive characteristic of the de- description length, it is§ important that the whole scription length formulated previously as formula value be additive to enable efficient optimization (as (1), we denote the minimized description length for will be explained in 4). We thus modified Juola’s a given text X simply as DP(X), which can be de- method as follows to§ make the length additive: composed recursively as follows6:

ˆ log2 Y DP(X) = min DP(x0 . . . xt 1) J 0Y (X) = E | | . t 0,..., X ,L { − Len (Y ) ∈{ | |} ∈L (4) [ i ] + dl L(xt . . . x X ) , Although there is no mathematical guarantee that | | } ˆ ˆ JY (X) or J 0Y (X) actually converges to the cross- In other words, the computation of DP(X) is de- entropy, our empirical tests showed a good estimate composed into obtaining the addition of two terms 3 ˆ for both cases . In this article, we use J Y (X) as by searching through t 0,..., X and L . 0 ∈ { | |} ∈ L a function to obtain the cross-entropy and for multi- The first term gives the MDL for the first t characters plication by X in formula (3). of text X, while the second term, dl L(xt+1 . . . x X ), | | gives the description length of the remaining charac-| | 3.2 PPM ters under the language model for L. As a representative method for calculating the We can straightforwardly implement this recur- cross-entropy through statistical language model- sive computation through dynamic programming, by ing, we adopt prediction by partial matching (PPM), managing a table of size X . To fill a cell of a language-based encoding method devised by this table, formula (4) suggests| | × |L| referring to t (Cleary and Witten, 1984). It has the particular char- cells and calculating the description length of× |L|the acteristic of using a variable n-gram length, unlike rest of the text for O( X t) cells for each language. ordinary n-gram models4. It models the probability Since t ranges up to| X|−, the brute-force computa- of a text X with a learning corpus Y as follows: tional complexity is O(| X| 3 2). The complexity can be| greatly| × |L| reduced, however, PY (X) = PY (x1 . . . x X ) | | when the function dl is additive. First, the de- X | | scription length can be calculated from the previ- = PY (xt xt 1 . . . xmax(1,t n)), | − − ous result, decreasing O( X t) to O(1) (to ob- t=1 | | − ∏ tain the code length of an additional character). Sec- where n is a parameter of PPM, denoting the max- ond, the referred number of cells t is in fact imum length of the n-grams considered in the U , with U X : for MMS,× |L|U can be 5 × |L|  | | model . The probability PY (X) is estimated by es- proven to be O(log Y ), where Y is the maximum cape probabilities favoring the longer sequences ap- length among the learning| | corpora;| | and for PPM, U pearing in the learning corpus (Bell et al., 1990). corresponds to the maximum length of an n-gram. The total code length of X is then estimated as Third, this factor U can be further decreased × |L| log PY (X). Since this value is additive and gives to U 2, since it suffices to possess the results for − the total code length of X for language Y , we adopt the two× 7 best languages in computing the first term this value in our approach. of (4). Consequently, the complexity decreases to O(U X ). 4 Segmentation by Dynamic Programming 6 × | | × |L| This formula can be used directly to generate a set L in By applying the above methods, we propose a solu- which all adjacent elements differ. The formula can also be tion to formula (1) through dynamic programming. used to generate segments for which some adjacent lan- guages coincide and then further to generate L through 3 This modification means that the original JˆY (X) is ob- post-processing by concatenating segments of the same tained through the harmonic mean, with Len obtained language. ˆ 7 through the arithmetic mean, whereas J 0Y (X) is obtained This number means the two best scores for different lan- through the arithmetic mean with Len as the harmonic guages, which is required to obtain L directly: in addition mean. to the best score, if the language of the best coincides with 4In the context of NLP, this is known as Witten-Bell smooth- L in formula (4), then the second best is also needed. If ing. segments are subjected to post-processing, this value can 5In the experiments reported here, n is set to 5 throughout. be one.

972 pendix at the end of the article lists the actual lan- Table 1: Number of languages for each writing system character kinds UDHR Wiki guages. Note that in this article, a character means a Unicode character throughout, which differs from Latin 260 158 a character rendered in block form for some writing Cyrillic 12 20 systems. Devanagari 0 8 To evaluate language identification for monolin- 1 6 gual texts, as will be reported in 6.1, we conducted Other 4 30 five-times cross-validation separately§ for both data sets. We present the results in terms of the average

5 Experimental Setting accuracy AL, the ratio of the number of texts with a correctly identified language to . 5.1 Monolingual Texts (Training / Test Data) |L| In this work, monolingual texts were used both for 5.2 Multilingual Texts (Test Data) training the cross-entropy computation and as test Multilingual texts were needed only to test the per- data for cross-validation: the training data does not formance of the proposed method. In other words, contain any test data at all. Monolingual texts were we trained the model only through monolingual also used to build multilingual texts, as explained in data, as mentioned above. This differs from the the following subsection. most similar previous study (Teahan, 2000), which Texts were collected from the World Wide Web required multilingual learning data. and consisted of two sets. The first data set con- The multilingual texts were generated artificially, sisted of texts from the Universal Declaration of since multilingual texts taken directly from the web Human Rights (UDHR)8. We consider UDHR the have other issues besides segmentation. First, proper most suitable text source for our purpose, since the nouns in multilingual texts complicate the final judg- content of every monolingual text in the declaration ment of language and segment borders. In prac- is unique. Moreover, each text has the tendency tical application, therefore, texts for segmentation to maximally use its own language and avoid vo- must be preprocessed by named entity recognition, cabulary from other languages. Therefore, UDHR- which is beyond the scope of this work. Second, the derived results can be considered to provide an em- sizes of text portions in multilingual web texts dif- pirical upper bound on our formulation. The set fer greatly, which would make it difficult to evaluate L consists of 277 languages , and the texts consist of the overall performance of the proposed method in a around 10,000 characters on average. uniform manner. The second data set was Wikipedia data from Consequently, we artificially generated two kinds Wikipedia Downloads9, denoted as “Wiki” in the of test sets from a monolingual corpus. The first is following discussion. We automatically assembled a set of multilingual texts, denoted as Test1, such the data through the following steps. First, tags in that each text is the conjunction of two portions in the Wikipedia texts were removed. Second, short different languages. Here, the experiment is focused lines were removed since they typically are not sen- on segment border detection, which must segment tences. Third, the amount of data was set to 10,000 the text into two parts, provided that there are two characters for every language, in correspondence languages. Test1 includes test data for all language with the size of the UDHR texts. Note that there pairs, obtained by five-times cross-validation, giving is a limit to the complete cleansing of data. After 25 ( 1) multilingual texts. Each portion these steps, the set contained 222 languages with of× text |L| for × a|L| single − language consists of 100 char- L sufficient data for the experiments. acters taken from a random location within the test Many languages adopt writing systems other than data. the Latin alphabet. The numbers of languages for The second kind of test set is a set of multilingual various representative writing systems are listed in texts, denoted as Test2, each consisting of k seg- Table 1 for both UDHR and Wiki, while the Ap- ments in different languages. For the experiment, k

8 http://www.ohchr.org/EN/UDHR/Pages/Introduction.aspx is not given to the procedure, and the task is to ob- 9 http://download.wikimedia.org/ tain k as well as B and L through recursion. Test2

973 was generated through the following steps: 1

1. Choose k from among 1,. . . ,5. 0.95 2. Choose k languages randomly from , where some of the k languages can overlap. L 0.9 3. Perform five-times cross-validation on the texts 0.85 accuracy of all languages. Choose a text length ran- 0.8 domly from 40,80,120,160 , and randomly PPM (UDHR) { } 0.75 MMS (UDHR) select this many characters from the test data. PPM (Wiki) MMS (Wiki) 4. Shuffle the k languages and concatenate the 0.7 0 20 40 60 80 100 120 140 160 180 200 text portions in the resultant order. input length (characters) For this Test2 data set, every plot in the graphs Figure 1: Accuracy of language identification for mono- shown in 6.2 was obtained by randomly averaging lingual texts 1,000 tests.§ By default, the possibility of segmentation is con- language from the text, there could be a formula- sidered at every character offset in a text, which tion specific to the problem. We consider it trivial, provides a lower bound for the proposed method. however, to specify such a narrow problem within Although language change within the middle of a our formulation, and it will lead to higher perfor- word does occur in real multilingual documents, mance than that of the reported results, in any case. it might seem more realistic to consider language Therefore, we believe that our general formulation change at word borders. Therefore, in addition to and experiment show the broadest potential of our choosing B from 1,..., X , we also tested our approach to solving this problem. approach under the{ constraint| |} of choosing borders from bordering locations, which are the locations of 6 Experimental Results spaces. In this case, B is chosen from this subset of 6.1 Language Identification Performance 1,..., X , and, in step 3 above, text portions are {generated| so|} as to end at these bordering locations. We first show the performance of language identifi- Given a multilingual text, we evaluate the outputs cation using formula (2), which is used as the com- B and L through the following scores: ponent of the text segmentation by language. Fig- ure 1 shows the results for language identification PB/RB: Precision/recall of the borders detected (i.e., the correct borders detected, divided by of monolingual texts with the UDHR and Wiki test the detected/correct border). data. The horizontal axis indicates the size of the in- put text in characters, the vertical axis indicates the PL/RL: Precision/recall of the languages detected 10 (i.e., the correct languages detected, divided by accuracy AL, and the graph contains four plots for the detected/correct language). MMS and PPM for each set of data. P s and Rs are obtained by changing the param- Overall, all plots rise quickly despite the se- eter γ given in formula (3), which ranges over vere conditions of a large number of languages 1,2,4,. . . ,256 bits. In addition, we verify the speed, (over 200), a small amount of input data, and a i.e., the average time required for processing a text. small amount of learning data. The results show Although there are web pages consisting of texts that language identification through cross-entropy is in more than 2 languages, we rarely see a web page promising. containing 5 languages at the same time. There- Two further global tendencies can be seen. First, fore, Test1 reflects the most important case of 2 lan- the performance was higher for UDHR than for guages only, whereas Test2 reflects the case of mul- Wiki. This is natural, since the content of Wikipedia tiple languages to demonstrate the general potential is far broader than that of UDHR. In the case of of the proposed approach. UDHR, when the test data had a length of 40 char- The experiment reported here might seem like a acters, the accuracy was over 95% for both the PPM case of over-specification, since all languages are and the MMS methods. Second, PPM achieved considered equally likely to appear. Since our mo- 10The results for PPM and MMS for UDHR are almost the tivation has been to eliminate a portion in a major same, so the graph appears to contain only three plots.

974 1 1 0.9 0.98 0.8 0.97 0.95 0.7 0.6 0.5 0.9 0.88 0.4 precision 0.3 0.87 PPM (UDHR) 0.85 PPM (UDHR) cummulative proportion 0.2 MMS (UDHR) MMS (UDHR) 0.1 PPM (Wiki) PPM (Wiki) MMS (Wiki) MMS (Wiki) 0 0.8 -5 -4 -3 -2 -1 0 1 2 3 4 5 0.8 0.85 0.9 0.95 1 relative position (characters) recall

Figure 2: Cumulative distribution of segment borders 0.8 0.77 slightly better performance than did MMS. When 0.75 0.76 the test data amounted to 100 characters, PPM 0.70 achieved language identification with accuracy of 0.7 precision about 91.4%. For MMS, the identification accu- 0.68 0.65 PPM (UDHR) racy was a little less significant and was about 90.9% MMS (UDHR) PPM (Wiki) even with 100 characters of test data. MMS (Wiki) 0.6 The amount of learning data seemed sufficient for 0.6 0.65 0.7 0.75 0.8 both cases, with around 8,000 characters. In fact, recall Figure 3: PL/RL (language, upper graph) and PB/RB we conducted tests with larger amounts of learning (border, lower graph) results, where borders were taken data and found a faster rise with respect to the input from any character offset length, but the maximum possible accuracy did not show any significant increase. Errors resulted from either noise or mistakes due Since the plots rise sharply at the middle of the to the . The Wikipedia test data was horizontal axis, the borders were detected at or very noisy, as mentioned in 5.1. As for language fam- near the correct place in many cases. § ily errors, the test data includes many similar lan- Next, we examine the results for Test2. Fig- guages that are difficult even for humans to correctly ure 3 shows the two precision/recall graphs for lan- judge. For example, Indonesian and Malay, Picard guage identification (upper graph) and segment bor- and Walloon, and Norwegian Bokmal˚ and der detection (lower graph), where borders were are all pairs representative of such confusion. taken from any character offset. In each graph, Overall, the language identification performance the horizontal axis indicates precision and the ver- seems sufficient to justify its application to our main tical axis indicates recall. The numbers appearing problem of text segmentation by language. in each figure are the maximum F-score values for each method and data set combination. As can be 6.2 Text Segmentation by Language seen from these numbers, the language identifica-

First, we report the results obtained using the Test1 tion performance was high. Since the text portion data set. Figure 2 shows the cumulative distribution size was chosen from among the values 40, 80, 120, obtained for segment border detection. The horizon- or 160, the performance is comprehensible from the results shown in 6.1. Note also that PPM performed tal axis indicates the relative location by character § with respect to the correct border at zero, and the slightly better than did MMS. vertical axis indicates the cumulative proportion of For segment border performance (lower graph), texts whose border is detected at that relative point. however, the results were limited. The main reason The figure shows four plots for all combinations of for this is that both MMS and PPM tend to detect the two data sets and the two methods. Note that a border one character earlier than the correct loca- segment borders are judged by characters and not tion, as was seen in Figure 2. At the same time, by bordering locations, as explained in 5.2. much of the test data contains unrealistic borders §

975 1 the time should increase linearly with respect to the input length X , with increasing k having no ef- 0.95 0.94 | | fect. Figure 5 shows the speed for Test2 processing, 0.9 0.91 with the horizontal axis indicating the input length 0.85 0.84 and the vertical axis indicating the processing time. 0.81 precision 0.8 Here, all character offsets were taken into consid- PPM (UDHR) eration, and the processing was done on a machine 0.75 MMS (UDHR) PPM (Wiki) with a Xeon5650 2.66-GHz CPU. The results con- MMS (Wiki) 0.7 firm that the complexity increased linearly with re- 0.7 0.75 0.8 0.85 0.9 0.95 1 recall spect to the input length. When the text size became

Figure 4: PB/RB, where borders were limited to spaces as large as several thousand characters, the process- ing time became as long as a second. This time 1 PPM (UDHR) could be significantly decreased by introducing con- MMS (UDHR) PPM (Wiki) straints on the bordering locations and languages. 0.8 MMS (Wiki)

0.6 7 Conclusion

time (s) 0.4 This article has presented a method for segmenting 0.2 a multilingual text into segments, each in a differ-

0 ent language. This task could serve for preprocess- 0 200 400 600 800 1000 input length (characters) ing of multilingual texts before applying language- Figure 5: Average processing speed for a text specific analysis to each text. Moreover, the pro- posed method could be used to generate corpora in a variety of languages, since many texts in minor lan- within a word, since the data was generated by con- guages tend to contain chunks in a major language. catenating two text portions with random borders. Therefore, we repeated the experiment with Test2 The segmentation task was modeled as an opti- under the constraint that a segment border could oc- mization problem of finding the best segment and cur only at a bordering location, as explained in 5.2. language sequences to minimize the description The results with this constraint were significantly§ length of a given text. An actual procedure for ob- better, as shown in Figure 4. The best result was for taining an optimal result through dynamic program- UDHR with PPM at 0.9411. We could also observe ming was proposed. Furthermore, we showed a way how PPM performed better at detecting borders in to decrease the computational complexity substan- this case. In actual application, it would be possible tially, with each of our two methods having linear to improve performance by relaxing the procedural complexity in the input length. conditions, such as by decreasing the number of lan- Various empirical results were shown for lan- guage possibilities. guage identification and segmentation. Overall, In this experiment for Test2, k ranged from 1 to when segmenting a text with up to five random por- 5, but the performance was not affected by the size tions of different languages, where each portion con- of k. When the F-score was examined with respect sisted of 40 to 120 characters, the best F-scores for to k, it remained almost equal to k in all cases. This language identification and segmentation were 0.98 shows how each recursion of formula (4) works al- and 0.94, respectively. most independently, having segmentation and lan- For our future work, details of the methods must guage identification functions that are both robust. be worked out. In general, the proposed approach Lastly, we examine the speed of our method. could be further applied to the actual needs of pre- Since is constant throughout the comparison, |L| processing and to generating corpora of minor lan- 11The language identification accuracy slightly increased as guages. well, by 0.002.

976 References William J. Teahan and David J. Harper. 2001. Using compression-based language models for text catego- Beatrice Alex, Amit Dubey, and Frank Keller. 2007. rization. In Proceedings of the Workshop on Language Using foreign inclusion detection to improve parsing Modeling and Information Retrieval, pages 83–88. performance. In Proceedings of the Joint Conference William John Teahan. 2000. Text classification and seg- on Empirical Methods in Natural Language Process- mentation using minimum cross-entropy. In RIAO, ing and Computational Natural Language Learning, pages 943–961. pages 151–160. Jacob Ziv and Abraham Lempel. 1977. A universal al- Beatrice Alex. 2005. An unsupervised system for iden- gorithm for sequential data compression. IEEE Trans- tifying english inclusions in german text. In Pro- actions on Information Theory, 23(3):337–343. ceedings of the 43rd Annual Meeting of the Associa- tion for Computational Linguistics, Student Research Appendix Workshop, pages 133–138. T.C. Bell, J.G. Cleary, and I. H. Witten. 1990. Text Com- This Appendix lists all the languages contained in our data sets, pression. Prentice Hall. as summarized in Table 1. Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto. For UDHR 2002. Language trees and zipping. Physical Review Letters, 88(4). Latin Rudi Cilibrasi and Paul Vitanyi.´ 2005. Clustering by Achinese, Achuar-Shiwiar, Adangme, , Aguaruna, compression. IEEE Transactions on Information The- Aja, Akuapem Akan, Akurio, Amahuaca, Amarakaeri, Ambo- ory, 51(4):1523–1545. Pasco Quechua, Arabela, Arequipa-La Union´ Quechua, Arpi- John G. Cleary and Ian H. Witten. 1984. Data compres- tan, Asante Akan, Ashaninka,´ Asheninka´ Pajonal, Asturian, Occitan, Quechua, Aymara, Baatonum, sion using adaptive coding and partial string matching. Balinese, Bambara, Baoule,´ Basque, Bemba, Beti, Bikol, Bini, IEEE Transactions on Communications, 32:396–402. Bislama, Bokmal˚ Norwegian, Bora, Bosnian, Breton, Buginese, Martin Farach, Michiel Noordewier, Serap Savari, Larry , Calderon´ Highland Quichua, Candoshi- Shepp, Abraham J. Wyner, and Jacob Ziv. 1994. On Shapra, Caquinte, Cashibo-Cacataibo, Cashinahua, Catalan, the entropy of dna: Algorithms and measurements Cebuano, Central Kanuri, Central Mazahua, Central Nahuatl, based on memory and rapid convergence. In Proceed- Chamorro, Chamula Tzotzil, Chayahuita, Chickasaw, Chiga, ings of the Sixth Annual ACM-SIAM Symposium on Chokwe, Chuanqiandian Cluster Miao, Chuukese, Corsican, Discrete Algorithms, pages 48–57. Quechua, Czech, Dagbani, Danish, Dendi, Ditammari, Gregory Grefenstette. 1995. Comparing two language Dutch, Eastern Maninkakan, Emiliano-Romagnolo, English, Esperanto, Estonian, Ewe, Falam Chin, Fanti, Faroese, Fi- identification schemes. In Proceedings of 3rd Inter- jian, Filipino, Finnish, Fon, French, Friulian, Ga, Gagauz, national Conference on Statistical Analysis of Textual Galician, Ganda, Garifuna, Gen, German, Gheg Albanian, Data, pages 263–268. Gonja, Guarani, Guil¨ a´ Zapotec, Haitian Creole, Haitian Cre- Marti A. Hearst. 1997. Texttiling: Segmenting text into ole (popular), Haka Chin, Hani, Hausa, Hawaiian, Hiligaynon, multi-paragraph subtopic passages. Computational Huamal´ıes-Dos de Mayo Huanuco´ Quechua, Huautla Maza- Linguistics, 23(1):33–64. tec, Huaylas , Hungarian, Ibibio, Icelandic, Patrick Juola. 1997. What can we do with small cor- Ido, Igbo, Iloko, Indonesian, Interlingua, Irish, Italian, Ja- pora? document categorization via cross-entropy. In vanese, Jola-Fonyi, K’iche’, Kabiye,` Kabuverdianu, Kalaal- Proceedings of an Interdisciplinary Workshop on Sim- lisut, Kaonde, Kaqchikel, Kasem, Kekch´ı, Kimbundu, Kin- ilarity and Categorization. yarwanda, Kituba, Konzo, Kpelle, Krio, Kurdish, Lamnso’, Languedocien Occitan, Latin, Latvian, Lingala, Lithuanian, Gen-itiro Kikui. 1996. Identifying the coding system and Lozi, Luba-Lulua, Lunda, Luvale, , Madurese, language of on-line documents on the internet. In Pro- Makhuwa, Makonde, Malagasy, Maltese, Mam, Maori, ceedings of 16th International Conference on Compu- Mapudungun, Margos-Yarowilca-Lauricocha Quechua, Mar- tational Linguistics, pages 652–657. shallese, Mba, Mende, Metlatonoc´ Mixtec, Mezquital Otomi, Casanai Kruengkrai, Prapass Srichaivattana, Virach Mi’kmaq, Miahuatlan´ Zapotec, Minangkabau, Mossi, Mozara- Sornlertlamvanich, and Hitoshi Isahara. 2005. Lan- bic, Murui Huitoto, M´ıskito, Ndonga, Nigerian Pidgin, Nomat- guage identification based on string kernels. In siguenga, North Jun´ın Quechua, Northeastern Dinka, Northern Proceedings of the 5th International Symposium on Conchucos Ancash Quechua, Northern Qiandong Miao, North- Communications and Information Technologies, pages ern Sami, Northern Kurdish, Nyamwezi, Nyanja, Nyemba, Nynorsk Norwegian, Nzima, Ojitlaan´ Chinantec, Oromo, 926–929. Palauan, Pampanga, Papantla Totonac, Pedi, Picard, Pichis Penelope Sibun and Jeffrey C. Reynar. 1996. Language Asheninka,´ Pijin, Pipil, Pohnpeian, Polish, Portuguese, Pu- identification: Examining the issues. In Proceedings laar, Purepecha, Paez,´ Quechua, Rarotongan, Romanian, Ro- of 5th Symposium on Document Analysis and Infor- mansh, Romany, Rundi, Salinan, Samoan, San Lu´ıs Potos´ı mation Retrieval, pages 125–135. Huastec, Sango, Sardinian, Scots, Scottish Gaelic, Serbian,

977 Serer, Seselwa Creole French, Sharanahua, Shipibo-Conibo, Arabic Shona, Slovak, Somali, Soninke, South Ndebele, Southern Arabic, Egyptian Arabic, Gilaki, Mazanderani, Persian, Dagaare, Southern Qiandong Miao, Southern Sotho, Spanish, Pushto, Uighur, Urdu Standard Malay, Sukuma, Sundanese, Susu, Swahili, Swati, Swedish, Saotomense,˜ Tahitian, Tedim Chin, Tetum, Tidikelt Devanagari Tamazight, Timne, Tiv, Toba, Tojolabal, Tok Pisin, Tonga Bihari, Hindi, Marathi, Nepali, Newari, Sanskrit (Tonga Islands), Tonga (Zambia), Tsonga, Tswana, Turkish, Tzeltal, Umbundu, Upper Sorbian, Urarina, Uzbek, Veracruz Other Huastec, Vili, Vlax Romani, Walloon, Waray, Wayuu, Welsh, Amharic, Armenian, Assamese, Bengali, Bishnupriya, Western Frisian, Wolof, Xhosa, Yagua, Yanesha’, Yao, Yapese, Burmese, Central Khmer, Chinese, Classical Chinese, Dhivehi, Yoruba, Yucateco, Zhuang, Zulu , Georgian, Gothic, Gujarati, Hebrew, Japanese, Kannada, Lao, Malayalam, Modern Greek, Official Aramaic, Cyrillic Panjabi, Sinhala, Tamil, Telugu, Thai, Tibetan, , Abkhazian, Belarusian, Bosnian, Bulgarian, Kazakh, Mace- , donian, Ossetian, Russian, Serbian, Tuvinian, Ukrainian, Yakut

Arabic Standard Arabic

Other Japanese, Korean, , Modern Greek

For Wiki Latin Afrikaans, Albanian, Aragonese, Aromanian, Arpitan, As- turian, Aymara, Azerbaijani, Bambara, Banyumasan, Basque, Bavarian, Bislama, Bosnian, Breton, Catala,` Cebuano, Central Bikol, Chavacano, Cornish, Corsican, Crimean Tatar, Croatian, Czech, Danish, Dimli, Dutch, Dutch Low Saxon, Emiliano- Romagnolo, English, Esperanto, Estonian, Ewe, Extremaduran, Faroese, Fiji Hindi, Finnish, French, Friulian, Galician, Ger- man, Gilaki, Gothic, Guarani, Hai//om, Haitian, Hakka Chi- nese, Hawaiian, Hungarian, Icelandic, Ido, Igbo, Iloko, Indone- sian, Interlingua, Interlingue, Irish, Italian, Javanese, Kabyle, Kalaallisut, Kara-Kalpak, Kashmiri, Kashubian, Kongo, Ko- rean, Kurdish, Ladino, Latin, Latvian, Ligurian, Limburgan, Lingala, Lithuanian, Lojban, Lombard, , Lower Sorbian, Luxembourgish, Malagasy, Malay, Maltese, Manx, Maori, Mazanderani, Min Dong Chinese, Min Nan Chinese, Nahuatl, Narom, Navajo, Neapolitan, Northern Sami, Norwe- gian, Norwegian Nynorsk, Novial, Occitan, , Pam- panga, Pangasinan, Panjabi, , Pennsylvania Ger- man, Piemontese, Pitcairn-Norfolk, Polish, Portuguese, Pushto, Quechua, Romanian, Romansh, Samoan, Samogitian Lithua- nian, Sardinian, Saterfriesisch, Scots, Scottish Gaelic, Serbo- Croatian, Sicilian, Silesian, Slovak, Slovenian, Somali, Span- ish, Sranan Tongo, Sundanese, Swahili, Swati, Swedish, Taga- log, Tahitian, Tarantino Sicilian, Tatar, Tetum, Tok Pisin, Tonga (Tonga Islands), Tosk Albanian, Tsonga, Tswana, Turkish, Turkmen, Uighur, Upper Sorbian, Uzbek, Venda, Venetian, Vietnamese, Vlaams, Vlax Romani, Volapuk,¨ Voro,˜ Walloon, Waray, Welsh, Western Frisian, Wolof, Yoruba, Zeeuws, Zulu

Cyrillic Abkhazian, Bashkir, Belarusian, Bulgarian, Chuvash, Erzya, Kazakh, Kirghiz, Macedonian, Moksha, Moldovan, Mongo- lian, Old Belarusian, Ossetian, Russian, Serbian, Tajik, Udmurt, Ukrainian, Yakut

978