Language Identification

Evan Martin

June 2002 Introduction

All data on a computer is at some level a sequence of , which are just a raw collection of numbers. To communicate text, people agree on a common encoding1, which maps these numbers to characters.

This is a fine solution except that with different possible encodings, the same document

(the same sequence of bytes) can represent different strings of text. This is not an imagi- nary problem. For example, internationalization-naive websites, which allow arbitrary input without specifying an encoding, can end up with stored user input in different sets.

The goal of my project was to determine the character set of a document by examining just the bytes in the document. To achieve this, I used probabilistic analysis, based on bigram models of what text in a “known” encoding “looks like”. I originally intended to just detect encodings, but it appears that this method may be effective for detecting different languages within an encoding.

I present here a basic overview of the differences between character sets, with a focus on the difference between Western European languages and Russian’ subset of Cyrillic, and then show how my distinguishes between these two and can also differentiate between

English and German.

1I make a number of simplifications in this paper. Here, I conflate the concept of “encoding” and “character set” and use the terms interchangably, but the differences can be ignored for my purposes.

1 Background

The nice thing about standards is that there are so many to choose from.

—Unknown

ASCII

The most commonly-cited encoding standard is called ASCII2, which defines characters for the values 0 through 127. This is an American standard, and only represents the characters used in English3. It is often used as a common ground between different encodings– computer languages and protocols are often defined using exclusively

ASCII, leaving encoding details to be handled by at a higher level.

English and German

ASCII leaves the values 128 through 255 undefined, and this is where the ambiguity sets in.

Different languages used this range for their own characters, often using the same numbers to represent different characters. Even within one language the encoding sometimes varies between operating systems. The International Standards Organization eventually produced a collection of standards. Because most Western European languages use almost the same characters as English, with a few extra characters like ´a,the International Standards Orga- nization created the ISO-8859-1 standard, also known as Latin-1. (See Figure 1).

ISO-8859-1 is the standard that is more or less in use today for most Western European

2American Standard Code for Information Interchange 3And even then, not all characters, depending on whether you spell resume as resum´eor even ´esum´. Though that isn’ a “real” English word, it is a word that we use.

2 Figure 1: ISO-8859-1.

3 Figure 2: Windows-1252. text. Microsoft extended the standard with some extra characters, such as a left double quote (“) and right double quote (”), because ASCII only provides a generic double quote.

This “standard” is known as Windows-1252 or CP-1252. Because of Windows’ dominance of the computer market, this encoding is pretty commonly used across the internet.

ISO-8859-1 (and Windows-1252) don’t cover other European languages. Roman Czy- borra’s Soup 4 and has a good discussion of the variety of standards created for languages such as Czech. In particular, languages based on Cyrillic, such as Russian, are not represented at all.

4http://czyborra.com/charsets/iso8859.html

4 Figure 3: Windows-1251.

Russian

There are a number of Cyrillic standards (see Cyrillic Charset Soup 5), and even within

Russian there are different encodings such as KOI8-R (the Russian encoding used on ).

But the one in common use for Russian is another Microsoft standard, called Windows-

1251 or CP-1251 (not to be confused with Windows-1252, above). These standards leave the original ASCII characters alone, but completely redefine the upper 128 characters, so they’re commonly used in places like the web where English characters are useful (for example, URLs use English characters).

Other Languages

Many other languages can use many more than 256 characters (for example, kanji is many thousands of characters). Because of this, their encodings are not used interchangably with

ASCII in the same way Windows-1251 can be, and are not considered here.

5http://czyborra.com/charsets/cyrillic.html

5 Identifying Languages

Models

To distinguish between different character sets, an unknown document can be compared

against models derived from documents in known character sets. When considering character

sets, a unigram model would probably be sufficient; Russian text is likely primarily Cyrillic

and will use more characters above 128, while English will be all characters in ASCII, which

are below 128. However, within a character set, it seems it should still be possible to tell

German from English from French. A word with “sch” in it is more likely German than

French, while a word with “oux” is the reverse. Because of this, I used bigram models to

represent languages.

Using bigrams has an added advantage. Because I’ only concerned with bytes, the

bigram data can be completely represented in a 256 by 256 entry table. This data can also

be visualized as a two dimensional image, where the intensity at (, ) is used to represent

the probability of character x transitioning to character y, which allows a qualitative analysis

of the data.

Generating Models

The program maketable.rb takes a set of input documents, counts the transitions within them, normalizes the bigrams, then generates an output “table” file which contains the model.

Here is a sample run of maketable.rb: lulu:~/projects/langid% ./maketable.rb samples/ru/cp1251

6 Loading data: samples/ru/cp1251/astrofox samples/ru/cp1251/a48 samples/ru/cp1251/avva samples/ru/cp1251/avk samples/ru/cp1251/emma_loy samples/ru/cp1251/aztech samples/ru/cp1251/deetan samples/ru/cp1251/dr_momm samples/ru/cp1251/dvor samples/ru/cp1251/dwalin samples/ru/cp1251/makropod samples/ru/cp1251/muchacho samples/ru/cp1251/nach_berlin samples/ru/cp1251/parf samples/ru/cp1251/pendejo samples/ru/cp1251/priest_dimitriy samples/ru/cp1251/qsju samples/ru/cp1251/runa_ samples/ru/cp1251/sema samples/ru/cp1251/sgt samples/ru/cp1251/shaltai_boltai samples/ru/cp1251/sherebon samples/ru/cp1251/tyrex samples/ru/cp1251/urbansheep samples/ru/cp1251/yogiki samples/ru/cp1251/yolka Normalizing... Saving table to tables/ru-cp1251...

Then, the programs makeimage.rb and makeps.rb generate .png and . images, re- spectively, from a given table: lulu:~/projects/langid% ./makeimage.rb tables/ru-cp1251 Loading tables/ru-cp1251... Generating images/ru-cp1251.png...

Images

Can we see the difference between the different languages? These datasets were trained from real-world content— text from web-based journals— and each were from about 100kb of text.

English

Figure 4 is the image generated from English data. A darker spot at row i, column , rep- resents a higher probability that a character i will be followed by character j. As you can

7 Figure 4: English (Windows-1252).

see, most of the characters fall within the upper-left quadrant (which corresponds to charac- ters under 128 followed by characters under 128), as would be expected of English. Within that quadrant, you can also make out a three by three grid of clusters which correspond to the second ( and numbers), third (captials), and fourth (lower case) columns in

Figure 4. Additionally, the intense section in the upperleft cluster corresponds to the digits– it would make sense that digits have a high probability of being followed by other digits, as they often occur together in text.

German

Figure 5 was generated from German text. It resembles the English model, but you can see there are more characters along the lower and right quadrants of the image, from the characters such as ¨uand ¨afound in German. I had hoped that there would also be a

8 Figure 5: German (Windows-1252).

noticable change in the probabilities of captial letters (because German capitalizes nouns) but the German writers who wrote the text used in training data often didn’t use capitals.

Russian

Finally, Figure 6 was generated from Russian text. Again, there are some Western characters, but you can easily see the Cyrillic in the lower right quadrant of the image.

Identification

The final component of my project, identify.rb, compares a given document against the generated tables. To choose which model is the best match for the document, I calculate the log-likelihood of the model generating the document. For each pair of letters a, over the

9 Figure 6: Russian (Windows-1252).

document :

P (model) = Y P (a transitioning to b) a,b∈D log P (model) = log Y P (a → b) a,b∈D = X log P (a → b) a,b∈D

And after counting bigrams in the document, that reduces to simply:

X (a, b) log P (a → b) a,b∈C

The models can then be compared against each other by choosing the model with the highest probability.

Here is a demonstration of the identification process:

10 lulu:~/projects/langid% ./identify.rb tests/krylov Loading tables: tables/de-cp1252 tables/ru-cp1251 tables/en-cp1252 Loading file to identify... Comparing file against tables: tables/de-cp1252 tables/en-cp1252 tables/ru-cp1251 Results, in order of likelihood: tables/ru-cp1251 (-6300.080235) tables/en-cp1252 (-16661.186752) tables/de-cp1252 (-16897.318567)

File tests/krylov best matches tables/ru-cp1251.

The data set krylov is correctly identified as Russian/Windows-1251.

Conclusions

This technique is surprisingly accurate, for at least my limited tests. For example, even if I create a tiny file with simply the contents, “Ja, ich kann Deutsch gut sprechen,” it is correctly identified as German. That result is especially notable for the fact that it doesn’t have any special characters; if I were just identifying documents based on the characters they used, that text could have any of a variety of Western languages (though the unigram probabilities are probably different).

In the future, I’d like to try other Western languages to see how similar they are (would the program confuse German and other Germanic languages, especially those which share the same spellings for a lot of words?) but for this project (again the data came from, online

6 diary entries) there wasn’t much data available in other languages.

The code itself was written in the programming language Ruby, and is available on my website 7.

6http://www.livejournal.com 7http://neugierig.org/software/langid

11