Language Identification

Language Identification Evan Martin June 2002 Introduction All data on a computer is at some level a sequence of bytes, which are just a raw collection of numbers. To communicate text, people agree on a common encoding1, which maps these numbers to characters. This is a fine solution except that with different possible encodings, the same document (the same sequence of bytes) can represent different strings of text. This is not an imagi- nary problem. For example, internationalization-naive websites, which allow arbitrary input without specifying an encoding, can end up with stored user input in different character sets. The goal of my project was to determine the character set of a document by examining just the bytes in the document. To achieve this, I used probabilistic analysis, based on bigram models of what text in a “known” encoding “looks like”. I originally intended to just detect encodings, but it appears that this method may be effective for detecting different languages within an encoding. I present here a basic overview of the differences between character sets, with a focus on the difference between Western European languages and Russian’s subset of Cyrillic, and then show how my code distinguishes between these two and can also differentiate between English and German. 1I make a number of simplifications in this paper. Here, I conflate the concept of “encoding” and “character set” and use the terms interchangably, but the differences can be ignored for my purposes. 1 Background The nice thing about standards is that there are so many to choose from. —Unknown ASCII The most commonly-cited encoding standard is called ASCII2, which defines characters for the byte values 0 through 127. This is an American standard, and only represents the characters used in English3. It is often used as a common ground between different encodings– computer languages and internet protocols are often defined using exclusively ASCII, leaving encoding details to be handled by at a higher level. English and German ASCII leaves the values 128 through 255 undefined, and this is where the ambiguity sets in. Different languages used this range for their own characters, often using the same numbers to represent different characters. Even within one language the encoding sometimes varies between operating systems. The International Standards Organization eventually produced a collection of standards. Because most Western European languages use almost the same characters as English, with a few extra characters like á,the International Standards Orga- nization created the ISO-8859-1 standard, also known as Latin-1. (See Figure 1). ISO-8859-1 is the standard that is more or less in use today for most Western European 2American Standard Code for Information Interchange 3And even then, not all characters, depending on whether you spell resume as resuméor even résumé. Though that isn’t a “real” English word, it is a word that we use. 2 Figure 1: ISO-8859-1. 3 Figure 2: Windows-1252. text. Microsoft extended the standard with some extra characters, such as a left double quote (“) and right double quote (”), because ASCII only provides a generic double quote. This “standard” is known as Windows-1252 or CP-1252. Because of Windows’ dominance of the computer market, this encoding is pretty commonly used across the internet. ISO-8859-1 (and Windows-1252) don’t cover other European languages. Roman Czy- borra’s Alphabet Soup 4 and has a good discussion of the variety of standards created for languages such as Czech. In particular, languages based on Cyrillic, such as Russian, are not represented at all. 4http://czyborra.com/charsets/iso8859.html 4 Figure 3: Windows-1251. Russian There are a number of Cyrillic standards (see Cyrillic Charset Soup 5), and even within Russian there are different encodings such as KOI8-R (the Russian encoding used on Unix). But the one in common use for Russian is another Microsoft standard, called Windows- 1251 or CP-1251 (not to be confused with Windows-1252, above). These standards leave the original ASCII characters alone, but completely redefine the upper 128 characters, so they’re commonly used in places like the web where English characters are useful (for example, URLs use English characters). Other Languages Many other languages can use many more than 256 characters (for example, kanji is many thousands of characters). Because of this, their encodings are not used interchangably with ASCII in the same way Windows-1251 can be, and are not considered here. 5http://czyborra.com/charsets/cyrillic.html 5 Identifying Languages Models To distinguish between different character sets, an unknown document can be compared against models derived from documents in known character sets. When considering character sets, a unigram model would probably be sufficient; Russian text is likely primarily Cyrillic and will use more characters above 128, while English will be all characters in ASCII, which are below 128. However, within a character set, it seems it should still be possible to tell German from English from French. A word with “sch” in it is more likely German than French, while a word with “oux” is the reverse. Because of this, I used bigram models to represent languages. Using bigrams has an added advantage. Because I’m only concerned with bytes, the bigram data can be completely represented in a 256 by 256 entry table. This data can also be visualized as a two dimensional image, where the intensity at (x, y) is used to represent the probability of character x transitioning to character y, which allows a qualitative analysis of the data. Generating Models The program maketable.rb takes a set of input documents, counts the transitions within them, normalizes the bigrams, then generates an output “table” file which contains the model. Here is a sample run of maketable.rb: lulu:~/projects/langid% ./maketable.rb samples/ru/cp1251 6 Loading data: samples/ru/cp1251/astrofox samples/ru/cp1251/a48 samples/ru/cp1251/avva samples/ru/cp1251/avk samples/ru/cp1251/emma_loy samples/ru/cp1251/aztech samples/ru/cp1251/deetan samples/ru/cp1251/dr_momm samples/ru/cp1251/dvor samples/ru/cp1251/dwalin samples/ru/cp1251/makropod samples/ru/cp1251/muchacho samples/ru/cp1251/nach_berlin samples/ru/cp1251/parf samples/ru/cp1251/pendejo samples/ru/cp1251/priest_dimitriy samples/ru/cp1251/qsju samples/ru/cp1251/runa_ samples/ru/cp1251/sema samples/ru/cp1251/sgt samples/ru/cp1251/shaltai_boltai samples/ru/cp1251/sherebon samples/ru/cp1251/tyrex samples/ru/cp1251/urbansheep samples/ru/cp1251/yogiki samples/ru/cp1251/yolka Normalizing... Saving table to tables/ru-cp1251... Then, the programs makeimage.rb and makeps.rb generate .png and .ps images, re- spectively, from a given table: lulu:~/projects/langid% ./makeimage.rb tables/ru-cp1251 Loading tables/ru-cp1251... Generating images/ru-cp1251.png... Images Can we see the difference between the different languages? These datasets were trained from real-world content— text from web-based journals— and each were from about 100kb of text. English Figure 4 is the image generated from English data. A darker spot at row i, column j, represents a higher probability that a character i will be followed by character j. As you can 7 Figure 4: English (Windows-1252). see, most of the characters fall within the upper-left quadrant (which corresponds to characters under 128 followed by characters under 128), as would be expected of English. Within that quadrant, you can also make out a three by three grid of clusters which correspond to the second (punctuation and numbers), third (captials), and fourth (lower case) columns in Figure 4. Additionally, the intense section in the upperleft cluster corresponds to the digits– it would make sense that digits have a high probability of being followed by other digits, as they often occur together in text. German Figure 5 was generated from German text. It resembles the English model, but you can see there are more characters along the lower and right quadrants of the image, from the characters such as üand äfound in German. I had hoped that there would also be a 8 Figure 5: German (Windows-1252). noticable change in the probabilities of captial letters (because German capitalizes nouns) but the German writers who wrote the text used in training data often didn’t use capitals. Russian Finally, Figure 6 was generated from Russian text. Again, there are some Western characters, but you can easily see the Cyrillic in the lower right quadrant of the image. Identification The final component of my project, identify.rb, compares a given document against the generated tables. To choose which model is the best match for the document, I calculate the log-likelihood of the model generating the document. For each pair of letters a, b over the 9 Figure 6: Russian (Windows-1252). document D: P (model) = Y P (a transitioning to b) a,b∈D log P (model) = log Y P (a → b) a,b∈D = X log P (a → b) a,b∈D And after counting bigrams in the document, that reduces to simply: X C(a, b) log P (a → b) a,b∈C The models can then be compared against each other by choosing the model with the highest probability. Here is a demonstration of the identification process: 10 lulu:~/projects/langid% ./identify.rb tests/krylov Loading tables: tables/de-cp1252 tables/ru-cp1251 tables/en-cp1252 Loading file to identify... Comparing file against tables: tables/de-cp1252 tables/en-cp1252 tables/ru-cp1251 Results, in order of likelihood: tables/ru-cp1251 (-6300.080235) tables/en-cp1252 (-16661.186752) tables/de-cp1252 (-16897.318567) File tests/krylov best matches tables/ru-cp1251. The data set krylov is correctly identified as Russian/Windows-1251.

Language Identification

ISO Basic Latin Alphabet

Unicode and Code Page Support

SAS 9.3 UTF-8 Encoding Support and Related Issue Troubleshooting

Basis Technology Unicode対応ライブラリスペックシート文字コードその他の名称 Adobe-Standard-Encoding A

JS Character Encodings

CONTENTS 1. Introduction...1 2. RPL Principles

Unicode Unicode

Windows NLS Considerations Version 2.1

The ASCII Character Set

UTF-8 for User Defined Fields

LG Programmer’S Reference Manual

Exploiting Unicode-Enabled Software