<<

March 26, 2012 The Burgeoning Challenge of Deciphering Chat “Arabizi”, an informal dialect of Arabic typed on mobile phones and computer keyboards using the Lan , has spread widely via text messages and social networks. Analyzing messages wrien in this dialect is a challenge for analysts in government and industry because of wide variaons in spelling, grammar, and dicon.

We put the World in the ® ABOUT BASIS TECHNOLOGY Basis Technology provides soware soluons for text analycs, informaon retrieval, digital forensics, and identy resoluon in over forty languages. Our Rosee® linguiscs plaorm is a widely used suite of interoperable components that power search, business intelligence, e-discovery, social media monitoring, financial compliance, and other enterprise applicaons. Our linguiscs team is at the forefront of applied natural language processing using a combinaon of stascal modeling, expert rules, and corpus-derived data. Our forensics team pioneers beer, faster, and cheaper techniques to extract forensic evidence, keeping government and law enforcement ahead of exponenal growth of data storage volumes.

Soware vendors, content providers, financial instuons, and government agencies worldwide rely on Basis Technology’s soluons for Unicode compliance, language idenficaon, mullingual search, enty extracon, name indexing, and name translaon. Our products and services are used by over 250 major firms, including Cisco, EMC, Exalead/Dassault Systems, Hewle-Packard, Microso, Oracle, and Symantec. Our text analysis products are widely used in the U.S. defense and intelligence industry by such firms as CACI, Lockheed Marn, Northrop Grumman, SAIC, and SRI. We are the top provider of mullingual technology to web and e-commerce search engines, including Amazon.com, Bing, , and Yahoo!.

Company headquarters are in Cambridge, Massachuses, with branch offices in San Francisco, Washington, London, and Tokyo. For more informaon, visit www.basistech.com.

© 2012 Basis Technology Corporaon. “Basis Technology”, “Geoscope”, “Odyssey Digital Forensics”, “Rosee”, and “We put the World in the World Wide Web” are registered trademarks of Basis Technology Corporaon. All other trademarks, service marks, and logos used in this document are the property of their respec owners. (2012-08-15) For the last several hundred years, technology has made it easier for dominant languages to drive out the smaller ones. Faster transportaon, ubiquitous telephony, and efficient prin are helping the major languages dominate our communicaons. Studies by the Linguiscs Society of America and the Naonal Geographic Society esmate that more than half of the world’s approximately 7,000 languages will be exnct by 2100.1

But technology also nurtures creavity, and new forms of wring are appearing in unforeseen places driven by unexpected confluences. One of the newest is , also called Arabizi—a casual version of wrien Arabic that appeared when Arabic speakers began using Western keyboards on mobile phones and computers to spell out their nave language with the .(”they would type mar7aba (translaon: “hello ﻣرﺣﺑﺎ Roman alphabet. Thus, instead of typing

The pracce is a growing challenge for government intelligence agencies because the wring system is proliferang as portable phones, social media, and other digital channels become more common. More and more conversaons flow through the handsets, and an increasing percentage travel as text messages. Western social networks like Facebook or Twier are also growing in popularity, and the users oen choose Arabizi for their messages. In many cases, protests of the Arab Spring were planned, nurtured, and executed through messages passed in these channels, oen as Arabic chat.

The format poses a unique problem for government analysts working with open source intelligence, because it is sll evolving and the writers do not follow any standard rules for spelling, grammar, or dicon. Speakers from different regions not only use different spellings but they also write in their local dialect and even code-switch (insert other languages such as English or French) , or mix dialectal Arabic with (MSA). The phonology, morphology, syntax, and lexicon of the dialects that these nave speakers of Arabic are using are different from those of MSA. (See sidebar on the growth of Arabizi, “The Flourishing Garden of Arabizi.”)

While the collecon phase of the intelligence cycle can gather text messages easily, the processing and exploitaon phase is slowed or blocked enrely by messages that cannot be easily understood by automated tools trained only on MSA in . Humans must read through the messages and choose the important ones.

Automaon is essenal because there are simply not enough analysts to scan the large volume of text. Arabizi confuses convenonal natural language processing tools for Arabic by mixing in non- Arabic leers, regional dialects, and foreign words, all of which are unexpected by tools trained on MSA.

UNDERSTANDING THE ATTRACTION OF ARABIZI While the early Arabizi users were probably movated by the need to send Arabic words with a Western keyboard, Arabizi also aracts current users for aesthec, polical, and personal reasons. Convenience is only part of the allure.

1 See “What is an Endangered Language” by AC Woodbury, Linguiscs Society of America, 2006. Also The Last Speakers: The Quest to Save the World’s Most Endangered Languages K. David Harrison, Naonal Geographic Press, Sept 2010. Also. “Languages Die, But Not Their Last Words”, NY Times, Sept 19, 2007.

The Burgeoning Challenge of Deciphering Arabic Chat 3 In one study conducted by the American University in Cairo2, a collecon of 70 Arabic users on Facebook were asked why they chose to write with Roman leers. More than 80% of the respondents said that they used Arabizi, and 40% used it “most of the me.” Those who did not use it largely said it was either out of respect for the or as part of an effort to maintain a separate Arabic identy outside of Western influence. Indeed many agreed it should not be used in a “religious context” even if it is acceptable for casual communicaon.

About 20% said it made them feel “closer to each other,” a phrase that implies a role as a secret language that is not understood by outsiders of different ages, backgrounds, or educaon. It is cool.

Just as many English speakers may choose a more elegant-sounding French word, and Japanese companies add English words to the packaging of products, Arabic speakers who use Arabizi display a depth of knowledge and sophiscaon. The early users were, by necessity, well educated in European languages, so they were able to know insncvely how Arabic sounds mapped to Roman leer combinaons. English and French classes are common in schools in the Arabic world and this nurtures the understanding of the Roman alphabet. This user profile makes Arabizi a mark of success that suggests that the writer is well educated and oen well traveled.

Today the wring technique also popular among young and agile minds because they are frequently the first to adopt new technology. Just as Western youths have improvised many acronyms and textual shorthand to simplify typing on mobile devices, the younger Arab users are also creang Roman approximaons to meet their needs. Some suggest that they use Arabizi because it is oen easier to type than Standard Arabic, especially when they have no training in using an .

The combinaon of polical acvity and the proliferaon of technology fueled a greater focus on Arabic chat format. While many of the original adopters may have turned to Arabizi out of a praccal need to express Arabic words using Western technology, some suggest that the language is now a popular and sophiscated choice even when the technology supports tradional Arabic script. People are choosing Arabizi when other tools to spell MSA are available. The once casual slang is growing in importance and becoming a significant format of its own.

The form quickly came to the aenon of non-Arabic speakers during the Arab Spring of 2011 when many polical acvists discovered that mobile phones and Facebook were ideal vectors for organizing protests and structuring the evoluon of polical dissent. The wring born of casual chat turned into a tool for revoluon.

THE CHALLENGE OF UNLOCKING ARABIC CHAT Analyzing text wrien in Arabizi is a difficult problem because it lacks much of the structure that current technologies rely upon. Many algorithms and data-mining tools depend on a stable, predictable spelling and structure for words—rules that are standardized through diconaries and schools. Arabizi’s improvisaonal origins produce something more chaoc. (See sidebar.) The wide range of words and influences can confound algorithms that assume stability and a fixed

2 From “Summary of Arabizi or Romanizaon: The dilemma of wring Arabic texts” by Randa Muhammed, Mona Farrag, Nariman Elshamly, and Nady Abdel-Ghaffar. Presented at Jīl Jadīd Conference, University of Texas at Ausn, February 18-19, 2011

4 The Burgeoning Challenge of Deciphering Arabic Chat interpretaon. All of this mathemacal apparatus assumes that the structure of a language will not change, but Arabizi is transforming as each person chooses the closest approximaon of a word that comes to mind.

Many of the earliest approaches to analyzing Arabizi with these tools depended upon people translang the words directly into MSA and then using tradional algorithms on MSA to work with the results. The mechanism was tuned to MSA in the result, not for Arabizi entering the system. Any success that these rote algorithms offered was oen limited and short-lived because a direct conversion to Arabic script will oen produce sentences that do not follow the rules of MSA. The linguisc shorthand structures may follow the rules of proper MSA in some sentences and ignore them in others. Foreign words that are oen mixed into Arabizi break these algorithms immediately.

The brileness is a compounded by the fact that there has been no formal process by any central body to provide guidance for how this language should be used, nor will there be—it is a language of the “digital streets.” There is no government-run commiee on the language like the Académie Française, and no one has wrien a style guide like the AP Style Manual.

There are also no large publicaons wrien and edited in Arabizi, so there is no source of a large, curated collecon of text, like a newspaper, that people can imitate when structuring their sentences. This lack of common rules can be compounded when the people use words from a smaller, local dialect that is not widely understood.

The lack of central steering means that the language evolves differently as each user chooses which or grammacal trope to adopt or ignore without any guidance or suggeson. Many adopt new construcons as they write their texts, manufacturing spelling as they need them. If people are influenced, they are influenced by their friends.

Regional Dialects Compound Difficules There are many regional differences in pronunciaon that mulply the complexies because ﻗﻠب ”the words are oen transliterated using sound. In Iraqi Arabic, the “q” as in the word “qalb (i.e., “heart”) is pronounced as a “g”; Palesnian speakers in the West Bank would pronounce “q” as “k”; in Tunisia, it is pronounced as either “q” or “g”; and Egypans pronounce it as a gloal stop (as in the sound in “uh-oh”). Hence, Egypans pronounce the first leer of the former Libyan ruler’s name with a gloal stop, whereas some other dialects pronounce it with a “g” sound. Consequently possible spellings in Arabic chat would be: Gadaffi, 2addafi, Kathafi, Qathafi, Qadhafy, etc. The different spellings reflect the different pronunciaons, the orthographic variaons that exist in the itself, and the orthographic variaons of Arabic chat (such as spelling the gloal stop or with the digit 2).

On top of that, residents of countries where French is spoken, for instance, oen learn different rules for mapping sounds to Roman leers than people who live in countries where English is the dominant second language. Even people who may know lile English may sll adopt when their friends choose to use it.

The decentralized structure does offer new opportunies for understanding and analysis. The spellings and linguisc construcons travel like viruses or memes without any central control, and people from the same social networks oen share the same locuons. It is possible to

The Burgeoning Challenge of Deciphering Arabic Chat 5 idenfy the regional origin of speakers through the dialectal words they use and the spellings they select for the words. Thus a simple chat can contain more informaon than what the words are communicang.

AUTOMATING ANALYSIS The complexity of the language and the explosion of Arabizi guarantee that automaon is essenal, and any computerized assistance with understanding the chat messages will be an analycal mulplier. It is impossible to find enough people to read all of the messages, so it is impossible to locate the salient texts without leveraging advanced algorithms.

Computerizaon also ensures that human resources can be deployed more efficiently. Without effecve algorithmic triage, human analysts become mired in roune translaons that become harder with the odd or chaoc transliteraon. If most of the text snippets are of lile interest to anyone but the recipients, automated tools are essenal for both searching the corpus and ranking the importance of messages. An automated pre-processor lets the human resources focus on the most important message streams.

An Enterprise-Ready Soluon for Deciphering Arabic Chat Alphabet Currently, there are few tools that can deal effecvely with Arabic chat. Basis Technology’s Rosee® linguiscs plaorm may be the only producon-quality, enterprise tool currently available on the market for converng and analyzing it.

Rosee Chat Translator, one module in the full Rosee pipeline, is capable of disassembling Arabizi because it begins with a stascal approach to transliteraon that breaks the text message into phonemes and then ranks the possible conversions. The most probable mappings are used to convert the text into Arabic script. This stascal approach allows the soware to adapt as the common usage changes.

mar7aban Abu Mas3uud. Wallahi mudda taweela lmma mar7aban Abu Mas3uud. Wallahi mudda taweela lmma shuftak ya shuftak ya shiekh. Esma3 insha' netgabal shiekh. Esma3 insha' allah netgabal 3end abu musle7 ghadan wa 3end abu musle7 ghadan wa la tensa el mawaad al la tensa el mawaad al matluba lzar3 al shajara fee shaari3 matluba lzar3 al shajara fee shaari3 karbala'. karbala'. 7awali 5:30 nitqabal ma3 ahmad abdallah salih. 7awali 5:30 nitqabal ma3 ahmad abdallah salih Romanized Arabic Romanized Arabic

Standard Arabic Standard Arabic

Greetings, Abu Masood! By God, it has been a long time since I’ve seen you, oh Sheikh. Listen, God willing, we meet at Abu Musleh tomorrow and do not forget the materials required to plant the tree in Karbala Street. Around 5:30 we meet with Ahmed Abdullah Saleh. Entity Extraction English Translation

Basis Technology built the stascal model for the chat translator from more than 300 million Arabizi messages gathered from throughout the world. The database is updated regularly through an automac algorithm that builds a new stascal model from the latest corpus. New releases include the latest version of the model trained with the most recent collecon of chat messages.

6 The Burgeoning Challenge of Deciphering Arabic Chat The results also carry metadata about the regional dialect used in the text message and this can idenfy the country of origin of the writer. The translaons of chat alphabet to Arabic script from Rosee Chat Translator amplify the knowledge of the analyst by suggesng possible sources of the message that may lie outside the core of the analyst’s experse. The encyclopedic nature of Rosee offers a deep set of opons for the analysts to grade, saving them me in idenfying the source. This informaon is kept alongside the translaon for analysts to study at all subsequent stages of processing.

Integrang with the Enterprise Rosee Chat Translator delivers its results to other soware packages in an industry-standard format for further analysis by the user’s custom algorithms or other modules from Basis Technology. The tool delivers the converted message along with any hints about the origins that were unlocked during the conversion.

When Rosee Chat Translator is combined with the full Rosee linguiscs plaorm pipeline, it offers a full set of tools for collecng and analyzing a corpus of messages. Rosee Enty Extractor takes the text — converted to Arabic script by the chat translator — and idenfies names and locaons. These enes then can be fed into Rosee Name Indexer (which matches different spellings of the same name) and Rosee Name Translator (which transliterates Arabic names to English), making it simpler for the non-Arabic-speaking analyst to understand them. The name index built by the Rosee pipeline helps analysts work with a large collecon of messages by providing a quick way to idenfy documents containing the same enty. The complexies caused by the different spellings and linguisc structures of Arabizi are dramacally reduced with this standardized index.

The large index built by Rosee is a key part of idenfying relevant messages. Analysts can move faster through incoming documents and find cross-references that can unlock connecons that are obscured by all of the different name spellings. The index can quickly idenfy all other messages with similar enes—names of people, places, and more—even when they use different spellings.

NEW TECHNOLOGY IS DRIVING FUTURE ARABIZI GROWTH The future of Arabizi will follow changes in technology. Today, mobile handsets are just beginning to dominate the Arab countries. One study from the Egypan government noted that handset subscripons jumped from 55 million to 71 million during 20103. In January 2011, the penetraon rate of the market was esmated to be 91%. Other Arab countries are experiencing similar explosions of interest as the technology becomes available to all income levels.

Social networks are also growing more popular and the users of these systems oen choose Arabizi because it seems a natural choice. Even non-Western websites oen have comments wrien in Arabizi instead of standard Arabic. As Facebook, Twier, and other Western social media websites connue to grow more popular, Arabizi will follow. Each new user adds complexity to the proliferang dialects, subdialects, and formats because each user has their own preferred set of words and orthography. These choices are imitated by friends, and the social networks

3 See “ICT Indicators in Brief”, Feb 2011, Arab Republic of Egypt Ministry of Communicaons and Informaon Technology. (www.mcit.gov.eg) Also “The Adopon of Mobile Phones in Emerging Markets: Global Diffusion and the Rural Challenge” by Kas Kalba. Internaonal Journal of Communicaon 2 (2008), 631-661

The Burgeoning Challenge of Deciphering Arabic Chat 7 amplify the structures as they pass, like trends and fads, from one to another. Just as English speakers using Twier are building a new dialect, Arabic speakers are also echoing similar structures.

Rosee Chat Translator is the only enterprise-grade, producon-quality soware choice available for working with the increasing stream of Arabizi. Its hybrid collecon of algorithms breaks down the language into phonec components enabling it to effecvely match the Western spellings with Arabic words. This module then feeds the rest of the Rosee pipeline to provide a complete soluon for understanding and cross-correlang the messages.

Automated tools like Rosee are essenal for effecvely managing the interpretaon of the vast collecon of data flowing through the intelligence cycle, especially when most of the informaon is not of interest. Flagging the most salient and potenally important messages ensures that the translators and analysts can focus on the messages with the highest potenal value, saving me and money. Automaon unlocks informaon that would be otherwise lost in a sea of data.

THE FLOURISHING GARDEN OF ARABIZI We can understand why Arabizi is complex by examining an Arabic chat text message wrien in the Levanne dialect used in Lebanon:

akid elli 3emel hai theory wa7ad 6el3ello wala wala wala luck

Most of the words like “wala” are Arabic spelled with English versions of the sounds4. The numerical digits are used because they look like Arabic leers, and somemes they do not have sounds that are easily approximated by English spellings. Two English words, “theory” and “luck”, are here because they probably were easier for the writer to include. Perhaps they were a more accurate reflecon of what the writer wanted to say, or perhaps they just came to mind before the Arabic versions. In some cases, the character just looked like the Arabic leer. Research shows that the writers have many reasons for why they choose parcular combinaons, and different people make different choices.

Understanding Arabic chat is not simple because there are many different ways to translate phonemes into a Roman script. Here is an example of five different transliteraons of the same word along with the group affiliaon of a person who draed the text message:

Spelling Region

talateh Jordanian

thalatheh Bedouin

talata Cairene

tlete Lebanese

salasa Egypan

The spellings can depend upon local pronunciaon and the writer’s exposure to European languages. This example is far from complete because many words can be spelled in more than several dozen ways. Studies have idenfied at least 32 different ways that the Western

4 The smallest unit of sound in a language.

8 The Burgeoning Challenge of Deciphering Arabic Chat publicaons spell the name of the former head of Libya, Mu’ammar Qadhafi. His first name alone is commonly represented in at least five different ways in Western literature. Arabizi oen includes even more variety in spelling because there are more users who are not following any standards.

The Influence of Non-Arabic Languages The job of transliterang Arabizi is more difficult when English is involved because the language is, like Arabizi, a polyglot tongue with orthography drawn from mulple linguisc tradions. These eight words all have the same vowel phoneme /i/ but are spelled differently: flea, free, niece, perceive, turkey, Phoebe, she, and ski. Arabizi users could choose any of them, and so a polyglot descendant of a polyglot language grows even more complex.

To make maers more complicated, each Arabic-speaking region oen pronounces the same word differently. Two French-speaking writers could choose different spellings because the local versions of the word are voiced differently.

The source of the spelling is not always strictly auditory. The digit “7” is oen used because it looks Some imitate the dot above the leer by pung a quote mark .خ like a common Arabic figure before it and some place it aer. Some use an asterisk for the dot instead and also place it either before or aer. So there are four common transliteraons of just this one sound and some users will change it in the same message. ('7, 7', *7, 7*)

Understanding can grow more complex when the users add in words from other languages. The mixture of French and Arabic is oen called “Franko-Arab”; including English words produces what some call “Arablish.”

This evoluon guarantees that there will be many forms of Arabizi and the spellings will change from region to region and social group to social group.

EXPLORE FURTHER For more informaon or to request an evaluaon, please call us at 617-386-2090 or 800-697-2062, or write to [email protected]. We will be happy to assist you in evaluang the performance of our products on your data.

The Burgeoning Challenge of Deciphering Arabic Chat 9