March 26, 2012 The Burgeoning Challenge of Deciphering Arabic Chat “Arabizi”, an informal dialect of Arabic typed on mobile phones and computer keyboards using the Lan alphabet, has spread widely via text messages and social networks. Analyzing messages wrien in this dialect is a challenge for analysts in government and industry because of wide variaons in spelling, grammar, and dicon.
We put the World in the World Wide Web® ABOUT BASIS TECHNOLOGY Basis Technology provides soware soluons for text analycs, informaon retrieval, digital forensics, and identy resoluon in over forty languages. Our Rosee® linguiscs plaorm is a widely used suite of interoperable components that power search, business intelligence, e-discovery, social media monitoring, financial compliance, and other enterprise applicaons. Our linguiscs team is at the forefront of applied natural language processing using a combinaon of stascal modeling, expert rules, and corpus-derived data. Our forensics team pioneers beer, faster, and cheaper techniques to extract forensic evidence, keeping government and law enforcement ahead of exponenal growth of data storage volumes.
Soware vendors, content providers, financial instuons, and government agencies worldwide rely on Basis Technology’s soluons for Unicode compliance, language idenficaon, mullingual search, enty extracon, name indexing, and name translaon. Our products and services are used by over 250 major firms, including Cisco, EMC, Exalead/Dassault Systems, Hewle-Packard, Microso, Oracle, and Symantec. Our text analysis products are widely used in the U.S. defense and intelligence industry by such firms as CACI, Lockheed Marn, Northrop Grumman, SAIC, and SRI. We are the top provider of mullingual technology to web and e-commerce search engines, including Amazon.com, Bing, Google, and Yahoo!.
Company headquarters are in Cambridge, Massachuses, with branch offices in San Francisco, Washington, London, and Tokyo. For more informaon, visit www.basistech.com.
© 2012 Basis Technology Corporaon. “Basis Technology”, “Geoscope”, “Odyssey Digital Forensics”, “Rosee”, and “We put the World in the World Wide Web” are registered trademarks of Basis Technology Corporaon. All other trademarks, service marks, and logos used in this document are the property of their respecve owners. (2012-08-15) For the last several hundred years, technology has made it easier for dominant languages to drive out the smaller ones. Faster transportaon, ubiquitous telephony, and efficient prinng are helping the major languages dominate our communicaons. Studies by the Linguiscs Society of America and the Naonal Geographic Society esmate that more than half of the world’s approximately 7,000 languages will be exnct by 2100.1
But technology also nurtures creavity, and new forms of wring are appearing in unforeseen places driven by unexpected confluences. One of the newest is Arabic chat alphabet, also called Arabizi—a casual version of wrien Arabic that appeared when Arabic speakers began using Western keyboards on mobile phones and computers to spell out their nave language with the .(”they would type mar7aba (translaon: “hello ﻣرﺣﺑﺎ Roman alphabet. Thus, instead of typing
The pracce is a growing challenge for government intelligence agencies because the wring system is proliferang as portable phones, social media, and other digital channels become more common. More and more conversaons flow through the handsets, and an increasing percentage travel as text messages. Western social networks like Facebook or Twier are also growing in popularity, and the users oen choose Arabizi for their messages. In many cases, protests of the Arab Spring were planned, nurtured, and executed through messages passed in these channels, oen as Arabic chat.
The format poses a unique problem for government analysts working with open source intelligence, because it is sll evolving and the writers do not follow any standard rules for spelling, grammar, or dicon. Speakers from different regions not only use different spellings but they also write in their local dialect and even code-switch (insert other languages such as English or French) , or mix dialectal Arabic with Modern Standard Arabic (MSA). The phonology, morphology, syntax, and lexicon of the dialects that these nave speakers of Arabic are using are different from those of MSA. (See sidebar on the growth of Arabizi, “The Flourishing Garden of Arabizi.”)
While the collecon phase of the intelligence cycle can gather text messages easily, the processing and exploitaon phase is slowed or blocked enrely by messages that cannot be easily understood by automated tools trained only on MSA in Arabic script. Humans must read through the messages and choose the important ones.
Automaon is essenal because there are simply not enough analysts to scan the large volume of text. Arabizi confuses convenonal natural language processing tools for Arabic by mixing in non- Arabic leers, regional dialects, and foreign words, all of which are unexpected by tools trained on MSA.
UNDERSTANDING THE ATTRACTION OF ARABIZI While the early Arabizi users were probably movated by the need to send Arabic words with a Western keyboard, Arabizi also aracts current users for aesthec, polical, and personal reasons. Convenience is only part of the allure.
1 See “What is an Endangered Language” by AC Woodbury, Linguiscs Society of America, 2006. Also The Last Speakers: The Quest to Save the World’s Most Endangered Languages K. David Harrison, Naonal Geographic Press, Sept 2010. Also. “Languages Die, But Not Their Last Words”, NY Times, Sept 19, 2007.
The Burgeoning Challenge of Deciphering Arabic Chat 3 In one study conducted by the American University in Cairo2, a collecon of 70 Arabic users on Facebook were asked why they chose to write with Roman leers. More than 80% of the respondents said that they used Arabizi, and 40% used it “most of the me.” Those who did not use it largely said it was either out of respect for the Quran or as part of an effort to maintain a separate Arabic identy outside of Western influence. Indeed many agreed it should not be used in a “religious context” even if it is acceptable for casual communicaon.
About 20% said it made them feel “closer to each other,” a phrase that implies a role as a secret language that is not understood by outsiders of different ages, backgrounds, or educaon. It is cool.
Just as many English speakers may choose a more elegant-sounding French word, and Japanese companies add English words to the packaging of products, Arabic speakers who use Arabizi display a depth of knowledge and sophiscaon. The early users were, by necessity, well educated in European languages, so they were able to know insncvely how Arabic sounds mapped to Roman leer combinaons. English and French classes are common in schools in the Arabic world and this nurtures the understanding of the Roman alphabet. This user profile makes Arabizi a mark of success that suggests that the writer is well educated and oen well traveled.
Today the wring technique also popular among young and agile minds because they are frequently the first to adopt new technology. Just as Western youths have improvised many acronyms and textual shorthand to simplify typing on mobile devices, the younger Arab users are also creang Roman approximaons to meet their needs. Some suggest that they use Arabizi because it is oen easier to type than Standard Arabic, especially when they have no training in using an Arabic keyboard.
The combinaon of polical acvity and the proliferaon of technology fueled a greater focus on Arabic chat format. While many of the original adopters may have turned to Arabizi out of a praccal need to express Arabic words using Western technology, some suggest that the language is now a popular and sophiscated choice even when the technology supports tradional Arabic script. People are choosing Arabizi when other tools to spell MSA are available. The once casual slang is growing in importance and becoming a significant format of its own.
The form quickly came to the aenon of non-Arabic speakers during the Arab Spring of 2011 when many polical acvists discovered that mobile phones and Facebook were ideal vectors for organizing protests and structuring the evoluon of polical dissent. The wring born of casual chat turned into a tool for revoluon.
THE CHALLENGE OF UNLOCKING ARABIC CHAT Analyzing text wrien in Arabizi is a difficult problem because it lacks much of the structure that current technologies rely upon. Many algorithms and data-mining tools depend on a stable, predictable spelling and structure for words—rules that are standardized through diconaries and schools. Arabizi’s improvisaonal origins produce something more chaoc. (See sidebar.) The wide range of words and influences can confound algorithms that assume stability and a fixed
2 From “Summary of Arabizi or Romanizaon: The dilemma of wring Arabic texts” by Randa Muhammed, Mona Farrag, Nariman Elshamly, and Nady Abdel-Ghaffar. Presented at Jīl Jadīd Conference, University of Texas at Ausn, February 18-19, 2011
4 The Burgeoning Challenge of Deciphering Arabic Chat interpretaon. All of this mathemacal apparatus assumes that the structure of a language will not change, but Arabizi is transforming as each person chooses the closest approximaon of a word that comes to mind.
Many of the earliest approaches to analyzing Arabizi with these tools depended upon people translang the words directly into MSA and then using tradional algorithms on MSA to work with the results. The mechanism was tuned to MSA in the result, not for Arabizi entering the system. Any success that these rote algorithms offered was oen limited and short-lived because a direct conversion to Arabic script will oen produce sentences that do not follow the rules of MSA. The linguisc shorthand structures may follow the rules of proper MSA in some sentences and ignore them in others. Foreign words that are oen mixed into Arabizi break these algorithms immediately.
The brileness is a compounded by the fact that there has been no formal process by any central body to provide guidance for how this language should be used, nor will there be—it is a language of the “digital streets.” There is no government-run commiee on the language like the Académie Française, and no one has wrien a style guide like the AP Style Manual.
There are also no large publicaons wrien and edited in Arabizi, so there is no source of a large, curated collecon of text, like a newspaper, that people can imitate when structuring their sentences. This lack of common rules can be compounded when the people use words from a smaller, local dialect that is not widely understood.
The lack of central steering means that the language evolves differently as each user chooses which orthography or grammacal trope to adopt or ignore without any guidance or suggeson. Many adopt new construcons as they write their texts, manufacturing spelling as they need them. If people are influenced, they are influenced by their friends.
Regional Dialects Compound Difficules There are many regional differences in pronunciaon that mulply the complexies because ﻗﻠب ”the words are oen transliterated using sound. In Iraqi Arabic, the “q” as in the word “qalb (i.e., “heart”) is pronounced as a “g”; Palesnian speakers in the West Bank would pronounce “q” as “k”; in Tunisia, it is pronounced as either “q” or “g”; and Egypans pronounce it as a gloal stop (as in the sound in “uh-oh”). Hence, Egypans pronounce the first leer of the former Libyan ruler’s name with a gloal stop, whereas some other dialects pronounce it with a “g” sound. Consequently possible spellings in Arabic chat would be: Gadaffi, 2addafi, Kathafi, Qathafi, Qadhafy, etc. The different spellings reflect the different pronunciaons, the orthographic variaons that exist in the English language itself, and the orthographic variaons of Arabic chat (such as spelling the gloal stop or hamza with the digit 2).
On top of that, residents of countries where French is spoken, for instance, oen learn different rules for mapping sounds to Roman leers than people who live in countries where English is the dominant second language. Even people who may know lile English may sll adopt English orthography when their friends choose to use it.
The decentralized structure does offer new opportunies for understanding and analysis. The spellings and linguisc construcons travel like viruses or memes without any central control, and people from the same social networks oen share the same locuons. It is possible to
The Burgeoning Challenge of Deciphering Arabic Chat 5 idenfy the regional origin of speakers through the dialectal words they use and the spellings they select for the words. Thus a simple chat can contain more informaon than what the words are communicang.
AUTOMATING ANALYSIS The complexity of the language and the explosion of Arabizi guarantee that automaon is essenal, and any computerized assistance with understanding the chat messages will be an analycal mulplier. It is impossible to find enough people to read all of the messages, so it is impossible to locate the salient texts without leveraging advanced algorithms.
Computerizaon also ensures that human resources can be deployed more efficiently. Without effecve algorithmic triage, human analysts become mired in roune translaons that become harder with the odd or chaoc transliteraon. If most of the text snippets are of lile interest to anyone but the recipients, automated tools are essenal for both searching the corpus and ranking the importance of messages. An automated pre-processor lets the human resources focus on the most important message streams.
An Enterprise-Ready Soluon for Deciphering Arabic Chat Alphabet Currently, there are few tools that can deal effecvely with Arabic chat. Basis Technology’s Rosee® linguiscs plaorm may be the only producon-quality, enterprise tool currently available on the market for converng and analyzing it.
Rosee Chat Translator, one module in the full Rosee pipeline, is capable of disassembling Arabizi because it begins with a stascal approach to transliteraon that breaks the text message into phonemes and then ranks the possible conversions. The most probable mappings are used to convert the text into Arabic script. This stascal approach allows the soware to adapt as the common usage changes.
mar7aban Abu Mas3uud. Wallahi mudda taweela lmma mar7aban Abu Mas3uud. Wallahi mudda taweela lmma shuftak ya shuftak ya shiekh. Esma3 insha' allah netgabal shiekh. Esma3 insha' allah netgabal 3end abu musle7 ghadan wa 3end abu musle7 ghadan wa la tensa el mawaad al la tensa el mawaad al matluba lzar3 al shajara fee shaari3 matluba lzar3 al shajara fee shaari3 karbala'. karbala'. 7awali 5:30 nitqabal ma3 ahmad abdallah salih. 7awali 5:30 nitqabal ma3 ahmad abdallah salih Romanized Arabic Romanized Arabic
Standard Arabic Standard Arabic
Greetings, Abu Masood! By God, it has been a long time since I’ve seen you, oh Sheikh. Listen, God willing, we meet at Abu Musleh tomorrow and do not forget the materials required to plant the tree in Karbala Street. Around 5:30 we meet with Ahmed Abdullah Saleh. Entity Extraction English Translation
Basis Technology built the stascal model for the chat translator from more than 300 million Arabizi messages gathered from throughout the world. The database is updated regularly through an automac algorithm that builds a new stascal model from the latest corpus. New releases include the latest version of the model trained with the most recent collecon of chat messages.
6 The Burgeoning Challenge of Deciphering Arabic Chat The results also carry metadata about the regional dialect used in the text message and this can idenfy the country of origin of the writer. The translaons of chat alphabet to Arabic script from Rosee Chat Translator amplify the knowledge of the analyst by suggesng possible sources of the message that may lie outside the core of the analyst’s experse. The encyclopedic nature of Rosee offers a deep set of opons for the analysts to grade, saving them me in idenfying the source. This informaon is kept alongside the translaon for analysts to study at all subsequent stages of processing.
Integrang with the Enterprise Rosee Chat Translator delivers its results to other soware packages in an industry-standard format for further analysis by the user’s custom algorithms or other modules from Basis Technology. The tool delivers the converted message along with any hints about the origins that were unlocked during the conversion.
When Rosee Chat Translator is combined with the full Rosee linguiscs plaorm pipeline, it offers a full set of tools for collecng and analyzing a corpus of messages. Rosee Enty Extractor takes the text — converted to Arabic script by the chat translator — and idenfies names and locaons. These enes then can be fed into Rosee Name Indexer (which matches different spellings of the same name) and Rosee Name Translator (which transliterates Arabic names to English), making it simpler for the non-Arabic-speaking analyst to understand them. The name index built by the Rosee pipeline helps analysts work with a large collecon of messages by providing a quick way to idenfy documents containing the same enty. The complexies caused by the different spellings and linguisc structures of Arabizi are dramacally reduced with this standardized index.
The large index built by Rosee is a key part of idenfying relevant messages. Analysts can move faster through incoming documents and find cross-references that can unlock connecons that are obscured by all of the different name spellings. The index can quickly idenfy all other messages with similar enes—names of people, places, and more—even when they use different spellings.
NEW TECHNOLOGY IS DRIVING FUTURE ARABIZI GROWTH The future of Arabizi will follow changes in technology. Today, mobile handsets are just beginning to dominate the Arab countries. One study from the Egypan government noted that handset subscripons jumped from 55 million to 71 million during 20103. In January 2011, the penetraon rate of the market was esmated to be 91%. Other Arab countries are experiencing similar explosions of interest as the technology becomes available to all income levels.
Social networks are also growing more popular and the users of these systems oen choose Arabizi because it seems a natural choice. Even non-Western websites oen have comments wrien in Arabizi instead of standard Arabic. As Facebook, Twier, and other Western social media websites connue to grow more popular, Arabizi will follow. Each new user adds complexity to the proliferang dialects, subdialects, and formats because each user has their own preferred set of words and orthography. These choices are imitated by friends, and the social networks
3 See “ICT Indicators in Brief”, Feb 2011, Arab Republic of Egypt Ministry of Communicaons and Informaon Technology. (www.mcit.gov.eg) Also “The Adopon of Mobile Phones in Emerging Markets: Global Diffusion and the Rural Challenge” by Kas Kalba. Internaonal Journal of Communicaon 2 (2008), 631-661
The Burgeoning Challenge of Deciphering Arabic Chat 7 amplify the structures as they pass, like trends and fads, from one to another. Just as English speakers using Twier are building a new dialect, Arabic speakers are also echoing similar structures.
Rosee Chat Translator is the only enterprise-grade, producon-quality soware choice available for working with the increasing stream of Arabizi. Its hybrid collecon of algorithms breaks down the language into phonec components enabling it to effecvely match the Western spellings with Arabic words. This module then feeds the rest of the Rosee pipeline to provide a complete soluon for understanding and cross-correlang the messages.
Automated tools like Rosee are essenal for effecvely managing the interpretaon of the vast collecon of data flowing through the intelligence cycle, especially when most of the informaon is not of interest. Flagging the most salient and potenally important messages ensures that the translators and analysts can focus on the messages with the highest potenal value, saving me and money. Automaon unlocks informaon that would be otherwise lost in a sea of data.
THE FLOURISHING GARDEN OF ARABIZI We can understand why Arabizi is complex by examining an Arabic chat text message wrien in the Levanne dialect used in Lebanon:
akid elli 3emel hai theory wa7ad 6el3ello wala wala wala luck
Most of the words like “wala” are Arabic spelled with English versions of the sounds4. The numerical digits are used because they look like Arabic leers, and somemes they do not have sounds that are easily approximated by English spellings. Two English words, “theory” and “luck”, are here because they probably were easier for the writer to include. Perhaps they were a more accurate reflecon of what the writer wanted to say, or perhaps they just came to mind before the Arabic versions. In some cases, the character just looked like the Arabic leer. Research shows that the writers have many reasons for why they choose parcular combinaons, and different people make different choices.
Understanding Arabic chat is not simple because there are many different ways to translate phonemes into a Roman script. Here is an example of five different transliteraons of the same word along with the group affiliaon of a person who draed the text message:
Spelling Region
talateh Jordanian
thalatheh Bedouin
talata Cairene
tlete Lebanese
salasa Egypan
The spellings can depend upon local pronunciaon and the writer’s exposure to European languages. This example is far from complete because many words can be spelled in more than several dozen ways. Studies have idenfied at least 32 different ways that the Western
4 The smallest unit of sound in a language.
8 The Burgeoning Challenge of Deciphering Arabic Chat publicaons spell the name of the former head of Libya, Mu’ammar Qadhafi. His first name alone is commonly represented in at least five different ways in Western literature. Arabizi oen includes even more variety in spelling because there are more users who are not following any standards.
The Influence of Non-Arabic Languages The job of transliterang Arabizi is more difficult when English is involved because the language is, like Arabizi, a polyglot tongue with orthography drawn from mulple linguisc tradions. These eight words all have the same vowel phoneme /i/ but are spelled differently: flea, free, niece, perceive, turkey, Phoebe, she, and ski. Arabizi users could choose any of them, and so a polyglot descendant of a polyglot language grows even more complex.
To make maers more complicated, each Arabic-speaking region oen pronounces the same word differently. Two French-speaking writers could choose different spellings because the local versions of the word are voiced differently.
The source of the spelling is not always strictly auditory. The digit “7” is oen used because it looks Some imitate the dot above the leer by pung a quote mark .خ like a common Arabic figure before it and some place it aer. Some use an asterisk for the dot instead and also place it either before or aer. So there are four common transliteraons of just this one sound and some users will change it in the same message. ('7, 7', *7, 7*)
Understanding can grow more complex when the users add in words from other languages. The mixture of French and Arabic is oen called “Franko-Arab”; including English words produces what some call “Arablish.”
This evoluon guarantees that there will be many forms of Arabizi and the spellings will change from region to region and social group to social group.
EXPLORE FURTHER For more informaon or to request an evaluaon, please call us at 617-386-2090 or 800-697-2062, or write to [email protected]. We will be happy to assist you in evaluang the performance of our products on your data.
The Burgeoning Challenge of Deciphering Arabic Chat 9