The Burgeoning Challenge of Deciphering Arabic Chat
Total Page:16
File Type:pdf, Size:1020Kb
March 26, 2012 The Burgeoning Challenge of Deciphering Arabic Chat “Arabizi”, an informal dialect of Arabic typed on mobile phones and computer keyboards using the Lan alphabet, has spread widely via text messages and social networks. Analyzing messages wrien in this dialect is a challenge for analysts in government and industry because of wide variaons in spelling, grammar, and dicon. We put the World in the World Wide Web® ABOUT BASIS TECHNOLOGY Basis Technology provides soware soluons for text analycs, informaon retrieval, digital forensics, and identy resoluon in over forty languages. Our Rosee® linguiscs plaorm is a widely used suite of interoperable components that power search, business intelligence, e-discovery, social media monitoring, financial compliance, and other enterprise applicaons. Our linguiscs team is at the forefront of applied natural language processing using a combinaon of stascal modeling, expert rules, and corpus-derived data. Our forensics team pioneers beer, faster, and cheaper techniques to extract forensic evidence, keeping government and law enforcement ahead of exponenal growth of data storage volumes. Soware vendors, content providers, financial instuons, and government agencies worldwide rely on Basis Technology’s soluons for Unicode compliance, language idenficaon, mullingual search, enty extracon, name indexing, and name translaon. Our products and services are used by over 250 major firms, including Cisco, EMC, Exalead/Dassault Systems, Hewle-Packard, Microso, Oracle, and Symantec. Our text analysis products are widely used in the U.S. defense and intelligence industry by such firms as CACI, Lockheed Marn, Northrop Grumman, SAIC, and SRI. We are the top provider of mullingual technology to web and e-commerce search engines, including Amazon.com, Bing, Google, and Yahoo!. Company headquarters are in Cambridge, Massachuses, with branch offices in San Francisco, Washington, London, and Tokyo. For more informaon, visit www.basistech.com. © 2012 Basis Technology Corporaon. “Basis Technology”, “Geoscope”, “Odyssey Digital Forensics”, “Rosee”, and “We put the World in the World Wide Web” are registered trademarks of Basis Technology Corporaon. All other trademarks, service marks, and logos used in this document are the property of their respecve owners. (2012-08-15) For the last several hundred years, technology has made it easier for dominant languages to drive out the smaller ones. Faster transportaon, ubiquitous telephony, and efficient prinng are helping the major languages dominate our communicaons. Studies by the Linguiscs Society of America and the Naonal Geographic Society esmate that more than half of the world’s approximately 7,000 languages will be exnct by 2100.1 But technology also nurtures creavity, and new forms of wring are appearing in unforeseen places driven by unexpected confluences. One of the newest is Arabic chat alphabet, also called Arabizi—a casual version of wrien Arabic that appeared when Arabic speakers began using Western keyboards on mobile phones and computers to spell out their nave language with the .(”they would type mar7aba (translaon: “hello ﻣرﺣﺑﺎ Roman alphabet. Thus, instead of typing The pracce is a growing challenge for government intelligence agencies because the wring system is proliferang as portable phones, social media, and other digital channels become more common. More and more conversaons flow through the handsets, and an increasing percentage travel as text messages. Western social networks like Facebook or Twier are also growing in popularity, and the users oen choose Arabizi for their messages. In many cases, protests of the Arab Spring were planned, nurtured, and executed through messages passed in these channels, oen as Arabic chat. The format poses a unique problem for government analysts working with open source intelligence, because it is sll evolving and the writers do not follow any standard rules for spelling, grammar, or dicon. Speakers from different regions not only use different spellings but they also write in their local dialect and even code-switch (insert other languages such as English or French) , or mix dialectal Arabic with Modern Standard Arabic (MSA). The phonology, morphology, syntax, and lexicon of the dialects that these nave speakers of Arabic are using are different from those of MSA. (See sidebar on the growth of Arabizi, “The Flourishing Garden of Arabizi.”) While the collecon phase of the intelligence cycle can gather text messages easily, the processing and exploitaon phase is slowed or blocked enrely by messages that cannot be easily understood by automated tools trained only on MSA in Arabic script. Humans must read through the messages and choose the important ones. Automaon is essenal because there are simply not enough analysts to scan the large volume of text. Arabizi confuses convenonal natural language processing tools for Arabic by mixing in non- Arabic leers, regional dialects, and foreign words, all of which are unexpected by tools trained on MSA. UNDERSTANDING THE ATTRACTION OF ARABIZI While the early Arabizi users were probably movated by the need to send Arabic words with a Western keyboard, Arabizi also aracts current users for aesthec, polical, and personal reasons. Convenience is only part of the allure. 1 See “What is an Endangered Language” by AC Woodbury, Linguiscs Society of America, 2006. Also The Last Speakers: The Quest to Save the World’s Most Endangered Languages K. David Harrison, Naonal Geographic Press, Sept 2010. Also. “Languages Die, But Not Their Last Words”, NY Times, Sept 19, 2007. The Burgeoning Challenge of Deciphering Arabic Chat 3 In one study conducted by the American University in Cairo2, a collecon of 70 Arabic users on Facebook were asked why they chose to write with Roman leers. More than 80% of the respondents said that they used Arabizi, and 40% used it “most of the me.” Those who did not use it largely said it was either out of respect for the Quran or as part of an effort to maintain a separate Arabic identy outside of Western influence. Indeed many agreed it should not be used in a “religious context” even if it is acceptable for casual communicaon. About 20% said it made them feel “closer to each other,” a phrase that implies a role as a secret language that is not understood by outsiders of different ages, backgrounds, or educaon. It is cool. Just as many English speakers may choose a more elegant-sounding French word, and Japanese companies add English words to the packaging of products, Arabic speakers who use Arabizi display a depth of knowledge and sophiscaon. The early users were, by necessity, well educated in European languages, so they were able to know insncvely how Arabic sounds mapped to Roman leer combinaons. English and French classes are common in schools in the Arabic world and this nurtures the understanding of the Roman alphabet. This user profile makes Arabizi a mark of success that suggests that the writer is well educated and oen well traveled. Today the wring technique also popular among young and agile minds because they are frequently the first to adopt new technology. Just as Western youths have improvised many acronyms and textual shorthand to simplify typing on mobile devices, the younger Arab users are also creang Roman approximaons to meet their needs. Some suggest that they use Arabizi because it is oen easier to type than Standard Arabic, especially when they have no training in using an Arabic keyboard. The combinaon of polical acvity and the proliferaon of technology fueled a greater focus on Arabic chat format. While many of the original adopters may have turned to Arabizi out of a praccal need to express Arabic words using Western technology, some suggest that the language is now a popular and sophiscated choice even when the technology supports tradional Arabic script. People are choosing Arabizi when other tools to spell MSA are available. The once casual slang is growing in importance and becoming a significant format of its own. The form quickly came to the aenon of non-Arabic speakers during the Arab Spring of 2011 when many polical acvists discovered that mobile phones and Facebook were ideal vectors for organizing protests and structuring the evoluon of polical dissent. The wring born of casual chat turned into a tool for revoluon. THE CHALLENGE OF UNLOCKING ARABIC CHAT Analyzing text wrien in Arabizi is a difficult problem because it lacks much of the structure that current technologies rely upon. Many algorithms and data-mining tools depend on a stable, predictable spelling and structure for words—rules that are standardized through diconaries and schools. Arabizi’s improvisaonal origins produce something more chaoc. (See sidebar.) The wide range of words and influences can confound algorithms that assume stability and a fixed 2 From “Summary of Arabizi or Romanizaon: The dilemma of wring Arabic texts” by Randa Muhammed, Mona Farrag, Nariman Elshamly, and Nady Abdel-Ghaffar. Presented at Jīl Jadīd Conference, University of Texas at Ausn, February 18-19, 2011 4 The Burgeoning Challenge of Deciphering Arabic Chat interpretaon. All of this mathemacal apparatus assumes that the structure of a language will not change, but Arabizi is transforming as each person chooses the closest approximaon of a word that comes to mind. Many of the earliest approaches to analyzing Arabizi with these tools depended upon people translang the words directly into MSA and then using tradional algorithms on MSA to work with the results. The mechanism was tuned to MSA