LREC 2008

Arabic Dialect Processing

Mona Diab Nizar Habash Center for Computational Learning Systems Columbia University

Tutorial Contents

• Introduction • Description of MSA Phenomena • Description of Dialectal Phenomena • Sample Applications • Resources and References

2

1 Introduction • is a Semitic language • Forms of Arabic – Classical Arabic (CA) • Classical Historical texts • Liturgical texts – (MSA) • News media & formal speeches and settings • Only written standard – Dialectal Arabic (DA) • Predominantly spoken vernaculars • No written standards • Dialect vs. Language – Linguistics vs. Politics 3

Introduction

• ~300M people worldwide speak Arabic

• Arabic is the/an official language of 23 countries

• No native speakers of CA nor MSA

• In the Arabic speaking world, MSA and CA are the only Arabic taught in schools

4

2 Introduction • Arabic Diglossia – Diglossia is where two forms of the language exist side by side – MSA is the formal public language • Perceived as “language of the mind” – Dialectal Arabic is the informal private language • Perceived as “language of the heart” • General Arab perception: dialects are a deteriorated form of Classical Arabic • Continuum of dialects 5

Geographical Continuum

ArabicDialects Eastern Western Peninsula Northern Egyptian Maghreb Saharan Yemeni Levantine Cairene Libyan Chadian Hadrami North Sa’idi Judeo Hassaniya Sana’ani Lebanese Sudanese Tunisian TaiziAdeni Syrian Judeo Judeo South Algerian Omani Palestinian Moroccan Dhofari Jordanian Judeo Maronite Saudi Maltese Cypriot Hijazi Iraqi Najdi Southern Gulf Northern Kuwaiti Judeo Shihhi Baharna Other Tajiki Uzbeki Khorasan 6

3 ﻟﻢ ﻳﺸﺘﺮﻧﺰﺍﺭﻃﺎﻭﻟﺔﺟﺪﻳﺪﺓ lam ja ȓȓȓtari nizār Ńawilatan ζadīdatan didn’t buyNizartablenew ار اش ةة niz ār ma ȓtar āȓ Ńarab ēza gid īda ار اش وة niz ār ma ȓtar āȓ Ńawile ζdīde ار اش ةة nizar ma ȓrāȓ mida ζdīda

Nizar not-bought-not table new 7

Social Continuum

• Factors affecting dialect – Lifestyle • Bedouin, urban, rural – Education & Social Class – Religion • Muslim, Christian, Jewish, Druze, etc. – Gender

8

4 Social Continuum • Badawi’s levels – Traditional Arabic – Modern Arabic – Educated Colloquial – Literate Colloquial – Illiterate Colloquial • Polyglossia

9

Why Study Arabic Dialects?

• Almost no native speakers of Arabic sustain continuous spontaneous production of MSA • Ubiquity of Dialect – Dialects are the primary form of Arabic used in all unscripted spoken genres: conversational, talk shows, interviews, etc. – Dialects are increasingly in use in new written media (newsgroups, weblogs, etc.) – Dialects have a direct impact on MSA phonology, syntax, semantics and pragmatics – Dialects lexically permeate MSA speech and text • Substantial Dialect-MSA differences impede direct application of MSA NLP tools 10

5 Why Study Arabic Dialects?

• Degrees of linguistic distance

Syntax Morphology Lexicon Phonology MSA-Dialect ++ +++ ++++ ++++ Inter-Dialect + +++ ++++ ++++ Intra-Dialect 0 0 + +

• Lack of standards for the dialects

• Lack of written resources 11

A Note on Romanization

• Phonological Transcription – IPA • Transliteration – Strict (one-to-one) • Buckwalter Encoding – Loose • Many spelling variants – Qadafi, kadaphi, kaddafy, etc. • This tutorial’s examples are in م – – Transcription (IPA) /sal ām/ – Transliteration (Buckwalter) slAm 12

6 Tutorial Contents

• Introduction • Description of MSA Phenomena – Orthography – Morphology – Syntax • Description of Dialectal Phenomena • Sample Applications • Resources and References

13

Arabic Script

ﺍﳋﹶﻂﹸﺍﻟﻌﺮﺑِﻲ

• Analphabet • Writtenrighttoleft • Letterhaveallographic variants • Optional • Commonligatures • UsedtowritemanylanguagesinadditiontoArabic: Persian,Kurdish,Urdu,Pashto,etc. 14

7 Arabic Script

Alphabet • letterforms

• lettermarks

15

Arabic Script

Alphabet

• letters (form+mark) ب ت ث س ش Distinctive • /ȓ/ /s/ /θ/ /t/ /b/ ا أ إ ئ ؤء Nondistinctive • /Ȥ/ glottalstopaka 16

8 Arabic Script

Nunation Vowel Diacritics بَ بً Zerowidthcharacters • • Usedforshortvowels /ban/ /ba/ بُ بٌ katab/ towrite/ آََ • Nunationisusedfor /bun/ /bu/ nominalindefinite بِ بٍ markerinMSA /bin/ /bi/ kitābun/ abook 17/ آَِبٌ

Arabic Script

Diacritics No Vowel بْ ( Novowelmarker( sukun • /makt ab/ office /b/ ََْ

• Doubleconsonantmarker Double (shadda ) Consonant بّ katta b/ todictate/ آَ /bb/ ب ب ب Combinable • /bbu/ /bbin/ /bban/ 18

9 Arabic Script

Puttingittogether Simple combination

عَرَب  ََب = ب /Arab/ȥarab

غَرْب  َْب = ب /West/ ȑarb Ligatures

سلام   م م /Peace/salām 19

Arabic Script “Arabic ” Numerals • Decimalsystem • Numberswrittenlefttorightinrighttolefttext ﺍﺳﺘﻘﻠﺖﺍﳉﺰﺍﺋﺮﰲﺳﻨﺔ 1962 ﺑﻌﺪ 132 ﻋﺎﻣﺎﻣﻦﺍﻻﺣﺘﻼﻝﺍﻟﻔﺮﻧﺴﻲ.

Algeriaachieveditsindependencein1962after132yearsofFrenchoccupation. • Threesystemsofenumerationsymbolsthatvarybyregion Western Arabic 0 1 2 3 4 5 6 7 8 9 Tunisia, Morocco, etc. ٩ ٨ ٧ ٦ ٥ ٤ ٣ ٢ ١ ٠ Indo-Arabic Middle East Eastern IndoArabic ٩ ٨ ٧ ٣ ٢ ١ ٠ .Iran, Pakistan, etc

20

10 Phonology and Spelling

• Phonological profile of Standard Arabic – 28 Consonants – 3 short vowels, 3 long vowels, 2 diphthongs • Arabic spelling is mostly phonemic … – Letter-sound correspondence ؤإأء ئى ا ب ت ةثجح خ ذد ر زسش ص ض عظط غ ف ق ك ل م ن و ي

ī j ū w h n m l k q f ȑȥ δ t d s ȓ s z r δ d x ħ ȴθ t b ā Ȥ

21

Phonology and Spelling

• Arabic spelling is mostly phonemic … Except for • Medial short vowels can only appear as diacritics • Diacritics are optional in most written text – Except in holy scripture – Present diacritics mark syntactic/semantic distinctions kutib/ to be written/ آُ katab/ to write/ آ • ħubb/ love َ /ħabb/ seed/ ُ • as consonant and long vowel ي ,و ,ا Dual use of • (/j/,/ ī/) ي (/w/,/ ū/) و (/ā /,/‘/) ا – 22

11 Phonology and Spelling

• Arabic spelling is mostly phonemic … Except for (continued) • Morphophonemic characters ( ta marbuta) ة Feminine marker – (♀ kab īra/ (big/ آة (♂ kab īr/ (big/ آ • – Derivation marker •/ȥas a/ (to disobey ) (a stick ) • Hamza variants (6 characters for one phoneme!) (baha’/ + 3MascSing (his glory/ ء ؤ ( ءأإؤئ)–

23

Phonology and Spelling

• Arabic spelling can be ambiguous – optional diacritics and dual use of letter • But how ambiguous? Really? • Classic example ths s wht n rbc txt lks lk wth n vwls this is what an Arabic text looks like with no vowels • Not exactly true – Long vowels are always written ’alef‘ ا Initial vowels are represented by an – – Some final short vowels are represented

ths is wht an Arbc txt lks lik wth no vwls

Will revisit ambiguity in more detail again under morphology discussion 24

12 Tutorial Contents

• Introduction • Description of MSA Phenomena – Orthography – Morphology – Syntax • Description of Dialectal Phenomena • Sample Applications • Resources and References

25

Morphology

• Type – Concatenative: prefix, suffix, circumfix – Templatic: root+pattern • Function – Derivational • Creating new words • Mostly templatic – Inflectional • Modifying features of words – Tense, number, person, mood, aspect

• Mostly concatenative 26

13 Derivational Morphology

• Templatic Morphology ك تب Root • b t k

? ا ?ِ ? مَ ? ?و ? Pattern • ūma i ā آ ب Lexeme • maktūb kātib

Lexeme.Meaning = written writer (Root.Meaning+Pattern.Meaning)*Idiosyncrasy.Random 27

Derivational Morphology Root Meaning ”KTB = notion of “writing ك ت ب•

آ آب /kitāb/ /katab/ book write ب ب /maktaba/ /maktūb/ /maktūb/ library written letter آ /maktab/ /kātib/ office writer 28

14 Root Polysemy LHM-1 LHM-2 LHM-3 “meat” “battle” “soldering” /la ħm/ /mal ħama/ /la ħam/ Meat Fierce battle Weld, solder, la ħħām/ Massacre stick, cling/ م Butcher Epic

29

Derivational Morphology Pattern Meaning

• VerbPatternMeaningishardtodefinesystematically

Pattern PatternMeaning Example Gloss I 1a2a3 Basicsenseofroot ktb  katab write II 1a22a3 Intensification,causation ktb  kattab dictate III 1aA2a3 Interactionwithothers ktb  kaAtab correspondwith IV Aa12a3 Causation jls  Ajlas seat V ta1a22a3 ReflexiveofPatternII Elm  taEal~am learn VI ta1aA2a3 ReflexiveofPatternIII ktb  takaAtab correspond VII Ain1a2a3 PassiveofPatternI ktb  Ainkatab subscribe/enroll VIII Ai1ta2a3 Acquiescence,exaggeration ktb  Aiktatab register IX Ai12a33 Transformation Hmr  AiHmarr Turnred/blush X Aista12a3 Requirement ktb  Aistaktab ask/make_write

30

15 Inflectional Morphology • Derivational Morphology – Lexeme ≈ Root + Pattern • Inflectional Morphology – Word = Lexeme + Features • Features – Part-of-speech • Traditional : Noun, Verb, Particle • Computational : N, PN, V, Adj, Adv, P, Pron, Num, Conj, Det, Aux, Pun, IJ, and others – Noun-specific • Number: singular, dual, plural, collective • Gender: masculine, feminine • Definiteness: definite, indefinite • Case: nominative, accusative, genitive • Possessive clitic 31

Inflectional Morphology

• Features (continued) – Verb-specific • Aspect: perfective, imperfective, imperative • Voice: active, passive • Tense: past, present, future • Mood: indicative, subjunctive, jussive • Subject (Person, Number, Gender) • Object clitic – Others • Single-letter conjunctions • Single-letter prepositions

32

16 Inflectional Morphology Nouns

poss plural noun article prep conj

وت وآ /wakabiyūtinā/ /walilmaktabāt/ و+ل+ ال+ + ات و+ك+ ت+ wa+ka+biyūt+nā wa+li+al+maktaba+āt and+like+houses+our and+for+the+library+plural And like our houses And for the libraries

(  ل + ال.Morphotactics (e.g • • Arabic Broken Plurals (templatic) 33

Inflectional Morphology Verbs

object subj verb tense conj

و ه /faqulnāhā/ /wasanaqūluhā/ و +س ن+ + ل + ه ف+ ل + ه+ fa+qul+na+hā wa+sa+na+qūl+u+hā so+said+we+it and+will+we+say+it So we said it. And we will say it

• Morphotactics

• Subjectconjugation(suffixorcircumfix) 34

17 Inflectional Morphology

• Perfect verb subject conjugation ( suffixes only ) Singular Dual Plural katabnā آ katabtu آ ُ 1 katabtum آ katabtum ā آ katabta آ َ 2 katabtū آ اkatab ā آ kataba آ َ 3 • Imperfect verb subject conjugation ( prefix+suffix ) Singular Dual Plural aktubu ُ naktubu ا آُ 1 taktub ūn ن taktub ān ن taktubu ُ 2 yaktub ūn ن yaktub ān ن yaktubu ُ 3 35 Feminineformandotherverbmoodsnotshown

Morphological Ambiguity • Derivational ambiguity basis/principle/rule, military base, Qa'ida/Qaeda/Qaida : ة – • Inflectional ambiguity – : you write, she writes – Segmentation ambiguity and+grandfather :و+ ; found : و • • Spelling ambiguity – Optional diacritics kātib/ writer , /kātab/ to correspond/ : آ • – Suboptimal spelling ا  إ ,أ :Hamza dropping •  ة :Undotted ta-marbuta • ى  ي :Undotted final ya •

36

18 Morphological Ambiguity

• Multiple sources of ambiguity ﺑﲔ – /bayyana/ Verb he declared/demonstrated

– /bayyanna/ Verb they fem declared/demonstrated – /bayyin/ Adj clear/evident/explicit – /bayna/ Prep between/among – /biyin/ Proper Noun in Yen – /biyn/ Proper Noun Ben

37

Levels of Representation WhichlevelforNLPapplications ?

وــــــــت Naturaltoken • وت Word• ولات SegmentedWord • و + + اتPrefix+Stem+Suffix •

[و+ ل+Lexeme+Features [+Pl+Def • • Root+Pattern+Features [و+ ل+Pl+Def+] + مa3a21aة+ كتب

• etc. 38

19 Tutorial Contents

• Introduction • Description of MSA Phenomena – Orthography – Morphology – Syntax • Description of Dialectal Phenomena • Sample Applications • Resources and References

39

Morphology and Syntax

• Rich morphology crosses into syntax – Pro-drop / Subject conjugation – Verb subcategorization and object clitics

• Verb transitive +subject+object

• Verb intransitive +subject • Verb passive +subject

• Morphological interactions with syntax – Agreement • Full : e.g. Noun-Adjective on number, gender, and definiteness • Partial : e.g. Verb-Subject on gender (in VSO order) – Definiteness • Noun compound formation, copular sentences, etc. • Nouns+DefiniteArticle, Proper Nouns, Pronouns, etc. 40

20 Morphology and Syntax

• Morphological interactions with syntax (continued) – Case • MSA is case marked: nominative, accusative, genitive • Almost-free word order • Case is often marked with optionally written short vowels – This effectively limits the word-order freedom in published text • Agglutination – Attached prepositions create words that cross phrase boundaries li+Almaktabāt ل+ات forthelibraries [PPli [NPAlmaktabāt]] • Some morphological analysis ( minimal segmentation ) is necessary

41

Sentence Structure

Two types of Arabic Sentences • Verbal sentences – [Verb Subject Object] (VSO) آ اودار – Wrote the-boys the-poems The boys wrote the poems • Copular sentences – [Topic Complement] اود اء – the-boys poets The boys are poets

42

21 Sentence Structure • Verbal sentences – Verb agreement with gender only wrote 3MascSing the-boy/the-boys آا\ اود • wrote 3FemSing the-girl/the-girls آ ا\ ات • – Pronominal subjects are conjugated wrote-you MascSing آ ُ • wrote-you MascPl آ • wrote-they MascPl آ ا • – Passive verbs

• Same structure: Verb passive Subject underlyingObject • Agreement with surface subject 43

Sentence Structure

• Verbal sentences – Common structural ambiguity • Third masculine/feminine singular are structurally ambiguous

– Verb 3MascSing Noun Masc Verb subject=he object=Noun Verb subject=Noun • Passive and active forms are often similar in standard orthography kataba/ he wrote/ آ – kutiba/ it was written/ آُ –

44

22 Sentence Structure

• Copular sentences – [Topic Complement] Definite Topic, Indefinite Complement ا • the-boy poet The boy is a poet – [Auxiliary Topic Complement] Auxiliaries ( kāna and her sisters ) • Tense, Negation, Transformation, Persistence was the-boy poet The boy was a poet آن ا ا • is-not the-boy poet The boy is not a poet ا ا • – Inverted order is expected in certain cases • Indefinite topic ȥandi kit ābun/ at-me a-book I have a book/ يآب 45

Sentence Structure

• Copular sentences – Types of complements • Noun/Adjective/Adverb the-boy smart The boy is smart ا ذآ – • Prepositional Phrase the-boy in the-library The boy is in the library ا ا – • Copular-Sentence the-boy [book-his big]] The boy, his book is big] ا آآ – • Verb-Sentence اود آ ا ار –

[the-boys [wrote 3MascPl -they poems]] The boys wrote the poems – Full agreement in this order (SVO) ار آ اود –

[the-poems [wrote 3MascSing -it the boys]] The poems, the boys wrote 46

23 Phrase Structure

• Noun Phrase – Determiner Noun Adjective PostModifier هاااحادمان • this the-writer the-ambitious the-arriving from Japan This ambitious writer from Japan – Noun-Adjective agreement • number, gender, definiteness the-writer fem the-ambitious fem اا – the-writer femPl the-ambitious femPl اتات –

47

Phrase Structure

• Noun Phrase ( ا ) Idafa construction – • Noun1 of Noun2 encoded structurally • Noun1-indefinite Noun2-definite اردن • king Jordan the king of Jordan / Jordan’s king – Noun1 becomes definite • Agrees with definite adjectives – Idafa chains 1 2 n-1 n • N indef N indef …N indef N def اررادارةاآ • son uncle neighbor chief committee management the-company The cousin of the CEO’s neighbor 48

24 Phrase Structure

• Morphological definiteness interacts with syntactic structure writer آ Word 1

definite Indefinite Noun Phrase Noun Compound آ ان اا ن

artist The artist(ic) writer The writer of the artist definite

Copular Sentence Noun Phrase آ ن ا ن The writer is an artist An artist(ic) writer نWord2 indefinite

49

Tutorial Contents

• Introduction • Description of MSA Phenomena • Description of Dialectal Phenomena – Orthography – Lexicon – Morphology – Syntax – Code switching • Sample Applications • Resources and References

50

25 Phonological Variations MSA ؤإأء ئى ا ب ت ةثجح خ ذد ر زسش ص ض عظط غ ف ق ك ل م ن و ي

ī j ū w h n m l k q f ȑȥ δ t d s ȓ s z r δ d x ħ ȴθ t b ā Ȥ LEV ؤإأء ئى ا ب ت ةثجح خ ذد ر زسش ص ض عظط غ ف ق ك ل م ن و ي

ī j ū w h n m l k q f ȑȥ δ t d s ȓ s z r δ d x ħ ȴθ t b ā Ȥ ē ō z

• phonemequalitydifferences 51

Phonological Variations

• Majorvariants MSA Dialects /q/ /q/,/k/,/ Ȥ/,/g/,/ ȴ/ ق /θ/ /θ/,/t/,/s/ ث /δ/ /δ/,/d/,/z/ ذ /ȴ/ /ȴ/,/g/ ج

• Someofmanylimitedvariants • /l/ /n/MSA:/burtuqāl/  LEV:/burt Ȥān/‘orange’ •/ȥ/  /ħ/MSA:/ka ȥk/  EGY:/ka ħk/‘cookie’ • Emphasisadd/delete:MSA:/fustān/  LEV:/fuṣṭān/‘dress’ 52

26 Script Choices

• Arabic script: + continuity with MSA + masks the vocalic and some consonantal difference across dialects - ambiguity • Latin script + precision – lose connections among dialects (within dialects) - politically loaded • Other scripts – Hebrew and Syriac - Different religious/ethnic preferences 53

Arabic Script Orthographic Variants IRQ LEV EGY TUN MOR ج ج چ ج ج /ȴȴȴ/ ڭ ڨ ج چ گ /g/ چ /tȓȓȓ/ پ پ پ پ پ /p/ ڥ ڥ ڤ ڤ ڤ /v/ (ڧ ,ڢ) MOR=( ق,ف)Historicalvariants:MSA • ء (Habash1999) ۆ /ō/ , ى ڧ/Modernproposals:LEV/Ȥ/,/ē • ^ 54

27 Syrian Arabic in Arabic Script

رحإار .. اوآا واةواة ... هآا .. دو واااتو و.. ...و .. وا إذار .. آ لاازمآ وأآ

55 http://www.soriagate.net/showthread.php?t=32678

Latin Script Akl 1961 • SeveralproposalstotheArabic LanguageAcademyinthe1940s • SaidAkl Experiment(1961) • WebArabic(Arabish,Francoarabe) – Nostandard,butcommonconventions

IPA Latin IPA Latin θ/ th/ ث ȤȤȤ/ ‘ 2 Ø/ أإءؤئ ṭ/ t T 6/ ط a/,/t/ a t/ ة ȥ/ ‘ 3 Ø/ ع ħ H h 7 ح ’ȑȑȑ/ g gh 3/ غ x/ kh 7’ x 8/ خ q/ q/ ق δ/ th/ ذ ,y//ay/ y,i,e/ ي ȓ/ sh ch / ش /ī//ē/ ai,ei,… 56

28 in Latin Script

nadeity bsho2 nadeit olteely ta3ala geit laha3atbek 3alli fat wala 7atta haloom 3aleiky adeeni rge3telek adeeni bein edeiky kefaya dmoo3 ba2a mush 3aref ashoof 3eneiky

57 http://www.jencomics.com/artist_a/amr_diab_lyrics/adeeni_rege3tilik_lyrics.html

The Case of Maltese

• An Arabic dialect that is considered a separate language • Standardized Latin-based orthography

58 http://www.languagemuseum.com/m/maltese.htm

29 Hebrew Script

• Example from Tunisian Judeo-Arabic

“The Ballad of Hannah and her Seven Sons “

59 http://www.uwm.edu/~corre/jatexts/

Lack of Orthographic Standards

• Orthographic inconsistency

• Egyptian /mabin Ȥulhalak ȓ/

– mA binquwlhA lak$ – mAbin&ulhalak$ – mA bin}ulhAlak$ – mA binqulhA lak$ –…

60

30 Spelling Inconsistency I

61 http://www.languagemuseum.com/a/arabicnorthlevantinespoken.php

Spelling Inconsistency II

• ya alain lesh el 2aza ti7keh 3anneh kaza w kaza iza bidallak ti7keh hek 2areeban ra7 troo7 3al 3aza

chi3rik 3emilleh na2zeh li2anneh manneh mi2zeh bass law baddik yeha 7arb fikeh il layleh ra7 3azzeh

62 http://www.onelebanon.com/forum/archive/index.php/t8236.html

31 Tutorial Contents

• Introduction • Description of MSA Phenomena • Description of Dialectal Phenomena – Orthography – Lexicon – Morphology – Syntax – Code switching • Sample Applications • Resources and References

63

Lexical Variation

• Arabic Dialects vary widely lexically English Table Cat Of I_want There_is There_isn’t MSA Tāwila qiTTa idafa ‘uridu yūjadu lā yujadu ار Ø و Moroccan mida qeTTa dyāl bγīt kāyn mā kāynš آ آ دل ة Egyptian Tarabēza ‘oTTa bitāς ςāwez fī mafīš وز ع ة Syrian Tāwle bisse tabaς biddi fī mā fi ي و Iraqi mēz bazzūna māl ‘arīd aku māku اآ ار ل و • Arabic orthography allows consolidating some variations 64

32 Lexical Variation

• EGY:reproduce – GLF: give condolences EGY:press iron – GLF:buttocks ى • EGY:kettle - LEV:fridge اد • EGY:prostitute - LEV:woman ا • • EGY/LEV:okay – MOR:not • EGY/LEV:make happy – IRQ:beat up EGY/LEV:health – MOR:hell fire ا • • LEV:start – SUD:end

65

Foreign Borrowings wky okay< أوآ • • mrsy merci (bndwrp pomodoro (italian ورة • (byrA birra (italian ا • • frmt format tlfwn telephone ن • • talfan to phone

66

33 Tutorial Contents

• Introduction • Description of MSA Phenomena • Description of Dialectal Phenomena – Orthography – Lexicon – Morphology – Syntax – Code switching • Sample Applications • Resources and References

67

Morphological Variation

• Nouns – No case marking • Word order implications – Paradigm reduction • Consolidating masculine & feminine plural • Verbs – Paradigm reduction • Loss of dual forms • Consolidating masculine & feminine plural (2 nd ,3 rd person) • Loss of morphological moods – Subjunctive/jussive form dominates in some dialects – Indicative form dominates in others • Other aspects increase in complexity

68

34 Morphological Variation Verb Morphology

object subj verb tense conj

neg IOBJ neg

MSA EGY و آه ش وه /walam taktubūhā lahu/ /wimakatabtuhalūȓ/ /wa+lam taktubū+hā la+hu / /wi+ma+katab+tu+ha+lū+ȓ/ and+not_past write_you+it for+him and+not+wrote+you+it+for_him+not

Andyoudidn’twriteitforhim

69

Morphological Variation

Perfect Imperfect Past Subjunctive Present Present Future habitual progressive آ MSA /kataba/ /jaktuba/ /jaktubu/ /sajaktubu/ آ LEV /katab/ /jiktob/ /bjoktob/ /ȥam bjoktob/ /ħajiktob/ ه آ EGY /katab/ /jiktib/ /bjiktib/ /hajiktib/ رح د آ IRQ /kitab/ /jiktib/ /dajiktib/ /ra ħ jiktib/ آ آ MOR /kteb/ /jekteb/ /kjekteb/ /ȑajekteb/

70

35 Morphological Variation Verb conjugation Perfect Imperfect 1S 2S ♂ 2S ♀ 1S 1P 2S ♀ َ ُ ا ُآ آ ِ آ َ آ ُ MSA /katabtu/ /katabta/ /katabti/ /aktubu/ /naktubu/ /taktub īna/ /taktub ī/ ا آ آ آ LEV /katabt/ /katabti/ /aktob/ /noktob/ /toktobi/ ا آ آ آ IRQ /kitabit/ /kitabti/ /aktib/ /niktib/ /tikitb īn/ ا آ آ MOR /ktebt/ /ktebti/ /nekteb/ /nektebu/ /tektebi/

71

Tutorial Contents

• Introduction • Description of MSA Phenomena • Description of Dialectal Phenomena – Orthography – Lexicon – Morphology – Syntax – Code switching • Sample Applications • Resources and References

72

36 Idafa Construction • Genitive/Possessive Construction • Both MSA and dialects • Noun1 Noun2 اردن • king Jordan the king of Jordan / Jordan’s king • Ta-marbuta allomorphs Idafa No Idafa Waqf MSA +at +a EGY +it +a • Dialects have an additional common construct • Noun1 Noun2 the-king belonging-to Jordan ا اردن :LEV • 73 • differs widely among dialects

Demonstrative Articles

• Forms Word Proclitic Proximal Distal ذ, , او ,ها ,ه هء - MSA ,د دي , دول - EGY هاك , ه , هوك ها , هدي , هول +هـ LEV • Word Order (Example: this man )

Pre-nominal Post-nominal X هاا MSA ااد EGY X ال ها هاال LEV 74

37 Negation of Declarative Verbal Sentences

Pre Circum Post , ,, MSA X X lA, lm, ln, mA ... ش EGY X m$ mA … $ ش ... ش , LEV mA, m$ mA … $ $

75

Sentence Word Order

• Verbal sentences – The boys wrote the poems – MSA • Verb Subject Object (Partial agreement) آ اودار

wrote masc the-boys the-poems • Subject Verb Object (Full agreement) V-S V(S) S-V اود آا ار the-boys wrote mascPl the-poems explicit pro explicit subject dropped subject – LEV, EGY subject • Subject Verb Object MSA 35% 30% 35% اود آ ار

The-boys wrote mascPl the-poems LEV 10% 60% 30% • Less present: Verb Subject Object VerbSubjectdistributionsin theLevantineArabicTreebank آ اودار wrote mascPl the-boys the-poems (Maamourietal,2006) • Full agreement in both orders 76

38 Lexico-syntactic Variation

• ‘want’ (Levantine)

77

Tutorial Contents

• Introduction • Description of MSA Phenomena • Description of Dialectal Phenomena – Orthography – Lexicon – Morphology – Syntax – Code switching • Sample Applications • Resources and References

78

39 Code Switching

MSA MSAandDialectmixinginspeech • phonology,morphologyandsyntax LEV

أاراام دهااااوي وععارضأمأن ةداروأ ناامااوأننرداو إانأو أآ نهااع، ي عإزاتا إزاتاه ام ن م ر ام نا م ر و ا ه اوادأل ر اة ن ولوأهااعر عات ا بودئبا هإ إ ب ر رهنر اانإقار اا اإاءاتالهوه د اا آوآ ااوانأءهاا كار وح اإباآنعدئهم ا ا واااأألاراترا أاو اااعآنادإهااع،أ اعاا أ هاهالإارأوه أوإ إدةاب دااواإه ره هه ااها هااع. 79

AljazeeraTranscripthttp://www.aljazeera.net/programs/op_direction/articles/2004/7/7231.htm

Code Switching with English

• Iraqi Arabic Example – ya ret 3inde hech sichena tit7arrak wa77ad-ha , 7atta ma at3ab min asawwe zala6a yomiyya :D – 3ainee Zainab, tara hathee technology jideeda, they just started selling it !! Lets ask if anybody knows where do they sell them ! :

80 http://www.aliraqi.org/forums/archive/index.php/t16137.html

40 Dialectal Impact on MSA

• Loss of case endings and nunation in read MSA /fī bajt ȴad īd/ instead of /fī bajtin ȴad īdin/ ‘in a new house’ • A shift toward SVO rather than VSO in written MSA

81

Dialectal Impact on MSA

• Structure borrowing • Example: monies and properties of the company

االاآو – •/Ȥamw ālu ȓȓ arikati wamumtalak ātuh ā/ • monies the-company and-properties-its

االوتاآ – •/Ȥamw ālu wamumtalak ātu ȓȓ arikati/ • monies and-properties the-company

82

41 Dialectal Impact on MSA

• Code switching in written MSA • Dialectal lexical and structural uses – Example Newswire Alnahar newspaper (ATB3 v.2)

اانوان ا f>x* ElY xATr AlAxwAn wmn hqhm An yzElw

then was-taken upon self the-brothers and-from right-their to be-angry

‘they were upset, and they had the right to be angry’

83

Tutorial Contents

• Introduction • Description of MSA Phenomena • Description of Dialectal Phenomena • Sample Applications – Automatic speech recognition – Dictionary creation – Morphological analysis – Part-of-speech tagging – Syntactic parsing – Machine translation • Resources and References

84

42 Arabic ASR: State of the Art

• BBN TIDESOnTap: 15.3% WER • BBN CallHome system: 55.8% WER • JHU WS 2002: 53.8% WER • WER on conversational speech noticeably higher than for other languages (eg. 30% WER for English CallHome)

85

JHU WS02 Approach improvementstoArabicASRthrough

developingnovel developingtechniques modelstobetter forusingoutofcorpus exploitavailabledata data

Factoredlanguagemodeling Automatic Integrationof romanization MSAtextdata

86 Slidecourtesyof(Kirchhoffetal.2002)

43 Approach Details

• Factored Language Models – complex morphological structure leads to large number of possible word forms – break up word into separate components – build statistical n-gram models over individual morphological components rather than complete word forms • Automatic Vowelization/Diacritization – try to predict vowelization automatically from data and use result for recognizer training • Integrate data from MSA written sources

87

JHU WS02 Results (WER)

Graphemebasedrecognizer Phonebasedrecognizer

Baseline 59.0 Random 59 65 62.7% Automatic 58 Base romanization 60 line Additional 57.9% 57 55.8 % Callhome data55.1% 55 56 Language True modeling53.8% 55 romanization 50 54.9% Oracle 54 46% 45 53

52 40

88 Slidecourtesyof(Kirchhoffetal.2002)

44 Dialect-MSA Dictionary

• Problem: total lack of Dialect-MSA resources – No Dialect-MSA parallel text – No paper dictionaries for Dialect-MSA

• Dialect-MSA dictionary is required for many NLP applications exploiting MSA resources – e.g., to translate dialect sentences to MSA before parsing them with an MSA parser

89

Levantine-MSA Dictionary

• The Automatic-Bridge dictionary (AB) – English as a bridge language between MSA and LA • The Egyptian-Cognate dictionary (EC) – Levantine-Egyptian cognate words in Columbia University Egyptian- MSA lexicon (2,500 lexeme pairs) • The Human-Checked dictionary (HC) – Human cleanup of the union of AB and EC – Using lexemes speeded up the process of dictionary cleaning • reducing the number of entries to check • minimizing word ambiguity decisions – Morphological analysis and generation are required to map from inflected LA to inflected MSA • The Simple-Modification dictionary (SM) – Minimal modification to LA inflected forms to look more MSA-like ('gnyA< أء ) gnyA ‘rich pl.’) is mapped to< أ ) :Form modification – أب ) b$rb ‘I drink’) is mapped to ب ) :Morphology modification – >$rb) AyDAF) 90 ا ) kmAn ‘also’) is mapped to آن ) :Full translation – (Maamourietal.2006)

45 Dialectal Morphological Analysis

• MAGEAD (Habash and Rambow 2006) – Morphological Analysis and GEneration for Arabic and its Dialects • Levels of Morphological Representation – Lexeme Level

Aizdahar 1 PER:3 GEN:f NUM:sg ASPECT:perf – Morpheme Level [zhr,1tV2V3,iaa] +at – Surface Level • Phonology: /izdaharat/ اِزْدهت • Orthography: Aizdaharat ( ) 91

The Lexeme

• Lexeme is an abstraction of all inflectional variants of a word ... آناآآُُآِب comprises | ﻛِﺘﺎﺏ| – • For us, lexeme is formally a triple – Root or NTWS – Morphological behavior class (MBC) ’house‘ { ت } .verse’ vs‘{ ات}• – Meaning index ’rule‘ { ةا} :| ﻗﺎﻋﺪﺓ 1| • ’military base‘ { ةا} :| ﻗﺎﻋﺪﺓ2| • 92

46 Morphological Behavior Class

• MBC::Verb-I-au ( katab/yaktub ) cnj=wa  wa+ tense=fut  sa+ per=1 + num=sg  ‘+ per=1 + num=pl  n+ mood=indic  +u mood=sub  +a aspect=imper  V12V3 aspect=perf  1V2V3 voice=act  a-u voice=pass  u-a obj=3FS  hA obj=1P  nA 93 …

Morphological Behavior Class

• MBC::Verb-I-au ( katab/yaktub ) cnj=wa  wa+ وََََُُْ +tense=fut  sa per=1 + num=sg  ‘+ per=1 + num=pl  n+ wasanaktubuhA mood=indic  +u mood=sub  +a aspect=imper  V12V3 aspect=perf  1V2V3 voice=act  a-u voice=pass  u-a obj=3FS  hA obj=1P  nA Wewillwriteit 94 … MSA EGY

47 Morphological Behavior Class

• MBC::Verb-I-au ( katab/yaktub ) cnj=wa  wa+ wi+ وََََُُْ +tense=fut  sa+ Ha per=1 + num=sg  ‘+ per=1 + num=pl  n+ n+ wasanaktubuhA mood=indic  +u +0 mood=sub  +a wiHaniktibhA aspect=imper  V12V3 V12V3 وََِِِْْ aspect=perf  1V2V3 voice=act  a-u i-i voice=pass  u-a obj=3FS  hA hA obj=1P  nA Wewillwriteit 95 … MSA EGY

Morphological Behavior Class

• MBC::Verb-I-au ( katab/yaktub ) cnj=wa  wa+ wi+  [CONJ:wa] tense=fut  sa+ Ha+  [PART:FUT] per=1 + num=sg  ‘+ per=1 + num=pl  n+ n+  [SUBJ_PRE_1P] mood=indic  +u +0  [SUBJ_SUF_Ind ] mood=sub  +a aspect=imper  V12V3 V12V3  [PAT:I-IMP] aspect=perf  1V2V3 voice=act  a-u i-i  [VOC:Iau-ACT] voice=pass  u-a obj=3FS  hA hA  [OBJ:3FS] obj=1P  nA 96 … MSA EGY

48 Morphological Behavior Class

• MBC::Verb-I-au ( katab/yaktub ) cnj=wa  [CONJ:wa] tense=fut  [PART:FUT]

per=1 + num=pl  [SUBJ_PRE_1P] mood=indic  [SUBJ_SUF_Ind]

aspect=imper  [PAT:I-IMP]

voice=act  [VOC:Iau-ACT]

obj=3FS  [OBJ:3FS]

97 …

Levantine Evaluation

• Results on Levantine Treebank

98% 100% 94%

80% 60% 60%

40%

20%

0% MSAMSA MSALEV LEVLEV

ContextTokenRecall

98

49 Arabic Dialect POS Tagging

• Duh and Kirchhoff 2005;Duh and Kirchhoff 2006 – Egyptian Arabic and – Minimal supervision • dialectal text • and MSA morphological analyzer – Cross-dialect sharing techniques • Rambow et al. 2005 – Levantine Arabic – LEV-MSA transduction using LEV-MSA lexicon – MSA POS Tagging – Projection of MSA tags unto LEV 99

Arabic Dialect Parsing

• Possible Approaches – Annotate corpora (“Brill Approach”) • Too expensive – Leverage existing MSA resources • Difference MSA/dialect not enormous • Linguistic studies of dialects exist • Too many dialects: even with dialects annotated, still need leveraging for other dialects

100

50 Parsing Arabic Dialects: TheProblem - Dialect - - MSA -

ازمش اهدا Treebank

SmallUAC ? like

Parser ازم ش ا work not men

BigUAC هدا this

101

SentenceTransductionApproach

- Dialect - - MSA -

ازمش اهدا الهاا Translation Lexicon like like Parser ال ا work ازم ش ا work not men not men Big LM ها this هدا this

102 (Rambowetal.2005;Chiangetal.2006)

51 MSA Treebank Transduction

- Dialect - - MSA -

Small LM

Treebank Treebank ازمشاهدا

Tree Transduction Parser

ازم ش ا هدا 103 (Rambowetal.2005;Chiangetal.2006)

Grammar Transduction

- Dialect - - MSA -

Probabilistic TAG

Treebank ازمشا Probabilistic هدا TAG

Parser Tree Transduction

TAG = Tree Adjoining Grammar ازم ش ا هدا

104 (Rambowetal.2005;Chiangetal.2006)

52 Dialect Parsing Results

Absolute/Relative F-1 improvement No Tags Gold Tags Sentence 4.2/9.0% 3.8/9.5% Transduction Treebank 3.5/7.5% 1.9/4.8% Transduction Grammar 6.7/14.4% 6.9/17.3% Transduction

Dialect-MSA dictionary was the biggest contributor to improved parsing accuracy: more than a 10% reduction on F1 labeled constituent error

105 (Rambowetal.2005;Chiangetal.2006)

Arabic Dialect Machine Translation • Problems – Limited resources – Non-standard Orthography – Morphological complexity • Solutions – Rule-based segmentation (Riesa et al. 2006) – Minimally supervised segmentation (Riesa and Yarowsky 2006) – Spelling normalization (Riesa et al. 2006) – Leveraging MSA resources (Riesa et al. 2006, Zollman et al. 2006, Rambow et al. 2005) – Dialect-MSA lexicons (Rambow et al. 2005, Chiang et al. 2006, Maamouri et al. 2006)

106

53 Arabic Dialect Machine Translation

• TransTac: DARPA Program on Trans lation System for Tac tical Use – Iraqi <> English speech-to-speech MT – Phraselator: http://www.phraselator.com/

• MT as a component – JHU Workshop on Parsing Arabic dialect (Rambow et al. 2005, Chiang et al. 2006)

107

Dialect Resources

• Most work on Arabic dialects focuses on Automatic Speech Recognition • Speech/transcript corpora – Egyptian and Levantine Arabic (LDC) – Moroccan and (ELDA) – Gulf Arabic (Appen) – Many other… • Few lexicons/morphology resources – CallHome Egyptian Arabic monolingual lexicon (LDC) – CallHome Egyptian Verb transducer (LDC) • Work on multi-dialectic resources – Linguistic Data Consortium – Columbia University Arabic Dialect Modeling (CADIM) Group • Pan-Arab lexicon and Pan-Arab Morphology • Novel Approaches to Arabic Speech Recognition (JHU summer workshop 2002) (Kirchhoff et al, 2002) • Parsing Arabic Dialects (JHU summer workshop 2005) (Rambow et al, 2005) , (Chiang et al., 2006) 108

54 Resources

Distributors • Linguistic Data Consortium • NEMLAR (Network for Euro-Mediterranean LAnguage Resources) • ELSNET is the European Network of Excellence in Human Language Technologies • ELDA Evaluation and Language resources Distribution Agency

109

Resources

Reports • Mohamed Maamouri and Christopher Cieri. 2002. Resources for Natural Language Processing at the Linguistic Data Consortium . In Proceedings of the International Symposium on Processing of Arabic, pages 125--146, Manouba, Tunisia, April 2002. • Mahtab Nikkhou and Khalid Choukri. Survey on Arabic Language Resources and Tools in the Mediterranean Countries . • Arabic Information Retrieval and Computational Linguistics Resources (thanks to Doug Oard)

110

55 Resources

Monolingual Corpora • Arabic Gigaword • Arabic Newswire Parallel Corpora • United Nations Parallel Corpus • Ummah Parallel Corpus • Arabic News Translation • Multiple-Translation Arabic Treebanks • Arabic Penn Treebank Webpage – Part 1 v 2.0 , Part 2 v 2.0 , Part 3 v 1.0 , 10K-word English Translation • Prague Arabic Dependency Treebank 111

Resources

Morphology • Buckwalter Arabic Morphological Analyzer – Version 1.0 , Version 2.0 • Xerox Arabic Morphology (online) Dialect Resources • CALLHOME Egyptian Arabic Speech and Transcripts • Egyptian Colloquial Arabic Lexicon • Levantine Arabic Resources • http://www.orientel.org/ • http://www.appen.com.au/ • LDC Arabic EARS • CADIM: http://www.ccls.columbia.edu/cadim 112

56 Resources

Dictionaries • Buckwalter Stem Dictionary • H. Anthony Salmone. An Advanced Learner's Arabic- English Dictionary encoded by the Perseus Project, Tufts University (contact: David Smith [email protected] ) • Ajeeb Arabic-English Dictionary (online) • Al-Misbar Dictionary (online) • Ectaco Bilingual Dictionary (online) Online MT systems • Ajeeb's Arabic-English Machine Translation (online) • Al-Misbar English-Arabic Machine Translation (online)

113

Conferences and Workshops with some focus on Arabic

• Parsing Arabic Dialects (JHU summer workshop 2005) • ACL 2005 Workshop on Computational Approaches to Semitic Languages • Arabic Language Resources and Tools Conference 2004 Cairo, Egypt • WORKSHOP Computational Approaches to Arabic Script-based Languages (COLING 2004) • Traitement Automatique du Langage Naturel (TALN ' 04) • NIST MT EVAL (http://www.nist.gov/speech/tests/mt/) • MT Summit IX Workshop on Machine Translation for Semitic Languages in 2003 • Novel Approaches to Arabic Speech Recognition (JHU summer workshop 2002) • LREC 2002 Arabic Language Resources and Evaluation Workshop • ACL 2002 Workshop on Computational Approaches to Semitic Languages • International Symposium on Processing of Arabic 2002, Tunisia • Workshop on ARABIC Language Processing: Status and Prospects (ACL/EACL 2001) • Arabic Translation and Localisation Symposium (ATLAS 1999) • Computational Approaches to Semitic Languages (COLING/ACL 1998)

114

57 References Books Bateson,MaryCatherine.ArabicLanguageHandbook .GeorgetownUniversityPress.2003. Brustad,KristenE.TheSyntaxofSpokenArabic:AComparativeStudyofMoroccan,Egyptian, Syrian,andKuwaitiDialects .GeorgetownUniversityPress.2000. Holes,Clive.ModernArabic:Structures,Functions,andVarieties .GeorgetownUniversityPress.2004.

ConferencePapersandJournalArticles Alexandrescu,A.andK.Kirchhoff.2006.FactoredNeuralLanguageModels .HLT. Aljlayl,M.andO.Frieder.2002.OnArabicSearch:ImprovingtheRetrievalEffectivenessviaaLight StemmingApproach .ACMConferenceonInformationandKnowledgeManagement. AlSughaiyer,I.andI.AlKharashi.2004.ArabicMorphologicalAnalysisTechniques:A ComprehensiveSurvey .JournaloftheAmericanSocietyforInformationScienceandTechnology. Volume55, Issue3. Beesley,K.2001.FiniteStateMorphologicalAnalysisandGenerationofArabicatXeroxResearch: StatusandPlansin2001 .EACLworkshoponArabicLanguageProcessing:StatusandProspects. Bikel,D.2002.DesignofaMultilingual,ParallelprocessingStatisticalParsingEngine .HLT. Buckwalter,T.2002.BuckwalterArabicMorphologicalAnalyzerVersion1.0 .LDCcatalognumber LDC2002L49. CavalliSforza,V.,A.Soudi,andT.Mitamura.2000.ArabicMorphologyGenerationUsinga ConcatenativeStrategy .ANLP. Chiang,D.,M.Diab,N.Habash,O.Rambow,andS.Shareef.2006. ArabicDialectParsing .EACL. Darwish,K.2002.BuildingaShallowMorphologicalAnalyzerinOneDay .ACLworkshopon ComputationalApproachestoSemiticLanguages. 115

References Diab,M.,K.HaciogluandD.Jurafsky.2004.AutomaticTaggingofArabicText:FromrawtexttoBase PhraseChunks .HLTNAACL. Diab,M.,K.HaciogluandD.Jurafsky.“AutomatedMethodsforProcessingArabicText:From TokenizationtoBasePhraseChunking.” BookChapter.InArabicComputationalMorphology: KnowledgebasedandEmpiricalMethods .EditorsA.vandenBoschandA.Soudi. Duh,K.andK.Kirchhoff.2005.POSTaggingofDialectalArabic:AMinimallySupervisedApproach. ACLWorkshoponSemiticLanguages. Duh,K.andK.Kirchhoff.2006.LexiconAcquisitionforDialectalArabicUsingTransductive Learning. EMNLP. Elming,J.andN.Habash. 2007.CombinationofStatisticalWordAlignments BasedonMultiple PreprocessingSchemes. NAACL. Fischer,W.2001.AGrammarofClassicalArabic.YaleLanguageSeries.YaleUniversityPress. TranslatedbyJonathanRodgers. Habash,N.IssuesinPalestinianArabicSpellingStandardization .NACAL27,1999.Baltimore,MD. Habash,N.andO.Rambow.2004.ExtractingaTreeAdjoiningGrammarfromthePennArabicTreebank. TALN. Habash,N.andO.Rambow.2005a.ArabicTokenization,PartofSpeechTagginginandMorphological DisambiguationOneFellSwoop .ACL. Habash,N.andO.Rambow.2006.MAGEAD:AMorphologicalAnalyzerandGeneratorfortheArabic Dialects .ACL. Habash,N.andO.Rambow.2007. ArabicDiacritizationthroughFullMorphologicalTagging .NAACL. Habash,N.,O.RambowandG.Kiraz.2005b.MorphologicalAnalysisandGenerationforArabicDialects . ACL workshop on Computational Approaches to Semitic Languages. Habash,N.2004.LargeScaleLexemeBasedArabicMorphologicalGeneration .TALN. 116 Habash,N.andF.Sadat.2006.ArabicPreprocessingSchemesforStatisticalMachineTranslation . NAACL.

58 References

Habash,N.2006.“ArabicMorphologicalRepresentationsforMachineTranslation.” BookChapter.In ArabicComputationalMorphology:KnowledgebasedandEmpiricalMethods .EditorsA.vanden BoschandA.Soudi. Habash,N,A.SoudiandT.Buckwalter.2006.“OnArabicTransliteration.” BookChapter.InArabic ComputationalMorphology:KnowledgebasedandEmpiricalMethods .EditorsA.vandenBosch andA.Soudi. Habash,N.ArabicanditsDialects .MultilingualMagazine.#81,July/August2006. Habash,N.,C.Mah,S.Imran,R.CalistriYeh,andP.Sheridan.2006.TheDesignandValidationof anArabicWordNetforInformationRetrieval .LREC. Habash,N.,B.DorrandC.Monz.2006.ChallengesinBuildinganArabicGenerationheavyMachine TranslationSystemandExtendingitwithStatisticalComponents .AMTA. Hwa,R.,C.NicholsandK.Sima'an.2006.CorpusVariationsforTranslationLexiconInduction . AMTA. Ittycheriah,A.andS.Roukos.2005.AmaximumentropywordalignerforArabicEnglishmachine translation. HLTandEMNLP. Khoja,S.2001.APT:ArabicPartofSpeechTagger .NAACLStudentResearchWorkshop. Kiraz,G.2001.ComputationalNonlinearMorphologywithEmphasisonSemiticLanguages .Studies inNaturalLanguageProcessing.CambridgeUniversityPress. Kirchhoff,K.,J.Bilmes,S.Das,N.Duta,M.Egan,G.Ji,F.He,J.Henderson,D.Liu,M.Noamany,P. Schone,R.SchwartzandD.Vergyri.2003.NovelApproachestoArabicSpeechRecognition: Reportfromthe2002JohnsHopkinsWorkshop .ICASSP. Kirchhoff,K.andD.Vergyri. 2004 .CrossdialectalacousticdatasharingforArabicspeech recognition . ICASSP . Lee,Y.,K.Papineni,S.Roukos,O.Emam andH.Hassan.2003.LanguageModelBasedArabic WordSegmentation .ACL. 117

References

Lee,Y.2004.MorphologicalAnalysisforStatisticalMachineTranslation .NAACL. Maamouri,M.,A.Bies,T.Buckwalter,M.Diab,N.Habash,O.Rambow,D.Tabessi.2006. DevelopingandUsingaPilotDialectalArabicTreebank .LREC. Maamouri,M.,T.Buckwalter,andC.Cieri.DialectalArabicTelephoneSpeechCorpus:Principles, ToolDesign,andTranscriptionConventions . NEMLAR2004. Maamouri,M.,D.Graff,H.Jin,C.Cieri,andT.Buckwalter.DialectalArabicOrthographybased TranscriptionandCTSLevantineArabicCollection . EARSRT04Workshop. Rambow,O.,D.Chiang,M.Diab,N.Habash,R.Hwa,K.Sima’an,V.Lacey,R.Levy,C.Nichols, andS.Shareef.2005.ParsingArabicDialects.FinalReport ,JHUSummerWorkshop. Riesa,J.andD.Yarowsky.MinimallySupervisedMorphologicalSegmentationwithApplicationsto MachineTranslation .AMTA06. Riesa,J.,B.Mohit,K.KnightandD.Marcu.BuildinganEnglishIraqiArabicMachineTranslation SystemforSpokenUtteranceswithLimitedResources.Interspeech 2006. Rogati,M.,S.McCarley,andY.Yang.2003.UnsupervisedLearningofArabicStemmingUsinga ParallelCorpus .ACL. Sadat,FatihaandNizarHabash.2006.CombinationofPreprocessingSchemesforStatisticalMT . ACL. SmithN.,D.Smith,andR.Tromble.2005.ContextBasedMorphologicalDisambiguationwith RandomFields .HLTEMNLP. Smrž,O.andP.Zemánek.2002.Sherds fromanArabicTreebanking Mosaic .PragueBulletinof MathematicalLinguistics,(78). Snider,N.andM.Diab.2006.AutomaticDiscoveryofVerbClassesinModernStandardArabic . ACL. 118

59 References

Snider,N.andM.Diab.2006.UnsupervisedInductionofArabicVerbClasses .NAACL. Soudi,A.,V.CavalliSforza,andA.Jamari.2001.AComputationalLexemeBasedTreatmentof ArabicMorphology .ACLworkshoponArabicNaturalLanguageProcessing. Vergyri,D.,K.Kirchhoff,K.DuhandA.Stolcke.2004.Morphologybasedlanguagemodeling forArabicspeechrecognition . ICSLP . Vergyri,D.andK.Kirchhoff.2004.AutomaticDiacritizationofArabicforAcousticModelingin SpeechRecognition .COLINGWorkshoponArabicscriptBasedLanguages . Xu J.2002.UNParallelText(ArabicEnglish) ,LDCCatalogNo.:LDC2002E15. Yang,M.andK.Kirchhoff.PhraseBasedBackoff ModelsforMachineTranslationofHighly InflectedLanguages . EACL. Žabokrtský,Z.andO.Smrž.2003.ArabicSyntacticTrees:fromConstituencytoDependency . EACL. Zitouni,I.,J.Sorensen,andR.Sarikaya. 2006.MaximumentropybasedrestorationofArabic diacritics .ICCCandACL. Zitouni,I.,J.Olive,D.Iskra,K.Choukri,O.Emam,O.Gedge,M.Maragoudakis,H.Tropf,A. Moreno,A.Rodriguez,B.Heuft andR.Siemund.2002.OrienTel:SpeechBasedInteractive CommunicationApplicationsfortheMediterraneanandtheMiddleEast .ICSLP. Zollmann,A.,A.Venugopal andS.Vogel.2006.BridgingtheInflectionMorphologyGapfor ArabicStatisticalMachineTranslation .NAACL.

119

References

Conference/Institution/Program Name Abbreviations JHU = Johns Hopkins University ANLP = Applied Natural Language Processing LREC = Language Resources and Evaluation Conference ACL = Association for Computational Linguistics LDC = Linguistic Data Consortium, University of ACM = Association for Computing Machinery Pennsylvania EMNLP = Empirical Methods to Natural Language NAACL = North American ACL Processing TALN =Traitement Automatique du Langage Naturel EACL = European ACL NACAL = North America Conference on Afro-asiatic HLT = Human Language Technology Conference Languages ICSLP = International Conference on Spoken Language EARS = DARPA Program (Efficient, Affordable, Processing Reusable Speech-to-Text) ICASSP =International Conference on Acoustics, Speech and Signal Processing

120

60