DIWAN: A Dialectal Word Annotation Tool for

Faisal Al-Shargi Owen Rambow Universität Leipzig CCLS, Columbia University Leipzig New York, NY Germany USA [email protected] [email protected] leipzig.de

Abstract prehensible to an Arabic speaker from the Levant or the .1 This paper presents DIWAN, an anno- tation interface for Arabic dialectal Within these broad regions, further and consider- texts. While the Arabic differ able geographic distinctions exist – within coun- in many respects from each other and tries, across country borders, and between cities from , they al- and villages. Some examples include Gulf Ara- so have much in common. To facilitate bic, Bahraini Arabic, , Hijazi Ara- annotation and to make it as efficient as bic, , Yemeni , possible, it is therefore not advisable to Yemeni Sanaani Arabic, Yemeni Ta'izzi-Adeni treat each Arabic as a separate Arabic, Dhofari Arabic, , Shihhi language, unrelated to the other vari- Arabic, and the dialects. ants of Arabic. Instead, we make anal- yses from other variants available to Despite this diversity, all Arabic dialects share the annotator, who then can choose to certain properties: much of their phonology, use them or not. templatic morphology augmented by affixes and a large set of clitics, large parts of their syntax, 1. Introduction and important (though unpredictable) parts of the lexicon. Arabic is a Central Semitic language, closely related to , Hebrew, Ugaritic and Phoe- Current natural language processing (NLP) tools nician. It is spoken by 420 million speakers (na- work well with MSA because they were de- tive and non-native) in the Arab World. Arabic signed specifically for the processing of MSA, also is a liturgical language of 1.6 billion Mus- and because of the abundance of MSA resources. lims around the world. Applying the NLP tools designed for MSA di- rectly to DA yields significantly lower perfor- Modern Standard Arabic (MSA) is the official mance (Chiang et al., 2006; Habash and Ram- Arabic language. It is the educational language bow, 2006; Benajiba et al., 2010; Habash et al., and official language used in news and official 2012). This makes it imperative to direct re- communication across the Arabic-speaking search to building resources and tools for DA world. When Arabs communicate spontaneously processing. in informal settings, they use dialectal Arabic (DA). There are divisions of many dialects of the 1 When Arabic speakers of different dialects meet, they tend Arabic language that occur between the spoken to navigate towards a middle Arabic that encapsulates the languages of different regions. Some varieties of shared aspects they are aware of in order to maximize communication. A better and harder test of comprehension Arabic in North Africa, for example, are incom- is to eavesdrop on a conversation in another dialect.

49 Proceedings of the Second Workshop on Arabic Natural Language Processing, pages 49–58, Beijing, China, July 26-31, 2015. c 2014 Association for Computational Linguistics Figure 1: An example of MSA and DA code switching

Arabic dialects lack large amounts of consistent This paper explains the design decisions we have data due to two main factors: the lack of ortho- made in order to meet these goals. DIWAN is graphic standards for the dialects, and the lack of fully implemented for use on Microsoft Win- overall Arabic content on the web (Benajiba et dows and is currently in use for the annotation of al., 2010). While the rise of the internet has in- Palestinian, Yemeni, and . creased the amount of DA being written, some- times Arabic dialects come mixed with the MSA This paper is structured as follows. In Section 2, in various forms of text (see Figure 1, which we review the NLP components we use in DI- shows the code switching in our DIWAN tool). WAN. In Section 3, we describe the workflow Furthermore, language used in social media pos- when using DIWAN. In Section 4, we describe es a challenge for NLP tools in general in any the specific annotation tasks the annotator per- language due to the difference in genre. There- forms. Section 5 gives some technical detail fore, in order to create tools for dialectal Arabic, about the implementation. Section 6 discusses annotated DA corpora are needed in a variety of related work. We conclude in Section 7 with a dialects. discussion of future work.

The goal of our Dialectal Word Annotation tool (DIWAN) is to address these gaps on the re- 2. NLP Resources used in DIWAN source creation level. In designing DIWAN, we have determined several important design goals: In order to make the annotation task easier, DI- WAN uses three main existing NLP resources: 1. We want to exploit the similarity between the MSA morphological analyzer SAMA, the dialects as much as possible to facilitate an- Egyptian morphological analyzer CALIMA- notation, which in general is costly and EGY, and the morphological tagger MADAMI- slow. RA which works for both MSA and Egyptian. 2. We want to use a convention for orthogra- We describe them in turn. phy (which the input text does not neces- sarily follow). The first resource is the Standard Arabic Mor- 3. We want to create data which can be used phological Analyzer, SAMA 3.1 (Graff et al. both for creating morphological analyzers 2009), which is based on the BAMA analyzer (which produce all morphological analyses (Buckwalter 2004). This system uses lexical for a word outside of any context) and mor- databases, divided into prefixes, stems, and suf- phological taggers (which determine the fixes, to assign words all possible MSA analyses. correct morphological analysis -- including A sample output is shown in Figure 2 (in Buck- ﻣﺎﺷﻲ the POS tag -- for a word in context). walter transliteration), for the input word mA$i, which is ambiguous between various in- flected forms of a verb meaning `walk’.

50 mA$y ﻲﺷﺎﻣ Figure 2: SAMA result for search on word

mA$y ﻲﺷﺎﻣ Figure 3: CALIMA-Egyptian result for search on word

The second resource is the Columbia Arabic 3. DIWAN Workflow Language and dialect Morphological Analyzer for Egyptian (CALIMA-EGY) (Habash et al. We designed and built DIWAN as a desktop ap- 2012b). It is an analyzer for Egyptian. A sample plication which can work locally (offline) or output is shown in Figure 3 for the input word online. As an annotation tool, we have designed mA$y. CALIMA returns the MSA readings DIWAN with two types of users: administrators ﻣﺎﺷﻲ shown in Figure 2, and in addition has Egyptian and annotators. The administrator’s responsibil- readings, in particular the interjection `OK’. ity is to create the DIWAN database, specify its settings, and track the annotator’s work. The third resource is MADAMIRA (Pasha et al. 2014). MADAMIRA is a system for morpholog- The administrator has several roles: ical analysis and disambiguation of Arabic that combines some of the best aspects of two previ- 1. She can create, edit and delete tables in ously commonly used systems for Arabic pro- the database. cessing, MADA (Habash and Rambow, 2005; 2. She can create, edit and delete annotator Habash et al., 2009; Habash et al., 2013) and accounts. AMIRA (Diab et al., 2007). MADAMIRA im- 3. She can check the status of the annota- proves upon the two systems with a more stream- tion tasks for each annotator. lined Java implementation that is more robust, 4. She can trace the annotator progress, portable, extensible, and is faster than its ances- work time, errors, etc. tors by more than an order of magnitude. Con- 5. She can generate reports and statistics on trary to SAMA and CALIMA-EGY, which pro- the underlying database (created by the vide all morphological analyses for a word re- annotators). gardless of context, MADAMIRA chooses a sin- 6. She can of course also annotate the data. gle analysis given the context of the word in a -The annotators can only annotate data. The ad ﻣﺎﺷﻲ ﻛﺪه ؟ sentence. For example, in the sentence mA$i kdh? `Is that OK?’, the interjection mean- ministrator assigns tasks to each annotator, and ing will be chosen. the annotations are added to the DIWAN data- base. As the annotator creates annotations,

51 can reuse the resulting lexical entries in the DI- WAN tool as a new resource for himself or for other annotators.

To work with DIWAN, the administrator first prepares the data. We assume that the data is DA written in . There are two ways of preparing the data:

1. The administrator can either simply use DI- WAN itself to identify sentences and words in the corpus. DIWAN extracts sentences and words from the prepared file and builds a DIWAN database. 2. Or the administrator can send the corpus to MADAMIRA. MADAMIRA not only iden- tifies sentences and words, it also performs morphological analysis (using MSA and Egyptian resources) and tagging, making a single analysis available for each input word in context. After getting the resulting data Figure 4: DIWAN setup workflow from MADAMIRA, DIWAN will present the analysis for each word to the annotator as a default annotation option. As in the previ- ous case, DIWAN extracts sentences and words from the prepared file and builds a When online, several annotators can work at DIWAN database (which now includes the once, sharing their work immediately through the MADAMIRA analysis). centralized database.

These two options are shown in Figure 4. 4.2. Choice of Word to Annotate and The annotator makes the dialect annotations by Using the Resources using the DIWAN GUI. We describe this pro- cess in detail in Section 4. The annotator has three options of how to order the words he wants to annotate: by frequency, by 4. Annotation Tasks text order, or by coverage. The frequency-based approach has the advantage that a large number We describe the workflow of the annotator. of tokens can be annotated at once, while the text order provides a natural right-to-left annotation 4.1. Initialization of the Annotation order through the text. Ordering by coverage GUI moves those words to the top of the list which the MADAMIRA system cannot analyze as The annotator starts out by choosing if he wants MSA or Egyptian. This typically (but not al- to work locally, i.e., offline, or connected to the ways, of course) means that the word is specific ktb ﻛﺘﺐ ,database. The offline option is useful when an to the dialect in question (for example internet connection is not reliable. In that case, `wrote (3ms)’ is common to all dialects, while .(m$bwjp `puffy (fs)’ is specific to Yemeni ﻣﺸﺒﻮﺟﺔ the work is uploaded in batch at the end of the session. Therefore, ordering by coverage will move

52 words specific to the dialect to the top. Of as part of the annotation; we will explain them in course, ordering by coverage also misses dialec- Section 4.3. As mentioned, DIWAN retrieves an tal words which look like a word in MSA or proposed analysis in context for the chosen word Egyptian, but mean something else than the token from MADAMIRA and populates all text MSA or Egyptian counterpart (“faux amis”). input boxes and checkboxes automatically with the analysis MADAMIRA finds, which may be In our experience, the frequency-based ordering based on an MSA analysis or an Egyptian analy- is useful at the beginning, when the annotator sis. In some cases, MADAMIRA does not find can quickly annotate high-frequency words an analysis, in which case this is clearly shown. which typically don’t change in form or mean- The annotator now has several choices as to how ing. Often, these words are dialect-specific. to enter the annotation. However, once a sufficient number of high- frequency words have been annotated, the anno- 1. He can accept the MADAMIRA analysis as tator can choose to switch to the text-order view. correct in this dialect as well. He then continues to annotate lower-frequency 2. He can modify the MADAMIRA analysis words in their textual order. The color coding and save the changes. shows him which words are already annotated. 3. He can look at the list of SAMA analyses for The annotator can also hide the annotated words the word (interpreting the word as MSA), by clicking on a button. If an annotation effort is and choose one. This analysis then popu- interested in creating a dialect-specific lexicon lates all input boxes. He can then choose to quickly, the ordering-by-coverage approach may accept this analysis, or modify and save it. be the most appropriate. 4. He can look at the list of CALIMA analyses for the word (interpreting the word as Egyp- Whatever ordering criterion the annotator choos- tian), and choose one. This analysis then es, he sees an ordered list of words, with the al- populates all input boxes. He can then ready annotated words in green and the words to choose to accept this analysis, or modify and be annotated in red. The annotator then clicks on save it. a word in the word panel on the left, and sees in 5. He can do word substitution: if the word a panel at the top of the GUI a scrollable list of does not produce the correct (or any) analy- all occurrences of this word in the corpus (one sis in SAMA or CALIMA, but he knows a per line), shown in context. The annotator word that does and that has the same mor- chooses which instances of the word he wants to phological analysis, then he can enter that annotate (i.e., which instances have the same word, search in SAMA or (more likely) analysis) by clicking a checkbox corresponding CALIMA, and then edit the analysis, but on- to that instance. Typically, he would survey all ly to modify the word itself. For example, occurrences and judge which ones have the same assume the annotator is working on Yemeni yqTb `speeds up ﯾﻘﻄﺐ analysis. He then chooses a representative ex- and the input word is ample, clicks on it, and proceeds to the main an- (3ms)’. This does not produce an analysis in notation panel. When the annotator clicks on the SAMA or CALIMA. So instead, the annota- ,’(yktb `writes (3ms ﯾﻜﺘﺐ word’s checkbox, by default he will get the tor searches for MADAMIRA result in the annotation panel (as- which has exactly the same morphological suming the administrator has chosen to include analysis, and then edits the stem by replacing -and updates the gloss to `has , ﻗﻄﺐ with ﻛﺘﺐ .(the MADMIRA analyses ten, speed up’. The annotator performs the annotation tasks in 6. Finally, he can create an analysis from the main annotation panel. There are several scratch. This would normally be the most input boxes which the annotator needs to fill in time consuming option.

53 Figure 5: The main DIWAN annotation interface

drop-down menus adds morphological infor- mation to morphs. For example, Egyptian busses’ is annotated at the morph level` ﺑﺎﺻﺎت Specific Annotation Tasks .4.3 as bAS/NOUN+At/NSUFF_FEM_PL, since is the suffix for regular feminine plural +ات Each annotation task corresponds to a specific input device. The interface is shown in Figure nouns. However, the form is in fact a mascu- 5. Note that here we describe the annotation task line plural form (and thus a type of broken as if it is performed from scratch (option 6 in plural), so that the annotator would mark .as functionally masculine and plural ﺑﺎﺻﺎت .(Section 4.2 above This task is performed using two drop-down 1. Rewriting the word in the conventionalized menu boxes (one for number, one for gender). orthography (CODA) defined for that dialect Note that none of the existing resources (Habash et al. 2012a). Note that CODA may (MADAMIRA, SAMA, or CALIMA) mark include diacritics or not; in the examples we functional number and gender, so that this show in this paper, it does not. task needs to be performed manually for each 2. Breaking into prefix, stem and suffix. These word in any case. two tasks are performed jointly in three text 6. Marking Arabic variant. The annotator can input boxes. choose to mark whether a word is in fact 3. Adding morph-specific features (in the style MSA rather than dialect (the default assump- of the Linguistic Data Consortium Arabic re- tion). This is useful when code switching oc- sources). This task is performed using drop- curs, and the annotator does not want to add down menus separately for the prefix, stem, an MSA word to the dialectal vocabulary. and suffix. Furthermore, the annotator can choose a spe- 4. Adding the English gloss and MSA equiva- cific region within the dialect. This is useful lent. This task is performed in two dedicated when a dialectal form is not typical for the re- text input boxes. gion that the text is from. For example a Leb- 5. Adding functional morphology. The mor- anese word may occur in a byy wEylty ﺑﯿﻲ وﻋﯿﻠﺘﻲ ﻛﻠﮭﺎ ﺑﺎﻟﻀﻔﮫ pheme-based annotation performed using text, such as

54 klhA bAlDfh `my father and all my family in vlAvh zErAn qAEdyn - ﺛﻼﺛﮫ زﻋﺮان ﻗﺎﻋﺪﯾﻦ ﺑﻤﺰرﻋﮫ ,West Bank’, where all words are Palestinian byy ‘my father’, which is more used bmzrEh ﺑﯿﻲ except in .2 vlAvp diac:vlAvp lex:valAv_1 ﺛﻼﺛﺔ 924 bw:+vlAv/NOUN_NUM+p/NSUFF_FEM_SG When everything is correct, the annotator can msa:valAv_1 gloss:three pos:noun_num gen:f save his annotation directly to the database; as a num:s region:ALL di- result, the color of the analyzed word token or wan_source:MADAMIRA source_mod:no tokens will change from red to green. source_search:vlAvp anno:diwan_approved

zErAn diac:zErAn lex:>azoEar_2 زﻋﺮان In addition, we have added some functionalities 925 bw:+zErAn/NOUN+ msa:>azoEar_2 in order to help the annotators in their annota- gloss:brigands;scoundrels pos:noun gen:m tion, like Google search on a word (which the num:p region:ALL diwan_source:MADAMIRA annotator can use to verify the meaning; often source_mod:yes source_search:zErAn an- image search is useful for this purpose), and no:diwan_approved Google translation for finding the English gloss. qAEdyn diac:qAEdyn ﻗﺎﻋﺪﯾﻦ 926 lex:qAEid_1 bw:+qAEd/ADJ+yn/NSUFF_MASC_PL 5. The Database and Output Files msa:jAls_1 gloss:sitting;seated;lazy;inactive;evaders_(draft In this section, we briefly summarize the data- _dodgers) pos:adj gen:m num:p region:ALL bases and file formats used by DIWAN. Only diwan_source:MADAMIRA source_mod:no the administrator has the ability to directly access source_search:qAEdyn anno:diwan_approved these databases, the annotators can only access it bmzrEh diac:bmzrEp ﺑﻤﺰرﻋﮫ 927 through the DIWAN interface. This ensures the lex:mazoraE_1 integrity of the DIWAN data. The database has bw:b/PART+mzrE/NOUN+p/NSUFF_FEM_S three main tables, the D_sentences table, the G msa:mazoraE_1 gloss:farm;plantation D_madamira table, and the D_result table. The pos:noun gen:f num:s region:ALL di- D_sentences table includes all the words orga- wan_source:EGY source_mod:no nized into sentences from the input. The source_search:bmzrEp anno:diwan_approved D_madamira table contains the result of the Figure 6: Extract of an output file generated by DIWAN MADAMIRA analysis on the input text. The of an annotated text in context D_result table is the table that contains all work by the annotators. 924: the word number in the text vlAvp: the word in Arabic script and , ﺛﻼﺛﺔ The administrator can at any time produce a file Buckwalter transliteration output from DIWAN which reflects the annota- diac:vlAvp: The CODA spelling (which, recall, tion. The file includes the results of MADAMI- may or may not be diacritized) RA if no manual annotation has been done on it. lex:valAv_1: the lexeme bw:+vlAv/NOUN_NUM+p/NSUFF_FEM_SG A sample output is shown in Figure 6. We The Buckwalter part-of-speech and mor- briefly summarize this format: pheme split; this is a the morpheme-based morphological annotation; the plusses indi- cate the boundaries between prefix, stem, and suffix. msa:valAv_1: MSA equivalent 2 gloss:three: English gloss Palestinian Arabic is particularly challenging due the common dialect mixing in different sub-varieties of it re- pos:noun_num: The core part-of-speech tag sulting from the particular situation of Palestinian refugees gen:f: functional gender in different countries.

55 num:s: functional number ation, not annotation in context. Thus, words in region:ALL: applicable dialectal subregion context are not assigned morphological features. diwan_source:MADAMIRA: which resource For our work, it is crucial that we get an annota- did the annotator use in DIWAN tion of morphological features in context so that source_mod:no: did the annotator modify the source? DIWAN can be used to create corpora to train source_search:vlAvp: what keyword did the taggers. Furthermore, COLABA does not use annotator use to search the resource? (In resources from other dialects, as does DIWAN. this case, since the resource is MADAMI- RA, the search keyword is necessarily the 7. Conclusion and Future Work word itself.) anno:diwan_approved: did an annotator work We have presented DIWAN, a tool designed for on this word? the morphological annotation of Arabic dialectal text. It incorporates resources from other dia- 6. Related Work lects (and new resources can be included as they become available) in order to lighten the annota- tor burden. It uses a conventionalized spelling There are two related interfaces that have been for Arabic dialects which is maintained in paral- used for annotating dialectal Arabic that we are lel with the naturally occurring spontaneous or- familiar with. thography. And it generates a file format which

preserves the linear order of the input text, so The annotation tool used at the Linguistic Data that it can be used both for deriving morphologi- Consortium for annotating the Egyptian Tree- cal analyzers, and for training morphological bank (Maamouri et al. 2014) is based on previ- taggers. ous interfaces used at the LDC for treebanking, notably for MSA. The approach towards mor- DIWAN has been used to annotate Levantine phological annotation used at the LDC is a boot- (Palestinian) Arabic (Jarrar et al. 2014). The an- strapping approach, which aims at developing an notators for Levantine quickly became proficient annotated corpus in conjunction with a morpho- with using the tool after annotating about 100 logical analyzer. The morphological analyzer words. The Palestinian corpus includes 45,000 developed in conjunction with the Egyptian Ara- annotated words (tokens). We are currently us- bic Treebank is in fact, the same CALIMA- ing DIWAN to annotate Yemeni (Sana’ai) Ara- Egyptian system we use. In contrast to DIWAN, bic. The Yemeni corpus contains 32,325 words there is no attempt at incorporating resources (tokens), and the annotator for Yemeni is the first from other dialects, which is also due to the fact author of the present paper. Finally, we have that the Egyptian Treebank was a pioneer in the embarked on a small project for Moroccan Ara- area of resources for dialectal Arabic. Further- bic. We have collected 64,171 words of Moroc- more, the LDC interface does not support anno- can for annotation. In separate publications in tation of functional number and gender, and con- the future, we will report on the Yemeni and Mo- centrates on morpheme-based annotation (which roccan annotation efforts. We will also report on DIWAN also supports, following the LDC ap- a general methodology about how to use such proach). resources to create morphological analyzers and

taggers. The COLABA annotation tool (Diab et al.

2010a) is a web application, unlike DIWAN One interesting question is how our tool com- which is a desktop application. As a result, un- pares to other annotation tools. We believe that like DIWAN, the COLABA tool does not sup- the built-in access to morphological analyzers for port offline work. The most important difference other variants is unique, and provides a specific is that COLABA is oriented towards lexicon cre- advantage in annotating Arabic dialects. How-

56 ever, we have not performed experiments to References show this. While such experiments would be very useful, they would also be quite costly, Y. Benajiba and M. Diab. 2010. A web applica- since the same texts would need to be annotated tion for dialectal Arabic text annotation. In Pro- twice by different annotators. ceedings of the LREC Workshop for Language Resources (LRs) and Human Language Technol- ogies (HLT) for : Status, Up- We will continue to improve the DIWAN tool. dates, and Prospects. As more dialects are annotated, we intend to add the created resources to the interface to make T. Buckwalter. 2004. Buckwalter Arabic Mor- them available to users working on new dialects phological Analyzer Version 2.0. Linguistic Data (parallel to the SAMA and CALIMA-Egyptian Consortium, University of Pennsylvania. LDC resources). Catalog No.: LDC2004L02, ISBN 1-58563-324- 0

Currently, DIWAN is available only for Mi- D. Chiang, M. Diab, N. Habash, O. Rambow, crosoft Windows. We are investigating reim- and S. Shareef. 2006. Parsing Arabic Dialect. In plementing it in a platform-independent manner. Proceedings of the European chapter of the As- DIWAN is freely available; for information, sociation of Computational Linguistics (EACL). please consult the following URL: M. Diab, N. Habash, O. Rambow, M. AlTant- http://volta.ccls.columbia.edu/~rambow/diwan/ awy, and Y. Benajiba. 2010a. COLABA: Arabic Dialect Annotation and Processing. In Proceed- home.html ings of the Language Resources (LRs) and Hu- man Language Technologies (HLT) for Semitic Languages at LREC. Acknowledgments M. Diab, K. Hacioglu, and D. Jurafsky. 2007. This paper is based upon work supported by Automated Methods for Processing Arabic Text: DARPA Contract No. HR0011-12-C-0014. Any From Tokenization to Base Phrase Chunking. In van den Bosch, A. and Soudi, A., editors, Arabic opinions, findings and conclusions or recom- Computational Morphology: Knowledge-based mendations expressed in this paper are those of and Empirical Methods. Kluwer/Springer. the authors and do not necessarily reflect the views of DARPA. We thank Nizar Habash for D. Graff, M. Maamouri, B. Bouziri, S. Krouna, many helpful suggestions, including the idea of S. Kulick, and T. Buckwalter. 2009. Standard providing the ability of searching using similar Arabic Morphological Analyzer (SAMA) Ver- words. We thank three anonymous reviewers for sion 3.1. Linguistic Data Consortium LDC2009E73. helpful feedback which has improved the presen- tation of this paper. We thank Mustafa Jarrar for N. Habash and O. Rambow. 2005. Arabic To- very useful discussions and feedback. We also kenization, Part-of-Speech Tagging and Morpho- thank the Levantine annotators, Faeq Al- logical Disambiguation in One Fell Swoop. In rimawi, Diyam Akra, and Linda Alamir, as well Proceedings of the 43rd Annual Meeting of the as Aidan Kaplan who is developing the Moroc- Association for Computational Linguistics (ACL’05), pages 573–580, Ann Arbor, Michi- can corpus, for their feedback which helped im- gan. prove the tool. N. Habash and O. Rambow. 2006. MAGEAD: A Morphological Analyzer and Generator for the Arabic Dialects. In Proceedings of ACL, Sydney, Australia.

57 N. Habash, M. Diab, and O. Rabmow. 2012a. Conventional Orthography for Dialectal Arabic. In Proceedings of the Language Resources and Evaluation Conference (LREC), Istanbul.

N. Habash, R. Eskander, and A. Hawwari. 2012b. A Morphological Analyzer for . In Proc. of the Special Interest Group on Computational Morphology and Phonology, Montréal, Canada.

N. Habash, O. Rambow, and R. Roth. 2009. MADA+TOKAN: A toolkit for Arabic tokeniza- tion, diacritization, morphological disambigua- tion, POS tagging, stemming and lemmatization. In Choukri, K. and Maegaard, B., editors, Pro- ceedings of the Second International Conference on Arabic Language Resources and Tools. The MEDAR Consortium, April.

N. Habash, R. Roth, O. Rambow, R. Eskander, and N. Tomeh. 2013. Morphological Analysis and Disambiguation for Dialectal Arabic. In Pro- ceedings of the 2013 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technol- ogies (NAACL-HLT), Atlanta, GA.

M. Jarrar, N. Habash, D. Akra, and N. Zalmout 2014. "Building a Corpus for Palestinian Arabic: a Preliminary Study." In Proceedings of the Workshop on Arabic Natural Language Pro- cessing (ANLP 2014).

M. Maamouri, A. Bies, S. Kulick, M. Ciul, N. Habash and R. Eskander. 2014. Developing a dialectal Egyptian Arabic Treebank: Impact of Morphology and Syntax on Annotation and Tool Development. In Proceedings of the Lan- guage Resources and Evaluation Conference (LREC), Reykjavik, Iceland.

A. Pasha, M. Al-Badrashiny, M. Diab, A. El Kholy, R. Eskander, N. Habash, M. Pooleery, O. Rambow and R. M. Roth. MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic. In Proc. of LREC, Reykjavik, Iceland, 2014.

58