Parsing Arabic Nominal Sentences with Transducers to Annotate Corpora

ISSN 2007-9737

Nadia Ghezaiel Hammouda1, Kais Haddar2

1 Higher Institute of Computer and Communication Technologies of Hammam Sousse, Miracl Laboratory, Tunisia 2 University of Sfax, Faculty of Sciences of Sfax, Miracl Laboratory, Tunisia

[email protected], [email protected]

Abstract. Studying Arabic nominal sentences is For a successful processing, we need a important to analyze and annotate successfully Arabic rigorous study of the Arabic language and corpora. This type of sentences is frequent in Arabic text especially the nominal sentence to facilitate the and speech. Transducers can be used to realize local identification of disambiguation’s rules. In fact, for grammars and treat several linguistic phenomena. In this the rule-based approach, there are many paper, we propose a parsing approach for Arabic nominal sentences using transducers. To do this, we first theoretical frameworks allowing the formalization, study the typology of the Arabic nominal sentence such as formal grammars and finite state automata indicating its different forms. Then, we develop a set of [16]. Thanks to finite automata and particularly lexical and syntactic rules dealing with this type of transducers, several linguistic phenomena (i.e., sentences and respecting the specificity of the Arabic morphological analysis, the recognition of named language. After that, we present our parsing approach entities) are treated appropriately. Transduction on based on the transducers and on certain principles. In text automata is useful: it can remove paths fact, this approach allows the annotation of Arabic representing morph-syntactic ambiguities. In nominal sentences but also the Arabic verbal ones. addition, lexical disambiguation can be a way to Finally, we present an idea about the implementation and experimentation of our approach in NooJ platform. POS-tagging automatically Arabic corpora. The obtained results are satisfactory. Linguistic platforms like NooJ can facilitate the implementation and the experimentation of Keywords. Arabic nominal sentence, disambiguation transducers allowing the parsing and annotation of process, topic, transducer, NooJ platform. corpora. Indeed, the automatic annotation of Arabic 1 Introduction nominal sentence is not an easy task. In fact, the Arabic language, that is an agglutinated one, is The automatic annotation of corpora has an strongly inflected and derivational language and important place in the field of Natural Language rich in complex structures. Moreover, the Processing (NLP) but it has not yet taken its disambiguation process becomes more difficult chance for the Arabic language. Due to its with the absence of vocalization that is a frequent morphological, syntactic, phonetic and phenomenon in Arabic. To identify all structures of phonological characteristics, Arabic is considered the Arabic nominal sentences, we make a study as one of the difficult languages to analyze in NLP. through a corpus and we have to study verbal So, the study of its linguistic phenomena is a sentences, because a close relationship between necessary task. In what follows, we are interested these two forms exists. From this study, we want to in the study and analysis of the Arabic nominal achieve a system of syntactic rules allowing the sentences that have a close relationship with the annotation of Arabic nominal sentences but also verbal ones. verbal ones.