Diacritization of a Highly Cited Text: a Classical Arabic Book As a Case
Total Page:16
File Type:pdf, Size:1020Kb
This is a repository copy of Diacritization of a Highly Cited Text: A Classical Arabic Book as a Case. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/128591/ Version: Accepted Version Proceedings Paper: Alosaimy, A and Atwell, E orcid.org/0000-0001-9395-3764 (2018) Diacritization of a Highly Cited Text: A Classical Arabic Book as a Case. In: Proceedings of ASAR'2018 Arabic Script Analysis and Recognition. ASAR'2018 Arabic Script Analysis and Recognition, 12-14 Mar 2018, Alan Turing Institute, The British Library, London UK. IEEE , pp. 72-77. © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Reuse Unless indicated otherwise, fulltext items are protected by copyright with all rights reserved. The copyright exception in section 29 of the Copyright, Designs and Patents Act 1988 allows the making of a single copy solely for the purpose of non-commercial research or private study within the limits of fair dealing. The publisher or other rights-holder may allow further reproduction and re-use of this version - refer to the White Rose Research Online record for this item. Where records identify the publisher as the copyright holder, users can verify any specific terms of use on the publisher’s website. Takedown If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing [email protected] including the URL of the record and the reason for the withdrawal request. [email protected] https://eprints.whiterose.ac.uk/ Diacritization of a Highly Cited Text: A Classical Arabic Book as a Case Abdulrahman Alosaimy Eric Atwell School of Computing School of Computing University of Leeds University of Leeds Leeds, UK Leeds, UK [email protected] [email protected] Abstract— We present a robust and accurate diacritization Arabic diacritization is the computational process method of highly cited texts by automatically “borrowing” of recovering missing diacritics to the orthographic diacritization from similar contexts. This method of diacritization has been tested on diacritizing one book: “Riyad As-Salheen”, for word. This process is known for improving the purpose of morphological annotation of the Sunnah Arabic readability (e.g. children books and educational Corpus. The original source of Riyad is about 48.66% diacritized, textbooks), automatic speech recognition (ASR) [2], and after borrowing diacritization, the percentage jumps to 76.41% with low diacritic error rate (0.004), compared to 61.73% text to speech (TTS) [3], information retrieval (IR), (DER=0.214) using MADAMIRA toolkit, and 67.68% and morphological annotation [4]. (DER=0.006) using Farasa toolkit. More importantly, this method has reduced the word ambiguity from 4.83 diacritized form/word Words can be fully diacritized: diacritics for all to 1.91. letter are specified or partially: diacritics for part of the letters are specified. Texts are usually fully Keywords—diacritization; Arabic; NLP; Sunnah; Riyad As- Salheen diacritized for children’s educational purposes, or when the great precision of pronunciation is required I. INTRODUCTION e.g. the Quran. [5]. On the other hand, the text is In the Arabic language, a high amount of mostly partly or unwritten, due to three reasons: to phonological information is missing such as short speed up the reading speed [5], not to strain the eyes vowels, Shaddah, tanween, Maddah, and sometimes and to speed up the typing by one third (required for hamzah1 as well. They (collectively called diacritics) typing diacritics). are not usually written. As a result, the ambiguity at A special type is the minimal: where some the word level is high in Arabic. There is an average diacritics are specified in which these specifications of 11.5 diacritizations/word according to [1]. For are enough to avoid word’s ambiguity. But ambiguity fhm) can here is ambiguous, and the minimal level depends on) ﻢﮭﻓ example, a vowelized form of the word be one of the following “non-comprehensive” list: the audience (e.g. reader’s level of education) and fahama) (v.) to understand target; for morphological annotation in Natural) َﻓ ﮭَ َﻢ .1 Language Processing (NLP), a minimal diacritization fahhama) (v.) to teach is the minimal partial diacritization that is sufficient) َﻓ ﮭﱠ َﻢ .2 fa+humo) (conj. + pron.) and they to eliminate other possible diacritizations produced) َﻓ ﮭُ ْﻢ .3 by a lexicon or morphological analyser. fahamma) (conj. + v.) and (he) intend) َﻓ ﮭَ ﱠﻢ .4 1 In cases where Hamza is considered a diacritic, only different shapes of Hamza on Alif is considered. XXX-X-XXXX-XXXX-X/XX/$XX.00 ©2018 IEEE Diacritization is usually done fully, but this full However, this borderline is not clear enough, and diacritization is not necessary diacritizing each letter, words that seem clear to the writer might still be due to the missing standard definition of the fully- ambiguous to a reader. Therefore, we notice different diacritized word. There are some letters that are not diacritization of the same word in different positions diacritized even in lexicons, and by convention are within the book, or between different versions of the no-vowel letters (i.e. has an intuitive vowel but not book. These differences are exploited for the sake of written). For example, using some diacritization improving morphological annotation of our Sunnah standards, the letter that precedes a long vowel and Arabic Corpus (SAC) and reach the minimal the lam letter in definite AL article are two no-vowel diacritization for each word. letters. However, deciding whether Waw/Yaa letters are consonant or a vowel is ambiguous. Similarly, III. DATA deciding whether the lam is part of a definite AL We picked one book from our Sunnah Arabic article is ambiguous too. Corpus: Riyāḍu Aṣṣāliḥīn2 (aka The Meadows of the Righteous) which is a compilation of 1896 hadith Arabic diacritization has grabbed the attention of narratives written by Al-Nawawi and published in Arabic NLP researchers, and much work has been 1334. The total number of words in Riyad is around done. Previous approaches have focused on ~144k (~17k word types), and 48.66% of its letters improving the quality of automatic diacritization to are diacritized. Riyad was chosen due to several produce a fully diacritized version of the text, either reasons: using rule-based approach [6], statistical approaches using, for example, recurrent networks [7], n-gram 1. It compiles narrations reported in other Hadith model [8], or hybrid approaches which usually books (e.g. Albukhari) which make them a perform the best [9]–[11]. This work, however, good source for diacritization. focuses on diacritizing text for the purpose of manual 2. Its codex was validated and investigated by annotation later. That is, the diacritization approach several scholars by a scientific palaeographical seeks a high accuracy in diacritization but is not process; at least there are two digitally necessary to diacritize the full text. This approach available validated versions of the same text. crosses some interests with [4] which exploits diacritizing to improve morphological annotation. 3. Its narratives have been explained in 6 written Our methodology is unique as it exploits partial books. diacritized texts as a source for diacritization. We The currently available diacritized corpora are borrow partial diacritizations from similar contexts either annotated corpora (mainly news) and and merge them together, and hope it lowers the Tashkeela (religious texts) [12], a corpus of 6.15 ambiguity level of that word as much as possible. million words, which we used as an initial source for II. MOTIVATION diacritization. But since Tashkeela focuses on fully diacritized texts, we added several Hadith books This article is motivated by our project of downloaded from Shamela library. Shamela developing semi-automatically annotated Sunnah (http://shamela.ws) is a downloadable library that Arabic Corpus. Since its text is not been fully contains at least 5300 Arabic books in Islamic studies diacritized, we needed to adopt a method for and becomes the standard library of Arabic classical diacritizing. Since the corpus mostly consists of texts books. It has been used to obtain Arabic classical text that are highly quoted in other diacritized texts, we in building several corpora [12]–[14]. had the idea of “borrowing” their diacritization. Although having a large collection should not In many Classical Arabic texts, it is common to lower the accuracy of our method, we limit the corpus diacritize the word at least minimally: to the amount size in our experiments for training time efficiency. that is enough to remove the ambiguity to the readers. We picked relevant books from Tashkeela and 2 All experiments in this paper and their used data is available at: http://github.com/aosaimy/sac XXX-X-XXXX-XXXX-X/XX/$XX.00 ©2018 IEEE Shamela, i.e. books that have a high likelihood of the number of occurrences of that quoting texts from Riyad. This selection method is diacritization. done manually. This selection can be done 4. Once finished, variants are sorted by the automatically as we developed a small companion number of occurrences to prevent infrequent tool that measures one book’s contribution by diacritization from bubbling up to the surface computing the number of matching n-grams with diacritization in the next step. additional diacritization. The final corpus is 7677814 words, where 58.31% of its letters are diacritized. 5. Centre words variants are merged recursively: The merge procedure (Algorithm 2) is done In Arabic examples for the rest of this article, we 3 letter by letter, and for every letter, only use Buckwalter transliteration instead, as it is easier candidate diacritics that do not contradict with to examine the differences of the diacritics.