Automatic Romanization of Arabic Bibliographic Records
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Romanization of Arabic Bibliographic Records Fadhl Eryani and Nizar Habash Computational Approaches to Modelling Language (CAMeL) Lab New York University Abu Dhabi, UAE {fadhl.eryani,nizar.habash}@nyu.edu Abstract Description Tag Romanized Entry Arabic Entry ﺗﯿﻤﻮر، اﺣﻤﺪ، ,Author 100a* Taymūr, Ah ̣mad ، ,International library standards require cata- details 100d 1871-1930 ﻣﺆﻟﻒ .loguers to tediously input Romanization of 100e author ﻗﺒﺮ اﻹﻣﺎم اﻟﺴﯿﻮطﻲ وﺗﺤﻘﯿﻖ their catalogue records for the benefit of li- Title 245a* Qabr al-imām al-Suyūṭī ﻣﻮﺿﻌﮫ / brary users without specific language exper- details wa-taḥqīq mawḍiʻihi ﺑﻘﻠﻢ اﻟﻔﻘﯿﺮ إﻟﯿﮫ ﺗﻌﺎﻟﻰ أﺣﻤﺪ ﺗﯿﻤﻮر 245c* bi-qalam al-faqīr ilayhi tise. In this paper, we present the first reported ta‘ālá Aḥmad Taymūr. اﻟﻘﺎھﺮة : results on the task of automatic Romanization Place of 264a* al-Qāhirah اﻟﻤﻄﺒﻌﺔ اﻟﺴﻠﻔﯿﺔ وﻣﻜﺘﺒﺘﮭﺎ، of undiacritized Arabic bibliographic entries. production 264b* al-Maṭbaʻah al-Salafīyah This complex task requires the modeling of wa-Maktabatuhā, ھـ ، ؟ م [.Arabic phonology, morphology, and even se- 264c 1346 [H.], [1927? M ﺻﻔﺤﺔ، ورﻗﺔ ﻟﻮﺣﺎت ﻏﯿﺮ Physical 300a 24 pages, 1 unnumbered ﻣﺮﻗﻤﺔ : mantics. We collected a 2.5M word corpus of description leaf of plates ﻣﺼﻮرات، ﺧﺮاﺋﻂ ؛ parallel Arabic and Romanized bibliographic 300b illustrations, maps ﺳﻢ entries, and benchmarked a number of models 300c اﻟﺴﯿﻮطﻲ، ,that vary in terms of complexity and resource Subject - 600a* Suyūt ̣ı̄ dependence. Our best system reaches 89.3% name 600d 1445-1505. اﻟﻘﺒﻮر exact word Romanization on a blind test set. Subject - 650a Tombs ﻣﺼﺮ We make our data and code publicly available. topic 650z Egypt اﻟﻘﺎھﺮة .650z Cairo 1 Introduction Figure 1: A bibliographic record for Taymur¯ (1927?) Library catalogues comprise a large number of bib- in Romanized and original Arabic forms. liographic records consisting of entries that provide specific descriptions of library holdings. Records for Arabic and other non-Roman-script language of complexity and resource dependence. Our best materials ideally include Romanized entries to help system reaches 89.3% exact word Romanization on a blind test set. We make our data and code researchers without language expertise, e.g., Fig- 1 ure1. There are many Romanization standards publicly available for researchers in Arabic NLP. such as the ISO standards used by French and other 2 Related Work European libraries, and the ALA-LC (American Library Association and Library of Congress) sys- Arabic Language Challenges Arabic poses a tem (Library of Congress, 2017) widely adopted by number of challenges for NLP in general, and the North American and UK affiliated libraries. These task of Romanization in particular. Arabic is mor- Romanizations are applied manually by librarians phologically rich, uses a number of clitics, and is across the world – a tedious error-prone task. written using an Abjad script with optional diacrit- In this paper, we present, to our knowledge, the ics, all leading to a high degree of ambiguity. The first reported results on automatic Romanization of Arabic script does not include features such as cap- undiacritized Arabic bibliographic entries. This is italization which is helpful for NLP in a range of a non-trivial task as it requires modeling of Arabic Roman script languages. There are a number of phonology, morphology and even semantics. We enabling technologies for Arabic that can help, e.g., collect and clean a 2.5M word corpus of parallel MADAMIRA (Pasha et al., 2014), Farasa (Abde- Arabic and Romanized bibliographic entries, and 1https://www.github.com/CAMeL-Lab/ evaluate a number of models that vary in terms Arabic_ALA-LC_Romanization 213 Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 213–218 Kyiv, Ukraine (Virtual), April 19, 2021. Split Bib Records Entries Words York University Abu Dhabi’s Arabic Collections Train 85,952 (80%) 479,726 ∼2M Online (ACO) (12K), amounting to 11.2 million Dev 10,744 (10%) 59,964 ∼250K records in total. Test 10,743 (10%) 59,752 ∼250K Extraction From these collections, we extracted Total 107,439 599,442 ∼2.5M 107,493 records that are specifically tagged with the Arabic language code (MARC 008 “ara”). Table 1: Corpus statistics and data splits. Filtering Within the extracted records we filter out some of the entries using two strategies. First, lali et al., 2016), and CAMeL Tools (Obeid et al., we used a list of 33 safe tags (determined us- 2020). In this paper we use MADAMIRA to pro- ing their definitions and with empirical sampling vide diacritics, morpheme boundaries and English check) to eliminate all entries that include a mix of gloss capitalization information as part of a rule- translations, control information, and dates. The based Romanization technique. star-marked tags in Figure1 are all included, while Machine Transliteration Transliteration refers the rest are filtered out. Second, we eliminated all to the mapping of text from one script to another. entries with mismatched numbers of tokens. This Romanization is specifically transliteration into the check was done after a cleaning step that corrected Roman script (Beesley, 1997). There are many for common errors and inconsistencies in many ways to transliterate and Romanize, varying in entries such as punctuation misalignment and in- 2 terms of detail, consistency, and usefulness. Com- correct separation of the conjunction +ð wa+ ‘and’ monly used name transliterations (Al-Onaizan and clitic. As a result of this filtering, a small number of Knight, 2002) and so-called Arabizi transliteration additional records are eliminated since all their en- (Darwish, 2014) tend to be lossy and inconsistent tries were eliminated. The total number of retained while strict orthographic transliterations such as records is 107,439. The full details on extraction Buckwalter’s (Buckwalter, 2004) tend to be exact and filtering are provided as part of the project’s but not easily readable. The ALA-LC transliter- public github repo (see footnote 1). ation is a relatively easy to read standard that re- Data Splits Finally, we split the remaining col- quires a lot of details on phonology, morphology lection of records into Train, Dev, and Test sets. De- and semantics. There has been a sizable amount tails on the number of records, entries, and words of work on mapping Arabizi to Arabic script using they contain is presented in Table1. We make our a range of techniques from rules to neural mod- data and data splits available (see footnote 1). els Chalabi and Gerges(2012); Darwish(2014); Al-Badrashiny et al.(2014); Guellil et al.(2017); 4 Task Definition and Challenges Younes et al.(2018); Shazal et al.(2020). In this paper we make use of a number of insights and As discussed above, there are numerous ways to techniques from work on Arabizi-to-Arabic script “transliterate” from one script to another. In this sec- transliteration, but apply them in the opposite di- tion we focus on the Romanization of undiacritized rection to map from Arabic script to a complex, Arabic bibliographic entries into the ALA-LC stan- detailed and strict Romanization. We compare dard. Our intention is to highlight the important rule-based and corpus-based techniques including challenges of this task in order to justify the design a Seq2Seq model based on the publicly available choices we make in our approaches. For a detailed code base of Shazal et al.(2020). reference of the ALA-LC Arabic Romanization standard, see (Library of Congress, 2012). 3 Data Collection Phonological Challenges While Romanizing Sources We collected bibliographic records from Arabic consonants is simple, the main challenge three publicly available xml dumps stored in the is in identifying unwritten phonological phenom- machine-readable cataloguing (MARC) standard, ena, e.g., short vowels, under-specified long vow- an international standard for storing and describing els, consonantal gemination, and nunnation, all of bibliographic information. The three data sources which require modeling Arabic diacritization. are the Library of Congress (LC) (10.5M), the Uni- 2Strict orthographic transliteration using the HSB scheme versity of Michigan (UMICH) (680K), and New (Habash et al., 2007). 214 Morphosyntactic Challenges Beyond basic di- in the transliteration. We strip diacritical morpho- acritization modeling, the task requires some mor- logical case endings, but keep other diacritics. We phosyntactic modeling: examples include (a) pro- utilize the CharTrans technique to finalize the Ro- clitics such as the definite article, prepositions and manization starting with the diacritized, hyphen- conjunctions are marked with a hyphen, (b) case ated and capitalization marked words. For words endings are dropped, except before pronominal unknown to the morphological analyzer, we simply enclitics, (c) the silent Alif, appearing in some mas- back off to the CharTrans technique. Model Rules culine plural verbal endings, is ignored, and (d) Morph uses MorphTrans with CharTrans backoff. the Ta-Marbuta ending can be written as h or t de- pending on the morphosyntactic state of the noun. MLE Technique Unlike the previous two tech- For more information on Arabic morphology, see niques, MLE (maximum likelihood estimate) re- (Habash, 2010). lies on the parallel training data we presented in Section3. This simple technique works on white- Semantic Challenges Proper nouns need to be space and punctuation tokenized entries and learns marked with capitalization on their first non-clitic simple one-to-one mapping from Arabic script to alphabetic letter. Since Arabic script does not have Romanization. The most common Romanization “capitalizations”, this effectively requires named- for a particular Arabic script input in the training entity recognition. The Romanization of the word data is used. The outputs are detokenized to allow AlqAhr¯h ‘Cairo’ as al-Qahirah¯ in Figure1 èQëA®Ë@ strict matching alignment with the input. Faced illustrates elements from all challenge types. with OOV (out of vocabulary), we back off to the Special Cases The Arabic ALA-LC guidelines MorphTrans technique (Model MLE Morph) or include a number of special cases, e.g., the word CharTrans Technique (Model MLE Simple).