A Consistent Persian Orthography Based on the Latin Script

SLPO: A Consistent Persian Orthography Based on the Latin Script Reza Salarmehr Amirkabir University [email protected] Abstract Persian is officially written in Arabic script which causes some well-known problems in writing academic materialfor native and non-native speakers. In this paper we propose a standard romanization and orthography scheme for Persian named SLPO that is best fitted for natural language processing, learning purposes and writing academic material (but not limited to) and we claim that it is the most comprehensive and consistent scheme ever presented for this language. Keywords: romanization scheme, Persian, natural language processing, orthography 1 Introduction 2 The Arabic Script Problems In Iran and Afghanistan, Persian is officially written in Arabic The problems of the Arabic script for writing Persian have script that is a right to left script (RTL). In this script short been investigated extensively during the two last centuries [?]. vowels are usually not written, and as a result a lot of word- A complete review of this topic is out of the scope of this pa- forms have more than one pronunciation and meaning (e.g: per. Here, we point to two main enigma regarding using the This fact makes writing and reading Persian hard Arabic script to write Persian (For a comprehensive overview .(ﺳﯿﺮ، ﺑﺮ، ﮐﺮم for non-native (and even native) speakers [?]. As a result most of related literature refer to [?]) . authors prefer to use a romanized form of Persian words that is easier to read and is supported better in computer environ- 2.1 Short Vowel Enigma ments. However, the absence of an accepted romanization stan- Short vowels (/æ/, /e/, /o/) are not generally represented in dard caused authors to use various personal romanization written Persian while changing the short vowel in a word can -rep ‹ﮔﻞ› schemes that in turn caused some problems. In this paper, create a totally unrelated new word. As an example we represent two romanization schemes for two distinct use- resents two unrelated words: ‹gel› (mud) and ‹gol› (ﬂower). cases. We claim that they are the most suitable schemes avail- So correct pronunciation of words which contains a short able to be used in general and particularly in academic cases. vowel is not possible without prior knowledge and if a word- The first scheme is intended to be used for general applica- form represents more that one word, the reader should choose tions but it is not totally reversible, and the second one is in- correct pronunciation according to its context. As a real ex- tended to be used in situations that reversibility is important. ample see ?? figure. The name of a single alley could notbe We have provided a web site that contains the Latin form of pronounced correctly by the mayoralty staffer and is roman- more than 350,000 Persian words and idioms. You can access ized differently on two signboards in the alley. it in http://vajje.com. Ezafe is a one of most important grammatical features of The structure of this paper is as follows: in the next sec- Persian. Ezafe is pronounced as a short vowel, /e/, and similar tion we draw an overview of related works. In section ?? we to other short vowels it is generally not written. If the word point to characteristics of our proposed orthography named preceding Ezafe ends in a vowel, Ezafe will be articulated /je/ Also there is no .‹ی› ,SLPO. In sections ?? and ?? romanization and orthography is- and it will have a written presentation sues are discussed respectively. consensus on how to write the Persian indefinite article when may be ‹ی›+‹ﺧﺎﻧﻪ› the referent ends in /e/. As an example 1 3 Related works In this section, we overview Persian romanization schemes that have been proposed in the past twenty years. Maleki [?] proposed a romanization scheme which contains 29 phonemes (23 consonants and 6 vowels) named Dabire (formerly eFarsi [?]). Dabire uses a single Latin grapheme for Arabic graphemes that have the same phonol- ogy in Persian. The author provides some guidelines for writ- Figure 1: Two signboards in Tehran. The name of the alley, ing compound words, abbreviations and foreign load words. ‹ ›, could not be pronounced correctly and is romanized -He also describes capitalization and punctuation in his ro ﭘﺸﻦ in two different forms in the same alley. manization scheme. Ghayour [?] provides a romanization mapping table named Ironic. Using the letter ‹w› to represent /u:/, his .[?] ‹ﺧﺎﻧﻪ ﺋﯽ› or ‹ﺧﺎﻧﻪ ﯾﯽ› ,‹ﺧﺎﻧﻪ ای› wrote as scheme only includes basic Latin alphabet. In his scheme letters ‹a›, ‹e›, ‹o›, ‹i›, ‹u› and ‹w› respectively represent /æ/, 2.2 ZWNJ Enigma /e/, /ɒ:/, /i:/, /o/ and /u:/. Unfortunately, this caused the Appearance of Arabic letters changes depending on the let- well-known pronunciation of letters ‹o›, ‹u›, ‹j› and ‹w› ters that precede or succeed them in a word and the struc- to be changed. Currently, Iranians generally pronounce these ture of the word that the letter belongs to. For exam- letters /o/, /uː/, /ʤ/ and /v/, respectively. Furthermore it uses a digraph for /ʤ/ and a monograph for /ʒ/ while /ʒ/ has ‹ﺳـ›,‹ـﺲ›,‹س› :has these forms (‹س›) ple, the letter sin .When writing Persian compound words, some the least frequency in Persian .‹ـﺴـ› and times letters in the concatenation point join each other Moslehi [?] introduced a romanization scheme named and sometimes they not (e.g. IPA2 (Pársik). He uses a lot of diacritics to represent aiyn (‹ﮔﻼب› →‹آب›+‹ﮔﻞ› .e.g) In computer ty- and hamza (called mul in IPA2). He also used a digraph for .(‹ﺟﺴﺘﻪ ﮔﺮﯾﺨﺘﻪ› →‹ﮔﺮﯾﺨﺘﻪ›+ZWNJ+‹ﺟﺴﺘﻪ› is a frequent Persian ‹ش› pography a zero width non-joiner character (ZWNJ) is in- /ʃ/ and a monograph for /ʧ/, while has very low frequency in Persian. The letter ‹چ› serted between stems to prevent them from joining each letter and mowz›. The› = ‹ﻣﻮز› .other. Unfortunately, there is no accepted rule for writing ‹w› is also used in some words, e.g derived and compound words (e.g. ‹dunecjoo› is written as scheme also has some rules for writing in informal language .and some other guidelines .[? ,? ,? ,?] (‹داﻧﺶ ﺟﻮ› and ‹داﻧﺸﺠﻮ› Similar to other Semitic languages, Arabic has a noncon- Mahdavi [?] proposed a strict mapping from Arabic script catenative "root-and-pattern" morphology. The pattern can to Latin script that maintains all Arabic diacritics and even The .‹(ڡ› .be used to read words correctly even without presenting short radical and ambiguous forms of letters (e.g vowels. Compounding and derivation are not used in the Ara- scheme contains almost 80 characters. Actually, adding Ara- bic language to form new words. As a result, the above two bic diacritics to Persian texts is simpler to learn and use. Nev- enigmas are not relevant in that language. Indeed the Arabic ertheless his suggestion may be useful in processing the old script only works well for writing the Arabic language. Persian and Arabic manuscripts. There are also some other rudimentary schemes, in- 2.3 Why is a standard romanization scheme nec- cluding: The Tajik alphabet in Latin (obsolete), Paarsi [?], essary? UniPers [?], UN romanization of Persian for Geographical Names [?], Library of Congress/American Library Associa- In addition to the above issues, there are some other diffi- tion romanization of Persian, IJMES transliteration system culties that caused those whose first language is not Persian for Arabic, Persian, and Turkish [?], Encyclopaedia Iranica’s to use the Latin script, instead of the Arabic script during scheme [?]. their learning course or in their publications. Unfortunately, All proposed schemes suffer from at least two of the fol- each author uses her/his personal romanization scheme that lowing deficiencies: (i) they use some new or modified Latin makes reading and indexing of romanized Persian texts diffi- letters that make their use hard for Persian-learners. (ii) they cult. Even in Iran when there is a need to romanize the name ignore phonological rules of Persian and spell some words un- of people and places, there is no unique accepted romaniza- necessarily long (iii) they lack a firm rule for writing saken ayin tion scheme. Consequently different romanized forms may and hamza. (iii) they do not comprehensively describe or- be used by distinct users which make searching the name of thography of all Persian grammatical structures. (iv) they do people and places hard and confusing. 2 not suggest a romanization scheme in situations in which re- 5 General Romanization versibility matters. (v) they ignore the semantic accepts of the writing system and do not provide any clues to shed light on The Latin script of Persian consists of 25 letters. It does not the meaning of words (vi) they do not provide a firm regular use the letter ‹w›. You can see the letters in table ??. There method to spell Persian verbs. are some rules in using this alphabet as follow: In this paper we have proposed a new orthography scheme which addresses all of these issues. Rule 1 Each word contains at least one vowel letter (‹a,e,i,o,u›). 4 SLPO Rule 2 No word begins with more than two consonants. In this section we introduce our suggested romanization See section ?? to see how to pronounce words that begin schema named Standard Latin-based Persian orthography ab- with two consonants. breviated as SLPO. In developing our romanization scheme we used some 5.1 Vowels statistics data such as letter frequency and phoneme frequency. You can see Persian letter frequency in figure ?? and Persian has six vowels, including: /ɒː/, /e/, /o/,/æ/, /iː/ and ??. /uː/. The first three rows of table ?? show how a single word- SLPO is the most comprehensive and consistent Persian form in the Arabic script can be pronounced differently and romanization scheme ever proposed.

A Consistent Persian Orthography Based on the Latin Script

Arabic Alphabet - Wikipedia, the Free Encyclopedia Arabic Alphabet from Wikipedia, the Free Encyclopedia

Diacritic-Based Matching of Arabic Words. ACM Trans

Mpub10110094.Pdf

The Unicode Standard, Version 6.2 Copyright © 1991–2012 Unicode, Inc

Through the Looking-Glass: Roman Letters in Phonics for Arabic As a Part of Multimedia Support

Exploiting Arabic Diacritization for High Quality Automatic Annotation

Diacritics Restoration for Arabic Dialects Salima Harrat, Mourad Abbas, Karima Meftouh, Kamel Smaïli

Middle East-I 9 Modern and Liturgical Scripts

Arabic Alphabet 1 Arabic Alphabet

Diacritization of a Highly Cited Text: a Classical Arabic Book As a Case

Arabic Diacritics 1 Arabic Diacritics

Guidelines and Framework for a Large Scale Arabic Diacritized Corpus