Arxiv:2012.05879V1 [Cs.CL] 10 Dec 2020 Datasets and Models for the Persian Language (E.G
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Standardization of Colloquial Persian Mohammad Sadegh Rasooli1 Farzane Bakhtyari2∗ Fatemeh Shafiei3* Mahsa Ravanbakhsh3* and Chris Callison-Burch1 1 Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, USA 2 Institute for Humanities and Cultural Studies, Tehran, Iran 3 Institute for Cognitive Science Studies, Tehran, Iran {rasooli, ccb}@seas.upenn.edu, [email protected] {shafiei_f, ravanbakhsh_m}@iricss.org Abstract variations in Arabic, but it is less diverse than what we observe in Arabic dialects. The general upshot The Iranian Persian language has two varieties: standard and colloquial. Most natural lan- is that many natural language processing systems guage processing tools for Persian assume that for Persian might easily break when dealing with the text is in standard form: this assumption a mixture of colloquial and standard text. This is is wrong in many real applications especially indeed an important issue given the surge in using web content. This paper describes a simple social media and online contents. and effective standardization approach based The focus of this paper, similar to the major- on sequence-to-sequence translation. We de- ity of previous work, is on contemporary Iranian sign an algorithm for generating artificial par- allel colloquial-to-standard data for learning a Persian. Persian is an Indo-Eurpoean language sequence-to-sequence model. Moreover, we with more than 100 million speakers across the annotate a publicly available evaluation data world especially in Iran, Afghanistan, and Tajik- consisting of 1912 sentences from a diverse set istan (Windfuhr, 2009). Loosely speaking, Iranian of domains. Our intrinsic evaluation shows a Persian is used in two different but similar forms: higher BLEU score of 62.8 versus 61.7 com- standard and colloquial (Boyle, 1952; Shamsfard, pared to an off-the-shelf rule-based standard- 2011; Tabibzadeh, 2020). In some sense, this id- ization model in which the original text has a BLEU score of 46.4. We also show that our iosyncratic phenomenon is a type of diglossia for model improves English-to-Persian machine which there exists a high variety of language used translation in scenarios for which the training in formal writing, and a low variety used in spoken data is from colloquial Persian with 1.4 ab- language (Jeremiás, 1984; Mahmoodi-Bakhtiari, solute BLEU score difference in the develop- 2018). What we refer here as colloquial Persian is ment data, and 0.8 in the test data. a version of the spoken language that is mostly used 1 Introduction in Tehran (captial of Iran). However, due to the na- tional academic system and media, most people in There has recently been a great deal of interest Iran understand and use it in daily basis (Solhju, in developing natural language processing (NLP) 2019). arXiv:2012.05879v1 [cs.CL] 10 Dec 2020 datasets and models for the Persian language (e.g. Colloquial Persian has many broken words for (Bijankhan et al., 2011; Rasooli et al., 2013; Feely which some of them invent their own morphology, et al., 2014; Seraji, 2015; Nourian et al., 2015; and sometimes idiosyncratic syntactic order, and Seraji et al., 2016; Mirzaei and Moloodi, 2016; some new idioms (Tabibzadeh, 2020). It is diverse Mirzaei and Safari, 2018; Poostchi et al., 2018; and appears in different shapes from merely stan- Mirzaei et al., 2020; Taher et al., 2020; Khashabi dard syntax with some occasional broken words et al., 2020)). Despite impressive achievements, (as used in TV and radio news) to broken words most of Persian NLP models assume that the text with colloquial syntax. Depending on the formality is written in the standard form. This assumption of context and personal writing style, a colloquial is practically not true (Solhju, 2019). This partic- sentence might have some few broken words and ular phenomenon is somewhat similar to dialectal few other standard word forms: This is somewhat ∗Equal contribution in data annotation. like code-switching between two varieties of the same language instead of two distinct languages. 2.1 Model This paper proposes an automatic method for We model colloquial-to-standard text conversion standardizing colloquial Persian text. The core idea as machine translation. The input to the sys- behind our work is training a sequence-to-sequence tem is a sequence of colloquial words fx1 : : : xng translation model (Vaswani et al., 2017) that trans- and the output is a sequence of standard word lates colloquial Persian to standard Persian. We forms fy1 : : : ymg where n and m are not nec- believe that our approach is a more viable tech- essarily equal. We train a standard neural ma- nique than merely applying some conversion rules chine translation model with attention (Vaswani for standardization. On the one hand, writing an ex- et al., 2017) on a training data in which each tensive set of rules depending on different semantic colloquial sentence is accompanied by its stan- contexts is cumbersome, if not impossible. On the dard version. Figure1 shows examples of such other hand, using a sequence-to-sequence model training data. Neural machine translation uses facilitates leveraging pretrained language models sequence-to-sequence models with attention (Cho such as masked language models (Devlin et al., et al., 2014; Bahdanau et al., 2015; Vaswani et al., 2019) trained on huge amount of monolingual texts. 2017) for which the likelihood of training data D = Our experiments also proves this point. Since there f(x(1); y(1)) ::: (x(jDj); y(jDj))g is maximized by is no available parallel text between standard and maximizing the log-likelihood of predicting each colloquial Persian, we propose a random standard- target word given its previous predicted words and to-colloquial conversion algorithm based on re- source sequence: cent linguistic studies on common colloquial word forms in Persian literature (Bakhshizadeh Gashti jDj jy(i)j X X (i) (i) (i) and Tabibzadeh, 2019; Tabibzadeh, 2020). L(D) = log p(yj jyk<j; x ; θ) Our contribution is three-fold: 1) We propose a i=1 j=1 novel method for training translation models from where θ is a collection of parameters to be learned. colloquial to standard Persian, and show improve- ments over an off-the-shelf rule-based model. Ex- 2.2 Training Data Generation cept a paper written in Persian with a simple n- Since there is no parallel data to train a machine gram approach (Armin and Shamsfard, 2011) and translation model, we define a set of rules to break without any code or data release, we are not aware standard forms into colloquial forms. In general, of any work on this topic; 2) We provide a pub- there are more than 300 rules, mostly inspired from licly available manually annotated evaluation data Bakhshizadeh Gashti and Tabibzadeh(2019). Ta- for colloquial Persian with two types of standard- ble1 lists the main rules used in this work. 2 These ization for which the first only concerns surface rules include changing vowels (e.g. a!u), mod- word forms, and the second considers making the ifying verb forms (e.g. Õç'ñÂK. [beguyæm] ! ÕÂK. text stylistically standard. This data consists of [begæm]), and other types of conversion listed by 1929 sentences from different genres including fic- Bakhshizadeh Gashti and Tabibzadeh(2019). tion, translated fiction, blog posts, public group Our conversion rules define a function chats, and comments in news websites;1 and 3) We f(xi; xi+1) in which depending on part-of-speech show that our method improves English-to-Persian tag and the next word xi+1 form, a new word machine translation trained on colloquial Persian form y or a sequence of word forms are generated. data. Since we are not sure about the extent in which words are broken in text, we randomly skip some 2 Automatic Standardization and conversions with a probability of 0:1 in order Evaluation to make the text look like a mix of colloquial and standard word forms. One advantage of this There are three core components of our research: approach is that we can easily have a large amount 1) Modeling, 2) Training data, 3) Evaluation. This of parallel text to train a neural machine translation section briefly describes these three components. model. Some real examples are shown in Figure1. There are definitely many cases for which our 1Available for download from https://github. 2The code for this rules is publicly available in https: com/rasoolims/Shekasteh //github.com/rasoolims/PBreak. Example ure, the distributions of the development and test Rule name Standard Colloquial datasets are intentionally very different. à @QîE [tehran] à ðQîE [tehrun] “an” suffix à A [neSan] à ñ [neSun] 3 Experiments [gæman] [gæmun] àAÒà àñÒà In this section we describe our experiments and [berævæm] [beræm] ÐðQK. ÐQK. results. In addition to intrinsic evaluation on our Verb suffix [berævæd] [bere] XðQK. èQK. evaluation datasets, we use machine translation as [berævænd] [beræn] YKðQK. àQK. an extrinsic task for which the training data only [miguyæm] [migæm] Õç' ñ JÓ Õ JÓ contains colloquial text while the test data contains [Pistadæm] [vaysadæm] Verb form ÐXAJ @ ÐXA @ð standard text. ' [bexahæm] ' [bexam] Ñë@ñm . ÐAm . ÐYÓ@ [Pamædæm] ÐYÓð@ [Pumædæm] 3.1 Datasets and Tools [golha] [gola] “ha” suffix AêÊÇ CÇ Datasets: We use the Wikipedia dump of Persian úGAêÊÇ [golhai] úGCÇ [golai] as standard text in addition to one million sentences [saheb] [sahab] I. kA H. AgA from the Mizan corpus (Kashefi, 2018). We ran- [digær] [dige] Common rules Q KX é KX domly break words and use this data as training ' [Panja] ' [Punja] Am. @ Am. ð@ data for standardization. Figure1 shows a few [ra] [ro] @P ðP examples of converted text to colloquial. For ma- [to ra] [toro] Case marker @Pñ K ðPñK chine translation evaluation, we use the TEP paral- [Pan ra] [Puno] @P à@ ñKð@ lel data (Pilevar et al., 2011) which is a collection [be tu] [behet] Attach pronoun ñKéK .