<<

Automatic Standardization of Colloquial Persian

Mohammad Sadegh Rasooli1 Farzane Bakhtyari2∗ Fatemeh Shafiei3* Mahsa Ravanbakhsh3* and Chris Callison-Burch1 1 Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, USA 2 Institute for Humanities and Cultural Studies, Tehran, 3 Institute for Cognitive Science Studies, Tehran, Iran {rasooli, ccb}@seas.upenn.edu, [email protected] {shafiei_f, ravanbakhsh_m}@iricss.org

Abstract variations in Arabic, but it is less diverse than what we observe in Arabic dialects. The general upshot The language has two varieties: standard and colloquial. Most natural lan- is that many natural language processing systems guage processing tools for Persian assume that for Persian might easily break when dealing with the text is in standard form: this assumption a mixture of colloquial and standard text. This is is wrong in many real applications especially indeed an important issue given the surge in using web content. This paper describes a simple social media and online contents. and effective standardization approach based The focus of this paper, similar to the major- on sequence-to-sequence translation. We de- ity of previous work, is on contemporary Iranian sign an algorithm for generating artificial par- allel colloquial-to-standard data for learning a Persian. Persian is an Indo-Eurpoean language sequence-to-sequence model. Moreover, we with more than 100 million speakers across the annotate a publicly available evaluation data world especially in Iran, , and Tajik- consisting of 1912 sentences from a diverse set istan (Windfuhr, 2009). Loosely speaking, Iranian of domains. Our intrinsic evaluation shows a Persian is used in two different but similar forms: higher BLEU score of 62.8 versus 61.7 com- standard and colloquial (Boyle, 1952; Shamsfard, pared to an off-the-shelf rule-based standard- 2011; Tabibzadeh, 2020). In some sense, this id- ization model in which the original text has a BLEU score of 46.4. We also show that our iosyncratic phenomenon is a type of diglossia for model improves English-to-Persian machine which there exists a high variety of language used translation in scenarios for which the training in formal writing, and a low variety used in spoken data is from colloquial Persian with 1.4 ab- language (Jeremiás, 1984; Mahmoodi-Bakhtiari, solute BLEU score difference in the develop- 2018). What we refer here as colloquial Persian is ment data, and 0.8 in the test data. a version of the spoken language that is mostly used 1 Introduction in Tehran (captial of Iran). However, due to the na- tional academic system and media, most people in There has recently been a great deal of interest Iran understand and use it in daily basis (Solhju, in developing natural language processing (NLP) 2019). arXiv:2012.05879v1 [cs.CL] 10 Dec 2020 datasets and models for the (e.g. Colloquial Persian has many broken words for (Bijankhan et al., 2011; Rasooli et al., 2013; Feely which some of them invent their own morphology, et al., 2014; Seraji, 2015; Nourian et al., 2015; and sometimes idiosyncratic syntactic order, and Seraji et al., 2016; Mirzaei and Moloodi, 2016; some new idioms (Tabibzadeh, 2020). It is diverse Mirzaei and Safari, 2018; Poostchi et al., 2018; and appears in different shapes from merely stan- Mirzaei et al., 2020; Taher et al., 2020; Khashabi dard syntax with some occasional broken words et al., 2020)). Despite impressive achievements, (as used in TV and radio news) to broken words most of Persian NLP models assume that the text with colloquial syntax. Depending on the formality is written in the standard form. This assumption of context and personal writing style, a colloquial is practically not true (Solhju, 2019). This partic- sentence might have some few broken words and ular phenomenon is somewhat similar to dialectal few other standard word forms: This is somewhat ∗Equal contribution in data annotation. like code-switching between two varieties of the same language instead of two distinct languages. 2.1 Model This paper proposes an automatic method for We model colloquial-to-standard text conversion standardizing colloquial Persian text. The core idea as machine translation. The input to the sys- behind our work is training a sequence-to-sequence tem is a sequence of colloquial words {x1 . . . xn} translation model (Vaswani et al., 2017) that trans- and the output is a sequence of standard word lates colloquial Persian to . We forms {y1 . . . ym} where n and m are not nec- believe that our approach is a more viable tech- essarily equal. We train a standard neural ma- nique than merely applying some conversion rules chine translation model with attention (Vaswani for standardization. On the one hand, writing an ex- et al., 2017) on a training data in which each tensive set of rules depending on different semantic colloquial sentence is accompanied by its stan- contexts is cumbersome, if not impossible. On the dard version. Figure1 shows examples of such other hand, using a sequence-to-sequence model training data. Neural machine translation uses facilitates leveraging pretrained language models sequence-to-sequence models with attention (Cho such as masked language models (Devlin et al., et al., 2014; Bahdanau et al., 2015; Vaswani et al., 2019) trained on huge amount of monolingual texts. 2017) for which the likelihood of training data D = Our experiments also proves this point. Since there {(x(1), y(1)) ... (x(|D|), y(|D|))} is maximized by is no available parallel text between standard and maximizing the log-likelihood of predicting each colloquial Persian, we propose a random standard- target word given its previous predicted words and to-colloquial conversion algorithm based on re- source sequence: cent linguistic studies on common colloquial word forms in (Bakhshizadeh Gashti |D| |y(i)| X X (i) (i) (i) and Tabibzadeh, 2019; Tabibzadeh, 2020). L(D) = log p(yj |yk

1Available for download from https://github. 2The code for this rules is publicly available in https: com/rasoolims/Shekasteh //github.com/rasoolims/PBreak. Example ure, the distributions of the development and test Rule name Standard Colloquial datasets are intentionally very different. à @QîE [tehran] à ðQîE [tehrun] “an” suffix à A‚ [neSan] à ñ‚ [neSun] 3 Experiments [gæman] [gæmun] àAÒà àñÒà In this section we describe our experiments and [berævæm] [beræm] ÐðQK. ÐQK. results. In addition to intrinsic evaluation on our Verb suffix [berævæd] [bere] XðQK. èQK. evaluation datasets, we use machine translation as [berævænd] [beræn] YKðQK. àQK. an extrinsic task for which the training data only [miguyæm] [migæm] Õç' ñÂJ Ó ÕÂJ Ó contains colloquial text while the test data contains  [Pistadæm] [vaysadæm] Verb form ÐXAJ‚ @ ÐXA‚ @ð standard text. ' [bexahæm] ' [bexam] Ñë@ñm . ÐAm . ÐYÓ@ [Pamædæm] ÐYÓð@ [Pumædæm] 3.1 Datasets and Tools [golha] [gola] “ha” suffix AêÊÇ CÇ Datasets: We use the Wikipedia dump of Persian úGAêÊÇ [golhai] úGCÇ [golai] as standard text in addition to one million sentences [saheb] [sahab] I. kA“ H. AgA“ from the Mizan corpus (Kashefi, 2018). We ran- [digær] [dige] Common rules Q KX éÂK X domly break words and use this data as training '  [Panja] ' [Punja] Am. @ Am. ð@ data for standardization. Figure1 shows a few [ra] [ro] @P ðP examples of converted text to colloquial. For ma-  [to ra]  [toro] Case marker @Pñ K ðPñK chine translation evaluation, we use the TEP paral-  [Pan ra] [Puno] @P à@ ñKð@ lel data (Pilevar et al., 2011) which is a collection  [be tu]  [behet] Attach pronoun ñKéK . IîE. of 600K translated movie subtitles in colloquial [be Pu]  [beheS] ð@éK . îE. Persian. Following the splits by Khashabi et al. “Pæst” (is)  [kæm Pæst] [kæme] Iƒ@Õ» éÒ» (2020), we use a small portion of the Mizan par-  [kæm hæstænd] [kæmænd] “hæst” (is) YJ‚ëÕ» YJÒ» allel corpus (Kashefi, 2018) (collection of classic  [kæm hæsti] [kæmi] úæ‚ëÕ» ùÒ» fiction) as our development (1596 sentences) and Table 1: Some of the main conversion rules for convert- test datasets (10000 sentences). ing standard Persian to colloquial with examples. Most Tools and Baselines: We use the Hazm library3 of these rules are inspired from Bakhshizadeh Gashti and Tabibzadeh(2019). to normalize characters and tokenize texts. Hazm is also capable of converting colloquial text to stan- dard text. This is done by manual rules. We use conversion rules create a wrong colloquial word this tool as a strong baseline to compare with. We form. However, since the standard form is the use SacreBLEU (Post, 2018) on the tokenized text ultimate output that we care about, there is less for evaluation. risk of making conversion mistakes. This is very Translation Model: We use a standard similar to back-translation (Sennrich et al., 2016) sequence-to-sequence transformer-based transla- for which the input of the model is a machine tion model (Vaswani et al., 2017) with a six-layer translated text with potential errors. BERT-based (Devlin et al., 2019) encoder-decoder architecture from HuggingFace (Wolf et al., 2.3 Evaluation Data 2019) and Pytorch (Paszke et al., 2019) with a We collect data from different sources including shared SentencePiece (Kudo and Richardson, comments on news websites, discussions on web 2018) vocabulary of size 40K.4 All input and forums, translated fiction, and dialogues in Persian output token embeddings are summed up with fiction. Three native speaker linguists have anno- the language id embedding: we use “” for tated the data in which the development part (917 standard Persian, and “” for broken Persian. sentences) is annotated by the first annotator, and First tokens of every input and output sentence the test data (1012 sentences) is annotated by the are shown by the language ID. We use greedy other two annotators. For the sake of keeping sen- decoding since we find greedy decoding slightly tences coherent, many sentences are actually multi- 3https://github.com/sobhe/hazm ple short sentences. Figure2 shows the counts of 4Code from https://github.com/rasoolims/ sentences in different topics. As seen in the Fig- ImageTranslate/tree/naacl21 Ⅽoⅼⅼoquiaⅼ Stanⅾarⅾ ﻭ ﺯﻣﺎﻧ ﮐﻪ ﺑﺨﻮﺍﻫﻢ ﺍﺯ ﮐﺴ ﺧﺎﺭﺝ ﺍﺯ ﺧﻮﺩﻣﺎﻥ ﺗﻌﺮﯾﻒ ﻭ ﺳﺘﺎﯾﺶ ﮐﻨﻢ ﻭ ﺯﻣﺎﻧ ﮐﻪ ﺑﺨﻮﺍﻡ ﺍﺯ ﮐﺴ ﺧﺎﺭﺝ ﺍﺯ ﺧﻮﺩﻣﻮﻥ ﺗﻌﺮﯾﻒ ﻭ ﺳﺘﺎﯾﺶ ﮐﻨﻢ ﺩﯾﺪﻡ ﮐﻪ ﻓﺎﯾﺪﻩ ﻧﺪﺍﺭﺩ ﺣﺮﻑ ﺑﺰﻧﻢ ﺑﺎﯾﺪ ﺩﺳﺘﻢ ﺭﺍ ﺑﺎﺯ ﮐﻨﻢ ﺩﯾﺪﻡ ﮎ ﻓﺎﯾﺪﻩ ﻧﺪﺍﺭﻩ ﺣﺮﻑ ﺑﺰﻧﻢ ﺑﺎﯾﺪ ﺩﺳﺘﻤﻮ ﺑﺎﺯ ﮐﻨﻢ ﺍﻭ ﻣ ﺩﯾﺪ ﮐﻪ ﭼﻮﻧﻪ ﻣﯿﺸﺎ ﺑﺎ ﺗﺄﻧ ﺑﻪ ﭘﯿﺶ ﻣ ﺭﻭﺩ ﻭ ﮐﻠﻮﺧﻬﺎﯼ ﺑﺰﺭﮒ ﺭﺍ ﺑﺎ ﺧﻮﺩ ﻣ ﮐﺸﺪ ﺍﻭ ﻣ ﺩﯾﺪ ﮐﻪ ﭼﻮﻧﻪ ﻣﯿﺸﺎ ﺑﺎ ﺗﺄﻧ ﺏ ﭘﯿﺶ ﻣ ﺭﻩ ﻭ ﮐﻠﻮﺧﺎﯼ ﺑﺰﺭﮔﻮ ﺑﺎ ﺧﻮﺩ ﻣ ﮐﺸﺪ

Figure 1: Real examples of our random conversions of standard text to colloquial.

800 791 Development Data Test Data Development Test Word Style Word Style 600 Original Data 43.9 38.8 46.4 40.5 400 385 Hazm 56.4 49.0 61.7 52.7

Sentence count Our approach 61.7 53.9 62.8 56.3 180 200 152 100 101 94 52 33 41 0 0 0 0 0 0 0 Table 2: Results of standardizing our Persian colloquial evaluation data via SacreBLEU (Post, 2018). Blog posts Persian fiction Movie subtitles

Product reviews Data Approach Dev. Test News comments Telegram groups Translated fiction Online discussions Original Data No edit 4.4 2.8 Figure 2: Number of sentences from different genres in Hazm 4.2 3.1 Post-edit our evaluation development and test datasets. Our approach 4.9 3.3 Hazm 5.0 3.0 Prepocess Our approach 5.8 3.6 more accurate for standardization than beam search. For machine translation, we use a similar Table 3: Machine translation results trained on pipeline. We pretrain the model on a tuple of TEP (Pilevar et al., 2011) (movie subtitles in colloquial three Wikipedia datasets for Persian, Arabic, and Persian), and evaluated on the Mizan corpus (Kashefi, a sample of 6 million sentences from English 2018) (standard Persian) via SacreBLEU (Post, 2018). using the MASS model (Song et al., 2019) with a shared SentencePiece (Kudo and Richardson, Second, using preprocessed data from our stan- 2018) vocabulary of size 60K. The decoder for dardization model outperforms post-editing with each language has a separate output layer in order standardization. One important note about the low to have language-specific output probabilities. BLEU scores is that the TEP data (Pilevar et al., We use a beam size of 4 for machine translation 2011) consists of very short colloquial movie di- experiments. alogues while the Mizan corpus (Kashefi, 2018) 3.2 Results consists of sentences from classic fictions. This domain mismatch plus the size of the TEP data are Intrinsic evaluation: Table2 shows the BLEU the main causes of this low BLEU. However, out scores of our model versus the rule-based Hazm method still improves the BLEU score by a wide tool. As we see in the Table, our model does a margin. In practice, previous work has used back- much better job in making the text closer to stan- translation improve low-resource translation either dard form. Figure3 shows some examples of our in a non-iterative (Edunov et al., 2018) or iterative conversions versus the rule-based conversion. It is (Hoang et al., 2018) manner. Improvements from worth noting that we observe some very few cases back-translation heavily depend on the quality of in which our model generates an irrelevant word, the initial model. We skip back-translation since it most likely due to the interference of the language is not the focus of our work. model component in sequence-to-sequence models. Solving this issue is an interesting topic to pursue 4 Conclusion in future work. We have described an algorithm for standardizing Extrinsic evaluation: Table3 shows the BLEU Persian text. We show that by creating artificial scores of different models trained of different types training data, we can leverage commonly known of preprocessed and post-processed data. First, patterns of breaking standard forms to colloquial, we observe that standardization helps in all cases. and learn an accurate standardization model. We ﺑﺎﯾﺪ ﺟﺪا ﺑﺸﯿﻢ ﺗﺎ ﻓﻀﺎی ﺑﯿﺸﺘﺮی رو ﺑﺘﻮﻧﯿﻢ ﭼ ﮐﻨﯿﻢ Original text ﺑﺎﯾﺪ ﺟﺪا ﺑﺸﻮﯾﻢ ﺗﺎ ﻓﻀﺎی ﺑﯿﺸﺘﺮی رو ﺑﺘﻮاﻧﯿﻢ ﭼ ﮐﻨﯿﻢ Hazm ﺑﺎﯾﺪ ﺟﺪا ﺷﻮﯾﻢ ﺗﺎ ﻓﻀﺎی ﺑﯿﺸﺘﺮی را ﺑﺘﻮاﻧﯿﻢ ﭼﮐﻨﯿﻢ Our approach ﺑﺎﯾﺪ ﺟﺪا ﺑﺸﻮﯾﻢ ﺗﺎ ﻓﻀﺎی ﺑﯿﺸﺘﺮی را ﺑﺘﻮاﻧﯿﻢ ﭼﮐﻨﯿﻢ (Gold standard (word ﺑﺎﯾﺪ ﺟﺪا ﺑﺸﻮﯾﻢ ﺗﺎ ﺑﺘﻮاﻧﯿﻢ ﻓﻀﺎی ﺑﯿﺸﺘﺮی را ﺑﺮرﺳ ﮐﻨﯿﻢ (Gold standard (style

ﻣﺎ ﺑﺎ ﺗﻮﺟﻪ ﺑﻪ ﻧﯿﺎزﻫﺎ اﯾﻦ ﭼﯿﺰا روﺑﻪ اﺷﺘﺮاک ﻣﯿﺬارﯾﻢ درﺳﺖ ﻣﺜﻞ ﺧﻮﻧﻪ Original Text ﻣﺎ ﺑﺎ ﺗﻮﺟﻪ ﺑﻪ ﻧﯿﺎزﻫﺎ اﯾﻦ ﭼﯿﺰا روﺑﻪ اﺷﺘﺮاکﻣ ﮔﺬارﯾﻢ درﺳﺖ ﻣﺜﻞ ﺧﻮاﻧﺪ Hazm ﻣﺎ ﺑﺎ ﺗﻮﺟﻪ ﺑﻪ ﻧﯿﺎزﻫﺎی اﯾﻦ ﭼﯿﺰﻫﺎ را ﺑﻪ اﺷﺘﺮاکﻣ ﮔﺬارﯾﻢ درﺳﺖ ﻣﺜﻞﺧﺎﻧﻪ Our approach ﻣﺎ ﺑﺎ ﺗﻮﺟﻪ ﺑﻪ ﻧﯿﺎزﻫﺎ اﯾﻦ ﭼﯿﺰﻫﺎ را ﺑﻪ اﺷﺘﺮاکﻣ ﮔﺬارﯾﻢ درﺳﺖ ﻣﺜﻞ ﺧﺎﻧﻪ (Gold standard (word ﻣﺎ درﺳﺖ ﻣﺜﻞ ﺧﺎﻧﻪ ﺑﺎ ﺗﻮﺟﻪ ﺑﻪ ﻧﯿﺎزﻫﺎ اﯾﻦ ﭼﯿﺰﻫﺎ را ﺑﻪ اﺷﺘﺮاکﻣ ﮔﺬارﯾﻢ (Gold standard (style

Figure 3: Examples of standardization outputs on our development data from our model and the baseline Hazm model. Wrong words are highlighted. We see that in the first example, although the output of our approach does not completely match with the surface word form, it is still a correct word (two correct forms of the same inflection). In the second example, we see that Hazm fails to do proper conversion, while our approach just adds an extra “Ezafe” suffix. believe that this work is just a beginning to this line for statistical machine translation. In Proceedings of of research. Future work should investigate more the 2014 Conference on Empirical Methods in Nat- sophisticated methods for standardization. ural Language Processing (EMNLP), pages 1724– 1734, Doha, Qatar. Association for Computational Linguistics. Acknowledgment Jacob Devlin, Ming-Wei Chang, Kenton Lee, and We would like to deeply thank Omid Tabibzadeh Kristina Toutanova. 2019. BERT: Pre-training of for sharing his knowledge with us and constructive deep bidirectional transformers for language under- discussions. standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language References Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Associ- Nadieh Armin and Mehrnoush Shamsfard. 2011. Con- ation for Computational Linguistics. verting Persian colloquium text to formal by n- grams. In Computer Society of Iran. Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua scale. In Proceedings of the 2018 Conference on Bengio. 2015. Neural machine translation by Empirical Methods in Natural Language Processing, jointly learning to align and translate. CoRR, pages 489–500, Brussels, Belgium. Association for abs/1409.0473. Computational Linguistics.

Yousef Bakhshizadeh Gashti and Omid Tabibzadeh. Weston Feely, Mehdi Manshadi, Robert E. Frederking, 2019. Colloquial forms and Persian lexicons. Per- and Lori S. Levin. 2014. The CMU METAL Farsi sian Language and Iranian Dialects, 2(6). NLP approach. In LREC, pages 4052–4055. Mahmood Bijankhan, Javad Sheykhzadegan, Mo- Vu Cong Duy Hoang, Philipp Koehn, Gholamreza hammad Bahrani, and Masood Ghayoomi. 2011. Haffari, and Trevor Cohn. 2018. Iterative back- Lessons from building a Persian written corpus: translation for neural machine translation. In Pro- Peykare. Language resources and evaluation, ceedings of the 2nd Workshop on Neural Machine 45(2):143–164. Translation and Generation, pages 18–24, Mel- John Andrew Boyle. 1952. Notes on the colloquial lan- bourne, Australia. Association for Computational guage of Persia as recorded in certain recent writ- Linguistics. ings. Bulletin of the School of Oriental and African Studies, 14(3):451–462. Éva M Jeremiás. 1984. Diglossia in persian. Acta Linguistica Academiae Scientiarum Hungaricae, Kyunghyun Cho, Bart van Merriënboer, Caglar Gul- 34(3/4):271–287. cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Omid Kashefi. 2018. Mizan: a large Persian-English phrase representations using RNN encoder–decoder parallel corpus. arXiv preprint arXiv:1801.02107. Daniel Khashabi, Arman Cohan, Siamak Shakeri, Hanieh Poostchi, Ehsan Zare Borzeshi, and Massimo Pedram Hosseini, Pouya Pezeshkpour, Malihe Piccardi. 2018. BiLSTM-CRF for Persian named- Alikhani, Moin Aminnaseri, Marzieh Bitaab, Faeze entity recognition. ArmanPersoNERCorpus: The Brahman, Sarik Ghazarian, Mozhdeh Gheini, first entity-annotated Persian dataset. In Proceed- Arman Kabiri, Rabeeh Karimi Mahabadi, Omid ings of the Eleventh International Conference on Memarrast, Ahmadreza Mosallanezhad, Erfan Language Resources and Evaluation (LREC 2018). Noury, Shahab Raji, Mohammad Sadegh Rasooli, Sepideh Sadeghi, Erfan Sadeqi Azer, Niloofar Safi Matt Post. 2018. A call for clarity in reporting BLEU Samghabadi, Mahsa Shafaei, Saber Sheybani, scores. In Proceedings of the Third Conference on Ali Tazarv, and Yadollah Yaghoobzadeh. 2020. Machine Translation: Research Papers, pages 186– PARSINLU: A suite of language understanding 191, Brussels, Belgium. Association for Computa- challenges for Persian. arXiv preprint. tional Linguistics.

Taku Kudo and John Richardson. 2018. SentencePiece: Mohammad Sadegh Rasooli, Manouchehr Kouhestani, A simple and language independent subword tok- and Amirsaeid Moloodi. 2013. Development of a enizer and detokenizer for neural text processing. In Persian syntactic dependency treebank. In Proceed- Proceedings of the 2018 Conference on Empirical ings of the 2013 Conference of the North Ameri- Methods in Natural Language Processing: System can Chapter of the Association for Computational Demonstrations, pages 66–71, Brussels, Belgium. Linguistics: Human Language Technologies, pages Association for Computational Linguistics. 306–314. Rico Sennrich, Barry Haddow, and Alexandra Birch. Behrooz Mahmoodi-Bakhtiari. 2018. Spoken vs. writ- 2016. Improving neural machine translation mod- ten Persian: Is Persian diglossic? In Alireza Ko- els with monolingual data. In Proceedings of the rangy and Corey Miller, editors, Trends in Iranian 54th Annual Meeting of the Association for Compu- and Persian Linguistics , chapter 10, pages 183–212. tational Linguistics (Volume 1: Long Papers), pages De Gruyter Mouton. 86–96, Berlin, Germany. Association for Computa- tional Linguistics. Azadeh Mirzaei and Amirsaeid Moloodi. 2016. Per- sian proposition bank. In Proceedings of the Tenth Mojgan Seraji. 2015. Morphosyntactic corpora and International Conference on Language Resources tools for Persian. Ph.D. thesis, Acta Universitatis and Evaluation (LREC’16), pages 3828–3835. Upsaliensis. Azadeh Mirzaei and Pegah Safari. 2018. Persian dis- Mojgan Seraji, Filip Ginter, and Joakim Nivre. 2016. course treebank and coreference corpus. In Proceed- Universal dependencies for Persian. In Proceedings ings of the Eleventh International Conference on of the Tenth International Conference on Language Language Resources and Evaluation (LREC 2018). Resources and Evaluation (LREC’16), pages 2361– 2365. Azadeh Mirzaei, Fatemeh Sedghi, and Pegah Safari. 2020. Semantic role labeling system for persian Mehrnoush Shamsfard. 2011. Challenges and open language. ACM Transactions on Asian and Low- problems in Persian text processing. In Proceedings Resource Language Information Processing (TAL- of LTC. LIP), 19(3):1–12. Ali Solhju. 2019. How To Write Spoken Short Forms Alireza Nourian, Mohammad Sadegh Rasooli, Mohsen In Persian Dialogues. Markaz, Tehran, Iran. Imany, and Heshaam Faili. 2015. On the importance of Ezafe construction in Persian parsing. In Proceed- Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie- ings of the 53rd Annual Meeting of the Association Yan Liu. 2019. MASS: Masked sequence to se- for Computational Linguistics and the 7th Interna- quence pre-training for language generation. arXiv tional Joint Conference on Natural Language Pro- cs.CL 1905.02450. cessing (Volume 2: Short Papers), pages 877–882. Omid Tabibzadeh. 2020. Orthography of Colloquial Persian: Based on Works of Fiction and Drama Adam Paszke, Sam Gross, Francisco Massa, Adam Spanning a Century (1918-2018). Institute for Hu- Lerer, James Bradbury, Gregory Chanan, Trevor manities and Cultural Studies, Tehran, Iran. Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, Ehsan Taher, Seyed Abbas Hoseini, and Mehrnoush high-performance deep learning library. In Ad- Shamsfard. 2020. Beheshti-NER: Persian named vances in neural information processing systems, entity recognition using BERT. arXiv preprint pages 8026–8037. arXiv:2003.08875. Mohammad Taher Pilevar, Heshaam Faili, and Ab- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob dol Hamid Pilevar. 2011. TEP: Tehran English- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Persian parallel corpus. In International Conference Kaiser, and Illia Polosukhin. 2017. Attention is all on Intelligent Text Processing and Computational you need. In Advances in neural information pro- Linguistics, pages 68–79. Springer. cessing systems, pages 5998–6008. Gernot Windfuhr. 2009. The . Psy- chology Press. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- icz, et al. 2019. Huggingface’s transformers: State- of-the-art natural language processing. ArXiv, pages arXiv–1910.