A Transcription Scheme for Languages Employing the Arabic Script Motivated by Speech Processing Application

Total Page:16

File Type:pdf, Size:1020Kb

A Transcription Scheme for Languages Employing the Arabic Script Motivated by Speech Processing Application A Transcription Scheme for Languages Employing the Arabic Script Motivated by Speech Processing Application Shadi GANJAVI Panayiotis G. GEORGIOU, *Department of Linguistics Shrikanth NARAYANAN* University of Southern California Department of Electrical Engineering [email protected] Speech Analysis & Interpretation Laboratory (sail.usc.edu) [georgiou, shri]@sipi.usc.edu Abstract 1 Introduction This paper offers a transcription system for Speech-to-speech (S2S) translation systems Persian, the target language in the Transonics present many challenges, not only due to the project, a speech-to-speech translation system complex nature of the individual technologies developed as a part of the DARPA Babylon involved, but also due to the intricate interaction program (The DARPA Babylon Program; that these technologies have to achieve. A great Narayanan, 2003). In this paper, we discuss challenge for the specific S2S translation system transcription systems needed for automated involving Persian and English would arise from spoken language processing applications in not only the linguistics differences between the Persian that uses the Arabic script for writing. two languages but also from the limited amount of This system can easily be modified for Arabic, data available for Persian. The other major hurdle Dari, Urdu and any other language that uses in achieving a S2S system involving these the Arabic script. The proposed system has languages is the Persian writing system, which is two components. One is a phonemic based based on the Arabic script, and hence lacks the transcription of sounds for acoustic modelling explicit inclusion of vowel sounds, resulting in a in Automatic Speech Recognizers and for Text very large amount of one-to-many mappings from to Speech synthesizer, using ASCII based transcription to acoustic and semantic symbols, rather than International Phonetic representations. Alphabet symbols. The other is a hybrid In order to achieve our goal, the system that was system that provides a minimally-ambiguous designed comprised of the following components: lexical representation that explicitly includes vocalic information; such a representation is needed for language modelling, text to speech synthesis and machine translation. Fig 1. Block diagram of the system. Note that the communication server allows interaction between all subsystems and the broadcast of messages. Our vision is that only the doctor will have access to the GUI and the patient will only be given a phone handset. (1) a visual and control Graphical User Interface Examples of the homophones and homographs (GUI); (2) an Automatic Speech Recognition are represented in Table 1. The words “six” and (ASR) subsystem, which works both using Fixed “lung” are examples of homographs, where the State Grammars (FSG) and Language Models identical (transliterated Arabic) orthographic (LM), producing n-best lists/lattices along with the representations (Column 3) correspond to different decoding confidence scores; (3) a Dialog Manager pronunciations [SeS] and [SoS] respectively (DM), which receives the output of the speech (Column 4). The words “hundred” and “dam” are recognition and machine translation units and examples of homophones, where the two words subsequently “re-scores'' the data according to the have similar pronunciation [sad] (Column 4), history of the conversation; (4) a Machine despite their different spellings (Column 3). Translation (MT) unit, which works in two modes: Classifier based MT and a fully Stochastic MT; Persian USCPers USCPron USCPers+ and finally (5) a unit selection based Text To ¢¡ ‘six’ SS SeS SeS Speech synthesizer (TTS), which provides the spoken output. A functional block diagram is ‘lung’ ¢¡ SS SoS SoS shown in Figure 1. ‘100’ £¥¤ $d sad $ad 1.1 The Language Under Investigation: ‘dam’ £¥¦ sd sad sad Persian Table 1 Examples of the transcription methods Persian is an Indo-European language with a and their limitation. Purely orthographic writing system based on the Arabic script. transcription schemes (such as USCPers) fail to Languages that use this script have posed a distinctly represent homographs while purely problem for automated language processing such phonetic ones (such as USCPron) fail to distinctly as speech recognition and translation systems. For represent the homophones. instance, the CSLU Labeling Guide (Lander, http://cslu.cse.ogi.edu/corpora/corpPublications.ht The former is the sample of the cases in which ml) offers orthographic and phonetic transcription there is a many-to-one mapping between systems for a wide variety of languages, from orthography and pronunciation, a direct result of German to Spanish with a Latin-based writing the basic characteristic of the Arabic script, viz., system to languages like Mandarin and Cantonese, little to no representation of the vowels. which use Chinese characters for writing. As is evident by the data presented in this table, However, there seems to be no standard there are two major sources of problems for any transcription system for languages like Arabic, speech-to-speech machine translation. In other Persian, Dari, Urdu and many others, which use words, to employ a system with a direct 1-1 the Arabic script (ibid; Kaye, 1876; Kachru, 1987, mapping between Arabic orthography and a Latin among others). based transcription system (what we refer to as Because Persian and Arabic are different, USCPers in our paper) would be highly ambiguous Persian has modified the writing system and and insufficient to capture distinct words as augmented it in order to accommodate the required by our speech-to-speech translation differences. For instance, four letters were added system, thus resulting in ambiguity at the text-to- to the original system in order to capture the speech output level, and internal confusion in the sounds available in Persian that Arabic does not language modelling and machine translation units. have. Also, there are a number of homophonic The latter, on the other hand, is a representative of letters in the Persian writing system, i.e., the same the cases in which the same sequence of sounds sound corresponding to different orthographic would correspond to more than one orthographic representations. This problem is unique to Persian, representation. Therefore, using a pure phonetic since in Arabic different orthographic transcription, e.g., USCPron, would be acceptable representations represent different sounds. The for the Automatic Speech Recognizer (ASR), but other problem that is common in all languages not for the Dialog Manager (DM) or the Machine using the Arabic script is the existance of a large Translator (MT). The goal of this paper is twofold number of homographic words, i.e., orthographic (i) to provide an ASCII based phonemic representations that have a similar form but transcription system similar to the one used in the different pronunciation. This problem arises due International Phonetic Alphabet (IPA), in line of to limited vowel presentation in this writing Worldbet (Hieronymus, system. http://cslu.cse.ogi.edu/corpora/corpPublications.ht ml) and (ii) to argue for an ASCII based hybrid transcription scheme, which provides an easy way 2.2 Consonants to transcribe data in languages that use the Arabic In addition to the six vowels, there are 23 script. distinct consonantal sounds in Persian. Voicing is We will proceed in Section 2 to provide the phonemic in Persian, giving rise to a quite USCPron ASCII based phonemic transcription symmetric system. These consonants are system that is similar to the one used by the represented in Table 3 based on the place (bilabial International Phonetic Alphabet (IPA), in line of (BL), lab-dental (LD), dental (DE), alveopalatal Worldbet (ibid). In Section 3, we will present the (AP), velar (VL), uvular (UV) and glottal (GT)) USCPers orthographic scheme, which has a one- and manner of articulation (stops (ST), fricatives to-one mapping to the Arabic script. In Section 4 (FR), affricates (AF), liquids (LQ), nasals (NS) we will present and analyze USCPers+, a hybrid and glides (GL)) and their voicing ([-v(oice)] and system that keeps the orthographic information, [+v(oice)]. while providing the vowels. Section 5 discusses some further issues regarding the lack of data. BL LD DE AP VL UV GT 2 Phonetic Labels (USCPron) ST [-v] p t k ? One of the requirements of an ASR system is a [+v] b d g q phonetic transcription scheme to represent the FR [-v] f s S x h pronunciation patterns for the acoustic models. Persian has a total of 29 sounds in its inventory, six [+v] v z Z vowels (Section 2.1) and 23 consonants (Section AF [-v] C 2.2). The system that we created to capture these sounds is a modified version of the International [+v] J Phonetic Alphabet (IPA), called LQ l, r USCPron(unciation). In USCPron, just like the NS m n IPA, there is a one-to-one correspondence between the sounds and the symbols representing them. GL y However, this system, unlike IPA does not require Table 3: Consonants special fonts and makes use of ASCII characters. The advantage that our system has over other Many of these sounds are similar to English systems that use two characters to represent a sounds. For instance, the stops, [p, b, t, d, k, g] are single sound is that following IPA, our system similar to the italicized letters in the following avoids all ambiguities. English words: ‘potato’, ‘ball’, ‘tree’, ‘doll’, ‘key’ 2.1 Vowels and ‘dog’ respectively. The glottal stop [?] can be found in some pronunciations of ‘button’, and the Persian has a six-vowel system, high to low and sound in between the two syllables of ‘uh oh’. The front and back. These vowels are: [i, e, a, u, o, A], uvular stop [q] does not have a correspondent in as are exemplified by the italicized vowels in the English. Nor does the velar fricative [x]. But the following English examples: ‘beat’, ‘bet’, ‘bat’, rest of the fricatives [f, v, s, z, S, Z, h] have a ‘pull’, ‘poll’ and ‘pot’.
Recommended publications
  • Technical Reference Manual for the Standardization of Geographical Names United Nations Group of Experts on Geographical Names
    ST/ESA/STAT/SER.M/87 Department of Economic and Social Affairs Statistics Division Technical reference manual for the standardization of geographical names United Nations Group of Experts on Geographical Names United Nations New York, 2007 The Department of Economic and Social Affairs of the United Nations Secretariat is a vital interface between global policies in the economic, social and environmental spheres and national action. The Department works in three main interlinked areas: (i) it compiles, generates and analyses a wide range of economic, social and environmental data and information on which Member States of the United Nations draw to review common problems and to take stock of policy options; (ii) it facilitates the negotiations of Member States in many intergovernmental bodies on joint courses of action to address ongoing or emerging global challenges; and (iii) it advises interested Governments on the ways and means of translating policy frameworks developed in United Nations conferences and summits into programmes at the country level and, through technical assistance, helps build national capacities. NOTE The designations employed and the presentation of material in the present publication do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. The term “country” as used in the text of this publication also refers, as appropriate, to territories or areas. Symbols of United Nations documents are composed of capital letters combined with figures. ST/ESA/STAT/SER.M/87 UNITED NATIONS PUBLICATION Sales No.
    [Show full text]
  • Arabic Alphabet - Wikipedia, the Free Encyclopedia Arabic Alphabet from Wikipedia, the Free Encyclopedia
    2/14/13 Arabic alphabet - Wikipedia, the free encyclopedia Arabic alphabet From Wikipedia, the free encyclopedia َأﺑْ َﺠ ِﺪﯾﱠﺔ َﻋ َﺮﺑِﯿﱠﺔ :The Arabic alphabet (Arabic ’abjadiyyah ‘arabiyyah) or Arabic abjad is Arabic abjad the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually[1] stand for consonants, it is classified as an abjad. Type Abjad Languages Arabic Time 400 to the present period Parent Proto-Sinaitic systems Phoenician Aramaic Syriac Nabataean Arabic abjad Child N'Ko alphabet systems ISO 15924 Arab, 160 Direction Right-to-left Unicode Arabic alias Unicode U+0600 to U+06FF range (http://www.unicode.org/charts/PDF/U0600.pdf) U+0750 to U+077F (http://www.unicode.org/charts/PDF/U0750.pdf) U+08A0 to U+08FF (http://www.unicode.org/charts/PDF/U08A0.pdf) U+FB50 to U+FDFF (http://www.unicode.org/charts/PDF/UFB50.pdf) U+FE70 to U+FEFF (http://www.unicode.org/charts/PDF/UFE70.pdf) U+1EE00 to U+1EEFF (http://www.unicode.org/charts/PDF/U1EE00.pdf) Note: This page may contain IPA phonetic symbols. Arabic alphabet ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع en.wikipedia.org/wiki/Arabic_alphabet 1/20 2/14/13 Arabic alphabet - Wikipedia, the free encyclopedia غ ف ق ك ل م ن ه و ي History · Transliteration ء Diacritics · Hamza Numerals · Numeration V · T · E (//en.wikipedia.org/w/index.php?title=Template:Arabic_alphabet&action=edit) Contents 1 Consonants 1.1 Alphabetical order 1.2 Letter forms 1.2.1 Table of basic letters 1.2.2 Further notes
    [Show full text]
  • Shahmukhi to Gurmukhi Transliteration System: a Corpus Based Approach
    Shahmukhi to Gurmukhi Transliteration System: A Corpus based Approach Tejinder Singh Saini1 and Gurpreet Singh Lehal2 1 Advanced Centre for Technical Development of Punjabi Language, Literature & Culture, Punjabi University, Patiala 147 002, Punjab, India [email protected] http://www.advancedcentrepunjabi.org 2 Department of Computer Science, Punjabi University, Patiala 147 002, Punjab, India [email protected] Abstract. This research paper describes a corpus based transliteration system for Punjabi language. The existence of two scripts for Punjabi language has created a script barrier between the Punjabi literature written in India and in Pakistan. This research project has developed a new system for the first time of its kind for Shahmukhi script of Punjabi language. The proposed system for Shahmukhi to Gurmukhi transliteration has been implemented with various research techniques based on language corpus. The corpus analysis program has been run on both Shahmukhi and Gurmukhi corpora for generating statistical data for different types like character, word and n-gram frequencies. This statistical analysis is used in different phases of transliteration. Potentially, all members of the substantial Punjabi community will benefit vastly from this transliteration system. 1 Introduction One of the great challenges before Information Technology is to overcome language barriers dividing the mankind so that everyone can communicate with everyone else on the planet in real time. South Asia is one of those unique parts of the world where a single language is written in different scripts. This is the case, for example, with Punjabi language spoken by tens of millions of people but written in Indian East Punjab (20 million) in Gurmukhi script (a left to right script based on Devanagari) and in Pakistani West Punjab (80 million), written in Shahmukhi script (a right to left script based on Arabic), and by a growing number of Punjabis (2 million) in the EU and the US in the Roman script.
    [Show full text]
  • Diacritic-Based Matching of Arabic Words. ACM Trans
    Diacritic-Based Matching of Arabic Words MUSTAFA JARRAR, Birzeit University, Palestine FADI ZARAKET, American University, Lebanon 10 RAMI ASIA and HAMZEH AMAYREH, Birzeit University, Palestine Words in Arabic consist of letters and short vowel symbols called diacritics inscribed atop regular letters. Changing diacritics may change the syntax and semantics of a word; turning it into another. This results in difficulties when comparing words based solely on string matching. Typically, Arabic NLP applications resort to morphological analysis to battle ambiguity originating from this and other challenges. In this article, we introduce three alternative algorithms to compare two words with possibly different diacritics. We propose the Subsume knowledge-based algorithm, the Imply rule-based algorithm, and the Alike machine-learning- based algorithm. We evaluated the soundness, completeness, and accuracy of the algorithms against a large dataset of 86,886 word pairs. Our evaluation shows that the accuracy of Subsume (100%), Imply (99.32%), and Alike (99.53%). Although accurate, Subsume was able to judge only 75% of the data. Both Subsume and Imply are sound, while Alike is not. We demonstrate the utility of the algorithms using a real-life use case – in lemma disambiguation and in linking hundreds of Arabic dictionaries. CCS Concepts: • Computing methodologies → Natural language processing; Phonology/ morphology; Language resources; Additional Key Words and Phrases: Arabic, diacritics, disambiguation ACM Reference format: Mustafa Jarrar, Fadi Zaraket, Rami Asia, and Hamzeh Amayreh. 2018. Diacritic-Based Matching of Arabic Words. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 18, 2, Article 10 (December 2018), 21 pages. https://doi.org/10.1145/3242177 1 INTRODUCTION Diacritics are distinguishing features of the Arabic language.
    [Show full text]
  • Orthographic Transparency and the Ottoman Abjad Maithili Jais
    Orthographic Transparency and the Ottoman Abjad Maithili Jais University of Florida Spring 2018 I. Introduction In 2014, the debate over whether Ottoman Turkish was to be taught in schools or not was once again brought to the forefront of Turkish society and the Turkish conscience, as Erdogan began to push for Ottoman Turkish to be taught in all high schools across the country (Yeginsu, 2014). This became an obsession of a news topic for media in the West as well as in Turkey. Turkey’s tumultuous history with politics inevitably led this proposal of teaching Ottoman Turkish in all high schools to become a hotbed of controversy and debate. For all those who are perfectly contented to let bygones be bygones, there are many who assert that the Ottoman Turkish alphabet is still relevant and important. In fact, though this may be a personal anecdote, there are still certainly people who believe that the Ottoman script is, or was, superior to the Latin alphabet with which modern Turkish is written. This thesis does not aim to undertake a task so grand as sussing out which of the two was more appropriate for Turkish. No, such a task would be a behemoth for this paper. Instead, it aims to answer the question, “How?” Rather, “How was the Arabic script moulded to fit Turkish and to what consequence?” Often the claim that one script it superior to another suggests inherent judgement of value, but of the few claims seen circulating Facebook on the efficacy of the Ottoman script, it seems some believe that it represented Turkish more accurately and efficiently.
    [Show full text]
  • Mpub10110094.Pdf
    An Introduction to Chaghatay: A Graded Textbook for Reading Central Asian Sources Eric Schluessel Copyright © 2018 by Eric Schluessel Some rights reserved This work is licensed under the Creative Commons Attribution-NonCommercial- NoDerivatives 4.0 International License. To view a copy of this license, visit http:// creativecommons.org/licenses/by-nc-nd/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, California, 94042, USA. Published in the United States of America by Michigan Publishing Manufactured in the United States of America DOI: 10.3998/mpub.10110094 ISBN 978-1-60785-495-1 (paper) ISBN 978-1-60785-496-8 (e-book) An imprint of Michigan Publishing, Maize Books serves the publishing needs of the University of Michigan community by making high-quality scholarship widely available in print and online. It represents a new model for authors seeking to share their work within and beyond the academy, offering streamlined selection, production, and distribution processes. Maize Books is intended as a complement to more formal modes of publication in a wide range of disciplinary areas. http://www.maizebooks.org Cover Illustration: "Islamic Calligraphy in the Nasta`liq style." (Credit: Wellcome Collection, https://wellcomecollection.org/works/chengwfg/, licensed under CC BY 4.0) Contents Acknowledgments v Introduction vi How to Read the Alphabet xi 1 Basic Word Order and Copular Sentences 1 2 Existence 6 3 Plural, Palatal Harmony, and Case Endings 12 4 People and Questions 20 5 The Present-Future Tense 27 6 Possessive
    [Show full text]
  • The Unicode Standard, Version 6.2 Copyright © 1991–2012 Unicode, Inc
    The Unicode Standard Version 6.2 – Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trade- mark claim, the designations have been printed with initial capital letters or in all capitals. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries. The authors and publisher have taken care in the preparation of this specification, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Copyright © 1991–2012 Unicode, Inc. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction. For information regarding permissions, inquire at http://www.unicode.org/reporting.html. For information about the Unicode terms of use, please see http://www.unicode.org/copyright.html. The Unicode Standard / the Unicode Consortium ; edited by Julie D. Allen ... [et al.]. — Version 6.2.
    [Show full text]
  • Through the Looking-Glass: Roman Letters in Phonics for Arabic As a Part of Multimedia Support
    Recent Technological Advances in Education Through the Looking-Glass: Roman letters in phonics for Arabic as a part of multimedia support MOSA AHMED EBRAHIM MOHAMED Computer Science Department WASEDA University Okubo, Shinjuku-ku, Tokyo JAPAN Educational Technology Department TANTA University Tanta, Al Gharbya EGYPT [email protected] KAKEHI KATSUHIKO Computer Science Department WASEDA University Okubo, Shinjuku-ku, Tokyo JAPAN [email protected] Abstract: - As a part of multi-media support system for Arabic e-learning, some vehicle should be provided for training students to pronounce Arabic text themselves. Since the Arabic alphabet is phonetic, it seems a good idea to transliterate Arabic into Roman, the most popular language, but there exists a big gap between Arabic and Roman besides a difference between their alphabets; scripting directions. Arabic script is written from right to left, whereas Roman from left to right. What kind of transliteration would let students jump over this gap? Our solution is to render transliteration results as if seen Through the Looking-Glass. We report our process of trials and errors for seeking solutions, and our ultimate solution. Key-Words: - E-Learning, Multimedia supporting, Arabic learning, Arabic transliteration. 1 Introduction We are living in the age of the e-Learning and Islamic religion and Arabian countries are multimedia. This is why there is an increasing need increasing their presence in the world. The Arabic for cultural exchange and language learning all over language is one of the most popular languages in the the world. Also there is wide range of demand for world, as there are about fifty seven Islamic learning Arabic worldwide.
    [Show full text]
  • The Secret of Letters: Chronograms in Urdu Literary Culture1
    Edebiyˆat, 2003, Vol. 13, No. 2, pp. 147–158 The Secret of Letters: Chronograms in Urdu Literary Culture1 Mehr Afshan Farooqi University of Virginia Letters of the alphabet are more than symbols on a page. They provide an opening into new creative possibilities, new levels of understanding, and new worlds of experience. In mature literary traditions, the “literal meaning” of literal meaning can encompass a variety of arcane uses of letters, both in their mode as a graphemic entity and as a phonemic activity. Letters carry hidden meanings in literary languages at once assigned and intrinsic: the numeric and prophetic, the cryptic and esoteric, and the historic and commemoratory. In most literary traditions there appears to be at least a threefold value system assigned to letters: letters can be seen as phonetic signs, they have a semantic value, and they also have a numerical value. Each of the 28 letters of the Arabic alphabet can be used as a numeral. When used numerically, the letters of the alphabet have a special order, which is called the abjad or abujad. Abjad is an acronym referring to alif, be, j¯ım, d¯al, the first four letters in the numerical order which, in the system most widely used, runs from alif to ghain. The abjad order organizes the 28 characters of the Arabic alphabet into eight groups in a linear series: abjad, havvaz, hutt¯ı, kalaman, sa`fas, qarashat, sakhkha˙˙ z, zazzagh.2 In nearly every area where˙ ¨the¨ Arabic script ˙ was adopted, the abjad¨ ˙ ˙system gained popularity. Within the vast area in which the Arabic script was used, two abjad systems developed.
    [Show full text]
  • Exploiting Arabic Diacritization for High Quality Automatic Annotation
    Exploiting Arabic Diacritization for High Quality Automatic Annotation Nizar Habash, Anas Shahrour, Muhamed Al-Khalil Computational Approaches to Modeling Language Lab New York University Abu Dhabi, UAE {nizar.habash,anas.shahrour,muhamed.alkhalil}@nyu.edu Abstract We present a novel technique for Arabic morphological annotation. The technique utilizes diacritization to produce morphological annotations of quality comparable to human annotators. Although Arabic text is generally written without diacritics, diacritization is already available for large corpora of Arabic text in several genres. Furthermore, diacritization can be generated at a low cost for new text as it does not require specialized training beyond what educated Arabic typists know. The basic approach is to enrich the input to a state-of-the-art Arabic morphological analyzer with word diacritics (full or partial) to enhance its performance. When applied to fully diacritized text, our approach produces annotations with an accuracy of over 97% on lemma, part-of-speech, and tokenization combined. Keywords: Arabic, Morphology, Annotation, Diacritization 1. Introduction Arabic resources (the taggers and analyzers) that facilitate The automatic processing of the Arabic language is chal- it. The idea is also applicable to other languages with sim- lenging for a number of reasons. Two inter-related reasons ilar orthographic ambiguity challenges, e.g., Hebrew and are Arabic’s complex morphology and its diacritic-optional Persian. Our results on Arabic show that using our ap- orthography (Habash, 2010). This combination leads to proach produces quality comparable to human manual an- the high number of about 12.8 possible analyses per word notation (Maamouri et al., 2008; Habash and Roth, 2009) out of context (Shahrour et al., 2015).
    [Show full text]
  • Diacritics Restoration for Arabic Dialects Salima Harrat, Mourad Abbas, Karima Meftouh, Kamel Smaïli
    Diacritics Restoration for Arabic Dialects Salima Harrat, Mourad Abbas, Karima Meftouh, Kamel Smaïli To cite this version: Salima Harrat, Mourad Abbas, Karima Meftouh, Kamel Smaïli. Diacritics Restoration for Arabic Dialects. INTERSPEECH 2013 - 14th Annual Conference of the International Speech Communication Association, ISCA, Aug 2013, Lyon, France. hal-00925815 HAL Id: hal-00925815 https://hal.inria.fr/hal-00925815 Submitted on 8 Jan 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Diacritics restoration for Arabic dialect texts S. Harrat1, M. Abbas2, K. Meftouh3, K. Smaili4 1ENS Bouzareah, Algiers, Algeria 2CRSTDLA, Algiers, Algeria 3Badji Mokhtar University-Annaba, Algeria 4Campus scientifique LORIA , Nancy, France [email protected], m [email protected], [email protected], [email protected] Abstract nally, we worked first on MSA texts because of the available results for many works in this field. In this paper we present a statistical approach for automatic di- This paper is organised as follows: In section 2, we describe acritization of Algiers dialectal texts. This approach is based the Arabic language and present the main features of this lan- on statistical machine translation. We first investigate this ap- guage especially those related to diacritization.
    [Show full text]
  • Middle East-I 9 Modern and Liturgical Scripts
    The Unicode® Standard Version 13.0 – Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trade- mark claim, the designations have been printed with initial capital letters or in all capitals. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries. The authors and publisher have taken care in the preparation of this specification, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. © 2020 Unicode, Inc. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction. For information regarding permissions, inquire at http://www.unicode.org/reporting.html. For information about the Unicode terms of use, please see http://www.unicode.org/copyright.html. The Unicode Standard / the Unicode Consortium; edited by the Unicode Consortium. — Version 13.0. Includes index. ISBN 978-1-936213-26-9 (http://www.unicode.org/versions/Unicode13.0.0/) 1.
    [Show full text]