Toward a Handwritten Recognition System Using Canonical

Total Page:16

File Type:pdf, Size:1020Kb

Toward a Handwritten Recognition System Using Canonical Toward a Handwritten Recognition System Using Canonical Representation for Multi-Script Documents Yassine El Malki, Youssef Es-Saady, Driss Mammass IRF-SIC Laboratory, IbnoZohr University Agadir, Morocco {y.elmalki; y.essaady ; mammass}@uiz.ac.ma Abstract: We present in this paper a system of multilingual handwriting recognition based on canonical representation. After the word/character image preprocessing, the skeleton is transformed only into vertical and horizontal strokes, which is called the canonical representation. Then, the word/character is represented by a vector of its intersection pixels’ values, those values, which are depends on the pixel’s 8-surrounding neighbors. Finally, an algorithm is applied to match the character’s vector with a codebook containing the vector of each character of the language. The vector that represents the character will be used in the classification step, and the system will be applied to databases of multi-script documents, which contains texts in Latin, Arabic, and/or Tifinagh. Index Terms — Optical Character Recognition; Multilingual; Amazigh; Tifinagh; Handwritten Recognition; Multi-Script Documents 1. Introduction Libraries around the world have a big amount of documents. By scanning those documents their content can be accessible everywhere and by everyone. However, searching manually in scanned document, page by page seems difficult task for those who seek specific information. The solution is to use Optical Character Recognition (OCR) which is the process of transforming the image of text into text known by the computer, thus can easily be used as ASCCII code through searching information. This area of research has attracted many researches in many languages especially for Latin, Chinese and Arabic scripts [Bozinovic and Srihari 1989], [Bazzi et al 1999], [Zhang et al 2009], [Agrawal et al 2009], [Radwan et al 2013], [Es-saady et al 2014]. The variability nature of the handwriting scripts made the recognition very challenging. Many local languages such Amazigh have integrated the information systems, thus, in recent years, many researchers have been interested in Amazigh Character Recognition such as [Skoutni 2003],[Zenkouar et al 2004], [Es-saady et al 2010], [Es-saady et al 2011], [Es-saady et al 2014], because the Amazigh character needs a specific processing. In North of Africa, the local Tifinagh writing system, which is a system of the Amazigh language, is widely used. The Tifinigh script is among the oldest script in the world, it started from 3rd century BC. This script has known many changes from its original form as many languages like the Arabic script who changed from its original script (adding dots and diacritics marks). This script is found in the stones and tombs in some historical locations in Morocco, Algeria, Tunisia and the Tuareg areas in the Sahara. The Amazigh alphabet which is called ―Tifinagh-IRCAM‖, adopted by the Royal Institute TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS of the Amazigh Culture, was officially recognized by the International Organization of Standardization (ISO) as the basic multilingual plan[Zenkouar 2014]. Figure 1 represents the repertoire of Tifinagh which is recognized and used in Morocco with their correspondents in Latin characters. The number of the alphabetical phonetic entities is 33, but Unicode codes only 31 letters plus a change of a few letters to form the two phonetic units: (g ) and (k ). Fig. 1. Tifinagh characters adopted by the Royal Institute of the Amazigh Culture with their correspondents in Latin character The Amazigh have their own writing system since the ancient times [Gaci 2011]. However, the amazigh people had used the alphabet and/or the language of the dominant people, who were in interaction with them, in the writing of the amazigh documents. Three systems have been used to transcript the Amazigh script[Skounti& al. 2003]: The Tifinagh as an authentic alphabet in the Libyc inscription since the ancient times. The Arabic script, after the arrival of the Arabic in the end of 6th century. The Latin since the 19th century, by the colonial’s scientists and later on by national researchers. Some documents that are present in libraries have multiple languages in the same document as shown in Figure 2 extracted from a dictionary of Taoureg language[Foucauld 1951]. This dictionary has both Amazigh and Latin Scripts; the Amazigh words are also written in Latin characters. TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS Fig2: An Extract of Touareg dictionary [Fouclaud 1951] Many works aimed to process multilingual documents, for example [Ambekar et al 2013] presented an OCR for printed English and Devanagari text, they used KNN for classification on 10 samples of each character of the two languages (610 samples, 260 for English and 350 for Devanagari) and they reached 95% for English text and 93% for the Devanagari Text. [G.S Lehal 2013] used a OCR for Gurmukhi and English, first they identify the script nature either Gurmukhi or English, then they used the proper OCR for each language, the test were held on 4 sets that were constructed based on 105 pages’ images (76 Gurmukhi pages and 29 English Pages), they obtained different rates depending on the used set. [Tan 1998] described an automatic method for identification of Chinese, English, Greek, Russian, Malayalam and Persian text. [Das et al 2011] has used an OCR system for Telugu, English and Hindi, they extracted 8 features and then used a KNN Classifier, and they obtained 93% accuracy. To process multilingual documents, we used a system script-free proposed by [Al Abodiand Li 2014] with some alterations, which can perform on different languages in the same text image. The scheme of the system is presented in figure 3 below. The paper is organized as following: section 2 represent the preprocessing step, the third section describes the process of the canonical representation, finally in the section 4 we present a conclusion and some future work. TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS Fig3:proposed system scheme TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS 2. Preprocessing Such as was presented in Figure 3, the procedure of pre-processing which refines the scanned input image includes several steps: Binarization, noises removal and image resizing. We used the [Otsu 1979] method for binarization. This method of thresholding is performed as a preprocessing step to remove the background noise from the picture prior to extraction of characters and recognition of text. This method performs a statistical analysis of histograms to define a function to be maximized to estimate the threshold. The database images present gaps between pixels. This is due to the writing style or the type of the pen used. To fill the gaps, we used the widely known algorithm Run Length Smoothing Algorithm (RLSA) in both vertical and horizontal direction. For instance, if we have a sequence of values and the RLSA threshold is for example 4 all zeros between two ones becomes 1 if their number is less than the fixed threshold. 10010001000001001 becomes 11111111000001111 After, a skeletonization step is applied on the image to have a pixel width of one pixel only. Figure 4 shows an example of Amazigh character and the result of skeletonization step. a) b) c) d) Fig 4: a) original image of Yab, b) binarization,c) RLSA gap filling, d) skeletonization 3. Canonical Representation 3.1 Pixel Value We represent each pixel by a code that is obtained by summing the values of the 8 surrounding pixels, which is similar to Freeman coding. The pixel of the top has the position 0, and we move clockwise to attribute the position of the other pixels, so the pixel in the top right has the value 1, the pixel in the right has the position 2 and so one we end up with the last pixel in the top left that has the position 7. The value of the pixel is 2pixel_position. The figure 5 represents the coding scheme of the surrounding pixels. There are 256 unique forms of surrounding pixels calculated as above. The following formulate implies that we have 256 of all geometric possible combination of the 8 neighboring pixels: TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS The figure 6 below presents some pixels with their codes: Fig 5: a) the value of the surrounding pixels b) code of the pixel is 145 = 128+1+16 Fig 6: some pixel codes 3.2 The Canonical Form Scripters have deferent writing styles, this produces a problem for OCRs because the shape, the width, and the height of the characters change from one writer to another. To reduce this variety of characters forms we used the canonical representation introduced by [abode et al 14], the main concept is to have only horizontal and vertical strokes by applying processing techniques on the pixels of the image and depending on their values we got the canonical representation as described by the following algorithm: TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS For example, in Figure 7 below we begin the examination of the pixels from the top left to the bottom right of the image. In this case, the first white pixel is located in (4, 12) (column 4, line 12) with 18 as value. This value means that this pixel has an upper right pixel and a bottom pixel. The upper right pixel does not belong to the same line or the same column as the processed pixel, so we must follow the northeast direction until finding a pixel with no pixel in his upper or upper right, and have no pixel on its left pixel (17, 4).
Recommended publications
  • UAX #44: Unicode Character Database File:///D:/Uniweb-L2/Incoming/08249-Tr44-3D1.Html
    UAX #44: Unicode Character Database file:///D:/Uniweb-L2/Incoming/08249-tr44-3d1.html Technical Reports L2/08-249 Working Draft for Proposed Update Unicode Standard Annex #44 UNICODE CHARACTER DATABASE Version Unicode 5.2 draft 1 Authors Mark Davis ([email protected]) and Ken Whistler ([email protected]) Date 2008-7-03 This Version http://www.unicode.org/reports/tr44/tr44-3.html Previous http://www.unicode.org/reports/tr44/tr44-2.html Version Latest Version http://www.unicode.org/reports/tr44/ Revision 3 Summary This annex consolidates information documenting the Unicode Character Database. Status This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress. A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published online as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version of the Unicode Standard of which it forms a part. Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this annex is found in Unicode Standard Annex #41, “Common References for Unicode Standard Annexes.” For the latest version of the Unicode Standard, see [Unicode]. For a list of current Unicode Technical Reports, see [Reports].
    [Show full text]
  • Tungumál, Letur Og Einkenni Hópa
    Tungumál, letur og einkenni hópa Er letur ómissandi í baráttu hópa við ríkjandi öfl? Frá Tifinagh og rúnaristum til Pixação Þorleifur Kamban Þrastarson Lokaritgerð til BA-prófs Listaháskóli Íslands Hönnunar- og arkitektúrdeild Desember 2016 Í þessari ritgerð er reynt að rökstyðja þá fullyrðingu að letur sé mikilvægt og geti jafnvel undir vissum kringumstæðum verið eitt mikilvægasta vopnið í baráttu hópa fyrir tilveru sinni, sjálfsmynd og stað í samfélagi. Með því að líta á þrjú ólík dæmi, Tifinagh, rúnaletur og Pixação, hvert frá sínum stað, menningarheimi og tímabili er ætlunin að sýna hvernig saga leturs og týpógrafíu hefur samtvinnast og mótast af samfélagi manna og haldist í hendur við einkenni þjóða og hópa fólks sem samsama sig á einn eða annan hátt. Einkenni hópa myndast oft sem andsvar við ytri öflum sem ógna menningu, auði eða tilverurétti hópsins. Hópar nota mismunandi leturtýpur til þess að tjá sig, tengjast og miðla upplýsingum. Það skiptir ekki eingöngu máli hvað þú skrifar heldur hvernig, með hvaða aðferðum og á hvaða efni. Skilaboðin eru fólgin í letrinu sjálfu en ekki innihaldi letursins. Letur er útlit upplýsingakerfis okkar og hefur notkun ritmáls og leturs aldrei verið meiri í heiminum sem og læsi. Ritmál og letur eru algjörlega samofnir hlutir og ekki hægt að slíta annað frá öðru. Ekki er hægt að koma frá sér ritmáli nema í letri og þessi tvö hugtök flækjast því oft saman. Í ljósi athugana á þessum þremur dæmum í ritgerðinni dreg ég þá ályktun að letur spilar og hefur spilað mikilvægt hlutverk í einkennum þjóða og hópa. Letur getur, ásamt tungumálinu, stuðlað að því að viðhalda, skapa eða eyðileggja menningu og menningarlegar tenginga Tungumál, letur og einkenni hópa Er letur ómissandi í baráttu hópa við ríkjandi öfl? Frá Tifinagh og rúnaristum til Pixação Þorleifur Kamban Þrastarson Lokaritgerð til BA-prófs í Grafískri hönnun Leiðbeinandi: Óli Gneisti Sóleyjarson Grafísk hönnun Hönnunar- og arkitektúrdeild Desember 2016 Ritgerð þessi er 6 eininga lokaritgerð til BA-prófs í Grafískri hönnun.
    [Show full text]
  • A Translation of the Malia Altar Stone
    MATEC Web of Conferences 125, 05018 (2017) DOI: 10.1051/ matecconf/201712505018 CSCC 2017 A Translation of the Malia Altar Stone Peter Z. Revesz1,a 1 Department of Computer Science, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA Abstract. This paper presents a translation of the Malia Altar Stone inscription (CHIC 328), which is one of the longest known Cretan Hieroglyph inscriptions. The translation uses a synoptic transliteration to several scripts that are related to the Malia Altar Stone script. The synoptic transliteration strengthens the derived phonetic values and allows avoiding certain errors that would result from reliance on just a single transliteration. The synoptic transliteration is similar to a multiple alignment of related genomes in bioinformatics in order to derive the genetic sequence of a putative common ancestor of all the aligned genomes. 1 Introduction symbols. These attempts so far were not successful in deciphering the later two scripts. Cretan Hieroglyph is a writing system that existed in Using ideas and methods from bioinformatics, eastern Crete c. 2100 – 1700 BC [13, 14, 25]. The full Revesz [20] analyzed the evolutionary relationships decipherment of Cretan Hieroglyphs requires a consistent within the Cretan script family, which includes the translation of all known Cretan Hieroglyph texts not just following scripts: Cretan Hieroglyph, Linear A, Linear B the translation of some examples. In particular, many [6], Cypriot, Greek, Phoenician, South Arabic, Old authors have suggested translations for the Phaistos Disk, Hungarian [9, 10], which is also called rovásírás in the most famous and longest Cretan Hieroglyph Hungarian and also written sometimes as Rovas in inscription, but in general they were unable to show that English language publications, and Tifinagh.
    [Show full text]
  • Documenting Endangered Alphabets
    Documenting endangered alphabets Industry Focus Tim Brookes Three years ago, acting on a notion so whimsi- cal I assumed it was a kind of presenile monoma- nia, I began carving endangered alphabets. The Tdisclaimers start right away. I’m not a linguist, an anthropologist, a cultural historian or even a woodworker. I’m a writer — but I had recently started carving signs for friends and family, and I stumbled on Omniglot.com, an online encyclo- pedia of the world’s writing systems, and several things had struck me forcibly. For a start, even though the world has more than 6,000 Figure 1: Tifinagh. languages (some of which will be extinct even by the time this article goes to press), it has fewer than 100 scripts, and perhaps a passing them on as a series of items for consideration and dis- third of those are endangered. cussion. For example, what does a written language — any writ- Working with a set of gouges and a paintbrush, I started to ten language — look like? The Endangered Alphabets highlight document as many of these scripts as I could find, creating three this question in a number of interesting ways. As the forces of exhibitions and several dozen individual pieces that depicted globalism erode scripts such as these, the number of people who words, phrases, sentences or poems in Syriac, Bugis, Baybayin, can write them dwindles, and the range of examples of each Samaritan, Makassarese, Balinese, Javanese, Batak, Sui, Nom, script is reduced. My carvings may well be the only examples Cherokee, Inuktitut, Glagolitic, Vai, Bassa Vah, Tai Dam, Pahauh of, say, Samaritan script or Tifinagh that my visitors ever see.
    [Show full text]
  • A Message from Afar: Fact Sheet 3 (PDF)
    3 What’s the Message? Here’s your signal The detection of a signal from another world would be a most remarkable moment in human history. However, if we detect such a signal, is it just a beacon from their technology, without any content, or does it contain information or even a message? Does it resemble sound, or is it like interstellar e-mail? Can we ever understand such a message? This appears to be a tremendous challenge, given that we still have many scripts from our own antiquity that remain undeciphered, despite many serious attempts, over hundreds of years. – And we know far more about humanity than about extra-terrestrial intelligence… We are facing all the complexities involved in understanding and glimpsing the intellect of the author, while the world’s expectations demand immediacy of information. So, where do we begin? Structure and language Information stands out from randomness, it is based on structures. The problem goal we face, after we detect a signal, is to first separate out those information-carrying signals from other phenomena, without being able to engage in a dialogue, and then to learn something about the structure of their content in the passing. This means that we need a suitable filter. We need a way of separating out the interesting stuff; we need a language detector. While identifying the location of origin of a candidate signal can rule out human making, its content could involve a vast array of possible structures, some of which may be beyond our knowledge or imagination; however, for identifying intelligence that shares any pattern with our way of processing and transmitting information, the collective knowledge and examples here on Earth are a good starting point.
    [Show full text]
  • Recognition of Baybayin Symbols (Ancient Pre-Colonial Philippine Writing System) Using Image Processing
    ISSN 2278-3091 Volume 9, No.1, January – February 2020 Mark Jovic A. Daday et al., International Journal of Advanced Trends in Computer Science and Engineering, 9(1), January – February 2020, 594 – 598 International Journal of Advanced Trends in Computer Science and Engineering Available Online at http://www.warse.org/IJATCSE/static/pdf/file/ijatcse83912020.pdf https://doi.org/10.30534/ijatcse/2020/83912020 Recognition of Baybayin Symbols (Ancient Pre-Colonial Philippine Writing System) using Image Processing Mark Jovic A. Daday1, Arnel C. Fajardo2, Ruji P. Medina3 1 Technological Institute of the Philippines, Quezon City, Philippines, [email protected] 2 School of Computer Science Manuel L. Quezon University, Diliman, Quezon City, Philippines, [email protected] 3 Technological Institute of the Philippines, Quezon City, Philippines, [email protected] because of the recognition problems encountered. Due to its ABSTRACT problem faces throughout recognition, computer is incapable to take out the features correctly when scanning them in [4]. The goal of this paper is to accomplish an Optical Character Recognition (OCR) that gives an extremely contribution to Alternatively, unusual modern methods for a several the advancement of technology in terms of image recognition concealed datasets of text and images were faced into more in Machine Learning. The researcher introduces the experiments, that directs to a compilation of multiple type of Feed-Forward Neural Network with Dropout Method font and unusual ruin degree in [5]. The goal of this paper is to (FFNNDM) and Convolutional Neural Network with Dropout accomplish an OCR system for the Baybayin handwriting Method (CNNDM) for the recognition of the Baybayin symbols or so called (“Alibata) an ancient and national symbols.
    [Show full text]
  • Language Ideologies in Morocco Sybil Bullock Connecticut College, [email protected]
    CORE Metadata, citation and similar papers at core.ac.uk Provided by DigitalCommons@Connecticut College Connecticut College Digital Commons @ Connecticut College Anthropology Department Honors Papers Anthropology Department 2014 Language Ideologies in Morocco Sybil Bullock Connecticut College, [email protected] Follow this and additional works at: http://digitalcommons.conncoll.edu/anthrohp Recommended Citation Bullock, Sybil, "Language Ideologies in Morocco" (2014). Anthropology Department Honors Papers. Paper 11. http://digitalcommons.conncoll.edu/anthrohp/11 This Honors Paper is brought to you for free and open access by the Anthropology Department at Digital Commons @ Connecticut College. It has been accepted for inclusion in Anthropology Department Honors Papers by an authorized administrator of Digital Commons @ Connecticut College. For more information, please contact [email protected]. The views expressed in this paper are solely those of the author. Language Ideologies in Morocco Sybil Bullock 2014 Honors Thesis Anthropology Department Connecticut College Thesis Advisor: Petko Ivanov First Reader: Christopher Steiner Second Reader: Jeffrey Cole Table of Contents Dedication………………………………………………………………………………….….…3 Acknowledgments………………………………………………………………………….…….4 Abstract…………………………………………………………………………………….…….5 Chapter 1: Introduction…………………………………………………………………………..6 Chapter 2: The Role of Language in Nation-Building…………………………………………..10 Chapter 3: The Myth of Monolingualism………………………………………………….……18 Chapter 4: Language or Dialect?...................................................................................................26
    [Show full text]
  • Association Relative À La Télévision Européenne G.E.I.E. Gtld: .Arte Status: ICANN Review Status Date: 2015-10-27 17:43:58 Print Date: 2015-10-27 17:44:13
    ICANN Registry Request Service Ticket ID: S1K9W-8K6J2 Registry Name: Association Relative à la Télévision Européenne G.E.I.E. gTLD: .arte Status: ICANN Review Status Date: 2015-10-27 17:43:58 Print Date: 2015-10-27 17:44:13 Proposed Service Name of Proposed Service: Technical description of Proposed Service: ARTE registry operator, Association Relative à la Télévision Européenne G.E.I.E. is applying to add IDN domain registration services. ARTE IDN domain name registration services will be fully compliant with IDNA 2008, as well as ICANN's Guidelines for implementation of IDNs. The language tables are submitted separately via the GDD portal together with IDN policies. ARTE is a brand TLD, as defined by the Specification 13, and as such only the registry and its affiliates are eligible to register .ARTE domain names. The full list of languages/scripts is: Azerbaijani language Belarusian language Bulgarian language Chinese language Croatian language French language Greek, Modern language Japanese language Korean language Kurdish language Macedonian language Moldavian language Polish language Russian language Serbian language Spanish language Swedish language Ukrainian language Arabic script Armenian script Avestan script Balinese script Bamum script Batak script Page 1 ICANN Registry Request Service Ticket ID: S1K9W-8K6J2 Registry Name: Association Relative à la Télévision Européenne G.E.I.E. gTLD: .arte Status: ICANN Review Status Date: 2015-10-27 17:43:58 Print Date: 2015-10-27 17:44:13 Bengali script Bopomofo script Brahmi script Buginese
    [Show full text]
  • The Cretan Script Family Includes the Carian Alphabet
    University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln CSE Journal Articles Computer Science and Engineering, Department of 2017 The rC etan Script Family Includes the Carian Alphabet Peter Z. Revesz University of Nebraska-Lincoln, [email protected] Follow this and additional works at: https://digitalcommons.unl.edu/csearticles Revesz, Peter Z., "The rC etan Script Family Includes the Carian Alphabet" (2017). CSE Journal Articles. 196. https://digitalcommons.unl.edu/csearticles/196 This Article is brought to you for free and open access by the Computer Science and Engineering, Department of at DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in CSE Journal Articles by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln. MATEC Web of Conferences 125, 05019 (2017) DOI: 10.1051/ matecconf/201712505019 CSCC 2017 The Cretan Script Family Includes the Carian Alphabet Peter Z. Revesz1,a 1 Department of Computer Science, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA Abstract. The Cretan Script Family is a set of related writing systems that have a putative origin in Crete. Recently, Revesz [11] identified the Cretan Hieroglyphs, Linear A, Linear B, the Cypriot syllabary, and the Greek, Old Hungarian, Phoenician, South Arabic and Tifinagh alphabets as members of this script family and using bioinformatics algorithms gave a hypothetical evolutionary tree for their development and presented a map for their likely spread in the Mediterranean and Black Sea areas. The evolutionary tree and the map indicated some unknown writing system in western Anatolia to be the common origin of the Cypriot syllabary and the Old Hungarian alphabet.
    [Show full text]
  • Encoding Diversity for All the World's Languages
    Encoding Diversity for All the World’s Languages The Script Encoding Initiative (Universal Scripts Project) Michael Everson, Evertype Westport, Co. Mayo, Ireland Bamako, Mali • 6 May 2005 1. Current State of the Unicode Standard • Unicode 4.1 defines over 97,000 characters 1. Current State of the Unicode Standard: New Script Additions Unicode 4.1 (31 March 2005): For Unicode 5.0 (2006): Buginese N’Ko Coptic Balinese Glagolitic Phags-pa New Tai Lue Phoenician Nuskhuri (extends Georgian) Syloti Nagri Cuneiform Tifinagh Kharoshthi Old Persian Cuneiform 1. Current State of the Unicode Standard • Unicode 4.1 defines over 97,000 characters • Unicode covers over 50 scripts (many of which are used for languages with over 5 million speakers) 1. Current State of the Unicode Standard • Unicode 4.1 defines over 97,000 characters • Unicode covers over 50 scripts (often used for languages with over 5 million speakers) • Unicode enables millions of users worldwide to view web pages, send e-mails, converse in chat-rooms, and share text documents in their native script 1. Current State of the Unicode Standard • Unicode 4.1 defines over 97,000 characters • Unicode covers over 50 scripts (often used for languages with over 5 million speakers) • Unicode enables millions of users worldwide to view web pages, send e-mails, converse in chat- rooms, and share text documents in their native script • Unicode is widely supported by current fonts and operating systems, but… Over 80 scripts are missing! Missing Modern Minority Scripts India, Nepal, Southeast Asia China:
    [Show full text]
  • (RSEP) Request October 16, 2017 Registry Operator INFIBEAM INCORPORATION LIMITED 9Th Floor
    Registry Services Evaluation Policy (RSEP) Request October 16, 2017 Registry Operator INFIBEAM INCORPORATION LIMITED 9th Floor, A-Wing Gopal Palace, NehruNagar Ahmedabad, Gujarat 380015 Request Details Case Number: 00874461 This service request should be used to submit a Registry Services Evaluation Policy (RSEP) request. An RSEP is required to add, modify or remove Registry Services for a TLD. More information about the process is available at https://www.icann.org/resources/pages/rsep-2014- 02-19-en Complete the information requested below. All answers marked with a red asterisk are required. Click the Save button to save your work and click the Submit button to submit to ICANN. PROPOSED SERVICE 1. Name of Proposed Service Removal of IDN Languages for .OOO 2. Technical description of Proposed Service. If additional information needs to be considered, attach one PDF file Infibeam Incorporation Limited (“infibeam”) the Registry Operator for the .OOO TLD, intends to change its Registry Service Provider for the .OOO TLD to CentralNic Limited. Accordingly, Infibeam seeks to remove the following IDN languages from Exhibit A of the .OOO New gTLD Registry Agreement: - Armenian script - Avestan script - Azerbaijani language - Balinese script - Bamum script - Batak script - Belarusian language - Bengali script - Bopomofo script - Brahmi script - Buginese script - Buhid script - Bulgarian language - Canadian Aboriginal script - Carian script - Cham script - Cherokee script - Coptic script - Croatian language - Cuneiform script - Devanagari script
    [Show full text]
  • Bioinformatics Evolutionary Tree Algorithms Reveal the History of the Cretan Script Family
    INTERNATIONAL JOURNAL OF APPLIED MATHEMATICS AND INFORMATICS Volume 10, 2016 Bioinformatics Evolutionary Tree Algorithms Reveal the History of the Cretan Script Family Peter Z. Revesz syllabary, whose similarity with Linear A was noted by Evans. Abstract— This paper shows that Crete is the likely origin of a The Phoenician alphabet [28] was a major influence on the family of related scripts that includes the Cretan Hieroglyph, Linear development of many other alphabets due to the Phoenicians’ A, Linear B and Cypriot syllabaries and the Greek, Phoenician, Old widespread commercial influence in the Mediterranean area. Hungarian, South Arabic and Tifinagh alphabets. The paper develops The Phoenician and the South Arabic [30] alphabets are a novel similarity measure between pairs of script symbols. The similarity measure is used as an aid to develop a comparison table of assumed to derive from the Proto-Sinaitic alphabet, which the nine scripts. The paper presents a method to translate comparison originated in the Sinai Peninsula sometime between the th th tables into DNA encodings, thereby enabling the use of mid-19 and mid-16 century BC [29]. Phoenician represents bioinformatics algorithms that construct hypothetical evolutionary the northern branch, while South Arabic represents the trees. Applying the method to the nine scripts yields a script southern branch of Proto-Sinaitic. evolutionary tree with two main branches. The first branch is The classical Greek alphabet from about 800 BC had a composed of Cretan Hieroglyph, Cypriot, Linear A, Linear B, Old Hungarian and Tifinagh, while the second branch is composed of major influence for many other European alphabets.
    [Show full text]