A Abjad, 430, 891 Abugida, 430, 891 Academic Systems, 429, 448

Total Page:16

File Type:pdf, Size:1020Kb

A Abjad, 430, 891 Abugida, 430, 891 Academic Systems, 429, 448 Index A Anchor points, 717 Abjad, 430, 891 Angular radial transform (ART) descriptors, Abugida, 430, 891 526, 530, 532 Academic systems, 429, 448 Animated, 625, 630, 633, 636, 637 Accidental forgery, 932 Anisotropic Gaussian, 269–271 Accuracy, 44, 50, 51, 54, 1023, 1024, ANN. See Artificial neural networks (ANN) 1027, 1033 Annotation, 630, 639, 966–968, 976, 983–1002 Acid-free paper, 15 Application domains, 182, 198 Acquisition, 11–60, 984–986, 989–991, Approximate NN search, 170 993, 996 APTI, 450 Action plan, 919, 921, 926, 929, 930 Arabic AdaBoost, 862, 865, 866 alphabet, 430, 434–436, 449 Adaptive local connectivity map, 476 extension, 435 Address block(s), 715, 717, 720, 721, 723, 724 letters, 432, 435, 436 Address block location, 720, 723–724, 727 OCR, 450 Address database, 715, 720, 729, 730, 744 writing styles, 437–438 Address interpretation, 723 writing system, 432 Address recognition systems, 709, 710, Arabic and Persian signatures, 930 720–723, 732, 733, 744 Arabic and Syriac are cursive, 428 Adjacency Arabic recognition approaches, 446 grammars, 528, 537 Aramaic letters, 433 matrices, 541, 544 Arc detection, 505, 511, 512 Affine covariant, 629 Architectural drawings, 493, 495, 511–513 AHARONI, 440 Area under the curve (AUC), 931 Algebraic invariants, 617 Area Voronoi diagram, 146, 147, 156, 157, Allograph, 303, 304 161, 163 Alphabet(s), 7–9, 303, 304, 306, 307, 312, Arial, 323 891, 892, 896, 897, 902, 904, 906, Arrowheads, 493, 512, 513 908, 913 Artificial intelligence, 332 Alphabetic class, 429 Artificial neural networks (ANN), 617, 632, Alphabetic fields, 719 636, 821, 835 Ambiguity, 680 Aruspix, 753, 768, 771 Analysis of Invoices, 184 Aryan language, 304 Analysis System to Interpret Areas in Ascenders and descenders, 325, 442–444, 810, Single-sided Letters (ANASTASIL), 812, 814 181, 182, 207, 208 ASCII characters, 817 Analytical word recognition, 725 Asian scripts, 460–462, 471, 475, 483 D. Doermann, K. Tombre (eds.), Handbook of Document Image 1037 Processing and Recognition, DOI 10.1007/978-0-85729-859-1, © Springer-Verlag London 2014 1038 Index Aspect, 279, 284 Bidirectional long short-term memory Assamese, 301, 304 (BLSTM), 822, 823 Assessment Bi-gram probability, 410, 411, 415 function, 1029–1032 Bi-lingual script, 314, 316 methods, 1031 Billboards, 629, 640 Associative graphs, 527, 536–538 Binarization, 44, 49, 54, 189, 262, 275, 335, Assyrian script, 439 336, 338–339, 355, 466, 475–477, Attributed grammars, 695 716, 814, 984, 986, 989, 990, Attributed relational graph (ARG), 904 992–994, 1018–1022, 1024, 1033 Attributive symbols, 750 algorithms, 716 Automatic document processing, 614 of handprinted text, 367 Automatic letter sorting machines, 709 skew correction, and noise, 755 Automatic number plate recognition Binary (ANPR), 846 classifier, 738 images, 716 mask, 633 B Bipartite graph, 540 Background analysis method, 144–146, 148, Black-and-white document, 144, 147 149, 158, 164 Bleed-through, 49, 95, 97, 99 Backtracking, 148 Blind attacker, 932 Bag-of-features, 871, 873, 874 Blob noise, 621 Bag-of-words, 618, 620, 627, 629 Block(s), 752 Bangla, 301, 305, 306, 308, 309, 315, 316, Block adjacency graphs, 479 321, 326, 461, 465 Block overlay phenomena, 778 Bank check(s), 363 Blur, 46, 48, 53, 54, 57, 59, 573, 844, 853, Bank check recognition, 67 856, 857 Banners, 640 Blurring, 623, 850, 854, 856–857 Bar codes, 708, 709, 723 Boldface, 322, 325 Bar lines, 751, 761, 767, 769 BongNet, 902, 903, 910 Bar units, 752 Book Baselines, 260, 263, 265, 266, 275–276, binding, 52, 57–59 287, 684, 686–688, 691, 692, 696, scanners, 846 810–815 Boosting methods, 540, 542, 543 Baselines extraction, 442, 686, 687, 691, 692 Borda count, 607 Base-region points, 311 Border removal, 102, 103 Baum-Welch algorithm, 350 Born digital documents, 687, 690 Bayesian combination rules, 739 Bottom-up strategies, 137, 144, 148, 149, 155, Bayesian network, 902, 903 157, 565, 566, 571, 722, 723, 928, BBN, 335, 354, 355 940, 941 Behavioural characteristics, 918 Bounding box, 367, 377, 382 Belga Logos, 632, 640 projections, 761, 764 Bengali Boxing systems, 723 alphabet, 304 Brain stroke risk factors, 941 script, 303, 304 Branch-and-bound, 628, 632 Bezier curve, 893 Brightness, 35, 37, 40, 43, 44 Bibliographic Broadcasting industry, 623 citation, 209 Brodatz textures, 314 metadata, 209 Brute-force attacker, 932 Biblio system, 196, 213 B-splines, 125, 126 Bidimensional Burmese, 314, 318 grammars, 513 Business Letter, 181, 182, 185, 186, 192, 197, patterns, 514 199, 204–210, 216 Index 1039 C Chemical drawing recognition, 973–974 C4.5 algorithm, 616 Chinese, 296, 305, 307–310, 312–316, Calligraphic 318–321, 325 characters, 852 Chinese Academy of Sciences (CASIA), 465 interfaces, 951 Chinese and Japanese signatures, 930 Calliope, 20 Chinese handwriting recognition competition, Camera(s), 44–47, 52, 59 463, 465 Camera-based OCR, 844–850, 860, 861, Chinese, Japanese, and Korean (CJK) 872, 874 ideographic character, 462 Camera-captured, 546 scripts, 460–463, 465–474 Cancellable biometrics, 931 Chrominance, 628 Candidate(s), 274, 280, 281, 284 CIDK. See Class independent domain Candidate staff-line segments, 757 knowledge (CIDK) Canonical structure, 462 City name recognition, 720, 722, Caption texts, 489, 848, 854, 855, 862, 725–727 866, 872 Class dependent domain knowledge Carbon paper, 28, 31 (CDDK), 197, 204 Car license plates, 846 Classes of forgeries, 932 Case-based reasoning (CBR), 197, 211 Classification, 463, 470–473, 478, 479, CBIR. See Content-based image retrieval 482–484, 535, 538–543, 545, (CBIR) 760–764, 766–769 CCH. See Co-occurrence histograms (CCH) Classification units, 478, 479 CDDK. See Class dependent domain Classifiers, 279, 281, 282, 284–286, knowledge (CDDK) 446–449, 451 Census forms, 363 Classifiers combination, 383, 689, 692 Chain Codes, 120–123, 369, 377, 384, 525, Class independent domain knowledge (CIDK), 527, 534 197, 207 Character Client-entropy measure, 930 forms, 363 Cluttered background, 623 recognition, 331–356, 359–385, 427–453, CNN. See Convolutional neural network 459–484, 524, 525, 534, 707, 718, (CNN) 725, 729, 737, 739, 1013, 1025, Coding scheme, 809, 817, 826, 827 1026, 1033 Cognitive psychology, 195 segmentation, 466, 467, 469, 807, 811, Color, 18–20, 28–30, 33, 34, 36–38, 40, 44, 824, 830, 834 46, 47, 50, 598–600, 603, 605–610, Character error rate (CER), 1026 612–615, 617, 623, 629, 631, 635, Character recognition rate (CRR), 1026 848, 849, 851, 853, 862–864, 868 Character Shape Coding, 807, 809–811, clustering, 147, 164, 165, 626, 864, 817–818, 824, 826–828 867, 868 Charged couple devices (CCDs), 39 descriptors, 628 Chart analysis, 998 naming, 604, 611, 638 CHAYIM, 440 printing, 30, 34–36 Check spaces, 625–628, 631 forms, 731, 734–736 Color clustering, dither, 164 models, 735 Combination rules, 934 processing, 705–745 Command parameter, 921, 930 readers, 710 Commercial Check 21, 744 language, 301 Check recognition applications, 708–711, OCRs, 452 714–716, 718, 731, 735, 741, 745 systems, 429, 448, 450 Check recognition system, 707, 708, 712, 731, Committee of experts, 716, 719 733, 743, 744 Comparative study, 759, 762 1040 Index Comparison, 1012–1016, 1018, 1022, 1024, Correlation filter, 545 1029, 1031, 1032 Cost value, 1028 Competitions, 759, 1013, 1022, 1028, Courtesy amount recognition, 710, 711, 719, 1032–1034 731, 733–741 datasets, 448 Creole languages, 295 Complementary metal-oxide silicon (CMOS), Critter, 323 39, 40 Cross-correlation, 100, 102, 114, 118 Complete graphic object recognition, Cross-related attributes, 194, 195 963–964, 976 Cryptographic key generation, 931 Complex background, 844, 851, 861 Cryptography, 303 Complexity of a signature, 929 Cube search, 906 Compression, 46, 48–50 Cultural habits, 930 Compression artifacts, 623 Currency sign, 731, 736, 738 Computational complexity, 714, 727 Cursive script, 890, 892, 896, 902, 906–908 Computer language, 295 Curvature, 317, 534, 535 Concavity features, 343 Curvature Scale Space (CSS), 601 Conditional random field (CRF), 870 Curvilinear, 260, 265, 266, 268 Condorcet method, 607, 609 Cyrillic, 305, 314, 316, 318, 319, 321 Confidence scores, 715, 716, 718 Confidence values, 719, 731, 734, 739–741 Conjunct consonants, 477 D Connected components Damaged characters, 74, 95, 97, 99, 101, 127 analysis, 85, 99, 101, 102, 496, 498 Data acquisition devices, 922 labeling, 74–76, 100, 102 Data-driven approaches, 214 Connected components based method, Data reduction, 893, 895 148, 149 techniques, 928 Connectionist Temporal Classification (CTC), Dataset(s), 448–451, 453, 596, 602, 610, 612, 821, 822 613, 620, 631, 632, 636, 639–641, Consistency checking, 579 983–1002 Consistency model, 924 Data variability, 583 Consonant, 462–464, 477, 478 DAVOS, 182, 209 Constraints, 762, 764, 765 DCT, 863, 868 Content 3-D document shape, 123, 126 ownership, 623 Decision stream, 779, 781 making, 719–720, 733, 736, 741 Content-based image retrieval (CBIR), strategy, 739 544, 545 Decorative characters, 872–874 Content-based video retrieval, 848, 849 Defects, 46, 48, 51, 53, 57 Contest, 1013, 1018, 1032–1033 Deformations, 988 Context/contextual, 680, 681, 683, 685, 691, Degradation, 68, 260, 282, 984, 986, 988, 991, 693–695, 862, 864, 873, 919, 994, 998 924, 931 Degraded document, 812, 817, 824 information, 627, 638, 690, 696, 752, Delaunay triangulation, 146, 149, 761–765, 770 155–158, 213 knowledge, 568, 571, 582 Delayed strokes, 890, 891, 894, 896, Contours 901, 905 extraction, 816 Delta features, 895 tracking, 756–760, 767 Denoising, 953, 954, 975 Contrast, 40, 43, 44, 47–50 Descenders,
Recommended publications
  • Yiddish Diction in Singing
    UNLV Theses, Dissertations, Professional Papers, and Capstones May 2016 Yiddish Diction in Singing Carrie Suzanne Schuster-Wachsberger University of Nevada, Las Vegas Follow this and additional works at: https://digitalscholarship.unlv.edu/thesesdissertations Part of the Language Description and Documentation Commons, Music Commons, Other Languages, Societies, and Cultures Commons, and the Theatre and Performance Studies Commons Repository Citation Schuster-Wachsberger, Carrie Suzanne, "Yiddish Diction in Singing" (2016). UNLV Theses, Dissertations, Professional Papers, and Capstones. 2733. http://dx.doi.org/10.34917/9112178 This Dissertation is protected by copyright and/or related rights. It has been brought to you by Digital Scholarship@UNLV with permission from the rights-holder(s). You are free to use this Dissertation in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/or on the work itself. This Dissertation has been accepted for inclusion in UNLV Theses, Dissertations, Professional Papers, and Capstones by an authorized administrator of Digital Scholarship@UNLV. For more information, please contact [email protected]. YIDDISH DICTION IN SINGING By Carrie Schuster-Wachsberger Bachelor of Music in Vocal Performance Syracuse University 2010 Master of Music in Vocal Performance Western Michigan University 2012
    [Show full text]
  • Similarities and Dissimilarities of English and Arabic Alphabets in Phonetic and Phonology: a Comparative Study
    Similarities and dissimilarities of English and Arabic 94 Similarities and dissimilarities of English and Arabic Alphabets in Phonetic and Phonology: A Comparative Study MD YEAQUB Research Scholar Aligarh Muslim University, India Email: [email protected] Abstract: This paper will focus on a comparative study about similarities and dissimilarities of the pronunciation between the syllables of English and Arabic with the help of phonetic and phonological tools i.e. manner of articulation, point of articulation and their distribution at different positions in English and Arabic Alphabets. A phonetic and phonological analysis of the alphabets of English and Arabic can be useful in overcoming the hindrances for those want to improve the pronunciation of both English and Arabic languages. We all know that Arabic is a Semitic language from the Afro-Asiatic Language Family. On the other hand, English is a West Germanic language from the Indo- European Language Family. Both languages show many linguistic differences at all levels of linguistic analysis, i.e. phonology, morphology, syntax, semantics, etc. For this we will take into consideration, the segmental features only, i.e. the consonant and vowel system of the two languages. So, this is better and larger to bring about pedagogical changes that can go a long way in improving pronunciation and ensuring the occurrence of desirable learners’ outcomes. Keywords: Arabic Alphabets, English Alphabets, Pronunciations, Phonetics, Phonology, manner of articulation, point of articulation. Introduction: We all know that sounds are generally divided into two i.e. consonants and vowels. A consonant is a speech sound, which obstruct the flow of air through the vocal tract.
    [Show full text]
  • The Ogham-Runes and El-Mushajjar
    c L ite atu e Vo l x a t n t r n o . o R So . u P R e i t ed m he T a s . 1 1 87 " p r f ro y f r r , , r , THE OGHAM - RUNES AND EL - MUSHAJJAR A D STU Y . BY RICH A R D B URTO N F . , e ad J an uar 22 (R y , PART I . The O ham-Run es g . e n u IN tr ating this first portio of my s bj ect, the - I of i Ogham Runes , have made free use the mater als r John collected by Dr . Cha les Graves , Prof. Rhys , and other students, ending it with my own work in the Orkney Islands . i The Ogham character, the fair wr ting of ' Babel - loth ancient Irish literature , is called the , ’ Bethluis Bethlm snion e or , from its initial lett rs, like “ ” Gree co- oe Al hab e t a an d the Ph nician p , the Arabo “ ” Ab ad fl d H ebrew j . It may brie y be describe as f b ormed y straight or curved strokes , of various lengths , disposed either perpendicularly or obliquely to an angle of the substa nce upon which the letters n . were i cised , punched, or rubbed In monuments supposed to be more modern , the letters were traced , b T - N E E - A HE OGHAM RU S AND L M USH JJ A R . n not on the edge , but upon the face of the recipie t f n l o t sur ace ; the latter was origi al y wo d , s aves and tablets ; then stone, rude or worked ; and , lastly, metal , Th .
    [Show full text]
  • Schwa Deletion: Investigating Improved Approach for Text-To-IPA System for Shiri Guru Granth Sahib
    ISSN (Online) 2278-1021 ISSN (Print) 2319-5940 International Journal of Advanced Research in Computer and Communication Engineering Vol. 4, Issue 4, April 2015 Schwa Deletion: Investigating Improved Approach for Text-to-IPA System for Shiri Guru Granth Sahib Sandeep Kaur1, Dr. Amitoj Singh2 Pursuing M.E, CSE, Chitkara University, India 1 Associate Director, CSE, Chitkara University , India 2 Abstract: Punjabi (Omniglot) is an interesting language for more than one reasons. This is the only living Indo- Europen language which is a fully tonal language. Punjabi language is an abugida writing system, with each consonant having an inherent vowel, SCHWA sound. This sound is modifiable using vowel symbols attached to consonant bearing the vowel. Shri Guru Granth Sahib is a voluminous text of 1430 pages with 511,874 words, 1,720,345 characters, and 28,534 lines and contains hymns of 36 composers written in twenty-two languages in Gurmukhi script (Lal). In addition to text being in form of hymns and coming from so many composers belonging to different languages, what makes the language of Shri Guru Granth Sahib even more different from contemporary Punjabi. The task of developing an accurate Letter-to-Sound system is made difficult due to two further reasons: 1. Punjabi being the only tonal language 2. Historical and Cultural circumstance/period of writings in terms of historical and religious nature of text and use of words from multiple languages and non-native phonemes. The handling of schwa deletion is of great concern for development of accurate/ near perfect system, the presented work intend to report the state-of-the-art in terms of schwa deletion for Indian languages, in general and for Gurmukhi Punjabi, in particular.
    [Show full text]
  • Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
    1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only.
    [Show full text]
  • Arabic Alphabet - Wikipedia, the Free Encyclopedia Arabic Alphabet from Wikipedia, the Free Encyclopedia
    2/14/13 Arabic alphabet - Wikipedia, the free encyclopedia Arabic alphabet From Wikipedia, the free encyclopedia َأﺑْ َﺠ ِﺪﯾﱠﺔ َﻋ َﺮﺑِﯿﱠﺔ :The Arabic alphabet (Arabic ’abjadiyyah ‘arabiyyah) or Arabic abjad is Arabic abjad the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually[1] stand for consonants, it is classified as an abjad. Type Abjad Languages Arabic Time 400 to the present period Parent Proto-Sinaitic systems Phoenician Aramaic Syriac Nabataean Arabic abjad Child N'Ko alphabet systems ISO 15924 Arab, 160 Direction Right-to-left Unicode Arabic alias Unicode U+0600 to U+06FF range (http://www.unicode.org/charts/PDF/U0600.pdf) U+0750 to U+077F (http://www.unicode.org/charts/PDF/U0750.pdf) U+08A0 to U+08FF (http://www.unicode.org/charts/PDF/U08A0.pdf) U+FB50 to U+FDFF (http://www.unicode.org/charts/PDF/UFB50.pdf) U+FE70 to U+FEFF (http://www.unicode.org/charts/PDF/UFE70.pdf) U+1EE00 to U+1EEFF (http://www.unicode.org/charts/PDF/U1EE00.pdf) Note: This page may contain IPA phonetic symbols. Arabic alphabet ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع en.wikipedia.org/wiki/Arabic_alphabet 1/20 2/14/13 Arabic alphabet - Wikipedia, the free encyclopedia غ ف ق ك ل م ن ه و ي History · Transliteration ء Diacritics · Hamza Numerals · Numeration V · T · E (//en.wikipedia.org/w/index.php?title=Template:Arabic_alphabet&action=edit) Contents 1 Consonants 1.1 Alphabetical order 1.2 Letter forms 1.2.1 Table of basic letters 1.2.2 Further notes
    [Show full text]
  • A Comparative Study of Shan and Standard Thai Morphology
    A COMPARATIVE STUDY OF SHAN AND STANDARD THAI MORPHOLOGY Kittisara A Thesis Submitted in Partial Fulfilment of the Requirements for the Degree of Master of Arts (Linguistics) Graduate School Mahachulalongkornrajavidayalaya University C.E. 2018 A Comparative Study of Shan and Standard Thai Morphology Kittisara A Thesis Submitted in Partial Fulfilment of the Requirements for the Degree of Master of Arts (Linguistics) Graduate School Mahachulalongkornrajavidayalaya University C.E. 2018 (Copyright by Mahachulalongkornrajavidyalaya University) i Thesis Title : A Comparative Study of Shan and Standard Thai Morphology Researcher : Kittisara Degree : Master of Arts in Linguistics Thesis Supervisory Committee : Assoc. Prof. Nilratana Klinchan B.A. (English), M.A. (Political Science) : Asst. Prof. Dr. Phramaha Suriya Varamedhi B.A. (Philosophy), M.A. (Linguistics), Ph.D. (Linguistics) Date of Graduation : March 19, 2019 Abstract The purpose of this research is to explore the comparative study of Shan and standard Thai Morphology. The objectives of the study are classified into three parts as the following; (1) To study morpheme of Shan and standard Thai, (2) To study the word-formation of Shan and standard Thai and (3) To compare the morpheme and word-classes of Shan and standard Thai. This research is the qualitative research. The population referred to this research, researcher selects Shan people who were born at Tachileik in Shan state consisting of 6 persons. Area of research is Shan people at Tachileik in Shan state union of Myanmar. Research method, the tool used in the research, the researcher makes interview and document research. The main important parts in this study based on content analysis as documentary research by selecting primary sources from the books, academic books, Shan dictionary, Thai dictionary, library, online research and the research studied from informants' native speakers for 6 persons.
    [Show full text]
  • Proposal for Ethiopic Script Root Zone LGR
    Proposal for Ethiopic Script Root Zone LGR LGR Version 2 Date: 2017-05-17 Document version:5.2 Authors: Ethiopic Script Generation Panel Contents 1 General Information/ Overview/ Abstract ........................................................................................ 3 2 Script for which the LGR is proposed ................................................................................................ 3 3 Background on Script and Principal Languages Using It .................................................................... 4 3.1 Local Languages Using the Script .............................................................................................. 4 3.2 Geographic Territories of the Language or the Language Map of Ethiopia ................................ 7 4 Overall Development Process and Methodology .............................................................................. 8 4.1 Sources Consulted to Determine the Repertoire....................................................................... 8 4.2 Team Composition and Diversity .............................................................................................. 9 4.3 Analysis of Code Point Repertoire .......................................................................................... 10 4.4 Analysis of Code Point Variants .............................................................................................. 11 5 Repertoire ....................................................................................................................................
    [Show full text]
  • Finite-State Script Normalization and Processing Utilities: the Nisaba Brahmic Library
    Finite-state script normalization and processing utilities: The Nisaba Brahmic library Cibu Johny† Lawrence Wolf-Sonkin‡ Alexander Gutkin† Brian Roark‡ Google Research †United Kingdom and ‡United States {cibu,wolfsonkin,agutkin,roark}@google.com Abstract In addition to such normalization issues, some scripts also have well-formedness constraints, i.e., This paper presents an open-source library for efficient low-level processing of ten ma- not all strings of Unicode characters from a single jor South Asian Brahmic scripts. The library script correspond to a valid (i.e., legible) grapheme provides a flexible and extensible framework sequence in the script. Such constraints do not ap- for supporting crucial operations on Brahmic ply in the basic Latin alphabet, where any permuta- scripts, such as NFC, visual normalization, tion of letters can be rendered as a valid string (e.g., reversible transliteration, and validity checks, for use as an acronym). The Brahmic family of implemented in Python within a finite-state scripts, however, including the Devanagari script transducer formalism. We survey some com- mon Brahmic script issues that may adversely used to write Hindi, Marathi and many other South affect the performance of downstream NLP Asian languages, do have such constraints. These tasks, and provide the rationale for finite-state scripts are alphasyllabaries, meaning that they are design and system implementation details. structured around orthographic syllables (aksara)̣ as the basic unit.1 One or more Unicode characters 1 Introduction combine when rendering one of thousands of leg- The Unicode Standard separates the representation ible aksara,̣ but many combinations do not corre- of text from its specific graphical rendering: text spond to any aksara.̣ Given a token in these scripts, is encoded as a sequence of characters, which, at one may want to (a) normalize it to a canonical presentation time are then collectively rendered form; and (b) check whether it is a well-formed into the appropriate sequence of glyphs for display.
    [Show full text]
  • The Gentics of Civilization: an Empirical Classification of Civilizations Based on Writing Systems
    Comparative Civilizations Review Volume 49 Number 49 Fall 2003 Article 3 10-1-2003 The Gentics of Civilization: An Empirical Classification of Civilizations Based on Writing Systems Bosworth, Andrew Bosworth Universidad Jose Vasconcelos, Oaxaca, Mexico Follow this and additional works at: https://scholarsarchive.byu.edu/ccr Recommended Citation Bosworth, Bosworth, Andrew (2003) "The Gentics of Civilization: An Empirical Classification of Civilizations Based on Writing Systems," Comparative Civilizations Review: Vol. 49 : No. 49 , Article 3. Available at: https://scholarsarchive.byu.edu/ccr/vol49/iss49/3 This Article is brought to you for free and open access by the Journals at BYU ScholarsArchive. It has been accepted for inclusion in Comparative Civilizations Review by an authorized editor of BYU ScholarsArchive. For more information, please contact [email protected], [email protected]. Bosworth: The Gentics of Civilization: An Empirical Classification of Civil 9 THE GENETICS OF CIVILIZATION: AN EMPIRICAL CLASSIFICATION OF CIVILIZATIONS BASED ON WRITING SYSTEMS ANDREW BOSWORTH UNIVERSIDAD JOSE VASCONCELOS OAXACA, MEXICO Part I: Cultural DNA Introduction Writing is the DNA of civilization. Writing permits for the organi- zation of large populations, professional armies, and the passing of complex information across generations. Just as DNA transmits biolog- ical memory, so does writing transmit cultural memory. DNA and writ- ing project information into the future and contain, in their physical structure, imprinted knowledge.
    [Show full text]
  • ISO/IEC JTC1/SC2/WG2 N 2029 Date: 1999-05-29
    ISO INTERNATIONAL ORGANIZATION FOR STANDARDIZATION ORGANISATION INTERNATIONALE DE NORMALISATION --------------------------------------------------------------------------------------- ISO/IEC JTC1/SC2/WG2 Universal Multiple-Octet Coded Character Set (UCS) -------------------------------------------------------------------------------- ISO/IEC JTC1/SC2/WG2 N 2029 Date: 1999-05-29 TITLE: Repertoire additions for ISO/IEC 10646-1 - Cumulative List No.9 SOURCE: Bruce Paterson, project editor STATUS: Standing Document, replacing WG2 N 1936 ACTION: For review and confirmation by WG2 DISTRIBUTION: Members of JTC1/SC2/WG2 INTRODUCTION This working paper contains the accumulated list of additions to the repertoire of ISO/IEC 10646-1 agreed by WG2, up to meeting no.36 (Fukuoka). A summary of all allocations within the BMP is given in Annex 1. A list of additional Collections, Blocks, and character tables is given in Annex 2. All additions are assigned to a Character Category, in accordance with clause II of the document "Principles and Procedures for Allocation of New Characters and Scripts" WG2 N 1502. The column Cat. in the table below shows the category (A to G) assigned by WG2. An entry P in this column indicates that the characters are provisionally accepted by WG2. WG2 Cat. No of Code Character(s) Source Current Ballot mtg.res chars position(s) doc. ref. end date NEW/EXTENDED SCRIPTS 27.14 A 11172 AC00-D7A3 Hangul syllables (revision) N1158 AMD 5 - 28.2 A 31 0591-05AF Hebrew cantillation marks N1217 AMD 7 - +05C4 28.5 A 174 0F00-0FB9 Tibetan N1238 AMD 6 - 31.4 A 346 1200-137F Ethiopic N1420 AMD10 - 31.6 A 623 1400-167F Canadian Aboriginal Syllabics N1441 AMD11 - 31.7 B1 85 13A0-13AF Cherokee N1172 AMD12 - & N1362 32.14 A 6582 3400-4DBF CJK Unified Ideograph Exten.
    [Show full text]
  • 1 Working Draft for Supporting Blueberry Revision of XML
    Working Draft for Supporting Blueberry Revision of XML Recommendation (1.0) with Particular Emphasis on Ethiopic Script and Writing System November 2001 This Version: http://www.digitaladdis.com/sk/ Ethiopics XML_Proposal_Version1.pdf Latest Ver.: http://www.digitaladdis.com/sk/ Ethiopics XML_Proposal_Version1.pdf Authors/Editors: Samuel Kinde Kassegne (Nanogen, UC San Diego, Engg Ext.) <[email protected]> Bibi Ephraim (Hewlett Packard) <[email protected]> Copyright ©2001, SKK, BE. Working Draft for Supporting Blueberry Revision of XML - Ethiopics Script 1 Abstract This document contains XML example codes, surveys and recommendations that support the need for Ethiopic script-based XML element type and attribute names in Amharic, Afaan Oromo, Tigrigna, Agewgna, Guragina, Hadiya, Harari, Sidama, Kembatigna and other Ethiopian languages that use Ethiopic script. Status of this Document This document is an initial working draft that is to be widely circulated for comment from the Ethiopic XML Interest Group and the wider public. The document is a work in progress. Table of Contents 1. Scope 2. Introduction 3. Objective 4. Review of Status of Native HTML and XML Applications in Ethiopic 4.1. Evolution of Ethiopic Content in HTML 4.2. Ethiopic and XML 5. Hybrid XML Document Markups in Ethiopic 5.1. Examples 5.2. Discussions 6. Fully-Native XML Document Markups in Ethiopic 6.1. Examples 6.2. Discussions 7. Conclusions and Recommendations 8. Appendix 9. References Working Draft for Supporting Blueberry Revision of XML - Ethiopics Script 2 1. Scope The work reported in this document contains XML example codes, surveys and recommendations that support the need for a revision of current XML standards to allow Ethiopic-based XML element type and attribute names in Amharic, Afaan Oromo, Tigrigna, Agewgna, Guragina, Hadiya, Harari, Sidama, Kembatigna and other Ethiopian languages that use Ethiopic script.
    [Show full text]