Combined Script and Page Orientation Estimation Using the Tesseract OCR Engine
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Old German Translation Tips
OLD GERMAN TRANSLATION TIPS BASIC STEPS 1. Identify the alphabet used (Fraktur, Sütterlin/Suetterlin, or Kurrent) and convert the letters to modern equivalents, and then; 2. Translate from German to English. CONVERTING OLD ALPHABETS Helps for Translating That Old German Handwriting http://nancysfamilyhistoryblog.blogspot.com/2011/06/helps-for-translating-that-old- german.html Helps for Translating the Old German Typeface http://nancysfamilyhistoryblog.blogspot.com/2011/10/helps-for-translating-old-german.html Omniglot https://www.omniglot.com/writing/german.htm –Scroll down for script styles. ONLINE TRANSLATORS AND DICTIONARIES Abkuerzungen http://abkuerzungen.de/main.php?language=de – for finding the meaning of abbreviations. Leo https://www.leo.org/german-english Members can also connect with each other via the LEO forums to ask for help. Langenscheidt online dictionary: Lingojam https://lingojam.com/OldGerman Linguee https://www.linguee.com/ Choose German → English from the dropdown menu. This site uses human translators, not machines! It gives you examples of the words used in a real-life context and provides all possible German translations of the word. You will find that anti-virus software treats some translation sites/apps (e.g. Babylon translator) as malware. ‘GENEALOGICAL GERMAN’ Family words in German https://www.omniglot.com/language/kinship/german.htm German Genealogical Word List: https://www.familysearch.org/wiki/en/German_Genealogical_Word_List Understanding a German Baptism Records http://www.stemwedegenealogy.com/baptism_sample.htm Understanding German Marriage Records http://www.stemwedegenealogy.com/marriage_sample.htm FACEBOOK HELP Genealogy Translations https://www.facebook.com/groups/genealogytranslation/ – for a wide range of languages, not only German. • Work out as much as you can for yourself first. -
ISO/IEC International Standard 10646-1
Final Proposed Draft Amendment (FPDAM) 5 ISO/IEC 10646:2003/Amd.5:2007 (E) Information technology — Universal Multiple-Octet Coded Character Set (UCS) — AMENDMENT 5: Tai Tham, Tai Viet, Avestan, Egyptian Hieroglyphs, CJK Unified Ideographs Extension C, and other characters Page 2, Clause 3 Normative references Page 20, Sub-clause 26.1 Hangul syllable Update the reference to the Unicode Bidirectional Algo- composition method rithm and the Unicode Normalization Forms as follows: Replace the three parenthetical notations in the second Unicode Standard Annex, UAX#9, The Unicode Bidi- sentence of the first paragraph with: rectional Algorithm, Version 5.2.0, [Date TBD]. (syllable-initial or initial consonant) (syllable-peak or medial vowel) Unicode Standard Annex, UAX#15, Unicode Normali- (syllable-final or final consonant) zation Forms, Version 5.2.0, [Date TBD]. Page 2, Clause 4 Terms and definitions Page 20, Clause 27 Source references for CJK Ideographs Replace the Code table entry with: In the list after the second paragraph, insert the follow- Code chart ing entry: Code table Hanzi M sources, A rectangular array showing the representation of coded characters allocated to the octets in a code. In the third paragraph, replace “(G, H, T, J, K, KP, V, and U)” with “(G, H, M, T, J, K, KP, V, and U)”. Remove the „Detailed code table‟ entry. Replace all further occurrences of „code table‟ in the Page 21, Sub-clause 27.1 Source references text of the standard with „code chart‟. for CJK Unified Ideographs Replace the last sentence of the second paragraph Page 13, Clause 17 Structure of the code with the following: tables and lists (now renamed „Structure of the code charts and lists‟) The current full set of CJK Unified Ideographs is represented by the collection 385 CJK UNIFIED Insert the following note after the first paragraph. -
5892 Cisco Category: Standards Track August 2010 ISSN: 2070-1721
Internet Engineering Task Force (IETF) P. Faltstrom, Ed. Request for Comments: 5892 Cisco Category: Standards Track August 2010 ISSN: 2070-1721 The Unicode Code Points and Internationalized Domain Names for Applications (IDNA) Abstract This document specifies rules for deciding whether a code point, considered in isolation or in context, is a candidate for inclusion in an Internationalized Domain Name (IDN). It is part of the specification of Internationalizing Domain Names in Applications 2008 (IDNA2008). Status of This Memo This is an Internet Standards Track document. This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 5741. Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc5892. Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. -
Proposal for a Kannada Script Root Zone Label Generation Ruleset (LGR)
Proposal for a Kannada Script Root Zone Label Generation Ruleset (LGR) Proposal for a Kannada Script Root Zone Label Generation Ruleset (LGR) LGR Version: 3.0 Date: 2019-03-06 Document version: 2.6 Authors: Neo-Brahmi Generation Panel [NBGP] 1. General Information/ Overview/ Abstract The purpose of this document is to give an overview of the proposed Kannada LGR in the XML format and the rationale behind the design decisions taken. It includes a discussion of relevant features of the script, the communities or languages using it, the process and methodology used and information on the contributors. The formal specification of the LGR can be found in the accompanying XML document: proposal-kannada-lgr-06mar19-en.xml Labels for testing can be found in the accompanying text document: kannada-test-labels-06mar19-en.txt 2. Script for which the LGR is Proposed ISO 15924 Code: Knda ISO 15924 N°: 345 ISO 15924 English Name: Kannada Latin transliteration of the native script name: Native name of the script: ಕನ#ಡ Maximal Starting Repertoire (MSR) version: MSR-4 Some languages using the script and their ISO 639-3 codes: Kannada (kan), Tulu (tcy), Beary, Konkani (kok), Havyaka, Kodava (kfa) 1 Proposal for a Kannada Script Root Zone Label Generation Ruleset (LGR) 3. Background on Script and Principal Languages Using It 3.1 Kannada language Kannada is one of the scheduled languages of India. It is spoken predominantly by the people of Karnataka State of India. It is one of the major languages among the Dravidian languages. Kannada is also spoken by significant linguistic minorities in the states of Andhra Pradesh, Telangana, Tamil Nadu, Maharashtra, Kerala, Goa and abroad. -
A Script Independent Approach for Handwritten Bilingual Kannada and Telugu Digits Recognition
IJMI International Journal of Machine Intelligence ISSN: 0975–2927 & E-ISSN: 0975–9166, Volume 3, Issue 3, 2011, pp-155-159 Available online at http://www.bioinfo.in/contents.php?id=31 A SCRIPT INDEPENDENT APPROACH FOR HANDWRITTEN BILINGUAL KANNADA AND TELUGU DIGITS RECOGNITION DHANDRA B.V.1, GURURAJ MUKARAMBI1, MALLIKARJUN HANGARGE2 1Department of P.G. Studies and Research in Computer Science Gulbarga University, Gulbarga, Karnataka. 2Department of Computer Science, Karnatak Arts, Science and Commerce College Bidar, Karnataka. *Corresponding author. E-mail: [email protected],[email protected],[email protected] Received: September 29, 2011; Accepted: November 03, 2011 Abstract- In this paper, handwritten Kannada and Telugu digits recognition system is proposed based on zone features. The digit image is divided into 64 zones. For each zone, pixel density is computed. The KNN and SVM classifiers are employed to classify the Kannada and Telugu handwritten digits independently and achieved average recognition accuracy of 95.50%, 96.22% and 99.83%, 99.80% respectively. For bilingual digit recognition the KNN and SVM classifiers are used and achieved average recognition accuracy of 96.18%, 97.81% respectively. Keywords- OCR, Zone Features, KNN, SVM 1. Introduction recognition of isolated handwritten Kannada numerals Recent advances in Computer technology, made every and have reported the recognition accuracy of 90.50%. organization to implement the automatic processing Drawback of this procedure is that, it is not free from systems for its activities. For example, automatic thinning. U. Pal et al. [11] have proposed zoning and recognition of vehicle numbers, postal zip codes for directional chain code features and considered a feature sorting the mails, ID numbers, processing of bank vector of length 100 for handwritten Kannada numeral cheques etc. -
Deciphering Old German Documents Using the Online German Script Tutorial Bradley J
Deciphering Old German Documents Using the Online German Script Tutorial Bradley J. York Center for Family History and Genealogy at Brigham Young University Abstract The German Script Tutorial (http://script.byu.edu/german) is a new online resource for learning how to decipher documents written in old German script. The German Script Tutorial provides descriptions and actual examples of each German letter. Animations demonstrate how each letter is written. Tests assess users’ ability to read and write letters, words, and passages. User accounts allow users to save their test scores and keep track of which letters they need to practice. The tutorial also provides basic guidelines on how to find vital (i.e., genealogical) in- formation in old German documents. Introduction One of the greatest difficulties in German genealogical research is deciphering old docu- ments and manuscripts written in old German script. The term old German script refers to the handwriting and typefaces that were widely used in German-speaking countries from me- dieval times to the end of World War II. Al- though old German script was based on the Latin alphabet, as were old English, French, Italian, and Spanish scripts, individual letters of old German script may appear to the untrained eye as unique as letters of the Cyrillic or even Arabic alphabets. Since old German script has not been taught in German schools since the 1950’s, even most native Germans are unable to read and write it nowadays. Thus, decipher- This is an old German emigration document consisting of printed and handwritten information. Both the hand- ing old German documents is problematic for writing and the printed Fraktur typeface are styles of old German script. -
Report of the Mikrusian Alphabet 1 Introduction
Report of the Mikrusian alphabet 1 Introduction People of Earth speak in different languages. There are several hundreds of various languages. Each language is characterized by its own specific set of sounds and by a peculiarity of their articulation. However, there are people that regard their native languages as same ones, live in distant regions, and articulate equal-sense words with some difference. And the difference may be only in a point that some sounds are articulated with a moderate difference. The typical example is a situation with German language natives in different regions of Europe. In spite of this, these people communicate with each other without problems in understanding and they kick the articulation distinction to «accent», «dialect», «patois», or another term. Furthermore, there exist people that articulate individual sounds slightly in a different way in comparison with other “same-language” inhabitants of the same region. Usually this is classified by speech therapists as a speech defect. But this does not lead to problems in understanding such humans “with defect”. This is connected with the following: each listener identifies all “similar” sounds for himself. Our brain acts analogously when it analyzes all sounds in speech. The brain identifies sounds that belong to a certain range; it consider them as equal, as “equivalent”. 1.1 Classification subproblems In spite of such variety of languages, the variety of speech sounds is stipulated and bounded by possibilities of a human vocal apparatus. It has approximately the same possibilities in speech for all people (with rare exceptions of pathologies). The question arises: is it possible to classify sounds that a human vocal apparatus can generate in speech? As for any classification problem, one should solve the following subproblems. -
The Unicode Standard, Version 4.0--Online Edition
This PDF file is an excerpt from The Unicode Standard, Version 4.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (http://www.unicode.org/errata/). For information on more recent versions of the standard, see http://www.unicode.org/standard/versions/enumeratedversions.html. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. -
Calligraphy Specimens Following Writing
Calligraphy Specimens following Writing Manual by ADOLPH ZUNNER[?] printed by JOHANN CHRISTOPH WEIGEL or CHRISTOPH WEIGEL THE ELDER In German and Latin, manuscript on paper Germany (Nuremberg), c. 1713 20 folios on paper, complete [collation i20], unidentified watermark (bisected with center lacking, crest holding 3? bezants with ornate frame, initials M and F at bottom), foliation in modern pencil in upper recto corners, text written in various calligraphic scripts in black ink on recto only, no visible ruling and varied justification, sketched decorative evergreen boughs on f. 1 and calligraphic scrollwork throughout, minor flecking and staining, some original ink blots. CONTEMPORARY BINDING, brown (once red?) brocade paper with elegant mixed floral design and traces of gold embossing, pasted spine, abrasion and discoloration but wholly intact. Dimensions 150 x 190 mm. Calligraphic sample books from the Renaissance, such as this manuscript, are far less common than their printed exemplars; this charming booklet, designed for teaching writing to the young, appears to be one of a kind. This volume in its fine contemporary binding includes texts that display a scribe’s skill in writing different types of scripts. It is partially copied from writing master Adolph Zunner’s 1709 Kunstrichtige Schreib-Art printed in Nuremberg by famous publisher and engraver [Johann] Christoph Weigel. PROVENANCE 1. Written in Germany, in Nuremberg, in 1713 or shortly thereafter, with a title page reading Gründliche Unterweisung zu Fraktur – Canzley – und Current Schrifften der lieben Jugend zum Anfang des Schreibens und sondern Nuzen gestellet durch A. <A. or Z.?> in Nürnberg Zufinden bey Johann Christoph Weigel (A Thorough Instruction in Fraktur, Chancery, and Cursive Scripts, prepared for the especial utility of dear Youth in beginning to write by A. -
Download Eskapade
TYPETOGETHER Eskapade Creating new common ground between a nimble oldstyle serif and an experimental Fraktur DESIGNED BY YEAR Alisa Nowak 2012 ESKAPADE & ESKAPADE FRAKTUR ABOUT The Eskapade font family is the result of Alisa Nowak’s unique script practiced in Germany in the vanishingly research into Roman and German blackletter forms, short period between 1915 and 1941. The new mainly Fraktur letters. The idea was to adapt these ornaments are also hybrid Sütterlin forms to fit with broken forms into a contemporary family instead the smooth roman styles. of creating a faithful revival of a historical typeface. Although there are many Fraktur-style typefaces On one hand, the ten normal Eskapade styles are available today, they usually lack italics, and their conceived for continuous text in books and magazines italics are usually slanted uprights rather than proper with good legibility in smaller sizes. On the other italics. This motivated extensive experimentation with hand, the six angled Eskapade Fraktur styles capture the italic Fraktur shapes and resulted in Eskapade the reader’s attention in headlines with its mixture Fraktur’s unusual and interesting solutions. In of round and straight forms as seen in ‘e’, ‘g’, and addition to standard capitals, it offers a second set ‘o’. Eskapade works exceptionally well for branding, of more decorative capitals with double-stroke lines logotypes, and visual identities, for editorials like to intensify creative application and encourage magazines, fanzines, and posters, and for packaging. experimental use. Eskapade roman adopts a humanist structure, but The Thin and Black Fraktur styles are meant for is more condensed than other oldstyle serifs. -
Five Centurie< of German Fraktur by Walden Font
Five Centurie< of German Fraktur by Walden Font Johanne< Gutenberg 1455 German Fraktur represents one of the most interesting families of typefaces in the history of printing. Few types have had such a turbulent history, and even fewer have been alter- nately praised and despised throughout their history. Only recently has Fraktur been rediscovered for what it is: a beau- tiful way of putting words into written form. Walden Font is proud to pres- ent, for the first time, an edition of 18 classic Fraktur and German Script fonts from five centuries for use on your home computer. This booklet describes the history of each font and provides you with samples for its use. Also included are the standard typeset- ting instructions for Fraktur ligatures and the special characters of the Gutenberg Bibelschrift. We hope you find the Gutenberg Press to be an entertaining and educational publishing tool. We certainly welcome your comments and sug- gestions. You will find information on how to contact us at the end of this booklet. Verehrter Frakturfreund! Wir hoffen mit unserer "“Gutenberg Pre%e”" zur Wiederbelebung der Fraktur= schriften - ohne jedweden politis#en Nebengedanken - beizutragen. Leider verbieten un< die hohen Produktion<kosten eine Deutsche Version diese< Be= nu@erhandbüchlein< herau<zugeben, Sie werden aber den Deutschen Text auf den Programmdisketten finden. Bitte lesen Sie die “liesmich”" Datei für weitere Informationen. Wir freuen un< auch über Ihre Kommentare und Anregungen. Kontaktinformationen sind am Ende diese< Büchlein< angegeben. A brief history of Fraktur At the end of the 15th century, most Latin books in Germany were printed in a dark, barely legible gothic type style known asTextura . -
A Brief History of the Blackletter Font Style
A Brief History of the Blackletter Font Style By: Linda Tseng ● Blackletter font also called Fraktur, Gothic, or Old English. ● From Western Europe such as Germany, Germany use Gothic until World War II. ● During twelfth century to twenty century. ● Blackletter hands were called Gothic by the modernist Lorenzo Valla and others in middle fifteenth century in Italy. ● Used to describe the scripts of the Middle Ages in which the darkness of the characters overpowers the whiteness of the page. ● In the 1500’s, blackletter became less popular for printing in a lots of countries except Germany and the Countries that speaking German. ● The best Textura specimen in the Gutenberg Museum Library’s collection is comes from Great Britain. Overview ● Rotunda types, the second oldest blackletter style never really caught on as a book type in German speaking lands, although twentieth century calligraphers, as well as arts and crafts designers, have used it quite well for display purposes. ● Cursiva- developed in the 14th century as a simple form of textualis. ● Hybrida( bastarda)- a mixture of textualis and cursiva, and it’s developed in the early 15th century. ● Donatus Kalender- the name for the metal type design that Gutenberg used in his earliest surviving printed works, and it’s from the early 1450s. Blackletters are difficult to read as body text so they are better used for headings, logos, posters and signs. ● Newspaper headlines, magazine, Fette Fraktur are used for old fashioned headlines and beer advertising. ● Wilhelm Klingspor Gotisch adorns many wine label. ● Linotext and Old English are popular choices for certificates.