Text Encoding

Language and Computers Prologue: Encoding Language Writing systems Alphabetic Syllabic Logographic Systems with unusual realization Language and Computers Relation to language Encoding written Prologue: Encoding Language language ASCII Unicode L245 Spoken language Transcription (Based on Dickinson, Brew, & Meurers (2013)) Why speech is hard to represent Indiana University Articulation Measuring sound Spring 2016 Acoustics Relating written and spoken language From Speech to Text From Text to Speech Language modeling 1 / 63 Language and Language and Computers Computers Prologue: Encoding Language Writing systems Alphabetic Syllabic Logographic Computers have a variety of applications involving language: Systems with unusual realization Relation to language Encoding written I textual searching language ASCII I grammar correction Unicode I Spoken language automatic translation Transcription Why speech is hard to I question answering represent Articulation Measuring sound I plagiarism detection Acoustics I ... Relating written and spoken language From Speech to Text From Text to Speech Language modeling 2 / 63 Language and Language and Computers – where to start? Computers Prologue: Encoding Language Writing systems Alphabetic Syllabic Logographic I Systems with unusual If we want to do anything with language, we need a way realization to represent language. Relation to language Encoding written language I We can interact with the computer in several ways: ASCII Unicode I write or read text Spoken language I speak or listen to speech Transcription Why speech is hard to represent I Computer has to have some way to represent Articulation Measuring sound I text Acoustics I speech Relating written and spoken language From Speech to Text From Text to Speech Language modeling 3 / 63 Language and Outline Computers Prologue: Encoding Language Writing systems Alphabetic Syllabic Writing systems Logographic Systems with unusual realization Relation to language Encoding written language Encoding written language ASCII Unicode Spoken language Spoken language Transcription Why speech is hard to represent Relating written and spoken language Articulation Measuring sound Acoustics Language modeling Relating written and spoken language From Speech to Text From Text to Speech Language modeling 4 / 63 Language and Writing systems used for human languages Computers Prologue: Encoding Language What is writing? Writing systems Alphabetic Syllabic “a system of more or less permanent marks used Logographic Systems with unusual to represent an utterance in such a way that it can realization be recovered more or less exactly without the Relation to language Encoding written intervention of the utterer.” language ASCII (Peter T. Daniels, The World’s Writing Systems) Unicode Spoken language Transcription Why speech is hard to Different types of writing systems are used: represent Articulation Measuring sound I Alphabetic Acoustics Relating written and I Syllabic spoken language From Speech to Text I Logographic From Text to Speech Language modeling Much of the information on writing systems and the graphics used are taken from the great site http://www.omniglot.com. 5 / 63 Language and Alphabetic systems Computers Prologue: Encoding Language Writing systems Alphabetic Alphabets (phonemic alphabets) Syllabic Logographic Systems with unusual realization I represent all sounds, i.e., consonants and vowels Relation to language Encoding written I Examples: Etruscan, Latin, Korean, Cyrillic, Runic, language ASCII International Phonetic Alphabet Unicode Spoken language Transcription Why speech is hard to Abjads (consonant alphabets) represent Articulation Measuring sound I represent consonants only (sometimes plus selected Acoustics Relating written and vowels; vowel diacritics generally available) spoken language From Speech to Text I Examples: Arabic, Aramaic, Hebrew From Text to Speech Language modeling 6 / 63 Language and Alphabet example: Fraser Computers Prologue: Encoding Language An alphabet used to write Lisu, a Tibeto-Burman language spoken by Writing systems about 657,000 people in Burma, India, Thailand and in the Chinese Alphabetic Syllabic provinces of Yunnan and Sichuan. Logographic Systems with unusual realization Relation to language Encoding written language ASCII Unicode Spoken language Transcription Why speech is hard to represent Articulation Measuring sound Acoustics Relating written and spoken language From Speech to Text From Text to Speech Language modeling (from: http://www.omniglot.com/writing/fraser.htm) 7 / 63 Language and Abjad example: Phoenician Computers Prologue: Encoding Language An abjad used to write Phoenician, created between the 18th and 17th Writing systems centuries BC; assumed to be the forerunner of the Greek and Hebrew Alphabetic Syllabic alphabet. Logographic Systems with unusual realization Relation to language Encoding written language ASCII Unicode Spoken language Transcription Why speech is hard to represent Articulation Measuring sound Acoustics Relating written and spoken language From Speech to Text From Text to Speech Language modeling (from: http://www.omniglot.com/writing/phoenician.htm) 8 / 63 Language and A note on the letter-sound correspondence Computers Prologue: Encoding Language I Alphabets use letters to encode sounds (consonants, Writing systems Alphabetic vowels). Syllabic Logographic Systems with unusual I But the correspondence between spelling and realization Relation to language pronunciation in many languages is quite complex, i.e., Encoding written not a simple one-to-one correspondence. language ASCII Unicode I Example: English Spoken language Transcription I same spelling – different sounds: ough: ought, cough, Why speech is hard to represent tough, through, though, hiccough Articulation I silent letters: knee, knight, knife, debt, psychology, Measuring sound Acoustics mortgage Relating written and I one letter – multiple sounds: exit, use spoken language From Speech to Text I multiple letters – one sound: the, revolution From Text to Speech I alternate spellings: jail or gaol; but not possible seagh Language modeling for chef (despite sure, dead, laugh) 9 / 63 Language and More examples for non-transparent letter-sound Computers Prologue: Encoding correspondences Language Writing systems Alphabetic Syllabic Logographic French Systems with unusual realization Relation to language (1) a. Versailles ! [veRsai] Encoding written language b. ete, etais, etait, etaient ! [ete] ASCII Unicode Spoken language Transcription Why speech is hard to Irish represent Articulation Measuring sound (2) a. samhradh (summer) ! [sauruh] Acoustics Relating written and b. scri’obhaim (I write) ! [shgri:m] spoken language From Speech to Text From Text to Speech Language modeling What is the notation used within the []? 10 / 63 Language and The International Phonetic Alphabet (IPA) Computers Prologue: Encoding Language Writing systems Alphabetic I Several special alphabets for representing sounds have Syllabic Logographic been developed, the best known being the International Systems with unusual realization Phonetic Alphabet (IPA). Relation to language Encoding written language I The phonetic symbols are unambiguous: ASCII Unicode I designed so that each speech sound gets its own Spoken language symbol, Transcription I eliminating the need for Why speech is hard to represent I multiple symbols used to represent simple sounds Articulation Measuring sound I one symbol being used for multiple sounds. Acoustics Relating written and I spoken language Interactive example chart: http://web.uvic.ca/ling/ From Speech to Text resources/ipa/charts/IPAlab/IPAlab.htm From Text to Speech Language modeling 11 / 63 Language and Syllabic systems Computers Prologue: Encoding Language Syllabaries Writing systems Alphabetic I writing systems with separate symbols for each syllable Syllabic Logographic of a language Systems with unusual realization I Examples: Cherokee. Ethiopic, Cypriot, Ojibwe, Relation to language Encoding written Hiragana (Japanese) language ASCII (cf. also: http://www.omniglot.com/writing/syllabaries.htm) Unicode Spoken language Transcription Abugidas (Alphasyllabaries) Why speech is hard to represent Articulation Measuring sound I writing systems organized into families Acoustics I Relating written and symbols represent a consonant with a vowel, but the spoken language vowel can be changed by adding a diacritic (= a From Speech to Text From Text to Speech symbol added to the letter). Language modeling I Examples: Balinese, Javanese, Tamil, Thai, Tagalog (cf. also: http://www.omniglot.com/writing/syllabic.htm) 12 / 63 Language and Syllabary example: Cypriot Computers Prologue: Encoding Language The Cypriot syllabary or Cypro-Minoan writing is thought to have Writing systems Alphabetic developed from the Linear A script of Crete, though its exact origins are Syllabic Logographic not known. It was used from about 1500 to 300 BC. Systems with unusual realization Relation to language Encoding written language ASCII Unicode Spoken language Transcription Why speech is hard to represent Articulation Measuring sound Acoustics Relating written and spoken language From Speech to Text From Text to Speech Language modeling (from: http://www.omniglot.com/writing/cypriot.htm) 13 / 63 Language and Abugida example: Lao Computers Prologue: Encoding Language Script developed in the 14th century to write the Lao language, based on Writing systems an early version of the Thai script, which was developed from the Old Alphabetic

Text Encoding

Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress

Linguistic Study About the Origins of the Aegean Scripts

Bryn Mawr Classical Review 2017.08.38

Iso/Iec Jtc1/Sc2/Wg2 N2378 A

The Cretan Script Family Includes the Carian Alphabet

Bioinformatics Evolutionary Tree Algorithms Reveal the History of the Cretan Script Family

An Analysis of Hamptonese Using Hidden Markov Models

Ancient and Other Scripts

The Cypriot Font∗

The Writing Revolution

Epigraphy: the Study of Ancient Inscriptions

Iso/Iec Jtc1/Sc2/Wg2 N4733 L2/16-179