Language and Computers Prologue: Encoding Language

Language and Computers Prologue: Encoding Language Writing systems Alphabetic Syllabic Logographic Systems with unusual realization Language and Computers Relation to language Encoding written language Prologue: Encoding Language ASCII Unicode Spoken language Transcription Based on Dickinson, Brew, & Meurers (2013) Why speech is hard to represent Articulation Measuring sound Acoustics Relating written and spoken language From Speech to Text From Text to Speech Language modeling 1 / 60 Language and Language and Computers – where to start? Computers Prologue: Encoding Language Writing systems Alphabetic Syllabic Logographic I Systems with unusual If we want to do anything with language, we need a way realization to represent language. Relation to language Encoding written language I We can interact with the computer in several ways: ASCII Unicode I write or read text Spoken language I speak or listen to speech Transcription Why speech is hard to represent I Computer has to have some way to represent Articulation Measuring sound I text Acoustics I speech Relating written and spoken language From Speech to Text From Text to Speech Language modeling 2 / 60 Language and Outline Computers Prologue: Encoding Language Writing systems Alphabetic Syllabic Writing systems Logographic Systems with unusual realization Relation to language Encoding written language Encoding written language ASCII Unicode Spoken language Spoken language Transcription Why speech is hard to represent Relating written and spoken language Articulation Measuring sound Acoustics Language modeling Relating written and spoken language From Speech to Text From Text to Speech Language modeling 3 / 60 Language and Writing systems used for human languages Computers Prologue: Encoding Language What is writing? Writing systems Alphabetic Syllabic “a system of more or less permanent marks used Logographic Systems with unusual to represent an utterance in such a way that it can realization be recovered more or less exactly without the Relation to language Encoding written intervention of the utterer.” language ASCII (Peter T. Daniels, The World’s Writing Systems) Unicode Spoken language Transcription Why speech is hard to Different types of writing systems are used: represent Articulation Measuring sound I Alphabetic Acoustics Relating written and I Syllabic spoken language From Speech to Text I Logographic From Text to Speech Language modeling Much of the information on writing systems and the graphics used are taken from the great site http://www.omniglot.com. 4 / 60 Language and Alphabetic systems Computers Prologue: Encoding Language Writing systems Alphabetic Alphabets (phonemic alphabets) Syllabic Logographic Systems with unusual realization I represent all sounds, i.e., consonants and vowels Relation to language Encoding written I Examples: Etruscan, Latin, Korean, Cyrillic, Runic, language ASCII International Phonetic Alphabet Unicode Spoken language Transcription Why speech is hard to Abjads (consonant alphabets) represent Articulation Measuring sound I represent consonants only (sometimes plus selected Acoustics Relating written and vowels; vowel diacritics generally available) spoken language From Speech to Text I Examples: Arabic, Aramaic, Hebrew From Text to Speech Language modeling 5 / 60 Language and Alphabet example: Fraser Computers Prologue: Encoding Language An alphabet used to write Lisu, a Tibeto-Burman language spoken by Writing systems about 657,000 people in Myanmar, India, Thailand and in the Chinese Alphabetic Syllabic provinces of Yunnan and Sichuan. Logographic Systems with unusual realization Relation to language Encoding written language ASCII Unicode Spoken language Transcription Why speech is hard to represent Articulation Measuring sound Acoustics Relating written and spoken language From Speech to Text From Text to Speech Language modeling (from: http://www.omniglot.com/writing/fraser.htm) 6 / 60 Language and Abjad example: Phoenician Computers Prologue: Encoding Language An abjad used to write Phoenician, created between the 18th and 17th Writing systems centuries BC; assumed to be the forerunner of the Greek and Hebrew Alphabetic Syllabic alphabet. Logographic Systems with unusual realization Relation to language Encoding written language ASCII Unicode Spoken language Transcription Why speech is hard to represent Articulation Measuring sound Acoustics Relating written and spoken language From Speech to Text From Text to Speech Language modeling (from: http://www.omniglot.com/writing/phoenician.htm) 7 / 60 Language and A note on the letter-sound correspondence Computers Prologue: Encoding Language I Alphabets use letters to encode sounds (consonants, Writing systems Alphabetic vowels). Syllabic Logographic Systems with unusual I But the correspondence between spelling and realization Relation to language pronunciation in many languages is quite complex, i.e., Encoding written not a simple one-to-one correspondence. language ASCII Unicode I Example: English Spoken language Transcription I same spelling – different sounds: ough: ought, cough, Why speech is hard to represent tough, through, though, hiccough Articulation I silent letters: knee, knight, knife, debt, psychology, Measuring sound Acoustics mortgage Relating written and I one letter – multiple sounds: exit, use spoken language From Speech to Text I multiple letters – one sound: the, revolution From Text to Speech I alternate spellings: jail or gaol; but not possible seagh Language modeling for chef (despite sure, dead, laugh) 8 / 60 Language and More examples for non-transparent letter-sound Computers Prologue: Encoding correspondences Language Writing systems Alphabetic Syllabic Logographic French Systems with unusual realization Relation to language (1) a. Versailles ! [veRsai] Encoding written language b. ete, etais, etait, etaient ! [ete] ASCII Unicode Spoken language Transcription Why speech is hard to Irish represent Articulation Measuring sound (2) a. samhradh (summer) ! [sauruh] Acoustics Relating written and b. scri’obhaim (I write) ! [shgri:m] spoken language From Speech to Text From Text to Speech Language modeling What is the notation used within the []? 9 / 60 Language and The International Phonetic Alphabet (IPA) Computers Prologue: Encoding Language Writing systems Alphabetic I Several special alphabets for representing sounds have Syllabic Logographic been developed, the best known being the International Systems with unusual realization Phonetic Alphabet (IPA). Relation to language Encoding written language I The phonetic symbols are unambiguous: ASCII Unicode I designed so that each speech sound gets its own Spoken language symbol, Transcription I eliminating the need for Why speech is hard to represent I multiple symbols used to represent simple sounds Articulation Measuring sound I one symbol being used for multiple sounds. Acoustics Relating written and I spoken language Interactive example chart: http://web.uvic.ca/ling/ From Speech to Text resources/ipa/charts/IPAlab/IPAlab.htm From Text to Speech Language modeling 10 / 60 Language and Syllabic systems Computers Prologue: Encoding Language Syllabic alphabets (Alphasyllabaries) Writing systems Alphabetic I writing systems with symbols that represent a Syllabic Logographic consonant with a vowel, but the vowel can be changed Systems with unusual realization by adding a diacritic (= a symbol added to the letter). Relation to language Encoding written I Examples: Balinese, Javanese, Tibetan, Tamil, Thai, language ASCII Tagalog Unicode (cf. also: http://www.omniglot.com/writing/syllabic.htm) Spoken language Transcription Why speech is hard to represent Syllabaries Articulation Measuring sound Acoustics I writing systems with separate symbols for each syllable Relating written and spoken language of a language From Speech to Text From Text to Speech I Examples: Cherokee. Ethiopic, Cypriot, Ojibwe, Language modeling Hiragana (Japanese) (cf. also: http://www.omniglot.com/writing/syllabaries.htm#syll) 11 / 60 Language and Syllabary example: Cypriot Computers Prologue: Encoding Language The Cypriot syllabary or Cypro-Minoan writing is thought to have Writing systems developed from the Linear A, or possibly the Linear B script of Crete, Alphabetic Syllabic though its exact origins are not known. It was used from about 800 to 200 Logographic Systems with unusual BC. realization Relation to language Encoding written language ASCII Unicode Spoken language Transcription Why speech is hard to represent Articulation Measuring sound Acoustics Relating written and spoken language From Speech to Text From Text to Speech Language modeling (from: http://www.omniglot.com/writing/cypriot.htm) 12 / 60 Language and Syllabic alphabet example: Lao Computers Prologue: Encoding Language Script developed in the 14th century to write the Lao language, based on Writing systems an early version of the Thai script, which was developed from the Old Alphabetic Syllabic Khmer script, which was itself based on Mon scripts. Logographic Systems with unusual realization Example for vowel diacritics around the letter k: Relation to language Encoding written language ASCII Unicode Spoken language Transcription Why speech is hard to represent Articulation Measuring sound Acoustics Relating written and spoken language From Speech to Text From Text to Speech Language modeling (from: http://www.omniglot.com/writing/lao.htm) 13 / 60 Language and Logographic writing systems Computers Prologue: Encoding Language I Logographs (also called Logograms):

Load more