<<

Speech Processing

• Speech is the most natural form of human-human communications. • Speech is related to language; is a branch of social science. • Speech is related to human physiological capability; physiology is a Digital Speech Processing— branch of medical science. • Speech is also related to sound and acoustics, a branch of physical Lecture 1 science. • Therefore, speech is one of the most intriguing that humans work with every day. • Purpose of speech processing: Introduction to Digital – To understand speech as a means of communication; – To represent speech for transmission and reproduction; Speech Processing – To analyze speech for automatic recognition and extraction of information – To discover some physiological characteristics of the talker.

1 2

Why Digital Processing of Speech? The Speech Stack Speech Applications — coding, synthesis, • digital processing of speech signals (DPSS) recognition, understanding, verification, enjoys an extensive theoretical and language translation, speed-up/slow-down experimental base developed over the past 75 years • much research has been done since 1965 on Speech Algorithms —speech-silence the use of digital processing in speech (background), voiced-unvoiced decision, communication problems pitch detection, formant estimation • highly advanced implementation technology (VLSI) exists that is well matched to the Speech Representations — temporal, computational demands of DPSS spectral, homomorphic, LPC • there are abundant applications that are in widespread use commercially Fundamentals — acoustics, linguistics, pragmatics, speech perception 3 4

Speech Applications Encoding • We look first at the top of the speech

processing stack—namely speechA-to-D Analysis/ data Channel Compressionyˆ[n] or Converter Coding Medium applications xc (t) x[n] y[n] yˆ[n]

Continuous Sampled Transformed Bit sequence – speech coding time signal signal representation – Decoding

Channel – and understanding or data speech Medium Decom- Decoding/ D-to-A – other speech applications pression Synthesis Converter y%[]n x%[]n x%yˆc c()t(t)

5 6

1 Speech Coding Demo of Speech Coding • Speech Coding is the process of transforming a speech signal into a representation for efficient • Narrowband Speech Coding: • Wideband Speech Coding: ™ 64 kbps PCM transmission and storage of speech Male talker / Female Talker – narrowband and broadband wired telephony ™ 32 kbps ADPCM – cellular communications ™ 16 kbps LDCELP ™ 3.2 kHz – uncoded ™ 7 kHz – uncoded – Voice over IP (VoIP) to utilize the Internet as a real-time ™ 8 kbps CELP ™ 7 kHz – 64 kbps communications medium ™ 4.8 kbps FS1016 ™ 7 kHz – 32 kbps – secure voice for privacy and encryption for national ™ 7 kHz – 16 kbps security applications ™ 2.4 kbps LPC10E – extremely narrowband communications channels, e.g., battlefield applications using HF radio – storage of speech for telephone answering machines, IVR systems, prerecorded messages Narrowband Speech Wideband Speech

7 8

Demo of Audio Coding Audio Coding

• CD Original (1.4 Mbps) versus MP3-coded at 128 kbps • Female vocal – MP3-128 kbps coded, CD ¾ female vocal original ¾ trumpet selection ¾ orchestra • Trumpet selection – CD original, MP3-128 ¾ baroque kbps coded ¾ guitar • Orchestral selection – MP3-128 kbps Can you determine which is the uncoded and which is the coded coded audio for each selection? • Baroque – CD original, MP3-128 kbps coded • Guitar – MP3-128 kbps coded, CD original Audio Coding Additional Audio Selections 9 10

Speech Synthesis Speech Synthesis • Synthesis of Speech is the process of generating a speech signal using computational means for effective human- machine interactions – machine reading of text or email messages textLinguistic DSP D-to-A speech – telematics feedback in automobiles Rules Computer Converter – talking agents for automatic transactions – automatic agent in customer care call center – handheld devices such as foreign language phrasebooks, dictionaries, crossword puzzle helpers – announcement machines that provide information such as stock quotes, airlines 11 schedules, weather reports, etc. 12

2 Speech Synthesis Examples Pattern Matching Problems • Soliloquy from Hamlet:

speechA-to-D Feature Pattern symbols Converter Analysis Matching • Gettysburg Address: • speech recognition Reference • speaker recognition Patterns • Third Grade Story: • speaker verification • word spotting 1964-lrr 2002-tts • automatic indexing of speech recordings 13 14

Speech Recognition and Understanding Speech Recognition Demos • Recognition and Understanding of Speech is the process of extracting usable linguistic information from a speech signal in support of human-machine communication by voice – command and control (C&C) applications, e.g., simple commands for spreadsheets, presentation graphics, appliances – voice dictation to create letters, memos, and other documents – natural language voice dialogues with machines to enable Help desks, Call Centers – voice dialing for cellphones and from PDA’s and other small devices – agent services such as calendar entry and update, address list modification and entry, etc. 15 16

Speech Recognition Demos Dictation Demo

17 18

3 Other Speech Applications DSP/Speech Enabled Devices • Speaker Verification for secure access to premises, information, virtual spaces • Speaker Recognition for legal and forensic purposes— national security; also for personalized services • Speech Enhancement for use in noisy environments, to eliminate echo, to align voices with video segments, to change voice qualities, to speed-up or slow-down Internet Audio Digital Cameras PDAs & Streaming prerecorded speech (e.g., talking books, rapid review of Audio/Video material, careful scrutinizing of spoken material, etc) => potentially to improve intelligibility and naturalness of speech • Language Translation to convert spoken words in one language to another to facilitate natural language Hearing Aids dialogues between people speaking different languages, i.e., tourists, business people 19 Cell Phones 20

Apple iPod One of the Top DSP Applications

• stores music in MP3, AAC, MP4, wma, wav, … audio formats • compression of 11-to-1 for 128 kbps MP3 • can store order of 20,000 songs with 30 GB disk • can use flash memory to eliminate all moving memory access • can load songs from iTunes store – more than 1.5 billion downloads • tens of millions sold

x[n] y[n] y (t) Memory Computer D-to-A c Cellular Phone

21 22

Digital Speech Processing Speech Signal Production

Speech Waveform • Need to understand the nature of the speech Message Linguistic Articulatory Acoustic Electronic Source Construction Production Propagation Transduction signal, and how dsp techniques, communication M W S A X technologies, and information theory methods Idea Message, M, Words realized Sounds Signals converted can be applied to help solve the various encapsulated realized as a as a sequence received at from acoustic to in a word of (phonemic) the electric, application scenarios described above message, M sequence, W sounds, S transducer transmitted, – most of the course will concern itself with speech through distorted and acoustic received as X signal processing — i.e., converting one type of ambient, A speech signal representation to another so as to Conventional studies of speech science use speech Practical applications uncover various mathematical or practical properties signals recorded in a sound require use of realistic or of the speech signal and do appropriate processing to booth with little interference or “real world” speech with distortion aid in solving both fundamental and deep problems of noise and distortions interest 23 24

4 Speech Production/Generation Model Speech Production/Generation Model

• Message Formulation Æ desire to communicate an idea, a wish, a • Neuro-Muscular Controls Æ need to direct the neuro-muscular request, … => express the message as a sequence of words system to move the articulators (tongue, lips, teeth, jaws, velum) so as to produce the desired spoken message in the desired manner Message I need some string Please get me some string (Discrete Symbols) Neuro- Desire to Formulation Text String Where can I buy some Articulatory Communicate string Muscular (Continuous control) Phoneme String motions • Language Code Æ need to convert chosen text string to a with Prosody Controls sequence of sounds in the language that can be understood by • Vocal Tract System Æ need to shape the human vocal tract system others; need to give some form of emphasis, prosody (tune, melody) and provide the appropriate sound sources to create an acoustic to the spoken sounds so as to impart non-speech information such waveform (speech) that is understandable in the environment in as sense of urgency, importance, psychological state of talker, which it is spoken environmental factors (noise, echo) Vocal Tract Acoustic (Continuous control) Language Articulatory Waveform Text String Phoneme string System Code (Discrete Symbols) Motions (Speech) with prosody Generator

Pronunciation Source control (lungs, (In The Brain) diaphragm, chest Vocabulary muscles) 25 26

Speech Perception Model • The acoustic waveform impinges on the ear (the basilar membrane) The Speech Signal and is spectrally analyzed by an equivalent filter bank of the ear Basilar Membrane Spectral (Continuous Control) Acoustic Representation Waveform Motion • The signal from the basilar membrane is neurally transduced and coded into features that can be decoded by the brain Neural Sound Features (Continuous/Discrete Spectral Transduction (Distinctive Features Features) Control) • The brain decodes the feature stream into sounds, words and Background Pitch Period sentences Signal Phonemes, Language Words, and (Discrete Message) Sound Features Translation Sentences

• The brain determines the meaning of the words via a message understanding mechanism Unvoiced Signal (noise- like sound) Message Basic Message Understanding (Discrete Message) 27 Phonemes, 28 Words and Sentences

The Speech Chain Text Phonemes, Prosody Articulatory Motions The Speech Chain

Message Language Neuro-Muscular Vocal Tract Formulation Code Controls System

Discrete Input Continuous Input Acoustic Waveform 30-50 50 bps 200 bps 2000 bps kbps Transmission Information Rate Channel Phonemes, Words, Semantics Feature Sentences Extraction, Spectrum Acoustic Coding Analysis Waveform Basilar Message Language Neural Membrane Understanding Translation Transduction Motion 29 30 Discrete Output Continuous Output

5 Speech Sciences The Speech Circle

Voice reply to customer • Linguistics: science of language, including phonetics, Customer voice request phonology, morphology, and syntax “What number did you want to call?” • Phonemes: smallest set of units considered to be the basic set of distinctive sounds of a languages (20-60 Text-to-Speech TTS ASR Automatic Speech units for most languages) Synthesis Recognition • Phonemics: study of phonemes and phonemic systems Data • Phonetics: study of speech sounds and their production, What’s next? Words spoken transmission, and reception, and their analysis, “Determine correct number” “I dialed a wrong number” classification, and transcription DM & SLU • Phonology: phonetics and phonemics together Dialog SLG Spoken Language Management Understanding • Syntax: meaning of an utterance (Actions) and Meaning Spoken Language “Billing credit” Generation 31 (Words) 32

Information Human speaker—lots of Information Rate of Speech Source variability • from a Shannon view of information:

– message content/information--2**6 symbols Measurement or Acoustic waveform/articulatory (phonemes) in the language; 10 symbols/sec for Observation positions/neural control signals normal speaking rate => 60 bps is the equivalent information rate for speech (issues of phoneme probabilities, phoneme correlations) Signal • from a communications point of view: Representation – speech bandwidth is between 4 (telephone quality) Signal Purpose of and 8 kHz (wideband hi-fi speech)—need to sample Course speech at between 8 and 16 kHz, and need about 8 Processing (log encoded) bits per sample for high quality Signal encoding => 8000x8=64000 bps (telephone) to Transformation 16000x8=128000 bps (wideband)

1000-2000 times change in information rate from discrete message Extraction and Human listeners, symbols to waveform encoding => can we achieve this three orders of Utilization of machines magnitude reduction in information rate on real speech waveforms? 33 Information 34

Digital Speech Processing Hierarchy of Digital Speech Processing

Representation of • DSP: Speech Signals – obtaining discrete representations of speech signal – theory, design and implementation of numerical procedures (algorithms) for processing the discrete representation in order to achieve a goal (recognizing the signal, modifying the time scale represent of the signal, removing background noise from the signal, etc.) signal as •Why DSP Waveform Parametric output of a Representations Representations speech – reliability production – flexibility model – accuracy – real-time implementations on inexpensive dsp chips preserve wave shape – ability to integrate with multimedia and data through sampling and – encryptability/security of the data and the data representations quantization Excitation Vocal Tract via suitable techniques Parameters Parameters

35 pitch, voiced/unvoiced, spectral, articulatory 36 noise, transients

6 Information Rate of Speech Speech Processing Applications Data Rate (Bits Per Second)

200,000 60,000 20,000 10,000 500 75

LDM, PCM, DPCM, ADM Analysis- Synthesis Cellphones Synthesis from Printed VoIP Vocoder Readings Noise and Methods Text Dictation, echo Messages, Secure for the command- removal, IVR, call access, blind, and- alignment of (No Source Coding) (Source Coding) centers, forensics speed-up control, speech and telematics agents, NL and slow- Conserve down of text bandwidth, voice Waveform Parametric dialogues, speech encryption, rates secrecy, call Representations Representations seamless voice centers, and data help desks 37 38

Intelligent Robot? The Speech Stack http://www.youtube.com/watch?v=uvcQCJpZJH8

40

Speak 4 It (AT&T Labs) What We Will Be Learning • review some basic dsp concepts • speech production model—acoustics, articulatory concepts, speech production models • speech perception model—ear models, auditory signal processing, equivalent acoustic processing models • time domain processing concepts—speech properties, pitch, voiced- unvoiced, energy, autocorrelation, zero-crossing rates • short time Fourier analysis methods—digital filter banks, , analysis-synthesis systems, vocoders • homomorphic speech processing—cepstrum, pitch detection, formant estimation, homomorphic vocoder • methods—autocorrelation method, covariance method, lattice methods, relation to vocal tract models • speech waveform coding and source models—delta modulation, PCM, mu-law, ADPCM, vector quantization, multipulse coding, CELP coding • methods for speech synthesis and text-to-speech systems—physical models, formant models, articulatory models, concatenative models • methods for speech recognition—the (HMM) Courtesy: Mazin Rahim 41 42

7