Introduction to Digital Speech Processing Speech Processing The
Total Page:16
File Type:pdf, Size:1020Kb
Speech Processing • Speech is the most natural form of human-human communications. • Speech is related to language; linguistics is a branch of social science. • Speech is related to human physiological capability; physiology is a Digital Speech ProcessingProcessing—— branch of medical science. • Speech is also related to sound and acoustics, a branch of physical Lecture 1 science. • Therefore, speech is one of the most intriguing signals that humans work with every day. • Purpose of speech processing: Introduction to Digital – To understand speech as a means of communication; – To represent speech for transmission and reproduction; Speech Processing – To analyze speech for automatic recognition and extraction of information – To discover some physiological characteristics of the talker. 1 2 Why Digital Processing of Speech? The Speech Stack Speech Applications — coding, synthesis, • digital processing of speech signals (DPSS) recognition, understanding, verification, enjoys an extensive theoretical and language translation, speed-up/slow-down experimental base developed over the past 75 years • much research has been done since 1965 on Sppgeech Algorithms —speech-silence the use of digital signal processing in speech (background), voiced-unvoiced decision, communication problems pitch detection, formant estimation • highly advanced implementation technology (VLSI) exists that is well matched to the Speech Representations — temporal, computational demands of DPSS spectral, homomorphic, LPC • there are abundant applications that are in widespread use commercially Fundamentals — acoustics, linguistics, pragmatics, speech perception 3 4 Speech Applications Speech Coding Encoding • We look first at the top of the speech processing stack—namely speechA-to-D Analysis/ data Channel Compressionyˆ[n] or Converter Coding Medium applications xc (t) x[n] y[n] yˆ[n] Continuous Sampled Transformed Bit sequence – speech coding time signal signal representation – speech synthesis Decoding Channel – speech recognition and understanding or data speech Medium Decom- Decoding/ D-to-A pression Synthesis Converter – other speech applications y[]n x[]n xyˆc c()t(t) 5 6 1 Speech Coding Demo of Speech Coding • Speech Coding is the process of transforming a speech signal into a representation for efficient • Narrowband Speech Coding: • Wideband Speech Coding: 64 kbps PCM transmission and storage of speech Male talker / Female Talker – narrowband and broadband wired telephony 32 kbps ADPCM – cellular communications 16 kbps LDCELP 3.2 kHz – uncoded 7 kHz – uncoded – Voice over IP (VoIP) to utilize the Internet as a real-time 8C8 kbps CELP 7 kHz – 64 kbps communications medium 4.8 kbps FS1016 7 kHz – 32 kbps – secure voice for privacy and encryption for national 7 kHz – 16 kbps security applications 2.4 kbps LPC10E – extremely narrowband communications channels, e.g., battlefield applications using HF radio – storage of speech for telephone answering machines, IVR systems, prerecorded messages Narrowband Speech Wideband Speech 7 8 Demo of Audio Coding Audio Coding • CD Original (1.4 Mbps) versus MP3-coded at 128 kbps • Female vocal – MP3-128 kbps coded, ¾ female vocal ¾ trumpet selection • CD original ¾ orchestra • Trumpet selection – CD original, MP3-128 ¾ baroque kbps co de d ¾ guitar • Orchestral selection – MP3-128 kbps Can you determine which is the uncoded and which is the coded audio for each selection? coded • Baroque – CD original, MP3-128 kbps coded Audio Coding Additional Audio Selections 9 • Guitar – MP3-128 kbps coded, CD original10 Speech Synthesis Speech Synthesis • Synthesis of Speech is the process of generating a speech signal using computational means for effective human- machine interactions – machine reading of text or email messages text Linguistic DSP D-to-A speech – telematics feedback in automobiles Rules Computer Converter – talking agents for automatic transactions – automatic agent in customer care call center – handheld devices such as foreign language phrasebooks, dictionaries, crossword puzzle helpers – announcement machines that provide information such as stock quotes, airlines 11 schedules, weather reports, etc. 12 2 Speech Synthesis Examples Pattern Matching Problems • Soliloquy from Hamlet: speechA-to-D Feature Pattern symbols Converter Analysis Matching • Gettysburg Address: • speech recognition Reference • speaker recognition Patterns • speaker verification • Third Grade Story: • word spotting 19641964--lrrlrr 20022002--ttstts • automatic indexing of speech recordings 13 14 Speech Recognition and Understanding Speech Recognition Demos • Recognition and Understanding of Speech is the process of extracting usable linguistic information from a speech signal in support of human-machine communication by voice – command and control (C&C) applications, e.g., simple commands for spreadsheets, presentation graphics, appliances – voice dictation to create letters, memos, and other documents – natural language voice dialogues with machines to enable Help desks, Call Centers – voice dialing for cellphones and from PDA’s and other small devices – agent services such as calendar entry and update, address list modification and entry, etc. 15 16 Other Speech Applications Speech Recognition Demos • Speaker Verification for secure access to premises, information, virtual spaces • Speaker Recognition for legal and forensic purposes— national security; also for personalized services • Speech Enhancement for use in noisy environments, to eliminate echo, to align voices with video segments, to change voice qualities, to speed-up or slow-down prerecorded speech (e.g., talking books, rapid review of material, careful scrutinizing of spoken material, etc) => potentially to improve intelligibility and naturalness of speech • Language Translation to convert spoken words in one language to another to facilitate natural language dialogues between people speaking different languages, i.e., tourists, business people 17 18 3 DSP/Speech Enabled Devices Apple iPod • stores music in MP3, AAC, MP4, wma, wav, … audio formats • compression of 11-to-1 for 128 kbps MP3 • can store order of 20,000 songs with 30 GB disk • can use flash memory to eliminate all moving memory access Internet Audio Digital Cameras PDAs & Streaming • can load songs from iTunes store – Audio/Video more than 1.5 billion downloads • tens of millions sold x[n] y[n] yc(t) Hearing Aids Memory Computer D-to-A Cell Phones 19 20 One of the Top DSP Applications Digital Speech Processing • Need to understand the nature of the speech signal, and how dsp techniques, communication technologies, and information theory methods can be applied to help solve the various application scenarios described above – most of the course will concern itself with speech signal processing — i.e., converting one type of speech signal representation to another so as to uncover various mathematical or practical properties of the speech signal and do appropriate processing to Cellular Phone aid in solving both fundamental and deep problems of interest 21 22 Speech Signal Production Speech Production/Generation Model Speech • Message Formulation Æ desire to communicate an idea, a wish, a Waveform request, … => express the message as a sequence of words Message Linguistic Articulatory Acoustic Electronic Source Construction Production Propagation Transduction Message I need some string Please get me some string (Discrete Symbols) M W S A X Desire to Formulation Text String Where can I buy some Communicate string Idea Message, M, Words realized Sounds Signals converted • Language Code Æ need to convert chosen text string to a encapsulated realized as a as a sequence received at from acoustic to sequence of sounds in the language that can be understood by in a word of(f (p honem ic ) the eltilectric, message, M sequence, W sounds, S transducer transmitted, others; need to give some form of emphasis, prosody (tune, melody) through distorted and to the spoken sounds so as to impart non-speech information such acoustic received as X as sense of urgency, importance, psychological state of talker, ambient, A environmental factors (noise, echo) Conventional studies of speech science use speech Text String Language Practical applications Phoneme string signals recorded in a sound Code (Discrete Symbols) require use of realistic or with prosody booth with little interference or Generator “real world” speech with distortion noise and distortions Pronunciation (In The Brain) Vocabulary 23 24 4 Speech Production/Generation Model The Speech Signal • Neuro-Muscular Controls Æ need to direct the neuro-muscular system to move the articulators (tongue, lips, teeth, jaws, velum) so as to produce the desired spoken message in the desired manner Neuro- Muscular Articulatory (Continuous control) Phoneme String motions with Prosody Controls • Vocal Tract System Æ need to shape the human vocal tract system and provide the appropriate sound sources to create an acoustic waveform (speech) that is understandable in the environment in Background which it is spoken Pitch Period Signal Vocal Tract Acoustic (Continuous control) Articulatory System Waveform Motions (Speech) Source control (lungs, Unvoiced Signal (noise- diaphragm, chest like sound) muscles) 25 26 Speech Perception Model The Speech Chain • The acoustic waveform impinges on the ear (the basilar membrane) Text Phonemes, Prosody Articulatory Motions and is spectrally analyzed by an equivalent filter bank of the ear Basilar Membrane Spectral (Continuous Control) Message