6. Text-to-

(Most Of these slides come from Dan Juray’s course at Stanford) History of Speech Synthesis

• In 1779, the Danish scientist Christian Kratzenstein builds models of the human vocal tract that can produce the five long vowel sounds. • In 1791 (the creator of the Turk chess playing game) devises the bellows(fuelle, mancha)-operated “automatic- mechanical speech machine”. It added models of the tongue and lips that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's design. • In the 1930s, Bell Labs developed the , a keyboard- operated electronic speech analyzer and synthesizer that was said to be clearly intelligible. It was later refined into the VODER, which was exhibited at the 1939 New York World's Fair.

Tractament Digital de la Parla 2 Von Kempelen:

• Small whistles controlled consonants • Rubber mouth and nose; nose had to be covered with two fingers for non-nasals • Unvoiced sounds: mouth covered, auxiliary bellows driven by string provides puff of air

From Traunmüller’s web site Von Kempelen’s speaking machine Bell labs VOCODER machine Homer Dudley 1939 VODER

• Synthesizing speech by electrical means • 1939 World’s Fair Homer Dudley’s VODER

• Manually controlled through complex keyboard • Operator training was a problem

One of the first “talking” computers Closer to a natural vocal tract: Riesz 1937 The UK Speaking Clock

• July 24, 1936 • Photographic storage on 4 glass disks • 2 disks for minutes, 1 for hour, one for seconds. • Other words in sentence distributed across 4 disks, so all 4 used at once. • Voice of “Miss J. Cain” The 1936 UK Speaking Clock

From hp://web.ukonline.co.uk/freshwater/clocks/spkgclock.htm The UK speaking clock ’s OVE synthesizer

• Of the Royal Institute of Technology, • Formant Synthesizer for vowels • F1 and F2 could be controlled

From Traunmüller’s web site Cooper’s

• Haskins Labs for investigating • Works like an inverse of a spectrograph • Light from a lamp goes through a rotating disk then through spectrogram into photovoltaic cells • Thus amount of light that gets transmitted at each frequency band corresponds to amount of acoustic energy at that band Cooper’s Pattern Playback Modern TTS systems

• 1960’s first full TTS: Umeda et al (1968) • 1970’s – Joe Olive 1977 concatenation of linear-prediction diphones – Texas Instruments Speak and Spell, • June 1978 • Paul Breedlove

• 1980’s – 1979 MIT MITalk (Allen, Hunnicut, Klatt) • 1990’s-present – Diphone synthesis – Unit selection synthesis – HMM synthesis Speak and spell demo TTS Demos (Unit-Selection) • Cereproc – Catalan – Spanish • Festival (open source) – English • ATT – English – South American Types of Waveform Synthesis

: – Model movements of articulators and acoustics of vocal tract • Formant Synthesis: – Start with acoustics, create rules/filters to create each formant • : – Use of stored speech to assemble new utterances. • Diphone • Unit Selection • Statistical (HMM) Synthesis – Trains parameters on databases of speech

Text modified from Richard Sproat slides 1st gen. synthesis: Formant Synthesis • Were the most common commercial systems when computers were slow and had little memory. • 1979 MIT MITalk (Allen, Hunnicut, Klatt) • 1983 DECtalk system – “Perfect Paul” (The voice of Stephen Hawking) – “Beautiful Betty” 2nd Generation Synthesis

• Diphone Synthesis – Units are diphones; middle of one phone to middle of next. – Why? Middle of phone is steady state. – Record 1 speaker saying each diphone – ~1400 recordings – Paste them together and modify prosody. 3rd GenerationSynthesis

• All current commercial systems.

• Unit Selection Synthesis – Larger units of variable length – Record one speaker speaking 10 hours or more, • Have multiple copies of each unit – Use search to find best sequence of units

• Hidden Markov Model Synthesis – Train a statistical model on large amounts of data. General TTS architecture

Text analysis: • Text Normalizaon • Part-of-speech tagging Raw text in • Homograph disambiguaon

Phonec analysis • Diconary lookup • Grapheme-to-phoneme(LTS)

Prosodic analysis • Boundary placement • Pitch accent assignment • Duraon computaon Speech out

Waveform synthesis 1. Text Normalization

• Analysis of raw text into pronounceable words • Sample problems: – He stole $100 million from the bank – It's 13 St. Andrews St. – The home page is hp://www.stanford.edu – yes, see you the following tues, that's 11/12/01 • Steps: – Identify tokens in text – Chunk tokens into reasonably sized sections – Map tokens to words – Identify types for words Identify tokens and chunk them • Whitespace can be viewed as separators • Punctuation can be separated from the raw tokens • Festival converts text into – ordered list of tokens – each with features: • its own preceding whitespace • its own succeeding punctuation End-of-utterance detection

• Relatively simple if utterance ends in ?! • But what about ambiguity of “.” • Ambiguous between end-of-utterance and end-of-abbreviation – My place on Winfield St. is around the corner. – I live at 151 Winfield St. – (Not “I live at 151 Winfield St..”) Identify Types of Tokens, and Convert Tokens to Words

• Pronunciation of numbers often depends on type. 3 ways to pronounce 1776: – 1776 date: seventeen seventy six. – 1776 phone number: one seven seven six – 1776 quantifier: one thousand seven hundred (and) seventy six – Also: • 25 day: twenty-fifth Step 2: Classify token into 1 of 20 types • EXPN: abbrev, contractions (adv, N.Y., mph, gov’t) • LSEQ: letter sequence (CIA, D.C., CDs) • ASWD: read as word, e.g. CAT, proper names • MSPL: misspelling • NUM: number (cardinal) (12,45,1/2, 0.6) • NORD: number (ordinal) e.g. May 7, 3rd, Bill Gates II • NTEL: telephone (or part) e.g. 212-555-4523 • NDIG: number as digits e.g. Room 101 • NIDE: identifier, e.g. 747, 386, I5, PC110 • NADDR: number as stresst address, e.g. 5000 Pennsylvania • NZIP, NTIME, NDATE, NYER, MONEY, BMONY, PRCT,URL,etc • SLNT: not spoken (KENT*REALTY) POS Tagging: Definition

• The process of assigning a part-of-speech or lexical class marker to each word in a corpus:

WORDS TAGS the koala put N the V keys P on DET the table Part of speech tagging

• 8 (ish) traditional parts of speech – Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc – This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.) – Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS – We’ll use POS most frequently POS examples

• N noun chair, bandwidth, pacing • V verb study, debate, munch • ADJ adj purple, tall, ridiculous • ADV adverb unfortunately, slowly, • P preposition of, by, to • PRO pronoun I, me, mine • DET determiner the, a, that, those POS Tagging example

WORD tag the DET koala N put V the DET keys N on P the DET table N POS tagging: Choosing a tagset • There are so many parts of speech, potential distinctions we can draw • To do POS tagging, need to choose a standard set of tags to work with • Could pick very coarse tagets – N, V, Adj, Adv. • More commonly used set is finer grained, the “UPenn TreeBank tagset”, 45 tags – PRP$, WRB, WP$, VBG • Even more fine-grained tagsets exist Penn TreeBank POS Tag set POS importance

• Part of speech tagging plays important role in TTS • Most algorithms get 96-97% tag accuracy • Not a lot of studies on whether remaining error tends to cause problems in TTS

1/5/07 Phonec analysis

• Deals with the pronunciaon of each word • Finds, for each word, the way to pronounce it • Straighorward method would be a diconary lookup, but: – Unknown words and proper names will be missing – Words in other Phoneme sets Converng from words to phones • Two methods: – Diconary-based – Rule-based (through leer-to-sound -LTS- rules) • All early systems used LTS. The first to use a diconary was MITTalk (10K words) • A good diconary example is CMU diconary with 127K words When dictionaries aren’t sufficient • Unknown words – Seem to be linear with number of words in unseen text – Mostly person, company, product names – But also foreign words, etc. – From a Black et al analysis • Of 39K tokens in part of the Wall Street Journal • 1775 (4.6%) were not in the OALD dictionary:

• So commercial systems have 3-part system: – Big dictionary – Special code for handling names – Machine learned LTS system for other unknown words

1/5/07 Learning LTS rules • Rules for Spanish are straighorward – Grapheme(wrien form) is very similar to the phoneme sequence • In the past most big systems employed many linguists (of PhD students) to manually write LTS rules • Now they are mostly induced automacally from a diconary of the – An important step is the phoneme-grapheme alignment Homograph disambiguaon

• An homograph is a word (or group of words) that share the same wrien form but have different meanings. • In pronunciaon, they might have the same pronunciaon (homonyms) or different (heteronyms). • It is very important to detect and disambiguate them for correct pronunciaon • Examples: – English: wood(madera)/wood(bosque), bear(aguantar)/ bear(barba), read(presente leer)/read(pasado leer) – Spanish: all words are homonyms Next steps aer text analysis

• Aer Text Analysis we now have a string of Phones. We also have some semanc informaon to do Prosodic Analysis, so next steps are: • Prosody – Desired F0 for enre uerance – Duraon for each phone – Stress value for each phone, possibly accent value • Generate Waveforms Prosody

Prosody is in charge of converng words+phones into boundaries, accent, F0 and duraon informaon • Prosodic phrasing – Need to break uerances into phrases – Punctuaon is useful, not sufficient • Accents: – Predicons of accents: which syllables should be accented – Realizaon of F0 contour: given accents/tones, – generate F0 contour • Duraon: – Predicng duraon of each phone Graphic representation of F0

400

350

300

250

200

150

100 F0 (inF0 Hertz) 50 legumes are a good source of VITAMINS time

Slide from Jennifer Venditti Alternave accents

• We might want to give prominence to different parts of a sentence depending on the context. • For example, imagine answering a queson on the previous example Q1: What types of foods are a good source of vitamins?

400

350

300

250

200

150

100

50 LEGUMES are a good source of vitamins

Slide from Jennifer Venditti Q2: Are legumes a source of vitamins?

400

350

300

250

200

150

100

50 Legumes are a GOOD source of vitamins

Slide from Jennifer Venditti Q3: I’ve heard that legumes are healthy, but what are they a good source of ?

400

350

300

250

200

150

100

50 legumes are a good source of VITAMINS

Slide from Jennifer Venditti Duration • Simplest: fixed size for all phones (100 ms) • Next simplest: average duration for that phone (from training data). Samples from SWBD in ms: – aa 118 b 68 – ax 59 d 68 – ay 138 dh 44 – eh 87 f 90 – ih 77 g 66 • Next Next Simplest: add in phrase-final and initial lengthening plus stress • Lots of fancy models of duration prediction: – Using Z-scores and other clever normalizations – Sum-of-products model – New features like word predictability • Words with higher bigram probability are shorter Waveform synthesis generation

• Given: – String of phones – Prosody • Desired F0 for entire utterance • Duration for each phone • Stress value for each phone, possibly accent value • Generate: – Waveforms The hourglass architecture Internal Representation: Input to Waveform Synthesis Waveform synthesis

• concatenave synthesis systems – Diphone synthesis: join diphones to form the waveform. – Unit selecon synthesis: join the longest unit possible. • HMM-based sytnthesis Diphones

• mid-phone is more stable than edge • Need O(phone2) number of units – Some combinaons don’t exist (hopefully) – ATT (Olive et al. 1998) system had 43 phones • 1849 possible diphones • Phonotaccs ([h] only occurs before vowels), don’t need to keep diphones across silence • Only 1172 actual diphones – May include stress, consonant clusters • So could have more – Lots of phonec knowledge in design • relavely small (by today’s standards) – Around 8 megabytes for English (16 KHz 16 bit)

Slide from Richard Sproat Diphones • Mid-phone is more stable than edge: Diphone synthesis

• Training: – Choose units (kinds of diphones) – Record 1 speaker saying 1 example of each diphone – Mark the boundaries of each diphones, • cut each diphone out and create a diphone database • Synthesizing an utterance, – grab relevant sequence of diphones from database – Concatenate the diphones, doing slight signal processing at boundaries – use signal processing to change the prosody (F0, energy, duration) of selected sequence of diphones Diphone labelling • Recorded sentences need to be labelled with phone start-end and middle(stable parts). This can be done automacally with: – ASR forced alignment – Audio-to-synthec audio auto-alignment Diphone auto-alignment

• Given – synthesized prompts – Human speech of same prompts • Do a dynamic time warping alignment of the two – Using Euclidean distance • Works very well 95%+ – Errors are typically large (easy to fix) – Maybe even automatically detected

Slide from Richard Sproat Dynamic Time Warping

Slide from Richard Sproat Pung it all together

Once we have the diphones we want to join, we need to: • Join them together to form the words and sentences • Modify the speaking speed to reflect the target duraon • Modify the pitch to reflect the target intonaon Joining diphones

Given any group of diphones to join together, we can follow a set of techniques: • Dumb: – just join -> we will get many arfacts – at zero crossings • Algorithms that allow joining and modifying the duraon at the same me – SOLA – TD-PSOLA(Time-domain pitch-synchronous overlap- and-add) – Necessary prerequisite in voiced signals: epoch labelling Epoch-labeling

• An example of epoch-labeling useing “SHOW PULSES” in Praat:

It can be easily done using an EGG, or (less accurately) via signal processing) Epoch-labeling: Electroglottograph (EGG) • Also called laryngograph or Lx – Device that straps on speaker’s neck near the larynx – Sends small high frequency current through adam’s apple – Human tissue conducts well; air not as well – Transducer detects how open the glottis is (I.e. amount of air between folds) by measuring Picture from UCLA Phonecs Lab impedence. Duraon/pitch modificaon

If we playback a sound faster than usual (for example by resampling the signal) both pitch and length gets affected. To only alter one of them: • Time stretching: algorithms used to change the length/speed of a signal without altering its pitch • Pitch scaling/shiing: algorithms used to change the pitch without affecng the length/speed. Speech as Short Term signals

Alan Black Duration modification • Duplicate/remove short term signals Pitch Modification

• Move short-term signals closer together/further apart

Slide from Richard Sproat Windowing • To avoid artifacts at the edges we first apply a windowing to short-term signals • y[n] = w[n]s[n] Windowing

• Multiply value of signal at sample number n by the value of a windowing function • y[n] = w[n]s[n] Pitch/duraon modificaon algorithms • OLA (Overlap-add synthesis): Applies a certain overlap to each short term signal (if we desire a pitch modificaon) and add them. If we want to lengthen the signal we just repeat some of the periods. It can create glitches when the short term boundaries are not well defined.

f ! $" "#$ &"#$(%"#  "(!' $ !!!!!!!!!!!!!%"&$! !¦ $" "#  ""!  !' $$! "!#!$! " f )*+!%,-./+%!% "&$!1&/2!3044+5!451-f !6+51!1&!,&!0&7+58,/!3+.+&3+&7!1&!5+918+50&:!4,9715! ! $ "#$ &"#$(%"#  "(! $ !!!!!!!!!!!!!%"&$0 ! ! $ "#  ""!  ! $$! "!#!$! " ; =!3+40&+3!,%!,!5,701!14!!%06+!>!14!7*+!,&' ¦ " ,/2%0%!?0&31?!?"&$!@2!7*+!.079*!.+5013(!!' < " f ! (Original period ' A! ! "!B!$!! )*+!%,-./+%!%0"&$!1&/2!3044+5!451-!6+51!1&!,&!0&7+58,/!3+.+&3+&7!1&!5+)Desired period 918+50&:!4,9715!' ;<=!3+40&+3!,%!,!5,701!14!!%06+!>!14!7*+!,&,/2%0%!?0&31?!?"&$!@2!7*+!.079*!.+5013(!! C&! .5,9709+=! ?+! 9*11%+! ;

! %.+975D-!14!%"&$!%0:&,/(!)*+&!7*+!91&9,7+&,701() ' A!' ! &!.519+%%!9*,&:+%!7*+!.079*!?07*1D7!,44+970&:!"!B!$!! 7*+!415-,&7%F!45+GD+&90+%(!)*+!D%+!14!3044+5+&7!5+918+50&:!4,9715!9,D%+%!%751&:!3+:5,3,701&!14! C&! .5,9709+=! ?+! 9*11%+! ;%2&7*+%06+3!%.++9*=!+(:(!@D660&:!

)*+!91&9+.7D,/!%9*+-+!14!7*+!30.*1&+!%2&7*+%06+5!0%!%*1?&!1&!7*+!40:D5+!#(!)*+5+!,5+! !! "#$%&'(&$)"%'*%+,("#$+-7*5++!@,%09!.*,%+%!14!%2.$/%01+$2%'(%"#$%$34%21"101+&7*+%0%(!C&!7*+!405%7!.*,%+=!07!0%!&$% +9+%%,52!71!95+,7+!,!3,7,@,%+!71!%715+! %,-./+%! 14! *D-,&! /,&:D,:+(! )*+&=! 7*+! .,5709D/,5! %+:-+&7%! 14! 915.D%=! %D9*! ,%! .*1&+-+%=! )*+!91&9+.7D,/!%9*+-+!14!7*+!30.*1&+!%2&7*+%06+5!0%!%*1?&!1&!7*+!40:D5+!#(!)*+5+!,5+! 30.*1&+%=! .+5013%=! +79(! ,5+! ,D71-,709,//2! 5+91:&06+3(! )*0%! %7+.! 0%!31&+!@2!7*+!%.+90,/! 7*5++!@,%09!.*,%+%!14!%2&7*+%0%(!C&!7*+!405%7!.*,%+=!07!0%!&+9+%%,52!71!95+,7+!,!3,7,@,%+!71!%715+! 5+91:&060&:! %147?,5+=! 7*+! .,57! 14! 1D5! %2%7+-(!)0-+!-,5H%! 415! 7*+! .,5709D/,5! %+:-+&7%! 14! %,-./+%! 14! *D-,&! /,&:D,:+(! )*+&=! 7*+! .,5709D/,5! %+:-+&7%! 14! 915.D%=! %D9*! ,%! .*1&+-+%=! 915.D%=!7*+!1D7.D7!14!7*+!%147?,5+=!,5+!%715+3!0&!7*+!3,7,@,%+(!! 30.*1&+%=! .+5013%=! +79(! ,5+! ,D71-,709,//2! 5+91:&06+3(! )*0%! %7+.! 0%!31&+!@2!7*+!%.+90,/! 5+91:&060&:! %147?,5+=! 7*+! .,57! 14! 1D5!)*+!%+91&3!.*,%+!0%!7* %2%7+-(!)0-+!-,5H%!+!%.++9*!%2&7*+%0%(!I%+5 415! 7*+! .,5709D/,5! %+:-!?507+%!7+E7=!*+!?,&7+&7%! 14! %!71!%2&7*+%06+!?07*! 915.D%=!7*+!1D7.D7!14!7*+!%147?,5+3+%05+3!45+GD+&92(!;05%7/2=!7*+!7+=!,5+!%715+3!0&!7*+!3,7,@,%+(!! E7!0%!,&,/2%+3(!)*+&!7*+!3,7,@,%+!0%!%+,59*+3!,&3!,991530&:!71! 7*+! 5D/+%! 14!7*+!/,&:D,:+=!,..51.50,7+! 30.*1&+%! ,5+! -,5H+3(! ;0&,//2=! 7*+2! ,5+! %2&7*+%06+3! )*+!%+91&3!.*,%+!0%!7*+!%.++9*!%2&7*+%0%(!I%+5!?507+%!7+E7=!*+!?,&7%!71!%2&7*+%06+!?07*! 3+%05+3!45+GD+&92(!;05%7/2=!7*+!7+,99153E7!0%!,&,/20&:! 71!%+3(!) 7*+!* )J!+&!7*+!3,7,@, KLM>N!%+!0%!%+,59*+3!,&3!,991530&:!71! ,/:1507*-(!L2&7*+%06+3!%.++9*!0%!%715+3!0&!7*+!3,7,@,%+!415! 7*+! 5D/+%! 14!7*+!/,&:D,:+=!,..54D57*+5!.519+%%0&:!"7*+!7*051.50,7+! 30.*1&+%! ,5+! -,5H+3(!3!.*,%+$=!15!07!9,&!@+!./,2 ;0&,//2=! 7*+2! ,5+! %2&7*+%06+3!+3!18+5!,&!1D7.D7!3+809+(!! ,991530&:! 71! 7*+! )J! KLM>N! ,/:1507*-! (!L2&7*+%06+3!%.++9*!0%!%715+3!0&!7*+!3,7,@,%+!415! 4D57*+5!.519+%%0&:!"7*+!7*053!.*,%+$=!15!07!9,&!@+!./,2+3!18+5!,&!1D7.D7!3+809+(!! ! I%+5! -- 91--,&3! Q1--,&3! ,&,/2%0%=! -- L+,59*0&:!415! I%+5! ---% -- %+:-+&7%! 91--,&3! Q1--,&3! ,&,/2%0%=! L+,59*0&:!415! -% ---% L,---./+%! %+:-+&7%! -- 14!%.++9*! OPI! -% L,-./+%! --3,7,@,%+! -- 14!%.++9*! %2%7+-! L.++9*! - OPI! %2&7*+%0%! 3,7,@,%+! -- @,%+3!1&!! )J!KLM>N! %2%7+-! L.++9*! - %2&7*+%0%! @,%+3!1&!! -- )J!KLM>--N-%! ! -- ---% N91D%709,/! ND71-,709!5+91:&0701&!14! 1D7.D7! ! .*1&+-+%=!30.*1&+%!,&3!.+5013% ! N91D%709,/! ND71-,709!5+91:&0701&!14! 1D7.D7! .*1&+-+%=!30.*1&+%!,&3!.+5013% ! *567%89! *+#,-./0123$,4-5-3+6! 37".4+#-3$8#/4-$"9-:3 ! *567%89! *+#,-./0123$,4-5-3+637".4+#-3$8#/4-$"9-:3

!!

!! Pitch/duraon modificaon algorithms • PSOLA (Pitch synchronous overlap-add synthesis): Through a good pitch/F0 detecon it defines the boundaries of the short term signals to be amplitude peaks containing 2-3 pitch periods. Then it uses OLA to join them. • TD-PSOLA (Time-domain PSOLA): Patented by France Telecom, is an efficient me-domain version of PSOLA (no FFT required, suitable for real-me TTS). Can modify pitch up to 2X or 0.5X • MBR-PSOLA (Mul-band re-synthesis ): tries to solve problems of TD-PSOLA in boundaries between diphones due to phase, pitch or spectral envelope mismatches y normalizing all the database a priory. It later turned into the open project called MBROLA, which focuses on minimizing the database storage. • LP-PSOLA (Linear predictors PSOLA): modificaon of PSOLA where the LP coefficients are stored instead of the signal itself. ! ! ! ! ! !"#$%&'! !"#$%&&'(#)*+,!-)#.*/01%/#12'3-!45#6787("#$*+,!-)#.*/0#12'3-!4#931&#:7;<=7#:7><=7# :7?<#=#

(! )*+),-./*+.%0+1%23423)5/63.%

"#$%&%'(! )*%! +,%%-*(! +./)*%+01%2! 3.! 45! 6789:! ;<=#'0)*>! 0+! ?@0)%! @/2%'+);/2;3<%(! *@>;/! %;'! ,%'-%0&%+! 0)! ;+! +./)*%)0-A! B@')*%'>#'%(! ;!='%;)! %<#/=;)0#/! #C! +./)*%+01%2! +,%%-*! +0=/;0))0/=! >#'%!,%'0#2+!0/!20,*#/%(!-;@+%+! )*%! <#++!#C!+,%%-*!0/C#'>;)0#/!-#/)%/)A!! 4*%!45!6789:!;<=#'0)*>!$;+!0>,<%>%/)%2!0/!4D9!+-'0,)0/=!<;/=@;=%A!:+!;!'%+@<)!#C! 3%<#/=0/=! )#! )*%! ='#@,! #C! 0/)%','%)%2! <;/=@;=%+(! 4D9! 0+! '%<;)0&%<.! +<#$A! E&%/! )*#@=*! )*%! ,'#=';>>%!$;+!#,)0>01%2(!)*%!+./)*%+0+!#C!;!+%/)%/-%!<;+)+!+%&%';,0<%2!<;/=@;=%!%A=A!DGG(!-#>30/%2!$0)*!*;'2$;'%!,%'C#'>;/-%!='#$)*!0/!)*%! C@)@'%(!-#@<2!+*#')%/!)*0+!)0>%!)#!*@/2'%2+!#C!!>0<0+%-#/2+A!4*%!@/2%/0;3<%!;2&;/);=%!#C!4D9! <;/=@;=%!'%>;0/+!)*%!,#++030<0).!#C!,#')0/=!4D9!+-'0,)+!;>#/=!#,%';)0/=!+.+)%>+A!! 4*%!C@/2;>%/);!#C!)*%!EHI!4D9!-#/+#<%!0+!;!,<%/).!#C!%''#'+!0/!+#@'-%!-#2%A! H;/.!C@/-)0#/+!2#%+!/#)!$#'J!,'#,%'<.(!;/2!+#>%!#C!)*%>(!0/!+,%-0;%!#C!)*%>!*;2!)#!3%!/%$<.!0>,<%>%/)%2A!! 5%+,0)%! )*%! C;-)! )*;)! )*%! +,%%-*! -'%;)%2! 3.! -#/-;)%/;)0#/! #C!!20,*#/%+!0+!+@,%'0#'!)#! -#/-;)%/;)0#/!#C!,*#/%>%+(!)*0+!;,,'#;-*!2#%+!/#)!%/;3<%!+./)*%+0+!#C!*0=*!?@;<0).!*@>;/! +,%%-*A!F)!0+!-;@+%2!3.!)*%!C;-)(!)*;)!/%0)*%'!;!*@=%!20,*#/%!2;);3;+%!0+!;3<%!)#!-#&%'!;!='%;)! &;'0%).!#C!!*@>;/!+,%%-*A!4*%!@+%!#C!H@<)0K;/2!EL-0);)0#/!M%+./)*%+0+!#/!2;);3;+%(!>0=*)! +<0=*)<.!0>,'#&%!)*%!?@;<0).!#C!)*%!+,%%-*(!3@)!!;')0-@<;)%!+./)*%+01%'+!'%>;0/!)*%!&0+0#/!C#'! )*%!C@)@'%A!

43!343+)3.% NOP!5@)#0)(!4AQ!:/!F/)'#2@-)0#/!)#!4%L)R)#R+,%%-*!7./)*%+0+(!*)),QSS)-)+AC,>+A;-A3%S+./)*%+0+! NTP! D;++02.(! 7A(! ";''0/=)#/! UAQ! H@<)0R<%&%%/)!7.+)%>(!7,%%-*!D#>>@/0-;)0#/!VV(!TWWT! NVP! 5@)#0)(!4A(!9%0-*(!"AQ!HKM!6789:!447!7./)*%+0+!K;+%2!#/!;/!HKE!M%R7./)*%+0+!#C!)*%! 7%=>%/)+!5;);3;+%(!*)),QSS)-)+AC,>+A;-A3%! NXP!7.'2;<(!:A!;/2!D##/0-!Z#0+%!H#2%!

!! Problems with diphone synthesis • Signal processing leave artifacts, making the speech sound unnatural • Diphone synthesis only captures local effects – But there are many more global effects (syllable structure, stress pattern, word-level effects) Unit Selection Synthesis

• Generalization of the diphone intuition – Larger units • From diphones to sentences – Many many copies of each unit • 10 hours of speech instead of 1500 diphones (a few minutes of speech) – Little or no signal processing applied to each unit • Unlike diphones Why Unit Selection Synthesis

• Natural data solves problems with diphones – Diphone databases are carefully designed but: • Speaker makes errors • Speaker doesn’t speak intended dialect • Require database design to be right – If it’s automatic • Labeled with what the speaker actually said • Coarticulation, schwas, flaps are natural • “There’s no data like more data” – Lots of copies of each unit mean you can choose just the right one for the context – Larger units mean you can capture wider effects Unit Selection Intuition

• Given a big database • For each segment (diphone) that we want to synthesize – Find the unit in the database that is the best to synthesize this target segment • What does “best” mean? – “Target cost”: Closest match to the target description, in terms of • Phonetic context • F0, stress, phrase position – “Join cost”: Best join with neighboring units • Matching formants + other spectral characteristics • Matching energy • Matching F0

n n n n target join C(t1 ,u1 ) = "C (ti,ui ) + "C (ui#1,ui ) i=1 i= 2

! Search of best units using Viterbi algorithm

n n n uˆ1 = argminC(t1 ,u1 ) u1 ,...,un

! HMM synthesis • It is a totally different approach to concatenave synthesis – It is much closer to ASR • Hidden Markov Models (HMM) are trained from labeled data to learn how each phone is pronounced in each condion – It also learns its prosody • Then, given a desired phoneme sequence and prosody paern, it outputs the most probable audio sequence.

Samples ~2007 Sample 2010 NITech