Disambiguating Heteronyms in Speech Synthesis

(19) TZZ¥Z¥ ¥ _T (11) EP 3 032 532 A1 (12) EUROPEAN PATENT APPLICATION (43) Date of publication: (51) Int Cl.: 15.06.2016 Bulletin 2016/24 G10L 13/08 (2013.01) G10L 15/22 (2006.01) (21) Application number: 15196748.6 (22) Date of filing: 27.11.2015 (84) Designated Contracting States: (71) Applicant: APPLE INC. AL AT BE BG CH CY CZ DE DK EE ES FI FR GB Cupertino, CA 95014 (US) GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR (72) Inventors: Designated Extension States: • HENTON, Caroline BA ME Cupertino, CA California 95014 (US) Designated Validation States: • NAIK, Devang MA MD Cupertino, CA California 95014 (US) (30) Priority: 09.12.2014 US 201462089464 P (74) Representative: Barnfather, Karl Jon 12.12.2014 US 201414569517 Withers & Rogers LLP 4 More London Riverside London SE1 2AU (GB) (54) DISAMBIGUATING HETERONYMS IN SPEECH SYNTHESIS (57) Systems and processes for disambiguating het- mined based on at least one of the phonemic string or eronyms in speech synthesis are provided. In one exam- usingan n-gram languagemodel of the automatic speech ple process, a speech input containing a heteronym can recognition system. A dialogue response to the speech be received from a user. The speech input can be proc- input can be generated where the dialogue response can essed using an automatic speech recognition system to include the heteronym. The dialogue response can be determine a phonemic string corresponding to the het- outputted as a speech output. The heteronym in the di- eronym as pronounced by the user in the speech input. alogue response can be pronounced in the speech output A correct pronunciation of the heteronym can be deter- according to the correct pronunciation. EP 3 032 532 A1 Printed by Jouve, 75001 PARIS (FR) 1 EP 3 032 532 A1 2 Description ing to the heteronym as pronounced by the user in the speech input and a frequency of occurrence of an n-gram Cross-Reference to Related Application with respect to a corpus. The n-gram can include the heteronym and the one or more additional words and the [0001] This application claims priority from U.S. Provi- 5 heteronym in the n-gram can be associated with a first sional Serial No. 62/089,464, filed on December 9, 2014, pronunciation. A correct pronunciation of the heteronym entitled "Disambiguating Heteronyms in Speech Synthe- can be based on at least one of the phonemic string and sis," and U.S. Non-Provisional Serial No. 14/569,517, the frequency of occurrence of the n-gram. A dialogue filed on December 12, 2014, entitled "Disambiguating response to the speech input can be generated where Heteronyms in Speech Synthesis," which are hereby in- 10 the dialogue response can include the heteronym. The corporated by reference in their entirety for all purposes. dialogue response can be output as a speech output. The heteronym in the dialogue response can be pro- Field nounced in the speech output according to the determined correct pronunciation. [0002] This relates generally to digital assistants and, 15 more specifically, to disambiguating heteronyms in Brief Description of the Drawings speech synthesis. [0006] Background 20 FIG. 1 illustrates a system and environment for im- [0003] Intelligent automated assistants (or digital as- plementing a digital assistant according to various sistants) can provide a beneficial interface between hu- examples. man users and electronic devices. Such assistants can allow users to interact with devices or systems using nat- FIG. 2 illustrates a user device implementing the cli- ural language in spoken and/or text forms. For example, 25 ent-side portion of a digital assistant according to a user can provide a speech input to a digital assistant various examples. associated with the electronic device. The digital assistant can interpret the user’s intent from the speech input FIG. 3A illustrates a digital assistant system or a and operationalize the user’s intent into tasks. The tasks server portion thereof according to various exam- can then be performed by executing one or more services 30 ples. of the electronic device and a relevant speech output can be returned to the user in natural language form. FIG. 3B illustrates the functions of the digital assist- [0004] Occasionally, speech outputs generated by dig- ant shown in FIG. 3A according to various examples. ital assistants can contain heteronyms. A heteronym can be each of two or more words that are spelled identically 35 FIG. 3C illustrates a portion of an ontology according but have different pronunciations and meanings. For ex- to various examples. ample, a user can provide a speech input to a digital assistant requesting the weather in Nice, France. The FIG. 4 illustrates a process for operating a digital digital assistant can return a relevant speech output such assistant according to various examples. as, "Here is the weather in Nice, France." In this example, 40 the speech output contains the heteronym "nice," which FIG. 5 illustrates a functional block diagram of an can have one pronunciation as a correct noun and a dif- electronic device according to various examples. ferent pronunciation as an adjective. Conventionally, digital assistants can have difficult disambiguating hetero- Detailed Description nyms and thus speech outputs containing heteronyms 45 can often be pronounced incorrectly. This can result in a [0007] In the following description of examples, refer- poor user experience in interacting with the digital assist- ence is made to the accompanying drawings in which it ant. is shown by way of illustration specific examples that can be practiced. It is to be understood that other examples Summary 50 can be used and structural changes can be made without departing from the scope of the various examples. [0005] Systems and processes for disambiguating het- [0008] Systems and processes for disambiguating heteronyms in speech synthesis are provided. In an example eronyms in speech synthesis are provided. In one exam- process, a speech input can be received from a user. ple process, a speech input containing a heteronym can The speech input can contain a heteronym and one or 55 be received from a user. Contextual data associated with more additional words. The speech input can be proc- the speech input can be received. The speech input can essed using an automatic speech recognition system to be processed using an automatic speech recognition determine at least one of a phonemic string correspond- system to determine a text string corresponding to the 2 3 EP 3 032 532 A1 4 speech input. Based on the text string, an actionable in- west gate." The user can also request the performance tent can be determined using a natural language proc- of a task, for example, "Please invite my friends to my essor. A dialogue response to the speech input can be girlfriend’s birthday party next week." In response, the generated where the dialogue response can include the digital assistant can acknowledge the request by saying heteronym. A correct pronunciation of the heteronym can 5 "Yes, right away," and then send a suitable calendar in- be determined using an n-gram language model of the vite on behalf of the user to each of the user’s friends automatic speech recognition system or based on at least listed in the user’s electronic address book. During per- one of the speech input, the actionable intent, and the formance of a requested task, the digital assistant can contextual data. The dialogue response can be output sometimes interact with the user in a continuous dialogue as a speech output and the heteronym can be pro-10 involving multiple exchanges of information over an ex- nounced in the speech output according to the correct tended period of time. There are numerous other ways pronunciation. of interacting with a digital assistant to request informa- [0009] By utilizing at least one of the speech input, the tion or performance of various tasks. In addition to pro- n-gram language model, the actionable intent, and the viding verbal responses and taking programmed actions, contextual data as a knowledge source for disambigua- 15 the digital assistant can also provide responses in other ting heteronyms, the pronunciation of heteronyms in the visual or audio forms, e.g., as text, alerts, music, videos, speech output can be synthesized more accurately, animations, etc. thereby improving user experience. Further, leveraging [0012] As shown in FIG. 1, in some examples, a digital the automatic speech recognition system and the natural assistant can be implemented according to a client-serv- language processorto disambiguate heteronyms can ob- 20 er model. The digital assistant can include a client-side viate the need to implement additional resources in the portion 102a, 102b (hereafter "DA client 102") executed speech synthesizer for the same purpose. For example, on a user device 104a, 104b, and a server-side portion additional language models need not be implemented in 106 (hereafter "DA server 106") executed on a server the speech synthesizer to disambiguate heteronyms. system 108. The DA client 102 can communicate with This enables digital assistants to operate with greater 25 the DA server 106 through one or more networks 110. efficiency and fewer resources. The DA client 102 can provide client-side functionalities such as user-facing input and output processing and 1. System and Environment communication with the DA-server 106. The DA server 106 can provide server-side functionalities for any [0010] FIG. 1 illustrates a block diagram of a system 30 number of DA-clients 102 each residing on a respective 100 according to various examples.

Disambiguating Heteronyms in Speech Synthesis

The Role of Higher-Level Linguistic Features in HMM-Based Speech Synthesis

The RACAI Text-To-Speech Synthesis System

Synthesis and Recognition of Speech Creating and Listening to Speech

The Role of Speech Processing in Human-Computer Intelligent Communication

Voice User Interface on the Web Human Computer Interaction Fulvio Corno, Luigi De Russis Academic Year 2019/2020 How to Create a VUI on the Web?

A Multimodal User Interface for an Assistive Robotic Shopping Cart

Voice Assistants and Smart Speakers in Everyday Life and in Education

Models of Speech Synthesis ROLF CARLSON Department of Speech Communication and Music Acoustics, Royal Institute of Technology, S-100 44 Stockholm, Sweden

Attention, I'm Trying to Speak Cs224n Project: Speech Synthesis

Quantifying the Effects of Prosody Modulation on User Engagement

Lecture 5: Part-Of-Speech Tagging

Investigation of Using Continuous Representation of Various Linguistic Units in Neural Network Based Text-To-Speech Synthesis