(19) TZZ¥Z¥ ¥ _T

(11) EP 3 032 532 A1

(12) EUROPEAN PATENT APPLICATION

(43) Date of publication: (51) Int Cl.: 15.06.2016 Bulletin 2016/24 G10L 13/08 (2013.01) G10L 15/22 (2006.01)

(21) Application number: 15196748.6

(22) Date of filing: 27.11.2015

(84) Designated Contracting States: (71) Applicant: APPLE INC. AL AT BE BG CH CY CZ DE DK EE ES FI FR GB Cupertino, CA 95014 (US) GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR (72) Inventors: Designated Extension States: • HENTON, Caroline BA ME Cupertino, CA California 95014 (US) Designated Validation States: • NAIK, Devang MA MD Cupertino, CA California 95014 (US)

(30) Priority: 09.12.2014 US 201462089464 P (74) Representative: Barnfather, Karl Jon 12.12.2014 US 201414569517 Withers & Rogers LLP 4 More London Riverside London SE1 2AU (GB)

(54) DISAMBIGUATING HETERONYMS IN SYNTHESIS

(57) Systems and processes for disambiguating het- mined based on at least one of the phonemic string or eronyms in are provided. In one exam- usingan n-gram languagemodel of the automatic speech ple process, a speech input containing a heteronym can recognition system. A dialogue response to the speech be received from a user. The speech input can be proc- input can be generated where the dialogue response can essed using an automatic system to include the heteronym. The dialogue response can be determine a phonemic string corresponding to the het- outputted as a speech output. The heteronym in the di- eronym as pronounced by the user in the speech input. alogue response can be pronounced in the speech output A correct of the heteronym can be deter- according to the correct pronunciation. EP 3 032 532 A1

Printed by Jouve, 75001 PARIS (FR) 1 EP 3 032 532 A1 2

Description ing to the heteronym as pronounced by the user in the speech input and a frequency of occurrence of an n-gram Cross-Reference to Related Application with respect to a corpus. The n-gram can include the heteronym and the one or more additional and the [0001] This application claims priority from U.S. Provi- 5 heteronym in the n-gram can be associated with a first sional Serial No. 62/089,464, filed on December 9, 2014, pronunciation. A correct pronunciation of the heteronym entitled "Disambiguating Heteronyms in Speech Synthe- can be based on at least one of the phonemic string and sis," and U.S. Non-Provisional Serial No. 14/569,517, the frequency of occurrence of the n-gram. A dialogue filed on December 12, 2014, entitled "Disambiguating response to the speech input can be generated where Heteronyms in Speech Synthesis," which are hereby in- 10 the dialogue response can include the heteronym. The corporated by reference in their entirety for all purposes. dialogue response can be output as a speech output. The heteronym in the dialogue response can be pro- Field nounced in the speech output according to the deter- mined correct pronunciation. [0002] This relates generally to digital assistants and, 15 more specifically, to disambiguating heteronyms in Brief Description of the Drawings speech synthesis. [0006] Background 20 FIG. 1 illustrates a system and environment for im- [0003] Intelligent automated assistants (or digital as- plementing a digital assistant according to various sistants) can provide a beneficial interface between hu- examples. man users and electronic devices. Such assistants can allow users to interact with devices or systems using nat- FIG. 2 illustrates a user device implementing the cli- ural in spoken and/or text forms. For example, 25 ent-side portion of a digital assistant according to a user can provide a speech input to a digital assistant various examples. associated with the electronic device. The digital assist- ant can interpret the user’s intent from the speech input FIG. 3A illustrates a digital assistant system or a and operationalize the user’s intent into tasks. The tasks server portion thereof according to various exam- can then be performed by executing one or more services 30 ples. of the electronic device and a relevant speech output can be returned to the user in natural language form. FIG. 3B illustrates the functions of the digital assist- [0004] Occasionally, speech outputs generated by dig- ant shown in FIG. 3A according to various examples. ital assistants can contain heteronyms. A heteronym can be each of two or more words that are spelled identically 35 FIG. 3C illustrates a portion of an ontology according but have different and meanings. For ex- to various examples. ample, a user can provide a speech input to a digital assistant requesting the weather in Nice, France. The FIG. 4 illustrates a process for operating a digital digital assistant can return a relevant speech output such assistant according to various examples. as, "Here is the weather in Nice, France." In this example, 40 the speech output contains the heteronym "nice," which FIG. 5 illustrates a functional block diagram of an can have one pronunciation as a correct noun and a dif- electronic device according to various examples. ferent pronunciation as an adjective. Conventionally, dig- ital assistants can have difficult disambiguating hetero- Detailed Description nyms and thus speech outputs containing heteronyms 45 can often be pronounced incorrectly. This can result in a [0007] In the following description of examples, refer- poor user experience in interacting with the digital assist- ence is made to the accompanying drawings in which it ant. is shown by way of illustration specific examples that can be practiced. It is to be understood that other examples Summary 50 can be used and structural changes can be made without departing from the scope of the various examples. [0005] Systems and processes for disambiguating het- [0008] Systems and processes for disambiguating het- eronyms in speech synthesis are provided. In an example eronyms in speech synthesis are provided. In one exam- process, a speech input can be received from a user. ple process, a speech input containing a heteronym can The speech input can contain a heteronym and one or 55 be received from a user. Contextual data associated with more additional words. The speech input can be proc- the speech input can be received. The speech input can essed using an automatic speech recognition system to be processed using an automatic speech recognition determine at least one of a phonemic string correspond- system to determine a text string corresponding to the

2 3 EP 3 032 532 A1 4 speech input. Based on the text string, an actionable in- west gate." The user can also request the performance tent can be determined using a natural language proc- of a task, for example, "Please invite my friends to my essor. A dialogue response to the speech input can be girlfriend’s birthday party week." In response, the generated where the dialogue response can include the digital assistant can acknowledge the request by saying heteronym. A correct pronunciation of the heteronym can 5 "Yes, right away," and then send a suitable calendar in- be determined using an n-gram language model of the vite on behalf of the user to each of the user’s friends automatic speech recognition system or based on at least listed in the user’s electronic address book. During per- one of the speech input, the actionable intent, and the formance of a requested task, the digital assistant can contextual data. The dialogue response can be output sometimes interact with the user in a continuous dialogue as a speech output and the heteronym can be pro-10 involving multiple exchanges of information over an ex- nounced in the speech output according to the correct tended period of time. There are numerous other ways pronunciation. of interacting with a digital assistant to request informa- [0009] By utilizing at least one of the speech input, the tion or performance of various tasks. In addition to pro- n-gram language model, the actionable intent, and the viding verbal responses and taking programmed actions, contextual data as a knowledge source for disambigua- 15 the digital assistant can also provide responses in other ting heteronyms, the pronunciation of heteronyms in the visual or audio forms, e.g., as text, alerts, music, videos, speech output can be synthesized more accurately, animations, etc. thereby improving user experience. Further, leveraging [0012] As shown in FIG. 1, in some examples, a digital the automatic speech recognition system and the natural assistant can be implemented according to a client-serv- language processorto disambiguate heteronyms can ob- 20 er model. The digital assistant can include a client-side viate the need to implement additional resources in the portion 102a, 102b (hereafter "DA client 102") executed speech for the same purpose. For example, on a user device 104a, 104b, and a server-side portion additional language models need not be implemented in 106 (hereafter "DA server 106") executed on a server the speech synthesizer to disambiguate heteronyms. system 108. The DA client 102 can communicate with This enables digital assistants to operate with greater 25 the DA server 106 through one or more networks 110. efficiency and fewer resources. The DA client 102 can provide client-side functionalities such as user-facing input and output processing and 1. System and Environment communication with the DA-server 106. The DA server 106 can provide server-side functionalities for any [0010] FIG. 1 illustrates a block diagram of a system 30 number of DA-clients 102 each residing on a respective 100 according to various examples. In some examples, user device 104. the system 100 can implement a digital assistant. The [0013] In some examples, the DA server 106 can in- terms "digital assistant," "," "intelligent clude a client-facing I/O interface 112, one or more automated assistant," or "automatic digital assistant," processing modules 114, data and models 116, and an can refer to any information processing system that in- 35 I/O interface to external services 118. The client-facing terprets natural language input in spoken and/or textual I/O interface can facilitate the client-facing input and out- form to infer user intent, and performs actions based on put processing for the digital assistant server 106. The the inferred user intent. For example, to act on an inferred one or more processing modules 114 can utilize the data user intent, the system can perform one or more of the and models 116 to process speech input and determine following: identifying a task flow with steps and parame- 40 the user’s intent based on natural language input. Fur- ters designed to accomplish the inferred user intent, in- ther, the one or more processing modules 114 perform putting specific requirements from the inferred user intent task execution based on inferred user intent. In some into the task flow; executing the task flow by invoking examples, the DA-server 106 can communicate with ex- programs, methods, services, APIs, or the like; and gen- ternal services 120 through the network(s) 110 for task erating output responses to the user in an audible (e.g., 45 completion or information acquisition. The I/O interface speech) and/or visual form. to external services 118 can facilitate such communica- [0011] Specifically, a digital assistant can be capable tions. of accepting a user request at least partially in the form [0014] Examples of the user device 104 can include, of a natural language command, request, statement, nar- but are not limited to, a handheld computer, a personal rative, and/or inquiry. Typically, the user request can50 digital assistant (PDA), a tablet computer, a laptop com- seek either an informational answer or performance of a puter, a desktop computer, a cellular telephone, a smart task by the digital assistant. A satisfactory response to , an enhanced general packet radio service (EG- the user request can be a provision of the requested in- PRS) mobile phone, a media player, a navigation device, formational answer, a performance of the requested task, a game console, a television, a television set-top box, a or a combination of the two. For example, a user can ask 55 remote control, a wearable electronic device, or a com- the digital assistant a question, such as "Where am I right bination of any two or more of these data processing now?" Based on the user’s current location, the digital devices or other data processing devices. More details assistant can answer, "You are in Central Park near the on the user device 104 are provided in reference to an

3 5 EP 3 032 532 A1 6 exemplary user device 104 shown in FIG. 2. era functions, such as taking photographs and recording [0015] Examples of the communication network(s) 110 video clips. Communication functions can be facilitated can include local area networks (LAN) and wide area through one or more wired and/or wireless communica- networks (WAN), e.g., the Internet. The communication tion subsystems 224, which can include various commu- network(s) 110 can be implemented using any known 5 nication ports, radio frequency receivers and transmit- network protocol, including various wired or wireless pro- ters, and/or optical (e.g., infrared) receivers and trans- tocols, such as, for example, Ethernet, Universal Serial mitters. An audio subsystem 226 can be coupled to Bus (USB), FIREWIRE, Global System for Mobile Com- speakers 228 and a microphone 230 to facilitate - munications (GSM), Enhanced Data GSM Environment enabled functions, such as voice recognition, voice rep- (EDGE), code division multiple access (CDMA), time di- 10 lication, digital recording, and telephony functions. The vision multiple access (TDMA), Bluetooth, Wi-Fi, voice microphone 230 can be configured to receive a speech over Internet Protocol (VoIP), Wi-MAX, or any other suit- input from the user. able communication protocol. [0021] In some examples, an I/O subsystem 240 can [0016] The server system 108 can be implemented on also be coupled to the peripherals interface 206. The I/O one or more standalone data processing apparatus or a 15 subsystem 240 can include a touch screen controller 242 distributed network of computers. In some examples, the and/or other input controller(s) 244. The touch-screen server system 108 can also employ various virtual de- controller 242 can be coupled to a touch screen 246. The vices and/or services of third-party service providers touch screen 246 and the touch screen controller 242 (e.g., third-party cloud service providers) to provide the can, for example, detect contact and movement or break underlying computing resources and/or infrastructure re- 20 thereof using any of a plurality of touch sensitivity tech- sources of the server system 108. nologies, such as capacitive, resistive, infrared, surface [0017] Although the digital assistant shown in FIG. 1 acoustic wave technologies, proximity sensor arrays, can include both a client-side portion (e.g., the DA-client and the like. The other input controller(s) 244 can be cou- 102) and a server-side portion (e.g., the DA-server 106), pled to other input/control devices 248, such as one or in some examples, the functions of a digital assistant can 25 more buttons, rocker switches, thumb-wheel, infrared be implemented as a standalone application installed on port, USB port, and/or a pointer device such as a stylus. a user device. In addition, the divisions of functionalities [0022] In some examples, the memory interface 202 between the client and server portions of the digital as- can be coupled to memory 250. The memory 250 can sistant can vary in different implementations. For in- include any electronic, magnetic, optical, electromagnet- stance, in some examples, the DA client can be a thin- 30 ic, infrared, or semiconductor system, apparatus, or de- client that provides only user-facing input and output vice, a portable computer diskette (magnetic), a random processing functions, and delegates all other functional- access memory (RAM) (magnetic), a read-only memory ities of the digital assistant to a backend server. (ROM) (magnetic), an erasable programmable read-only memory (EPROM) (magnetic), a portable optical disc 2. User Device 35 such as CD, CD-R, CD-RW, DVD, DVD-R, or DVD-RW, or flash memory, such as compact flash cards, secured [0018] FIG. 2 illustrates a block diagram of a user-de- digital cards, USB memory devices, memory sticks, and vice 104 in accordance with various examples. The user the like. In some examples, a non-transitory computer- device 104 can include a memory interface 202, one or readable storage mediumof thememory 250 can be used more processors 204, and a peripherals interface 206. 40 to store instructions (e.g., for performing the process 400, The various components in the user device 104 can be described below) for use by or in connection with an in- coupled by one or more communication busses or signal struction execution system, apparatus, or device, such lines. The user device 104 can include various sensors, as a computer-based system, processor-containing sys- subsystems, and peripheral devices that are coupled to tem, or other system that can fetch the instructions from the peripherals interface 206. The sensors, subsystems, 45 the instruction execution system, apparatus, or device and peripheral devices can gather information and/or fa- and execute the instructions. In other examples, the in- cilitate various functionalities of the user device 104. structions (e.g., for performing the process 400, de- [0019] For example, a motion sensor 210, a light sen- scribed below) can be stored on a non-transitory com- sor 212, and a proximity sensor 214 can be coupled to puter-readable storage medium (not shown) of the server the peripherals interface 206 to fac ilitateorientation, light, 50 system 108, or can be divided between the non-transitory and proximity sensing functions. One or more other sen- computer-readable storage medium of memory 250 and sors 216, such as a positioning system (e.g., GPS re- the non-transitory computer-readable storage medium ceiver), a temperature sensor, a biometric sensor, a gyro, of server system 110. In the context of this document, a a compass, an accelerometer, and the like, can also be "non-transitory computer readable storage medium" can connected to the peripherals interface 206 to facilitate 55 be any medium that can contain or store the program for related functionalities. use by or in connection with the instruction execution [0020] In some examples, a camera subsystem 220 system, apparatus, or device. and an optical sensor 222 can be utilized to facilitate cam- [0023] In some examples, the memory 250 can store

4 7 EP 3 032 532 A1 8 an 252, a communication module 254, orientation, device location, device temperature, power a user interface module 256, a sensor processing module level, speed, acceleration, motion patterns, cellular sig- 258, a phone module 260, and applications 262. The nals strength, etc. In some examples, information related operating system 252 can include instructions for han- to the state of the digital assistant server 106, dling basic system services and for performing hardware 5 e.g., running processes, installed programs, past and dependent tasks. The communication module 254 can present network activities, background services, error facilitate communicating with one or more additional de- logs, resources usage, etc., and of the user device 104 vices, one or more computers and/or one or more serv- can be provided to the digital assistant server as contex- ers. The user interface module 256 can facilitate graphic tual information associated with a user input. user interface processing and output processing using 10 [0028] In some examples, the DA client module 264 other output channels (e.g., speakers). The sensor can selectively provide information (e.g., user data 266) processing module 258 can facilitate sensor-related stored on the user device 104 in response to requests processing and functions. The phone module 260 can from the digital assistant server. In some examples, the facilitate phone-related processes and functions. The ap- digital assistant client module 264 can also elicit addi- plication module 262 can facilitate various functionalities 15 tional input from the user via a natural language dialogue of user applications, such as electronic-messaging, web or other user interfaces upon request by the digital as- browsing, mediaprocessing, Navigation,imaging, and/or sistant server 106. The digital assistant client module 264 other processes and functions. canpass the additionalinput to thedigital assistant server [0024] As described herein, the memory 250 can also 106 to help the digital assistant server 106 in intent de- store client-side digital assistant instructions (e.g., in a 20 duction and/or fulfillment of the user’s intent expressed digital assistant client module 264) and various user data in the user request. 266 (e.g., user-specific vocabulary data, preference da- [0029] In various examples, the memory 250 can in- ta, and/or other data such as the user’s electronic ad- clude additional instructions or fewer instructions. For ex- dress book, to-do lists, shopping lists, user-specified ample, the DA client module 264 can include any of the namepronunciations, etc.) to provide the client-side func- 25 sub-modules of the digital assistant module 326 de- tionalities of the digital assistant. scribed below in FIG. 3A. Furthermore, various functions [0025] In various examples, the digital assistant client of the user device 104 can be implemented in hardware module 264 can be capable of accepting voice input (e.g., and/or in firmware, including in one or more signal speech input), text input, touch input, and/or gestural in- processing and/or application specific integrated circuits. put through various user interfaces (e.g., the I/O subsys- 30 tem 244) of the user device 104. The digital assistant 3. Digital Assistant System client module 264 can also be capable of providing output in audio (e.g., speech output), visual, and/or tactile forms. [0030] FIG. 3A illustrates a block diagram of an exam- For example, output can be provided as voice, sound, ple digital assistant system 300 in accordance with var- alerts, text messages, menus, graphics, videos, anima- 35 ious examples. In some examples, the digital assistant tions, vibrations, and/or combinations of two or more of system 300 can be implemented on a standalone com- the above. During operation, the digital assistant client puter system. In some examples, the digital assistant module 264 can communicate with the digital assistant system 300 can be distributed across multiple comput- server 106 using the communication subsystems 224. ers. In some examples, some of the modules and func- [0026] In some examples, the digital assistant client 40 tions of the digital assistant can be divided into a server module 264 can utilize the various sensors, subsystems, portion and a client portion, where the client portion re- and peripheral devices to gather additional information sides on a user device (e.g., the user device 104) and from the surrounding environment of the user device 104 communicates with the server portion (e.g., the server to establish a context associated with a user, the current system 108) through one or more networks, e.g., as user interaction, and/or the current user input. In some 45 shown in FIG. 1. In some examples, the digital assistant examples, the digital assistant client module 264 can pro- system 300 can be an implementation of the server sys- vide the contextual information or a subset thereof with tem 108 (and/or the digital assistant server 106) shown the user input to the digital assistant server to help infer in FIG. 1. It should be noted that the digital assistant the user’s intent. In some examples, the digital assistant system 300 is only one example of a digital assistant can also use the contextual information to determine how 50 system, and that the digital assistant system 300 can to prepare and deliver outputs to the user. Contextual have more or fewer components than shown, may com- information can be referred to as context data. bine two or more components, or may have a different [0027] In some examples, the contextual information configuration or arrangement of the components. The that accompanies the user input can include sensor in- various components shown in FIG. 3A can be implement- formation, e.g., lighting, ambient , ambient temper- 55 ed in hardware, software instructions for execution by ature, images or videos of the surrounding environment, one or more processors, firmware, including one or more etc. In some examples, the contextual information can and/or application specific integrated also include the physical state of the device, e.g., device circuits, or a combination thereof.

5 9 EP 3 032 532 A1 10

[0031] The digital assistant system 300 can include low. The one or more processors 304 can execute these memory 302, one or more processors 304, an input/out- programs, modules, and instructions, and reads/writes put (I/O) interface 306, and a network communications from/to the data structures. interface 308. These components can communicate with [0036] The operating system 318 (e.g., Darwin, RTXC, one another over one or more communication buses or 5 LINUX, UNIX, OS X, WINDOWS, or an embedded op- signal lines 310. erating system such as VxWorks) can include various [0032] In some examples, the memory 302 can include software components and/or drivers for controlling and a non-transitory computer readable medium, such as managing general system tasks (e.g., memory manage- high-speed random access memory and/or a non-volatile ment, storage device control, power management, etc.) computer-readable storage medium (e.g., one or more 10 and facilitates communications between various hard- magnetic disk storage devices, flash memory devices, ware, firmware, and software components. or other non-volatile solid-state memory devices). [0037] The communications module 320 can facilitate [0033] In some examples, the I/O interface 306 can communications between the digital assistant system couple input/output devices 316 of the digital assistant 300 with other devices over the network communications system 300, such as displays, keyboards, touch screens, 15 interface 308. For example, the communications module and microphones, to the user interface module 322. The 320 can communicate with the communication module I/O interface 306, in conjunction with the user interface 254 of the device 104 shown in FIG. 2. The communica- module 322, can receive user inputs (e.g., voice input, tions module 320 can also include various components keyboard inputs, touch inputs, etc.) and processes them for handling data received by the wireless circuitry 314 accordingly. In some examples, e.g., when the digital as- 20 and/or wired communications port 312. sistant is implemented on a standalone user device, the [0038] The user interface module 322 can receive digital assistant system 300 can include any of the com- commands and/or inputs from a user via the I/O interface ponents and I/O and communication interfaces de- 306 (e.g., from a keyboard, touch screen, pointing device, scribed with respect to the user device 104 in FIG. 2. In controller, and/or microphone), and generate user inter- some examples, the digital assistant system 300 can rep- 25 face objects on a display. The user interface module 322 resent the server portion of a digital assistant implemen- can also prepare and deliver outputs (e.g., speech, tation, and can interact with the user through a client-side sound, animation, text, icons, vibrations, haptic feed- portion residing on a user device (e.g., the user device back, light, etc.) to the user via the I/O interface 306 (e.g., 104 shown in FIG. 2). through displays, audio channels, speakers, touch-pads, [0034] In some examples, the network communica- 30 etc.). tions interface 308 can include wired communication [0039] The applications 324 can include programs port(s) 312 and/or wireless transmission and reception and/or modules that are configured to be executed by circuitry 314. The wired communication port(s) can re- the one or more processors 304. For example, if the dig- ceive and send communication signals via one or more ital assistant system is implemented on a standalone us- wired interfaces, e.g., Ethernet, Universal Serial Bus35 er device, the applications 324 can include user applica- (USB), FIREWIRE, etc. The wireless circuitry 314 can tions, such as games, a calendar application, a naviga- receive and send RF signals and/or optical signals tion application, or an email application. If the digital as- from/to communications networks and other communi- sistant system 300 is implemented on a server farm, the cations devices. The wireless communications can use applications 324 can include resource management ap- any of a plurality of communications standards, proto- 40 plications, diagnostic applications, or scheduling appli- cols, and technologies, such as GSM, EDGE, CDMA, cations, for example. TDMA,Bluetooth, Wi-Fi, VoIP, Wi-MAX, or any other suit- [0040] The memory 302 can also store the digital as- able communication protocol. The network communica- sistant module (or the server portion of a digital assistant) tions interface 308 can enable communication between 326. In some examples, the digital assistant module 326 the digital assistant system 300 with networks, such as 45 can include the following sub-modules, or a subset or the Internet, an intranet, and/or a wireless network, such superset thereof: an input/output processing module as a cellular telephone network, a wireless local area 328, a speech-to-text (STT) processing module 330, a network (LAN), and/or a metropolitan area network natural language processing module 332, a dialogue flow (MAN), and other devices. processing module 334, a task flow processing module [0035] In some examples, memory 302, or the compu- 50 336, a service processing module 338, and a speech ter readable storage media of memory 302, can store synthesis module 340. Each of these modules can have programs, modules, instructions, and data structures in- access to one or more of the following systems or data cluding all or a subset of: an operating system 318, a and models of the digital assistant 326, or a subset or communications module 320, a user interface module superset thereof: ontology 360, vocabulary index 344, 322, one or more applications 324, and a digital assistant 55 user data 348, task flow models 354, service models 356, module 326. In particular, memory 302, or the computer ASR systems, and pronunciation system 342. readable storage media of memory 302, can store in- [0041] In some examples, using the processing mod- structions for performing the process 400, described be- ules, data, and models implemented in the digital assist-

6 11 EP 3 032 532 A1 12 ant module 326, the digital assistant can perform at least intermediate recognitions results (e.g., , pho- some of the following: converting speech input into text; nemic strings, and sub-words), and ultimately, text rec- identifyinga user’s intent expressed ina natural language ognition results (e.g., words, strings, or sequence input received from the user; actively eliciting and obtain- of tokens). In some examples, the speech input can be ing information needed to fully infer the user’s intent (e.g., 5 processed at least partially by a third-party service or on by disambiguating words, games, intentions, etc.); de- the user’s device (e.g., user device 104) to produce the termining the task flow for fulfilling the inferred intent; and recognition result. Once the STT processing module 330 executing the task flow to fulfill the inferred intent. produces recognition results containing a text string (e.g., [0042] In some examples, as shown in FIG. 3B, the I/O words, or sequence of words, or sequence of tokens), processing module 328 can interact with the user through 10 the recognition result can be passed to the natural lan- the I/O devices 316 in FIG. 3A or with a user device (e.g., guage processing module 332 for intent deduction. a user device 104 in FIG. 1) through the network com- [0044] In some examples, the STT processing module munications interface 308 in FIG. 3A to obtain user input 330 can include and/or access a vocabulary of recogniz- (e.g., a speech input) and to provide responses (e.g., as able words via a phonetic alphabet conversion module speech outputs) to the user input. The I/O processing 15 331. Each vocabulary word can be associated with one module 328 can optionally obtain contextual information or more candidate pronunciations of the word represent- associated with the user input from the user device, along ed in a speech recognition phonetic alphabet. In partic- with or shortly after the receipt of the user input. The ular, the vocabulary of recognizable words can include contextual information can include user-specific data, vo- a word that is associated with a plurality of candidate cabulary, and/or preferences relevant to the user input. 20 pronunciations. For example, the vocabulary may in- In some examples, the contextual information also in- clude the word "tomato" that is associated with the can- cludes software and hardware states of the device (e.g., didate pronunciations of and the user device 104 in FIG. 1) at the time the user request . Further, vocabulary words can be asso- is received, and/or information related to the surrounding ciated with custom candidate pronunciations that are environment of the user at the time that the user request 25 based on previous speech inputs from the user. Such was received. In some examples, the I/O processing custom candidate pronunciations can be stored in the module 328 can also send follow-up questions to, and STT processing module 330 and can be associated with receive answers from, the user regarding the user re- a particular user via the user’s profile on the device. In quest. When a user request is received by the I/O some examples, the candidate pronunciations for words processing module 328 and the user request can include 30 can be determined based on the of the word and speech input, the I/O processing module 328 can forward one or more linguistic and/or phonetic rules. In some ex- the speech input to the STT processing module 330 (or amples, the candidate pronunciations can be manually speech recognizer) for speech-to-text conversions. generated, e.g., based on known canonical pronuncia- [0043] The STT processing module 330 can include tions. one or more ASR systems. The one or more ASR sys- 35 [0045] In some examples, the candidate pronuncia- tems can process the speech input that is received tions can be ranked based on the commonness of the through the I/O processing module 328 to produce a rec- candidate pronunciation. For example, the candidate ognition result. Each ASR system can include a front- end speech pre-processor. The front-end speech pre- pronunciation can be ranked higher than processor can extract representative features from the 40 , because the former is a more commonly speech input. For example, the front-end speech pre- used pronunciation (e.g., among all users, for users in a processor can perform a Fourier transform on the speech particular geographical region, or for any other appropri- input to extract spectral features that characterize the ate subset of users). In some examples, candidate pro- speech input as a sequence of representative multi-di- nunciations can be ranked based on whether the candi- mensional vectors. Further, each ASR system can in- 45 date pronunciation is a custom candidate pronunciation clude one or more speech recognition models (e.g., associated with the user. For example, custom candidate acoustic models and/or language models) and can im- pronunciations can be ranked higher than canonical can- plement one or more speech recognition engines. Exam- didate pronunciations. This can be useful for recognizing ples of speech recognition models can include Hidden proper nouns having a unique pronunciation that devi- Markov Models, Gaussian-Mixture Models, Deep Neural 50 ates from canonical pronunciation. In some examples, Network Models, n-gram language models, and other candidate pronunciations can be associated with one or statistical models. Examples of speech recognition en- more speech characteristics such as geographic origin, gines can include the dynamic time warping based en- nationality, or ethnicity. For example, the candidate pro- gines and weighted finite-state transducers (WFST) nunciation can be associated with the based engines. The one or more speech recognition55 United States while the candidate pronunciation models and the one or more speech recognition engines can be used to process the extracted representative fea- can be associated with Great Britain. Fur- tures of the front-end speech pre-processor to produce ther, the rank of the candidate pronunciation can be

7 13 EP 3 032 532 A1 14 based on one or more characteristics (e.g., geographic can optionally use the contextual information to clarify, origin, nationality, ethnicity, etc.) of the user stored in the supplement, and/or further define the information con- user’s profile on the device. For example, it can be de- tained in the token sequence received from the STT terminedfrom the user’s profilethat theuser is associated processing module 330. The contextual information can with the United States. Based the user being associated 5 include, for example, user preferences, hardware, and/or with the United States, the candidate pronunciation software states of the user device, sensor information (associated with the United States) can collected before, during, or shortly after the user request, be ranked higher than the candidate pronunciation prior interactions (e.g., dialogue) between the digital as- sistant and the user, and the like. As described herein, (associated with Great Britain). In some 10 contextual information can be dynamic, and can change examples, one of the ranked candidate pronunciations with time, location, content of the dialogue, and other can be selected as a predicted pronunciation (e.g., the factors. most likely pronunciation). [0050] In some examples, the natural language [0046] When a speech input is received, the STT processing can be based on, e.g., ontology 360. The on- processing module 330 can be used to determine the 15 tology 360 can be a hierarchical structure containing phonemes corresponding to the speech input (e.g., using many nodes, each node representing either an "action- an acoustic model), and then attempt to determine words able intent" or a "property" relevant to one or more of the that match the phonemes (e.g., using a language model). "actionable intents" or other "properties." As noted For example, if the STT processing module 330 can first above, an "actionable intent" can represent a task that identify the sequence of phonemes cor- 20 the digital assistant is capable of performing, i.e., it is responding to a portion of the speech input, it can then "actionable" or can be acted on. A "property" can repre- determine, based on the vocabulary index 344, that this sent a parameter associated with an actionable intent or sequence corresponds to the word "tomato." a sub-aspect of another property. A linkage between an [0047] In some examples, the STT processing module actionable intent node and a property node in the ontol- 330 can use approximate matching techniques to deter- 25 ogy 360 can define how a parameter represented by the mine words in an utterance. Thus, for example, the STT property node pertains to the task represented by the processing module 330 can determine that the sequence actionable intent node. of phonemes corresponds to the word "to- [0051] In some examples, the ontology 360 can be mato," even if that particular sequence of phonemes is made up of actionable intent nodes and property nodes. not one of the candidate sequence of phonemes for that 30 Within the ontology 360, each actionable intent node can word. be linked to one or more property nodes either directly [0048] The natural language processing module 332 or through one or more intermediate property nodes. ("natural language processor") of the digital assistant can Similarly, each property node can be linked to one or take the sequence of words or tokens ("token sequence") more actionable intent nodes either directly or through generated by the STT processing module 330, and at- 35 one or more intermediate property nodes. For example, tempt to associate the token sequence with one or more as shown in Figure 3C, the ontology 360 can include a "actionable intents" recognized by the digital assistant. "restaurant reservation" node (i.e., an actionable intent An "actionable intent" can represent a task that can be node). Property nodes "restaurant," "date/time" (for the performed by the digital assistant, and can have an as- reservation), and "party size" can each be directly linked sociated task flow implemented in the task flow models 40 to the actionable intent node (i.e., the "restaurant reser- 354. The associated task flow can be a series of pro- vation" node). grammed actions and steps that the digital assistant [0052] In addition, property nodes "cuisine," "price takes in order to perform the task. The scope of a digital range," "phone number," and "location" can be sub- assistant’s capabilities can be dependent on the number nodes of the property node "restaurant," and can each and variety of task flows that have been implemented 45 be linked to the "restaurant reservation" node (i.e., the and stored in the task flow models 354, or in other words, actionable intent node) through the intermediate property on the number and variety of "actionable intents" that the node "restaurant." For another example, as shown in Fig- digital assistant recognizes. The effectiveness of the dig- ure3C, theontology 360 can alsoinclude a "set reminder" ital assistant, however, can also be dependent on the node (i.e., another actionable intent node). Property assistant’s ability to infer the correct "actionable intent(s)" 50 nodes "date/time" (for setting the reminder) and "subject" from the user request expressed in natural language. (for the reminder) can each be linked to the "set reminder" [0049] In some examples, in addition to the sequence node. Since the property "date/time" can be relevant to of words or tokens obtained from the STT processing both the task of making a restaurant reservation and the module 330, the natural language processing module task of setting a reminder, the property node "date/time" 332 can also receive contextual information associated 55 can be linked to both the "restaurant reservation" node with the user request, e.g., from the I/O processing mod- and the "set reminder" node in the ontology 360. ule 328. The natural language processing module 332 [0053] An actionable intent node, along with its linked concept nodes, can be described as a "domain." In the

8 15 EP 3 032 532 A1 16 present discussion, each domain can be associated with 360 can be associated with a set of words and/or phrases a respective actionable intent, and refers to the group of that are relevant to the property or actionable intent rep- nodes (and the relationships there between) associated resented by the node. The respective set of words and/or with the particular actionable intent. For example, the on- phrases associated with each node can be the so-called tology 360 shown in Figure 3C can include an example 5 "vocabulary" associated with the node. The respective of a restaurant reservation domain 362 and an example set of words and/or phrases associated with each node of a reminder domain 364 within the ontology 360. The can be stored in the vocabulary index 344 in association restaurant reservation domain includes the actionable in- with the property or actionable intent represented by the tent node "restaurant reservation," property nodes "res- node. For example, returning to Figure 3B, the vocabu- taurant," "date/time," and "party size," and sub-property 10 lary associated with the node for the property of "restau- nodes "cuisine," "price range," "phone number," and "lo- rant" can include words such as "food," "drinks," "cui- cation." The reminder domain 364 can include the ac- sine," "hungry," "eat," "pizza," "fast food," "meal," and so tionable intent node "set reminder," and property nodes on. For another example, the vocabulary associated with "subject" and "date/time." In some examples, the ontol- the node for the actionable intent of "initiate a phone call" ogy 360 can be made up of many domains. Each domain 15 can include words and phrases such as "call," "phone," can share one or more property nodes with one or more "dial," "ring," "call this number," "make a call to," and so other domains. For example, the "date/time" property on. The vocabulary index 344 can optionally include node can be associated with many different domains words and phrases in different . In some ex- (e.g., a scheduling domain, a travel reservation domain, amples, the vocabulary associated with the node (e.g., a movie ticket domain, etc.), in addition to the restaurant 20 actionable intent, parameter/property) can include a het- reservation domain 362 and the reminder domain 364. eronym and the heteronym in the vocabulary can be as- [0054] While Figure 3C illustrates two example do- sociated with a particular meaning and pronunciation. mains within the ontology 360, other domains can in- For example, the heteronym in the vocabulary can be clude, for example, "initiate a phone call," "find direc- uniquely identified (e.g., by means of a tag, label, token, tions," "schedule a meeting," "send a message," and25 metadata, and the like) as being associated with a par- "provide an answer to a question," "read a list," "providing ticular meaning and pronunciation. navigation instructions," "provide instructions for a task" [0058] The natural language processing module 332 and so on. A "send a message" domain can be associ- can receive the token sequence (e.g., a text string) from ated with a "send a message" actionable intent node, the STT processing module 330, and determine what and may further include property nodes such as "recipi- 30 nodes are implicated by the words in the token sequence. ent(s)," "message type," and "message body." The prop- In some examples, if a word or phrase in the token se- erty node "recipient" can be further defined, for example, quence is found to be associated with one or more nodes by the sub-property nodes such as "recipient name" and in the ontology 360 (via the vocabulary index 344), the "message address." word or phrase can "trigger" or "activate" those nodes. [0055] Insome examples, the ontology360 can include 35 Based on the quantity and/or relative importance of the all the domains (and hence actionable intents) that the activated nodes, the natural language processing mod- digital assistant is capable of understanding and acting ule 332 can select one of the actionable intents as the upon. In some examples, the ontology 360 can be mod- taskthat the user intended thedigital assistant to perform. ified, such as by adding or removing entire domains or In some examples, the domain that has the most "trig- nodes, or by modifying relationships between the nodes 40 gered" nodes can be selected. In some examples, the within the ontology 360. domain having the highest confidence value (e.g., based [0056] In some examples, nodes associated with mul- on the relative importance of its various triggered nodes) tiple related actionable intents can be clustered under a can be selected. In some examples, the domain can be "super domain" in the ontology 360. For example, a "trav- selected based on a combination of the number and the el" super-domain can include a cluster of property nodes 45 importance of the triggered nodes. In some examples, and actionable intent nodes related to travel. The action- additional factors are considered in selecting the node able intent nodes related to travel can include "airline as well, such as whether the digital assistant has previ- reservation," "hotel reservation," "car rental," "get direc- ously correctly interpreted a similar request from a user. tions," "find points of interest," and so on. The actionable [0059] User data 348 can include user-specific infor- intent nodes under the same super domain (e.g., the50 mation, such as user-specific vocabulary, user prefer- "travel" super domain) can have many property nodes in ences, user address, user’s default and secondary lan- common. For example, the actionable intent nodes for guages, user’s contact list, and other short-term or long- "airline reservation," "hotel reservation," "car rental," "get term information for each user. In some examples, the directions," and "find points of interest" can share one or natural language processing module 332 can use the more of the property nodes "start location," "destination," 55 user-specific information to supplement the information "departure date/time," "arrival date/time," and "party contained in the user input to further define the user in- size." tent. For example, for a user request "invite my friends [0057] In some examples, each node in the ontology to my birthday party," the natural language processing

9 17 EP 3 032 532 A1 18 module 332 can be able to access user data 348 to de- order to obtain additional information, and/or disam- termine who the "friends" are and when and where the biguate potentially ambiguous utterances. When such in- "birthday party" would be held, rather than requiring the teractions are necessary, the task flow processing mod- user to provide such information explicitly in his/her re- ule 336 can invoke the dialogue flow processing module quest. 5 334 to engage in a dialogue with the user. In some ex- [0060] In some examples, once the natural language amples, the dialogue flow processing module 334 can processing module 332 identifies an actionable intent (or determine how (and/or when) to ask the user for the ad- domain) based on the user request, the natural language ditional information and receives and processes the user processing module 332 can generate a structured query responses. The questions can be provided to and an- to represent the identified actionable intent. In some ex- 10 swers can be received from the users through the I/O amples, the structured query can include parameters for processing module 328. In some examples, the dialogue one or more nodes within the domain for the actionable flow processing module 334 can present dialogue output intent, and at least some of the parameters are populated to the user via audio and/or visual output, and receives with the specific information and requirements specified input from the user via spoken or physical (e.g., clicking) in the user request. For example, the user may say "Make 15 responses. Continuing with the example above, when me a dinner reservation at a sushi place at 7." In this the task flow processing module 336 invokes the dia- case, the natural language processing module 332 can logue flow processing module 334 to determine the "party be able to correctly identify the actionable intent to be size" and "date" information for the structured query as- "restaurant reservation" based on the user input. Accord- sociated with the domain "restaurant reservation," the ing to the ontology, a structured query for a "restaurant 20 dialogue flow processing module 334 can generate ques- reservation" domain may include parameters such as tions such as "For how many people?" and "On which {Cuisine}, {Time}, {Date}, {Party Size}, and the like. In day?" to pass to the user. Once answers are received some examples, based on the speech input and the text from the user, the dialogue flow processing module 334 derived from the speech input using the STT processing can then populate the structured query with the missing module 330, the natural language processing module 25 information, or pass the information to the task flow 332 can generate a partial structured query for the res- processing module 336 to complete the missing informa- taurant reservation domain, where the partial structured tion from the structured query. query includes the parameters {Cuisine = "Sushi"} and [0063] Once the task flow processing module 336 has {Time = "7pm"}. However, in this example, the user’s ut- completed the structured query for an actionable intent, terance contains insufficient information to complete the 30 the task flow processing module 336 can proceed to per- structured query associated with the domain. Therefore, form the ultimate task associated with the actionable in- other necessary parameters such as {Party Size} and tent. Accordingly, the task flow processing module 336 {Date} may not be specified in the structured query based can execute the steps and instructions in the task flow on the information currently available. In some examples, model according to the specific parameters contained in the natural language processing module 332 can popu- 35 the structured query. For example, the task flow model late some parameters of the structured query with re- for the actionable intent of "restaurant reservation" can ceived contextual information. For example, in some ex- include steps and instructions for contacting a restaurant amples, if the user requested a sushi restaurant "near andactually requestinga reservation for a particular party me," the natural language processing module 332 can size at a particular time. For example, using a structured populate a {location} parameter in the structured query 40 query such as: {restaurant reservation, restaurant = ABC with GPS coordinates from the user device 104. Café, date = 3/12/2012, time = 7pm, party size = 5}, the [0061] In some examples, the natural language task flow processing module 336 can perform the steps processing module 332 can pass the structured query of: (1) logging onto a server of the ABC Café or a restau- (including any completed parameters) to the task flow rant reservation system such as OPENTABLE®, (2) en- processing module 336 ("task flow processor"). The task 45 tering the date, time, and party size information in a form flow processing module 336 can be configured to receive on the website, (3) submitting the form, and (4) making the structured query from the natural language process- a calendar entry for the reservation in the user’s calendar. ing module 332, complete the structured query, if neces- [0064] In some examples, the task flow processing sary, and perform the actions required to "complete" the module 336 can employ the assistance of a service user’s ultimate request. In some examples, the various 50 processing module 338 ("service processing module") to procedures necessary to complete these tasks can be complete a task requested in the user input or to provide provided in task flow models 354. In some examples, the an informational answer requested in the user input. For task flow models can include procedures for obtaining example, the service processing module 338 can act on additional information from the user and task flows for behalf of the task flow processing module 336 to make performing actions associated with the actionable intent. 55 a phone call, set a calendar entry, invoke a map search, [0062] As described above, in order to complete a invoke or interact with other user applications installed structured query, the task flow processing module 336 on the user device, and invoke or interact with third-party may need to initiate additional dialogue with the user in services (e.g., a restaurant reservation , a social

10 19 EP 3 032 532 A1 20 networking website, a banking portal, etc.). In some ex- process the phonemic string in the metadata to synthe- amples, the protocols and application programming in- size the word in speech form. terfaces (API) required by each service can be specified [0068] Further, speech synthesis module 340 can in- by a respective service model among the service models clude a pronunciation system 342 for disambiguating het- 356. The service processing module 338 can access the 5 eronyms. Pronunciation system 342 can thus be config- appropriate service model for a service and generate re- ured to determine the correct pronunciation of a hetero- quests for the service in accordance with the protocols nym in the speech input or the generated dialogue re- and APIs required by the service according to the service sponse. In some examples, pronunciation system 342 model. can utilize at least one of the speech input, the n-gram [0065] For example, if a restaurant has enabled an on- 10 language model of the ASR system 331, the natural lan- line reservation service, the restaurant can submit a serv- guage processing module 332, and received context ice model specifying the necessary parameters for mak- (e.g., contextual information) to determine the correct ing a reservation and the APIs for communicating the pronunciation of the heteronym. In determining the cor- values of the necessary parameter to the online reser- rect pronunciation of the heteronym, the pronunciation vation service. When requested by the task flow process- 15 system 342 can obtain a phonemic string corresponding ing module 336, the service processing module 338 can to the correct pronunciation of the heteronym from one establish a network connection with the online reserva- or more acoustic models or language models of the ASR tion service using the web address stored in the service system 331. Further, in some examples, the pronuncia- model, and send the necessary parameters of the res- tion system 342 can annotate (e.g., by means of a tag, ervation (e.g., time, date, party size) to the online reser- 20 label, token, or metadata) the heteronym in the generated vation interface in a format according to the API of the dialogue response to identify the correct pronunciation online reservation service. associated with the heteronym. In cases where there is [0066] In some examples, the natural language conflicting information between the speech input, the processing module 332, dialogue flow processing mod- ASR system, the natural language processing module ule 334, and task flow processing module 336 can be 25 332, and contextual information regarding the correct used collectively and iteratively to infer and define the pronunciation of the heteronym, the pronunciation sys- user’s intent, obtain information to further clarify and re- tem 342 can be configured to apply predetermined logic fine the user intent, and finally generate a response (i.e., or rules (e.g., voting schemes, combination schemes, an output to the user, or the completion of a task) to fulfill weighting schemes, and the like) to determine the correct the user’s intent. The generated response can be a dia- 30 pronunciation of the heteronym. As described in greater logue response to the speech input that at least partially detail below, the pronunciation system 342 can be con- fulfills the user’s intent. Further, in some examples, the figured to perform block 418 of process 400. generated response can be output as a speech output. [0069] In some examples, instead of (or in addition to) In these examples, the generated response can be sent using the local speech synthesis module 340, speech to the speech synthesis module 340 (e.g., speech syn- 35 synthesis is performed on a remote device (e.g., the serv- thesizer) where it can be processed to synthesize the er system 108), and the synthesized speech is sent to dialogue response in speech form. the user device 104 for output to the user. For example, [0067] The speech synthesis module 340 can be con- this occurs in some implementations where outputs for figured to synthesize speech outputs for presentation to a digital assistant are generated at a server system. And the user. The speech synthesis module 340 synthesizes 40 because serversystems generallyhave more processing speech outputs based on text provided by the digital as- power or resources than a user device, it may be possible sistant. For example, the generated dialogue response to obtain higher quality speech outputs than would be can be in the form of a text string. The speech synthesis practical with client-side synthesis. module 340 can convert the text string to an audible [0070] Additional details on digital assistants can be speech output. The speech synthesis module 340 can 45 found in the U.S. Utility Application No. 12/987,982, en- use any appropriate speech synthesis technique in order titled "Intelligent Automated Assistant," filed January 10, to generate speech outputs from text, including, but not 2011, and U.S. Utility Application No. 13/251,088, enti- limited, to , unit selection synthe- tled "Generating and Processing Task Items That Rep- sis, diphone synthesis, domain-specific synthesis, form- resent Tasks to Perform," filed September 30, 2011, the ant synthesis, , hidden Markov50 entire disclosures of which are incorporated herein by model (HMM) based synthesis, and sinewave synthesis. reference. In some examples, the speech synthesis module 340 can be configured to synthesize individual words based 4. Process for Operating a Digital Assistant on phonemic strings corresponding to the words. For ex- ample, a phonemic string can be associated with a word 55 [0071] FIG. 4 illustrates a process 400 for operating a in the generated dialogue response. The phonemic string digital assistant according to various examples. Process can be stored in metadata associated with the word. The 400 can be performed at an electronic device with one speech synthesis model 340 can be configured to directly or more processors and memory storing one or more

11 21 EP 3 032 532 A1 22 programs for execution by the one or more processors. [0075] At block 404 of process 400, contextual infor- In some examples, the process 400 can be performed mation associated with the speech input can be received. at the user device 104 or the server system 108. In some The contextual information received at block 404 can be examples, the process 400 can be performed by the dig- similar or identical to the contextual information de- ital assistant system 300 (FIG. 3A), which, as noted5 scribed above for inferring the user’s intent or determin- above, may be implemented on a standalone computer ing how to prepare and deliver outputs. In some exam- system (e.g., either the user device 104 or the server ples, the contextual information can accompany the system 108) or distributed across multiple computers speech input. In some examples, the contextual informa- (e.g., the user device 104, the server system 108, and/or tion can include sensor information received from a sen- additional or alternative devices or systems). While the 10 sor of the electronic device (e.g., a motion sensor 210, following discussion describes the process 400 as being a light sensor 212, a proximity sensor 214, a positioning performed by a digital assistant (e.g., the digital assistant system, a temperature sensor, a biometric sensor, a system 300), the process is not limited to performance gyro, a compass of the user device 104). In particular, by any particular device, combination of devices, or im- the contextual information can include the location of the plementation.Moreover, theindividual blocks ofthe proc- 15 user at the time the speech input of block 402 is received. esses may be distributed among the one or more com- In some examples, the contextual information can in- puters, systems, or devices in any appropriate manner. clude information received from an application of the user [0072] At block 402 of process 400, a speech input can device (e.g., contacts, calendar, , , be received from a user. In some examples, the speech messages, maps application, weather application, and input can be received in the course of, or as part of, an 20 the like). In some examples, the contextual information interaction with the digital assistant. The speech input can include information associated with the user, such can be received in the form of sound waves, an audio as the user’s identity, geographical origin, nationality, or file, or a representative audio signal (analog or digital). ethnicity. Such user information can be received from the In some examples, the speech input can be sound waves user’s profile stored on the device. The received contex- that are received by the microphone (e.g., microphone 25 tual information can be used at block 414 to determine 230) of the electronic device (e.g., user device 104). In an actionable intent or at block 418 to determine the cor- other examples,the speechinput can be a representative rect pronunciation of the heteronym, as describe below. audio signal or a recorded audio file that is received by [0076] At block 406 of process 400, the speech input the audio subsystem (e.g., audio subsystem 226), the can be processed using an ASR system. For example, peripheral interface (e.g., peripheral interface 206), or 30 the speech input can be processed using an ASR system the processor (e.g., processor 204) of the electronic de- of the STT processing module 330, as described above. vice. In yet other examples, the speech input can be a Block 406 can include one or more of blocks 408 through representative audio signal or a recorded audio file that 412, as described below. Accordingly, the speech input is received by the I/O interface (e.g., I/O interface 306) can be processed using the ASR system to determine at or the processor (e.g., processor 304) of the digital as- 35 least one of a phonemic string corresponding to the het- sistant system. eronym as pronounced in the speech input, a frequency [0073] In some examples, the speech input can include of occurrence of an n-gram with respect to a corpus, and a user request. The user request can be any request, a text string corresponding to the speech input. including a request that indicates a task that the digital [0077] At block 408 of process 400, a phonemic string assistant can perform (e.g., making and/or facilitating 40 corresponding to the heteronym as pronounced by the restaurant reservations, initiating telephone calls or text user in the speech input can be determined. As described messages, etc.), a request for a response (e.g., an an- above,in some examples,the front-end speech pre-proc- swer to a question, such as "How far is Earth from the essor of the ASR system can extract representative fea- sun?"), and the like. tures from the speech input. The representative features [0074] Insome examples, the speech inputcan contain 45 can be processed using an acoustic model of the ASR a heteronym and one or more additional words. One ex- system to produce, for example, a sequence of pho- ample of a speech input containing a heteronym and one nemes corresponding to the speech input as pronounced or more additional words can include, "How’s the weather by the user. The sequence of phonemes can be further in Nice?" where "Nice" is the heteronym. As previously processed using a language model of the ASR system described, the heteronym "Nice" can be a proper noun 50 to map phonemic strings within the sequence of pho- having the pronunciation of /nis/ (indicated using the In- nemes to corresponding words. In particular, a phonemic ternational Phonetic Alphabet) or an adjective having the string within the sequence of phonemes can be mapped pronunciation of . It should be recognized that to the heteronym in the speech input. This phonemic in examples where the speech input is not used to de- string can correspond to the heteronym as pronounced termine the correct pronunciation of the heteronym (e.g., 55 by the user in the speech input. at block 418), the speech input need not include a het- [0078] At block 410 of process 400, a frequency of oc- eronym. currence of an n-gram with respect to a corpus can be determined. The n-gram can include the heteronym and

12 23 EP 3 032 532 A1 24 the one or more additional words. For example, in the additional words can precede the heteronym in the speech input, "How’s the weather in Nice?" the n-gram speech input, it should be recognized that in other exam- can be the word "weather in Nice" where the het- ples, the one or more additional words can succeed the eronym is "Nice" (the proper noun associated with the heteronym or the heteronym can be positioned between pronunciation /nis/) and the one or more additional words 5 the one or more additional words. is "weather in." The frequency of occurrence of the n- [0082] At block 412 of process 400, a text string cor- gram with respect to a corpus can be determined using responding to the speech input can be determined. In an n-gram language model of the ASR system that is particular, one or more language models of the ASR sys- trained using the corpus. In some examples, the frequen- tem can be used to determine the text string. In some cy of occurrence of the n-gram can be in the form of raw 10 examples, the n-gram language model used to determine counts. For example, a particular trigram can occur 25 frequency of occurrence of the n-gram at block 410 can times within the corpus. Accordingly, the frequency of be used to determine the text string. In some examples, occurrence of that particular trigram within the corpus as described above with reference to block 408, the se- can be 25 counts. In other cases, the frequency of oc- quence of phonemes determined using an acoustic mod- currence can be a normalized value. For example, the 15 el can be processed using the one or more language frequency of occurrence can be in the form of a likelihood models of the ASR system to map phonemic strings with- or probability (e.g., probability distribution). In one such in the sequence of phonemes to corresponding words. example, the corpus of natural language text can include In this way, the text string corresponding to the speech 25 counts of a particular trigram and 1000 counts of all input can be determined. Accordingly, the text string can .Accordingly, thefrequency of occurrence of that 20 be a transcription of the speech input and can include trigram within the corpus can be equal to 25/1000. the heteronym in the speech input. In some examples, [0079] The n-gram language model can be configured the text string can be in the form of a token sequence. to distinguish the meaning or pronunciation associated [0083] At block 414 of process 400, an actionable in- with a heteronym. For example, the n-gram language tent based on the text string can be determined. As de- model can have separate entries to distinguish between 25 scribed above, in some examples, an actionable intent the proper noun "Nice" associated with the pronunciation can represent a task that can be performed by the digital /nis/ and the adjective "nice" associated with the pronun- assistant. The actionable intent can be determined by ciation /naIs/. In particular, the trigram "weather in Nice" means of natural language processing. For example, the (proper noun "Nice") in the trigram language model can natural language processing module 332, described be distinct from the trigram "weather in nice" (adjective 30 above, can process the text string of block 412 to deter- "nice") in the n-gram language model. In some examples, mine the actionable intent that is consistent with the text the n-gram language model can distinguish between het- string. In a specific example, the text string can be "How’s eronyms using metadata (e.g., tagging, tokenization, and the weather in Nice?" and the determined actionable in- the like). tent can be "report weather," which can represent the [0080] Further, in some examples, block 410 can in- 35 task of retrieving the current and/or forecasted weather clude determining a frequency of occurrence of a second for the location of and outputting the re- n-gram with respect to the corpus. The second n-gram trieved weather on the electronic device to the user. can include the heteronym and the one or more additional [0084] As described above, an actionable intent can words. Although the n-gram and the second n-gram can be associated with one or more parameters. For exam- include the same sequence of words, the heteronym in 40 ple, the actionable intent of "report weather" can be as- the second n-gram can be associated with a different sociated with the parameters of {location} and {date/time} meaning and pronunciation than the n-gram. Returning (e.g., current date/time). In some examples, block 414 to the example speech input, "How’s the weather in can include assigning the heteronym to a parameter of Nice?" the second n-gram can be the word trigram the actionable intent. In an example where the text string "weather in nice" where the heteronym "nice" is associ- 45 is "How’s the weather in Nice?" the heteronym "Nice" can ated with the adjective and with the pronunciation be assigned to the parameter of {location} such that {lo- . In this example, the frequency of occurrence cation="Nice, France"}. In some examples, as described of the trigram "weather in Nice" (/nis/) is likely greater in greater detail at block 418, the heteronym can be dis- than the frequency of occurrence of the second trigram ambiguated based on the parameter to which it is as- 50 signed. "weather in nice" . Thus, it can be determined [0085] Further, as described above, an actionable in- that the heteronym "Nice" in the speech input is more tent, including a parameter of an actionable intent, can likely to be associated with the pronunciation /nis/ rather be associated with a set of words and/or phrases. The than . Accordingly, by comparing the frequency set of words and/or phrases can be a vocabulary list as- of occurrence of the n-gram and the frequency of occur- 55 sociated with the actionable intent or a parameter of the rence of the second n-gram, the heteronym in the speech actionable intent. The text string can be mapped to a input can be disambiguated. particular actionable intent based on the text string and [0081] Although in the above example, the one or more the vocabulary list having one or more common words.

13 25 EP 3 032 532 A1 26

For example, a vocabulary list associated with the ac- erated dialogue response can be directed to reporting tionable intent of "report weather" can include the words the weather to the user. "weather" and "Nice." If the text string is "How’s the [0090] At block 418 of process 400, a correct pronun- weather in Nice?" the actionable intent can be deter- ciation of the heteronym can be determined. In some mined to be "report weather" based on the common5 examples, block 416 can be performed using the speech words "weather" and "Nice" in the text string and the vo- synthesis module (e.g., the speech synthesis module cabulary list associated with the actionable intent of "re- 340) of the digital assistant. In particular, a pronunciation port weather." system (e.g., the pronunciation system 342) of the [0086] In some examples, a vocabulary list associated speech synthesis module can be used to perform block with an actionable intent can include one or more heter- 10 418. onyms and the one or more heteronyms can each be [0091] In some examples, the correct pronunciation of uniquely identified (e.g., by means of tagging, tokenizing, the heteronym can be determined based on the speech and the like) as being associated with a particular mean- input. In particular, the correct pronunciation of the het- ing and/or a particular pronunciation. For example, the eronym can be based on the phonemic string determined vocabulary list associated with the parameter {location} 15 at block 408. For example, the correct pronunciation of of the actionable intent "report weather" can include the the heteronym can be determined to be consistent with heteronym "Nice" (proper noun). In this example, "Nice" the phonemic string determined at block 408 where the in the vocabulary list can be uniquely identified as being correct pronunciation of the heteronym is similar or iden- associated with the proper noun and the pronunciation tical to the pronunciation of the heteronym by the user in /nis/ rather than the adjective and the pronunciation20 the speech input. In these examples, speech synthesis /naIs/. This can be desirable in enabling the use of the of the heteronym can be performed using the phonemic vocabulary list as a knowledge base for disambiguating string, as described in greater detail at block 420. heteronyms. [0092] In some examples, the correct pronunciation of [0087] Further, as described above, the contextual in- the heteronym can be determined using the n-gram lan- formation received at block 404 can be used to determine 25 guage model of the ASR system. In some examples, the the actionable intent. For example, information from the context in the speech input can be utilized to disambigua- sensors and applications of the electronic device can be te the heteronym. For example, the correct pronunciation used to infer missing information in the text string and of the heteronym can be determined based on the fre- thus disambiguate the user’s intent associated with the quency of occurrence of the n-gram determined at block speech input. Further, in some examples, the contextual 30 410. Referring back to the example speech input, "How’s information can be associated with the parameters of a the weather in Nice?" the n-gram can be the trigram particular actionable intent and thus the actionable intent "weather in Nice" where the heteronym "Nice" in the tri- can be determined based on these associations. gram is associated with the proper noun and with the [0088] At block 416 of process 400, a dialogue re- pronunciation /nis/. Based on the frequency of occur- sponse to the speech input can be generated. In some 35 rence of the trigram "weather in Nice" (/nis/), it can be examples, the dialogue response can include the heter- determined that the heteronym "Nice" in the speech input onym in the speech input. Referring back to the example is most likely associated with the proper noun and the speechinput, "How’s the weatherin Nice? "the generated pronunciation /nis/. Accordingly, in this example, the cor- dialogue response can include, "Here’s the weather in rect pronunciation of the heteronym can be determined Nice, France between today and Saturday." In some ex- 40 to be /nis/. amples, the dialogue response can be generated using [0093] In some examples, the correct pronunciation of the dialogue flow processing module 334. As described the heteronym can be determined based on both the fre- above, the dialogue flow processing module 334 can co- quency of occurrence of the n-gram and the frequency operate with the natural language processing module of occurrence of the second n-gram determined at block 332 and task flow processing module 336 to generate 45 410. For example, the correct pronunciation of the het- the dialogue response. eronym can be determined by comparing the frequency [0089] The dialogue response can be any appropriate of occurrence of the n-gram to the frequency of occur- response in natural language form that at least partially rence of the second n-gram. Referring to the example addresses the speech input. In examples where the where the n-gram is the trigram "weather in Nice" (/nis/) speech input contains a user request, the generated di- 50 and the second n-gram is the trigram "weather in nice" alogue response can be one that at least partially fulfills ), the correct pronunciation of the heteronym the user’s request. In some examples, the generated di- "Nice" can be determined to be /nis/ based on the fre- alogue response can be a confirmatory response or a quency of occurrence of the trigram "weather in Nice" request for further information needed to fulfill the user’s being greater than the frequency of occurrence of the request. In some examples, the dialogue response can 55 trigram "weather in nice" by at least a predetermined be generated based on the actionable intent determined amount. In another example, the correct pronunciation at block 414. In a specific example, the actionable intent of the heteronym "Nice" can be determined to be /nis/ can be determined to be "report weather" and the gen-

14 27 EP 3 032 532 A1 28 based on the frequency of occurrence of the trigram cordingly, it can be determined based on the actionable "weather in Nice" (/nis/) being greater than a first prede- intent of "map directions" that the correct pronunciation termined threshold value and the frequency of occur- of the heteronym "Nice" is /nis/. rence of the trigram "weather in nice" ) being [0096] In some examples, the correct pronunciation of less than a second predetermined threshold value. In 5 the heteronym can be determined based on a parameter some examples, the first predetermined threshold value of the actionable intent. As described above, the heter- can be equal to the second predetermined threshold val- onym can be assigned to a parameter of the actionable ue. In other examples, the first predetermined threshold intent. For instance, continuing with the example speech value can be greater than the second predetermined input, "Find me directions to Nice," the actionable intent 10 threshold value. can be "map directions" and the heteronym "Nice" can [0094] In some examples, the word context from the be assigned to the parameter {destination location} of dialogue response can be utilized to disambiguate the the actionable intent. In this example, the parameter heteronym. In these examples, the speech input need {destination location} can be more closely associated not include a heteronym. In one such example, the with the proper noun "Nice" (/nis/) than the adjective speech input can be "How’s the weather in the fifth most 15 "nice" ( ). Accordingly, it can be determined populous city in France?" and the dialogue response can based on the parameter {destination location} that the be "Here’s the weather in Nice, France." The dialogue correct pronunciation of the heteronym "Nice" is /nis/. response can then be processed using the ASR system [0097] As described above, a vocabulary list can be to disambiguate the heteronym in the dialogue response associated with the actionable intent or a parameter of and determine a correct pronunciation of the heteronym. 20 the actionable intent. The heteronym can be included in For instance, in the above example, the dialogue re- the vocabulary list. In some examples, the correct pro- sponse can include the trigram "weather in Nice." Similar nunciation of the heteronym can be determined based to the example described above, the frequency of occur- on the particular pronunciation associated with the het- rence of the trigram "weather in Nice" (/nis/) can be de- eronym in the vocabulary list. Referring back to the ex- termined using the n-gram language model of the ASR 25 ample where the actionable intent is determined to be system. Further, in some examples, the frequency of oc- "report weather," a vocabulary list can be associated with currence of the second trigram "weather in nice"the parameter {location} of the actionable intent "report ) can also be determined using the n-gram lan- weather." In this example, the heteronym "Nice" can be included in the vocabulary list and the pronunciation /nis/ guage model of the ASR system. Based on at least one 30 of the frequency of occurrence of the trigram "weather in can be associated with the heteronym "Nice" in the vo- Nice" (/nis/) and the frequency of occurrence of the sec- cabulary list. By using the vocabulary list as a knowledge base to disambiguate the heteronym, it can be deter- ond 3-gram "weather in nice" , the correct pro- mined that the correct pronunciation of the heteronym nunciation of the heteronym "Nice" can be determined. "Nice" is /nis/. In addition, the correct pronunciation of the heteronym 35 [0098] In some examples, the contextual information "Nice" in the dialogue response can be determined by received at block 404 can be used to determine the cor- comparing, in a similar or identical manner as described rect pronunciation of the heteronym. Contextual informa- above, the frequency of occurrence of the trigram "weath- tion can be particularly useful in disambiguating hetero- er in Nice" (/nis/) to the frequency of occurrence of the nyms that are semantically similar or having the same second trigram "weather in nice" /naIs/. 40 parts of speech. In one example, the speech input can [0095] Further, in some examples, the correct pronun- be "How are the Ajax doing?" In this example, Ajax is a ciation of the heteronym can be determined using the heteronym that can refer to the Dutch soccer team (pro- natural language processor (e.g., natural language processing module 332) of the digital assistant. For ex- nounced ) or the mythological Greek hero ample, the correct pronunciations of the heteronym can 45 (pronounced ) - both of which are proper be determined based on the actionable intent determined nouns. In this example, contextual information can be at block 414. In these examples, the dialogue response useful to disambiguate the meaning and pronunciation can include the heteronym while the speech input need of the heteronym "Ajax". For example, the contextual in- not include the heteronym. The correct pronunciation can formation can include the current location of the user be determined based on the relationship between the 50 based on information received from the positioning sys- heteronym and the words, phrases, and parameters as- tem of the electronic device. The contextual information sociated with the actionable intent. In a specific example can indicated that the user is currently located in the city where the speech input is "Find me directions to Nice," of Amsterdam, The Netherlands. Based on this contex- and the actionable intent can be determined to be "map tual information, it can be determined that the heteronym directions." In this example, the task of mapping direc- 55 "Ajax" likely refers to the Dutch football team tions can be more closely related to the location "Nice" (/nis/) rather than to the adjective "nice" ( ). Ac- ( ) rather than the mythological Greek hero

15 29 EP 3 032 532 A1 30

the phonemic string determined at block 408 can corre- ( ) Accordingly, the correct pronunciation of the heteronym "Ajax" with respect to the speech input spond to the pronunciation . In addition, based on the frequency of occurrence of the trigram "weather can be determined to be ) in Nice" (/nis/) being greater than the frequency of occur- [0099] In some examples, the contextual information 5 rence of the trigram "weather in nice" ( ), it can used to determine the correct pronunciation of the het- be determined that the heteronym "Nice" is more likely eronym can include information associated with the user, associated with the pronunciation of /nis/ rather than such as, for example, geographic origin, nationality, or ethnicity. In one such example, the user can be identified . In this example, the correct pronunciation can to be Dutch based on contextual information received 10 be determined to be consistent with the n-gram language from the user’s profile stored on the device. In such an model (e.g., /nis/) if the frequency of occurrence of the example, the correct pronunciation of the heteronym trigram "weather in Nice" (/nis/) is greater than a prede- termined threshold value or if the frequency of occur- "Ajax" can be determined to be rather than rence of the trigram "weather in Nice" (/nis/) is greater . 15 than the frequency of occurrence of the trigram "weather [0100] Further, in some examples, the correct pronun- in nice" ( ) by at least a predetermined amount. ciation of the heteronym can be determined based on a Conversely, if the frequency of occurrence of the trigram custom pronunciation of the heteronym that is associated "weather in Nice" (/nis/) is not above a predetermined with the user. In particular, the custom pronunciation can threshold value or if the frequency of occurrence of the be created based on a previous speech input received 20 trigram "weather in Nice" (/nis/) is not greater than the from the user. This can be desirable for proper nouns frequency of occurrence of the trigram "weather in nice" that may not abide by canonical pronunciations. For ex- ( ) by a predetermined amount, then the correct ample, in a previous speech input received from a par- pronunciation of the heteronym "Nice" can be determined ticular user, the heteronym "Ajax" can be pronounced as to be consistent with the phonemic string of block 408. 25 . This pronunciation can be stored as a cus- [0104] In some examples, upon determining the cor- tom pronunciation of the heteronym "Ajax" and associ- rect pronunciation of the heteronym, the heteronym in ated with the particular user. In subsequent speech in- the dialogue response can be annotated (e.g., with a tag, teractions associated with the particular user, the correct token, label, and the like) with a unique identifier such that the annotation can be used to identify the correct pronunciation of the heteronym "Ajax" can then be de- 30 pronunciation of the heteronym. In some examples, the termined to be based on the custom pronun- dialogue response can be represented by a string of to- ciation of the heteronym "Ajax" that is associated with kens and the token representing the heteronym can be the particular user. uniquely associated with the correct pronunciation of the [0101] It should be recognized that the correct pronun- heteronym. Further, in some examples, the correct pro- 35 ciation of the heteronym can be determined based on nunciation of the heteronym can be stored in metadata any combination of the speech input, the n-gram lan- associated with the heteronym in the dialogue response. guage model, the natural language processor, and the The annotation of metadata can then be used at block contextual information. More specifically, the correct pro- 420 to synthesize the heteronym according to the correct nunciation of the heteronym can be based on any com- pronunciation. 40 bination of the phonemic string determined at block 408, [0105] In some examples, a phonemic string corre- the frequency of occurrence of the n-gram, the actionable sponding to the correct pronunciation of the heteronym intent determined at block 414, and the contextual infor- can be obtained from the ASR system based on the an- mation received at block 404. notation identifying the correct pronunciation. For exam- [0102] In some examples, there can be conflicting in- ple, in the dialogue response, "Here’s the weather in 45 formation regarding the correct pronunciation of the het- Nice, France between today and Saturday," the hetero- eronym. For example, the user’s pronunciation of the het- nym "Nice" can be a represented by a token that identifies eronym in the speech input can be inconsistent with the the heteronym as the proper noun rather than the adjec- correct pronunciation of the heteronym determined tive. Based on the token and using one or more acoustic based on the n-gram language model or the actionable models or language models of the ASR system, a pho- 50 intent. In these examples, predetermined logic (e.g., vot- nemic string corresponding to the pronunciation /nis/ can ing schemes, predetermined weighting, rules, etc.) can be obtained. The obtained phonemic string can be stored be applied to integrate the information provided by the in metadata associated with the heteronym in the dia- speech input, the n-gram language model, the actionable logue response. The obtained phonemic string can then intent, and the contextual information. be used at block 420 to synthesize the heteronym ac- 55 [0103] Inthe example wherethe speech input is "How’s cording to the correct pronunciation. the weather in Nice?" the user can mispronounce the [0106] In examples where the correct pronunciation of heteronym "Nice" (the proper noun) as and thus the heteronym is determined to be consistent with the

16 31 EP 3 032 532 A1 32 phonemic string of block 408, the phonemic string of speech output according to the correct pronunciation. block 408 can be directly stored in metadata associated Synthesizing the heteronym at a level can be with the heteronym in the dialogue response. In these desirable because it can eliminate the need for the examples, the phonemic string at block 420 can be used speech synthesizer to have a dictionary with multiple en- to synthesize the heteronym according to the correct pro- 5 tries for each heteronym, thereby reducing resource re- nunciation. quirements. [0107] At block 420 of process 400, the dialogue re- [0110] Although in the examples above, the heteronym sponse can be output as a speech output. The heteronym can be synthesized at the word level or the phonemic in the dialogue response can be pronounced in the level, it should be recognized that various synthesis proc- speech output according to the correct pronunciation de- 10 esses can be utilized to synthesize the heteronym such termined at block 418. In some examples, the dialogue that it is pronounced in the speech output according to response can be output using a speech synthesizer (e.g., the correct pronunciation. speech synthesis module 340) of the digital assistant. [0111] Although blocks 402 through 420 of process Specifically, the heteronym in the dialogue response can 400 are shown in a particular order in FIG. 4, it should be synthesized by the speech synthesizer according to 15 be appreciated that these blocks can be performed in the determined correct pronunciation. any order. For instance, in some examples, block 418 [0108] In some examples, the heteronym in the dia- can be performed prior to block 416. Further, it should logue response can be synthesized at a word level. In be appreciated that in some cases, one or more blocks these examples, the speech synthesizer can be config- of process 400 can be optional and additional blocks can ured with a dictionary that includes multiple entries for 20 also be performed. For instance, in some examples, one each heteronym. For example, the dictionary can include or more of blocks 408 through 412 within block 406 can an entry for "Nice" corresponding to the pronunciation be optional. Thus, processing the speech input using the /nis/ and another entry for "nice" corresponding to the ASR system at block 406 can include any combination pronunciation . In some examples, the speech of blocks 408, 410, and 412. In other examples, block synthesizer can synthesize the heteronym at a word level 25 404 can be optional. In yet other examples, block 414 based on its annotation (e.g., tag, label, token, and the can be optional. like) or associated metadata to output the heteronym in speech form according to the correct pronunciation. For 5. Electronic Device example, if the token representing the heteronym "Nice" identifies the heteronym as being the correct noun having 30 [0112] FIG. 5 shows a functional block diagram of an the pronunciation /nis/ rather than the adjective having electronic device 500 configured in accordance with the principles of the various described examples. The func- the pronunciation , then the speech synthesizer tional blocks of the device can be optionally implemented can synthesize the heteronym based on the token and by hardware, software, or a combination of hardware and using the dictionary entry in the speech synthesizer cor- 35 software to carry out the principles of the various de- responding to the pronunciation /nis/. It should be recog- scribed examples. It is understood by persons of skill in nized that the heteronym can be identified by various the art that the functional blocks described in FIG. 5 can means such as metadata, tags, label, and the like and be optionally combined or separated into sub-blocks to each of these means can be accessed and processed implement the principles of the various described exam- by the speech synthesizer to synthesize the heteronym 40 ples. Therefore, the description herein optionally sup- at a word level using the appropriate dictionary entry. ports any possible combination, separation, or further [0109] In other examples, the heteronym in the dia- definition of the functional blocks described herein. logue response can be synthesized at a phoneme level. [0113] As shown in FIG. 5, an electronic device 500 In these examples, the speech synthesizer can be con- can include a touch screen display unit 502 configured figured to directly process phonemic strings and output 45 to display a user interface and receive input from the speech according to the phonemic strings. As described user,an audio input unit 504 configured toreceive speech above, a phonemic string corresponding to the correct input, a sensor unit 506 configured to provide contextual pronunciation of the heteronym can be stored in meta- information from the user, a memory unit 508 configured data associated with the heteronym in the dialogue re- to store contextual information, and a speaker unit 508 sponse. The phonemic string can be the phonemic string 50 configured to output audio. In some examples, audio in- of block 408 or the separately obtained phonemic string put unit 504 can be configured to receive a speech input described at block 416. In these examples, the phonemic in the form of sound waves from a user and transmit the string stored in the metadata can be accessed by the speech input in the form of a representative signal to speech synthesizer and directly processed to output the processing unit 510. The electronic device 500 can fur- heteronym in speech form. Because the stored phonemic 55 ther include a processing unit 510 coupled to the touch strings are based on the determined correct pronuncia- screen display unit 502, the audio input unit 504, the sen- tion of the heteronym, synthesizing the phonemic string sor unit 506, and the speaker unit 508. In some examples, can result in the heteronym being pronounced in the the processing unit 510 can include a receiving unit 512,

17 33 EP 3 032 532 A1 34 a unit 514, a determining unit 518, a configured to determine (e.g., using speech processing generating unit 520, an outputting unit 522, an assigning unit 514) a frequency of occurrence of a second n-gram unit 524, an obtaining unit 526, and an annotating unit with respect to the corpus. The second n-gram includes 528. the heteronym and the one or more additional words. [0114] The processing unit 510 is configured to receive 5 The heteronym in the second n-gram is associated with (e.g., fromthe audio input unit 504and using the receiving a second pronunciation. The correct pronunciation of the unit 512), from a user, a speech input containing a het- heteronym is determined based on the frequency of oc- eronym and one or more additional words. The process- currence of the n-gram and the frequency of occurrence ing unit 510 is configured to process (e.g., using the of the second n-gram. speech processing unit 514) the speech input using an 10 [0120] In some examples, the frequency of occurrence automatic speech recognition system to determine at of the n-gram is greater than the frequency of occurrence least one of a phonemic string corresponding to the het- of the second n-gram by at least a predetermined eronym as pronounced by the user in the speech input amount, and the correct pronunciation of the heteronym and a frequency of occurrence of an n-gram with respect is determined to be the first pronunciation. to a corpus. The n-gram includes the heteronym and the 15 [0121] In some examples, the frequency of occurrence one or more additional words and the heteronym in the of the first n-gram is greater than a first predetermined n-gram is associated with a first pronunciation. The threshold value. The frequency of occurrence of the sec- processing unit 510 is configured to determine (e.g., us- ond n-gram is less than a second predetermined thresh- ing the determining unit 518) a correct pronunciation of old value. The correct pronunciation of the heteronym is the heteronym based on at least one of the phonemic 20 determined to be the first pronunciation. string and the frequency of occurrence of the n-gram. [0122] In some examples, the phonemic string corre- The processing unit 510 is configured to generate (e.g., sponds to the second pronunciation. The frequency of using the generating unit 520) a dialogue response to occurrence of the n-gram is greater than the frequency the speech input. The dialogue response includes the of occurrence of the second n-gram by at least a prede- heteronym. The processing unit 510 is configured to out- 25 termined amount and the correct pronunciation of the put (e.g., using the speaker unit 508 and the outputting heteronym is determined to be the first pronunciation. unit 522) the dialogue response as a speech output. The [0123] In some examples, the processing unit 510 is heteronym in the dialogue response is pronounced in the configured to obtain (e.g., using the obtaining unit 526) speech output according to the determined correct pro- from the automatic speech recognition system a second nunciation. 30 phonemic string corresponding to the determined correct [0115] In some examples, the processing unit 510 is pronunciation. In some examples, outputting the dia- configured to determine (e.g., using speech processing logue response includes synthesizing the heteronym in unit 514) a text string corresponding to the speech input. the dialogue response using a speech synthesizer. The In some examples, the processing unit 510 is configured speech synthesizer uses the second phonemic string to to determine (e.g., using the determining unit 518) an 35 synthesize the heteronym in the speech output according actionable intent based on the text string. The correct to the correct pronunciation. pronunciation of the heteronym is determined based on [0124] In some examples, the processing unit 510 is at least one of the phonemic string, the frequency of oc- configured to annotate (e.g., using the annotating unit currence of the n-gram, and the actionable intent. 528) the heteronym in the dialogue response with a tag [0116] In some examples, the processing unit 510 is 40 to identify the correct pronunciation of the heteronym. In configured to assign (e.g., using the assigning unit 524) some examples, outputting the dialogue response in- the heteronym to a parameter of the actionable intent. cludes synthesizing the heteronym in the dialogue re- The correct pronunciation of the heteronym is determined sponse using a speech synthesizer. The heteronym in based at least in part on the parameter. the dialogue response is synthesized based on the tag. [0117] In some examples, a vocabulary list is associ- 45 [0125] In some examples, the correct pronunciation of ated with the actionable intent. The vocabulary list in- the heteronym is determined based at least in part on cludes the heteronym and the heteronym in the vocab- the contextual information. In some examples, the con- ulary list is associated with a particular pronunciation. textual information includes information associated with The correct pronunciation of the heteronym is determined the user. based on the particular pronunciation associated with the 50 [0126] In some examples, the correct pronunciation of heteronym in the vocabulary list. the heteronym is determined based at least in part on a [0118] In some examples, the processing unit 510 is custom pronunciation of the heteronym that is associated configured to receive (e.g., from the sensor unit 506 or with the user. The custom pronunciation can be based the memory unit 508 and using the receiving unit 512) on a previous speech input received from the user. contextual information associated with the speech input. 55 [0127] In some examples, the processing unit 510 is The actionable intent is determined based at least in part configured to receive (e.g., from the audio input unit 504 on the contextual information. and using the receiving unit 512), from a user, a speech [0119] In some examples, the processing unit 510 is input containing a heteronym and one or more additional

18 35 EP 3 032 532 A1 36 words. The processing unit 510 is configured to process and the outputting unit 522) the dialogue response as a (e.g., using the speech processing unit 514) the speech speech output. The heteronym in the dialogue response input using an automatic speech recognition system to is pronounced in the speech output according to the de- determine a frequency of occurrence of a first n-gram termined correct pronunciation. with respect to a corpus and a frequency of occurrence 5 [0132] In some examples, the processing unit 510 is of a second n-gram with respect to the corpus. The first configured to determine (e.g., using the speech process- n-gram includes the heteronym and the one or more ad- ing unit 514) a frequency of occurrence of a first n-gram ditional words. The heteronym in the first n-gram is as- with respect to a corpus. The first n-gram includes the sociated with a first pronunciation. The second n-gram heteronym and the one or more additional words in the includes the heteronym and the one or more additional 10 dialogue response. The heteronym in the first n-gram is words. The heteronym in the second n-gram is associ- associated with a first pronunciation. In some examples, ated with a second pronunciation. The processing unit the processing unit 510 is configured to determine (e.g., 510 is configured to determine (e.g., using the determin- using the speech processing unit 514) a frequency of ing unit 518) a correct pronunciation of the heteronym occurrence of a second n-gram with respect to the cor- based on the frequency of occurrence of the first n-gram 15 pus. The second n-gram includes the heteronym and the and the frequency of occurrence of the second n-gram. one or more additional words. The heteronym in the sec- The processing unit 510 is configured to generate (e.g., ond n-gram is associated with a second pronunciation. using the generating unit 520) a dialogue response to The correct pronunciation of the heteronym in the dia- the speech input. The dialogue response includes the logue response is determined based on the frequency of heteronym. The processing unit 510 is configured to out- 20 occurrence of the first n-gram and the frequency of oc- put (e.g., using the speaker unit 508 and the outputting currence of the second n-gram. unit 522) the dialogue response as a speech output. The [0133] In some examples, the frequency of occurrence heteronym in the dialogue response is pronounced in the of the first n-gram is greater than the frequency of occur- speech output according to the determined correct pro- rence of the second n-gram by at least a predetermined nunciation. 25 amount and the correct pronunciation of the heteronym [0128] In some examples, the frequency of occurrence is determined to be the first pronunciation. of the first n-gram is greater than the frequency of occur- [0134] In some examples, the frequency of occurrence rence of the second n-gram by at least a predetermined of the first n-gram is greater than a first predetermined amount and the correct pronunciation of the heteronym threshold value, the frequency of occurrence of the sec- is determined to be the first pronunciation. 30 ond n-gram is less than a second predetermined thresh- [0129] In some examples, the frequency of occurrence old value, and the correct pronunciation of the heteronym of the first n-gram is greater than a first predetermined is determined to be the first pronunciation. threshold value. The frequency of occurrence of the sec- [0135] In some examples, the one or more additional ond n-gram is less than a second predetermined thresh- words precede the heteronym in the dialogue response. old value. The correct pronunciation of the heteronym is 35 In some examples, the contextual information includes determined to be the first pronunciation. information associated with the user. [0130] In some examples, the one or more additional [0136] In some examples, the correct pronunciation of words precede the heteronym in the speech input. the heteronym is determined based at least in part on a [0131] In some examples, the processing unit 510 is custom pronunciation of the heteronym that is associated configured to receive (e.g., from the audio input unit 504 40 with the user. The custom pronunciation can be based and using the receiving unit 512), from a user, a speech on a previous speech input received from the user. input. The processing unit 510 is configured to process [0137] In some examples, the processing unit 510 is (e.g., using the speech processing unit 514) the speech configured to receive (e.g., from the audio input unit 504 input using an automatic speech recognition system to and using the receiving unit 512), from a user, a speech determinea text string corresponding tothe speech input. 45 input. The processing unit 510 is configured to process The processing unit 510 is configured to determine (e.g., (e.g., using the speech processing unit 514) the speech using the determining unit 518) an actionable intent input using an automatic speech recognition system to based on the text string. The processing unit 510 is con- determinea textstring corresponding to thespeech input. figured to generate (e.g., using the generating unit 520) The processing unit 510 is configured to determine (e.g., a dialogue response to the speech input based on the 50 using the determining unit 518) an actionable intent actionable intent. The dialogue response includes the based on the text string. The processing unit 510 is con- heteronym. The processing unit 510 is configured to de- figured to generate (e.g., using the generating unit 520) termine (e.g., using the determining unit 518) a correct a dialogue response to the speech input based on the pronunciation of the heteronym using an n-gram lan- actionable intent. The dialogue response includes the guage model of the automatic speech recognition system 55 heteronym. The processing unit 510 is configured to de- and based on the heteronym and one or more additional termine (e.g., using the determining unit 518) a correct words in the dialogue response. The processing unit 510 pronunciation of the heteronym based on the actionable is configured to output (e.g., using the speaker unit 508 intent. The processing unit 510 is configured to output

19 37 EP 3 032 532 A1 38

(e.g., using the speaker unit 508 and the outputting unit terest to them. The present disclosure contemplates that 522) the dialogue response as a speech output. The het- in some instances, this gathered data may include per- eronym in the dialogue response is pronounced in the sonal information data that uniquely identifies or can be speech output according to the determined correct pro- usedto contactor locatea specificperson. Such personal nunciation. 5 information data can include demographic data, location- [0138] In some examples, the processing unit 510 is based data, telephone numbers, email addresses, home configured to assign (e.g., using the assigning unit 524) addresses, or any other identifying information. the heteronym to a parameter of the actionable intent. [0145] The present disclosure recognizes that the use The correct pronunciation of the heteronym is determined of such personal information data in connection with the based on the parameter. 10 systems, processes, and devices described above, can [0139] In some examples, a vocabulary list is associ- be used to the benefit of users. For example, the personal ated with the actionable intent, the vocabulary list in- information data can be used to determine the correct cludes the heteronym, the heteronym in the vocabulary pronunciation of a heteronym. Accordingly, use of such list is associated with a particular pronunciation, and the personal information data can enable heteronyms to be correct pronunciation of the heteronym is determined 15 pronounced more accurately in speech outputs of the based on the particular pronunciation associated with the system, processes, and devices described above. heteronym in the vocabulary list. [0146] The present disclosure further contemplates [0140] In some examples, the processing unit 510 is that the entities responsible for the collection, analysis, configured to receive (e.g., from the audio input unit 504 disclosure, transfer, storage, or other use of such per- and using the receiving unit 512), from a user, a speech 20 sonal information data will comply with well-established input containing a heteronym and one or more additional privacy policies and/or privacy practices. In particular, words. The processing unit 510 is configured to process such entities should implement and consistently use pri- (e.g., using the speech processing unit 514) the speech vacy policies and practices that are generally recognized input using an automatic speech recognition system to as meeting or exceeding industry or governmental re- determine a phonemic string corresponding to the het- 25 quirements for maintaining personal information data, eronym as pronounced by the user in the speech input. private and secure. For example, personal information The processing unit 510 is configured to generate (e.g., from users should be collected for legitimate and reason- using the generating unit 520) a dialogue response to able uses of the entity and not shared or sold outside of the speech input where the dialogue response includes those legitimate uses. Further, such collection should oc- the heteronym. The processing unit 510 is configured to 30 cur only after receiving the informed consent of the users. output (e.g., using the speaker unit 508 and the outputting Additionally, such entities would take any needed steps unit 522) the dialogue response as a speech output. The for safeguarding and securing access to such personal heteronym in the dialogue response is pronounced in the information data and ensuring that others with access to speech output according to the phonemic string. the personal information data adhere to their privacy pol- [0141] In some examples, the phonemic string is de- 35 icies and procedures. Further, such entities can subject termined using an acoustic model of the automatic themselves to evaluation by third parties to certify their speech recognition system. In some examples, output- adherence to widely accepted privacy policies and prac- ting the dialogue response includes synthesizing the het- tices. eronym in the dialogue response using a speech synthe- [0147] Despite the foregoing, the present disclosure sizer. The dialogue response is synthesized based on 40 also contemplates examples in which users selectively the phonemic string. block the use of, or access to, personal information data. [0142] In some examples, the phonemic string is That is, the present disclosure contemplates that hard- stored in metadata that is associated with the heteronym ware and/or software elements can be provided to pre- in the dialogue response. The metadata is accessed by vent or block access to such personal information data. the speech synthesizer to synthesize the heteronym in 45 For example, in the case of advertisement delivery serv- the dialogue response according to the phonemic string. ices, the systems and devices described above can be [0143] Although examples have been fully described configured to allow users to select to "opt in" or "opt out" with reference to the accompanying drawings, it is to be of participation in the collection of personal information noted that various changes and modifications will be- data during registration for services. In another example, come apparent to those skilled in the art. Such changes 50 users can select not to provide location information for and modifications are to be understood as being included targeted content delivery services. In yet another exam- within the scope of the various examples as defined by ple, users can select to not provide precise location in- the appended claims. formation, but permit the transfer of location zone infor- [0144] In some cases, the systems, processes, and mation. devices described above can include the gathering and 55 [0148] Therefore, although the present disclosure use of data (e.g., contextual information) available from broadly covers use of personal information data to im- various sources to improve the delivery to users of invi- plement one or more various disclosed examples, the tational content or any other content that may be of in- present disclosure also contemplates that the various ex-

20 39 EP 3 032 532 A1 40 amples can also be implemented without the need for text string, wherein the correct pronunciation of accessing such personal information data. That is, the the heteronym is determined based on at least various examples disclosed herein are not rendered in- one of the phonemic string, the frequency of oc- operabledue to the lack ofall or a portion of such personal currence of the n-gram, and the actionable in- information data. For example, the correct pronunciation 5 tent. of a heteronym can be determined based on non-per- sonal information data or a bare minimum amount of per- 3. The method of statement 2, further comprising: sonal information, such as the content being requested by the device associated with a user, other non-personal assigning the heteronym to a parameter of the information available to the content delivery services, or 10 actionable intent, wherein the correct pronunci- publicly available information. ation of the heteronym is determined based at least in part on the parameter. Further statements of invention: 4. The method of statement 2, wherein: [0149] 15 a vocabulary list is associated with the actiona- 1. A method for operating an intelligent automated ble intent; assistant, the method comprising: the vocabulary list includes the heteronym; the heteronym in the vocabulary list is associat- at an electronic device with a processor and20 ed with a particular pronunciation; and memory storing one or more programs for exe- the correct pronunciation of the heteronym is de- cution by the processor: termined based on the particular pronunciation associated with the heteronym in the vocabulary receiving, from a user, a speech input con- list. taining a heteronym and one or more addi- 25 tional words; 5. The method of any of statements 2-4, further com- processing the speech input using an auto- prising: matic speech recognition system to deter- mine at least one of: receiving contextual information associated with 30 the speech input, wherein the actionable intent a phonemic string corresponding to the is determined based at least in part on the con- heteronym as pronounced by the user textual information. in the speech input; and a frequency of occurrence of an n-gram 6. The method of any of statements 1-5, wherein: with respect to a corpus, wherein the n- 35 gram includes the heteronym and the the heteronym in the n-gram is associated with one or more additional words; a first pronunciation processing the speech input using the automatic determining a correct pronunciation of the speech recognition system includes determin- heteronym based on at least one of the pho- 40 ing a frequency of occurrence of a second n- nemic string and the frequency of occur- gram with respect to the corpus; rence of the n-gram; the second n-gram includes the heteronym and generating a dialogue response to the the one or more additional words; speech input, wherein the dialogue re- the heteronym in the second n-gram is associ- sponse includes the heteronym; and 45 ated with a second pronunciation; and outputting the dialogue response as a the correct pronunciation of the heteronym is de- speech output, wherein the heteronym in termined based on the frequency of occurrence the dialogue response is pronounced in the of the n-gram and the frequency of occurrence speech output according to the determined of the second n-gram. correct pronunciation. 50 7. The method of statement 6, wherein the frequency 2. The method of statement 1, wherein processing of occurrence of the n-gram is greater than the fre- the speech input using the automatic speech recog- quency of occurrence of the second n-gram by at nition system includes determining a text string cor- least a predetermined amount, and wherein the cor- responding to the speech input, and further compris- 55 rect pronunciation of the heteronym is determined ing: to be the first pronunciation.

determining an actionable intent based on the 8. The method of statement 6, wherein the frequency

21 41 EP 3 032 532 A1 42 of occurrence of the first n-gram is greater than a on a previous speech input received from the user. first predetermined threshold value, wherein the fre- quency of occurrence of the second n-gram is less 15. A method for operating an intelligent automated than a second predetermined threshold value, and assistant, the method comprising: wherein the correct pronunciation of the heteronym 5 is determined to be the first pronunciation. at an electronic device with a processor and memory storing one or more programs for exe- 9. The method of statement 6, wherein the phonemic cution by the processor: string corresponds to the second pronunciation, wherein the frequency of occurrence of the n-gram 10 receiving, from a user, a speech input con- is greater than the frequency of occurrence of the taining a heteronym and one or more addi- second n-gram by at least a predetermined amount, tional words; and wherein the correct pronunciation of the heter- processing the speech input using an auto- onym is determined to be the first pronunciation. matic speech recognition system to deter- 15 mine a frequency of occurrence of a first n- 10. The method of any of statements 1-9, further gram with respect to a corpus and a fre- comprising: quency of occurrence of a second n-gram with respect to the corpus, wherein: obtaining from the automatic speech recognition system a second phonemic string correspond- 20 the first n-gram includes the heteronym ing to the determined correct pronunciation, and the one or more additional words; wherein outputting the dialogue response in- the heteronym in the first n-gram is as- cludes synthesizing the heteronym in the dia- sociated with a first pronunciation; logue response using a speech synthesizer, and the second n-gram includes the heter- wherein the speech synthesizer uses the sec- 25 onym and the one or more additional ond phonemic string to synthesize the hetero- words; and nym in the speech output according to the cor- the heteronym in the second n-gram is rect pronunciation. associated with a second pronuncia- tion; 11. The method of any of statements 1-10, further 30 comprising: determining a correct pronunciation of the heteronym in the speech input based on the annotating the heteronym in the dialogue re- frequency of occurrence of the first n-gram sponse with a tag to identify the correct pronun- and the frequency of occurrence of the sec- ciation of the heteronym, wherein outputting the 35 ond n-gram; dialogue response includes synthesizing the generating a dialogue response to the heteronym in the dialogue response using a speech input, wherein the dialogue re- speech synthesizer, and wherein the heteronym sponse includes the heteronym; and in the dialogue response is synthesized based outputting the dialogue response as a on the tag. 40 speech output, wherein the heteronym in the dialogue response is pronounced in the 12. The method of any of statements 1-11, further speech output according to the determined comprising: correct pronunciation.

receivingcontextual information associated with 45 16. The method of statement 15, wherein the fre- the speech input, wherein the correct pronunci- quency of occurrence of the first n-gram is greater ation of the heteronym is determined based at than the frequency of occurrence of the second n- least in part on the contextual information. gram by at least a predetermined amount, and wherein the correct pronunciation of the heteronym 13. The method of statement 12, wherein the con- 50 is determined to be the first pronunciation. textual information includes information associated with the user. 17. The method of statement 15, wherein the fre- quency of occurrence of the first n-gram is greater 14. The method of any of statements 1-13, wherein than a first predetermined threshold value, wherein the correct pronunciation of the heteronym is deter- 55 the frequency of occurrence of the second n-gram mined based at least in part on a custom pronunci- is less than a second predetermined threshold value, ation of the heteronym that is associated with the and wherein the correct pronunciation of the heter- user, and wherein the custom pronunciation is based onym is determined to be the first pronunciation.

22 43 EP 3 032 532 A1 44

18. The method of any of statements 15-17, wherein mine a text string corresponding to the the one or more additional words precede the het- speech input; eronym in the speech input. determining an actionable intent based on the text string; 19. The method of any of statements 15-18, further 5 generating a dialogue response to the comprising: speechinput basedon the actionableintent, wherein the dialogue response includes a obtaining from the automatic speech recognition heteronym; system a phonemic string corresponding to the determining a correct pronunciation of the determined correct pronunciation, wherein out- 10 heteronym using an n-gram language mod- putting the dialogue response includes synthe- el of the automatic speech recognition sys- sizing the heteronym in the dialogue response tem and based on the heteronym and one using a speech synthesizer, and wherein the or more additional words in the dialogue re- speech synthesizer uses the phonemic string to sponse; and synthesize the heteronym in the speech output 15 outputting the dialogue response as a according to the correct pronunciation. speech output, wherein the heteronym in the dialogue response is pronounced in the 20. The method of any of statements 15-19, further speech output according to the determined comprising: correct pronunciation. 20 annotating the heteronym in the dialogue re- 25. The method of statement 24, further comprising: sponse with a tag to identify the correct pronun- ciation of the heteronym, wherein outputting the determining a frequency of occurrence of a first dialogue response includes synthesizing the n-gram with respect to a corpus, wherein the heteronym in the dialogue response using a25 first n-gram includes the heteronym and the one speech synthesizer, and wherein the heteronym or more additional words in the dialogue re- in the dialogue response is synthesized based sponse, and wherein the heteronym in the first on the tag. n-gram is associated with a first pronunciation; and 21. The method of any of statements 15-20, further 30 determining a frequency of occurrence of a sec- comprising: ond n-gram with respect to the corpus, wherein the second n-gram includes the heteronym and receivingcontextual information associated with the one or more additional words, wherein the the speech input, wherein the correct pronunci- heteronym in the second n-gram is associated ation of the heteronym is determined based at 35 with a second pronunciation, and wherein the least in part on the contextual information. correct pronunciation of the heteronym in the dialogue response is determined based on the 22. The method of statement 21, wherein the con- frequency of occurrence of the first n-gram and textual information includes information associated the frequency of occurrence of the second n- with the user. 40 gram.

23. The method of any of statements 15-22, wherein 26. The method of statement 25, wherein the fre- the correct pronunciation of the heteronym is deter- quency of occurrence of the first n-gram is greater mined based at least in part on a custom pronunci- than the frequency of occurrence of the second n- ation of the heteronym that is associated with the 45 gram by at least a predetermined amount, and user, and wherein the custom pronunciation is based wherein the correct pronunciation of the heteronym on a previous speech input received from the user. is determined to be the first pronunciation.

24. A method for operating an intelligent automated 27. The method of statement 25, wherein the fre- assistant, the method comprising: 50 quency of occurrence of the first n-gram is greater than a first predetermined threshold value, wherein at an electronic device with a processor and the frequency of occurrence of the second n-gram memory storing one or more programs for exe- is less than a second predetermined threshold value, cution by the processor: and wherein the correct pronunciation of the heter- 55 onym is determined to be the first pronunciation. receiving, from a user, a speech input; processing the speech input using an auto- 28. The method of any of statements 24-27, wherein matic speech recognition system to deter- the one or more additional words precede the het-

23 45 EP 3 032 532 A1 46 eronym in the dialogue response. speech input; determining an actionable intent based on 29. The method of any of statements 24-28, further the text string; comprising: generating a dialogue response to the 5 speechinput basedon the actionableintent, obtaining from the automatic speech recognition wherein the dialogue response includes a system a phonemic string corresponding to the heteronym; and determined correct pronunciation, wherein out- determining a correct pronunciation of the putting the dialogue response includes synthe- heteronym based on the actionable intent; sizing the heteronym in the dialogue response 10 outputting the dialogue response as a using a speech synthesizer, and wherein the speech output, wherein the heteronym in speech synthesizer uses the phonemic string to the dialogue response is pronounced in the synthesize the heteronym in the speech output speech output according to the determined according to the determined correct pronuncia- correct pronunciation. tion. 15 35. The method of statement 34, further comprising: 30. The method of any of statements 24-28, further comprising: assigning the heteronym to a parameter of the actionable intent, wherein the correct pronunci- annotating the heteronym in the dialogue re- 20 ation of the heteronym is determined based on sponse with a tag to identify the correct pronun- the parameter. ciation of the heteronym, wherein outputting the dialogue response includes synthesizing the 36. The method of statement 34, wherein: heteronym in the dialogue response using a speech synthesizer, and wherein the heteronym 25 a vocabulary list is associated with the actiona- in the dialogue response is synthesized based ble intent; on the tag. the vocabulary list includes the heteronym; the heteronym in the vocabulary list is associat- 31. The method of any of statements 24-30, further ed with a particular pronunciation; and comprising: 30 the correct pronunciation of the heteronym is de- termined based on the particular pronunciation receivingcontextual information associated with associated with the heteronym in the vocabulary the speech input, wherein the correct pronunci- list. ation of the heteronym is determined based at least in part on the contextual information. 35 37. The method of any of statements 34-36, further comprising: 32. The method of statement 31, wherein the con- textual information includes information associated receiving contextual information associated with with the user. the speech input, wherein the correct pronunci- 40 ation of the heteronym is determined based at 33. The method of any of statements 24-32, wherein least in part on the contextual information. the correct pronunciation of the heteronym is deter- mined based at least in part on a custom pronunci- 38. The method of any of statements 34-37, further ation of the heteronym that is associated with the comprising: user, and wherein the custom pronunciation is based 45 on a previous speech input received from the user. obtaining from the automatic speech recognition system a phonemic string corresponding to the 34. A method for operating an intelligent automated determined correct pronunciation, wherein out- assistant, the method comprising: putting the dialogue response includes synthe- 50 sizing the heteronym in the dialogue response at an electronic device with a processor and using a speech synthesizer, and wherein the memory storing one or more programs for exe- speech synthesizer uses the phonemic string to cution by the processor: synthesize the heteronym in the speech output according to the correct pronunciation. receiving, from a user, a speech input; 55 processing the speech input using an auto- 39. The method of any of statements 34-38, further matic speech recognition system to deter- comprising: mine a text string corresponding to the

24 47 EP 3 032 532 A1 48

annotating the heteronym in the dialogue re- a processor; and sponse with a tag to identify the correct pronun- memory having instructions stored thereon, the ciation of the heteronym, wherein outputting the instructions, when executed by the processor, dialogue response includes synthesizing the cause the processor to perform any of the meth- heteronym in the dialogue response using a 5 ods of statements 1-43. speech synthesizer, and wherein the heteronym in the dialogue response is synthesized based 46. An electronic device comprising: on the tag. means for receiving, from a user, a speech input 40. A method for operating an intelligent automated 10 containing a heteronym and one or more addi- assistant, the method comprising: tional words; means for processing the speech input using an at an electronic device with a processor and automatic speech recognition system to deter- memory storing one or more programs for exe- mine at least one of: cution by the processor: 15 a phonemic string corresponding to the het- receiving, from a user, a speech input con- eronym as pronounced by the user in the taining a heteronym and one or more addi- speech input; and tional words; a frequency of occurrence of an n-gram with processing the speech input using an auto- 20 respect to a corpus, wherein the n-gram in- matic speech recognition system to deter- cludes the heteronym and the one or more mine a phonemic string corresponding to additional words, and wherein the hetero- the heteronym as pronounced by the user nym in the n-gram is associated with a first in the speech input; pronunciation; generating a dialogue response to the25 speech input, wherein the dialogue re- means for determining a correct pronunciation sponse includes the heteronym; and of the heteronym based on at least one of the outputting the dialogue response as a phonemic string and the frequency of occur- speech output, wherein the heteronym in rence of the n-gram; the dialogue response is pronounced in the 30 means for generating a dialogue response to speech output according to the phonemic the speech input, wherein the dialogue re- string. sponse includes the heteronym; and means for outputting the dialogue response as 41. The method of statement 40, wherein the pho- a speech output, wherein the heteronym in the nemic string is determined using an acoustic model 35 dialogue response is pronounced in the speech of the automatic speech recognition system. output according to the determined correct pro- nunciation. 42. The method of any of statements 40-41, wherein outputting the dialogue response includes synthe- 47. An electronic device comprising: sizing the heteronym in the dialogue response using 40 a speech synthesizer, and wherein the dialogue re- means for receiving, from a user, a speech input sponse is synthesized based on the phonemic string. containing a heteronym and one or more addi- tional words; 43. The method of statements 40-42, wherein the means for processing the speech input using an phonemic string is stored in metadata that is asso- 45 automatic speech recognition system to deter- ciated with the heteronym in the dialogue response, mine a frequency of occurrence of a first n-gram and wherein the metadata is accessed by the speech with respect to a corpus and a frequency of oc- synthesize to synthesize the heteronym in the dia- currence of a second n-gram with respect to the logue response according to the phonemic string. corpus, wherein: 50 44.A computerreadable mediumhaving instructions the first n-gram includes the heteronym and stored thereon, the instructions, when executed by the one or more additional words; one or more processors, cause the one or more proc- the heteronym in the first n-gram is associ- essors to perform any of the methods of statements ated with a first pronunciation; 1-43. 55 the second n-gram includes the heteronym and the one or more additional words; and 45. An system comprising: the heteronym in the second n-gram is as- sociated with a second pronunciation;

25 49 EP 3 032 532 A1 50

means for determining a correct pronunciation nunciation. of the heteronym in the speech input based on the frequency of occurrence of the first n-gram 50. An electronic device comprising: and the frequency of occurrence of the second n-gram; 5 means for receiving, from a user, a speech input means for generating a dialogue response to containing a heteronym and one or more addi- the speech input, wherein the dialogue re- tional words; sponse includes the heteronym; and means for processing the speech input using an means for outputting the dialogue response as automatic speech recognition system to deter- a speech output, wherein the heteronym in the 10 mine a phonemic string corresponding to the dialogue response is pronounced in the speech heteronym as pronounced by the user in the output according to the determined correct pro- speech input; nunciation. means for generating a dialogue response to the speech input, wherein the dialogue re- 48. An electronic device comprising: 15 sponse includes the heteronym; and means for outputting the dialogue response as means for receiving, from a user, a speech input; a speech output, wherein the heteronym in the means for processing the speech input using an dialogue response is pronounced in the speech automatic speech recognition system to deter- output according to the phonemic string. mine a text string corresponding to the speech 20 input; means for determining an actionable intent Claims based on the text string; means for generating a dialogue response to 1. A method for operating an intelligent automated as- the speech input based on the actionable intent, 25 sistant, the method comprising: wherein the dialogue response includes a het- eronym; at an electronic device with a processor and means for determining a correct pronunciation memory storing one or more programs for exe- of the heteronym using an n-gram language cution by the processor: model of the automatic speech recognition sys- 30 tem and based on the heteronym and one or receiving, from a user, a speech input con- more additional words in the dialogue response; taining a heteronym and one or more addi- and tional words; means for outputting the dialogue response as processing the speech input using an auto- a speech output, wherein the heteronym in the 35 matic speech recognition system to deter- dialogue response is pronounced in the speech mine at least one of: output according to the determined correct pro- nunciation. a phonemic string corresponding to the heteronym as pronounced by the user 49. An electronic device comprising: 40 in the speech input; and a frequency of occurrence of an n-gram means for receiving, from a user, a speech input; with respect to a corpus, wherein the n- means for processing the speech input using an gram includes the heteronym and the automatic speech recognition system to deter- one or more additional words; mine a text string corresponding to the speech 45 input; determining a correct pronunciation of the means for determining an actionable intent heteronym based on at least one of the pho- based on the text string; nemic string and the frequency of occur- means for generating a dialogue response to rence of the n-gram; the speech input based on the actionable intent, 50 generating a dialogue response to the wherein the dialogue response includes a het- speech input, wherein the dialogue re- eronym; and sponse includes the heteronym; and means for determining a correct pronunciation outputting the dialogue response as a of the heteronym based on the actionable intent; speech output, wherein the heteronym in means for outputting the dialogue response as 55 the dialogue response is pronounced in the a speech output, wherein the heteronym in the speech output according to the determined dialogue response is pronounced in the speech correct pronunciation. output according to the determined correct pro-

26 51 EP 3 032 532 A1 52

2. The method of claim 1, wherein processing the of occurrence of the second n-gram by at least a speech input using the automatic speech recognition predetermined amount, and wherein the correct pro- system includes determining a text string corre- nunciation of the heteronym is determined to be the sponding to the speech input, and further compris- first pronunciation. ing: 5 8. The method of claim 6, wherein the frequency of oc- determining an actionable intent based on the currence of the first n-gram is greater than a first text string, wherein the correct pronunciation of predeterminedthreshold value, whereinthe frequen- the heteronym is determined based on at least cy of occurrence of the second n-gram is less than one of the phonemic string, the frequency of oc- 10 a second predetermined threshold value, and currence of the n-gram, and the actionable in- wherein the correct pronunciation of the heteronym tent. is determined to be the first pronunciation.

3. The method of claim 2, further comprising: 9. The method of claim 6, wherein the phonemic string 15 corresponds to the second pronunciation, wherein assigning the heteronym to a parameter of the the frequency of occurrence of the n-gram is greater actionable intent, wherein the correct pronunci- than the frequency of occurrence of the second n- ation of the heteronym is determined based at gram by at least a predetermined amount, and least in part on the parameter. wherein the correct pronunciation of the heteronym 20 is determined to be the first pronunciation. 4. The method of claim 2, wherein: 10. The method of any of claims 1-9, further comprising: a vocabulary list is associated with the actiona- ble intent; obtaining from the automatic speech recognition the vocabulary list includes the heteronym; 25 system a second phonemic string correspond- the heteronym in the vocabulary list is associat- ing to the determined correct pronunciation, ed with a particular pronunciation; and wherein outputting the dialogue response in- the correct pronunciation of the heteronym is de- cludes synthesizing the heteronym in the dia- termined based on the particular pronunciation logue response using a speech synthesizer, and associated with the heteronym in the vocabulary 30 wherein the speech synthesizer uses the sec- list. ond phonemic string to synthesize the hetero- nym in the speech output according to the cor- 5. The method of any of claims 2-4, further comprising: rect pronunciation.

receivingcontextual information associated with 35 11. The method of any of claims 1-10, further compris- the speech input, wherein the actionable intent ing: is determined based at least in part on the con- textual information. annotating the heteronym in the dialogue re- sponse with a tag to identify the correct pronun- 6. The method of any of claims 1-5, wherein: 40 ciation of the heteronym, wherein outputting the dialogue response includes synthesizing the the heteronym in the n-gram is associated with heteronym in the dialogue response using a a first pronunciation speech synthesizer, and wherein the heteronym processing the speech input using the automatic in the dialogue response is synthesized based speech recognition system includes determin- 45 on the tag. ing a frequency of occurrence of a second n- gram with respect to the corpus; 12. The method of any of claims 1-11, further compris- the second n-gram includes the heteronym and ing: the one or more additional words; the heteronym in the second n-gram is associ- 50 receiving contextual information associated with ated with a second pronunciation; and the speech input, wherein the correct pronunci- the correct pronunciation of the heteronym is de- ation of the heteronym is determined based at termined based on the frequency of occurrence least in part on the contextual information. of the n-gram and the frequency of occurrence of the second n-gram. 55 13. The method of any of claims 1-12, wherein the cor- rect pronunciation of the heteronym is determined 7. The method of claim 6, wherein the frequency of oc- based at least in part on a custom pronunciation of currence of the n-gram is greater than the frequency the heteronym that is associated with the user, and

27 53 EP 3 032 532 A1 54

wherein the custom pronunciation is based on a pre- vious speech input received from the user.

14. A computer readable medium having instructions stored thereon, the instructions, when executed by 5 one or more processors, cause the one or more proc- essors to perform any of the methods of claims 1-13.

15. An system comprising: 10 a processor; and memory having instructions stored thereon, the instructions, when executed by the processor, cause the processor to perform any of the meth- ods of claims 1-13. 15

20

25

30

35

40

45

50

55

28 EP 3 032 532 A1

29 EP 3 032 532 A1

30 EP 3 032 532 A1

31 EP 3 032 532 A1

32 EP 3 032 532 A1

33 EP 3 032 532 A1

34 EP 3 032 532 A1

35 EP 3 032 532 A1

5

10

15

20

25

30

35

40

45

50

55

36 EP 3 032 532 A1

5

10

15

20

25

30

35

40

45

50

55

37 EP 3 032 532 A1

REFERENCES CITED IN THE DESCRIPTION

This list of references cited by the applicant is for the reader’s convenience only. It does not form part of the European patent document. Even though great care has been taken in compiling the references, errors or omissions cannot be excluded and the EPO disclaims all liability in this regard.

Patent documents cited in the description

• US 62089464 B [0001] • US 98798211 A [0070] • US 14569517 B [0001] • US 25108811 A [0070]

38