Constructing a Low-Cost, Open-Source, Voicexml Gateway

,Vlak ~ '''lr own notes. 1\1 EYI':I? underline or '",,.·itc in }! hook RHODES UNIVERSITY LIBRARY tl. No ....TI1.C' .?.-:::-J.L llRN._-- ............ _--...... _... _- Constructing a Low-Cost, Open-Source, VoiceXML gateway Submitted in fulfilment of the requirements of the degree Masters in Computer Science of Rhodes University Adam King March 22, 2007 Abstract Voice-enabled applications, applications that interact with a user via an audio channel, are used ex tensively today. Their use is growing as speech related technologies improve, as speech is one of the most natural methods of interaction. They can provide customer support as IVRs, can be used as an assistive technology, or can become an aural interface to the Internet. Given that the telephone is used extensively throughout the globe, the number of potential users of voice-enabled applications is very high. VoiceXML is a popular, open, high-level, standard means of creating voice-enabled applications which was designed to bring the benefits of web based development to services. While YoiceXML is an ideal language for creating these applications, YoiceXML gateways, the hardware and software responsible for interpreting VoiceXML applications and interfacing with the PSTN, are still expensive and so there is a need for a low-cost gateway. Asterisk, and open-source, TDMNoIP telephony platform, can be used as a low-cost PSTN inter face. This thesis investigates adding a YoiceXML service to Asterisk, creating a low-cost YoiceXML prototype gateway which is able to render voice-enabled applications. Following the Component-Based Software Engineering (CBSE) paradigm, the YoiceXML gateway is divided into a set of components which are sourced from the open-source community, and integrated to create the gateway. The browser requires a YoiceXML interpreter (OpenYXI), a Text-To-Speech engine (Festival) and a speech recognition engine (Sphinx 4). The integration of the components results in a low-cost, open-source VoiceXML gateway. System tests show that the integration of the components was successful, and that the system can handle concurrent calls. A fully compliant version of the gateway can be used in the real world to render voice-enabled applications at a low cost. Acknowledgements Most importantly, to my family. Your support, love and sacrifices will never go unnoticed, and have got me where I am today. To Alf, whose influence and hard work is evident on every page of this thesis. Thanks for all the years of guidance, help and advice. To Peter, who has provided valuable input and helped shape this project into what it is today. Thank you for the time and effort spent in making my masters studies possible. To the Computer Science department and associated staff here at Rhodes. It has been pleasure being part of such a great team. To my peers, and friends, for all the support and friendship, and for providing a valuable place to bounce ideas off. Finally, to Grahamstown, Rhodes and the Rat, without which I wouldn't be who I am today. hit... l~. :.. h uk {)l ;)nlC ~a 1"'r~ Contents 1 Introduction 10 1.1 Aim . 11 1.2 Methodologies and environment 12 1.2. 1 Component-based software engineering 12 Thesis Implications . .. 14 1.2.2 Agile and extreme programming . 14 Thesis Implications . 17 1.2.3 Environment 17 1.3 Document overview. 18 2 Related work 19 2.1 Voice-enabled applications and IVRs . 19 2.2 VoiceXML .. 22 2.2.1 Origins and definitions 22 2.2.2 VoiceXML in detail. 26 2.2.2.1 Architectural model 26 2.2.2.2 Implementation platform requirements 28 2.2.2.3 Applications . 28 CONTENTS 2 2.2.2.4 Form Interpretation Algorithm (FIA) 29 2.2.2.5 Summary .... 32 2.2.3 Advantages of VoiceXML 32 2.2.4 Summary 33 2.3 Asterisk ... , . 34 2.3.1 What is Asterisk? . 34 2.3.2 The PSTN interface . 37 2.3.3 Extending Asterisk 38 2.3.4 Summary 39 2.4 Summary .... 40 3 Component selection and system design 41 3. 1 Requirements and restrictions . 41 3.2 OpenVXI , ....... 43 3.2.1 Speech browser . 44 3.2.2 Platform components . 45 3.2.3 Summary 46 3.3 Festival . .... 46 3.3.1 . Festival interfaces. 47 3.3.2 Alternatives . 48 3.3.3 Summary 48 3.4 Sphinx .. .... 49 3.4. 1 Alternatives . SO 3.4.2 Summary SO 3.5 System architecture 51 3.6 Summary . , .. 52 CONTENTS 3 4 Building the gateway: Phase 1 S3 4.1 Creating an Asterisk application 53 4. 1.1 Creating the test .. 54 4.1.2 The voicexml application module . 54 4.2 Asterisk and Open VXI . 57 4.2.1 Creating the test 57 4.2.2 Adding the VoiceXML interpreter to the voicexml application 58 4.2.3 Running the test 59 4.3 Giving the gateway voice 62 4.3.1 The prompt API 62 4.3.2 SSML parser . 64 4.3.3 Creating the prompt queue 66 4.3.4 Playing the prompts 67 4.3.4.1 Audio 68 4.3.4.2 Synthesized speech 70 4.3.5 Running the test 73 4.4 Summary .. .. 74 S Building the gateway: Phase 2 10 5.1 Creating the Test ... 76 5.2 Adding DTMF support 77 5.3 Adding Sphinx .... 80 5.3.1 Creating Sphinx applications. 80 5.3.2 JNI and pipes - a first attempt 81 CONTENTS 4 5.3.3 The client-server approach 85 5.3.4 SphinxFarrn ....... 87 5.3.5 Testing the recognize application . 89 5.4 Results .... 92 5.5 Final testing. 93 5.6 Summary .. 96 6 Conclusion 98 6.1 Completed work. 98 6.2 Achievements . 100 6.3 Future work and extensions. 100 6.4 Conclusion . .. ... 101 References 102 A VoiceXML 2.0 elements 110 B VoiceXML 2.1 elements 112 C Sample zaptel configuration files 113 D Application module skeleton 115 E A Sphinx 4 application 117 F Sphinx configuration 119 List of Figures 2, I A web browser requesting and displayi ng a web page 23 2,2 The W3C Speech Interface Framework [90] , 24 2.3 A simple ' Press One' VoiceXML application 25 2.4 The VoiceXML architectural model, as described by the VoiceXML 2,0 specification [5 6] "" "" " "' , ',',' , '," "',""""""" 27 2,5 A VoiceXML gateway requesting and rendering a VoiceXML page 33 2,6 A simple 'Press one' dial plan IVR , , , , , , , , , , , , , , , , 35 2,7 A simple ' Press one' IVR system with re-prompting capabilities 36 3, I The OpenVXI architecture, as described by Eberman et ai, [24] 44 3,2 The VoiceXML gateway's architecture , , , , , , , , , , , .. , 51 4, I The core of the gateway after Asterisk and Open VXI are integrated. 61 4,2 The new architecture reflecting the integration of Festival , , , . , , 74 5, I The final gateway architecture , , , , , , , , , , , , , . , , . , , .. ' , ' , " 94 5 List of Tables 2.1 Vcommerce revenue models and their chances of success as described by Duggan (23) 21 2.2 Possible PSTN interfaces in South Africa and their current approximate prices (Oc- tober 2006) (60) . .. 38 3.1 Comparison of Sphinx 3.3 and Sphinx 4 performances, as shown in (14) . .. 49 5.1 Sample test results ........ 91 A.I VoiceXML elements in the VoiceXML 2.0 specification . III B.I Elements added or enhanced in the VoiceXML 2.1 specification .. ... .. 112 6 Glossary of terms API Application Programming Interface, ASR Automatic Speech Recognition, ATOM Automated Telephone Disease Management. BRI Basic Rate Interface, CBSE Component-Based Software Engineering, CLI Command Line Interface, COM Component Object Model. CORBA Common Object Request Broker Architecture, COTS Commercial Off The Shelf. DTMF Dual Tone Multi Frequency, EMMA Extensible MultiModal Annotation, FIFO First In First Out. FXO Foreign Exchange Office, FXS Foreign Exchange Station, HMM Hidden Markov Model. HTML HypterText Markup Language, lAX Inter-Asterisk eXchange, IDE Integrated Development Environment. IPC Inter Process Communication, 7 8 ISDN Integrated Services Digital Network. IVR Interactive Voice Response. JNI Java Native Interface. JVM Java Virtual Machine. LAN Local Area Network. MGCP Media Gateway Control Protocol. PBX Private Branch eXchange. PCM Pulse-Code Modulation PDA Personal Digital Assistant. PML Phone Markup Language. PRI Primary Rate Interface. PSTN Public Switched Telephone Network. SIP Session Initiation Protocol. SRGS Speech Recognition Grammar Specification. SSML Speech Synthesis Markup Language. TDM Time Division Multiplex. TTS Text To Speech. UDP User Datagram Protocol. URL Uniform Resource Locator. VOFR Voice Over Frame Relay. VoiceXML Voice eXtensible Markup Language. VoIP Voice over Internet Protocol. VSP Voice hosting Service Provider. WER Word Error Rate. WSJ Wall Street Journal. 9 XML eXtensible Markup Language. XP eXtreme Programming. Chapter 1 Introduction Voice-enabled services have been around for a while. Due to the improvement in speech-related technologies, they are becoming increasingly popular. The potential to reach vast numbers of users is very large. Telephones are a familiar sight in many homes and businesses worldwide, and provide a relatively inexpensive, reliable means of communication. It follows that a voice-enabled application which is able to interface with the PSTN has the potential to reach many users cheaply and reliably. In addition to this, many people are used to interacting with the telephone, and it is therefore a very familiar means of communication. While full natural language processing is not achievable at present, well designed speech user interfaces can result in satisfied users. The most common form of voice-enabled applications are Interactive Voice Response (rVR) sys tems, which traditionally have been created using low-level, platform-specific languages. Newer examples of voice-enabled applications include multimodal applications and information interfaces, both of which require a bridging of the PSTN and IP networks, and applications to assist the visually impaired or the elderly.

Load more