Automatic Derivation of Grammar Rules Used for Speech Recognition in Dialogue Systems

Total Page:16

File Type:pdf, Size:1020Kb

Automatic Derivation of Grammar Rules Used for Speech Recognition in Dialogue Systems Masaryk University Faculty of Informatics Automatic Derivation of Grammar Rules Used for Speech Recognition in Dialogue Systems Master’s Thesis Bc. Klára Kufová Brno, Spring 2018 This is where a copy of the official signed thesis assignment and a copy of the Statement of an Author is located in the printed version of the document. Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or ex- cerpted during elaboration of this work are properly cited and listed in com- plete reference to the due source. Bc. Klára Kufová Advisor: Mgr. Luděk Bártek, Ph.D. i Acknowledgements First and foremost, I would like to acknowledge the thesis advisor, Mgr. Luděk Bártek, Ph.D., of the Faculty of Informatics at Masaryk University, for his valuable ideas, constructive advice, and the time dedicated to our consultations. My sincere thanks also go to both my current and former colleagues at Red Hat, especially to Mgr. Milan Navrátil and Ing. Radovan Synek, who helped me tremendously during both the implementation and writing of this thesis. Last but not least, I would like to express my sincere gratitude to my parents, grandparents, my brother, and to my fiancé, who have supported and encouraged me throughout the course of my studies. iii Abstract The thesis deals with the process of building an adaptive dialogue system, which is capable of learning new grammar rules used for automatic speech recognition based on past conversations with real users. The subsequent abil- ity to automatically reduce the unused rules within a grammar is proposed and implemented as well. Such a dialogue system is domain-independent, effi- ciently extensible, and overcomes the most significant drawbacks of grammar- based language models utilized within speech recognizers. The described theoretical principles are demonstrated on the created conversational agent, which is prepared to be deployed in production as a virtual shopping assis- tant in an online fashion boutique. Apart from the system’s implementation details, the thesis provides a comprehensive overview of the area of dialogue systems in the field of artificial intelligence and natural language processing. iv Keywords dialogue system, grammar rules, speech recognition, Sphinx4, speech syn- thesis, MaryTTS, language modelling, chat bot, virtual personal assistant, Coco, natural language processing, speech understanding, speech generation, corrective dialogue, grammar expansion, grammar reduction v Contents Introduction 1 1 State of the Art 3 1.1 Contemporary Dialogue Systems ................3 1.1.1 Virtual Personal Assistants . .3 1.1.2 Business and Commerce . .5 1.1.3 Education and Healthcare . .5 1.2 Current Fields of Research ....................7 1.2.1 Natural Language Understanding . .8 1.2.2 Dialogue Management . .9 1.2.3 Natural Language Generation . 10 2 Building a Dialogue System 11 2.1 Introducing Coco ......................... 11 2.1.1 Problem Domain . 12 2.1.2 Deployment . 13 2.2 Voice Dialogue Standards .................... 16 2.2.1 VoiceXML . 17 2.2.2 Aspect Prophecy . 19 2.2.3 VoxML . 21 2.3 Input and Output Speech .................... 22 2.3.1 Speech Recognition . 22 2.3.2 Speech Synthesis . 31 2.4 Speech Understanding and Generation ............. 35 2.4.1 Speech Understanding . 35 2.4.2 Speech Generation . 36 3 Adaptive Dialogue Systems 37 3.1 Automatic Derivation of Grammar Rules ............ 38 3.1.1 Detecting Out-Of-Grammar Utterances . 38 3.1.2 Corrective Dialogue . 40 3.1.3 Grammar Expansion . 44 3.1.4 Dialogue Continuation . 50 3.2 Automatic Reduction of Grammar Rules ............ 50 3.2.1 Removing Old Rules . 51 3.2.2 Removing Unused Rules . 52 4 Future Work 55 vii Conclusion 57 Bibliography 59 A Running Coco 67 A.1 Software Distribution ....................... 67 A.2 Execution Instructions ...................... 67 A.2.1 Linux-Based Operating Systems . 68 A.2.2 Microsoft Windows . 68 B Contributing to Coco 69 B.1 Developing Coco ......................... 69 B.2 Building Coco ........................... 70 C Example Dialogue with Coco 71 viii List of Figures 1.1 An architecture of a dialogue system. 7 2.1 The architecture of a VoiceXML application. 17 2.2 The architecture of the Sphinx4 speech recognition system. 23 2.3 An example search graph generated by the Linguist module. 27 2.4 The architecture of the MaryTTS speech synthesizer. 32 3.1 The schema of a corrective dialogue. 41 ix Introduction „Simplicity is the keynote of all true elegance.“ – Coco Chanel The idea of a real conversation with a machine has been tempting the re- search from the field of artificial intelligence and natural language processing from the very beginning. Starting in 1950, when the British journal Mind published the influential article Computing Machinery and Intelligence [1] written by Alan Mathison Turing, the area of natural language processing instantaneously emerged and soon after, machines were not only able to talk, but also recognize and understand human speech. Simple and naively operating dialogue systems from 1960s shifted into complex, sophisticated conversational agents of the new millennium. With the assistance of a dialogue system, making a restaurant reservation, booking a plane ticket, or recognizing a new song is a matter of seconds, while learning a new language or improving one’s mental health may be a matter of days. Classified based on their area of usage, both the former and contemporary dialogue systems are introduced in chapter 1. The chapter, however, does not only mention the existing conversational agents, but also describes the current prominent research areas as well. The second half of the first chapter is therefore divided into three sections, which correspond to the fundamental components of a dialogue system: a natural language understanding unit, a dialogue manager, and a natural language generation unit. The most important part of this thesis is Coco, a personal shopping as- sistant introduced at the beginning of chapter 2. The Coco dialogue system is named after Gabrielle Bonheur Chanel, the founder of the world-famous Parisian haute couture fashion house, nicknamed Coco. Principles and tech- niques associated with the implementation of a conversational agent are demonstrated on the created dialogue system. Apart from the characteriza- tion of the system’s problem domain, deployment, and personas, the chapter also introduces the most influential voice dialogue standards and thoroughly describes the used speech recognition and speech synthesis libraries and re- lated methods, which were utilized within the system. The natural language understanding and generation logic implemented in Coco is mentioned as well. 1 The Coco dialogue system was created to demonstrate the ability of an artificial computer system to learn new grammar rules for speech recognition based on past conversations with real users. The automatic derivation of grammar rules together with the employed approaches to their automatic reduction are explained in detail in chapter 3. The final chapter 4 then briefly lists the system’s weak points and desirable enhancements, which may be included in one of the future releases of Coco to provide a more progressive state-of-the-art dialogue interface. The aim of the thesis is not only to define and build an adaptive dialogue system, but also to provide a brief but comprehensive overview of the con- versational agents-specific field of artificial intelligence and natural language processing. And as stated at the beginning, it is crucial to achieve the goal as simply as possible. 2 1 State of the Art At the time of writing the very first sentence of this thesis, researching and implementing computer dialogue systems that can vary in many assorted as- pects has been a significant and fairly appealing part of the natural-language processing area of computer science for over fifty years. From ELIZA1 and PARRY2, simple and intuitive early chat bots that did not as a matter of fact apply many artificial intelligence approaches—although they are considered to be early artificial intelligence programs—the research moved noticeably forward to complex and sophisticated systems, which can solve and address many current issues. 1.1 Contemporary Dialogue Systems In general, dialogue systems can be divided by many diverse aspects. One of the most common methods of classifying dialogue systems is by the used initiative—a dialogue interface can have either a system initiative, user ini- tiative, or a mixed initiative—but with the current notable progress in the area of multimodal human-computer interaction interfaces, it is becoming more common to categorize dialogue systems based on modality (a dialogue interface can be multimodal or controlled by a written text, spoken word, or through a graphical user interface). For the purposes of this thesis, a classifi- cation based on areas of usage is discussed, although each of the mentioned dialogue systems may belong to more—occasionally overlapping—categories. 1.1.1 Virtual Personal Assistants Virtual personal assistants (hereinafter also referred to as VPAs) are pos- sibly the most widely known type of dialogue systems mainly due to their automatic availability in personal electronic devices. Current virtual per- sonal assistants are built to perform a large amount of different tasks, which makes the interaction with the personal device easier and faster, while also allowing the users not to focus on accomplishing their goals manually. 1. ELIZA was created between 1964 and 1966 by Joseph Weizenbaum at the Massachusetts Institute of Technology and is believed to be one of the very first dialogue systems ever implemented. The program simulated a conversation with a psychotherapist and its fundamental logic was based on the detection of critical words in the user’s text input [2]. 2. PARRY—as a reaction to ELIZA—was supposed to act as a paranoid patient suffering from schizophrenia. PARRY, created by Kenneth Mark Colby, comprised more advanced approaches than ELIZA and was even examined by the famous Turing test [3].
Recommended publications
  • Speech Recognition in Java Breandan Considine Jetbrains, Inc
    Speech Recognition in Java Breandan Considine JetBrains, Inc. Automatic speech recognition in 2011 Automatic speech recognition in 2015 What happened? • Bigger data • Faster hardware • Smarter algorithms Traditional ASR • Requires lots of handmade feature engineering • Poor results: >25% WER for HMM architectures State of the art ASR • <10% average word error on large datasets • DNNs: DBNs, CNNs, RBMs, LSTM • Thousands of hours of transcribed speech • Rapidly evolving field • Takes time (days) and energy (kWh) to train • Difficult to customize without prior experience Free / open source • Deep learning libraries • C/C++: Caffe, Kaldi • Python: Theano, Caffe • Lua: Torch • Java: dl4j, H2O • Open source datasets • LibriSpeech – 1000 hours of LibriVox audiobooks • Experience is required Let’s think… • What if speech recognition were perfect? • Models are still black boxes • ASR is just a fancy input method • How can ASR improve user productivity? • What are the user’s expectations? • Behavior is predictable/deterministic • Control interface is simple/obvious • Recognition is fast and accurate Why offline? • Latency – many applications need fast local recognition • Mobility – users do not always have an internet connection • Privacy – data is recorded and analyzed completely offline • Flexibility – configurable API, language, vocabulary, grammar Introduction • What techniques do modern ASR systems use? • How do I build a speech recognition application? • Is speech recognition accessible for developers? • What libraries and frameworks exist
    [Show full text]
  • Covenant Journal of Engineering Technology (CJET) Vol.3 No.1, June 2019
    Covenant Journal of Engineering Technology (CJET) Vol.3 No.1, June 2019 ISSN: p. 2682-5317 e. 2682-5325 An Open Access Journal Available Online Covenant Journal of Engineering Technology (CJET) Vol. 3 No. 1, June 2019 Publication of the College of Engineering, Covenant University, Canaanland. Editor-in-Chief: Dr. Olugbenga Omotosho [email protected] Managing Editor: Edwin O. Agbaike [email protected] URL: http://journals.covenantuniversity.edu.ng/index.php/cjet Achebe C. H., et al CJET (2019) 3(1) 1-19 © 2019 Covenant University Journals All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any meams, electronic, electrostatic, magnetic tape, mechanical, photocopying, recording or otherwise, without the prior written permission of the publisher. It is a condition of publication in this journal that manuscripts have not been published or submitted for publication and will not be submitted or published elsewhere. Upon the acceptance of articles to be published in this journal,the author(s) are required to transfer copyright of the article to the publisher. ISSN: p. 2682-5317 e. 2682-5325 Published by Covenant University Journals, Covenant University, Canaanland, Km 10, Idiroko Road, P.M.B. 1023, Ota, Ogun State, Nigeria Printed by Covenant University Press URL: http://journals.covenantuniversity.edu.ng/index.php/cjet ii Achebe C. H., et al CJET (2019) 3(1) 1-19 Articles Evaluation of Frictional Heat and Oil Cooling Rate in Mechanical Contact Due to Debris Formation. Achebe C. H., Nwagu I. A., Chukwuneke J.
    [Show full text]
  • A New Era: Digital Curtilage and Alexa-Enabled Smart Home Devices
    Touro Law Review Volume 36 Number 2 Article 12 2020 A New Era: Digital Curtilage and Alexa-Enabled Smart Home Devices Johanna Sanchez Touro Law Center Follow this and additional works at: https://digitalcommons.tourolaw.edu/lawreview Part of the Constitutional Law Commons, Fourth Amendment Commons, Internet Law Commons, Law Enforcement and Corrections Commons, and the Supreme Court of the United States Commons Recommended Citation Sanchez, Johanna (2020) "A New Era: Digital Curtilage and Alexa-Enabled Smart Home Devices," Touro Law Review: Vol. 36 : No. 2 , Article 12. Available at: https://digitalcommons.tourolaw.edu/lawreview/vol36/iss2/12 This Article is brought to you for free and open access by Digital Commons @ Touro Law Center. It has been accepted for inclusion in Touro Law Review by an authorized editor of Digital Commons @ Touro Law Center. For more information, please contact [email protected]. Sanchez: A New Era: Digital Curtilage A NEW ERA: DIGITAL CURTILAGE AND ALEXA-ENABLED SMART HOME DEVICES Johanna Sanchez* I. INTRODUCTION ....................................................................664 II. SOCIERY’S BENEFITS DERIVED FROM THE ALEXA-ENABLED SMART HOME DEVICES .......................................................667 A. A New Form of Electronic Surveillance...............667 1. What is a smart device? ..............................667 2. What is voice recognition? ..........................668 3. The Alexa-enabled Echo device assists law enforcement .......................................................669 4.
    [Show full text]
  • Multimodal HALEF: an Open-Source Modular Web-Based Multimodal Dialog Framework
    Multimodal HALEF: An Open-Source Modular Web-Based Multimodal Dialog Framework Zhou Yu‡†, Vikram Ramanarayanan†, Robert Mundkowsky†, Patrick Lange†, Alexei Ivanov†, Alan W Black‡ and David Suendermann-Oeft† Abstract We present an open-source web-based multimodal dialog framework, “Multimodal HALEF”, that integrates video conferencing and telephony abilities into the existing HALEF cloud-based dialog framework via the FreeSWITCH video telephony server. Due to its distributed and cloud-based architecture, Multimodal HALEF allows researchers to collect video and speech data from participants in- teracting with the dialog system outside of traditional lab settings, therefore largely reducing cost and labor incurred during the traditional audio-visual data collection process. The framework is equipped with a set of tools including a web-based user survey template, a speech transcription, annotation and rating portal, a web visual processing server that performs head tracking, and a database that logs full-call audio and video recordings as well as other call-specific information. We present observations from an initial data collection based on an job interview application. Finally we report on some future plans for development of the framework. 1 Introduction and Related Work Previously, many end-to-end spoken dialog systems (SDSs) used close-talk micro- phones or handheld telephones to gather speech input [3] [22] in order to improve automatic speech recognition (ASR) performance of the system. However, this lim- its the accessibility of the system. Recently, the performance of ASR systems has improved drastically even in noisy conditions [5]. In turn, spoken dialog systems are now becoming increasingly deployable in open spaces [2].
    [Show full text]
  • Systematic Review on Speech Recognition Tools and Techniques Needed for Speech Application Development
    INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 03, MARCH 2020 ISSN 2277-8616 Systematic Review On Speech Recognition Tools And Techniques Needed For Speech Application Development Lydia K. Ajayi, Ambrose A. Azeta, Isaac. A. Odun-Ayo, Felix.C. Chidozie, Aeeigbe. E. Azeta Abstract: Speech has been widely known as the primary mode of communication among individuals and computers. The existence of technology brought about human computer interface to allow human computer interaction. Existence of speech recognition research has been ongoing for the past 60 years which has been embraced as an alternative access method for individuals with disabilities, learning shortcomings, Navigation problems etc. and at the same time allow computer to identify human languages by converting these languages into strings of words or commands. The ongoing problem in speech recognition especially for under-resourced languages has been identified to be background noise, speed, recognition accuracy, etc. In this paper, in order to achieve a higher level of speed, recognition accuracy (word or sentence accuracy), manage background noise, etc. we have to be able to identify and apply the right speech recognition algorithms/techniques/parameters/tools needed ultimately to develop any speech recognition system. The primary objective of this paper is to make summarization and comparison of some of the well-known methods/algorithms/techniques/parameters identifying their steps, strengths, weaknesses, and application areas in various stages of speech recognition systems as well as for the effectiveness of future services and applications. Index Terms: Speech recognition, Speech recognition tools, Speech recognition techniques, Isolated speech recognition, Continuous speech recognition, Under-resourced languages, High resourced languages.
    [Show full text]
  • Artificial Intelligence for Health
    Artificial Intelligence for Health Erwan Scornet (Assistant Professor, Ecole Polytechnique) Erwan Scornet Intelligence Artificielle et sant´e 1 History of AI 2 From Big Data to Deep Learning 3 AI and Health Chest X-ray Liver lesion segmentation Genomics Toxicogenetics Medical 4 Perspective and issues Erwan Scornet Intelligence Artificielle et sant´e A first look at Artificial Intelligence What is Artificial Intelligence? What are the main challenges? What are the applications of What are the issues raised by AI? AI? Erwan Scornet Intelligence Artificielle et sant´e Definition of AI - Dartmouth conference On September 1955, a project was proposed by McCarthy, Marvin Minsky, Nathaniel Rochester and Claude Shannon introducing formally for the first time the term ”Artificial Intelligence". The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. Proposal for Dartmouth conference on AI (1956) Erwan Scornet Intelligence Artificielle et sant´e Old applications of AI Many tasks were achieved between 1956 and 1974. Computer checkers (1959, Arthur Lee Samuel) capable of challenging a respectable amateur. Interestingly, he coined the term "Machine Learning" in 1959. IBM Shoebox (1961) was able to recognize 16 spoken words and the digits 0 to 9. ELIZA (1964-1966 at MIT, Joseph Weizenbaum) was one of the first chatterbots and one of the first programs capable of attempting the Turing Test.
    [Show full text]
  • Open Source German Distant Speech Recognition: Corpus and Acoustic Model
    Open Source German Distant Speech Recognition: Corpus and Acoustic Model Stephan Radeck-Arneth1;2, Benjamin Milde1, Arvid Lange1;2, Evandro Gouvea,ˆ Stefan Radomski1, Max Muhlh¨ auser¨ 1, and Chris Biemann1 1Language Technology Group / 2Telecooperation Group Computer Science Departement Technische Universitat¨ Darmstadt {stephan.radeck-arneth,milde,biem}@cs.tu-darmstadt.de {radomski,max}@tk.informatik.tu-darmstadt.de {egouvea,arvidjl}@gmail.com Abstract. We present a new freely available corpus for German distant speech recognition and report speaker-independent word error rate (WER) results for two open source speech recognizers trained on this corpus. The corpus has been recorded in a controlled environment with three different microphones at a dis- tance of one meter. It comprises 180 different speakers with a total of 36 hours of audio recordings. We show recognition results with the open source toolkit Kaldi (20.5% WER) and PocketSphinx (39.6% WER) and make a complete open source solution for German distant speech recognition possible. Keywords: German speech recognition, open source, speech corpus, distant speech recognition, speaker-independent 1 Introduction In this paper, we present a new open source corpus for distant microphone record- ings of broadcast-like speech with sentence-level transcriptions. We evaluate the corpus with standard word error rate (WER) for different acoustic models, trained with both Kaldi[1] and PocketSphinx[2]. While similar corpora already exist for the German lan- guage (see Table 1), we placed a particular focus on open access by using a permissive CC-BY license and ensured a high quality of (1) the audio recordings, by conducting the recordings in a controlled environment with different types of microphones; and (2) the hand verified accompanying transcriptions.
    [Show full text]
  • Hidden Voice Commands
    Hidden Voice Commands Nicholas Carlini∗ Pratyush Mishra University of California, Berkeley University of California, Berkeley Tavish Vaidya Yuankai Zhang Micah Sherr Georgetown University Georgetown University Georgetown University Clay Shields David Wagner Wenchao Zhou Georgetown University University of California, Berkeley Georgetown University Abstract command may recognize it as an unwanted command Voiceinterfaces are becoming more ubiquitous and are and cancel it, or otherwise take action. This motivates now the primary input method for many devices. We ex- the question we study in this paper: can an attacker cre- plore in this paper how they can be attacked with hidden ate hidden voice commands, i.e., commands that will be voice commands that are unintelligible to human listen- executed by the device but which won’t be understood ers but which are interpreted as commands by devices. (or perhaps even noticed) by the human user? We evaluate these attacks under two different threat The severity of a hidden voice command depends upon models. In the black-box model, an attacker uses the what commands the targeted device will accept. De- speech recognition system as an opaque oracle. We show pending upon the device, attacks could lead to informa- that the adversary can produce difficult to understand tion leakage (e.g., posting the user’s location on Twitter), commands that are effective against existing systems in cause denial of service (e.g., activating airplane mode), the black-box model. Under the white-box model, the or serve as a stepping stone for further attacks (e.g., attacker has full knowledge of the internals of the speech opening a web page hosting drive-by malware).
    [Show full text]
  • German Speech Recognition: a Solution for the Analysis and Processing of Lecture Recordings
    German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings Haojin Yang, Christoph Oehlke, Christoph Meinel Hasso Plattner Institut (HPI), University of Potsdam P.O. Box 900460, D-14440 Potsdam e-mail: Haojin.Yang, Meinel @hpi.uni-potsdam.de; [email protected] { } Abstract—Since recording technology has become more ro- Therefore their training dictionary can not be extended or bust and easier to use, more and more universities are taking optimized periodically. [8] proposed a solution for creating the opportunity to record their lectures and put them on the an English speech corpus basing on lecture audio data. But, Web in order to make them accessable by students. The auto- matic speech recognition (ASR) techniques provide a valueable they have not dealt with a complete speech recognition source for indexing and retrieval of lecture video materials. In system. [7] introduced a spoken document retrieval system this paper, we evaluate the state-of-the-art speech recognition for Korean lecture search. Their automatically generated software to find a solution for the automatic transcription of search index table is based on lecture transcriptions. How- German lecture videos. Our experimental results show that ever, they do not consider the recognition of topic-related the word error rates (WERs) was reduced by 12.8% when the speech training corpus of a lecturer is increased by 1.6 hours. technical foreign words which are important for keyword- search. Overall, most of these lecture recognition systems Keywords-automatic speech recognition; e-learning; multi- have a low recognition accuracy, the WERs of audio lec- media retrieval; recorded lecture videos.
    [Show full text]
  • Bootstrapping Development of a Cloud-Based Spoken Dialog System in the Educational Domain from Scratch Using Crowdsourced Data
    Research Report ETS RR–16-16 Bootstrapping Development of a Cloud-Based Spoken Dialog System in the Educational Domain From Scratch Using Crowdsourced Data Vikram Ramanarayanan David Suendermann-Oeft Patrick Lange Alexei V. Ivanov Keelan Evanini Zhou Yu Eugene Tsuprun Yao Qian May 2016 ETS Research Report Series EIGNOR EXECUTIVE EDITOR James Carlson Principal Psychometrician ASSOCIATE EDITORS Beata Beigman Klebanov Donald Powers Senior Research Scientist - NLP ManagingPrincipalResearchScientist Heather Buzick Gautam Puhan Research Scientist Principal Psychometrician Brent Bridgeman John Sabatini Distinguished Presidential Appointee ManagingPrincipalResearchScientist Keelan Evanini Matthias von Davier Managing Senior Research Scientist - NLP Senior Research Director Marna Golub-Smith Rebecca Zwick Principal Psychometrician Distinguished Presidential Appointee Shelby Haberman Distinguished Presidential Appointee PRODUCTION EDITORS Kim Fryer Ayleen Stellhorn Manager, Editing Services Senior Editor Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Peer review notwithstanding, the positions expressed in the ETS Research Report series and other published accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees of Educational Testing Service.
    [Show full text]
  • Using Open Source Speech Recognition Software Without an American Accent
    Using open source speech recognition software without an American accent. The field Julius: C, strange BSDish license, Japanese HTK: (HMM (Hidden Markov Model) Toolkit) C, non-commercial license Kaldi: C++ library, Apache 2.0 (F.U. HTK) CMU Sphinx Family: C, Java, BSD-ish Smaller possibly dormant projects Shout: C++, GPL RWTH ASR: C++, non-commercial, German Iatros: C, GPL3, Spanish Carnegie Mellon University Sphinx Sphinx C 1980s Sphinx 2 C 1990s -------. | Sphinx 3 C 2000s | | Sphinx 4 Java | | PocketSphinx C Sphinx 3, 4, and PocketSphinx share tools. Sphinx development model PhD driven: 1. Have a moderately good idea 2. Write code that “proves” your idea 3. Submit thesis 4. $$$* * $$$ refers to a job at Nuance, Apple, Microsoft, Google, IBM,... The “Coral Reef” development model. What Americans do: sudo aptitude install \ pocketsphinx-hmm-wsj1 \ pocketsphinx-lm-wsj \ python-pocketsphinx \ # or gstreamer0.10-pocketsphinx from pocketsphinx import Decoder HMM = "/usr/share/pocketsphinx/model/hmm/wsj1/" LM = "/usr/share/pocketsphinx/model/"\ "lm/wsj/wlist5o.3e-7.vp.tg.lm.DMP" DICT = "/usr/share/pocketsphinx/model/"\ "lm/wsj/wlist5o.dic" decoder = Decoder(hmm=HMM, lm=LM, dict=DICT) fh = open("speech.wav") fh.seek(44) # skip the WAV header decoder.decode_raw(fh) print decoder.get_hyp() # short for hypothesis A good reference is David Huggins-Daines’ live-coding PyCon2010 lightning talk. Everything goes wrong and he still finishes in under 4 minutes. The configuration lines: HMM = ".../hmm/wsj1/" LM = ".../lm/wlist5o.3e-7.vp.tg.lm.DMP" DICT = ".../lm/wsj/wlist5o.dic" HMM Acoustic model (Hidden Markov Model) LM Language model DICT Pronunciation dictionary The Acoustic model matches sounds to phoneme probabilities.
    [Show full text]
  • End User License Agreement
    End User License Agreement End User License Agreement 14 January 2016 Contact: LumenVox, LLC Edward Miller, CEO 3615 Kearny Villa Road, Suite 202, San Diego, CA 92123 USA Telephone: +1 (858) 707-7700 Email: [email protected] Website: www.LumenVox.com Version 15.0 Page 1 January 2016 End User License Agreement This License Agreement between LumenVox LLC (Hereinafter “LumenVox”) and You (Hereinafter “You”) constitutes the terms and conditions for Your use of the Software Product in consideration of the license fees that You have paid. The Software Product(s) provided with this agreement are licensed, not sold, to You for use only under the terms of this License Agreement, unless a Software Product is accompanied by a separate license agreement, in which case the terms of that separate license agreement will govern, subject to Your prior acceptance of that separate license agreement. Please read this Agreement carefully. By downloading, installing, copying, or otherwise using the Software, You agree to be bound by the terms and conditions of this Agreement and become a party to this Agreement. If You do not agree with all of the terms and conditions of this Agreement, do not download, install, copy or otherwise use the Software. You may, however, request a return for a full refund within 30 days of your purchase. THE SOFTWARE PRODUCT IS PROTECTED UNDER U.S. AND INTERNATIONAL COPYRIGHT LAWS, AS WELL AS OTHER INTELLECTUAL PROPERTY LAWS AND TREATIES. 1. DEFINITIONS. The following definitions apply to this Licensing Agreement: 1.1. “Application Context” means the particular fields of use and the purpose and manner of using the Software permitted under this Agreement, as specified below.
    [Show full text]