Evaluation of a Speech Recognition System

Evaluation of a speech recognition system Pocketsphinx Rickard Hjulström Rickard Hjulström Spring 2015 Bachelors thesis, 15 hp Supervisor: Marie Nordström Examiner: Lars-Erik Janlert Bachelors program in computer science, 180 hp Abstract Speech recognition is the process of translating an audio signal into text using a computer program. The technique is today widely used in a large variety of areas. Pocketsphinx which is an open source speech recognition system is one of the more promising systems on the market today. It is designed to be effective and to be able to translate speech in real-time on low performance platforms. In this thesis Pocketsphinx is measured with respect to word error rate and translation time using data recorded by two Swedish speakers. A proof of concept was made using Pocketsphinx to control a robot by voice. The system was compared to the similar speech recognition system Google speech with respect to word error rate and translation time. The resulting data from measurements suggests that Google speech has a considerably better accuracy on long grammatically correct sentences, while Pocketsphinx is a less demanding and faster system. The word error rate and translation time of Pocketsphinx is more affected by noise than that of Google speech. The accuracy and recognition speed of Pocketsphinx showed huge im- provements when stripping down the dictionary, thus it is more suited for controlling a robot with a limited number of commands in real-time. Acknowledgements I want to thank Thomas Hellström and Peter Hohnloser for the support during the project and for giving me a very interesting task. Thanks to Marie Nordström for the guidance during the project and thanks to Magnus Stenman for the help with the implementation and measuring parts. Finally i want to thank my family. Without your support I would not manage to complete the work. Contents 1 An introduction to speech recognition 1 1.1 Areas where speech recognition is used 1 1.2 The purpose of this thesis 2 2 The functionality of a speech recognition system 3 2.1 The structure of an audio signal 3 2.2 Speaker-dependent/independent speech recognition 4 2.3 Models used in speech recognition 4 2.4 Algorithms used in speech recognition 5 2.5 The process of recognizing speech 6 3 Evaluation of the system 9 3.1 CMU Sphinx: Pocketsphinx 9 3.2 Google Speech 9 3.3 Word error rate 10 3.4 Data collection 10 3.5 Performing measurements 11 3.6 Results 11 4 Practical implementation: Amigobot 13 4.1 Robot Operating System 13 4.2 Implementation and commands 13 4.3 Robotic behaviors 14 4.4 Evaluation of Pocketsphinx in robotics 15 4.5 Results 15 5 Discussion and conclusions 17 5.1 Results of evaluations 17 5.2 Controlling a robot 17 5.3 Conclusions 18 5.4 Future work 18 6 Bibliography 19 Appendices 21 A Appendix: Data sentences 23 A.1 Sentences for testing: Common sentences 23 A.2 Sentences for testing: Robot commands 25 1(25) 1 An introduction to speech recognition Interpreting and translating a speech signal into text is the art of speech recognition. Recog- nizing speech has become one of the most successful areas of signal processing. The uses of a fully working speech recognition system with acceptable levels of error are enormous and many areas could take advantage of it. The definition of speech recognition according to Anusuya and Katti [1]: "Speech Recognition (is also known as Automatic Speech Recognition (ASR), or computer speech recognition) is the process of converting a speech signal to a sequence of words, by means of an algorithm implemented as a computer program." Even though the definition gives a fairly good picture of what speech recognition is about, it lacks a description of what human speech actually is and the fact that the resulting text from a translation should match the meaning of the spoken words. In this thesis I try to explain what speech recognition is and how it works. The Pocketsphinx speech recognition system is targeted to be a lightweight variant with the ability of translating speech in real-time. According to the creators it is the first and only system that manages to accomplish a fast and reliable offline translation of speech to text while still maintaining a good accuracy. Pocketsphinx is an open source system and trainable to recognize different languages and dialects. These properties makes it useful for a broad variety of applications and thus a very interesting system to study from a commercial perspective [2][3]. 1.1 Areas where speech recognition is used Speech recognition is today widely used in a large number of areas. Corporations use speech recognition in automated phone systems with the purpose of saving money in time and staff. Speech recognition can aid people with physical disabilities. People suffering from hearing impairment can for example use speech recognition to participate in discussions, and those who have difficulties using their hands can take advantage of the technique by instead of using a keyboard speaking the text to be printed. One example is the Dragon Speech Recognition Software [4], which is a tool for persons with dyslexia, low vision or other physical disabilities, making it possible to type text on a computer using the voice instead of a keyboard. Millions of people today are using speech recognition every day with their smartphones. Android devices as well as Iphones have support for speech recognition when for example writing text messages or searching on the internet. 2(25) 1.2 The purpose of this thesis The main focus of this work is to describe how a speech recognition system works and to discuss the problems that may impair the results of translations from speech to text. The Pocketsphinx speech recognition system is measured, evaluated and compared to the similar system Google Speech with respect to word error rate and translation time. Google speech is evaluated in another work by Magnus Stenman [5] taking into account the same parameters. The results from Magnus’s measurements is compared to the measurements in this thesis. Pocketsphinx are also to be evaluated with respect to its usability in robotics, controlling a robot using nothing but the voice. The main purpose of the thesis: • Describe the algorithms and models used by speech recognition systems, as well as the process of speech recognition. • Compare Pocketsphinx to the similar system Google speech, taking into account word error rate and translation time. • By implementing Pocketsphinx in a robot controller software, test its practical usability in robotics, taking into account word error rate and translation time. In chapter 2 algorithms, models and the process of recognizing speech is described. Chapter 3 is about the collection of data and the evaluations of the systems. Chapter 4 describes the practical implementation of Pocketsphinx in a robot controller software. The thesis concludes with chapter 5, containing the discussion, conclusions and future work. 3(25) 2 The functionality of a speech recognition system This chapter focuses on how speech recognition works. The structure of an audio signal is described, followed by a review of the models and algorithms being used in speech recognition systems. The chapter concludes with the actual process of translating an audio signal into text. 2.1 The structure of an audio signal The smallest structural unit of speech is called phonemes. It is an abstraction of a speech signal that is perceived to have a specific functionality. Most languages are said to have a specific set of phonemes but most of them have very similar sets. The English language consists of about 44 different phonemes. The idea of phonemes is to lump together sounds that are similar to each other while disregarding characteristics such as speech rate or di- alect. Today a very common way of translating speech into text is to first interpret the audio signal, translate it to phonemes and derive sentences from the phonemes using stochastic processes and algorithms [6]. Figure 1: The words "hello there" stated in waveform While phonemes is an abstract collection consisting of generally all sounds in a language lumped together, there exists other physical sounds as well. The physical representation of the most basic sound unit is called a phone. Phones are in general segments of speech and can be either vowels or consonants. Phones differ between each other in that they have different physical or perceptual characteristics. Figure 1 shows a graphical view of the words "hello there" as sound waves. The phones building the sentence could be; HH, EH, L, OW, DH, EH, R. There are however words with several pronunciations why there also could be other sequences of phones. When trying to decide which phoneme a spoken sound is, the context is often taken into account. Phones derived using context is called triphones, in which the phones uttered before and after are also considered. Triphones build words which in turn builds sentences [7]. 4(25) 2.2 Speaker-dependent/independent speech recognition Speaker-dependent speech recognition systems are highly dependent on the person who is speaking. Many factors can affect the translations, like gender, age, voice and pitch. For these reasons the speaker-dependent variants are rarely customized to work for a larger group of people. Speaker-independent systems are constructed to be more general in their translations, resulting in a system with broader usability. Speaker-independent systems are however generally much harder to build. Speaker-independent systems do not require any training which often is necessary in the speaker-dependent variants. Training can highly affect the accuracy of speech recognition for a system.

Load more