Automatic Speech Recognition Using Limited Vocabulary: a Survey
Total Page:16
File Type:pdf, Size:1020Kb
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. Digital Object Identifier 10.1109/ACCESS.xxxx.DOI Automatic Speech Recognition using limited vocabulary: A survey JEAN LOUIS K. E. FENDJI1, (Member, IEEE), DIANE M. TALA2, BLAISE O. YENKE1, (Member, IEEE), and MARCELLIN ATEMKENG3 1Department of Computer Engineering, University Institute of Technology, University of Ngaoundere, 455 Ngaoundere, Cameroon (e-mail: [email protected], [email protected]) 2Department Mathematics and Computer Science, Faculty of Science, University of Ngaoundere, 454 Ngaoundere, Cameroon (e-mail: [email protected]) 3Department of Mathematics, Rhodes University, Grahamstown 6140, South Africa (e-mail: [email protected]) Corresponding author: Jean Louis K.E. Fendji (e-mail: lfendji@ gmail.com), Marcellin Atemkeng (e-mail: [email protected]). ABSTRACT Automatic Speech Recognition (ASR) is an active field of research due to its huge number of applications and the proliferation of interfaces or computing devices that can support speech processing. But the bulk of applications is based on well-resourced languages that overshadow under-resourced ones. Yet ASR represents an undeniable mean to promote such languages, especially when design human-to-human or human-to-machine systems involving illiterate people. An approach to design an ASR system targeting under-resourced languages is to start with a limited vocabulary. ASR using a limited vocabulary is a subset of the speech recognition problem that focuses on the recognition of a small number of words or sentences. This paper aims to provide a comprehensive view of mechanisms behind ASR systems as well as techniques, tools, projects, recent contributions, and possibly future directions in ASR using a limited vocabulary. This work consequently provides a way to go when designing ASR system using limited vocabulary. Although an emphasis is put on limited vocabulary, most of the tools and techniques reported in this survey applied to ASR systems in general. INDEX TERMS Deep learning, Dataset, Machine Learning, Limited Vocabulary, Speech Recognition. I. INTRODUCTION extinction of under-resourced languages [8]. In fact, the high Automatic speech recognition (ASR) is the process and the penetration of communication tools such as smartphones in related technology applied to convert a speech signal into the developing world [9] and their increasing presence in the matching sequence of words or other linguistic entities rural areas [10], [11] provides an unprecedent opportunity to using algorithms implemented in computing devices [1]. develop voice-based application that can help to mitigate the arXiv:2108.10254v1 [cs.AI] 23 Aug 2021 Automatic speech recognition has become an exciting field low literacy level in those areas. Actually, smartphones offer for many researchers. Nowadays, users prefer to use devices many advantages over any PC-based interface, such as high such as computers, smartphones, or any other connected de- mobility and portability, easy recharge of their batteries, and vice through speech. Automatic speech recognition can tech- conventional embedded features such as microphones and nically be defined as graphical representations of frequencies speakers. emitted as a function of time [2]. Current speech processing techniques encompassing speech synthesis, speech process- A. MOTIVATION ing, speaker identification or verification pave the way to In regions with low literacy levels, people are used to speak- create Human to Machine voice interfaces. Automatic speech ing local languages that are most of the time considered as recognition can be applied in several applications includ- under-resourced language because of the lack or insufficiency ing voice services [3], program control and data entry [3], of formal written grammar and vocabulary. Since people do avionics [3], disabled assistance [6], [7], etc. Although ASR not know how to read or to write well resourced languages can be a plus to ease Human to Machine communication, in (such as English or French), the development of automatic some cases, it is a must to help users with low literacy levels speech recognition systems for under-resourced languages to make use of important digital services, or to prevent the appears as an appealing solution to overcome this limitation. VOLUME 4, 2016 1 Fendji et al.: Automatic Speech Recognition using limited vocabulary: A survey Due to the complexity of the task, a good starting point is to is found in [8], notwithstanding the fact that many of the consider limited vocabulary. issues and approaches presented in the paper applied to This paper focuses on limited vocabulary in automatic speech technology in general. In this survey, authors did not speech recognition to allow researchers who wish to work focus on the limited vocabulary. Table 1 provides a summary on under-resourced languages to have an overview on how to of the recent surveys on ASR. develop a speech recognition system for limited vocabulary. In contrast to limited vocabulary systems, large vocabulary C. CONTRIBUTIONS continuous speech recognition (LVCSR) systems are usually Despite the plethora of surveys, a survey on ASR using trained on thousands of hours of speech and billions of words limited vocabularies is still missing. Yet ASR using limited of text [12]. The development of large vocabulary systems vocabulary is a tremendous opportunity as starting point for is complex since the larger the vocabulary, the harder the the development of speech recognition systems for under- manipulation of learning algorithms with a need for a larger resourced languages. This survey helps to fill this gap by number of rules to build the dataset. LVCSR systems can be presenting a summary of works done on ASR for limited very efficient when they are applied on similar domains to vocabulary. For a better understanding, ASR principle is those on which they were trained [13]. However, they are detailed along with the approach to build ASR systems, to not robust enough to mismatched training and test conditions construct datasets, and to evaluate the performance of such since the context may not be well handled. In fact, most of the systems. Furthermore, close and open-source toolkits, and input can be silence or contain background noise, that can frameworks are also presented. Therefore, such a study can be mistaken for speech and increases the false positive rate rapidly and easily enabling researchers who want to build [14]. LVCSR systems are not suitable for a transfer learning speech recognition systems using limited vocabulary. The targeting small or limited vocabulary. contributions of this paper are as follows: • A description of fundamental aspects of ASR. B. POSITION WITH OTHER SURVEYS • A description of tools and process for creating ASR Several works have been done in speech recognition using systems. limited vocabulary. Among the existing surveys, authors in • A summary of important contributions in ASR with [15] focused on Portuguese-based language and its varia- limited vocabulary. tions. They considered Portuguese as an understudied lan- • Orientations for future works. guage compared to English, Arabic, and Asian languages. Among Asian languages, Indian languages received a par- The rest of the paper is organised around eleven sections as ticular attention. Works on the development of ASR systems shown in Figure 1. Section 2 presents the methodology used dealing with Indian languages such as Hindi, Punjabi, Tamil, to conduct this survey. Section 3 provides an understanding just to mention a few of them are presented in [16]. Another of “limited vocabulary”. Section 4 describes the principle specific language survey is provided in [17] where the au- of speech recognition. Section 5 provides a description of thors focused on Russian language specificities, and applied techniques used for automatic speech recognition. Section 6 models for the development of Russian speech recognition deals with the management of datasets. Section 7 presents systems in some organizations in Russia and abroad. A the traditional performance metrics. Section 8 provides an broader survey inspecting more than 120 promising works on insight on the speech recognition frameworks. Section 9 sum- biometric recognition (including voice) based on deep learn- marizes works on speech recognition using limited vocabu- ing models is provided in [18]. |ATMIn the latter, the authors lary. Section 10 discusses possibly future directions, followed presented the strengths and potentials of deep learning in by a conclusion in section 11. The list of abbreviations used different applications. A narrowed survey is presented in [19] throughout this work is shown in Table 2. and highlights the major subjects and improvements made in ASR. A conical survey focusing on various feature extraction II. METHODOLOGY techniques in speech processing is provided in [20]. The work The research methodology is composed of three main steps: in [21] attempted to provide a comprehensive survey on noise the collection of papers, filtering to keep only relevant papers resistant features as well as similarity measurement, speech with significant findings, and analysis of the selected papers. enhancement, and speech model compensation in noisy con- The procedure used for the collection and filtering are de- text. Due to the increasing penetration rate of mobile devices, tailed below. the work in [22] investigated different approaches for pro- viding automatic speech recognition technology