Biosignal Sensors and Deep Learning-Based Speech Recognition: a Review

sensors Review Biosignal Sensors and Deep Learning-Based Speech Recognition: A Review Wookey Lee 1 , Jessica Jiwon Seong 2, Busra Ozlu 3, Bong Sup Shim 3, Azizbek Marakhimov 4 and Suan Lee 5,* 1 Biomedical Science and Engineering & Dept. of Industrial Security Governance & IE, Inha University, 100 Inharo, Incheon 22212, Korea; [email protected] 2 Department of Industrial Security Governance, Inha University, 100 Inharo, Incheon 22212, Korea; [email protected] 3 Biomedical Science and Engineering & Department of Chemical Engineering, Inha University, 100 Inharo, Incheon 22212, Korea; [email protected] (B.O.); [email protected] (B.S.S.) 4 Frontier College, Inha University, 100 Inharo, Incheon 22212, Korea; [email protected] 5 School of Computer Science, Semyung University, Jecheon 27136, Korea * Correspondence: [email protected] Abstract: Voice is one of the essential mechanisms for communicating and expressing one’s intentions as a human being. There are several causes of voice inability, including disease, accident, vocal abuse, medical surgery, ageing, and environmental pollution, and the risk of voice loss continues to increase. Novel approaches should have been developed for speech recognition and production because that would seriously undermine the quality of life and sometimes leads to isolation from society. In this review, we survey mouth interface technologies which are mouth-mounted devices for speech recognition, production, and volitional control, and the corresponding research to de- velop artificial mouth technologies based on various sensors, including electromyography (EMG), electroencephalography (EEG), electropalatography (EPG), electromagnetic articulography (EMA), permanent magnet articulography (PMA), gyros, images and 3-axial magnetic sensors, especially Citation: Lee, W.; Seong, J.J.; Ozlu, B.; with deep learning techniques. We especially research various deep learning technologies related to Shim, B.S.; Marakhimov, A.; Lee, S. voice recognition, including visual speech recognition, silent speech interface, and analyze its flow, Biosignal Sensors and Deep and systematize them into a taxonomy. Finally, we discuss methods to solve the communication Learning-Based Speech Recognition: A Review. Sensors 2021, 21, 1399. problems of people with disabilities in speaking and future research with respect to deep learning https://doi.org/10.3390/s21041399 components. Academic Editors: Wai Lok Woo and Keywords: mouth interface; voice production; artificial larynx; EMG; biosignal; deep learning Biswanath Samanta Received: 28 December 2020 Accepted: 12 February 2021 Published: 17 February 2021 1. Introduction Voice is a basic means of communication and social interaction through spoken lan- Publisher’s Note: MDPI stays neutral guage. People with voice disorders face serious problems in their daily lives, which may with regard to jurisdictional claims in lead to emotional instability and isolation from society. A voice disorder implies that the published maps and institutional affil- pitch, intensity, or fluidity of one’s voice does not conform to his or her gender, age, body iations. composition, social environment, and geographic location. In other words, the term voice disorder refers to all abnormal conditions in which the expression of the voice is not in its normal range. In the human body, voice is produced by the vibration of the vocal cords as a result of the airflow supplied by the respiratory system, thus the normal voice Copyright: © 2021 by the authors. production depends on the coordination among airflow, laryngeal muscle strength and the Licensee MDPI, Basel, Switzerland. supraglottic resonator cavities such as pharyngeal, oral and nasal cavity [1]. The reasons This article is an open access article for the voice disorder can be categorized as organic, functional, and/or psychogenic causes. distributed under the terms and Organic causes may have structural (vocal fold abnormalities, inflammation or trauma conditions of the Creative Commons to the larynx) and neurologic (recurrent laryngeal nerve paralysis, adductor/abductor Attribution (CC BY) license (https:// spasmodic dysphonia, Parkinson’s disease, multiple sclerosis) origins, while functional creativecommons.org/licenses/by/ causes may arise from phonotrauma, muscle tension dysphonia, ventricular phonation, 4.0/). Sensors 2021, 21, 1399. https://doi.org/10.3390/s21041399 https://www.mdpi.com/journal/sensors Sensors 2021, 21, 1399 2 of 22 and vocal fatigue. Anxiety and depression are considered psychogenic causes of voice disorder [1]. Although there are several approaches for the treatment of voice disorder, such as PMA using articulatory data captured from the lips and tongue [2] or a contactless silent speech recognition system using an impulse radio ultra-wideband (IR-UWB) radar [3], the communication options are limited for the patients whose larynx (voice box) Sensors 2019, 19, x FOR PEER REVIEWis surgically removed following throat cancer or trauma [4]. To overcome this problem,2 of 20 the external device, called electrolarynx, has been developed as a form of artificial larynx [2]communication or a contactless where silent speech an electronic recognition vibration system is using produced an impulse and resultedradio ultra in-wideband a monotonic (IR- UWB)sound radar that [3], will the be communication formed to speech options [ 4are]. However,limited for the the patients unpractical whose uselarynx and (voice unnatural box) is surgicallyvoice production removed makefollowing the devicethroat cancer challenging or trauma for the[4]. To patients. overcome Thus, this a problem, novel concept the external must device,be developed called electrolarynx, for voice recognition has been developed and production as a form of technologies artificial larynx which communication also can include where an electronic vibration is produced and resulted in a monotonic sound that will be formed to speech brain-computer interfaces (BCIs) and silent-speech interfaces (SSIs). SSI is considered as a [4]. However, the unpractical use and unnatural voice production make the device challenging for theplausible patients. approach Thus, a novel to producing concept must natural-sounding be developed for speech voice recognition by capturing and biosignals production from technol- the ogiesarticulators, which also neural can include pathways, brain- orcomputer the brain interfaces itself in (BCI brain-computers) and silent-speech interfaces interfaces (BCIs) (SSIs [).5 –SSI8]. isRecently, considered various as a plausible biosignals approach captured to producing by techniques natural-sounding such as ultrasound,speech by capturing optical biosignals imagery, fromEPG, the EEG, articulators, and surface neural electromyography pathways, or the brain (sEMG) itself in have brain- beencomputer investigated interfaces in(BCI termss) [5–8]. of Recently,developing various silent biosignals speech captured communication by techniques systems such [ 8as– 10ultrasound,]. optical imagery, EPG, EEG, and surfaceIn this electromyography review, as summarized (sEMG) in have Figure been1 ,investigated we report on in theterms recent of developing advances silent in the speech field communicationof biosignal-based systems voice [8– recognition10]. and production with deep learning-based technologies. We firstIn th introduceis review, sensoras summarized technologies in Figure that 1, can we acquire report on voice-related the recent advances biosignal in data the asfield well of biosignalas their- applicationsbased voice recognition in voice recognitionand production and with production. deep learning Regarding-based technologies. deep learning, We first we introducealso present sensor not technologies only voice recognitionthat can acquire technology voice-related through biosignal sound data or as image well as but their also applica- silent tions in voice recognition and production. Regarding deep learning, we also present not only voice speech representation interface. We concluded with a summary and perspectives on the recognition technology through sound or image but also silent speech representation interface. We current and future applications. concluded with a summary and perspectives on the current and future applications. Figure 1. How to help people with voice disorders communicate with others: to recognize voices, which indicates that biosignals are processed via various methods and and transmitted transmitted to to devices devices suitable suitable for for the the user, user, and can be used in several fields, such as robotics, medical engineering, and image processing. fields, such as robotics, medical engineering, and image processing. 2.2. Biosignal Biosignal-Based-Based Speech Speech Recognition Recognition BiosignalsBiosignals provide provide the theinformation information belong belong to the electrical, to the electrical, chemical, mechanical, chemical,mechanical, and biologi- caland processes biological of the processes living organism of the living[11,12]. organismIn the field [of11 ,biosignal12]. In- thebased field speech of biosignal-based recognition, sig- nalsspeech from recognition, brain, muscle signals and movement from brain, are muscle considered and movementas potentially are useful considered biosignals as potentiallythat can be measured by different techniques. Traditional acoustic sensors capture the sound pressure waves useful biosignals that can be measured by different techniques. Traditional acoustic sensors resulting in acoustic biosignals.

Biosignal Sensors and Deep Learning-Based Speech Recognition: a Review

Arxiv:1807.06441V1 [Eess.AS] 12 Jul 2018 Etr Saln Hr-Emmmr LT)Ta a Endesigne Been Has That (LSTM) Memory Popu Short-Term Most Long the a (Rnns)

Preliminary Test of a Real-Time, Interactive Silent Speech Interface Based on Electromagnetic Articulograph

Silent Speech Interfaces B

Articulatory Information and Multiview Features for Large Vocabulary Continuous Speech Recognition

Text-To-Speech Synthesis As an Alternative Communication Means

Towards Robust Combined Deep Architecture for Speech Recognition : Experiments on TIMIT

Arxiv:1805.04699V4 [Cs.CL] 13 Jun 2019

A 8.93 TOPS/W LSTM Recurrent Neural Network Accelerator Featuring Hierarchical Coarse-Grain Sparsity with All Parameters Stored On-Chip

Automatic Speech Recognition with Deep Neural Networks for Impaired Speech

Improving Speech Recognition by Revising Gated Recurrent Units

Speaker-Aware Training of Lstm-Rnns for Acoustic Modelling

UNIVERSITY of CALIFORNIA Los Angeles Neural Network Based