Direct Generation of Speech from Electromyographic Signals: EMG-To

EMG-to-Speech: Direct Generation of Speech from Facial Electromyographic Signals Zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften der KIT-FakultätfürInformatik des Karlsruher Instituts fürTechnologie (KIT) genehmigte Dissertation von Matthias Janke aus Offenburg Tag der mündlichen Prüfung: 22. 04. 2016 Erste Gutachterin: Prof. Dr.-Ing. Tanja Schultz Zweiter Gutachter: Prof. Alan W Black, PhD Summary For several years, alternative speech communication techniques have been examined, which are based solely on the articulatory muscle signals instead of the acoustic speech signal. Since these approaches also work with completely silent articulated speech, several advantages arise: the signal is not corrupted by background noise, bystanders are not disturbed, as well as assistance to people who have lost their voice, e.g. due to accident or due to disease of the larynx. The general objective of this work is the design, implementation, improvement and evaluation of a system that uses surface electromyographic (EMG) signals and directly synthesizes an audible speech output: EMG-to-speech. The electrical potentials of the articulatory muscles are recorded by small electrodes on the surface of the face and neck. An analysis of these signals allows interpretations on the movements of the articulatory apparatus and in turn on the spoken speech itself. An approach for creating an acoustic signal from the EMG-signal is the usage of techniques from automatic speech recognition. Here, a textual output is produced, which in turn is further processed by a text-to-speech synthesis component. However, this approach is difficult resulting from challenges in the speech recognition part, such as the restriction to a given vocabulary or recognition errors of the system. This thesis investigates the possibility to convert the recorded EMG signal directly into a speech signal, without being bound to a limited vocabulary or other limitations from an speech recognition component. Different approaches for the conversion are being pursued, real-time capable systems are implemented, evaluated and compared. For training a statistical transformation model, the EMG signals and the acoustic speech are captured simultaneously and relevant characteristics in terms of features are extracted. The acoustic speech data is only required as a reference for the training, thus the actual application of the transformation can take place using solely the EMG data. A feature mapping is accomplished ii Summary by a model that estimates the relationship between muscle activity patterns and speech sound components. From the speech components the final audible voice signal is synthesized. This approach is based on a source-filter model of speech: The fundamental frequency is overlaid with the spectral information (Mel Cepstrum), which reflects the vocal tract, to generate the final speech signal. To ensure a natural voice output, the usage of the fundamental frequency for prosody generation is of great importance. To bridge the gap between normal speech (with fundamental frequency) and silent speech (no speech signal at all), whispered speech recordings are investigated as an intermediate step. In whispered speech no fundamental frequency exists and accordingly the generation of prosody is possible, but difficult. This thesis examines and evaluates the following three mapping methods for feature conversion: 1. Gaussian Mapping: A statistical method that trains a Gaussian mixture model based on the EMG and speech characteristics in order to estimate a mapping from EMG to speech. Originally, this technology stems from the Voice Conversion domain: The voice of a source speaker is transformed so that it sounds like the voice of a target speaker. 2. Unit Selection: The recorded EMG and speech data are segmented in paired units and stored in a database. For the conversion of the EMG signal, the database is searched for the best matching units and the resulting speech segments are concatenated to build the final acoustic output. 3. Deep Neural Networks: Similar to the Gaussian mapping, a statistical relation of the feature vectors is trained, using deep neural networks. These can also contain recurrences to reproduce additional temporal contextual sequences. This thesis examines the proposed approaches for a suitable direct EMG- to-speech transformation, followed by an implementation and optimization of each approach. To analyze the efficiency of these methods, a data cor- pus was recorded, containing normally spoken, as well as silently articulated speech and EMG recordings. While previous EMG and speech recordings were mostly limited to 50 utterances, the recordings performed in the scope of this thesis include sessions with up to 2 hours of parallel EMG and speech data. Based on this data, two of the approaches are applied for the first time to EMG-to-speech transformation. Comparing the obtained results to Summary iii related EMG-to-speech work, shows a relative improvement of 29 % with our best performing mapping approach. Zusammenfassung Seit einigen Jahren werden alternative Sprachkommunikationsformen untersucht, welche anstatt des akustischen Sprachsignals allein auf den, beim Spre- chen aufgezeichneten, Muskelsignalen beruhen. Da dieses Verfahren auch funktioniert, wenn komplett lautlos artikuliert wird, ergeben sich mehre- re Vorteile: Keine Nutzsignalbeeinträchtigung durch Hintergrundgeräusche, keine Lärmbelästigung Dritter, sowie die Unterstutzung¨ von Menschen, die beispielsweise durch Unfall oder Erkrankung ihre Stimme verloren haben. Ziel dieser Arbeit ist die Entwicklung, Evaluation und Verbesserung eines Systems, welches elektromyographische (EMG) Signale der artikulatorischen Muskelaktivität direkt in ein hörbares Sprachsignal synthestisiert: EMG-to- speech. Dazu werden elektrische Potentiale der Artikulationsmuskeln, durch am Ge- sicht angebrachte, Oberflächenelektroden aufgezeichnet. Eine Analyse dieser Signale ermöglicht Ruckschl¨ usse¨ auf die Bewegungen des Artikulationsappa- rates und damit wiederum auf die gesprochene Sprache selbst. Ein möglicher Ansatz zur Erzeugung der akustischen Sprache aus dem EMG- Signal ist die Verwendung von Techniken aus der automatischen Spracher- kennung. Dabei wird eine textuelle Ausgabe erzeugt, welche wiederum mit einer textbasierten Synthese verknupft¨ wird, um das Gesprochene hörbar zu machen. Dieses Verfahren ist jedoch verbunden mit Einschränkungen das Spracherkennungssystems, wie z.B. das limitierte Vokabular sowie Erken- nungsfehler des Systems. Diese Arbeit befasst sich mit der Möglichkeit das aufgezeichnete EMG-Signal direkt in ein Sprachsignal umzuwandeln, ohne dabei an ein beschränktes Vo- kabular oder weitere Vorgaben gebunden zu sein. Dazu werden unterschiedli- che Abbildungsansätze verfolgt, möglichst echtzeitfähige Systeme implementiert, sowie anschließend evaluiert und miteinander verglichen. Zum Training eines statistischen Transformationsmodells werden die EMG- Signale und die simultan erzeugte akustische Sprache zunächst erfasst und relevante Merkmale extrahiert. Die akustischen Sprachdaten werden dabei vi Zusammenfassung nur als Referenz fur¨ das Training benötigt, so dass die eigentliche Anwendung der Transformation ausschließlich auf den EMG-Daten stattfinden kann. Die Transformation der Merkmale wird erreicht, indem ein Modell erzeugt wird, das den Zusammenhang von Muskelaktivitätsmustern und Lautbe- standteilen der Sprache abbildet. Dazu werden drei alternative Abbildungs- verfahren untersucht. Aus den Sprachbestandteilen kann wiederum das hörbare Sprachsignal synthetisiert werden. Grundlage dafur¨ ist ein Quelle-Filter-Modell: Die Grundfrequenz wird mit den spektralen Informationen (Mel Cepstrum) uberlagert,¨ die den Vokaltrakt wiederspiegeln, um daraus das finale Sprach- signal zu generieren. Um eine naturliche¨ Sprachausgabe zu gewährleisten, ist die Verwendung der Grundfrequenz zur Generierung der Prosodie von großer Bedeutung. Da sich die sprachlichen Charakteristika von normaler Sprache (enthält Grundfre- quenz) und lautloser Sprache (kein Sprachsignal) deutlich unterscheiden, werden als Zwischenschritt geflusterte¨ Sprachaufnahmen betrachtet. Bei ge- flusterter¨ Sprache ist keine Grundfrequenz vorhanden und dementsprechend die Prosodiegenerierung notwendig, wobei im Gegensatz zur lautlosen Spra- che auf ein akustisches Sprachsignal zuruckgegriffen¨ werden kann. In dieser Dissertation wurden die folgenden drei Abbildungsverfahren zur Merkmalstransformation entwickelt, implementiert, im Detail untersucht und ausgewertet: 1. Gaussian Mapping: Ein statistisches Verfahren, das eine Gaußsche Mischverteilung auf Basis der EMG- und Sprach-Merkmale trainiert, um damit eine Abbildung von EMG auf Sprache zu schätzen. Ur- sprunglich¨ findet diese Technik im Bereich der Sprechertransformation Verwendung: Die Stimme eines Quellsprechers wird so transformiert, dass sie wie die Stimme eines Zielsprechers klingt. 2. Unit Selection: Die aufgenommenen EMG- und Sprach-Daten werden in paarweise Einheiten segmentiert und in einer Datenbank ab- gelegt. Fur¨ die Konvertierung des EMG-Signals wird die Datenbank nach möglichst passenden Segmenten durchsucht und die resultieren- den Sprachabschnitte konkateniert. 3. Deep Neural Networks: Ahnlich¨ dem Gaussian Mapping wird hier eine Abbildung der Merkmalsvektoren trainiert. Als Abbildungsverfah- ren kommt ein vielschichtiges neuronales Netz zum Einsatz, welches zusätzlich auch Rekurrenzen enthalten kann, um zusätzlich

Direct Generation of Speech from Electromyographic Signals: EMG-To

Talking Without Talking

Preliminary Test of a Real-Time, Interactive Silent Speech Interface Based on Electromagnetic Articulograph

Silent Speech Interfaces B

Silent Sound Technology: a Solution to Noisy Communication

Pedro Miguel André Coelho Generic Modalities for Silent Interaction

Articulatory-Based English Consonant Synthesis in 2-D Digital Waveguide Mesh

Text-To-Speech Synthesis As an Alternative Communication Means

The IAL News

Ultrasound-Based Silent Speech Interface Using Deep Learning

Across-Speaker Articulatory Normalization for Speaker-Independent Silent Speech Recognition Jun Wang University of Texas at Dallas, [email protected]

A History of Vocoder Research at Lincoln Laboratory

Sessions, Workshops, and Technical Initiatives