École Des Hautes Études En Sciences Sociales Cod a G E E F F Ica Ce D E

Total Page:16

File Type:pdf, Size:1020Kb

École Des Hautes Études En Sciences Sociales Cod a G E E F F Ica Ce D E École des Hautes Études en Sciences Sociales École doctorale de l’EHESS Centre d’Analyse et de Mathématique sociales Thèse de doctorat Discipline : Sciences cognitives (option Neurosciences computationnelles) FRANÇOIS DELOCHE Codage efficace de la parole à court t e r m e S h o r t t i m e - scale efficient coding of speech Thèse dirigée par: Jean-Pierre Nadal Date de soutenance : 22 octobre 2019 Rapporteurs 1 Frédéric BIMBOT, IRISA Rennes 2 Bruno TORRÉSANI, Université Aix-Marseille Examinateurs 3 Gilles CHARDON, CentraleSupélec 4 Judit GERVAIN, Université Paris-Descartes 5 Shihab SHAMMA, ENS Paris Remerciements J’adresse en premier lieu mes plus sincères remerciements à mon directeur de thèse, Jean-Pierre Nadal. Il m’a orienté, dès le début de mon stage de master M2 (Mathématiques, Vision, Apprentissage), sur ce sujet qui m’a passionné pendant ces 3 ans ½– et continue de me passionner ! – je lui en suis grandement reconnaissant. Je le remercie pour s’être toujours rendu disponible afin de répondre à mes interrogations et pour toute l’aide qu’il m’a apporté durant ces années. J’ai beaucoup apprécié la liberté qu’il m’a accordée dans le choix des directions de recherche, sans laquelle la thèse aurait pu prendre un tout autre chemin. Il a toujours accueilli mes diverses idées et propositions avec intérêt et bienveillance, tout en s’assurant que je sois bien encadré. Je veux associer à ces remerciements Laurent Bonnasse-Gahot, qui, bien que son nom ne figure pas sur la couverture de cette thèse, a été pour moi comme un second encadrant. Je le remercie chaleureusement pour ses précieux conseils. Les nombreuses discussions, que j’ai eu le plaisir d’avoir avec lui, ont fortement contribué au bon déroulement de la thèse et m’ont personnellement beaucoup apporté. Je tiens à remercier également les membres du projet ANR SpeechCode : merci à Judit Gervain, pour m’avoir indirectement amené à travailler sur ce sujet captivant, pour avoir accepté de faire partie du jury de thèse et pour l’aide qu’elle m’a apportée à plusieurs reprises, avec Ramon Guevarra Erra. Merci à Christian Lorenzi pour son soutien et ses conseils d’une grande valeur ; j’ai été enthousiasmé par les discussions passionnantes autour de l’audition que j’ai eu avec lui. Je remercie vivement Bruno Torrésani et Frédéric Bimbot d’avoir accepté d’être rappor- teurs pour cette thèse, ainsi que Shihab Shamma et Gilles Chardon de me faire l’honneur de faire partie du jury. J’adresse des remerciements plus larges à tous les chercheurs avec qui j’ai été amené à interagir ces trois années. Je souhaite citer ici Monika Dörfler, que j’ai rencontré à l’école d’été de Peyresq (2018), qui a fait le lien entre mon travail et les ondelettes de Gabor flexibles, et qui m’a ainsi initié à toute une bibliographie dont je n’avais pas beaucoup conscience. Merci enfin à Gabriel Peyré et Emmanuel Dupoux pour avoir fait partie du comité de suivi de thèse. J’ai eu la chance d’être accueilli au Centre d’Analyse et de Mathématique sociales (CAMS), à l’EHESS, qui a été pour moi un lieu idéal afin de mener à terme cette thèse. Ces très bonnes conditions ont été rendues possibles par la présence et le travail des membres du CAMS, en particulier Sandrine Nadal, Nathalie Brusseaux, Thomas Tailpied et Francesca Aceto. Je salue amicalement tous les étudiants et post-docs que j’ai côtoyé au CAMS (ou à l’ENS), et qui ont considérablement amoindri l’âpreté de la thèse par les moments passés avec eux (par ordre approximatif d’apparition) : Kévin B., Quentin F., Romain D., Samuel N., Alessandro Z., Andrea T., Elisa A., Benedetta F., Antoine P., Noemi M., José M., Charles L., Beniada S., Gabrielle N., Federico B., Julien B., Imke M., Nicolas (Songshen) Y. A titre plus personnel, je remercie famille & amis de longue date, à commencer par mes parents, pour leur soutien depuis nombreuses années. C’est en pensant à ce long-time scale que je souhaite rendre hommage aux professeurs qui m’ont fait découvrir et apprendre tant de choses durant mes études, et inspiré cet intérêt pour les sciences, la thèse marquant la fin d’un long processus de formation qui ne fait pas en vérité que trois ans ! Merci à mes colocs (Lawrence & Dhafer) qui m’ont supporté quotidiennement. Enfin, je ne pourrais pas terminer cette section de remerciements sans faire mention des soutiens IVL (Loup, Constance & tous les autres :pandayay: :love:). – ii – Résumé Résumé Des analyses de données ont montré que la sélectivité fréquentielle de la cochlée est adaptée à la structure statistique de la parole. Ce résultat est conforme à l’hypothèse du codage efficace selon laquelle le traitement sensoriel code de manière optimale pour les stimuli naturels. Cependant, le signal de la parole possède une structure riche, même sur des petites échelles de temps, du fait de la diversité des facteurs acoustiques à l’origine de la génération de la parole. Cette complexité de structure motive l’idée qu’une représentation non linéaire de la parole pourrait aboutir à un schéma de codage plus efficace que le modèle linéaire. La première étape dans la recherche de stratégies efficaces est de décrire la structure statistique de la parole à un niveau fin. Dans cette thèse, j’explore la structure statistique au niveau phonétique à l’aide d’une méthode paramétrique. On cherche la décomposition du signal la plus parcimonieuse parmi une famille de dictionnaires de filtres de Gabor dont la sélectivité fréquentielle suit différentes lois de puissance dans la gamme des hautes fréquences 1-8kHz. L’utilisation de ces dictionnaires comme représentations temps-fréquence parcimonieuses est justifiée mathématiquement et empiriquement. Un lien formel avec les travaux précédents, fondés sur l’Analyse en Composantes indépendantes (ACI), est présenté. Les lois de puissance des représentations parcimonieuses offrent une interprétation riche de la structure statistique de la parole, et peut être relié à des facteurs acoustiques clés déduits de l’analyse de données réelles et synthétiques. Les résultats montrent en outre qu’une stratégie de codage efficace, reflétant le comportement non linéaire de la cochlée, consiste à réduire la sélectivité fréquentielle avec le niveau d’intensité sonore. Mots clé : Hypothèse du codage efficace, Codage parcimonieux, dictionnaires de Gabor, phonétique acoustique, codage auditif, statistiques de la parole, analyse temps-fréquence, Analyse en Composantes Indépendantes. – iii – Abstract Cochlear frequency selectivity is known to reflect the overall statistical structure of speech, in line with the hypothesis that low-level sensory processing provides efficient codes for information contained in natural stimuli. Speech signals, however, possess a complex structure, even on short-time scales, as a result of the diversity of acoustic factors involved in the generation of speech. This rich structure means that advanced coding schemes based on a nonlinear representation of speech sounds could provide more efficient codes. The first step in finding efficient strategies is to describe the statistical structure of speech at a fine level — at the level of phonemes or even finer at the level of acoustic events. In this thesis, I use a parametric approach to explore the fine-grained statistical structure of speech. The goal of this method is to find the sparsest representation of speech sounds among a family of dictionaries of Gabor filters whose frequency selectivity follows different power laws in the high frequency range 1-8kHz. I motivate the use of Gabor filters for the search of sparse time-frequency representations of speech signals, and I show that the dictionary method has a formal link with previous work based on Independent Component Analysis (ICA). The acoustic factors that affect the power law associated with the sparsest decomposition can be inferred from the analyses of synthetic and real data. The results suggest that an efficient speech coding strategy is to reduce frequency selectivity with sound intensity level, reflecting the nonlinear behavior of the cochlea. Keywords : Independent Component Analysis, efficient coding hypothesis, sparse coding, Gabor dictionaries, acoustic phonetics, auditory coding, speech statistics, time-frequency analysis. – iv – Table des matières Table of contents Codage efficace de la parole à court terme 2 Introduction 2 Contexte : le projet SpeechCode ............................ 2 Approche . 4 Neurosciences computationnelles . 4 Interdisciplinarité . 7 Bases théoriques . 10 Hypothèse du codage efficace . 10 La notion de structure statistique . 12 Analyse temps-fréquence . 15 Travaux antérieurs . 18 Objectifs et structure de la thèse . 21 1 Hypothèse du codage efficace 24 1.1 Critères d’efficacité . 24 1.2 Algorithmes et méthodes associées . 30 1.3 Analyse temps-fréquence et décompositions parcimonieuses . 33 1.4 Limites . 33 2 Structure statistique de la parole 36 2.1 Méthodes . 36 2.2 Résultats . 39 2.2.1 Données synthétiques . 39 2.2.2 Données réelles . 42 3 Représentations non linéaires et parcimonieuses de la parole 45 3.1 Filtres dépendant du niveau d’intensité . 45 3.2 Limites du modèle et recherches futures . 48 Short-time scale efficient coding of speech 51 Introduction 51 Context : the SpeechCode project . 51 Approach of this work . 53 Computational neuroscience . 53 Interdisciplinarity . 56 Theoretical background . 59 – v – The efficient coding hypothesis . 59 The notion of statistical structure . 60 Time-frequency analysis . 64 Previous work . 67 Objectives and structure of the thesis . 69 1 The efficient coding hypothesis 72 1.1 Coding efficiency . 72 1.1.1 Redundancy reduction . 74 1.1.2 Information maximization . 75 1.1.3 Minimum entropy codes . 76 1.2 Algorithms and methods related . 81 1.2.1 Independent Component Analysis . 81 1.2.2 Sparse coding methods . 83 1.2.3 Dealing with overcompleteness . 85 1.3 Evidence and limits .
Recommended publications
  • Talk Bank: a Multimodal Database of Communicative Interaction
    Talk Bank: A Multimodal Database of Communicative Interaction 1. Overview The ongoing growth in computer power and connectivity has led to dramatic changes in the methodology of science and engineering. By stimulating fundamental theoretical discoveries in the analysis of semistructured data, we can to extend these methodological advances to the social and behavioral sciences. Specifically, we propose the construction of a major new tool for the social sciences, called TalkBank. The goal of TalkBank is the creation of a distributed, web- based data archiving system for transcribed video and audio data on communicative interactions. We will develop an XML-based annotation framework called Codon to serve as the formal specification for data in TalkBank. Tools will be created for the entry of new and existing data into the Codon format; transcriptions will be linked to speech and video; and there will be extensive support for collaborative commentary from competing perspectives. The TalkBank project will establish a framework that will facilitate the development of a distributed system of allied databases based on a common set of computational tools. Instead of attempting to impose a single uniform standard for coding and annotation, we will promote annotational pluralism within the framework of the abstraction layer provided by Codon. This representation will use labeled acyclic digraphs to support translation between the various annotation systems required for specific sub-disciplines. There will be no attempt to promote any single annotation scheme over others. Instead, by promoting comparison and translation between schemes, we will allow individual users to select the custom annotation scheme most appropriate for their purposes.
    [Show full text]
  • Multilingvální Systémy Rozpoznávání Řeči a Jejich Efektivní Učení
    Multilingvální systémy rozpoznávání řeči a jejich efektivní učení Disertační práce Studijní program: P2612 – Elektrotechnika a informatika Studijní obor: 2612V045 – Technická kybernetika Autor práce: Ing. Radek Šafařík Vedoucí práce: prof. Ing. Jan Nouza, CSc. Liberec 2020 Multilingual speech recognition systems and their effective learning Dissertation Study programme: P2612 – Electrotechnics and informatics Study branch: 2612V045 – Technical cybernetics Author: Ing. Radek Šafařík Supervisor: prof. Ing. Jan Nouza, CSc. Liberec 2020 Prohlášení Byl jsem seznámen s tím, že na mou disertační práci se plně vztahu- je zákon č. 121/2000 Sb., o právu autorském, zejména § 60 – školní dílo. Beru na vědomí, že Technická univerzita v Liberci (TUL) nezasahu- je do mých autorských práv užitím mé disertační práce pro vnitřní potřebu TUL. Užiji-li disertační práci nebo poskytnu-li licenci k jejímu využi- tí, jsem si vědom povinnosti informovat o této skutečnosti TUL; v tomto případě má TUL právo ode mne požadovat úhradu nákla- dů, které vynaložila na vytvoření díla, až do jejich skutečné výše. Disertační práci jsem vypracoval samostatně s použitím uvedené literatury a na základě konzultací s vedoucím mé disertační práce a konzultantem. Současně čestně prohlašuji, že texty tištěné verze práce a elektro- nické verze práce vložené do IS STAG se shodují. 1. 12. 2020 Ing. Radek Šafařík Abstract The diseratation thesis deals with creation of automatic speech recognition systems (ASR) and with effective adaptation of already existing system to a new language. Today’s ASR systems have modular stucture where individual moduls can be consi- dered as language dependent or independent. The main goal of this thesis is research and development of methods that automate and make the development of language dependent modules effective as much as possible using free available data fromthe internet, machine learning methods and similarities between langages.
    [Show full text]
  • Here ACL Makes a Return to Asia!
    Handbook Production: Jung-jae Kim, Nanyang Technological University, Singapore Message from the General Chair Welcome to Jeju Island — where ACL makes a return to Asia! As General Chair, I am indeed honored to pen the first words of ACL 2012 proceedings. In the past year, research in computational linguistics has con- tinued to thrive across Asia and all over the world. On this occasion, I share with you the excitement of our community as we gather again at our annual meeting. On behalf of the organizingteam, it is my great pleasure to welcome you to Jeju Island and ACL 2012. In 2012, ACL turns 50. I feel privileged to chair the conference that marks such an important milestone for our community. We have prepared special programs to commemorate the 50th anniversary, including ‘Rediscovering 50 Years of Discovery’, a main conference workshop chaired by Rafael Banchs with a program on ‘the People, the Contents, and the Anthology’, which rec- ollects some of the great moments in ACL history, and ‘ACL 50th Anniver- sary Lectures’ by Mark Johnson, Aravind K. Joshi and a Lifetime Achieve- ment Award Recipient. A large number of people have worked hard to bring this annual meeting to fruition. It has been an unforgettable experience for everyone involved. My deepest thanks go to the authors, reviewers, volunteers, participants, and all members and chairs of the organizing committees. It is your participation that makes a difference. Program Chairs, Chin-Yew Lin and Miles Osborne, deserve our gratitude for putting an immense amount of work to ensure that each of the 940 sub- missions was taken care of.
    [Show full text]
  • Multimedia Corpora (Media Encoding and Annotation) (Thomas Schmidt, Kjell Elenius, Paul Trilsbeek)
    Multimedia Corpora (Media encoding and annotation) (Thomas Schmidt, Kjell Elenius, Paul Trilsbeek) Draft submitted to CLARIN WG 5.7. as input to CLARIN deliverable D5.C­3 “Interoperability and Standards” [http://www.clarin.eu/system/files/clarin­deliverable­D5C3_v1_5­finaldraft.pdf] Table of Contents 1 General distinctions / terminology................................................................................................................................... 1 1.1 Different types of multimedia corpora: spoken language vs. speech vs. phonetic vs. multimodal corpora vs. sign language corpora......................................................................................................................................................... 1 1.2 Media encoding vs. Media annotation................................................................................................................... 3 1.3 Data models/file formats vs. Transcription systems/conventions.......................................................................... 3 1.4 Transcription vs. Annotation / Coding vs. Metadata ............................................................................................. 3 2 Media encoding ............................................................................................................................................................... 5 2.1 Audio encoding ..................................................................................................................................................... 5 2.2
    [Show full text]
  • The Workshop Programme
    The Workshop Programme Monday, May 26 14:30 -15:00 Opening Nancy Ide and Adam Meyers 15:00 -15:30 SIGANN Shared Corpus Working Group Report Adam Meyers 15:30 -16:00 Discussion: SIGANN Shared Corpus Task 16:00 -16:30 Coffee break 16:30 -17:00 Towards Best Practices for Linguistic Annotation Nancy Ide, Sameer Pradhan, Keith Suderman 17:00 -18:00 Discussion: Annotation Best Practices 18:00 -19:00 Open Discussion and SIGANN Planning Tuesday, May 27 09:00 -09:10 Opening Nancy Ide and Adam Meyers 09:10 -09:50 From structure to interpretation: A double-layered annotation for event factuality Roser Saurí and James Pustejovsky 09:50 -10:30 An Extensible Compositional Semantics for Temporal Annotation Harry Bunt, Chwhynny Overbeeke 10:30 -11:00 Coffee break 11:00 -11:40 Using Treebank, Dictionaries and GLARF to Improve NomBank Annotation Adam Meyers 11:40 -12:20 A Dictionary-based Model for Morpho-Syntactic Annotation Cvetana Krstev, Svetla Koeva, Du!ko Vitas 12:20 -12:40 Multiple Purpose Annotation using SLAT - Segment and Link-based Annotation Tool (DEMO) Masaki Noguchi, Kenta Miyoshi, Takenobu Tokunaga, Ryu Iida, Mamoru Komachi, Kentaro Inui 12:40 -14:30 Lunch 14:30 -15:10 Using inheritance and coreness sets to improve a verb lexicon harvested from FrameNet Mark McConville and Myroslava O. Dzikovska 15:10 -15:50 An Entailment-based Approach to Semantic Role Annotation Voula Gotsoulia 16:00 -16:30 Coffee break 16:30 -16:50 A French Corpus Annotated for Multiword Expressions with Adverbial Function Eric Laporte, Takuya Nakamura, Stavroula Voyatzi
    [Show full text]
  • The 2Nd Workshop on Arabic Corpora and Processing Tools 2016 Theme: Social Media Workshop Programme
    The 2nd Workshop on Arabic Corpora and Processing Tools 2016 Theme: Social Media Workshop Programme Date 24 May 2016 09:00 – 09:20 – Welcome and Introduction by Workshop Chairs 09:20 – 10:30 – Session 1 (Keynote speech) Nizar Habash, Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions 10:30 – 11:00 Coffee break 10:30 – 13:00 – Session 2 Soumia Bougrine, Hadda Cherroun, Djelloul Ziadi, Abdallah Lakhdari and Aicha Chorana, Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects Maha Alamri and William John Teahan, Towards a New Arabic Corpus of Dyslexic Texts Ossama Obeid, Houda Bouamor, Wajdi Zaghouani, Mahmoud Ghoneim, Abdelati Hawwari, Mona Diab and Kemal Oflazer, MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization Wajdi Zaghouani and Dana Awad, Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation Muhammad Abdul-Mageed, Hassan Alhuzali, Dua'a Abu-Elhij'a and Mona Diab, DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis Nora Al-Twairesh, Mawaheb Al-Tuwaijri, Afnan Al-Moammar and Sarah Al-Humoud, Arabic Spam Detection in Twitter Editors Hend Al-Khalifa King Saud University, KSA King Abdul Aziz City for Science and Abdulmohsen Al-Thubaity Technology, KSA Walid Magdy Qatar Computing Research Institute, Qatar Kareem Darwish Qatar Computing Research Institute, Qatar Organizing Committee Hend Al-Khalifa King Saud University, KSA King Abdul Aziz City for Science and Abdulmohsen Al-Thubaity Technology, KSA Walid Magdy Qatar Computing Research Institute,
    [Show full text]
  • Exploring Phone Recognition in Pre-Verbal and Dysarthric Speech
    Exploring Phone Recognition in Pre-verbal and Dysarthric Speech Syed Sameer Arshad A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science University of Washington 2019 Committee: Gina-Anne Levow Gašper Beguš Program Authorized to Offer Degree: Department of Linguistics 1 ©Copyright 2019 Syed Sameer Arshad 2 University of Washington Abstract Exploring Phone Recognition in Pre-verbal and Dysarthric Speech Chair of the Supervisory Committee: Dr. Gina-Anne Levow Department of Linguistics In this study, we perform phone recognition on speech utterances made by two groups of people: adults who have speech articulation disorders and young children learning to speak language. We explore how these utterances compare against those of adult English-speakers who don’t have speech disorders, training and testing several HMM-based phone-recognizers across various datasets. Experiments were carried out via the HTK Toolkit with the use of data from three publicly available datasets: the TIMIT corpus, the TalkBank CHILDES database and the Torgo corpus. Several discoveries were made towards identifying best-practices for phone recognition on the two subject groups, involving the use of optimized Vocal Tract Length Normalization (VTLN) configurations, phone-set reconfiguration criteria, specific configurations of extracted MFCC speech data and specific arrangements of HMM states and Gaussian mixture models. 3 Preface The work in this thesis is inspired by my life experiences in raising my nephew, Syed Taabish Ahmad. He was born in May 2000 and was diagnosed with non-verbal autism as well as apraxia-of-speech. His speech articulation has been severely impacted as a result, leading to his speech production to be sequences of babbles.
    [Show full text]
  • Jira: a Kurdish Speech Recognition System Designing and Building Speech Corpus and Pronunciation Lexicon
    Jira: a Kurdish Speech Recognition System Designing and Building Speech Corpus and Pronunciation Lexicon Hadi Veisi* Hawre Hosseini Mohammad MohammadAmini University of Tehran, Ryerson University, Avignon University, Faculty of New Sciences and Electrical and Computer Engineering Laboratoire Informatique d’Avignon (LIA) Technologies [email protected] mohammad.mohammadamini@univ- [email protected] avignon.fr Wirya Fathy Aso Mahmudi University of Tehran, University of Tehran, Faculty of New Sciences and Faculty of New Sciences and Technologies Technologies [email protected] [email protected] Abstract: In this paper, we introduce the first large vocabulary speech recognition system (LVSR) for the Central Kurdish language, named Jira. The Kurdish language is an Indo-European language spoken by more than 30 million people in several countries, but due to the lack of speech and text resources, there is no speech recognition system for this language. To fill this gap, we introduce the first speech corpus and pronunciation lexicon for the Kurdish language. Regarding speech corpus, we designed a sentence collection in which the ratio of di-phones in the collection resembles the real data of the Central Kurdish language. The designed sentences are uttered by 576 speakers in a controlled environment with noise-free microphones (called AsoSoft Speech-Office) and in Telegram social network environment using mobile phones (denoted as AsoSoft Speech-Crowdsourcing), resulted in 43.68 hours of speech. Besides, a test set including 11 different document topics is designed and recorded in two corresponding speech conditions (i.e., Office and Crowdsourcing). Furthermore, a 60K pronunciation lexicon is prepared in this research in which we faced several challenges and proposed solutions for them.
    [Show full text]
  • Minds.Data.Res-REV.Ps
    MINDS 2006-2007 DRAFT Report of the Data Resources Group Historical Development and Future Directions in Data Resource Development Workshop Attendees: Martha Palmer, Stephanie Strassel, Randee Tangi Report Contributors: Martha Palmer, Randee Tangi, Stephanie Strassel, Christiane Fellbaum, Eduard Hovy Introduction On February 25,26 2007, the second of two workshops entitled “Meeting of the MINDS: Future Directions for Human Language Technology,” sponsored by the U.S. Government's Disruptive Technology Office (DTO), was held in Marina del Rey, California. “MINDS” is an acronym for Machine Translation (MT), Information Retrieval (IR), Natural Language Processing (NLP), Data Resources (Data), and Speech Understanding (ASR) which were the 5 areas, each addressed by a number of experienced researchers, at this workshop. The goal of these working groups was to identify and discuss especially promising future research directions that could lead to major paradigm shifts and which are yet to be funded. As a continuation of a prior workshop where each group (except Data Resources) had first reviewed major past developments in their respective fields and the circumstances that led to their success, this workshop began by reviewing and refining the suggestions for future research emanating from the first workshop. The different areas then met in pairs to discuss possible cross-fertilizations and collaborations. Each area is responsible for producing a report proposing 5 to 6 ĀGrand Challenges.ā 1. Historically Significant Developments in Data Resource Development The widespread availability of electronic text and the more recent advent of linguistic annotation have revolutionized the fields of ASR, MT, IR and NLP. Transcribed Speech Beginning with TI461 Āthe availability of common speech corpora for speech training, development, and evaluation, has been critical in creating systems of increasing capabilities.
    [Show full text]
  • End-To-End Speech Recognition Using Connectionist Temporal Classification
    THESIS END-TO-END SPEECH RECOGNITION USING CONNECTIONIST TEMPORAL CLASSIFICATION Master Thesis by Marc Dangschat Thesis submitted to the Department of Electrical Engineering and Computer Science, University of Applied Sciences Münster, in partial fulfillment of the requirements for the degree Master of Science (M.Sc.) Advisor: Prof. Dr.-Ing. Jürgen te Vrugt Co-Advisor: Prof. Dr. rer. nat. Nikolaus Wulff Münster, October 11, 2018 Abstract Speech recognition on large vocabulary and noisy corpora is challenging for computers. Re- cent advances have enabled speech recognition systems to be trained end-to-end, instead of relying on complex recognition pipelines. A powerful way to train neural networks that re- duces the complexity of the overall system. This thesis describes the development of such an end-to-end trained speech recognition system. It utilizes the connectionist temporal classifi- cation (CTC) cost function to evaluate the alignment between an audio signal and a predicted transcription. Multiple variants of this system with different feature representations, increas- ing network depth and changing recurrent neural network (RNN) cell types are evaluated. Results show that the use of convolutional input layers is advantages, when compared to dense ones. They further suggest that the number of recurrent layers has a significant im- pact on the results. With the more complex long short-term memory (LSTM) outperforming the faster basic RNN cells, when it comes to learning from long input-sequences. The ini- tial version of the speech recognition system achieves a word error rate (WER) of 27:3%, while the various upgrades decrease it down to 12:6%. All systems are trained end-to-end, therefore, no external language model is used.
    [Show full text]
  • Pdf Field Stands on a Separate Tier
    ComputEL-2 Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages March 6–7, 2017 Honolulu, Hawai‘i, USA Graphics Standards Support: THE SIGNATURE What is the signature? The signature is a graphic element comprised of two parts—a name- plate (typographic rendition) of the university/campus name and an underscore, accompanied by the UH seal. The UH M ¯anoa signature is shown here as an example. Sys- tem and campus signatures follow. Both vertical and horizontal formats of the signature are provided. The signature may also be used without the seal on communications where the seal cannot be clearly repro- duced, space is limited or there is another compelling reason to omit the seal. How is color incorporated? When using the University of Hawai‘i signature system, only the underscore and seal in its entirety should be used in the specifed two-color scheme. When using the signature alone, only the under- score should appear in its specifed color. Are there restrictions in using the signature? Yes. Do not • alter the signature artwork, colors or font. c 2017 The Association for Computational Linguistics • alter the placement or propor- tion of the seal respective to the nameplate. • stretch, distort or rotate the signature. • box or frame the signature or use over a complexOrder background. copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 5 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ii Preface These proceedings contain the papers presented at the 2nd Workshop on the Use of Computational Methods in the Study of Endangered languages held in Honolulu, March 6–7, 2017.
    [Show full text]
  • Development of a Large Spontaneous Speech Database of Agglutinative Hungarian Language
    Development of a Large Spontaneous Speech Database of Agglutinative Hungarian Language Tilda Neuberger, Dorottya Gyarmathy, Tekla Etelka Gráczi, Viktória Horváth, Mária Gósy, and András Beke Research Institute for Linguistics of the Hungarian Academy of Sciences Departement of Phonetics, Benczúr 33, 1068 Budapest, Hungary {neuberger.tilda,gyarmathy.dorottya,graczi.tekla,horvath.viktoria, gosy.maria,beke.andras}@nytud.mta.hu Abstract. In this paper, a large Hungarian spoken language database is intro- duced. This phonetically-based multi-purpose database contains various types of spontaneous and read speech from 333 monolingual speakers (about 50 minutes of speech sample per speaker). This study presents the background and motiva- tion of the development of the BEA Hungarian database, describes its protocol and the transcription procedure, and also presents existing and proposed research using this database. Due to its recording protocol and the transcription it provides a challenging material for various comparisons of segmental structures of speech also across languages. Keywords: database, spontaneous speech, multi-level annotation 1 Introduction Nowadays the application of corpus-based and statistical approaches in various fields of speech research is a challenging task. Linguistic analyses have become increasingly data-driven, creating a need for reliable and large spoken language databases. In our study, we aim to introduce the Hungarian database named BEA that provides a useful material for various segmental-level comparisons of speech also across languages. Hun- garian, unlike English and other Germanic languages, is an agglutinating language with diverse inflectional characteristics and a very rich morphology. This language is char- acterized by a relatively free word order. There are a few spoken language databases for highly agglutinating languages, for example Turkish [1], Finnish [2].
    [Show full text]