“Nettalk En Español”

Total Page:16

File Type:pdf, Size:1020Kb

“Nettalk En Español” 1 Universidad Autónoma Metropolitana I ztapalapa Ciencias Básicas e Ingeniería Licenciatura en Computación 99321701. Molina Villegas Alejandro 201320439. García Arias Néstor Hugo 202212079. Nuñez Reyna José Ismael “NETtalk en español” Marzo 2006 2 Índice INTRODUCCIÓN. .................................................................................................................................. 4 CAPITULO I. Redes Neuronales aplicadas a la clasificación de fonemas ................................... 5 1. Redes Neuronales Artificiales ................................................................................................................. 5 1.1 La máquina de Turing y la fisiología de lo computable. .......................................................... 5 1.2 Elementos de una Red Neuronal Artificial. ............................................................................. 6 1.3 Clasificación de las RNA. ........................................................................................................ 7 1.4 Las redes como reconocedores de patrones. .......................................................................... 10 2. Conexionismo y procesamiento del lenguaje natural ............................................................................ 11 2.1 Expresiones fonéticas y Caracterización articulatoria. ........................................................... 12 2.2 Unidades de representación. ................................................................................................... 13 2.3 El Alfabeto Fonético Internacional (AFI) y el SAMPA. ........................................................ 15 2.4 Sistemas TTS. ......................................................................................................................... 17 3. NETtalk .................................................................................................................................................. 24 3.1 Representación y Estructura de la red NETtalk. ..................................................................... 25 3.2 Metodología. ........................................................................................................................... 27 3.3 Implementación de la red neuronal de NETTalk en español. ................................................. 29 3.4 Descripción de la codificación del programa. ......................................................................... 31 CAPITULO II. Árboles de Decisión aplicados a la clasificación de fonemas .................................... 48 1. ¿Qué son los árboles de decisión? ........................................................................................................ 48 2. Algoritmo básico de aprendizaje de los árboles de decisión. ............................................................... 49 2.1 Entropía. .................................................................................................................................. 49 3. Reglas de poda. ..................................................................................................................................... 50 3.1 Poda por estimación del error. ............................................................................................... 51 3.2 Poda por coste-complejidad. .................................................................................................. 51 3.3 Poda pesimista. ....................................................................................................................... 52 4. Sobreajuste con los datos de entrenamiento. ....................................................................................... 52 5. Implementación de un árbol de decisión para Nettalk. ......................................................................... 53 5.1 Descripción general del problema. ........................................................................................ 53 5.2 Descripción del programa. ..................................................................................................... 56 CAPITULO III. Redes Recurrentes ...................................................................................................... 64 1. La red neuronal recurrente de Elman. ................................................................................................... 64 2. Creación de una red recurrente simple con SNNS. ............................................................................... 65 2.1 Una arquitectura propuesta. ................................................................................................... 66 2.2 Estructura de la red recurrente. .............................................................................................. 68 2.3 Inicialización de los pesos con la función JE_Weights. ........................................................ 69 2.4 Función de aprendizaje. ......................................................................................................... 70 2.5 Función de actualización de pesos JE_Order. ....................................................................... 70 2.6 Función de inicialización de pesos. ....................................................................................... 70 3. Archivo de patterns SNNS. ………………………………………………………………………….. 70 4. Archivo de resultados del entrenamiento del SNNS. ........................................................................... 71 3 CAPITULO IV. Asignación de valores a los parámetros de entonación ........................................... 73 1. Explicación de la implementación de la red neuronal para asignar los parámetros de duración y pitch. ..................................................................................................................................................................... 73 2. Explicación de la implementación con árboles de decisión para asignar los parámetros de duración y pitch. ........................................................................................................................................................... 74 2.1 Descripción del programa. ..................................................................................................... 75 CAPITULO V. Resultados ...................................................................................................................... 81 1. Resultados de la implementación del perceptron multicapa para clasificación de fonemas. ....... 81 2. Resultados de la implementación del Perceptrón multicapa para asignación de valores a los parámetros de entonación. ............................................................................................................ 90 3. Resultados de la implementación con árboles de decisión para la clasificación de fonemas. ..... 91 4. Resultados de la implementación con árboles de decisión para la asignación de parámetros de entonación. .................................................................................................................................. 104 5. Resultados de la implementación con redes recurrentes para clasificación de fonemas. ........... 106 CONCLUSIONES. ................................................................................................................................. 111 ANEXOS ................................................................................................................................................. 113 A. Stuttgart-Java NNS ............................................................................................................................. 113 B. MBROLA ........................................................................................................................................... 116 C. CRUISE .............................................................................................................................................. 121 D. El Stuttgart Neural Network Simulator SNNS ................................................................................... 128 REFERENCIAS .................................................................................................................................... 143 4 Introducción: El presente trabajo describe detalladamente la implementación de diversas versiones de NETtalk aplicado al español. En una primera aproximación el objetivo principal consistió en reproducir el experimeno de Terrence Sejnowski y Charles Rosenberg efectuado en 1986 y descrito en NETtalk: A Parallel Network that Learns to Read Aloud1, para lo cual se utilizó una arquitectura de perceptrón multicapa cuyas entradas representan ventanas que incluyen el contexto de un fonema y cuyas salidas representan una clase fonémica. Dicho perceptrón, con ayuda de otros modulos auxiliares, estan implementados en un proyecto JAVA; de tal manera que siguiendo el procesamiento adecuadamente, es posible comenzar con un texto arbitrario y terminar con la lectura de algún otro texto en el sintetizador MBROLA. La segunda aproximación consiste en la utilización de árboles de desición para la clasificación de fonemas, dichos árboles son entrenados y podados por el programa CRUISE y posteriormente transformados en una clase JAVA que representa el árbol que clasifica de manera óptima cualquier texto previamente procesado para ese fin. Al igual que en el caso anterior, es posible comenzar con un texto arbitrario y terminar con la lectura de algún otro texto. También se utiliza la idea de ventanas. Posteriormente,
Recommended publications
  • THE DEVELOPMENT of ACCENTED ENGLISH SYNTHETIC VOICES By
    THE DEVELOPMENT OF ACCENTED ENGLISH SYNTHETIC VOICES by PROMISE TSHEPISO MALATJI DISSERTATION Submitted in fulfilment of the requirements for the degree of MASTER OF SCIENCE in COMPUTER SCIENCE in the FACULTY OF SCIENCE AND AGRICULTURE (School of Mathematical and Computer Sciences) at the UNIVERSITY OF LIMPOPO SUPERVISOR: Mr MJD Manamela CO-SUPERVISOR: Dr TI Modipa 2019 DEDICATION In memory of my grandparents, Cecilia Khumalo and Alfred Mashele, who always believed in me! ii DECLARATION I declare that THE DEVELOPMENT OF ACCENTED ENGLISH SYNTHETIC VOICES is my own work and that all the sources that I have used or quoted have been indicated and acknowledged by means of complete references and that this work has not been submitted before for any other degree at any other institution. ______________________ ___________ Signature Date iii ACKNOWLEDGEMENTS I want to recognise the following people for their individual contributions to this dissertation: • My brother, Mr B.I. Khumalo and the whole family for the unconditional love, support and understanding. • A distinct thank you to both my supervisors, Mr M.J.D. Manamela and Dr T.I. Modipa, for their guidance, motivation, and support. • The Telkom Centre of Excellence for Speech Technology for providing the resources and support to make this study a success. • My colleagues in Department of Computer Science, Messrs V.R. Baloyi and L.M. Kola, for always motivating me. • A special thank you to Mr T.J. Sefara for taking his time to participate in the study. • The six Computer Science undergraduate students who sacrificed their precious time to participate in data collection.
    [Show full text]
  • Commercial Tools in Speech Synthesis Technology
    International Journal of Research in Engineering, Science and Management 320 Volume-2, Issue-12, December-2019 www.ijresm.com | ISSN (Online): 2581-5792 Commercial Tools in Speech Synthesis Technology D. Nagaraju1, R. J. Ramasree2, K. Kishore3, K. Vamsi Krishna4, R. Sujana5 1Associate Professor, Dept. of Computer Science, Audisankara College of Engg. and Technology, Gudur, India 2Professor, Dept. of Computer Science, Rastriya Sanskrit VidyaPeet, Tirupati, India 3,4,5UG Student, Dept. of Computer Science, Audisankara College of Engg. and Technology, Gudur, India Abstract: This is a study paper planned to a new system phonetic and prosodic information. These two phases are emotional speech system for Telugu (ESST). The main objective of usually called as high- and low-level synthesis. The input text this paper is to map the situation of today's speech synthesis might be for example data from a word processor, standard technology and to focus on potential methods for the future. ASCII from e-mail, a mobile text-message, or scanned text Usually literature and articles in the area are focused on a single method or single synthesizer or the very limited range of the from a newspaper. The character string is then preprocessed and technology. In this paper the whole speech synthesis area with as analyzed into phonetic representation which is usually a string many methods, techniques, applications, and products as possible of phonemes with some additional information for correct is under investigation. Unfortunately, this leads to a situation intonation, duration, and stress. Speech sound is finally where in some cases very detailed information may not be given generated with the low-level synthesizer by the information here, but may be found in given references.
    [Show full text]
  • Estudios De I+D+I
    ESTUDIOS DE I+D+I Número 51 Proyecto SIRAU. Servicio de gestión de información remota para las actividades de la vida diaria adaptable a usuario Autor/es: Catalá Mallofré, Andreu Filiación: Universidad Politécnica de Cataluña Contacto: Fecha: 2006 Para citar este documento: CATALÁ MALLOFRÉ, Andreu (Convocatoria 2006). “Proyecto SIRAU. Servicio de gestión de información remota para las actividades de la vida diaria adaptable a usuario”. Madrid. Estudios de I+D+I, nº 51. [Fecha de publicación: 03/05/2010]. <http://www.imsersomayores.csic.es/documentos/documentos/imserso-estudiosidi-51.pdf> Una iniciativa del IMSERSO y del CSIC © 2003 Portal Mayores http://www.imsersomayores.csic.es Resumen Este proyecto se enmarca dentro de una de las líneas de investigación del Centro de Estudios Tecnológicos para Personas con Dependencia (CETDP – UPC) de la Universidad Politécnica de Cataluña que se dedica a desarrollar soluciones tecnológicas para mejorar la calidad de vida de las personas con discapacidad. Se pretende aprovechar el gran avance que representan las nuevas tecnologías de identificación con radiofrecuencia (RFID), para su aplicación como sistema de apoyo a personas con déficit de distinta índole. En principio estaba pensado para personas con discapacidad visual, pero su uso es fácilmente extensible a personas con problemas de comprensión y memoria, o cualquier tipo de déficit cognitivo. La idea consiste en ofrecer la posibilidad de reconocer electrónicamente los objetos de la vida diaria, y que un sistema pueda presentar la información asociada mediante un canal verbal. Consta de un terminal portátil equipado con un trasmisor de corto alcance. Cuando el usuario acerca este terminal a un objeto o viceversa, lo identifica y ofrece información complementaria mediante un mensaje oral.
    [Show full text]
  • UN SYNTHÉTISEUR DE LA VOIX CHANTÉE BASÉ SUR MBROLA POUR LE MANDARIN Liu Ning
    UN SYNTHÉTISEUR DE LA VOIX CHANTÉE BASÉ SUR MBROLA POUR LE MANDARIN Liu Ning To cite this version: Liu Ning. UN SYNTHÉTISEUR DE LA VOIX CHANTÉE BASÉ SUR MBROLA POUR LE MAN- DARIN. Journées d’Informatique Musicale, 2012, Mons, France. hal-03041805 HAL Id: hal-03041805 https://hal.archives-ouvertes.fr/hal-03041805 Submitted on 5 Dec 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Actes des Journées d’Informatique Musicale (JIM 2012), Mons, Belgique, 9-11 mai 2012 UN SYNTHÉTISEUR DE LA VOIX CHANTÉE BASÉ SUR MBROLA POUR LE MANDARIN Liu Ning CICM (Centre de recherche en Informatique et Création Musicale) Université Paris VIII, MSH Paris Nord [email protected] RESUME fonction de la mélodie et des paroles à partir d’un fichier MIDI [7]. Dans cet article, nous présentons le projet de développement d’un synthétiseur de la voix chantée basé Néanmoins, les synthèses de la voix chantée pour la sur MBROLA en mandarin pour la langue chinoise. langue chinoise ont encore des limites techniques. Par Notre objectif vise à développer un synthétiseur qui exemple, le système à partir de la lecture de fichier MIDI puisse fonctionner en temps réel, qui soit capable de se est un système en temps différé.
    [Show full text]
  • Espeak : Speech Synthesis
    Software Requirements Specification for ESpeak : Speech Synthesis Version 1.48.15 Prepared by Dimitrios Koufounakis January 10, 2018 Copyright © 2002 by Karl E. Wiegers. Permission is granted to use, modify, and distribute this document. Software Requirements Specification for <Project> Page ii Table of Contents Table of Contents .......................................................................................................................... ii Revision History ............................................................................................................................ ii 1. Introduction ..............................................................................................................................1 1.1 Purpose ............................................................................................................................................. 1 1.2 Document Conventions .................................................................................................................... 1 1.3 Intended Audience and Reading Suggestions................................................................................... 1 1.4 Project Scope .................................................................................................................................... 1 1.5 References......................................................................................................................................... 1 2. Overall Description ..................................................................................................................2
    [Show full text]
  • Fully Generated Scripted Dialogue for Embodied Agents
    Fully Generated Scripted Dialogue for Embodied Agents Kees van Deemter 1, Brigitte Krenn 2, Paul Piwek 3, Martin Klesen 4, Marc Schröder 5, and Stefan Baumann 6 Abstract : This paper presents the NECA approach to the generation of dialogues between Embodied Conversational Agents (ECAs). This approach consist of the automated constructtion of an abstract script for an entire dialogue (cast in terms of dialogue acts), which is incrementally enhanced by a series of modules and finally ”performed” by means of text, speech and body language, by a cast of ECAs. The approach makes it possible to automatically produce a large variety of highly expressive dialogues, some of whose essential properties are under the control of a user. The paper discusses the advantages and disadvantages of NECA’s approach to Fully Generated Scripted Dialogue (FGSD), and explains the main techniques used in the two demonstrators that were built. The paper can be read as a survey of issues and techniques in the construction of ECAs, focussing on the generation of behaviour (i.e., focussing on information presentation) rather than on interpretation. Keywords : Embodied Conversational Agents, Fully Generated Scripted Dialogue, Multimodal Interfaces, Emotion modelling, Affective Reasoning, Natural Language Generation, Speech Synthesis, Body Language. 1. Introduction A number of scientific disciplines have started, in the last decade or so, to join forces to build Embodied Conversational Agents (ECAs): software agents with a human-like synthetic voice and 1 University of Aberdeen, Scotland, UK (Corresponding author, email [email protected]) 2 Austrian Research Center for Artificial Intelligence (OEFAI), University of Vienna, Austria 3 Centre for Research in Computing, The Open University, UK 4 German Research Center for Artificial Intelligence (DFKI), Saarbruecken, Germany 5 German Research Center for Artificial Intelligence (DFKI), Saarbruecken, Germany 6 University of Koeln, Germany.
    [Show full text]
  • Design and Implementation of Text to Speech Conversion for Visually Impaired People
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Covenant University Repository International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868 Foundation of Computer Science FCS, New York, USA Volume 7– No. 2, April 2014 – www.ijais.org Design and Implementation of Text To Speech Conversion for Visually Impaired People Itunuoluwa Isewon* Jelili Oyelade Olufunke Oladipupo Department of Computer and Department of Computer and Department of Computer and Information Sciences Information Sciences Information Sciences Covenant University Covenant University Covenant University PMB 1023, Ota, Nigeria PMB 1023, Ota, Nigeria PMB 1023, Ota, Nigeria * Corresponding Author ABSTRACT A Text-to-speech synthesizer is an application that converts text into spoken word, by analyzing and processing the text using Natural Language Processing (NLP) and then using Digital Signal Processing (DSP) technology to convert this processed text into synthesized speech representation of the text. Here, we developed a useful text-to-speech synthesizer in the form of a simple application that converts inputted text into synthesized speech and reads out to the user which can then be saved as an mp3.file. The development of a text to Figure 1: A simple but general functional diagram of a speech synthesizer will be of great help to people with visual TTS system. [2] impairment and make making through large volume of text easier. 2. OVERVIEW OF SPEECH SYNTHESIS Speech synthesis can be described as artificial production of Keywords human speech [3]. A computer system used for this purpose is Text-to-speech synthesis, Natural Language Processing, called a speech synthesizer, and can be implemented in Digital Signal Processing software or hardware.
    [Show full text]
  • Feasibility Study on a Text-To-Speech Synthesizer for Embedded Systems
    2006:113 CIV MASTER'S THESIS Feasibility Study on a Text-To-Speech Synthesizer for Embedded Systems Linnea Hammarstedt Luleå University of Technology MSc Programmes in Engineering Electrical Engineering Department of Computer Science and Electrical Engineering Division of Signal Processing 2006:113 CIV - ISSN: 1402-1617 - ISRN: LTU-EX--06/113--SE Preface This is a master degree project commissioned by and performed at Teleca Systems GmbH in N¨urnberg at the department of Speech Technology. Teleca is an IT services company focused on developing and integrating advanced software and information technology so- lutions. Today Teleca possesses a speak recognition system including a grapheme-to-phoneme module, i.e., an algorithm converting text into phonetic notation. Their future objective is to develop a Text-To-Speech system including this module. The purpose of this work from Teleca’s point of view is to investigate a possible solution of converting phonetic notation into speech suitable for an embedded implementation platform. I would like to thank Dr. Andreas Kiessling at Teleca for his support and patient discussions during this work, and Dr. Stefan Dobler, the head of department, for giving me the possibility to experience this interesting field of speech technology. Finally, I wish to thank all the other personnel of the department for their consistently practical support. i Abstract A system converting textual information into speech is usually denoted as a TTS (Text-To- Speech) system. The design of this system varies depending on its purpose and platform requirements. In this thesis a TTS synthesizer designed for an embedded system operat- ing on an arbitrary vocabulary has been evaluated and partially implemented in Matlab, constituting a base for further development.
    [Show full text]
  • A Tooi to Support Speech and Non-Speech Audio Feedback Generation in Audio Interfaces
    A TooI to Support Speech and Non-Speech Audio Feedback Generation in Audio Interfaces Lisa J. Stfelman Speech Research Group MIT Media Labomtory 20 Ames Street, Cambridge, MA 02139 Tel: 1-617-253-8026 E-mail: lisa@?media.mit. edu ABSTRACT is also needed for specifying speech and non-speech audio Development of new auditory interfaces requires the feedback (i.e., auditory icons [9] or earcons [4]). integration of text-to-speech synthesis, digitized audio, and non-speech audio output. This paper describes a tool for Natural language generation research has tended to focus on specifying speech and non-speech audio feedback and its use producing coherent mtdtisentential text [14], and detailed in the development of a speech interface, Conversational multisentential explanations and descriptions [22, 23], VoiceNotes. Auditory feedback is specified as a context-free rather than the kind of terse interactive dialogue needed for grammar, where the basic elements in the grammar can be today’s speech systems. In addition, sophisticated language either words or non-speech sounds. The feedback generation tools are not generally accessible by interface specification method described here provides the ability to designers and developers.z The goal of the work described vary the feedback based on the current state of the system, here was to simplify the feedback generation component of and is flexible enough to allow different feedback for developing audio user interfaces and allow rapid iteration of different input modalities (e.g., speech, mouse, buttons). designs. The declarative specification is easily modifiable, supporting an iterative design process. This paper describes a tool for specifying speech and non- speech audio feedback and its use in the development of a KEYWORDS speech interface, Conversational VoiceNotes.
    [Show full text]
  • Voice Synthesizer Application Android
    Voice synthesizer application android Continue The Download Now link sends you to the Windows Store, where you can continue the download process. You need to have an active Microsoft account to download the app. This download may not be available in some countries. Results 1 - 10 of 603 Prev 1 2 3 4 5 Next See also: Speech Synthesis Receming Device is an artificial production of human speech. The computer system used for this purpose is called a speech computer or speech synthesizer, and can be implemented in software or hardware. The text-to-speech system (TTS) converts the usual text of language into speech; other systems display symbolic linguistic representations, such as phonetic transcriptions in speech. Synthesized speech can be created by concatenating fragments of recorded speech that are stored in the database. Systems vary in size of stored speech blocks; The system that stores phones or diphones provides the greatest range of outputs, but may not have clarity. For specific domain use, storing whole words or suggestions allows for high-quality output. In addition, the synthesizer may include a vocal tract model and other characteristics of the human voice to create a fully synthetic voice output. The quality of the speech synthesizer is judged by its similarity to the human voice and its ability to be understood clearly. The clear text to speech program allows people with visual impairments or reading disabilities to listen to written words on their home computer. Many computer operating systems have included speech synthesizers since the early 1990s. A review of the typical TTS Automatic Announcement System synthetic voice announces the arriving train to Sweden.
    [Show full text]
  • Assisting the Speech Impaired People Using Text-To-Speech Synthesis 1Ledisi G
    International Journal of Emerging Engineering Research and Technology Volume 3, Issue 8, August, 2015, PP 214-224 ISSN 2349-4395 (Print) & ISSN 2349-4409 (Online) Assisting the Speech Impaired People Using Text-to-Speech Synthesis 1Ledisi G. Kabari 2Ledum F. Atu 1,2Department of Computer Science, Rivers State Polytechnic, Bori, Nigeria ABSTRACT Many people have some sort of disability which impairs their ability to communicate, thus with all their intelligence they are not able to participate in conferences or meeting proceedings and consequently inhibit in a way their contributions to development of the nation. Work in alternative and augmentative communication (AAC) devices attempts to address this need. This paper tries to design a system that can be used to read out an input to its user. The system was developed using Visual Basic6.0 for the user interface, and was interfaced with the Lemout & Hauspie True Voice Text-to-Speech Engine (American English) and Microsoft agent. The system will allow the users to input the text to be read out, and also allows the user to open text document of the Rich Text Formant (.rtf) and Text File format (.txt) that have been saved on disk, for reading. Keywords: Text-To-Speech (TTS), Digital Signal Processing (DSP), Natural Language Processing (NLP), Alterative and Augmentative Communication (AAC), Rich Text file Format (RTF), Text File Format (TFF). INTRODUCTION The era has come and it is now here, where the mind and reasoning of all are needed for the development of a nation. Thus people who have some sort of disability which impairs their ability to communicate cannot be left behind.
    [Show full text]
  • Towards Expressive Speech Synthesis in English on a Robotic Platform
    PAGE 130 Towards Expressive Speech Synthesis in English on a Robotic Platform Sigrid Roehling, Bruce MacDonald, Catherine Watson Department of Electrical and Computer Engineering University of Auckland, New Zealand s.roehling, b.macdonald, [email protected] Abstract Affect influences speech, not only in the words we choose, but in the way we say them. This pa- per reviews the research on vocal correlates in the expression of affect and examines the ability of currently available major text-to-speech (TTS) systems to synthesize expressive speech for an emotional robot guide. Speech features discussed include pitch, duration, loudness, spectral structure, and voice quality. TTS systems are examined as to their ability to control the fea- tures needed for synthesizing expressive speech: pitch, duration, loudness, and voice quality. The OpenMARY system is recommended since it provides the highest amount of control over speech production as well as the ability to work with a sophisticated intonation model. Open- MARY is being actively developed, is supported on our current Linux platform, and provides timing information for talking heads such as our current robot face. 1. Introduction explicitly stated otherwise the research is concerned with the English language. Affect influences speech, not only in the words we choose, but in the way we say them. These vocal nonverbal cues are important in human speech as they communicate 2.1. Pitch information about the speaker’s state or attitude more effi- ciently than the verbal content (Eide, Aaron, Bakis, Hamza, Pitch contour seems to be one of the clearest indica- Picheny, and Pitrelli 2004).
    [Show full text]