Commercial Tools in Speech Synthesis Technology
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Research in Engineering, Science and Management 320 Volume-2, Issue-12, December-2019 www.ijresm.com | ISSN (Online): 2581-5792 Commercial Tools in Speech Synthesis Technology D. Nagaraju1, R. J. Ramasree2, K. Kishore3, K. Vamsi Krishna4, R. Sujana5 1Associate Professor, Dept. of Computer Science, Audisankara College of Engg. and Technology, Gudur, India 2Professor, Dept. of Computer Science, Rastriya Sanskrit VidyaPeet, Tirupati, India 3,4,5UG Student, Dept. of Computer Science, Audisankara College of Engg. and Technology, Gudur, India Abstract: This is a study paper planned to a new system phonetic and prosodic information. These two phases are emotional speech system for Telugu (ESST). The main objective of usually called as high- and low-level synthesis. The input text this paper is to map the situation of today's speech synthesis might be for example data from a word processor, standard technology and to focus on potential methods for the future. ASCII from e-mail, a mobile text-message, or scanned text Usually literature and articles in the area are focused on a single method or single synthesizer or the very limited range of the from a newspaper. The character string is then preprocessed and technology. In this paper the whole speech synthesis area with as analyzed into phonetic representation which is usually a string many methods, techniques, applications, and products as possible of phonemes with some additional information for correct is under investigation. Unfortunately, this leads to a situation intonation, duration, and stress. Speech sound is finally where in some cases very detailed information may not be given generated with the low-level synthesizer by the information here, but may be found in given references. The objective of the from high-level one. paper is to develop high quality audiovisual speech synthesis with a well synchronized talking head, primarily in Finnish. Other The simplest way to produce synthetic speech is to play long aspects, such as naturalness, personality, platform independence, prerecorded samples of natural speech, such as single words or and quality assessment are also under investigation. Most sentences. This concatenation method provides high quality and synthesizers today are so called standalones and they do not work naturalness, but has a limited vocabulary and usually only one platform independently and usually do not share common parts, voice. The method is very suitable for some announcing and thus we cannot just put together the best parts of present systems information systems. However, it is quite clear that we cannot to make a state-of-the-art synthesizer. Hence, with good modularity characteristics we may achieve a synthesis system create a database of all words and common names in the world. which is easier to develop and improve. It is maybe even inappropriate to call this speech synthesis because it contains only recordings. Thus, for unrestricted Keywords: ESST, Speech Synthesis, High Quality, naturalness. speech synthesis (text-to-speech) we have to use shorter pieces of speech signal, such as syllables, phonemes, diphones or even 1. Introduction shorter segments. Speech is the primary means of communication between people. Speech synthesis, automatic generation of speech 2. Speech Production waveforms, has been under development for several decades Human speech is produced by vocal organs. The main energy (Santen et al. 1997, Kleijn et al. 1998). Recent progress in source is the lungs with the diaphragm. When speaking, the air speech synthesis has produced synthesizers with very high flow is forced through the glottis between the vocal cords and intelligibility but the sound quality and naturalness still remain the larynx to the three main cavities of the vocal tract, the a major problem. However, the quality of present products has pharynx and the oral and nasal cavities. From the oral and nasal reached an adequate level for several applications, such as cavities, the air flow exits through the nose and mouth, multimedia and telecommunications. With some audiovisual respectively. The V-shaped opening between the vocal cords, information or facial animation (talking head) it is possible to called the glottis, is the most important sound source in the increase speech intelligibility considerably (Beskow et. al. vocal system. The vocal cords may act in several different ways 1997). Some methods for audiovisual speech have been during speech. The most important function is to modulate the recently introduced by for example Santen et al. (1997), Breen air flow by rapidly opening and closing, causing buzzing sound et al. (1996), Beskow (1996), and Le Goff et al. (1996). from which vowels and voiced consonants are produced. The The text-to-speech (TTS) synthesis procedure consists of two fundamental frequency of vibration depends on the mass and main phases. The first one is text analysis, where the input text tension and is about 110 Hz, 200 Hz, and 300 Hz with men, is transcribed into a phonetic or some other linguistic women, and children, respectively. With stop consonants the representation, and the second one is the generation of speech vocal cords may act suddenly from a completely closed position waveforms, where the acoustic output is produced from this International Journal of Research in Engineering, Science and Management 321 Volume-2, Issue-12, December-2019 www.ijresm.com | ISSN (Online): 2581-5792 in which they cut the air flow completely, to totally open The latest full commercial version, Infovox 230, is available position producing a light cough or a glottal stop. for American and British English, Danish, Finnish, French, German, Icelandic, Italian, Norwegian, Spanish, Swedish, and A. Speech Synthesis METHODS Dutch (Telia 1997). The system is based on formant synthesis Synthesized speech can be produced by several different and the speech is intelligible but seems to have a bit of Swedish methods. All of these have some benefits and deficiencies that accent. The system has five different built-in voices, including are discussed in this and previous chapters. The methods are male, female, and child. The user can also create and store usually classified into three groups: individual voices. Aspiration and intonation features are also 1. Articulatory synthesis, which attempts to model the adjustable. Individual articulation lexicons can be constructed human speech production system directly. for each language. For words which do not follow the 2. Formant synthesis, which models the pole frequencies pronunciation rules, such as foreign names, the system has a of speech signal or transfer function of vocal tract specific pronunciation lexicon where the user can store them. based on source-filter-model. The speech rate can be varied up to 400 words per minute. The 3. Concatenative synthesis, which uses different length text may be synthesized also word by word or letter by letter. prerecorded samples derived from natural speech Also DTMF tones can be generated for telephony applications. B. Speech Synthesis Tools The system is available as a half-length PC board, RS 232 Here we are introducing some of the commercial products, connected stand-alone desktop unit, OEM board, or software developing tools, and ongoing speech synthesis projects for Macintosh and Windows environments (3.1, 95, NT) and available today. It is clear that it is not possible to present all requires only 486DX33MHz with 8 Mb of memory. systems and products out there, but at least the most known D. DEC Talk products are presented. Some of the text in this chapter is based Digital Equipment Corporation (DEC) has also long on information collected from Internet, fortunately, mostly traditions with speech synthesizers. The DECtalk system is from the manufacturers and developer’s official homepages. originally descended from MITalk and Klattalk. The present However, some criticism should be bear in mind when reading system is available for American English, German and Spanish the "this is the best synthesis system ever" descriptions from and offers nine different voice personalities, four male, four these WWW-sites. First commercial speech synthesis systems female and one child. The present system has probably one of were mostly hardware based and the developing process was the best designed text preprocessing and pronunciation very time-consuming and expensive. Since computers have controls. The system is capable to say most proper names, e- become more and more powerful, most synthesizers today are mail and URL addresses and supports a customized software based systems. Software based systems are easy to pronunciation dictionary. It has also punctuation control for configure and update, and usually they are also much less pauses, pitch, and stress and the voice control commands may expensive than the hardware systems. However, a standalone be inserted in a text file for use by DECtalk software hardware device may still be the best solution when a portable applications. The speaking rate is adjustable between 75 to 650 system is needed. words per minute (Hallahan 1996). Also the generation of The speech synthesis process can be divided in high-level single tones and DTMF signals for telephony applications is and low-level synthesis. A low-level synthesizer is the actual supported. device which generates the output sound from information DECtalk software is currently available for Windows 95/NT provided by high-level device in some format, for example in environments and for Alpha systems running Windows NT or phonetic representation. A high-level synthesizer is responsible DIGITAL UNIX. A software version for Windows requires at for generating the input data to the low-level device including least Intel 486-based computer with 50 MHz processor and 8 correct text-preprocessing, pronunciation, and prosodic Mb of memory. The software provides also an application information. Most synthesizers contain both, high and low level programming interface (API) that is fully integrated with system, but due to specific problems with methods, they are computer's audio subsystem.