A Brief Review of Speech Synthesis
Total Page:16
File Type:pdf, Size:1020Kb
Computer Science Computer Networks Piotr Leszczyński Book No. s4207 Remote voice Web browser for people with sight impairment Zdalna głosowa przeglądarka WWW dla osób niewidomych Engineering Thesis Written under the advice of Ph.D. Eng. Przemysław Skurowski Bytom September 2009 Contents 1 Introduction............................................................................... 7 2 A brief review of speech synthesis ................................................ 9 2.1 Human speech synthesis ......................................................... 9 2.2 Text-To-Speech systems overview .......................................... 10 2.2.2 Concatenation Speech Systems ...................................... 11 2.2.3 Articulator Speech Systems ............................................ 11 2.2.4 History ........................................................................ 12 3 Application modeling and implementation .................................... 14 3.1 Application concept ............................................................... 14 3.2 Functional requirements ........................................................ 15 3.3 Non-Functional requirements ................................................. 16 3.4 Feasibility analysis ................................................................ 16 3.5 Technical limitations ............................................................. 17 3.5.1 Accessibility ................................................................. 17 3.5.2 Speech synthesis .......................................................... 18 3.5.3 Interpretation of Web pages ........................................... 18 4 Technology .............................................................................. 19 4.1 Java .................................................................................... 19 4.2 Java Web Start ..................................................................... 20 4.3 FreeTTS speech system ......................................................... 20 4.4 Netbeans framework ............................................................. 21 5 Design .................................................................................... 24 5.1 Communication .................................................................... 24 5.2 Client .................................................................................. 26 5.2.1 Modular design ............................................................. 26 5.2.2 Portability .................................................................... 26 5.2.3 Model-View-Controller architecture pattern ...................... 26 5.2.4 Class model.................................................................. 27 5.2.5 Sequence diagram ........................................................ 30 5.2.6 Dependency model ........................................................ 33 5.2.7 Interface ...................................................................... 33 5.2.8 Compatibility ................................................................ 35 5.3 Server ................................................................................. 35 5.3.1 Class model.................................................................. 35 5.3.2 Compatibility ................................................................ 37 5.4 Development challenges ........................................................ 37 6 Testing ................................................................................... 39 6.1 Live web testing ................................................................... 39 6.1.1 Method ........................................................................ 39 6.1.2 Results ........................................................................ 39 6.1.3 Feedback ..................................................................... 40 6.2 Synthetic testing .................................................................. 40 6.2.1 Method ........................................................................ 40 6.2.2 The results ................................................................... 42 7 Conclusions ............................................................................. 47 8 Summery in Polish .................................................................... 49 9 Bibliography ............................................................................ 51 A Installation and use .................................................................. 53 A.1 Installation ........................................................................ 53 4 A.2 Use .................................................................................. 54 B Client compatibility list .............................................................. 56 C Server compatibility list ............................................................. 61 Introduction 5 6 1 Introduction Human beings posses five senses, according to Aristotelian psychology these senses are sight, hearing, smell, taste and touch [1]. We mainly use only the hearing and the sight when interfacing with computers. There are projects introducing smell into the equation but those have not gone mainstream yet. Though the main burden of communication lies on the sight, we use hearing to argument multimedia, enhance system and application communication. In most cases, it would be impossible to even enter the operating system without the sense of sight not to mention doing anything else. That is why the situation of people with sight impairment is so difficult when it comes to interfacing with computers. It requires often a very expensive set of software with the top of the line costing between $500-$1300 [2].A combination of a screen reader whose task is to identify and interpret the output of a computer screen, a task that’s being done by a computer monitor and a human brain in case of people without sight impairment, and a text-to-speech or Braille output device. Text-to-speech is the preferred and the most natural method of representing interpreted text. I have decided to take on the problem of accessibility of sight impairment enabled computers in public places like schools, public administration offices, airports, libraries - in specific World Wide Web access. There are major obstacles in adjusting computers to the use of sight impaired persons. Starting with the high cost of buying the software which in a places like schools, universities, libraries with hundreds of computers could rise to astronomical levels not to mention not many schools can afford it when they cannot even afford buying all the needed computers. Introduction 7 The next issue, the software needs to be installed, configured and maintained which adds to the already high costs. Those were all the issues I wanted to either deal with or alleviate. The idea behind this project was to provide a Web based application which would identify, interpret a WWW page and represent the output to the user with text-to-speech technology. An application that would not need an installation or configuration would be easy to run and handle by persons with sight impairment and could be installed on any computer plugged into the internet regardless of its architecture or operating system. I would like to present you the end result of that idea in this document and a working application on the attached disk. I hope you find it interesting and useful. 8 2 A brief review of speech synthesis In this chapter the basic information needed to understand how speech synthesis works on human and mechanical levels are introduced. 2.1 Human speech synthesis Modern researchers believe Humans possessed speech abilities as early as 300,000 years ago after the Neanderthals evolution [13] yet documented Human speech synthesis was a subject of research for only a century now and the biggest breakthroughs happened only in the last hundred years or even in the last 20 years. There are two major centers responsible for the process of human speech creation: lungs and larynx, with vocal cords and glottis. Humans create sound when the air pumped by the lungs moves over the vocal cords and is made to vibrate. Changing sound into speech is a much more complicated process though; it involves creation of phonation in the glottis and modifying it into different vowels and consonants. Prepared speech is then modified by a complex movement of lips, tongue and soft palate which purpose is to filter out some of the frequencies and resonate some of the others. [14] A brief review of speech synthesis 9 Fig. 2.1: Human vocal tract 1 2.2 Text-To-Speech systems overview The simplest description of a Text-To-Speech system would be an application that can reproduce speech sequence from a supplied text. There are generally three kinds of speech synthesis methods: concatenation, articulator and formant. Despite the differences they follow the model of human speech production. Fig. 2.2: Human speech model 2 1 Human vocal tract, picture taken from the May-June 2008 issue of Duke Magazine. 10 2.2.2 Concatenation Speech Systems It is the most popular and widely used method of speech synthesis. There are two methods of concatenation; the first one concatenates single words or parts of sentences from a database to create the speech. They are called “Voice Response Systems” and their usability is limited to situations where rich vocabulary is not needed and sentence structure is predefined to a strict structure. For example telephone automated systems or train