OCR Based Speech Synthesis System Using Labview
OCR Based Speech Synthesis System Using LabVIEW
A thesis Submitted towards the partial fulfillment of the requirements of the degree of
Master of Engineering In Electronic Instrumentation and Control Engineering
Submitted By
Rajiv Kumar Yadav Roll No-80751017
Under the esteemed guidance of
Sunil Kumar Singla Sr.Lecturer,EIED
DEPARTMENT OF ELECTRICAL AND INSTRUMENTATION ENGINEERING THAPAR UNIVERSITY PATIALA –147004.
July - 2009 DEDICATED TO MY PARENTS i ACKNOWLEDGEMENT
The real spirit of achieving a goal is through the way of excellence and austerous discipline. I would have never succeeded in completing my task without the cooperation, encouragement and help provided to me by various personalities. First of all, I render my gratitude to the ALMIGHTY who bestowed self-confidence, ability and strength in me to complete this work. Without his grace this would never come to be today’s reality. With deep sense of gratitude I express my sincere thanks to my esteemed and worthy Supervisor Mr. Sunil Kumar Singla in the Department of Electrical and Instrumentation Engineering for his valuable guidance in carrying out this work under his effective supervision, encouragement, enlightenment and cooperation. Most of the novel ideas and solutions found in this thesis are the result of our numerous stimulating discussions. His feedback and editorial comments were also invaluable for writing of this thesis. I shall be failing in my duties if I do not express my deep sense of gratitude towards Dr. Smarajit Ghosh, Professor and Head of Electrical and Instrumentation Department who has been a constant source of inspiration for me throughout this work. I am grateful to Dr. R.K. Sharma, Dean of Academic Affair for his constant encouragement that was of great importance in the completion of the thesis. I extend my thanks to Dr. K.K. Raina, Deputy Director, Dr. Abhijit Mukherjee, Director, Thapar University for their valuable support that made me consistent performer. I am also thankful to all the staff members of the Department for their full cooperation and help. Share my greatest thanks are to all who wished me success especially my parents, my batch mates Manish Sharma, Ankit Sharma, Vivek Chaudhary, my juniors Rahul Dev Nigam, Yatendra Rawal & Ankit Sharma whose support and care makes me.
Place: TU, Patiala Rajiv Kumar Yadav Roll No. 80751017
ii ABSTRACT
Knowledge extraction by just listening to sounds is a distinctive property. Although text can be a medium of communication but speech signal is more effective means of communication than text. In this thesis work OCR Based Speech Synthesis System has been discussed using LabVIEW 7.1. Although a lot of work has been done in the field OCR and Speech Synthesis individually, but it is fist OCR based Speech Synthesis System using LabVIEW. This Thesis contains two part optical character recognition and text to speech conversion. The OCR software is developed with IMAQ Vision for LabVIEW software- developing tool and it uses a commercial digital scanner as image acquisition device. IMAQ Vision OCR software is a PC- based character recognition tool for use with IMAQ Vision for LabVIEW. IMAQ Vision OCR is designed for high-speed and reliable reading performance, even with poor image quality resulting from varying lighting conditions and print quality. For speech synthesis in LabVIEW the ACTIVE X sub pallet in Communication pallet and its functions to exchange data between applications. ActiveX/COM refers to the process of controlling one program from another via ActiveX. Like networking, one program acts as the client and the other as the server LabVIEW supports ActiveX automation both as the client and the server. Both program, client and server, exist independent of each other but are able to share information. The client communicates with the ActiveX objects that the server opens to allow the sharing of information. The automation client can access the object's properties and methods. Properties are attributes of an object. This thesis aims to study the OCR and speech synthesis technology and to develop a cost effective, user friendly OCR Based Speech Synthesis System using Laboratory virtual instruments engineering workbench (LabVIEW) graphical programming language.
iii LIST OF CONTENTS
Certificate i Acknowledgement ii Abstract iii List of Contents iv-vii List of Figure viii-x List of Table xi Chapter: 1 Introduction 1-16 1.1 Introduction 1 1.1.1 OCR 1 1.1.2 Automatic Identification 1 1.1.2.1 Speech recognition 2 1.1.2.2 Radio frequency 2 1.1.2.3 Vision System 2 1.1.2.4 Magnetic Stripe 2 1.1.2.5 Barcode Recognition 2 1.1.2.6 Magnetic Ink Character Recognition 3 1.1.2.7 Optical Mark Recognition 4 1.1.2.8 Optical Character recognition 4 1.1.3 Speech Synthesis 4 1.1.3.1 Phonetics and Theory of Speech Production 5 1.1.3.1.1 Representation and Analysis of Speech Signals 5 1.1.3.1.2 Speech Production 7 1.1.3.1.3 Phonetics 10 1.1.3.1.3.1 English Articulatory Phonetics 12 1.1.3.1.3.2 Finnish.Articulatory Phonetics 14 1.2 Problem Formulation 16 Chapter 2: Literature Survey 17-48 2.1 Introduction 17 2.2 History of OCR 17
iv 2.2.1 The Start of OCR 17 2.2.2 First Generation OCR 18 2.2.3 Second Generation OCR 18 2.2.4 Third Generation OCR 19 2.2.5 OCR Today 20 2.3 Component of an OCR System 20 2.3.1 Image Scanning 21 2.3.2 Binarization 22 2.3.3 Segmentation Process 22 2.3.3.1 Line Segmentation 22 2.3.3.2 Word Segmentation 23 2.3.3.3 Character Segmentation 23 2.3.4 Feature Extraction 24 2.3.4.1 Template Matching and Correlation Techniques 25 2.3.4.2 Feature based Techniques 25 2.3.4.3 Distortion of Points 26 2.3.4.3.1 Zoning 26 2.3.4.3.2 Moments 26 2.3.4.3.3 Crossings and Distances 26 2.3.4.3.4 N-tuples 27 2.3.4.3.5 Characteristics Loci 27 2.3.4.4 Transformation and Series Expansions 27 2.3.4.5 Structural Analysis 28 2.3.5 Recognition 29 2.4 Text to Speech Conversion System 29 2.4.1 Natural Language Processing Module 31 2.4.1.1 Text/Linguistic Analysis 31 2.4.1.2 Letter to Sound 32 2.4.1.3 Prosody Generation 33 2.4.2 Digital Signal Processing Module 36 2.5 Methods, Techniques and Algorithms 36
v 2.5.1 Articulatory Synthesis 37 2.5.2 Formant Synthesis 38 2.5.3 Concateative Synthesis 41 2.5.3.1 PSOLA Methods 44 2.5.3.2 Sinusoidal Models 45 2.6 Application of OCR Based Synthetic Speech System 47 2.6.1 Applications for the Blind 47 2.6.2 Educational Applications 47 2.6.3 Applications for the Deafened and Vocally Handicapped 49-68 Chapter 3: OCR based Speech Synthesis System 49 3.1 Introduction 49 3.2 Hardware requirements 49 3.2.1 Scanner 49 3.2.2 P.C 50 3.2.3 Speaker 50 3.3 Software Platform 50 3.3.1 LabVIEW 50 3.3.2 Virtual Instrumentation 51 3.3.3 LabVIEW Program Structure 51 3.4 Software Implementation 53 3.4.1 Optical Character Recognition 53 3.4.1.1 Image Acquisition 55 3.4.1.2 Image Pre-processing (Binarization) 55 3.4.1.3 Image Segmentation 55 3.4.1.3.1 Line Detection and Segmentation 56 3.4.1.3.2 Word Segmentation 57 3.4.1.3.3 Character Segmentation 58 3.4.1.4 Template matching 59 3.4.1.4.1 Correlation 59 3.4.1.5 Recognition 60 3.4.2 Text to Speech Synthesis 61
vi 3.4.2.1 Text to speech conversion 61 3.4.2.1.1 Overview to ActivX 63 3.4.2.1.2 ActiveX Automation 63 3.4.2.1.3 ActiveX Automation with Lab View 63 3.4.2.1.4 Lab View as an Automation Client 63 3.4.2.1.5 Automation Open 64 3.4.2.1.6 Invoke Node 64 3.4.2.1.7 Property Node 65 3.4.2.1.8 Close Reference 65 3.4.2.1.9 Rate Volume VI 66 3.4.2.1.10 Status VI 66 3.4.2.2 Play Speech Wave File Player 67 3.4.2.5.1 Snd Read Wave File 67 3.4.2.5.2 Snd Play Wave File 67 Chapter 4: Results and Discussion 70-73 4.1 Introduction 70 4.2 Optical Character Recognition 71 4.3 Speech Synthesis 73 Chapter 5: Conclusion and Future Scope 74-75 5.1 Conclusion 74 5.2 Future Scope 74 References 74-80
vii LIST OF FIGURE
S.No. Figure Number Figure Name Page No.
1 Figure 1.1 1D barcode The Gettysburg Address 3 (UPC) 2 Figure 1.2 2D barcode Universal Product Code 3 3 Figure 1.3 The College Board SAT uses OMR 4 technology 4 Figure 1.4 The time- and frequency-domain 6 presentation of vowels /a/, /i/, and /u/. 5 Figure 1.5 Cepstral analysis 6 6 Figure 1.6. Hierarchical levels of fundamental 7 frequency (Sagisaga 1990 7 Figure 1.7 The human vocal organs. 8 8 Figure.1.8 Examples of two- and three-tube models 10 for the vocal tract 9 Figure 1.9 The classification of the main vowels in 13 English 10 Figure 1.10 Classification of English consonants 14 11 Figure 1.11. Classification of Finnish vowels 14 12 Figure 1.12 Classification of Finnish consonants 15 13 Figure 2.1. OCR-A 19 14 Figure 2.2 OCR-B 19 15 Figure 2.3 Components of an OCR-system 20 16 Figure 2.4 Zoning 26 17 Figure 2.5 Elliptical Fourier descriptors 28 18 Figure 2.6 Strokes extracted from the capital letters 29 F, H and N. 19 Figure 2.7 General TTS Synthesizer 30
viii 20 Figure 2.8 A Simple NLP Module 31 21 Figure 2.9 Prosodic dependencies 33 22 Figure 2.10 Basic structure of cascade formant 39 synthesizer 23 Figure 2.11 Basic structure of a parallel formant 40 synthesizer. 24 Figure 2.12 PARCAS model 41 Figure 2.13 Pitch modification of a voiced speech 45 segment 25 Figure 2.14 Sinusoidal analysis / synthesis system 46 (Macon 1996). 26 Figure 3.1 Block Diagram of OCR System 53 27 Figure 3.2 Flow chart OCR system 54 28 Figure 3.3 Image configuration 55 29 Figure 3.4 Block diagram of the line detection vi 56 30 Figure .3.5 The block diagram of word segmentation 57 sub-vi sub-vi 31 Figure 3.6 Block diagram of Find vi 59 32 Figure 3.7 Block diagram of correlation vi 60 33 Figure 3.8 Flowchart for the text to speech wave file 62 conversion 34 Figure 3.9 Programming flow of ActiveX used in 64 LabVIEW 35 Figure 3.10 Block diagram text to speech Synthesis 66 36 Figure 3.11 Block diagram of wave file player 68 37 Figure 3.12 Flow chart of wave file player.vi 69 38 Figure 4.1 Front panel of OCR based speech 70 recognition system 39 Figure 4.2 Read image 71 40 Figure 4.3 Image after thresholding and inveting 71 40 Figure 4.4 Image window segmented line 72
ix 41 Figure 4.5 Image window segmentaed word 72 42 Figure 4.6 Image window segmented character image 72 43 Figure 4.7 Final result of OCR 73 44 Figure 4.8 Output of file wave player 73
x LIST OF TABLES
S.No. Table No. Table Name Page No. 1 Table 1.1 Examples of different phonetic notations 12 2 Table 2.1 Evaluation of feature extraction 25 techniques
xi Chapter 1
Introduction
1.1 Introduction
Machine replication of human functions, like reading, is an ancient dream. However, over the last five decades, machine reading has grown from a dream to reality. Character recognition or optical character recognition (OCR), is the process of converting scanned images of machine printed or handwritten text (numerals, letters, and symbols), into a computer format text (such as ASCII). Optical character recognition has become one of the most successful applications of technology in the field of pattern recognition and artificial intelligence. Many commercial systems for performing OCR exist for a variety of applications. Speech is probably the most efficient medium for communication between humans. A Text-To-Speech (TTS) synthesizer is a computer-based system that should be able to read any text aloud, whether it was directly introduced in the computer by an operator or scanned and submitted to an Optical Character Recognition (OCR) system [1]. 1.1.1 OCR
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable text [2]. Optical character recognition belongs to the family of techniques performing automatic identification. These different techniques are discussed below and define OCR’s position among them.
1.1.2 Automatic Identification
The traditional way of entering data into a computer is through the keyboard. However this is neither always the best nor the most efficient solution. In many cases automatic identification may be an alternative. Various technologies for automatic identification
1 exist, and they cover needs for different areas of application [3]. Below a brief overview of the different technologies and their applications is given.
1.1.2.1 Speech recognition
In systems for speech recognition, spoken inputs from a predefined library of words are recognized. Such systems should be speaker-independent and may be used for instance for reservations or ordering of goods by telephone. Another kind of such systems are those used to recognize the speaker, rather than the words, for identification.
1.1.2.2 Radio frequency
This kind of identification is used for instance in connection with toll roads for identification of cars. Special equipment on the car emits the information. The identification is efficient, but special equipment is needed both to send and to read the information. The information is also inaccessible to humans [4].
1.1.2.3 Vision systems
By the use of a TV-camera objects may be identified by their shape or size. This approach may for instance be used in automatons for recirculation of bottles. The type of bottle must be recognized, as the amount reimbursed for a bottle depends on its type.
1.1.2.4 Magnetic stripe
Information contained in magnetic stripe is widely used on credit cards etc. Quite a large amount of information can be stored on the magnetic stripe, but specially designed readers are required and the information can not be read by humans.
1.1.2.5 Barcode Recognition
A barcode is a machine-readable representation of information. Barcodes can be read by optical scanners called barcode readers or scanned from an image using software. A 2D barcode is similar to a linear, one-dimensional barcode, but has more data representation capability. Figure of 1-D barcode and 2-D barcode are given bellow [5].
2 Figure 1.1 1-D barcode The Gettysburg Address (UPC).
Figure 1.2 2-D barcode Universal Product Code.
1.1.2.6 Magnetic Ink Character Recognition (MICR)
Printing in magnetic ink is mainly used within bank applications. The characters are written in ink that contains finely ground magnetic material and they are written in stylized fonts which are specifically designed for the application. Before the characters are read, the ink is exposed to a magnetic field. This process accentuates each character and helps simplify the detection. The characters are read by interpreting the waveform obtained when scanning the characters horizontally. Each character is designed to have its own specific waveform. Although designed for machine reading, the characters are still readable to humans. However, the reading is dependent on the characters being printed with magnetic ink [6].
3 1.1.2.7 Optical Mark Recognition (OMR)
OMR technology detects the existence of a mark, not its shape. OMR forms usually contain small ovals, referred to as 'bubbles,' or check boxes that the respondent fills in. OMR cannot recognize alphabetic or numeric characters. OMR is the fastest and most accurate of the data collection technologies. It is also relatively user-friendly. The accuracy of OMR is a result of precise measurement of the darkness of a mark, and the sophisticated mark discrimination algorithms for determining whether what is detected is an erasure or a mark [7].
Figure 1.3 The College Board SAT uses OMR technology.
1.1.2.8 Optical Character Recognition
Optical character recognition is needed when the information should be readable both to humans and to a machine and alternative inputs cannot be predefined. In comparison with the other techniques for automatic identification, optical character recognition is unique in that it does not require control of the process that produces the information.
1.1.3 Speech Synthesis
Speech synthesis is the artificial production of human speech. Synthesizing is the very effective process of generating speech waveforms using machines based on the phonetical transcription of the message. Recent progress in speech synthesis has produced
4 synthesizers with very high intelligibility but the sound quality and naturalness still remains a major problem.
1.1.3.1 Phonetics and Theory of Speech Production
Speech processing and language technology contains lots of special concepts and terminology. To understand how different speech synthesis and analysis methods work one must have some knowledge of speech production, articulatory phonetics, and some other related terminology. The basic theories related to these topics are described below.
1.1.3.1.1 Representation and Analysis of Speech Signals
Continuous speech is a set of complicated audio signals which makes producing them artificially difficult. Speech signals are usually considered as voiced or unvoiced, but in some cases they are something between these two. Voiced sounds consist of fundamental frequency (F0) and its harmonic components produced by vocal cords (vocal folds). The vocal tract modifies this excitation signal causing formant (pole) and sometimes antiformant (zero) frequencies [8]. Each formant frequency has also amplitude and bandwidth and it may be sometimes difficult to define some of these parameters correctly. The fundamental frequency and formant frequencies are probably the most important concepts in speech synthesis and also in speech processing in general. With purely unvoiced sounds, there is no fundamental frequency in excitation signal and therefore no harmonic structure either and the excitation can be considered as white noise. The airflow is forced through a vocal tract constriction which can occur in several places between glottis and mouth. Some sounds are produced with complete stoppage of airflow followed by a sudden release, producing an impulsive turbulent excitation often followed by a more protracted turbulent excitation [9]. Unvoiced sounds are also usually more silent and less steady than voiced ones. Speech signals of the three vowels (/a/ /i/ /u/) are presented in time- and frequency domain in Figure: 1.4. The fundamental frequency is about 100 Hz in all cases and the formant frequencies F1, F2, and F3 with vowel /a/ are approximately 600 Hz, 1000 Hz, and 2500 Hz respectively. With vowel /i/ the first three formants are 200 Hz, 2300 Hz, and 3000 Hz, and with /u/ 300 Hz, 600 Hz, and 2300 Hz. The harmonic structure of the excitation is also easy to perceive from frequency domain presentation.
5 Figure 1.4 The time- and frequency-domain presentation of vowels /a/, /i/, and /u/.
For determining the fundamental frequency or pitch of speech, for example a method called cepstral analysis may be used [9]. Cepstrum is obtained by first windowing and making Discrete Fourier Transform (DFT) for the signal and then logaritmizing power spectrum and finally transforming it back to the time-domain by Inverse Discrete Fourier Transform (IDFT). The procedure is shown in Figure 1.5.
Speech Capstrum Window DFT Log IDFT
Figure 1.5 Cepstral analyses.
Fundamental frequency or intonation contour over the sentence is important for correct prosody and natural sounding speech. The different contours are usually analysed
6 from natural speech in specific situations and with specific speaker characteristics and then applied to rules to generate the synthetic speech. The fundamental frequency contour can be viewed as the composite set of hierarchical patterns shown in Figure 1.6. The overall contour is generated by the superposition of these patterns [10].
Figure 1.6 Hierarchical levels of fundamental frequency.
1.1.3.1.2 Speech Production
Human speech is produced by vocal organs presented in Figure 1.7. The main energy source is the lungs with the diaphragm. When speaking, the air flow is forced through the glottis between the vocal cords and the larynx to the three main cavities of the vocal tract, the pharynx and the oral and nasal cavities. From the oral and nasal cavities the air flow exits through the nose and mouth, respectively. The V-shaped opening between the vocal cords, called the glottis, is the most important sound source in the vocal system. The vocal cords may act in several different ways during speech. The most important function is to modulate the air flow by rapidly opening and closing, causing buzzing sound from which
7 vowels and voiced consonants are produced. The fundamental frequency of vibration depends on the mass and tension and is about 110 Hz, 200 Hz, and 300 Hz with men, women, and children, respectively. With stop consonants the vocal cords may act suddenly from a completely closed position, in which they cut the air flow completely, to totally open position producing a light cough or a glottal stop. On the other hand, with unvoiced consonants, such as /s/ or /f/, they may be completely open. An intermediate position may also occur with for example phonemes like /h/.
Figure 1.7 The human vocal organs. (1) Nasal cavity, (2) Hard palate, (3) Alveoral ridge, (4) Soft palate (Velum), (5) Tip of the tongue (Apex), (6) Dorsum, (7) Uvula, (8) Radix, (9) Pharynx, (10) Epiglottis, (11) False vocal cords, (12) Vocal cords, (13) Larynx, (14) Esophageus, and (15) Trachea.
The pharynx connects the larynx to the oral cavity. It has almost fixed dimensions, but its length may be changed slightly by raising or lowering the larynx at one end and the soft palate at the other end. The soft palate also isolates or connects the route from the nasal cavity to the pharynx. At the bottom of the pharynx are the epiglottis and false vocal cords to prevent food reaching the larynx and to isolate the esophagus acoustically from the vocal tract. The epiglottis, the false vocal cords and the vocal cords are closed during swallowing and open during normal breathing.
8 The oral cavity is one of the most important parts of the vocal tract. Its size, shape and acoustics can be varied by the movements of the palate, the tongue, the lips, the cheeks and the teeth. Especially the tongue is very flexible, the tip and the edges can be moved independently and the entire tongue can move forward, backward, up and down. The lips control the size and shape of the mouth opening through which speech sound is radiated. Unlike the oral cavity, the nasal cavity has fixed dimensions and shape. Its length is about 12 cm and volume 60 cm3. The air stream to the nasal cavity is controlled by the soft palate. From technical point of view, the vocal system may be considered as a single acoustic tube between the glottis and mouth. Glottal excited vocal tract may be then approximated as a straight pipe closed at the vocal cords where the acoustical impedance
Zg= ¥ and open at the mouth (Zm = 0) [11, 12]. In this case the volume-velocity transfer function of vocal tract is
( ) = = = (1.1) ( ) Where l is the length of the tube, w is radian frequency and c is sound velocity. The denominator is zero at frequencies, = 2 ( =1,2,3……) where