Structural Analysis of Hindi Phonetics and a Method for Extraction of Phonetically Rich Sentences from a Very Large Hindi Text Corpus
Total Page:16
File Type:pdf, Size:1020Kb
2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Technique (O-COCOSDA) 26-28 October 2016, Bali, Indonesia Structural Analysis of Hindi Phonetics and A Method for Extraction of Phonetically Rich Sentences from a Very Large Hindi Text Corpus Shrikant Malviya∗, Rohit Mishray and Uma Shanker Tiwaryz Department of Information Technology Indian Institute of Information Technology, Allahabad, India 211012 ∗Email: [email protected] yEmail: [email protected] zEmail: [email protected] Abstract—Automatic speech recognition (ASR) and Text to which contains all the phonemes in most of the possible speech (TTS) are two prominent area of research in human contexts, when recorded, would produce a phonetically rich computer interaction nowadays. A set of phonetically rich and balanced speech corpus [3], [4]. sentences is in a matter of importance in order to develop these two interactive modules of HCI. Essentially, the set of Compare to some studies which are based on employing phonetically rich sentences has to cover all possible phone units words, syllables and monophones, most of the current research distributed uniformly. Selecting such a set from a big corpus in the development of Automatic Speech Recognition (ASR) with maintaining phonetic characteristic based similarity is still and Text to Speech (TTS) systems widely focused on how the a challenging problem. The major objective of this paper is to devise a criteria in order to select a set of sentences encompassing contextual phone units e.g. triphones and diphones could be all phonetic aspects of a corpus with size as minimum as possible. used for improving the robustness of the systems [5], [6]. First, this paper presents a statistical analysis of Hindi phonetics The field of constructing a phonetically-rich sentence cor- by observing the structural characteristics. Further a two stage algorithm is proposed to extract phonetically rich sentences with pus is relevant to various applications i.e. ASR and speech a high variety of triphones from the EMILLE Hindi corpus. The synthesis and many more. For instance, to estimate a robust algorithm consists of a distance measuring criteria to select a accoustic model, a phonetically rich speech dataset is a basic sentence in order to improve the triphone distribution. Moreover, requirement [7]. Phonologists have found useful to have such a a special preprocessing method is proposed to score each triphone specific corpora in order to develop a sytem to analyze speech in terms of inverse probability in order to fasten the algorithm. The results show that the approach efficiently build uniformly production and variability [8]. In speech therapy, to find com- distributed phonetically-rich corpus with optimum number of municative disorders of patients, phonetically-rich sentences sentences. are often utilized in assessment of patient’s speech production Keywords—Phonetically-Rich Sentences; Statistical Analysis; under surveillance in various phonetic/phonological contexts Phonemes; Triphone; Hindi Speech Recognition; Grapheme- [9]. Phoneme; Hindi Phonology; Phone Like Units(PLU); A novel approach based on the triphone distribution has I. I been formulated and evaluated in this paper. The problem When a set of sentences would be called as phonetically- could be elaborated formally as: suppose a corpus C is given rich set? The answer to the question depends on two statistical which consists of s sentences, find a subset K such that properties of phonetic distribution. First one is to know about subset K containing sk sentences having uniformly distributed the characteristic distribution of phonemes which decides the triphones. In first appearance the problem looks simple, but phonetic richnes of a sentence. On the other hand second prop- due to inherent high computational time complexity, the task erty talks about the phonetic resemblance between extracted is considered to be approached diligently. The problem is a sentences and language in study. Evidently, this field of study non-polynomial type as it generates a set of instances based is significantly related to automatic speech recognition (ASR) on the combinations of triphones and should be considered as and speech synthesis(TTS) [1]. an intractable problem [10]. The task of extracting phonetically rich and balanced sen- A two phase algorithm has been proposed to extract set of tences involves, analyzing a large corpus and performing the phonetically-rich triphone sentences from EMILLE corpora in procedure to extract sentences using sentence extraction crite- order to build a phonetically-rich corpus for Hindi. Hindi cor- ria, based on various stochastic methodologies [2]. Addition- pus from the EMILLE has been selected for the experiments ally, a corpus has to be phonetically balanced through made up and evaluation of the proposed method [11]. Furthermore, a of sentences having phonetic units as per its distribution in the greedy distance based criteria is employed in the algorithm for natural spoken language. Now based on these set of sentences the selection of sentences through a finite number of iterations. 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Technique (O-COCOSDA) 26-28 October 2016, Bali, Indonesia II. C S H L On the basis of place of origin, consonants are of catego- rized in the following four categories : A. Phonetic Structure 1) Regular Consonants (25) After Mandarin, Spanish and English, Hindi is the most 2) Semivowels Consonants (4) natively spoken language in the world, almost spoken by 3) Sibilant Consonants (3) 260 million people according to Ethnologue, 2014 [12]. It 4) Fricative Consonants (1) is recognized as an official language of India in Devanagari script. Hindi is a direct descendant of Sanskrit through Pakrit Consonants are being articulated and amalgamated with (One of the Ancient Indo-Aryan language) and Apabhransha vowels by stopping the air moving out of the mouth. Based (Corruption or corrupted speech). Additionally, it is consid- on this fact of obstruction, the consonants are divided into ered to be influenced by Dravidian language (Telugu, Tamil, five groups (वग [vargas]). These vargas (groups) are ordered Kannada and Malayalam spoken mostly in southern India), according to where the tongue is located in the mouth. As Turkish, Persian, Arabic, Portuguese and English. we move forward in the list of consonants groups, the tongue In Hindi, there are total 33 consonants and 13 vowels used position is also move forward inside the mouth during the for speaking as well as writing. All the vowels have two articulation. Table II clearly distinguishes this categorization forms of writing. First one is standalone form, being used of regular consonants. Each group (vargas) has five consonants as it is shown in the table I. For example in the Hindi word ordered in a specific way according to how they are realized during the articulation. The members of each group are ordered आम (‘mango’) contains vowel आ ‘/a:/’ is being used in the same way as defined. The other one is mātrā form, which in the sequence of voiceless unaspirated, voiceless aspirated, worked as a consonants modifier. For example in the word voiced unaspirated, voiced aspirated and nasal. Thus a sort of 5x5 matrix has been formed here to represent all the regular दान (‘donation’) [/d a: n/], vowel आ ‘/a:/’ is attached with consonants table II. the consonant द /‘d’/. Table I and III show a complete list of Hindi vowels and consonants with their phonetic details. For the convenience of representation PLU, phone like unit has Table III A G O H C W E P been defined for each letter. P [13] Hindi IPA PLU Table I Example A G O H V W E P P Orthography Symbol Symbol [13] क /k/ K कर ‘Tax’ h ख /k / KH खेल ‘Sport’ Hindi IPA PLU ग /g/ G गाय ‘Cow’ Example ɦ Orthography Symbol Symbol घ /g / GH घर ‘House’ अ /ə/ A अयापक ‘Teacher’ ङ /ŋ/ WN पक ‘Mud’ आ /a:/ AA आम ‘Mango’ च /tʃ/ CH चमकार ‘Miracle’ h इ /i/ I इ ‘Perfume’ छ /tʃ / CHH छाया ‘Shadow’ ई /i:/ II ईवर ‘God’ ज /dʒ/ J जल ‘Water’ ɦ उ /u/ U उर ‘Answer’ झ /dʒ / JH झरना ‘Waterfall’ ऊ /u:/ UU ऊमा ‘Heat’ ञ /ɲ/ ZN कचे ‘Marbles’ ए /e:/ E एक ‘One’ ट /ʈ/ TT टहलना ‘Amble’ h ऐ /ᴂ:/ EE ऐनक ‘Specs’ ठ /ʈ / TTH ठक ‘Fine’ ओ /o:/ O ओस ‘Dew’ ड /ɖ/ DD डाक ‘Post’ ɦ औ /ᴔ:/ OO औषध ‘Medicine’ ढ /ɖ / DDH ढोल ‘Drum’ अं /aŋ/ MN अंगूर ‘Grape’ ण /ɳ/ AN गणना ‘Calculation’ अः /əh/ AH अतः ‘So’ त /t/ T तरबजू ‘Watermelon’ h ऋ /r̥ / RI ऋष ‘Sage’ थ /t / TH थाना ‘Station’ द /d/ D दपण ‘Mirror’ ɦ ध /d / DH धातु ‘Metal’ न /n/ N नमक ‘Salt’ Table II प /p/ P परोपकार ‘Charity’ h R C C A T P फ /p / PH फल ‘Fruit’ A ब /b/ B बरसात ‘Rain’ ɦ भ /b / BH भित ‘Devotion’ Group Tongue Group Members म /m/ M मदद ‘Help’ Name Position VL* VLA# V¶ VA§ N† य /j/ Y यं ‘Instrument’ क वग Velar क ख ग घ ङ र /ɾ/ R रा ‘Defence’ च वग Palatal च छ ज झ ञ ल /l/ L लगन ‘Diligence’ ट वग Retroflex ट ठ ड ढ ण व /ʋ/ V वातु ‘Architectural’ त वग Dental त थ द ध न स /s/ S समुदाय ‘Community’ प वग Labial प फ ब भ म ष /ʂ/ SHH धनुष ‘Bow’ *VL = VOICE LESS श /ʃ/ SH शित ‘Power’ #VLA = VOICE LESS ASPIRATED ह /ɦ/ H हल ‘solution’ h ¶V = VOICED VOICELESS ASPIRATED ɦ §VA = VOICED ASPIRATED VOICED ASPIRATED †N = NASAL 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Technique (O-COCOSDA) 26-28 October 2016, Bali, Indonesia Figure 1.