Phonetic Dictionary for Natural Language Processing: Kannada

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by ePrints@Bangalore University Mallamma V. Reddy et al Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 7( Version 3), July 2014, pp.01-04 RESEARCH ARTICLE OPEN ACCESS Phonetic Dictionary for Natural Language Processing: Kannada Mallamma V. Reddy*, Hanumanthappa M. **, Jyothi N.M***, Rashmi S. *(Department of Computer Science, Rani Channamma University, Vidyasangam, Belgaum-591156, India) ** (Department of Computer Science, Bangalore University, Jnanabharathi Campus, Bangalore-561156, India) ***(Department of Master of Computer Applications, Bapuji Institute of Engineering and Technology, Davangere-577004, India) ABSTRACT India has 22 officially recognized languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Tamil, Telugu, and Urdu. Clearly, India owns the language diversity problem. In the age of Internet, the multiplicity of languages makes it even more necessary to have sophisticated Systems for Natural Language Process. In this paper we are developing the phonetic dictionary for natural language processing particularly for Kannada. Phonetics is the scientific study of speech sounds. Acoustic phonetics studies the physical properties of sounds and provides a language to distinguish one sound from another in quality and quantity. Kannada language is one of the major Dravidian languages of India. The language uses forty nine phonemic letters, divided into three groups: Swaragalu (thirteen letters); Yogavaahakagalu (two letters); and Vyanjanagalu (thirty-four letters), similar to the vowels and consonants of English, respectively. Keywords - Information Retrieval, Natural Language processing (NLP), Phonetics I. INTRODUCTION Direction of writing: left to right in horizontal Kannada ( ) or Canarese, the official lines ಕꃍನಡ The phonetic can be defined as: language of the southern Indian state of Karnataka. pho·net·i·cal. of or pertaining to speech Kannada is a Dravidian language [1] spoken by about sounds, their production, or their 44 million people in the Indian states of Karnataka, transcription in written symbols. Andhra Pradesh, Tamil Nadu and Maharashtra. The Corresponding to pronunciation: phonetic Kannada alphabet (ಕꃍನಡ 좿ꢿ) developed from the transcription. Kadamba and Cālukya scripts, descendents of Agreeing with pronunciation: phonetic Brahmi [2] which were used between the 5th and 7th spelling. centuries AD. These scripts developed into the Old Concerning or involving the discrimination Kannada script, which by about 1500 had morphed of nondistinctive elements of a language. In into the Kannada and Telugu scripts. Under the English, certain phonological features, as influence of Christian missionary organizations, length and aspiration, are phonetic but not Kannada and Telugu scripts were standardized at the phonemic. beginning of the 19th century. Kannada has 44 speech sounds. Among them 35 are consonants and 9 A phonetic dictionary is a dictionary that allows you are vowels. The vowels are further classified into to locate words by the "way they sound", i.e. a short vowels, long vowels and diphthongs [3]. dictionary that matches common or phonetic Notable features of Kannada language are: misspellings with the correct spelling of a word. Such Type of writing system: alphasyllabary in which a dictionary uses pronunciation respelling to aid the all consonants have an inherent vowel. Other search for or recognition of a word. vowels are indicated with diacritics, which can appear above, below, before or after the II. HEADINGS consonants. The Phonetics was studied as early as 500 BC in When they appear the beginning of a syllable, the Indian subcontinent, with Pāṇ ini's account of the vowels are written as independent letters. place and manner of articulation of consonants in his When consonants appear together without 5th century BC treatise on Sanskrit. The major Indic intervening vowels, the second consonant is alphabets today order their consonants according to written as a special conjunct symbol, usually Pāṇ ini's classification. The Phoenicians are credited below the first. as the first to create a phonetic writing system, from which all major modern phonetic alphabets are now derived. Modern phonetics [4] begins with attempts www.ijera.com 1 | P a g e Mallamma V. Reddy et al Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 7( Version 3), July 2014, pp.01-04 to introduce systems of precise notation for speech done detailed evaluation and comparison test on sounds. TIMIT (acoustic-phonetic continuous speech corpus) Stephen Hawking‘s Speech synthesiser: Stephen which aids the other researchers to accomplish the Hawking, an English theoretical physicist, collation task. cosmologist, author and Director of Research at the English language is homophonic in nature. The Centre for Theoretical Cosmology within the same words have multiple spelling orders. For University of Cambridge uses speech synthesiser example, ‗Soumya‘ and ‗Sowmya‘. Hence the author, when he communicates. Using this he is able to proposed a algorithm [8] based on English phonetic convert the text into speech. A program called EZ spelling correction, has shown the mechanisms to keys has been written by world plus Inc. whatever he solve common spelling errors such as missing letters, types on the keyboard the cursor will automatically extra letters, disordered letters, as well as phonetic scan each word row and column wise. Then the spelling errors in the perspective of from the same system produces the respective sounds. There is also pronunciation, similar pronunciation. Algorithms word prediction option. such as phonetic spelling correction, phonetic NSA [5] has proposed a method of using the spelling regulations, edit-distance and habit-distance Soundex and Asoundex approach. Here the author is put forward carefully by the authors, thereby the has carefully addressed the issues of homophones. overall exposition about spelling correction system is ‗New‘ and ‗Knew‘; ‗no‘ and ‗know‘; to, two, too; vigilantly proposed. meat, meet: are some of the examples of homophones. The work is mainly concentrated on III. METHODOLOGY generating the ―name code‖. Code generated by the Kannada script has forty-nine characters in its program is then compared with the names in the test alphasyllabary and is phonemic. The Kannada data. It has been shown for ―Malay‖ language character set is almost identical to that of other Indian (Language spoken by Malaysian people) languages. The number of written symbols, however, The phonetic segmentation [6] and considerably is far more than the 49 characters in the deals with articulation phonetics. The work is alphasyllabary, because different characters can be concentrated on cross language phonetic combined to form compound characters segmentation using Hidden Markov Models (ottaksharas). Each written symbol in the Kannada (HMMs). It also provides extensive models that are script corresponds with one syllable, as opposed to applicable across languages. The efficiency was one phoneme in languages like English. The Kannada tested on Appen Spanish speech corpus and the writing system is an abugida, with consonants efficiency of about 61.5% was achieved. appearing with an inherent vowel. The speech recognition system uses the The characters are classified into three categories knowledge obtained from phonotactics, phonology, [3] swaras (vowels), vyanjanas (consonants) and and acoustic phonetics to apprehend admissible Yogavaahakas (part vowel, part consonants– two phonetic information. Hence the system can make letters: anusvara and visarga). broad classifications and also detailed phonetic The name given for a pure, true letter is akshara, distinctions. Espy-Wilson et al has proposed a paper akkara or varna. Each letter has its own form (ākāra) [7] which discusses a framework for developing a and sound (shabda); providing the visible and audible phonetically based recognition system. The representations, respectively. Kannada is written recognition task is the class of sounds known as the from left to right.Kannada alphabet (aksharamale or semivowels. Though the results obtained from the varnamale) now consists of 49 letters. study were incomplete, it stimulated the bedrock for Each sound has its own distinct letter, and further studies. therefore every word is pronounced exactly as it is A phonetic segmentation method based on spelt; so the ear is a sufficient guide. After the exact speech analysis under Microcanonical Multiscale sounds of the letters have been once gained, every Formalism (MMF). The MMF [6] depends on word can be pronounced with perfect accuracy. The computation of local geometrical parameters, accent falls on the first syllable. Each consonant singularity exponents (SE). The SE conveys the sound has two distinct pronunciations. The sound phoneme boundaries. The author has proposed the 2- with normal pronunciation (deergha) is generally steps technique, which exploits the SE to upgrade the used in the varnamala or aksharamala: segmentation accuracy. The first step concentrates on 1. short/brief one ( , also known as hrasva, ), detecting the boundaries of original signals and a 响 ಹ್ಯಸ್ವ low-pass filtered version and the union of all the without any vowel. detected boundaries are considered as the candidates. 2. long/normal

Phonetic Dictionary for Natural Language Processing: Kannada

75 Characters Maximum

OSU WPL # 27 (1983) 140- 164. the Elimination of Ergative Patterns Of

Recognition of Online Handwritten Gurmukhi Strokes Using Support Vector Machine a Thesis

Sanskrit Alphabet

Tugboat, Volume 15 (1994), No. 4 447 Indica, an Indic Preprocessor

5892 Cisco Category: Standards Track August 2010 ISSN: 2070-1721

Proposal for a Kannada Script Root Zone Label Generation Ruleset (LGR)

A Script Independent Approach for Handwritten Bilingual Kannada and Telugu Digits Recognition

The Taittirtyaprtiakhya As on Antjsvara

15178-Devanagari-Spacing-Anusvara

Inflectional Morphology Analyzer for Sanskrit

Proposal for a Gujarati Script Root Zone Label Generation Ruleset (LGR)