Identification of Marathi and Sanskrit Compound and Non-Compound Wordusing Genetic Algorithm
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Advances in Electronics and Computer Science, ISSN(p): 2394-2835 Volume-5, Issue-4, Apr.-2018 http://iraj.in IDENTIFICATION OF MARATHI AND SANSKRIT COMPOUND AND NON-COMPOUND WORDUSING GENETIC ALGORITHM 1SONAL P. PATIL, 2K. N. JARIWALA 1Ph.D. Research Scholar, 2Assistant Professor Computer Engineering Department, S.V.N.I.T, Surat, India E-mail: [email protected], [email protected] Abstract- Text based language recognition is the task of recognizing a language from a given text of document automatically. It is complicated to distinguish languages within language families than other families. In this paper, the performance of statistical measures has been investigated to determine the text-based language identification system with prominence on five languages used in India based on Devanagari script –Marathi, Hindi, Sanskrit, Bhojpuriand Nepali. n- grams is used as feature for classification in the proposed system. Language Identification is a main pre-processing step in several tasks of Natural Language Processing (NLP). There is wide scope in a multilingual society like India for automatic language identification since it would be a fundamental step in bridging the digital segregate between the Indian masses and the world. Index Terms- Devanagari Script, Multilingual Computing Wiener filter, Curvelet transform, Genetic algorithm I. INTRODUCTION based on consonants. In this system vowels are requisite [8]. The Unicode Standard describes three Text based language identification or recognition is blocks for Devanagari: Devanagari (U+0900– the chore ofautomatically recognizing a language U+097F)Devanagari Extended (U+1CD0– U+1CFF) from a specified text of document. It is not easy to and Vedic Extensions (U+A8E0–U+A8FF).Non- distinguish languages within language families. In assigned code points are indicate by grey areas. 0900 this paper, the performance of statistical measures has to 097F is the range. been investigated to determine the text-based language identification system with prominence on 1.2 Word Identification Architecture five languages used in India based on Devanagari Optical word identification involves many steps to script - Marathi, Hindi, Sanskrit, Bhojpuri andNepali. completely recognize and produce machine encoded The proposed system uses n-grams as feature for text. These phases are termed as: Pre-processing, classification. Language recognition is a significant Segmentation, Feature extraction, Classification. The pre-processing step in many tasks of Natural architecture of these phases is shown in figure1 and Language Processing (NLP). In a multilingual society these phases are listed below with brief description. like India there is wide scope for automatic language identification since it would be a vital step in bridging Pre-processing the digital divide between the Indian masses and the The pre-processing phase normally includes many world.OCR (Optical Character Recognition) is an techniques applied for binarization, noise removal, active field of research in Pattern Recognition. OCR skew detection, slant correction, normalization, methodologies can be classified based on two criteria; contour making and skeletonization like processes to data acquisition process which can be on-line or o- make character image easy to extract relevant line and type of the text which is printed text or hand- features and ecient recognition. written text[1].Devanagari is the most admired Indian script. But in case of Indian languages, the research work is very limited due to the complex structure of the language [2]. 1.1 Devanagari Character Set Devanagari is an Indian, syllabic alphabetic type of script that is used to write several languages like Sanskrit, Hindi, Marathi, Bhojpuri, Nepali, Konkani, Sindhi,Marwari, Pali, Maithli and many languages that are spoken in various parts of India. The word Figure 1: Steps to recognize a specificlanguage [3]. Devanagari is a combination of two words deva which means God and nagari.Most Indo-Aryan All possible n-grams (unigram, bigram and trigram) languages are written in Devanagari script. both character level and word level were extracted at Devanagari is the heart of the writing system. An the time of training stage.The main advantages of n- alpha syllabary is a writing system which is primarily gram models and algorithms are relative simplicity Identification of Marathi and Sanskrit Compound and Non-Compound Word using Genetic Algorithm 23 International Journal of Advances in Electronics and Computer Science, ISSN(p): 2394-2835 Volume-5, Issue-4, Apr.-2018 http://iraj.in and the capability to scale up by simply increasing n. the features from an image thinning process is applied The model is used to store additional contexts with a in pre processing technique Thinning is asignificant well understood space– time tradeoff, enabling tiny pre-processing step in OCR. The purpose of thinning experiments to scale up powerfully. The n-gram is to delete redundant information and at the same approximation for calculating the next word in the time retain the characteristic features of the image. sequence is given by: Freeman Chain code is one of the representation P(X1.....Xn) = P (X1) P (X2|X1)..... P techniques that is useful for image processing, shape n-1 (Xn|X1 ) analysis and pattern recognition fields is used with n k-1 = ∏ k=1 P (X1 ) heuristic approach for feature extraction. [1]. U. Pal, Wakabayashi and Kimura also presented Language Profile Generator and Classifier are the two comparative study of Devanagari handwritten main components of LID system. For language character recognition using dierent features and identification the former calculates the n-gram profile classifiers [4]. They used four sets of features based of a text to be identified and after that compares it to on curvature and gradient information obtained from language specific n-gram profiles. For every language binary as well as gray scale images and compared it generates all possible n-grams for the text and save results using 10 dierent classifiers as concluded the it into corresponding language files. In classification best results 74.74% and 75.17% for features extracted method a test sample is given then its possibility is from binary and gray image respectively obtained calculated for all the models and the language that with Mirror Image Learning (MIL) classifier. gives the best likelihood is selected. Sarbajit Pal et al.[5] have described projection based Feature extraction is used to extract relevant features statistical approach for handwritten character for recognition of characters based on these features. recognition. They proposed four sided projections of First features are computed and extracted and then characters and projections were smoothed by polygon most relevant features are selected to construct approximation. feature vector which is used eventually for Nikita Gaur and Dayashankar Singh et al.[6] have recognition. described gradient feature extraction approach for recognition of Sanskrit word they have used sobel Classification operator for edge detection The Sobel operator is Each pattern having feature vector is classified in used in image processing, particularly with edge predefined classes using classifiers. Classifiers are detection algorithms. first trained by a training set of pattern samples to Brijmohan Singh, Ankush Mittal, M.A. Ansari, prepare a model which is later used to recognize the Debashis Ghosh et al.[7] have described a holistic test samples. The training data should consist of wide system of oine handwritten Devanagari word varieties of samples to recognize all possible samples recognition. In this paper, they proposed a Curvelet during testing. feature extractor with SVM and k-NN classifiers based scheme for the recognition of handwritten II. LITERATURE SURVEY Devanagari words. Prachi patil,saniya ansari et al.[9]have proposed A number of works on LID in Indian languages are Online Handwritten Devnagari Word Recognition extraordinary and these works helps us to know using HMM based Technique. Feature extraction of challenges and methods of Indian language input image is done by android technology. Using identification.Language identification is formulated that features HMM recognizes the word. They tested by Kavi Narayana Murthy. It stated that machine proposed system on dierent word images and learning problem is a supervised classification task in obtained 95.70% recognition accuracy. which features extracted from a training corpus which are then used for classification. The paper in which n- M. N. Sandhya Arora, D. Bhattacharjee et. al.[13] gram and Word Network Features are used forNative proposed Recognition of non-compound handwritten Language Identification by Shibamouli Lahiri devnagari characters using a combination of MLP recognize writer’s native language from his/ her and minimum edit distance they used two well known writing in second language using n-gram feature and and established pattern recognition techniques: one WordNet[9]. Another method in LID of Indian using neural networks and the other one using languages proposed byPinky Roy’s as “Language minimum edit distance and characters are represented Identification using Gaussian Mixture Model using shadow feature and chain code histogram. The Tokenization”, aim at identifying the language of a method is carried out on a database of 7154 samples. spoken utterance. It uses Gaussian mixture model as The overall recognition is found to be 90.74%. basis phone tokenization and uses n-gram