International Journal of Advances in Electronics and Computer Science, ISSN(p): 2394-2835 Volume-5, Issue-4, Apr.-2018 http://iraj.in IDENTIFICATION OF MARATHI AND COMPOUND AND NON-COMPOUND WORDUSING GENETIC ALGORITHM

1SONAL P. PATIL, 2K. N. JARIWALA

1Ph.D. Research Scholar, 2Assistant Professor Computer Engineering Department, S.V.N.I.T, Surat, India E-mail: [email protected], [email protected]

Abstract- Text based language recognition is the task of recognizing a language from a given text of document automatically. It is complicated to distinguish languages within language families than other families. In this paper, the performance of statistical measures has been investigated to determine the text-based language identification system with prominence on five languages used in India based on script –Marathi, Hindi, Sanskrit, Bhojpuriand Nepali. n- grams is used as feature for classification in the proposed system. Language Identification is a main pre-processing step in several tasks of Natural Language Processing (NLP). There is wide scope in a multilingual society like India for automatic language identification since it would be a fundamental step in bridging the digital segregate between the Indian masses and the world.

Index Terms- Devanagari Script, Multilingual Computing Wiener filter, Curvelet transform, Genetic algorithm

I. INTRODUCTION based on consonants. In this system vowels are requisite [8]. The Standard describes three Text based language identification or recognition is blocks for Devanagari: Devanagari (U+0900– the chore ofautomatically recognizing a language U+097F)Devanagari Extended (U+1CD0– U+1CFF) from a specified text of document. It is not easy to and Vedic Extensions (U+A8E0–U+A8FF).Non- distinguish languages within language families. In assigned code points are indicate by grey areas. 0900 this paper, the performance of statistical measures has to 097F is the range. been investigated to determine the text-based language identification system with prominence on 1.2 Word Identification Architecture five languages used in India based on Devanagari Optical word identification involves many steps to script - Marathi, Hindi, Sanskrit, Bhojpuri andNepali. completely recognize and produce machine encoded The proposed system uses n-grams as feature for text. These phases are termed as: Pre-processing, classification. Language recognition is a significant Segmentation, Feature extraction, Classification. The pre-processing step in many tasks of Natural architecture of these phases is shown in figure1 and Language Processing (NLP). In a multilingual society these phases are listed below with brief description. like India there is wide scope for automatic language identification since it would be a vital step in bridging Pre-processing the digital divide between the Indian masses and the The pre-processing phase normally includes many world.OCR (Optical Character Recognition) is an techniques applied for binarization, noise removal, active field of research in Pattern Recognition. OCR skew detection, slant correction, normalization, methodologies can be classified based on two criteria; contour making and skeletonization like processes to data acquisition process which can be on-line or o- make character image easy to extract relevant line and type of the text which is printed text or hand- features and ecient recognition. written text[1].Devanagari is the most admired Indian script. But in case of Indian languages, the research work is very limited due to the complex structure of the language [2].

1.1 Devanagari Character Set Devanagari is an Indian, syllabic alphabetic type of script that is used to write several languages like Sanskrit, Hindi, Marathi, Bhojpuri, Nepali, Konkani, Sindhi,Marwari, Pali, Maithli and many languages that are spoken in various parts of India. The word Figure 1: Steps to recognize a specificlanguage [3]. Devanagari is a combination of two words deva which means God and nagari.Most Indo-Aryan All possible n-grams (unigram, bigram and trigram) languages are written in Devanagari script. both character level and word level were extracted at Devanagari is the heart of the writing system. An the time of training stage.The main advantages of n- alpha syllabary is a writing system which is primarily gram models and algorithms are relative simplicity

Identification of Marathi and Sanskrit Compound and Non-Compound Word using Genetic Algorithm

23 International Journal of Advances in Electronics and Computer Science, ISSN(p): 2394-2835 Volume-5, Issue-4, Apr.-2018 http://iraj.in and the capability to scale up by simply increasing n. the features from an image thinning process is applied The model is used to store additional contexts with a in pre processing technique Thinning is asignificant well understood space– time tradeoff, enabling tiny pre-processing step in OCR. The purpose of thinning experiments to scale up powerfully. The n-gram is to delete redundant information and at the same approximation for calculating the next word in the time retain the characteristic features of the image. sequence is given by: Freeman Chain code is one of the representation P(X1.....Xn) = P (X1) P (X2|X1)..... P techniques that is useful for image processing, shape n-1 (Xn|X1 ) analysis and pattern recognition fields is used with n k-1 = ∏ k=1 P (X1 ) heuristic approach for feature extraction. [1]. U. Pal, Wakabayashi and Kimura also presented Language Profile Generator and Classifier are the two comparative study of Devanagari handwritten main components of LID system. For language character recognition using dierent features and identification the former calculates the n-gram profile classifiers [4]. They used four sets of features based of a text to be identified and after that compares it to on curvature and gradient information obtained from language specific n-gram profiles. For every language binary as well as gray scale images and compared it generates all possible n-grams for the text and save results using 10 dierent classifiers as concluded the it into corresponding language files. In classification best results 74.74% and 75.17% for features extracted method a test sample is given then its possibility is from binary and gray image respectively obtained calculated for all the models and the language that with Mirror Image Learning (MIL) classifier. gives the best likelihood is selected. Sarbajit Pal et al.[5] have described projection based Feature extraction is used to extract relevant features statistical approach for handwritten character for recognition of characters based on these features. recognition. They proposed four sided projections of First features are computed and extracted and then characters and projections were smoothed by polygon most relevant features are selected to construct approximation. feature vector which is used eventually for Nikita Gaur and Dayashankar Singh et al.[6] have recognition. described gradient feature extraction approach for recognition of Sanskrit word they have used sobel Classification operator for edge detection The Sobel operator is Each pattern having feature vector is classified in used in image processing, particularly with edge predefined classes using classifiers. Classifiers are detection algorithms. first trained by a training set of pattern samples to Brijmohan Singh, Ankush Mittal, M.A. Ansari, prepare a model which is later used to recognize the Debashis Ghosh et al.[7] have described a holistic test samples. The training data should consist of wide system of oine handwritten Devanagari word varieties of samples to recognize all possible samples recognition. In this paper, they proposed a Curvelet during testing. feature extractor with SVM and k-NN classifiers based scheme for the recognition of handwritten II. LITERATURE SURVEY Devanagari words. Prachi patil,saniya ansari et al.[9]have proposed A number of works on LID in Indian languages are Online Handwritten Devnagari Word Recognition extraordinary and these works helps us to know using HMM based Technique. Feature extraction of challenges and methods of Indian language input image is done by android technology. Using identification.Language identification is formulated that features HMM recognizes the word. They tested by Kavi Narayana Murthy. It stated that machine proposed system on dierent word images and learning problem is a supervised classification task in obtained 95.70% recognition accuracy. which features extracted from a training corpus which are then used for classification. The paper in which n- M. N. Sandhya Arora, D. Bhattacharjee et. al.[13] gram and Word Network Features are used forNative proposed Recognition of non-compound handwritten Language Identification by Shibamouli Lahiri devnagari characters using a combination of MLP recognize writer’s native language from his/ her and minimum edit distance they used two well known writing in second language using n-gram feature and and established pattern recognition techniques: one WordNet[9]. Another method in LID of Indian using neural networks and the other one using languages proposed byPinky Roy’s as “Language minimum edit distance and characters are represented Identification using Gaussian Mixture Model using shadow feature and chain code histogram. The Tokenization”, aim at identifying the language of a method is carried out on a database of 7154 samples. spoken utterance. It uses Gaussian mixture model as The overall recognition is found to be 90.74%. basis phone tokenization and uses n-gram for identification [10]. III. STEPS FOR IDENTIFICATION OF MARATHI ANDSANSKRIT WORD Namita Dwivedi have described recognition of Current research on language identification has been Sanskrit word using Prewitt’s operator for extracting restricted entirely to machine learning approaches. A

Identification of Marathi and Sanskrit Compound and Non-Compound Word using Genetic Algorithm

24 International Journal of Advances in Electronics and Computer Science, ISSN(p): 2394-2835 Volume-5, Issue-4, Apr.-2018 http://iraj.in set of training data is given in machine learning to additive noise. The approach of reducing approaches and the machine learns a general rule or degradation at a time allows us to develop a builds a model for performing the intended task. A restoration algorithm for each type of degradation and machine learning system is expected to be general simply combine them. and it is understood that training is based only on the inherent properties of data which is expressed 3.3 Feature Extraction Technique through a set of features. Any given image can be decomposed into several features. Feature extraction technique is accurately The basic steps for recognition of handwritten retrieve features of characters. The extracted features devnagari Marathi and Sanskrit word recognition are organized in a database, which is the input for the system are preprocessing, feature extraction and recognition phase of the classifier. Feature extraction recognition. is a very important in recognition system because it is used by the classifier to classify the data. 3.1 Training Set 3.3.1 Curvelet Transform Once the training set iscreated then the proposed A feature extraction scheme based on digital Curvelet system is used on random data for classification and transform has been used in[14]. In this work, the identification of unidentified content in the digital words from the sample images are extracted using online text. This test data was taken arbitrarily from conventional methods. A usual feature of handwritten Internet with sentences in any of these five languages text is the orientation of text written by the writer. and the results were noted. Then the test data was Each sample is cropped to edges and resized to a filtered and sent as constraint to the testing profile standard width and height suitable for digital Curvelet generator code to generate character based as well as transform.The features extracted from input image by word based unigram, bigram and trigram training set using curvlet transform as contrast,correlation, text data. Then the resemblance measures of homogeneity and energy. languages are calculated. 1. Contrast:-Contrast is the dierence in luminance or color that makes an object (or its representation in 3.2 Image pre-processing: an image or display) distinguishable.2. Correlation:- The raw data is subjected to a number of preliminary the process of establishing a relationship or processing steps to make it usable in the descriptive connection between two or more things. 3. stages of character analysis. Pre-processing aims to Homogeneity:- the quality or state of being produce data that are easy for the character homogeneous.4. Energy:-It is a function that would recognition systems to operate accurately. capture the solution we desire and perform 3.2.1 Smoothing gradientdescent to compute its lowest value, resulting Smoothing operations are used to blur the image and in a solution for the image segmentation. reduce the noise. Blurring is used in pre-processing 3.3.2 Shape Moment Feature steps such as removal of small details from an In fact, the main problem in character recognition images. The main objectives of the filters are to system is the large variation in shapes within a class improve the quality of image by enhancing is to of character. This variation depends on font styles, improve interoperability of the information present in document noise, photometric eect, document skew the images for human visual. In this sytem wiener and poor image quality. The large variation in shapes filter is used for smoothing. makes it dicult to determine the number of features that are convenient prior to model building. Various Gaussian filter: shape based and boundary based features are taken Noise in digital images can be described as random from individual character. The various Moment based fluctuations in brightness and color. It will corrupt features like total Kurtosis, Skewness, moment, information in an image such that the intensity at percentile and quartile are calculated from the each pixel is a combination of the true signal and the character images. noise.Gaussian noise, or amplifier noise will impair the image with a linear addition of white noise, 3.4 Optimization meaning that it is independent of the image itself and Genetic algorithms are a very good means of usually evenly distributed over the frequency domain optimizations in such problems. They optimize the .As the name suggests, the intensity of the noise at desired property by generating hybrid solutions from each pixel follows a Gaussian normal distribution. the presently existing solutions. All this is done by the genetic operators, which are defined and applied Wiener Filter: over the problem[15]. The inverse filtering is a restoration technique for deconvolution, i.e., when the image is blurred by a Fitness Function :- In Genetic Algorithms, the known low pass filter, it is possible to recover the fitness function is used to test the goodness of the image by inverse filtering or generalized inverse solution. This function, when applied on any of the filtering. However, inverse filtering is very sensitive

Identification of Marathi and Sanskrit Compound and Non-Compound Word using Genetic Algorithm

25 International Journal of Advances in Electronics and Computer Science, ISSN(p): 2394-2835 Volume-5, Issue-4, Apr.-2018 http://iraj.in solution from the solution pool, tells the level of The table 1shown above demonstrates the percentage goodness. accuracy for sample five word images from Marathi 3.5 Classification word dataset with ngram classifier and figure 2 shows The decision making part of a recognition system is the percentage accuracy for sample five word image the classification stage and it uses the features from Sanskrit word dataset. The accuracy for both extracted from the previous stage. There are various Marathi and Sanskrit words are dierent each time methods for classification. K Nearest Neighbor when we train the neural network. (KNN), Artificial Neural Network(ANN) and Support Vector Machine (SVM). The characteristics of the some classification methods that have been successfully applied to Oine Devnagari Marathi and Sanskrit word recognition and results of Neural Network classification is better than other classification methods, applied on Handwritten Devnagari word. Artificial Neural Network Artificial neural networks are one of the popular techniques used for classification due to their learning and generalization abilities. A multilayer perceptron (MLP) is a feedforward artificial neural network Figure 2: Accuracy with ngram classifier model that maps sets of input data onto a set of appropriate outputs. A MLP consists of multiple CONCLUSION layers of nodes in a directed graph, with each layer fully connected to the next one. The optimisation Althoughevery language has similar script the technique used for training this architecture was the meaning of words formed from aksharas changes. Scaled Conjugate Gradient (SCG) method. SCG The position of characters (aksharas) or words method was used because it gave better results and modify from one language to another. The end case has been found to solve the optimization problems marker (words or characters) changes from one encountered when training an MLP network more language to another. Bigrams are common in one eciently than the gradient descent and conjugate language which may not be similar as another gradient methods [17]. language. Similarly the occurrence of trigram in one language may not be same as another language. As IV. RESULT the data set for testing increases then the accurateness of result also increases. Language identification task The experimental result and accuracy of the test has been hindered by lexical similarity between results are discussed below. languages. If there is a greater lexical similarity between languagesthen accuracy of the LID will be 4.1 Experimental Results less. The problems faced in LID of Indian languages In-order to train the classifiers a set of training are:- Marathi and Sanskrit words are required and their 1. Several words are derived from single root respective features which are extracted through word in Indian languages. These words have feature extraction step. There are two basic phases of the root form common in all languages. pattern classification. They are training and testing 2. It is not possible to collect every word in any phases. In the training phase, data is repeatedly Indian language to form a dictionary as well as presented to the classifier, while weights are updated to search in the dictionary. to obtain a desired response. In testing phase, the 3. Effectiveness of Language identifier depends trained system is applied to data that it has never seen on size of the text to get enhanced result input to check the performance of the classification. which should be greater than 5 words of length. 4. Clear representation of long range dependency is less.

REFERENCES

[1] AdityaBhargava and GrzegorzKondrak, “Language identification of names with SVMs”.IEEE ICECCN, 2013, pages 693– 696, 2010 [2] Combrinck, H., & Botha, “A Short ReviewAutomatic language identification: Resisting Complexity,” International Journal of Computer Applications, vol. 4, Table 1 : Accuracy Result Table for Sanskrit Word Sample December 2010.

Identification of Marathi and Sanskrit Compound and Non-Compound Word using Genetic Algorithm

26 International Journal of Advances in Electronics and Computer Science, ISSN(p): 2394-2835 Volume-5, Issue-4, Apr.-2018 http://iraj.in [3] S. Kubatur and M. Sid-Ahmed, “A Neural Network [11] S. Shelke and S. Apte, “A Multistage Handwritten Marathi Approach to Online Devanagari Handwritten Character Compound Character Recognition Scheme using Neural Recognition,” IEEE, pp. 209–214, December 2012. Networks and Wavelet Features,” International Journal of [4] U. Pal and F. Kimura, “Comparative Study of Devanagari Signal Processing, Image Processing and Pattern Handwritten Character Recognition using Dierent Recognition, vol. 1, March 2011. Feature and Classifiers,” 10th International Conference on [12] S. Arora and D. Bhattacharjee, “Performance Comparison Document Analysis and Recognition, 2009. of SVM and ANN for Handwritten Devnagari Character [5] S. Pal and J. Mitra, “A Projection Based Statistical Recognition,” IJCSI International Journal of Computer Approach for Handwritten Character Recognition,” Science, vol. 7, May 2010. Proceedings of International Conference on Computational [13] S. Arora and D. Bhattacharjee, “Recognition of Non- Intelligence and Multimedia Applications, vol. 2, pp. 404– compound Handwritten Devnagari Characters using A 408, 2007. Combination of MLP and Minimum Edit Distance,” [6] N. Gaur and D. Singh, “Sanskrit Word Recognition using IJCSS. Gradient Feature Extraction,” VSRD-IJCSIT, vol. 2, pp. [14] S.Pannirselvam and S.Ponmani, “Preprocessing of 167–174, 2012. Handwritten Documents using Various Filters – A [7] B. Singh and A. Mittal, “Handwritten Devanagari Word Survey,” International Journal of Advanced Research in Recognition: A Curvelet Transform Based Approach,” Computer Science and Software Engineering, vol. 3, July International Journal on Computer Science and 2013. Engineering (IJCSE), vol. 3, April 2011. [15] R. Kala and H. Vazirani, “Oine Handwriting [8] V. Saraf and D. Rao, “Handwritten Devanagari Character Recognition using Genetic Algorithm,” IJCSI International Recognition using Gradient Features,” International Journal of Computer Science Issues, vol. 1, March 2010. Journal of Soft Computing and Engineering (IJSCE), vol. [16] R. Dineshkumar and J. Suganthi, “Sanskrit Character 2, April 2013. Recognition System using Neural Network,” Indian [9] P. Patil and S. Ansari, “Online Handwritten Devnagari Journal of Science and Technology, vol. 8, pp. 65–69, Word Recognition using Hmm Based Technique,” January 2015. International Journal of Computer Applications, vol. 17, [17] M. Abdella and T. Marwala, “The Use of Genetic June 2014. Algorithms and Neural Networks to Approximate Missing [10] G. kumar and P. kumar Bhatia, “Neural Network Based Data in Database,” Computing and Informatics, vol. 24, Approach for Recognition of Text Images,” International pp. 577– 589,2005. Journal of Computer Applications, vol. 14, January 2013.



Identification of Marathi and Sanskrit Compound and Non-Compound Word using Genetic Algorithm

27