<<

International Journal of Scientific & Engineering Research Volume 11, Issue 6, June-2020 1783 ISSN 2229-5518

Grammar Checker for Hindi and Other Indian Languages

Anjani Kumar Ray, Vijay Kumar Kaul Center for Information and Language Engineering Mahatma Gandhi Antarrashtriya Hindi Vishwavidyalaya, Wardha (India)

Abstract: Grammar checking is one of the sentence is grammatically well-formed. In most widely used techniques within absence of the potential syntactic natural language processing (NLP) approach, incorrect or not-so-well applications. Grammar checkers check the grammatically formed sentences are grammatical structure of sentences based analyzed or produced. The preset paper is on morphological and syntactic an exploratory attempt to devise the hybrid processing. These two steps are important models to identify the grammatical parts of any natural language processing relations and connections of the to systems. Morphological processing is the phrases to sentences to the extent of step where both lexical words (parts-of- discourse. Language Industry demands speech) and non- tokens (punctuation such economic programmes doing justice marks, made-up words, acronyms, etc.) are and meeting the expectations of language analyzed into IJSER their components. In engineering. syntactic processing, linear sequences of words are transformed into structures that Keywords: Grammar Checking, Language show grammatical relationships among the Engineering, Syntax Processing, POS words in the sentence (Rich and Knight Tagging, Chunking, morphological 1991) and between two or more sentences Analysis joined together to make a or complex sentence. There are three main Introduction: Grammar Checker is an approaches/models which are widely used NLP application that helps the user to for grammar checking in a language; write correct sentence in the concerned language. It compiles the give text in such probability-based models, rule-based a manner that reports the error found after models and hybrid models. In rule-based analyzing the text at the grammatical grammar checking, each sentence is parameters of the language. Grammar checker enables you to correct the most completely parsed to check whether the unnoticeable mistakes/errors. Grammar

IJSER © 2020 http://www.ijser.org International Journal of Scientific & Engineering Research Volume 11, Issue 6, June-2020 1784 ISSN 2229-5518

Checker helps the user to write better – Probability Based Models, Hindi with efficiently equipped features that corrects the texts. Grammar Checker – Rule Based Models corrects grammatical mistake based on the – Hybrid Models. context of complete sentences with unmatched accuracy. For the utilization of Methodology: The methodology is used in Grammar checking software the user has designing the grammar checker is to enter the text that is intended to be deductive. System can extract the corrected for grammar check and information related to grammar rules: punctuation mistakes the user has to just morphological, syntactic and semantic click to check error related to Grammar evaluation of each word in a given and punctuation and it will report the error sentence. The information cannot get in found in the word or phrase with a wavy one go but the information is extracted in underline. To get the possible solution to each phase of the system and the error then you have to click right hand on information is accumulated and stored it will show the possible solution for with each individual word. The grammar particular error. Most of mistake which are checker has been built on the basis of committed by the user be it related to syntactic rules the Hindi language Gender, number and/or related to observes. System converts the document agreement between phrases of sentence. into token of paragraph. The paragraph is Most of Grammar checkers have the spell further divided into sentences and each checker embedded in it. sentence is divided into token of word and non-word sequences. Using these State of Art: Grammar checking is one of techniques we are in a position to know the most widely used techniques within about each token which has its original natural languageIJSER processing (NLP) belonging to any given paragraph and applications. Grammar checker checkes sentence. the grammatical structure of sentence based on morphological and syntactic Tokenization Phase: In the process of processing. These two steps are important tokenization the text input divided in long parts of any natural language processing string of characters, into subunits, called systems. Morphological processing is the tokens[18] . Here the token means an step where both lexical words (parts-of- individual appearance of a word in certain speech) and non-word tokens (punctuation position in a given text. For instance one marks, made-up words, acronyms, etc.) are can consider (लड़कⴂ ləɽəkoň ladakon) as an analyzed into their components. In instance of word (लड़का ləɽəka ladaka). syntactic processing, linear sequences of words are transformed into structures that In a running text the token are generally show grammatical relationships among the separated by white space but token words words of a given sentence. may contain following characters as punctuation markers and other as part of Approaches/models and other frameworks, the word for e.g. widely used for grammar checking in a language are:

IJSER © 2020 http://www.ijser.org International Journal of Scientific & Engineering Research Volume 11, Issue 6, June-2020 1785 ISSN 2229-5518

well. This type of ambiguity cannot be resolved by one sentence.

लड़का लड़का’ लड़का,’ लड़का” लड़का. लड़का). Name Entity Recognition (NER): NER ləɽəka ləɽəka’ ləɽəka,’ ləɽəka” plays an important role in identifying the ləɽəka. ləɽəka). token word which represents one identity ladakaa ladakaa’ ladakaa,’ ladakaa” i.e. NER and it behaves like a Proper ladakaa. ladakaa). Noun. Thus NER identification process will affect overall performance of the system. For NER identification we discuss some rules which helps in identifying the Abbreviation Identification: In NER and combine group of token which is Tokenization process, we have to be very part of any NER then system can be treat carefull about using of dot (.) in defining as a one single unit or single token. sentence boundaries and it is also used in writting abbreviations. There is a need for Some Rules are as follows … writing the rules to identify abbreviations. Identification of Name of Person

We can predict some abbreviations such as In Hindi language, common title that is the abbreviations are written with dot (.) used in writing the names of people such and space such as अ. ग. प. is used as as abbreviations for, असम गण परिषद (əsəm ɡəɳ pərɪʂəd̪ asama gan parishad). In this, a श्री श्रीमान श्रीमति,डॉ. इंति. मोहििमा, मैडम, sequence of letterIJSER-dot-letter-dot pattern is मैम, etc. appearing. Abbreviations are written with ʃri ʃriman ʃrimətɪ,ɖɒ. ɪnʤɪ. moɦətərəma, dot (.) and without space between ̪ ̪ mɛɖəm, mɛm alphabets such as, अ.ग.प. is used as abbreviations for असम गण परिषद (əsəm When the system encounters such title ɡəɳ pərɪʂəd̪ asama gan parishad) word it checks the next word which may be the part of name of a Person written in abbreviations are written without space fashion and without dot (.) such as अगप Is used as abbreviations for असम गण परिषद (əsəm [Title ] [First Name] [Middle Name] [Sur Name] [Honorific Marker word] ɡəɳ pərɪʂəd̪ asama gan parishad). Identification of Date String Some abbreviation are written as a group of characters without space and without Date string is being identified by using dot (.) and such word also belongs to regular expression and is consider the best lexicon of Hindi language such as आप i.e. way to do it. There are several patterns for used as abbreviations for आम आदमी पार्टी writing the date string such as (aama aadamee paartee) but as we know that आप is a very prompt Hindi pronoun as

IJSER © 2020 http://www.ijser.org International Journal of Scientific & Engineering Research Volume 11, Issue 6, June-2020 1786 ISSN 2229-5518

01/02/2017, 1/2/2017, 01/02/17, 1//2/17, system Morphological Analysis is being 01-Feb-2017, Feb 01, 2017 etc. used for analyzing the word and its lexical form. Morphological Analysis of word During the identification process of data will give some information about its and time string, we have to write several Gender and Number and in case the word regular expressions to identify all types of is a verb it will give information about date and time string. Tense, Aspect and Modality (TAM).

1. Identification of Numeric Value / There are various approaches used in Currency designing the Morphological Analysis i.e. There are several words in a document like Corpus Based Approach, 200 셁, 셁 200, 셁 200/- these word will Approach, Paradigm Approach and Finite State Transducer (FST) Based approaches. represent for some currency value. Its POS In the proposed System Paradigm category will be “NCD” i.e. numerical Approach for Morphological Analysis is cardinal. used, as Hindi is a highly inflectional On other hand, some words like 9 वा, 9 वे, language. Hence, the Paradigm based 9 वी. These words contain two tokens. The approach is most efficient approach for morph analyzer for Hindi and probably for first token is a numerical value and the other Indian languages since a root word second token which is attached to previous can generate various words by adding the token gives some additional information suffix, prefix, or circumfix. related to Gender, Number etc. such as when token “वी” is added to a numerical This type of Morphological Analysis is value the word become a feminine gender done in two Phases and likewise when token “वे” is added to IJSERNominal Word Morphology any numerical token then word formed by these tokens having depict the plural Verb Identification using Morphological Number. Analysis

Therefore when we analyze the tokens we POS Tagging : Part-of-speech tagging is a have to extract some information about process of assigning the part-of-speech or gender, Number and sometimes about the other lexical class marker to each word in Person. This information is store with the corpus. Tags for natural language are word. This information will help in much more ambiguous. Part of speech deducting the problem or error in the tagger plays an important role in the Phrase that contains these types of tokens. , natural language parsing and information retrieval[13]. In Morphological Analysis : A tagging algorithm, string words are taken Morphological Analyzer is a tool which as input and algorithm select appropriate takes a word as input and produces tag from tagset for the word and assign it linguistic information (lexical form) of the to the word. The Part of Speech can word, such as its class, tense, etc. which is broadly be divided in two super categories required by all Natural Language i.e. closed class and open class. Members Processing Systems. In the proposed

IJSER © 2020 http://www.ijser.org International Journal of Scientific & Engineering Research Volume 11, Issue 6, June-2020 1787 ISSN 2229-5518

of Closed class type of any language may to following suffix table. Word is broken be fixed. Close classes of language may into root word and suffix. The system will differ from language to language. Some of check whether the root word belongs to close classes in Hindi are as follows: diction of root and suffix. If it matches to the suffix table then we can say that the

given word is a verb and system will Postp को,ने,के ,का,की,द्वािा,से,मᴂ,पि,अनु셂प,अनुसाि, ositio तिलाफ,िरिए,ििफ,िहि,िौि,दौिान,दौलि,ब accumulate its function along with the n s तनबि,बाद word. These information will help to check ko,ne,ke,ka,ki,d̪ wara,se,men,pər,ənʊr grammar associated with noun phrase up,ənʊsar,kʰɪlapʰ,ʤərɪe,t̪ərəpʰ,t̪əɦət̪,t̪ which behaves like a subject in a sentence. ɔr,d̪ ɔran, bəd̪ ɔlət̪,bənɪsbət̪,bad̪ ko,ne,ke,kaa,kee,dvaaraa,se,men,para Table Verb Suffix ,anuroopa,anusaara,khilaapha,jarie,ta rapha,tahata,taura,dauraan,badaulata, Suffix Function banisbata,baad Deter 煍यादा,ितनक,िमाम,थोड़ा, ंंंुगी/ंंंूगी/ंुंंगी Tense – Future mine अति,अथाह,अधूिी,अतधक,अतधकातधक /ंूंंगी/ंेंंगी/ंूं Number – Plural r s/ अतधकांश,अपयााप्त,अपाि,अिसे,अिसा, अ핍वल Gender – ʤjad̪ a,t̪ənɪk,t̪əmam,t̪ʰoɽa, गी/ंुं गी/यीं/ए गी/ Quan ət̪ɪ,ət̪ʰaɦ,əd̪ ʱuri,əd̪ ʱɪk,əd̪ ʱɪkad̪ ʱɪk, Feminine tifier əd̪ ʱɪkanʃ, əpərjapt̪, əpar, ərəse,ərəsa, गी/एगी/गीं/ℂ/एगीं/ s əwwəl jyaadaa,tanika,tamaama,thodaa, यᴂगी/ंंगीं/ं गीं/ऊ ati,athaaha,adhree,adhika,adhikaadhi गी ka jin/ẽɡi/ɡi/eɡi/ɡin/in/eɡi adhikaansha,aparyaapta,apaara,arase, n/jenɡi/-----/̃----/ũɡi arasaa, avvala Conj अथवा,एवं,औि,िथा,या,व िीं/ंींं Tense – Past uncti ət̪ʰəwa,enw,ɔr,t̪ət̪ʰa,ja,w t̪iň/in Number – Plural ons athava, evm,aur,tIJSERatha,yaa,vaa Gender – Auxi िहा/िहे है/हℂ िही है/हℂ िहा था िही थी/थीं िहे Feminine liary थे/थᴂ Verb rəɦa/rəɦe ɦɛ/ɦɛn rəɦi ɦɛ/ɦɛn rəɦa वािीं/वाℂ/वायीं/वा Double Causative s t̪ʰa rəɦi t̪ʰi/t̪ʰin rəɦe t̪ʰe Verb, Number – नी/ंािीं/ंाइ/ंाई raha/rahe hei/hein rahi hai/hein raha Plural tha rahi thi/thi rahe the/then wat̪in/wain/wajin/wani /aatin/aai/aaie Gender – Parti अलावा,केवल,िक,बिाय,भि,भी,ही Feminine cles əlawa,kewəl,t̪ək,bəʤaj,bʱər,bʱi,ɦi alava, kewal, tak, bajaay, bhar, bhi,hi Num एक, दो, हिाि, लाि, किोड़ इए/इएगा/तंएगा/ Number – Plural erals ek, d̪ o, ɦəʤar, lakʰ, kəroɖ Honorific Marker ek, do, hjaar, laakh, karor ंं/एं/यᴂ/ंेंं/ए/ तिए ɪe/ɪeɡa/---/---/ en/jeň/-- : Yes / ẽ/ʤɪe Automatically assigning tag to each word is not trivial.

Verb Identification: To identify whether the given word is verb or not is according

IJSER © 2020 http://www.ijser.org International Journal of Scientific & Engineering Research Volume 11, Issue 6, June-2020 1788 ISSN 2229-5518

Table Verb Suffix Table Verb Suffix Suffix Function Suffix Function ंंंुगा/ंंंूगा/ंुंंगा Tense – Future Gender – /ंूंंगा/ंेंंगा/ंूं Number – Plural Feminine Gender – गा/ंुं गा/ऊ गा ंेगा/एगा/येगा/गा Tense – Future Mescaline Number – /ंोगे/ए गे/एगा/ंेंं ega/jaga/ga Singular गे/ं गे/एंगᴂ/एंगे/ऊ /यᴂ Gender – गा/उंगा/ ंूं/ ंंगे/गे Mescaline ũɡa/enɡen/enɡe/ िा Tense – Present ũ/jenɡa/ʊnɡa/ɡe t̪a Number – िे/ंा/या/ंे/या/ए/ Number – Singular ना/ला/ंो/िना/यो/ Singular Gender – य//ंंंू/ंूंं Mescaline Tense – Past ंेगी/ओगी/एगी//ये Tense – Future ओ/ए Number – गी/गा/ये/तंए/ओगी Number – /ंाए//ंाये/आ Singular o/e/aye/aye/aa Singular /एंगी/ए गी Gender – Gender – oɡi/enɡi/ẽɡi Feminine Mescaline वाना/वािा/वाया/ Double Causative Verb िी Tense – Present Number – /ंािी/यी//ली/ंीIJSER/ Number – वािे/ंाया/ंािे/ये/ Singular Singular या ंाना वाना वा इ/ई/यी / / / Gender – Gender – t̪i/aati/yi/li/i/i/ee/ या/वाने/वा/लवाया/ Mescaline Feminine yi वाए/वाए/ वाएं/वायᴂ नी/ंानी/वानी Tense – VCT wana/wat̪a/waja/ wate/wana/waja/ /ंायी Number – ̪ Singular wane/wa/ləwaja/ वायी/वाई/वािी/यी Gender – wae/waẽ/waen/w /लायी/लाई Feminine ajeň ni/ani/wani/ayi/w कि/के Non Finite Verb ayi/waii/wati/yii/ /ंाकि/ने/ंाने laayi/laae kar/ke/aakar/ne/a लवाई/लवायी Tense – VDC ane ləwai/ləwaji Number – lavaai/lavaayi Singular

IJSER © 2020 http://www.ijser.org International Journal of Scientific & Engineering Research Volume 11, Issue 6, June-2020 1789 ISSN 2229-5518

Phrasing or : Shallow parsing (also chunking, "light parsing") is After POS Tagging an analysis of a sentence which identifies सिोि[N] घि[N] िा[VMAIN] िहा [VINF] है [VAUX] the constituents (noun groups, verbs, verb groups, etc.), but does not specify their Shallow Parsing internal structure, nor their role in the main [सिोि N]_NP [घि N]_NP { [िा VMAIN] sentence. There are various approaches [िहा VINF] [है VAUX]}_VP which are used in constructing the Shallow Morphological Analyzer for each Parser or Creating Phrases is parsing by elementary packet chunks, Machine Learning approach, Memory Based Shallow Parsing, HMM Elementary Packet 1 Based Shallow Parsing, etc.

These approaches have been used extensively in construction of shallow NP – सिोि parsers for English and other Latin Gender : Fem/Mes languages. The approach that is chosen and used for this system is Context Free Number : Singular Grammar based. This approach uses Person : Third bottom up and left to right parsing for construction of Phrases in a given sentence according to the rules that is described Elementary Packet 2 using the CFG Grammar formalism. The approach fails when there is ambiguity in the sentence. Removing ambiguities is not NP – घि within the currentIJSER scope of the system. After Shallow Parsing, we can obtain Gender : Mescaline Adjective Phrase, Noun Phrase etc. Number : Singular

Functioning of Grammar Checker Person : Third System: The user can input a sentence or a group of sentences for checking grammar related errors. System converts it into Elementary Packet 3 tokens of the word with the information about its position in paragraph, sentence and gender, number, person, suffix etc. VP िा िहा है After that POS Tagging has been done. In this phase morphological clue gives Gender : Mescaline information about POS Category. For Number : Singular example:

Person : Third सिोि घि िा िहा है।

səroj ɡʱər ʤa rəɦa ɦɛ। saroj ghara jaa rahaa hai।

IJSER © 2020 http://www.ijser.org International Journal of Scientific & Engineering Research Volume 11, Issue 6, June-2020 1790 ISSN 2229-5518

There are two methods for finding possible Sentence Level: Error finding method errors in sentence on following level which we have adopted takes the whole sentence as a unit is called Sentence Level  Phrase level Error Checking Method. The method is  Sentence Level required some agreement checking mechanism to find agreements between the Phrase level: Error finding method in Elementary Packets. Therefore we get phrases of sentence is by comparing connected phrases of sentences. This will elementary packets i.e. Phrases of help in comparing Phrase on the basic of sentences. All associated Elementary grammatical features and system will Packets have one to one relation in its ensure whether these features are identical. grammatical feature like Gender, Number If the features are identical in both the and Person etc. elementary packets sentences is correct otherwise it is an incorrect sentence. In the above examples, we can see the differences in Elementary Packet 1 and Conclusion: Design and development of Elementary Packet 3 on gender feature. Automatic Grammar Checking facility for Now program will give two possibilities of Hindi and other Indian languages is the above sentences. possible. The rule based approaches is

appropriate for it, because the system will extract information related to grammar at Possibility I each stage of the system. This will help in Suppose that In Elementary Packet 1 is deciding the correctness of a sentence. correct the program will change the Proposed system has capability for Gender Number Person (GNP) of checking in the error in sentence on the Elementary PacketIJSER 3 according to GNP basis of Phrases and it associations. But it features of Elementary Packet 1 thus the does not handle the anaphoric relation output sentence will be: between the sentences as yet. The system is also capable to find error in various type सीिा घि िा िही है। sentences but it fails to identify the error in sit̪a ɡʱər ʤa rəɦi ɦɛ। some cases. There is every possibility of improving the system so that it covers all seetaa ghara jaa rahee hai। types of the sentences.

Possibility II References: Suppose that Elementary Packet 3 is 1. ambati, B. R. (2010). On the role of correct the program will change the morphosyntactic features in Hindi Gender Number Person (GNP) of dependency parsing. Proceedings Elementary Packet 1 according to GNP of the NAACL HLT 2010 First features of Elementary Packet 3 thus the Workshop on Statistical Parsing of output sentence will be: Morphologically-Rich Languages. सिोि घि िा िहा है। Association for Computational Linguistics.

IJSER © 2020 http://www.ijser.org International Journal of Scientific & Engineering Research Volume 11, Issue 6, June-2020 1791 ISSN 2229-5518

2. Ambati, B. R., Gadde, P., & Jindal, 7. bharti, A., Chaitanya, V., & K. (2009). Experiments in Indian Sangal, R. (2000). Natural Language Dependency Parsing. Language Processing : A Paninian Proceedings of ICON09 NLP Tools Perspective. New Delhi: Prentice Contest: Indian Language Hall of India Private Limited. Dependency Parsing. Hyderabad. 8. Bhattacharya, P. (2012). Natural 3. Ambati, B. R., Husain, S., Jain, S., language Processing: A Perspective Sharma, D. M., & Sangal, R. (n.d.). from Computation in Presence of Two methods to incorporate local Ambiguity. CSI journal of morphosyntactic features in Hindi , 1. dependency parsing,. Proceedings

of the NAACL HLT 2010 First 9. Elaine Rich, K. K. (2010). Workshop on Statistical Parsing of Artificial Intelligence . Nodia, MorphologicallyRich Languages. India: Tata McGraw-Hill. Association for Computational 10. Gadde, P. (2010). Improving data Linguistics. driven dependency parsing using 4. Bharati, A. (2009). Simple parser clausal information, Human for Indian languages in a Language Technologies . The 2010 dependency framework. Annual Conference of the North Proceedings of the Third Linguistic American Chapter of the Annotation Workshop. Association Association for Computational for Computational Linguistics. Linguistics. Association for Computational Linguistics. 5. Bharati, A., & Sangal, R. (2009). Parsing IJSER Free Word Order 11. Hopcroft, J. E., & Ullman, J. D. Languages in the Paninian (1996). Introduction to Automata Framework. proceeding ACL ’93 Theory, Languages and Proceeding of the 31st annual Computation. New Delhi: Narosa meeting on Association for Publishing House. Computational Linguistic. 12. Jain, S. (2013). Exploring Semantic Association for Computational Information in Hindi WordNet for Linguistic. Hindi Dependency Parsing. The 6. Bharati, A., Husain, S., Ambati, B. sixth international joint conference R., Jain, S., Sharma, D. M., & on natural language processing. Sangal, R. (2008). Two semantic IJCNLP2013. features make all the difference in 13. Jurafsky, D., & Martin, J. H. parsing accuracy. In Proceedings (2000). Speech and Language of the 6th International Conference Processing : Introduction to on Natural Language Processing. Natural Language Processing, Pune: ICON-08. Computational Lingustics, Speech Recognition. Pearson.

IJSER © 2020 http://www.ijser.org International Journal of Scientific & Engineering Research Volume 11, Issue 6, June-2020 1792 ISSN 2229-5518

14. Prashanath, M. (2009). 24. तविय कुमाि म쥍होत्रा. (2002 ). कंप्युर्टि के Bidirectional Dependency Parser भातषक अनुप्रयोग . नई दद쥍ली : वाणी प्रकाशन. for Hindi, Telugu and Bangla. 25. ससंह, ड. स. (2009). सहंदी का वाकया配मक Proceedings of ICON09 NLP Tools व्याकिण. नई दद쥍ली, भािि : प्रभाि प्रकाशन. Contest: Indian Language Dependency Parsing,Hyderabad, 26. ससंह, स. (2000). अंग्रेिी-सहंदी अनुवाद India,. व्याकिण . नई दद쥍ली : प्रभाि प्रकाशन.

15. Rahman, M., Das, S., & Sharma, U. (2009). Parsing of part-of- speech tagged Assamese Texts. IJCSI International Journal of Computer Science Issues, 6(1).

16. Rich, E., & Knight, K. (1991). Artificial Intelligence. McGraw- Hill.

17. Sells, P. (1985). Lectures on Contemporary Syntactic Theories: An Introduction to Government- binding Theory, Generalized Phrase Structure Grammar, and Lexical-functional Grammar. Center for the Study of Language and Information, Standford University.IJSER

18. van Halteren, H. (1999). Syntactic Wordclass Tagging. Springer.

19. कन्हैया ससंह. (2013 ). सहंदी भाषा सातह配य औि नागिी तलतप . वािाणसी: तवश्वतवद्यालय प्रकाशन.

20. कामिाप्रसाद गु셁. (2010). संतिप्त तहन्दी व्याकिण. नई दद쥍ली: भाििीय ज्ञानपीठ.

21. तिवािी, ड. भ. (1999 ). तहन्दी भाषा दक श녍द संिचना . नई दद쥍ली : सातह配य सहकाि.

22. पं. कामिाप्रसाद गु셁. (2013). सहंदी व्याकिण. नई दद쥍ली: िितशला प्रकाशन.

23. पांडेय, म. क. (2008, िनविी-माचा). अनुक्रतमक व्याकिण : एक संगणकीय पि. (प. ि. गोतपनाथन, Ed.) बबचन(18), 112-123.

IJSER © 2020 http://www.ijser.org