Proceedings of the 4th Conference of Mathematical Society of Moldova CMSM4’2017, June 25-July 2, 2017, Chisinau, Republic of Moldova

Romanian Spelling and Checking Systems

Veronica Iamandi

Abstract This provides description about spelling and grammar checking system developed for detecting various grammatical errors in Romanian sentences. This system utilizes a full form lexicon for morphological analysis, and applies rule-based approaches for part-of-speech tagging and chunking. The system can detect and suggest rectifications for a number of grammatical errors, resulting from the lack of agreement, order of words in various . Keywords: JSON data, JavaScript, POS, spelling system, grammar checking system.

1 Introduction Romanian spelling and grammar checking system is a program that corrects a sentence at the word and syntactic level. The system rearranges necessary data depending on the sentence structure through semi- structured data. The system cannot correct 100% of what the user aims. The program is divided into two systems: Correct Spelling System and Correct Grammar System. The Correct Spelling System checks each word from the sentence if the word exists in dictionary. If it does not exist in dictionary, the system finds similar words from the dictionary and offers them to the user. The Correct Grammar System checks the form of the components of the sentence from the basis form of the sentence, and if it does not suite for processing, the system corrects the form depending on the of the sentence.

© 2017 by Veronica Iamandi Veronica Iamandi

2 Check and correct spelling system Check and correct spelling system uses JavaScript program language and JSON data. A user inputs a sentence and checks each word from Romanian dictionary. In addition, if the words do not exist in JSON data, the processing engine finds similar words and recommends the words to the user.

Figure 1. Architecture of correct spelling system. At the first step it is necessary to make a Romanian dictionary in JSON data. Here is an example JSON data. It is made into an array: {"words":["pisică","câine","tigru","păsări","pește","crocodil""]}. The dictionary is formatted in JSON. Here is the source [1], which contains more than 100,000 Romanian words.

Figure 2. The dictionary formatted in JSON. Romanian Spelling and Grammar Checking System

At the second step the system verifies if the word exists or not. If the word exists in JSON, the message "cuvîntul acesta este în dicționar" appears, else if the word doesn't exist in JSON, there appears the message "cuvîntul acesta nu este în dicționar". At the third step, if the word doesn't exist, the system finds the corrected word with the help of processing engine. Processing engine is used by Edit Distance [2]. Edit Distance is a way of quantifying how two dissimilar strings (e.g., words) are sticked together by counting the minimum number of operations required to transform one string into the other. Edit Distance finds applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. Processing engine should work like the following examples. - calculatoi → calculator. The symbol "i" in the word changes to "r". - calclator → calculator. The symbol "u" is added anto the word between the symbols "c" and "u". - ccalculator → calculator. The symbol "c" is deleted from the word. Here is the result calculated by Edit Distance.

Figure 3. Results calculated by Edit Distance for word ”calculator”. Approximately 600 new words were generated by Edit Distance function.

3 Check and correct grammar system Check and correct grammar system uses JavaScript program language and JSON data. A user inputs a sentence, the system checks each word of the Iamandi Veronica sentence and associates it with the respective (POS) using the dictionary of Romanian POS. If the sentence is not correct from the Romanian grammar point of view, the program rearranges the words at syntactic level.

Figure 4. Architecture of correct grammar system. The first step divides the input sentence word by word, and puts them into an array. At the next step, it is necessary to make a POS of Romanian dictionary in JSON data. The MULTEXT-East [3] resources are a multilingual dataset for language engineering research and development. This dataset contains Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Lithuanian, Macedonian, Persian, Polish, Russian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian languages. The MULTEXT-East project adapted existing tools and standards to those languages. The database with POS code type can be obtained from the site nlptools.info.uaic.ro [4]. This database has approximately 1.1 million words with POS code type included. Example word "Calculator", code is "Ncmsrn":

"N"- , "c" - Common, "m" - Gender Masculine, "s" - Number Singular, "r" - Case Direct, "n" - Not Defined.

Therefore, JSON form will be like the following: Romanian Spelling and Grammar Checking System

{ "WORD" :[ "POS Code" , "OriginalWord" , "InfiniteWord" ] }

Here is an example of JSON data with POS code type.

{"allword":[ {"calculator":["Ncmsrn","calculator"]}, {"merge":["Vmip3s","merge"]}, {"repede":["Afpfsrn","repede"]}, {"studiat":["Afpmson","studiat","studia"]}, ]}

Figure 5. POS of word in JSON arrSen = ["calculator", "merge", "repede"]; posSen = ["Ncmsrn", "Vmip3s", "Afpfsrn"];

When POS code types were associated to each word, the sentence is corrected by the rules of Romanian grammar. For example, in the textbox the following sentence is input: "Calculatorul merg repede". The sentence is incorrect because the word "merg" is a Form of the First Person type. Therefore, the system must find in array the word, which has the verb form of the Third Person type.

{"merg":["Vmsp1s","merge"]} "merg" is parent key and "Vmsp1s" and "merge" are childern keys. {"merge":["Vmip3s","merge"]} "merge" is parent key and "Vmip3s" and "merge" are childern keys. Iamandi Veronica

So the system can find correct verb by children keys, the original word of "merg" namely "merge" and the correct verb form of Third Person type "Vmip3s".

4 Conclusion Romanian Spelling System and Grammar Checking System can check if input words exist in dictionary or no. In addition, they can transform incorrect forms of words to the correct ones. These systems find where incorrect , verbal , , articles, nouns, pronouns and some particles are. However, there are also things that this program cannot do. Another problem is that computer cannot understand the meaning of some words. References [1] http://corpora.heliohost.org/statistics.html#RomanianCorpus [2] https://en.wikipedia.org/wiki/Edit_distance [3] http://nl.ijs.si/ME/ [4] http://nlptools.info.uaic.ro/

Iamandi Veronica Institute of Mathematics and Computer Science Email: [email protected]