On the Optical Character Recognition and Machine Translation Technology in Arabic: Problems and Solutions

On the Optical Character Recognition and Machine Translation Technology in Arabic: Problems and Solutions Prof. Oleg Redkin, Dr. Olga Bernikova Faculty of Asian and African Studies, St. Petersburg State University, St. Petersburg, Russia Abstract - The report addresses the basic problems of the linguistic material dates back to early 60’s 1. However just in Arabic language formalization based on analysis of linguistic 1981 John McCarthy generalized the theory to deal with the errors in software products. Reviewing the principles of conjugational system of Arabic, on the basis of an modern information systems operation the authors come to autosegmental account of vowel and consonant slots on a the conclusion that the existing methods of the Arabic central timing tier (A CV skeleton analysis of Arabic root- formalization allow to note a shift towards the technological and-pattern morphology) 2. aspects of the linguistic processing of facts, however, the The first two-level morphological analysis was carried quality of applied linguistic components still remains poor. out in 1983 by K. Koskenniemi. At that time he presented a Possibilities for the application of traditional recognition new linguistic, computationally implemented model for algorithms for Arabic are still uncertain in spite of a morphological analysis and synthesis. It was general in the significant number of theoretical and practical results in the sense that the same language independent algorithm and the field of computational linguistics . There are several problems same computer program can operate on a wide range of which are due to be solved in relation to the processing of the languages, including highly inflected ones such as Finnish, Arabic text. These issues may be divided into those related to Russian or Sanskrit. The model was based on a lexicon that Optical Recognition of the Written Text (OCR), Word defines the word roots, inflectional morphemes and certain Processing (WP) and building of the content of the nonphonological alternation patterns, and on a set of parallel dictionaries, Machine Translation (MT). rules that define phonologically oriented phenomena. The rules are implemented as parallel finite state automata, and the Keywords: NLP, Arabic, OCR, machine translation. same description can be run both in the producing and in the 1 Introduction analyzing direction 3. His system was based on the prefix- suffix method of word formation and a simple sequence of Modern tendencies of developing in the information- morphemes. The changes of the internal structure of the root, oriented society can be defined by the deep interaction of the as well as the conversion of weak roots were not considered at international industrial, technical, economic, sociological, that time. scientific and informational aspects. Globalization in every In 1996, Xerox Research Center has implemented a aspect of everyday life is primarily performed through system using a combination of automated algorithms for root language, which is the main tool of communication. structures and derivational patterns. This technology is proved Expansion of cross-cultural interaction on different levels to be very effective for the linguistic data processing, and now calls for the necessity to eliminate the so-called “language it is used in NooJ - a linguistic environment, processing the barrier”; this can be achieved through the application of the texts of a few million units in real time. It includes tools for machine translation systems and improvement of the compiling, testing and supports the bulk of lexical resources, multilingual search engine software. as well as morphological and syntactic grammars 4. However, Recently, the progress in development of quantity and our analysis of the NooJ operation shows that the proposed quality of the software applications for processing of description of the principles of the Arabic language linguistic material in Arabic became evident. However the formalization is not comprehensive and, as a result, does not available linguistic software products still abound in errors provide the sufficient accuracy of data processing. In and obvious mistakes. This largely happens due to the intention to shift the focus towards the creation of the 1Bar-Hillel, Y., “The Present Status of Automatic Translations technological solutions for the universal processing of of Languages”; Advances in Computers, Vol.1, 91-163, 2006. linguistic material, while the quality of the linguistic 2McCarthy John J. “A prosodic theory of nonconcatenative background tends to be left in a poor condition. morphology”; Linguistic Inquiry, Vol. 12, 373-418, 1981. 3Koskenniemi, Kimmo, “Two-level Morphology: A General 2 Historical background Computational Model for Word-Form Recognition and Production, Publications”, University of Helsinki, Department The commencement of the language formalization of General Linguistics, 1983. P.160. process, aimed at establishing of the automatic parsing of 4 http://www.nooj4nlp.net/pages/nooj.html. particular, insufficient quantity of verbal derivational Besides that, Slim Mesfar notes that “by inflectional algorithms was noted along with the lack of the complex description it was composed the set of all possible solutions for the root definition based on various patterns of transformations which allow us to obtain, from a lexical entry, broken plurals, distinctions in the final hamza writing when all inflected forms. These inflexional descriptions include the joining enclitic, etc. Thus, the paper “Standard Arabic mood (indicative, subjunctive, jussive or imperative), the Formalization and Linguistic Platform for its Analysis" by voice (active or passive), the gender (masculine or feminine), Slim Mesfar, describes NooJ as «system that uses finite state the number (singular, plural or dual) and the person (first, technology to parse vowelled texts, as well as partially and not second or third). On average, there are 122 inflected forms per vowelled ones. It is based on large-coverage morphological lexical entry” 6. According to our research there can be up to grammars covering all grammatical rules» 5. The author 400 forms and, taking into consideration the possibility of highlights that for Arabic, as a highly inflectional and enclitics joining for the transitive verbs, up to 780 models. derivational language, they had to define three new operators Our findings in this area are based on experience in the such as the <T> operator that checks if the last consonant development of the morphological analyzer which has the :T - Teh marbouta) to replace it with a following characteristics) ”ة “ within a noun is a .t –Teh maftouha) in some inflexional or derivational - More than 60000 rules for verbs) ”ت “ descriptions: - More than 7500 for names. - The rules do cover all kinds of roots. - Lexical content includes 20000 units. - All distinctions in hamza and tashdeed writing are taking into consideration. 3 Problems The recent developments in the field of the Information and Communication Technologies open new horizons and perspectives for scholars engaged in the field of linguistic analysis and linguistic data processing. It covers a wide range Figure 1. Special “operators” for Arabic in accordance of areas and allows to create new programs based on the with “Standard Arabic Formalization and Linguistic Platform Arabic script such as data bases, automatic translation and for its Analysis” by Slim Mesfar. automatic Arabic text recognition, search engines in the Internet, teaching tools, and, finally, Arabic-Arabic and Noting that the system contains all grammatical rules the Arabic-multilingual dictionaries. author misses a number of grammatical features of the Arabic Along with the advantages brought by the common language. For example, there are no “operators” for broken development in the field of Arabic software there is a number plurals as there is no distinction in types of plural patterns, of difficulties. One of the most important of them is the depending on the proper part of speech: proceeding of the Arabic text. In order to cope with these challenges we have to develop new principles and ways, which aim to solve the problem of the Arabic language text formalization. There are several problems which are due to be solved in relation to the processing of the Arabic text. These issues may be divided into those related to Optical Recognition of the Written Text (OCR), Word Processing (WP) and building of the content of the dictionaries, Machine Translation (MT). Unlike Latin or Cyrillic scripts, the Arabic one brings along many peculiarities, which cause many difficulties the researches confront with. Such difficulties are represented by a wide variety of Arabic fonts (the major are kuufii, deewaanii, ruq‘a, naskh, thulth, etc.) as well as by elements of individual letters, the Figure 2. Distinction in plural patterns, depending on various forms of writing of several letters depending on their the part of speech and the example of broken plural pattern . 5 Mesfar S. “Standard Arabic formalization and linguistic 6 Mesfar S. “Standard Arabic formalization and linguistic platform for its analysis”; Proceedings of the Conference The platform for its analysis”; Proceedings of the Conference The Challenge of Arabic for NLP/MT, London, England, 2006. Challenge of Arabic for NLP/MT, London, England, 2006. P.84. P. 86. position in the word, and ligatures, symbols for germination As for the paradigm of the verbal forms, in comparison and vowel signs (diacritics),

On the Optical Character Recognition and Machine Translation Technology in Arabic: Problems and Solutions

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support