Language Modelling with NMT Query Translation for Amharic-Arabic
Total Page:16
File Type:pdf, Size:1020Kb
Language Modeling with NMT Query Translation for Amharic-Arabic Cross-Language Information Retrieval Ibrahim Gashaw H L Shashirekha Mangalore University Mangalore University Mangalagangotri, Mangalore-574199 Mangalagangotri, Mangalore-574199 [email protected] [email protected] Abstract information in different languages as per their information need (Sourabh, 2013). This paper describes our first experiment on Neural Machine Translation (NMT) The Amharic language is the official lan- based query translation for Amharic- guage of Ethiopia spoken by 26.9% of Arabic Cross-Language Information Re- Ethiopia’s population as mother tongue and trieval (CLIR) task to retrieve relevant spoken by many people in Israel, Egypt, and documents from Amharic and Arabic text Sweden. Arabic is a natural language spoken collections in response to a query expressed by 250 million people in 21 countries as the in the Amharic language. We used a pre- first language and serving as a second lan- trained NMT model to map a query in the source language into an equivalent query guage in some Islamic countries. Ethiopia in the target language. The relevant docu- is one of the nations, which have more than ments are then retrieved using a Language 33.3% of the population who follow Islam, and Modeling (LM) based retrieval algorithm. they use the Arabic language to teach religion Experiments are conducted on four con- and for communication purposes. Arabic and ventional IR models, namely Uni-gram and Amharic languages belong to the Semitic fam- Bi-gram LM, Probabilistic model, and Vec- ily of languages, where the words in such lan- tor Space Model (VSM). The results ob- tained illustrate that the proposed Uni- guages are formed by modifying the root itself gram LM outperforms all other models for internally and not simply by the concatena- both Amharic and Arabic language docu- tion of affixes to word roots (Shashirekha and ment collections. Gashaw, 2016). Nowadays, it is widely used to solve CLIR 1 Introduction problems for many language pairs. However, Information Retrieval (IR) is the activity much of the research on this area has fo- of retrieving relevant documents to informa- cused on European languages despite these tion seekers from a collection of informa- languages being very rich in resources. So tion resources such as text, images, videos, this study is aimed to develop the NMT query scanned documents, audio, and music as well. translation based Amharic-Arabic CLIR sys- These resources can be structured, indexed, tem. and navigated through Language Technology An essential part of CLIR is mapping be- (LT), which includes computational methods tween query and document collections by that are specialized for analyzing, producing, translating queries to the target document lan- modifying, and translating text and speech guage or the source document to the target (Madankar et al., 2016) . The increasing ne- document language. We follow the first ap- cessity for retrieval of multilingual documents proach to translate the query words by us- in response to a query in any language opens ing a pre-trained NMT model. For the pur- up a new branch of IR called Cross-Language pose of this translation, we have constructed Information Retrieval (CLIR). Its goal is to a small parallel text corpus by modifying the accept the query in one language, transform existing monolingual Arabic and its equiva- it into a searchable format and provide an in- lent translation of Amharic language text cor- terface to allow a user to search and retrieve pora available on Tanzile (Tiedemann, 2012), 56 D M Sharma, P Bhattacharyya and R Sangal. Proc. of the 16th Intl. Conference on Natural Language Processing, pages 56–64 Hyderabad, India, December 2019. ©2019 NLP Association of India (NLPAI) as Amharic-Arabic parallel text corpora are difference between different meanings of am- not available for MT task. biguous terms according to their contexts of The rest of the paper is organized as fol- utilization (Nie, 2010). lows. CLIR approaches are discussed in sec- tion 2. Related works are reviewed in Section 2.3 Machine Translation approach 3. The proposed CLIR approach based on LM MT is a process of obtaining a target language is described in Section 4. Resources and con- text for a given source language text by us- figurations of experiments for evaluating the ing automatic techniques. MT can be used system and the results are detailed in Section to translate the query, the document, or both 5, followed by a conclusion in section 6. into the same language, and the retrieval pro- cess could then be treated similar to a con- 2 CLIR Approaches ventional IR system. However, MT systems require time and resources to develop and are In CLIR, the query and the document col- still not widely or readily available for many lection needs to be mapped into a common language pairs (Madankar et al., 2016) . representation to enable users to search and retrieve relevant documents across the lan- 2.4 Probabilistic-based approaches guage boundaries (Tune, 2015). Based on Probabilistic-based approaches include the resources used to map the query and the corpus-based methods which translate queries documents in different languages, CLIR ap- and language modeling which avoid transla- proaches can be categorized as; Dictionary- tion of queries. based approach, Latent Semantic Indexing (LSI), Machine Translation (MT) approach, 2.4.1 Corpus-based methods and Probabilistic-based approach (Raju et al., Corpus-Based approaches use multilingual 2014). corpora which can be parallel corpora or com- 2.1 Dictionary-based approaches parable corpora. In this approach, queries are translated on the basis of multilingual terms Dictionary-based approaches use either an extracted from parallel or comparable docu- automatically constructed bilingual Machine ment collections. While parallel corpora con- Readable Dictionaries (MRD), bilingual word tain translation-equivalent texts which contain lists, or other lexicon resources to translate the direct translations of the same documents in query terms to their target language equiva- different languages, comparable corpora con- lents. This approach offers a relatively cheap tain texts of the same subject which are nei- and easily applicable solution for large-scale ther aligned nor direct translations of each document collection. Due to Out of Vocab- other but composed in their respective lan- ulary (OOV), some words in a query may guages independently (Tesfaye, 2010). It is not be translated. Further, linguistic con- available only in a few languages and more ex- cepts such as polysemy and homonymy may pensive to construct. introduce ambiguity in translation of words (Shashirekha and Gashaw, 2016) 2.4.2 Language modeling approaches A language model is a probability distribution 2.2 LSI approach over all possible sentences or other linguistic In the LSI approach, the documents of units in a language. While the classification of the source language are represented in the LM is not exhaustive, and a specific language language-independent LSI space. Similarly, model may belong to several types, LM can be a user query can be treated as a pseudo- categorized as uniform, finite state, grammar- document and represented as a vector in the based, n-gram, and Neural Language Model same LSI space. Even though the performance (NLM) (or continuous space LM) that might of the LSI model is on par with the tradi- be feed-forward or recurrent (SWLG, 1997) . tional vector space model, the cost of comput- Uniform LM uses the same probability for all ing Singular Value Decomposition (SVD) of words of the vocabulary of the sentences if very large collections is high, and it makes a the number of sentences is limited. In finite- 57 state LM, the set of legal word sequences is in the dictionary. The lack of electronic re- represented as a finite state network (or regu- sources such as morphological analyzers and lar grammar) whose edges stand for the words large MRD have forced A. Argaw (2005) to that are assigned probabilities. Grammar- spend considerable time to develop those re- based LM is based on variants of stochastic sources themselves. context-free grammars or other phrase struc- Solving the problem of word sense dis- ture grammars. ambiguation will enhance the effectiveness Data scarcity is a significant problem in of CLIR systems. Andres Duque et al. building language models, as most possible (2015), studied to choose the best dictionary word sequences will not be observed in train- for Cross-Lingual Word Sense Disambiguation ing. One solution to this problem is contin- (CLWSD), which is focused only on English- uous representations, or embedding of words Spanish cross-lingual disambiguation and the to make their predictions that help to alle- disambiguation task is dependent on the cov- viate the curse of dimensionality in LM. The erage of dictionary and corpus size. Query main advantage of LM is to estimate the dis- suggestion that exploits query logs and doc- tribution of various natural language phenom- ument collections by mapping the input query ena for language technologies such as speech, of French language to queries of English lan- machine translation, document classification guage in the query log of a search engine by and routing, optical character recognition, in- W. Gao et al. (2007) showed the strong cor- formation retrieval, handwriting recognition, respondence between the French input queries spelling correction, etc. (Kim et al., 2016) . and English queries in the log, but languages Over-fitting (random error or noise instead of may be more loosely correlated. For exam- the underlying relationship when its test error ple, English and Amharic. M.Al-shuaili and is larger than its training error) is the main M.Garvalho (2016), proposed a technique to limitation in current LM for small size datasets map characters automatically from different (Jozefowicz et al., 2016) . languages into English, without human inter- ference and prior knowledge of the language.