Improving Wordnets for Under-Resourced Languages Using Machine Translation Bharathi Raja Chakravarthi, Mihael Arcan, John P

Total Page:16

File Type:pdf, Size:1020Kb

Improving Wordnets for Under-Resourced Languages Using Machine Translation Bharathi Raja Chakravarthi, Mihael Arcan, John P Improving Wordnets for Under-Resourced Languages Using Machine Translation Bharathi Raja Chakravarthi, Mihael Arcan, John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Galway, Ireland [email protected] , [email protected], [email protected] Abstract individual words. Wordnets can be constructed by either the merge or the expand approach (Vossen, Wordnets are extensively used in natural 1997). Princeton WordNet (Miller, 1995; Fell- language processing, but the current ap- baum, 2010) was manually created within Prince- proaches for manually building a word- ton University covering the vocabulary in En- net from scratch involves large research glish language only. Then, based on the Prince- groups for a long period of time, which are ton WordNet, wordnets for several languages were typically not available for under-resourced created. As an example, EuroWordNet (Vossen, languages. Even if wordnet-like resources 1997) is a multilingual lexical database for sev- are available for under-resourced lan- eral European languages, structured in the same guages, they are often not easily accessi- way as Princeton’s WordNet. The Multiword- ble, which can alter the results of applica- net (Pianta et al., 2002) is strictly aligned with tions using these resources. Our proposed Princeton WordNet and allows to access senses method presents an expand approach for in Italian, Spanish, Portuguese, Hebrew, Roma- improving and generating wordnets with nian and Latin language. Many others have fol- the help of machine translation. We ap- lowed for different languages. The IndoWordNet ply our methods to improve and extend (Bhattacharyya, 2010) was compiled for eighteen wordnets for the Dravidian languages, i.e., out of the twenty-two official languages of India Tamil, Telugu, Kannada, which are sev- and made available for public use. It is based on erly under-resourced languages. We report the expand approach like EuroWordNet, but from evaluation results of the generated word- the Hindi wordnet, which is then linked to En- net senses in term of precision for these glish. On the Global WordNet Association web- languages. In addition to that, we carried site,1 a comprehensive list of wordnets available out a manual evaluation of the translations for different languages can be found, including In- for the Tamil language, where we demon- doWordNet and EuroWordNet etc. strate that our approach can aid in improv- This paper describes the effort towards gen- ing wordnet resources for under-resourced erating and improving wordnets for the under- Dravidian languages. resourced Dravidian languages. Since studies (Federico et al., 2012; Laubli¨ et al., 2013; Green 1 Introduction et al., 2013) have shown significant productiv- As computational activities and the Internet cre- ity gains when human translators post-edit ma- ates a wider multilingual and global commu- chine translation output rather than translating text nity, under-resourced languages acquire political from scratch, we use the available parallel cor- 2 as well as economic interest to develop Natural pora from multiple sources, like OPUS, to cre- Language Processing (NLP) systems for these lan- ate a machine translation system to translate the guages. In general, creating NLP systems requires wordnet senses in the Princeton WordNet into an extensive amount of resources and manual ef- the mentioned under-resourced languages. Trans- 3 fort, however, under-resourced languages lack in lation tools such as Google Translate, or open both. source SMT systems such as Moses (Koehn et Wordnets are lexical resources, which provide 1http://globalwordnet.org/ a hierarchical structure based on synsets (a set of 2http://opus.lingfil.uu.se/ one or more synonyms) and semantic features of 3http://translate.google.com/ al., 2007) trained on generic data are the most In their further work (Rajendran et al., 2010), they common solutions, but they often result in unsat- emphasize the need for an independent wordnet isfactory translations of domain-specific expres- for the Dravidian languages, based on EuroWord- sions. Therefore, we follow the idea of Arcan et al. Net. This is due the observation that the mor- (2016b), where the authors automatically identify phology and lexical concepts of these languages relevant sentences in English containing the Word- are different compared to other Indian languages. Net senses and translate them within the context, The authors have combined the Tamil wordnet and which showed translation quality improvement of wordnets in other Dravidian languages to form the the targeted entries. The effectiveness of our ap- IndoWordNet. proach is evaluated by comparing the generated Mohanty et al. (2017) built SentiWordNet for translations with the IndoWordNet entries, auto- the Odia language, which is one of the official lan- matically and manually, respectively. This paper guages of India. Being an under-resourced lan- reports our first outcomes in improving wordnet guage, Odia lacks proper machine translation sys- for under-resourced Dravidian languages such as tem to translate the vocabulary of the available re- Tamil(ISO 639-2: tam), Telugu (ISO 639-2: tel) source from English into Odia. The authors have and Kannada (ISO 639-2: kan). created SentiWordNet for Odia using resources of other Indian languages and the IndoWordNet. Al- 2 Related work though the IndoWordNet structure does not map directly to the SentiWordNet, instead synsets are Scannell (2007) describes the start of the creation matched. The authors used these for translation of a resource for the Irish language using the Web from source lexicon to target lexicon. Aliabadi as a resource for NLP approaches. This work et al. (2014) have created a wordnet for the Kur- started by creating a resource for Irish language dish language, one of the under-resourced lan- using the Web as a resources for NLP. Since 2000, guages in western Iranian language family. They the author and his collaborators developed many have created Kurdish translation for the “core” resources like monolingual corpora, bilingual cor- wordnet synsets (Vossen, 1997), which is a set pora and parsers etc, for many under-resourced of 5,000 essential concepts. They used a dictio- languages, but they did not cover all languages in nary to translate its literals (words), adopted an the world. A six-level typology was proposed by indirect evaluation alternative in which they look Alegria et al. (2011) that separated languages into at the effectiveness of using KurdNet for rewrit- six levels. According to the authors, except for ing Information Retrieval queries. Similarly, the top ten languages in the world all the other lan- work by Horvath´ et al. (2016) focuses on the semi- guages are under-resourced languages. The third automatic construction of wordnet for the Mansi and fourth level languages are the languages which language, which is spoken by Mansi people in have some resource on the internet. These six level Russia, an endangered under-resourced languages typologies is a relative definition for the under- with a low number of native speakers. The au- resourced language, but still can be useful for our thors have used the Hungarian wordnet as a start- study of under-resourced languages. ing point. With the help of a Hungarian-Mansi dic- IndoWordNet covers official Indian languages, tionary, which was used to create possible transla- from the major three families: Indo-Aryan, Dra- tions between the languages, the Mansi wordnet vidian and Sino-Tibetan languages. In general, In- was continuously expanded. dian languages are rich in morphology and each Previous works did lots of manual effort to cre- of the three language families has different mor- ate wordnet-like resources, which was funded by phology structure. It was compiled for eighteen public research for a long period of time. How- out of the twenty-two official languages and made ever, IndoWordNet is not complete and biased to- publicly available.4 Similarly to EuroWordNet it wards Hindi, because the authors created a Hindi- is based on the expand approach, but the central Tamil bilingual dictionary, rather than a wordnet. language is Hindi, which is then linked to English. As explained in Rajendran et al. (2010), the mor- The IndoWordNet entries are updated frequently. phology and lexical concepts of Dravidian lan- For the Tamil language, Rajendran et al. (2002) guages are different from Hindi, which illustrates proposed a design template for the Tamil wordnet. that the IndoWordNet may not be the most suitable 4http://www.cfilt.iitb.ac.in/ resource to represent the wordnet for the targeted indowordnet/index.jsp Dravidian languages. To evaluate and improve the wordnets for the script is used to write other under-resourced lan- targeted Dravidian languages, we follow the ap- guages like Tulu, Konkani and Sankethi. In the proach of Arcan et al. (2016b), which uses the ex- Kannada language, the derivation of words is ei- isting translations of wordnets in other languages ther by combining two distinct words or by affixes. to identify contextual information for wordnet Different to Tamil, Kannada and Telugu inherits senses from a large set of generic parallel corpora. some of the affixes from Sanskrit. We use this contextual information to improve the translation quality of WordNet senses. We show 3.2 Machine Translation that our approach can help overcome drawbacks Statistical Machine Translation (SMT) sys- of simple translations of words without context. tems assume that we have a set of example translations(S(k), T (k)) for k = 1 : : : :n, where 3 Background S(k) is the kth source sentence, T (k) is the kth target sentence which is the translation of S(k) Our specific aim of this work is to generate and in the corpus. SMT systems try to maximize the improve wordnets for under-resourced languages. conditional probability p(tjs) of target sentence For our task we chose the expand approach and au- t given a source sentence s by maximizing tomatically translated the Princeton WordNet en- separately a language model p(t) and the inverse tries within a disambiguate context to obtain en- translation model p(sjt).
Recommended publications
  • Integrating Optical Character Recognition and Machine
    Integrating Optical Character Recognition and Machine Translation of Historical Documents Haithem Afli and Andy Way ADAPT Centre School of Computing Dublin City University Dublin, Ireland haithem.afli, andy.way @adaptcentre.ie { } Abstract Machine Translation (MT) plays a critical role in expanding capacity in the translation industry. However, many valuable documents, including digital documents, are encoded in non-accessible formats for machine processing (e.g., Historical or Legal documents). Such documents must be passed through a process of Optical Character Recognition (OCR) to render the text suitable for MT. No matter how good the OCR is, this process introduces recognition errors, which often renders MT ineffective. In this paper, we propose a new OCR to MT framework based on adding a new OCR error correction module to enhance the overall quality of translation. Experimenta- tion shows that our new system correction based on the combination of Language Modeling and Translation methods outperforms the baseline system by nearly 30% relative improvement. 1 Introduction While research on improving Optical Character Recognition (OCR) algorithms is ongoing, our assess- ment is that Machine Translation (MT) will continue to produce unacceptable translation errors (or non- translations) based solely on the automatic output of OCR systems. The problem comes from the fact that current OCR and Machine Translation systems are commercially distinct and separate technologies. There are often mistakes in the scanned texts as the OCR system occasionally misrecognizes letters and falsely identifies scanned text, leading to misspellings and linguistic errors in the output text (Niklas, 2010). Works involved in improving translation services purchase off-the-shelf OCR technology but have limited capability to adapt the OCR processing to improve overall machine translation perfor- mance.
    [Show full text]
  • An Efficient Database Design for Indowordnet Development Using
    An Efficient Database Design for IndoWordNet Development Using Hybrid Approach Venkatesh Prabhu2 Shilpa Desai1 Hanumant Redkar1 N eha Prabhugaonkar1 Apurva N agvenkar1 Ramdas Karmali1 (1) GOA UNIVERSITY, Taleigao - Goa (2) THYWAY CREATIONS, Mapusa - Goa [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] ABSTRACT WordNet is a crucial resource that aids in Natural Language Processing (NLP) tasks such as Machine Translation, Information Retrieval, Word Sense Disambiguation, Multi-lingual Dictionary creation, etc. The IndoWordNet is a multilingual WordNet which links WordNets of different Indian languages on a common identification number given to each concept. WordNet is designed to capture the vocabulary of a language and can be considered as a dictionary cum thesaurus and much more. WordNets for some Indian Languages are being developed using expansion approach. In this paper we have discussed the details and our experiences during the evolution of this database design while working on the Indradhanush WordNet Project. The Indradhanush WordNet Project is working on the development of WordNets for seven Indian languages. Our database design gives an efficient plan for storage of WordNet data for all languages. In addition it extends the design to hold specific concepts for a language. KEYWORDS: WordNet, IndoWordNet, synset, database design, expansion approach, semantic relation, lexical relation. Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP), pages 229–236, COLING 2012, Mumbai, December 2012. 229 1 Introduction 1.1 WordNet and its storage methods WordNet (Miller, 1993) maintains the concepts in a language, relations between concepts and their ontological details.
    [Show full text]
  • Champollion: a Robust Parallel Text Sentence Aligner
    Champollion: A Robust Parallel Text Sentence Aligner Xiaoyi Ma Linguistic Data Consortium 3600 Market St. Suite 810 Philadelphia, PA 19104 [email protected] Abstract This paper describes Champollion, a lexicon-based sentence aligner designed for robust alignment of potential noisy parallel text. Champollion increases the robustness of the alignment by assigning greater weights to less frequent translated words. Experiments on a manually aligned Chinese – English parallel corpus show that Champollion achieves high precision and recall on noisy data. Champollion can be easily ported to new language pairs. It’s freely available to the public. framework to find the maximum likelihood alignment of 1. Introduction sentences. Parallel text is a very valuable resource for a number The length based approach works remarkably well on of natural language processing tasks, including machine language pairs with high length correlation, such as translation (Brown et al. 1993; Vogel and Tribble 2002; French and English. Its performance degrades quickly, Yamada and Knight 2001;), cross language information however, when the length correlation breaks down, such retrieval, and word disambiguation. as in the case of Chinese and English. Parallel text provides the maximum utility when it is Even with language pairs with high length correlation, sentence aligned. The sentence alignment process maps the Gale-Church algorithm may fail at regions that contain sentences in the source text to their translation. The labor many sentences with similar length. A number of intensive and time consuming nature of manual sentence algorithms, such as (Wu 1994), try to overcome the alignment makes large parallel text corpus development weaknesses of length based approaches by utilizing lexical difficult.
    [Show full text]
  • Machine Translation for Academic Purposes Grace Hui-Chin Lin
    Proceedings of the International Conference on TESOL and Translation 2009 December 2009: pp.133-148 Machine Translation for Academic Purposes Grace Hui-chin Lin PhD Texas A&M University College Station Master of Science, University of Southern California Paul Shih Chieh Chien Center for General Education, Taipei Medical University Abstract Due to the globalization trend and knowledge boost in the second millennium, multi-lingual translation has become a noteworthy issue. For the purposes of learning knowledge in academic fields, Machine Translation (MT) should be noticed not only academically but also practically. MT should be informed to the translating learners because it is a valuable approach to apply by professional translators for diverse professional fields. For learning translating skills and finding a way to learn and teach through bi-lingual/multilingual translating functions in software, machine translation is an ideal approach that translation trainers, translation learners, and professional translators should be familiar with. In fact, theories for machine translation and computer assistance had been highly valued by many scholars. (e.g., Hutchines, 2003; Thriveni, 2002) Based on MIT’s Open Courseware into Chinese that Lee, Lin and Bonk (2007) have introduced, this paper demonstrates how MT can be efficiently applied as a superior way of teaching and learning. This article predicts the translated courses utilizing MT for residents of global village should emerge and be provided soon in industrialized nations and it exhibits an overview about what the current developmental status of MT is, why the MT should be fully applied for academic purpose, such as translating a textbook or teaching and learning a course, and what types of software can be successfully applied.
    [Show full text]
  • BRING YOUR “A” GAME to VIDEO GAME LOCALIZATION a Publication of the American Translators Association Best Solution to This Problem
    The Voice of Interpreters and Translators THE ATA Mar/Apr 2020 Volume XLIX Number 2 BRING YOUR “A” GAME TO VIDEO GAME LOCALIZATION A Publication of the American Translators Association best solution to this problem. A victory American Translators Association in California will greatly facilitate 225 Reinekers Lane, Suite 590 obtaining similar exemptions in other Alexandria, VA 22314 USA states that pass strict versions of the Tel:Tel: +1-703-683-6100+1.703.683.6100 ABC Test. Fax:Fax: +1-703-683-6122+1.703.683.6122 Through its membership in the Email: [email protected] 5 Professional Certication Coalition , Website: www.atanet.org FROM THE PRESIDENT ATA is also monitoring state legislation TED R. WOZNIAK regarding voluntary certication programs Editorial Board [email protected] to ensure that they do not negatively Paula Arturo impact ATA’s Certication Program. Lois Feuerle Geoff Koby (chair) CorinneMary McKee McKay 2020 is shaping up to be a year TedMary Wozniak McKee in which ATA focuses a great deal JostTed ZetzscheWozniak Advocacy and Jost Zetzsche on state and national legislation Publisher/Executive Director Other Business Publisher/ExecutiveWalter Bacak, CAE Director affecting the translation and [email protected] Bacak, CAE s I conclude the third month of [email protected] my term as president, I nd interpreting professions. Editor A myself spending a lot of time on JeffEditor Sanfacon advocacy efforts. [email protected] Sanfacon 2020 is shaping up to be a year in [email protected] which ATA focuses a great deal on state But not all proposed legislation Advertising and national legislation affecting the has negative consequences for ATA [email protected] translation and interpreting professions.
    [Show full text]
  • Diversity and Inclusion in the European Audiovisual Sector European Audiovisual Observatory, Strasbourg, 2021 ISSN 2079-1062 ISBN 978-92-871-9054-3 (Print Version)
    Diversity and inclusion in the European audiovisual sector IRIS Plus IRIS Plus 2021-1 Diversity and inclusion in the European audiovisual sector European Audiovisual Observatory, Strasbourg, 2021 ISSN 2079-1062 ISBN 978-92-871-9054-3 (Print version) Director of publication – Susanne Nikoltchev, Executive Director Editorial supervision – Maja Cappello, Head of Department for Legal Information Editorial team – Francisco Javier Cabrera Blázquez, Julio Talavera Milla, Sophie Valais Research assistant - Léa Chochon European Audiovisual Observatory Authors (in alphabetical order) Francisco Javier Cabrera Blázquez, Maja Cappello, Julio Talavera Milla, Sophie Valais Translation Marco Polo Sarl, Sonja Schmidt Proofreading Jackie McLelland, Johanna Fell, Catherine Koleda Editorial assistant – Sabine Bouajaja Press and Public Relations – Alison Hindhaugh, [email protected] European Audiovisual Observatory Publisher European Audiovisual Observatory 76, allée de la Robertsau, 67000 Strasbourg, France Tel.: +33 (0)3 90 21 60 00 Fax: +33 (0)3 90 21 60 19 [email protected] www.obs.coe.int Cover layout – ALTRAN, France Please quote this publication as Cabrera Blázquez F.J., Cappello M., Talavera Milla J., Valais S., Diversity and inclusion in the European audiovisual sector, IRIS Plus, European Audiovisual Observatory, Strasbourg, April 2021 © European Audiovisual Observatory (Council of Europe), Strasbourg, 2021 Opinions expressed in this publication are personal and do not necessarily represent the views of the Observatory, its members or the Council of Europe. Diversity and inclusion in the European audiovisual sector Francisco Javier Cabrera Blázquez, Maja Cappello, Julio Talavera Milla, Sophie Valais Foreword Let me tell you a few stories about extraordinary people. Artemisia Gentileschi was a seventeenth century painter, and quite a talented one at that.
    [Show full text]
  • A Mapping of Translation in the Euro-Mediterranean Region
    A MAPPING OF TRANSLATION IN THE EURO-MEDITERRANEAN REGION PARTNERS Banipal, London ÇEVBIR, Istanbul European Council of Literary Translators’ Association (CEATL), Brussels Escuela de Traductores de Toledo, Toledo King Abdul-Aziz Foundation, Casablanca Next Page Foundation, Sofia Goethe Institut, Cairo Index Translationum (UNESCO) Institut du monde arabe, Paris Institut français du Proche-Orient, Damascus, Beirut, Amman, Ramallah Institute for research and studies in the Arab and Islamic World (IREMAM/MMSH), Aix-en-Provence Literature Across frontiers, Manchester Swedish Institute Alexandria, Alexandria Università degli studi di Napoli l’Orientale, Naples Saint-Joseph University, Beirut SUPPORT The mapping project was accomplished with the support of: the Anna Lindh Euro-Mediterranean Foundation for the Dialogue between Cultures, the French Ministry of Culture and Communication as well as the Conseil régional d’Ile de France the Institut français This document is also available in French and Arabic A MApping of TrAnslATion in The euro-MediTerrAneAn region A project carried out by Transeuropéennes and the Anna Lindh Euro-Mediterranean Foundation for the Dialogue between Cultures Conclusions and Recommendations Final overview and compilation: Ghislaine Glasson Deschaumes Editing team: Anaïs-Trissa Khatchadourian The present conclusions and recommendations are the fruit of a collective effort over a number of months. They have benefited from the enlightening advice and attentive readings of Yana Genova, Richard Jacquemond, Mohamed-Sghir Janjar, Elisabeth Longuenesse, Franck Mermier and Hakan Özkan. The quantitative overviews were produced with the help of Sophie Brones Translated from French into English by Andrew Goffey Direction of the project: Ghislaine Glasson Deschaumes (Transeuropéennes) and Gemma Aubarell (Fondation Anna Lindh) Coordination: Anaïs-Trissa Khatchadourian, with the participation of Virginia Pisano (Transeuropéennes) and Chaymaa Ramzy (Fondation Anna Lindh).
    [Show full text]
  • Machine Translation for Language Preservation
    Machine translation for language preservation Steven BIRD1,2 David CHIANG3 (1) Department of Computing and Information Systems, University of Melbourne (2) Linguistic Data Consortium, University of Pennsylvania (3) Information Sciences Institute, University of Southern California [email protected], [email protected] ABSTRACT Statistical machine translation has been remarkably successful for the world’s well-resourced languages, and much effort is focussed on creating and exploiting rich resources such as treebanks and wordnets. Machine translation can also support the urgent task of document- ing the world’s endangered languages. The primary object of statistical translation models, bilingual aligned text, closely coincides with interlinear text, the primary artefact collected in documentary linguistics. It ought to be possible to exploit this similarity in order to improve the quantity and quality of documentation for a language. Yet there are many technical and logistical problems to be addressed, starting with the problem that – for most of the languages in question – no texts or lexicons exist. In this position paper, we examine these challenges, and report on a data collection effort involving 15 endangered languages spoken in the highlands of Papua New Guinea. KEYWORDS: endangered languages, documentary linguistics, language resources, bilingual texts, comparative lexicons. Proceedings of COLING 2012: Posters, pages 125–134, COLING 2012, Mumbai, December 2012. 125 1 Introduction Most of the world’s 6800 languages are relatively unstudied, even though they are no less im- portant for scientific investigation than major world languages. For example, before Hixkaryana (Carib, Brazil) was discovered to have object-verb-subject word order, it was assumed that this word order was not possible in a human language, and that some principle of universal grammar must exist to account for this systematic gap (Derbyshire, 1977).
    [Show full text]
  • EN 301 775 V1.1.1 (2000-07) European Standard (Telecommunications Series)
    Draft ETSI EN 301 775 V1.1.1 (2000-07) European Standard (Telecommunications series) Digital Video Broadcasting (DVB); Specification for the carriage of Vertical Blanking Information (VBI) data in DVB bitstreams European Broadcasting Union Union Européenne de Radio-Télévision EBU·UER 2 Draft ETSI EN 301 775 V1.1.1 (2000-07) Reference DEN/JTC-DVB-106 Keywords broadcasting, digital, DVB, SNG, TV, video ETSI 650 Route des Lucioles F-06921 Sophia Antipolis Cedex - FRANCE Tel.:+33492944200 Fax:+33493654716 Siret N° 348 623 562 00017 - NAF 742 C Association à but non lucratif enregistrée à la Sous-Préfecture de Grasse (06) N° 7803/88 Important notice Individual copies of the present document can be downloaded from: http://www.etsi.org The present document may be made available in more than one electronic version or in print. In any case of existing or perceived difference in contents between such versions, the reference version is the Portable Document Format (PDF). In case of dispute, the reference shall be the printing on ETSI printers of the PDF version kept on a specific network drive within ETSI Secretariat. Users of the present document should be aware that the document may be subject to revision or change of status. Information on the current status of this and other ETSI documents is available at http://www.etsi.org/tb/status/ If you find errors in the present document, send your comment to: [email protected] Copyright Notification No part may be reproduced except as authorized by written permission. The copyright and the foregoing restriction extend to reproduction in all media.
    [Show full text]
  • The Web As a Parallel Corpus
    The Web as a Parallel Corpus Philip Resnik∗ Noah A. Smith† University of Maryland Johns Hopkins University Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content- based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair. 1. Introduction Parallel corpora—bodies of text in parallel translation, also known as bitexts—have taken on an important role in machine translation and multilingual natural language processing. They represent resources for automatic lexical acquisition (e.g., Gale and Church 1991; Melamed 1997), they provide indispensable training data for statistical translation models (e.g., Brown et al. 1990; Melamed 2000; Och and Ney 2002), and they can provide the connection between vocabularies in cross-language information retrieval (e.g., Davis and Dunning 1995; Landauer and Littman 1990; see also Oard 1997). More recently, researchers at Johns Hopkins University and the
    [Show full text]
  • Machine Translation and Computer-Assisted Translation: a New Way of Translating? Author: Olivia Craciunescu E-Mail: Olivia [email protected]
    Machine Translation and Computer-Assisted Translation: a New Way of Translating? Author: Olivia Craciunescu E-mail: [email protected] Author: Constanza Gerding-Salas E-mail: [email protected] Author: Susan Stringer-O’Keeffe E-mail: [email protected] Source: http://www.translationdirectory.com The Need for Translation IT has produced a screen culture that Technology tends to replace the print culture, with printed documents being dispensed Advances in information technology (IT) with and information being accessed have combined with modern communication and relayed directly through computers requirements to foster translation automation. (e-mail, databases and other stored The history of the relationship between information). These computer documents technology and translation goes back to are instantly available and can be opened the beginnings of the Cold War, as in the and processed with far greater fl exibility 1950s competition between the United than printed matter, with the result that the States and the Soviet Union was so intensive status of information itself has changed, at every level that thousands of documents becoming either temporary or permanent were translated from Russian to English and according to need. Over the last two vice versa. However, such high demand decades we have witnessed the enormous revealed the ineffi ciency of the translation growth of information technology with process, above all in specialized areas of the accompanying advantages of speed, knowledge, increasing interest in the idea visual
    [Show full text]
  • Exploring Resources in Word Sense Disambiguation for Marathi Language Amit Patil1, Chhaya Patil2, Dr
    www.rspsciencehub.com Volume 02 Issue 10S October 2020 Special Issue of First International Conference on Advancements in Management, Engineering and Technology (ICAMET 2020) Exploring Resources in Word Sense Disambiguation for Marathi Language Amit Patil1, Chhaya Patil2, Dr. Rakesh Ramteke3, Dr. R. P. Bhavsar4, Dr. Hemant Darbari5 1,2 Assistant Professor, Department of Computer Application, RCPET’s IMRD, Maharashtra, India 3,4Professor, School of Computer Sciences, KBC North Maharashtra University, Maharashtra, India 5Director General, Centre for Development of Advanced Computing (C-DAC), Maharashtra, India Abstract Word Sense Disambiguation (WSD) is one of the most challenging problems in the research area of natural language processing. To find the correct sense of the word in a particular context is called Word Sense Disambiguation. As a human, we can get a correct sense of the word given in the sentence because of word knowledge of that particular natural language, but it is not an easy task for the machine to disambiguate the word. Developing any WSD system, it required sense repository and sense dictionary. It is very costly and time-consuming to build these resources. Many foreign languages have available these resources, that is why most of the foreign languages like English, German, Spanish etc lot of work is done in these Natural languages. When we look for Indian languages like Hindi, Marathi, Bengali etc. very less work is done. The reason behind this is resource-scarcity. In this paper, we majorly focus on Marathi Language Word Sense Disambiguation because of very less work is done in the Marathi Language as compared to Hindi and other Indian Languages.
    [Show full text]