ON CREATION of DOGRI LANGUAGE CORPUS 1Sonam Gandotra, 2Bhavna Arora 1Ph.D, Department of Computer Science & IT, Central University of Jammu

Total Page:16

File Type:pdf, Size:1020Kb

ON CREATION of DOGRI LANGUAGE CORPUS 1Sonam Gandotra, 2Bhavna Arora 1Ph.D, Department of Computer Science & IT, Central University of Jammu JOURNAL OF CRITICAL REVIEWS ISSN- 2394-5125 VOL 7, ISSUE 09, 2020 ON CREATION OF DOGRI LANGUAGE CORPUS 1Sonam Gandotra, 2Bhavna Arora 1Ph.D, Department of Computer Science & IT, Central University of Jammu. 2Assistant Professor, Department of Computer Science & IT, Central University of Jammu AbstractThe pre-requisite for any Natural Language Processing (NLP) task, is the corpus. Corpus is defined as a large collection of structured text. Dogri is one of the official languages of India but is under-resourced in terms of computational resources needed for any NLP task. This paper proposes a methodology to construct a standard corpus which can be used for performing various language processing tasks like stemming, part-of-speech tagging, information retrieval, etc. The digitized text required for creating the corpus is not available due to the scarcity of online resources containing Dogri text. The only online source which is available is the Dogri Newspaper “Jammu Prabhat". Hence, the text is to be extracted from portable document formats (pdf) of that newspaper which are first converted to images before extraction of the text. To achieve this, an open-source tool-Tesseract is used for extracting the text from images. The methodology that is used for the corpus creation of Dogri Language is discussed in detail in the paper. The challenges faced during the research and the acquired results have also been discussed. Index Terms— Corpus, Tesseract-OCR, Dogri Language, Language Resource. In India, there are 22 official languages as defined in I. INTRODUCTION1 the eighth schedule of the Indian Constitution [9] and L ANGUAGEis one of the fundamental traits of human Dogri language being one of them. Dogri language behavior which enable us to express and communicate belongs to the class of Indo-Aryan languages, which is with people across the world. It can either be in spoken or spoken by about 5 million people in India and Pakistan written form [1]. With the advancement in technology, the but mostly has its dominance in the state of Jammu & researchers are trying to build tools which can understand Kashmir and parts of Himachal Pradesh and northern the natural language and make technology more Punjab [10]. Dogri was originally written using the Dogri accessible to the people. Natural Language Processing script [11] which has its resemblance with the Takri (NLP) has been an active area of research with the script, the script of royals of Himachal Region but now objective of making computer able to process, understand, the language is written using Devanagari script. Dogri summarize, translate and draw inferences from the human language is still in its developing form, as it has not been natural text and language. Various linguistic resources are widely explored in area of computational linguistic and required for carrying out diverse NLP tasks like machine natural language processing. Thus, resulting in scarcity of translation, automatic text summarization, sentiment online resources required for the processing of text. analysis, named entity recognition, parsing etc. These Dogri is the native language of the people of J&K and linguistic resources mainly deal with processing of text in in particular of the people of Jammu region. The written form i.e. performing processes like tokenization, digitization of this language for the creation of corpus is stemming, parsing etc. Due to progression in research taken up as the researcher itself hails from this region. In wide number of languages are being digitalized and this era of digitalization, it is important to take up every efficient tools has been developed for them[2]–[5]. But as minor step which will bring the smaller of the part in race far as Indian languages are concerned, it is still in its with the global forum. The creation of Dogri corpus will progressive phase due to the variability of the Indian lead more researchers being able to develop the NLP tools languages and their complex structure. The linguistic for Dogri language which will further enhance the resources required for the development of tools required research for the language. for processing in NLP are also limited. Researchers are Corpus is defined as the large collection of data which constantly working on providing the basic tools required can either be in written or spoken form. It is machine- for the processing of text in digital form so that better readable and is used for the purpose of linguistic research. results could be achieved [6]–[8]. A standard corpus is required for the development of tools needed for the processing of language. There are various benchmark corpuses which researchers use for executing 2337 JOURNAL OF CRITICAL REVIEWS ISSN- 2394-5125 VOL 7, ISSUE 09, 2020 NLP tasks in different domains like Brown Corpus[12], II. BACKGROUND DUC (Document Understanding Conferences) [13], TAC The task of identification of corpus or to get the corpus (Text Analysis Conference)[14], CNN dataset, Reuters in digitised form is not a challenge anymore for the dataset etc. These corpuses contain tokens, human common languages having widespread usage and generated abstractive as well as extractive summaries implementation. As large numbers of resources are freely required for evaluation of the proposed summarization available and now the researchers are concentrating on technique. These can be used for both single and multi- making improvements in the techniques which can be document summarization. These datasets are mainly applied further to enhance the quality of research. Under- developed for English, Chinese and other European resourced languages are deprived from these basic languages. As far as Indian languages are concerned, resources like: corpus, annotators, part of speech tagger, Indian languages are less explored as compared to the stemmer, named-entity lists etc. These resources provide foreign languages due to the lack of availability of a base for developing efficient tools and applications resources required for processing the text. This scenario is which boost NLP research. Over the years the researchers changing rapidly through the efforts of the researcher’s have developed tools and techniques for extracting text community, as various linguistic resources are being from the images. The researchers are also working for developed which are at par with the resources developed developing the basic resources for the under-resourced for foreign languages. Few examples of resources for text languages. A few of these tools and the basic resources are Parallel Corpora, Ontology & Word-Net, Online are discussed below as: Vishavkosh, Text Corpora etc. [15], [16] and for speech In an attempt of using Tesseract-OCR, Kumar [20] are Speech Corpora, Semi-automatic annotation tool for have used Tesseract-OCR for recognizing Gujarati Speech Corpora [17]. characters. An already available trained Gujarati script This paper aims at developing a corpus for Dogri was used for training the OCR. The results have been language. The created corpus contains news items checked by using different font-size, font style etc. and covering the domain of sports, education, politics, the mean confidence come out to be 86%. Various neural entertainment from the only Dogri newspaper “Jammu network techniques have been proposed by the Prabhat” [18]. The idea behind choosing the newspaper is researchers for identification of text from the images. to cover maximum words from all the domains. As the Word level script has been identified with the help of an data is not available in digital form and the only available end-to-end Recurrent Neural Network (RNN) based resource are the images of the text. Thus, in this work, an architecture which is able to detect and recognize the open source tool i.e. Tesseract [19] is used for extracting script and text of 12 Indian languages by Mathew in [21]. the text and making a corpus from the available resources. Two RNN structure was used each containing 3 levels of The paper begins with the introduction of the linguistic 50 Long Short-Term Memory (LSTM) nodes; one RNN is resources required for performing NLP tasks. A brief used for script identification and the other is used for introduction of Dogri language along with the status of recognition. Smitha [22] has described the whole process on-going research in the field of NLP for Dogri language of extraction of text from the document images. The is presented in Section I. The background work that has process is executed with the combination of Tesseract- been carried out previously for under-resourced languages OCR and imagemagick and then the results are compared and the different techniques used by researchers to extract with that of OCRopus.OCRopus is also an open source text from images are also discussed in Section II. In OCR developed for English language. But the results of Section III, the proposed methodology for creation of the Tesseract-OCR and imagemagick combination are Dogri corpus and the need for its development are also more promising as compared to the OCRopus. described. Section IV gives a detailed explanation of the Sankaran in paper [23] has proposed a recognition methodology for corpus creation. The challenges which system for Devanagari script using RNN known as were faced during creation of the corpus are presented in Bidirectional Long Short-Term Memory (BLSTM). The Section V. Then, the paper discusses the statistics which approach used by the author has eliminated the need for were obtained during extraction of text from the word to character segmentation thus reducing the word newspaper and book pdf in Section VI. Finally, the paper error rate by 20% and the character error rate by 9% as concludes in Section VII along with the open challenges compared to the best OCR available for the Devanagari for further research in this area are also stated. script. In [24], Chakraborty has developed a system for the visually-impaired person in which the text from the 2338 JOURNAL OF CRITICAL REVIEWS ISSN- 2394-5125 VOL 7, ISSUE 09, 2020 images are extracted and then converted to a 6-dot cell first created the corpus from local newspaper and then Braille format.
Recommended publications
  • A Structural Analysis of Dogri Temporal Markers
    Dialectologia 23 (2019), 235-260. ISSN: 2013-2247 Received 11 April 2017. Accepted 8 OctoBer 2017. A STRUCTURAL ANALYSIS OF DOGRI TEMPORAL MARKERS Tanima ANAND & Amitabh VIKRAM DWIVEDI Shri Mata Vaishno Devi University, India** [email protected] / amitabhvikram @yahoo.co.in Abstract The present paper aims to investigate the structural and semantic properties of the temporal markers in Dogri language. It centralizes at investigating the modifications which are carried out in Dogri for the previous generalizations of the already formulated tense theories propounded by Comrie (1985), Olphen (1975), Guru (1982), Kuryɬowicz (1956), Partee (1973), and alike. The study focuses on the linguistic realization of tense in Dogri where the grammaticalization and lexicalization of temporal markers is discussed. It employs a quantitative approach and considers linguistic typology as a frame of reference to study the range of variation in tense, also taking into consideration the apparent anomalies and deviations which Dogri occupies within the boundaries of the already generated theories of grammatical tense. The present study involves the analysis of the discourse, both written and spoken. The discourse is corpus constituting the spontaneous Dogri spoken in the standardized form in the Jammu region. Keywords structural, semantic, cross-linguistic, typology, empirically, corpus, Dogri ANÁLISIS ESTRUCTURAL DE LOS MARCADORES TEMPORALES EN DOGRI Resumen Este trabajo tiene como objetivo investigar las propiedades estructurales y semánticas de los marcadores temporales en la lengua dogri. Se centra en la investigación de las modificaciones llevadas a cabo en dogri por las generalizaciones previas de las ya formuladas teorías sobre el tiempo verbal propuestas por Comrie (1985), Olphen (1975), Guru (1982), Kuryɬowicz (1956), Partee (1973) y otros.
    [Show full text]
  • LIST of PROGRAMMES Organized by SAHITYA AKADEMI During APRIL 1, 2016 to MARCH 31, 2017
    LIST OF PROGRAMMES ORGANIZED BY SAHITYA AKADEMI DURING APRIL 1, 2016 TO MARCH 31, 2017 ANNU A L REOP R T 2016-2017 39 ASMITA Noted women writers 16 November 2016, Noted Bengali women writers New Delhi 25 April 2016, Kolkata Noted Odia women writers 25 November 2016, Noted Kashmiri women writers Sambalpur, Odisha 30 April 2016, Sopore, Kashmir Noted Manipuri women writers 28 November 2016, Noted Kashmiri women writers Imphal, Manipur 12 May 2016, Srinagar, Kashmir Noted Assamese women writers 18 December 2016, Noted Rajasthani women writers Duliajan, Assam 13 May 2016, Banswara, Rajasthan Noted Dogri women writers 3 March 2016, Noted Nepali women writers Jammu, J & K 28 May 2016, Kalimpong, West Bengal Noted Maithili women writers 18 March 2016, Noted Hindi women writers Jamshedpur, Jharkhand 30 June 2016, New Delhi AVISHKAR Noted Sanskrit women writers 04 July 2016, Sham Sagar New Delhi 28 March 2017, Jammu Noted Santali women writers Dr Nalini Joshi, Noted Singer 18 July 2016, 10 May, 2016, New Delhi Baripada, Odisha Swapan Gupta, Noted Singer and Tapati Noted Bodo women writers Gupta, Eminent Scholar 26 September 2016, 30 May, 2016, Kolkata Guwahati, Assam (Avishkar programmes organized as Noted Hindi women writers part of events are subsumed under those 26 September 2016, programmes) New Delhi 40 ANNU A L REOP R T 2016-2017 AWARDS Story Writing 12-17 November 2016, Jammu, J&K Translation Prize 4 August 2016, Imphal, Manipur Cultural ExCHANGE PROGRAMMES Bal Sahitya Puraskar 14 November 2016, Ahmedabad, Gujarat Visit of seven-member
    [Show full text]
  • According to the Syllabus of University of Azad Jammu & Kashmir
    LLB FIVE YEARS DEGREE PROGRAMME NOTES According to the Syllabus of University of Azad Jammu & Kashmir Muzaffarabad and Other Public, Private Sector Universities of Pakistan Prepared By Advocate Muhammad Adnan Masood Joja Sardar Javed Zahoor Khan (Advocate) CITI Law College Rawalakot 05824-442207, 444222, 0332-4573251, 051-4852737 Near CMH Rawalakot AJK Web:-www.clc.edu.pk Citi Law College, Near CMH, Rawalakot Ph: 05824-442207, 051-4852737 www.clc.edu.pk Page | 1 Citi Law College, Near CMH, Rawalakot Ph: 05824-442207, 051-4852737 www.clc.edu.pk Page | 2 Citi Law College, Near CMH, Rawalakot Ph: 05824-442207, 051-4852737 www.clc.edu.pk Page | 3 Citi Law College, Near CMH, Rawalakot Ph: 05824-442207, 051-4852737 www.clc.edu.pk Page | 4 Citi Law College, Near CMH, Rawalakot Ph: 05824-442207, 051-4852737 www.clc.edu.pk Page | 5 FUNCTIONAL ENGLISH Citi Law College, Near CMH, Rawalakot Ph: 05824-442207, 051-4852737 www.clc.edu.pk Page | 6 Parts of Speech NOUNS A noun is the word that refers to a person, thing or abstract idea. A noun can tell you who or what. There are several different types of noun: - There are common nouns such as dog, car, chair etc. Nouns that refer to things which can be counted (can be singular or plural) are countable nouns. Nouns that refer to some groups of countable nouns, substances, feelings and types of activity (can only be singular) are uncountable nouns. Nouns that refer to a group of people or things are collective nouns. Nouns that refer to people, organizations or places are proper nouns, only proper nouns are capitalized.
    [Show full text]
  • Prout in a Nutshell Volume 3 Second Edition E-Book
    SHRII PRABHAT RANJAN SARKAR PROUT IN A NUTSHELL VOLUME THREE SHRII PRABHAT RANJAN SARKAR The pratiika (Ananda Marga emblem) represents in a visual way the essence of Ananda Marga ideology. The six-pointed star is composed of two equilateral triangles. The triangle pointing upward represents action, or the outward flow of energy through selfless service to humanity. The triangle pointing downward represents knowledge, the inward search for spiritual realization through meditation. The sun in the centre represents advancement, all-round progress. The goal of the aspirant’s march through life is represented by the swastika, a several-thousand-year-old symbol of spiritual victory. PROUT IN A NUTSHELL VOLUME THREE Second Edition SHRII PRABHAT RANJAN SARKAR Prout in a Nutshell was originally published simultaneously in twenty-one parts and seven volumes, with each volume containing three parts, © 1987, 1988, 1989, 1990 and 1991 by Ánanda Márga Pracáraka Saîgha (Central). The same material, reorganized and revised, with the omission of some chapters and the addition of some new discourses, is now being published in four volumes as the second edition. This book is Prout in a Nutshell Volume Three, Second Edition, © 2020 by Ánanda Márga Pracáraka Saîgha (Central). Registered office: Ananda Nagar, P.O. Baglata, District Purulia, West Bengal, India All rights reserved by the publisher. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording
    [Show full text]
  • ADB Declines to Fund Pak's Big Dam Project In
    SPORTS NEWS PAGE 9 ENTERTAINMENT NEWS PAGE 10 K K NATION NEWS PAGE 7 M M SC Notice to UP Govt Resurgent Kiwis Priyanka Chopra’s big Y Y C C on Petition Against keep series alive Diwali plans Summoning Kejriwal DAILY Price 2.00 Pages : 12 JAMMU FRIDAY | OCTOBER 28 2016 | VOL. 31 | NO. 297 | REGD. NO. : JM/JK 118/15 /17 | E-mail : [email protected] | epaper.glimpsesoffuture.com 111$'&(+-"-*##/./," *( 3 pilgrims killed in road PM asks MHA to take accident in Kishtwar BSF jawan martyred, 10 civilian injured Hiranagar ((/ . initiative to reopen Three pilgrims were killed when their vehicle turned turtle in Pak shelling extend Kishtwar district of Jammu and Kashmir this evening. The incident schools in Kashmir from International occurred in Bhanderkoot area of the district when the driver of the mata- Border to LoC in dor lost control over the vehicle af- Police to prepare action plan Mendhar & Rajouri ter it developed some technical snag and turned turtle late this evening, a to provide security to schools police officer said. Three persons Fresh tension died on the spot, he said, adding ""&'$,&(-*)+$" "1"'%& . The Home Ministry is said to have persuaded the forces mass Prime Minister Jammu and Kashmir govern- migration from Two 'Hardcore' stone Narendra Modi is believed to ment to direct the state Board be concerned over prolonged of School Education to notify border villages pelters held closure of educational insti- over 500 schools to conduct tutions in Kashmir due to regular examinations. ,&)$, . separatist-sponsored strike Examinations for Class X .##"+*,.", Two "hardcore stone pelters", and wants their early reopen- will commence from ((/ .
    [Show full text]
  • KATHUA Jammu | Saturday | July 22 | 2017 2 PCPG Meetings Held 'Massive Plantation Taken up To
    www.thenorthlines.com www.epaper.northlines.com 3 DAYS’ FORECAST JAMMU Date Min Temp Max Temp Weather 22-Jul 23.0 32.0 Thunderstorm with rain 23-Jul 22.0 30.0 Thunderstorm with rain 24-Jul 23.0 32.0 Thunderstorm with rain 3 DAYS’ FORECAST SRINAGAR 22-Jul 18.0 27.0 Generally cloudy sky with Light rain 23-Jul 18.0 26.0 Generally cloudy sky with Light rain 24-Jul 16.0 27.0 Partly cloudy sky with possibility of rain or Thunderstorm or Duststorm northlines the Minister visits Thatri to oversee Minister takes stock of hospital Press Council team calls on CM rescue, relief ops facilities Interacting with the visiting team of Minister for PHE, Irrigation and Minister for Public Works Naeem Press Council of India which met her Flood Control Sham Lal Choudhary Akhtar and Minister for Education here today, the Chief Minister said a on the directions of Chief Minister Ms Altaf Bukhari today visited LD section of country's media has been Mehbooba Mufti visited Thathri town Hospital and took stock of pace of indulging in debates, .... 3 4 construction of various .... in Doda district to oversee... 7 INSIDE Vol No: XXII Issue No. 174 22.07.2017 (Saturday) Daily Jammu Tawi Price 3/- Pages-12 Regd. No. JK|306|2017-19 FM acknowledges problems Pak troops violate For the US, Jammu Kashmir is no longer faced in migration to GST ceasefire along LoC Indian administered but India's integral part NL CORRESPONDENT This is the third ceasefire 'Commercial Tax Department will act as Tax advisory department' SRINAGAR, JULY 21 violation in Naugam sector JAMMU TAWI, JULY 21 Kashmir.
    [Show full text]
  • LANGUAGES of ERSTWHILE STATE of JAMMU KASHMIR 1. Introduction
    LANGUAGES OF ERSTWHILE STATE OF JAMMU KASHMIR (A Preliminary Study) By Dr. Mohsin Shakil University of Azad Jammu Kashmir (AJK Medical College); 2012 ABSTRACT: The study aims at an estimation of active speakers of the indigenous languages of erstwhile state of Jammu Kashmir. Method adopted was the analysis of the data to identify the languages and determine percentage of the population speaking different languages by evaluating the relevant literature, data and informed input of scholars to prepare the demography charts. Observations relevant to study were made by review of literature and informed input and conclusions were drawn to determine the language demography in erstwhile state of J&K in tabulated form showing “Major Languages” and “Percentages of Major Languages” at state and district levels. Dard West.Pahari Rajesthani West. Tibetan unclassified Other Kashmiri Shina Dogri Pahari Gujari BaltiTibetan Ladakhi Brushaski Other* 34.64% 4% 17.99% 23.99% 10.41% 1.87% 1.56% 1% 4.49% *The other languages include, Pugoli, Baderwahi, Wakhi, Khowar, Kohistani, Kundalshahi, Pashto & Punjabi 1. Introduction: State of Jammu and Kashmir was founded by Gulab Singh of Jammu initially by establishing his rule over dozens of chiefdom or principalities of “Aap Raji” period, which emerged at that time in Jammu & Pir Panjal region during the last days of Mughal Empire. He did it by taking advantage of his position as a trustworthy great vassal of powerful Sikh Maharaja of Punjab. Gulab Singh realized British Military might after the defeat of Punjab by colonial power. He preferred to make them friend and also managed to extend his rule over the ancient kingdom of Kashmir by founding the state of Jammu & Kashmir as a consequence of the treaty of Amritsar.
    [Show full text]
  • Dogri and Its Dialects: a Comparative Study of Kandi and Pahari Dogri Kamaldeep Kaur and Amitabh Vikram Dwivedi (2019)
    Sociolinguistic ISSN: 1750-8649 (print) Studies ISSN: 1750-8657 (online) Review Dogri and its Dialects: A Comparative Study of Kandi and Pahari Dogri Kamaldeep Kaur and Amitabh Vikram Dwivedi (2019) Munich: Lincom Europa. Pp. 216 ISBN: 9783862888672 Reviewed by Ayushi Ayushi 1. Introduction The volume under review, Dogri and its Dialects: A Comparative Study of Kandi and Pahari Dogri, is a comparative study of two Dogri dialects: Kandi and Pahari.1 Dogri is an Indo-Aryan language (p. 1) spoken in the Union Territory of Jammu and Kashmir2 in the Republic of India. Unlike the sketchy description of Grierson (1967) where he compares Dogri as a dialect of Punjabi, and the philological study of Dogri by Varma (1978), this work falls under the descriptive linguistic tradition and provides an extensive analysis of two Dogri dialects. While Gupta (2004) traces the origin and development of the Dogri language along with a brief description of its grammatical structures, her work is not informed by modern linguistics, and rather follows the model of the Hindi grammar written by Kamta Prasad Guru (2014). Though the previous works throw some light on different aspects of Dogri, no substantial work has been published that highlights the area of linguistic and sociolinguistic study with respect to Dogri and its dialects. Kaur and Dwivedi provide a detailed analysis of linguistic and sociolinguistic variations of Dogri dialects. While ostensibly the text focuses upon dialects, significantly the study provides a linguistic description of the Dogri language, covering areas such as phonology, morphology and syntax of Kandi and Pahari Dogri, along with their sociolinguistic variations and Affiliation Shri Mata Vaishno Devi University, India email: [email protected] SOLS VOL 13.2-4 2019 417–425 https://doi.org/10.1558/sols.38775 © 2020, EQUINOX PUBLISHING 418 SOCIOLINGUISTIC STUDIES implications.
    [Show full text]
  • Galaxy: International Multidisciplinary Research Journal the Criterion: an International Journal in English Vol
    About Us: http://www.the-criterion.com/about/ Archive: http://www.the-criterion.com/archive/ Contact Us: http://www.the-criterion.com/contact/ Editorial Board: http://www.the-criterion.com/editorial-board/ Submission: http://www.the-criterion.com/submission/ FAQ: http://www.the-criterion.com/fa/ ISSN 2278-9529 Galaxy: International Multidisciplinary Research Journal www.galaxyimrj.com The Criterion: An International Journal in English Vol. 8, Issue-VI, December 2017 ISSN: 0976-8165 Generating Dogri Morphological Analyzer Using Apertium Tool: An Overview Sunil Kumar Senior Resource Person (Academics) National Translation Mission Central Institute of Indian Languages, Mysore Article History: Submitted-06/12/2017, Revised-13/12/2017, Accepted-15/12/2017, Published-31/12/2017. Abstract: Computational morphology is a subfield of computational linguistics (also called “natural language processing” or language engineering). Computational morphology concerns itself with computer applications that analyze words in a given text, such as determining whether a given word is verb or a noun. Almost all practical applications that deal with natural language must have a morphological component. After all, an application must first recognize the word in question before analyzing it syntactically, semantically, or whatever the case may be. The term morphology is generally attributed to the Johann Wolfgang von Goethe (1749–1832) a German poet, playwright, novelist, and philosopher who coined it early in the nineteenth century in a biological context. Its etymology is Greek: morph- means ‘shape, form’, and morphology is the study of form or forms. In linguistics morphology is thestudy of the smallest grammatical units of language, and of their formation into words, including composition, derivation and inflection.
    [Show full text]
  • Challenges in Building Dogri-Hindi Statistical MT System
    20 Manu Raj Moudgil, Preeti Dubey, Ajit Kumar Challenges in building Dogri-Hindi Statistical MT System 1Manu Raj Moudgil, 2Preeti Dubey, 3Ajit Kumar 1Research Scholar (Ph.D), 2,3Assistant Professor 1JJT University, Jhunjhunu, 2J&K Higher Education Dept, Jammu, 3Multani Mal Modi College,Patiala [email protected], [email protected], [email protected] ABSTRACT In present time to overcome the problem of language barrier in communication, lots of researches have been done in the field of language translation. The system which is developed for translation of languages by the researchers is known as machine translation system. There are different types of Machine Translation systems are developed with the change of time and for the need of more accuracy in the translation. No doubt the development of MT systems proved to be great achievement in the field of science and technology. But such achievement was not that easy. Different challenges and issues were faced by various researchers during the development of their MT systems. This paper focuses on the challenges and problems faced during the development of Dogri-Hindi SMT system. This paper also discusses the steps that are taken under consideration to overcome that challenges. Keywords: Statistical machine translation, parallel corpus, Translation, INTRODUCTION TO DOGRI-HINDI SMT SYSTEM Dogri is composed utilizing Devanāgarī content and has thirty eight segmental and five supra segmental phonemes. Segmental phonemes have been isolated into two general gatherings i.e. vowels and consonants. It has 10 vowel phonemes and 28 consonant phonemes. It is believed by researchers that it is easy to develop machine translation systems for closely related languages.
    [Show full text]
  • Indian Culture & Heritage
    1 Chapter-1 Culture and Heritage: The Concept We all know that when we hear or say 'culture', we think of the food habits, life style, dress, tradition, customs, language, beliefs and sanskaras. Have you ever thought that human being has made an astonishing progress in various spheres of life whether it is language culture, art or architecture or religion. Have you ever wondered as to how it happened? Definitely, because we did not have to make a fresh start every time but we could go forward and reconstruct on achievements accomplished by our ancestors. All the aspects of life like language, literature, art, architecture, painting, music, customs traditions, laws have come to us as heritage from our ancestors. Then, you add something new to it and make it rich for the coming generations. It is a constant and never ending process. Man possesses unique characteristic from the beginning known as culture. Every nation, country and region has its own culture. Culture denotes the method with which we think and act. It includes those things as well which we have received in inheritance being a member of this society. Art, music literature, craft, philosophy, religion and science are all part of culture, though it also incorporates customs, traditions, festivals, way of living and one's own concept on various aspects of life. All the achievements of a human being can be termed as culture. 1.0 Aims:- After studying this chapter you will be able to:- • Know about meaning and concept of culture. • Acquaint youself with general characteristics of culture.
    [Show full text]
  • Braj B Kachru-Asian Englishes
    ASIAN ENGLISHES BEYOND THE CANON Asian Englishes Today Series Editor: Kingsley Bolton Professor of English Linguistics, Stockholm University Honorary Professor, The University of Hong Kong The volumes in this series set out to provide a contemporary record of the spread and development of the English language in South, Southeast, and East Asia from both linguistic and literary perspectives. Volumes in this series reflect themes that cut across national boundaries, including the study of language policies; globalization and linguistic imperialism; English in the media; English in law, government and education; ‘hybrid’ Englishes; and the bilingual creativity manifested by the vibrant creative writing found in a swathe of Asian societies. The editorial advisory board comprises a number of leading scholars in the field of World Englishes, including Maria Lourdes S. Bautista (De La Salle, Philippines), Susan Butler (Macquarie Dictionary), Braj Kachru (University of Illinois, Urbana-Champaign), Yamuna Kachru (University of Illinois, Urbana-Champaign), Shirley Geok-lin Lim (University of California, Santa Barbara), Tom McArthur (editor of English Today), Larry Smith (co-editor of World Englishes), Anne Pakir (National University of Singapore), and Yasukata Yano (Waseda University, Japan). Also in the series: Hong Kong English: Autonomy and Creativity Edited by Kingsley Bolton Japanese English: Language and Culture Contact James Stanlaw China’s English: A History of English in Chinese Education Bob Adamson ASIAN ENGLISHES BEYOND THE CANON Braj B. Kachru Hong Kong University Press 14/F Hing Wai Centre 7 Tin Wan Praya Road Aberdeen Hong Kong © Hong Kong University Press 2005 ISBN 962 209 665 4 (Hardback) 962 209 666 2 (Paperback) All rights reserved.
    [Show full text]