Arxiv:2101.04899V2

Total Page:16

File Type:pdf, Size:1020Kb

Arxiv:2101.04899V2 Experimental Evaluation of Deep Learning models for Marathi Text Classification Atharva Kulkarni1, Meet Mandhane1, Manali Likhitkar1, Gayatri Kshirsagar1, Jayashree Jagdale1 and Raviraj Joshi2 1Pune Institute of Computer Technology, Pune 2Indian Institute of Technology Madras, Chennai {k.atharva4899,meetmandhanemnm,manalil1806,gayatrimohan7}@gmail.com {jayashree.jagdale,ravirajoshi}@gmail.com Abstract to building single multilingual models instead of models for individual language (Pires et al., 2019; The Marathi language is one of the prominent Conneau and Lample, 2019). languages used in India. It is predominantly spoken by the peopleof Maharashtra. Over the past decade, the usage of language on online English is the most widely used language platforms has tremendously increased. How- globally, due to which the majority of the NLP ever, research on Natural Language Processing research is done in English. There is a large scope (NLP) approaches for Marathi text has not re- for research in other regional languages in India. ceived much attention. Marathi is a morpho- These regional languages are a low resource logically rich language and uses a variant of and morphologically rich, which is the main the Devanagari script in the written form. reason for limiting research (Patil and Patil, 2017). This works aims to provide a comprehensive overviewof available resources and models for Morphologically rich languages have greater Marathi text classification. We evaluate CNN, complexity in their grammar as well as sentence LSTM, ULMFiT, and BERT based models on structure. Grammatical relations like subject, two publicly available Marathi text classifica- predicate, object, etc., are indicated by changes tion datasets and present a comparative analy- to the words. Also, the structure of the sentence sis. The pre-trained Marathi fast text word em- in Marathi can change, without it affecting the beddings by Facebook and IndicNLP are used meaning. For example, ”To khelayla yenar nahi”, in conjunction with word-based models. We show that basic single layer models based on ”Khelayla nahi yenar to”, ”To nahi yenar khe- CNN and LSTM coupled with FastText em- layla”. An absence of large annotated datasets is a beddings perform on par with the BERT based major issue due to which new methods cannot be models on the available datasets. We hope our properly tested and documented, thus hampering paper aids focused research and experiments research. In this work, we are concerned with in the area of Marathi NLP. Marathi text classification. We evaluate different deep learning models for Marathi text and provide 1 Introduction a comprehensive review of publicly available The Marathi language is spoken by more than 83 models and datasets. arXiv:2101.04899v2 [cs.CL] 14 Jan 2021 million people in India. In terms of the number of speakers, it ranks third in India after Hindi and Text classification is the process of categorizing Bengali. It is native to the state of Maharashtra and the text into different classes, grouped according also spoken in Goa and some regions of western to content (Kowsari et al., 2019). It has been India. Despite leading in education and economy, used in a variety of applications from optimizing NLP research in Marathi has not received much searches in search engines to analyzing customer attention in Maharashtra. In contrast research in needs in businesses. With the increase in Marathi Hindi has been much more significant (Arora, textual content on online platforms, it becomes 2013; Akhtar et al., 2016; Joshi et al., 2016, important to build text processing systems for the 2019; Patra et al., 2015) followed by Bengali Marathi language. The text classification module (Patra et al., 2018; Al-Amin et al., 2017; Pal et al., requires the application of various pre-processing 2015; Sarkar and Bhowmick, 2017). Recently, to techniques to the text before running the clas- enable cross-lingual NLP, the focus has shifted sification model. These tasks involve steps viz tokenizing, stop word removal, and stemming the word tokenizers for Marathi text classifica- words to their root form. Tokenization is a way of tion. We present a comparative analysis of separating a piece of text into smaller units called CNN, LSTM, and Transformer based mod- tokens. The tokens can be either word, characters, els. The analysis shows that simple CNN and or subwords. Tokens and punctuations that do LSTM based models along with FastText em- not contribute to classification are removed using beddings performs as good as currently avail- stopword removal techniques. Stemming is the able pre-trained multilingual BERT based process of reducing a word to its root form. These models. root forms of words then represent the sentence of the document to which they belong and are 2 Related Work passed on to classifiers. In this work, we are not concerned about stemming and stop word removal. There has been very little research done on More recent techniques based on sub-words and Marathi NLP. Recently, (Kakwani et al., 2020) neural networks implicitly mitigate problems introduced NLP resources for 11 major Indian associated with morphological variations and stop languages. It had Marathi as one of the languages. words to a large extent (Joulin et al., 2016). They collected sentence-level monolingual cor- pora for all the languages from web sources. The This paper explores and summarizes the effi- monolingual corpus contains 8.8 billion tokens ciency of Convolutional Neural Network (CNN), across multiple languages. This corpus was used Long Short Term Memory (LSTM), and Trans- to pre-train word embeddings and multi-lingual former based approaches on two datasets. The language models. The pre-trained models are two datasets are contrasting in terms of length based on the compact ALBERT model termed of records, grammatical complexity of sentences, IndicBERT. The FastText embeddings were and the vocabulary itself. We also evaluate the trained as it is better at handling morphological effect of using pre-trained FastText word embed- variants. The pre-trained models and embeddings dings and explicit sub-word embeddings on the were evaluated on text classification and gener- aforementioned architectures. Finally, we eval- ation tasks. In most of their experimentations, uate the pre-trained language models such as IndicBERT models have outperformed XLM-R Universal Language Model Fine-tuning (ULM- and mBERT models. They have also created other FiT) and multilingual variations of the Bidirec- datasets for various NLP tasks like Article Genre tional Encoder Representations from Transform- Classification, Headline Prediction, etc. This ers (BERT), mBERT, and IndicBERT for the task work is referred to as IndicNLP throughout the of Marathi text classification (Howard and Ruder, paper. 2018; Devlin et al., 2018). We use the pub- licly available ULMFiT and BERT based models An NLP library for Indian languages iNLTK and re-run the experiments on the classification is presented in (Arora, 2020). It consists of datasets. The evaluation of other models and the pre-trained models, support for word embeddings, effect of using pre-trained word embeddings on textual similarity, and other important components them is specific to this work and not covered in of NLP for 13 Indian languages including Marathi. previous literature. The main contributions of this They evaluate pre-trained models like ULMFiT work are and TransformerXL. The ULMFiT model and other pre-trained models are shown to perform • We provide an overview of publicly avail- well on small datasets as compared to raw models. able classification datasets, monolingual cor- Work is being done to expand the iNLTK support pus, and deep learning models useful for to other languages like Telugu, Maithili, and some the Marathi language. We emphasize that code mixed languages. Marathi is truly a very low resource language and even lacks a simple sentiment classifica- Previously, (Bolaj and Govilkar, 2016) pre- tion dataset. sented supervised learning methods and ontology- based Marathi text classification. Marathi text • We evaluate the effectiveness of publicly documents were mapped to the output labels available FastText word embeddings and sub- like festivals, sports, tourism, literature, etc. The steps proposed for predicting the labels were 3.3 Monolingual Corpora preprocessing, feature extraction and finally Although we have not explicitly used monolingual applying supervised learning methods. Methods corpus in this work, we list the publicly available based on Label Induction Clustering Algorithm Marathi monolingual corpus for the sake of com- (LINGO) to categorize the Marathi documents pleteness. These individual corpora were used were explored in (Vispute and Potey, 2013; to pre-train FastText word embeddings, ULMFiT, Patil and Bogiri, 2015). A small custom data-set and BERT based models by the respective authors. containing 100-200 documents was used for classification in the respective work. Wikipedia text corpus: The Marathi Wikipedia article monolingual dataset consists of 85k cleaned articles. This is a small corpus which con- 3 Data Resources sists of comparatively fewer tokens. 3.1 Datasets CC-100 Monolingual Dataset: The dataset is a huge collection of crawled websites for 100+ This section summarizes publicly available classi- languages (Wenzek et al., 2019). It was created fication datasets used for experimentation. by processing January-December 2018 Common- crawl snapshots. However, for Marathi as well as IndicNLP News Article Dataset: This dataset most other Indian languages, the dataset consists consists of news articles in Marathi categorized of just about 50 million tokens each. into 3 classes viz. sports, entertainment, and lifestyle (Kakwani et al., 2020). The dataset con- OSCAR: Open Super-large Crawled AL- tains 4779 records with predefined splits of the MAnaCH coRpus (OSCAR) is obtained by train, test, and validation sets. They contain 3823, filtering and language classifying the Common 479, and 477 records respectively. The average Crawl corpus (Su´arez et al., 2019). After de- length of a record is approximately 250 words. duplifying all words, the size of the Marathi corpus comes up to 82 million tokens.
Recommended publications
  • Automatic Correction of Real-Word Errors in Spanish Clinical Texts
    sensors Article Automatic Correction of Real-Word Errors in Spanish Clinical Texts Daniel Bravo-Candel 1,Jésica López-Hernández 1, José Antonio García-Díaz 1 , Fernando Molina-Molina 2 and Francisco García-Sánchez 1,* 1 Department of Informatics and Systems, Faculty of Computer Science, Campus de Espinardo, University of Murcia, 30100 Murcia, Spain; [email protected] (D.B.-C.); [email protected] (J.L.-H.); [email protected] (J.A.G.-D.) 2 VÓCALI Sistemas Inteligentes S.L., 30100 Murcia, Spain; [email protected] * Correspondence: [email protected]; Tel.: +34-86888-8107 Abstract: Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being a real-word error is computed. On the other hand, state-of-the-art approaches make use of deep learning models to learn context by extracting semantic features from text. In this work, a deep learning model were implemented for correcting real-word errors in clinical text. Specifically, a Seq2seq Neural Machine Translation Model mapped erroneous sentences to correct them. For that, different types of error were generated in correct sentences by using rules. Different Seq2seq models were trained and evaluated on two corpora: the Wikicorpus and a collection of three clinical datasets. The medicine corpus was much smaller than the Wikicorpus due to privacy issues when dealing Citation: Bravo-Candel, D.; López-Hernández, J.; García-Díaz, with patient information.
    [Show full text]
  • Sources of Maratha History: Indian Sources
    1 SOURCES OF MARATHA HISTORY: INDIAN SOURCES Unit Structure : 1.0 Objectives 1.1 Introduction 1.2 Maratha Sources 1.3 Sanskrit Sources 1.4 Hindi Sources 1.5 Persian Sources 1.6 Summary 1.7 Additional Readings 1.8 Questions 1.0 OBJECTIVES After the completion of study of this unit the student will be able to:- 1. Understand the Marathi sources of the history of Marathas. 2. Explain the matter written in all Bakhars ranging from Sabhasad Bakhar to Tanjore Bakhar. 3. Know Shakavalies as a source of Maratha history. 4. Comprehend official files and diaries as source of Maratha history. 5. Understand the Sanskrit sources of the Maratha history. 6. Explain the Hindi sources of Maratha history. 7. Know the Persian sources of Maratha history. 1.1 INTRODUCTION The history of Marathas can be best studied with the help of first hand source material like Bakhars, State papers, court Histories, Chronicles and accounts of contemporary travelers, who came to India and made observations of Maharashtra during the period of Marathas. The Maratha scholars and historians had worked hard to construct the history of the land and people of Maharashtra. Among such scholars people like Kashinath Sane, Rajwade, Khare and Parasnis were well known luminaries in this field of history writing of Maratha. Kashinath Sane published a mass of original material like Bakhars, Sanads, letters and other state papers in his journal Kavyetihas Samgraha for more eleven years during the nineteenth century. There is much more them contribution of the Bharat Itihas Sanshodhan Mandal, Pune to this regard.
    [Show full text]
  • ACL 2019 Social Media Mining for Health Applications (#SMM4H)
    ACL 2019 Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task Proceedings of the Fourth Workshop August 2, 2019 Florence, Italy c 2019 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ISBN 978-1-950737-46-8 ii Preface Welcome to the 4th Social Media Mining for Health Applications Workshop and Shared Task - #SMM4H 2019. The total number of users of social media continues to grow worldwide, resulting in the generation of vast amounts of data. Popular social networking sites such as Facebook, Twitter and Instagram dominate this sphere. According to estimates, 500 million tweets and 4.3 billion Facebook messages are posted every day 1. The latest Pew Research Report 2, nearly half of adults worldwide and two- thirds of all American adults (65%) use social networking. The report states that of the total users, 26% have discussed health information, and, of those, 30% changed behavior based on this information and 42% discussed current medical conditions. Advances in automated data processing, machine learning and NLP present the possibility of utilizing this massive data source for biomedical and public health applications, if researchers address the methodological challenges unique to this media. In its fourth iteration, the #SMM4H workshop takes place in Florence, Italy, on August 2, 2019, and is co-located with the
    [Show full text]
  • Mapping the Hyponymy Relation of Wordnet Onto
    Under review as a conference paper at ICLR 2019 MAPPING THE HYPONYMY RELATION OF WORDNET ONTO VECTOR SPACES Anonymous authors Paper under double-blind review ABSTRACT In this paper, we investigate mapping the hyponymy relation of WORDNET to feature vectors. We aim to model lexical knowledge in such a way that it can be used as input in generic machine-learning models, such as phrase entailment predictors. We propose two models. The first one leverages an existing mapping of words to feature vectors (fastText), and attempts to classify such vectors as within or outside of each class. The second model is fully supervised, using solely WORDNET as a ground truth. It maps each concept to an interval or a disjunction thereof. On the first model, we approach, but not quite attain state of the art performance. The second model can achieve near-perfect accuracy. 1 INTRODUCTION Distributional encoding of word meanings from the large corpora (Mikolov et al., 2013; 2018; Pen- nington et al., 2014) have been found to be useful for a number of NLP tasks. These approaches are based on a probabilistic language model by Bengio et al. (2003) of word sequences, where each word w is represented as a feature vector f(w) (a compact representation of a word, as a vector of floating point values). This means that one learns word representations (vectors) and probabilities of word sequences at the same time. While the major goal of distributional approaches is to identify distributional patterns of words and word sequences, they have even found use in tasks that require modeling more fine-grained relations between words than co-occurrence in word sequences.
    [Show full text]
  • Pairwise Fasttext Classifier for Entity Disambiguation
    Pairwise FastText Classifier for Entity Disambiguation a,b b b Cheng Yu , Bing Chu , Rohit Ram , James Aichingerb, Lizhen Qub,c, Hanna Suominenb,c a Project Cleopatra, Canberra, Australia b The Australian National University c DATA 61, Australia [email protected] {u5470909,u5568718,u5016706, Hanna.Suominen}@anu.edu.au [email protected] deterministically. Then we will evaluate PFC’s Abstract performance against a few baseline methods, in- cluding SVC3 with hand-crafted text features. Fi- For the Australasian Language Technology nally, we will discuss ways to improve disambig- Association (ALTA) 2016 Shared Task, we uation performance using PFC. devised Pairwise FastText Classifier (PFC), an efficient embedding-based text classifier, 2 Pairwise Fast-Text Classifier (PFC) and used it for entity disambiguation. Com- pared with a few baseline algorithms, PFC Our Pairwise FastText Classifier is inspired by achieved a higher F1 score at 0.72 (under the the FastText. Thus this section starts with a brief team name BCJR). To generalise the model, description of FastText, and proceeds to demon- we also created a method to bootstrap the strate PFC. training set deterministically without human labelling and at no financial cost. By releasing 2.1 FastText PFC and the dataset augmentation software to the public1, we hope to invite more collabora- FastText maps each vocabulary to a real-valued tion. vector, with unknown words having a special vo- cabulary ID. A document can be represented as 1 Introduction the average of all these vectors. Then FastText will train a maximum entropy multi-class classi- The goal of the ALTA 2016 Shared Task was to fier on the vectors and the output labels.
    [Show full text]
  • Fasttext.Zip: Compressing Text Classification Models
    Under review as a conference paper at ICLR 2017 FASTTEXT.ZIP: COMPRESSING TEXT CLASSIFICATION MODELS Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herve´ Jegou´ & Tomas Mikolov Facebook AI Research fajoulin,egrave,bojanowski,matthijs,rvj,[email protected] ABSTRACT We consider the problem of producing compact architectures for text classifica- tion, such that the full model fits in a limited amount of memory. After consid- ering different solutions inspired by the hashing literature, we propose a method built upon product quantization to store word embeddings. While the original technique leads to a loss in accuracy, we adapt this method to circumvent quan- tization artefacts. Combined with simple approaches specifically adapted to text classification, our approach derived from fastText requires, at test time, only a fraction of the memory compared to the original FastText, without noticeably sacrificing quality in terms of classification accuracy. Our experiments carried out on several benchmarks show that our approach typically requires two orders of magnitude less memory than fastText while being only slightly inferior with respect to accuracy. As a result, it outperforms the state of the art by a good margin in terms of the compromise between memory usage and accuracy. 1 INTRODUCTION Text classification is an important problem in Natural Language Processing (NLP). Real world use- cases include spam filtering or e-mail categorization. It is a core component in more complex sys- tems such as search and ranking. Recently, deep learning techniques based on neural networks have achieved state of the art results in various NLP applications. One of the main successes of deep learning is due to the effectiveness of recurrent networks for language modeling and their application to speech recognition and machine translation (Mikolov, 2012).
    [Show full text]
  • Natural Language Toolkit Based Morphology Linguistics
    Published by : International Journal of Engineering Research & Technology (IJERT) http://www.ijert.org ISSN: 2278-0181 Vol. 9 Issue 01, January-2020 Natural Language Toolkit based Morphology Linguistics Alifya Khan Pratyusha Trivedi Information Technology Information Technology Vidyalankar Institute of Technology Vidyalankar Institute of Technology, Mumbai, India Mumbai, India Karthik Ashok Prof. Kanchan Dhuri Information Technology, Information Technology, Vidyalankar Institute of Technology, Vidyalankar Institute of Technology, Mumbai, India Mumbai, India Abstract— In the current scenario, there are different apps, of text to different languages, grammar check, etc. websites, etc. to carry out different functionalities with respect Generally, all these functionalities are used in tandem with to text such as grammar correction, translation of text, each other. For e.g., to read a sign board in a different extraction of text from image or videos, etc. There is no app or language, the first step is to extract text from that image and a website where a user can get all these functions/features at translate it to any respective language as required. Hence, to one place and hence the user is forced to install different apps do this one has to switch from application to application or visit different websites to carry out those functions. The which can be time consuming. To overcome this problem, proposed system identifies this problem and tries to overcome an integrated environment is built where all these it by providing various text-based features at one place, so that functionalities are available. the user will not have to hop from app to app or website to website to carry out various functions.
    [Show full text]
  • Question Embeddings for Semantic Answer Type Prediction
    Question Embeddings for Semantic Answer Type Prediction Eleanor Bill1 and Ernesto Jiménez-Ruiz2 1, 2 City, University of London, London, EC1V 0HB Abstract. This paper considers an answer type and category prediction challenge for a set of natural language questions, and proposes a question answering clas- sification system based on word and DBpedia knowledge graph embeddings. The questions are parsed for keywords, nouns and noun phrases before word and knowledge graph embeddings are applied to the parts of the question. The vectors produced are used to train multiple multi-layer perceptron models, one for each answer type in a multiclass one-vs-all classification system for both answer cat- egory prediction and answer type prediction. Different combinations of vectors and the effect of creating additional positive and negative training samples are evaluated in order to find the best classification system. The classification system that predict the answer category with highest accuracy are the classifiers trained on knowledge graph embedded noun phrases vectors from the original training data, with an accuracy of 0.793. The vector combination that produces the highest NDCG values for answer category accuracy is the word embeddings from the parsed question keyword and nouns parsed from the original training data, with NDCG@5 and NDCG@10 values of 0.471 and 0.440 respectively for the top five and ten predicted answer types. Keywords: Semantic Web, knowledge graph embedding, answer type predic- tion, question answering. 1 Introduction 1.1 The SMART challenge The challenge [1] provided a training set of natural language questions alongside a sin- gle given answer category (Boolean, literal or resource) and 1-6 given answer types.
    [Show full text]
  • Practice with Python
    CSI4108-01 ARTIFICIAL INTELLIGENCE 1 Word Embedding / Text Processing Practice with Python 2018. 5. 11. Lee, Gyeongbok Practice with Python 2 Contents • Word Embedding – Libraries: gensim, fastText – Embedding alignment (with two languages) • Text/Language Processing – POS Tagging with NLTK/koNLPy – Text similarity (jellyfish) Practice with Python 3 Gensim • Open-source vector space modeling and topic modeling toolkit implemented in Python – designed to handle large text collections, using data streaming and efficient incremental algorithms – Usually used to make word vector from corpus • Tutorial is available here: – https://github.com/RaRe-Technologies/gensim/blob/develop/tutorials.md#tutorials – https://rare-technologies.com/word2vec-tutorial/ • Install – pip install gensim Practice with Python 4 Gensim for Word Embedding • Logging • Input Data: list of word’s list – Example: I have a car , I like the cat → – For list of the sentences, you can make this by: Practice with Python 5 Gensim for Word Embedding • If your data is already preprocessed… – One sentence per line, separated by whitespace → LineSentence (just load the file) – Try with this: • http://an.yonsei.ac.kr/corpus/example_corpus.txt From https://radimrehurek.com/gensim/models/word2vec.html Practice with Python 6 Gensim for Word Embedding • If the input is in multiple files or file size is large: – Use custom iterator and yield From https://rare-technologies.com/word2vec-tutorial/ Practice with Python 7 Gensim for Word Embedding • gensim.models.Word2Vec Parameters – min_count:
    [Show full text]
  • Question Answering by Bert
    Running head: QUESTION AND ANSWERING USING BERT 1 Question and Answering Using BERT Suman Karanjit, Computer SCience Major Minnesota State University Moorhead Link to GitHub https://github.com/skaranjit/BertQA QUESTION AND ANSWERING USING BERT 2 Table of Contents ABSTRACT .................................................................................................................................................... 3 INTRODUCTION .......................................................................................................................................... 4 SQUAD ............................................................................................................................................................ 5 BERT EXPLAINED ...................................................................................................................................... 5 WHAT IS BERT? .......................................................................................................................................... 5 ARCHITECTURE ............................................................................................................................................ 5 INPUT PROCESSING ...................................................................................................................................... 6 GETTING ANSWER ........................................................................................................................................ 8 SETTING UP THE ENVIRONMENT. ....................................................................................................
    [Show full text]
  • Arxiv:2009.12534V2 [Cs.CL] 10 Oct 2020 ( Tasks Many Across Progress Exciting Seen Have We Ilo Paes Akpetandde Language Deep Pre-Trained Lack Speakers, Billion a ( Al
    iNLTK: Natural Language Toolkit for Indic Languages Gaurav Arora Jio Haptik [email protected] Abstract models, trained on a large corpus, which can pro- We present iNLTK, an open-source NLP li- vide a headstart for downstream tasks using trans- brary consisting of pre-trained language mod- fer learning. Availability of such models is criti- els and out-of-the-box support for Data Aug- cal to build a system that can achieve good results mentation, Textual Similarity, Sentence Em- in “low-resource” settings - where labeled data is beddings, Word Embeddings, Tokenization scarce and computation is expensive, which is the and Text Generation in 13 Indic Languages. biggest challenge for working on NLP in Indic By using pre-trained models from iNLTK Languages. Additionally, there’s lack of Indic lan- for text classification on publicly available 1 2 datasets, we significantly outperform previ- guages support in NLP libraries like spacy , nltk ously reported results. On these datasets, - creating a barrier to entry for working with Indic we also show that by using pre-trained mod- languages. els and data augmentation from iNLTK, we iNLTK, an open-source natural language toolkit can achieve more than 95% of the previ- for Indic languages, is designed to address these ous best performance by using less than 10% problems and to significantly lower barriers to do- of the training data. iNLTK is already be- ing NLP in Indic Languages by ing widely used by the community and has 40,000+ downloads, 600+ stars and 100+ • sharing pre-trained deep language models, forks on GitHub.
    [Show full text]
  • Proposal for a Gujarati Script Root Zone Label Generation Ruleset (LGR)
    Proposal for a Gujarati Root Zone LGR Neo-Brahmi Generation Panel Proposal for a Gujarati Script Root Zone Label Generation Ruleset (LGR) LGR Version: 3.0 Date: 2019-03-06 Document version: 3.6 Authors: Neo-Brahmi Generation Panel [NBGP] 1 General Information/ Overview/ Abstract The purpose of this document is to give an overview of the proposed Gujarati LGR in the XML format and the rationale behind the design decisions taken. It includes a discussion of relevant features of the script, the communities or languages using it, the process and methodology used and information on the contributors. The formal specification of the LGR can be found in the accompanying XML document: proposal-gujarati-lgr-06mar19-en.xml Labels for testing can be found in the accompanying text document: gujarati-test-labels-06mar19-en.txt 2 Script for which the LGR is proposed ISO 15924 Code: Gujr ISO 15924 Key N°: 320 ISO 15924 English Name: Gujarati Latin transliteration of native script name: gujarâtî Native name of the script: ગજુ રાતી Maximal Starting Repertoire (MSR) version: MSR-4 1 Proposal for a Gujarati Root Zone LGR Neo-Brahmi Generation Panel 3 Background on the Script and the Principal Languages Using it1 Gujarati (ગજુ રાતી) [also sometimes written as Gujerati, Gujarathi, Guzratee, Guujaratee, Gujrathi, and Gujerathi2] is an Indo-Aryan language native to the Indian state of Gujarat. It is part of the greater Indo-European language family. It is so named because Gujarati is the language of the Gujjars. Gujarati's origins can be traced back to Old Gujarati (circa 1100– 1500 AD).
    [Show full text]