Selection of XML Tag Set for Myanmar National Corpus

The 6th Workshop on Asian Languae Resources, 2008 Selection of XML tag set for Myanmar National Corpus Wunna Ko Ko Thin Zar Phyo AWZAR Co. Myanmar Unicode and NLP Research Mayangone Township, Yangon, Center Myanmar Myanmar Info-Tech, Hlaing Campus, [email protected] Yangon, Myanmar [email protected] Myanmar Language Commission and a number Abstract of scholars had been collected a number of corpora for their specific uses [Htay et al., 2006]. But there In this paper, the authors mainly describe is no national corpus collection, both in digital and about the selections of XML tag set for non-digital format, until now. Myanmar National Corpus (MNC). MNC Since there are a number of languages used in will be a sentence level annotated corpus. Myanmar, the national level corpus to be built will The validity of XML tag set has been include all languages and scripts used in Myanmar. tested by manually tagging the sample data. It has been named as Myanmar National Corpus or MNC, in short form. Keywords: Corpus, XML, Myanmar, During the discussion for the selection of Myanmar Languages format for the corpus, XML (eXtensible Markup 1 Introduction Language), a subset of SGML (Standard Generalized Markup Language), format has been Myanmar (formerly known as Burma) is one of the chosen since XML format can be a long usable and South-East Asian countries. There are 135 ethnic possible to keep the original format of the text groups living in Myanmar. These ethnic groups [Burnard. 1996]. The range of software available speak more than one language and use different for XML is increasing day by day. Certainly more scripts to present their respective languages. There and more NLP related tools and resources are are a total of 109 languages spoken by the people produced in it. This in turn makes the necessity of living in Myanmar [Ethnologue, 2005]. selection of XML tag set to start building of MNC. There are seven major languages, according to MNC will include not only written text but also the speaking population in Myanmar. They are spoken texts. The part of written text will include Kachin, Kayin/Karen, Chin, Mon, Burmese, regional and national newspapers and periodicals, Rakhine and Shan [Ko Ko & Mikami, 2005]. journals and interests, academic books, fictions, Among them, Burmese is the official language and memoranda, essays, etc. The part of spoken text spoken by about 69% of the population as their will include scripted formal and informal mother tongue [Ministry of Immigration and conversations, movies, etc. Population, 1995]. During the selection of XML tag sets, the Corpus is a large and structured set of texts. sample for all the data which will be included in They are used to do statistical analysis, checking building of MNC, has been learnt. occurrences or validating linguistic rules on a specific universe.1 2 Myanmar National Corpus In Myanmar, there are a plenty of text for most Myanmar is a country of using 109 different of the languages, especially Burmese and major languages and a number of different scripts languages, since stone inscription. [Ethnologue, 2005]. In order to do language processing for these languages and scripts, it 1 http://en.wikipedia.org/wiki/Text_corpus becomes a necessity to build a corpus with 33 The 6th Workshop on Asian Languae Resources, 2008 languages and scripts used in Myanmar; at least great benefit about XML is that the document itself with major languages and scripts, which will describes the structure of data. 2 include almost all areas of documents. Three characteristics of XML distinguish from Among the different scripts used in Myanmar, other markup languages:3 the popular scripts include Burmese script (a • its emphasis on descriptive rather than Brahmi based script), Latin scripts. Building of procedural markup; MNC will be helpful for development of Natural Language Processing (NLP) tools (such as • its notion of documents as instances of a grammar rules, spelling checking, etc) and also for document type and linguistic research on these languages and scripts. Moreover, since Burmese script is written without • its independence of any hardware or necessarily pausing between words with spaces, software system. the corpus to be built is hoped to be useful for Since MNC is to be built in XML based format, developing tools for automatic word segmentation. the selection process for tag set of XML become an important process. The XML tagged corpus data 2.1 XML based corpus should also keep the original format of the data. XML is universal format for structured documents In order to select XML tag set for MNC, the and data, and can provide highly standardized sample data for the corpus has to be collected. The representation frameworks for NLP (Jin-Dong format of the sample corpus data has been studied KIM et al. 2001); especially, the ones with for the selection of the XML tag set in appropriate annotated corpus based approaches, by providing with the data format. them with the knowledge representation frameworks for morphological, syntactic, 2.2 Structure of a data file at MNC semantics and/or pragmatics information structure. The structure of a data file at MNC will include Important features are: two main parts: information of the corpus file and • XML is extensible and it does not consist the corpus data. of a fixed set of tags. The first part, the header part of a corpus file, describes the information of a corpus file. The • XML documents must be well-formed information of the corpus file includes the header according to a defined syntax. which will provide sensible use of the corpus • XML document can be formally validated information in machine readable form. In this part, against a schema of some kind. the information such as language usage and the description of the corpus file will be included. • XML is more interested in the meaning of The second part, the document part, of a corpus data than its presentation. file will include the source description of the The XML documents must have exactly one corpus data and the corpus data, the written or top-level element or root element. All other spoken part of the text, itself. The information of elements must be nested within it. Elements must the corpus data such as bibliographic information, be properly nested [Young, 2001]. That is, if an authorship, and publisher information will be element starts within another element, it must also included in this section. Moreover, the corpus data end within that same element. itself will also be included in this section. Each element must have both a start-tag and an The hierarchically structure of a corpus file at end-tag. The element type name in a start-tag must MNC will be as shown in figure 1. exactly match the name in the corresponding end- tag and element name are case sensitive. Moreover, the advantages of XML for NLP includes ontology extraction into XML based structured languages using XML Schema. The 2 http://www.tei-c.org/P5/Guidelines/index.html 3 http://www.w3.org/TR/xml/ 34 The 6th Workshop on Asian Languae Resources, 2008 Myanmar National Corpus Data File Header Document Language File Source Written or Usage Description Description Spoken Texts Figure 1. Hierarchically structure of a data file at MNC 6 3 Selection of necessary XML tag set for the text encoding and Interchange . TEI encoding scheme consists of a number of rules After studying original formats and features of with which the document has to adhere in order texts, to be used in corpus, and the structure of to be accepted as a TEI document. the corpus file has been determined, the This header part contains language usage of selection procedure for XML tag set has been the data file <langUsage> and the file started. description <fileDesc> which includes machine British National Corpus (BNC)4, American readable information of the data file. 5 National Corpus (ANC) had been referenced -<mnc> for selection of XML tag set. -<teiHeader> The selection of XML tag set is based on the +<langUsage></langUsage> nature of the structure of a data file. The main +<fileDesc></fileDesc> tag for the data file will be named as <mnc> </teiHeader> which is the abbreviation of Myanmar National +<myaDoc></myaDoc> Corpus. </mnc> A data file contains two main parts, the Figure 3. Element and Child tags of MNC header part and the document part. -<mnc> The language usage part contains such +<teiHeader></teiHeader> information as language name <langName>, +<myaDoc></myaDoc> script information <script>, International </mnc> Organization for Standardization (ISO) code Figure 2. Root and element tags of MNC number <ISO>, encoding information <encodingDesc> and version of encoding 3.1 Header Part <version>. The XML tag for the header part of the corpus -<mnc> data file is named as <teiHeader>. Text -<teiHeader> Encoding Initiative (TEI) published guidelines <langUsage> <langName> </langName> <script> </script> <ISO></ISO> 4 Lou Burnard. 2000. Reference Guide for the British <encodingDesc> </encodingDesc> National Corpus (World Edition). Oxford University Computing Services, Oxford. 5 Nancy Ide and Keith Suderman. 2003. The American National Corpus, first Release. Vassar College, 6 TEI Consortium. 2001, 2002 and 2004 Text Encoding Poughkeepsie, USA Initiative. In The XML Version of the TEI Guidelines. 35 The 6th Workshop on Asian Languae Resources, 2008 <version> </version> bibliographic information, such as title, name of </langUsage> author, publisher, etc., of the original data. +<fileDesc></fileDesc> <mnc> </teiHeader> +<teiHeader></teiHeader> +<myaDoc></myaDoc> -<myaDoc> </mnc> -<sourceDesc> Figure 4. 2nd level Child tags in language -<bibl> Usage part of MNC <title></title> <author></author> The file description part contains such <editor/></editor> information as title information of the corpus file -<imprint> <titleStmt>, edition information <editionStmt> <publisher></publisher> and publication information about the corpus file <pubPlace></pubPlace> <publicationStmt>.

Selection of XML Tag Set for Myanmar National Corpus

Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress

Representing Myanmar in Unicode Details and Examples Version 3

Character Properties 4

Globalization Support Oracle Unicode Database Support

The Unicode Standard, Version 3.0, Issued by the Unicode Consor- Tium and Published by Addison-Wesley

Some Properties of Burmese Script H1

Overview and Rationale

Myanmar Range: 1000–109F

Myanmar Extended-B Range: A9E0–A9FF

Section 13.4

Rule-Based Pāḷi Romanization System for Myanmar Language

Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress