Resources for Philippine Languages: Collection, Annotation, and Modeling
Total Page:16
File Type:pdf, Size:1020Kb
PACLIC 30 Proceedings Resources for Philippine Languages: Collection, Annotation, and Modeling Nathaniel Ocoa, Leif Romeritch Syliongkaa, Tod Allmanb, Rachel Edita Roxasa aNational University 551 M.F. Jhocson St., Sampaloc, Manila, PH 1008 bGraduate Institute of Applied Linguistics 7500 W. Camp Wisdom Rd., Dallas, TX 75236 {nathanoco,lairusi,todallman,rachel_roxas2001}@yahoo.com The paper’s structure is as follows: section 2 Abstract discusses initiatives in the country and the various language resources we collected; section In this paper, we present our collective 3 discusses annotation and documentation effort to gather, annotate, and model efforts; section 4 discusses language modeling; various language resources for use in and we conclude our work in section 5. different research projects. This includes those that are available online such as 2 Collection tweets, Wikipedia articles, game chat, online radio, and religious text. The Research works in language studies in the different applications, issues and Philippines – particularly in language directions are also discussed in the paper. documentation and in corpus building – often Future works include developing a involve one or a combination of the following: language web service. A subset of the “(1) residing in the place where the language is resources will be made temporarily spoken, (2) working with a native speaker, or (3) available online at: using printed or published material” (Dita and http://bit.ly/1MpcFoT. Roxas, 2011). Among these, working with resources available is the most feasible option given ordinary circumstances. Following this 1 Introduction consideration, the Philippines as a developing country is making its way towards a digital age, The Philippines is a country in Southeast Asia which highlights – as Jenkins (1998) would put it composed of 7,107 islands and 187 listed – a “technological culture of computers”. individual languages. Among these, 41 are listed Organizations and educational institutions are as institutional, 73 are developing, 45 are making resources available in the Internet. vigorous, 13 are in trouble, 11 are dying, and 4 In the Philippines, documenting languages and are already extinct1. These numbers highlight making the resources public had been realized that there is a pressing need for a databank on even before the turn of the millennium. For our Philippine languages. As highlighted in collection initiatives, we derive inspiration from literature (Dita et al., 2009; Oco and Roxas, previous works. One of the projects is IsaWika 2012), even those with high number of native (Roxas and Borra, 2000). It is an English- speakers have limited available corpora. Filipino machine translator developed in 1999 Towards addressing this scenario, we describe in that translates simple declarative sentences using this paper the collection, annotation, and the augmented transition network. Years after, modeling of various language resources. the development of the Philippine component of the International Corpus of English or ICE-PHI (Bautista, 2004) started. It contains one million 1 Ethnologue Philippine language status profile for the words of written and spoken Philippine English. Philippines: http://www.ethnologue.com/country/PH 30th Pacific Asia Conference on Language, Information and Computation (PACLIC 30) Seoul, Republic of Korea, October 28-30, 2016 433 Source Language Number of collect these online. A tweet is a short 140- Articles character message delivered in Twitter. The Inquirer English 547 program uses Twitter 4J3 Java library and stores Manila Bulletin English 1,333 the following information in a database: Tweet Pang-Masa Filipino 576 ID; the tweet; date and time the tweet was sent; Pilipino Star Ngayon Filipino 1,013 and geolocation where the tweet was sent. Rappler English 779 The collection started last February 17, 2013 The Philippine Star English 1,011 and a total of approximately 50 million tweets Total 5,259 have been collected as of this writing. As tweets Table 1. Number of news articles collected are considered an online chronicle of events and a repository of human opinion, they have been The written component contains non-printed used to study Filipino voting behavior (Pablo et texts such as non-professional writing and al., 2014) and analyze disasters such as the correspondence; and printed texts such as typhoon Haiyan (Soriano et al., 2016). In academic writing, reportage, instructional tandem with classification techniques such as writing, persuasive and creative writing. On the sentiment analysis, tweets can be used in other hand, the transcribed texts contain prediction and to help policy makers make dialogues and monologues. One feature of the informed decisions. It should be noted that we corpus is manual annotation – there are markup only collected those that are publicly available symbols to indicate the part-of-speech, the and those whose location is set, following ethical foreign and indigenous words, and paralinguistic standards. devices. 2.2 News Articles The advent of ICE-PHI inspired the development of PALITO (Dita et al., 2009). An Aside from tweets, we also collected news online corpus developed for purposes of pursuing articles as they also represent actual language various linguistic agendas, PALITO is a usage. Also a continuation of a previous project repository for religious and literary texts written (Oco et al., 2014b), we are collecting from four in eight Philippine languages – Bikol, Cebuano, news agencies (shown in Table 1). Calibre, an 4 Hiligaynon, Ilocano, Kapampangan, open source e-book management system was Pangasinense, Tagalog, and Waray – and has a used. As of this writing, a total of 5,259 articles total word count of two million words. Aside have been collected, which are in .txt, .epub, and from written texts, another component is the .pdf formats. Filipino sign language (FSL) videos. The signs News articles provide a clear usage of the were illustrated in the form of actions and the language. In a culturomics study (Ilao et al., videos cover the alphabet, number system, basic 2011), cultural trends were studied using articles terms and expressions, and discourse. The 118 collected from more than ten Philippine tabloids FSL videos have a combined file size of 224 and a computational model for language million bytes. development was built. The size of both ICE-PHI and PALITO 2.3 Game Chat highlights one limitation if collection was done manually – the low number of resources. In our An untapped source of valuable information is initiative, we address this through automatic massively multiplayer online role-playing games means. Current statistics put the number of (or MMORPG). They provide a venue for Internet users to 44% 2 , which makes social people to interact virtually. One of the popular media and other online forms as viable sources MMORPG in the Philippines is Ragnarok. Also of data, and one popular form is Twitter, a social a continuation of a previous project (Oco et al., networking site. 2014b), we are collecting chat logs from this game using OpenKore 5 – a client and bot 2.1 Tweets program made specifically for Ragnarok. Chats For the current work, we continued the automatic are classified into four: (1) A private chat [PM], collection of tweets started in a previous project which is a message sent to the character and can (Oco et al., 2014b). A program was used to 3 http://twitter4j.org/en/index.html 2 https://telehealth.ph/2015/03/26/internet-social-media-and- 4 http://calibre-ebook.com/ mobile-use-of-filipinos-in-2015/ 5 http://www.openkore.com/index.php/Main_Page 434 PACLIC 30 Proceedings only be read by the recipient of the message; (2) ISO Code Word Count a public chat [C], which is a message sent by bik 466,096 nearby characters and can be read by other cbk 283,798 nearby players; and (3) a shout [GM] and [S], ilo 842,373 which are game-wide message and can be read pag 127,492 by everyone. pam 628,948 The chat logs can be used as training data in tgl 4,955,547 text normalization (Nocon et al., 2014b) and in Total 7,304,254 cyber bullying detection (Cheng and Ng, 2016). Table 3. Summary of Wikipedia articles The current log has 1,166,352 lines. 2.4 Wikipedia ISO Code Corpus Size (Words) Inspired by a previous study that collected ceb 1,069,713 English and Tagalog Wikipedia articles (Oco et hil 291,370 al., 2014b), we also collected Wikipedia articles ilo 1,003,392 in other Philippine languages because of its tgl 1,104,035 popularity and availability. Embodiment of Total 3,468,510 majority of the society, it is a multilingual Table 4. Corpus size of Bible editions Internet encyclopedia that is collaboratively edited by volunteers. Entire Wikipedias are 2.5 Bible Editions publicly available through XML dumps and editions in Philippine languages exist. As a form A number of religious organizations have put of cleaning, we used an XML to text converter6 efforts to translate the Bible and made them to extract entries from XML dumps of the available online in an effort to promote biblical following languages: Bikol, Chavacano teachings. One of these organizations is the Zamboangeño, Ilokano, Kapampangan, Jehovah’s Witness, which has more than 100,000 Pangasinense, and Tagalog. Table 2 shows a active members in the Philippines. It provides ePub versions of several Bible editions in their sample text in the XML dump and its converted 7 version, while Table 3 shows the word count of website . To collect these, we used Calibre to the converted text. A total of 7,304,254 words convert the files to text files. The size of each were collected. There is also an ongoing work to edition is detailed in Table 4. The collection has collect articles from the Cebuano, Hiligaynon, a total size of 3,468,510 words. Bible editions and Waray Wikipedias. Earlier projects utilized have been used as training data in translation the Tagalog Wikipedia for purposes of code- studies, language identification (Oco et al., switching point detection (Oco and Roxas, 2013a; Oco et al., 2014a; Octaviano et al., 2015), 2012).