I/ETS: Indonesian-English Machine Translation System using Collaborative P2P Corpus

Hammam Riza, Budiono, Adiansya Prasetya and Henky Mulyadi

Science and Technology Network Information Center (IPTEKnet) Agency for the Assessment and Application of Technology (BPPT), [email protected]

Abstract. This paper is a preliminary result in developing a bidirectional machine translation system of Indonesian-English, by using open source software and creative common corpus. We will describe our method, starting with corpus collection process, followed by corpus processing and the software system for translation. The corpus is developed through a collaborative P2P development framework, a collective intelligence approach to building a parallel text of Indonesian- English. We further describe the component of the translation system which combine a hybrid symbolic-statistical technique.

1. Introduction

In the era of globalization, communication among languages becomes much more important. People has been hoping that natural language processing and speech processing, which are part of ICT (Information and Communication Technology), can assist in smoothening the communication among people with different languages. However, especially for Indonesian language, there were only few researches in the past.

Based on the fact that there is no large corpus available and it is of crucial importance, the first phase of this project is to build large bilingual Indonesian- English corpus. We use collective intelligence approach to build this corpus, which in turn are used to build modules for the hybrid symbolic-statistical Machine Translation (MT).

2. System Components

There are two main components in building statistical machine translation system where both of these components are crucial. Additional supporting component is the symbolic modules. In the following section, we describe each component in details. 2.1 Collaborative P2P Corpus

To have a reliable corpus as one of the main component in development of MT, we build a database of bilingual sentences from various sources using a collective intelligence approach. We called this approach as a collaborative P2P (peer-to- peer) approach for building corpus. The corpus is build by online community of news agencies, newspaper journalist, reporters, bloggers and media publishers who cooperate and contribute documents in either Indonesian language or a collection of bilingual Indonesian-English sentence-pair (see Figure 1).

We classify this work into several domains, i.e. National, International, Sport, Economy and Science Technology. The choice of documents (genre) is affected by the intended coverage of the linguistic resources, which we further categorize into News or Article style of document.

The next step is to make collaborative translation and cross evaluation of the contributed documents, especially on the adequacy and fluency of translation. We also observe the related IPR of the collection. We seek creative common materials from all document sources which enable us to develop an Open Source framework for both the data and MT application. In the end, we have the following set of corpora: ñ Monolingual Indonesian corpus (Bahasa Indonesia) ñ Monolingual English corpus ñ Sentence-aligned parallel corpus (Indonesia and English) The targeted parallel corpus will contain 1 million sentence-pair (approximately 20 million words), with the current 282,000 sentences of the collection having genre/domain as given in Table 1.

Figure 1. P2P Collaborative Schema Table 1. Corpus Sources: Domain and Genre

Genre Domain News Article id.wikipedia.org id.wikipedia.org detikfinance.com www.ekonomirakyat.org Economy mediaindonesia.com web.bisnis.com/artikel/ economy.okezone.com beritasore.com kompas.com/bisniskeuangan beritasore.com id.wikipedia.org id.wikipedia.org detiksport.com arsip.info/olahraga Sport mediaindonesia.com www.pbdjarum.com sports.okezone.com www.koni.or.id kompas.com/olahraga arsip.info/olahraga lampungpost.com beritasore.com tempo.co.id/nasional www.transparansi.or.id National News liputan6.com www.jangkar.org news.okezone.com arsip.info/hiburan kompas.com/nasional www.legalitas.org lampungpost.com www.link.or.id tempo.co.id/internasional www.kompas.com International News internasional.okezone.com artrenewal.org/articles/articles.asp kompas.com/internasional www.belajargratis.web.id liputan6.com/luarnegri www.tempo.co.id techno.okezone.com www.dikti.org tekno.kompas.com www.tempo.co.id Science & Technology mediaindonesia.com arsip.info/teknologi techno.okezone.com www.dikti.org tekno.kompas.com www.tempo.co.id

Table 2. Parallel Corpus Collections Statistic (Current Status)

Sum of Sum of Sum of Corpus name Sources sentences English word Indonesian word ANTARA I Antara 157,490 3,604,181 3,323,362 BTEC BPPT 106,698 736,615 685,175 ANTARA II Antara News Agency 14,000 277,200 267,400 Science Technology Internet & eBook 4,308 51,201 43,519 Total 282,496 4,669,197 4,319,456

2.2 Statistical Machine Translation

In order to experiment the feasibility of statistical MT for Indonesian, we build a prototype of bidirectional Indonesian-English MT. For that purpose, we used the parallel corpus of Indonesian-English created above. The corpus is further selected to form a collection of training and test sentences which belongs to different set of genre and domains, totaling 250,000 parallel sentences. We use SRILM to build the n-gram language model and the translation model. We subsequently use PHARAOH (Koehn 2004) as a beam search decoder. Evaluation of the preliminary system using 240,000 training sentences and 10,000 test sentences, yielding a BLEU score of 0.649 using 2 reference translations.

Source Indonesian Language Monolingual

Corpus

Morphological Analyze

English POS-Tagger Statistical Syntactic Parser Monolingual Language Corpus Phrase Reordering Toolkit Pharaoh SMT Generation System

Indonesian- English Parallel Corpus Target Language

Figure 2. I/ETS Symbolic-Statistical Machine Translation Framework

2.3. Symbolic Language Components

It is our goal to use a statistical method or combination of statistical and symbolic method in our MT experiment. The current system is purely statistical, but ultimately I/ETS will consist of the following symbolic modules:

Morphological analyzer and Syntactic parser The derivational morphology of Bahasa Indonesia, particularly that of verbal morphology, is quite complex. Therefore, to process Indonesian words require careful analysis. Existing Indonesian morphological literature (Alwi, et. al. 2003) is surveyed and compared with the empirical evidence as found in the corpus. The morphological analysis will be built using both n-gram or HMM (Cutting 1992) and symbolic rule-based system (Brill 1992). Completion of the morphological analysis will enable the development of a hybrid phrase parser (Ramshaw 1994). This parser will decide noun phrases, verb phrases, adjective phrases, etc. in a sentence (Chen 1994). Furthermore, a robust syntactic parser can be build using output of the shallow parser to decide syntactical structure of a sentence, i.e., which part of a sentence is the subject, predicate, object.

Phrase Reordering System This module will perform transformation of phrase structure from Indonesian to English, especially to enable correct translation of noun phrase (NP) and adjective phrase (AdjP). The word order in Indonesia NP is reordered to match the word order of English NP This module will prepare the input sentence or files prior of translation process, to enable word reorder and aligned to achieve better translation of NP and AdjP.

Generation system This module will produce target sentences (Indonesian or English) based on an intermediate representation created by the statistical MT.

3. Discussion and Future Work

New media is often associated with the promotion and enhancement of collective intelligence. The ability of new media to easily store and retrieve information, predominantly through databases and the Internet, allows it for it to be shared without difficulty. Thus, through interaction with new media, knowledge easily passes between sources, resulting in a form of collective intelligence. The use of interactive new media, particularly the Internet, promotes online interaction and this distribution of knowledge between users. In our work, the collaborative work share the knowledge of standard Indonesian news or report and the ability of translating the source Indonesian sentences to English.

Our work on collaborative P2P corpus is driven by study of collective intelligence, namely the “Learner generated context” in which a group of users collaboratively marshal available resources to create an ecology that meets their needs often (but not only) in relation to the co-creation and co-design of a particular learning space that allows learners to create their own context (Skrbina 2001). In this sense, the learner generated contexts represents an ad hoc community which facilitates the coordination of collective action in a network of trust. Our corpus building

The best example of Learner generated context is perhaps found on the Internet, in which a group of collaborative users pooling knowledge to result in a shared intelligence space. As the Internet has developed, so has the concept of collective intelligence as a shared public forum. The global accessibility and availability of the Internet has allowed more people than ever to contribute their ideas and to access these collaborative intelligence spaces. According to Don Tapscott and Anthony D. Williams (2008), collective intelligence is mass collaboration. In order for this concept to happen, four principles need to exist. We adapt to these and relate our work in the following principles:

Openness - During the early ages of the communications technology, people and companies are reluctant to share ideas, intellectual property and encourage self- motivation. The reason for this is these resources provide the edge over competitors. Now people and companies tend to loosen hold over these resources because they reap more benefits in doing so. By allowing journalist and reporters to share their ideas in articles and reports, we gain significant content for developing open corpus and obtain scrutiny through collaboration.

Peering - This is a form of horizontal organization with the capacity to create information technology products. In contributing a document for corpus, users are free to modify and develop it provided that they made it available for others. As quoted, “Peering succeeds because it leverages self-organization – a style of production that works more effectively than hierarchical management for certain tasks.”

Sharing - This principle has been the subject of debate for many, with the question being “Should there be no laws against distribution of intellectual property?” Research has shown that more and more publishers have started to share some, while maintaining some degree of control over others, like critical information. This is because publisher has realized that by limiting all their intellectual property, they are shutting out all possible opportunities.

Acting Globally - The emergence of communication technology has prompted the rise of global companies. As the influence of the Internet is widespread, a global community of writers and corpus producer would have no geographical boundaries. They would also have global connections, allowing them to gain access to ideas and technology.

Our future work on corpus building will focuses in these 4 principles using an unifying framework which is now being developed in IPTEKnet-BPPT.

References Alwi, Hasan et.al (2003). Tata Bahasa Baku Bahasa Indonesia. Balai Pustaka, 2003 Brill, E. (1992). A Simple Rule-Based Part of Speech Tagger, Proceedings of the Third Conference on Applied Computational Linguistics, ACL. Chen, K., H. Chen (1994). Extracting NP from Large Scale Texts: A Hybrid Approach and Its Automatic Evaluation, 32nd ACL Annual Meeting, Las Cruces. Cutting, D. (1992), A Practical Part-of-Speech Tagger, Proceedings of the Third Conference on Applied Natural Language Processing, ACL. Koehn, P. (2004). Pharaoh: A beam-search decoder for phrase-based statistical machine translation models. Retrieved from http://www.isi.edu/licensed-sw/pharaoh/ Ramshaw, L., M. Marcus (1994). Exploring the Statistical Derivation of Transformational Rule Sequence for Part of Speech Tagging, Proceedings of ACL-94 Workshop: The Balancing Act, Las Cruces, New Mexico, USA. Riza, H. (2001). BIAS-II: Bahasa Indonesia Analyser System Using Stochastic-Symbolic Techniques, International Conference on Multimedia Annotation (MMA), , Japan. Riza, H (2003). MT Research and Development in Indonesia, Lingustics Symposium of Atmajaya Language Center, , Indonesia. Skrbina, D., 2001, Participation, Organization, and Mind: Toward a Participatory Worldview, Doctoral Thesis, Centre for Action Research in Professional Practice, School of Management, University of Bath: England Tapscott, D., & Williams, A. D. (2008). Wikinomics: How Mass Collaboration Changes Everything, USA: Penguin Group