I/ETS: Indonesian-English Machine Translation System Using Collaborative P2P Corpus
Total Page:16
File Type:pdf, Size:1020Kb
I/ETS: Indonesian-English Machine Translation System using Collaborative P2P Corpus Hammam Riza, Budiono, Adiansya Prasetya and Henky Mulyadi Science and Technology Network Information Center (IPTEKnet) Agency for the Assessment and Application of Technology (BPPT), Indonesia [email protected] Abstract. This paper is a preliminary result in developing a bidirectional machine translation system of Indonesian-English, by using open source software and creative common corpus. We will describe our method, starting with corpus collection process, followed by corpus processing and the software system for translation. The corpus is developed through a collaborative P2P development framework, a collective intelligence approach to building a parallel text of Indonesian- English. We further describe the component of the translation system which combine a hybrid symbolic-statistical technique. 1. Introduction In the era of globalization, communication among languages becomes much more important. People has been hoping that natural language processing and speech processing, which are part of ICT (Information and Communication Technology), can assist in smoothening the communication among people with different languages. However, especially for Indonesian language, there were only few researches in the past. Based on the fact that there is no large corpus available and it is of crucial importance, the first phase of this project is to build large bilingual Indonesian- English corpus. We use collective intelligence approach to build this corpus, which in turn are used to build modules for the hybrid symbolic-statistical Machine Translation (MT). 2. System Components There are two main components in building statistical machine translation system where both of these components are crucial. Additional supporting component is the symbolic modules. In the following section, we describe each component in details. 2.1 Collaborative P2P Corpus To have a reliable corpus as one of the main component in development of MT, we build a database of bilingual sentences from various sources using a collective intelligence approach. We called this approach as a collaborative P2P (peer-to- peer) approach for building corpus. The corpus is build by online community of news agencies, newspaper journalist, reporters, bloggers and media publishers who cooperate and contribute documents in either Indonesian language or a collection of bilingual Indonesian-English sentence-pair (see Figure 1). We classify this work into several domains, i.e. National, International, Sport, Economy and Science Technology. The choice of documents (genre) is affected by the intended coverage of the linguistic resources, which we further categorize into News or Article style of document. The next step is to make collaborative translation and cross evaluation of the contributed documents, especially on the adequacy and fluency of translation. We also observe the related IPR of the collection. We seek creative common materials from all document sources which enable us to develop an Open Source framework for both the data and MT application. In the end, we have the following set of corpora: Monolingual Indonesian corpus (Bahasa Indonesia) Monolingual English corpus Sentence-aligned parallel corpus (Indonesia and English) The targeted parallel corpus will contain 1 million sentence-pair (approximately 20 million words), with the current 282,000 sentences of the collection having genre/domain as given in Table 1. Figure 1. P2P Collaborative Schema Table 1. Corpus Sources: Domain and Genre Genre Domain News Article id.wikipedia.org id.wikipedia.org detikfinance.com www.ekonomirakyat.org Economy mediaindonesia.com web.bisnis.com/artikel/ economy.okezone.com beritasore.com kompas.com/bisniskeuangan beritasore.com id.wikipedia.org id.wikipedia.org detiksport.com arsip.info/olahraga Sport mediaindonesia.com www.pbdjarum.com sports.okezone.com www.koni.or.id kompas.com/olahraga arsip.info/olahraga lampungpost.com beritasore.com tempo.co.id/nasional www.transparansi.or.id National News liputan6.com www.jangkar.org news.okezone.com arsip.info/hiburan kompas.com/nasional www.legalitas.org lampungpost.com www.link.or.id tempo.co.id/internasional www.kompas.com International News internasional.okezone.com artrenewal.org/articles/articles.asp kompas.com/internasional www.belajargratis.web.id liputan6.com/luarnegri www.tempo.co.id techno.okezone.com www.dikti.org tekno.kompas.com www.tempo.co.id Science & Technology mediaindonesia.com arsip.info/teknologi techno.okezone.com www.dikti.org tekno.kompas.com www.tempo.co.id Table 2. Parallel Corpus Collections Statistic (Current Status) Sum of Sum of Sum of Corpus name Sources sentences English word Indonesian word ANTARA I Antara News Agency 157,490 3,604,181 3,323,362 BTEC BPPT 106,698 736,615 685,175 ANTARA II Antara News Agency 14,000 277,200 267,400 Science Technology Internet & eBook 4,308 51,201 43,519 Total 282,496 4,669,197 4,319,456 2.2 Statistical Machine Translation In order to experiment the feasibility of statistical MT for Indonesian, we build a prototype of bidirectional Indonesian-English MT. For that purpose, we used the parallel corpus of Indonesian-English created above. The corpus is further selected to form a collection of training and test sentences which belongs to different set of genre and domains, totaling 250,000 parallel sentences. We use SRILM to build the n-gram language model and the translation model. We subsequently use PHARAOH (Koehn 2004) as a beam search decoder. Evaluation of the preliminary system using 240,000 training sentences and 10,000 test sentences, yielding a BLEU score of 0.649 using 2 reference translations. Source Indonesian Language Monolingual Corpus Morphological Analyze English POS-Tagger Statistical Syntactic Parser Monolingual Language Corpus Phrase Reordering Toolkit Pharaoh SMT Generation System Indonesian- English Parallel Corpus Target Language Figure 2. I/ETS Symbolic-Statistical Machine Translation Framework 2.3. Symbolic Language Components It is our goal to use a statistical method or combination of statistical and symbolic method in our MT experiment. The current system is purely statistical, but ultimately I/ETS will consist of the following symbolic modules: Morphological analyzer and Syntactic parser The derivational morphology of Bahasa Indonesia, particularly that of verbal morphology, is quite complex. Therefore, to process Indonesian words require careful analysis. Existing Indonesian morphological literature (Alwi, et. al. 2003) is surveyed and compared with the empirical evidence as found in the corpus. The morphological analysis will be built using both n-gram or HMM (Cutting 1992) and symbolic rule-based system (Brill 1992). Completion of the morphological analysis will enable the development of a hybrid phrase parser (Ramshaw 1994). This parser will decide noun phrases, verb phrases, adjective phrases, etc. in a sentence (Chen 1994). Furthermore, a robust syntactic parser can be build using output of the shallow parser to decide syntactical structure of a sentence, i.e., which part of a sentence is the subject, predicate, object. Phrase Reordering System This module will perform transformation of phrase structure from Indonesian to English, especially to enable correct translation of noun phrase (NP) and adjective phrase (AdjP). The word order in Indonesia NP <N,Adj> is reordered to match the word order of English NP <AdjP,N> This module will prepare the input sentence or files prior of translation process, to enable word reorder and aligned to achieve better translation of NP and AdjP. Generation system This module will produce target sentences (Indonesian or English) based on an intermediate representation created by the statistical MT. 3. Discussion and Future Work New media is often associated with the promotion and enhancement of collective intelligence. The ability of new media to easily store and retrieve information, predominantly through databases and the Internet, allows it for it to be shared without difficulty. Thus, through interaction with new media, knowledge easily passes between sources, resulting in a form of collective intelligence. The use of interactive new media, particularly the Internet, promotes online interaction and this distribution of knowledge between users. In our work, the collaborative work share the knowledge of standard Indonesian news or report and the ability of translating the source Indonesian sentences to English. Our work on collaborative P2P corpus is driven by study of collective intelligence, namely the “Learner generated context” in which a group of users collaboratively marshal available resources to create an ecology that meets their needs often (but not only) in relation to the co-creation and co-design of a particular learning space that allows learners to create their own context (Skrbina 2001). In this sense, the learner generated contexts represents an ad hoc community which facilitates the coordination of collective action in a network of trust. Our corpus building The best example of Learner generated context is perhaps found on the Internet, in which a group of collaborative users pooling knowledge to result in a shared intelligence space. As the Internet has developed, so has the concept of collective intelligence as a shared public forum. The global accessibility and availability of the Internet has allowed more people than ever to contribute their ideas and to access these collaborative intelligence spaces. According