Apertium Sh-Mk, Implementation of a New Language Pair

apertium sh-mk, implementation of a new language pair

Hrvoje Peradin [email protected], krvoje on IRC: #apertium

Why is it you are interested in machine translation?

It’s a perfect combination of Computer science and linguistics. I am very fascinated with languages, both natural or artiﬁcial. With MT It fascinates me to see a natural message transfered across a language barrier via only a process of computation.

Why is it that you are interested in the Aper- tium project?

Apertium was simply my best choice on GSoC. Since I am aiming for my graduate thesis to be about natural language processing, and my University currently doesn’t oﬀer a course on that particular subject, this is a unique chance to seriously work ﬁrst-hand with tools and methods of that particular area. Also, I would be quite happy to contribute to the open source community, and at the same time improve a little bit the status of my mother tongue in the world of machine translation.

Which of the published tasks are you interested in?

Contributing to an orphaned language pair. I intend to target apertium-sh- mk.

1 Why should Google and Apertium sponsor it? The language i intend to target encompasses three standard languages, which are altogether spoken by over 15 million people. The morphological dictionary developed in this a project will be a solid foundation for future work on expanding vocabulary, and other language pairings.

How and whom it will beneﬁt in society? Hopefuly, it’ll help improve machine translation of Bosnian, Croatian and Serbian standard languages, and in the future help build new translation systems.

What do you plan to do? I plan to reimplement the sh half of the apertium-sh-mk language pair, the bilingual dictionary and transfer rules towards Macedonian. There is some previous work done on the SC part, including some handy methods for han- dling the diﬀerence between the three standards. However, some linguistic paradigms are missing, and the documentation on the entire implementation is quite scarce. To be more productive, I will reimplement the entire dictionary from scratch, using the old code as a rough guideline, and recycling some parts of it.

The Macedonian language morphological dictionary is already eﬃciently implemented, and I plan to rely on it while implementing the transfer rules. As some of the analytical paradigms of Macedonian are diﬃcult to implement in translation towards their syntethic counterparts in Serbo-Croatian, and such a task would exceed the time frame I have at my disposal, I wont be implementing the other direction.

Community bonding period • installation of the Apertium environment, and reading of the materials

• a thorough study of previous work on the sh-mk pair, mostly by going through code, and experimenting with apertium-viewer and command line

2 • preparing a preliminary morphological dictionary to start writing from scratch (expanding on the tutorial, and experimenting)

Work plan • Weeks 1-3: Implementation of the morphology paradigms, along with some words they cover

• Deliverable #1 : A complete morphology, with some diﬀerences in local standard languages built in.

• Week 4-6: Continue with the input of lemmata. Writing the transfer rules.

• Week 7-8: Speech tagger training using manualy tagged corpora, or some other method if available.

• Deliverable #2 : Transfer rules + POS tagger completed

• Week 9-12: Testing with testvoc, and other types of prooﬁng. Brushing up parts if neccesarry. Writing the documentation.

• Deliverable #3 : morphology+ bilingual + POS tagger + transfer rules + vocabulary + documentation

Non-GSoC activities • Classes till the 1st of July, but available 30 h per week

• Final exams in June, they end 1st of July

• GF Summer school 15-26 August

Bio

I am an Undergraduate student of Computer Science at the Faculty of Sci- ence, University of Zagreb. During my courses I have worked with C/C++, C#, Java, JavaScript, PHP + CSS + HTML, XML, SQL... I also have a basic knowledge of Perl and Ruby, and recently I have started learning some functional programming, particularly Haskell and GF.

3 Regarding the technologies used in machine translation we I’ve been en- rolled in courses with ﬁnite state machines, and context free grammars (implementation of a parser using yacc+ﬂex). Right now I’m attending a course about Machine learning, for which I believe would be a great complement to working with Apertium. Currently I’m also constructing a grammar of the Croatian language in GF, as a part of the assignment for the GF summer school. I have never been involved in any open source projects. However, I would in any case like to do some work on Apertium, even if I am not selected for GSoC.