apertium sh-mk, implementation of a new language pair

Hrvoje Peradin [email protected], krvoje on IRC: #apertium

Why is it you are interested in machine trans- lation?

It’s a perfect combination of Computer science and linguistics. I am very fascinated with languages, both natural or artificial. With MT It fascinates me to see a natural message transfered across a language barrier via only a process of computation.

Why is it that you are interested in the Aper- tium project?

Apertium was simply my best choice on GSoC. Since I am aiming for my graduate thesis to be about natural language processing, and my University currently doesn’t offer a course on that particular subject, this is a unique chance to seriously work first-hand with tools and methods of that particular area. Also, I would be quite happy to contribute to the open source com- munity, and at the same time improve a little bit the status of my mother tongue in the world of machine .

Which of the published tasks are you inter- ested in?

Contributing to an orphaned language pair. I intend to target apertium-sh- mk.

1 Why should Google and Apertium sponsor it? The language i intend to target encompasses three standard languages, which are altogether spoken by over 15 million people. The morphological dictio- nary developed in this a project will be a solid foundation for future work on expanding vocabulary, and other language pairings.

How and whom it will benefit in society? Hopefuly, it’ll help improve of Bosnian, Croatian and Serbian standard languages, and in the future help build new translation systems.

What do you plan to do? I plan to reimplement the sh half of the apertium-sh-mk language pair, the bilingual dictionary and transfer rules towards Macedonian. There is some previous work done on the SC part, including some handy methods for han- dling the difference between the three standards. However, some linguistic paradigms are missing, and the documentation on the entire implementation is quite scarce. To be more productive, I will reimplement the entire dic- tionary from scratch, using the old code as a rough guideline, and recycling some parts of it.

The morphological dictionary is already efficiently implemented, and I plan to rely on it while implementing the transfer rules. As some of the analytical paradigms of Macedonian are difficult to implement in translation towards their syntethic counterparts in Serbo-Croatian, and such a task would exceed the time frame I have at my disposal, I wont be implementing the other direction.

Community bonding period • installation of the Apertium environment, and reading of the materials

• a thorough study of previous work on the sh-mk pair, mostly by going through code, and experimenting with apertium-viewer and command line

2 • preparing a preliminary morphological dictionary to start writing from scratch (expanding on the tutorial, and experimenting)

Work plan • Weeks 1-3: Implementation of the morphology paradigms, along with some words they cover

• Deliverable #1 : A complete morphology, with some differences in local standard languages built in.

• Week 4-6: Continue with the input of lemmata. Writing the transfer rules.

• Week 7-8: Speech tagger training using manualy tagged corpora, or some other method if available.

• Deliverable #2 : Transfer rules + POS tagger completed

• Week 9-12: Testing with testvoc, and other types of proofing. Brushing up parts if neccesarry. Writing the documentation.

• Deliverable #3 : morphology+ bilingual + POS tagger + transfer rules + vocabulary + documentation

Non-GSoC activities • Classes till the 1st of July, but available 30 h per week

• Final exams in June, they end 1st of July

• GF Summer school 15-26 August

Bio

I am an Undergraduate student of Computer Science at the Faculty of Sci- ence, University of Zagreb. During my courses I have worked with C/C++, C#, Java, JavaScript, PHP + CSS + HTML, XML, SQL... I also have a basic knowledge of Perl and Ruby, and recently I have started learning some functional programming, particularly Haskell and GF.

3 Regarding the technologies used in machine translation we I’ve been en- rolled in courses with finite state machines, and context free grammars (im- plementation of a parser using yacc+flex). Right now I’m attending a course about Machine learning, for which I believe would be a great complement to working with Apertium. Currently I’m also constructing a grammar of the Croatian language in GF, as a part of the assignment for the GF summer school. I have never been involved in any open source projects. However, I would in any case like to do some work on Apertium, even if I am not selected for GSoC.

4