<<

HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET UNIVERSITY OF HELSINKI THE FINNO- AND HUMANISTINEN TIEDEKUNTA HUMANISTISKA FAKULTETEN THE INTERNET PROJECT FACULTY OF ARTS Tommi Jauhiainen, Heidi Jauhiainen and Krister Lindén Department of Modern Languages

For the test length of 20 characters, the overall average identification recall is 93.9% whereas the INTRODUCTION average is 90.5% for the . Almost THE LINK The ”Finno-Ugric Languages and The Internet” all languages attain 100.0% recall at 150 project started at the beginning of 2013 as part of characters. COLLECTION the Kone Foundation Language Programme [1] and the international CLARIN cooperation. The The link collection that is available at main goal of the project is to build a prototype of a http://suki.ling.helsinki.fi/sites has been curated system that will crawl the internet and gather texts by hand from the pages of the .fi crawl. It contains written in small Uralic languages. The largest CRAWLING THE links to 266 sites from which text was found in 19 Uralic languages Hungarian, Finnish, and of the 31 small Uralic languages searched. The Estonian are outside the scope of the project. The NATIONAL DOMAINS links have not been verified by experts or native gathered texts will be collected into sentence and In order to crawl for pages written in small Uralic speakers. We are planning to incorporate a simple word corpora for each language and the links to languages, we use Heritrix [3], a web archiving crowd-sourcing platform to be able to get feedback the associated web-pages into link collections. The system developed by the Internet Archive. The from those who are more familiar with the corpora will act as a source for linguists and the version we are currently using downloads all text languages. Our goal is to make the creation of the link collections will hopefully spread the files as well as pdf files it finds from within the link collection as automated as possible, avoiding knowledge of the existence of relevant pages to domain in question. manual link curation. interested parties. SENTENCE LANGUAGE CORPORA IDENTIFICATION A diagram showing how Uralic web pages are When we are creating a sentence corpora, one of We are using a language identifier, developed processed during a crawl. the greatest problems we have at the moment is within the project [2], which uses relative that many of the downloaded pages are The number of lines, words, and characters in frequencies of -grams of characters together with We have chosen to start collecting the material by multilingual. We are currently making a survey of tokens and token-based backoff. The language the methods for language identification in small Uralic languages after language identifying crawling the national domains most likely to each individual line. identifier recognizes 285 languages from all contain material written in small Uralic multilingual documents and in future we will around the world, including 34 Uralic languages. languages: .ee, .fi, .no, .ru, and .se. incorporate a multilingual detection method in the The amount of training material differs system. We did a separate language identification considerably between languages ranging from for all the lines of all the files containing small REFERENCES 19000 characters in Ume Sami to over 400 million Uralic languages in order to see which ones are, [1] Kone Foundation. The language programme characters in the Hungarian material. indeed, written in the language indicated by the 2012-2016. http://www.koneensaatio.fi/en, 2012. identification of the file as a whole. The average identification accuracy for Uralic [2] Tommi Jauhiainen and Krister Lindén. languages is generally slightly lower than for all Statistics for the crawls of the five national Identifying the language of digital text. In review, languages. This is due to some of the languages domains. submitted 08/14, 2015. being very close varieties of each other, especially FUTURE WORK [3] Gordon Mohr, Michael Stack, Igor Rnitovic, within the . Dan Avery, and Michele Kimpton. Introduction to We will be doing new crawls for them all as most of Language identification methods will be further the crawls ended before the domains were really heritrix. In 4th International Web Archiving developed in order to improve the robustness of Workshop, Bath, 2004. exhausted. It is actually far from trivial to define the language identifier we use. We will, when we have exhausted a national domain. There furthermore, try to increase the speed of the are many sites that dynamically generate an crawler in order to crawl more widely and more infinite number of web-pages and even sub- often. The most important national domains in CONTACT INFORMATION domains, which makes each of the national regard to the Uralic language speakers will be re- [email protected] domains infinite in size if we are calculating the crawled with more depth and more frequency. We Confusion matrix of Finnic languages. number of pages or sub-domains. also intend to look into crawling the .com and .org domains. We would also like to extract text from http://suki.ling.helsinki.fi other binary files than pdfs.