An Overview of the Crúbadán Project
Total Page:16
File Type:pdf, Size:1020Kb
An Overview of the Crúbadán Project Kevin Scannell Saint Louis University 11 August 2014 Background ● I was born in the US, but I speak Irish ● Trained as a mathematician ● Began developing tools for Irish in the 90s ● No corpora for doing statistical NLP ● So I built my own with a crawler ● Crúbadán = crawling thing, from crúb=paw An Crúbadán: History ● First attempt at crawling Irish web, Jan 1999 ● 50M words of Welsh for historical dict., 2004 ● ~150 minority languages, 2004-2007 ● ~450 languages for WAC3, 2007 ● Unfunded through 2011 ● Search for “all” languages, started c. 2011 How many languages on the web? ● Two year project ends in December ● Phase one: aggressively seek out new langs ● Phase two: produce free+usable resources ● Current total: 1950 ● At least 150 more queued for training ● 2500? 3000? Languages vs. time Goals ● Interested in revitalization, first and foremost ● Building useful software for communities ● Blark: word lists, morph. analyzers ● Open data for under-resourced languages ● Linguistic typology ● Linguistic diversity of the web Spelling and grammar checkers ● Corpus-based Irish spellchecker, 2000 ● Grammar checker, 2003 ● 28 new spellcheckers since 2004 ● Collaborations with native speakers ● All under open source licenses hunspell ● Standard open source spellchecking engine ● The default in Firefox and LibreOffice ● Fast and powerful ● Good language support: ~150 languages ● Can be as simple as a word list ● But also supports complex morphology Morphology in hunspell ● Finite-state transducers (Xerox, HFST, …) ● Very fast + bidirectional ● Cover most morphology of human langs ● Hunspell uses “two-fold affix stripping” ● Morphological analysis only ● Not as powerful, theoretically ● BUT: simple formalism, user support FTW! Powerful enough ● Hungarian ● Northern Sámi ● Basque ● Lingala, Kinyarwanda, Swahili, Chichewa ● …? Language ID ● Component and an application of Crúbadán ● Character n-grams + word models ● NLTK 3-gram data set ● Indigenous Tweets and Blogs Predictive text ● T9 input ● Adaptxt ● Firefox OS accentuate.us ● Diacritic restoration service ● Eni kookan lo ni eto si omi nira lati ni imoran ti o wu u ● Ẹnì kọ̀ọ̀kan ló ní ẹ̀tọ́ sí òmì nira láti ní ìmọ̀ràn tí ó wù ú ● End-user clients for Firefox, LibreOffice ● Perl, Python, Haskell libraries ● Joint work with Michael Schade Lexicography ● Geiriadur Prifysgol Cymru ● Foclóir Nua Béarla-Gaeilge ● Foclóir na Nua-Ghaeilge ● SketchEngine NLP Research ● N-gram language models for MT ● Computational morphology ● Parsing ● OCR (e.g. Irish seanchló) ● Speech recognition/synthesis N-Gram Models ● Generative language model ● Prob of word conditioned on previous n-1 ● Used in “noisy channel” problems ● Machine Translation ● Speech recognition ● Spelling and grammar correction Irish standardization ● Irish underwent a spelling reform in the 40's ● Hard to use pre-standard texts for NLP ● Treated this as a statistical MT problem ● Just need an n-gram model for target ● Translation handled via spelling rules ● Plus hand-curated pre-post mappings Example ● Pre: Ní rabh 'sa dearbhughadh sin acht a chuid uchtaighe, eisean, a h-Aodh féin ag teacht na h-arraicis. ● Post: Ní raibh sa dearbhú sin ach a chuid uchtaí, eisean, a hAodh féin ag teacht ina haraicis. Scottish Gaelic to Irish MT ● Very similar setup ● SG spelling resembles pre-standard Irish ● Many of the same spelling rules ● Exactly the same n-gram model for target Example ● Sc: Bha tàladh air choireigin na nàdur a bha a' tarraing a h-uile duine thuice. ● Ir: Bhí mealladh éigin ina nádúr a tharraing gach uile dhuine chuici. Linguistic research ● Comparative phonology ● Syntax ● Psycholinguistics ● Selectional preferences ● … Orthotree ● http://indigenoustweets.blogspot.com/2011/12/ ● https://github.com/kscanne/orthotree An Crúbadán: Design principles ● Orthographies, not languages ● Labelled by BCP-47 codes ● en, chr, sr-Latn, de-AT, fr-x-nor, el-Latn-x-chat ● Real, running texts (vs. word lists, GILT) ● Get “everything” for small languages ● Large samples for English, French, etc. Three modules ● Traditional web crawler ● Twitter crawler ● Blog tracker Phase 1: Finding new languages ● Lots of web searching! ● Special code monitors WP, JW, UN, bible.is ● Typing/OCR of scanned or offline texts ● Special thanks to Ed Jahn of George Mason ● D. Joosten, J. Berlage, N. Lewchenko ● NSF grant 1159174 Phase 2: Building useful resources ● Separating orthographies/dialects ● Clean boilerplate ● Convert to UTF-8 text + normalize ● Sentence segment and tokenize ● Avoid copyright issues ● Discoverability (OLAC) UTF-8 Normalization ● Fonts (Sámi, Mongolian, dozens of others) ● Lookalikes (az: ә/ә, bua: γ/ү, ro: ş/ș) ● Shortcuts (haw, mi, etc. äëïöü for āēīōū) ● Encoding issues (tn, nso: ß/š from Latin-2) ● Apostrophe hell: 'ˈ´ʹʻʼʽʾˊˋ˴ʹ΄՚՝᾽᾿´῾‘’‛′‵ꞌ '` Tokenization ● Default tokenizer (letters in default script) ● Many exceptions: Greek in coo/hur/kab, etc. ● Word internal punctuation (ca: l•l, l·l) ● Initial/final apostrophes or lookalikes Twitter crawler ● Twitter's REST API ● Search “unique” words ● Crawl follow graph ● Language ID particularly challenging ● Uses word and character models Blog tracker ● Blogger platform only (for now) ● Works hand-in-hand with traditional crawler ● Registers all blogs with an in-language post ● Tracks all past and future posts ● http://indigenousblogs.com/ Call to action ● > 100 collaborators: speakers, linguists ● Help sort dialects, orthographies ● Tokenization and normalization ● Finding new material for training ● Help create new online material .