An Overview of the Crúbadán Project
Kevin Scannell Saint Louis University 11 August 2014
Background
● I was born in the US, but I speak Irish
● Trained as a mathematician
● Began developing tools for Irish in the 90s
● No corpora for doing statistical NLP
● So I built my own with a crawler
● Crúbadán = crawling thing, from crúb=paw
An Crúbadán: History
● First attempt at crawling Irish web, Jan 1999
● 50M words of Welsh for historical dict., 2004
● ~150 minority languages, 2004-2007
● ~450 languages for WAC3, 2007
● Unfunded through 2011
● Search for “all” languages, started c. 2011
How many languages on the web?
● Two year project ends in December
● Phase one: aggressively seek out new langs
● Phase two: produce free+usable resources
● Current total: 1950
● At least 150 more queued for training
● 2500? 3000?
Languages vs. time
Goals
● Interested in revitalization, first and foremost
● Building useful software for communities
● Blark: word lists, morph. analyzers
● Open data for under-resourced languages
● Linguistic typology
● Linguistic diversity of the web
Spelling and grammar checkers
● Corpus-based Irish spellchecker, 2000
● Grammar checker, 2003
● 28 new spellcheckers since 2004
● Collaborations with native speakers
● All under open source licenses
● Standard open source spellchecking engine
● The default in Firefox and LibreOffice
● Fast and powerful
● Good language support: ~150 languages
● Can be as simple as a word list
● But also supports complex morphology
Morphology in hunspell
● Finite-state transducers (Xerox, HFST, …)
● Very fast + bidirectional
● Cover most morphology of human langs
● Hunspell uses “two-fold affix stripping”
● Morphological analysis only
● Not as powerful, theoretically
● BUT: simple formalism, user support FTW!
Powerful enough
● Hungarian
● Northern Sámi
● Basque
● Lingala, Kinyarwanda, Swahili, Chichewa
● …?
Language ID
● Component and an application of Crúbadán
● Character n-grams + word models
● NLTK 3-gram data set
● Indigenous Tweets and Blogs
Predictive text
● T9 input
● Adaptxt
● Firefox OS
accentuate.us
● Diacritic restoration service
● Eni kookan lo ni eto si omi nira lati ni imoran ti o wu u
● Ẹnì kọ̀ọ̀kan ló ní ẹ̀tọ́ sí òmì nira láti ní ìmọ̀ràn tí ó wù ú
● End-user clients for Firefox, LibreOffice
● Perl, Python, Haskell libraries
● Joint work with Michael Schade
Lexicography
● Geiriadur Prifysgol Cymru
● Foclóir Nua Béarla-Gaeilge
● Foclóir na Nua-Ghaeilge
● SketchEngine
NLP Research
● N-gram language models for MT
● Computational morphology
● Parsing
● OCR (e.g. Irish seanchló)
● Speech recognition/synthesis
N-Gram Models
● Generative language model
● Prob of word conditioned on previous n-1
● Used in “noisy channel” problems
● Speech recognition
● Spelling and grammar correction
Irish standardization
● Irish underwent a spelling reform in the 40's
● Hard to use pre-standard texts for NLP
● Treated this as a statistical MT problem
● Just need an n-gram model for target
● Translation handled via spelling rules
● Plus hand-curated pre-post mappings
Example
● Pre: Ní rabh 'sa dearbhughadh sin acht a chuid uchtaighe, eisean, a h-Aodh féin ag teacht na h-arraicis.
● Post: Ní raibh sa dearbhú sin ach a chuid uchtaí, eisean, a hAodh féin ag teacht ina haraicis.
Scottish Gaelic to Irish MT
● Very similar setup
● SG spelling resembles pre-standard Irish
● Many of the same spelling rules
● Exactly the same n-gram model for target
Example
● Sc: Bha tàladh air choireigin na nàdur a bha a' tarraing a h-uile duine thuice.
● Ir: Bhí mealladh éigin ina nádúr a tharraing gach uile dhuine chuici.
Linguistic research
● Comparative phonology
● Syntax
● Psycholinguistics
● Selectional preferences
● …
Orthotree
● http://indigenoustweets.blogspot.com/2011/12/
● https://github.com/kscanne/orthotree
An Crúbadán: Design principles
● Orthographies, not languages
● Labelled by BCP-47 codes
● en, chr, sr-Latn, de-AT, fr-x-nor, el-Latn-x-chat
● Real, running texts (vs. word lists, GILT)
● Get “everything” for small languages
● Large samples for English, French, etc.
Three modules
● Traditional web crawler
● Twitter crawler
● Blog tracker
Phase 1: Finding new languages
● Lots of web searching!
● Special code monitors WP, JW, UN, bible.is
● Typing/OCR of scanned or offline texts
● Special thanks to Ed Jahn of George Mason
● D. Joosten, J. Berlage, N. Lewchenko
● NSF grant 1159174
Phase 2: Building useful resources
● Separating orthographies/dialects
● Clean boilerplate
● Convert to UTF-8 text + normalize
● Sentence segment and tokenize
● Avoid copyright issues
● Discoverability (OLAC)
UTF-8 Normalization
● Fonts (Sámi, Mongolian, dozens of others)
● Lookalikes (az: ә/ә, bua: γ/ү, ro: ş/ș)
● Shortcuts (haw, mi, etc. äëïöü for āēīōū)
● Encoding issues (tn, nso: ß/š from Latin-2)
● Apostrophe hell: 'ˈ´ʹʻʼʽʾˊˋ˴ʹ΄՚՝᾽᾿´῾‘’‛′‵ꞌ '`
Tokenization
● Default tokenizer (letters in default script)
● Many exceptions: Greek in coo/hur/kab, etc.
● Word internal punctuation (ca: l•l, l·l)
● Initial/final apostrophes or lookalikes
Twitter crawler
● Twitter's REST API
● Search “unique” words
● Crawl follow graph
● Language ID particularly challenging
● Uses word and character models
Blog tracker
● Blogger platform only (for now)
● Works hand-in-hand with traditional crawler
● Registers all blogs with an in-language post
● Tracks all past and future posts
● http://indigenousblogs.com/
Call to action
● > 100 collaborators: speakers, linguists
● Help sort dialects, orthographies
● Tokenization and normalization
● Finding new material for training
● Help create new online material