An Overview of the Crúbadán Project

Kevin Scannell Saint Louis University 11 August 2014

Background

● I was born in the US, but I speak Irish

● Trained as a mathematician

● Began developing tools for Irish in the 90s

● No corpora for doing statistical NLP

● So I built my own with a crawler

● Crúbadán = crawling thing, from crúb=paw

An Crúbadán: History

● First attempt at crawling Irish web, Jan 1999

● 50M of Welsh for historical dict., 2004

● ~150 minority languages, 2004-2007

● ~450 languages for WAC3, 2007

● Unfunded through 2011

● Search for “all” languages, started c. 2011

How many languages on the web?

● Two year project ends in December

● Phase one: aggressively seek out new langs

● Phase two: produce free+usable resources

● Current total: 1950

● At least 150 more queued for training

● 2500? 3000?

Languages vs. time

Goals

● Interested in revitalization, first and foremost

● Building useful software for communities

● Blark: lists, morph. analyzers

● Open data for under-resourced languages

● Linguistic typology

● Linguistic diversity of the web

Spelling and grammar checkers

● Corpus-based Irish spellchecker, 2000

, 2003

● 28 new spellcheckers since 2004

● Collaborations with native speakers

● All under open source licenses

● Standard open source spellchecking engine

● The default in Firefox and LibreOffice

● Fast and powerful

● Good language support: ~150 languages

● Can be as simple as a word list

● But also supports complex morphology

Morphology in hunspell

● Finite-state transducers (Xerox, HFST, …)

● Very fast + bidirectional

● Cover most morphology of human langs

● Hunspell uses “two-fold affix stripping”

● Morphological analysis only

● Not as powerful, theoretically

● BUT: simple formalism, user support FTW!

Powerful enough

● Hungarian

● Northern Sámi

● Basque

● Lingala, Kinyarwanda, Swahili, Chichewa

● …?

Language ID

● Component and an application of Crúbadán

● Character n-grams + word models

● NLTK 3-gram data set

● Indigenous Tweets and Blogs

Predictive text

input

● Adaptxt

● Firefox OS

accentuate.us

● Diacritic restoration service

● Eni kookan lo ni eto si omi nira lati ni imoran ti o wu u

● Ẹnì kọ̀ọ̀kan ló ní ẹ̀tọ́ sí òmì nira láti ní ìmọ̀ràn tí ó wù ú

● End-user clients for Firefox, LibreOffice

● Perl, Python, Haskell libraries

● Joint work with Michael Schade

Lexicography

● Geiriadur Prifysgol Cymru

● Foclóir Nua Béarla-Gaeilge

● Foclóir na Nua-Ghaeilge

● SketchEngine

NLP Research

● N-gram language models for MT

● Computational morphology

● OCR (e.g. Irish seanchló)

/synthesis

N-Gram Models

● Generative language model

● Prob of word conditioned on previous n-1

● Used in “noisy channel” problems

● Speech recognition

● Spelling and grammar correction

Irish standardization

● Irish underwent a spelling reform in the 40's

● Hard to use pre-standard texts for NLP

● Treated this as a statistical MT problem

● Just need an n-gram model for target

● Translation handled via spelling rules

● Plus hand-curated pre-post mappings

Example

● Pre: Ní rabh 'sa dearbhughadh sin acht a chuid uchtaighe, eisean, a h-Aodh féin ag teacht na h-arraicis.

● Post: Ní raibh sa dearbhú sin ach a chuid uchtaí, eisean, a hAodh féin ag teacht ina haraicis.

Scottish Gaelic to Irish MT

● Very similar setup

● SG spelling resembles pre-standard Irish

● Many of the same spelling rules

● Exactly the same n-gram model for target

Example

● Sc: Bha tàladh air choireigin na nàdur a bha a' tarraing a h-uile duine thuice.

● Ir: Bhí mealladh éigin ina nádúr a tharraing gach uile dhuine chuici.

Linguistic research

● Comparative phonology

● Syntax

● Psycholinguistics

● Selectional preferences

● …

Orthotree

● http://indigenoustweets.blogspot.com/2011/12/

● https://github.com/kscanne/orthotree

An Crúbadán: Design principles

● Orthographies, not languages

● Labelled by BCP-47 codes

● en, chr, sr-Latn, de-AT, fr-x-nor, el-Latn-x-chat

● Real, running texts (vs. word lists, GILT)

● Get “everything” for small languages

● Large samples for English, French, etc.

Three modules

● Traditional web crawler

● Twitter crawler

● Blog tracker

Phase 1: Finding new languages

● Lots of web searching!

● Special code monitors WP, JW, UN, bible.is

● Typing/OCR of scanned or offline texts

● Special thanks to Ed Jahn of George Mason

● D. Joosten, J. Berlage, N. Lewchenko

● NSF grant 1159174

Phase 2: Building useful resources

● Separating orthographies/dialects

● Clean boilerplate

● Convert to UTF-8 text + normalize

● Sentence segment and tokenize

● Avoid copyright issues

● Discoverability (OLAC)

UTF-8 Normalization

● Fonts (Sámi, Mongolian, dozens of others)

● Lookalikes (az: ә/ә, bua: γ/ү, ro: ş/ș)

● Shortcuts (haw, mi, etc. äëïöü for āēīōū)

● Encoding issues (tn, nso: ß/š from Latin-2)

● Apostrophe hell: 'ˈ´ʹʻʼʽʾˊˋ˴ʹ΄՚՝᾽᾿´῾‘’‛′‵ꞌ '`

Tokenization

● Default tokenizer (letters in default script)

● Many exceptions: Greek in coo/hur/kab, etc.

● Word internal punctuation (ca: l•l, l·l)

● Initial/final apostrophes or lookalikes

Twitter crawler

● Twitter's REST API

● Search “unique” words

● Crawl follow graph

● Language ID particularly challenging

● Uses word and character models

Blog tracker

● Blogger platform only (for now)

● Works hand-in-hand with traditional crawler

● Registers all blogs with an in-language post

● Tracks all past and future posts

● http://indigenousblogs.com/

Call to action

● > 100 collaborators: speakers, linguists

● Help sort dialects, orthographies

● Tokenization and normalization

● Finding new material for training

● Help create new online material