How to link producvity and quality

Andrzej Zydroń CTO XTM Intl

Translang Europe, Warszawa 2016 Language is difficult Language is difficult

Language is organic Language is diverse Language is human

30 billion cells, 100 trillion synapses UG UG Morphology Spectrum

Primive morphology Extremely rich

Informaon Technology Evoluon

1945- 1975- 1985- 2000- 2010- 2006-

Mainframe Mini Workstaon/PC Laptop Tablet Cloud Informaon Technology Evoluon

The Cloud A Connected World Unimaginable Scales Algorithmic advances Turing/von Neumann architecture

John von Neumann Alan Turing 1903-1957 1912-1954 von Neumann architecture von Neumann architecture limitaons

= The right tool for the job The right tool for the job The Human Brain

30 billion cells, 100 trillion synapses Von Neuman architecture does not scale

DARPA Cat brain project

Pung Things Into Perspecve Innovaon in Translaon Technology

• Standards ✴ Unicode ✴ XML ✴ L10N Interoperability (TMX, XLIFF, TBX, TIPP etc.) ✴ Quality QT Launchpad, TAUS DQF • Internet ✴ Resources (sharing, accessing) ✴ Communication • Automation ✴ Translation Management Systems (TMS) ✴ Automated file processing ✴ Collaborative workflows ✴ Connected real time resource sharing • Advanced algorithmic technology ✴ Web Services ✴ Voice Recognition ✴ NLP ✴ SMT, NMT ✴ POS analysers ✴ Stemmers ✴ Terminology extraction (monolingual, bilingual) ✴ High quality dictionary based bilingual text alignment ✴ Linked data lexicons

Why Standards? Why Standards? Why have Standards? ISO Standard Standards = Efficiency Standards = Lower Costs Standards = Safe to Implement Standards = Greater Interoperability Standards: Unforeseen Benefits Standards: Misuse

imap://azydron%40xml-intl%40xml-intl %2Ecom@-intl.com:143/fetch%3EUID %3E.INBOX%3E87222? part=1.2&filename=image003.jpg Standards: Abuse Standards: Sabotage

L10N Standards

• Encoding – Unicode • 16 and 32 bit encoding • TR 29 - Word Boundaries – ISO 639, ISO 3166 – IETF BCP 47 – Locale • Descriptive – W3C ITS - Internationalization and Localization Tag Set L10N Standards

• Exchange Standards – TMX – LISA OSCAR • Translation Memory Exchange – TBX, TBX Link, TBX Basic - LISA OSCAR • Terminology Exchange – SRX - LISA OSCAR • Segmentation rules exchange – GMX – LISA OSCAR • Metrics Exchange (Volume, Complexity, Quality) – XLIFF - OASIS • XML Localization Interchange File Format L10N Standards

• Interoperability – Translation Web Services - OASIS – Interoperability Now - XLIFF:doc, TIPP – Linport - Language Interoperability Portfolio • Reuse – DITA – OASIS - Darwin Information Technology Architecture • Topic level document granularity (Reference, Concept, Task) – xml:tm – LISA OSCAR • Sentence granularity L10N Standards

• Architectural • OAXAL - OASIS ✴ Open Architecture for XML Authoring and Localization ✴ Brings all of the L10N standards together in one architectural framework ✴ OASIS Reference Architecture Standard L10N Standards

• Quality Measurement MQM - QT Launchpad - TAUS DQF Core L10N Standards 2016

• W3C ITS Document Rules • Gala SRX • ETSI LIS xml:tm • ETSI LIS TMX • ETSI LIS TBX • ETSI LIS GMX-V • OASIS XLIFF • W3C/OASIS DITA (XHTML, DocBook, or any XML Vocabulary) • Linport Interoperability: TIPP XLIFF:doc • OASIS OAXAL • Unicode • QT Launchpad • TAUS DQF

OAXAL 2.0

• Open Architecture for XML Authoring and Localization (OAXAL) – http://wiki.oasis-open.org/oaxal/FrontPage OAXAL 2.0 OAXAL 2.0

Pung Things Into Perspecve Process Automaon: Translaon Management Systems

TMS: Raises Quality • Process automation • Significantly reduced costs • Reduced turnaround times • Eliminate repetitive administrative tasks • All data is immediately available ✴ Terminology ✴ TM • More secure • JIT and ‘never ending projects’ • Built-in quality assessment

Improving Quality

• Standards for MQM ✴ QT Launchpad ✴ TAUS DQF • Integration of MQM with workflow • Process automation • Interactively shared data - consistency ✴ Terminology ✴ TM • Terminology extraction QT Launchpad TAUS DQF

Normalised with QT Launchpad

• Content Profiling and Knowledge Base • DQF Tools • Quality Dashboard • API TAUS DQF

Translaon Tool Improvements

• Predictive Typing • Voice input • Fuzzy matching completion • Concordance • Automated alignment • Terminology extraction • Automatic terminology insertion • QA tools ✴ Spelling ✴ Grammar ✴ Omissions etc. Automated alignment

MT Development

– Rule based 1950+ – Stascal Word Based 2000 – Stascal Phrase Based 2008 – Stascal Hybrid Word/Phrase + Grammar 2012 – Stascal Deep Learning: 2015 • Neural Network • Powerful Language Models • Diconary, disambiguaon (BabelNet) Stascal MT Neural MT

SMT: No problem can be solved from the same consciousness that they have arisen

NMT: Problems can never be solved with the same way of thinking that caused them. Neural MT Neural MT NMT predicts a target word based on the context associated with source and previously generated target words

An attention mechanism is used to analyse the context for every source word NMT Assessment

Joss Moorkens (DCU, ASLING TC38): • Improved translaon quality for morphologically rich languages • Fluency is improved, word order errors are fewer • Fewer segments require eding • Fewer morphological errors • No clear improvement for omission or mistranslaon • Mistakes can be harder to spot • NMT for producon: no great improvement in post-eding throughput Limits of MT technology

The limitaons of computaonal linguiscs: • Syntax • Morphology • Grammar • Language model size • Training set quality and size: data diluon • Domain similarity • Homographs, Polysemy • OOV: Out of Vocabulary words • Word and phrase alignment SMT

Limitaons: – More data != beer performance – Diminishing returns John Searle’s Chinese Room The Ulmate MT Limitaon

In Order to Translate you need to UNDERSTAND How can we assess the potential productivity savings by using MT?

Current MT providers answer: Well it depends… Theorecal limits of MT

1 HT en-US > en-GB

en-US > fr-FR Morphology SMT

Delta = en-US > de-DE Language Closeness en-US > ru-RU

0 Quality 1 Language Similarity Training Set Size Factor

Where Size is the actual training data size and Size' is an empirical value which makes TSSF equal 0.5. Estimating Percentage Reduction in Translator Effort (PRTE)

PRTE = (LC x TSSF x DMS) x 100% PRTE Calculaon Examples

• Translating from en-US to en-GB we can assume a LC value of 1. If we have an ideal reference TSSF of 1 and an ideal DMS of 1, we arrive at a PRTE of: 1x1x1x100 = 100%

• Translating from en-US to fr-FR we can assume a LC value of 0.8. If we have a slightly less that ideal TSSF of 0.75 but with an ideal DMS of 1, we arrive at a PRTE of: 0.8x0.75x1x100 = 60%

• Translating from en-US to ja-JP we can assume a LC value of 0.2. If we have an ideal TSSF value of 1 and an ideal DMS of 1, we arrive at a PRTE value of: .2x1x1x100 = 20% Queson and Answer session

Better Translation Technology Contact Details

XTM Internaonal www.xtm-intl.com

Register for future Webinar sessions www.xtm-intl.com/demos

Contact [email protected] +44 (0) 1753 480 479