How to link produc vity and quality
Andrzej Zydroń CTO XTM Intl
Transla ng Europe, Warszawa 2016 Language is difficult Language is difficult
Language is organic Language is diverse Language is human
30 billion cells, 100 trillion synapses UG UG Morphology Spectrum
Primi ve morphology Extremely rich
Informa on Technology Evolu on
1945- 1975- 1985- 2000- 2010- 2006-
Mainframe Mini Worksta on/PC Laptop Tablet Cloud Informa on Technology Evolu on
The Cloud A Connected World Unimaginable Scales Algorithmic advances Turing/von Neumann architecture
John von Neumann Alan Turing 1903-1957 1912-1954 von Neumann architecture von Neumann architecture limita ons
= The right tool for the job The right tool for the job The Human Brain
30 billion cells, 100 trillion synapses Von Neuman architecture does not scale
DARPA Cat brain project
Pu ng Things Into Perspec ve Innova on in Transla on Technology
• Standards ✴ Unicode ✴ XML ✴ L10N Interoperability (TMX, XLIFF, TBX, TIPP etc.) ✴ Quality QT Launchpad, TAUS DQF • Internet ✴ Resources (sharing, accessing) ✴ Communication • Automation ✴ Translation Management Systems (TMS) ✴ Automated file processing ✴ Collaborative workflows ✴ Connected real time resource sharing • Advanced algorithmic technology ✴ Web Services ✴ Voice Recognition ✴ NLP ✴ SMT, NMT ✴ POS analysers ✴ Stemmers ✴ Terminology extraction (monolingual, bilingual) ✴ High quality dictionary based bilingual text alignment ✴ Linked data lexicons
Why Standards? Why Standards? Why have Standards? ISO Standard Standards = Efficiency Standards = Lower Costs Standards = Safe to Implement Standards = Greater Interoperability Standards: Unforeseen Benefits Standards: Misuse
imap://azydron%40xml-intl%40xml-intl %2Ecom@xml-intl.com:143/fetch%3EUID %3E.INBOX%3E87222? part=1.2&filename=image003.jpg Standards: Abuse Standards: Sabotage
L10N Standards
• Encoding – Unicode • 16 and 32 bit encoding • TR 29 - Word Boundaries – ISO 639, ISO 3166 – IETF BCP 47 – Locale • Descriptive – W3C ITS - Internationalization and Localization Tag Set L10N Standards
• Exchange Standards – TMX – LISA OSCAR • Translation Memory Exchange – TBX, TBX Link, TBX Basic - LISA OSCAR • Terminology Exchange – SRX - LISA OSCAR • Segmentation rules exchange – GMX – LISA OSCAR • Metrics Exchange (Volume, Complexity, Quality) – XLIFF - OASIS • XML Localization Interchange File Format L10N Standards
• Interoperability – Translation Web Services - OASIS – Interoperability Now - XLIFF:doc, TIPP – Linport - Language Interoperability Portfolio • Reuse – DITA – OASIS - Darwin Information Technology Architecture • Topic level document granularity (Reference, Concept, Task) – xml:tm – LISA OSCAR • Sentence granularity L10N Standards
• Architectural • OAXAL - OASIS ✴ Open Architecture for XML Authoring and Localization ✴ Brings all of the L10N standards together in one architectural framework ✴ OASIS Reference Architecture Standard L10N Standards
• Quality Measurement MQM - QT Launchpad - TAUS DQF Core L10N Standards 2016
• W3C ITS Document Rules • Gala SRX • ETSI LIS xml:tm • ETSI LIS TMX • ETSI LIS TBX • ETSI LIS GMX-V • OASIS XLIFF • W3C/OASIS DITA (XHTML, DocBook, or any XML Vocabulary) • Linport Interoperability: TIPP XLIFF:doc • OASIS OAXAL • Unicode • QT Launchpad • TAUS DQF
OAXAL 2.0
• Open Architecture for XML Authoring and Localization (OAXAL) – http://wiki.oasis-open.org/oaxal/FrontPage OAXAL 2.0 OAXAL 2.0
Pu ng Things Into Perspec ve Process Automa on: Transla on Management Systems
TMS: Raises Quality • Process automation • Significantly reduced costs • Reduced turnaround times • Eliminate repetitive administrative tasks • All data is immediately available ✴ Terminology ✴ TM • More secure • JIT and ‘never ending projects’ • Built-in quality assessment
Improving Quality
• Standards for MQM ✴ QT Launchpad ✴ TAUS DQF • Integration of MQM with workflow • Process automation • Interactively shared data - consistency ✴ Terminology ✴ TM • Terminology extraction QT Launchpad TAUS DQF
Normalised with QT Launchpad
• Content Profiling and Knowledge Base • DQF Tools • Quality Dashboard • API TAUS DQF
Transla on Tool Improvements
• Predictive Typing • Voice input • Fuzzy matching completion • Concordance • Automated alignment • Terminology extraction • Automatic terminology insertion • QA tools ✴ Spelling ✴ Grammar ✴ Omissions etc. Automated alignment
MT Development
– Rule based 1950+ – Sta s cal Word Based 2000 – Sta s cal Phrase Based 2008 – Sta s cal Hybrid Word/Phrase + Grammar 2012 – Sta s cal Deep Learning: 2015 • Neural Network • Powerful Language Models • Dic onary, disambigua on (BabelNet) Sta s cal MT Neural MT
SMT: No problem can be solved from the same consciousness that they have arisen
NMT: Problems can never be solved with the same way of thinking that caused them. Neural MT Neural MT NMT predicts a target word based on the context associated with source and previously generated target words
An attention mechanism is used to analyse the context for every source word NMT Assessment
Joss Moorkens (DCU, ASLING TC38): • Improved transla on quality for morphologically rich languages • Fluency is improved, word order errors are fewer • Fewer segments require edi ng • Fewer morphological errors • No clear improvement for omission or mistransla on • Mistakes can be harder to spot • NMT for produc on: no great improvement in post-edi ng throughput Limits of MT technology
The limita ons of computa onal linguis cs: • Syntax • Morphology • Grammar • Language model size • Training set quality and size: data dilu on • Domain similarity • Homographs, Polysemy • OOV: Out of Vocabulary words • Word and phrase alignment SMT
Limita ons: – More data != be er performance – Diminishing returns John Searle’s Chinese Room The Ul mate MT Limita on
In Order to Translate you need to UNDERSTAND How can we assess the potential productivity savings by using MT?
Current MT providers answer: Well it depends… Theore cal limits of MT
1 HT en-US > en-GB
en-US > fr-FR Morphology SMT
Delta = en-US > de-DE Language Closeness en-US > ru-RU
0 Quality 1 Language Similarity Training Set Size Factor
Where Size is the actual training data size and Size' is an empirical value which makes TSSF equal 0.5. Estimating Percentage Reduction in Translator Effort (PRTE)
PRTE = (LC x TSSF x DMS) x 100% PRTE Calcula on Examples
• Translating from en-US to en-GB we can assume a LC value of 1. If we have an ideal reference TSSF of 1 and an ideal DMS of 1, we arrive at a PRTE of: 1x1x1x100 = 100%
• Translating from en-US to fr-FR we can assume a LC value of 0.8. If we have a slightly less that ideal TSSF of 0.75 but with an ideal DMS of 1, we arrive at a PRTE of: 0.8x0.75x1x100 = 60%
• Translating from en-US to ja-JP we can assume a LC value of 0.2. If we have an ideal TSSF value of 1 and an ideal DMS of 1, we arrive at a PRTE value of: .2x1x1x100 = 20% Ques on and Answer session
Better Translation Technology Contact Details
XTM Interna onal www.xtm-intl.com
Register for future Webinar sessions www.xtm-intl.com/demos
Contact [email protected] +44 (0) 1753 480 479