How to Link Productivity and Quality Andrzej Zydron
Total Page:16
File Type:pdf, Size:1020Kb
How to link produc<vity and quality Andrzej Zydroń CTO XTM Intl Translang Europe, Warszawa 2016 Language is difficult Language is difficult Language is organic Language is diverse Language is human 30 billion cells, 100 trillion synapses UG UG Morphology Spectrum Primi've morphology Extremely rich Informaon Technology Evolu'on 1945- 1975- 1985- 2000- 2010- 2006- Mainframe Mini WorKstaon/PC Laptop Tablet Cloud Informaon Technology Evolu'on The Cloud A Connected World Unimaginable Scales Algorithmic advances Turing/von Neumann architecture John von Neumann Alan Turing 1903-1957 1912-1954 von Neumann architecture von Neumann architecture limitaons = The right tool for the job The right tool for the job The Human Brain 30 billion cells, 100 trillion synapses Von Neuman architecture does not scale DARPA Cat brain project Pung Things Into Perspec've Innovaon in Translaon Technology • Standards ✴ Unicode ✴ XML ✴ L10N Interoperability (TMX, XLIFF, TBX, TIPP etc.) ✴ Quality QT Launchpad, TAUS DQF • Internet ✴ Resources (sharing, accessing) ✴ Communication • Automation ✴ Translation Management Systems (TMS) ✴ Automated file processing ✴ Collaborative workflows ✴ Connected real time resource sharing • Advanced algorithmic technology ✴ Web Services ✴ Voice Recognition ✴ NLP ✴ SMT, NMT ✴ POS analysers ✴ Stemmers ✴ Terminology extraction (monolingual, bilingual) ✴ High quality dictionary based bilingual text alignment ✴ Linked data lexicons Why Standards? Why Standards? Why have Standards? ISO Standard Standards = Efficiency Standards = Lower Costs Standards = Safe to Implement Standards = Greater Interoperability Standards: Unforeseen Benefits Standards: Misuse imap://azydron%40xml-intl%40xml-intl %[email protected]:143/fetch%3EUID %3E.INBOX%3E87222? part=1.2&filename=image003.jpg Standards: Abuse Standards: Sabotage L10N Standards • Encoding – Unicode • 16 and 32 bit encoding • TR 29 - Word Boundaries – ISO 639, ISO 3166 – IETF BCP 47 – Locale • Descriptive – W3C ITS - Internationalization and Localization Tag Set L10N Standards • Exchange Standards – TMX – LISA OSCAR • Translation Memory Exchange – TBX, TBX Link, TBX Basic - LISA OSCAR • Terminology Exchange – SRX - LISA OSCAR • Segmentation rules exchange – GMX – LISA OSCAR • Metrics Exchange (Volume, Complexity, Quality) – XLIFF - OASIS • XML Localization Interchange File Format L10N Standards • Interoperability – Translation Web Services - OASIS – Interoperability Now - XLIFF:doc, TIPP – Linport - Language Interoperability Portfolio • Reuse – DITA – OASIS - Darwin Information Technology Architecture • Topic level document granularity (Reference, Concept, Task) – xml:tm – LISA OSCAR • Sentence granularity L10N Standards • Architectural • OAXAL - OASIS ✴ Open Architecture for XML Authoring and Localization ✴ Brings all of the L10N standards together in one architectural framework ✴ OASIS Reference Architecture Standard L10N Standards • Quality Measurement MQM - QT Launchpad - TAUS DQF Core L10N Standards 2016 • W3C ITS Document Rules • Gala SRX • ETSI LIS xml:tm • ETSI LIS TMX • ETSI LIS TBX • ETSI LIS GMX-V • OASIS XLIFF • W3C/OASIS DITA (XHTML, DocBook, or any XML Vocabulary) • Linport Interoperability: TIPP XLIFF:doc • OASIS OAXAL • Unicode • QT Launchpad • TAUS DQF OAXAL 2.0 • Open Architecture for XML Authoring and Localization (OAXAL) – http://wiki.oasis-open.org/oaxal/FrontPage OAXAL 2.0 OAXAL 2.0 Pung Things Into Perspec've Process Automaon: Translaon Management Systems TMS: Raises Quality • Process automation • Significantly reduced costs • Reduced turnaround times • Eliminate repetitive administrative tasks • All data is immediately available ✴ Terminology ✴ TM • More secure • JIT and ‘never ending projects’ • Built-in quality assessment Improving Quality • Standards for MQM ✴ QT Launchpad ✴ TAUS DQF • Integration of MQM with workflow • Process automation • Interactively shared data - consistency ✴ Terminology ✴ TM • Terminology extraction QT Launchpad TAUS DQF Normalised with QT Launchpad • Content Profiling and Knowledge Base • DQF Tools • Quality Dashboard • API TAUS DQF Translaon Tool Improvements • Predictive Typing • Voice input • Fuzzy matching completion • Concordance • Automated alignment • Terminology extraction • Automatic terminology insertion • QA tools ✴ Spelling ✴ Grammar ✴ Omissions etc. Automated alignment MT Development – Rule based 1950+ – Stas'cal Word Based 2000 – Stas'cal Phrase Based 2008 – Stas'cal Hybrid Word/Phrase + Grammar 2012 – Stas'cal Deep Learning: 2015 • Neural NetworK • Powerful Language Models • Dic'onary, disambiguaon (BabelNet) Stas'cal MT Neural MT SMT: No problem can be solved from the same consciousness that they have arisen NMT: Problems can never be solved with the same way of thinking that caused them. Neural MT Neural MT NMT predicts a target word based on the context associated with source and previously generated target words An attention mechanism is used to analyse the context for every source word NMT Assessment Joss Moorkens (DCU, ASLING TC38): • Improved translaon quality for morphologically rich languages • Fluency is improved, word order errors are fewer • Fewer segments require edi'ng • Fewer morphological errors • No clear improvement for omission or mistranslaon • Mistakes can be harder to spot • NMT for produc'on: no great improvement in post-edi'ng throughput Limits of MT technology The limitaons of computaonal linguiscs: • Syntax • Morphology • Grammar • Language model size • Training set quality and size: data dilu'on • Domain similarity • Homographs, Polysemy • OOV: Out of Vocabulary words • Word and phrase alignment SMT Limitaons: – More data != beoer performance – Diminishing returns John Searle’s Chinese Room The Ul'mate MT Limitaon In Order to Translate you need to UNDERSTAND How can we assess the potential productivity savings by using MT? Current MT providers answer: Well it depends… Theore'cal limits of MT 1 HT en-US > en-GB en-US > fr-FR Morphology SMT Delta = en-US > de-DE Language Closeness en-US > ru-RU 0 Quality 1 Language Similarity Training Set Size Factor Where Size is the actual training data size and Size' is an empirical value which makes TSSF equal 0.5. Estimating Percentage Reduction in Translator Effort (PRTE) PRTE = (LC x TSSF x DMS) x 100% PRTE Calculaon Examples • Translating from en-US to en-GB we can assume a LC value of 1. If we have an ideal reference TSSF of 1 and an ideal DMS of 1, we arrive at a PRTE of: 1x1x1x100 = 100% • Translating from en-US to fr-FR we can assume a LC value of 0.8. If we have a slightly less that ideal TSSF of 0.75 but with an ideal DMS of 1, we arrive at a PRTE of: 0.8x0.75x1x100 = 60% • Translating from en-US to ja-JP we can assume a LC value of 0.2. If we have an ideal TSSF value of 1 and an ideal DMS of 1, we arrive at a PRTE value of: .2x1x1x100 = 20% Ques'on and Answer session Better Translation Technology Contact Details XTM Internaonal www.xtm-intl.com Register for future Webinar sessions www.xtm-intl.com/demos Contact [email protected] +44 (0) 1753 480 479 .