Segmentation Rules SRX – Segmentation Rules Exchange

A Guide to Open Standards and Open Source A Conceptual Case Study Angelika Zerfass [email protected] David Filip, Ph.D. [email protected] © 2009 Moravia IT a.s. and Angelika Zerfass Agenda 1. Polling Questions 2. Definitions ------------------------ 3. Architecture considerations 4. Strategy 5. Open Standards 6. Talking Legalese 7. Open Tools 8. Usage Cases 1. Polling Questions • (A) What is your level of experience with localization industry open standards (such as the XML-based TMX, TBX, SRX, and XLIFF standards)? * I know these standards well and see them regularly in the work done at my organization. * I have a basic understanding of localization industry open standards. * I'm new to localization industry open standards and want to learn more. 1. Polling Questions • (B) How familiar are you with open source applications used in the localization industry (such as OmegaT, Okapi Framework, Sun XLIFF Translation Editor)? * I'm familiar with these tools and use them (or tools like them) regularly. * I have a basic understanding of these applications, but don't really use them. * I'm new to the idea of open source tools for the localization industry and want to learn more. 1. Polling Questions • (C) Which are important to you? * Learning about the differences between open standards and open source. * Learning about actual use open standards and commonly used tools. * Learning about licensing and patent issues. * Learning about the open Translation Management Systems in use or development today. Definitions The magic quadrant Open Standards Q1 Q2 Open-Closed Open-Open Good Good Closed Open Source Source Q3 Q4 Proprietary- Proprietary- Closed Open Bad Wild Proprietary ways 2. Definitions • TMS, GMS ETMS – Enterprise TMS, “from cradle to the grave” Computer Aided L10N Project Management System (CALPMS) • Open Standards, XLIFF, TMX, TBX, SRX etc. • OSS, Open Source, Free Software vs. Freeware • Open Source (Copy-left) Licensing vs. Permissive Licensing 3. Architecture 4. Strategy Win Translator Win LSP of any size Win Enterprise Exponential growth of content Change Changing balance Enabler between published and user generated New content Need for Continuous TinyTM business Translation OmegaT Needs Community Translation Open ACS Shared language OKAPI framework data Etc. Massive online collaboration Translation automation What is an open standard? World Wide Web Consortium's definition • Transparency (design/due process is public, and all technical discussions, meeting minutes, are archived and referencable in decision making) • Relevance (new standardization is started upon due analysis of the market needs, including requirements phase, e.g. accessibility, multi-linguism) • Openness (anyone can participate: industry, individual, public, government bodies, academia, on a worldwide scale) • Impartiality and consensus (neutral org leading it, with equal weight for each participant) • Availability (free access to the standard text, both during development and at final stage, translations, and clear IPR rules for implementation, allowing open source development in the case of Web technologies) • Support (multiple implementations, ongoing process for testing, errata, revision, permanent access) Wikipedia, 2009 Goal of open standards • Interoperability of tools • Vendors can concentrate on innovation in other fields than their proprietary formats • Standardization of processes (translation of just one file format like XLIFF instead of DOC, HTML, InDesign, FM…) Success of open standards • Depends on the commercial usability • TMX – widespread, XLIFF – coming on strong, SRX – not widely used, TBX – slow, others – in the making (TBX Basic, GMX…) 5. Open Standards • Why Open Standards in Open Source? • Implementing open standards seems obvious success scenario for OSS development • XLIFF and TMX are open standards co-developed by our clients • Minimalist open standards implementation ensures desired functionality and is also legally safe • LISA OSCAR TMX 1.4b, 1.5?, 2.0? • OASIS XLIFF 1.1, 1.2, 1.2.1?, 2.0? Open Standards OAXAL © Andrzej Zydron, OASIS OAXAL TC OAXAL OASIS Zydron, Andrzej © TMX Translation Memory Exchange • From the TMX specification: • …The purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools and/or translation vendors, while introducing little or no loss of critical data during the process… What is TMX • It is an XML representation of translation memory data • Header • Body <header creationtool=“Déjà Vu " Déjà Vu, Transit, Trados, MemoQ creationtoolversion=“4" Version / build number of the tool datatype="PlainText” HTML, SGML, RTF, Interleaf, Java… segtype="sentence" Basic segmentation adminlang="en-us" Default language for elements like <note> srclang="en-us" Source text language o-tmf="DVMDB" > Original translation memory format (DVMDB – Déjà Vu database…) What is TMX • Body <body> <tu creationdate="20030915T153704Z" creationid="USER"> <tuv lang="EN-US"> <seg>This is the first sentence.</seg> </tuv> <tuv lang="DE-DE"> <seg>Dies ist der erste Satz</seg> </tuv> </tu> </body> tu = Translation Unit tuv lang = translation unit variant (language), seg = segment What is TMX • Depending on the tool that created the TMX file, it can be bilingual or multilingual. • Importing multilingual TMX file into a bilingual project will only import the relevant languages Levels of TMX • Level 1: • Plain text only (sufficient for data coming from software localization tools) • Level 2: • Text plus formatting (data coming from translation memory tools used for translation of documentation) To move formatting and text from one tool to the other both tools need to be level.2 compliant! Level 1 • Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file, only pure text. • Original • This sentence has some formatting. • In TMX • This sentence has some formatting. Level 2 • Formatting that is applied to the source and target text of a translation unit is exported to the TMX file. • Different tools use different ways of encoding that information (placeholders or actual formatting information). Level 2 MemoQ – Word DOC with formatting seg> This is the <bpt i='1' type='bold'>{}</bpt>first<ept i='1'>{}</ept> sentence; this is <bpt i='2' type='ulined'>{}</bpt>another<ept i='2'>{}</ept> sentence. </seg> Trados 2007 8.2 / 8.3 – Word DOC with formatting <seg> This is the <bpt i="1"><cf bold="on"></bpt>first<ept i="1"></cf></ept> sentence; this is <bpt i="2"><cf underlinestyle="single"></bpt>another<ept i="2"></cf></ept> sentence. </seg> Trados 2009 – Word DOC with formatting <seg> This is the <bpt i="1" type="Bold" />first<ept i="1" /> sentence; this is<bpt i="2" type="Underline" />another<ept i="2" /> sentence. </seg> Level 2 MemoQ – HTML file with link <seg> Text with a link to <bpt i='1'><a href="http://www.samplehtml.com/page1.htm"></bpt>another page<ept i='1'></a></ept>. </seg> Trados 2007 8.2 / 8.3 – HTML file with link <seg> Text with a link to <bpt i="1" type="link"><a href = "http://www.samplehtml.com/page1.htm"></bpt>another page<ept i="1"></a></ept>. </seg> Trados 2009 – HTML file with link <seg> Text with a link to <bpt i="1" type="19" x="1" />another page<ept i="1" />. </seg> OmegaT - HTML file with link OmegaT internal format: <seg>Text with a link to <a0>another page</a0>.</seg> TMX Level 2 format: <seg>Text with a link to <bpt i='0' x='0'><a0></bpt>another page<ept i='0'></a0></ept>.</seg> Level 2 MemoQ – InDesign <seg> InDesign Text with <bpt i='1' type='bold'>{}</bpt>formtatting in bold<ept i='1'>{}</ept>. </seg> Trados 2007 8.2 / 8.3 – InDesign <seg> InDesign text with <bpt i="1"><cf ptfs="c_Bold"></bpt>formatting in bold<ept i="1"></cf></ept>. </seg> Trados 2009 – InDesign <seg> InDesign text with <bpt i="1" type="pt16" x="1" />formatting in bold<ept i="1" />. </seg> Implications of different tags for formatting • Tools that use placeholder tags do not include the actual formatting information in the TMX file • Other tools might only be able to re-use the text, especially if the formatting is only applied in the target segment, but not in the source • The result of the exchange would then be the same as with TMX level 1 (text only) • TMX files which carry the actual formatting information will yield better matches in other tools that can read this information Where do you use TMX? • Transfering data between different translation memory tools • Checking tools / QA tools • TM maintenance tools • Basis for bilingual term extraxtion Reusing TMX data • Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations), this has been realized in different ways. • Exchange with TMX works, but there is an issue that can lower the match rates nonetheless… the segmentation rules SRX – Segmentation Rules Exchange • From the SRX specification • …The purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools and/or translation vendors... • …is intended to enhance the TMX standard… Why SRX? • Tool A • Semicolon is end of segment • This is a sentence; this is another sentence. • TM system sees two separate segments • Tool B • Semicolon is NOT end of segment • This is a sentence; this is another sentence. • TM system sees one segment • No match

Segmentation Rules SRX – Segmentation Rules Exchange

A Semantic Model for Integrated Content Management, Localisation and Language Technology Processing

Euractiv Proposal

Metadata-Group Report

OAXAL Open Architecture for XML Authoring and Localization

App Localization SDL LSP Partner Program Our Partners

How to Link Productivity and Quality Andrzej Zydron

Dita + Xliff + Cms

Limitations of Localisation Data Exchange Standards and Their Effect on Interoperability

Translations in Libre Software

IJCNLP 2011 Proceedings of the Workshop on Language Resources,Technology and Services in the Sharing Paradigm

Language Industry Standards and Guidelines

Multilingualweb – Language Technology a New W3C Working Group