A Guide to Open Standards and Open Source A Conceptual Case Study

Angelika Zerfass [email protected] David Filip, Ph.D. [email protected]

© 2009 Moravia IT a.s. and Angelika Zerfass Agenda

1. Polling Questions 2. Definitions ------3. Architecture considerations 4. Strategy 5. Open Standards 6. Talking Legalese 7. Open Tools 8. Usage Cases 1. Polling Questions

• (A) What is your level of experience with localization industry open standards (such as the XML-based TMX, TBX, SRX, and XLIFF standards)? * I know these standards well and see them regularly in the work done at my organization. * I have a basic understanding of localization industry open standards. * I'm new to localization industry open standards and want to learn more. 1. Polling Questions

• (B) How familiar are you with open source applications used in the localization industry (such as OmegaT, Okapi Framework, Sun XLIFF Translation Editor)? * I'm familiar with these tools and use them (or tools like them) regularly. * I have a basic understanding of these applications, but don't really use them. * I'm new to the idea of open source tools for the localization industry and want to learn more. 1. Polling Questions

• (C) Which are important to you? * Learning about the differences between open standards and open source. * Learning about actual use open standards and commonly used tools. * Learning about licensing and patent issues. * Learning about the open Translation Management Systems in use or development today. Definitions The magic quadrant Open Standards Q1 Q2 Open-Closed Open-Open Good Good Closed Open Source Source Q3 Q4 Proprietary- Proprietary- Closed Open Bad Wild Proprietary ways 2. Definitions

• TMS, GMS  ETMS – Enterprise TMS, “from cradle to the grave”  Computer Aided L10N Project Management System (CALPMS) • Open Standards, XLIFF, TMX, TBX, SRX etc. • OSS, Open Source, Free Software vs. Freeware • Open Source (Copy-left) Licensing vs. Permissive Licensing 3. Architecture 4. Strategy

Win  Translator Win  LSP of any size Win  Enterprise  Exponential growth of content Change  Changing balance Enabler between published and user generated New content  Need for Continuous  TinyTM business Translation  OmegaT Needs  Community Translation  Open ACS  Shared language  OKAPI framework data  Etc.  Massive online collaboration  Translation automation What is an ?

World Wide Web Consortium's definition • Transparency (design/due process is public, and all technical discussions, meeting minutes, are archived and referencable in decision making) • Relevance (new standardization is started upon due analysis of the market needs, including requirements phase, e.g. accessibility, multi-linguism) • Openness (anyone can participate: industry, individual, public, government bodies, academia, on a worldwide scale) • Impartiality and consensus (neutral org leading it, with equal weight for each participant) • Availability (free access to the standard text, both during development and at final stage, translations, and clear IPR rules for implementation, allowing open source development in the case of Web technologies) • Support (multiple implementations, ongoing process for testing, errata, revision, permanent access)

Wikipedia, 2009 Goal of open standards

• Interoperability of tools • Vendors can concentrate on innovation in other fields than their proprietary formats • Standardization of processes (translation of just one file format like XLIFF instead of DOC, HTML, InDesign, FM…) Success of open standards • Depends on the commercial usability • TMX – widespread, XLIFF – coming on strong, SRX – not widely used, TBX – slow, others – in the making (TBX Basic, GMX…) 5. Open Standards

• Why Open Standards in Open Source? • Implementing open standards seems obvious success scenario for OSS development • XLIFF and TMX are open standards co-developed by our clients • Minimalist open standards implementation ensures desired functionality and is also legally safe • LISA OSCAR TMX 1.4b, 1.5?, 2.0? • OASIS XLIFF 1.1, 1.2, 1.2.1?, 2.0? Open Standards

OAXAL © © AndrzejZydron, OASIS OAXAL TC TMX Translation Memory Exchange

• From the TMX specification: • …The purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools and/or translation vendors, while introducing little or no loss of critical data during the process… What is TMX

• It is an XML representation of translation memory data • Header • Body

srclang="en-us" Source text language o-tmf="DVMDB" > Original translation memory format (DVMDB – Déjà Vu database…) What is TMX

• Body This is the first sentence. Dies ist der erste Satz

tu = Translation Unit tuv lang = translation unit variant (language), seg = segment What is TMX

• Depending on the tool that created the TMX file, it can be bilingual or multilingual. • Importing multilingual TMX file into a bilingual project will only import the relevant languages Levels of TMX

• Level 1: • Plain text only (sufficient for data coming from software localization tools) • Level 2: • Text plus formatting (data coming from translation memory tools used for translation of documentation)

To move formatting and text from one tool to the other both tools need to be level.2 compliant! Level 1

• Formatting that is applied to the source and target text of a translation unit is not exported to the TMX file, only pure text.

• Original • This sentence has some formatting. • In TMX • This sentence has some formatting. Level 2

• Formatting that is applied to the source and target text of a translation unit is exported to the TMX file. • Different tools use different ways of encoding that information (placeholders or actual formatting information). Level 2

MemoQ – Word DOC with formatting seg> This is the {}first{} sentence; this is {}another{} sentence.

Trados 2007 8.2 / 8.3 – Word DOC with formatting

This is the <cf bold="on">first</cf> sentence; this is <cf underlinestyle="single">another</cf> sentence.

Trados 2009 – Word DOC with formatting

This is the first sentence; this isanother sentence. Level 2

MemoQ – HTML file with link Text with a link to <a href="http://www.samplehtml.com/page1.htm">another page</a>. Trados 2007 8.2 / 8.3 – HTML file with link Text with a link to <a href = "http://www.samplehtml.com/page1.htm">another page</a>. Trados 2009 – HTML file with link

Text with a link to another page. OmegaT - HTML file with link OmegaT internal format: Text with a link to <a0>another page</a0>.

TMX Level 2 format: Text with a link to <a0>another page</a0>. Level 2

MemoQ – InDesign

InDesign Text with {}formtatting in bold{}.

Trados 2007 8.2 / 8.3 – InDesign

InDesign text with <cf ptfs="c_Bold">formatting in bold</cf>.

Trados 2009 – InDesign

InDesign text with formatting in bold. Implications of different tags for formatting

• Tools that use placeholder tags do not include the actual formatting information in the TMX file • Other tools might only be able to re-use the text, especially if the formatting is only applied in the target segment, but not in the source • The result of the exchange would then be the same as with TMX level 1 (text only) • TMX files which carry the actual formatting information will yield better matches in other tools that can read this information Where do you use TMX?

• Transfering data between different translation memory tools • Checking tools / QA tools • TM maintenance tools • Basis for bilingual term extraxtion Reusing TMX data

• Although Translation Memory Tools have the same basic idea (storing source-target language pairs and recycling translations), this has been realized in different ways. • Exchange with TMX works, but there is an issue that can lower the match rates nonetheless… the segmentation rules SRX – Segmentation Rules Exchange

• From the SRX specification • …The purpose of the SRX format is to provide a standard method to describe segmentation rules that are being exchanged among tools and/or translation vendors... • …is intended to enhance the TMX standard… Why SRX?

• Tool A • Semicolon is end of segment • This is a sentence; this is another sentence. • TM system sees two separate segments

• Tool B • Semicolon is NOT end of segment • This is a sentence; this is another sentence. • TM system sees one segment • No match from the TMX data! • Match rate around 50%, usual setting around 70% Segmentation rules

• Rules that the tool applies to the text to translate to split it up into segments • paragraph • sentence • phrase • incomplete sentences in bulleted lists • single words (headings, “Note”, “Attention”) Segmentation rules

• End of segment rules (common to the default settings of all tools) • Dot at the end of a sentence (not after known abbreviations) • Question mark, exclamation mark • Paragraph mark • Colon • End of segment rules (different for different tools) • Semicolon • Tab character • Sub segments (index entries, footnotes, graphics) Comparison of default rules

Workbench Transit DV SDLX Across

Colon end end end no end no end

Semi- no end end end no end no end colon

Tab end no end no end no end no end

Soft no end no end end in end in no end return Word no Word no end in end in PPT PPT What can SRX do and what not?

• It can only show the segmentation rule settings at the time of export. • It cannot show any changes that have been applied in the segmentation rules during the use of the TM. • Sometimes the rules from system 1 cannot be re-created in system 2, then the rule will be ignored. TBX – TermBase Exchange

• From the working draft of the TBX specification • TBX is an open XML-based standard format for terminological data • TBX is designed to support the analysis, representation, dissemination, and exchange of information from human-oriented terminological databases (termbases) • TBX is built on the basis of ISO 12620 (data categories) and ISO 12200 (MARTIF, core structure) TMX TBX Global information in entry head

Language ID

Administrative data of this language

Term in English

Information on term level

Language ID

Term in French

[email protected] 35 Where could you use TBX?

• Exchange terminological data • From term base to source language checking tools • Between term bases of translation memory tools • From term base to terminology checking tools • From term base to terminology extraction tools (as stop word lists) • From term base to dictionary of a machine translation tool

• For indexing / keywords in document management systems, content management systems, knowledge management systems

• Publishing terminological data on the Intranet / Internet

• Optimization of search enginges / text mining by searching for synonyms automatically XLIFF – XML Localization Interchange File Format

• The XLIFF format aims to: • separate localizable text from formatting (deal with one file format in translation instead of different processes to extract, filter, convert text from different file formats) • enable multiple tools to work on the source text • store information that is helpful in supporting a localization process (like meta data on versions of source and target segemtns)

• An XLIFF file is bilingual and can be the container for a number of individual files

• Each file element in an XLIFF file contains a header (project data, such as contact information, project phases, pointers to reference material, and information on the skeleton file) and a body section with the actual text segments. XLIFF

This is a sentence. • XLIFF can carry Dies ist ein Satz." several translation This is a sentence. • Additional fields can Das ist ein Satz." contain context, This is a short sentence. tool, history… Dies ist ein kurzer Satz." • Information on name of a button and its coordinates (for visual representation of the button in a Cancel localization tool) Where is XLIFF useful?

• Where experience with XML exists • Projects contain many different file formats • All formats are converted to XLIFF for translation • Different tools need to be used during localization • Different translations (alt-trans) or languages needed as reference Any idea why XLIFF should NOT be the cure for everything?

• Instead of developing parsers for different file formats (to read in the file into a translation tool), developers now need to create parsers to convert those file formats to XLIFF • Some file formats already can be dealt with (Office, HTML, XML…) – why should a new parser be created for those? • XLIFF has its variants, like all XML-based files – how can you make sure that each tool can process any of the possible XLIFF flavours? The magic quadrant again just to remember the distinction Open Standards Q1 Q2 Open-Closed Open-Open Good Good Closed Open Source Source Q3 Q4 Proprietary- Proprietary- Closed Open Bad Wild Proprietary ways 6. Talking Legalese in OSS

OSS (Open Source Software) • Copyleft and Permissive Licensing • GPL (General Public License by FSF) • BSD (Berkeley Software Distribution) • Apache License 2.0 • MS open source licensing • Ms-PL (Public License) • Ms-RL (Reciprocal License) Derived/Derivative works vs. dynamic linkage • Paradigmatic case is app running on OS Contributions Distribution 6. Talking Legalese in standards

RF (Royalty Free) vs. RAND (Reasonable and Non- Discriminatory) Standards can be licensed under different terms • RF • RAND – there is no general consensus up to date on whether RAND is Open or Proprietary • Proprietary By open standard we mean just RF Standards

The opposite of Open Standard is any proprietary way of doing things be it standardized 7. Open Tools

• Infrastructure baseline • PostgreSQL, MySQL, Open ACS, Petri Nets, Alfresco, jBoss • TM Server • TinyTM • Workflow capabilities through any CALPMS • Open: ]po[, .LRN; Closed source: AIT Projetex, LTC Worx, Plunet Business Manager etc. • CAT Clients • OmegaT, Metatexis, TinyTM Word macro, Multilizer and whoever wants to implement the protocol • Filters • OKAPI framework, mainstream XLIFF generators such as Sun or Adobe 8. Use Cases - odt + OmegaT

• Level 2 TMX OpenOffice.org support *.odt. • Has TinyTM connector in alpha • Has plaintext glossary support 8. Use cases - TinyTM

• TinyTM protocol is licensed under the ”LGPL V2.1 or higher“. • Rest of the TinyTM code is licensed under the ”GPL V2.0 or higher“. • Translation clients can be licensed under any license. • TinyTM documentation is licensed under the Creative Common's Attribution License. 8. Use Cases - TinyTM - Status

• Alpha functionality • Java TMX importer • Mark up parsing logic ------• Protocol freeze in public discussion on Sourceforge • Protocol features • Conservative extension with many important enhancements 8. Use Cases TinyTM – new protocol features

• Industry and domain taxonomy • open to accommodate future TDA taxonomy research • Inline markup handling 1. stripped plain version 2. normalized version with formatting placeholders {i} 3. TMX version with full TMX markup • Advanced leverage functions • hash based context storage • hash and trigram enhanced fuzzy search 9. Discussion

Thanks for your attention! A Guide to Open Standards and Open Source A Conceptual Case Study

Angelika Zerfass [email protected] David Filip, Ph.D. [email protected]

© 2009 Moravia IT a.s. and Angelika Zerfass 10. References

Open Standards (selective) • XLIFF 1.2, http://docs.oasis-open.org/xliff/v1.2/os/xliff- core.html • TMX 1.4b, http://www.lisa.org/fileadmin/standards/tmx1.4/tmx.htm • SRX 2.0, http://www.lisa.org/fileadmin/standards/srx20.html Open Tools (selective) • OKAPI Framework, http://okapi.sourceforge.net/ • OmegaT, http://sourceforge.net/projects/omegat/ • OmegaT+ http://omegatplus.sourceforge.net/ • Open ACS, http://openacs.org/ • PostgreSQL, http://www.postgresql.org/ • TinyTM, http://tinytm.sourceforge.net/ 10. References

Tools continued • XML to XLIFF to XML: https://sourceforge.net/projects/xliffroundtrip/ • TMX Complicance Kit (test for TMX certification) http://www.lisa.org/Translation-Memory- e.34.0.html • TBX Checker http://www.lisa.org/TBX-Resources.650.0.html

Standards continued • Unicode Standard Annex #29 http://www.unicode.org/reports/tr29/ • W3C ITS http://www.w3.org/TR/its/