Technology
xml:tm — a new approach to translating XML
Andrzej Zydro´n
XML has become one of the defining tech- vidual sections only need to be authored and translated once, nologies that is helping to reshape the face of and may be reused many times over in different publications. A core concept of DITA is that of reuse at a given level both computing and publishing. It is helping of granularity. Actual publications are achieved through the to drive down costs and dramatically increase means of a map that pulls together all of the required constitu- Xinteroperability between diverse computer ent components. DITA represents an intelligent approach to the systems. process of publishing technical documentation. At the core of DITA is the concept of the topic. A topic is a unit of informa- From a localization point of view XML offers many advan- tion that describes a single task, concept or reference item. tages: a well-defined, rigorous syntax backed up by a rich tool DITA uses an object-orientated approach to the concept of top- set that allows documents to be validated and proven; a well- ics encompassing the standard object-oriented characteristics defined character encoding system that includes support for of polymorphism, encapsulation and message passing. Unicode; and the separation of form and content, which allows The main features of DITA are: both multi-target publishing (PDF, Postscript, WAP, HTML, ■■Topic centric level of granularity XHTML, online help) from one source. ■■Substantial reuse of existing assets Companies that have adopted XML-based publishing have ■■Specialization at the topic and domain level seen significant cost savings compared with proprietary sys- ■■Meta data property based processing tems. The localization industry has also enthusiastically used ■■Leveraging existing popular element names and attri- XML as the basis of exchange standards such as the ETSI butes from XHTML LIS (previously LISA OSCAR) standards: Translation Memory The basic message behind DITA is reuse: write once, trans- eXchange (TMX), TermBase Exchange (TBX) and Segmentation late once, reuse many times. Rules eXchange (SRX), as well as Global Information Manage- ment Metrics eXchange - Volume (GMX-V). OASIS has also xml:tm contributed in this field with XML Localization Interchange xml:tm is a radical approach to the problem of translating File Format (XLIFF) and Translation Web Services (TransWS). XML documents. In essence, it takes the DITA concept of reuse In addition, there is the W3C Internationalization Tag Set (ITS). and implements it at the sentence level. It does this by leverag- Another significant development affecting XML and local- ing the power of XML to embed additional information within ization has been the OASIS Darwin Information Technology the XML document itself. xml:tm has additional benefits which Architecture (DITA) standard. DITA provides a comprehensive emanate from its use. The main way it does this is through architecture for the authoring, production and delivery of tech- nical documentation. DITA was originally developed within IBM and then donated to OASIS. The essence of DITA is the Andrzej Zydro´n has been working in IT for 35 concept of topic-based publication, construction and develop- years, 22 of which have been in localization- ment that allows for the modular reuse of specific sections. Each related systems at Xerox, Ford and Oxford section is authored independently and then each publication is University Press. He is now the CTO of XTM constructed from the section modules. This means that indi- International.
www.multilingual.com September 2014 MultiLingual | 49 Technology
the use of the XML namespace syntax. Originally developed ■■As previously mentioned, xml:tm is a perfect match for as a standard under the auspices of LISA OSCAR, xml:tm is DITA, taking the DITA reuse principle down to sentence level. now an ETSI LIS standard. The xml:tm standard is a perfect ■■xml:tm mandates the use of SRX for text segmentation of companion to DITA — the two fit together hand in glove in paragraphs into text units. terms of interoperability and localization. ■■xml:tm mandates the use of Unicode Standard Annex #29 xml:tm was designed from the outset to integrate closely for tokenization of text into words. with and leverage the potential of other relevant XML based ■■xml:tm mandates the use of XLIFF for the actual transla- localization industry standards. tion process. xml:tm is designed to facilitate the automated cre- ation of XLIFF files from xml:tm enabled documents, and after translation to easily create the target versions of the documents. ■■xml:tm mandates the use of GMX-V for all metrics con- cerning authoring and translation.
50 | MultiLingual September 2014 [email protected] Technology
52 | MultiLingual September 2014 [email protected] Technology
provide fuzzy matches of similar previously translated text. In practice fuzzy matching source source extract extracted tm is of little use to translators except for text text text process instances where the text units are fairly long and the differences between the origi- nal and current sentence are very small. Additionally, in technical documents target prepared you can often find a large number of text Customer text merge text units that are made up solely of numeric, alphanumeric, punctuation or measurement items. With xml:tm these can be identified during authoring and flagged as nontrans- latable, thus reducing the word counts. For target target numeric and measurement only text units QA translated translate text text text it is also possible to automatically convert the decimal and thousands designators as required by the target language.
Figure 6: Traditional translation scenario. Process automation You can use xml:tm to create an inte- Matching translation memories, but in a consistent grated and totally automated translation The matching described in the previ- and automatic fashion. environment. The presence of xml:tm ous section is called “exact” matching. For in-document fuzzy match- allows for the automation of what would Because xml:tm memories are embedded ing, during the maintenance of author otherwise be labor-intensive processes. within an XML document, they have memory a note can be made of text units The previously translated target version all the contextual information that is that have only changed slightly. If a of the document serves as the basis for required to precisely identify text units corresponding translation exists for the the exact matching of unchanged text. In that have not changed from the previ- previous version of the source text unit, addition, xml:tm allows for the identifica- ous revision of the document. Unlike then the previous source and target ver- tion of text that does not require transla- leveraged matches, perfect matches do sions can be offered to the translator as tion (such as text units comprising solely not require translator intervention, thus a type of close fuzzy match. punctuation, or numeric or alphanumeric reducing translation costs. The text units contained in the lever- only text) as well as providing for in- xml:tm provides much more focused aged memory database can also be used to document leveraged and fuzzy matching. types of matching than traditional translation memory systems. The first is exact matching, where author memory provides exact details of any changes to a document. Where text units have not been changed for a previously trans- lated document we can say that we have an exact match. The concept of exact matching is an important one. With traditional translation memory systems Jahrestagung a translator still has to proof each match, as there is no way to ascertain and the appropriateness of the match. Proof- ing has to be paid for — and is typically conference about 60% of the standard translation Stuttgart, Germany cost. With exact matching there is no need to proofread, thereby saving on the ICS International Congress Center cost of translation. November 11–13, 2014 xml:tm can also be used to find in- document leveraged matches, which will be more appropriate to a given document The world‘s largest than normal translation memory lever- trade fair in the aged matches. Additionally, when an area of technical xml:tm document is translated, the trans- communication lation process provides perfectly aligned source and target language text units.
conferences.tekom.de MultiLingual These can be used to create traditional www.multilingual.com September 2014 MultiLingual | 53 Technology
based on open standards. This not only leveraged substantially reduces the cost of preparing matching the document for translation, but is also source extract extracted tm XLIFF much more efficient and cost effective. A text text process file traditional translation scenario can be seen exact in Figure 6, whereas in the xml:tm trans- matching lation scenario, all processing takes place CMS or LSP workflow system within the customer's environment, as seen in Figure 7. web Because xml:tm mandates the use of Internet QA translate browser XLIFF as the exchange format for trans- lation, the XLIFF format can be used to create dynamic web pages for translation. A translator can access these pages via a target CMS or LSP workflow system text browser and undertake the whole of the merge translation process over the internet. This has many potential benefits. The problems of running filters and the delays inherent Figure 7: xml:tm translation scenario. in sending data out for translation such as In essence, xml:tm has already pre- in the traditional leveraged and fuzzy inadvertent corruption of character encod- prepared a document for translation search manner. ing, document syntax or simple human and provided all of the facilities to The presence of xml:tm can be used to work flow problems can be totally avoided. produce much more focused matching. totally automate the extraction and match- Using XML technology it is now possible to After exhausting all of the in-document ing process. This means that the customer is both reduce and control the cost of transla- matching possibilities any unmatched in control of all of the translation memory tion as well as reduce the time it takes for xml:tm text units can be searched for matching and word count processes, all translation and improve reliability. M LavaConTHE Conference ON CONTENT STRATEGY AND USER EXPERIENCE
A gathering place for content strategists, user experience designers, documentation managers October 12–15, 2 014 and other content professionals LavaCon.org Portland, Oregon
54 | MultiLingual September 2014 [email protected] Translation Localization Internationalization Language Technology
Global Web Business Region Focuses Industry Resources
MultiLingual — Your information source for language and business.
Subscribe now and see these benefits:
• Eight issues a year plus an annual editorial index/resource directory • An online searchable archive of issues beginning January 2006 • Insightful articles on the topics above as well as managing content, standards, language preservation and much more.
Two ways to subscribe:
1. Digital only. Receive an email when each issue is available. Have access to all the digital issues since 2006. Use the indices or search to find topics, companies and people across issues.
2. Print + Digital. Receive a printed copy of the magazine at your address. Keep the archives on your bookshelf plus take advantage of all the digital features outlined in #1.
SUBSCRIBE NOW! www.multilingual.com/subscribe