10 Things You Should Know About Automatic Extraction By Uwe Muegge

It is probably safe to say that many, if not most, commercial trans- As using free MT services is becoming more and more lation and localization projects today are carried out without a comprehen- popular among professional translators, so is the desire sive, project-specific, up-to-date glos- sary in place. I suspect that one of the to control terminology in the final output that is primary reasons for this inefficient delivered to clients. state of affairs is the fact that many participants involved in these projects are unfamiliar with the tools and processes that enable linguists to create monolingual and multilingual ogy extraction tools use a language- patterns. As a result, many linguists glossaries quickly and efficiently. independent approach to terminology who use these popular extraction prod- Below are 10 insights for linguists extraction, which has the benefit of ucts are disappointed by the amount of wishing to give automatic termi- giving linguists a single tool for clean-up work that some of these fairly nology extraction a(nother) try. extracting terminology in many dif- expensive products can require. (Please see the links in the box on ferent languages. The drawback of this page 27 for information on all of the approach is that the percentage of 2. For short texts, manual extraction tools mentioned here.) “noise” (i.e., invalid term candidates) may be your best option. To the and “silence” (i.e., missing legitimate best of my knowledge, there is no 1. The two biggest issues with termi- term candidates) is typically higher automatic terminology extraction nology extraction tools are noise and than in linguistic extraction tools that system—at least for the English lan- silence. Many commercial terminol- use language-specific term formation guage—that creates term lists reliably

24 The ATA Chronicle n September 2012 without requiring substantial human intervention either prior to extraction (e.g., set-up, importing lists, cre- Many commercial terminology extraction tools use ating rules, etc.) or after extraction (primarily manual or semi-automatic a language-independent approach to clean-up). For this reason, short texts are typically not well suited for auto- terminology extraction. matic terminology extraction. (What qualifies as “short” differs from tool to tool, but 1,000 serves as a general guideline.) This rule holds par- this category are Systran Business statistical terminology extraction func- ticularly true when the person per - Translator (available for 15 languages; tion. Unlike Similis, the Across tools forming the term extraction will Price: US$299) and PROMT Profes- support a wide range of languages and subsequently translate the source text. sional (available for 5 languages; language combinations. It is generally a good idea to read the Price: US$265). Both of these transla- text to be translated in its entirety tion tools are very mature. They also 5. Use a concordance tool for simple before , which creates a per- offer a built-in and terminology extraction. Stand-alone fect opportunity for manual termi- very large general and subject-specific concordance tools have been used as nology extraction. dictionaries that make these products a research tools in for great investment for any professional a long time. A is a type 3. Rule-based MT systems are a translator working in a covered lan- of software application that allows great choice for low-cost automatic guage combination. users to extract and display in context terminology extraction. Rule-based all occurrences of specific words or (MT) systems are 4. Some free translation memory sys- phrases in a body of text. While con- among my favorite translation tools. tems offer excellent built-in auto- cordance software is typically used to Unlike statistical MT systems, rule- matic terminology extraction. Similis study collocations, perform frequency based MT products do not require any is an often overlooked, yet extremely analyses and the like, linguists can linguistic training on bilingual data to capable, free translation memory use, and have been using, concor- be useful, but rely on built-in grammar system. Since Similis, much like a rule- dancers for terminology extraction. rules for the analysis of the source and based MT system, uses language- One of the best for ter- generation of the target. More than 10 specific analysis technology, the quality minology extraction is AntConc. This years ago, at the translation quality of the term extraction lists that this tool is highly customizable. For conference TQ2000 in Leipzig, I pre- translation memory product generates example, it allows users to define the sented a paper on how to use rule- puts it in a class of its own among trans- word length of terms and supports based MT systems to perform lation memory systems. One particu- multiple platforms (i.e., Windows, auto matic terminology extraction.1 larly useful feature of Similis is its Mac, and Linux). It is also free. One would expect that after so many ability to extract highly accurate bilin- years, it would now be common gual glossaries from translation 6. Free online tools provide pow- knowledge that the “Unknown Word” memory (TMX) files. If you work from erful terminology extraction, and feature of rule-based MT systems is English and a half-dozen other sup- there is nothing to install. If you are highly suitable for automatic termi- ported languages, this might be the ter- still not convinced that automatic ter- nology extraction. But, unfortunately, minology extraction tool for which you minology extraction is for you, let me it just is not so. So let me tell you have been looking. introduce you to a set of tools where again: If you are a freelance translator Another translation memory solu- all you have to do to create a term list or small translation agency, the most tion that is available at no cost to free- is to specify a source text and then powerful, customizable, and cost- lance translators and students is press a button or two. There is no soft- effective terminology extraction solu- Across Personal Edition, which includes ware to install, no manual to read, and, tion you can buy is a rule-based MT crossTerm, a full-featured terminology of course, no price to pay. With web- system. My two recommendations for management module complete with a based terminology extraction ·

The ATA Chronicle n September 2012 25 10 Things You Should Know About Automatic Terminology Extraction Continued services like TerMine and FiveFilters For exam ple, in the TerMine term list • Client business names; Term Extraction, automatic termi- shown in Figure 1 below, I would argue • Product names; and nology extraction really is child’s play. that at least three of the 10 term candi- • Trademarks. Do not let the simple interface of dates in this list require editing. these sites fool you. Both of these While most illegitimate term candi- Yes, I know, this piece of advice online tools produce professional dates are easy to identify (e.g., runs counter to what many other ter- quality extraction lists that include misspelled, truncated, incorrectly minology experts say; namely, if a nouns, and, in the case of hyphenated words), many linguists term occurs only once in a text, there TerMine, even scored rankings of have a hard time answering the fol- is no risk of inconsistency, and there- term candidates. lowing question: Which term candi- fore single terms should not be dates should users of terminology included in glossaries. To that I say: 7. Are you using free MT? Start post- extraction systems actually develop There are terms that are so important editing with a glossary. As using free into multilingual glossaries? There is that if a linguist gets them wrong, even MT services is becoming more and no simple solution to this problem, as just once, it would be a huge embar- more popular among professional each has its own rassment for all parties involved. translators, so is the desire to control limiting factors, available time typi- Including every term of the above- terminology in the final output that is cally being the most important one. mentioned types is particularly impor- delivered to clients. Google Translator My recommendation for commer- tant when working with MT, as MT Toolkit is a free, full-featured online cial translation projects is always to systems are notorious for “making- translation memory system that allows include the following types of terms up” their own terminology in the users to post-edit gener- in a project glossary, even if the term target language. As such, the fol- ated by Google Translate, Google’s occurs only once in a source docu- lowing term types should be included proprietary MT system. Since Google ment. Mandatory term types include: in glossaries based on the frequency Translate is a statistical MT system that has been, and continues to be, trained on a wide variety of docu- Figure 1: A sample term list generated by TerMine, a free online ments, the same source term might get terminology extraction service. translated in multiple ways even within the same document, not to mention across documents. While it is currently not possible to submit user glossaries to Google’s MT engine, it is possible to upload glos- saries to the Translator Toolkit. And using one of the tools mentioned in this article to extract terminology and build a bilingual glossary before translating/ post-editing in Google Translator Toolkit may be the best thing linguists can do to improve the efficiency of an already very efficient process.

8. Clean up your terminology extrac- tion list to identify the most impor- tant term types. In my professional experience, term lists generated by automatic terminology extraction tools are never perfect. Even the best term extraction systems introduce “noise.”

26 The ATA Chronicle n September 2012 of their occurrence in the source text (many terminology extraction tools Links Related to Tools provide frequency information):

• Feature names; Systran Business Translator • Function names; http://owl.li/ciGG5 • Domain-specific terms; and • Generic terms. PROMT Professional 9. Use the recommended data cate- http://owl.li/ciGQI gories when integrating an extraction list into a terminology management Similis Free Download system. Once the extraction list has been cleaned up, the next logical step is http://owl.li/ciHfQ to develop a multilingual glossary that will add value not only to the translation Similis Terminology Extraction How-To Information process but ideally to the entire transla- http://owl.li/ciHxp tion cycle. The most valuable glossaries are those that provide information that goes beyond simple word pairs of Across Personal Edition “source term” and “target term.” Here is http://owl.li/ciHPf the minimum data model I recommend for commercial projects: crossTerm http://owl.li/ciHZ5 • Client and/or business unit and/or project name; • Source term; AntCoc • Part of speech (e.g., noun, proper http://owl.li/ciIvf noun, compound noun, verb, adjec tive, other); TerMine • Context (e.g., a sample sentence in http://owl.li/ciIH9 which the source term occurs); and • Target term. FiveFilters Term Extraction The big question at this stage is: http://owl.li/ciIQT What software platform do we use for developing and managing termi- Google Translator Toolkit Registration Page nology after extraction? This is an http://owl.li/ciJbr important question as many, if not most, linguists do not have a proper terminology management system in TermWiki place. While it may be tempting to use http://owl.li/ciJr6 Microsoft Word or Excel tables to manage terminology—after all, these TermWiki Pro are programs that most linguists own and know how to use—word proces- http://owl.li/ciJyV sors and spreadsheet applications are not good choices for managing ·

The ATA Chronicle n September 2012 27 10 Things You Should Know About Automatic Terminology Extraction Continued terminology data. The systems I rec- ommend are TermWiki (if you are willing to share terminology) and If you are a freelance translator or small translation TermWiki Pro (if you need to keep your terminology data private). Full agency, the most powerful, customizable, and disclosure: I have been, and keep, contributing to the development of cost-effective terminology extraction solution TermWiki, which is already changing the way thousands of users around the you can buy is a rule-based MT system. globe manage linguistic assets. Here are some of the benefits of using either version of TermWiki: minology quickly and efficiently is a • Eliminate (terminology) review • Completely web-based (no soft- wonderful thing. With automatic ter- and corrections after translation. ware to install). minology extraction as part of a com- prehensive terminology management With so many powerful termi- • Platform-independent (runs on effort, you are able to: nology extraction tools to choose Windows, Mac, Linux, Android, from, as long as the source language iOS, etc.). • Create comprehensive multilingual is a major language, there really is no glossaries before translation. excuse for not extracting terminology • Wiki user interface (intuitively and creating a glossary as part of familiar, easy-to-use). • Have the client authorize project- every translation project. specific, multilingual glossaries • Powerful collaboration features (auto- before translation. Notes matic workflow management, etc.). 1. The material presented here was • Have translation memory systems inspired by a two-part series I • No-cost/low-cost solution (TermWiki automatically suggest authorized wrote for T for Translation, the is free, TermWiki Pro is US$9.95/ translations for every term during blog of CSOFT International, user/month). translation. blog.csoftintl.com. If you read German, an expanded version of 10. A small investment in automatic • Have all members of a translation this article is available at: terminology extraction can yield a team use the same terminology http://owl.li/ciGr9. big return in efficiency and client during translation. satisfaction. Being able to extract ter-

Sep tember 2012 V olume XLI Number 9 A Pub The lication of the American Translators Association CHRONICLE Send a Complimentary Copy In this issue: ATA 2012 Elections: Candidate Statem ents Diplomatic Translation Autom If you enjoyed reading this issue of The ATA Chronicle and think a atic Terminology Extraction colleague or organization would enjoy it too, we’ll send a free copy. Simply e-mail the recipient’s name and address to Kwana Ingram at ATA Headquarters—[email protected]—and she will send the magazine with a note indicating that the copy is being sent with your compliments. Help spread the word about ATA!

28 The ATA Chronicle n September 2012