Nine Terminology Extraction Tools: Are They Useful for Translators?
Total Page:16
File Type:pdf, Size:1020Kb
Nine Terminology Extraction Tools: Are they useful for translators? Hernani Costa, Anna Zaretskaya, Gloria Corpas Pastor, Miriam Seghiri University of Malaga Malaga, Spain fhercos,annazar,g.corpas,[email protected] Abstract such broad range of applications, these tools are often designed for one specific purpose, which Terminology extraction tools have become consequently makes their usage challenging an indispensable resource in education, when employed in a different setting. research and business. Today, users can find a great variety of terminology One of the most important areas where extraction tools of all kinds, and they terminology extraction is extremely helpful is all offer different features. Apart from in the translation industry. Today, more and many other areas, these tools are especially more language service providers (LSP) as well as helpful in the professional translation freelance translators and interpreters understand setting. We do not know, however, if the benefits of automatizing terminology tasks. the existing tools have all the necessary It not only allows them to quickly identify the features for this kind of work. In search for the answer, we make an overview of nine domain of the documents they are dealing with, selected tools available on the market and but also to easily find words and phrases that need find out if they provide the translators’ most to be paid special attention to. While translating favourite features. terminological units, in many cases it is necessary to consider the domain and look up the term equivalents in special resources like terminology 1 Terminology extraction tools and their databases. And in addition, it helps maintain areas of application terminological consistency throughout the project The purpose of terminology extraction tools between all the parts involved: the translator, the (TET) is to help users build terminological LSP and the client. resources in a (semi-)automatic way. The Apart from saving time, another significant need for such resources comes mostly from advantage of using TET instead of manual the growing needs in information management terminology search consists in the possibility to and translation, which make it more and more specify different search criteria, which allows to necessary to have some automated assistance adapt the search query to a particular task. This when performing terminology-related tasks. allows users to see all kinds of information they Companies, freelancers and professionals in need about the term, and also to narrow the search various linguistic fields can resort to these tools and filter the results depending on what they are to, for example build glossaries, thesauri and looking for. As an example, many state-of-the- terminological dictionaries that they use directly art TET offer a possibility to see linguistic and in their work. Moreover, TE is embedded in statistic information about the term, the context a number of natural language processing and where it appears, specify the number of words linguistic research tasks, such as automatic in the term, and many other useful features. indexing, machine translation, information Unfortunately, not every TET offers a full set of extraction, creation of ontologies and knowledge desirable features and settings, which makes it bases, and corpus analysis. Although they have sometimes challenging to find the perfect tool for the task in hand. Apart from the functionalities export glossaries from and to different technology they offer, TET also differ as to the environment environments. In addition, its integration with they work in. For instance, standalone installable SDL Multiterm gives access to many convenient tools require an installation process and work as term-management functions, such as manually independent computer programs. There also exist adding a variety of meta-data information to the web-based tools, which work within a browser. terms, such as synonyms, context, definitions, And finally, there are reusable software that illustrations, part-of-speech tags, URLs, etc., and facilitates the development of larger applications, searching not only the indexed terms but also called frameworks. their descriptive fields. Considering the existing variety, it is not Simple Extractor as its name implies, offers clear how a professional translator is to proceed significantly less functionalities compared to the when choosing a TET suitable for the job. As previous tool. It is a commercial TET developed we will see further, there are some TET that by DAIL Software S.L. for Mac OS, Linux and are specifically created for translators. But do Windows platforms. This clean and easy-to- they have all the necessary characteristics for use standalone Java application was designed to translators? And, furthermore, what exactly are automatically extract the most frequent words these characteristics? and multi-word terms from English, Portuguese, 2 Standalone Terminology Extraction Spanish, French and Russian documents. Simple Tools Extractor not only permits to extract a list of terms (from unigrams up to seven-grams), but Standalone software is probably the most popular also specify the minimum and maximum number type of software today, and TET are no exception. of occurrences of a term. Moreover, Simple Standalone TET are tools that can be installed on Extractor offers an option to load stopword lists, the computer and operate independently of any an advanced search functionality that permits other device or system. to search through the extracted list of terms, to explore all the contexts that a specific term SDL MultiTerm Extract is one of such appears, to edit the term text, to filter the extracted applications. It is a component of SDL terms according to the number of words that form MultiTerm, a commercial terminology them, and to sort the displayed output by any of its management tool that provides one solution fields (frequency, term and context in alphabetical to store, extract and manage multilingual order). Finally, Simple Extractor permits to print terminology. Multiterm exists as a standalone out or export to a file (.pdf, .doc, .csv or .txt) all application, and can also be integrated in SDL the extract terms, as well as their frequencies and Trados Studio. It is one of the few tools that were corresponding contexts. designed specifically to be used by translators and is probably the most well-known TET in TermSuite is an open-source and platform- the translation industry. This TE system locates independent TET written in Java and distributed potential monolingual and bilingual terminology under the Apache License 2.0. It was developed in documents and translation memories using a within the scope of the TTC (Terminology statistic-based method. The user can validate Extraction, Translation Tools and Comparable the extracted candidate terms by looking at a Corpora) project, whose purpose was to design a monolingual or bilingual concordance. A big tool capable of extracting bilingual terminology advantage of this tool is its support for any from comparable corpora in seven languages: language, including Unicode languages. In English, French, German, Spanish, Chinese and addition, it offers a number of functionalities Russian. TermSuite’s architecture is composed by that are useful in different translation scenarios, 3-step modules: the Spotter, the Indexer and the such as ability to compile a dictionary from Aligner. The Spotter module is responsible for parallel texts; flexible filtering that ensures preprocessing the input monolingual corpus, i.e., that only the most frequent candidate terms it performs tokenization, part-of-speech tagging, are extracted; possibility to store an unlimited stemming and lemmatization. Then, the Indexer number of terms in any language; import and module uses both a statistic and a linguistic-based approach to extract monolingual terminology whether to extract only single words (keywords) from a monolingual corpus processed by the or multi-word terminological units (terms). In the Spotter. Finally, the Aligner computes the output, the user can see the keywords or terms, translation of a source terminology into a target links to the five most relevant Wikipedia articles language. The source and target terms required for each of them, the term’s score, its frequency are these already computed by the Indexer in the searched corpus, and its frequency in the module, which means that the previous two reference corpus. There are a variety of search steps should be repeated for the target language. options that can be tuned. For instance, the The user can choose from several alignment user can choose a different reference corpus, options, such as the selection of the maximum decide whether search for words or lemmas, number of translation candidates for a given and accentuate low or high-frequency keywords source term, the use of similarity measures to according to the preferences. The output can compare the contexts of the term in the source be downloaded as a TBX or CSV file. In and the target languages, amongst other advanced order to perform multilingual term extraction the settings. Once all the parameters are set, it is user needs to upload a TMX file with a parallel possible to view and explore all the translation corpus aligned on the sentence or paragraph level. candidates ranked according to their similarity The terminology is first extracted within each score within the tool or use the output XML file language resulting in lists of candidate terms. In for other purposes. the second step, the system searches for such pairs of candidates which co-locate in the parallel 3 Web-Based Terminology Extraction documents most often. The resulting list of Tools candidate pairs (terms in two languages) is then presented to the user. Results can be saved in a Although standalone TET still are predominant TBX or TXT file, which is especially convenient on todays TE applications market, the future for computer-assisted translation tool users. web-based TE technologies will certainly evolve by migrating all standalone features to a web- Translated s.r.l.