D3.7: Revised Smt and Ner Components Integrated Into Platform

D3.7: REVISED SMT AND NER COMPONENTS INTEGRATED INTO PLATFORM

Ankit Srivastava, Jinhua Du, Andy Way, Alfredo Maldonado, Dave Lewis

Distribution: Public

Federated Active Linguistic data CuratiON (FALCON)

FP7-ICT-2013-SME-DCA

Project no: 610879

Document Information

Deliverable number: D3.7

Deliverable title: Revised SMT and NER components integrated into Platform

Dissemination level: RE

Contractual date of delivery: 28th Feb 2015

Actual date of delivery: 23rd March 2015

Author(s): Ankit Srivastava, Jinhua Du, Andy Way, Alfredo Maldonando, Dave Lewis

Participants: DCU, TCD

Internal Reviewer: TCD

Workpackage: WP3

Task Responsible: DCU

Workpackage Leader: XTM

Revision History

Revision Date Author Organization Description

1 31/07/2014 Ankit Srivastava, DCU, TCD Delivery of first iteration of SMT and NER Sandipan component delivery Dandapat, Andy Way

2 15/01/2015 A. Maldonado, A. DCU, TCD Redesigned text analysis components , Way moving from NER component to Automatic Text Extraction in combination from additional term data from BabelNet 3 13/02/2015 D.Lewis, A.Zydron, DCU, TCD, Agreed interactions between MT and A.Way, A. Interverbum, XTM, between TA and TermWeb and Maldonda, A. XTM interactions with L3Data Srivastave, J. Du, M. Granstrom, 4 15/02/2015 A. Srivastava DCU Revised SMT web services components from D3.3

5 21/02/2015 A. Srivastava DCU Deleted NER sections to be filled in by TCD and replaced by automatic term extraction component 6 15/03/2015 A. Maldonado TCD Updated Automatic term extraction component 7 18/03/2015 D. Lewis TCD Updated Integration (Section 5) tables 8 20/03/2015 A. Srivastava DCU Updated MT integration and exampes

2 D3.7

CONTENTS

Document Information ...... 2

Revision History ...... 2

Contents ...... 3

1. Executive Summary ...... 4

2. Introduction ...... 4

2.1. Integration Points ...... 4

3. Revised SMT Components ...... 6

4. Automatic Term Extraction ...... 9

4.1. Background: traditional automatic terminology extraction ...... 10

4.2. Our solution: trainable automatic term extraction based on anomaly detection techniques ...... 13

5. Web Service Integration ...... 15

5.1. Automated Term Extraction and its Validation ...... 18

5.2. Train Initial Project MT Engine and Generate Reference Translation ...... 23

5.3. Machine Translate Segments on Request ...... 24

5.4. Analyse Postediting Progress and Retrain MT Engine ...... 25

6. Conclusions and Next Steps ...... 26

7. References ...... 27

3 D3.7

1. EXECUTIVE SUMMARY

This deliverable (D3.7) is a report on Task 3.2 “SMT and NER Integration” from WP3 “Platform Development.” This document presents the revised interface designs for both Statistical Machine Translation (SMT) and Named Entity Recognition (NER) components developed as web services for integration into the L3Data Federation Platform. The major changes from the initial version of this integration (reported in D3.2) are:

• Specialised the Text analytics function in the project from one based on Named Entity Recognition (NER) to on based on Automatic Term Extraction (ATE) as specified in the revision to D2.1 Requirements Specification. • Introduced mechanism for iterative improvement of Automated Term Extraction through active curation of validations of suggested terms. • Aligned the ATE and statistical machine translation (SMT) with the revised model and interface for the L3Data Platform to be released in D3.6. • Introduced integration between ATE and the Term Web Terminology Management Tool. • Provided detailed interaction sequence specification for the interactions with the ATE and SME components.

2. INTRODUCTION

The deliverable specifies the implementation of language technologies used in FALCON, specifically Statistical Machine Translation (SMT) and Automatic Term Extraction (ATE). These language technologies aim to

• Process and produce open format language resources and interoperable metadata according to the L3Data schema and architecture. • To actively curate and reuse language resources in the form of L3Data to iteratively improve the performance of the language technology components. • Integrate with the commercial localisation tool chain used in the FALCON, Showcase system to demonstrate the active curation of language resource within a localisation workflow in order to improve the performance of language technology in a specific customer domain.

2.1. Integration Points

Within the FALCON Showcase System architecture, the Text Analytics (TA) component corresponds to the Automatic Term Extraction component presented here and the Machine Translation (MT) component corresponds to the Statistical Machine Translation component presented here.

This document captures the interactions with these TA ad MT components occurring across the following architecture reference points as defined in deliverable D2.1:

• MT-L3D: This is the interface whereby the MT component logs the source text to be translated, logs the initial translation of a customer project, and logs the retraining iterationsof the MT engine and the translations generated by those iterations. • TrM-TA: This is the interface whereby the translation project manager initiates and monitors the process of term identification and the validation of suggested terms and term translations. • TeM-TA: This is the interface whereby terminology candidates are checked against existing terms available to the project, where candidate terms and their translations are submitted for validation and from where the outcomes of term validation are retrieved.

4 D3.7

• TA-L3D: this is the interface where term candidates and positive and negative validation outcomes are logged. • L3D-PD: this is the interface where public data sources are queries for definitions and translations of candidate terms.

Figure 1: FALCON Showcase Architecture and its Integration Reference Points

As specified in D2.21 “Initial L3Data Schema and Architecture,” in order to support the use of L3Data in localisation workflows, the transfer of data will be performed based existing open standards, namely HTML (Hypertext Markup Language), XML (Extensible Markup Language), XLIFF (XML Localisation Interchange File Format), and Term Base eXchange TBX standards. This enables interoperability of translated content, terminology and related metadata with commercial tools, such as XTM Cloud2, TermWeb3 and EasyLing4 which are used in the FALCON Project, but also with the wider range of tools used in the localisation industry that make use of these standards..

For machine translation, the Moses SMT system 5 with DCU’s operational and training extensions is implemented (described in Section 3). Named Entity Recognition is replaced with Automatic Term Extraction (described in Section 4).

1 D2.2 and all other FALCON deliverables available at http://falcon-project.eu/deliverables/ 2 http://xtm-intl.com/enterprises/ 3 http://www.interverbumtech.com/ProductsServices/TermWeb.aspx 44 http://www.easyling.com/ 5 Moses SMT available at http://www.statmt.org/moses 5 D3.7

In Section 5, a step-by-step guide to the interactions of the MT and TA component web services with the other components is details.

3. REVISED SMT COMPONENTS

Statistical Machine Translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual (parallel) text corpora. Moses is a SMT package licensed under the GNU Lesser General Public License (LGPL). It is an open source software system primarily written in C++ which allows training of translation models in any language pair.

The SMT components implemented in the FALCON project are as follows:

• SMT Translation o Translation of segments (input/output format: sentences) [translate_main_seg] o Translation of documents (input/output format: XLIFF) [translate_main_xliff] o Translation of documents (input/output format: HTML) [translate_main_html] • SMT Training o Training of translation model given an input of parallel corpora [train_mtmodel] o Retraining of translation model given an input of post-edited translations [retrain_mtmodel]

Thus five Web Services (3 for translation and 2 for training) are implemented as part of the SMT module. The current functionality of the DCU SMT system (based on the DCU SMT system implemented as part of the EUFP7 LT-Web6 Project) is represented by the following workflow:

6 EUFP7 Project LT-Web http://cordis.europa.eu/fp7/ict/language-technologies/project-multilingualweb- lt_en.html 6 D3.7

Figure 2. Workflow of SMT Translation Module

1 Input source language text in any one of the formats: Plaintext segments | HTML | XLIFF 2 SEGMENTER - Pre-Process Module: Parses the ITS 2.0 tagged input document and generates a Metadata Wrapper Information {2a} b Segments to be translated (source language) {2b} 3 DECODER – MT Module: Translates the segments {3a} with the help of {3b} SMT modules (Translation Model (TM), Language Model (LM), Reordering Model (RM), and feature weights) 4 Translated segments are generated (target language), also contains additional information like MT Confidence scores, Provenance, etc. 5 DESEGMENTER – Post-Process Module: Takes as input the {5a} Metadata Wrapper Info (from 2a) and the {5b} translated segments (from 4) to merge and concatenate into one document 6 Output target language text in the same format as input: Plaintext segments | HTML | XLIFF

The SMT Engine (online interface at http://srv-cngl.computing.dcu.ie/mlwlt/) takes as input segments or a document of segments in source language (annotated term decoration indicating stauts and transaltion), parses it to extract text to be translated, feeds the plain text to the Moses decoder for translation, merges the MT metadata with the translated content, and generates the translated segment or document of segments in target language.

Thus the DCU SMT system (Figure 2) is currently capable of translating a document with term annotation with

7 D3.7

the help of pre-processing and post-processing wrapper scripts.

Figure 3. Workflow of SMT Training Module

An added functionality of the DCU SMT system is its capability to train or re-train SMT models. The system takes as input parallel content (corresponding data files in the source and target language) and retrains the MT components with the help of the wrapper scripts used by SMT Decoder. This training data is accompanied by meta-data and is structured according to

The workflow (illustrated in Figure 3) is as follows:

1 Input corpora (sentence-aligned parallel content) in one of the formats: plaintext | HTML | XML | XLIFF a Content in Source Language b Content in Target Language 2 SEGMENTER - Pre-Process Module: Parses any annotations on input corpora (both source and target) and extracts: a Segments in source language ( {2a} b Corresponding Segments in target language {2b} 3 RETRAINER – Main Module: Processes the bilingual segments {3a} to generate new training data to augment pre-existing {3b} SMT models (Translation Model (TM), Language Model (LM), Reordering Model (RM), and feature weights)

8 D3.7

4 Retrained SMT models (Translation Model (TM), Language Model (LM), Reordering Model (RM), and feature weights) are produced and replace the old versions.

Note that this is incremental retraining based on post-edited MT outputs. Incremental retraining allows translation managers to leverage content meta-data (captured in L3Data) to prioritise segment postediting in order to introduce the benefits of retraining earlier in the progress of the project.

In order to develop and test the SMT components trained with L3Data, the relevant metadata categories are available in JSON format (based on W3C meta-data specification) included in the revised L3Data deliverable (D3.2). Note that in contrast to the initial SMT web services (D3.3) the use of ITS2.0 for term annotations have been dropped. Instead the XML tags as received from XTM are used, which in turn are provided from the decoration of terms provided by TermWeb. This is mainly because we receive segments rather than a whole document to be translated. Information such as MT confidence score and other related provenance data from both translation and training is captured as L3Data written to the L3Data component.

For example of term decoration, consider the following segment for translation from English into French. Validated terms are annotated with a ‘translation’ attribute which provides the translation of the term to be forced during the SMT decoding step:

INPUT:

From the canyons of Arizona, to the Khmer temples deep in the jungle; from the tropical beaches of Queensland to the glaciers of Antarctica; or from the wild savanna of Africa to mysterious castles in the forests of Bohemia, our offer takes you in some of the most amazing places on our planet.

OUTPUT:

Des canyons de l'Arizona, aux temples khmers profondes dans la jungle, des plages tropicales du Queensland pour les glaciers de L'Antarctique, ou de la savane sauvage de l'Afrique de châteaux mystérieux dans les forêts de Bohême, notre offre vous emmène dans certaines les endroits les plus étonnants de la planète.

Note the MT Confidence score (e.g. 0.546) generated by the SMT system is not returned in response to the request for segment translation from XTM, but instead it is logged in an associated CSV file to the L3Data. The above example illustrates that translations for named entities like “Arizona” and “Africa” are conveyed via term extraction and terminology translation and validation components.

4. AUTOMATIC TERM EXTRACTION

The perceived quality and accuracy of translated specialised or technical content depends heavily on the consistency and accuracy of domain-specific terminology (Champagne 2004). Ensuring terminology consistency through effective terminology management strategies can therefore have a positive impact on the overall quality of the final translated text. In particular, terminology consistency reduces misunderstandings and confusions in the usage or consumption of the translated content (Manning & Schütze 1999, p.186; Champagne 2004; Dunne 2007) and facilitates the internal communication of the company or organisation

9 D3.7

that owns the content, as well as the external communication between that company or organisation and its customers (Gómez Palou Allard 2012, pp.38–40). Within the translation workflow, effectively managed terminology can boost translators’ productivity by reducing the amount of time they spend on terminology research tasks such as searches on translation memories, dictionaries, glossaries and the web and discussions with content owners and/or colleagues (Champagne 2004; LISA 2005; Childress 2007; Dunne 2007; Karamanis et al. 2011). Finally, one of the goals of the FALCON project is to show that terminology can be used for machine translation (MT) processes in order to ensure that the preferred terminology is used in the output produced by MT.

Terminology management starts with identifying source-language terms that are domain-specific, technical, specialised or that are of particular commercial or marketing interest to the content owner (such as trademarks, features and product names) before translation takes place. The most cost-effective way of identifying this terminology is by automatically extracting it from existing content, specifically content that will be translated as part of the translation project. The typical output of an automatic terminology extraction tool is a list of term candidates ranked by some statistical criterion that seeks to measure their termhood, i.e. the degree to which a term candidate is a valid term in the domain or project at hand. This term candidate list is then validated by a terminologist, i.e. each term candidate is manually classified into a valid term or a non- valid term. Valid terms are then translated and captured in a terminology database (termbase) for future reference.

However, in fast-paced and on-going content-creation projects that require translated content to ship at the same time or at nearly the same time as source-language content, not all source-language content will be finalised before translation starts (Karsch 2006). This requires terminology extraction to be performed repeatedly as new source-language content is created in order to identify the latest terms. Traditional automatic terminology extraction methods are of limited use in this iterative scenario as they normally lack the capability of distinguishing term candidates that were extracted and validated (accepted or rejected) by a terminologist in previous runs of the extraction tool and thus the amount of manual validation work increases as the project progresses. We propose instead a novel, machine-learning-based automatic terminology tool that is capable of learning the validation decisions made by the terminologist so that 1) accepted and rejected terms from previous term extraction runs do not show up in subsequent runs, and more importantly 2) the term extraction tool learns to generalise on the characteristics of the terms the terminologist is interested in and is able to better rank term candidates in subsequent runs, thus progressively reducing the amount of manual work.

This trainable automatic terminology extraction approach is based on anomaly detection algorithms commonly used to detect fraudulent activities in e-commerce, faults in servers, spam e-mail, etc. Based on the assumption that the majority of term candidates extracted by traditional terminology extraction tools are non- valid terms, we model the extraction of a valid term as an anomaly (an exception or a “spam” message) and rank these anomalies (valid terms) towards the top of the term candidate list. This requires a statistical model to be built in which non-valid terms are treated as typical members of a population of linguistic units and valid terms as anomalous members of that population that need to be detected. We use the manual validation step already present in the traditional terminology extraction process as feedback to the statistical model for retraining and fine-tuning. We expect that the more term candidates are validated, the better the statistical model will be able to generalise and produce better term candidate rankings. 4.1. Background: traditional automatic terminology extraction

A term is a lexical unit (such as a singleton word, but more often a multiword unit) that designates a concept that is either specialised, specific to a domain or otherwise considered to be important or valuable by a community of experts and/or by an organisation. In content localisation and translation, terms that are

10 D3.7

frequent, prominent, innovative or that have some value from a functional, technical, proprietary or marketing perspective need to be identified, translated and captured in a termbase accessible to translators, thus facilitating terminology consistency across the content that is being translated.

In order for terminology to be managed, it must first be identified. One way of doing so is via terminology extraction. Terminology extraction is the process in which terms are identified in and extracted from a volume of text or corpus. Automatic terminology extraction is the full or partial automation of this process through the usage of computational tools (Ananiadou 1994). Traditionally, automatic text extraction consists of three basic steps (Nakagawa 2000; Pazienza et al. 2005):

1. Linguistic extraction of term candidates 2. Statistical scoring and ranking of extracted term candidates 3. Manual validation (filtering) of the top ranked terms

These three steps are graphically detailed in Figure 4.

Figure 4 Traditional automatic term recognition - items with a human icon indicate manual steps

The first step, linguistic extraction of term candidates, is conducted with the aid of a tool capable of performing part-of-speech (POS) tagging on text. Given a set of POS patterns typical of terms, the term candidate extraction tool extracts those text segments that satisfy one of the given POS patterns. It is usually the end user (usually a terminologist or user experienced in the project domain acting as a terminologist) who supplies the tool with the necessary list of typical, hand-crafted POS patterns based on his/her experience in dealing with the terminology of the domain in question. Some typical POS patterns with examples from the banking/financial domain are given in Table 1.

Table 1 Typical term part-of-speech (POS) patterns with example terms

Pattern Example Pattern Example noun expense adj. + noun financial statement noun + noun cash flow adj. + noun + noun negative cash flow noun + noun + noun cash flow statement verb Reinvest

The second step seeks to assign some statistical score to each of the term candidates extracted in the first step. The statistical score is interpreted to be a measure of the degree to which the term candidate designates a domain-specific concept or the degree to which it is an important/key term for the product or content being translated in functional, technical or marketing terms. Term candidates are then sorted (ranked) in descending

11 D3.7

order according to this statistic and presented to the user (terminologist) for evaluation.

In the third step, the user (terminologist) examines the ranked list of term candidates and selects those that he/she believes to be valid terms based on his/her own experience in the field or domain in question. The user can choose to examine only those term candidates with a score above some pre-defined threshold, or to only process the first N term candidates in the list, depending on the purpose and time and budget constraints of the terminology extraction task. As can be seen in Figure 4, depending on the quality of the candidate list, the terminologist/user can decide to tune the hand crafted syntactic patterns supplied to the first step and/or change the statistical ranking function and/or the statistical score threshold, and run the extraction process or ranking process again.

After term validation, further steps to process the valid terms are performed by a terminologist as part of the general terminology management process. These steps include checking whether the valid terms already exist in a termbase and only add the those terms that are new to the termbase; determine whether there are any synonyms and if so which synonym should be the preferred term, which one should be an accepted term and which one should be a superseded or obsolete term; determine whether concepts should be split or merged; etc.

This three-step process has several shortcomings:

• Reliance on hand-crafted syntactic patterns for initial extraction in step 1. o It assumes that the user is acquainted with the typical syntactic structure of terms in the application domain. This assumption requires the user to be experienced in linguistics or terminology theory as well as the subject matter of the application domain, which can be a very specific combination of skills. o Tools that do not let the user specify these syntactic patterns assume that their pre- programmed list will apply to all domains, an assumption that is difficult to justify. o The main reason behind the usage of template syntactic patterns (versus using n-grams, for example) is to significantly reduce the number of term candidates detected and reported, and therefore also reduce the number of false positives detected and reported. However, this also has the potential to reduce the number of true positives. I.e. it can potentially reduce recall. o Users are forced to try out several alternative syntactic pattern lists by trial and error without much feedback other than the list of extracted candidate terms. This can be a laborious and error-prone manual process. • The use of a single statistical ranking feature in step 2. o Numerous statistical ranking algorithms for automatic terminology extraction have been proposed. o Each of them is capable of potentially giving a different ranking of the term candidates, forcing the user to examine each ranked list and select the most appropriate one based on trial and error or previous experience. o Each ranking algorithm has its pros and cons. The user must be familiar with these pros and cons in order to use them correctly. • The results of human evaluation from step 3 are not fed back into the terminology extraction tool. o The tool does not use information about the decisions made during the human evaluation step to help improve the automatic term identification process or the statistical ranking. o It can be very time consuming and tiresome for ongoing projects that require term extraction to be performed periodically on new content for the same project or domain as the tool will tend to report more or less the same terms ranked in more or less the same order as before, requiring the same effort to reject false positives from the validated term set. o Whilst it has been suggested to keep lists of accepted and rejected candidate terms reported in initial runs of the terminology extraction tool so that they can be automatically excluded

12 D3.7

from the reports of subsequent runs (Warburton 2014), the maintenance of these lists can quickly become cumbersome as they add even more manual steps to the whole terminology extraction and management, such as periodic review and clean-up of lists, merging of lists, etc.

The solution we propose aims to address most of these issues. 4.2. Our solution: trainable automatic term extraction based on anomaly detection techniques

In the solution that we propose, term candidates are extracted and automatically classified as valid terms or non-valid terms. The classification mechanism employed is an anomaly or outlier detection algorithm based on a statistical model. In this algorithm, the items being tested for anomalies are the extracted term candidates. Since the majority of term candidates are expected to be non-valid terms, we interpret anomalous items (outliers) to be valid terms. As human validators correctly discriminate between valid terms and non-valid terms, the model is gradually adjusted and learns to better classify term candidates. The expectation is that after a few human-mediated iterations, the classifier achieves an acceptable degree of precision and recall. The proposed terminology extraction process consists in three steps that roughly correspond to the three steps of the traditional process, but with more automation at each step:

1. Candidate term extraction 2. Statistical model training and ranking of term candidates 3. Manual validation (filtering) of the top ranked terms and feedback loop for retraining

We expect that terminologists, translators and users experienced with the traditional terminology extraction process will find this proposed process familiar. And given that it requires less manual steps, we expect that they will find it far less tedious. This reduced reliance on manual steps also welcomes expert users familiar with the terminology of the domain, project or product, who are however inexperienced with terminology extraction, terminology theory and linguistics in general. This process is depicted graphically in Figure 5 and detailed in the following paragraphs.

Figure 5 Trainable automatic term extraction

In the first step, term candidates and their features are extracted from a corpus. Instead of extracting candidates using user-supplied, hand-crafted syntactic patterns or traditional n-grams, we extract candidates based on what we call contiguous dependency n-grams. A contiguous dependency n-gram is a fragment of text of n words in which each of the n words holds syntactic dependency relationship with at least one of the other

13 D3.7

words in the fragment7. The effect of considering these n-grams in which words must have a syntactic dependency relation is that it eliminates arbitrary n-gram combinations that cannot form complete, meaningful linguistic units such as terms. As a consequence, the number of term candidates will be greatly reduced in comparison to traditional n-grams whilst at the same time avoiding the bias imposed by the usage of hand-crafted syntactic patterns.

As an example, Figure 6 shows the extraction of dependency n-grams from the sentence “Companies with positive cash flow can reinvest the cash”8. A dependency n-gram extractor would only output the n-grams that are not crossed-out in the term table shown in the figure. The crossed-out n-grams cannot form complete, meaningful terms because they do not form complete sub-graphs of the dependency graph shown in the figure; so they are safely discarded by the dependency n-gram extractor. Not all of the n-grams extracted by the dependency n-gram extractor will be valid terms, however. Some of these, for example, start or end with a stopword (with, the, can). These can be removed automatically in a filtering step after running the extractor. The candidates shown in bold are more likely to be accepted as terms by a terminologist.

For each of the term candidates a numeric vector of statistical features is also computed. These features are a combination of traditional terminology features9, like those used in the traditional extraction method, as well as numeric features that model the topical properties of each term candidate in order to assess its domain- specific qualities. The actual features are as follows:

• Frequency – the frequency of the term candidate in the corpus. • IDF (inverse document frequency) – a measure of the distribution of the term candidate along the corpus. “Document” here is redefined as a translation segment instead. • Log-likelihood ratio of association (Dunning 1993; McInnes 2004) – for multiword terms only: this is a statistical measure of how strongly the members of a term candidates are collocated. • C-Value (Frantzi et al. 2000) – a transformation of the frequency feature that makes it sensitive to nested terms in multi-word terms. • TermEx (Sclano & Velardi 2007) – a statistical feature that seeks to measure the domain-specific relevance and lexical cohesion of a term candidate. • Weirdness (Ahmad et al. 1999) – a ratio that measures the difference in frequency distribution of term candidates in the corpus from where they are extracted against their frequency distribution in a general language corpus. • Topical features from Latent Semantic Analysis (LSA) (Deerwester et al. 1990; Landauer & Dumais 1997; Utsumi 2013) – a mathematical representation of the semantics of individual term candidates based on the topical structure of the corpus from where they are extracted.

In the second step of the proposed terminology extraction process, a statistical model that combines the features of each term candidate extracted in the first step is computed. As previously mentioned, the model assumes that valid terms are “anomalous” or interesting and therefore must be detected whereas non-valid terms are “normal” or uninteresting and therefore can be ignored. The model thus assigns a score based on a probability of the “normality”/uninterestingness of a term candidate. The higher the score, the more “normal”/uninteresting the term candidate is. Conversely, the lower the score, the more “anomalous”/interesting the term candidate is. This score is thus used to produce a ranked list of term candidates with those candidates towards the top of the list being interpreted as valid terms and those

7 Contiguous dependency n-grams are similar to the syntactic dependency n-grams proposed by Sidorov et al. (2013), except that we consider n-grams having different word order and equal dependency structure to be distinct. We plan to experiment with non-contiguous syntactic dependency n-grams in future work. 8 Taken from The Houston Chronicle at http://smallbusiness.chron.com/negative-cash-flow-mean-companys- financial-performance-bad-60010.html 9 The open-source JATE extraction tool (Zhang et al. 2008) is used to compute some of these traditional statistical features. 14 D3.7

towards the bottom as non-valid terms. As an optional input, the model can receive the terminology validation decisions (training data) from previous runs of the process and/or from a previous batch of terms in the current run in order to fine-tune the statistical model. Notice that in the very first run, no training data will be available. However, since we expect that the majority of the term candidates will be non-valid terms, the model will still be slightly more biased towards ranking non-valid terms as “normal”/uninteresting and valid terms as “anomalous”/interesting. This ranking will be improved after a few validation steps are performed in order to improve the statistical model.

Figure 6 Contiguous dependency n-gram extraction example – Terms in bold are far more likely to be valid terms, subject to the needs of a project or domain. Crossed-out candidates cannot be complete terms but would be detected by a standard n-gram extractor. A contiguous dependency n-gram extractor would detect all non-crossed out candidates (bold and non-bold). Many non-bold candidates can be also automatically discarded as they begin or end with a stop-word (the, with, can)

The third step is the only manual step in this process and requires the user (terminologist) to manually inspect the ranked list of term candidates produced by the model and validate each term candidate. Validation consists in indicating the system whether a term candidate is a valid term or a non-valid term. Normally the terminologist will review the list in order. After a batch of term candidates (say 100 candidates) have been validated, their validation decisions (i.e. whether they are valid or non-valid terms) are fed back to the statistical model for re-computation and re-ranking. After a few iterations and when the terminologist is satisfied that the statistical model is able to satisfactorily rank term candidates for the project or domain at hand, the terminologist is then able to export the extracted term candidates, along with the validation decisions to the following steps of the terminology management process, via the system’s APIs.

5. WEB SERVICE INTEGRATION

This section details the interactions within the FALCON Showcase System that involve the MT and TA components as they execute the functions described above. These are presented as message sequence diagrams and accompanying tables for tracking the testing status of each operation. The sequence diagrams follow the process flow defined the system in version 2 of the FACLON Requirements Specification (D2.1), and message ID are indexed where appropriate to process flow message ID from that specification.

15 D3.7

The overall FALCON architecture assumes that each of the components can be operated by a separate service provider, that they operate in federated manner using well-defined web services to interact and that they use standards-based data formats to exchange information. The component integrated have been developed separately over different time periods and with different interoperability requirements, and therefore they adopt different approaches to implement the web services used for integration.

The full architecture will be documented in Deliverable D2.3, ‘Revised L3Data Schema and Architecture’. This deliverable restricts itself to documenting the components that interact with the TA and MT components.

The components that the TA and MT component interact with are: • XTM Cloud (XTM): which provides the workflow and computer assisted translation features for translation management. XTM Cloud is typically deployed as a system integration hub within localisation tool chain, mediating between content management systems, the tools used by LSP subcontractors and external machine translation components. It allows the state of a translation project to be access as a Translation Interoperability Protocol Package (TIPP), which contains XLIFF file and potentially other files such as TBX. It supports both WSDL and RESTful interfaces for interacting other components, including those documented at http://xtm-intl.com/api as well as other developed specifically integration undertaken for the first time in FALCON, and which will be documented in D3.8. • TermWeb: which provides terminology management functionality. It provide an XML-RCP API for remote access and manipulation of its term based, which is structured according to the data model used in the Term Base eXchange (TBX) standard. This interface is documented at http://www.termweb.org/docs/api/ and its use in FALCON will be documented in D3.9. • L3Data Server (L3D-Svr): This is a data server component that holds tabular data to be exchanged between federated components in the FALCON showcase system. It uses open tabular data in the well supported comma separated value (CSV) format, but with meta-data supported by alignment with open data vocabularies from the W3C, captured in the CSV meta-data JSON format being developed by the CSV on the Web Working Group at the W3C. This forms the shared, provenance- based record of resources generated, manipulated and reused by components in this federated workflow which constitutes a novel linguistic linked data knowledge based for localisation termed L3Data. Access to L3Data is offered via a simple REST interface via which the CSV and CSV meta-data file can be straightforwardly uploaded, replaced and downloaded. • L3Data Manager (L3D-Mgr): which monitors the project workflow, manages the generation of L3Data in the project and offers visual analytics of L3Data to support decision making by localisation project managers. This therefore component prototypes functionality that could readily be integrated into a translation management system such as XTM Cloud in the future. • Babelnet: which is an aggregation of public multilingual lexical resource that can be access through a query API: http://babelnet.org/guide

These service specifications in this document, and therefore the sequence charts detailed below, focus on the core functionality of the TA and MT components. It therefore make some initial assumptions about security and access control involved in integrating with other components. These will be documented more fully in the revised L3Data Schema and Architecture document D2.3, but broadly entail the following: • Private services invoked by the TA and MT components, namely the TermWeb, L3D-Mgr call backs, L3D-Svr, TermWeb and XTM, are secured via prior authentication resulting sharing of a session identifier and associated key. • Invocation of the TA and MT service by components with a user driven feature will be secured via user authentication and access controls appropriate to role and affiliation of individual.

16 D3.7

The sequence diagrams capture abstract messages between components which are indexed to the accompanying tables that detail in turn how the abstract message is implemented. For simplicity, abstract messages are classified as follows: • SIGNAL: A message signalling a component to initiate an activity as part of a workflow • CREATE: A message to create a specific data from a component • READ: A message to read specific data from a component • UPDATE: A message to update specific data on a component • DELETE: A message to delete specific data on a component • QUERY: A message to retrieve data items from a component that match provided criteria • TRANSLATE: A request to translate provided content Responses to a message are given a as sub-index of the original request and message label is given the suffix ‘- RESP’. The indexes are annotated with references to process flow interactions defined in D2.1.

The implementation mapping for messages indexed from the sequence diagrams in the following subsections follow the following conventions in relation the interfaces that implement them: • Interactions with the TermWeb API use a short description of the payload and a reference to the XML- RCP API call invoked. For brevity it does not include the details of how the API is used to construct the payload.. • Interactions with the Machine Translation component and XTM Cloud component refer to simple interfaces exposed in WSDL. • The remaining interfaces are RESTful type and interactions are documented using simple examples or HTTP request , query parameters and the payload. The examples assume the interactions are being conducted for a customer with ID “cust1”, for a project named ‘proj1’, and the language part use is English to French.

17 D3.7

5.1. Automated Term Extraction and its Validation

Figure 7: Message sequence diagram for interaction with TA component during automatic term extraction and validations

ID Message Implementation 1 SIGNAL automatically- L3D Mgr signals the Text Analysis component to initiate Automatic Term extract-terms Extraction, providing a pointer to the project source and indicating the ATE engine to use: D2.1ref: TrM-TA.B1 HTTP POST Request: ex.ta.com/signal ?signal=initiate-ATE &engine-ref=engine0001 &src=ex.falcon.org/cus1/prj1/source.csv &callback=ex.l3d.org/mgr/signal

18 D3.7

2 READ project source The TA component retrieves the meta-data (as a .csvm file based on the CSV on the web meta-data format10) and then the CSV file of the project source and the segmented source text from the L3D Svr:

HTTP –GET-Request: ex.l3d.org/cus1/prj1/source.csvm HTTP-GET-Response: Meta-data for Table of source segments including, data catalogue, language and provenance information

HTTP -GET–Request: ex.l3d.org/cus1/prj1/source.csv HTTP-GET-Response CSV Table with columns: • Segment ID (taken from XTM Cloud XLIFF) • Segment text

The TA component runs the automatic term extraction engine over the source and selects the best results as candidate terms for the project. 3 QUERY termbase match The candidate term are then compared against existing terms available for suggested terms in TermWeb for the project using the Termweb API calling method getTermEntries of the interface XmlRcpTermwebSession11. 4 QUERY public lexicon The TA components runs each source segment through the Babelfy match for definitions API12 and records any matches from terms identified in segments from and translations of ATE. suggested terms The TA component then uses the BabelNet API13 to retrieve information D2.1ref: TA-L3D.B1 of the BabelNet entries returned by Babelfy that match the suggested terms, harvesting, where present, the definition of the terms and the translation into the target languages. 5 CREATE suggested The TA component logs the suggested terms resulting from ATE to the project terms L3D Svr as a table of annotations and its meta-data: D2.1ref: TeM-TA.B2 HTTP-PUT-Request: ex.l3d.org/cus1/prj1/suggested-src-terms.csvm

HTTP-PUT-Request: ex.l3d.org/cus1/prj1/suggested-src-terms.csv 6 CREATE suggested term The TA component logs the suggested term definitions for the definitions suggested terms that were harvested using Babelfy and Babelnet APIs. D2.1ref: TeM-TA.B2 This is logged as annotation of the suggested term and meta-data of that annotation:

HTTP-PUT-Request: https://ex.l3d.org/cus1/prj1/suggested- definitions.csvm HTTP-PUT-Request: https://ex.l3d.org/cus1/prj1/suggested- definitions.csv CSV table with columns: • Suggested Term Definition annotation identifier • Reference to suggested source term • Text of suggested term definitions from BabelNet • URL referencing the BabelNet Concept Entry selected to provide this definition • Score in interval 0 to 1 indicating the confidence that the suggested definition is correct for this term 7 CREATE suggested target The TA component logs the suggested term translations of the

10 http://www.w3.org/TR/tabular-metadata/ 11 http://www.termweb.org/docs/api/org/termweb/api/XmlRpcTermWebSession.html 12 http://babelfy.org/download.jsp 13 http://babelnet.org/guide#html 19 D3.7

terms suggested source terms into the target language as harvested using D2.1ref: TeM-TA.B2 Babelfy and Babelnet APIs. This is logged as a table of annotations of the suggested terms and meta-data of that table:

HTTP-PUT-Request: https://ex.l3d.org/cus1/prj1/suggested-fr-FR- terms.csvm HTTP-PUT-Request: https://ex.l3d.org/cus1/prj1/suggested-fr-FR- terms.csvm CSV table with columns: • Suggested Term Translation annotation identifier • Reference to suggested source term • Text of suggested term translation from BabelNet • URL referencing the BabelNet Lexical Entry selected to provide this definition • Score in interval 0 to 1 indicating the confidence that the suggested translation is correct for this term 8 UPDATE Project terms, The suggested terms, their suggested definitions and their suggested definitions and translations are submitted TermWeb using the ‘create’ method of the translations ‘XmlRpcTermWebSession’ interface for each item. Definitions are mapped to concepts and source and target term to term entries. This is used to create a dictionary for the project and then move terms from existing dictionaries to this dictionary, based on matches to suggested terms, then adding other suggested terms with suggested definitions and target language translations. The TBX categories of ‘usageStatus’ and ‘processStatus’ are set to indicate that these terms are unvalidated. 1.1 SIGNAL-RESP Once automatic text extraction is completed, aligned with suggestion automatically-extract- for definitions and translations from BabelNet and are submitted to terms TermWeb, the TA component signals the L3D Mgr:

HTTP-POST-Request: ex.l3d.org/mgr/signal ?signal=ATE-complete

The L3D Mgr advances the workflow in XTM to the suggested term validation activity. 9 READ suggested-term- Once the suggested term validation process activity is underway the validation-results project manager can opt to analyse its progress by signalling the TA component to collate term validation from Termweb against the suggestions sources from ATE and Babelnet:

HTTP-POST-Request: ex.ta.com/signal ?signal=collate-suggested-term-validation &suggested-terms=https://ex.l3d.org/cus1/prj1/suggested-src- terms.csv 10 QUERY validation of The TA component retrieves the validation status of the suggested terms terms, their suggested definitions and their suggested translations from D2.1ref: TeM-TA.B3 Termweb, using the ‘create’ method of the ‘XmlRpcTermWebSession’ interface for each suggested item. The changes to ‘usageStatus’ and ‘processStatus’ are used to ascertain whether individual source term, definitions and target terms suggestion have been validated a source term entries, concept entries and target term entries respectively.

20 D3.7

11 CREATE/UPDATE source The TA component logs the state of validation of suggested source term validation results terms. This is logged as a table of terms from Termweb and a table of annotations of the suggested term by these validated terms, and meta- Note: As steps 9-10 can data for those logs: be repeated, steps 11-13 are initially CREATE and HTTP-PUT-Request: ex.l3d.org/cus1/prj1/validated-en-GB-terms.csvm subsequently UPDATE HTTP-PUT-Request: ex.l3d.org/cus1/prj1/validated-en-GB-terms.csv type interactions CSV table with columns: • Validated Term identifier from TermWeb D2.1ref: TA-L3D.B3 • Term Text • Term Usage Status • Term processing status

HTTP-PUT-Request: ex.l3d.org/cus1/prj1/ suggested-en-GB-term- validation-anno.csvm HTTP-PUT-Request: ex.l3d.org/cus1/prj1/ suggested-en-GB-term- validation-anno.csv CSV table with columns: • Suggested Term validation annotation identifier • Reference to suggested source Term • Reference to validated source Term • Status of validation. One of 'validated', 'rejected' or 'pending' 12 CREATE/UPDATE term The TA component logs the state of validation of suggested term definition validation definitions as project term concepts. This is logged as a table of results annotations of the suggested term and meta-data for that log: D2.1ref: TA-L3D.B3 HTTP-PUT-Request: ex.l3d.org/cus1/prj1/validated-concepts.csvm HTTP-PUT-Request: ex.l3d.org/cus1/prj1/validated-concepts.csv CSV table with columns: • Validated Concept identifier from TermWeb • Definition Text • Concept Usage Status • Concept processing status

HTTP-PUT-Request: https://ex.l3d.org/cus1/prj1/suggested-definition- validation-anno.csvm HTTP-PUT-Request: https://ex.l3d.org/cus1/prj1/suggested-definition- validation-anno.csvm

21 D3.7

13 CREATE/UPDATE fr-FR The TA component logs the state of validation of suggested term term validation results definitions as project term concepts. This is logged as a table of terms D2.1ref: TA-L3D.B3 from Termweb and a table of annotations of the suggested term by these validated terms, and meta-data for those logs:

HTTP-PUT-Request: ex.l3d.org/cus1/prj1/validated-concepts.csvm HTTP-PUT-Request: ex.l3d.org/cus1/prj1/validated-concepts.csv CSV table with columns: • Validated Term identifier from TermWeb • Term Text

HTTP-PUT-Request: ex.l3d.org/cus1/prj1/suggested-definition- validation-anno.csvm HTTP-PUT-Request: ex.l3d.org/cus1/prj1/suggested-definition- validation-anno.csv CSV table with columns: • Suggested concept validation annotation identifier • Reference to suggested definition • Reference to validated concept • Status of validation. One of 'validated', 'rejected' or 'pending' 9.1 READ-RESP suggested- The TA component returns references to the above validation logs of term-validation-results the suggested terms. The L3D Svr can then access those data and present progress analytics visualisation to the project manager. The project manager may then opt to complete the validation phase or allow it to continue.

HTTP-POST-Request: ex.l3d.org/mgr/signal ?signal=suggested-term-validation-collation-ready &validated-suggested-terms= ”https://ex.l3d.org/cus1/prj1/suggested-src-terms.csv, https://ex.l3d.org/cus1/prj1/suggested-definitions.csv, https://ex.l3d.org/cus1/prj1/suggested-fr-FR-terms.csv, https://ex.l3d.org/cus1/prj1/validated-en-GB-terms.csv, https://ex.l3d.org/cus1/prj1/validated-concepts.csv, https://ex.l3d.org/cus1/prj1/validated-fr-FR-terms.csv, https:// ex.l3d.org/cus1/prj1/validated-en-GB-terms.csv, https://ex.l3d.org/cus1/prj1/suggested-en-GB-term-validation- anno.csv, https://ex.l3d.org/cus1/prj1/suggested-fr-FR-term-validation-anno.csv”

22 D3.7

5.2. Train Initial Project MT Engine and Generate Reference Translation

Figure 8: Message sequence diagram for interaction with MT component during initial SMT engine training and generation of reference translation

ID Message Implementation 1 SIGNAL train- The L3D Mgr detects that the project workflow has been initiated and signals the MT initial-MT- service to generate an initial MT engine for this customer’s project (assuming the ID engine and credentials for this have been set up in advance). This invokes the trainMT D2.1ref: TrM- method14 with parameters: L3D.C1 • Required language pair • Reference to source training data • Reference to aligned target training data

2 GET training- The MT component retrieves the training data from the L3D Svr. In this iteration, no data content annotation training data selection (e.g. on term occurrence) is involved, so D2.1ref: L3D- all data matching the language pair is returned PD.C1 & LD3- MT.C1 HTTP-GET-Request ex.l3d.org/pd&type=parallel?language=en-GB-t-fr- FR?domain=medical 3 CREATE mt- On completion of the MT training process, the MT engine logs the training process training-data- outcome to trace problematic segments log HTTP-PUT-Request ex.l3d.org/cus1/prj1/enGB-t-frFR-m0v0-training-log-anno.csvm HTTP-PUT-Request ex.l3d.org/cus1/prj1/enGB-t-frFR-m0v0-training-log-anno.csv CSV file with columns • ID of annotation of training data • Reference to annotated training data element • Reason for discarding for training, one of ‘length’, ‘tags’, ‘encoding’ 1.1 SIGNAL-RESP On completion of the MT engine training and logging process the MT component train-initial- signals the readiness and ID of the engine to the L3D Mgr component, which uses MT-engine this to drive forward the XTM workflow

HTTP-POST-Request: ex.l3d.org/mgr/signal ?signal=MT-training-complete?mtengine=enGB-t-frFR-m0v0

14 http://srv-cngl.computing.dcu.ie/mlwlt/typed/services/mlwlt.train_mt?wsdl 23 D3.7

4 TRANSLATE The L3D Mgr then requests an initial translation of the project source in order to project- capture the confidence of the engine in those translations. This invoked via the source translate main service15 D2.1ref: MT- LD3.D1 6 CREATE The MT service logs the results of the translation to the L3D Svr. This is logged as a source- table of annotations of the source segments and meta-data for those logs. translation- log HTTP-PUT request: https://ex.l3d.org/cus1/prj1/src-enGB-t-frFR-m0v0-init.csvm HTTP-PUT request: https://ex.l3d.org/cus1/prj1/src-enGB-t-frFR-m0v0-init.csv CSV file with columns • ID of annotation of source segment • Reference to annotated source segment • Translated text • Confidence score from the MT engine in the accuracy it transltion of the annotated content. Its value is a rational number in the interval 0 to 1 (inclusive).Time to translate segment • The time in milliseconds taken by the MT engine to generate the translated text from the annotated source text • A space separated list of word that are in the source but not present in the MT engine and are therefore left untranslated

5.1 TRANSLATE- On receiving the machine translation and confidence scores from the MT engine the RESP source- L3D Mgr combines this with term frequency score to allow the project manager to translate-ref generate the optimal order in which to guide the post-editing of these translations. 5.3. Machine Translate Segments on Request

Figure 9: Message sequence diagram for interaction with MT components during for translation of segments

ID Message Implementation 1 TRANSLATE Invoke service operation source- http://srv- segments cngl.computing.dcu.ie/mlwlt/typed/services/falcon.translate_main_seg?wsdl D2.1ref: TrM-MT.E1 1.1 TRANSLATE- Unique Identifier (Acknowledgement Code) RESP ack 1.2 TRANSLATE- Translated Segment with Unique Identifier Code RESP target- segments

15 http://srv-cngl.computing.dcu.ie/mlwlt/typed/services/falcon.translate_main_html?wsdl 24 D3.7

5.4. Analyse Postediting Progress and Retrain MT Engine

Figure 10: Message sequence diagram for interaction with MT component during automatic term extraction and validations

ID Message Implementation 1 READ tipp-xliff L3D Mgr retrieve the current state of the project bi-text via an XLIFF file D2.1ref: TrM- access through the XTM TIPP read API16 L3D.E2 2 CREATE/UPDATE L3D Mgr generates a log of post-editing outcomes to date. This is logged mt-postedit-log as a table of annotations of the source segments and meta-data for those logs. This integrates previously logged post-edits.

HTTP-PUT Request: https://ex.l3d.org/cus1/prj1/enGB-t-frFR-m0v0- postedit.csvm HTTP-PUT Request: https://ex.l3d.org/cus1/prj1/enGB-t-frFR-m0v0- postedit.csvm CSV file with columns • ID of annotation of source segment • Reference to annotated source segment • Machine translated text subject to post-editing • Post-edited text • Time to post-edit segment

3 SIGNAL generate- L3D Mgr signals to the MT component to provide a log of translations

16 http://xtm-intl.com/api?doc=files#generateProjectFile_api 25 D3.7

mt-log performed by the signalled MT engine. This is logged as a table of D2.1ref: MT- annotations of the source segments and meta-data for those logs. This LD3.E1 integrates previously logged translations by this engine

4 CREATE/UPDATE The MT service logs the results of the translation to the L3D Svr. This is mt-log logged as a table of annotations of the source segments and meta-data for those logs.

HTTP-PUT request: https://ex.l3d.org/cus1/prj1/src-enGB-t-frFR-m0v0- anno.csvm HTTP-PUT request: https://ex.l3d.org/cus1/prj1/src-enGB-t-frFR-m0v0- anno.csv CSV file with columns • ID of annotation of source segment • Reference to annotated source segment • Decorated source segment containing annotation of terms with term translation and term translation status. • Translated text • Confidence score from the MT engine in the accuracy it translation of the annotated content. Its value is a rational number in the interval 0 to 1 (inclusive). Time to translate segment • The time in milliseconds taken by the MT engine to generate the translated text from the annotated source text • A space separated list of words that are in the source but not present in the MT engine and are therefore left untranslated • A space separated list of target language term IDs that were decorated in the source when submitted for translation and were forced decoded in the MT output

3.1 SIGNAL-RESP mt- MT engine returns the URL reference of the MT log log-ref 7 SIGNAL retrain-mt http://srv- D2.1ref: MT- cngl.computing.dcu.ie/mlwlt/typed/services/falcon.train_mtmodel?wsdl L3D.E2 8 GET mt-postedit- MT component retrieves the MT post-edit logs from this iteration of the logs MT engine to retrain MT engine.

HTTP-GET Request: https://ex.l3d.org/cus1/prj1/enGB-t-frFR-m0v0- postedit.csvm HTTP-GET Request: https://ex.l3d.org/cus1/prj1/enGB-t-frFR-m0v0- postedit.csvm

7.1 SIGNAL-RESP ack Unique Identifier Code 7.2 SIGNAL-RESP mt- Location Reference to the Retrained MT Engine engine-ref

6. CONCLUSIONS AND NEXT STEPS

In this document, SMT and ATE web service implementations have been described. The pre-process and post- process wrapper modifications of off-the-shelf SMT software for enabling L3Data input and output is outlined and the internal processes of the ATE component are described.

26 D3.7

The detailed interfaces and sequence diagrams used to integrate with XTM, TermWeb, L3D Mgr and L3D Svr are detailed and the API calls are elaborated.

The full integration testing will be performed as these further components are completed (to be reported in D3.6, D3.8 and D3.9). The integrated system will be evaluated and further enhancements made as necessary. The final integrated systems specification will be integrated across these deliverables and provided at the end of the project.

7. REFERENCES

Ahmad, K., Gillam, L. & Tostevin, L., 1999. University of Surrey participation in TREC8: Weirdness indexing for logical document extrapolation and retrieval (WILDER). In Proceedings of the Eighth Text Retrieval Conference.

Ananiadou, S., 1994. A methodology for automatic term recognition. In Proceedings of the 15th conference on Computational linguistics, Volume 2. Kyoto: Association for Computing Machinery, pp. 1034–1038. Available at: http://portal.acm.org/citation.cfm?doid=991250.991317.

Champagne, G., 2004. The Economic Value of Terminology: An exploratory study. Report submitted to the Translation Bureau of Canada, Montréal.

Childress, M.D., 2007. Terminology work saves more than it costs. MultiLingual, pp.43–46.

Deerwester, S. et al., 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), pp.391–407.

Dunne, K.J., 2007. Terminology: ignore it at your peril. MultiLingual, pp.32–38.

Dunning, T., 1993. Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19(1), pp.61–74.

Frantzi, K., Ananiadou, S. & Mima, H., 2000. Automatic recognition of multi-word terms:. the C- value/NC-value method. International Journal on Digital Libraries, 3(2), pp.115–130.

Gómez Palou Allard, M., 2012. Managing Terminology for Translation Using Translation Environment Tools: Towards a Definition of Best Practices. University of Ottawa.

Karamanis, N., Luz, S. & Doherty, G., 2011. Translation practice in the workplace: contextual analysis and implications for machine translation. Machine Translation, 25(1), pp.35–52.

Karsch, B.I., 2006. Terminology workflow in the localization process. In K. J. Dunne, ed. Perspectives on Localization. Amsterdam: John Benjamins, pp. 173–191.

Landauer, T.K. & Dumais, S.T., 1997. A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Psychological Review, 104(2), pp.211–240.

27 D3.7

LISA, 2005. LISA Terminology Management Survey: Terminology Management Practices and Trends,

Manning, C.D. & Schütze, H., 1999. Foundations of Statistical Natural Language Processing, Cambridge, MA: MIT Press.

McInnes, B.T., 2004. Extending the log likelihood measure to improve collocation identification. University of Minnesota.

Nakagawa, H., 2000. Automatic term recognition based on statistics of compound nouns. Terminology, 6(2), pp.195–210.

Pazienza, M.T., Pennacchiotti, M. & Zanzotto, F.M., 2005. Terminology extraction: an analysis of linguistic and statistical approaches. Knowledge Mining, 185, pp.255–279.

Sclano, F. & Velardi, P., 2007. TermExtractor: a Web Application to Learn the Common Terminology of Interest Groups and Research Communities. In Proceedings of the 9th Conference on Terminology and Artificial Intelligence (TIA 2007). pp. 8–9.

Sidorov, G. et al., 2013. Syntactic Dependency-Based N-grams as Classification Features. Advances in Computational Intelligence. Lecture Notes in Computer Science, 7630, pp.1–11.

Utsumi, A., 2013. A semantic space approach to the computational semantics of noun compounds. Natural Language Engineering, (January), pp.1–50. Available at: http://www.journals.cambridge.org/abstract_S135132491200037X [Accessed January 22, 2013].

Warburton, K., 2014. Narrowing the gap between termbases and corpora in commercial environments. City University of Hong Kong.

Zhang, Z. et al., 2008. A comparative evaluation of term recognition algorithms. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008). Marrakech, pp. 2108–2113.

28 D3.7