Leveraging SOA Principles in Development of Modular & Scalable NLP Pipelines

Thesis submitted in partial fulfillment of the requirements for the degree of

Master of Science in Computational Linguistics by Research

by

Nehal Jagdish Wani 201125005 [email protected]

International Institute of Information Technology (Deemed to be University) Hyderabad - 500 032, INDIA May 2019 Copyright © Nehal Jagdish Wani, 2019 All Rights Reserved International Institute of Information Technology Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “Leveraging SOA Principles in Develop- ment of Modular & Scalable NLP Pipelines” by Nehal Jagdish Wani, has been carried out under my supervision and is not submitted elsewhere for a degree.

Date Adviser: Prof. Dipti Mishra Sharma International Institute of Information Technology Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “Leveraging SOA Principles in Develop- ment of Modular & Scalable NLP Pipelines” by Nehal Jagdish Wani, has been carried out under my supervision and is not submitted elsewhere for a degree.

Date Co-Adviser: Prof. Suresh Purini To Curiosity and Perseverance Acknowledgments

First of all, I would like to thank my adviser, Dr. Dipti Misra Sharma who has been very supportive and understanding throughout my stay in IIIT-H. I am grateful to her for guiding me in this eventful journey. Had she not been present during my interviews or during the initial orientation session, I would probably not have been part of the IIIT-H family. She always saw right through me. Thank you, Dr. Suresh Purini, for being a friend and a mentor. He is a lively professor with whom one could easily confide in. Our discussions have helped me and my work grow a lot and become what it is today. Thank you for being there at odd times, encouraging my thoughts and helping shape them into better ideas. I have been extremely fortunate to have interacted, learned and studied along with two of my closest seniors, Sanidhya Kashyap and Jaspal Dhillon, with whom I spent more time than my batchmates. They have been constant pillars of inspiration. They always supported and motivated me and I owe them a great deal in helping me, shaping my mind, answering my stupid queries, making me realize that there is a lot more to computer science than what is taught in courses. They helped me realize my potential. It’s very difficult for me to imagine how I would have survived in college without their assistance. SP Mohanty deserves a special mention here. He made me realize how ideas are spun into reality and how NLP is not restricted to only linguists. I have always envied him ;-) The implementation of the ideas proposed in this thesis was initially done by porting numerous mod- ules from Sampark MT system developed during the ”Indian Language to Indian Language Machine translation” (ILMT) consortium project funded by the TDIL program of Department of Electronics and Information Technology (DeitY), Govt. of India; so much of this work would not have had been possible without support from the whole consortium. A big shout out to Rashid Ahmed and Avinash for helping me understand the integration of various components in this system. I would also like to thank the Bhat brothers (Riaz and Irshad) for helping me understand the word ‘computational’ in the context of NLP. They, along with Maaz, made me enjoy my stay at LTRC. Raveesh Motlani always believed in me, helped me sift through ideas and also shaped me as a person. He never gave up on me. He was always there to morally support me. I had various interesting conversations with Ankush, Mayank, Anubhav, Anhad, Venky, Arnav, Tushant, Ayush, Kapil, Deepesh and Somay. The made me ponder over various things in life. I am honored to have been a part of the batch UG2k11 and to have been in acquaintance with them.

vi vii

Last but not the least, I am thankful to my family for their constant support and motivation to stay focused and without them I could not have gathered the courage and taken the opportunity to pursue what I wanted to. Abstract

Installation, deployment and maintenance of an NLP system, can be a daunting dask; based on the number and complexity of the components involved. One such system can be a hybrid Machine Translation system, composed of several modules which define the transformation of a given word, phrase or sentence, from one language to another. The end users of such a system can be developers themselves who want to improve it. To achieve this, the system as whole, needs to adopt an architecture which lets the users control and change the order of components in the pipeline, be able to intervene the execution in middle to debug one or more components, tweak inputs/outputs without having to rewrite the components, be able to easily replicate the system on their local box without having to worry about the hassle of compiling everything from source, be able to quickly replace a component with a higher or lower version and be able to see the impact on the final result quickly; basically make development iterative and fast. The system should also expose an interface for those users who want to build something on top of it, without having to worry about the internal details.

The ideas proposed in this thesis try to cater to the needs of a broad category of users, in an attempt to keep their work-flow as simple as possible. We propose an architecture where we show how to identify and transform a monolithic system into small, individual components (each being a linguistic unit), identify bottlenecks from an operating system’s point of view, identify scalable components and finally provide an easy mechanism to interact with the system. To achieve this, we apply the proposed architecture over an existing system (Sampark MT System) and walk through it’s transformation. Towards the end, we show the creation of a web client which shows how easy it becomes to interact with the modified system. We also apply our proposition to show that it can be applied to any pipeline based system by thinking of it as a disconnected, directed acyclic graph. We also show how the modified system can be deployed on the cloud easily and how individual components can be scaled up or down as per needs.

To be able to plan the overall architecture and produce guidelines for enabling large scale collaborative development, a structured systems analysis and design, from the point of view of both, a computational linguist and a systems engineer is required. This thesis provides the foundation in that direction by enhancing an existing system, reducing the overall runtime of it’s components by greater than 85%, improving the test-dev-deploy cycle for computational linguists and discussing a generalized architecture on top of which, further complex systems can be built for specific purposes.

viii Contents

Chapter Page

Abstract ...... viii

1 Introduction ...... 1 1.1 Workflows in Language Processing ...... 2 1.1.1 Patterns in workflows ...... 2 1.2 Problem Statement ...... 3 1.3 Related Work ...... 5 1.4 Summary of Contributions ...... 6 1.5 Organization of the Thesis ...... 7

2 The Sampark MT System: A Case Study ...... 8 2.1 Brief Introduction to ILMT Modules ...... 8 2.2 Functioning & Performance Analysis of Modules ...... 9 2.2.1 Indic Converter ...... 11 2.2.2 Morph Analyzer ...... 12 2.2.3 POS Tagger ...... 13 2.2.4 Transfer Grammar ...... 14 2.2.5 Lexical Transfer ...... 15 2.2.6 Word Generator ...... 16 2.3 Common Traits & Peculiarities ...... 17 2.4 Revamping Modules: Moving Towards Services ...... 18 2.4.1 Reducing File I/O ...... 19 2.4.2 (Re)using memory, Efficiently ...... 20 2.4.3 Asynchronous I/O & Daemonization ...... 21 2.5 Results of Transformation ...... 22

3 Service Oriented Architecture ...... 27 3.1 Services: A Short Introduction ...... 27 3.2 Adopting SOA to ILMT ...... 28 3.2.1 The RESTful API ...... 29 3.2.2 Anuvaad Pranaali ...... 31 3.2.3 ResumeMT ...... 31 3.2.3.1 Detecting Error Propagation and Rectification ...... 32 3.2.3.2 Demonstration ...... 35 3.2.4 A Graph based Approach for Querying ...... 35

ix x CONTENTS

3.3 Use Cases ...... 37 3.3.1 ILParser ...... 37 3.3.2 Kathaa: A Visual Programming Framework for Humans ...... 38

4 Deployment and Packaging ...... 41 4.1 Monolithic Application ...... 41 4.1.1 Breakdown ...... 41 4.2 Microservices based Application ...... 43 4.2.1 Implementation ...... 43 4.2.2 Architecture Benefits ...... 44 4.3 Becoming Cloud Native ...... 47 4.3.1 Towards Containerization ...... 47 4.4 Demonstration ...... 49

5 Conclusions and Future Directions ...... 50 5.1 Future Work ...... 50

Related Publications ...... 52

Bibliography ...... 53 List of Figures

Figure Page

1.1 Types of workflows ...... 3

2.1 Sample Telugu input text ...... 9 2.2 Time taken by modules in the TEL-HIN pipeline ...... 9 2.3 Tracing system calls made by modules in the TEL-HIN pipeline ...... 10 2.4 Example for UTF-8 to WX Conversion ...... 11 2.5 HIN-PAN: Old v/s Transformed Runtime Comparison ...... 23 2.6 HIN-URD: Old v/s Transformed Runtime Comparison ...... 24 2.7 PAN-HIN: Old v/s Transformed Runtime Comparison ...... 25 2.8 URD-HIN: Old v/s Transformed Runtime Comparison ...... 26

3.1 A snapshot of the Client-Dashboard ...... 32 3.2 A snapshot of the ResumeMT feature ...... 33 3.3 A sample workflow ...... 35 3.4 A networked workflow in Kathaa ...... 39 3.5 A sequential workflow in Kathaa ...... 40

4.1 An overview of AP Mono ...... 42 4.2 An overview of AP Micro ...... 46

xi List of Tables

Table Page

2.1 Time spent in syscalls, sorted by time in the TEL-HIN pipeline ...... 10 2.2 Sample Output by the Morph Analyzer module ...... 13 2.3 Transfer Grammar rules (HIN-TEL) on paper ...... 14 2.4 DSFRecord - Modified ...... 15 2.5 Word Generator - Sample Conversions ...... 16

3.1 SOAP, RPC and REST ...... 29

xii Chapter 1

Introduction

Natural Language Processing systems are being embedded heavily into various applications being used in our day to day life. They have had an intriguing impact in the past decade. This has been possible because of a broad category of language researchers who have spent a huge amount of time studying languages, coming up with theories, conceptualizing ideas, writing code and publishing their research. However, the systems that are developed as a result of this research can be inherently very complex. Sometimes, these systems also work under the assumption that the end users of this system will have a certain level of pre-requisite knowledge in accordance with whether they want to be able to test it, modify it, or improve it. So, when young padawans arrive in this field and wish to try out something that has already been done, they either have to contact the original authors, or, sift through a lot of code and understand how it is all tied together, or find out if any of their seniors might have worked on it, or give up and re-implement it; which results into a lot of time and effort spent needlessly. This also inhibits students from other trans-disciplinary fields to work together and reuse existing work.

The work in this thesis tries to bridge the gap between the original developer and the future develop- er/end user by proposing an architecture, which helps in iterative development, identifying the essential components and then exposing these components as black boxes so that something else can be built on top of it. We show how the proposed ideas have been employed to a subset of the language pairs of the ILMT system and also make use of them to lay a foundation for developing creative applications that utilize the underlying functionality. One such example is Kathaa, a visual programming framework for Natural Languages.

In this chapter, we start with a brief introduction to workflows and then define the problem addressed by this thesis and state our contributions. Later, we talk about micro-services, which is a key enabler for development and operations in the industry these days.

Our work is also pertinent to cloud paradigms such as Software-as-a-Service (SaaS).

1 1.1 Workflows in Language Processing

In any area of language processing, be it natural or programming languages, a given process can comprise of multiple tasks. For example, when The GNU Compiler Collection1 is invoked to compile a certain piece of code, the process of compilation is comprised of a series of steps. First, it has to pre- process the source code (replace macros, operate on include headers, remove comments, etc), then, it has to lexically analyze the code by converting code into a stream of tokens. The next step is to follow a well defined grammar to construct a parse tree (syntax errors are detected at this stage), after that, the tree is semantically analyzed to verify whether the parsed content is meaningful, then it reaches the phase of intermediate code generation. The next steps are target architecture and platform dependent, where the code is further optimized and then assembler and linker are invoked to write the final machine code. Now, let’s draw an analogy for a similar pipeline, but in the context of NLP. For example, in machine translation, the initial steps comprise of cleaning up the input (source language text) and performing tokenization. Then, various source language specific tasks are performed, like part of speech tagging, morphological analysis, computing Vibhaktis, and similarly, we end up with a meaningful tree which can further be parsed. In compliment to the intermediate code generation phase, the task in natural language translation is transfer of grammar. At this point, the pipeline pivots to perform tasks specific to the target language and finally generates text in the target language with the help of transliteration and word generation. These tasks, steps or phases, through which input transforms from initiation to completion, forms a workflow.

1.1.1 Patterns in workflows

Workflows allow users to easily express multi-step computational tasks. An instance of a workflow can be described as a directed acyclic graph (DAG), where the nodes are tasks and the edges denote the dependencies. Workflows can be simple or complex. A workflow pattern can be observed in pro- cess of developing process oriented applications, which refer to recurrent and generic business process constructs. Broadly speaking, workflow patterns can be classified into three categories:

1. Independent or pooled: All tasks which are part of the process pipeline (or groups of smaller subtasks), do not have any inter dependency. They can be executed independently. In the analogy above, if the task at hand is to only process the input text or clean up corpora, then it can be categorized in this format. Here, running the pre-processor (which may comprise of smaller tasks) does not depend on any other task and only requires input and hence can be massively parallelized.

2. Sequential: Let’s say the end goal of a workflow is Coreference resolution (determining identical entities). Now, this cannot be done if input text is not normalized (with ambiguities resolved, abbreviations cleaned up). Next, ambiguities can’t be resolved without identifying the named

1https://gcc.gnu.org/

2 entities, which also cannot be done without finding matching entities and assigning them ontology classes, which further requires the input text to be pre-processed (tokenized, sentences split, etc).

In this instance, all tasks that comprise the workflow are sequential in nature, which means taski+1

has to wait for taski to finish. The workflow completes only when task1..n have finished.

3. Networked or Interdependent: In the pipeline of compiling source code, the linker cannot be invoked until all the object files are ready and the object files aren’t ready until the pre-processor has processed the source files. In this case, the pre-processing task can be done in parallel, de- pending on the number of source files, but the linking step has to wait. Such workflows which can be partially pooled and partially sequential fall under this category.

A much granular and detail oriented collection of workflow patterns has been compiled by Wil van der Aalst et al.2 in their book[13] titled Workflow Patterns: The Definitive Guide to describe control flow dependencies.

Figure 1.1 Types of workflows

1.2 Problem Statement

A machine translation system or any complex NLP system, in most cases, can be represented as a scientific workflow. And it is quite usual to have multiple researchers working on the system at the same time, but belonging to different domains and having various levels of technical expertise. So, in general, there is a divide between folks who devise the core logic (linguists), who implement it in the form of code (computational linguists), who integrate the various modules into a control-flow and publish it as a workflow (computer ) and those who are the end users, who do not necessarily need to know about the internals of the system, but only the input/output format. Such users could be anyone ranging

2http://www.headlessbrick.org/mediawiki2/images/4/48/WorkflowPatterns-van_der-Aalst-2003.pdf

3 from a non-computer geek to researchers in the field of Artificial Intelligence or Machine Learning, or Data Scientists.

To elaborate a certain subset of the problem statement, let’s assume that CL1 and CL2 are two computational linguists who work closely with two linguists L1 and L2 and E1 is one person working on the logic to glue all the components. For the sake of this example, let’s say that CL1 and CL2 are working on modules M1 and M2, respectively. In the final pipeline, P1, M2 needs input from M1.

Assuming that the first version of P1 has been deployed, following scenarios may arise:

1. L1 wants to test output that M1 might generate with a certain test case (could be an edge case).

At this point in time, L1 shouldn’t have to learn the to write a program to query the system.

Instead, there should be user-friendly web-view which would let L1 run the input and see the output for each intermediate step without having to write a single line of code or get assistance

from CL1.

2. L2 would like to pause the execution of tasks in P1 at Mi, manually edit the output, and resume execution.

3. CL2 wants to test the implementation of M2 for certain inputs (outputs of M1). CL2 should be

directly able to query M2 without having M1 generate those outputs again and should be able to

do so without assistance from E1.

4. CL2 found a bug in M2 and wants to test a newer version of M2 without having to change P1 or

wait on E1 to create P2 to be able to do so. There should be a well documented API which lets

CL2 query the outputs till M1, send them to revised M2 and then send the output of revised M2

to the rest of the pipeline, without assistance from E1.

5. CL3 comes to the arena and wants to reuse parts of P1 for Px which might potentially become P3

at some point in the future. CL3 should be able to do it without needing assistance from E1.

6. M2 requires certain pre-processing to be done before it can process input from M1. CL2 and E1

should make sure that M2 doesn’t do so for every input.

7. CL4 comes up with an idea to modify the order of tasks in P1 and should be able to do so without

requiring assistance from E1.

8. If CLi is no longer available and the software stack on which Mi depends has become old/depre-

cated, Mi should still be deployable by E1.

It is often the case that Pi is a collaboration between a consortium of Institutions I. This also makes it essential to have a clear demarcation between each module Mi, such that the ownership is clearly x ∈ I defined and if CLi decides to move on to something else, CLj,j≠ i, x can easily take up ownership of Mi.

4 The goal of this thesis is to provide an architecture to cater to the requirements that are demanded by the scenarios described above by the four different categories of people. We wish to make the de- velopment on NLP pipelines easier. The idea is to give more power and freedom to the end user. We wish to make the underlying system easy to debug, allow dynamic sequencing of components in the workflow, provide strict isolation between modules, define clear ownership, allow parts of the worflow to be re-usable, make the system flexible enough to allow processing of multiple inputs concurrently and provide a well defined interface to the end users. The architecture also needs to be Cloud Ready and allow reproducibility of the modules and make them easily deployable.

1.3 Related Work

In this thesis, some ideas for the SOA architecture are built on top of AnnoMarket [16], which adopts the Software-as-a-Service (SaaS) model to reduce the complexity of deployment, maintenance, cus- tomization, and sharing of text processing resources and services. They also enable end users to deploy their custom components/applications and receive revenue via the AnnoMarket marketplace. LetsMT! [18] is a Statistical Machine Translation (SMT) platform which provides a user interface and integrates with already existing services like Moses3 to save end users from the trouble of setting things up them- selves. However, they provide a stateful service, whereas our contributions focus on defining a stateless service. NLPCurator [21] is an NLP management system with a user interface. It allows end users to configure pipelines over components by specifying input dependencies for components in a configura- tion file. It also allows multiple NLP components to be distributed across multiple machines. Similar to AnnoMarket, NLPCurator also relies on Amazon AWS Services. Each service provided by them sets up a “cluster” on the cloud which is composed of a fixed number of components. NLP@Desktop[14] employs a service oriented architecture and propose a framework for creating desktop based NLP ap- plications, which integrate well with email clients, word-processors, browsers, etc and makes use of the DBus channel for seamless integration. P. Kumar, et al.[8] focus on automating and decreasing the time taken for deployment of an NLP MT System on the Cloud. They also discuss means to facilitate auto scaling of computational resources for varying load conditions. P. Gupta, et al.[6] discuss scaling of Sampark MT System by making efficient use of memcached for caching input texts, without modifying the underlying components. In contrast to this, we discuss enhancing the performance of the curated workflows by making improvements from a systems engineering point of view to the components, sub-components and integration. Paolo, et al.[4] conducted surveys and case studies about architectural transitions from Monolithic to a Microservices based architecture. The challenges incurred in the process are explained and they show how the latter saves the day by handling complexity and size. TectoMT is yet another modular NLP framework designed and developed by Martin, et al.[12]. It enables fast and efficient development of NLP applications by exploiting a wide range of software mod-

3http://www.statmt.org/moses/

5 ules already integrated in TectoMT, such as tools for sentence segmentation, tokenization, morphological analysis, POS tagging, etc. Each NLP task corresponds to a single processing unit, ‘block’, which has to be written as a class in the programming language, . All workflows are file based and are always represented as strictly linear scenarios. This system follows a similar approach towards NLP tasks with the legacy system we perform a case study on, in Section 2 of this thesis. The main focus of TectoMT is on building new applications using existing blocks, where as one of the themes touched upon in this thesis is to be able to test and work on existing modules or blocks as well. We also discuss non-linear workflows in Section 3.2.4. Erhard, et al.[7] created a web based service oriented environment Weblitch for the integration and use of language resources and tools. Their system makes use of an XML based data exchange format using which they create RESTFul web services which leverage the use of wrappers written in Perl. In their tutorial4, they also discuss the use of daemons, which we make extensive use of, as discussed later in Section 2.4.3. Our approach to services is quite similar, but differs in the aspect that the techniques used are not tied to a specific format.

1.4 Summary of Contributions

This thesis proposes taking advantage of a novel service oriented architecture for the problem state- ment. In the process, we make the following contributions:

• We analyze a legacy Machine Translation system and port the internals for four language pairs to the proposed architecture. This involves identifying bottlenecks and making changes to the internals to conform to the new model and add optimizations to the underlying components.

• We show the different ways in which the transformed system can be queried and how it can be of use to various researchers. A Disconnected, Directed Acyclic Graph based query system is discussed.

• A web-based client, Anuvaad Pranaali is developed, which acts as the proof of concept for pro- posed backend. This is followed by ResumeMT, a mechanism which lets end users to modify workflow outputs in between and the resume from there, without having to write a single line of code. We also show how this backend can be used by systems like Kathaa.

• We share two deployment strategies, both being service oriented, but one being monolithic in nature, and the other based on microservices. A concise comparison is performed. We also show how these make the workflow easily deployable and scalable in the cloud. We also discuss two mechanisms of making the modules easily re-producible.

4https://weblicht.sfs.uni-tuebingen.de/TCF-Webservices-in-Perl.pdf

6 1.5 Organization of the Thesis

The structure of the thesis is as follows:

• Chapter 2 explores the Sampark MT System[2], we walk through the various areas in which the system can be improved. Further, we identify certain bottlenecks and discuss various techniques that can be used to resolve them and show performance metrics for certain transformations.

• Chapter 3 introduces the service oriented architecture that is used to transform the system dis- cussed in Chapter 2. It presents the different ways in which the system can be queried by the end users.

• Chapter 4 presents two deployment strategies, contains an experimental setup and compares the two. It also discusses packaging and distribution mechanisms.

• Chapter 5 marks the end of this thesis with the conclusions and scope for future work.

7 Chapter 2

The Sampark MT System: A Case Study

The Sampark1 system [8] is a multi-part machine translation system developed with the combined ef- forts under the umbrella of consortium project “Indian Language to India Language Machine translation” (ILMT) funded by TDIL program of Dept of IT, Govt. of India. It uses Computational Paninian Gram- mar (CPG) approach for analyzing language. It uses both traditional rules-based and dictionary-based algorithms with statistical machine learning. It is based on analyze-transfer-generate paradigm. First, analysis of the source language is done, then a transfer of vocabulary and structure to target language is carried out and finally the target language is generated. The major modules together form a hybrid system that combines rule-based approaches with statistical methods in which the software uses rules generated through ‘training’ on text tagged by human language experts. All of the modules operate on a common data representation called Shakti Standard Format (SSF)2. The textual SSF output of a module is not only for human consumption but it is also used by the subsequent module in the data stream as its input.

2.1 Brief Introduction to ILMT Modules

At the time of writing this thesis, the ILMT system has 9 Indian language pairs, out of which, 5 have been made available on the Sampark MT Dashboard. An end user can input a set of sentences and get the translated output. There is also an option to see the intermediary steps involved in the process, which can be accessed post-process. For any given language pair, the system is primarily composed of several modules, certain common utilities (such as SSF tree parsing APIs, format converters, etc), a specification file, and resources such as dictionaries and trained models. The modules are always broken down into three categories: Source Language Analyzer Modules (SL), Source to Target Language Transfer Modules (SL-TL) and Target Generation Modules (TL). The control flow of the pipeline is determined by the specification file which declares a directed acyclic graph, where each node in the graph is a wrapper script for a particular combination of module, module-version. This spec file also defines the interdependencies between

1http://www.cfilt.iitb.ac.in/~ilmt/download/overall_flow_ilmt_system_0.3.pdf 2http://sampark.org.in/sampark/web/ssf-guide-4oct07.pdf

8 nodes. For any given input, a utility called dashboard acts as the orchestrator. It takes in the input, parses the specification file and invokes the modules in a serial fashion.

2.2 Functioning & Performance Analysis of Modules

In this section, we provide insights into the linguistic work performed by the common modules, and consequently, we also perform an analysis on different runs of a particular pipeline on a given input, in an attempt to identify the bottlenecks.

Figure 2.1 Sample Telugu input text

Figure 2.2 Time taken by modules in the TEL-HIN pipeline

9 %time seconds µsecs/call calls syscall 65.15 6.467371 14533 445 wait4 34.27 3.401699 4138 822 futex 0.52 0.051258 17 3022 write 0.03 0.003327 0 93807 read 0.01 0.001049 2 509 munmap 0.01 0.000912 0 5601 open 0.01 0.000511 0 16324 stat 0.00 0.000121 0 5099 close 0.00 0.000120 0 6052 mmap 0.00 0.000119 0 8241 lseek 0.00 0.000111 3 32 unlinkat

Table 2.1 Time spent in syscalls, sorted by time in the TEL-HIN pipeline

For the input shown in Figure 2.1, we translated the sentence using Sampark 100 times to try and understand the time taken by each module. As we can see in the Figure 2.2, the whole translation took an average of 4 seconds, with a standard deviation of 0.03 seconds. The most time consuming module was TransferGrammar-v2.4.7 followed by LexicalTransfer-v3.2.3 followed by WordGen. All of which took more than 0.5 seconds to process their inputs. We tried running the same test again, but this time, we actively logged all the system calls being made by all the modules, as can be seen in Figure 2.3

Figure 2.3 Tracing system calls made by modules in the TEL-HIN pipeline

By using the strace utility, we traced all the system calls made in the execution of TEL-HIN pipeline and it has been summarized in Table 2.1. System calls have an inherent overhead. System Call is a means

10 of tapping into the kernel, a controlled gateway towards obtaining some service or resource. To put it in an over-simplified manner, this involves invoking an interrupt, wherein the processor has to switch from user space to kernel space and then back, which also involves saving registers onto the kernel stack, validating the arguments, performing the actual task and then restoring the registers. We noticed that about 34% of the whole time was being consumed in dealing with futex syscalls. That essentially means, about 34% of the time is spent trying to get a lock on a file, or waiting for the lock to be released. The number of calls to stat is also considerably huge. This gave us a good hint that the system was spending a lot of time in dealing with a lot of files. Another noticeable point was that for each translation, around 260 processes were being spawned, which is surprisingly huge and could be a major deterrent when it comes to scalability of the whole system. We managed to reconstruct the whole process tree3 along with the time consumed for each process using valgrind, which gives a more granular view over time spent inside modules. We also generated a call graph4 to get more insight into which library calls were widely used. On running the same analysis on the rest of the Indian Language pairs, we saw similar trends. In the following subsections, we will examine a subset of these modules.

2.2.1 Indic Converter

Figure 2.4 Example for UTF-8 to WX Conversion

The converter is used for transforming input text from UTF to WX notation and vice versa. Different programs accept input in different formats and these programs are part of the modules (individually or collectively). Some accept UTF-8 bytes, while others work only on ASCII characters (bytes). However, it’s not a one-to-one mapping of characters from UTF-8 to WX and to overcome that, an intermediate representation, especially made for Indian languages is used, also known as ISCII. The code points are mapped to corresponding decimal code points in ISCII and from there, they are first nor- malized and then converted (transliterated) to WX notation. Figure 2.4 shows an example of such a transformation. Because of the flexibility to accept input in different formats, converter-indic can be used more than once in the translation pipeline for a given language pair. These mappings are represented using hashMaps data structures and normalization rules are based on replacement rules and regular expression engines. Computationally, there is a cost to create the hashMaps in memory (for the first time), if the program is started from scratch every time it is called. Similarly, before using a regular expression, the

3http://ltrc.github.io/ILMT-API/sampark_profiling.html 4https://raw.githubusercontent.com/ltrc/ILMT-API/gh-pages/sampark_callgraph.png

11 engine has to be computed first and then used. If, the system is modified such that these mappings are pre-computed and the regex engines are pre-compiled, then the look-ups can become significantly faster.

2.2.2 Morph Analyzer

The primary focus of this module is to perform morphological analysis for a given input, which it does by compiling and analyzing words belonging to a natural language. For a given input, it identifies linguistic entities like gender, number, person, case, suffixes, prefixes, causality, tense/aspect/modality and much more. These morphological features are then put together in a feature structure array and after combining them for each word in the input, converted into the SSF format [1]. For this module, the input format strictly follows the format: (address a.k.a position of token in the sentence), (the token itself) and (category) and the output format has these three columns, along with a fourth one called feature structure array, which is basically a ‘|’ sep- arated list of abbreviated features (af ) or simply put, a list of attribute-value pairs. Regardless of the category in which the input token falls, the output is required to have the following mandatory fea- tures: root, lexical category(n/v/adj/adv/pn/nst/avy/psp/num/punc/unkn), gender(m/f/n/mf/fn/mn/any), number(sg/pl/dual/any), person(1/2/3/any), case(direct/oblique), vibhakti(if noun: case marker/if verb: TAM) and suffix. If any linguistic entity can’t be computed at this stage, it is left blank. The module has two phases. In the compilation phase, it reads paradigm input and lexicon input data files and stores the data in processed form, in the form of gdbm files. These data files are compiled using GNU dbm5 for storing key/data pairs in data files. In the second phase, also known as analysis mode, the morph analyzer makes use of these pre-compiled databases for each input to process the tokens and generate their morphology. For a given word, first the suffix or prefix is identified, and it’s validity is checked. If the suffix is valid, the stem is checked for validity by converting it to a root word (addition and deletion operations are performed). If the root word is also valid, then the final feature structure array is generated. The SSF API for this module is compiled along with the generated ELF binary, thereby reducing the need for API wrappers. The file backed databases generated in the first phase are used in the second phase by memory- mapping them by calling the mmap()ed system call, which is used by the GDBM library internally, by default. This is hugely beneficial for three reasons, (a) when one wants to have multiple processes accessing the same file in read-only fashion, because then, they share the same physical memory-pages, saving a lot of memory; (b) when random access is performed on huge files and (c) portions of the file can be accessed via memory operations (memcpy, pointer arithmetic). In each case one needn’t worry about file buffering. Regardless of this design, in the Sampark system, the analyzer was being spawn on each input, wouldn’t have taken full advantage of the benefits. Also, it could process only a single input at a given time.

5https://www.gnu.org.ua/software/gdbm/

12 Input Output

ADDR TKN CAT fs

1 bahuwa | | | |

2 acCI

3 aMgrejI | | |

4 jAnawe

5 We

Table 2.2 Sample Output by the Morph Analyzer module

2.2.3 POS Tagger

Part of speech (POS) tagging is the process of assigning a part of speech to each word in the sentence. Identification of the parts of speech such as nouns, verbs, adjectives, adverbs for each word of the sen- tence helps in analyzing the role of each constituent in a sentence. There are a number of approaches, such as rule-based, statistics based, transformation-based etc., which are used for POS tagging. In most of the language pairs in the Sampark System, statistical approach has been adopted. This module uses the CRF++ Tagger6 to create a model by training on the golden data (100-400k annotated input points) created by human language experts. CRF++ uses a combination of forward Viterbi and backward A* search. In most of the language pairs in ILMT, a unigram based model is used and when the source language is Hindi, for each word, the feature functions are created by looking at two tokens before, two tokens after, and the 13 entities for a token which determine the length, suffix and prefix tree and the feature structure array obtained from the Morph Analyzer. In certain language pairs, a series of post-processing steps are also performed, which further smoothen the output and modify outputs which are known to be linguistically incorrect and follow a pattern. When this module is run, it invokes the crf_test command line utility to test the model against the input by loading it in main memory, for each input, which can consume up to 10-15% of the execution time, based on the size of the model. Similar to the GNU dbm library, this implementation also relies on mmap() to load the model in main memory. However, there is no guarantee that on each invocation, the pages corresponding to the model file will reside at the same address in the physical memory.

6http://taku910.github.io/crfpp/

13 Sample Rules Change case marker from ’se’ to ’ni’ and root of the verb phrase. R29092015: NP~1((%PRP((%VM)) ⇓

NP~1((%PRP((%VM)) Delete genitive when the following word is binA R100201: NP NEGP((binA%NEG)) ⇓

NP NEGP((binA%NEG))

Table 2.3 Transfer Grammar rules (HIN-TEL) on paper

Another module, chunker, has a similar implementation, wherein it uses a trained model to compute output. Although it’s main focus is on segregating noun groups and other phrasal structures, it plays a crucial role in the shallow parsing phase of the source language.

2.2.4 Transfer Grammar

The Transfer Grammar module is a major component of the MT system. It is based on analyze, transfer and generate approach. It provides structural transference which deals with transferring the sentence structures of one Indian language (IL-1) to another Indian language (IL-2). This method is entirely rule based which implies that the module uses a set of rules which give corresponding structures of IL-1 - IL-2. The transfer grammar module is used after shallow/full parsing of the input sentence has already been done and it involves querying a set of transfer rules. These rules, between a language pair, map the structural differences. Each rule has an LHS and an RHS. Also, these are stated slightly differently in the PSG (Phrase Structure Grammar) and DG (Dependency Grammar) formalisms. The former is a purely syntactic approach which uses a set of phrase structure rules, is constituency based and the order of elements in a sentence is implicit in it whereas in the latter, the semantic relations of the elements of a sentence are captured. The operations performed by these rules can be categorized into insertion, deletion and reordering. Table 2.3 shows two such rules. These rules are represented in the form of in-memory, multi-level, nested dictionaries. For a rule file of size 395921 bytes, the data structure can take up to 9739953 bytes.

14 ID 999999 CAT VERB CONCEPT EXAMPLE SYNSET-LANG W1/LANG1, W2/LANG3, W3/LANG2

Table 2.4 DSFRecord - Modified

For a single execution of this module, using the benchmark module in perl7, we found that the ma- jor time (~40%) was being spent on parsing the rules and formulating in-memory data-structures for representing the rules, and then, matching the rules against a given input (~60%).

2.2.5 Lexical Transfer

The Lexical Transfer module formulates the process of finding the correct word in the target language given a particular word in the source language. It is a two stage process, wherein in the first stage the source word is disambiguated and its sense is identified. In the second stage, an appropriate target word is selected from the sense identified in the first stage. A source word may have multiple possible translations in the target language. If the word is not disambiguated by the WSD Engine, then the most frequent sense of that word is selected. This word would then be the (closest) correct translation of the source word in the target language. For any given input, this module requires the use of several dictionaries. For a given pipeline, as many as five dictionaries were being used, in a serial fashion. A word is first searched in the session dictionary (level-1) , if not found, the next search is performed in the user dictionary (level-2). If the word is still not found, then then Proper Noun(NNP) dictionary (level-3) is looked up. If the word is unknown up to the 3rd level, the session dictionary (level-4) is searched. Finally, if the word is still not found, it is queried for in the bilingual dictionary (level-5). Each of these dictionaries are supposed to have three types of mappings:

.idx: For each POS tag, a mapping of each known word of this part of speech and a list of synset ids to which the word belongs. e.g. हकु म 04176 04182 04185 022358 043835

• data.: For each POS tag, a reverse mapping of synset ids belonging to this POS tag and the corresponding synset members e.g. 10009:~:NOUN:~:लौटने क िकर्या का भाव:~:"�याम अपने गांव लौट गया": :लौट

-crossLinks.idx: For each POS tag, a sorted list of all words of all synsets belonging to this POS tag mapped to a cross-linked word in the target language. e.g. 00004021 TEL- अदरं పల Words in these dictionaries were populated by reading a language pair specific input file consisting of entries conforming to an extended version of Double Stream Flag (DSF) Record format. In Table 2.3,

7https://perldoc.perl.org/Benchmark.html

15 Input Output abbAyi VBN abbAyilwotegAni puswakaM NN puswakamulUnu we VBN weVccinAdu AmeV PRON AmeV . SYM .

Table 2.5 Word Generator - Sample Conversions

W1, W2 and W3 are words in the synset and W1 is cross-linked with the first word of LANG (pivot) of synset 999999. These input files are filled by language experts and a static tool is used to update the dictionary mappings, which is an off-line task. For a given run of this module, (~90%) of the execution time was consumed in dictionary lookups for given inputs, whereas that should have been the WSD engine, since that constitutes the core logic. There was already a small optimization included, wherein for a given lexeme, the dictionary lookup functions inside components {CrossLinks,Synset,WordIndex}RandomAccess were performing a binary search on files on disk to bring down the runtime complexity from O(n) to O(log(n)). Although beneficial for systems with low memory, this was a major reason behind the huge number of lseek() system calls in Table 2.1. For a system to be able to analyze multiple inputs concurrently, this was sub-optimal. This warranted the need for a more efficient representation of data and a better way to process and re-use static data.

2.2.6 Word Generator

The task of this module is to generate an inflected word form, given the root and its feature struc- ture array containing morphological information about the word (as discussed in section 2.2.2). The generator then inflects the root word according to the morphology of the language and outputs the tar- get language word form. The words thus generated are later concatenated to form the complete target language sentence. In general, the word generator for a target language requires resources to find out which paradigm class the given word belongs to, find the inflected word which would be morphologically suitable in the given context and then apply the add/delete rules to the inflected word. Depending upon the complexity of morphology of words in the target language, it may also require a lot of post processing steps based on rules made by language experts (later transformed into regular expressions), which is very common for agglutinative, Dravidian languages (e.g Telugu). This is another resource heavy module. For example, when the target language is Punjabi, for each execution, it expects data in the form of files for nouns, pronouns, adjectives, postpositions, cardi-

16 nals, ordinals, verbs, VAUX, and vocative participles (basically, major parts of speech). Overall, these were 76 thousand lines of input. Apart from this, the module also used a lucene8 based search index, which did use an in-memory heap for querying inflections. Again, this index was being re-loaded in memory for ever input. The initializers for the reader objects for each type of input cumulatively took up more than 85% of the execution time. Similarly, for the case when the target language is Telugu, the module requires more data sets for complex verb forms, suffix mappings and adjectives, which were stored in the form of key, value pairs (hash tables), but were being pre-filled for each input, instead of just once.

2.3 Common Traits & Peculiarities

So far, we have seen some prominent modules and the way they are used and how they consume inputs. Here, we’ll summarize the common traits seen across all modules in most of the language pairs that are part of the system. This will also consider the integration steps used for gluing all the modules.

• Focus on keeping memory footprint low: After going through the source code of the modules, it is quite evident that the authors wished to keep the runtime memory footprint of the program as low as possible. Given the fact that some of the modules were developed over a span of 10 years, this is not surprising, however, the cost of memory now is 6-7 times less, if compared.

• Focus on batch processing: The system was designed in such a fashion, that it seemed to prefer batch processing. So, any type of pre-processing, if present, would be of benefit, if multiple sentences were clubbed together and the module was invoked only once, for that batch input.

• Choice of programming language: Since these modules were developed by different groups scat- tered all across the country, this seems obvious, reasons for which were also discussed in Chapter 1. To enumerate, the set of programming languages consisted of C, , Python and Perl.

• Use of subcomponents: Some modules didn’t just perform the core functionality, but also did ad- ditional tasks. For instance, in the URD-HIN pipeline, the LexicalTransfer ‘module’ also had a word identifier and a word substitution subcomponent. Similarly, in the HIN-PAN pipeline, again in the lexicaltransfer ‘module’, there were two other subcomponents: TAMUtility and Infinitive- Generator.

• Use of common/assisting utilities: In all modules, the one thing that ties them together, is the input/output format. This is done by using the SSF API. For the modules written in Perl, the component which provided the API was re-used everywhere. There was use of intermediate scripts like ssf2connl.pl or bio2ssf.pl or tnt2ssf.pl which were required, because not all components can directly understand the SSF API; the existence of a transformation mechanism was prudent. (e.g. CRF based POS Tagger only understands the CoNNL format)

8https://lucene.apache.org/core/

17 • Everything used file as input/output: For each module, the input mechanism was a file and so was the output. This was also true for the subcomponents and helper utilities.

• Variant revisions: Even though modules like morph analyzer, lexical transfer and word generator are supposed to be language agnostic, different versions/forks of the same module were being used in different language pairs. This usually happens when changes required for a particular language

pair need to be merged back, but CLi is not available to do it, which results in generation of forks.

Instead of a single source of truth being owned by CLi, these ended up being maintained by Ei.

• Not independently ‘query-able’: The Sampark MT system did not provide any interface to query the individual modules. An end user had to either input raw sentences and gather the complete output, or gain access to the module, set it up themselves and then run their inputs.

• Tight coupling with system software: The ILMT modules are known to work well with Perl v5.18.2 but break when Perl is upgraded to v5.26.2 . A major reason behind this could be the use of language features which were deprecated in the future. A classic example of this is the use of relative imports. For instance, if x is relative in require x, it now has to be explicitly marked as relative require ./x. There is a hard requirement on libgdbm as well, although it has bumped up the soname9 in more modern operating systems, it mostly works okay, because of very few symbol removals10. Another classic example of this is the new optimizations11 in newer versions of gcc which can break old code relying on undefined behaviour. This begs the need for a packaging and distribution mechanism which allows one to freeze software along with it’s dependencies.

2.4 Revamping Modules: Moving Towards Services

The grammar and syntax of a language, including it’s sentences and their structure, are the rules which define that language. The grammar of a language is how one can check or parse a sentence to see if it is valid in that language or not. Parsing a sentence involves the same mental steps as diagramming a sentence or building a parse tree. One of the reasons why sentence parsing is a complex task is because there are various steps involved in the pipeline and there are interdependencies between them. If these steps were to be exposed as services, it would be easier to modify outputs of interim modules and allow those changes to percolate down the pipeline efficiently. Latching happens from that step, user need not run translation from the beginning. This also makes the overall system scalable when processing large texts. Earlier, one would need to modify interim modules and re-run the whole pipeline from start. This section walks through some of the limitations that the Sampark system faced and discusses ideas for improvements (which were later applied). It also forms a basis for the need of services.

9Indicator for Binary API compatibility 10https://abi-laboratory.pro/?view=timeline&l=gdbm 11https://stackoverflow.com/q/36893251/1005215

18 2.4.1 Reducing File I/O

In one of the sections above, it was shown that all of the input/output was in the form of regular files. It wasn’t really needed to be that way. Let’s take the converter-indic module as an example. It provides the functionality to convert text from utf to wX and vice-versa. Let’s denote the module as

Mi which requires four inputs, Ij, 0 ≤ j < 4. I0 denotes the path to input file, I1 denotes the path to output file, I2 ∈ {utf, wx} and I3 ∈ {hin, tel, tam, ben, pan, mar}. This can easily be encapsulated in a function, f(ifh, ofh, args, ), where inh denotes the input file handle, ofh denotes the output file handle, args.fmt is the input format and args.lng is the input language. In this context, a ‘file handle’ is typically a kernel handle for files or file descriptor in the Unix land. This can be useful as long as this function can be imported as in the form for a library in some other code. But that holds true only if the other code also runs in the same language or has bindings for it. However, everything in Linux is a file, even network connections (tcp sockets, UNIX domain sockets, etc). So, one can even expose this function over a network connection. In fact, this was generalized further so that it could be adopted in other modules too in C, Java and Perl. This revises the small module to behave like a multi-threaded daemon, which can serve requests concurrently. Also, by doing so, the pre-requisite knowledge required to query this system decreases, as one just needs to know how to create a tcp connection and pass a stream of bytes over it, which is (a) language agnostic and (b) operating system agnostic. We ended up following a similar approach for the morph analyzer, wherein the code was tweaked to create a multi-client TCP server using fork()12. All the child processes of the fork()-ed process share the same set of pages and each one gets its own private copy when it wants to modify a page. Since the GNU dbm file handles are shared and the pages pointed to by them are supposed to be read-only, we also save memory. With a runtime configurable upper limit on the max number of clients, the morph analyzer module attains the power to serve multiple requests simultaneously, auto-scaling on demand. A lot of sub-components in the entire execution of a single pipeline, required invoking the Perl and Python language interpreters numerous times. These two languages are not strictly interpreted lan- guages; rather, they compile the source code into an intermediate bytecode which is then later inter- preted. This initial compilation, along with loading the Perl/Python interpreter into memory, takes a moment or two. However, because the interpreters were being instantiated for every sub-component, it came at a cost of wasting CPU cycles, doing the same thing over and over again. This begets the need for an architecture where the language interpreters are invoked only once and all the relevant procedures are loaded into memory, so that when they are actually needed, the overhead is reduced down to just a simple jmp instruction to a certain address in the main memory. Another noticeable fact in the modules was that logging was enabled by default. That created a major overhead. When it was turned off, about 35% of the time consumed by the Lexical Transfer module in the TEL-HIN pipeline was reduced, saving us a lot of calls to the write() syscall.

12https://linux.die.net/man/2/fork

19 Listing 2.1 Sample Multi-Threaded TCP Server in Python import S t r i n g I O import t h r e a d i n g

def processInput(ifp , ofp, args , coreLogic): f o r l i n e in i f p : line = coreLogic(line , args.X, args.Y, ...) ofp.write(line)

c l a s s ClientThread(threading .Thread):

def __init__(self , port , clientsocket , args): threading.Thread.__init__(self) self.con = con self.port = port self.args = args self.csocket = clientsocket

def run ( s e l f ) : d a t a = s e l f . c s o c k e t . r e c v ( _MAX_BUFFER_SIZE_ ) fakeInputFile = StringIO.StringIO(data) fakeOutputFile = StringIO.StringIO(””) processInput(fakeInputFile , fakeOutputFile , self.args) fakeInputFile.close() self .csocket.send(fakeOutputFile.getvalue ()) fakeOutputFile.close() self.csocket.close()

i f __name__ == ’__main__’: # process arguments # perform pre−f e t c h i n g # perform pre−p r o c e s s i n g coreLogic = f(...) i f args .isDaemon: host = ”0.0.0.0” # Listen on all interfaces port = args.daemonPort # Port number tcpsock = socket . socket(socket .AF_INET, socket .SOCK_STREAM) t c p s o c k . s e t s o c k o p t ( s o c k e t . SOL_SOCKET, s o c k e t . SO_REUSEADDR, 1) tcpsock.bind((host , port)) while True : tcpsock. listen(4) (clientsock , (ip, port)) = tcpsock.accept() newthread = ClientThread(ip, port , clientsock , args , coreLogic) newthread. start () e l s e : processInput(args.infile , args.outfile , args , coreLogic)

2.4.2 (Re)using memory, Efficiently

Majority of the modules which required inputs in the form of dictionaries, synsets, or trained models can be categorized into two groups: (a) ones which loaded them entirely in memory (b) ones which lseek()13ed through them based on offsets. For both of these categories, we could just load the resources only once in a pre-fetch step, and then pre-process them. After that, one could easily expose the core functionality of the program over a configurable network port on which clients could easily connect and get the corresponding output. This was done for the Word Generator, TAMUtility and Lexical Transfer modules, which covers most of category a. For category b, there was a trade-off. The bilingual dictionaries and synset databases were consid- erably big, but not that big as per today’s standards. The trade-off here is that more memory would be

13https://linux.die.net/man/2/lseek

20 required to keep the entire resource in RAM (which we did, using HashMaps14). However, on experi- menting this with the Transfer Grammar module of the HIN-PAN pipeline, it was found that with the old setup, the time taken to process one input was 1 second, but with the transformed, completely in-mem version, it could easily process upto more than 300 inputs in one second, concurrently. The trade-off was worth it. For remaining modules in category a (e.g POSTagger) for which the source is either not available, or available but needs to be modified and maintained, one can try relying on the operating system, hoping that the pre-trained models are always available in the page cache, or move the model to tmpfs15. To cut down on compute time, another caching technique (on a function level) is to use memoization, combined with a carefully chosen cache eviction policy (LRU, TLRU, FIFO, etc) to keep a check on the memory usage. This requires in-depth analysis of time taking functions and proper benchmarking. In the legacy system, certain set of common utilities like SSFAPI or helper utilities like add_sentence_tag, remove_sentence_tag, root_to_infinity, etc were being used repeatedly. How- ever, each module loaded it’s own copy of these on their invocation. So for instance, if N people queried the system at the same time, and assuming that these requests are handled in parallel, there is a chance that there would be N invocations of the same module at the same time, with each having the common utility loaded into it’s memory. At this point, the file system cache could save us from disk I/O, or we use a system where these common utilities are loaded once, and are shared among all modules.

2.4.3 Asynchronous I/O & Daemonization

For all the language pairs examined in the Sampark System, although it allowed a user to submit multiple sentences as one input, all of the modules were performing their logic on a sentence level. There was no contextual information being analyzed in between sentences. So, in theory, one could instead submit multiple inputs, each having a single sentence, so that they could be processed concurrently. Unfortunately, that would lead to the entire pipeline being spawned from start to finish, with each module being invoked from start. This would put a lot of pressure on the server hosting the system and could lead to performance degradation. This implores the need for having unbiased processes, continuously running in the background, which have already gone through the phase of pre-fetching and pre-processing resources and are just ready for inputs, in technical terms, daemons. In multi-tasking computer operating systems, daemonization is a very common technique for running processes perpetually. For example, syslogd is the daemon which implements system logging facility, sshd is the daemon which keeps listening for incoming SSH connections and crond is yet another daemon which is used to schedule tasks at pre-defined times. An alternate and simplified term for daemon is service.

14https://docs.oracle.com/javase/8/docs/api/java/util/HashMap.html 15man7.org/linux/man-pages/man5/tmpfs.5.html

21 However, just having such a setup is not enough. It is a known fact that the inputs can be of varying sizes (in length). It can happen that a smaller input (Is) reaches module Mi just after a larger input (Il)

and while Mi is processing Il, it should be capable of processing Is concurrently, and not block it and if it finishes processing Is before Il, it should let it move forward. This model is achievable with a system design which allows asynchronous input/output processing. Since it is desirable to move away from the file based I/O mechanism in between modules, a requirement for an orchestrator can be seen, which is also capable of coordinating the control flow for a given input asynchronously.

2.5 Results of Transformation

On addressing most of the performance concerns highlighted in the previous sections, we were able to achieve significant amount of reduction in time taken to process a single sentence across multiple language pairs. Results for sequential execution of four different translation pipelines have been highlighted in Fig- ure 2.5, Figure 2.6, Figure 2.7 and Figure 2.8. Results were computed on a 4 Core, Intel Xeon (Sandy Bridge) machine with 8GB of main memory and a magnetic hard disk, running Ubuntu 14.04. 100 sen- tences of different word length were used for each language pair and the figures display the amortized results for 100 runs of each. The lexicaltransfer module saved an ample amount of time because of reduction in File I/O. Pre- fetching resources and representing them in space optimized, in-memory data structures impacted the performance for the wordgenerator and transliteration and transfergrammar modules. Components in- volved in shallow parsing saw a huge win due to reduction in interpretor invocation costs. The total number of processes spawned reached a total of 75 for all four pipelines combined, which is a ~90% re- duction (per pipeline) as compared to the previous state (~260). This helps in reducing context switches, thereby using the CPU in a more efficient manner.

22 HIN-PAN Translation Pipeline

wordgen Sampark Anuvaad Pranaali defaultfeature agrdistribution intrachunk interchunk vibhsplitter agrfeature transliteration lexicaltransfer wx2utf transfergrammar root2inf simpleparser vibhcompute headcompute pickonemorph guessmorph prune chunker total time (legacy): 3.73053 s total time (new): 0.348433 s postagger gain per sentence: ~90.65% morph utf2wx

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

Time(s) Taken by each module

Figure 2.5 HIN-PAN: Old v/s Transformed Runtime Comparison

23 HIN-URD Translation Pipeline

wordgen

defaultfeat Sampark Anuvaad Pranaali intrachunk interchunk agrfeature utf2wx_u

transliteration lexicaltransfer wx2utf headcompute merger

ner mwe

chunker pickonemorph pruning total time (legacy): 2.98647 s postagger total time (new): 0.333518 s gain per sentence: ~88.83% morph

utf2wx

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Time(s) Taken by each module

Figure 2.6 HIN-URD: Old v/s Transformed Runtime Comparison

24 PAN-HIN Translation Pipeline

wordgen

defaultfeature Sampark agrdistribution Anuvaad Pranaali intrachunk interchunk vibhspliter agrfeature utf2wx transliteration lexicaltransfer transfergrammar simpleparser vibhcompute headcompute pickonemorph guessmorph

pruning total time (legacy): 4.05499 s chunker total time (new): 0.413419 s gain per sentence: ~89.80% postagger morph

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

Time(s) Taken by each module

Figure 2.7 PAN-HIN: Old v/s Transformed Runtime Comparison

25 URD-HIN Translation Pipeline

wordgen

defaultfeatures Sampark Anuvaad Pranaali intrachunk

interchunk

agrfeature

transliteration

lexicaltransfer

wx2utf_u2 headcomputation

pickonemorph

pruning

chunker

utf2wx_u2

postagger total time (legacy): 5.60513 s wx2utf_u1 total time (new): 0.653292 s gain per sentence: ~88.34% morph

utf2wx_u1

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Time(s) Taken by each module

Figure 2.8 URD-HIN: Old v/s Transformed Runtime Comparison

26 Chapter 3

Service Oriented Architecture

In this Chapter, we define the term SOA and show it’s relation to the workflows in ILMT systems. Then we describe the process of identifying services. Later, the design of a simple, yet powerful API is discussed. This is followed by the introduction to our web-based client where we discuss it’s features. Finally, we throw light on some use cases which make use of the system created by us.

3.1 Services: A Short Introduction

A well written software is governed by a set of guiding principles with respect to design and im- plementation. There are various styles for designing software. One such widely used style is SOA (Service-oriented Architecture). Simply put, SOA is essentially a collection of services and these services communicate with each other through a communication protocol. The information exchange (data) between these services could take place over a network, a pipe, a Unix socket, shared memory, message queues, or even semaphores. In order for a SOA to be effective, the services that it is composed of, have to be well-defined (follow a strict contract), self-contained (bundle all of it’s dependencies), self-sustained (independent of the context or state of other services), opaque (be a black-box for it’s consumers) and maybe composed of other underlying services a.k.a sub-components that are not neces- sarily part of it’s public interface. From a linguistic point of view, a service is a single unit of programmable software. Properties of a service:

1. Services exchange data The only way for one service to communicate with other is via data. Each service can receive data and send data. In SOA, a service which coordinates exchange of data across different services and optionally interact with the client directly or indirectly, is called an orchestrator. In essence, each service has input ports and output ports.

2. Services have a behavior The behavior defines the core logic or task which is meant to be exe- cuted, when some data is received. The core logic can be a combination of several small pieces of business logic which collectively define the behavior. Each of these pieces can be made of

27 one or more operations. When triggered, an operation has to reach an end state, which means an operation must always express a finite computation in time.

3. Services declare interfaces Interfaces are used as a contract for interacting with a service. This defines the number of inputs and outputs, format for each input and output and the corresponding task that the service promises to execute on your behalf.

4. Services execute sessions A session is a running instance of a service. These sessions are in- dependently executed and their data is independently scoped, however, they may use the same resources. Data in one session should not affect the data in other session. Multiple sessions may run in parallel.

3.2 Adopting SOA to ILMT

In order to adopt an SOA for the ILMT Systems, a series of pre-requisite steps were required, which can be extended to any NLP system, and generalized to any workflow based system. Identifying easily distinguishable service boundaries [11]: Each pipeline in the ILMT System is composed of various components. As shown in profiling data1, a single execution of the pipeline in turn churns through a plethora of processes. The primary goal of this step involves associating these processes to coherent groups, which when combined together, can logically express a single unit of functionality. For instance, for the task of POS Tagging, the sub-tasks performed are: (i) reducing SSF to a simpler form, (ii) converting this form in (i) to CoNNL format, (iii) actually invoking the POS Tagger on the trained model, (iv) convert CoNNL back to SSF and (v) perform repair operations, if any. These are united to form a single module, which is later exposed as a service. A different example in the same context is that of Word Generator. In most of the language pairs, this was a single unit in itself, except in the case of HIN-URD and URD-HIN pipelines. There, an extra step is involved where in output has to be converted back from wX to UTF. Since the responsibility of converting textual formats lies with the converter-indic unit, the Word Generator module was restricted to the core logic and did not subsume the other module within itself. Use of a well-defined contract: Services in an SOA need to function independently of development technologies and platforms. To achieve this, services inter-operate based on a formal definition, or a contract. This enables the end users to query the system in a language agnostic manner. The contract should be formulated using a proper standard, which is popular among the community and is widely supported. It should also clearly define what input is expect from the user, and what output should the user be expecting. For our use case, we implemented this contract in the form of a Web-based Application Programming Interface (API). Web services which implement an SOA, tend to use one of RPCs (Remote Procedural Calls), Simple Object Access Protocol (SOAP), and Representational State Transfer (REST).

1http://ltrc.github.io/ILMT-API/sampark_profiling.html

28 SOAP RPC REST Requires a SOAP toolkit at Language requires support for No library support needed, typ- client’s end, not strongly sup- performing RPCs ically used over HTTP ported by all languages Exposes operations/method Requires users to know the pro- Returns data without exposing calls cedure names methods

Packages of data are large, re- Can return back any data-type, Supports any content-type quire XML but tightly coupled to the RPC (XML and JSON are used type (JSON-RPC, XML-RPC) primarily)

All calls are sent through POST Typically utilizes just GET/- Typically uses explicit HTTP POST action verbs (CRUD)2

Can be stateless or stateful Stateless Stateless

Standard governed by WSDL Standard governed by IDL (In- No standard in terms of inte- (Web service description lan- terface description language) gration and implementation guage)

Table 3.1 SOAP, RPC and REST

Mechanism to communicate with inner daemons: Towards the end of previous chapter, we showed how modules could be converted into continuous processes running in background. However, a daemon could be a constituent of a larger service, expecting input in a specific format depending upon the core functionality exposed by that daemon. However, services should only expose the bare minimum inter- face required for them to be queried, and hence services need to encapsulate the internal details. This is done by adding a generic piece of code in between, which acts as the middleware. Behind the scenes, a service could also be using multiple daemons. The middleware knows the sequence in which these daemons can be queried and the format of their input/output. This middleware can also be enhanced to to perform additional functionality, for instance subscribe to an event or reply to it, store intermediary results in a transient database, etc.

3.2.1 The RESTful API

In his book, ‘Undisturbed REST: A Guide to Designing the Perfect API’[15], M. Stowe states, “... building an API is the easy part. What is far more challenging is to put together a design that will stand the test of time, while also meeting your developers’ needs.” This subsection discusses how we try to design the API, keeping future use cases in mind. REpresentational State Transfer (REST) is an architectural style inspired by the web. This architec- ture provides many implementation options [19] including HTTP, which uses verbs to easily state and formulate services as resources. We expose a simple, yet powerful API to end users where, whatever

29 the translation task, queries are represented as HTTP requests. Below, we’ll describe the various types of queries that can be made to the system.

1. Query to fetch supported language pairs by the new system is a simple HTTP GET request to:

http://${endpoint}/langpairs

2. Query to fetch the number of modules in the pipeline for a given language pair is an HTTP GET request to:

http://${endpoint}/${src_lang}/${tgt_lang} $src_lang = ISO-639 code of the source language $tgt_lang = ISO-639 code of the target language

3. Query to fetch the list of modules in the pipeline for a given language pair, in the order of execution, is an HTTP GET request to:

http://${endpoint}/${src_lang}/${tgt_lang}/modules

4. Query to run a specific part of the pipeline for a given language pair, is an HTTP POST request to:

http://${endpoint}/${src_lang}/${tgt_lang}/${start}/${end} $start = module number where execution has to start at $end = module number where the execution has to end

For example, to get the output up to running the Shallow Parser in our Hindu-Urdu pipeline, the POST request is structured as http://$endpoint/hin/urd/1/10. The actual input, is passed as the data parameter which is part of the POST request. This also lets the end user query a single module in the pipeline, by structuring the request as http://$endpoint/hin/urd/$x/$x where $x is the module number that one wants to seek output from. This is a very powerful functionality, which forms the basis behind ResumeMT, which we’ll talk about in the upcoming section.

5. Query to run the whole translation unit for a given language pair, is a simple HTTP GET request to:

http://${endpoint}/${src_lang}/${tgt_lang}/translate?input=""

This is useful for users who don’t need to concern themselves with what modules a translation pipeline is composed of and in what order they are executed.

All responses by the server3 are in JSON format.

3http://api.ilmt.iiit.ac.in (accessible within IIIT-H campus)

30 It is quite natural to question the decision on landing on REST, where other approaches for formulating the API could have been adopted as well. Initially we had started out with RPC, but the tight coupling with the knowledge required about function calls was a major demotivating factor. Then we turned towards SOAP. At first, the use of a specific information interchange format, XML, which is quite similar to the SSF format, was quite tempting, however, a proper implementation of SOAP requires heavy documentation and client-side libraries. The examination of these three different types of APIs have been documented in Table 3.1. The REST architecture focused on two things primarily: Resources and Interface and we wanted both of these to be first class citizens for implementing the SOA. Using HTTP as the choice for implementing REST was mainly to support API flexibility and simplic- ity. This is understood better with the help of an example. In the shallow parsing phase of the HIN-PAN pipeline for the source language, the Vibhakti Computation service can take an optional boolean param- eter, ‘keep’, which, if true, resulted in the retention of the PSP and Auxiliary markers in the output. To facilitate this behavior, all we had to do, was to allow the service to take additional parameters and process them only if they were supposed to be honoured or return an error otherwise. The simplicity of the end points was designed bearing in mind the use cases of Computational Lin- guists (CLi) and Integration Engineers (Ei).

3.2.2 Anuvaad Pranaali

A simple, browser-based client4 was built for querying the pipelines exposed by our system as ser- vices. Once the end user enters an input, the client auto-detects the input language and modifies the target language selection automatically. After selecting the appropriate target language and hitting the translate button, the input text is sent to the tokenizer, to identify sentence boundaries. From there, the JavaScript callbacks asynchronously process each sentence in parallel. As and when each sentence is translated, the web-client fills the output table and also makes sure that the ordering of input sentences in maintained. The client in this context is the end user, who could be anyone, language researcher or not! The software running inside the browser, is simply a client written in Javascript, which has got no relation to any of the ILMT modules. This confirms the fact that the service exposed by the backend is completely agnostic about the language being used to query it. This web-client essentially provides a dashboard for translating Languages and hence it was named Anivaad Pranaali, which in Hindi, literally means Translation System

3.2.3 ResumeMT

In our web-client, another key feature is exposed, which is hidden behind a button which can be toggled: Debug Mode. If turned on, the debug mode enables viewing intermediate results of the pipeline for each sentence in the input. It even allows direct editing of target translations using JQuery IME

4http://api.ilmt.iiit.ac.in

31 Figure 3.1 A snapshot of the Client-Dashboard

(useful for human language experts and linguists); and direct modification of the intermediate pipeline outputs. Once the output of a particular intermediary step has been modified, the pipeline can be resumed from that point. The web client makes a fork of the translation mid-way and queries our SOA based API accordingly. We named this feature as ResumeMT. This open source5 client can be used for any language pair and is not necessarily limited to Indic Languages. The proposed API6 has also been integrated with Kathaa [9]7, in a fashion where the Kathaa backend acts as a REST aggregator for all services, where, each node is processed independently.

Out of the four category of users described in Section 1.2, the linguists (Li) can make use of the functionality exposed by ResumeMT to test potential updates to modules without having to worry about their integration.

3.2.3.1 Detecting Error Propagation and Rectification

In a workflow based system such as ILMT, at each step of processing, the computation depends on a given set of input variables. Consequently, based on the importance of each variable, a change in it’s value may or may not impact the current output, or the output of a step in the future. If, the output of a particular module is incorrect for the given context, it then gets fed as wrong input to the modules

5https://github.com/ltrc/anuvaad-pranaali 6https://github.com/ltrc/ILMT-API 7https://www.youtube.com/watch?v=woK5x0NmrUA

32 Figure 3.2 A snapshot of the ResumeMT feature further on in the pipeline, which can, in turn, result into incorrect translation. In a statistical system, this effect is known as propagation of uncertainty, or simply put, for our use case, propagation of error. To perform calculations based on the uncertainty brought in by a wrong input and to measure the propagated error, requires a strict mathematical representation of each step, with each variable and it’s weight clearly pre-defined, which is out of scope for this thesis. However, the improvements made to the system can make it easier for a (computational) linguist to empirically identify these variables and rectify the modules to minimize the chances of error. Since we are dealing with natural languages, a computational linguist is bound to face linguistic ambiguities at some point or the other. Be it lexical, morphological or syntactic, these ambiguities can impact the output of a given module, and based on the level of propagation, the end result of the translation. A classic example for this can be seen in Hindi, with adjective-noun ambiguity: An adjective may function as a head noun if the noun is dropped, and bears the same inflection as the nominal head, as can be seen in inputs for (i) and (ii) (translations for HIN-PAN pipeline).

→ i. अ楍छे काम का फल अ楍छा होता है ਚੰਗੇ ਕਾਮ ਦਾ ਫਲ ਚੰਗਾ ਹੁੰਦਾ ਹ acCe kAma kA Pala acCA howA hE cazge kAma xA Pala cazgA huzxA hEM

→ ii. अ楍छे का फल अ楍छा होता है ਚੰਗੇ 0 ਫਲ ਚੰਗਾ ਹੁੰਦਾ ਹ acCe kA Pala acCA howA hE cazge 0 Pala cazgA huzxA hEM

If the case marker of post position immediately follows the adjective, it should be treated as a nominal head. Since the occurrence of ‘acCe’ (or any other adjective) as an adjective is more likely than its occurrence as a noun in any learning corpus, this can result in incomplete learning and consequently, in

33 incorrect tagging. NG identification rules could help in resolving such ambiguity by using the featural information of the NG constituents.

As is evident, the translation for (i) is correct, but one cannot say the same for (ii). In (i), the word acCe is marked as JJ (adjective), and the same has been done in (ii), which is incorrect. The error gets propagated as follows:

Input: अ楍छे का फल अ楍छा होता है ,→ PoS: acCe(JJ) kA(PSP) Pala(NN) ,→ Chunker: ((NP acCe(JJ) kA(PSP) Pala(NN))) ,→ ComputeHead: head=Pala ,→ ComputeVibhakti: Associates kA with Pala ,→ Simpleparser: Isn’t able to find any dependency relation in the first chunk ,→ VibhaktiSplitter: Is confused about the incorrect relation adds an unknown auxiliary verb in place of kA ,→ Output has a missing token, generated translation is absurd.

Using Anuvaad Pranaali, a linguist can simply turn on the debug mode to identify the fault point, fork the flow in between and pivot the results:

Input: अ楍छे का फल अ楍छा होता है ,→ PoS: acCe(JJ) kA(PSP) Pala(NN) Intervene and replace JJ with NN PoS: acCe(NN) kA(PSP) Pala(NN) ,→ Chunker: ((NP acCe(NN) kA(PSP))) ((NP Pala(NN))) ,→ ComputeHead: (head=acCe), (head=Pala) ,→ ComputeVibhakti: Associates kA with acCe ,→ Simpleparser: Marks a dependency relation of r6 between 1st and 2nd chunks, and r4 between 2nd chunk and the chunk associated with ‘howA hE’ → , VibhaktiSplitter: Understands the association and adds the post position ਦਾ → , Output (meaningful): ਚੰਗੇ ਦਾ ਫਲ ਚੰਗਾ ਹੁੰਦਾ ਹ

This exercise helped in identifying a problem with the PoS tagger for a corner case. Now, it can be rectified in multiple ways. One way would be to add more training data which covers more scenarios like these, or add post processing steps, as we saw was done in the HIN-TEL pipeline. This behavior was seen in postagger-hin v2.0 (which was being used in the HIN-PAN pipeline) and rectified in v2.2 (which is now being used by the HIN-URD pipeline) by using the former method. Similar techniques can be applied to improve modules where output can be one of multiple possibilities, like PickOneMorph, ComputeHead, Chunker, etc.

34 Compared to the previous system, if a linguist wanted to detect this problem and try out a fix, they would have to sit down with a computational linguist to gather the intermediate data, analyze it and then ask them to make a change in their local setup (hoping that the system could be replicated locally) and then test it out. Whereas in the modern system, the linguist could just tweak the output via their browser, figure out the issue and report it. In case a computational linguist would like to perform further experiments, they wouldn’t need to replicate the setup locally, but just call the already running services via the provided APIs and diverge the pipeline in the middle to call their modules. This can also act as a practical, interactive teaching tool to teach various aspects and components of machine translation to students, instead of relying on books with repetitive examples.

3.2.3.2 Demonstration

A demonstration of the ResumeMT feature is available at: https://youtu.be/z9ugKY9UArI?t= 2m40s

3.2.4 A Graph based Approach for Querying

Figure 3.3 A sample workflow

In Chapter 1, various types of workflows were discussed. The ILMT pipelines are a special form of workflow where the control flow is usually pre-defined. However, as a Computational Linguist, one may want to tweak the control-flow without disturbing the setup. This may be required for testing a newer version of a particular service or analyzing the affects on results, if an intermediary step is removed, replaced or added. With the API and it’s usage proposed till now, it is possible to achieve this requirement by modifying the client’s call stack: break one API call into at least three more API calls. For example, if one wants to temporarily replace the Head Computation implementation into the HIN-PAN pipeline with that of the implementation in the HIN-TEL pipeline, one would need to use the following new sequence:

http://${endpoint}/hin/pan/1/$x http://${endpoint}/hin/tel/$y/$y

35 http://${endpoint}/hin/pan/$x+1/$z $x = module number of Head Computaton module in HIN-PAN - 1 $y = module number of Head Computation module in HIN-TEL $z = module number of last module in HIN-PAN

Although this is much more simpler than the legacy system, we propose yet another method to query the system, which lets the user define the sequence, rather than partially overriding it for each language pair. In this setup, all steps of any given workflow are represented using a disconnected directed acyclic graph. For example, Figure 3.3 shows that we have five nodes in total, with three input ports. On receiving the input, the dispatcher service creates a job with this graph and subscribes to a channel on a shared key-value store (e.g. Redis). One request is sent to node A with inputs 1 and 2 and the other request is sent to node E with input 3, asynchronously. After processing the two inputs, Node A generates one output, and sends it to B and C. B and C process the input in parallel. As soon as one of the outputs from B or C reaches D, it starts maintaining a local (in-memory) database so that it can proceed further when it has been provided all the inputs that it requires. Output from D goes to another instance of B. Since it is the one of the leaf nodes, it publishes its result to the same channel to which the dispatcher service had subscribed to. Till this point, E might have already computed its result (concurrently) and published it’s output. The dispatcher service waits till the combined output from the channel contains the same number of nodes as present in the graph and finally gives back the complete processed output to the user. The representation of such a workflow is given below, where each node in the graph (except the input nodes) is depicted in the format: ${service-name}_${identifier}. Since node B is queried twice, and we want to avoid cycles, it has two identifiers.

{ "edges": { "input1": [ "serviceA_1" ], "input2": [ "serviceA_1" ], "input3": [ "serviceE_1" ], "serviceA_1": [ "serviceB_1", "serviceC_1" ], "serviceB_1": [ "serviceD_1" ],

36 "serviceC_1": [ "serviceD_1" ], "serviceD_1": [ "serviceB_2" ] }, "data": { "input1": "I am Anuvaad", "input2": "You are Pranaali", "input3": "He is Kathaa" } }

The REST URI for the dispatcher service looked like http://$endpoint/translate/graph which took the input in a HTTP POST request. A sample setup with the source code for this example has been made available in the ddag-sample8 repository. A requirement for a similar diamond dependency was seen in the HIN-URD pipeline, where A is the Chunker module, B is the Multi Word Expression (MWE) module and C is the Named Entity Recognition (NER) module. Simply put, the workflow for HIN-URD followed a networked pattern as discussed in section 1.1.1. In the previous API where a default control-flow is pre-defined, the workflow was transformed into a sequential pattern by topologically sorting all the nodes and breaking cycles, if any.

This application programming interface also makes it easier for power users Ui who want to use these workflows partially or fully in their own applications.

3.3 Use Cases

In this section, we describe two applications which used the APIs provided by our system to perform an independent research.

3.3.1 ILParser

While defining the problem statement in Chapter 1, a scenario was discussed where certain compo- nents of an existing system could be re-used for another purpose. As shown in the previous section, the SOA based system was designed keeping this use case in mind. Riyaz Bhat’s ilparser is an excellent example of this use case. ILParser9 is an Arc-eager Dependency Parser[3] for Indian Languages using beam-search decoding. In order to create the dependency tree, this parser requires the existence of certain set of linguistic units to be present in the input, along with the input being in CoNNL format. These set of linguistic units are:

8https://github.com/nehaljwani/ddag-sample 9https://bitbucket.org/riyazbhat/ilparser/

37 chunkId, chunkHead, lemma, category, gender, number, person, case, vibhakti, TAM10, sentence type, voice type and word root. This information can only be obtained by running a shallow parser on the raw input, functionality for which, wasn’t part of the dependency parser. However, the author wanted to provide an abstraction layer over it and allow end users to throw in raw input. To achieve this, ILParser made use of two key features of our system:

1. Running a pipeline partially: The shallow parsing of a given input for most of the Indic lan- guages, in a major way, comprises of POS Tagging, Chunking (Intra/Inter), Morphological Anal- ysis, Vibhakti Computation and Head Computation. Out of the 22 steps in the HIN-PAN pipeline, the first 10 steps covered this functionality, which was queried by sending an HTTP request to: http://$endpoint/hin/pan/1/9

2. Ability to query independent modules: The dependency parser also required the PSP and Aux- iliary markers to be present, so it queried the Vibhakti Computation service independently by sending an HTTP request to http://$endpoint/hin/pan/10/10 with the input from previous step and an additional boolean parameter.

For using the dependency parser, the end user still required to have a minimal environment to install and setup the dependency parser along with it’s dependencies. To simplify this further, we created a service for on top of it and exposed it as another web API. Now, one could just send in an HTTP request to http://$endpoint/parse with input parameters specifying the input language and data. This setup is available at https://github.com/ltrc/ilparser-api/.

3.3.2 Kathaa: A Visual Programming Framework for Humans

Kathaa is framework which allows anyone to play with NLP components in a visual manner. One of the main reasons behind the development of Kathaa was to bridge the gap between NLP researchers, developers and novice users. It allows the end user to create, modify and fork workflows visually, with the help of simple mouse clicks and drag-n-drops. The architecture of Kathaa is primarily composed of the following components:

• Kathaa Modules: Most basic units of computation in the framework. Modules have input and output ports like registers for the inflow and outflow of data.

• Kathaa Module Packaging: Kathaa modules can be created on the fly, or can be picked up from library with pre-packaged modules and optionally, modified.

• Kathaa Graph: This component builds on top of our graph based approach for querying services and let’s the end users create and modify them visually.

10Tense, Aspect, Modality marker

38 Figure 3.4 A networked workflow in Kathaa

• Kathaa Orchestrator: This component is designed on the same principles as our dispatcher ser- vice, but goes one step further and also implements a job queue to process inputs in configurable batch sizes.

• Kathaa Visual Interface: To interact with Kathaa, one simply needs to open the advanced, yet intuitive web interface. It comprises of an editor and also lets one write code to modify behavior of any service, without actually modifying it.

Figure 3.4 shows a visual representation of a complex workflow. Figure 3.5 shows a sequential work- flow adopted for the HIN-PAN pipeline. To demonstrate11 the power of Kathaa, the authors used the ser- vices created by us and exposed them as Kathaa modules. In order for the control-flow of a given pipeline to be in full control with the Katha Graph, each service exposed by our system is queried independently and all HTTP requests end up following the pattern: http://$endpoint/$src/$tgt/$x/$x. Since Kathaa is completely agnostic on the implementation of services, and only respects it’s own Module format, we created wrappers for our services in the Kathaa Module Packaging format and published them. The modules for the HIN-URD pipeline and the HIN-PAN pipeline have been made available at https://github.com/kathaa.

11https://www.youtube.com/watch?v=woK5x0NmrUA

39 Figure 3.5 A sequential workflow in Kathaa

40 Chapter 4

Deployment and Packaging

So far, we have seen an architecture based on services, and walked through some of it’s use cases. In this chapter, we’ll discuss ways to deploy these services as applications and discuss methods of packaging them.

4.1 Monolithic Application

When an application follows a monolithic style of software design, it is written as a cohesive unit of code whose components are designed to work together, sharing the same memory space and resources. It bundles together all the functionalities. Monolithic approaches use well-factored, independent modules within a single application. While performing the analysis discussed in Chapter 2 it was noticed that more than 50% of the code base was written in one language (Perl) and it was desirable to reduce the interpreter invocation count to one. But, we also wanted to make use of the transformed modules (which were daemonized). To achieve this, we wrote a semi-monolithic application (AP-Mono) using the Mojolicious Framework1. We refer to this as semi-monolithic, because some components of the monolith need not be part of the giant service, but could be running independently and the application itself would have wrappers around these components to send and receive data which would in turn be part of the monolith.

4.1.1 Breakdown

Inside this monolithic deployment, each language pair registers itself as a translation service with the dispatcher along with a pre-defined sequence of it’s execution. When the monolith is launched, all submodules of all language pairs are initialized at the same place, one by one, to complete all pre- processing and pre-fetching tasks. The architecture of the scalable monolithic system is composed of four layers, L1 through L4 (see Figure 4.1).

• L1 is the exposed API or a browser-based client which sends requests for translation.

1http://mojolicious.org/

41 Figure 4.1 An overview of AP Mono

• L2 consists of a load balancer that manages scalability of L3

• L3 is a collection of pre-forking API servers hosting the backend for the API.

• At L4, we have multi-threaded daemons which might not have been written in the same language as that of the API server.

In this setup, the control to scale the entire application is limited at L3, by making copies of the entire monolith. This results in a huge amount of resources being wasted, since modules for which multiple instances are not required, are being duplicated. For example, after pre-processing, the response time of the InfinitiveGenerator submodule is very low, and scaling it is not really necessary. Although the source code2 for this setup is open source, most of the git sub modules are private due to certain copy right restraints. A similar setup has been created for the il-parser-api3, which is completely open source. Monolithic services tend to get entangled as the application evolves, making it difficult to isolate services for purposes such as independent scaling or code maintainability. However, these services are tightly coupled to a code base [20] and in most cases, are not amenable for reuse. Further, it may not be possible to build new work-flows using existing modules developed by different sources due to software dependency conflicts and incompatible interfaces between them.

2https://github.com/ltrc/ILMT-API 3https://github.com/ltrc/ilparser-api

42 4.2 Microservices based Application

Microservices[17] is a specialized technique for implementing SOA. Within this architectural style, services have a clear demarcation among themselves, they are fine-grained. These services communicate over lightweight APIs. These services are independently deployable and easy to replace. They are organized around capabilities and each service performs a single function. The architecture doesn’t enforce any restrictions on the language in which the service has to be implemented in. Thus, this style of design naturally enforces modularity in the application adopting it. Microservices are inspired by the Unix philosophy of do one thing, and do it well.

4.2.1 Implementation

To embrace this style for our system, the requirement was to make sure that each individual service exposes a queryable interface and still conforms to the API design proposed in the previous chapter. Since HTTP was chosen as our choice of implementing REST, and it was required to remove the cou- pling between distinguishable components, now each service in the pipeline needed to have it’s own REST wrapper, which we implemented using Hypnotoad4. Also, the auto-register functionality of a pipeline with a pre-defined sequence was dropped in favour of the query system where each workflow is represented as a DDAG, as discussed earlier. In the transformed system AP-Micro, the input graph, G(V,E) is sent to the dispatcher service5 (see Algo 1). Here, each node in the graph corresponds to one microservice. Each disconnected component in the graph is treated as unique job and this service makes a one time subscription to a distinct channel on a global key-value storage service6 running separately, with key as the jobID and registers a call back which would return the processed value back to the end user. Thereafter, it kicks off the execution of all vertices in the graph, with deg−v = 0 (nodes with no incoming edges) by sending an asynchronous HTTP request to http://$service_endpt/pipeline. Each request sent to a microservice v consists of the tuple: (input, moduleID, jobID, subgraph) (see Algo 2). The subgraph G′(V ′,E′) in this request denotes the unprocessed portion of the original graph. The REST wrapper stores the request in a transient database which is shared across all instances of the microservice and sends back a response of HTTP 202 to the requester. When any instance of this ∪ − − ′ ′ ′ microservice notices that deg v(G(V,E)) = deg v( G (Vi ,Ei)) for a particular jobID, it knows that all inputs till this node have been processed and proceeds to compute it’s core functionality. Once the result is available, it sends an HTTP request to all the vertices which belong to deg+v(G(V,E)). If deg+v(G(V,E)) = 0, it publishes the final output to the channel corresponding to the jobID in the shared key-value store which marks the end of that job.

4https://mojolicious.org/perldoc/Mojo/Server/Hypnotoad 5https://goo.gl/NJTMNL 6https://redis.io/

43 Algorithm 1 Function to start execution of a Graph 1: procedure dispatch(G(V,E)) 2: jobID ← random() 3: kvStoreCon ← newKV Store() 4: kvStoreCon.subscribe(jobID) 5: procedure kvStoreCon.callBack(result) 6: if result.keys() == |V | then ▷ Return value only after all vertices have been processed 7: return result.() 8: SV ← getV ertices(G, indegree = 0) 9: 10: for n in neighbours(G, SV ) do 11: endpoint ← n.getEndP oint() 12: modID ← n.getModID() 13: input ← n.getInputData() 14: httpHandle ← newHT T P Handle(endpoint) 15: httpHandle.post(input, jobID, modID, G(V,E).subtract(n)) 16: 17: procedure httpHandle.callback(status) 18: if status.code ̸∈ 200, 202 then 19: this.log(ERROR, status) 20: else 21: this.log(status)

4.2.2 Architecture Benefits

We now summarize the benefits of our microservices based implementation of SOA

• Distributed: Since communication between microservices happens over a network, they need not be hosted on the same physical machine. Deployments can be strategically located to mini- mize round-trip time in the case of, for example, user demography; or spread over heterogenous hardware in the case of, for example, microservices requiring GPUs versus commodity machines.

• Scalable: Rather than adding machines containing entire copies of the monolithic system, we need only scale the respective microservice. For example, if a text processing microservice is currently a bottleneck, we can increase the number of instances for that microservice and making their endpoints be the load balancer.

• Easily Deployable: Rather than system wide restarts for each modification to a pipeline or updates to microservices, etc. we use a rolling update model. Each affected microservice is, in turn, removed, updated, and added back to the load balancer. Dynamic auto-discovery of microservices using, for example, Consul7 can have these microservices register/de-register themselves and the load balancer is updated automatically.

7https://www.consul.io/

44 Algorithm 2 Function to process a node 1: procedure process(input, jobID, modID, G(V,E)) 2: myDB ← getDBHandle() ▷ The database can be completely in-mem 3: kvStoreCon ← newKV Store() 4: HT T P Con ← newthis.getHT T P Con() 5: thisV ertex ← getV ertex(modID, G(V,E)) 6: myDB.insert(jobID, modID, input, G(V,E)) 7: subGraphs ← myDB.query(jobID, modID) ′ ′ ′ 8: G (V ,E ) ← getUnion(subGraphs) ′ ′ ′ 9: newV ertex ← getV ertex(modID, G (V ,E )) 10: 11: if thisV ertex.getInDegree() == newV ertex.getInDegree() then 12: inputs = newV ertex.getInputs() 13: result = processInput(inputs) ′ ′ ′ 14: setResult(G (V ,E ), newV ertex, result) 15: 16: if newV ertex.getOutDegree() ≠ 0 then 17: for n in neighbours(G(V,E), newV ertex) do 18: endpoint ← n.getEndP oint() 19: modID ← n.getModID() 20: input ← n.getInputData() 21: httpHandle ← newHT T P Handle(endpoint) 22: httpHandle.post(input, jobID, modID, G(V,E).subtract(n)) 23: 24: procedure httpHandle.callback(status) 25: if status.code ̸∈ 200, 202 then 26: this.log(ERROR, status) 27: else 28: this.log(status) 29: else ′ ′ ′ 30: kvStoreCon.publish(jobID, G (V ,E )) 31: else 32: return HT T P Con.reply(202, ’Waiting for more inputs ...’)

45 Figure 4.2 An overview of AP Micro

• Resilient: Points of failure in the pipeline are microservices, not the entire monolithic application, and therefore the remaining microservices are isolated. The faulty service can be rolled back to a healthier one, without affecting the rest of the system.

• Composable: It is easy to compose new pipelines or modify existing pipelines by adding or removing microservices or simply adjusting the graph in the REST calls. This also allows re- searchers to identify the precise order of components that suits best for translation in a given language pair.

• Easy Debugging: The proposed setup allows researchers to pause/play the workflow and resume it after manual intervention. This helps a lot in debugging faulty modules and also helps determine the impact of each component in the workflow.

• Owned at a Microservice Level: With microservices not necessarily coupled to the same code- base, ownership of pipeline components can be individual or be separated along other dimensions (e.g., language, hardware). In industrial terms, this helps keep balance in team size and produc- tivity.

Since the proposed architecture makes use of a jobid based publisher-subscriber model, it can be easily enhanced to include services for training statistical NLP systems.

46 4.3 Becoming Cloud Native

Cloud-native is a dynamic term, like cloud was at one point, like serverless is becoming. In the past two decades, software technologies have evolved, influencing software practices. Applications require a new way of thinking about application design, architecture, and capacity planning. To be agile and scal- able, services must be sufficiently small, but to avoid complexity, they need to be sufficiently big. Being cloud-native is an approach to building, running and deploying software that takes advantage of cloud computing model. It’s not really a question of where it is deployed. It can be a private cloud or public. A cloud-native application is typically loosely coupled, which enables selective scaling, independent updates, and tolerance to partial failure. Some of the prominent native features that are provided by any cloud provider can be accessed via layers, including the topmost virtual platform/OS, underlying resources (such as storage and data), and then the cloud-native services, such as provisioning and tenant management. They provide the capability an application needs. To be truly cloud-native, requires a proper understanding and usage of these layers and their underlying resources. As mentioned by Cornelia et al.[5], the top two considerations in designing cloud-native applications are (1) they are distributed and (2) they operate in an environment that is constantly changing. Here, distributed doesn’t necessarily mean distributed just for scalability, but also availability. As is the case with most applications, the environment of our system also consists of the software stack our application depends on, which keeps changing at a very fast pace.

4.3.1 Towards Containerization

A very common problem faced by students in any research field is reproducibility of the work done by peers in the same or related field. If the work done by some researcher results into the creation of software, often it is the case that it is demoed in presentations and published, but not publicly available. Sometimes these services are hosted only inside some internal network or require a license to play around with. And in the most unlucky scenario, one has to re-implement it by carefully reading and analyzing the published work. Second major obstacle is platform dependency. As of 2017, the three most popular Operating Systems are Windows, MacOS followed by Linux. And in the case of Linux, there are countless dis- tributions, a subset of which are very popular for application development or day to day usage. It is not unusual to notice when a particular piece of software works only on one platform. This restricts the audience to a subset of folks who need to have familiarity in using that platform and also have access to it. Another predicament prevalent in academia and the software industry in general is the dependency on software stack. Towards the end of Chapter 2, we discussed this in relation with the legacy system. To summarize, if a work is published in year X and someone in the year X + Y wants to try out that work, they need to make sure that the exact dependency range of the software stack is available before

47 attempting explore the work. In some cases, the Linux distribution gets updated, the version of Java changes and old methods are deprecated, certain libraries suffer from security vulnerabilities and are taken down by the maintainers, etc. Most of the pain points listed above can be eased with the use of containers. Containerization is a technology which can be used to sandbox a piece of software that comprises of everything needed to run it: code, runtime, system tools, system libraries, settings. The only thing that it can’t isolate is the kernel used at runtime, but there are ongoing efforts8 to provide that as well. Containers are primarily composed of cgroups (to control and account resource usage like CPU, memory), namespaces (isolation of resources like PIDs, hostnames, storage, network, IPC) and capabilities (fine grained control over permissions and access checks). There are various container runtimes, with Docker9 being the most popular and widely used.

• Distribution: Sharing pre-built container images is very easy. One just needs to push it to a publicly available docker registry, like https://hub.docker.com

• Versioning: Docker images allow tagging container images with versions. This makes it easier to manage multiple versions of one application.

• Constrained dependencies: First class support for layered file systems like OverlayFS, which allow users to build container on top of existing ones. So, if my application requires Ruby 2.4, then I can containerize my application over an a publicly available image of Ruby 2.4 and publish it.

• Cross Platform: The docker container runtime is also available for MacOS and Windows, which allows users to run Linux containers on these Operating Systems as well.

We packaged our services as docker containers so that anyone wanting to use them as a black box, could do it straight away. This approach was also applied to Ilparser and the setup for that is documented and available at:

1. Monolithic: https://github.com/ltrc/ilparser-api

2. Microservices based: https://github.com/ltrc/ilparser-api-distributed

Apart from the low level dependencies with which our system is known to work, we package pre- compiled binaries and the REST wrapper, so that the end users do not have to go through the ordeal of compiling everything from source. These containers can also be easily re-used as Modules in Kathaa by using the Kathaa Module Definition API. With the advent of containers, it also becomes easy to dynamically provision the microservices. They enable easy:

8https://github.com/google/gvisor 9https://www.docker.com/

48 • addition of a Microservice: Dynamic insertion of a new microservice is as simple as creating the new docker container, and running it inside the existing overlay network with a pre-defined hostname. For simplicity, the name of the service itself is used as the hostname for the docker container. To make this microservice part of a workflow, just create an edge with it as a vertex in the graph to be queried.

• removal a Mircoservice: Dynamic removal of a microservice is as easy as stopping the associated docker container from the overlay network or replacing it with another one.

Another motivation behind using the microservices based style is to supplement an optimized option for Kathaa [10]. In the current implementation of Kathaa10, an acyclic graph (see Figure 3.4), is pro- cessed in a fashion where the Kathaa backend acts as a REST aggregator for all services, where, each node is processed independently. With the help of the proposed architecture, Kathaa11 can save up to ~2 x number of nodes number of network calls, by sending the graph to be processed directly to the microservices based API end point.

4.4 Demonstration

On applying the improvements discussed in Chapter 2, and using our SOA based architecture, we were able to gain significance speed increase. A demo of this enhancement is available at https: //www.youtube.com/watch?v=XQD-155hDuA

10https://github.com/kathaa 11https://www.youtube.com/watch?v=woK5x0NmrUA

49 Chapter 5

Conclusions and Future Directions

In this thesis, we discussed three major contributions. First, we discussed the problems faced by four categories of end users of any NLP system. Then we analyzed an MT system for performance, high- lighting the various bottlenecks, followed by providing means of mitigating them. Finally, we propose a service oriented architecture which abstracts away the intricacies of the complex system and provides an easy and intuitive way to query the system. This was done by further adopting two styles of architectures: Monolithic and Micro services based. We also showed how two different projects ended up making use of two key features provided by our system, which was possible because of service based architecture. We also discussed ways to pre-package the application, so that it can be re-used in the future by other researchers.

5.1 Future Work

Our work opens up a whole new set of interesting problems both from systems and NLP perspective. We list few such potential problems below.

1. In this thesis, all services were given equal thought and priority. However, in MT systems, that is not always the case. There should be a mechanism to assign weights to services, and allow end users to perform error analysis and regression testing on the components by measuring success and failure rates. The graph based approach can be extended to achieve this.

2. The system can be improved to provide diffs between a single unit of computation, to find out which modules are expendable in what kind of input. For example, in the PAN-HIN pipeline, is it useful to bifurcate GuessMorph and PickOneMorph into two separate entities? Can they be subsumed withing one service to avoid overhead?

3. In all of the components referred to in Section 2.2, input/output data was always represented in the SSF format. For viewing an analyzing it from the perspective of a human eye, that is justified, but one can also explore use of other intermediate data representation techniques like msgpack or protocol buffers to pack data efficiently into smaller bits and decrease the time taken to serialize/de- serialize data, thereby saving computation time.

50 4. An architecture was proposed in this thesis, which could further be leveraged for creation of a system for resource creation. At present, Human Language experts can query our system to get intermediate outputs and then add missing tags or correct wrong ones. Later, they should be able to feed this information back to a system which can generate models that can be used by the components used in this service automatically. Think of a CRF Tagger, which trains on golden data nightly and publishes the models which are later picked up by our POS Tagging service, on restart.

5. Addition of another layer on top of our system to throttle inputs, and allow users to specify batch sizes. Users can then subscribe to an event based system which is triggered when their workflows are complete.

6. Building on top of our proposed SOA, a mechanism should be adopted for designing APIs without the need for writing a REST aggregater or a dispatcher service. Openly available mechanisms like Swagger Tools1 can be used to enhance the ease of generating APIs.

7. For the purposes of packaging our application, we used one popular technology: Linux containers. One limitation with this is that one cannot specify constraints within services. For instance, if SimpleParser v1.x works best with MorphAnalyzer v2.x and breaks with >= v3.0, there should be a way to specify that, so that one doesn’t end up with incompatible services being queried in a workflow. To accomplish this, one can take a look at Conda2, which is a platform agnostic, language agnostic package and environment manager, which comes along with a robust SATsolver to specify dependency constraints.

8. At present, the system works on a granularity level of one sentence. It should be enhanced to scales ranging from a single linguistic unit (morpheme) to larger linguistic units like discourse.

9. A standard practice for service daemons is to accept UNIX signals like SIGHUP to reload static data. Such an implementation can allow, for example, the morph analyzer to be able to load a new paradigm table without restarting itself, thereby achieving zero downtime.

10. This thesis proposes the building blocks for easing development of workflows. It is quite possible to write a Domain Specific Language (DSL) on top of it, which can be used to deploy workflows temporarily and tear them down as and when needed, based on the inputs. Kubernetes3, a container orchestration system can be used for such a dynamic system.

1https://swagger.io/ 2https://conda.io/ 3https://kubernetes.io/

51 Related Publications

1. Title: Anuvaad Pranaali: A RESTful API for Machine Translation Authors: Nehal J Wani, Sharada Prasanna Mohanty, Suresh Purini, Dipti Misra Sharma Published: The 14th International Conference on Service Oriented Computing ICSOC 2016, Banff, Canada

2. Title: Kathaa: A Visual Programming Framework for NLP Applications Authors: Sharada Prasanna Mohanty, Nehal J Wani, Manish Srivastava, Dipti Misra Sharma Published: Demonstrations, NAACL-2016, San Diego, USA

3. Title: NLP Systems as Edge-Labeled Directed Acyclic MultiGraphs Author: Sharada Prasanna Mohanty, Nehal J Wani, Manish Srivastava, Dipti Misra Sharma Published: Third International Workshop on Worldwide Language Service Infrastructure and Sec- ond Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technolo- gies (WLSI3nOIAF2), COLING-2016, Osaka, Japan

52 Bibliography

[1] Shakti standard format guide. http://verbs.colorado.edu/hindiurdu/guidelines_ docs/ssf-guide.pdf, 2007. Accessed: 2016-02-10.

[2] Sampark: Machine translation among indian languages. http://sampark.iiit.ac.in/ sampark/web/index./content, 2016. Accessed: 2016-02-10.

[3] R. A. Bhat. Exploiting Linguistic Knowledge to Address Representation and Sparsity Issues in Dependency Parsing of Indian Languages. PhD thesis, International Institute of Information Tech- nology, Hyderabad, India, 2017.

[4] P. D. Francesco, P. Lago, and I. Malavolta. Migrating towards microservice architectures: An industrial survey. In IEEE International Conference on Software Architecture, ICSA 2018, Seattle, WA, USA, April 30 - May 4, 2018, pages 29–39, 2018. doi: 10.1109/ICSA.2018.00012. URL https://doi.org/10.1109/ICSA.2018.00012.

[5] D. Gannon, R. S. Barga, N. Sundaresan, S. Goasguen, N. Gustaffson, C. Davis, B. Subramanian, and D. Kohn. An asynchronous panel discussion: What are cloud-native applications? IEEE Cloud Computing, 4(5):50–54, 2017. doi: 10.1109/MCC.2017.4250941. URL https://doi.org/10. 1109/MCC.2017.4250941.

[6] P. Gupta, R. Ahmad, M. Shrivastava, P. Kumar, and M. K. Sinha. Improve performance of machine translation service using memcached. In 2017 17th International Conference on Computational Science and Its Applications (ICCSA), pages 1–8, July 2017. doi: 10.1109/ICCSA.2017.7999650.

[7] E. W. Hinrichs, M. Hinrichs, and T. Zastrow. Weblicht: Web-based LRT services for german. In ACL 2010, Proceedings of the 48th Annual Meeting of the Association for Computational Lin- guistics, July 11-16, 2010, Uppsala, Sweden, System Demonstrations, pages 25–29, 2010. URL http://www.aclweb.org/anthology/P10-4005.

[8] P. Kumar, R. Ahmad, B. D. Chaudhary, and R. Sangal. Machine translation system as virtual appliance: For scalable service deployment on cloud. In Seventh IEEE International Symposium on Service-Oriented System Engineering, SOSE 2013, San Francisco, CA, USA, March 25-28, 2013, pages 304–308, 2013. doi: 10.1109/SOSE.2013.69. URL http://dx.doi.org/10.1109/ SOSE.2013.69.

53 [9] S. P. Mohanty, N. J. Wani, M. Shrivastava, and D. M. Sharma. Kathaa: A visual programming framework for NLP applications. In HLT-NAACL Demos, pages 92–96. The Association for Com- putational Linguistics, 2016.

[10] S. P. Mohanty, N. J. Wani, M. Srivastava, and D. M. Sharma. Kathaa: A visual programming framework for nlp applications. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, June 12 - June 17, 2016, 2016, in press.

[11] S. Newman. Building Microservices. O’Reilly Media, 1 edition, 2 2015. ISBN 9781491950357. URL http://amazon.com/o/ASIN/1491950358/.

[12] M. Popel and Z. Zabokrtský. Tectomt: Modular NLP framework. In Advances in Natural Language Processing, 7th International Conference on NLP, IceTAL 2010, Reykjavik, Iceland, August 16-18, 2010, pages 293–304, 2010. doi: 10.1007/978-3-642-14770-8\_33. URL https://doi.org/ 10.1007/978-3-642-14770-8_33.

[13] N. Russell, W. M. P. van der Aalst, and A. H. M. ter Hofstede. Workflow Patterns: The Definitive Guide. MIT Press, 2016. ISBN 9780262029827.

[14] I. S., S. N.K., H. P.J., G. R., N. Varghese, N. Sreekanth, and S. N. Pal. Nlp@desktop: A service oriented architecture for integrating nlp services in desktop clients. SIGSOFT Softw. Eng. Notes, 38 (4):1–4, July 2013. ISSN 0163-5948. doi: 10.1145/2492248.2492265. URL http://doi.acm. org/10.1145/2492248.2492265.

[15] M. Stowe. Undisturbed REST : a guide to designing the perfect API. MuleSoft, San Francisco, CA, 2015. ISBN 1329115945.

[16] V. Tablan, K. Bontcheva, I. Roberts, H. Cunningham, and M. Dimitrov. Annomarket: An open cloud platform for nlp. In Proceedings of the 51st Annual Meeting of the Association for Compu- tational Linguistics: System Demonstrations, pages 19–24, Sofia, Bulgaria, August 2013. Associ- ation for Computational Linguistics. URL http://www.aclweb.org/anthology/P13-4004.

[17] J. Thones. Microservices. IEEE Software, 32(1):116, 2015. doi: 10.1109/MS.2015.11. URL http://dx.doi.org/10.1109/MS.2015.11.

[18] A. Vasiļjevs, R. Skadiņš, and J. Tiedemann. Letsmt!: Cloud-based platform for do-it-yourself machine translation. In Proceedings of the ACL 2012 System Demonstrations, pages 43–48, Jeju Island, Korea, July 2012. Association for Computational Linguistics. URL http://www.aclweb. org/anthology/P12-3008.

[19] J. Webber, S. Parastatidis, and I. Robinson. REST in Practice: Hypermedia and Systems Architec- ture. O’Reilly Media, 1 edition, 9 2010. ISBN 9780596805821. URL http://amazon.com/o/ ASIN/0596805829/.

54 [20] D. Woods. Enterprise Services Architecture. O’Reilly Media, 2003. URL https://books. google.co.in/books?isbn=0596005512. ISBN 10: 0596005512.

[21] H. Wu, Z. Fei, A. Dai, M. Sammons, D. Roth, and S. Mayhew. Illinoiscloudnlp: Text analyt- ics services in the cloud. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 14–21, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA). ISBN 978-2-9517408-8-4. URL http://www.lrec-conf.org/proceedings/lrec2014/pdf/632_Paper.pdf. ACL Anthol- ogy Identifier: L14-1504.

55