Leveraging SOA Principles in Development of Modular & Scalable NLP Pipelines

Leveraging SOA Principles in Development of Modular & Scalable NLP Pipelines Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computational Linguistics by Research by Nehal Jagdish Wani 201125005 [email protected] International Institute of Information Technology (Deemed to be University) Hyderabad - 500 032, INDIA May 2019 Copyright © Nehal Jagdish Wani, 2019 All Rights Reserved International Institute of Information Technology Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled “Leveraging SOA Principles in Develop- ment of Modular & Scalable NLP Pipelines” by Nehal Jagdish Wani, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Adviser: Prof. Dipti Mishra Sharma International Institute of Information Technology Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled “Leveraging SOA Principles in Develop- ment of Modular & Scalable NLP Pipelines” by Nehal Jagdish Wani, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Co-Adviser: Prof. Suresh Purini To Curiosity and Perseverance Acknowledgments First of all, I would like to thank my adviser, Dr. Dipti Misra Sharma who has been very supportive and understanding throughout my stay in IIIT-H. I am grateful to her for guiding me in this eventful journey. Had she not been present during my interviews or during the initial orientation session, I would probably not have been part of the IIIT-H family. She always saw right through me. Thank you, Dr. Suresh Purini, for being a friend and a mentor. He is a lively professor with whom one could easily confide in. Our discussions have helped me and my work grow a lot and become what it is today. Thank you for being there at odd times, encouraging my thoughts and helping shape them into better ideas. I have been extremely fortunate to have interacted, learned and studied along with two of my closest seniors, Sanidhya Kashyap and Jaspal Dhillon, with whom I spent more time than my batchmates. They have been constant pillars of inspiration. They always supported and motivated me and I owe them a great deal in helping me, shaping my mind, answering my stupid queries, making me realize that there is a lot more to computer science than what is taught in courses. They helped me realize my potential. It’s very difficult for me to imagine how I would have survived in college without their assistance. SP Mohanty deserves a special mention here. He made me realize how ideas are spun into reality and how NLP is not restricted to only linguists. I have always envied him ;-) The implementation of the ideas proposed in this thesis was initially done by porting numerous modules from Sampark MT system developed during the ”Indian Language to Indian Language Machine translation” (ILMT) consortium project funded by the TDIL program of Department of Electronics and Information Technology (DeitY), Govt. of India; so much of this work would not have had been possible without support from the whole consortium. A big shout out to Rashid Ahmed and Avinash for helping me understand the integration of various components in this system. I would also like to thank the Bhat brothers (Riaz and Irshad) for helping me understand the word ‘computational’ in the context of NLP. They, along with Maaz, made me enjoy my stay at LTRC. Raveesh Motlani always believed in me, helped me sift through ideas and also shaped me as a person. He never gave up on me. He was always there to morally support me. I had various interesting conversations with Ankush, Mayank, Anubhav, Anhad, Venky, Arnav, Tushant, Ayush, Kapil, Deepesh and Somay. The made me ponder over various things in life. I am honored to have been a part of the batch UG2k11 and to have been in acquaintance with them. vi vii Last but not the least, I am thankful to my family for their constant support and motivation to stay focused and without them I could not have gathered the courage and taken the opportunity to pursue what I wanted to. Abstract Installation, deployment and maintenance of an NLP system, can be a daunting dask; based on the number and complexity of the components involved. One such system can be a hybrid Machine Translation system, composed of several modules which define the transformation of a given word, phrase or sentence, from one language to another. The end users of such a system can be developers themselves who want to improve it. To achieve this, the system as whole, needs to adopt an architecture which lets the users control and change the order of components in the pipeline, be able to intervene the execution in middle to debug one or more components, tweak inputs/outputs without having to rewrite the components, be able to easily replicate the system on their local box without having to worry about the hassle of compiling everything from source, be able to quickly replace a component with a higher or lower version and be able to see the impact on the final result quickly; basically make development iterative and fast. The system should also expose an interface for those users who want to build something on top of it, without having to worry about the internal details. The ideas proposed in this thesis try to cater to the needs of a broad category of users, in an attempt to keep their work-flow as simple as possible. We propose an architecture where we show how to identify and transform a monolithic system into small, individual components (each being a linguistic unit), identify bottlenecks from an operating system’s point of view, identify scalable components and finally provide an easy mechanism to interact with the system. To achieve this, we apply the proposed architecture over an existing system (Sampark MT System) and walk through it’s transformation. Towards the end, we show the creation of a web client which shows how easy it becomes to interact with the modified system. We also apply our proposition to show that it can be applied to any pipeline based system by thinking of it as a disconnected, directed acyclic graph. We also show how the modified system can be deployed on the cloud easily and how individual components can be scaled up or down as per needs. To be able to plan the overall architecture and produce guidelines for enabling large scale collaborative development, a structured systems analysis and design, from the point of view of both, a computational linguist and a systems engineer is required. This thesis provides the foundation in that direction by enhancing an existing system, reducing the overall runtime of it’s components by greater than 85%, improving the test-dev-deploy cycle for computational linguists and discussing a generalized architecture on top of which, further complex systems can be built for specific purposes. viii Contents Chapter Page Abstract :::::::::::::::::::::::::::::::::::::::::::::: viii 1 Introduction :::::::::::::::::::::::::::::::::::::::::: 1 1.1 Workflows in Language Processing ........................... 2 1.1.1 Patterns in workflows .............................. 2 1.2 Problem Statement .................................... 3 1.3 Related Work ....................................... 5 1.4 Summary of Contributions ................................ 6 1.5 Organization of the Thesis ................................ 7 2 The Sampark MT System: A Case Study ::::::::::::::::::::::::::: 8 2.1 Brief Introduction to ILMT Modules ........................... 8 2.2 Functioning & Performance Analysis of Modules .................... 9 2.2.1 Indic Converter .................................. 11 2.2.2 Morph Analyzer ................................. 12 2.2.3 POS Tagger .................................... 13 2.2.4 Transfer Grammar ................................ 14 2.2.5 Lexical Transfer ................................. 15 2.2.6 Word Generator .................................. 16 2.3 Common Traits & Peculiarities .............................. 17 2.4 Revamping Modules: Moving Towards Services .................... 18 2.4.1 Reducing File I/O ................................. 19 2.4.2 (Re)using memory, Efficiently .......................... 20 2.4.3 Asynchronous I/O & Daemonization ....................... 21 2.5 Results of Transformation ................................ 22 3 Service Oriented Architecture ::::::::::::::::::::::::::::::::: 27 3.1 Services: A Short Introduction .............................. 27 3.2 Adopting SOA to ILMT ................................. 28 3.2.1 The RESTful API ................................. 29 3.2.2 Anuvaad Pranaali ................................. 31 3.2.3 ResumeMT .................................... 31 3.2.3.1 Detecting Error Propagation and Rectification ............ 32 3.2.3.2 Demonstration ............................. 35 3.2.4 A Graph based Approach for Querying ..................... 35 ix x CONTENTS 3.3 Use Cases ......................................... 37 3.3.1 ILParser ...................................... 37 3.3.2 Kathaa: A Visual Programming Framework for Humans ............ 38 4 Deployment and Packaging :::::::::::::::::::::::::::::::::: 41 4.1 Monolithic Application .................................. 41 4.1.1 Breakdown .................................... 41 4.2 Microservices based Application ............................. 43 4.2.1 Implementation .................................. 43 4.2.2 Architecture Benefits ............................... 44 4.3 Becoming Cloud Native ................................

Leveraging SOA Principles in Development of Modular & Scalable NLP Pipelines

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support