A SURVEY ON A MACHINE LEARNING BASED APPROACH FOR A LEGAL DOCUMENT SIMPLIFIER AND READER

1SIMRAN H. MHATRE, 2DEVIKA JALGAONKAR, 3SUSHRUT MADHAVI, 4KAILAS K. DEVADKAR

1,2,3Bachelors of Information Technology, Sardar Patel Institute of Technology, Mumbai, India. 4 Professor, Department of Information Technology, Sardar Patel Institute of Technology, Mumbai, India. E-mail: [email protected], [email protected], [email protected], [email protected]

Abstract - Reading Legal documents and understanding them is difficult due to their highly sophisticated language. Thus, many people depend on others for translation and understanding of legal documents and are cheated in the process. There is a pressing need for reliable simplified verbal translation of a legal document. We thus propose an implementation of a Legal Document Simplifier and Reader for the layman. The Legal Document Simplifier and Reader is to help the layman better understand the clauses included in a Legal Document and have the simplified document read out if they are unable to read it. This will eliminate dependence on others for understanding a Legal Document and prevent deception of unassuming una- ware civilians. Translating the document from one language to another is also included to make it more convenient for a user. The framework for the proposed implementation includes a camera module to capture an image of the Legal Docu- ment, an image processing module to process the image, an OCR(optical character recognition) module for recognizing the text, NLP(Natural Language Processing) module for simplifying difficult terms, Language translation module and TTS(Text-to-Speech) synthesis module to read out the simplified document in the user’s preferred language.

Keywords - Legal Document Reader, , NLP, OCR, Text Simplification, Text To Speech

I. INTRODUCTION proposes an application that will capture an image of the legal document, extract the text from the image In today’s world, understanding legal documents is of using OCR, translate it into English with high accura- extreme importance. They are notoriously difficult to cy(an accuracy threshold can be set after experiment- understand even when one is familiar with the lan- ing and research), simplify the text using text simpli- guage that the document is written in. Just being able fication algorithms and a crafted dictionary for legal to read the words is not enough to understand it, one terms, translate the simplified document into a user needs to know the meaning behind the legal jargon. preferred language with high accuracy, pass the trans- According to the 2011 Indian Census, India’s literacy lated text through a text-to-speech conversion module rate is 74% and the Indian state with the least literacy and read out the simplified document to the user. The rate is Bihar(63.82%) [1]. As one goes up the educa- Legal Document Simplifier and Reader can benefit tion hierarchy, the no. of people at that level become many users such as low-literacy users, blind users, increasingly sparse. News articles reporting deception users not familiar with the language the legal docu- of rural dwellers through legal documents is on a ment is written in. This makes the users aware of the constant rise. content of the legal document enabling them to take correct decisions regarding legal matters. Even today, legal document interpretation is heavily human dependent. Even large organizations that have II. RELATED WORK well-educated people require assistance in under- standing legal documents. Many entities exist provid- A method for reading medical documents was pro- ing legal document interpretation, but one has to pay posed using OCR (Optical Character Recognition) in order to avail their services. India’s official pover- [4]. For the core recognition process a Tesseract OCR ty line threshold was set at about ₹ 32 per day in ur- Engine was used which takes the preprocessed image ban areas ₹ 26 a day in rural areas since 2007 [2]. In as input and returns a string of editable characters India, 270,000,000 people are poor, i.e. 1 in 5 Indians found in the image. This is followed by using a dic- is poor [3]. These people cannot afford to pay for tionary trained model to recognize each word. It was legal document interpretation services. The only op- found that this method was more accurate with cer- tion available for them is to rely on someone who is tain fonts than others and the accuracy also varied trustworthy for understanding legal documents. Cun- greatly with the distance between the camera and the ning people take advantage of this very situation and document while taking the image. a lot of people are cheated due to lack of resources to properly understand a legal document. An extended implementation of OCR was given by We thus propose a Legal Document Simplifier and combining it with a Text to speech module to help the Reader that be easily available to all and will prevent visually impaired with reading [5]. A Raspberry pi the deception of unassuming civilians. This project camera module was used to capture an image for pre-

Proceedings of 73rd IRF International Conference, 27th May, 2018, Pune, India 14 A Survey on A Machine Learning based Approach for a Legal Document Simplifier and Reader processing. Characters in the preprocessed image are about the legal documents being dealt with. Many distinguished, recognized and converted to readable such cases go unreported. text format based on their degree of correlation with the model used for OCR. Once the procedure has finished extracting the text from the image, the cha- racter strings are passed to a text to speech module like Pico TTS to finally give an audio output. An implementation of a comprehensive search engine for understanding the meaning of legal terms in their specific context was proposed [6] . The search en- gine's functionalities are explored with regard to the entered term which is the question and the system's output which is the answer. For the purpose of giving the simplest output possible the system uses resources like for synonyms, LawNet for legal defini- tions as well as official data like results of law cases, reports, the constitution of the country, etc. The ques- Table 1 Education statistics of India tion is initially processed using NLP (Natural lan- (Per 1000 distribution of persons) [10, Tab. 3.10] guage Processing) for filtering relevant keywords and entities which are used for classifying the question The main reason for this is the lack of higher educa- into a particular type. Based on the type of question, tion among the Indian population as seen in Table 1. an algorithm discerns the kind of information that is Only 8.5% of the population of males and 6.2% of needed and formulates a query accordingly to be used population in females have completed their higher for generation of an answer. The query is compared secondary education (rural + urban). To understand with existing information in the resources to under- the legal documents higher secondary education is stand the context of the question based on previous not enough. Only a handful of the total population has similar occurrences of keywords and entities. completed their graduation and post graduation. Ef- Matches identified in this comparison are indexed forts are being taken to increase the education of the and ranked according to multiple parameters like population but that will take time. This gap can be number of keywords and entities, length of matching bridged by a legal document simplifier and reader keyword sequence from question, etc. while irrele- that will help people understand legal documents in vant or out of context information is filtered out. Fi- simpler terms in a language of their preference, and nally, the system analyzes the results of indexing and help avoid deception of the layman. generates a simple readable answer to be presented as output. IV. PROPOSED METHODOLOGY

III. NECESSITY OF THE SYSTEM Our solution proposes a system which give a proper understanding of the document to the user in a lan- Legal documents are written in a manner which is guage they are familiar with. The methodology is as difficult for the layman to understand or read. If we follows: consider India, the literacy rate is 71% i.e still, 29% 1. Scanning the document using OCR of the population cannot read or write. Understanding 2. Translation of document in English if it’s not a legal document is not just about literacy, but also already in English about education. Terms used in legal documents are 3. Simplification of the text obtained very sophisticated, making it especially difficult for 4. Translation of simplified text into user friendly people with bare minimum education to understand language them. Fraud cases where innocent people sign on 5. Conversion of simplified text into speech papers where they don’t know what’s written on them are rampant, especially in rural areas and with the 1. Optical Character Recognition elderly. A steelmaking company took possession of Optical character recognition is a technology that over 200 acres of farmers’ lands in a poverty-stricken converts images containing text into formats with district in India [7]. Mr. A of a village in India had no editable text allowing you to process scanned docu- idea that the ownership of all his 7.5 acres of land had ments, books, screenshots, photos with text to get been transferred to Mr. B, to whom he had sold just editable TXT, DOC, PDF [11]. Each character in the one small plot from his entire land. Mr. A claims Mr. document is scanned individually to obtain a well B made him sign papers and that he was unaware that formatted editable document instead of a messy jpeg it was the sales deed [8]. Ms. X and Ms. Y bought which is the result of simple scanning. It has been two land plots in a village for which a power of attor- observed that OCR results are far more accurate if the ney was prepared by Mr. Z [9]. Above are just a few the text is sharp, dark font on a light paper [12]. The examples of fraud cases due to lack of knowledge OCR system first removes artifacts in the image like

Proceedings of 73rd IRF International Conference, 27th May, 2018, Pune, India 15 A Survey on A Machine Learning based Approach for a Legal Document Simplifier and Reader dust, graphics. It then aligns the text properly and the replacement of difficult or unknown phrases with converts any shades or grey in the font to black for simpler equivalents and the transformation of long easy recognition of the words. Simple OCR systems and syntactically complex sentences into shorter and take each letter and scan it pixel by pixel to a known less complex ones [16]. Two complementary strate- database consisting of some commonly used fonts gies can be followed to achieve the overall goal : where as some complex OCRs take into accounts the Lexical simplification which means substitution of shapes and curves of each character for comparisons. complex words with simple words retaining the Some OCR systems also use a dictionary to avoid meaning and the essence of the sentence and Syntac- spilling of unethical words. Legal documents are gen- tic simplification which replaces complex sentences erally printed and hence they are sharp and the font with understandable alternative sentence. A simple used is always clear thus reducing the chances for the flow of NLP is shown in Figure 1. OCR system to identify a specific letter incorrectly. For accuracy and simplification, a dictionary contain- ing keywords and key phrases commonly present in legal documents will be included.

2) Machine Translation Machine Translation is a technique where the com- puter software translates a text from one language to another with the help of some advanced grammatical, syntactic and semantic analysis techniques [13]. Vast advancements in fields of neural networks, artificial intelligence and machine learning have extended their applications in the domains of automatic translations. The first approach to machine translations was a rule based approach which relies on a set of rules pre- defined by linguistic experts. The sentence is parsed and words are identified and analysed to convert them into the target language based on these rules [13]. Statistical systems have no rules and they learn to translate by analysing a large amount of data for each Figure 1: Text simplification using NLP language pair [14]. The newest approach is based on machine learning which learns to translate through The scanned legal document converted into computer one large neural network [14]. recognizable and editable text document is further simplified by Natural Language Processing (NLP). The editable text obtained from the OCR module is Simplification can be carried out by the following fed to the translation module. This module will be steps. A sentence from the document (output from the used twice for this application. First is converting the OCR system) is taken. Phrases and words which are legal document from the language it is written in to difficult for a layman to understand are identified English (skip if document is already in English). The from the sentence. The difficult words and phrases second use is after getting the simplified text of the are expressed in an understandable way using alterna- document, where it will be converted into the user’s tives. Alternatives are selected and ranked on basis language of preference. Accuracy of translation must of their relevance to the sentence using part of speech be high for both to ensure that the meaning of the tagging [17] and a simplified sentence is constructed. document is intact in the end. Neural network based All this is done without changing or compromising on approach has proven to generate accurate results and the meaning of the original sentence. Great care is can be used for the translation module in this particu- taken to keep the meaning of the legal document in- lar application. After experimenting, an accuracy tact in the simplified version. The alternatives can be threshold can be set to ensure that the core meaning generated from a carefully crafted dictionary contain- of the document is not lost in translation. ing synonyms to some very commonly used words in a legal document. 3) Text Simplification using Natural Language Processing 4) Text to NLP is a way for computers to analyze, comprehend Text-to-speech synthesis (TTS) converts the simpli- and deduce a human language [15]. Machine learning fied and translated (if required) text obtained from the algorithms generally form a base for NLP. Automatic previous modules into a speech which reads out the text simplification is the process of transforming a simplified document in the preferred language. text into another text which, ideally conveying the Speech synthesis is the process of artificially produc- same message, will be easier to read and understand ing human speech using a speech synthesizer [18]. by a broader audience. The process usually involves Implementation of a speech synthesizer can be either

Proceedings of 73rd IRF International Conference, 27th May, 2018, Pune, India 16 A Survey on A Machine Learning based Approach for a Legal Document Simplifier and Reader software or hardware based. A text-to-speech (TTS) by the user. A text to speech module is also included system converts normal text into speech that resem- that converts the translated document into speech as bles actual human speech as much as possible [19]. A an output. This is especially necessary for people who text-to-speech module has a front-end and a back-end cannot read. The user can also view the simplified [20]. The front-end does text normalization, also and translated document without it being converted to called preprocessing or tokenization. Here, raw text speech. containing symbols abbreviations and numbers is converted to its equivalent spelled out words. Text-to- phoneme or grapheme-to-phoneme conversion is per- formed next. Here, assignment of Phonetic transcrip- tions to each word takes place, the text is divided and marked in prosodic units like sentences, phrases, clauses. The output of the front-end is the symbolic linguistic representation, which consists of prosody information and phonetic transcriptions. The back- end (synthesizer) converts symbolic linguistic repre- sentation to sound. This process is called synthesis. This may include computation of the target prosody (phoneme durations, pitch contour), [21] which is the final speech output. The simplified document obtained from the Text Simplification module is initially translated into the user’s preferred language by Machine Translation. The translated simplified document is given as input to the Text-To-Speech module. Here, the simplified document is converted into speech in the user’s pre- ferred language. The speech output should closely resemble the pronunciation and dialect of the user’s Figure 2: Overall working of the system preferred language to enhance the user’s understand- ing of the document. This is extremely essential for V. CHALLENGES users who cannot read. This is a system comprised of multiple modules 5) Overall Working working in conjunction each of which has its own set Figure 2 demonstrates the complete working of the of challenges that need to be overcome in order to system. Initially, the user selects an image of a legal ensure optimum functioning. A dataset containing as document (in .jpg, .jpeg or .pdf format) to be parsed, many types of legal documents and phrases is crucial the language the legal document is in, and the lan- for accurate performance. This data needs to be made guage they would like the document to be translated available by trusted legal sources. Improving accura- to. The image captured by the user’s camera or cy of the OCR module is extremely crucial for further fetched from gallery will be fed to the Optical charac- stages. The meaning and context of the simplified ter recognition system. The OCR module will recog- text obtained from NLP must be as close to the origi- nize the characters and convert it into an editable text nal as possible, with little to no variations. This can document (like .txt or .docx). Accuracy of character prove to be a great challenge. This can be overcome recognition in the OCR module is extremely impor- by using an NLP algorithm robust enough to handle tant for the later stages to produce an accurate output exceedingly complex and varied inputs and training it overall. The editable text document obtained from on a considerably large dataset which has as many OCR module is fed to the Translation module. The varieties of legal documents and legal phrases as document is converted from the language it is written possible. Perchance it may also happen that in spite of in to English. This document is then passed through retaining the meaning in the simplification the trans- the text simplification module. Text simplification is lation proves insufficient for the text to be understood done only in one language (English) to reduce the by the user. This can be overcome by having a trans- complexity of the process. A dictionary made specifi- lation module which is accurate and extensive enough cally for legal terms will be used to make the simpli- to support major Indian languages. fication more accurate. In text simplification the con- tent is simplified as much as possible without altering CONCLUSION AND FUTURE WORK the meaning and context of the text. Accuracy para- meters should be set to ensure the same. The simpli- In this paper we have proposed a system that simpli- fied text is given as input to the translation module. fies legal document jargon and converts the simpli- The translation module translates the simplifies doc- fied text into speech. It helps the layman and people ument into the user’s preferred language as selected who have difficulty reading by presenting a simpli-

Proceedings of 73rd IRF International Conference, 27th May, 2018, Pune, India 17 A Survey on A Machine Learning based Approach for a Legal Document Simplifier and Reader fied version of the legal document. We hope this will [7] "Company accused of cheating villagers", The Hindu, para. prevent or at least reduce cases of fraudulent misin- 5, September 28, 2016. [Online]. Available: http://www.thehindu.com/todays-paper/tp-national/tp- terpretation and legal fraud. For this, we have sug- otherstates/Company-accused-of-cheating- gested some technologies that will work in tandem villagers/article14762016.ece. [Accessed: May 8 ,2018]. from scanning a document with OCR, translating it [8] R. Kumar, "Illegal transfers: Adivasis in Chhattisgarh plan with one of the proposed API's, simplifying it using to criminally prosecute firms that hold their land", Scroll.in, para. 2-3, July 17, 2017. [Online]. Available: NLP to finally converting the text to speech. We have https://scroll.in/article/843427/illegal-transfers-adivasis-in- not thoroughly covered the precise details of the im- chhattisgarh-plan-to-criminally-prosecute-firms-that-hold- plementation of these technologies involved and have their-land. [Accessed: May 8, 2018]. only suggested the tools that may be used as per their [9] "Spate of land cheating cases", The Hindu, March 22, 2012. [Online]. Available: http://www.thehindu.com/todays- effectiveness and application. Future work in this paper/tp-national/tp-tamilnadu/spate-of-land-cheating- domain could be a more in-depth examination of the cases/article3049944.ece. [Accessed: May 8, 2018]. NLP algorithm and the kind of data needed to im- [10] Ministry of Statistics and Programme Implementation, Lite- prove the accuracy and efficiency of the application. racy and Education. [Online]. Available: http://www.mospi.gov.in/sites/default/files/reports_and_pub The overall effectiveness of the system may also be lica- computed using a metric that evaluates the perfor- tion/statistical_publication/social_statistics/Chapter_3.pdf. mance of the system with real world standards. Fur- [Accessed: May 8, 2018]. thermore, with regards to our proposal we have li- [11] "Optical Character Recognition (OCR) – How it works", Nicomsoft OCR, para. 1, February 5, 2012. [Online]. Avail- mited our scope to legal documents in India since able: https://www.nicomsoft.com/optical-character- every country has different laws and legal terms may recognition-ocr-how-it-works/. [Accessed: April 28, 2018] vary from nation to nation. The system could be ex- [12] "OCR Document", Haven OnDemand, para. 4, January 8, tended to support more countries and eventually be 2016. [Online]. Available: https://dev.havenondemand.com/apis/ocrdocument#overvie made self-adaptive to changes in legal terms. w. [Accessed: April 28, 2018] [13] "Machine Translation", Andovar, June 9, 2017. [Online]. REFERENCES Available: https://www.andovar.com/machine-translation/. [Accessed: April 29, 2018] [14] "Increase productivity and translate faster", SDL Trados [1] Ministry of Home Affairs, Government of India, State of Studio, October 28, 2017. [Online]. Available: Literacy, 2011. [Online]. Available: https://www.sdltrados.com/solutions/machine-translation/. http://censusindia.gov.in/2011-prov- [Accessed: April 29, 2018] results/data_files/india/Final_PPT_2011_chapter6.pdf. [Ac- [15] “Introduction to Natural Language Processing (NLP)”, Al- cessed: May 8, 2018]. gorithmia, August 11, 2016. [Online]. Available: [2] "Not poor if you earn Rs 32 a day: Planning Commission", https://blog.algorithmia.com/introduction-natural-language- India Today, September 21, 2011. [Online]. Available: processing-nlp/. [Accessed: April 28, 2018] https://www.indiatoday.in/india/north/story/planning- [16] H. Saggion, Automatic Text Simplification (Synthesis Lec- commission-bpl-earn-rs-25-a-day-india-141619-2011-09- tures on Human Language Technologies). San Rafael, 2017. 21. [Accessed: May 8 ,2018]. CA: Morgan & Claypool Publishers. [Online]. Available: [3] "India's Poverty Profile", World Bank, May 27, 2016. [On- Amazon e-book. line] Available: [17] B. P. Nunes, R. Kawase, P. Siehndel, M. A. Casanova and http://www.worldbank.org/en/news/infographic/2016/05/27/ S. Dietze, “As simple as it gets - A sentence simplifier for india-s-poverty-profile. [Accessed: May 8, 2018] the different learning levels and contexts”, In Proc. 2013 [4] A. Kongtaln, S. Minsakorn, L. Yodchaloemkul, S. Boonta- IEEE 13th International Conference on Advanced Learning rak, S. Phongsuphap, “Medical Document Reader on An- Technologies, 2013 droid Smartphone”, In Proc. 2014 Third ICT International [18] Suendermann, D., Höge, H., and Black, A., Student Project Conference (ICT-ISPC2014), 2014 2010.Challenges in Speech Synthesis. Chen, F., Jokinen, [5] S. Sonth and J. Kallimani, “OCR Based Facilitator for the K.,(eds.), Speech Technology, Springer Science + Business Visually Challenged”, In Proc. 2017 International Confe- Media LLC. rence on Electrical, Electronics, Communication, Computer [19] Allen, J., Hunnicutt, M. S., Klatt D., 1987. From Text to and Optimization Techniques (ICEECCOT), 2017 Speech: The MITalk system. Cambridge University Press. [6] N. Dharamsiri, B. Gunathilake, U. Pathirana, S. Senevi- [20] Van Santen, J.P.H., Sproat, R. W., Olive, J.P., and Hir- rathne, A. Nugaliyadde and S. Thellijagoda, "Simplifying schberg, J., 1997. Progress in Speech Synthesis. Springer. Law statements using Natural Language Processing", In [21] Van Santen, J.P.H., 1994. Assignment of segmental dura- Proc. Second Asia Pacific Conference on Contemporary Re- tion in text-to-speech synthesis. Computer Speech & Lan- search (APCCR, Malaysia, 2016), 2016 guage, Volume 8, Issue 2, Pages 95–128



Proceedings of 73rd IRF International Conference, 27th May, 2018, Pune, India 18