A Survey on a Machine Learning Based Approach for a Legal Document Simplifier and Reader

A SURVEY ON A MACHINE LEARNING BASED APPROACH FOR A LEGAL DOCUMENT SIMPLIFIER AND READER 1SIMRAN H. MHATRE, 2DEVIKA JALGAONKAR, 3SUSHRUT MADHAVI, 4KAILAS K. DEVADKAR 1,2,3Bachelors of Information Technology, Sardar Patel Institute of Technology, Mumbai, India. 4 Professor, Department of Information Technology, Sardar Patel Institute of Technology, Mumbai, India. E-mail: [email protected], [email protected], [email protected], [email protected] Abstract - Reading Legal documents and understanding them is difficult due to their highly sophisticated language. Thus, many people depend on others for translation and understanding of legal documents and are cheated in the process. There is a pressing need for reliable simplified verbal translation of a legal document. We thus propose an implementation of a Legal Document Simplifier and Reader for the layman. The Legal Document Simplifier and Reader is to help the layman better understand the clauses included in a Legal Document and have the simplified document read out if they are unable to read it. This will eliminate dependence on others for understanding a Legal Document and prevent deception of unassuming unaware civilians. Translating the document from one language to another is also included to make it more convenient for a user. The framework for the proposed implementation includes a camera module to capture an image of the Legal Docu- ment, an image processing module to process the image, an OCR(optical character recognition) module for recognizing the text, NLP(Natural Language Processing) module for simplifying difficult terms, Language translation module and TTS(Text-to-Speech) synthesis module to read out the simplified document in the user’s preferred language. Keywords - Legal Document Reader, Machine Translation, NLP, OCR, Text Simplification, Text To Speech I. INTRODUCTION proposes an application that will capture an image of the legal document, extract the text from the image In today’s world, understanding legal documents is of using OCR, translate it into English with high accura- extreme importance. They are notoriously difficult to cy(an accuracy threshold can be set after experiment- understand even when one is familiar with the lan- ing and research), simplify the text using text simpli- guage that the document is written in. Just being able fication algorithms and a crafted dictionary for legal to read the words is not enough to understand it, one terms, translate the simplified document into a user needs to know the meaning behind the legal jargon. preferred language with high accuracy, pass the trans- According to the 2011 Indian Census, India’s literacy lated text through a text-to-speech conversion module rate is 74% and the Indian state with the least literacy and read out the simplified document to the user. The rate is Bihar(63.82%) [1]. As one goes up the educa- Legal Document Simplifier and Reader can benefit tion hierarchy, the no. of people at that level become many users such as low-literacy users, blind users, increasingly sparse. News articles reporting deception users not familiar with the language the legal docu- of rural dwellers through legal documents is on a ment is written in. This makes the users aware of the constant rise. content of the legal document enabling them to take correct decisions regarding legal matters. Even today, legal document interpretation is heavily human dependent. Even large organizations that have II. RELATED WORK well-educated people require assistance in understanding legal documents. Many entities exist provid- A method for reading medical documents was pro- ing legal document interpretation, but one has to pay posed using OCR (Optical Character Recognition) in order to avail their services. India’s official pover- [4]. For the core recognition process a Tesseract OCR ty line threshold was set at about ₹ 32 per day in ur- Engine was used which takes the preprocessed image ban areas ₹ 26 a day in rural areas since 2007 [2]. In as input and returns a string of editable characters India, 270,000,000 people are poor, i.e. 1 in 5 Indians found in the image. This is followed by using a dic- is poor [3]. These people cannot afford to pay for tionary trained model to recognize each word. It was legal document interpretation services. The only op- found that this method was more accurate with cer- tion available for them is to rely on someone who is tain fonts than others and the accuracy also varied trustworthy for understanding legal documents. Cun- greatly with the distance between the camera and the ning people take advantage of this very situation and document while taking the image. a lot of people are cheated due to lack of resources to properly understand a legal document. An extended implementation of OCR was given by We thus propose a Legal Document Simplifier and combining it with a Text to speech module to help the Reader that be easily available to all and will prevent visually impaired with reading [5]. A Raspberry pi the deception of unassuming civilians. This project camera module was used to capture an image for pre- Proceedings of 73rd IRF International Conference, 27th May, 2018, Pune, India 14 A Survey on A Machine Learning based Approach for a Legal Document Simplifier and Reader processing. Characters in the preprocessed image are about the legal documents being dealt with. Many distinguished, recognized and converted to readable such cases go unreported. text format based on their degree of correlation with the model used for OCR. Once the procedure has finished extracting the text from the image, the character strings are passed to a text to speech module like Pico TTS to finally give an audio output. An implementation of a comprehensive search engine for understanding the meaning of legal terms in their specific context was proposed [6] . The search engine's functionalities are explored with regard to the entered term which is the question and the system's output which is the answer. For the purpose of giving the simplest output possible the system uses resources like wordnet for synonyms, LawNet for legal defini- tions as well as official data like results of law cases, reports, the constitution of the country, etc. The ques- Table 1 Education statistics of India tion is initially processed using NLP (Natural lan- (Per 1000 distribution of persons) [10, Tab. 3.10] guage Processing) for filtering relevant keywords and entities which are used for classifying the question The main reason for this is the lack of higher educa- into a particular type. Based on the type of question, tion among the Indian population as seen in Table 1. an algorithm discerns the kind of information that is Only 8.5% of the population of males and 6.2% of needed and formulates a query accordingly to be used population in females have completed their higher for generation of an answer. The query is compared secondary education (rural + urban). To understand with existing information in the resources to under- the legal documents higher secondary education is stand the context of the question based on previous not enough. Only a handful of the total population has similar occurrences of keywords and entities. completed their graduation and post graduation. Ef- Matches identified in this comparison are indexed forts are being taken to increase the education of the and ranked according to multiple parameters like population but that will take time. This gap can be number of keywords and entities, length of matching bridged by a legal document simplifier and reader keyword sequence from question, etc. while irrele- that will help people understand legal documents in vant or out of context information is filtered out. Fi- simpler terms in a language of their preference, and nally, the system analyzes the results of indexing and help avoid deception of the layman. generates a simple readable answer to be presented as output. IV. PROPOSED METHODOLOGY III. NECESSITY OF THE SYSTEM Our solution proposes a system which give a proper understanding of the document to the user in a lan- Legal documents are written in a manner which is guage they are familiar with. The methodology is as difficult for the layman to understand or read. If we follows: consider India, the literacy rate is 71% i.e still, 29% 1. Scanning the document using OCR of the population cannot read or write. Understanding 2. Translation of document in English if it’s not a legal document is not just about literacy, but also already in English about education. Terms used in legal documents are 3. Simplification of the text obtained very sophisticated, making it especially difficult for 4. Translation of simplified text into user friendly people with bare minimum education to understand language them. Fraud cases where innocent people sign on 5. Conversion of simplified text into speech papers where they don’t know what’s written on them are rampant, especially in rural areas and with the 1. Optical Character Recognition elderly. A steelmaking company took possession of Optical character recognition is a technology that over 200 acres of farmers’ lands in a poverty-stricken converts images containing text into formats with district in India [7]. Mr. A of a village in India had no editable text allowing you to process scanned docu- idea that the ownership of all his 7.5 acres of land had ments, books, screenshots, photos with text to get been transferred to Mr. B, to whom he had sold just editable TXT, DOC, PDF [11]. Each character in the one small plot from his entire land. Mr. A claims Mr. document is scanned individually to obtain a well B made him sign papers and that he was unaware that formatted editable document instead of a messy jpeg it was the sales deed [8].

A Survey on a Machine Learning Based Approach for a Legal Document Simplifier and Reader

Measuring Text Simplification with the Crowd

Proceedings of the 1St Workshop on Tools and Resources to Empower

Data-Driven Text Simplification

Classifier-Based Text Simplification for Improved Machine Translation

Natural Language Processing Tools for Reading Level Assessment and Text Simpliﬁcation for Bilingual Education

Experimenting Sentence Split-And-Rephrase Using Part-Of-Speech Labels

Arxiv:2005.00352V2 [Cs.CL] 16 Apr 2021

Text Simplification Using Typed Dependencies

Workshopabstracts

Par4sim – Adaptive Paraphrasing for Text Simplification

Automatic Text Simplification

Discourse Level Factors for Sentence Deletion in Text Simplification