Automatic Learning Assistant in Telugu

Automatic Learning Assistant in Telugu Meghana Bommadi, Shreya Terupally, Radhika Mamidi Language Technologies Research Centre International Institute of Information Technology Hyderabad, India [email protected] , [email protected], [email protected] Abstract interactive NLP applications such as AI-based dialogue systems/voice assistants like SIRI, Google This paper presents a learning assistant that Assistant, Alexa, and similar personal assistants. tests one’s knowledge and gives feedback that Research is still going on to make these assistants helps a person learn at a faster pace. A learning assistant (based on an automated question work in major Indian languages as well. generation) has extensive uses in education, An automated learning assistant like our system information websites, self-assessment, FAQs, is not only useful for the learning process for testing ML agents, research, etc. Multiple re- humans but also for machines in the process of searchers, and companies have worked on Vir- testing ML systems2. Research has been done for tual Assistance, but majorly in English. We Question Answer generating system in English3, built our learning assistant for Telugu language to help with teaching in the mother tongue, concentrating on basic Wh-questions with a 4 which is the most efficient way of learning1. rule-based approach , question template based 5 Our system is built primarily based on Ques- approaches etc. For a low-resourced language tion Generation in Telugu. like Telugu, a complete AI-based solution can Many experiments were conducted on Ques- be non-viable. There are hardly any datasets tion Generation in English in multiple ways. available for the system to produce significant We have built the first hybrid machine learn- accuracy. A completely rule-based system might ing and rule-based solution in Telugu, which leave out principle parts of the abstract. There is proves efficient for short stories or short pas- a chance that all the questions cannot be captured sages in children’s books. Our work cov- inclusively by completely handwritten rules. ers the fundamental question forms with ques- Hence, we want to introduce a mixed rule-based tion types: adjective, yes/no, adverb, verb, when, where, whose, quotative, and quantita- and AI-based solution to this problem. tive (how many/how much). We constructed rules for question generation using Part of Our system works on the following three Speech (POS) tags and Universal Dependency crucial steps: (UD) tags along with linguistic information of the surrounding relevant context of the word. 1. Summarization Our system is primarily built on question gen- 2. Question Generation eration in Telugu, and is also capable of evalu- ating the user’s answers to the generated ques- 3. Answer Evaluation tions. We implemented summarization using two 1 Introduction techniques viz. Word Frequency (see 4.1), and TextRank (see 4.2) which are explained further in Research on Virtual Assistants is renowned section 4. Summarization since they being widely used in recent times for We attempted to produce questions, concentrat- numerous tasks. These assistants are generated using on the critical points of a text that are generally ing large datasets and high-end Natural Language Understanding (NLU) and Natural Language Gen- 2(Hidenobu Kunichika, 2004) eration (NLG) tools. NLU and NLG are used in 3(Maria Chinkina, 2017) 4(Payal Khullar) 1(Roshni, 2020)(Nishanthi, 2020) 5(Hafedh Hussein, 2014) 29 Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering, pages 29–37 August 5–6, 2021. ©2021 Association for Computational Linguistics asked in assessment tests. Questions posed to an 1. Number of stories : 21 individual challenge their knowledge and under- 2. Average number of sentences : 56 standing of specific topics, so we formed questions in each sentence in as many ways as possible. We 3. Average number of words : 281 based this model on children’s stories, so the ques- 4. Genre of the stories : Moral Stories for Chil- tions we wanted to produce aim to be simpler and dren more objective. For testing we used stories by Prof. N. Lakshmi Based on the observation of the data chosen and Aiyar: analysis of all the possible causes, we developed a set of rules for each part of speech that can 1. Number of stories : 5 be formed into a question word in Telugu. We 2. Average number of sentences : 190 maximized the possible number of questions in 3. Average number of words : 1060 each sentence with all the keywords. We built rules for question generation based on POS 4. Genre of the stories : Realistic Fiction tags, UD tags and information surrounding the word, which is comparable with Vibhaktis (case 4 Summarization markers) in Telugu grammar. Since Telugu is a low resource language, we used statistical and unsupervised methods for this The Question Generation in manually evaluated task. Summarization also ensures the portability and the detailed error analysis is given in section of our system to other similar low resource lan- 8.1. Our Learning Assistant evaluates using string guages. matching, keyword matching for Telugu answers, For summarization, we did a basic data prepro- and a pre-trained sentence transformer model us- cessing (spaces, special characters, etc.) in addi- ing XLM-R.(Nils Reimers, 2019) tion to root-word extraction using Shiva Reddy’s POS tagger7. 2 Related Work We used two types of existing summarization techniques: Previously, Holy Lovenia, Felix Limanta et al.[2018] (Holy Lovenia, 2018) experimented on 1. Word Frequency-based summarization Q&A pair Generation in English where they suc- 2. TextRank based frequency ceeded in forming What, Who, and Where questions. Rami Reddy et al.[2006] (Rami Reddy 4.1 Word Frequency-based Summarization Nandi Reddy, 2006) worked on Dialogue based WFBS (Word Frequency-based Summariza- Question Answering System in Telugu for Rail- tion) is calculated using the word frequency in the way inquiries, which majorly concentrated on passage.8 This process is based on the idea that Answer Generation for a given query. Sim- the keywords or the main words will frequently ilar work has done by (Hoojung Chung) on appear in the text, and those words with lower dealing with practical question answering sys- frequency have a high probability of being less tem in restricted domain. Shudipta Sharma et related to the story. al.[2018](Shudipta Sharma) worked on automatic Q&A pair generation for English and Bengali texts All the sentences that carry crucial information using NLP tasks like verb decomposition, subject are produced successfully by this method because auxiliary inversion for a question tag. the keywords are used repeatedly in children’s stories, subsequently causing the highest fre- 3 Dataset quency. We have used a Telugu stories dataset taken We used a dynamic ratio (a ratio that can be from a website called kathalu wordpress".6 This changed or chosen by the user as an input) for get- dataset was chosen because of a variety in the ting the desirable amount of summary (short sum- themes of the stories, wide vocabulary and sen- mary or a longer summary, for example: k% of tences of varying lengths. 7http://sivareddy.in/downloads 6https://kathalu.wordpress.com/ 8(Ani Nenkova)(Mr. Shubham Bhosale) 30 the sentences, the system will output k% of sen- instead of high-frequency words. tences with the highest frequent words from the dictionary) This ratio, when dynamically changed, We used two kinds of similarity measures performed better than the fixed ratio of word selec- for the TextRank based summarization: tion. 1. Common words: A measure of similarity Steps followed in WFBS are: based on the number of common words in 1. Sentences are extracted from the input file. two sentences after removing stop words. We 2. The file is prepossessed and the words are to- used root word extraction of the common kenized. words for better results since Telugu is a 3. Stop words are removed. fusional and agglutinative language and has 4. Frequency of each word is calculated and repeated words with a different suffix each stored in dictionaries. time. 5. The sentences with least frequent word are 2. Best Match 25: A measure of the similar- removed. ity between two passages, based on term fre- 6. Calculated the ratio of words that occur in quencies in the passage.10 highest to lowest frequency order. The results observed by this method captures 4.2 TextRank based Frequency crucial information of the story, but lesser read- ability and fluency was observed. Within the sim- TextRank is a graph-based ranking model9 that ilarity measures, BM25 has shown slightly better prioritizes each element based on the values in the results since the BM25 algorithm ranks sentences graph. This process is done in the following steps: based on the importance of particular words (in- 1. A graph is constructed using each sentence as verse document frequency - IDF) instead of just a node using the frequency of words. 2. Similarity between the two nodes is marked 5 Answer Phrase Selection as the edge weight between the nodes Candidate answers are words/phrases that de- 3. Each sentence is ranked based on the similar- pict some vital information in a sentence. Adjec- ity with the whole text tives, adverbs, and the subject of a sentence are some examples of such candidates. 4. The page-rank algorithm is run until conver- The answer selection module utilizes two main gence NLP components - POS Tagging (Part of Speech tagging) and UD parsing (Universal Dependency 5. The sentences with top N ranking as summa- parsing), along with language-specific rules to de- rized text is given as the output termine the answer words in an input sentence.

Load more