<<

School of and Indic Studies, J.N.U., New Delhi Computational Linguistics, AI and Sanskrit

Girish Nath Professor of Computational Linguistics School of Sanskrit and Indic Studies, JNU & Concurrent Faculty, Center for Linguistics, SLL&CS Concurrent Faculty, Special Center of -Learning (SCEL) Associated Faculty, Atal Bihari Vajpayee School of Management and Entrepreneurship JNU, New Delhi - 110067

2/28/2021 Intro to coling, 2020-21 1

School of Sanskrit and Indic Studies, J.N.U., New Delhi

Computational Linguistics (CL) /Natural Language Processing (NLP)

2/28/2021 Intro to coling, 2020-21 2 School of Sanskrit and Indic Studies, J.N.U., New Delhi

 Interdisciplinary field – linguistics, computer science, AI, psychology, philosophy, logic, cognitive science,….. Wherever language applies  Develops formal and computable models of natural languages  Model human languages using rule based or statistical methods from a computational perspective  Use of Computational approaches/methods to relevant linguistic questions

2/28/2021 Intro to coling, 2020-21 3 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Its relation with Intelligent computing

2/28/2021 Intro to coling, 2020-21 4 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Background (adapted from Dafydd Gibbon (2013))

40s  encryption, decryption, neural automata, neural networks, neuro-linguistics 50s  Machine Translation, dictionaries, text utilities (concordances) 60s  Theoretical informatics, complexity, natural language parsing, speech 70s  psycholinguistic interpretations of parsers/ generators 80s-90s  logic, inference, unification, NLIs, bi/multi-modal interfaces 2000-2010  Web, resources, big data Future  ??? 2/28/2021 Intro to coling, 2020-21 5 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Its relation with AI

2/28/2021 Intro to coling, 2020-21 6 School of Sanskrit and Indic Studies, J.N.U., New Delhi

 CL/NLP  Inference engines  Expert Systems  Intelligent Tutoring Systems  Vision Machines  Robotics  Today AI is synonymous with Machine Learning  AI only happen if machines understand natural language texts) 2/28/2021 Intro to coling, 2020-21 7 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Its relation with HCI – Human Computer Interaction

2/28/2021 Intro to coling, 2020-21 8 School of Sanskrit and Indic Studies, J.N.U., New Delhi Human Computer Interaction (HCI) and NLP

 Conventional HCI  Intelligent HCI (HCII) - Human interacts with machine with human (read intelligent) means of communication  One of the objectives of NLP is to make this happen (if the means of communication is language)

2/28/2021 Intro to coling, 2020-21 9 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Its relation with Linguistics

2/28/2021 Intro to coling, 2020-21 10 School of Sanskrit and Indic Studies, J.N.U., New Delhi

 Phonetics – स्वनविज्ञान (svanavijna)  Articulatory Phonetics (उच्चारण-स्वनविज्ञान uccraa-svanavijna)  Acoustic Phonetics (भौविक-स्वनविज्ञान bhautika-svanavijna)  Auditory Phonetics (श्रिणात्मक-स्वनविज्ञान shravatmak-svanavijna)

 Phonology – स्ववनमविज्ञान (svanimavijna)  Morphology – 셂पविज्ञान (rpavijna)  Syntax – िाक्यविज्ञान (vkyavijna)  Semantics – अर्थविज्ञान (arthavijna)

2/28/2021 Intro to coling, 2020-21 11 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Its relation with Sanskrit

2/28/2021 Intro to coling, 2020-21 12 School of Sanskrit and Indic Studies, J.N.U., New Delhi

 Panini  Sanskrit  Linguistic tradition  Sanskrit and AI Rick Briggs  Sanskrit as foundation of COLING Nicholas Ostler

2/28/2021 Intro to coling, 2020-21 13 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Major Areas of R &D under CL (but not limited to)

2/28/2021 Intro to coling, 2020-21 14 School of Sanskrit and Indic Studies, J.N.U., New Delhi

 I/ mechanisms  POS and other Taggers  Morphological Analyzers  Parsing  Automatic Speech Recognition  Speaker Recognition  Speech Synthesis  Machine Translation  Cross Lingual Information Access  Automatic Localization  Text Summarizers  Named Entity Recognition 2/28/2021 Intro to coling, 2020-21 15

School of Sanskrit and Indic Studies, J.N.U., New Delhi How are these done?

 Text  Speech  Image/video

2/28/2021 Intro to coling, 2020-21 16 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Methods and algorithms in CL/NLP/AI

 Rule-based  Statistical ML based  Hybrid methods

2/28/2021 Intro to coling, 2020-21 17 School of Sanskrit and Indic Studies, J.N.U., New Delhi

CL/NLP – integrated platforms

 OpenNLP (Java based platform)  NLTK – Python based  ILCIANN for Indian languages

2/28/2021 Intro to coling, 2020-21 18 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Why is Panini important in CL/NLP/AI?

2/28/2021 Intro to coling, 2020-21 19 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Panini’s Grammar – systemic view

 Phonetic Component (अक्षरसमाम्नाय)  Phonemes – 14 Shivasutras  Pratyahara (dynamic sound classes)  Rulebase (सूत्रपाठ)  4000 grammar rules  Lexica  Verbs database (धातुपाठ)  Nominals database (गणपाठ)  Lists  Affixes  Rule-specific entries

2/28/2021 Intro to coling, 2020-21 20 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Panini’s System

 more formal  Largely unambiguous procedures  easier programming  Structure Similar to a program  Variable Instantiation  Vriddhi (evaluation / expansion)  PS rules and replacement procedures may have been influenced by Panini

2/28/2021 Intro to coling, 2020-21 21 School of Sanskrit and Indic Studies, J.N.U., New Delhi

How do we compute Panini rules

 Understanding Panini’s rules for a particular (programmable) derivation task (Vidhi rule with other defining/environment rules)  Generate pseudocode (Panini sutra example for Sandhi rules computing)  Convert pseudocode to actual code in some language ( Java example)  Compile and run  (with/without interface) get desired results  evaluate and debug  Execution of Sandhi code

2/28/2021 Intro to coling, 2020-21 22 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Requirements of Diverse and Multilingual societies like India

2/28/2021 Intro to coling, 2020-21 23 School of Sanskrit and Indic Studies, J.N.U., New Delhi

 Technology viewed as basic need for multilingual societies  E-governance, e-education, all other fields where language is critical  Competition between language groups  Role of govt. and industry critical for development of tech

2/28/2021 Intro to coling, 2020-21 24 School of Sanskrit and Indic Studies, J.N.U., New Delhi

India’s language situation

2/28/2021 Intro to coling, 2020-21 25 School of Sanskrit and Indic Studies, J.N.U., New Delhi

 More diverse and than any other country in the world  More than 1600 languages  Documented literature  Sanskrit >> more than 6000 yrs  / >> 2500 yrs  Apabhramsha >> 1500 yrs  Modern Indian languages >> 1000 yrs  Oral tradition  Sanskrit has been a dominant influence on all other Indian languages in every area of creativity  Besides Sanskrit, Persian in the middle ages and English in the colonial and post colonial periods have had significant influence  -Urdu is another major language to recon with which has been impacting other languages - even Sanskrit and English

2/28/2021 Intro to coling, 2020-21 26 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Indian Language Families and % Speakers

IndoAryan - 76.87%

Dravidian -20.82%

Austro Asiatic - 1.11% Tibeto Burman - 1%

Andamanese* - 0%

2/28/2021 Intro to coling, 2020-21 27 School of Sanskrit and Indic Studies, J.N.U., New Delhi

India’s Scheduled Languages

2/28/2021 Intro to coling, 2020-21 28 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Major agencies responsible

2/28/2021 Intro to coling, 2020-21 29 School of Sanskrit and Indic Studies, J.N.U., New Delhi

 Govt.  MEITY  to develop these tools/technologies

 MHRD  to deliver these on cheaper tablets

 Private  Business  Altruism  Language groups

2/28/2021 Intro to coling, 2020-21 30 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Indian academia – current focus

2/28/2021 Intro to coling, 2020-21 31 School of Sanskrit and Indic Studies, J.N.U., New Delhi

IIT Chennai  speech (Prof Hemamurthy, Prof Umesh) IIT Delhi OCR (Prof Shantanu Chaudhury) IISc Bangalore  OLHWR (Prof A G Ramakrishnan) JNU  LT Resources, Tools (Prof Girish Nath Jha) CDACs (Dr Hemant Darbari), IIIT Hyderabad (Prof Rajiv Sangal), IIITM (Prof Elizabeth Shirley), some major universities  MT, resource creation IIT Bombay (Prof Pushpak Bhattacharya)  MT, wordnet 2/28/2021 Intro to coling, 2020-21 32 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Industry working in diverse areas

2/28/2021 Intro to coling, 2020-21 33 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Microsoft – search engine, MT and all related tools Google - search engine, MT and all related tools Swiftkey – input mechanism Amazon AI Samsung Adobe – document processing Nuance – input mechanism ezDI – medical data processing

2/28/2021 Intro to coling, 2020-21 34 School of Sanskrit and Indic Studies, J.N.U., New Delhi

TDIL - Technology Development for Indian Languages (TDIL) MCIT, Govt. of India (set up in 1991)

2/28/2021 Intro to coling, 2020-21 35 School of Sanskrit and Indic Studies, J.N.U., New Delhi

TDIL – Mission mode projects

In the consortium mode, 26 premier Institutes and R&D organizations are working on LTR projects

2/28/2021 Intro to coling, 2020-21 36 School of Sanskrit and Indic Studies, J.N.U., New Delhi

TDIL - Major projects in recent past

 English to Indian Languages Machine Translation (MT) System (E-ILMT) CDAC, Pune

 English to Indian Languages Machine Translation (MT) System with Angla-Bharti Technology (E-ILMT-ABT) IIT Kanpur

 Indian Language to Indian Language Machine Translation System: (ILMT) IIIT Hyderabad

 Sanskrit-Hindi Machine Translation (SHMT)  University of Hyderabad, JNU…

2/28/2021 Intro to coling, 2020-21 37 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Major TDIL projects in recent past

 Document Analysis & Recognition System for Indian Languages (DARSIL) IIT Delhi

 On-Line Handwriting Recognition (OLHR) I.I.Sc, Bangalore

 Cross Lingual Information Access (CLIA)  IIT, Bombay

 Speech Corpora & Technologies (SCT)  IIT Chennai

 Indian Language Corpora Initiative (ILCI)  JNU, New Delhi

2/28/2021 Intro to coling, 2020-21 38 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Does our progress depend on CL/NLP?

YES……. HOW?

2/28/2021 Intro to coling, 2020-21 39 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Most critical areas of development

 Governance  Education  Health  Disaster Management  Languages, Cultures, Knowledge Traditions

2/28/2021 Intro to coling, 2020-21 40 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Sanskrit – an extra ordinary language ?

2/28/2021 Intro to coling, 2020-21 41 School of Sanskrit and Indic Studies, J.N.U., New Delhi

 Most ancient and most modern?  Natural language with oldest history and a precision almost matching artificial language  Standardized  Not a street language (Language of Intellectual Tradition)  Does not have a linguistic community as other languages in India, yet spoken in all the states  Panini’s grammar – the only complete grammar for any human language

2/28/2021 Intro to coling, 2020-21 42 School of Sanskrit and Indic Studies, J.N.U., New Delhi

 More than 6000 years of continuous intellectual activity  Millions of manuscripts (un edited, un-read? and un-researched)  Condition of manuscript repositories/libraries in India  We are losing several hundred per week (Dominique Wujastyk)  Besides handwritten manuscripts, we have printed texts (pre-digital), converted to digital, born digital  Complexity of the text, nature of historical languages, multiple scripts, mixed scripts  Language detection – is it really Sanskrit (Van Pelt library in UPenn)  Technical nature of the subject matter in the text – need subject experts  Most funded language in the world?  19 universities exclusively for Sanskrit  150 plus departments in universities/colleges

2/28/2021 Intro to coling, 2020-21 43

School of Sanskrit and Indic Studies, J.N.U., New Delhi

Computational Linguistics tasks for Sanskrit

2/28/2021 Intro to coling, 2020-21 44 School of Sanskrit and Indic Studies, J.N.U., New Delhi

1) Creating urgently needed tools

 I/O Tools  Smart keyboards  OCR/Handwriting (manuscript) recognition  Speech recognition  Text to Speech  Multimodal recognition

 Manuscript/text digitization, preservation and research  Text Readers and simplifiers  Text summarizers  E-learning applications  Machine Translation (text to text/speech to speech)

2/28/2021 Intro to coling, 2020-21 45

School of Sanskrit and Indic Studies, J.N.U., New Delhi

2) Create digital resources

 Digital Manuscript library  Handwritten texts parallel with their electronic texts  These will indexed and cross linked with relevant interpretative shastras  Translation facility  Digital library of heritage texts  Digital library of multimodal texts (speech and videos)  Performances of natakas  Vedic ritual performances  Vedic path traditions  Cultural rites and rituals  Digital mapping of yatras, religious tourism  Digital resources of Ayurveda texts, practices and traditions

2/28/2021 Intro to coling, 2020-21 46

School of Sanskrit and Indic Studies, J.N.U., New Delhi 3) Expert systems and education technology

 Expert system for Yoga  Expert System for Ayurveda  Expert System for Vedic traditions and Vedangas  Multimedia presentations  Popular Epics (Ramayana, Mahabharata including diverse variations in India and abroad)  Other geetikavyas, mahakavyas and natakas  Puranas  Katha sahitya  Educational technology resources and teaching/learning methods

2/28/2021 Intro to coling, 2020-21 47

School of Sanskrit and Indic Studies, J.N.U., New Delhi

4) Explore the shastras for AI

 In today’s times of data driven technology, machines need to be trained on web data  We need “programs which can translate natural language texts into collections of sentences in mathematical logical language” (John McCarthy)  MacCarthy also believes that the knowledge “needed for people to speak to and understand each other is mostly not encoded in language. If we knew what it is, both AI and linguistics would be helped”  The “mathematical logical language” he is referring to can be compared with the language and methods of Navya-Nyaya and Mimamsa interpretation techniques be helpful  A combination of व्याकरण-न्याय-मीमा車सा with AI/Computational Linguistics would yield positive results

2/28/2021 Intro to coling, 2020-21 48

School of Sanskrit and Indic Studies, J.N.U., New Delhi

Work done at School of Sanskrit and Indic Studies Jawaharlal Nehru University

2/28/2021 Intro to coling, 2020-21 49 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Projects

2/28/2021 Intro to coling, 2020-21 50 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Sanskrit-Hindi Machine Translation (SHMT)

2/28/2021 Intro to coling, 2020-21 51 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Shallow Parser Tools for Indian Languages (SPTIL)

2/28/2021 Intro to coling, 2020-21 52 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Indian Languages Corpora Initiative (ILCI)

2/28/2021 Intro to coling, 2020-21 53 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Consultancies

22/28/2021/28/2021 Intro to coling, 2020-21 54 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Predictive mobile keyboard for Kashmiri

Nuance Technologies, 2016

2/28/2021 Intro to coling, 2020-21 55 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Predictive mobile keyboards for lesser used languages (like Sanskrit, Santhali, Manipuri, Maithili Sindhi) Swiftkey, 2015

2/28/2021 Intro to coling, 2020-21 56 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Online Handwriting Recognition for Hindi (ink samples, language and usage model, dictionaries ) Funded by Microsoft, USA, 2006

2/28/2021 Intro to coling, 2020-21 57 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Multimodal data in 8 languages (Indian English, Hindi, Urdu, Tamil, Bangla, Punjabi, Pushto, Dari) Funded by LDC, University of Pennsylvania, 2011

2/28/2021 Intro to coling, 2020-21 58 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Microsoft Translator Hub English-Urdu MT system released in Feb 2013 http://bing.com/translator (done in collaboration with us at JNU) More language pairs in progress – English-Gujarati, English-Sindhi, Sanskrit-English, Sanskrit-Hindi

2/28/2021 Intro to coling, 2020-21 59 School of Sanskrit and Indic Studies, J.N.U., New Delhi Our ‘Monster’ Tools

Crawler Sanitizer Lexicographer

2/28/2021 Intro to coling, 2020-21 60 School of Sanskrit and Indic Studies, J.N.U., New Delhi

Work done by our research students

 As part of M.Phil and Ph.D research by my students, a large number of tools have been developed mainly for Sanskrit

 Their tools and dissertations are online on our sever

2/28/2021 Intro to coling, 2020-21 61 School of Sanskrit and Indic Studies, J.N.U., New Delhi

We showcase our developments on international platforms

2/28/2021 Intro to coling, 2020-21 62 School of Sanskrit and Indic Studies, J.N.U., New Delhi WILDRE Workshop on Indian Language Data Resource & Evaluation (Partially sponsored by Microsoft Research India - MSRI)  WILDRE5 – Marseille, France, 16 May 2020 (now online on 24 May)  WILDRE4 - Miyazaki, Japan (2018)  WILDRE3 - Portoroz, Sloveia (2016)  WILDRE2 – Reykjavik (2014)  WILDRE1 – Istanbul (2012) 2/28/2021 Intro to coling, 2020-21 63 School of Sanskrit and Indic Studies, J.N.U., New Delhi

SOIL-TECH 2019 (JNU, 15-17 Feb)

 Sanskrit and Other Indian Languages (SOIL) Technology, JNU

2/28/2021 Intro to coling, 2020-21 64 School of Sanskrit and Indic Studies, J.N.U., New Delhi Demo http://sanskrit.jnu.ac.in https://www.youtube.com/channel/UCm DadGPT098c48s4cBpWeTg

2/28/2021 Intro to coling, 2020-21 65 Some Readings

 Language and Mind, Noam Chomsky  The Oxford Handbook of Computational Linguistics, Ruslan Mitkov (ed), OUP  Speech and NLP, Jurafsky, Martin, 2000, Prentice Hall, NJ  Sanskrit Computational Linguistics, Girish N Jha (ed.), Springer  Conceptual Structures: Information processing in mind and machine, J.F.Sowa, Reading, Addison Wesley Publishing company

2/28/2021 Intro to coling, 2020-21 66 School of Sanskrit and Indic Studies, J.N.U., New Delhi धन्यवाद !

കൂ क କ ಕ ਕ क క ક గ ক ಕ ક ಕ କ ਕ ক क ક గ

[email protected] 2/28/2021 Intro to coling, 2020-21  91-11-26741308 67