School of Sanskrit and Indic Studies, J.N.U., New Delhi Computational Linguistics, AI and Sanskrit
Girish Nath Jha Professor of Computational Linguistics School of Sanskrit and Indic Studies, JNU & Concurrent Faculty, Center for Linguistics, SLL&CS Concurrent Faculty, Special Center of E-Learning (SCEL) Associated Faculty, Atal Bihari Vajpayee School of Management and Entrepreneurship JNU, New Delhi - 110067
2/28/2021 Intro to coling, 2020-21 1
School of Sanskrit and Indic Studies, J.N.U., New Delhi
Computational Linguistics (CL) /Natural Language Processing (NLP)
2/28/2021 Intro to coling, 2020-21 2 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Interdisciplinary field – linguistics, computer science, AI, psychology, philosophy, logic, cognitive science,….. Wherever language applies Develops formal and computable models of natural languages Model human languages using rule based or statistical methods from a computational perspective Use of Computational approaches/methods to relevant linguistic questions
2/28/2021 Intro to coling, 2020-21 3 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Its relation with Intelligent computing
2/28/2021 Intro to coling, 2020-21 4 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Background (adapted from Dafydd Gibbon (2013))
40s encryption, decryption, neural automata, neural networks, neuro-linguistics 50s Machine Translation, dictionaries, text utilities (concordances) 60s Theoretical informatics, complexity, natural language parsing, speech 70s psycholinguistic interpretations of parsers/ generators 80s-90s logic, inference, unification, NLIs, bi/multi-modal interfaces 2000-2010 Web, resources, big data Future ??? 2/28/2021 Intro to coling, 2020-21 5 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Its relation with AI
2/28/2021 Intro to coling, 2020-21 6 School of Sanskrit and Indic Studies, J.N.U., New Delhi
CL/NLP Inference engines Expert Systems Intelligent Tutoring Systems Vision Machines Robotics Today AI is synonymous with Machine Learning AI only happen if machines understand natural language texts) 2/28/2021 Intro to coling, 2020-21 7 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Its relation with HCI – Human Computer Interaction
2/28/2021 Intro to coling, 2020-21 8 School of Sanskrit and Indic Studies, J.N.U., New Delhi Human Computer Interaction (HCI) and NLP
Conventional HCI Intelligent HCI (HCII) - Human interacts with machine with human (read intelligent) means of communication One of the objectives of NLP is to make this happen (if the means of communication is language)
2/28/2021 Intro to coling, 2020-21 9 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Its relation with Linguistics
2/28/2021 Intro to coling, 2020-21 10 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Phonetics – स्वनविज्ञान (svanavijna) Articulatory Phonetics (उच्चारण-स्वनविज्ञान uccraa-svanavijna) Acoustic Phonetics (भौविक-स्वनविज्ञान bhautika-svanavijna) Auditory Phonetics (श्रिणात्मक-स्वनविज्ञान shravatmak-svanavijna)
Phonology – स्ववनमविज्ञान (svanimavijna) Morphology – 셂पविज्ञान (rpavijna) Syntax – िाक्यविज्ञान (vkyavijna) Semantics – अर्थविज्ञान (arthavijna)
2/28/2021 Intro to coling, 2020-21 11 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Its relation with Sanskrit
2/28/2021 Intro to coling, 2020-21 12 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Panini Sanskrit Linguistic tradition Sanskrit and AI Rick Briggs Sanskrit as foundation of COLING Nicholas Ostler
2/28/2021 Intro to coling, 2020-21 13 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Major Areas of R &D under CL (but not limited to)
2/28/2021 Intro to coling, 2020-21 14 School of Sanskrit and Indic Studies, J.N.U., New Delhi
I/O mechanisms POS and other Taggers Morphological Analyzers Parsing Automatic Speech Recognition Speaker Recognition Speech Synthesis Machine Translation Cross Lingual Information Access Automatic Localization Text Summarizers Named Entity Recognition 2/28/2021 Intro to coling, 2020-21 15
School of Sanskrit and Indic Studies, J.N.U., New Delhi How are these done?
Text Speech Image/video
2/28/2021 Intro to coling, 2020-21 16 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Methods and algorithms in CL/NLP/AI
Rule-based Statistical ML based Hybrid methods
2/28/2021 Intro to coling, 2020-21 17 School of Sanskrit and Indic Studies, J.N.U., New Delhi
CL/NLP – integrated platforms
OpenNLP (Java based platform) NLTK – Python based ILCIANN for Indian languages
2/28/2021 Intro to coling, 2020-21 18 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Why is Panini important in CL/NLP/AI?
2/28/2021 Intro to coling, 2020-21 19 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Panini’s Grammar – systemic view
Phonetic Component (अक्षरसमाम्नाय) Phonemes – 14 Shivasutras Pratyahara (dynamic sound classes) Rulebase (सूत्रपाठ) 4000 grammar rules Lexica Verbs database (धातुपाठ) Nominals database (गणपाठ) Lists Affixes Rule-specific entries
2/28/2021 Intro to coling, 2020-21 20 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Panini’s System
more formal Largely unambiguous procedures easier programming Structure Similar to a program Variable Instantiation Vriddhi (evaluation / expansion) PS rules and replacement procedures may have been influenced by Panini
2/28/2021 Intro to coling, 2020-21 21 School of Sanskrit and Indic Studies, J.N.U., New Delhi
How do we compute Panini rules
Understanding Panini’s rules for a particular (programmable) derivation task (Vidhi rule with other defining/environment rules) Generate pseudocode (Panini sutra example for Sandhi rules computing) Convert pseudocode to actual code in some language ( Java example) Compile and run (with/without interface) get desired results evaluate and debug Execution of Sandhi code
2/28/2021 Intro to coling, 2020-21 22 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Requirements of Diverse and Multilingual societies like India
2/28/2021 Intro to coling, 2020-21 23 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Technology viewed as basic need for multilingual societies E-governance, e-education, all other fields where language is critical Competition between language groups Role of govt. and industry critical for development of tech
2/28/2021 Intro to coling, 2020-21 24 School of Sanskrit and Indic Studies, J.N.U., New Delhi
India’s language situation
2/28/2021 Intro to coling, 2020-21 25 School of Sanskrit and Indic Studies, J.N.U., New Delhi
More diverse and than any other country in the world More than 1600 languages Documented literature Sanskrit >> more than 6000 yrs Pali/Prakrit >> 2500 yrs Apabhramsha >> 1500 yrs Modern Indian languages >> 1000 yrs Oral tradition Sanskrit has been a dominant influence on all other Indian languages in every area of creativity Besides Sanskrit, Persian in the middle ages and English in the colonial and post colonial periods have had significant influence Hindi-Urdu is another major language to recon with which has been impacting other languages - even Sanskrit and English
2/28/2021 Intro to coling, 2020-21 26 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Indian Language Families and % Speakers
IndoAryan - 76.87%
Dravidian -20.82%
Austro Asiatic - 1.11% Tibeto Burman - 1%
Andamanese* - 0%
2/28/2021 Intro to coling, 2020-21 27 School of Sanskrit and Indic Studies, J.N.U., New Delhi
India’s Scheduled Languages
2/28/2021 Intro to coling, 2020-21 28 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Major agencies responsible
2/28/2021 Intro to coling, 2020-21 29 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Govt. MEITY to develop these tools/technologies
MHRD to deliver these on cheaper tablets
Private Business Altruism Language groups
2/28/2021 Intro to coling, 2020-21 30 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Indian academia – current focus
2/28/2021 Intro to coling, 2020-21 31 School of Sanskrit and Indic Studies, J.N.U., New Delhi
IIT Chennai speech (Prof Hemamurthy, Prof Umesh) IIT Delhi OCR (Prof Shantanu Chaudhury) IISc Bangalore OLHWR (Prof A G Ramakrishnan) JNU LT Resources, Tools (Prof Girish Nath Jha) CDACs (Dr Hemant Darbari), IIIT Hyderabad (Prof Rajiv Sangal), IIITM (Prof Elizabeth Shirley), some major universities MT, resource creation IIT Bombay (Prof Pushpak Bhattacharya) MT, wordnet 2/28/2021 Intro to coling, 2020-21 32 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Industry working in diverse areas
2/28/2021 Intro to coling, 2020-21 33 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Microsoft – search engine, MT and all related tools Google - search engine, MT and all related tools Swiftkey – input mechanism Amazon AI Samsung Adobe – document processing Nuance – input mechanism ezDI – medical data processing
2/28/2021 Intro to coling, 2020-21 34 School of Sanskrit and Indic Studies, J.N.U., New Delhi
TDIL - Technology Development for Indian Languages (TDIL) MCIT, Govt. of India (set up in 1991)
2/28/2021 Intro to coling, 2020-21 35 School of Sanskrit and Indic Studies, J.N.U., New Delhi
TDIL – Mission mode projects
In the consortium mode, 26 premier Institutes and R&D organizations are working on LTR projects
2/28/2021 Intro to coling, 2020-21 36 School of Sanskrit and Indic Studies, J.N.U., New Delhi
TDIL - Major projects in recent past
English to Indian Languages Machine Translation (MT) System (E-ILMT) CDAC, Pune
English to Indian Languages Machine Translation (MT) System with Angla-Bharti Technology (E-ILMT-ABT) IIT Kanpur
Indian Language to Indian Language Machine Translation System: (ILMT) IIIT Hyderabad
Sanskrit-Hindi Machine Translation (SHMT) University of Hyderabad, JNU…
2/28/2021 Intro to coling, 2020-21 37 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Major TDIL projects in recent past
Document Analysis & Recognition System for Indian Languages (DARSIL) IIT Delhi
On-Line Handwriting Recognition (OLHR) I.I.Sc, Bangalore
Cross Lingual Information Access (CLIA) IIT, Bombay
Speech Corpora & Technologies (SCT) IIT Chennai
Indian Language Corpora Initiative (ILCI) JNU, New Delhi
2/28/2021 Intro to coling, 2020-21 38 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Does our progress depend on CL/NLP?
YES……. HOW?
2/28/2021 Intro to coling, 2020-21 39 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Most critical areas of development
Governance Education Health Disaster Management Languages, Cultures, Knowledge Traditions
2/28/2021 Intro to coling, 2020-21 40 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Sanskrit – an extra ordinary language ?
2/28/2021 Intro to coling, 2020-21 41 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Most ancient and most modern? Natural language with oldest history and a precision almost matching artificial language Standardized Not a street language (Language of Intellectual Tradition) Does not have a linguistic community as other languages in India, yet spoken in all the states Panini’s grammar – the only complete grammar for any human language
2/28/2021 Intro to coling, 2020-21 42 School of Sanskrit and Indic Studies, J.N.U., New Delhi
More than 6000 years of continuous intellectual activity Millions of manuscripts (un edited, un-read? and un-researched) Condition of manuscript repositories/libraries in India We are losing several hundred per week (Dominique Wujastyk) Besides handwritten manuscripts, we have printed texts (pre-digital), converted to digital, born digital Complexity of the text, nature of historical languages, multiple scripts, mixed scripts Language detection – is it really Sanskrit (Van Pelt library in UPenn) Technical nature of the subject matter in the text – need subject experts Most funded language in the world? 19 universities exclusively for Sanskrit 150 plus departments in universities/colleges
2/28/2021 Intro to coling, 2020-21 43
School of Sanskrit and Indic Studies, J.N.U., New Delhi
Computational Linguistics tasks for Sanskrit
2/28/2021 Intro to coling, 2020-21 44 School of Sanskrit and Indic Studies, J.N.U., New Delhi
1) Creating urgently needed tools
I/O Tools Smart keyboards OCR/Handwriting (manuscript) recognition Speech recognition Text to Speech Multimodal recognition
Manuscript/text digitization, preservation and research Text Readers and simplifiers Text summarizers E-learning applications Machine Translation (text to text/speech to speech)
2/28/2021 Intro to coling, 2020-21 45
School of Sanskrit and Indic Studies, J.N.U., New Delhi
2) Create digital resources
Digital Manuscript library Handwritten texts parallel with their electronic texts These will indexed and cross linked with relevant interpretative shastras Translation facility Digital library of heritage texts Digital library of multimodal texts (speech and videos) Performances of natakas Vedic ritual performances Vedic path traditions Cultural rites and rituals Digital mapping of yatras, religious tourism Digital resources of Ayurveda texts, practices and traditions
2/28/2021 Intro to coling, 2020-21 46
School of Sanskrit and Indic Studies, J.N.U., New Delhi 3) Expert systems and education technology
Expert system for Yoga Expert System for Ayurveda Expert System for Vedic traditions and Vedangas Multimedia presentations Popular Epics (Ramayana, Mahabharata including diverse variations in India and abroad) Other geetikavyas, mahakavyas and natakas Puranas Katha sahitya Educational technology resources and teaching/learning methods
2/28/2021 Intro to coling, 2020-21 47
School of Sanskrit and Indic Studies, J.N.U., New Delhi
4) Explore the shastras for AI
In today’s times of data driven technology, machines need to be trained on web data We need “programs which can translate natural language texts into collections of sentences in mathematical logical language” (John McCarthy) MacCarthy also believes that the knowledge “needed for people to speak to and understand each other is mostly not encoded in language. If we knew what it is, both AI and linguistics would be helped” The “mathematical logical language” he is referring to can be compared with the language and methods of Navya-Nyaya and Mimamsa interpretation techniques be helpful A combination of व्याकरण-न्याय-मीमा車सा with AI/Computational Linguistics would yield positive results
2/28/2021 Intro to coling, 2020-21 48
School of Sanskrit and Indic Studies, J.N.U., New Delhi
Work done at School of Sanskrit and Indic Studies Jawaharlal Nehru University
2/28/2021 Intro to coling, 2020-21 49 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Projects
2/28/2021 Intro to coling, 2020-21 50 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Sanskrit-Hindi Machine Translation (SHMT)
2/28/2021 Intro to coling, 2020-21 51 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Shallow Parser Tools for Indian Languages (SPTIL)
2/28/2021 Intro to coling, 2020-21 52 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Indian Languages Corpora Initiative (ILCI)
2/28/2021 Intro to coling, 2020-21 53 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Consultancies
22/28/2021/28/2021 Intro to coling, 2020-21 54 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Predictive mobile keyboard for Kashmiri
Nuance Technologies, 2016
2/28/2021 Intro to coling, 2020-21 55 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Predictive mobile keyboards for lesser used languages (like Sanskrit, Santhali, Manipuri, Maithili Sindhi) Swiftkey, 2015
2/28/2021 Intro to coling, 2020-21 56 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Online Handwriting Recognition for Hindi (ink samples, language and usage model, dictionaries ) Funded by Microsoft, USA, 2006
2/28/2021 Intro to coling, 2020-21 57 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Multimodal data in 8 languages (Indian English, Hindi, Urdu, Tamil, Bangla, Punjabi, Pushto, Dari) Funded by LDC, University of Pennsylvania, 2011
2/28/2021 Intro to coling, 2020-21 58 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Microsoft Translator Hub English-Urdu MT system released in Feb 2013 http://bing.com/translator (done in collaboration with us at JNU) More language pairs in progress – English-Gujarati, English-Sindhi, Sanskrit-English, Sanskrit-Hindi
2/28/2021 Intro to coling, 2020-21 59 School of Sanskrit and Indic Studies, J.N.U., New Delhi Our ‘Monster’ Tools
Crawler Sanitizer Lexicographer
2/28/2021 Intro to coling, 2020-21 60 School of Sanskrit and Indic Studies, J.N.U., New Delhi
Work done by our research students
As part of M.Phil and Ph.D research by my students, a large number of tools have been developed mainly for Sanskrit
Their tools and dissertations are online on our sever
2/28/2021 Intro to coling, 2020-21 61 School of Sanskrit and Indic Studies, J.N.U., New Delhi
We showcase our developments on international platforms
2/28/2021 Intro to coling, 2020-21 62 School of Sanskrit and Indic Studies, J.N.U., New Delhi WILDRE Workshop on Indian Language Data Resource & Evaluation (Partially sponsored by Microsoft Research India - MSRI) WILDRE5 – Marseille, France, 16 May 2020 (now online on 24 May) WILDRE4 - Miyazaki, Japan (2018) WILDRE3 - Portoroz, Sloveia (2016) WILDRE2 – Reykjavik (2014) WILDRE1 – Istanbul (2012) 2/28/2021 Intro to coling, 2020-21 63 School of Sanskrit and Indic Studies, J.N.U., New Delhi
SOIL-TECH 2019 (JNU, 15-17 Feb)
Sanskrit and Other Indian Languages (SOIL) Technology, JNU
2/28/2021 Intro to coling, 2020-21 64 School of Sanskrit and Indic Studies, J.N.U., New Delhi Demo http://sanskrit.jnu.ac.in https://www.youtube.com/channel/UCm DadGPT098c48s4cBpWeTg
2/28/2021 Intro to coling, 2020-21 65 Some Readings
Language and Mind, Noam Chomsky The Oxford Handbook of Computational Linguistics, Ruslan Mitkov (ed), OUP Speech and NLP, Jurafsky, Martin, 2000, Prentice Hall, NJ Sanskrit Computational Linguistics, Girish N Jha (ed.), Springer Conceptual Structures: Information processing in mind and machine, J.F.Sowa, Reading, Addison Wesley Publishing company
2/28/2021 Intro to coling, 2020-21 66 School of Sanskrit and Indic Studies, J.N.U., New Delhi धन्यवाद !
കൂ क କ ಕ ਕ क క ક గ ক ಕ ક ಕ କ ਕ ক क ક గ
ಕ
[email protected] 2/28/2021 Intro to coling, 2020-21 91-11-26741308 67