Open Voice Data in Indian Languages Preface
Total Page:16
File Type:pdf, Size:1020Kb
Implemented by A Study on Open Voice Data in Indian Languages Preface Published by: Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) GmbH Registered offices Bonn and Eschborn A Study on Open Voice Data in Indian Languages A2/18, Safdarjung Enclave T: +91 11 4949 5353 F: +91 4949 539 E: [email protected] I: www.giz.de/india Person responsible Mr. Gaurav Sharma E: [email protected] FAIR Forward – Artificial Intelligence for All Author: BizAugmentor Global Services Pvt Ltd www.bizaugmentor.com e language diversity and lack of technological Economic Cooperation and Development (BMZ), Design and Layout support for spoken languages in India make universal implemented in India by Deutsche Gesellschaft für Caps & Shells Creatives Pvt Ltd access to information and services an ongoing Internationale Zusammenarbeit (GIZ) Gmbh. is challenge. AI-based voice recognition has great initiative is also active in partner countries: Ghana, Photo credits All the images that have been used in the report are openly accessible and made to use from ‘Google’ with permission for potential. It makes technology more inclusive and Rwanda, South Africa and Uganda. reusability and modification. enables millions of people to access services they cannot use yet – be it in agriculture, education, is research aims at nding the current state of On behalf of the health, or others. For this vision to be realised, a (open) voice datasets in Indian languages, including German Federal Ministry for Economic Cooperation and Development (BMZ) central obstacle needs to be addressed: e lack of free information about their volume, quality, mode of GIZ is responsible for the content of this publication and open voice data in India to develop and train collection, and availability. e present research also New Delhi, December 2020 natural language processing models. explores the challenges for the creation and maintenance of open voice datasets in India. e Disclaimer: The data in the publication has been collected, analysed and compiled with due care; and has been prepared in good faith e present research work is conducted as part of the report makes practice-oriented recommendations based on information available at the date of publication without any independent verification. However, Deutsche Gesellschaft global project 'FAIR Forward – Articial Intelligence for future sustainable voice data collection based on für Internationale Zusammenarbeit (GIZ) GmbH does not guarantee the accuracy, reliability, completeness or currency of the for All', initiative of the German Federal Ministry for both extensive desk research and expert interviews. information in this publication. GIZ shall not be held liable for any loss, damage, cost or expense incurred or arising by reason of any person using or relying on information in this publication. Preface Published by: Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) GmbH Registered offices Bonn and Eschborn A Study on Open Voice Data in Indian Languages A2/18, Safdarjung Enclave T: +91 11 4949 5353 F: +91 4949 539 E: [email protected] I: www.giz.de/india Person responsible Mr. Gaurav Sharma E: [email protected] FAIR Forward – Artificial Intelligence for All Author: BizAugmentor Global Services Pvt Ltd www.bizaugmentor.com e language diversity and lack of technological Economic Cooperation and Development (BMZ), Design and Layout support for spoken languages in India make universal implemented in India by Deutsche Gesellschaft für Caps & Shells Creatives Pvt Ltd access to information and services an ongoing Internationale Zusammenarbeit (GIZ) Gmbh. is challenge. AI-based voice recognition has great initiative is also active in partner countries: Ghana, Photo credits All the images that have been used in the report are openly accessible and made to use from ‘Google’ with permission for potential. It makes technology more inclusive and Rwanda, South Africa and Uganda. reusability and modification. enables millions of people to access services they cannot use yet – be it in agriculture, education, is research aims at nding the current state of On behalf of the health, or others. For this vision to be realised, a (open) voice datasets in Indian languages, including German Federal Ministry for Economic Cooperation and Development (BMZ) central obstacle needs to be addressed: e lack of free information about their volume, quality, mode of GIZ is responsible for the content of this publication and open voice data in India to develop and train collection, and availability. e present research also New Delhi, December 2020 natural language processing models. explores the challenges for the creation and maintenance of open voice datasets in India. e Disclaimer: The data in the publication has been collected, analysed and compiled with due care; and has been prepared in good faith e present research work is conducted as part of the report makes practice-oriented recommendations based on information available at the date of publication without any independent verification. However, Deutsche Gesellschaft global project 'FAIR Forward – Articial Intelligence for future sustainable voice data collection based on für Internationale Zusammenarbeit (GIZ) GmbH does not guarantee the accuracy, reliability, completeness or currency of the for All', initiative of the German Federal Ministry for both extensive desk research and expert interviews. information in this publication. GIZ shall not be held liable for any loss, damage, cost or expense incurred or arising by reason of any person using or relying on information in this publication. Contents Executive Summary 08 1. Languages in India 10 2. AI Industry in India: Focal Points 14 2.1 AI-based Speech Technologies 15 2.1.1 Automatic Speech Recognition (ASR) 16 2.1.2 Text-to-Speech (TTS) 16 2.2. Speech System for Indian Languages 17 3. Research: Objectives & Approach 18 3.1 Research Objectives 18 3.2 Research Approach 19 Phase 1: Online Secondary Research 19 Phase 2: Qualitative Interviews 20 4. Understanding Voice Data 22 4.1 Availability of Data 24 4.1.1 Data access/procurement procedures 24 4.1.2 Dataset Licensing 26 4.2 Data collection mechanisms 27 4.2.1 Speech types 27 4.2.2 Data collection process 28 4.2.3 Recording environment 31 4.3 Quality of Voice data 31 4.3.1 Quality based on recording device used 32 4.3.2 Quality based on the recording environment 32 4.3.3 Quality based on the method of transcription 32 4.3.4 Quality based on the variety of Speakers 32 4.3.5 Quality based on domain coverage 33 4.4 Volume of Voice Data 33 4.4.1 Volume based on Hours of Speech data 34 4.4.2 Volume based on Number of Languages covered 35 5. Observations 38 5.1 State of open-source voice data in India 38 5.2 Working with Indian languages 39 5.3 Methods of Speech data collection 39 5.4 Indian voice technology community 39 5.5 Intent to Open-source 40 Recommendations 42 Appendices & References 44 Appendix 1: List of contributors 45 Appendix 2: Abbreviations 46 Appendix 3: List of Datasets 47 Appendix 4: Dataset Details & Source Links 52 References 55 Contents Executive Summary 08 1. Languages in India 10 2. AI Industry in India: Focal Points 14 2.1 AI-based Speech Technologies 15 2.1.1 Automatic Speech Recognition (ASR) 16 2.1.2 Text-to-Speech (TTS) 16 2.2. Speech System for Indian Languages 17 3. Research: Objectives & Approach 18 3.1 Research Objectives 18 3.2 Research Approach 19 Phase 1: Online Secondary Research 19 Phase 2: Qualitative Interviews 20 4. Understanding Voice Data 22 4.1 Availability of Data 24 4.1.1 Data access/procurement procedures 24 4.1.2 Dataset Licensing 26 4.2 Data collection mechanisms 27 4.2.1 Speech types 27 4.2.2 Data collection process 28 4.2.3 Recording environment 31 4.3 Quality of Voice data 31 4.3.1 Quality based on recording device used 32 4.3.2 Quality based on the recording environment 32 4.3.3 Quality based on the method of transcription 32 4.3.4 Quality based on the variety of Speakers 32 4.3.5 Quality based on domain coverage 33 4.4 Volume of Voice Data 33 4.4.1 Volume based on Hours of Speech data 34 4.4.2 Volume based on Number of Languages covered 35 5. Observations 38 5.1 State of open-source voice data in India 38 5.2 Working with Indian languages 39 5.3 Methods of Speech data collection 39 5.4 Indian voice technology community 39 5.5 Intent to Open-source 40 Recommendations 42 Appendices & References 44 Appendix 1: List of contributors 45 Appendix 2: Abbreviations 46 Appendix 3: List of Datasets 47 Appendix 4: Dataset Details & Source Links 52 References 55 LIST OF TABLES Table 1: Families of Indian languages along with number of speakers (Census of India, 2011) Table 2: Prominent data hosting platforms based on volume of data, number of languages and number of speakers Table 3: Voice data collection mechanisms with their pros and cons LIST OF FIGURES Figure 1: Top 11 Indian languages based on number of speakers (Census of India, 2011) Figure 2: States and union territories of India by the most commonly spoken rst language (50th Report of e Commissioner for Linguistics Minorities in India). Figure 3: Startup market share (State of Articial Intelligence in India - 2020)[8] Figure 4: Smart Speakers (Left: Google Nest Mini, Right: Amazon Echo) Figure 5: Number of Indian speech datasets listed during secondary research Figure 6: e Research Process Flowchart Figure 7: Top 10 Indian languages based on their data volume (in hours) for ASR (Appendix 3) Figure 8: Top 10 Indian languages based on their data volume (in hours) for TTS (Appendix