QASR: QCRI Aljazeera Speech Resource--A Large Scale Annotated Arabic Speech Corpus
Total Page:16
File Type:pdf, Size:1020Kb
QASR: QCRI Aljazeera Speech Resource A Large Scale Annotated Arabic Speech Corpus Hamdy Mubarak,1 Amir Hussein,2 Shammur Absar Chowdhury,1 and Ahmed Ali1 1 Qatar Computing Research Institute, HBKU, Doha, Qatar 2 Kanari AI, California, USA [email protected], https://arabicspeech.org/ Abstract et al., 2016); and more recently unsupervised (Valk and Alumae¨ , 2020; Wang et al., 2021) transcrip- We introduce the largest transcribed Arabic tion. This work enables to either reduce Word Error speech corpus, QASR1, collected from the broadcast domain. This multi-dialect speech Rate (WER) considerably or extract metadata from dataset contains 2; 000 hours of speech sam- speech: dialect-identification (Shon et al., 2020); pled at 16kHz crawled from Aljazeera news speaker-identification (Shon et al., 2019); and code- channel. The dataset is released with lightly switching (Chowdhury et al., 2020b, 2021). supervised transcriptions, aligned with the Natural Language Processing (NLP), on the audio segments. Unlike previous datasets, other hand values large amount of textual infor- QASR contains linguistically motivated seg- mation for designing experiments. NLP research mentation, punctuation, speaker information for Arabic has achieved a milestone in the last few among others. QASR is suitable for train- ing and evaluating speech recognition sys- years in morphological disambiguation, Named En- tems, acoustics- and/or linguistics- based Ara- tity Recognition (NER) and diacritization (Pasha bic dialect identification, punctuation restora- et al., 2014; Abdelali et al., 2016; Mubarak et al., tion, speaker identification, speaker linking, 2019). The NLP stack for Modern Standard Ara- and potentially other NLP modules for spoken bic (MSA) has reached very high performance in data. In addition to QASR transcription, we re- many tasks. With the rise of Dialectal Arabic (DA) 130 lease a dataset of M words to aid in design- content online, more resources and models have ing and training a better language model. We show that end-to-end automatic speech recog- been built to study DA textual dialect identification nition trained on QASR reports a competi- (Abdul-Mageed et al., 2020; Samih et al., 2017). tive word error rate compared to the previous Our objective is to release the first Arabic speech MGB-2 corpus. We report baseline results for and NLP corpus to study spoken MSA and DA. downstream natural language processing tasks This is to enable empirical evaluation of learning such as named entity recognition using speech more than the word sequence from the speech. In transcript. We also report the first baseline for our view, existing speech and NLP corpora are Arabic punctuation restoration. We make the missing the link between the two different modali- corpus available for the research community. ties. Speech poses unique challenges such as dis- 1 Introduction fluency (Pravin and Palanivelan, 2021), overlap arXiv:2106.13000v1 [cs.CL] 24 Jun 2021 speech (Tripathi et al., 2020; Chowdhury et al., Research on Automatic Speech Recognition (ASR) 2019), hesitation (Wottawa et al., 2020; Chowd- has attracted a lot of attention in recent years (Chiu hury et al., 2017), and code-switching (Du et al., et al., 2018; Watanabe et al., 2018). Such success 2021; Chowdhury et al., 2021). These challenges has brought remarkable improvements in reaching are often overlooked when it comes to NLP tasks, human-level performance (Xiong et al., 2016; Saon since they are not present in typical text data. et al., 2017; Hussein et al., 2021). This has been In this paper, we create and release2 the largest achieved by the development of large spoken cor- corpus for transcribed Arabic speech. It comprises pora: supervised (Panayotov et al., 2015; Ardila of 2; 000 hours of speech data with lightly super- et al., 2019); semi-supervised (Bell et al., 2015; Ali vised transcriptions. Our contributions are: (i) 1 QASR Qå¯ in Arabic means “Palace”. The acronym 2Data can be obtained from: stands for: QCRI Aljazeera Speech Resource. https://arabicspeech.org/qasr aligning the transcription with the corresponding Table 1: Comparison between MGB-2 vs QASR. audio segments including punctuation for build- MGB-2 QASR ing ASR systems; (ii) providing semi-supervised Hours 1; 200 2; 000 speaker identification and speaker linking per audio Dialects MSA, GLF, LEV, NOR, EGY segments; (iii) releasing baseline results for acous- Influenced by Linguistically and Segmentation silence and acoustically tic and linguistic Arabic dialect identification and segment length motivated iv Lightly Lightly punctuation restoration; ( ) adding a new layer of Transcription annotation in the publicly available MGB-2 testset, supervised supervised Punctuation – for evaluating NER for speech transcription; (v) Code-Switching – sharing code-switching data between Arabic and Possible Turn-Ending – foreign languages for speech and text; and finally, Speaker Names (+ normalised names) (2000 speakers) Speaker Gender – (vi) releasing more than 130M words for Language covers ≈82% data Manually annotated Model (LM). Speaker Country – in testset We believe that providing the research com- Manually annotated NER – munity with access to multi-dialectal speech data in testset along with the corresponding NLP features will fos- ter open research in several areas, such as the anal- bic dataset, from the CommonVoice project, pro- ysis of speech and NLP processing jointly. Here, vides 49 hours of modern standard Arabic (MSA) we build models and share the baseline results for speech data.4 all of the aforementioned tasks. Unlike MGB-2, QASR dataset is the largest 1.1 Related work multi-dialectal corpus with linguistically motivated segmentation. The dataset includes multi-layer in- The CallHome task within the NIST benchmark formation that aids both speech and NLP research evaluations framework (Pallett, 2003), released one community. QASR is the first speech corpora to of the first transcribed Arabic dialect dataset. Over provide resources for benchmarking NER, punc- years, NIST evaluations provided with more dialec- tuation restoration systems. For close comparison tal - mainly in Egyptian and Levantine dialects, as between MGB-2 vs QASR, see Table1. part of language recognition evaluation campaign. Projects such as GALE and TRANSTAC (Olive 2 Corpus Creation et al., 2011) program, released more than 251 hours of Arabic data, including the first spoken Iraqi di- 2.1 Data Collection alect among others. These datasets exposed the We obtained Aljazeera Arabic news channel’s research community to the challenges of spoken archive (henceforth AJ), spanning over 11 years dialectal Arabic and motivated to design competi- from 2004 until 2015. It contains more than 4; 000 tion to handle dialect identification, dialectal ASR episodes from 19 different programs. These pro- among others (see Ali et al.(2021) for details). grams cover different domains like politics, society, The following datasets are released from the economy, sports, science, etc. For each episode, we Multi-Genre Broadcast MGB challenge: (i) MGB- have the following: (i) audio sampled at 16KHz; 2 (Ali et al., 2016) – this dataset is the first mile- (ii) manual transcription, the textual transcriptions stone towards designing the first large scale contin- contained no timing information. The quality of uous speech recognition for Arabic language. The the transcription varied significantly; the most chal- corpus contains a total of 1; 200 hours of speech lenging were conversational programs in which with lightly supervised transcriptions and is col- overlapping speech and dialectal usage was more lected from Aljazeera Arabic news channel span frequent; and finally (iii) some metadata. over many years. (ii) MGB-3 (Ali et al., 2017)– For better evaluation of the QASR corpus, we focused on only Egyptian Arabic broadcast data reused the publicly available MGB-2 (Ali et al., comprises of 16 hours. (iii) MGB-5 (Ali et al., 2016) testset as it has been manually revised, com- 2019) – consists of 13 hours of Moroccan Arabic ing from the same channel, thus making this testset speech data. In addition, the CommonVoice3 Ara- ideal to evaluate the QASR corpus. It is worth not- ing that we ensure that the MGB-2 dev/test sets 3https://commonvoice.mozilla.org/en/ datasets 4Reported on June 2021. Item Description Barack Obama/President of USA, Barck Obama Hours 10 Episodes 17. Average episode duration = 34 min (typo), etc.). The list of guest speakers and episode Segments 8; 014 topics are not comprehensive, with many spelling Words 69; 644. Unique words = 15; 754 mistakes in the majority of metadata field names Speakers 111 Males 87 (78%) and attributes. To overcome these challenges, we Females 13 (11%) applied several iterations of automatic parsing and Variety MSA: 78%:, Dialectal Arabic: 22% Countries Top 5 countries are: (based on dialectal segments) extraction followed by manual verification and stan- EG: 18%, SY: 11%, PS: 11%, DZ: 8%, SD: 7% dardization. Genre Top 5 topics are: Politics: 69%, Society: 9%, Economy: 8%, Culture/Art: 4%, Health: 3% Table 2: Description of the updated MGB-2 testset Item Count Notes Hours 2; 041 Episodes 3; 545 Average episode duration = 32 min. Segments 1:6M . Average segment duration = 4 sec 84% of segments are [2-6] sec . Average segment len = 9 words 80% of segments have [5-11] words Words 14:3M Unique words = 360K Speakers 27; 977 Unique speakers = 11; 092 Males 1; 171 1:2M segments (69%) Females 68 99K segments (6%) Table 3: QASR Corpus Statistics are not included