Instructions for Preparing LREC 2016 Proceedings

Total Page:16

File Type:pdf, Size:1020Kb

Instructions for Preparing LREC 2016 Proceedings Linking Language Resources and NLP papers Gil Francopoulo, LIMSI, CNRS, Université Paris-Saclay + Tagmatica (France) Joseph Mariani, LIMSI, CNRS, Université Paris-Saclay (France) Patrick Paroubek, LIMSI, CNRS, Université Paris-Saclay (France) Abstract The Language Resources and Evaluation Map (LRE Map) is an accessible database on Language Resources based on records collected during the submission of several major Speech and Natural Language Processing (NLP) conferences, including the Language Resources and Evaluation Conferences (LREC). The NLP4NLP is a very large corpus of scientific papers in the field of Speech and Natural Language Processing covering a large number of conferences and journals in that field. In this article, we establish the link between those two elements in order to study the mention of the LRE Map resource names within the NLP4NLP corpus. Keywords: Resource Citation, Named Entity Detection, Informetrics, Scientometrics, Text Mining, LRE Map. Data Consortium (LDC) team whose goal was, and still is, to build a language resource (LR) database documenting 1. Introduction the use of the LDC resources [Ahtaridis et al 2012]. At the Our work is based on the hypothesis that names, in this case time of the publication (i.e. 2012), the LDC team found language resource names, correlate with the study, use and 8,000 references and the problems encountered were improvement of the given referred objects, in this case documented in [Mariani et al 2014b]. language resources. We believe that the automatic (and objective) detection is a step towards the improvement of 3. Our approach the reliability of language resources as mentioned in The general principle is to confront the names of the LRE [Branco 2013]. Map with the newly collected NLP4NLP corpus. The process is as follows: We already have an idea on how the resources are used in Consider the archives of (most of) the NLP field, the recent venues of conferences such as Coling and LREC, as the LRE Map is built according to the resources declared Take an entity name detector which is able to work by the authors of these conferences [Calzolari et al 2012]. with a given list of proper names, But what about the other conferences and the other years? This is the subject of the present study. Use the LRE Map as the given list of proper names, Run the application and study the results. 2. Situation with respect to other studies The approach is to apply NLP tools on texts about NLP 4. Archives of a large part of the NLP field itself, taking advantage of the fact that we have a good knowledge of the domain ourselves. Our work goes after The corpus is a large content of our own research field, i.e. the various studies presented and initiated in the Workshop NLP, covering both written and speech sub-domains and entitled: “Rediscovering 50 Years of Discoveries in Natural extended to a limited number of corpora, for which Language Processing” on the occasion of ACL’s 50th Information Retrieval and NLP activities intersect. This anniversary in 2012 [Radev et al 2013] where a group of corpus was collected at IMMI-CNRS and LIMSI-CNRS 3 researchers studied the content of the corpus recorded in (France) and is named NLP4NLP . It currently contains the ACL Anthology [Bird et al 2008]. Various studies, 65,003 documents coming from various conferences and based on the same corpus followed, for instance [Bordea et journals with either public or restricted access. This is a al 2014] on trend analysis and resulted in systems such as large part of the existing published articles in our field, Saffron1 or the Michigan Univ. web site2 . Other studies apart from the workshop proceedings and the published were conducted by ourselves specifically on speech-related books. Despite the fact that they often reflect innovative archives [Mariani et al 2013], and on the LREC archives trends, we did not include workshops as they may be based [Mariani et al 2014a] but the target was to detect the on various reviewing processes and as the access to their terminology used within the articles, and the focus was not content may sometimes be difficult. The time period spans to detect resource names. More focused on the current from 1965 to 2015. Broadly speaking, and aside from the 4 workshop topic is the study conducted by the Linguistic small corpora, one third comes from the ACL Anthology , one third from the ISCA Archive5 and one third from IEEE6. 1 http://saffron.deri.ie 4 http://aclweb.org/anthology 2 http://clair.eecs.umich.edu/aan/index.php 5 www.isca-speech.org/iscaweb/index.php/archive/online-archive 3 See www.nlp4nlp.org 6 https://www.ieee.org/index.html The corpus follows the organization of the ACL Anthology associated with some alternate names. The number of with two parts in parallel. For each document, on one side, entries was originally 4,396. Each entry has been defined the metadata is recorded with the author names and the title. with a headword like “British National Corpus” and some On the other side, the PDF document is recorded on disk in of them are associated with alternate names like “BNC”. its original form. Each document is labeled with a unique We further cleaned the data, by regrouping the duplicate identifier, for instance “lrec2000_1” is reified on the hard entries, by omitting the version number which was disk as two files: “lrec2000_1.bib” and “lrec2000_1.pdf”. When recorded as an image, the PDF content is extracted associated with the resource name for some entries, and by by means of Tesseract OCR7. The automatic test leading to ignoring the entries which were not labeled with a proper the call (or not) of the OCR is implemented by means of name but through a textual definition and those which had some PDFBox 8 API calls. For all the other documents, no name. Once cleaned, the number of entries is now 1,301, other PDFBox API calls are applied in order to extract the all of them with a different proper name. All the LRE Map textual content. See [Francopoulo et al 2015] for more entries are classified according to a very detailed set of details about the extraction process as well as the solutions resource types. We reduced the number of types to 5 broad for some tricky problems like joint conferences categories: NLPCorpus, NLPGrammar, NLPLexicon, management. NLPSpecification and NLPTool, with the convention that The majority (90%) of the documents come from when a resource is both a specification and a tool, the conferences, the rest coming from journals. The overall “specification” type is retained. An example is ROUGE number of words is 270M. Initially, the texts are in four languages: English, French, German and Russian. The which is both a set of metrics and a software package number of texts in German and Russian is less than 0.5%. implementing those metrics, for which we chose the They are detected automatically and are ignored. The texts “specification” type. in French are a little bit numerous (3%), so they are kept with the same status as the English ones. This is not a 7. Connection of LRE Map with TagParser problem because our tool is able to process English and TagParser is natively associated with a large multilingual French. The number of different authors is 48,894. The knowledge base made from Wikidata and Wikipedia and detail is presented in table 1. whose name is Global Atlas [Francopoulo et al 2013]. Of course, at the beginning, this knowledge base did not 5. Named Entity Detection contain all the names of the LRE Map. Only 30 resource The aim is to detect a given list of names of resources, names were known like “Wikipedia” or “WordNet”. provided that the detection should be robust enough to During the preparation of the experiment, a data fusion has recognize and link as the same entry some typographic been applied between the two lists to incorporate the LRE variants such as “British National Corpus” vs “British Map into the knowledge base. National corpus” and more elaborated aliases like “BNC”. Said in other terms, the aim is not to recognize some given 8. Running session and post-processing raw character strings but also to link names together, a The entity name detection is applied to the whole corpus on process often labeled as “entity linking” in the literature a middle range machine, i.e. one Xeon E3-1270V2 with [Guo et al 2011][Moro et all 2014]. We use the industrial 32Gb of memory. A post-processing is done in order to 9 Java-based parser TagParser [Francopoulo 2007] which, filter only the linked entities of the types: NLPCorpus, after a deep robust parsing for English and French, NLPGrammar, NLPLexicon, NLPSpecification and performs a named entity detection and then an entity NLPTool. Then the results are gathered to compute a linking processing. The system is hybrid, combining a readable synthesis as an HTML file which is too big to be statistical chunker, a large language specific lexicon, a presented here, but the interested reader may consult the multilingual knowledge base with a hand-written set of file “lremap.html” on www.nlp4nlp.org. Let’s add that the rules for the final selection of the named entities and their whole computation takes 95 minutes. entity linking. 6. The LRE Map The LRE Map is a freely accessible large database on resources dedicated to Natural Language Processing (NLP). The original feature of LRE Map is that the records are collected during the submission of different major NLP conferences10. These records were collected directly from the authors.
Recommended publications
  • Talk Bank: a Multimodal Database of Communicative Interaction
    Talk Bank: A Multimodal Database of Communicative Interaction 1. Overview The ongoing growth in computer power and connectivity has led to dramatic changes in the methodology of science and engineering. By stimulating fundamental theoretical discoveries in the analysis of semistructured data, we can to extend these methodological advances to the social and behavioral sciences. Specifically, we propose the construction of a major new tool for the social sciences, called TalkBank. The goal of TalkBank is the creation of a distributed, web- based data archiving system for transcribed video and audio data on communicative interactions. We will develop an XML-based annotation framework called Codon to serve as the formal specification for data in TalkBank. Tools will be created for the entry of new and existing data into the Codon format; transcriptions will be linked to speech and video; and there will be extensive support for collaborative commentary from competing perspectives. The TalkBank project will establish a framework that will facilitate the development of a distributed system of allied databases based on a common set of computational tools. Instead of attempting to impose a single uniform standard for coding and annotation, we will promote annotational pluralism within the framework of the abstraction layer provided by Codon. This representation will use labeled acyclic digraphs to support translation between the various annotation systems required for specific sub-disciplines. There will be no attempt to promote any single annotation scheme over others. Instead, by promoting comparison and translation between schemes, we will allow individual users to select the custom annotation scheme most appropriate for their purposes.
    [Show full text]
  • MASC: the Manually Annotated Sub-Corpus of American English
    MASC: The Manually Annotated Sub-Corpus of American English Nancy Ide*, Collin Baker**, Christiane Fellbaum†, Charles Fillmore**, Rebecca Passonneau†† *Vassar College Poughkeepsie, New York USA **International Computer Science Institute Berkeley, California USA †Princeton University Princeton, New Jersey USA ††Columbia University New York, New York USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected] Abstract To answer the critical need for sharable, reusable annotated resources with rich linguistic annotations, we are developing a Manually Annotated Sub-Corpus (MASC) including texts from diverse genres and manual annotations or manually-validated annotations for multiple levels, including WordNet senses and FrameNet frames and frame elements, both of which have become significant resources in the international computational linguistics community. To derive maximal benefit from the semantic information provided by these resources, the MASC will also include manually-validated shallow parses and named entities, which will enable linking WordNet senses and FrameNet frames within the same sentences into more complex semantic structures and, because named entities will often be the role fillers of FrameNet frames, enrich the semantic and pragmatic information derivable from the sub-corpus. All MASC annotations will be published with detailed inter-annotator agreement measures. The MASC and its annotations will be freely downloadable from the ANC website, thus providing
    [Show full text]
  • Multilingvální Systémy Rozpoznávání Řeči a Jejich Efektivní Učení
    Multilingvální systémy rozpoznávání řeči a jejich efektivní učení Disertační práce Studijní program: P2612 – Elektrotechnika a informatika Studijní obor: 2612V045 – Technická kybernetika Autor práce: Ing. Radek Šafařík Vedoucí práce: prof. Ing. Jan Nouza, CSc. Liberec 2020 Multilingual speech recognition systems and their effective learning Dissertation Study programme: P2612 – Electrotechnics and informatics Study branch: 2612V045 – Technical cybernetics Author: Ing. Radek Šafařík Supervisor: prof. Ing. Jan Nouza, CSc. Liberec 2020 Prohlášení Byl jsem seznámen s tím, že na mou disertační práci se plně vztahu- je zákon č. 121/2000 Sb., o právu autorském, zejména § 60 – školní dílo. Beru na vědomí, že Technická univerzita v Liberci (TUL) nezasahu- je do mých autorských práv užitím mé disertační práce pro vnitřní potřebu TUL. Užiji-li disertační práci nebo poskytnu-li licenci k jejímu využi- tí, jsem si vědom povinnosti informovat o této skutečnosti TUL; v tomto případě má TUL právo ode mne požadovat úhradu nákla- dů, které vynaložila na vytvoření díla, až do jejich skutečné výše. Disertační práci jsem vypracoval samostatně s použitím uvedené literatury a na základě konzultací s vedoucím mé disertační práce a konzultantem. Současně čestně prohlašuji, že texty tištěné verze práce a elektro- nické verze práce vložené do IS STAG se shodují. 1. 12. 2020 Ing. Radek Šafařík Abstract The diseratation thesis deals with creation of automatic speech recognition systems (ASR) and with effective adaptation of already existing system to a new language. Today’s ASR systems have modular stucture where individual moduls can be consi- dered as language dependent or independent. The main goal of this thesis is research and development of methods that automate and make the development of language dependent modules effective as much as possible using free available data fromthe internet, machine learning methods and similarities between langages.
    [Show full text]
  • Here ACL Makes a Return to Asia!
    Handbook Production: Jung-jae Kim, Nanyang Technological University, Singapore Message from the General Chair Welcome to Jeju Island — where ACL makes a return to Asia! As General Chair, I am indeed honored to pen the first words of ACL 2012 proceedings. In the past year, research in computational linguistics has con- tinued to thrive across Asia and all over the world. On this occasion, I share with you the excitement of our community as we gather again at our annual meeting. On behalf of the organizingteam, it is my great pleasure to welcome you to Jeju Island and ACL 2012. In 2012, ACL turns 50. I feel privileged to chair the conference that marks such an important milestone for our community. We have prepared special programs to commemorate the 50th anniversary, including ‘Rediscovering 50 Years of Discovery’, a main conference workshop chaired by Rafael Banchs with a program on ‘the People, the Contents, and the Anthology’, which rec- ollects some of the great moments in ACL history, and ‘ACL 50th Anniver- sary Lectures’ by Mark Johnson, Aravind K. Joshi and a Lifetime Achieve- ment Award Recipient. A large number of people have worked hard to bring this annual meeting to fruition. It has been an unforgettable experience for everyone involved. My deepest thanks go to the authors, reviewers, volunteers, participants, and all members and chairs of the organizing committees. It is your participation that makes a difference. Program Chairs, Chin-Yew Lin and Miles Osborne, deserve our gratitude for putting an immense amount of work to ensure that each of the 940 sub- missions was taken care of.
    [Show full text]
  • Multimedia Corpora (Media Encoding and Annotation) (Thomas Schmidt, Kjell Elenius, Paul Trilsbeek)
    Multimedia Corpora (Media encoding and annotation) (Thomas Schmidt, Kjell Elenius, Paul Trilsbeek) Draft submitted to CLARIN WG 5.7. as input to CLARIN deliverable D5.C­3 “Interoperability and Standards” [http://www.clarin.eu/system/files/clarin­deliverable­D5C3_v1_5­finaldraft.pdf] Table of Contents 1 General distinctions / terminology................................................................................................................................... 1 1.1 Different types of multimedia corpora: spoken language vs. speech vs. phonetic vs. multimodal corpora vs. sign language corpora......................................................................................................................................................... 1 1.2 Media encoding vs. Media annotation................................................................................................................... 3 1.3 Data models/file formats vs. Transcription systems/conventions.......................................................................... 3 1.4 Transcription vs. Annotation / Coding vs. Metadata ............................................................................................. 3 2 Media encoding ............................................................................................................................................................... 5 2.1 Audio encoding ..................................................................................................................................................... 5 2.2
    [Show full text]
  • Background and Context for CLASP
    Background and Context for CLASP Nancy Ide, Vassar College The Situation Standards efforts have been on-going for over 20 years Interest and activity mainly in Europe in 90’s and early 2000’s Text Encoding Initiative (TEI) – 1987 Still ongoing, used mainly by humanities EAGLES/ISLE Developed standards for morpho-syntax, syntax, sub-categorization, etc. (links on CLASP wiki) Corpus Encoding Standard (now XCES - http://www.xces.org) Main Aspects" ! Harmonization of formats for linguistic data and annotations" ! Harmonization of descriptors in linguistic annotation" ! These two are often mixed, but need to deal with them separately (see CLASP wiki)" Formats: The Past 20 Years" 1987 TEI Myriad of formats 1994 MULTEXT, CES ~1996 XML 2000 ISO TC37 SC4 2001 LAF model introduced now LAF/GrAF, ISO standards Myriad of formats Actually…" ! Things are better now" ! XML use" ! Moves toward common models, especially in Europe" ! US community seeing the need for interoperability " ! Emergence of common processing platforms (GATE, UIMA) with underlying common models " Resources 1990 ! WordNet gains ground as a “standard” LR ! Penn Treebank, Wall Street Journal Corpus World Wide Web ! British National Corpus ! EuroWordNet XML ! Comlex ! FrameNet ! American National Corpus Semantic Web ! Global WordNet ! More FrameNets ! SUMO ! VerbNet ! PropBank, NomBank ! MASC present NLP software 1994 ! MULTEXT > LT tools, LT XML 1995 ! GATE (Sheffield) 1996 1998 ! Alembic Workbench ! ATLAS (NIST) 2003 ! What happened to this? 200? ! Callisto ! UIMA Now: GATE
    [Show full text]
  • Informatics 1: Data & Analysis
    Informatics 1: Data & Analysis Lecture 12: Corpora Ian Stark School of Informatics The University of Edinburgh Friday 27 February 2015 Semester 2 Week 6 http://www.inf.ed.ac.uk/teaching/courses/inf1/da Student Survey Final Day ! ESES: The Edinburgh Student Experience Survey http://www.ed.ac.uk/students/surveys Please log on to MyEd before 1 March to complete the survey. Help guide what we do at the University of Edinburgh, improving your future experience here and that of the students to follow. Ian Stark Inf1-DA / Lecture 12 2015-02-27 Lecture Plan XML We start with technologies for modelling and querying semistructured data. Semistructured Data: Trees and XML Schemas for structuring XML Navigating and querying XML with XPath Corpora One particular kind of semistructured data is large bodies of written or spoken text: each one a corpus, plural corpora. Corpora: What they are and how to build them Applications: corpus analysis and data extraction Ian Stark Inf1-DA / Lecture 12 2015-02-27 Homework ! Tutorial Exercises Tutorial 5 exercises went online earlier this week. In these you use the xmllint command-line tool to check XML validity and run your own XPath queries. Reading T. McEnery and A. Wilson. Corpus Linguistics. Second edition, Edinburgh University Press, 2001. Chapter 2: What is a corpus and what is in it? (§2.2.2 optional) Photocopied handout, also available from the ITO. Ian Stark Inf1-DA / Lecture 12 2015-02-27 Remote Access to DICE ! Much coursework can be done on your own machines, but sometimes it’s important to be able to connect to and use DICE systems.
    [Show full text]
  • The Workshop Programme
    The Workshop Programme Monday, May 26 14:30 -15:00 Opening Nancy Ide and Adam Meyers 15:00 -15:30 SIGANN Shared Corpus Working Group Report Adam Meyers 15:30 -16:00 Discussion: SIGANN Shared Corpus Task 16:00 -16:30 Coffee break 16:30 -17:00 Towards Best Practices for Linguistic Annotation Nancy Ide, Sameer Pradhan, Keith Suderman 17:00 -18:00 Discussion: Annotation Best Practices 18:00 -19:00 Open Discussion and SIGANN Planning Tuesday, May 27 09:00 -09:10 Opening Nancy Ide and Adam Meyers 09:10 -09:50 From structure to interpretation: A double-layered annotation for event factuality Roser Saurí and James Pustejovsky 09:50 -10:30 An Extensible Compositional Semantics for Temporal Annotation Harry Bunt, Chwhynny Overbeeke 10:30 -11:00 Coffee break 11:00 -11:40 Using Treebank, Dictionaries and GLARF to Improve NomBank Annotation Adam Meyers 11:40 -12:20 A Dictionary-based Model for Morpho-Syntactic Annotation Cvetana Krstev, Svetla Koeva, Du!ko Vitas 12:20 -12:40 Multiple Purpose Annotation using SLAT - Segment and Link-based Annotation Tool (DEMO) Masaki Noguchi, Kenta Miyoshi, Takenobu Tokunaga, Ryu Iida, Mamoru Komachi, Kentaro Inui 12:40 -14:30 Lunch 14:30 -15:10 Using inheritance and coreness sets to improve a verb lexicon harvested from FrameNet Mark McConville and Myroslava O. Dzikovska 15:10 -15:50 An Entailment-based Approach to Semantic Role Annotation Voula Gotsoulia 16:00 -16:30 Coffee break 16:30 -16:50 A French Corpus Annotated for Multiword Expressions with Adverbial Function Eric Laporte, Takuya Nakamura, Stavroula Voyatzi
    [Show full text]
  • The Expanding Horizons of Corpus Analysis
    The expanding horizons of corpus analysis Brian MacWhinney Carnegie Mellon University Abstract By including a focus on multimedia interactions linked to transcripts, corpus linguistics can vastly expand its horizons. This expansion will rely on two continuing developments. First, we need to develop easily used methods for each of the ten analytic methods we have examined, including lexical analyses, QDA (qualitative data analysis), automatic tagging, language profiles, group comparisons, change scores, error analysis, feedback studies, conversation analysis, and modeling. Second, we need to work together to construct a unified database for language studies and related sciences. This database must be grounded on the principles of open access, data-sharing, interoperability, and integrated structure. It must provide powerful tools for searching, multilinguality, and multimedia analysis. If we can build this infrastructure, we will be able to explore more deeply the key questions underlying the structure and functioning of language, as it emerges from the intermeshing of processes operative on eight major timeframes. 1. Introduction Corpus linguistics has benefitted greatly from continuing advances in computer and Internet hardware and software. These advances have made it possible to develop facilities such as BNCweb (bncweb.lancs.ac.uk), LDC (Linguistic Data Consortium) online, the American National Corpus (americannationalcorpus. org), TalkBank (talkbank.org), and CHILDES (childes.psy.cmu.edu). In earlier periods, these corpora were limited to written and transcribed materials. However, most newer corpora now include transcripts linked to either audio or video recordings. The development of this newer corpus methodology is facilitated by technology which makes it easy to produce high-quality video recordings of face-to-face interactions.
    [Show full text]
  • The 2Nd Workshop on Arabic Corpora and Processing Tools 2016 Theme: Social Media Workshop Programme
    The 2nd Workshop on Arabic Corpora and Processing Tools 2016 Theme: Social Media Workshop Programme Date 24 May 2016 09:00 – 09:20 – Welcome and Introduction by Workshop Chairs 09:20 – 10:30 – Session 1 (Keynote speech) Nizar Habash, Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions 10:30 – 11:00 Coffee break 10:30 – 13:00 – Session 2 Soumia Bougrine, Hadda Cherroun, Djelloul Ziadi, Abdallah Lakhdari and Aicha Chorana, Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects Maha Alamri and William John Teahan, Towards a New Arabic Corpus of Dyslexic Texts Ossama Obeid, Houda Bouamor, Wajdi Zaghouani, Mahmoud Ghoneim, Abdelati Hawwari, Mona Diab and Kemal Oflazer, MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization Wajdi Zaghouani and Dana Awad, Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation Muhammad Abdul-Mageed, Hassan Alhuzali, Dua'a Abu-Elhij'a and Mona Diab, DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis Nora Al-Twairesh, Mawaheb Al-Tuwaijri, Afnan Al-Moammar and Sarah Al-Humoud, Arabic Spam Detection in Twitter Editors Hend Al-Khalifa King Saud University, KSA King Abdul Aziz City for Science and Abdulmohsen Al-Thubaity Technology, KSA Walid Magdy Qatar Computing Research Institute, Qatar Kareem Darwish Qatar Computing Research Institute, Qatar Organizing Committee Hend Al-Khalifa King Saud University, KSA King Abdul Aziz City for Science and Abdulmohsen Al-Thubaity Technology, KSA Walid Magdy Qatar Computing Research Institute,
    [Show full text]
  • Exploring Phone Recognition in Pre-Verbal and Dysarthric Speech
    Exploring Phone Recognition in Pre-verbal and Dysarthric Speech Syed Sameer Arshad A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science University of Washington 2019 Committee: Gina-Anne Levow Gašper Beguš Program Authorized to Offer Degree: Department of Linguistics 1 ©Copyright 2019 Syed Sameer Arshad 2 University of Washington Abstract Exploring Phone Recognition in Pre-verbal and Dysarthric Speech Chair of the Supervisory Committee: Dr. Gina-Anne Levow Department of Linguistics In this study, we perform phone recognition on speech utterances made by two groups of people: adults who have speech articulation disorders and young children learning to speak language. We explore how these utterances compare against those of adult English-speakers who don’t have speech disorders, training and testing several HMM-based phone-recognizers across various datasets. Experiments were carried out via the HTK Toolkit with the use of data from three publicly available datasets: the TIMIT corpus, the TalkBank CHILDES database and the Torgo corpus. Several discoveries were made towards identifying best-practices for phone recognition on the two subject groups, involving the use of optimized Vocal Tract Length Normalization (VTLN) configurations, phone-set reconfiguration criteria, specific configurations of extracted MFCC speech data and specific arrangements of HMM states and Gaussian mixture models. 3 Preface The work in this thesis is inspired by my life experiences in raising my nephew, Syed Taabish Ahmad. He was born in May 2000 and was diagnosed with non-verbal autism as well as apraxia-of-speech. His speech articulation has been severely impacted as a result, leading to his speech production to be sequences of babbles.
    [Show full text]
  • Jira: a Kurdish Speech Recognition System Designing and Building Speech Corpus and Pronunciation Lexicon
    Jira: a Kurdish Speech Recognition System Designing and Building Speech Corpus and Pronunciation Lexicon Hadi Veisi* Hawre Hosseini Mohammad MohammadAmini University of Tehran, Ryerson University, Avignon University, Faculty of New Sciences and Electrical and Computer Engineering Laboratoire Informatique d’Avignon (LIA) Technologies [email protected] mohammad.mohammadamini@univ- [email protected] avignon.fr Wirya Fathy Aso Mahmudi University of Tehran, University of Tehran, Faculty of New Sciences and Faculty of New Sciences and Technologies Technologies [email protected] [email protected] Abstract: In this paper, we introduce the first large vocabulary speech recognition system (LVSR) for the Central Kurdish language, named Jira. The Kurdish language is an Indo-European language spoken by more than 30 million people in several countries, but due to the lack of speech and text resources, there is no speech recognition system for this language. To fill this gap, we introduce the first speech corpus and pronunciation lexicon for the Kurdish language. Regarding speech corpus, we designed a sentence collection in which the ratio of di-phones in the collection resembles the real data of the Central Kurdish language. The designed sentences are uttered by 576 speakers in a controlled environment with noise-free microphones (called AsoSoft Speech-Office) and in Telegram social network environment using mobile phones (denoted as AsoSoft Speech-Crowdsourcing), resulted in 43.68 hours of speech. Besides, a test set including 11 different document topics is designed and recorded in two corresponding speech conditions (i.e., Office and Crowdsourcing). Furthermore, a 60K pronunciation lexicon is prepared in this research in which we faced several challenges and proposed solutions for them.
    [Show full text]