Phonbank and Data Sharing: Recent Developments in European Portuguese
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 6562–6570 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC PhonBank and Data Sharing: Recent Developments in European Portuguese + Ana Margarida Ramalho*, M. João Freitas*, Yvan Rose + *University of Lisbon / CLUL; Memorial University + *Lisboa, Portugal; St. John’s, NL Canada [email protected], [email protected], [email protected] Abstract This paper presents the recently published RAMALHO-EP and PHONODIS corpora. Both include European Portuguese production data from Portuguese children with typical (RAMALHO-EP) and protracted (PHONODIS) phonological development. The data in the two corpora were collected using the phonological assessment tool CLCP-EP, developed in the context of the Crosslinguistic Child Phonology Project, coordinated by Barbara Bernhardt and Joe Stemberger (University of British Columbia (UBC), Canada). Both corpora are part of the PhonBank Project (Brian MacWhinney (Carnegie Mellon, USA) and Yvan Rose (Memorial University of Newfoundland, Canada), which is the child phonology component of TalkBank, coordinated by Brian MacWhinney. The data at PhonBank is edited in Phon format, a language tool designed and built by Yvan Rose and Greg Hedlund (Memorial University of Newfoundland) and widely used by researchers working in the field of phonological acquisition. RAMALHO-EP contains production data from 87 typically developing children, aged 2;11 to 6;04, all monolinguals. PHONODIS includes production data from 22 children diagnosed with different types of speech and language disorders, all EP monolinguals, aged 3;2 to 11,05. Both corpora are open access language resources and contribute to enlarge the amount of production data on the acquisition of European Portuguese available in PhonBank. Keywords: Phonology, Phonetics, Acquisition, Phon, PhonBank, Data sharing, European Portuguese platform1 has been offering compelling solutions to 1. Introduction these two challenges, through a combination of technical innovations in the areas of phonological and Big Data research enables computer-assisted, broad- phonetic data representation and annotation, assembled based generalizations over rich datasets, which cannot within a software package called Phon2 and the building be obtained through traditional methods of observation. of a public (web-accessible) database documenting However, these generalizations often come with phonological development and speech disorders across partially or completely untested assumptions about the a number of different languages and speaker irrelevance of potential noise in the data, the nature of populations. A key feature of PhonBank is that it relies which may influence results in ways which remain on the commitment of the users of Phon toward data elusive to the researcher. In light of this, and also given sharing in order to construct the database component of the fact that some types of Big Data corpora are not this project: As scholars of phonological development easily obtainable, due to the nature of the data, the need and speech disorders benefit from the functionality for precisely-analyzable corpora remains across many available through Phon for their research, their areas of research. publication works and, in particular, the contribution of This is the case for the study of phonological their corpus data to the PhonBank database enable its development and speech disorders, two disciplines construction in ways that make Big Data research which require access to phonetically-transcribed sets of possible in the long term. data documenting different populations of speakers (e.g. In this paper, we present an overview of this project and typically-developing vs. disordered) across different highlight how it offered a framework for corpus languages and, in the context of multilingual building and analysis based on a series of projects populations, across different combinations of languages. recently developed based on European Portuguese (EP) Current systems for the automatic transcription of speech data. We begin by situating PhonBank within speech data are not reliable enough to be used as a basis the larger context of corpus-based research on language for corpus building at the moment, especially given that development. We then highlight the main components child, disordered, or accented speech are typically of Phon for corpus building and analysis. This provides transcribed very poorly through these systems. This the technical basis for our discussion of how these implies that corpus building in these areas must rely on resources have been used for research on European human transcribers, whose work can be assisted through Portuguese, which we highlight through some of the computer-assisted methods of data measurement (e.g. key findings emanating from this research. Finally, we acoustic analysis) or data annotation (see below). This highlight how these results have now been incorporated requirement in turn poses a major roadblock for the within PhonBank in ways which contribute to cross- development of Big Data corpora for research. linguistic, Big Data research. The PhonBank project, which supports researchers and students through software programs and a data sharing 1 https://phonbank.talkbank.org 2 https://www.phon.ca 6562 2. CHILDES, TalkBank, PhonBank study of phonology and phonological development, for example in the area of segmental and syllabic patterns PhonBank comes from a rich tradition of corpus-based of behaviours (e.g. phones and phonological features; research in language acquisition, which itself has its syllable onsets vs. codas, stress, tone, and so on). In roots all the way to Charles Darwin’s (1877) description addition to this, at the time of PhonBank inception, in of gestural development by his son. This work provided 2006, recent progress in phonetic science was making a model application of Darwin’s descriptive techniques acoustic analyses of child speech increasing compelling for work on natural selection and evolution to the study for our understanding of phonological development. In of human development (MacWhinney, 2000). Limited order to address this need, Praat libraries for the by the means of research available to individuals until acoustic analysis speech5 were subsequently integrated the democratization of computers in the early 1980s, within Phon. We provide more detail on the functions corpus-based research on language acquisition remained supported by Phon in the next section. central to virtually all studies of language acquisition since Darwin, however in ways that were limiting both 3. Phon the level of annotation obtained and the analysis of these annotations. Phon, an open-source (free) software program, was first 3 released in 2006 (Rose et al., 2006). Among other The Child Language Exchange System (CHILDES , co- functions, Phon supports the following: founded by Brian MacWhinney and Catherine Snow in 1983, opened the door for the building of the first large- • Media (audio/video) time-alignment to data scale database documenting child language. To this day, transcripts CHILDES remains the primary source of data and • Orthographic and phonetic transcription research tools for the study of child language • Segmental, tonal and stress-level annotation of the worldwide, through its combination of publicly- phonetic transcriptions accessible datasets documenting child language across • Segment-by-segment alignment between target different languages and learning situations, and (model) and actual (speaker-produced) forms computer programs for the analysis of these datasets, • Data query and reporting in reference to segmental assembled within the CLAN software package. and prosodic aspects of the transcribed forms CHILDES has since provided a model (and inspiration) • Acoustic analysis for the creation of similar databases, for the study of These general functions are supplemented by a series of other topics relevant to human language and additional features within Phon which help the creation communication (e.g. AphasiaBank, BilingBank, and maintenance of structured databases, including FluencyBank) which together form the TalkBank 4 format checks and the ability to import and export data consortium . In addition to the rich documentation it from and to third-party applications. provides to scholars of language, language acquisition, pathologies, etc., one of the most central assets of the After the research has diarized the corpus into speaker- TalkBank databases is their adherence to relatively specific utterances (which can range from isolated word strict data formats, in particular the CHAT format forms to full utterances, each associated with a given supported by CLAN and its corresponding. This cross- speaker) and completed the orthographic transcription corpus consistency greatly facilitates the large-scale of the corpus (a task that remains essentially manual in study of the different datasets. the absence, still today, of reliable and/or easily accessible speech recognition systems), Phon can PhonBank, came into existence first as a subset of the automatically generate broad phonetic transcriptions CHILDES database; through successive rounds of (following the standards of the International Phonetic funding from the National Institutes of Health, Association6 using either built-in dictionaries IPA- PhonBank subsequently grew into becoming an transcribed forms (current