Multimedia Corpora (Media Encoding and Annotation) (Thomas Schmidt, Kjell Elenius, Paul Trilsbeek)

Multimedia Corpora (Media encoding and annotation) (Thomas Schmidt, Kjell Elenius, Paul Trilsbeek) Draft submitted to CLARIN WG 5.7. as input to CLARIN deliverable D5.C3 “Interoperability and Standards” [http://www.clarin.eu/system/files/clarindeliverableD5C3_v1_5finaldraft.pdf] Table of Contents 1 General distinctions / terminology................................................................................................................................... 1 1.1 Different types of multimedia corpora: spoken language vs. speech vs. phonetic vs. multimodal corpora vs. sign language corpora......................................................................................................................................................... 1 1.2 Media encoding vs. Media annotation................................................................................................................... 3 1.3 Data models/file formats vs. Transcription systems/conventions.......................................................................... 3 1.4 Transcription vs. Annotation / Coding vs. Metadata ............................................................................................. 3 2 Media encoding ............................................................................................................................................................... 5 2.1 Audio encoding ..................................................................................................................................................... 5 2.2 Video encoding ..................................................................................................................................................... 5 3 Media annotation ............................................................................................................................................................. 6 3.1 Tools and tool formats........................................................................................................................................... 6 3.1.1 ANVIL (Annotation of Video and Language Data).................................................................................... 6 3.1.2 CLAN (Computerized Language Analysis)/CHAT (Codes for the Human Analysis of Transcripts) / Talkbank XML............................................................................................................................................ 7 3.1.3 ELAN (EUDICO Linguistic Annotator) / EAF (ELAN Annotation Format)............................................. 8 3.1.4 EXMARaLDA (Extensible Markup Language for Discourse Annotation) ................................................ 9 3.1.5 FOLKER (FOLK Editor).......................................................................................................................... 10 3.1.6 Praat / TextGrid ........................................................................................................................................ 11 3.1.7 Transcriber................................................................................................................................................ 12 3.1.8 Other tools ................................................................................................................................................ 13 3.2 Generic formats and frameworks ........................................................................................................................ 14 3.2.1 TEI transcriptions of speech ..................................................................................................................... 14 3.2.2 Annotation Graphs / Atlas Interchange Format / Multimodal Exchange Format...................................... 15 3.2.3 NXT (NITE XML Toolkit)....................................................................................................................... 15 3.3 Other formats ...................................................................................................................................................... 15 3.4 Interoperability of tools and formats ................................................................................................................... 16 3.5 Transcription conventions / Transcription systems ............................................................................................. 17 3.5.1 Systems for phonetic transcription............................................................................................................ 17 3.5.2 Systems for orthographic transcription..................................................................................................... 17 3.5.3 Systems for sign language transcription ................................................................................................... 18 3.5.4 Commonly used combinations of formats and conventions...................................................................... 18 4 Summary / Recommendations ....................................................................................................................................... 20 4.1 Media Encoding .................................................................................................................................................. 20 4.2 Media Annotation................................................................................................................................................ 20 5 References ..................................................................................................................................................................... 22 1 General distinctions / terminology By a Multimedia Corpus we understand a systematic collection of language resources involving data in more than one medium. Typically, a multimedia corpus consists of a set of digital audio and/or video data and a corresponding set of textual data (the transcriptions and/or annotations of the audio or video data). Before we start discussing tools, models, formats and standards used for multimedia corpora in the following sections, we will devote a few paragraphs to clarifying our terminology. 1.1 Different types of multimedia corpora: spoken language vs. speech vs. phonetic vs. multimodal corpora vs. sign language corpora Multimedia corpora are used in a variety of very different domains including, among others, speech technology, several subfields of linguistics (e.g. phonetics, sociolinguistics, conversation analysis), media sciences and sociology. Consequently, there is no consistent terminology, let alone a unique taxonomy to classify and characterise different types of multimedia corpora. Instead of attempting to define such a terminology here, we will characterise different types of corpora by describing some of their prototypical exponents. The categories and their characterising features are meant neither to be mutually exclusive (in 1 fact, many existing corpora are a mixture of the types), nor to necessarily cover the whole spectrum of multimedia corpora. A Spoken Language (or: Spontaneous Speech) Corpus is a corpus constructed with the principal aim of investigating language as used in spontaneous spoken everyday interaction. A typical spoken language corpus contains recordings of authentic (as opposed to experimentally controlled or read) dialogues or multilogues (as opposed to monologues). It is transcribed orthographically, typically in a modified orthography which takes into account characteristic features of spontaneous speech (like elisions, assimilations or dialectal features), and it takes care to note carefully certain semi‐lexical (or paraverbal) phenomena like filled pauses (hesitation markers) or laughing. A spoken language corpus may contain information about behaviour in other modalities (like mimics and gesture) in addition to the transcribed speech, but its main concern lies with linguistic behaviour. In that sense, corpora used in conversation or discourse analysis (e.g. the Santa Barbara Corpus of Spoken American English or the FOLK corpus) are prototypes of spoken language corpora. Meeting corpora like the ISL Meeting Corpus or the AMI Meeting Corpus can be regarded as another group of typical spoken language corpora. Less prototypical, but equally relevant members of this class are child language corpora (e.g. corpora in the CHILDES database), corpora of narrative (or: ‘semi‐structured’) interviews (e.g. the SLX Corpus of Classic Sociolinguistic Interviews by William Labov), or corpora of task‐oriented communication (like Map Tasks, e.g. the HCRC Map Task Corpus). Often, spoken language corpora are restricted to a specific interaction type (e.g. doctor‐ patient communication, classroom discourse or ad‐hoc interpreting) or speaker type (e.g. adolescents or a certain social group). A Speech (Technology) Corpus is a corpus constructed with the principal aim of building, training or evaluating speech technology applications like speech recognizers, dialogue systems or text‐to‐speech‐ systems. A speech corpus typically contains recordings of read or prompted speech – by one speaker in a studio for text‐to‐speech systems, or by more than hundreds of persons for speech recognition systems. The latter are often recorded in an office setting or recorded over mobile and/or fixed telephones. They are usually balanced over age and gender. Speech corpora are usually transcribed in standard orthography (possibly with labels for various acoustic noises and disturbances as well as truncated

Multimedia Corpora (Media Encoding and Annotation) (Thomas Schmidt, Kjell Elenius, Paul Trilsbeek)

Talk Bank: a Multimodal Database of Communicative Interaction

Child Language

Multilingvální Systémy Rozpoznávání Řeči a Jejich Efektivní Učení

Segmentability Differences Between Child-Directed and Adult-Directed Speech: a Systematic Test with an Ecologically Valid Corpus

Here ACL Makes a Return to Asia!

Conference Abstracts

Gold Standard Annotations for Preposition and Verb Sense With

A Massively Parallel Corpus: the Bible in 100 Languages

The Relationship Between Transitivity and Caused Events in the Acquisition of Emotion Verbs

The Workshop Programme

Distributional Properties of Verbs in Syntactic Patterns Liam Considine

The Field of Phonetics Has Experienced Two