Brian MacWhinney TalkBank CMU -

1 TalkBank

CHILDES TalkBank Aphasia PhonBank HomeBank

Years 33 years 15 8 6 1 Words (mil) 59 47 1.5 0.7 2.0 Media 2.8 TB 1.1 TB .4 TB .6 TB 2TB

Languages 34 18 6 13 3 Publications 7500 220 140 76 8 Active users 2500 634 520 154 42 Web Hits 4,577,61 1,329,92 434,170 92,110 278,347 Focus Areas

Children CHILDES PhonBank Narrative Bilingual

Clinical Aphasia TBI Dementia Fluency

Adult CABank Tutoring Medical ClassBank

Multilingualism BilingBank SLABank Online Tutors CapVid

3 TalkBank Principles

❖ Standard format — CHAT, variable levels of detail

❖ Transcripts linked to media

❖ Multilingual (43 languages), multiscript

❖ Open and free access

❖ Analytic programs — CLAN, tutorials, MOR grammars

❖ TalkBank is a CLARIN-B Center, Core Trust Seal

❖ Metadata: OLAC, CMDI, VLO

❖ Interoperable with other resources: R, Elan, Praat, SALT A Tour of the Websites

❖ All reachable from https://talkbank.org

❖ Also https://talkbank.org/screencasts 4 Major Methods

1.Corpus Analysis - CLAN, R, ShinyServer, etc 2.Profiling - EVAL, KIDEVAL, FluCalc 3.Microanalysis - CA, gesture 4.Web-based Tutors, Experiments - eCALL

6 Method #1: Corpus Analysis

❖ FREQ - Frequency analysis

❖ wild cards

❖ word files (morality words, LIWC, medical)

❖ KWAL - Key word and line

❖ matches highlighted

❖ COMBO - Regular expression matching

❖ Hits can be triple-clicked to go back to transcript and play

7 MOR, POST, GRASP

❖ 41 languages, but only 11 have MOR/POST

❖ Cantonese, Danish, Dutch, English, French, Italian, Hebrew, Japanese, German, Mandarin, Spanish

❖ GRASP for English, German, Hebrew, Spanish, Mandarin

8 MOR

❖ More declarative than FST

❖ Part-of-speech tuned to spoken language

❖ Easy to use once there is a grammar

❖ Hard to build the grammar (A-rules, C-rules)

❖ 98% accuracy for English

❖ POSTMORTEM rules (as for German declension)

9 Bilingual MOR

❖ *CHL: +" [- spa] [/] yo no la desmentí porque. [+ break]

❖ *CHL: what's my word against hers &ladadada .

❖ *CHL: +" [- spa] todos estamos con un calor and@s working@s .

❖ All words are tagged implicity; can be made explicit.

❖ Coding system makes code-switching junctures evident.

❖ Run English MOR, excluding [- spa], then Spanish MOR including [- spa]

10 Dependency Graphs

Web service runs by triple-clicking on %gra line

11 Using TalkBank data

❖ Standard statistical tests in Excel and R

❖ R routines - LuCiD Shiny server, childesr, rbrul

❖ Collostructional analysis

❖ CHILDES corpora inside SketchEngine

12 Method #2: Profiling

❖ EVAL and KIDEVAL

❖ Depends on MOR and GRASP

❖ Crucial for Clinicians

13 EVAL

MLU, TTR Verbs/Utt % errors % N, V, Aux, Adv, Conj, Pro % PAST, PASTP, PL Retracing, repetition

14 Sample Output

Comparing adler01a to 91 Broca PWA on all parts of protocol

15 Analysis

❖ [*p] phonological p:w, p:n, p:m

❖ [* s] semantic s:r, s:ur. s:uk, s:per

❖ [* n] neologism n:k, n:uk, n:k:s, n:uk:s

❖ [* d] dysfluency

❖ [* m] morphology m:a:0es etc.

❖ [* f] formal lexical

❖ [+ gram] [+ jar] [+ es] [+ per] [+ cir]

16 Method #3: Microanalysis

❖ Process frames show their effects in specific moments in time and space — on video.

❖ Consolidation is revealed across times.

❖ Microanalysis (CA) looks for practices, devices to see how they are conditioned

17 A sample moment: Transcript linked to video

5/22/04

18 You flip up that little temporal lobe

❖ Goal Stack: Med School, PBL, differential diagnosis, amnesic dysnomic aphasic, anterior cerebral circulation

❖ Where is the hippocampus?

❖ Lot more medial ,Pointing to diagram

❖ Finding right section

❖ Dealing with interaction

❖ Linking to CMaps

19 CA Coding

20 CHAT2ELAN

5/22/04 8

21 CHAT2PRAAT - sociophonetics

❖ Highlight utterance bullet

❖ Send to sound analyzer

❖ Extracts audio from video

❖ In Praat, draw a picture

22 Time Series and R

Alberto and Jorge — I no go.

23 Method #4: Web-based Tutors

1.E-CALL Tutors 2.Learning from Corpora 3.Language Learning in the Wild

24 PinyinTutor http://sla.talkbank.org/pinyin/#

Features: •Immediate Corrective Feedback •Initial-final-tone separation •Target-Your attempt comparison •Recycling/Scheduling •Linkage to textbook / or not •Instructor report •Data logged to server etc. •Pages with rules of Pinyin •Playable sound chart •Multiple speakers

25 Virtual Reality Tutor

Spanish Prepositions and Relative Clause Processing Take the milk to the left of the plants and put it next to the box Recoge la leche que está a la izquierda de las plantas y ponla cerca de la caja.

26 Online Individual Difference Measures

27 Aligned Parallel Corpora http://sla.talkbank.org/latin/originalAlignmentDemo.html

28 Captioned Video

29 German and English through Wikipedia

30 Interesting Issues

• A Federation linked to CLARIN? • Uniform Format, Data Types • Open Access, IRB • Metadata - Access

• Federated Content Search • Sustainability

31 Federations

• CLARIN as an example • Core Trust Seal • Can the Americas support one? • Suggestion: Recruit Linguistic departments • Joining CLARIN vs. parallel to CLARIN • What would be the benefits? • Who could support this?

32 Uniform Format

• User only has to learn one set of programs • Analyses can use multiple corpora, various ages, language backgrounds, etc.

• Clear definition of codes, verbal features, categories • What data types are we covering?

33 Open Access, Data-Sharing, IRB

• TalkBank data are fully open and shareable • Why aren’t we all sharing data? • Why haven’t we adopted (at least) interoperable formats?

• Perhaps there is a credit assignment problem • But we have: web pages, DOI, coauthorship, shared grants, citations in articles etc.

34 Metadata, Access

• My sense is that the emphasis on metadata has occluded the need to standardize formats

• Metadata is certainly important

• But what good is metadata if you can’t actually access the materials?

35 Federated Content Search

• This seems to be a CLARIN idea — it does make sense • But is it just access through metadata or can you really search and analyze multiple corpora across multiple sites

• Can we standardize search engines: CQL, ANNIS, MTAS, SearchEngine

36 Sustainability

• Federation, standardization, and the Cloud can help • We seem to have lots of contacts with enterprise. How can this be leveraged?

• Motivated groups and projects are crucial • Generational transition

37