Cabank Database Guide

CABank Database Guide

This guide provides documentation regarding the CABank corpora in the TalkBank database. TalkBank is an international system for the exchange of data on spoken language interactions. The majority of the corpora in TalkBank have either audio or video media linked to transcripts. All transcripts are formatted in the CHAT system and can be automatically converted to XML using the CHAT2XML convertor. To jump to the relevant section, click on the page number to the right of the corpus. CallFriend...... 2 CallHome...... 8 CMU...... 27 DISCLAB...... 28 GulfWar...... 29 Jefferson...... 30 WaterGate...... 30 NB...... 31 JOC...... 33 MOVIN...... 34 Sakura...... 35 SamtaleBank...... 38 SBCSAE...... 41 SCoSE...... 44 CallFriend

Malcah Yaeger-Dror Cognitive Science University of Arizona [email protected]

These corpora were contributed to TalkBank by the Linguistic Data Consortium. Thanks to Mark Liberman, Steven Bird, and Chris Cieri for sharing these audio data. Transcriptions in CA-CHAT were produced by Malcah Yaeger-Dror, working with four students: Alan Beaudrie, Sarah Beuadrie, Tania Granadillo,

File Sex age ed state calling to calling en_4504 M 14 10 NY Peekskill 914292upb en_4708 M 28 22 Toronto Toronto 416635gbe en_4745 M 37 16 FL Key West 305966cmp en_4823 M 39 18 FL Key West 305583apj en_4874 M 49 20 IL Chicago 312539eka en_4919 F 43 15 NY Peekskill 914356xmt en_5051 M 59 15 USA Aspen, CO 970498yoo en_5615 F 31 17 NC Charolotte 704948xpc en_5984 M 18 13 PA Erie,PA 814862xcc en_6015 F 19 13 MI Detroit 313764qfu en_6058 M 21 15 CO CO Springs 719564mns en_6062 F 22 14 NJ Atlantic City 609883ogd en_6084 M 30 17 NYC NYC 212420mnh en_6092 M 18 13 NY Ithaca 607436qie en_6093 MxF 18 12 MI Detroit 313764eji en_6094 M 19 14 MS Kansas City 816543udj en_6102 F 43 19 FL Orlando 407451ucv en_6110 F 43 19 FL Orlando 407451ucv en_6126 MxF 34 17 AZ Phoenix 602395rbo en_6157 MxF 24 16 PA Harrisburgh 717228nfw en_6172 F 19 13 MA Boston 617352olf en_6193 M 18 12 IL Urbana, IL 217355mmf en_6200 MxF 43 16 NY UpstateNY 716625hgu en_6202 MxF 18 12 CA Sta Barbara 805872sld en_6205 F 34 16 GA Atlanta 404378shv en_6255 FxM 27 16 MN Rochester,mn 507534xdy en_6372 MxF 37 17 Toronto Toronto 416638hjf en_6379 FxM 25 20 VA Beltway, va 703790mhk en_6384 MxF 22 16 PA Allentown 610328yhr en_6401 MxF 23 21 MA Worchester, ma 508475nlj en_6402 FxM 17 12 NJ Elizabeth, nj 908238tcj en_6428 FxM 18 14 NY Peekskill? 914737tei en_6451 MxF 51 15 MD beltway, MD 301253oas en_6476 M 21 15 MI Detroit 313662sgw en_6503 FxM 18 12 CA San José 408471ulc en_6507 MxF 21 16 CA Freemont,CA 510623gon en_6508 F 27 21 NC? Durham, NC 919387tos en_6511 mix 45 12 Toronto Toronto 416789uea en_6557 M 53 20 Toronto Toronto 416651dob en_6649 MxF 34 16 PA Pittsburgh 412661xkc en_6865 MxF 24 0 LA New Orleans 504861spm

File sex age ed #1Dialect Comments ja_0617 F ? ? kansai baby cry ja_0921 FxM ? ? Tokyo?/st F sometimes uses English at the beginning ja 1367 F ? 16 USAst F2 has children.; married to an American. ja 1605 F ? ? standard F2 has kansai accent ja 1612 FxM ? ? standard F2 has accent ja 1684 F 22 16 standard New York; university. They like dancing. ja 1722 F 19 16 Yamanashi F2 is 22 years old. Ja1758 F ? 16 standard F1; TX.F2 lived in Canada 3yrs, now US; 26yrs old ja 1733 FxM 19 14 standard F has kansai accent ja 1841 FxM ? 16 standard in US; M grad of 'Phila. University. ja 2167 F ? 16 USA Living in US for business; F2 has two children. ja_4044 FxM 20 15 Tokyo F lived in New York and Seattle before. ja_4164 M 30 16 Saitama M1 leaves for Japan soon. M2 is married. ja_4222 M 28 16 USA/Tohoku M1 is working. ja_4261 M 23 16 Tokyo/Kansai Both of them work. ja_4549 M 20 12 SuwaCity M2 studying for finals. ja_4573 M 31 18 Hiroshima M2 is M1's cousin. M1in Boston;M2 in San Diego. ja_4608 M 25 19 Tokyo M2 is a graduate student in USA. ja_4725 M 23 15 Tokyo spraying cocroches in background. ja_4905 MxF 21 14 Numazu ja_6149 FxM 23 18 Tokyo F1 is a student in UAA. ja_6166 M 21 14 Yamanashi They seem to live in Okurahama in USA. ja_6167 M 22 14 Tokyo ja_6186 MxF 21 14 Tokyo F is washing dishes so there is water sound. ja_6221 M 30 19 Kyoto ja_6228 M 29 16 Oita ja_6264 FxM 26 16 Kyoto ja_6277 M 18 11 Ena ja_6281 M ja_6354 MxF 26 23 Tokyo ja_6414 F 32 16 Osaka ja_6416 MxF 27 16 Kyoto ja_6422 F 54 `6 USA/Miyazaki F1 lives in Idaho. F2 lives in Illinois. ja_6434 M 36 14 Yokohama ja_6463 MxF 17 10 Shizuoka ja_6465 M 34 18 IbarakiPref ja_6484 MxF 17 10 Shizuoka ja_6490 MxF 30 16 Osaka ja_6525 M 28 18 Tokyo ja_6587 FxM 44 16 Yamagata ja_6616 FxM 23 15 Tokyo ja_6630 FxM 33 17 Osaka ja_6632 M 20 15 Tokyo ja_6666 F 26 16 USA/Osaka They are friends at diff US universities. ja_6688 F 38 18 Sapporo fr. from work. F2 has strong dialect. ja_6698 F 27 14 USA/Miyazaki F2 has heavy dialect. Both work in California. ja_6700 F 42 16 Sapporo F2 has children. ja_6707 F 57 12 CA/Hokkeido Both (F1& F2) have Hokkaido accents. ja_6716 FxM 22 15 Tokyo ja_6717 F 35/34 18 NY/CA/Gifu Met in SF. F1 in New York. F2 is a teacher. ja_6738 F 34 17 Nagasaki ja_6739 F 53 16 Chiba F1 h/w and F2 works at a lingual center. ja_6742 F 31 16 Ichinomiya ja_6759 M 53 12 Tokyo

File Sex age ed country sp_4019 F 24 18 Peru sp_4053 F 29 16 Colombia sp_4057 F 32 18 Venezuela sp_4089 F 23 19 Spain sp_4095 sp_4096 F 24 17 Venezuela sp_4100 M x F 23 16 Colombia sp_4106 M sp_4116 F 23 16 Dominican_ sp_4148 M 23 Ecuador sp_4352 M 56 18 Peru sp_4358 F X N 24 17 Argentina sp_4400 F 26 12 Colombia sp_4414 M 28 22 Colombia sp_4422 M 41 12 Ecuador sp_4427 F 24 17 Venezuela sp_4435 F 51 16 Colombia sp_4450 M x F 27 12 Colombia sp_4462 F 17 12 Venezuela sp_4463 M 24 20 Colombia sp_4466 M x F 25 17 Peru sp_4468 M 42 12 Nicaragua sp_4492 M x F 24 19 Puerto_Rico sp_4500 M 33 19 Colombia sp_4524 F 26 17 Colombia sp_5034 F 26 10 Colombia sp_5052 M x F 23 17 Colombia sp_5055 F x M 21 16 Peru sp_5070 F 25 13 Dominican_ sp_5084 F 38 12 El_Salvador sp_5112 M x F 34 20 Nicaragua sp_5175 F 27 16 Colombia sp_5258 F x M 24 11 Colombia sp_5316 F x M 27 22 Colombia sp_5340 F 20 12 Colombia sp_5354 F 34 14 El_Salvador sp_5361 F 23 16 Venezuela sp_5367 M 34 6 Peru sp_5418 F 22 16 Canada sp_5502 F 18 11 Honduras sp_5558 F 57 12 Colombia sp_5582 M 23 9 Mexico sp_5589 M 45 18 Peru sp_5607 M 30 20 Colombia sp_5638 M 19 14 Chile sp_5641 M x F 27 16 Peru sp_5650 F 32 18 Mexico sp_5685 F x M 26 15 Colombia sp_5704 M x F 24 17 Canada

Glossary of Spanish terms

ñángara: leftist or someone who does not care for his appearance berraquera: colombian slang for 'good'. bicho: stuff, like vaina bolazo: an informal word that means boredom. boludeces: A swearword that means unimportant things. bravo: to be angry. burda: venezuelan slang for 'a lot'. cabarullí: cachar: from the English word catch. lo cacharon= they caught him cagando: person is very very cold carota blanca: white bean catire: blonde catzo cauchos: tires. cazaba: understood. chabado: without luck chabienda: a group of friends chama: girl. chamo:guy. che: Interjection used in informal conversations to call the other speaker's attention. chevere: neat, cool, wonderful chiches: bells and whistles. chimbo: venezuelan slang for 'bad'. chismear = to gossip, usually among girls/women chismes: gossip. choto: "un choto" is a slang expression meaning "nothing". chusma (Mex?) = derogative way to refer to lower class people . coño= shit, cunt cochambroso = to think bad of someone without knowing comodin: from comodo, to be comfortable. culecos: nervous indecisiveness culo: venezuelan male slang for girl. cutre dar paja: venezuelan slang for 'feel bad'. de bolas: interjection 'of course'. de pinga: venezuelan slang for 'good'. de vaina: almost didn't make it, by very little. duracos: embustes: lies empatados: to be dating. güiro: venezuelan slang for 'head'. guacho: lucky person hinchar las pelotas: bother somebody. jangiador: leader of a group jeva: venezuelan slang for girl. kilombo: slang word for "mess". koala: fanny pack. mamadera:a lot of drinking. mamar gallo: to joke. mameico: very easy manita, mana (Mex) = shortened from the word "hermana" marangos: stupid or monkies mate: traditional drink in Argentina made with "yerba" me cago en la hostia= I don't really care me cago en la visagra= I don't really care mijita: pal mi hijita mojón: lie morochas: twins. murieseando: adaptation of "muriéndose" pacotilla: worthless. pajecito: the little boy that carries a symbolic dowry or the wedding rings pana: venezuelan slang, term of address for a good friend. pancho: comfortable, lazy. para la gravedad de la vaina= something very grave parar bolas: to pay attention. parla: the gift of good talking. pegastes el chicle = captivate the interest of someone of the opposite sex. pelada: girl or teenager puntaje= The correct word is 'puntuación.' Point average. que show? = what's new? what is going on? que vaina= (depending on intonation) sorry, rasca: to be drunk. ser del otro lado = to be homosexual. tocazo: a lot. tomar el pelo: to joke. tuquís: interjection used jokingly vacilar: to pull someones leg, to joke. vaina: in venezuela it can mean "stuff". verga: interjection, meaning depends on intonation. verguero: venezuelan slang for 'lots of stuff'. vido The correct conjugation is 'visto." Past tense of to see. viste: a very common informal expression equivalent to "you know". zampó (zampear)= to eat CallHome

1. Summary abstract

The CallHome English corpus of telephone speech was collected and transcribed by the Linguistic Data Consortium primarily in support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), sponsored by the U.S. Department of Defense.

This release of the CallHome English corpus consists of 120 unscripted telephone conversations between native speakers of English. The CD-ROM distribution contains the speech data only, along with essential documentation files and software for handling the compressed speech data. The transcripts and other text data and documentation are distributed separately (typically via electronic transmission from the LDC's ftp/web server), and will be subject to periodic updates. The transcripts cover a contiguous 5 or 10 minute segment (see section 2 below) taken from a recorded conversation lasting up to 30 minutes. All speakers were aware that they were being recorded. They were given no guidelines concerning what they should talk about. Once a caller was recruited to participate, he/she was given a free choice of whom to call. Most participants called family members or close friends overseas. All calls originated in North America; 90 of the 120 calls were placed to various locations overseas, while the remaining 30 were placed within North America. The distribution of call destinations can be found in the file "spkrinfo.tbl". The transcripts are timestamped by speaker turn for alignment with the speech signal, and are provided in standard orthography.

------2. Data acquisition

Speakers were solicited by the LDC to participate in this telephone speech collection effort via the internet, publications (advertisements), and personal contacts. A total of 200 call originators were found, each of whom placed a telephone call via a toll-free robot operator maintained by the LDC. Access to the robot operator was possible via a unique Personal Identification Number (PIN) issued by the recruiting staff at the LDC when the caller enrolled in the project. The participants were made aware that their telephone call would be recorded, as were the call recipients. The call was allowed only if both parties agreed to being recorded. Each caller was allowed to talk up to 30 minutes. Upon successful completion of the call, the caller was paid $20 (in addition to making a free long-distance telephone call). Each caller was allowed to place only one telephone call.

Although the goal of the call collection effort was to have unique speakers in all calls, a handful of repeat speakers are included in the corpus. Specific information on this can be found in the file "spkrinfo.doc". In all, 200 calls were transcribed. Of these, 80 have been designated as training calls, 20 as development test calls, and 100 as evaluation test calls. For each of the training and development test calls, a contiguous 10-minute region was selected for transcription; for the evaluation test calls, a 5-minute region was transcribed. For the present publication, only 20 of the evaluation test calls are being released; the remaining 80 test calls are being held in reserve for future LVCSR benchmark tests.

------3. Data verification

After a successful call was completed, a human audit of each telephone call was conducted to verify that the proper language was spoken, to check the quality of the recording, and to select and describe the region to be transcribed. The description of the transcribed region provides information about channel quality, number of speakers, their gender, and other attributes. The information from this audit may be found in the file "callinfo.tbl", and its contents are described in greater detail in "callinfo.doc". ------4. Speaker demographics

Information on speaker demographics can be found in the file spkrinfo.tbl, whose contents are described in the file spkrinfo.doc.

------5. Data transcription - General

All CallHome telephone conversations were transcribed using the general conventions described below. The finite set of "non-lexemes" (hesitation sounds) used in the transcripts are provided in section 6 below.

The transcription was carried out on Sun 4 workstations. The transcription was done using the emacs text editor which was linked to the visual and auditory soundwave from the telephone recording in an xwaves window. A program written at the LDC linked the xwaves signal to the emacs buffer so that a highlighted region of the soundwave could be brought into the emacs buffer as a timestamp via a simple keystroke. Similarly, the transcribers could listen to any timemarked turn in the transcript, and view the aligned soundwave as well. Thus, the transcribers had a visual as well as auditory signal that they were transcribing. Both the visual and auditory signal were broken into two separate channels that could be reviewed separately or together.

The transcribers were given the transcription conventions provided below as a guideline how to transcribe the telephone conversations.

CALLHOME TRANSCRIPTION CONVENTIONS - General What to transcribe: 10 contiguous minutes (600 seconds) from the recorded telephone conversations. This should not include the beginning of the conversation where the speakers are getting permission for being recorded.

Definition of turns: Separate turns are defined by the following criteria:

(1) speaker change, e.g.

A: Well I was thinking about that

B: I know I talked to &Jan about it yesterday

(2) within one speaker's stretch of talk, a long turn should be broken up in terms of what makes grammatical/semantic sense, e.g.

A: And I told her %um I didn't I wasn't setting you up to be a spiritual director or anything {laugh} but I did say to her that if she were to talk if she felt that she wanted to talk about her prayer experience in Spanish

A: that you would probably be able to certainly to understand her but to empathize a little bit with what she was experiencing

(3) If there is an extra-long pause within a single speaker's turn, break the turn up into two turns, e.g.

B: When we were fishing out on &Lake &Travis last August I thought I saw, %uh [[long pause]]

B: %uh, &George &Martin, but I wasn't sure it was him.

Timestamps: Each speaker turn is marked with a unique timestamp (in seconds). The timestamps mark the beginning and end time of each turn relative to the beginning of the recording. Each timestamp is precise to the 100th of a second, and is in the format: beginning time [space] ending time, followed by the turn. Some samples:

27.98 28.72 A: You know so 137.49 139.47 A: yeah {breath} (( )) [distortion] 284.54 286.79 B: %ah &Lydia &Van &Damme.

Special Conventions:

Acronyms Acronyms pronounced like a word are written in all caps with no spaces, e.g. AIDS NARAL

Acronyms pronounced like the individual letters are written in all caps with spaces between the letters: C I A H I V C E O

Numbers Write all numbers out, do not use digits: twenty-two nineteen- ninety-five

Interjections Use the most standard spelling (as given on the lexicon list, if it's there); don't try to represent lengthening by writing multiple consonants (like 'ooooh'). uh-huh mhm uh-oh okay jeez

Punctuation Transcribers are free to add any punctuation that they feel is helpful to someone reading the transcript.

Special symbols:

Noises, conversational phenomena, foreign words, etc. are marked with special symbols. In the table below, "text" represents any word or descriptive phrase.

{text} sound made by the talker

{laugh} {cough} {sneeze} {breath}

[text] sound not made by the talker (background or channel)

[distortion] [background noise] [buzz]

[/text] end of continuous or intermittent sound not made by the talker (beginning marked with previous [text])

[[text]] comment; most often used to describe unusual characteristics of immediately preceding or following speech (as opposed to separate noise event)

[[previous word lengthened]] [[speaker is singing]]

((text)) unintelligible; text is best guess at transcription ((coffee klatch))

(( )) unintelligible; can't even guess text

(( ))

speech in another language

? indicates unrecognized language; (( )) indicates untranscribable speech

-text partial word text- -tion absolu-

#text# simultaneous speech on the same channel (simultaneous speech on different channels is not explicitly marked, but is identifiable as such by reference to time marks)

//text// aside (talker addressing someone in background)

//quit it, I'm talking to your sister!//

+text+ mispronounced word (spell it in usual orthography)

+probably+

**text** idiosyncratic word, not in common use

**poodle-ish**

%text This symbol flags non-lexemes, which are general hesitation sounds. See the section on non-lexemes below to see a complete list for each language.

%mm %uh

&text used to mark proper names and place names

&Mary &Jones &Arizona &Harper's &Fiat &Joe's &Grill text -- marks end of interrupted turn and continuation -- text of same turn after interruption, e.g.

A: I saw &Joe yesterday coming out of --

B: You saw &Joe?!

A: -- the music store on &Seventeenth and &Chestnut.

------6. Data transcription - Non-lexemes

For LVCSR purposes, some of the speech sounds uttered by the conversational participants were deemed to be "non-lexemes" or periodic sound sequences that are not listed as words in the pronunciation dictionary. The "non-lexemes" are distinct from the set of interjections such as "okay" and "jeez" which are considered as words in the lexicon. The "non-lexemes" can loosely be considered as hesitation sounds that a speaker makes while speaking. While the spelling of these sounds is somewhat arbitrary, the transcribers were given a finite list from which to choose in order to maintain orthographic consistency.

Below is the histogram of the token and frequency of non-lexemes occurring in the 80 training and 20 devtest transcripts.

1530 %uh 1470 %um 310 %eh 309 %mm 209 %hm 194 %ah 166 %huh 15 %ha 3 %er 2 %oof 2 %hee 2 %ach 1 %eee 1 %ew

------7. Quality control (QC) procedures The creation of the transcripts was made in an iterative manner. The first step was to transcribe and timestamp the appropriate portion of each conversation. Once this was completed, proper formatting and spelling was checked and corrected. Once this was completed, a second pass over all of the transcripts was made, where both content and formatting was checked once more. Throughout this process, small improvements were constantly made and re-checked for accuracy. In most instances, a third (or even fourth) pass was made over the transcript to verify its accuracy.

Spelling:

As the telephone conversations were being transcribed, the words found in the transcripts were being compiled for inclusion in pronunciation dictionaries also being prepared by the LDC. As the lexicon workers compiled lists of words, they checked (among other things) for spelling errors. The lists of spelling/typo errors found in the transcripts were compiled, and a program was run over the transcripts to replace a misspelled word with its correct spelling. Thus, work on the pronunciation dictionaries of the respective languages helped to double-check the proper spelling of all words in the transcripts.

Syntax:

To check the well-formedness of the bracketing, a program was written which goes over the transcripts and notes any apparent irregularities. This program was later adapted for on-line use by the transcribers to be used while creating the transcripts. A final syntax check was run over all transcripts before the final release.

Timestamps:

To check the well-formedness of timestamps, a program was developed that checked for (1) overlapping timestamps, (2) start times that are greater than end times, (3) turns that are missing timestamps, (4) the proper formatting of a blank line before each timestamp, (5) proper number of digits in each timestamp, and (6) the proper marking of the speaker id. This procedure was folded into the syntax checking procedure to be used on-line by the transcribers.

Content:

To check that the properly spelled and formatted transcription actually matched the spoken signal, a second human pass was made over all of the transcripts. In many instances, three or more passes were made as well.

English Sex Age Age Place 0638 F 20 15 PA 6067 F 18 12 NJ 4838 F 18 13 NY 6079 F 19 13 NY 6100 F 19 15 LA 4092 F 22 16 PA 5788 F 22 16 TX 6479 F 22 17 OH 5352 F 22 17 PA 6107 F 23 16 NY 5273 F 23 16 OH 4432 F 23 17 NY 4886 F 24 13 PA 4624 F 24 16 MI 5931 F 24 18 PA 4490 F 24 18 SC 4844 F 25 13 OH 5777 F 25 16 MI 4887 F 25 18 DC 4365 F 25 21 WI 5573 F 26 18 MA 4913 F 26 18 NY 4660 F 27 16 MO 4157 F 28 16 MO 4926 F 30 12 NY 4077 F 30 16 WA 4861 F 30 18 IN 4628 F 30 19 WY 4576 F 31 18 IL 6467 F 31 18 OH 6625 F 32 18 FL 4145 F 32 21 CA 4245 F 32 21 NE 4248 F 32 21 PA 6348 F 33 12 IL 5388 F 33 13 NE 4595 F 33 18 NY 4610 F 33 21 NY 4315 F 34 16 MA 4564 F 34 16 VA 4927 F 34 18 NY 6071 F 35 18 FL 5254 F 35 18 NY 6047 F 36 16 WI 4571 F 36 18 NE 4431 F 36 20 IL 4065 F 36 23 MD 4459 F 37 16 CA 4325 F 38 13 VT 4580 F 38 18 NJ 5907 F 38 20 ID 5046 F 39 22 CA 4104 F 40 16 CA 5700 F 40 16 IA 4234 F 42 17 NY 5866 F 43 12 AR 5888 F 44 16 ID 4544 F 45 18 MA 4822 F 46 18 CA 5278 F 46 18 NY 4310 F 47 18 NH 4665 F 47 18 PA 4666 F 48 16 PA 4335 F 48 18 IA 5551 F 49 16 WI 4623 F 52 13 NY 6161 F 53 18 TX 6456 F 54 20 OH 6033 F 54 20 WI 4941 F 56 16 NY 4112 F 56 19 CA 5736 F 57 16 CA 6274 F 57 16 WI 4705 F 57 22 OR 5495 F 61 20 WA 5242 F 63 12 IL 6314 F 63 14 IL 6252 F 65 17 WI 5712 F 65 20 MI 5648 F 66 18 CT 6045 F 66 18 WI 4556 F 67 14 NJ 5532 F 67 16 IL 6447 F 71 20 WI 5208 F 74 16 WI 6408 F 77 12 MN 4673 F 80 17 MI 4569 F 80 17 MI 4677 M 13 7 WV 4521 M 19 13 NY 6825 M 19 14 NY 6265 M 20 15 NY 6521 M 21 15 OH 4967 M 26 7 UT 4093 M 27 16 WA 4801 M 27 17 WA 4721 M 27 18 FL 6861 M 27 19 UT 5166 M 28 18 MD 5872 M 29 15 CA 4485 M 29 21 MA 4686 M 30 18 FL 4074 M 30 18 TX 4371 M 30 20 NE 5373 M 31 25 Canada 6313 M 32 18 NY 4415 M 33 17 NC 4612 M 37 16 VA 4829 M 38 18 CT 6179 M 43 16 NY 4247 M 43 17 varied 5713 M 43 18 CT 6785 M 46 16 AR 4792 M 48 17 NY 4807 M 54 17 NJ 4184 M 54 20 NY 4702 M 55 13 KA 6298 M 74 12 WI 4808 M 76 14 MA 4629 M 8 3 IL

German Sex Age Age Place 4002 F - - - 4024 M 31 24 Pfullingen 4028 M 25 18 Berlin 4073 M 26 18 Kassel 4076 F 37 17 Krefeld 4111 M 40 23 Bensberg 4123 M 27 19 Hagen 4287 M 32 24 Buende 4308 F 37 12 Hadmersleben 4384 F 32 14 Mainz 4458 M 30 20 Bad V slac 4552 M 31 21 Freiburg 4553 M 27 20 Gengenbach 4630 F 49 16 Voerde 4684 M 24 18 Bads-Alzunge 4711 F 51 22 Nuremburg 4755 F 57 12 Bremen 4764 F 32 21 Hamburg 4765 F 46 16 Bielefeld 4777 F 56 12 Berlin 4828 M 25 16 Cologne 4857 M 34 20 Augsburg 4866 M 25 17 Beckum 4868 M 54 20 Stutgart 4896 F 23 13 Leverkusen 4921 M 26 15 Hildesheim 4940 F 27 16 Hamburg 4951 M 23 15 Stuttgart 4957 M 23 16 UNK 4965 F 29 13 Bad-Neuheim 5016 M 34 21 Guendburg 5088 F 26 14 Ommernheim 5097 F 16 10 Munich 5123 M 19 13 Berlin 5143 F 29 16 Frankfurt 5159 F 76 16 Bernburg 5161 F 27 16 Goetting 5168 M 24 16 Kahl 5206 F 74 13 Breslau 5207 F 52 12 Neunkirchen 5223 F 68 12 Stuttgart 5224 F 62 14 Berlin 5248 F 60 16 Berlin 5298 F 40 16 Frankfurt 5351 F 60 18 Berlin 5421 F 69 16 Munich 5452 F 67 12 Leipzig 5493 F 47 13 Heidelberg 5518 M 66 12 Germany 5519 M 24 15 Friederchshsen 5566 F 54 17 Frankfurt 5569 F 27 17 Salzburg 5577 F 26 18 Munich 5596 F 47 16 Zweibruecken 5626 F 65 12 Hanover 5661 M 25 19 Berlin 5681 M 59 12 Berlin 5699 F 68 16 Berlin 5776 F 69 20 Cologne 5778 F 21 12 South_A VA 5832 F 68 12 Liga 5900 F 23 16 Osnabrueck 5909 F 69 16 Stuttgart 5944 F 40 20 Switzerland 5945 F 28 20 Frankenberg 6069 F 29 21 Cleveland 6140 F 31 20 Insbrook 6144 M 60 16 Berlin 6162 M 24 18 Waldshut 6197 M 31 22 Mainz 6199 F 31 20 Frankfurt 6219 F 19 13 Wiesental 6247 M 54 14 Buchel 6248 F 41 18 Giessen 6250 M 60 14 Chemnitz 6251 F 60 12 Palastinate 6297 M 26 19 Bibirbach 6311 F 69 18 Saaz 6312 M 25 17 Berlin 6333 F 54 12 Hamburg 6349 M 54 16 Braunschweig 6350 M 37 10 Allersberg 6352 F 24 16 Coesfeld 6373 M 17 12 Munich 6386 M 25 17 Tuebingen 6388 M 24 18 Berlin 6446 F 38 16 Gottingen 6477 F 56 16 Berlin 6506 M 75 15 Karlsruhe 6517 M 28 18 Oldenburg 6518 M 26 12 Dessau 6545 F 41 18 Hamm 6623 M 57 16 Stuttgart 6639 M 26 17 Cologne 6659 M 49 17 Heidenheim 6691 M 31 23 Munich 6692 F 65 14 Stuttgart 6719 F 25 17 Berlin 6838 M 26 18 Hannover 6888 M 23 15 Stuttgart

Japanese Sex AGE Age Place ja_0856 ja_0924 38 16 ja_0930 ja_1012 31 16 ja_1032 ja_1041 ja_1048 41 ja_1057 41 18 ja_1099 ja_1109 ja_1123 ja_1201 ja_1237 37 21 ja_1263 37 21 ja_1277 33 ja_1288 43 20 ja_1290 16 ja_1328 29 14 ja_1369 25 12 ja_1370 26 12 ja_1418 ja_1425 16 ja_1428 30 16 ja_1461 34 14 ja_1509 30 16 ja_1538 33 14 ja_1541 ja_1542 35 18 ja_1557 16 10 ja_1593 26 14 ja_1604 28 19 ja_1607 ja_1608 45 12 ja_1615 13 ja_1628 40 14 ja_1642 12 ja_1667 44 16 ja_1710 31 16 ja_1713 24 17 ja_1725 12 ja_1731 63 12 ja_1738 30 16 ja_1741 27 15 ja_1749 18 ja_1889 16 ja_1899 22 12 ja_1925 19 14 ja_1928 13 ja_1999 31 16 ja_2004 58 12 ja_2041 ja_2085 40 17 ja_2096 36 17 ja_2111 19 12 ja_2134 18 13 ja_2157 13 ja_2180 36 16 ja_2188 18 ja_2199 ja_2204 25 12 ja_2206 22 15 ja_2207 31 12 ja_2208 33 16 ja_2209 82 8 ja_2210 28 14 ja_2212 61 10 ja_2215 45 12 ja_2217 65 15 ja_2218 47 16 ja_2219 28 18 ja_2220 43 22 ja_2222 ja_2224 29 20 ja_2225 54 12 ja_2231 31 20 ja_2234 26 16 ja_2235 50 12 ja_2237 29 16 ja_2239 29 14 ja_2243 18 12 ja_0743 ja_0922 ja_0988 ja_1003 ja_1069 ja_1622 ja_1629 54 14 ja_1670 28 21 ja_1688 21 12 ja_1690 19 11 ja_1967 16 ja_2035 30 14 ja_2214 ja_2238 29 23 ja_3002 ja_3004 ja_3005 ja_3008 ja_4061 ja_4275 ja_0696 ja_0862 ja_0986 32 16 ja_1005 ja_1072 34 16 ja_1586 35 18 ja_1674 54 16 ja_1832 19 13 ja_1867 30 16 ja_1966 27 16 ja_2053 14 ja_2074 46 16 ja_2196 28 19 ja_2216 25 18 ja_2223 36 16 ja_2236 13 ja_2242 48 14 ja_3001 ja_3006 ja_3007

Mandarin Sex Age Age ma_0003 F 40 13 ma_0010 27 15 ma_0022 F 1 ma_0027 M 14 ma_0028 0 19 ma_0029 M 20 ma_0030 M 29 16 ma_0035 0 15 ma_0104 F 26 16 ma_0106 M 24 17 ma_0110 F 29 16 ma_0111 F 32 20 ma_0117 30 16 ma_0131 0 18 ma_0626 M ma_0637 0 15 ma_0651 31 10 ma_0653 21 15 ma_0667 M ma_0669 M 27 20 ma_0671 F 40 20 ma_0674 M 31 10 ma_0679 M 32 18 ma_0682 M 23 18 ma_0691 M 31 11 ma_0695 M 28 10 ma_0698 M 25 10 ma_0703 F 30 14 ma_0704 M ma_0711 F 27 10 ma_0716 M 31 12 ma_0717 F 15 ma_0718 M 27 20 ma_0719 M 18 ma_0721 F 24 15 ma_0727 F 25 15 ma_0735 F 26 20 ma_0738 F 47 17 ma_0742 M 25 13 ma_0748 M 31 15 ma_0750 F 23 16 ma_0751 F 20 14 ma_0752 M 24 16 ma_0754 M 27 18 ma_0755 M 42 ma_0756 M 30 17 ma_0758 M 92 ma_0760 M 27 17 ma_0761 F 29 20 ma_0763 M 25 18 ma_0764 M 20 ma_0766 M 20 ma_0768 M 26 19 ma_0769 F 36 15 ma_0771 F 30 18 ma_0773 M 28 20 ma_0774 M 26 14 ma_0779 F 25 15 ma_0782 F 28 22 ma_0783 M ma_0785 F 27 20 ma_0786 M 29 16 ma_0790 M 25 16 ma_0796 30 15 ma_0799 M 32 16 ma_0806 M 40 18 ma_0807 M 49 22 ma_0814 F 26 18 ma_0815 ma_0817 M 30 20 ma_0821 M 25 16 ma_0823 M 31 8 ma_0827 F 30 16 ma_0828 F 30 20 ma_0829 41 19 ma_0840 M 29 15 ma_0844 F 24 10 ma_0845 M ma_0846 F 24 15 ma_0848 M 26 10 ma_0851 M 35 20 ma_0859 M 26 18 ma_0860 M 20 ma_0861 M 27 17 ma_0871 35 16 ma_0876 M ma_0880 M 13 ma_0881 F 30 25 ma_0882 M ma_0888 M 16 ma_0894 M 36 25 ma_0900 F 26 18 ma_0906 ma_0913 M ma_0915 ma_0916 M 32 18 ma_0920 M ma_0925 M 16 ma_0932 ma_0952 F 23 12 ma_0958 M 23 17 ma_0963 F 33 15 ma_0975 F 34 16 ma_0976 M 23 16 ma_0977 M ma_1006 F 31 20 ma_1008 F ma_1014 M 38 20 ma_1022 25 12 ma_1067 M 24 6 ma_1077 M 30 20 ma_1279 F ma_1280 M ma_1281 M ma_1283 M 16 ma_1293 F 26 14 ma_1303 F 12 ma_1307 F 15 ma_1346 M 16 ma_1352 M ma_1357 F ma_1359 F ma_1376 F ma_1393 F ma_1396 M ma_1430 M ma_1525 F ma_1539 M ma_1582 M 26 19 ma_1597 M 15 ma_1603 F ma_1671 F ma_1700 M ma_1711 F ma_1728 F ma_1737 M

Spanish Sex Age Age sp_0053 30 16 sp_0054 56 22 sp_0082 39 14 sp_0084 32 12 sp_0088 37 15 sp_0616 sp_0681 15 sp_0687 sp_0699 29 17 sp_0707 sp_0737 sp_0776 sp_0857 20 17 sp_0912 56 22 sp_0934 sp_0937 25 19 sp_0943 sp_0970 sp_1015 30 10 sp_1031 sp_1046 sp_1059 sp_1074 22 sp_1084 32 20 sp_1100 sp_1142 29 19 sp_1143 34 20 sp_1148 34 16 sp_1156 32 10 sp_1157 21 sp_1163 35 19 sp_1186 19 sp_1212 22 16 sp_1219 37 16 sp_1295 27 15 sp_1343 31 20 sp_1345 29 14 sp_1362 sp_1427 sp_1435 25 16 sp_1438 21 16 sp_1553 sp_1577 sp_1578 sp_1587 sp_1592 sp_1594 sp_1596 30 21 sp_1643 40 16 sp_1644 sp_1648 sp_1651 sp_1654 39 16 sp_1673 sp_1720 28 20 sp_1747 18 sp_1748 sp_1784 20 2 sp_1785 sp_1789 sp_1807 26 17 sp_1813 19 13 sp_1814 19 13 sp_1827 sp_1829 20 15 sp_1847 37 12 sp_1850 74 12 sp_1858 20 12 sp_1904 sp_1923 23 18 sp_1926 14 sp_1931 sp_1933 58 14 sp_1934 sp_1940 sp_1953 37 16 sp_1954 20 15 sp_1955 26 17 sp_1963 19 13 sp_2003 28 14 sp_2010 sp_2023 28 14 sp_2024 25 20 sp_2036 21 14 sp_2046 20 14 sp_2049 sp_2061 sp_2067 sp_2069 20 14 sp_2077 sp_2078 sp_2079 24 18 sp_2082 20 14 sp_2083 19 13 sp_2086 sp_2114 28 18 sp_2155 48 14 sp_2158 sp_2164 sp_2168 sp_2173 40 10 sp_2174 sp_2175 sp_2179 20 14 CMU Brian MacWhinney Department of Psychology Carnegie Mellon University Pittsburgh, PA 15213 [email protected]

These are short conversations recorded by the students in Brian MacWhinney's class in Language and Thought in the Fall term of 2001. This assignment was worth 25% of the grade. The goal was to learn to record and transcribe spontaneous interactions using CLAN.

The transcripts are combined in one zip file, called cmu.zip. The audio and documentation can be downloaded separately. The materials include.

Dimitrios discussing the experience of coming to America in Greek with his mother. Elizabeth planning the evening's activities with her family. An anonymous student discussing past events with his friends from Singapore. Marina discussing travel with two Swiss friends. Yuki discussing her friend's dog. Discussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile.

This is a parallel set from the class of the Spring of 2003.

Anna's recording of a discussion of dominance relations in a sorority: transcript and audio. Amy discussing graduate schools with her friends with some code-mixing Beverly's workgroup devising a ball net. Courtney'sdiscussion of her childhood with her mother. David recording of a session of a CMU comedy group. Jai's art project group. Jing's discussion in Chinese with her statistics tutor. Kerry's recording of a discussion between a couple planning a move to Minneapolis after graduation: Kirsten's recording of a conversation between two friends. Matthew's recording of a computerized route description task. Michael's transcripts from discussions of CMU life between friends. Monica's recording of friends in a cafe. Ryan's recording of his buddies discussing Spring Break and football season. Vanessa's recording of a discussion of a friend's dating life. DISCLAB Susan Ervin-Tripp Department of Psychology University of California Berkeley, CA

The DISCLAB transcripts were collected by Susan Ervin-Tripp and John Gumperz in the 1980s from a variety of conversations at Berkeley. Permissions to use these data vary from conversation to conversation. Conversations with possible restrictions are CON03 (ambiguous), DIN17 (for linguistic and ethnographic research), KIDS01/02 (ambiguous), QUAKE (okay if anonymous), RAZAS (okay if not degrading), RPG01 (only by Psych department).

Lampert: In file names for this corpus, the T and L refer to schools. T is Sta. Teresa, a middle class Oakland parochial school and L is Longfellow, a working class school in Alameda with ethnic diversity. T2 is second grade. M4 means fourth male group.In the L school there were more identifiers because more classes, so those have to be figured out. In the larger L school, there were grade, room number, gender group, so identifiers like L2.12.F3

Escalera and Sprott did little kids. GulfWar

The GulfWar corpus is a set of 16 transcripts of calls to radio station WQED in Berkeley California during the Gulf War of 1987, contributed by Johannes Wagner. Jefferson

This segment of the Conversation Database is dedicated to the memory of Gail Jefferson.

WaterGate

GailGate is a collection of 22 transcripts of telephone conversations between President Richard Nixon and members of his top staff and their lawyers during April 1973, as the prosecution of the Watergate breakup and its subsequent cover-up were being conducted. All of the conversations are telephone calls with the exception of 3-21ndh.cha, which is a conversation recorded in the Oval Office. The National Archives provided the audiotapes; Gail Jefferson produced the transcripts in MS-Word format. These originals are available in PDF format in the media folder, which has MP3, WAV, and the original WAV from the National Archives. Johannes Wagner, Lone Laursen, and Brian MacWhinney then reformatted the transcripts to CHAT heritage format and linked the files to the audio. In addition, Gail Jefferson had created four transcripts (3-21ndh, 4- 19ekalm, 4-25nh, and 72-colhunt) using the typewriter. Johannes Wagner and Lone Laursen computerized these directly into CHAT heritage format, so not PDF files are available for these four.

Conversation Tape/Segment Participants Time Before April 13: - 72-colhunt.cha - Colson Hunt Nov 13, 1972 3-21ndh.cha - Oval Office: Nixon, March 21, 10-11 Dean, Haldeman am 4-12nc.cha - Nixon Colson April 12, 7-8 pm April 13-15, 1973: (253) 4-13nehig.cha (1) 38-1 Nix Ehrl Higby April 13, 9-10 am 4-13nh.cha (2) 38-9 Nix Hald April 13, 5pm 4-13ne1.cha (3) 38-12 Nix Ehrl April 13, 6pm 4-13neh.cha (4) 38-14 Nix Hald Ehrl April 13, 6pm 4-13ne2.cha (5) 38-15 Nix Ehrl Apr 13, 7 p.m. 4-14eklein.cha (6) 38-31 Ehrlichman Kleindienst April 14, 5pm 4-15nz.cha (7) 38-39 Nix Ziegler April 15, 1 a.m. April l5-16: (254) 4-15psilb.cha (1) 38-48 Petersen and Silbert April 15, 4pm (a). 4-15np1.cha (2) 38-52 Nixon Petersen April 15, 8 pm (a) 4-15hhig.cha (3) 38-53 Haldeman Higby April 15, 8pm 4-15np2.cha (4) 38-55 Nixon Petersen April 15, 8 pm (b) 4-15np3.cha (5) 38-58 Nixon Petersen April 15, 9pm 4-15egray1.cha (6) 38-60 Ehrlichman Gray April 15, 10 pm (a) 4-15egray2.cha (7) 38-62 Ehrlichman Gray April 15 10 pm (b) 4-15np4.cha (8) 38-63,64 Nixon Petersen April 15, 11pm 4-16np.cha (9) 38-82 Nixon Petersen April 16, 8 pm April 17-18: (255) 4-17nd.cha (1) 38-84 Nixon Dean April 17, 9 am 4-17ne.cha (2) 38-86 Nixon Ehrlichman April 17, 2 pm 4-17etim.cha (3) 38-88 Ehrlichman Timmons April 17, 3-4 pm 4-17nz.cha (4) 38-90 Nixon Ziegler April 17, 6 pm 4-17nkiss.cha (5) 38-92 Nixon Kissinger April 17, 11pm-12 4-18nh.cha (6) 38-95 Nixon Haldeman April 18, 12 am After April 18 4-19ekalm.cha - Ehrlichman Kalmbach April 19 4-25nh.cha - Nixon Haldeman April 25

NB

NB (aka Newport Beach) is a collection of phone calls collected in the early years of Conversation Analysis. These transcripts have been the focus of much of the seminal work done in CA. NB includes 25 files from more than 30 files in the original data- corpus. All files have been typewriter-transcribed by Gail Jefferson. Several files have been re-transcribed over the years. Gail Jefferson produced four transcriptions (2countryclub, 3assistance, 4matter and 5directions) electronically in autumn 2007.

In 2003 and 2004, Kresten Nyman and Johannes Wagner retyped all of the files into the computer from the typewritten originals. Johannes Wagner is responsible for any errors in the electronic transcripts. Before making the data available, Lone Laursen, Brian MacWhinney and Johannes Wagner anonymized names and addresses. In the sound files, names and other personal information were replaced by silences. In the transcripts, personal names, place names and addresses were replaced by pseudonyms, which are syllabically equivalent to the original. During this process, we retained the various pseudonyms that had already been chosen by Gail Jefferson, while adding many new ones. The data in the NB corpus are numbered to indicate the succession of the calls. However, we do not know how or when the original recordings were made.

Files in directory

Transcript Earlier Name Number 1golf.cha golf I:1:R 2countryclub ------3assistance.cha ------4matter.cha ------5directions.cha ------6fungus.cha fishing/fungus picture I:6:R 7assassination1.cha assassination i II:1:R 8assassination2.cha assassination ii II:2:R 9palmsprings.cha palm springs II:3:R 10blinddate.cha blind date II:4:R 11goldbridge.cha gold bridge II:5:R 12invitation.cha invitation to the beach III:1 13hightide.cha high tide III:2:R 15fishing.cha fishing III:4:R 16dreary.cha dreary IV:1:R 17tacos.cha tacos IV:2:R 18clothing.cha clothing IV:3:R 19paper.cha paper IV:5:R2 20marysinvitation.cha m’s invitation IV:9:R2 21swimnude.ca swim in the nude IV:10:R 22thanksgiving.cha happy thanksgiving IV:11:R2 23marines.cha black marines IV:12:R2 24meatless,cha meatless IV:13:R 25powertools.cha power tools V:R JOC

Curtis LeBaron Organizational Leadership Brigham Young University Marriott School of Management Provo UT 84602 [email protected]

These six transcripts linked to video provide the content for articles published in a special issue of the Journal of Communication. MOVIN

Johannes Wagner Language and Communication Odense University Campusvej 55 Odense, Denmark [email protected]

The MOVIN project involves collaboration among various researchers in the fields of Discourse Analysis and Conversation Analysis with a focus on political dialog. The database includes a small number of sample files from these languages:

 American English: A video recording of a story told by an American professor to four Danish listeners. The story is about a doctor who fixes a shoulder dislocation during a waterskiing accident.  Australian English  Danish: Danish reality show clips.  Estonian  Finnish  French: A divorce conciliation proceeding from the CLAPI corpus.  German: A TV discussion of scandals in the building industry.  Italian: A TV interview with Bettino Craxi.  Norwegian  Swedish Sakura

Miyata, Susanne [email protected] Banno, Kyoko Konishi, Saya Matsui, Ayumi Matsumoto, Shiori Oogi, Rie Takahashi, Akane Muraki, Kyoko Aichi Shukotoku University Nagoya, Japan

This corpus of 18 conversations is the product of six graduation theses on gender differences in students' group talk. Each conversation lasted between 12 and 35 minutes (avg. 25 minutes) resulting in an overall time of 7 hours and 30 minutes. 31 Students (19 female, 12 male) participated in the study (Table 1). The participants gathered in groups of 4 students, either of the same or the opposite sex (6 conversations with a group of 4 female students, 6 with 4 male students, and 6 conversations with 2 male and 2 female students), according to age (first and third year students) and affiliation (two academic departments). In addition, the participants of each conversation came from the same small-sized class and were well acquainted.

The participants were informed that their conversations may be transcribed and a video recorded for use in possible publication when recruited. Additionally, permission was asked once more after the transcription in cases where either private information had been displayed, or a misunderstanding concerning the nature and degree of the publication of the conversations became apparent during the conversation.

The recordings took place in a small conference room at the university between or after lectures. The participants were given a card with a conversation topic to start with, but were free to vary (topic 1 "What do you expect from an opposite sex friend?" [isee ni motomeru koto]; topic 2 "Are you a dog lover or a cat lover?" [inuha ka nekoha ka]; topic 3 "About part-time work" [arubaito ni tsuite]). The investigator was not present during the recording. The combination of participants, the topic, and the duration of the 18 conversations are given in Table 2.

The participants produced 15,449 utterances overall (female: 8,027 utterances, male: 7,422 utterances). All utterances were linked to video and transcribed in regular Japanese orthography and Latin script (Wakachi2002), and provided with morphological tags (JMOR04.1). Proper names were replaced by pseudonyms.

Table 1 List of Participants, sex, age, and the number of their appearances ID Age Sex # ID Age Sex # A1F 19 female 3 I3F 21 female 1 B1F 19 female 5 K3F 21 female 1 C1F 19 female 2 L3F 21 female 3 D1F 19 female 3 A1M 19 male 3 E1F 19 female 2 B1M 19 male 3 F1F 19 female 1 C1M 19 male 5 G1F 19 female 1 D1M 19 male 3 H1F 19 female 1 E1M 21 male 4 A3F 21 female 2 G3M 21 male 5 B3F 21 female 2 H3M 21 male 4 C3F 21 female 2 I3M 21 male 4 D3F 21 female 1 J3M 21 male 2 E3F 21 female 2 K3M 22 male 1 F3F 21 female 1 L3M 21 male 1 G3F 21 female 2 M3M 21 male 1 H3F 21 female 1

Table 2 Specifications of the 18 conversations File Participants Sex Topic Duration sakura01 G3M H3M K3F L3F MF 1 26'00" sakura02 A1F B1F C1M B1M MF 1 35'30" sakura03 H3F I3F J3M K3M MF 2 11'45" sakura04 H1F B1F C1M E1M MF 2 26'25" sakura05 I3M G3M L3F E3F MF 3 27'20" sakura06 G1F F1F E1M D1M MF 3 26'00" sakura07 D3F F3F L3F E3F FF 1 25'00" sakura08 E1F B1F C1F D1F FF 1 28'25" sakura09 E1F B1F A1F D1F FF 2 27'00" sakura10 A3F B3F C3F G3F FF 2 25'15" sakura11 A1F B1F C1F D1F FF 3 25'25" sakura12 A3F B3F C3F G3F FF 3 23'55" sakura13 G3M H3M I3M J3M MM 1 21'20" sakura14 B1M A1M E1M C1M MM 1 30'00" sakura15 I3M H3M G3M M3M MM 2 25'50" sakura16 E1M A1M C1M D1M MM 2 23'30" sakura17 L3M G3M I3M H3M MM 3 26'45" sakura18 A1M B1M C1M D1M MM 3 16'50" SamtaleBank

Johannes Wagner Southern Denmark University

SamtaleBank is the Danish spoken language component of the DK/CLARIN project directed by Bente Maegaard. Participants in the spoken language component include Johannes Wagner, Lone Laursen, Patrizia Paggio, Frans Gregersen, and Peter Henrichsen. The current contents of the corpus include:

Password: Materials on the use of Business English by Danes and second language learners of Danish.

Radio: Danish call-in radio programs.

Sam2: Videotaped conversations between two participants.

Sam3: Videotaped conversations between three participants.

Telefon: Telephone conversations.

Filename Length Transsk. Comments Lines Arun504A1 22:50 Kristian Snak i frokoststuen – ”lille lort – Mortensen værkstedssamtale” 601 Ingen prosodimarkering i transskription Arun504A2 5:29 Kristian Snak på lageret – ”værksstedsamtale” 312 Mortensen Ingen prosodimarkering i transskription Arun509A3 9:50 Sofie Lige efter arbejde(?) Mødes med veninde Emmertsen (NNS) der lige er blevet fastansat i Føtex. 745 Meget muligt spændende. Arun513A1 12 Susan Optager under arbejde i butik + lager. Ikke 321 Linke meget interaktion. Arun513A3 25:40 Sofie I frokoststue + butik. Snak med kunder (bl.a. 1289 Emmertsen bekendts forældre) og kolleger.

Filnavn Length Situation Lines MUL534H1 0:53 Mulenga tager imod sin mand i døren, da han kommer 41 hjem fra arbejde. MUL536H1 6:07 Emma, Mulenga og hendes mand, Jørgen, spiser 279 morgenmad (eller er det aftensmad?). De sidder ved sofabordet, fordi der står en masse ting på spisebordet. Mulenga fortæller hvilke bøger hun skal læse. Emma er 9, går i 3. klasse og er lige begyndt at lære engelsk i skole. MUL536H3 16:12 Morgenmad: Mulenga, Emma og Jørgen. De taler om at 667 M skal ud at spise med nogle venner om aftenen. Emma fortæller om filmen Madagaskar, som hun har set aftenen før. MUL542H1 31:13 Mulenga, hendes mand og Emma spiser aftensmad. De 1426 sidder ved sofabordet, fordi der står en masse ting på spisebordet. Emma fortæller om sin fødselsdag hos sin mor. De sidder ved sofabordet og spiser fordi spisebordet er fyldt med ting der skal males. Faster Finn er en mand, som Jørgen arbejder for ind i mellem. MUL548H1 13:12 Mulenga taler med sin mand om, hvad de skal give hans 498 forældre i julegave. MUL551H 9:12 Mulenga er sammen med Emma i stuen i deres hjem. 394 Emma bor hos Mulenga og hendes mand hveranden weekend og enkelte gange på andre tidspunkter. Mulenga spørger, hvordan det går med Emmas dans. MUL605H1 34:34 Emma, Jørgen, Mulenga hjemme hos sig selv 1075

MUL605H3 22:57 Mulenga, Jørgen og Emma spiser spiser aftensmad. 1067 sammen med hvem? MUL605H4 16:34 Mulenga, Jørgen og Emma. I baggrunden er lyden af et 606 fjernsyn og plasken med vand. MUL605H5 39:57 Mulenga, Jørgen og Emma. Mulenga er ved at klippe 1516 negle på Ronia, Jørgen er i nærheden og fjernsynet kører i baggrunden. – snak om hvor meget slik Emma må spise. MUL605H6 3:54 Mulenga og Emma. under hele optagelsen er lyden ret 115 langt væk. Fjernsynet kører i baggrunden. MUL613H1 8:26 Mulenga taler med Emma 275 MUL613H2 11:52 Mulenga taler med Emma 313 MUL613H3 3:42 Mulenga taler med Emma 65 MUL613H4 9:58 lektier Mulenga taler med Emma. Det lyder som om 300 Mulenga læser op. Hun spørger Emma om ordbetydninger. MUL613H5 5:54 Jørgen er også hjemme. Han, Mulenga og Emma taler. 70 Kun 1.65 minut transskriberet MUL534F1 3:35 Mulenga er til fest med sine udenlandske venner: Charles (Ghana) og Nun (Thailand. I DK 9 måneder). Der er musik i baggrunden og de har fået lidt at drikke. De øver dansk sammen. De ser på Nuns familiebilleder. Nun forklarer, hvem der er på billederne. MUL534H2 1:48 Mulenga og hendes mand øver sig på at bruge optageren MUL536H2 14:00 Emma og Mulenga ser på hendes gamle hjemmearbejde sammen og laver ansigter. M læser op af Askepot. Emma bruger tegn- kropssprog til at forklare, hvad ordene betyder. De ser på billederne sammen.

MUL536H4 14:02 Morgenmad: Mulenga, Emma og Jørgen.

MUL537T 17:34 Modultest 3.1. Mulenga er oppe sammen med Eva. Det er Eva der starter med at fortælle om ”Palle Alene i Verden”. Derefter fortæller Mulenga om ”Et år i Paris” MUL542H2 22:08 Mulenga læser højt fra sine bøger for Emma

MUL605H2 35:58 Mulenga, Emma, Jørgen og ? i en bil

MUL620T 44:46 (15. maj 2006. Kirsten er der) Mulenga er til modultest 3.4 med Fie fra Kina. Det er deres lærer Jacob, der eksaminerer dem. MUL534F1 3:35 Mulenga er til fest med sine udenlandske venner: Charles (Ghana) og Nun (Thailand. I DK 9 måneder). Der er musik i baggrunden og de har fået lidt at drikke. De øver dansk sammen. De ser på Nuns familiebilleder. Nun forklarer, hvem der er på billederne. MUL622H 53:24 Mulenga diskuterer med sin mand om Emma og hendes lektier. SBCSAE

John DuBois Linguistics University of California, Santa Barbara [email protected]

Robert Englebretson Linguistics Rice University Houston, TX [email protected]

The Santa Barbara Corpus of Spoken American English is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more. The corpus was collected by the University of California, Santa Barbara Center for the Study of Discourse, Director John W. Du Bois (UCSB), Associate Editors: Wallace L. Chafe (UCSB), Charles Meyer (UMass, Boston), and Sandra A. Thompson (UCSB).

Each speech file is accompanied by a transcript in which phrases are time stamped with respect to the audio recording. Personal names, place names, phone numbers, etc., in the transcripts have been altered to preserve the anonymity of the speakers and their acquaintances and the audio files have been filtered to make these portions of the recordings unrecognizable. Pitch information is still recoverable from these filtered portions of the recordings, but the amplitude levels in these regions have been reduced relative to the original signal. A separate filter list file (*.flt) associated with each transcript/waveform file pair is provided to list the beginning and ending times of the filtered regions. There are 4 .flt files which are empty because there was no information that needed to be filtered out from the audio files.The filtering was done using a digital FIR low-pass filter, with the cut-off frequency set at 400 Hz. The effect of the filter was gradually faded in and out at the beginning and end of the regions over a 1,000 sample region, roughly 45 milliseconds, to avoid abrupt transitions in the resulting waveform. The audio data consists of 16 wave format speech files, recorded in two-channel pcm, at 22050Hz.

The TalkBank version of the corpus was constructed by Nii Martey of the Linguistic Data Consortium with help from Jack DuBois for Part 1 and from Robert Englebretson, now at Rice University, for Parts 2, 3, and 4. Personal names, place names, phone numbers, etc, in the transcripts have been altered to preserve the anonymity of the speakers and their acquaintances and the audio files have been filtered to make these portions of the recordings unrecognizable. Pitch information is still recoverable from these filtered portions of the recordings, but the amplitude levels in these regions have been reduced relative to the original signal. A separate filter list file (*.flt) associated with each transcript/waveform file pair is provided to list the beginning and ending times of the filtered regions. The filtering was done using a digital FIR low-pass filter, with the cut-off frequency set at 400 Hz. The effect of the filter was gradually faded in and out at the beginning and end of the regions over a 1,000 sample region, roughly 45 milliseconds, to avoid abrupt transitions in the resulting waveform. In the case of a phone number, which was not adequately disguised by the filter, the signal was set to zero, except for the 45 millisecond boundary regions which fade into and out of zero.

01 Actual Blacksmithing 02 Lambada 03 Conceptual Pesticides 04 Raging Bureaucracy 05 A Book About Death 06 Cuz 07 A Tree's Life 08 Tell the Jury That 09 Zero Equals Zero 10 Letter of Concerns 11 This Retirement Bit 12 American Democracy is Dying 13 Appease the Monster 14 Bank Products

Age City Orig 0001 LENORE f 30 Los Angeles CA CA BA 16 student white 0002 DORIS f 50 Montana MT MT HS 12 horse ranc white 0003 LYNNE f 19 Montana MT HS 12 student/ho white 0004 HAROLD 0005 JAMIE f 30 Walnut Cre CA CA college 16 dancer/da white 0006 MILES m CA black 0007 PETE m 36 San Leandr CA CA 18 grad student white 0008 ROY m 34 CA designer white 0009 MARILYN f 33 CA writer white 0010 CAROLYN f 19 Santa Fe NM CO HS 12 student white 0011 KATHY f 31 Boston/Santa A/NM CA grad student white Fe 0012 SHARON f 24 New Mexico NM TX college teacher white 0013 SHANE m 23 Corp Christi TX TX grad med student chicano 0014 PAM f 43 Massachusetts MA NM housewife white 0015 WARREN m 34 Wenham MA IL DVM 23 veterinarian white 0016 DARRYL m 33 San Francisco CA CA BA 16 comm./comp white 0017 PAMELA f 38 Southern CA CA BA 16 actress/fi white California 0018 ALINA f 34 Los Angeles CA CA BA 16 housewife white 0019 ALICE f 28 Pryor MT MT 4 years 16 student Crow Indian 0020 MARY f 27 Pryor MT MT college 3 cook fire Crow Indian 0021 RICKIE San Francisco CA CA HS 12 clerk black 0022 JUNE f 21 Laguna Beach CA CA A MA 17 grad student white 0023 REBECCA f 31 Saratoga A CA A J 22 attorney white 0024 ARNOLD m Saginaw MI CA HS 12 S Army white 0025 KATHY f 17 Mobile AL AL HS 10 student white 0026 NATHAN m 19 Mobile AL AL HS 12 student white 0027 BRAD m 45 MA 18 director o white 0028 PHIL m 30 NM BA 16 designer hispanic 0029 DORIS f 83 Indianapolis IN AZ MA 18 teacher white 0030 ANGELA f 90 middle Wes MO AZ MS 18 teacher J white 0031 SAM f 72 Arcadia IN AZ Nursing 15 retired white 0032 BEV f 20 So California CA CA HS 15 student white 0033 MONTOYO m 51 CA PhD political latino/chicano 0034 MARIA f 26 Nicaragua CA HS 15 dispatcher hispanic 0035 GILBERT m 22 So California CA CA HS student hispanic 0036 CAROLYN f 18 So California CA CA HS 12 student white 0037 LAURA f 23 San Jose CA CA HS student japanese/ 0038 FRANK m 24 So California CA CA BA 16 business o white 0039 RAMON m 19 MoreValley CA CA HS 12 student hispanic 0040 RUBEN m 27 So California CA CA 5 yrs 17 teacher hispanic 0042 KENDRA f 25 midwest IN IN BA 16 administrator white 0043 KEN m 51 midwest IN IN Phd M 23 director o white 0044 MARCI f 50 midwest IN IN MA 19 counselor white 0045 WENDY f 26 midwest IN IN BS 16 missionary white 0046 KEVIN m 26 midwest IN IN S Cr 16 missionary white 0047 JIM m 41 metro St.L. IL IL certified 16 banking white 0048 FRED m 47 Chrisman IL IL masters 18 loan officer white 0049 JOE m 45 Dupo IL IL 17 banking white 0050 KURT m 70 Millstad IL IL 12 retired-co white 0051 VIVIAN f 55 Shenandoah A IL HS 13 banking white SCoSE – Saarbrücken Corpus of Spoken English

Neal Norrick Linguistics Saarland University Saarbrücken, Germany [email protected]

Lynne

Participants: Helen and her daughters: Annie, in her early thirties; Lynne, grad-student home from college; Jennifer, under-grad younger sister arrives later; their niece/cousin Jean also in her early thirties

They are gathered before a late-afternoon Thanksgiving dinner in the living room of the house where Helen and Annie and live. Both go into the adjacent kitchen from time to time.

Jason

Comments: Three under-grads sharing an apartment. One voice often louder than others; Frequent comments on recorder and recording process; Maybe cut A3 after first six minutes and edit other files, as you see fit

Steve

Grad-student George has invited three under-grads to talk about experiences they‚ve had which could provide the basis for writing assignments

Shelley

Pre-thanksgiving dinner with grandparents, mother, father, younger sister.

Yiddish

Zelda Newman [email protected]

Every Thursday night at the Millenary synagogue in Manhattan, a group of young men and women meet starting at 10 Pm and way into the wee hours of the morning to hear music, occasionally listen to a lecture, eat, drink, and mix socially. Because the hot dish known in Yiddish as the “(shabes) chulent” is served there, the group meeting is called Chulent”. [i] While the gathering is open to all, and some non-Jews do wander in, the great majority of the attendees are men and women who have been brought up in the Ultra-Orthodox (specifically, Hassidic) world. While not all of them were brought up speaking Yiddish, many were. The group composition changes from week to week. Some are regulars; others are not. I was contacted by one of the regular attendees and asked to speak on a topic relating to Yiddish. I spoke about the research I did together with an Israeli colleague on an Old Yiddish poem. Then I informally met some of the attendees. Once I was familiar with the attendees, it was no problem getting them to agree to be informants. I made it clear that whatever they told me about their individual/personal issues was between us. In my research, I discuss their language only. In some case, they told me which Hassidic group they belonged to; in other case, they didn’t. Three of the nine informants were brought up in the Satmar community, and one is from the Tseylemer Hassidic community, a close relative to the Satmar. The room in which the attendees meet is crowded and noisy. Recording there is virtually impossible. In one case only, I went to a back room with an informant, where we sat down and the informant spoke. For the most part, it was too noisy there to get a good recording. Fortunately, in the summer of 2009, Mayor Bloomberg’s administration allowed the placing of small tables and chairs outdoors, along Broadway between 34th street and 38th street. There we sat, my informants and I, between midnight and 2 AM, in the dark that was illuminated solely by street lamps. The recordings include very little conversational interaction between the informant and me. I said to the young men: “Tell me a story- any story you want”, and they launched into a narrative. Some gave me a ready-made anecdote and spoke without hesitation; some retold family stories which had known content but no pre-determined form; some spoke of things that happened to them; still others simply made up a story as they went along.