Audrey Le for the NIST Gang

RT-04F Workshop November 7-10, 2004 Palisades, NY STT Task Test Sets Scoring Procedures STT Participants Evaluation Results Multiple Applications Metadata Extraction (MDE)

Speech-to-Text (STT) Words + Metadata Human-human Rich transcript speech Readable transcript UL

Broadcasts Conversations 20x

10x Metadata Extraction (MDE) 1x Words, Rich times, & Transcript Speech-to-Text confidences (STT)

Human-human Get the speech words right! English English Arabic Chinese Broadcast News Broadcast News Broadcast News Test Set Test Set Test Set

English Arabic Chinese Conversational Conversational Conversational Telephone Telephone Telephone Speech Test Set Speech Test Set Speech Test Set English BN Arabic BN Chinese BN Test Set Test Set Test Set

English CTS Arabic CTS Chinese CTS Test Set Test Set Test Set English EARS 2003 Collection 12 shows: 30 minute excerpts Consist of network news and local news shows Broadcast from the US for a US audience Cover international news, domestic and local US news English BN Arabic BN Chinese BN Test Set Test Set Test Set

English CTS Arabic CTS Chinese CTS Test Set Test Set Test Set Arabic EARS 2003 Collection 3 shows: 20 minute excerpts Dubai TV, Al-Jazeera (2 shows, different dates) Target Middle East and world-wide audience Cover international and Middle East related news Main dialect - Modern Standard Arabic Format similar to US network news English BN Arabic BN Chinese BN Test Set Test Set Test Set

English CTS Arabic CTS Chinese CTS Test Set Test Set Test Set Mandarin EARS 2004 Collection 3 shows: 20 minute excerpts China Central Television (CCTV-4) Financial and Economic Reports Broadcasts from China for overseas Chinese Covers international and Chinese financial and economic news

Radio Free Asia (RFA) Asia-Pacific Headline News and Asia-Pacific Reports Broadcasts from the US for Chinese who live in China Covers news about China and China-related international news

New Tang Dynasty TV (NTDTV) Hourly News Broadcasts from the US for a Chinese audience Covers international news Main dialect - mainland China standard Mandarin Format similar to US network news English BN Arabic BN Chinese BN Test Set Test Set Test Set English Fisher Collection English CTS Arabic CTS Chinese CTS Test Set Test Set Test Set 36 conversations: 72 speakers – 5 minute excerpts Equal number of male and female speakers All speakers are native English speakers living in the US Speakers came from four US dialect regions North, South, Midland, West Numbers of standard, cordless, and cellular phones used are ~ equally distributed All conversations were recorded in the US Topics from current events and social issues English BN Arabic BN Chinese BN Test Set Test Set Test Set

English CTS Arabic CTS Chinese CTS Test Set Test Set Test Set Levantine Dialect Fisher Collection 12 conversations: 24 speakers – 5 minute excerpts ~ Equal number of male and female speakers All speakers are native Arabic speakers with majority living in the Middle East and a few in the US Majority of the speakers was raised in Jordan Majority of the phones used was landlines All conversations were recorded in the US Topics from current events and social issues English BN Arabic BN Chinese BN Test Set Test Set Test Set

English CTS Arabic CTS Chinese CTS Test Set Test Set Test Set Hong Kong Univ. of Science & Tech. Collection 12 conversations: 24 speakers – 5 minute excerpts ~ Equal number of male and female speakers All speakers are native Mandarin speakers and live in China All speakers are in their twenties All phones used were landlines All conversations were recorded in China Topics from current events and social issues Reference data SCLITE

GLM Error Filtering Alignment Processing Calculation

System output

WER = (# Deletions + # Insertions + # Substitutions) # Reference Words Transcription of colloquial Arabic is a challenge Modern Standard Arabic (MSA) orthographic conventions do not easily apply to dialectal Arabic speech New transcription conventions and methods were devised by LDC with community input New Arabic text normalization steps Word-initial alif+hamza normalization Rule: All word-initial alif+hamza characters are translated to bare alif The word-initial “alif” can be written omitting a “hamza” or with any of three vocalic variants Neither form changes the meaning and transcribers rarely agree Unresolved Issue: normalization was not applied to inflected forms Tanween character removal Rule: All tanween characters removed The three vowel diacritics, fathah tanween, kasrah tanween, and dhammah tanween are inconsistently transcribed Omitting or including the vowel does not affect the word’s meaning Pause filler characters List of pause filler characters expanded to 5 as an allowance for CTS training data Sometimes they occur as non-pause filler characters in broadcast news No lexical normalizations Character Error Rate (CER) No “word” analog in the native orthography Ordinary words Scored as is without case Speaker-attributed noises (e.g., cough, laugh) Stripped out Non-vocal noises (e.g., door slam) Stripped out Word fragments Counted as optionally deletable Counted as correct if substring of system output matches Uncertain or foreign words Counted as optionally deletable Pause-fillers Counted as optionally deletable Contractions Expanded to the most likely correct form Ordinary words Scored as is without case Speaker-attributed noises (e.g., cough, laugh) Stripped out Non-vocal noises (e.g., door slam) Stripped out Word fragments Stripped out Uncertain or foreign words (as marked by the system) Stripped out Pause-fillers Stripped out if identified as pause-fillers Transformed to a generic pause-fillers if not identified as such Allowed equivalent system output to be scored as correct Contained equivalence rules to transform data to a canonical form Rules to expand contractions (only applied to the system output) she’s => she is, she has, she was Rules to split hyphenated words processing-speed task => processing speed task Rules to split compounds with ambiguous representations servicemen => service men halftime => would not be split Rules to map backchannels and hesitations uh-huh => %BCACK Rules to allow legitimate alternative spellings Mr. => Mister REF: they want to give you (e-) give them all the things you never got ** %h

HYP: *** going to give you give them all *** day you never got to %bc

ERR: D S D S I

Summary: two deletions, two substitutions, one insertion BBN BBN+LIMSI Cambridge Univ. IBM IBM+SRI ISL LIMSI SRI SRI + Univ. of Washington Super EARS (Collaborative effort involving BBN+CU+LIMSI+SRI) Primary systems ordered by increasing WER 100 10x 1x No significant difference

) 15.3 14.6

% 11.6 ( 10.7

R 10 E W

1 ears cu bbn+limsi bbn sri best rt03 RT-04 dev set is different from RT-03 eval set

Comparable subset of RT-04 with RT-03 Better sampling of broadcast news shows LDC selections News From CNN (anchor: ) PBS News Hour (anchor: Gwen Ifill) CNN Headline News (anchors: Steven Fraser, Sophia Choi) CNBC The News (anchor: Tom Costello) ABC6 WPVI Action News (12/17) CSPAN Book TV: Texas Book Festival

NIST selections CNN Daybreak (anchor: ) CNBC The News (anchor: Brian Williams) ABC World News Tonight (anchor: Peter Jennings) PBS BBC News (anchor: Alistair Yates) ABC6 WPVI Action News (12/03) WB17 News WPHL RT-03 Like Subset of Characteristics RT-03 Shows RT-04 Shows ABC World News Tonight Same show, different dates ABC World News Tonight

CNBC The News Typical network news with NBC Nightly News (anchor: Tom Costello) both having male speakers (anchor: Tom Brokaw) speaking most of the time PBS News Hour Similar types of news content VOA News Now and speaking style

CNN Daybreak Similar anchor styles PRI The World

CNBC The News Both anchored by Brian MSNBC News (anchor: Brian Williams) Williams (anchor: Brian Williams) CNN Headline News Similar types of news CNN Headline News contents, different dates Hypothesis: A robust system should have little variation in WER for different sources Observation: Dramatic sensitivity to test sets / subsets Implication: Current technology is not adequately robust? Shows ordered by RT-03 Like Subset WER Primary English BN 10X Systems

RT-04F (RT-03 Like Subset) RT-04F (Others) 25 bbn+limsi cu 20 ears sri More difficult )

% 15 (

R E

W 10 RT-03 like Wider spectrum 5 Easy for all shows are of BN shows is systems easier more difficult

0 ABC World CNBC - PBS News CNN - Carol CNBC - CNN WPHL CNN - Wolf WPVI CSPAN WPVI PBS BBC News Tom Hour Costello Brian Headline WB17 News Blitzer Action Texas Book Action News Tonight Costello Williams News News Festival News Comparison with RT-03 English BN Shows Shows ordered by RT-03 Like Subset WER Primary English BN 10X Systems

RT-04F (RT-03 Like Subset) RT-03 Test Set 25 bbn+limsi bbn Similar but not 20 limsi cu identical trend ears ) 15 sri % (

R E

W 10

Easier than 5 other Brian Williams show 0 ABC World CNBC - PBS News CNN - CNBC - CNN ABC World NBC VOA New s PRI The MSNBC - CNN New s Tom Hour Carol Brian Headline New s Nightly Now World Brian Headline Tonight Costello Costello Williams New s Tonight New s Williams New s Primary systems ordered by increasing WER 100 10x full set 10x rt-03 like subset 1x full set 1x rt-03 like subset

There is improvement! ) 15.3 14.6

% 11.6 12.1 ( 10.7 9.1

R 10 E W

1 ears cu bbn+limsi bbn sri best rt03 Primary systems ordered by increasing WER

Arabic BN Chinese BN 100 100 10x ul 10x 30.8 26.3 ) 22.4 19.1 18.5 ) 17 % ( %

(

R 10 10 R E E W C

1 1 limsi bbn best rt03 bbn isl best rt03

All differences at a given speed are statistically significant Primary systems ordered by increasing WER 100 ul 20x 10x 1x

No significant 29 difference 22 19 20.4

) 14.9 15.2 % (

R 10 E W

1 ibm +sri ibm bbn+lim si cu bbn sri best rt03 Speakers ordered by ascending absolute WER difference Primary English CTS 20X Systems 50 First maximum, difficult for all systems bbn+lim si 45 cu ibm 40 ibm +sri 35 sri Different

) 30 system %

( outputs

25 R E

W 20

15

10

5 First minimum, easy for all systems 0 a a a a a a b a a b a a b b b b a a a a b b a a b b a a b b a a a a b a ______19 47 59 46 99 18 45 98 51 59 63 05 05 47 25 46 49 79 38 41 53 63 06 29 38 18 85 77 99 19 53 16 05 19 31 06 3 3 2 2 9 9 0 6 6 2 8 4 9 3 8 2 8 3 1 8 1 8 1 4 1 9 5 1 1 3 1 3 9 4 2 8 9 0 9 4 8 7 2 8 4 9 0 3 1 0 2 4 4 9 5 9 5 0 1 4 5 7 1 8 9 9 5 7 1 8 5 3 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 10 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 ______h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs fs Systems don’t do well on rapid and disfluent speech Primary systems ordered by increasing WER

Arabic CTS Chinese CTS 100 100 ul 20x ul 20x 10x 44.1 37.5 42.7 29.3 29.7 ) ) % % ( (

R 10 10 R E E C W

1 1 bbn sri+uw best rt03 bbn cu sri+uw best rt03

The 3 20X systems are Unfair comparison, not different significantly different dialects BN and CTS at Different Speeds 100 ) % ( 10 R

E Eng BN 1x W Eng BN 10x

Eng BN >10x

Eng CTS 1x

Eng CTS 10x 1 RT-02 RT-03 RT-04 RT-04 RT-03 Like Subset Phil Woodland The LDC Our Arabic tutors Mohamed Maamouri John Makhoul The rest of the NIST Gang!!!