Spoken Content Retrieval Beyond Pipeline Integration of Automatic Speech Recognition and Information Retrieval David N. Racca

Spoken Content Retrieval Beyond Pipeline Integration of Automatic Speech Recognition and Information Retrieval David N. Racca Bachelor's in Computer Science A dissertation submitted in fulfilment of the requirements for the award of Doctor of Philosophy (Ph.D.) to the Dublin City University School of Computing Supervisor: Prof. Gareth J.F. Jones July 2018 I hereby certify that this material, which I now submit for assessment on the programme of study leading to the award of Ph.D. is entirely my own work, and that I have exercised reasonable care to ensure that the work is original, and does not to the best of my know- ledge breach any law of copyright, and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the text of my work. Signed: (Candidate) ID No.: Date: Contents List of Tables vi List of Figures viii Abstract xi Acknowledgements xii 1 Introduction 1 1.1 Overview of spoken content retrieval (SCR) . 2 1.1.1 Information access and retrieval from spoken content . 2 1.1.2 SCR system overview . 3 1.1.3 Open problems in SCR . 9 1.2 Research questions . 13 1.3 Thesis structure . 14 2 Review of Fundamental Technologies in SCR 17 2.1 Information retrieval (IR) . 18 2.1.1 Text pre-processing and indexing . 18 2.1.2 Frameworks for ranked retrieval . 19 2.1.3 Evaluation of ranked retrieval . 24 2.2 Automatic speech recognition (ASR) . 27 2.2.1 Overview . 27 2.2.2 Speech units, signal processing, and feature extraction . 29 2.2.3 Language and acoustic modelling . 31 2.2.4 Decoding, output representation, and evaluation . 34 2.3 Content structuring . 36 2.3.1 Automatic segmentation of text documents . 37 2.3.2 Segmentation of spoken content . 41 2.4 The application of content structuring methods to text retrieval . 43 2.4.1 Document retrieval . 43 2.4.2 Passage retrieval . 45 2.4.3 XML retrieval . 47 2.5 Summary . 49 3 Review of SCR Research 51 3.1 Experiments with formal speech . 51 3.1.1 Early work: voice mail and private collections . 51 3.1.2 Broadcast news: the TREC-SDR campaigns . 53 3.2 Experiments with conversational spontaneous speech . 56 iii 3.2.1 Interviews: the CLEF-SR campaigns . 57 3.2.2 Broadcast TV: the MediaEval campaigns . 58 3.2.3 Lecture recordings: the NTCIR campaigns . 60 3.2.4 Final remarks on content structuring and ASR errors . 61 3.3 SCR beyond lexical matching: exploiting acoustic features and prosody . 63 3.3.1 Speech Prosody . 63 3.3.2 Prosody and informativeness . 66 3.3.3 Prosody and ASR errors . 68 3.3.4 Previous attempts to use prosody in SCR . 69 3.4 Summary . 71 4 Materials and Test Collections 73 4.1 The BBC collection of TV content . 74 4.1.1 Overview . 74 4.1.2 Speech collection and transcripts . 75 4.1.3 Topics . 80 4.1.4 Relevance assessments . 84 4.2 The SDPWS collections of academic presentations . 87 4.2.1 Overview . 88 4.2.2 Speech collection and transcripts . 89 4.2.3 Topics . 96 4.2.4 Relevance assessments . 100 4.3 Summary . 103 5 Prosodic-based Term Weighting 104 5.1 Prominence score computation . 105 5.1.1 Extraction of low-level descriptors . 105 5.1.2 Speaker-based standardisation, time-alignment, and word durations 107 5.1.3 Combining low-level descriptors into prominence scores . 110 5.2 Prominence score integration . 112 5.2.1 General integration approach . 112 5.2.2 GH's integration approach . 113 5.2.3 CWL's integration approach . 114 5.2.4 A rough interpretation of GH and CWL under the PRF . 116 5.3 Experiments with heuristic retrieval functions . 118 5.3.1 Tasks and test collections . 118 5.3.2 Comparison between GH, CWL, and Okapi BM25 . 122 5.3.3 Comparison between acoustic and randomised scores . 128 5.3.4 Comparison between acoustic scores and other weighting schemes . 131 5.3.5 Experiments with feature combinations . 137 5.3.6 Summary of experiments with heuristic functions . 148 5.4 Experiments with statistical methods . 149 5.4.1 Correlation and regression analysis . 150 5.4.2 Acoustic-based classification of significant terms . 158 5.4.3 Learning-to-rank with acoustic features . 164 5.4.4 Summary of experiments with statistical methods . 168 5.5 Summary . 171 6 Robust SCR through Passage Contextualisation 173 iv 6.1 Motivation . 174 6.2 Contextualisation techniques . 177 6.2.1 Document score interpolation (DSI) . 177 6.2.2 Positional models (PMs) . 178 6.3 Experiments with contextualisation techniques . 184 6.3.1 Task and test collections . 184 6.3.2 Maximising retrieval effectiveness via QF and exponential IDF . 186 6.3.3 Contextualisation experiments . 193 6.3.4 Confidence adaptive contextualisation . 198 6.4 Summary . 202 7 Content Structuring and Evaluation in SCR 205 7.1 Evaluation of unstructured content retrieval . 206 7.1.1 Overview and the pool bias problem . 206 7.1.2 Representation and visualisation of search results . 208 7.1.3 Browsing dimensions and user satisfaction . 209 7.1.4 Browsing and navigation of multimedia content . 210 7.2 Evaluation measures for unstructured content retrieval . 212 7.2.1 One-sided measures based on temporal distance . 213 7.2.2 Two-sided measures based on text or temporal units . 216 7.2.3 Browsing and interaction oriented measures . 220 7.3 A new user-centric evaluation framework for SCR . 225 7.3.1 Horizontal browsing model . 225 7.3.2 Vertical browsing model . 235 7.3.3 Summary . 240 7.4 Cross-evaluation of content structuring methods for SCR . 240 7.4.1 Task, collections, and evaluation measures . 241 7.4.2 Comparison of content structuring methods . 244 7.5 Summary . 253 8 Conclusions 257 8.1 Summary of main contributions . 257 8.2 Research questions revisited . 261 8.3 Future work . 264 Appendices 269 A List of publications 269 B Index Similarity Metrics 271 C LambdaMART 273 D Coordinate Ascent Optimisation 275 D.1 Line Search . 275 D.2 Promising Directions . 275 E Results of experiments with binary classifiers 277 Bibliography 278 v List of Tables 4.1 Duration statistics of videos in the BBC collection after removing duplicates. 75 4.2 Length statistics of BBC collections. 79 4.3 Recognition accuracy of BBC transcripts. 80 4.4 Examples of transcripts for the show Daily Politics. 80 4.5 Examples of transcripts for the show Top Gear. 81 4.6 Examples of transcripts for the show Oliver Twist. 81 4.7 Example of SH13 known-item topics for the BBC1 collection. 82 4.8 Example of SH14 ad-hoc topics for the BBC2 collection. 83 4.9 Length statistics of SH13, SH14, and SAVA queries. 84 4.10 Example of SAVA ad-hoc topics for the BBC2 collection. 85 4.11 Duration statistics of presentation recordings from the SDPWS collection. 89 4.12 Manual and ASR transcripts for the SDPWS collections. 89 4.13 Annotations of spontaneous speech phenomena in the SDPWS collection. 90 4.14 ASR models used to transcribe the SDPWS collection. 93 4.15 Length statistics of processed transcripts from the SDPWS2 collection. 96 4.16 Recognition accuracy of presentation transcripts for the SDPWS2 collection. 96 4.17 Examples of SD2 topics for the SDWPS1 collection. 97 4.18 Examples of SQD1 topics for the SDWPS2 collection. 98 4.19 Examples of SQD2 topics for the SDWPS2 collection. 99 4.20 Length statistics for processed queries from the SD2, SQD1, and SQD2 sets. 100 4.21 Recognition error rates for the spoken queries of the SQD1 and SQD2 sets. 100 4.22 Availability of relevance assessments for NTCIR topics. 103 5.1 Retrieval tasks, collections, topics, and transcript types in which the GH and CWL functions were evaluated. 118 5.2 Length statistics of segmented transcripts from the BBC1, BBC2, and SDPWS2 collections. 119 5.3 Comparison between Okapi BM25 with TREC's recommended parameter settings and alternative settings. 122 5.4 Summary of the prominence score derivations and integration approaches explored in the experiments with prominence scores. 123 5.5 General statistics of occurrence level prominence scores. 125 5.6 Mean values of occurrence-level features in the BBC2 collection for different TV genres. ..

Spoken Content Retrieval Beyond Pipeline Integration of Automatic Speech Recognition and Information Retrieval David N. Racca

Information to Users

The Syllable Structure in Nyangatom

GOO-80-02119 392P

How Syntax and Morphology Talk to Phonology

Lecture Notes and Other Handouts for Introductory Phonology

Speech Prosody — Theories, Models and Analysis

4. Pitch Patterns: Notations and Models

The Role of Zero in the Prosodic Hierarchy

Lecture 7: Prosodic Phonology

Phonological Analysis

Kobon Phonology

The Status of Coronals in Standard American English an Optimality-Theoretic Account