Using Term Clouds to Represent Segment-Level Semantic Content of Podcasts
Total Page:16
File Type:pdf, Size:1020Kb
UvA-DARE (Digital Academic Repository) Using term clouds to represent segment-level semantic content of podcasts Fuller, M.; Tsagkias, E.; Newman, E.; Besser, J.; Larson, M.; Jones, G.J.F.; de Rijke, M. Publication date 2008 Published in Proceedings of the ACM SIGIR Workshop 'Searching Spontaneous Conversational Speech' Link to publication Citation for published version (APA): Fuller, M., Tsagkias, E., Newman, E., Besser, J., Larson, M., Jones, G. J. F., & de Rijke, M. (2008). Using term clouds to represent segment-level semantic content of podcasts. In J. Köhler, M. Larson, F. M. G. de Jong, W. Kraaij, & R. J. F. Ordelman (Eds.), Proceedings of the ACM SIGIR Workshop 'Searching Spontaneous Conversational Speech' (pp. 12-19). Centre for Telematics and Information Technology (CTIT). http://ilps.science.uva.nl/SSCS2008/Proceedings/sscs08_proceedings.pdf General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. Download date:26 Sep 2021 UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) Using Term Clouds to Represent Segment-Level Semantic Content of Podcasts Marguerite Fuller Manos Tsagkias Centre for Digital Video ISLA, University of Amsterdam Processing [email protected] Dublin City University, Ireland [email protected] Eamonn Newman Jana Besser Martha Larson Centre for Digital Video ISLA, University of Amsterdam ISLA, University of Amsterdam Processing [email protected] [email protected] Dublin City University, Ireland [email protected] Gareth J.F. Jones Maarten de Rijke Centre for Digital Video ISLA, University of Amsterdam Processing [email protected] Dublin City University, Ireland [email protected] ABSTRACT Keywords Spoken audio, like any time-continuous medium, is notoriously dif- Speech browsing, term clouds, TextTiling ficult to browse or skim without support of an interface provid- ing semantically annotated jump points to signal the user where to listen in. Creation of time-aligned metadata by human anno- 1. INTRODUCTION tators is prohibitively expensive, motivating the investigation of Podcasting, the practice of publishing audio files online using representations of segment-level semantic content based on tran- a syndication feed, enjoys growing popularity. Users subscribe scripts generated by automatic speech recognition (ASR). This pa- to podcasts or download individual episodes for listening. The per examines the feasibility of using term clouds to provide users podcasts available on the internet, known collectively as the po- with a structured representation of the semantic content of podcast dosphere, frequently contain unplanned conversational speech, in- episodes. Podcast episodes are visualized as a series of sub-episode cluding interviews, chitchat and user generated commentaries. The segments, each represented by a term cloud derived from a tran- podosphere provides fertile ground for developing approaches to script generated by automatic speech recognition (ASR). Quality of confront the difficulty presented by spontaneous speech content segment-level term clouds is measured quantitatively and their util- to automatic speech recognition (ASR) and automatic structuring ity is investigated using a small-scale user study based on human algorithms. These include heterogeneous speaking styles, incom- labeled segment boundaries. Since the segment-level clouds gener- plete or interrupted sentences, lack of clear topical structure and un- ated from ASR-transcripts prove useful, we examine an adaptation predictable vocabularies. The growth rate of the podosphere makes of text tiling techniques to speech in order to be able to generate providing access to the information contained in podcasts a signif- segments as part of a completely automated indexing and structur- icant and pressing speech search challenge. ing system for browsing of spoken audio. Results demonstrate that Parallel to speech search on other domains, this challenge de- the segments generated are comparable with human selected seg- composes into two primary components. First, a user with an in- ment boundaries. formation need must be able to search the podosphere in order to locate podcasts or podcast episodes that satisfy that need. The task Categories and Subject Descriptors of podcast retrieval can be addressed with approaches that exploit podcast level metadata and ASR-transcripts, and the development H.3.1 [Information Storage and Retrieval]: Content Analysis of such techniques appears to be receiving substantial attention. and Indexing—Indexing Methods; H.3.3 [Information Storage and Previous research demonstrates that speech retrieval is remarkably Retrieval]: Information Search and Retrieval tolerant to high levels of transcription errors [5, 9, 17, 18, 23] and a growing number of podcast search engines on the internet, General Terms such as Podscope1, Everyzing2 and Pluggd3 exploit a combination Algorithms, Experimentation of speech-based and metadata features. Second, given a podcast 1http://www.podscope.com/ Copyright is held by the author/owner(s). 2http://search.everyzing.com/ SSCS’08 SIGIR 2008 workshop, July 24, Singapore 3 . http://www.pluggd.tv/audio/ 12 episode, the user must be able to easily preview that episode to composition of term clouds. Key word redundancy compensates confirm its relevance or else search within the episode to find spe- for ASR error for retrieval [9], so our hypothesis is that similar ef- cific sections that are relevant. The temporal nature of spoken data fects will extend to tag cloud creation. The objective is to have means that auditioning files to locate relevant information is often term clouds of sufficient quality so that users can reliably use them very time consuming and inefficient. Spoken audio differs in this to identify relevant sections of a speech file, and that they should way from text, which presents a user with an instantaneous impres- find the overall quality of the term clouds acceptable. sion accommodating depictions of importance or relevance in the This paper examines the feasibility of using term clouds to pro- form of highlighting or font changes such as size, boldness or col- vide users with a structured representation of the semantic content oration. of podcast episodes. Podcast episodes are visualized as a series of This aspect of podcast search is relatively less well studied, moti- sub-episode segments, each represented by a term cloud derived vating us to pursue research on this issue. This paper devotes itself from a transcript generated by ASR. Results demonstrate that the to the issue of representation of podcast episode content, investi- segments generated are comparable with human selected segment gating a method for creating articulated podcast episode surrogates boundaries. that can be used by an interface to represent the semantic content of This remainder of the paper is structured as follows. First, we a podcast episode at the level of thematic segments. Previous work provide a brief review of previous work in the areas of spoken con- suggests that episode level surrogates are robust to speech recogni- tent surrogates and in content segmentation. Then, we describe our tion error [22] and this paper takes up the open question of whether method for generation of segment level term clouds and present this robustness is maintained at the segment level, where signifi- results of evaluations of several methods for deriving these clouds cantly fewer running transcript words are available to generate a from transcripts. Finally, we take a further step towards full au- given surrogate. tomation of the process of automatic generation of topical segment It is important to note that the issue of highly articulated seman- level representations of podcast episodes by exploring the applica- tic surrogates for podcast episodes is a fundamental one. Although tion of the TextTiling algorithm to ASR-transcripts. We finish with improved ASR-transcripts would facilitate the generation of sur- concluding remarks and an outlook on future work. rogates representing the semantic content of podcast episode seg- ments, ASR-transcripts, even if they were completely error-free, are inherently unsuitable to serve as surrogates for speech content. 2. BACKGROUND Speaker prosody is an important component of spoken audio, mod- Issues of presenting speech documents to users and giving users ulating meaning and signaling structure. However, prosody fails a means by which to skim or browse within spoken content has at- to be represented in conventional ASR-transcripts. Unlike a news- tracted attention since research first began on speech search. Early paper report, a blog or an online product review, spoken content examples include a system for interactive skimming [1] and an in- was not generated