Document Similarity Amid Automatically Detected Terms*
Total Page:16
File Type:pdf, Size:1020Kb
Document Similarity Amid Automatically Detected Terms∗ Hardik Joshi Jerome White Gujarat University New York University Ahmedabad, India Abu Dhabi, UAE [email protected] [email protected] ABSTRACT center was involved. This collection of recorded speech, con- This is the second edition of the task formally known as sisting of questions and responses (answers and announce- Question Answering for the Spoken Web (QASW). It is an ments) was provided the basis for the test collection. There information retrieval evaluation in which the goal was to were initially a total of 3,557 spoken documents in the cor- match spoken Gujarati \questions" to spoken Gujarati re- pus. From system logs, these documents were divided into a sponses. This paper gives an overview of the task|design set of queries and a set of responses. Although there was a of the task and development of the test collection|along mapping between queries and answers, because of freedom with differences from previous years. given to farmers, this mapping was not always \correct." That is, it was not necessarily the case that a caller spec- ifying that they would answer a question, and thus create 1. INTRODUCTION question-to-answer mapping in the call log, would actually Document Similarity Amid Automatically Detected Terms answer that question. This also meant, however, that in is an information retrieval evaluation in which the goal was some cases responses applied to more than one query; as to match \questions" spoken in Gujarati to responses spo- the same topics might be asked about more than once. ken in Gujarati. The design of the task was motivated by The 151 longest queries were initially divided into a train- a speech retrieval interaction paradigm first proposed by ing set of 50 questions and an evaluation set of 101 questions. Oard [4]. In this paradigm, a searcher, using speech for Training questions were those for which the largest num- both queries and responses, speaks extensively about what ber of answers were known beforehand (mappings between they seek to find until interrupted by the system with a sin- questions and known answers was available to the organizers gle potential answer. This task follows a stream of similar from data collected by the operational system). Once the efforts, most notably the Question Answering for the Spo- transcripts became available, two evaluation questions were ken Web (QASW) from FIRE 2013, and an attempted task removed for which the resulting transcripts were far shorter in MediaEval from the same year. than would be expected based on the file length. This re- sulted in a total of 50 training questions and 99 evaluation 2. QUESTIONS AND RESPONSES questions. Of the 50 training questions, results from the 2013 QASW task revealed that only 17 were actual queries. The source of the questions and the collection of possible Further, of these 17, 10 had more than one relevant doc- answers (which we call \responses") was the IBM Spoken ument. These 17 queries along with their relevance judge- Web Gujarati collection [6]. This collection was based on a ments, were made available to participants as the training spoken bulletin board system for Gujarati farmers. A farmer set. could call the system and record their question by going The set of response files did not change from the QASW through a set of prompts. Other farmers would call the sys- task: in that case, very short response files were removed, tem to record answers to those questions. There were also a along with files that were did not seem to be inline with small group of system administrators who would periodically their corresponding transcript.1 After removal, the final test call in to leave announcements that they expected would be collection contained 2,999 responses. of interest to the broader farming community. The system was completely automated|no human intervention or call 3. SPEECH PROCESSING ∗ An initial version of this task appear in 2013 under the Recent term discovery systems [5,2] automatically iden- title, \Question Answering for the Spoken Web" tify repeating words and phrases in large collections of audio, providing an alternative means of extracting lexical features for retrieval tasks. Such discovery is performed without the assistance of supervised speech tools by instead resorting to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are a search for repeated trajectories in a suitable acoustic fea- not made or distributed for profit or commercial advantage and that copies ture space (for example MFCCs, PLP) followed by a graph bear this notice and the full citation on the first page. To copy otherwise, to clustering procedure. Due to their sometimes ambiguous republish, to post on servers or to redistribute to lists, requires prior specific 1 permission and/or a fee. The reader is referred to the initial QASW summary doc- FIRE 2014, Bangalore, India ument as to why transcripts, in general, were not available Copyright 2015 ACM. for this task. content, the discovered units are referred to as pseudoterms, Although, we received a single submission, the experi- and we can represent each question and response as a set of ments may give better MAP values by pooling the results at pseudoterm offsets and durations. Complete specification of depth of 100. We will make the result analysis more compre- the term discovery system use for this work can be found in hensive in the upcoming editions. We wish to scale the task the literature [1,3]. for more languges in future and hope more teams participate Briefly, the system functions by constructing a sparse (thresh- enthusiastically in FIRE-2015. olded) distance matrix across the frames of the entire corpus. It then searches for approximately diagonal line structures 7. ACKNOWLEDGMENTS in that matrix, as such structures are indicative that a word We are grateful to Nitendra Rajput for providing the spo- or phrase has been repeated. Once the sparse distance ma- ken questions and responses and for early discussions about trix has been constructed, it remains to search for runs of evaluation design. We als thank Doug Oard for his guidance nearby frames, which make up extracted terms. A thresh- throughout the design and execution of this task. Finally, old δ dictates a frame length that is considered acceptable, we wish to thank Aren Jansen for donating his time and and thus the number of extracted regions. These regions expertise in creating the psuedo-term cluster set. are then clustered based on whether they overlap in a given dimension. Regions that happen to overlap are clustered; these clusters are known as psuedo-terms. 8. REFERENCES The choice of δ has a strong influence on the number of [1] M. Dredze, A. Jansen, G. Coppersmith, and K. Church. psuedo-terms that are produced. Lower thresholds imply NLP on spoken documents without asr. In Proc. higher fidelity matches that yield purer pseudoterm clusters EMNLP, pages 460{470. Association for Computational with, on average, lower collection frequencies. The set of Linguistics, 2010. data made available for this task had, specifically, δ = 0:06, [2] A. Jansen, K. Church, and H. Hermansky. Towards yielding 406,366 unique pseudoterms. spoken term discovery at scale with zero resources. In INTERSPEECH, pages 1676{1679, 2010. 4. EVALUATION DESIGN [3] A. Jansen and B. Van Durme. Efficient spoken term Participating research teams were provided with the full discovery using randomized algorithms. In Proc. ASRU, set of psuedo-terms extracted from the Gujurati collection. 2011. The principal task of a participating research teams was sim- [4] D. W. Oard. Query by babbling: A research agenda. In ilar to task previously: rank all responses to each full query Proceedings of the First Workshop on Information and such that, to the extent possible, all correct answers were Knowledge Management for Developing Regions, ranked ahead of all incorrect answers. Each participating IKM4DR '12, pages 17{22, 2012. system was asked to rank all responses for all training ques- [5] A. Park and J. R. Glass. Unsupervised pattern tions. Systems were evaluated on their ability to satisfy that discovery in speech. IEEE T-ASLP, 16(1):186{197, goal using mean average previcision (MAP). 2008. [6] N. Patel, D. Chittamuru, A. Jain, P. Dave, and T. S. 5. RESEARCH TEAMS Parikh. Avaaj Otalo: A field study of an interactive In FIRE-2014, the task was proposed as "Document Sim- voice forum for small farmers in rural india. In ilarity Amid Automatically Detected Terms", three teams Proceedings of the SIGCHI Conference on Human registered for participation, however, one team (HGH) from Factors in Computing Systems, CHI '10, pages 733{742. DAIICT, Gandhinagar submitted two runs within the time ACM, 2010. frame. The participating team submitted runs each of depth 1000. 6. EVAUATION AND RESULTS Evaluation was done by pooling of top 10 documents. Rel- evance judgements were carried out manually by listening to each audio file. Summary of the results is show in table 1. Particulars Run-1 Run-2 Num of Queries 99 99 MAP Score 0.1600 0.1600 Table 1: MAP Score of HGH Team Both the runs submitted by HGH team were identical and generated same results. The documents used in the task contained pseudo-terms, participating teams were unaware of the audio. This task might have proved as a black-box for participating teams as they were asked to retrieve the matching audio by looking at the pseudo-terms, however, the judgements were made by listening to audio files. Playing with distances: Document Similarity Amid Automatically Detected Terms @FIRE 2014 Harsh Thakkar1, Ganesh Iyer1, Honey Patel2, Kesha Shah1 fDA-IICT1- Gandhinagar, Gujarat University2- Ahmedabadg, Gujarat, India [email protected], [email protected], [email protected], [email protected] Abstract.