Using Ontology-Based Approaches to Representing Speech Transcripts for Automated Speech Scoring
Total Page:16
File Type:pdf, Size:1020Kb
Syracuse University SURFACE School of Information Studies - Dissertations School of Information Studies (iSchool) 8-2013 Using Ontology-Based Approaches to Representing Speech Transcripts for Automated Speech Scoring Miao Chen Follow this and additional works at: https://surface.syr.edu/it_etd Part of the Library and Information Science Commons Recommended Citation Chen, Miao, "Using Ontology-Based Approaches to Representing Speech Transcripts for Automated Speech Scoring" (2013). School of Information Studies - Dissertations. 87. https://surface.syr.edu/it_etd/87 This Dissertation is brought to you for free and open access by the School of Information Studies (iSchool) at SURFACE. It has been accepted for inclusion in School of Information Studies - Dissertations by an authorized administrator of SURFACE. For more information, please contact [email protected]. ABSTRACT Text representation is a process of transforming text into some formats that computer systems can use for subsequent information-related tasks such as text classification. Representing text faces two main challenges: meaningfulness of representation and unknown terms. Research has shown evidence that these challenges can be resolved by using the rich semantics in ontologies. This study aims to address these challenges by using ontology-based representation and unknown term reasoning approaches in the context of content scoring of speech, which is a less explored area compared to some common ones such as categorizing text corpus (e.g. 20 newsgroups and Reuters). From the perspective of language assessment, the increasing amount of language learners taking second language tests makes automatic scoring an attractive alternative to human scoring for delivering rapid and objective scores of written and spoken test responses. This study focuses on the speaking section of second language tests and investigates ontology-based approaches to speech scoring. Most previous automated speech scoring systems for spontaneous responses of test takers assess speech by primarily using acoustic features such as fluency and pronunciation, while text features are less involved and exploited. As content is an integral part of speech, the study is motivated by the lack of rich text features in speech scoring and is designed to examine the effects of different text features on scoring performance. A central question to the study is how speech transcript content can be represented in an appropriate means for speech scoring. Previously used approaches from essay and speech scoring systems include bag-of-words and latent semantic analysis representations, which are adopted as baselines in this study; the experimental approaches are ontology-based, which can help improving meaningfulness of representation units and estimating importance of unknown terms. Two general domain ontologies, WordNet and Wikipedia, are used respectively for ontology-based representations. In addition to comparison between representation approaches, the author analyzes which parameter option leads to the best performance within a particular representation. The experimental results show that on average, ontology-based representations slightly enhances speech scoring performance on all measurements when combined with the bag-of-words representation; reasoning of unknown terms can increase performance on one measurement (cos.w4) but decrease others. Due to the small data size, the significance test (t-test) shows that the enhancement of ontology-based representations is inconclusive. The contributions of the study include: 1) it examines the effects of different representation approaches on speech scoring tasks; 2) it enhances the understanding of the mechanisms of representation approaches and their parameter options via in- depth analysis; 3) the representation methodology and framework can be applied to other tasks such as automatic essay scoring. USING ONTOLOGY-BASED APPROACHES TO REPRESENTING SPEECH TRANSCRIPTS FOR AUTOMATED SPEECH SCORING by Miao Chen B.S., Peking University, 2005 Dissertation Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Information Science and Technology. Syracuse University August 2013 Copyright © Miao Chen August 2013 All Rights Reserved ACKNOWLEDGEMENTS I owe many thanks to many people that helped me enormously through my dissertation: my advisor Prof. Jian Qin, who led me throughout the once tough time of my doctoral life, taught me to be a researcher, and put through a great deal of time on my thesis; my committee members professors Bei Yu and Howard Turtle, who provided me invaluable input not only to my dissertation but also my research in general; Dr. Klaus Zechner, to whom my special thanks go to, my mentor at Educational Testing Service when interned there, kindly provided data set from ETS, and discussed every detail of my thesis. I started my PhD work from 2005 with the goal of doing some research in information retrieval, while ending up in conducting research in natural language processing and ontology. I am pleased that the current work is not very far from my goal 8 years ago, and I really enjoy research in text analytics. During my thesis writing time, I received help from many people, in life and in research. Prof.Beth Plale, whom I worked for at Indiana University, generously allowed me enough time to finish up the thesis. Prof. Nancy McCracken has helped me in research for quite several years and I really appreciate her suggestions on my research. My PhD journey has been a long one, but it is a good journey and I learned many things. I especially thank my husband Xiaozhong Liu, without whose strong support and love this dissertation would have been impossible. I am grateful to my parents, who are always in upright spirit, and my parents in law, who helped take care of my daughter during the busiest dissertation writing time. Lastly, this is also dedicated to my daughter Tiana Liu, whose smile is the best gift when waking up in the morning. v CONTENT TABLE CHAPTER 1. INTRODUCTION ........................................................................... 1 1.1 REPRESENTATION IN INFORMATION SCIENCE: A GENERAL PERSPECTIVE ................................ 1 1.2 TEXT REPRESENTATION: OVERVIEW AND CHALLENGES .......................................................... 7 1.3 ONTOLOGY-BASED REPRESENTATION: A COMPLEMENT TO THE CHALLENGES ....................... 10 1.4 THE TEST BED: CONTENT SCORING OF SPEECH ................................................................... 12 1.5 RESEARCH FRAMEWORK AND DESIGN.................................................................................. 15 1.5.1 Conceptual Framework ............................................................................................... 15 1.5.2 Research Design ........................................................................................................ 16 1.6 RESEARCH QUESTIONS ....................................................................................................... 19 1.7 CONTRIBUTIONS AND IMPLICATIONS ..................................................................................... 21 CHAPTER 2 LITERATURE REVIEW ................................................................ 22 2.1 DOCUMENT REPRESENTATION IN GENERAL .......................................................................... 22 2.1.1 Bag of Words .............................................................................................................. 23 2.1.2 Latent Semantic Analysis ........................................................................................... 25 2.1.3 Other Representation Approaches ............................................................................. 30 2.1.4 Local and Global Representations .............................................................................. 32 2.1.5 Dimensionality Reduction ........................................................................................... 33 2.1.6 Document Representation in Essay Scoring .............................................................. 34 2.2 SECOND LANGUAGE ASSESSMENT AND AUTOMATED SCORING ............................................. 35 2.2.1 Theoretical Aspects .................................................................................................... 35 2.2.1.1 Overview ............................................................................................................................. 35 2.2.1.2 Speaking Proficiency ........................................................................................................... 38 2.2.2 Automated Scoring for Second Language Assessment ............................................. 41 2.2.2.1 Automated Essay Scoring ................................................................................................... 42 2.2.2.2 Automated Speech Scoring ................................................................................................ 45 2.3 ONTOLOGY AND ITS USE IN TEXT PROCESSING .................................................................... 48 2.3.1 Definitions of Ontology ................................................................................................ 48 2.3.2 Use in Text Processing ............................................................................................... 55 2.4 SUMMARY ........................................................................................................................... 61 CHAPTER 3. METHODOLOGY .......................................................................