Classifying Sentences As Speech Acts in Message Board Posts
Total Page:16
File Type:pdf, Size:1020Kb
In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP-2011). Classifying Sentences as Speech Acts in Message Board Posts Ashequl Qadir and Ellen Riloff School of Computing University of Utah Salt Lake City, UT 84112 {asheq,riloff}@cs.utah.edu Abstract discourse and pragmatic phenomena that are funda- mentally different in these genres. This research studies the text genre of mes- Message boards are common on the WWW as a sage board forums, which contain a mix- forum where people ask questions and post com- ture of expository sentences that present fac- ments to members of a community. They are typ- tual information and conversational sentences ically devoted to a specific topic or domain, such as that include communicative acts between the writer and readers. Our goal is to create finance, genealogy, or Alzheimer’s disease. Some sentence classifiers that can identify whether message boards offer the opportunity to pose ques- a sentence contains a speech act, and can tions to domain experts, while other communities recognize sentences containing four different are open to anyone who has an interest in the topic. speech act classes: Commissives, Directives, From a natural language processing perspective, Expressives, and Representatives. We con- message board posts are an interesting hybrid text duct experiments using a wide variety of fea- tures, including lexical and syntactic features, genre because they consist of both expository text speech act word lists from external resources, and conversational text. Most obviously, the conver- and domain-specific semantic class features. sations appear as a thread, where different people We evaluate our results on a collection of mes- respond to each other’s questions in a sequence of sage board posts in the domain of veterinary posts. Studying the conversational threads, however, medicine. is not the focus of this paper. Our research addresses the issue of conversational pragmatics within indi- vidual message board posts. 1 Introduction Most message board posts contain both exposi- In the 1990’s, the natural language processing com- tory sentences as well as speech acts. The person munity shifted much of its attention to corpus-based posting a message (the “writer”) often engages in learning techniques. Since then, most of the text cor- speech acts with the readers. The writer may explic- pora that have been annotated and studied are collec- itly greet the readers (“Hi everyone!”), request help tions of expository text (e.g., news articles, scientific from the readers (“Anyone have a suggestion?”), or literature, etc.). The intent of expository text is to commit to a future action (“I promise I will report present or explain information to the reader. In re- back soon.”). But most posts contain factual infor- cent years, there has been a growing interest in text mation as well, such as general knowledge or per- genres that originate from Web sources, such as we- sonal history describing a situation, experience, or blogs and social media sites (e.g., tweets). These predicament. text genres offer new challenges for NLP, such as Our research goals are twofold: (1) to distin- the need to handle informal and loosely grammatical guish between expository sentences and speech act text, but they also pose new opportunities to study sentences in message board posts, and (2) to clas- sify speech act sentences into four types: Com- sponse modes (VRM) speech act taxonomy. They missives, Directives, Expressives, and Representa- also provided a comparison of VRM taxonomy with tives, following Searle’s original taxonomy (Searle, Searle’s taxonomy (Searle, 1976) of speech act 1976). Speech act classification could be useful classes. They evaluated several machine learning al- for many applications. Information extraction sys- gorithms using syntactic, morphological, and lexi- tems could benefit from filtering speech act sen- cal features. Mildinhall and Noyes (2008) presented tences (e.g., promises and questions) so that facts are a stochastic speech act model based on verbal re- only extracted from the expository text. Identifying sponse modes (VRM) to classify email intentions. Directive sentences could be used to summarize the Some research has considered speech act classes questions being asked in a forum over a period of in other means of online conversations. Twitchell time. Representative sentences could be extracted and Jr. (2004) and Twitchell et al. (2004) employed to highlight the conclusions and beliefs of domain speech act profiling by plotting potential dialogue experts in response to a question. categories in a radar graph to classify conversa- In this paper, we present sentence classifiers that tions in instant messages and chat rooms. Nas- can identify speech act sentences and classify them tri et al. (2006) performed an empirical analysis of as Commissive, Directive, Expressive, and Repre- speech acts in the away messages of instant mes- sentative. First, we explain how each speech act senger services to achieve a better understanding of class is manifested in message board posts, which the communication goals of such services. Ravi can be different from how they occur in spoken dia- and Kim (2007) employed speech act profiling in logue. Second, we train classifiers to identify speech online threaded discussions to determine message act sentences using a variety of lexical, syntactic, roles and to identify threads with questions, answers, and semantic features. Finally, we evaluate our sys- and unanswered questions. They designed their own tem on a collection of message board posts in the speech act categories based on their analysis of stu- domain of veterinary medicine. dent interactions in discussion threads. The work most closely related to ours is the re- 2 Related Work search of Jeong et al. (2009) on semi-supervised speech act recognition in both emails and forums. There has been relatively little work on applying Like our work, their research also classifies indi- speech act theory to written text genres, and most vidual sentences, as opposed to entire documents. of the previous work has focused on email classi- However, they trained their classifier on spoken fication. Cohen et al. (2004) introduced the notion telephone (SWBD-DAMSL corpus) and meeting of “email speech acts” defined as specific verb-noun (MRDA corpus) conversations and mapped the la- pairs following a pre-designed ontology. They ap- belled dialog act classes of these corpora to 12 di- proached the problem as a document classification alog act classes that they found suitable for email task. Goldstein and Sabin (2006) adopted this no- and forum text genres. These dialog act classes (ad- tion of email acts (Cohen et al., 2004) but focused dressed as speech acts by them) are somewhat differ- on verb lexicons to classify them. Carvalho and ent from Searle’s original speech act classes. They Cohen (2005) presented a classification scheme us- also used substantially different types of features ing a dependency network, capturing the sequential than we do, focusing primarily on syntactic subtree correlations with the context emails using transition structures. probabilities from or to a target email. Carvalho and Cohen (2006) later employed N-gram sequence fea- 3 Classifying Speech Acts in Message tures to determine which N-grams are meaningfully Board Posts related to different email speech acts with a goal towards improving their earlier email classification 3.1 Speech Act Class Definitions based on the writer’s intention. Searle’s (Searle, 1976) early research on speech acts Lampert et al. (2006) performed speech act clas- was seminal work in natural language processing sification in email messages following a verbal re- that opened up a new way of thinking about con- versational dialogue and communication. Our goal the speaker expects the listener to do something as was to try and use Searle’s original speech act def- a response. For example, the speaker may ask a initions and categories as the basis for our work to question, make a request, or issue an invitation. Di- the greatest extent possible, allowing for some inter- rective speech acts are common in message board pretation as warranted by the WWW message board posts, especially in the initial post of each thread text genre. when the writer explicitly requests help or advice re- For the purposes of defining and evaluating our garding a specific topic. Many Directive sentences work, we created detailed annotation guidelines for are posed as questions, so they are easy to identify four of Searle’s speech act classes that commonly by the presence of a question mark. However, the occur in message board posts: Commissives, Direc- language in message board forums is informal and tives, Expressives, and Representatives. We omitted often ungrammatical, so many Directives are posed the fifth of Searle’s original speech act classes, Dec- as a question but do not end in a question mark (e.g., larations, because we virtually never saw declara- “What do you think.”). Furthermore, many Direc- tive speech acts in our data set.1 The data set used in tive speech acts are not stated as a question but as our study is a collection of message board posts in a request for assistance. For example, a doctor may the domain of veterinary medicine. We designed our write “I need your opinion on what drug to give this definitions and guidelines to reflect language use in patient.” Finally, some sentences that end in ques- the text genre of message board posts, trying to be as tion marks are rhetorical in nature and do not repre- domain-independent as possible so that these defini- sent a Directive speech act, such as “Can you believe tions should also apply to message board texts rep- that?”. resenting other topics. However, we give examples Expressives: An Expressive speech act occurs in from the veterinary domain to illustrate how these conversation when a speaker expresses his or her speech act classes are manifested in our data set.