Lexical and Discourse Analysis of Online Chat Dialog

International Conference on Semantic Computing Lexicaland Discourse Analysis of Online Chat Dialog Eric N. Forsyth and Craig H. Martell Department of Computer Science, Naval Postgraduate School [email protected], [email protected] Abstract Autonomous Systems Laboratory. Specifically, the goals related to this effort include the following: 1) One of the ultimate goals of natural language preserve the online chat dialog in an XML-based processing (NLP) systems is understanding the corpus to aid in future accessibility to the data; 2) meaning of what is being transmitted, irrespective of annotate the chat corpus with lexical, syntactic, and the medium (e.g., written versus spoken) or the form discourse information; and 3) use this annotated corpus (e.g., static documents versus dynamic dialogues). to develop, train and test higher-level NLP Although much work has been done in traditional applications. language domains such as speech and static written There are numerous NLP applications that could text, little has yet been done in the newer benefit from an annotated chat corpus. For example, communication domains enabled by the Internet, e.g., law enforcement and intelligence analysts could use online chat and instant messaging. This is in part due author profiling and entity identification applications to the fact that there are no annotated chat corpora to help detect predatory or terrorist activities on the available to the broader research community. The Internet. On the other side of the spectrum, legitimate purpose of this research is to build a chat corpus, chat use could be enhanced by applications that tagged with lexical (token part-of-speech labels), automatically identify and group the multiple threads syntactic (post parse tree), and discourse (post of conversation that often occur within chat. classification) information. Such a corpus can then be used to develop more complex, statistical-based NLP 2. Building the Corpus applications that perform tasks such as author profiling, entity identification, and social network The Python programming language was the primary analysis. tool we used to build the corpus. Within Python, we used Lundh’s ElementTree module [2] to create, edit, store, and retrieve the XML documents that comprised 1. Introduction the corpus. We also used Schemenauer’s back- propagation neural network Python class [3] for our In 2006, Jane Lin [1] collected 475,000+ posts automated post classification effort. In addition, Loper made by 3200+ users from five different age-oriented and Bird’s Natural Language Toolkit Lite (NLTK- chat rooms at an Internet chat site. The chat rooms Lite) Python modules [4] formed the basis for our were not limited to a specific topic, i.e. were open to automated lexical analysis. Finally, we used an XML discussion of any topic. Lin’s goal was to parser for subsequent cor pus editing and validation. automatically determine the age and gender of the One of the challenging aspects we faced in poster based on their chat “style”. The features she developing the corpus was sanitizing it to protect user captured were surface details of the post, namely, privacy. If the corpus is to be made available to the average number of words per post, vocabulary breadth, larger research community, th is must be accomplished. use of emoticons, and punctuation usage. Lin relied on It was straightforward to replace the user’s screen the user’s profile information to establish the “truth” of name in both the session logs as well as the user each user’s age and gender. profile with a mask, for example, “killerBlonde51” The data Lin captured has enormous potential, and with “10-19-30sUser112.” However, more often than as such has formed the foundation of an ongoing not, users were referred to by variations of their screen research effort at the Naval Postgraduate School’s names in other users’ posts. For example, other users would refer to “killerBlonde 51” as “killer”, “Blondie”, 0-7695-2997-6 2007 19 U.S. Government Work Not Protected by U.S. Copyright DOI 10.1109/ICSC.2007.55 Authorized licensed use limited to: University of Pennsylvania. Downloaded on August 25, 2009 at 11:02 from IEEE Xplore. Restrictions apply. “kb51”, etc. Although regular expressions can assist in the masking task, ultimately 100% masking requires Table1. Post classification examples hand verifying that the appropriate masks have been applied in every post. To date, complete masking has Classification Example been accomplished on 3,507 (~700 posts/chat room) of Accept yeah it does, they all do the 475,000+ posts. Bye night ya'all. It should be noted that although masking is essential Clarify i meant to write the word may..... to ensure privacy, it results in a loss of information. Continuer and thought I'd share For example, the way to which users are referred often Emotion lol conveys additional information, for example, Ok I'm gonna put it up ONE MORE Emphasis familiarity and emotion; this information is lost in the TIME 10-19-30sUser37 masking process. In addition, it was observed that a Greet hiya 10-19-40sUser43 hug user’s screen name would become a topic of No Answer no I had a roomate who did though conversation independent from the original user; again, Other 0 the origin of this conversation thread is lost in the masking process. Reject u r not on meds Statement Yay...democrats have taken the house! 3. Discourse Analysis: Post Classification System JOIN 11-08-20sUser70 why do you feel that Wh-Question way? A great deal of research has been performed regarding discourse analysis of spoken language. Yes Answer why yes I do 10-19-40sUser24, lol Stolcke, et al [5] developed over 40 tags associated Yes/No Question cant we all just get along with different dialog acts used in conversational speech. Certainly, a fundamental reason why online These examples highlight th e complexity of the task chat is similar to spoken conversational speech is that a at hand. First, we should note that posts were conversation is taking place. In addition, fillers like classified into only one of the 15 categories. At times, “you know”, “really” as well as interjections like “hey” more than one category might apply. In addition, the and “awww” occur both in speech and online chat. “Wh-Question” example does not start with a “wh” However, with chat, multiple topics are being token, while the “Yes Answ er” does start with a “wh” discussed by multiple people simultaneously, and token. Also, notice that the “Yes/No Question” does people don’t always “wait their turn” when posting. not include a question mark. Finally, the “Statement” Finally, the stops and restarts associated with spoken example contains a token that conveys an emotion dialog do not seem to occur in chat. (“yay”). Taken together, these examples highlight the Obviously, chat is also very similar to written text. fact that more than just simple regular expression However, chat participants often spell words matching is required to cla ssify these posts accurately. phonetically, e.g. “dontcha” for “don’t you”. In The initial post classification task was assisted by addition, they make extensive use of emoticons and simple regular expression matching, followed by hand abbreviations, e.g. “:-)” and “LOL” (Laughing Out correction of each post. Of these posts, various, Loud). Finally, due to the nature of the medium, randomly-selected subsets were used for training (3007 words are frequently misspelled. posts total) and testing (500 posts total). The overall Recognizing these distinctions, Wu, et al [6], used frequencies of the post classes in our sanitized corpus subsets of previous dialog act tags along with chat- are shown below. Note that the highest occurring specific tags to automatically classify 3,129 chat posts category of posts was “Statement”, with more than over Internet Relay Chat channels into 1 of 15 double the next highest classification category. categories using Transformation-Based Error Driven learning. As an initial annotation attempt for our online chat corpus, we classified the 3,507 user-sanitized posts mentioned earlier using Wu’s 15 post categories, and investigated two different machine learning algorithms to automatically classify the posts. Wu’s classification categories as well as an example of each taken from our corpus are shown below. 20 Authorized licensed use limited to: University of Pennsylvania. Downloaded on August 25, 2009 at 11:02 from IEEE Xplore. Restrictions apply. 13. First token in post contains goodbye or Table2. Post classification frequencies variants. 14. First token in post contains wh-question start Class Count Percent such as who, what, where, etc. 15. First token in post contains yes/no-question Statement 1210 34.50% start such as is, are, does, etc. System 597 17.02% 16. First token in post contains conjunction start Greet 470 13.40% such as and, but, or, etc. Emotion 404 11.52% 17. Number of tokens in the post containing one or more “?” (normalized by maximum number of ? Wh-Question 187 5.33% found in a single post in train/test set). Yes/No Question 183 5.22% 18. Number of tokens in the post containing one Continuer 122 3.48% or more “!” (normalized by max number of “!” found a Accept 86 2.45% single post in train/test set). 19. Number of tokens in the post containing yes Reject 75 2.14% or variants (normalized by max number of yes variants Bye 55 1.57% found in a single post in train/test set). Yes Answer 41 1.17% 20. Number of tokens in the post containing no or No Answer 33 0.94% variants (normalized by max number of no variants found in a single post in train/test set).

Load more