Detection of Child Exploiting Chats from a Mixed Chat Dataset As a Text Classification Task
Total Page:16
File Type:pdf, Size:1020Kb
Detection of child exploiting chats from a mixed chat dataset as a text classification task Md Waliur RahmanMiah John Yearwood Sid Kulkarni School of Science, Information School of Science, Information School of Science, Information Technology and Engineering Technology and Engineering Technology and Engineering University of Ballarat University of Ballarat University of Ballarat [email protected] [email protected] [email protected] medium increased the chance of some heinous acts Abstract which one may not commit in the real world. O’Connell (2003) informs that the Internet affords Detection of child exploitation in Internet greater opportunity for adults with a sexual interest chatting is an important issue for the protec- in children to gain access to children. Communica- tion of children from prospective online pae- tion between victim and predator can take place dophiles. This paper investigates the whilst both are in their respective real world homes effectiveness of text classifiers to identify but sharing a private virtual space. Young (2005) Child Exploitation (CE) in chatting. As the chatting occurs among two or more users by profiles this kind of virtual opportunist as ‘situ- typing texts, the text of chat-messages can be ational sex offenders’ along with the ‘classical sex used as the data to be analysed by text classi- offenders’. Both these types of offenders are taking fiers. Therefore the problem of identification the advantages of the Internet to solicit and exploit of CE chats can be framed as the problem of children. This kind of solicitation or grooming by text classification by categorizing the chat- the use of an online medium for the purpose of logs into predefined CE types. Along with exploiting a child may refer to the problem of three traditional text categorizing techniques a ‘online child exploitation’. new approach has been made to accomplish Currently there is no such system that can the task. Psychometric and categorical infor- automatically identify the elements of child exploi- mation by LIWC (Linguistic Inquiry and Word Count) has been used and improvement tation in chat text. It is very difficult for parents or of performance in some classifier has been the members of Law and Enforcement Agency found. For the experiments of current research (LEA) to watch over the children all the time to the chat logs are collected from various web- protect them from online paedophiles loitering sites open to public. Classification-via- over the vast space of the Internet. An online Regression, J-48-Decision-Tree and Naïve- automatic CE detection system can be useful. Re- Bayes classifiers are used. Comparison of the garding offline, most of the chatting programs have performance of the classifiers is shown in the the options of storing the chat-texts in log-archives. result. According to Krone (2005) and pjfi.org chat-logs can be used as evidence to proof a paedophile at- tempting to exploit children. Therefore after an 1 Introduction online child exploitation occur; a LEA member can The online chatting has become a popular tool for retrieve those offline archived chat logs from the personal as well as group communication. It is hard drive of the accused to produce as evidence in cheap, convenient, virtual and private in nature. In the court of law. However manual identification of an online chatting one can hide ones personal in- the evidence is a tedious and time consuming formation behind the monitor. This makes it a work, as one may have to read hundreds or thou- source of fun in one hand but possess threat on the sands of pages of chat-texts from different chat- other hand. The privacy and virtual nature of this logs. Thus it is prone to error due to exhaustion. Moreover manual process may lead to a biased Md. Waliur Rahman Miah, John Yearwood and Sid Kulkarni. 2011. Detection of child exploiting chats from a mixed chat dataset as a text classification task. In Proceedings of Australasian Language Technology Association Workshop, pages 157−165 decision. Therefore a research to develop such an 2 Related work automatic system will have a significant contribu- tion in both the online and offline situation for the In the recent years, IT research community has protection of children from exploitation. paid good attention to the chat-text analysis and This paper introduces the results of the prelimi- chat-mining. Different applications evolved in this nary experiments of an ongoing research aims to area though are not perfect in all situations. Litera- develop a novel methodology that can automati- ture review suggests that most of the existing tech- cally identify the child exploitation in chats niques have good performance only for its specific through the analysis of the contents of the chat- context. The context of the current research is par- logs using data-mining and machine learning tech- ticularly unique; it focuses on detecting CE chats. niques. For the experiments the chat logs are col- In addition, it uses archived chat logs instead of lected from various websites open to public. Three single chat posts used by others. Therefore the ex- classifiers, named Classification-via-Regression, J- isting works does not match with the current re- 48-Decision-Tree and Naïve-Bayes classifiers are search problem. As any technique that corresponds used from the WEKA data mining tool. Along with to the same context is not found, related works on term based feature set a new kind of features chat massages is discussed in this section. named psychometric and word categorical infor- Following subsections provide a short descrip- mation has been used. The LIWC (Linguistic In- tion of the analysis of unique properties of chat quiry and Word Count) is used to get this messages, psychological aspect of child exploita- information of the chat-terms. The result and per- tion and a brief overview of the related existing formances of the classifiers are compared in the work on chat text. experiment and result section. The contributions of this paper are many fold. 2.1 Analysis of Chat messages First, in the information and language technology The texts in the chat possess some unique charac- field currently it is difficult to find a good number teristics that distinguish them from other literary of researches focusing on the issue of detection of formal texts (Rosa and Ellen 2009; Kucukyilmaz et internet child exploitation. This paper emphasises al. 2008). Chat-users suppose to type spontane- on this issue and examines the technical aspects of ously and instantly. Therefore the individual post is chat messages that can be used to find a solution. very brief, as short as a word. Frequently it is con- Second, the experiments in the current paper use fined within a couple of words. Generally the chats archived chat logs instead of single chat posts. do not follow any grammar rules. Therefore the Single chat posts contain only a few terms, on the chat-text is grammatically informal and unstruc- other hand a log of chats contain a good number of tured. This made them more difficult to process by terms which provides more facility for a machine traditional sentence parsers. Chat-users are though learning system to learn the prediction function of typing texts, but are actually trying to talk with a class. Third, this research uses psychometric in- each other through it. So the text is typed very formation for the first time to detect CE chats. No quickly, frequently unedited, errors and abbrevia- other research has been found that is doing the tions are more common. For example, “ASL” is a same. This psychometric information seems common chat abbreviation for Age, Sex and Loca- enriching the feature set that improves the perfor- tion asked at the introduction stage. “P911” is a mance of some classifiers. chatting code used by teenagers. It stands for “Par- The remainder of this paper is organized as fol- ent Alert!”(teenchatdecoder.com). These kinds of lows. Section 2 reviews the related works. Section previously unseen abbreviations and erroneous 3 describes the methodology followed in this re- texts are difficult to be handled by any currently search. The experimental results are analysed in available text processing techniques. section 4 while section 5 presents conclusion and Chatting is a purely textual communication me- future work. dium. So for transferring emotional feelings like happiness, sadness and angers, emoticons (emotion + icon = emoticon; a chat jargon) are widely used. These are different sequences of punctuation marks 158 that display graphical representation of different a psychological progression. A perpetrator tends to emotional feelings. For example, ‘‘:-)” means follow the model of luring communication theory, “happy” and ‘‘:-(” represents “sad”. Another way proposed by Olson et.al. (2007). According to this of emotion transfer is by emphasizing a word with model a perpetrator builds up a deceptive psycho- repeating some specific characters. For example, logical trust. This indicates that the terms used in “soryyyyyyyyyyyy”. This kind of deliberate mis- the process of exploitation are categorically and spelling is also frequent in chat text. The emoti- psychologically different than the terms used in cons and intentional misspelled words may contain general chatting. Therefore analysing the psycho- valuable contextual information in a chat text. For logical and categorical information of the chat example, in the grooming phase the perpetrator terms would be helpful to learn the psychological may reconstruct relation by an emphasized “sory- pattern of the exploitation. To find out the cate- yyyyyyyy” when the child felt threatening by any gorical and psychological properties of terms obtrusive language. Another example may be the LIWC (Linguistic Inquiry and Word Count) has emoticon for “hug (>:d<)” and “kiss (:-*)” for a been used in this current research.