
A Framework for Mining Instant Messaging Services John Resig Ankur Teredesai Data Mining Research Group Department of Computer Science Rochester Institute of Technology {jer5513,amt}@cs.rit.edu Abstract cation protocols (information packets) are very similar Developing a framework for analysis of large scale mass- to one another. communication media such as instant messaging (popularly Online The user’s client is connected to the known as IM) has gone largely unexplored up until this central server and the user is active point. This paper explores various data mining issues and (currently typing or moving the mouse how they relate to Instant Messaging and current Counter- on his computer). Terrorism efforts. Specific topics include user pattern anal- Offline The user’s client is not connected to the ysis, anomaly detection, limited message size based textual messaging server at this time. topic detection, and largely generic social network analysis Idle The user’s client is connected to the in this context. Several interesting questions are posed and central server, but the user is not active. the current framework being developed explores some of the Additionally, how long a user has been possible solutions. idle can be determined from their status. Away The user is logged on but away from 1 Introduction the station. Sometime users specify a The medium of Instant Messaging on the Internet is a text message that can be viewed by any- well-established means by which users can quickly and one who wishes to get more information effectively communicate with one another. Long utilized about where they are or why they are by the public as a quick form of free communication, away. (e.g. ”Out to lunch.”, ”Watching data mining tasks have not been attempted over Instant TV.”) In fact a user can be either idle, Messaging. Additionally, on a corporate or government or active, while an away message is ex- level, people are just beginning to take notice of the plicitly up. potential that IM provides in terms of the type of information that can be collected from these networks. Table 1: Possible user statuses. As shown above an IM Many large software or internet based corporations client can be in one of the above statuses at a given have started Instant Messaging networks of their own, time. generally open to the public after registration, including Time Warner, Yahoo, and Microsoft. Currently, some Most Instant Messaging networks follow a strict of the most popular Instant Messaging networks are run Client-Server model in which a server (or a cluster by some of the aforementioned companies: of servers) is maintained by a service provider who controls traffic coming to and from the server. Users • AOL Instant Messenger who wish to utilize a certain network generally register themselves with the service provider, then download • Yahoo! Instant Messenger a provider-approved client for use on their network. Using this client, users can connect to the central • MSN Instant Messenger server in order to be able to send and receive messages • Various IRC Networks and collect account information. A friend is generally another registered user (the term friend is server- Interestingly enough, even with all the various networks specific, but exists on almost all messaging networks). being developed by corporations for profit, their physi- The concept is that a user may maintain a Buddy List cal structures (client-server architecture) and communi- under which a listing of their immediate friends may Figure 2: The Proposed IM Mining Framework Figure 1: Existing Instant Messaging Network data generated in turn is very useful for data mining to analyze user behavior. However, in order to utilize the exist. Using this, the server then sends a client updates flow of information offered by these networks, a data based upon the statuses of their friends. Once the collection framework need will have to be established. connection process has completed, the server performs This paper proposes one such framework which has all future communication in the form of Update Packets. been developed. Information distributed by the Instant An update packet is sent from the server to a client Messaging networks can be broken down into two simple whenever an action occurs that is associated with him. groups: User status-change and communication-flow For example, when a friend performs a status change (Instant Messages, Chat Rooms). or if a message is being sent to a user’s client. An The first item collected, user status change, can unfortunate consequence of the server maintaining such be achieved relatively simply as the current structure buddy lists is that it can impose restrictions upon the of Instant Messaging networks support the collection maximum number of friends which a user is allowed to process. One interesting feature, previously discussed, maintain (this number is generally around 200). Since of Instant Messaging networks is that of ’Buddy Lists’ a client does not directly communicate with any other - lists of friends of a user. The direct benefit of connected client, and only the server, the server is this feature is the fact that whenever a buddy (a then in charge of disseminating any potentially useful member of a user’s buddy list) performs a status change, information from one client to another. Once such the client is immediately notified of it by the server. piece of critical information is a user’s status. Table Utilizing this feature set, one could set up a client 1 describes a list of possible statuses that a client can of their own, with an arbitrary buddy list, and begin be in. Status is an attribute generally associated with a collecting information about their ’buddies’ resulting user’s client and often indicates how a user responds to actions. This is significant due to the fact that most an Instant Message. Whenever a user’s status changes, Instant Messaging networks don’t require that someone an update packet is relayed by the central server to actually be a friend of another user in order to watch everyone who has the user on their buddy list. their status changes. Another important aspect of communication flow Using this standard model, it is relatively simple within an Instant Messaging network is the traffic of to set up a tracking client whose only job is to collect messages between users. The amount of information re- pertinent information about users that are on its buddy vealed concerning instant messages is generally limited list - aptly named, in this framework, Tracking Client. to the information which is directly related to a user. In order to maintain a tracking client a Tracking Server Such information paths include chat rooms (a group is constructed which manages the actions of its associ- discussion area where multiple people can communicate ated tracking clients. The Tracking Server marshalls with one another stimultaneously) and private Instant communication between an arbitrary number of track- Messages (messages sent directly from one user to an- ing clients and the database server. Whenever a new other). Tracking Client spawns and connects to the Tracking Server the server attempts to determine which Instant 2 Data Collection Messaging users need to be tracked, from a list of po- Between the various information resources provided by tential users. Due to the restrictions imposed by the Instant Messaging networks, there are a number of various Instant Messaging networks as to the size of a valuable resources available to the average user. The user’s buddy list this distributed Tracking Client struc- ture is required in order to be able to track the maxi- mer of 2003. 207 participants were tracked and 55061 mum number of people at any given time. An advan- unique data packets were received and stored. Two sep- tage to this distributed network is that no one client is arate tracking clients were used to collect results, both dependant upon for all tracking efforts or network band- of which aggregated their data to a single database for width usage. Each Tracking Client within the network later retrieval. watches a given number of other clients in order to ver- Figure 3 shows the probability that a given user ify that they are, in fact, still connected to the network was in a certain state over the course of 10 weeks. It - if not then a communication is sent to the tracking can be quickly surmised that most users have the ability server and another client is spawned to cover the users to maintain a fairly steady record from week-to-week. not being tracked by its disabled peer. As information Additionally, due to the polar differences that some of packets come in from the server to each tracking client, the users seem to exhibit, it becomes apparent that the client attempts to determine if the packet should be the concept of User Profiles is an important step to re-transmitted to the server for storage in the central determining a user’s common course of action (more database. information can be found in Section 5.1 User Pattern Another tracking effort that is currently being ex- Analysis). plored is that of monitoring inter-user communication. One resource offered by most Instant Messages networks Timestamp User ID Status (and exclusively by others, see IRC) is that of a public 70242 68 Online chat room. A tracking client has the ability to con- 70303 118 Online nect to one of these rooms as a spectator, simply to 70325 65 Offline view the flow of conversation. Similar to how the server 70447 68 Idle performed by sending data packets concerning a user’s 70453 16 Idle status change, the server will also send packets detail- 70725 98 Offline ing messages being publicly sent from one user to an- 70743 89 Away other within this chat room setting.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-