Automatic Author Profiling of Online Chat Logs
Total Page:16
File Type:pdf, Size:1020Kb
Calhoun: The NPS Institutional Archive Theses and Dissertations Thesis Collection 2007-03 Automatic author profiling of online chat logs Lin, Jane. Monterey, California. Naval Postgraduate School http://hdl.handle.net/10945/3559 NAVAL POSTGRADUATE SCHOOL MONTEREY, CALIFORNIA THESIS AUTOMATIC AUTHOR PROFILING OF ONLINE CHAT LOGS by Jane Lin March 2007 Thesis Advisor: Craig H. Martell Second Reader: Kevin M. Squire Approved for public release; distribution is unlimited THIS PAGE INTENTIONALLY LEFT BLANK REPORT DOCUMENTATION PAGE Form Approved OMB No. 0704- 0188 Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instruction, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) Washington DC 20503. 1. AGENCY USE ONLY (Leave blank) 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED March 2007 Master’s Thesis 4. TITLE AND SUBTITLE Automatic Author Profiling of 5. FUNDING NUMBERS Online Chat Logs 6. AUTHOR(S) Jane Lin 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Naval Postgraduate School REPORT NUMBER Monterey, CA 93943-5000 9. SPONSORING /MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSORING/MONITORING N/A AGENCY REPORT NUMBER 11. SUPPLEMENTARY NOTES The views expressed in this thesis are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government. 12a. DISTRIBUTION / AVAILABILITY STATEMENT 12b. DISTRIBUTION CODE Approved for public release; distribution is unlimited 13. ABSTRACT (maximum 200 words) Now that the Internet has become easily accessible and more affordable, a larger number of people spend more time in front of a computer. Some spend so much time on the Internet that they develop virtual friendships and relationships – people with whom they have regular contact via a computer screen and the Internet. While most of the dialogue exchanged online is not harmful or illegal, there are those with dishonest intentions lurking online. These people can be breaking the law by seducing a minor virtually or even going as far as meeting a minor in person. Terrorists can also use the Internet to facilitate communication and plan attacks. Since e-mail is one of the original means of communication on the Internet, methods for determining the author of an email have already been studied. So far, however, no significant experimentation with online chat logs exists. The first of part of this study is comprised of generating an unbiased, random, and broad corpus of online chat logs. Having a general corpus with a wide-range of topics allows the results of this research to be applied in the most general case. Because developing a complete solution to the authorship attribution problem for chat logs is difficult, we limit our scope to predicting gender and age. The ultimate goal of this work, then, is to facilitate the jobs of law enforcers in tracking down criminals who attempt to use the Internet as a hiding place. 14. SUBJECT TERMS 15. NUMBER OF Authorship Analysis, Authorship Profiling, Authorship Attribution, PAGES Natural Language Processing, Naïve Bayes, Artificial Intelligence, 289 Forensics, Information Retrieval, Data Mining 16. PRICE CODE 17. SECURITY 18. SECURITY 19. SECURITY 20. LIMITATION OF CLASSIFICATION OF CLASSIFICATION OF THIS CLASSIFICATION OF ABSTRACT REPORT PAGE ABSTRACT Unclassified Unclassified Unclassified UL NSN 7540-01-280-5500 Standard Form 298 (Rev. 2-89) Prescribed by ANSI Std. 239-18 i THIS PAGE INTENTIONALLY LEFT BLANK ii Approved for public release; distribution is unlimited AUTOMATIC AUTHOR PROFILING OF ONLINE CHAT LOGS Jane Lin Civilian, Department of Defense B.S. Cum Laude, University of Maryland, College Park, 2004 Submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE IN COMPUTER SCIENCE from the NAVAL POSTGRADUATE SCHOOL March 2007 Author: Jane Lin Approved by: Craig H. Martell, PhD Thesis Advisor Kevin M. Squire, PhD Second Reader Peter J. Denning, PhD Chairman, Department of computer Science iii THIS PAGE INTENTIONALLY LEFT BLANK iv ABSTRACT Now that the Internet has become easily accessible and more affordable, a larger number of people spend more time in front of a computer. Some spend so much time on the Internet that they develop virtual friendships and relationships – people with whom they have regular contact via a computer screen and the Internet. While most of the dialogue exchanged online is not harmful or illegal, there are those with dishonest intentions lurking online. These people can be breaking the law by seducing a minor virtually or even going as far as meeting a minor in person. Terrorists can also use the Internet to facilitate communication and plan attacks. Since e-mail is one of the original means of communication on the Internet, methods for determining the author of an email have already been studied. So far, however, no significant experimentation with online chat logs exists. The first of part of this study is comprised of generating an unbiased, random, and broad corpus of online chat logs. Having a general corpus with a wide-range of topics allows the results of this research to be applied in the most general case. Because developing a complete solution to the authorship attribution problem for chat logs is difficult, we limit our scope to predicting gender and age. The ultimate goal of this work, then, is to facilitate the jobs of law enforcers in tracking down criminals who attempt to use the Internet as a hiding place. v THIS PAGE INTENTIONALLY LEFT BLANK vi TABLE OF CONTENTS I. INTRODUCTION ............................................1 A. THE EARLY YEARS ....................................1 B. CHAT TODAY .........................................1 C. ROLE OF CHAT .......................................2 D. IMPORTANCE OF THE STUDY OF CHAT BEHAVIOUR ..........4 1. Motivation ....................................4 2. Purpose .......................................7 E. ORGANIZATION OF THESIS .............................8 F. CHAPTER SUMMARY ....................................9 II. BACKGROUND .............................................11 A. AUTHORSHIP ATTRIBUTION ............................11 B. SOCIOLOINGUISTICS .................................11 C. MACHINE LEARNING TECHNIQUES .......................12 1. Naïve Bayes ..................................13 2. Support Vector Machines (SVM) ................13 D. CHAPTER SUMMARY ...................................13 III. GENERATION OF CORPUS ..................................15 A. SOURCE OF DATA ....................................15 1. Overview .....................................15 2. Chosen Host ..................................15 B. DATA DESCRIPTION ..................................16 C. STATISTICAL ANALYSIS OF DATA ......................17 1. Document Counts, Sentence and Document Lengths ......................................17 2. Vocabulary Variety ...........................25 3. Emoticons ....................................25 4. Punctuations .................................28 D. CHAPTER SUMMARY ...................................30 IV. TESTING AND ANALYSIS ...................................31 A. MACHINE LEARNING AND CLASSIFICATION TECHNIQUES ....31 1. Classification Tool ..........................31 2. Classification Method ........................32 3. Measures of Classification Performance .......32 B. FEATURES AND FEATURE VECTORS ......................36 C. EXPERIMENT SETUP ..................................37 1. Creating the Joint Probability Distribution Tables from the Training Data ................37 2. Labeling the Test Data .......................39 a. Alternate Step 2: Finding the Best Label for Individual Features with Naïve Bayes .............................41 vii b. More Alternate Step 2’s: Finding the Best Label for Feature Vectors with Naïve Bayes without the influence of the Prior ...............................42 D. RESULTS AND ANALYSIS ..............................42 1. Including the Influence of the Prior .........43 2. Excluding the Influence of the Prior .........46 E. CHAPTER SUMMARY ...................................50 V. CONCLUSIONS ..............................................53 A. SUMMARY ...........................................53 B. FUTURE WORK .......................................54 C. CHAPTER SUMMARY AND CONCLUDING REMARKS ............56 APPENDIX A: TABLES FOR RAW COUNTS, FILE SIZES, SENTENCE LENGTHS, AND DOCUMENT LENGTHS ..........................57 A. RAW COUNTS FOR THE TRAINING SET ...................57 B. RAW COUNTS FOR THE TESTING SET ....................60 APPENDIX B: FIGURES OF SENTENCE LENGTHS AND DOCUMENT LENGTHS ................................................61 A. MEASURES OF AGE GROUP VS. SENTENCE LENGTH .........61 B. MEASURES OF INDIVIDUAL DOCUMENT LENGTHS ...........69 APPENDIX C: TABLES FOR VOCABULARY VARIETY ...................71 A. TYPE COUNTS .......................................71 B. AVERAGE VOCABULARY VARIETY ........................72 APPENDIX D: TABLES AND FIGURES FOR