A Novel Defense Mechanism Against Web Crawler Intrusion Alireza Aghamohammadi
Total Page:16
File Type:pdf, Size:1020Kb
Eastern Michigan University DigitalCommons@EMU Master's Theses, and Doctoral Dissertations, and Master's Theses and Doctoral Dissertations Graduate Capstone Projects 11-5-2013 A novel defense mechanism against web crawler intrusion Alireza Aghamohammadi Follow this and additional works at: http://commons.emich.edu/theses Part of the Digital Communications and Networking Commons, and the Information Security Commons Recommended Citation Aghamohammadi, Alireza, "A novel defense mechanism against web crawler intrusion" (2013). Master's Theses and Doctoral Dissertations. 544. http://commons.emich.edu/theses/544 This Open Access Dissertation is brought to you for free and open access by the Master's Theses, and Doctoral Dissertations, and Graduate Capstone Projects at DigitalCommons@EMU. It has been accepted for inclusion in Master's Theses and Doctoral Dissertations by an authorized administrator of DigitalCommons@EMU. For more information, please contact [email protected]. A Novel Defense Mechanism against Web Crawler Intrusion by Alireza Aghamohammadi Dissertation Submitted to the College of Technology Eastern Michigan University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY IN TECHNOLOGY Area of Concentration: Information Assurance Dissertation Committee: Dr. Ali Eydgahi, Ph.D., (Chair) Dr. Daniel Fields, Ph.D., Dr. Huei Lee, Ph.D., Dr. Alphonso Bellamy, Ph.D. November 5, 2013 Ypsilanti, Michigan Acknowledgments This research would have not been possible without the assistance of the dedicated and outstanding committee members Dr. Ali Eydgahi (Chair), Dr. Daniel Fields, Dr. Huei Lee, and Dr. Alphonso Bellamy. The committee members’ contributions and guidance were the most important factor in the successful completion of this study. I am especially grateful for Dr. Huei Lee’s assistance and for making a computer lab available for this study. In addition, I also would like to acknowledge and thank my wonderful wife, parents, and sisters for their continuous support and understanding along the way. Their support and understanding have contributed to my success in so many ways, and I hope these few words will at least provide a glimpse of how much I appreciate their kindness and support through this Ph.D. journey and in my life. ii Abstract Web robots also known as crawlers or spiders are used by search engines, hackers and spammers to gather information about web pages. Timely detection and prevention of unwanted crawlers increases privacy and security of websites. In this research, a novel method to identify web crawlers is proposed to prevent unwanted crawler to access websites. The proposed method suggests a five-factor identification process to detect unwanted crawlers. This study provides the pretest and posttest results along with a systematic evaluation of web pages with the proposed identification technique versus web pages without the proposed identification process. An experiment was performed with repeated measures for two groups with each group containing ninety web pages. The outputs of the logistic regression analysis of treatment and control groups confirm the novel five-factor identification process as an effective mechanism to prevent unwanted web crawlers. This study concluded that the proposed five distinct identifier process is a very effective technique as demonstrated by a successful outcome. iii Table of Contents Acknowledgments............................................................................................................... ii Abstract .............................................................................................................................. iii Table of Contents ............................................................................................................... iv Chapter 1. Introduction ........................................................................................................1 Introduction ................................................................................................................1 Statement of the Problem ...........................................................................................8 Nature and Significance of the Problem ....................................................................9 Purpose and Objective(s) of the Study ..................................................................... 11 Research Questions and Hypotheses ....................................................................... 11 Definition of Terms ..................................................................................................13 Assumptions .............................................................................................................16 Limitations ...............................................................................................................16 Summary ..................................................................................................................16 Chapter 2. Background and Review of Literature .............................................................18 Introduction ..............................................................................................................18 Background ..............................................................................................................18 Review of Literature and Related Works .................................................................21 Summary ..................................................................................................................43 Chapter 3. Methods ............................................................................................................45 Introduction ..............................................................................................................45 Research Design.......................................................................................................45 Measurements ..........................................................................................................48 Research Setting........................................................................................................49 iv Population, Sample, and Subjects ............................................................................51 Humans Subjects Approval ......................................................................................52 Data Collection ........................................................................................................52 Data Analysis ...........................................................................................................55 Analysis Tools ..........................................................................................................58 Validation .................................................................................................................59 Personnel ..................................................................................................................61 Budget ......................................................................................................................61 Timeline ...................................................................................................................62 Summary ..................................................................................................................63 Chapter 4. Results ..............................................................................................................64 Introduction ..............................................................................................................64 Web Crawler’s Return Rate .....................................................................................64 Demographic Characteristics of the Sample ............................................................66 Unwanted Web Crawlers Results Group 1 (pretest-posttest control group) ............70 Unwanted Web Crawlers Results, Group 2 (pretest-posttest Treatment group) ......71 Wanted Web Crawlers Results Group 1 (pretest-posttest control group) ................73 Wanted Web Crawlers Results Group 2 (pretest-posttest Treatment group) ...........74 Classification for Web Crawlers' Results .................................................................76 Data Reliability ........................................................................................................78 Research Questions/Hypotheses Results .................................................................80 Summary ..................................................................................................................83 Chapter 5. Conclusion(s) and Discussion ..........................................................................84 Introduction ..............................................................................................................84 v Conclusion(s)/Discussion ........................................................................................84 Recommendations ....................................................................................................89 Summary ..................................................................................................................91 References ..........................................................................................................................92 APPENDIX A: Binary Logistic Regression Results - Unwanted Web Crawlers ............106 APPENDIX