Automated Data Extraction; What You See Might Not Be What You Get

Automated data extraction; what you see might not be what you get Detecting web-bot detection Master Thesis by G. Vlot In partial fulfilment for the degree of Master of Science in Software Engineering at the Open University, faculty of Management, Science and Technology Master Software Engineering Student number: 851708682 Course: AF-SE IM9906 Date: July 5th, 2018 Version: 1.0 Chairman and supervisor: Dr. Ir. Hugo Jonker [email protected] Open University Second supervisor: Dr. Greg Alpár [email protected] Open University 1 Table of contents List of tables ........................................................................................................................................... 4 Code snippet index ................................................................................................................................. 4 Illustration index .................................................................................................................................... 5 Diagram index ........................................................................................................................................ 5 Abstract .................................................................................................................................................. 6 1 Introduction .................................................................................................................................... 7 2 Background .................................................................................................................................. 11 2.1 Web-bot detection ................................................................................................................ 11 2.1.1 Web-bot detection on different implementation levels ................................................. 11 2.1.2 Page deviations caused by web-bot detection ............................................................... 13 2.1.3 Web-bot detection by commercial companies .............................................................. 14 2.2 Types of web bots ................................................................................................................. 15 2.3 Data extraction tactics .......................................................................................................... 16 2.4 Studies based on automated data extraction.......................................................................... 17 3 Related work ................................................................................................................................ 20 4 Methodology ................................................................................................................................ 22 5 Analysis web-bot detection implementation ................................................................................. 25 5.1 Manual observation: browser specific web-bot detection ..................................................... 25 5.1.1 Configuration ................................................................................................................ 25 5.1.2 Analysis ........................................................................................................................ 26 5.2 Web-bot detection based on browser properties ................................................................... 27 5.2.1 The observation of browser properties .......................................................................... 27 5.2.2 Web-bot detection based on client server communication ............................................ 32 5.3 Validation of the client side web-bot detection ..................................................................... 35 5.4 Summary .............................................................................................................................. 36 6 Browser family fingerprint classification ..................................................................................... 37 6.1 Browser families ................................................................................................................... 37 6.2 Browser based web bots ....................................................................................................... 38 6.3 Browser family classification ............................................................................................... 40 7 Determining the web-bot fingerprint surface ................................................................................ 41 7.1 Relevant browser properties ................................................................................................. 41 7.2 Approach: obtaining deviating browser properties ............................................................... 42 7.2.1 Design improvements ................................................................................................... 45 7.2.2 Limitation ..................................................................................................................... 46 7.3 Deviating browser properties ................................................................................................ 46 2 7.4 Web-bot fingerprint surface .................................................................................................. 51 8 The adoption of web-bot detection on the internet ....................................................................... 56 8.1 Design and implementation web-bot detection scanner ........................................................ 56 8.1.1 Application flow ........................................................................................................... 57 8.1.2 Detection patterns and web-bot detection score calculation .......................................... 59 8.1.3 Validation ..................................................................................................................... 63 8.2 Evaluation............................................................................................................................. 64 8.3 Improvements ....................................................................................................................... 66 8.4 Limitations and risk .............................................................................................................. 67 9 Observing deviations .................................................................................................................... 68 9.1 Approach .............................................................................................................................. 68 9.2 Different types of deviations ................................................................................................ 69 10 Conclusion, discussion and future work ................................................................................... 71 10.1 Discussion ............................................................................................................................ 74 10.2 Future work .......................................................................................................................... 75 Appendices ........................................................................................................................................... 76 A. Taxonomy of automatic data extraction methods ..................................................................... 76 A.1 Management of the data extraction process ...................................................................... 76 A.2 Approach .......................................................................................................................... 77 A.3 Operation mode ................................................................................................................ 78 A.4 data extraction strategy ..................................................................................................... 79 A.5 Extractable content ........................................................................................................... 81 A.6 Supported technology ....................................................................................................... 81 B. Details manual observation web-bot detection on StubHub.com .............................................. 82 C. Deviating browser properties .................................................................................................... 83 C.1 Blink + V8 ........................................................................................................................ 83 C.2 Gecko + Spidermonkey .................................................................................................... 89 C.3 Trident + JScript ............................................................................................................... 90 C.4 EdgeHTML + Chakra ....................................................................................................... 91 D. Code contributions ................................................................................................................... 92 D.1 Analysis of a client side web-bot detection implementation ............................................. 92 D.2 Determining the web-bot fingerprint surface .................................................................... 92 D.3 Measuring the adoption of web-bot detection on the internet ..........................................

Load more