Automated Data Extraction; What You See Might Not Be What You Get

Total Page:16

File Type:pdf, Size:1020Kb

Automated Data Extraction; What You See Might Not Be What You Get Automated data extraction; what you see might not be what you get Detecting web-bot detection Master Thesis by G. Vlot In partial fulfilment for the degree of Master of Science in Software Engineering at the Open University, faculty of Management, Science and Technology Master Software Engineering Student number: 851708682 Course: AF-SE IM9906 Date: July 5th, 2018 Version: 1.0 Chairman and supervisor: Dr. Ir. Hugo Jonker [email protected] Open University Second supervisor: Dr. Greg Alpár [email protected] Open University 1 Table of contents List of tables ........................................................................................................................................... 4 Code snippet index ................................................................................................................................. 4 Illustration index .................................................................................................................................... 5 Diagram index ........................................................................................................................................ 5 Abstract .................................................................................................................................................. 6 1 Introduction .................................................................................................................................... 7 2 Background .................................................................................................................................. 11 2.1 Web-bot detection ................................................................................................................ 11 2.1.1 Web-bot detection on different implementation levels ................................................. 11 2.1.2 Page deviations caused by web-bot detection ............................................................... 13 2.1.3 Web-bot detection by commercial companies .............................................................. 14 2.2 Types of web bots ................................................................................................................. 15 2.3 Data extraction tactics .......................................................................................................... 16 2.4 Studies based on automated data extraction.......................................................................... 17 3 Related work ................................................................................................................................ 20 4 Methodology ................................................................................................................................ 22 5 Analysis web-bot detection implementation ................................................................................. 25 5.1 Manual observation: browser specific web-bot detection ..................................................... 25 5.1.1 Configuration ................................................................................................................ 25 5.1.2 Analysis ........................................................................................................................ 26 5.2 Web-bot detection based on browser properties ................................................................... 27 5.2.1 The observation of browser properties .......................................................................... 27 5.2.2 Web-bot detection based on client server communication ............................................ 32 5.3 Validation of the client side web-bot detection ..................................................................... 35 5.4 Summary .............................................................................................................................. 36 6 Browser family fingerprint classification ..................................................................................... 37 6.1 Browser families ................................................................................................................... 37 6.2 Browser based web bots ....................................................................................................... 38 6.3 Browser family classification ............................................................................................... 40 7 Determining the web-bot fingerprint surface ................................................................................ 41 7.1 Relevant browser properties ................................................................................................. 41 7.2 Approach: obtaining deviating browser properties ............................................................... 42 7.2.1 Design improvements ................................................................................................... 45 7.2.2 Limitation ..................................................................................................................... 46 7.3 Deviating browser properties ................................................................................................ 46 2 7.4 Web-bot fingerprint surface .................................................................................................. 51 8 The adoption of web-bot detection on the internet ....................................................................... 56 8.1 Design and implementation web-bot detection scanner ........................................................ 56 8.1.1 Application flow ........................................................................................................... 57 8.1.2 Detection patterns and web-bot detection score calculation .......................................... 59 8.1.3 Validation ..................................................................................................................... 63 8.2 Evaluation............................................................................................................................. 64 8.3 Improvements ....................................................................................................................... 66 8.4 Limitations and risk .............................................................................................................. 67 9 Observing deviations .................................................................................................................... 68 9.1 Approach .............................................................................................................................. 68 9.2 Different types of deviations ................................................................................................ 69 10 Conclusion, discussion and future work ................................................................................... 71 10.1 Discussion ............................................................................................................................ 74 10.2 Future work .......................................................................................................................... 75 Appendices ........................................................................................................................................... 76 A. Taxonomy of automatic data extraction methods ..................................................................... 76 A.1 Management of the data extraction process ...................................................................... 76 A.2 Approach .......................................................................................................................... 77 A.3 Operation mode ................................................................................................................ 78 A.4 data extraction strategy ..................................................................................................... 79 A.5 Extractable content ........................................................................................................... 81 A.6 Supported technology ....................................................................................................... 81 B. Details manual observation web-bot detection on StubHub.com .............................................. 82 C. Deviating browser properties .................................................................................................... 83 C.1 Blink + V8 ........................................................................................................................ 83 C.2 Gecko + Spidermonkey .................................................................................................... 89 C.3 Trident + JScript ............................................................................................................... 90 C.4 EdgeHTML + Chakra ....................................................................................................... 91 D. Code contributions ................................................................................................................... 92 D.1 Analysis of a client side web-bot detection implementation ............................................. 92 D.2 Determining the web-bot fingerprint surface .................................................................... 92 D.3 Measuring the adoption of web-bot detection on the internet ..........................................
Recommended publications
  • Automated Testing Clinic Follow-Up: Capybara-Webkit Vs. Poltergeist/Phantomjs | Engineering in Focus
    Automated Testing Clinic follow-up: capybara-webkit vs. polter... https://behindthefandoor.wordpress.com/2014/03/02/automated-... Engineering in Focus the Fandor engineering blog Automated Testing Clinic follow-up: capybara-webkit vs. poltergeist/PhantomJS with 2 comments In my presentation at the February Automated Testing SF meetup I (Dave Schweisguth) noted some problems with Fandor’s testing setup and that we were working to fix them. Here’s an update on our progress. The root cause of several of our problems was that some of the almost 100 @javascript scenarios in our Cucumber test suite weren’t running reliably. They failed occasionally regardless of environment, they failed more on slower CPUs (e.g. MacBook Pros only a couple of years old), when they failed they sometimes hung forever, and when we killed them they left behind webkit-server processes (we were using the capybara-webkit driver) which, if not cleaned up, would poison subsequent runs. Although we’ve gotten pretty good at fixing flaky Cucumber scenarios, we’d been stumped on this little handful. We gave up, tagged them @non_ci and excluded them from our build. But they were important scenarios, so we had to run them manually before deploying. (We weren’t going to just not run them: some of those scenarios tested our subscription process, and we would be fools to deploy a build that for all we knew wouldn’t allow new users to subscribe to Fandor!) That made our release process slower and more error-prone. It occurred to me that I could patch the patch and change our deployment process to require that the @non_ci scenarios had been run (by adding a git tag when those scenarios were run and checking for it when deploying), but before I could put that in to play a new problem appeared.
    [Show full text]
  • In This Issue Monthly Meeting
    Monthly Meeting January 28, 2004 The Apple Store Westfarms Mall Panther demo, hands-on G5 trials, great deals, etc. NEWSLETTER OF CONNECTICUT MACINTOSH CONNECTION, INC.JANUARY, 2004 Danger! iPod Could instead of making her surface was clear. Inspection of the wait to Christmas for it. car revealed the side walls on both Be Hazardous To After all, if I didn’t, I passenger tires were torn, and one rim Your Health! would have to burn 25 was badly chewed up. She had Mouse Tales CDs so she could lis- obviously tangled both right wheels By Don Dickey, president ten to the new book! with the curb, but why? Answer: iPod There was a single distraction. Whenever a good deal condition to my gift, appears, I often call Joe Arcuri however. Before shelling out $640 for a new and ask him to “talk me out of chrome plated alloy rim and half that it” if he can. He sometimes does the same with me. The iPod I ordered for a pair of new tires, I realized just Simultaneous failures arrived a couple of how lucky we were. This was a lesson led us to both purchase Umax days before Joe’s, so she walked away from. Had it clones and scanners, Wallstreet one morning I met him and his daugh- happened on Interstate 91 at 65 miles PowerBook G3s, Toshiba M4 digital ter Savannah for breakfast and per hour, things could have been cameras, PowerBook G4s, PowerBoy brought along the iPod to show him. much more tragic, to say the least.
    [Show full text]
  • Between Enforcement and Regulation
    Katharina Voss Between Enforcement and Regulation A Study of the System of Case Resolution Mechanisms Used by the Between Enforcement and Regulation Between European Commission in the Enforcement of Articles 101 and 102 TFEU Katharina Voss ISBN 978-91-7797-570-0 Department of Law Doctoral Thesis in European Law at Stockholm University, Sweden 2019 Between Enforcement and Regulation A Study of the System of Case Resolution Mechanisms Used by the European Commission in the Enforcement of Articles 101 and 102 TFEU Katharina Voss Academic dissertation for the Degree of Doctor of Laws in European Law at Stockholm University to be publicly defended on Friday 12 April 2019 at 10.00 in Nordenskiöldsalen, Geovetenskapens hus, Svante Arrhenius väg 12. Abstract This thesis examines the current design of the system of case resolution mechanisms used by the European Commission (the Commission) where an infringement of Articles 101 and 102 TFEU is suspected and advances some proposals regarding this design. Infringements of Articles 101 and 102 TFEU cause considerable damage to the EU economy and ultimately, to consumers. Despite intensified enforcement of Articles 101 and 102 TFEU and ever-growing fines imposed for such infringements, the Commission continues to discover new infringements, which indicates a widespread non-compliance with EU competition rules. This raises the question of whether the enforcement currently carried out by the Commission is suitable for achieving compliance with Articles 101 and 102 TFEU. The thesis is divided into four main parts: First, the objectives pursued by the system of case resolution mechanisms used by the Commission are identified.
    [Show full text]
  • The Fourth Paradigm
    ABOUT THE FOURTH PARADIGM This book presents the first broad look at the rapidly emerging field of data- THE FOUR intensive science, with the goal of influencing the worldwide scientific and com- puting research communities and inspiring the next generation of scientists. Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of eScience such as databases, workflow management, visualization, and cloud- computing technologies. This collection of essays expands on the vision of pio- T neering computer scientist Jim Gray for a new, fourth paradigm of discovery based H PARADIGM on data-intensive science and offers insights into how it can be fully realized. “The impact of Jim Gray’s thinking is continuing to get people to think in a new way about how data and software are redefining what it means to do science.” —Bill GaTES “I often tell people working in eScience that they aren’t in this field because they are visionaries or super-intelligent—it’s because they care about science The and they are alive now. It is about technology changing the world, and science taking advantage of it, to do more and do better.” —RhyS FRANCIS, AUSTRALIAN eRESEARCH INFRASTRUCTURE COUNCIL F OURTH “One of the greatest challenges for 21st-century science is how we respond to this new era of data-intensive
    [Show full text]
  • Web Browser a C-Class Article from Wikipedia, the Free Encyclopedia
    Web browser A C-class article from Wikipedia, the free encyclopedia A web browser or Internet browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web. An information resource is identified by a Uniform Resource Identifier (URI) and may be a web page, image, video, or other piece of content.[1] Hyperlinks present in resources enable users to easily navigate their browsers to related resources. Although browsers are primarily intended to access the World Wide Web, they can also be used to access information provided by Web servers in private networks or files in file systems. Some browsers can also be used to save information resources to file systems. Contents 1 History 2 Function 3 Features 3.1 User interface 3.2 Privacy and security 3.3 Standards support 4 See also 5 References 6 External links History Main article: History of the web browser The history of the Web browser dates back in to the late 1980s, when a variety of technologies laid the foundation for the first Web browser, WorldWideWeb, by Tim Berners-Lee in 1991. That browser brought together a variety of existing and new software and hardware technologies. Ted Nelson and Douglas Engelbart developed the concept of hypertext long before Berners-Lee and CERN. It became the core of the World Wide Web. Berners-Lee does acknowledge Engelbart's contribution. The introduction of the NCSA Mosaic Web browser in 1993 – one of the first graphical Web browsers – led to an explosion in Web use. Marc Andreessen, the leader of the Mosaic team at NCSA, soon started his own company, named Netscape, and released the Mosaic-influenced Netscape Navigator in 1994, which quickly became the world's most popular browser, accounting for 90% of all Web use at its peak (see usage share of web browsers).
    [Show full text]
  • Automated Testing of Your Corporate Website from Multiple Countries with Selenium Contents
    presents Automated Testing of Your Corporate Website from Multiple Countries with Selenium Contents 1. Summary 2. Introduction 3. The Challenges 4. Components of a Solution 5. Steps 6. Working Demo 7. Conclusion 8. Questions & Answers Summary Because of the complexities involved in testing large corporate websites and ecommerce stores from multiple countries, test automation is a must for every web and ecommerce team. Selenium is the most popular, straightforward, and reliable test automation framework with the largest developer community on the market. This white paper details how Selenium can be integrated with a worldwide proxy network to verify website availability, performance, and correctness on a continuous basis. Introduction Modern enterprise web development teams face a number of challenges when they must support access to their website from multiple countries. These challenges include verifying availability, verifying performance, and verifying content correctness on a daily basis. Website content is presented in different languages, website visitors use different browsers and operating systems, and ecommerce carts must comprehend different products and currencies. Because of these complexities involved, instituting automated tests via a test automation framework is the only feasible method of verifying all of these aspects in a repeatable and regular fashion. Why automate tests? Every company tests its products before releasing them to their customers. This process usually involves hiring quality assurance engineers and assigning them to test the product manually before any release. Manual testing is a long process that requires time, attention, and resources in order to validate the products’ quality. The more complex the product is, the more important, complex, and time- consuming the quality assurance process is, and therefore the higher the demand for significant resources.
    [Show full text]
  • OECD‘S Directorate for Science Technology and Industry
    THE ECONOMIC AND SOCIAL ROLE OF INTERNET INTERMEDIARIES APRIL 2010 2 FOREWORD FOREWORD This report is Part I of the larger project on Internet intermediaries. It develops a common definition and understanding of what Internet intermediaries are, of their economic function and economic models, of recent market developments, and discusses the economic and social uses that these actors satisfy. The overall goal of the horizontal report of the Committee for Information, Computer and Communications Policy (ICCP) is to obtain a comprehensive view of Internet intermediaries, their economic and social function, development and prospects, benefits and costs, and responsibilities. It corresponds to the item on 'Forging Partnerships for Advancing Policy Objectives for the Internet Economy' in the Committee‘s work programme. This report was prepared by Ms. Karine Perset of the OECD‘s Directorate for Science Technology and Industry. It was declassified by the ICCP Committee at its 59th Session in March 2010. It was originally issued under the code DSTI/ICCP(2009)9/FINAL. Issued under the responsibility of the Secretary-General of the OECD. The opinions expressed and arguments employed herein do not necessarily reflect the official views of the OECD member countries. ORGANISATION FOR ECONOMIC CO-OPERATION AND DEVELOPMENT The OECD is a unique forum where the governments of 30 democracies work together to address the economic, social and environmental challenges of globalisation. The OECD is also at the forefront of efforts to understand and to help governments respond to new developments and concerns, such as corporate governance, the information economy and the challenges of an ageing population.
    [Show full text]
  • Casperjs Documentation Release 1.1.0-DEV
    CasperJs Documentation Release 1.1.0-DEV Nicolas Perriault Sep 13, 2018 Contents 1 Installation 3 1.1 Prerequisites...............................................3 1.2 Installing from Homebrew (OSX)....................................4 1.3 Installing from npm...........................................4 1.4 Installing from git............................................4 1.5 Installing from an archive........................................5 1.6 CasperJS on Windows..........................................5 1.7 Known Bugs & Limitations.......................................6 2 Quickstart 7 2.1 A minimal scraping script........................................7 2.2 Now let’s scrape Google!........................................8 2.3 CoffeeScript version...........................................9 2.4 A minimal testing script......................................... 10 3 Using the command line 11 3.1 casperjs native options.......................................... 12 3.2 Raw parameter values.......................................... 13 4 Selectors 15 4.1 CSS3................................................... 15 4.2 XPath................................................... 16 5 Testing 17 5.1 Unit testing................................................ 17 5.2 Browser tests............................................... 18 5.3 Setting Casper options in the test environment............................. 19 5.4 Advanced techniques........................................... 20 5.5 Test command args and options....................................
    [Show full text]
  • Protecting the Crown: a Century of Resource Management in Glacier National Park
    Protecting the Crown A Century of Resource Management in Glacier National Park Rocky Mountains Cooperative Ecosystem Studies Unit (RM-CESU) RM-CESU Cooperative Agreement H2380040001 (WASO) RM-CESU Task Agreement J1434080053 Theodore Catton, Principal Investigator University of Montana Department of History Missoula, Montana 59812 Diane Krahe, Researcher University of Montana Department of History Missoula, Montana 59812 Deirdre K. Shaw NPS Key Official and Curator Glacier National Park West Glacier, Montana 59936 June 2011 Table of Contents List of Maps and Photographs v Introduction: Protecting the Crown 1 Chapter 1: A Homeland and a Frontier 5 Chapter 2: A Reservoir of Nature 23 Chapter 3: A Complete Sanctuary 57 Chapter 4: A Vignette of Primitive America 103 Chapter 5: A Sustainable Ecosystem 179 Conclusion: Preserving Different Natures 245 Bibliography 249 Index 261 List of Maps and Photographs MAPS Glacier National Park 22 Threats to Glacier National Park 168 PHOTOGRAPHS Cover - hikers going to Grinnell Glacier, 1930s, HPC 001581 Introduction – Three buses on Going-to-the-Sun Road, 1937, GNPA 11829 1 1.1 Two Cultural Legacies – McDonald family, GNPA 64 5 1.2 Indian Use and Occupancy – unidentified couple by lake, GNPA 24 7 1.3 Scientific Exploration – George B. Grinnell, Web 12 1.4 New Forms of Resource Use – group with stringer of fish, GNPA 551 14 2.1 A Foundation in Law – ranger at check station, GNPA 2874 23 2.2 An Emphasis on Law Enforcement – two park employees on hotel porch, 1915 HPC 001037 25 2.3 Stocking the Park – men with dead mountain lions, GNPA 9199 31 2.4 Balancing Preservation and Use – road-building contractors, 1924, GNPA 304 40 2.5 Forest Protection – Half Moon Fire, 1929, GNPA 11818 45 2.6 Properties on Lake McDonald – cabin in Apgar, Web 54 3.1 A Background of Construction – gas shovel, GTSR, 1937, GNPA 11647 57 3.2 Wildlife Studies in the 1930s – George M.
    [Show full text]
  • Open Dissertationdraft3.Pdf
    The Pennsylvania State University The Graduate School College of Information Sciences and Technology SUPPORTING THE USER EXPERIENCE IN FREE/LIBRE/OPEN SOURCE SOFTWARE DEVELOPMENT A Dissertation in Information Sciences and Technology by Paula M. Bach © 2009 Paula M. Bach Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy August 2009 The dissertation of Paula M. Bach was reviewed and approved* by the following: John M. Carroll Professor of Information Sciences and Technology Dissertation Advisor Chair of Committee Mary Beth Rosson Professor of Information Sciences and Technology Andrea Tapia Assistant Professor of Information Sciences and Technology Allison Carr-Chellman Professor of Instructional Systems Steven Haynes Professor of Practice of Information Sciences and Technology John Yen Professor of Information Sciences and Technology Associate Dean for Research and Graduate Programs *Signatures are on file in the Graduate School iii ABSTRACT With the increasing number and awareness of free/libre/open source software (FLOSS) projects, Internet users can download a FLOSS tool that meets just about any need. The user experience of projects, however, varies greatly and identifying FLOSS projects that offer a positive user experience (UX) is challenging. FLOSS projects center on software developer activities with little attention to user-centered design activities that could increase the user experience on the project. The purpose of this dissertation is to understand open source software ecology in order to bring support for user experience design activities on FLOSS projects. CodePlex, an open source project hosting website, serves as the open source software ecology. The research consists of two phases, a descriptive science phase and a design science phase.
    [Show full text]
  • Advanced CSS
    www.allitebooks.com AdvancED CSS Joseph R. Lewis and Meitar Moscovitz www.allitebooks.com AdvancED CSS Copyright © 2009 by Joseph R. Lewis and Meitar Moscovitz All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval system, without the prior written permission of the copyright owner and the publisher. ISBN-13 (pbk): 978-1-4302-1932-3 ISBN-13 (electronic): 978-1-4302-1933-0 Printed and bound in the United States of America 9 8 7 6 5 4 3 2 1 Trademarked names may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, we use the names only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. Distributed to the book trade worldwide by Springer-Verlag New York, Inc., 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax 201-348-4505, e-mail kn`ano)ju<olnejcan)o^i*_om, or visit sss*olnejcankjheja*_ki. For information on translations, please contact Apress directly at 2855 Telegraph Avenue, Suite 600, Berkeley, CA 94705. Phone 510-549-5930, fax 510-549-5939, e-mail ejbk<]lnaoo*_ki, or visit sss*]lnaoo*_ki. Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales–eBook Licensing web page at dppl6++sss*]lnaoo*_ki+ejbk+^qhgo]hao.
    [Show full text]
  • Giant List of Web Browsers
    Giant List of Web Browsers The majority of the world uses a default or big tech browsers but there are many alternatives out there which may be a better choice. Take a look through our list & see if there is something you like the look of. All links open in new windows. Caveat emptor old friend & happy surfing. 1. 32bit https://www.electrasoft.com/32bw.htm 2. 360 Security https://browser.360.cn/se/en.html 3. Avant http://www.avantbrowser.com 4. Avast/SafeZone https://www.avast.com/en-us/secure-browser 5. Basilisk https://www.basilisk-browser.org 6. Bento https://bentobrowser.com 7. Bitty http://www.bitty.com 8. Blisk https://blisk.io 9. Brave https://brave.com 10. BriskBard https://www.briskbard.com 11. Chrome https://www.google.com/chrome 12. Chromium https://www.chromium.org/Home 13. Citrio http://citrio.com 14. Cliqz https://cliqz.com 15. C?c C?c https://coccoc.com 16. Comodo IceDragon https://www.comodo.com/home/browsers-toolbars/icedragon-browser.php 17. Comodo Dragon https://www.comodo.com/home/browsers-toolbars/browser.php 18. Coowon http://coowon.com 19. Crusta https://sourceforge.net/projects/crustabrowser 20. Dillo https://www.dillo.org 21. Dolphin http://dolphin.com 22. Dooble https://textbrowser.github.io/dooble 23. Edge https://www.microsoft.com/en-us/windows/microsoft-edge 24. ELinks http://elinks.or.cz 25. Epic https://www.epicbrowser.com 26. Epiphany https://projects-old.gnome.org/epiphany 27. Falkon https://www.falkon.org 28. Firefox https://www.mozilla.org/en-US/firefox/new 29.
    [Show full text]