The Representativeness of Automated Web Crawls As a Surrogate For

Total Page:16

File Type:pdf, Size:1020Kb

The Representativeness of Automated Web Crawls As a Surrogate For The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing David Zeber, Sarah Bird, Camila Oliveira, Walter Rudametkin, Ilana Segall, Fredrik Wollsén, Martin Lopatka To cite this version: David Zeber, Sarah Bird, Camila Oliveira, Walter Rudametkin, Ilana Segall, et al.. The Representa- tiveness of Automated Web Crawls as a Surrogate for Human Browsing. The Web Conference, Apr 2020, Taipei, Taiwan. 10.1145/3366423.3380104. hal-02456195 HAL Id: hal-02456195 https://hal.inria.fr/hal-02456195 Submitted on 27 Jan 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing David Zeber Sarah Bird Camila Oliveira Mozilla Mozilla Mozilla [email protected] [email protected] [email protected] Walter Rudametkin Ilana Segall Fredrik Wollsén Univ. Lille / Inria Mozilla Mozilla [email protected] [email protected] [email protected] Martin Lopatka Mozilla [email protected] ABSTRACT KEYWORDS Large-scale Web crawls have emerged as the state of the art for Web Crawling, Tracking, Online Privacy, Browser Fingerprinting, studying characteristics of the Web. In particular, they are a core World Wide Web tool for online tracking research. Web crawling is an attractive ACM Reference Format: approach to data collection, as crawls can be run at relatively low David Zeber, Sarah Bird, Camila Oliveira, Walter Rudametkin, Ilana Segall, infrastructure cost and don’t require handling sensitive user data Fredrik Wollsén, and Martin Lopatka. 2020. The Representativeness of Au- such as browsing histories. However, the biases introduced by us- tomated Web Crawls as a Surrogate for Human Browsing. In Proceedings of ing crawls as a proxy for human browsing data have not been well The Web Conference 2020 (WWW ’20), April 20–24, 2020, Taipei, Taiwan. ACM, studied. Crawls may fail to capture the diversity of user environ- New York, NY, USA, 12 pages. https://doi.org/10.1145/3366423.3380104 ments, and the snapshot view of the Web presented by one-time crawls does not reflect its constantly evolving nature, which hinders 1 INTRODUCTION reproducibility of crawl-based studies. In this paper, we quantify The nature, structure, and influence of the Web have been subject the repeatability and representativeness of Web crawls in terms to an overwhelming body of research. However, its rapid rate of of common tracking and fingerprinting metrics, considering both evolution and sheer magnitude have outpaced even the most ad- variation across crawls and divergence from human browser usage. vanced tools used in its study [25, 44]. The expansive scale of the We quantify baseline variation of simultaneous crawls, then isolate modern Web has necessitated increasingly sophisticated strategies the effects of time, cloud IP address vs. residential, and operating for its traversal [6, 10]. Currently, computationally viable crawls system. This provides a foundation to assess the agreement be- are generally based on representative sampling of the portion of tween crawls visiting a standard list of high-traffic websites and the Web most seen by human traffic [23, 47] as indicated by top-site actual browsing behaviour measured from an opt-in sample of over lists such as Alexa [4] or Tranco [27]. 50,000 users of the Firefox Web browser. Our analysis reveals dif- Web crawls are used in a wide variety of research including fun- ferences between the treatment of stateless crawling infrastructure damental research on the nature of the Web [32], training language and generally stateful human browsing, showing, for example, that models [21], social network analysis [33], and medical research crawlers tend to experience higher rates of third-party activity than [60]. In particular, they are a core tool in privacy research, having human browser users on loading pages from the same domains. been used in numerous studies (see Section 2.2) describing and quantifying online tracking. However, little is known about the CCS CONCEPTS representativeness of such crawls compared to the experience of • Information systems → Online advertising; Web mining; human users browsing the Web, nor about how much they de- Traffic analysis; Data extraction and integration; Browsers. pend on the environment from which they are run. In order to contextualize the information obtained from Web crawls, this study evaluates the efficacy, reliability, and generalisability of the useof Web crawlers [25] as a surrogate for user interaction with the Web in the context of online tracking. The main contributions of this WWW ’20, April 20–24, 2020, Taipei, Taiwan paper are: © 2020 IW3C2 (International World Wide Web Conference Committee), published • under Creative Commons CC-BY 4.0 License. A systematic exploration of the sources of variation between This is the author’s version of the work. It is posted here for your personal use. Not repeated Web crawls under controlled conditions for redistribution. The definitive Version of Record was published in Proceedings • Quantification of the variation attributable to the dynamic of The Web Conference 2020 (WWW ’20), April 20–24, 2020, Taipei, Taiwan, https: //doi.org/10.1145/3366423.3380104. nature of the Web versus that resulting from client-specific environment and contextual factors 1 WWW ’20, April 20–24, 2020, Taipei, Taiwan Zeber, et al. • An assessment of the degree to which website ranking lists, on the objectives of the crawl and are an active area of discussion often used to define crawl strategies, are representative of [7, 48]. These are, however, outside the scope of this paper. Web traffic measured over a set of real Web browsing clients There are a multitude of crawler implementations that use differ- • A large-scale comparison between Web crawls and the expe- ent technologies with different tradeoffs. Simple crawlers are fast rience of human browser users over a dataset of 30 million but do not run JavaScript, a central component of the modern Web site visits across 50,000 users. environment. Other crawlers require more resources but are better These results present a multifaceted investigation towards answer- at simulating a user’s browser. The simplest forms of crawlers are ing the question: when a crawler visits a specific page, is it served scripts that use curl or wget to get a webpage’s contents. More com- a comparable Web experience to that of a human user performing plete crawlers rely on frameworks like Scrapy [51], a Python-based an analogous navigation? framework for crawling and scraping data. Selenium [11] goes a step further by providing plugins to automate popular browsers, in- 2 BACKGROUND AND MOTIVATIONS cluding Firefox, Chrome, Opera and Safari. Its popularity has led to the W3C Webdriver Standard [12]. In fact, Firefox and Chrome now Seminal research into the graph-theoretical characteristics of the provide headless modes that allow them to be run fully-automated Web simplified its representation by assuming static relationships from the command-line, without the need for a graphical interface. (edges) between static pages (nodes). Increasingly sophisticated Libraries such as Puppeteer [20] provide convenient APIs to con- Web crawlers [25] were deployed to traverse hyperlinkage struc- trol browsers. Finally, OpenWPM [17] is a privacy measurement tures, providing insights into its geometric characteristics [9]. Subse- framework built on Firefox that uses Selenium for automation and quent research has emphasized the importance of studying Internet provides hooks for data collection to facilitate large-scale studies. topology [18, 31, 54]. Models accounting for the dynamic nature There are important considerations when choosing a crawler of the Web, and morphological consequences thereof [31], have technology. Simple crawlers that do not run JavaScript and are encouraged emphasis on higher traffic websites in the context ofan quite limited in modern dynamic environments. A stateful crawler, increasingly nonuniform and ever-growing Web. Applied research i.e., one that supports cookies, caches or other persistent data, may into specific characteristics of the Web [5], including the study of be desirable to better emulate users, but the results of the crawl the tracking ecosystem that underlies online advertising [17, 50], may then depend on the accumulated state, e.g., the order of pages have leveraged Web crawlers to study an immense and diverse visisted. Finally, a distributed crawl may exploit the use of coordi- ecosystem of Web pages. nated crawlers over multiple boxes with distinct IP addresses. Indeed, much of our understanding of how the Web works, and the privacy risks associated with it, result from research based on large-scale crawls. Yet, most crawls are performed only once, pro- 2.2 Web measurement studies viding a snapshot view of the Web’s inner workings. Moreover, Crawls are a standard way of collecting data for Web measure- crawls are most often performed using dedicated frameworks or ments. Many crawl-based studies focus on
Recommended publications
  • The Web Never Forgets: Persistent Tracking Mechanisms in the Wild
    The Web Never Forgets: Persistent Tracking Mechanisms in the Wild Gunes Acar1, Christian Eubank2, Steven Englehardt2, Marc Juarez1 Arvind Narayanan2, Claudia Diaz1 1KU Leuven, ESAT/COSIC and iMinds, Leuven, Belgium {name.surname}@esat.kuleuven.be 2Princeton University {cge,ste,arvindn}@cs.princeton.edu ABSTRACT 1. INTRODUCTION We present the first large-scale studies of three advanced web tracking mechanisms — canvas fingerprinting, evercookies A 1999 New York Times article called cookies compre­ and use of “cookie syncing” in conjunction with evercookies. hensive privacy invaders and described them as “surveillance Canvas fingerprinting, a recently developed form of browser files that many marketers implant in the personal computers fingerprinting, has not previously been reported in the wild; of people.” Ten years later, the stealth and sophistication of our results show that over 5% of the top 100,000 websites tracking techniques had advanced to the point that Edward employ it. We then present the first automated study of Felten wrote “If You’re Going to Track Me, Please Use Cook­ evercookies and respawning and the discovery of a new ev­ ies” [18]. Indeed, online tracking has often been described ercookie vector, IndexedDB. Turning to cookie syncing, we as an “arms race” [47], and in this work we study the latest present novel techniques for detection and analysing ID flows advances in that race. and we quantify the amplification of privacy-intrusive track­ The tracking mechanisms we study are advanced in that ing practices due to cookie syncing. they are hard to control, hard to detect and resilient Our evaluation of the defensive techniques used by to blocking or removing.
    [Show full text]
  • Configuring DNS
    Configuring DNS The Domain Name System (DNS) is a distributed database in which you can map hostnames to IP addresses through the DNS protocol from a DNS server. Each unique IP address can have an associated hostname. The Cisco IOS software maintains a cache of hostname-to-address mappings for use by the connect, telnet, and ping EXEC commands, and related Telnet support operations. This cache speeds the process of converting names to addresses. Note You can specify IPv4 and IPv6 addresses while performing various tasks in this feature. The resource record type AAAA is used to map a domain name to an IPv6 address. The IP6.ARPA domain is defined to look up a record given an IPv6 address. • Finding Feature Information, page 1 • Prerequisites for Configuring DNS, page 2 • Information About DNS, page 2 • How to Configure DNS, page 4 • Configuration Examples for DNS, page 13 • Additional References, page 14 • Feature Information for DNS, page 15 Finding Feature Information Your software release may not support all the features documented in this module. For the latest caveats and feature information, see Bug Search Tool and the release notes for your platform and software release. To find information about the features documented in this module, and to see a list of the releases in which each feature is supported, see the feature information table at the end of this module. Use Cisco Feature Navigator to find information about platform support and Cisco software image support. To access Cisco Feature Navigator, go to www.cisco.com/go/cfn. An account on Cisco.com is not required.
    [Show full text]
  • The Internet and Web Tracking
    Grand Valley State University ScholarWorks@GVSU Technical Library School of Computing and Information Systems 2020 The Internet and Web Tracking Tim Zabawa Grand Valley State University Follow this and additional works at: https://scholarworks.gvsu.edu/cistechlib ScholarWorks Citation Zabawa, Tim, "The Internet and Web Tracking" (2020). Technical Library. 355. https://scholarworks.gvsu.edu/cistechlib/355 This Project is brought to you for free and open access by the School of Computing and Information Systems at ScholarWorks@GVSU. It has been accepted for inclusion in Technical Library by an authorized administrator of ScholarWorks@GVSU. For more information, please contact [email protected]. Tim Zabawa 12/17/20 Capstone Project Cover Page: 1. Introduction 2 2. How we are tracked online 2 2.1 Cookies 3 2.2 Browser Fingerprinting 4 2.3 Web Beacons 6 3. Defenses Against Web Tracking 6 3.1 Cookies 7 3.2 Browser Fingerprinting 8 3.3 Web Beacons 9 4. Technological Examples 10 5. Why consumer data is sought after 27 6. Conclusion 28 7. References 30 2 1. Introduction: Can you remember the last time you didn’t visit at least one website throughout your day? For most people, the common response might be “I cannot”. Surfing the web has become such a mainstay in our day to day lives that a lot of us have a hard time imagining a world without it. What seems like an endless trove of data is right at our fingertips. On the surface, using the Internet seems to be a one-sided exchange of information. We, as users, request data from companies and use it as we deem fit.
    [Show full text]
  • V6.5 Data Sheet
    Data Sheet Blue Prism Network Connectivity To ensure compatibility with evolving network infrastructures, Blue Prism can be deployed in environments that utilize IPv4 or IPv6 network protocols for all connections as well as those that use a hybrid approach, utilizing a combination of both protocols. This allows all Blue Prism components - Runtimes, Clients, Application Servers - to connect using the preferred or most suitable method. Resource connectivity When establishing connections to Runtime Resources, Blue Prism uses the name specified in the DNS. This is based on the machine name and can either the short name or the Fully Qualified Domain Name (FQDN). If a Resource has both IPv4 and IPv6 addresses in the DNS, the network adapter settings of the connecting device (Application Server, Interactive Client, or Resource) are used to determine which IP address should be used to establish the connection: 1. The connecting device defaults to an IPv6 connection. 2. If an IPv6 connection is not established within 1.5 seconds and if the connecting device has multiple IPv6 addresses listed in the DNS, a connection attempt is made using the next available IPv6 address. 3. If an IPv6 connection is not established, the connecting device automatically attempts to connect using IPv4. 4. If all available IPv4 addresses have been tried without success, the Resource is considered unreachable. Commercial in Confidence Page 1 of 3 ®Blue Prism is a registerd trademark of Blue Prism Limited 6.5 Data Sheet | Blue Prism Network Connectivity Resource connectivity The following diagram illustrates the logic used for connections to Runtime Resources. Commercial in Confidence Page 2 of 3 ®Blue Prism is a registerd trademark of Blue Prism Limited 6.5 Data Sheet | Blue Prism Network Connectivity Application Server connectivity Application Server connectivity Clients and Resources can connect to Application Servers using the host name, IPv4 address, or IPv6 address specified in the connection settings on the Server Configuration Details screen.
    [Show full text]
  • Changing IP Address and Hostname for Cisco Unified Communications Manager and IM and Presence Service, Release 11.5(1) First Published
    Changing IP Address and Hostname for Cisco Unified Communications Manager and IM and Presence Service, Release 11.5(1) First Published: Americas Headquarters Cisco Systems, Inc. 170 West Tasman Drive San Jose, CA 95134-1706 USA http://www.cisco.com Tel: 408 526-4000 800 553-NETS (6387) Fax: 408 527-0883 THE SPECIFICATIONS AND INFORMATION REGARDING THE PRODUCTS IN THIS MANUAL ARE SUBJECT TO CHANGE WITHOUT NOTICE. ALL STATEMENTS, INFORMATION, AND RECOMMENDATIONS IN THIS MANUAL ARE BELIEVED TO BE ACCURATE BUT ARE PRESENTED WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. USERS MUST TAKE FULL RESPONSIBILITY FOR THEIR APPLICATION OF ANY PRODUCTS. THE SOFTWARE LICENSE AND LIMITED WARRANTY FOR THE ACCOMPANYING PRODUCT ARE SET FORTH IN THE INFORMATION PACKET THAT SHIPPED WITH THE PRODUCT AND ARE INCORPORATED HEREIN BY THIS REFERENCE. IF YOU ARE UNABLE TO LOCATE THE SOFTWARE LICENSE OR LIMITED WARRANTY, CONTACT YOUR CISCO REPRESENTATIVE FOR A COPY. The Cisco implementation of TCP header compression is an adaptation of a program developed by the University of California, Berkeley (UCB) as part of UCB's public domain version of the UNIX operating system. All rights reserved. Copyright © 1981, Regents of the University of California. NOTWITHSTANDING ANY OTHER WARRANTY HEREIN, ALL DOCUMENT FILES AND SOFTWARE OF THESE SUPPLIERS ARE PROVIDED “AS IS" WITH ALL FAULTS. CISCO AND THE ABOVE-NAMED SUPPLIERS DISCLAIM ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING, WITHOUT LIMITATION, THOSE OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OR ARISING FROM A COURSE OF DEALING, USAGE, OR TRADE PRACTICE. IN NO EVENT SHALL CISCO OR ITS SUPPLIERS BE LIABLE FOR ANY INDIRECT, SPECIAL, CONSEQUENTIAL, OR INCIDENTAL DAMAGES, INCLUDING, WITHOUT LIMITATION, LOST PROFITS OR LOSS OR DAMAGE TO DATA ARISING OUT OF THE USE OR INABILITY TO USE THIS MANUAL, EVEN IF CISCO OR ITS SUPPLIERS HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
    [Show full text]
  • HTTP Cookie - Wikipedia, the Free Encyclopedia 14/05/2014
    HTTP cookie - Wikipedia, the free encyclopedia 14/05/2014 Create account Log in Article Talk Read Edit View history Search HTTP cookie From Wikipedia, the free encyclopedia Navigation A cookie, also known as an HTTP cookie, web cookie, or browser HTTP Main page cookie, is a small piece of data sent from a website and stored in a Persistence · Compression · HTTPS · Contents user's web browser while the user is browsing that website. Every time Request methods Featured content the user loads the website, the browser sends the cookie back to the OPTIONS · GET · HEAD · POST · PUT · Current events server to notify the website of the user's previous activity.[1] Cookies DELETE · TRACE · CONNECT · PATCH · Random article Donate to Wikipedia were designed to be a reliable mechanism for websites to remember Header fields Wikimedia Shop stateful information (such as items in a shopping cart) or to record the Cookie · ETag · Location · HTTP referer · DNT user's browsing activity (including clicking particular buttons, logging in, · X-Forwarded-For · Interaction or recording which pages were visited by the user as far back as months Status codes or years ago). 301 Moved Permanently · 302 Found · Help 303 See Other · 403 Forbidden · About Wikipedia Although cookies cannot carry viruses, and cannot install malware on 404 Not Found · [2] Community portal the host computer, tracking cookies and especially third-party v · t · e · Recent changes tracking cookies are commonly used as ways to compile long-term Contact page records of individuals' browsing histories—a potential privacy concern that prompted European[3] and U.S.
    [Show full text]
  • Configuring Smart Licensing
    Configuring Smart Licensing • Prerequisites for Configuring Smart Licensing, on page 1 • Introduction to Smart Licensing, on page 1 • Connecting to CSSM, on page 2 • Linking Existing Licenses to CSSM, on page 4 • Configuring a Connection to CSSM and Setting Up the License Level, on page 4 • Registering a Device on CSSM, on page 14 • Monitoring Smart Licensing Configuration, on page 19 • Configuration Examples for Smart Licensing, on page 20 • Additional References, on page 26 • Feature History for Smart Licensing, on page 27 Prerequisites for Configuring Smart Licensing • Release requirements: Smart Licensing is supported from Cisco IOS XE Fuji 16.9.2 to Cisco IOS XE Amsterdam 17.3.1. (Starting with Cisco IOS XE Amsterdam 17.3.2a, Smart Licensing Using Policy is supported.) • CSSM requirements: • Cisco Smart Account • One or more Virtual Account • User role with proper access rights • You should have accepted the Smart Software Licensing Agreement on CSSM to register devices. • Network reachability to https://tools.cisco.com. Introduction to Smart Licensing Cisco Smart Licensing is a flexible licensing model that provides you with an easier, faster, and more consistent way to purchase and manage software across the Cisco portfolio and across your organization. And it’s secure – you control what users can access. With Smart Licensing you get: Configuring Smart Licensing 1 Configuring Smart Licensing Overview of CSSM • Easy Activation: Smart Licensing establishes a pool of software licenses that can be used across the entire organization—no more PAKs (Product Activation Keys). • Unified Management: My Cisco Entitlements (MCE) provides a complete view into all of your Cisco products and services in an easy-to-use portal, so you always know what you have and what you are using.
    [Show full text]
  • Improving Transparency Into Online Targeted Advertising
    AdReveal: Improving Transparency Into Online Targeted Advertising Bin Liu∗, Anmol Sheth‡, Udi Weinsberg‡, Jaideep Chandrashekar‡, Ramesh Govindan∗ ‡Technicolor ∗University of Southern California ABSTRACT tous and cover a large fraction of a user’s browsing behav- To address the pressing need to provide transparency into ior, enabling them to build comprehensive profiles of their the online targeted advertising ecosystem, we present AdRe- online interests. This widespread tracking of users and the veal, a practical measurement and analysis framework, that subsequent personalization of ads have received a great deal provides a first look at the prevalence of different ad target- of negative press; users associate adjectives such as creepy ing mechanisms. We design and implement a browser based and scary with the practice [18], primarily because they lack tool that provides detailed measurements of online display insight into how their data is being collected and used. ads, and develop analysis techniques to characterize the con- Our paper seeks to provide transparency into the targeted textual, behavioral and re-marketing based targeting mecha- advertising ecosystem, a capability that has not been ex- nisms used by advertisers. Our analysis is based on a large plored so far. We seek to enable end-users to reason about dataset consisting of measurements from 103K webpages why ads of a certain category are being displayed to them. and 139K display ads. Our results show that advertisers fre- Consider a user that repeatedly receives ads about cures for a quently target users based on their online interests; almost particularly private ailment. The user currently lacks a way half of the ad categories employ behavioral targeting.
    [Show full text]
  • Cookie Swap Party: Abusing First-Party Cookies for Web Tracking
    Cookie Swap Party: Abusing First-Party Cookies for Web Tracking Quan Chen Panagiotis Ilia [email protected] [email protected] North Carolina State University University of Illinois at Chicago Raleigh, USA Chicago, USA Michalis Polychronakis Alexandros Kapravelos [email protected] [email protected] Stony Brook University North Carolina State University Stony Brook, USA Raleigh, USA ABSTRACT 1 INTRODUCTION As a step towards protecting user privacy, most web browsers perform Most of the JavaScript (JS) [8] code on modern websites is provided some form of third-party HTTP cookie blocking or periodic deletion by external, third-party sources [18, 26, 31, 38]. Third-party JS li- by default, while users typically have the option to select even stricter braries execute in the context of the page that includes them and have blocking policies. As a result, web trackers have shifted their efforts access to the DOM interface of that page. In many scenarios it is to work around these restrictions and retain or even improve the extent preferable to allow third-party JS code to run in the context of the of their tracking capability. parent page. For example, in the case of analytics libraries, certain In this paper, we shed light into the increasingly used practice of re- user interaction metrics (e.g., mouse movements and clicks) cannot lying on first-party cookies that are set by third-party JavaScript code be obtained if JS code executes in a separate iframe. to implement user tracking and other potentially unwanted capabil- This cross-domain inclusion of third-party JS code poses security ities.
    [Show full text]
  • Analysis and Detection of Spying Browser Extensions
    I Spy with My Little Eye: Analysis and Detection of Spying Browser Extensions Anupama Aggarwal⇤, Bimal Viswanath†, Liang Zhang‡, Saravana Kumar§, Ayush Shah⇤ and Ponnurangam Kumaraguru⇤ ⇤IIIT - Delhi, India, email: [email protected], [email protected], [email protected] †UC Santa Barbara, email: [email protected] ‡Northeastern University, email: [email protected] §CEG, Guindy, India, email: [email protected] Abstract—In this work, we take a step towards understanding history) and sends the information to third-party domains, and defending against spying browser extensions. These are when its core functionality does not require such information extensions repurposed to capture online activities of a user communication (Section 3). Such information theft is a and communicate the collected sensitive information to a serious violation of user privacy. For example, a user’s third-party domain. We conduct an empirical study of such browsing history can contain URLs referencing private doc- extensions on the Chrome Web Store. First, we present an in- uments (e.g., Google docs) or the URL parameters can depth analysis of the spying behavior of these extensions. We reveal privacy sensitive activities performed by the user on observe that these extensions steal a variety of sensitive user a site (e.g., online e-commerce transactions, password based information, such as the complete browsing history (e.g., the authentication). Moreover, if private data is further sold or sequence of web traversals), online social network (OSN) access exchanged with cyber-criminals [4], it can be misused in tokens, IP address, and geolocation. Second, we investigate numerous ways.
    [Show full text]
  • Track the Planet: a Web-Scale Analysis of How Online Behavioral Advertising Violates Social Norms
    University of Pennsylvania ScholarlyCommons Publicly Accessible Penn Dissertations 2017 Track The Planet: A Web-Scale Analysis Of How Online Behavioral Advertising Violates Social Norms Timothy Patrick Libert University of Pennsylvania, [email protected] Follow this and additional works at: https://repository.upenn.edu/edissertations Part of the Communication Commons, Computer Sciences Commons, and the Public Policy Commons Recommended Citation Libert, Timothy Patrick, "Track The Planet: A Web-Scale Analysis Of How Online Behavioral Advertising Violates Social Norms" (2017). Publicly Accessible Penn Dissertations. 2714. https://repository.upenn.edu/edissertations/2714 This paper is posted at ScholarlyCommons. https://repository.upenn.edu/edissertations/2714 For more information, please contact [email protected]. Track The Planet: A Web-Scale Analysis Of How Online Behavioral Advertising Violates Social Norms Abstract Various forms of media have long been supported by advertising as part of a broader social agreement in which the public gains access to monetarily free or subsidized content in exchange for paying attention to advertising. In print- and broadcast-oriented media distribution systems, advertisers relied on broad audience demographics of various publications and programs in order to target their offers to the appropriate groups of people. The shift to distributing media on the World Wide Web has vastly altered the underlying dynamic by which advertisements are targeted. Rather than rely on imprecise demographics, the online behavioral advertising (OBA) industry has developed a system by which individuals’ web browsing histories are covertly surveilled in order that their product preferences may be deduced from their online behavior. Due to a failure of regulation, Internet users have virtually no means to control such surveillance, and it contravenes a host of well-established social norms.
    [Show full text]
  • World-Wide Web Proxies
    World-Wide Web Proxies Ari Luotonen, CERN Kevin Altis, Intel April 1994 Abstract 1.0 Introduction A WWW proxy server, proxy for short, provides access to The primary use of proxies is to allow access to the Web the Web for people on closed subnets who can only access from within a firewall (Fig. 1). A proxy is a special HTTP the Internet through a firewall machine. The hypertext [HTTP] server that typically runs on a firewall machine. server developed at CERN, cern_httpd, is capable of run- The proxy waits for a request from inside the firewall, for- ning as a proxy, providing seamless external access to wards the request to the remote server outside the firewall, HTTP, Gopher, WAIS and FTP. reads the response and then sends it back to the client. cern_httpd has had gateway features for a long time, but In the usual case, the same proxy is used by all the clients only this spring they were extended to support all the within a given subnet. This makes it possible for the proxy methods in the HTTP protocol used by WWW clients. Cli- to do efficient caching of documents that are requested by ents don’t lose any functionality by going through a proxy, a number of clients. except special processing they may have done for non- native Web protocols such as Gopher and FTP. The ability to cache documents also makes proxies attrac- tive to those not inside a firewall. Setting up a proxy server A brand new feature is caching performed by the proxy, is easy, and the most popular Web client programs already resulting in shorter response times after the first document have proxy support built in.
    [Show full text]