A Measurement Study of Web Redirections in the Internet

A Measurement Study of Web Redirections in the Internet Krishna Bhargrava Douglas Brewer Kang Li Vangapandu Deparment of Computer Deparment of Computer Deparment of Computer Science Science Science University of Georgia University of Georgia University of Georgia 415 Boyd GSRC 415 Boyd GSRC 415 Boyd GSRC Athens, GA Athens, GA Athens, GA [email protected] [email protected] [email protected] ABSTRACT legitimate (non-spam) URLs also employ redirection for rea- The use of URL redirections has been recently studied to sons such as load balancing, link tracking, and bookmark filter spam as email and web spammers use redirection to preservation. To build accurate and effective spam detec- camouflage their web pages. However, many web sites also tion based on URL redirections, we need to understand the employ redirection for legitimate reasons such as logging, use of URL redirections for both spam and legitimate rea- localization, and load-balancing. While a majority of the sons. studies on URL redirection focused on spam redirection we Our study measures redirections in both legitimate URLs provide a holistic view of the use of URL redirections in and spam email URLs. We observe that redirection is com- the Internet. We performed a redirection study on vari- mon in both spam and legitimate URLs. Spam URLs used ous sets of URLs that includes known legitimate and spam redirection only slightly more than legitimate URLs at 43.63% websites. We observed that URL redirections are widely and 40.97% respectively. This means that while spam URLs used in today's Internet with more than 40% of legitimate will use redirection, that fact that redirection occurs cannot URLs redirecting for various reasons. We also observed that be a means for determining whether an email should be con- server side redirection is prominent in both legitimate and sidered spam or not. spam redirection. Differing from legitimate URL redirec- To further our study, we broke down types of redirection, JavaScript redirection is detected more often in spam tions and examined whether the redirections were internal websites. Furthermore, a very high percentage of spam redi- or external, whether the redirected domains were owned by rections lead to an external domain. Apart from providing the same organization. We found three types of redirec- a quantitative view of URL redirections, we also provide a tions used most often server-side, Javascript, and meta. A further classification of legitimate URL redirection based on clear difference between spam and legitimate URLs emerges the reason of redirection. We expect that our measurement when we categorize redirections into these types. We find results and classifications to provide a better understanding that spam URLs use Javascript redirection much more of- of the usage of URL redirection, which could help improving ten than do legitimate URLs 30% compared to 10%. Other the spam filtering and other applications that rely on URLs types of redirection where used more often by legitimate as the web identifiers. URLs, but they were not disparate enough with the usage by spam URLs to be of use. Breaking down redirections into external or internal yields a result where external redirects 1. INTRODUCTION are used more often by spam URLs and internal more often The point of most email spam is to sell something to re- by legitimate URLs. cipient or steal personal information. This means that the The paper is organized as follows. Section 2 briefly de- spam email must contain some way for the recipient to pur- scribes the previous work on redirection. In section 3, the chase what is being peddled or trick the user into providing types of redirections, reasons for using redirection, how redi- their information. To this end, it is very common to see rections are detected, and classification of redirection is pre- spam emails contain a URL that directs the user to a web- sented. Section 4 describes the experiments conducted while site[?]. In spam filters, it is common to have blacklist filters section 5 provides a detailed analysis of the results observed. for URLs, but redirections make it hard for these filters to function correctly[?][?][?][?]. This means it is common to 2. RELATED WORK see spam messages that obscure their website URLs with Most of the previous work in the field of web redirections redirection[?]. focused on spam redirections. In one of the early works on Since redirection is a widely used technique, much re- Web Spam classification, Gyongyi and H. Garcia- Molina search has focused on using URL redirections as an impor- [?] describe redirection as a spam hiding technique used tant factor in detecting spam. These previous works study by spammers to create doorway pages. Wu and Davison URL redirections in spam URLs exclusively. However, many [?] conducted a preliminary study that contributes with a quantitative analysis of the presence of cloaking in the In- ternet. They look at redirections as one of the techniques to CEAS 2009 - Sixth Conference on Email and Anti-Spam July 16-17, 2009, perform cloaking. Our study differs from these by consider Mountain View, California USA not only spam datasets but also a few common categories of legitimate web sites. rection would be triggered at the client browser to this target JavaScript redirections have been shown to be used by URL. The redirection happens after a browser finishes pars- spammers as a way to dupe users into viewing spam. Benczur ing a HTML page, then the META refresh action is triggered et al. [?] discovered numerous doorway pages which rely on to load content from the target URL. JavaScript redirection. Chellapilla and Maykov [?] look at JavaScript redirection explicitly with a focus on the tech- 3.2 Reasons for Using Redirection niques employed by the spammers. The Microsoft Strider As mentioned earlier, there are several reasons for URL team in their work on systematic discovery of spammers redirection { both legitimate and illegitimate. In this section emphasized URL redirection as a common spam technique. we look at some of the most common reasons for employing They developed a tool, Strider URL tracer, which can be redirection. The order of appearance is based on the popu- used to detect all the domains that a current web page con- larity among the classification results with most commonly nects to. With the aid of the Strider, Wang et al. [?] stud- observed reason listed first. ied URL redirections in the context that there are content Virtual hosting (and DNS aliasing) | Virtual Hosting providers which redirect the user to malicious sites. Niu et and DNS aliasing are where more than one sites (or just al. [?] conducted a study on forum spamming with context- domain names) are mapped to a single IP address. This based analysis using the Strider to identify doorway pages. is commonly used by websites to register most commonly To our knowledge, most of the work mentioned above misspelled domains and redirect requests to these domains studied redirection in the context of spam. Our approach is to the original server. Such organizations register the same different from the above work as we look at general web domain name with different top-level domains. For example, redirections on the whole. Our study involves detection requests to gooogle.com or google.net all redirect the user of URL redirection, classification of detected redirections to http://www.google.com. across multiple dimensions. Load balancing | Most of the top websites host content on several servers. These servers either host specialized con- 3. OVERVIEW OF REDIRECTION tent or mirror each other. In cases of high volume of web A URL is said to be redirected, if a client requests a re- traffic, requests are redirected to one of these servers; the source located at a specific URL, but the client's final des- criteria of which depends on the website. For example, pop- tination at the end of the request is a different URL. This ular websites like search engines host different mirror servers section describes the types of redirection techniques as well and requests are redirected based on the nature of resource as the common reasons for redirections. requested. Link tracking (via indirections) | Many websites use redi- 3.1 Types of Redirection rection for statistical and logging reasons. For example, Based on the implementation techniques, URL redirec- websites log the advertisement clicks before it actually takes tions are classified into 1) Server-Side redirections, the user to the advertised webpage. For this to happen, ad- 2) JavaScript redirections, and 3) META redirections. vertisement clicks are taken to the originating website where Server-side redirections | Server-side redirection occurs the information is logged and then the request is redirected when a client requests a resource and the server issues a di- to the advertised webpage. rective in the form of HTTP status codes which makes the Resisting web spam (via indirections) | Many websites client request through a different URL. HTTP reply status rewrite external links in their web pages by introducing a codes of type 3xx as well as some 4xx with a location field in level of indirection through a server that is not indexed by the header imply that the client has to redirect to a differ- search engines. For example, all user links posted at MyS- ent URL. For example, request for http://www.google.net pace are disguised as a link with indirection from the domain returns a status code 302 redirecting the request to name msplinks.com, not myspace.com. A web page linked http://www.google.com. from the former domain is much less valuable than a link Redirections can also be performed by using publicly avail- from the latter one. able services redirection services (such as tinyurl,shorturl).

Load more