An Empirical Study of Factors Affecting the Rate of Spam
Total Page:16
File Type:pdf, Size:1020Kb
An Empirical Study of Factors Affecting the Rate of Spam Rodrigo Sanches Miani1, Danielle Oliveira1 Kil Jin Brandini Park2, Bruno Bogaz Zarpelao˜ 3 1Faculdade de Computac¸ao˜ (FACOM) Universidade Federal de Uberlandiaˆ (UFU) – Uberlandia,ˆ MG – Brasil 2Faculdade de Engenharia Eletrica´ (FEELT) Universidade Federal de Uberlandiaˆ (UFU) – Uberlandia,ˆ MG – Brasil 3Departamento de Computac¸ao˜ (DC) Universidade Estadual de Londrina (UEL) – Londrina, PR – Brasil fmiani,[email protected], [email protected], [email protected] Abstract. Several factors may influence the number of spam received by email users, from user’s profile information such as age or nationality to the way the email account is exposed on the Web. We propose a replication study of an ex- periment conducted more than a decade ago to understand the changes in the dynamics of the business of spam. To that end, using real email addresses cre- ated and managed only for the experiment, we simulate four different behavior profiles: i) interaction on social networks, ii) purchase in e-commerce sites, iii) interaction on forums and message boards and iv) use of file sharing tools, and analyze the amount of spam received on those accounts. The results indicate that linking an email account to a social network is the most significant influence on the spam rate. 1. Introduction The increasing use of the Internet and the resulting popularization of email caused a sig- nificant impact on people’s lives. A problem directly linked to the popularization of email involves receiving unsolicited electronic messages, also known as spam. The amount of spam generated is a direct consequence of the low financial cost of sending emails, especially when compared to the regular correspondence [Cerf 2005]. Hann et al. [Hann et al. 2006] suggested a study to understand the dynamics of spam and confirm whether spam follows some guidance or are randomly distributed. The authors conducted an experiment to evaluate whether some factors, such as email providers and interests declared in certain products or services, could determine the rate of spam. A secondary goal in [Hann et al. 2006] was to investigate the influence of other factors (age and geographic location of the owner’s account) in the delivery of unsolicited messages. To that end, 288 email accounts with distinct owner’s profiles were created on Hotmail, Lycos, Excite e Yahoo providers. 192 were exposed in the Yahoo Geocities, currently available only in Japan, website hosting provider. A web page was constructed for each one of these accounts. The definition of profiles included characteristics such as interests declared, e.g., ”computers and technology,” age, gender and geographic location. Over a 33 weeks period, spam sent to the email accounts were monitored and analyzed according to each feature. The conclusions gathered at the end of the experiment were: • Spam is not random but targeted at consumer segments that are relatively more likely to make online purchases, who declare an interest in specific products or services, adults, and US residents (US); • The most significant find was that the greatest influence on spam rate was the iden- tity of email service provider, since accounts on Hotmail received significantly more spam. The study proposed by Hann et al. [Hann et al. 2006] is particularly influential due to its design. The authors conducted a field study by establishing real email accounts and also proposed a way to other people interact with them using Geocities. The com- mon approach discussed in related studies consisted in analyzing a snapshot of a spam dataset provided by some university or ISP [Clayton 2008], [Garg and Niliadeh 2013] and [Almaatouq et al. 2016] or using survey data [Siponen and Stucke 2006]. The aim of this study is to replicate many aspects of the experiment conducted by [Hann et al. 2006], using an updated database that reflects the behavior of a Web user with different characteristics, such as using social networks, buying at online shopping sites, sharing files and posting on online forums. This type of replication study is known as triangulation which we have the same study goal but different design. At the end we will conduct a comparison between the results obtained by [Hann et al. 2006] and those achieved in this study. Since the prior study was performed more than ten years ago and the Internet behavior is constantly changing, we believe that a new and renovated investigation is necessary to provide a better understanding of other factors influencing the distribution of spam. Our idea is to verify which behavior profiles (online shopping, online social net- works, file hosting and online forums) affects the rate of spam, how email providers treated unsolicited messages and also establish a dataset of email accounts that could be used in future work. The rest of this paper is organized as follows: Section 2 outlines the main strat- egy adopted in the methodology applied. Section 3 presents the results obtained and the analysis of those results. Section 4 describes related work. Finally, Section 5 brings the conclusion and future work. 2. Methodology This paper aims to replicate, extend and update the findings of the work proposed by Hann et al. [Hann et al. 2006]. The main difference in this study is that the email accounts will be exposed to certain classes of online services, called ”exposure groups.” The proposal consists of the following steps: 1. Build the user profile: the first step consists in creating profiles for emails of fictitious persons. Each profile is formed by name, age, gender, nationality, service provider and exposure group that the account is associated. 2. Create email accounts: create fictitious accounts in three free email providers. Each account will be created respecting the users’ profiles defined in the previous step. Table 1. Summarization of the characteristics considered in this study. Age Gender Residence Mail service provider Exposure group 18, 35 and 60 Male and Fe- Brazil and GMX/Mail, India and i) Online Social male United States Zoho Network (Facebook); of Americca ii) Online Shopping (Amazon and Sub- marino), iii) File Hosting (4Shared) and iv) Discussion Forum (Reddit) 3. Association: with all the email accounts created, the next step, was to associate them with their respective exposure groups: online shopping, online social net- working, file hosting, and discussion forums. 4. Data collection: the email accounts were accompanied by a period of four months. They were monitored periodically to facilitate the analysis of the amount of spam received at the end of the review period. 5. Analysis: gathered spam was analyzed according to previously defined character- istics (age, gender, nationality, service provider and exposure groups). A compar- ison of the results acquired here and those found in [Hann et al. 2006] should be performed. The first step was to define the user profiles. The characteristics used to deter- mine each profile were the same as seen in [Hann et al. 2006]. For the age attribute, the possible choices are 18, 35 and 60 years. We choose those ages because they represent the main groups: youth, adults, and seniors, respectively. For nationality, there are two possibilities, United States or Brazil. For gender, choices are male and female. To explore other factors that influence the receipt of spam, we considered the email service provider and the exposure of email addresses to particular Web services. We classified every ser- vice in exposure groups. Table 1 summarizes the characteristics of the email accounts created in this experiment. The exposure groups were proposed based on a typical Web user’s behavior. Our idea was to answer the following question: “What kind of activities a Web user could do using an email account?”. Based on studies conducted by [Whang et al. 2003], we created the following exposure groups: • Online Social Networks (OSN) - Online social networking sites are very pop- ular, usually run by individual corporations (e.g., Google and Yahoo!), and are accessible via the Web. Participating users join a network, publish their profile and any content, and create links to any other users with whom they associate [Mislove et al. 2007]. In this work, we associated email accounts with a Face- book account; • Online Shopping (OS) - This is a shopping activity performed by a consumer via a computer-based interface, where the consumer’s computer is connected to and can interact with, a retailer’s digital storefront [Haubl¨ and Trifts 2000]. For Brazilian residents, email accounts were registered in submarino.com, while US residents were associated with amazon.com; • Discussion Forum (DF) - A discussion forum is a web application that provides a virtual environment supporting discussion and debate among peers without tem- poral or geographical barriers [Cheng et al. 2011]. Reddit accounts were created and linked to some of the email accounts; • File Hosting (FH) - File Hosting (FH) services are used daily by thousands of people as a way of storing and sharing files. 4Shared accounts were created and associated with some of the email accounts. We would like to verify the incidence of spam as follows: i) in every isolated group, ii) in the case of an email account is associated with all groups at the same time or, iii) an email account is associated with OSN and OS or iv) an email account is associated with exposure groups DF and FH. The division between OSN/OS and DF/FH is related to the e-commerce potential of both groups. Thus the total number of cases to analyze is seven, four separate groups added to each of the three combination alternatives. We created an email account for each combination of gender, nationality, age and exposure group. Based on the same method as proposed by Hann et al. [Hann et al.