Recognition of attacks utilizing anomalies in phishing websites

Sunil Chaudhary

University of Tampere Department of Computer Sciences Computer Science/Software Development M.Sc. thesis Supervisor: Eleni Berki November 2012

i

University of Tampere Department of Computer Sciences Computer Science /Software Development Sunil Chaudhary: Recognition of phishing attacks utilizing anomalies in phishing websites M.Sc. Thesis, 78 pages, 15 index and appendix pages November 2012

The fight against phishing has resulted in several anticipating phishing prevention techniques. However, they are only partially able to address the phishing problem. There are still a large number of Internet users who are tricked to disclose their personal information to fake websites every day. This might be because existing phishing prevention techniques are either not foolproof or they are unable to deal with the emerging changes in phishing. The main purpose of this thesis is to identify anomalies that can be found in the Uniform Resource Locators (URLs) and source codes of phishing websites and determine an efficient way to employ those anomalies for phishing detection. In order to do that, I performed the meta-analysis of several existing phishing prevention techniques, specifically heuristic methods. Then, I selected forty-one anomalies, which can be found in the URLs and sources codes of phishing websites and are also mentioned or utilized by the past studies. This is followed by the verification of those anomalies using an experiment conducted on twenty online phishing websites. The study revealed that some anomalies, which were once significant for phishing detection, are no longer included in present day phishing websites, and several anomalies are also widely present in legitimate websites. Such ambiguous anomalies need further analysis to determine their significance in phishing detection. Moreover, it was also found that several heuristic methods use an insufficient set of anomalies which introduces inaccuracy in their results. Finally, in order to design an efficient heuristic method employing anomalies that can be found in URLs and source codes of phishing websites, it is suggested to give due priority to the anomalies that are: difficult for phishers to bypass, only found in phishing websites, seriously harmful, independent of other anomalies, and do not consume a lot of time for evaluation.

Key words and terms: phishing, phishing prevention, URL, DOM objects, whitelist, blacklist, heuristics, meta-analysis, software quality. ii

Acknowledgement I would like to express my sincere thanks and deep appreciation to my professor and supervisor Eleni Berki for her guidance and valuable comments. I am equally thankful to Marko Helenius (Tampere University of Technology) for the constructive feedback. I would also like to thank Linfeng Li for sharing his experiences on phishing research and suggesting various useful materials that I used for my thesis. My sincere thanks also go to my English teachers, Robert Hollingsworth and Julie Rajala who helped me to get familiar with the rules of academic writing. I would also like to thank to my professors Jyrki Nummenmaa and Zheying Zhang as well as all the attendee of the seminar course entitled “Master’s Thesis Seminar in Sofware Development “for their suggestion and feedback. Last but not least, I am thankful to my professor Mikko Ruohonen who provided me summer traineeship and ample freedom to complete a large part of my thesis during the traineeship period.

Sunil Chaudhary 2nd December 2012, Tampere

iii

Contents 1.Introduction ...... 1 1.1.The phishing epidemic ...... 1 1.2.Research questions ...... 5 1.3.Anomalies in phishing websites are suitable for phishing detection ...... 6 1.4.Thesis contribution ...... 7 1.5.Thesis outline...... 8 2.Review of phishing prevention methods ...... 8 2.1.Meaning of phishing prevention methods ...... 8 2.2.Important factors for effective phishing prevention methods ...... 9 2.2.1. Phishers’ behavior and phishing techniques ...... 10 2.2.2.Internet users behavior and decision making process ...... 12 2.3.Objectives of existing phishing prevention methods ...... 14 2.3.1.Reasons behind internet users’ tendency to fall for phishing ...... 15 2.3.2.Design techniques to educate and aware about phishing ...... 16 2.3.3.Design effective UI and warning to alert about phishing ...... 18 2.3.4.Development of countermeasure to automatically detect phishing ...... 20 2.3.5.Evaluate the effectiveness of existing phishing prevention methods ... 22 2.3.6.The need to invent proactive strategies for phishing prevention ...... 24 2.4.Classification of phishing prevention techniques ...... 28 2.5.Phishing prevention applications ...... 30 3.Analysis of strength and limitations of technical phishing prevention methods ...... 34 3.1.List based methods ...... 34 3.1.1.Whitelist method ...... 34 3.1.2.Blacklist method ...... 36 3.2.Heuristic methods ...... 40 3.2.1.Use of visual similarity measures in phishing detection ...... 40 3.2.2.Use of search engine in phishing detection ...... 46 3.2.3.Use of anomalies in phishing websites for phishing detection ...... 50 4.Investigating anomalies in phishing websites ...... 55 4.1.Anomalies found in the URLs of phishing websites ...... 56 4.2.Anomalies found in the source codes of phishing websites ...... 62 4.3.Verification of anomalies using online phishing websites ...... 66 4.4. Discussion on findings ...... 70 5.Conclusions ...... 75 6.Limitations and future development work ...... 78 References ...... 79 Appendix ...... 86 1

1. Introduction

1.1. The phishing epidemic Online services are an integral part of modern society. They make information readily accessible from any place through the Internet. This feature is equally utilized by both service providers and users. Service providers are able to penetrate and cover large markets easily at a low operational cost whilst users are able to choose from a wide range of services and are able to use them regardless of time and location. Unfortunately, these services too have not spared the attentions of cybercriminal. One of the major drawbacks of using such services is the risk of phishing. Phishing is a fraudulent activity carried out using an electronic communication to acquire personal information for malicious purposes. This information can include bank or financial institution credentials, social security numbers, credit card details, and online shopping account information with which phishers usually defraud their victims. Phishers employ a number of techniques, such as social engineering scheme and technical subterfuge [APGW, 2012] in order to allure potential victims and make them divulge their account details and other susceptible information. (i) Social engineering scheme. In general, phishers use emails masquerading as being from a legitimate and trustworthy source, such as a bank, or an auction site, or an online commerce site [APGW, 2012] and redirect victims to an authentic looking counterfeit website to deceive recipients into disclosing sensitive information. Many other mediums, such as snail mail, phone call, and instant messenger are also used to reach the potential victims and lure them to disclose their confidential information. However, fake emails and phony websites are easy and economically viable means to target a large number of potential victims at a time which also might be a reason they are widely used to conduct phishing. The fake emails and phony websites used by phishers have evolved to become technically deceptive and hard for casual detection methods to detect. Phishing emails often create a sense of urgency to motivate Internet users to take prompt action, such as asking potential victims to update, or validate, or confirm their account information for different reasons, for example, to receive an award, or to help the bank in their procedures, otherwise their account will be 2

suspended, or to stop the account from misuse. Similarly, phishers are also found misusing the situations and current happenings, for example, phishing attacks, which emerged after the Haiti earthquake purported to be from relief organizations or the victims themselves asking for donations, and FIFA World Cup-themed phishing attempts. Phishing websites often use original website layout, logo, trademark, and even a similar domain name to make them look similar to the genuine websites. Furthermore, mirroring original websites to generate fake websites makes it harder to differentiate them even for people with adequate knowledge about phishing. It has also been reported that some phishing websites claim to sell products, such as software, games, and sex pills at high discount, and then steal the bank information when the Internet user enters it into their websites to buy the products. (ii) Technical subterfuge. Phishers plant onto Personal Computers (PCs) of potential victims to steal their credentials directly [APGW, 2012]. Many hackers have been involved with phishing and use advanced hacking techniques. Some of the mechanisms used in technical subterfuge are:  Session hijacking is used, often by corrupting the local navigation infrastructure to misdirect potential victims to a fake website or an authentic website through proxies controlled by the phishers. Techniques, such as pharming, cross-site scripting attack, cross-site request forgery, domain name typos, and man-in-the-middle attacks are implemented to carry out session hijacking [Milletary, 2006].  An uncontrolled flood of spam emails are sent with in the attachment or with a link on clicking which surreptitiously installs specialized malware in the Internet users’ computers. Such malware is designed to monitor and intercept the victims’ keystrokes and mouse clicks. Sometimes malware is designed even to capture the screenshot of webpage visited by the victims and ultimately post captured information to the phishers. More advanced malware designed to capture network packets and protocol information, and password harvester that looks for username and password information in the victims’ computer are also found to be employed by phishers. 3

There has been a rapid increase in phishing attacks from the first half of 2008 to the first half of 2012, shown in Figure 1.

Figure 1: Phishing websites detected from 2008 to 2012 Many factors are responsible for the growth of phishing. One of the major factors is the unawareness of Internet users that their personal information is actively targeted by criminals and as a consequence they neglect to take precautionary measures while performing online transactions. Likewise, many online service users lack organizational policies and procedures for contacting customers [Dhamija et al., 2006], although presently many big organizations that are phishing prone do seem to have acted to improve the situation. Moreover, phishing is a very lucrative with a high benefit return against little risk. The exponential growth in the use of the Internet and online services has resulted in a rapid increase in potential victims encouraging many new criminals and inspiring them to use different new sophisticated techniques to deceive Internet users more effectively. Additionally, the fact that the technical resources required for phishing are easily accessible. It has enabled even a criminal with a little technical knowledge to conduct phishing successfully. Many do-it yourself phishing kits are available online which can be downloaded for free. These kits also contain software for that enable phishers to easily reach large numbers of potential victims. There are various websites available online that offer the guidance for designing and conducting phishing. 4

Phishing is a leading cause of identity theft on the Internet and causes billions of dollars of damage worldwide every year. It has an adverse impact on the economy through direct and indirect losses experienced by businesses and customers.  The direct loss is the financial damage incurred of the amount that phishers withdraw from their victims’ accounts.  The indirect losses are an adverse impact on customers’ confidence towards online commerce and services, the diminished reputation of victimized organizations, and the resources spent to combat phishing. Moreover, the convenience of e-commerce seems to be embraced by both cybercriminals and users on an equal basis. Financial services are the most targeted industries by phishing, shown in Figure 2.

Figure 2: Targeted industry sectors by phishing [APWG, 2012] With the prevalence of phishing attacks and the increasing vulnerability of users’ confidential and personal information, it is increasingly important to provide Internet users with an effective and reliable phishing prevention method. There is no silver bullet to eliminate the problem of phishing. It depends partially on well designed technology and equally on the browsing habits of Internet users. Well designed technology includes techniques efficiently able to tackle successful phishing techniques and a usable design that take into consideration what humans can and cannot do well [Dhamija et al., 2006]. Li et al. [2007] emphasize improving the quality of system design and the need for well-defined security requirements to prevent system users from phishing. The browsing habit means that Internet users are familiar with phishing and are able to detect them. It includes the trust towards anti-phishing software 5 which Internet users have installed in their system and their reaction to the warning from the anti-phishing system installed. However, an empirical study has shown that many of the Internet users neglect warnings from the anti-phishing system [Dhamija et al., 2006]. Many Internet users do not understand phishing attacks or do not understand the sophistication of phishing [Wu et al., 2006b]. There are several promising solutions provided by security experts and researchers against phishing. These systems build an awareness of potential phishing attempts, and develop and promote suitable technology solutions that help to protect Internet users against phishing. They implement prevention, detection, and response measures. They are available in a variety of forms: integrated with popular anti-virus systems, e.g., anti- phishing tool in Norton , as an embedded feature of renowned web browsers, e.g., Google Safe Browsing toolbar [Google Safe Browsing] used in Mozilla Firefox browser, and as separate tools and add-ons that can be used in server and client machines, e.g., eBay toolbar [eBay Toolbar’s Account Guard]. They employ different techniques, such as blacklist, e.g., Netcraft Anti-phishing toolbar [Netcraft] , whitelist, e.g., SmartScreen Filter [MSDN IEBlog], content based detection, e.g., CANTINA [Zhang et al., 2007a], analysis of source web page source code or URL, e.g., CANTINA+ [Xiang et al., 2011] , comparing visual similarity of the whole webpage or layout or logo, e.g., online tool called “SiteWatcher Anti-phishing Tech” [Liu et al. ,2006], analysis of data submitted by users online, e.g., SpoofGuard[Chou et al. ,2004] , and use of a reputable search engine, e.g., CANTINA [Zhang et al., 2007a]. There has been a good progress in identifying countermeasures; however, there has also been an increase in attack diversity and technical sophistication to circumvent both detection and users’ suspicions too. This means as countermeasures are implemented to thwart one method of stealing information, criminals search for new vulnerabilities to be exploited. This also means they always have additional opportunities available to them.

1.2. Research questions The most common and straight forward technique to commit phishing attacks is to deploy a webpage that mimics the look and feel of a target organization’s website. There are several heuristic methods which employ anomalies in the URLs and source codes in order to identify phishing websites. Many anti-phishing tools in use, such as SpoofGuard [Chou et al., 2004], Netcraft Anti-phishing toolbar [Netcraft], CallingID toolbar [CallingID], eBay toolbar [eBay Toolbar’s Account Guard], and SmartScreen 6

Filter [MSDN IEBlog] also implement heuristic methods for phishing detection. However, there are several anti-phishing tools, such as Cloudmark Anti-fraud toolbar [Cloudmark] and EarthLink toolbar [EarthLink] that still rely on manual verifications and blacklists for phishing detection [Zhang et al., 2007b]. Ironically, even anti- phishing tools, such as eBay Toolbar and SmartScreen Filter that use heuristic methods do not use them in the first place [MSDN IEBlog, eBay Toolbar’s Account Guard], since heuristic methods introduce higher inaccuracy in the results compare to list based methods. Therefore, it requires further research that can improve the heuristic methods’ results. In this thesis, I have worked on answering following two questions: (i) What are the most common anomalies found in URLs and source code of phishing websites? (ii) How could these anomalies be deployed in order to recognize phishing attempts? I believe in order to enhance the accuracy of results in heuristic methods, the two crucial factors are: selection of suitable anomalies and designing suitable method to employ them. Some of the related studies that use anomalies in the source codes and URLs of phishing websites for phishing detection include Pan and Ding [2006], who looked for anomalies in webpage and cookies for phishing detection; Gasteller-Prevost et al. [2011], who evaluated URL and webpage source code; Garera et al. [2007], who analyzed the features of URL for discrepancies; and Alkhozae and Batarfi, [2011] who looked for abnormalities in webpage with respect to the W3C standard. However, Garera et al. [2007] excluded the source code of the website despite the fact that important anomalies can also be found in the source code of phishing websites, whilst all the other studies seems to neglect some vital factors during selection, calibration, and deployment of the anomalies which is testified by high inaccuracy in their results. In addition, the studies were performed some years ago, but the trend in phishing is very dynamic. There is a high chance that anomalies that were important during their studies may no longer be valid. Many other related researches are analyzed in chapter three.

1.3. Anomalies in phishing websites are suitable for phishing detection Although phishing sites are cheap and easy to build [Pan and Ding, 2006], these cheaply made websites are often poorly designed and coded, and do not properly meet recognized standards, for example, the recommendations from the World Wide Web 7

Consortium (W3C) [Alkhozae and Batarfi, 2011] and the Google guidelines [Garera et al., 2007]. Their quality score in the Google crawl database was found to be either very low or they had no score [Gastellier-Prevost et al., 2011]. Moreover, phishing websites have a very short lifetime and on average a phishing website domain remains online for 3 days, 31 minutes and 8 seconds [McGrath and Gupta, 2008]. For this short duration naturally phishers do not prefer to concentrate on website design and quality improvement, but rather to work on more beneficial activities, such as pushing more emails and websites to potential victims, infecting users’ PCs with malign software to use them as proxies, and designing distributed architecture that includes registering many domains from various registrars in order to direct traffic to one of their domains when any of their domains were removed. In addition, phishing websites often imitate some genuine websites and they claim false identities which cannot be possible unless some anomalies are introduced. Therefore, these anomalies can be utilized to detect phishing. The other benefits of using such anomalies found in the URLs and DOM objects of websites for a phishing detection method are: (i) It is not dependent on any specific phishing strategy and is equally valid for all kinds of phishing websites. (ii) It does not depend on any external factors ,such as databases, and (iii) It does not require any changes in user browsing habits.

1.4. Thesis contribution In order to determine anomalies that are found in the URLs and source codes of phishing websites, I performed meta-analysis of several past studies related to phishing prevention, specifically heuristic methods. Then, I selected forty-one anomalies that can be found in the URLs and source code of phishing websites. After that, I performed an experiment conducted on twenty online phishing websites to verify those anomalies and determine their significances in phishing detection. Finally, I suggest the ways by which anomalies in the URLs and source code of phishing websites can be effectively utilized during phishing detection. In general, the thesis makes the following contributions: (i) A systematic classification of phishing prevention techniques and applications. (ii) The meta-analysis of phishing prevention methods. 8

(iii) A set of forty-one anomalies that can be found in the URLs and source code of phishing websites. (iv) Results from an experiment conducted on twenty online phishing websites in order to verify the significances of anomalies in phishing detection. (v) Necessary guidelines to help in deployment of anomalies for phishing prevention methods.

1.5. Thesis outline The thesis proceeds as follows: chapter second reviews phishing prevention techniques and also includes a systematic classification of phishing prevention techniques and applications. Chapter third includes the meta-analysis of list based methods and heuristic methods along with various related studies on them with their main contents, specialities, and limitations. Chapter four lists out anomalies found in the URLs and source codes of phishing websites which can be employed for phishing detection, an explanation of the experiment setup conducted on twenty online phishing websites, the results obtained from the experiment, and a discussion on the findings. Chapter five presents conclusions and the last chapter, i.e., chapter six includes the limitations of this research and some future research and development work.

2. Review of phishing prevention methods

2.1. Meaning of phishing prevention methods Phishing utilizes the union of technology and social engineering. Social engineering is about the exploitation of human vulnerabilities [Odaro and Sanders, 2011]. There are various limitations which arise from human behaviour and decision making process (e.g., greed and fear affect decision), and social norms (e.g., ethical, legality) which, unfortunately, so far do not have an exact technical solution valid for all scenario. In order to overcome those limitations, it requires Internet users’ intelligences to correctly make the security critical decisions. However, phishers use social engineering and technology in a strategic manner to distract their potential victims [Jakobsson, 2005]. Therefore, phishing prevention techniques target both components (i.e., technology and social engineering) related to phishing. Precisely, a phishing prevention technique is any technical or non-technical solution designed to either stop sensitive information from leaking to counterfeit website or make leaked data useless [Cao et al., 2008]. 9

In order to address the problem of phishing, the American Bankers Association [2005] recommends developing a comprehensive set of procedures that perform: (i) Detection. Detection means to keep a vigilant eye on phishing and discover when any new phishing activity occurs before it can victimize Internet users. It also includes a solution that extracts information about the phishing website. (ii) Prevention. Prevention means to help in reducing the frequency of phishing attempts that Internet users receive or educate Internet users so that they are less likely to respond to phishing attempts. (iii) Response. Response means to focus on the precaution and action which have to be taken after the detection of phishing. It is also related to information flow about the culprit website and process of removing the phishing websites. Even though it is recommended for banking sector, it is valid for curbing all other kinds of phishing as well. The three procedures are shown in Figure 3.

Figure 3: Phishing prevention procedures [American Bankers Association, 2005]

2.2. Important factors for effective phishing prevention methods In order to prevent phishing attacks, it is vital to comprehend phishers’ behaviour and phishing techniques along with Internet users’ behaviour and their decision making process. An analysis of phishing behaviours and techniques provide idea and knowledge about technical and social engineering techniques applied for phishing. Likewise, Internet users’ behaviour and their decision making process put light on aspects that Internet users are good at doing and their vulnerabilities. A detected phishing attempt does not make sense when Internet users cannot either notice or ignore the warnings from a phishing prevention system. Therefore, Internet users’ response limitations should be respected. These should further be facilitated with suitable usability. 10

2.2.1. Phishers’ behavior and phishing techniques attacks are of three kinds: (i) Physical attacks. It targets physical infrastructure and network to cause physical outages, such as break the power or data transmission cable. (ii) Syntactic attacks. It targets vulnerabilities and loopholes in software, such as problems in cryptographic algorithms and protocols. (iii) Semantic attacks. It targets people behaviour and the way they interact with computer and web, such as the use of social engineering to manipulate Internet users and steal their information. This means that phishing includes both syntactic and semantic attacks [Downs et al., 2006]. This also implies that a phishing prevention system should prevent Internet users from both syntactic and semantic attacks. According to Singh [2007], the schemes used by phishers can roughly be classified into following four kinds: (i) Dragnet method. It uses spammed emails, bearing the falsified corporate identification websites or pop-up windows. (ii) Rod-and-Reel method. It targets specific prospective victims with whom initial contact is already made, and sends false information to prompt their disclosure of personal or financial data. (iii) Lobsterpot method. It consists of creating a forgery website that imitates a legitimate website so that the victims mistake the spoofed website as a legitimate one and provide the information of personal data. (iv) Gillnet phishing. It uses malicious code which infects user’s system with a or changes the settings of user’s system. Consequently, the Internet user is directed to a phishing website when tries to visit a legitimate website or record the keystrokes of user’s personal information and transmit those data to phishers. In all these techniques, the phishing schemes seem to typically rely on three basic elements:  Phishing solicitations often use familiar corporate trademarks and trade names, as well as recognized security agency names and logos. This can be seen from Figure 4; it is a phishing website for “Paypal” that also uses “Verisign” logo. 11

Figure 4: A phishing website for Paypal  The solicitations routinely contain warnings or information about award, lottery or other similar messages intended to cause the recipients immediate concern or worry about access to an existing financial account. An example of phishing email informing about a grant can be seen in Figure 5.

Figure 5: A phishing email  The solicitations rely on two facts pertaining to authentication of the e- mails: 1. Online consumers often lack the tools and technical knowledge to authenticate messages, especially from financial institutions and e- commerce companies; and 12

2. Most of the available tools and techniques are inadequate for robust authentication or can easily be spoofed [Wu et al., 2006b]. In fact, they are the elements against which the existing anti-phishing techniques work and also the future researches on phishing prevention techniques will work. There are several heuristic methods that use logo comparison, look for the misuse of security agency logo, and other properties to detect phishing. Heuristic methods are discussed in later chapters. Then, there are various spam and phishing emails’ filters in use to protect against phishing attacks. Some of the e-commerce organizations have their own toolbars designed for their customers, e.g., eBay’s toolbars that can alert their clients about phishing targeted to eBay [eBay Toolbar’s Account Guard].

2.2.2. Internet users behavior and decision making process Human behaviour makes decision making process a very complex procedure. The outcome depends on probability. It is affected by various factors, such as beliefs, preferences, past experiences, subjected situations, current states, and others. Further studies can:  Improve the understanding of factors that make Internet users to fall for phishing, and  Guide security experts to design countermeasure which can effectively protect Internet users from phishing. There has been little work done related to Internet users behaviour and decision making process in the context of phishing. There is, however, work related to human behaviour and decision making process in other research contexts. Only a few security scientists have contributed to human behaviour and decision making process with respect to phishing. Dhamija et al. [2006] experiment on why people fall for phishing is an example of such work. This study focused on finding limitations of existing phishing prevention techniques. Their study revealed that Internet users have their own preferences of characteristics for identifying phishing, and their decision making process is affected by various factors, such as their past experiences with phishing, subjected situation (i.e., a person desperately looking to buy FIFA world cup ticket reaction towards FIFA world cup-themed phishing will be different than a person who has not thought about watching FIFA world cup). For instance, in this experiment participants were asked to identify phishing which affected their decision, participants were found to be misguided by attractive and luring sentences of email or website. 13

Moreover, subjected situation was a key factor; in the experiment there was no penalty for wrong decision which affected participants’ decisions. A classical example about the impact of belief in decision making during phishing detection is mentioned in the experimental case study performed on a bank’s employee by Aburrous et al. [2010]. They found that some Internet users strongly believe that they are capable of detecting all kinds of phishing attacks and avoid using anti-phishing tools which, unintentionally, expose them to the higher risk of phishing attacks. One of the in-depth studies about Internet users’ behaviour while interacting with phishing was done by Dong et al. [2008]. Their research focused on Internet users’ behaviour during interaction with phishing websites and their decision making process. They also designed a model called “user-phishing interaction model” after a cognitive walkthrough on four hundreds phishing websites; identifying users’ activities, information used, and assumptions/executions that Internet users make during their interaction with phishing webpage. A diagrammatic representation of the information Internet users may use when encountering phishing attacks is shown in Figure 6.

Figure 6: The overview of User-Phishing Interaction [Dong et al., 2008]  External information. This is the information that users perceive from user interface (includes phishing emails/communication), as well as other sources (such as expert advice).  Knowledge and context. This is the information that user perceive from his environment, social networks, past experience, things happening around him etc. 14

 Expectation and previous perception. After each action, Internet users have some expectations. This is the information retrieved from this expectation and also understanding of the system. In their Decision Making Model, Dong et al. [2008] mentioned the following two kinds of decision that users make when interacting with phishing activities and reflect in their content.  Decide on a series of action to take. This is taken consciously. This affects the decision whether to proceed or not.  Decide whether to proceed or not. This is, usually, taken subconsciously. Both decision making processes are further divided into the following three steps:  Construction of the perception of the situation  Generation of possible actions to respond  Generation of assessment criteria and choosing an action. A diagrammatic representation of their Decision Making Model is shown in Figure 7.

Figure 7: Decision Making Model [Dong et al., 2008]

2.3. Objectives of existing phishing prevention methods There are several phishing prevention methods resulted from different studies conducted on protection against phishing. These phishing prevention methods are primarily motivated to look for: 15

(i) Reasons behind Internet users’ tendency to fall for phishing (ii) Design techniques to educate and aware about phishing (iii) Design effective User Interface (UI) and warning to alert about phishing (iv) Development of countermeasure to automatically detect phishing (v) Evaluation of the effectiveness of existing phishing prevention methods, and (vi) Invent proactive strategies for phishing prevention. Below there are references and examples from all these research studies.

2.3.1. Reasons behind internet users’ tendency to fall for phishing It is not uncommon for novice Internet users to be victimized by phishing; but shockingly, it is found that even those with adequate knowledge about phishing are tricked by phishers [Odaro and Sanders, 2011]. In a study conducted by Aburrous et al. [2010] on bank’s employee found that even employee from Information Technology (IT) department who are chiefly responsible to always remain alert about phishing got tricked. Likewise, in a study by Dhamija et al. [2006], ninety percent of the participants got tricked by good phishing websites. There are a number of such studies that have examined the reasons behind Internet users’ tendency to fall for phishing. Friedman et al. [2002] empirical study on users’ conceptions of web security revealed that many Internet users are unable to differentiate between secure and insecure website connection. The meaning of security varies from one Internet user to other and many look to components in UI that can be easily copied from original website as cues for secure connection. Likewise, the study by Dhamija et al. [2006] found that many Internet users are unable to differentiate between legitimate and spoofed websites. Many Internet users use the content of the website as cues for authenticity. There are a number of Internet users who use padlock icon, animated graphics, pictures, and design touches, such as logo, favicons etc. to differentiate between genuine and fake websites. Most surprisingly, many Internet users do not hesitate revealing their personal information to spoofed website despite warning from the phishing prevention tools installed in their system. Dhamija et al. [2006] also blamed the ineffectiveness of existing solutions designed for phishing prevention to be a reason behind Internet users falling for phishing. These solutions are more technical and usually neglect some crucial non-technical aspects in their design. Similarly, Downs et al. [2006] study on Internet users’ mental model when reading email and browsing web, and their vulnerability to manipulation revealed that 16 merely having knowledge and experience about phishing is an ineffectual strategy for phishing prevention especially, in the case of new phishing methods. One of the reasons mentioned is that ineffectiveness could be because of current awareness techniques that do not effectively mention about possible vulnerabilities or strategies to identify phishing emails. Another reason could be due to the fact, one sometimes going too rigid with certain knowledge can lead to suspect real email and web-based actions [Odaro and Sanders, 2011] that are unlikely to work for many who conduct business via web. Wu et al.’s [2006b] study also found that many Internet users use website appearances and content to differentiate between fraud and legitimate websites. Moreover, security is rarely the primary goal of Internet users. They also indicate that sloppy practices of web aid in confusing Internet users and impose them to risk. For example, a web form is used to submit both sensitive and insensitive information, some legitimate websites use Internet Protocol (IP) address URLs, some legitimate websites have login page without Secure Socket Layer (SSL) or use SSL for very short time which is unnoticeable for Internet users. Moreover, Ma [2006] and Wu et al. [2006b] mention that lack of alternative is a factor behind Internet user falling for phishing. Almost all phishing prevention approaches detect probable phishing, but they rarely provide alternative to proceed and enforce Internet users to take risk. There is some role of human behaviour as well to make Internet users fall in phishing trap.

2.3.2. Design techniques to educate and aware about phishing Phishing is largely dependent on human factor, so educating Internet users and bringing awareness about phishing is one of the potential countermeasures. All phishing attempts are not complex to differentiate. The majority of phishing attacks contain visible distinguishing factors which can facilitate Internet users in identifying them, however, the majority of Internet users are found either not aware or not clear about them. Their inability to distinguish legitimate websites from phishing websites is exploited by phishing attacks. Surveys and studies undertaken by Friedman et al. [2002], Dhamija et al. [2006], Karakasiliotis et al. [2007], Jagatic et al. [2007], Herzberg and Jbara[2004], and Odaro and Sanders [2011] have revealed that Internet users lack proper knowledge about phishing. Their skill to identify phishing attacks is not adequate enough and they usually misclassify phishing websites as legitimate websites and vice versa. Undoubtedly all phishing attacks cannot be detected manually. Yet, performing manual 17 detection by Internet users can make a big change in reducing the number of people falling for phishing. Wu et al. [2006b] found significant improvement in ability to detect phishing attacks in Internet users before and after reading a tutorial by email about phishing. Various kind of materials are available to educate and aware Internet users about phishing and techniques to detect them manually. Many online training materials are published by various government and non- government organizations, business, security organizations, universities etc. Most of the organizations that work on the prevention of phishing (e.g., APWG, antivirus companies, universities working on phishing) or are targeted by phishing attacks (e.g., bank, e-business companies, finance companies) have included information about phishing and instructions to be performed when encounter such scenario in their official websites. An example of such information included in the website of Nordea Bank, Finland is shown in Figure 8.

Figure 8: ‘About phishing’ page in Nordea Bank, Finland website Many other online materials are also available. “Anti-Phish Phil”, an interactive game and “PhishGuru”, an interactive training system are designed by Cylab Usable Privacy and Security (CUPS) Laboratory at Carnegie Mellon University that is used to educate Internet users about phishing websites. Sheng et al. [2007] experiment on the role of game to educate Internet users about phishing showed that game is more effective than other means, such as reading text or reading online tutorial material. A screenshot of “Anti-Phish Phil” game is shown in Figure 9. 18

Figure 9: A screenshot of the education game called “Anti-Phish Phil” Similarly, “Phish or Not Phish”, an online quiz developed by VeriSign is available for free. It displays two similar looking websites snapshots and asks users to distinguish the snapshot from a phishing website. After each answer, it displays the reasons that make one of the snapshots from a phishing website. A screenshot of “Phish or Not Phish” is shown in Figure 10.

Figure 10: A screenshot of the online quiz called “Phish or Not Phish”

2.3.3. Design effective UI and warning to alert about phishing Dhamija et al. [2006] have mentioned that phishing cannot be solely solved by a traditional cryptographic-based security framework; rather it equally needs inclusion of usability and user experience. Several studies have indicated that bad or ineffective user interface is some of the prominent factors behind weak performance of anti-phishing software. Wu et al. [2006a] pointed out location of warning indicators found at peripheral area in many phishing prevention solutions as one of the example of a very poor design. Further, they mention that such warning indicators send very weak signal 19 in comparison to much larger centrally located displayed spoofed web pages. Zhang et al. [2007b] study also revealed poor usability performance of existing phishing prevention tools. Some of the examples of poor design in phishing prevention tools are:  Use of red and green colour indicator , which is a poor choice for red/green colour blindness unless there is some other noticeable cues included along with it,  Use of pop-up dialog boxes to warn when popular browsers (e.g., Internet explorer (IE), Google Chrome, Mozilla Firefox) have option to block such boxes and beside that most of Internet users dismiss such boxes without reading. An option to disable pop-up dialog in IE 9 is shown in Figure 11 Some examples of the anti-phishing toolbars, which use poor ways to notify phishing attacks, are: EBay’s Account Guard and SpoofGuard. EBay’s Account Guard shows green icon to indicate the webpage belongs to eBay or PayPal, Grey icon for unidentified websites, and red icon to indicate potential phishing website [eBay Toolbar’s Account Guard]. SpoofGuard displays traffic light colours (Red: above their threshold value, Yellow: probably hostile, and Green: for low scores and is probably safe) to indicate a website chance of being a phish [Chou et al, 2004].

Figure 11: Highlight of pop-up blocker in Internet Explorer 9 Currently, significant improvement can be seen in the usability of some phishing prevention tools. Popular browsers are using active warning that is displayed on the full page. Such warning cannot be unnoticed by Internet users. Internet Explorer uses both active and passive warning; when it gets confirm that the website is a phishing website, 20 it uses active warning whilst for suspected webpage passive warning is used. An active warning displayed by Google Chrome browser is shown in Figure 12.

Figure 12: Active warning message displayed in Google Chrome Similarly, many other researchers have designed user friendly interfaces. Dynamic Security Skins [Dhamija and Tygar, 2005] used a random photographic image in the background of password window as cues to differentiate between legitimate and fake website. Each Internet user is assigned a unique image and is recommended to enter password only after his personal image is loaded. SpoofStick displays the website’s real domain and expose the websites that obscure their domain name [SpoofStick]. TrustBar makes SSL more visible by displaying the logos of the website and its certificate authority [Herzeberg and Gbara, 2004]. Netcraft toolbar enforces display of browser navigational controls (toolbar and address bar) in all windows, to defend against pop-up windows which attempt to hide the navigational controls. In addition, it clearly displays sites’ hosting location, including country that helps in evaluating fraudulent URLs [Netcraft].

2.3.4. Development of countermeasures to automatically detect phishing Human ability to detect phishing is limited and varies among Internet users. Moreover, manual method to identify phishing can be deluded. Therefore, there are several software tools developed in order to identify phishing websites. These software tools can be phishing emails filter, such as Phishing Identification by Leading on Features of Email Received (PILFER) [Fette et al., 2006], which uses a machine learning based approach to examine a set of ten features in suspected email. PILFER also uses Support Vector Machine (SVM) classifier for reference implementation. Another approach for spam filtering is greylisting which blocks spam at the mail server level based on the behaviour of sending server, rather than the content of the message. The mail server 21 that employs greylisting deliberately dismiss mails from unknown or suspect sources with temporary error until configured period of time. It relies on the fact that many spam sources, i.e., Simple Mail Transfer Protocol (SMTP) used by spammer, do not maintain queues for retrying message transmission. When a sender has proven itself able to properly retry delivery, the sender is added into the whitelist so that no more the mail from the sender is impeded. However, the problem with phishing emails filter is that it fails to stop phishing attacks that use other mediums, such as IRC, Messenger, and advertisement [IBM Systems, 2007]. Moreover, such phishing email filters are unable to stop all malicious emails. There are several software tools mostly in the form of browser toolbars in order to detect phishing attacks that use other mediums including emails. Some of anti-phishing tools are: phishing prevention tools integrated in popular anti-virus software, such as Norton antivirus and AVG antivirus; inbuilt in popular browsers, such as Internet Explorer, Mozilla Firefox, and Google Chrome; as a independent applications or web browser add-ons, such as FraudEliminator [Fraud Eliminator], Netcraft toolbar [Netcraft], eBay toolbar [eBay Toolbar’s Account Guard], EarthLink toolbar [EarthLink], Geo Trust Trustwatcher toolbar [Geo Trust], SpoofGuard[Chou et al, 2004], CallingID Toolbar [CallingID], Cloudmark Anti-Fraud toolbar [Cloudmark], Google Safe Browsing [Google Safe Browsing], SpoofStick [SpoofStick], and TrustBar [Herzberg and Gbara, 2004]. These tools employ either heuristic methods or list based methods or both of them for phishing detection. Heuristic methods check characteristics of website and decide whether it is phishing or not whilst list based methods maintain a list of either genuine website (whitelist) or phishing website (blacklist) and verify if the website is in the list to decide phishing or not phishing. Each technique has its own pros and cons. This thesis is also about heuristic methods for phishing detection. So, in the later chapters, details of heuristic methods and list based methods used for automatic detection of phishing are covered. Then, there is DNSSEC Validator [DNSSEC Validator], an add-on made for Mozilla Firefox browser that detects DNS spoofing. The DNS Validator compares only the DNS records of the domain name used in page address and the IP addresses from where the Firefox download the page in order to detect DNS spoofing. A screenshot of DNS Validator is shown in Figure 13. 22

Figure 13: A screenshot of browser add-on “DNSSEC Validator” [DNSSEC Validator]

2.3.5. Evaluate the effectiveness of existing phishing prevention methods Despite wide media coverage of phishing and numerous phishing prevention techniques, phishing remains effective. This brings forth a serious concern on the efficiency of methods used for phishing prevention. Many studies are conducted in order to examine the efficiency of the existing phishing prevention methods. These studies expose the reliability of phishing prevention methods and at the same time point out their deficiencies which can be helpful to improve existing phishing prevention methods as well as forth coming methods. Wu et al. [2006b] study on the effectiveness of security toolbars revealed that existing security toolbars are big failure in mitigating phishing attacks. They pointed out several factors, such as very small alert display in comparison to content display located at the periphery that gets unnoticed, security not as primary goal of Internet users, and distrust towards such toolbars due to their false positive , can be responsible for ineffectiveness of phishing prevention methods. In another similar study by Zhang et al. [2007b] to observe the tool performance, testing methodology, and user interface of eleven selected phishing prevention tools (i.e., CallingID toolbar, Cloudmark Anti-Fraud toolbar, EarthLink toolbar, eBay toolbar, Firefox 2, GeoTrust TrustWatch toolbar, Microsoft Phishng Filter in Windows Internet Explorer 7, Netcraft anti-phishing toolbar, Netscape browser, and SpoofGurad) revealed that these tools are under performing and all of them are incapable to protect Internet users from the phishing attacks using sophisticated techniques. Their performance vary with the source of phishing URLs used by them. Further, many of the tools even failed for very simple exploit as well. They suggest that no single phishing prevention methods can ensure high performance; multiple methods supporting each other used together in an anti-phishing tool can provide better results. 23

Some studies only evaluated the effectiveness of in-built anti-phishing toolbars of web browsers. For instance, Ludl et al. [2007] analyzed the effectiveness of blacklists maintained by Microsoft and Google. The blacklist maintained by Microsoft is used in Internet Explorer whilst the blacklist maintained by Google is used by Google Chrome and Mozilla Firefox. Google Chrome, Mozilla Firefox, and Internet Explorer are the most widely used web browsers, their inbuilt anti-phishing toolbars are also the most widely used. The study focused on three crucial factors: coverage and quality of blacklist, and list update time. It indicated that blacklist based phishing prevention is satisfactorily effective and especially from Google; however, blacklist based phishing prevention’s inability to detect new phishing attacks can be handled in large extent using heuristic techniques in the way IE browser use heuristic technique to complement list based technique. Likewise, Bian et al. [2009] evaluated the effectiveness of three external online resources (Google PageRank system, Yahoo! Inlink data, and Yahoo! Directory service). Their finding suggested that such online resources can be used to increase efficiency of detection when used in conjunction with existing countermeasures. Similarly, Egelman et al. [2008] studied the effectiveness of Internet browsers warning and found that most of the Internet users heed to active warning (79% in their experiment) whilst passive warning was no different than not displaying any warning. They further found that the active warning in Mozilla Firefox was more helpful than IE active warning. Sheng et al. [2009] also performed an empirical analysis to observe the effectiveness of phishing blacklists and found that phishing blacklists are poor choice to fight against zero hour phishing websites. Li and Helenius [2007] performed heuristic usability evaluation on five selected anti-phishing client-side applications (i.e., Google toolbar, Netcraft toolbar, SpoofGuard, Phishing Filter in IE, and anti-phishing IEPlug). They suggested the following three points for an effective usability design of anti- phishing client-side applications: (i) Toolbar’s status should be visible to Internet users and anti-phishing client- side application’s should have intuitive interface. (ii) Warning should help Internet users to take the correct decision. The warning for suspicious webpage should not be as strong as the warning for detected webpage. 24

(iii) Anti-phishing client-side application should be aided with a suitable help system.

2.3.6. The need to invent proactive strategies for phishing prevention Most of the investigations on phishing are motivated towards finding a new reactive technique. A reactive technique is often effective against the types of phishing which exist when the technique was designed but it abruptly failed to detect a phishing attack that employs a new technique. Current trends in research are chiefly targeted towards defending attacks from phishers or taking down phishing websites, when scammers are continuously making new attacks. In fact, no adequate effort seems to be applied in order to reach the root of the problem. There is a need of more research that can:  Strengthen the weak points in legitimate systems and make them tedious to misuse  Develop strategies to retaliate and circumvent the phishers, and  Track the phishers to bring them under law enforcement Law enforcement could be difficult in those countries that do not have provision for such case; however, a study by APWG [2012] showed that the countries that host most of the phishing websites are developed countries, USA topping the list with an average of about fifty percent of the phishing websites hosted from there. Other countries hosting most of the phishing websites for the first quarter of 2012 are shown in Figure 14.

Figure 14: Top countries hosting phishing websites [APWG, 2012] Most of the countries in the list have stern law for cyber crime, so tracking such wicked people to punish them by the law can discourage many phishing aspirants or at least to those who are non techie and still conduct phishing. Similarly, making phishing activity 25 sophisticated to conduct can highly affect the naïve players in phishing. There are some proactive strategies that are directed towards reducing phishing. One of the ways is to use web crawlers alike to that is used by search engine to search phishing websites, and pass this information to appropriate Internet Service Provider (ISP) to bring down the websites. However, there are some limitations in this technique. Many countries do not have legal provision to remove such websites. Moreover, such detection can consume time which can be enough for scammers to fulfill their illegal desires. Another similar technique is to flood the phishers’ database with false information also called poisoning, but it is not Denial of Service (DOS) attack. This can make it difficult for phishers to differentiate between valid and false data and sometimes even can make the database useless. This technique, too, has limitations. It requires tracking the spoof websites without any false negative. The time taken to track the fraud website can be sufficient for phishers to victimize many Internet users. Moreover, any false negative result can cause serious consequences and lawsuit. Another proactive technique can be to keep watch on corporation’s logo download. Many phishers use an authentic logo in their websites to give a more real look to their fake websites. However, this technique, too, has some limitations. Firstly, the corporation’s logo is also used by respective corporation’s partners and some other legitimate websites; phishers can easily download from them. Figure 15 shows a page from a website that has logos of various banks and hundreds of such websites are available online.

Figure 15: Logos of various banks used in a personal blog website Secondly, making a copy of legitimate website logo is not difficult for many good designers. After all, how many Internet users can correctly differentiate between a legitimate logo and its copy is still a question to be answered. 26

One of the prominent works related to proactive strategies to track phishers is from McRae and Vaughn [2007] using web bugs and honeytokens. In their experiment, they used uniquely named HyperText Markup Language (HTML) image tags of one pixel by one pixel for each phishing e-mail as honeytokens or web bugs. The links of HTML and image links were filled to all the values of the variables with a text data type in phishing website forms and submitted. When phishers viewed the results from HTML enabled environment that does not filter or block third party images from being loaded, this image get retrieved from the server by the attacker. This is used to gather information about individuals or groups who viewed the data collected by phishing schemes. However, this technique, too, can be bypassed by using various approaches, for instance,  View the results of phishing form in text-only viewers.  Disable the HTML code and prevent any referral from being made in the web server log.  Disable loading of third party images in whatever browser used.  Use a web proxy (usually some hacked system) to view the results. Another proactive approach is from Hacker Factor Solutions [2005] who proposed to use page encoding in order to encapsulate each web page to stop phishing websites generated using mirroring techniques. Availability of various mirroring techniques (Web browser “Save as” option is the simplest mirroring technique; tools, such as “wget”, WebWhacker, Templeton, telnet, netcat) have drastically reduced the time and effort of phishers in making fake websites. In fact, such techniques are acting as catalyst to vigorous growth of phishing and encouraging many novice cybercriminals to perform phishing. However, the problem with this approach is that it uses Javascript code to decode the page content, whilst all popular web browsers have options to disable Javascript and in current time only few websites require Javascript enabled. Figure 16 shows the options to disable Javascript in Google Chrome browser. 27

Figure 16: Options to enable and disable Javascript in Google Chrome Moreover, there are add-ons like NoScript for Mozilla Firefox browser that can be used to allow execution of Javascript, Java, Flash, and other plugins only by the selected websites. Figure 17 shows the options to disable execution of script in NoScript.

Figure 17: Options to enable and disable script execution in NoScript There are many other issues with this approach, some of them are:  Search engine will be unable to index the page since it is encoded.  It cannot provide protection against phishing malware.  It requires routine (may be weekly) change in encoding algorithm.  It needs specialized skill and more time to develop such websites. 28

Likewise, another technique is from Li et al. [2007] that suggest misuse-oriented prevention, i.e., protect form phishing attacks with the misuse case method from a system design perspective. Security requirements are often not stated during requirement elicitation and analysis, leaving vulnerabilities in future Information systems which later are compromised by scammers. Such vulnerabilities can be fixed using a misuse case approach at requirement gathering (a designer is asked to abuse each use case, and then its countermeasure is identified and employed. It continues in iterative way unless it does not get full proof.). The summery of the methodology of misuse cases are: a. Design the use cases of the system b. Personate a misuse, who intends to compromise the system; c. Design the misuse for a specific use case; d. Find a countermeasure for a misuse case; e. Judge whether the countermeasure is vulnerable; if yes, go to step c, otherwise go to the next step; f. Find whether there is possible vulnerability or misuse; if yes, go to step c, otherwise security requirement elicitation ends. Even though the technique could be beneficial for cases in which websites are hacked and compromised to conduct phishing, but its ability to prevent the majority of phishing in which phishers develop an independent websites or ask information through email cannot be seen. Moreover, no matter how full proof system you design, the hackers may find some ways to intrude. This can also be seen from the news of attack on the Pentagon (the headquarters of the United States Department of Defense) computer system and 24, 000 files stolen [NYDailyNews.com, July 14 2011], and the news that a hacker succeeded to hack the computer systems owned by Oracle, NASA (National Aeronautics and Space Administration), the U.S. Army, and the U.S. Department of Defense [IDG News Service, May 10 2012].

2.4. Classification of phishing prevention techniques There are several promising techniques that significantly prevent phishing attacks. These techniques have to deal with both technical and non-technical factors. Therefore, in the first level, phishing prevention techniques can be classified into technical methods and non-technical methods. The technical methods can be further categorized 29 into list based methods and heuristics methods [Dunlop et al., 2010]. A classification hierarchy of phishing prevention techniques is shown in Figure 18.

Figure 18: Classification of phishing prevention techniques  Technical methods. Technical methods deal with technical vulnerabilities in Information systems; tools for phishing detection, prevention, and response; designing game, online tutorial, quiz for Internet awareness etc. Some of the examples are: Anti-virus integrated with phishing prevention; in-built system in web browsers; software tools, such as FraudEliminator, Netcraft toolbar, eBay toolbar, EarthLink toolbar, Geo Trust Trustwatcher toolbar, SpoofGuard, CallingID toolbar, Cloudmark Anti-Fraud Toolbar Google Safe Browsing, SpoofStick, TrustBar , Anti-Phishi, DOMAntiphish, PwdHash etc. o List based methods .List based methods classify websites into either phishing or trusted one and maintain into database lookup in the form of either blacklist or whitelist. These lists can be of IP addresses or domain name or URLs. Blacklist is a list of IP addresses or domain names or URLs collection of phishing websites whilst whitelist is a list of IP addresses or domain name or URLs collection of legitimate websites. List based methods are discussed in detail in section 3.1. o Heuristics methods .Heuristics methods check for one or more characteristics of websites and decide whether it is phishing or legitimate website. It utilizes the properties like HTML and script code of website, URL, UI design, page content for phishing websites identification. Heuristics methods are discussed in more details in section 3.4. 30

 Non-technical methods .Non-technical methods deal with the factors which are related to studying Internet users’ behavior, social engineering principles and techniques used by phishers, legality of using any techniques, training Internet users about phishing, information and guidelines for safe browsing, and cyber laws to punish phishing culprit. Since, the purpose of this thesis is to concentrate on technical methods, i.e., list based methods and heuristic methods specifically used in browser based applications for phishing prevention, here non-technical methods are not further discussed.

2.5. Phishing prevention applications Both list based methods and heuristic methods are implemented in server-side applications and client-side applications (i.e., browser based applications, since client side applications are widely used as web browser toolbars) used for phishing prevention. According to the implementation architecture of client-side applications, they are further categorized into two types: client-server structured applications and independent applications [Li et al., 2007]. A classification hierarchy of phishing prevention applications are shown in Figure 19.

Figure 19: Classification of phishing prevention applications (i) Server side applications. Server side applications are employed in the servers (e.g., organizational server, email server, ISP server) for phishing identification and remedy. Bayesian Filters are installed in the server to detect phishing emails. Although, such filters are an effective technique for phishing prevention, it should be noted that such filters cannot be hundred percent accurate and above all email is not a sole channel (other popular channels are message boards, web 31

banner advertising, instant chats, such as Internet Relay Chat (IRC) and instant messenger) of phishing attacks. Many other applications that use IP addresses and URL blacklist, heuristics and fingerprinting (compares known samples of phishing message against incoming emails) are deployed in ISP’s servers for phishing prevention. (ii) Client side applications or browser based applications. Web browser is the most common method used by Internet users to get access of web contents. There are other methods too, but they are usually tricky and complex, which makes them unsuitable for general Internet users. Furthermore, it is the foremost layer with which Internet user interacts, and tracking user’s activity at this level is potentially more effective. Its strategic positions make it suitable to warn Internet users directly and effectively [Sheng et al., 2009]. Even a study by Egelman et al. [2008] found that phishing warning in Mozilla Firefox 2 was very effective, and was able to stop all participants in their study from entering sensitive information into fraudulent websites. In addition, web browser market is dominated by selected number of browsers, i.e., Google Chrome, Internet Explorer, Mozilla Firefox, Safari, and Opera. All together, it is easy to handle phishing at the browser level. This also does not mean use of web browser is free of limitations. Most of browser based techniques act when webpage is loaded, which is risky from malware and other malicious code prospect that are used for phishing [Garera et al., 2007; Ma et al., 2009]. Other factor that has always been challenging for the researcher and security expert in browser based techniques is the mode to display the warning messages. Passive warning used to notify about phishing, such as change in colour, pop-up with textual information displayed at the corner or periphery of browser without interrupting browse activity is either unnoticed or neglected by Internet user [Wu et al., 2006]. Current trend is to use active warning, which enforces Internet users to notice and take action by interrupting the browsing activity. However, it can be debatable how acceptable such interruptive warnings are, more specifically in case of false negative. This might be a reason that IE uses active warning when it is confirmed that the website is a phishing website otherwise, it uses passive warning for doubtful websites. Thus, such warning should be precise and accurate. Any wrong 32 warning or alert can raise the question on its reliability which ultimately will reduce Internet user trust towards it. Despite some limitations in use of browsers based techniques for phishing prevention, they are widely used. Nowadays, most of the phishing prevention applications are found to be concentrated on the most vulnerable client side [Li et al., 2007] and for them browser based applications highly suit. Such applications are either inbuilt or they are independent browser toolbar that can be embedded into the web browser. The current version of all popular browsers (Google Chrome, IE, and Mozilla Firefox) comes with inbuilt phishing prevention system and some other features (e.g., block pop-up windows, enable and disable Javascript or Active script in IE, warn when sites try to install add- ons in Mozilla Firefox) that contribute in fight against phishing attacks. Some examples of independent browser toolbars are: Netcraft Anti-phishing toolbar, eBay’s Account Guard, SpoofGuard, Microsoft Anti-phishingtoolbar for IE etc.  Client-server structured applications. Client-server structured applications routinely request for update and maintenance from the server. Such kinds of toolbars are usually made by commercial organizations, such as Google, Microsoft, and Netcraft. Mozilla Firefox uses Google Safe Browsing and updates its blacklist for the first time when the feature is enabled and after that it updates in every thirty minutes. It communicates with the Google server during two occasions: during the regular update of blacklist and when the reported phishing website is encountered so that before blocking the website it doubles checks to confirm the website is not removed since the last update. Similarly, Google Chrome contacts the Google servers within the five minutes of start-up, and approximately every half an hour thereafter to download updated lists of suspected phishing websites. Likewise in IE from version 8, it uses “SmartScreen filter” for phishing detection that does both local verification and online lookup for phishing website identification. SmartScreen filter uses both list based method and heuristic method for phishing website identification. In the beginning, i.e., local verification, it looks for the website’s URL in the whitelist (generated by Microsoft) stored on users’ computer. In case, the website is not found in the list, it uses heuristics method for probable deception detection. When the heuristic 33 method indicates the website is suspicious, it sends the website addresses to the Microsoft online service in order to compare with its blacklist, i.e., online lookup. Figure 20 shows the option to enable “SmartScreen Filter” in IE 9.

Figure 20: SmartScreen Filter in IE9 Similarly, Netcraft Toolbar too communicates with the Netcraft web server’s database to obtain the blacklist of phishing websites [Netcraft]. In addition, the toolbar displays also other information related to the website like date it was first surveyed, country where it is hosted, popularity amongst toolbar users, and other information that can be seen from Figure 21.

Figure 21: Netcraft Toolbar [Netcraft]

 Independent applications. Independent applications use the data stored in local systems to identify a deceptive website. The working mechanism of such toolbars is as follows: After the webpage is downloaded into the 34

local computer, it compares the characteristics of websites with the data stored locally. When any anomalies are detected, it warns the Internet users. An example of such toolbars is SpoofGuard, a plug-in for IE that accesses the IE history file along with three additional files stored in the user profile directory for phishing detection. The three additional files are comprised of: read only file of host names of email sites, such as Hotmail, Yahoo!Mail, and Gmail, used in the referring page check; file of hashed password history (domain name , username, and password) and the file of hashed image history[SpoofGuard]

3. Analysis of strength and limitations of technical phishing prevention methods

The two technical methods (i.e., list based methods and heuristic methods) for phishing prevention are further decomposed into their constituents depending on strategies used for phishing detection. Further details about them with their pros and cons, and several studies related to them are discussed in the following sections:

3.1. List based methods The list based methods are reactive techniques for phishing prevention. It maintains a database lookup of either trusted websites (whitelist) or malicious websites (blacklist). Such list can be maintained either locally or hosted at the central server.

3.1.1. Whitelist method Whitelist is the list of trusted websites that an Internet user visits in regular basis. When the whitelist is exclusive, it allows access to only those only those websites which are considered trusted and thus is highly effective against zero hour phishing. It also does not produce any false positive results unless there is any wrong entry in the whitelist. However, it is very difficult to determine beforehand all the websites which users may want to browse and accordingly update the list on time. Any failure in updating the whitelist causes high false negative and severe usability penalty, which also might be a reason behind the low popularity of whitelist. SmartScreen Filter [MSDN IEBlog] is a feature in IE8 and IE9 browsers that uses whitelist for phishing prevention; however, it further uses heuristic method and blacklist method in order to confirm the phishing webpage. Anti-Phishing IEPlug [Li and Helenius, 2007] is another toolbar made for 35

Internet Explorer that uses whitelist method. It uses a whitelist of domain names maintained by the Internet user or computer administrator. It checks whether the webpage that the Internet user wants to visit contains password input field or not. When password input field is detected, it checks whether the domain contains any domain names in the whitelist. It warns the Internet user when an address to be visited contains a keyword that is saved in the whitelist , but the actual domain is different. There are very few studies that have focused on improvement of whitelist. One of such study is from Cao et al. [2008]. They designed an approach called “Automated Individual White-List (AIWL)” that stores all familiar websites with Login User Interface (LUI). AIWL uses the Naïve Bayesian classifier in order to identify websites with login page. Each time an Internet user submits confidential information to any website that is not in the list; the user gets an alert message. A new website is added to the list when the user continues to submit the confidential information to the website several times despite the warning. Although, this approach includes a mechanism for the auto-update of whitelist that differentiate it from a general whitelist method, it possesses several limitations, such as:  The initial list used by this method is not automated that means it will either initially have zero entry or it has to rely on some other mechanism for the initial list.  The update mechanism used by this method is highly dependable on Internet users’ ability to distinguish legitimate websites, when studies have shown that Internet users are not good at identifying phishing websites [Friedman et al. ,2002; Dhamija et al., 2006; Karakasiliotis et al., 2007; Jagatic et al. ,2007; Herzberg and Jbara,2004; Odaro and Sanders ,2011].  The reliability of method that alerts the user even for legitimate website and many times for the same legitimate website is in itself questionable. In conclusion, whitelist method can be an effective technique when used to complement other technique, such as blacklist method and heuristic method. It can be used for the first level verification, so that those legitimate websites which Internet users visit very often do not have to go through time-consuming verification process and most importantly they do not get misclassified. 36

3.1.2. Blacklist method Blacklist is the list of IP addresses or Domain Names (DNs) or URLs of treacherous websites, although, IP addresses and DNs used by the scammer can be blocked. However, many times phishers use hacked DNs and servers [MarkMonitor Inc., 2008]. So, blocking the whole DNs or IP addresses can unintentionally block many legitimate websites which share the same IP addresses and DNs. Therefore, blacklisting URLs is, comparatively more appropriate for blacklist [Sheng et al., 2009]. It is a widely used technique for phishing prevention. Even the popular web browsers (i.e., Google Chrome, Mozilla Firefox, and IE) use blacklist for phishing detection. It detects malicious websites that are included in the blacklist, so it has a very low false positive and is favoured over heuristic methods. The low false positive rate and the simplicity in design and implementation especially with browser can be the reasons behind the popularity of blacklist method. The low false positive also reduces the liability risk of incorrectly labelling a legitimate website as a phish. Despite these all benefits and the wide popularity of blacklist, it possesses following three main challenges. (i) Zero hour phishing. It takes time to include a new phishing website in to the blacklist. Thus, it is ineffective against zero hour phishing, leaving the Internet users vulnerable to phishing unless it is not discovered. An empirical analysis by Sheng et al. [2009] on the tools that use blacklist revealed that most of such tools are able to catch only less than 20% of phish at zero hour. Moreover, majority of phishing websites are short lived and the most of damages are done during this short time span. Thus, delay in list update reduces the effectiveness of the blacklist. (ii) Update mechanisms. Everyday there are hundreds of new phishing websites added to Internet. Most of the blacklists, for instance, PhishTank relies on manual verification of websites due to its high accuracy; despite the fact that manual verification is time inefficient process. There are some blacklists, such as the Google blacklist that uses automatic verification employing heuristics via machine learning techniques which is a quick process but introduces comparatively more inaccuracy in the list. The compilation and maintenance of blacklist in itself is a multiple step process, and the two steps are: 37

 Data (phishing URLs) gathering. It needs the gathering of data (phishing URLs) from various sources, such as: spam traps, detected by filters, users reported (APWG List, Phishtank list), compiled by other parties, such as takedown vendors or financial institutions.  Verification of websites. After the data gathering, it needs verification of the websites to identify phishing websites. This verification often relies on human reviewers for reliability. Sometime verification from multiple reviewers is needed for more accurate result. Phishtank’s statistics showed that manual review process of URLs takes considerable amount of time, ranging from a median of over ten hours in March, 2009 to a median of over fifty hours in June, 2009 for single URL [Whittaker et al., 2010]. Although, PhishTank was able to significantly improve this figure; it dropped the median time to identify a phish to 12 hours in Jan, 2010 and to 2.4 hours in Jan, 2011, its verification mechanism still leaves several suspected URLs unidentified [Liu et al., 2011]. The verification mechanism prescribed by PhishTank requires 4 votes to confirm a website as a phish, and those URLs that receive less than 4 votes also called “wasted votes” are declared unidentified URLs. Moreover, it should be noted that the median time 2.4 hours is after the suspected website is submitted, which means there is a chance of delay before the submission of the website to Phishtank; when most of the victims fall for phishing scam within eight hours from the start of attack. [Kumaraguru et al., 2009] Above that, phishing websites grow endlessly making it difficult to always keep the list up to date. Even human verification is prone to human error. Moore and Clayton [2008] found power-law issue in the participants of PhishTank (i.e., participants who periodically participate are more prone to making error in labelling), and at the same time taking out human effort entirely out of the loop is too risky [Edwards et al., 2007]. (iii) Matching mechanisms. The third difficulty of blacklist method is the ways of matching URLs that Internet user enters with those from the list. An exact matching of URLs can be easily evaded by automatically generated URLs from phishers [Prakash et al., 2010], for example, Rock-Phish gang uses phishing 38

toolkits to generate a large number of slightly varied URLs for a single phishing website [MarkMonitor Inc., 2008]. A way to tackle such problem is to include an ability to detect any changes in the URLs, but it introduces more inaccuracy in the blacklist. Therefore, it is clear that the efficiency of a blacklist basically depends on the following factors:  list accuracy,  list update mechanism, and  URLs matching mechanism There are several researches that have worked on those factors in order to increase the efficiency of blacklist technique. One of such study is by Liu et al. [2011] to improve the list update mechanism maintaining its accuracy .They suggest improving the wisdom of crowds to maintain extremely low false positive rates and also reducing the time to verify attacks. They designed an approach called “Aquarium” which is a crowdsourcing technique that clusters similar phishes together and asks the manually trained participants to vote for the cluster rather than individual phish. The mechanism uses websites’ URLs submitted to PhishTank yet to be verified. The URLs are passed through the whitelist technique to filter some of the legitimate pages, and reduce the effort by reviewers. After that, the remaining URLs are clustered using Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Shingling algorithm, commonly used in the search engines for duplicate page detection. Finally, the clustered URLs are submitted as tasks to Amazon’s Mechanical Turk (MTurk) system as Human Intelligence Tasks (HITs) for verification by the participants. The weighing model of votes from participants is based on their history of votes. Alike Phishtank, this approach too requires the minimum of four votes to classify a cluster of URLs as phish. Although, this approach improves the efficiency of reviewer in quantity, their efficiency in quality is still questionable. Moreover, limitations, such as waste votes, power-law issues of participation, limitation from MTurk (e.g., there is chance that each reviewer can have different browsing experiences or get distracted by MTrek’s physical environment) [Kittur et al., 2008], and inability to correct when participants make incorrect classification still persist. Similarly, to make the classification mechanism swift and timely, Whittaker et al. [2010] designed a scalable machine learning classifier that automatically classifies 39 phishing pages and is used to maintain Google’s blacklist. This classifier examines the features which the human reviewers look for in suspected websites to identify phishes e.g., page’s URL, page HTML content collected by crawler, and hosting information. It also uses a logistic regression classifier to make the final decision. The classifier classifies the websites submitted by Internet users and also those collected from the Gmail’s spam filters. Moreover, the blacklist maintained by Google is found to be more effective than its contemporaries in phishing prevention [Ludl et al., 2007]. The problem in Whittaker et al. [2010] approach is that its efficiency is dependent on the efficiency of Gmail’s spam filter, when there are various other ways (e.g., Internet relay Chat, i.e., IRC chat, web banner advertising, and instant messenger, other email services like Hotmail, Yahoo!Mail, RediffMail, and so on) that scammer use to reach their potential victims [IBM Internet Security Systems, 2007] and on the activeness of Internet users to report suspected website, when several studies have proved that the Internet users are not good at identifying phishing websites [Friedman et al., 2002 ; Dhamija et al., 2006; Downs et al., 2006; Wu et al., 2006b]. Likewise, an approach called “PhishNet” by Prakash et al. [2010] attempts to tackle the URL matching mechanism problem. PhishNet uses two components and they are: (i) A URL prediction component. It works offline and systematically generates new URLs that are the modified form of the URLs in existing blacklist employing various heuristics, such as: changing the Top Level Domains (TLD), IP address equivalence, i.e., grouping together URLs having the same IP addresses, directory structure similarity, i.e., grouping together URLs with similar directory structure, using query string substitution, and brand name equivalence, i.e. , replacing one brand name with another. (ii) An approximate URL matching component. It performs an approximate matching of the URLs entered by Internet users with the URLs in blacklist. In fact it utilizes the finding that malicious URLs even after mutation remain usually syntactically close to each other or semantically same, i.e. ,uses the same IP address. The verification of generated URLs to find whether they are indeed malicious or not is done with the help of Domain Name System (DNS) queries and content matching techniques in an automated fashion thus ensuring minimal human effort. The matching is performed using a novel data structure that performs approximate matches with 40 incoming URLs based on regular expression and hash maps to catch syntactic and semantic variations. Even though, this is a novel technique in generating various modified form of URLs, however, it seems to utilize very few heuristic features to check whether a newly generated URL belongs to phish or not. This means a phishing website may get misclassified, especially when it looks for ninety percent similarity to parent URL webpage in order to declare as phish page. In conclusion, an effective blacklist must be comprehensive, error free, and timely. An incomprehensive blacklist fails to protect a portion of its users. Similarly, blacklist with wrong entry results unwanted warning which gradually trained Internet users to disobey the warning [Whittaker et al., 2010]. Moreover, untimely update can significantly degrade the quality of list. Therefore, an effective blacklist can be achieved only, when it can use an error free automatic classifier with broad sources to receive suspected websites for verifications and possesses URLs matching mechanism that can detect all derivative URLs of phishing URLs. The study by Sheng et al. [2009] found that tools that use heuristic method to complement blacklist performs better than those using only blacklist, especially against zero hour phishing. Table 1 shows the summery of list based methods with their main characteristics, pros, and cons.

3.2. Heuristic methods Heuristic methods examine one or more characteristics of websites in order to detect phishing websites. These characteristics are anomalies in the components of phishing websites. In fact, even the automatic verification of phishing websites used to maintain blacklists employs heuristic methods. Some of the heuristic methods are next analyzed.

3.2.1. The use of visual similarity measures for phishing detection Phishing websites often imitate the look and feel of official websites with the same layouts, styles, key regions, rendering, blocks, and most of the contents. They use various non-text elements, such as images and flash objects to display contents. Such mimic of an authentic website with only minimal required changes are often difficult for Internet users to distinguish. Moreover, the use of non-text elements to display web contents makes it even harder for general content based anti-phishing techniques. There are some techniques, for instance, the technique proposed by Pang and Ding [2006] that uses Optical Character Recognition (OCR) to analyze the contents in image, but it still fails to analyze websites’ elements, such as flash objects and advertisement banners. 41

However, such cases can be efficiently handled by the use of phishing prevention techniques that employ visual similarity measures to differentiate between bogus and original websites. All visual similarity measures use database to store genuine websites’ data. When any suspicious websites are met, their data are compared to the data of genuine websites stored in the database to detect differences. The genuine websites’ data are stored in one of the following forms: (i) DOM elements of genuine websites. In this case, DOM elements of genuine websites are compared with that of suspicious websites (ii) Captured images of genuine websites. In this case, features in the images of genuine websites are compared with that of suspicious websites using the various techniques of Image Recognition (IR). There are several studies that used DOM elements comparison for the visual similarity measure. One of the approaches is from Wenyin et al. [2005], which consists of four modules: (i) Suspicious URL detection module. It is the source for suspicious URLs which are obtained from transformation of the true URLs and various suspicious URLs detected in emails. (ii) Suspicious webpage processing module. It validates whether any real webpage exists for the URL supplied by the “Suspicious URL detection module” and generates a representation of the found webpage, i.e., blocks and features of suspicious webpage. (iii) True webpage processing module. It obtains a representation of the true webpage, i.e., blocks, features, and weight of the true webpage. (iv) Visual similarity assessment module. It compares the true webpage and each suspicious webpage and finally calculates their visual similarity based on their intermediate representations. The approach by Wenyin et al. [2005] uses three similarity metrics, i.e., block level similarity, layout similarity, and overall similarity defined on webpage segmentation to calculate visual similarity between two websites. In the block level similarity, the similarity of features that represent text blocks and image blocks are measured. Similarly, in layout similarity, the ratio of the weighted number of matched blocks in the suspected website to the total number of blocks in the true webpage is calculated. The overall style similarity focuses on the visual style of webpage, which can be 42 represented by several format definitions, e.g., font family, background colour, text alignment, and line spacing. The final verdict is made on the basis of similarity weight of the suspected webpage which needs to exceed the similarity threshold in order to be declared a phishing website. Another similar approach is from Liu et al. [2006] called “SiteWatcher” that uses visual similarity comparison and comprise of two sequential processes , the first process runs at email server and the second process performs the visual similarity comparison. It needs the registration of true URLs and their associated keywords to the system. The process in the email server monitors and analyzes both incoming and outgoing emails to find messages that contain keywords associated with the genuine website. All embedded URLs from the messages that contain keywords are sent for visual similarity assessment. After that, the second process performs visual similarity assessment at block level, layout, and style. The visual similarity assessment includes the extraction of visual features and the finding matches of suspected website against original website. This matching is performed at blocks level, each visually and semantically, then on position constraints among blocks. It calculates layout similarity (i.e., the weighted number of blocks by the total blocks in the true page), calculates overall similarity on the basis of distribution of features values, and the correlation coefficient of two pages’ histogram as the overall style similarity. It issues phishing reports to the respective genuine website’s owner when the visual similarity reaches higher than corresponding threshold values. The problem with both of the above mentioned approaches is that they use feed of legitimate website and cannot detect phishing websites that target other than the websites in the database. The approach by Liu et al. [2006] even needs unique keywords that can represent the legitimate website, which is an additional burden on Internet users. Moreover, such approaches that completely rely on code can be easily deceived by the use of following tricks: rewrite HTML codes that give the same design but use different DOM objects than the legitimate page as shown in Figure 22, use images that provide the same look as spoofed website, and use code obfuscation techniques to alter the codes. In addition, such approaches can result false negative when the same theme is used to generate different websites. 43

Figure 22: HTML codes and screenshots of the sign-in page in eBay.com [Lam et al., 2009] An approach that provides solution for the case mentioned in Figure 22, i.e., the same design but uses different DOM objects than the legitimate website, is proposed by Lam et al. [2009]. It uses visual similarity-based phishing detection effective even for polymorphic phishing web pages. Polymorphic web pages are visually identical to authentic web pages but uses different source code components than the authentic webpage. This approach by Lam et al. [2009] performs page layout analysis and layout block matching to calculate the degree of similarity using image processing techniques. The authentic webpage is stored in a database. When a suspected webpage is found, both authentic and suspect web pages are treated as images and Otsu’s thersholding method is applied to transform images into black and white images. The degree of similarity is ranked using classifier trained to handle such case. However, this approach, too, cannot detect phishing websites that use code obfuscation techniques to alter the source code. Moreover, using two processes for phishing website validation cannot come for free; it is usually accompanied by degrade in time performance. In addition, it still cannot detect websites that are not in the database. The problems in visual similarity measure techniques that occurred due to dependency on source code can be overcome by analyzing the features in captured images of legitimate and suspicious websites. An approach by Fu et al. [2006] that extracts the URLs from emails containing keywords associated with the protected websites. This approach uses Earth Mover’s Distance (EMD) to calculate the visual similarity of web pages. It first extracts the URLs from emails and then converts the web pages associated to those URLs into normalized images. Next, it obtains the 44 images’ signatures which comprise colour and coordinates features. Finally, visual similarity is computed using the linear programming algorithm of EMD. The final classification is made on the basis of similarity value of the suspected webpage. When similarity value of suspected webpage exceeds the threshold value of protected webpage, it is classified as a phishing website. However, the problem in this approach is that it uses colour histogram which is unsuitable for web pages, since websites usually contain very few colours [Liu et al., 2006]. Moreover, making even a minor change in dynamic components, which are often unnoticed by Internet users, can significantly vary colour histogram. In addition, use of colour histogram has high chance of false negative results for websites that are designed using popular theme. Another approach that uses images, but analyze many other features of images is from Cordero and Blain [2006]. Their approach uses differences in image rendering of web pages for phishing websites identification. It captures the Tagged Interchange File Format (TIFF) image of entire rendered web page which is turned into more manageable feature vectors by calculating a joint histogram with two features resulting in 256 features per image. It uses Cocoa/Safari Engine for website rendering and GNU Octave and Image Magick for data pre-processing. Although, this approach compares far more features than the approach by Fu et al. [2006], it also posses various limitations. It uses the image rendering and layout of webpage for phishing website detection despite the fact that both of them are affected with change in window size. Even changes in font type and font size make changes in the appearance of webpage. In addition, a website uses several dynamic components, such as advertisement banners, flash objects that are cumbersome to compare using this approach, since with each scene the image changes. Likewise, an approach that uses image which also claims to handle use of dynamic objects in webpage is by Chen et al. [2009]. It considers phishing page detection as an image matching process. It takes the suspicious webpage snapshot and uses Contrast Context Histogram (CCH) to extract discriminative keypoints from suspected webpage which are matched with that of the authentic webpage often targeted by phishers. Such authentic web pages data are stored in the database from reliable source. Computer vision and image processing are used to compare the similarity. The degree of similarity is calculated using k-means algorithm and when it exceeds certain threshold, suspected webpage is considered to be phishing website. Even though, this 45 approach is effective against dynamic objects, such as advertisement banner, flash objects, and video; however, it does not mention about degrade in time performance that can occur due to processing of dynamic objects. Different from all above mentioned approaches, Wang et al. [2011] proposed an approach called “Verilogo”, which does not analyze the image of the whole webpage; rather it analyzes only the logo used in the webpage. The main assumption of Verilogo is that, logo is an easy means of recognition and it is deeply associated with given organizations so it is often included in phishing websites to exhibit false originality. It stores heavily phished logos and their related information in the database. It matches the logo used by suspected webpage from the logos stored in the database using computer- vision algorithm, then validate whether the suspected webpage has authorized hosting IP address to use that logo or not. It warns the Internet users when they enter keyboard input into the webpage that is not authorized to use the logo. Even though, comparing logo is lighter than comparing the whole webpage, it protects only the websites whose logos information is stored in the database. Moreover, it needs the list of all organizations that are allowed to use a particular logo, which is another unconventional situation. In all of the above mentioned techniques that use visual similarity measures for phishing detection, the common limitation is that all of them needed to know the legitimate websites beforehand which is impractical. In order to remove this limitation, Medvet et al. [2008] proposed an approach that uses three features to determine webpage similarity:  Text pieces which also includes style-related features  Image embedded in the webpage, and  The overall visual appearance of the webpage as seen by the Internet user (after the web browser has rendered the webpage). This approach does not need initial list of legitimate web pages; instead it remembers the pair of information (e.g., username, password) and the webpage in which Internet user enters them. When Internet user enters the same credentials into any new webpage, it performs the similarity comparison. The procedure is to retrieve the suspicious webpage, transform the webpage into a signature, and compare the signature with the stored signature of the legitimate webpage. In case of similarity, it raises an alert. However, this approach neglects the fact that there are several Internet users who use 46 the same credentials for different websites. Moreover, some banks and organizations (e.g., Nordea Bank) use one-time password and such case cannot be protected by this approach. To sum up, visual similarity measure is suitable for server (e.g., ISP server) based phishing prevention techniques so that server admin can maintain the list of phishing prone websites. However, it still can be a question whether that is possible.

3.2.2. Use of search engine in phishing detection There are several search engines (e.g., Google, Bing, Yahoo!, Baidu) that maintain crawl database and perform page ranking to display search results. PageRank algorithm that was formulated by Google founder Larry Page and Sergey Brown uses factors, such as number of inbound links, number of outbound links, and other damping factors. Moreover, there is a set of recommended guidelines from Google web master to prevent removable of websites from Google search engine index [Google Webmaster Guidelines]. These all suggest that web pages must follow Google web master guidelines and it must have maximum inbound links in order to gain high page rank. In the contrary, phishing web pages usually have very short life span and they are even found to disobey the recommended guidelines [Garera et al., 2007]. Therefore, phishing websites are either absent in the search results or possess a very low page rank. In addition, the count of search results for phishing websites are usually very few that mostly consist of other phishing websites and websites that maintain malicious websites list, such as PhishTank. These features of search engine are applied by many researchers for phishing detection. The two vital components of this approach are: extraction of search keywords and selection of search engine. Some of the proposed approaches that use search engines for phishing detection are mentioned next. An approach that uses search engine for phishing detection is by Ma [2006]. His approach uses the Google search engine results for phishing detection. His work is a plug-in for Mozilla Firefox web browser that extracts unique keywords from the website to be analyzed and uses the keywords as query word for Google search engine. Then the URL of suspected site is compared with the URLs of the top search results. In case of mismatch, it interrupts the Internet user and it suggests one of the top ranked search results. However, the problem with this approach is that it does not mention about the keywords extraction method and the number of search results to be compared. 47

Another similar approach that is clear on both of the problems mentioned in Ma [2006] approach is by Zhang et al. [2007b]. They proposed an approach called “A Content-Based Approach to Detecting Phishing Website” or simply CANTINA that examines the content of a webpage to identify phishing. It implements Term Frequency–Inverse Document Frequency (TF-IDF) algorithm used in Information Retrieval (IR) and Robust Hyperlink algorithm. TF-IDF algorithm is used to determine the importance of a word in a document and Robust Hyperlink algorithm is used to determine broken hyperlinks. The two ideas behind this approach are:  Phishers usually copy legitimate websites to generate phishing web pages. In that case, Robust Hyperlink algorithm can be used to find the original log-in page.  Phishing websites often contain the original brand name which is common in legitimate webpage, but it is relatively rare in web. Again in this case, Robust Hyperlink algorithm can be applied to determine the actual owner of the webpage. The general working mechanism is as follows: first it calculates the score of each term on the webpage using TF-IDF and then generates lexical signatures of the top five terms which in concatenation with the domain name (even when the signatures already contain domain name) is fed to the search engine (in this case Google). Finally, it classifies the suspected webpage as a phishing webpage if its domain name does not lie in the top thirty results of search engine. Even in the case when the search result count is zero, the suspected webpage is classified as a phishing webpage. The limitations of this approach are:  It works only with the webpage that has content in English language  It takes time because it involves querying Google  It can be bypassed using techniques, such as: use image content instead of textual content, use unrelated text in invisible form (i.e., use font colour that is used as webpage background colour), change enough words in the webpage, and use webpage already high ranked in search engine result.  This approach uses linear classifier, which has its own limitations [Xiang et al., 2011]. Likewise, Xiang and Hong [2009] proposed an approach that uses search engine technique in association with other techniques for phishing detection. Their approach uses IR methods to recognize the identity of the claimed webpage and captured phishing 48 webpage by examining the discrepancies between the claimed identity and its original identity. It uses Named Entity Reorganization (NER) algorithms to reduce false positives. The identity oriented component is aided by a keywords-retrieval component that employs search engines to detect potential phishing webpage via searching keywords of significant importance with respect to IR. It includes whitelist methods and login-form detector to filter good web pages and control false positive results. Even though this approach has better handling for false positive, it still contains the limitations mentioned for CANTINA. Similarly, an approach is proposed by Huh and Kim [2011] that is lighter than all the above mentioned approaches that use search engine for phishing detection. It uses the full URL string without parameter of suspected webpage as the query for search engine exempting it from the tedious process of keyword extraction. The total number of search results and the ranking of suspected webpage are used to determine whether it is legitimate or fake. It uses the fact that legitimate web pages get a large number of search results and usually ranked the first in search results whilst phishing web pages get only a few numbers of results and they usually have a low rank or no rank. The validation of this approach was performed using three different reputable search engines: Google, Yahoo!, and Bing. However, the problem with this approach is that it fails to detect the phishing web pages which use compromised popular websites. To sum up, using search engine is an effective approach for phishing detection. The results are more accurate due to the high efficiency of webmaster of search engines. Moreover, approach by Ma [2006] provides an alternative option to the Internet users to proceed browsing. One of the reasons that enforce Internet users to risk clicking a suspected website despite the warning from security system could be the lack of alternative. Most of the phishing prevention systems just warn Internet users and rarely provide any substitute. Then, the approach by Huh and Kim [2011] which uses the whole URL for search improves the quality of search keyword. Apart from them, this approach is independent of other resources, such as database, and is equally effective for zero hour phishing. However, use of search engines for phishing detection, too, has several limitations and some of them are mentioned next.  It is the webmaster of search engine who determines whether a website should be indexed or not. This decision is taken on the basis of fact, how much the website 49

adhere to the recommended guidelines from webmaster for design content, technical, and the quality of website. These guidelines help to make a website search engine friendly [Google Webmaster Tools, Bing Webmaster Tools]. Search engine spider crawls the website on the basis of several factors, for instance, Google looks to the factors, such as Pagerank, links to a page, and crawling constraint like the number of parameters in a URL [Google Webmaster Tools]. Moreover, Google PageRank is updated approximately in every three month [Huh and Kim, 2011], and the case is similar with other search engines. However, the concern is how many new legitimate websites do follow the Webmaster guidelines. There are many legitimate websites designed by novice designers who are unacquainted to the Webmaster guidelines. The situation might improve when Content Management System (CMS) tools, such as Joomla!, Druple, and Wordpress, is used for the design activities of webpage. However, there are still many fresh legitimate websites which rank very low in search result or they are not even in the rank of search engine results. Such websites are misclassified by the phishing prevention approaches that use search engines. Some of the legitimate websites, whose rank might improve, yet suffer misclassification for three months in case of Google.  Such phishing prevention approaches can be easily bypassed by abusing a legitimate website that already has a top ranking in search engine results or registering a legitimate website to conduct phishing, even though such processes are comparatively expensive.  Phishers can manipulate the ranking algorithms to get good ranking for their websites in search engine results. Fourthly, search results vary with the kind of search engine used. Figure 23 shows snapshots of search results after a legitimate URL is entered as a query to two popular search engines, i.e., Google and Bing. 50

Figure 23: Same URL searched using Google.com and Bing.com Thus, it is suitable to use popularity of websites to support other heuristic properties for phishing detection.

3.2.3. Use of anomalies in phishing websites for phishing detection Phishing websites mimic the look and feel of genuine websites at interface level, but they are different at code level. In fact, they also contain many anomalies in their web objects, HyperText Transfer Protocol (HTTP) transactions, and claimed identities [Pan and Ding, 2006]. These anomalies can exist in their URLs, DOM objects, or webpage contents. There are several studies that have utilized the varied sets of these anomalies for phishing detection. Some of the prominent studies are mentioned next. An approach by Chou et al. [2004] is a browser plug-in called “SpoofGuard” which is designed for the client side defence against phishing. SpoofGuard examines properties, such as domain name, URL, link, and image to identify probable spoof attacks. Further, it also looks to the browse history in order to verify whether the given domain was visited before or not. It also checks whether the webpage is opened by clicking any link from email messages. Most importantly, it stores the hash values of post data, i.e., username and password, and the domain name where the credentials are used. When Internet users enter any credentials, it compares the post data with the stored credentials and their respective domain names. It warns the user, when credentials match but their domain names differ. The two major problems in this approach are: 51

 It neglects the facts that many Internet users use the same credentials for different domain names which can produce false negative results, and  It does not protect the websites which use one-time password, i.e., password is valid for only one login session. It will store several credential for a single domain name; precisely an entry for every login. Likewise, Pan and Ding [2006] proposed an approach which detects phishing from anomalies in the DOM objects of phishing websites. It employs two major components: (i) Identity Extractor uses IR algorithm and χ2 test to extract web identity, and (ii) SVM as Page Classifier takes input of web identity and a set of structural features (i.e., web objects or properties relevant to web identity) to determine whether a webpage is phishing page or not. They also suggest using Optical Character Recognition (OCR) to extract contents from phishing websites that use images in the place of textual contents. The main limitation of this approach is that it uses an assumption “the distribution of identity-related words usually deviates from that of other words” which is not completely true and this can be observed from the high false positive results produced by the approach [Xiang et al., 2011]. Similarly, an approach by Alkhozae and Batarfi [2011] looks to the violation of W3C recommendations in webpage source codes to identify phishing websites. The general mechanism is to assign an appropriate weight for each characteristic (W3C violation) and an initial weight to the suspected website. An occurrence of each characteristic in the suspected website reduces the corresponding characteristic’s weight from the initial weight. The final decision is taken on the basis of remaining initial weight after the examination. The smaller the weight, the higher is the probability of being a phishing website. The main problem with this approach is that it depends on the violation of W3C recommendations when it is unclear how many web developers really know and follow W3C recommendations. Then, there are other web standards followed by the development web industry, such as Internet Standards (STD) documents [IETF]. Moreover, there is a chance of bypassing this approach by the use of phishing website that follows the most of W3C recommendations. Problem with the above discussed approaches by Chou et al. [2004], Pan and Ding [2006], and Alkhozae and Batarfi [2011] is that they load websites in order to identify whether phishing websites, which ultimately expose Internet users to phishing 52 conducted using malicious codes. Therefore, to overcome this danger, Garera et al. [2007] proposed a phishing prevention approach that uses only anomalies in the URLs of phishing websites to detect them. This approach uses various distinguishing features of phishing URLs and a logistic regression classifier (trained with data from Google) which also includes obfuscation style heuristics and general heuristics based on the Google’s Index Infrastructure. The main problem of depending solely on URLs for phishing detection is that such approach can be easily deceived using either registered domains or some compromised legitimate websites to conduct phishing. Another similar approach that uses only URLs analysis is by Ma et al. [2009]. Their approach uses statistical method from machine learning to identify phishing websites. It examines the lexical features (i.e., textual feature of URLs) and host based features (i.e., IP addresses properties, WHOIS properties, domain name properties, and geographical properties) of URLs in order to know the reputation of websites. The problems with this approach are:  It can misclassify legitimate websites that use URLs containing benign tokens stated in the approach.  It can misclassify legitimate websites that use free hosting services.  It cannot detect phishing websites that use compromised legitimate websites.  It can misclassify legitimate websites that use redirection of services.  It can misclassify legitimate websites hosted in reputable geographical regions, such as USA, despite the fact the more than fifty percent of phishing websites are hosted in USA [APWG, 2012].  It can misclassify websites that possess international TLDs but are hosted in USA. Even though URLs analysis protects Internet users from malicious software, it lacks the accuracy that could have gained when using DOM objects and webpage contents analysis. A more robust approach called “CANTINA+” is designed by Xiang et al. [2011] that uses the resources including URLs, HTML DOMs, third party services, and search engine to detect phishing websites. It uses five features from CANTINA (discussed in section 3.2.2), and additional ten new discriminative features for phishing websites identification. It employs two filters, they are (i) Hash Based filter. It uses SHA1 hash algorithm .It is used for duplicate page detection. 53

(ii) Login form detection. It looks for three main characteristics of Login form, i.e., FORM tags, INPUT tags, and Login keywords (search for 42 different login keywords). Finally, it employs the machine learning detection model based on discriminative features extensively trained as classifier. Even though CANTINA+ is more robust than CANTINA, it still has some limitations which are:  It us unable to detect Cross-site scripting attacks,  It cannot detect phishing that is conducted using compromised legitimate websites.  It cannot detect phishing websites that use images instead of textual content. Above mentioned approaches detect phishing, but they do not report what kind of attack is it. Choi et al. [2011] proposed a machine learning approach to detect malicious URLs of all kinds including phishing, spamming, and malware infection. Along with detection, it also identifies the attack type. It uses various discriminative features (e.g., lexical, link popularity, webpage content, DNS fluxiness, and network) for detection. The methodologies used are SVM for detection of malicious URLs and RAkEL and ML-kNN for identifying attack types of malicious URLs. The main problem with machine learning approach is that its effectiveness is dependent on the type of data used for training. Moreover, phishing schemes are dynamic and such classifier has to be updated timely. To sum up, anomalies in the URLs and source codes of phishing websites can be a promising way to differentiate between phishing and legitimate websites. An approach designed by Gastellier-Prevost et al. [2011] called “Phishark” in order to study the effectiveness of URLs and page contents analysis for phishing detection, too, showed that anomalies can be an effective means to distinguish between legitimate and phishing websites. The major challenge in using anomalies for phishing prevention is the legitimate websites that are developed by novice web developers or precisely, the web developers who are unacknowledged about Internet security and various web development standards. Such web developers unintentionally practice several anomalies in their work and their websites usually get misclassified. Table 1 is the summery of technical phishing prevention methods with their main characteristics, pros, and cons.

54

Methods Characteristics Pros Cons Whitelist It uses a list of trusted (i)It is effective against (i)It has difficult method websites and checks zero hour phishing. update mechanism. whether a given website is (ii)It produces almost present in the list or not. no false positive results. iii) It is simple in design.

Blacklist It uses a list of treacherous (i)It has low false (i)It is ineffective method websites and checks positive results. against zero hour whether a given website is (ii)It is simple in phishing. present in the list of not. design. (ii)It has difficult update mechanism. (iii)It has difficult URLs’ matching mechanism. Visual It stores the information of It is effective against (i)It needs to store similarity the DOM elements or phishing attacks data about the measures captured images of the targeting websites legitimate websites legitimate websites and whose information is which has to be compares the information stored in its database. protected from from its database with that phishing. of the suspicious websites. (ii) It cannot detect phishing attacks which target the websites not in its database. 55

Use of search It extracts search keywords (i)It is simple in design. (i)It can misclassify engine from the given websites (ii)It is very much many legitimate and searches the keywords suitable for anti- websites. using a search engine. phishing tools that can (ii)Its accuracy Then, it compares whether suggest alternative links depends on selected the given URL is in the top to Internet user. search engine. search results.

Use of It looks for the (i) It is not dependent (i)It is complex in anomalies in characteristics in DOM on any specific design. phishing objects or URLs of the phishing strategy and is (ii)Its accuracy varies websites websites. equally valid for all with the list of used kinds of phishing phishing- websites. characteristics. (ii) It does not depend on any external factors, such as databases (iii) It does not require any changes in user browsing habits.

Table 1: Summery of technical phishing prevention methods

4. Investigating anomalies in phishing websites

One of the main objectives of this thesis is to identify the important anomalies found in the URLs and source codes of phishing websites. Therefore, I compiled as many distinctive anomalies as possible. In order to gather anomalies, I realize there are two possible ways. One way is to analyze phishing websites and the corresponding legitimate websites together to discover their differences, but this is time consuming process. Therefore, I selected the second way and chose past studies as the sources to get the anomalies; since those anomalies are already confirmed that they can occur in phishing websites. I collected several past studies, for example, studies by Chou et al. 56

[2004], Fette et al. [2006], Pan and Ding [2006], Garera et al. [2007], McGrath and Gupta [2008] , Ma et al. [2009], Bian et al. [2009], Alkhozae and Batarfi [2011] , Xiang et al. [2011], Choi et al. [2011], and Gastellier-Prevost et al. [2011] and picked all non- redundant anomalies. The anomalies that I have listed are mentioned next.

4.1. Anomalies found in the URLs of phishing websites  Use IP address in URLs. Some of the phishing websites use IP address in their URLs either to replace the host name or as a substring of the URL in order to confuse Internet users. APWG [2012] reported that 1.19%, 1.4%, and 2.09% of the phishing websites had used URLs containing IP address during the first quarter of 2012. An example of such URL is: http://184.173.179.200/~agarwal/rbc/ However, some genuine web applications usually used in intranet also can contain IP address in URL.  URLs contain brand, or domain, or host name. In this form of phishing websites’ URLs, the target’s company brand or domain or host name is included in the path segment of URLs. McGrath and Gupta [2008] found that 50%-75% of phishing websites’ URLs contain the targeted brand or domain or host name. According to the report of APWG for the first quarter of 2012, it was found that 49.53%, 45.39%, and 55.42% of the phishing websites used URLs containing targeted company’s brand, or domain, or host name in their URLs. An example of such URL is: http://fatloss4babyboomers.com/paypal.html However, brand or domain or host name is also used by the most of the genuine websites in their URLs.  URLs use http in place of https, i.e., abnormal SSL certificate. Most of the phishing websites use unsecured connection to transfer sensitive information. Valid Secure Socket Layer (SSL) certificate is issued by authorized organizations. The authorized organizations verify the websites before issuing SSL certificate which means acquiring such certificate by a phishing website makes it susceptible to detection techniques and some time even dangerous for the respective phisher to get trace. In addition, Internet users are not good at differentiating between secure and unsecure connections [Gastellier-Provost et al., 2011]. Some phishing websites were reported to use either invalid or 57

inconsistent to claimed identity SSL certificate, but currently it is rarely in practices since all the recent versions of popular web browsers, such as Google Chrome, Mozilla Firefox, and IE have detection systems for them. An example of phishing website that uses http is: http://coachbronek.com/muz4/index.php. However, there are some authentic websites, such as Facebook, Viadeo which use SSL for very short time to validate the users’ credentials [Gastellier-Provost et al., 2011].  URLs contain misspelled or derived domain name. There are various tricks used by phishers to derive domain name that looks similar to genuine domain name but disobey the URL naming conventions. Many times such derived domain name is registered domain name. Some of the techniques used to generate derive domain name for phishing websites are: o Replace the characters of real domain name with similar looking elements (can be Hexadecimal, Integer). An example of such URL is: http://paypa1.com, where character ‘l’ is replaced by number one. o Introduce a hyphen (-) in domain name. An example of such URL is: http://www.adm-ahtuba.astranet.ru/semite.html o Shift the characters of domain name. An example of such URL is: http://www.paypla.com, where position of characters ‘a’ and ‘l’ are interchanged. However, several genuine websites have URLs that contain meaningless word and this can complicate the detection of phishing websites’ URLs.  URLs using long host name. Phishing websites’ URLs are usually longer than normal URLs. McGrath and Gupta [2008] found that the URLs’ lengths peak at 22 characters for legitimate websites in the DMOZ whilst they are 67 characters for the URLs in PhishTank and 107 for the URLs in MarkMonitor. They further found that only few URLs in DMOZ were found to be longer than 75 characters and the longest URLs found in PhishTank and MarkMonior had length more than 150 characters. In addition, they found that phishing domains (without TLD) have shorter length than legitimate domains. Domains’ length (without TLD) peaks at 10 characters for the URLs in DMOZ when it peaks at 7 characters for the URLs in PhishTank and MarkMonitor. An example of such URL is: 58

http://fodamat.com/templates/fodamat/webscr/PayPal.com/webscr.php?cmd=_l ogin-run&dispatch=5885d80a13c0db1f998ca054efbdf2c29878a435fe324eec25 11727fbf3e9efe4eb694d5cae9e96bf5176d35f4070ec44eb694d5cae9e96bf5176d 35f4070ec4  Use short URLs. Some phishing websites use URLs shortening services, such as TinyURL [McGrath and Gupta 2008, Gastellier-Prevost et al., 2011] to shorten their URLs which ultimately redirect to long URLs. An example of such URL is: http://prophor.com.ar/prophor/wells/alerts.php that redirected to URL http://specialneedssvg.org/wp/wp- admin/import/wellsfargo/wellsfargo/wellsfargo2011/index.php  Use “//” character in URLs’ path. When URLs’ path contains “//” character, it is suspicious and there is greater chance that it will redirect [Gastellier-Prevost et al., 2011]. An example of such URL is: http://bganketa.com/libraries/eBaiISAPI.dll.htm?https://signin.ebay.co.uk/ws/eB ayISAPI.dll?SignIn However, there are some genuine websites that satisfy the condition. An example is the login page URL for Gmail: https://accounts.google.com/ServiceLogin?service=mail&passive=true&rm=fals e&continue=https://mail.google.com/mail/&ss=1&scc=1<mpl=default<mplc ache=2  URLs use unknown or unrelated domain name. Sometime phishers use a domain name that is either completely unknown or unrelated. An example of such URL targeted to Paypal is: http://www.traitembal.com/backoffice/images- backoffice/dossier/ However, it is legal to have unique domain name.  URLs use multiple Top Level Domains (TLD) within domain name. Some phishing websites’ URLs use multiple TLDs within domain name. Such URLs can be detected from the number of dots (.) used in URLs. It is found that genuine URLs contain on average less than five dots (.) [Zhang et al 2007a]. An example of phishing URL with more than five dots is: http://paypal.com.bin.webscr.skin.a5s4d6a5sdas56d6554y65564y65564y4a56s4 d56as4d65sad4.shoppingcarblumenau.com.br/ 59

However, there are some legitimate websites that contain more than five dots. An example of such URL is: https://login.live.com/login.srf?wa=wsignin1.0&rpsnv=11&ct=1351023508&rv er=6.1.6206.0&wp=MBI&wreply=http:%2F%2Fmail.live.com%2Fdefault.aspx &lc=1033&id=64855&mkt=en-us&cbcxt=mai&snsc=1 the URL of login page of “Hotmail.com”  Use encoded URLs. Use of obfuscated text, i.e., ASCII or Hex or Oct, equivalent of readable text for URL is another technique exercised to hide the identity of phish. Some time encoded IP address is used in the URL. Such text is less likely to be readable and can easily deceive Internet users. An example of such URL is: http://www.absolutewealthsystem.com/www.paypal.it_service- security_confermation/it/Processing1.php?cmd=_Processing&dispatch=5885d8 0a13c0db1fb6947b0aeae66fdbfb2119927117e3a6f876e0fd34af4365dcbd1864c8 b4dcf443a6f60fef107b96dcbd1864c8b4dcf443a6f60fef107b96  Uses special character ‘@’ in URLs. Special character ‘@’ is used in the URL to redirect the user to a website different from that appears within the address bar. A ‘@’ symbol in URL disregard string on the left side of the symbol and the actual URL is the string on the right side of the symbol [Zhang et al., 2007a]. An example of such URL is: http://www.amazon.com:[email protected]  URLs use different port number. Some phishing websites use port other than port 80 [Gastellier-Prevost et al., 2011]. It is found that 1.19%, 0.68%, and 0.26% of the phishing websites did not use port 80 in January, February, and March of 2012 respectively [APWG 2012].  URLs with abnormal DNS record. Legitimate websites usually have record in DNS record; however, phishing websites usually do not have record. In case if they have, most of the information remains empty. Figure 24 shows the DNS lookup result using My-Addr.com tool for the phishing URL: http://188.138.124.133/www.paypal.com/session_id/8754445562322241489889 6521458754/index.htm# 60

Figure 24: DNS record for a phish URL tested using My-Addr.com tool However, incomplete DNS record can also be for legitimate websites whilst a complete DNS can be for fake websites.  Life of Domain. In general, the life of phishing sites is not long. Even when they have registered domain, it is usually a recently registered one. Phishing websites become active immediately after registration [McGrath and Gupta, 2008; Zhang et al., 2007a]. However, everyday many recently registered legitimate websites are added to Internet.  Number of sensitive words in URLs. Several suggestive word-tokens are used in phishing websites’ URLs [Garera et al., 2007]. The eight word-tokens used by Garera et al. [2007] in their classifier are: webscr, secure, banking, ebayisapi, account, confirm, login, and signin. An example of such URL is: http://paypal.com.cgi.bin.webscr.cmd.login.submit.dispatch.8f9j89u54iu5l5469t

6d6sd4.boquetequalityproperties.net/pay/  Use of free web hosting. Free web hosting services are widely misused by phishers to host their phishing websites [McGrath and Gupta, 2008]. Most of the phishing websites use domain that is specifically registered for hosting phishing sites or they use web hosting services which are available for free [Prakash et al.,2010]. An example of such URL is: http://arnodits.net/ysCntrlde/webscr_prim.php?YXJub2RpdHMubmV0NTAxN mNmYTVjMzY4NQ==MTM0MzY3MjIyOQ However, many other legitimate websites also use free web hosting services.  URLs popularity. Page rank depicts the relative importance of a website within a set of websites. A higher page rank indicates that the website is more important and mostly a legitimate website can achieve it [Garera et al., 2007; Choi et al., 2011]. Techniques by Ma [2006], Zhang et al. [2007a], Xiang and Hong [2009], and Huh and Kim [2011] use search engine ranking for phishing websites detection. A screenshot of the results returned by Google for a phishing URL is shown in Figure 25. 61

Figure 25: Google search results for a phish URL However, phishing websites can use compromised URLs which are already popular whilst newly designed websites can have very low popularity. Moreover, the ranking varies with the type of search engine used, shown in Figure 23.  No credible in-neighbor search results [Bian et al., 2009]. Legitimate websites’ domain usually has inlinks from various credible websites while phishing websites mostly do not have inlinks from legitimate websites. In fact, most of the time phishing websites even do not have inlinks at all. This does not mean all legitimate websites will have inlinks. Several legitimate websites may not have inlinks at all as well. Some of methods that can be used to get the inlink are: “link:[no space]DomainToSearch” in Google, “link:[ space]DomainToSearch” in Yahoo! and Bing ,Bing webmaster tool , and Google webmaster tool.  URLs absence in relevant web category [Bian et al., 2009]. When the keywords of a legitimate website are entered to Yahoo! Directory, it lists out the websites that are relevant to provided keywords which also include the legitimate website. However, a phishing website either does not get any results or it is absent in the results. This again does not guarantee that all legitimate websites will have non- zero result counts. There are several legitimate websites that were found to have zero result counts.  Number of “Bag of words” in URLs. Frequency of strings delimited by ‘/’,’?’,’.’,’=’,’-’,’_’ can be used for phishing detection[Ma et al., 2009]. In 62

general, phishing websites possess higher frequency of these symbols in their URLs than normal websites URLs. An example of such URL is: http://artesax.com/~citcompa/paypal_priv8_us_2012/index.htm?cmd=_login- run&dispatch=063c19f9f888ffe32e5abeba112f5b33063c19f9f888ffe32e5abeba1 12f5b33  Domain name character composition. McGrath and Gupta [2008] found that domain names from DMOZ resembles to relative letter frequencies of characters in English language whilst domain names from PhishTank and MarkMonitor have less pronounced peak at each of vowels. Likewise, they also found that relative popularity of letters of the English language differs in legitimate and phishing domain names. Letters ‘a’, ‘c’, and ‘e’ have significantly different probability of appearing in English language documents or DMOZ domain names; but they have very similar probabilities of occurrence in phishing domain names.  URLs hosted by geographical location. The majority of phishing websites are hosted in USA [APWG, 2012]. This might be because USA hosts the highest number of other websites as well.  TLD triplets used in URLs. It is found that triplets correspond to TLD that are very often used by spammers are .us, .cn, and .com [Gastellier-Provost et al., 2011]. However, they are also widely used TLD for genuine websites.

4.2. Anomalies found in the source codes of phishing websites  Abnormal anchor URLs. Genuine websites link use an anchor to provide navigational guidance. The URLs used in the anchor are usually from their own domain and sometime to different domain. However, in phishing sites such anchor URLs are mostly from different domain. It has been also found that sometimes the anchor in phishing websites does not link to any page, for example, AURL can be “file:///E/” or “#”.  Abnormal Server Form Handler (SFH). Security is one of the prime concerns for organizations that do online transactions. Such organizations require credentials for login which are generally username and password. Thus, their websites include SFH. Legitimate websites always take actions upon the submission of form; however, phishing websites can either contain “about:blank” or “#”. 63

Moreover, legal site’s SFHs are handled by the server of the same domain. So whenever the form is handled by any foreign domain server, it makes the websites suspicious.  Abnormal request URLs. RURLs are the links of external objects (images, external scripts, CSS) also called resources. W3C recommends websites to use resources from page’s own domain and is widely followed by genuine websites. However, spoof websites often use these resources from the victim websites to make phishing websites look and feel similar to legitimate websites. It means the request URLs used by crook websites are often from different domain. Some of the genuine websites too use resources from domain other than their own domain; however, they use for very few resources whilst in phishing websites they use different domain for RURLs for most of their resources.  Abnormal Cookie. Cookie is used to identify users and their previous activity in the websites. This is an important part of portals and online shopping websites. This is always bound to websites’ server domain. However, in phishing websites, it either points to its own domain which is inconsistent to the claimed identity or points to genuine websites’ domain which differs from the phishing domain.  Mismatch hyperlink. Mismatch hyperlink is used to mislead Internet users. Although the links that appear to Internet users are of the original websites, but when the links are clicked, they direct to the phishing websites. For instance, https://secure.regionset.com/Ebamking/logon/  Use of illegal pop-up windows. A phisher uses pop-up and asks Internet users to fill their information. It could be a borderless window above the real websites that looks very much a part of genuine websites. There can be two ways to create pop-up windows: using HTML which is in practice, for instance, < div onClick=”window.open(‘mona.html’)”> Other way is using Javascript, which is illegal: onClick=”javascript:popup(‘mona.html’)” All popular web browsers have features to block pop-up windows [Alkhozae and Batarfi, 2011]. 64

 Harmful forms. Phishing websites usually use a form asking to fill other details along with username and password [Ludl et al., 2007]. Number of input fields, text fields, password fields, hidden fields, and other fields, such as radio buttons, and check box can be used for phishing detection [Ludl et al., 2007]. Precisely, tags that accept text accompanied by word, such as “credit card” can indicate phishing [Zhang et al., 2007a]. This form usually contains submit button. Figure 26 shows a form in a phishing website.

Figure 26: A form phishing website  Use of onMouseOver to hide the link. Some phishing websites include onMouseOver function to hide their abnormal link. An example of code snippet that performs onMouseOver is below: ABC onMouseOver="window.status='Click here to go to ABC'; return true"  Number of Script tag. In general, phishing websites are found to use more number of Javascript tags and plain text pages than legitimate websites [Ludl et al., 2007]. Thus, too many uses of Javascript tags in a website make it suspicious.  Presence of Javascript functions. There are some native Javascript functions, such as escape (), eval(), link(), unescape(), exec(), link(), and search() , which occur predominately in phishing websites containing cross-site scripting and web based malware [Choi et al., 2011]. Availability of these functions in higher count in a website makes it suspicious.  IFrame redirection. IFrame is used to embed another webpage within the current webpage. It creates a frame or window on a webpage so that another page can be 65

inside this frame. A borderless IFrame which can be hard for Internet users to detect manually is found to be used by some phishing websites.  Mismatch in form fields and domain name. Phishing websites use their own domain name but put text of legitimate websites in the tag, which make it a complete mismatch [Gastellier-Provost et al., 2011]. This can be applied for phishing detection.  Disabled right click. Some of the phishing websites disabled the right mouse click. A simple Javascript function can be used to disable it. A code snippet that can disable right click is given below: function disableclick(e){ if(event.button==1) { return false; }}  Use authentic logo. Almost all of the phishing websites use logo of the legitimate websites to imitate the appearance [Zhang et al., 2007a]. This verification needs record of all the logos of legitimate websites that are highly targeted by phishers, which means dependency.  Integrate security logo. Most of the phishing websites use security logo, such as VeriSign [Gastellier-Prevost et al., 2011] to provide the look of genuineness. It needs prior knowledge about all existing security logos. Figure 27 shows a phishing website that uses “VeriSign” logo. </p><p>Figure 27: Phishing website with a company’s logo and VeriSign’s logo  Keyword/Description. These objects and properties provide information about the websites, such as copyright, ownership, and content of the website. Although website’s mirroring is quite simple process, even all popular browser’s (e.g., 66 </p><p>“Save as” option is one of the simplest methods for website mirroring, yet this information can be helpful for phishing detection. In fact, there are already some phishing prevention techniques which use them for phishing detection, such as Bayesian Filter.  Sloppiness or lack of familiarity with English. Some phishing websites bear silly spelling mistakes, grammatical errors, and inconsistencies in the web contents. Sometime it is done deliberately in order to bypass anti-phishing tools that use content based filtering technique, i.e., Bayesian Filter. Moreover, designing a tool to check language mistakes is in itself another challenge. Moreover, there are many phishing websites that are in other languages than English.  Email function. Some of the phishing websites include a function that sends email to the phishers. When a victim enters the information, it sends an email with all the information to the phisher. An example of Javascript code that sends email is: function sendMail() { var link = "mailto:me@example.com" "?cc=myCCaddress@example.com" "&subject=" + escape("This is my subject") "&body=" + escape(document.getElementById('myText').value) ; window.location.href = link ;} This code can be in some other programming language that cannot be shown in client side. </p><p>4.3. Verification of the anomalies using online phishing websites An experiment was conducted to verify the anomalies listed in afore mentioned subchapters (i.e., 4.1. and 4.2.). I used twenty online phishing websites already validated as phishing websites by PhishTank, for the experiment. I selected serially the top twenty phishing websites that were verified as phishing on 9th of August 2012. The list of phishing websites’ URLs used for the experiment is included in the Appendix. I verified most of the anomalies, but few of the anomalies were not verified due to technical complications. This includes anomalies that are related with the grammatical mistakes in the web contents. I used mainly the login page of phishing websites for the experiment, since it is the entry point and phishing has to be detected at this point. I used the most of the tools and environments that already exist for the experiment. The 67 benefit of using existing tools is that these tools are online, stable, and their results can be trusted. The tools and environments used are:  Google search engine was used to obtain the popularity of phishing URLs. The complete URL of each phishing website was used as a search keyword.  Google, Yahoo!, and Bing search engines were used for finding the credible in- neighbor search of phishing websites’ URLs.  Yahoo! Directory was used to obtain relevant web category. Spoofed organization name was used as a keyword.  DNS and WHOIS tool in My-Addr.com was used to get the DNS record of phishing website’s URL.  Check/Search Port tool in My-Addr.com was employed to get the port used by phishing websites.  Notepad++ was used as a source code viewer and also its ‘find’ feature was used to search DOM objects’ tags.  Utility applications designed in C Sharp programming language (.Net platform) were used for extraction of properties in URLs and DOM objects. In order to verify the anomalies, I chose a phishing website at a time and looked for all the anomalies in the website. I always started with the anomalies which require the website to be online, e.g., URLs hosted geographical location, URLs popularity, no credible in-neighbour search results, URLs with abnormal DNS record, URLs use different port number, use of free web hosting, and life of domain. One of the major challenges was that phishing websites do not remain online for a long time. Therefore, I have to make sure I get the required information before somebody takes the website down. Then, I download the phishing webpage for source code analysis and after that I analyzed its URL for anomalies. I analyzed the source code of the phishing website in the last. During analysis, firstly, I checked whether the anomalies are present in the selected phishing website or not. Then, I obtained the count of occurrences for those anomalies whose count is necessary to differentiate between a legitimate website and a phishing website, such as number of “Bag of words” in URLs, number of script tags, URLs use multiple Top Level Domains (TLD) within domain name, and number of script tags. I also calculated the mean and median values of the count of occurrences. Mean value is calculated when the data set (i.e., a set of values formed from count of occurrences of an 68 </p><p> anomaly in each phishing websites) is evenly distributed, otherwise, median value is calculated. The results from the experiment are listed in Table 2 and Table 3. Table 2 contains anomalies type and the number of phishing websites containing anomalies in their URLs. Properties Results (Occurrence/Total) Use IP address in URLs 2/20 URLs contain brand or domain or host 12/20 name URLs use http in place of https ,i.e., 20/20 abnormal SSL certificate URLs contain misspelled or derived 0/20 domain name URLs use large host name 9 /20 URLs length equal or greater than 75 characters ; Mean =96.9 Use short URLs 2/20 Use “//” characters in URLs path 1/20 URLs use unknown or unrelated domain 8/20 name URLs use multiple Top Level Domains 20/20; Mean=3 (TLD) within domain name Use encoded URLs 4/20 Uses special character ‘@’ in URLs 0/20 URLs use different port number 0/20 URLs with abnormal DNS record Complete=11; Incomplete=8; Not Found=1 Number of sensitive words in URLs 9/20, Number of “Bag of words” in URLs 20/20; Mean=9 URLs popularity 18/ 20; Median Results Count =3 No credible in-neighbour search results 20/20 URLs absence in relevant web category 20/20 Life of domain Unknown, cannot obtain the life of domain Use of free web hosting Unknown, cannot obtain information about the web hosting servers 69 </p><p>Domain name character composition Unable to classify URLs hosted geographical location 10/20 –United State; 3/20- Spain; 1/20 each for- France, Italy, Switzerland, Hong Kong, Vietnam, Turkey; 1/20- Unknown TLD triplets used in URL 11/20 use .com Table 2: Number of phishing websites containing anomalies in their URLs Similarly, Table 3 contains anomalies type and the number of phishing websites containing anomalies in their source codes. Properties Results (Occurrence/Total) Abnormal anchor URLs 18/20 Abnormal Server Form Handler (SFH) 20/20 Abnormal request URLs 18/20 Abnormal cookie 3/20 Mismatch hyperlink 0/20 Use of illegal pop-up windows 0/20 Harmful forms 20/20 Use of onMouseOver to hide the link 0/20 Number of script tags 20/20; Mean=28 Presence of Javascript functions 10/20 IFrame redirection 0/20 Email functions 0/20 Mismatch in form fields and domain 19/20 name Disable right click 0/20 Use authentic logo 20/20 Integrate security logo 11/20 Keyword/Description Unknown, phishes used various languages. Sloppiness or lack of familiarity with Unknown, phishes used various languages. English Table 3: Number of phishing websites containing anomalies in their source codes 70 </p><p>4.4. Discussion on findings The anomalies present in source codes are clearer than those found in URLs. Most of the anomalies in source code can be analyzed locally which means they do not need Internet connection and they are almost independent of the Internet speed once the web pages get loaded. Likewise, the majorities of anomalies in source codes are only textual matching except few anomalies which need images matching and English grammar rule. One of the major problems in analyzing anomalies in source codes is that they need to load web pages which expose Internet users to vulnerabilities from malicious codes, keyloggers, and <a href="/tags/Botnet/" rel="tag">botnets</a>. Although, the risk from malicious code, keyloggers, and botnotes can be reduced using a sandbox browser to load the webpage for analysis; it cannot guarantee a complete protection from <a href="/tags/Malware/" rel="tag">malwares</a> and malicious codes [Sabanal and Yason, 2012]. Similarly, the analysis of anomalies in URLs does not need to load the web pages which mean Internet users can be safe from phishing conducted using malicious software. However, some of the anomalies found in URLs need Internet connection and are time consuming processes. The experiment revealed that all anomalies are not equally important. Some important results from the experiment are:  A promising set of anomalies which had high frequency and they were strong indicator of phishing are listed in Table 4. Anomaly types Abnormal Server Form Handler (SFH) Harmful forms URLs uses http in place of https or abnormal SSL certificate URLs contain brand or domain or host name Abnormal anchor URLs Abnormal request URLs Mismatch in form fields and domain name Table 4: Promising anomalies  Some anomalies are highly occurring and also are important for phishing detection; however, they need prior information about the owner of the legitimate websites and the security logo owner. List of such anomalies is in Table 5. 71 </p><p>Anomaly types </p><p>Authentic logo used </p><p>Security logo integrated </p><p>Table 5: Anomalies dependent on external factors  It was also found that some of the anomalies, which are easy to avoid, are either rarely present (Table 6) or are absent (Table 7) in phishing websites. Anomaly types Use IP address in URLs Use encoded URLs Use ‘//’ characters in URLs path Abnormal Cookie Use short URLs Table 6: Important anomalies that are less occurring Anomaly types Uses special character ‘@’ in URLs Mismatch hyperlink Use of illegal pop-up windows Use of onMouseOver to hide the link IFrame redirection Email functions Disable right click URLs contain misspelled or derived domain name URLs use unknown or unrelated domain name Table 7: Important anomalies absence in phishing websites  Some of the anomalies can have higher time overhead, which can make them unsuitable during certain circumstances, for example, in the case when Internet speeds is slow. The list of anomalies is in Table 8. Anomaly types URLs with abnormal DNS record No credible in-neighbor search results URLs absence in relevant web category Life of domain 72 </p><p>Use of free web hosting URLs hosted geographical location URLs Popularity URLs use different port number Table 8: Anomalies with higher time overhead  There are some anomalies which are not clear in the sense that the same anomalies also exist in legitimate websites. Therefore, such anomalies need further analysis to clarify exactly when their presence can declare a website as a phishing website. The list of such anomalies is in Table 9. Anomaly types URLs use multiple TLD within domain name TLD triplets used in URL Number of sensitive words in URLs Number of Script tag Number of ‘Bag of words’ in URLs URLs use large host name Presence of Javascript functions Table 9: Vague anomalies (need further analysis) Although, Zhang et al [2007a] stated that a genuine website contains less than five dots (‘.’) in URL, i.e., anomaly “URL uses multiple TLD within domain name”, but only three phishing websites are found during the experiment that satisfy the condition whilst there are legitimate websites, which have login page with more than five dots ,e.g. , https://login.live.com/login.srf?wa=wsignin1.0&rpsnv=11&ct=1350861003&rver =6.1.6620.0&wp=MBI&wreply=http:%2F%2Fmail.live.com%2Fdefault.aspx&lc =1033&id=64648&mkt=en-us&cbcxt=mai&snsc=1 , is the login page URL for “Hotmail.com” that has seven dots. Similarly, McGrath and Gupta mentioned that a long genuine URL can be of length maximum seventy-five characters and in general of twenty-two characters. But some of the phishing websites used for the experiment have URL length less than twenty-two characters whilst there are genuine websites whose login page URLs have length more than seventy-five characters, e.g., 73 </p><p> https://accounts.google.com/ServiceLogin?service=mail&passive=true&rm=false &conticon=https://mail.google.com/mail/&ss=1&scc=1<mpl=default<mplcac he=2, is the login page URL of “Gmail.com”. Similarly, TLD stated by the anomaly called “TLD triplets used in URL” are the most common TLDs and millions of legitimate websites use them. Likewise, for anomalies “Number of sensitive words in URL”, “Number of Script tag”, “Number of ‘Bag of words’ in URL”, even though several websites contain them but what number should indicate phishing is unclear and interestingly they are also very common in legitimate websites.  Some anomalies are associated with English language, when several phishing websites are found to be non-English. List of such anomalies is in Table 10. Anomaly types Sloppiness or lacks of familiarity with English Domain name character composition Keyword/Description Table 10: Anomalies dependent on English language Anomalies in phishing websites can be an effective way to detect phishing, but there is a need for proper methods for selection, calibration, and deployment of those anomalies. There is a need to look for anomaly or a group of anomalies that are hard for phishers to manipulate and are unexpected in legitimate websites during examination of suspected websites. Some important points that can be utilized during the deployment of anomalies for heuristic methods are: (i) Priority should be given to the anomalies which phishers cannot easily avoid.  Elimination of these anomalies takes time, effort, and money of phishers. Further, it makes easier to detect such phishing websites and sometimes it makes risky for phishers that they might be traced. An example is URLs uses http in place of https or abnormal SSL certificate.  Anomalies, which are crucial for usability and social engineering. The removable of such anomalies can easily be noticed by Internet users and phishers are forced to include them. An example is authentic logo used in phishing websites. 74 </p><p> Anomalies that are vital part of phishing and phishers usually do not have good alternative for them. An example is the use of abnormal Server Form Handler (SFH)”. (ii) Priority should be given depending on the harmfulness of anomalies.  Higher is the harmfulness of the anomalies when they are included in the websites, more important the anomalies are. An example is the use of abnormal Server Form Handler (SFH)”. (iii) Priority should be given to anomalies on the basis of time taken for analysis versus the importance of anomalies  It is important to realize the time required analyzing an anomaly and the impact it makes in phishing detection procedures. There should not be a time overhead. An example is checking URLs popularity that can have time overhead when Internet is slow. (iv) Priority should be given to independent anomalies.  Priority should be given to independent anomalies over dependent anomalies. Some anomalies need other anomalies to make sense in phishing detection. Examples of such anomalies are: "Harmful form" and "URLs uses http in place of https, i.e., abnormal SSL certificate". (v) There is a possibility that an anomaly will occur in legitimate websites other than domain owner.  Priority should be given to anomalies that have a high possibility to occur in legitimate websites and are against recognized standard or practices than anomalies that can occur in legitimate website and are not objected by recognized standard. An example of an anomaly which is against the recognized standard is “Use of illegal pop-up windows”. Similarly, an anomaly which is not against the recognized standard is “Presence of Javascript functions”. It is recommended to employ anomalies that are strong indicators of phishing in heuristic methods; however, the irony is that most of the phishers try to get rid of those anomalies. Therefore, heuristic methods also have to rely on those anomalies that are not strong indicators and can be easily found in legitimate websites. In addition, many web developers either lack information on standards, such as W3C, ISO, Ecma International, and Google Guidelines relate to the best practices in web development or 75 they deliberately do not follow these standards. Such developers unintentionally include several anomalies in their websites which are also the characteristics of phishing websites because of which their websites get misclassified. One of the prime reasons for such misclassification is that current heuristic methods that look for the anomalies in URLs and source codes of suspected website usually look for each anomaly separately and assign a particular score to each of them. The problem with this approach is that they penalize all websites on equal basis when any anomaly is present. Due to it, several unimportant anomalies which also occur in legitimate websites and improperly designed website accumulate enough score to declare a legitimate website as a phishing website. Moreover, this is not the way human decision making process works. The human decision making process looks to other circumstances before making the final verdict and they are justifiable. Such decision making should be applied for phishing detection too. A technique alike to Ludl et al. [2007] who employed J48 algorithm to extract decision tree to classify phishing and legitimate website can be more effective for such case. It can provide intuitive insight into which features are important in classifying a data set. </p><p>5. Conclusions Phishing is almost a decade and half old concept emerged in mid 90s. It is also one of the highly publicised cyber crimes since it is related to money and adversely impacts business and general public interest. Moreover, the majority of phishing uses technically simple method, i.e., create authentic looking forge websites and reach potential victims through spam. Indeed, there is some phishing which employ complex techniques, such as cross-site request forgery, cross site scripting, dynamic pharming, botnets, malicious code, and key logger software. However, there is no countermeasure that can outperform and protect from every kind of phishing. There are a number of studies which have worked on technical and non-technical aspects with the objective to determine remedies for phishing. They claim to be more effective than their contemporaries, but, the misery is, most of them do perform well for the certain kind of phishing and usually fail to counterattack various tricky phishing strategies. This might be because; phishing does not just exploit technical vulnerabilities but it equally exploits human vulnerabilities. There can be exact solutions for technical vulnerabilities; but the exploitation of human behaviour and decision making does not 76 have any precise remedy. Additionally, methods adopted by the phishers are constantly changing. When security experts succeed to design a countermeasure for one, phishers discover new routes to make successful attacks. One of the common mistakes that the most of phishing prevention techniques make in general is; they depict users’ purpose for web browsing and security significance as two different components. They inform that something is wrong and prohibit proceeding; however, they do not provide suitable alternatives [Ma, 2006; Wu et al., 2006b]. They usually neglect the fact that security is not the prime concern of Internet users; and this enforces Internet users to take risks despite warning. Further, designing phishing prevention techniques are compounded by several issues. Most of the phishing prevention techniques fail to overcome one or many of these issues. Some of these issues are:  Accuracy in results. The results from any phishing prevention systems should be accurate, i.e., no false positive and no false negative results. Any errors in results diminish the credibility of phishing prevention systems and ultimately discourage Internet users from using them or encourage Internet users to take risk and fall for phishing. At the same time produces a challenge for phishing prevention systems when a website is doubtful but cannot confirm whether it is a phishing website or not.  Effective warning. It is very important to have effective method to warn Internet users and stop them from revealing their credentials to phishing websites. It is one of the major challenges for anti-phishing tools. Several past studies have proved that passive alert signals or messages are either unnoticed or ignored by Internet users [Dhamija et al., 2006; Wu et al., 2006a; Zhang et al., 2007b]. For active warning, i.e., refusing to connect, it should be absolutely certain else it is unacceptable. Moreover, in the case of passive warning, the frequency of alert message should be so that it does not miss any phish and at the same time it should be comfortable to Internet users. Bombarding with alert messages can force Internet users to switch off anti-phishing tools. It was also found that too frequent alert message desensitized Internet users and they are more likely to reveal their personal details to phishing [ITNOW, 2012].  Execution time matters. Time is an important factor in all kind of software. It makes more sense to client side phishing prevention toolbars. Client side phishing prevention toolbars perform the verification of webpage before loading 77 </p><p> it. Therefore, a slow system can highly demotivate Internet users from using it. However, this constraint enforces to detect those anomalies that are quick to analyse even though they might not be practically very effective to detect phishing.  Address security and Internet users’ intentions together. Security and Internet users’ intention cannot be dealt separately. The majority of phishing prevention tools make mistakes by separating them. They attempt to solve the security problem and disregard the Internet users’ specific intention. They inform that there is something wrong, but never tells the specific ways to continue. It is recommended integrating the security concerns into the critical path of task of Internet users [Wu et al., 2006b] and provides them with suitable alternatives when phishing is detected. However, it needs an extra process to determine alternatives which affects execution time.  Scale problem. Phishing is very dynamic and phishers constantly look for ways to bypass phishing prevention techniques. It also means that the higher the popularity of phishing prevention technique is, phishers will apply more effort to evade it. Therefore, phishing prevention should also have to constantly update covering emerging trends in phishing.  Usability and Internet users’ behaviour under controlled conditions. Almost all the studies of usability and Internet users’ behaviour are performed under controlled condition due to ethical and legal issues. Such studies are unable to see all factors that can influence result. However, such studies cannot be allowed to conduct in uncontrolled condition due to privacy, ethics, and legality issue. Therefore, there is a need for more studies and research to develop robust technical approaches. It equally needs some flexibility from social and legal division to freely conduct such studies. The current trends in phishing prevention are mostly reactive techniques. Therefore, there is a need for proactive strategies for phishing prevention. Web development industries need technology and practices which can make it difficult for phishers to conduct phishing. One of the major factors that are encouraging scammers to conduct phishing is the low cost and high benefit from phishing. When their benefits get reduced, less and less number of people will be interested in conducting phishing. 78 </p><p>Awareness about security and standards in web developer is another necessary factor. For instance, web developers should properly fill in all the different fields of source codes with some information related to their domain name by clearly identifying every HTML tag [Gastellier-Prevost et al., 2011]. In addition, a web developer should not use features that are disallowed by the recognized standards, such as recommendation from W3C and standards published by ISO. They should develop code in the way it facilitates phishing prevention methods. Similarly, companies should follow standards and guidelines to improve distinguishing their websites from phony websites. There is a need of work in development of technology that can trace phishers and help law authority to punish them. This does not mean phishing can be eliminated; however, it can significantly be reduced. Last but not least, non-technical methods can be a vital player in the war against phishing. However, many of the organizations prone to phishing still do not provide information or counselling to their new customers relating dangers from phishing unless they are victimized. This might be because to conduct counselling it needs resources and also there is a chance that their customers wrongly understand as the weakness of organizations. Many organizations do include static information about phishing in their websites which is dull for many customers and they hardly read it. Therefore, there is a need for improvement in presentation of such information. For instance, techniques, such as puzzle and game can be motivating and an effective way to teach customers about phishing. </p><p>6. Limitations and future development work </p><p>In this thesis, the experiment is conducted only on phishing websites, so I believe the results could be more accurate if the same study was conducted on legitimate websites as well. More importantly, the results obtained are solely on the basis of meta-analysis of past studies followed by an experiment on phishing websites. In order to observe the clear picture of results, it is necessary to apply them in real time anti-phishing software. Therefore, designing such software is the main future development work from this thesis. </p><p>79 </p><p>References </p><p>[APGW, 2012] Phishing activity trends report: 1st half 2012. Report January-March 2012. Available as: http://www.antiphishing.org/reports/apwg_trends_report_q1_2012.pdf (retrieved on 5th May 2012) [American Bankers Associaion, 2005] ABA works on fraud: phishing prevention and resolution. Available as: http://www.angelinabank.com/phishing063005.pdf (retrieved on 15th October 2012) [Bing Webmaster Tools] How to submit a sitemap. Available as: http://onlinehelp.microsoft.com/en-US/bing/hh204487.aspx (retrieved on 7th July 2012) [CallingID] CallingID toolbar. Available as: http://www.callingid.com/Default.aspx (retrieved on 17th November 2012) [Cloudmark] Cloudmark Anti-Fraud toolbar. Available as: http://www.cloudmark.com/en/products/cloudmark-desktopone/index (retrieved on 17th November 2012) [DNSSEC Validator] DNSSEC Validator 1.1.5. Available as: https://addons.mozilla.org/en-us/firefox/addon/dnssec-validator/ (retrieved on 18th November 2012) [IDG News Service, May 10 2012] NASA and pentagon hacker TinKode receives two years suspended jail sentence. Available as: http://news.idg.no/cw/art.cfm?id=F21FFE88-01F3-6A5A-F13AD8F4C45D72FC (retrieved on 16th November 2012) [EarthLink] EarthLink toolbar. Available as: http://www.earthlink.net/software/domore.faces?tab=toolbar (retrieved on 17th November 2012) [eBay Toolbar’s Account Guard] Using eBay toolbar’s account guard. Available as: http://pages.ebay.com.au/help/account/toolbar-account-guard.html (retrieved on 28th July 2012) [Fraud Eliminator] Fraud Eliminator toolbar. Available as: http://www.topsecretsoftware.com/fraud-eliminator.html (retrieved on 17th November 2012) [Geo Trust] Geo Trust Trustwatcher toolbar. Available as: http://dnstree.com/com/trustwatch/ (retrieved on 17th November 2012) [Google Safe Browsing] Google Safe Browsing API. Available as: https://developers.google.com/safe-browsing/ (retrieved on 17th November 2012) [Google Support] Phishing and malware detection. Available as: https://support.google.com/chrome/bin/answer.py?hl=en&answer=99020&p=cp n_safe_browsing (retrieved on 31st July 2012) 80 </p><p>[Google Webmaster Guidelines] Best practices to help google find, crawl, and index your site. Available as: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769 (retrieved on 7th July 2012) [Google Webmaster Tools] How often does Google crawl the web? Available as: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=34439 (retrieved on 7th July 2012) [Hacker Factor Solutions, 2005] Anti-Phishing: page encoding. Available as: http://www.hackerfactor.com/papers/ap-page_encoding.pdf (retrieved on 2nd March 2012) [IBM Internet Security Systems, 2007] The phishing guide: understanding & preventing phishing attacks. Available as: http://www- 935.ibm.com/services/us/iss/pdf/phishing-guide-wp.pdf (retrieved on 2nd March 2012) [ITNOW, 2012] Overload information, ITNOW- The Chartered Institute for IT, autumn 2012. [MarkMonitor Inc., 2008] Whitepaper- Rock phishing: the thread and recommended countermeasures. Available as: https://www.markmonitor.com/download/wp/wp- rock-phish.pdf (retrieved on 2nd March 2012) [MSDN IEBlog] IE8 security part III: SmartScreen filter. Available as: http://blogs.msdn.com/b/ie/archive/2008/07/02/ie8-security-part-iii-smartscreen- filter.aspx (retrieved on 22nd July 2012) [Netcraft] Why use the Netcraft toolbar? Available as: http://toolbar.netcraft.com/ (retrieved on 23rd July 2012) [NYDailyNews.com, July 14 2011] Pentagon hacked, 24,000 files stolen by ‘foreign intruders’ in cyber attack. Available as: http://articles.nydailynews.com/2011-07- 14/news/29792364_1_cyber-attack-terrorist-group-pentagon-computer-system (retrieved on 28th July 2012) [PhishTank] Online valid phishes. Available as: http://www.phishtank.com/phish_search.php?valid=y&active=All&Search=Searc h (retrieved on 9th of August 2012) [SpoofStick] SpoofStick 1.02. Available as: https://whatapp.org/spoofstick/ (retrieved on 28th July 2012) [SpoofGuard] SpoofGuard. Available as: http://crypto.stanford.edu/SpoofGuard/ (retrieved on 9th October 2012) [Aburrous et al., 2010] Maher Aburrous, M.A. Hossain, Keshav Dahal, and Fadi Thabtah, Experimental case studies for investigation e-banking phishing techniques and attacks strategies. Springer Science+ Business Media, LLC 2010. 81 </p><p>[Alkhozae and Batarfi, 2011] Mona Ghotaish Alkhozae and Omar Abdullah Batarfi, Phishing websites detection based on phishing characteristics in the webpage source code. IJICT, Volume 1 No.6, October 2011, ISSN-2223-4985. [Bian et al., 2009] Kaigui Bian, Jung-Min” Jerry” Park, Michael S. Hsiao, France Belanger, and Janine Hiller, Evaluation of online resources in assisting phishing detection. In: Proc. of 2009 Ninth Annual International Symposium on Applications and the Internet, Page 30-36. [Cao et al., 2008] Ye Cao, Weili Han, and Yueran Le, Anti-phishing based on automated individual white list. ACM 978-1-60558-294-8/08/10. [Chen et al., 2009] Kaun-ta Chen, Chun-Rong Huang, Chu-Song Chen, and Jau-Yuan Chen, Fighting phishing with discriminative keypoint features. IEEE Internet Computing, 1089-7801/09. [Choi et al., 2011] Hyunsang Choi, Bin B. Zhu, and Heejo Lee, Detecting malicious web links and indentifying their attack types. In: Proc. of 2nd USENIX Conference on Web Application Development 2011. [Chou et al., 2004] Neil Chou, Robert Ledesma, Yuka Teraguchi, and John C. Mitchell, Client-side defence against web-based identity theft. In: Proc. of 11th Annual Network and Distributed System Security Symposium, 2004. [Cordero and Blain, 2006] Arel Cordero and Tamara Blain, Catching phish: Detecting phishing attacks from rendered website images. University of California, Berkeley, CA, 94720, 12th December, 2012. Also available as: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.9084&rep=rep1&ty pe=pdf (retrieved on 27th July 2012). [Dhamija et al., 2006] Rachna Dhamija, J.D.Tygar, and Marti Hearst, Why phishing works. ACM 1-59593-178-3/06/0004. [Dhamja and Tygar, 2005] Rachna Dhamija and J.D. Tygar, The battle against phishing: Dynamic security skins. In: Proc. Symposium On Usable Privacy and Security (SOUPS) 2005, July 6-8, 2005, Pittsburgh, PA, USA. [Dong et al., 2008] Xun Dong, John A. Clark, and Jeremy Jacob, Modeling user- interaction. IEEE 2008, 1-4244-1543-8/08. [Downs et al., 2006] Julie S. Downs, Mandy B. Holbrook, and Lorrie Faith Cranor, Decision strategies and susceptibility to phishing. In: Proc. of Symposium On Usable Privacy and Security (SOUPS), July 12-14, 2006, Pittsburgh, PA, USA. [Dunlop et al., 2010] Matthew Dunlop, Stephen Groat, and David Shelly, GoldPhish: using images for content-based phishing analysis. In: Proc. of Fifth International Conference on Internet Monitoring and Protection, 2010, ICIMP, pp.123-128. [Edwards et al., 2007] W. Keith Edwards, Erika Shehan Poole, and Jennifer Stoll, Security automation considered harmful? ACM 978-1-60558-080-7/07/09. 82 </p><p>[Egelman et al., 2008] Serger Egleman, Lorrie Faith Cranor, and Jason Hang, You’ve been warned: An empirical study of the effectiveness of web browser phishing warning. In: Proc. of CHI 2008, April5-10, 2008, Florence, Italy. ACM 1-59593- 178-3/07/0004. [Fette et al., 2006] Ian Fette, Norman Sadeh, and Anthony Tomasic, Learning to detect phishing emails. Carnegie Mellon University, School of Computer Scienec, Technical Report CMU-CyLab-06-012. Available as: http://www.cs.cmu.edu/~tomasic/doc/2007/FetteSadehTomasicWWW2007.pdf (retrieved on 2nd May 2012). [Florêncio and Herley, 2006] Dinei Florêncio and Cormac Herley, Analysis and improvement of anti-phishing schemes. Security and Privacy in Dynamic Environments IFIP International Federation for Information Processing Volume 201, 2006, pp 148-157. [Friedman et al., 2002] Batya Friedman, Helen Nissenbaum, David Hurley, Daniel C. Howe, and Edward Felten, Users’ conceptions of risks and harms on the web: A comparative study. ACM 1-58113-454-1/02/0004. [Fu et al., 2006] Anthony Y. Fu, Liu Wenyin, and Xiaotie Deng, Detecting phishing web pages with visual similarity assessment based on Earth Mover’s Distance (EMD). In: IEEE Transactions on Dependable and Secure Computing, Vol. 3, No. 4, October-December 2006. [Garera et al., 2007] Sujata Garera, Niels Provos, Monica Chew, and Aviel D. Rubin, A framework for detection and measurement of phishing attacks. ACM 978-1- 59593-886-2/07/0011. [Gastellier-Prevost et al., 2011] Sophie Gastellier-Prevost, Gustavo Gonzalez Granadillo, and Maryline Laurent, Decisive heuristics to differentiate legitimate from phishing sites. In: Proc. of Network and Information System Security (SAR- SSI), 2011 Conference. ACM 978-1-4577-0735-3. [Herzberg and Gbara, 2004] Amir Herzberg and Ahmad Gbara, TrustBar: protecting (even naïve ) web users from spoofing and phishing attacks. Bar Ilan University, Dept. of Computer Science. Available as: http://u.cs.biu.ac.il/~herzbea/Papers/ecommerce/spoofing.htm (retrieved on 23rd July 2012). [Huh and Kim, 2011] Jun Ho Huh and Hyoungshick Kim, Phishing detection with popular search engines: Simple and effective. In: Proc. of Springer-Verlag Berlin Heidelberg 2011, FPS 2011, LNCS 6888, pp.194-207, 2011. [Jagatic et al., 2007] Tom Jagatic, Nathaniel Johnson, Markus Jakobsson, and Filippo Menczer, Social phishing. ACM, Volume 50 Issue 10, October 2007, Pages 94- 100. 83 </p><p>[Jakobsson, 2005] Markus Jakobsson, Modeling and preventing phishing attacks. In: Proc. the 9th International Conference on Financial Cryptography and Data Security, Pages 89-89. [Karakasiliotis et al., 2007] Athanasios Karakasiliotis,Steven Furnell, and Maria Papadaki, An assessment of end-user vulnerability of phishing attacks. Journal of Information Warfare, 6 (1), 2007, pp. 17-28. [Kittur et al., 2008] Aniket Kittur, Ed H. Chi, and Bongwon Suh, Crowdsourcing user studies with Mechanical Turk. In: Proc. CHI 2008, April 5–10, 2008, Florence, Italy. ACM 978-1-60558-011-1/08/04 [Kumaraguru et al., 2009] Ponnurangam Kumaraguru, Justin Cranshaw, Alessandro Acquisti, Lorrie Cranor, Jason Hong, Mary Ann Blair, and Theodore Pham. School of phish: A real-world evaluation of anti-phishing training. In: Proc. of 5th Symposium on Usable Privacy and Security (SOUPS ’09). [Lam et al., 2009] Ieng-Fat Lam, Wei-Cheng Xiao, Szu-Chi Wang and Kaun-Ta Chen, Counteracting phishing page polymorphism: An image layout analysis approach. In: Proc. of ISA 2009. [Li et al., 2007] Linfeng Li, Marko Helenius, and Eleni Berki, Phishing-resistant systems: security handling with misuse cases design. In: Proc. of SQM07, 389- 404, 2007. [Li and Helenius, 2007] Linfeng Li and Marko Helenius, Usability evaluation of anti- phishing toolbars. Journal in Computer Virology, volume 3, 163-184, DOI 10.1007/s11416-007-0050-4. [Liu et al., 2006] Wenyin Liu, Xiaotie Deng, Guanglin Huang and Anthony Y.Fu, An anti-phishing strategy based on visual similarity assessment. In: Proc. of IEEE Internet Computing, ACM 1089-7891/06. [Liu et al., 2011] Gang Liu, Guang Xiang,Bryan A. Pendleton, Jason I. Hong, and Wenyin Liu, Smartening the crowds: computational techniques for improving human verification to fight phishing scams. In: Proc. Symposium On Usable and Secuirty (SOUPS) 2011, July 20-22, 2011, Pittsburgh, PA, USA. [Ludl et al., 2007] Christian Ludl, Sean Mcallister, Engin Kirda, and Christopher Kruegel, On the effectiveness of techniques to detect phishing sites. In: Proc. of DIMVA’07 Proceedings of the 4th International Conference on Detection of Intrusions and Malware, and Vulnerability. Springer-Verlag Berlin, Heidelberg 2007, ISBN: 978-3-540-73613-4 doi. [Ma, 2006] Robert Ma, Phishing attack detection by using a reputable search engine. University of Toronto, Dept. of Electrical and Computer Engineering. Available as: http://www.eecg.toronto.edu/~lie/Courses/ECE1776- 2006/Projects/Phishing2a-proposal.pdf (retrieved on 7th July 2012). 84 </p><p>[Ma et al., 2009] Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker, Beyond blacklists: Learning to detect malicious web sites from suspicious urls. In: Proc. of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1245-1254, June-July 2009. [Martino and Perramon, 2010] Antonio San Martino and Xavier Perramon, Phishing secrets: history, effects, and countermeasures. International Journal of <a href="/tags/Network_security/" rel="tag">Network Security</a>, Vol.11, No.3, PP.163-171, November 2010. [McGrath and Gupta, 2008] D. Kevin McGrath and Minaxi Gupta, Behind phishing: An examination of phisher modi operandi. In: Proc. of 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats. San Francisco, California, USA: USENIX Association Berkeley, CA, USA, 2008, p. Article No.4. [McRae and Vaughan, 2007] Craig M, McRae and Rayford B. Vaughn, Phighting the phisher: Using web bugs and honeytokens to investigate the source of phishing attacks. In: Proc. of 40th Annual Hawaii International Conference on System Sciences (HICSS ‘07) 0-7695-2755-8/07. [Medvet et al., 2008] Eric Medvet, Engin Kirda, and Christopher Kruegel, Visual similarity-based phishing detection. ACM ISBN #978-1-60558-241-2. [Milletary, 2006] Jason Milletary, Technical trends in phishing attacks. United States Computer Emergency Readiness Team (US-CERT), 2006. Available as http://www.us-cert.gov/reading_room/phishing_trends0511.pdf (retrieved on 2nd May 2012). [Moore and Clayton, 2008] Tyler Moore and Richard Clayton, Evaluating the wisdom crowds in assessing phishing websites. In: Proc. of Financial Cryptography and Data Security (FC) 2008, LNCS 5143, pp. 16-30. [Pan and Ding, 2006] Ying Pan and Xuhua Ding, Anomaly based web phishing page detection. In: Proc. of 22nd Annual Computer Security Applications Conference (ACSAC’06), Computer Society, 2006. [Prakash et al, 2010] Pawan Prakash, Manish kumar, Rao Kompella and Minaxi Gupta, PhishNet: Predictive blacklisting to detect phishing attacks. In: Proc. of IEEE INFOCOM on Computer Communication 2010. [Odaro and Sanders, 2011] Ugiomo S. Odaro and Benjamin G. Sanders, Social engineering: phishing for a solution. In: Proc. of IT Security for the Next Generation-European Cup 2011, Kaspersky Lab. [Rasmussen and Aaron, 2011] Rod Rasmussen and Greg Aaron, Global phishing survey: trends and domain name use in 1H2011. APWG Report January-June 2011 .Available as: http://www.antiphishing.org/reports/APWG_GlobalPhishingSurvey_1H2011.pdf (retrieved on 3rd May 2012). 85 </p><p>[Sabanal and Yason, 2012] Paul Sabanal and Mark Vincent Yason, Digging deep into the flash sandboxes. ibm security systems. Available as: http://media.blackhat.com/bh-us- 12/Briefings/Sabanal/BH_US_12_Sabanal_Digging_Deep_WP.pdf (retrieved on 17th November 2012) [Sheng et al., 2007] Steve Sheng, Bryant Magnien, Ponnurangam Kumaraguru, Alessandro Acquisti, Lorrie Faith Cranor, Jason Hong, and Elizabeth Nunge, Anti-Phishing Phil: The design and evaluation of a game that teachers people not to fall for phish. In: Proc. of Symposium on Usable and Security (SOUPS) 2007, July 18-20, 2007, Pittsburgh, PA, USA. [Singh, 2007] N.P. Singh, Online frauds in banks with phishing. Journal of Internet Banking and Commerce, August 2007, vol.12, no.2. [Wang et al., 2011] Ge Wang, He Liu, Sebastian Becerra, Kai Wang, Serge Belongie, Hovav Shacham, and Stefan Savage, Verilogo: Proactive phishing detection via logo recognition. University of California, San Diego, Dept. of Computer Science and Engineering. Technical Report CS211-0969, US San Diego, August 2011. Available as: http://cseweb.ucsd.edu/~hovav/dist/verilogo.pdf (checked on August 2nd, 2012). [Wenyin et al., 2005] Liu Wenyin, Guanglin Huang, Lui Xiaoyue, Zhang Min, and Xiaotie Deng, Detection of phishing webpages based on visual similarity. ACM 1- 59593-051-5/05/0005. [Whittaker et al., 2010] Colin Whittaker, Brian Ryner, and Marria Nazif, Large-scale automatic classification of phishing pages. Google Inc., Research at Google: Research Areas & Publications. Available as: http://research.google.com/pubs/pub35580.html. (retrieved on 26th July, 2012). [Wu et al., 2006a] Min Wu, Robert C. Miller, Greg Little, Web Wallet: Preventing phishing attacks by revealing user intentions. In: Proc. of The Second Symposium on Usable Privacy and Security (SOUPS 2006). pp. 102-113 2006. [Wu et al., 2006b] Min Wu, Robert C. Miller, and Simson L. Garfinkel, Do security toolbars actually prevent phishing attacks? ACM 1-59593-178-3/06/0004. [Xiang and Hong, 2009] Guang Xiang and Jason I. Hong, A hybrid phish detection approach by identify discovery and keywords retrieval. ACM 978-1-60558-487- 4/09/04. [Xiang et al., 2011] Guang Xiang, Jason Hong, Carolyn P. Rose, and Lorrie Cranor, CANTINA+: A feature-rich machine learning framework for detecting phishing websites. ACM Transactions on Information and System Security (TISSEC) Volume 14 Issue 2, September 2011, Article No. 21. 86 </p><p>[Zhang et al., 2007a] Yue Zhang, Jason Hong and Lorrie Cranor, CANTINA: A Content-Based Approach to Detecting Phishing Web Sites. ACM 978-1-59593- 654-7/07/0005. [Zhang et al., 2007b] Yue Zhang, Serge Egelman, Lorrie Cranor, and Jason Hong, Phinding phish: Evaluating anti-phishing tools. In: Proc. of the 14th Annual Network and Distributed System Security Symposium (NDSS 2007). </p><p>Appendix Important terminology and definitions The Anti-Phishing Working Group (APWG) An international consortium formed to fight against phishing and on-line fraud. Active warning Warning that forces Internet users to notice it by interrupting their activity. Code obfuscation An act of converting code into the form that is difficult to understand and it is mainly performed to protect code from reverse engineering. Crimeware Software designed for conducting cybercrime. Cross-site request forgery A malicious exploitation of a website in which the legitimate user is forced to execute unauthorized commands. Cross-site scripting An attack in which malicious code is injected into the client side of legitimate webpage. DNS spoofing An attack because of which a DNS server returns wrong IP addresses and diverts traffic to another computer. Domain name typos An act of generating a list of misspelled and mistyped of entered domain name. Denial of Service (DOS) An attack on a network by flooding it with useless traffic. DOM (Document Object Model) objects Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content, structure and style of documents. [W3C] 87 </p><p>DMOZ A web directory. False negative A phishing website is misclassified as a legitimate website. False positive A legitimate website gets misclassified as a phishing website. Heuristic methods A technique in which various characteristics of the websites are checked to differentiate whether it is a phishing website or not. Malware Malicious software used to disrupt computer operation and also used to conduct phishing. Malicious code Any code or script in software system that is intended to cause undesired effect, security breach, or damage to the system. [Wikipedia] Man in the middle attacks An intrusion into an existing connection to intercept the exchanged data and inject false information. MarkMonitor A company that develops Internet brand protection software and services. Mirroring of website Act of creating an exact copy of another website. Passive warning Warning that just displays the message without interrupting Internet user activity. Password harvester Malicious software that looks for username and password information in the victims’ computer. Pharming An attack intended to redirect a website’s traffic to a bogus website. PhishTank An anti-phishing website. Sandbox A security mechanism for programs from untrusted sources. Session hijacking An exploitation of computer session in order to get an unauthorised access to information or services in a computer. [Wikipedia] Secure Socket Layer A cryptographic protocol used for secure communication over the Internet. Spam 88 </p><p>Unsolicited bulk messages, usually, used for advertisement. Trojan horse A kind of malware. </p><p>List of URLs for the valid phishing websites used for the experiment (Source: PhishTank) </p><p>S.N URLs Brands 1. http://agenciasck.goldenbiz.com.br/ SCK Imperial 2. http://credit10.webobo.biz/download.php?id_menu=3441921/ Haboo 3. http://deutchland-konto.ntdll.net/img/glyph/webscr.php?cmd=_login- Paypal run&dispatch=5885d80a13c0db1f1ff80d546411d7f84f1036d8f209d 3d19ebb6f4eeec8bd0eaf4a55ab8d6b037be0813c1fa7ae828caf4a55ab 8d6b037be0813c1fa7ae828c 4. http://lehoapaper.com/Paypal_Virefication/1596578fae650778e27f8f Paypal fbd70c4502/ 5. http://masterstudio.es/wp-includes/js/crop/ Paypal 6. http://ilhanpolat.com/account/id/78550375/paypal/pp/update/webscr/ Paypal 6998GSQ64976W84f356Gi6Bn432/profile/webscr/pp/us/www.Payp al.com/webscr.php?cmd=_login-run&dispatch=5885d80a13c0db1f 1ff80d54 6411d7f8a8350c132bc41e0934cfc023d4e8f9e5fb78214886 cead8bcd4c1677f8e7572cfb78214886cead8bcd4c1677f8e7572c 7. http://188.138.124.133/www.paypal.com/session_id/8754445562322 Paypal 2414898896521454598/index.htm# 8. http://pornographicrecordings.com/img/icons/tabs/webscr.php?cmd= Paypal _login-run&dispatch=5885d80a13c0db1f1ff80d546411d7f84f 1036d8f209d3d19ebb6f4eeec8bd0eb8fde1c0e2ec85dcf4341e5b9956 64adb8fde1c0e2ec85dcf4341e5b995664ad 9. http://sreeramsolutions.com/ayyalu/images/login.php CAPITEC Bank 10. http://sreeramsolutions.com/ayyalu/images/capitec.htm CAPITEC Bank 11. http://prophor.com.ar/prophor/wells/alerts.php WELLS http://specialneedssvg.org/wp/wp-admin/import/wellsfargo/ FARGO wellsfargo/wellsfargo2011/index.php 12. http://rrnow.findhere.org/ Time Warner Cabel 13. http://paypal.com.login.secure.md5.id.0645654032132165461321. Paypal 89 </p><p> fabianpulido.com/b22668f2a2c3063efb7749ac67fef65a/ 14. http://net77-43-56-76.mclink.it/.ss/ Sparkasse http://78.188.234.21/.ss3/?https://bankingportal.kreissparkasse- heinsberg.de/portal/portal/StartenIPSTANDARD </p><p>15. http://godknwswhy.x90x.net/ Yahoo!Mail 16. http://zulumarket.com/negocio/index.html CHASE 17. http://abnerindonesia.com/billingcenter/aol/XKklowI9292O02/ AOL Mail DBMECX8QgQ1BHaQQv4pYZFzemQbF/verify/Accounts/Secure_ Area/aol/update.php </p><p>18. http://abnerindonesia.com/billingcenter/aol/XKklowI9292O02/ AOL Mail DBMECX8 QgQ1BHaQQv4pYZFzemQbF/verify/Accounts /Secure_Area/aol/ </p><p>19. http://alex.24openstore.de/PayPal/webscr.php?cmd=_login- Paypal run&dispatch=5885d80a13c0db1f1ff80d546411d7f8a8350c132b c41e0934cfc023d4e8f9e5eb7cfbb17ec87b191acc343bb447f8e9eb7cf bb17ec87b191acc343bb447f8e9 </p><p>20. http://us.battlle.net.htm.isnyeo.info/battle_net_account.html?ref=http BATTLENET s%3A%2F%2Fus.battle.net%2Faccount%2Fmanagement%2Findex. xml&app=bam&t=1 </p> </div> </article> </div> </div> </div> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.1/jquery.min.js" crossorigin="anonymous" referrerpolicy="no-referrer"></script> <script> var docId = 'dd7967998dd789bab4a9f882990f945d'; var endPage = 1; var totalPage = 93; var pfLoading = false; window.addEventListener('scroll', function () { if (pfLoading) return; var $now = $('.article-imgview .pf').eq(endPage - 1); if (document.documentElement.scrollTop + $(window).height() > $now.offset().top) { pfLoading = true; endPage++; if (endPage > totalPage) return; var imgEle = new Image(); var imgsrc = "//data.docslib.org/img/dd7967998dd789bab4a9f882990f945d-" + endPage + (endPage > 3 ? ".jpg" : ".webp"); imgEle.src = imgsrc; var $imgLoad = $('<div class="pf" id="pf' + endPage + '"><img src="/loading.gif"></div>'); $('.article-imgview').append($imgLoad); imgEle.addEventListener('load', function () { $imgLoad.find('img').attr('src', imgsrc); pfLoading = false }); if (endPage < 5) { adcall('pf' + endPage); } } }, { passive: true }); if (totalPage > 0) adcall('pf1'); </script> <script> var sc_project = 11552861; var sc_invisible = 1; var sc_security = "b956b151"; </script> <script src="https://www.statcounter.com/counter/counter.js" async></script> </html><script data-cfasync="false" src="/cdn-cgi/scripts/5c5dd728/cloudflare-static/email-decode.min.js"></script>