BUILDING BETTER RANKINGS: OPTIMISING FOR HIGHER POSITIONING

A study submitted in partial fulfilment of the requirements for the degree of Master of Science in Information Management

at

THE UNIVERSITY OF SHEFFIELD

by

PETER GERAINT MASON

September 2004

Abstract

The rapid growth of the has seen a profusion of websites on every conceivable subject. Search engines have developed to make sense of this chaos for users. Most search engines use automated programs known as crawlers or spiders to find web pages by following and indexing text. These results are then ranked according to a relevance algorithm in response to a user’s query.

Many operators seek to improve the search engine rankings of their sites by a variety of means, collectively known as search engine optimisation. These approaches range from simple principles of good design to more duplicitous means such as cloaking, keyword stuffing and link trees.

An extensive literature review explores a variety of themes surrounding search engines in general, and search engine optimisation in particular. Following this, an experiment involving keyword density analysis of a sample of nearly 300 web pages is presented. Statistical analysis reveals the contribution made by a number of factors (including document length, keyword density and PageRank) to a web page’s search rankings in the Google search engine.

1 Acknowledgements

There are several people I would like to acknowledge for their help and advice during the course of this dissertation. I would like to thank my supervisor, Dr. Mark Sanderson for his support and advice, and for being patient with my writer’s block. Thanks also go out to Jean Russell of the University of Sheffield’s Corporate Information and Computing Services for her help in explaining how to turn my mass of data into meaningful information. My parents, Rosalind and Peter Mason have been of immeasurable support, and have endured the rigors of proof-reading with ease. Finally, I would like to thank Matt Cutts of Google and Danny Sullivan of Jupiter Research for each taking a few minutes to discuss the project and offer their advice.

2 Contents

Abstract ...... 1 Acknowledgements...... 2 Contents ...... 3 1. Introduction...... 4 1.1 The World Wide Web ...... 4 1.2 Directories...... 5 1.3 Search Engines...... 5 2. Literature review ...... 9 2.1 Background to search engines on the Web ...... 10 2.2 Link Structure ...... 11 2.3 The Invisible Web...... 16 2.4 Evaluating search engines...... 16 2.5 Searcher behaviour...... 17 2.6 Site Content...... 19 2.7 Architecture...... 21 2.8 Dynamic content ...... 26 2.9 Spam...... 30 2.9.1 Text Spam ...... 31 2.9.2 Redirects...... 32 2.9.3 Cloaking ...... 33 2.9.4 Link Spam ...... 33 3. Research Methodology ...... 35 3.1 Query set ...... 36 3.2 Keyword density analysis ...... 38 3.3 Limitations ...... 39 4. Presentation and discussion of results...... 41 4.1 GoRank Analysis Results...... 41 4.2 Statistical Analysis...... 42 4.2.1 Frequency Analysis...... 45 4.2.2 Spearman’s Rank Correlation ...... 47 4.2.3 Regression Analysis...... 49 4.3 Summary ...... 56 5. Conclusions and further research...... 57 Appendix: Keyword Density Analysis Results...... 62 Bibliography...... 77

16,388 Words

3 1. Introduction

This dissertation will present a study of the practice of ‘search engine optimisation’ (SEO), the practice of modifying or tuning web pages so that they will rank higher in search engine results pages (SERPs) for particular queries called ‘keywords’ or key phrases. It will consist of a summary of search engine optimisation techniques and a review of literature on the subject, both academic and commercial. The importance of keyword density for rankings in Google, the leading search engine at the time of writing, will be investigated with the use of experimental data.

1.1 The World Wide Web

The World Wide Web is a vast collection of information. In the ten or so years since it originated it has grown into a huge, chaotic mass of heterogeneous documents. Because it has developed organically as the number of Web users has grown, there is little organisation, and indeed it is this very freedom and flexibility which enables many of the Web’s benefits – virtually anyone can use it, and anyone with a modicum of knowledge can publish on it.

The number of people with web access is growing very rapidly, and is currently estimated at over 700 million. Google alone indexes well over 4 billion web pages at the time of writing, according to its homepage. Of course the very decentralised nature of the World Wide Web makes it impossible to measure these figures exactly but sheer scale of the document collection clearly creates a difficult information retrieval problem.

Because of the diverse and unstructured nature of the web, it would be almost impossible to locate relevant information through the medium without the aid of some sort of tool. The main tools which have grown up to service these needs are directories and search engines.

4 1.2 Directories

Directories are categorised collections of hyperlinks that list websites (rather than specific web pages) by topic. These usually follow a library-style hierarchical model, in which broad high-level categories contain smaller, more specific categories, and the user must navigate or ‘drill down’ to the specific subject they are interested in. These directories are frequently compiled and edited by humans, who review websites that are submitted by their owners and decide in what category, if any, the site will be listed. Most directories have some sort of editorial policy and do not accept all site submissions.

Because of the human input involved in running a directory, there is an effective limit on how comprehensive it can be – there are simply too many websites to list them all. Thus, in information retrieval terms, the reach of a directory is typically quite low as the document collection is only those sites that have been submitted and approved. Even Yahoo! - formerly the leading and portal - has de-emphasised its directory in favour of search. At its height, the Yahoo! directory had approximately 1,500 human editors. Now it employs just 20. (Risvik, 2004)

1.3 Search Engines

Search engines take a different approach to the problem of finding information on the web. They are more like information retrieval systems in that the user inputs a query, and a list of web pages is returned which are judged to be relevant to that query. These are generally sorted by a mathematical algorithm or ranking function, to present the pages which are most likely to be relevant first. Because they are automated, and each page does not need to be viewed by an individual, the reach of search engines is typically much wider than directories as there are not so many limitations. In addition, because they index the full text of documents, search engines are less reliant on structured and reliable metadata. Search engines are concerned with individual web pages, rather than whole websites.

5 Because of the huge commercial potential of the web, an entire industry has grown up around the concept of trying to improve a website’s rankings in search engine results. Everyone wants to be at the top of the search results for queries relating to their product or service. Web users typically do not look beyond the first 2 or 3 pages of search results, and a significant proportion does not look beyond the first page.

This ‘search engine optimisation’ as it is known incorporates a wide variety of techniques, some of which have been viewed as inappropriate by the search engines. As the search engines have grown wise to these tricks, and adapted to overcome them, new techniques have evolved at an impressive rate. This has led to a sort of adversarial system with webmasters trying to manipulate their rankings and the search engines trying to maintain the ‘purity’ of their results.

Much of this adversarial attitude has come about through the actions of people trying to misuse the search engines, by appearing in results for which their sites are not relevant. This is known as ‘spam’. Early spamming techniques involved such tactics as putting highly popular query terms such as ‘sex’ or the names of famous celebrities in invisible text such as meta-keyword tags on pages. This became such a problem that most major search engines no longer use the keyword tags – which were originally intended to provide useful indexing terms – for judging the relevance of web pages.

The factors that determine ranking in the major search engines can be divided into two main categories – those that have something to do with the document itself (on-page factors), and those that are to do with inbound link popularity (off-page factors). Of these, only the former is fully in control of the site developer, and so it is this area with which the SEO industry has traditionally been most concerned. More recently, in particular since the rise in prominence of Google to its current leading position with its reliance on PageRank, there has been an emphasis on attempting to influence off-page factors such as link building campaigns as well.

On-page factors are primarily concerned with trying to increase the apparent relevance of a web page to a given set of query terms, known as keywords or key-

6 phrases. The use of the term keywords does not mean meta-keyword HTML tags in this context, although these terms should probably appear in those tags. Websites and search engine rankings do not exist in a vacuum. Many other factors have an impact on a site’s position, which are not directly concerned with the site itself. These might include things like the number of competing sites, the degree of specialisation or ‘generic-ness’ of the subject of the sites and other factors. One of the most important areas of consideration, and certainly the one which has received the most attention both from the academic and search optimisation communities is the link structure of the web itself, and the effect it has on how easily search engines (and users) can find web pages, and how they evaluate them when they do.

Many of the techniques employed in search optimisation have been judged to cross the line into spamming by the search engines, and web pages found to be employing them can find themselves penalised in the rankings or even banned from the indexes. Thus these techniques may provide a short-term gain in rankings, but are likely to be counterproductive in the longer term. The search engines rightly feel that in order to provide a beneficial service to users, and by extension their business partners and advertisers, they must stamp out spam. Because of this, a new trend towards more ethical optimising techniques, which may provide a longer-term advantage, has evolved.

The message from the search engines themselves seems to be that what is good for the search engines is what’s good for the users – i.e. web pages with plenty of useful and relevant content on a clearly defined subject, good website architecture and navigation schemes which are easy to follow, both by human users and search engine robots. This would make sense, as what both search engines and websites want to achieve is to deliver pages which are beneficial to users, or in the case of commercial sites, that convert visitors to customers – which seems unlikely if those visitors have arrived at an irrelevant site due to search engine spamming. This doesn’t stop people trying to spam the engines however.

The development of spam techniques is characterised by a kind of evolving leapfrog approach – the spammers develop a new technique, and the search

7 engines respond by revising their ranking algorithms to invalidate it. The spammers develop a workaround to the change or a new technique and the search engines again respond. This cycle has continued to occur, with spamming techniques becoming ever more sophisticated. It seems likely that this ‘arms race’ will continue for the foreseeable future.

This continual evolution of the search engines of course has implications for this study of search optimisation techniques – it is likely that many of the techniques discussed here will be rendered obsolete as the search engines modify their methods for finding and ranking web pages.

This project consists of two parts. An extensive literature review explores a variety of themes surrounding search engines in general, and search engine optimisation in particular. Following this, an experiment involving keyword density analysis of a sample of nearly 300 web pages is presented. Statistical analysis reveals the contribution made by a number of factors (including document length, keyword density and PageRank) to a web page’s search rankings in the Google search engine.

8 2. Literature review

A great deal has been written about web search engines, in both the academic literature and more practitioner-based commercial press. There is a distinct difference however, in the focus of these different literature collections. The literature can be divided according to point of view: that which deals with the search engines themselves, which tends to take an information retrieval approach; that which deals with the behaviour of searchers, which is concerned with issues such as information seeking models and human-computer interaction; and that which is concerned with the websites indexed by the search engines and how they can benefit from this. There is very little academic material concerned with the last of these three viewpoints, but it is probably the greatest focus of the non- academic literature.

It is this latter theme which is of most relevance to this dissertation, which is focused on the ways in which website owners can best take advantage of the opportunities presented by the search engines. There is almost no academic writing related to this area, and so the literature reviewed here consists mainly of material from a variety of sources including how-to books and magazine articles, online advice columns and forum and newsgroup posts. To place this material in context however, a number of more academic sources are reviewed, in particular with regard to link analysis and the Google PageRank algorithm, and methods of assessing different search engines.

Whereas academic journal articles and conference papers typically go through a stringent peer review process before publication, this is not generally the case with many of these other sources, and they should therefore be approached with a healthy dose of caution as to their authenticity and veracity. This is particularly important with forum posts where the author’s identity is not clear, although by their nature such posts tend to attract comments from other users in a manner somewhat analogous to an informal peer review process.

9 As well as these written and online sources, this section discusses material from two conferences which were attended during the course of the project. The first of these was Search Engine Strategies run by Jupiter Research, in London in June 2004, and was a conference of the search engine optimisation industry. Speakers included prominent SEO professionals, as well as representatives from several major search engines including Google and Yahoo! The second was more academic, the ACM SIGIR 2004 conference, held in Sheffield in July 2004.

2.1 Background to search engines on the Web

The first group of writings identified are those which give an overview or background information on search technology such as Arasu et al. (2001) This paper for example gives us an explanation of how a crawler-based search engine works and then goes on to discuss the different ways in which relevance rankings can be made. The authors draw attention to the fact that many of the information retrieval techniques developed for much smaller, more structured document collections must be extensively modified if they are to scale successfully to such a large, heterogeneous environment such as the web.

The World Wide Web is undoubtedly the largest document collection ever devised, and because of its widely distributed nature its size is very difficult to measure. Nevertheless, a number of attempts to do so have been made, both by academics and market research organisations throughout the last decade. (Lawrence & Giles, 1998) Obviously, with the web growing at such a spectacular rate, these studies date very rapidly. Related to these are discussions of what proportion of the web are covered by the search engines.

Thelwall (2000a, 2001a) is one of the few academics to consider the importance of search from the point of view of the commercial realities of the web, discussing how in the ever expanding online world it is of vital commercial importance for companies’ websites to be easily found by potential customers, and discussing the role of the search engines in this. He highlights the relatively low coverage provided by the major search engines.

10 Henzinger et al. (2002) give an overview of the different challenges facing web search engines. They define and give examples of these challenges, including spam, the difficulty of evaluating the quality of content, the semi-structured nature of web data and the difficulties caused by dynamic or duplicate data.

Other writers such as Brewer (2001) look to the future rather than the past, seeking new directions for search technologies, in particular the application of web style search engines to the desktop for personal information management.

2.2 Link Structure

The link structure of the web is central to modern search engine design. It is the very interconnectedness of hyperlinked web pages that allows crawler-based search engines to function at all. The number of incoming and outgoing links to a given web page are used to characterise pages as either authorities on a particular subject – those pages to which many other related pages contain links, or hubs – those pages which link to many authorities. (Kleinberg, 1998; Gibson et al. 1998) It is these hubs and authorities which make up the core of the web. Broder et al. (2000) provide a model of the link structure of the web as a whole:

Source: Broder et al. (2000)

11

Essentially this model suggests that the web is made up of different types of sites: standalone sites with few or no links in or out, which are almost impossible to locate which may include personal homepages; sites with few or no incoming links which are hard to locate; and those with few or no outgoing links which result in a dead end for users and search engines. The last type of sites are fully integrated into the web, and share both incoming and outgoing links. The ideal position to be in from a search engine’s point of view is in the central area, the Strongly Connected Component (SCC) or ‘knot’ of the bow tie.

Brin & Page (1998) take the authority concept a stage further in their PageRank algorithm, which catapulted Google to its position of pre-eminence in the search engine world. PageRank attempts to compute an authority score for each document in the index. This is made up of the number of incoming links, adjusted to take into the account of the PageRank (PR) of each of those pages. Each page ‘inherits’ a proportion of the PR of each of the pages which link to it, and passes on a fraction of its own PR to each page it links to, divided equally amongst those pages: PR(A) = (1-d) + d(PR(t1)/C(t1) + …… + PR(tn)/C(tn)) Where: PR = PageRank A = Web Page A d = a constant used as a damping factor, set to perhaps 0.85 t1…tn = Pages which link to page A/ C = the number of outbound links from page tn

This idea springs from the concept that the quality of a given document can be assessed by looking at the number and quality of documents that link to it. This is known as citation analysis, and has previously been used to identify the key research papers that have been highly influential in various academic fields.

PageRank is useful because it is more resistant to spam than purely textual ranking systems, since it relies on the co-operation of others in order to work properly. It relies on the simple assertion that what other people say about a page

12 is inherently more reliable than what the owner of that page says, as information on the web is not necessarily trustworthy (Graham & Metaxas, 2003). It is also more spam resistant than a pure link popularity count, because it requires incoming links to be from sources which are themselves authoritative in order to give much of a boost to a page – simply building a network of ‘dummy’ domains to link to a site will be of limited value. (Sullivan, 2004) Some variation of the concept is now incorporated into the ranking processes of all the major search engines to a greater or lesser extent.

Perhaps because it is the only one of the major search engines to have published research papers regarding its algorithm, Google’s PageRank has been the subject of a number of articles and papers. A variety of evolutions and refinements have been proposed. (Langville & Meyer, 2004; Thelwall & Vaughan, 2004) Clausen has called its spam resistance into question. (2003a, 2003b)

Inbound links are important for a variety of reasons: they allow the crawlers to find the pages in the first place, the number and authority of the links contributes to the page’s PageRank, and the (link text) is viewed as an independent description of what the page is about. This last point may mean that a page may be returned in the results for a query even if it does not contain the query terms, provided enough pages link to it with anchor text containing words in the query. This can be perfectly illustrated by typing the query “click here” into Google. The number one result is the download page for Adobe Acrobat Reader, although the phrase “click here” does not appear anywhere on the page.

Thus those linking to a page have considerable influence on what queries it will rank well for. This can be used for malicious means – a Google search on the phrase “miserable failure” returns the official White House biography of George W. Bush as the number one result (although it obviously does not contain the phrase) due to a malicious link campaign, for example. In order to rank well for particular key phrases then, it is useful if incoming links use those phrases in the anchor text. Obviously a webmaster usually has little control over who links to the site or how, but it is now common to begin a link building campaign –

13 contacting other site owners and asking them to link to the site. Commonly these links are ‘reciprocal’ – site A links to site B in return for a link from B to A.

Because link popularity is such an important factor in the rankings of websites in modern search engines, it is inevitable that ways around it would be developed. Thelwall (2001) discusses the evolution of hyperlinks into a form of currency. Sites with high PageRank are often known to charge for links, and link-building campaigns are now part of the search engine marketer’s arsenal.

Cowan (2004) contends that not all links are created equal. Neither are they uni- dimensional, in that a number of factors feed into the ranking algorithm to determine the importance of the link and ultimately its contribution to ranking. A useful device for showing this pictorially is the radar graph (Cowan, 2004) where the various attributes’ contributions can be seen. Generally speaking the greater the surface area covered, the higher the quality of the link as rated by the search engines.

Radar Graph of link effectiveness:

On page thematic relevance 10 9 Honesty / Integrity 8 On page placement 7 6 5 4 3 3rd Party context 2 Relevance of anchor text 1 0

3rd Party Authority Expertise

Parental Context Parental Authority

Source: Cowan (2004)

14 Metrics may differ depending on the web page in question, so for example, an industry portal site would have a very different profile to that of a reciprocal link on an unrelated page:

Link on Industry portal

On page thematic relevance 10

Honesty / Integrity 8 On page placement

6

4

Context conferred by the web 2 Relevance of anchor text

0

Authority conferred by the web Expertise

Context conferred by site Authority conferred by site

Source: Cowan (2004) Reciprocated link on resources page

On page thematic relevance 10

Honesty / Integrity 8 On page placement

6

4

Context conferred by the web 2 Relevance of anchor text

0

Authority conferred by the web Expertise

Context conferred by site Authority conferred by site

Source: Cowan (2004)

Cowan proposes this graph as a method of evaluating potential linking partners to see whether or not they will provide any significant benefit.

15 2.3 The Invisible Web

The invisible or ‘deep web’ (Sherman & Price, 2001; Bergman, 2001) refers to the large proportion of the World Wide Web which is not covered by the search engines, and therefore inaccessible to the majority of users. There are a number of reasons why web pages may fall into this category. Many websites, in particular newer sites and individual personal homepages, have no incoming links at all, and therefore will never be found by search engine robots unless they are submitted. Understandably, the usual methods of ranking such as link popularity and PageRank do not perform well when applied to these sections of the web. (Eiron et al., 2004)

A large number of pages are specifically excluded from search engines by their owners, using means such as the Robot Exclusion Protocol. (Thurow, 2003) These include proprietary or members-only content, which is not meant to be available to the general public, such as subscription services and corporate intranets. In addition, dynamically generated pages (which draw information from databases) can be very difficult to index, due to the fact that they often have complex with large numbers of parameters.

2.4 Evaluating search engines

A common theme in the academic literature is the evaluation of search engines in terms of coverage, precision and recall. Oppenheim et al. (2000) point out the difficulty of measuring recall of web search engines, since the evaluators cannot know the number or type of relevant documents in the way that they could with more controlled collections.

Soboroff et al. (2001) discuss a methodology to judge the effectiveness of information retrieval systems without relevance judgements, using documents randomly assigned as ‘pseudo-relevant.’ Soboroff’s work was carried out with more traditional IR systems using the TREC collections, but Haddon (2001) extends this work to evaluate web search engines, and finds the methodology to perform rather well – better even than Soboroff’s original results. In fact,

16 Soboroff himself discusses the extent to which the TREC web collections can be said to mimic the structure of the web in his later work (2002).

2.5 Searcher behaviour

Another trend in the literature is the study of how users actually use search engines. Ford et al. (2001) point out the differences between the searching strategies of trained researchers with an IR background and those adopted by the typical web user. Whereas IR professionals typically include more than six query terms and make use of Boolean operators the typical web searcher uses only two or three simple search terms. They find that for many users simple ‘best match’ queries are more effective than more complex Boolean ones. Eastman & Jansen (2003) confirm this for mass-market search engines, suggesting that the majority of users should not be expected to use complex search strategies.

Studies have been conducted which analyse large samples of the query logs of search engines to discover what kinds of things people search for, in what numbers, and how they go about it. (Silverstein et al., 1999) Beitzel et al. (2004) bring this approach up to data and take it one stage further by conducting the analysis on an hourly basis to determine differences in query behaviour at different times of the day or night.

Other work emphasises the importance of understanding what users actually want from search engines, which may not necessarily be reflected in the search strategies of naïve users in particular. Brandt and Uden (2003) borrow from the field of human-computer interaction (HCI) and indicate that searchers may have incomplete or faulty mental models, which will lead to ineffective searching strategies. Broder (2002) and Rose and Levinson (2004) look at the information needs of web searchers and divide their queries into three broad categories: • Navigational – where the user is looking to locate a particular website. • Informational – where the user wishes to learn about a particular topic. • Transactional or Resource – where the user is looking for particular products or resources such as music downloads, currency converters etc.

17 This concentration on the users has important implications for web site optimisation. Understanding what users are looking for and what they type to try to find it is very important to keyword research in particular. Without this understanding, any attempt at optimisation will likely fail.

Of particular importance is ensuring that the website actually contains the words which the intended audience will use to find it (or at least that incoming links to the site contain those words). The writers of the site’s content will need to think about the intended audience, and the terminology they are likely to use to search. There is little point in optimising a page for the company’s name – unless that name is a major brand – as potential users are unlikely to search by the name of the firm unless they are already familiar with it. A much better strategy is to focus on content reflecting the products or services with which the firm deals. Visitors that reach the site through searching for these items are highly likely to convert to customers, as they have already expressed a demand for the product or service by making the search. (Hansell, 2004)

It is important to think about the language customers are likely to use – if the intended audience is unlikely to have much domain knowledge, for example, the page should be relatively jargon free and use generic layman’s terminology where possible. Both Google and Overture (The firm who provide paid inclusion and pay-per-click advertising to several search engines including Yahoo, their parent company) have tools that suggest possible keyword alternatives to assist in this process.

In addition, there are a number of third-party keyword research tools such as WordTracker, which identifies conceptually related search terms. Mindel (2004) divides these related terms into vertically similar keyword (those which contain the original term such as football, football boots, football match, football player etc.) and horizontally similar keywords (which are conceptually related but do not contain the original term such as goalkeeper, Sheffield United, FA Cup etc.) Although these tools are primarily intended to assist in targeting paid advertising, they are also useful for targeting keywords for organic rankings (the natural search results).

18

2.6 Site Content

The overriding theme that comes through from the majority of literature around search engine optimisation is the importance of relevant and unique content. Search engines are essentially text indexing and retrieval systems, and so in this context, content means textual information presented in a machine-readable form. Although search engines make extensive use of many other factors in determining and ranking the relevance of pages to a given query, the central consideration must be the words on the page. If the query terms are not present, or the search engines cannot read them for some reason, the page is unlikely to be returned for a given query. There are exceptions, however – if enough other pages link to the page using the query terms in and around the anchor text, the search engines can use this to infer the meaning of the page’s contents.

As we have seen, optimising the textual content of the site consists of ensuring that pages contain the key words and phrases that the site’s audience are likely to use as search terms. These key phrases are determined thorough keyword research, making use of resources such as the keyword suggestion tools offered by Google and Overture as well as more sophisticated tools such as WordTracker. Charon Matthew (2004) recommends that each page should be optimised for no more than one or two key phrases. In other words, each page should deal with only one or two subjects. This makes sense intuitively – if a page deals with many different subjects it is difficult to judge which one is the most important, and therefore what the document is about.

Jill Whalen (2004) has written and spoken extensively about optimising textual content on the web. She emphasises the importance of optimising for key phrases not key words. A single word may have many different meanings, and without additional contextual clues it is difficult to determine the most relevant. Any text that is optimised for key phrases however, will by definition be optimised for the individual key words as well, although the reverse is not necessarily true. For example, a document which talks about ‘computer games’ gives more information

19 about its meaning than one which just mentions ‘games’, which could just as easily refer to board games or the Olympic games.

Often a word or phrase can be expressed in more than one way, such as ‘optimisation’ (the English spelling) and ‘optimization’ (the American spelling), or ‘ecommerce’ and ‘e-commerce’. Since users may search using either or both of these versions, it may be necessary to include both of them in the site. However Whalen advises against mixing both on the same page for two reasons. Firstly, as the two versions are likely to be indexed as separate terms, it reduces the density of each term on the page. Secondly, and perhaps more importantly, it may appear to real users as though the writer doesn’t know how to spell or proof- read. A better strategy is to use one version on some pages and the other version on other pages.

Whalen, Matthew and Thurow all agree that it is important for the key terms to appear in a number of specific areas of the document. If the phrase occurs in the title of the page, its description, headings on the page, and several times in the body text, particularly in the first couple of paragraphs near the start of the page, as well as in and around outgoing links, it gives the search engine (and the user) a pretty strong clue that that particular phrase accurately represents the most important content of the page, and the document is therefore likely to be regarded as highly relevant if that phrase is then used as a query term. Matthew (2004) suggests a rule of thumb that a key phrase should ideally be mentioned approximately four times per 250 words of page content.

As Whalen points out, many web pages that appear to contain a great deal of text actually do not. It is common for web designers to use graphical images of text – in particular for headlines and titles – in order to allow them to make use of special fonts and text effects. Occasionally, entire pages are rendered as a single large graphic image, rather than as discrete text and graphical elements. This is most commonly the case with ‘splash pages’ - introductory pages to a site, where the user must click on a link to enter the site proper. Search engine spiders cannot read this graphical text, and so they are unable to index the content effectively.

20 Alternative text in the image ‘alt’ tags can go some way to alleviate this, but alt text is weighted less heavily than visible body text by the search engines, because it presents an opportunity for keyword stuffing. This presents a real difficulty for graphical text, especially since it is most commonly used for headings and titles – the very areas that summarise the contents of the page, and would therefore be the ideal place to utilise key words and phrases. Cascading style sheets (CSS) can come to the rescue in this situation however, as they give the web designer much greater control over text layout and formatting than HTML, so that interesting visual designs are possible without resorting to graphical text. (Thurow 2004)

Lloyd-Martin (2004) gives general advice regarding the choice of key phrases, and goes on to suggest that pages should ideally contain around 250 words. This appears contrary to some advice about copywriting for the web, which tends to emphasise the need for brevity to improve readability, suggesting that users don’t like to have to scroll down. She suggests ways to break this up to make it easier to read, such as using short, focussed paragraphs, subheadings and bullet points. The most important information should be at the top of the page, with greater detail below, much as in a tabloid newspaper article.

2.7 Architecture

Another area, which is heavily discussed in relation to search engine performance, is site architecture. Perkins (2004) talks in terms of both technical architecture and information architecture. Technical architecture is represented by the hardware and software, servers, databases etc. The information architecture concerns the organisation, content, navigation and search systems of the site.

Site architecture can have a dramatic impact on search results. Many web technologies can make it more or less difficult for search engines to crawl and index sites properly. For example, sites built using frames often do not index correctly. They can cause problems for crawlers because they “don’t fit the conceptual model of the web (every page corresponds to a single URL).” (Google, 2004) A frameset will have one URL for multiple files, and this can make it difficult to link to them properly, both for search engines and users.

21

The Robot Exclusion Protocol is used to stop crawlers probing unauthorised areas of the site, and password protection obviously also accomplishes this in a different way. There are several meta-tags that can be used to instruct robots how to behave on a page such as meta-noindex and meta-, which instruct the crawler not to index the page or follow its links etc. There are other similar tags such as meta-revisit, which is supposed to instruct the spider to return more frequently, but the search engines largely ignore these. (Cutts, 2004) The HTTPS protocol, however does not present a barrier in itself, so other methods such as robots.txt must still be used.

If for any reason a website or page must be moved, in order to preserve rankings in search results it is necessary to make use of redirects. The only acceptable ones to most Search engines are 301 permanent redirects and, in some cases, 302 temporary ones. (Cutts, 2004)

Put simply search engines are basically text retrieval systems. Site architecture includes URL structure, navigation scheme and directory structure. Navigation schemes can consist of a range of options from text based links, which is the most search engine friendly through button images, image-maps, JavaScript and DHTML menus and even menus and buttons embedded in Flash movie files. Each of these different systems has different problems associated with it as far as search engines are concerned. If one of these navigation schemes is employed, it is important to give a text alternative. (Thurow, 2004a) Search engines basically do three things: • Index text • Follow links • Measure popularity

According to Thurow web pages come in various types according to their function such as news, products, services and categories, advertising or brochure pages, forms, shopping carts and ‘credibility pages’ such as terms and conditions, about us pages, privacy policy etc. She suggests that all sites should have: home, about

22 us (the best page to optimise for the company name), content (product/service etc), help/FAQ, contact details, a sitemap, and a contingency 404 page which enables navigation back to the rest of the site in the event of a broken link.

The navigation structure should be such that all web pages should indicate where you are in the site, what pages have been visited, what other pages have not as yet been visited (eg. by changing link colour). This allows a crawler (or a user) entering any page on the site from an external link to navigate around the rest of the site with a minimum of difficulty. Cross-linking is important within the site. Cross links fall into two broad categories – vertical as characterised by ‘breadcrumb trails’ - and horizontal, such as if a press release mentions a new product, it may link to the product page describing that product.

Search engines generally like the use of CSS style sheets for formatting for a number on reasons. Firstly, they strip formatting information out of the HTML, reducing the amount of unnecessary code that spiders must wade through. In addition it is possible to use CSS to mimic many other web design technologies. A rollover button can be built using a text link with a rollover background image in CSS, instead of an image button using a JavaScript rollover. The first of these links would be indexed as a text link and would be preferred by most search engines. (Thurow, 2004a)

In summary, Thurow for example ranks users, content and usability above search engine optimisation as she and others feel that in the end these former considerations will cause the website to rise up the rankings and transcend any technical adjustments that can be made. After all, search engines aren’t going to spend money on your site!

Information architecture also includes the directory and page structure of the site and there are various layouts to suit particular situations. For example, a tutorial or e-learning site might adopt a linear structure, where the user passes in one direction from each page to the next. Perkins (2004) recommends the simple hierarchical tree structure whereby towards the top of the hierarchy – nearer the homepage – is more important or general information, and more specialised

23 information is placed further down the hierarchy. A matrix structure combines a hierarchical subdivision with heavy cross-linking between pages. This provides excellent connectivity, but may prove difficult to manage, any changes in the site will be more likely to result in dead or broken links.

Linear Structure: Limited usefulness for most applications. Most commonly associated with e-learning facilities.

Hierarchical structure: Logical categorisation of subjects good for both users and crawlers, even on large sites.

Heavily cross-linked structure: Provides good connectivity, but on larger sites may become difficult to manage, increasing the likelihood of broken links.

24

This hierarchical information structure can suggest a taxonomy of information, which allows the logical navigation of the site. Each page can include a ‘breadcrumb trail’ of links back up in the hierarchy. If the categorisation is well chosen, these breadcrumb links will naturally contain keywords, for example on a car dealership website: • Home > Cars > Ford > Escort > XR3i

Any of these terms (with the exception of ‘home’) are likely search terms for users trying to find information for which the site is relevant, and so their presence in link text should give a boost in ranking as well as improving usability.

Other advantages of the hierarchical structure include ease of management - if a subject area needs to be updated or modified, this can be done with relatively little impact on the rest of the site for example. A hierarchical structure can also make management of security easier. For example, the robots.txt file can be used to disallow whole sections of the site that should not be crawled, indexed and linked to by search engines such as the cgi-bin directory where scripts may be contained.

Web server technologies can make life difficult for search engines. Alan Perkins (2004) suggests a rule of thumb that the more expensive the server technology the more problems it causes for crawlers. For example highly complex and expensive content management systems such as IBM Websphere can be almost impossible to index because absolutely all the content is held in a database, with no static, crawlable text. Search engines appear to be extremely robust when it comes to more conventional web server architecture, for instance they generally do not discriminate between the open source server software Apache and Microsoft’s Internet Information Server (IIS). They also appear to be tolerant and patient of slow and old servers and can accommodate a wide range of file types according to Google’s Matt Cutts (2004).

Mirrors and load balancing systems can cause significant problems for a number of reasons. These types of systems often present two or more copies of the same site on two or different servers in order to reduce the load on individual servers

25 when many concurrent users access the site. This is commonly the case on sites that have resources such as software or music files available for download, and are therefore likely to consume a lot of hardware resources and network bandwidth. Cho et al. (2000) discuss methods of detecting and dealing with replicated content on the web. A mirror or load-balancing system might use URLs similar to the following, with the same content on each URL:

• http://www1.domain.com/ • http://www2.domain.com/

The difficulties caused by these types of systems include the problem of measuring incoming links. Other sites are likely to link to the more usual address http://www.domain.com, and the page request will be directed to whichever of the servers has the most spare capacity at the time. This will mean that neither site has many incoming links to build PageRank or other link popularity metrics. In addition, most search engines do not want to fill their databases with duplicate content – they will usually detect this and index only one or other of the sites, or in some cases neither.

In general, Perkins suggests, the information architecture of the site – that is, the directories, the link text etc. – should support the navigation process and accurately represent the content. In particular, given that for a given query, the most relevant pages are likely to be the most specialised and detailed, it is important to ensure that the site architecture enables deep-linking. If users are searching for your products, but your site architecture doesn’t allow crawling and linking to your product pages, your site will not rank well in response to their queries. This emphasis on links enables the site to be bound into the core of the web.

2.8 Dynamic content

Search engines are designed primarily to crawl and index static HTML documents. Unfortunately, many web pages do not fit this mould, but instead are dynamically generated when a user requests specific information, typically from a

26 database. The web server assembles the page before sending it to the browser. A number of types of dynamic page can prove very difficult for search engines to index correctly. This can present a problem for a great many websites such as product catalogues and those built with content management systems, where almost all the content is stored in a database.

The problems in indexing dynamic content primarily spring from the URL. Many dynamic pages include parameters in the URL of the page that indicate, for example, what options have been chosen by the user. These parameters are denoted by a sequence after the filename in the URL beginning with ?. Many search engines have difficulty indexing pages with more than one or two parameters. Since two or more instances of the same dynamic page could be very different or very similar depending on the parameters in the URL, this can present a problem. Google generally does index pages with only one or two parameters, but pages with any more than this are unlikely to be indexed. (Cutts, 2004b)

One way in which this problem can be avoided is to use server side technologies to re-write the URL of the page to a more search engine friendly form before sending it to the browser. For example: http://www.domain.com/page.php?category=6789&productid=12345 might be rewritten to the static URL: http://www.domain.com/category6789/productid12345/page.php.

This type of operation is possible on most servers – in the Apache web-server this is called mod_rewrite – but can slow download times as there are more operations occurring on the server before the page is displayed. The true URL of the page is actually still the former, but when a browser or robot places a page request using the latter address, the server is configured to rewrite it into the correct format to query the database.

There are other ways in which dynamic information can be passed from page to page, rather than in the URL, but the URL method is useful because it allows the visitor to bookmark the page and return later, or to pass the link on to other people. The use of cookies is one such method. Many sites use cookies to

27 maintain state, and to perform other tasks such as recognising returning users. Provided that cookies are not required to actually navigate the site, they should remain search engine friendly. If, however, the site will not work without them, it will be impossible for search engine robots to crawl and index the site. This is because spiders function as simple browsers, and do not accept cookies or any similar technology.

The problems presented by dynamic websites in terms of search engine positioning have been discussed at some length. Mikkel deMib Svendsen (2004) contends that dynamic sites in themselves are not necessarily a bad thing for search engines, but rather that poor design is. He argues that dynamic sites can in fact be optimised more effectively than their purely static counterparts, provided that sufficient consideration is given to the problems mentioned. He points out that for the major crawlers, the fact that the page contains dynamic content at the server side is not necessarily a problem, provided that the client receives valid HTML. Database driven content is still content, and provided it is presented to the visitor (whether human or robot) as HTML, it can be indexed easily.

Similarly the presence of dynamic parameters or a question mark in the URL does not automatically hamper the search engines’ ability to index a page. The question mark simply indicates that the page makes use of some sort of template and contains dynamic content. Template pages may cause some problems because they break the relationship that one page is represented by one file (and one URL). All parameters are not necessarily equal. Session IDs, click IDs and time-stamp parameters are especially problematic because they can create ‘spider traps’ of infinitely re-spawning pages containing the same content but with different URLs.

One particular type of dynamic URL that is especially problematic for search engines is those that include a session ID in the address. This is a unique variable which is different each time a visitor accesses the site, and is used to maintain state during a particular visit. This is necessary because the HTTP protocol is naturally stateless. The server is not aware that a user whose browser requests a page, is the same user who requested the previous page without some means of

28 storing this information. Session variables in the URL are one method of doing this. When the visitor first accesses the site, or logs in for example, they are assigned a session variable. This variable is passed each time they request a new page from the server, and their session can be tracked. Next time the visitor arrives at the website, they will have a different session ID.

Search engine spiders find this a problem for two reasons: Firstly, every time the spider attempts to re-crawl a given URL, that URL will no longer exist as the session has expired. Either an error will occur, or a new session ID will be assigned. Either way the page in the index will seem to have disappeared. Secondly, assuming the page can be spidered at all, a large collection of seemingly identical pages with different URLs will be built up in the search engine’s index. Search engines do not want their databases clogged with duplicate or inaccessible content, and therefore frequently implement algorithms which ignore pages the URLs of which can be recognised as containing a session variable.

In order to avoid this problem then, it is preferable to use another method of tracking a user session if it is necessary to do so, such as cookies. This will keep the session ID out of the URL and ensure that the page is not unnecessarily duplicated in the search engines’ indexes.

Solutions to these problems come in two forms – replicating the content on static pages for the benefit of search engines and rewriting the URL on the server into a static form before serving the page to the client. Static replication of content may be accomplished in a variety of ways and may be either carried out in real-time (as the database is queried, the server software generates a static HTML page which is then stored and can be retrieved at a later date if the query is repeated) or on a scheduled basis (in an online store site for example, static catalogue pages may be automatically generated at off-peak times, such as overnight).

29

Source: Svendsen (2004)

Even if a dynamic site is necessary, there is no reason why all the pages need to be dynamically generated. Information pages, which are likely to be easily optimised in their own right can easily be included as static HTML, and tend to have plenty of textual content to index. It is also possible to include dynamic content in otherwise static pages, for example by means of ‘server side includes’. Dynamic content is easier to keep fresh, and sites that are updated regularly are likely to be spidered more often than sites that rarely change.

Svendsen cites forums and bulletin boards as examples of sites that contain dynamic content and are constantly updated, but contain lots of textual information to be indexed. They often rank very well in search engines for these very reasons.

2.9 Spam

Spam is the name given to attempts to fool the search engines into awarding higher rankings to websites and pages, or into regarding them as relevant when they are not. There are a wide variety of approaches to spam, and the field is constantly expanding and evolving as the search engines adapt to overcome particular techniques. Henzinger et al. (2002) divide spam into three types: text spam, link spam and cloaking. Perkins (2002) gives a much more detailed look at the different types of spam, categorising and defining them. He defines spam as

30 anything that is done solely to increase search engine rankings. Anything that would be done if the search engines did not exist is not spam.

2.9.1 Text Spam

The fairly simple and logical premise that the more times the query terms appear on the page, the greater its relevance led to a number of rather clumsy approaches to inflating rankings through spamming – in particular keyword stuffing using invisible text. These techniques took advantage of the fact that a search engine crawler would be able to read text that a real user would not – for example, text in meta-keywords and description tags, and text in the same colour as the background. In particular, a large number of adult and gambling sites used the most popular queries of the day – celebrity names such as ‘Jennifer Aniston’ to attract traffic. Because this was quite clearly spam, it was cracked down on fairly quickly.

Search engines rapidly adapted to these tactics and built functions into their ranking algorithms to penalise a page if it contained text in the same colour as the background, or meta-keywords that did not reflect the contents on the page. Thus spammers would find themselves falling in the rankings, or sometimes excluded from the search engines’ indices entirely. A more modern extension of this technique is to include additional text in HTML

tags, and then use CSS style-sheet layers to render the text invisible to browsers, whilst crawlers would still view it.

Most major search engines have now adapted to accommodate this technique, which has implications for sites which legitimately use invisible layers to create DHTML menus which only become visible when the user’s mouse passes over and so on. This abuse of useful web development technologies and techniques has led to a situation where even well behaved web designers must be careful not to work in a way which could be mistaken for spam, and has led to the bizarre situation where meta-keyword tags are not used by the majority of search engines to judge relevance, despite that being the specific function of that particular tag – to provide a list of indexing terms to describe the page.

31

Although most of the main search engines look for broadly the same things when ranking pages, they tend to vary in the detail of what elements receive the most weighting in the ranking function. For example one engine might favour link text over heading tags, or vice-versa. Doorway pages or gateway pages are web pages designed in such a way to score highly in the rankings of a particular search engine. Frequently a site will have several doorway pages, each of which directs the user to the site’s main page. Often these pages are largely identical, but have been tweaked in terms of layout and keyword frequency to perform better in their specific target search engine. Some doorway pages are nothing more than a collection of keywords in different HTML tags to make up the requisite keyword density. Thurow contrasts doorway pages with information pages – pages which are designed to be read by humans, and which contain keyword rich content which is useful, for example help pages, frequently asked questions (FAQ) pages, tutorials and the like.

2.9.2 Redirects

Another technique frequently used by spammers is to optimise pages for particular queries, and then redirect the user’s browser a fraction of a second into the page’s loading to another page, or another site entirely. This is often used in conjunction with doorway pages – the search engine spiders the , but when a browser accesses the page it is redirected elsewhere before the doorway page has finished loading.

Most of the major search engines have adapted their systems to identify this tactic, and will heavily penalise pages which use it. There are of course legitimate reasons to redirect visitors’ browsers, such as when the site in question has been restructured and the page has moved. There are ways to set up permanent redirects that are unlikely to be penalised. A particular bugbear of Google in particular is the HTML meta-refresh tag, which is used to redirect the browser to a new location. It is considered highly suspect if the refresh occurs in less than ten seconds or so. (Cutts, 2004b)

32 2.9.3 Cloaking

Cloaking is a similar, but more sophisticated approach, and refers to the practice of serving different content depending on the client. It is possible to detect whether a page request is being made by a browser or a search engine spider, and even which crawler it is. If a crawler is detected a different page is served, designed to rank well in that particular engine. Cloaking, redirects and doorway pages are considered spam, due to the fact that they involve users seeing different content than the search engines.

To confuse the issue, even cloaking could have legitimate uses – for example detecting the browser’s settings and returning a text-only version of a page to a user with visual disabilities, for example.

2.9.4 Link Spam

A number of ‘link spam’ techniques have also developed which rapidly build large numbers of links to pages. Bad links would be regarded as reciprocal links with little or no relevance to users, links to and from link farms, automated spambots which create links on sites such as guest books and bulletin boards, and deceptive and hidden links.

Many websites include a ‘guestbook’ page that allows a visitor to leave a comment on the site. Weblogs or ‘blogs’ as they are known are online diary style websites that are arranged in chronological order and often allow readers to add their own comments to the articles. Forums are online discussion boards where users can post questions and answers and discuss a huge variety of topics. Each of these is vulnerable to abuse by link spammers who leave comments whose sole purpose is to contain a link to their websites. This process is frequently automated – search engine spiders are not the only robots that crawl the web. So- called ‘spambots’ also do so for a variety of purposes, including harvesting email addresses and leaving comment spam. This allows a very rapid build-up of incoming links – but these links are likely to be downgraded by the search engines because of the likelihood of spam.

33 Link farms or link trees are pages that contain nothing of use but thousands of links. They enable hundreds or thousands of websites to interlink, for no purpose other than to build incoming links. Frequently the links from the member websites to the are concealed in some way so that they will be unlikely to be followed by a human visitor, but can be seen and followed by search engine robots. For example, a full-stop character in the body text of the page may contain the link. Search engines naturally view this as a deceptive technique and if it is caught a penalty is likely to result. Link farms are typically made by SEO firms claiming to offer instant link popularity. The search engines tend to penalise sites that contain concealed links.

Free for all link sites are websites where webmasters place reciprocal links to their sites. These sites become massive hubs, linking to thousands of sites, but like link farms the lack of relevance probably downgrades the value of the links.

Google in particular views link farms and free-for-all’s as ‘bad neighbourhoods’ and claims to penalise the ranking of sites which link to them. However, they do not penalise sites that are linked to from them as this would be unfair, as it is not necessarily within the control of the site owner. Because anyone can submit a link on these sites, it would be possible for competitors to submit each other’s sites in the hope of earning their rivals a penalty. Linking to a farm, on the other hand, is seen as proof of guilt, since it is entirely within the webmaster’s control.

34 3. Research Methodology

This study was intended to investigate the effect of keyword density, document length and other factors such as PageRank on the ranking of web pages in the Google search engine. Originally, it was intended to build a series of web-pages – each only slightly different from the last, and each containing the same unique or unusual phrase in differing keyword densities and locations in the page – in order to see which pages performed better in the results and why.

This approach was modified for two reasons. Firstly, due to the short time available for the project, it was deemed unlikely that enough data could be collected to reach any significant conclusions. Secondly, and more importantly a conversation with Google software engineer Matt Cutts at the Search Engine Strategies conference (2004) indicated that the approach might fall foul of Google’s duplicate content detection software, which attempts to ensure that only one version of a page is indexed. If the pages were too similar they might be excluded from the index, thus invalidating the experiment.

In view of these reservations, a different approach was required, which concentrated on the collection and analysis of data. A sample of queries was selected and the top ten results for each query were captured, resulting in a collection of nearly 300 pages. The keyword density and length of each of these pages was analysed, and statistically tested to establish whether or not there is a pattern or correlation with the search position.

The intention was to discover to what extent on-page search engine optimisation has an effect on Google results, since Google is renowned for its emphasis on off- page factors - both the query-independent PageRank, and the presence of query terms in anchor text pointing to the page. PageRank in particular is often credited with great influence, and many involved in search engine optimisation obsess at great length over it.

35 The following research questions were posed: 1. Can the conventional wisdom regarding optimum values for factors such as document length and keyword density be justified empirically? 2. Does the presence or absence of query terms in different areas of a document such as headings and link text really have a noticeable effect on search engine positioning? 3. How influential is PageRank in Google’s ranking algorithm in comparison with these other factors?

3.1 Query set

The queries were chosen from the Google Zeitgeist news page. (http://www.google.com/press/zeitgeist.html) This page is a regularly updated listing of the most popular queries entered into the Google search engine from around the world in a given period of time. As a collection of popular queries the set contained a large number of two word queries – in particular celebrity names. This set was chosen as it is representative of actual user behaviour, and in addition, the majority of these queries would be in highly competitive fields, and would likely return large numbers of results.

Thirty queries were selected. With 10 results returned for each query this would give a sample size of 300 web pages, which should be adequate to establish statistical significance in any findings. Each of the queries was entered as typed – no operators or quotation marks were used – only the words. It is not clear from the Google Zeitgeist page whether the queries have been normalised to correct common spelling mistakes, or were originally entered with any special operators, but this is unimportant for this research. The queries were chosen simply to provide a representative sample.

Queries collected from Google Zeitgeist 11 August 2004:

Top 10 Gaining Queries Week Ending Aug. 9, 2004: 1. mega millions 2. doom 3

36 3. rick james 4. olympics 5. ralph fiennes 6. quincy carter 7. hurricane alex 8. lindsay lohan

Top 10 Declining Queries Week Ending Aug. 9, 2004: 1. the village 2. jibjab 3. robert sorrells 4. democratic national convention 5. barack obama 6. jennifer love hewitt 7. alexandra kerry 8. lori hacking 9. big brother 10. amish in the city

Popular Google News Queries - June 2004: 1. euro 2004 2. harry potter 3. paul johnson 4. john kerry 5. wimbledon 6. venus transit 7. fahrenheit 911 8. bill clinton 9. ronald reagan 10. scott peterson

37 3.2 Keyword density analysis

The keyword density statistics were harvested using a free online tool, the GoRank Top 10 Keyword Analyser. (http://www.gorank.com) This is an application that makes use of the Google API (Application Programming Interface) to query Google and then follow the links to each of the top ten results returned. It then provides a breakdown of the frequency of the search terms in different areas of the pages such as title tags, link text and headings, along with a measure of the PageRank of the page. The GoRank Keyword Density Analysis Tool:

The results of these analyses were then imported into an Excel spreadsheet and cleaned up to prepare them for statistical analysis. Having been taken directly from a web application it was necessary to manually strip away all of the HTML formatting which had been copied along with the data, in order to render it into a form which could be operated on by the spreadsheet software. This proved to be an extremely time-consuming process.

Once the results were rendered into a more manageable form, a number of statistics were calculated. The results were collated in order of ranking position – all the 1’s together, all the 2’s and so on. The mean of each variable was calculated for each position, and examined to see if any patterns were immediately apparent.

38 Following these simple metrics, the numerical data was imported into the SPSS statistical package for further analysis. Initially, a frequency analysis of each variable was carried out in order to determine whether parametric or non- parametric statistical tests would be most appropriate. The majority of variables did not display a normal distribution, and therefore a non-parametric test was required. Spearman’s rank correlation coefficient was used to calculate the correlation of each of the independent variables with the dependent variable position. This is a non-parametric test and is suitable for use with data that violates parametric assumptions such as normality. (Field, 2000) It is commonly used to deal with ranked data.

Spearman’s correlation coefficient could reveal whether or not there was a correlation between each of the independent variables and the dependent variable (the ranking position) but would not convey any information regarding the fit of the model (to what extent all of the variables dealt with in the experiment accounted for the position) or the individual contributions of each factor. In order to enable this information to be discovered, it was necessary to convert the results into a form suitable for conducting parametric tests. This would allow multiple regression to be used, to identify the most significant factors. This was made possible by taking the logarithmic values of the data, which were approximately normally distributed.

3.3 Limitations

The approach taken had some limitations. In particular the query set, being taken from a list of highly popular queries, was quite limited in scope. The majority of the queries were names of famous people, and were therefore two-word queries. The results may possibly have been different with more complex queries.

For the first query, ‘mega millions’ no meaningful results were returned. I attempted to open these pages from the links retrieved by the GoRank tool in order to investigate, and was unable to do so. It seems likely that the pages were dynamically generated in some way, possibly including session variables or some other mechanism making them irretrievable. Given that all 10 results for this

39 query returned no keywords, it seem more likely to be a technical error than simply an anomaly in the results, and so it was decided to ignore this query.

The GoRank Keyword Density tool seems to have had some difficulty in capturing some of the results. For several of the queries only the top 9 results were returned, and not the top 10 as expected. For this reason, the sample of pages ranking number 10 in the SERPs contained only 23 pages, where the rest of the rankings contained 29 (due to the problem with the ‘mega millions’ query).

40 4. Presentation and discussion of results

4.1 GoRank Analysis Results

The output from the GoRank keyword density analysis tool produced data on just under 300 web pages. For some reason, for several of the queries in the sample, the tool was only able to capture data for the top nine results, rather than the top ten. Repeating the query did not appear to fix the problem, and so it seems likely to be a technical issue, either with the websites or the tool itself. In addition, for one query ‘mega millions’ the tool was unable to capture any data at all. I attempted to reach these pages by manually entering the URL into a browser, but was unable to do so. I suspect that these sites must all make use of some dynamic parameter such as a session ID or time-stamp, and are no longer available. Given this difficulty, this query was excluded from the sample altogether. This left data on a total of 284 web pages.

The data captured for each page was as follows: • The page title text • The URL • The PageRank on a scale of 1-10 as reported by the Google Toolbar (PR) • The total number of words (Words) • The number of times the search word or phrase is repeated (Repeats) • The overall keyword density – i.e. the percentage of the word count of the page contributed by the search terms (Density) • The number of times the search terms appear in the page title (Title) • The number of times the search terms appear in link text on the page (Link) • The number of times the search terms appear in image ‘alt’ text (Alt) • The number of times the search terms appear in the keywords meta-tag (KW) • The number of times the search terms appear in the description meta-tag (Desc) • The number of times the search terms appear in bold text (Bold)

41 • The number of times the search terms appear in heading tags such as H1, H2, etc. (HTags) • The ranking of the document on the Google search results page. This is assumed to be the dependent variable. (Position)

The GoRank tool captures all of this information for the query as a whole (ie as a phrase) and also for each query term individually. In order to keep the sample down to a manageable amount of data however, further analysis has only been carried out on the results for the whole queries. For example with the query ‘harry potter’ the numbers do not reflect occurrences of the individual words, but only of the two together.

4.2 Statistical Analysis

A preliminary analysis of descriptive statistics for the whole sample revealed the following information:

PR words repeats density Title Link Alt KW Desc Bold HTags Position

N Valid 284 284 284 284 284 284 284 284 284 284 284 284 Missing 0 0 0 0 0 0 0 0 0 0 0 0 Mean 3.70 963.65 6.77 .037017 .61 1.95 .49 .61 .39 .76 .12 5.40 Median 4.00 562.50 4.00 .014550 1.00 .00 .00 .00 .00 .00 .00 5.00 Std. Deviation .084809 2.801 1316.197 10.119 .509 5.844 1.625 2.103 .717 2.272 .325 2.831 3 Minimum 0 1 0 .0000 0 0 0 0 0 0 0 1 Maximum 8 10098 99 1.0000 2 81 17 27 5 28 1 10

As can be seen the pages have an average PageRank of 3.7 (remembering of course that the PR figures here are an approximation). The average document length is 963 words and the average keyword density is 3.7%. Looking closer however, we see that there is a considerable spread in the data, as can be deduced from the minimum and maximum figures and the standard deviations.

It is perhaps more useful to look at the averages for each ranking position, in order to begin to detect patterns in the data that might indicate potential correlations between the different variables and the position.

42

Mean Position PR words repeats density Title Link Alt Kw Desc Bold HTags 1 5.34 463.93 4.62 10.31% 0.62 0.83 0.31 0.34 0.45 0.48 0.14 2 4.34 660.72 5.21 3.84% 0.59 1.28 0.69 0.90 0.28 0.34 0.10 3 4.03 739.62 5.76 4.19% 0.52 1.76 0.45 1.41 0.41 0.45 0.07 4 3.86 1177.97 6.90 2.23% 0.66 1.55 0.24 0.17 0.31 0.86 0.21 5 3.31 895.48 7.14 2.66% 0.55 2.45 1.41 0.38 0.24 0.76 0.10 6 3.48 996.97 4.28 2.33% 0.55 0.90 0.24 0.52 0.41 0.21 0.07 7 3.48 1069.45 8.62 2.45% 0.69 2.00 0.41 0.41 0.41 0.97 0.10 8 3.28 1231.28 9.00 2.88% 0.62 2.79 0.52 0.24 0.28 1.21 0.10 9 3.10 1061.34 8.38 2.50% 0.59 3.79 0.21 0.69 0.59 1.69 0.10 10 2.52 1437.83 8.09 3.61% 0.78 2.26 0.43 1.09 0.52 0.65 0.22

By inspection there do appear to be some trends here. The average PageRank has a general downward trend going further down the rankings (i.e. as the value of Position increases).

Position vs. Mean PageRank

10

8

6 on i t i

Pos 4

2 R Sq Linear = 0.843

0

2.50 3.00 3.50 4.00 4.50 5.00 5.50 PR

In contrast, the data seems to suggest that document length and the number of repeats generally increase going from the highest to the lowest position. These two factors could reasonably be expected to correlate with each other fairly well.

43 Position vs. Mean Document Length

10

8

6 n o i t i s

Po 4

2 R Sq Linear = 0.76

0

400.00 600.00 800.00 1000.00 1200.00 1400.00 1600.00 words

There also appears to be a general trend in the keyword density, from an average of 10% in the number one position, down to 2 or 3% further down the results. However a single outlying page in the number one position, on which the keyword density was 100%, may unduly affect the mean density value. As can be seen from the graph overleaf, due to the presence of a significant outlying result it would be difficult to define a line of best fit to illustrate the trend, unless perhaps by a non-linear function. Trends are less apparent in the other variables.

44 Position vs. Mean Overall Keyword Density

10

8

6 on i t i

Pos 4

2

0

2.000% 4.000% 6.000% 8.000% 10.000% 12.000% density

These results suggested that the most influential factors in the rankings were the PageRank, the document length and the keyword density, much as expected.

4.2.1 Frequency Analysis

However, frequency histograms showed that the data do not generally follow a normal distribution. Rather, the curves generated tend to indicate a Zipfian distribution. This is to be expected to some extent, as Zipf’s law commonly describes the frequency with which words appear in a document or collection of documents. It suggests that the most important words in a document – those that will most readily identify it as unique – are those that occur least frequently. The most commonly occurring words tend to be of little value in information retrieval, as they are common in the majority of documents.

45 Frequency distribution of document length in words:

140

120

100 y

80 uenc eq r

F 60

40

20

0 0 2000 4000 6000 8000 10000 12000 words

Frequency distribution of overall keyword density:

250

200 y 150 uenc eq Fr 100

50

0 0.0000 0.2000 0.4000 0.6000 0.8000 1.0000 density

46 PageRank appears to be distributed a little more closely to the normal curve, although with a significant spike at zero. In any case, it was decided to proceed as though PageRank is approximately normally distributed for the rest of the analysis.

Frequency distribution of PageRank:

100

80 y

c 60 uen q e Fr 40

20

0 -5 -2.5 0 2.5 5 7.5 10 PageRank

4.2.2 Spearman’s Rank Correlation

Because of the non-normality of the data, initially only non-parametric statistical calculations were appropriate. For each of the different independent variables, a bi-variate correlation with the dependent variable position was calculated using Spearman’s Rank Correlation Coefficient, a non-parametric test that is commonly used when dealing with ranked data such as this.

47 Correlation of each independent variable with position:

Position Spearman's rho PageRank Correlation -.241(**) Coefficient Sig. (2-tailed) .000 N 284 words Correlation .179(**) Coefficient Sig. (2-tailed) .002 N 284 repeats Correlation .140(*) Coefficient Sig. (2-tailed) .018 N 284 density Correlation -.055 Coefficient Sig. (2-tailed) .352 N 284 Title Correlation .063 Coefficient Sig. (2-tailed) .288 N 284 Link Correlation .049 Coefficient Sig. (2-tailed) .409 N 284 Alt Correlation -.046 Coefficient Sig. (2-tailed) .439 N 284 KW Correlation .070 Coefficient Sig. (2-tailed) .237 N 284 Desc Correlation .051 Coefficient Sig. (2-tailed) .396 N 284 Bold Correlation .078 Coefficient Sig. (2-tailed) .192 N 284 HTags Correlation .019 Coefficient Sig. (2-tailed) .746 N 284 ** Correlation is significant at the 0.01 level (2-tailed). * Correlation is significant at the 0.05 level (2-tailed).

48

As can be seen from the above figures, there is a noticeable negative correlation between PageRank and Position. This is to be expected of course as the higher the page appears in the search results the lower the value of position (i.e. a position of 1 is better than a position of two and so on). This correlation is highly significant, and so we can be sure of a connection between PageRank and position.

The only two other variables that give a statistically significant Spearman correlation with position are the number of words in the document (significant to the 0.01 level) and the number of times the query phrase is repeated in the document (significant to the 0.05 level). Both of these variables have a small positive correlation with the value of position – which would indicate that the longer the document, the further down the rankings it would appear.

The overall keyword density does give a very slight negative correlation with position, but this is not statistically significant. Therefore we cannot conclude from the Spearman test that keyword density has a noticeable effect on the position. This seems somewhat counterintuitive given the apparent trend in the mean densities shown above. The discrepancy may perhaps be down to one or two outlying densities distorting the mean at position 1 and giving the impression of a trend where none really exists.

4.2.3 Regression Analysis

In order to use more powerful techniques such as multiple regression modelling, it was necessary to bring the data closer to a normal distribution, in order to render it suitable for parametric testing. This was accomplished by taking logarithmic values for the variables, which appeared to follow the Zipfian distribution. These include document length, repeats of the query terms, keyword density, and the appearance of the query terms in keywords, bold, alt, description and link text. Because many of these variables included values of 0, it was necessary to add a constant (0.5) before taking the logs to eliminate the null values. Using the log10

49 of each of these variables rendered frequency histograms much closer to a normal distribution as shown below for document length:

Logarithmic frequency distribution of document length:

log.words

50

40 y

c 30 n e u q e Fr 20

10

Mean = 2.519 Std. Dev. = 0.87266 0 N = 284 0.00 1.00 2.00 3.00 4.00 5.00 log.words

It was not necessary to take a log value for PageRank as its frequency histogram was already much closer to a normal distribution, although as Kent (2004: 220) speculates, the value of PageRank reported here may already be on a logarithmic scale. The HTags variable only gave values of 0 and 1, and was therefore taken to indicate a binary state – the query term was either present in the heading tags or not. The Title variable was very nearly binary, but not quite – of the 284 pages in the sample, only 3 had two repetitions in the title, and none more than that. It was therefore decided to recode this as a binary variable also – the query terms would either appear in the page title (1) or not (0).

Having performed these manipulations on the data it was then possible to carry out a multiple regression model. This would accomplish two important objectives. Firstly, it would allow the overall fit of the model to be determined – that is, what contribution the variables we have measured make overall to the

50 position of the page. Secondly, it would enable the identification of which of the various variables make the greatest contribution to the position. Multiple linear regression was carried out, using the backwards regression method. This method carries out the regression several times in an iterative process. Each time the regression is run after the first, the variable that was least influential in the previous iteration is removed. This allows the identification of which of the variables has the greatest influence. The following table indicates the order in which the variables were removed:

Variables Variables Model Entered Removed Method 1 PageRank, log.Link, log.density, Htags(binary), log.KW, log.Bold, . Enter log.Alt, Title(binary), log.words, log.Desc, log.repeats(a) 2 . log.Link Backward (criterion: Probability of F-to-remove >= .100). 3 . log.repeats Backward (criterion: Probability of F-to-remove >= .100). 4 . log.Desc Backward (criterion: Probability of F-to-remove >= .100). 5 . HTags Backward (criterion: Probability of F-to-remove >= .100). 6 . Title Backward (criterion: Probability of F-to-remove >= .100). 7 . log.Alt Backward (criterion: Probability of F-to-remove >= .100). 8 . log.Bold Backward (criterion: Probability of F-to-remove >= .100). 9 . log.KW Backward (criterion: Probability of F-to-remove >= .100). 10 . log.density Backward (criterion: Probability of F-to-remove >= .100). All requested variables entered Dependent Variable: Position

This would appear to indicate that the least influential variable is the presence of the query term in link text, as it was the first to be removed. Somewhat surprisingly the next least influential factor is the number of repeats of the query term in the document, in contrast to the trend suggested by the Spearman correlation coefficients shown above.

51

The following table indicates the fit of each iteration of the regression model. This indicates what proportion of the all the variables that contribute to the value of Position are accounted for by the variables measured here. This is indicated by the R-square value.

Std. Error R Adjusted R of the Model R Square Square Estimate Change Statistics R Square F Sig. F Change Change df1 df2 Change 1 .316(a) .100 .063 2.740 .100 2.742 11 272 .002 2 .316(b) .100 .067 2.735 .000 .013 1 272 .910 3 .315(c) .099 .070 2.730 .000 .097 1 273 .756 4 .315(d) .099 .073 2.726 .000 .084 1 274 .772 5 .314(e) .098 .076 2.722 -.001 .219 1 275 .640 6 .312(f) .098 .078 2.718 -.001 .275 1 276 .601 7 .310(g) .096 .080 2.715 -.001 .377 1 277 .540 8 .306(h) .093 .080 2.715 -.003 .916 1 278 .339 9 .299(i) .089 .079 2.716 -.004 1.290 1 279 .257 10 .286(j) .082 .075 2.722 -.007 2.250 1 280 .135

Predictors: (Constant), PageRank, log.Link, log.density, HTags, log.KW, log.Bold, log.Alt, Title, log.words, log.Desc, log.repeats Predictors: (Constant), PageRank, log.density, HTags, log.KW, log.Bold, log.Alt, Title, log.words, log.Desc, log.repeats Predictors: (Constant), PageRank, log.density, HTags, log.KW, log.Bold, log.Alt, Title, log.words, log.Desc Predictors: (Constant), PageRank, log.density, HTags, log.KW, log.Bold, log.Alt, Title, log.words Predictors: (Constant), PageRank, log.density, log.KW, log.Bold, log.Alt, Title, log.words Predictors: (Constant), PageRank, log.density, log.KW, log.Bold, log.Alt, log.words Predictors: (Constant), PageRank, log.density, log.KW, log.Bold, log.words Predictors: (Constant), PageRank, log.density, log.KW, log.words Predictors: (Constant), PageRank, log.density, log.words Predictors: (Constant), PageRank, log.words Dependent Variable: Position

As can be seen, the variables measured in this experiment appear to account for only 10% in total of the overall calculation of the position of a web page. The two most prominent predictors are PageRank and document length, which appear to

52 account for 8.2% between them, followed by keyword density. The following table shows the significance at each iteration of the model.

Unstandardized Standardized Coefficients Coefficients

Model B Std. Error Beta t Sig. 1 (Constant) 3.256 1.161 2.804 .005 log.words .292 .283 .090 1.033 .302 log.repeats .140 .658 .026 .213 .832 Search word present in .238 .443 .041 .536 .592 title log.density -7.224 4.048 -.128 -1.785 .075 log.Link .060 .527 .010 .113 .910 log.Alt -.437 .646 -.045 -.677 .499 log.KW .650 .676 .071 .962 .337 log.Desc -.290 .855 -.026 -.339 .735 log.Bold .386 .569 .047 .679 .498 HTags -.255 .551 -.029 -.463 .644 PageRank -.201 .061 -.198 -3.288 .001 2 (Constant) 3.251 1.158 2.807 .005 log.words .289 .281 .089 1.029 .304 log.repeats .177 .570 .033 .311 .756 Search word present in .226 .431 .039 .525 .600 title log.density -7.226 4.040 -.128 -1.789 .075 log.Alt -.424 .634 -.043 -.668 .504 log.KW .655 .674 .071 .971 .332 log.Desc -.292 .853 -.026 -.343 .732 log.Bold .399 .557 .048 .716 .475 HTags -.245 .543 -.028 -.452 .652 PageRank -.200 .061 -.198 -3.294 .001 3 (Constant) 3.322 1.134 2.929 .004 log.words .342 .223 .106 1.536 .126 Search word present in .266 .411 .046 .647 .518 title log.density -6.890 3.886 -.122 -1.773 .077 log.Alt -.353 .590 -.036 -.597 .551

log.KW .672 .671 .073 1.001 .318 log.Desc -.243 .837 -.022 -.290 .772 log.Bold .452 .529 .055 .855 .394 HTags -.232 .540 -.027 -.430 .668 PageRank -.203 .060 -.200 -3.375 .001

53

4 (Constant) 3.329 1.132 2.941 .004 log.words .337 .222 .104 1.520 .130 Search word present in .261 .410 .045 .637 .525 title log.density -6.978 3.868 -.124 -1.804 .072 log.Alt -.348 .589 -.035 -.590 .556 log.KW .562 .555 .061 1.014 .312 log.Bold .419 .516 .051 .812 .417

HTags -.250 .535 -.029 -.468 .640 PageRank -.201 .060 -.199 -3.368 .001 5 (Constant) 3.415 1.115 3.062 .002 log.words .325 .220 .100 1.479 .140 Search word present in .205 .392 .036 .524 .601 title log.density -6.801 3.844 -.120 -1.769 .078 log.Alt -.365 .587 -.037 -.622 .534 log.KW .582 .552 .063 1.053 .293 log.Bold .430 .514 .052 .836 .404 PageRank -.202 .060 -.200 -3.393 .001 6 (Constant) 3.698 .974 3.795 .000 log.words .354 .213 .109 1.662 .098

log.density -6.048 3.561 -.107 -1.698 .091 log.Alt -.360 .586 -.037 -.614 .540 log.KW .633 .543 .069 1.166 .245 log.Bold .501 .495 .060 1.011 .313 PageRank -.205 .059 -.203 -3.458 .001 7 (Constant) 3.755 .969 3.876 .000 log.words .335 .210 .103 1.591 .113 log.density -6.270 3.539 -.111 -1.772 .078 log.KW .586 .537 .064 1.091 .276 log.Bold .471 .493 .057 .957 .339 PageRank -.211 .059 -.209 -3.597 .000 8 (Constant) 3.664 .964 3.801 .000 log.words .388 .203 .120 1.911 .057 log.density -5.964 3.524 -.106 -1.692 .092 log.KW .609 .536 .066 1.136 .257 PageRank -.212 .059 -.210 -3.618 .000 9 (Constant) 3.697 .964 3.835 .000 log.words .421 .201 .130 2.098 .037 log.density -5.188 3.459 -.092 -1.500 .135 PageRank -.209 .059 -.206 -3.565 .000 10 (Constant) 4.856 .577 8.412 .000 log.words .528 .188 .163 2.807 .005 PageRank -.211 .059 -.209 -3.603 .000 Dependent Variable: Position

54 The smaller the value of sig, and the higher the value of t, the greater the contribution of the factor in question to the position. The final iteration clearly shows that the contributions of document length and PageRank are highly significant, and so we can be fairly confident that these contributions are real. However, as these factors account for only around 8-10% of the position of a page, there are clearly a lot of other factors at work.

The results presented here would seem to indicate that while factors such as document length, keyword frequency and density and even PageRank do play a part in determining the position of web pages in search results, they imply that the contribution is relatively minor – only around 10% of the overall picture. For several of the other variables, such as link text for example, it was not possible to detect a correlation with position at a statistically significant level. This does not necessarily indicate that no such relationship exists however. It is possible that there may be flaws in the experiment or the sample that confuse the results.

For example, the sample was really quite restricted as it took only the top ten results for each query. Ten pages out of say 100,000 is not a particularly representative sample, particularly when those ten are all skewed towards the top end of the result set. It is possible that the differences between the top ten pages are so infinitesimally small that they are difficult or impossible to measure accurately, particularly without access to the whole ranking algorithm.

A better sample set might have been to take every fiftieth result for each query for example. This would have hopefully given a wider spread of results, and might have made the differences between the pages more apparent, highlighting any trends more obviously. Another consideration is that each query would return a different number of results, and so measures of relevance are clearly relative. For example, there is a significant difference between measuring the relevance of the top ten documents for a query that returns a few hundred results, and a query that returns several million.

Also, the counts and associated statistics were calculated only to include cases where the query terms appeared as phrases on the pages, and did not include

55 incidences of the individual query words separately. Including these in the calculations may have significantly affected the results. In addition, position is not really an absolute scale – it is impossible to say with any certainty that the difference between position 1 and position 2 is the same as that between position 2 and position 3 and so on. It has been necessary to act as though this is the case in order to carry out these tests.

4.3 Summary

In summary, these results would indicate that of the factors measured here, the most significant is PageRank, which accounts for around 8% of the Google ranking calculation. Given that PageRank is described as the heart of Google, this contribution is not surprising, except that it appears to be of relatively minor importance. The real surprise is in how little influence the other factors seem to have, with only document length having a statistically significant contribution in this sample.

While keyword density and the other factors do seem to play a part, it is only very minor indeed. All of the factors dealt with in this study, with the exception of PageRank, would seem to add up less than 2% of the ranking calculation. There is some suggestion however, that documents of approximately 400-500 words have a slightly better chance of ranking well in Google than longer or shorter documents, as indicated by the regression line in the graph of position vs. document length shown earlier in this chapter.

56 5. Conclusions and further research

The experiment sought to answer the following research questions: 1. Can the conventional wisdom regarding optimum values for factors such as document length and keyword density be justified empirically? 2. Does the presence or absence of query terms in different areas of a document such as headings and link text really have a noticeable effect on search engine positioning? 3. How influential is PageRank in Google’s ranking algorithm in comparison with these other factors?

Taking the last of these questions first, it does seem to be possible to say with a fair degree of certainty that the PageRank calculation really does have a significant influence on the ranking of web pages in Google. Furthermore, that influence appears to be measurable, and seems to account for around 8% of the factors that go to make up the ranking score.

However, it must be taken into account that the PageRank values used in the experiment are only approximations (as on the PageRank indicator on the Google Toolbar), and Google have never released information about how these approximations relate to the actual PageRank calculation. It seems highly unlikely that this relationship is linear, as Kent points out (2004). Thus it may possibly be flawed to try to use this number in a model as in the experiment. The fact remains however, that a measurable correlation between PageRank and position was established in the experiment.

Whether this assertion would hold true for all situations however, is open to debate. As has already been mentioned, the sample used was extremely skewed towards the high positions. It is possible that the influence of PageRank is greater or lesser elsewhere in the rankings. As a query-independent measure, it may perhaps be invoked as a ‘tie-breaker’ in the event of a number of documents all appearing equally relevant, for example.

57 The answers to the other two questions were less conclusive. There was certainly some indication that shorter web pages (in the 400-500 word range) ranked slightly better than longer ones, although even 400 words is quite long for a web page. This was something of a surprise, as I had anticipated that longer documents would provide more opportunity for the search engine to establish relevance. A possible explanation might be that the larger web pages may have dealt with more than one subject, whereas the shorter ones were probably more focussed. Also, if the proximity of query terms to the top of the page is a factor in the calculation, longer pages may score lower as the keywords are likely to be more spread out.

The apparently minimal influence of keyword density was somewhat unexpected. While a slight negative correlation with position was suggested by the results - remembering of course that a higher numerical value of position actually translates to a lower ranking - it could not be established with a sufficiently robust level of statistical significance. This was rather puzzling, as logic would seem to dictate that the more times a given term appears in a document, the more relevant that document is likely to be.

Similarly, the other factors all appeared to have little or no influence on the ranking. With some of them this was not so unexpected. It has long been known that Google makes little or no use of the meta-keyword tag in its algorithm due to the ease with which it can be abused, for example. For other factors however, this was highly surprising – all the advice indicates that the composition of the page title is extremely important for example, yet the results would appear to show that this is not the case. Indeed a fair number of these highly ranked pages did not have a title at all.

The apparently low impact of the factors measured in this study on the effect of ranking positions raises some interesting questions. If all these factors account for only 10% of what is taken into account by the ranking algorithm, where does the other 90% come from? A number of scenarios are possible:

58 • There may be a very small number of extremely significant factors that are not taken into account by this experiment, each of which has a large impact on the ranking function. • There may be a very large range of factors that affect the ranking function, each of which has a very small impact on the outcome, similar to the trends observed in variables such as document length and keyword density. • There may be a middle ground solution, with a few factors contributing, say 80% of the function, but a significant number of minor variables which ‘tune’ the results.

One important factor, which Google uses extensively but is not measured in this experiment, is the presence of keyword terms in anchor text (the link text of pages pointing to the page in question). This is a very good candidate for a highly significant factor in the above scenarios, as illustrated by the ‘click here’ and ‘miserable failure’ examples described earlier. It may in fact be that this is the most significant of all variables in Google’s current ranking algorithm. So then, there are a number of possible candidate factors that this study has not addressed, which could influence the ranking. Some of these are on-page or on-site factors, whereas others are to do with analysis of the pages’ back-links (links pointing to the pages). Other possible internal factors include: • Proximity of keywords to each other • Proximity of keywords to the top of the page • Presence or absence of keywords in the URL (, directory structure or filenames)

Other possible external factors may include: • Anchor text on back-linked pages • Contextual text (text surrounding anchors) on back linked pages • Whether incoming links are internal (from the same domain) or external (from off-site) • The total number of pages returned by the search engine as relevant to the query

59

So, the experiment appears to raise more questions than it answers. In particular, it would be desirable to know what are the major factors influencing the other 90% or so of the ranking function that remains unaccounted for. There is considerable scope for further research in this area in order to determine this.

The GoRank Top Ten tool is useful, but is limited in scope. It was selected for this study because of the ability to harvest data from several pages simultaneously, thus building up a reasonably large data set fairly quickly. This was an important consideration because of the very limited time available for this project. It does however collect less information about each specific page than some other keyword density analysis tools. In fact, the GoRank site itself has another tool, which analyses a single page in much more depth, and there are also commercial software packages available such as WebPosition Gold.

A project with less time constraints could perhaps repeat this experiment in a modified form, with a wider ranging sample, and collecting more information about each page for analysis. For example, rather than analysing the top ten pages for each query, it might be better to look at pages whose search rankings were more widely dispersed. It would be reasonable to expect that differences between them would be more pronounced and easier to measure.

In addition, the current study took no account of incoming links to each page other than their PageRank value. It would be useful to perform a more in-depth link analysis on the results pages, to allow us to determine the extent to which factors such as anchor text are factors in ranking a page. The query set could also be widened, to see if the general findings of this study hold true in other situations. All the queries used in this experiment were fairly straightforward and short, and it might be useful to see what happens with more specific or complex queries, perhaps using different operators or facilities from the advanced search page.

This study concentrated on Google, because it is the most written about search engine, and because it is the only one to give away much of a clue to its ranking

60 function, in terms of the published work on the PageRank metric. It would be interesting to do a comparative study with other search engines in order to see what similarities or differences are apparent.

An alternative project takes a more hands-on approach, and assesses the impact of inbound links on search engine positioning in an empirical experiment. This long-term experiment would involve taking a live, newly launched website and monitoring and recording its position in the major search engines over a period of several months. The aim would be to begin with little or no incoming links, and then on a monthly basis encourage an increasing number of reciprocal links in order to measure the progress of the site in the search listings over time.

In summary, the results of this experiment do raise some questions as to the value of on-page search engine optimisation techniques. If on-page factors such as keyword density have such a limited effect on rankings as appear to be the case, is it really worth the time and effort involved? I would say it probably is. As has already been discussed, the results are inconclusive, and the other 90% or so of what goes into the ranking remains unaccounted for. It is possible to make an educated guess that a significant proportion of it will be down to anchor text, but there are other factors that may have an influence, and in a competitive environment such as the web, every little helps. In any case, many of the techniques discussed - such as the inclusion of text navigation schemes, cascading style sheets, hierarchical site structures and so on - have additional usability and accessibility advantages beyond any search engine considerations. They are good for users, and therefore good for search engines.

What the future holds is a source of speculation. With developments such as personal preference based searching, which both Google and Yahoo! are experimenting with, on the horizon, it is difficult to say what the state of SEO will be in even the near future. Whatever happens, the message from the search engines will remain – design for users first, and spiders second.

61 Appendix: Keyword Density Analysis Results

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76 Bibliography

Alimohammadi, D. (2003) “Meta-tag: a means to control the process of Web indexing”, Online Information Review [Online]. 27 (4) 238-242. http://www.emeraldinsight.com/1468-4527.htm [Accessed 28 August 2004]

Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., Raghavan, S. (2001) “Searching the Web” [Online]. ACM Transactions on Internet Technology 1 (1) 2-43. http://portal.acm.org/ [Accessed 16 May 2004]

Beitzel, S., Jensen, E., Chowdhury, A., Grossman, D., Frieder, O. (2004) “Hourly Analysis of a Very Large Topically Categorized Web Query Log”. In: ACM SIGIR ’04 [Online]. Proceedings of the 27th annual international conference on Research and development in information retrieval. 25-29 June 2004, Sheffield, UK. http://portal.acm.org/ [Accessed 16 May 2004]

Bergman, M.K. (2001) “The Deep Web: Surfacing Hidden Value” [online] in the Journal of Electronic Publishing. 7(1). University of Michigan. July 2001. http://www.press.umich.edu/jep/07-01/bergman.html [accessed 18th October 2003]

Boutin, P (1999) Sending Search Engine Traffic to Your Site, [Online]. Barcelona, Spain: Webmonkey/Terra Lycos. http://hotwired.lycos.com/webmonkey/99/31/index1a.html [Accessed 16 May 2004]

Brewer, E.A. (2001) “When Everything is Searchable”, Communications of the ACM [Online]. 44 (3) 53-55. http://portal.acm.org/ [Accessed 28 August 2004]

77 Brin, S., Page, L. (1998) “The Anatomy of a Large-Scale Hypertextual Web Search Engine” in 7th International World Wide Web Conference. [Online] 14th-18th April 1998. Brisbane, Australia: World Wide Web Consortium. http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm [Accessed 16 May 2004]

Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J. (2000) “Graph Structure in the Web” [Online]. In: Proceedings of the 9th International World Wide Web Conference on Computer Networks, 15-19 May 2000, Amsterdam, The Netherlands. http://www.almaden.ibm.com/cs/k53/www9.final/ [Accessed 28 August 2004]

Broder, A. (2002) “A Taxonomy of Web Search”, ACM SIGIR Forum, 23 (2) 3- 10.

Brown, E. W., Smeaton, A.F. (1998) “ Information Retrieval for the Web”, ACM SIGIR Forum [Online] 32 (2) 8-13. http://portal.acm.org/ [Accessed 28 August 2004]

Clarke, S., Willett, P. (1997) “Estimating the recall performance of Web search engines”, ASLIB Proceedings, 49 (7) 184-189.

Clausen, A. (2003a) How much does it cost to get a good Google PageRank? [Online]. Unpublished working paper. Victoria, Australia. http://members.optusnet.com.au/clausen/ideas/google/google-subvert.pdf [Accessed 16 May 2004]

Clausen, A. (2003b) Online Reputation Systems: The cost of attack of PageRank [Online]. Unpublished Thesis, Melbourne: University of Melbourne. http://members.optusnet.com.au/clausen/reputation/rep-cost-attack.pdf [Accessed 16 May 2004]

78 Cho, J., Shivakumar, N., Garcia-Molina, H. (2000) “Finding replicated web collections” In: SIGMOD 2000 [Online]. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. 15-18 May 2000, Dallas, Texas, USA.

Cowan, W. (2004) “Wheat v Chaff: Identifying the links that will make an impact on a ranking” in Search Engine Strategies Conference, 2-3 June 2004, London.

Crockatt, T. (2004) “Search Term Research and Targeting” in Search Engine Strategies Conference, 2-3 June 2004, London.

Cutts, M. (2004a) “Link Building Basics” in Search Engine Strategies Conference, 2-3 June 2004, London.

Cutts, M. (2004b) “Successful Website Architecture” in Search Engine Strategies Conference, 2-3 June 2004, London.

DTI (2004) Search Engine Positioning Factsheet [Online] London: Department of Trade and Industry. http://www.dti.gov.uk [Accessed 28 august 2004]

Eastman, C., Jansen, B. (2003) “Coverage, Relevance and Ranking: The Impact of Query Operators on Web Search Engine Results”, ACM Transactions on Information Systems [Online]. 21 (4) 383-411. http://portal.acm.org/ [Accessed 28 August 2004]

Eiron, N., McCurley, K., Tomlin, J. (2004) “Ranking the Web Frontier” In: WWW 2004 [Online]. Proceedings of the 13th International World Wide Web Conference 2004. May 17-24 2004, New York, USA. http://portal.acm.org/ [Accessed 28 August 2004]

Field, A. (2000) Discovering Statistics Using SPSS for Windows. London: Sage Publications.

79 Ford, N., Miller, D., Ross, N. (2002) “Web search strategies and retrieval effectiveness: An empirical study”, Journal of Documentation [Online]. 58 (1) 30-48. http://www.emeraldinsight.com/0022-0418.htm [Accessed 28 August 2004]

Galistky, B., Levene, M. (2004) “On the economy of Web links: Simulating the exchange process”, First Monday [Online]. 9 (1). http://firstmonday.org/issues/issue9_1/galitsky/index.html [Accessed 28 August 2004]

Gibson, D., Kleinberg, J., Raghavan, P. (1998) “Inferring Web Communities from Link Topology” [Online]. Proceedings of the ninth ACM conference on Hypertext and : links, objects, time and space---structure in hypermedia systems. 20-24 June 1998, Pittsburgh, Pennsylvania, USA. http://portal.acm.org/ [Accessed 28 August 2004]

Goh, D. H., Ang, R.P. (2003) “Relevancy Rankings: Pay for performance search engines in the hot seat”, Online Information Review [Online]. 27 (2) 87- 93. www.emeraldinsight.com/oir.htm [Accessed 28 August 2004]

Gomme, B. (2002) Free Search Engine Tips [Online]. No location given. http://www.freesearchenginestips.com/ [Accessed 16 May 2004]

Google (2004a) Webmaster Guidelines [Online]. Mountain View, CA, USA: Google. http://www.google.com/webmasters/guidelines.html [Accessed 16 May 2004]

Google (2004b) Google Zeitgeist [Online]. Mountain View, CA, USA: Google. http://www.google.com/press/zeitgeist.html [Accessed 16 May 2004]

Graham, L., Metaxas, P.T. (2003) “Of Course It’s True; I saw It on the Internet! Critical Thinking in the Internet Era”, Communications of the ACM [Online]. 46 (5) 70-75. http://portal.acm.org/ [Accessed 28 August 2004]

80 Kent, P. (2004) Search Engine Optimization for Dummies. Hoboken NJ, USA: Wiley.

Kobayashi, M., Takeda, K. (2000) “Information Retrieval on the Web”, ACM Computing Surveys [Online]. 32 (2) 144-173. http://portal.acm.org/ [Accessed 28 August 2004]

Haddon, L (2001) Ranking Internet Search Engines Without Relevance Judgements, Unpublished MSc Dissertation, Sheffield: University of Sheffield.

Hansell, G. (2004) “Language of Your Audience” in Search Engine Strategies Conference, 2-3 June 2004, London.

Haveliwala, T. (2002) “Topic-Sensitive PageRank”. In: WWW2002 [Online]. Proceedings of the Eleventh International World Wide Web Conference, 7-11 May 2002, Honolulu, Hawaii, USA. http://portal.acm.org/ [Accessed 28 August 2004]

Henzinger, M., Heydon, A., Mitzenmacher, M., Najorc, M. (2000) “On Near- Uniform URL Sampling” [Online]. In: Proceedings of the 9th International World Wide Web Conference on Computer Networks, 15-19 May 2000, Amsterdam, The Netherlands. http://www9.org/w9cdrom/88/88.html [Accessed 28 August 2004]

Henzinger, M., Motwani, R., Silverstein, C. (2002) “Challenges in Web Search Engines”, ACM SIGIR Forum, 36 (2) 11-22. http://portal.acm.org/ [Accessed 28 August 2004]

Hölscher, C., Strube, G. (2000) “Web search behaviour of Internet experts and newbies”. Computer Networks [Online] 33 () 337-346. http://www.sciencedirect.com [Accessed 16 December 2003]

81 Kleinberg, J.M. (1999) “Hubs, Authorities and Communities”, ACM Computing Surveys [Online] 31 (4). http://portal.acm.org/ [Accessed 28 August 2004]

Langville, A. N., Meyer, C.D. (2004) “Deeper inside PageRank” [Online]. Not yet published. Accepted by Internet Mathematics. http://meyer.math.ncsu.edu/Meyer/PS_Files/DeeperInsidePR.pdf [Accessed 16 May 2004]

Larson, K., Czerwinski, M. (1998) “Web page Design: Implications of Memory, Structure and Scent for Information Retrieval”. In: CHI ’98 [Online]. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.18-23 April 1998, Los Angeles, USA. http://research.microsoft.com/users/marycz/chi981.htm [Accessed 28 August 2004]

Lawrence, S., Giles, C. (1998) “Searching the World Wide Web”, Science [Online]. 280 (5360) 98-100. http://www.neci.nec.com/~lawrence/science98.html [Accessed 28 August 2004]

Lim, E.P., Tan, C.H., Lim, B.W., Ng, W.K. (1998) “Querying Structured Web Resources”. [Online] In: Proceedings of the third ACM Conference on Digital Libraries. 23-26 June 1998, Pittsburgh, Pennsylvania, USA. http://portal.acm.org/ [Accessed 28 August 2004]

Lloyd-Martin, H. (2004) Search Engine Copywriting: Quick Reference SEO Copywriting Preferred practices Guide. [Online] Bellingham, WA, USA: SuccessWorks. http://www.searchenginewriting.com/makecontact.html [Accessed 28 August 2004]

Mat-Hassan, M., Levene, M. (2003) “Can navigational assistance improve search experience?” First Monday. [Online] 6 (9). http://www.firstmonday.dk/issues/issue6_9/mat/ [Accessed 28 August 2004]

82

Markel, G. “Optimizing Flash Sites for Search Engines: The inherent challenges of Flash based websites and search engine positioning” in Search Engine Strategies Conference, 2-3 June 2004, London.

Matthew, C. (2004) “Writing for Search Engines” in Search Engine Strategies Conference, 2-3 June 2004, London.

Mindel, A. (2004) “Targeted Keyword Traffic” in Search Engine Strategies Conference, 2-3 June 2004, London.

Mukhopadyhay, D., Giri, D., Singh, S.R. (2003) “An Approach to Confidence Based Page Ranking for User Oriented Web Search”, ACM SIGMOD Record [Online]. 32 (2) 28-33. http://portal.acm.org/ [Accessed 28 August 2004]

Oppenheim, C., Morris, A., McKnight, C. (2000) “The Evaluation of WWW Search Engines”, Journal of Documentation. 56 (2) 190-211.

Pedersen, J., Risvik, K.M. (2004) “Web Search Tutorial” in ACM SIGIR 2004, Proceedings of the 27th annual international conference on Research and development in information retrieval. 25-29 June 2004, Sheffield, UK.

Perkins, A. (2002) “The Classification of Search Engine Spam”, Search Mechanics. [online] White Paper. http://www.ebrandmanagement.com/whitepapers/spam-classification/ [Accessed 28 August 2004]

Perkins, A. (2004) “Successful Site Architecture” in Search Engine Strategies Conference, 2-3 June 2004, London.

Ridings, C. (2001) PageRank Explained: or “Everything you’ve always wanted to know about PageRank” [Online]. No location: The Black Box Group.

83 http://www.googlerank.com/ranking/pagerank.html [Accessed 16 May 2004]

Rose, D., Levinson, D. (2004) “Understanding User Goals in Web Search” In: WWW 2004 [Online]. Proceedings of the Thirteenth International World Wide Web Conference. May 17-24 2004, New York, USA. http://portal.acm.org/ [Accessed 28 August 2004]

Search Engine Watch (2004) Search Engine Resources [Online]. Darien, CT, USA: Jupitermedia. http://searchenginewatch.com/resources/index.php [Accessed 16 May 2004]

Search Engine World (2004) [Online]. Moville IA, USA: PHD Software Systems. http://www.searchengineworld.com/ [Accessed 16 May 2004]

Sherman, C., Price, G. (2001) The Invisible Web: Uncovering Information Sources Search Engines Can’t See. Medford, New Jersey: Cyberage Books.

Silverstein, C., Henzinger, M., Marais, H., Moricz, M. (1999) “Analysis of a Very Large Web Search Engine Query Log”, ACM SIGIR Forum [Online]. 33 (1) 6-12. http://portal.acm.org/ [Accessed 28 August 2004]

Soboroff, I., Nicholas, C., Cahan, P. (2001) “Ranking retrieval systems without relevance judgements”. In: SIGIR ’01 [Online]. Proceedings of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval. 9-12 September 2001, New Orleans, Louisiana, USA. New York: ACM Press. http://portal.acm.org/ [Accessed 15 May 2004].

Soboroff, I. (2002) “Do TREC Web Collections Look Like the Web?”, ACM SIGIR Forum, 36 (2) 23-31. http://portal.acm.org/ [Accessed 28 August 2004]

84 Sullivan, D. (2004a) “Intro to Search Engine Marketing” in Search Engine Strategies Conference, 2-3 June 2004, London.

Sullivan, D. (2004b) “Nielsen NetRatings Search Engine Ratings”, Search Engine Watch [Online]. 23 February. Darien, CT, USA: Jupitermedia. http://searchenginewatch.com/reports/article.php/2156451 [Accessed 16 May 2004]

Sullivan, D. (2004c) “Researching Keywords at Major Search Engines” in Search Engine Strategies Conference, 2-3 June 2004, London: Jupitermedia.

Svendsen, M.D. (2004) “Indexing Dynamic Websites: Summary of problems and solutions” in Search Engine Strategies Conference, 2-3 June 2004, London.

Thelwall, M. (2000a) “Commercial Websites: Lost in cyberspace?”, Internet Research: Electronic Networking Applications and Policy [Online]. 10 (2) 150-159. http://www.emerald-library.com [Accessed 28 August 2004]

Thelwall M. (2000b) “Web Impact Factors and Search Engine Coverage”, Journal of Documentation [Online]. 56 (2) 185-189. http://www.aslib.co.uk/jdoc/2000/mar/rb01.html [Accessed 28 August 2004]

Thelwall, M. (2001a) “Commercial Website Links”, Internet Research: Electronic Networking Applications and Policy [Online]. 11 (2) 114-124. http://www.emerald-library.com/ft [Accessed 28 August 2004]

Thelwall, M. (2001b) “Web log file analysis: back-links and queries”, Aslib Proceedings [Online]. 53 (6) 217-223. http://www.aslib.co.uk/proceedings/2001/jun/02.html [Accessed 28 August 2004]

85 Thelwall, M. (2002) “Subject gateway sites and search engine ranking”, Online Information Review [Online]. 26 (2) 101-107. http://www.emeraldinsight.com/1468-4527.htm [Accessed 28 August 2004]

Thelwall, M., Vaughan, L. (2004) “New versions of PageRank employing alternative Web document models”, Aslib Proceedings [Online]. 56 (1) 24-33. http://www.emeraldinsight.com/0001-253X.htm [Accessed 28 August 2004]

Thurow, S. (2003) Search Engine Visibility. Indianapolis IN, USA: New Riders.

Thurow, S. (2004a) “Designing Search Engine Friendly Web Sites”in Search Engine Strategies Conference, 2-3 June 2004, London.

Thurow, S. (2004b) “Optimizing Flash and Non-HTML Sites: PDF Files” in Search Engine Strategies Conference, 2-3 June 2004, London.

Thurow, S. (2004c) “Successful Site Architecture” in Search Engine Strategies Conference, 2-3 June 2004, London.

Tomlin, J.A. (2003) “A New Paradigm for Ranking Pages on the World Wide Web”. In: WWW2003 [Online]. Proceedings of the twelfth international World Wide Web Conference, 20-24 May 2003, Budapest, Hungary. http://portal.acm.org/ [Accessed 28 August 2004]

Tsoi, A.C., Morini, G., Scarselli, F., Hagenbuchner, M., Maggini, M. (2003) “Adaptive Ranking of Web Pages”. In: WWW2003 [Online]. Proceedings of the twelfth international World Wide Web Conference, 20-24 May 2003, Budapest, Hungary. http://portal.acm.org/ [Accessed 28 August 2004]

Webmaster World (2004) [Online]. No location given: WebmasterWorld.com http://www.webmasterworld.com/ [Accessed 16 May 2004]

86

Whalen, J. (2001) Plant your site at the top of Mt. Search Engine [Online]. Ashland, MA, USA: HighRankings.com. http://www.highrankings.com/mtsearch.htm [Accessed 16 May 2004]

Whalen, J. (2004) “Writing for the Search Engines: Editing strategies and opportunities” in Search Engine Strategies Conference, 2-3 June 2004, London.

Wolf, J., Squillante, M., Yu, P., Sethuraman, J., Ozsen, L. (2002) “Optimal Crawling Strategies for Web Search Engines”. In: WWW2002 [Online]. Proceedings of the eleventh international World Wide Web Conference, 7- 11 May 2002, Honolulu, Hawaii, USA. http://portal.acm.org/ [Accessed 28 August 2004]

87