Optimising Websites for Higher Search Engine Positioning
Total Page:16
File Type:pdf, Size:1020Kb
BUILDING BETTER RANKINGS: OPTIMISING WEBSITES FOR HIGHER SEARCH ENGINE POSITIONING A study submitted in partial fulfilment of the requirements for the degree of Master of Science in Information Management at THE UNIVERSITY OF SHEFFIELD by PETER GERAINT MASON September 2004 Abstract The rapid growth of the World Wide Web has seen a profusion of websites on every conceivable subject. Search engines have developed to make sense of this chaos for users. Most search engines use automated programs known as crawlers or spiders to find web pages by following hyperlinks and indexing text. These results are then ranked according to a relevance algorithm in response to a user’s query. Many website operators seek to improve the search engine rankings of their sites by a variety of means, collectively known as search engine optimisation. These approaches range from simple principles of good design to more duplicitous means such as cloaking, keyword stuffing and link trees. An extensive literature review explores a variety of themes surrounding search engines in general, and search engine optimisation in particular. Following this, an experiment involving keyword density analysis of a sample of nearly 300 web pages is presented. Statistical analysis reveals the contribution made by a number of factors (including document length, keyword density and PageRank) to a web page’s search rankings in the Google search engine. 1 Acknowledgements There are several people I would like to acknowledge for their help and advice during the course of this dissertation. I would like to thank my supervisor, Dr. Mark Sanderson for his support and advice, and for being patient with my writer’s block. Thanks also go out to Jean Russell of the University of Sheffield’s Corporate Information and Computing Services for her help in explaining how to turn my mass of data into meaningful information. My parents, Rosalind and Peter Mason have been of immeasurable support, and have endured the rigors of proof-reading with ease. Finally, I would like to thank Matt Cutts of Google and Danny Sullivan of Jupiter Research for each taking a few minutes to discuss the project and offer their advice. 2 Contents Abstract ................................................................................................................... 1 Acknowledgements................................................................................................. 2 Contents .................................................................................................................. 3 1. Introduction......................................................................................................... 4 1.1 The World Wide Web ................................................................................... 4 1.2 Directories..................................................................................................... 5 1.3 Search Engines.............................................................................................. 5 2. Literature review ................................................................................................. 9 2.1 Background to search engines on the Web ................................................. 10 2.2 Link Structure ............................................................................................. 11 2.3 The Invisible Web....................................................................................... 16 2.4 Evaluating search engines........................................................................... 16 2.5 Searcher behaviour...................................................................................... 17 2.6 Site Content................................................................................................. 19 2.7 Architecture................................................................................................. 21 2.8 Dynamic content ......................................................................................... 26 2.9 Spam............................................................................................................ 30 2.9.1 Text Spam ............................................................................................ 31 2.9.2 Redirects............................................................................................... 32 2.9.3 Cloaking ............................................................................................... 33 2.9.4 Link Spam ............................................................................................ 33 3. Research Methodology ..................................................................................... 35 3.1 Query set ..................................................................................................... 36 3.2 Keyword density analysis ........................................................................... 38 3.3 Limitations .................................................................................................. 39 4. Presentation and discussion of results............................................................... 41 4.1 GoRank Analysis Results............................................................................ 41 4.2 Statistical Analysis...................................................................................... 42 4.2.1 Frequency Analysis.............................................................................. 45 4.2.2 Spearman’s Rank Correlation .............................................................. 47 4.2.3 Regression Analysis............................................................................. 49 4.3 Summary ..................................................................................................... 56 5. Conclusions and further research...................................................................... 57 Appendix: Keyword Density Analysis Results..................................................... 62 Bibliography.......................................................................................................... 77 16,388 Words 3 1. Introduction This dissertation will present a study of the practice of ‘search engine optimisation’ (SEO), the practice of modifying or tuning web pages so that they will rank higher in search engine results pages (SERPs) for particular queries called ‘keywords’ or key phrases. It will consist of a summary of search engine optimisation techniques and a review of literature on the subject, both academic and commercial. The importance of keyword density for rankings in Google, the leading search engine at the time of writing, will be investigated with the use of experimental data. 1.1 The World Wide Web The World Wide Web is a vast collection of information. In the ten or so years since it originated it has grown into a huge, chaotic mass of heterogeneous documents. Because it has developed organically as the number of Web users has grown, there is little organisation, and indeed it is this very freedom and flexibility which enables many of the Web’s benefits – virtually anyone can use it, and anyone with a modicum of knowledge can publish on it. The number of people with web access is growing very rapidly, and is currently estimated at over 700 million. Google alone indexes well over 4 billion web pages at the time of writing, according to its homepage. Of course the very decentralised nature of the World Wide Web makes it impossible to measure these figures exactly but sheer scale of the document collection clearly creates a difficult information retrieval problem. Because of the diverse and unstructured nature of the web, it would be almost impossible to locate relevant information through the medium without the aid of some sort of tool. The main tools which have grown up to service these needs are directories and search engines. 4 1.2 Directories Directories are categorised collections of hyperlinks that list websites (rather than specific web pages) by topic. These usually follow a library-style hierarchical model, in which broad high-level categories contain smaller, more specific categories, and the user must navigate or ‘drill down’ to the specific subject they are interested in. These directories are frequently compiled and edited by humans, who review websites that are submitted by their owners and decide in what category, if any, the site will be listed. Most directories have some sort of editorial policy and do not accept all site submissions. Because of the human input involved in running a directory, there is an effective limit on how comprehensive it can be – there are simply too many websites to list them all. Thus, in information retrieval terms, the reach of a directory is typically quite low as the document collection is only those sites that have been submitted and approved. Even Yahoo! - formerly the leading web directory and portal - has de-emphasised its directory in favour of search. At its height, the Yahoo! directory had approximately 1,500 human editors. Now it employs just 20. (Risvik, 2004) 1.3 Search Engines Search engines take a different approach to the problem of finding information on the web. They are more like information retrieval systems in that the user inputs a query, and a list of web pages is returned which are judged to be relevant to that query. These are generally sorted by a mathematical algorithm or ranking function, to present the pages which are most likely to be relevant first. Because they are automated, and