Link Analysis Algorithms to Handle Hanging and Spam Pages
Total Page:16
File Type:pdf, Size:1020Kb
Department of Computing Link Analysis Algorithms to Handle Hanging and Spam Pages Ravi Kumar Patchmuthu This thesis is presented for the Degree of Doctor of Philosophy of Curtin University June 2014 Declaration Declaration To the best of my knowledge and belief this thesis contains no material previously published by any other person except where due acknowledgment has been made. This thesis contains no material which has been accepted for the award of any other degree or diploma in any university. Signature : ………………………… (RAVI KUMAR PATCHMUTHU) Date : 18th June 2014 Abstract Abstract In the Web, when a page does not have any forward or outgoing links, then that page can be called as hanging page/dangling page/zero-out link page/dead end page. Most of the ranking algorithms used by search engines just ignore the hanging pages. However, hanging pages cannot be just ignored because they may have relevant and useful information like .pdf, .ppt, video and other attachment files. Hanging pages are one of the hidden problems in link structure based ranking algorithms because: They do not propagate the rank scores to other pages (important function of link structure based ranking algorithms). They can be compromised by spammers to induce link spam in the Web. They can affect Website optimization and the performance of link structure based ranking algorithms. A detailed literature survey on link structure based ranking algorithms, hanging pages and their effect on Web information retrieval was conducted. Different link structure based ranking algorithms were explored and compared. Also, literature review on Web spam and in particular link spam was conducted. Finally, a detail literature survey on Web site optimization was conducted. The following are the research objectives: Handling Hanging Pages (HP) in the link structure based ranking algorithms. PageRank is used as the base algorithm throughout this research. I Abstract Developing Hanging Relevancy Algorithm (HRA) to produce fair and relevant ranking results by including only the relevant hanging pages. Developing Link Spam Detection (LSD) algorithm to analyse and detect the effect of hanging pages in link spam contribution. Developing techniques and methods to improve Web Site Optimization (WSO) by studying the effect of hanging pages in Search Engine Optimization (SEO). The following are the methodologies used to meet the research objectives: Web Graph is implemented where nodes are treated as Web pages and edges between nodes are treated as hyperlinks. PageRank algorithm is simulated, matrix interface as well as graph interface is created and the ranks of Web pages are computed. Experiments are conducted to show the effects of hanging pages and methods are proposed to handle hanging pages in ranking Web pages. Hanging Relevancy Algorithm (HRA) is implemented and only the relevant hanging pages are included in the rank computation to reduce the complexity. Link Spam Detection (LSD) algorithm is implemented to detect link spam contributed by hanging pages. Experiments are conducted to show the effect of hanging pages in Search Engine Optimization (SEO) and factors are proposed to build optimized Web sites. PageRank algorithm was implemented and experiments were carried out to show its convergence. Three publicly available datasets, WEBSPAM-UK2006, WEBSPAM- II Abstract UK2007 and EU2010 were used for the experiments, apart from live data from the World Wide Web. Several experiments were conducted to show the effects of hanging pages on Web page ranking. Methods were proposed to include all the hanging pages in PageRank computation and to compare them with PageRank algorithm. The proposed methods were slower than PageRank algorithm but produced more relevant results. The experiments showed the percentage of hanging pages in the following datasets: WEBSPAM-UK2006 - 21.35%, WEBSPAM- UK2007 - 43.11%, EU2010 - 54.21% and the Curtin University (Sarawak) Website - 35.57%. The study showed that the hanging pages are keep increasing in the Web. Hanging Relevancy Algorithm (HRA) is implemented and it produced more relevant results with less computation time compared to including all the hanging pages in the computation. The experiment also showed that the ranks of certain relevant hanging pages were increased by four, signifying that these pages deserved a better ranking. Experiments were carried out using live Web data to prove the contribution of link spam by hanging pages. The results showed that the rank of the target page after link spam had increased by two and the order had also improved. This proved that hanging pages contributed to spam and the proposed method had detected link spam contributed by the hanging pages. Experiments were done to determine the On-Site and Off-Site ranking factors by taking www.curtin.edu.my as a sample Website. The link analysis experiment showed that 90% of the sample Website's back links are external links and only 10% are internal back links. 80% of the sample Website's links are followed back links and only 20% of them are no-followed back links. The experiments showed different ranking factors and also suggested factors to improve the ranking of the particular Website. The research study has therefore, helped to improve the rankings of relevant hanging pages and reduce the link spam contributed by hanging pages in the Search Engine Result Pages of link structure based ranking algorithms. III Acknowledgement Acknowledgement This thesis would not have been possible without the guidance and help of several individuals, who in one way or another, contributed or extended their valuable assistance in the preparation and completion of this research study. First, I would like to thank my Supervisor, Dr. Zhuquan Zang (Associate Professor, HOD, Department of Electrical and Computer Engineering), for providing me with all the administrative support and guidance throughout this research journey. Next, I would like to express my utmost gratitude to my Co-Supervisor/Associate Supervisor, Dr. Ashutosh Kumar Singh (Ex. Associate Professor, Department of Electrical and Computer Engineering, Curtin University, (Miri Campus), currently, Professor, Department of Computer Application, NIT, Kurukshetra, India) to whom I was reporting during this research journey. I have enjoyed his energetic and enthusiastic thought provoking ideas, constructive comments and valuable advices during the research discussions; much of this is reflected in my research and in the thesis as a whole. I am really grateful for Dr. Singh's constant support, guidance and encouragement, and deeply thankful and proud of him as my supervisor. The enthusiasm and dedication he has shown towards research, has been commendable and particularly motivational for me, during the tough stages of this research endeavour. I would like to acknowledge the support given by the Chairperson of the Doctoral Thesis Committee, Dr. Chua Han Bing (Associate Professor, HOD, Chemical Engineering) and the ex-chairperson Dr. K. Shenbaga (ex-Dean, Research and IV Acknowledgement Development, Curtin University). Both of them have provided guidance, encouragement and support and shared tremendous knowledge with me. I am also obliged to acknowledge Professor Dr. Michael Cloke (Dean, SOE), Professor Dr. Marcus Man Kon Lee (Associate Dean, Research Training), Associate Professor Dr. Michael Kobina Danquah (Associate Dean, Research and Development) and Nur Afnizan Johan (Tom) for all the support given to me during the course of this research study. I would also like to extend my sincere gratitude to my research colleague, Alex Goh Kwang Leng for participating in the research discussions and contributing to the research study as a whole. I wish him good luck for his journey towards his doctoral study. I like to acknowledge Dr. Anand Mohan, (Director, NIT, Kurukshetra, India) for his expert advice on this research study. My appreciation also goes to Mdm. Sheila Gopinath for proof reading and editing this thesis, Billy Lau Pik Lik for the initial contribution to this research and Dr. Herbert Raj, my brother-in-law, for correcting few chapters. I extend my thanks to others as well, who have helped me either directly or indirectly in this research study. My sincere gratitude goes to Jefri Bolkiah College of Engineering, Kuala Belait, and Ministry of Education, Brunei for supporting me in this research study. Special thanks go to my family, especially to my wife Jelciana, for her constant support and motivation to complete this thesis. I would also like to acknowledge my son Arun and daughter Cynthea for their love and support, as well as my father and my in-laws for their telephonic support and motivation. Last of all but most importantly; I would like to thank Almighty God for providing me with such great strength and opportunity to complete this thesis. V Publications Publications JOURNAL PAPERS 1. Ravi Kumar Patchmuthu, Ashutosh Kumar Singh and Anand Mohan. ―Efficient Methodologies to Overcome the effects of Hanging Pages in Website Optimization‖, International Journal of Web Engineering and Technology. (Scopus, Under Review). 2. Ravi Kumar Patchmuthu, Goh Kwang Leng, Ashutosh Kumar Singh and Anand Mohan. ―Efficient Methodologies to determine the Relevancy of Hanging Pages with Stability Analysis‖, Cybernetics and Systems: An International Journal. (ISI, Under Review). 3. Ravi Kumar Patchmuthu, Ashutosh Kumar Singh and Anand Mohan. ―A New Algorithm for Detection of Link Spam Contributed by Zero-Out-Link Pages‖, Turkish