Tamper-Resilient Methods for Web-Based Open Systems
Total Page:16
File Type:pdf, Size:1020Kb
TAMPER-RESILIENT METHODS FOR WEB-BASED OPEN SYSTEMS A Thesis Presented to The Academic Faculty by James Caverlee In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the College of Computing Georgia Institute of Technology August 2007 Copyright c 2007 by James Caverlee TAMPER-RESILIENT METHODS FOR WEB-BASED OPEN SYSTEMS Approved by: Ling Liu, Advisor Sham Navathe College of Computing College of Computing Georgia Institute of Technology Georgia Institute of Technology William B. Rouse, Co-Advisor Sushil K. Prasad Tennenbaum Institute, Industrial & Department of Computer Science Systems Engineering (Joint Appointment Georgia State University with College of Computing) Georgia Institute of Technology Jonathon Giffin Calton Pu College of Computing College of Computing Georgia Institute of Technology Georgia Institute of Technology Date Approved: 18 June 2007 For Sherry iii ACKNOWLEDGEMENTS I am pleased to acknowledge the generous support of the Tennenbaum Institute in providing me a Tennenbaum Fellowship for this research. In addition, I am grateful for the positive influence and the many contributions of a number of special people – people who have encouraged, guided, and inspired me. My parents are my foundation. Sam and Ellen Caverlee are two incredibly encouraging and supportive parents. They have been with me every step of the way, encouraging me to push myself to my limits and supporting me when I fall short. They are an inspiration to me and provide a blueprint for how to live a moral life. I love you, Mom and Dad. My brother John has always been there for me. John has a way of grounding me, so that I maintain my focus and don’t float off into a self-imposed nerd orbit. My grandfather Jim is a constant source of inspiration. The rest of my family is equally amazing. I am fortunate to have such a great family. When I first met Ling Liu in the fall of 2002, I knew I had met someone with a fierce intellect and strong opinions. As I have grown to know her, I have seen firsthand her generous spirit and tireless work ethic. Ling has had the single greatest impact on my intellectual and professional development, and she deserves much of the credit for leading me on this journey. Ling is an inspiration to me, and I hope to guide my students as well as she has guided me. I am especially grateful to Bill Rouse. Bill is thoughtful and engaging, and it has been a real pleasure to participate in the growth of the Tennenbaum Institute. Bill has been a model for my professional development, and I would be happy to achieve half of what he has. I am thankful to Calton Pu, another fine model for my professional development. Calton has always offered welcome advice, as well as encouraged me along the way. Leo Mark has been a great source of support from my first days at Georgia Tech. His enthusiasm has been a great motivator. I am indebted to Ling, Bill, Calton, Jonathon Giffin, Sham Navathe, and Sushil K. iv Prasad for reading and commenting on my thesis. Dave Buttler and Dan Rocco welcomed me to Georgia Tech and showed me what it takes to be a graduate student. Li Xiong and Steve Webb were both great officemates and are now my good friends. I have had the great opportunity to collaborate with a number of really smart and interesting characters in my time at Georgia Tech. I am especially grateful to the members of the Distributed Data Intensive Systems Lab. Thanks to Susie McClain and Deborah Mitchell for helping me find my way in the maze that is graduate school. Outside of school, I have been fortunate to have a wonderful group of friends at Marietta First United Methodist Church. Sometimes I think my wife Sherry is a superhero, only without the cape. She is an amazing person, a source of boundless love and energy, and my best friend. She has endured much because of me. I am a better man because of her. Thank you, sweetie. Finally, this acknowledgment would not be complete without a nod to my two beautiful children. Luke and Eva, you are an inspiration. Know that I love you always. v TABLE OF CONTENTS DEDICATION ....................................... iii ACKNOWLEDGEMENTS ................................ iv LIST OF TABLES .................................... xii LIST OF FIGURES .................................... xiii SUMMARY ......................................... xvii I INTRODUCTION .................................. 1 1.1 Research Challenges . 2 1.2 Overview of Thesis . 4 II WEB RISKS AND VULNERABILITIES ..................... 8 2.1 Content-Based Web Risks . 8 2.1.1 Badware . 8 2.1.2 Page Spoofing . 10 2.1.3 Offensive Content . 10 2.1.4 Useless or Derivative Content . 11 2.1.5 Propaganda . 12 2.1.6 Discussion . 12 2.2 Experience-Based Web Risks . 13 2.2.1 Social Engineering . 14 2.2.2 Redirection . 14 2.2.3 Piggy-Back . 15 2.2.4 Web Search Engines . 15 2.2.5 Online Communities . 17 2.2.6 Categorization and Integration Services . 19 2.3 Mitigating Web Risks . 20 III SOURCE-CENTRIC WEB RANKING ...................... 23 3.1 Link-Based Vulnerabilities . 26 3.1.1 Hijacking-Based Vulnerabilities . 26 3.1.2 Honeypot-Based Vulnerabilities . 27 vi 3.1.3 Collusion-Based Vulnerabilities . 28 3.1.4 PageRank’s Susceptibility to Link-Based Vulnerabilities . 29 3.2 Source-Centric Link Analysis . 30 3.2.1 Overview . 31 3.2.2 Parameter 1: Source Definition . 33 3.2.3 Parameter 2: Source-Centric Citation . 35 3.2.4 Parameter 3: Source Size . 38 3.2.5 Parameter 4: Influence Throttling . 39 3.3 Applying SLA to Web Ranking . 39 3.3.1 Mapping Parameters . 41 3.3.2 Spam-Proximity Throttling . 43 3.3.3 Evaluating Objectives . 44 3.4 Spam Resilience Analysis . 45 3.4.1 Link Manipulation Within a Source . 46 3.4.2 Cross-Source Link Manipulation . 47 3.4.3 Comparison with PageRank . 50 3.5 Experimental Evaluation . 53 3.5.1 Experimental Setup . 54 3.5.2 Measures of Ranking Distance . 55 3.5.3 Objective-Driven Evaluation . 56 3.5.4 Time Complexity . 57 3.5.5 Stability – Ranking Similarity . 58 3.5.6 Stability – Link Evolution . 60 3.5.7 Approximating PageRank . 62 3.5.8 Spam-Resilience . 64 3.5.9 Influence Throttling Effectiveness . 71 3.5.10 Summary of Experiments . 72 3.6 Related Work . 73 3.7 Summary . 74 IV CREDIBILITY-BASED LINK ANALYSIS .................... 75 4.1 Reference Model . 77 vii 4.1.1 Web Graph Model . 77 4.1.2 Link-Based Ranking: Overview . 78 4.2 Link Credibility . 80 4.2.1 Naive Credibility . 81 4.2.2 k-Scoped Credibility . 81 4.3 Computing Credibility . 83 4.3.1 Tunable k-Scoped Credibility . 83 4.3.2 Implementation Strategy . 87 4.4 Time-Sensitive Credibility . 88 4.4.1 Credibility Over Time . 88 4.4.2 Choice of Importance Weight . 89 4.4.3 Missing Page Policy . 90 4.5 Credibility-Based Web Ranking . 90 4.6 Blacklist Expansion . 92 4.7 Experimental Evaluation . 93 4.7.1 Setup . 94 4.7.2 Credibility Assignment Evaluation . 95 4.7.3 Spam Resilience Evaluation . 102 4.7.4 Blacklist Expansion . 111 4.8 Related Work . 113 4.9 Summary . 113 V TRUST ESTABLISHMENT IN ONLINE COMMUNITIES . 114 5.1 Reference Model . 116 5.2 Vulnerabilities in Online Social Networks . 119 5.3 The SocialTrust Framework . 121 5.4 Trust Establishment Scope . 123 5.4.1 Trust Group Formation: Browse-Based Search . 124 5.4.2 Trust Group Formation: Forwarding-Based Search . 124 5.5 Assessing Trust Group Feedback . 125 5.5.1 Open Voting . 127 5.5.2 Restricted Voting . 127 viii 5.5.3 Trust-Aware Restricted Voting . 128 5.6 Assessing Relationship Link Quality . 128 5.6.1 Link Quality as a Scoped Random Walk . 129 5.6.2 Correction Factor . 130 5.7 Assessing the Quality Component of Trust . 133 5.7.1 Popularity . 133 5.7.2 Basic Random Walk Model . 134 5.7.3 Incorporating Link Quality . 134 5.7.4 Incorporating Trust Group Feedback . 136 5.7.5 The SocialTrust Model . 136 5.8 Assessing the History Component of Trust . 137 5.8.1 Historical Trust Ratings . 137 5.8.2 Fading Memories . 138 5.9 Personalized Trust Aggregation with mySocialTrust . 139 5.9.1 Basic Personalization . 140 5.9.2 Random Walk Personalization . 140 5.10 Experimental Setup . 141 5.10.1 Data . ..