Increasing the Robustness of Networked Systems Srikanth Kandula
Total Page:16
File Type:pdf, Size:1020Kb
Increasing the Robustness of Networked Systems by Srikanth Kandula M.S., University of Illinois at Urbana-Champaign () B.Tech., Indian Institute of Technology, Kanpur () Submitted to the Department of Electrical Engineering and Computer Science in partial ful llment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February © Srikanth Kandula, MMIX. All rights reserved. e author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part. Author............................................. ................. Department of Electrical Engineering and Computer Science September , Certi edby......................................... ................. Dina Katabi Associate Professor esis Supervisor Acceptedby......................................... ................ Terry P.Orlando Chairman, Department Committee on Graduate Students Increasing the Robustness of Networked Systems by Srikanth Kandula Submitted to the Department of Electrical Engineering and Computer Science on February, , in partial ful llment of the requirements for the degree of Doctor of Philosophy Abstract What popular news do you recall about networked systems? You’ve probably heard about the several hour failure at Amazon’s computing utility that knocked down many startups for several hours, or the attacks that forced the Estonian government web-sites to be inaccessible for several days, or you may have observed inexplicably slow responses or errors from your favorite web site. Needless to say, keeping networked systems robust to attacks and failures is an increasingly signi cant problem. Why is it hard to keep networked systems robust? We believe that uncontrollable in- puts and complex dependencies are the two main reasons. e owner of a web-site has little control on when users arrive; the operator of an ISP has little say in when a ber gets cut; and the administrator of a campus network is unlikely to know exactly which switches or le-servers may be causing a user’s sluggish performance. Despite unpredictable or mali- cious inputs and complex dependencies we would like a network to self-manage itself, i.e., diagnose its own faults and continue to maintain good performance. is dissertation presents a generic approach to harden networked systems by distin- guishing between two scenarios. For systems that need to respond rapidly to unpredictable inputs, we design online solutions that re-optimize resource allocation as inputs change. For systems that need to diagnose the root cause of a problem in the presence of complex sub- system dependencies, we devise techniques to infer these dependencies from packet traces and build functional representations that facilitate reasoning about the most likely causes for faults. We present a few solutions, as examples of this approach, that tackle an important class of network failures. Speci cally, we address () re-routing trac around congestion when trac spikes or links fail in internet service provider networks, () protecting web- sites from denial of service attacks that mimic legitimate users and () diagnosing causes of performance problems in enterprises and campus-wide networks. rough a combination of implementations, simulations and deployments, we show that our solutions advance the state-of-the-art. esis Supervisor: Dina Katabi Title: Associate Professor Contents Prior Published Work Acknowledgments Introduction . Robustness as a Resource Management Problem ............... .. Online Trac Engineering ...................... .. Protecting Against Denial-of-Service Attacks that Mimic Legitimate Requests ................................ . Robustness as a Learning Problem ...................... .. Learning Communication Rules ................... .. Localizing Faults and Performance Problems ............ Defending Against DDoS Attacks at Mimic Flash Crowds . Overview .................................... . reat Model .................................. . Separating Legitimate Users from Attackers ................. .. Activating the Authentication Mechanism .............. .. CAPTCHA-Based Authentication .................. .. Authenticating Users Who Do Not Answer CAPTCHAs ...... .. Criterion to Distinguish Legitimate Users and Zombies ....... . Managing Resources Eciently ........................ .. Results of the Analysis ......................... .. Adaptive Admission Control ..................... . Security Analysis ................................ . Kill-Bots System Architecture ......................... . Evaluation ................................... .. Experimental Environment ...................... .. Microbenchmarks ........................... .. Kill-Bots under Cyberslam ...................... .. Kill-Bots under Flash Crowds ..................... .. Bene t from Admission Control ................... .. User Willingness to Solve Puzzles ................... . Related Work .................................. . Generalizing & Extending Kill-Bots ...................... . Limitations & Open Issues ........................... . Conclusion ................................... Responsive Yet Stable Trac Engineering . Overview .................................... . Problem Formalization ............................. . TeXCP Design ................................. .. Path Selection ............................. .. Probing Network State ........................ .. e Load Balancer ........................... .. Preventing Oscillations and Managing Congestion ......... .. Shorter Paths First ........................... . Analysis ..................................... . Performance .................................. .. Topologies & Trac Demands .................... .. Metric ................................. .. Simulated TE Techniques ....................... .. Comparison With the OSPF Optimizer ............... .. Comparison With MATE ....................... .. TeXCP Convergence Time ...................... .. Number of Active Paths ........................ .. Automatic Selection of Shorter Paths ................. . Implementation & Deployment ........................ . Related Work .................................. . Further Discussion ............................... . Concluding Remarks .............................. Learning Communication Rules . Overview .................................... . Learning Communication Rules ........................ .. From Dependency to Association .................. .. Signi cance of a Communication Rule ................ .. Generic Communication Rules .................... . Algorithms to Mine for Rules ......................... .. Composing Communication Rules .................. .. Towards a Streaming Solution .................... . Evaluation ................................... .. Dataset ................................. .. Metrics ................................. .. Nature of the Rule-Mining Problem ................. .. Evaluation in Controlled Settings ................... .. Micro-Evaluation ........................... .. Case Study: Patterns in the Enterprise ................ .. Case: Patterns in the LabEnterprise .................. .. Case: Patterns on the Lab’s Access Link ................ .. Case Study: Rules for HotSpot Traces ................ . Discussion ................................... . Related Work .................................. . Conclusion ................................... Localizing Faults and Performance Problems . Overview .................................... . Related Work .................................. . e Inference Graph Model .......................... .. e Inference Graph .......................... .. Fault Localization on the Inference Graph .............. . e Sherlock System .............................. .. Constructing the Inference Graph .................. .. Fault Localization Using Ferret .................... . Implementation ................................ . Evaluation ................................... .. Micro-Evaluation ........................... .. Evaluation of Field Deployment ................... .. Comparing Sherlock with Prior Approaches ............. .. Time to Localize Faults ........................ .. Impact of Errors in Inference Graph ................. .. Modeling Redundancy Techniques .................. .. Summary of Results .......................... . Discussion ................................... . Conclusions .................................. Concluding Remarks Appendix A. Analysing Kill-Bots ............................... B. Interaction between Admission Control and Re-transmissions ....... C. Stability of e Kill-Bots Controller ...................... D. Proof of TeXCP Stability due to Explicit Feedback .............. E. Proof of TeXCP Load Balancer Convergence ................. F. Value of Preferring Shorter Paths in TeXCP .................. G. Propagating State for Meta-Nodes in Sherlock ................ Bibliography Prior Published Work is thesis builds on some prior published work. Chapter contains joint work with Dina Katabi, Matthias Jacob and Arthur Berger pub- lished at NSDI’ []. Chapter contains joint work with Dina Katabi, Bruce Davie and Anna Charny that was published at SIGCOMM’ []. Chapter contains joint work with Ranveer Chandra and Dina Katabi published at SIGCOMM’ []. Chapter builds on Sherlock [], joint work with Victor Bahl, Ranveer Chandra, Albert