Hierarchical Clustering with Global Objectives: Approximation Algorithms and Hardness Results

HIERARCHICAL CLUSTERING WITH GLOBAL OBJECTIVES: APPROXIMATION ALGORITHMS AND HARDNESS RESULTS ADISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Evangelos Chatziafratis June 2020 © 2020 by Evangelos Chatziafratis. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/ This dissertation is online at: http://purl.stanford.edu/bb164pj1759 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Tim Roughgarden, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Moses Charikar, Co-Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Li-Yang Tan I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Gregory Valiant Approved for the Stanford University Committee on Graduate Studies. Stacey F. Bent, Vice Provost for Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii Abstract Hierarchical Clustering is an important tool for data analysis, used in diverse areas ranging from Phylogenetics in Biology to YouTube video recommendations and everything in between. The term “hierarchical” is used to contrast with “flat” clustering approaches, like k-means or k-center, which only produce one specific partitioning of a data set. Hierarchical Clustering is a way of understanding the overall structure of large data sets by visualizing them as a collection of successively smaller and smaller clusters, capturing di↵erent levels of granularity simultaneously. The end result is a tree, whose internal nodes represent nested clusters and with leaves representing the data points. Despite the plethora of applications and heuristics, the theory behind Hierarchical Clustering was underdeveloped, since no “global” objective function was associated with the final output. A well- formulated objective allows us to compare di↵erent algorithms, measure their quality, explain their success or failure and enables continuous optimization methods with gradient descent. This lack of objectives is in stark contrast with the flat clustering literature, where objectives like k-means, k- median or k-center have been studied intensively starting from the 1950s, leading to a comprehensive theory on clustering. In this thesis, we study approximation algorithms and their limitations, making progress towards building such a theory for objective-based Hierarchical Clustering (HC). For the minimization cost function with pairwise similarities introduced recently by Dasgupta (2016) and its maximization variants, we show how standard tools like Sparsest Cut, Balanced Cut, Max Cut or Max Uncut Bisection can be used in a black box manner, to produce provably good HC and we also complement our algorithms with inapproximability results. Escaping from worst-case analyses, we also study scenarios where extra structure is imposed on the data points (e.g., feature vectors) or qualitative information is provided (e.g., the common triplet or quartet constraints in Phylogenetics). Our results are inspired by geometric ideas (semidefinite programming relaxations, spreading metrics, random projections) and novel connections with ordering Constraint Satisfaction Problems. Overall, this thesis provides state-of-the-art approximation and hardness results for optimizing global HC objectives, reveals unexpected connections with common linkage or graph cut heuristics and advances our understanding of old Phylogenetics problems. As such, we hope it serves the com- munity as the foundation for objective-based Hierarchical Clustering and for future improvements. iv Acknowledgments As I am finishing my thesis from an apartment in New York, I can’t help but thinking how lucky I have been so far, especially throughout those 5 years of my PhD. Featuring a single name on the front cover is merely an understatement, as both this thesis and myself would not be the same without the tremendous help and constant support of several amazing individuals, I had the extreme pleasure to meet along my journey in research and in the academic world. Given that no physical PhD defense ever took place at Stanford, I want to take this opportunity to express my gratitude to the people that helped me during this beautiful journey and, more importantly, made it possible. First and foremost, I can’t thank enough my advisor Tim Roughgarden for all he has done for me over these past 5 years. From my second quarter at Stanford, when I first rotated and worked with Tim, to our productive research trip in London School of Economics during his sabbatical, to our path towards New York in Columbia and my next career steps, Tim has always been extremely supportive and patient of my diverse research pursuits, always ready to provide me with his insightful ideas and advice. I owe him so much for his mentorship and ideas, for inspiring me, for teaching me cool stu↵and helping me improve my communication skills; and all these, done in the casual environment he created for his group (lunches at Ike’s, dinners at NOLA etc.), which allowed me to thrive without ever feeling I was working, rather enjoying research and theory. Outside research, I also want to thank Tim, simply for being lenient with me, accepting my longer-than-usual trips to Greece, without asking questions whether they happened during (early) summer, Christmas, or even March. I also consider myself privileged to have been his unique student simultaneously at Stanford and Columbia, and to have helped with the summer rental car situation in Ikaria and with the cameras and amplifiers during Papafest’s talks and band concert. The second person that I want to thank is Moses Charikar. During the spring quarter of my first year, we started exploring hierarchical clustering, a term I had never heard about back then and, unknowingly, the main chapter in this thesis had already started. Collaborating with Moses and trying to make sense of the variety of clustering questions that appeared on our way over time, has truly been an immense pleasure. His intuition and clarity of thought together with his patience on explaining to me possible ways of attacking the problems have been extremely beneficial. I am also grateful to him for being available for late night Skypes in several occasions and for interrupting his v dinner plans just to answer my texts, admittedly right before the submission deadlines, and finally for battling with the di↵erent timezones when I was in Switzerland or in Sweden and he was in India. I also want to thank Jan Vondrak, not only for our collaboration, but also for being a friendly and approachable person in the Math department, that I always felt comfortable talking to and I could always go with questions and have interesting conversations across many di↵erent topics. Of course, I am glad he also agreed to be the Chair of my thesis committee. I want to thank Greg Valiant for being in my orals committee, for letting me TA several of his courses and for his support during summer 2017. Also, for his direct and casual character that always makes you feel comfortable. I also owe a big thanks to Li-Yang Tan for agreeing to be in my quals and my orals committee and for sharing much of his love and expertise in complexity theory. I also want to thank several professors I had the pleasure to work with for a short period: Virginia Vassilevska Williams for an awesome first rotation, and for her great teaching style, Nisheeth Vishnoi for an inspiring summer internship in EPFL during summer 2016, which taught me a lot and Osman Yagan for several visits to CMU in Silicon Valley and in Pittsburgh and for interesting discussions on network robustness. Outside research, I want to thank Omer Reingold for an amazing class on Playback theater and Kumaran Arul for being an excellent and inspiring piano tutor. Watching Ryan Williams playing guitar, singing and improvising during our Theory Music Tuesday Nights has also been quite the experience! A big role for the development of this thesis was played by my amazing collaborators. First of all, I want to thank Rad Niazadeh for the countless hours we spent drawing trees and clusters on the whiteboard, discussing about how to round SDPs, late-night editing of papers before submissions, and also for giving me career advice. I want to thank Mohammad Mahdian for being an amazing host at Google New York during my summer and winter internships, for the always stimulating conversations that may happen at a microkitchen or while searching for available rooms, over video calls, or even while on vacation in Rhodes, Greece. I want to thank Ioannis Panageas for hosting me in Singapore, for his direct and fun character and for his mathematical depth that helped me explore and appreciate the many interesting connections between deep learning and dynamical systems. Moreover and in no particular order, I also had the pleasure to work with: Neha Gupta, Euiwoong Lee, Sara Ahmadian, Alessandro Epasto, Grigory Yaroslavtsev, Kontantin Makarychev, Sai Ganesh Nagarajan, Xiao Wang, Haris Angelidakis, Avrim Blum, Chen Dan, Pranjal Awasthi, Xue Chen, Aravindan Vijayaraghavan, Osman Yagan, and Joshua Wang.

Hierarchical Clustering with Global Objectives: Approximation Algorithms and Hardness Results

FOCS 2005 Program SUNDAY October 23, 2005

Constraint Clustering and Parity Games

Implementation of Locality Sensitive Hashing Techniques

Model Checking Large Design Spaces: Theory, Tools, and Experiments

Detecting Near-Duplicates for Web Crawling

Time-Release Cryptography from Minimal Circuit Assumptions

Moses Charikar Stanford University

Approximation Algorithms for Grammar-Based

Rajeev Motwani

Practical Hash Functions for Similarity Estimation and Dimensionality Reduction

STOC ’98 to Qualify for the Early Registration Fee, Your Registra- Tion Application Must Be Postmarked by Thursday, April 30, 1998

A Constant Factor Approximation Algorithm for Fault-Tolerant K- Median