Fair and Diverse Data Representation in Machine Learning

FAIR AND DIVERSE DATA REPRESENTATION IN MACHINE LEARNING A Dissertation Presented to The Academic Faculty By Uthaipon Tantipongpipat In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Algorithms, Combinatorics, and Optimization Georgia Institute of Technology August 2020 Copyright © Uthaipon Tantipongpipat 2020 FAIR AND DIVERSE DATA REPRESENTATION IN MACHINE LEARNING Approved by: Dr. Mohit Singh, Advisor Dr. Sebastian Pokutta School of Industrial and Systems Institute of Mathematics Engineering Technical University of Berlin Georgia Institute of Technology Dr. Santosh Vempala Dr. Rachel Cummings School of Computer Science School of Industrial and Systems Georgia Institute of Technology Engineering Georgia Institute of Technology Date Approved: May 8, 2020 Dr. Aleksandar Nikolov Department of Computer Science University of Toronto ACKNOWLEDGEMENTS This thesis would not be complete without many great collaborators and much support I re- ceived. My advisor, Mohit Singh, is extremely supportive for me to explore my own research interests, while always happy to discuss even a smallest technical detail when I need help. His ad- vising aim has always been for my best – for my career growth as a researcher and for a fulfilling life as a person. He is unquestionably an integral part of my success today. I want to thank Rachel Cummings, Aleksandar (Sasho) Nikolov, Sebastian Pokutta, and San- tosh Vempala for many research discussions we had together and their willingness to serve as my dissertation committees amidst COVID-19 disruption. I want to thank my other coauthors as part of my academic path: Digvijay Boob, Sara Krehbiel, Kevin Lai, Vivek Madan, Jamie Morgen- stern, Samira Samadi, Amaresh (Ankit) Siva, Chris Waites, and Weijun Xie. I am grateful for friendly and supportive peers at Georgia Tech: Prateek Bhakta, Sarah Cannon, Zongchen Chen, Ben Cousins, David Durfee, Majid Farhadi, Yu Gao, Bhuvesh Kumar, Samantha Petti, Saurabh Sawlani, Sadra Yazdanbod, Sina Yazdanbod, and many others. Janardhan Kulkarni and Sergey Yekhanin were wonderful supervisors and collaborators during my internship at Microsoft Research, Redmond. I enjoyed our research discussions that broaden my research horizon. Both of them were a memorable part of my internship experience that has significantly shaped my research and career path. I would not be here today without continuous support of my family, friends, and mentors in my life, especially the support they gave when I needed most. I want to especially thank Jocelyn Davis, Doug Johns, Aaron Menikoff, Bryan Pillsbury, and MaryAnne Turner for many of their extremely helpful, timely wisdom on my career planning and personal life. iii TABLE OF CONTENTS Acknowledgments . iii List of Tables . x List of Figures . xi Chapter 1: Introduction . 1 1.1 Diverse Subset Selection . 1 1.1.1 Introduction . 1 1.1.2 Other Applications of Subset Selection and Related Work . 3 1.1.3 Summary of Contributions . 5 1.2 Fair Dimensionality Reduction . 6 1.2.1 Introduction . 6 1.2.2 Related Work . 7 1.2.3 Summary of Contribution . 8 1.2.4 Fast implementations . 9 1.2.5 Experiments . 9 1.3 Future Directions . 10 1.3.1 Generalized Linear Models . 10 iv 1.3.2 Ridge Regression . 11 1.3.3 Other Applications . 11 Chapter 2: Preliminaries . 12 2.1 Convex Relaxation and its Dual of A-optimal Design . 12 2.2 Convex Relaxation of D-optimal Design . 14 2.3 Integrality Gaps . 14 2.3.1 Tightness of Approximations. 15 2.3.2 Dual-Fitting . 15 2.4 Local Search and Greedy Algorithms . 16 Chapter 3: Sampling-Based Approximation Algorithm for Subset Selection . 17 3.1 Introduction . 17 3.1.1 Our Contributions and Results . 18 3.1.2 Related Work . 24 3.1.3 Problem Variants . 26 3.1.4 Organization . 27 3.2 Approximation via Near Independent Distributions . 28 3.2.1 Approximately Independent Distributions . 28 3.3 Approximating Optimal Design without Repetitions . 32 3.3.1 d-approximation for k = d .......................... 32 3.3.2 (1 + )-approximation . 33 3.4 Approximately Optimal Design with Repetitions . 35 v 3.5 Generalizations . 38 3.5.1 k-Approximation Algorithm for k d .................... 38 ≤ 3.5.2 Restricted Invertibility Principle for Harmonic Mean . 41 3.5.3 The Generalized Ratio Objective . 43 3.6 Efficient Algorithms . 52 3.6.1 Efficient Randomized Proportional Volume . 53 3.6.2 Efficient Deterministic Proportional Volume . 56 3.6.3 Efficient Randomized Implementation of k/ (k d + 1)-Approximation Algorithm With Repetitions . .− . 58 3.6.4 Efficient Deterministic Implementation of k/ (k d + 1)-Approximation Algorithm With Repetitions . .− . 63 3.6.5 Efficient Implementations for the Generalized Ratio Objective . 64 3.7 Integrality Gaps . 67 3.7.1 Integrality Gap for E-Optimality . 67 3.7.2 Integrality Gap for A-optimality . 72 3.8 Hardness of Approximation . 73 3.9 Regularized Proportional Volume Sampling for Ridge Regression . 79 3.9.1 Background . 79 3.9.2 λ-Regularized A-Optimal Design and λ-Regularized Proportional Volume Sampling . 85 3.9.3 Related Work . 86 3.9.4 Reduction of Approxibility to Near-Pairwise Independence . 88 3.9.5 Constructing a Near-Pairwise-Independent Distribution . 91 vi 3.9.6 The Proof of the Main Result . 92 3.9.7 Efficient Implementation of λ-Regularized Proportional Volume Sampling . 96 Chapter 4: Combinatorial Algorithms for Optimal Design . 99 4.1 Introduction . 99 4.1.1 Main Approximation Results of Combinatorial Algorithms . 99 4.1.2 Related Work . 102 4.1.3 Organization . 102 4.2 Local Search for D-DESIGN .............................102 4.2.1 Local Search Algorithm . 103 4.2.2 Relaxations . 103 4.2.3 D-DESIGN without Repetitions . 107 4.3 Local Search for A-DESIGN ..............................107 4.3.1 Capping Vectors . 107 4.3.2 Local Search Algorithm . 109 4.3.3 Instances with Bad Local Optima . 111 4.4 Proofs from Section 4.2 . 113 4.4.1 Local Search for D-DESIGN without Repetitions . 116 4.5 Proofs from Section 4.3 . 118 4.5.1 Proof of Performance of Modified Local Search Algorithm for A-DESIGN . 118 4.5.2 Guessing A-Optimum Value φA(V ) . 139 4.5.3 Example of Instances to A-DESIGN . 140 vii 4.6 Approximate Local Search for D-DESIGN . 142 4.7 Approximate Local Search for A-DESIGN . 147 4.8 Greedy Algorithm for D-DESIGN . 150 4.9 Greedy Algorithm for A-DESIGN . 157 Chapter 5: Multi-Criteria Dimensionality Reduction with Applications to Fairness . 167 5.1 Introduction . 167 5.1.1 Results and Techniques . 170 5.1.2 Related Work . 176 5.2 Low-Rank Solutions of MULTI-CRITERIA-DIMENSION-REDUCTION . 177 5.3 Approximation Algorithm for FAIR-PCA . 182 5.4 Iterative Rounding Framework with Applications to FAIR-PCA . 183 5.5 Polynomial Time Algorithm for Fixed Number of Groups . 187 5.5.1 Proof of Theorem 5.1.6 . 189 5.6 Hardness . 190 5.7 Integrality Gap . 192 5.8 Experiments . 193 5.9 Scalability of the Algorithms . 195 5.9.1 Multiplicative Weight Update . 197 5.9.2 Frank-Wolfe . 199 5.9.3 Parameter Tuning . 200 5.9.4 Practical Considerations and Findings . 202 viii 5.9.5 Runtime Results . 202 Chapter 6: Conclusion . 205 References . 217 ix LIST OF TABLES 1.1 Summary of approximation ratios of optimal design of previous work and our work. Cells with an asterisk * indicates our results that improve the previous ones. No integrality gap result exists before our work. ..

Fair and Diverse Data Representation in Machine Learning

A Learning Theoretic Framework for Clustering with Similarity Functions

IARCS, the Indian Association for Research in Computing Science

Curriculum Vitae

Visions in Theoretical Computer Science

Random Projection in the Brain and Computation with Assemblies Of

Conference Program

Lipics-ITCS-2018-0.Pdf (0.4

Sampling-Based Algorithms for Dimension Reduction Amit Jayant

Karthekeyan Chandrasekaran

New Approaches to Integer Programming

Submission Data for 2017 CORE Conference Re-Ranking Process

Santosh S. Vempala