<<

Investigations in Applied and High-Dimensional by Julia Gaudio ScB, Brown University (2016) ScM, Brown University (2016) Submitted to the Sloan School of Management in partial fulfillment of the requirements for the degree of Doctor of Philosophy in at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2020 ○c Massachusetts Institute of Technology 2020. All rights reserved.

Author...... Sloan School of Management May 1, 2020 Certified by...... David Gamarnik Nanyang Technological University Professor of Operations Research Thesis Supervisor Certified by...... Patrick Jaillet Dugald C. Jackson Professor, Department of Electrical and Thesis Supervisor Accepted by ...... Georgia Perakis William F. Pounds Professor of Management Science, Codirector, Operations Research Center 2 Investigations in Applied Probability and High-Dimensional Statistics by Julia Gaudio

Submitted to the Sloan School of Management on May 1, 2020, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Operations Research

Abstract This thesis makes contributions to the areas of applied probability and high-dimensional statistics. We introduce the Attracting Random Walks model, which is a model on a graph. In the Attracting Random Walks model, particles move among the vertices of a graph transition depending on the locations of the other particles. The model is designed so that transitions to more occupied vertices are more likely. We analyze the mixing time of the model under different values of the parameter governing the attraction. We additionally consider the repelling version of the model, in which particles are more likely to move to vertices with low occupancy. Next, we contribute to the methodology of Markov processes by studying con- vergence rates for Markov processes under perturbation. We specifically consider parametrized stochastically ordered Markov processes, such as queues. We bound the time until a given Markov process converges to stationary after its parameter experiences a perturbation. The following chapter considers the random instance Traveling Salesman Problem. Namely, 푛 points (cities) are placed uniformly at random in the unit square. It was shown by Beardwood√ et al (1959) that the optimal tour length through these points, divided by 푛, converges to a constant [2]. Determining the value of the constant is an open problem. We improve the lower bound over the original bound given in [2]. Finally, we study a statistical model: isotonic regression. Isotonic regression is the problem of estimating a coordinate-wise monotone function from data. We introduce the sparse version of the problem, and study it in the high-dimensional setting. We provide optimization-based algorithms for the recovery of the ground truth function, and provide guarantees for function estimation in terms of 퐿2 loss.

Thesis Supervisor: David Gamarnik Title: Nanyang Technological University Professor of Operations Research

Thesis Supervisor: Patrick Jaillet

3 Title: Dugald C. Jackson Professor, Department of Electrical Engineering and Computer Science

4 Acknowledgments

I would first like to thank my advisors Patrick Jaillet and David Gamarnik. Patrick was always supportive in allowing me to pursue a variety of research directions. The freedom and trust that he placed helped me to develop my own identity as a researcher. David helped me develop the skill of finding interesting open research areas. His patience and encouragement gave me the energy to work through difficult proofs. When I (hopefully) have students of my own to advise, I will remember the excellent advising I received from Patrick and David. Thank you to Guy Bresler and Ankur Moitra, the other two members of my thesis committee. I was fortunate to take a class taught by each of them, and have truly been inspired by their research. During my PhD I had many research discussions with Yury Polyanskiy, who was always enthusiastic and kindly invited me to his group meetings. During my first year, I collaborated with Saurabh Amin, who encouraged me to shape my research to match my interests. Thank you to Pascal Van Hentenryck, who was my Masters thesis advisor, for introducing me to the field of operations research. When I was an undergraduate, an accidental situation lead me to take a course with Stuart Geman, who got me excited about probability. I am grateful to Microsoft Research for providing me with a Microsoft Research PhD Fellowship for my last two years in graduate school. During the summer of 2018, I completed an internship at Microsoft Research Redmond, mentored by Ishai Menache. Ishai was highly supportive, and was always looking out for opportunities for me. He is also a role model to me as someone who puts a high priority on family life. During that summer, I was also fortunate to collaborate with Luke Marshall and Ece Kamar. During the summer of 2019, I completed an internship at Microsoft Research New England, mentored by Henry Cohn. I worked with Christian Borgs, Jennifer Chayes, Samantha Petti, and Subhabrata Sen on large deviations for stochastic block models, using graphon theory. I feel fortunate to have learned about graphons from Jennifer and Christian, who are pioneers in the area. Subhabrata taught me the value

5 of breaking up long proofs into more manageable components. I was inspired by my fellow intern Sam’s continuous drive, and enjoyed our funny moments. The leaders and administrators at the ORC and LIDS have created amazing research communities: Laura Rose, Andrew Carvalho, Patrick Jaillet, Dimitris Bertsimas, and Georgia Perakis from the ORC, and John Tsitsiklis, Eytan Modiano, Asu Ozdaglar, Francisco Jaimes, Richard Lay, and Rachel Cohen from LIDS. I am fortunate to have made many friends over the course of my PhD: Agni Orfanoudaki, Álvaro Fernandez Galeano, Elisabeth Paulson, Emma Gibson, Emily Meigs, Jackie Baek, Arthur Delarue, Jean Pauphilet, Sébastien Martin, Andrew Li, Brad Sturt, Matthew Sobiesk, Konstantina Mellou, Ilias Zadik, Colin Pawlowski, Yee Sian Ng, Alireza Fallah, Igor Kadota, and others. I am thankful for my local friendships outside of MIT: Tamar Kaminski, Adam Gosselin, Angelica Cygan, and the friends I have made through various communities: the Tech Catholic Community, St. Mary of the Annunciation Parish, and Westgate. I am thankful for my continuing friendships from Toronto: Karen Morenz, Alisa Ugodnikov, Natalie Landon-Brace, Marilyn Verghis, and Katie Lawrence, among many others (UTS never ends!). Thank you to my family, in particular my parents. From the beginning, they always wanted the best educational opportunities for me. They always believed that it was worth it. This thesis is dedicated to my wonderful husband Joey. I met Joey the day I arrived at MIT, at an orientation barbecue. We were engaged in March 2017, and married in June 2018. Our first child Luca was born in November 2019. Joey and Luca reminded me of the importance of a balanced life and put the academic component in perspective. Thank you to Joey for countless amazing memories over the course of this PhD.

6 Contents

1 Introduction 13

2 Attracting Random Walks 19 2.1 Introduction ...... 20 2.2 The Model ...... 21 2.2.1 Definitions and Main Results ...... 21 2.2.2 Connection to the Potts Model ...... 23 2.3 Mixing Time on General Graphs ...... 27 2.3.1 Slow Mixing ...... 27 2.3.2 Fast Mixing ...... 35 2.4 Repelling Random Walks ...... 49 2.4.1 The Case 훽 = −∞ ...... 49 2.4.2 The Complete Graph Case ...... 52 2.5 Conclusion ...... 57 2.6 Appendix ...... 58

3 Exponential Convergence Rates for Stochastically Ordered Markov Processes Under Perturbation 67 3.1 Introduction ...... 67 3.2 Related work ...... 69 3.3 Main result ...... 72 3.4 M/M/1 queues ...... 74 3.4.1 Queue length process ...... 74

7 3.4.2 Workload process ...... 81 3.5 Conclusion ...... 82 3.6 Appendix ...... 83

4 An Improved Lower Bound on the Traveling Salesman Constant 87 4.1 Introduction ...... 87 4.2 Approaches for the Lower Bound ...... 88 4.3 Derivation of the Lower Bound ...... 91 4.4 An Improvement ...... 93

5 Sparse High-Dimensional Isotonic Regression 97 5.1 Introduction ...... 97 5.2 Algorithms for sparse isotonic regression ...... 100 5.2.1 The Simultaneous Algorithm ...... 101 5.2.2 The Two-Stage Algorithm ...... 102 5.3 Results on the Noisy Output Model ...... 105 5.3.1 Statistical consistency ...... 106 5.3.2 Support recovery ...... 106 5.4 Results on the Noisy Input Model ...... 108 5.4.1 Statistical consistency ...... 108 5.4.2 Support recovery ...... 112 5.5 Experimental results ...... 113 5.5.1 Support recovery ...... 113 5.5.2 Cancer classification using gene expression data ...... 114 5.6 Conclusion ...... 118 5.7 Appendix ...... 119 5.7.1 Proofs for the Noisy Output Model ...... 120 5.7.2 Proofs for the Noisy Input Model ...... 146

6 Future Directions 157

8 List of Figures

2-1 Simulation of the Attracting Random Walks model on a grid graph . 23 2-2 Correspondence of the Curie–Weiss Potts model to the Attracting Random Walks model. A Potts configuration is drawn on the left, and the corresponding ARW configuration is drawn on the right...... 24 2-3 Initial state of a cycle that breaks Kolmogorov’s criterion...... 26 2-4 Single-particle Markov chain from the 푍 chain (퐷 ≥ 2)...... 30 2-5 Single-particle Markov chain from the 푍 chain (퐷 = 1)...... 30 2-6 Pairing of particles in the coupling. The edges between vertices are omitted...... 38 2-7 Simulation of the Attracting Random Walks model on an 8 × 8 grid graph after 106 steps for 푛 = 320, 훽 = −500...... 49

4-1 The six stubs associated with vertices 푎, 푏, and 푐...... 91 4-2 Conditioning on the location of point 푐. The gray regions indicate where point 푐 may lie...... 93

5-1 Illustration of the TSIR + S-LPSR algorithm. Blue and red markers correspond to lung and skin cancer, respectively...... 118 5-2 Robustness to error of TSIR + S-LPSR...... 118 5-3 Illustration of a partition in 푑 = 2 with 푚 = 10. The partition cells are indicated in gray, and the border cells are marked...... 125

9 10 List of Tables

5.1 Performance of support recovery algorithms on synthetic instances. Each line of the table corresponds to 100 trials...... 114 5.2 Accuracy of isotonic regression on synthetic instances. Each line of the table corresponds to 100 trials...... 114 5.3 Comparison of classifier success rates on COSMIC data. Top row data is according to the “min” interpolation rule and bottom row data is according to the “max” interpolation rule...... 117

11 12 Chapter 1

Introduction

The field of operations research is concerned with building models to make decisions, drawing on many areas of , including optimization, probability, and statistics. This thesis makes contributions in the areas of applied probability and high-dimensional statistics. This chapter provides a high-level overview of the content of the thesis. More in-depth introductions are included in the main chapters. Markov chains, introduced by Russian mathematician Andrey Markov in 1906, have revolutionized the field of probability [53]. A Markov chain is a random process

(푋1, 푋2,... ) whose defining characteristic is that the state 푋푘 depends probabilistically

only on the state 푋푘−1. Markov chains are used for modeling and statistical prediction. They also have a wide range of applications in other fields, including statistical and computer science [35]. Markov chains that are both aperiodic and irreducible converge to a unique stationary distribution. A central quantity in the analysis of Markov chains is the mixing time, which is the number of time steps for the distribution of the chain to be very close to the stationary distribution. A first example of a Markov chain is a simple random walk on a graph. Let 풢 = (풱, ℰ) be a graph, where 풱 denotes the vertices and ℰ denotes the edges. A simple random walk on 풢 is a random process that moves among the vertices of the graph. At each time step, it moves to a neighboring vertex uniformly at random. Since the transitions of the particle depend only on the current vertex location, a

13 random walk is a Markov chain. This thesis introduces a variation of the classical random walk, known as Attracting Random Walks, in Chapter 2. In the Attracting Random Walks (ARW) model, there are many particles walking on a graph. At each time step, one of the particles is selected uniformly at random, and moves to a neighboring vertex. The probability of moving to a given vertex depends on the number of other particles at that vertex. The model is designed so that moving to a more highly occupied vertex is more likely than moving to a less occupied vertex. The ARW model on the complete graph is closely related the the Potts model from . We analyze how the mixing time of the ARW model depends on the parameter governing the degree of attraction. When the particles are strongly attracted to each other, the mixing time is exponential in the number of particles (slow mixing), while when the particles are weakly attracted to each other, the mixing time is polynomial in the number of particles (fast mixing). Chapter 2 additionally includes analysis of the related Repelling Random Walks model. In the Repelling Random Walks model, particles are more likely to make transitions to less occupied vertices. So far, we have discussed discrete time, discrete state space Markov chains. Con- tinuous time Markov processes, which may have discrete or continuous state spaces, are also widely used in applied probability, with queueing systems being one of the key application areas. In operations research, queues are used to model customers awaiting service [25]. A queue is typically modeled as a continuous time, discrete state space Markov process. Queueing theory has application to traffic management, telecommunications, supply chain management, and many other fields. In Chapter 3, we consider continuous time Markov processes. The processes we consider are parametrized and stochastically ordered, such as queues. We are again interested in the convergence to the stationary distribution. As an example, we can take an MM1 queue. In an MM1 queue with arrival rate 휆 and service rate 휇, customers arrive as a Poisson process with rate 휆, and are served in a time period that is distributed as an exponential random variable with rate 휇. Suppose that the queue is initially distributed according to the stationary distribution associated with

14 parameters (휆0, 휇0). At time 푡 = 0, the parameters are perturbed to (휆, 휇). We analyze how long the queue will take to reach the new stationary distribution associated with (휆, 휇). More generally, we provide bounds for convergence of stochastically ordered Markov processes under perturbation. In Chapter 4, we consider the Traveling Salesman Problem (TSP), a famous problem in optimization, from a probabilistic standpoint. In the TSP, a salesman must navigate a set of cities, finding a tour of the shortest possible length. Finding the optimal TSP tour is NP-hard [54]. We consider the random TSP, where 푛 cities fall uniformly at random in the square [0, 1]2. It is known that the optimal tour √ length, normalized by dividing by 푛, converges to a constant [2]. The value of the constant is an open problem. We improve the lower bound of the constant. This type of limiting behavior is not unique to the TSP; it is a common phenomenon in Euclidean combinatorial optimization [55]. Analyzing the behavior of random instances of difficult combinatorial problems provides insight to limiting behaviors of these problems. Studying high-dimensional models in probability is valuable for understanding limiting or aggregate behavior, as in Chapters 2 and 4. On the other hand, high- dimensional models in statistics are necessitated by the presence of big data. In many data-rich settings, the number of features (dimension) may be higher than the number of samples. These high-dimensional settings are challenging for statistical inference. One method to handle high-dimensional models is to introduce some form of regularization, such as for example sparsity. In Chapter 5, we consider the sparse high-dimensional isotonic regression problem. Isotonic regression is the problem of estimating a coordinate-wise monotone function given noisy measurements. It is a natural and widely studied model of non-parametric regression. For example, one might model a patient’s risk for a disease based on his or her age, weight, blood pressure, and other factors. Because it makes few assumptions, isotonic regression is often a more appropriate modeling choice than using linear regression. The chapter introduces the sparse version of the problem. In sparse isotonic regression, only a few variables, referred to as the active coordinates,

15 determine the function value. The chapter provides algorithms for estimating the unknown sparsely monotone function, using mathematical programming techniques. The first approach jointly estimates the active coordinates and the function values. The second approach uses two steps: (1) Estimate the active coordinates using a linear program, and (2) Estimate the function values given the estimated active coordinates. The second approach is more tractable empirically. We provide guarantees on the

퐿2 loss of the estimated function, using both approaches. In addition, we provide empirical evidence for the applicability of the algorithms on real-life data. Using the Catalogue of Somatic Mutations in Cancer (COSMIC) database, we measure the ability of the algorithms to classify between two cancer types given patient gene expression values. This is a proof-of-concept experiment that supports the applicability of the algorithms.

Bibliographic Information

Chapter 2 is in submission to the Electronic Journal of Probability, and a major revision has been requested. A preliminary version appears on arXiv (arXiv:1903.00427) [21]. This work benefitted significantly from the contributions of others. Their contributions are indicated at the beginning of the chapter. Chapter 3 appears in Systems and Control Letters, co-authored with Patrick Jaillet and Saurabh Amin [22]. Chapter 4 appears in Operations Research Letters, co-authored with Patrick Jaillet [23]. I am aware of a much more significant improvement to the lower bound of the TSP constant, in unpublished work by Thomas Oliver Herz, produced simultaneously and independently from my own work. Chapter 5, co-authored with David Gamarnik, appeared in the proceedings of the Thirty-third Conference on Neural Information Processing Systems (NeurIPS 2019 ) [18]. I thank all the anonymous reviewers for their helpful comments, which greatly improved the clarity of this work. Chapter-specific acknowledgments appear at the

16 start of the corresponding chapter.

17 18 Chapter 2

Attracting Random Walks

This chapter introduces the Attracting Random Walks model, which describes the dynamics of a system of particles on a graph with certain attraction properties. In the model, particles move between adjacent vertices of a graph 풢, with transition probabilities that depend positively on particle counts at neighboring vertices. From an applied standpoint, the model captures the rich get richer phenomenon. We show that the Markov chain underlying the dynamics exhibits a phase transition in mixing time, as the parameter governing the attraction is varied. Namely, mixing is fast when the temperature is sufficiently high and slow when sufficiently low. When 풢 is the complete graph, the model is a projection of the Potts model, whose phase transition is known. On the other hand, when the graph is incomplete, the model is non-reversible, and the stationary distribution is unknown. We demonstrate the existence of phase transition in mixing time for general graphs. Acknowledgements Many thanks to Yury Polyanskiy, David Gamarnik, and Patrick Jaillet for numerous helpful discussions. The proof of fast mixing for small 훽 > 0 is due to Y. Polyanskiy. I appreciate the careful editing by D. Gamarnik. The work benefited in a pivotal way from discussions with Eyal Lubetzky and Reza Gheissari, especially in the proof of slow mixing. The idea of using a lower-bounding comparison chain is due to R. Gheissari. I am grateful to E. Lubetzky for kindly hosting me at NYU. I acknowledge Yuval Peres for several helpful discussions, including an idea used in the proof of fast mixing for Repelling Random Walks on the complete

19 graph. Thank you to the reviewer for pointing out a strengthening of Lemma 6.

2.1 Introduction

In this chapter, we introduce the Attracting Random Walks (ARW) model. The motivation of the model is to understand the formation of wealth disparities in an economic network. Consider a network of economic agents, each with a certain number of coins representing their wealth. At each time step, one coin is selected uniformly at random, and moves to a neighbor of its owner with a probability that depends on how wealthy the neighbors are. Those who are well-connected and initially wealthy will tend to accumulate more wealth. We refer to particles instead of coins in what follows. This is a flexible model based on a few principles: There are a fixed number of particles moving around on a graph. Movements are asynchronous, and particles make choices about where to move based on their local environment. The model can encompass a variety of situations. Further, the model can be extended by allowing for multiple particle types, with intra– and inter–group attraction parameters, though we do not consider this extension in this chapter. There are many more applications beyond the economic application. As an interacting particle system, it could be relevant for physics or applications. This chapter analyzes the Attracting Random Walks model and establishes phase transition properties. The difficulty in bounding mixing times, particularly in finding lower bounds, is due to the fact that the stationary distribution cannot be simply formulated. Additionally, the model is not reversible unless the graph is complete (Theorem 3), meaning that many techniques do not apply. We establish the existence of phase transition in mixing time as the attraction parameter, 훽, is varied. Slow mixing for 훽 large enough is established by relating the mixing time to a suitable hitting time. Fast mixing for 훽 small enough is proven by a path coupling approach that relates the Attracting Random Walks chain to the simple random walk on the same graph (i.e. with 훽 = 0). An alternative prove of fast mixing is to use a variable-length path coupling, as introduced in [28]. The alternative

20 prove is included in the appendix of this chapter. We emphasize that even though the stationary distribution is not known analytically for general graphs, we have shown that it undergoes phase transition by arguing through mixing times. The rest of the chapter is structured as follows. We describe the dynamics of the model in Section 2.2, along with some possible applications. The remainder of the chapter is focused on properties of the Markov chain governing the dynamics. In Section 2.2.2 we discuss a link to the Potts model. Section 2.3 proves the existence of phase transition in mixing time for general graphs, and is the main theoretical contribution of this work. In Section 2.4, we collect partial results on the version of the model in which particles repel each other instead of attracting, a model we call “Repelling Random Walks.”

2.2 The Model

2.2.1 Definitions and Main Results

The model is a discrete time process on a simple graph 풢 = (풱, ℰ), where 풱 is the set of vertices and ℰ is the set of undirected edges. We assume throughout that 풢 is connected. We write 푖 ∼ 푗 if (푖, 푗) ∈ ℰ. Let 푘 = |풱|. Initially, 푛 indistinguishable particles are placed on the vertices of 풢 in some configuration. Let 푥(푖) be the number of particles at vertex 푖. The particle configuration is updated in two stages, according to a fixed parameter 훽:

1. Choose a particle uniformly at random. Let 푖 be the location of that particle.

2. Move the particle to a vertex 푗 ∼ 푖, 푗 ≠ 푖, with probability which is proportional (︀ 훽 )︀ to exp 푛 푥(푗) . Keep the particle at vertex 푖 with probability proportional to (︀ 훽 )︀ exp 푛 (푥(푖) − 1) (with the same constant of proportionality).

Let 푃 be the transition probability matrix of the resulting Markov chain. Let 푒푖 denote the 푖th standard basis vector in R푘. Then for two configurations 푥 and 푦 such

21 that 푦 = 푥 − 푒푖 + 푒푗 for 푖 ∼ 푗 or 푖 = 푗, we have

⎧ exp( 훽 푥(푗)) ⎪ 푥(푖) 푛 if 푖 ∼ 푗 ⎨ 푛 ∑︀ exp( 훽 푥(푙))+exp( 훽 (푥(푖)−1)) 푃 (푥, 푦) = 푙∼푖 푛 푛 . exp 훽 (푥(푖)−1) ⎪ 푥(푖) ( 푛 ) ⎩⎪ 푛 ∑︀ 훽 훽 if 푖 = 푗 푙∼푖 exp( 푛 푥(푙))+exp( 푛 (푥(푖)−1))

The probabilities are a function of the numbers of particles at each vertex, excluding the particle that is to move. This modeling choice means that the moving particle is neutral toward itself, and relates the ARW model to the Potts model, as will be explained below. When 훽 is positive (ferromagnetic dynamics), the particle is more likely to travel to a vertex that has more particles. Greater 훽 encourages stronger aggregation of the particles. On the other hand, taking 훽 < 0 (antiferromagnetic dynamics) encourages particles to spread. Note that 훽 = 0 corresponds to the case of independent random walks. For an application with 훽 < 0, consider an ensemble of identical gas particles in a container. We can discretize the container into blocks. Each block becomes a vertex in our graph. Vertices are connected by an edge whenever the corresponding blocks share a face. Since gas particles primarily repel each other, it makes sense to consider 훽 < 0 in this scenario. Taking 훽 ≪ 0 discourages particles from occupying the same block. To get an idea of the effect of 훽, Figure 2-1 displays some instances of the Attracting Random Walks model run for 106 steps for different values of 훽. The graph is the 8 × 8 grid graph, with 푛 = 320, for an average of 5 particles per vertex. We now state our main results regarding the phase transition in mixing time. We

let ‖푃 − 푄‖TV denote the total variation distance between two discrete probability 푡 measures 푃 and 푄, and let 푑(푋, 푡) , max푥∈풳 ‖푃 (푥, ·) − 휋‖TV be the worst-case (with

respect to the initial state) total variation distance for a chain {푋푡} with stationary

distribution 휋. Let 푡mix(푋, 휖) , min {푡 : 푑(푋, 푡) ≤ 휖} denote the mixing time of a

chain {푋푡}.

Theorem 1. For any graph 풢, there exists 훽+ > 0 such that if 훽 > 훽+, the mixing

22 (a) 훽 = 0 (b) 훽 = 100 (c) 훽 = 200

(d) 훽 = 300 (e) 훽 = 400 (f) 훽 = 500

Figure 2-1: Simulation of the Attracting Random Walks model on a grid graph time of the ARW model is exponential in 푛.

Theorem 2. For any graph 풢, there exists 훽− > 0 such that if 0 ≤ 훽 < 훽−, the mixing time of the ARW model is 푂(푛 log 푛).

Note that we do not prove that one value 훽+ = 훽− satisfies both statements.

2.2.2 Connection to the Potts Model

In the case where 풢 is the complete graph, the Attracting Random Walks model is a projection of Glauber dynamics of the Curie–Weiss Potts model. The Potts model is a multicolor generalization of the Ising model, and the Curie–Weiss version considers a complete graph. In the Curie–Weiss Potts model, the vertices of a complete graph are assigned a color from [푞] = {1, . . . , 푞}. Setting 푞 = 2 corresponds to the Ising model. Let 푠(푖) be the color of vertex 푖 for each 1 ≤ 푖 ≤ 푛. Define

⎧ ⎨⎪1, for 푠(푖) = 푠(푗) 훿 (푠(푖), 푠(푗)) , . ⎩⎪0, for 푠(푖) ̸= 푠(푗)

23 The stationary distribution of the Potts model, with no external field, is

⎛ ⎞ 1 훽 ∑︁ 휋(푠) = exp 훿 (푠(푖), 푠(푗)) . 푍 ⎝푛 ⎠ (푖,푗),푖̸=푗

The Glauber dynamics for the Curie–Weiss Potts model are as follows:

1. Choose a vertex 푖 uniformly at random.

2. Update the color of vertex 푖 to color 푘 ∈ [푞] with probability proportional to (︁ 훽 ∑︀ )︁ exp 푛 푗̸=푖 훿 (푘, 푠(푗)) . ∑︀ Observe that the summation 푗̸=푖 훿 (푘, 푠(푗)) is equal to the number of vertices, apart from vertex 푖, that have color 푘. Therefore if each vertex in the Potts model corresponds to a particle in the ARW model, and each color in the Potts model corresponds to a vertex in the ARW model, then the ARW model is a projection of the Glauber dynamics for the Potts model. The correspondence is illustrated in Figure 2-2. Under the correspondence, the ARW chain is exactly the “vector of proportions” chain in the Potts model.

Figure 2-2: Correspondence of the Curie–Weiss Potts model to the Attracting Random Walks model. A Potts configuration is drawn on the left, and the corresponding ARW configuration is drawn on the right.

Let 푣(푖) be the vertex location of the 푖th particle in the ARW model, for 1 ≤ 푖 ≤ 푛. By the correspondence, we show that the stationary distribution of the ARW model is

⎛ ⎞ 1 (︂ 푛 )︂ 훽 ∑︁ 휋(푥) = exp 훿 (푣(푖), 푣(푗)) 푍 푥(1), 푥(2), . . . , 푥(푘) ⎝푛 ⎠ (푖,푗),푖̸=푗 (︃ 푛 )︃ 1 (︂ 푛 )︂ 훽 ∑︁ = exp (푥 (푣(푖)) − 1) 푍 푥(1), 푥(2), . . . , 푥(푘) 2푛 푖=1

24 (︃ 푘 )︃ 1 (︂ 푛 )︂ 훽 ∑︁ 훽 = exp 푥(푖)2 − 푍 푥(1), 푥(2), . . . , 푥(푘) 2푛 2 푖=1 (︃ 푘 )︃ 1 (︂ 푛 )︂ 훽 ∑︁ = exp 푥(푖)2 . 푍′ 푥(1), 푥(2), . . . , 푥(푘) 2푛 푖=1

(︀ 훽 ∑︀ 2)︀ Observe that the exp 2푛 푖 푥(푖) factor encourages particle aggregation, while the multinomial encourages particle spread. The reader is encouraged to refer to [9] for a detailed study of the mixing time of the Curie–Weiss Potts model, for different values of 훽. For instance, [9] shows

that there exists 훽푠(푞) such that if 훽 < 훽푠(푞), the mixing time is Θ(푛 log 푛), and if

훽 > 훽푠(푞), the mixing time is exponential in 푛. In the 퐴푅푊 context, these results hold with 푞 replaced by 푘. On the other hand, when 풢 is not the complete graph, the correspondence to the Potts model is lost. In fact, the following can be shown:

Theorem 3. For 푛 ≥ 3, the ARW Markov chain is reversible for all 훽 if and only if the graph 풢 is complete.

The non-reversibility can be shown by applying Kolmogorov’s cycle criterion, demonstrating a cycle of states (configurations) that violates the criterion.

Lemma 1 (Kolmogorov’s criterion). A finite state space Markov chain associated with the transition probability matrix 푃 is reversible if and only if for all cyclic sequences of states 푖1, 푖2, . . . , 푖푙−1, 푖푙, 푖1 it holds that

(︃푙−1 )︃ (︃푙−2 )︃ ∏︁ ∏︁ 푃 (푖푗, 푖푗+1) 푃 (푖푙, 푖1) = 푃 (푖1, 푖푙) 푃 (푖푙−푗, 푖푙−푗−1) . 푗=1 푗=0

In other words, the forward product of transition probabilities must equal the reverse product, for all cycles of states.

Proof of Theorem 3. First, if the graph is complete, then the chain is a projection of Glauber dynamics, which is automatically reversible. Now suppose 풢 is not complete. We apply Kolmogorov’s cycle criterion. In the ARW model, a state is a particle configuration. A cycle of states is then a sequence of particle configurations such that

25 푑푣

푑푢 u v w 푑푤

Figure 2-3: Initial state of a cycle that breaks Kolmogorov’s criterion.

1. Subsequent configurations differ by the movement of a single particle.

2. The first and last configurations are the same.

If 풢 is not a complete graph, then it is straightforward to show that there exist

three vertices 푢 ∼ 푣 ∼ 푤 such that 푢  푤. Now we demonstrate a cycle of states that breaks Kolmogorov’s criterion. We have the following situation, illustrated by

Figure 2-3. The values 푑푢, 푑푣, and 푑푤 indicate the degrees of the vertices, excluding the named vertices. Place 푛 − 2 particles at 푢 and 2 particles at 푣. The particle movements are as follows: 푣 → 푢, 푣 → 푤, 푢 → 푣, 푤 → 푣. (︀ 훽 )︀ For clarity, let 푓(푧) = exp 푛 푧 . The forward transition probabilities are

(︂ 2 푓(푛 − 2) )︂ (︂ 1 1 )︂ (︂ 푛 − 1 1 )︂ (︂ 1 푓(1) )︂ . 푛 푓(푛 − 2) + 푓(1) + 1 + 푑푣 푛 푓(푛 − 1) + 1 + 1 + 푑푣 푛 푓(푛 − 2) + 1 + 푑푢 푛 푓(1) + 1 + 푑푤

The reverse transition probabilities are

(︂ 2 1 )︂ (︂ 1 푓(푛 − 2) )︂ (︂ 1 1 )︂ (︂ 푛 − 1 푓(1) )︂ . 푛 푓(푛 − 2) + 푓(1) + 1 + 푑푣 푛 푓(푛 − 2) + 1 + 푓(1) + 푑푣 푛 1 + 1 + 푑푤 푛 푓(푛 − 2) + 푓(1) + 푑푢

Canceling factors that appear in both products, we are left comparing

(푓(푛 − 1) + 1 + 1 + 푑푣)(푓(푛 − 2) + 1 + 푑푢)(푓(1) + 1 + 푑푤)

to

(푓(푛 − 2) + 1 + 푓(1) + 푑푣) (1 + 1 + 푑푤)(푓(푛 − 2) + 푓(1) + 푑푢) .

Observe that 푓(푧1)푓(푧2) = 푓(푧1 + 푧2). Taking leading terms, the first product is therefore a degree-(2푛 − 2) polynomial in 푒훽. Since 푛 − 2 ≥ 1, the second is a degree- (2푛 − 4) polynomial in 푒훽. These polynomials have a finite number of solutions for 푒훽,

26 and therefore 훽 itself. Therefore the Markov chain is not reversible.

2.3 Mixing Time on General Graphs

In this section, we show the existence of phase transition in mixing time in the ARW model when 훽 is varied, for a general fixed graph. First, we show exponentially slow mixing for 훽 suitably large, namely prove Theorem 1 by relating mixing times to hitting times. Next, we show polynomial time mixing for small values of 훽. The proof is by an adaptation of path coupling. For a reference to standard definitions around Markov chains, see [35].

2.3.1 Slow Mixing

The idea of the proof of slow mixing is to show that with substantial probability, the chain takes an exponential time to access a constant portion of the state space. We now outline the proof, deferring the proofs of the lemmas. First we state a helper lemma.

Lemma 2. For any graph 풢 = (풱, ℰ), there exists a vertex 푣 ∈ 풱 such that for the

1 set of configurations 푆푣 , {푥 : 푥(푣) = max푤 푥(푤)}, it holds that 휋(푆푣) ≥ /푘. In other words, the states where 푣 has the greatest number of particles contribute at least 1/푘 to the stationary probability mass.

1 By Lemma 2, there exists a vertex 푣 such that 휋(푆푣) ≥ /푘. Choose any other vertex 푢. Whenever 푥(푢) > 푛/2, we can be sure that 푣 is not the maximizing vertex,

and therefore that a set of states having at least 1/푘 mass under the stationary measure has not been reached. It therefore suffices to lower bound the time until vertex 푢 has lost sufficient particles for vertex 푣 to have the maximum number of particles.

1 Let 푇푥 , inf{푡 : 푋푡(푢) ≤ 2 푛, 푋0 = 푥}. If the probability that {푋푡} has reached

the set {푥 ∈ Ω: 푥(0) ≤ 푛/2} by time 푡 is less than some 푝, then the total variation

1 distance at time 푡 is at least (1 − 푝) 푘 . Therefore we get the following relationship between the mixing time and hitting time:

27 Proposition 1.

(︂ 1)︂ {︁ }︁ 푡mix 푋, (1 − 푝) ≥ inf 푡 : min P (푇푥 ≤ 푡) ≥ 푝 . 푘 푥

The problem now reduces to lower bounding this hitting time. The idea is that when particles leave vertex 푢, there is a strong drift back to 푢. However, controlling the hitting times of a multidimensional Markov chain is challenging, and direct comparison is difficult to establish. We instead reason by comparison to another Markov chain, 푍, which lower-bounds the particle occupancy at vertex 푢. Let 푙(푤) be the length of the shortest path connecting vertex 푢 to vertex 푤. Let ˜ ˜ ∑︀ ˜ 푋푡 be a projection of the 푋푡 chain defined by 푋푡(푑) , 푤:푙(푤)=푑 푋푡(푤), and let Ω be its state space. In other words, the 푑th coordinate of the projected chain counts the ˜ number of particles that are a distance 푑 away from vertex 푢. Note that 푋푡(0) = 푋푡(푢).

We let 퐹 denote this projection, writing, 푋˜ = 퐹 (푋). For any 0 < 훿 < 1/2, define

˜ ˜ 푇푥(훿) , inf{푡 : 푋푡(푢) ≤ (1 − 훿)푛, 푋0 = 푥} = inf{푡 : 푋푡(0) ≤ (1 − 훿)푛, 푋0 = 퐹 (푥)}.

For some 훿 > 0 to be determined, let

˜ 푐 ˜ 푆 , {푥 ∈ Ω: 푥(0) > (1 − 훿)푛} and 푆 , Ω ∖ 푆.

st ˜ ˜ ˜ ˜ We now build a chain 푍 on Ω coupled to 푋 such that as long as 푋푡 ∈ 푆, 푍푡(0) ≤ 푋푡(0). st 푐 Then 푇푥(훿) ≥ inf푡{푍푡 ∈ 푆 }. The remainder the proof of slow mixing is as follows. st ˜ 1. Construct a lower-bounding comparison chain 푍 satisfying 푍푡(0) ≤ 푋푡(0) when

푡 ≤ 푇푥(훿).

2. Compute E휋푍 [푍(0)] and use a concentration bound to show that 푍(0) ∼ 휋푍 (0) places exponentially little mass on the set 푆푐.

3. Comparing the chain 푋 to 푍, show that 푋 takes exponential time to achieve

푋(푢) ≤ (1 − 훿)푛. The result is complete by 1 − 훿 > 1/2.

28 We now define the lower-bounding comparison chain 푍, which is a chain on 푛 inde- pendent particles. These particles move on the discrete line with points {0, 1, . . . , 퐷}, where 퐷 = 푑푖푎푚(풢). We first describe the case 퐷 ≥ 2. Since the comparison needs to ˜ ˜ hold only when 푋푡(0) ≥ (1 − 훿)푛, we assume that 푋푡(0) ≥ (1 − 훿)푛. The idea is to identify a uniform constant lower bound on the probability of a particle moving closer to 푢 under this assumption, which tells us that once the particle is at 푢, there is a high probability of remaining there. Let 풩 (푢) denote the neighbourhood of 푢, i.e. 풩 (푢) = {푤 : 푤 ∼ 푢}. In the 푋 chain, when a particle is at a vertex 푤∈ / {푢} ∪ 풩 (푢), its probability of moving to any one of its neighbors is at least 1 푝 , , 푒훽훿 + Δ where Δ is the maximum degree of the graph. This is because the lowest probability when 훽 is large corresponds to placing all 훿푛 movable particles at some other neighbor of 푤. When a particle is at a vertex 푢, it stays there with probability at least

exp (︀훽(1 − 훿) − 훽 )︀ 푞 푛 , , (︀ 훽 )︀ 훽훿 exp 훽(1 − 훿) − 푛 + 푒 + Δ − 1

When a particle is at a vertex 푤 ∈ 풩 (푢), it moves to 푢 with probability at least

exp (훽(1 − 훿)) > 푞. exp (훽(1 − 훿)) + 푒훽훿 + Δ − 1

Note that 푞 > 푝. The transitions of the 푍 chain are chosen in order to maintain comparison. At each time step, a particle is selected uniformly at random. When the chosen particle is located at 푑∈ / {0, 1}, the particle moves to 푑 − 1 with probability 푝 and moves to min{푑 + 1, 퐷} with probability (1 − 푝). When the chosen particle is located at 푑 ∈ {0, 1}, it moves to 0 with probability 푞, and moves to 푑 + 1 with probability 1 − 푞. The transition probabilities for single particle movements are depicted in Figure 2-4. When 퐷 = 1 (i.e., 풢 is the complete graph), we instead have the transitions depicted by Figure 2-5. Lemma 3 establishes the comparison.

29 푞 1 − 푝 1 − 푞 1 − 푞 1 − 푝 1 − 푝 1 − 푝 0 1 2 3 ... 퐷−1 퐷 푞 푝 푝 푝 푝

Figure 2-4: Single-particle Markov chain from the 푍 chain (퐷 ≥ 2)

푞 1 − 푞 1 − 푞 0 1 푞

Figure 2-5: Single-particle Markov chain from the 푍 chain (퐷 = 1)

Let 휋푍 denote the stationary distribution of the 푍 chain, and let 휆(푤) be the

probability according to 휋푍 of a particular particle being located at vertex 푤 in the line graph. The following results about the 푍 chain are required to complete the proof.

˜ Lemma 3. For a configuration 푥 ∈ Ω, set 푍0 = 푋0 = 퐹 (푥). As long as 푡 ≤ 푇푥(훿), the chain 푍푡 satisfies 푑 st 푑 ∑︁ ∑︁ ˜ 푍푡(푟) ≤ 푋푡(푟) 푟=0 푟=0 st ˜ for all 푑 ∈ {0, 1, . . . , 퐷} and 푡 ∈ {0, 1, 2,... }. In particular, 푍푡(0) ≤ 푋푡(0).

1 훿 2 Lemma 4. Recall that 퐷 = 푑푖푎푚(풢). Let 훿 = 3퐷 and fix 0 < 휖 < / . For all 훽 large enough, E휋(푍) [푍(0)] ≥ (1 − 훿 + 휖)푛. Moreover,

(︀ )︀ (︀ 2 )︀ P휋(푍) 푍(0) ≤ E휋(푍) [푍(0)] − 휖푛 ≤ 2 exp −2휖 푛 , which implies (︀ 2 )︀ P휋(푍) (푍(0) ≤ (1 − 훿)푛) ≤ 2 exp −2휖 푛 .

Proof of Theorem 1. Recall the choices of 푢 and 푣 above. Lemma 4 tells us that the 푍 chain places exponentially little stationary mass on the set 푆푐. We now combine this fact with the comparison established in Lemma 3.

30 ˜ ˜ Recall 푇푥(훿) = inf{푡 : 푋푡(푢) ≤ (1 − 훿)푛, 푋0 = 푥} = inf{푡 : 푋푡(0) ≤ (1 − 훿)푛, 푋0 =

퐹 (푥)}. Applying Proposition 1 with 푝 = 1/2,

(︂ 1 )︂ {︂ (︂ (︂1)︂ )︂ 1}︂ 푡mix 푋, ≥ min 푡 : min P 푇푥 ≤ 푡 ≥ . 2푘 푥 2 2

Since 1/2 < 1 − 훿, it also holds that

(︂ 1 )︂ {︂ 1}︂ 푡mix 푋, ≥ min 푡 : min P (푇푥 (훿) ≤ 푡) ≥ 2푘 푥 2 {︂ 1 }︂ = min 푡 : (푇 (훿) ≤ 푡) ≥ , ∀푥 ∈ Ω P 푥 2 {︂ 1 }︂ = min 푡 : (푇 (훿) ≤ 푡) ≥ , ∀푥 ∈ 푆 . (2.1) P 푥 2

푐 The last equality is due to the fact that P(푇푥(훿) ≤ 푡) = 1 for all 푥 in 푆 . Additionally define

푍 푐 푇푥 , inf {푡 : 푍푡 ∈ 풮 , 푍0 = 퐹 (푥)} .

Now because 푍푡 is a lower-bounding chain, it holds that

(︀ 푍 )︀ P (푇푥(훿) ≤ 푡) ≤ P 푇푥 ≤ 푡 for all 푥 ∈ 푆 and 푡 ≥ 0. Therefore,

(︂ 1 )︂ {︂ 1 }︂ 푡 푋, ≥ min 푡 : (︀푇 푍 ≤ 푡)︀ ≥ , ∀푥 ∈ 푆 . mix 2푘 P 푥 2

푐 2 Finally, from Lemma 4 we know that 휋푍 (푆 ) ≤ 2 exp (−2휖 푛). Suppose that 푍0 is 푍 distributed according to 휋푍 and consider the hitting time 푇휋푧 . It holds that

st 푍 (︀ (︀ 2 )︀)︀ 푇휋푧 ≥ Geom 2 exp −2휖 푛 .

Therefore, 푡 = 푒Θ(푛) time is required for (︀푇 푍 ≤ 푡)︀ ≥ 1 . The same is true when P 휋푍 2 {︀ (︀ 푍 )︀ 1 }︀ Θ(푛) 푍0 = 푥, for some 푥 ∈ 푆. Therefore min 푡 : P 푇푥 ≤ 푡 ≥ 2 , ∀푥 ∈ 푆 = 푒 and

31 (︀ 1 )︀ Ω(푛) 푡mix 푋, 2푘 = 푒 , which proves Theorem 1.

We now provide the deferred proofs.

Proof of Lemma 2. By the Union Bound,

∑︁ (︁ {︁ }︁)︁ 휋(푆푣) ≥ 휋 ∪푣∈풱 푥(푣) = max 푥(푤) = 1. P 푤 푣∈풱

˜ Proof of Lemma 3. We show that there exists a coupling (푋푡, 푍푡) satisfying

푑 푑 ∑︁ ∑︁ ˜ 푍푡(푟) ≤ 푋푡(푟) (2.2) 푟=0 푟=0

˜ for all 푑 ∈ {0, 1, . . . , 퐷} and 푡 ≤ 푇푥(훿). Since 푍0 = 푋0, we can pair up the particles at time 푡 = 0 and design a synchronous coupling, i.e. when a certain particle is chosen in the 푋˜ process, its copy is chosen in the 푍 chain. We design the coupling so that for ˜ each particle, the 푋–copy is at least as close to 0 as the 푍–copy, for all 푡 ≤ 푇푥(훿). Note

that this implies (2.2) for all 푑 ∈ {0, 1, . . . , 퐷} and 푡 < 푇푥(훿). The uniformity of 푝 and 푞 over all configurations in 푆 ensures that the coupling will maintain the requirement (2.2), which is established by induction on 푡. The following analysis applies to both 퐷 ≥ 2 and 퐷 = 1 by considering the relevant cases. ˜ The base case (푡 = 0) holds since 푍0 = 푋0. Suppose that at time 푡 < 푇푥(훿), each particle in the 푋˜ chain is at least as close to 0 as its copy in the 푍 chain. We will show that the same property holds for time 푡 + 1. First consider a particle located at 0 in the 푍 chain. By the inductive hypothesis, its copy must be located at 0 in the 푋˜ process also, and the corresponding particle in the 푋 chain must be at 푢. The probability of the particle staying at 0 in the 푍 chain is smaller than the probability of the corresponding particle staying at 푢 in the 푋 chain, since 푞 is a uniform lower bound on the probability of staying at 푢. Therefore in this case, the property is maintained in the next time step. Next consider a particle located at vertex 푑 ̸= 0 in the 푍 chain and suppose its copy is located at vertex 푑′ in the 푋˜ process. By the inductive hypothesis, 푑′ ≤ 푑. If

32 푑′ < 푑 − 1, then clearly the property is maintained in the next step. It remains to consider the cases 푑′ = 푑 and 푑′ = 푑 − 1. Consider the case 푑 = 푑′. We couple the particles so that if the particle in the 푍 chain moves left to vertex 푑 − 1, then the particle in the 푋˜ process makes the same transition. This coupling is possible by the uniformity of 푝 and 푞. Otherwise, the particle in the 푍 chain moves right, and the property is maintained. Next consider the case 푑′ = 푑 − 1. It suffices to design a coupling such that if the particle in the 푋˜ process moves right, then so does the particle in the 푍 chain. If 푑 ≥ 3, this is possible due to the fact that 1 − 푝 is a uniform upper bound on the probability of moving right from these states. Next suppose that 푑 = 2 and 푑′ = 1. The particle in the 푋˜ process moves right with probability upper-bounded by 1 − 푞, which is smaller than 1 − 푝 for 훿 sufficiently small and 푛 sufficiently large. Therefore we can ensure the property in the next step. Finally, suppose 푑 = 1 and 푑′ = 0. Due to the fact that 1 − 푞 is a uniform upper bound on the probability of moving right from these states, we can again construct a coupling that maintains the property.

To prove Lemma 4, we need the stationary probability 휆(0).

Proposition 2. It holds

⎧ ⎪ ⎨⎪푞 if 퐷 = 1 휆(0) = 퐷−1 −1 . [︂ 2 2−퐷 (︂ 푝 )︂]︂ (1−푞) (︁ 푝 )︁ 1−( 1−푝 ) ⎪푞 1 + 푝 if 퐷 ≥ 2 ⎩ 푝 1−푝 1− 1−푝

The proof of Proposition 2 is deferred to the appendix.

Proof of Lemma 4. When 퐷 = 1, we have 휆(0) = 푞. Since 푞(훽) → 1 as 훽 → ∞, it holds that 휆(0) ≥ 1 − 훿 + 휖 for 훽 large enough. Next, for 퐷 ≥ 2 we have

푞 휆(0) = 퐷−1 2 2−퐷 (︂ 푝 )︂ (1−푞) (︁ 푝 )︁ 1−( 1−푝 ) 1 + 푝 푝 1−푝 1− 1−푝 푞 = 퐷 2 1−퐷 (︂ 푝 푝 )︂ (1−푞) (︁ 푝 )︁ 1−푝 −( 1−푝 ) 1 + 푝 푝 1−푝 1− 1−푝

33 푞 ≥ 2 1−퐷 푝 (1−푞) (︁ 푝 )︁ (︁ 1−푝 )︁ 1 + 푝 푝 1−푝 1− 1−푝 푞 = 1−퐷 (2.3) (1−푞)2 (︁ 푝 )︁ 1 + 1−2푝 1−푝

We show that 휆(0) ≥ 1 − 훿 + 휖 for 훽 and 푛 large enough. Again, for 훽 large enough, we have 푞 ≥ 1 − 훿 + 2휖. Next,

푒훽훿 + Δ − 1 1 − 푞 = (︀ 훽 )︀ 훽훿 exp 훽(1 − 훿) − 푛 + 푒 + Δ − 1 푒훽훿 + Δ − 1 ≤ (︀ 훽 )︀ exp 훽(1 − 훿) − 푛 (︂ (︂ 2 1 )︂)︂ (︂ (︂ 1 1 )︂)︂ = exp 훽 − 1 + + (Δ − 1) exp 훽 − 1 + (2.4) 3퐷 푛 3퐷 푛 (︂ (︂ 1 )︂)︂ ≤ exp 훽 − 1 , 퐷 where the last inequality holds for 훽 and 푛 large enough, since the first term of (2.4)

1 Δ−1 dominates. Since 푝 < Δ+1 , we have 1 − 2푝 > Δ+1 . Finally,

1−퐷 (︂ 푝 )︂ 퐷−1 퐷−1 퐷−1 = (︀푒훽훿 + Δ − 1)︀ = (︀푒훽/(3퐷) + Δ − 1)︀ ≤ (︀푒훽/(2퐷))︀ < 푒훽/2, 1 − 푝 where the first inequality holds for 훽 large enough. Substituting into (2.3), we obtain for 훽 and 푛 large enough

1 − 훿 + 2휖 1 − 훿 + 2휖 휆(0) ≥ = ≥ 1 − 훿 + 휖, Δ+1 (︀ (︀ 1 )︀ 훽 )︀ Δ+1 (︀ (︀ 2 3 )︀)︀ 1 + Δ−1 exp 2훽 퐷 − 1 + 2 1 + Δ−1 exp 훽 퐷 − 2 where the second inequality holds for 훽 large enough, due to 2/퐷 − 3/2 < 0 for 퐷 ≥ 2. We conclude that the expectation is linearly separated from the boundary:

E휋푍 [푍(0)] = 휆(0)푛 ≥ (1 − 훿 + 휖) 푛.

Next we show concentration. Label all the particles, and define 푈푖 = 1 if particle 푖 ∑︀ is at vertex 0 in the line graph, and 푈푖 = 0 otherwise. Then 푍(0) = 푖 푈푖, and 푈푖 is

34 independent of 푈푗 for all 푖 ̸= 푗. Applying Hoeffding’s inequality,

(︂ 2 )︂ (︀⃒ ⃒ )︀ 2(푐푛) (︀ 2 )︀ 휋 ⃒푍(0) − 휋(푍)[푍(0)]⃒ ≥ 푐푛 ≤ 2 exp − = 2 exp −2푐 푛 . P 푍 E 푛

for 푐 > 0. Let 푐 = 휖. Then the above implies

(︀ 2 )︀ P휋푍 (푍(0) ≤ E휋푍 [푍(0)] − 휖푛) ≤ 2 exp −2휖 푛 (︀ 2 )︀ =⇒ P휋푍 (푍(0) ≤ (1 − 훿) 푛) ≤ 2 exp −2휖 푛 .

2.3.2 Fast Mixing

The proof of fast mixing is due to Yury Polyanskiy. The proof is by a modification of path coupling, which is a method to find an upper bound on mixing time through contraction of the Wasserstein distance. The following definition can be found in [35], pp. 189.

Definition 1 (Transportation metric). Given a metric 휌 on a state space Ω, the associated transportation metric 휌푇 for two probability distributions 휇 and 휈 is defined as

휌푇 (휇, 휈) inf [휌(푋, 푌 )] , 푋∼휇,푌 ∼휈 E where the infimum is over all couplings of 휇 and 휈 on Ω × Ω.

Definition 2 (Wasserstein distance). Let 푃 be the transition probability matrix of a Markov chain on a state space Ω, and let 휌 be a metric on Ω. The Wasserstein

푃 distance 푊휌 (푥, 푦) of two states 푥, 푦 ∈ Ω with respect to 푃 and 휌 is defined as follows:

푃 푊휌 (푥, 푦) , 휌푇 (푃 (푥, ·), 푃 (푦, ·)) = inf E푋1,푌1 [휌(푋1, 푌1)] . 푋1∼푃 (푥,·),푌1∼푃 (푦,·)

In other words, the Wasserstein distance is the transportation metric distance between the next state distributions from initial states 푥 and 푦.

The following lemma is the path coupling result which can be found in [5] and

35 [35]. Given a Markov chain on state space Ω with transition probability matrix 푃 ,

consider a connected graph ℋ = (Ω, ℰℋ), i.e. the vertices of ℋ are the states in Ω

and the edges are ℰℋ. Let 푙 be a “length function” for the edges of ℋ, which is an

arbitrary function 푙 : ℰℋ → [1, ∞). For 푥, 푦 ∈ Ω, define 휌(푥, 푦) to be the path metric, i.e. 휌(푥, 푦) is the length of the shortest path from 푥 to 푦 in terms of 푙 and ℋ.

Lemma 5 (Path Coupling). Under the above construction, if there exists 훿 > 0 such that for all 푥, 푦 that are connected by an edge in ℋ it holds that

푃 푊휌 (푥, 푦) ≤ (1 − 훿)휌(푥, 푦), then 푑(푋, 푡) ≤ (1 − 훿)푡diam(Ω),

where diam(Ω) = max푥,푦∈Ω 휌(푥, 푦) is the diameter of the graph ℋ with respect to 휌.

Our proof of rapid mixing for small enough 훽 relies on rapid mixing of a single random walk. The following lemma demonstrates the existence of a contracting metric for a single random walk. It is possible that such a result appears elsewhere, but we are not aware of a published proof.

Lemma 6. Consider a random walk on 풢 which makes a uniform choice among staying or moving to any of the neighbors and denote by 푄 its transition matrix. Let 푑(푥, 푦) be the expected meeting time of two independent copies of a random walk on a graph started from states 푥 and 푦. Then 푑(푥, 푦) is a metric and 푄 contracts the respective Wasserstein distance. In particular,

(︂ )︂ 푄 1 푊푑 (푥, 푦) ≤ 1 − 푑(푥, 푦) 푑max where 푑max = max푥,푦 푑(푥, 푦). Furthermore, as noted by the reviewer of the journal submission, if 푥 ∼ 푦, then

(︂ )︂ 푄 1 푊푑 (푥, 푦) ≤ 1 − ′ 푑(푥, 푦) 푑max

36 ′ where 푑max = max푥,푦:푥∼푦 푑(푥, 푦).

Remark 1. In fact, we can show a stronger result (i.e. with a smaller value in the

place of 푑max): we can allow arbitrary Markovian coupling between two copies of the random walk and define 푑(푥, 푦) to be the meeting time under that coupling.

In order to apply path coupling, we let ℋ = (Ω, ℰℋ) be a graph on particle

configurations, where (푥, 푦) ∈ ℰℋ whenever 푦 = 푥 − 푒푖 + 푒푗 for some pair of distinct vertices 푖 and 푗 in 풢. In other words, 푥 and 푦 differ by the position of a single particle. Note that 푖 and 푗 need not be neighboring vertices in 풢. For such a pair of neighboring configurations (푥, 푦), let 푙(푥, 푦) = 푑(푖, 푗). Clearly, 푙(푥, 푦) ≥ 1{푥 ̸= 푦}. Now for any two configurations 푥, 푦 ∈ Ω, let 휌(푥, 푦) denote the path metric induced by ℋ and 푙(·, ·). We show that 휌(푥, 푦) = 푙(푥, 푦) for neighboring configurations.

Proposition 3. For any two configurations 푥, 푦 such that 푦 = 푥 − 푒푖 + 푒푗, it holds that 휌(푥, 푦) = 푙(푥, 푦).

Let 푃푥(푖, ·) be the probability distribution of the next location of the selected particle, when it is initially located at vertex 푖 ∈ 풱 in configuration 푥. Recall that 푄(푖, ·) is the probability distribution of the next location of a simple random walk on

풢, initially located at vertex 푖. Note that when 훽 = 0, it holds that 푃푥(푖, ·) = 푄(푖, ·).

When 훽 is small, 푃푥(푖, ·) ≈ 푄(푖, ·). Lemma 7 quantifies this statement.

Lemma 7. For all configurations 푥 and vertices 푖 ∈ 풢, it holds that

푒훽/2 − 1 ‖푃푥(푖, ·) − 푄(푖, ·)‖ ≤ . TV 푒훽/2 + 1

Next, consider two neighbouring configurations 푥 and 푦. Because only the position

of one particle is different between the two configurations, 푃푥(푣, ·) ≈ 푃푦(푣, ·). The following lemma makes this precise.

Lemma 8. Let 푥 and 푦 be neighbouring configurations. Recall that Δ is the maximum degree of the vertices in 풱. The following holds:

Δ + 1 ‖푃 (푣, ·) − 푃 (푣, ·)‖ ≤ 훽. 푥 푦 TV 푛 37 With these results stated, we prove Theorem 2.

Proof of Theorem 2. Suppose 푑(푖, 푗) ≥ 1{푖 ̸= 푗} is a metric on 풢 such that a single- particle random walk’s kernel 푄 satisfies

푄 푊푑 (푖, 푗) ≤ (1 − 훿)푑(푖, 푗) (2.5)

for all 푖 ̸= 푗 and 푑(푖, 푗) ≤ 푑푚푎푥. Note that the existence of such a metric 푑(·, ·) was

1 established in Lemma 6 with an estimate of 훿 = /푑푚푎푥.

푃 Now we wish to bound 푊휌 (푥, 푦) for all neighboring particle configurations 푥 and

푦 related by 푦 = 푥 − 푒푖 + 푒푗. We may choose any coupling in order to obtain an upper bound. The coupling will be synchronous: the choice of particle to be moved will be coordinated between the chains. Namely, if the “extra” particle is chosen in configuration 푥, then so too will the “extra” particle be chosen in configuration 푦. Similarly, if some other particle is chosen in 푥, than a particle at the same vertex will be chosen in 푦. For an illustration, see Figure 2-6.

푥 푦

Figure 2-6: Pairing of particles in the coupling. The edges between vertices are omitted.

Let 푋1 ∼ 푃 (푥, ·) and 푌1 ∼ 푃 (푦, ·) denote the coupled random variables corre- sponding to the next configurations. Let 푝⋆ be the “extra” particle. Let 푝˜ be a random variable that denotes the uniformly selected particle. Since our coupling gives an upper bound, we can write

1 푛 − 1 푊 푃 (푥, 푦) ≤ [휌(푋 , 푌 )|푝˜ = 푝⋆] + [휌(푋 , 푌 )|푝˜ ̸= 푝⋆] . (2.6) 휌 푛E 1 1 푛 E 1 1

38 First, suppose the “extra” particle, 푝⋆, is chosen in both chains. This happens with

1 probability 푛 . By Lemma 7, we can couple the distributions 푃푥(푖, ·) and 푃푦(푗, ·) to 훽/2 푄(푖, ·) and 푄(푗, ·) respectively with probability at least 1− (푒 −1)/(푒훽/2+1). In that case, we get contraction by a factor of (1 − 훿). With the remaining probability, we assume

the worst-case distance of 푑max. Therefore, the conditional Wasserstein distance is upper bounded as follows:

(︂ 훽/2 )︂ (︂ 훽/2 )︂ ⋆ 푒 − 1 푒 − 1 E [휌(푋1, 푌1)|푝˜ = 푝 ] ≤ 1 − 훽 (1 − 훿)푑(푖, 푗) + 훽 푑max. (2.7) 푒 /2 + 1 푒 /2 + 1

Next, suppose some other particle (located at 푣) is chosen in both chains. This

푛−1 happens with probability 푛 . We claim

Δ + 1 [휌(푋 , 푌 )|푝˜ ̸= 푝⋆] ≤ 휌(푥, 푦) + 2푑 훽. (2.8) E 1 1 max 푛

Indeed, by Lemma 8, we can couple particle 푝˜ so that it moves to the same vertex in both chains with probability at least

Δ + 1 1 − 훽. 푛

By Proposition 3, it holds that 휌(푋1, 푌1) = 푑(푖, 푗) = 휌(푥, 푦) in the case that the particle 푝˜ moves to the same vertex in both chains. Otherwise, an additional distance

of at most 2푑max is incurred. Finally, we substitute the bounds (2.7) and (2.8) into (2.6).

푃 푊휌 (푥, 푦) (︃ (︃ )︃ (︃ )︃ )︃ 1 푒훽/2 − 1 푒훽/2 − 1 푛 − 1 (︂Δ + 1 )︂ ≤ 휌(푥, 푦) + −휌(푥, 푦) + 1 − (1 − 훿)휌(푥, 푦) + 푑max + 2훽푑max 푛 푒훽/2 + 1 푒훽/2 + 1 푛 푛

[︃ (︃ (︃ 훽/2 )︃ (︃ 훽/2 )︃ (︂ )︂)︃]︃ 1 푒 − 1 푒 − 1 푑max 푛 − 1 Δ + 1 = 휌(푥, 푦) 1 − 1 − 1 − (1 − 훿) − − 2훽푑max 푛 푒훽/2 + 1 푒훽/2 + 1 휌(푥, 푦) 푛 휌(푥, 푦) [︃ (︃ (︃ )︃ (︃ )︃ )︃]︃ 1 푒훽/2 − 1 푒훽/2 − 1 ≤ 휌(푥, 푦) 1 − 1 − 1 − (1 − 훿) − 푑max − (Δ + 1)2훽푑max 푛 푒훽/2 + 1 푒훽/2 + 1 (2.9)

39 푛−1 where the last inequality is due to 휌(푥, 푦) ≥ 1 and 푛 < 1. In order to show 1 contraction, it is sufficient that the expression multiplying 푛 be positive:

(︂ 푒훽/2 − 1)︂ (︂푒훽/2 − 1)︂ 1 − 1 − (1 − 훿) − 푑max − (Δ + 1)2훽푑max > 0 푒훽/2 + 1 푒훽/2 + 1 푒훽/2 − 1 훿 − (Δ + 1)2훽푑 ⇐⇒ < max 훽/2 푒 + 1 푑max + 훿 − 1

For an example of a satisfying 훽, choose 훽 so that

훿 푒훽/2 − 1 (︂훽 )︂ 훿 (Δ + 1)2훽푑 < and = tanh < . max 훽/2 2 푒 + 1 4 2 (푑max + 훿 − 1)

Therefore, we can choose

{︂ (︂ )︂}︂ 훿 −1 훿 0 < 훽− < min , 4 tanh . 4푑max(Δ + 1) 2 (푑max + 훿 − 1)

′ Substituting 훽 = 훽− into (2.9), we obtain for some 훿 > 0

(︂ 1 )︂ 푊 푃 (푥, 푦) ≤ 휌(푥, 푦) 1 − 훿′ . 휌 푛

Applying the path coupling lemma (Lemma 5), we obtain

(︂ 1 )︂푡 (︂ 1 )︂푡 푑(푋, 푡) ≤ 1 − 훿′ 푑푖푎푚(Ω) ≤ 1 − 훿′ 푛푑 . 푛 푛 max

Setting the right hand side to be less than 휖 > 0 in order to bound 푡mix(푋, 휖),

(︁ 휖 )︁ (︂ )︂푡 log (︀ 푛푑max )︀ 1 푛푑max log 1 − 훿′ 푛푑 ≤ 휖 ⇐⇒ 푡 ≥ ⇐⇒ 푡 ≥ 휖 . 푛 max (︀ 1 ′)︀ (︀ 푛 )︀ log 1 − 푛 훿 log 푛−훿′

Since ′ (︂ )︂ (︂ ′ )︂ 훿 푛 훿 ′ log = log 1 + ≥ 푛−훿 , 푛 − 훿′ 푛 − 훿′ 훿′ 1 + 푛−훿′

40 we have

log (︀ 푛푑max )︀ (︂푛푑 )︂ (︂ 훿′ )︂ 푛 − 훿′ 휖 ≤ log max 1 + = 푂(푛 log 푛). (︀ 푛 )︀ 휖 푛 − 훿′ 훿′ log 푛−훿′

Therefore, 푡mix(푋, 휖) = 푂(푛 log 푛), which completes the proof of Theorem 2.

Remark 2. Arguably, a more natural approach to show fast mixing would be through a more traditional path coupling approach: Let ℋ have an edge between configurations

푥 and 푦 = 푥 − 푒푖 + 푒푗 if 푖 and 푗 are adjacent vertices in 풢. Set 푙(푥, 푦) = 1 for adjacent configurations. However, this approach does not yield contraction in the Wasserstein distance, which we show at the end of this section.

We now provide the deferred proofs.

Proof of Lemma 6. First we verify that 푑(푥, 푦) is a metric. It holds that 푑(푥, 푦) = 푑(푦, 푥), and 푑(푥, 푦) ≥ 0 with equality if and only if 푥 = 푦. To show the triangle inequality, start three random walks from vertices 푥, 푦, 푧 and let 휏(푥, 푦) be the meeting time of the walks started from 푥 and 푦. The three random walks are advanced according to the independent coupling, and if a pair of walks collides, they are advanced identically starting from that time. Under this coupling, observe that

휏(푥, 푧) ≤ max{휏(푥, 푦), 휏(푦, 푧)} ≤ 휏(푥, 푦) + 휏(푦, 푧)

푄 and take expectations. Next we show that 푊휌 (푥, 푦) ≤ 푑(푥, 푦) − 1 for 푥 ̸= 푦. We

can choose any coupling of 푋1 ∼ 푃 (푥, ·) and 푌1 ∼ 푃 (푦, ·) to show an upper bound.

Letting 푋1 ∼ 푃 (푥, ·) and 푌1 ∼ 푃 (푦, ·) be independent, we have

푄 ∑︁ 푊휌 (푥, 푦) ≤ E [휏(푋1, 푌1)] = 푄(푥, 푎)푄(푦, 푏)E[휏(푎, 푏)] 푎,푏

and ∑︁ 푑(푥, 푦) = E[휏(푥, 푦)] = 1 + 푄(푥, 푎)푄(푦, 푏)E[휏(푎, 푏)]. 푎,푏

41 푄 1 These two equations imply 푊휌 (푥, 푦) ≤ 푑(푥, 푦)−1. Finally, 푑(푥, 푦)−1 ≤ 푑(푥, 푦) (1 − /푑max).

1 ′ If 푥 ∼ 푦, then we conclude 푑(푥, 푦) − 1 ≤ 푑(푥, 푦) (1 − /푑max).

Proof of Proposition 3. Consider any path from 푥 to 푦: (푥 = 푥0, 푥1, . . . , 푥푚−1, 푥푚 = 푦), where 푥푟+1 = 푥푟 − 푒푖푟 + 푒푗푟 for 푟 ∈ {0, 1, . . . , 푚 − 1}. Then we have

푚−1 푚−1 ∑︁ ∑︁ 푙(푥푟, 푥푟+1) = 푑(푖푟, 푗푟). 푟=0 푟=0

We claim that we can rearrange this summation to be of the form

푚−2 ∑︁ 푑(푖, 푙1) + 푑(푙푟, 푙푟+1) + 푑(푙푚−1, 푗) 푟=1

for some sequence 푙1, . . . , 푙푚−1. Indeed, let ℐ = {푖푟 : 0 ≤ 푟 ≤ 푚 − 1} and 풥 =

{푗푟 : 0 ≤ 푟 ≤ 푚 − 1} be the multisets that collect the “outbound” and “inbound” particle transfers, respectively. The value 푖 must appear one more time in ℐ than in 풥 . Similarly, the value 푗 must appear one more time in 풥 than in ℐ. All other values

appear an equal number of times in ℐ and 풥 . By choosing terms 푑(푖푟, 푗푟) in order,

beginning with 푑(푖, 푙1), it is possible to rearrange the sum into the given form. By the triangle inequality for 푑(·, ·),

푚−2 ∑︁ 푑(푖, 푙1) + 푑(푙푟, 푙푟+1) + 푑(푙푚−1, 푗) ≥ 푑(푖, 푗) = 푙(푥, 푦). 푟=1

Therefore, the shortest distance between 푥 and 푦 is along the edge connecting them, and we conclude that 휌(푥, 푦) = 푙(푥, 푦) for neighboring configurations.

To prove Lemma 7, we state the following proposition.

Proposition 4. The set of distributions {푃푥(푖, ·): 푥 ∈ Ω} parametrized by the configuration 푥 is contained within the convex set

{︃ 푑 }︃ 푝푖 ∑︁ 푃 (푝 , . . . , 푝 ): ≤ 푒훽 ∀푖, 푗; 푝 = 1; 푝 ≥ 0 ∀푖 . 훽 , 0 푑 푝 푖 푖 푗 푖=0

42 푃푥(푖,푗1) Proof. To show this claim, we compute the ratio when 푗1, 푗2 ∈ 풩 (푖) ∪ {푖} and 푃푥(푖,푗2) 훽 푗1 ̸= 푗2, and show that it is upper bounded by 푒 . There are three cases to consider.

1. The case 푗1 = 푖.

푃 (푖, 푗 ) exp (︀ 훽 (푥(푗 ) − 1))︀ (︂훽 )︂ 푥 1 = 푛 1 = exp (푥(푗 ) − 푥(푗 ) − 1) . 푃 (푖, 푗 ) (︀ 훽 )︀ 푛 1 2 푥 2 exp 푛 푥(푗2)

푃푥(푖,푗1) 훽 Since 푥(푗1) − 푥(푗2) − 1 ≤ 푛 − 1 < 푛, it holds that < 푒 . 푃푥(푖,푗2)

2. The case 푗2 = 푖.

(︂ )︂ 푃푥(푖, 푗1) 훽 = exp (푥(푗1) − 푥(푗2) + 1) . 푃푥(푖, 푗2) 푛

푃푥(푖,푗1) 훽 Since 푗2 = 푖, we have 푗2 ≥ 1. Therefore, again < 푒 . 푃푥(푖,푗2)

3. The case 푗1, 푗2 ̸= 푖.

(︂ )︂ 푃푥(푖, 푗1) 훽 훽 = exp (푥(푗1) − 푥(푗2)) ≤ 푒 . 푃푥(푖, 푗2) 푛

Proof of Lemma 7. Recall that 풩 (푖) is the neighbor set of vertex 푖 in graph 풢. Let 푑 = |풩 (푖)|. We have

⎧ 훽 exp( 푥(푗)) ⎪ 푛 if 푖 ∼ 푗 ⎪ ∑︀ exp( 훽 푥(푙))+exp( 훽 (푥(푖)−1)) ⎪ 푙∼푖 푛 푛 ⎨ exp( 훽 (푥(푖)−1)) 푃푥(푖, 푗) = 푛 if 푖 = 푗 ∑︀ exp( 훽 푥(푙))+exp( 훽 (푥(푖)−1)) ⎪ 푙∼푖 푛 푛 ⎪ ⎩⎪0 otherwise and ⎧ ⎪ 1 if 푖 ∼ 푗 ⎪ 푑+1 ⎨⎪ 푄(푖, 푗) = 1 푑+1 if 푖 = 푗 ⎪ ⎪ ⎩⎪0 otherwise.

43 Using Proposition 4,

푒훽 1 max ‖푃푥(푖, ·) − 푄(푖, ·)‖ ≤ sup ‖푝 − 푄(푖, ·)‖ = − . (2.10) 푥 TV TV 훽 푝∈푃훽 푑 + 푒 푑 + 1

The inequality is due to the fact that {푃푥(푖, ·): 푥 ∈ Ω} ⊂ 푃훽 and the equality is due to the fact that the maximum of a convex function over a closed and bounded convex

(︁ 푒훽 1 1 )︁ set is achieved at an extreme point, namely 푑+푒훽 , 푑+푒훽 ,..., 푑+푒훽 . To maximize the 푒훽 1 right hand side of (2.10), let 푓(푑) = 푑+푒훽 − 푑+1 . Then

푒훽 1 푓 ′(푑) = − + (푑 + 푒훽)2 (푑 + 1)2 (︀푑 + 푒훽)︀2 − 푒훽 (푑 + 1)2 = (푑 + 푒훽)2 (푑 + 1)2 (︀푑 + 푒훽 − 푒훽/2(푑 + 1))︀푑 (︀ + 푒훽 + 푒훽/2(푑 + 1))︀ = . (푑 + 푒훽)2 (푑 + 1)2

Setting 푓 ′(푑) = 0 we obtain the solutions 푑 = ±푒훽/2. The solution 푑 = 푒훽/2 is the maximizer. Substituting 푑 = 푒훽/2 into (2.10),

푒훽 1 푒훽/2 1 푒훽/2 − 1 max ‖푃푥(푖, ·) − 푄(푖, ·)‖푇 푉 ≤ − = − = , 푥,푖 푒훽/2 + 푒훽 푒훽/2 + 1 푒훽/2 + 1 푒훽/2 + 1 푒훽/2 + 1 which completes the proof.

Proof of Lemma 8. First,

1 ∑︁ ‖푃 (푣, ·), 푃 (푣, ·)‖ = |푃 (푣, 푤) − 푃 (푣, 푤)| . 푥 푦 TV 2 푥 푦 푤∈풩 (푣)∪{푣}

2훽 We will show that each term is upper bounded by 푛 . Since there are at most Δ + 1 terms, the bound follows.

We compute max푥,푦:푥∼푦 |푃푥(푣, 푤) − 푃푦(푣, 푤)| for 푤 ∈ 풩 (푣) ∪ {푣}. Since 푥 and 푦 are interchangeable, we can drop the absolute value.

max |푃푥(푣, 푤) − 푃푦(푣, 푤)| = max 푃푥(푣, 푤) − 푃푦(푣, 푤). 푥,푦:푥∼푦 푥,푦:푥∼푦

44 First consider the case that 푣 ̸= 푤. Then

max 푃푥(푣, 푤) − 푃푦(푣, 푤) 푥,푦:푥∼푦 exp (︀ 훽 푥(푤))︀ exp (︀ 훽 푦(푤))︀ = max 푛 − 푛 . 푥,푦:푥∼푦 (︀ 훽 )︀ ∑︀ (︀ 훽 )︀ (︀ 훽 )︀ ∑︀ (︀ 훽 )︀ exp 푛 (푥(푣) − 1) + 푢∼푣 exp 푛 푥(푢) exp 푛 (푦(푣) − 1) + 푢∼푣 exp 푛 푦(푢)

Let (︂훽 )︂ ∑︁ (︂훽 )︂ 퐴(푧) = exp (푧(푣) − 1) + exp 푧(푢) . 푛 푛 푢∼푣

Note that 퐴(푦) ≤ 푒훽/푛퐴(푥) for 푥 ∼ 푦. We have

(︀ 훽 )︀ (︀ 훽 )︀ exp 푛 푥(푤) exp 푛 푦(푤) max 푃푥(푣, 푤) − 푃푦(푣, 푤) = max − 푥,푦:푥∼푦 푥,푦:푥∼푦 퐴(푥) 퐴(푦) exp (︀ 훽 푥(푤))︀ exp (︀ 훽 푦(푤))︀ ≤ max 푛 − 푛 푥,푦:푥∼푦 퐴(푥) 푒훽/푛퐴(푥) exp (︀ 훽 푥(푤))︀ − exp (︀ 훽 (푦(푤) − 1))︀ = max 푛 푛 푥,푦:푥∼푦 퐴(푥) exp (︀ 훽 푥(푤))︀ − exp (︀ 훽 (푥(푤) − 2))︀ ≤ max 푛 푛 푥 퐴(푥) exp (︀ 훽 (푥(푤))︀1 (︀ − 푒−2훽/푛)︀ = max 푛 푥 퐴(푥) ≤ 1 − 푒−2훽/푛 2훽 ≤ . 푛

Next, we consider the case 푣 = 푤. We have

(︀ 훽 )︀ (︀ 훽 )︀ exp 푛 (푥(푣) − 1) exp 푛 (푦(푣) − 1) max 푃푥(푣, 푣) − 푃푦(푣, 푣) = max − 푥,푦:푥∼푦 푥,푦:푥∼푦 퐴(푥) 퐴(푦) exp (︀ 훽 (푥(푣) − 1))︀ exp (︀ 훽 (푦(푣) − 1))︀ ≤ max 푛 − 푛 푥,푦:푥∼푦 퐴(푥) 푒훽/푛퐴(푥) exp (︀ 훽 (푥(푣) − 1))︀ − exp (︀ 훽 (푦(푣) − 2))︀ = max 푛 푛 푥,푦:푥∼푦 퐴(푥) exp (︀ 훽 (푥(푣) − 1))︀ − exp (︀ 훽 (푥(푣) − 3))︀ ≤ max 푛 푛 푥 퐴(푥) exp (︀ 훽 ((푥(푣) − 1))︀1 (︀ − 푒−2훽/푛)︀ = max 푛 푥 퐴(푥)

45 ≤ 1 − 푒−2훽/푛 2훽 ≤ . 푛

We now show that the approach for proving Theorem 2 based on the natural one-step path coupling does not yield the required contraction.

Theorem 4. Let ℋ have an edge between configurations 푥 and 푦 = 푥−푒푖 +푒푗 whenever 푖 and 푗 are adjacent vertices in 풢. Let 푙(푥, 푦) = 1 for adjacent configurations. There exists a graph 풢 such that for 훽 = 0,

푃 푊휌 (푥, 푦) ≥ 1

for some adjacent configurations 푥, 푦.

Proof. Let 풢 be the 4-vertex path graph. Label the vertices 1, 2, 3, 4 in order along the

path, and consider 푥 and 푦 related by 푦 = 푥 − 푒2 + 푒3 so that the two configurations differ by a transfer from one middle vertex to the other. When 훽 = 0, the transition probabilities are simple: given that a particle is chosen at vertex 푣, it moves to vertex

1 푤 ∈ 풩 (푣) ∪ {푣} with probability 푑푒푔(푣)+1 . The optimal coupling of 푃 (푥, ·) and 푃 (푦, ·) may be expressed as an optimal solution of a linear program, as follows. Write 푥′ ∼ 푥 if 푥′ is adjacent to 푥 in ℋ or 푥′ = 푥. For each 푥′ ∼ 푥 and 푦′ ∼ 푦, let 푧(푥′, 푦′) be the probability of the next states being 푥′ and 푦′ in a coupling. The constraints require the collection of 푧 variables to be a valid coupling, and the objective function calculates the expected distance under the coupling.

∑︁ min 푧(푥′, 푦′)휌(푥′, 푦′) 푥′∼푥,푦′∼푦 ∑︁ s.t. 푧(푥′, 푦′) = 푃 (푥, 푥′) ∀푥′ ∼ 푥 푦′∼푦 ∑︁ 푧(푥′, 푦′) = 푃 (푦, 푦′) ∀푦′ ∼ 푦 푥′∼푥 푧 ≥ 0

46 This linear program is known as a Kantorovich problem. Our goal is to show that the optimal objective value is at least 1. We will first write down the dual problem. By weak duality, any feasible solution to the dual problem gives a lower bound to the optimal value of the primal problem. Next we will construct a primal solution with objective value equal to 1, and apply the complimentary slackness condition to help us construct a dual solution whose objective value is also equal to 1. Finally we will conclude that the optimal solution to the primal problem is equal to 1, by strong duality. For a reference to linear programming duality, see e.g. Chapter 4 of [3]. First we take the dual of the linear program, introducing dual variables 푢(푥′) for 푥′ ∼ 푥 and 푣(푦′) for 푦′ ∼ 푦:

∑︁ ∑︁ max 푢(푥′)푃 (푥, 푥′) + 푣(푦′)푃 (푦, 푦′) 푥′∼푥 푦′∼푦 s.t. 푢(푥′) + 푣(푦′) ≤ 휌(푥′, 푦′) ∀푥′ ∼ 푥, 푦′ ∼ 푦

This linear program is a Kantorovich dual problem. By weak duality, if there exists a dual solution with objective value 푍, then the optimal solution of the primal is at least 푍. Therefore our goal is to find a dual solution with objective value at least 1.

′ ′ 푥(푎) For 푥 = 푥 − 푒푎 + 푒푏 with 푎, 푏 ∈ {1, 2, 3, 4}, 푃 (푥, 푥 ) = 푛(deg(푎)+1) . Similarly, for ′ ′ 푦(푎) 푦 = 푦 − 푒푎 + 푒푏, 푃 (푦, 푦 ) = 푛(deg(푎)+1) . The value of 휌 is given by

⎧ ⎪ ′ ′ ′ ′ ⎪0 if [푥 = 푦 = 푥] or [푥 = 푦 = 푦] ⎪ ⎪ ⎪1 if [푥′ = 푦, 푦′ ̸= 푦] or [푦′ = 푥, 푥′ ̸= 푥] ⎪ ⎨⎪ ′ ′ ′ ′ 휌(푥 , 푦 ) = 1 if [푥 = 푥 − 푒푎 + 푒푏, 푦 = 푦 − 푒푎 + 푒푏, 푎 ̸= 푏] . ⎪ ⎪ ⎪ ′ ′ ′ ′ ⎪2 if [푥 = 푥, 푦 ∈/ {푥, 푦}] or [푦 = 푦, 푥 ∈/ {푥, 푦}] ⎪ ⎪ ⎩⎪3 otherwise

There exists a primal solution with objective value 1: Set

푧(푥 − 푒푎 + 푒푏, 푦 − 푒푎 + 푒푏) = min {푃 (푥, 푥 − 푒푎 + 푒푏), 푃 (푦, 푦 − 푒푎 + 푒푏)} ,

47 1 푧(푥 − 푒 + 푒 , 푦 − 푒 + 푒 ) = , 2 1 3 2 3푛 and 1 푧(푥 − 푒 + 푒 , 푦 − 푒 + 푒 ) = . 2 3 3 4 3푛 Other values of 푧(푥′, 푦′) are set to zero. In other words, 푧 describes a synchronous coupling according to the pairing in Figure 2-6, with particles moving in the same direction always. Now supposing this is an optimal solution, we apply complementary slackness to identify candidate dual optimal solutions. The complementary slackness condition states that if 푧 and (푢, 푣) are optimal primal and dual solutions, then it holds that for all 푥′ ∼ 푥, 푦′ ∼ 푦,

푧(푥′, 푦′)[휌(푥′, 푦′) − 푢(푥′) − 푣(푦′)] = 0.

If our primal solution 푧 is optimal, then whenever 푧(푥′, 푦′) ̸= 0, we need 푢(푥′)+푣(푦′) = 휌(푥′, 푦′). These additional constraints help us construct the following dual feasible solution:

푢(푥) = 1, 푢(푥 − 푒1 + 푒2) = 0, 푢(푥 − 푒2 + 푒1) = 2, 푢(푥 − 푒2 + 푒3) = 0,

푢(푥 − 푒3 + 푒2) = 2, 푢(푥 − 푒3 + 푒4) = 0, 푢(푥 − 푒4 + 푒3) = 0

푣(푦) = 0, 푣(푦 − 푒1 + 푒2) = 1, 푣(푦 − 푒2 + 푒1) = −1, 푣(푦 − 푒2 + 푒3) = 1,

푣(푦 − 푒3 + 푒2) = −1, 푣(푦 − 푒3 + 푒4) = 1, 푣(푦 − 푒4 + 푒3) = 1.

We find that the objective value of this solution is equal to 1. By strong duality, we conclude that the optimal value of the primal problem is equal to 1, and therefore there does not exist a contractive coupling.

Remark 3. The argument in the proof of Theorem 4 should apply to all graphs 풢 that contain the a four-vertex path graph as a subgraph, and possibly to other graphs as well.

48 2.4 Repelling Random Walks

Throughout our analysis, we have only considered 훽 ≥ 0. However, the case 훽 < 0 (“Repelling Random Walks”) is theoretically and practically interesting to study also. Simulations confirm the intuition that the particles behave like independent random walks when 훽 is close to zero, and spread evenly when 훽 is very negative (see Figure 2-7). We conjecture that there are not any hard-to-escape subsets of the state space for all 훽 < 0.

Figure 2-7: Simulation of the Attracting Random Walks model on an 8 × 8 grid graph after 106 steps for 푛 = 320, 훽 = −500.

Conjecture 1. For all 훽 < 0 and any graph, the mixing time of the ARW model is polynomial in 푛.

We consider two cases: the extreme case of 훽 = −∞, and the case where 풢 is the complete graph, for certain values of 훽.

2.4.1 The Case 훽 = −∞

Theorem 5. When 훽 = −∞, the mixing time of the Attracting Random Walks model is 푂(푛2).

Proof. When 훽 = −∞, the dynamics are simplified. Suppose a particle is chosen at vertex 푖. Let 퐴 be the set of vertices corresponding to the minimal value(s) of {푥(푖) − 1} ∪ {푥(푗): 푗 ∼ 푖}. The chosen particle moves to a vertex among those in 퐴, uniformly at random.

49 Our goal is to show that the set

{︁ {︁⌊︁푛⌋︁ ⌊︁푛⌋︁ }︁ }︁ 퐶 푥 : 푥(푣) ∈ , + 1 ∀푣 ∈ 푉 ∩ Ω , 푘 푘

satisfies the following three properties: (1) It is absorbing, meaning that once the chain enters 퐶, it cannot escape 퐶; (2) The chain enters 퐶 in polynomial time; (3) Within 퐶, the chain mixes in constant time with respect to 푛. We claim that the maximum particle occupancy cannot increase, and the mini- mum particle occupancy cannot decrease. We now show that the maximum particle

occupancy, 푀푡 , max푣 푋푡(푣), is monotonically non-increasing over time. Suppose that at time 푡, a particle at vertex 푖 is selected and moves to vertex 푗. There are five cases:

1. 푖 = 푗. The maximum does not change.

2. 푖 ̸= 푗, and both are maximizers. This case is not possible, since 푥(푗) > 푥(푖) − 1.

3. 푖 ≠ 푗, 푖 is a maximizer, and 푗 is not. The new maximum value is at most 푀푡, in

the case that 푋푡(푗) = 푋푡(푖) − 1.

4. 푖 ≠ 푗, 푖 is not a maximizer, and 푗 is. This case is not possible, since 푥(푗) > 푥(푖)−1.

5. 푖 ≠ 푗, 푖 and 푗 are not maximizers. The new maximum value is at most 푀푡, in

the case that 푋푡(푗) = 푋푡(푖) − 1.

Therefore 푀푡+1 ≤ 푀푡. A similar argument shows that the minimum particle occupancy is monotonically non-decreasing over time. Together, they imply Property (1).

Next, we show Property (2). Assume 푋푡 ∈/ 퐶. Let ℳ푡 be the set of maximizing vertices at time 푡. We claim there exists at least one vertex 푢 ∈ ℳ푡 such that there

exists a path of distinct vertices 푢 = 푖1 ∼ 푖2 ∼ · · · ∼ 푖푝 satisfying 푥푖2 = 푥푖3 = ··· =

푥푖푝−1 = 푀푡 − 1 and 푥푖푝 ≤ 푀푡 − 2 (allowing 푝 = 2). In other words, there is a walkable

path from 푢 = 푖1 to 푖푝. The maximum length of the path is 푘 −1. The probability that a particle is transferred along this path before any other events happen is therefore

50 lower bounded by (︂푀 1 )︂푘−1 (︂1 1 )︂푘−1 푡 · ≥ · . 푛 Δ + 1 푘 Δ + 1

Therefore the probability that such a transfer happens within 푇1 trials is at least

(︃ )︃푇1 (︂1 1 )︂푘−1 푝 1 − 1 − · . , 푘 Δ + 1

If there had been at least two maximizing vertices to start, the number of maximizing vertices would have fallen by 1. If there had been only one maximizing vertex to start, the maximum value itself would have fallen by 1. We see that there are two types of “good” events: reducing the number of maximiz- ing vertices while the maximum value stays the same, or reducing the maximum value. We claim that the number of “good” events that happen before the chain enters the set 퐶 is upper bounded by 푛2. Indeed, imagine that the particles at each vertex are stacked vertically. A particle movement from vertex 푖 to vertex 푗 is interpreted as a particle moving from the top of the stack at vertex 푖 to the top of the stack at vertex 푗. Observe that the height of a particle cannot increase. Further, each particle’s height can fall by at most 푛 − 1 units over time, and can therefore drop at most 푛 − 1 times. Since all good events require a particle’s height to drop, the number of good events is

2 2 1 at most 푛(푛 − 1) < 푛 . Let 푇2 = ⌈2푛 푝 ⌉ be the number of trials of length 푇1 each.

Let 푁 be the number of successes during the 푇2 trials. By the Hoeffding inequality,

(︃ )︃ (︂ 2푛4 )︂ 2푛4 (︀|푁 − [푁]| ≥ 푛2)︀ ≤ 2 exp − ≤ 2 exp − . P E 2 1 푇2 2푛 푝 + 1

2 1 2 Since E[푁] = 푝⌈2푛 푝 ⌉ ≥ 2푛 ,

(︃ )︃ 2푛4 (︂ 1 )︂ (︀푁 ≤ 푛2)︀ ≤ 2 exp − ≤ 2 exp − 푝푛2 . P 2 1 2푛 푝 + 1 2

Therefore the probability that the chain is in 퐶 after 푇1 × (푘 − 1) × 푇2 steps is at

51 1 2 least 1 − 2 exp(− 2 푝푛 ). For an example, we can even set 푇1 = 1. Then

(︂1 1 )︂푘−1 (︂1 1 )︂1−푘 푝 = · and 푇 ≤ 1 + 2푛2 · , 푘 Δ + 1 2 푘 Δ + 1

Therefore, within 푂(푛2) steps, the chain is in 퐶 with high probability. Finally, we show Property (3). Once the chain is in 퐶, there are two types of ⌊︀ 푛 ⌋︀ ⌊︀ 푛 ⌋︀ vertices: those that have 푘 particles, and those that have 푘 +1 particles. Note that ˜ 푛 there are always 푘 , 푛 − 푘⌊ 푘 ⌋ vertices with the higher number of particles. Therefore it is equivalent to study an exclusion process with just 푘˜ particles on the graph 풢. With ⌊︀ 푛 ⌋︀ 푘−푘˜ probability 푘 · 푛 , an unoccupied vertex is selected, and the chain stays in place. With the remaining probability, an occupied vertex is chosen uniformly at random. Its particle then moves to a neigboring empty vertex or stays where it is, uniformly ⌊︀ 푛 ⌋︀ 푘−푘˜ at random. Equivalently, the chain is lazy with probability 푘 · 푛 , and otherwise one of the 푘˜ particles is chosen, and either stays or moves to a neighbor. Since the number of particles 푘˜ can be upper and lower bounded by constants (0 ≤ 푘˜ ≤ 푘), the mixing time within 퐶 is independent of 푛. Therefore, we conclude that the overall mixing time is 푂(푛2).

2.4.2 The Complete Graph Case

Note that the complete graph case for 훽 < 0 is equivalent to the vector of proportions chain in the antiferromagnetic Curie–Weiss Potts model.

Theorem 6. On the complete graph with 푘 vertices, the mixing time is 푂 (푛 log 푛)

푘 for all 훽 satisfying − 10 < 훽 ≤ 0.

The proof relies on the following two lemmas.

Lemma 9. Let (푋푡, 푡 ≥ 0) be the ARW chain for any 훽 < 0 and let (푌푡, 푡 ≥ 0) be a chain of independent particles (훽 = 0). Set 푋0 = 푌0. For every vertex 푣 and time 푡,

⃒ 푛⃒ st ⃒ 푛⃒ ⃒푋 (푣) − ⃒ ≤ ⃒푌 (푣) − ⃒ . ⃒ 푡 푘 ⃒ ⃒ 푡 푘 ⃒

52 {︀ ⃒ 푛 ⃒ }︀ For 휆 ≥ 0, let 퐶(휆) , 푥 : ⃒푥(푣) − 푘 ⃒ ≤ 휆푛 .

Lemma 10. On the complete graph, if 푦 = 푥 − 푒푖 + 푒푗 and 푥, 푦 ∈ 퐶(휆), then

−5훽/푛 ‖푃 (푣, ·) − 푃 (푣, ·)‖ ≤ 푥 푦 TV 2 + (푘 − 2)푒2휆훽

for −3훽 푛 ≥ (︀ 5 )︀. log 4 The proof of Lemma 9 appears later in this section, and the proof of Lemma 10 is deferred to the appendix due to its technical nature.

Proof of Theorem 6. We may assume that 푛 is large enough so that

−3훽 5 푛 ≥ ⇐⇒ 푒−3훽/푛 ≤ . (︀ 5 )︀ 4 log 4

Let {푌 (푣), 푣 ∈ 풱} be a random variable distributed according to the stationary

distribution of the {푌푡(푣), 푣 ∈ 풱, 푡 ≥ 0} chain at stationarity. At stationarity, the vertex occupancies are strongly concentrated around their means. By the Hoeffding Inequality, for every 휆 > 0,

(︁⃒ 푛⃒ )︁ 2 ⃒푌 (푣) − ⃒ ≥ 휆푛 ≤ 2푒−2휆 푛, P ⃒ 푘 ⃒

for every vertex 푣.

Fix 휖 > 0. We wish to upper bound 푡mix(푋, 휖). Note that the mixing time of the 푌 chain is 푂(푛 log 푛). To see this, consider a synchronous coupling. The expected amount of time to select all the particles is 푂(푛 log 푛), and whenever a particle is selected, it moves to a uniformly random location, which is coupled. Now, for all 휖′,

′ 푇1 , 푡mix (푌, 휖 ) = 푂(푛 log 푛). Therefore at time 푇1, for every 휆 > 0,

(︁⃒ 푛⃒ )︁ 2 ⃒푌 (푣) − ⃒ ≥ 휆푛 ≤ 2푒−2휆 푛 + 휖′, P ⃒ 푇1 푘 ⃒

53 for every vertex 푣. By Lemma 9, it also holds that for every 휆 > 0,

(︁⃒ 푛⃒ )︁ 2 ⃒푋 (푣) − ⃒ ≥ 휆푛 ≤ 2푒−2휆 푛 + 휖′ P ⃒ 푇1 푘 ⃒

{︀ ⃒ 푛 ⃒ }︀ for every vertex 푣. Recall that 퐶(휆) = 푥 : ⃒푥(푣) − 푘 ⃒ ≤ 휆푛 . Then by the Union Bound,

(︁ −2휆2푛 ′)︁ P (푋푇1 ∈/ 퐶(휆)) ≤ 푘 2푒 + 휖 ,

for every 휆 and 푣. We observe that for 푛 large enough, there is always an 휖′ small enough so that (︁ 2 )︁ 휖 푘 2푒−2휆 푛 + 휖′ ≤ . 2

휖 Then with probability at least 1 − /2, 푋푇1 belongs to 퐶(휆).

Next, we establish that for every 훽 < 0, there exists 휆훽 such that (1) once the

chain enters 퐶(휆훽), it takes exponential time to leave 퐶(2휆훽), with high probability;

(2) we can applying path coupling within 퐶(2휆훽). The first claim is due to comparison with the 훽 = 0 chain, as established above. We now demonstrate the required contraction for path coupling within 퐶(2휆).

Recall that we need to define the edges of the graph ℋ = (Ω, ℰℋ) and choose a length

function on the edges. Let (푥, 푦) ∈ ℰℋ if 푦 = 푥 − 푒푖 + 푒푗 for some 푖 ̸= 푗, and let 푙(푥, 푦) = 1. Consider any pair of neighboring configurations 푥 and 푦. We employ a synchronous coupling, as in Figure 2-6. Namely, the “extra” particle at vertex 푖 in configuration 푥 is paired to the “extra” particle at vertex 푗 in configuration 푦. All other particles are paired by vertex location. When a particle is selected to be moved in the 푥 configuration, the particle that it is paired to in the 푦 configuration is also selected to be moved.

푛−1 With probability 푛 , one of the (푛 − 1) pairs that has the same vertex location is chosen. Suppose it is located at vertex 푣. We couple the transitions in the two chains

according to the coupling achieving the total variation distance ‖푃푥(푣, ·) − 푃푦(푣, ·)‖TV. By Lemma 10, when one of the (푛 − 1) particles paired by vertex location is chosen,

54 we can couple them so that they move to the same vertex with probability at least

−5훽/푛 1 − . 2 + (푘 − 2)푒4휆훽

With the remaining probability, the distance increases by at most 2.

1 With the remaining 푛 probability, the “extra” particle is chosen in both chains.

The chains can then equalize with probability 1 because 푃푥(푖, ·) = 푃푦(푗, ·) on the complete graph. Therefore, we can bound the Wasserstein distance as follows:

−5훽/푛 1 1 (︂ 10훽 )︂ 푊 푃 (푥, 푦) ≤ 1 + 2 − = 1 − 1 + 휌 2 + (푘 − 2)푒4휆훽 푛 푛 2 + (푘 − 2)푒4휆훽

Therefore, in order to achieve contraction, it suffices that

10훽 1 + > 0 ⇐⇒ −10훽 < 2 + (푘 − 2)푒4휆훽 (2.11) 2 + (푘 − 2)푒4휆훽

1 Fix 0 < 훿 < 1, and let 휆훽 = 4훽 log(1 − 훿) > 0. Then substituting 휆 = 휆훽, we obtain the condition

−10훽 < 2 + (푘 − 2)(1 − 훿) = 푘 + 훿(2 − 푘). (2.12)

When 훽 ≤ 0 is such that −10훽 < 푘, there exists 훿 > 0 small enough so that the

condition (2.12) holds. We conclude that contraction holds for −푘/10 < 훽 ≤ 0. To summarize the argument, we have shown that in time 푂(푛 log 푛), the chain

enters 퐶(휆훽). After that, the chain leaves the larger set, 퐶(2휆훽), with exponentially

small probability, which can be disregarded. Within 퐶(2휆훽), the Wasserstein distance (︀ (︀ 1 )︀)︀ with respect to the chosen ℋ and 휌 contracts by a factor of 1 − 휃 푛 , so an additional 푂 (푛 log 푛) steps are sufficient. Therefore, the overall mixing time is 푂 (푛 log 푛).

Proof of Lemma 9. We claim that there exists a coupling of {푋푡, 푌푡} such that for all ⃒ 푛 ⃒ ⃒ 푛 ⃒ ˜ ⃒ 푛 ⃒ ˜ 푣 and 푡, ⃒푋푡(푣) − 푘 ⃒ ≤ ⃒푌푡(푣) − 푘 ⃒. Let 푋푡(푣) = ⃒푋푡(푣) − 푘 ⃒ and define 푌푡(푣) similarly.

55 푛 We claim that for all configurations 푥 and vertices 푣, if 푥(푣) ̸= 푘 , then

(︁ ˜ ˜ )︁ (︁ ˜ ˜ )︁ P 푋푡+1(푣) = 푋푡(푣) + 1|푋푡 = 푥 ≤ P 푌푡+1(푣) = 푌푡(푣) + 1|푌푡 = 푥 (2.13)

and

(︁ ˜ ˜ )︁ (︁ ˜ ˜ )︁ P 푋푡+1(푣) = 푋푡(푣) − 1|푋푡 = 푥 ≥ P 푌푡+1(푣) = 푌푡(푣) − 1|푌푡 = 푥 . (2.14)

푛 If 푥(푣) = 푘 , then

(︁ 푛 )︁ (︁ 푛 )︁ 푋 (푣) = + 1|푋 = 푥 ≤ 푌 (푣) = + 1|푌 = 푥 (2.15) P 푡+1 푘 푡 P 푡+1 푘 푡

and

(︁ 푛 )︁ (︁ 푛 )︁ 푋 (푣) = − 1|푋 = 푥 ≤ 푌 (푣) = − 1|푌 = 푥 . (2.16) P 푡+1 푘 푡 P 푡+1 푘 푡

In other words, the inequalities (2.13)-(2.16) state that the 푋 chain is less likely to move in the absolute value–increasing direction, and more likely to move in the absolute value–decreasing direction. These inequalities, along with the fact that

푋0 = 푌0, suffice to prove the lemma.

(︁ 푌푡(푣) )︁ 1 The transitions for the 푌푡(푣) process are +1 with probability 1 − 푛 푘 , and −1

푌푡(푣) 푘−1 with probability 푛 푘 . With the remaining probability, 푌푡+1(푣) = 푌푡(푣). Suppose 푛 푛 푥(푣) ̸= 푘 . There are two cases to analyze when 푥(푣) ̸= 푘 :

푛 1. 푋푡(푣) < 푘 . The probability that 푋푡+1(푣) = 푋푡(푣) − 1 is upper bounded by

푋푡(푣) 푘−1 푛 푘 , because vertex 푣 is a more likely than average destination. In other words, it is harder to lose a particle from vertex 푣 that has fewer than the average number of particles when 훽 < 0, compared to when 훽 = 0. Formally,

(︃ )︃ 푋 (푣) exp (︀ 훽 (푋 (푣) − 1))︀ 푋 (푣) (︂ 1)︂ 푡 1 − 푛 푡 < 푡 1 − . 푛 (︀ 훽 )︀ ∑︀ (︀ 훽 )︀ 푛 푘 exp 푛 (푋푡(푣) − 1) + 푤̸=푣 exp 푛 푋푡(푤)

For the same reason, the probability that 푋푡+1(푣) = 푋푡(푣) + 1 is lower bounded

56 by (︂ 푋 (푣))︂ 1 1 − 푡 . 푛 푘 Therefore, inequalities (2.13) and (2.14) hold in this case.

푛 2. 푋푡(푣) > 푘 . This time, 푣 is a less likely than average destination. The probability

that 푋푡+1(푣) = 푋푡(푣) − 1 is lower bounded by

푋 (푣) 푘 − 1 푡 . 푛 푘

The probability that 푋푡+1(푣) = 푋푡(푣) + 1 is upper bounded by

(︂ 푋 (푣))︂ 1 1 − 푡 . 푛 푘

Therefore, inequalities (2.13) and (2.14) hold in this case also.

푛 Finally, suppose 푥(푣) = 푘 . Then the probability of losing a particle is upper bounded 1 푘−1 푘−1 1 by 푘 푘 , and the probability of gaining a particle is upper bounded by 푘 푘 . Therefore, inequalities (2.15) and (2.16) hold. We conclude that such a coupling exists, and therefore the stochastic dominance holds.

2.5 Conclusion

In this chapter we have introduced a new interacting particle system model. We have shown that for any fixed graph, the mixing time of the Attracting Random Walks Markov chain exhibits phase transition. We have also partially investigated the Repelling Random Walks model, and we conjecture that model is always fast mixing. Beyond theoretical results, it is our hope that the model will find practical use.

57 2.6 Appendix

Proof of Proposition 2. To compute the stationary probabilities 휆(푟), 푟 ∈ {0, 1, . . . , 퐷}, note that we can disregard the initial uniform particle choice, and simply consider a Markov chain on a graph with (퐷 + 1) nodes as in Figure 2-4 or 2-5. When 퐷 = 1, we have 휆(0) = 푞휆(0) + 푞휆(1) =⇒ 휆(0) = 푞. Next, consider 퐷 = 2. We have

푝 휆(2) = (1 − 푝)휆(2) + (1 − 푞)휆(1) =⇒ 휆(1) = 휆(2) 1 − 푞 푝푞 휆(0) = 푞휆(0) + 푞휆(1) =⇒ 휆(0) = 휆(2). (1 − 푞)2

Since 휆(0) + 휆(1) + 휆(2) = 1, we have

(︂ 푝푞 푝 )︂ + + 1 휆(2) = 1, (1 − 푞)2 1 − 푞 and so

푝푞 2 푞 휆(0) = (1−푞) = . 푝푞 + 푝 + 1 (1−푞)2 (1−푞)2 1−푞 1 + 푝

Finally, consider the case 퐷 ≥ 3. We solve the equations for the stationary distribution.

휆(퐷) = (1 − 푝)휆(퐷) + (1 − 푝)휆(퐷 − 1) 푝 =⇒ 휆(퐷 − 1) = 휆(퐷) 1 − 푝 휆(퐷 − 1) = 푝휆(퐷) + (1 − 푝)휆(퐷 − 2) 1 (︂ 푝 )︂ (︂ 푝 )︂2 =⇒ 휆(퐷 − 2) = 휆(퐷) − 푝휆(퐷) = 휆(퐷) 1 − 푝 1 − 푝 1 − 푝 ... (︂ 푝 )︂푖 휆(퐷 − 푖) = 휆(퐷) for 0 ≤ 푖 ≤ 퐷 − 2 (2.17) 1 − 푝 휆(2) = 푝휆(3) + (1 − 푞)휆(1) (2.18)

휆(0) = 푞휆(1) + 푞휆(0) (2.19)

58 Using Equations (2.17)-(2.19),

(︃ )︃ 1 (︂ 푝 )︂퐷−2 (︂ 푝 )︂퐷−3 휆(1) = 휆(퐷) − 푝 휆(퐷) 1 − 푞 1 − 푝 1 − 푝 푝 (︂ 푝 )︂퐷−2 = 휆(퐷) 1 − 푞 1 − 푝 푝푞 (︂ 푝 )︂퐷−2 휆(0) = 휆(퐷). (2.20) (1 − 푞)2 1 − 푝

∑︀퐷 Since 푖=0 휆(푖) = 1,

(︃ 퐷−2 퐷−2 퐷−2 푖)︃ 푝푞 (︂ 푝 )︂ 푝 (︂ 푝 )︂ ∑︁ (︂ 푝 )︂ 휆(퐷) + + = 1 2 1 − 푝 1 − 푞 1 − 푝 1 − 푝 (1 − 푞) 푖=0 1 휆(퐷) = . 퐷−2 (︂ 푝 퐷−1 )︂ 푝 (︁ 푝 )︁ 1−( 1−푝 ) 2 + 푝 (1−푞) 1−푝 1− 1−푝

Substituting into (2.20)

푞 휆(0) = 퐷−1 . 2 2−퐷 (︂ 푝 )︂ (1−푞) (︁ 푝 )︁ 1−( 1−푝 ) 1 + 푝 푝 1−푝 1− 1−푝

Proof of Lemma 10. Let

∑︁ (︂훽 )︂ (︂훽 )︂ 퐵(푥) = exp 푥(푢) + 1 exp (푥(푣) − 1) 푛 푣̸∈{푖,푗} 푛 푢̸∈{푖,푗,푣} (︂훽 )︂ (︂훽 )︂ (︂훽 )︂ (︂훽 )︂ 퐶(푥) = 1 exp 푥(푖) + 1 exp (푥(푖) − 1) + 1 exp 푥(푗) + 1 exp (푥(푗) − 1) , and 푣̸=푖 푛 푣=푖 푛 푣̸=푗 푛 푣=푗 푛 (︂훽 )︂ (︂훽 )︂ (︂훽 )︂ (︂훽 )︂ 퐷(푥) = 1 exp (푥(푖) − 1) + 1 exp (푥(푖) − 2) + 1 exp (푥(푗) + 1) + 1 exp 푥(푗) . 푣̸=푖 푛 푣=푖 푛 푣̸=푗 푛 푣=푗 푛

Then we can write

1 exp (︀ 훽 푥(푤))︀ + 1 exp (︀ 훽 (푥(푤) − 1))︀ 푃 (푣, 푤) = 푣̸=푤 푛 푣=푤 푛 푥 퐵(푥) + 퐶(푥)

59 and

1 exp (︀ 훽 푦(푤))︀ + 1 exp (︀ 훽 (푦(푤) − 1))︀ 푃 (푣, 푤) = 푣̸=푤 푛 푣=푤 푛 . 푦 퐵(푥) + 퐷(푥)

To check the sign of 푃푥(푣, 푤) − 푃푦(푣, 푤), it is equivalent to check the sign of

(︂훽 )︂ (︂훽 )︂ exp 푥(푤) (퐵(푥) + 퐷(푥)) − exp 푦(푤) (퐵(푥) + 퐶(푥)) . 푛 푛

Next we show that for fixed 푣, the sign of 푃푥(푣, 푤) − 푃푥(푣, 푤) is the same for all (︀ 훽 )︀ (︀ 훽 )︀ 푤 ̸∈ {푖, 푗}. Suppose 푤 ̸∈ {푖, 푗}. Then exp 푛 푥(푤) = exp 푛 푦(푤) , and is equivalent to check the sign of the expression 퐷(푥) − 퐶(푥). Since this expression does not depend on 푤, we conclude that the sign is the same for all 푤 ̸∈ {푖, 푗}.

If 푃푥(푣, 푤) − 푃푥(푣, 푤) ≥ 0 for all 푤 ̸∈ {푖, 푗}, then

‖푃푥(푣, ·) − 푃푦(푣, ·)‖TV = max{푃푦(푣, 푖) − 푃푥(푣, 푖), 0} + max{푃푦(푣, 푗) − 푃푥(푣, 푗), 0}.

Similarly, if 푃푥(푣, 푤) − 푃푥(푣, 푤) < 0 for all 푤 ̸∈ {푖, 푗}, then

‖푃푥(푣, ·) − 푃푦(푣, ·)‖TV = max{푃푥(푣, 푖) − 푃푦(푣, 푖), 0} + max{푃푥(푣, 푗) − 푃푦(푣, 푗), 0}.

Therefore,

‖푃푥(푣, ·) − 푃푦(푣, ·)‖TV ≤ |푃푥(푣, 푖) − 푃푦(푣, 푖)| + |푃푥(푣, 푗) − 푃푦(푣, 푗)| .

Consider the ratio of denominators of 푃푥(푣, 푤) and 푃푦(푣, 푤). We have

퐵(푥) + 퐶(푥) 푒훽/푛 ≤ ≤ 푒−훽/푛. 퐵(푥) + 퐷(푥)

We first bound |푃푥(푣, 푖) − 푃푦(푣, 푖)|. If 푣 ̸= 푖, we obtain

|푃푥(푣, 푖) − 푃푦(푣, 푖)| {︂⃒ (︂ )︂ (︂ )︂ ⃒ ⃒ (︂ )︂ (︂ )︂ ⃒}︂ 1 ⃒ 훽 훽 훽/푛⃒ ⃒ 훽 훽 −훽/푛⃒ ≤ max ⃒exp 푥(푖) − exp (푥(푖) − 1) 푒 ⃒ , ⃒exp 푥(푖) − exp (푥(푖) − 1) 푒 ⃒ 퐵(푥) + 퐶(푥) ⃒ 푛 푛 ⃒ ⃒ 푛 푛 ⃒

60 exp (︀ 훽 푥(푖))︀ = 푛 (︀푒−2훽/푛 − 1)︀ . 퐵(푥) + 퐶(푥)

Similarly, if 푣 = 푖, we obtain

exp (︀ 훽 푥(푖))︀ |푃 (푣, 푖) − 푃 (푣, 푖)| ≤ 푒−훽/푛 푛 (︀푒−2훽/푛 − 1)︀ . 푥 푦 퐵(푥) + 퐶(푥)

We similarly bound |푃푥(푣, 푗) − 푃푦(푣, 푗)|. If 푣 ̸= 푗, we obtain

|푃푥(푣, 푗) − 푃푦(푣, 푗)| {︂⃒ (︂ )︂ (︂ )︂ ⃒ ⃒ (︂ )︂ (︂ )︂ ⃒}︂ 1 ⃒ 훽 훽 훽/푛⃒ ⃒ 훽 훽 −훽/푛⃒ ≤ max ⃒exp 푥(푗) − exp (푥(푗) + 1) 푒 ⃒ , ⃒exp 푥(푗) − exp (푥(푗) + 1) 푒 ⃒ 퐵(푥) + 퐶(푥) ⃒ 푛 푛 ⃒ ⃒ 푛 푛 ⃒ exp (︀ 훽 푥(푗))︀ = 푛 (︀1 − 푒2훽/푛)︀ . 퐵(푥) + 퐶(푥)

If 푣 = 푗, we obtain

exp (︀ 훽 푥(푗))︀ |푃 (푣, 푗) − 푃 (푣, 푗)| ≤ 푒−훽/푛 푛 (︀1 − 푒2훽/푛)︀ . 푥 푦 퐵(푥) + 퐶(푥)

For any choice of 푣,

(︂훽 )︂ (︂훽 )︂ 퐶(푥) ≥ exp 푥(푖) + exp 푥(푗) . 푛 푛

Therefore,

|푃푥(푣, 푖) − 푃푦(푣, 푖)| + |푃푥(푣, 푗) − 푃푦(푣, 푗)| 푒−훽/푛 (︂ (︂훽 )︂ (︂훽 )︂ )︂ ≤ exp 푥(푖) (︀푒−2훽/푛 − 1)︀ + exp 푥(푗) (︀1 − 푒2훽/푛)︀ (︀ 훽 )︀ (︀ 훽 )︀ 푛 푛 퐵(푥) + exp 푛 푥(푖) + exp 푛 푥(푗) 푒−훽/푛 (︀1 − 푒2훽/푛)︀ (︂ (︂훽 )︂ (︂훽 )︂)︂ = exp 푥(푖) 푒−2훽/푛 + exp 푥(푗) (︀ 훽 )︀ (︀ 훽 )︀ 푛 푛 퐵(푥) + exp 푛 푥(푖) + exp 푛 푥(푗) 푒−3훽/푛 (︀1 − 푒2훽/푛)︀ (︂ (︂훽 )︂ (︂훽 )︂)︂ ≤ exp 푥(푖) + exp 푥(푗) . (︀ 훽 )︀ (︀ 훽 )︀ 푛 푛 퐵(푥) + exp 푛 푥(푖) + exp 푛 푥(푗)

Recall that 푥 ∈ 퐶(휆). We upper bound by setting 푥(푖) and 푥(푗) to their lower bounds,

61 and 푥(푢) to its upper bound for 푢 ̸∈ {푖, 푗}.

|푃푥(푣, 푖) − 푃푦(푣, 푖)| + |푃푥(푣, 푗) − 푃푦(푣, 푗)| 푒−3훽/푛 (︀1 − 푒2훽/푛)︀ (︂훽 (︁푛 )︁)︂ ≤ 2 exp − 휆푛 (︀ 훽 (︀ 푛 )︀)︀ (︀ 훽 (︀ 푛 )︀)︀ 푛 푘 (푘 − 2) exp 푛 푘 + 휆푛 + 2 exp 푛 푘 − 휆푛 2푒−3훽/푛 (︀1 − 푒2훽/푛)︀ = (푘 − 2)푒2휆훽 + 2 −3훽/푛 2푒 (−2훽/푛) ≤ (푘 − 2)푒2휆훽 + 2 −5훽/푛 ≤ , (푘 − 2)푒2휆훽 + 2 where in the second-last inequality we have used the fact that 1 + 푧 ≤ 푒푧 and the last

−3훽/푛 inequality holds when 푒 ≤ 5/4.

We now provide an alternate proof of fast mixing.

Proof of Theorem 2. We apply the Path Coupling Lemma (Lemma 5). Let ℋ have an edge between configurations 푥 and 푦 if 푥 differ by the position of a single particle,

meaning that we can write 푦 = 푥 − 푒푖 + 푒푗. Note that 푖 and 푗 need not be neighbors in 풢. Set 푙(푥, 푦) = 1 for such adjacent configurations 푥, 푦. Then the path metric 휌(푥, 푦) is equal to the graph distance in ℋ, which is the number of mismatched particles between 푥 and 푦.

Consider two arbitrary adjacent configurations 푥 and 푦 such that 푦 = 푥 − 푒푖 + 푒푗. We use the same synchronous coupling pairing as pictured in the Figure 2-6, and label the particles from 1 to 푛 according to this pairing. We say that the 푛 − 1 particles that are paired by vertex location are “matched,” and the other pair is “separated.”

퐷 Let 푅 = 2 be the radius of graph 풢. Instead of considering a single step at a time, we consider blocks of 푅푛 steps and construct a variable-length coupling, a technique introduced by [28].

푅푛 Let {푁푡}푡=0 denote the number of matched pairs at step 푡 during this time block.

Then 푁0 = 푛 − 1. The 푁푡 process changes by a value in {−1, 0, 1} at each time step. 퐹 푅푛 For ease of analysis, we define an auxiliary stopped process, denoted by {푁푡 }푡=0.

62 퐹 Let 푇퐹 = inf{푡 : 푁푡 ∈ {푛 − 2, 푛}}, and define 푁푡 = 푁min{푡,푇푓 }. In other words, the stopped process advances as the 푁푡 process until the value changes by −1 or +1. At that point, the stopped process maintains its value. Note that this stopping can only increase the time for the chains to meet, so it is sufficient to show contraction in the stopped process. Another interpretation is that we are measuring contraction not in one step, but after min{푇퐹 , 푅푛} steps. At the end of the time block, the stopped process has changed by a value in {−1, 0, 1}. We therefore need an upper bound on

퐹 퐹 P(푁푅푛 = 푛 − 2) and a lower bound on P(푁푅푛 = 푛). 퐹 퐹 To analyze P(푁푅푛 = 푛 − 2), observe that the event {푁푅푛 = 푛 − 2} implies the event that at least one matched pair separated in the 푁푡 process. Therefore it suffices to find an upper bound on the probability of the latter event. First let us find an upper bound on the probability of separation in one step conditioned on choosing a matched particle. Lemma 8 tells us that the total variation distance on the next

Δ+1 particle location is at most 푛 훽. We set the joint distribution of the next particle location according to the optimal coupling.

Let 퐸1 be the event that at least one pair separates. Then

(︂ Δ + 1 )︂푅푛 (퐸푐) ≥ 1 − 훽 P 1 푛 (︂ Δ + 1 )︂푅푛 =⇒ (퐸 ) ≤ 1 − 1 − 훽 . P 1 푛

퐹 Next, let 퐸2 be the event that 푁푅푛 = 푛, meaning that in the 푁푡 process, the separated pair matches before any matched pair separates. Also define 퐸3 to be the event that 푐 in the 푁푡 process, the separated pair matches. The events 퐸3 and 퐸1 (no matched pair separates in the 푁푡 process) together imply the event 퐸2:

푐 P(퐸2) ≥ P(퐸3 ∩ 퐸1) ≥ P(퐸3) − P(퐸1).

(︀ 1 )︀ To analyze P(퐸3), let 푀 ∼ 퐵푖푛 푅푛, 푛 be the number of steps allocated to the separated pair in the 푁푡 process. Given that the pair receives 푅 steps, the probability 푅 (︁ 1 )︁ of matching is at least 푒훽 +Δ , using the uniform lower bound on particle movements.

63 The probability that the chains coalesce in the stopped process is therefore lower bounded by

(︃ )︃ (︂ 1 )︂푅 (︂ Δ + 1 )︂푅푛 (퐸 ) ≥ (푀 = 푅) − 1 − 1 − 훽 . P 2 P 푒훽 + Δ 푛

Combining the two bounds, the expected distance after one block is upper bounded as

퐹 퐹 푊휌(푥, 푦) ≤ 1 + P(푁푅푛 = 푛 − 2) − P(푁푅푛 = 푛) [︃ ]︃ [︃ (︃ )︃]︃ (︂ Δ + 1 )︂푅푛 (︂ 1 )︂푅 (︂ Δ + 1 )︂푅푛 ≤ 1 + 1 − 1 − 훽 − (푀 = 푅) − 1 − 1 − 훽 푛 P 푒훽 + Δ 푛 [︃ ]︃ [︃ ]︃ (︂ Δ + 1 )︂푅푛 (︂ 1 )︂푅 = 1 + 2 1 − 1 − 훽 − (푀 = 푅) . 푛 P 푒훽 + Δ

Since (︂ Δ + 1 )︂푅푛 lim 1 − 훽 = exp (−푅(Δ + 1)훽) , 푛→∞ 푛 we have (︂ Δ + 1 )︂푅푛 1 − 훽 ≥ exp (−푅(Δ + 2)훽) 푛 for 푛 large enough. To analyze the second term in brackets,

(︂푅푛)︂ (︂ 1 )︂푅 (︂ 1 )︂푅푛−푅 (푀 = 푅) = 1 − P 푅 푛 푛 (︂푅푛)︂푅 (︂ 1 )︂푅 (︂ 1 )︂푅푛−푅 ≥ 1 − 푅 푛 푛 (︂ 1 )︂푅푛 ≥ 1 − → 푒−푅 푛

1 For 푛 large enough, P (푀 = 푅) ≥ 2푒푅 . Combining the two bounds, we have

1 (︂ 1 )︂푅 푊 (푥, 푦) ≤ 1 + 2 [1 − exp (−푅(Δ + 2)훽)] − 휌 2푒푅 푒훽 + Δ 1 (︂ 1 )︂푅 = 1 + 2 [1 − exp (−푅(Δ + 2)훽)] − 2 푒 (푒훽 + Δ)

64 (︂ )︂푅 for 푛 large enough. Let 푓(훽) = 2 [1 − exp (−푅(Δ + 2)훽)] and 푔(훽) = 1 1 , 2 푒(푒훽 +Δ) so that

푊휌(푥, 푦) ≤ 1 + 푓(훽) − 푔(훽).

Both 푓 and 푔 are continuous functions, where 푓 is increasing and 푔 is decreasing, and

푔(0) > 푓(0). Therefore there exists a computable 훽− > 0 such that for all 훽 ≤ 훽−, 푔(훽) > 푓(훽), and we achieve contraction. We consider 훽 ≤ log(Δ). Thus,

1 (︂ 1 )︂푅 푔(훽) ≥ . 2 2푒Δ

Therefore, it suffices that

1 (︂ 1 )︂푅 > 푓(훽) 2 2푒Δ 1 (︂ 1 )︂푅 > 2 [1 − exp (−푅(Δ + 2)훽)] 2 2푒Δ (︃ )︃ 1 (︂ 1 )︂푅 − 푅(Δ + 2)훽 > log 1 − 4 2푒Δ 1 (︂4(2푒Δ)푅 − 1)︂ 훽 < log . 푅(Δ + 2) 4(2푒Δ)푅

We conclude that we can take

{︂ 1 (︂4(2푒Δ)푅 − 1)︂ }︂ 1 (︂4(2푒Δ)푅 − 1)︂ 훽 < min log , log(Δ) = log . − 푅(Δ + 2) 4(2푒Δ)푅 푅(Δ + 2) 4(2푒Δ)푅

Choose

1 (︂4(2푒Δ)푅 − 1)︂ 훽 = log . − 푅(Δ + 3) 4(2푒Δ)푅

Let 훿 = 푔(훽−) − 푓(훽−), so that the Wasserstein distance contracts by a factor of (1 − 훿). Using Lemma 5 on the block chain,

푑(푡) ≤ (1 − 훿)푡푑푖푎푚(Ω) = (1 − 훿)푡푛.

65 1 푡 log 푛+log 휖 For any 휖 > 0, set (1 − 훿) 푛 = 휖. Then 푡 = 1 , and 푡mix(휖) = 푂 (log 푛). Since log( 1−훿 ) each block is 푅푛 steps long, the mixing time for the original chain is 푂 (푛 log 푛). We conclude that there exists a computable 훽0 such that the mixing time of the Attracting Random Walks chain is 푂(푛 log 푛), proving Theorem 2.

66 Chapter 3

Exponential Convergence Rates for Stochastically Ordered Markov Processes Under Perturbation

We find computable exponential convergence rates for a large class of stochastically ordered Markov processes. We extend the result of Lund, Meyn, and Tweedie (1996), who found exponential convergence rates for stochastically ordered Markov processes starting from a fixed initial state, by allowing for a random initial condition that is also stochastically ordered. Our bounds are formulated in terms of moment-generating functions of hitting times. To illustrate our result, we find an explicit exponential convergence rate for an M/M/1 queue beginning in equilibrium and then experiencing a change in its arrival or departure rates, a setting which has not been studied to our knowledge.

3.1 Introduction

This chapter is concerned with parametrized stochastically ordered Markov processes. Consider, for example, a stable M/M/1 queue with service rate 휇 and arrival rate

휆 < 휇. For a fixed 휇, let {푋푡(휋, 휆)}푡≥0 be the queue-length process with service rate 휆

and initial distribution 휋. Then 푋푡(휋, 휆) is stochastically increasing in 휆, for all 푡 ≥ 0.

67 That is,

′ P (푋푡(휋, 휆) ≥ 푥) ≤ P (푋푡(휋, 휆 ) ≥ 푥)

′ for all 푥 ∈ Z+ if 휆 ≤ 휆 . Similarly, 푋푡(휋, 휇) is stochastically decreasing in 휇 for fixed 휆 and 휋. The focus of this chapter is to analyze the convergence of a parametrized stochastically ordered Markov process to its stationary distribution, when its initial state is distributed according to a stationary distribution for another parameter choice. This will be stated more precisely below. The Markov process is described by its transition kernel and its initial distribution. We assume that the initial distribution is the stationary distribution associated with

setting the parameter equal to 푟0, and we let 푟 be the parameter setting of the transition kernel. The parameter change happens once, at 푡 = 0. In other words, if

푟 = 푟0, the process is always in equilibrium, and when 푟 ̸= 푟0, the system starts in

the equilibrium associated with 푟0 and transitions over time to the one associated with 푟. The equilibrium distributions are denoted by 휋(푟0) and 휋(푟). When 푟 ̸= 푟0 we

say the system is “perturbed.” These Markov processes will be denoted by 푋푡 (푟0, 푟).

We sometimes refer to the collection {푋푡 (푟0, 푟)}푟0,푟 as a “system.” Note that there could be multiple parameters. For example, to study an M/M/1 queue starting in the

stationary distribution associated with (휆0, 휇0), operating under parameters (휆, 휇), we would have 푟0 = (휆0, 휇0) and 푟 = (휆, 휇). In the M/M/1 setting, we say that 푟 = 푟0 if

휆0 = 휆 and 휇0 = 휇. Otherwise, 푟0 ̸= 푟. As in [38], we consider Markov processes that take value in [0, ∞). In this chapter, we consider the total variation distance between a parametrized continuous time Markov process and its stationary distribution. Recall the definition of total variation distance:

Definition 3. The total variation distance between two measure 푃 and 푄 on state space Ω is given by

‖푃 − 푄‖TV = sup |푃 (퐴) − 푄(퐴)| . 퐴⊂Ω For a given random variable 푋, let ℒ(푋) denote the distribution law of 푋. We

68 seek a convergence bound of the form

−훼푡 ‖ℒ (푋푡(푟0, 푟)) − 휋(푟)‖TV ≤ 퐶푒 .

The value 훼 is referred to as the “convergence rate.” Prior work in the area of the convergence of continuous-time Markov processes focuses on convergence assuming a particular deterministic initial state. However, this type of analysis is limiting, because the initial state of a process is often unknown. In situations where the initial state is unobservable, it may be more reasonable to assume a particular initial distribution rather than a particular initial state. Our extension of the result by [38] allows one to analyze a system in equilibrium that undergoes a perturbation of its parameters, pushing it towards another equilibrium. For example, one might wish to analyze the effect of a disruption on a queue of customers waiting for service. The bounds in this chapter would allow one to study how quickly the queue length process reaches the new equilibrium after being perturbed. We start by reviewing the existing literature on the convergence of stochastically ordered Markov processes, focusing on a paper by Lund, Meyn and Tweedie ([38]). We extend the result of [38], allowing the initial state of the system to be distributed according to a stationary distribution from the family of distributions parametrized by the system parameters. To illustrate the value of our result, we apply it to the analysis of perturbed M/M/1 queues. We also analyze a control system of parallel M/M/1 queues, in which the controller seeks to equalize the queue lengths in response to perturbations of the service rates. More importantly, our result applies to a broader class of Markov processes, namely any parametrized Markov process whose initial distribution is a stationary distribution.

3.2 Related work

Lund, Meyn, and Tweedie ([38]) establish convergence rates for nonnegative Markov processes that are stochastically ordered in their initial state, starting from a fixed

69 initial state. Examples of such Markov processes include: M/G/1 queues, birth-and- death processes, storage processes, insurance risk processes, and reflected diffusions. We reproduce here the main theorem, Theorem 2.1 from [38], which will be extended in this chapter.

Theorem 7. ([38]) Suppose that {푋푡} is a Markov process on Ω = [0, ∞) that is stochastically increasing in its initial state, with parameter setting 푟. Let 휏0(푥) be the hitting time to zero of 푋푡 given that 푋0 = 푥, and let 휏0(휋(푟)) be the hitting time to zero of 푋푡 given that 푋0 is distributed according to the stationary distribution 휋(푟).

[︀ 훼휏0(푥)]︀ Let ℒ (푋푡(푥, 푟)) be the distribution law of 푋푡 given that 푋0 = 푥. If E 푒 < ∞ for some 훼 > 0 and some 푥 > 0, then

(︀ [︀ 훼휏0(푥)]︀ [︀ 훼휏0(휋(푟))]︀)︀ −훼푡 ‖ℒ (푋푡(푥, 푟)) − 휋(푟)‖TV ≤ E 푒 + E 푒 푒 (3.1)

for every 푥 ≥ 0 and 푡 ≥ 0.

The significance of this theorem is that it provides computable rates of convergence for a large class of Markov processes by relating the total variation distance from equilibrium to moment generating functions of hitting times to zero. We extend this

result to the situation where 푋0 is distributed according to the stationary distribution corresponding to a different parameter choice. The proof is analogous to the one given in [38] and is based on a coupling approach. The second major result in [38] is to connect a drift condition to the convergence rate in (3.1), which is Theorem 2.2 (i) in [38], reproduced below:

Theorem 8. ([38]) Suppose that {푋푡} is a Markov process that is stochastically increasing in its initial state. Let 풜 be the extended generator of the process. If there exists a drift function 푉 :Ω → [1, ∞) and constants 푐 > 0 and 푏 < ∞ such that for all 푥 ∈ Ω

풜푉 (푥) ≤ −푐푉 (푥) + 푏1{0}(푥) (3.2)

[︀ 푐휏 (푥)]︀ then E 푒 0 < ∞ for all 푥 > 0, which implies that (3.1) holds for 훼 ≤ 푐.

70 We also connect Theorem 8 to our extension of Theorem 7. Theorem 7 is applied to several univariate systems in [38]: finite capacity stores, dam processes, diffusion models, periodic queues, and M/M/1 queues. Additionally, one multivariate system is considered in [38]: two M/M/1 queues in series. The paper by Lund et al (1996) has inspired numerous related papers, some of which we reference here. Several directly apply the main results; for example Novak and Watson (2009) used Theorem 7 to derive the convergence rate of an M/D/1 queue. In a more applied work, Kiessler (2008) used the result of [38] to prove the convergence of an estimator for traffic intensity. Other works build on the derivation of bounds for other processes, or more general bounds. For example, Liu et al (2008) applied the main theorem in order to bound the best uniform convergence rate for strongly ergodic Markov chains. Other processes studied are Langevin diffusions ([45]) and jump diffusions ([49]). Hou et al (2005) also used a coupling method, focusing on establishing subgeometric convergence rates. In related work to [30], Liu et al (2010) established subgeometric convergence rates via first hitting times and drift functions. Douc et al (2004) were able to generalize convergence bounds to time-inhomogeneous chains using coupling and drift conditions. Baxendale (2005) derives convergence bounds for geometrically ergodic Markov processes with an alternate approach to [38], though also using a drift condition. Few papers allow for a random initial condition. Roberts and Tweedie (2000) found convergence bounds for stochastically ordered Markov processes with a random initial condition, allowing for no minimal reachable element. However, their bound is stated in terms of a drift condition, which may be challenging to verify because it requires finding a drift function. Rosenthal (2002) also derives a convergence bound for an initial distribution for more general chains, using drift and minorization conditions, via a coupling approach.

71 3.3 Main result

We begin with some definitions that we utilize in this chapter. We let {푋푡(푟0, 푟)}

denote the process governed by 푟 with initial distribution corresponding to 푟0. Similarly, we let {푋푡(푟)|푋0(푟) = 푥} denote the process governed by 푟 with initial state 푥.

Definition 4. A set 퐴 is said to be increasing if

∀푥 ∈ 퐴, 푦 ≥ 푥 =⇒ 푦 ∈ 퐴.

Definition 5. For a family of nonnegative Markov processes {푋푡(푟0, 푟)} with transition kernel parametrized by 푟 with starting stationary distribution parametrized by 푟0, we say that 푋푡 is stochastically increasing in 푟0 if for all 푡 ≥ 0 and all increasing sets 퐴 ⊂ Ω,

′ P (푋푡(푟0, 푟) ∈ 퐴) ≤ P (푋푡(푟0, 푟) ∈ 퐴)

′ whenever 푟0 ≤ 푟0. Note that for a univariate process, 퐴 is of the form {푦 ∈ Ω: 푦 ≥ 푥} for some 푥.

Definition 6. Define 휏0(푟0, 푟) to be the hitting time to the zero state of {푋푡(푟0, 푟)}.

Similarly, define 휏0(푥, 푟) to be the hitting time to the zero state of {푋푡(푟)|푋0(푟) = 푥}

For a Markov process {푋푡(푟0, 푟)}, let

[︀ 훼휏0(푟0,푟)]︀ 퐺(푟0, 푟, 훼) = E 푒

[︀ 훼휏0(푥,푟)]︀ and similarly for a Markov process {푋푡(푟)|푋0(푟) = 푥}, define 퐺(푥, 푟, 훼) = E 푒 .

We now extend Theorem 7 to allow for a random initial condition.

Theorem 9. Consider a family of nonnegative Markov processes {푋푡(푟0, 푟)} that is stochastically increasing in 푟, where 푟 = 푟0 corresponds to the system being in equilibrium. Let 푟푚 = max{푟0, 푟}. If 퐺(푟푚, 푟, 훼) < ∞ for some 훼 > 0, then

−훼푡 ‖ℒ (푋푡(푟0, 푟)) − 휋(푟)‖TV ≤ 퐺(푟푚, 푟, 훼)푒 . (3.3)

72 d Proof. Note that 푋푡(푟, 푟) = 휋(푟). Using the coupling inequality, we have

‖ℒ (푋푡(푟0, 푟)) − 휋(푟)‖TV ≤ P (푋푡(푟0, 푟) ̸= 푋푡(푟, 푟))

where (푋푡(푟0, 푟), 푋푡(푟, 푟)) is any coupling.

d d Either {푋푡(푟푚, 푟)} = {푋푡(푟0, 푟)} or {푋푡(푟푚, 푟)} = {푋푡(푟, 푟)}. We can create copies ′ ′ ′ ′ 푋푡(푟0, 푟) , 푋푡(푟, 푟) so that 푋푡(푟푚, 푟) = 푋푡(푟0, 푟) ≥ 푋푡(푟, 푟) if 푟푚 = 푟0, and 푋푡(푟푚, 푟) = ′ ′ 푋푡(푟, 푟) ≥ 푋푡(푟0, 푟) if 푟푚 = 푟. This is possible by an extension of Strassen’s Theorem to stochastic processes, developed in [32] and as cited by [38]. We take

′ ′ (푋푡(푟0, 푟) , 푋푡(푟, 푟) ) as the coupling. Then, the process 푋푡(푟푚, 푟) acts as a bounding process. Observe that

{푋푡(푟푚, 푟) = 0} =⇒ {푋푡(푟0, 푟) = 푋푡(푟, 푟) = 0}

and the coupling occurs at or before time 푡. So then we have

P (푋푡(푟0, 푟) ̸= 푋푡(푟, 푟)) ≤ P (휏0 (푟푚, 푟) > 푡) .

Exponentiating and using Markov’s inequality, we obtain the desired result:

(︀ 훼휏0(푟푚,푟) 훼푡)︀ ‖ℒ (푋푡(푟0, 푟)) − 휋(푟)‖TV ≤ P 푒 > 푒 for 훼 > 0

[︀ 훼휏0(푟푚,푟)]︀ −훼푡 ≤ E 푒 푒

−훼푡 = 퐺(푟푚, 푟, 훼)푒

However, the challenge in applying Theorem 9 is finding 훼 > 0 for which 퐺(푟푚, 푟, 훼)

is finite. Note that 퐺(푟푚, 푟, 훼) is a moment generating function (MGF), so {훼 :

퐺(푟푚, 푟, 훼) < ∞} is an interval containing zero, typically referred to as the domain. For some Markov processes, the domain is precisely known. One such example is the

M/M/1 queue with fixed service rate, where the arrival rate is perturbed from 푟0 = 휆0

to 푟 = 휆. For processes where the domain is difficult to find but 푟푚 = 푟, we can apply Theorem 8.

73 Corollary 1. If 푟푚 = 푟 and the drift condition (3.2) holds for a Markov process

푋푡(푥, 푟) with some 푉 (·), 푏, 푐, then (3.3) holds with 훼 = 푐.

Proof. If the drift condition holds then 퐺(푥, 푟, 훼) < ∞, by Theorem 8. Applying Lemma 3.1 from [38], we also have that 퐺(푟, 푟, 푐) < ∞.

We now apply Theorem 9 to the analysis of a single M/M/1 queue.

3.4 M/M/1 queues

3.4.1 Queue length process

We study the queue length process and consider perturbing the arrival and service

rates from 푟0 = (휆0, 휇0) to 푟 = (휆, 휇). Throughout, we assume the stability conditions

휆0 < 휇0 and 휆 < 휇. First we consider the case of changing the arrival rate while keeping the service rate fixed. We then show how to find bounds for any change of the two parameters, as long as the service rate is greater than the arrival rate.

Suppose that 휇 = 휇0. The two processes are then 푋푡 ((휆0, 휇0), (휆, 휇0)) and

푋푡 ((휆, 휇0), (휆, 휇0)). For clarity of presentation, we omit the service rate in the

notation, and refer to the two processes as 푋푡(휆0, 휆) and 푋푡(휆, 휆), respectively. Let

휆푚 = max{휆0, 휆}. From Theorem 9, we have

−훼푡 ‖ℒ (푋푡(휆0, 휆)) − 휋(휆)‖TV ≤ 퐺(휆푚, 휆, 훼)푒 . (3.4)

Let us analytically compute 퐺(휆푚, 휆, 훼). Let 휏푦(푥, 휆) be the hitting time to 푦 of the M/M/1 queue with parameters set to (휆, 휇), started from a queue length of 푥, and write

[︀ 훼휏0(푥)]︀ 퐺(푥, 휆, 훼) = E 푒 .

Then by conditioning on the initial state, we obtain

∞ (︂ )︂ (︂ )︂푥 ∑︁ 휆푚 휆푚 퐺(휆 , 휆, 훼) = [︀푒훼휏0(휆푚,휆)]︀ = 1 − 퐺(푥, 휆, 훼). 푚 E 휇 휇 푥=0

74 Now by decomposing the hitting time and noting the independence and stationarity of the incremental hitting times,

[︀ 훼휏0(푥,휆)]︀ 퐺(푥, 휆, 훼) = E 푒 [︃ 푥 ]︃ ∏︁ 훼휏 (푥−푖+1,휆) = E 푒 푥−푖 푖=1 푥 ∏︁ [︀ 훼휏푥−푖(푥−푖+1,휆)]︀ = E 푒 푖=1 푥 ∏︁ [︀ 훼휏0(1,휆)]︀ = E 푒 푖=1 = (퐺(1, 휆, 훼))푥

Therefore

∞ (︂ )︂ (︂ )︂푥 ∑︁ 휆푚 휆푚 푥 퐺(휆 , 휆, 훼) = 1 − (퐺(1, 휆, 훼)) 푚 휇 휇 푥=0 1 − 휆푚 = 휇 (3.5) 휆푚 1 − 휇 퐺(1, 휆, 훼)

휆푚 as long as 휇 퐺(1, 휆, 훼) < 1. Next we compute 퐺(1, 휆, 훼). (︁√ √ )︁2 Theorem 10. Assume 휆 < 휇. For 훼 ≤ 휇 − 휆 ,

1 (︂ √︁ )︂ 퐺(1, 휆, 훼) = [︀푒훼휏0(1,휆)]︀ = 휆 + 휇 − 훼 − (휆 + 휇 − 훼)2 − 4휆휇 . (3.6) E 2휆

Proof. In order to calculate the MGF, we condition on whether a departure or an

arrival happens first. Let 퐸퐴 be the event that an arrival happens first and let 퐸퐷 be the event that a departure happens first. Let 휏(퐴, 휆) be the time required for the arrival, conditioned on the an arrival happening first; we define 휏(퐷, 휆) similarly. Using properties of exponential random variables, we have

[︀ 훼휏0(1,휆)]︀ [︀ 훼휏0(1,휆) ]︀ [︀ 훼휏0(1,휆) ]︀ E 푒 = E 푒 |퐸퐴 P(퐸퐴) + E 푒 |퐸퐷 P(퐸퐷) 휆 휇 = [︀푒훼(휏0(2,휆)+휏(퐴,휆))]︀ + [︀푒훼휏(퐷,휆)]︀ E 휆 + 휇 E 휆 + 휇

75 2 휆 휇 = [︀푒훼휏0(1,휆)]︀ [︀푒훼휏(퐴,휆)]︀ + [︀푒훼휏(퐷,휆)]︀ . E E 휆 + 휇 E 휆 + 휇

Now since 휏(퐴, 휆) =d 휏(퐷, 휆) ∼ 푒푥푝(휆 + 휇),

휆 + 휇 [︀푒훼휏(퐴,휆)]︀ = [︀푒훼휏(퐷,휆)]︀ = E E 휆 + 휇 − 훼

(︁√ √ )︁2 √ so long as 훼 < 휆+휇. In fact, this is the case: 훼 ≤ 휇 − 휆 = 휇+휆−2 휆휇 < 휆+휇.

[︀ 훼휏 (1,휆)]︀ Now in order to find E 푒 0 we solve the resulting quadratic to obtain

1 (︁ √︀ )︁ [︀푒훼휏0(1,휆)]︀ = 휆 + 휇 − 훼 ± (휆 + 휇 − 훼)2 − 4휆휇 . (3.7) E 2휆

(︁√ √ )︁2 In order for the discriminant to be nonnegative, we need 훼 ≤ 휇 − 휆 or (︁√ √ )︁2 훼 ≥ 휇 + 휆 . However, the second condition is overruled by the condition 훼 < 휆 + 휇. To identify the correct root, we use the differentiation property of moment generating functions:

푑 ⃒ [︀ 훼휏0(1,휆)]︀ ⃒ E [휏0(1, 휆)] = E 푒 ⃒ . 푑훼 ⃒훼=0

Again conditioning on whether an arrival or departure happens first, we have

E [휏0(1, 휆)] (︂ 1 )︂ 휆 (︂ 1 )︂ 휇 = [휏 (2, 휆)] + + E 0 휆 + 휇 휆 + 휇 휆 + 휇 휆 + 휇 (︂ 1 )︂ 휆 (︂ 1 )︂ 휇 = [휏 (1, 휆)] = 2 [휏 (1, 휆)] + + E 0 E 0 휆 + 휇 휆 + 휇 휆 + 휇 휆 + 휇 1 =⇒ [휏 (1, 휆)] = . E 0 휇 − 휆

⃒ 푑 [︀ 훼휏 (1,휆)]︀ ⃒ 휇 The + root of Equation (3.7) gives E 푒 0 ⃒ = < 0 and the − root 푑훼 ⃒ 휆(휆−휇) ⃒ 훼=0 푑 [︀ 훼휏0(1,휆)]︀ ⃒ 1 gives 푑훼 E 푒 ⃒ = 휇−휆 = E [휏0(1, 휆)] . This concludes the proof. ⃒훼=0 Remark 4. After proving Theorem 10, we came to know of an alternate proof in [43],

76 pp. 92-95.

Now we apply Theorem 9 to the convergence of M/M/1 queue with arrival rate

perturbed from 푟0 = 휆0 to 푟 = 휆, using Theorem 10. There are two cases:

Case 1: 휆푚 = 휆 ≥ 휆0 (︁√ √ )︁2 Set 훼 = 휇 − 휆 in Equation (3.6) to obtain

√︂휇 퐺(1, 휆, 훼) = . 휆

휆푚 To substitute into Equation (3.5), we need to verify that 휇 퐺(1, 휆, 훼) < 1.

√︃ 휆 휆√︂휇 휆 푚 퐺(1, 휆, 훼) = = < 1. 휇 휇 휆 휇

Thus, we obtain 휆 √︃ 1 − 휇 휆 퐺(휆푚, 휆, 훼) = = 1 + √︁ 휆 휇 1 − 휇 and

(︃ √︃ )︃ 휆 √ √ 2 ‖ℒ (푋 (휆 , 휆)) − 휋(휆)‖ ≤ 1 + 푒−( 휇− 휆) 푡. 푡 0 TV 휇

Case 2: 휆푚 = 휆0 > 휆 (︁ √ )︁2 휆0 √ We need to pick 훼 for which 1) 휇 퐺(1, 휆, 훼) < 1 and 2) 훼 ≤ 휇 − 휆 . Condition 1) is equivalent to

휆 (︂ 1 (︂ √︁ )︂)︂ 0 휆 + 휇 − 훼 − (휆 + 휇 − 훼)2 − 4휆휇 < 1 휇 2휆 √︁ 2휆휇 (휆 + 휇 − 훼)2 − 4휆휇 > − + 휆 + 휇 − 훼. 휆0

To determine when Condition 1) holds, we set these quantities equal to each other.

√︁ 2휆휇 (휆 + 휇 − 훼)2 − 4휆휇 = − + 휆 + 휇 − 훼 휆0

77 (︂ 2휆휇 )︂2 (휆 + 휇 − 훼)2 − 4휆휇 = − + 휆 + 휇 − 훼 휆0 휆휇 훼 = 휆 + 휇 − 휆0 − 휆0

Squaring may have introduced additional solutions. With this value of 훼, the left side is equal to

√︃ (︂ )︂2 √︁ 2 휆휇 (휆 + 휇 − 훼) − 4휆휇 = 휆0 + − 4휆휇 휆0 √︃ (︂ 휆휇)︂2 = 휆0 − 휆0 ⃒ ⃒ ⃒ 휆휇⃒ = ⃒휆0 − ⃒ . ⃒ 휆0 ⃒

휆휇 √ The right side is equal to 휆0 − . If 휆0 > 휆휇 there is a solution, otherwise there is no 휆0 solution. Setting 훼 = 0, the left side is equal to 휇 − 휆, while the right side is less than √ 휆휇 휇 − 휆 (setting 휆0 = 휇 − 휖). Therefore when 휆0 > 휆휇, we pick 훼 < 휆 + 휇 − 휆0 − . 휆0 √ 2 휆휇 (︁√ )︁ √ We verify that 휆 + 휇 − 휆0 − ≤ 휇 − 휆 . Otherwise, when 휆0 ≤ 휆휇, we are 휆0 (︁√ √ )︁2 free to pick 훼 = 휇 − 휆 . 휆휇 Therefore Theorem 9 is satisfied by substituting either 훼 = 휆 + 휇 − 휆0 − − 휖 휆0 (︁√ √ )︁2 or 훼 = 휇 − 휆 , depending on the value of 휆0. Intuitively, large values of 휆0 correspond to more “contraction” when the system goes to equilibrium, and therefore the convergence rate 훼 should be smaller.

Remark 5. The function

⎧(︁√ √ )︁2 √ ⎨⎪ 휇 − 휆 if 휆0 ≤ 휆휇 푓(휆0) = 휆휇 √ ⎩⎪휆 + 휇 − 휆0 − if 휆0 > 휆휇 휆0

is continuous in 휆0. In other words, the convergence rate changes continuously in 휆0. (︁√ √ )︁2 Remark 6. The rate 훼⋆ = 휇 − 휆 is well-known as the best convergence rate for the M/M/1 queue length process starting in a fixed initial condition (see e.g.12 [ ] in addition to [38]). However, it is not immediately clear that the same result would

78 hold in our setting where the initial state of the queue has a distribution:

‖ℒ (푋푡(휆0, 휆)) − 휋(휆)‖TV  E푋∼휋(휆0) [‖ℒ (푋푡(휆)|푋0 = 푋) − 휋(휆)‖TV] .

In other words, we cannot simply go from quenched to annealed convergence.

In the Appendix, we show another technique that gives a convergence rate of

휇 log √ 2 휆0 (︁√ )︁ 훼 = √︀ 휇 휇 − 휆 log 휆 √ √ when 휆0 > 휆휇. Therefore, the best known convergence rate in the 휆0 > 휆휇 case is

{︃ 휇 }︃ 휆휇 log (︁√ √ )︁2 max 휆 + 휇 − 휆 − , 휆0 휇 − 휆 . 0 휆 √︀ 휇 0 log 휆

We now consider a more general perturbation. Suppose the parameters of the

M/M/1 queue change from (휆0, 휇0) to (휆, 휇). We can relate these parameters by

푎푏휆0 = 휆 and 푏휇0 = 휇. First, observe that 휋 ((푏휇0, 푎푏휆0)) ≡ 휋 ((휇0, 푎휆0)). Second,

observe that the process 푋푡((휇0, 휆0), (푏휇0, 푎푏휆0)) is a sped-up version of the process

푋푡((휇0, 휆0), (휇0, 푎휆0)), by a factor of 푏. Therefore,

ℒ (푋푡((휇0, 휆0), (푏휇0, 푎푏휆0))) ≡ ℒ (푋푏푡((휇0, 휆0), (휇0, 푎휆0))) .

These two observations allow us to write

‖ℒ (푋푡((휇0, 휆0), (푏휇0, 푎푏휆0)) − 휋 ((푏휇0, 푎푏휆0))‖TV

= ‖ℒ (푋푏푡((휇0, 휆0), (휇0, 푎휆0)) − 휋 ((휇0, 푎휆0))‖TV .

We then conclude

−훼푏푡 ‖ℒ (푋푡((휇0, 휆0), (휇, 휆)) − 휋 ((휇, 휆))‖TV ≤ 퐺 ((휇0, 휆0), (휇0, 푎휆0), 훼) 푒 .

Thus, we are left with 퐺 ((휇0, 휆0), (휇0, 푎휆0), 훼) which is of the same form as Equation

79 (3.4), allowing us to apply Theorem 9 and Theorem 10 in order to calculate a bound.

Example 1. We now apply our work to a simple control system. Consider two parallel

M/M/1 queues with arrival rates 휆1 and 휆2, and service rates 휇1 and 휇2, respectively. This queueing system could be a model for two parallel road segments, for example.

Let 휆 = 휆1 + 휆2 be the total arrival rate. The controller chooses 휆1 and 휆2 so that the expected queue lengths are equal, by setting

휇1휆 휇2휆 휆1 = , 휆2 = . 휇1 + 휇2 휇1 + 휇2

We assume that 휆 < 휇1 + 휇2, so that the queueing system is stable. Suppose that at 푡 = 0, the queueing system is in equilibrium, meaning that each

′ ′ queue is in equilibrium. Suddenly, the service rates are perturbed to 휇1 and 휇2, which ′ ′ are assumed to satisfy 휆 < 휇1 + 휇2, and are known to the controller. The controller ′ ′ responds by setting the new arrival rates (휆1, 휆2) to be

′ ′ ′ 휇1휆 ′ 휇2휆 휆1 = ′ ′ , 휆2 = ′ ′ . 휇1 + 휇2 휇1 + 휇2

Our methods can be used to analyze the rate of convergence of the queue length of each 휇′ 휆′ queue to equilibrium. Consider the first queue. Let 푏 = 1 and 푎 = 1 . Then by the 휇1 푏휆1 above analysis,

′ ′ ′ ′ −훼푏푡 ‖ℒ (푋푡((휇1, 휆1), (휇1, 휆1)) − 휋 ((휇1, 휆1))‖TV ≤ 퐺 ((휇1, 휆1), (휇1, 푎휆1), 훼) 푒

Therefore the convergence rate is 훼⋆푏, where

⎧ (︀√ √ )︀2 √ ⎨⎪ 휇1 − 푎휆1 if 휆1 ≤ 푎휆1휇1 훼⋆ = , ′ √ ⎩⎪훼 if 휆1 > 푎휆1휇1 and ⎧ 휇 ⎫ ⎨ log 1 2⎬ ′ 푎휆1휇1 휆1 (︁√ √︀ )︁ 훼 = max 푎휆1 + 휇1 − 휆1 − , √︁ 휇1 − 푎휆1 . ⎩ 휆1 log 휇1 ⎭ 푎휆1

80 √︁ ′ 휆1 √︁ 1 The first condition can be rewritten as 휆1 ≤ 휇1 ′ = 휇1 ′ ′ . The analysis for the 휇1 휇1+휇2 second queue is analogous.

3.4.2 Workload process

Next we consider the workload process, {푊푡}, for an M/M/1 queue. The value

푊푡 ∈ R≥0 is the time remaining until the queue is empty, starting from time 푡. As for

the queue length process, we consider changing the arrival rate from 휆0 to 휆 while

keeping the service rate fixed at 휇0. The process {푊푡} is stochastically increasing in

휆. Applying Theorem 9, we need to calculate 퐺푊 (휆푚, 휆, 훼) for the process {푊푡}. But

{푊푡 = 0} = {푋푡 = 0} since the workload is zero if and only if the queue length is zero.

Therefore 퐺푊 (휆푚, 휆, 훼) = 퐺푋 (휆푚, 휆, 훼), and the same convergence results follow. (︁√ √ )︁2 In [38], it is shown that 훼⋆ = 휇 − 휆 is the best possible convergence rate

for the M/M/1 workload process beginning with initial condition 푊0 = 0. Precisely, ⋆ [38] show that if 훼 > 훼 and 푊0 = 0,

훼푡 lim sup 푒 ‖ℒ(푊푡) − 휋‖TV = ∞. 푡→∞

We investigate whether a similar property holds when 푊0 is distributed according √ 2 √ ⋆ (︁√ )︁ to the parameters (휇0, 휆0). When 휆0 ≤ 휆휇, it turns out that 훼 = 휇 − 휆

is in fact the best rate. We use the bounding process idea again with 푊푡(휆0, 휆) and

푊푡(휆, 휆), which is analogous to the proof of Theorem 2.3 in [38]. Let 푇 = inf푡{푡 :

푊푡(휆0, 휆) = 푊푡(휆, 휆)}.

‖ℒ(푊푡(휆0, 휆)) − 휋(휆)‖TV

= sup |P (푊푡(휆0, 휆) ∈ 퐴) − 휋(퐴; 휆)| 퐴

≥ |P (푊푡(휆0, 휆) = 0) − 휋(0; 휆)|

= P (푊푡(min{휆0, 휆}, 휆) = 0, 푇 > 푡)

≥ P (푊푡(min{휆0, 휆}, 휆) = 0, 푇 > 푡|푊0(min{휆0, 휆}, 휆) = 0)

× P (푊 (min{휆0, 휆}, 휆) = 0)

81 (︂ 휆 )︂ = (푊 (휆) = 0, 푇 > 푡|푊 (휆) = 0) 1 − 0 P 푡 0 휇

(︁√ √ )︁2 It is shown in the proof of Theorem 2.3 in [38] that for 훼 > 휇 − 휆 ,

훼푡 lim sup 푒 P (푊푡(휆) = 0, 푇 > 푡|푊0(휆) = 0) = ∞. 푡→∞

(︁ )︁ 휆0 Multiplying the left side by the constant 1 − 휇 ,

(︂ )︂ 훼푡 휆0 lim sup 푒 P (푊푡(휆) = 0, 푇 > 푡|푊0(휆) = 0) 1 − = ∞, 푡→∞ 휇

and we conclude that

훼푡 lim sup 푒 ‖ℒ(푊푡(휆0, 휆)) − 휋(휆)‖TV = ∞ 푡→∞

(︁√ √ )︁2 when 훼 > 휇 − 휆 . √ When 휆0 ≥ 휆휇, we have a gap between the best known rate

{︃ 휇 }︃ log (︁√ √ )︁2 휆휇 훼 = max 휆0 휇 − 휆 , 휆 + 휇 − 휆 − √︀ 휇 0 휆 log 휆 0

(︁√ √ )︁2 and the upper bound on the rate, 훼⋆ = 휇 − 휆 .

3.5 Conclusion

In this chapter we presented a method for finding exponential convergence rates for stochastically ordered Markov processes with a random initial condition. This method of analysis is useful for perturbation analysis of Markov processes, such as various queueing systems. Furthermore, we provided an explicit exponential bound for convergence in total variation distance of an M/M/1 queue that begins in an equilibrium distribution, and applied it in the analysis of a control system. The method developed in this chapter can certainly be applied to other systems, such as

82 M/G/1 queues, as long as one can identify the domain of the moment generating function of the hitting time to the zero state.

3.6 Appendix

Using a truncation technique, we can improve the convergence of the M/M/1 queue- √ length process (and therefore the workload process as well) in the case 휆0 > 휆휇.

Theorem 11. There exists a computable 퐶 such that

−훼푡 ‖ℒ(푋푡(휆0, 휆)) − 휋(휆)‖푇 푉 ≤ 퐶푒 where 휇 log √ 2 휆0 (︁√ )︁ 훼 = √︀ 휇 휇 − 휆 . log 휆 Proof.

‖ℒ(푋푡(휆0, 휆)) − 휋(휆)‖푇 푉 = sup |P (푋푡(휆0, 휆) ∈ 퐴) − 휋(퐴; 휆)| 퐴

= sup (P (푋푡(휆0, 휆) ∈ 퐴) − 휋(퐴; 휆)) 퐴 (︃ ∞ )︃ ∑︁ = sup P (푋푡(휆) ∈ 퐴|푋0 = 푥) 휋(푥; 휆0) − 휋(퐴; 휆) 퐴 푥=0 ∞ ∑︁ = sup 휋(푥; 휆0)[P (푋푡(휆) ∈ 퐴|푋0 = 푥) − 휋(퐴; 휆)] 퐴 푥=0

The first equality is due to the fact that for any 퐴,

푐 푐 P (푋푡(휆0, 휆) ∈ 퐴) − 휋(퐴; 휆) = − (P (푋푡(휆0, 휆) ∈ 퐴 ) − 휋(퐴 ; 휆)) .

Since one of these differences of probabilities must be nonnegative, the absolute value can be dropped.

83 {︀ ∑︀∞ }︀ We now truncate 휋(휆0). Let 푁(휖) = min 푁 : 푥=푁+1 휋(푥; 휆0) ≤ 휖 . Continuing,

푁(휖) ∑︁ ≤ sup [휋(푥; 휆0)(P (푋푡(휆) ∈ 퐴|푋0 = 푥) − 휋(퐴; 휆))] + 휖 (3.8) 퐴 푥=0 푁(휖) ∑︁ [︂ ]︂ ≤ 휋(푥; 휆0) sup (P (푋푡(휆) ∈ 퐴|푋0 = 푥) − 휋(퐴; 휆)) + 휖 (3.9) 푥=0 퐴

(︁√ √ )︁2 Let 훼⋆ = 휇 − 휆 . Applying Theorem 2.1 from [38], we can write

푁(휖) ∑︁ [︁ ⋆ −훼⋆푡]︁ ≤ 휋(푥; 휆0)(퐺(푥, 휆, 훼) + 퐺(휆, 휆, 훼 )) 푒 + 휖 푥=0 (︃ √︃ )︃ 푁(휖) 휆 ⋆ ∑︁ [︁ ⋆ ]︁ ≤ (1 − 휖) 1 + 푒−훼 푡 + 휖 + 휋(푥; 휆 )퐺(푥, 휆, 훼)푒−훼 푡 휇 0 푥=0 (︃ √︃ )︃ 푁(휖) 휆 ⋆ ∑︁ [︁ ⋆ ]︁ = (1 − 휖) 1 + 푒−훼 푡 + 휖 + 휋(푥; 휆 )(퐺(1, 휆, 훼))푥 푒−훼 푡 휇 0 푥=0 (︃ √︃ )︃ 푁(휖) [︂(︂ )︂ (︂ )︂푥 (︂√︂ )︂푥 ]︂ 휆 ⋆ ∑︁ 휆0 휆0 휇 ⋆ = (1 − 휖) 1 + 푒−훼 푡 + 휖 + 1 − 푒−훼 푡 휇 휇 휇 휆 푥=0 ⎛(︁ )︁푁(휖)+1 ⎞ (︃ √︃ )︃ (︂ )︂ √휆0 − 1 휆 ⋆ 휆 휆휇 ⋆ = (1 − 휖) 1 + 푒−훼 푡 + 휖 + 1 − 0 ⎜ ⎟ 푒−훼 푡. (3.10) 휇 휇 ⎝ √휆0 − 1 ⎠ 휆휇

Set 휖 = 푒−훼푡 in order to fold in the 휖 term into a convergence bound. Then 푁(휖) must satisfy

(︂ )︂ 푁(휖) (︂ )︂푥 휆0 ∑︁ 휆0 1 − ≥ 1 − 푒−훼푡 휇 휇 푥=0 1 푁(휖) ≥ 훼푡 − 1. log 휇 휆0

⌈︂ ⌉︂ 1 1 Substituting the value 푁(휖) = log 휇 훼푡 ≥ log 휇 훼푡 − 1 back into the bound (3.10), 휆0 휆0 the last term in the bound becomes

1 (︁ 휆 )︁ 휇 훼푡+1 √ 0 1 ⎛(︁ )︁ log ⎞ ⎛ log 휇 훼푡 ⎞ 휆0 휆 휆 휆휇 log (︂ )︂ √ 0 − 1 (︂ )︂ √ 0 휆0 휆 휆휇 ⋆ 휆 푒 − 1 ⋆ 1 − 0 ⎜ ⎟ 푒−훼 푡 = 1 − 0 ⎜ 휆휇 ⎟ 푒−훼 푡. 휇 ⎝ √휆0 ⎠ 휇 ⎝ √휆0 ⎠ 휆휇 − 1 휆휇 − 1

84 (︁ )︁ √휆0 1 ⋆ If log 휆휇 log 휇 훼 < 훼 , then we get convergence at rate 휆0

⎧ (︁ )︁ ⎫ √휆0 ⎨ log 휆휇 ⎬ min 훼, 훼⋆ − 훼 . log 휇 ⎩ 휆0 ⎭

log 휇 ⋆ 휆0 Let 훼 = 푐훼 with 푐 < (︁ 휆 )︁ . Then we seek to maximize log √ 0 휆휇

⎧ (︁ )︁ ⎫ √휆0 ⎨ log 휆휇 ⎬ min 푐훼⋆, 훼⋆ − 푐훼⋆ log 휇 ⎩ 휆0 ⎭

(︁ 휆 )︁ log √ 0 √ 휆휇 over 푐. When 휆0 > 휆휇 the factor log 휇 is positive, and the optimal 푐 is found by 휆0 log 휇 휆0 setting the two quantities equal to each other, leading to 푐 = √ 휇 . We verify that log 휆 log 휇 휆0 this value is less than (︁ 휆 )︁ . Therefore the best rate obtained by this method is log √ 0 휆휇

휇 log √ 2 휆0 (︁√ )︁ 훼 = √︀ 휇 휇 − 휆 . log 휆

Remark 7. The function

⎧(︁√ √ )︁2 √ ⎨⎪ 휇 − 휆 if 휆0 ≤ 휆휇 푔(휆 ) = 0 log 휇 (︁√ √ )︁2 √ ⎪ √휆0 ⎩ 휇 휇 − 휆 if 휆0 > 휆휇 log 휆 is continuous. In other words, the convergence rate changes continuously in 휆0.

For certain values of (휆0, 휆, 휇) this rate is better than the rate previously computed, 휆휇 ⋆ √ 훼 = 휆 + 휇 − 휆0 − . However, 훼 < 훼 when 휆0 > 휆휇, so there is still a gap, and 휆0 we do not know the best convergence rate in this case. We suspect that the rate 훼 is not the best possible, since the step from expression (3.8) to expression (3.9), which exchanges the order of a supremum with a sum, can be quite loose.

85 86 Chapter 4

An Improved Lower Bound on the Traveling Salesman Constant

2 Let 푋1, 푋2, . . . , 푋푛 be independent uniform random variables on [0, 1] . Let 퐿(푋1, . . . , 푋푛) be the length of the shortest Traveling Salesman tour through these points. Beardwood et al (1959) showed that there exists a constant 훽 such that

퐿(푋 , . . . , 푋 ) lim 1 √ 푛 = 훽 푛→∞ 푛

almost surely. It was shown that 훽 ≥ 0.625. Building upon an approach proposed by Steinerberger (2015), we improve the lower bound to 훽 ≥ 0.6277.

4.1 Introduction

2 Let 푋1, . . . , 푋푛 be independent uniform random variables on [0, 1] . Let 푑(푥, 푦) =

‖푥 − 푦‖2 be the Euclidean distance. Let 퐿(푋1, . . . , 푋푛) be the distance of the optimal Traveling Salesman tour through these points, under distance 푑(·, ·). In seminal work, Beardwood et al (1959) analyzed the limiting behavior of the value of the optimal Traveling Salesman tour length, under the random Euclidean model.

87 Theorem 12 ([2]). There exists a constant 훽 such that

퐿(푋 , . . . , 푋 ) lim 1 √ 푛 = 훽 푛→∞ 푛 almost surely.

This limiting behavior is true of other problems in Euclidean combinatorial opti- mization; please see [55]. The value of 훽 is presently unknown. Empirical analysis has shown that 훽 ≈ 0.71 [31]. The optimal tour length for large values of 푛 can be approximated using the relaxation technique proposed by Held and Karp [29]; see [24] for a probabilistic analysis of the Held-Karp lower bound.

The authors additionally showed in [2] that 0.625 ≤ 훽 ≤ 훽+, where

√ ∫︁ ∞ ∫︁ 3 √︁ √ (︂ 푧 )︂ 2 2 − 3푧1 2 훽+ = 2 푧1 + 푧2푒 1 − √ 푑푧2푑푧1. 0 0 3

This integral is equal to approximately 0.92116 ([56]). To date, the only improvement

to the upper bound was given in [56], showing that 훽 ≤ 훽+ − 휖0, for an explicit 9 −6 휖0 > 16 10 . In [56], the author also claimed to improve the lower bound; however, we have found a fault in the argument. The rest of this note is structured as follows. In Section 4.2, we present the proof of 훽 ≥ 0.625 by [2]. We then outline the approach of [56] to improve the bound. Section

19 4.3 corrects the result in [56], giving the lower bound 훽 ≥ 0.625 + 10368 ≈ 0.6268. Finally, Section 4.4 tightens the argument of [56] to derive the improved bound, 훽 ≥ 0.6277.

4.2 Approaches for the Lower Bound

By the following lemma, we can equivalently study the limiting behavior of

[퐿(푋 , . . . , 푋 )] E 1√ 푛 . 푛

88 Lemma 11 ([2]). It holds that

[퐿(푋 , . . . , 푋 )] E 1√ 푛 → 훽. 푛

Further, we can switch to a Poisson process with intensity 푛. Let 풫푛 denote a Poisson process with intensity 푛 on [0, 1]2.

Lemma 12 ([2]). It holds that

[퐿(풫 )] E √ 푛 → 훽. 푛

[2] gave the following lower bound on 훽.

5 Theorem 13 ([2]). The value 훽 is lower bounded by 8 .

Proof. (Sketch) We outline the proof given by [2], giving a lower bound on E [퐿(풫푛)]. Observe that in a valid Traveling Salesman tour, every point is connected to exactly two other points. To lower bound, we can connect each point to its two closest points.

We can further assume that the Poisson process is over all of R2, rather than just [0, 1]2, in order to remove the boundary effect. The expected distance of a point to

√1 its closest neighbor is shown to be 2 푛 , and the expected distance to the next closes √3 neighbor is shown to be 4 푛 . Each point contributes half the expected lengths to the closest two other points. Since the number of points is concentrated around 푛, it holds 1 (︀ 1 3 )︀ that 훽 ≥ 2 2 + 4 .

Certainly there is room to improve the lower bound. Observe that short cycles are likely to appear when we connect each point to the two closest other points. In [56], the author gave an approach to identify situations in which 3-cycles appear, and then lower-bounded the contribution of correcting these 3-cycles. We outline the approach below.

1. For point 푎, let 푟1 be the distance of 푎 to the closest point, and let 푟2 be the

distance to the next closest point. Let 퐸푎 be the event that the third closest

point is at a distance of 푟3 ≥ 푟1 + 2푟2.

89 7 2. The probability that 퐸푎 occurs is calculated to be 324 for a given point 푎. Therefore, the expected number of points satisfying this geometric property is

7 1 7 324 푛, and the number of triples involved is at least 3 324 푛 in expectation.

3. Using the relationship 푟3 ≥ 푟1 + 2푟2, we can show that if {푎, 푏, 푐, 푑} satisfy the

geometric property with ‖푎 − 푏‖ = 푟1, ‖푎 − 푐‖ = 푟2, and ‖푎 − 푑‖ = 푟3 ≥ 푟1 + 2푟2, then the closest two points to 푏 are 푎 and 푐, and the closest two points to 푐 are 푎 and 푏. Therefore, the “count the closest two distances” method would create a triangle in this situation.

4. To correct for the triangle, subtract the lengths coming from the triangle and add a lower bound on the new lengths. The final adjustment is the sum of contributions for each triple that satisfies the geometric property.

The analysis requires careful bookkeeping of edge lengths. We may count length contributions from the perspective of vertices, giving each vertex two “stubs.” These stubs are connected to other vertices, and may form edges if there are agreements. A

1 stub from vertex 푎 to vertex 푏 contributes 2 ‖푎 − 푏‖ to the path length. In this way, a triangle comprises six stubs, and the contribution to the path length is the sum of the edge lengths. The analysis in [56] contains two errors in Step (4), both due to inconsistency

in counting edge lengths. On page 35, the author writes 푟1 + 푟2 + 2‖푎 − 푐‖ as the

contribution of the triangle. This is probably a typo and likely 푟1 + 푟2 + 2‖푏 − 푐‖ was

meant instead. However, it should be 푟1 + 푟2 + ‖푏 − 푐‖ ≤ 2(푟1 + 푟2). Next, six stubs must be redirected, and their length contributions determined. We break edge (푏, 푐), which means we need to redirect two stubs, while the four stubs that comprise the edges (푎, 푏) and (푎, 푐) remain. This is illustrated in Figure 4-1. The

1 1 redirected stubs contribute 2 ‖푏−푑‖+ 2 ‖푐−푒‖. The six stubs therefore yield an overall 1 1 1 1 contribution of ‖푎−푏‖+‖푎−푐‖+ 2 ‖푏−푑‖+ 2 ‖푐−푒‖ ≥ 푟1+푟2+ 2 (푟3 − 푟1)+ 2 (푟3 − 푟2) = 1 푟3 + 2 (푟1 + 푟2). In the analysis above Figure 5 in [56], the author includes the full lengths ‖푏 − 푑‖ and ‖푐 − 푒‖. The effect of this is to give points 푑 and 푒 a third stub each.

90 To summarize, the overall contribution for the triangle scenario, after breaking

1 3 3 edge (푏, 푐), is 푟3 + 2 (푟1 + 푟2) − 2(푟1 + 푟2) = 푟3 − 2 푟1 − 2 푟2.

d c e

a b

Figure 4-1: The six stubs associated with vertices 푎, 푏, and 푐.

4.3 Derivation of the Lower Bound

In this section we use the approach of [56] to derive a lower bound on 훽.

5 19 Theorem 14. It holds that 훽 ≥ 8 + 10368 .

The proof of Theorem 14 requires Lemmas 13 and 14.

2 Lemma 13 (Lemma 4 in [56]). Let 풫푛 be a on R with intensity 푛. Then for any fixed point 푝 ∈ R2, the probability distribution of the distance between 푝 and the the three closest points to 푝 is given by

⎧ −푛휋푟3 3 ⎨⎪푒 3 (2푛휋) 푟1푟2푟3 if 푟1 < 푟2 < 푟3 ℎ(푟1, 푟2, 푟3) = ⎩⎪0 otherwise.

Lemma 14.

∫︁ ∞ ∫︁ ∞ ∫︁ ∞ (︂ )︂ 3 3 −푛휋푟2 19 푟3 − 푟1 − 푟2 푒 3 푟1푟2푟3푑푟3푑푟2푑푟1 = 3 7 푟1=0 푟2=푟1 푟3=푟1+2푟2 2 2 27648휋 푛 2

Proof. We can change the order of integration to compute the integral more easily.

∫︁ ∞ ∫︁ ∞ ∫︁ ∞ (︂ )︂ 3 3 −푛휋푟2 푟3 − 푟1 − 푟2 푒 3 푟1푟2푟3푑푟3푑푟2푑푟1 푟1=0 푟2=푟1 푟3=푟1+2푟2 2 2 ∫︁ ∞ ∫︁ 푟3 ∫︁ 푟3−푟1 (︂ )︂ 3 2 3 3 −푛휋푟2 = 푟3 − 푟1 − 푟2 푒 3 푟1푟2푟3푑푟2푑푟1푑푟3 푟3=0 푟1=0 푟2=푟1 2 2

91 ∫︁ ∞ ∫︁ 푟3 ∫︁ 푟3−푟1 (︂ )︂ −푛휋푟2 3 2 3 3 = 푟3푒 3 푟1 푟2 푟3 − 푟1 − 푟2 푑푟2푑푟1푑푟3 푟3=0 푟1=0 푟2=푟1 2 2 푟 ∫︁ ∞ ∫︁ 3 (︂ 2 (︂ )︂ )︂ 푟3−푟1 2 3 푟 3 1 ⃒ 2 −푛휋푟3 2 3 = 푟3푒 푟1 푟3 − 푟1 − 푟2 ⃒ 푑푟1푑푟3 ⃒푟 =푟 푟3=0 푟1=0 2 2 2 2 1 푟3 2 ∫︁ ∞ ∫︁ (︃(︀ 푟3−푟1 )︀ 2 (︂ )︂ (︃(︂ )︂3 )︃)︃ 2 3 − 푟 3 1 푟3 − 푟1 −푛휋푟3 2 1 3 = 푟3푒 푟1 푟3 − 푟1 − − 푟1 푑푟1푑푟3 푟3=0 푟1=0 2 2 2 2 푟 ∫︁ ∞ ∫︁ 3 (︂ 4 3 2 2 3 )︂ −푛휋푟2 3 9푟1 3푟1푟3 푟1푟3 푟1푟3 = 푟3푒 3 − − + 푑푟1푑푟3 푟3=0 푟1=0 8 16 4 16 ∫︁ ∞ (︂ 5 4 3 2 2 3 )︂ 푟3 −푛휋푟2 9푟1 3푟1푟3 푟1푟3 푟1푟3 ⃒ 3 = 푟3푒 3 − − + ⃒ 푑푟3 ⃒푟 =0 푟3=0 40 64 12 32 1 5 4 3 2 ∫︁ ∞ (︃ (︀ 푟3 )︀ (︀ 푟3 )︀ (︀ 푟3 )︀ 2 (︀ 푟3 )︀ 3 )︃ −푛휋푟2 9 3 3 3 푟3 3 푟3 3 푟3 = 푟3푒 3 − − + 푑푟3 푟3=0 40 64 12 32 (︃ (︀ 1 )︀5 (︀ 1 )︀4 (︀ 1 )︀3 (︀ 1 )︀2 )︃ ∞ 9 3 ∫︁ 2 3 3 3 3 6 −푛휋푟3 = − − + 푟3푒 푑푟3 40 64 12 32 푟3=0 ∞ 19 ∫︁ 2 19 15 19 6 −푛휋푟3 = 푟3푒 푑푟3 = 7 = 7 3 3 25920 푟3=0 25920 16휋 푛 2 27648휋 푛 2

Proof of Theorem 14. First we verify that the lower bound from breaking edge (푏, 푐) is valid. If edge (푎, 푏) is broken instead, the new stub lengths are ‖푎 − 푐‖ + ‖푏 − 푐‖ +

1 1 2 ‖푎 − 푑‖ + 2 ‖푏 − 푒‖. The difference after subtracting the original stub lengths is then equal to

1 1 ‖푎 − 푐‖ + ‖푏 − 푐‖ + ‖푎 − 푑‖ + ‖푏 − 푒‖ − (‖푎 − 푐‖ + ‖푏 − 푐‖ + ‖푎 − 푏‖) 2 2 1 1 = ‖푎 − 푑‖ + ‖푏 − 푒‖ − ‖푎 − 푏‖ 2 2 1 1 ≥ 푟 + (‖푎 − 푒‖ − ‖푎 − 푏‖) − 푟 2 3 2 1 1 1 3 ≥ 푟 + (푟 − 푟 ) − 푟 = 푟 − 푟 . 2 3 2 3 1 1 3 2 1

3 Similarly, if edge (푎, 푐) is broken, the contribution is lower bounded by 푟3 − 2 푟2. Since 3 3 3 3 3 3 푟3 − 2 푟1 − 2 푟2 ≤ 푟3 − 2 푟2 ≤ 푟3 − 2 푟1, we conclude that 푟3 − 2 푟1 − 2 푟1 from breaking edge (푏, 푐) is a valid lower bound. Therefore, from the discussion in Section 4.2 and Lemma 13 we adjust the integral in [56] to give

√ ∫︁ ∞ ∫︁ ∞ ∫︁ ∞ (︂ )︂ 5 푛 3 3 −푛휋푟2 3 훽 ≥ + 푟3 − 푟1 − 푟2 푒 3 (2푛휋) 푟1푟2푟3푑푟3푑푟2푑푟1. 8 3 푟1=0 푟2=푟1 푟3=푟1+2푟2 2 2

92 From Lemma 14, √ 5 푛 3 19 5 19 훽 ≥ + (2푛휋) 7 = + ≈ 0.626833. 8 3 27648휋3푛 2 8 10368

4.4 An Improvement

In this section, we improve upon the bound in Section 4.3 by tightening the triangle inequality.

Theorem 15. It holds that

(︃ √ )︃ 5 1 (︂ 19 )︂ 1 3072 2 − 4325 훽 ≥ + + ≥ 0.6277. 8 2 10368 2 5376

Proof. Place a Cartesian grid so that point 푎 is at the origin and point 푏 is at (푟1, 0). 1 Then with probability 2 , point 푐 falls into the first or fourth quadrant, and with 1 probability 2 , point 푐 falls into the second or third quadrant. Conditioned on point 푐 √︀ 2 2 falling into the first or fourth quadrant, the maximum length of ‖푏 − 푐‖ is 푟1 + 푟2. Conditioned on point 푐 falling into the second or third quadrant, the maximum length

of ‖푏 − 푐‖ is 푟1 + 푟2, which corresponds to the computation in Section 4.3. See Figure 4-2 for an illustration of this conditioning.

Q2 Q1 Q2 Q1

푎 푏 푎 푏 Q3 Q4 Q3 Q4

Figure 4-2: Conditioning on the location of point 푐. The gray regions indicate where point 푐 may lie.

Conditioned on point 푐 falling into the first or fourth coordinate, the length contribution from breaking edge (푏, 푐) is at least

1 (︂ √︁ )︂ 1 1 √︁ 푟 + (푟 + 푟 ) − 푟 + 푟 + 푟2 + 푟2 = 푟 − 푟 − 푟 − 푟2 + 푟2. 3 2 1 2 1 2 1 2 3 2 1 2 2 1 2

93 1 If edge (푎, 푏) is broken instead, the new stub lengths are ‖푎 − 푐‖ + ‖푏 − 푐‖ + 2 ‖푎 − 푑‖ + 1 2 ‖푏 − 푒‖. The difference after subtracting the original stub lengths is then equal to 1 1 ‖푎 − 푐‖ + ‖푏 − 푐‖ + ‖푎 − 푑‖ + ‖푏 − 푒‖ − (‖푎 − 푐‖ + ‖푏 − 푐‖ + ‖푎 − 푏‖) 2 2 1 1 = ‖푎 − 푑‖ + ‖푏 − 푒‖ − ‖푎 − 푏‖ 2 2 1 1 ≥ 푟 + (‖푎 − 푒‖ − ‖푎 − 푏‖) − 푟 2 3 2 1 1 1 3 ≥ 푟 + (푟 − 푟 ) − 푟 = 푟 − 푟 . 2 3 2 3 1 1 3 2 1

3 Similarly, if edge (푎, 푐) is broken, the contribution is lower bounded by 푟3− 2 푟2. Since 1 1 √︀ 2 2 3 3 1 1 √︀ 2 2 푟3− 2 푟1− 2 푟2− 푟1 + 푟2 ≤ 푟3− 2 푟2 ≤ 푟3− 2 푟1, we conclude that 푟3− 2 푟1− 2 푟2− 푟1 + 푟2 from breaking edge (푏, 푐) is a valid lower bound. We therefore break edge (푏, 푐).

√︀ 2 2 Proposition 5. If 푟3 ≥ 푟2 + 푟1 + 푟2, then the closest points to each of 푎, 푏, 푐 are the other two points in the set {푎, 푏, 푐}, whenever point 푏 is in the first or fourth quadrant.

Proof. Point 푎: 푑(푎, 푏) = 푟1, 푑(푎, 푐) = 푟2, and for any 푑∈ / {푎, 푏, 푐}, it holds that √︀ 2 2 푑(푎, 푑) ≥ 푟3 ≥ 푟2 + 푟1 + 푟2. Therefore 푑(푎, 푑) ≥ 푑(푎, 푏) and 푑(푎, 푑) ≥ 푑(푎, 푐). √︀ 2 2 Point 푏: 푑(푎, 푏) = 푟1, 푑(푏, 푐) ≤ 푟1 + 푟2, and for any 푑∈ / {푎, 푏, 푐}, it holds that √︀ 2 2 푑(푏, 푑) ≥ 푑(푎, 푑) − 푑(푎, 푏) ≥ 푟2 + 푟1 + 푟2 − 푟1. Therefore 푑(푏, 푑) ≥ 푑(푎, 푏) and 푑(푏, 푑) ≥ 푑(푏, 푐).

√︀ 2 2 Point 푐: 푑(푎, 푐) = 푟2, 푑(푏, 푐) ≤ 푟1 + 푟2, and for any 푑∈ / {푎, 푏, 푐}, it holds that √︀ 2 2 √︀ 2 2 푑(푐, 푑) ≥ 푑(푎, 푑) − 푑(푎, 푐) ≥ 푟2 + 푟1 + 푟2 − 푟2 = 푟1 + 푟2. Therefore 푑(푐, 푑) ≥ 푑(푎, 푐) and 푑(푐, 푑) ≥ 푑(푏, 푐).

The lower bound on 훽 is therefore √ 5 푛 ∫︁ ∞ ∫︁ ∞ ∫︁ ∞ + √ 푓푛(푟1, 푟2, 푟3)푑푟3푑푟2푑푟1, 8 3 2 2 푟1=0 푟2=푟1 푟3=푟2+ 푟1+푟2 where

(︂ √︁ )︂ 1 1 −푛휋푟2 3 푓 (푟 , 푟 , 푟 ) = 푟 − 푟 − 푟 − 푟2 + 푟2 푒 3 (2푛휋) 푟 푟 푟 . 푛 1 2 3 3 2 1 2 2 1 2 1 2 3

94 Lemma 15. Let 훼 = 1√ . It holds that 1+ 2

∞ ∞ ∞ (︂ )︂ ∫︁ ∫︁ ∫︁ 1 1 √︁ 2 2 2 −푛휋푟3 √ 푟3 − 푟1 − 푟2 − 푟1 + 푟2 푒 푟1푟2푟3푑푟3푑푟2푑푟1 2 2 2 2 푟1=0 푟2=푟1 푟3=푟2+ 푟1+푟2 [︂ 8 7 6 √ 4 3 2 ]︂ 훼 훼 훼 1 (︁ )︁ 5 13훼 훼 훼 15 = − − − + 13 + 16 2 훼 − − + 7 . 8 · 48 7 · 16 6 · 16 120 64 48 32 16휋3푛 2

Proof. Again we change the order of integration to compute the integral more easily. √︀ 푟 Given 푟 , the upper bound on 푟 is derived by setting 푟 = 푟 + 2푟2 ⇐⇒ 푟 = √3 . 3 1 3 1 1 1 1+ 2 2 2 √︀ 2 2 푟3−푟1 Given 푟3 and 푟1, set 푟3 = 푟2 + 푟 + 푟 . We rearrange to obtain 푟2 = . Therefore, 1 2 2푟3

∞ ∞ ∞ (︂ )︂ ∫︁ ∫︁ ∫︁ 1 1 √︁ 2 2 2 −푛휋푟3 √ 푟3 − 푟1 − 푟2 − 푟1 + 푟2 푒 푟1푟2푟3푑푟3푑푟2푑푟1 2 2 2 2 푟1=0 푟2=푟1 푟3=푟2+ 푟1+푟2 2 2 푟 푟 −푟 ∞ 3√ 3 1 (︂ )︂ ∫︁ 2 ∫︁ 1+ 2 ∫︁ 2푟3 1 1 √︁ −푛휋푟3 2 2 = 푟3푒 푟1 푟2 푟3 − 푟1 − 푟2 − 푟1 + 푟2 푑푟2푑푟1푑푟3 푟3=0 푟1=0 푟2=푟1 2 2 푟3 푟2−푟2 ∫︁ ∞ ∫︁ √ [︂ 2 (︂ )︂ 3 ]︂ 3 1 2 1+ 2 푟 1 1 1 ⃒ 2푟 −푛휋푟3 2 3 (︀ 2 2)︀ 2 3 = 푟3푒 푟1 푟3 − 푟1 − 푟2 − 푟1 + 푟2 ⃒ 푑푟1푑푟3 ⃒푟 =푟 푟3=0 푟1=0 2 2 6 3 2 1 ⎡ 2 2 2 3 푟 (︁ 푟 −푟 )︁ ∫︁ ∞ ∫︁ 3√ 3 1 (︂ )︂ (︂ 2 2 )︂3 (︃ (︂ 2 2 )︂2)︃ 2 2 1+ 2 2푟3 1 1 푟 − 푟 1 푟 − 푟 −푛휋푟3 ⎢ 3 1 2 3 1 = 푟3푒 푟1 ⎣ 푟3 − 푟1 − − 푟1 + 푟3=0 푟1=0 2 2 6 2푟3 3 2푟3

푟2 (︂ 1 )︂ 1 1 3 ]︂ − 1 푟 − 푟 + 푟3 + (︀푟2 + 푟2)︀ 2 푑푟 푑푟 2 3 2 1 6 1 3 1 1 1 3

⎡ 2 2 2 3 푟 (︁ 푟 −푟 )︁ ∫︁ ∞ ∫︁ 3√ 3 1 (︂ )︂ (︂ 2 2 )︂3 (︃(︀ 2 2)︀2 )︃ 2 −푛휋푟2 1+ 2 2푟3 1 1 푟3 − 푟1 1 푟1 + 푟3 = 푟 푒 3 푟 ⎢ 푟 − 푟 − − 3 1 ⎣ 3 1 2 푟3=0 푟1=0 2 2 6 2푟3 3 4푟3

2 (︃ 3 )︃ ]︃ 푟 푟 1 1 2 2 − 1 3 + + + 푟3 푑푟 푑푟 2 4 6 3 1 1 3

⎡ 2 2 2 푟 (︁ 푟 −푟 )︁ ∫︁ ∞ ∫︁ 3√ 3 1 (︂ )︂ (︂ 2 2 )︂3 (︂ 2 2 )︂3 2 1+ 2 2푟3 1 1 푟 − 푟 1 푟 + 푟 −푛휋푟3 ⎢ 3 1 1 3 = 푟3푒 푟1 ⎣ 푟3 − 푟1 − − 푟3=0 푟1=0 2 2 6 2푟3 3 2푟3

2 (︃ 3 )︃ ]︃ 푟 푟 1 1 2 2 − 1 3 + + + 푟3 푑푟 푑푟 2 4 6 3 1 1 3

푟 ∫︁ ∞ ∫︁ 3√ [︃ 7 6 5 (︃ 3 )︃ 3 2 1+ 2 푟 푟 푟 1 1 1 2 2 13푟 푟3 −푛휋푟3 1 1 1 4 1 = 푟3푒 − 3 − 2 − + + + + 푟1 − 푟3=0 푟1=0 48푟3 16푟3 16푟3 8 4 6 3 16 푟2푟2 푟 푟3 ]︂ − 1 3 + 1 3 푑푟 푑푟 16 16 1 3 ∫︁ ∞ [︃ 8 7 6 (︃ 3 )︃ 4 2 푟 푟 푟 1 1 1 1 2 2 13푟 푟3 −푛휋푟3 1 1 1 5 1 = 푟3푒 − 3 − 2 − + + + + 푟1 − 푟3=0 8 · 48푟3 7 · 16푟3 6 · 16푟3 5 8 4 6 3 64

95 푟 3 2 2 3 ]︂ 3√ 푟1푟3 푟1푟3 ⃒ 1+ 2 − + ⃒ 푑푟3 48 32 ⃒푟1=0 ∫︁ ∞ [︃ 8 7 6 (︃ 3 )︃ 2 (훼푟3) (훼푟3) (훼푟3) 1 1 1 1 2 2 −푛휋푟3 5 = 푟3푒 − 3 − 2 − + + + + (훼푟3) 푟3=0 8 · 48푟3 7 · 16푟3 6 · 16푟3 5 8 4 6 3 ]︃ 13 (훼푟 )4 푟 (훼푟 )3 푟2 (훼푟 )2 푟3 − 3 3 − 3 3 + 3 3 푑푟 64 48 32 3 [︂ 8 7 6 4 3 2 ]︂ ∞ 훼 훼 훼 1 (︁ √ )︁ 13훼 훼 훼 ∫︁ 2 5 6 −푛휋푟3 = − − − + 13 + 16 2 훼 − − + 푟3푒 푑푟3 8 · 48 7 · 16 6 · 16 120 64 48 32 푟3=0 [︂ 8 7 6 √ 4 3 2 ]︂ 훼 훼 훼 1 (︁ )︁ 5 13훼 훼 훼 15 = − − − + 13 + 16 2 훼 − − + 7 . 8 · 48 7 · 16 6 · 16 120 64 48 32 16휋3푛 2

√ 푛(2푛휋)3 Multiplying the value of the integral in Lemma 15 by 3 , we obtain the following lower bound.

5 5 [︂ 훼8 훼7 훼6 1 (︁ √ )︁ 13훼4 훼3 훼2 ]︂ + − − − + 13 + 16 2 훼5 − − + 8 2 8 · 48 7 · 16 6 · 16 120 64 48 32 √ 5 3072 2 − 4325 5 = + ≈ + 0.003621. 8 5376 8

Finally, conditioning on the quadrant, the overall lower bound is

(︃ √ )︃ 5 1 (︂ 19 )︂ 1 3072 2 − 4325 훽 ≥ + + ≥ 0.6277. 8 2 10368 2 5376

96 Chapter 5

Sparse High-Dimensional Isotonic Regression

We consider the problem of estimating an unknown coordinate-wise monotone function given noisy measurements, known as the isotonic regression problem. Often, only a small subset of the features affects the output. This motivates the sparse isotonic regression setting, which we consider here. We provide an upper bound on the expected VC entropy of the space of sparse coordinate-wise monotone functions, and identify the regime of statistical consistency of our estimator. We also propose a linear program to recover the active coordinates, and provide theoretical recovery guarantees. We close with experiments on cancer classification, and show that our method significantly outperforms several standard methods. Acknowlegdements Thank you to Jackie Baek for an introduction to using the COSMIC database.

5.1 Introduction

Given a partial order ⪯ on R푑, we say that a function 푓 : R푑 → R is monotone if for 푑 all 푥1, 푥2 ∈ R such that 푥1 ⪯ 푥2, it holds that 푓(푥1) ≤ 푓(푥2). In this chapter, we study the univariate isotonic regression problem under the standard Euclidean partial

푑 order. Namely, we define the partial order ⪯ on R as follows: 푥1 ⪯ 푥2 if 푥1,푖 ≤ 푥2,푖

97 for all 푖 ∈ {1, . . . , 푑}. If 푓 is monotone according to the Euclidean partial order, we say 푓 is coordinate-wise monotone. This chapter introduces the sparse isotonic regression problem, defined as follows.

푑 Write 푥1 ⪯퐴 푥2 if 푥1,푖 ≤ 푥2,푖 for all 푖 ∈ 퐴. We say that a function 푓 on R is 푠-sparse coordinate-wise monotone if for some set 퐴 ⊆ [푑] with |퐴| = 푠, it holds that

푥1 ⪯퐴 푥2 =⇒ 푓(푥1) ≤ 푓(푥2). We call 퐴 the set of active coordinates. The sparse isotonic regression problem is to estimate the 푠-sparse coordinate-wise monotone function 푓 from samples, knowing the sparsity level 푠 but not the set 퐴. Observe

that if 푥 and 푦 are such that 푥푖 = 푦푖 for all 푖 ∈ 퐴, then 푥 ⪯퐴 푦 and 푦 ⪯퐴 푥, so that 푓(푥) = 푓(푦). In other words, the value of 푓 is determined by the active coordinates. We consider two different noise models. In the Noisy Output Model, the input 푋 is a random variable supported on [0, 1]푑, and 푊 is zero-mean noise that is independent from 푋. The model is 푌 = 푓(푋) + 푊 . Let ℛ be the range of 푓 and let supp(푊 ) be the support of 푊 . We assume that both ℛ and supp(푊 ) are bounded. Without loss of generality, let ℛ + supp(푊 ) ⊆ [0, 1], where + is the Cartesian sum. In the Noisy Input Model, 푌 = 푓(푋 + 푊 ), and we exclusively consider the classification problem,

namely 푓 : R푑 → {0, 1}. In either noise model, we assume that 푛 independent samples

(푋1, 푌1),..., (푋푛, 푌푛) are given. ^ The goal of this chapter is to produce an estimator 푓푛 and give statistical guarantees for it. To our knowledge, the only work that provides statistical guarantees on isotonic regression estimators in the Euclidean partial order setting with 푑 ≥ 3 is

the paper of Han et al ([26]). The authors give guarantees of the empirical 퐿2 [︂ (︁ )︁2]︂ ^ 1 ∑︀푛 ^ loss, defined as 푅(푓푛, 푓) = E 푛 푖=1 푓푛(푋푖) − 푓(푋푖) , where the expectation is

over the samples 푋1, . . . 푋푛. In this chapter, we instead expand on the approach in Gamarnik ([19]), to the high-dimensional sparse setting. It is shown in [19] that the expected Vapnik-Chervonenkis entropy of the class of coordinate-wise monotone functions grows subexponentially. The main result of [19] is a guarantee on the tail of ^ 2 ‖푓푛 − 푓‖2. When 푋 ∈ [0, 1] and 푌 ∈ [0, 1] almost surely, it is claimed that

(︁ )︁ 4 √ 휖4푛 ^ ⌈ 2 ⌉ 푛− 256 P ‖푓푛 − 푓‖2 > 휖 ≤ 푒 휖 ,

98 ^ where 푓푛 is a coordinate-wise monotone function, estimated based on empirical mean squared error. However, the constants of the result are incorrect due to a calculation error, which we correct. This result shows that the estimated function converges to

the true function in 퐿2, almost surely ([19]). In this chapter, we extend the work of [19] to the sparse high-dimensional setting, where the problem dimension 푑 and the sparsity 푠 may diverge to infinity as the sample size 푛 goes to infinity. We propose two algorithms for the estimation of the unknown 푠-sparse coordinate- wise monotone function 푓. The simultaneous algorithm determines the active coordi- nates and the estimated function values in a single optimization formulation based on integer programming. The two-stage algorithm first determines the active coordinates via a linear program, and then estimates function values. The sparsity level is treated as constant or moderately growing. We give statistical consistency and support re- covery guarantees for the Noisy Output Model, analyzing both the simultaneous and {︁ }︁ two-stage algorithms. We show that when 푛 = max 푒휔(푠2), 휔 (푠 log 푑) , the estimator ^ 푓푛 from the simultaneous procedure is statistically consistent. In particular, when the sparsity 푠 level is of constant order, the dimension 푑 is allowed to be much larger than the sample size. We note that, remarkably, when the maximum is dominated by 휔(푠 log 푑), our sample performance nearly matches the one of high-dimensional linear regression [6, 17]. For the two-stage approach, we show that if a certain signal strength {︁ }︁ condition holds and 푛 = max 푠푒휔(푠2), 휔(푠3 log 푑)) , the estimator is consistent. We also give statistical consistency guarantees for the simultaneous and two-stage algo- rithms in the Noisy Input Model, assuming that the components of 푊 are independent. We show that in the regime where a signal strength condition holds, 푠 is of constant order, and 푛 = 휔(log 푑), the estimators from both algorithms are consistent. The isotonic regression problem has a long history in the statistics literature; see for example the books [44] and [47]. The emphasis of most research in the area of isotonic regression has been the design of algorithms: for example, the Pool Adjacent Violators algorithm ([34]), active set methods ([4], [10]), and the Isotonic Recursive

Partitioning algorithm ([39]). In addition to the univariate setting (푓 : R푑 → R), the multivariate setting (푓 : R푑 → R푞, 푞 ≥ 2) has also been considered; see e.g. [50]

99 and [51]. In the multivariate setting, whenever 푥1 ⪯ 푥2 according to some defined

partial order ⪯, it holds that 푓(푥1)⪯˜ 푓(푥2), where ⪯˜ is some other defined partial order. There are many applications for the coordinate-wise isotonic regression problem. For example, Dykstra and Robertson (1982) showed that isotonic regression could be used to predict college GPA from standardized test scores and high school GPA. Luss et al (2012) applied isotonic regression to the prediction of baseball players’ salaries, from the number of runs batted in and the number of hits. Isotonic regression has found rich applications in and , particularly to build disease models ([39], [52]). The rest of the chapter is structured as follows. Section 5.2 gives the simultaneous and two-stage algorithms for sparse isotonic regression. Section 5.3 and Section 5.4 of the appendix provide statistical consistency and recovery guarantees for the Noisy Output and Noisy Input models. All proofs can be found in the appendix. In Section 5.5, we provide experimental evidence for the applicability of our algorithms. We test our algorithm on a cancer classification task, using gene expression data. Our algorithm achieves a success rate of about 96% on this task, significantly outperforming the 푘-Nearest Neighbors classifier and the Support Vector Machine.

5.2 Algorithms for sparse isotonic regression

In this section, we present our two algorithmic approaches for sparse isotonic regression: the simultaneous and two-stage algorithms. Recall that ℛ is the range of 푓. In the Noisy Output Model, ℛ ⊆ [0, 1], and in the Noisy Input Model, ℛ = {0, 1}. We assume the following throughout.

Assumption 1. For each 푖 ∈ 퐴, the function 푓(푥) is not constant with respect to 푥푖, i.e.

∫︁ ⃒ ∫︁ ⃒ ⃒ ⃒ ⃒푓(푥) − 푓(푧)푑푧⃒ 푑푥 > 0. 푥∈풳 ⃒ 푧∈풳 ⃒

100 5.2.1 The Simultaneous Algorithm

The simultaneous algorithm solves the following problem.

푛 ∑︁ 2 min (푌푖 − 퐹푖) (5.1) 퐴,퐹 푖=1 s.t. |퐴| = 푠 (5.2)

퐹푖 ≤ 퐹푗 if 푋푖 ⪯퐴 푋푗 (5.3)

퐹푖 ∈ ℛ ∀푖 (5.4)

^ The estimated function 푓푛 is determined by interpolating from the pairs (푋1, 퐹1),..., (푋푛, 퐹푛) ^ in a straightforward way. In particular, 푓푛(푥) = max{퐹푖 : 푋푖 ⪯ 푥}. In other words, we

identify all points 푋푖 such that 푋푖 ⪯ 푥 and select the smallest consistent function value. We call this the “min” interpolation rule because it selects the smallest interpolation ^ values. The “max” interpolation rule is 푓푛(푥) = min{퐹푖 : 푋푖 ⪰ 푥}.

Definition 7. For inputs 푋1, . . . , 푋푛, let 푞(푖, 푗, 푘) = 1 if 푋푖,푘 > 푋푗,푘, and 푞(푖, 푗, 푘) = 0 otherwise.

Problem (5.1)-(5.4) can be encoded as a single mixed-integer convex minimization. We refer to the resulting Algorithm 1 as Integer Programming Isotonic Regression

(IPIR) and provide its formulation below. Binary variables 푣푘 indicate the estimated

active coordinates; 푣푘 = 1 means that the optimization program has determined that

coordinate 푘 is active. The variables 퐹푖 represent the estimated function values at

data points 푋푖. Algorithm 1 Integer Programming Isotonic Regression (IPIR)

Input: Values (푋1, 푌1),..., (푋푛, 푌푛); sparsity level 푠 ^ Output: An estimated function 푓푛 1: Solve the following optimization problem.

푛 ∑︁ 2 min (푌푖 − 퐹푖) (5.5) 푣,퐹 푖=1 푑 ∑︁ s.t. 푣푘 = 푠 (5.6) 푘=1

101 푑 ∑︁ 푞(푖, 푗, 푘)푣푘 ≥ 퐹푖 − 퐹푗 ∀푖, 푗 ∈ {1, . . . , 푛} (5.7) 푘=1

푣푘 ∈ {0, 1} ∀푘 ∈ {1, . . . , 푑} (5.8)

퐹푖 ∈ ℛ ∀푖 ∈ {1, . . . , 푛} (5.9)

^ 2: Return the function 푓푛(푥) = max{퐹푖 : 푋푖 ⪯ 푥}. We claim that Problem (5.5)-(5.9) is equivalent to Problem (5.1)-(5.4). Indeed, the

monotonicity requirement is 푋푖 ⪯퐴 푋푗 =⇒ 푓(푋푖) ≤ 푓(푋푗). The contrapositive of

this statement is 푓(푋푖) > 푓(푋푗) =⇒ 푋푖 ̸⪯퐴 푋푗; alternatively, 푓(푋푖) > 푓(푋푗) =⇒

∃푘 ∈ 퐴 s.t. 푋푖푘 > 푋푗푘. The contrapositive is expressed by Constraints (5.7). Recall that in the Noisy Input Model, the function 푓 is binary-valued, i.e. ℛ =

+ − 푛 {0, 1}. Let 풮 = {푖 : 푌푖 = 1} and 풮 = {푖 : 푌푖 = 0}. When {퐹푖}푖=1 are binary-valued, ∑︀푛 2 ∑︀ ∑︀ it holds that 푖=1 (푌푖 − 퐹푖) = 푖∈풮+ (1 − 퐹푖) + 푖∈풮− 퐹푖. Therefore, if we replace ∑︀ ∑︀ the objective function (5.5) by 푖∈풮+ (1 − 퐹푖) + 푖∈풮− 퐹푖, we obtain an equivalent linear integer program. Algorithm 1 when applied to the Noisy Output Model is a mixed-integer convex optimization program. When applied to the Noisy Input Model, it s a mixed integer linear optimization program. While both are formally NP-hard in general, moderately- sized instances are solvable in practice.

5.2.2 The Two-Stage Algorithm

Algorithm 1 is slow, both in theory and in practice. Motivated by this, we propose an alternative two-stage algorithm. The two-stage algorithm estimates the active coordinates through a linear program, using these to then estimate the function values. The process of estimating the active coordinates is referred to as support recovery. The active coordinates may be estimated all at once (Algorithm 2) or sequentially (Algorithm 3). Algorithm 2 is referred to as Linear Programming Support Recovery (LPSR) and Algorithm 3 is referred to as Sequential Linear Programming Support ^ Recovery (S-LPSR). The two-stage algorithm for estimating 푓푛 first estimates the set of active coordinates using the LPSR or S-LPSR algorithm, and then estimates the

102 function values. The results algorithm is referred to as Two Stage Isotonic Regression (TSIR) (Algorithm 4).

Algorithm 2 Linear Programming Support Recovery (LPSR)

Input: Values (푋1, 푌1),..., (푋푛, 푌푛); sparsity level 푠 Output: The estimated support, 퐴^

1: Solve the following optimization problem.

푛 푛 푑 ∑︁ ∑︁ ∑︁ min 푐푖푗 (5.10) 푣,푐 푘 푖=1 푗=1 푘=1 푑 ∑︁ s.t. 푣푘 = 푠 (5.11) 푘=1 푑 푑 ∑︁ (︀ 푖푗)︀ ∑︁ 푞(푖, 푗, 푘) 푣푘 + 푐푘 ≥ 1 if 푌푖 > 푌푗 and 푞(푖, 푗, 푘) ≥ 1 푘=1 푘=1 (5.12)

0 ≤ 푣푘 ≤ 1 ∀푘 ∈ {1, . . . , 푑} (5.13)

푖푗 푐푘 ≥ 0 ∀푖 ∈ {1, . . . , 푛}, 푗 ∈ {1, . . . , 푛}, 푘 ∈ {1, . . . , 푝} (5.14)

^ 2: Determine the 푠 largest values 푣푖, breaking ties arbitrarily. Let 퐴 be the set of the corresponding 푠 indices.

In Problem (5.10)-(5.14), the 푣푘 variables are meant to indicate the active coordinates, 푖푗 while the 푐푘 variables act as correction in the monotonicity constraints. For example, ∑︀푑 if for one of the constraints (5.12), 푘=1 푞(푖, 푗, 푘)푣푘 = 0.7, then we will need to set 푖푗 푐푘 = 0.3 for some (푖, 푗, 푘) such that 푞(푖, 푗, 푘) = 1. The 푣푘’s should therefore be chosen in a way to minimize the correction. Algorithm 3 determines the active coordinates one at a time, setting 푠 = 1 in Problem (5.10)-(5.14). Once a coordinate 푖 is included in the set of active coordinates, variable 푣푖 is set to zero in future iterations. Algorithm 3 Sequential Linear Programming Support Recovery (S-LPSR)

103 Input: Values (푋1, 푌1),..., (푋푛, 푌푛); sparsity level 푠 Output: The estimated support, 퐴^

1: 퐵 ← ∅

2: while |퐵| < 푠 do

3: Solve the optimization problem in Algorithm 2 with 푠 = 1:

푛 푛 푑 ∑︁ ∑︁ ∑︁ 푖푗 min 푐푘 (5.15) 푖=1 푗=1 푘=1 푑 ∑︁ s.t. 푣푘 = 1 (5.16) 푘=1

푣푖 = 0 ∀푖 ∈ 퐵 (5.17)

푑 푑 ∑︁ (︀ 푖푗)︀ ∑︁ 푞(푖, 푗, 푘) 푣푘 + 푐푘 ≥ 1 if 푌푖 > 푌푗 and 푞(푖, 푗, 푘) ≥ 1 푘=1 푘=1 (5.18)

0 ≤ 푣푘 ≤ 1 ∀푘 ∈ {1, . . . , 푑} (5.19)

푖푗 푐푘 ≥ 0 ∀푖 ∈ {1, . . . , 푛}, 푗 ∈ {1, . . . , 푛}, 푘 ∈ {1, . . . , 푑} (5.20)

⋆ 4: Identify 푖 such that 푣푖⋆ = max푖{푣푖}, breaking ties arbitrarily. Set 퐵 ←

퐵 ∪ {푖max}. 5: end while

6: Return 퐴^ = 퐵. Algorithm 3’ is defined to be the batch version of Algorithm 3. Namely, there are 푛

푛 samples in total, divided into 푠 batches. The first iteration of the sequential procedure is performed on the first batch, the second iteration on the second batch, and so on. ^ We are now ready to state the two-stage algorithm for estimating the function 푓푛. Algorithm 4 Two Stage Isotonic Regression (TSIR)

Input: Values (푋1, 푌1),..., (푋푛, 푌푛); sparsity level 푠 ^ Output: The estimated function, 푓푛

104 ^ ^ 1: Estimate 퐴 by using Algorithm 2, 3, or 3’. Let 푣푘 = 1 if 푘 ∈ 퐴 and 푣푘 = 0 otherwise.

2: Solve the following optimization problem.

푛 ∑︁ 2 min (푌푖 − 퐹푖) (5.21) 푖=1 푑 ∑︁ s.t. 푞(푖, 푗, 푘)푣푘 ≥ 퐹푖 − 퐹푗 ∀푖, 푗 ∈ {1, . . . , 푛} (5.22) 푘=1

퐹푖 ∈ ℛ ∀푖 ∈ {1, . . . , 푛} (5.23)

∑︀ ∑︀ In the Noisy Input Model, replace the objective with 푖∈풮+ (1 − 퐹푖) + 푖∈풮− 퐹푖. ^ 3: Return the function 푓푛(푥) = max{퐹푖 : 푋푖 ⪯ 푥}.

Both algorithms for support recovery are linear programs, which can be solved in polynomial time. The second step of Algorithm 4 when applied to the Noisy Output Model is a linearly-constrained quadratic minimization program that can be solved in polynomial time. The following lemma shows that Step 2 of Algorithm 4 when applied to the Noisy Input Model can be reduced to a linear program.

Lemma 16. Under the Noisy Input Model, replacing the constraints 퐹푖 ∈ {0, 1} with

퐹푖 ∈ [0, 1] in Problems (5.5)-(5.9) and (5.21)-(5.23) does not change the optimal value. Furthermore, there always exists an integer optimal solution that can be constructed from an optimal solution in polynomial time.

5.3 Results on the Noisy Output Model

Recall the Noisy Output Model: 푌 = 푓(푋)+푊 , where 푓 is an 푠-sparse coordinate-wise monotone function with active coordinates 퐴. We assume throughout this section that 푋 is a uniform random variable on [0, 1]푑, 푊 is a zero-mean random variable independent from 푋, and the domain of 푓 is [0, 1]푑. We additionally assume that 푌 ∈ [0, 1] almost surely. Up to shifting and scaling, this is equivalent to assuming that 푓 has a bounded range and 푊 has a bounded support.

105 5.3.1 Statistical consistency

In this section, we extend the results of [19], in order to demonstrate the statistical consistency of the estimator produced by Algorithm 1. The consistency will be stated

in terms of the 퐿2 norm error.

^ Definition 8 (퐿2 Norm Error). For an estimator 푓푛, define

∫︁ 2 ^ 2 (︁ ^ )︁ ‖푓푛 − 푓‖2 , 푓푛(푥) − 푓(푥) 푑푥. 푥∈[0,1]푑

^ We call ‖푓푛 − 푓‖2 the 퐿2 norm error.

^ Definition 9 (Consistent Estimator). Let 푓푛 be a estimator for the function 푓. We ^ say that 푓푛 is consistent if for all 휖 > 0, it holds that

(︁ ^ )︁ lim ‖푓푛 − 푓‖2 ≥ 휖 → 0. 푛→∞ P

^ Theorem 16. The 퐿2 error of the estimator 푓푛 obtained from Algorithm 1 is upper bounded as

(︂ )︂ [︂(︂ )︂ 4 ]︂ (︁ )︁ 푑 128 log(2) 64 푠 푠−1 휖 푛 ‖푓^ − 푓‖ ≥ 휖 ≤ 8 exp + 2 휖2 2 푛 푠 − . P 푛 2 푠 휖2 512

{︁ }︁ 휔(푠2) ^ Corollary 2. When 푛 = max 푒 , 휔(푠 log(푑)) , the estimator 푓푛 from Algorithm ^ 1 is consistent. Namely, ‖푓푛 − 푓‖2 → 0 in probability as 푛 → ∞. In particular, if the sparsity level 푠 is constant, the sample complexity is only logarithmic in the dimension.

5.3.2 Support recovery

In this subsection, we give support recovery guarantees for Algorithm 3. The guarantees will be in terms of the values 푝푘, defined below.

Definition 10. Let 푌1 = 푓(푋1) + 푊1 and 푌2 = 푓(푋2) + 푊2 be two independent

106 samples from the model. For 푘 ∈ 퐴, let

푝푘 , P (푌1 > 푌2 | 푞(1, 2, 푘) = 1) − P (푌1 < 푌2 | 푞(1, 2, 푘) = 1) .

Assume without loss of generality that 퐴 = {1, 2, . . . , 푠} and 푝1 ≤ 푝2 ≤ · · · ≤ 푝푠.

Lemma 17. It holds that 푝푘 > 0 for all 푘. In other words, when 푋1 is greater than

푋2 in at least one active coordinate, the output corresponding to 푋1 is likely to be larger than the one corresponding to 푋2.

Theorem 17. Let 퐵 be the set of indices corresponding to running Algorithm 3’ using 푛 samples. Then it holds that 퐵 = 퐴 with probability at least

(︂ 푝2푛 )︂ 1 − 푑푠 exp − 1 . 64푠3

Corollary 3. Assume that 푝1 = Θ(1). Let 푛 be the number of samples used by Algorithm 3’. If 푛 = 휔(푠3 log(푑)), then Algorithm 3’ recovers the true support w.h.p. as 푛 → ∞.

푑 For 푥 ∈ R , let 푥퐴 denote 푥 restricted to coordinates defined by the set 퐴. Suppose

that 푠 is constant, and the sequence of functions {푓푑} extends a function on 푠 variables, 푠 i.e. 푓푑 is defined as 푓푑(푥) = 푔(푥퐴) for some 푔 : [0, 1] → ℛ. In that case, 푝1 = Θ(1). We can now give a guarantee of the success of Algorithm 4, using Algorithm 3’ for support recovery.

Corollary 4. Assume that 푝1 = Θ(1). Consider running Algorithm 4 using 푛 samples 푛 for sequential recovery. Let 푚 = 푠 . Consider using an additional 푚 samples for ^ function value estimation, so that the total sample size is 푛 + 푚. Let 푓푛+푚 be the {︁ }︁ 3 휔(푠2) ^ estimated function. If 푛 = max 휔(푠 log(푑)), 푠푒 , then 푓푛+푚 is a consistent estimator.

Corollary 4 shows that if 푠 is constant and the sequence of functions {푓푑} extends a function of 푠 variables, then Algorithm 4 produces a consistent estimator with 푛 = 휔(log(푑)) samples. In the appendix, we state the statistical consistency results for the Noisy Input Model.

107 5.4 Results on the Noisy Input Model

Recall the Noisy Input Model: 푌 = 푓(푋 + 푊 ), where 푓 is an 푠-sparse coordinate-wise monotone function with active coordinates 퐴. We assume throughout this section that 푋 is a uniform random variable on [0, 1]푑, 푊 is a zero-mean random variable

independent from 푋 with independent coordinates, and 푓 : R푑 → {0, 1}. In this section, we prove the statistical consistency of Two-Stage Isotonic Regres- sion, with Sequential Linear Programming Support Recovery as the support recovery algorithm. In Subsection 5.4.1 we consider the setting where the set of active coordi-

nates is known, and provide an upper bound on the resulting 퐿2-norm error of our estimator. In Subsection 5.4.2 we provide a guarantee on the probability of correctly estimating the support, using S-LPSR. These results are combined to give Corollary 8, stated at the end of the section. As a special case of the corollary, if 푠 is constant and

the sequence of functions {푓푑} extends a function of 푠 variables, and 푛 = 휔(log(푑)) ^ samples are used by TSIR, then the estimator 푓푛 that is produced is consistent.

5.4.1 Statistical consistency

Suppose that the set of active coordinates, 퐴, is known. Then we can apply Problem (5.21)-(5.23) within Algorithm 4 to estimate the function values, with the variables

푣푖 that indicate the active coordinates set to 1 if 푖 ∈ 퐴, and set to 0 otherwise. The coordinates outside the active set do not influence the solution of the optimization problem, and therefore do not affect the estimated function. Therefore, the setting where 퐴 is known is equivalent to the non-sparse setting with dimension 푑 = 푠. We investigate the regime under which Problem (5.21)-(5.23) produces a consistent estimator, in the non-sparse setting (푑 = 푠). To state our guarantees, it is convenient to represent binary coordinate-wise monotone functions in terms of monotone partitions.

Definition 11 (Monotone Partition). We say that (푆0, 푆1) is a monotone partition of R푑 if

푑 푑 1. 푆0 and 푆1 form a partition of R . That is, 푆0 ∪ 푆1 = R and 푆0 ∩ 푆1 = ∅.

108 푑 2. For all 푥, 푦 ∈ R , if 푥 ⪯ 푦, then either (i) 푥, 푦 ∈ 푆0, (ii) 푥, 푦 ∈ 푆1, or (iii)

푥 ∈ 푆0, 푦 ∈ 푆1.

푑 Let ℳ푑 be the set of all monotone partitions of R .

Note that there is a one-to-one correspondence between monotone partitions and binary coordinate-wise monotone functions. Let 푌 = 푓(푋 + 푊 ) represent our model, with 푑 = 푠, and with 푓 corresponding

⋆ ⋆ ⋆ to a monotone partition (푆0 , 푆1 ). That is, 푓(푥) = 0 for 푥 ∈ 푆0 and 푓(푥) = 1 for ⋆ 푥 ∈ 푆1 . Let ℎ0(푥) be the probability density function of 푋, conditional on 푌 = 0.

Similarly, let ℎ1(푥) be the probability density function of 푋, conditional on 푌 = 1.

For (푆0, 푆1) ∈ ℳ푑, let

∫︁ ∫︁ 퐻0(푆1) = ℎ0(푧)푑푧 and 퐻1(푆0) = ℎ1(푧)푑푧. 푧∈푆1 푧∈푆0

Finally, let 푝 be the probability that 푌 = 0. Let

푞(푆0, 푆1) , 푝퐻0(푆1) + (1 − 푝)퐻1(푆0).

The value of 푞(푆0, 푆1) is the probability of misclassification, under the monotone

partition (푆0, 푆1).

⋆ ⋆ Assumption 2. We assume that 푞 has a unique minimizer on ℳ푑, which is (푆0 , 푆1 ).

′ ′ Definition 12 (Discrepancy). For two monotone partitions (푆0, 푆1) and (푆0, 푆1), the discrepancy function 퐷 : ℳ푑 × ℳ푑 → [0, 1] is defined as follows.

′ ′ ′ ′ 퐷 ((푆0, 푆1), (푆0, 푆1)) , P (푋 ∈ 푆0 ∩ 푆1) + P (푋 ∈ 푆0 ∩ 푆1)

Also let

⋆ ⋆ ⋆ ⋆ 퐵훿 (푆0 , 푆1 ) , {(푆0, 푆1) ∈ ℳ푑 : 퐷 ((푆0, 푆1), (푆0 , 푆1 )) ≤ 훿}

⋆ ⋆ be the set of monotone partitions with discrepancy at most 훿 from (푆0 , 푆1 ).

109 Theorem 18. Let 푑 = 푠. Suppose Assumption 2 holds, and the components of 푊 are ^ independent. Let 푓푛 be the estimator derived from Algorithm 1, and let

⋆ ⋆ ⋆ ⋆ 푞min(훿) , min {푞(푆0, 푆1):(푆0, 푆1) ̸∈ 퐵훿(푆0 , 푆1 )} > 푞(푆0 , 푆1 ).

Then for any 0 < 훿 ≤ 1,

(︁ ^ )︁ P ‖푓푛 − 푓‖2 > 훿 ≤

[︁ 푠 푠−1 ]︁ exp (2 + 2 log(2) − 1) 푛 푠 (︃ ⋆ ⋆ 2 )︃ (︁ [︁ 2푠−1 ]︁ )︁ (푞min (훿) − 푞 (푆0 , 푆1 )) 푛 + exp 푛 2푠 + 1 exp − . [︁ 2푠−1 ]︁ 36 exp 푛 2푠

⋆ ⋆ Corollary 5. Suppose that 푞min (훿) − 푞 (푆0 , 푆1 ) = Θ(1), that is, constant in 푠. When 휔(푠2) ^ 푑 = 푠 and 푛 = 푒 , the estimator 푓푛 produced by Algorithm 1 is consistent.

Theorem 18 has an analogous version in the sparse setting (푠 < 푑). First we need

some definitions, similar to those that precede Theorem 18. We write 푥 =퐴 푦 if 푥 ⪯퐴 푦

and 푥 ⪰퐴 푦.

Definition 13 (푠-Sparse Monotone Partition). We say that (푆0, 푆1) is an 푠-sparse monotone partition of R푑 if

푑 푑 1. 푆0 and 푆1 form a partition of R . That is, 푆0 ∪ 푆1 = R and 푆0 ∩ 푆1 = ∅.

푑 2. There exists a set 퐴 ⊂ [푑] such that for all 푥, 푦 ∈ R , if 푥 ⪯퐴 푦, then either (i)

푥, 푦 ∈ 푆0, (ii) 푥, 푦 ∈ 푆1, or (iii) 푥 ∈ 푆0, 푦 ∈ 푆1. Note that this implies that if

푥 =퐴 푦, then either 푥, 푦 ∈ 푆0 or 푥, 푦 ∈ 푆1.

푑 Let ℳ푠,푑 be the set of all 푠-sparse monotone partitions of R .

Note that there is a one-to-one correspondence between monotone partitions and 푠-sparse binary coordinate-wise monotone functions. Let 푌 = 푓(푋 + 푊 ) represent our model, with 푑 < 푠, and with 푓 corresponding to

⋆ ⋆ ⋆ an 푠-sparse monotone partition (푆0 , 푆1 ). That is, 푓(푥) = 0 for 푥 ∈ 푆0 and 푓(푥) = 1 ⋆ for 푥 ∈ 푆1 . Let ℎ0(푥) be the probability density function of 푋, conditional on 푌 = 0.

110 Similarly, let ℎ1(푥) be the probability density function of 푋, conditional on 푌 = 1.

For (푆0, 푆1) ∈ ℳ푠,푑, let

∫︁ ∫︁ 퐻0(푆1) = ℎ0(푧)푑푧 and 퐻1(푆0) = ℎ1(푧)푑푧. 푧∈푆1 푧∈푆0

Finally, let 푝 be the probability that 푌 = 0. Let

푞(푆0, 푆1) , 푝퐻0(푆1) + (1 − 푝)퐻1(푆0).

The value of 푞(푆0, 푆1) is the probability of misclassification, under the 푠-sparse mono-

tone partition (푆0, 푆1).

⋆ ⋆ Assumption 3. We assume that 푞 has a unique minimizer on ℳ푠,푑, which is (푆0 , 푆1 ).

Definition 14 (Discrepancy). For two 푠-sparse monotone partitions (푆0, 푆1) and ′ ′ (푆0, 푆1), the discrepancy function 퐷 : ℳ푠,푑 × ℳ푠,푑 → [0, 1] is defined as follows.

′ ′ ′ ′ 퐷 ((푆0, 푆1), (푆0, 푆1)) , P (푋 ∈ 푆0 ∩ 푆1) + P (푋 ∈ 푆0 ∩ 푆1)

Also let

푠 ⋆ ⋆ ⋆ ⋆ 퐵훿 (푆0 , 푆1 ) , {(푆0, 푆1) ∈ ℳ푠,푑 : 퐷 ((푆0, 푆1), (푆0 , 푆1 )) ≤ 훿}

⋆ ⋆ be the set of 푠-sparse monotone partitions with discrepancy at most 훿 from (푆0 , 푆1 ).

Theorem 19. Suppose Assumption 3 holds, and the components of 푊 are independent. ^ Let 푓푛 be the estimator derived from Algorithm 1 and let

푠 ⋆ ⋆ ⋆ ⋆ 푞min(훿) , min {푞(푆0, 푆1):(푆0, 푆1) ̸∈ 퐵훿 (푆0 , 푆1 )} > 푞(푆0 , 푆1 )}.

Then for any 0 < 훿 ≤ 1,

(︁ ^ )︁ P ‖푓푛 − 푓‖2 > 훿 ≤

[︁ 푠 푠−1 ]︁ exp (2 + 2 log(2) − 1) 푛 푠 (︂(︂ )︂ )︂ (︃ ⋆ ⋆ 2 )︃ 푑 [︁ 2푠−1 ]︁ (푞min (훿) − 푞 (푆0 , 푆1 )) 푛 + exp 푛 2푠 + 1 exp − . [︁ 2푠−1 ]︁ 푠 36 exp 푛 2푠

111 Theorem 18 allows us to state the following corollary regarding the IPIR algorithm.

Corollary 6. Suppose 푠 is constant and the sequence of functions {푓푑} extends ^ a function of 푠 variables. Let 푓푛 be the estimator produced by Algorithm 1. If ^ 푛 = 휔(log(푑)), then 푓푛 is a consistent estimator.

5.4.2 Support recovery

In this subsection, we give support recovery guarantees for Algorithm 3’. The guaran- tees will be in terms of differences of probabilities.

Definition 15. Let 푌1 = 푓(푋1 + 푊1) and 푌2 = 푓(푋2 + 푊2) be two independent samples from the model. For 푘 ∈ 퐴, define

푝푘 , P (푌1 = 1, 푌2 = 0 | 푞(1, 2, 푘) = 1) − P (푌1 = 0, 푌2 = 1 | 푞(1, 2, 푘) = 1) .

Assume without loss of generality that 퐴 = {1, . . . , 푠} and 푝1 ≤ 푝2 ≤ · · · ≤ 푝푠.

Lemma 18. For all 푘 ∈ 퐴, it holds that 푝푘 > 0.

Theorem 20. Let 퐵 be the set of indices corresponding to running Algorithm 3’ using 푛 samples. Then it holds that 퐵 = 퐴 with probability at least

(︂ 푝2푛 )︂ 1 − 푑푠 exp − 1 . 64푠3

We can now give a guarantee of the success of Algorithm 4, using Algorithm 3 for support recovery.

Corollary 7. Assume that 푝1 = Θ(1). Let 푛 be the number of samples used by Algorithm 3’. If 푛 = 휔(푠3 log(푑)), then Algorithm 3’ recovers the true support w.h.p. as 푛 → ∞.

⋆ ⋆ Corollary 8. Suppose that 푞min (훿) − 푞 (푆0 , 푆1 ) = Θ(1). Suppose also that 푝1 = Θ(1), and that the components of 푊 are independent. Consider running Algorithm 4 using

푛 푛 samples for sequential support recovery. Let 푚 = 푠 . Consider using an additional

112 푚 samples for function value estimation, so that the total number of samples is 푛 + 푚. ^ 3 휔(푠2) ^ Let 푓푛+푚 be the estimated function. If 푛 = 휔(푠 log(푑)) and 푛 = 푠푒 , then 푓푛+푚 is a consistent estimator.

5.5 Experimental results

All algorithms were implemented in Java version 8, using Gurobi version 6.0.0.

5.5.1 Support recovery

We test the support recovery algorithms on random synthetic instances. Let 퐴 = {1, . . . , 푠} without loss of generality. First, randomly sample 푟 “anchor points” in [0, 1]푑,

calling them 푍1, . . . , 푍푟. The parameter 푟 governs the complexity of the function

produced. In our experiment, we set 푟 = 10. Next, randomly sample 푋1, . . . , 푋푛 in 푑 [0, 1] . For 푖 ∈ {1, . . . , 푛}, assign 푌푖 = 1 + 푊푖 if 푍푗 ⪯퐴 푋푖 for some 푗 ∈ {1, . . . , 푟},

and assign 푌푖 = 푊푖 otherwise. The linear programming based algorithms for support recovery, LPSR and S-LPSR, are compared to the simultaneous approach, IPIR, which estimates the active coordinates while also estimating the function values. Note that even though the proof of support recovery using S-LPSR requires fresh data at each iteration, our experiments do not use fresh data. We keep 푠 = 3 fixed and vary 푑 and 푛. The error is Gaussian with mean 0 and variance 0.1, independent across coordinates. We report the percentages of successful recovery (see Table 5.1). The IPIR algorithm performs the best on nearly all settings of (푛, 푑). This suggests that the objective of the IPIR algorithm- to minimize the number of misclassifications on the data- gives the algorithm an advantage in selecting the true active coordinates. The LPSR algorithm outperforms the S-LPSR algorithm when 푑 = 5, but the situation reverses for 푑 ∈ {10, 20}. For 푛 = 200 samples and 푑 = 5, the LPSR algorithm correctly recovers the coordinates all but one time, while S-LPSR succeeds 86% of the time. For 푑 = 10, LPSR and S-LPSR succeed 46 and 75% of the time, respectively; for 푑 = 20, LPSR and S-LPSR succeed 30 and 63% of the time, respectively. It appears that determining the coordinates one at a time provides implicit regularization for larger 푑.

113 We additionally compare the accuracy in function estimation (Table 5.2). We found these results to be extremely encouraging. For 푛 = 200 samples, the IPIR and S-LPSR algorithms had accuracy rates in the range of 87 − 90%. Table 5.1: Performance of support recovery algorithms on synthetic instances. Each line of the table corresponds to 100 trials.

IPIR LPSR S-LPSR 푑 = 푑 = 푑 = 푛 5 10 20 5 10 20 5 10 20 50 62 55 57 76 29 1 62 33 26 100 92 85 90 92 33 13 76 56 49 150 98 94 91 99 50 16 86 71 65 200 95 99 92 99 46 30 86 75 63

Table 5.2: Accuracy of isotonic regression on synthetic instances. Each line of the table corresponds to 100 trials.

IPIR LPSR S-LPSR 푑 = 푑 = 푑 = 푛 5 10 20 5 10 20 5 10 20 50 78.2 77.8 77.6 77.4 74.2 65.9 77.1 76.1 74.3 100 85.1 85.8 84.6 84.1 77.6 75.0 84.2 83.9 81.7 150 87.9 87.8 86.8 87.8 81.3 77.9 87.1 86.6 85.9 200 89.2 89.8 88.3 89.1 83.6 83.4 89.0 88.9 87.5

5.5.2 Cancer classification using gene expression data

The presence or absence of a disease is believed to follow a monotone relationship with respect to gene expression. Similarly, classifying patients as having one of two diseases amounts to applying the monotonicity principle to a subpopulation of individuals having one of the two diseases. In order to assess the applicability of our sparse monotone regression approach, we apply it to cancer classification using gene expression data. The motivation for using a sparse model for disease classification is that certain genes should be more responsible for disease than others. Sparsity can be viewed as a kind of regularization; to prevent overfitting, we allow the regression to explain the results using only a small number of genes.

114 The data is drawn from the COSMIC database [16], which is widely used in quantitative research in cancer biology. Each patient in the database is identified as having a certain type of cancer. For each patient, gene expressions of tumor cells are

reported as a z-score. Namely, if 휇퐺 and 휎퐺 are the mean and standard deviation of the gene expression of gene 퐺 and 푥 is the gene expression of a certain patient, then his or her z-score would be equal to 푥−휇퐺 . We filter the patients by cancer type, 휎퐺 selecting those with skin and lung cancer, two common cancer types. There are 236698 people with lung or skin cancer in the database, though the database only includes gene expression data for 1492 of these individuals. Of these, 1019 have lung cancer and 473 have skin cancer. A classifier always selecting “lung” would have an expected correct classification rate of 1019/1492 ≈ 68%. Therefore this rate should be regarded as the baseline classification rate. Our goal is to use gene expression data to classify the patients as having either skin or lung cancer. We associate skin cancer as a “0” label and lung cancer as a “1” label. We only include the 20 most associated genes for each of the two types, according to the COSMIC website. This leaves 31 genes, since some genes appear on both lists. We additionally include the negations of the gene expression values as coordinates, since a lower gene expression of certain genes may promote lung cancer over skin cancer. The number of coordinates is therefore equal to 62. The number of active genes is ranged between 1 and 5. We perform both simultaneous and two-stage isotonic regression, comparing the IPIR and TSIR algorithms, using S-LPSR to recover the coordinates in the two-stage approach. Since for every gene, its negation also corresponds to a coordinate, we

added additional constraints. In IPIR, we use variables 푣푘 ∈ {0, 1} to indicate whether coordinate 푘 is in the estimated set of active coordinates. In LPSR and S-LPSR, we

use variables 푣푘 ∈ [0, 1] instead. In order to incorporate the constraints regarding

negation of coordinates in IPIR, we included the constraint 푣푖 + 푣푗 ≤ 1 for pairs (푖, 푗) such that coordinate 푗 is the negation of coordinate 푖. In S-LPSR, once a coordinate

푣푖 was selected, its negation was set to zero in future iterations. The LPSR algorithm, however, could not be modified to take this additional structure into account without

115 using integer variables. Adding the constraints 푣푖 + 푣푗 ≤ 1 when coordinate 푗 is the negation of coordinate 푖 proved to be insufficient. Therefore, we do not include the LPSR algorithm in our experiments on the COSMIC database. We compare our isotonic regression algorithms to two classical algorithms: 푘- Nearest Neighbors ([15]) and the Support Vector Machine ([8]). Given a test sample 푥 and an odd number 푘, the 푘-Nearest Neighbors algorithm finds the 푘 closest training samples to 푥. The label of 푥 is chosen according to the majority of the labels of the 푘 closest training samples. The SVM algorithm used is the soft-margin classifier with penalty 퐶 and polynomial kernel given by 퐾(푥, 푦) = (1 + 푥 · 푦)푚. We have additionally implemented a version of kNN with dimensionality reduction, in an effort to reduce the curse-of-dimensionality suffered by kNN. Data points are compressed by Principal Component Analysis ([42]) prior to nearest-neighbor classification. However, this version of kNN performed worse than the basic version, and we omit the results. In Table 5.3, each row is based on 10 trials, with 1000 test data points chosen uniformly and separately from the training points. The two-stage method was generally faster than the simultaneous method. With 200 training points and 푠 = 3, the simultaneous method took 260 seconds on average per trial, while the two-stage method took only 42 seconds per trial. The simultaneous method became prohibitively slow for higher values of 푛. The averages for 푘-Nearest Neighbors and Support Vector Machine are taken as the best over parameter choices in hindsight. For 푘- Nearest Neighbors, 푘 ∈ {1, 3, 5, 7, 9, 11, 15}, and for SVM, 퐶 ∈ {10, 100, 500, 1000} and 푚 ∈ {1, 2, 3, 4}. The fact that the sparse isotonic regression method outperforms the 푘-NN classifier and the polynomial kernel SVM by such a large margin can be explained by a difference in structural assumptions; the results suggest that monotonicity, rather than proximity or a polynomial functional relationship, is the correct property to leverage. The results suggest that the correct sparsity level is 푠 = 3. With 푛 = 400 samples, the classification accuracy rate is 95.7%. When the sparsity level is too low, the monotonicity model is too simple to accurately describe the monotonicity pattern. On the other hand, when the sparsity level is too high, fewer points are comparable, which

116 Table 5.3: Comparison of classifier success rates on COSMIC data. Top row data is according to the “min” interpolation rule and bottom row data is according to the “max” interpolation rule.

IPIR TSIR + S-LPSR 푘-NN SVM 푛 푠 = 푠 = 1 2 3 4 5 1 2 3 4 5 83.1 84.6 76.8 66.2 53.8 82.4 84.6 77.8 73.0 65.4 69.8 63.8 100 83.9 91.8 91.0 85.7 75.7 82.9 90.4 88.9 87.4 83.3 85.4 88.1 84.3 73.9 62.7 85.4 89.3 86.7 81.2 76.9 76.6 72.6 200 85.8 92.6 96.4 88.9 83.9 85.8 94.5 95.9 95.3 93.0 - - - - - 84.7 91.7 89.0 84.4 80.2 76.6 74.2 300 - - - - - 85.1 94.2 95.6 95.9 94.8 - - - - - 85.6 91.8 89.7 87.3 81.7 78.6 77.4 400 - - - - - 85.8 94.0 95.7 96.4 95.7

leads to fewer monotonicity constraints. For 푛 ∈ {100, 200} and 푑 ∈ {1, 2, 3, 4, 5}, TSIR + S-LPSR does at least as well as IPIR on 15 out of 20 of (푛, 푑) pairs, and outperforms on 12 of these. This result is surprising, because synthetic experiments show that IPIR outperforms S-LPSR on support recovery. We further investigate the TSIR + S-LPSR algorithm. Figure 5-1 shows how the two-stage procedure labels the training points. The high success rate of the sparse isotonic regression method suggests that this nonlinear picture is quite close to reality. The observed clustering of points may be a feature of the distribution of patients, or could be due to a saturation in measurement. Figure 5-2 studies the robustness of TSIR + S-LPSR. Additional synthetic zero-mean Gaussian noise is added to the inputs, with varying standard deviation. The “max” classification rule is used. 200 training points and 1000 test points were used. Ten trials were run, with one standard deviation error bars indicated in gray. The results indicate that TSIR + S-LPSR is robust to moderate levels of noise. We note that because the gene expression is measured from tumor cells, much of the variation between the lung and skin cancer patients can be attributed to intrinsic differences between lung and skin cells. Still, this classification task is highly non-linear and challenging, as evidenced by the poor performance of other classifiers. We view

117 these experiments as a proof-of-concept, showing that our algorithm can perform well on real data. An example of a more medically relevant application of our algorithm would be identifying patients as having cancer or not, using blood protein levels [7].

1

0 1

-1 0

-2 -1

-2 -3

-3 -4

-4 1 -5 0.5 0 0 -0.5 -1 -1 -1.5 -6 -2 -2 -6 -5 -4 -3 -2 -1 0 1 -2.5 (a) 푠 = 2. (b) 푠 = 3.

Figure 5-1: Illustration of the TSIR + S-LPSR algorithm. Blue and red markers correspond to lung and skin cancer, respectively.

100 90 80 70 Accuracy (%) 60 0 0.1 0.2 0.3 0.4 0.5 Standard deviation of additional synthetic noise

Figure 5-2: Robustness to error of TSIR + S-LPSR.

5.6 Conclusion

In this chapter, we have considered the sparse isotonic regression problem under two noise models: Noisy Output and Noisy Input. We have formulated optimization problems to recover the active coordinates, and then estimate the underlying monotone function. We provide explicit guarantees on the performance of these estimators. We leave the analysis of Linear Programming Support Recovery (Algorithm 2) as an open problem. Finally, we demonstrate the applicability of our approach to a cancer classification task, showing that our methods outperform widely-used classifiers. While the task of classifying patients with two cancer types is relatively simple, the accuracy rates illustrate the modeling power of the sparse monotone regression approach.

118 5.7 Appendix

Proof of Lemma 16. Consider Problem (5.21)-(5.23), with the objective function re- ∑︀ ∑︀ placed by 푖∈풮+ (1 − 퐹푖) + 푖∈풮− 퐹푖. Here, the vector 푣 is fixed. Since 퐹푖 ∈ [0, 1] for

all 푖 ∈ {1, . . . , 푛}, 퐹푖 − 퐹푗 ∈ [−1, 1]. The left side of Constraint (5.22) takes value in {0, 1, . . . , 푠}. Therefore, the constraint is tight only when the left side is equal to ∑︀푑 0. Therefore, we only require that for (푖, 푗) such that 푘=1 푞(푖, 푗, 푘)푣푘 = 0, it holds

that 퐹푖 ≤ 퐹푗. Let 휖 = min {min퐹푖>0{퐹푖}, min퐹푖<1{1 − 퐹푖}}. In other words, 휖 is the margin to the endpoints of [0, 1]. Suppose that 퐹 is an optimal solution with some values 퐹푖 ∈ (0, 1). Then 휖 > 0. Let 퐶 = {푖 : 퐹푖 ∈ (0, 1)}. Consider adding 휖 to each 퐹푖 such that 푖 ∈ 퐶, and call the new solution 퐹 +휖. Clearly, 퐹 +휖 is feasible. The change in the objective is equal to

∑︁ (︀ +휖)︀ ∑︁ +휖 ∑︁ ∑︁ 1 − 퐹푖 + 퐹푖 − (1 − 퐹푖) − 퐹푖 푖∈풮+ 푖∈풮− 푖∈풮+ 푖∈풮− (︀⃒{︀ −}︀⃒ ⃒{︀ +}︀⃒)︀ = 휖 ⃒ 푖 : 푖 ∈ 퐶, 푖 ∈ 풮 ⃒ − ⃒ 푖 : 푖 ∈ 퐶, 푖 ∈ 풮 ⃒

On the other hand, consider subtracting 휖 from each 퐹푖 such that 푖 ∈ 퐶, and call the new solution 퐹 −휖. By construction, 퐹 −휖 is also feasible. The chance in the objective is equal to 휖 (|{푖 : 푖 ∈ 퐶, 푖 ∈ 풮+}| − |{푖 : 푖 ∈ 퐶, 푖 ∈ 풮−}|). Since we have assumed that 퐹 is an optimal solution, both changes must be nonnegative. Since they are negations of each other, they must both be equal to zero. Therefore, the solutions 퐹 +휖 and 퐹 −휖

−휖 have the same objective value as the solution 퐹 . If 휖 = min퐹푖>0{퐹푖}, choose 퐹 , and +휖 if 휖 = min퐹푖<1{1 − 퐹푖}, choose 퐹 . This leads to the size of the set 퐶 decreasing by at least one. Repeating this process inductively, we eventually produce a solution with 퐶 = ∅. Therefore, we have shown that there always exists an integer optimal solution. We have shown that for fixed 푣, there always exists an integer optimal solution. Furthermore, the process of converting an optimal solution into an integer optimal solution runs in polynomial time. Varying 푣, it also holds that the optimal solutions to Problem (5.5)-(5.9) remains the same. The same procedure applies to convert an optimal solution into an integer optimal solution in polynomial time.

119 5.7.1 Proofs for the Noisy Output Model

We will build toward a proof of Theorem 16. We note that Algorithm 1 selects an

푠-sparse coordinate-wise monotone function that minimizes the empirical 퐿2 loss. To prove the statistical consistency of the estimated function, we need to introduce the

expected VC dimension ([57]). Let ℱ푠,푑 be the set of 푠-sparse coordinate-wise monotone 2 functions on [0, 1]푑. Following [19], let 푄(푥, 푦, 푓) = (푦 − 푓(푥)) for 푥 ∈ [0, 1]푑, 푦 ∈ R, 푑 and 푓 ∈ ℱ푠,푑. For a fixed sequence (푥1, 푦1),..., (푥푛, 푦푛) ∈ [0, 1] × [0, 1], consider the

set of vectors Q = {(푄(푥1, 푦1, 푓), . . . , 푄(푥푛, 푦푛, 푓)) , 푓 ∈ ℱ푠,푑}. In other words, we vary

over ℱ푠,푑 and produce the associated error vectors. Let 푁 (휖, ℱ푠,푑, (푥1, 푦1),..., (푥푛, 푦푛)) be the size of the minimal 휖-net of the set Q. Namely, a set 퐸 is an 휖-net for Q if for

every 푞 ∈ Q there exists 푣 ∈ 퐸 such that ‖푞 − 푣‖∞ ≤ 휖. For any 휖 > 0, the expected

VC entropy of ℱ푠 is defined as

푁ℱ푠,푑 (휖, 푛) = E [푁 (휖, ℱ푠,푑, (푋1, 푌1),..., (푋푛, 푌푛))] .

The expectation is over the random variables (푋푖, 푌푖). The expected VC entropy

measures the complexity of the class ℱ푠,푑, and can be used to prove convergence in 퐿2. The following proposition follows from Corollary 1 (pp. 45) of [27].

Proposition 6.

(︃ ⃒ 푛 ⃒ )︃ ⃒∫︁ 1 ⃒ (︁ 휖 )︁ (︂ 휖2푛 )︂ ⃒ ^ ∑︁ ^ ⃒ P sup 푄(푥, 푦, 푓)푑퐹 (푥, 푦) − 푄(푋푖, 푌푖, 푓) > 휖 ≤ 4푁ℱ푠,푑 , 푛 exp − . ^ ⃒ 푛 ⃒ 16 128 푓∈ℱ푠,푑 ⃒ 푖=1 ⃒

Proposition 7. If 푌 = 푓(푋) + 푊 ∈ [0, 1] almost surely, then

(︁ )︁ (︂ 휖2 )︂ (︂ 휖4푛 )︂ ‖푓^ − 푓‖ > 휖 ≤ 8푁 , 푛 exp − . P 푛 2 ℱ푠,푑 32 512

Proof. Equivalently, we show

(︁ )︁ (︁ 휖 )︁ (︂ 휖2푛 )︂ ‖푓^ − 푓‖2 > 휖 ≤ 8푁 , 푛 exp − . P 푛 2 ℱ푠,푑 32 512

120 As shown by [20],

∫︁ ∫︁ ^ 2 ^ ‖푓푛 − 푓‖2 = 푄(푥, 푦, 푓푛)푑퐹 (푥, 푦) − 푄(푥, 푦, 푓)푑퐹 (푥, 푦).

Therefore,

(︁ )︁ (︂∫︁ ∫︁ )︂ ^ 2 ^ P ‖푓푛 − 푓‖2 > 휖 = P 푄(푥, 푦, 푓푛)푑퐹 (푥, 푦) − 푄(푥, 푦, 푓)푑퐹 (푥, 푦) > 휖 .

^ ∑︀푛 ∑︀푛 ^ By optimality of 푓푛, it holds that 푖=1 푄(푋푖, 푌푖, 푓) − 푖=1 푄(푋푖, 푌푖, 푓푛) ≥ 0. We therefore have

(︁ ^ 2 )︁ P ‖푓푛 − 푓‖2 > 휖 (︃∫︁ 푛 푛 ∫︁ )︃ ^ ∑︁ ^ ∑︁ ≤ P 푄(푥, 푦, 푓푛)푑퐹 (푥, 푦) − 푄(푋푖, 푌푖, 푓푛) + 푄(푋푖, 푌푖, 푓) − 푄(푥, 푦, 푓)푑퐹 (푥, 푦) > 휖 . 푖=1 푖=1

Grouping the first two terms and the last two terms, we obtain by the Union Bound,

(︃ 푛 )︃ (︁ )︁ ∫︁ ∑︁ 휖 ‖푓^ − 푓‖2 > 휖 ≤ 푄(푥, 푦, 푓^ )푑퐹 (푥, 푦) − 푄(푋 , 푌 , 푓^ ) > P 푛 2 P 푛 푖 푖 푛 2 푖=1 (︃ 푛 )︃ ∑︁ ∫︁ 휖 + 푄(푋 , 푌 , 푓) − 푄(푥, 푦, 푓)푑퐹 (푥, 푦) > P 푖 푖 2 푖=1 (︃⃒ 푛 ⃒ )︃ ⃒∫︁ ∑︁ ⃒ 휖 ≤ ⃒ 푄(푥, 푦, 푓^ )푑퐹 (푥, 푦) − 푄(푋 , 푌 , 푓^ )⃒ > P ⃒ 푛 푖 푖 푛 ⃒ 2 ⃒ 푖=1 ⃒ (︃⃒ 푛 ⃒ )︃ ⃒∑︁ ∫︁ ⃒ 휖 + ⃒ 푄(푋 , 푌 , 푓) − 푄(푥, 푦, 푓)푑퐹 (푥, 푦)⃒ > P ⃒ 푖 푖 ⃒ 2 ⃒ 푖=1 ⃒ (︃ ⃒∫︁ 푛 ⃒ )︃ ⃒ ^ ∑︁ ^ ⃒ 휖 ≤ 2P sup ⃒ 푄(푥, 푦, 푓)푑퐹 (푥, 푦) − 푄(푋푖, 푌푖, 푓)⃒ > ^ ⃒ ⃒ 2 푓∈ℱ푠,푑 ⃒ 푖=1 ⃒ (︁ 휖 )︁ (︂ 휖2푛 )︂ ≤ 8푁 , 푛 exp − , ℱ푠,푑 32 512 where the last inequality follows from Proposition 6.

Therefore, if the expected VC entropy of ℱ푠,푑 grows subexponentially in 푛, the ^ estimator 푓푛 derived from Algorithm 1 converges to the true function in 퐿2. Define

121 the non-sparse class ℱ푑 = ℱ푑,푑.

(︀푑)︀ Proposition 8. 푁ℱ푠,푑 (휖, 푛) ≤ 푠 푁ℱ푠 (휖, 푛).

(︀푑)︀ Proof. The set ℱ푠,푑 can be written as a union of 푠 function classes, depending on which subset of the coordinates is active.

Our goal is now to bound the expected VC entropy of the class ℱ푑. The expected VC entropy is related to a combinatorial quantity known as the labeling number.

Definition 16 (Labeling Number ([19])). For a sequence of points 푥1, . . . , 푥푛 ∈ 푑 [0, 1] and a positive integer 푚, the labeling number 퐿(푚, 푥1, . . . , 푥푛) is the number of

functions 휑 : {푥1, . . . , 푥푛} → {1, 2, . . . , 푚} such that 휑(푋푖) ≤ 휑(푋푗) whenever 푥푖 ⪯ 푥푗, for 푖, 푗 ∈ {1, . . . , 푛}.

푑 Proposition 9. For any (푥1, 푦1),..., (푥푛, 푦푛) ∈ [0, 1] × [0, 1],

(︂⌈︂2⌉︂ )︂ 푁 (휖, ℱ , (푥 , 푦 ),..., (푥 , 푦 )) ≤ 퐿 , 푥 , . . . , 푥 . 푑 1 1 푛 푛 휖 1 푛

푑 Let ℱ 푑 be the set of coordinate-wise monotone functions 푓 : [0, 1] → {0, 1}. Then

(︃⌊︃ ⌋︃ )︃ √︂ 3 푁 (︀휖, ℱ , (푥 , 푦 ),..., (푥 , 푦 ))︀ ≥ 퐿 − 3, 푥 , . . . , 푥 . 푑 1 1 푛 푛 2휖 1 푛

√︁ 2휖 ⌊︀ 1 ⌋︀ Proof. For the lower bound, let 훿 = 3 , and let 푁 = 훿 − 3. Define the sequence

푞푖 = 훿(푖 + 1), for 푖 ∈ {1, . . . , 푁}. The monotone labelings supported on {푞1, . . . , 푞푁 } are a subset of the coordinate-wise monotone functions. Our goal is to show that for

every two distinct labelings 푙1 and 푙2, it holds that

‖ (푄(푥1, 푦1, 푙1), . . . , 푄(푥푛, 푦푛, 푙1)) , (푄(푥1, 푦1, 푙2), . . . , 푄(푥푛, 푦푛, 푙2)) ‖∞ > 2휖.

If this relation holds for all distinct pairs of labelings, then at least 퐿(푁, 푥1, 푥2, . . . , 푥푛) points are required to form an 휖-net of the set Q.

122 If 푙1 and 푙2 are distinct labelings, then there exists 푘 ∈ {1, . . . , 푛} such that

푙1(푥푘) ̸= 푙2(푥푘). Therefore,

⃒ 2 2⃒ |푄(푥푘, 푦푘, 푙1) − 푄(푥푘, 푦푘, 푙2)| = ⃒(푙1(푥푘) − 푦푘) − (푙2(푥푘) − 푦푘) ⃒

⃒ 2 2⃒ = ⃒푙1(푥푘) − 2푦푘푙1(푥푘) + 2푦푘푙2(푥푘) − 푙2(푥푘) ⃒

= |2푦푘 (푙1(푥푘) − 푙2(푥푘)) + (푙1(푥푘) − 푙2(푥푘)) (푙1(푥푘) + 푙2(푥푘))|

= |푙1(푥푘) − 푙2(푥푘)| |푙1(푥푘) + 푙2(푥푘) − 2푦푘| ⃒ ⃒ ⃒푙1(푥푘) + 푙2(푥푘) ⃒ ≥ 훿 · 2 ⃒ − 푦푘⃒ ⃒ 2 ⃒

≥ 훿 · 2 min {푞1, 1 − 푞푁 } ≥ 4훿2 8 = 휖 3 > 2휖

(︀ )︀ We conclude that 퐿(푁, ℱ 푑, 푥1, . . . , 푥푛) ≤ 푁 휖, ℱ 푑, (푥1, 푦1),..., (푥푛, 푦푛) . For the upper bound, the proof comes from the proof of Proposition 3 in [19]. Let ⌈︀ 2 ⌉︀ 푖−1 푁 = 휖 . Let 푞푖 = 푁 for 푖 ∈ {1, . . . , 푁, 푁 + 1}. Define

(︀ 2 2 2)︀ 퐺 , { (푦1 − 푔1) , (푦2 − 푔2) ,..., (푦푛 − 푔푛) : 푔푖 ∈ {푞1, . . . , 푞푁 }, 푥푖 ⪯ 푥푗 =⇒ 푔푖 ≤ 푔푗}.

Then |퐺| ≤ 퐿(푁, 푥1, . . . , 푥푛). We now show that 퐺 is an 휖-net of Q = {(푄(푥1, 푦1, 푓), . . . , 푄(푥푛, 푦푛, 푓)) , 푓 ∈

ℱ}. For each sample 푖 ∈ {1, . . . , 푛}, find 푘푖 such that 푓(푥푖) ∈ [푞푘푖 , 푞푘푖+1). Set 푔푖 = 푞푘푖 . Now,

⃒(푦 − 푓(푥 ))2 − (푦 − 푞 )2⃒ = ⃒푦2 − 2푦 푓(푥 ) + 푓(푥 )2 − 푦2 + 2푦 푞 − 푞2 ⃒ ⃒ 푖 푖 푖 푘푖 ⃒ ⃒ 푖 푖 푖 푖 1 푖 푘푖 푘푖 ⃒ = ⃒푓(푥 )2 + 2푦 (푞 − 푓(푥 )) − 푞2 ⃒ ⃒ 푖 푖 푘푖 푖 푘푖 ⃒

= |(푓(푥푖) − 푞푘푖 )(푓(푥푖) + 푞푘푖 ) − 2푦푖 (푓(푥푖) − 푞푘푖 )|

= (푓(푥푖) − 푞푘푖 ) |푓(푥푖) + 푞푘푖 − 2푦푖| 2 ≤ 푁

123 2 = ⌈︀ 2 ⌉︀ 휖 ≤ 휖

It remains to show that 푥푖 ⪯ 푥푗 =⇒ 푔푖 ≤ 푔푗. Since 푓 is coordinate-wise monotone,

푥푖 ⪯ 푥푗 =⇒ 푓(푥푖) ≤ 푓(푥푗). Then also 푔푖 ≤ 푔푗. Therefore, we have shown that 퐺 is a valid 휖-net, and we conclude that the size of the smallest 휖-net is at most

퐿(푁, 푥1, . . . , 푥푛).

The 푚-labeling number is in turn related to the binary labeling number.

푚−1 Proposition 10. [19] It holds that 퐿(푚, 푥1, . . . , 푥푛) ≤ (퐿(2, 푥1, . . . , 푥푛)) .

Proof. The proof can be found in the proof of Lemma 3 in [19], with the correction that ⎧ ⎨⎪1 if 푔(푥푖) ≤ 푚 푔2(푥푖) = ⎩⎪2 if 푔(푥푖) = 푚 + 1.

Propositions 9 and 10 suggest that the binary labeling number is a good proxy for the VC entropy.

푑 Theorem 21. Let 푋1, . . . , 푋푛 be distributed uniformly and independently in [0, 1] .

Let 퐿(푋1, . . . , 푋푛) be the number of binary monotone labelings of the points 푋1, . . . , 푋푛. Then for 푘 ≥ 1,

[︂ −1 ]︂ log(2)(1 − 푒 )푘 푑−1 푘 [︁(︀ 푘+푑)︀ 푑−1 ]︁ exp 푛 푑 ≤ [퐿(푋 , . . . 푋 ) ] ≤ exp 2 log(2)푘 + 2 푛 푑 . (푑 − 1)! E 1 푛

We also have

[︁(︀ 푑 )︀ 푑−1 ]︁ E[퐿(푋1, . . . 푋푛)] ≤ exp 2 + 2 log(2) − 1 푛 푑 .

In order to prove the upper bound in Theorem 21, we relate the binary labeling number to the number of integer partitions.

124 Figure 5-3: Illustration of a partition in 푑 = 2 with 푚 = 10. The partition cells are indicated in gray, and the border cells are marked.

Definition 17 (Integer Partition). An integer partition of dimension (푑 − 1) with values in {0, 1, . . . 푚}, is a collection of values 퐴푖1,푖2,...,푖푑−1 ∈ {0, 1, . . . , 푚} where 푖푘 ∈

{1, . . . 푚} and 퐴푖1,푖2,...,푖푑−1 ≤ 퐴푗1,푗2,...,푗푑−1 whenever 푖푘 ≤ 푗푘 for all 푘 ∈ {1, . . . , 푑 − 1}. The set of integer partitions of dimension (푑 − 1) with values in {0, 1, . . . 푚} is denoted by 푃 ([푚]푑).

Note: the definition is in terms of (푑 − 1) because when the monotone regression problem is in dimension 푑, we will consider partitions of dimension (푑 − 1). To illustrate the definition, consider setting 푑 = 2 (see Figure 5-3). An integer partition

of dimension 1 is an assignment of values (퐴1, 퐴2, . . . , 퐴푚) that is non-increasing, and

each 퐴푘 takes value in {0, 1, . . . , 푚}.A 1-dimensional partition can be seen to divide the 푚 × 푚 grid in a monotonic way. Next we define the concept of a border cell.

Definition 18 (Border Cell). Label the cells in the [푚]푑 grid according to cell coordi- nates, namely entries (푥1, 푥2, . . . 푥푑), where 푥푘 ∈ {1, . . . , 푚} for each 푘 ∈ {1, . . . , 푑}. 푑 For a partition 푝 ∈ 푃 ([푚] ) with entries in {1, . . . , 푚}, consider its values 퐴푖1,푖2,...,푖푑−1 . The cells corresponding to the partition (which we call the partition cells) are given by (푥1, 푥2, . . . , 푥푑−1, 푥), for 푥 ≤ 퐴푥1,푥2,...,푥푑−1 and where each 푥푘 ranges in {1, . . . , 푚}. We say that two cells are adjacent if they share a face or a corner. The border cells are defined to be the partition cells that are adjacent to at least one cell thatisnota partition cell.

Lemma 19. The number of border cells in any (푑 − 1)-dimensional integer partition with entries from {1, . . . , 푚} is at most 푚푑 − (푚 − 1)푑.

125 Proof. When 푑 = 2, the number of cells on the border of any (1-dimensional) partition with values in {1, . . . , 푚} is at most 2푚 − 1 = 푚2 − (푚 − 1)2, corresponding to a path from (1, 푚) to (푚, 1). When 푑 = 3, the number of border cells in any (2-dimensional) partition with values in {1, . . . , 푚} is at most corresponding to border cells that include (1, 푚, 푚) and (푚, 1, 1). All partitions with such border cells have the same number of border cells. The simplest of these is the one where each cell is on the perimeter of the cube. The number of border cells in such a partition is equal to 3푚2 − 3푚 + 1 = 푚3 − (푚 − 1)3. For general 푑, the number of border cells in a (푑 − 1)-dimensional partition taking values in {1, . . . , 푚} is upper bounded by the total number of cells minus the number of cells in an (푚 − 1)푑 grid, in other words, 푚푑 − (푚 − 1)푑.

The key idea of the proof of the upper bound in Theorem 21 comes from the following lemma.

(︁ 푚푑−(푚−1)푑 )︁ Lemma 20. Let 푘 ≥ 1 and let 푁 ∼ Binom 푛, 푚푑 . It holds that

푘 ⃒ (︀ 푑)︀⃒푘 푘푁 E[퐿(푋1, . . . , 푋푛) ] ≤ ⃒푃 [푚] ⃒ E[2 ].

Proof. The idea of the proof comes from the proof of Theorem 13.13 in [11], who showed a similar result for 푑 = 2 and 푘 = 1. Consider a binary coordinate-wise

푑 푑 monotone function 푓, with domain [0, 1] . Let 푆0 = {푥 ∈ [0, 1] : 푓(푥) = 0} and 푑 푆1 = {푥 ∈ [0, 1] : 푓(푥) = 1}. The number of binary labelings of a set of points

푋1, . . . , 푋푛 is equal to the number of partitions (푆0, 푆1) producing distinct labelings. To upper-bound the number of dividing surfaces, we divide the 푑-dimensional cube

푑 푑 1 into an 푚 grid, [푚] . That is, each cell in the grid has side length 푚 . Let 퐵 be the

intersection of the boundaries of the 푆0 and 푆1. For example, if

⎧ ⎨⎪0 if 푥1 + 푥2 < 1 푓(푥) =

⎩⎪1 if 푥1 + 푥2 ≥ 1

then 퐵 = {푥 : 푥1 + 푥2 = 1}. Now consider the subset of cells that contain at least one

126 element of 퐵. These cells are necessarily the border cells of some (푑 − 1)-dimensional integer partition with values from {1, . . . , 푚}. Therefore, we can upper bound the

number of labelings as follows. For a boundary 퐵 corresponding to a partition (푆0, 푆1), let 퐵 be the border cells of the corresponding integer partition. Define the set ℬ to

⃒ (︀ 푑)︀⃒ contain all such 퐵, noting that |ℬ| = ⃒푃 [푚] ⃒. Let 푁퐵 be the number of points

falling into the cells comprising 퐵. For each 퐵 that corresponds to a partition (푆0, 푆1), 푁 we add a contribution of 2 퐵 . This contribution corresponds to all (valid or invalid) labelings of the points within the border cells. Points outside the border cells are

labeled 0 if they fall in 푆0 and 1 if they fall in 푆1. Since we have potentially overcounted the number of binary labelings by including invalid labelings, we have the following upper bound.

∑︁ 푁 퐿(푋1, . . . , 푋푛) ≤ 2 퐵 . 퐵∈ℬ

Therefore, we also have

⎛ ⎞푘 푘 ∑︁ 푁 퐿(푋1, . . . , 푋푛) ≤ ⎝ 2 퐵 ⎠ . 퐵∈ℬ

By Jensen’s inequality, we have that for 푎푖 ≥ 0 and 푘 ≥ 1,

(︃ 푛 )︃푘 푛 1 ∑︁ 1 ∑︁ 푎 ≤ 푎푘. 푛 푖 푛 푖 푖=1 푖=1

Therefore,

⎛ ⎞푘 ∑︁ 푁 푘−1 ∑︁ 푘푁 ⎝ 2 퐵 ⎠ ≤ |ℬ| 2 퐵 퐵∈ℬ 퐵∈ℬ

and we have

푘 푘−1 ∑︁ 푘푁 퐿(푋1, . . . , 푋푛) ≤ |ℬ| 2 퐵 퐵∈ℬ

127 ⎡ ⎤ [︀ 푘]︀ 푘−1 ∑︁ 푘푁 E 퐿(푋1, . . . , 푋푛) ≤ E ⎣|ℬ| 2 퐵 ⎦ 퐵∈ℬ

푘−1 ∑︁ [︀ 푘푁 ]︀ = |ℬ| E 2 퐵 . 퐵∈ℬ

From Lemma 19, the number of points in the border cells of a partition with the maximal number of border cells is distributed as a binomial random variable 푁 with (︁ 푚푑−(푚−1)푑 )︁ parameters 푛, 푚푑 . We therefore have

[︀ 푘]︀ 푘−1 ∑︁ [︀ 푘푁 ]︀ E 퐿(푋1, . . . , 푋푛) ≤ |ℬ| E 2 퐵∈ℬ 푘 [︀ 푘푁 ]︀ = |ℬ| E 2 ⃒ (︀ 푑)︀⃒푘 [︀ 푘푁 ]︀ = ⃒푃 [푚] ⃒ E 2 .

Proof of Theorem 21. Upper bound From Lemma 20, we know that

푘 ⃒ (︀ 푑)︀⃒푘 푘푁 E[퐿(푋1, . . . , 푋푛) ] ≤ ⃒푃 [푚] ⃒ · E[2 ].

Now,

푘푁 log(2)푘푁 E[2 ] = E[푒 ] = 푀푁 (log(2)푘),

where 푀푁 (·) is the moment-generating function of the random variable 푁. A binomial

random variable 푍 with parameters (푛, 푝) has moment-generating function 푀푍 (휃) = (1 − 푝 + 푝푒휃)푛. Additionally, [40] showed that

(︂ )︂푚푑−2 ⃒ (︀ 푑)︀⃒ 2푚 ⃒푃 [푚] ⃒ ≤ . 푚

Substituting,

푑−2 (︂2푚)︂푘푚 (︂ 푚푑 − (푚 − 1)푑 푚푑 − (푚 − 1)푑 )︂푛 [퐿(푋 , . . . , 푋 )푘] ≤ 1 − + 푒log(2)푘 E 1 푛 푚 푚푑 푚푑

128 푑−2 (︂2푚)︂푘푚 (︂ 푚푑 − (푚 − 1)푑 )︂푛 ≤ 1 + 2푘 푚 푚푑 푑 푑 푛 푑−2 (︂ 푚 − (푚 − 1) )︂ ≤ (︀22푚)︀푘푚 1 + 2푘 푚푑 (︂ 푑 푑 )︂푛 푑−1 푚 − (푚 − 1) = 22푘푚 1 + 2푘 푚푑 [︂ (︂ 푚푑 − (푚 − 1)푑 )︂]︂ = exp 2 log(2)푘푚푑−1 + 푛 log 1 + 2푘 푚푑

1 Choosing 푚 = 푛 푑 ,

⎡ ⎛ (︁ 1 )︁푑 ⎞⎤ 푛 − 푛 푑 − 1 푘 푑−1 푘 [퐿(푋 , . . . , 푋 ) ] ≤ exp ⎢2 log(2)푘푛 푑 + 푛 log ⎜1 + 2 ⎟⎥ E 1 푛 ⎣ ⎝ 푛 ⎠⎦

Since log(1 + 푥) ≤ 푥,

[︂ (︂ 푑)︂]︂ 푘 푑−1 푘 (︁ 1 )︁ E[퐿(푋1, . . . , 푋푛) ] ≤ exp 2 log(2)푘푛 푑 + 2 푛 − 푛 푑 − 1

Applying the Binomial Theorem,

푑 푑 (︂ )︂ (︁ 1 )︁ ∑︁ 푑 푑−푘 푘 푛 − 푛 푑 − 1 = 푛 − 푛 푑 (−1) 푘 푘=0 푑 (︂ )︂ ∑︁ 푑 푑−푘 푘 = − 푛 푑 (−1) 푘 푘=1 푑 (︂ )︂ ∑︁ 푑 푑−푘 푘+1 ≤ · max 푛 푑 (−1) 푘 푘∈{1,...,푑} 푘=1 (︀ 푑 )︀ 푑−1 = 2 − 1 푛 푑

Substituting, we obtain

푘 [︁ 푑−1 푘 (︀ 푑 )︀ 푑−1 ]︁ E[퐿(푋1, . . . , 푋푛) ] ≤ exp 2 log(2)푘푛 푑 + 2 2 − 1 푛 푑

[︁(︀ 푘+푑)︀ 푑−1 ]︁ ≤ exp 2 log(2)푘 + 2 푛 푑 .

The proof for 푘 = 1 is similar.

129 Lower Bound Let 푁 be an integer, which will be specified later. Divide [0, 1]푑 into 푁 푑 cells of

1 side length 푁 . The cells are labeled in the natural coordinate system, writing 푑 퐶 = (푥1, . . . , 푥푑) ∈ [푁] . We say that two cells are incomparable if for all 푥 ∈ 퐶1 and

푦 ∈ 퐶2, neither 푥 ⪯ 푦 nor 푥 ⪰ 푦. Let us find the number of incomparable cells.

(︀푁+푑−2)︀ Lemma 21. The number of incomparable cells is at least 푑−1 .

Proof. Consider any two cells 퐶1 = (푥1, 푥2, . . . , 푥푑) and 퐶2 = (푦1, 푦2, . . . , 푦푑). If ∑︀푑 ∑︀푑 푖=1 푥푖 = 푖=1 푦푖, then either (푥1, . . . , 푥푑) = (푦1, . . . , 푦푑) or (푥1, . . . , 푥푑) ̸⪯ (푦1, . . . , 푦푑)

and (푥1, . . . , 푥푑) ̸⪰ (푦1, . . . , 푦푑). Observe that if (푥1, . . . , 푥푑) ̸⪯ (푦1, . . . , 푦푑) and

(푥1, . . . , 푥푑) ̸⪰ (푦1, . . . , 푦푑), then 퐶1 and 퐶2 are incomparable. In dimension 푑, let us therefore count the number of cells whose coordinates sum to 푁 + 푑 − 1. This corresponds to the number of integer compositions of (푁 + 푑 − 1) into 푑 parts, which (︀푁+푑−2)︀ is given by 푑−1 .

The number of incomparable points, 푌푛 is at least the number of occupied incom- parable cells, which we call Δ. For 푑 ≥ 2,

(︂푁 + 푑 − 2)︂ (︂ (︂ 1 )︂푛)︂ [Δ] ≥ 1 − 1 − E 푑 − 1 푁 푑 (푁 + 푑 − 2)! (︂ (︂ 1 )︂푛)︂ = 1 − 1 − (푑 − 1)!(푁 − 1)! 푁 푑 푁 푑−1 (︂ (︂ 1 )︂푛)︂ ≥ 1 − 1 − (푑 − 1)! 푁 푑

⌈︁ 1 ⌉︁ Now let 푁 = 푛 푑 . Then

푑−1 푛 푑 [Δ] ≥ (︀1 − 푒−1)︀ E (푑 − 1)!

We can now lower bound the expected value of labeling number raised to the power 푘. First,

Δ 퐿(푋1, . . . , 푋푛) ≥ 2

130 푘 푘Δ 퐿(푋1, . . . , 푋푛) ≥ 2

By Jensen’s inequality,

−1 푑−1 [︂ −1 ]︂ [︀ 푘]︀ [︀ 푘Δ]︀ 푘 [Δ] 푘 1−푒 푛 푑 log(2)(1 − 푒 )푘 푑−1 퐿(푋 , . . . , 푋 ) ≥ 2 ≥ 2 E ≥ 2 (푑−1)! = exp 푛 푑 . E 1 푛 E (푑 − 1)!

Finally, we tie together the above results to prove Theorem 16.

Proof of Theorem 16. The proof is by chaining the inequalities from Propositions 7- 10, along with Theorem 21. By Proposition 7,

(︁ )︁ (︂ 휖2 )︂ (︂ 휖4푛 )︂ ‖푓^ − 푓‖ > 휖 ≤ 8푁 , 푛 exp − . P 푛 2 ℱ푠,푑 32 512

(︀푑)︀ By Proposition 8, 푁ℱ푠,푑 (휖, 푛) ≤ 푠 푁ℱ푠 (휖, 푛). Therefore,

(︁ )︁ (︂푑)︂ (︂ 휖2 )︂ (︂ 휖4푛 )︂ ‖푓^ − 푓‖ > 휖 ≤ 8 푁 , 푛 exp − . P 푛 2 푠 ℱ푠 32 512

By Proposition 9,

(︂⌈︂2⌉︂ )︂ 푁 (휖, ℱ , (푥 , 푦 ),..., (푥 , 푦 )) ≤ 퐿 , 푥 , . . . , 푥 , 푠 1 1 푛 푛 휖 1 푛

푠 where 푥푖 ∈ [0, 1] for 푖 ∈ {1, . . . , 푛}. Substituting,

(︁ )︁ (︂푑)︂ [︂ (︂ 휖2 )︂]︂ (︂ 휖4푛 )︂ ‖푓^ − 푓‖ > 휖 ≤ 8 푁 , ℱ , (푥 , 푦 ),..., (푥 , 푦 ) exp − P 푛 2 푠 E 32 푠 1 1 푛 푛 512 (︂푑)︂ [︂ (︂⌈︂64⌉︂ )︂]︂ (︂ 휖4푛 )︂ ≤ 8 퐿 , 푋 , . . . , 푋 exp − , 푠 E 휖2 1 푛 512

푠 where 푋1, . . . , 푋푛 are distributed independently and uniformly at random in [0, 1] . 푚−1 By Proposition 10, 퐿(푚, 푥1, . . . , 푥푛) ≤ (퐿(2, 푥1, . . . , 푥푛)) . Therefore,

(︂ )︂ (︂ 4 )︂ (︁ )︁ 푑 [︁ ⌈ 64 ⌉−1]︁ 휖 푛 ‖푓^ − 푓‖ > 휖 ≤ 8 퐿 (푋 , . . . , 푋 ) 휖2 exp − P 푛 2 푠 E 1 푛 512 4 (︂푑)︂ [︁ 64 ]︁ (︂ 휖 푛 )︂ ≤ 8 퐿 (푋 , . . . , 푋 ) 휖2 exp − . 푠 E 1 푛 512

131 Finally, by Theorem 21,

(︂ )︂ [︂(︂ )︂ ]︂ (︂ 4 )︂ (︁ )︁ 푑 64 64 +푠 푠−1 휖 푛 ‖푓^ − 푓‖ > 휖 ≤ 8 exp 2 log(2) + 2 휖2 푛 푠 exp − P 푛 2 푠 휖2 512 (︂ )︂ [︂(︂ )︂ 4 ]︂ 푑 128 log(2) 64 푠 푠−1 휖 푛 = 8 exp + 2 휖2 2 푛 푠 − . 푠 휖2 512

(︁√︀ )︁ 표 푛 Proof of Corollary 2. Equivalently, we show that 푠 = 표 log(푛) and 푑 = 푒 ( 푠 ) suffices. Analyzing the leading term in the exponent,

log(2) 64 푠 푠−1 1+(푠+ 64 ) − 1 2 휖2 2 푛 푠 = 푛 휖2 log(푛) 푠 .

Analyzing the exponent,

(︁√︀ )︁ (︂ 64)︂ log(2) 1 표 log(푛) 1 1 + 푠 + − = 1 + − 2 (︁ )︁ 휖 log(푛) 푠 log(푛) 표 √︀log(푛) (︃ )︃ (︃ )︃ 1 1 = 1 + 표 − 휔 √︀log(푛) √︀log(푛) (︃ )︃ 1 = 1 − 휔 . √︀log(푛)

Therefore,

{︃ (︂ )︂ }︃ {︂(︂ )︂ 4 }︂ √ 1 128 log(2) 64 푠 푠−1 휖 푛 1−휔 exp + 2 휖2 2 푛 푠 − = exp Θ(1)푛 log(푛) − Θ(푛) 휖2 512 {︃ (︃ (︂ )︂ )︃}︃ −휔 √ 1 = exp Θ(푛) 푛 log(푛) − 1

{︃ (︃ (︂ )︂ )︃}︃ −휔 √ 1 log(푛) = exp Θ(푛) 푒 log(푛) − 1

{︂ (︂ (︁√ )︁ )︂}︂ = exp Θ(푛) 푒−휔 log(푛) − 1 {︁ (︁ (︁ √ )︁ )︁}︁ = exp Θ(푛) 표 푒− log(푛) − 1

exp {−Θ(푛)} .

132 Next,

(︂푑)︂ ≤ 푑푠 = 푒푠 log(푑) 푠

표 푛 We need 푠 log(푑) = 표(푛), or equivalently, 푑 = 푒 ( 푠 ).

To prove Theorem 17, we first prove Lemma 17.

Proof of Lemma 17. Consider the following procedure. We sample 푋1 and 푋2 inde- pendently and uniformly on [0, 1]푑. Fix 푘 ∈ 퐴. Let

⎧ ⎨⎪푋1 if 푋1,푘 > 푋2,푘 푋+ = ⎩⎪푋2 otherwise and

⎧ ⎨⎪푋1 if 푋1,푘 ≤ 푋2,푘 푋− = ⎩⎪푋2 otherwise.

In other words, 푋+ is the right point according to coordinate 푘 and 푋− is the left point according to the same coordinate. Now,

P (푓(푋1) + 푊1 > 푓(푋2) + 푊2|푋1,푘 > 푋2,푘)

= P (푓(푋1) + 푊1 > 푓(푋2) + 푊2|푋1,푘 = 푋+, 푋2,푘 = 푋−)

= P (푓(푋+) + 푊1 > 푓(푋−) + 푊2|푋1,푘 = 푋+, 푋2,푘 = 푋−)

We claim that the conditioning in the last expression can be dropped. Indeed,

1 (푋 = 푋 , 푋 = 푋 ) = (푋 = 푋 , 푋 = 푋 ) = P 1,푘 + 2,푘 − P 1,푘 − 2,푘 + 2

so that

P (푓(푋+) + 푊1 > 푓(푋−) + 푊2)

133 1 = (푓(푋 ) + 푊 > 푓(푋 ) + 푊 |푋 = 푋 , 푋 = 푋 ) 2P + 1 − 2 1,푘 + 2,푘 − 1 + (푓(푋 ) + 푊 > 푓(푋 ) + 푊 |푋 = 푋 , 푋 = 푋 ) 2P + 1 − 2 1,푘 − 2,푘 +

= P (푓(푋+) + 푊1 > 푓(푋−) + 푊2|푋1,푘 = 푋+, 푋2,푘 = 푋−) .

The last equality is due to the two probabilities taking the same value, by symmetry. Therefore, we have

P (푓(푋1) + 푊1 > 푓(푋2) + 푊2|푋1,푘 > 푋2,푘) = P (푓(푋+) + 푊1 > 푓(푋−) + 푊2) .

Similarly,

P (푓(푋1) + 푊1 < 푓(푋2) + 푊2|푋1,푘 > 푋2,푘) = P (푓(푋+) + 푊1 < 푓(푋−) + 푊2) .

Therefore, we can equivalently define 푝푘 as

푝푘 = P (푓(푋+) + 푊1 > 푓(푋−) + 푊2) − P (푓(푋+) + 푊1 < 푓(푋−) + 푊2) .

Our goal is to show that

P (푓(푋+) + 푊1 > 푓(푋−) + 푊2) > P (푓(푋+) + 푊1 < 푓(푋−) + 푊2) .

Let 푘 = 1. By Assumption 1, the function 푓 is not constant with respect to the first coordinate.

We now construct a coupling (푋−, 푋+, 푊 1, 푊 2) ∼ (푋−, 푋+, 푊1, 푊2) (⋆) such that

(︀ )︀ P 푓(푋+) + 푊 1 > 푓(푋−) + 푊 2 > P (푓(푋+) + 푊1 < 푓(푋−) + 푊2) .

The coupling is given by setting 푋+,1 = 푋+,1, 푋−,1 = 푋−,1. Set 푋+,푖 = 푋−,푖 and

푋−,푖 = 푋+,푖 for all 푖 ∈ {2, . . . , 푑}. Finally, set 푊 1 = 푊2 and 푊 2 = 푊1. Observe that

134 by monotonicity, 푓(푋+) ≥ 푓(푋−) and similarly 푓(푋−) ≤ 푓(푋+). Therefore,

{푓(푋−) − 푓(푋+) ≥ 푊1 − 푊2} =⇒ {푓(푋+) − 푓(푋−) > 푊 2 − 푊 1}.

Furthermore, (⋆) holds. We conclude that

(︀ )︀ P 푓(푋+) + 푊 1 > 푓(푋−) + 푊 2 ≥ P (푓(푋+) + 푊1 < 푓(푋−) + 푊2) .

To show the strict inequality, it suffices to show that there is a non-zero probability of the event

{푓(푋+) + 푊 1 > 푓(푋−) + 푊 2} ∩ {푓(푋+) + 푊1 ≥ 푓(푋−) + 푊2}

= {푓(푋+) + 푊2 > 푓(푋−) + 푊1} ∩ {푓(푋+) + 푊1 ≥ 푓(푋−) + 푊2}

= {푓(푋+) − 푓(푋−) > 푊1 − 푊2} ∩ {푓(푋+) − 푓(푋−) ≥ 푊2 − 푊1}

= {푓(푋−) − 푓(푋+) < 푊2 − 푊1 ≤ 푓(푋+) − 푓(푋−)}

This last expression holds with non-zero probability because there exists some 휖 > 0

such that with positive probability, 푓(푋+) ≥ 푓(푋−) + 휖 and similarly 푓(푋−) ≤

푓(푋+) − 휖. (Otherwise, 푓 would be constant with respect to the first coordinate).

Proposition 11. Consider stage 푡 in Algorithm 3’. Suppose that the first 푡 − 1 coordinates recovered by the algorithm are correct, i.e. 푘푖 ∈ {1, . . . , 푠} for all 푖 ∈

{1, . . . , 푡 − 1}. Let 푅 = {1, . . . , 푠} ∖ {푘1, . . . , 푘푡−1}. Let (푋1, 푌1) and (푋2, 푌2) be independent samples from the model. There exists 푟 ∈ 푅 so that for all 푘 ∈ 푅,

P (푌1 > 푌2|푞(1, 2, 푟) = 1, 푞(1, 2, 푘) = 0) ≥ P (푌2 > 푌1|푞(1, 2, 푟) = 1, 푞(1, 2, 푘) = 0) .

Proof. Let

푓(푟, 푘) = P (푌1 > 푌2|푞(1, 2, 푟) = 1, 푞(1, 2, 푘) = 0)−P (푌2 > 푌1|푞(1, 2, 푟) = 1, 푞(1, 2, 푘) = 0) .

135 We first claim that 푓(푟, 푘) = −푓(푘, 푟). Using the fact that 푞(1, 2, 푘) = 1 ⇐⇒ 푞(2, 1, 푘) = 0 for all but a measure-zero set,

푓(푘, 푟) = P (푌1 > 푌2|푞(1, 2, 푘) = 1, 푞(1, 2, 푟) = 0) − P (푌2 > 푌1|푞(1, 2, 푘) = 1, 푞(1, 2, 푟) = 0)

= P (푌1 > 푌2|푞(2, 1, 푘) = 0, 푞(2, 1, 푟) = 1) − P (푌2 > 푌1|푞(2, 1, 푘) = 0, 푞(2, 1, 푟) = 1)

= P (푌2 > 푌1|푞(1, 2, 푘) = 0, 푞(1, 2, 푟) = 1) − P (푌1 > 푌2|푞(1, 2, 푘) = 0, 푞(1, 2, 푟) = 1) = −푓(푟, 푘).

If there are one or two indices remaining to be found, then clearly such an 푟 exists. Otherwise, let 푎, 푏, and 푐 be correct indices that have not yet been found. Our next claim is that

{푓(푎, 푏) ≥ 푓(푏, 푎), 푓(푏, 푐) ≥ 푓(푐, 푏)} =⇒ 푓(푎, 푐) ≥ 푓(푐, 푎).

Suppose 푓(푎, 푏) ≥ 푓(푏, 푎) and 푓(푏, 푐) ≥ 푓(푐, 푏). Then 푓(푎, 푏) ≥ 0 and 푓(푏, 푐) ≥ 0. Observe that

푓(푎, 푏) ≥ 0

⇐⇒ P (푌1 > 푌2|푞(1, 2, 푎) = 1, 푞(1, 2, 푏) = 0) − P (푌2 > 푌1|푞(1, 2, 푎) = 1, 푞(1, 2, 푏) = 0) ≥ 0

⇐⇒ P (푌1 > 푌2, 푞(1, 2, 푎) = 1, 푞(1, 2, 푏) = 0) − P (푌2 > 푌1, 푞(1, 2, 푎) = 1, 푞(1, 2, 푏) = 0) ≥ 0

Similarly,

푓(푏, 푐) ≥ 0

⇐⇒ P (푌1 > 푌2, 푞(1, 2, 푏) = 1, 푞(1, 2, 푐) = 0) − P (푌2 > 푌1, 푞(1, 2, 푏) = 1, 푞(1, 2, 푐) = 0) ≥ 0 and

푓(푎, 푐) ≥ 0

⇐⇒ P (푌1 > 푌2, 푞(1, 2, 푎) = 1, 푞(1, 2, 푐) = 0) − P (푌2 > 푌1, 푞(1, 2, 푎) = 1, 푞(1, 2, 푐) = 0) ≥ 0.

136 By the Law of Total Probability,

P (푌1 > 푌2, 푞(1, 2, 푎) = 1, 푞(1, 2, 푐) = 0)

= P (푌1 > 푌2, 푞(1, 2, 푎) = 1, 푞(1, 2, 푏) = 1, 푞(1, 2, 푐) = 0)

+ P (푌1 > 푌2, 푞(1, 2, 푎) = 1, 푞(1, 2, 푏) = 0, 푞(1, 2, 푐) = 0) . (5.24)

Consider the first term of Equation (5.24).

P (푌1 > 푌2, 푞(1, 2, 푎) = 1, 푞(1, 2, 푏) = 1, 푞(1, 2, 푐) = 0)

= P (푌1 > 푌2, 푞(1, 2, 푏) = 1, 푞(1, 2, 푐) = 0) − P (푌1 > 푌2, 푞(1, 2, 푎) = 0, 푞(1, 2, 푏) = 1, 푞(1, 2, 푐) = 0)

Similarly, consider the second term of Equation (5.24).

P (푌1 > 푌2, 푞(1, 2, 푎) = 1, 푞(1, 2, 푏) = 0, 푞(1, 2, 푐) = 0)

= P (푌1 > 푌2, 푞(1, 2, 푎) = 1, 푞(1, 2, 푏) = 0) − P (푌1 > 푌2, 푞(1, 2, 푎) = 1, 푞(1, 2, 푏) = 0, 푞(1, 2, 푐) = 1)

Adding the terms,

P (푌1 > 푌2, 푞(1, 2, 푎) = 1, 푞(1, 2, 푐) = 0)

= P (푌1 > 푌2, 푞(1, 2, 푏) = 1, 푞(1, 2, 푐) = 0) − P (푌1 > 푌2, 푞(1, 2, 푎) = 0, 푞(1, 2, 푏) = 1, 푞(1, 2, 푐) = 0)

+ P (푌1 > 푌2, 푞(1, 2, 푎) = 1, 푞(1, 2, 푏) = 0) − P (푌1 > 푌2, 푞(1, 2, 푎) = 1, 푞(1, 2, 푏) = 0, 푞(1, 2, 푐) = 1) .

Analyzing the expression P (푌2 > 푌1, 푞(1, 2, 푎) = 1, 푞(1, 2, 푐) = 0) similarly, we obtain

P (푌2 > 푌1, 푞(1, 2, 푎) = 1, 푞(1, 2, 푐) = 0)

= P (푌2 > 푌1, 푞(1, 2, 푏) = 1, 푞(1, 2, 푐) = 0) − P (푌2 > 푌1, 푞(1, 2, 푎) = 0, 푞(1, 2, 푏) = 1, 푞(1, 2, 푐) = 0)

+ P (푌2 > 푌1, 푞(1, 2, 푎) = 1, 푞(1, 2, 푏) = 0) − P (푌2 > 푌1, 푞(1, 2, 푎) = 1, 푞(1, 2, 푏) = 0, 푞(1, 2, 푐) = 1) .

Recall that we need to show

P (푌1 > 푌2, 푞(1, 2, 푎) = 1, 푞(1, 2, 푐) = 0) − P (푌2 > 푌1, 푞(1, 2, 푎) = 1, 푞(1, 2, 푐) = 0) ≥ 0.

137 Taking the difference,

P (푌1 > 푌2, 푞(1, 2, 푎) = 1, 푞(1, 2, 푐) = 0) − P (푌2 > 푌1, 푞(1, 2, 푎) = 1, 푞(1, 2, 푐) = 0) = 푓(푎, 푏) + 푓(푏, 푐)

− P (푌1 > 푌2, 푞(1, 2, 푎) = 0, 푞(1, 2, 푏) = 1, 푞(1, 2, 푐) = 0)

− P (푌1 > 푌2, 푞(1, 2, 푎) = 1, 푞(1, 2, 푏) = 0, 푞(1, 2, 푐) = 1)

+ P (푌2 > 푌1, 푞(1, 2, 푎) = 0, 푞(1, 2, 푏) = 1, 푞(1, 2, 푐) = 0)

+ P (푌2 > 푌1, 푞(1, 2, 푎) = 1, 푞(1, 2, 푏) = 0, 푞(1, 2, 푐) = 1) = 푓(푎, 푏) + 푓(푏, 푐)

− P (푌1 > 푌2, 푞(1, 2, 푎) = 0, 푞(1, 2, 푏) = 1, 푞(1, 2, 푐) = 0)

− P (푌1 > 푌2, 푞(1, 2, 푎) = 1, 푞(1, 2, 푏) = 0, 푞(1, 2, 푐) = 1)

+ P (푌2 > 푌1, 푞(2, 1, 푎) = 1, 푞(2, 1, 푏) = 0, 푞(2, 1, 푐) = 1)

+ P (푌2 > 푌1, 푞(2, 1, 푎) = 0, 푞(2, 1, 푏) = 1, 푞(2, 1, 푐) = 0) = 푓(푎, 푏) + 푓(푏, 푐)

≥ 0.

Therefore, we have proven the claim

{푓(푎, 푏) ≥ 푓(푏, 푎), 푓(푏, 푐) ≥ 푓(푐, 푏)} =⇒ 푓(푎, 푐) ≥ 푓(푐, 푎).

This fact shows that the indices can be totally ordered, i.e. by writing 푎 ≥ 푏 when 푓(푎, 푏) ≥ 푓(푏, 푎). We let 푟 be the largest element of the order.

푛 Lemma 22. Consider stage 푡 in Algorithm 3’, which uses 푚 = 푠 samples. The probability that the first coordinate is correctly recovered is at least

(︂ 푝2푚 )︂ 1 − (푑 − 1) exp − 1 . 64푠2

Suppose that the first 푡 − 1 coordinates recovered by the algorithm are correct, i.e.

138 푘푖 ∈ {1, . . . , 푠} for all 푖 ∈ {1, . . . , 푡 − 1}. Then 푘푡 ∈ {1, . . . , 푠} with probability at least

(︂ 푝2푚 )︂ 1 − (푑 − 푡) exp − 1 . 64(푠 − 푡 + 1)2

Proof. Applying Proposition 11, let 푟 be an element of {1, . . . , 푠} ∖ {푘1, . . . , 푘푡−1} such that

P (푌1 > 푌2|푞(1, 2, 푟) = 1, 푞(1, 2, 푘) = 0) ≥ P (푌2 > 푌1|푞(1, 2, 푟) = 1, 푞(1, 2, 푘) = 0) .

for all 푘 ∈ {1, . . . , 푠} ∖ {푘1, . . . , 푘푡−1}. Next, we consider the optimization problem at step 푡. Fix a feasible solution 푣 = 푣. ∑︀푑 Recall that 푘=1 푣푘 = 1. For this fixed value 푣, the optimal choice for the variables 푖푗 푐푘 must satisfy 푑 푑 ∑︁ 푖푗 ∑︁ 푞(푖, 푗, 푘)푐푘 = 1 − 푞(푖, 푗, 푘)푣푘 푘=1 푘=1 ∑︀푑 푖푗 for 푖, 푗 such that 푌푖 > 푌푗 and 푘=1 푞(푖, 푗, 푘) ≥ 1, with 푐푘 = 0 whenever 푞(푖, 푗, 푘) = 0. ∑︀푑 푖푗 ∑︀푑 푖푗 Note that 푘=1 푞(푖, 푗, 푘)푐푘 = 푘=1 푐푘 . Therefore, the objective function is equal to

푚 푚 (︃ 푑 )︃ ∑︁ ∑︁ 1 ∑︁ 푧(푣) , {푌푖 > 푌푗, 푋푖 ̸⪯ 푋푗} 1 − 푞(푖, 푗, 푘)푣푘 . 푖=1 푗=1 푘=1

⋆ Let 푣 = 푒푟. Let 퐹푡 be the feasible set for the vector of variables 푣 at step 푡, i.e.

{︃ 푑 }︃ 푑 ∑︁ 퐹푡 = 푣 ∈ R : 푣푖 ≥ 0 ∀푖 ∈ {1, . . . , 푑}, 푣푖 = 0 ∀푖 ∈ {푘1, . . . , 푘푡−1}, 푣푖 = 1 . 푖=1

Let 퐹 푡 = {푣 ∈ 퐹푡 : arg max푖 푣푖 ∩ {푠 + 1, . . . , 푑}= ̸ ∅}. In other words, 퐹 푡 is the set of feasible solutions that lead to an incorrect coordinate choice at step 푡. We will give an upper bound on the probability

(︀ ⋆ )︀ P ∃푣 ∈ 퐹 푡 : 푧(푣) ≤ 푧(푣 ) .

⋆ Note that the complementary event, {푧(푣) > 푧(푣 ), ∀푣 ∈ 퐹 푡}, implies that the

139 optimization problem will choose a coordinate among {1, . . . , 푠} ∖ {푘1, . . . , 푘푡−1}. ⋆ ⋆ Let 푣 ∈ 퐹 푡 and write 푣 = 푣 + 푢. Observe that since the coordinates of 푣 and 푣

both sum to 1, the coordinates of 푢 sum to 0. Also, 푢푟 < 0 and 푢푘 ≥ 0 for 푘 ̸= 푟. Now,

푧(푣) − 푧(푣⋆)

푚 푚 (︃(︃ 푑 )︃ (︃ 푑 )︃)︃ ∑︁ ∑︁ 1 ∑︁ ∑︁ ⋆ = {푌푖 > 푌푗, 푋푖 ̸⪯ 푋푗} 1 − 푞(푖, 푗, 푘)푣푘 − 1 − 푞(푖, 푗, 푘)푣푘 푖=1 푗=1 푘=1 푘=1 푚 푚 (︃ 푑 푑 )︃ ∑︁ ∑︁ 1 ∑︁ ⋆ ∑︁ = {푌푖 > 푌푗, 푋푖 ̸⪯ 푋푗} 푞(푖, 푗, 푘)푣푘 − 푞(푖, 푗, 푘)푣푘 푖=1 푗=1 푘=1 푘=1 푚 푚 푑 ∑︁ ∑︁ 1 ∑︁ ⋆ = {푌푖 > 푌푗, 푋푖 ̸⪯ 푋푗} 푞(푖, 푗, 푘)(푣푘 − 푣푘) 푖=1 푗=1 푘=1 푚 푚 푑 ∑︁ ∑︁ ∑︁ = − 1 {푌푖 > 푌푗, 푋푖 ̸⪯ 푋푗} 푞(푖, 푗, 푘)푢푘 푖=1 푗=1 푘=1 푑 푚 푚 ∑︁ ∑︁ ∑︁ = − 푢푘 1 {푌푖 > 푌푗, 푋푖 ̸⪯ 푋푗} 푞(푖, 푗, 푘). 푘=1 푖=1 푗=1

Now,

1 {푌푖 > 푌푗, 푋푖 ̸⪯ 푋푗} 푞(푖, 푗, 푘) = 1 {푌푖 > 푌푗} 1 {푋푖 ̸⪯ 푋푗} 푞(푖, 푗, 푘)

= 1 {푌푖 > 푌푗} (1 − 1 {푋푖 ⪯ 푋푗}) 푞(푖, 푗, 푘)

= 1 {푌푖 > 푌푗} 푞(푖, 푗, 푘) − 1 {푌푖 > 푌푗} 1 {푋푖 ⪯ 푋푗} 푞(푖, 푗, 푘)

= 1 {푌푖 > 푌푗} 푞(푖, 푗, 푘),

where the last equality is due to the fact that 푋푖 ⪯ 푋푗 implies 푞(푖, 푗, 푘) = 0. Substi- tuting,

푑 푚 푚 ⋆ ∑︁ ∑︁ ∑︁ 푧(푣) − 푧(푣 ) = − 푢푘 1 {푌푖 > 푌푗} 푞(푖, 푗, 푘) 푘=1 푖=1 푗=1 푚 푚 푛 푛 ∑︁ ∑︁ ∑︁ ∑︁ ∑︁ = −푢푟 1 {푌푖 > 푌푗} 푞(푖, 푗, 푟) − 푢푘 1 {푌푖 > 푌푗} 푞(푖, 푗, 푘) 푖=1 푗=1 푘̸=푟 푖=1 푗=1

140 푚 푚 푛 푛 ∑︁ ∑︁ ∑︁ ∑︁ ∑︁ ∑︁ = 푢푘 1 {푌푖 > 푌푗} 푞(푖, 푗, 푟) − 푢푘 1 {푌푖 > 푌푗} 푞(푖, 푗, 푘) 푘̸=푟 푖=1 푗=1 푘̸=푟 푖=1 푗=1 푚 푚 ∑︁ ∑︁ ∑︁ = 푢푘 1 {푌푖 > 푌푗} (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) . 푘̸=푟 푖=1 푗=1

(︀ ⋆ )︀ Recall that we seek to upper bound the probability P ∃푣 ∈ 퐹 푡 : 푧(푣) ≤ 푧(푣 ) . From the above,

(︀ ⋆ )︀ P ∃푣 ∈ 퐹 푡 : 푧(푣) ≤ 푧(푣 ) (︀ ⋆ ⋆ ⋆ )︀ = P ∃푣 + 푢 ∈ 퐹 푡 : 푧(푣 + 푢) ≤ 푧(푣 ) (︃ 푚 푚 )︃ ⋆ ∑︁ ∑︁ ∑︁ 1 = P ∃푣 + 푢 ∈ 퐹 푡 : 푢푘 {푌푖 > 푌푗} (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) ≤ 0 . 푘̸=푟 푖=1 푗=1

⋆ Consider 푣 + 푢 ∈ 퐹 푡. Recalling that 푢푘 = 0 for all 푘 ∈ {푘1, . . . , 푘푡−1}, observe that

∑︁ 1 ∑︁ 푢 ≥ 푢 푘 푠 − 푡 푘 푘∈{푠+1,...,푑} 푘∈{1,...,푠}∖{푟} ∑︁ ∑︁ ⇐⇒ (푠 − 푡) 푢푘 ≥ −푢푟 − 푢푘 푘∈{푠+1,...,푑} 푘∈{푠+1,...,푑} ∑︁ 1 ⇐⇒ 푢 ≥ (−푢 ). (5.25) 푘 푠 − 푡 + 1 푟 푘∈{푠+1,...,푑}

Since −푢푟 > 0,

(︃ 푚 푚 )︃ ⋆ ∑︁ ∑︁ ∑︁ 1 P ∃푣 + 푢 ∈ 퐹 푡 : 푢푘 {푌푖 > 푌푗} (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) ≤ 0 푘̸=푟 푖=1 푗=1 (︃ 푚 푚 )︃ 1 ⋆ ∑︁ ∑︁ ∑︁ 1 = P ∃푣 + 푢 ∈ 퐹 푡 : 푢푘 {푌푖 > 푌푗} (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) ≤ 0 . −푢푟 푘̸=푟 푖=1 푗=1

푝푟 ⋆ Let 0 < Δ ≤ 4(푠−푡+1) . Observe that the existence of 푣 + 푢 ∈ 퐹 푡 such that 1 ∑︀ 푢 ∑︀푚 ∑︀푚 1 {푌 > 푌 } (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) ≤ 0 implies at least one of −푢푟 푘̸=푟 푘 푖=1 푗=1 푖 푗 the following occurs:

141 1. There exists 푘 ∈ {1, . . . , 푠} ∖ {푟, 푘1, . . . , 푘푡−1} such that

푚 푚 ∑︁ ∑︁ 1 {푌푖 > 푌푗} (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) ≤ 푚(푚 − 1)(−Δ). 푖=1 푗=1

2. There exists 푘 ∈ {푠 + 1, . . . , 푑} such that

푚 푚 ∑︁ ∑︁ (︂1 )︂ 1 {푌 > 푌 } (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) ≤ 푚(푚 − 1) 푝 − Δ . 푖 푗 4 푟 푖=1 푗=1

⋆ Indeed, if none of these events occur, then for every 푣 + 푢 ∈ 퐹 푡,

푚 푚 1 ∑︁ ∑︁ ∑︁ 푢푘 1 {푌푖 > 푌푗} (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) −푢푟 푘̸=푟 푖=1 푗=1 푚 푚 1 ∑︁ ∑︁ ∑︁ = 푢푘 1 {푌푖 > 푌푗} (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) −푢푟 푘∈{1,...,푠}∖{푟,푘1,...,푘푡−1} 푖=1 푗=1 푚 푚 1 ∑︁ ∑︁ ∑︁ + 푢푘 1 {푌푖 > 푌푗} (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) −푢푟 푘∈{푠+1,...,푑} 푖=1 푗=1 ⎡ ⎤ 1 ∑︁ 1 ∑︁ (︂1 )︂ > 푚(푚 − 1) ⎣ 푢푘(−Δ) + 푢푘 푝푟 − Δ ⎦ −푢푟 −푢푟 4 푘∈{1,...,푠}∖{푟,푘1,...,푘푡−1} 푘∈{푠+1,...,푑} ⎡ ⎤ −Δ ∑︁ 푝푟 ∑︁ = 푚(푚 − 1) ⎣ 푢푘 + 푢푘⎦ −푢푟 −4푢푟 푘̸=푟 푘∈{푠+1,...,푑} ⎡ ⎤ 푝푟 ∑︁ = 푚(푚 − 1) ⎣−Δ + 푢푘⎦ −4푢푟 푘∈{푠+1,...,푑} [︂ 푝 ]︂ ≥ 푚(푚 − 1) −Δ + 푟 (5.26) 4(푠 − 푡 + 1) [︂ 푝 푝 ]︂ ≥ 푚(푚 − 1) − 푟 + 푟 4(푠 − 푡 + 1) 4(푠 − 푡 + 1) = 0.

142 The inequality (5.26) holds by (5.25). Therefore, by the Union Bound,

(︃ 푚 푚 )︃ 1 ⋆ ∑︁ ∑︁ ∑︁ 1 P ∃푣 + 푢 ∈ 퐹 푡 : 푢푘 {푌푖 > 푌푗} (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) ≤ 0 −푢푟 푘̸=푟 푖=1 푗=1 (︃ 푚 푚 )︃ ∑︁ ∑︁ ∑︁ 1 ≤ P {푌푖 > 푌푗} (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) ≤ −푚(푚 − 1)Δ 푘∈{1,...,푠}∖{푟,푘1,...,푘푡−1} 푖=1 푗=1 (︃ 푚 푚 )︃ ∑︁ ∑︁ ∑︁ (︂1 )︂ + 1 {푌 > 푌 } (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) ≤ 푚(푚 − 1) 푝 − Δ P 푖 푗 4 푟 푘∈{푠+1,...,푑} 푖=1 푗=1

We upper bound each probability by establishing concentration. Fix 푖, 푗 ∈ {1, . . . , 푚} with 푖 ̸= 푗, and 푘 ̸= 푟. By the Law of Total Expectation,

1 E [ {푌푖 > 푌푗} (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘))] 1 = E [ {푌푖 > 푌푗} |푞(푖, 푗, 푟) = 1, 푞(푖, 푗, 푘) = 0] P (푞(푖, 푗, 푟) = 1, 푞(푖, 푗, 푘) = 0) 1 − E [ {푌푖 > 푌푗} |푞(푖, 푗, 푟) = 0, 푞(푖, 푗, 푘) = 1] P (푞(푖, 푗, 푟) = 0, 푞(푖, 푗, 푘) = 1)

= P (푌푖 > 푌푗|푞(푖, 푗, 푟) = 1, 푞(푖, 푗, 푘) = 0) P (푞(푖, 푗, 푟) = 1, 푞(푖, 푗, 푘) = 0)

− P (푌푖 > 푌푗|푞(푖, 푗, 푟) = 0, 푞(푖, 푗, 푘) = 1) P (푞(푖, 푗, 푟) = 0, 푞(푖, 푗, 푘) = 1)

= P (푌푖 > 푌푗, 푞(푖, 푗, 푟) = 1, 푞(푖, 푗, 푘) = 0) − P (푌푖 > 푌푗, 푞(푖, 푗, 푟) = 0, 푞(푖, 푗, 푘) = 1) .

Now, 푞(푖, 푗, 푘) = 1 ⇐⇒ 푞(푗, 푖, 푘) = 0. Therefore,

1 E [ {푌푖 > 푌푗} (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘))]

= P (푌푖 > 푌푗, 푞(푖, 푗, 푟) = 1, 푞(푖, 푗, 푘) = 0) − P (푌푖 > 푌푗, 푞(푗, 푖, 푟) = 1, 푞(푗, 푖, 푘) = 0)

= P (푌푖 > 푌푗, 푞(푖, 푗, 푟) = 1, 푞(푖, 푗, 푘) = 0) − P (푌푗 > 푌푖, 푞(푖, 푗, 푟) = 1, 푞(푖, 푗, 푘) = 0)

= [P (푌푖 > 푌푗|푞(푖, 푗, 푟) = 1, 푞(푖, 푗, 푘) = 0) − P (푌푗 > 푌푖|푞(푖, 푗, 푟) = 1, 푞(푖, 푗, 푘) = 0)] P (푞(푖, 푗, 푟) = 1, 푞(푖, 푗, 푘) = 0) 1 = [ (푌 > 푌 |푞(푖, 푗, 푟) = 1, 푞(푖, 푗, 푘) = 0) − (푌 > 푌 |푞(푖, 푗, 푟) = 1, 푞(푖, 푗, 푘) = 0)] , 4 P 푖 푗 P 푗 푖 where we have swapped 푖 and 푗 in the second equality, due to symmetry. We now

consider the two cases for 푘. First consider 푘 ∈ {1, . . . , 푠} ∖ {푟, 푘1, . . . , 푘푡−1}. Due to the choice of 푟, the expectation is nonnegative, and we lower bound it by 0.

143 Next consider 푘 ∈ {푠 + 1, . . . , 푑}. Due to the independence of the coordinates of 푋, the values of the non-active coordinates does not influence the value of the active coordinates. Also, the value of the function is determined entirely by the active coordinates. Therefore, we can drop the conditioning on the ordering on the inactive coordinate 푘. For 푘 ∈ {푠 + 1, . . . , 푑}, we therefore have

1 [1 {푌 > 푌 } (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘))] = [ (푌 > 푌 |푞(푖, 푗, 푟) = 1) − (푌 > 푌 |푞(푖, 푗, 푟) = 1)] E 푖 푗 4 P 푖 푗 P 푗 푖 1 = 푝 . 4 푟

[︁ ]︁ ∑︀푚 ∑︀푚 1 Let 푘 ∈ {1, . . . , 푠}∖{푟, 푘1, . . . , 푘푡−1}. Since E 푖=1 푗=1 {푌푖 > 푌푗} (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) ≥ 0, we have

⎛ 푚 푚 ⎞ ∑︁ ∑︁ P ⎝ 1 {푌푖 > 푌푗 } (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) ≤ −푚(푚 − 1)Δ⎠ 푖=1 푗=1

⎛ 푚 푚 ⎡ 푚 푚 ⎤ ⎞ ∑︁ ∑︁ ∑︁ ∑︁ ≤ P ⎝ 1 {푌푖 > 푌푗 } (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) ≤ E ⎣ 1 {푌푖 > 푌푗 } (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘))⎦ − 푚(푚 − 1)Δ⎠ . 푖=1 푗=1 푖=1 푗=1

Similarly for 푘 ∈ {푠 + 1, . . . , 푑}, we have

⎛ 푚 푚 ⎞ ∑︁ ∑︁ (︂ 1 )︂ P ⎝ 1 {푌푖 > 푌푗 } (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) ≤ 푚(푚 − 1) 푝푟 − Δ ⎠ 4 푖=1 푗=1

⎛ 푚 푚 ⎡ 푚 푚 ⎤ ⎞ ∑︁ ∑︁ ∑︁ ∑︁ = P ⎝ 1 {푌푖 > 푌푗 } (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) ≤ E ⎣ 1 {푌푖 > 푌푗 } (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘))⎦ − 푚(푚 − 1)Δ⎠ . 푖=1 푗=1 푖=1 푗=1

∑︀푚 ∑︀푚 1 Consider the summation 푖=1 푗=1 {푌푖 > 푌푗} (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)), as a function of the 푋푖 and 푊푖 variables, for fixed 푘. We now establish the bounded differences property for the 푋푖 and 푊푖 variables. Suppose we change the value of 푊푖. The affected terms are 1 {푌푖 > 푌푗} (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) and 1 {푌푗 > 푌푖} (푞(푗, 푖, 푟) − 푞(푗, 푖, 푘)), for all 푗 ≠ 푖. Fix 푗 ̸= 푖. The largest absolute change is 2, and occurs when 푞(푖, 푗, 푟) = 1,

푞(푖, 푗, 푘) = 0, and 푌푖 > 푌푗, and changing 푊푖 switches the order on 푌푖 and 푌푗. Adding the contributions for all 푗 ̸= 푖, the total change corresponding to changing 푊푖 is bounded by 2(푚 − 1). By similar reasoning, changing any 푋푖 may change the summation by up to 2(푚 − 1).

144 Applying the McDiarmid inequality, we obtain for every 푘 ̸∈ {푟, 푘1, . . . , 푘푡−1},

⎛ 푚 푚 ⎡ 푚 푚 ⎤ ⎞ ∑︁ ∑︁ ∑︁ ∑︁ P ⎝ 1 {푌푖 > 푌푗 } (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) ≤ E ⎣ 1 {푌푖 > 푌푗 } (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘))⎦ − 푚(푚 − 1)Δ⎠ 푖=1 푗=1 푖=1 푗=1 (︃ )︃ 2 (Δ푚(푚 − 1))2 ≤ exp − 2푚(2(푚 − 1))2 (︂ Δ2푚2(푚 − 1)2 )︂ = exp − 4푚(푚 − 1)2 (︂ 1 )︂ = exp − Δ2푚 . 4

푝푟 Substituting Δ = 4(푠−푡+1) , we obtain

⎛ 푚 푚 ⎡ 푚 푚 ⎤ ⎞ ∑︁ ∑︁ ∑︁ ∑︁ P ⎝ 1 {푌푖 > 푌푗 } (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) ≤ E ⎣ 1 {푌푖 > 푌푗 } (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘))⎦ − 푚(푚 − 1)Δ⎠ 푖=1 푗=1 푖=1 푗=1 (︂ 푝2푚 )︂ ≤ exp − 푟 . 64(푠 − 푡 + 1)2

Finally,

⎛ ⎞ 푚 푚 (︂ 2 )︂ ⋆ 1 ∑︁ ∑︁ ∑︁ 푝푟푚 P ⎝∃푣 + 푢 ∈ 퐹 푡 : 푢푘 1 {푌푖 > 푌푗 } (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) ≤ 0⎠ ≤ (푑 − 푡) exp − . −푢 64(푠 − 푡 + 1)2 푟 푘̸=푟 푖=1 푗=1

We conclude that the probability that the 푡th coordinate is correct is lower-bounded by

(︂ 푝2푚 )︂ (︂ 푝2푚 )︂ 1 − (푑 − 푡) exp − 푟 ≥ 1 − (푑 − 푡) exp − 1 . 64(푠 − 푡 + 1)2 64(푠 − 푡 + 1)2

Proof of Theorem 17. Let 푘푡 be the 푡th coordinate recovered, where 푡 ∈ {1, . . . , 푠}.

By Lemma 22, the probability that 푘1 ̸∈ {1, . . . , 푠} is upper bounded by (푑 − (︁ 2 )︁ 푝1푚 1) exp − 64푠2 .. Next, the probability that 푘2 ̸∈ {1, . . . , 푠} given that 푘1 ∈ {1, . . . , 푠} (︁ 2 )︁ 푝1푚 is upper bounded by (푑 − 2) exp − 64(푠−1)2 . In general, the probability that 푘푡 ̸∈

{1, . . . , 푠} given that 푘푖 ∈ {1, . . . , 푠} for all 푖 ∈ {1, . . . , 푡 − 1} is at most (푑 − (︁ 2 )︁ 푝1푚 푡) exp − 64(푠−푡+1)2 . Therefore, the probability of error in any coordinate is upper bounded by

푠 푠 ∑︁ (︂ 푝2푚 )︂ ∑︁ (︂ 푝2푚 )︂ (︂ 푝2푛 )︂ (푑 − 푡) exp − 1 ≤ (푑 − 푡) exp − 1 ≤ 푑푠 exp − 1 . 64(푠 − 푡 + 1)2 64푠2 64푠3 푡=1 푡=1

145 Proof of Corollary 3.

(︂ 푝2푛 )︂ (︂ 푝2푛 )︂ 푑푠 exp − 1 = exp log(푑) + log(푠) − 1 64푠3 64푠3 (︂ 푝2푛 )︂ ≤ exp 2 log(푑) − 1 (5.27) 64푠3

Therefore, if 푛 = 휔(푠3 log(푑)), then (5.27) goes to zero.

Proof of Corollary 4. Support recovery fails with probability at most

(︂ 푝2푛 )︂ 푑푠 exp − 1 . 64푠3

If it succeeds, the probability of the 퐿2 norm error exceeding 휖 is upper bounded by (︁ )︁ 푛 ^ 2 the value in Theorem 16, with 푑 set to 푠 and 푛 set to 푚 = 푠 . Then P ‖푓푛 − 푓‖2 > 휖 is at most

(︂ 2 )︂ [︂(︂ 12 11 )︂ 3 ]︂ 푝1푛 2 2 푠 푠−1 3휖 푚 푑푠 exp − + 6 exp log(2) + 2 휖2 2 푚 푠 − . 64푠3 휖2 41 × 210

Therefore, if 푛 = 휔(푠3 log(푑)) and 푛 = 푠푒휔(푠2), the estimator is consistent, by Corollaries 2 and 3.

5.7.2 Proofs for the Noisy Input Model

Proof of Theorem 18. To illustrate the proof idea, we show the claim for 푑 = 푠 = 1

first. Observe that for any monotone partition (푆0, 푆1) in R, either 푆0 = {푥 : 푥 ≤ 푟}

or 푆0 = {푥 : 푥 < 푟} for some 푟. When 푑 = 푠 = 1, the optimization problem (5.5)-(5.9) amounts to finding a boundary 푟 ∈ R. Let

푛 ∑︁ 푔 (푋1:푛, 푊1:푛;(푆0, 푆1)) = 1 {푓(푋푖 + 푊푖) = 1, 푋푖 ∈ 푆0}+1 {푓(푋푖 + 푊푖) = 0, 푋푖 ∈ 푆1} 푖=1

denote the corresponding value of the objective function. Observe that the value

of 푔 (푋1:푛, 푊1:푛;(푆0, 푆1)) can change by at most ±2 when any one of the random

146 variables is changed. Applying the McDiarmid inequality, for all 휖 > 0, it holds that

(︂ 2휖2푛2 )︂ (푔 (푋 , 푊 ;(푆 , 푆 )) − [푔 (푋 , 푊 ;(푆 , 푆 ))] ≥ 휖푛) ≤ exp − P 1:푛 1:푛 0 1 E 1:푛 1:푛 0 1 2푛 · 22 (︂ 휖2푛)︂ = exp − . 4

Similarly,

(︂ 휖2푛)︂ (푔 (푋 , 푊 ;(푆 , 푆 )) − [푔 (푋 , 푊 ;(푆 , 푆 ))] ≤ −휖푛) ≤ exp − . P 1:푛 1:푛 0 1 E 1:푛 1:푛 0 1 4

We now calculate E [푔 (푋1:푛, 푊1:푛;(푆0, 푆1))]:

[︂ ∫︁ ∫︁ ]︂ E [푔 (푋1:푛, 푊1:푛;(푆0, 푆1))] = 푛 푝 ℎ0(푡)푑푡 + (1 − 푝) ℎ1(푡)푑푡 푡∈푆1 푡∈푆0

= 푛 [푝퐻0(푆1) + (1 − 푝)퐻1(푆0)]

= 푛 · 푞(푆0, 푆1).

⋆ ⋆ By Assumption 2, the expectation has a unique minimizer (푆0 , 푆1 ) ∈ ℳ1. Observe that

(︁ ^ 2 )︁ ⋆ ⋆ ⋆ ⋆ P ‖푓푛 − 푓‖2 > 훿 = P (퐷 ((푆0, 푆1), (푆0 , 푆1 )) > 훿) = P ((푆0, 푆1) ̸∈ 퐵훿(푆0 , 푆1 )) .

We therefore need to analyze the probability that there exists a monotone partition

⋆ ⋆ ⋆ ⋆ outside 퐵훿(푆0 , 푆1 )) with a smaller value of 푔 than 푔 (푋1:푛, 푊1:푛;(푆0 , 푆1 )). For all

(푆0, 푆1) ∈ ℳ1,

⋆ ⋆ ⋆ ⋆ E [푔 (푋1:푛, 푊1:푛;(푆0, 푆1))] − E [푔 (푋1:푛, 푊1:푛;(푆0 , 푆1 ))] = 푛 (푞(푆0, 푆1) − 푞 (푆0 , 푆1 ))

1 ⋆ ⋆ We now use the concentration result with 휖 set to 3 (푞(푆0, 푆1) − 푞 (푆0 , 푆1 )). For any

(푆0, 푆1), with probability at least

(︃ )︃ (푞(푆 , 푆 ) − 푞 (푆⋆, 푆⋆))2 푛 1 − exp − 0 1 0 1 , 36

147 it holds that

푛 푔 (푋 , 푊 ;(푆 , 푆 )) ≥ [푔 (푋 , 푊 ;(푆 , 푆 ))] − (푞(푆 , 푆 ) − 푞 (푆⋆, 푆⋆)) . 1:푛 1:푛 0 1 E 1:푛 1:푛 0 1 3 0 1 0 1

Similarly, with the same probability, it holds that

푛 푔 (푋 , 푊 ;(푆⋆, 푆⋆)) ≤ [푔 (푋 , 푊 ;(푆⋆, 푆⋆))] + (푞(푆 , 푆 ) − 푞 (푆⋆, 푆⋆)) . 1:푛 1:푛 0 1 E 1:푛 1:푛 0 1 3 0 1 0 1

⋆ ⋆ For a given (푆0, 푆1) ̸= (푆0 , 푆1 ), both of these events occur with probability at least

(︃ )︃ (푞(푆 , 푆 ) − 푞 (푆⋆, 푆⋆))2 푛 1 − 2 exp − 0 1 0 1 . 36

In that case,

⋆ ⋆ 푔 (푋1:푛, 푊1:푛;(푆0, 푆1)) − 푔 (푋1:푛, 푊1:푛;(푆0 , 푆1 )) 푛 ≥ [푔 (푋 , 푊 ;(푆 , 푆 ))] − (푞(푆 , 푆 ) − 푞 (푆⋆, 푆⋆)) E 1:푛 1:푛 0 1 3 0 1 0 1 푛 − [푔 (푋 , 푊 ;(푆⋆, 푆⋆))] − (푞(푆 , 푆 ) − 푞 (푆⋆, 푆⋆)) E 1:푛 1:푛 0 1 3 0 1 0 1 2푛 = 푛 (푞(푆 , 푆 ) − 푞 (푆⋆, 푆⋆)) − (푞(푆 , 푆 ) − 푞 (푆⋆, 푆⋆)) 0 1 0 1 3 0 1 0 1 푛 = (푞(푆 , 푆 ) − 푞 (푆⋆, 푆⋆)) . 3 0 1 0 1

Therefore, in this situation, solution (푆0, 푆1) is suboptimal compared to solution ⋆ ⋆ (푆0 , 푆1 ).

Observe that the cardinality of the set {푔(푋1:푛, 푊1:푛;(푆0, 푆1)) : (푆0, 푆1) ∈ ℳ푑} is at most 푛 + 1. In other words, 푔 has at most 푛 + 1 possible values when we

range over all possible monotone partitions. Recall the definition of 푞min(훿) =

min ⋆ ⋆ 푞(푆 , 푆 ). By the previous analysis and the Union Bound, the (푆0,푆1)̸∈퐵훿(푆0 ,푆1 ) 0 1

chosen sets (푆0, 푆1) satisfy

(︃ )︃ (푞 (훿) − 푞 (푆⋆, 푆⋆))2 푛 ((푆 , 푆 ) ̸∈ 퐵 (푆⋆, 푆⋆)) ≤ (푛 + 2) exp − min 0 1 . P 0 1 훿 0 1 36

148 Therefore, with probability at least

(︃ )︃ (푞 (훿) − 푞 (푆⋆, 푆⋆))2 푛 1 − (푛 + 2) exp − min 0 1 , 36

^ 2 it holds that ‖푓푛 − 푓‖2 ≤ 훿.

For 푑 ≥ 2 and (푆0, 푆1) ∈ ℳ푑, let

푛 ∑︁ 푔 (푋1:푛, 푊1:푛;(푆0, 푆1)) = 1 {푓(푋푖 + 푊푖) = 1, 푋푖 ∈ 푆0}+1 {푓(푋푖 + 푊푖) = 0, 푋푖 ∈ 푆1} . 푖=1

The function 푔 represents the error associated with partition (푆0, 푆1). Applying the McDiarmid inequality,

(︂ 휖2푛)︂ (푔 (푋 , 푊 ;(푆 , 푆 )) − [푔 (푋 , 푊 ;(푆 , 푆 ))] ≥ 휖푛) ≤ exp − P 1:푛 1:푛 0 1 E 1:푛 1:푛 0 1 4

and

(︂ 휖2푛)︂ (푔 (푋 , 푊 ;(푆 , 푆 )) − [푔 (푋 , 푊 ;(푆 , 푆 ))] ≤ −휖푛) ≤ exp − P 1:푛 1:푛 0 1 E 1:푛 1:푛 0 1 4

Calculating the expectation,

[︂ ∫︁ ∫︁ ]︂ E [푔 (푋1:푛, 푊1:푛;(푆0, 푆1))] = 푛 푝 ℎ0(푡)푑푡 + (1 − 푝) ℎ1(푡)푑푡 푡∈푆1 푡∈푆0

= 푛 [푝퐻0(푆1) + (1 − 푝)퐻1(푆0)]

= 푛 · 푞(푆0, 푆1).

⋆ ⋆ By Assumption 2, the function 푞(푆0, 푆1) has a unique minimizer, (푆0 , 푆1 ), that ^ 2 corresponds to the true function 푓. Therefore, if ‖푓푛 − 푓‖2 is greater than 훿, then ^ ⋆ ⋆ the function 푓푛 must be outside of 퐵훿(푆0 , 푆1 ). Then it must be the case that some ⋆ ⋆ ⋆ ⋆ (푆0, 푆1) outside of 퐵훿(푆0 , 푆1 ) attained a lower value of 푔 than 푔 (푋1:푛, 푊1:푛;(푆0 , 푆1 )). We use concentration to upper bound the probability of this event. First, we need to know how many possible objective values there are. This is upper

149 bounded by the number of binary labelings of the set {푋1, . . . , 푋푛}. By Theorem 21, it holds that

[︁ 푠 푠−1 ]︁ E[퐿(푋1, . . . 푋푛)] ≤ exp (2 + 2 log(2) − 1) 푛 푠 .

For any 휖 > 0, the Markov inequality tells us that

[︁ 푠 푠−1 ]︁ [퐿(푋 , . . . 푋 )] exp (2 + 2 log(2) − 1) 푛 푠 (퐿(푋 , . . . 푋 ) ≥ 푡) ≤ E 1 푛 ≤ . P 1 푛 푡 푡

[︁ 2푠−1 ]︁ Setting 푡 = exp 푛 2푠 ,

[︁ 푠 푠−1 ]︁ exp (2 + 2 log(2) − 1) 푛 푠 (︁ [︁ 2푠−1 ]︁)︁ 퐿(푋 , . . . 푋 ) ≥ exp 푛 2푠 ≤ . P 1 푛 [︁ 2푠−1 ]︁ exp 푛 2푠

[︂ 푠−1 ]︂ exp (2푠+2 log(2)−1)푛 푠

Therefore, with probability at least 1 − [︂ 2푠−1 ]︂ , there are at most exp 푛 2푠

[︁ 2푠−1 ]︁ exp 푛 2푠 labelings, and therefore function values. We bound the 퐿2 loss similarly to

the proof for the case 푑 = 푠 = 1, above. Recall that 푞 (훿) = min ⋆ ⋆ 푞(푆 , 푆 ). min (푆0,푆1)̸∈퐵훿(푆0 ,푆1 ) 0 1 1 ⋆ ⋆ Set 휖 = 3 (푞min(훿) − 푞(푆0 , 푆1 )) in the McDiarmid bound so that the optimal value remains separated from the alternatives.

(︁ ^ 2 )︁ P ‖푓 − 푓‖2 > 훿

⋆ ⋆ = P ((푆0, 푆1) ̸∈ (퐵훿(푆0 , 푆1 )))

[︁ 푠 푠−1 ]︁ exp (2 + 2 log(2) − 1) 푛 푠 (︃ ⋆ ⋆ 2 )︃ (︁ [︁ 2푠−1 ]︁ )︁ (푞min (훿) − 푞 (푆0 , 푆1 )) 푛 ≤ + exp 푛 2푠 + 1 exp − [︁ 2푠−1 ]︁ 36 exp 푛 2푠

(︁ )︁ Proof of Corollary 5. We equivalently show that 푠 = 표 √︀log(푛) is sufficient. Ana-

150 lyzing the first term,

{︂ (︂ )︂}︂ {︁ (︁ 푠 log (2) 1 )︁ − 1 }︁ 1− 1 푠 log (2) 1 1 exp 푛 푛 푛 + 2 log(2) − 1 − 푛 2푠 푛 푠 ≤ exp 푛 푠 푛 푛 − 푛 2푠 + 2 {︂ (︂ )︂}︂ 1− 1 푠 log (2)− 1 1 − 1 = exp 푛 2푠 푛 푛 2푠 − 1 + 푛 2푠 2 {︂ (︂ )︂}︂ 푠 log (2)− 1 1 ≤ exp 푛 푛 푛 2푠 − 2 {︂ (︂ )︂}︂ 1 푠2 log (2)− 1 1 = exp 푛 푛 푠 ( 푛 2 ) − 2 {︂ (︂ )︂}︂ 1 표(1)− 1 1 = exp 푛 푛 푠 ( 2 ) − 2 {︂ (︂ )︂}︂ −Θ(1) 1 1 = exp 푛 푛 푠 − 2 {︃ (︃ (︂ )︂ )︃}︃ −휔 √ 1 1 = exp 푛 푛 log(푛) − 2

{︂ (︂ (︂ − √ 1 )︂ 1)︂}︂ = exp 푛 표 푛 log(푛) − 2 {︂ (︂ )︂}︂ (︁ − 1 )︁ 1 = exp 푛 표 푛 log(푛) − 2 {︂ (︂ )︂}︂ (︁ − log푛(2) )︁ 1 = exp 푛 표 푛 log(2) − 2 {︂ (︂ )︂}︂ (︁ − 1 )︁ 1 = exp 푛 표 2 log(2) − 2 {︂ (︂ 1)︂}︂ = exp 푛 표 (1) − 2 = exp {−Θ(1)푛}

⋆ ⋆ 2 We have assumed that the expression (푞min (훿) − 푞 (푆0 , 푆1 )) is constant in 푠. Analyzing the second term,

(︃ ⋆ ⋆ 2 )︃ [︁ 2푠−1 ]︁ (푞min (훿) − 푞 (푆0 , 푆1 )) 푛 {︁ (︁ − 1 )︁}︁ exp 푛 2푠 exp − = exp 푛 푛 2푠 − Θ(1) 36

{︂ (︂ − √ 1 )︂}︂ = exp 푛 푛 2표( log(푛)) − Θ(1)

{︂ (︂ (︂ − √ 1 )︂ )︂}︂ = exp 푛 표 푛 2 log(푛) − Θ(1)

151 = exp {푛 (표(1) − Θ(1))}

= exp {−Θ(1)푛}

Proof of Theorem 19. The proof is analogous to the proof of Theorem 18, with the above definition for the function 푞. Recall that in the proof of Theorem 18, we needed to upper bound the number of possible function values. Here, the number of possible function values is upper bounded by the number of 푠-sparse binary labelings, which

are those labelings corresponding to 푠-sparse monotone partitions. Let 퐿푠(푋1, . . . 푋푛) be the number of 푠-sparse binary labelings. By Theorem 21, it holds that

(︂ )︂ 푑 [︁ 푠 푠−1 ]︁ [퐿 (푋 , . . . 푋 )] ≤ exp (2 + 2 log(2) − 1) 푛 푠 . E 푠 1 푛 푠

For any 휖 > 0, the Markov inequality tells us that

푑 [︁ 푠−1 ]︁ (︀ )︀ 푠 푠 [퐿 (푋 , . . . 푋 )] 푠 exp (2 + 2 log(2) − 1) 푛 (퐿 (푋 , . . . 푋 ) ≥ 푡) ≤ E 푠 1 푛 ≤ . P 푠 1 푛 푡 푡

푑 [︁ 2푠−1 ]︁ (︀ )︀ 2푠 Setting 푡 = 푠 exp 푛 ,

[︁ 푠 푠−1 ]︁ (︂ (︂ )︂ )︂ exp (2 + 2 log(2) − 1) 푛 푠 푑 [︁ 2푠−1 ]︁ P 퐿푠(푋1, . . . 푋푛) ≥ exp 푛 2푠 ≤ . 푠 [︁ 2푠−1 ]︁ exp 푛 2푠

[︂ 푠−1 ]︂ exp (2푠+2 log(2)−1)푛 푠

Therefore, with probability at least 1 − [︂ 2푠−1 ]︂ , there are at most exp 푛 2푠

푑 [︁ 2푠−1 ]︁ (︀ )︀ 2푠 푠 exp 푛 푠-sparse binary labelings, and therefore function values.

Proof of Corollary 6. We have assumed that 푠 is constant and the sequence of functions

{푓푑} extends a function of 푠 variables. For fixed (푆0, 푆1), the value of 푞(푆0, 푆1) does not change if we increase the overall dimension, because of the uniformity of 푋 and the independence of the coordinates of 푊 . Therefore, 푞 does not depend on 푑 when 푠

152 ⋆ ⋆ is fixed, and so 푞min (훿) − 푞 (푆0 , 푆1 ) = Θ(1). We now analyze the bound in Theorem 19. Since 푠 is constant, the first term goes to zero. Analyzing the second term,

(︂(︂ )︂ )︂ (︃ ⋆ ⋆ 2 )︃ 푑 [︁ 2푠−1 ]︁ (푞min (훿) − 푞 (푆0 , 푆1 )) 푛 exp 푛 2푠 + 1 exp − 푠 36

(︃ ⋆ ⋆ 2 )︃ (︁ [︁ 2푠−1 ]︁ )︁ (푞min (훿) − 푞 (푆0 , 푆1 )) 푛 ≤ exp 푠 log(푑) + 푛 2푠 + 1 exp − 36

[︁ 2푠−1 ]︁ −Θ(1)푛 = exp 푠 log(푑) + 푛 2푠 − Θ(1)푛 + 푒

If 푛 = 휔(log(푑)), the second term goes to zero.

The proof of Theorem 20 requires Lemmas 18 and 23.

Proof of Lemma 18. We need to show that

P (푌1 = 1, 푌2 = 0|푋1,푘 > 푋2,푘) > P (푌1 = 0, 푌2 = 1|푋1,푘 > 푋2,푘) .

The proof is similar to the proof of Lemma 17. Consider the following procedure. We

푑 sample 푋1 and 푋2 independently and uniformly on [0, 1] . Fix 푘 ∈ 퐴. Let

⎧ ⎨⎪푋1 if 푋1,푘 > 푋2,푘 푋+ = ⎩⎪푋2 otherwise and

⎧ ⎨⎪푋1 if 푋1,푘 ≤ 푋2,푘 푋− = ⎩⎪푋2 otherwise.

In other words, 푋+ is the right point according to coordinate 푘 and 푋− is the left

point. As in the proof of Lemma 17, we can equivalently define 푝푘 as

푝푘 = P (푓(푋+ + 푊1) > 푓(푋− + 푊2)) − P (푓(푋+ + 푊1) < 푓(푋− + 푊2)) .

Let 푘 = 1. By Assumption 1, the function 푓 is not constant with respect to the first

153 coordinate. Our goal is to show that

P (푓(푋+ + 푊1) > 푓(푋− + 푊2)) > P (푓(푋+ + 푊1) < 푓(푋− + 푊2)) .

We now construct a coupling (푋+, 푋−, 푊 1, 푊 2) ∼ (푋+, 푋−, 푊1, 푊2) (⋆). Let 푋+,1 =

푋+,1 and 푋−,1 = 푋−,1. Let 푋+,푖 = 푋−,푖 and 푋−,푖 = 푋+,푖 for 푖 ∈ {2, . . . , 푑}. Finally,

let 푊 1 = 푊2 and 푊 2 = 푊1. By monotonicity, 푓(푋+ + 푊 1) ≥ 푓(푋− + 푊2). Similarly,

푓(푋− + 푊 2) ≤ 푓(푋+ + 푊1). Therefore, the event {푓(푋+ + 푊1) < 푓(푋− + 푊2)}

implies the event {푓(푋+ + 푊 1) > 푓(푋− + 푊 2)}. Furthermore, (⋆) holds. This shows

P (푓(푋+ + 푊1) > 푓(푋− + 푊2)) ≥ P (푓(푋+ + 푊1) < 푓(푋− + 푊2)) .

To show a strict inequality, we need to show that the following event happens with positive probability.

{푓(푋+ + 푊 1) > 푓(푋− + 푊 2)} ∩ {푓(푋+ + 푊1) ≥ 푓(푋− + 푊2)}

= {푓(푋+ + 푊2) > 푓(푋− + 푊1)} ∩ {푓(푋+ + 푊1) ≥ 푓(푋− + 푊2)}

Observe that there exists 휖 such that 푓(푋+ +푊2) ≥ 푓(푋− +푊2)+휖 and 푓(푋− +푊1) ≤

푓(푋+ +푊1)−휖 with positive probability. Otherwise, 푓 would be constant with respect to the first coordinate. This completes the proof.

The following proposition is the analogue of Proposition 11.

Proposition 12. Consider stage 푡 in Algorithm 3’. Suppose that the first 푡 − 1 coordinates recovered by the algorithm are correct, i.e. 푘푖 ∈ {1, . . . , 푠} for all 푖 ∈

{1, . . . , 푡 − 1}. Let 푅 = {1, . . . , 푠} ∖ {푘1, . . . , 푘푡−1}. Let (푋1, 푌1) and (푋2, 푌2) be independent samples from the model. There exists 푟 ∈ 푅 so that for all 푘 ∈ 푅,

P (푌1 > 푌2|푞(1, 2, 푟) = 1, 푞(1, 2, 푘) = 0) ≥ P (푌2 > 푌1|푞(1, 2, 푟) = 1, 푞(1, 2, 푘) = 0) .

Proof. The proof is identical to the proof of Proposition 11.

154 The following lemma is the analogue of Lemma 22.

푛 Lemma 23. Consider stage 푡 in Algorithm 3’, which uses 푚 = 푠 samples. The probability that the first coordinate is correctly recovered is at least

(︂ 푝2푚 )︂ 1 − (푑 − 1) exp − 1 . 64푠2

Suppose that the first 푡 − 1 coordinates recovered by the algorithm are correct, i.e.

푘푖 ∈ {1, . . . , 푠} for all 푖 ∈ {1, . . . , 푡 − 1}. Then 푘푡 ∈ {1, . . . , 푠} with probability at least

(︂ 푝2푚 )︂ 1 − (푑 − 푡) exp − 1 . 64(푠 − 푡 + 1)2

Proof. The proof is nearly identical to the proof of Lemma 22, with 푝 replaced by

푝. Lemma 18 guarantees that 푝푘 > 0 for all 푘 ∈ 퐴, and Proposition 12 establishes the existence of a special coordinate 푟. The bounded differences analysis for the

application of the McDiarmid inequality again shows that each 푋푖 or 푊푖 can change ∑︀푚 ∑︀푚 1 the summation 푖=1 푗=1 {푌푖 > 푌푗} (푞(푖, 푗, 푟) − 푞(푖, 푗, 푘)) by up to 2(푚 − 1).

Proof of Theorem 20. The proof is nearly identical to the proof of Theorem 17, and relies on Lemma 23.

Proof of Corollary 7. The proof is identical to the proof of Corollary 3, with 푝 replaced by 푝.

Proof of Corollary 8. Support recovery fails with probability at most

(︂ 푝2푛 )︂ 푑푠 exp − 1 . 64푠3

If it succeeds, the probability of the 퐿2 norm error exceeding 훿 is upper bounded by (︁ )︁ ^ 2 the value in Theorem 18. Then P ‖푓푛 − 푓‖2 > 휖 is at most

[︁ 푠 푠−1 ]︁ (︂ 푝2푛 )︂ exp (2 + 2 log(2) − 1) 푚 푠 푑푠 exp − 1 + 64푠3 [︁ 푠−1 +휖]︁ exp 푚 푠

155 (︃ ⋆ ⋆ 2 )︃ (︁ [︁ 푠−1 +휖]︁ )︁ (푞min (훿) − 푞 (푆0 , 푆1 )) 푚 + exp 푚 푠 + 1 exp − . 36

Therefore, if 푛 = 휔(푠3 log(푑)) and 푛 = 푠푒휔(푠2), the estimator is consistent under the assumptions of Corollary 8.

156 Chapter 6

Future Directions

We provide possible future directions inspired by the work of this thesis.

1. Attracting Random Walks It was conjectured that the Repelling Random Walks model mixes in polynomial time for all 훽 < 0. We have provided partial results in this direction. Another possible research direction (discussed with John Sylvester and Luca Zanetti) concerns the hitting times of the ARW model on the line graph. For example, suppose that at time 푡 = 0 all particles are placed at the

origin of Z. Let 푚 ∈ Z. What is the expected time for a particle to reach the set {−푚, 푚}?

2. Exponential Convergence Rates for Stochastically Ordered Markov Processes Under Perturbation The chapter provides upper bounds on convergence. It would be valuable, though significantly more difficult, to derive accompanying lower bounds.

3. An Improved Lower Bound on the Traveling Salesman Constant This chapter analyzes the lower bound on the TSP constant. Very little progress has been made on the upper bound since the pioneering work of [2]. This is a natural future direction. Another possible direction is to establish computational complexity results around the approximability of the TSP constant.

4. Sparse High-Dimensional Isotonic Regression Let 훽 be a 푑 × 푘 matrix and let 푓

157 be a function from R푘 to R. The model 푌 = 푓(훽푇 푋) + 푍, where 푍 is zero-mean noise, is called a multi-index model. Both 훽 and 푓 are unknown. Multi-index models are a generalization of linear models. We may consider 푓 to lie in a function class, such as the set of coordinate-wise monotone functions. Monotone multi-index models are a generalization of isotonic regression. Future work will apply the estimation techniques for isotonic regression to estimation of monotone multi-index models.

158 Bibliography

[1] P. H. Baxendale. and computable convergence rates for geometri- cally ergodic Markov chains. The Annals of Applied Probability, 15(1B):700–738, 2005.

[2] Jillian Beardwood, J. H. Halton, and J. M. Hammersley. The shortest path through many points. Mathematical Proceedings of the Cambridge Philosophical Society, 55(4):299–327, 1959.

[3] D. Bertsimas and J. N. Tsitsiklis. Introduction to Linear Optimization. Athena Scientific, Dynamic Ideas, 1997.

[4] Michael J. Best and Nilotpal Chakravarti. Active set algorithms for isotonic regression; a unifying framework. Mathematical Programming, 47:425–439, 1990.

[5] R. Bubley and M. E. Dyer. Path coupling: A technique for proving rapid mixing in Markov chains. 38th Annual Symposium on Foundations of Computer Science, 1975.

[6] Peter Bühlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.

[7] Joshua D Cohen, Lu Li, Yuxuan Wang, Christopher Thoburn, Bahman Afsari, Ludmila Danilova, Christopher Douville, Ammar A Javed, Fay Wong, Austin Mattox, et al. Detection and localization of surgically resectable cancers with a multi-analyte blood test. Science, 359(6378):926–930, 2018.

[8] Corinna Cortes and Vladimir Vapnik. Support-Vector networks. Machine Learn- ing, 20:273–297, 1995.

[9] P. Cuff, J. Ding, O. Louidor, E. Lubetzky, Y. Peres, and A. Sly. Glauber dynamics for the mean-field Potts model. Journal of Statistical Physics, 149:432–477, 2012.

[10] Jan de Leeuw, Kurt Hornik, and Patrick Mair. Isotone optimization in R: Pool- Adjacent-Violoators Algorithm (PAVA) and active set methods. Journal of Statistical Software, 32(5):1–24, 2009.

[11] Luc Devroye, László Györfi, and Gábor Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1996.

159 [12] E. A. Van Doorn. Conditions for exponential ergodicity and bounds for the decay parameter of a birth-death process. Advances in Applied Probability, 17(3), 1985.

[13] R. Douc, E. Moulines, and J. S. Rosenthal. Quantitative bounds for geometric convergence rates of Markov chains. Annals of Applied Probability, 14(4):1643– 1665, 2004.

[14] Richard L. Dykstra and Tim Robertson. An algorithm for isotonic regression for two or more independent variables. The Annals of Statistics, 10(3):708–716, 1982.

[15] E. Fix and J.L. Hodges. Discriminatory analysis. nonparametric discrimination; consistency properties. Technical Report Report Number 4, Project Number 21-49-004, USAF School of Aviation Medicine, Randolph Field, Texas., 1951.

[16] Simon A. Forbes, Nidhi Bindal, Sally Bamford, Charlotte Cole, Chai Yin Kok, David Beare, Mingming Jia, Rebecca Shepherd, Kenric Leung, Andrew Menzies, Jon W. Teague, Peter J. Campbell, Michael R. Stratton, and P. Andrew Futreal. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Research, 39(1):D945–D950, 2011.

[17] Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing. Birkhäuser Springer, 2013.

[18] D. Gamarnik and J. Gaudio. Sparse high-dimensional isotonic regression. Thirty- third Conference on Neural Information Processing Systems, 2019.

[19] David Gamarnik. Efficient learning of monotone concepts via quadratic optimiza- tion. In COLT, 1999.

[20] Dimitris Bertsimas David Gamarnik and John N. Tsitsiklis. Estimation of time- varying parameters in statistical models: an optimization approach. Machine Learning, 35(3):225–245, 1999.

[21] Julia Gaudio. Attracting random walks. arXiv:1903.00427, 2019.

[22] Julia Gaudio, Saurabh Amin, and Patrick Jaillet. Exponential convergence rates for stochastically ordered Markov processes under perturbation. Systems & Control Letters, 133, 2019.

[23] Julia Gaudio and Patrick Jaillet. An improved lower bound for the traveling salesman constant. arXiv:1907.02390, 2019.

[24] Michael X. Goemans and Dimitris J. Bertsimas. Probabilistic analysis of the Held and Karp lower bound for the Euclidean Traveling Salesman Problem. Mathematics of Operations Research, 16(1):72–89, 1991.

[25] Donald Gross. Fundamentals of queueing theory. John Wiley & Sons, 2008.

160 [26] Qiyang Han, Tengyao Wang, Sabyasachi Chatterjee, and Richard J. Samworth. Isotonic regression in general dimensions. arXiv 1708.0946v1, 2017.

[27] David Haussler. Overview of the Probably Approximately Correct (PAC) learn- ing framework. https://hausslergenomics.ucsc.edu/wp-content/uploads/ 2017/08/smo_0.pdf, 1995.

[28] Thomas P. Hayes and Eric Vigoda. Variable length path coupling. Random Structures and Algorithms, 31(3):251–272, 2007.

[29] Michael Held and Richard M. Karp. The Traveling-Salesman Problem and minimum spanning trees. Operations Research, 18(6):1138–1162, 1970.

[30] Z. Hou, Y. Liu, and H. Zhang. Subgeometric rates of convergence for a class of continuous-time Markov process. Journal of Applied Probability, 42(3):698–712, 2005.

[31] D. S. Johnon, L. A. McGeoch, and E. E. Rothberg. Asymptotic experimental analysis for the Held-Karp Traveling Salesman bound. Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 341–350, 1996.

[32] T. Kamae, U. Krengel, and G. L. O’Brien. Stochastic inequalities on partially ordered spaces. The Annals of Probability, 5(899-912), 1977.

[33] P. C. Kiessler and R. Lund. Traffic intensity estimation. Naval Research Logistics, 56(4):385–387, 2008.

[34] J. B. Kruskal. Nonmetric multidimensional scaling: A numerical method. Psy- chometrika, 29(2):115–129, 1964.

[35] David A. Levin and Yuval Peres. Markov Chains and Mixing Times. American Mathematical Society, 2 edition, 2017.

[36] Y. Liu, H. Zhang, and Y. Zhao. Computable strongly ergodic rates of convergence for continuous-time Markov chains. ANZIAM Journal, 49(4):463–478, 2008.

[37] Y. Liu, H. Zhang, and Y. Zhao. Subgeometric ergodicity for continuous-time Markov chains. Journal of Mathematical Analysis and Applications, 368:178–189, 2010.

[38] R. B. Lund, S. P. Meyn, and R. L. Tweedie. Computable exponential conver- gence rates for stochastically ordered Markov processes. The Annals of Applied Probability, 61(1):218–237, 1996.

[39] Ronny Luss, Saharon Rosset, and Moni Shahar. Efficient regularised isotonic regression with application to gene-gene interaction search. The Annals of Applied Statistics, 6(1):253–283, 2012.

161 [40] Guy Moshkovitz and Asaf Shapira. Ramsey theory, integer partitions and a new proof of the Erdos-Szekeres theorem. Advances in Mathematics, 262:1107–1129, 2014.

[41] A. Novak and R. Watson. Determining an adequate probe separation for estimat- ing the arrival rate in an M/D/1 queue using single-packet probing. Queueing Systems, 61:255–272, 2009.

[42] Karl Pearson. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.

[43] N.U. Prabhu. Stochastic Storage Processes: Queues, Insurance Risk, Dams, and Data Coomunication. Springer-Verlag, 2 edition, 1998.

[44] R. E. Barlow, D. J. Bartholomew, J. M. Bremner, and H. D. Brunk. Statistical inference under order restrictions. John Wiley & Sons, 1973.

[45] G. O. Roberts and J. S. Rosenthal. Hitting time and convergence rate bound for symmetric Langevin diffusions. Methodology and Computing in Applied Probability, 2017.

[46] G. O. Roberts and R. L. Tweedie. Rates of convergence of stochastically monotone and continuous time Markov models. Journal of Applied Probability, 37:359–373, 2000.

[47] T. Robertson, F. T. Wright, and R. L. Dykstra. Order restricted statistical inference. John Wiley & Sons, 1988.

[48] J. S. Rosenthal. Quantitative convergence rates of Markov chains: A simple account. Electronic Communications in Probability, 7:123–128, 2002.

[49] A. Sarantsev. Explicit rates of exponential convergence for reflected jump- diffusions on the half-line. Latin American Journal of Probability and Mathematical Statistics, 13:1069–1093, 2015.

[50] S. Sasabuchi, M. Inutsuka, and D. D. S. Kulatunga. A multivariate version of isotonic regression. Biometrika, 70(2):465–472, 1983.

[51] Syoichi Sasabuchi, Makoto Inutsuka, and D. D. Sarath Kulatunga. An algorithm for computing multivariate isotonic regression. Hiroshima Mathematical Journal, 22(551-560), 1992.

[52] Michael J. Schell and Bahadur Singh. The reduced monotonic regression method. Journal of the American Statistical Association, 92(437):128–135, 1997.

[53] Eugene Seneta. Markov and the creation of Markov chains. Markov Anniversary Meeting, pages 1–20, 2006.

162 [54] Michael Sipser. Introduction to the Theory of Computation. Cengage Learning, 2012.

[55] J. Michael Steele. Subadditive Euclidean functionals and nonlinear growth in geometric probability. The Annals of Probability, 9(3):365–376, 1981.

[56] Stefan Steinerberger. New bounds for the Traveling Salesman constant. Advances in Applied Probability, 47:27–36, 2015.

[57] V. Vapnik. Nature of Learning Theory. Springer-Verlag, 1996.

163