<<

3286 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 5, MAY 2018 With Shuffled : Statistical and Computational Limits of Permutation Recovery

Ashwin Pananjady ,MartinJ.Wainwright,andThomasA.Courtade

Abstract—Consider a noisy linear observation model with an permutation , and w Rn is observation noise. We refer ∈ unknown permutation, based on observing y !∗ Ax∗ w, to the setting where w 0asthenoiseless case.Aswithlinear d = + where x∗ R is an unknown vector, !∗ is an unknown = ∈ n regression, there are two settings of interest, corresponding to n n , and w R is additive Gaussian noise.× We analyze the problem of∈ permutation recovery in a whether the design matrix is (i) deterministic (the fixed design random design setting in which the entries of matrix A are case), or (ii) random (the random design case). drawn independently from a standard Gaussian distribution and There are also two complementary problems of interest: establish sharp conditions on the signal-to-noise ratio, sample recovery of the unknown !∗,andrecoveryoftheunknownx∗. size n,anddimensiond under which !∗ is exactly and approx- In this paper, we focus on the former problem; the latter imately recoverable. On the computational front, we show that problem is also known as unlabelled sensing [5]. the maximum likelihood estimate of !∗ is NP-hard to compute for general d,whilealsoprovidingapolynomialtimealgorithm The observation model (1) is frequently encountered in when d 1. scenarios where there is uncertainty in the order in which = Index Terms—Correspondenceestimation,permutationrecov- measurements are taken. An illustrative example is that of ery, unlabeled sensing, information-theoretic bounds, random sampling in the presence of jitter [6], in which the uncertainty projections. about the instants at which measurements are taken results in I. INTRODUCTION an unknown permutation of the measurements. A similar syn- ECOVERY of a vector based on noisy linear mea- chronization issue occurs in timing and molecular channels [7]. Rsurements is the classical problem of linear regression, Here, identical molecular tokens are received at the receptor and is arguably the most basic form of statistical estimator. at different times, and their signatures are indistinguishable. Avariant,the“errors-in-variables”model[3],allowsforerrors The vectors of transmitted and received times correspond to in the measurement matrix; classical examples include additive the signal and the observations, respectively, where the latter or multiplicative noise [4]. In this paper, we study a form is some permuted version of the former with additive noise. of errors-in-variables in which the measurement matrix is Arecentpaper[8]alsopointsoutanapplicationofthis perturbed by an unknown permutation of its rows. observation model in flow cytometry. More concretely, we study an observation model of the form Another such scenario arises in multi-target tracking problems [9]. For example, in the robotics problem of simul- y !∗ Ax∗ w, (1) = + taneous localization and mapping [10], the environment in d n d where x∗ R is an unknown vector, A R × is a which measurements are made is unknown, and part of the measurement∈ (or design) matrix, ! is an unknown∈ n n ∗ × problem is to estimate relative permutations between mea- Manuscript received October 15, 2016; revised October 12, 2017; accepted surements. Archaeological measurements [11] also suffer from November 5, 2017. Date of publication November 21, 2017; date of current an inherent lack of ordering, which makes it difficult to version April 19, 2018. This work was supported in part by NSF under Grant estimate the chronology. Another compelling example of such CCF-1528132 and Grant CCF-0939370 (Center for Science of Information), in part by the Office of Naval Research MURI under Grant DOD-002888, in part an observation model is in data anonymization, in which by the Air Force Office of ScientificResearchunderGrantAFOSR-FA9550- the order, or “labels”, of measurements are intentionally 14-1-001, in part by the Office of Naval Research under Grant ONR-N00014, deleted to preserve privacy. The inverse problem of data and in part by the National Science Foundation under Grant CIF-31712-23800. This work was presented in part at the 2016 Allerton Conference on de-anonymization [12] is to estimate these labels from the Communication, Control and Computing [1] and is available on Arxiv [2]. observations. A. Pananjady and T. A. Courtade are with the Department of Electrical Also, in large sensor networks, it is often the case that Engineering and Computer Sciences, University of California at Berkeley, Berkeley, CA 94720 USA (e-mail: [email protected]). the number of bits of information that each sensor records M. J. Wainwright is with the Department of Electrical Engineering and transmits to the server is exceeded by the number of and Computer Sciences, University of California at Berkeley, Berkeley, bits it transmits in order to identify itself to the server [13]. CA 94720 USA, and also with the Department of , University of California at Berkeley, Berkeley, CA 94720 USA. In applications where sensormeasurementsarelinear, Communicated by G. Moustakides, Associate Editor for Sequential model (1) corresponds to the case where each sensor only Methods. sends its measurement but not its identity. The server is then Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. tasked with recovering sensor identites, or equivalently, with Digital Object Identifier 10.1109/TIT.2017.2776217 determining the unknown permutation. 0018-9448 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. PANANJADY et al.:LINEARREGRESSIONWITHSHUFFLEDDATA 3287

estimation. Clearly, accurate permutation estimation allows for recovery of the regression vector, while the reverse may not be true. From a theoretical standpoint, such a distinction is similar to the difference between the problems of subset selection [19] and sparse vector recovery [20] in high dimensional linear regression, where studying the model selection and parameter estimation problems together helped advance our understand- ing of the in its entirety.

A. Related Work Previous work related to the observation model (1) can be broadly classified into two categories – those that focus on x∗ recovery, and those focussed on recovering the underly- ing permutation. We discuss the most relevant results below. Fig. 1. Example of pose and correspondence estimation. The camera 1) Latent Vector Estimation: The observation model (1) introduces an unknown linear transformation corresponding to the pose. The unknown permututation represents the correspondence between points, which appears in the context of compressed sensing with an unknown is shown in the picture via coloured shapes, and needs to be estimated. sensor permutation [21]. The authors consider the matrix- based observation model Y !∗ AX∗ W,whereX ∗ is The pose and correspondence estimation problem in image amatrixwhosecolumnsarecomposedofmultiple,unknown,= + processing [14], [15] is also related to the observation sparse vectors. Their contributions include a branch and bound model (1). The capture of a 3D object by a 2D image algorithm to recover the underlying X ∗,whichtheyshowto can be modelled by an unknown linear transformation called perform well empirically for small instances under the setting the “pose”, and an unknown permutation representing the in which the entries of the matrix A are drawn i.i.d. from a “correspondence” between points in the two spaces. One Gaussian distribution. of the central goals in image processing is to identify this In the context of pose and correspondence estimation, the correspondence information, which in this case is equivalent paper [15] considers the noiseless observation model (1), and to permutation estimation in the linear model. An illustration shows that if the permutation matrix maps a sufficiently large of the problem is provided in Figure 1. Image stitching from number of positions to themselves, then x∗ can be recovered multiple camera angles [16] also involves the resolution of reliably. unknown correspondence information between point clouds of In the context of molecular channels, the model (1) has images. been analyzed for the case when x∗ is some random vector, The discrete analog of the model (1) in which the vectors A I,andw represents non-negative noise that models = x∗ and y,andthematrixA are all constrained to belong delays introduced between emitter and receptor. Rose et al. [7] to some finite alphabet/field corresponds to the permutation provide lower bounds on the capacity of such channels. channel studied by Schulman and Zuckerman [17], with In particular, their results yield closed-form lower bounds for A representing the (linear) encoding matrix. However, tech- some special noise distributions, e.g., exponentially random niques for the discrete problem do not carry over to the noise. continuous problem (1). Amorerecentpaper[5]thatismostcloselyrelatedtoour Another line of work that is related in spirit to the observa- model considers the question of when the equation (1) has tion model (1) is the genome assembly problem from shotgun auniquesolutionx∗,i.e.,theidentifiabilityofthenoiseless d reads [18], in which an underlying vector x∗ A, T, G, C model. The authors show that if the entries of A are sampled must be assembled from an unknown permutation∈{ of its con-} i.i.d. from any continuous distribution with n 2d,then ≥ tinuous sub-vector measurements, called “reads”. Two aspects, equation (1) has a unique solution x∗ with probability 1. They however, render it a particularization of our observation model, also provide a converse showing that if n < 2d,anymatrixA besides the obvious fact that x∗ in the genome assembly whose entries are sampled i.i.d. from a continuous distribution problem is constrained to a finite alphabet: (i) in genome does not (with probability 1) have a unique solution x∗ to assembly, the matrix A is fixed and consists of shifted identity equation (1). While the paper shows uniqueness, the question matrices that select sub-vectors of x∗,and(ii)thepermutation of designing an efficient algorithm to recover a solution, matrix of genome assembly is in fact a block permutation unique or not, is left open. The paper also analyzes the matrix that permutes sub-vectors instead of coordinates as in stability of the noiseless solution, and establishes that x∗ can equation (1). be recovered exactly when the SNR goes to infinity. It is worth noting that both the permutation recovery and Since a version of this paper was made available online [2], vector recovery problems have an operational interpretation afewrecentpapershavealsoconsideredvariantsofthe in applications. Permutation recovery is equivalent to “corre- observation model (1). Elhami et al. [22] show that there is spondence estimation” in vision tasks [14], [15], and vector acarefulchoiceofthemeasurementmatrixA such that it recovery is equivalent to “pose estimation”. In sensor network is possible to recover the vector x in time (dnd 1) in the ∗ O + examples, permutation recovery corresponds to sensor iden- noiseless case. Hsu et al. [23] show that the vector x∗ can tification [13], while vector recovery corresponds to signal be recovered efficiently in the noiseless setting (1) when the 3288 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 5, MAY 2018 design matrix A is i.i.d. Gaussian. They also demonstrate that walks on the graph. They propose an EM-type algorithm in the noisy setting, it is not possible to recover the vector x∗ with importance sampling to tackle the problem. The reliably unless the signal-to-noise ratio is sufficiently high. fundamental limits of such a problem were later considered by See Section 4 of their paper [23] for a detailed comparison Gripon and Rabbat [32]. of their results with our own. Our own follow-up work [24] Permutation estimation has also been considered in other establishes the minimax rate of prediction for the more general observation models involving matrices with structure, par- multivariate setting, and proposes an efficient algorithm for ticularly in the context of ranking [33]–[35], or even more that setting with guaranteed recovery provided some techni- generally, in the context of identity management [36]. While cal conditions are satisfied. Haghighatshoar and Caire [25] we mention both of these problemsbecausearerelatedinspirit consider a variant of the observation model (1) in which the to permutation recovery, the problem setups do not bear too permutation matrix is replaced by a row selection matrix, and much resemblance to our linear model (1). provide an alternating minimization algorithm with theoretical Algorithmic approaches to solving for !∗ in equation (1) guarantees. are related to the multi-dimensional assignment problem. We also briefly compare the model (1) with the problem of In particular, while finding the correct permutation mapping vector recovery in unions of subspaces, studied widely in the between two vectors minimizing some loss function between compressive sensing literature [26], [27]. In the compressive them corresponds to the 1-dimensional assignment problem, sensing setup, the vector x∗ lies in the union of finitely many here we are faced with an assignment problem between subspaces, and must be recovered from linear measurements subspaces. While we do not elaborate on the vast literature with a , without a permutation. In our model, that exists on solving variants on assignment problems, we on the other hand, the vector x∗ is unrestricted, and the note that broadly speaking, assignment problems in higher observation y lies in the union of n subspaces – one for dimensions are much harder than the 1-D assignment problem. each permutation. While the two models! share a superficial Asurveyonthequadraticassignmentproblem[37]and connection, results do not carry over from one to the other in references therein provide examples and methods that are any obvious way. In fact, our model is fundamentally differ- currently used to solve these problems. ent from traditional compressive sensing, since the unknown permutation acts on the row space of the design matrix A. B. Contributions In contrast, restricting x∗ to a union of subspaces (or more Our primary contribution addresses permutation recovery specifically, restricting its sparsity) influences the column in the noisy version of observation model (1), with a random space of A. design matrix A.Inparticular,whentheentriesofA are drawn 2) Latent Permutation Estimation: While our paper seems i.i.d. from a standard Gaussian matrix, we show sharp condi- to be the first to consider permutation recovery in the linear tions on the SNR under which exact permutation recovery regression model (1), there are many related problems for is possible. We also derive necessary conditions for approx- which permutation recovery has been studied. We mention imate permutation recovery to within a prescribed Hamming only those that are most closely related to our work. distortion. The problem of feature matching in machine learning [28] We also briefly address the computational aspect of the bears a superficial resemblance to our observation model. permutation recovery problem. We show that the information There, observations take the form Y X W and Y theoretically optimal estimator we propose for exact permu- = ∗ + ′ = ! X W ,withallof(X , Y, Y , W, W ) representing matri- tation recovery is NP-hard to compute in the worst case. For ∗ ∗ + ′ ∗ ′ ′ ces of appropriate dimensions, and the goal is to recover !∗ the special case of d 1, however, we show that it can be = from the tuple (Y, Y ′).Thepaper[28]establishesminimax computed in polynomial time. Our results are corroborated by rates on the separation between the rows of X ∗ (as a function numerical simulations. of problem parameters n, d,σ)thatallowforexactpermuta- tion recovery. C. Organization The problem of statistical seriation [29] involves an obser- The remainder of this paper is organized as follows. In the vation model of the form Y ! X W,withthematrixX next section, we set up notation and formally state the problem. = ∗ ∗ + ∗ obeying some shape constraint. In particular, if the columns In Section III, we state our main results and discuss some of X ∗ are unimodal (or, as a special case, monotone), then of their implications. We provide proofs of the main results Flammarion et al. [29] establish minimax rates for the problem in Section IV, deferring the more technical lemmas to the 2 in the prediction error metric !X !∗ X ∗ F by analyzing the appendices. least squares estimator. The∥ seriation− problem∥ without noise II. BACKGROUND AND PROBLEM SETTING was also considered by Fogel!et! al. [30] in the context of designing convex relaxations to permutation problems. In this section, we set up notation, state the formal problem, Estimating network structure from co-occurrence and provide concrete examples of the noiseless version of our data [31], [32] also involves the estimation of multiple observation model by considering some fixed design matrices. underlying permutations from unordered observations. In particular, Rabbat et al. [31] consider the problem of A. Notation estimating the structure of an unknown directed graph Since most of our analysis involves metrics involving per- from multiple, unordered paths formed by simple random mutations, we introduce all the relevant notation in this section. PANANJADY et al.:LINEARREGRESSIONWITHSHUFFLEDDATA 3289

Permutations are denoted by π and permutation matrices Exact Permutation Recovery: The problem of exact per- by !.Weuseπ(i) to denote the image of an element i under mutation recovery is to recover !∗,andtheriskofanestimator the permutation π.Withaminorabuseofnotation,welet is evaluated on the 0-1 loss. More formally, given an estimator n denote both the set of permutations on n objects as well as of !∗ denoted by ! (y, A) n,weevaluateitsriskby Pthe corresponding set of permutation matrices. We sometimes : → P Pr ! !∗ E 1 ! !∗ , (2) use the compact notation yπ (or y!)todenotethevectory { !̸= }= { ̸= } with entries permuted according to the permutation π (or !). where the probability in the LHS" is taken over# the randomness ! ! We let dH(π,π′) denote the Hamming distance between in y induced by both A and w. two permutations. More formally, we have d (π,π ) Approximate Permutation Recovery: It is reasonable to H ′ := # i π(i) π (i) .Additionally,weletd (!,! ) denote the think that recovering ! up to some distortion is sufficient { | ̸= ′ } H ′ ∗ Hamming distance between two permutation matrices, which for many applications. Such a relaxation of exact permuta- is to be interpreted as the Hamming distance between the tion recovery allows the estimator to output a ! such that corresponding permutations. d (!,! ) D,forsomedistortionD to be specified. The H ∗ ≤ The notation vi denotes the ith entry of a vector v. risk of such an estimator is again evaluated on! the 0-1 loss d We denote the ith standard basis vector in R by ei .Weuse of this! error metric, given by Pr dH(!,! ) D ,with { ∗ ≥ } the notation ai⊤ to refer to the ith row of A.Wealsousethe the probability again taken over both A and w.Whileour standard shorthand notation n 1, 2,...,n . results are derived mainly in the context! of exact permutation [ ]:={ } We also make use of standard asymptotic notation. recovery, they can be suitably modified to also yield results O Specifically, for two real sequences fn and gn, fn (gn) for approximate permutation recovery. = O means that fn Cgn for a universal constant C > 0, and fn We now provide some examples in which the noiseless ≤ = $(gn) denotes that the relation gn ( fn) holds. Lastly, all version of the observation model (1) is identifiable. = O logarithms denoted by log are to the base e,andweusec1, c2, etc. to denote absolute constants that are independent of other C. Illustrative Examples of the Noiseless Model problem parameters. In this section, we present two examples to illustrate the B. Formal Problem Setting and Permutation Recovery problem of permutation recovery and highlight the difference between our signal model and that of Unnikrishnan et al. [5]. As mentioned in the introduction, we focus exclusively on Example 1: Consider the noiseless case of the obser- the noisy observation model in the random design setting. vation model (1). Let νi ,νi′ (i 1, 2,...,d) represent i.i.d. In other words, we obtain an n-vector of observations y from continuous random variables, and= form the design matrix A the model (1) with n d to ensure identifiability, and with by choosing the following assumptions:≥ d Signal Model: The vector x is fixed, but unknown. a⊤ νi e⊤ and a⊤ ν′e⊤, i 1, 2,...,d. ∗ R 2i 1 := i 2i = i i = We note that this is different from the∈ adversarial signal model − Note that n 2d.Nowconsiderourfixedbutunknown of Unnikrishnan et al. [5], and we provide clarifying examples signal model= for x .Sincethepermutationisarbitrary, in Section II-C. ∗ our observations can be thought of as the unordered set Measurement Matrix: The measurement matrix A ν x ,ν x i d .Withprobability1,theratiosr ν /ν n d is a random matrix of i.i.d. standard Gaussian variables∈ i i∗ i′ i∗ i i i′ R × are{ distinct| for∈[ each]} i,andalsoν x ν x with:= prob- chosen without knowledge of x .Ourassumptiononi.i.d. i i∗ j ∗j ∗ ability 1, by assumption of a fixed x .Therefore,thereis̸= standard Gaussian designs easily extends to accommodate the ∗ aonetoonecorrespondencebetweentheratiosr and x . more general case when rows of A are drawn i.i.d. from the i i∗ All ratios are computable in time (n2),andx can be distribution (0,%).Inparticular,writingA W√%,where ∗ exactly recovered. Using this information,O we can also exactly W in an n N d standard Gaussian matrix and= √% denotes recover ! . the symmetric× square root of the (non-singular) covariance ∗ Example 2: Aparticularcaseofthisexamplewasalready matrix %,ourobservationmodeltakestheform observed by Unnikrishnan et al. [5], but we include it to y !∗W√%x∗ w, illustrate the difference between our signal model and the = + adversarial signal model. Form the fixed design matrix A and the unknown vector is now √%x∗ in the model (1). i 1 2 by including 2 − copies of the vector ei among its rows. Noise Variables: The vector w (0,σ In) repre- 1 d i 1 d sents uncorrelated noise variables, each of∼ (possiblyN unknown) We therefore have n i 1 2 − 2 1. Our observations therefore= = consist= of 2i−1 repetitions of x variance σ 2.Aswillbemadeclearintheanalysis,our − i∗ for each i d .Thevalueof$ x can therefore be recovered assumption that the noise is Gaussian also readily extends i∗ by simply∈ counting[ ] the number of times it is repeated, with to accommodate i.i.d. σ-sub-Gaussian noise. Additionally, the permutation noise represented by the unknown permutation our choice of the number of repetitions also accounting for cases when x∗ x∗ for some i j.Noticethatwecan matrix !∗ is arbitrary. i = j ̸= The main recovery criterion addressed in this paper is that of now recover any vector x∗,eventhosechosenadversarially exact permutation recovery, which is formally described below. with knowledge of the A matrix. Therefore, such a design We also address the problem of approximate permutation 1 Unnikrishnan et al. [5] proposed that ei be repeated i times, but it is easy recovery. to see that this does not ensure recovery of an adversarially chosen x∗. 3290 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 5, MAY 2018

The following theorem provides an upper bound on the probability of error of !ML,with(c1, c2) denoting absolute constants. Theorem 1: For any d!< nandϵ<√n, if 2 x∗ 2 n log ∥ ∥ c1 ϵ log n, (4) σ 2 ≥ n d + % & ' − ( 2ϵ then Pr !ML !∗ c2n− . Theorem{ 1̸= provides}≤ conditions on the signal-to-noise ratio x 2 snr !∥ ∗∥2 that are sufficient for permutation recovery σ 2 in the= non-asymptotic, noisy regime. In contrast, the results of Unnikrishnan et al. [5] are stated in the limit snr ,withoutanexplicitcharacterizationofthescaling behavior.→∞ We also note that Theorem 1 holds for all values of d < n, whereas the results of Unnikrishnan et al. [5] require n 2d ≥ Fig. 2. Empirical frequency of the event !ML !∗ over 1000 independent for identifiability of x∗ in the noiseless case. Although trials with d 1, plotted against '(n, snr{ ) for= different} values of n.The = the recovery of !∗ and x∗ are not directly comparable, it probability of successful permutation recovery! undergoes a phase transition as '(n, snr) varies from 3 to 5. This is consistent with the prediction of is worth pointing out that thediscrepancyalsoarisesdue Theorems 1 and 2. to the difference between our fixed and unknown signal model, and the adversarial signal model assumed in the paper [5]. matrix allows for an adversarial signal model, in the flavor of We now turn to the following converse result, which com- compressive sensing [20]. plements Theorem 1. Having provided examples of the noiseless observation Theorem 2: For any δ (0, 2),if model, we now return to the noisy setting of Section II-B, ∈ 2 and state our main results. x∗ 2 2 log 1 ∥ 2∥ (2 δ)log n, (5) + % + σ & ≤ − III. MAIN RESULTS then Pr ! ! 1 c e c4nδ for any estimator !. In this section, we state our main theorems and discuss ∗ 3 − Theorem{ 2̸= serves}≥ as− a “strong converse” for our problem, their consequences. Proofs of the theorems can be found since it guarantees that if condition (5) is satisfied, then the in Section IV. ! ! probability of error of any estimator goes to 1 as n goes to infinity. Indeed, it is proved using the strong converse argu- A. Statistical Limits of Exact Permutation Recovery ment for the Gaussian channel [38], which yields a converse Our main theorems in this section provide necessary and result for any fixed design matrix A (see (15)). It is also worth sufficient conditions under which the probability of error in noting that the converse result of Theorem 2 holds uniformly exactly recovering the true permutation goes to zero. over d. In brief, provided that d is sufficiently small, we establish Taken together, Theorems 1 and 2 provide a crisp char- athresholdphenomenonthatcharacterizeshowthesignal-to- acterization of the problem when d pn for some fixed x 2 ≤ noise ratio snr ∥ ∗∥2 must scale relative to n in order to p < 1. In particular, setting ϵ and δ in Theorems 1 and 2 σ 2 ensure identifiability.:= More specifically, defining the ratio to be small constants and letting n grow, we recover the threshold behavior of identifiability in terms of '(n, snr) that log (1 snr) ' (n, snr) + , was discussed above and illustrated in Figure 2. In the next := log n section, we find that a similar phenomenon occurs even with we show that the maximum likelihood estimator recovers the approximate permutation recovery. true permutation with high probability provided '(n, snr) c, When d can be arbitrarily close to n,thecharacterization where c denotes an absolute constant. Conversely,≫ if obtained using these bounds is no longer sharp. In this regime, '(n, snr) c,thenexactpermutationrecoveryisimpossible. we conjecture that Theorem 1 provides the correct character- For illustration,≪ we have plotted the behaviour of the maximum ization of the limits of the problem, and that Theorem 2 can likelihood estimator for the case when d 1inFigure2. be sharpened. Evidently, there is a sharp phase transition between= error and exact recovery as the ratio '(n, snr) varies from 3 to 5. B. Limits of Approximate Permutation Recovery Let us now turn to more precise statements of our results. The techniques we used to prove results for exact per- We first define the maximum likelihood estimator (MLE) as mutation recovery can be suitably modified to obtain results 2 for approximate permutation recovery to within a Hamming (!ML, xML) arg min y !Ax 2. (3) = ! n ∥ − ∥ distortion D.Inparticular,weshowthefollowingconverse ∈Pd x R result for approximate recovery. ! ! ∈ PANANJADY et al.:LINEARREGRESSIONWITHSHUFFLEDDATA 3291

Theorem 3: For any 2 < D n 1,if preferred to !∗ by the estimator. The analysis requires precise ≤ − control on the lower tails of χ2-random variables, and tight x 2 n D 1 n D 1 log 1 ∥ ∗∥2 − + log − + , (6) bounds on the norms of random projections, for which we use + σ 2 ≤ n 2e % & ' ( results derived in the context of dimensionality reduction by Dasgupta and Gupta [39]. then Pr dH(!,!∗) D 1/2 for any estimator !. Note that{ for any D≥ }≥pn with p (0, 1),Theorems1 In order to simplify the exposition, we first consider the case ≤ ∈ when d 1inSectionIV-A,andlatermakethenecessary and 3 provide! a set of sufficient and necessary conditions! for = approximate permutation recovery that match up to constant modifications for the general case in Section IV-B. In order factors. In particular, the necessary condition resembles that to understand the technical subtleties, we recommend that the reader fully understand the d 1casealongwiththetechnical for exact permutation recovery, and the same SNR threshold = behaviour is seen even here. lemmas before moving on to the proof of the general case. Remark 1: The converse results given by Theorem 2 and Theorem 3 hold even when the estimator has exact knowledge A. Proof of Theorem 1: d 1case of x . = ∗ Recall the definition of the maximum likelihood estimator C. Computational Aspects 2 (!ML, xML) arg min min y !Ax 2. = ! n x d ∥ − ∥ In the previous sections, we considered the MLE given by ∈P ∈R equation (3) and analyzed its statistical properties. However, For a fixed! permutation! matrix !,assumingthatA has full since equation (3) involves a combinatorial minimization over column rank,2 the minimizing argument x is simply (!A)† y, † 1 n permutations, it is unclear if !ML can be computed effi- where X (X X) X represents the pseudoinverse of ! = ⊤ − ⊤ ciently. The following theorem addresses this question. amatrixX.Bycomputingtheminimumoverx Rd in the ∈ Theorem 4: For d 1,theMLE! !ML can be computed in above equation, we find that the maximum likelihood estimate = time (n log n) for any choice of the measurement matrix A. of the permutation is given by O In contrast, if d $(n),then!ML is! NP-hard to compute. = 2 The algorithm used to prove the first part of the !ML arg min P!⊥ y 2, (7) = ! n ∥ ∥ theorem involves a simple sorting! operation, which introduces ∈P the (n log n) complexity. We emphasize that the algorithm where P I !!A(A A) 1(!A) denotes the projection O !⊥ ⊤ − ⊤ assumes no prior knowledge about the distribution of the data; onto the orthogonal= − complement of the column space of !A. for every given A and y,itreturnstheoptimalsolutionto For a fixed ! n,definetherandomvariable problem (3). ∈ P 2 2 The second part of the theorem asserts that the algo- +(!,!∗) P!⊥ y 2 P!⊥ y 2. (8) := ∥ ∥ −∥ ∗ ∥ rithmic simplicity enjoyed by the d 1casedoesnot For any permutation !,theestimator(7)prefersthepermuta- extend to general d.Theproofproceedsbyareductionfrom= tion ! to ! if +(!,! ) 0. The overall error event occurs the NP-complete partition problem. We stress here that the ∗ ∗ when +(!,! ) 0forsome≤ !,meaningthat NP-hardness claim holds over worst case input instances. ∗ ≤ In particular, it does not preclude the possibility that there !ML !∗ +(!,!∗) 0 . (9) exists a polynomial time algorithm that solves problem (3) { ̸= }= { ≤ } ! n ! ∈-P \ ∗ with high probability when A is chosen randomly as in ! our original setting. In fact, since this paper was posted Equation (9) holds for any value of d.Weshortlyspecialize online, Hsu et al. [23] have proposed an efficient algorithm to the d 1case.OurstrategyforprovingTheorem1 = for the noiseless version of the problem with a random boils down to bounding the probability of each error event design matrix A,byleveragingtheconnectiontothepartition in the RHS of equation (9) using the following key lemma, (subset-sum) problem. It is also worth noting that the same proved in Section V-A.1. Technically speaking, the proof of paper [23] provides a (1 ϵ)-approximation algorithm for the this lemma contains the meat of the proof of Theorem 1, and + n (d) MLE objective (3) running in time O . the interested reader is encouraged to understand these details O ϵ before embarking on the proof of the general case. Recall the ) , IV. PROOFS OF MAIN RESULTS* + definition of dH(!,!′),theHammingdistancebetweentwo permutation matrices. In this section, we prove our main results. Technical details Lemma 1: For d 1 and any two permutation matrices ! are deferred to the appendices. Throughout the proofs, we = x 2 and ,andprovided ∥ ∗∥2 1,wehave assume that n is larger than some universal constant. The case !∗ σ 2 > where n is smaller can be handled by changing the constants 2 x∗ in our proofs appropriately. We also use the notation c, c′ 2 Pr +(!,!∗) 0 c′ exp c dH(!,!∗) log ∥ 2∥ . to denote absolute constants that can change from line to { ≤ }≤ %− % σ && line. Technical lemmas used in our proofs are deferred to the We are now ready to prove Theorem 1. appendix. We begin with the proof of Theorem 1. At a high level, it 2An n d i.i.d. Gaussian random matrix has full column rank with involves bounding the probability that any fixed permutation is probability× 1 as long as d n ≤ 3292 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 5, MAY 2018

Proof of Theorem 1 for d 1: Fix ϵ>0andassumethat Now define b(k) Pr +(!,! ) 0 . ! dH(!,!∗) k ∗ the following consequence of= condition (4) holds: Applying Lemma 2 then:= yields: = { ≤ } $ 2 (n k) x∗ − !b(k) c log ∥ ∥2 (1 ϵ)log n, (10) n σ 2 ≥ + % & ! n c′ max exp n log , ≤ − 2 where c is the same as in Lemma 1. Now, observe that 1 ) 2, x∗ 2 2n Pr ! !∗ exp ck log ∥ ∥ log n . (13) { ML ̸= } − σ 2 − n d % % % & − && 2 Pr +(!,!∗) 0 ≤! { ≤ } We upper bound b(k) by splitting the analysis into two cases. ! n !∗ ∈.P \ a) Case 1: If the first term attains the maximum in the (i) 2 x∗ 2 RHS of inequality (13), then for all 2 k n,wehave c′ exp c d (!,!∗) log ∥ ∥ ≤ − H σ 2 ≤ ≤ ! ! % % && n ∗ b(k) c′n exp( n log n n log 2) ∈.P \ ≤ ! − + 2 (i) k x∗ 2 √ c′ n exp cklog ∥ ∥ c′e n exp( n log n n log 2 n n log n) ≤ − σ 2 ≤ − + +− + 2 k n % % && (ii) c′ ≤.≤ 2ϵ 1 , (ii) ≤ n ϵk + c′ n− ≤ where inequality (i) follows from the well-known upper bound 2 k n n n ≤.≤ n e√n ,andinequality(ii) holds since ϵ (0, √n). 1 !≤ e ∈ c′ , b) Case 2: Alternatively, if the maximum is attained by ≤ nϵ (nϵ 1) * + − the second term in the RHS of inequality (13), then we have k where step (i) follows since # ! dH(!,!∗) k n ,and 2 { : = }≤ k x∗ 2 2n step (ii) follows from condition (10). Relabelling the constants b(k) n c′ exp ck log ∥ 2∥ log n ≤ %− % % σ & − n d && in condition (10) proves the theorem. ! − (iii) In the next section, we prove Theorem 1 for the general ϵk c′n− , case. ≤ where step (iii) follows from condition (12). Combining the two cases, we have B. Proof of Theorem 1: Case d 2, 3,...,n 1 ∈{ − } ϵh 2ϵ 1 ϵh 2ϵ 1 b(k) max c′n− , cn− − c′n− cn− − . In order to be consistent, we follow the same proof structure ≤ { }≤ + as for the d 1case.Recallthedefinitionof+(!,!∗) The last step is to use the union) bound to obtain , from equation= (8). We begin with an equivalent of the key lemma to bound the probability of the event +(!,!∗) 0 . Pr ! !∗ b(k) { ≤ } { ML ̸= }≤ As in the d 1case,thisconstitutesthetechnicalcoreofthe 2 k n = ≤.≤ result. ! ϵh 2ϵ 1 c′n− cn− − Lemma 2: For any 1 < d < n, any two permutation ≤ + 2 k n matrices ! and !∗ at Hamming distance h, and provided ≤.≤ ) , 2 2n (iv) x∗ 5 2ϵ ∥ ∥2 n n d > ,wehave cn− , (14) σ 2 − − 4 ≤ ' ( where step (iv) follows by a calculation similar to the one Pr +(!,!∗) 0 carried out for the d 1 case. Relabelling the constants in { ≤ } = n condition (12) completes the proof. ! c′ max exp n log , ≤ − 2 / ) , x 2 2n C. Proof of Theorem 2 exp ch log ∥ ∗∥2 log n . (11) σ 2 − n d We begin by assuming that the design matrix A is fixed, and % % % & − && 0 that the estimator has knowledge of x a-priori. Note that the We prove Lemma 2 in Section V-B.1. Taking it as given, ∗ latter cannot make the estimation task any easier. In proving we are ready to prove Theorem 1 for the general case. this lower bound, we can also assume that the entries of Ax Proof of Theorem 1, General Case: As before, we use the ∗ union bound to prove the theorem. We begin by fixing some are distinct, since otherwise, perfect permutation recovery is impossible. ϵ (0, √n) and assuming that the following consequence of Given this setup, we now cast the problem as one of condition∈ (4) holds: coding over a Gaussian channel. Toward this end, consider 2 the codebook x∗ 2 2n c log ∥ 2∥ 1 ϵ c log n. (12) % σ & ≥ + + n d !Ax∗ ! n . ' − ( C ={ | ∈ P } PANANJADY et al.:LINEARREGRESSIONWITHSHUFFLEDDATA 3293

We may view !Ax∗ as the codeword corresponding to the Applying the chain rule for mutual information yields permutation !,whereeachpermutationisassociatedtoone I (!∗ y, A) I (!∗ y A) I (!∗, A) of n equally likely messages. Note that each codeword has ; = ; | + ! 2 (i) power Ax∗ 2. I (!∗ y A) The∥ codeword∥ is then sent over a Gaussian channel with = ; | EA I (!∗ y A α) , (16) noise power equal to n σ 2 nσ 2.Thedecodingproblem = ; | = i 1 = is to ascertain from the noisy= observations which message was where step (i) follows since" !∗ is chosen# independently sent, or in other words,$ to identify the correct permutation. of A.Wenowevaluatethemutualinformationterm I (! y A α),whichwedenotebyIα(! y).Letting We now use the non-asymptotic strong converse for the ∗; | = ∗; Gaussian channel [40]. In particular, using Lemma 11 (see Hα(y) H (y A α) denote the conditional entropy of y log n given a:= fixed realization| = of A,wehave Appendix B-C) with R n ! then yields that for any δ′ > 0, if = Iα(!∗ y) Hα(y) Hα(y ! ) ; = − | ∗ log n 1 δ Ax 2 (ii) ′ ∗ 2 1 log det cov yy n log σ 2, ! > + log 1 ∥ 2∥ , 2 ⊤ 2 n 2 % + nσ & ≤ − where the covariance is evaluated with A α,andinstep(ii), then for any estimator !,wehavePr! ! 1 2 2 nδ′ . we have used two key facts: = { ̸= }≥ − · − For the choice δ δ/(2 δ),wehavethatif (a) Gaussians maximize entropy for a fixed covariance, ′ = − ! ! which bounds the first term, and n Ax 2 ∗ 2 (b) For a fixed realization of ! ,thevectory is composed of (2 δ)log > log 1 ∥ 2∥ , (15) ∗ − e % + nσ & ) , n uncorrelated Gaussians. This leads to an explicit evaluation nδ/2 of the second term. then Pr ! ! 1 2 2− .Notethattheonly { ̸= }≥ − · Now taking expectations over A and noting that cov yy randomness assumed so far was in the noise w and the random ⊤ ≼ choice of!!. Ew yy⊤ ,wehavefromtheconcavityofthelogdeterminant function and Jensen’s inequality that We now specialize the result for the case when A is " # Gaussian. Toward that end, define the event I (!∗ y A) EA Iα(!∗ y) 2 ; | = ; Ax∗ 1 n 2 (δ) 1 δ ∥ ∥2 . log" det E yy#⊤ log σ , (17) 2 ≤ 2 − 2 E = 3 + ≥ n x∗ 2 4 ∥ ∥ where the expectation in the last5 line6 is now taken over Conditioned on the event (δ),itcanbeverifiedthat E randomness in both A and w. condition (5) implies condition (15). We also have Furthermore, using the AM-GM inequality for PSD matrices n 2 det X 1 trace X ,andbynotingthatthediagonalentries Ax∗ 2 n Pr (δ) 1 Pr ∥ ∥ > 1 δ ≤ 2 2 {E }= − n x 2 + of the matrix E yy⊤ are all equal to x∗ 2 σ ,weobtain 3 ∥ ∗∥2 4 * + ∥ ∥ + (i) " # n x 2 1 c e cnδ, ∗ 2 ′ − I (!∗ y, A) log 1 ∥ 2∥ . ≥ − ; ≤ 2 % + σ & where step (i) follows by using the sub-exponential tail bound 2 Ax∗ 2 2 Combining the pieces, we now have that (see Lemma 9 in Appendix B-B), since ∥ ∥2 χn . x∗ 2 ∼ Pr ! !∗ 1/2if Putting together the pieces, we have that∥ ∥ provided condi- { ̸= }≥ 2 tion (5) holds, x∗ n D 1 n log! 1 ∥ ∥2 (n D 1) log − + , (18) + σ 2 ≤ − + 2e Pr ! !∗ Pr ! !∗ (δ) Pr (δ) % & ' ( { ̸= }≥ { ̸= n|δE/2 } {E cn}δ (1 2 2− )(1 c′e− ) which completes the proof. ! ! = −! · cnδ − 1 c′e− . ≥ − E. Proofs of Computational Aspects ! In this section, we prove Theorem 4 by providing an efficient algorithm for the d 1caseandshowing D. Proof of Theorem 3 NP-hardness for the d > 1case. = We now prove Theorem 3 for approximate permutation 1) Proof of Theorem 4: d 1Case: In order to prove recovery. For any estimator !,wedenotebytheindicator the theorem, we need to show= an algorithm that performs the random variable E(!, D) whether or not the ! has acceptable optimization (7) efficiently. Accordingly, note that for the case distortion, i.e., E(!, D) I !dH(!,!∗) D ,withE 1 when d 1, equation (7) can be rewritten as = [ ≥ ] = = representing the error! event. For !∗ picked! uniformly at 2 !ML arg max a!⊤ y random in n,Lemma6statedandprovedinSectionV-C! ! = ! ∥ ∥ lower boundsP the probability of error as: ! arg max max a!⊤ y, a!⊤y = ! − I (!∗ y, A) log 2 Pr E(!, D) 1 1 ; + . 7 2 8 2 { = }≥ − n arg min max a! y 2, a! y 2 , (19) log n log (n D! 1) = ! ∥ − ∥ ∥ + ∥ !− − + ! ! 7 8 3294 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 5, MAY 2018

2 2 where the last step follows since 2a!⊤y a y a! has a solution (!, x) if and only if there exists a subset 2 =∥ ∥ +∥ ∥ −∥ − y 2,andsincethefirsttwotermsdonotinvolveoptimizing S d such that i S bi i d S bi . ∥ ⊂[ ] ∈ = ∈[ ]\ over !. By converting to , we see that yπ Ax Once the optimization problem has been written in the if and only if $ $ = form (19), it is easy to see that it can be solved in polynomial yπ(i) yπ(i), (20) time. In particular, using the fact that for fixed vectors p and = i π(i) d i π(i)>d q, p! q is minimized for ! that sorts p according to the | .≤ | . ∥ − ∥ order of q,weseethatAlgorithm1computes!ML exactly. and equation (20) holds, by construction, if and only if for This is a classical result [41] known as the rearrangement S i π(i) d d ,wehave ={ | ≤ }∩[ ] inequality (see, e.g., [2, Example 2]). ! bi bi . = i S i d S Algorithm 1 Exact Algorithm for Implementing .∈ ∈.[ ]\ Equation (7) for the Case When d 1 This completes the proof. ! = Input:designmatrix(vector)a,observationvectory V. D ISCUSSION 1 !1 permutation that sorts a according to y ← 2 !2 permutation that sorts a according to y We analyzed the problem of exact permutation recovery ← − in the linear regression model, and provided necessary and 3 !ML arg max a!⊤ y , a!⊤ y ← {| 1 | | 2 |} sufficient conditions that are tight in most regimes of n and d. Output: !ML ! We also provided a converse for the problem of approximate ! permutation recovery to within some Hamming distortion. The procedure defined by Algorithm 1 is clearly the correct It is still an open problem to characterize the fundamental thing to do in the noiseless case: in this case, x∗ is a limits of exact and approximate permutation recovery for scalar value that scales the entries of a,andsothecorrect all regimes of n, d and the allowable distortion D.Inthe permutation can be identified by a simple sorting operation. context of exact permutation recovery, we believe that the limit Two such operations suffice, one to account for when x∗ is suggested by Theorem 1 is tight for all regimes of n and d, positive and one more for when it is negative. Since each sort but showing this will likely require a different technique. operation takes (n log n) steps, Algorithm 1 can be executed In particular, as pointed out in Remark 1, all of our lower in nearly linearO time. ! bounds assume that the estimator is provided with x∗ as side 2) Proof of Theorem 4: NP-Hardness: In this section, we information; it is an interesting question as to whether stronger n n d show that given a vector matrix pair (y, A) R R × , lower bounds can be obtained without this side information. it is NP-hard to determine whether the equation∈ y × !Ax = On the computational front, many open questions remain. has a solution for a permutation matrix ! n and vector ∈ P The primary question concerns the design of computationally x Rd .Clearly,thisissufficienttoshowthattheproblem(3) ∈ efficient estimators that succeed in similar SNR regimes. is NP-hard to solve in the case when A and y are arbitrary. We have already shown that the maximum likelihood esti- Our proof involves a reduction from the PARTITIO N prob- mator, while being statistically optimal for moderate d,is lem, the decision version of which is defined as the following. computationally hard to evaluate in the worst case. Showing a Definition 1 (PARTITIO N): Given d integers b1, b2,..., corresponding hardness result for random A with noise is also bd ,doesthereexistasubsetS d such that ⊂[ ] an open problem. Finally, while this paper mainly addresses the problem of permutation recovery, the complementary bi bi ? = problem of recovering x∗ is also interesting, and we plan to i S i d S .∈ ∈.[ ]\ investigate its fundamental limits in future work. It is well known [43] that PARTITIO N is NP-complete. Also note that asking whether or not equation (1) has a APPENDIX A solution (!, x) is equivalent to determining whether or not PROOFS OF TECHNICAL LEMMAS there exists a permutation π and a vector x such that yπ Ax = In this section, we provide statements and proofs of the has a solution. We are now ready to prove the theorem. technical lemmas used in the proofs of our main theorems. Given an instance b1, bd of PARTITIO N,defineavector 2d 1 ··· y Z + with entries ∈ A. Supporting Proofs for Theorem 1: d 1Case = bi , if i d We provide a proof of Lemma 1 in this section; see yi ∈[ ] := 30, otherwise. Section IV-A for the proof of Theorem 1 (case d 1) given Lemma 1. We begin by restating the lemma for convenience.= Also define the 2d 1 2d matrix 1) Proof of Lemma 1: Before the proof, we establish + × I notation. For each δ>0, define the events A 2d . 1 1 2 2 := d⊤ d⊤ 1(δ) P⊥ y P⊥w δ , and (21a) / − 0 F = |∥ !∗ ∥2 −∥ ! ∥2|≥ Clearly, the pair (y, A) can be constructed from b1, bd in 7 2 2 8 ··· 2(δ) P!⊥ y 2 P!⊥w 2 2δ . (21b) time polynomial in n 2d 1. We now claim that yπ Ax F = ∥ ∥ −∥ ∥ ≤ = + = 7 8 PANANJADY et al.:LINEARREGRESSIONWITHSHUFFLEDDATA 3295

x 2 log snr Evidently, Using the shorthand snr ∥ ∗∥2 ,settingt h ,and σ 2 snr noting that t 0, h since snr:= > 1, we have = +(!,!∗) 0 1(δ) 2(δ). (22) ∈[ ] { ≤ }⊆F ∪ F Pr +(!,!∗) 0 Indeed, if neither 1(δ) nor 2(δ) occurs F F { ≤ } c′ exp ( ch log snr) 2 2 2 2 ≤ − +(!,!∗) P!⊥ y 2 P!⊥w 2 P!⊥ y 2 P!⊥w 2 h snr log snr = ∥ ∥ −∥ ∥ − ∥ ∗ ∥ −∥ ∥ 6exp log 1 . > )2δ δ , ) , + −10 log snr + snr − − ' / ' ( 0( δ. = It is easily verified that for all snr > 1, we have Thus, to prove Lemma 1, we shall bound the probability of snr log snr log snr the two events 1(δ) and 2(δ) individually, and then invoke log 1 > . (27) F F log snr + snr − 4 the union bound. Note that inequality (22) holds for all values ' ( 1 2 of δ>0; it is convenient to choose δ∗ 3 P!⊥!∗ Ax∗ 2. With this choice, the following lemma bounds:= the∥ probabilities∥ Hence, after substituting for snr,wehave of the individual events over randomness in w conditioned 2 x∗ 2 on a given A.Itsproofispostponedtotheendofthe Pr +(!,!∗) 0 c exp ch log ∥ ∥ . (28) { ≤ }≤ ′ − σ 2 section. ' ' (( 1 2 Lemma 3: For any δ>0 and with δ∗ 3 P!⊥!∗ Ax∗ 2, we have = ∥ ∥ ! 2) Proof of Lemma 3: We prove each claim of the lemma δ separately. Prw 1(δ) c′ exp c ,and (23a) {F }≤ − σ 2 a) Proof of claim (23a): To start, note that by definition ' ( 2 2 δ∗ of the linear model, we have P!⊥ y 2 P!⊥ w 2.Letting Prw 2(δ∗) c′ exp c . (23b) 2 ∥ ∗ ∥ =∥ ∗ ∥ {F }≤ − σ 2 Zℓ denote a χ random variable with ℓ degrees of freedom, ' ( we claim that The next lemma, proved in Section V-A.3, is needed in order to incorporate the randomness in A into the required 2 2 P⊥ w P⊥w Zk Zk, tail bound. It is convenient to introduce the shorthand ∥ !∗ ∥2 −∥ ! ∥2 = − ˜ T P ! Ax 2. ! !⊥ ∗ ∗ 2 where k min(d, d (!,! )). Lemma:= ∥ 4: For d∥ 1 and any two permutation matrices ! H ∗ For the:= rest of the proof, we adopt the shorthand ! ! and ! at Hamming= distance h, we have ′ ∗ range(!A) range(! A),and! ! range(\!A):= \ ′ ∩ ′ := ∩ 2 h h t range(!′ A).Now,bythePythagoreantheorem,wehave Pr A T! t x∗ 6exp log 1 (24) { ≤ ∥ ∥2}≤ −10 t + h − ' / 0( 2 2 2 2 P⊥ w P⊥w P!w P! w . for all t 0, h . ∥ !∗ ∥2 −∥ ! ∥2 =∥ ∥2 −∥ ∗ ∥2 We now∈[ have] all the ingredients to prove Lemma 1. Splitting it up further, we can then write Proof of Lemma 1: Applying Lemma 3 and using the union bound yields 2 2 2 P!w 2 P! ! w 2 (P! P! ! )w 2, ∥ ∥ =∥ ∩ ∗ ∥ +∥ − ∩ ∗ ∥ Prw +(!,!∗) 0 Prw 1(δ∗) Prw 2(δ∗) { ≤ }≤ {F }+ {F } T where we have used the fact that P! ! P! P! ! ! ∩ ∗ = ∩ ∗ = c′ exp c . (25) P P . ≤ − σ 2 ! !∗ !∗ ' ( ∩ 2 Similarly for the second term, we have P!∗ w 2 Combining bound (25) with Lemma 4 yields 2 2 ∥ ∥ = P! ! w 2 (P! P! ! )w 2, and hence, ∥ ∩ ∗ ∥ +∥ ∗ − ∩ ∗ ∥ Pr +(!,!∗) 0 2 2 2 { ≤ } P!w 2 P! w 2 (P! P! ! )w 2 (P! ∗ ∩ ∗ ∗ 2 ∥ ∥ −∥ ∥ =∥ − ∥ −∥ 2 t x∗ 2 2 P! !∗ )w 2. c′ exp c ∥ 2∥ Pr A T! t x∗ 2 − ∩ ∥ ≤ %− σ & { ≥ ∥ ∥ } 3 2 Now each of the two projection matrices above has rank Pr A T! t x∗ 2 + { ≤ ∥ ∥ } dim(! !∗) k,whichcompletestheproofoftheclaim. 2 \ = t x∗ 2 To prove the lemma, note that for any δ>0, we can write c′ exp c ∥ 2∥ ≤ %− σ & Pr 1(δ) Pr Zk k δ/2 Pr Zk k δ/2 . h h t {F }≤ {| − |≥ }+ {| ˜ − |≥ } 6exp log 1 , (26) + −10 t + h − 2 ' / 0( Using the sub-exponential tail-bound on χ random variables where the last inequality holds provided that t 0, h ,and (see Lemma 9 in Appendix B-B) completes the proof. ! the probability in the LHS is now taken over randomness∈[ ] in both w and A. 3With probability 1 3296 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 5, MAY 2018

b) Proof of claim (23b): We begin by writing union bound. Finally, bounds on the lower tails of χ2 random variables (see Lemma 8 in Appendix B-B) yield

2 hi hi Prw 2(δ) Prw P⊥!∗ Ax∗ 2 P⊥!∗ Ax∗, P⊥w 2δ . Pr Zh t Pr Zh t {F }= ⎧∥ ! ∥2 + ⟨ ! ! ⟩≤ ⎫ i ≤ h = i ≤ h ⎨⎪ R(A,w) ⎬⎪ 1 2 1 2 (iv) t t hi /2 expD 1 We see that conditioned⎩⎪= on A,therandomvariable>? R@(A,w)⎭⎪ ≤ h − h 2 ' ' (( is distributed as (T!, 4σ T!),wherewehaveusedthe h/10 N 2 (v) t t shorthand T! P⊥!∗ Ax∗ . exp 1 . := ∥ ! ∥2 ≤ h − h So applying standard Gaussian tail bounds (see, for exam- ' ' (( thi ple, Boucheron et al. [44]), we have Here, inequality (iv) is valid provided hi ,orequiva- h ≤ 2 lently, if t h,whereasinequality(v) follows since hi h/5 (T! 2δ) ≤ 1 x ≥ Prw 2(δ) exp − . and the function xe 0, 1 for all x 0, 1 .Combining {F }≤ − 8σ 2T − ∈[ ] ∈[ ] ' ! ( the pieces proves Lemma 4 for h 3. ≥ 1 b) Case h 2: In this case, we have Setting δ δ∗ 3 T! completes the proof. ! = 3) Proof= of Lemma:= 4: In the case d 1, the matrix A is 2 2 a a! 2 d a a! 2 d n = ∥ − ∥ 2Z1, and ∥ + ∥ 2Z1 Zn 2. composed of a single vector a R .Recallingtherandom 2 2 − 2 ∈ = = + variable T! P!⊥!∗ Ax∗ 2,wehave =∥ ∥ Proceeding as before by applying the unionD boundD and Lemma 8, we have that for t 2, the random variable T! 2 2 1 2 ≤ T! (x∗) a 2 2 a!, a obeys the tail bound = %∥ ∥ − a ⟨ ⟩ & ∥ ∥2 1/2 (i) 2 t t 2 2 Pr T! t(x∗) 2 exp 1 (x∗) a 2 a, a! { ≤ }≤ 2 − 2 ≥ ∥ ∥ −|⟨ ⟩| ' ' (( 2 ) , h/10 (x∗) 2 2 t t min a a! , a a! , 6 exp 1 , for h 2. = 2 ∥ − ∥2 ∥ + ∥2 ≤ h − h = ) , ' ' (( where step (i) follows from the Cauchy Schwarz inequality. ! Applying the union bound then yields

2 2 2 B. Supporting Proofs for Theorem 1: d > 1Case Pr T! t(x∗) Pr a a! 2 2t Pr a a! 2 2t . { ≤ }≤ {∥ − ∥ ≤ }+ {∥ + ∥ ≤ } We now provide a proof of Lemma 2. See Section IV-B Let Z and Z denote (not necessarily independent) χ2 ℓ ˜ ℓ for the proof of Theorem 1 (case d > 1) given Lemma 2. random variables with ℓ degrees of freedom. We split the We begin by restating the lemma for convenience. analysis into two cases. 1) Proof of Lemma 2: The first part of the proof is exactly a) Case h 3: Lemma 7 from Appendix B-A guarantees the same as that of Lemma 1. In particular, Lemma 3 applies ≥ that without modification to yield a bound identical to the inequal- 2 ity (25), given by a a! 2 d ∥ − ∥ Zh Zh Zh , and (29a) 2 = 1 + 2 + 3 T! Prw +(!,!∗) 0 c′ exp c , (30) a a 2 { ≤ }≤ − σ 2 ! 2 d ' ( ∥ + ∥ Zh1 Zh2 Zh3 Zn h , (29b) − 2 2 = + + + where T! P ! Ax ,asbefore. =∥ !⊥ ∗ ∗∥2 d The major difference from the d 1caseisintherandom where denotes equalityD in distributionD D andDh , h , h h 1 2 3 5 variable T .Accordingly,westatethefollowingparallel= with h = h h h.Anapplicationoftheunionbound≥ ! 1 2 3 lemma to Lemma 4. then yields+ + = Lemma 5: For 1 < d < n, any two permutation matrices 2n 3 ! and !∗ at Hamming distance h, and t hn− n d ,wehave 2 hi ≤ − Pr a a! 2 2t Pr Zhi t . {∥ − ∥ ≤ }≤ ≤ h Pr T t x 2 2max T , T , (31) i 1 1 2 A ! ∗ 2 1 2 .= { ≤ ∥ ∥ }≤ { } Similarly, provided that h 3, we have where ≥ n 2 T1 exp n log , and Pr a a! 2 2t Pr Zh1 Zh2 Zh3 Zn h t {∥ + ∥ ≤ }≤ { + + + − ≤ } = − 2 (ii) ) , 2n h h tn n d Pr Zh1 Zh2 Zh3 t − ≤ {D + D + D ≤ D} T2 6exp log 2n 1 . 3 = −10 n d + h − (iii) hi % E 'tn − ( F& DPr ZhD t D , ≤ i ≤ h The proof of Lemma 5 appears in Section V-B.2. We are now i 1 1 2 .= ready to prove Lemma 2. D where inequality (ii) follows from the non-negativity of Zn h, Proof of Lemma 2: We prove Lemma 2 from Lemma 5 and the monotonicity of the CDF; and inequality (iii) from− the and equation (30) by an argument similar to the one before. PANANJADY et al.:LINEARREGRESSIONWITHSHUFFLEDDATA 3297

2 In particular, in a similar vein to the steps leading up to b) Case 2: Suppose now that Pr A T! t∗ x∗ 2 n { ≤ ∥ ∥ }≤ equation (26), we have 2exp n log 2 ,i.e.,thatthefirstterminRHSofinequal- ity (31)− attains the maximum when t t .Inthiscase,we * + = ∗ Pr +(!,!∗) 0 have { ≤ } 2 t x∗ 2 2 n c′ exp c ∥ ∥ Pr A T! t x∗ . (32) exp n log ≤ − σ 2 + { ≤ ∥ ∥2} − 2 % & 2n ) , n d h h t∗n − x 2 6exp log 1 We now use the shorthand snr ∥ ∗∥2 and let t 2n σ 2 ∗ ≥ %−10 E t n n d + h − F& 2n := = ' ∗ − ( log snr n− n d (i) 2n · − 2n ' ( n d c′ exp ch log snr n− n d , h snr .Notingthatsnr n− − > 5/4yieldst∗ − 2n · ≤ ≥ − · hn− n d ,wesett t in inequality (32) to obtain ) ) ,, − = ∗ where step (i) follows from the right inequality (34b). Now substituting into inequality (33), we have Pr +(!,!∗) 0 { ≤ } 2n n d 2 Pr +(!,! ) 0 c′ exp ch log sn− − Pr A T! t∗ x∗ 2 . (33) ∗ ≤ − + { ≤ ∥ ∥ } { ≤ } 2n n ) , c′ exp ch log snr n− n d 2exp n log 2 − 2 Since Pr A T! t x can be bounded by a maximum ≤ − · + − { ≤ ∗∥ ∗∥2} ) n) ,, ) , of two terms (31), we now split the analysis into two cases c′ exp n log . (37) ≤ − 2 depending on which term attains the maximum. ) , a) Case 1: First, suppose that the second term Combining equations (36) and (37) completes the proof of attains the maximum in inequality (31), i.e., Pr A T! Lemma 2. ! 2n { ≤ n d 2) Proof of Lemma 5: We begin by reducing the problem to 2 h h t∗n − t∗ x∗ 2 12 exp 10 log 2n h 1 . ∥ ∥ }≤ − t n n d + − the case x∗ e1 x∗ 2,wheree1 represents the first standard ' / ' ∗ − ( 0( = d∥ ∥ Substituting for t∗,wehave basis vector in R .Inparticular,ifWx∗ e1 x∗ 2 for a d d W and writing A AW=,wehavebyrotation∥ ∥ × 2 = Pr A T! t∗ x∗ invariance of the Gaussian distribution that the the entries { ≤ ∥ ∥2} 2n n d of A are distributed as i.i.d. standardD Gaussians. It can be h snr n− − 1 2 2 12 exp log · 2n verified that T! I !A(A⊤ A)− (!A)⊤!∗ Ae1 2 x∗ 2. ≤ ⎛−10 ⎛ n d ⎞⎞ d =∥ − ∥ ∥ ∥ log snr n− − D · Since A A,thereductioniscomplete. ⎝ ⎝ 2n ⎠⎠ = D D D D D ) n d , In order to keep the notation uncluttered, we denote the h log snr n− − exp · 1 . first columnD of A by a.Wealsodenotethespanofthefirst · ⎛−10 ⎡ ) 2n , − ⎤⎞ snr n− n d column of !A by S1 and that of the last d 1columnsof!A · − − ⎝ ⎣ ⎦⎠ by S 1.Denotetheirrespectiveorthogonalcomplementsby 2n − 5 S⊥ and S⊥ .Wethenhave We have snr n− n d > ,aconditionwhichleadstothe 1 1 · − 4 − following pair of easily verifiable inequalities: 2 2 T! x∗ P⊥a =∥ ∥2∥ ! ∥2 2n 2 2 2n log snr n n d x∗ 2 PS S a 2 snr n− n d − − =∥ ∥ ∥ ⊥1∩ 1⊥ ∥ log · − · 1 2 − 2 ⎛ 2n ⎞ + ) 2n , − x∗ 2 PS S PS a 2. log snr n− n d snr n− n d =∥ ∥ ∥ ⊥1 1⊥ 1⊥ ∥ · − · − − ∩ ⎝ 2n ⎠ ) n d , We now condition on a.Consequently,thesubspaceS1⊥ is a log snr n− − · , and (34a) fixed (n 1)-dimensional subspace. Additionally, S⊥1 S1⊥ ≥ 4 − − ∩ 2n is the intersection of a uniformly random (n (d 1))- 2n log snr n n d − − snr n− n d − − dimensional subspace with a fixed (n 1)-dimensional sub- log · − · 1 − 2n 2n space, and is therefore a uniformly random (n d)-dimensional ⎛ n d ⎞ + ) n d , − log snr n− − snr n− − P a − · · S1⊥ 2n subspace within S1⊥.Writingu P a ,wehave ⎝ ) ,⎠ S 2 5log snr n− n d . (34b) = ∥ 1⊥ ∥ ≤ · − ) , d 2 2 2 T! x∗ P u P a . Using inequality (34a), we have 2 S⊥1 S1⊥ 2 S1⊥ 2 =∥ ∥ ∥ − ∩ ∥ ∥ ∥ 2 2n 2 n d Now since u S1⊥,notethat PS S u 2 is the squared Pr A T! t∗ x∗ 2 12 exp ch log snr n− − . (35) ∈ ∥ ⊥1 1⊥ ∥ { ≤ ∥ ∥ }≤ − · length of a projection of an (n 1−)∩-dimensional unit vec- ) ) ,, − Inequality (34b) will be useful in the second case to follow. tor onto a uniformly chosen (n d)-dimensional subspace. − Now using inequalities (35) and (33) together yields In other words, denoting a uniformly random projection from m dimensions to k dimensions by Pm and noting that u is a 2n k Pr +(!,!∗) 0 c′ exp ch log snr n− n d . (36) unit vector, we have { ≤ }≤ − · − ) ) ,, 2 d n 1 2 (i) n 1 2 P u P − v1 1 P − v1 , It remains to handle the second case. S⊥1 S1⊥ 2 n d 2 d 1 2 ∥ − ∩ ∥ =∥ − ∥ = −∥ − ∥ 3298 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 5, MAY 2018

where v1 represents a fixed standard basis vector in n 1 Proof of Lemma 6: We use the shorthand E n 1 n 1 − := dimensions. The quantities Pn −d and Pd −1 are projections E(!, D) in this proof to simplify notation. Proceeding by onto orthogonal subspaces, and− step (i) is− a consequence of the usual proof of Fano’s inequality, we begin by expanding the Pythagorean theorem. H (!E,!∗ y, A a, !) in two ways: Now removing the conditioning on a,weseethattheterm | = H (E,!∗ y, A, !) for d > 1canbelowerboundedbythecorrespondingT! for | ! d 1, but scaled by a random factor – the norm of a random H (!∗ y, A, !) H (E !∗, y, A, !) (43a) = = | + | projection. Using T 1 P a 2 x 2 to denote T when H (E y, A!, !) H (! E, y, A, !). (43b) ! S⊥1 2 ∗ 2 ! ∗ d 1, we have := ∥ ∥ ∥ ∥ = | !+ | ! = Since !∗ (y, A) ! forms a Markov chain, we have 1 → → ! ! T! (1 Xd 1)T!, (38) H (!∗ y, A, !) H (!∗ y, A).Non-negativityofentropy = − − | = | n 1 2 yields H (E !∗, y, A, !)! 0. Since conditioning cannot where we have introduced the shorthand Xd 1 P − v1 2. | ≥ − =∥ d 1 ∥ increase entropy,! we have H (E y, A, !) H (E) log 2, We first handle the random projection term in equation− (38) | ≤ ≤ and H (!∗ E, y, A, !)! H (!∗ E, !).Combiningallofthis using Lemma 10 in Appendix B-B. In particular, substituting | ≤ | n 1 with the pair of equations (43) yields ! β (1 z) d−1 in inequality (47) yields = − − ! ! (d 1)/2 (n d)/2 H (!∗ y, A) n 1 − z(n 1) − | Pr 1 Xd 1 z − − H (!∗ E, !) log 2 { − − ≤ }≤ d 1 n d ' ( ' ( ≤ | + − − Pr E 1 H (!∗ E 1, !) (i) n 1 n 1 n d = { = }! | = − − z −2 (1 Pr E 1 ) H (!∗ E 0, !) log 2. (44) ≤ O d 1 O n d + − { = } | = + ' − ( ' − ( ! n 1 n d We now use the fact that uniform distributions maximize −2 ! − z entropy to bound the two terms as H (!∗ E 1, !) = d 1 | n= ≤ ' − ( H (! ) log n ,andH (! E 0, !) log ! ,where ii ∗ ∗ (n D 1) ( ) n 1 n d = ! | = ≤ − + ! 2 z −2 , the last inequality follows since E 0revealsthat!!∗ is − = ≤ within a Hamming ball of radius D! 1around!,andthe where in steps (i) and (ii),wehaveusedthestandardinequality − n r 2n cardinality of that Hamming ball is (n D! 1) . n n n n− d 2 r r .Nowsettingz n − ,whichensuresthat Substituting back into inequality (44)− yields+ ! ≥ n ≥1 = ! (1 z) d−1 > 1foralld < n and large enough n,wehave − * +− * + n 2n Pr E 1 log n log ! H (!∗) n− d n Pr 1 Xd 1 n − exp n log . (39) { = } !− (n D 1) + − 2 ' − + !( { − ≤ }≤ − n Applying the union bound then yields * + H (!∗ y, A) log 2 log ! log n , ≥ | − − (n D 1) + ! Pr T t x 2 − + ! ! ∗ 2 where we have added the term H (! ) log n to both sides. { ≤ ∥ ∥ } 2n 2n ∗ n− d 1 n d 2 = ! Pr 1 Xd 1 n − Pr T! tn − x∗ 2 . (40) Simplifying then yields inequality (42). ! ≤ { − − ≤ }+ { ≤ ∥ ∥ } 1 We have already computed an upper bound on Pr T! 2n { ≤ APPENDIX B n d 2 tn − x∗ 2 in Lemma 4. Applying it yields that provided ∥ 2∥n } AUXILIARY RESULTS t hn n d ,wehave − − In this section, we prove a preliminary lemma about per- ≤ 2n 1 n d 2 mutations that is useful in many of our proofs. We also derive Pr T! tn − x∗ 2 { ≤ ∥ ∥ } 2 2n 2n h/10 tight bounds on the lower tails of χ -random variables and tn n d tn n d 6 − exp 1 − . (41) state an existing result on tail bounds for random projections. ≤ % h % − h && Combining equations (41) and (39) with the union D. Independent Sets of Permutations bound (40) and performing some algebraic manipulation then In this section, we prove a combinatorial lemma about completes the proof of Lemma 5. ! permutations. Given a Gaussian random vector Z Rn,we use this lemma to characterize the distribution of Z∈ !Z as C. Supporting Proofs for Theorem 3 afunctionofthepermutation!.Inordertostatethelemma,± The following technical lemma was used in the proof of we need to set up some additional notation. For a permutation Theorem 3, and we recall the setting for convenience. For π on k objects, let Gπ denote the corresponding undirected incidence graph, i.e., V (Gπ ) k ,and(i, j) E(Gπ ) iff any estimator !,wedenotebytheindicatorrandomvariable =[] ∈ E(!, D) whether or not the ! has acceptable distortion, i.e., j π(i) or i π(j). =Lemma 7: Let= π be a permutation on k 3 objects such E(!, D) I d!H(!,!∗) D ,withE 1representingthe ≥ = [ ≥ ] = that d (π, I) k. Then the vertices of Gπ can be partitioned error! event. Assume !∗ is picked! uniformly at random in n. H = Lemma 6: The probability of error is lower bounded asP into three sets V1,V2,V3 such that each is an independent ! ! k k set, and V1 , V2 , V3 3 5 . I (!∗ y, A) log 2 | | | | | |≥⌊ ⌋≥ Pr E(!, D) 1 1 ; + . (42) Proof: Note that for any permutation π,thecorresponding { = }≥ − n log n log (n D! 1) graph Gπ is composed of cycles, and the vertices in each !− − + ! ! PANANJADY et al.:LINEARREGRESSIONWITHSHUFFLEDDATA 3299 cycle together form an independent set. Consider one such F. Strong Converse for Gaussian Channel Capacity cycle. We can go through the vertices in the order induced by The following result due to Shannon [38] provides a strong the cycle, and alternate placing them in each of the 3 parti- converse for the Gaussian channel. The non-asymptotic ver- tions. Clearly, this produces independent sets, and furthermore, sion as stated here was also derived by Yoshihara [40]. having 3 partitions ensures that the last vertex in the cycle Lemma 11 ( [40]): Consider a vector Gaussian channel on has some partition with which it does not share edges. If the ncoordinateswithmessagepowerPandnoisepowerσ 2, cycle length C 0 (mod 3),theneachpartitiongetsC/3 P ≡ whose capacity is given by R log 1 2 .Forany vertices, otherwise the smallest partition has C/3 vertices. = + σ ⌊ ⌋ codebook with 2nR,ifforsomeϵ)>0 we, have The partitions generated from the different cycles can then C |C|= be combined (with relabelling, if required) to ensure that the R >(1 ϵ)R, largest partition has cardinality at most 1 more than that of + the smallest partition. nϵ then the probability of error pe 1 2 2− for n large enough. ≥ − · E. Tail Bounds on χ2 Random Variables and Random Projections REFERENCES

In our analysis, we require tight control on lower tails of [1] A. Pananjady, M. J. Wainwright, and T. A. Courtade, “Linear regression χ2 random variables. The following lemma provides one such with an unknown permutation: Statistical and computational limits,” in Proc. IEEE 54th Annu. Allerton Conf. Commun., Control, Comput. bound. (Allerton),Sep.2016,pp.417–424. 2 Lemma 8: Let Zℓ denote a χ random variable with [2] A. Pananjady, M. J. Wainwright,andT.A.Courtade.(Aug.2016). ℓ degrees of freedom. Then for all p 0,ℓ ,wehave “Linear regression with an unknown permutation: Statistical and com- ∈[ ] putational limits.” [Online]. Available: https://arxiv.org/abs/1608.02902 p p ℓ/2 [3] R. J. A. Little and D. B. Rubin, Statistical Analysis With Missing Data. Pr Zℓ p exp 1 Hoboken, NJ, USA: Wiley, 2014. { ≤ }≤ ℓ − ℓ [4] P.-L. Loh and M. J. Wainwright, “Corrupted and missing predictors: ) ℓ) ℓ,, p Minimax bounds for high-dimensional linear regression,” in Proc. IEEE exp log 1 . (45) = −2 p + ℓ − Int. Symp. Inf. Theory (ISIT),Jul.2012,pp.2601–2605. ' / 0( [5] J. Unnikrishnan, S. Haghighatshoar, and M. Vetterli. (2015). “Unla- Proof: The lemma is a simple consequence of the Cher- beled sensing with random linear measurements.” [Online]. Available: noff bound. In particular, we have for all λ>0that https://arxiv.org/abs/1512.00115 [6] A. Balakrishnan, “On the problem of time jitter in sampling,” IRE Trans. Inf. Theory,vol.8,no.3,pp.226–236,Apr.1962. Pr Zℓ p Pr exp( λZℓ) exp( λp) [7] C. Rose, I. S. Mian, and R. Song. (2012). “Timing chan- { ≤ }= { − ≥ − } nels with multiple identical quanta.” [Online]. Available: exp(λp)E exp( λZℓ) ≤ − https://arxiv.org/abs/1208.1070 ℓ [8] A. Abid, A. Poon, and J. Zou. (May. 2017). “Linear regression with exp(λp)(1" 2λ)− 2 . # (46) = + shuffled labels.” [Online]. Available: https://arxiv.org/abs/1705.01342 [9] A. B. Poore and S. Gadaleta, “Some assignment problems arising from where in the last step, we have used E exp( λZℓ) (1 multiple target tracking,” Math. Comput. Modeling,vol.43,nos.9–10, ℓ − = + pp. 1074–1091, 2006. 2λ)− 2 ,whichisvalidforallλ> 1/2." Minimizing# the last − 1 ℓ [10] S. Thrun and J. J. Leonard, “Simultaneous localization and mapping,” expression over λ>0thenyieldsthechoiceλ 1 , in Springer Handbook of Robotics.Berlin,Germany:Springer,2008, ∗ = 2 p − which is greater than 0 for all 0 p ℓ.Substitutingthis pp. 871–889. ≤ ≤ ) , [11] W. S. Robinson, “A method for chronologically ordering archaeological choice back into equation (46) proves the lemma. deposits,” Amer. Antiquity,vol.16,no.4,pp.293–301,1951. We also state the following lemma for general sub- [12] A. Narayanan and V. Shmatikov, “Robust de-anonymization of large exponential random variables (see, e.g., Boucheron et al. [44]). sparse datasets,” in Proc. IEEE Symp. Security Privacy (SP),May2008, 2 pp. 111–125. We use it in the context of χ random variables. [13] L. Keller, M. J. Siavoshani, C. Fragouli, K. Argyraki, and S. Diggavi, Lemma 9: Let X be a sub-exponential random variable. “Identity aware sensor networks,” in Proc. IEEE INFOCOM,Apr.2009, Then for all t > 0,wehave pp. 2177–2185. [14] P. David, D. Dementhon, R. Duraiswami, and H. Samet, “SoftPOSIT: ct Simultaneous pose and correspondence determination,” Int. J. Comput. Pr X E X t c′e− . {| − [ ]| ≥ }≤ Vis.,vol.59,no.3,pp.259–284,2004. [15] M. Marques, M. Stoši´c, and J. Costeira, “Subspace matching: Unique Lastly, we require tail bounds on the norms of random solution to point matching with geometric constraints,” in Proc. IEEE projections, a problem that has been studied extensively in the 12th Int. Conf. Comput. Vis.,Sep./Oct.2009,pp.1288–1294. literature on dimensionality reduction. The following lemma, [16] S. Mann, “Compositing multiple pictures of the same scene,” in Proc. aconsequenceoftheChernoffbound,istakenfromDasgupta 46th IST Annu. Conf.,1993,pp.50–52. [17] L. J. Schulman and D. Zuckerman, “Asymptotically good codes correct- and Gupta [39, Lemma 2.2b]. ing insertions, deletions, and transpositions,” IEEE Trans. Inf. Theory, Lemma 10 ( [39]): Let x be a fixed n-dimensional vector, vol. 45, no. 7, pp. 2552–2557, Nov. 1999. and let Pn be a from n-dimensional space to [18] X. Huang and A. Madan, “CAP3:ADNAsequenceassemblyprogram,” d Genome Res.,vol.9,no.9,pp.868–877,1999. auniformlyrandomlychosend-dimensionalsubspace,where [19] M. J. Wainwright, “Information-theoretic limits on sparsity recovery d n. Then we have for every β>1 that in the high-dimensional and noisy setting,” IEEE Trans. Inf. Theory, ≤ vol. 55, no. 12, pp. 5728–5741, Dec. 2009. (n d)/2 [20] E. J. Candès and T. Tao, “Near-optimal signal recovery from random βd (1 β)d − n 2 2 d/2 projections: Universal encoding strategies?” IEEE Trans. Inf. Theory, Pr Pd x 2 x 2 β 1 − . (47) {∥ ∥ ≥ n ∥ ∥ }≤ + n d vol. 52, no. 12, pp. 5406–5425, Dec. 2006. ' − ( 3300 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 5, MAY 2018

[21] V. Emiya, A. Bonnefoy, L. Daudet, and R. Gribonval, “Compressed [40] K.-I. Yoshihara, “Simple proofsforthestrongconversetheorems sensing with unknown sensor permutation,” in Proc. IEEE Int. Conf. in some channels,” Kodai Math. Seminar Rep.,vol.16,no.4, Acoust., Speech Signal Process. (ICASSP),May2014,pp.1040–1044. pp. 213–222, 1964. [22] G. Elhami, A. Scholefield, B. B. Haro, and M. Vetterli, “Unlabeled [41] G. H. Hardy, J. E. Littlewood, and G. Pólya, Inequalities.Cambridge, sensing: Reconstruction algorithm and theoretical guarantees,” in Proc. MA, USA: Cambridge Univ. Press, 1952. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP),Mar.2017, [42] A. Vince, “A rearrangement inequality and the permutahedron,” Amer. pp. 4566–4570. Math. Monthly,vol.97,no.4,pp.319–323,1990.[Online].Available: [23] D. Hsu, K. Shi, and X. Sun. (2017). “Linear regression without corre- http://dx.doi.org/10.2307/2324517 spondence.” [Online]. Available: https://arxiv.org/abs/1705.07048 [43] C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: [24] A. Pananjady, M. J. Wainwright,andT.A.Courtade,“Denoisinglinear Algorithms and Complexity.NorthChelmsford,MA,USA:Courier models with permuted data,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Corporation, 1998. Jun. 2017, pp. 446–450. [44] S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: [25] S. Haghighatshoar and G. Caire. (2017). “Signal recovery from unla- ANonasymptoticTheoryofIndependence.Oxford,U.K.:OxfordUniv. beled samples.” [Online]. Available: https://arxiv.org/abs/1701.08701 Press, 2013. [26] Y. M. Lu and M. N. Do, “A theory for sampling signals from a union of subspaces,” IEEE Trans. Signal Process.,vol.56,no.6,pp.2334–2345, Jun. 2008. [27] T. Blumensath, “Sampling and reconstructing signals from a union Ashwin Pananjady is a PhD candidate in the Department of Electrical of linear subspaces,” IEEE Trans. Inf. Theory,vol.57,no.7, Engineering and Computer Sciences (EECS) at the University of California, pp. 4660–4671, Jul. 2011. Berkeley. He received the B.Tech. degree (with honors) in electrical engi- [28] O. Collier and A. S. Dalalyan, “Minimax rates in permutation estimation neering from IIT Madras in 2014. His research interests include statistics, for feature matching,” J. Mach. Learn. Res.,vol.17,no.1,pp.162–192, optimization, and information theory. He is a recipient of the Governor’s 2016. Gold Medal from IIT Madras, and an Outstanding Graduate Student Instructor [29] N. Flammarion, C. Mao, and P. Rigollet. (2016). “Optimal rates of sta- Award from UC Berkeley. tistical seriation.” [Online]. Available: https://arxiv.org/abs/1607.02435 [30] F. Fogel, R. Jenatton, F. Bach, and A. D’Aspremont, “Convex relaxations for permutation problems,” in Proc. Adv. Neural Inf. Process. Syst.,2013, Martin J. Wainwright is currently Chancellor’s Professor at the University pp. 1016–1024. of California at Berkeley, with a joint appointment between the Department [31] M. G. Rabbat, M. A. T. Figueiredo, and R. D. Nowak, “Network of Statistics and the Department of Electrical Engineering and Computer inference from co-occurrences,” IEEE Trans. Inf. Theory,vol.54,no.9, Sciences (EECS). He received a Bachelor’s degree in Mathematics from Uni- pp. 4053–4068, Sep. 2008. versity of Waterloo, Canada, and Ph.D.degreeinEECSfromMassachusetts [32] V. Gripon and M. Rabbat, “Reconstructing a graph from path traces,” Institute of Technology (MIT). His research interests include high-dimensional in Proc. IEEE Int. Symp. Inf. Theory (ISIT),Jul.2013,pp.2488–2492. statistics, information theory, statistical machine learning, and optimization [33] N. B. Shah, S. Balakrishnan, A. Guntuboyina, and M. J. Wainwright, theory. Among other awards, he has received Best Paper Awards from IEEE “Stochastically transitive models for pairwise comparisons: Statistical Signal Processing Society (2008), IEEE Information Theory Society (2010); and computational issues,” IEEE Trans. Inf. Theory,vol.63,no.2, aMedallionLectureship(2013)fromtheInstituteofMathematicalStatistics pp. 934–959, Feb. 2017. (IMS); a Section Lectureship at the International Congress of Mathematicians [34] S. Chatterjee, “Matrix estimation by universal singular value threshold- (2014); the COPSS Presidents’ Award (2014) from the Joint Statistical ing,” Ann. Stat.,vol.43,no.1,pp.177–214,2015. Societies; and the IMS David Blackwell Lectureship (2017). [35] A. Pananjady, C. Mao, V. Muthukumar, M. J. Wainwright, and T. A. Courtade, “Worst-case vs average-case design for estimation from fixed pairwise comparisons,” CoRR,Jul.2017.[Online].Available: http://arxiv.org/abs/1707.06217 Thomas A. Courtade received the B.Sc. degree (summa cum laude) in [36] J. Huang, C. Guestrin, and L. Guibas, “Fourier theoretic proba- electrical engineering from Michigan Technological University, Houghton, bilistic inference over permutations,” J. Mach. Learn. Res.,vol.10, MI, USA, in 2007, and the M.S. and Ph.D. degrees from the University of pp. 997–1070, 2009. California, Los Angeles (UCLA), CA,USA,in2008and2012,respectively. [37] E. M. Loiola, N. M. M. de Abreu, P. O. Boaventura-Netto, P. Hahn, He is an Assistant Professor with the Department of Electrical Engineering and T. Querido, “A survey for the quadratic assignment problem,” and Computer Sciences, University of California, Berkeley, CA, USA. Prior Eur. J. Oper. Res.,vol.176,no.2,pp.657–690,2007. to joining UC Berkeley in 2014, he was a Postdoctoral Fellow supported by [38] C. E. Shannon, “Probability of error for optimal codes in a Gaussian the NSF Center for Science of Information. Prof. Courtade’s honors include a channel,” Bell Syst. Tech. J.,vol.38,no.3,pp.611–656,1959. Distinguished Ph.D. Dissertation Award and an Excellence in Teaching Award [39] S. Dasgupta and A. Gupta, “An elementary proof of a theorem of from the UCLA Department of Electrical Engineering, and a Jack Keil Wolf Johnson and Lindenstrauss,” Random Struct. Algorithms,vol.22,no.1, Student Paper Award for the 2012 International Symposium on Information pp. 60–65, 2003. Theory. He was the recipient of a Hellman Fellowship in 2016.