<<

The Most Informative Order and its Application to Image Denoising

Alex Dytso?, Martina Cardone∗, Cynthia Rush† ? New Jersey Institute of Technology, Newark, NJ 07102, USA Email: [email protected] ∗ University of Minnesota, Minneapolis, MN 55404, USA, Email: [email protected] † Columbia University, New York, NY 10025, USA, Email: [email protected]

Abstract—We consider the problem of finding the subset of over others. Although such a universal1 choice can be justified order that contains the most information about a when there is no knowledge of the underlying distribution, in of random variables drawn independently from some known scenarios where some knowledge is available a natural question parametric distribution. We leverage information-theoretic quan- tities, such as entropy and mutual information, to quantify arises: Can we somehow leverage such a knowledge to choose the level of informativeness and rigorously characterize the which is the “best” to consider? amount of information contained in any subset of the complete The main goal of this paper is to answer the above question. collection of order statistics. As an example, we show how these Towards this end, we introduce and analyze a theoretical informativeness metrics can be evaluated for a sample of discrete Bernoulli and continuous Uniform random variables. Finally, framework for performing ‘optimal’ order statistic selection we unveil how our most informative order statistics framework to fill the aforementioned theoretical gap. Specifically, our can be applied to image processing applications. Specifically, we framework allows to rigorously identify the subset of order investigate how the proposed measures can be used to choose statistics that contains the most information on a random sample. the coefficients of the L-estimator filter to denoise an image As an application, we show how the developed framework corrupted by random noise. We show that both for discrete (e.g., salt-pepper noise) and continuous (e.g., mixed Gaussian noise) can be used for image denoising to produce competitive noise distributions, the proposed method is competitive with off- approaches with off-the-shelf filters, as well as with - the-shelf filters, such as the and the total variation filters, based denoising methods. Similar ideas also have the potential as well as with wavelet-based denoising methods. to benefit other fields where order statistics find application, such as radar detection and classification. With the goal of developing a theoretical framework for ‘optimal’ order statistic I.INTRODUCTION selection, in this work we are interested in answering the following questions: Consider a random sample X ,X ,...,X drawn indepen- 1 2 n (1) How much ‘information’ does a single order statistic X dently from some known parametric distribution p(x|θ) where (i) contain about the random sample Xn for each i ∈ [n]? We the parameter θ may or may not be known. Let the random refer to the X that contains the most information about the variables (r.v.) X ≤ X ≤ ... ≤ X represent the order (i) (1) (2) (n) sample as the most informative order statistic. statistics of the sample. In particular, X(1) corresponds to (2) Let S ⊆ [n] be a set of cardinality |S| = k and let the minimum value of the sample, X(n) corresponds to the X = {X } . Which subset of order statistics X of maximum value of the sample, and X n (provided that n is (S) (i) i∈S (S) ( 2 ) n even) corresponds to the median of the sample. We denote the size k is the most informative with respect to the sample X ? n collection of the random samples as X := (X1,X2,...,Xn), (3) Given a set S ⊆ [n] and the collection of order statistics and we use [n] to denote the collection {1, 2, . . . , n}. X(S), which additional order statistic X(i) where i ∈ [n] but n As illustrated by comprehensive survey texts [1], [2], order i 6∈ S, adds the most information about the sample X ?

arXiv:2101.11667v1 [cs.IT] 27 Jan 2021 statistics have a broad of applications including survival One approach for defining the most informative order statis- and reliability analysis, life testing, statistical , tics, and the one that we investigate in this work, is to consider filtering theory, signal processing, robustness and classification the mutual information as a base measure of informativeness. studies, radar target detection, and wireless communication. In Recall that, intuitively, the mutual information between two such a wide variety of practical situations, some order statistics variables X and Y , denoted as I(X; Y ) = I(Y ; X), measures – such as the minimum, the maximum, and the median – have the reduction in uncertainty about one of the variables given been analyzed and adopted more than others. For instance, in the knowledge of the other. Let p(x, y) be the joint density the context of image processing (see also Section V), a widely of (X,Y ) and let p(x), p(y) be the marginals. The mutual employed order statistic filter is the median filter. However, to the best of our knowledge, there is not a theoretical study 1A large body of the literature has focused on analyzing information that justifies why certain order statistics should be preferred measures of the (continuous or discrete) parent population of ordered statistics (examples include the differential entropy [3], the Rényi entropy [4], [5], the cumulative entropies [6], the Fisher information [7], and the f- [8]) The work of M. Cardone was supported in part by the U.S. National Science and trying to show universal (i.e., distribution-free) properties for such Foundation under Grant CCF-1849757. information measures, see for instance [5], [9], [8], [10]. n information is calculated as Definition 1. Let Z := (Z1,Z2,...,Zn) be a vector n ZZ  p(x, y)  of i.i.d. standard Gaussian r.v. independent of X = I(X; Y ) = p(x, y) log dx dy. (1) (X ,X ,...,X ). Let S ⊆ [n] be defined as p(x)p(y) 1 2 n The base of the logarithm determines the units of the measure, S = {(i1, i2, . . . , ik) : 1 ≤ i1 < i2 < . . . < ik ≤ n}, and throughout the paper we use base e. Notice that there is a with |S| = k. We define the following three measures of order relationship between the mutual information and the differential statistic informativeness: entropy, namely, r (S,Xn) = I(Xn; X ), (3) I(X; Y ) = h(Y ) − h(Y |X), (2) 1 (S) n 2 n n r2(S,X ) = lim 2σ I(X + σZ ; X(S)), (4) where the entropy and the conditional entropy are de- σ→∞ n 2 n k R r3(S,X ) = lim 2σ I(X ; X(S) + σZ ). (5) fined as h(Y ) = − p(y) log p(y) dy, and h(Y |X) = σ→∞ RR p(x, y) log(p(x)/p(x, y))dy dx. The discrete analogue n of (1) replaces the integrals with sums, and (2) holds with In Definition 1, the measure r1(S,X ) computes the mutual the differential entropy h(Y ) being replaced with its discrete information between a subset of order statistics X(S) and n n P the sample X . The measure r2(S,X ) computes the slope version, denoted as H(Y ) = − y p(y) log p(y). In particular, if X and Y are independent – so knowing of the mutual information at σ = ∞: intuitively, as noise becomes large, only the most informative X(S) should maintain one delivers no information about the other – then the mutual n X the largest mutual information. The measure r3(S,X ) is an information is zero. Differently, if is a deterministic function n Y Y X alternative to r2(S,X ), with noise added to X(S) instead of of and is a deterministic function of , then knowing n one gives us complete information on the other. If additionally, X . The limits in (4) and (5) always exist, but may be infinity. X and Y are discrete, the mutual information is then the same One might also consider similar measures as in (4) and (5), as the amount of information contained in X or Y alone, as but in the limit of σ that goes to zero, namely measured by the entropy, H(Y ), since H(Y |X) = 0. If X n n n I(X + σZ ; X(S)) and Y are continuous, the mutual information is infinite since r4(S,X ) = lim , σ→0 1 log(1 + 1 ) h(Y |X) = −∞ (because (X,X) is singular with respect to 2 σ2 n k (6) the Lebesgue measure on 2). n I(X ; X(S) + σZ ) R r5(S,X ) = lim . σ→0 1 1 2 log(1 + σ2 ) II.MEASURESOF INFORMATIVENESS OF ORDER n STATISTICS In particular, the intuition behind r4(S,X ) is that the most informative set X(S) should have the largest increase in the In this section, we propose several metrics, all of which mutual information as the observed sample becomes less noisy. leverage the mutual information as a base measure of infor- n n The measure r5(S,X ) is an alternative to r4(S,X ) where mativeness. We start by considering the mutual information the noise is added to X instead of Xn. However, as we n (S) between the sample X and any order statistic X(i), i.e., prove next, these measures evaluate to n I(X(i); X ) and find the index i ∈ [n] that results in the n largest mutual information. In the case of discrete r.v., we have r4(S,X ) = 0, continuous and discrete r.v.,  n n n n k, continuous r.v., I(X(i); X ) = H(X ) − H(X |X(i)) r5(S,X ) =  n  0, discrete r.v.. X X p(x(i), x ) = p(x , xn) log . (i) n Hence, these are not useful measures of information. n p(x(i))p(x ) x(i) x n Such an approach works only when the sample is composed Proof. To characterize r4(S,X ) in (6), recall that by the of discrete r.v. and does not work for continuous r.v. The processing inequality, if X → Y → Z is a Markov n n n reason for this is that, as highlighted in Section I, when Xn chain then I(X; Z|Y ) = 0. Now, since X + σZ → X → n n n is a collection of continuous r.v., then I(X ; Xn) = ∞ as X(S) is a Markov chain and I(X + σZ ; X(S)|X ) = 0, (i) n n n h(X |Xn) = −∞. we therefore have that I(X + σZ ; X(S)) = I(X + (i) n n This idea of using mutual information, however, can be σZ ; X ,X(S)). Then, by the chain rule of the mutual n n n n n n salvaged by introducing noise to the sample. For example, information, I(X + σZ ; X ,X(S)) = I(X + σZ ; X ) − n n n I(X + σZ ; X |X(S)), and, the informativeness of X(i) can be measured by considering n n n I(X(i); X + σZ ) where Z := (Z1,Z2,...,Zn) is a vector I(Xn + σZn; X ) n n (S) of i.i.d. Gaussian r.v. independent of X with σ being the r4(S,X ) = lim σ→0 1 log(1 + 1 ) noise . Next, based on the above discussion, 2 σ2 I(Xn + σZn; Xn) − I(Xn + σZn; Xn|X ) we propose three potential measures of informativeness of = lim (S) n σ→0 1 1 order statistics about the sample X , all based on the mutual 2 log(1 + σ2 ) information measure. n n = d(X ) − d(X |X(S)), where d(Xn) is known as the information dimension or Rényi Proof. For simplicity, we focus on the case V = ∅. The proof dimension [11], [12], namely for arbitrary V follows along the same lines. First, assume that Xn is a sequence of discrete r.v. Then, by using the relationship  n continuous r.v. d(Xn) = (7) between mutual information and entropy given in (2) we have, 0 discrete r.v.. n n I(X ; X(S)) = H(X(S)) − H(X(S)|X ) = H(X(S)), where k n the last equality uses that H(X |Xn) = 0 since X is Similarly, since (X(S) + σZ ) → X(S) → X is a Markov (S) (S) k n fully determined given the value of the sequence Xn. As chain with I(X(S) + σZ ; X(S)|X ) = 0, we obtain mentioned in Section II, if Xn is a sequence of continuous n k n n n I(X ; X(S) + σZ ) r.v. then I(X ; X(S)) = h(X(S)) − h(X(S)|X ) = ∞ since r5(S,X ) = lim 1 1 n n σ→0 h(X |X ) = −∞. This characterizes r1(S,X ) 2 log(1 + σ2 ) (S) n k We now characterize the measure r2(S,X ). We have that I(X(S); X(S) + σZ ) = lim = d(X(S)), σ→0 1 1 n 2 n n 2 log(1 + σ2 ) r2(S,X ) = 2 lim σ I(X + σZ ; X(S)) σ→∞ √ n n where d(·) is defined in (7). (a) I( snrX + Z ; X ) = 2 lim (S) snr→0 snr Remark 1. We emphasize that the scaling and Gaussian noise (b) d √ n n used above were not chosen artificially. It can be shown that = 2 I( snrX + Z ; X(S)) dsnr snr=0 any absolutely continuous perturbation with a finite Fisher (c)  n n n 2 n n n 2 information would result in equivalent limits [13]. Therefore, = E kX − E[X |Z ]k − kX − E[X |Z ,X(S)]k the choice of Gaussian noise was simply made for the ease of (d) = kXn − [Xn]k2− kXn − [Xn|X ]k2 , (14) exposition and the proof. E E E E (S) There are a few shortcomings of the measures just introduced. where the labeled equalities follow from: (a) defining snr = 2 For instance, the elements of the most informative set are not 1/σ and noting that I(aX; Y ) = I(X; Y ) for a constant a; ordered based on the amount of information that each element (b) using the fact that provides. Moreover, at this point, we are unable to quantify the f(snr) − f(0) d amount of information that an additional order statistic adds to lim = f(a) , snr→0 snr da a=0 a given collection X(S) of order statistics. These shortcomings √ n n can be remedied by considering a conditional version of the where f(a) = I( aX + Z ; X(S)) with f(0) = n measures introduced in Definition 1. I(Z ; X(S)) = 0; (c) using the generalized I-MMSE rela- n √ n n tionship [14, Thm. 10] since X(S) → X → ( snrX + Z ) Definition 2. Under the assumptions in Definition 1, let V ⊂ is a Markov chain; and (d) since Zn is independent of Xn. [n] such that S ∩ V = . Then, we define three conditional n ∅ To conclude the proof of r2(S,X ) in (12), we would like measures of order statistic informativeness: n n 2 to show that (14) is equal to E[kE[X ] − E[X |X(S)]k ]. We n n start by noting that r1(S,X |V) = I(X ; X(S)|X(V)), (8) n 2 n n  n n 2 r2(S,X |V) = lim 2σ I(X + σZ ; X(S)|X(V)), (9) k [X ] − [X |X ]k σ→∞ E E E (S) n 2 n k h n n n n 2i r3(S,X |V) = lim 2σ I(X ; X(S) + σZ |X(V)). (10) = ( [X ] − X ) + (X − [X |X ]) σ→∞ E E E (S)  n n 2  n n 2 III.CHARACTERIZATION OF THE INFORMATIVENESS = E kE[X ] − X k + E kX − E[X |X(S)]k  n n T n n  MEASURES + 2E (E[X ] − X ) (X − E[X |X(S)]) . (15) In this section, we characterize the measures of informative- Moreover, we note that ness of order statistics proposed in Definition 1 and Definition 2.  n n T n n  In particular, we have the following theorem. − 2E (E[X ] − X ) (X − E[X |X(S)])  n n T n n n T n  Theorem 1. Let S ⊆ [n] such that |S| = k, and V ⊂ [n] such = 2E (X − E[X ]) X − (X − E[X ]) E[X |X(S)] that S ∩ V = ∅. Then, the metrics in Definition 2 evaluate to (a)  n T n n  = 2E (X ) (X − E[X |X(S)])  H(X |X ), for discrete r.v., (b) r (S,Xn|V)= (S) (V) (11) = 2 kXn − [Xn|X ]k2 , (16) 1 ∞, otherwise, E E (S) n n n 2 r2(S,X |V)=E[kE[X |X(V)]−E[X |X(S),X(V)]k ], (12) where the labeled equalities follow from: (a) the fact that n 2 r3(S,X |V) = E[kX(S) − E[X(S)|X(V)]k ]. (13)  n T n n  E (E[X |) (X − E[X |X(S)]) n T  n n  Taking V = ∅ gives an evaluation of the metrics in Def- = (E[X |) E X − E[X |X(S)] n inition 1, namely r1(S,X ) = H(X(S)) for discrete r.v. = ( [Xn|)T [Xn] − [ [Xn|X ]] and r (S,Xn) = ∞ otherwise, r (S,Xn) = [k [Xn] − E E E E (S) 1 2 E E n T n n n 2 n 2 = ( [X |) ( [X ] − [X ]) = 0, E[X |X(S)]k ], and r3(S,X ) = E[kX(S) − E[X(S)]k ]. E E E where in the third equality we have used the law of total approach to use (i.e., which of the three questions raised in expectation; and (b) using the orthogonality principle [15], Section I is most relevant for the problem at hand). which states that [( [Xn|X ])T (Xn − [Xn|X ])] = 0. E E (S) E (S) IV. EVALUATION OF THE INFORMATIVENESS MEASURES By substituting (16) back into (15), we obtain A. Discrete Random Variables: The Bernoulli Case  n n 2 E kE[X ] − E[X |X(S)]k We assess the three measures in Theorem 1 for the case of a  n n 2  n n 2 = E kE[X ] − X k − E kX − E[X |X(S)]k , sample of discrete r.v. in Lemma 2 (proof in Appendix B). In particular, Lemma 2 studies the Bernoulli case, and in Section V which is precisely (14). Hence, r (S,Xn) = [k [Xn] − 2 E E we consider another discrete distribution with applications to [Xn|X ]k2]. E (S) image processing. The results presented here rely heavily on We now characterize r (S,Xn). It follows by the data 3 Lemma 5 in Appendix A-A to compute the joint distribution processing inequality, that I(X; Z) = I(X; Y ) for a Markov of k order statistics. chain X → Y → Z if I(X; Y |Z) = 0. Notice that k n n in our problem, (X(S) + σZ ) → X(S) → X forms a Lemma 2. Let X be sampled as i.i.d. Bernoulli with success k n 0 Markov chain with I(X(S) + σZ ; X(S)|X ) = 0. Thus, probability p. Let B be a Binomial(n, 1 − p) r.v. and B be a k n k I(X(S) + σZ ; X ) = I(X(S) + σZ ; X(S)). Therefore, Binomial(n − 1, 1 − p) r.v. Then, n 2 n k r (i, Xn) = h (P (B < i)), (20) r3(S,X ) = lim 2σ I(X ; X(S) + σZ ) 1 b σ→∞ 2 2 2 k n np h 0 i = lim 2σ I(X(S); X(S) + σZ ) r2(i, X ) = P (B < i) σ→∞ P (B < i)  2 2 = E kX(S) − E[X(S)]| , np h i2 + P (B0 ≥ i) − np2, (21) where the last limit is a standard result and can for example P (B ≥ i) n be found in [16, Corollary 2]. r3(i, X ) = P (B < i)P (B ≥ i), (22) By leveraging Theorem 1, we can now construct procedures where hb(t) := −t log(t) − (1 − t) log(1 − t) is the binary that answer the three questions raised in Section I. Specifically, entropy function. given m ∈ [3], we propose the following three approaches: Remark 2. Consider x(1 − x) for x ∈ (0, 1), which is (1) Marginal Approach: Generate one set of cardinality k symmetric and convex with the maximum occurring at x = 1/2. n according to Thus, r3(i, X ) in (22) is maximized by the i such that P (B ≥ i) or 1 − P (B ≥ i) is as close to 1/2 as possible. S¯M = {(i , . . . , i ): r (i ,Xn) ≥ ... ≥ r (i ,Xn), m 1 k m 1 m k Hence, the maximizer i is a median of B, namely, 1 ≤ i1 < . . . < ik ≤ n}. (17) ? n n i3(X ) = arg max r3(i, X ) ¯M i∈[n] This approach generates an ordered set Sm of indices of order n 1 o (23) statistics, listed from the (first) most informative to the k-th = arg min P (B ≥ i) − . most informative, and quantifies the amount of information i∈{bn(1−p)c,dn(1−p)e} 2 that an individual order statistic contains about the sample. Moreover, since the binary entropy function hb(t) is increasing (2) Joint Approach: Generate one set of cardinality k with on 0 < t ≤ 1/2 and decreasing on 1/2 ≤ t < 1, the maximizer n ¯J n for r1(i, X ) in (20) will also be given by (23). Sm ∈ arg max rm(S,X ). (18) S⊆[n], |S|=k When Xn is sampled i.i.d. Bernoulli with probability p, the ¯J ‘information’ in the order statistics 0 ≤ X(1) ≤ X(2) ≤ ... ≤ Now Sm contains the indices of the k order statistics that are the most informative about the sample. X(n) ≤ 1 is simply the counts of 0’s and 1’s present in the data. In terms of the order statistics, the ‘information’ lies in (3) Sequential Approach: Generate one set of cardinality k the location of the switch point (if there is one), i.e., the i according to n where X(i) = 0 but X(i+1) = 1. Since we expect E[X ] = np ¯S Sm = {(i1, . . . , ik): of the samples to take the value 1, the switch point is expected n n to occur at round(n(1 − p)), and Remark 2 (at least for r (·, ·) rm(it,X |Vt−1) ≥ max rm(j, X |Vt−1), 1 j∈[n]:j∈V / t−1 and r3(·, ·)) tells us that the ‘most informative’ order statistic Vt = (i1, . . . , it), t ∈ [k], V0 = ∅}. (19) is where we expect the switch point to occur. In the next proposition, we further show that, as the sample size grows, S¯S This approach produces an ordered set, m, of indices of order the most informative order statistic significantly dominates the i statistics where t is the most informative order statistic given other statistics for measures (20) and (22). that the information of t − 1 order statistics has already been incorporated (captured by the conditioning term). Proposition 3. Let Xn be i.i.d. Bernoulli with success proba- ¯M ¯J ¯S In the next section we show that the sets Sm , Sm and Sm bility p ∈ (0, 1). For any c ∈ (0, 1) independent of n, we obtain may not be the same, even in simple cases. Thus, the application  n log(2), c = (1 − p), lim r1(bcnc,X ) = of interest and target analysis should guide the choice of which n→∞ 0, otherwise,  n 1/4, c = (1 − p), lim r3(bcnc,X ) = 2 n→∞ 0, otherwise. n a i(n + 1 − i) r3(i, X ) = 2 , and (24b) The same result holds when b·c is replaced by d·e. (n + 1) (n + 2)     Proof. From de Moivre-Laplace theorem [17], we know that ? n ? n n + 1 n + 1 i2(X ) = i3(X ) ∈ , . (24c) for B ∼ Binomial(n, 1 − p) the distribution of B√−n(1−p) 2 2 np(1−p) converges to the standard . Hence, Remark 3. Lemma 4 also encompasses the case where Xn is sampled as i.i.d. U(a, b) for general a < b since the mutual lim P (B < bcnc) n n n→∞ information, which characterizes r2(i, X ) and r3(i, X ) (see ! B − n(1 − p) bcnc − n(1 − p) Definition 1), has the property that I(X + c; Y ) = I(X; Y ) = lim P < when c is some constant. n→∞ pnp(1 − p) pnp(1 − p) !  1, c > (1 − p), Remark 4. For c ∈ (0, 1) independent of n, metrics r2(·, ·) bcnc − n(1 − p)  = lim Φ = 1/2, c = (1 − p), and r3(·, ·) have the following behaviors as n goes to infinity: n→∞ p np(1 − p)  n 2 0, c < (1 − p), lim r2(bcnc,X ) = a c(1 − c)/4, n→∞ n 2 where Φ(·) is the cumulative distribution function of the lim n · r3(bcnc,X ) = a c(1 − c). standard normal. Inserting the limit above into the expressions n→∞ for r1(·, ·) and r3(·, ·) in (20) and (22) completes the proof. We conclude this section by again considering the sets of ¯M ¯J ¯S most informative order statistics Sm , Sm and Sm in (17)–(19). In the above, we focused our analysis on the single most Specifically, for an i.i.d. sample uniform on (0, a) with a = 1 ¯M informative order statistic. We now want to consider sets Sm , and n = 5, the sets of sizes k ∈ [4] are given by ¯J ¯S Sm and Sm defined in (17)–(19). For simplicity, we consider ¯M measure r1(·, ·) and an i.i.d. Bernoulli sample of size n = 19 S3 → {3}, {3, 2}, {3, 2, 4}, {3, 2, 4, 1}; with p = 0.5. Then for set sizes k ∈ [4] we find ¯J S3 → {3}, {3, 2}, {3, 2, 4}, {3, 2, 4, 1}; ¯M ¯S S1 → {10}, {10, 9}, {10, 9, 11}, {10, 9, 11, 8}; S3 → {3}, {3, 5}, {3, 5, 1}, {3, 5, 1, 4}. ¯J S1 → {10}, {9, 11}, {10, 8, 12}, {10, 8, 12, 9}; Similarly to the discrete case, we see that it is possible for the ¯S S1 → {10}, {10, 8}, {10, 8, 12}, {10, 8, 12, 9}. approaches to result in different sets. To interpret the above, ¯M consider only the k = 2 collection. From S3 , we know that Notice that the three sets can all be different (e.g., when k = 2) the 3rd statistic (the median) is the most informative and the and we find that this difference becomes more drastic when any nd ¯J 2 is the second most. By S3 , the same order statistics form of the following occurs: n increases, the size of the r.v. support ¯S the most informative pair. However, from S3 , we know that increases, or the distribution becomes more asymmetric. To given the most informative (the 3rd), the 5th provides the most interpret the above, consider only the k = 2 collection. From additional information. ¯M th S1 , we know that the 10 statistic is the most informative and the 9th is the second most. However, the pair of most V. APPLICATIONS th th ¯J ¯S informative statistics is the 9 and 11 by S1 . From S1 , In this section, we show how the informativeness framework we know that, given the most informative (the 10th), the 8th for order statistics just developed can be used in image provides the most additional information. processing applications. We begin by reviewing some of the details about order statistics filters, which represent a class of B. Continuous Random Variables: The Uniform Case non-linear filters. Now we look at an example for a sample of continuous random variables in Lemma 4 (proof in Appendix C). Remem- A. Order Statistics Filtering ber that, from Theorem 1, we have that the metric r1(·, ·) is Consider the following discrete-time filter, referred to as an infinity for continuous r.v., and hence we here focus on r2(·, ·) L-estimator in the remainder of the paper. and r (·, ·). In particular, Lemma 4 studies a Uniform sample, 3 Definition 3. Define a filter and in Section V we consider another continuous distribution n with applications to image processing. Throughout this section X we use Lemma 6, in Appendix A-B, to compute the joint Yt = αkX(k), t ∈ Z, (25) distribution of k order statistics. k=1 where: (i) X , for k ∈ [n], is the k-th order statistic of an i.i.d. Lemma 4. Let Xn be sampled as i.i.d. U(0, a) for a > 0, i.e., (k) sequence X , for i ∈ [n+1]; (ii) n is the filtering window sampled i.i.d. uniform on the interval (0, a) and, for k ∈ {2, 3} t+i−1 ? n n width; and (iii) αk ≥ 0’s, for k ∈ [n], are the coefficients of define ik(X ) = arg maxi∈[n] rk(i, X ). Then, Pn the filter such that k=1 αk = 1. This filter is known as an a2i(n + 1 − i) L-estimator in [18] and as an order statistics r (i, Xn) = , (24a) 2 4n(n + 2) filter in image processing [19], [20]. The general form of the L-estimator encompasses a large number of linear and non-linear filters. Examples are: 1) moving-average filter: αk = 1/n, for all k ∈ [n]; 2) median filter: (by considering odd values of n) αk = 1 for k = (n + 1)/2 and αk = 0 for k 6= (n + 1)/2; 3) maximum filter: αn = 1 and αk = 0 for k 6= n; 4) minimum filter: α1 = 1 and αk = 0 for k 6= 1; 5) midpoint filter: α1 = αn = 1/2 and αk = 0 for k 6= 1, n; 6) r-th ranked-order filter: αr = 1 and αk = 0 for k 6= r. The L-estimator in Definition 3 has been extensively studied in the literature [21], [22], [1], [23], [24], [25], [26], [27]. A comprehensive survey of their applications and, more generally, of order statistics is given in [1]. It is important to highlight that the L-estimator in (25) forms a restricted class of estimators, and, as such, it is possible that other estimators, like the maximum likelihood, may have better efficiency. Nonetheless, Fig. 1: Test image. it was shown in [23] that for a certain choice of weights, the estimator in (25) attains the Cramér-Rao bound asymptotically and, hence, is asymptotically efficient. For an excellent survey that it can be applied to both continuous and discrete models. on L-estimators, the interested reader is referred to [24]. Moreover, reliance on the MSE can be avoided, and signal The optimal choice of the coefficients in (25) has received fidelity can instead be measured using alternative quantities like considerable attention in the context of scale-and-shift models. the entropy. Our goal is to show that selecting the L-estimator Specifically, suppose that the Xi’s are generated i.i.d. according coefficients using the most informative order statistics is a x−λ viable and competitive approach, worth further exploration. We to a cumulative distribution function, F ( σ ), where the , λ, and the scaling parameter, σ, are compare the performance of the proposed L-estimator to that unknown. The best unbiased estimator of (λ, σ) under the of several state-of-the-art denoising methods such as the total squared error (MSE) criterion was found in [25]. This variation filter [30], and three different implementations of approach, however, requires computation and inversion of the wavelet-based filters namely empirical Bayes [31], Stein’s matrices of order statistics and is often prohibitive. Unbiased Estimate of Risk (SURE) [32] and False Discovery To overcome this, the authors of [26] proposed a choice of Rate (FDR) [33]. coefficients resulting in an approximately minimum , Our simulations use the image in Fig. 1, which has N = 2 while depending only on F (·) and the probability density (512) pixels. As there is no universally-used performance function (pdf), and only requiring inversion of a 2 × 2 matrix. metric for image reconstruction, we consider several well- Our interest in this work lies in applications of order statistics known ones: (i) the MSE normalized by N; (ii) the peak to image processing, where the median filter is the most popular signal-to-noise ratio (PSNR), measured in dB; (iii) the structural choice [27]. The work in [20] also applies the L-estimator to similarity (SSIM) index [34], taking values between 0 and 1 image processing in a setting where the image is assumed to where 1 is perfect reconstruction; and (iv) the image quality be corrupted by additive noise and the optimal MSE estimator index IQI [35], taking values between −1 and 1 where 1 is of [25] was used. A comprehensive survey of applications of perfect reconstruction. order statistics to digital image processing can be found in [19]. B. Image Denoising in Salt and Pepper Noise For image processing, using a parametric scale-and-shift We analyze gray scale image denoising where pixels are model might be too simplistic as it only models additive noise typically 8-bit data values ranging from 0 (black) to 255 and a variety of widely-used image processing noise models, (white). We use an observation model where an unknown such as salt and pepper or speckle noise, cannot be modeled as pixel x ∈ [0 : 255] is corrupted by the salt and pepper noise. additive. Moreover, the majority of the distortions encountered Let P (x is corrupted by pepper noise|x is noisy) = ρ and in practice are discrete in nature, and hence one needs to 1 P (x is noisy) = ρ. We model the noisy observation X with a work with discrete, instead of continuous, order statistics. probability mass function (pmf): Another issue that arises with the aforementioned approaches to choosing the optimal coefficients in (25) is the use of the P (x corrupted by pepper noise)=P (X =0)=ρ1ρ, (26a) MSE as the fidelity criterion. Indeed, it turns out that the MSE P (x noise-free) = P (X = x) = 1 − ρ, (26b) is not a good approximation of the human perception of image P (x )=P (X =255)=(1−ρ )ρ. fidelity [28], [29]. Thus, coefficients that are optimal for the corrupted by salt noise 1 MSE might not be the best choice if the goal is to optimize (26c) the human perceptual criterion for image quality. In the above, ρ corresponds to the percentage of pixels We will use the measures in Section III to choose the L- corrupted by noise, and ρ1 is the percentage of pixels corrupted estimator coefficients. This approach benefits from the fact by pepper noise. The pseudocode in Algorithm 1 summarizes our general 1 image denoising algorithm based on the L-estimator. In particular, we use a square-shaped window of size w × w 0.8 ) to sample the pixels of an image. Moreover, if ρ1 and ρ are n 0.6

unknown, their estimates can be computed as i, X

( 0.4 1 PN 1 r ρˆ = t=1 {Xt=0} 0.2 1 PN (1{X =0} + 1{X =255}) t=1 t t (27) 0 N 0 5 10 15 1 X ρˆ = (1 + 1 }), N {Xt=0} {Xt=255 i t=1 0.8 where 1{·} is the indicator function, and N is the number of pixels in the image. The estimators in (27) perform well if the 0.6 original image contains very few pixel values exactly equal to 0.4

0 and 255, but since these are the extremes of possible pixel pmf values, this is often reasonable to assume. 0.2 Choosing r1(·, ·) as the performance metric offers several benefits. First, the received data Xn for n = w2 is discrete, and 0 hence entropy is a natural choice for informativeness measure. 0 50 100 150 200 250 Second, the measures r2(·, ·) and r3(·, ·) depend on the values Support of X n of the support of X . Thus, one would need to specify the n Fig. 2: ρ = 0.3; ρ1 = 0.05. Above: r1(i, X ) for i ∈ [n], n = 16; value of the unknown parameter x in (26). In contrast, the Below: pmf of X. measure r1(·, ·) does not depend on the support values but only on the relative positions of the support points. Hence, the Low-Noise Regime, ρ < 0.5. In this regime, the noise-free parameter x can be left unspecified, and we only assume that pixels are the most common or typical. Now, recall that the it lies in the range [0 : 255]. entropy can be interpreted as the average rate at which a stochastic source produces information, where typical events Algorithm 1 Image denoising based on the L-estimator. are assigned less weight than extreme probability events. Hence, n Input: Image; Size w of the square-shaped window; Probabil- we expect that r1(i, X ) is smaller for values of i that fall in the middle chunk of samples (that consists of noise-free pixels) ities ρ1 and ρ or their estimates in (27). Output: Reconstructed image. compared to values of i corresponding to other samples. Hence, in this regime, we choose the coefficients of the L-estimator 1: Set the length of the sequence n = w2. Sample the square window of size w × w and collect the samples in a vector to be inversely proportional to r1(·, ·) as shown in (28) for of length n. This constitutes the noisy sequence Xn = ρ < 0.5, where the normalization is needed to ensure that the estimator is unbiased. {Xi, for i ∈ [n]}. n As an example, we consider a low-noise regime with ρ = 0.3 2: Compute r1(k, X ) in Theorem 1 for all k ∈ [n] by using (26). and ρ1 = 0.05, where we expect that roughly 30% of the image is corrupted by noise and the noise is mostly salt. In Fig. 2, 3: Compute the coefficients αk’s for all k ∈ [n] for the L- n estimator in Definition 3 as follows: we plot the measure r1(i, X ) for i ∈ [n] and n = 16 (i.e., 4 × 4 window). Observe that in this regime, approximately −1 n r1 (k, X ) 0.24 samples are corrupted by pepper noise, 4.56 samples are If ρ < 0.5, assign αk = n , P r−1 (i, Xn) 11.2 i=1 1 (28) corrupted by salt noise, and samples are noise-free. r (k, Xn) We now show that in the low-noise regime, our procedure in otherwise, assign α = 1 . k Pn n Algorithm 1 competes with some of the state-of-the-art filters. i=1 r1(i, X ) The simulation results are presented in Fig. 3. In the simulation, 4: Apply the L-estimator in Definition 3 to the samples in estimated values of the parameters ρ and ρ are used to train Step 1 with the coefficients in (28). 1 the L-estimator. The estimates are computed as in (27) and are given by ρˆ = 0.3007 and ρˆ1 = 0.0508 (recall the true We now explain our choice of the coefficients in (28) for the values are ρ = 0.3 and ρ1 = 0.05). The coefficients of the low-noise regime, i.e., ρ < 0.5, and for the high-noise regime, L-estimator in (25) are computed by using (28) for ρ < 0.5 i.e., ρ ≥ 0.5. We start by noting that in the ordered sample where the values of r1(·, ·) are those in Fig. 2. From Fig. 3, 2 X(1),...,X(n) with n = w , approximately: (i) the first ρ1ρn we observe that we have the following performance across the samples are corrupted by pepper noise; (ii) the middle chunk four considered metrics: of samples of length (1−ρ)n consists of noise-free pixels; and MSE: Total Variation L-Estimator Median Filter (iii) the last chunk of samples of length (1 − ρ1)ρn consists of pixels corrupted by salt noise. Avg. Filter E. Bayes Filter FDR Filter SURE Filter, (a) Noisy Image. (b) Average Filter. (c) Median Filter. (d) Total Variation Filter. (e) L-Estimator in (28). MSE=0.022, PSNR=10.510, MSE=0.007, PSNR= 15.398, MSE=0.003, PSNR=19.375, MSE = 1.74 · 10−4, PSNR=31.537, MSE = 6.95 · 10−4, PSNR=25.525, SSIM=0.099, IQI=0.037. SSIM=0.366, IQI=0.062. SSIM=0.560, IQI=0.664. SSMI=0.956, IQI=0.130. SSMI=0.914, IQI=0.779.

(f) FDR filter. (g) SURE filter. (h) Bayes filter. MSE = 0.012, PSNR=13.235, MSE = 0.013, PSNR=12.972, MSE = 0.011, PSNR=13.643, SSMI=0.318, IQI=0.042. SSMI=0.195, IQI=0.045. SSMI=0.303 , IQI=0.044.

Fig. 3: Denoising salt & pepper noise with ρ = 0.3, ρ1 = 0.05.

PSNR: Total Variation L-Estimator Median Filter 1 Avg. Filter E. Bayes Filter FDR Filter SURE Filter, )

SSIM: Total Variation L-Estimator Median Filter n

Avg. Filter FDR Filter E. Bayes Filter SURE Filter, i, X ( 0.5 1 IQI: L-Estimator Median Filter Total Variation r Avg. Filter SURE Filter E. Bayes Filter FDR Filter, 0 where, for a given metric M, the notation A B that 0 10 20 30 A outperforms B when M is considered. The fact that the i median outperforms the total variation when the IQI metric is considered stems from the fact that the median filter allows for a better edge recovery compared to the total variation 0.4 filter. Moreover, the L-estimator outperforms the median for

all considered metrics, and has a competitive performance to pmf 0.2 that of the total variation filter (i.e., the performance is slightly worse over the MSE, PSNR and SSIM metrics, but significantly better over the IQI metric). Finally, the L-estimator outperforms 0 the wavelet-based filters over all metrics. 0 100 200 300 High-Noise Regime, ρ ≥ 0.5. Arguably, the noise-dominated Support of X regime is the most interesting case both theoretically and n Fig. 4: ρ = 0.7; ρ1 = 0.3. Above: r1(i, X ) for i ∈ [n], n = 36; practically. Consider, ρ = 0.7 and ρ1 = 0.3, where we expect Below: pmf of X. that 70% of the image is corrupted by mostly salt noise. In n the majority of the samples is corrupted. We have the following Fig. 4, we plot r1(i, X ) for n = 36 (i.e., 6 × 6 window). Here, approximately 7.56 samples are corrupted by pepper performance across the metrics: noise, 17.64 samples are corrupted by salt noise, and 10.8 MSE, PSNR: L-Estimator FDR filter = E. Bayes filter samples are noise-free. Thus, noisy pixels are the most common, = SURE filter Avg. Filter Total Variation, which is a fundamental difference from the low-noise regime, and justifies our choice of the L-estimator coefficients in (28) SSIM: E. Bayes filter FDR filter Total Variation for ρ ≥ 0.5. In other words, these coefficients are chosen to SURE filter Avg. Filter Median Filter, be directly proportional to r1(·, ·). The performance of the IQI: Total Variation FDR filter = E. Bayes filter proposed filter is evaluated in Fig. 5 (top (a)-(h)), where the L-Estimator Avg. Filter SURE filter, estimates of ρ and ρ1 are computed from (27) and given by ρˆ = 0.7003 and ρˆ1 = 0.2995. where, for a given metric M, the notation A B means that We observe that the median filter performs the worst for A outperforms B when M is considered. The result above all the four considered image quality metrics, except for the suggests that the L-estimator is very much competitive with SSIM metric where it outperforms the L-estimator. This is the total variation filter and wavelet-based filters, and most of expected since the median filter performance degrades once the time it also outperforms the average filter. It is also worth (a) Noisy Image. (b) Average Filter. (c) Median Filter. (d) Total Variation Filter. (e) L-Estimator in (28). MSE=0.055, PSNR=6.548, MSE=0.015, PSNR=12.177, MSE=0.032, PSNR=8.958, MSE= 0.021, PSNR=10.666, MSE=0.010, PSNR=14.043, SSIM=0.010, IQI=0.007. SSIM=0.205, IQI=0.019. SSIM=0.190, IQI=0.016. SSIM=0.750, IQI=0.136. SSIM= 0.114, IQI=0.021.

(f) FDR filter. (g) SURE filter. (h) Empirical Bayes Filter. MSE=0.014, PSNR=12.593, MSE=0.014, PSNR=12.620, MSE= 0.014, PSNR=12.599, SSIM=0.770, IQI=0.021. SSIM= 0.726, IQI=0.016. SSIM=0.771, IQI=0.021.

(i) Noisy Image. (j) Median Filter. (k) Total Variation Filter. (l) L-Estimator in (28). (m) Sequential L-Estimator in (29). MSE=0.072, PSNR=5.384, MSE=0.088, PSNR=4.517, MSE=0.073, PSNR=5.336, MSE=0.018, PSNR=11.505, MSE=0.016, PSNR=11.933, SSIM=0.010, IQI=0.005. SSIM=0.010, IQI=0.000. SSIM=0.017, IQI=0.005. SSIM=0.061, IQI=0.019. SSIM=0.048, IQI=0.019.

(n) FDR filter. (o) SURE filter. (p) Empirical Bayes Filter. MSE=0.044, PSNR=7.487, MSE=0.048, PSNR=7.166, MSE=0.045, PSNR=7.445, SSIM=0.408, SSIM=0.498, IQI=0.016. SSIM=0.078, IQI=0.006. IQI=0.009.

Fig. 5: Denoising salt & pepper noise. ρ = 0.7, ρ1 = 0.3 top (a)-(h); ρ = 0.8, ρ1 = 0.9 bottom (i)-(p). noting that visually the total variation filter appears to have coefficients. We highlight that d is introduced for computational the worst performance across all the four filters in terms of purposes to speed the simulations, and with reference to Fig. 5 recovering the shapes, but this observation is not captured by we have d = 4. Fig. 5 shows that the total variation and median the SSIM and IQI metrics. filters perform on the level of the noisy image. We also note In the extremely high-noise regime, we observe through that the L-estimator with coefficients as in (29) offers better extensive simulations that the L-estimator performs significantly MSE and PSNR metrics than the L-estimator with coefficients better than the total variation filter and the wavelet-based as in (28), but performs either the same or worse for IQI and filters. The two bottom rows of Fig. 5 (i)-(p) show the filters SSIM metrics. Finally, we highlight that the L-estimator based performance for n = 16 in an extremely noisy setting where on the joint approach in (18) was also simulated, and observed ρ = 0.8 and ρ1 = 0.9. Here, we expect that 80% of the image to offer a similar performance to the sequential L-estimator is corrupted by noise, and this perturbation is dominated by in (28). The performance of all filters is as follows: pepper noise. The estimates of ρ and ρ1 are computed from (27) MSE: Seq. L-Estimator L-Estimator FDR filter and given by ρˆ = 0.7990 and ρˆ1 = 0.8992. In addition to the already used filters, in this regime Fig. 5 also shows the L- E. Bayes filter SURE filter Total Variation, estimator performance where the coefficients are chosen based PSNR: Seq. L-Estimator L-Estimator FDR filter on the sequential approach, discussed in (19), i.e., for i ∈ S¯S, k 1 E. Bayes filter SURE filter Total Variation, n r1(ik,X |Vk−1) αk = , when k ≤ d, and SSIM: FDR filter E. Bayes filter SURE filter Pd r (i ,Xn|V ) k=1 1 k k−1 (29) L-Estimator Seq. L-Estimator Total Variation, α = 0, k > d, k when IQI: Seq. L-Estimator L-Estimator FDR filter where Vk−1 is the set that contains the first k − 1 indices of E. Bayes filter SURE filter Total Variation. ¯S S1 and d is the truncation parameter. The idea is to choose the k-th coefficient by conditioning on the information that has The above suggests that the L-estimator is very much com- been already incorporated into the previously selected k − 1 petitive with the total variation filter and wavelet-based filters. (a) Noisy Image. (b) Mean Filter. (c) Median Filter. (d) Total Variation Filter . (e) L-Estimator. MSE=0.032, PSNR=8.917, MSE=0.015, PSNR=12.141, MSE=0.026, PSNR=9.794, MSE=0.029, PSNR=9.382, MSE=0.014, PSNR=12.533, SSIM=0.020, IQI=0.005. SSIM=0.221, IQI=0.012. SSIM=0.056, IQI=0.002. SSIM=0.237, IQI=-0.001. SSIM=0.627, IQI=0.021.

(f) FDR filter (g) SURE filter (h) Empirical Bayes filter MSE=0.016, PSNR=11.960, MSE=0.016, PSNR=11.977, MSE=0.016, PSNR=11.960, SSIM=0.780, IQI=0.048. SSIM=0.775, IQI=0.016. SSIM=0.781, IQI=0.048. Fig. 6: Denoising mixed Gaussian noise with µ = [−2, 2], σ2 = [0.15, 0.1] and p = [0.5, 0.5].

In particular, L-estimators perform better than wavelet-based 101 denoisers over the MSE, PSNR and IQI metrics, and better than the total variation denoiser over all metrics. ) 100 n C. Image Denoising in Additive Continuous Noise i, X

( −1 Now we consider image denoising under the signal model 3 10 r X = x + Z, where x is the unknown pixel value and Z is random noise. We consider two example noise distributions, 10−2 Cauchy and mixed Gaussian. In particular, here we focus on 0 5 10 15 20 25 the mixed Gaussian case, and an in the next subsection we i will focus on the case when Z is Cauchy. We also performed simulations for Gaussian Z and observed that the total variation filter always outperforms our proposed L-estimator. We believe 0.8 this is due to the fact that the total variation filter was designed for Gaussian noise perturbation. 0.6 With mixed Gaussian noise, our denoising works as in pdf 0.4 Algorithm 1, but the coefficients of the L-estimator are now 2 chosen with respect to the r3(·, ·) measure : 0.2 r (k, Xn) α = 3 , (30) 0 k Pn n −5 0 5 i=1 r3(i, X ) x where n = 25 (i.e., 5 × 5 window). The Gaussian mixture n has two components with means µ = [−2, 2], σ2 = Fig. 7: r3(i, X ), for i ∈ [n] and n = 25 (top), and pdf of the mixed Gaussian Z with µ = −2 2 , [0.15, 0.1] and weights p = [0.5, 0.5]. Fig. 7 shows r (i, Xn). 3 σ2 = 0.15 0.1 and p = 0.5 0.5 (bottom). Fig. 6 shows all filters performance for this setting, assuming known µ, σ and p. The performance of all filters is as follows: Here the L-estimator always outperforms the total variation filter. Moreover, it outperforms all filters over the MSE and MSE: L-Estimator Avg. Filter SURE filter PSNR metrics and its performance is comparable to those of E. Bayes filter = FDR filter Median Filter, wavelet-based filters over the SSIM and IQI metrics. PSNR: L-Estimator Avg. Filter SURE filter D. Cauchy Noise Distribution FDR filter E. Bayes filter Median Filter, Now we consider a continuous noise model as discussed SSIM: E. Bayes filter FDR filter SURE filter in Section V-C. We let the noise, Z, be distributed according L-Estimator Avg. Filter Median Filter, to a . This is a heavy tail distribution that IQI: E. Bayes filter FDR filter L-Estimator models impulsive noise, which occurs commonly in image SURE filter Avg. Filter Median Filter. processing applications [36]. In the presence of Cauchy noise, our denoising algorithm works as in Algorithm 1, however, the 2 coefficients of the L-estimator in (25) are now chosen with Simulations were performed also for the r2(·, ·) measure and observed to have similar performance as for the r3(·, ·) measure. respect to the r3(·, ·) measure as in (30). (a) Noisy Image. (b) Mean Filter. (c) Median Filter. (d) Total Variation Filter. (e) L-Estimator. MSE=0.052, PSNR=6.765, MSE=0.021, PSNR=10.810, MSE=0.052, PSNR=6.765, MSE=0.007, PSNR=15.682, MSE=0.004, PSNR=18.471, SSIM=0.690, IQI=0.513 SSIM=0.765, IQI=0.529. SSIM=0.690, IQI=0.653. SSIM=0.887, IQI=0.710. SSIM=0.770, IQI=0.732. Fig. 8: Denoising Cauchy noise with parameter γ = 0.0002.

10−4 continuous Uniform random samples. As an application, the proposed measures have been used to choose the coefficients of the L-estimator filter to denoise an image corrupted by ) n 10−6 random noise. To show the utility of our approach, several examples of various noise mechanisms (e.g., salt and pepper, i, X (

3 mixed Gaussian) have been considered, and the proposed filters r 10−8 have been shown to be competitive with off-the-shelf filters (e.g., median, total variation and wavelet).

0 5 10 15 20 25 APPENDIX A i JOINT DISTRIBUTIONOF k ORDERED STATISTICS 2,000 A. Discrete Random Variables

1,500 Lemma 5. Let X1,X2,...,Xn be i.i.d. r.v. from a discrete distribution with cumulative distribution function F (x). Let 1,000 S = {(i1, i2, . . . , ik) : 1 ≤ i1 < i2 < . . . < ik ≤ n} and pdf let P (X(S) = x(S)) := P (∩i∈S (X(i) = x(i))), where x(i) 500 denotes the observation associated to index i. Then, P (X(S) =

x(S)) is non-zero only if x(i1) ≤ x(i2) ≤ ... ≤ x(ik) and when 0 this is true we have that −1 −0.5 0 0.5 1  x P X(S) = x(S) ·10−3 k   Fig. 9: Cauchy random variable Z with x0 = 0 and γ = 0.0002. X v−1 X − n = F(S)(x(S)) − (−1) F(I,Ic)(x(I), x(Ic)) , Top: r3(i, X ), for i ∈ [n] with n = 25; Bottom: pdf of Z. v=1 I⊆S |I|=v Using location parameter, x0 = 0, and , n (31a) γ = 0.0002, in Fig. 9 we plot r3(i, X ) for i ∈ {2, . . . , n − 1}  −  and n = 25 (i.e., 5×5 window) and the pdf of Z. We highlight F(I,Ic) x , x(Ic) n n (I) that r3(1,X ) = r3(n, X ) = ∞, which is due to the infinite n k  Pk  variance of the Cauchy distribution. However, r3(i, X ) < ∞ X Y n − u=j+1 tu − = g(t[k], {x(I) ∪ x(Ic)}), (31b) for i ∈ {2, . . . , n − 1}, as we observe from Fig. 9. tj t ∈T j=1 Fig. 8 shows the performance of all the four filters for the [k] g(t , y )=[1−F (y )]tk [F (y )−F (y )]tk−1 ... case where the Cauchy scale parameter is given by γ = 0.0002, [k] (S) (ik) (ik) (ik−1) Pk and it is assumed to be known. In this example, the L-estimator t1 n− u=1 tu [F (y(i2)) − F (y(i1))] [F (y(1))] , (31c) has the best performance as compared to all other filters across Pk all four metrics, except the SSIM where the total variation where T = {t[k] ≥ 0 : m=j tm ≤ n − ij, ∀j ∈ [k]}, with filter has a slightly better performance. It is also important to t[k] = {t1, t2, . . . , tk}, tu ≥ 0 ∀u ∈ [k]. note that the MSE and PSNR metrics might not be meaningful Proof. For all t ∈ [k], we define the event in this case since the Cauchy noise has infinite variance. c (At) = {X(it) = x(it) | X(it) ≤ x(it)}, (32) VI.CONCLUSION c This work has proposed an information-theoretic framework where (·) denotes the complement of the event. First notice for finding the order statistic that contains the most information that by De Morgan’s Law we have that  about the random sample. Specifically, the work has proposed P X(S) = x(S) | X(S) ≤ x(S) three different information-theoretic measures to quantify the = P ∩k (A )c | X ≤ x  informativeness of order statistics. As an example, all three t=1 t (S) (S)  k c  measures have been evaluated for discrete Bernoulli and = P ∪t=1At |X(S) ≤ x(S) k  − = 1 − P ∪t=1At | X(S) ≤ x(S) . (33) Equivalently, we also note that F(I,Ic)(x(I), x(Ic)) can be computed as the probability that: Next we study the probability on the right side of (33). First, • i ∈ I j ∈ [k] (n − i ) applying the inclusion-exclusion principle and, for any subset For all j with there are at most j observations greater than or equal to x(ij ); I ⊆ S, defining the event AI := ∩i∈I Ai, we find c • For all it ∈ I with t ∈ [k] there are at most (n − it) k  P ∪t=1At | X(S) ≤ x(S) observations greater than x(it). k Thus, computing P (X = x ) boils down to computing X  X  (S) (S) = (−1)t−1 P A | X ≤ x  . (34) − I (S) (S) F(I,Ic)(x(I), x(Ic)) for all subsets I ⊆ S. Finally, simple t=1 I⊆S − counting techniques are used to show that F c (x , x c ) |I|=t (I,I ) (I) (I ) is equal to (31b) with the function g(·, ·) is defined in (31c). Next notice that P (X | Y) = P (X , Z | Y) for Z ⊆ Y. Then, This concludes the proof of Lemma 5. for any set I ⊆ S, denoting Ic = S\I, B. Continuous Random Variables P (AI | X ≤ x ) (S) (S) We state a lemma from [37] that computes the joint = P (AI ,X(Ic) ≤ x(Ic) | X(S) ≤ x(S)) distribution of k order statistics, and is the counterpart of = P (X(I) < x(I),X(Ic) ≤ x(Ic) | X(S) ≤ x(S)), (35) Lemma 5 for the case of continuous random variables. where in the last equality we use the definition of A’s from (32). Lemma 6. Let X1,X2,...,Xn be i.i.d. r.v. from an abso- Now combining (33)-(35), we have that lutely continuous distribution with cumulative distribution function F (x) and probability density function f(x). Let P (X(S) = x(S) | X(S) ≤ x(S)) = 1− S = {(i1, i2, . . . , ik) : 1 ≤ i1 < i2 < . . . < ik ≤ n} and k X t−1 X (−1) P (X

(36) denotes the observation associated to index i. Then, fX(S) (x(S)) is non-zero only if −∞ < x < x < . . . < x < ∞, We now note that the event in the conditioning in (36), namely, (i1) (i2) (ik) and, when this is true, its expression is given by X(S) ≤ x(S), is a superset of the other event considered, X(S) = x(S). It therefore follows that by multiplying both fX(S) (x(S)) sides of (36) by P (X(S) ≤ x(S)), we obtain our probability k k+1 of interest. In other words, Y Y  it−it−1−1 = g(n, i(S)) f(x(it)) F (x(it)) − F (xit−1 ) ,  t=1 t=1 P (X(S) ≤ x(S))P X(S) = x(S) | X(S) ≤ x(S)  where x(i ) = −∞, x(i ) = +∞, and, with i0 = 0 and = P X(S) = x(S) and X(S) ≤ x(S) 0 k+1 ik+1 = n + 1, = P (X(S) = x(S)). n! g(n, i ) = . Using the above in (36), we find a representation for P (X(S) = (S) Qk+1 t=1 (it − it−1 − 1)! x(S)) as: APPENDIX B P (X = x ) = P (X ≤ x ) (S) (S) (S) (S) PROOFOF LEMMA 2 k X v−1 X  First, for any i ∈ [n], by Lemma 5, we have − (−1) P X(I) < x(I),X(Ic) ≤ x(Ic) . (37) n v=1 I⊆S X n |I|=v P (X = 0) = (1 − p)kpn−k, (i) k We finally note that the probability on the right side of (37) k=i is equal to the result given in (31a), which can be seen by P (X(i) = 1) = 1 − P (X(i) = 0). defining, F(S)(x(S)) := P (X(S) ≤ x(S)), and for all I ⊆ S, Thus, X(i) is Bernoulli distributed with success probability   −  v(i), i.e., X(i) ∼ Ber(v(i)), where F(I,Ic) x(I), x(Ic) :=P X(I)

observations less than or equal to x(it). random variable. n We first consider the measure r1(i, X ) = H(X(i)) where APPENDIX C the equality follows by Theorem 1. Since X(i) ∼ Ber(v(i)), PROOFOF LEMMA 4 the entropy is given by 0 1 If Xis are i.i.d. ∼ U(0, a), then a X(i) ∼ Beta(i, n − i + 1) n r1(i, X ) = H(X(i)) = hb(v(i)), (39) with mean and variance given by 2 where hb(t) := −t log(t) − (1 − t) log(1 − t) is defined to be ai a i(n + 1 − i) E[X(i)] = , and Var(X(i)) = . (44) the binary entropy function. n + 1 (n + 1)2(n + 2) n Next, we focus on the metric r3(i, X ). By Theorem 1, we Thus, by Theorem 1, we have n 2 have r3(i, X ) = E[(X(i) − E[X(i)]) ] = Var(X(i)). By the n  2 result just discussed, X(i) ∼ Ber(v(i)) and therefore r3(i, X ) = E (X(i) − E[X(i)]) a2i(n + 1 − i) Var(X(i)) = v(i)(1 − v(i)) = P (B < i)P (B ≥ i). (40) = Var(X(i)) = 2 . n (n + 1) (n + 2) Finally, we study the measure r2(i, X ). We have n n  n n 2 By taking the first derivative of r3(i, X ) above with respect r2(i, X ) = E kE[X ] − E[X |X(i)]k ? n to i and equating it to zero, we obtain i3(X ) as in (24c). n n X  2 We now compute r2(i, X ). Using (12), we have = E (E[Xj] − E[Xj|X(i)]) . (41) n  n n 2 j=1 r2(i, X ) = E kE[X ] − E[X |X(i)]k n Now consider just a single term inside the sum in (41): X = ( [X ] − [X |X ])2 . (45)  2  2 E E j E j (i) E (E[Xj] − E[Xj|X(i)]) = E (p − E[Xj|X(i)]) j=1 2  2   = p + E (E[Xj|X(i)]) − 2pE E[Xj|X(i)] Now we look at computing the expectation E[Xj|X(i) = x(i)].  2 2 = E (E[Xj|X(i)]) − p . (42) By the law of total expectation, Moreover, we notice that E[Xj|X(i) = x(i)]  2 2   E (E[Xj|X(i)]) = P (X(i) = 1) E[Xj|X(i) = 1] = E Xj|X(i) = x(i), {Xj = X(i)} P (Xj = X(i)) 2   + P (X(i) = 0) E[Xj|X(i) = 0] . (43) + E Xj|X(i) = x(i), {Xj < X(i)} P (Xj < X(i)) + X |X = x , {X > X } P (X > X ). (46) With the above in mind, we study the expectations E[Xj|X(i) = E j (i) (i) j (i) j (i) 1] = P (Xj = 1|X(i) = 1) and E[Xj|X(i) = 0] = P (Xj = Now we simplify the three terms of the above. First notice 1|X(i) = 0). First, by Bayes rule, that the probabilities can be computed using the fact that any P (X(i) = 1|Xj = 1)P (Xj = 1) Xj is equally likely to produce the i-th order statistic, so P (Xj = 1|X(i) = 1) = P (X(i) = 1) 1 P (Xj = X(i)) = , p · P (X(i) = 1|Xj = 1) n = . i − 1 v(i) P (X < X ) = , j (i) n Now we study the probability P (X(i) = 1|Xj = 1). First n − i notice that this equals the probability that there are at least P (Xj > X(i)) = . 0 n n n − i + 1 total 1 s in the sample X , given that Xj = 1, or in Next we compute the expectations in (46). Clearly, other words, this equals the probability that there are at least   n − i total 10s from the n − 1 other sample values (excluding E Xj|X(i) = x(i), {Xj = X(i)} = x(i). Moreover, we note the jth one). Using this rationale, that Xj is independent of the event {X(i) = x(i)} given {Xj > X } and hence n−1   (i) X n − 1 n−1−k k P (X(i) = 1|Xj = 1) = (1 − p) p     a + x(i) k E Xj|X(i), {Xj > X(i)} = E Xj|Xj > x(i) = . k=n−i 2 i−1 X n − 1 Similarly, = (1 − p)kpn−1−k = P (B0 < i), k     x(i) k=0 X |X , {X < X } = X |X < x = . E j (i) j (i) E j j (i) 2 where B0 ∼ Binomial(n−1, 1−p). Putting this all together, we p 0 Plugging these results into (46), we find have that E[Xj|X(i) = 1] = v(i) P (B < i). Similar reasoning, and the fact that P (X(i) = 0|Xj = 1) = 1 − P (X(i) = 2nE[Xj|X(i) = x(i)] = 2x(i) + (i − 1)x(i) + (n − i)(a + x(i)) p 0 1|Xj = 1), shows that [Xj|X(i) = 1] = P (B ≥ i). E 1−v(i) = (1 + n)x(i) + a(n − i). (47) Now, plugging the above results into the work in (41)-(43), 2 2 Now we use the result in (47) to simplify (45). First, n np 0 2 np 0 2 2 r2(i, X ) = [P (B < i)] + [P (B ≥ i)] − np , n v(i) 1 − v(i) n X  2 r2(i, X ) = E (E[Xj] − E[Xj|X(i)]) where recall that v(i) = P (B < i). j=1 n X = ( [X ])2 − 2 [X ]  [X |X ] + ( [X |X ])2 [9] K. M. Wong and S. Chen, “The entropy of ordered sequences and order E j E j E E j (i) E E j (i) statistics,” IEEE Transactions on Information Theory, vol. 36, no. 2, pp. j=1 276–284, 1990. n [10] N. Ebrahimi, E. S. Soofi, and H. Zahedi, “Information properties of 2 X  2 order statistics and spacings,” IEEE Transactions on Information Theory, = −n(E[X1]) + E (E[Xj|X(i)]) , vol. 50, no. 1, pp. 177–183, 2004. j=1 [11] A. Guionnet and D. Shlyakhtenko, “On classical analogues of free entropy dimension,” Journal of Functional Analysis, vol. 251, no. 2, pp. 738–771, where in the final equality we have used E[Xj]E[E[Xj|X(i)]] = 2 2 2 2007. (E[Xj]) and that (E[Xj]) = (E[X1]) for all j ∈ [n]. [12] Y. Wu and S. Verdú, “Optimal phase transitions in compressed sensing,” 2 2 Therefore, using that n(E[X1]) = na /4 and plugging the IEEE Transactions on Information Theory, vol. 58, no. 10, pp. 6241–6263, result in (47) into the above, we have 2012. [13] D. Guo, S. Shamai, and S. Verdú, “Additive non-Gaussian noise channels: 2 Mutual information and conditional mean estimation,” in Proceedings. n −na 1 h 2i r2(i, X ) = + E (1 + n)X(i) + a(n − i) International Symposium on Information Theory (ISIT), 2005, pp. 719– 4 4n 723. −na2 (n + 1)2  a(n − i)2 = + X + [14] ——, “Mutual information and minimum mean-square error in Gaussian 4 4n E (i) n + 1 channels,” IEEE Transactions on Information Theory, vol. 51, no. 4, pp. 1261–1282, 2005. −na2 = [15] S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation 4 Theory. Prentice Hall, 1997. (n + 1)2  a(n − i)2 a(n − i) [16] V. V. Prelov and S. Verdú, “Second-order asymptotics of mutual 2 information,” IEEE Transactions on Information Theory, vol. 50, no. 8, + E[X(i)] + + 2E[X(i)] . 4n n + 1 n + 1 pp. 1567–1580, 2004. [17] W. Feller, An Introduction to and its Applications. Next, note that by (44), John Wiley & Sons, 2008, vol. 2. 2 2 [18] P. J. Huber, Robust Statistics. John Wiley & Sons, 2004, vol. 523. E[X(i)] = Var[X(i)] + (E[X(i)]) [19] I. Pitas and A. N. Venetsanopoulos, “Order statistics in digital image a2i(n + 1 − i) a2i2 a2i(i + 1) processing,” Proceedings of the IEEE, vol. 80, no. 12, pp. 1893–1921, = + = . 1992. (n + 1)2(n + 2) (n + 1)2 (n + 1)(n + 2) [20] A. Bovik, T. Huang, and D. Munson, “A generalization of median filtering using linear combinations of order statistics,” IEEE Transactions on Therefore, using the above and E[X(i)] = ai/(n + 1), Acoustics, Speech, and Signal Processing, vol. 31, no. 6, pp. 1342–1350, 1983. n r2(i, X ) [21] R. Viswanathan, “Order statistics application to CFAR radar target −na2 detection,” Handbook of Statistics, vol. 17, pp. 643–671, 1998. = [22] H.-C. Yang and M.-S. Alouini, Order Statistics in Wireless Communi- 4 cations: Diversity, Adaptation, and Scheduling in MIMO and OFDM (n + 1)2  a2i(i + 1) a2(n − i)2 2a2i(n − i) Systems. Cambridge University Press, 2011. [23] H. Chernoff, J. L. Gastwirth, and M. V. Johns, “Asymptotic distribution + + 2 + 2 4n (n + 1)(n + 2) (n + 1) (n + 1) of linear combinations of functions of order statistics with applications a2[(n + 1)i(i + 1) + (n + i)(n − i)(n + 2) − n2(n + 2)] to estimation,” The Annals of , vol. 38, no. 1, pp. = 52–72, 1967. 4n(n + 2) [24] J. Hosking, “L-estimation,” Handbook of Statistics, vol. 17, pp. 215–235, a2i(n + 1 − i) 1998. = , [25] E. Lloyd, “Least-squares estimation of location and scale parameters 4n(n + 2) using order statistics,” Biometrika, vol. 39, no. 1/2, pp. 88–95, 1952. ? n [26] G. Blom, “Nearly best linear estimates of location and scale parameters,” which has maximum value for i2(X ) as reported in (24c). Contributions to Order Statistics, vol. 3446, 1962. [27] J. Tukey, “Nonlinear (nonsuperposable) methods for smoothing data,” REFERENCES Proc. Cong. Rec. EASCOM’74, pp. 673–681, 1974. [1] C. R. Rao and V. Govindaraju, Handbook of Statistics. Elsevier, 2006, [28] Z. Wang and A. C. Bovik, “Mean squared error: Love it or leave it? A vol. 17. new look at signal fidelity measures,” IEEE Signal Processing Magazine, [2] H. A. David and H. N. Nagaraja, Order Statistics, Third edition. John vol. 26, no. 1, pp. 98–117, 2009. Wiley & Sons, 2003. [29] T. N. Pappas, R. J. Safranek, and J. Chen, “Perceptual criteria for image [3] S. Baratpour, J. Ahmadi, and N. R. Arghami, “Some characterizations quality evaluation,” Handbook of Image and Video Processing, vol. 110, based on entropy of order statistics and record values,” Communications 2000. in Statistics-Theory and Methods, vol. 36, no. 1, pp. 47–57, 2007. [30] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based [4] ——, “Characterizations based on rényi entropy of order statistics and noise removal algorithms,” Physica D: Nonlinear Phenomena, vol. 60, record values,” Journal of Statistical Planning and Inference, vol. 138, no. 1-4, pp. 259–268, 1992. no. 8, pp. 2544–2551, 2008. [31] I. M. Johnstone, B. W. Silverman et al., “Needles and straw in haystacks: [5] M. Abbasnejad and N. R. Arghami, “Renyi entropy properties of order Empirical bayes estimates of possibly sparse sequences,” The Annals of statistics,” Communications in Statistics-Theory and Methods, vol. 40, Statistics, vol. 32, no. 4, pp. 1594–1649, 2004. no. 1, pp. 40–52, 2010. [32] D. L. Donoho and J. M. Johnstone, “Ideal spatial adaptation by wavelet [6] N. Balakrishnan, F. Buono, and M. Longobardi, “On cumulative entropies shrinkage,” biometrika, vol. 81, no. 3, pp. 425–455, 1994. in terms of moments of order statistics,” arXiv preprint arXiv:2009.02029, [33] A. Pizurica, A. M. Wink, E. Vansteenkiste, W. Philips, and B. J. Roerdink, 2020. “A review of wavelet denoising in MRI and ultrasound brain imaging,” [7] G. Zheng, N. Balakrishnan, and S. Park, “Fisher information in ordered Current Medical Imaging, vol. 2, no. 2, pp. 247–260, 2006. data: A review,” Statistics and its Interface, vol. 2, pp. 101–113, 2009. [34] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image [8] A. Dytso, M. Cardone, and C. Rush, “Measuring dependencies quality assessment: From error visibility to structural similarity,” IEEE of order statistics: An information theoretic perspective,” arXiv: Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004. https://arxiv.org/abs/2009.12337, to appear in IEEE ITW 2020, September [35] Z. Wang and A. C. Bovik, “A universal image quality index,” IEEE 2020. Signal Processing Letters, vol. 9, no. 3, pp. 81–84, 2002. [36] V. Barnett, “Order statistics estimators of the location of the Cauchy distribution,” Journal of the American Statistical Association, vol. 61, no. 316, pp. 1205–1218, 1966. [37] B. C. Arnold, N. Balakrishnan, and H. N. Nagaraja, A First Course in Order Statistics. Siam, 1992, vol. 54.