Compressibility Analysis of Asymptotically Mean Stationary Processes

Jorge F. Silva∗ Information and Decision System Department of Electrical Engineering, University of Chile Av. Tupper 2007, Room 508, Santiago, Chile

Abstract This work provides new results for the analysis of random in terms of `p-compressibility. The results characterize the degree in which a random can be approximated by its best k-sparse version under different rates of significant coefficients (compressibility analysis). In particular, the notion of strong `p-characterization is introduced to denote a random sequence that has a well-defined asymptotic limit (sample-wise) of its best k-term approximation error when a fixed rate of significant coefficients is considered (fixed-rate analy- sis). The main theorem of this work shows that the rich family of asymptotically mean stationary (AMS) processes has a strong `p-characterization. Further- more, we present results that characterize and analyze the `p-approximation error for this family of processes. Adding in the analysis of AMS processes, we introduce a theorem demonstrating that the approximation error function is constant and determined in closed-form by the stationary mean of the process. Our results and analyses contribute to the theory and under- standing of discrete-time sparse processes and, on the technical side, confirm how instrumental the point-wise ergodic theorem is to determine the compressibil- ity expression of discrete-time processes even when stationarity and ergodicity assumptions are relaxed. Keywords: Sparse models, discrete time random processes, best k-term approximation error analysis, compressible priors, processes with ergodic properties, AMS processes, the ergodic decomposition theorem. arXiv:2107.03975v1 [stat.ME] 8 Jul 2021

1. Introduction

Quantifying sparsity and compressibility for random sequences has been a topic of active research largely motivated by the results on sparse signal recovery and compressed sensing (CS) [1, 2, 3, 4, 5, 6, 7]. Sparsity and compressibility can

Email address: [email protected] (Jorge F. Silva∗)

Preprint submitted to Elsevier July 9, 2021 be understood, in general, as the degree to which one can represent a random sequence (perfectly and loosely, respectively) by its best k-sparse version in the non-trivial regime when k (the number of significant coefficients) is smaller than the signal or ambient dimension. Various forms of compressibility for a random sequence have been used in signal processing problems, for instance in regression [8], signal reconstruction (in the classical random Gaussian linear measuring setting used in CS) [2, 3], and inference-decision [9, 10]. Compressibility for random sequences has also been used to analyze continuous-time processes [7, 5], for instance, in periodic generalized L´evyprocesses [11]. A discrete time process is a well-defined infinite dimensional random object, however, the standard approach used to compressibility for finite di- mensional signals (based on the rate of decay of the absolute approximation error) does not extend naturally for this infinite dimensional analysis. Address- ing this issue, Amini et al. [1] and Gribonval et al. [2] proposed the use of a relative approximation error analysis to measure compressibility with the ob- jective of quantifing the rate of the best k-approximation error with respect to the energy of the signal when the number of significant coefficients scales at a rate proportional to the dimension of the signal. This approach offered a meaningful way to determine the energy — and more generally the `p-norm — concentration signature of independent and identically distributed (i.i.d.) pro- cesses [1, 2]. In particular, they introduced the concept of `p-compressibility to name a random sequence that has the capacity to concentrate (with very high probability) almost all their `p-relative energy in an arbitrary small number of coordinates (relative to the ambient dimension) of the canonical or innovation domain. Two important results were presented for i.i.d. processes. [1, Theorem 3] showed that i.i.d. processes with heavy tail distribution (including the general- ized Pareto, Students‘s t and log-logistic) are `p-compressible for some `p-norms. On the other hand, [1, Theorem 1] showed that i.i.d. processes with exponen- tially decaying tails (such as Gaussian, Laplacian and Generalized Gaussians) are not `p-compressible for any `p-norm. Completing this analysis, Silva et al. [3] stipulated a necessary and sufficient condition over the process distribution to be `p-compressible (in the sense of Amini et al.[1, Def.6]) that reduces to look at the p-moment of the 1D marginal stationary distribution of the process. Importantly, the proof of the result in [3] was rooted in the almost sure convergence of two empirical distributions (random objects function of the pro- cess) to their respective probabilities as the number of samples goes to infinity1. This argument offered the context to move from using the law of large numbers (to characterize i.i.d. processes) to the use of the point-wise ergodic theorem [12, 13]. Then a necessary and sufficient condition for `p-compressibility was obtained for the family of stationary and ergodic sources under the mild as- sumption that the process distribution projected on one coordinate, i.e., its 1D

1These almost sure convergences created a family of typical sets that was used to prove the main result in [3, Theorem 1].

2 marginal distribution on (R, B(R)), has a density [3, Theorem 1]. Furthermore, for non `p-compressible processes, Silva et al. [3] provided a closed-form expres- sion for the so called `p-approximation error function, meaning that a stable asymptotic value of the relative `p-approximation error is obtained when the rate of significant coefficients is given (fixed-rate analysis). Considering that the proof of [3, Theorem 1] relies heavily on an almost sure (with probability one) convergence of empirical means to their respective expectations, the idea of relaxing some of the assumptions of the process, in particular stationarity, suggests an interesting direction in the pursuit of ex- tending results for the analysis of `p-compressibility for general discrete time processes. In this work, we extend the compressibility analysis for a family of random sequences where stationarity or ergodicity is not assumed by examin- ing the rich family of processes with ergodic properties and, in particular, the important family of asymptotically mean stationary (AMS) processes [12, 14]. This family of processes has been studied in the context of source coding and channel coding problems where its ergodic properties (with respect to the fam- ily of indicator functions) have been used to extend fundamental performance limits in source and channel coding problems. Our interest in AMS processes centers on the fact that the `p-characterization in [3] is fundamentally rooted in a form of ergodic property over a family of indicator functions; this family of measurable functions is precisely where AMS sources have (by definition) a stable almost-sure asymptotic behavior [12].

1.1. Contributions of this Work Specifically, we apply a more refined and relevant (sample-wise) almost sure fixed-rate analysis of `p-approximation errors, first considered by Gribonval et al. [2], to the analysis of a process. Through this analysis, we determine the relationship between the rate of significant coefficients and the `p-approximation of a process in two main results. Our first main result (Theorem 1) shows that this rate vs. approximation error has a well-defined expression function of the process distribution—in particular the stationary mean of the process— for the complete collection of AMS and ergodic processes. This result relaxes stationarity as well as some of the regularity assumptions used in [3, Theorem 1]; consequently, it is a significant extension of that result. As a direct implication of this theorem, we extend the dichotomy of the `p-compressible process presented in [3, Theorem 1] to the family of AMS ergodic processes (in Theorem 1 in Section 3.3). The second main result of this work (Theorem 3) uses the ergodic decomposi- tion theorem (EDT) [12] to extend the strong `p-characterization to the family of AMS processes, where ergodicity and the stationarity assumptions on the process have been relaxed. Remarkably, we show that this family of processes do have a stable (almost sure) asymptotic `p-approximation error for any given rate of significant coefficients as the block of the analysis tends to infinity. Inter- estingly, this limiting value is in general a measurable (non-constant) function of the process, which is fully determined by the so-called ergodic decomposi-

3 tion (ED) function that maps elements of the sample space of the process to stationary and ergodic components [12].

1.2. Organization of the Paper The rest of the paper is organized as follows. Section 2 introduces notations, preliminary results and some basic elements of the `p-compressibility analysis. In particular, Section 2.1 introduces the fixed-rate almost sure approximation error analysis that is the focus of this work. Sections 3 and 4 present the two main results of this paper for AMS processes. The summary and final discussion of the results are presented in Section 5. To conclude, Section 6 provides some context for the construction of AMS processes based on the basic principle of passing an innovation process through deterministic (coding) and random (channel) processing stages. Section 7 presents a numerical strategy to estimate the `p-approximation error function and some examples to illustrate the main results of this work. The proofs of the two main results (Theorems 1 and 3) are presented in Sections 8 and 9, respectively, while the proofs of supporting results are relegated to the Appendices.

2. Preliminaries

n n n For any vector x = (x1, .., xn) in R , let (xn,1, .., xn,n) ∈ R denote the ordered vector such that |xn,1| ≥ |xn,2| ≥ ... |xn,n|. For p > 0 and k ∈ {1, .., n}, let n p p 1 σp(k, x ) ≡ (|xn,k+1| + ... + |xn,n| ) p , (1) n be the best k-term `p-approximation error of x , in the sense that if

n n n n Σk ≡ {x ∈ R : σp(k, x ) = 0}

n n n is the collection of k-sparse signals, then σ (k, x ) = min n n ||x − x˜ || . p x˜ ∈Σk `p Amini et al. [1] and Gribonval et al. [2] proposed the following relative best k-term `p-approximation error

σ (k, xn) σ˜ (k, xn) ≡ p ∈ [0, 1], k ∈ {1, .., n} , (2) p ||xn|| `p for the analysis of infinite sequences, with the objective of extending notions of N compressibility to sequences that have infinite `p-norms in R . More precisely, N N let X1, .., Xn, ... be a one-side random sequence with values in (R , B(R )). (Xn)n≥1 is fully characterized by its consistent family of finite dimensional prob- n n abilities denoted by µ = {µn ∈ P(R ): n ≥ 1} [13], where X ≡ (X1, .., Xn) ∼ n µn for all n ≥ 1 and P(R ) is the collection of probabilities on the space (Rn, B(Rn)) [13, 12]. 2

2 n n In other words, µn(A) = P {X ∈ A} for any A ∈ B(R ) and therefore µn is a probability in the measurable space (Rn, B(Rn)).

4 For d ∈ (0, 1), n ≥ 1 and k ∈ {1, .., n}, let us define the following set

n,k n n n Ad ≡ {x ∈ R :σ ˜p(k, x ) ≤ d} . (3) At this point, we need to introduce the following:

Definition 1. For a process (Xn)n≥1, equipped with µ = {µn, n ≥ 1}, and for n n any n ≥ 1, a set A ∈ B(R ) is said to be -typical for X (or µn) if

µn(A) ≥ 1 − . (4)

For the following three definitions, let (Xn)n≥1 be a process equipped with a distribution µ = {µn, n ≥ 1}: Definition 2. [1, Defs.5 and 6] For  > 0, n ≥ 1 and d ∈ (0, 1), let

n n,k o κ˜p(d, , µn) ≡ min k ∈ {1, . . . , n} : µn(Ad ) ≥ 1 −  (5)

n,k denote the critical number of terms that makes Ad in (3) -typical for µn (Def. 1).

Using Definition 2, we can study the asymptotic rate of innovation of (Xn)n≥1 relative to the `p-approximation error usingκ ˜p(d, , µn) in (5): Definition 3. For any d ∈ (0, 1) and  > 0 let us define:

+ κ˜p(d, , µn) r˜p (d, , µ) ≡ lim sup , (6) n→∞ n

− κ˜p(d, , µn) r˜p (d, , µ) ≡ lim inf . (7) n→∞ n Alternatively to the expressions in Definition 3, we can consider the following fixed-rate asymptotic analysis for (Xn)n≥1: Definition 4. [3, Defs.5 and 6] Let us consider  ∈ (0, 1), r ∈ (0, 1) and d ∈ (0, 1). The rate-distortion pair (r,d) is said to be `p-achievable for (Xn)n≥1 with probability , if there exists a sequence of positive integers (kn)n≥1 such kn that lim supn→∞ n ≤ r and

n,kn lim inf µn(A ) ≥ 1 − . (8) n→∞ d

Then, the rate-approximation error function of (Xn)n≥1 with probability  is given by

rp(d, , µ) ≡

inf {r ∈ [0, 1], such that (r, d) is `p-achievable for (Xn) with probability } . (9)

+ In general, it follows that rp(d, , µ) ≤ r˜p (d, , µ) [3, Prop. 2]. Furthermore, for the important case of stationary and ergodic processes, it was shown in [3, Th. 1] that + − rp(d, , µ) =r ˜p (d, , µ) =r ˜p (d, , µ) for all d ∈ (0, 1). (10)

5 2.1. Revisiting the `p-Approximation Error Analysis

The approximation properties of a process (Xn)n≥1 presented above rely on n,k a weak convergence (in probability) of the event Ad (see Defs. 2 and 4). Here, we introduce a stronger (almost sure) convergence of the approximation error at a given rate of innovation to study a more essential asymptotic indicator of the best k-term `p-approximation attributes of (Xn)n≥1. This notion will be meaningful for a large collection of processes (details presented in Section 2.2), and it will imply specific approximation attributes for µ in terms of rp(d, , µ), + − rp (d, , µ), and rp (d, , µ).

Definition 5. A process X = (Xn)n≥1, with distribution µ = {µn, n ≥ 1}, is said to have a strong rate vs. `p best k-term approximation error characteriza- tion (in short, a strong `p-characterization), if for any r ∈ (0, 1] and for any kn non-negative sequence of integers (kn)n≥1 satisfying that n −→ r: n i) the limit limn→∞ σ˜p(kn,X ) is well defined in R with probability one (µ- almost surely), and

ii) this limit only depends on r and not on the sequence (kn)n≥1.

Provided that X has a strong `p-characterization, we define and denote the ap- proximation error function of X by

n fp,µ(X, r) ≡ lim σ˜p(kn,X ), (11) n→∞ which is a function of X and r.

A process with a strong `p-characterization has an almost everywhere asymp- totic (with n) pattern for its `p-approximation error when a finite rate of sig- nificant coefficients is considered (i.e., a fixed-rate analysis). On top of the structure introduced in Definition 5, a relevant scenario to consider is when the limiting function fp,µ(X, r), in (11), is constant (indepen- dent of X) µ-almost surely. This can be interpreted as an ergodic property of X with respect to its best-k term `p-approximation error, reflecting a typical (almost sure) approximation attribute that is constant for the entire process3. The following result offers a connection between fp,µ(X, r) and rp(d, , µ) in this very special case.

Lemma 1. Let us consider a process X = (Xn)n≥1 and its process distribution µ. Let us assume that X has a strong `p-characterization (Def. 5) and that its limiting function in (11) is constant µ-almost surely, denoted by (fp,µ(r))r∈(0,1]. 4 Assume that do = fp,µ(ro) for some ro ∈ (0, 1] . Then, we have the following:

3 The next section shows that fp,µ(X, r) is a constant function for the family of AMS and ergodic processes [12]. However, it is not a constant function for stationary and AMS processes in general as presented in Section 4. 4 This means that do ∈ {fp,µ(r), r ∈ (0, 1]}.

6 i) If do > 0, then ro > 0 and ro is the unique solution of do = fp,µ(r) for r ∈ (0, 1). Furthermore, for any r ∈ (0, 1) and any (kn)n≥1 such that kn n −→ r, we have that  n,kn 1 if r > ro lim µn(Ad ) = (12) n→∞ o 0 if r < ro.

ii) If do = 0, then ∀d ∈ (0, 1), for any r ≥ ro, and any (kn)n≥1 such that kn n −→ r, it follows that

n,kn lim µn(A ) = 1. (13) n→∞ d

This result shows that for a process with a strong `p-characterization and a constant rate approximation error function, there is a 0-1 phase transition n,kn on the asymptotic probability of the events Ad when kn/n −→ r, which is governed by (fp,µ(r))r∈(0,1] in (11). More precisely, we have the following −1 corollary, which uses the fact that the inverse fp,µ(d) is well defined for any d ∈ {fp,µ(r), r ∈ (0, 1]}\{0} (see Lemma 1 i)):

Corollary 1. Under the assumptions of Lemma 1, for any d ∈ {fp,µ(r), r ∈ (0, 1]}\ {0}, and  > 0

+ − −1 rp(d, , µ) =r ˜p (d, , µ) =r ˜p (d, , µ) = fp,µ(d). (14)

On the other hand, if fp,µ(ro) = 0 for some ro ∈ (0, 1], then ∀ ∈ (0, 1), ∀d ∈ (0, 1), + rp(d, , µ) ≤ r˜p (d, , µ) ≤ ro. (15)

It is worth noting in (14) that the weak `p-approximation error function −1 rp(d, , µ) is independent of  and fully determined by (fp,µ(d))d∈{fp,µ(r),r∈(0,1]}. This is consistent with the result obtained for i.i.d. in [1] and stationary and ergodic processes in [3]. The proof of Lemma 1 and Corollary 1 are presented in Appendix A.

2.2. AMS Processes N N A process X = (Xn)n≥1 is fully represented by the (R , B(R ), µ), where the process distribution µ is the central object. One way to model struc- ture and dynamics on (Xn)n≥1 (and indeed on µ) is through the introduction of a measurable function T :(RN, B(RN)) → (RN, B(RN)). Then we have an augmented object (RN, B(RN), µ, T ) to analyze (and sometimes to represent) the process (Xn)n≥1. In this work, we focus exclusively on the standard shift operator to look at the time dynamics and invariances of a process.5

5 N Forx ¯ ∈ R ,z ¯ = T (¯x) is given by the coordinate-wise relationship zi = xi+1 for all i ≥ 1 [12].

7 Definition 6. For the shift operator, the process (Xn)n≥1 is said to be station- ary (relative to T ) if for any F ∈ B(RN), µ(F ) = µ(T −1(F )). The definitions and properties presented in this section extend this basic notion N N of stationarity for (Xn)n≥1. To that end, we will use (R , B(R ), µ, T ) to be the underlying dynamical system representation of X = (Xn)n≥1 where T is the shift operator.6 Given the above context, let us briefly introduce the family of AMS processes that is the main object of study of this work.

Definition 7. A process X = (Xn)n≥1 (or its underlying dynamical system (RN, B(RN), µ, T )) is said to have an ergodic property with respect to a measur- able function f :(RN, B(RN)) → (R, B(R)) if the sample average of f, defined by

n−1 1 X < f > (X) ≡ f(T i(X)), (16) n n i=0 converges µ-almost surely as n tends to infinity to a measurable function < f > (X) from (RN, B(RN)) to (R, B(R)) .

N N Definition 8. A process X = (Xn)n≥1 (or (R , B(R ), µ, T )) is said to have an ergodic property with respect to a class of measurable functions M, if for any f ∈ M, the sequence (< f >n (X))n≥1 converges to a well-defined limit < f > (X), µ-almost surely.

For any n > 0, let us define the set of arithmetic mean probabilities by

n−1 1 X µn(µ, F ) ≡ µ(T −i(F )), (17) n i=0

N n N for all F ∈ B(R ), where it is clear that (µ (µ, F ))F ∈B(RN) ∈ P(R ) for any n ≥ 1.

N N Definition 9. A process X = (Xn)n≥1 (or (R , B(R ), µ, T )) is said to be 7 n asymptotically mean stationary (AMS) , if (µ (µ, F ))n≥1 in (17) converges as N n goes to infinity for any F ∈ B(R ). This limit is denoted by (¯µ(F ))F ∈B(RN) ∈ P(RN) and is called the stationary mean of X.8

Remark 1. If the limit of arithmetic mean probabilities of µ exists, in the sense n that (µ (µ, F ))n≥1 convergences in R as n tends to infinity for any event F ∈ B(RN), these values (indexed by F ∈ B(RN)) induce a well-defined probability in P(RN) that we denoted by µ¯ in Definition 9 [12, Lemma 7.4].

6A complete exposition of sources with ergodic properties viewed as a dynamical system is presented in [12, Chapts. 7, 8 and 10]. 7 By definition, if (Xn)n≥1 is stationary, then it is AMS. 8µ¯ is function of µ, but for sake of simplicity this dependency will be considered implicit.

8 As expected, it can be proved that if X is AMS, thenµ ¯ is a stationary probability (with respect to T ) in the sense thatµ ¯(F ) =µ ¯(T −1(F )) for all F ∈ B(RN). The following important result connects processes with ergodic properties and AMS processes: Lemma 2. [12, Ths. 7.1 and 8.1] A necessary and sufficient condition for a process X = (Xn)n≥1 to be AMS is that it has an ergodic property (Definition  N 8) with respect to the family of indicator functions, i.e., 1F (·): F ∈ B(R ) .

2.3. Ergodicity Let us first introduce a stronger version of Definition 8.

N N Definition 10. A process X = (Xn)n≥1 (or (R , B(R ), µ, T )) has a constant ergodic property over a class M if: i) X = (Xn)n≥1 has an ergodic property over M (Definition 8), and ii) for any f ∈ M, its limit < f > (X) is a constant function, µ-almost surely. The following definition derives from the celebrated point-wise ergodic the- orem for AMS sources [12, Th. 7.5]:

N N Definition 11. A process X = (Xn)n≥1 (or (R , B(R ), µ, T )) is said to be ergodic, if the collection of invariant events (the events F such that T −1(F ) = F ) has µ-probability 1 or 0 [13, 12]. The following result connects these last definitions:

Lemma 3. [12, Th. 7.5 and Lem. 7.14] A necessary and sufficient condition for an AMS process X = (Xn)n≥1 to be ergodic is that X has a constant ergodic  N property for 1F (x): F ∈ B(R ) . In general, AMS processes are not ergodic. In fact, the following instrumen- tal result provides a condition for X to meet ergodicity that can be considered a form of a weak mixing (asymptotic independence) condition [12]. Lemma 4. [12, Lem. 7.15] A necessary and sufficient condition for an AMS process X = (Xn)n≥1 to be ergodic is that

n−1 X lim µ(T −i(F ) ∪ F ) =µ ¯(F )µ(F ) (18) n→∞ i=0 for all F ∈ F, where F ⊂ B(RN) is a subfamily that generates B(RN). For the case when the process is stationary, it follows thatµ ¯(F ) = µ(F ) for all F ∈ B(RN), then the condition in (18) can be interpreted as a mixing (asymp- totic independence) property on (Xn)n≥1. Finally, we have the point-wise ergodic theorem for AMS and ergodic pro- cesses:

9 Lemma 5. [12, Th. 7.5] Let X = (Xn)n≥1 be an AMS and ergodic process and let f :(R, B(R)) −→ (R, B(R)) be `1-integrable function with respect to µ¯1. Then it follows that n 1 X lim f(Xi) = EX∼µ¯ (f(X)) < ∞, µ − a.s.. (19) n−→∞ n 1 i=1

3. Strong `p-Characterization for AMS and Ergodic Processes To present the main result of this section, we first need to introduce some notations and definitions for the statement of Theorem 1. Let X = (Xn)n≥1 be an AMS process with stationary meanµ ¯ = {µ¯n : n ≥ 1}. If the p-moment of the R p 1D marginalµ ¯1 is well defined, i.e., R |x| dµ¯1(x) < ∞, then we can introduce the following induced probability vp ∈ P(R) with vp  µ¯1 by R p B |x| dµ¯1(x) vp(B) ≡ R p , ∀B ∈ B(R). (20) R |x| dµ¯1(x) In addition, let us define the following tail sets:

Bτ ≡ (−∞, −τ] ∪ [τ, ∞) and Cτ ≡ (−∞, −τ) ∪ (τ, ∞), (21) for any τ ≥ 0. With this we define the following admissible set forµ ¯1: R∗ ≡ {µ¯ (B ), τ ∈ [0, ∞)} . (22) µ¯1 1 τ Finally, let λ denote the in (R, B(R)).

3.1. Main Result When an AMS process satisfies the mixing condition in (18) and, conse- quently, it is ergodic, we can state the following result: N N Theorem 1. Let X = (Xn)n≥1 (or (R , B(R ), µ, T )) be an AMS and ergodic process, and let µ¯ = {µ¯n : n ≥ 1} be its stationary mean (Definition 9). Then X has a a strong `p-characterization (Definition 5). More precisely, for any p > 0, kn r ∈ (0, 1], and (kn)n≥1 satisfying that n −→ r, it follows that n fp,µ¯(r) = lim σ˜p(kn,X ), µ − almost surely, (23) n→∞ where fp,µ¯(r) is a well-defined function of the stationary mean µ¯. Furthermore, (fp,µ¯(r))r∈(0,1] is an exclusive function of µ¯1 ∈ P(R) (the 1D marginal of µ¯) with the following characterization: R p i) If R |x| dµ¯1(x) = ∞, then it follows that fp,µ¯(r) = 0, ∀r ∈ (0, 1]. R p ii) If R |x| dµ¯1(x) < ∞ and µ¯1  λ, then for any r ∈ (0, 1], q p fp,µ¯(r) = 1 − vp(Bτ(r)),

where τ(r) > 0 is the unique solution of µ¯1(Bτ ) = r.

10 R p 9 iii) If R |x| dµ¯1(x) < ∞, µ¯1 is not absolutely continuous with respect to λ, we have two cases: a) r ∈ R∗ , where it follows that µ¯1 q p fp,µ¯(r) = 1 − vp(Bτ(r)),

and τ(r) is the solution of µ¯1(Bτ ) = r. b) r∈ / R∗ where there is τ > 0 such that µ¯ ({−τ , τ }) > 0 and r ∈ µ¯1 o 1 o o

[¯µ1(Cτo ), µ¯1(Bτo )). Then ∃αo ∈ [0, 1) such that

r =µ ¯1(Cτo ) + αo(¯µ1(Bτo ) − µ¯1(Cτo )), where q p fp,µ¯(r) = 1 − vp(Cτo ) − αo(vp(Bτo ) − vp(Cτo )). In the last expression, we have that

v (B ) − v (C ) = v ({−τ , τ }) = |τ |p · µ¯ ({−τ , τ })/ ||(xp)|| . p τo p τo p o o o 1 o o L1(¯µ1)

The proof of this result is presented in Section 8.

3.2. Analysis and Interpretations of Theorem 1 1. The general result in (23) shows that any ergodic AMS process has a strong `p-characterization (Def. 5) where its point-wise (almost sure) ap- proximation error function in (11) is completely determined by the 1D projection of its stationary mean, i.e.,µ ¯1 ∈ P(R). R p 2. Two important scenarios can be highlighted. The case |x| dµ¯1(x) = ∞ R p R in which fp,µ¯(r) = 0, ∀r ∈ (0, 1] and the case R |x| dµ¯1(x) < ∞ that has a non-trivial approximation error function expressed by the following collection of (rate, distortion) pairs:  q   p {(r, fp,µ¯(r)), r ∈ (0, 1]} = µ¯1(Bτ ), 1 − vp(Bτ ) , τ ∈ [0, ∞)  q  [ [ p µ¯1(Cτn ) + αµ¯1({−τn, τn}), 1 − vp(Cτn ) − αvp({−τn, τn})

τn∈Yµ¯1 α∈[0,1) (24)

where Yµ¯1 = {τ ∈ [0, ∞), µ¯1({−τ, τ}) > 0}, which is shown to be at most a countable set. It is worth noting that the expression in (24) summarizes the continuous and non-continuous result stated in ii) and iii). The details of this analysis are presented in Section 8. 3. From the proof of Theorem 1, the following properties for the function (fp,µ¯(r))r∈(0,1] in (23) can be obtained:

9 In other words,µ ¯1 has atomic components.

11 R p Proposition 1. Assuming that R |x| dµ¯1(x) < ∞, the function (fp,µ¯(r))r∈(0,1] −1 is continuous, strictly non-increasing in the domain fp,µ¯(0, 1) = (0, 1 − µ¯1({0})) ⊂ (0, 1), and satisfies that fp,µ¯(r) = 0 ∀r ∈ [1 − µ¯({0}), 1] and limr→0 fp,µ¯(r) = 1. The proof follows from Lemma 10 in Section 8 (and its proof in Appendix B). 4. Proposition 1 implies that ∀d ∈ [0, 1) there exists r ∈ (0, 1] such that fp,µ¯(r) = d, meaning that (fp,µ¯(r))r∈(0,1] achieves all values in [0, 1), and −1 fp,µ¯(d) > 0 is well defined for any d ∈ (0, 1). Some illustrations of fp,µ¯(·) are presented in Figure 1. R p 5. Under the assumption that R |x| dµ¯1(x) < ∞, we have two important scenarios: the case whenµ ¯1({0}) ∈ (0, 1) (¯µ1 has atomic mass at 0), and the case whenµ ¯1({0}) = 0. i) For the scenario whereµ ¯1({0}) > 0, Proposition 1 and Theorem 1 tell us that zero approximation error could be achieved at rates strictly smaller than 1. More precisely, we have that fp,µ¯(r) = 0 ∀r ∈ [1 − µ¯1({0}), 1], and fp,µ¯(r) > 0 ∀r ∈ (0, 1 − µ¯1({0})). ii) On the other hand for the scenario whereµ ¯1({0}) = 0, zero distortion is exclusively achieved at a rate equal to 1, meaning that fp,µ¯(r) > 0 for any r ∈ (0, 1) (from Proposition 1). These two scenarios are illustrated in Figure 1.

6. Finally, we recognize two signatures for the general structure of (fp,µ¯(r))r∈(0,1]. (fp,µ¯(·)) is either a constant function equal to zero everywhere, or it is a non-constant, strictly decreasing and continuous function from Proposi- tion 1 (see Figure 1). Indeed, we have the following dichotomy: R p Corollary 2. fp,µ¯(r) = 0 for any r ∈ (0, 1) if, and only if, R |x| dµ¯1(x) = ∞.10 The proof follows directly from Proposition 1 and Theorem 1.

3.3. `p-compressible AMS and Ergodic Processes

In light of Theorem 1, it is worth revisiting the important concept of `p- compressible processes introduced by Amini et al. [1], which is originally based on the weak `p-characterization presented in Definition 4.

Definition 12. [1, Def.6] A process (Xn)n≥1, with distribution µ, is said to be + `p-compressible for p > 0, if for any  ∈ (0, 1) and d ∈ (0, 1), r˜p (d, , µ) = 0. Looking at Theorem 1, Lemma 1 and Corollary 1 (in particular the rela- tionship expressed in Eq.(14) that connects the strong `p-characterization in Definition 5 with the weak `p-characterization in (9)), the following result can be stated for the family of AMS ergodic processes:

10 For this last statement, we exclude the trivial process whereµ ¯1({0}) = 1.

12 Caseµ ¯ ( 0 ) > 0 Caseµ ¯ ( 0 ) = 0 1 { } 1 { } fp,¯µ(r) fp,¯µ(r)

1.0 1.0

µ¯ ( 0 ) 1 { }

1 µ¯ ( 0 ) 1.0 1.0 − 1 { } Rate(r) Rate(r)

R p Figure 1: Illustrations of the function (fp,µ¯(r))r∈(0,1] in (24) assuming that R |x| dµ¯1(x) < ∞. The left and right panels present the scenario whereµ ¯1({0}) ∈ (0, 1) andµ ¯1({0}) = 0, respectively.

Theorem 2. A necessary and sufficient condition for an AMS ergodic process (with process distribution µ and stationary mean µ¯) to be `p-compressible (Def. R p R p 12) is that R |x| dµ¯1(x) = ∞. More specifically, if R |x| dµ¯1(x) < ∞ then for any d ∈ (0, 1) and ∀ > 0,

+ − −1 rp(d, , µ) =r ˜p (d, , µ) =r ˜p (d, , µ) = fp,µ¯(d) > 0, R p and otherwise, i.e., R |x| dµ¯1(x) = ∞, for any d ∈ (0, 1) and  > 0,

+ − rp(d, , µ) =r ˜p (d, , µ) =r ˜p (d, , µ) = 0.

Theorem 2 extends the dichotomy known in [3, Theorem 1] for ergodic and stationary processes to the family of AMS and ergodic processes: i.e., a process R p is `p-compressible if, and only if, |x| dµ¯1(x) = ∞. RR p Proof: Let us consider that R |x| dµ¯1(x) = ∞. From Theorem 1, we have that fp,µ¯(r) = 0 ∀r ∈ (0, 1]. Then using Corollary 1 (see Eq(15)), it + follows directly that sup∈(0,1) supd∈(0,1) r˜p (d, , µ) ≤ r for any arbitrary small + − r ∈ (0, 1). Consequently, we have that rp(d, , µ) =r ˜p (d, , µ) =r ˜p (d, , µ) = 0 for any  ∈ (0, 1) and for any d ∈ (0, 1). R p Assuming that R |x| dµ¯1(x) < ∞, Proposition 1 shows that {fp,µ¯(r), r ∈ (0, 1]} = [0, 1). Then using Corollary 1 (see Eq(14)), we have that for any d ∈ (0, 1) and + − −1  > 0, rp(d, , µ) =r ˜p (d, , µ) =r ˜p (d, , µ) = fp,µ¯(d). Finally, we know from −1 the properties of (fp,µ¯(r))r∈(0,1] stated in Proposition 1 that fp,µ¯(d) > 0 for any d ∈ (0, 1). + To conclude, from Definition 12 and the two previous results onr ˜p (d, , µ), the main dichotomy stated in Theorem 2 is obtained.

13 3.3.1. Examples for µ¯1 Here we show classes of distributionsµ ¯1 where the key condition of `p- compressibility stated in Theorem 2 can be tested. As this implies an isolated analysis ofµ ¯1, we revisit some of the conditions and examples studied in [3] to R p verify whether R |x| dµ¯1(x) is either finite or not. Importantly, this analysis reduces to look at the tail ofµ ¯1 [3, 1]. We begin covering the case of distribution (equipped with a density func- tion) with exponential tail. [3, Corrollary 1] shows that if for some γ > 0 R γ|x| R p R e dµ¯1(x) < ∞ then R |x| dµ¯1(x) < ∞ for any p > 0. From this condi- tion, it is simple to check that AMS ergodic processes with stationary mean (¯µ1) following a Gaussian, generalized Gaussian, Laplacian and Gamma distri- γ·|X| bution (all of them with exponential tails) verify that EX∼µ¯1 (e ) < ∞ for some γ > 0 and their AMS ergodic processes are not `p-compressible (Def. 12) for any p > 0. Furthermore, ifµ ¯1 is finitely supported, i.e., ∃C > 0 where µ¯1([−C,C]) = 1, then its AMS ergodic process is not `p-compressible for any p > 0. On the other hand, we can consider AMS ergodic processes whereµ ¯1 has a heavy tail distribution. For this class, [3, Corrollary 2] shows that ifµ ¯1  λ dµ¯1 −(τ+1) and its density function fµ¯1 (x) = dλ (x) decays as O(|x| ) for some τ > 0 11 then Z p |x| dµ¯1(x) = ∞ if, and only if, p ≥ τ. (25) R An example of this class of heavy tail distribution is the family of Student‘s t- distribution with q > 0 degrees of freedom 12, whose pdf decay (when |x| tends to infinity) as O(|x|−(q+1)). Consequently from Theorem 2 and the condition in (25), an AMS ergodic process with stationary meanµ ¯1 following a Student‘s t-distribution with parameter q > 0 is `p-compressible for any p ≥ q and non- 13 `p-compressible for p < q.

3.4. Estimation of fp,µ¯(r) from samples of X

Theorem 1 shows that an AMS ergodic process has a strong `p-characterization that is an exclusive function ofµ ¯1. However, the determination ofµ ¯1 could be a non-trivial technical task as well as the derivation of (fp,µ¯(r))r∈(0,1] from the ex- pression presented in (24). To contextualize this observation and illustrate some examples, Section 6 presents results that show how AMS and ergodic processes can be constructed from basic processing stages. From these constructions, it is

11 −(τ+1) A function f(x) decays as |x| if there exists xo > 0 and 0 < K1 < K2 < ∞ and f(−x) + −(τ+1) limx→∞ f(x) = ρ ∈ R ∪ {∞}, where: if ρ = 0, then ∀x > xo, K1x ≤ f(x) ≤ −(τ+1) −(τ+1) −(τ+1) K2x ; if ρ = ∞, then ∀x < −xo, K1|x| ≤ f(x) ≤ K2|x| ; and otherwise, −(τ+1) −(τ+1) ∀|x| > xo then K1|x| ≤ f(x) ≤ K2|x| . 12 The pdf of a Student‘s t-distribution with q > 0 degrees of freedom is given by fµ¯1 (x) = Γ((q+2)/2) √ 2 −(q+1)/2 qπ (1 + x /q) , where Γ(·) is the gamma function. 13 R p More examples and discussion about the verification of R |x| dµ¯1(x) = ∞ can be found in [3].

14 observed that the determination ofµ ¯1 could be a non-trivial task in many cases. Interestingly, if we can sample the process, we could estimate (fp,µ¯(·)) (using the expression in (24)) and numerically evaluate how (fp,µ¯(·)) behaves. These estimations could be used to compare the compressibility pattern of different AMS and ergodic processes. A simple estimation strategy to infer (fp,µ¯(·)) and some numerical examples are presented in Section 7.

4. Strong `p-Characterization for AMS Processes

Relaxing the ergodic assumptions for an AMS source is the focus of this part. It is worth noting that the ergodic result in Theorem 1 will be instrumental for this analysis in view of the ergodic decomposition (ED) theorem for AMS sources nicely presented in [12, Ths. 8.3 and 10.1] and references therein. In a nutshell, the ED theorem shows that the stationary mean of an AMS process (see Def. 9) can be decomposed as a convex combination of stationary and ergodic distributions (called the ergodic components) in (RN, B(RN)). It is important to introduce one specific aspect of this result for the statement of the following theorem. Let us consider an arbitrary AMS process (Xn)n≥1 equipped with its process distribution µ ∈ P(RN) and its induced stationary meanµ ¯ ∈ P(RN). If we denote by P˜ ⊂ P(RN) the family of stationary and ergodic probabilities with respect to the shift operator, then one of the implica- tions of the ergodic decomposition theorem [12, Ths. 8.3 and 10.1] is that there is a measurable space (Λ, L) indexing this family, i.e., P˜ = {µλ, λ ∈ Λ}. More importantly, there is a measurable function Ψ : (RN, B(RN)) → (Λ, L) that maps points in the sequence space RN to stationary and ergodic components (more details will be given in Section 9). Then using Ψ, there is a probability measure WΨ in (Λ, L) induced by µ in the standard way, where ∀A ∈ L, we have that −1 WΨ(A) = µ(Ψ (A)). One of the implications of the ED theorem [12, Ths. 8.3 and 10.1] is that for all F ∈ B(RN)14 Z µ¯(F ) = µλ(F )∂WΨ(λ). (26)

In other words,µ ¯ can be expressed as the convex combination of stationary and ergodic components {µλ, λ ∈ Λ}, where the mixture probability on (Λ, L) is induced by the decomposition function Ψ. This last function is universal, meaning that Ψ is valid to decompose any stationary distribution on (RN, B(RN)) in stationary and ergodic components as presented in (26).

4.1. Main Result The following result uses the ED theorem for AMS sources [12, Ths. 8.3 and 10.1] and Theorem 1 to show that AMS sources have a strong `p-characterization

14 The assumption here is that for any F , µλ(F ), as a function of λ, is measurable from (Λ, L) to (R, B(R)) [12].

15 as stated in Definition 5. Furthermore, the result offers an expression to specify the limit (fp,µ(X, r))(0,1] in (11).

Theorem 3. Let X = (Xn)n≥1 be an AMS process with process distribution µ. Let us consider P˜ = {µλ, λ ∈ Λ} the collection of stationary and ergodic probabilities and the decomposition function Ψ:(RN, B(RN)) → (Λ, L) presented in the ED theorem [12, Th 10.1]. Then it follows that:

i) The process X = (Xn)n≥1 has a strong `p-characterization (Def. 5), where for any r ∈ (0, 1] and (kn)n≥1 such that kn/n → r

n lim σ˜p(kn,X1 ) = fp,µ(X, r) = fp,µ X (r), µ − almost surely, (27) n→∞ Ψ( )

15 where (fp,µλ (r)) has been introduced and developed in Theorem 1.

ii) For any r ∈ (0, 1], d ∈ [0, 1), and (kn)n≥1 such that kn/n → r, Z n,kn n n,kn lim µn(A ) = lim µλ(A )∂WΨ(λ) n→∞ d n→∞ d  N  = µ x ∈ R : µΨ(x) is `p-compressible  N  + µ x ∈ R : µΨ(x) is not `p-compressible and fp,µΨ(x) (r) ≤ d . (28)

The proof of this result is presented in Section 9.

4.2. Analysis and Interpretations of Theorem 3: • The first almost-sure point wise result in (27) provides a closed-form ex- pression for the `p-characterization of the process X given by fp,µ(X, r), which is a function of X (not a constant function in general) through the ED function Ψ(·) that maps X to stationary and ergodic components in P˜.

• An interesting interpretation of the result in (27), which is a consequence of the ED theorem, is that this limiting behaviour can be seen as if one selects at t = 0 an ergodic component µλ ∈ P˜, and then the process evolves with the statistic of µλ, which has a strong `p-characterization (Def. 5) given by Theorem 1. This is equivalent to stating that there is one stationary ergodic component that is active all the time, but we do not know a priori which component. In fact, to resolve which component is active, we need to know the entire process X, as the active component λ ∈ Λ is given by Ψ(X). This interpretation has a natural connection with the standard setting used in universal source coding as clearly argued by Gray and Kieffer in [14], where it is assumed that a process is fixed and

15 ˜ Note that ∀λ ∈ Λ, µλ ∈ P is a stationary and ergodic process.

16 belongs to a family of process distributions from beginning to end, but the observer (or the designer of the coding scheme) does not know which specific distribution is active. Therefore, when observing a realization of an AMS process, what we are really observing is a realization of one (un- known a priori) stationary and ergodic component in P˜ and, consequently, its limiting behaviour is well defined as expressed in (27). The fact that this limit is expressed as a function of Ψ can be understood from the per- spective that Ψ is the object that chooses the active component in P˜ from X.

• An intriguing aspect of this result, which is again a consequence of the ED theorem for AMS sources, is that if we look at the limit fp,µ(X, r) in (27),

this is equal to fp,µΨ(X) (r), which does not depend on µ explicitly as long as µ is AMS. Then, we could say that the ED function Ψ characterizes the asymptotic limit for any AMS source universally.

• When we move to the weak `p-characterization result expressed in (28) (see Defs. 2 and 4), here we can observe explicitly the role of the distribution µ in the analysis, which is consistent with the almost sure result in (27). In n,kn the expression in the RHS of (28), we note that the probability µn(Ad ) has a limit determined by the pair (r, d) and the distribution µ.

• Complementing the previous point, there are two clear terms in (28): The first is the probability (over µ) of the sequences that map through Ψ(·) to `p-compressible components (Def. 12) in P˜. The second term is the probability of the sequences that map through Ψ(·) to ergodic components that are not `p-compressible and satisfy that its `p-approximation error function (which is characterized in Theorem 1) evaluated at the rate r is smaller than the distortion d. Note that these two events on (RN, B(RN)) are distribution independent (universals) and therefore can be determined a priori (independent of µ) for this weak `p-compressibility analysis.

5. Summary and Discussion of the Results

In this work, we revisit the notion of `p-compressibility focusing on the study of the almost sure (with probability one) limit of the `p-relative best k-term ap- proximation error when a fixed-rate of significant coefficients is considered for the analysis. We consider the study of processes with general ergodic prop- erties relaxing the stationarity and ergodic assumptions considered in previous work. Interestingly, we found that the family of asymptotically mean stationary (AMS) processes has an (almost-sure) stable `p approximation error behavior (sample-wise) when considering any arbitrary rate of significant coefficients per dimension of the signal. In particular, our two main results (Theorems 1 and 3) offer expressions for this limit, which is a function of the entire process through the known ergodic decomposition (ED) mapping used in the proof of the cele- brated ED theorem. When ergodicity is added and we assume an AMS ergodic

17 source, the `p-approximation error function reduces to a closed-form expression of the stationary mean of the process. As a direct consequence of this analysis, we extend (in Theorem 2) the dichotomy between `p-compressibility and non `p-compressibility observed in a previous result [3, Th.1]. In summary, the two main theorems of this paper significantly extend previ- ous results in the literature on this problem that are valid under the assumption of stationarity, ergodicity and some extra regularity conditions on the process distributions. On the technical side, these new theorems show the important role that the general point-wise ergodic theorem and, in particular, the ED theorem play for the extension of the `p-compressibility analysis to families of processes with general ergodic properties. Finally, from the proof of Theorem 1, we notice that imposing an ergodic property on the family of indicator func- tions is essential to obtain a stable (almost sure) result for the `p-approximation error function, in the way expressed in Definition 5. Consequently, the AMS assumption (see Lemma 2) seems to be crucial to achieve the desired strong (almost-sure) `p-approximation property declared in Definition 5.

6. On the Construction and Processing of AMS Processes To conclude this paper, we provide some context to support the application of our results in Sections 3 and 4. We consider a general generative scenario in which a process is constructed as the output of an innovation source passing through a signal processor (or coding process) and a random corruption (or channel). This scenario permits us to observe a family of operations on a sta- tionary and ergodic source (for example an i.i.d. source) that produces a process with a strong `p-characterization (Def. 5). For that we briefly revisit known results that guarantee that a process has stationarity and/or ergodic proper- ties when it is produced (deterministically or randomly) from a stationary and ergodic source.16 A general way of representing a transformation of a process X = (Xn)n≥1 into another process is using the concept of a channel. Definition 13. A channel denoted by C is a collection of probabilities (or pro- cess distributions) in (RN, B(RN)) indexed by elements in RN. More precisely,  N N N C = vx, x ∈ R ⊂ P(R ) where for any F ∈ B(R ) vx(F ) is a measurable function from (RN, B(RN)) to (R, B(R)).  N Given the process distribution of (Xn)n≥1, denoted by µ, a channel C = vx, x ∈ R induces a joint distribution in the product space (RN × RN, B(RN × RN)) by Z N µC(F × G) = vx¯(G)dµ(¯x), ∀F,G ∈ B(R ). x¯∈F The joint process distribution is denoted by µC. Then a new process Y = (Yn)n∈N is obtained at the output of the channel when (Xn)n∈N is its input. If

16A complete exposition can be found in [15, Ch.2].

18 we denote the distribution of Y by v, this is obtained by the marginalization of µC, i.e., v(G) ≡ P(Y ∈ G) = µC(RN × G) for all G ∈ B(RN).

Definition 14. Considering the shift operator T (used to characterize station-  N arity and ergodic properties in Sec. 2.2), a channel C = vx, x ∈ R is said to be stationary with respect to T if [15, Sec. 2.3]

−1 N N vT (x)(G) = vx(T (G)), ∀x ∈ R , ∀G ∈ B(R ). Then, the following result can be obtained: Lemma 6. [15, Lemma 2.2] Let us consider an AMS process X (with stationary  N mean µ¯) as the input of a stationary channel C = vx, x ∈ R . Then the output process Y is AMS and its stationary mean is given by17

n−1 X v¯(G) = lim 1/n v(T −iG) =µ ¯C(RN × G) for all G ∈ B(RN). n−→∞ i=0 Remarkably, Lemma 6 shows a general random approach to produce AMS processes from another AMS process. Furthermore, the result provides a closed expression for the resulting stationary mean (function of the stationary mean of the inputµ ¯ and the channel C), which is the object that determines its strong `p-compressibility signature from Theorems 1 and 3. Furthermore adding ergodicity, we highlight the following result:

 N Lemma 7. [15, Lemma 2.7] If the channel C = vx, x ∈ R is weakly mixing in the sense that for all x ∈ RN and measurable events F , G ∈ B(RN)

n−1 1 X −i −i lim vx(T (F ) ∩ G) − vx(T (F ))vx(G) = 0, n−→∞ n i=0 then if the input process is AMS and ergodic, the output of the channel is also AMS and ergodic. We will cover two important families of channels used in many applications of statistical signal processing, source coding, and communications.

6.1. Deterministic Channels: Stationary Codes and LTI Systems A deterministic transformation (or measurable function) of an AMS process X = (Xn)n≥1 can be seen as an important example of the channel framework presented above. Let us consider a measurable function f :(RN, B(RN)) −→ (RN, B(RN)) and the induced process Y = f(X), where the process distribution is v(G) = µ(f −1(G)) for all G ∈ B(RN). This form of encoding X is a special f N case of a channel, where v (G) = 1 1 (x) for all x ∈ R . Importantly, it x f − (G) follows that:

17The result shows more generally that the joint process (X, Y) is AMS with respect to T × T (T × T (¯x, y¯) = (T (¯x),T (¯y)), where its stationary mean isµ ¯C.

19 f  f N Lemma 8. [15] The deterministic channel (or coding) C ≡ vx, x ∈ R induced by f is stationary if, and only if, f(T (x)) = T (f(x)) for all x ∈ RN. In this context, we say that f produces a stationary coding of X. Corollary 3. Any stationary coding of an AMS process produces an AMS pro- cess, where the stationary mean of Y is given by v¯(G) =µ ¯(f −1(G)) for all G ∈ B(RN). The proof of this result follows directly from Lemma 6. There is a stronger result for deterministic and stationary channels: Lemma 9. [15, Lemma 2.4] Let us consider a deterministic and stationary channel Cf . If the input process to Cf is AMS and ergodic, then the output process is AMS and ergodic. It is worth noting that a direct way of constructing stationary coding is by a scalar measurable function φ :(RN, B(RN)) −→ (R, B(R)), where given N n−1 18 x ∈ R , the output is produced by yn = φ(T (x)) for all n ≥ 1. Then, there is an infinity collection of stationary coding that preserves the AMS and ergodic characteristics of an input process. Two emblematic cases to consider are the finite length sliding block code where yn = φ(Xn+M , ...., Xn+D) with D > M ≥ 0 and φ : RD−M+1 −→ R, and the case when φ is a linear function, PD−M+1 i.e., φ(x) = i≥1 ai · xi, and, consequently, f produces a linear and time invariant (LTI) coding of X.19

6.2. Memoryless Channels  N Definition 15. A channel C = vx, x ∈ R is said to be memoryless, if for N N any finite dimensional cylinder ×i∈J Fi ∈ B(R ) and for any x ∈ R , it follows Q that vx(×i∈J Fi) = i∈J pxi (Fi) where {px, x ∈ R} ⊂ P(R).

Basically if C is memoryless, we have that vx decompose as the multiplication of its marginals (memoryless) for any x ∈ RN. The classical example is the additive white Gaussian noise (AWGN) channel used widely in signal processing and communications where px = N (µx, σ) is a normal distribution with the mean depending on x and σ > 0. It is easy to check that memoryless channels are stationary. Consequently, Lemma 6 tells us that a memoryless corruption of an AMS process produces an AMS process. In addition, the mixing condition of Lemma 7 is easily verified for memoryless channels [15]. Consequently, a memoryless corruption of an AMS and ergodic process preserves the ergodicity of the input at the output of the channel. Finally, using Lemmas 6, 7 and 9, we have a rich collection of processing steps where AMS alone and AMS and ergodicity are preserved from the input to the

18 Conversely, for any stationary code f there is a function φ(x) = π1(f(x)) that induces f, N where π1() denotes the first coordinate projection on R . 19Stationary codings play an important role in ergodic theory for the analysis of isomorphic processes [12].

20 output and, consequently, Theorems 1 and 3 can be adopted for compressibility analysis of these induced processes.

6.3. Final Examples To conclude this section, we cover a result that offers a condition for a one- side process to be AMS using the following definition: Definition 16. Let us consider two probabilities µ and v in the measurable sequence space (RN, B(RN)). We say that v asymptotically dominates µ with respect to the shift operator T if, for any F ∈ B(RN) such that v(F ) = 0 then −n 20 limn→∞ µ(T (F )) = 0. Then, we have the following result by Rechard [16] revisited in [14, Theorem 2]:

Theorem 4. [16] Let X = (Xn)n≥1 be a process with process distribution µ in (RN, B(RN)). The process X is AMS if, and only if, there is a probability v in (RN, B(RN)) that is stationary with respect to T (Def. 6) and asymptotically dominates µ (Def.16). Using a stationary process (or probability v), for example an i.i.d. process, Theorem 4 provides a way of constructing AMS processes. To illustrate this, the following is a selection of examples presented in [14].

Example 1. [14] Let v be a stationary measure in (RN, B(RN)) and let f : (RN, B(RN)) → (R, B(R) be a non-negative integrable function with respect to v. If vf is the probability induced by f by R F f(x)δv(x) N vf (F ) ≡ R , ∀F ∈ B(R ), RN f(x)δv(x) then vf is AMS (from Theorem 4 and the fact that by construction vf  v). Conversely, from the Radon-Nicodym theorem [13], for any µ  v there exists ∂µ f(x) = ∂v (x) (v-integrable) and, consequently, µ is AMS from Theorem 4. Therefore, any integrable function f with respect to a stationary probability v creates an AMS process that is not stationary in general.

Example 2. [14] Let v be a stationary measure in (RN, B(RN)) and let F ∈ N B(R ) be such that v(F ) > 0. If µF is the conditional distribution of v given F N N in (R , B(R )), then µF is AMS (from Theorem 4 and the fact that by definition µF  v).

Therefore conditioning a stationary probability on any non-trivial measurable set F creates an AMS process that is not stationary in general.

20If µ  v then v asymptotically dominates µ. The proof of this statement follows directly from [14, Theorem 3].

21 Example 3. [14] If µ is stationary with respect to T N , for some N > 1, then µ is AMS (with respect to T ) with stationary mean given by

N−1 1 X µ¯(F ) = µ(T −i(F )), ∀F ∈ B(RN). (29) N i=0 A practical case of this example is applying a N-block to N-block non-overlapping mapping (function) over a stationary process to create a process that is T N - stationary by construction and, consequently, AMS. Details and examples of this block-coding construction and its use in information theory are presented in [15].

7. Estimating fp,µ¯(r)

The purpose of this last section is two-fold: First, we introduce an estimation strategy to approximate the function fp,µ¯(·) with an arbitrary precision for an AMS and ergodic process without the need to obtain, in closed-form, its invariant distributionµ ¯ (Theorem 1). Second, we illustrate the trend of the rate vs. `p-approximation error pairs in (31) for some simple cases ofµ ¯1 ∈ P(R) (induced by Gaussian and α-stable distributions [13, 1]). For this analysis, our main focus is the family of non `p-compressible processes (see Definition 12 and Theorem 2) where in light of Theorem 1 the behavior of (fp,µ¯(r))r∈(0,1] is non-trivial (see Corollary 2).

7.1. Non `p-compressible case

Let us consider the case of a non `p-compressible AMS and ergodic process R p X = (Xn)n≥1, i.e., R |x| dµ¯1(x) < ∞. Let us assume for simplicity that µ¯1  λ and, consequently,µ ¯1 has a density function. In this context, Theorem 1 part ii) shows that fp,µ¯(·) is fully determined by the probabilitiesµ ¯1 and vp evaluated on the tail events {Bτ , τ ≥ 0} in (21). More precisely, we have that:  q   p {(r, fp,µ¯(r)), r ∈ (0, 1]} = µ¯1(Bτ ), 1 − vp(Bτ ) , τ ∈ [0, ∞) . (30)

At this point, we can use a realization of the process X = (Xn)n≥1 and the point-wise Ergodic theorem (PED) in Lemma 5 to approximate with arbitrary   pp precision (and almost surely) any of the pairs µ¯1(Bτ ), 1 − vp(Bτ ) that de-

fine the rate vs. `p-approximation error of X. This is achieved without having to derive an expression forµ ¯. Indeed, for any τ > 0, we have from the PED theorem that n 1 X lim 1B (Xi) =µ ¯1(Bτ ), µ − a.s., (31) n−→∞ n τ i=1 | {z } n µ¯ˆ1(Bτ ,X )≡

22 Rate-Aproximation Error Curves fp, (r) for p = 1 1.0 Students t q = 1.2 Students t q = 2.1 0.8 Students t q = 3 Students t q = 7 Students t q = 10 0.6 (0, 1)

0.4 Approximation Error

0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 Rate

Figure 2: Estimated curves of the approximation error function f1,µ¯(r): for the case of an i.i.d. Gaussian process and i.i.d. processes with Student´s t-distribution with q = 1.2; 2.1; 3; 7 and 10.

n Z 1 X p p lim 1B (Xi) · |Xi| = |x| dµ¯1(x), µ − a.s. (32) n−→∞ n τ i=1 Bτ and, consequently,

n p R p P |x| dµ¯1(x) i=1 1Bτ (Xi) · |Xi| Bτ lim n p = = vp(Bτ ), µ − a.s. (33) n−→∞ P |X | ||(xp)|| i=1 i L1(¯µ1) | {z } n vˆp(B,X )≡

n Therefore, using a finite length realization of the process X = (X1, .., Xn), for   n pp n any τ > 0, we have that the empirical probabilities µ¯ˆ1(Bτ ,X ), 1 − vˆp(Bτ ,X )   pp converge almost surely to the true expressions µ¯1(Bτ ), 1 − vp(Bτ ) as n tends to infinity. This estimation based strategy offers a numerical way to es- timate, from a sample of the process, the monotonic behavior of fp,µ¯(·) (see Proposition 1) using the empirical distributions in (33) and (31). In effect, this approach needs a strategy to sample the process and a finite collection of threshold points τ1 < τ2 < .... < τL selected to capture the diversity of the range [0, ∞). Note that as long as L < ∞, the almost sure convergence of n  o n  o ˆ n pp n pp µ¯1(Bτi ,X ), 1 − vˆp(Bτi ,X ) , i = 1, .., L to µ¯1(Bτi ), 1 − vp(Bτi ) , i = 1, .., L is guaranteed from the union bound [12, 13]. In the following subsection, we illustrate this estimation strategy with some examples. Finally, this strategy extends naturally to the more general scenario pre- sented in Theorem 1 part iii), for which the almost sure convergence result

23 Rate-Aproximation Error Curves fp, (r) for p = 2 1.0 Students t q = 1.2 Students t q = 2.1 0.8 Students t q = 3 Students t q = 7 Students t q = 10 0.6 (0, 1)

0.4 Approximation Error

0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 Rate

Figure 3: Estimated curves of the approximation error function f2,µ¯(r): for the case of an i.i.d. Gaussian process and i.i.d. processes with Student´s t-distribution with q = 1.2; 2.1; 3; 7 and 10. presented in (58) and (59) can be adopted (see the proof of Theorem 1 in Sec- tion 8.2).

7.2. `p-compressible case The sampling-based strategy presented in Section 7.1 can be adopted to R p estimate fp,µ¯(·) when the process is `p-compressible, i.e., R |x| dµ¯1(x) = ∞. In this scenario, we know from Theorem 1 part i) that fp,µ¯(r) = 0 for any r ∈ (0, 1]. Although in theory, the probability vp is not well-defined because we R p need that R |x| dµ¯1(x) < ∞ (see Eq.(20)), we can still use the same empirical n n expressions µ¯ˆ1(Bτ ,X ) andv ˆp(Bτ ,X ) presented in (31) and (33). The main difference in this case is that the PED theorem implies that n Z 1 X p p lim |Xi| = |x| dµ¯1(x) = ∞, µ − a.s., (34) n−→∞ n i=1 R n Z 1 X p p lim 1B (Xi) · |Xi| = |x| dµ¯1(x) = ∞, µ − a.s., (35) n−→∞ n τ i=1 Bτ using that Bτ = (−∞, −τ]∪[τ, ∞). (34) and (35) imply directly that limn−→∞(1− n vˆp(Bτ ,X )) = 0, µ − a.s. for any τ > 0. Consequently, we have that the pair of   n pp n empirical probabilities µ¯ˆ1(Bτ ,X ), 1 − vˆp(Bτ ,X ) converge almost surely to the true expressions (rτ =µ ¯1(Bτ ), fp,µ¯(rτ ) = 0) as n tends to infinity. Therefore, we can use a finite collection of threshold points τ1 < τ2 < .... <

24 τL to sample fp,µ¯(·), where in particular we have that  q   ˆ n p n µ¯1(Bτi ,X ), 1 − vˆp(Bτi ,X ) , i = 1, .., L convergences almost surely (as n tends to infinity) to the true sample points

{(ri =µ ¯1(Bτi ), fp,µ¯(ri) = 0) : i = 1, .., L} of the zero constant function fp,µ¯(·). n  o ˆ n pp n Finally, looking at the trend of µ¯1(Bτi ,X ), 1 − vˆp(Bτi ,X ) , i = 1, .., L as n tends to infinity, we could have the ability to infer whether fp,µ¯(·) is the triv- R p ial constant function equal to zero (associated to the case |x| dµ¯1(x) = ∞) R p R or not (associated to the case R |x| dµ¯1(x) < ∞). This dichotomy is observed in the next section in Figure 3.

7.3. Numerical Examples We begin revisiting some of the examples used in [3], but in the context of the refined sample-wise almost sure approximation error analysis presented in this paper. We begin with i.i.d. processes induced by different densities: from the Gaussian case that has an exponential tail, to the case of Student‘s t-distribution with different values for its parameter q > 0. The density of a Student‘s t-distribution with q degrees of freedom is said to be heavy tailed as its tail goes to zero (with x) at the polynomial rate O(|x|−q−1).21 From Theorem 2, we know that the Gaussian process is non `p-compressible for any p > 0. On the other hand, an i.i.d. process driven by a Student‘s t-distribution with parameter q > 0 is non `p-compressible if, and only if, p < q (see Eq.(25) in Section 3.3.1). Using the estimation strategy presented above, and for all the aforemen- tioned scenarios, we consider a good range of points in τ ∈ [0, ∞) and a suffi- ciently large number of samples for each process (n = 105) to obtain a precise estimation of a finite collection of rate vs. `p-approximation error pairs in (30). These points are illustrated in Figures 2 and 3. Beginning with p = 1, Figure 2 presents the estimated functions f1,µ¯(r) for the Gaussian case and Student‘s t-distribution with q = 1.2; 2.1; 3; 7 and 10. All the presented curves are consis- tent with the observation that all the selected processes are non `1-compressible (see Section 3.3.1) and, consequently, for each example, the estimation of fp,µ¯(·) should be non-zero (from Theorem 1 part ii)), continuous, and strictly decreas- ing (from Proposition 1). In addition. the curves show that the Gaussian i.i.d. process (white noise) is the least compressible and the degree of compressibility decreases from case to case as a function of how fast the tail of the density of the i.i.d process goes to zero. This observation is consistent with previous results

21The density of a Student´s t-distribution with parameter q > 0 (degrees of freedom) is: Γ((q+2)/2) − q+1 √ 2 f(x) = qπ (1 + x/q) , where Γ() denotes the Gamma function.

25 that show that heavy tail processes are better approximated by their best sparse versions than processes with exponential tails [1, 3]. The same set of curves are obtained for p = 2 in Figure 3. Using the analysis presented in Section 3.3.1 (for p = 2), we have that only one of the selected processes is `2-compressible (the case with a Student‘s t-distribution with q = 1.2), and the rest are non-`2-compressible. The estimations of fp,µ¯(·) presented in Figure 3 are consistent with Theorem 1 part ii), showing non-zero functions with decreasing trends for all examples that are non-`2-compressible. For the `2- compressible case (Student‘s t-distribution with q = 1.2), the estimated curve matches with good precision a constant equal to zero function predicted by Theorem 1 part i). To complement this analysis, Figure 4 illustrates fp,µ¯(·) for the same i.i.d. Gaussian process for different p values (p ∈ {0.2, 0.5, 0.8, 1.5, 2}). There is clear monotonic behavior of the curves by increasing the magnitude of p in the analy- sis, where the Gaussian process becomes less compressible when p increases. For this analysis, we know that the Gaussian i.i.d. process is non `p-compressible for any possible value of p (see Section 3.3.1), which is shown consistently in the estimated curves presented in Figure 4.

The Rate-Aproximation Error Curves: Gaussian Process 1.0 p = 0.2 p = 0.5 0.8 p = 0.8 p = 1.5 p = 2 0.6

0.4 Approximation Error

0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 Rate

Figure 4: Estimated curves of the approximation error function (fp,µ¯(r))r∈(0,1]) for the an i.i.d. Gaussian process.

Finally, we estimate fp,µ¯(·) for an AMS and ergodic process that is obtained as the result of an i.i.d. innovation X = (Xn)n≥1 passing through a stationary coding (see Lemma 8). In particular, we consider the case of a linear operator of PM the form φ(x) = i≥1 ai ·xM+1−i. This is one of the classical LTI constructions used to model random signals in statistical signal processing [17]. In this case, the invariant distribution of the resulting process can be obtained by the product distribution of the input (Xn)n≥1 and Corollary 3. However, we use the estima-

26 tion approach for which we only need a sufficiently large sample of (Xn)n≥1 and n−1 the application of convolutional equation to reproduce Yn = φ(T (X)) for any n ≥ 1. Figure 5 shows fp,µ¯(r) (for X with stationary meanµ ¯) and fp,v¯(r) (for Y = (Yn)n≥1 with stationary meanv ¯) for the cases of LTI filter with M (ai)i=1 = (1, 1, .., 1) and when X is i.i.d. driven by a Student‘s t-distribution with parameter q = 2.1 and p = 2 (i.e., non-`p-compressible for p = 2). The M filter (parametrized by (ai)i=1) clearly changes the compressibility signature of the output process and, furthermore, the effect of increasing M is evident in fp,v¯(r) (the output) on its relationship with fp,µ¯(r) (the input). To clearly ob- serve these changes from the input to the output, we choose the process that showed (from the previous analysis) the most compressible curve in Figure 3, i.e., the Student‘s t-distribution with q = 2.1. In the other end, Figure 6 shows the same analysis when the input process X is an i.i.d. Gaussian process (see M Figure 3). In this case, however, the effect of the filter (ai)i=1 and its length on the compressibility of Y is imperceptible. We observe from these two last examples (in Figures 4 and 5) that a linear stationary coding makes the output process less compressible than its input process, which is expected from the convolutional nature of the mapping, however, when the input process has an exponential tail (white noise) the effect of linear filtering is imperceptible.

The Rate-Aproximation Error Curves for Student t 1.0 Unfiltered Signal Filter Lenght M = 5 0.8 Filter Lenght M = 10 Filter Lenght M = 15 Filter Lenght M = 20 0.6 Filter Lenght M = 50 Filter Lenght M = 100 Filter Lenght M = 300 0.4 Approximation Error

0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 Rate

Figure 5: Approximation error curves for an i.i.d. processes driven by a Student´s t- distribution with q = 2.1 and for the convolution of this process (innovation) with the LTI M system (ai)i=1 = (1, 1, .., 1) for M ∈ {5, 10, 15, 20, 50, 100, 300}.

8. Proof of Theorem 1

First, we introduce a number of preliminary results, definitions, and prop- erties that will be essential to elaborate the main argument to prove Theorem

27 The Rate-Aproximation Error Curves for Gaussian 1.0 Unfiltered Signal Filter Lenght M = 5 0.8 Filter Lenght M = 10 Filter Lenght M = 15 Filter Lenght M = 20 0.6 Filter Lenght M = 50 Filter Lenght M = 100 Filter Lenght M = 300 0.4 Approximation Error

0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 Rate

Figure 6: The same set of curves presented in Figure 5 but using an i.i.d. Gaussian process as the input of the LTI.

1.

8.1. Preliminaries For the case of AMS and ergodic sources (see Lemmas 2 and 4), the ergodic theorem [12, Th. 7.5] (see Lemma 5) tells us that for any `1-integrable function with respect toµ ¯1, f :(R, B(R)) −→ (R, B(R)), the sampling mean (computed with a realization of X) converges with probability one (with respect to µ) to the expectation of f with respect toµ ¯1, i.e.,

n 1 X lim f(Xi) = EX∼µ¯ (f(X)) < ∞, µ − a.s. (36) n−→∞ n 1 i=1 Therefore, we have that for any B ∈ B(R),

n 1 X lim 1B(Xi) =µ ¯1(B), µ − a.s. (37) n−→∞ n i=1 R p In addition, if R |x| dµ¯1(x) < ∞ then for any B ∈ B(R), n Z 1 X p p p lim 1B(Xi) · |Xi| = |x| dµ¯1(x) = ||(x )|| , µ − a.s. (38) n−→∞ n L1(¯µ1) i=1 B and, consequently, Pn p R p i=1 1B(Xi) · |Xi| B |x| dµ¯1(x) lim n p = vp(B) = , µ − a.s. (39) n−→∞ P |X | ||(xp)|| i=1 i L1(¯µ1)

28 Let us define the tail distribution function of a probability v in (R, B(R)), which is an object that will play a relevant role in the argument to prove The- orem 1. Definition 17. For v ∈ P(R) let us define its tail distribution function by φv(τ) ≡ v(Bτ ) for all τ ∈ [0, ∞). It is simple to verify that the following: Proposition 2. For any v ∈ P(R), it follows that:

i) if τ1 > τ2 then φv(τ1) ≤ φv(τ2) and φv(τ1) = φv(τ2) if, and only if, v([τ2, τ1) ∪ (−τ1, τ2]) = 0,

ii) φv(0) = 1 and limτ−→∞ φv(τ) = 0, and + iii) (φv(τ))τ≥0 is left continuous and φv (τ) ≡ limtn−→τ,tn>τ φv(tn) = φv(τ)− v({τ} ∪ {−τ}). The proof is presented in Appendix C. Therefore, (φv(τ)) is a continuous function except on the points where v has atomic mass (see Fig. 7). From a well-known result on real analysis [18], using the fact that (φv(τ)) is non-decreasing, this function has at most a countable number of discontinuities. This means that v has at most a countable number of non-zero probability events on the collection {{τ} ∪ {−τ} , τ ∈ [0, ∞)} ⊂ B(R) that we index and denote by Yv ⊂ [0, ∞).

φµ¯ (τ)

1

φµ¯ (τ0) µ¯( τ τ ) { } ∪ {− } µ¯ ( ) = φ (τ ) µ¯ ( τ τ ) Cτ0 µ¯ 0 − { 0} ∪ {− 0}

τ0

Figure 7: Illustration of the tail distribution function φµ¯(τ) ofµ ¯ ∈ P(R) with a single discon- tinuous point at τo > 0 in [0, ∞).

By definition of vp (stated in (20)), the discontinuity points of (φµ¯1 (τ)) 22 and (φvp (τ)) agree from the fact that if τ > 0 thenµ ¯1({τ} ∪ {−τ}) = 0 ⇔

22There is only one exception when τ = 0.

29 vp({τ} ∪ {−τ}) = 0. Therefore, we have that Yvp = Yµ¯1 \{0}. For the rest of the proof, it is relevant to consider the range of these tail functions. These can be characterized as follows (see Figure 7): [ R∗ ≡ {φ (τ), τ ≥ 0} = (0, 1] \ [¯µ (C ), µ¯ (B )), (40) µ¯1 µ¯1 1 τn 1 τn

τn∈Yµ¯1 [ R∗ ≡ φ (τ), τ ≥ 0 = (0, 1] \ [v (C ), v (B )), (41) vp vp p τn p τn

τn∈Yµ¯1 \{0}

where Yµ¯1 is either the empty set, a finite set, or a countable set.

With the tail functions (φµ¯1 (τ)) and (φvp (τ)), we can introduce the collection n q o p (φµ¯1 (τ), 1 − φvp (τ)), τ ∈ [0, ∞) (42) that in its first coordinate covers the range R∗ . For the non-continuous case, µ¯1 i.e., |Yµ¯ | > 0, we can complete the range on the first coordinate to cover the 1 S non-achievable values [¯µ1(Cτn ), µ¯1(Bτn )) in (40) (Fig. 7 illustrates this τn∈Yµ¯1 range when Yµ¯1 = {τ0}) by the following simple extension: n q o p Fµ¯1 ≡ (φµ¯1 (τ), 1 − φvp (τ)), τ ∈ [0, ∞)  q  [ p (¯µ1(Cτn ) + αµ¯1({−τn, τn}), 1 − vp(Cτn ) − αvp({−τn, τn})), α ∈ [0, 1) .

τn∈Yµ¯1 (43)

Importantly, we have the following simple result from the points in Fµ¯1 :

Proposition 3. The collection of pairs in Fµ¯1 defines a function from (0, 1] to

[0, 1) that we denote by (fµ¯1 (r))r∈(0,1]. Proof: For any r ∈ (0, 1] we have that either: r ∈ R∗ in (40) for µ¯1 ∗ ∗ which there is a unique τ ≥ 0 such that r = φµ¯1 (τ ) and, consequently, pp ∗ there is only one d = 1 − φvp (τ ) ∈ [0, 1) such that (r, d) ∈ Fµ¯1 , or r ∈ S ∗ ∗ [¯µ1(Cτn ), µ¯1(Bτn )), for which there is a unique pair (τ , α ) ∈ Yµ¯1 × τn∈Yµ¯1 ∗ ∗ ∗ [0, 1) such that r =µ ¯1(Cτ ∗ ) + α µ¯1({−τ } ∪ {τ }) and, consequently, a unique pp ∗ ∗ ∗ ∗ d = 1 − φvp (τ ) − α vp({−τ } ∪ {τ }) such that again (r, d) ∈ Fµ¯1 .

In addition, from the properties of (φµ¯1 (τ)) and (φvp (τ)), the following can be stated:

Lemma 10. The function (fµ¯1 (r))r∈(0,1] induced by the set Fµ¯1 in (43) has the following properties:

i) (fµ¯1 (r)) is continuous in the domain r ∈ (0, 1]. −1 ii) (fµ¯1 (r)) is strictly decreasing in the domain r ∈ fµ¯1 ((0, 1)) ⊂ (0, 1). More precisely, for any pair (r1, r2) such that 0 < r1 < r2 ≤ 1 − µ¯({0}) then

fµ¯1 (r2) < fµ¯1 (r1).

30 −1 −1 iii) fµ¯1 ((0, 1)) = (0, 1 − µ¯({0})) ⊂ (0, 1). Consequently, fµ¯1 ((0, 1)) = (0, 1)

if, and only if, µ¯({0}) = 0 (i.e., 0 ∈/ Yµ¯1 ).

iv) fµ¯1 (r) = 0 ∀r ∈ [1 − µ¯({0}), 1] and limr→0 fµ¯1 (r) = 1.

v) The range of (fµ¯1 (r))r∈(0,1] is [0, 1). The proof of this result is presented in Appendix B. We are in a position to prove the main result:

R p 8.2. Main Argument — Case R |x| dµ¯1(x) < ∞ R p Proof: Let us assume that R |x| dµ¯1(x) < ∞. Let us consider an arbitrary r ∈ [1, 0) and a sequence (kn)n≥1 such that kn/n −→ r as n tends to infinity. Case 1: Continuous Scenario: Let us first consider the case where r ∈ int(R∗ ), µ¯1 i.e., r ∈ R∗ \{µ¯ (B ), τ ∈ Y }, and, consequently, there is τ such that µ¯1 1 τn n µ¯1 o r = φµ¯1 (τo) being τo a continuous point of the tail function φµ¯1 (·) (see iii) in Proposition 2). n Pn Let us define nτ (x ) ≡ i=1 1Bτ (xi), then using the (point-wise) ergodic theorem in (38), it follows that for all τ ≥ 0

n nτ (X ) lim = φµ¯ (τ), µ − a.s., (44) n→∞ n 1 and from (39) and (2) q n n p lim σ˜p(nτ (X ),X ) = 1 − φv (τ), µ − a.s. (45) n→∞ p In other words, we have the following family of (typical) sets:

 n  τ nτ (x ) A ≡ (xn)n≥0, lim = φµ¯ (τ) (46) n→∞ n 1 n q o τ n n p B ≡ (xn)n≥0, lim σ˜p(nτ (x ), x ) = 1 − φv (τ) , (47) n→∞ p satisfying that µ(Aτ ∩ Bτ ) = 1 for all τ ≥ 0.

Using the fact that φµ¯1 (·) is continuous at τo and the observation that φµ¯1 (·) has at most a countable number of discontinuities, there is δ ∈ (0, r) where the interval (r − δ, r + δ) defines an open domain containing τo, given by (τ1, τ2) = −1 φµ¯ ((r − δ, r + δ)) where the function φµ¯ (·) is continuous (see Figure 8). Asso- 1 1 ciated with this domain, we can consider φvp (τ), τ ∈ (τ1, τ2) = (v2, v1) where by monotonicity v1 = φvp (τ1) > v2 = φvp (τ2) (see Figure 8 for an illustration). 23 It is simple to show (by the construction of vp fromµ ¯1) that for any τ > 0 and  > 0

φµ¯1 (τ + ) < φµ¯1 (τ) if, and only if, φvp (τ + ) < φvp (τ). (48)

23 Note that for any B ∈ B(R) where 0 ∈/ B,µ ¯1(B) = 0 if, and only if, vp(B) = 0.

31 Therefore, this mutually absolute continuity property property betweenµ ¯1 and vp implies that

φµ¯1 (τ1) > φµ¯1 (τo) = r ⇔ φvp (τ1) = v1 > φvp (τo), and (49)

φµ¯1 (τ2) < φµ¯1 (τo) = r ⇔ φvp (τ2) = v2 < φvp (τo). (50)

φ (τ) φ (τ) µ¯1 vp

r + δ φvp (τ1) = v1

r φvp (τ0) = v(r)

r δ φv (τ2) = v2 − p

τ1τ0 τ2 τ1τ0 τ2

Figure 8: Illustration of the tail distribution functions ofµ ¯1 and vp at a continuous point τ0, where r = φµ¯1 (τ0).

We can then find M > 0 sufficiently large such that for all m ≥ M, φvp (τo)+ 1/m < v1. For any of these m ≥ M, there is τm ∈ (τ1, τ2) (from the continuity of φvp (·) in (τ1, τ2)) such that φvp (τm) = φvp (τo) + 1/m, where again by (48) τm φµ¯1 (τm) > φµ¯1 (τo) = r. Therefore, for any m ≥ M and any (xn)n≥1 ∈ A ∩ τm n B , the condition nτm (x ) > kn is met eventually in n as n tends to infinity.

This comes from the assumption that limn→∞ kn/n = r < φµ¯1 (τm) and the definition of Aτm given in (46). Consequently, under this context, it follows that

n n n σ˜p(nτm (x ), x ) ≤ σ˜p(kn, x ), (51)

τ is met eventually in n. Finally, using explicitly that (xn)n≥1 ∈ B m (see Eq.(47)), we have that q p n 1 − (φv (τo) + 1/m) ≤ lim inf σ˜p(kn, x ). (52) p n→∞

T τm τm Repeating this argument, if (xn)n≥1 ∈ m≥M (A ∩ B ), it follows from (52) that24 q p n 1 − φv (τo) ≤ lim inf σ˜p(kn, x ). (53) p n→∞

24 This is obtained by taking√ the supremum (m ≥ M) in the LHS of (52) and using the continuity of the function p 1 − x in x ∈ (0, 1).

32 By the sigma additivity [13] and the fact that from the ergodic theorem µ(Aτm ∩ Bτm ) = 1 for any m ≥ 1, it follows that q p n 1 − φv (τo) ≤ lim inf σ˜p(kn,X ), µ − a.s. (54) p n→∞ The exact argument can be used to prove that q n p lim sup σ˜p(kn,X ) ≤ 1 − φvp (τo), µ − a.s.. (55) n→∞ ˜ by using the sequencesτ ˜m such that φvp (˜τm) = φvp (τo) − 1/m for m ≥ M and M˜ sufficiently large. This part of the argument is omitted for the sake of space. Finally, (54) and (55) prove the result in the continuous case.25 Case 2: Discontinuous scenario: Let us consider the case where r∈ / R∗ (see µ¯1

Eq.(40)), which means that ∃τi ∈ Yµ¯1 such that

r ∈ [¯µ1(Cτi ), µ¯1(Bτi )), (56) (see the illustration in Fig. 9). For the moment let us assume that r ∈ 26 (¯µ1(Cτi ), µ¯1(Bτi )), then there is a unique αo ∈ (0, 1) such that

r =µ ¯1(Cτi ) + αo · µ¯1({−τi, τi}). (57)

Here we need to use an extended version of the point-wise ergodic theorem in (36). For that, let us introduce an i.i.d. Bernoulli process Y = (Yi)i≥1 of parameter ρ ∈ [0, 1], where P(Yi = 1) = ρ for all i ≥ 1, that is independent N of X = (Xn)n≥1. Let us denote by η its (i.i.d) process distribution in {0, 1} . Then, from the ergodic theorem for AMS process in (36) it follows, as a natural extension of (37), that for all τ ≥ 0

n n 1 X 1 X lim 1C (Xi) + 1{−τ,τ}(Xi) · Yi =µ ¯1(Cτ ) +µ ¯1({−τ, τ})ρ, (58) n−→∞ n τ n i=1 i=1 n p R p p P (1 (X ) + 1 (X )Y ) |X | |x| dµ¯1(x) +µ ¯1({−τ, τ})τ ρ i=1 Cτ i {−τ,τ} i i i Cτ lim n p = , n−→∞ P |X | ||(xp)|| i=1 i L1(¯µ1) (59) with probability one with respect to joint process distribution of (X, Y) denoted by µ × η.

25The proof assumes that r < 1. The proof for the asymmetric case when r = 1 ∈ int(R∗ ), µ¯1 i.e., r = 1 is a continuous point of φµ¯1 (·) follows from the same argument presented above. n On the one hand,σ ˜p(kn, x ) ≥ 0 for any kn by definition. On the other hand, the ar- gument used to obtain (55) follows without any problem in this context, implying that q n p lim supn→∞ σ˜p(kn,X ) ≤ 1 − φvp (τo) = 0 µ − a.s., considering that τo = 0 in this case. 26 We left the case r ∈ {µ¯1(Cτn ): τn ∈ Yµ¯1 } ∪ {µ¯1(Bτn ): τn ∈ Yµ¯1 } for the mixed scenario below.

33 φ (τ) µ¯1

µ¯ (B ) 1 τi v = µ¯ ( τ ) + α µ¯ ( τ τ ) 1 C i 0 · 1 {− i } ∪ { i } µ¯( τ τ ) > 0 { i} ∪ {− i} µ¯ ( ) 1 Cτi

τi

φvp (τ)

v = v (B ) 1 p τi v(r) = vp ( τ ) + α vp ( τ τ ) C i 0 · {− i } ∪ { i } v ( τ τ ) > 0 p { i} ∪ {− i} v = v ( ) 2 p Cτi

τi

Figure 9: Illustration of the tail distribution functions ofµ ¯1 and vp at a discontinuous point τi > 0.

Returning to the argument, let us consider an arbitrary (kn)≥1 such that kn/n −→ r as n tends to infinity. Let us consider αm ≡ αo + 1/m and M sufficiently large to make αM < 1. For any m ≥ M, let us construct an auxiliary i.i.d. Bernoulli process Y(αm) ≡ (Yi)i≥1, where P(Yi = 1) = αm. The process distribution of Y(αm) is denoted by ηm. In this context, if we define the joint count function

n n n n X X nτ (x , y ) ≡ 1Cτ (xi) + 1{τ}∪{−τ}(xi) · yi i=1 i=1

34 it follows from (58) and (59) that for τi introduced in (56),

n n nτi (X ,Y ) lim =µ ¯1(Cτ ) + (αo + 1/m) · µ¯1({−τi, τi}), (60) n−→∞ n i q n n n p lim σ˜p(nτ (X ,Y ),X ) = 1 − (vp(Cτ ) − (αo + 1/m) · vp({−τi, τi})) n−→∞ i i (61)

µ × ηm -almost surely. Importantly in (61), vp({−τi, τi}) > 0 from the fact that 27 µ¯1({−τi, τi}) > 0. Let us consider an arbitrary (typical) sequence ((xn)n≥1, (yn)n≥1) satisfying the limiting conditions in (60) and (61). From (60), it follows that the condition n n nτ (x , y ) > kn happens eventually in n as kn/n −→ r =µ ¯1(Cτi ) + αo ·

µ¯1({−τi, τi}) < µ¯1(Cτi ) + (αo + 1/m) · µ¯1({−τi, τi}) by construction. Therefore, the condition

n n n n σ˜p(nτ (x , y ), x ) ≤ σ˜p(kn, x ) holds eventually in n. (62)

pp The left hand side of (62) converges to 1 − (vp(Cτi ) − (αo + 1/m) · vp({−τi, τi})) as n tends to infinity by the construction of ((xn)n≥1, (yn)n≥1) and (61). Finally, by the almost sure convergence in (60) and (61), it follows that q n p lim inf σ˜p(kn,X ) ≥ 1 − (vp(Cτ ) − (αo + 1/m) · vp({−τi, τi})), (63) n−→∞ i

µ- almost surely.28 τ τ Let us denote by D m ≡ {(xn)n≥1, where (63) holds}. From (63), µ(D m ) = τ 1 and by sigma-additivity [13], it follows that µ(∩m≥M D m ) = 1, which implies that q n p lim inf σ˜p(kn,X ) ≥ 1 − (vp(Cτ ) − αo · vp({−τi, τi})), µ − a.s. (64) n−→∞ i

To conclude, an equivalent (symmetric) argument can be used to prove that q n p lim sup σ˜p(kn,X ) ≤ 1 − (vp(Cτi ) − αo · vp({−τi, τi})), µ − a.s., (65) n−→∞ ˜ usingα ˜m ≡ αo − 1/m and M sufficiently large to makeα ˜M˜ > 0. For the sake of space, the proof of this part is omitted. This concludes the result in this case. Case 3: Mixed scenario: Here we consider the scenario where

r ∈ {µ¯1(Bτn ), µ¯1(Cτn ): τn ∈ Yµ¯1 } .

27 Here we assume that τi > 0. The important sparse case when τi = 0 will be treated below. 28 We remove the dependency on ηm, as both terms in (63) (in the limit) turn out to be independent of the auxiliary process (Yi)i≥1.

35 The proof reduces to the same procedure presented above in the continuous and discontinues scenarios, but adopted in a mixed form. A sketch with the basic steps of the argument is provided here as no new technical elements are needed.

For r =µ ¯1(Bτi ) for τi ∈ Yµ¯1 and τi 6= 0, the same argument adopted in the continuous case (to obtain (54)) can be adopted here to obtain that q n p lim inf σ˜p(kn,X ) ≥ 1 − φv (τi), µ − a.s., (66) n→∞ p

for any sequence (kn)n≥1 such that kn/n −→ µ¯1(Bτi ). For the other inequality, the strategy with the auxiliary Bernoulli process presented in the proof of the discontinuous case can be adopted considering αo = 1 andα ˜m = 1 − 1/m for m sufficiently large. Then, a result equivalent to (65) is obtained, meaning in this specific context that q n p lim sup σ˜p(kn,X ) ≤ 1 − φvp (τi), µ − a.s. (67) n→∞

For r =µ ¯1(Cτi ) for τi ∈ Yµ¯1 and τi 6= 0, the same argument with the auxiliary Bernoulli process used to obtain (64) can be adopted here, considering αo = 0 and αm = 1/m for m sufficiently large, to obtain that q n p lim inf σ˜p(kn,X ) ≥ 1 − vp(Cτ ), µ − a.s., (68) n−→∞ i

for any sequence (kn)n≥1 such that kn/n −→ µ¯1(Cτi ). For the other inequality, the argument of the continuous case proposed to obtain (55) can be adopted here (with no differences) to obtain that q n p lim sup σ˜p(kn,X ) ≤ 1 − vp(Cτi ), µ − a.s. (69) n−→∞

Case 4: The case when 0 ∈ Yµ¯1 : This case deserves a special treatment be- cause it offers some insights about a property of the function (fp,µ¯(r))r∈(0,1] whenµ ¯1({0}) > 0, i.e.,µ ¯1 has atomic mass at 0. + Let us consider the case thatµ ¯1({0}) = ρo > 0, then φµ¯1 (0) = limτ→0 φµ¯1 (τ) = µ¯1(C0) = 1−ρo ∈ (0, 1) (see, the illustration in Fig. 10). On the other hand, we p 0 ·µ¯1({0}) have that vp({0}) = ||(xp)|| = 0. Therefore, (φvp (τ)) is continuous at τ = 0 L1(¯µ1)

(Fig. 10). From the fact that Yµ¯1 is at most a countable set, there is τ1 > 0 with φµ¯1 (τ1) < 1 − ρo where φµ¯1 (·) is continuous in (0, τ1) and, consequently, so is φv (·) in (0, τ1) (from Proposition 5 in Appendix B). If we consider the range p  of φvp (·) in this continuous domain, we have that φvp (τ), τ ∈ (0, τ1) = (v1, 1) where v1 = φvp (τ1) < 1. Here, we adopt the same argument used in the continuous scenario to obtain the upper bound in (55). Let us consider an arbitrary sequence (kn)n≥1, such that kn/n −→ 1 − ρo with n. By the continuity of φvp (τ) in (0, τ1) for any m 1 sufficiently large such that 1 − m < v1, there is τm > 0 such that φvp (τm) =

36 φ (τ) φ (τ) µ¯1 vp

φ (0) = 1 v = φ (0) = 1 µ¯ 1 0 vp ρ0 + φ (0) = 1 ρ0 µ¯ 1 −

v = φ (τ ) v = φ (τ ) 1 µ¯ 1 1 1 vp 1

τ1 τ1

Figure 10: Illustration of the tail distribution functions ofµ ¯1 and vp in the sparse case where µ¯1({0}) = ρ0 > 0.

1 29 1 − m . For any of these τm, it follows that φµ¯1 (τm) < 1 − ρo. Then, we can consider the set of typical sequences defined in (46) and (47), where if τm τm n (xn)n≥1 ∈ A ∩ B then eventually in n it follows that kn > nτm (x ) (from the fact that φµ¯1 (τm) < 1 − ρo) and, consequently,

n pp lim sup σ˜p(kn, x ) ≤ 1/m. (70) n−→∞

τ This last result follows from the definition of B m and the construction of τm 1 T τm τm (i.e., φvp (τm) = 1 − m ). Then if (xn)n≥1 ∈ m≥M (A ∩ B ), where M > 0 1 is set such that 1 − M < v1, then

n lim sup σ˜p(kn, x ) ≤ 0. (71) n−→∞ Finally, from the (point-wise) ergodic theorem for AMS sources in (36), it follows T τm τm n that µ( m≥M A ∩ B ) = 1, meaning from (71) that limn−→∞ σ˜p(kn,X ) = 0, µ-almost surely. ˜ The last observation to conclude this part is that if (kn) dominates (kn), in ˜ ˜ n n the sense that kn ≥ kn eventually, then from definitionσ ˜p(kn, x ) ≤ σ˜p(kn, x ) n for all x . Therefore from (71), for any r ∈ [1 − ρo, 1] and for any (kn) such that kn/n −→ r, it follows that

n lim σ˜p(kn,X ) = 0, µ − a.s. (72) n−→∞

Then we obtain in this case that ∀r ∈ [1 − µ¯1({0}), 1]

fp,µ¯(r) = 0, (73) while fp,µ¯(r) > 0 if r ∈ (0, 1 − µ¯1({0})).

29 This from the fact that if φvp (τ) < φvp (˜τ) then φµ¯1 (τ) < φµ¯1 (˜τ) from the definition of vp.

37 Remark 2. The result in (72) is consistent with the statement of Theorem 1 because if r ∈ [1 − ρo, 1], then it can be written as r =µ ¯1((0, ∞)) + α · µ¯1({0}) pp for some α ∈ [0, 1] where fp,µ¯(r) = 1 − vp((0, ∞)) − α · vp({0}) = 0.

R p 8.3. Main Argument — Case R |x| dµ¯1(x) = ∞ R p Proof: When R |x| dµ¯1(x) = ∞, it follows that ∀τ ≥ 0, Pn p i=1(1 − 1Bτ (Xi)) · |Xi| lim n p = 0, µ − a.s.. (74) n−→∞ P i=1 |Xi| R p this from the (point-wise) ergodic theorem in (36) and the fact that R |x| dµ¯1(x) = ∞. Then, from (37) and (74), it follows in this case that

n nτ (X ) lim = φµ¯ (τ), µ − a.s., (75) n→∞ n 1 n n lim σ˜p(nτ (X ),X ) = 0, µ − a.s. (76) n→∞

n n o τ nτ (x ) for all τ ≥ 0. Again we can consider A = (xn)n≥1, limn→∞ n = φµ¯1 (τ) and τ n n n o B = (xn)n≥1, lim σ˜p(nτ (x ), x ) = 0 , n→∞ where µ(Aτ ∩ Bτ ) = 1 for all τ. Let us fix r ∈ (0, 1] and (kn)n≥1 such that kn/n −→ r. We can considerr ¯ < r τo τo and τo, such that φµ¯1 (τo) =r ¯. Then, for any (xn)n≥1 ∈ A ∩B , it follows that n the condition kn > nτo (x ) happens eventually in n (from the fact that r > r¯ τ n n and the definition of A o ), therefore eventually we have thatσ ˜p(nτ (x ), x ) ≥ n τ n σ˜p(kn, x ). Finally, from the definition of B o , limn→∞ σ˜p(kn, x ) = 0. The proof concludes noting that µ(Aτo ∩ Bτo ) = 1.

9. Proof of Theorem 3

First, we introduce formally the ergodic decomposition (ED) theorem:

Theorem 5. [12, Th. 10.1] Let X = (Xn)n≥1 be an AMS process charac- terized by (RN, B(RN), µ). Then there is a measurable space given by (Λ, L) that parametrizes the family of stationary and ergodic distribution, i.e., P˜ = N N {µλ, λ ∈ Λ}, and a measurable function Ψ:(R , B(R )) → (Λ, L) such that: i) Ψ is invariant with respect to T , i.e., Ψ(x) = Ψ(T (x)) for all x ∈ RN.

ii) Using the stationary mean µ¯ of X and its induced probability in (Λ, L), N denoted by WΨ, it follows that ∀F ∈ B(R ) Z µ¯(F ) = µλ(F )∂WΨ(λ). (77)

38 30 N N iii) Finally, for any L1(¯µ)-integrable and measurable function f :(R , B(R )) → (R, B(R)),

n−1 1 X i lim f(T (X)) = EZ∼µ (f(Z)) , µ − almost surely, (78) n→∞ n Ψ(X) i=0 where Z in (78) denotes a stationary and ergodic process in (RN, B(RN)) with process distribution given by Ψ(X) ∈ Λ. Proof: Let us first prove the almost sure sample-wise convergence in (27). For r ∈ (0, 1] and (kn)n≥1 such that kn/n → r, we need to study the limit of the n following random object Yn =σ ˜p(kn,X1 ). As in the proof of Theorem 1, we consider the tail events

Bτ = (−∞, −τ] ∪ [τ, ∞) and Cτ = (−∞, −τ) ∪ (τ, ∞) (79) for τ ≥ 0. From Theorem 5, it follows that for any B ∈ B(R),

n 1 X lim 1B(Xi) = EZ∼µ (1B(Z1)) = µ1,Ψ(X)(B), µ − a.s. (80) n−→∞ n Ψ(X) i=1 where Z = (Zi)i≥1 and µ1,Ψ(X) denotes the probability of Z1 (the 1D marginal- ization of the process distribution µΨ(X)) in (R, B(R)). In addition, from Theo- rem 5 we have that Pn p i=1 1Bτ (Xi) · |Xi| lim n p = ξp(X,Bτ ), µ − a.s., (81) n−→∞ P i=1 |Xi| where

 R p |x| dµ X (x)  Bτ 1,Ψ( ) p ||(xp)|| if (x )x∈R ∈ L1(µ1,Ψ(X)) ξp(X,Bτ ) ≡ L1(µ1,Ψ(X)) . (82) p  1 if (x )x∈R ∈/ L1(µ1,Ψ(X)) From the results in (80) and (81), we can proceed with the same arguments used in the proof of Theorem 1 to obtain that31

n lim σ˜p(kn,X1 ) = fp,µ X (r), µ − almost surely, (83) n−→∞ Ψ( ) where fp,µΨ(X) (r) is the almost-sure asymptotic limit of the stationary and er- ˜ godic component µΨ(X) ∈ P stated in (23) and elaborated in the statement of Theorem 1. This proves the first part of the result.

30This result can be interpreted as a more sophisticated re-statement of the point-wise ergodic theorem for AMS sources under the assumption of a standard space, which is the case for (RN, B(RN)). Details and the interpretations of this result are presented in [12, Chs. 8 and 10]. 31We omit the argument here as it is redundant, following directly the structure presented in Section 8.

39 For the second part, we consider again r ∈ (0, 1] and (kn)n≥1 such that 32 kn/n → r. Let us denote the almost sure limit in (83) by fp(X, r) , which is in general a random variable from (RN, B(RN)) to (R, B(R)). For an arbitrary n,kn d ∈ [0, 1), we need to analyze the asymtotic limit of µn(Ad ). By additivity, we decompose this probability in two terms:

n,kn  N n  µn(Ad ) =µ x¯ = (xi)i≥1 ∈ R :σ ˜p(kn, x1 ) ≤ d, fp(¯x, r) ≤ d  N n  +µ x¯ ∈ R :σ ˜p(kn, x1 ) ≤ d, fp(¯x, r) > d , (84) For the first term (from left to right) in the RHS of (84), we can consider the following bounds

n µ ({x¯ : fp(¯x, r) ≤ d}) ≥ µ ({x¯ :σ ˜p(kn, x1 ) ≤ d, fp(¯x, r) ≤ d}) ≥ n µ ({x¯ :σ ˜p(kn, x1 ) ≤ fp,µ(¯x, r), fp(¯x, r) ≤ d}) . (85) The lower and upper bounds in (85) have the same asymptotic limit, i.e.,

n lim µ ({x¯ :σ ˜p(kn, x1 ) ≤ fp(¯x, r), fp(¯x, r) ≤ d}) = lim µ ({x¯ : fp(¯x, r) ≤ d}) . n−→∞ n−→∞ (86)

This can be shown by the following equality

n µ ({x¯ : fp(¯x, r) ≤ d}) = µ ({x¯ :σ ˜p(kn, x1 ) ≤ fp(¯x, r), fp(¯x, r) ≤ d}) n + µ ({x¯ :σ ˜p(kn, x1 ) > fp(¯x, r), fp(¯x, r) ≤ d}) , (87)

n n and µ ({x¯ :σ ˜p(kn, x1 ) > fp(¯x, r), fp(¯x, r) ≤ d}) ≤ µ ({x¯ :σ ˜p(kn, x1 ) > fp(¯x, r)}), n where the almost sure convergence ofσ ˜p(kn,X1 ) to fp(X, r) in (83) implies that

n lim µ ({x¯ :σ ˜p(kn, x1 ) > fp(¯x, r), fp(¯x, r) ≤ d}) = 0 n−→∞ obtaining the result in (86). Consequently, we have from (85) that

 N n  lim µ x¯ ∈ R :σ ˜p(kn, x1 ) ≤ d, fp(¯x, r) ≤ d = n−→∞  N  lim µ x¯ ∈ R : fp(¯x, r) ≤ d . (88) n−→∞ For the second term in the RHS of (84), it is simple to verify that

 N n   N n  µ x¯ ∈ R :σ ˜p(kn, x1 ) ≤ d, fp(¯x, r) > d ≤ µ x¯ ∈ R :σ ˜p(kn, x1 ) < fp(¯x, r) , then the almost sure convergence in (83) implies that

 N n  lim µ x¯ ∈ R :σ ˜p(kn, x1 ) ≤ d, fp(¯x, r) > d = 0. n−→∞

32We omit the dependency on µ in the notation because this limit (as a random variable of X) is independent of µ.

40 Putting this result in (84) and using (88), it follows that

n,kn  N  lim µn(A ) = lim µ x¯ ∈ R : fp(¯x, r) ≤ d , (89) n−→∞ d n−→∞ which concludes the argument. Finally to obtain the specific statement presented in (28), we first note that N fp(¯x, r) = fp,µΨ(¯x)(r) for allx ¯ ∈ R , where (fp,µλ (r))r∈(0,1] is the expression that has been fully characterized in Theorem 1 for any µλ ∈ P˜. In addition, we can use Theorem 1 i) stating that when µλ is `p-compressible, meaning p that (x )x∈R ∈/ L1(µλ1), then fp,µλ (r) = 0 for all r ∈ (0, 1]. Therefore, all the stationary and ergodic components µΨ(¯x) that are `p-compressible satisfy that fp(¯x, r) = fp,µΨ(¯x)(r) ≤ d independent of the pair (r, d), this observation explains the first term in the expression presented in (28).

10. Acknowledgment

This material is based on work supported by grants of CONICYT-Chile, Fondecyt 1210315 and the Advanced Center for Electrical and Electronic Engi- neering, Basal Project FB0008. I want to thank the two anonymous reviewers for providing valuable comments and suggestions that helped to improve the technical quality and organization of this paper. The author thanks Professor Martin Adams for providing valuable comments about the organization and pre- sentation of this paper. The author thanks Sebastian Espinosa. Felipe Cordova and Mario Vicuna for helping with the figures and the simulations presented in this work. Finally, I thank Diane Greenstein for editing and proofreading all this material.

Appendix A. Proof of Lemma 1 (and Corollary 1)

First, some properties of (fp,µ(r))r∈(0,1] will be needed. Proposition 4. It follows that:

• If 0 < r1 < r2 ≤ 1, then fp,µ(r2) ≤ fp,µ(r1).

• If 0 < r1 < r2 ≤ 1 and fp,µ(r2) = fp,µ(r1), then fp,µ(r2) = fp,µ(r1) = 0.

n The proof of this result derives directly from the definition ofσ ˜p(kn,X ) and 33 some basic inequalities. From Proposition 4, fp,µ(·) is strictly monotonic and −1 −1 injective in the domain fp,µ((0, 1)). Therefore, fp,µ(d) is well defined for any d ∈ {fp,µ(r), r ∈ (0, 1]}\{0}. Proof: Let us first consider the case d ∈ {fp,µ(r), r ∈ (0, 1]}\{0} assuming for a moment that this set is non-empty. Then, there exists ro ∈ (0, 1) such

33This result is revisited and proved (including additional properties) in Lemma 10, Section 8.

41 that d = fp,µ(ro), where by the strict monotonicity of fp,µ(·), we have that fp,µ(r2) < d < fp,µ(r1) for any r1 < ro < r2 ≤ 1. On the other hand, using the convergence of the approximation error to the function fp,µ(r) in (11) and the n,k definition of Ad in (3), it follows that for any r ∈ (0, 1), kn with kn/n −→ r, and  > 0 n,kn lim µn(A ) = 1 and (A.1) n→∞ fp,µ(r)+

n,kn lim µn(A ) = 0. (A.2) n→∞ fp,µ(r)−

Then assuming that kn/n −→ ro, if r > ro, then fp,µ(r) < d, and from (A.1) n,kn we obtain that limn→∞ µn(Ad ) = 1. On the other hand, if r < ro, then n,kn fp,µ(r) > d, and from (A.2) we obtain that limn→∞ µn(Ad ) = 0. This proves (12).

Remark 3. (for Corollary 1) Adopting the definition in (9) and setting  > 0, it follows from (A.1) and (A.2) that for any arbitrary small δ > 0, ro − δ ≤ −1 rp(d, , µ) ≤ ro + δ, and, consequently, rp(d, , µ) = ro = fp,µ(d). Furthermore, n adopting the definition of κ˜p(d, , µ ) in (5) with a fixed  > 0 and its asymptotic limits (with n) in (6) and (7), it follows from (A.1) and (A.2) that for any arbi- − + trary small δ > 0, ro − δ ≤ r˜p (d, , µ) ≤ r˜p (d, , µ) ≤ ro + δ, and, consequently, − + −1 r˜p (d, , µ) =r ˜p (d, , µ) = ro = fp,µ(d).

Concerning the second part of the result, let us assume ro ∈ (0, 1) such that fp,µ(ro) = 0. From the convergence in (11) assumed in this result, we have that if (kn)n≥1 is such that kn/n −→ ro, then

n lim σ˜p(kn,X ) = 0, µ − a.s. (A.3) n→∞

n,k Then adopting Ad in (3), it follows from (A.3) that for any d > 0

n,kn lim µn(A ) = 1, (A.4) n→∞ d which proves (13).

n Remark 4. (for Corollary 1) Using the definition of κ˜p(d, , µ ) in (5), from n (A.4) it is clear that for any  > 0, κ˜p(d, , µ ) ≤ kn eventually (in n). From + (6), this last inequality implies that r˜p (d, , µ) ≤ ro.

Appendix B. Proof of Lemma 10

The following properties of the tail functions (that define Fµ¯1 in (43)) will be used:

Proposition 5. • Yvp = Yµ¯1 \{0}, meaning that for all τ > 0, (φµ¯1 (·)) is

continuous at τ if, and only if, (φvp (·)) is continuous at τ.

42 • ∀τ1 > τ2 > 0, φµ¯1 (τ1) = φµ¯1 (τ2) if, and only if, φvp (τ1) = φvp (τ2). The proof of this result is presented in Appendix D.

Proof: Proof of i): Let us first show that (fµ¯1 (·)) is continuous in (0, 1]. ˜ p It is sufficient to prove continuity on the function fµ¯1 (r) ≡ 1 − fµ¯1 (r) , which is induced by the following simpler relationship:34 ˜  Fµ¯1 ≡ (φµ¯1 (τ), φvp (τ)), τ ∈ [0, ∞) (B.1) [ {(¯µ1(Cτn ) + αµ¯1({−τn, τn}), vp(Cτn ) + αvp({−τn, τn})), α ∈ [0, 1)} .

τn∈Yµ¯1 (B.2) There are three distinct scenarios to consider: Let us first focus on the case when r ∈ R∗ \{µ¯ (B ), τ ∈ Y } (see, • µ¯1 1 τn n µ¯1

Eq.(40)). Under this assumption there exists τo ∈ [0, ∞) \Yµ¯1 (in the

domain where φµ¯1 (·) is continuous) where r = φµ¯1 (τo). From Proposition ˜ 5, φvp (·) is also continuous at τo where by construction in (B.1) fµ¯1 (r) =

φvp (τo). Let us consider an arbitrary  > 0. From the continuity of φvp (·)  ˜ 35 at τo, there exists δ > 0 such that φvp (τ), τ ∈ Bδ(τo) ⊂ B(fµ¯1 (r)). ˜ Without loss of generality, we can assume that φvp (τo − δ) > fµ¯1 (r) =

φvp (τo) > φvp (τo + δ). Then from Proposition 5, it follows that φµ¯1 (τo − ¯ δ) > r = φµ¯1 (τo) > φµ¯1 (τo + δ). Then, there exists δ > 0 such that

Bδ¯(r) ⊂ {φµ¯1 (τ), τ ∈ Bδ(τo)}. Therefore from (B.1), we have that for any

r¯ ∈ Bδ¯(r), there exists τr¯ ∈ Bδ(τo) wherer ¯ = φµ¯1 (τr¯) and, consequently, ˜ ˜ fµ¯1 (¯r) = φvp (τr¯) ∈ B(fµ1 (r)), which concludes the argument in this case. S • Let us assume that r ∈ (¯µ1(Cτn ), µ¯1(Bτn )) (see, Eq.(40)). Then τn∈Yµ¯1

there is τn ∈ Yµ¯1 and a unique αo ∈ (0, 1) such that r =µ ¯1(Cτn ) + αo · ˜ µ¯1(Bτn \ Cτn ) and, consequently, fµ¯1 (r) = vp(Cτn ) + αo · vp(Bτn \ Cτn ) from (B.2). Without loss of generality, let us consider  > 0 small enough ˜ such that B(fµ¯1 (r)) ⊂ (vp(Cτn ), vp(Bτn )). Then from the continuity of

the affine function g(α) ≡ vp(Cτn ) + α · vp(Bτn \ Cτn ) in (0, 1), there exists ˜ δ > 0 (function of ) such that {g(α), α ∈ Bδ(αo)} ⊂ B(fµ¯1 (r)). There- ˜ fore for anyr ¯ ∈ {µ¯1(Cτn ) + α · µ¯1(Bτn ), α ∈ (αo − δ, αo + δ)}, fµ¯1 (¯r) ∈ ˜ ¯ B(fµ¯1 (r)) from the construction in (B.2). Finally fixing δ = δ · µ¯1(Bτn \ n ˜ o ˜ Cτn ), we have that fµ¯1 (¯r), r¯ ∈ Bδ¯(r) ⊂ B(fµ¯1 (r)), which concludes the argument in this case.

• Finally, we need to consider the case where r ∈ {µ¯1(Cτn ), τn ∈ Yµ¯1 } ∪

{µ¯1(Bτn ), τn ∈ Yµ¯1 }. The argument mixed the steps already presented in the two previous scenarios, and for the sake of space it is omitted here as no new technical elements are needed.

√ 34This from the continuity of the function g(x) = p 1 − x in x ∈ [0, 1]. 35 B(x) ≡ (x − , x + ) ⊂ R denotes the open ball of radius  > 0 centered at x ∈ R.

43 Proof of iv): Using the fact that limτ→∞ φµ¯1 (τ) = 0 and limτ→∞ φvp (τ) =

0, from the construction of fµ¯1 (r) in (43) we have that limr→0 fµ¯1 (r) = 1. For the other condition, let us consider r = 1 − µ¯1({0}). There are two cases. The simplest case to analyze is when 0 ∈/ Yµ¯1 . In this case, we are only looking at the point r = 1. This point is achieved at τ = 0, i.e., r = φµ¯1 (τ = 0) = 1, which pp is mapped to fµ¯1 (1) = 1 − φvp (τ = 0) = 0. When 0 ∈ Yµ¯1 , then from (43) we can focus on the following range of pairs determining fµ¯1 (r):  q  p (¯µ1(C0) + αµ¯1({0}), 1 − vp(C0) − α · vp({0})), α ∈ [0, 1)

looking at τ = 0 ∈ Yµ¯1 . We know that 0 ∈/ Yvp then vp({0}) = 0 and vp(C0) = vp(B0) = 1. On the other hand,µ ¯1(C0) =µ ¯1(B0) − µ¯1({0}) = 1 − µ¯1({0}).

Therefore, by exploring all the values of α ∈ [0, 1), we have that fµ¯1 (r) = 0 for all r ∈ [1 − µ¯1({0}), 1). Proof of ii): Let us consider r2 > r1 and assume that both belong to R∗ in (40). This means that there exist τ > τ ≥ 0 such that r = φ (τ ) µ¯1 1 2 1 µ¯1 1 and r2 = φµ¯1 (τ1). Then φvp (τ1) < φvp (τ2) from Proposition 5, which implies the result by the construction of fµ¯1 (·) in (43). Another important scenario to cover is the case when r1 =µ ¯1(Cτn ) + α1µ¯1(Bτn \ Cτn ) and r2 =µ ¯1(Cτn ) +

α2µ¯1(Bτn \ Cτn ) with α2 > α1, τn ∈ Yµ¯1 and τn > 0. Then in this case vp(Cτn )+α2vp(Bτn \Cτn ) > vp(Cτn )+α1vp(Bτn \Cτn ) because from Proposition

5 it follows that vp(Bτn \Cτn ) > 0 if τn ∈ Yµ¯1 \{0}. Again the result in this case follows from (43). Mixing these two scenarios and using the monotonic property of the tail functions (φµ¯1 (·), φvp (·)), we can prove the strict monotonic property of (fµ¯1 (·)) in (0, 1] if 0 ∈/ Yµ¯1 and, the strict monotonic property of (fµ¯1 (·)) in

(0, 1] \ [1 − µ¯1({0}), 1] if 0 ∈ Yµ¯1 .

Proof of iii): From the fact that (fµ¯1 (·)) is strictly monotonic in r ∈ (0, 1 − µ¯1({0})) (proof of ii)) and the conditions on iv) (proved above), we have that fµ¯1 (r) ∈ (0, 1) if, and only if, r ∈ (0, 1 − µ¯1({0})). This suffices to show −1 that fµ¯1 ((0, 1)) = (0, 1 − µ¯({0})).

Proof of v): This part comes directly from the continuity of fµ¯1 (·) in (0, 1) and the limiting values of fµ¯1 (·) (i.e., fµ¯1 (1) = 0 and limr−→0 fµ¯1 (r) = 1).

Appendix C. Proof of Proposition 2

Proof: The statement in i) follows from the definition of φ(·) and the statement ii) comes from the continuity of a measure under a monotone sequence of events converging to a limit [19]. The left continuous property of φ(·) and the + fact that φm(τ) = φm(τ) − m({τ} ∪ {−τ}) (stated in iii)) follow mainly from the continuity of a measure [19].

Appendix D. Proof Proposition 5

Proof: The proofs of these two points derive directly from the definition of the tail function and the construction of vp fromµ ¯1. More precisely, both

44 results derive from the observation that these two measures are almost mutually absolutely continuous in the sense that for all B ∈ B(R) such that 0 ∈/ B, µ¯1(B) = 0 if, and only if, vp(B) = 0. In fact, for all B ∈ B(R) such that 0 ∈/ B, p ||(xp)|| R |x| R L1(¯µ1) vp(B) = B ||(xp)|| dµ¯1(x) and, conversely,µ ¯1(B) = B |x|p dvp(x). L1(¯µ1)

References

[1] A. Amini, M. Unser, F. Marvasti, Compressibility of deterministic and random infinity sequences, IEEE Transactions on Signal Processing 59 (11) (2011) 5193–5201. [2] R. Gribonval, V. Cevher, M. E. Davies, Compressible distributions for hight-dimensional statistics, IEEE Transactions on Information Theory 58 (8) (2012) 5016–5034.

[3] J. F. Silva, M. S. Derpich, On the characterization of `p-compressible er- godic sequences, IEEE Transactions on Signal Processing 63 (11) (2015) 2915–2928. [4] V. Cevher, Learning with compressible priors, in: Neural Inf. Process. Syst.(NIPS), Canada, 2008.

[5] A. Amini, M. Unser, Sparsity and infinity divisibility, IEEE Transactions on Information Theory 60 (4) (2014) 2346–2358. [6] M. Unser, P. Tafti, A. Amini, H. Kirshner, A unified formualtion of gaussian versus sparse stochastic processes — part ii: Discrete domain theory, IEEE Transactions on Information Theory 60 (5) (2014) 3036–3050.

[7] M. Unser, P. D. Tafti, An Introduction to Sparse Stochastic Processes, Cambridge Univ Press, 2014. [8] R. Gribonval, Should penalized least squares regression be interpreted as maximum a posteriori estimation?, IEEE Transactions on Signal Processing 59 (5) (2011) 2405–2410.

[9] A. Amini, U. Kamilov, E. Bostan, M. Unser, Bayesian estimation for continuous-time sparse stocastic processes, IEEE Transactions on Signal Processing 61 (4) (2013) 907–929. [10] R. Prasad, C. Murthy, Cramer-Rao-type of bounds for sparse bayesian learning, IEEE Transactions on Signal Processing 61 (3) (2013) 622–632.

[11] J. Fageot, M. Unser, J. P. Ward, The n-term approximation of periodic generalized L´evyprocesses, Jounal of Theoretical Probability 33 (2020) 180–200.

45 [12] R. M. Gray, Probability, Random Processes, and Ergodic Properties, 2nd Edition, Springer, 2009. [13] L. Breiman, Probability, Addison-Wesley, 1968. [14] R. M. Gray, J. Kieffer, Asymptotically mean stationary measures, The Annals of Probability 8 (5) (1980) 962–973. [15] R. M. Gray, Entropy and Information Theory, Springer - Verlag, New York, 1990. [16] O. W. Rechard, Invariant mesaures for many-one transformations, Duke J. Math. 23 (1956) 477–488.

[17] R. Gray, L. D. Davisson, Introduction to Statistical Signal Processing, Cambridge Univ Press, 2004. [18] H. L. Royden, P. Fitzpatrick, Real Analysis, Pearson Education, 2010. [19] S. Varadhan, Probability Theory, American Mathematical Society, 2001.

46