<<

An Approach to Optimal Discretization of Continuous Real Random Variables with Application to Machine Learning

Carlos Antonio Pinzón Henao Student identification: 8945263

Director: Camilo Rocha, PhD Co-director: Jorge Finke, PhD

Engineering and sciences Master in Engineering - Emphasis on Computer Science

Pontificia Universidad Javeriana Cali, Cali, Colombia, October 2019. 1 Contents

1 Introduction6

2 Preliminaries 10

2.1 Random variables of interest..... 11

2.2 Entropy and redundancy...... 12

2.2.1 Shannon entropy...... 12

2.2.2 Differential entropy and LDDP 13

2.2.3 Rényi entropy...... 15

2.2.4 Redundancy...... 16

2.3 Discretization...... 17

2.3.1 Digitalization...... 17

2.3.2 n-partitions...... 18

2 2.3.3 Distortion and entropy.... 21

2.3.4 Discretization functions... 21

2.3.5 Redundancy and scaled distortion 23

3 Optimal quantizers 25

3.1 Context...... 25

3.2 Formal definition...... 26

3.3 Application in sensorics...... 27

4 Uses in machine learning 30

4.1 Preprocessing...... 31

4.2 Intra-class variance split criterion.. 33

4.2.1 Construction of decision trees 34

4.2.2 Min-variance split criterion. 37

4.2.3 Non-uniqueness...... 43

4.2.4 Non-optimality...... 45

4.3 Discussion...... 47

3 5 A real analysis approach 48

5.1 Formal derivation...... 48

5.1.1 Deriving Theorem 10..... 49

5.1.2 Deriving Theorem 17..... 57

5.1.3 Deriving Theorem 19..... 63

5.1.4 Optimality theorems..... 67

5.2 Summary of formulas and quantizers 73

5.3 Some generalizations...... 74

5.4 An open problem...... 75

6 Conclusion and future work 77

Bibliography 78

A Optimal quantizer in 81

A.1 Rate-distortion function...... 81

A.2 Optimal quantizers in the rate- distortion plane...... 83

4 Acknowledgment

I would like to thank all the people that made this thesis possible, including Dr. Camilo Rocha and Dr. Jorge Finke for their trust and guidance, all the supervisors that agreed to review my work, the professors at the University whose teachings and advises helped me directly or indirectly throughout this period, including Isabel García, Carlos Ramírez and Frank Valencia, my friend Juan Pablo Berón for providing insight into some proofs, and my friends and family, specialy my future wife Adriana, for their constant support. Without their help, it would not have been possible to conclude this thesis successfully.

In addition, I thank the growing Mathematics Stack Exchange community for having provided the sketch for two of the proofs presented in this document: question 3339209 for Theorem9 and question 154721 for Theorem 12.

5 Chapter 1

Introduction

This document explores some discretization methods for continuous real random variables, their performance and some of their applications. Particularly, this work borrows a well-studied discretization method from the signal processing community, called the minimal distortion high resolution quantizer1, and makes use of it in machine learning with the purpose of illustrating how and when it can improve typically used techniques and methodologies.

Discretization is the process of converting a continuous variable into a discrete one. It is useful because unlike the continuous input, the discrete output can be encoded and treated with discrete methods and tools, which includes computers and microcontrollers. Therefore, it occurrs in all modern sensor systems. It is used also in data analysis, either for a final categorization of the data or internally as a bridge between the continuous nature of the data and the discrete nature of some algorithms. Despite its advantages, discretization introduces forcibly an error because the discrete output is unable to mimic the continuous input with entire precision. This error can

1Herbert Gish and John Pierce. “Asymptotically efficient quantizing”. In: IEEE Transactions on Information Theory 14.5 (1968), pp. 676–683.

6 be forced to be very small by using properly designed discretization methods or at the expense of using more bits in the codification.

Discretization has been deeply studied under the term quantization, especially in the signal processing community, which not only proposed the minimal distortion high resolution quantizer2, but also an iterative method, called Lloyd method, for computing optimal quantizers for any resolution level3 and several generalized versions of this algorithm that support several dimensions and are therefore called vector quantizers4. These algorithms, have a tremendous advantage over the high resolution quantizer, namely that they operate directly from samples and do not require the distribution to be known or estimated previously.

Furthermore, the book Foundations of Quantization for Distributions5 due to Graf and Lushgy makes an extense review of quantization methods and provides common notation and a rigorous theory, using advanced topics in mathematics, for supporting them. The book embraces several types of quantizers, including one dimensional optimal quantizers, one dimensional high resolution optimal quantizers and vector quantizers.

This work explores just one of these types of quantizers using undergraduate mathematics so as to be readable and useful for a broader less experienced audience. It focuses exclusively on the discretization of one-dimensional continous real random variables X for which a Riemann-integrable probability density function with domain [a, b] exists, for reals a < b. These constraints, which might seem very strong from a mathematical point of view, are

2Herbert Gish and John Pierce. “Asymptotically efficient quantizing”. In: IEEE Transactions on Information Theory 14.5 (1968), pp. 676–683. 3Stuart Lloyd. “Least squares quantization in PCM”. in: IEEE transactions on information theory 28.2 (1982), pp. 129–137. 4Yoseph Linde, Andres Buzo, and Robert Gray. “An algorithm for vector quantizer design”. In: IEEE Transactions on communications 28.1 (1980), pp. 84–95, Robert Gray. “Vector quantization”. In: IEEE Assp Magazine 1.2 (1984), pp. 4–29. 5Siegfried Graf and Harald Luschgy. Foundations of quantization for probability distributions. Springer, 2007.

7 common in engineering applications and allow the study to be reduced to undergraduate mathematics. The analysis focuses on the following quantizers, especially on the optimal quantizer.

1. Equal width quantizer, or uniform quantizer: it is the most simple quantizer and it is used in most sensors. It consists of dividing the input interval into n bins of equal width. 2. Equal frequency quantizer: it consists of dividing the input interval into n bins of equal area under the curve f. As a consequence, it maximizes the entropy of the random output, which can be understood less precisely as its diversity. 3. Minimal distortion high resolution quantizer, called optimal quantizer throughout the document: it minimizes the error between the input and the output when n is sufficiently large. It is explained in detail in Chapter3. 4. Recursive binary quantizer: used in binary decision trees in machine learning. As explained in Chapter4, it is constructed with a recursive and greedy procedure. Namely, it divides the interval into two by means of a minimal distortion quantizer with n = 2, and repeats the procedure on each subinterval yielding 4, 8, and in general 2k parts.

The main contributions of this dissertation can be summarized as follows:

1. It brings the optimal quantizer to the attention of the machine learning community as an alternative for the typical equal width and equal frequency quantizers when the input distribution is known or well estimated, especially targeting members with few knowledge in advanced mathematics. 2. It illustrates with some examples how and when can the optimal quantizer replace typical non-iterative techniques in machine learning for discretizing a dataset, and how and when can it be used as a split criterion for decision trees. 3. It presents real-analysis proofs of the optimality and uniqueness of the optimal quantizer under the constraints of being bounded and having

8 a Riemann-integrable density. In contrast to existing proofs6, these are comprehensible for the majority of undergraduate engineers because it does not use measure theory. 4. It summarizes the generalizations of these results to other distance metrics and provides insight on how to generalize as well for different entropy functions.

This document is organized as follows.

Chapter1 is the introduction. Chapter2 provides definitions that are used along the document and introduces the reader into discretizations with some plots and examples. Chapter3 reviews the optimal quantizer and presents its application in sensorics. Chapter4 compares the optimal quantizer against the others in machine learning. More precisely, the chapter compares the performance of some predictors that are fed with the output of the discretizators, and provides a split criterion for decision trees based on the optimal quantizer. Chapter5 uses real analysis to prove that the optimal quantizer is indeed optimal and unique. It also provides a generalization of this quantizer and an open problem that emerged naturally from the generalizations. Chapter6 concludes the document with a short summary. AppendixA explores the optimal quantizer from an information theoretical perspective. In particular, some plots of optimal quantizers on the rate-distortion plane are provided.

6Siegfried Graf and Harald Luschgy. Foundations of quantization for probability distributions. Springer, 2007, Herbert Gish and John Pierce. “Asymptotically efficient quantizing”. In: IEEE Transactions on Information Theory 14.5 (1968), pp. 676–683.

9 Chapter 2

Preliminaries

This chapter presents the formal definition of discretization and some other additional definitions that are needed throughout the document.

Section 2.1 presents the specific mathematical objects that are studied in this work and, as a consequence, it determines precisely the scenarios in which the results of this work are guaranteed to be valid and applicable.

Section 2.2 presents some classical definitions of entropy from information theory as well as the term ‘redundancy’ as used in this work. These are related to discretizations mainly, but not only, because they can be used to evaluate them. A high-entropy discretization has a more diverse output and usually a lower input-output error than a low entropy one. The reader is referred to these references1 for a deeper understanding on the contents of Section 2.2.

Finally, Section 2.3 presents discretizations (n-partitions) and discretization functions, which are introduced and deeply studied in this work. In

1Claude Elwood Shannon. “A mathematical theory of communication”. In: Bell system technical journal 27.3 (1948), pp. 379–423, Charles Marsh. “Introduction to continuous entropy”. In: Department of Computer Science, (2013).

10 addition, it presents the importance and relationship between discretization and digitalization, the properties of n-partitions and the properties of discretization functions.

2.1 Random variables of interest

The theory of this document is concerned exclusively with one-dimensional continuous bounded real random variables for which a Riemann-integrable probability density function exists. Some definitions and results may be extended to non continuous random variables, unbounded domains or n- dimensions. To simplify the assumptions of each theorem, these sufficient conditions are required everywhere.

Definition 1. For any a, b ∈ R with a < b,

1. RPDF[a,b] denotes the set of Riemann-integrable probability density functions with domain [a, b], also called in this text simply Riemann- densities. More precisely, a Riemann-density over [a, b] is a Riemann- integrable function with domain [a, b] and integral 1.

2. CRPDF[a,b] denotes the set of continuous Riemann-densities over [a, b].

Clearly, CRPDF[a,b] ⊆ RPDF[a,b]. + 3. RPDF[a,b] denotes the set of positive Riemann-densities over [a, b]. + Clearly RPDF[a,b] ⊆ RPDF[a,b]. + 4. CRPDF[a,b] denotes the set of continuous positive Riemann-densities + + over [a, b]. Clearly CRPDF[a,b] = CRPDF[a,b] ∩ RPDF[a,b].

Theorem 1. If f ∈ RPDF[a,b], then f is bounded everywhere and continuous almost everywhere.

11 Proof. This is a well known fact. A proof can be found here2.

2.2 Entropy and redundancy

This section presents some classical definitions of entropy from information theory3 as well as the term ‘redundancy’ as used in this work.

2.2.1 Shannon entropy

Claude Shannon introduced the concept of information entropy in 19484. Since then, it has become one of the pillars of information theory. Informally, information entropy, or simply entropy, is a measure of the average ‘information content’ that comes from realizations of a given discrete random variable.

A discrete random variable X is a function5 X :Ω → [0, 1] such that its domain, called the state space, is a finite or countably infinite, and P ω∈Ω X(ω) = 1. As usual, the notation X(ω) is replaced by P (X = ω), and denotes the probability that a random realization of X is ω. Definition 2. The entropy of a discrete random variable X over a state space Ω, is defined as ( X 0 , if P (X = ω) = 0 h(X) := ω∈Ω −P (X = ω) log P (X = ω) , otherwise.

For simplicity, written simply as

X h(X) := − P (X = ω) log P (X = ω), ω∈Ω 2Tom M Apostol. “Mathematical analysis”. In: (1964), pp. 169–172. 3Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999. 4Claude Elwood Shannon. “A mathematical theory of communication”. In: Bell system technical journal 27.3 (1948), pp. 379–423. 5Jean Jacod and Philip Protter. Probability essentials. Springer Science & Business Media, 2012.

12 where p log p is assumed to be zero when p = 0, in concordance with the limit limp→0+ p log p = 0.

Unless explicitly set to 2 (for entropy in bits), the base of the logarithm is assumed to be e throughout this document, thus the entropy units are nats, where 1bit = log 2 nats.

An important claim about entropy is that it measures effectively the expected amount of information per realization of a discrete random variable. Technically, no lossless compression scheme can achieve an expectation of less than h(X) bits per realization; but it is always possible to achieve an expectation of less than h(X) +  for any  > 06. Therefore, the term “information content per realization”, which may appear subjective in principle, can be simply understood as the minimum average number of bits per realization required to encode a large sequence of realizations.

In particular, the Huffman encoding7 produces a compression rate r that falls in the range r ∈ [h(X), h(X) + ] where  = 1. It works by encoding each realization separately and is optimal under this constraint. More generally, by removing the constraint and allowing a single code to represent several realizations, the constant  can be made smaller.

2.2.2 Differential entropy and LDDP

For the continuous case, the discrete entropy formula is not of use anymore because the state space of continuous random variables is non-countable. Therefore, Shannon proposed a different name and a different formula.

Definition 3. The differential entropy of a continuous random variable X

6Claude Elwood Shannon. “A mathematical theory of communication”. In: Bell system technical journal 27.3 (1948), pp. 379–423. 7Paul Cuff. Course Notes of Information Signals Course, Kulkarni. url: https: //www.princeton.edu/~cuff/ele201/kulkarni_text/information.pdf.

13 over a non-empty interval [a, b] is defined as

Z b h(X) := − f(x) log f(x) dx, a where p log p is assumed to be zero when p = 0 in accordance with the limit.

Differential entropy has been useful in the continuous field, but it does not retain the original meaning of entropy nor its relationship with optimal compression. As a matter of fact, differential entropy can be negative: let X have a uniform distribution over the interval [0, r] for some r ∈ (0, 1). R r Then h(X) = − 0 (1/r) log(1/r)dx = log(r) < 0. Moreover, h(X) → −∞ as r → 0+.

Edwin Jaynes8 was apparently the first to write on this issue and proposed an adjustment called limiting density of discrete points - LDDP. According to Jaynes, the correct way to extend the definition of entropy is not to simply replace P with R , as it appears in Shannon’s formula, but to consider n points in the domain, and to consider the entropy as n grows as shown in Definition4.

Definition 4. Let X be any continuous random variable with domain [a, b] ⊆ R and density f , and let m be another density in [a, b]. Then, the entropy of X respect to m is defined as

Z b f(t) lddp(X, m) = lim log(N) − f(t) log dt N→∞ a m(t)

= lim log(N) − DKL(f||m), N→∞ where DKL denotes the Kullback-Leibler divergence.

This definition transports the relationship between entropy and information 8Edwin Thompson Jaynes. “Information Theory and ”. In: Brandeis University Summer Institute Lectures in Theoretical Physics (1963), pp. 182–218.

14 content from the discrete to the continuous domain directly. However, it introduces forcibly a second density function and diverges towards infinity by definition. Nevertheless, as pointed out by Marsh9, it is important to highlight LDDP before introducing differential entropy, because although the latter is the standard extension of discrete entropy on continuous domains, the former is the natural way of doing it. Marsh also suggests that for continous random variables it is more natural to consider relative entropy between pairs than absolute entropy of individual random variables.

2.2.3 Rényi entropy

Rényi entropy, Definition5, is a generalization of Shannon entropy by introducing an additional parameter α ∈ [0, ∞]10.

Definition 5. The α-entropy of a discrete random variable X over a state space Ω, is defined as

( 1 P α 1−α log ω∈Ω P (X = ω) if α ∈ [0, 1) ∪ (1, ∞) hα(X) := lima→α ha(X) if α = 1 or α = ∞.

The parameter α can be tuned as follows to obtain different notions of entropy, including Shannon entropy.

• If α = 0, the result is called Hartley entropy.

• If α = 1, Shannon entropy is obtained: h1(X) = h(X). • If α = 2, the result is called collision entropy, although it is also ambiguously called Rényi entropy. • If α = ∞, the result is called min-entropy.

9Charles Marsh. “Introduction to continuous entropy”. In: Department of Computer Science, Princeton University (2013). 10Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.

15 Rényi entropy can also be extended to real continuous domains as follows.

Definition 6. The differential entropy of a continuous random variable X over a non-empty interval [a, b], is defined as

Z b h(X) := − f(x) log f(x) dx, a where, again p log p is assumed to be zero when p = 0, and the differential α-entropy as    1 log R b f(x)α dx if α ∈ [0, 1) ∪ (1, ∞)  1−α a hα(X) := lima→α ha(X) if α = 1 or α = ∞.

Again, for differential entropy, it also holds that h1(X) = h(X).

2.2.4 Redundancy

For the purposes of this thesis, the notion of redundancy is introduced.

Definition 7. The α-redundancy of a discrete random variable Xn with n possible outcomes is defined as

Rα(Xn) = log(n) − hα(Xn), and the redundancy as

R(Xn) := R1(Xn).

The term redundancy is justified by the fact that if the outcomes of Xn are

16 11 encoded using log n nats per outcome, then R(Xn) nats will be redundant.

The definition forces R(Xn) ≥ 0, with equality holding if and only if the outcomes of Xn are all equally likely, i.e. uniform.

2.3 Discretization

This section presents two formal definitions around discretization and their properties, namely n-partitions and discretization functions. These are the two fundamental objects of study of this work.

Additionally, the importance and relationship between discretization and digitalization is presented in prose, as well as some intuition behind the mathematical definitions.

2.3.1 Digitalization

Realizations from continuous random variables are often converted into binary data that can be stored, transmitted and processed by computers. This process consists of two stages called discretization and encoding. Firstly, a function maps lossily the continuous input into a discrete, i.e. countable, set of outputs to make it encodable; secondly, an encoding scheme translates the discrete outputs into binary strings.

By lossy we mean that the discrete output is a mere approximation of the continuous input, although they can be very similar. In other words, the process distorts the continuous input. It is usually unnecessary to have no distortion at all. In general, this requirement is impossible simply because of the cardinality mismatch between the non-countable set of possible inputs and the countable set of possible outputs.

The composition of the two stages yields digitalization: a lossy conversion

11A nat is a unit of information.

17 from realizations of a given random variable into binary strings. Digitalization is found in every sensor with digital output, specifically in the analog to digital conversion circuit. If several samples are to be digitalized for further transmission, storage or direct processing, there must be a physical constraint due to the channel bandwidth during transmission or the maximum storage capacity. In any case, it is desired to represent each sample using as few bits as possible, so that transmission and storage are doable. The number of bits used per sample depends on the encoder function, and it is not necessarily fixed, i.e. it may be different for different samples. But the expected number of bits used per sample can always be computed given the encoder function and the probability density function of the discrete source. Moreover, compression can be included to reduce the expected number of bits used per sample, although not indefinitely.

2.3.2 n-partitions

Discretization is the process in which a continuous variable is mapped into a countable set of values. Concretely, this thesis introduces the notion of n-partitions to discretize continuous random variables as presented in Definition8.

Definition 8. Let n ≥ 1 and a < b. An n-partition of a random variable n ∗ n X ∼ f ∈ RPDF[a,b], or simply of f, is a sequence (xi)i=0 with tags (xi )i=1 ∗ such that a = x0 < x1 < ··· < xn = b and xi−1 ≤ xi ≤ xi for i = 1, ..., n.

If the sequence of tags is not explicitly given in the document, the tags are ∗ assumed be the local centroids (xi = µi) given by

 1 R xi  xf x dx p > pi xi−1 ( ) , if i 0 µi :=  xi+xi−1 2 , if pi = 0,

p R xi f x dx where i := xi−1 ( ) .

The discrete random variable induced by the n-partition is defined as Xn :=

18 Figure 2.3.1: Discretization of a continuous signal (blue) into n = 4 (above) and n = 16 (below) possible output values (red) and the resulting discretized output (orange). The similarity between the input and the output improves as n grows. q(X) where q is the discrete mapping given by

 ∗ x1 , if x0 ≤ x < x1   ∗ x2 , if x1 ≤ x < x2 q(x) := . . . . . .   ∗ xn , if xn−1 ≤ x ≤ xn.

Figure 2.3.1 shows the discretization of a fictitious example signal into 4 and 16 evenly separated output values. At each time t ∈ [0, 1], the output signal

19 Figure 2.3.2: Uniform placement vs. finely tuned placement of n = 8 output values for a given input histogram. The tuned placement improves the similarity between the input and the discretized output.

(orange) is computed by rounding the input (blue) to the nearest possible output value (red). Intuitively, the larger the number of possible values, the better the approximation. The quality of the approximation, or fidelity, may be understood as the opposite of the average squared error between the input and the discretized output, or as the average absolute error.

The placement of the output values in Figure 2.3.1 is even along the range of the input signal. This uniform placement is typical in generic analog to digital converters in which no additional properties of the input signal is known during the circuit design. Only the minimum and maximum values are assumed. However, if the probability density function of the input is known, the placement of the output values can be tuned to improve the similarity between the output and the input.

Indeed, as shown in Figure 2.3.2, for a given known histogram of the input

20 (green), the output values can be placed non uniformly in such a way that fidelity is improved. Intuitively, it makes sense to sacrifice fidelity around less frequent regions in order to improve fidelity in the more frequent ones.

2.3.3 Distortion and entropy

In order to evaluate the quality of a discretization Xn of a continuous random variable X, it is natural to consider the following metrics.

n Definition 9. Given an n-partition ρ = (xi)i=1 of X ∼ f ∈ RPDF[a,b], we define the following properties of the n-partition.

1. α-entropy: hα (Xn). h ki 2. k-distortion: Dk(Xn) := E |X − Xn| . −k 3. Scaled k-distortion: Sk(Xn) := n Dk(Xn).

4. α-redundancy: Rα(Xn) := log n − hα (Xn).

The last two properties are useful for discretization functions, Section 2.3.5.

Notice in particular that 2-distortion accounts for the expected squared  2 error E (X − Xn) and 1-distortion accounts for the expected average error

E [|X − Xn|].

2.3.4 Discretization functions

Discretization functions allow to systematically discretize continuous random variables into n bins for all n ≥ 1.

Definition 10. Let X ∼ f ∈ RPDF[a,b].A discretization function for X is + any function g ∈ RPDF[a,b]. For each n ≥ 1, the n-partition of X induced by n −1  i  R x g is the sequence (xi)i=0 given by xi := G n , where G(x) = a g(t) dt.

21 Notice that since g(x) ≥ 0 for all x ∈ [a, b], it follows that G :[a, b] → [0, 1] is an increasing bijection, thus G−1 is well-defined over [0, 1] and is also an increasing bijection.

In addition to that of Definition8, we use the following notation to simplify the algebra:

• Center and radius:

xi−1 + xi xi − xi−1 ci = ; ri = , 2 2

so that [xi−1, xi] = [ci − ri, ci + ri]. • Mass: Z xi pi = P(xi−1 ≤ X ≤ xi) = f(x)dx. xi−1 • Centroid:  1 R xi  xf x dx p > pi xi−1 ( ) if i 0 µi := ci if pi = 0.

A discretization function for a given random variable X with domain [a, b] ⊆ R is any density g over [a, b]. For any positive integer n ≥ 1, g induces a discretization Xn of X whose split points x0, x1, ..., xn are such that

Z xi i g(x) dx = . a n

Informally, g(x) determines, for sufficiently large values of n, the precision or resolution of the discretization around the value x. Figure 2.3.3 depicts this intuitive phenomenom: since g(0) > g(1), there are more intervals around 0 than around 1.

The discretization function g can be chosen so that X ∼ g. In that case,

P (xi−1 ≤ X ≤ xi) is constant for all i = 1, ..., n. This particular choice for

22 Figure 2.3.3: Normal random variable with µ = 1; σ = 2 discretized into n = 32 intervals using a normal distribution g with µ = 0; σ = 1 as discretization function. Any other n could be chosen to obtain an n-partition. g is a good example of a discretization that improves fidelity around likely outcomes of X by sacrificing it around less likely values, but it is still usually non-optimal.

2.3.5 Redundancy and scaled distortion

For measuring the quality of discretization functions, the first natural approach is to consider the behavior of distortion and entropy as n → ∞. However, for most discretization functions, the expected squared error drops to zero and the entropy increases indefinitely as n → ∞, that is,  2 E (X − Xn) → 0 and h(Xn) → ∞. Therefore, these quantities must be compensated so as to cancel out the default behavior and to make the limit useful. This gives rise to the notion of redundancy of Definition7, to some definitions in Definition9, and to Definition 11.

+ Definition 11. Given a discretization function g ∈ RPDF[a,b] for X ∼ f ∈ RPDF[a,b], we define the following properties of (f, g).

1. Scaled k-distortion:

Sk(f, g) := lim Sk(Xn) n→∞

23 2. α-redundancy:

Rα(f, g) := lim Rα(Xn). n→∞

24 Chapter 3

Optimal quantizers

This chapter presents the minimal distortion high resolution quantizer of Gish and Pierce1 from the signal processing community, which minimizes distortion for large resolutions. Concretely, the quantizer is a discretization function whose scaled distortion is minimal.

The formulas that characterize this optimal quantizer are presented formally in this chapter, and they are generalized and derived independently using real analysis in Chapter5.

3.1 Context

Signal processing is an area deeply related with electronics, thereof also with sensorics and discretization. Several important facts related with discretization have been found in this area, but the terminology is different. In signal processing, discretization refers exclusively to time sampling, while quantization refers to approximating the continuous value of the samples to

1Herbert Gish and John Pierce. “Asymptotically efficient quantizing”. In: IEEE Transactions on Information Theory 14.5 (1968), pp. 676–683.

25 a predefined discrete set of options. More precisely, the term discretization (as used in this work) coincides with the term quantization from the signal processing community.

Two main related theorems about quantizers are considered in this document: the 6db/bit rule and the asymptotically optimal quantizer2. On the one hand, the 6db/bit rule claims that for uniform discretizations, adding 1 bit decreases the distortion by a factor of 4, i.e. 6db. On the other hand, the formulas about asymptotically optimal quantizers by Gish and Pierce state that the distortion caused by a discretization function g over a density f √ 1 R f p3 R 3 approaches 12n2 g2 , and it is minimized when g(x) = f(x)/ f. The distortion formula implies that the 6db/bit applies not only to the uniform distribution, but asymptotically to any other distribution and discretization function.

3.2 Formal definition

Definition 12. For any f ∈ RPDF[a,b], the optimal quantizer for f is defined as the discretization function fˆ ∈ RPDF[a,b] given for all x ∈ [a, b] by

p3 f(x) fˆ(x) = . R b p3 a f(t)dt

This quantizer has the minimal scaled distortion (in the limit) among all other discretization functions. This fact is proven in Theorem 21 of Chapter 5. 2Herbert Gish and John Pierce. “Asymptotically efficient quantizing”. In: IEEE Transactions on Information Theory 14.5 (1968), pp. 676–683.

26 3.3 Application in sensorics

The vast majority of electronic sensors feed data to a digital receptor, such as a microprocessor, a computer or a database. This does not only imply that discretization takes place in almost all sensor systems, but also that it should be designed properly. Discretization, as a conceptual component of digitalization occurs physically in the circuitry of these sensors. The digitalization stage converts the sensor continuous input into one of several possible outputs, each with an associated binary code.

The standard discretization of sensor data is uniform. That is, the interval [a, b] where the input variable is expected to occur is divided into n equal length disjoint intervals. Regarding the uniform discretization, two main claims are made.

1. The discretization can be improved when the distribution of the input function is known. 2. If the distribution of the input is unknown, but the interval [a, b] is known, then the uniform discretization should be chosen.

On the one hand, consider any input X ∼ f ∈ RPDF[a,b], any n  2 and any + discretization function g ∈ RPDF[a,b]. Consider also the quantile function Q := G−1 : [0, 1] → [a, b] that states for a given fraction p ∈ [0, 1] that the fraction p out of the total number of discretization bins is located in the n interval [a, Q(p)]. The thresholds (xi)i=0 that will divide the interval [a, b] into n bins are xi = Q(i/n).

The quantile function Q defines the discretization uniquely, therefore, it can be tuned for minimal distortion. Assuming that n is sufficiently large, then, it is recommended to design Q so as to yield an asymptotically optimal quantizer so that Z Q(p) q Z b q 3 f(t)dt = p 3 f(t)dt. a a

27 Figure 3.3.1: Uniform quantizer vs finely tuned. n = 16. The total distortion is recorded in Table 3.1.

Particularly, for a truncated normal density and a truncated exponential density, Table 3.1 and Figure 3.3.1 compare the uniform quantile function with the asymptotically optimal quantizer. Distortion is improved with the latter.

On the other hand, if the input distribution f is not known, the maximum entropy principle implies that among all the distributions with domain [a, b], the uniform is the best candidate because it has the maximum entropy. It can be inferred from the fact that if we distribute m ballots randomly in n bins, where m  n  1, then the distribution of the ballots is expected to be nearly symmetric.

Technically, however, if the ADC hardware was programmable and allows to change the internal threshold values, then a better approach can be used. As the sensor senses, f can be estimated and updated, and the threshold values can be updated frequently to decrease the expected future distortion.

28 Normal Normal Exp. Exp. n uniform tuned uniform tuned 2 3.64e-01 3.64e-01 5.71e-01 3.47e-01 4 2.46e-01 1.21e-01 2.42e-01 1.03e-01 8 1.02e-01 3.59e-02 7.51e-02 2.76e-02 16 3.06e-02 9.74e-03 2.03e-02 7.01e-03 256 1.32e-04 4.16e-05 8.12e-05 2.77e-05

Table 3.1: Average distortion (300 000 samples) after applying two quantile functions on a normal distribution truncated to [−5, 5] and an exponential truncated to [0, 8]. The tuned intervals introduce less distortion.

29 Chapter 4

Uses in machine learning

This chapter compares several standard discretization mechanisms in machine learning against the optimal quantizer of Chapter3. The optimal quantizer is shown to be useful as an alternative to classic discretization methods for the following tasks under the supposition that the input distribution is known or well estimated:

1. Discretizing a dataset for subsequent classification with a discrete classifier in Section 4.1. This task is important in machine learning because it allows complex discrete methods to be used over continuous data. The performance of these discrete methods may surpass that of the continuous methods or at least serve as a point of comparison. 2. Splitting a continuous dataset into two. This task underlies decision trees and generally all classifiers based on them, such as random forests or xgboost in Section 4.2. Since this task is performed several times during the training of these data structures, it is crucial do it properly maximizing an adequate objective function.

The classic discretization methods that are compared against the optimal quantizer are:

30 1. Equal width bins1: the continuous dataset is divided into bins of equal width. 2. Equal frequency bins2: the continuous dataset is divided into bins of approximately equal number of samples.

Given that the optimal quantizer can be easily described in a single formula, this work considers only these two simple classic discretization methods. Unsupervised iterative methods3 are not reviewed in this work.

4.1 Preprocessing

This section compares different discretization methods over synthetic datasets. More precisely, this section compares the 5-fold validated score of 6 different classifiers of the library sklearn over 12 different synthetic datasets of 4 different kinds. The datasets have 2000 samples, each with three continuous features (xi, yi, zi) and one binary output ti. Each column was discretized into 10 categories separately by using the same discretization method each time and the performance of the classifiers was evaluated on each scenario.

Table 4.1 shows the results of the experiment. The highest scores were obtained on average when there was no discretization at all (raw), followed by using an approximation to the optimal quantizer of Chapter3, closely

1Jason Catlett. “On changing continuous attributes into ordered discrete attributes”. In: European working session on learning. Springer. 1991, pp. 164–178, James Dougherty, Ron Kohavi, and Mehran Sahami. “Supervised and unsupervised discretization of continuous features”. In: Machine Learning Proceedings 1995. Elsevier, 1995, pp. 194–202, Randy Kerber. “Chimerge: Discretization of numeric attributes”. In: Proceedings of the tenth national conference on Artificial intelligence. 1992, pp. 123–128. 2Jason Catlett. “On changing continuous attributes into ordered discrete attributes”. In: European working session on learning. Springer. 1991, pp. 164–178, James Dougherty, Ron Kohavi, and Mehran Sahami. “Supervised and unsupervised discretization of continuous features”. In: Machine Learning Proceedings 1995. Elsevier, 1995, pp. 194–202, Randy Kerber. “Chimerge: Discretization of numeric attributes”. In: Proceedings of the tenth national conference on Artificial intelligence. 1992, pp. 123–128. 3Daniela Joiţa. “Unsupervised static discretization methods in data mining”. In: Titu Maiorescu University, Bucharest, Romania (2010).

31 Table 4.1: Average classification scores obtained after discretizing the input columns with different methods, and a colored percentual comparison of improvement respect to the row average values.

32 followed by the equal frequency discretization method, and followed lastly by the equal width method with a relatively poor performance. Equal-frequency divides into bins containing approximately the same number of samples while equal-width divides the range uniformly into bins of equal width. The implementation details are available at www.caph.info/wiki/#Files with a copy at bitbucket.org/caph1993/thesis-test.

In conclusion, the optimal quantizer showed nice results during the preprocessing stage of discretizing continuous data before passing it as input to discrete methods. Particularly, it has the best scores on average when compared with the equal width and equal frequency quantizers. However, its performance is similar to the equal frequency quantizer which is more simple.

4.2 Intra-class variance split criterion

This section describes briefly decision trees and explores the minimal weighted variance split criterion that they use. It concludes with explicit counter examples to the uniqueness of the split point, and with examples of an alternative criterion that yields better results when the tree is sufficiently large. The proposed method assumes that the continuous distribution of the dataset is known.

Section 4.2.1 describes decision trees and their construction. Then, in Section 4.2.2, a thresholding technique from image processing, called Otsu’s method4, is presented and compared with the intra-class variance minimization method. Finally, the two main assertions of this chapter are explained and justified in Sections 4.2.3 and 4.2.4.

4Mehmet Sezgin and Bülent Sankur. “Survey over image thresholding techniques and quantitative performance evaluation”. In: Journal of Electronic imaging 13.1 (2004), pp. 146–166.

33 4.2.1 Construction of decision trees

Decision trees are computer objects that cluster a continuous, discrete, or mixed set of data by recursively partitioning it into two. From a graph theory perspective, a decision tree is a rooted binary5 tree, with split criteria on each non-leaf node and a set of samples on each leaf. The sets of samples on the leaves form a partition for the training dataset, that is, every sample of the training dataset lies in exactly one leaf.

The split criteria of a given tree determine to which leaf does any given sample belong: if the sample satisfies the criterion at the root node, then it belongs to the left sub-tree, otherwise to the right one; this process is repeated on the selected subtree until a leaf is reached. Notice that this procedure can also be applied to non-training samples. Therefore, a decision tree partitions the whole space of possible samples by means of its split criterions.

The quality of a decision tree is determined by the partition it induces over the space of possible samples. In general terms, it is ideal to have similar samples on the same group and dissimilar ones on different ones. Since the split criterions determine the partition, it is mandatory to set them carefully and, for speed and simplicity, to make them simple. It is typical to have very simple criteria and to tune them locally using greedy strategies. These tunnings are very fast to apply and they usually reach good quality but they do not guarantee global optimality.

For discrete sets of data, split criterions are of the form ‘does x belong to X?’ where x is the sample and X is a subset of the domain. The tuning is usually achived by maximizing either Shanon entropy, or Rényi entropy6, also called Gini diversity. For continuous sets, the split criterions are of the form ‘x ≤ t?’

5There are variants that allow more than 2 children, but they are not convered in this document. 6Catuscia Palamidessi and Marco Romanelli. “Feature selection with Rényi min- entropy”. In: IAPR Workshop on Artificial Neural Networks in Pattern Recognition. Springer. 2018, pp. 226–239.

34 where t is a threshold, and the tuning is achieved by maximizing inter-class variance, or equivalently, by minimizing intra-class variance.

The construction of a decision tree when the dataset is particularly a set of one dimensional samples from a real continuous random variable is referred here as the recursive thresholding method (Definition 13).

Definition 13. The recursive thresholding method is a function that maps k a density f ∈ RPDF[a,b] and a power of two, say 2 for some non-negative k 2k integer k, to a 2 -partition (xi)i=1 by aplying the following algorithm.

1. Let j = 1 and start with the j-partition given by x0 = a and x1 = b. j 2j j+1 2j+1 2. Use the 2 -partition (xi)i=0 to compute a 2 -partition (zi)i=0 given by

z2i = xi

and

z2i+1 = arg min V[xi,xi+1](xi) x∈[xi,xi+1]

where V[l,r], which is detailed further in Definition 14, is the intra-class variance curve of the function f restricted to the interval [l, r] and scaled properly to be a density. 2j+1 2j+1 3. Rename (xi)i=0 := (zi)i=0 , increase the variable j by one and repeat 2k until a partition (xi)i=1 is obtained.

As explained in Proposition7, the arg min procedure might have more than one solution. It is assumed that the smallest argument is chosen greedily.

Figure 4.2.1 depicts the procedure. The procedure goes on until a specified number of 2k bins is reached.

35 Figure 4.2.1: Recursive thresholding of a normal distribution truncated to [−5, 5]. The distribution is split into two bins, then, each bin is split into two, and so on.

36 4.2.2 Min-variance split criterion

The minimum intra-class variance split criterion is a criterion that tells how to split a dataset of continuous samples into two disjoint subsets. It aims to minimize the sum of the variances of each subset weighted by their size. More precisely, it minimizes the intra-class variance curve, defined in Definition 14.

Definition 14. (Intra-class variance curve) Let X ∼ f ∈ RPDF[a,b]. For any sub-interval [l, r] ⊆ [a, b] we define

1. p(l, r) := P (X ∈ [l, r]), or equivalently

Z r p(l, r) := f(x) dx. l

2. µ(l, r) := E [X | X ∈ [l, r]], or equivalently,

R r l xf(x) dx µ(l, r) := R r l f(x) dx

when l =6 r, and µ(l, r) = l when l = r. 3. σ2(l, r) := Var [X | X ∈ [l, r]], or equivalently,

R r 2 2 l (x − µ(l, r)) f(x) dx σ (l, r) := R r l f(x) dx

when l =6 r, and σ2(l, r) = 0 when l = r. ≥0 4. The intra-class variance curve for X is the function V :[a, b] → R given by V (x) := p(l, x)σ2(l, x) + p(x, r)σ2(x, r).

Notice that for any X ∼ f ∈ RPDF[a,b], the 2-partitions are of the form (a, x, b) and their distortion is precisely given by V (x). Therefore, finding a 2-partition of minimal distortion and minimizing V (x) are equivalent

37 problems.

Figure 4.2.2: Example probability density function and the plot of the weighted intra-class variance V as a function of the threshold location. The optimum is around x = 0.56.

Figure 4.2.2 depicts an example of the intra-class variance curve for a given distribution.

As a side note, the intra-class variance curve appears not only in machine learning, but also in image processing, where a common problem is that of converting a color or gray image into only black and white. A methodology for addressing this problem was proposed by Otsu7. Otsu’s method converts a gray image into black and white by selecting a threshold as follows:

1. Compute the histogram of the image. Let N(x) be the number of pixels in the input image with color x. 2. For each possible gray value x, consider the sets of colors L = {y|y ≤ x}

7Nobuyuki Otsu. “A threshold selection method from gray-level histograms”. In: IEEE transactions on systems, man, and cybernetics 9.1 (1979), pp. 62–66.

38 P P and R = {y|y > x}. Let NL = y∈L N(y) and NR = y∈R N(y). Compute the variances of VL and VR using the distributions fL(y) :=

N(y)/NL and fR(y) := N(y)/NR respectively (set variances to 0 for

NR = 0 or NL = 0). Compute the weighted inter-class variance as V (x) := VLNL+VRNR . NL+NR 3. Select the gray value x∗ that minimizes the weighted inter-class variance. 4. For each pixel in the input image, set it to white if the pixel value is above the threshold x∗, otherwise to black.

In the rest of this section, several properties of the intra-class variance are derived. Mainly,

1. Theorem3: the curve is continuous. 2. Theorem5: the curve is differentiable. 3. Theorem6: the minima of the curve satisfy a simple equation.

These properties can be exploited to speed-up the algorithm behind the method. In particular, most numerical methods, including ternary search, require continuity, and Newton’s method requires differentiability.

Theorem 2. For any X ∼ f ∈ RPDF[a,b] and any [l, r] ⊆ [a, b] we have

Z r p(l, r)σ2(l, r) = x2f(x) dx − p(l, r)µ2(l, r). l

Proof. If p(l, r) = 0, the equality 0 = 0 holds, and if p(l, r) =6 0, then

Z r p(l, r)σ2(l, r) = (x − µ(l, r))2f(x) dx l Z r = x2f(x) dx + µ2(l, r)p(l, r) l Z r − 2µ(l, r) xf(x) dx l Z r = x2f(x) dx − p(l, r)µ2(l, r). l

39 Theorem 3. For any X ∼ f ∈ RPDF[a,b], the function V is continuous and given by

V (x) :=p(a, x)σ2(a, x) + p(x, b)σ2(x, b) =σ2(a, b) + µ2(a, b) − p(a, x)µ2(a, x) − p(x, b)µ2(x, b)

Proof. Using Theorem2, we have

V (x) =p(a, x)σ2(a, x) + p(x, b)σ2(x, b) Z x = x2f(x) dx − p(a, x)µ(a, x)2 a Z b + x2f(x) dx − p(x, b)µ(x, b)2 x Z b = x2f(x) dx − p(a, b)µ(a, b)2 a + p(a, b)µ(a, b)2 − p(a, x)µ(a, x)2 − p(x, b)µ(x, b)2 =p(a, b)σ2(a, b) + p(a, b)µ(a, b)2 − p(a, x)µ(a, x)2 − p(x, b)µ(x, b)2 =σ2(a, b) + µ(a, b)2 − p(a, x)µ(a, x)2 − p(x, b)µ(x, b)2.

This function is continuous on (a, b) because both p and µ are integrals. Moreover, V also happens to be continuous on a and b because

lim V (x) =σ2(a, b) + µ(a, b)2 x→a+ − 0 − p(a, b)µ(a, b)2

40 =σ2(a, b) = V (a) and

lim V (x) =σ2(a, b) + µ(a, b)2 x→b− − p(a, b)µ(a, b)2 − 0 =σ2(a, b) = V (b).

Corollary 4. Any X ∼ f ∈ RPDF[a,b] has a minimal distortion partition. In other words, the function V attains its minimum.

Proof. Because of the Bolzano-Weierstrass theorem, also called extreme value theorem, since [a, b] is a compact set and V is continuous over ir (see Theorem 3), then V must attain its maximum and minimum.

Theorem 5. For any X ∼ f ∈ RPDF[a,b], the function V is differentiable on the interval {x ∈ (a, b) | F (x) ∈ (0, 1)}, and given on that interval by

d V (x) =f(x)(µ(x, b) − µ(a, x)) (2x − µ(a, x) − µ(x, b)) . dx

d d Proof. For all x ∈ [a, b], since dx p(a, x) = f(x) and dx (p(a, x)µ(a, x)) = d R x dx a tf(t) dt = xf(x), then

d   d (p(a, x)µ(a, x))2 p(a, x)µ2(a, x) = dx dx p(a, x) ! 2p(a, x)µ(a, x)xf(x)p(a, x) − (p(a, x)µ(a, x))2 f(x) = p2(a, x) =f(x)µ(a, x) (2x − µ(a, x)) .

41 d d d R b Similarly, since dx p(x, b) = −f(x) and dx (p(x, b)µ(x, b)) = dx x tf(t) dt = −xf(x), then

d   d (p(x, b)µ(x, b))2 p(x, b)µ2(x, b) = dx dx p(x, b) ! −2p(x, b)µ(x, b)xf(x)p(x, b) + (p(x, b)µ(x, b))2 f(x) = p2(x, b) =f(x)µ(x, b)(µ(x, b) − 2x) .

Therefore ,

! d d σ2(l, r) − p(l, x)µ2(l, x) V (x) = dx dx +µ(l, r)2 − p(x, r)µ2(x, r) d   = − p(l, x)µ(l, x)2 + p(x, r)µ(x, r)2 dx ! µ(a, x) (2x − µ(a, x)) = − f(x) +µ(x, b)(µ(x, b) − 2x)   =f(x) µ(a, x)2 − 2xµ(a, x) − µ(x, b)2 + 2xµ(x, b) =f(x)(µ(x, b) − µ(a, x)) (2x − µ(a, x) − µ(x, b)) .

Theorem 6. For any X ∼ f ∈ RPDF[a,b], if x is a local minimum of V , it must satisfy µ(a, x) + µ(x, b) x = 2 and p(a, x) > 0 and p(x, b) > 0.

Proof. From Theorem5 it follows that the extreme points of V over (a, b) must satisfy

d V (x) =f(x)(µ(x, b) − µ(a, x)) (2x − µ(a, x) − µ(x, b)) dx

42 More precisely, since f(x) ≥ 0 and µ(x, b) − µ(a, x) > x − x = 0, then the d sign of dx V (x) understood as any of {−1, 0, 1} is given by,

 d  sign V (x) =sign (f(x)) sign(2x − µ(a, x) − µ(x, b)) dx thus a local minimum can only occur when 2x − µ(a, x) − µ(x, b) is zero and changes sign, from negative to positive.

The rest of the proof deals with showing that p(a, x) > 0 and p(x, b) > 0 must also hold. Let A = {x ∈ [a, b] | p(a, x) = 0}, B = {x ∈ [a, b] | p(x, b) = 0}, a∗ = inf A and b∗ = sup B.

Notice that a ∈ A, b ∈ B, and

lim 2x − µ(a, x) − µ(x, b) =2a∗ − a∗ − µ(a∗, b) < 0, x→(a∗)+ and

lim 2x − µ(a, x) − µ(x, b) =2b∗ − µ(a, b∗) − b∗ > 0. x→(b∗)−

d ∗ Hence limx→(a∗)+ dx V (x) ≤ 0. Moreover, for all x with a ≤ x < a we have  d  f(x) = 0 and sign dx V (x) = 0. It follows that the points in A are not local minimums.

d  d  Similarly, given limx→(b∗)+ dx V (x) ≥ 0 and f(x) = 0 and sign dx V (x) = 0 for all x with b∗ < x ≤ b, it follows that the points in B are not local minimums.

4.2.3 Non-uniqueness

Proposition7 shows that the Otsu’s method may have several optimal values. Therefore, the greedy procedure of ‘minimizing the intra-class variance’ is

43 Figure 4.2.3: Example of a density for which the function V has two different local minima whose value is approximately equal. not deterministic on its own, that is, it requires an additional tiebreaker rule to decide which of the optimal values to choose.

Proposition 7. There are some X ∼ f ∈ RPDF[a,b] for which the minimum of the function V is not unique.

Proof. This proposition was named proposition instead of theorem because the proof, although numerically precise is not a formal proof.

With the help of the computer, a function f was found such that the curve V (see Definition 14) has two local minimums V (x∗) and V (x∗∗) such that V (x∗) ≈ V (x∗∗). The plots of the functions f and V are depicted in Figure 4.2.3. They were found using random brute force search and an evolutionary algorithm afterwards.

The curve of f is given by joining the following array of points with line segments. The relative error between V (x∗) and V (x∗∗) is around 2.40×10−8

44 and the absolute error around 2.52 × 10−10.

XY = [ (0.11705603947710559, 0.012196874213098162), (0.6351229049473819, 0.12284926152112471), (0.6803465227293913, 1.4104952135907167), (0.8580277590515838, 0.6038725439277689), (1.0198390439788831, 0.3127715797956027) ]

4.2.4 Non-optimality

This subsection presents an alternative split criterion for decision trees. For some cases, this criterion provides better results than the standard intra- class variance minimization, thus, it also serves as a counter example to the optimality of the standard methodology.

For decision trees over a (one dimensional) random continuous variable R x p3 X ∼ f ∈ RPDF[a,b], if f is known and the function I(x) = a f(t)dt is at hand, then the following split criterion for thresholding any interval [l, r] can be used, and the weighted intra-class variance at the leaves of the trees is likely to be improved, specially if n is large.

The proposed split criterion for thresholding the interval [l, r] consists of finding the threshold value x such that

2I(x) = I(l) + I(r).

This is justified by the fact that if 2I(x) = I(l) + I(r), then x is the median of the optimal discretization function restricted to [l, r].

Indeed, Figure 4.2.4 and Table 4.2 show a comparison between the standard

45 Figure 4.2.4: Recursive thresholding method using the standard split criterion vs the proposed one. The proposed criterion reduces slightly the concentration of bins around very frequent values and favors sparse regions. The total distortion is recorded in Table 4.2. greedy thresholding algorithm (see Definition 13) against the proposed split criterion.

This method is only valid if the density f is already known. If it is not, then the criterion would consist of estimating the split point x instead of simply computing it. The exact procedure for estimating x is left as an open problem. It can be stated as follows: Given a sequence of samples x1, x2, ..., xn, drawn independently from f ∈ RPDF[l,r], estimate the median p3 x of the random variable given by Y ∼ g where g(x) = c f(x) and c is a R r constant such that l g(x) dx = 1.

46 Normal Normal Exp. Exp. 2k standard proposed standard proposed 2 3.64e-01 3.64e-01 3.40e-01 3.47e-01 4 1.17e-01 1.21e-01 1.02e-01 1.03e-01 8 3.49e-02 3.59e-02 2.85e-02 2.76e-02 16 9.85e-03 9.74e-03 7.34e-03 7.01e-03 256 4.82e-05 4.16e-05 2.94e-05 2.77e-05

Table 4.2: Average distortion (300 000 samples) after applying the two split criteria on a normal distribution truncated to [−5, 5] and an exponential truncated to [0, 8]. As n → ∞, the proposed criterion introduces less distortion. Figure 4.2.4 shows the case 2k = 16.

4.3 Discussion

The optimal quantizer of Chapter3 is a simple discretization method that can be used as an alternative to the classic equal width or equal frequency discretizations.

For preprocessing, its performance is better than the other two methods on average although similar to the equal frequency. For splitting a set of data recursively into 2k disjoint sets, the proposed criterion decreases significantly the overall intra-class variance specially when k > 4. However, the criterion assumes that the distribution of the data is known or properly estimated.

47 Chapter 5

A real analysis approach

The main goal of this chapter is to provide an alternative proof (Theorem 21) of the formula of the optimal quantizer of Chapter3 restricted to Riemann- integrable functions. Unlike the original proof, which uses measure theory extensively, this one uses only basic concepts from real analysis. The optimal quantizer was not known to the author of this work initially, hence it was derived independently.

In addition, this chapter contributes:

1. A proof that the quantizer of equal frequency is optimal (Theorems 19 and 23, Section 5.1). 2. Some generalizations of the minimum distortion quantizer (Section 5.3). 3. An open problem (Section 5.4).

5.1 Formal derivation

This section contains several definitions and theorems that are required to prove the main conclusions: Theorems 19, 21, 23 and 24.

48 Theorem 19 is particularly complex to prove. It provides the objective function for a minimal-distortion quantizer by reducing the limit of Sn(f) as n → ∞ for a given distribution f ∈ RPDF[a,b] to a simple formula. The section is organized respect to the dependencies of Theorem 19 as follows:

• Subsection 5.1.1 derives Theorem 10. • Subsection 5.1.2 derives Theorem 17. • Subsection 5.1.3 uses Theorems 10 and 17 for deriving Theorem 19. • Subsection 5.1.4 presents theorems about the optimality of some quantizers.

5.1.1 Deriving Theorem 10

This subsection derives Theorem 10, which states the behavior as n → ∞ of a formula that is similar to distortion.

In detail, Definition 15 provides the notation used for partitions. Theorem9 is a real analysis fact about the product of Riemann-Integrable functions, and Theorem 10 is the most significant term of Theorem 19, which is the pivotal theorem of this section. n n Claim 8. Let (ai)i=1 and (bi)i=1 be two sequences of positive real numbers and let B ≥ ai, bi for all i = 1, ..., n. Then

n n n Y Y n−1 X ai − bi ≤ B ai − bi. i=1 i=1 i=1

Proof. Notice that

πn :=a1 ··· an − b1 ··· bn

=a1 ··· an−1an − b1 ··· bn−1an

+ b1 ··· bn−1an − b1 ··· bn−1bn

=(a1 ··· an−1 − b1 ··· bn−1)an

49 + b1 ··· bn−1(an − bn)

=πn−1an + b1 ··· bn−1(an − bn)

Thereof, recursively we have

πn =πn−1an + b1 ··· bn−1(an − bn)

=πn−2an−1an

+ b1 ··· bn−2(an−1 − bn−1)an

+ b1 ··· bn−2bn−1(an − bn) =...(recursively)...

=(a1 − b1)a2a3 ··· an

+ a1(a2 − b2)a3 ··· an

+ a1a2(a3 − b3) ··· an . .

+ a1a2 ··· an−1(an − bn) n n−1 X ≤B ai − bi. i=1

>0 Theorem 9. Let f1, f2, ..., fK :[a, b] → R and its product φ = f1 ·f2 ··· fK be Riemann-integrable functions.

n For each n = 1, 2, ..., let (xi)i=1 be a partition of [a, b]. Let the sequence of partitions satisfy n lim max{xi − xi−1} = 0. n→∞ i=1

n n n n For each partition (xi)i=1, let (t1,i)i=1, (t2,i)i=1, ..., (tK,i)i=1 be K arbitrary

50 tag sequences with xi−1 ≤ tk,i ≤ xi for all i = 1, ..., n. Then

n K ! X Y Z b lim fk(tk,i) (xi − xi−1) = φ(t)dt. n→∞ i=1 k=1 a

≥1 Proof. Let  > 0. We will show that there exists N ∈ Z such that if n > N, ∗ ∗ ∗ ∗ then there exist L ,U ∈ R with U − L <  and

∗ n ∗ L ≤ L(φ, (xi)i=1) ≤ U ∗ Pn QK  ∗ L ≤ i=1 k=1 fk(tk,i) (xi − xi−1) ≤ U ∗ n ∗ L ≤ U(φ, (xi)i=1) ≤ U where U and L are the upper and lower Darboux sums.

Since f1, f2, ..., fK are Riemann-integrable, they have an upper bound, say ∗  B > 0. Let  := KBK−1 .

≥1 Again, since f1, f2, ..., fK are Riemann-integrable, there must exist N ∈ Z such that if n > N,

n n ∗ |U(φ, (xi)i=1) − L(φ, (xi)i=1)| <  .

Fix any n > N. Let Mk,i := sup {f(x) | xi−1 ≤ x ≤ xi} and mk,i := inf {f(x) | xi−1 ≤ x ≤ xi}.

Notice that

K K K ∗ Y Y Y ∗ mi := mk,i ≤ fk(tk,i) ≤ Mk,i =: Mi k=1 k=1 k=1

∗ Pn ∗ ∗ Pn ∗ therefore if we let L := i=1 mi (xi − xi−1) and U := i=1 Mi (xi − xi−1), we have n K ! ∗ X Y ∗ L ≤ fk(tk,i) (xi − xi−1) ≤ U i=1 k=1

51 ∗ QK ∗ n Moreover, mi ≤ k=1 fk(yi) ≤ Mi holds for any tag sequence (yi)i=1. In particular, the inequality holds for the lower and upper Darboux sums. Hence

∗ n ∗ L ≤ L(φ, (xi)i=1) ≤ U ∗ n ∗ L ≤ U(φ, (xi)i=1) ≤ U

It remains to show that U ∗ − L∗ < .

Claim8 guarantees that

n ∗ ∗ K−1 X Mi − mi ≤ B Mk,i − mk,i i=1

Therefore

n ∗ ∗ X ∗ ∗ U − L = (Mi − mi )(xi − xi−1) i=1 n K K−1 X X ≤B (Mk,i − mk,i)(xi − xi−1) i=1 k=1 K n k−1 X X =B (Mk,i − mk,i)(xi − xi−1) k=1 i=1 K K−1 X n n =B U(fk, (xi)i=1) − L(fk, (xi)i=1) k=1

Definition 15. For two given probability density functions f ∈ RPDF[a,b] + and g ∈ RPDF[a,b], we use the following notation.

R x G(x) := a g(x)dx is the corresponding (continuous) cumulative distribution function for g. Since g(x) ≥ 0 for all x ∈ [a, b], then G :[a, b] → [0, 1] is a

52 bijection, thus G−1 is well-defined over [0, 1] and is also a bijection.

n For a given integer n ≥ 1, the n-partition ρn := (xi)i=0 is given by xi = −1  i  G n . Only if there are several integers n1, n2 under consideration, we  (n )n  (n )n use the notation x 1 and x 2 to resolve the ambiguity. i i=0 i i=0

For any fixed positive integer n ≥ 1 and any sub-interval [xi−1, xi] with i = 1, ..., n, we use the following notation, where X ∼ f:

xi−1+xi xi−xi−1 Center and radius: ci = 2 and ri = 2 , so that [xi−1, xi] = [ci − ri, ci + ri].

1. Mass: Z xi pi = P(xi−1 ≤ X ≤ xi) = f(x)dx xi−1 . 2. Centroid:  1 R xi  xf x dx p > pi xi−1 ( ) , if i 0 µi = E(X|xi−1 ≤ X ≤ xi) := ci , if pi = 0

3. Information:  pi log pi , if pi > 0 hi := 0 , if pi = 0

4. Variance:  1 R xi 2  x − µ f x dx p > pi xi−1 ( i) ( ) , if i 0 Var(X|xi−1 ≤ X ≤ xi) := 0 , if pi = 0

Theorem 10. Assuming the notation of Definition 15, for all f ∈ RPDF[a,b]

53 + and g ∈ RPDF[a,b],

n Z xi Z b 2 X 2 1 f(x) lim n f(x)(x − ci) dx = dx. n→∞ g2 x i=1 xi−1 12 a ( )

∗ n Moreover, fixing for each n ≥ 1 any tag sequence (xi )i=1,

n X 2 1 Z b f(x) lim n2 f(x∗) r3 = dx. n→∞ i i g2 x i=1 3 12 a ( )

Proof. Let n Z xi 2 X 2 I = lim n f(x)(x − ci) dx. n→∞ i=1 xi−1

∗ n ∗ Fix for each n any tag sequence (xi )i=1 such that xi−1 ≤ xi ≤ xi, let Q := G−1, and consider the following limit.

n Z xi 2 X ∗ 2 lim n f(x )(x − ci) dx n→∞ i i=1 xi−1 n Z xi 2 X ∗ 2 = lim n f(x ) (x − ci) dx n→∞ i i=1 xi−1 n 3 X 2r = lim n2 f(x∗) i n→∞ i i=1 3     3 n  i i−1  2 X Q n − Q n = lim n2 f(x∗) n→∞ i   3 i=1 2 n 3 (a) 1 X   i  i − 1 = lim n2 f(Q(y∗)) Q − Q n→∞ i n n 12 i=1 n 3 (b) 1 X   i i − 1 = lim n2 f(Q(y∗)) Q0(y∗∗) − n→∞ i i n n 12 i=1 n 1 X 3 1 = lim f(Q(y∗)) Q0(y∗∗) n→∞ i i n 12 i=1

54 n !3 (c) 1 X 1 1 = lim f(Q(y∗)) n→∞ i g Q y∗∗ n 12 i=1 ( ( i )) n (d) 1 X 1 1 = lim f(Q(y∗)) n→∞ i g3 Q y∗∗ n 12 i=1 ( ( i )) Z 1 1 f(G−1(y)) = 3 −1 dy 0 12 g (G (y)) 1 Z b f(x) = 3 g(x)dx 12 a g (x) 1 Z b f(x) = 2 dx. 12 a g (x)

∗  i−1 i  (a) For some yi ∈ n , n . ∗∗  i−1 i  (b) For some yi ∈ n , n , because of the mean value theorem. (c) Because of the inverse function theorem. (d) Because of Theorem9.

1 If f was the constant function b−a , then it holds that

n Z xi 2 X 2 I = lim n f(x)(x − ci) n→∞ i=1 xi−1 n Z xi 2 X ∗ 2 = lim n f(x )(x − ci) dx n→∞ i i=1 xi−1 1 Z b f(x) = 2 dx. 12 a g (x)

∗ ∞ If f is not constant, since the tags were chosen arbitrarily, we may fix (xi )i=1 to approach as close as possible to the upper and lower Darboux sums. ∗  ∞ Indeed, let  > 0, let  := b , and for each i = 1, ..., n let (mi)i=1 1 R 1 dx 12 a g2(x) ∞ and (Mi)i=1 be sequences such that

∗ f(mi) −  < inf f(x) x∈[xi−1,xi] and ∗ sup f(x) < f(Mi) +  . x∈[xi−1,xi]

55 Notice that

n n Z xi Z xi 2 2 X ∗ 2 ∗ 2 X (x − ci) lim n  (x − ci) dx = (b − a) lim n dx n→∞ n→∞ b − a i=1 xi−1 i=1 xi−1 Z b ∗ 1 1 1 = (b − a) 2 dx 12 a b − a g (x) Z b ∗ 1 1 = 2 dx = . 12 a g (x)

Hence

n ! Z b Z xi 1 f(x) 2 X 2 dx −  = lim n f(mi)(x − ci) dx −  g2 x n→∞ 12 a ( ) i=1 xi−1 n ! Z xi 2 X ∗ 2 ≤ lim n (f(x) +  )(x − ci) dx −  n→∞ i=1 xi−1 n ! Z xi 2 X 2 = lim n f(x)(x − ci) dx n→∞ i=1 xi−1 n ! Z xi 2 X ∗ 2 + lim n  (x − ci) dx −  n→∞ i=1 xi−1 =I n ! Z xi 2 X 2 ≤ lim n f(x)(x − ci) dx n→∞ i=1 xi−1 n ! Z xi 2 X ∗ 2 − lim n  (x − ci) dx +  n→∞ i=1 xi−1 n ! Z xi 2 X ∗ 2 = lim n (f(x) −  )(x − ci) dx +  n→∞ i=1 xi−1 n ! Z xi 2 X 2 ≤ lim n f(Mi)(x − ci) dx +  n→∞ i=1 xi−1

56 1 Z b f(x) = 2 dx +  12 a g (x)

1 R b f(x) Since  > 0 is arbitrarily small then necessarily I = 12 a g2(x) dx holds.

5.1.2 Deriving Theorem 17

The proof of Theorem 17 consists firstly of considering exclusively positive locally Lipschitz continuous functions, for which the statement is easier to prove, and then generalize to arbitrary distributions taking advantage of the density of the Lipszhitz continuous distributions over the space of arbitrary distributions. Because of this, several definitions and theorems in this section are related to Lipschitz continuity.

Definition 16. A function f : X ⊆ R → R is said to be locally Lipschitz continuous if for all x ∈ X, there exists Bx > 0 such that |f(x) − f(y)| <

Bx|x − y| for all y ∈ X. If in addition, Bx is constant for all x ∈ X, then f is said to be Lipschitz continuous.

In addition, LipLoc[a,b] and Lip[a,b] denote the sets of all functions with domain [a, b] that are locally Lipschitz continuous, and Lipschitz continuous respectively.

Definition 17. Let U be the set of functions with domain [a, b] ⊆ R.A subset S ⊆ U is said to be uniformly dense in U if for any f ∈ U and any  > 0 there exists g ∈ S such that |g(x) − f(x)| <  for all x ∈ [a, b].1

Theorem 11. The set CRPDF[a,b] ∩ LipLoc[a,b] is uniformly dense in

CRPDF[a,b].

1For a more general definition in metric spaces, see (VV Tkachuk. “Properties of function spaces reflected by uniformly dense subspaces”. In: Topology and its Applications 132.2 [2003], pp. 183–193).

57 Proof. The core of this theorem lies in that LipLoc[a,b] is uniformly dense in the set C[a,b] of all continuous functions with domain [a, b]. This was already proven23.

Let f ∈ CRPDF[a,b] and  > 0. Since f is positive and continuous over a compact domain, the extreme value theorem implies that f must attain its (positive) maximum and minimum.

n  1 o ∗ Let M > max {f(x) | a ≤ x ≤ b}, let 0 < m < min 2M , 2 , and let  := n  m m o min 6 , 2 , b−a .

Since LipLoc[a,b] is uniformly dense in C[a,b], there must exist a continuous function g with |f(x) − g(x)| < ∗ for all x ∈ [a, b] . The function g attains its minimum, and it is positive because

min {g(x) | a ≤ x ≤ b} ≥ min {f(x) | a ≤ x ≤ b} − ∗ =m − ∗ m ≥ > 0. 2

∗ ∗ g(x) Therefore, we may define g ∈ CRPDF[a,b] as g (x) := b . R g(x)dx a

R b Notice that a g(x)dx > 1 − m because

Z b Z b g(x)dx > f(x) − ∗dx = 1 − ∗(b − a) > 1 − m. a a

This implies for all x ∈ [a, b] that

∗ g(x) |f(x) − g (x)| < f(x) − 1 − m

2M Isabel Garrido and Jesus A Jaramillo. “Homomorphisms on function lattices”. In: Monatshefte für mathematik 141.2 (2004), pp. 127–146. 3Corollary 2.8. It applies even for arbitrary metric spaces and arbitrary subsets.

58

f(x) − g(x) mf(x) = − 1 − m 1 − m

f(x) − g(x) mf(x) = + 1 − m 1 − m ∗ mM < + 1 − m 3(1 − m) ∗ 2 = +  1 − m 3 2 <2∗ +  3 1 2 <  +  = . 3 3

∗ ∗ That is, there exists a function g ∈ CRPDF[a,b] with |f(x) − g (x)| <  for all x ∈ [a, b] .

Theorem 12.

Lip[a,b] = LipLoc[a,b].

Proof. By definition, Lip[a,b] ⊆ LipLoc[a,b] holds. It remains to show

LipLoc[a,b] ⊆ Lip[a,b].

Towards a contradiction, suppose f ∈ LipLoc[a,b] \ Lip[a,b]. Since f is not ∞ ∞ Lipschitz continuous, there exist two sequences (xi)i=1 and (yi)i=1 in [a, b] such that xi =6 yi for all i = 1, 2, ... and

|g(x ) − g(y )| n n → ∞. |xn − yn|

Since g is continuous in a compact set, it has an upper bound, say M > 0.

Given that |g(xn) − g(yn)| ≤ 2M, it must be the case that |xn − yn| → 0.

∞ Since (xi)i=1 is a sequence in a compact set, it must have a convergent

59 ∞ ∗ subsequence, say (xni )i=1 with xni → x , and we also have

|g(x ) − g(y )| ni ni → ∞ |xni − yni | because any subsequence of a convergent sequence must converge to the same value.

The contradiction follows from the fact that f can not be locally Lipschitz ∗ ≥1 continuous at x because for any k > 0, there exists i ∈ Z such that |g(xni )−g(yni )| > k, or equivalently, |g(xni ) − g(yni )| > k|xni − yni |. |xni −yni |

+ Theorem 13. Let f ∈ CRPDF[a,b], x ∈ [a, b], X ∼ f, and

1 h (x) := E [X − x | |X − x| < r] . r r

∗ If there exists r > 0 such that infx−r∗ f(t), then the following functional limit converges uniformly.

lim hr = 0. r→0+

Proof. The proof consists of two claims. Firstly, that the statement holds for Lipschitz continuous functions, and secondly, that it holds in general.

+ Assume f ∈ CRPDF[a,b] ∩ Lip[a,b]. Let  ∈ (0, 1), let k be the Lipschitz m constant for f, and let 0 < m < min {f(x) | a ≤ x ≤ b}. For all r < 2k we have

r R tf(x + t)dt 1 −r E [X − x | |X − x| < r] = R r r r −r f(x + t)dt r R tf(x) + kt2dt −r ≤ R r r −r f(x) − k|t|dt r r f(x) R tdt + k R t2dt −r −r = R r R r rf(x) −r dt − rk −r |t|dt

60

0 + k 2 r3 3 = 2 rf(x)2r − rkr 2 kr 2 kr = < 3 2f(x) − kr 3 m − kr m 2 2 2  2  < m < < 3 m − 2 3 2 −  3 1 <.

Thus limr→0+ hr = 0 whenever f is Lipschitz continuous, or equivalently because of Theorem 12, whenever f is locally Lipschitz continuous.

+ Now assume that f ∈ CRPDF[a,b] . Let X ∼ f , let  > 0 and let 0 < m < min {f(x) | a ≤ x ≤ b}.

+ Theorem 11 guarantees the existence of a function g ∈ CRPDF[a,b] ∩ ∗  m  LipLoc[a,b] such that |f(x) − g(x)| <  := min 4 , 2 . Moreover, because of the first case of this proof, there exists R > 0 such that

1 ∗ r E [Y − x | |Y − x| < r] <  for any r ∈ (0,R) where Y ∼ g. Let √ n  o r < min R, 2∗ .

Notice that min {g(x) | a ≤ x ≤ b} − ∗ ≥ min {f(x) | a ≤ x ≤ b} − 2∗ > ∗ m 4 m − 2 > 2 > 0, and recall the following algebraic inequality for positive constants c1, c2, c3 with c3 < c2.

c1 c1 < (1 + c3). c2 − c3 c2

Using these two facts, we derive

r R tf(x + t)dt 1 −r E [X − x | |X − x| < r] = R r r r −r f(x + t)dt r R t(g(x + t) + ∗)dt −r ≤ R r ∗ r −r g(x + t) −  dt

4 The inequality follows directly from c1c2 < c1c2 + c1c2c3.

61

R r −r tg(x + t)dt + 0 = R r ∗ r −r g(x + t)dt − r2 r r R tg(x + t)dt −r ∗ 2 = R r (1 + 2 r ) r −r g(x + t)dt <∗(1 + 2∗r2)   < + = . 2 2

Claim 14. Let a, b, c, d > 0. Then |ab − cd| = a |d − b| + b |c − a| + |d − b| |c − a|.

Proof.

|ab − cd| = |ab − (a + c − a)(b + d − b)| = |ab − (ab + a(d − b) + b(c − a) + (c − a)(b − d))| ≤a |d − b| + b |c − a| + |d − b| |c − a| .

Theorem 15. Assuming the notation of Definition 15, for all f ∈ + + CRPDF[a,b] and g ∈ RPDF[a,b],

n Z xi 2 X 2 lim n f(x)(µi − ci) dx = 0. n→∞ i=1 xi−1

Proof. We proceed using Theorem 13 and its definition of the hr function. Let ∗  > 0. Theorem 13 guarantees the existence of R > 0 such that |hr(x)| <  ∗ q  1 R b f(x) for all x ∈ [a, b] and r ∈ (0,R), where  := 3I and I = 12 a g2(x) dx.

Since G is a continuous bijection, the width of the largest sub-interval of the partition can be made arbitrarily small by tuning n. In particular, there

62 + exist N ∈ Z such that maxi=1,...,n ri < R for all n > N. This implies that 2ri < R applies for all i = 1, ..., n, and

n n Z xi Z xi 2 X 2 2 X 2 lim n (µi − ci) f(x)dx = lim n (rihr (ci)) f(x)dx n→∞ n→∞ i i=1 xi−1 i=1 xi−1 n X Z xi < lim n2 r2∗2 f(x)dx n→∞ i i=1 xi−1 n (a) 2 X 2 ∗2 ∗ = lim n r  f(x )|xi − xi−1| n→∞ i i i=1 n X ≤ lim ∗2n2 f(x∗)2r3 n→∞ i i i=1 n X 2 =3∗2 lim n2 f(x∗) r3 n→∞ i i i=1 3 =3∗2I ≤

∗ (a) For some xi ∈ [xi−1, xi], because of the mean value theorem for definite integrals.

Since  > 0 is arbitrarily small then necessarily the limit is zero.

5.1.3 Deriving Theorem 19

This subsection uses Theorems 10 and 17 to derive Theorem 19, which provides the formula of the limit distortion of a given quantizer.

+ Theorem 16. CRPDF[a,b] is uniformly dense in CRPDF[a,b].

∗ + Proof. Let f ∈ CRPDF[a,b] and  > 0. Let f ∈ CRPDF[a,b] be given by

f(x) + ∗ f ∗(x) = , 1 + ∗(b − a)

63 ∗ where  := /(1 + (b − a) supx∈[a,b] f(x)) > 0. Then

∗ ∗ ∗ ∗ f (x) (b − a) +  |f(x) − f (x)| = 1 + ∗(b − a) =∗ |f(x)(b − a) − 1| ≤.

∗ ∗ + That is, kf − f k1 := supx∈[a,b] |f(x) − f (x)| ≤ , thus CRPDF[a,b] is dense in CRPDF[a,b].

Theorem 17. Assuming the notation of Definition 15, for all f ∈ + CRPDF[a,b] and g ∈ RPDF[a,b],

n Z xi 2 X 2 lim n f(x)(µi − ci) dx = 0. n→∞ i=1 xi−1

Proof. Consider an arbitrary continuous function f ∗ with domain [a, b] and ∗ n denote the interval centroids over f induced by the partition (xi)i=0 with ∗ µi .

+ Fix n ∈ Z and notice that for each i = 1, ..., n we have

∗ 2 2 (µi − ci) − (µi − ci) ∗ ∗ =(µi − µi)(2ci − µi − µi) ∗ ∗ =rihri (µi)(2ci − µi − µi) 2 ∗ ≤2ri hri (µi).

∗ This, the fact that µi − ci ≤ 2ri and the inequality of Claim 14 imply that

µ∗ − c 2 R xi f ∗ x dx ( i i) xi−1 ( ) ∆i := − µ − c 2 R xi f x dx ( i i) xi−1 ( )

64

Z xi ∗ 2 ∗ ≤(µi − ci) f(x) − f (x)dx xi−1 Z xi 2 ∗ ∗ + 2ri hri (µi) f (x) dx xi−1

Z xi r2h∗ µ f x − f ∗ x dx + 2 i ri ( i) ( ) ( ) xi−1

Z xi   ≤ f x − f ∗ x dx r2 r2h∗ µ ( ) ( ) 8 i + 4 i ri ( i) xi−1 3 ∗ ∗ ∗ + 4ri f (x )hri (µi)

∗ for some x ∈ [xi−1, xi].

1 R b 1 1 Now, let f ∈ CRPDF[a,b], let  ∈ (0, 1) and let J = 12 a b−a g2(x) dx.

+ From Theorem 16, since CRPDF[a,b] is uniformly dense in CRPDF, there is ∗ + ∗  f ∈ CRPDF such that kf − f k1 < J := 36(b−a)(J+1)+2 .

From Theorem 10 there must exist N1 ≥ 1 such that for all n ≥ N1,

2 Pn 3 1 ∗ 2 Pn 3 ∗ ∗ ∗ J − n i=1 ri b−a < 1 and I − n i=1 ri f (xi ) < 1, where I = 1 R b f ∗(x) ∗ n 12 a g2(x) dx and (xi )i=1is any tag sequence for each n.

∗ Since hr → 0 uniformly as r → 0, there exists R > 0 such that for all n o ∗ J r ∈ (0,R], it holds that khrk1 < δ := min 6(I∗+1) , 1 .

−1 Since G is a continuous bijection, there must exist N2 ≥ 1 such that for all n ≥ N2,

sup xi − xi−1 = 2 sup ri < R. i=1,...,n i=1,...,n

From Theorem 15, since f ∗ is continuous, it holds that

n Z xi 2 X ∗ 2 ∗ lim n (µ − ci) f (x)dx = 0, n→∞ i i=1 xi−1

65 thus there exists N3 ≥ 1 such that for any n ≥ N3,

n Z xi 2 X ∗ 2 ∗ n (µi − ci) f (x)dx < . i=1 xi−1

Then, for all n ≥ max {N1,N2,N3},

n Z xi 2 X 2 n (µi − ci) f(x)dx i=1 xi−1 n ! Z xi 2 X ∗ 2 ∗ = n (µi − ci) f (x)dx − ∆i i=1 xi−1 n n Z xi 2 X ∗ 2 ∗ 2 X ≤ n (µi − ci) f (x)dx + n ∆i i=1 xi−1 i=1 n 3 3 ∗ ∗ ∗ !  X 8r J + 4r f (x )h (µi) < + n2 i i ri 2 3 ∗ i=1 +J 4ri hri (µi) n !  2 X 3 ≤ + J (8 + 4δ) n ri 2 i=1 n ! 2 X 3 ∗ ∗ + 4δ n ri fn(x ) i=1  ∗ ≤ + J (12 + 6δ)(b − a)(J + 1) + 6δ(I + 1) 2 n  ≤ + J (18(b − a)(J + 1) + 1) 2 =.

This implies n ! Z xi 2 X 2 lim n (µi − ci) f(x)dx = 0. n→∞ i=1 xi−1

66 Fact 18. For any random variable X with finite variance and any c ∈ R,

Var [X] =Var [X − c] h i =E (X − c)2 − (µ − c)2.

Theorem 19. Assuming the notation of Definition 15, for all f ∈ RPDF[a,b] + and g ∈ RPDF[a,b],

1 Z b f(x) lim S(Xn) = 2 dx. n→∞ 12 a g (x)

Proof. Based on the Fact 18, Theorem 10 and Theorem 17, we conclude that

n n Z xi Z xi 2 X 2 2 X 2 lim n f(x)(x − µi) dx = lim n f(x)(x − ci) dx n→∞ n→∞ i=1 xi−1 i=1 xi−1 n Z xi 2 X 2 + lim n (µi − ci) f(x)dx n→∞ i=1 xi−1 1 Z b f(x) = 2 dx + 0. 12 a g (x)

5.1.4 Optimality theorems

This subsection provides the formula for two optimal quantizers, and an example that shows that the recursive thresholding method (Definition 13) does not guarantee minimal distortion as k → ∞.

In detail, Theorem 21 (along with Theorem 19) provide the argument and the formula for constructing a minimum distortion quantizer for a given distribution. Theorems 23 and 24 provide the argument and the formula for constructing a minimum redundancy quantizer for a given distribution.

Theorem 20. Assuming in RPDFX the metric d(f1, f2) := sup f1 −g1, then

67 the following operators F is uniformly continuous respect to f ∈ RPDF[a,b] + and continuous respect to g ∈ RPDF[a,b].

1 Z b f(x) F(f, g) := 2 dx. 12 a g (x)

Moreover, for any α ∈ [0, 1] and any f1, f2 ∈ RPDF[a,b] it holds that F(αf1 + (1 − α)f2, g) = αF(f1, g) + (1 − α)F(f2, g).

+ 12 Proof. Given  > 0 and g ∈ RPDF , let δ = b . For any f1, f2 ∈ [a,b] R g−2(x)dx a RPDF[a,b] with d(f1, f2) < δ, it holds that

Z b f x Z b f x Z b |f x − f x | 1 1( ) 1 2( ) 1 1( ) 2( ) 2 dx − 2 dx ≤ 2 dx 12 a g (x) 12 a g (x) 12 a g (x) 1 Z b δ ≤ 2 dx 12 a g (x) =.

+ Similarly, given  > 0 and f ∈ RPDFX , let g1 ∈ RPDF and δ =   X  12  min 1, b . For any g2 ∈ RPDF[a,b] with d(g1, g2) < δ, it R f(x) 2g1(x)+1 dx  a g4(x)  1 holds that

Z b f x Z b f x Z b g2 x − g2 x 1 ( ) 1 ( ) 1 1( ) 2( ) 2 dx − 2 dx ≤ f(x) 2 2 dx 12 a g1(x) 12 a g2(x) 12 a g1(x)g2(x) b 1 Z g1(x) + g2(x) ≤ f(x)δ 2 2 dx 12 a g1(x)g2(x) b δ Z 2g1(x) + δ ≤ f(x) 2 2 dx 12 a g1(x)(g1(x) + δ)

68 b δ Z 2g1(x) + 1 ≤ f(x) 4 dx 12 a g1(x) =.

The second part of the theorem is trivial.

Theorem 21. Let f ∈ RPDF[a,b]. Consider the functional F(g) = 1 R b f(x) 12 a g2(x) dx over the set CRPDF[a,b].

There is a unique function fˆ ∈ RPDF[a,b] that minimizes F(fˆ) and is given p3 1 for all x ∈ [a, b] by fˆ(x) = c f(x) where c = b √ . R 3 f(t)dt a

1 f(x) Proof. Consider the function H(x, G, g) := 12 g2(x) . According to the Euler- R b Lagrange equation, for g to maximize the integral a H(x, G, g)dx, it must hold that

∂H d ∂H d −f(x) d f(x) 0 = − = 0 − = . ∂G dx ∂g dx g3(x) dx g3(x)

f(x) This implies that g3(x) is constant, but also, given that f, g ∈ RPDF[a,b], p3 R b p3 that g(x) = c f(x) for all x ∈ [a, b], where c = a f(t)dt.

Theorem 22. Let f, g ∈ CRPDF[a,b], with cummulative distributions F and G respectively. For each n = 1, 2, ..., let Xn be a discrete random variable whose n outcomes have occurrence p R xi f x dx where i := xi−1 ( ) −1  i  xi := G n .

Then, Z b f(x) lim log(n) − h(Xn) = f(x) log dx ≥ 0 n→∞ a g(x)

where h(Xn) is the (discrete) entropy of Xn.

69 R b f(x) Moreover, a f(x) log g(x) dx = −h(Z), where h(Z) is the differential entropy of the continuous random variable Z ∈ [0, 1] whose cumulative distribution function is given by F ◦ G−1.

Proof. Let R = F ◦ G−1 : [0, 1] → [0, 1] and r = R0. Notice that since

d d f(x) = F (x) = R(G(x)) = r(G(x))g(x) dx dx then

Z xi Z xi pi = f(x)dx = r(G(x))g(G(x))dx xi−1 xi−1 i Z n = r(y)dy. i−1 n

Since r(y) = g(y)f(G(y)) for all y ∈ [a, b], r is continuous. If we fix any n ≥ 1, the mean value theorem for integrals implies that for all i = 1, ..., n, i ∗ ∗ i ∗ i−1 R n r(yi ) there exists yi with n ≤ yi ≤ n and pi = i−1 r(y)dy = n . Therefore n

n X lim log(n) − h(Xn) = lim log(n) + pi log (pi) n→∞ n→∞ i=1 n X = lim log(n) + pi log (pi) n→∞ i=1 n  ∗  X r(yi ) = lim log(n) + pi log n→∞ n i=1 n X ∗ = lim log(n) + pi (log r(y ) − log n) n→∞ i i=1 n ! X = lim log(n) 1 − pi n→∞ i=1 n X ∗ + lim pi log r(y ) n→∞ i i=1

70 n ∗ X r(y ) =0 + lim i log r(y∗) n→∞ n i i=1 Z 1 = r(y) log r(y)dy 0 Z b f(x) = f(x) log dx. a g(x)

Theorem 23. Let f ∈ CRPDF[a,b]. Consider the functional

Z b f(x) F(g) = f(x) log dx a g(x)

f(x) over the set CRPDF[a,b], where f(x) log g(x) is assumed to be 0 whenever f(x) = 0.

There is a unique function g ∈ CRPDF[a,b] that minimizes F(g) and is given for all x ∈ [a, b] by g(x) = f(x).

Proof. Let R = F ◦G−1, r = R0 and Z ∼ r. Since F and G are monotonically

increasing, then so is R, thus r ∈ CRPDF[0,1].

Notice that if we let q(y) := 1 for 0 ≤ y ≤ 1, we have

Z b f(x) f(x) log dx = − h(Z) a g(x) Z 1 r(y) = r(y) log dy 0 q(y)

=DKL(r||q).

Because of the properties of the Kullback-Leibler divergence,

Z b f(x) f(x) log dx = DKL(r||q) ≥ 0 a g(x)

71 for all g ∈ CRPDF[a,b], and

Z b f(x) f(x) log dx = DKL(r||q) = 0 a g(x) if and only if r(y) = q(y) = 1 for all y ∈ [0, 1], that is, whenever y = R(y) = F (G−1(y)), which implies G(x) = F (x) and g(x) = f(x) for all x ∈ [a, b].

Theorem 24. The 2k-partition generated by the recursive thresholding method is not necessarily optimal, that is, its distortion is not minimal.

Proof. A counter example using the exponential distribution is presented.

−x Let X ∼ f ∈ CRPDF[0,M] where M  0 and f(x) = c1e and c1 is such R M that 0 f(x) dx = 1. From Theorem 21, the optimal distortion function is −x/3 fˆ ∈ CRPDF[0,M] given by fˆ(x) := c2e , whose median approaches the median of the exponential distribution, that is log 2, as M → ∞. Therefore, the first optimal cut made by the recursive thresholding method should also approach log 2 as M → ∞.

However, it does not. It approaches w = 2 + W(−2e−2) =6 log 2, where W is the W-Lambert function, because it is the only point in [0, ∞) that under the notation of Definition 14 satisfies

2w = µ(0, w) + µ(w, ∞).

Figure 4.2.4 shows how do the actual partitions differ.

72 5.2 Summary of formulas and quantizers

+ For all X ∼ f ∈ RPDF[a,b] and any discretization function g ∈ RPDF[a,b], if Xn denotes the n’th discretization of X using g, then

Z b f(x) R(f, g) := lim R(Xn) = f(x) log dx n→∞ a g(x) and, if f is continuous,

1 Z b f(x) S(f, g) := lim S(Xn) = 2 dx. n→∞ 12 a g (x)

The scaled distortion is minimized with the quantizer given by

p3 f(x) g(x) = fˆ(x) := . R b p3 a f(t) dt

In that case, the scaled distortion becomes

!3 1 Z b q S(f, fˆ) = 3 f(t) dt . 12 a

The redundancy is minimized to R(f, g) = 0 when g(x) = f(x).

Some interesting corollaries are the following.

The redundancy respect to the minimal scaled distortion discretization function is 2   R(f, fˆ) = h (X) − h(X) , 3 1/3 and the scaled distortion respect to the minimal redundancy function is

1 Z b S(f, f) = f −2(x) dt. 12 a

73 5.3 Some generalizations

The same procedure that was used for reaching the general formulas for S(f, g) and R(f, g) can be used more generally with k-distortion and α- distortion. The following formulas were not derived formally in this Chapter, but they are well known5.

On the one hand, the limit of the scaled k-distortion is given by

1 Z b f Sk(f, g) = k k (x) dx. (k + 1)2 a g

It is minimized when

1 q g(x) = fˆ(x) := k+1 f(x) C 1 k+1 where Z b r Cr := f (x)dx. a

And the minimal possible scale k-distortion for a given function f is

1  k+1 Sk(f, fˆ) = C 1 . (k + 1)2k k+1

On the other hand, the limit of the α-redundancy is given by

 1 R b f α  1−α log a gα−1 (x) dx for α =6 1 Rα(f, g) := R b f(x) a f(x) log g(x) dx for α = 1.

It is minimized when g = f, in which case Rα(f, f) = 0.

5Siegfried Graf and Harald Luschgy. Foundations of quantization for probability distributions. Springer, 2007.

74 Finally, the scaled k-distortion of the minimal α-distortion function and the α-distortion of the minimal scaled k-distortion function are respectively

Z b 1 1−k Sk(f, f) = k f (x) dx (k + 1)2 a and k   Rα(f, fˆ) = h kα+1 (X) − h 1 (X) . k + 1 k+1 k+1

5.4 An open problem

This work led to an open problem related to the maximum entropy problem. The maximum entropy problem is to find a probability distribution that maximizes entropy satisfying some set of constrains.

As expressed by Marsh6, “intuitively, the maximum entropy problem focuses on finding the ‘most random’ distribution under some conditions. ... We further motivate maximum entropy by noting the following: 1. Maximizing entropy will minimize the amount of “prior information” built into the probability distribution. 2. Physical systems tend to move towards maximum entropy as time progresses.”

The minimum redundancy-distortion problem is to find a probability distribution that minimizes R(f, fˆ) satisfying some set of constrains. R(f, fˆ) denotes the redundancy of f respect to its minimal distortion discretization function fˆ. In other words given a set of densities F that serves as the constrains, the problem is about finding

2   arg min h1/3(f) − h(f) . f∈F 3

6Charles Marsh. “Introduction to continuous entropy”. In: Department of Computer Science, Princeton University (2013).

75 When F is the set of all distributions on a given domain, this problem does not seem to be easy to solve. In a recent paper7, several results about minimizing the Rényi entropy hα(f) are presented, however, interestingly, the results are only applicable when α > 1/3.

Formally, the problem is stated as follows.

Problem 25. (minimum redundancy-distortion problem) Let F be a set of densities over [a, b] ⊆ R. Find the function f ∈ F that minimizes

2   h (f) − h(f) . 3 1/3

7Agapitos N Hatzinikitas. “Self-similar solutions of Rényi’s entropy and the concavity of its entropy power”. In: Entropy 17.9 (2015), pp. 6056–6071.

76 Chapter 6

Conclusion and future work

The optimal quantizer of Chapter3 can be used in machine learning for continuous data discretization and as a split criterion. It exhibited better performance on average than classic non-iterative discretization methods (k-means is excluded from the analysis) over the multiple synthetic datasets that were considered.

This quantizer is governed by a simple formula that makes it non-iterative, but it requires the distribution of the input to be known or well estimated. The formula is supported on the original proof1 as well as on that presented using only real analysis in Chapter5 for the case of bounded variables with Riemann-integrable densities. Chapter5 proves also that the equal frequency quantizer achieves minimum redundancy, therefore it serves as an explanation on why it usually outperforms the equal width quantizer.

Future work includes resolving the open problem presented in Section 5.4, testing the performance of the generalizations of the optimal quantizer presented in Section 5.3 and exploring their performance on the rate-distortion plane as done in AppendixA for the fixed case k = 2 and α = 1.

1Herbert Gish and John Pierce. “Asymptotically efficient quantizing”. In: IEEE Transactions on Information Theory 14.5 (1968), pp. 676–683.

77 Bibliography

[1] Tom M Apostol. “Mathematical analysis”. In: (1964), pp. 169–172. [2] Suguru Arimoto. “An algorithm for computing the capacity of arbitrary discrete memoryless channels”. In: IEEE Trans. Information Theory 18 (1972), pp. 14–20. [3] Richard Blahut. “Computation of channel capacity and rate-distortion functions”. In: IEEE transactions on Information Theory 18.4 (1972), pp. 460–473. [4] Jason Catlett. “On changing continuous attributes into ordered discrete attributes”. In: European working session on learning. Springer. 1991, pp. 164–178. [5] Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999. [6] Paul Cuff. Course Notes of Information Signals Course, Kulkarni. url: https://www.princeton.edu/~cuff/ele201/kulkarni_text/ information.pdf. [7] James Dougherty, Ron Kohavi, and Mehran Sahami. “Supervised and unsupervised discretization of continuous features”. In: Machine Learning Proceedings 1995. Elsevier, 1995, pp. 194–202. [8] Nariman Farvardin and James Modestino. “Optimum quantizer performance for a class of non-Gaussian memoryless sources”. In: IEEE Transactions on Information Theory 30.3 (1984), pp. 485–497.

78 [9] M Isabel Garrido and Jesus A Jaramillo. “Homomorphisms on function lattices”. In: Monatshefte für mathematik 141.2 (2004), pp. 127–146. [10] Herbert Gish and John Pierce. “Asymptotically efficient quantizing”. In: IEEE Transactions on Information Theory 14.5 (1968), pp. 676– 683. [11] Siegfried Graf and Harald Luschgy. Foundations of quantization for probability distributions. Springer, 2007. [12] Robert Gray. “Vector quantization”. In: IEEE Assp Magazine 1.2 (1984), pp. 4–29. [13] Agapitos N Hatzinikitas. “Self-similar solutions of Rényi’s entropy and the concavity of its entropy power”. In: Entropy 17.9 (2015), pp. 6056– 6071. [14] Jean Jacod and Philip Protter. Probability essentials. Springer Science & Business Media, 2012. [15] Edwin Thompson Jaynes. “Information Theory and Statistical Mechanics”. In: Brandeis University Summer Institute Lectures in Theoretical Physics (1963), pp. 182–218. [16] Daniela Joiţa. “Unsupervised static discretization methods in data mining”. In: Titu Maiorescu University, Bucharest, Romania (2010). [17] Randy Kerber. “Chimerge: Discretization of numeric attributes”. In: Proceedings of the tenth national conference on Artificial intelligence. 1992, pp. 123–128. [18] Yoseph Linde, Andres Buzo, and Robert Gray. “An algorithm for vector quantizer design”. In: IEEE Transactions on communications 28.1 (1980), pp. 84–95. [19] Stuart Lloyd. “Least squares quantization in PCM”. In: IEEE transactions on information theory 28.2 (1982), pp. 129–137. [20] Charles Marsh. “Introduction to continuous entropy”. In: Department of Computer Science, Princeton University (2013).

79 [21] Nobuyuki Otsu. “A threshold selection method from gray-level histograms”. In: IEEE transactions on systems, man, and cybernetics 9.1 (1979), pp. 62–66. [22] Catuscia Palamidessi and Marco Romanelli. “Feature selection with Rényi min-entropy”. In: IAPR Workshop on Artificial Neural Networks in Pattern Recognition. Springer. 2018, pp. 226–239. [23] Mehmet Sezgin and Bülent Sankur. “Survey over image thresholding techniques and quantitative performance evaluation”. In: Journal of Electronic imaging 13.1 (2004), pp. 146–166. [24] Claude Elwood Shannon. “A mathematical theory of communication”. In: Bell system technical journal 27.3 (1948), pp. 379–423. [25] Gary J Sullivan. “Efficient scalar quantization of exponential and Laplacian random variables”. In: IEEE Transactions on information theory 42.5 (1996), pp. 1365–1374. [26] VV Tkachuk. “Properties of function spaces reflected by uniformly dense subspaces”. In: Topology and its Applications 132.2 (2003), pp. 183–193.

80 Appendix A

Optimal quantizer in information theory

This section depicts a relationship between the optimal quantizers of Chapter 3 and the rate-distortion function by plotting several examples of optimal quantizers in the rate-distortion plane.

Quantization has been already used in information theoretical literature to introduce the rate-distortion function1. In this appendix, the goal is to provide the intuition and some visual examples of how the optimal quantizer approaches the optimal pareto boundary as n increases.

A.1 Rate-distortion function

There is a whole field in information theory for studying the quality of discretizations, not necessarily deterministic ones. It is called rate-distortion theory and it deals with the trade-off that exists between achieving high precision and using few bits.

1Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.

81 One of the main results of rate-distortion theory is the Blahut-Arimoto algorithm, proposed independently by Blahut2 and Arimoto3. For any given distribution function, the Blahut-Arimoto algorithm produces a codification scheme that is optimal in the sense that no other scheme achieves more precision using less bits. In other words, when looking at the rate-distortion plane, the algorithm produces a codification scheme whose distortion and bit-rate lies on the so called rate-distortion function. The rate-distortion function is the pareto frontier of low distortion and low bit-rate and depends only on the input distribution.

Four important properties of the Blahut-Arimoto algorithm are the following.

1. It has a parameter β that can be tuned from 0 to ∞ to move along the rate-distortion function. 2. It is iterative, and it must be stopped after a certain number of iterations or a low error threshold is met. Therefore, it is a practical solution to the problem that does not produce a symbolic optimum, but an approximation as close as desired. 3. Its output is a probabilistic encoding scheme, that is, a non- deterministic transformation of inputs to outputs (codes) in which some fixed probabilities determine which outputs are more likely to occur for some inputs. 4. The sets of input and output values are given to the algorithm and it computes the probability matrix that relates them.

The Blahut-Arimoto algorithm is elegant for its simplicity and generality, yet some authors have attempted to produce faster algorithms or formulas for specific inputs taking advantage of their properties. Farvardin4 proposed an

2Richard Blahut. “Computation of channel capacity and rate-distortion functions”. In: IEEE transactions on Information Theory 18.4 (1972), pp. 460–473. 3Suguru Arimoto. “An algorithm for computing the capacity of arbitrary discrete memoryless channels”. In: IEEE Trans. Information Theory 18 (1972), pp. 14–20. 4Nariman Farvardin and James Modestino. “Optimum quantizer performance for a class of non-Gaussian memoryless sources”. In: IEEE Transactions on Information Theory 30.3 (1984), pp. 485–497.

82 Figure A.2.1: Several deterministic discretizations of the normal distribution in the rate-distortion plane. Each dot is a discretization, and the minimum distortion quantizers are represented with an ‘x’. The dashed curve is the Shannon lower bound and corresponds also to the rate-distortion function in this particular case. algorithm for normally distributed inputs and Sullivan5 for the exponential distribution.

A.2 Optimal quantizers in the rate-distortion plane

It is tempting to think that optimal quantizers lie in the rate distortion function because they are optimal, however, this is not necessarily true, because these quantizers are restricted to have a fixed number of output values, to being deterministic and to minimize distortion regardless of the rate, i.e. the entropy. Indeed, as shown in Figure A.2.1, the optimal quantizers for the standard normal distribution do not lie on the rate-distortion function, although as n increases, the distance between the minimum distortion quantizers and the distortion function decreases.

For the uniform distribution, a similar observation can be made. Figure A.2.2 shows several discretizations for the uniform distribution in the rate- 5Gary J Sullivan. “Efficient scalar quantization of exponential and Laplacian random variables”. In: IEEE Transactions on information theory 42.5 (1996), pp. 1365–1374.

83 Figure A.2.2: Several deterministic discretizations of the uniform distribution in the rate-distortion plane. Each dot is a discretization, and the minimum distortion quantizers are represented with an ‘x’. distortion plane. Unlike the Gaussian case, for each n, the maximal rate discretizations are precisely the minimal distortion discretizations. These correspond to discretizations with uniform width among the bins.

84