An Approach to Optimal Discretization of Continuous Real Random Variables with Application to Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
An Approach to Optimal Discretization of Continuous Real Random Variables with Application to Machine Learning Carlos Antonio Pinzón Henao Student identification: 8945263 Director: Camilo Rocha, PhD Co-director: Jorge Finke, PhD Engineering and sciences Master in Engineering - Emphasis on Computer Science Pontificia Universidad Javeriana Cali, Cali, Colombia, October 2019. 1 Contents 1 Introduction6 2 Preliminaries 10 2.1 Random variables of interest..... 11 2.2 Entropy and redundancy....... 12 2.2.1 Shannon entropy........ 12 2.2.2 Differential entropy and LDDP 13 2.2.3 Rényi entropy......... 15 2.2.4 Redundancy.......... 16 2.3 Discretization.............. 17 2.3.1 Digitalization.......... 17 2.3.2 n-partitions........... 18 2 2.3.3 Distortion and entropy.... 21 2.3.4 Discretization functions... 21 2.3.5 Redundancy and scaled distortion 23 3 Optimal quantizers 25 3.1 Context................. 25 3.2 Formal definition............ 26 3.3 Application in sensorics........ 27 4 Uses in machine learning 30 4.1 Preprocessing.............. 31 4.2 Intra-class variance split criterion.. 33 4.2.1 Construction of decision trees 34 4.2.2 Min-variance split criterion. 37 4.2.3 Non-uniqueness........ 43 4.2.4 Non-optimality......... 45 4.3 Discussion................ 47 3 5 A real analysis approach 48 5.1 Formal derivation........... 48 5.1.1 Deriving Theorem 10..... 49 5.1.2 Deriving Theorem 17..... 57 5.1.3 Deriving Theorem 19..... 63 5.1.4 Optimality theorems..... 67 5.2 Summary of formulas and quantizers 73 5.3 Some generalizations.......... 74 5.4 An open problem............ 75 6 Conclusion and future work 77 Bibliography 78 A Optimal quantizer in information theory 81 A.1 Rate-distortion function........ 81 A.2 Optimal quantizers in the rate- distortion plane............. 83 4 Acknowledgment I would like to thank all the people that made this thesis possible, including Dr. Camilo Rocha and Dr. Jorge Finke for their trust and guidance, all the supervisors that agreed to review my work, the professors at the University whose teachings and advises helped me directly or indirectly throughout this period, including Isabel García, Carlos Ramírez and Frank Valencia, my friend Juan Pablo Berón for providing insight into some proofs, and my friends and family, specialy my future wife Adriana, for their constant support. Without their help, it would not have been possible to conclude this thesis successfully. In addition, I thank the growing Mathematics Stack Exchange community for having provided the sketch for two of the proofs presented in this document: question 3339209 for Theorem9 and question 154721 for Theorem 12. 5 Chapter 1 Introduction This document explores some discretization methods for continuous real random variables, their performance and some of their applications. Particularly, this work borrows a well-studied discretization method from the signal processing community, called the minimal distortion high resolution quantizer1, and makes use of it in machine learning with the purpose of illustrating how and when it can improve typically used techniques and methodologies. Discretization is the process of converting a continuous variable into a discrete one. It is useful because unlike the continuous input, the discrete output can be encoded and treated with discrete methods and tools, which includes computers and microcontrollers. Therefore, it occurrs in all modern sensor systems. It is used also in data analysis, either for a final categorization of the data or internally as a bridge between the continuous nature of the data and the discrete nature of some algorithms. Despite its advantages, discretization introduces forcibly an error because the discrete output is unable to mimic the continuous input with entire precision. This error can 1Herbert Gish and John Pierce. “Asymptotically efficient quantizing”. In: IEEE Transactions on Information Theory 14.5 (1968), pp. 676–683. 6 be forced to be very small by using properly designed discretization methods or at the expense of using more bits in the codification. Discretization has been deeply studied under the term quantization, especially in the signal processing community, which not only proposed the minimal distortion high resolution quantizer2, but also an iterative method, called Lloyd method, for computing optimal quantizers for any resolution level3 and several generalized versions of this algorithm that support several dimensions and are therefore called vector quantizers4. These algorithms, have a tremendous advantage over the high resolution quantizer, namely that they operate directly from samples and do not require the distribution to be known or estimated previously. Furthermore, the book Foundations of Quantization for Probability Distributions5 due to Graf and Lushgy makes an extense review of quantization methods and provides common notation and a rigorous theory, using advanced topics in mathematics, for supporting them. The book embraces several types of quantizers, including one dimensional optimal quantizers, one dimensional high resolution optimal quantizers and vector quantizers. This work explores just one of these types of quantizers using undergraduate mathematics so as to be readable and useful for a broader less experienced audience. It focuses exclusively on the discretization of one-dimensional continous real random variables X for which a Riemann-integrable probability density function with domain [a, b] exists, for reals a < b. These constraints, which might seem very strong from a mathematical point of view, are 2Herbert Gish and John Pierce. “Asymptotically efficient quantizing”. In: IEEE Transactions on Information Theory 14.5 (1968), pp. 676–683. 3Stuart Lloyd. “Least squares quantization in PCM”. in: IEEE transactions on information theory 28.2 (1982), pp. 129–137. 4Yoseph Linde, Andres Buzo, and Robert Gray. “An algorithm for vector quantizer design”. In: IEEE Transactions on communications 28.1 (1980), pp. 84–95, Robert Gray. “Vector quantization”. In: IEEE Assp Magazine 1.2 (1984), pp. 4–29. 5Siegfried Graf and Harald Luschgy. Foundations of quantization for probability distributions. Springer, 2007. 7 common in engineering applications and allow the study to be reduced to undergraduate mathematics. The analysis focuses on the following quantizers, especially on the optimal quantizer. 1. Equal width quantizer, or uniform quantizer: it is the most simple quantizer and it is used in most sensors. It consists of dividing the input interval into n bins of equal width. 2. Equal frequency quantizer: it consists of dividing the input interval into n bins of equal area under the curve f. As a consequence, it maximizes the entropy of the random output, which can be understood less precisely as its diversity. 3. Minimal distortion high resolution quantizer, called optimal quantizer throughout the document: it minimizes the error between the input and the output when n is sufficiently large. It is explained in detail in Chapter3. 4. Recursive binary quantizer: used in binary decision trees in machine learning. As explained in Chapter4, it is constructed with a recursive and greedy procedure. Namely, it divides the interval into two by means of a minimal distortion quantizer with n = 2, and repeats the procedure on each subinterval yielding 4, 8, and in general 2k parts. The main contributions of this dissertation can be summarized as follows: 1. It brings the optimal quantizer to the attention of the machine learning community as an alternative for the typical equal width and equal frequency quantizers when the input distribution is known or well estimated, especially targeting members with few knowledge in advanced mathematics. 2. It illustrates with some examples how and when can the optimal quantizer replace typical non-iterative techniques in machine learning for discretizing a dataset, and how and when can it be used as a split criterion for decision trees. 3. It presents real-analysis proofs of the optimality and uniqueness of the optimal quantizer under the constraints of being bounded and having 8 a Riemann-integrable density. In contrast to existing proofs6, these are comprehensible for the majority of undergraduate engineers because it does not use measure theory. 4. It summarizes the generalizations of these results to other distance metrics and provides insight on how to generalize as well for different entropy functions. This document is organized as follows. Chapter1 is the introduction. Chapter2 provides definitions that are used along the document and introduces the reader into discretizations with some plots and examples. Chapter3 reviews the optimal quantizer and presents its application in sensorics. Chapter4 compares the optimal quantizer against the others in machine learning. More precisely, the chapter compares the performance of some predictors that are fed with the output of the discretizators, and provides a split criterion for decision trees based on the optimal quantizer. Chapter5 uses real analysis to prove that the optimal quantizer is indeed optimal and unique. It also provides a generalization of this quantizer and an open problem that emerged naturally from the generalizations. Chapter6 concludes the document with a short summary. AppendixA explores the optimal quantizer from an information theoretical perspective. In particular, some plots of optimal quantizers on the rate-distortion plane are provided. 6Siegfried Graf and Harald Luschgy. Foundations of quantization for probability distributions. Springer, 2007, Herbert Gish