True Random Number Generation using Genetic Algorithms on High Performance Architectures

by

Jose Juan Mijares Chan

A thesis submitted to The Faculty of Graduate Studies of The University of Manitoba in partial fulfillment of the requirements of the degree of

Doctor of Philosophy

Department of Electrical and Computer Engineering The University of Manitoba Winnipeg, Manitoba, Canada October 2016

© Copyright 2016 by Jose Juan Mijares Chan Thesis advisor Author

Gabriel Thomas,

Parimala Thulasiraman Jose Juan Mijares Chan

True Random Number Generation using Genetic Algorithms on High Performance Architectures

Abstract

Many real-world applications use random numbers generated by pseudo-random number and true random number generators (TRNG). Unlike pseudo-random number generators which rely on an input seed to generate random numbers, a TRNG relies on a non-deterministic source to generate aperiodic random numbers. In this research, we develop a novel and generic software-based TRNG using a random source extracted from compute architectures of today. We show that the non-deterministic events such as race conditions between compute threads follow a near Gamma distribution, independent of the architecture, multi-cores or co-processors. Our design improves the distribution towards a uniform distribution ensuring the stationarity of the sequence of random variables.

We improve the random numbers statistical deficiencies by using a post-processing stage based on a heuristic evolutionary algorithm. Our post-processing algorithm is composed of two phases: (i) Histogram Specification and (ii) Stationarity Enforcement.

We propose two techniques for histogram equalization, Exact Histogram Equalization

(EHE) and Adaptive EHE (AEHE) that maps the random numbers distribution to

ii Abstract iii a user-specified distribution. EHE is an offline algorithm with O(NlogN). AEHE is an online algorithm that improves performance using a sliding window and achieves

O(N). Both algorithms ensure a normalized entropy of (0.95, 1.0].

The stationarity enforcement phase uses genetic algorithms to mitigate the statis- tical deficiencies from the output of histogram equalization by permuting therandom numbers until wide-sense stationarity is achieved. By measuring the power spectral density standard deviation, we ensure that the quality of the numbers generated from the genetic algorithms are within the specified level of error defined by the user. We develop two algorithms, a naive algorithm with an expected exponential complexity of E[O(eN )], and an accelerated FFT-based algorithm with an expected quadratic complexity of E[O(N 2)]. The accelerated FFT-based algorithm exploits the paral- lelism found in genetic algorithms on a homogeneous multi-core cluster. We evaluate the effects of its scalability and data size on a standardized battery of tests, TestU01, finding the tuning parameters to ensure wide-sense stationarity on long runs. Contributions

The goal of this thesis is to develop a novel TRNG using a random source extracted from random numbers permuted using evolutionary algorithms for improved solution quality. The following points summarize my contributions.

• Extraction of generic random source: I observed that events involving

race conditions between computer threads follow a near Gamma distribution,

independently of the modern computer architecture (CPU & GPU) and the

underlying process scheduler. This observation led me to propose the following

below.

• TRNG construction: Given that the Gamma distribution found above is

of poor quality for an ideal random number generator, I designed a software-

based TRNG that improves the distribution towards a uniform distribution and

ensures the stationarity of the sequence of random variables. To the best of my

knowledge no one has attempted this before.

• The proposed TRNG is composed of two post-processing stages: Histogram

Specification and Stationarity Enforcement.

– Histogram Specification: I developed two techniques, Exact Histogram

Equalization (offline algorithm) and Adaptive EHE (online algorithm) that

maps the random numbers distribution to a uniform distribution.

– Stationarity Enforcement: I used genetic algorithms to mitigate the

statistical deficiencies from the histogram specification stage by permuting

iv Contributions v

the random numbers until wide-sense stationarity is achieved. I developed

two algorithms to achieve this: a naive algorithm and an accelerated FFT-

based algorithm.

– Parallelization: The accelerated FFT-based algorithm using genetic al-

gorithms is parallelized on an homogeneous multi-core cluster using Intel

Ivy Bridge processors, using the MapReduce programming model to in-

crease the performance.

• Evaluation: I presented a group of evaluations that highlight the performance

and the quality conditions of my algorithm.

– Performance analysis: The results of the algorithm scalability and

data size on the performance suggested that an expected quadratic com-

plexity on the computation time can be achieved.

– Quality analysis: Using the parallel version of the accelerated FFT-

based algorithm with the genetic algorithms, I observed a relation between

the window size and the quality of the results from a standardized battery

of tests, TestU01. As well, I presented sub-optimal solutions with tuning

parameters that ensure wide-sense stationarity on long runs. Contents

Abstract ...... ii Contributions ...... iv Table of Contents ...... viii List of Figures ...... ix List of Tables ...... xv Acknowledgments ...... xvi Dedication ...... xvii Acronyms ...... 1 List of symbols ...... 4

1 Introduction 8 1.1 The history of random numbers ...... 9 1.2 Random Numbers ...... 16 1.3 Characteristics of random numbers ...... 19 1.3.1 The uniform distribution of random numbers ...... 19 1.3.2 The correlation of random numbers ...... 23 1.3.3 The spectral density of random numbers ...... 25 1.3.4 The stationarity of random numbers ...... 25 1.4 Types of random number generators ...... 29

2 The Extraction layer on True Random Number Generators 34 2.1 The computer architectures of today ...... 34 2.2 The idea behind a TRNG on today’s computer architectures . . . . . 39 2.2.1 Data hazards ...... 39 2.2.2 Schedulers ...... 40 2.2.3 Strategy for designing a TRNG using a compute architecture 43 2.3 Case study: TNRG design on GPUs ...... 46 2.3.1 Source of randomness found in GPUs ...... 50 The measurement of race conditions times ...... 52 The measurement of in-chip temperature calculation times . . 52 2.3.2 Evaluation of TRNGs sources in GPUs ...... 53

vi Contents vii

Phenomena observation...... 53 Effects due to architectural changes...... 53 Behaviour consistency over long runs...... 60 2.4 Case study: TRNGs in Modern microprocessors ...... 64 Effects due to architectural changes...... 65 Behaviour consistency over long runs...... 68 Other TRNGs developed in hardware for CPUs...... 70 2.5 Summary ...... 71

3 Distribution shape enhancement on TRNGs 73 3.1 Modulus-based algorithms ...... 74 3.1.1 Optimizing the random numbers representation format . . . . 75 3.1.2 Modulus ...... 76 3.2 Histogram-based algorithms ...... 79 3.2.1 Histogram equalization ...... 79 3.2.2 Exact Histogram Equalization ...... 83 3.2.3 Adaptive Exact Histogram Equalization ...... 83 3.3 Asymptotic analysis of algorithms ...... 88 3.4 Summary ...... 91

4 Stationarity enforcement on True Random Number Generators 93 4.1 Naive approach of stationarity enforcement ...... 95 4.1.1 Evaluation methodology ...... 104 4.1.2 Results ...... 105 4.2 Stationarity enforcement accelerated by a FFT-based algorithm . . . 109 4.2.1 Correlation metrics ...... 110 Autocorrelation ...... 110 Power spectral density ...... 112 4.2.2 Evaluation criteria ...... 115 4.2.3 Enhancements on the stationary enforcement block ...... 116 4.2.4 Evaluation methodology ...... 122 Characteristics of a GA post-processing stage ...... 122 4.2.5 Results ...... 123 Results from the Characteristics of a GA post-processing stage 123 4.3 Asymptotic analysis of algorithms ...... 128 4.4 Summary ...... 131

5 Quality Evaluation in True Random Number Generators 132 5.1 Parallel scheme ...... 133 5.2 SmallCrush evaluation ...... 134 5.3 Results ...... 137 5.4 Discussion ...... 140 viii Contents

5.5 Summary ...... 142

6 Summary of work 144

7 Conclusions and Future work 146 7.1 Conclusions ...... 146 7.2 Future Work ...... 149

Bibliography 164 List of Figures

1.1 On the left, Pakua or Bāguà (八卦) representation, literally ”eight trigrams” by BenduKiwi, licensed under CC-AS 3.0, the 8 basic tria- grams are presented surrounding the Yin and Yang, the duality of all things in nature. On the right, the I Ching hexagrams are presented in ascending order, starting at 0 in the upper left corner, going left to right and up to down, to 63 in the lower right corner...... 10 1.2 On the left, The Royal game of Ur from Mesopotamia by British Mu- seum licensed under CC-BY-3.0. On the right, an Egyptian dice (600- 800 Before Common Era (BCE)) by Swiss Museum of Games licensed under CC-SA 4.0...... 11 1.3 An Illustration from the Mahabharata: The Humiliation of Draupadi by an unknown author, India, 1830-40. This moment depicts when the Pandavas are tricked by the Kauravas, into a fixed game of dice in which they lose their all their wealth and kingdom...... 13 1.4 19th century Tsimshian Bag with 65 Inlaid Gambling Sticks, Maple wood, abalone, pigment, hide and tooth presented at Brooklyn Mu- seum in the Museum Expedition 1905...... 14 1.5 Mapping Diagram of a Random variable X and its corresponding sam- pling space S...... 17 1.6 (a) Example of random values x from a computer generated algorithm. (b) Correlation plot between random values...... 18 1.7 Probability mass function (a), autocorrelation (b) and power spectral density (c) of x random numbers in an ideal case scenario. Probability mass function (d), autocorrelation (e) and power spectral density (f) of x random numbers from a computer generated algorithm...... 20 1.8 Miminum block design of a True random number generator ...... 30

2.1 Simple scoreboard implementation ...... 41 2.2 Simple Tomasulo’s algorithm implementation ...... 42 2.3 G80-ENG-A1 graphic processing unit (GPU) ...... 47

ix x List of Figures

2.4 Left: compute unified device architecture (CUDA) Memory Hierarchy and execution model. Right: Thread organization diagram. Both from NVIDIA CUDA Programming Guide version 3.0, licensed by CC-A3.0. 48 2.5 Warp Scheduling in single-instruction multiple-thread (SIMT) pipelines. 50 2.6 Probability mass function (pmf ) of random sources while being sam- pled for 106 samples with a bin size of 0.5µ sec. The measurement of race conditions times following Listing 2.4, tRC , present a sharper distribution; while the measurement of times to calculate the in-chip temperature, tTC , is wider. ©2011 IEEE...... 54 2.7 Pmf of random sources using the three suggested kernels on a Tesla architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 55 2.8 Pmf of random sources using the three suggested kernels on a Kepler architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 55 2.9 Pmf of random sources using kernel GMSHM and changing the num- ber of repetitions, M, and the number of threads from 1 to 512. The kernel is run over a Tesla architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 57 2.10 Pmf of random sources using kernel GM and changing the number of repetitions, M, and the number of threads from 1 to 512. The kernel is run over a Tesla architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 57 2.11 Pmf of random sources using kernel SHM and changing the number of repetitions, M, and the number of threads from 1 to 512. The kernel is run over a Tesla architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 58 2.12 Pmf of random sources using kernel GMSHM and changing the num- ber of repetitions, M, and the number of threads from 1 to 1024. The kernel is run over a Kepler architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 59 2.13 Pmf of random sources using kernel GM and changing the number of repetitions, M, and the number of threads from 1 to 1024. The kernel is run over a Kepler architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 59 2.14 Pmf of random sources using kernel SHM and changing the number of repetitions, M, and the number of threads from 1 to 1024. The kernel is run over a Kepler architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 60 List of Figures xi

2.15 Sliding window of size 1024 samples, creating a pmf of a random source using kernel GMSHM where M=100 repetitions, N=33 threads. The kernel is run over a Tesla architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 61 2.16 Sliding window of size 1024 samples, creating a pmf of random sources using kernel GM where M=100 repetitions, N=33 threads. The kernel is run over a Tesla architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 61 2.17 Sliding window of size 1024 samples, creating a pmf of random sources using kernel SHM where M=100 repetitions, N=33 threads. The ker- nel is run over a Tesla architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 62 2.18 Sliding window of size 1024 samples, creating a pmf of a random source using kernel GMSHM where M=100 repetitions, N=33 threads. The kernel is run over a Kepler architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 62 2.19 Sliding window of size 1024 samples, creating a pmf of random sources using kernel GM where M=100 repetitions, N=33 threads. The kernel is run over a Kepler architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 63 2.20 Sliding window of size 1024 samples, creating a pmf of random sources using kernel SHM where M=100 repetitions, N=33 threads. The ker- nel is run over a Kepler architecture, while being sampled for 106 sam- ples with a bin size of 0.5µ sec...... 63 2.21 Scheduling in pipelines using Skylake product line...... 65 2.22 Pmf of random sources using the three suggested kernels on a Skylake architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 66 2.23 Pmf of random sources using kernel GMSHM and changing the num- ber of repetitions, M, and the number of threads from 1 to 512. The kernel is run over a Skylake architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 66 2.24 Pmf of random sources using kernel GM and changing the number of repetitions, M, and the number of threads from 1 to 512. The kernel is run over a Skylake architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 67 2.25 Pmf of random sources using kernel SHM and changing the number of repetitions, M, and the number of threads from 1 to 512. The kernel is run over a Skylake architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 67 xii List of Figures

2.26 Sliding window of size 1024 samples, creating a pmf of a random source using kernel GMSHM where M=100 repetitions, N=33 threads. The kernel is run over a Skylake architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 68 2.27 Sliding window of size 1024 samples, creating a pmf of random sources using kernel GM where M=100 repetitions, N=33 threads. The kernel is run over a Skylake architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 69 2.28 Sliding window of size 1024 samples, creating a pmf of random sources using kernel SHM where M=100 repetitions, N=33 threads. The ker- nel is run over a Skylake architecture, while being sampled for 106 samples with a bin size of 0.5µ sec...... 69 2.29 Pmf of random sources using Skylake RDSEED, while being sampled for 106 samples with a bin size of 0.5µ from the Normalized numbers. 70 2.30 Sliding window of size 1024 samples, creating a pmf of random sources using Skylake RDSEED, while being sampled for 106 samples with a bin size of 0.5µ from the Normalized numbers...... 71

3.1 Extracted random numbers while being sampled for 20000 samples with a bin size of 0.5µ from the Normalized numbers...... 78 3.2 Extracted random numbers after applying Mright (·) from Eq. 3.2 with k = 1 sampled for 106 samples with a bin size of 0.5µ from the Nor- malized numbers...... 79 3.3 Transformation function using Histogram Equalization...... 82 3.4 Transformed random numbers after applying Mright (·) with k = 1 6 and MHE (·) sampled for 10 samples with a bin size of 0.5µ from the Normalized numbers...... 82 3.5 Transformation function using Exact Histogram Equalization. . . . . 84 3.6 Transformed random numbers after applying Mright (·) with k = 1 6 and MEHE sampled for 10 samples with a bin size of 0.5µ from the Normalized numbers...... 84 3.7 Change on the pmf of MAEHE output by using an sliding window of 64 numbers and 64 bins. For MAEHE, the window size is 128 random numbers...... 87 3.8 Normalized entropy of the Pmf of MAEHE output from Fig 3.7. . . . 89 3.9 Run-time complexity for Modulus-based and Histogram-based algo- rithms...... 90

4.1 Generic architecture of a TRNG...... 94 4.2 The stationarity enforcement block ...... 96 4.3 The semi-internal random numbers, x ...... 96 4.4 The pmf of the semi-internal random numbers ...... 97 List of Figures xiii

4.5 The autocorrelation of the semi-internal random numbers ...... 97 4.6 The power spectral density of the semi-internal random numbers . . . 98 4.7 Initial case: The semi-internal random numbers (si random numbers) (left) with periodic traces shows strong side lobes in its autocorrela- tion (center), while the power spectral density present a exponential distribution (right) ...... 106 4.8 Middle case: At iteration 23, the si random numbers (left) with small periodic traces shows small side lobes in its autocorrelation (center), while the power spectral density starts to show a Gaussian distribution (right) ...... 106 4.9 End case: At iteration 2403, the si random numbers (left) without periodic traces shows nearly no side lobes in its autocorrelation (cen- ter), while the power spectral density presents a Gaussian distribution around 1, with the target standard deviation(right) ...... 106 4.10 Proof of concept: The standard deviation shows to smoothly converge to 0.25 ...... 107 4.11 Probability mass function (a), autocorrelation (b) and power spectral density (c) of x random numbers in an ideal case scenario. Probability mass function (d), autocorrelation (e) and power spectral density (f) of x random numbers from a computer generated algorithm...... 112 4.12 Comparison of the autocorrelation, FFT-based autocorrelation and power spectral density algorithms analyzing (a) the computational in- tensity, (b) floating point operations and (c) memory access operation in terms of the data size N...... 114 4.13 The Architecture of the Stationarity Enforcement Block ...... 115 4.14 DAS Random numbers (a), its probabilistic mass function (b) and au- tocorrelation (c). Semi-internal random numbers (d), its probabilistic mass function (e) and autocorrelation (f)...... 116 4.15 (a) Random numbers at generation 0, (b) its probability mass function, (c) the autocorrelation, (d) the power spectral density ...... 124 4.16 (a) Random numbers at generation 300, (b) its probability mass func- tion, (c) the autocorrelation, (d) the power spectral density ...... 124 4.17 (a) Random numbers at generation 4779992, (b) its probability mass function, (c) the autocorrelation, (d) the power spectral density . . . 124 4.18 Standard deviation of the Power spectral density plotted against the change on generations ...... 125 4.19 Standard deviation of the Power spectral density from the parents, children and external pool plotted against the change on generations 126 4.20 Standard deviation of the Power spectral density from the parents and children pool without external pool plotted against the change on generations ...... 127 4.21 Time complexity for naive and accelerated approach ...... 129 xiv List of Figures

4.22 Space complexity for naive and accelerated approach ...... 129

5.1 SmallCrush results from TestU01 battery test using 2 processors. . . 138 5.2 SmallCrush results from TestU01 battery test using 4 processors. . . 138 5.3 SmallCrush results from TestU01 battery test using 8 processors. . . 139 List of Tables

2.1 Evolution of Intel’s pipeline and µ-decoder cache ...... 37 2.2 CUDA memory relative speed and scope ...... 49 2.3 Gamma Distribution approximations ...... 56

3.1 Normalized Shannon Entropy of noise sources ...... 74 3.2 Entropy analysis per decimal digit ...... 77 3.3 Asymptotic analysis of the Modulus-based and Histogram-based algo- rithms ...... 90

4.1 Results of using one-point cross over ...... 108 4.2 Results of using modular cross over ...... 109 4.3 Approximation of complexity model for Naive approach ...... 131 4.4 Approximation of complexity model for Accelerated approach . . . . 131

5.1 Small Crush test configurations ...... 135

xv Acknowledgments

Firstly, I would like to express my most sincere gratitude to my advisor and co- advisor, Dr. Thomas and Dr. Thulasiraman, for their continuous support during my

Ph.D study and related research. Thanks for all the patience, motivation, support and the opportunity to join your labs. But most important of all, thanks for your sharp guidance and sharing your immense knowledge during the time of this research.

Besides my advisors, I would like to thank the rest of my thesis committee: Dr.

Yahampath and Dr. Pistorius for their comments that always were insightful and motivated me to expand my research, by not being afraid of answering the hard questions.

I extend my gratitude to my fellow labmates in IDEAS/CFD lab and Dr. Thomas’s lab for the stimulating discussions and valuable feedback, all the hard work we shared, and the fun we had.

Last but not the least, I would like to thank my family: my parents, my sister, my girlfriend and all my friends, for supporting me throughout writing this thesis and all the hard times.

xvi This thesis is dedicated to my family.

xvii

Acronyms cdf cumulative distribution function fpo floating point operations iid independent and identically distributed mao memory access operations pmf probability mass function si random numbers semi-internal random numbers sss strict or strong-sense stationarity wss wide-sense stationarity

AEHE adaptive exact histogram equalization

AES advanced encryption standard

AHE adaptive histogram equalization

ALU arithmetic logic unit

ASIC application-specific integrated circuit

BCE Before Common Era

BOHE block overlapped histogram equalization

CBC-MAC cipher block chaining message authentication code

CISC complex instruction set computers

CPU

CUDA compute unified device architecture 2 Acronyms

DAS random numbers digitized analog signal random numbers

DRNG digital random number generator

EHE exact histogram equalization

FFT fast Fourier transform

FU functional unit

GA genetic algorithm

GPU graphic processing unit

H normalized Shannon entropy

HE histogram equalization

Kb kilo-

MAD multiply-add blocks

MIPS microprocessor without interlocked pipeline stages

NIST National Institute of Standards and Technology

NV-API NVIDIA application programming interface Acronyms 3

OoO out-of-order execution

OSTE on-line self test entropy

PRNG pseudo random number generator

RAW read-after-write

RISC reduced instruction set computers

SFU special functions units

SIMD single-instruction multiple-data

SIMT single-instruction multiple-thread

SR Latch set-reset latch

TRNG true random number generators

VLIW very-long-instruction-word

VP vector processors

WAR write-after-read

WAW write-after-write

x86 Intel’s processor ended in ”86”, including 8086,

80186, 80286, 80386 and 80486 processors 4 List of symbols

List of symbols

COmod Modular crossover selector for GA crossover on

Sec. 4.1

C0,C1 Children from Group of parents, P0, andP1, for GA

crossover on Sec. 4.1

E Observed histogram for Chi-squared goodness of fit

test

LP Group of parents cardinality for GA crossover on

Sec. 4.1

M Number of segments in x

N Index of last element of finite set or subset with

cardinality N + 1

O Observed histogram for Chi-squared goodness of fit

test

PX Probablity distribution of X, which is assumed to

be an uniform distribution U [0, 1].

P0,P1 Group of parents for GA crossover on Sec. 4.1

(g) (g) Rm Autocorrelation function of xm

Rxx Autocorrelation function of x

RtElit Mutation ratio for GA crossover on Sec. 4.1

RtMut Mutation ratio for GA crossover on Sec. 4.1

RtSel Selection ratio for GA Selection on Sec. 4.1 List of symbols 5

SD {σS (·)} σS Standard deviation error from user specified

standard deviation level, σ0 in Sec. 4.1

(g) (g) Sm Power spectral density function of xm

Sxx Power spectral density

X Random variable.

Γ {·} Gamma function

α Gamma distribution 1st parameter or shape

2 α0 the predetermined level of significance for X good-

ness of fit test

β Gamma distribution 2nd parameter or scale

αˆ Gamma distribution 1st parameter estimator

βˆ Gamma distribution 2nd parameter estimator

xˆi Subset of x, such that xˆi = {xn ∈ x| i ≤ n < i + k}

OQ(·) σS Error from user specified standard deviation

level, σ0. This is expressed in terms of the euclidean

distance form between the mean and standard de-

viation error on Sec. 4.1

Z The set of Integers, such that Z =

{..., −2, −1, 0, 1, 2,...}

Z+ The set of Positive Integers, such that Z+ =

{1, 2, 3,...}

F {·} Fourier transform

H0 Null hypothesis 6 List of symbols

H1 Not null hypothesis

N {·,·} Normal distribution

X 2 {·,·} Chi-squared goodness of fit test µ Mean

(g) (g) Rm Mean of Rm

σS (·) σS Mean error from user specified standard devia-

tion level, σ0 in Sec. 4.1

x Sample Mean

ϕ {·} Characteristic function of a probabilistic distribu-

tion

σ Standard deviation

(g) (g) σm Standard Deviation of Rm

σ0 User specified standard deviation level for Sm

(g) σ (g) Standard deviation of Sm Sm g {·,·} Gamma distribution

ς Sample Standard Deviation

◦C Degrees Celsius

dini,dend Modular crossover origin and destination positions

for GA crossover on Sec. 4.1

df Degrees of freedom in X 2 goodness of fit test

e Euler’s number

f frequency

fx Probability mass function of x List of symbols 7

√ j Imaginary unit, or −1

k Interval or lag interval

p0,p1 elements from Group of parents, P0, andP1, for GA

crossover on Sec. 4.1

x Subset of random values.

(g) xm m-th segment in x at g-th generation Chapter 1

Introduction

According to ancient Chinese mythology, in the beginning of time, there was

the great emperor Fu Xi(伏羲) who ruled the world. He was a wise and curious mythological being, with a human face and a body of a snake. As an emperor, he wanted to give something to help his people. One day while observing the Heavens and the Earth, he became inspired and began drawing patterns and markings. He called these patterns the Eight Trigrams. This was his gift to his people, a tool that will allow them to connect to their Gods by randomly generating the trigrams from yarrow stalks [1]. According to historians, it is believed that Fu Xi was a human being

during the early patriarchal society in China near mid-2800 BCE [2], and there is

evidence of his legend having roots in the Neolithic period in China (8500-2070 BCE)

[3] making him one of the first people to think about the concept of randomness in

history. Since the beginning, humanity has always sought to explain the nature of

randomness, but it is the role in the probabilistic thinking that is paramount to its

presence in history. And even today, we still seek the answer to how can we achieve

8 Chapter 1: Introduction 9

true randomness.

1.1 The history of random numbers

The triagrams, associated with the conscious state of mind, play a major philo-

sophical role in Chinese culture. A triagram is formed by three lines which are

continuous or discontinuous, and are also known as Yin and Yang (阴阳). The Yin

and Yang have a strong influence on the mentality in Asia, polarizing the Universe by

balancing between the Yin at one state, and the Yang at the opposite state. Then,

the two triagrams can form a hexagram, which contains another 64 possible hexa-

grams. The hexagrams is a key concept in the I Ching(易经), also known as the

Book of Changes, and is one of the oldest and most reviewed oracles even in the present day [4, 5]. This oracle works by randomly generating a hexagram from which

interpretations allow for predictions of your own fortune.

There exist a couple of methods that involve either coins or yarrow stalks to

generate the hexagrams. The oldest and most traditional method to generate the

hexagram consists of manipulating a pile of yarrow stalks [4]. First, a pile of 50 stalks

is selected, with one stalk removed. Then, the remaining 49 stalks are separated into

two piles. From the first pile one stalk is removed, and from the second pile,the

reminder of four is recorded. From the first pile, the same process is repeated, where

the two recorded reminders are added. If the addition is eight you count two, or if it

is four you count three. The process then gets repeated two more times, extracting

one stalk every time at the beginning. At the end of the process, the three numbers

obtained from each iteration are summed, which will add either six, seven, eight, or 10 Chapter 1: Introduction

Figure 1.1: On the left, Pakua or Bāguà (八卦) representation, literally ”eight tri- grams” by BenduKiwi, licensed under CC-AS 3.0, the 8 basic triagrams are presented surrounding the Yin and Yang, the duality of all things in nature. On the right, the I Ching hexagrams are presented in ascending order, starting at 0 in the upper left corner, going left to right and up to down, to 63 in the lower right corner. nine. This corresponds to the four types of lines in the hexagrams, which later are reduced to either a Yin or Yang. This process is repeated six times, which eventually will result in one of the hexagrams from Fig.1.1. The I Ching has been a book of cultural influence all across Asia, and even today it continues to influence decisions and lives. In almost every ancient civilization, devices and methods were fused with mythology to relate the effects of randomness with simple purposes of divination, decision making, or entertainment [6]. Among those devices, archaeologists have found multi-sided dice, stalks with marked sides, bones, pebbles, and game boards that in combination were the centre of moral stories and adventures. Later in history, thanks to commerce, exploration, and war we have found these wonderful stories and games disseminated across distant cultures.

Today’s cubic die has its origins in ancient Mesopotamia, near Northern Iraq Chapter 1: Introduction 11

Figure 1.2: On the left, The Royal game of Ur from Mesopotamia by British Museum licensed under CC-BY-3.0. On the right, an Egyptian dice (600-800 BCE) by Swiss Museum of Games licensed under CC-SA 4.0.

[7, 8], and also in Mohenjo-Daro in Pakistan [9], both around 2750 BCE. During this

period other types of dice were also used, such as triangular and rectangular prisms.

In 1320 BCE, it has been found that the Egyptians also used a similar cubic die,

as shown in Fig.1.2, made out of bone [10], which later was discovered to be used

by the Greeks and Romans [11] as well. The Greeks were fascinated by the effects

of randomness and started assigning god’s names to the different combinations of

numbers in a throw, such as a ”Venus throw” where all the numbers are different

[6]. In Egypt near Thebes, around the year 2800 BCE, another interesting board

game was played using reeds with convex and concave sides. In this game, the reeds

were tossed similar to how coins are tossed, making it very popular. This would later

result in its depiction on the walls of the Beni Hassan tombs [6]. In the same way,

a similar game called ”odd-or-even” became popular among Greeks and Romans at

a later time. More elaborate games would also be developed and gain in popularity.

Around 1876-1786 BCE, the Egyptians had a game similar to Snakes and Ladders 12 Chapter 1: Introduction that is believed to be accompanied by shells or pebbles that were thrown as dice [12].

The interpreted numbers would then move ivory hounds and jackals along a path, where the winner is the individual who finishes the path in the shortest number of throws.

Across history, there are numerous records of the moral effects of gambling and the use of random numbers. Some served as warnings or lessons about the abuse and dangers of gambling. One of the earliest records among the Hindus is the Rgveda

Samhita (ऋ嵍वेद संिहता)[13]. In this poem, called Lament of the Gambler, the gam- bling obsession of a man results in him losing everything and distances him from his family. In this poem, the man blames his misfortune on his addiction to the magic of the dice, not God’s interventions or fate. In another epic narrative, the Mahab- harata (महाभारतम्)[14], the story is centred on gambling and its moral implications.

The narrative starts on the struggle between two far relatives, Duryodhana from the

Kauravas, and Yudhishthira from the Pandavas. Both argue to be the first in line to throne of the kingdom of Hastinapura, which eventually divides the royal family into two sides: one in favour of the Kauravas and the other on favour of the Pandavas. The core narrative of the story is how an uncle from the Kauravas challenges a game of dice with Yudhishthira from the Pandavas. The uncle intentionally uses loaded dice, and cheats Yudhishthira. The dice are made out of harden nuts with 5 flat sides and the game they played involves calling the odd or even overthrown dice, and counting them. If the player guessed right, the stakes were won, otherwise, the bet was lost.

But under all the pressure, Yudhishthira was destined to lose everything including his wealth, his properties, and even his family to servitude. At the end, he even loses Chapter 1: Introduction 13

Figure 1.3: An Illustration from the Mahabharata: The Humiliation of Draupadi by an unknown author, India, 1830-40. This moment depicts when the Pandavas are tricked by the Kauravas, into a fixed game of dice in which they lose their all their wealth and kingdom. his wife on a bet. But the game was continued until the Pandavas are forced into exile for 12 years, as depicted in Fig.1.3. Upon their return, the Pandavas allied with other families and God Krishna, then go to war with the Kauravas. Inevitably, the

Mahabharata ends when the Pandavas win the war. The last Pandavas ascend to

Heaven and Krishna leaves Earth [14]. The Mahabharata is a story of morals, war, and justified acts that centre around gambling.

Other civilizations created chance devices with materials that were abundant in their regions. The American Indians had chance devices such as two-sided dice made out of bones, wood, seeds, beaver teeth, claws, walnut shells, or stones, as seen in

Fig.1.4. Some were not symmetric in shape, such as the Eskimo’s six-sided die in the shape of a chair. While the Papago Indians used bison vertebrates as two-sided dice [6, 15]. In the 1600s, during the Renaissance, gambling and dice based games were already well established and its randomness started to catch the attention of 14 Chapter 1: Introduction

Figure 1.4: 19th century Tsimshian Bag with 65 Inlaid Gambling Sticks, Maple wood, abalone, pigment, hide and tooth presented at Brooklyn Museum in the Museum Expedition 1905. physicians and mathematicians. Among them was Girolamo Cardano who wrote the

Liber de Ludo Aleae [16]. The book demonstrates ways to calculate the winning odds in the common games during his time. This simple work would later evolve into a broader area of research around the understanding of patterns in the random num- bers of the games, propelling the very beginnings of quantitative sciences during the

Renaissance. For the next two centuries a large group of mathematicians, including

Blaise Pascal and Pierre de Fermat, dominated the study of randomness [17]. It was not until the 1800s, when calculus and the representation of improbable events was in its maturity, that the modern theory of probability was established. In 1713, Jakob

Bernoulli [18] and Abraham de Moivre [19] started to represent simple events, like flipping coins, with its possible outcomes. This is also the point where the conceptof the sample average is introduced. The sample average is the proportion of cases with known outcome over a number of tries. Bernoulli, in his studies, showed that as the sample size, or number of tries, was increased the sample average would not deviate Chapter 1: Introduction 15

within a fixed margin of error. This was the beginning of what we recognize today

as the ”Law of Large Numbers”, but it is to the credit of Moivre who discovered the

curve shape that describes the behaviour of the tries around the sample average, also

known as a normal bell or Gaussian curve shape. This breakthrough happens to be

found in many sources of data around us. It was not until the last two centuries that

the study of uncertainty was dispersed and began to connect the study of physics

and chemistry, and developing into a more sophisticated field. In 1933, the Russian

mathematician Andrei Kolmogorov formalized the understanding of random events

and an axiom system, using set and measure theory, that elegantly allowed the ex-

planation of random behaviour [20, 21]. Later, Kolmogorov contributed to the Allied efforts during World War II (WWII) with his study on firing dispersion theoryand the characterization of efficient firing strategies [22]. His contributions extend to all

of mathematics, but his particular strength was his contributions to statistics.

Prior to WWII Claude Shannon, an innovative mathematician at Bell Telephone

Laboratories, formulated the idea of calculating the information content of a message

which eventually marked him as the father of information theory. His contributions

to communications are enormous, but his view of how randomness can be measured

marked a new chapter. Shannon had the idea of having a pool of choices or symbols

where the uncertainty of selecting a symbol over a message was defined by the number

of possible choices and the length of the message [23]. For example, in a binary set

with only 0 and 1 as symbols and a message of n length, there are 2n possible messages, and as the length of the message increases the degree of uncertainty also increases.

Shannon’s major contribution was to define the average information content as the 16 Chapter 1: Introduction sum of the likelihoods of every possible symbol in the message, which he denoted as

Shannon Entropy.

The study of random numbers has drastically accelerated since the 1970s with the rise of computer science advancing algorithms development. Random numbers have made an impact in the arts, simulations, gaming, and security applications. The study of computer-generated random numbers is a fascinating area of research that has been addressed by multiple researchers [24, 25, 26, 27]. Today, the established theoretical foundations behind random numbers highlight conditions which are ideal for encryption, gambling, and simulation among others. However, these conditions have a major trade-off between accuracy, portability, speed, and scalability. There- fore, the need to develop a balance among these properties, and a solution for the customized type of generator, becomes paramount.

In the following section, we will discuss the concepts around random numbers.

1.2 Random Numbers

Among the different disciplines of applied sciences, random numbers are the pil- lar to many theories. This theoretical, elegant and abstract concept explains how elements are selected from a given set of possible outcomes, associating a probabilis- tic function that describes their variable behaviour. Even in its simplicity, its exact implementation is nearly impossible. Therefore, random variables are represented by the outcome of deterministic algorithms or by the extraction of samples from physical phenomena. These methods produce a sequence of random numbers or objects as the outcome, their behaviour is hard to distinguish when it is compared to the behaviour Chapter 1: Introduction 17

Figure 1.5: Mapping Diagram of a Random variable X and its corresponding sampling space S. of a random variable, as mentioned by L’Ecuyer [24]. In his work, he presents his point of view on the interaction between random number requirements and the con- text of their application. For example, in random numbers generated for security or gambling, the sequence plays a key role; while for other applications like the Monte

Carlo method the uniformity of the distribution, or entropy, has a major role.

Formally, we can define a random number, x ∈ R, as an outcome of a random variable X, being X a function that maps a sample, s, in the sample space S to x. A discrete random variable is a random variable whose set of values is finite or countably infinite. For example, in a finite sequence {x0, x1, x2, ··· , xN } , X is the discrete random variable that maps xi = X (si), as shown in the mapping diagram in Fig.1.5. Random variables can be described by its statistical properties, like its probabilistic distribution, P (·) or f. The notation of P (X = a) is conveniently used to express the probability from a set of outcomes at which the random variable X 18 Chapter 1: Introduction

1.0 ) i x (

s 0.8 r e b 0.6 m u n 0.4 m o

d 0.2 n a

R 0.0 0 256 512 768 1024 Samples(i) (a)

1.0

0.8

1 0.6 + i

x 0.4

0.2

0.0 0.00 0.25 0.50 0.75 1.00 xi (b)

Figure 1.6: (a) Example of random values x from a computer generated algorithm. (b) Correlation plot between random values.

takes the value of a.

From this point forward, we will consider the use of discrete random variables only.

For example, as in Fig.1.6(a), we can observe multiple realizations of X, where

24 the i-th realization of X, xi ∈ [0, 1] is mapped by si ∈ [0, 2 − 1] representing a sample between zero and one at the maximum resolution using a IEEE 754 floating point format. As well in Fig.1.6(b), random numbers can be plotted in terms of their

correlation by observing how patterns align between consecutive values. In general,

random numbers should show a uniform distribution and be unpredictable. In the

following subsections will discuss these qualifications in more detail. Chapter 1: Introduction 19

1.3 Characteristics of random numbers

The abstract concept of a random number and its properties can be reduced to the behaviour described over a sample of numbers and the behaviour inferred about the population as a state of the system that generates random numbers.

1.3.1 The uniform distribution of random numbers

The distribution of X can be described by the pmf , or fx. The pmf is characterized as having three distinctive properties:

≥ ∀ 1. The pmf is always positive, or fxi 0, i. ∑

2. The area below the pmf is always 1, or i fxi = 1.

3. The cumulative density function of X at a, can be calculated as F (X = a) = ∑ a i=−∞ fxi

A pmf can be easily calculated by creating a histogram, which only shows the counts per bin ( or per interval), and then recalculating the probability per bin.

In engineering as in many other fields, arbitrary probabilistic distributions are often used. It is a common practice to use random number generators that output the target distribution. As well, other methods rely on modifying a source distribution into the target distribution. This is done by applying integration by substitution of a new discrete random variable, Y , over the probability mass function f, where

Y = T (X) is the transformation function from X to Y . For example, P (a ≤ X < 20 Chapter 1: Introduction

0.10 1.2 10.0 ] ] k f [ [ x x 0.05 0.5 x 4.9 f x x S R

0.00 0.2 0.2 0.0 0.5 1.0 1023 0 1023 0.5 0.0 0.5 x k f (a) (b) (c)

0.10 1.2 10.0 ] ] k f [ [ x x 0.05 0.5 x 4.9 f x x S R

0.00 0.2 0.2 0.0 0.5 1.0 1023 0 1023 0.5 0.0 0.5 x k f (d) (e) (f)

Figure 1.7: Probability mass function (a), autocorrelation (b) and power spectral density (c) of x random numbers in an ideal case scenario. Probability mass function (d), autocorrelation (e) and power spectral density (f) of x random numbers from a computer generated algorithm. b) = P (T (a) ≤ Y < T (b)) can be solved as follows,

∫ T (b) dx P (T (a) ≤ Y < T (b)) = f(f −1) dy y dy ∫T (a) (1.1) T (b) = g(y)dy T (a)

where g(y) is the pmf associated with Y . This is simplified on the particular case

of a Uniform distribution selected over the support that makes f = 1. For example, Chapter 1: Introduction 21

considering the Uniform distribution U(0, 1) as in Fig.1.7(a), with the following pmf,

  1 0 ≤ x < 1 f = (1.2) U(x)  0 otherwise

−1 dx dx the transformation function is simplified from g(y) = f(fy ) dy to g(y) = dy . In reality, for some random number generators obtaining Fig.1.7(a) is difficult to achieve

and their designs are instead intended to get the best quality across different statistical

characteristics. A typical response is shown in Fig.1.7(d).

A common way to evaluate the shape of the distribution is by calculating its

normalized Shannon entropy (H), as proposed in [28] and commented in [29]. This

metric is a measure of the uncertainty associated with the random variable, or the

unpredictability of the information content. Formally, the Shannon Entropy[30] of a

discrete random variable X is

Ho(X) = E [−ln(P (X))] ∑K − = P (X = xi)log(P (X = xi)) (1.3) i=0 ∑K − = fxi log(fxi ) i=0

where E is the expected value operator, and K is the number of possible values of

X. Another way of calculating P (X = xi) or fxi is by constructing a normalized

histogram, where the pmf samples are aligned with the centre of the histograms bins. In a histogram of the possible values of X, the entire range is divided into a series of intervals or bins. At each bin, the number of cases that a value fall within the 22 Chapter 1: Introduction interval is counted and normalized with the total number of values of X. From 1.3, we

≈ 1 can calculate the Entropy of a Uniform distribution, considering that fxi K , then

HU (X) = log(K). Therefore, the normalized Shannon Entropy can be represented as followed 1 ∑K H(X) = − f log(f ) (1.4) log(K) xi xi i=0 where the entropy is normalized by the entropy value of a Uniform distribution or

HU (X). In Eq. 1.4, H(X) is bounded to the range of [0, 1], in which the higher value closer to 1 indicates that the random numbers generated are more uniformly distributed. In the opposite case, if H(X) tends to 0, then the random numbers generated are close to the same value, since its distribution is of an impulse shape.

The normalized Shannon Entropy is not the only metric that can be used to describe the shape of the distribution compared to the Uniform distribution. An ex- tensive comprehensive summary of the different metrics for uncertainty and fuzziness are presented in [29, 31, 32, 33, 34] and [35], from which the most simple and well- suited metrics are the normalized Shannon entropy suggested by Kaufmann, and the

Kullback-Leibler divergence metric. The Kullback-Leibler divergence or discrimina- tion information measure (DKL)[36], compares the Entropy difference between two distributions. DKL is often used to represent the information gain or loss against a reference distribution. In this way, considering fy is the reference pmf of Y and fx the is pmf of X under evaluation, both random variables over the same space,

DKL(fx ∥ fy) = H(X,Y ) − Ho(X) ∑K ∑K (1.5) − = fxi log(fyi ) + fxi log(fxi ) i=0 i=0 Chapter 1: Introduction 23

where H(X,Y ) is the cross entropy between fx and fy. While DKL, seems to be an

appropriate conceptual metric compared to the normalized Shannon Entropy, it is

related by the following equation derived from [37]

log(K) − D (f ∥ f ) H(X) = KL x y log(K) (1.6) D (f ∥ f ) = 1 − KL x y log(K)

considering that fy is the pmf of a Uniform distribution. Here, DKL ≥ 0 and only

if both distributions are Uniform then DKL = 0. The log(K) acts as a scale factor

or normalization, similarly done as in the normalized Shannon Entropy. After the

DKL(fx∥fy) normalization, the term log(K) is bounded between [0, 1], acting as the complement of 1 of the normalized Shannon Entropy. From here on, we will focus on using the

normalized Shannon Entropy over Kullback-Leibler divergence, for simplicity.

1.3.2 The correlation of random numbers

For a random variable X with a pmf of a Uniform distribution, the realiza- tions of X should be uncorrelated between themselves. This can be observed as non-geometric patterns or trends between pairs of contiguous realizations, such as

(x0, x1) , (x1, x2) ,..., (xN−1, xN ) as in Fig.1.6(b).

In time series, the cross-correlation, RX,Y , is a mathematical tool that calculates the statistical relationship including dependence between two time series, X and Y .

The cross-correlation is not limited to evaluate distinct time series, it can also be used to calculate the correlation of a time series with itself, this is known as autocorrelation and it is denoted as RX,X . The autocorrelation of RX,X is of an impulse shape, only 24 Chapter 1: Introduction

when X is independent and long enough so RX,X ≈ δ, as shown in Fig.1.7(b). When

the autocorrelation shape is different from an impulse, it presents important features

that identify the non-randomness nature of X by highlights the periodic features of

X[38]. Formally, the autocorrelation function, RXX [k], is defined by the expectation

between X separated by a k lag under the assumption that all realizations, or x, are equi-distant and share the same space[39],

E [X − µ, X − µ] R [k] = i i+k (1.7) xx σ2 where k ∈ [1 − N,N − 1] is the lagged interval, N is the number of realizations of X,

µ and σ is the mean and standard deviation of X. As well, RXX can be calculated ∑ 2 1 N−1 by replacing µ and σ for the sample mean x¯ = N i=0 xi and the sample variance,

∑ − − N k 1 (x − x¯)(x − x¯) R [k] = i=0 ∑ i i+k (1.8) XX N−1 − 2 i=0 (xi x¯)

When there is no lag or k = 0, the autocorrelation achieves its maximum value,

RXX [0] = 1, which is interpreted as the lag at which X is highly correlated with itself.

Alternatively, as k ≠ 0 the autocorrelation is always |RXX [k ≠ 0]| < 1 indicating the strength of the periodic patterns as it is closer to 1.

In Eq. 1.8, the denominator is intended to normalize the autocorrelation function by bounding it to [−1, 1]. Commonly, realizations of random variables, like X, are

generated through computer-generated random numbers, which may contain small

traces of correlation. In general, the autocorrelation of computer generated random

numbers has a close to zero response when RXX [k ≠ 0] ≈ 0 as shown in Fig.1.7(e). Chapter 1: Introduction 25

As well, other random numbers are intentionally design to have a non-zero autocor-

relation.

1.3.3 The spectral density of random numbers

Another useful tool to review the uncorrelation of random numbers is the power

spectral density. It relies on the fact that through Fourier analysis, X can be decom-

posed into frequency components where it describes the energy distribution per unit

of frequency—a distinctive signature between sequences of numbers. The relation be- tween the autocorrelation and the power spectral density can be established with the

Wiener-Khinchin theorem, which works under the assumption that X is the output

of a stationary process. The stationarity of random numbers will be addressed in the

following subsection. For the time being, consider that the power spectral density

function is formally defined as:

SXX [f] = F{RXX } N∑−1 (1.9) −2πfkj = RXX [k] e k=1−N

where f is the frequencies at which the power spectral density is evaluated, and j is

the imaginary unit.

1.3.4 The stationarity of random numbers

Another important characteristic of random variables is the stability over time

of its probability distributions. In many fields, including engineering, statistics and 26 Chapter 1: Introduction economics, the study of data collected through time is called time series analysis. A time series is a sequence of observations ordered by a t time index, that takes values from an index set S˙, this index set is finite for the case ofa discrete-time time series.

When a sequence of random variables is defined with a probabilistic distribution ordered as a time series, we refer to it as a stochastic process[40], {Xt(·)}.

While in classical statistical inference, a random variable and a selection scheme based on events from a sample space are used to extract the observations or realiza- tions as random values corresponding to the selected event, as defined in Sec. 1.2.

In time series[41], an stochastic process {Xt(·)} is observed at single event w, with a finite number of observations happening at t = 1 ...N. Therefore, an observation of event w is defined as xt = Xt (w). This highlights the need of making the assumption of time-homogeneity, so the time series can be inferred as a classical statistical infer- ence; otherwise, every observation xt is as if it was sampled form a different random variable Xt. The stationary processes are processes that make such assumption pos- sible, which fits with phenomena of smooth transitions model over limited periods of time.

In times series analysis, the assumption of stationarity is one of the most im- portant considerations. This assumption ranges on two levels, strict or strong-sense stationarity (sss) and wide-sense stationarity (wss). The sss makes the assumption of a time-invariant probabilistic distribution over the whole data generation process. Chapter 1: Introduction 27

Formally, this is

d (Xt1 ,Xt2 ,...,Xtk ) = (Xt1+h ,Xt2+h ,...,Xtk+h ) (1.10)

k ∀k, h ∈ Z, (t1, t2, ··· , tk) ∈ Z

d where the process {Xt(·)} is sss and = means equal in distributions. In other words, a sss process is characterized by keeping its pmf constant for any displacement over the sample indexes, that is ∀h ∈ Z, f (·) = f (·). While all the statistical Xtk Xtk+h moments are considered to be time-invariant for the sss; for the wss, only the first two moments are time-invariant. This is,

E(Xt) = µ, (1.11) ∗ Cov(Xt,Xt−h) = γ (h) where γ∗(0) < ∞. Also from this definition and as explained in[42], the following five properties can be derived for the caseof sss:

1. The random variables Xt are identically distributed.

d ∀ ∈ Z 2. (Xt1 ,Xt2 ,...,Xtk ) = (Xt1+h ,Xt2+h works k, h

{ } E 2 ∞ ∀ 3. Xt has weak stationary if (Xt ) < t

4. Weak stationary does not imply strict stationary

5. An independent and identically distributed (iid) sequence is strictly station-

ary[42]. 28 Chapter 1: Introduction

The first two properties follow the definitions in 1.10. Then, if the covariance is finite

and the two first properties stand, then the mean and covariance are independent of

time, which proves the third property. The forth property implies that having the two

first moments being time-invariant don’t guarantee time-invariance for all moments.

The last property[42, 43] is established under the assumption that a iid sequence can

be contemplated sharing the same distribution over the data generation process, if

this iid sequence is also the process {Xt(·)}, then the join distribution function of ··· (X1+h,...,XN+h) at (x1, . . . , xN ) is fx1 fxN , which is independent of h.

Unfortunately, in most cases sss is not achievable, therefore a less restrictive con- dition is used, wss or covariance stationarity. A wss process requires only the mean and the autocorrelation to be constant over all sample indexes, presenting a more relaxed constraint.

According to Basu et al. [44], one way to confirm the wss of a process is to

compare the power spectral density among all segments of the output. Considering ˆ that X is the output, we will extend the notation of X, where Xi is the i-th segment, ˆ so Xi ⊂ X. Then,

∀i, k S ˆ ˆ [f] = S ˆ ˆ [f] (1.12) XiXi Xi+kXi+k

ˆ ˆ where, Xi and Xi+k represent two different segments of X. Given that X has M seg- ments, comparing the ith segment to every other M − 1 segments is computationally expensive. Therefore, we propose before another strategy that relaxes this constraint, and reduces the computational burden from [45] in Chapter 4. Chapter 1: Introduction 29

1.4 Types of random number generators

A widely accepted distinction between random numbers is based on its generation method: either by generating a long sequence or by sampling a physical phenomena.

The first type of generator is called pseudo random number generator (PRNG), which is the most commonly used generator. PRNGs are portable, fast, and reproducible.

Among its vast number of applications, PRNGs are used in encryption, simulations, art and communications just to name a few. At its core, a PRNG walks through a long sequence generating random numbers using a deterministic algorithm with a finite number of states which makes it predictable. Another natural inconvenience of this design is requiring an initial seed state, in which its entropy is dictated by the entropy of the seed, as mentioned in [46]. It is indisputable that PRNGs are widely used, its entropy dependency and seed selection make it challenging to exploit their parallelism and potential to scale up. PRNGs can be classified depending on the type of algorithm. For example, as congruential generators, feedback shift register generators, generators based on cellular automata, and generators based on chaotic systems. A few good basic references on the details of the PRNGs designs are the works of [25], [26][Vol. 3], and [27].

True random number generators (TRNG) is the second type of generators. This type of generator is designed to sample a non-deterministic source, or a noise source, to generate the random number. Today’s research on TRNGs is focused mainly on noise sources [46, 47]. The designs are based on noise such as radio noise, radioactive decay, thermal noise generated from a semiconductor diode, or thermal flow. Other approaches sample the user’s interactions or hardware events, such as keystrokes, 30 Chapter 1: Introduction

Figure 1.8: Miminum block design of a True random number generator mouse movement, hard-drive seek times, system activity, and configuration or delays as explained in [46].

In general, the TRNG process can be simplified into two blocks: extraction and

post-processing as seen in Fig.1.8. In the majority of the cases, the values extracted,

known as digitized analog signal random numbers (DAS random numbers), will not

satisfy the target statistical properties. That is, being iid or with a uniform distribu-

tion as explained in [24]. For this reason, a post-processing phase is adopted where

the DAS random numbers are modified to meet the target statistical properties. At

the present time, it is common that TRNGs are selected as an input to a PRNG, in

an effort to alleviate the entropy dependency on the seed.

Although the PRNG research has been exhaustively covered, the generation of the

unknown seeds based on a TRNG has been overlooked [48]. The National Institute

of Standards and Technology (NIST) is taking great effort in developing a draft set

of regulations for the appropriate use of TRNGs. Still, it has not specified a set of

approved TRNGs for standardization due to its large diversity of techniques. Chapter 1: Introduction 31

In literature for TRNGs, there are multiple types of sources, design, and extraction techniques [49, 50, 51]. In some cases, the design uses a thermal noise circuit that is amplified and sampled [49]. Other designs just focus on sampling the jitter of a phase-locked loop circuit found on application-specific integrated circuit (ASIC) based technologies [52]. In other cases, the trend seems to exploit purely the artifacts found in digital circuits, like the metastable state of flip-flop circuits [53]. Additionally, researchers have also explored the use of random sources from audio and video signals

[54]. In 2011, Intel re-engineered [55] its own TRNG design to target a solution to security threats on the Ivy Bridge processor. In this new design, the noise circuit relies on thermal noise and the meta-stable state of a set-reset latch (SR Latch), where the source of random binary digits is located at the output of the SR Latch. One of the main features of this new design is the on-line health tests, or on-line self test entropy

(OSTE), which are introduced to empirically test the frequency of binary patterns within a certain range and to test against intentional attacks. Only the healthy sequences are appended to an entropy pool. Then, the entropy pool becomes the seed of a NIST SP800-90A based PRNG that is passed through a conditioner block that uses the advanced encryption standard (AES) cipher block chaining message authentication code (CBC-MAC) technique until being refined to pass the OSTE.

Another outstanding category of TRNGs, but not very common, is based on the abnormalities found in the computational architecture. It is well known that allowing the competition among central processing unit (CPU) threads for updating a shared variable leads to random results in the updated variable. Colesa et al. [50] found a way to take advantage of this phenomena. Colesa’s study suggested that the execution 32 Chapter 1: Introduction

environment’s irregularities, the number of cache misses, the number of instructions

executed in the pipeline, and the imprecision of the hardware clock used for the timer

interrupts were the main contributors to the random behaviour observed in the shared

variable. In [50], Colesa et al. claim this design passes over 90% of the NIST tests

after applying a Von Neumann method.

In [56], we developed a TRNG that generates high quality random numbers by

exploiting the natural sources of randomness on the NVIDIA GPUs, such as race con-

ditions during concurrent memory access. The digitized noise signal generated from

non-deterministic sources follow a post-processing algorithm, which ensures having

random numbers with a uniform distribution. In this thesis, the unpredictability

and randomness is further elaborated, focusing on improving the post-processing al-

gorithm to generate iid random numbers. In the post-processing step, we apply an evolutionary based genetic algorithm heuristic to produce uniform random numbers that will hold the same statistical properties over time. This work is discussed in- depth in Chapter 2 and 3.

In summary, the work in TRNG has been on CPU-based architectures. Neverthe- less, in my thesis I present a generalized scheme to answering the following questions:

• How can a computer architecture be re-utilized to build a TRNG?

• Can the statistical properties of an arbitrary random source be improved?

• Is the TRNG design scalable?

In my thesis, I present a novel TRNG scheme that exploits a generalization of hetero- geneous multicore architectures to be applicable to the majority of compute architec- Chapter 1: Introduction 33

tures of Today. I show that independently of the architecture, the random numbers

extraction can be improved from a Gamma distribution to a uniform distribution by

using heuristics that ensure the improvement of the statistical properties.

The rest of the thesis is organized as follows. Chapter 2 concentrates on the

analysis of the extraction techniques for TRNGs, its design, and influencing factors.

In Chapter 3, an analysis of the first proposed stage for post-processing is conducted, focusing on the statistical stretching technique and its impact on random number distributions. Chapter 4, focuses on stationarity correction techniques and its impact

as a secondary post-processing stage. Chapter 5 evaluates the quality of the post-

processing stage using well known evaluation packages for random numbers. Finally,

Chapter 6 present a summary of the work, and Chapter 7 concludes this work by

highlighting quality and performance trade-offs, as well as future suggestions and

recommendations. Chapter 2

The Extraction layer on True

Random Number Generators

This chapter introduces the extraction of random numbers based on the abnor- malities found in computational architectures, its noise source design and variations, implementation details, and performance characteristics. To do this, we discuss a group of underlying computational architectures selected based on its popularity and performance with particular consideration of the scheduler. Then, we will elaborate more on the design details followed by the behaviour of the noise source and we will conclude with two cases studies on different architectures, GPU and CPU.

2.1 The computer architectures of today

Over the past 30 years, the most dominant processor architectures had always ex- ploited the concurrency of instruction execution with the primary goal of increasing

34 Chapter 2: The Extraction layer on True Random Number Generators 35 their performance. Nowadays processor architectures combine multiple techniques to achieve performance goals. Among the most common ones are pipelining, multipro- cessing (or multi-cores), multi-issue instructions, out-of-order execution and multi- execution of the same instruction [57].

Historically, reduced instruction set computers (RISC) and complex instruction set computers (CISC) have been the trends for concurrent execution. RISC architectures include short load/store instructions. These instructions do not directly compute on memory, but via registers. This is contrary to CISC architecture, where memory access instructions dominate the instruction set. In this chapter, we discuss RISC architectures, the most dominant architectures in the market.

From the RISC category, the microprocessor without interlocked pipeline stages

(MIPS) and vector processors (VP) became the major trends, while for the CISC category the Intel’s processor ended in ”86”, including 8086, 80186, 80286, 80386 and

80486 processors (x86), became the most popular architecture [58].

The MIPS architectures execute instructions out-of-order. That is, the effects of execution of instructions are independent of each other. Using this idea, differ- ent approaches were taken to find independent instructions. Some during compile time (before the program is executed), others during run-time (while the program is executed) [59].

This allows MIPS to promote two different architectures: very-long-instruction- word (VLIW) and super-scalar processors. VLIW processors execute operations in parallel based on the program, and schedules the instruction at compile time. This makes the architecture simpler compared to super-scalar processors that finds inde- 36 Chapter 2: The Extraction layer on True Random Number Generators

pendent instructions at run-time. The VP approach focuses on the fact that the

program explicitly states many of the data operands to be vectors or arrays, or even

loops whose data references can be expressed as vector operands. These architectures

are also known as single-instruction multiple-data (SIMD). As well, VP rely on the

compilers and hardware ability to convert loops into sequences of vector operations.

VP use pipeline independent of instructions and pipeline independent of data. VP

application produce increased performance when the data is parallel, or VP. Today,

the SIMD model exists in graphic cards, for example Nvidia CUDA design [57].

On the other hand, a super-scalar processor uses instruction level parallelism. The

scheduling of instructions is done during runtime dynamically, which is a way of exe-

cuting more than one instruction during a clock cycle by distributing the execution of

instructions across multiple execution units. The main idea behind dynamic schedul-

ing is to issue instruction in order but execute them in out-of-order execution (OoO).

The key behind super-scalar is the scheduling of independent operations, which is

widely segmented into control-flow and data-flow based scheduling. For control-flow

the best known implementation is scoreboard, while for data-flow it is the Tomasulo’s

algorithm implementation [57].

The x86 super-scalar architecture emerged as the world’s predominant personal computer CPU architecture, not as the fastest or most efficient architecture, but the most popular design. Table 2.1 provides a summary that presents the line of

products of the major contributor to the x86 architecture, Intel Corp. The x86

architecture relies on the use of complex instructions and a complicated addressing

mode, with a limited number of registers which is inefficient at executing parallel Chapter 2: The Extraction layer on True Random Number Generators 37

Table 2.1: Evolution of Intel’s pipeline and µ-decoder cache µ-ops Multi Pipeline Reordering OoO Throughput Year Model threading Depth Buffer Support (µ-ops/cycle) Support (µ-ops) Depth Pentium - Double 1993 4 no 1.88 Pentium MMX - pipeline Pentium II 1995-99 Pentium III - 10 40 yes 3 Pentium Pro Pentium 4 2 20 2000 126 yes 3 Pentium 4E 2 >20 (NetBurst) Pentium M - 2003-06 Core Solo - 13-14 40 yes 3-4 Core Duo - Core 2 14 96 2008 8 yes 3-6 Nehalem 14-16 128 Sandy Bridge 2011 4-16 14-16 168 yes 3-6 Ivy Bridge Haswell 2013 4-36 14-16 192 yes 4 Broadwell 2015 Skylake 4-8 14-16 224 yes 4 instructions due to potential dependencies. In order to become competitive against all the RISC architectures, the x86 evolved into having RISC-like instructions called

µ-instructions. The µ-instructions were combined with OoO techniques for the x86, with the exception of instruction scheduling at compile-time and large registers to reduce memory access [58]. As shown in Table 2.1, Intel has been increasing the number of threads and the depth of the reordering buffer for the OoO execution, while keeping a moderately sized pipeline. 38 Chapter 2: The Extraction layer on True Random Number Generators

In today’s microprocessors market, the dominant architectures are based on the

x86, combining most of the super-scalar techniques to improve the performance. This

is also known by researchers as decoupled x86 super-scalar processors [58]. Among

the major contributors to the evolution of this architectures is Intel Corp. In 2015,

Intel Corp. held nearly 80% of the PC market, 60% of the graphics accelerators, and

99% of the server market [60, 61]. While Intel has focused on data server chips, it has

not put as much effort in the mobile market with the ARM processor, which continues

being the dominant RISC MIPS processor in 52% of the mobile devices [62]. Still,

Intel processors are the most widely used processors of today.

Despite Intel’s dominant position in the graphics sector, Nvidia has been able

to overtake them in performance with different models of graphic cards. As well,

Nvidia video cards have been used as general purpose accelerators, achieving massive

performance. According to TOP500 list [63], Nvidia’s video cards have been used in the world fastest computer design twice over the last 10 years. An example would be the Chinese Tianhe-1A system at the National Supercomputer Center in Tianjin with 2.57 petaflop/s in 2010. Another example is the 2012 U.S. Oak Ridge National

Laboratory’s Titan with a 27 petaflop/s. This places Nvidia among one of the most accepted high performance graphic card designs. Chapter 2: The Extraction layer on True Random Number Generators 39

2.2 The idea behind a TRNG on today’s computer

architectures

In this thesis random numbers are extracted from hardware, and in particular from race conditions generated in the program. Race conditions occur due to data and instruction dependencies. The hardware schedules these instructions accordingly to avoid data hazards and to ensure program order to obtain the correct results.

Therefore, in order to understand the proposed algorithm, this section is devoted to describing the scheduling algorithm at the hardware level.

2.2.1 Data hazards

In computer architecture, data hazards are a condition where the next instruction in a pipeline is not allowed to execute because it will create an erroneous result. The data dependencies are characteristics of possible data hazards and they can be classi- fied into three categories according to the situation in which these may occur. These are read-after-write (RAW),write-after-write (WAW), and write-after-read (WAR).

1. RAW. This data hazard refers to the situation where the current instruction

execution depends on the output result from another instruction. This is also

known as true dependency, and an example can be observed in Listing 2.1.

Listing 2.1: RAW hazard example. The subtraction in line 2 requires R1 before the addition in line 1 finish to write on R1.

1 add R1, R2, R3 2 sub R5, R1, R4 40 Chapter 2: The Extraction layer on True Random Number Generators

2. WAW. This data hazard represents the situations where two instructions are

concurrently executed having the same output resource, as shown in Listing 2.2.

In this situation, the order of execution is not predictable and it is also known

as an output dependency.

Listing 2.2: WAW hazard example. The addition in line 1 and subtraction in line 2 are competing to write on R1. Only one can execute and the hardware scheduler guarantees line 1 is executed preserving program order.

1 add R1, R2, R3 2 sub R1, R2, R4

3. WAR. This data hazard represents the situations where the currently executed

instruction writes an operand before the earlier instruction reads it, an example

can be observed in Listing 2.3. This is also known as an anti-dependency.

Listing 2.3: WAR hazard example. The subtraction in line 2 cannot write to R1 before the addition in line 1 reads R1.

1 add R3, R2, R1 2 sub R1, R2, R4

2.2.2 Schedulers

As mentioned before, there are two types of scheduling: Control-Flow based scheduling and Data-Flow based scheduling. Control-flow is best known by its score- board implementation. The scoreboard method is an old concept invented by Sey- mour Cray for the CDC 6600 mainframes in the 60s [64]. This technique uses a table to track instructions being fetched, issued and executed, resources availability, and the history of changes in registers. Score-boarding is a solution for instruction scheduling in processing pipelines. It is classified as dynamic scheduling since it Chapter 2: The Extraction layer on True Random Number Generators 41

Figure 2.1: Simple scoreboard implementation uses OoO processing to avoid hardware conflicts. The technique relies on using at least four cycles to execute the four stages: decoding instructions (or issue), read operands, execution, and write results. Its implementation is characterized by having three blocks: an instruction status block, a functional unit (FU) status block, and the register results status block as shown in Fig. 2.1. The instruction status block

handles the sequence in which the instructions will be executed. The FU status block

presents the execution units (FU) as resources. Then the register results status block

works as a reference for the results. Most of the time, the technique is successful

except in certain data hazards that create the instruction execution to halt until the

data dependency is resolved [65].

For data-flow, the best known implementation is Tomasulo’s algorithm. It differs

from score-boarding by the ability of renaming registers which overcomes the limita-

tions of score-boarding. The Tomasulo algorithm is a sophisticated algorithm that

prevents bottle necks on registers and the data hazards by renaming registers and

reordering the instructions sequence, something that the score-boarding algorithm 42 Chapter 2: The Extraction layer on True Random Number Generators

Figure 2.2: Simple Tomasulo’s algorithm implementation does not handle well. Other optimization from Tomasulo algorithm include loop un- rolling and branch prediction. In short, Tomasulo algorithm has three stages: issue, execution and write back; its implementation is similar to the scoreboard, it involves a map table, a reservations station, and the vectors of registers, as shown in Fig.2.2.

The map table is similar to the register results status block, the reservation Stations are similar to FU in combination with the vectors of registers, V1 and V2. During the issue stage, if there is no hazard between resources allocated indirectly between multiple threads, then the reservation station issues the instruction and sends the operands renaming the registers. During the execution stage, the operands are ex- ecuted only when they are ready. Otherwise, the execution waits for operands over the common data . During the write back stage, the result is broadcasted over the common data bus to all the resources that reserve the result.

From the three categories of data hazards, the data-flow schedulers based on the score-boarding method does not handle WAW and WAR hazards. This mechanism causes the instruction issue or execution to stall until the hazards are cleared. From Chapter 2: The Extraction layer on True Random Number Generators 43 amhardware implementations perspective, this technique has a physical limitation on the number of instructions that can stall or solve, but this can easily be overcome by combining other techniques to override the stalling.

On the other hand, control-flow schedulers based on Tomasulo’s algorithm present a more robust solution to data hazards by tracking operands for instructions in reser- vation stations and register renaming in hardware. Still, it has a physical limitation where the number of registers and the reservation stations are limited, which are important for parallel execution.

2.2.3 Strategy for designing a TRNG using a compute archi-

tecture

As mentioned previously, stressing the number of data hazards can affect the response time of the scheduler independently of the computer architecture. However, there are further variables which can stress the execution behaviour. The following are the unified principal variables that can have a major effect on data-flow and control-flow schedulers.

1. The number of threads. Given a finite number of computer units, the sched-

uler determines the number of threads that can be executed across their execu-

tion units to avoid data hazards. Some of the threads may be scheduled at a

later execution time due to the hardware limitations.

2. The number of induced race conditions. By intentionally repeating data

hazards and consuming the available resources, the scheduler is forced to stall 44 Chapter 2: The Extraction layer on True Random Number Generators

or wait until resources are available. This is possible since both scheduling

methods issue instructions in order but execute in OoO.

3. The memory access latency. Depending on the implementation in hardware,

the memory access latency changes with the penalty cycles used while waiting

to read from memory. In some cases access latency is one or two clock cycles,

and in others 300 to 400 clock cycles [66, 67, 68]. For most types of memory

modules, the delay time is not a fixed response time. In addition, the coherency

protocol will also influence the response time, since the protocol takes careof

updating the values or making it consistent across threads [69].

4. The instructions latency. The majority of instructions will have a fixed

completion time, which can be used as building blocks for stalling and exposing

the delay time of the algorithm.

For these reasons, and based on our findings in [56], we suggest the following

evaluation kernels targeting the interaction of two types of memories: local (on-chip)

and global (off-chip) memory. Local memory is also known as shared memory in

some cases. Our strategy is based on creating a loop that generates a data hazard

continuously. The accumulated effect of these events are sampled with the internal

timers to generate a customized source of randomness.

Listing 2.4, to which we will refer as kernel GMSHM, was our first approach in understanding this phenomena; here, we combined the effects of global and local memories race conditions on a GPU, as demonstrated in [56]. We also created two

other methods to target the effects of the individual memories, kernel GM for the

global memory race conditions (Listing 2.5), and kernel SHM for the local memory Chapter 2: The Extraction layer on True Random Number Generators 45

Listing 2.4: OpenCL kernel causing race conditions using global memory and local memory.

1 __kernel void kernel_GMSHM(__global unsigned int* dMem , unsigned int M, __local unsigned int* dMem2, __local unsigned int* dMem3){ 2 int tid = get_global_id(0); 3 unsigned int k = 0; 4 do{ 5 dMem2[tid] = tid; 6 dMem3[tid] = dMem2[tid+1]; 7 dMem[tid] = dMem3[tid]; 8 k++; 9 } while(k

Listing 2.5: OpenCL kernel causing race conditions using global memory.

1 __kernel void kernel_GM(__global unsigned int* dMem , unsigned int M){ 2 int tid = get_global_id(0); 3 unsigned int k = 0; 4 do{ 5 dMem[tid] = tid; 6 dMem[tid] = dMem[tid+1]; 7 k++; 8 } while(k

race conditions (Listing 2.6). The programs are written in the high level parallel

language, OpenCL.

The three kernels present a few possible scenarios that provide the reaction time

of the scheduler against race conditions. In lines 5-7 of kernel GMSHM dMem[tid]

can not be written before executing dMem3[tid] in line 6. This is a WAW hazard.

Similarly, lines 5-6 of kernel GM and kernel SHM also create a WAW hazard. The

methods presented here are not a unique solution to achieve similar behaviour, but

definitely provides an insight into the scheduler architecture. 46 Chapter 2: The Extraction layer on True Random Number Generators

Listing 2.6: OpenCL kernel causing race conditions using local memory.

1 __kernel void kernel_SHM(__local unsigned int* dMem , unsigned int M){ 2 int tid = get_global_id(0); 3 unsigned int k = 0; 4 do{ 5 dMem[tid] = tid; 6 dMem[tid] = dMem[tid+1]; 7 k++; 8 } while(k

It is expected that the behaviour of the scheduler handling the race conditions,

especially the shape of the pmf , is not affected by the differences between architec-

tures. It can only be affected by the scheduler implementation. In some cases, like

the scoreboard implementations, we expect that the distributions to be wider in com-

parison to Tomasulo’s algorithm implementation. Nevertheless, the distributions will

highlight the underlying differences.

2.3 Case study: TNRG design on GPUs

Over the last three decades, GPUs have been one of the fastest growing and

most revolutionary architectures. Since the late 70s, starting as a cheaper option

to buffer complete images before being displayed, graphic cards became astandard

solution to rendering simplified figures by performing fast calculations. During the

80s and the beginnings of the 90s, the demand for fast two dimensional special effects

in video games made GPUs popular and its demand propelled miniaturization and

faster circuitry. At the end of the 90, real-time 3D graphics assisted by CPUs started

to move towards using GPUs due its low manufacturing cost. One example is the rise Chapter 2: The Extraction layer on True Random Number Generators 47

Figure 2.3: NVIDIA G80-ENG-A1 GPU

of the G80 model from Nvidia, Fig. 2.3. Later on with this model, Nvidia became

one of the main distributors of GPUs worldwide [70, 71].

By the end of 2006, Nvidia decided to develop a generic GPU architecture, the

Tesla micro-architecture. This new architecture reduced the power consumption,

standardized its stream processors usage and programming by unifying several ar-

chitectures into a generic architecture programmable under a common programming

language. This architecture would later be referred to as CUDA.

Eventually, CUDA’s purpose extended beyond graphics applications and into an

effective general purpose computing architecture, specifically on data-parallel prob-

lems by easily exploiting the inherent parallelism. For Nvidia, the major step was

taken on the G80 base model, seen in Fig 2.3. It was the first GPU supporting the C programming language while unifying a pixel, vertex, and geometry processors into one stream processor. The G80 introduced the SIMT execution model where multi- 48 Chapter 2: The Extraction layer on True Random Number Generators

Figure 2.4: Left: CUDA Memory Hierarchy and execution model. Right: Thread organization diagram. Both from NVIDIA CUDA Programming Guide version 3.0, licensed by CC-A3.0.

ple threads are concurrently manage to execute one instruction, in a similar way as

SIMD. It also introduces the use of shared memory and synchronization barriers to

strengthen the inter-thread communication. CUDA parallel programs are executed

by threads kernels, in warps or groups of 32 concurrent threads. During execution,

the warp scheduling is handled by the hardware, creating a zero overhead context

switching with the capability of dealing with long latency operations.

In CUDA, the threads are organized in blocks and grids of thread blocks. A thread

has access to their own registers or local memory, while each thread block has access

to the warp program counter, the arithmetic logic unit (ALU), and shared memory,

as shown in Fig. 2.4. In an attempt to provide versatility, all threads have access to larger and slower memories, global memory, constant memory, and texture memory for inter-block communication [72]. Each type of memory has different speeds and Chapter 2: The Extraction layer on True Random Number Generators 49

Table 2.2: CUDA memory relative speed and scope Memory Relative Speed Scope type (in instr.) Register 1 Thread Local ≥ 200 Shared Block 2-3 Global ≥ 200 Constant All ≥ 200 Texture 2-200 (cached) scopes as shown in Table 2.2, in which the local communication is favoured with speed, while the inter-block communications is penalized. Shared memory is a peculiar case, in that it is optimized to have a high memory bandwidth for concurrent access but it is bounded to be shared within a warp. Shared memory has a size of multiples of 16kilo- bytes (Kb), which is divided into banks of the same size; when multiple threads within the same warp try to access the same shared memory bank, the conflicts are resolved by serializing, and separating the request into conflict free requests as necessary. This reduces the effective bandwidth in proportion to the number of separated memory requests, but simplifies the scheduling.

The stream multiprocessor warp scheduler in CUDA works by executing the warps with instructions that are available and prioritized given an operand score boarding mechanism over the pipeline of instructions, as shown in Fig. 2.5. First, a set of instructions located on the L1 cache are fetched and placed in the instruction buffer where the warp scheduler selects and prioritizes the instructions byround- robin vs. age decision. Then, only instructions free of hazards are issued and its corresponding operands are read. Next, the instructions are executed using either 50 Chapter 2: The Extraction layer on True Random Number Generators

Figure 2.5: Warp Scheduling in SIMT pipelines. the scalar multiply-add blocks (MAD) or the special functions units (SFU). Once the execution is finished, the outputs are written back to their corresponding memories or registers.

The warp scheduler uses a scoreboard that, in combination with the age of the instruction, decides the instruction ordering. Meanwhile, the way of avoiding data hazards is by stalls. A race condition can lead to an unknown behaviour as described by Nvidia [72]. As well, Nvidia has been able to provide other means to mitigate the effect of this situation, for example, by introducing synchronization barriers atblock and grid level, or the use of atomic operations.

2.3.1 Source of randomness found in GPUs

As suggested in [73], sources of randomness can be found in most computational devices that fall under three categories: Chapter 2: The Extraction layer on True Random Number Generators 51

1. Hardware sources. Temperature noise, vibrations, or electrical noise.

2. Pseudo random number generators. Sources that are originated by a

PRNG and that could undergo through another process.

3. Complex processes. Page allocation, systems schedulers or race conditions.

Therefore, GPU kernels are not an exception to non-deterministic behaviours in race conditions, in which an unknown behaviour is observed from multiple desychro- nized threads writing and reading from the same memory location [74]. During a race

condition, as mentioned by Boyer et al. [74], a source of randomness can be obtained; thus, the measurements of the elapsed time of a group of possible race conditions is an indirect measurement of a source of randomness.

Another non-deterministic behaviour present in all semiconductors is the core temperature change [47]. It has been observed that the temperature extracted from the NVIDIA application programming interface (NV-API) using the GPU thermal control interface follows a non-deterministic behaviour in the elapsed time of its cal- culation. Additionally, race conditions and thermal flow are also well known sources of randomness. Nevertheless, to the best of our knowledge, the use of the above mentioned random sources to extract true random numbers from GPUs has not been considered before, as well other findings in56 [ , 75] are included.

Referring back to Fig. 1.8 in Chapter 1, the extraction stage block, we analyze two cases which are the noise sources: the measurement of the race conditions, and the measurement of the temperature of the core. 52 Chapter 2: The Extraction layer on True Random Number Generators

The measurement of race conditions times

The Nvidia warp scheduler does an excellent job reordering threads execution to

avoid race conditions. But to our surprise, the scheduler behaviour is irregular as

observed in Listing 2.4 to 2.6, in particular when the number of threads is slightly

over a multiple of the warp size, such as 32n + 1, n ∈ Z+. For example, in Listing

2.4, SIZE = 33 is the number of threads that force the instruction buffer to saturate with intentional WAW hazards and causes an undefined behaviour. In part, this is delayed by the shared and global memory latency.

The measurement of in-chip temperature calculation times

In CUDA family line of products, the cores are provided with at least one diode

that acts as a built-in temperature measurement device. Diode thermometers are

commonly used as temperature measurements due to their low cost, and its stability

of nearly 0.8◦C within the range of −55◦C to 150◦C. Additionally, diodes have

linearity, simple circuity, and good sensitivity [76].

As the temperature changes in the diode, the forward voltage across the p-n

junction changes linearly. This voltage is digitized and stored in a register, which

indirectly represents the core temperature. The time for retrieving the register by a

call from the NV-API behaves like a non-deterministic process. Any CUDA device

is subject to the round-robin selection method [77], such as the ADT7473 diode.

Therefore, the measurement of the completion time for extracting the temperature

measurement is a valid source of randomness.

Unfortunately, for the CUDA family the location and number of diode thermome- Chapter 2: The Extraction layer on True Random Number Generators 53

ters is inconsistent, giving more room for the post-processing stage to stabilize its

outcome. As well, the number of diodes is limited which bounds its scalability in

comparison to the thread race conditions time measurement.

2.3.2 Evaluation of TRNGs sources in GPUs

In this subsection, we present three fundamental parts to understanding the ob-

served phenomena. First, diving into the behaviour of the source of randomness by

observing the pmf while identifying some of the key parameters. Second, we focus on

the change on the pmf due to the modifications on different architectures, which are subject to small modifications on their circuitry. Finally, we provide some conclusions on the effects of the pmf over long term runs.

Phenomena observation.

Here we present some results on the behaviour of the two proposed sources of

randomness [56], in which the distributions appear to be multi-modal and similar

to a Gamma distribution as in Fig.2.6. These results were evaluated over a low-

performance , the Nvidia GeForce 9600M from the GeForce 9 series, con-

taining 32 cores.

Effects due to architectural changes.

Further experiments on the race condition kernels from Listings 2.4-2.6, were

conducted over two different architectures trying to highlight the behaviour foundat

low and high performance devices. This includes the Nvidia GeForce 320M with 48 54 Chapter 2: The Extraction layer on True Random Number Generators

Figure 2.6: Pmf of random sources while being sampled for 106 samples with a bin size of 0.5µ sec. The measurement of race conditions times following Listing 2.4, tRC , present a sharper distribution; while the measurement of times to calculate the in-chip temperature, tTC , is wider. ©2011 IEEE. cores from the Tesla Architecture and the Nvidia GeForce GT 750M with 768 cores from the Kepler architecture. For both experiments, the number of threads were adjusted to be slightly over the warp size, SIZE = 33, and the number of iterations

for accumulating the effect of the race conditions was setto M = 128.

As observed on Fig.2.7, the three proposed kernels show a similar shape for the

Tesla architecture. Surprisingly, the effect has a similar shape on the Kepler archi-

tecture, as in Fig.2.8. As mentioned before in subsection 2.3.1, the shape of the pmf

is not drastically modified, therefore it is estimated using method of moments forthe

pmf as a Gamma distribution. The estimators are given in eq. 2.1 and 2.2.

( ) x 2 αˆ = (2.1) ς Chapter 2: The Extraction layer on True Random Number Generators 55

0.045 kernel_GMSHM 0.040 kernel_GM kernel_SHM 0.035

0.030

0.025

0.020 Probability

0.015

0.010

0.005

0.000 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Time in milisec.

Figure 2.7: Pmf of random sources using the three suggested kernels on a Tesla architecture, while being sampled for 106 samples with a bin size of 0.5µ sec.

0.025 kernel_GMSHM kernel_GM kernel_SHM 0.020

0.015

Probability 0.010

0.005

0.000 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Time in milisec.

Figure 2.8: Pmf of random sources using the three suggested kernels on a Kepler architecture, while being sampled for 106 samples with a bin size of 0.5µ sec.

ς2 βˆ = (2.2) x 56 Chapter 2: The Extraction layer on True Random Number Generators

Table 2.3: Gamma Distribution approximations Architecture kernel αˆ βˆ GmShm 3.694 0.100 Tesla Gm 0.294 1.829 Shm 7.614 0.056 GmShm 309.562 0.000 Kepler Gm 21078.9 0.000 Shm 26208 0.000

∑ √ ∑ 1 N−1 1 N−1 − From which x = N i=0 xi is the sample mean, and ς = N i=0 (xi x) is the sample standard deviation. Then, the calculated estimators are shown in Table

2.3. One reason why the Gamma distribution is selected is because it comprises a two-

parameter family of continuous distributions with the property of changing its shape

and inverse scale, through the α and β parameters. We approximate these parameters with α ≈ αˆ and β ≈ βˆ. Furthermore, the Gamma distribution is expressed by,

xα−1e−xβ g (x; α, β) = (2.3) βαΓ(α)

where α ≥ 0 is the shape parameter, β ≥ 0 is the inverse scale parameter, x is the ∫ ≤ ∞ ∞ α−1 −x support defined on 0 x < and Γ(α) = 0 x e dx is the Gamma function. This last function can be approximated by using the iterative version of the incomplete ∑ α −k 100 ki Gamma function, Γ(α, k = 30) = k e i=0 α(α+1)···(α+i) . Additionally in Fig. 2.9, 2.10, and 2.11 we present the effects of stressing the

number of threads and the number of repetitions. Evidently, due to the limitations of

the Tesla architecture, it is easy to saturate the scheduler and make the distribution

wider, especially in the case of using global memory and a large number of repetitions, Chapter 2: The Extraction layer on True Random Number Generators 57

Figure 2.9: Pmf of random sources using kernel GMSHM and changing the number of repetitions, M, and the number of threads from 1 to 512. The kernel is run over a Tesla architecture, while being sampled for 106 samples with a bin size of 0.5µ sec.

Figure 2.10: Pmf of random sources using kernel GM and changing the number of repetitions, M, and the number of threads from 1 to 512. The kernel is run over a Tesla architecture, while being sampled for 106 samples with a bin size of 0.5µ sec. 58 Chapter 2: The Extraction layer on True Random Number Generators

Figure 2.11: Pmf of random sources using kernel SHM and changing the number of repetitions, M, and the number of threads from 1 to 512. The kernel is run over a Tesla architecture, while being sampled for 106 samples with a bin size of 0.5µ sec.

M=1000, in Fig. 2.10. Despite the effects of using local or global memory, a consistent

peak in the distribution between 0.3 and 0.4 millisecond is observed across the three

figures.

Similarly, the same conditions are observed in Fig. 2.12, 2.13, and 2.14, as well

as the same conditions when we applied the three kernels to a Kepler architecture.

In this case, the capabilities of the Kepler architecture do not get affected easily

by changing either the number of threads or the number of induced race conditions.

Among the three figures, the global memory still has a wider distribution inFig. 2.13.

As well, small peaks appear distributed across the first three milliseconds. Chapter 2: The Extraction layer on True Random Number Generators 59

Figure 2.12: Pmf of random sources using kernel GMSHM and changing the number of repetitions, M, and the number of threads from 1 to 1024. The kernel is run over a Kepler architecture, while being sampled for 106 samples with a bin size of 0.5µ sec.

Figure 2.13: Pmf of random sources using kernel GM and changing the number of repetitions, M, and the number of threads from 1 to 1024. The kernel is run over a Kepler architecture, while being sampled for 106 samples with a bin size of 0.5µ sec. 60 Chapter 2: The Extraction layer on True Random Number Generators

Figure 2.14: Pmf of random sources using kernel SHM and changing the number of repetitions, M, and the number of threads from 1 to 1024. The kernel is run over a Kepler architecture, while being sampled for 106 samples with a bin size of 0.5µ sec.

Behaviour consistency over long runs.

Despite the partial consistency of the pmf over different architectures, we also

reviewed the changes of the pmf over long runs. For this condition, a sliding window

of 1024 in size is created, which slides over 65536 samples creating the pmf .

In Fig. 2.15, 2.16, and 2.17 we observe that global memory gives a distinctive

cloud of points around 1 to 1.5 milliseconds. Also, a significant peak around 0.3 and

0.4 millisecond is visible across the three figures.

Fig. 2.18, 2.19, and 2.20 present a similar effect in the Tesla architecture, andis

more consistent. Still, in the case of using global memory, it gives a distinctive cloud

of points around 0.5 to 1 millisecond. Also, it produces a significant peak around 0.1

and 0.2 millisecond across the three figures. Chapter 2: The Extraction layer on True Random Number Generators 61

Figure 2.15: Sliding window of size 1024 samples, creating a pmf of a random source using kernel GMSHM where M=100 repetitions, N=33 threads. The kernel is run over a Tesla architecture, while being sampled for 106 samples with a bin size of 0.5µ sec.

Figure 2.16: Sliding window of size 1024 samples, creating a pmf of random sources using kernel GM where M=100 repetitions, N=33 threads. The kernel is run over a Tesla architecture, while being sampled for 106 samples with a bin size of 0.5µ sec. 62 Chapter 2: The Extraction layer on True Random Number Generators

Figure 2.17: Sliding window of size 1024 samples, creating a pmf of random sources using kernel SHM where M=100 repetitions, N=33 threads. The kernel is run over a Tesla architecture, while being sampled for 106 samples with a bin size of 0.5µ sec.

Figure 2.18: Sliding window of size 1024 samples, creating a pmf of a random source using kernel GMSHM where M=100 repetitions, N=33 threads. The kernel is run over a Kepler architecture, while being sampled for 106 samples with a bin size of 0.5µ sec. Chapter 2: The Extraction layer on True Random Number Generators 63

Figure 2.19: Sliding window of size 1024 samples, creating a pmf of random sources using kernel GM where M=100 repetitions, N=33 threads. The kernel is run over a Kepler architecture, while being sampled for 106 samples with a bin size of 0.5µ sec.

Figure 2.20: Sliding window of size 1024 samples, creating a pmf of random sources using kernel SHM where M=100 repetitions, N=33 threads. The kernel is run over a Kepler architecture, while being sampled for 106 samples with a bin size of 0.5µ sec. 64 Chapter 2: The Extraction layer on True Random Number Generators

2.4 Case study: TRNGs in Modern microproces-

sors

Today, Intel stands as one of the pillars in the micro-processing manufacturing market. Their most recent design, the Skylake product family, is the 6th generation of the Intel Core processor architectures, maturing its manufacturing process for 14nm size transistors. This generation primarily targets mobile devices ranging from 4 to

91 Watts in power consumption, which is similar to other competitors. Intel uses the combination of the CPU bundled with a GPU at low power consumption. The design presents automated extraction of instruction level parallelism from code, in addition to having instructions dispatched, queued, and retired within a clock cycle. Skylake improves on its cache misses which has a major impact on overall performance. As well, in this generation the scalability to peripherals was boosted by using 16 PCIe and other high speed buses like DMI 3.0, SLI, Gigabit Ethernet, and SATA express support among many others. In terms of power consumption, the integration of voltage regulators on the processor reduces the cost and improves the performance for lower voltages.

On Skylake cores, we can execute from two to eight hardware threads simultane- ously. The hardware takes care of managing the instruction scheduling by using the instruction scheduler model based Tomasulo algorithm. The Skylake scheduling is depicted on Fig.2.21. In Skylake, instructions are fetched then decoded into granular

µ − ops using the Tomasulo’s algorithm, then the execution order is reordered. Chapter 2: The Extraction layer on True Random Number Generators 65

Figure 2.21: Scheduling in pipelines using Skylake product line.

Effects due to architectural changes.

Similar to the GPUs, we conduct experiments on the race conditions kernels

from Listings 2.4-2.6, over the Skylake M3-6Y30 quad processor supporting multi-

threading.

In Fig. 2.22 we can observe the three kernels with a similar shape to the ones observed on the GPU architecture. For this experiment, the number of threads was adjusted to be SIZE = 33, and the number of iterations for accumulating the effect of

the race conditions set to M = 128, which is the same setup as the GPU architecture.

From the results, a significant peak around 0.005 and 0.03 millisecond is visible across

the three kernels.

Similarly, in Fig. 2.23, 2.24, and 2.25 we present the effect of stressing the number of threads and the number of repetitions. Given that the response time is faster, it can be observed as a slimmer distributions compared with the GPU results. As well, 66 Chapter 2: The Extraction layer on True Random Number Generators

Figure 2.22: Pmf of random sources using the three suggested kernels on a Skylake architecture, while being sampled for 106 samples with a bin size of 0.5µ sec.

Figure 2.23: Pmf of random sources using kernel GMSHM and changing the number of repetitions, M, and the number of threads from 1 to 512. The kernel is run over a Skylake architecture, while being sampled for 106 samples with a bin size of 0.5µ sec. Chapter 2: The Extraction layer on True Random Number Generators 67

Figure 2.24: Pmf of random sources using kernel GM and changing the number of repetitions, M, and the number of threads from 1 to 512. The kernel is run over a Skylake architecture, while being sampled for 106 samples with a bin size of 0.5µ sec.

Figure 2.25: Pmf of random sources using kernel SHM and changing the number of repetitions, M, and the number of threads from 1 to 512. The kernel is run over a Skylake architecture, while being sampled for 106 samples with a bin size of 0.5µ sec. 68 Chapter 2: The Extraction layer on True Random Number Generators

Figure 2.26: Sliding window of size 1024 samples, creating a pmf of a random source using kernel GMSHM where M=100 repetitions, N=33 threads. The kernel is run over a Skylake architecture, while being sampled for 106 samples with a bin size of 0.5µ sec.

the effect of repeating the race conditions saturating the scheduler on the Skylake

architecture is similar to the GPU case, as in Fig. 2.24 and 2.25.

Behaviour consistency over long runs.

Now, we review the changes of the pmf over long runs. A sliding window of 1024

in size is used to create a pmf as it slides over 65536 samples, similar to the results

presented in the case of GPUs.

As observed on Fig. 2.26, 2.25 and 2.27 the concentration of the distribution

happens in a very consistent way in the first microseconds. But in Fig. 2.26, the

concentration fluctuates around a few microseconds overtime, given that the scheduler

is saturated with long request from the race conditions. In the cases of Fig. 2.25 and

2.27, the process is more consistent. Chapter 2: The Extraction layer on True Random Number Generators 69

Figure 2.27: Sliding window of size 1024 samples, creating a pmf of random sources using kernel GM where M=100 repetitions, N=33 threads. The kernel is run over a Skylake architecture, while being sampled for 106 samples with a bin size of 0.5µ sec.

Figure 2.28: Sliding window of size 1024 samples, creating a pmf of random sources using kernel SHM where M=100 repetitions, N=33 threads. The kernel is run over a Skylake architecture, while being sampled for 106 samples with a bin size of 0.5µ sec. 70 Chapter 2: The Extraction layer on True Random Number Generators

Figure 2.29: Pmf of random sources using Skylake RDSEED, while being sampled for 106 samples with a bin size of 0.5µ from the Normalized numbers.

Other TRNGs developed in hardware for CPUs.

As part of the ALU sections in Fig.2.21, Skylake has a special instruction, RDRAND, that returns a random number from the Intel on-chip hardware random number gener- ator. The instruction RDRAND has been available since the Ivy Bridge architecture.

RDRAND takes a random source that is passed to an AES conditioner such that its output is inputted into a PRNG, called CTR DRGBG.

Alternatively, RDSEED returns the Intel TRNG secure key which uses thermal noise within the chip intended to re-seed a pseudo number generator. RDSEED first became available with the commercialization of Intel Broadwell CPUs.

In Fig. 2.29, we present the distribution of RDSEED. As observed, the distribution appears to be uniform, but still holds some peaks showing higher and lower probability at different spots. As well, RDSEED was analyzed over long runs, inFig. 2.30. This Chapter 2: The Extraction layer on True Random Number Generators 71

Figure 2.30: Sliding window of size 1024 samples, creating a pmf of random sources using Skylake RDSEED, while being sampled for 106 samples with a bin size of 0.5µ from the Normalized numbers. confirms the shape of the distribution and how it changes overtime, having blobs with high and low probability that remain as the window slides over the output of

RDSEED.

2.5 Summary

In this chapter, we presented the foundations for developing a TRNG using popu- lar CPUs or GPUs. We concentrated on creating the basic road-map for developing a

TRNG based on exploiting the anomalies on its architecture, showing that the archi- tecture is independent from the design. Furthermore, the TRNG design was analyzed for similarities between control-flow and data-flow schedulers. It was concluded that similar behaviour was shared when the schedulers are saturated, and all three differ- 72 Chapter 2: The Extraction layer on True Random Number Generators ent strategies concluded the same. As well, we presented two cases studies, the GPU and CPU. These case studies highlighted the effect of the variables that influence the saturation of the schedulers and change the shape of the distribution, which in most of the cases fits a gamma distribution. In the following chapters, we will address waysof improving the quality of the DAS random numbers by addressing two characteristics, the entropy and the autocorrelation of the numbers. Chapter 3

Distribution shape enhancement

on TRNGs

In Chapter 2, we observed the shape of the distribution for the DAS random num-

bers of the two case studies. We concluded that for most cases a Gamma distribution

is followed, which is far from being uniform. A better, quantitative way of reviewing

the shape of the distribution is by analyzing the normalized entropy. Table 3.1 shows

the normalized entropy is low for the different architectures experimented with in

Chapter 2, indicating the need for algorithms that improve the quality and shape of the distribution. In this chapter, we will focus on algorithms that correspond to the post-processing block seen in Fig. 1.8. First, we will concentrate on modulus-based al-

gorithms for compressing the distribution. Then, we will continue with enhancement

techniques such as histogram equalization (HE), histogram specification or exact his-

togram equalization (EHE), followed by the effects of a proposed adaptive technique,

adaptive exact histogram equalization (AEHE). We will conclude with a complexity

73 74 Chapter 3: Distribution shape enhancement on TRNGs

Table 3.1: Normalized Shannon Entropy of noise sources Case Reference Normalized of Architecture Kernel Figure Entropy (H) study GMSHM 0.645 Tesla GM 2.7 0.771 SHM 0.717 GPU GMSHM 0.667 Kepler GM 2.8 0.631 SHM 0.621 GMSHM 0.541 CPU Skylake GM 2.23 0.676 SHM 0.627

analysis and a comparison of the normalized entropy.

3.1 Modulus-based algorithms

With the intention of increasing the uncertainty of the extracted numbers in Chap-

ter 2, we develop a method that focuses on reshaping the distribution. We propose

two steps,

1. Selecting a suitable representation of the extracted numbers by focusing on the

digits that expose a higher normalized entropy.

2. Exploiting the properties of the modulo operation, or modulus, by concentrating

the distribution within a defined range.

In the following subsection, we will discuss the details of these two techniques. Chapter 3: Distribution shape enhancement on TRNGs 75

3.1.1 Optimizing the random numbers representation format

In communications, messages can be encoded for transmission. More importantly, the representation format of a message can have an effect on some of the statistical properties, such as the maximum amount of information that can be transmitted.

Additionally, the entropy per letter, word, or symbol is a useful tool for statistically measuring the amount of information being transmitted [78]. In principle, we can consider the extracted DAS random numbers from Chapter 2 as messages.

Despite the low entropy in their native representation, as shown in Table 3.1, the extracted number can be analyzed at every digit. In a similar manner, there are many ways of selecting the group of digits that present a higher entropy. For example, consider the DAS random numbers from the Tesla architecture using global and local memory kernel (kernelGMSHM) from Fig. 2.7, shown in Table 3.2 in which we can observe the entropy at every decimal digit. First, let us consider the two following mapping functions,

k Mdigit (k, i) = 10 ximod (10.0) (3.1)

k Mright (k, i) = 10 ximod (1.0) (3.2)

th where Mdigit (·) will be used for isolating the behaviour per k digit, at the i sample of the x random numbers. Alternatively, Mright (·) concentrates on the fractional part of x. In both equations, the modulus operation ensures that a specific group of digits of x are observed. The properties of the modulus will be addressed in the next subsection. 76 Chapter 3: Distribution shape enhancement on TRNGs

In Table 3.2, the entropy per digit shows that digits to the right of 10−6 are equally well distributed, while the first digits present an impulse shape distribution which makes them predictable. A simpler selection method can be selecting all the digits to the right of a given digit that exposes a higher entropy. In our example to the right of the 10−7 digit the entropy improves. Other selections methods could include the analysis of the join entropy between digits, but the analysis per digit can certainly reduce the number of options while deciding on a selection scheme.

3.1.2 Modulus

In mathematics, the result of the modulo operation is the remainder of the Eu-

clidean division. This convention differs across different programming languages and

hardware implementations, but most definitions follow these conditions for the divi-

sion,

q ∈ Z (3.3)

D = dq + r D, d ∈ R, d ≠ 0 (3.4)

|r| < |d| (3.5)

where D is the dividend, d is the divisor, q is the quotient, and r is the remainder. The

remainder is strictly related to the method of obtaining the quotient. Some methods

truncate the quotient towards zero [79], rounded towards the negative infinity [80],

rounded towards the closest integer [81], or the quotient is kept positive [82]. For

further details of the different methods, an extensive survey can be found in[83]. Chapter 3: Distribution shape enhancement on TRNGs 77

Table 3.2: Entropy analysis per decimal digit Normalized Normalized Entropy Digit Entropy by all digits by digit to the right 100 0.000 0.645 10−1 0.000 0.903 10−2 1.78x10−5 0.838 10−3 0.001 0.722 10−4 0.062 0.519 10−5 0.462 0.476 10−6 0.812 0.562 10−7 0.930 0.849 10−8 0.999 0.886 10−9 0.999 0.920 10−10 0.999 0.860 10−11 0.304 0.935 10−12 0.642 0.920 10−13 0.999 0.921 10−14 0.998 0.907 10−15 0.999 0.929 10−16 0.999 0.922

A well accepted approach is the implementation of rounding towards negative infinity suggested by Knuth in [80]. The implementation considers,

q = D div d = ⌊D/d⌋ (3.6)

r = D mod d = D − dq (3.7)

In our case, the random numbers extracted in Chapter 2 are positive and larger than 78 Chapter 3: Distribution shape enhancement on TRNGs

Figure 3.1: Extracted random numbers while being sampled for 20000 samples with a bin size of 0.5µ from the Normalized numbers. zero. When considering Eq. 3.2, the effect on the distribution is folding it within the

range of [0, 10−3] seconds. In our research paper [56], we proposed a group of oper-

ators to be tested on the impact of the distribution. The effects were insignificant

compared to the next sections techniques. Among the tested operators were addition,

subtraction, multiplication, exponentiation, and modulus of two sources. Here, we

selected the modulus operator as the most representative of concentrating the distri-

bution within a range. Taking our previous example, we can observe the distribution

of the extracted random numbers in Fig. 3.1. Then, after applying Mright (·) from

Eq. 3.2 with k = 1, in Fig. 3.2 we can observe how the distribution is concentrated

within [0, 10−3] seconds. There are trade-offs when applying the modulus operator.

For example, how the statistical weaknesses are masked and transformed into others,

affecting some statistical moments and autocorrelation side-lobes while the entropy

increases [84]. Chapter 3: Distribution shape enhancement on TRNGs 79

Figure 3.2: Extracted random numbers after applying Mright (·) from Eq. 3.2 with k = 1 sampled for 106 samples with a bin size of 0.5µ from the Normalized numbers.

3.2 Histogram-based algorithms

Another way of improving the shape of the distribution of the random numbers is by mapping the distribution to a desired shape. Unfortunately, the task can be computationally expensive, creating a trade-off between speed and precision. In the following subsections, we will introduce three techniques based on image processing equalization, histogram equalization ( HE), exact histogram equalization (EHE), and adaptive exact histogram equalization (AEHE), which are constructed to address the problem of improving the quality of random numbers.

3.2.1 Histogram equalization

HE is a technique used in image processing that enhances the contrast of an image by guaranteeing a final image with a distribution of grey level values that resembles 80 Chapter 3: Distribution shape enhancement on TRNGs

the uniform distribution [85]. HE is based on the following mathematical observations:

given a discrete random variable, X, with the pmf or normalized histogram composed of a group of bins representing the frequencies on the random variable (xi), defined within [0, 1], the cumulative distribution function (cdf ), the accumulated normalized histogram is, ∑xi F (xi) = P (X ≤ xi) = fx (3.8) 0

The HE technique suggests using Eq. 3.8 as a transformation function in the form,

MHE (xi) = F (xi) (3.9) where the transformation function will linearize cumulative distribution function (cdf )

of x along the value range, and then MHE is supposed to have a uniform distribution

within the range of [0, 1]. A slight modification is proposed since the distribution

inside the bins needs to be preserved and not replaced by an integer value. The

modification then consists of interpolating the value within the limits of thebinand

in the only case where the same value is present in the bin. This value is changed and

uniformly distributed along the bin. HE relies on the construction of the histogram

as presented in Algorithm 1. The steps for its implementation are summarized in

Algorithm 2. Algorithm 2 is described below.

In the first step, the number of elements per bin are calculated. In line 3,the

pmf is calculated. Then lines 4-6 calculate the cdf . And finally in the last iteration

through the data set x,the outputs are recalculated and stored in y (lines 7-10).

Although HE can be considered as one of the most popular approaches among Chapter 3: Distribution shape enhancement on TRNGs 81 the image enhancement techniques, it can show spikes and gaps in the final uniform histogram. In our case, the random numbers show no exception to this condition.

We can illustrate this by using the example presented before. First, let us consider the transformed output of Mright as in Fig. 3.2. Then by applying MHE with the transformation function in Fig. 3.3 we obtain the equalized output showing spikes and gaps, Fig. 3.4.

Algorithm 1 Histogram Construction Algorithm, hist Require: x is a Random Variable. M is the number of bins. N is the length of the vector x. Ensure: h is the frequency. sum is the cdf of x ← N 1: B M 2: sum ← 0 3: h ← 0 4: for xi ⌊in x⌋ do xi 5: k = B 6: h [k] = h [k] + 1 7: sum = sum + 1 8: end for

Algorithm 2 Histogram Equalization Algorithm, MHE Require: x is a Random Variable. M is the number of bins. N is the length of the vector x. Ensure: x ∈ [0, 10−3] at input. y ∈ [0, 1] at output. ← N 1: B M 2: sum ← 0 3: H, sum ← hist(x, M, N) 4: for k = 1 to M do H[k] 5: cdf[k] = sum[M] 6: end for 7: for xi ⌊in x⌋ do xi 8: k = B 9: y[i] = cdf[k] 10: end for 82 Chapter 3: Distribution shape enhancement on TRNGs

Figure 3.3: Transformation function using Histogram Equalization.

Figure 3.4: Transformed random numbers after applying Mright (·) with k = 1 and 6 MHE (·) sampled for 10 samples with a bin size of 0.5µ from the Normalized numbers. Chapter 3: Distribution shape enhancement on TRNGs 83

3.2.2 Exact Histogram Equalization

The deficiencies found in the HE output in Fig. 3.4 are overcome in EHE using

Fig. 3.6 that yields a final histogram that is completely flat without having the

undesired effects mentioned above [86]. The necessary steps to achieve histogram

equalization are summarized in Algorithm 3. The first step is ordering the elements

Algorithm 3 Exact Histogram Equalization Algorithm, MEHE Require: x is a Random Variable. M is the number of bins. N is the length of the vector x. Ensure: x ∈ [0, 10−3] at input. y ∈ [0, 1] at output. 1: xs, i ← SORT(x) ← N 2: B M 3: for k in i do ( ) k 1 4: y[k] = BN + xs [i [k]] mod N 5: end for

in the data-set such that, x0 < x1 < ··· < xN . Then in line 2, the number of elements

per bin is determined and stored in a variable B. Next, a value is assigned to each group. Specifically in this step, a similar interpolation as in HE is introduced inorder to preserve the source distribution within the bin. The scheme of EHE yields the exact results, since the data-set is transformed exactly into the desired distribution by fixing the bin elements count. This transformation will be referred toas MEHE.

3.2.3 Adaptive Exact Histogram Equalization

One of the limitations of HE and EHE is that it requires the entire, complete

data-set to perform the computations, presenting a memory bound. Due to this, we

introduce a localized version of EHE that works over a sliding window, with the main

aim of defining a local transformation function based on its neighbourhood, AEHE. 84 Chapter 3: Distribution shape enhancement on TRNGs

Figure 3.5: Transformation function using Exact Histogram Equalization.

Figure 3.6: Transformed random numbers after applying Mright (·) with k = 1 and 6 MEHE sampled for 10 samples with a bin size of 0.5µ from the Normalized numbers.

This method has similarity with block overlapped histogram equalization (BOHE)

[87] and adaptive histogram equalization (AHE) [88], since it relies on the use of a sliding window, but it differs on the equalization technique. Chapter 3: Distribution shape enhancement on TRNGs 85

The main idea behind AEHE is to generate random numbers that fill and are

post-processed on a sliding window. The sliding window is divided into two segments

of equal size, a left segment and a right segment. At the beginning, EHE is applied

to the sliding window, then the window is shifted by the segment size and filled

with new random numbers. While the sliding window is shifted, the left segment

replaces its values with the right segment, while the right segment gets filled with

un-equalized random values. Since one of the segments was purged from the sliding

window, the left segment is still connected to the purged random numbers and it can

not be modified, otherwise the uniform distribution will no longer hold. Therefore,

the distribution of the left segment needs to be complemented to become uniform

for the left segment and un-equalized random numbers. This is done by using EHE

only over the right segment for a histogram that complements the left segment to

form a uniform distribution. In fact, the distribution of the right segment needs to be

identical to the one from the purged random values. Once EHE is applied, the sliding

window is shifted again and the process is repeated. Once the process has finished the

last four segments, it is convenient to reset any possible pattern or relation between

two consecutive complement histograms. This can be easily done by shifting again

the sliding window and fill the two segments with un-equalized random numbers.

Then the process can be repeated all over again or concluded at this point.

In Algorithm 4, we present a summarized version of the AEHE implementation. It

starts by applying MEHE from Algorithm 3 to two contiguous segments, left segment

(xL) and right segment (xR), from the vector x. Then, from lines 2-12 we repeat for

NAsteps as the sliding window takes two segments at each step and only updates the 86 Chapter 3: Distribution shape enhancement on TRNGs

xR segment. This is performed twice using MAEHE−loop from Algorithm 5. After

equalizing over four segments, Algorithm 3 is applied to new random numbers in line

12, which resets the dependencies between contiguous segments. In Algorithm 5 from

Algorithm 4 Adaptive Exact Histogram Equalization algorithm, MAEHE Require: x as vector of Random Numbers. M is the number of bins. NL,NR are the length of xL, xR segments of x. Nall is the length of x, where Nall >> NL + NR. Ensure: x ∈ [0, 10−3] at input. y ∈ [0, 1] at output. 1: x[0 : NL + NR − 1] = MEHE({xL, xR} ,M,NL + NR) 2: for k = NL + NR − 1 to Nall − 3NR in 3NR steps do 3: k0 = k 4: xL ← x[k0 − NL + 1 : k0] 5: xR ← x[k0 + 1 : k0 + NR] 6: x[k0 + 1 : k0 + NR] ← MAEHE−loop(xL, xR, M, NL,NR) 7: k1 = k0 + NR 8: xL ← x[k1 − NL + 1 : k1] 9: xR ← x[k1 + 1 : k1 + NR] 10: x[k1 + 1 : k1 + NR] ← MAEHE−loop(xL, xR, M, NL,NR) 11: k2 = k1 + NR 12: x[k2 : k2 + NL + NR − 1] = MEHE({xL, xR} ,M,NL + NR) 13: end for

lines 1-5, a histogram is constructed for xL, which will not be modified any further

since it is dependent on the previous windows. The counts of the xL histogram,

h, is key in determining the shape of the complement histogram, which together

assembles the uniform distribution as mentioned before. For this reason, in line 3,

xR is sorted in SxR with its indexes before sorting, SxRi. From line 7-16, the new

indexes are calculated for the complement histogram. From line 9-13, the new index

positions,SxRinew, are assigned filling the complement histogram in an incremental

order. In here knew is the new index position counter, while inew is the index of

bin in the complement histogram. Therefore, if a bin on h doesn’t require to be Chapter 3: Distribution shape enhancement on TRNGs 87

Figure 3.7: Change on the pmf of MAEHE output by using an sliding window of 64 numbers and 64 bins. For MAEHE, the window size is 128 random numbers.

complemented the knew counter is incremented by the bin size in line 15. In line 18-

22, with the constructed complement histogram the new values of xR are recalculated in a sorted manner. Each value on SxR is transferred to its corresponding bin defined in line 19. The signature of SxR is preserved by applying the modulus of the bin size. Then, when the bin number j is multiplied by the inverse of the size of the sliding window N, it behaves as an offset to the SxR signatures. Together in line 21, the new random numbers are calculated and updated into xR. At the end, both xL and xR are uniform. Since xL was left intact, xR is returned to update the random number of x. In Fig. 3.7, we constructed an experiment to demonstrate the effects of

15 MAEHE, over 2 random numbers, using a small window size of 128, making it ideal for online random number generation. As well, the normalized entropy is calculated 88 Chapter 3: Distribution shape enhancement on TRNGs

Algorithm 5 Adaptive Exact Histogram Equalization Loop Algorithm, MAEHE−loop Require: xL, xR are two vectors of Random Numbers. M is the number of bins. NL,NR are the length of xL, xR. Ensure: xL, xR ∈ [0, 10−3] at input. {xL, xR} ∈ U [0, 1] at output. 1: N ← NL + NR 2: h, ← hist(xL, M, NL) 3: SxR, SxRi ← SORT (xR) 4: knew ← 0 5: inew ← 0 6: for k = 1 to M do N 7: if h[k] < M then 8: knew = knew + h[k] N − 9: for l = 1 to M h[k] do 10: knew = knew + 1 11: inew = inew + 1 12: SxRinew[inew] = knew 13: end for 14: else N 15: knew = knew + M 16: end if 17: end for 18: for k = 1 to NR do 19: i = SxRi[k] 20: j = SxRinew[k] 1 j−1 21: xR[i] = SxR[k] mod ( N ) + N 22: end for as the window slides in increments of one random number in Fig. 3.8. We observe that the entropy oscillates between 0.96 and 1.00 which is close to having a consistent uniform distribution, confirmed in Fig. 3.7 with no distinguishable pattern.

3.3 Asymptotic analysis of algorithms

In computational theory, the complexity of a problem, algorithm, or structure can be measured in terms of its resources, bounds, and quantitative relations with the Chapter 3: Distribution shape enhancement on TRNGs 89

Figure 3.8: Normalized entropy of the Pmf of MAEHE output from Fig 3.7.

intention to investigate its solvability [89]. Among the most important resources and restrictions are time and storage space. A general approach is to count the worst-case number of operations being executed in terms of the input size, which is formalized by using the Big-O complexity notation. The Big-O notation is helpful for deriving the asymptotic analysis of the algorithm, either for run-time or storage space. The

Big-O notation is defined as follows: Let f, g be two functions. Then if a constant c > 0 exist such that f ≤ cg is defined, then f = O(g) is read as f is of order g.

Then the function is defined by the number of operations including arithmetic-logic

operations, loops, assignments, and recursion [90, 91].

Based on the algorithms presented in Sections 3.1 and 3.2, Table 3.3 presents the

summary of complexity of each method, versus the achieved entropy on the examples

previously presented. As we can observe, EHE presents the highest values in the

normalized entropy with the trade-off of having the largest time complexity among 90 Chapter 3: Distribution shape enhancement on TRNGs

Figure 3.9: Run-time complexity for Modulus-based and Histogram-based algorithms.

Table 3.3: Asymptotic analysis of the Modulus-based and Histogram-based algo- rithms Online/ Time Space Normalized Algorithm Reference Offline complexity complexity Entropy algorithm Modulo Eq. 3.2 O(N) O(N) 0.903 Online HE Alg. 2 O(N) O(N) 0.959 Offline EHE Alg. 3 O(NlogN) O(N) 1.000 Offline

NNL O( log(2NL)), AEHE Alg. 4 2 O(N) (0.95, 1.0] Online NL N > (2NL) 2 the algorithms. A better option with similar normalized entropy is AEHE with a time complexity smaller than EHE as long as the segment size is significantly smaller than the total number of random numbers, N. In Fig. 3.9, the run-time is presented in terms of the input size, where AEHE is a faster option than EHE as long as the Chapter 3: Distribution shape enhancement on TRNGs 91

run-time of AEHE is smaller than EHE. This happens when,

NN O(Nlog(N)) > O( L log(2N )) 2 L N log(N) > L log(2N ) 2 L

NL N > (2NL) 2 (3.10)

Therefore, the selection of small segment size, NL, presents a suitable linear run-time solution for a large N like in BigData applications.

3.4 Summary

In this chapter, we presented a group of post-processing techniques with the aim

of solving the shape of the DAS random numbers distribution. The techniques are

broadly classified into two groups: modulus-based technique and histogram-based

techniques. Among the modulus-based techniques, we concentrated on selecting a

format representation scheme for DAS random numbers based on normalized en-

tropy found per digit, and by selecting a group of digits to the right of 10−3. Next,

this technique was combined with the modulus operations that concentrates the ran-

dom numbers within a given range, [0, 10−3]. On the histogram-based techniques,

we presented two approaches that depend directly on having a priori the complete

set of DAS random numbers that are required to post-process. These approaches

are Histogram Equalization and Exact Histogram Equalization. HE concentrates on

creating a mapping function based on the cdf , while EHE uses a sort algorithm to

specify or assign the correct shape of the distribution. Evidently, EHE is a robust 92 Chapter 3: Distribution shape enhancement on TRNGs technique but shows to have a higher time complexity. Therefore, the Adaptive Ex- act Histogram Equalization solution is introduced to reduce the time complexity by implementing EHE over a sliding window complemented histogram. The idea behind the complemented histograms ensures a consistent transition between histograms as the window slides. AEHE is relaxed on the accuracy by delivering most of the time a nearly perfectly uniform distribution with normalized entropy of [0.95, 1.0]. From these algorithms, the challenge of the distribution shape is solved, but the relation between random numbers and its sequence is left to be solved and analyzed in Chap- ter 4. Chapter 4

Stationarity enforcement on True

Random Number Generators

Summarizing Chapter 1-3, we present the three phases found in a TRNG archi-

tecture (Fig.4.1):

i. The digitized noise source/extraction: This phase is composed of a noise source

and a digitizer that periodically samples and outputs the DAS random num-

bers [56].

ii. The post-processing phase: This phase transforms the DAS random numbers

random numbers from the previous phase into internal random numbers by

meeting the target statistical properties.

iii. The buffer phase: This is the optional final phase which buffers the internal

random numbers into the output as random numbers.

In Chapter 3, we presented a group of techniques suitable for improving the pmf of

93 94 Chapter 4: Stationarity enforcement on True Random Number Generators

Digitized Noise source noise source phase Digitizer DAS random DAS random numbers numbers Data Histogram compression specification Post- processing Mapping phase function Stationarity enforcement Exact histogram specification Internal random numbers Buffer Semi-internal Buffer random phase numbers External random numbers Random Numbers

Figure 4.1: Generic architecture of a TRNG. the DAS random numbers, from which we proved that histogram-based techniques improved the pmf normalized entropy, especially EHE and AEHE. For simplicity, we refer to both as the histogram specification. In this chapter, we expand on the post-processing phase by dividing it into two blocks, histogram specification and sta- tionarity enforcement. As mentioned before the histogram specification block maps and statistically stretches the output of step (i), or DAS random numbers, to the spec- ified output distribution which in our case an uniform distribution. This block[56] was studied in detail in Chapter 3. In Fig. 4.1, the output of this block, referred to as semi-internal (si), are fed into the stationarity enforcement block.

For the stationarity enforcement block we present two approaches:

i. Naive approach. This first approach considers the autocorrelation and power

spectral density, using a genetic algorithm (GA) to permute the si random

numbers to obtain the best result. This is addressed in Sec. 4.1. Chapter 4: Stationarity enforcement on True Random Number Generators 95

ii. Accelerated approach. This second approach involves a simplified and im-

proved approach using fast Fourier transform (FFT)-based metrics and GA,

which will be addressed on Sec. 4.2.

In general, both stationarity enforcement blocks use a GA to permute the si random

numbers aiming to meet wss within a specified level of quality. The quality level is defined by the expected standard deviation of a Gaussian distribution fromthe si

random numbers power spectral density. Within this chapter, we will discuss the two

approaches, evaluation criteria, and results in-depth.

4.1 Naive approach of stationarity enforcement

In order to understand the reasoning behind each block in Fig. 4.2, we will focus

on the underlying theoretical aspects that link each sub-block. The main strategy is to

partition the evaluation of stationarity by performing local and global evaluations of

the si random numbers, followed by using GA to rearrange the si random numbers so

that the wss condition is met. This implies that only the two first statistical moments

are constant over time. In addition, the autocovariance between two different points

only depends on its separation.

The semi-internal random numbers: The si random numbers, or for simplic-

ity x, shown in Fig.4.3, may present small traces of periodicity not corrected by the

histogram specification block. Instead x is guaranteed to have a uniform distribution,

Fig. 4.4. { }N−1 Formally, the nature of x can be defined as x [i] i=0 , being wss over N observa- tions with a U{0, 1} distribution. Also, x can be partitioned into M non-overlapping 96 Chapter 4: Stationarity enforcement on True Random Number Generators

Semi-internal random numbers

Push random numbers into buffer (g) xm Calculate autocorrelation, acf (g) (g) Rm , Sm Normality check on acf True (g f ) Spectral xm comparison on acf False Shuffle random (g+1) numbers. with GA xm

Push output random numbers

Internal random numbers

Figure 4.2: The stationarity enforcement block

Figure 4.3: The semi-internal random numbers, x Chapter 4: Stationarity enforcement on True Random Number Generators 97

Figure 4.4: The pmf of the semi-internal random numbers

Figure 4.5: The autocorrelation of the semi-internal random numbers

(g) and contiguous segments, where the xm is the m-th segment, that has been pushed

into a buffer as is shown in Fig. 4.2. In order to differentiate the changes in x, the super index in parenthesis, (g), denotes the g-th generation or iteration the algorithm 98 Chapter 4: Stationarity enforcement on True Random Number Generators

Figure 4.6: The power spectral density of the semi-internal random numbers

has evolved over x.

Autocorrelation calculation: The purpose of this block is to identify periodic

traces. As seen in Fig. 4.5, the traces or patterns found are easily reflected in the side-

(g) lobes of the autocorrelation of xm . Expanding on the definition of the autocorrelation

(g) from Eq. 1.8, the autocorrelation of xm is [39] defined as,

∑ ( )( ) M−k−1 (g) − (g) − i=0 xm [i] x¯ xm [i + k] x¯ (g) ( ) Rm [k] = ∑ 2 (4.1) M−k−1 (g) − i=0 xm [i] x¯ where k ∈ [1 − M,M − 1] is the number of delayed samples, and M is the size of

(g) the xm . As mentioned before, the overall shape reflects a nearby impulse shape centred at 0 lags, as seen in Fig. 4.5, where the side-lobes are symmetrical and have an asymptotic convergence.

Based on the Basu et al. approach [44], we use the Wiener-Khinchin theorem for Chapter 4: Stationarity enforcement on True Random Number Generators 99

discrete-time processes to link the power spectral density as the Fourier transform

(g) of the autocorrelation, Rm . Expanding on the definition from Eq. 1.9, the power

spectral density [39] can be defined as,

{ } (g) F (g) Sm [f] = Rm

N∑m−1 (4.2) (g) −2πfkj = Rm [k] e k=1−Nm

where f is the frequency at which the power spectral density is evaluated. Since the

autocorrelation in Fig. 4.5 presents a similar shape to an impulse function, the power

spectral density in Fig. 4.6 shows a high concentration around 1, as it would be

expected for a unit step function. However, the error is visible over a target unit step

function. The point spread is caused by the side-lobes found in the autocorrelation,

and it can be analyzed by observing the distribution of the side-lobe.

In this particular case, the shape of the autocorrelation side-lobe resembles a

normal distribution skewed towards 0, and in most cases, we can infer the distribution

(g) of Sm . This is possible, given that the characteristic function of a normal distribution is,

{ } { ( )} ϕ S(g) = F P R(g) [k > 0] m  m   ( )   2   (g) − (g)  1  Rm [k] Rm  = F √ ( ) exp − ( )   2 2  (4.3)  (g) 2 σ(g)  2π σm m ( ) 1 ( )2 = exp jR(g)f − σ(g) f 2 m 2 m

where ϕ is the Fourier transform of the probabilistic distribution of the autocorre- 100 Chapter 4: Stationarity enforcement on True Random Number Generators

( ) (g) (g) (g) (g) lation, P Rm . The estimate of the standard deviation of Rm is σm , and Rm

is the mean estimate. j is an imaginary number, so j2 = −1. As we can observe

from Eq. 4.3, while the normal distribution in the side-lobes of the autocorrelation is

(g) present, it will be reflected as normally distributed values around the meanof Sxm .

Normality check on the autocorrelation: In [44] and expanding from Eq. 1.12,

Basu et al. stated that the following conditions must always be present in order to

guarantee wss,

∀i, k S(g) [f] = S(g) [f] (4.4) xi xi+k

However, since a full comparison is implied, storing all segments is required as new segments are being pushed in to be compared. It is here, where we propose a different strategy in an on-line testing fashion.

(g) The strategy is based upon ensuring the normality on Rm [k > 0], that will guar-

(g) antee the normal distribution on Sm with a specified standard deviation, σ0. These

evaluations can be assembled by using the X 2 goodness of fit test, implying the

(g) comparison between the observed histogram of Rm , of 6 bins (where the minimum

necessary for X 2 to be significant is 5 bins), and an expected histogram. The two

(g) evaluations can be easily identified as testing the normality on Rm [k > 0] and testing

(g) the similarity of the distribution on Sm against a given N{1, σ0}. For the first test

this is,

{ } (g) ∼ N (g) (g) H0 : Rm [k > 0] Rm , σm (4.5)

H1 : Not H0 Chapter 4: Stationarity enforcement on True Random Number Generators 101

∑ − 2 2 (Oi Ei) X − > (4.6) 1 α0,df E r i

where in order to fail or reject the null hypothesis, H0, Eq. 4.6 must be true. In

Eq. 4.6, Oi and Ei represent the i-th frequency of the observed and expected his-

togram respectively. α0 and df are the predetermined level of significance, usually

0.95, and the degrees of freedom of the test, in our case 3. In case the test rejects

(g) the null hypothesis, xm is aggregated with the new si random numbers, until the null hypothesis fails to be rejected and the evaluation is continued over the spectral properties.

Spectral comparison on the autocorrelation: In this test, we apply the X 2 test against a user specified distribution, N{1, σ0}. This is done by modifying the null hypothesis in Eq. 4.5 with the user specified standard deviation level, σ0, for Sm,

H (g) ∼ N { } 0 : Sm 1, σ0 (4.7)

H1 : Not H0 then, we perform the X 2 goodness of fit test, using Eq. 4.6. In the event both evalu-

ations are successful, and successfully passes the NIST FIPS 140-2 requirements [48],

the segment holds the properties of a wss process output within a certain error level

defined by σ0. On the other hand, the rejection of the hypothesis implies a condition

(g) where the order of the elements in xm causes side-lobes in its autocorrelation, and

(g) as a result Sm spreads more than σ0. This explains the reason behind a shuffling algorithm guided by a GA in the next block.

(g) Given the fact that the standard deviation, σ (g) , of Sxm . can be large at some Sm 102 Chapter 4: Stationarity enforcement on True Random Number Generators

points in the evaluation, the X 2 goodness of fit test tends to concentrate on high values saturating the test. In these situations, the σ (g) becomes an ideal metric. Sm

2 Although when the σ (g) is close to reaching σ0, the X goodness of fit test becomes Sm sensitive.

Shuffling random numbers with a genetic algorithm: Given the case where

(g) the spectral comparison fails, xm is passed through a GA to alter their sequence order.

The GA is composed by,

(g) a. Definition of the population: xm is a chromosome solution for the problem over

(g) the M! solutions from all possible permutations of the elements of xm .

b. Fitness evaluation: This evaluates the difference between σ (g) and σ0 specified Sm by the user.

c. Selection: The probabilistic binary tournament selection scheme [92, 93] is used.

(g) This scheme picks randomly a group of elements of xm , on a selection ratio,

RtSel, and then performs a tournament by pairs, where the winners with a

probability of (0.5, 1] are passed to a mating pool.

d. Crossover: In this recombination stage, the mating pool is divided into two

groups of parents, P0 and P1, each group holds a cardinality of LP , where

their elements are p0 and p1 respectively. Then, we apply either of the two

following operators, single point crossover or modular crossover. This is defined

by COmod = 1 for modular crossover, else is COmod = 0.

The single point crossover [94] relies on indexing all the elements in each parent,

{ }LP { }LP P0 = p0 [i] i=1 and P1 = p1 [i] i=1, then by swapping all the elements above a Chapter 4: Stationarity enforcement on True Random Number Generators 103

random cross-point, px, the children are defined as

{ ∪ }px C0 = p0 [i] p1 [i] i=1 (4.8) C = {p [i] ∪ p [i]}LP 1 0 1 i=px+1

The modular crossover operator focuses on swapping pairs of p0 and p1 by

selecting a random origin position, dini ∼ U (1,LP ) and a destination position

defined by dend ≡ p0 [dini] mod dini. Then the children can be defined as,

{ }LP C0 = p1 [dend [i]] i=1 (4.9) { }LP C1 = p0 [dini [i]] i=1

e. Mutation: The algorithm randomly selects two elements from a permutation

set and swaps them [95]. In our case, this operator is applied only in 1% of the

mutation probability or mutation ratio, RtMut, of the evaluations for C0 and

C1.

f. Replacement and Elitism: After mutation, only the children with better genetic

content than their parent survive to the next generation. Otherwise they are

replaced by their parents which smooths the fitness function as a consequence.

Elitism [96, 97] proves to reduce the drift while converging to the solution, as

it adjusts itself by selecting a ratio of RtElit with only the best among the past

and current generation.

g. Termination condition: The GA exits when the fitness is σ (g) ≤ σ0. If the Sm condition is successful, the algorithm loops back to calculate the autocorrelation, 104 Chapter 4: Stationarity enforcement on True Random Number Generators

as shown in Fig. 4.2. Otherwise the algorithm loops back to the selection stage

on the GA.

Output random numbers: In the last stage, the si random numbers meet the wss

condition. These are pushed out and referred as internal random numbers.

4.1.1 Evaluation methodology

In this section, we introduce a group of tests intended to demonstrate two aspects:

a proof of concept defining the characteristics of the GA, and a performance and

quality study that will highlight the best conditions for its set up. The results of

these tests are presented in section 4.1.2.

Proof of concept. For the purpose of proof of concept, we selected the first

(g) segment, at x0 , from a sequence of si random numbers with strong periodic traces, such that all its elements are distributed over the range [0, 255]. Step by step, we

(g) (g) recorded the change on x0 , the autocorrelation, R0 , the power spectral density

(g) S0 and its distribution at the g-th generation by sampling according to the block diagram in Fig.4.2.

(g) Having x0 with M = 512 samples, the algorithm was evaluated on its convergence capability given an error level (target) of σ0 = 0.25. For simplicity, the normality

X 2 check on the autocorrelation and the spectral comparison were fixed to use 0.05,3. With the didactic purpose of explaining the behaviour of the algorithm, we will present the results from three different generations: first at generation 0 at the end

or gf , and arbitrarily in-between the last two, gmid.

For the GA parameters the modular crossover technique was selected to be COmod = Chapter 4: Stationarity enforcement on True Random Number Generators 105

1 . Then, we fixed the selection ratio to RtSel = 0.75 or LSel = 384 parents. The mutation ratio was set to RtMut = 0.01 and the elitism was applied to all selected parents, RtElit = 1.0.

Convergence accuracy. In this evaluation we will confirm the accuracy of the algorithm over different sets of input parameters along 30 runs each. This will allow the user to select the best configuration that will hit the target σ0 = 0.5. The principal GA input parameters were selected as follows: RtSel = {0.25, 0.5, 0.75, 1},

COmod = {0, 1}, RtMut = {0.01, 0.023, 0.036, 0.05}, and RtElit = {0, 0.33, 0.66, 1},

giving a total of 128 different permutations.

For this test, using the same segment and specifications from the previous sub-

section, we focused on calculating the mean and the standard deviation of σ (100) S0 at the 100-th generation, given the four parameters used in the GA. We will de-

note the mean difference of σ0 as σS (·) and the standard deviation as SD {σS (·)}, ( ) which form an ordered pair of coordinates σS (·),SD {σS (·)} representing a point,

Q(·), in a Cartesian plane. We also measured the Euclidean distance to the origin, √ 2 2 OQ(·) = σS (·) + SD {σS (·)} , which will be smaller for all the points that are

close to σ0 and larger for the ones that don’t converge fast enough by the 100-th

generation.

4.1.2 Results

Results of the proof of concept. In Figs. 4.7 to 4.9, we present the results,

grouping them by generation g = {0, 23, 2403}, where 2403-th generation is the last

gf generation, and arbitrarily, we select gmid = 100 = 23-th generation, which has simi- 106 Chapter 4: Stationarity enforcement on True Random Number Generators

Figure 4.7: Initial case: The si random numbers (left) with periodic traces shows strong side lobes in its autocorrelation (center), while the power spectral density present a exponential distribution (right)

Figure 4.8: Middle case: At iteration 23, the si random numbers (left) with small periodic traces shows small side lobes in its autocorrelation (center), while the power spectral density starts to show a Gaussian distribution (right)

Figure 4.9: End case: At iteration 2403, the si random numbers (left) without periodic traces shows nearly no side lobes in its autocorrelation (center), while the power spectral density presents a Gaussian distribution around 1, with the target standard deviation(right)

larities with gf .

(0) As observed in Figs. 4.7 to 4.9, the effect of the periodic traces found on x0 ,

(23) for numbers with magnitudes between 63 and 191 diminished in x0 and completely

(2403) vanishes in x0 . Therefore, the effect of the serial correlated values in x are pro-

(0) jected in the autocorrelation. In R0 it starts with high side lobes, on lags near 0, Chapter 4: Stationarity enforcement on True Random Number Generators 107

Figure 4.10: Proof of concept: The standard deviation shows to smoothly converge to 0.25

showing the strong correlation on values that are displaced within a short lag, which

(0) (23) is another reason why the traces are visible on x0 . Looking at R0 , we observe

(2403) how the side lobes are reduced, showing small fluctuations. While on R0 , the fluctuations are minimized.

Moreover, we can observe the effect of the autocorrelation side lobes on the power

(0) spectral density. Starting at S0 , the range is 8 times larger than that observed

(23) (2403) (0) in S0 and S0 . The shape of S0 shows two major peaks, which relate to the distance between repeating values. It tends to have an inverse exponential distribution

as shown in the right of Fig. 4.7 with a σ (g0) = 2.2224. As the peaks are reduced in S0 (23) S0 , the distribution tends to be normal around 1 (Fig. 4.8), with σ (g0) = 0.61184. In S0 (2403) Fig. 4.9 it is in S0 , where σ (g2403) = 0.24498 fits a normal distribution and achieves S0 the target. It is here where the GA prunes the solution using the modular operator, and smooths the search by applying the full elitism, as in Fig. 4.10. Nevertheless, full

elitism can cause pre-convergence to a local minima, but the mutation ratio tends

to slowly compensate, as it moves the search window away from the plateaus at a

finer-grid pace.

Results on the convergence accuracy. From Table 4.1 & 4.2, the change of

the parameters in GA reflect the evident effects on the end result. Independently of 108 Chapter 4: Stationarity enforcement on True Random Number Generators

Table 4.1: Results of using one-point cross over

RtMut

RtSel RtElit 0.01 0.023 0.0367 0.05 0.00 1.0273 0.9918 1.0007 0.9782 0.33 0.9592 1.0184 1.0223 1.0090 0.25 0.66 1.0124 1.0145 0.9840 0.9770 1.00 0.9995 0.9858 1.0014 0.9714 0.00 0.9531 0.9590 0.9407 0.9251 0.33 0.9185 0.9564 0.9237 0.9312 0.50 0.66 0.9365 0.9255 0.9306 0.9463 1.00 0.9031 0.9830 0.9047 0.9329 0.00 0.8667 0.8097 0.8130 0.8145 0.33 0.8454 0.8336 0.7823 0.8351 0.75 0.66 0.8473 0.8036 0.8324 0.8398 1.00 0.9016 0.8663 0.8320 0.8716 0.00 0.2744 0.2424 0.2373 0.2369 0.33 0.2438 0.2412 0.2381 0.2370 1.00 0.66 0.2422 0.2405 0.2368 0.2386 1.00 0.3189 0.2496 0.2372 0.2368 choosing either one of the two types of crossover operators, one-point crossover, or modular crossover, we see that the trend is affected by RtSel. When RtSel ≤ 0.5, the resulting distance is near 1.0 for all mutation and elitism. However, when RtSel = 1.0, the Euclidean distance moves closer to 0.23, for which all mutation and elitism values exceeds the target already. On this condition, the modular crossover settles around

0.23 to 0.25. while the one-point crossover has small fluctuations when the elitism is either removed, RtElit = 0, or fully applied, RtElit = 1.0. It is noticeable that the best selection of parameters is RtSel = 1.0, COmod = 1, RtMut = 0.05, and RtElit = 0.66, which converges and reaches the target with a mean of σS (·) = |0.5 − 0.4810| = 0.019 Chapter 4: Stationarity enforcement on True Random Number Generators 109

Table 4.2: Results of using modular cross over

RtMut

RtSel RtElit 0.01 0.023 0.0367 0.05 0.00 1.0321 1.0019 1.0124 1.0047 0.33 0.9974 1.0231 0.9936 1.0204 0.25 0.66 1.0218 0.9929 1.0248 1.0040 1.00 1.0089 0.9988 0.9925 0.9988 0.00 1.0424 1.0190 0.9936 1.0130 0.33 0.9882 1.0007 0.9693 0.9485 0.50 0.66 0.9854 1.0183 1.0041 0.9986 1.00 0.9991 0.9992 0.9747 0.9283 0.00 0.9042 0.8781 0.8806 0.8991 0.33 0.8645 0.8764 0.8519 0.9034 0.75 0.66 0.9006 0.8907 0.8702 0.8868 1.00 0.8994 0.8833 0.8846 0.8746 0.00 0.2433 0.2402 0.2354 0.2379 0.33 0.2424 0.2412 0.2370 0.2355 1.00 0.66 0.2431 0.2391 0.2384 0.2316 1.00 0.2543 0.2387 0.2384 0.2334

and a standard deviation of SD {σS (·)} = 0.0161.

4.2 Stationarity enforcement accelerated by a FFT-

based algorithm

In an attempt to reduce the computation burden of calculating the stationarity enforcement block, we redesigned the algorithm presented in Sec. 4.1 by addressing three important parts: the correlation metrics, the evaluation criteria, and the ar- chitecture of the stationarity enforcement block. In the following subsection, we will 110 Chapter 4: Stationarity enforcement on True Random Number Generators address each of the improvements as well as the algorithm evaluations, using DAS random numbers from the natural sources of randomness found in the Intel digital random number generator (DRNG), available on Ivy Bridge processors [98] sampled from the /dev/random/ over a Linux kernel.

4.2.1 Correlation metrics

Autocorrelation

As we have discussed before, one of the main metrics in stationarity enforcement encompasses the correlation of random numbers. Principally, this is solved by using metrics like autocorrelation. However, as previously mentioned, the sensitivity of the autocorrelation is inadequate for small periodic traces at different lags, which are represented as small side-lobes. The execution of the autocorrelation function from

Eq. 1.8 is a computational expensive task that requires a total of 4N 2 + N floating point operations and 4N 2+9N+4 memory accesses, where N is the length of the input data. There are many other efficient algorithms, the method based on the Wiener-

Khinchin theorem being among them. This method uses two FFTs to calculate the autocorrelation. In combination with a single pass variance calculation, like the

Welford method [99, 100], it takes 12Nlog2 (N) + 9N − 2 floating point operations and 10Nlog2 (2N) + 12N − 4 memory accesses to calculate it. The algorithm includes the following operations,

FFT [x − x¯] FFT ∗ [x − x¯] S [f] = (4.10) xx (2N − 1) ς2 Chapter 4: Stationarity enforcement on True Random Number Generators 111

Rxx [k] = IFFT [Sxx] (4.11)

where in Eq. 4.10, FFT ∗ [·] represent the complex conjugate operation of the FFT [·],

the FFT [·] operation involves calculating the FFT of x after removing x¯ and zero padded to 2N in size. Then, Sxx [f] is the power spectral density of x in terms of frequency f, while in Eq. 4.11, IFFT [·] is the inverse FFT operation. As well, in

Eq. 4.10, the sample variance, ς2, is the Welford method implementation based on

the following equations,      xk k = 1, Mk = (4.12)   −  xk Mk−1 Mk−1 + k else

    0 k = 1, ςk =  (4.13)   ςk−1 + (xk − Mk−1)(xk − Mk) else

ς ς2 = N (4.14) N − 1

This implementation works well against losing precision due to having catastrophic cancellation, even though it is slightly slower than the naive method,

( ) ∑ ∑ 2 N 2 − 1 N k=1 xk N k=1 xk ς2 = (4.15) N 112 Chapter 4: Stationarity enforcement on True Random Number Generators

0.10 1.2 10.0 ] ] k f [ [ x x 0.05 0.5 x 4.9 f x x S R

0.00 0.2 0.2 0.0 0.5 1.0 1023 0 1023 0.5 0.0 0.5 x k f (a) (b) (c)

0.10 1.2 10.0 ] ] k f [ [ x x 0.05 0.5 x 4.9 f x x S R

0.00 0.2 0.2 0.0 0.5 1.0 1023 0 1023 0.5 0.0 0.5 x k f (d) (e) (f)

Figure 4.11: Probability mass function (a), autocorrelation (b) and power spectral density (c) of x random numbers in an ideal case scenario. Probability mass function (d), autocorrelation (e) and power spectral density (f) of x random numbers from a computer generated algorithm.

Power spectral density

As previously introduced on Eq. 1.9, the power spectral density is defined by,

Sxx [f] = F{Rxx} N∑−1 −2πfkj = Rxx [k] e k=1−N

The implementation can follow the proposed algorithm in Eq. 1.9, which only involves

6Nlog2 (N) + 9N − 2 floating point operations and 5Nlog2 (2N) + 12N − 4 memory

accesses. Under ideal conditions, the power spectral density of random numbers has Chapter 4: Stationarity enforcement on True Random Number Generators 113

a step function response at 1, as shown in Fig. 4.11(c). In reality, due to the small

peaks found at k ≠ 0 in the autocorrelation of Fig. 4.11(e), the power spectral density

spreads around 1 as in Fig. 4.11(f).

Another property between the autocorrelation and the power spectral density is the distribution of the peaks found at k ≠ 0 in the autocorrelation, as shown in ( ) (0) (0) Eq. 4.3. For example in Fig. 4.11(e), rx ∼ N Rm = −0.00048, σm = 0.02204 .

Therefore, when mapping the autocorrelation towards the power spectral density,

we can analytically find the shape of the points distribution around the ideal step

function in the power spectral density, Fig. 4.11(c), which happens to also follow a

normal distribution [45] as previously explained in Sec. 4.1.

In most cases, the characteristic function is used to describe a probabilistic distri-

bution behaviour over time. In our case, it is a helpful tool to compare the standard

deviations between rx and its characteristic function. From rx, the standard devia- { } (g) (g) tion, σm is always smaller than the standard deviation from ϕ Sm . In Eq. 4.3, the

(g) 2 variance, (σm ) , is used to magnify the growth of the exponential, making it more

sensible to changes. This means that Sxx from Eq. 4.10 is more sensitive to variations

from 1, than Rxx from Eq. 4.11 or 1.8 to the changes against an impulse shape.

fpo Besides the sensitivity of the metric, the computational intensity, I = mao , of each algorithm can express the relation between the number of floating point operations

(fpo) and the memory access operations (mao) as in [101]. Additionally, the compu-

tational intensity can be evaluated in terms of the data size, as shown in Fig. 4.12(a).

The autocorrelation from Eq. 1.8 shows to be data intensive in its majority, and only

after the data size passes 210, the number of floating points operations grow closely 114 Chapter 4: Stationarity enforcement on True Random Number Generators s

1.2 n o i

t 16

a 10 Rxx r e

p 12 FFT-Rxx

o 10 t Sxx n

i 8

o 10

1.0 p g 4 n y 10 i t t i s a 0 n o

l 10 e 1 8 15 22 29 36 t

F 2 2 2 2 2 2 n i

l Data size(N) a

n 0.8 (b) o i t a t s u n p o i m t 16 o a 10 Rxx r C e

0.6 p 12 FFT-Rxx o 10 s

s Sxx Rxx e 8 c 10 c

FFT-Rxx a

y 4

r 10

Sxx o

m 0

0.4 e 10 21 28 215 222 229 236 21 28 215 222 229 236 M Data size(N) Data size(N) (a) (c)

Figure 4.12: Comparison of the autocorrelation, FFT-based autocorrelation and power spectral density algorithms analyzing (a) the computational intensity, (b) float- ing point operations and (c) memory access operation in terms of the data size N.

to the number of memory access operations, Fig. 4.12(b) and Fig. 4.12(c), where

both keep a ratio of 1. However, the other two algorithms are computationally in- tensive and follow a similar trend, since the power spectral density is necessary to compute the autocorrelation. As shown in Fig. 4.12(a), the computational intensity

of the FFT-based autocorrelation from Eq. 4.11 and the power spectral density from

Eq. 4.10 is higher, but the number of floating point operations is smaller than the

autocorrelation, Fig. 4.12(b). Therefore the power spectral density is suited for long

data segments where the computing architecture is ideal for massive floating point

operations. Chapter 4: Stationarity enforcement on True Random Number Generators 115

Figure 4.13: The Architecture of the Stationarity Enforcement Block

4.2.2 Evaluation criteria

As explained in Sec. 4.1 expanding from [44] and Eq. 1.12, we relax Eq. 4.4 by

(g) taking advantage that Sm has a normal distribution, therefore we can reduce the number of comparisons by forcing the segment to be close to σ0 and ensuring wss.

That is,

∀m σ0 ≥ σ (g) (4.16) Sm

The details on the algorithm and how (4.16) becomes an essential metric will be

discussed on subsection 4.2.3. 116 Chapter 4: Stationarity enforcement on True Random Number Generators

1.0 0.10 1.2 ) i x (

s 0.8 r e b 0.6 ] k m [ x x u 0.05 0.5 f x n

0.4 R m o

d 0.2 n a

R 0.0 0.00 0.2 0 512 1024 0.0 0.5 1.0 1023 0 1023 Samples(i) x k (a) (b) (c)

1.0 0.10 1.2 ) i x (

s 0.8 r e b 0.6 ] k m [ x x u 0.05 0.5 f x n

0.4 R m o

d 0.2 n a

R 0.0 0.00 0.2 0 512 1024 0.0 0.5 1.0 1023 0 1023 Samples(i) x k (d) (e) (f)

Figure 4.14: DAS Random numbers (a), its probabilistic mass function (b) and au- tocorrelation (c). Semi-internal random numbers (d), its probabilistic mass function (e) and autocorrelation (f).

4.2.3 Enhancements on the stationary enforcement block

Fig. 4.13 illustrates the stationarity enforcement block. In this block, the si ran- dom numbers, x, obtained from the histogram specification block are the input to the stationarity enforcement block in Fig. 4.1. The si random numbers may have small traces of periodicity not corrected by the histogram specification block. How- ever, the histogram specification block guarantees x to follow an uniform distribution.

Fig. 4.14(a)-(f) shows an example of DAS random numbers and semi-random num- bers, including its pmf and autocorrelation.

In Fig.4.13, each block is considered as one phase, or step, in the proposed iterative Chapter 4: Stationarity enforcement on True Random Number Generators 117 algorithm.

i. Sampling semi-internal random numbers: As previously defined, x ex-

hibits U (0, 1), a uniform distribution. x will be considered to have a size of N.

And xˆj will be the j-th segment of size M, that is sampled from x, such that

the concatenation of all xˆj form x.

Algorithm 6 Parallel Permutation-based Genetic Algorithm

Ppool, Chpool, Extpool are the parents, children and external populations. nP , nCh, nExt are the parents, children and external population size Pval is the parents pool fitness score vectors. xBest is the best solution. xExt is the external random numbers. σBest is the best solution fitness score. σ0 is the target fitness score, also an input. Thdone is a Shared vector among threads. myid is the thread id, unique to each thread. 1: In parallel do: 2: Thdone [myid] = false 3: Ppool = Selection (nP ,N) 4: Pval = Evaluation (Ppool, xm) 5: σBest = Min (Pval) 6: xBest = xm [Ppool [ArgMin (Pval)]] 7: while σBest > σ0&Any (Thdone) = false do 8: Chpool = CrossOver (Ppool, nCh) 9: Chpool = Mutation (Chpool) 10: xExt ← N new random numbers 11: Extpool = Selection (nExt,N) 12: σBest, xBest,Ppool,Pval = Replacement (Ppool, Chpool, Extpool, xExt, xBest) 13: end while 14: Thdone [myid] = true 15: if σBest ≤ σ0 then 16: return xBest 17: end if

ii. Shuffling random numbers with GA: In this stage, xm is passed through a

GA to alter its sequence order, which in consequence will have a major impact on

the power spectral density. We explain our parallel GA implementation in Alg.6, 118 Chapter 4: Stationarity enforcement on True Random Number Generators

which relies on the data non-dependencies to avoid synchronization as much as

possible. Given the nature of GA, parallelism can be exploited at a coarse-grain

level. In this algorithm, we focus on removing the data dependencies so that a

large number of threads can be launched without a synchronization constraint,

allowing a finer-grain parallelism.

A traditional implementation of permutations-based GA involves generating the

parent population of random numbers, followed by calculating the permutation

indexes. For simplicity, the parent populations are sorted so that the crossover

operations are consistent among parents. A drawback of this implementation

is the need of unsorting the random numbers before evaluating them. Sort-

ing, calculating the permutation indexes and unsorting them, can take from

2 Nlog2 (N) + N operations at its best, and in its worst case Nlog2 (N) + N op- erations. On the other hand, our implementation involves generating the pop-

ulations of permutation indexes representing the parents. This approach avoids

sharing values between parents. Instead, it only shares the indexes transferring

the order of pattern to the children. This approach involves only N operations.

The idea behind this algorithm is presented in Algorithm 6. The sequence order

or indexes will be considered as a chromosome solution. As well, we introduce

the idea of having an external population that migrates part of its genetic

content to the pool, avoiding stagnation. Since its parallel implementation takes

advantage of the data independence, every thread holds a unique id, myid, and

keeps updating a shared array, Thdone, in which only one memory position is

accessed by a thread. In the following, we will explain all the processes that are Chapter 4: Stationarity enforcement on True Random Number Generators 119

used in Alg.6.

a. Selection: This procedure selects the pool of permutation indexes. At this

stage, the intention is to sample and select a group of permutation indexes

that will be later used on the mating process inside the GA. The selection

procedure is illustrated in Alg. 7. We implement the Fisher-Yates shuffle

algorithm due to its uniformity among picking a random permutation and

its generation speed [100]. So given that n is the number of vectors and M

is the length of a vector, this algorithm will return a pool of permutation

indexes, a.

Algorithm 7 Selection method a is a pool of size n vectors of length M ak is the k-th indexes vector, initialized in ascending order from 0 to M − 1 n, M are inputs 1: for k = 0 to n − 1 do 2: for i = 0 to M − 2 do {Fisher-Yates Shuffle Algorithm} 3: j ← random integer number from {0, 1, ··· ,M − 1} 4: swap ak [i] and ak [i + j] 5: end for 6: end for 7: return a

b. Evaluation: This process is focused on evaluating a pool of permutation

indexes based on calculating the standard deviation of the power spectral

density from Eq.(4.10).

c. Crossover: This process is focused on the cross over operation. The

Order 1 cross over is a simple permutation based crossover that connects

one group of indexes from one parent with the remaining indexes of the

other parent. The process is showed in detail in Alg.9 120 Chapter 4: Stationarity enforcement on True Random Number Generators

Algorithm 8 Evaluation method a is a vector of the pool fitness score ak is the k-th element xm is the segment random numbers b is the pool of indexes of size n by M xm, b are inputs 1: for k = 0 to n − 1 do 2: xa ← xm [bk] ← 3: ak st.dev. of Sxaxa { From Eq.(4.10)} 4: end for 5: return a

Algorithm 9 Order 1 Cross Over method a is the pool of vectors, the pool has a size of n, and vector a size of M ak is the k-th vector a, n are inputs PA,PB will be referred as the parent A, and parent B P os0, P os1 will be referred as partition point 0, and point 1 A = {A0,A1,A2} the child vector product of the cross over operation 1: for k = 0 to 2 ∗ n − 1 in steps of 2 do 2: PA ← ak 3: PB ← ak+1 4: P os0 ← random number from 0 to M − 2 5: P os1 ← random number from P os0 to M − 1 6: A1 ← PA [P os0 : P os1] 7: A0 ← only the first P os0 elements from PB \ A1 8: A2 ← elements from PB \{A0 ∥ A1} 9: ak ← A0 ∥ A1 ∥ A2 10: end for 11: return a

d. Mutation: This process is focused on the mutation operation. The ex-

change mutation, as in Algorithm 10, is a simple technique that only 5%

of the times it selects randomly two indexes and swaps their positions.

e. Replacement: This process is dedicated to applying elitism by selecting

the best sets of indexes based on their fitness values [96, 97]. It also

updates the pool of parents with the best sets possible. This process takes

on consideration the pool of children and the pool of externals, which have Chapter 4: Stationarity enforcement on True Random Number Generators 121

Algorithm 10 Exchange Mutation method a is the pool of vectors, the pool has a size of n, and vector a size of M ak is the k-th vector a is an input P os0, P os1 will be referred as partition point 0, and point 1 1: for k = 0 to n − 1 do 2: P os0 ← random number from 0 to M − 1 3: P os1 ← random number from 0 to M − 1 4: P r ← random number from 0 to 1 5: if P r < 0.05 then 6: swap ak [P os0] and ak [P os1] 7: end if 8: end for 9: return a

a key purpose for escaping local minimas or stagnation over the fitness

function. The replacement process is described further in Algorithm 11.

Algorithm 11 Replacement method

Ppool, Chpool, Extpool are the parents, children and external pool of indexes. Pval, Chval, Extval are the parents, children and external pool score vectors. nP is the number of parents in Ppool

1: Pval = Evaluate (Ppool, xBest) 2: Chval = Evaluate (Chpool, xBest) 3: Extval = Evaluate (Extpool, xExt) 4: σP = Min (Pval) 5: σCh = Min (Chval) 6: σExt = Min (Extval) 7: Ppool ← select only the best nP indexes from {Ppool, Chpool, Extpool} using {Pval, Chval, Extval}

8: Pval ← select only the best nP scores from {Pval, Chval, Extval} 9: xBest ← select the best indexes from {Ppool, Chpool, Extpool} using either xBest or xExt

10: σBest ← the best value from {σP , σCh, σExt} 11: return σBest, xBest,Ppool,Pval

f. Termination condition: The GA exits when the fitness is σBest ≤ σ0.

If the condition is successfully met, the algorithm proceeds to update

Thdone [myid] = true, and consequently any other thread running will stop 122 Chapter 4: Stationarity enforcement on True Random Number Generators

by the end of its iteration as shown in 6 at line 6. Then, the algorithm

returns the best sequence of random numbers found, xBest or the answer

for the j-th segment, yˆj . Otherwise, if the condition is unsuccessful the

algorithm loops back to the cross over stage on the GA on line 7.

iii . Handling the output buffer: In Fig.4.13, the output of the GA in ii, yˆj, is

pushed into the buffer y. Then, if the buffer is not full yet, it will loop backto

the algorithm to sample the semi-internal random numbers at step i. Otherwise,

y is pushed out as the internal random numbers.

4.2.4 Evaluation methodology

In this section, we introduce a group of tests to show the characteristics of the

random number generation aided by the GA post-processing. The results of these

tests are presented in subsection 4.2.5.

Characteristics of a GA post-processing stage

In this evaluation, we follow the m segment of random numbers, analyzing step-by-

step the change in xm, its autocorrelation or Rxmxm from Eq. 4.11, its power spectral

density or Sxmxm from Eq. 4.10 and its distribution described by its probability mass

function or fxm . The evaluation is iterated over a loop from step 6-12 in Alg. 6 or

generation, until the standard deviation of the power spectral density, or σBest, is smaller than the user specified quality level, σ0.

Having xm with N = 1024 samples, the algorithm is evaluated on its convergence capability given an error level (target) of σ0 = 0.415. The parents, children, and Chapter 4: Stationarity enforcement on True Random Number Generators 123

external populations size are set to M = 32 each. With the intention of explaining the behaviour of the algorithm, we present a snapshot at three different generations; at the first generation or gini = 0, in the middle or gmid = 300 and then in the last

generation or gend, which will correspond with the output.

For the GA parameters, the mutation ratio was set to 0.05 or 5% and the elitism

was applied at the replacement stage to all individuals, letting survive only the best

fitted individuals, which at the next generation will be considered as the parents pool.

4.2.5 Results

In this section we present the results according to the evaluation methodology

presented in section 4.2.4. The results are presented in the following section.

Results from the Characteristics of a GA post-processing stage

As observed on Fig. 4.18, the standard deviation of the power spectral density,

or σBest, always tends to decrease as the algorithm goes forward on the generations,

which is the expected behaviour. This is caused by the elitism implementation in the

Replacement Algorithm 11. The elitism determines whether or not to decrease or

stagnate σBest, keeping only the best individuals available for the next generation.

In Fig. 4.18 in a didactic manner, we select three points to compare the progress on the distribution of the power spectral density: its respective shape on the auto- correlation, the shape of the probability mass function, and the points distribution pattern. In Fig. 4.15 at generation gini = 0, the random numbers show some traces of

periodicity, that can be seen on the peaks with a lag different than 0 on the autocorre- 124 Chapter 4: Stationarity enforcement on True Random Number Generators

1.0 0.10 1.2 9 ] ] k f [ [ i

x 5 x 0.5 0.05 0.5 x x f x x S R 1 0.0 0.00 0.2 0 0 512 1023 0.0 0.5 1.0 1022 0 1022 0.5 0.0 0.5 Sample(i) x k f (a) (b) (c) (d)

Figure 4.15: (a) Random numbers at generation 0, (b) its probability mass function, (c) the autocorrelation, (d) the power spectral density

1.0 0.10 1.2 9 ] ] k f [ [ i

x 5 x 0.5 0.05 0.5 x x f x x S R 1 0.0 0.00 0.2 0 0 512 1023 0.0 0.5 1.0 1022 0 1022 0.5 0.0 0.5 Sample(i) x k f (a) (b) (c) (d)

Figure 4.16: (a) Random numbers at generation 300, (b) its probability mass func- tion, (c) the autocorrelation, (d) the power spectral density

1.0 0.10 1.2 9 ] ] k f [ [ i

x 5 x 0.5 0.05 0.5 x x f x x S R 1 0.0 0.00 0.2 0 0 512 1023 0.0 0.5 1.0 1022 0 1022 0.5 0.0 0.5 Sample(i) x k f (a) (b) (c) (d)

Figure 4.17: (a) Random numbers at generation 4779992, (b) its probability mass function, (c) the autocorrelation, (d) the power spectral density lation at Fig. 4.15(c). This is also shown on the power spectral density, Fig. 4.15(d),

as points that overshoot more than 5.0. These random numbers, at generation 0, are Chapter 4: Stationarity enforcement on True Random Number Generators 125

0.49 0.48 0.47

t 0.46 s e 0.45 B

σ 0.44 0.43 0.42 0.41 100 101 102 103 104 105 106 107 generations (g)

Figure 4.18: Standard deviation of the Power spectral density plotted against the change on generations

already of an outstanding good quality, σBest = 0.4876, but as shown in Fig. 4.15(c-

d) there is still room for improvement. Nevertheless, due to the fact that we only

manipulate the order and not the magnitudes of those numbers, the probability mass

function remains uniform across all generations.

In Fig. 4.16(d) at generation gmid = 300, the overshoots on the power spectral

density converge to less than 5.0, still with traces of periodicity on the autocorrelation,

in Fig. 4.16(c), but closer to the solution. At this generation with σBest = 0.4382, the distribution pattern of the points in Fig. 4.16(a) is sparser than the one in Fig. 4.15(a).

The algorithm converges until generation gend = 4779992, where the power spec- tral density is concentrated in its majority below 3.0 with a σBest = 0.4150, this can be reviewed on Fig. 4.17(d). As well, the improvement on the power spectral density is reflected on the autocorrelation as closer appearance of an impulse shape,

Fig. 4.17(c). The overall benefit of the algorithm can be observed onFig. 4.17(a),

where the points are spread covering the majority of the space, with a less visible

agglomerates, or darker spots, as in Fig. 4.16(a) and Fig. 4.15(a).

As we have reviewed, the underlining idea of the Algorithm 6 is to benefit from 126 Chapter 4: Stationarity enforcement on True Random Number Generators

Figure 4.19: Standard deviation of the Power spectral density from the parents, children and external pool plotted against the change on generations the interactions within pools of parents, children, and externals. To review these interactions in more detail, we look at the effects of those distribution in the following two examples.

It is well known, that one of the major drawbacks of elitism is the reduction of genetic content over the population which perpetuates the stagnation of the fitness function. However, elitism is characterized for a smooth fitness function. So in order to compensate for the lack of genetic content the external pool is introduced. In Fig

4.19 over an example, we can observe the distributions of the children pool scores overlapping the external pool scores in its majority. At each iteration, the minimum among the two distribution is selected as the best, as it is described in Algorithm

11. Here, the external pool becomes another set of points over the solution space, Chapter 4: Stationarity enforcement on True Random Number Generators 127

Figure 4.20: Standard deviation of the Power spectral density from the parents and children pool without external pool plotted against the change on generations

uniformly distributed. Meanwhile the children pool is distributed similarly to the

parent distribution, since it keeps traces of the permutation orders, described on

Algorithm 7,9, and 10.

Contrary to the example in Fig.4.19, we also present in another example, Fig.

4.20, the effects of the lack of an external pool. Here, the fitness function tends to stagnate for longer periods and also is noted that children will not impact σBest as much as the case of Fig. 4.19, since the sample of the solution space is smaller.

Consequently, this evolves a lack of genetic diversity and the proliferation of dominant

traits on the children observed along different generations as stagnation.

Balancing between the two factors is beyond the scope of this paper, but is a good

insight into which variables contribute to modifying the fitness curve; the pool size 128 Chapter 4: Stationarity enforcement on True Random Number Generators and the children to external pool ratio.

4.3 Asymptotic analysis of algorithms

As previously presented on Sec. 3.3, the Big-O notation is helpful for highlight- ing the upper bound or worst case scenario of the execution time. The algorithms presented in Chapter 3 are easily analyzed from a theoretical point of view since the algorithms are deterministic. The analysis becomes more complex when the algo- rithm contain multiple sources of randomness as part of its logic. These types of algorithms are known as randomized algorithms. A useful approach for analyzing the complexity of randomized algorithms [102, 103] is by calculating the expectation,

E[t], of the worst execution time, te, given an input size, N and a random variable,X.

That is, ∑M E[t] = te[i]P {t = te[i]} (4.17) i=0

From Eq. 4.17, it can then be approximated the expected complexity of the run- time, as E[O(·)]. In our case, by fixing the user specified quality level to σS = 0.5, we can proceed to analyze the complexity of the two algorithms presented in this chapter. In Fig. 4.21-4.22 we present the observed execution time and storage space in terms of the input data size for both algorithms. By making a series of edu- cated guesses, we can observe how the naive approach from Sec. 4.1 has an expo- nential trend which forms for both execution time and storage space. Then, by using least square fitting [104] we can approximate its exponential function in the ∑ ∑ ∑ ∑ 2 − B x i(xi yi) ∑i(yi ∑ln yi) i(x∑iyi) i(xiyi ln yi) following form, yi = A e where A = exp( 2 − 2 ) and i yi i(xi yi) ( i xiyi) Chapter 4: Stationarity enforcement on True Random Number Generators 129

Figure 4.21: Time complexity for naive and accelerated approach

Figure 4.22: Space complexity for naive and accelerated approach 130 Chapter 4: Stationarity enforcement on True Random Number Generators

∑ ∑ ∑ ∑ − i yi ∑i(xiyi∑ln yi) i(∑xiyi) i(yi ln yi) B = 2 − 2 . As well, we evaluate the coefficient of deter- i yi i(xi yi) ( i xiyi) mination as a metric of how well the approximation model represents the observed ∑ (y −y )2 2 − ∑i i o samples [105]. The coefficient of determination is also known as R = 1 − , i(yi µ) where yi is an observed point, and yo is the modelled point. In Table 4.3, we show the values of A, B and the models coefficient of determination. Also in Fig. 4.21-4.22, the two models are overlapped with the observed data as continuous lines. On the other hand, for the accelerated approach from Sec. 4.2, we can make an educated guess that it follows a power-law functional form in both cases. Therefore, in a similar ∑ ∑ − B i( ln yi) B i( ln xi) manner, we approximate using yi = Ax [106], where A = exp( |y| ), ∑ ∑ ∑ |y| ( ln x ln y )− ( ln x ) ( ln y ) i ∑i i i ∑ i i i | | B = | | 2− 2 and y is the cardinality of y. In Table 4.4, we y i( ln xi) ( i ln xi) confirm the estimated model, which is represented inFig. 4.21-4.22 as a continuous line. From both figures, we can conclude that the naive algorithm can be approx- { } imated and simplified to follow E O(eN ) , while the accelerated algorithm follows an E {O(N c)} where c is a constant. Therefore, the accelerated algorithm intersects with the exponential models in both cases, making it more efficient in either time complexity or space storage against the naive algorithm in larger values of N. For { } example, E O(eN ) > E {O(N c)} for N > 280. In the worst average case the ac- celerated algorithm is faster than the worst average case of the naive algorithm. For

N > 70, the worst average case of accelerated algorithm needs less space than the worst average case of the naive algorithm. Chapter 4: Stationarity enforcement on True Random Number Generators 131

Table 4.3: Approximation of complexity model for Naive approach Variable A B R2 Execution time 2x10−5 0.0432 0.882 Space storage 2419 0.0455 0.914

Table 4.4: Approximation of complexity model for Accelerated approach Variable A B R2 execution time 0.0008 1.777 0.951 space storage 43659 0.0717 0.933

4.4 Summary

In this chapter, we presented two algorithms that enforces stationarity within a

user specified quality level, σ0. These approaches are independent of the algorithms

and architectures presented in Chapter 3, since its only purpose is to modify the se- quence of numbers. We conclude that the use of a post-processing stage for TRNGs using GA intended for seed generation proves to satisfy the specified quality level within a desirable number of generations. We have presented the theoretical founda- tions that guarantee the algorithm convergence to a nearby point of σ0, as well as,

the analysis of a GA approach aided to avoid stagnation and its respective complex-

ity and performance justification. We also highlight that the second algorithm, in

Sec. 4.2 proves to be efficient compared to the naive implementation from Sec. 4.1.

Leaving open for review in Chapter 5 are the effects of the segment size over longer

runs with standardized evaluations. Chapter 5

Quality Evaluation in True

Random Number Generators

In Chapter 4, we presented two techniques: a naive and an accelerated algorithm which shuffles the positions of the random numbers in order to improve the si random numbers quality. The complexity of both techniques are determined by the segment size of si random numbers that are shuffled at a time. For small segment sizes, the naive approach is faster and a more compact solution. However, for larger segment sizes, a better approach is to use the accelerated FFT-based algorithm. This becomes a key point when the data size is large, like in today’s big data. Therefore, in this chapter we extend Chapter 4’s stationarity enforcement accelerated by a FFT-based algorithm section to analyze the effects of large data sets, and its impact on thelong runs, which is especially applicable to cloud-related services.

In this chapter, we will introduce the parallelization scheme used to evaluate the algorithm, focusing on the programming model of MapReduce for homogeneous

132 Chapter 5: Quality Evaluation in True Random Number Generators 133 clusters with multi-cores over a distributed shared memory model using Julia as a programming language. Then, we will proceed with a battery of tests from the

TestU01 evaluation package for random number generation [107], focusing on the group of test called SmallCrush.

5.1 Parallel scheme

In Chapter 4, the stationarity enforcement accelerated FFT-based algorithm was presented as a general method. Here, we extend on the algorithm design to a specific implementation on a homogeneous cluster as our case study.

In this case study, the implementation was completed at the University of Man- itoba InterDisciplinary Evolving Algorithmic Science (IDEAS)/Computational Fi- nance Derivatives (CFD) lab homogeneous cluster with three Dell PowerEdge R420 nodes using two Intel Xeon E5-2400 processors, and 12 cores, supporting multi- threading. For its implementation, we use the MapReduce programming model

MapReduce.

The MapReduce programming model is a well known model and implementation for large data sets and large pools of resources. In this model, the problem is decom- posed into a set of functions that exploits the data non-dependency of the problem in such a way that the mapping function allows the problem to be distributed across the computational resources with its corresponding data partition. In this way, the data partitioning and tasks scheduling is handled automatically during run time without the intervention from the user, providing fault tolerance. As well, the model offers a reduction function that is used to collapse the partial results into one, after a global 134 Chapter 5: Quality Evaluation in True Random Number Generators

synchronization barrier [108, 109]. This model is suitable for embarrassingly parallel

problems.

Today, we can find multiple implementations of MapReduce in different program-

ming languages, the Julia language is no exception to it. Julia is a flexible dynamic

language designed for scientific, numerical, and high performance computing. Also,

Julia provides a group of instructions for parallel computing, including the implemen-

tation of the mapping function called pmap(). In our case, we implemented Algorithm

6 in Julia, using pmap to partition the execution of parallel GAs given that it is an embarrassingly parallel solution to the optimization search of the random numbers quality problem described in Chapter 4. In the next section, we will continue to

explain and evaluate the implementation using a battery of quality and performance

tests.

5.2 SmallCrush evaluation

This evaluation is intended to highlight the weakness and strengths of the post-

processing method. It evaluates a vast group of configurations as shown in Table 5.1.

The evaluations are done considering xm with Nm = 1024 samples, and an error level

(target) of σ0 = 0.45. The parents, children, and external populations sizes are set to

N = 32 each.

SmallCrush is a battery test that forms part of the statistical package TestU01

[107]. This is a modern test package that includes most available test packages,

surpassing in flexibility, and extends popular packages such as DIEHARD [110] and

NIST [111]. As is standard, TestU01 uses statistical testing where the test outcome, Chapter 5: Quality Evaluation in True Random Number Generators 135

Table 5.1: Small Crush test configurations Number of Processors Segment Size 2 4 64 16384 17408 32768 49152 65536 1048576 8 or p-value, fails if it is outside of the interval [0.001, 0.999]. In essence, SmallCrush consist of 10 tests: the birthday spacing test [112], the collision test, the gap test, the simplified poker test, the coupon collector test, the maximum-of-t test[100], a weight distribution test proposed by Matsumoto and Kurita [113], the rank of a random binary matrix test [112], two tests of independence between the Hamming weights of successive blocks of [114], and various tests based on random walk over the set of integers [107].

i The birthday spacing test. The birthday spacing test checks the separation

between numbers by mapping them to points in a cube. It checks that the

number of collisions between their spacing follows a Poisson distribution.

ii The collision test. This test checks for n-dimensional uniformity, by checking

that the number of collisions per subdivision of the interval [0, 1] is uniformly

distributed.

iii The gap test. This test checks the number of times that random numbers within

[0, 1) fall outside of an expected interval. It applies chi-square testing between

the observed and expected number of observations.

iv The simplified poker test. This test checks for the number of different integers 136 Chapter 5: Quality Evaluation in True Random Number Generators

that are generated. It applies chi-square testing between the observed and

expected number of observations.

v The coupon collector test. This test checks over a group of random number

integers for integers without collisions and unique numbers, then applies a chi-

square test between the observed and expected number of observations among

the group of random numbers.

vi The maximum-of-t test. This test first break a sequence into subsequences of

equal length, then it checks for the number of times that the maximum ran-

dom number of each subsequence, boudend within [0, 1) follows an exponential

distribution via chi-square and Anderson-Darling test.

vii The weight distribution test. This test takes a group of uniform distributions;

then it computes a binomial distribution and it gets compared via chi-square

against the expected distribution.

viii The rank of a random binary matrix test. This test generates a binary matrix

with uniform random numbers and then it calculates the rank of the matrix, or

the number of linearly independent rows. Then, the probability distribution is

compared to a known distribution via chi-square.

ix The independence test between hamming weights of successive blocks. This in-

volves two tests that check for independence among Hamming weights. The first

test considers only the most significant bits of random numbers in a block and

its corresponding Hamming weight. Then the Hamming weights are arranged

in successive pairs that later are evaluated for the number of possibilities, and Chapter 5: Quality Evaluation in True Random Number Generators 137

compared against expected values via a chi-square test. The second test consid-

ers mapping the counting from the first test into a 2-dimensional plane, where it

is segmented into four blocks where its values are related to an expected value.

Therefore, via a chi-square test the observed quantities can be compared to the

expected values.

x The random walk test. This test performs a random walk over integers, having

an equal probability to move to right or left during the walk. A group of

statistics and the distribution of its end position in the test are, both, compared

against their theoretical targets via a chi-square test.

5.3 Results

In order to simplify the results review, given that there are 10 SmallCrush evalua-

tions with the three different configurations on the numbers of processors, and seven

different segment sizes, we present a coded colour map highlighting the p-value in Fig.

5.1 to 5.3. In a SmallCrush test, the test is passed only if all the 10 evaluations have a p-value within the range of [0.001, 0.999]. During this evaluations, sequences of 32-

IEEE 754 Floating-point random numbers were feed into the battery evaluation.

The first configuration is using 2 processors, where the coded colour results are

presented at different segment sizes in Fig. 5.1. The evaluation of SmallCrush in this

configuration is close to passing the battery test but it fails in at least one test,either

maximum-of-t test or the random walks. Independent of this, the battery of tests

successfully passes all the tests when the segment size is 17408. 138 Chapter 5: Quality Evaluation in True Random Number Generators

1.000 Birthday Spacing Collision 0.999 Gap

Simp. Poker 0.900 Coupon Collector Max Of t #1 0.700 Max Of t #2 t s

e Weight Dist. 0.500

T Matrix Rank H. Weights Indep. 0.300 RandomWalk #1 RandomWalk #2 0.100 RandomWalk #3 0.001 RandomWalk #4 RandomWalk #5 0.000 64 16384 17408 32768 49152 65536 1048576 Segment size

Figure 5.1: SmallCrush results from TestU01 battery test using 2 processors.

1.000 Birthday Spacing Collision 0.999 Gap

Simp. Poker 0.900 Coupon Collector Max Of t #1 0.700 Max Of t #2 t s

e Weight Dist. 0.500

T Matrix Rank H. Weights Indep. 0.300 RandomWalk #1 RandomWalk #2 0.100 RandomWalk #3 0.001 RandomWalk #4 RandomWalk #5 0.000 64 16384 17408 32768 49152 65536 1048576 Segment size

Figure 5.2: SmallCrush results from TestU01 battery test using 4 processors.

The second configuration is using 4 processors and the results are presented in Fig.

5.2. In this configuration, all the tests failed either at maximum-of-t or therandom walks tests, similar to tests in Fig. 5.1. The results tend to fail exclusively on one of Chapter 5: Quality Evaluation in True Random Number Generators 139

Figure 5.3: SmallCrush results from TestU01 battery test using 8 processors. the test only. The only evaluation at which the failing test is close to passing is when the segment size is 17408.

The third configurations uses only 8 processors. The results are shown in Fig. 5.3.

In a similar manner, the failed evaluations are exclusive of each other, the between maximum-of-t and random walks tests. The only evaluation that passes has a segment size of 17408.

Summarizing the overall SmallCrush tests, the segment size plays a key role on the quality of the tests results. As well, only maximum-of-t and random walks fail among different segment sizes, and configurations with different number of processors. The maximum-of-t test focuses on extracting the maximums from a sample set of random values, while the maximum follows an exponential distribution. Therefore, failing maximum-of-t is due to inherent statistical properties of nearly iid uniform distribu- tions. This improves at certain segment sizes, because it aligns with the sample set 140 Chapter 5: Quality Evaluation in True Random Number Generators size of the maximum-of-t test. The random walks test relies on the properties found on Markov chains, where the probability of the current position depends on the pre- vious position. This means that failing random walks is due to a lack of consistency among continuous segments. This evaluation passes when the segment size fits the sampling size used over the random walk so this connecting property is satisfied. So far, the segment size of 17408 random numbers is one sub-optimal solution for cen- tering the evaluations. It is almost independent of the number of processors. These conditions seem to remain constant as the number of processors is increased, since passing the test is connected to the consistency of σBest among segments which is aided by increasing the sampling size. This is benefited by the parallelism exposed on GA as seen in Algorithm 6.

5.4 Discussion

During the evaluation of the SmallCrush tests, the results tend to have a positive output. In this section, we discuss the effects of the post-processing stages onthe results for each of the tests.

i The birthday spacing test. The post-processing takes care of distributing the

random numbers avoiding overlapping. Then, this gets reflected on the distri-

bution over the cube representation in the test. The results are therefore of

good quality.

ii The collision test. Similar to the birthday spacing test, the evaluation gets

favoured at large segment sizes, since it reduces the chances of overlapping Chapter 5: Quality Evaluation in True Random Number Generators 141

random numbers.

iii The gap test. This test is always correct, since all the random numbers are

bounded to [0, 1).

iv The simplified poker test. This test presents good results since the first stage

of the post-processing ensures to distribute the range of [0, 1) into 2b random

numbers, where b is the number of bits on the representation, in our case b =

32.

v The coupon collector test. This presents relatively good results, since it concen-

trates on the distribution of unique random numbers. This is taken care of by

the first stage of the post-processing by distributing uniformly all the values.

Larger segment sizes should present better results.

vi The maximum-of-t test. This test does not present good results, since it looks

at the distribution of the maximum of sub-sequences after breaking the input

sequence. The post-processing does not consider optimizing the sequence at

different sizes.

vii The weight distribution test. This test presents always good results. Since it

tries to check a binomial distribution by combining uniform distributions, the

first stage of the post-processing takes cares of keeping a uniform distribution

even if it will be concatenated later on in the test.

viii The rank of a random binary matrix test. This presents relatively good results.

The combination of the two post-processing stages makes sure to have random 142 Chapter 5: Quality Evaluation in True Random Number Generators

numbers well distributed and with no linear dependencies.

ix The independence test between hamming weights of successive blocks. This eval-

uation presents good results, because the stationarity enforcement stage takes

care of reducing the autocorrelation of the random variable towards a impulse

shape.

x The random walk test. This evaluation does not present good results, since

it considers a binary sequence as input. The sequence is tested against its

behaviour as if it was a random walk over a 2D space. In the case of the post-

processing, the random numbers were enforced for stationarity only based on

their floating point values. But not on the effects of their binary representation

when concatenated as done in the evaluation.

To recapitulate the results of the evaluation, the two post-processing stages present

good results on every test except the maximum-of-t test and the random walk test since the statistical properties are not optimized for sub-sequences of all different sizes. As well, the consideration of binary inputs on the random walk test could be a fairer comparison.

5.5 Summary

In this chapter, we presented an evaluation for the implementation of Chapter 4

stationarity enforcement accelerated FFT-based algorithm, using a homogeneous clus-

ter. We introduced the SmallCrush battery of tests from the package TestU01 for

evaluating random number generation to evaluate the boundaries of the proposed Chapter 5: Quality Evaluation in True Random Number Generators 143 algorithm. In addition, guidelines have been presented showing the influence of dif- ferent parameters that connect the statistical properties among the random numbers and the task decomposition from the post-processing algorithm. From these param- eters, we observe that the segment size in the post-processing is key for centring the quality of the random numbers. Specifically, a sub-optimal solution is found when the segment size is 17408. Chapter 6

Summary of work

In this thesis we designed, developed a software based true random number gen- erator that is customized to the user requirements, in terms of quality and speed.

In Chapter 1, the importance of random numbers, its historical precedence, and its conceptual characteristic were presented as background information to explain the design of a random number generator.

In Chapter 2, we introduced the foundations of a generic architectural design for a true random number generator. The design focused on exploiting and reusing the available random sources on the architecture, based on the most popular instruction schedulers, data-flow, and control-flow schedulers in the microprocessor and graphic cards market.

In Chapter 3-4, we introduced a group of post-processing techniques to mitigate the statistical deficiencies. For this, we divided the problem into two smaller problems related to the most important characteristics of a target statistical distribution. In our case, the target distribution is a uniform distribution with wss.

144 Chapter 6: Summary of work 145

The first part addressed the shape of the distribution by presenting modulus-based

and histogram-based techniques. The modulus-based technique showed to concen-

trate the distribution within an specified range, while the histogram-based technique

mapped to a target distribution and improved the normalized entropy of the numbers.

Once the shape of the distribution was solved, the second addressed property

was the correlation of the random numbers. In Chapter 4, we presented two algo-

rithms and provided a proof of concept for a user specified quality level, σ0. The two approaches enforced stationarity by modifying the sequence of numbers. The first approach, or naive algorithm, relied on selectively shuffling sequences ofnum- bers with GA in which a normal distribution was found in the autocorrelation and the power spectral density. On the other hand, the second approach, or acceler- ated FFT-based algorithm, was an optimization upon the deficiencies of the naive algorithm. It reduced the space complexity and mitigate the rapid growth on the time complexity by implementing a simpler evaluation criteria that concentrated on the standard deviation of the power spectral density aided by a parallel GA search scheme.

In Chapter 5, we evaluated the effects of a large data-size and its parallel imple- mentation under a standardized battery of tests, SmallCrush from TestU01. Here, we used a prototype cluster to evaluate the capabilities of the post processing algorithm.

In the following chapters, we will present the conclusion of this thesis, as well as future work and trends. Chapter 7

Conclusions and Future work

7.1 Conclusions

The concept of randomness has historically been a topic of great interest to human- ity. Today, the established theoretical foundations behind random numbers highlight conditions which are ideal for encryption, gambling, and simulation among others.

However, these conditions have a major trade off between accuracy, portability, speed, and scalability. Therefore, the need to develop a balance among these properties, and a solution for the customized type of generator, becomes paramount.

This thesis explored the design, development and implementation of a software- based TRNG, its capability on re-utilizing modern compute architectures (CPU &

GPU) to create a non-deterministic random source. The work also contemplated the characterization of the proposed random source. Through experiments we observed a near Gamma distribution on the proposed random source. Due to its poor quality in generating random numbers we developed and improved the two post processing

146 Chapter 7: Conclusions and Future work 147 stages to produce uniform distribution and while ensuring its stationarity. In general, literature specifically in the context of software-based TRNGs has been overlooked, rather other studies have been focused on specialized hardware designs that create a random source with a better distribution. In this study, I sought to answer the question:

• How can a computer architecture be re-utilized to build a TRNG?

• Can the statistical properties of a random source be improved?

• Is this design scalable?

Answering the first question, a generic TRNG was introduced highlighting to not require hardware modifications, since its implementation was done in its totality in software exploiting the computer architecture. In combination of the anomalies found on the architecture and the characteristic behaviour of computer threads in race conditions, a consistent and versatile random source was presented with low entropy, but with wss properties of a Gamma distribution being independent of the architecture and the process scheduler.

Given the low entropy of the random source, in answering the second question, the problem was divided into two parts. The first part addressed the shape of the distri- bution by presenting modulus-based and histogram-based techniques. The modulus- based showed to concentrate the distribution within a specified range, while the histogram-based technique, proved to achieve a high normalized entropy, (0.95, 1.0], especially with Exact Histogram Equalization (EHE) and Adaptive Exact Histogram

Equalization (AEHE), solutions for offline and online generation, respectively. The 148 Chapter 7: Conclusions and Future work

AEHE proved to be an improvement against the algorithmic compute bound EHE with a complexity of O(Nlog(N)). This was implemented by a sliding window and the equalization on complementary histogram with a complexity of O(cN), where c << N is a constant, calculated in terms of the window size.

The second part of the answer is related to ensuring the stationarity of the ran- dom variables. Ensuring stationarity is presented through two algorithms, a naive algorithm and an accelerated FFT-based algorithm. Both algorithms are controlled by a user specified quality level and converge to an expected complexity ofE[O(eN )] or E[O(N c)], where c is a constant for the naive algorithm, while the accelerated

FFT-based algorithm presents an expected complexity of E[O(N c)] which was always smaller than the exponential for large data sizes. This suggested the use of accel- erated FFT-based algorithm that is suitable for big data cases where the space and time complexity are critical.

To answer the third question, from a standardized TestU01 test it was concluded that the task-decomposition and the segment size were key in improving the overall performance and quality. This targeted a road-map for re-engineering products or applications that involved random number generators with a versatile design.

In spite of the TRNGs in literature, TRNG designs are focused on hardware designs that guarantee a predefined distribution. The benefit of using a software- based TRNG in combination with post-processing stages have been shown to be an alternative to using specialized hardware for true random number generation due to being re-configurable, scalable and adaptable. Chapter 7: Conclusions and Future work 149

7.2 Future Work

Today, random number generations is an essential tool for securing confidential information in its majority. Security is being affected by the exponential computa- tional growth capacity since it is bounded by its determinism. Which leaves us to rethink the approach to finding an appropriate sources of randomness. In the coming years, the following topics will have to be addressed in random number generation:

a. The random numbers quality accelerated by new paradigms. In today’s com-

pute architectures, quantum computers are rapidly moving towards being an

accelerator within cloud services. The quantum computers present an scal-

able and option to speed up the search on parallel universes of solution spaces,

strengthening the security of random numbers.

b. The stationarity enforcement solution space search acceleration. In this the-

sis, we presented, genetic algorithm as a possible solution, since it covers uni-

formly the solution space and finds the solution in a suitable time. Other

approaches can be taken regarding the combination of meta-heuristics, as in

hyper-heuristics, with adaptive algorithmic behaviour that prunes the solution

space depending on its surroundings.

c. Expanding towards re-configurable architectures solutions. Re-configurable ar-

chitectures as FPGAs, present a cheap and fast solution that can be scalable

for compute or data centres, it can add an extra layer by re-configuring the ar-

chitecture in an aperiodic manner, besides making a robust randomness source.

d. Kernel optimization for True Random Number Generation. In this thesis, we 150 Chapter 7: Conclusions and Future work

presented three kernels that exploit race conditions, leaving the possibility of

other kernels to be defined in the future. This will have to involve finding the

relation across instructions and the effects on the scheduler behaviour. Bibliography

[1] Stephen Karcher. Ta Chuan, The Great Treatise: The Key to Understanding

the I Ching and its Place in Your Life. London: Carroll & Brown, 2000.

[2] Encyclopedia Britannica. Fu Xi | Chinese mythological emperor, 2016. URL

http://www.britannica.com/topic/Fu-Xi.

[3] Alfred Schinz. The magic square: cities in ancient China. Edition Axel Menges,

1996.

[4] Margaret Pearson. The Original I Ching: An Authentic Translation of The

book of Changes. Tuttle Publishing, 2011.

[5] James Legge. The I Ching: The book of changes. Courier Corporation, 2012.

[6] Deborah J Bennett. Randomness. Harvard University Press, 2009.

[7] Charles Tam. The royal game of Ur. 2008.

[8] Stephen Bertman. Handbook to life in ancient Mesopotamia. Oxford University

Press, 2005.

151 152 Bibliography

[9] John Marshall. Mohenjo-daro and the Indus Civilization. Asian Educational

Services, 1931.

[10] John Gardner Wilkinson. The manners and customs of the ancient Egyptians,

volume 2. J. Murray, 1878.

[11] Roland G Austin. Roman board games. I. Greece and Rome, 4(10):24–34, 1934.

[12] Josie Glausiusz. Trading Bronze Age technology. Nature, 456(7223):709–709,

2008.

[13] Barend A Van Nooten and Gary B Holland. Rig Veda: A metrically restored

text with an introduction and notes, volume 1. Harvard Univ Pr, 1994.

[14] Chakravarti Rajagopalachari. Mahabharata, volume 1. Diamond Pocket Books

(P) Ltd., 1958.

[15] Stewart Culin. Games of the North American indians, volume 24. Courier

Corporation, 1975.

[16] Girolamo Cardano. The book on games of chance:(Liber de ludo aleae). Holt,

Rinehart and Winston, 1961.

[17] Keith Devlin. The unfinished game: Pascal, Fermat, and the seventeenth-

century letter that made the world modern. Basic Books, 2010.

[18] Elart von Collani. Jacob Bernoulli deciphered. Bernoulli News, 13(2), 2006.

[19] Abraham De Moivre. The doctrine of chances: or, A method of calculating the

probabilities of events in play, volume 1. Chelsea Publishing Company, 1756. Bibliography 153

[20] Carmen Batanero, Michel Henry, and Bernard Parzysz. The nature of chance

and probability. In Exploring Probability in School, pages 15–37. Springer, 2005.

[21] Andrei N Kolmogorov and Vladimir A Uspenskii. Algorithms and randomness.

Theory of Probability & Its Applications, 32(3):389–412, 1988.

[22] Albert N Shiryaev. On the defense work of AN Kolmogorov during World War

II. Springer, 2003.

[23] Claude E Shannon. Prediction and entropy of printed English. Bell system

technical journal, 30(1):50–64, 1951.

[24] Pierre L’Ecuyer. Random number generation. Springer, 2012.

[25] James E Gentle. Random number generation and Monte Carlo methods.

Springer Science & Business Media, 2006.

[26] Donald E. Knuth. Art of Computer Programming, Volumes 1-4A Boxed

Set. Addison-Wesley Professional, 3rd edition, 2011. ISBN 0321751043,

9780321751041.

[27] Harald Niederreiter. Recent trends in random number and random vector gen-

eration. Annals of Operations Research, 31(1):323–345, 1991.

[28] Arnold Kaufmann and David L Swanson. Introduction to the theory of fuzzy

subsets, volume 1. Academic Press New York, 1975.

[29] Nikhil R Pal and James C Bezdek. Measuring fuzzy uncertainty. Fuzzy Systems,

IEEE Transactions on, 2(2):107–118, 1994. 154 Bibliography

[30] Claude Elwood Shannon. A mathematical theory of communication. ACM

SIGMOBILE Mobile Computing and Communications Review, 5(1):3–55, 2001.

[31] Leandro Pardo. Statistical inference based on divergence measures. CRC Press,

2005.

[32] Aman Ullah. Entropy, divergence and distance measures with econometric

applications. Journal of Statistical Planning and Inference, 49(1):137–162, 1996.

[33] Gavin E Crooks. On measures of entropy and information, 2015.

[34] Jianhua Lin. Divergence measures based on the Shannon entropy. IEEE Trans-

actions on Information theory, 37(1):145–151, 1991.

[35] Amir Dembo, Thomas M Cover, and Joy A Thomas. Information theoretic

inequalities. IEEE Transactions on Information Theory, 37(6):1501–1518, 1991.

[36] James M. Joyce. Kullback-Leibler Divergence, pages 720–722. Springer

Berlin Heidelberg, Berlin, Heidelberg, 2011. ISBN 978-3-642-04898-2.

doi: 10.1007/978-3-642-04898-2_327. URL http://dx.doi.org/10.1007/

978-3-642-04898-2_327.

[37] Annick Lesne. Shannon entropy: a rigorous mathematical notion at the cross-

roads between probability, information theory, dynamical systems and statisti-

cal physics. Mathematical Structures in Computer Science, pages 1–40, 2011.

[38] George EP Box, Gwilym M Jenkins, and Gregory C Reinsel. Time series

analysis: forecasting and control, volume 734. John Wiley & Sons, 2011. Bibliography 155

[39] J. G. Proakis. Digital Signal Processing: Principles, Algorithms, And Applica-

tions. Pearson Education, 4th edition, 2007. ISBN 9788131710005.

[40] Howard M Taylor and Samuel Karlin. An introduction to stochastic modeling.

Academic press, 2014.

[41] A. C. Harvey. Time Series Models, volume 2nd ed. MIT Press, 1993.

ISBN 9780262082242. URL http://uml.idm.oclc.org/login?url=http:

//search.ebscohost.com/login.aspx?direct=true&db=nlebk&AN=11358&

site=ehost-live.

[42] Peter J Brockwell and Richard A Davis. Introduction to time series and fore-

casting. Springer Science & Business Media, 2006.

[43] Matteo Pelagatti. Stationary processes, 2016. URL http://www.statistica.

unimib.it/utenti/p_matteo/lessons/SSE/stationarity.pdf.

[44] Prabahan Basu, Daniel Rudoy, and Patrick J Wolfe. A nonparametric test for

stationarity based on local Fourier analysis. In IEEE International Conference

on Acoustics, Speech and Signal Processing, pages 3005–3008, Taipei, China,

April 19-24 2009.

[45] Jose Juan Mijares Chan, Parimala Thulasiraman, Gabriel Thomas, and Ruppa

Thulasiram. Stationarity enforcement of accelerator based TRNG by genetic

algorithm. In Trustcom/BigDataSE/ISPA, 2015 IEEE, volume 1, pages 1122–

1128. IEEE, 2015. 156 Bibliography

[46] Benjamin Jun and Paul Kocher. The Intel random number generator. Cryp-

tography Research Inc. white paper, 1999.

[47] Marco Bucci and Raimondo Luzzi. Design of testable random bit generators.

In Cryptographic Hardware and Embedded Systems–CHES 2005, pages 147–156.

Springer, 2005.

[48] Elaine Barker and John Kelsey. NIST DRAFT special publication 800-90b

recommendation for the entropy sources used for random bit generation, 2012.

[49] Marco Bucci, Lucia Germani, Raimondo Luzzi, Alessandro Trifiletti, and Mario

Varanonuovo. A high-speed oscillator-based truly random number source for

cryptographic applications on a smart card IC. IEEE Transactions on Com-

puters, 52(4):403–409, 2003.

[50] Adrian Colesa, Radu Tudoran, and Sebastian Banescu. Software random num-

ber generation based on race conditions. In IEEE 10th International Symposium

onSymbolic and Numeric Algorithms for Scientific Computing, pages 439–444,

Timisoara, Romania, 2008.

[51] Charles W O’donnell, G Edward Suh, and Srinivas Devadas. PUF-based

random number generation. In MIT CSAIL CSG Technical Memo, 481, 2004.

[52] Viktor Fischer and Miloš Drutarovskỳ. True random number generator em-

bedded in reconfigurable hardware. In Cryptographic Hardware and Embedded

Systems-CHES 2002, pages 415–430. Springer, 2003.

[53] Michael Epstein, Laszlo Hars, Raymond Krasinski, Martin Rosner, and Hao Bibliography 157

Zheng. Design and implementation of a true random number generator based

on digital circuit artifacts. In Cryptographic Hardware and Embedded Systems-

CHES 2003, pages 152–165. Springer, 2003.

[54] I-Te Chen. Random numbers generated from audio and video sources. Mathe-

matical Problems in Engineering, 2013, 2013.

[55] Mike Hamburg, Paul Kocher, and Mark E Marson. Analysis of Intel’s Ivy Bridge

digital random number generator. 2012. URL http://www.cryptography.com/

public/pdf/Intel_TRNG_Report_20120312.pdf.

[56] Jose Juan Mijares Chan, Bhanu Sharma, Jiaqing Lv, Gabriel Thomas, Ruppa

Thulasiram, and Parimala Thulasiraman. True random number generator using

GPUs and histogram equalization techniques. In IEEE 13th International Con-

ference on High Performance Computing and Communications, pages 161–170,

Banff, Alberta, Canada, 2011.

[57] Michael J. Flynn. Computer Architecture: Pipelined and Parallel Processor

Design. Jones and Bartlett Publishers, Inc., USA, 1st edition, 1995. ISBN

0867202041.

[58] Jason Patterson. Modern microprocessors: A 90 minute guide! Cortex, 15:A57,

2003.

[59] David A. Patterson and John L. Hennessy. Computer Organization and Design:

The Hardware/Software Interface (The Morgan Kaufmann Series in Computer 158 Bibliography

Architecture and Design). Morgan Kaufmann Publishers Inc., San Francisco,

CA, USA, 4th edition, 2008. ISBN 0123744938, 9780123744937.

[60] Ian King. Intel forecast shows rising server demand, PC share gains,

Jul 2015. URL http://www.bloomberg.com/news/articles/2015-07-15/

intel-forecast-shows-server-demands-makes-up-for-pc-market-woes.

[61] Roger Kay. Intel and AMD: The juggernaut vs. the squid - Forbes,

Nov 2014. URL http://www.forbes.com/sites/rogerkay/2014/11/25/

intel-and-amd-the-juggernaut-vs-the-squid/#6f8673d81a1b.

[62] Strategy Analytics. Global market revenue share of lead-

ing smartphone applications processor vendors in 2014, May

2015. URL http://www.statista.com/graphic/1/233415/

global-market-share-of-applications-processor-suppliers.jpg.

[63] Top500 List November. TOP500 Supercomputer sites, 2012.

[64] James E. Thornton. Parallel operation in the Control Data 6600. pages 33–40,

1965. doi: 10.1145/1464039.1464045. URL http://doi.acm.org/10.1145/

1464039.1464045.

[65] David Budde, Robert Riches, Michael T Imel, Glen Myers, and Konrad Lai.

Register scorboarding on a microprocessor chip, January 2 1990. US Patent

4,891,753.

[66] Daniel J Sorin, Mark D Hill, and David A Wood. A primer on memory consis- Bibliography 159

tency and cache coherence. Synthesis Lectures on Computer Architecture, 6(3):

1–212, 2011.

[67] David Levinthal. Performance analysis guide for Intel core i7 processor and

Intel Xeon 5500 processors. Intel Performance Analysis Guide, 30, 2009.

[68] Vinodh Cuppu, Bruce Jacob, Brian Davis, and Trevor Mudge. A performance

comparison of contemporary DRAM architectures. In ACM SIGARCH Com-

puter Architecture News, volume 27, pages 222–233. IEEE Computer Society,

1999.

[69] Ulrich Drepper. What every programmer should know about memory. Red Hat,

Inc, 11:2007, 2007.

[70] Chris McClanahan. History and evolution of GPU architecture. A Survey

Paper, 2010.

[71] Thomas Scott Crow. Evolution of the graphical processing unit. PhD thesis,

Citeseer, 2004.

[72] Nvidia. Cuda programming guide, 2008.

[73] Michal Hocko and Tomas Kalibera. Reducing performance non-determinism

via cache-aware page allocation strategies. In Proceedings of the first joint

WOSP/SIPEW international conference on Performance engineering, pages

223–234. ACM, 2010.

[74] Michael Boyer, Kevin Skadron, and Westley Weimer. Automated dynamic anal- 160 Bibliography

ysis of CUDA programs. In Third Workshop on Software Tools for MultiCore

Systems, 2008.

[75] Parimala Thulasiraman, Ruppa K Thulasiram, Jose Juan Mijares Chan, Bhanu

Sharma, Jiaqing Lv, and Gabriel Thomas. True random number generator

using GPU and signal processing techniques, February 8 2012. US Patent App.

13/977,066.

[76] PRN Childs, JR Greenwood, and CA Long. Review of temperature measure-

ment. Review of scientific instruments, 71(8):2959–2978, 2000.

[77] ON Semiconductors. ADT7473: Remote thermal monitor and fan control, 2009.

URL http://www.onsemi.com/pub_link/Collateral/ADT7473-D.PDF.

[78] John R Pierce. An introduction to information theory: symbols, signals and

noise. Courier Corporation, 2012.

[79] Tucker S Taft and Robert A Duff. Ada 95 Reference Manual. Language and

Standard Libraries: International Standard ISO/IEC 8652: 1995 (E), volume

8652. Springer Science & Business Media, 1997.

[80] Donald E Knuth. The art of computer programming vol. 1, fundamental algo-

rithms. Addison-Wesley, Reading, MA, 9:364–369, 1968.

[81] IEEE Standards Committee et al. 754-2008 IEEE standard for floating-point

arithmetic. IEEE Computer Society Std, 2008, 2008.

[82] Raymond T Boute. The Euclidean definition of the functions div and mod. Bibliography 161

ACM Transactions on Programming Languages and Systems (TOPLAS), 14

(2):127–144, 1992.

[83] Daan Leijen. Division and modulus for computer scientists. University of

Utrecht, 2001. URL http://www.cs.uu.nl/~daan/lvm.html.

[84] Werner Schindler and Wolfgang Killmann. Evaluation criteria for true (physical)

random number generators used in cryptographic applications. In Cryptographic

Hardware and Embedded Systems-CHES 2002, pages 431–449. Springer, 2002.

[85] Rafael C Gonzalez and Richard E Woods. Digital image processing. Prentice

hall Upper Saddle River, 2 edition, 2002.

[86] Dinu Coltuc, Philippe Bolon, and Jean-Marc Chassery. Exact histogram spec-

ification. Image Processing, IEEE Transactions on, 15(5):1143–1152, 2006.

[87] Tae Keun Kim, Joon Ki Paik, and Bong Soon Kang. Contrast enhancement

system using spatially adaptive histogram equalization with temporal filtering.

IEEE Transactions on Consumer Electronics, 44(1):82–87, 1998.

[88] John B Zimmerman, Stephen M Pizer, Edward V Staab, J Randolph Perry,

William McCartney, and Bradley C Brenton. An evaluation of the effectiveness

of adaptive histogram equalization for contrast enhancement. IEEE Transac-

tions on Medical Imaging, 7(4):304–312, 1988.

[89] Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach.

Cambridge University Press, 2009. 162 Bibliography

[90] David Harel and Yishai A Feldman. Algorithmics: the spirit of computing.

Pearson Education, 2004.

[91] Christos H. Papadimitriou. Computational complexity. In Encyclopedia of

Computer Science, pages 260–265. John Wiley and Sons Ltd., Chichester, UK.

ISBN 0-470-86412-5. URL http://dl.acm.org/citation.cfm?id=1074100.

1074233.

[92] David E Goldberg and Kalyanmoy Deb. A comparative analysis of selection

schemes used in genetic algorithms. In Foundations of Genetic Algorithms,

1991.

[93] David Beasley, RR Martin, and DR Bull. An overview of genetic algorithms:

Part 1, fundamentals. University computing, 15:58–58, 1993.

[94] Mandavilli Srinivas and Lalit M Patnaik. Genetic algorithms: A survey. Com-

puter, 27(6):17–26, 1994.

[95] Pedro Larrañaga, Cindy M. H. Kuijpers, Roberto H. Murga, Iñaki Inza, and

Sejla Dizdarevic. Genetic algorithms for the travelling salesman problem: A

review of representations and operators. Artificial Intelligence Review, 13(2):

129–170, 1999.

[96] Chang Wook Ahn and Rudrapatna S Ramakrishna. Elitism-based compact

genetic algorithms. IEEE Transactions on Evolutionary Computation, 7(4):

367–385, 2003.

[97] Patrick M Reed, Barbara S Minsker, and David E Goldberg. The practitioner’ Bibliography 163

s role in competent search and optimization using genetic algorithms. Bridging

the Gap, 10(40569):97, 2001.

[98] Gael Hofemeier. Intel Digital Random Number Generator (DRNG) software

implementation guide, 2012.

[99] BP Welford. Note on a method for calculating corrected sums of squares and

products. Technometrics, 4(3):419–420, 1962.

[100] D Knuth. The art of computer programming. volume 2 (seminumerical algo-

rithms), 1997.

[101] Werner Augustin, Vincent Heuveline, and Jan-Philipp Weiss. Optimized stencil

computation using in-place calculation on modern multicore systems. In Euro-

Par 2009 Parallel Processing, pages 772–784. Springer, 2009.

[102] Thomas H Cormen. Introduction to algorithms. MIT press, 2009.

[103] Rajeev Motwani and Prabhakar Raghavan. Algorithms and theory of compu-

tation handbook. chapter 15: Randomized Algorithms, pages 12–12. Chapman

& Hall/CRC, 2010. ISBN 978-1-58488-822-2. URL http://dl.acm.org.uml.

idm.oclc.org/citation.cfm?id=1882757.1882769.

[104] Eric W Weisstein. Least squares fitting. 2002.

[105] Eric W Weisstein. Least squares fitting–exponential. MathWorld-A

Wolfram Web Resource., 2011. URL http://mathworld.wolfram.com/

LeastSquaresFittingExponential.html. 164 Bibliography

[106] Eric W Weisstein. Least squares fitting–power law. From MathWorld–A Wol-

fram Web Resource, 2007.

[107] Pierre L’Ecuyer and Richard Simard. TestU01: a software library in ANSI C

for empirical testing of random number generators. 2007.

[108] Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing

on large clusters. Communications of the ACM, 51(1):107–113, 2008.

[109] Jeffrey Dean and Sanjay Ghemawat. MapReduce: a flexible data processing

tool. Communications of the ACM, 53(1):72–77, 2010.

[110] George Marsaglia. DIEHARD: a battery of tests of randomness. 1996. URL

http://stat.fsu.edu/~geo/diehard.html.

[111] Al Ruhkin. Testing randomness: A suite of statistical procedures. Theory of

Probability & Its Applications, 45(1):111–132, 2001.

[112] George Marsaglia. A current view of random number generators. In Computer

Science and Statistics, Sixteenth Symposium on the Interface. Elsevier Science

Publishers, North-Holland, Amsterdam, pages 3–10, 1985.

[113] Makoto Matsumoto and Yoshiharu Kurita. Twisted GFSR generators II. ACM

Transactions on Modeling and Computer Simulation (TOMACS), 4(3):254–266,

1994.

[114] Pierre L’ecuyer and Richard Simard. Beware of linear congruential generators

with multipliers of the form a = 2q  2r. ACM Transactions on Mathematical

Software (TOMS), 25(3):367–374, 1999.