Turbo Code Performance Analysis Using Hardware Acceleration

Master of Science Thesis in Electrical Engineering Department of Electrical Engineering, Linköping University, 2016

Turbo Code Performance Analysis using Hardware Acceleration

Oskar Nordmark Master of Science Thesis in Electrical Engineering Turbo Code Performance Analysis using Hardware Acceleration Oskar Nordmark LiTH-ISY-EX--16/5010--SE

Supervisor: Niclas Wiberg Ericsson AB Examiner: Oscar Gustafsson isy, Linköpings universitet

Computer Engineering Department of Electrical Engineering Linköping University SE-581 83 Linköping, Sweden

The upcoming 5G mobile communications system promises to enable use cases requiring ultra-reliable and low latency communications. Researchers therefore require more detailed information about aspects such as channel coding performance at very low block error rates. The simulations needed to obtain such results are very time consuming and this poses a challenge to studying the problem. This thesis investigates the use of hardware acceleration for performing fast simulations of turbo code performance. Special interest is taken in investigating diﬀerent methods for generating normally distributed noise based on pseudo- random number generator algorithms executed in DSP:s. A comparison is also done regarding how well diﬀerent simulator program structures utilize the hardware. Results show that even a simple program for utilizing parallel DSP:s can achieve good usage of hardware accelerators and enable fast simulations. It is also shown that for the studied process the bottleneck is the conversion of hard bits to soft bits with addition of normally distributed noise. It is indicated that methods for noise generation which do not adhere to a true normal distribution can further speed up this process and yet yield simulation quality comparable to methods adhering to a true Gaussian distribution. Overall, it is show that the proposed use of hardware acceleration in combination with the DSP software simulator program can in a reasonable time frame generate results for turbo code 9 performance at block error rates as low as 10− .

iii

Acknowledgments

This work was performed at Ericsson Research in Linköping, Sweden, during the spring of 2016. I would very much like to thank my supervisor Niclas Wiberg for proposing this work and giving me the possibility to perform it. It has truly been a privilege to work with you and take part of your knowledge and insight, as well as your continuous input and guidance. I would also like to thank professor Oscar Gustafsson at Linköping University for his time and the good discussions. My thanks also extend to Patrik Sträng, Mirsad Cirkic and the Flake team in Linköping for their help and explanations. Lastly I would like to thank all the colleagues at LINLAB for taking an interest in my work and, above all, for making my stay so enjoyable.

Stockholm, April 2017 Oskar Nordmark

Contents

Notation ix

1 Introduction 1 1.1 Purpose ...... 1 1.2 Problem formulation ...... 2 1.3 Related work ...... 2 1.4 Method ...... 3 1.5 System overview ...... 3 1.6 Limitations ...... 3 1.6.1 Random numbers ...... 4 1.6.2 Channel model ...... 4 1.6.3 Modulation ...... 4

2 Background Theory 5 2.1 The binomial distribution and interval estimation ...... 5 2.2 Channel coding and turbo codes ...... 6 2.3 Pseudo-random number generation ...... 7 2.3.1 Matrix linear recurrence modulo 2 ...... 8 2.3.2 Equidistribution ...... 8 2.3.3 Mersenne Twister ...... 9 2.3.4 Xorshift ...... 9 2.4 Methods for generating normally distributed variables ...... 12 2.4.1 The Box-Müller method ...... 12 2.4.2 The Marsaglia Polar method ...... 13 2.4.3 The Ziggurat method ...... 13 2.5 Hardware acceleration for turbo codes ...... 15

3 Method 17 3.1 Fundamental simulator model ...... 17 3.2 General principles of the simulator setup ...... 18 3.2.1 Soft bits and noise addition ...... 18 3.2.2 SNR calculations ...... 19 3.2.3 Number of coding attempts ...... 20

vii viii Contents

3.3 Simulator program structure ...... 20 3.3.1 Busy-wait simulator structure ...... 21 3.3.2 Callback simulator structure ...... 22 3.4 The table method for noise generation ...... 23

4 Results 27 4.1 BLER curves ...... 27 4.1.1 Reference curves and axis ...... 27 4.1.2 Diﬀerent noise methods ...... 29 4.1.3 Diﬀerent simulator structures ...... 31 4.2 Execution time for PRNG methods ...... 33 4.3 Normal distribution methods ...... 33 4.4 Hard to soft conversion with noise addition ...... 36 4.5 Simulator throughput ...... 40 4.6 Distribution of execution time ...... 43 4.7 Low error-rate BLER-curves ...... 44

5 Conclusion and Discussion 47 5.1 Simulator output and speed ...... 47 5.2 Program structure ...... 48 5.3 Bottleneck ...... 49 5.4 Normally distributed noise ...... 51

6 Future Work 53 6.1 Improved usability of the simulator ...... 53 6.2 Comparison with theoretical properties ...... 53 6.3 Optimizations of the simulator ...... 54

A Normal distribution table 57

Bibliography 61 Notation

Abbreviations Abbreviation Deﬁnition LTE Long Term Evolution 3GPP 3rd Generation Partnership Program SoC System on Chip PRNG Pseudo Random Number Generator DSP Digital Signal Processor BLER Block error rate SNR Signal to noise ratio CDF Cumulative distribution function PDF Probability density function

1 Introduction

The evolution of mobile communication systems over the past decades has brought about a new technological generation about every ten year. In the current discussions of the next technological generation, 5G, many in the industry point out the demand for low latency and high reliability as important requirements [12, 23]. One area of high interest in this context is the channel coding used for control- ling errors in the transmission of data on a wireless channel. In the discussions for 5G there are proposals for diﬀerent channel coding techniques such as polar codes and low density parity check codes (LDPC) [11]. In the current LTE system the channel coding used is turbo coding [22, 9]. In some uses cases for 5G the high requirements for the communication con- 9 ditions correspond to a maximum block error rate of 10− and latency down to a millisecond or less[12]. This creates a need to be able to perform simulations in order to evaluate the performance of proposed algorithms for this type of context. To obtain accurate results in the case of very low error probabilities requires a very large number of simulations. This emphasises the need for quick simulations and motivates the need for hardware acceleration.

1.1 Purpose

The aim of this work is to study how simulations concerning channel coding can be carried out utilizing hardware acceleration. The overall objective with this type of approach is to achieve faster simulations than can be accomplished using a software implementation running on a general purpose processor. More speciﬁcally this thesis will study the use of system-on-chip (SoC) products containing hardware acceleration for turbo codes conforming to the 3GPP speciﬁcation of turbo codes used in LTE. The system used in this thesis has several digital signal processors (DSPs) as well as a hardware accelerator for turbo

1 2 1 Introduction codes. In order to study the use of hardware acceleration for turbo code performance simulations, a simulator set-up has to be implemented for the speciﬁc system. One key factor needed for this type of simulator set-up to study the performance of the error-correcting codes is the simulation of random noise. In this thesis this will be accomplished by implementing pseudo-random number generation (PRNG) algorithms on the DSPs which will ensure repeatability of the simulations. The use of well known and tested standard methods also provide a reliable base for discussing the statistical quality of the results. The various performance limitations and bottle-necks of the simulator set-up are of interest to determine the capabilities of the simulator and of the type of hardware utilized.

1.2 Problem formulation

This thesis aims to answer the following:

1. How can a simulator be set up in a multi core environment to enable fast simulations using accelerators?

2. Which are the limitations on simulation speed and accuracy with the employed simulator set-up and hardware?

3. How eﬃciently can random numbers be generated in the system using PRNG on DSP processors? Can a potential performance improvement mo- tivate the use of other approaches or methods?

4. How well does the LTE turbo code perform around very low block error rates? How is the general shape of this graph in the area of block error rates 4 9 from 10− to 10− ?

1.3 Related work

Since turbo codes were introduced by Berrou et al. in 1993 [4] the algorithm has been studied and improved or adapted by many researchers and also implemented in many systems. [10, 22]. There has been many results presented concerning the performance of turbo codes, both in the form of analytical work and empirical studies where the simu- lated performance is presented. There has been work presented concerning the speciﬁc turbo coding used in LTE, but many of these focus on relatively high BLER values. In many cases, a BLER value of 10% is seen as a reasonable error rate to perform transmissions around. To the best of our knowledge, there has not been any work presented concerning the use of dedicated hardware acceleration in simulations for turbo code performance. 1.4 Method 3

1.4 Method

In this thesis, an implementation of a simulator using DSP and hardware acceleration components is presented. Several different PRNG methods have been researched through literature studies and some of these methods have been implemented. Methods for generating normally distributed values have also been researched and implemented for the simulator. A very simple method for generating normally distributed variables which is not found in the literature studies is also implemented and presented. The implemented simulator is benchmarked by recording the simulation results and the execution time, both for the overall program and for different segments and algorithms. These results are then used to analyse the performance and capability of the simulator and the hardware system. The software for the simulator is implemented in C code and the simulation results are compared to reference results from an existing simulator. It should however be stated that this thesis does not aim to guarantee a implementation completely free of flaws or bugs. However, care is taken to verify that the results appear reasonable and relevant for the conclusions.

1.5 System overview

The simulator set-up studied in this thesis utilises both DSP resources and hardware acceleration resources. The system is schematically illustrated in ﬁgure 1.1.

Figure 1.1: An illustration of the type of hardware system studied for use of performing simulations. A SoC system with multiple DSPs and hardware accelerators for turbo encoding and decoding

1.6 Limitations

Here is an presentation of the limitations used to narrow the scope of the thesis. The limitations are divided up into the three subsections below. 4 1 Introduction

1.6.1 Random numbers The implemented simulator should ideally replicate the random nature of a transmission channel. Generation of seemingly random numbers has been an area of interest for a very long time. In 1951 the article "Various techniques used in connection with random digits" by John von Neumann was published in National Bureau of Standards Applied Mathematics Series [24]. Von Neumann therein discusses the use of either physical methods or algorithmic methods to generate random numbers. Physical methods can generate true randomness but make it impossible to repeat the process to check for errors. On the other hand, von Neu- man very neatly summarizes a fundamental problem of algorithmic methods as "Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin." [24]. These principal concerns are still relevant today. If repeatability is required then methods with true randomness are abandoned in favour of algorithmic pseudo-random methods. However, an algorithmic method will never generate an actually random number, and von Neumann points out that these "cooking recipes" for making digits "should merely be judged by their results. Some statistical study of the digits generated by a given recipe should be made, but exhaustive tests are impractical." For the simulator set-up presented in this thesis, only PRNG methods implemented in software and executed in the DSPs are studied.

1.6.2 Channel model The simulator set-up used in this thesis is based on an AWGN channel model. The concept of a channel in the context of a communication link is introduced in section 2.2, and the AWGN model is presented in section 3.1. This thesis will not discuss the relevance of AWGN channels or how appropriate an AWGN model is for this type of simulations, we simply observe the model is commonly used and we will apply it in this context of utilising hardware acceleration for performing simulations. For a further discussion of channel models and background information concerning random signal, see for example [13, 30, 8].

1.6.3 Modulation When data is actully transfered over a transmission medium it is necessary to have some form of modulation which maps the binary information sequence to a continous physical signal that can be transmitted. In LTE, several different modulation techniques are available, such as quadrature phase-shift keying (QPSK) and quadrature amplitude modulation (QAM). [9] This thesis will not study the possible effect of using different modulation schemes, and no actual modulation to a physical signal will ever take place. We will only use a simple binary phase- shift keying (BPSK) model where a logical one or zero will be mapped to a positive or negative integer value. 2 Background Theory

In the following sections some background information relevant to the scope of the thesis is presented. First there is some very brief information about the binomial distribution and conﬁdence intervals in section 2.1. Information about channel coding and turbo codes are presented in section 2.2, theory and methods for generating pseudo-random variables are presented in 2.3, how these random variables can be altered to achieve a normal distribution is discussed in 2.4, and ﬁnally a brief overlook of available products with hardware acceleration for turbo codes is presented in 2.5.

2.1 The binomial distribution and interval estimation

The binomial distribution is a discrete probability distribution. If a series of n independent experiments are conducted, each with a probability of success p, the the number of successful experiments x will have a binomial distribution. We write this as X B(n, p). The binomial distribution has an expected value of ∼ p np and a standard deviation of np(1 p). − For large values of n the binomial distribution can be approximated by the p normal distribution, so that approximately X (np, np(1 p)). This approx- imation can typically be done if np(1 p) 10.∼ N − − ≥ Interval estimation in statistics is used to calculate a confidence interval of possible values for an unknown parameter based on some experimental sample data. A confidence interval with a confidence level of 1 α will contain the true parameter value with a probability of 1 α. A commonly− used confidence level is 95%. − Say that we have an unknown population parameter Θ and the point estimator for this parameter is Θ∗. Let this estimator be approxiamtely normally

5 6 2 Background Theory distributed with a the expected value Θ and the standard deviation D. Then a conﬁdence interval for Θ with a conﬁdence level of 1 α is show in 2.1 −

IΘ = (Θ∗ λ D, Θ∗ + λ D) (2.1) − α/2 α/2 Where D can be replaced by a suitable estimate d if D depends on Θ [5].

2.2 Channel coding and turbo codes

This section presents an introduction to some components of a modern digital communication link, including the general concept of channel coding and the specific turbo codes used in LTE. In a digital communication link, as describes in [13], a message consisting of a binary string is transmitted over a waveform channel, which for example can serve as a model for the air interface between a transmitter and a receiver in a mobile wireless communication system such as LTE. This type of link is illustrated in figure 2.1, where only the components of the link relevant to this discussion are shown. Here we see that the binary message goes through the steps of channel coding and modulation before the waveform is transmitted over the channel. In the step channel coding, which can also be referred to as error protection, the message is mapped to a codeword by adding artificial redundancy to it. This is done in order to make it possible to decode the message even in the case of errors induced by the channel. Typically this step also includes an interleaver which creates a permutation of the order of the coded bits, in order to spread the bit errors over the codeword. During the modulation this codeword is then mapped onto a continuous waveform which can be transmitted over the channel. In the receiver, the steps are undone. The received waveform is first demodu- lated into digital sequence. In practice, soft demodulation is almost always used, in which case the demodulator does not produce bits, but rather soft bits, which contain reliability information. This is also what is used in the study.These soft bits then undergoes the reverse process of the channel coding, called decoding. This decoding will ideally eliminate errors induced during the transmission so that the output string from the decoder resembles the input string to the encoder. 2.3 Pseudo-random number generation 7

Figure 2.1: An illustration of the described digital communication link for sending a message over a waveform channel

In the LTE mobile communication system, the channel coding used is turbo coding as specified in [22]. Turbo codes were first published in 1993 [4] and use a design were two encoders at the transmitter and two decoders at the receiver work in parallel. The two decoders at the receiver can exchange decoding information between them in an iterative way. This iterative process can either continue until both the decoders agree on the result or until a pre-determined number of iterations is reached. The number of iterations used in the decoding process is typically in the range of four to ten. According to [10], the name "turbo codes" was chosen due to a noticed similar- ity to a turbocharger in an engine. Turbo codes use the output of the decoders to improve the decoding process, much like the turbo charger uses the exhaust gas to improve the combustion by forcing more air into the engine. The specific turbo encoder in LTE uses two 8-state constituent encoders and a turbo code internal interleaver. The coding rate is 1/3 and the codewords support information bit sequence sizes for a subset of the sized in interval between 40 bits and 6144 bits [22].

2.3 Pseudo-random number generation

There are many methods available to generate a series of pseudo-random numbers, and below is an presentation of two algorithms which have been used in the work of this thesis, the Mersenne Twister and the family of xorshift generators. Both these methods are based on linear recursion over the two element ﬁeld. [25, 29] The quality of these PRNG methods are studied using empirical test. Several alternative standardised tests of statistical properties for pseudo-random number generators have been developed. Among the most notable are the Diehard test suite [15] by George Marsaglia and TestU01[14] by Pierre L’Ecuyer and Richard Simard. The performance of the algorithms in these two test are have been studied in the cited sources and the main results for some of the PRNG implementations are brieﬂy presented for the respective algorithms. 8 2 Background Theory

2.3.1 Matrix linear recurrence modulo 2 Linear recurrence over the two element field use the theory of finite field, also known as Galois Field. Typically, we let F2 = GF(2) be the finite field with two elements 0, 1 . We write the field operations as + and . If 0 is regarded as "false" and 1 as "true",{ } then the field operations are "exclusive× or" ( ) and "and" ( ). The ⊕ ∧ vectors and matrices used have elements in F2. [7] As described in [25], a PRNG which uses linear recurrence over the two element field is based on the relationship shown in 2.2-2.4.

xi = Axi 1 mod 2, (2.2) − yi = Bxi mod 2, (2.3) Xw l ui = yi,l 12− = .yi,0yi,1yi,2 (2.4) − ··· l=1 T T In the equations above xi = (xi,0, . . . , xi,k 1) and yi = (yi,0, . . . , yi,w 1) are the k-bit state and the w-bit output vector at− step i, A is a k k binary transition− matrix, B is a w k binary output transformation matrix, k× and w are positive × integers, and ui 0, 1 is the output at step i. Several well known∈ { } algorithms belong to this class of PRNGs. Apart from the Mersenne Twister and xorshift family of generators, which are described in further detail below, it includes others such as linear feedback shift register (LFSR), generalized feedback shift register (GFSR), twisted GFSR and the WELL generators [25]. In fact, it has been shown in [7] that the xorshift generators are equiva- lent to certain linear feedback shift registers.

2.3.2 Equidistribution It is desirable that an PRNG should generate a sequence which behaves like independent samples of stochastical variables with the same distribution. However, there is no decisive definition of what good "randomness" for a PRNG means. Equidistribution, which in this context also can be referred to as the k-distribution test, is one property which is used as a measure of randomness [19, 25, 29]. For a sequence xi of real numbers, equidistribution in one dimension means that the values in the sequence are uniformly distributed over the interval of the sequence. The proportion of terms falling in a subinterval is thus proportional to the length of the interval. If the values in the sequence are for example taken as numbers in a decimal representation of π, then each integer 1,2,3,...,9 and 0 should appear on average in a tenth of the cases. In the same way, if we take a sequence of 32 bits, or w bits in general, then each of the 2w possible combinations should occur equally many times in a period. A slight flaw is permitted in the case of the all zero combination, which appears once less often. Equidistribution in 2 dimensions means that we study pairs of the w bit values in the sequence, and require that all the 22w possible combinations should occur equally many times in a period. For the general case of t dimensions, we 2.3 Pseudo-random number generation 9 require that all the 2tw possible combinations should occur equally many times in a period, except from the all zero combination which occurs once less often. The geometrical meaning is taken from viewing the values in the sequence as points in a t-dimensional room. For a sequence of w bit integers, we divide each value by 2w to normalize it into a pseudo-random real number in the [0, 1] interval, and place each consecutive t-tuple in the t-dimensional unit hypercube. The sequence is t-dimensionally equidistributed if these points are uniformly distributed in the t-dimensional unit hypercube [5, 19, 29, 25]. This can further be generalized. Instead of studying all w bits in the output word, we only study the l most significant ones. This corresponds to truncating each value in the sequence and taking the value formed by the leading l bits. Then the sequence is said to be (t, l)-equidistributed if each of the 2tl possible combinations of bits occurs the same number of times in a period, except from the all zero combination which occurs once less often. In the geometrical interpretation of this, we still study the t-dimensional unit hypercube. We divide the interval [0, 1] into l equally long segments, so that we partition the unit hypercube into 2tl cubic cells of equal size. The sequence is said to be (t, l)-equidistributed if each of these cubic cells contain the same number of points [19, 25].

2.3.3 Mersenne Twister The Mersenne Twister is a pseudo-random number generator proposed by Makoto Matsumoto and Takuji Nishimura in 1998 [19]. It has become a very popular choice as a PRNG for many software systems. In the words of Cleve Moler, founder of MathWorks and author of the ﬁrst MATLAB [27]: "Mersenne Twister is, by far, today’s most popular pseudorandom number generator. It is used by every widely distributed mathematical software package. It has been available as an option in MATLAB since it was invented and has been the default for almost a decade." [21] The Mersenne Twister in its typical implementation, the MT19937, has a very large period of 219937 1. It creates a sequence of 32-bit integers and this sequence is 623-dimensionally− equidistributed. To achieve this the Mersenne Twister uses a state consisting of 623 32-bit words, meaning roughly 2.5 kB of memory [19]. The Mersenne Twister passes the Diehard tests [19] but fails some of the tests in TestU01[14].

2.3.4 Xorshift In the paper "Xorshift RNGs", [16], George Marsaglia proposes a class of simple and very fast random number generators based on the repeated use of the logical operation exclusive or (xor) of a computer word with a shifted version of itself. This operation can easily be implemented in software as a computer instruction and can be executed very quickly by a processor. I C code, the xorshift operation is y^(y << a) for left shifts and y^(y >> a) for right shifts. Marsaglia establishes the requirement needed for these PRNGs to have a full period, and 10 2 Background Theory supplies all sets of parameters which achieves this for some certain types of xorshift generators. To showcase the simple structure of the xorshift generators, Marsaglia pro- vides the essential C code for generating a random sequence of 32-bit integers with a period of 2128 1. This code is show below in listing 2.1. As visible from the code, only three xorshift− operations are used, and the generator requires four random 32-bit seeds x, y, z, w.

Listing 2.1: C code example for xorshift generator tmp=(x^(x<<15)); x=y ; y=z ; z=w; return w=(w^(w>>21))^(tmp^(tmp>>4)); The xorshift generators are mathematically modelled by Marsaglia as linear transformation over the binary vector space, characterized by a nonsingular n n binary matrix T . The value of n is typically selected to be 32 or a multiple thereof,× for facilitating the implementation on a computer. In this notation, the linear transformation of a binary vector y is denoted by yT . The xorshift operation for a shift left is represented by T = I + La, where L is the n n binary matrix that effects a left shift of one position on a binary vector y, that× is, L is all 0’s except for 1’s on the principal subdiagonal. Similarly, R is used to denote a right shift of one position on a binary vector, so that the xorshift operation for a right shift is represented by T = I + Ra. For the mathematical model we also define the seed set Z as the set of 1 n × binary vectors β = (b1, b2, . . . , bn), excluding the zero vector. Such a binary vector can also be referred to as the state of the xorshift generator. If β is a uniform random choice from Z then each member of the sequence βT , βT 2, βT 3,... is also uniformly distributed over Z. Marsaglia present a theorem, shown here as theorem 2.1, for establishing when the xorshift generators will have a full period. Theorem 2.1. In order that a nonsingular n n binary matrix T produce all possible non-null 1 n binary vectors in the sequence× βT , βT 2, βT 3,... for every × non-null initial 1 n binary vector β, it is necessary and sufficient that, in the group of nonsingular× n n binary matrices, the order of T is 2n 1 × − If the order of T is 2n 1, then each of the matrices T,T 2,T 3, ,T k are distinct and nonsingular, and− it follows from the theorem that the··· sequence βT , βT 2, βT 3,... must have a period of k = 2n 1 When using only two xorshift operations for− n = 32 or n = 64, there are no values for the shift lengths which will create a matrix T of the desired order 2n 1 and thus a sequence of the maximum period. But when using three consecutive− shift on the same binary vector y, i.e. the operation yT , where T is of the form T = (I + La)(I + Rb)(I + Lc), there are several sets of the shift length (a, b, c) which will yield the desired property of the matrix, and thus attain the maximum period of the sequence. Marsaglia presents all the possible choices which achieves this 2.3 Pseudo-random number generation 11 for both n = 32 and n = 64. In a later examination of the xorshift generator, [25], it has been shown that to reach the maximal period, both left and right xorshifts must be used. Marsaglia points out that although the sequence generated by this type of linear transformation over the binary vector space is identically distributed, the elements are not independent. Furthermore, since long period xorshift RNG necessarily uses a nonsingular matrix transformation, every successive n vectors must be linearly independent, while truly random binary vectors will be linearly independent only some 30% of the time. This property makes the xorshift generators fail the binary rank test in Diehard, but they pass all other tests of randomness in Diehard [16]. Marsaglia also remarks that the xorshift generators can be further developed by assigning any function to modify the output of the generator. In the example shown in listing 2.1, the output w is taken to be the last value of the state (x, y, z, w). This value must be calculated and assigned to achieve the full period, but the output could instead be taken as any function of the state. For example, returning x multiplied by a constant would also constitute a PRNG of a full period. Panneton and L’Ecuyer have in a subsequent analysis of the xorshift generators pointed out some weaknesse [25]. Their report included an analysis of the theoretical properties, a search for the best xorshift generators according to equidistribution, and empirical test using the TestU01 suite. In further detail these tests included all full period xorshift generators which have either 32-bit xorshift operations and a 32-bit state, or 64-bit xorshift operations and a 64-bit state. They also analysed xorshift generators with a larger state, and which utilise different xorshift operations on different words in the state. However, among this type of generators they only empirically tested the best generators they found according to equidistribution. They concluded that the xorshift generators are fast, but not reliable according to their analysis of equidistribution and results from empirical statistical tests. They state that the generators which only use 32-bits or 64-bits of state are doomed due to their short period. To overcome the other limitations, they propose using more xorshifts than three or combining the xorshift generators with other RNGs from different classes [25]. The proposed use of a non-linear operation to scramble the result of xorshift generators is explored in a paper by Sebastiano Vigna [29]. The scrambling is done with multiplication by a suitable constant to achieve what is called an xorshift* generator. Vigna restricts the search to consider only 64-bit shifts, and states consisting of 64-bits or a power of two thereof. Several xorshift generators of different state sizes using this type of multiplication are presented. Their results in empirical test with TestU01 are compare with that of other generators, among them the Mersenne Twister. Among other, Vigna concludes that an xorshift generator followed by a multiplication can give very good statistical quality. He remarks that the Mersenne Twister and other well know generators have more linear artifacts than even a 64-bit state xorshift generator followed by multiplication. The Mersenne Twister and other generators with extremely long periods also have problems when the state has many zeros. Vinga also points out prob- 12 2 Background Theory lems with evaluating generators according to equidistribution, and remarks on the high-bit bias of TestU01, such that the result is very different when the bits in reverse order are tested. Other methods for scrambling are presented by Saito and Matsumoto with what they call the xorshift-add (ASadd) generator [26, 28]. The proposed generator has a 128-bit internal state and uses addition as a non-linear output function. It passes the bigcrush test of TestU01, but fails when reversed. Sebastiano Vigna further develops this use of addition with xorshift generators and creates what he calls the xorshift128+ generator. This generator has 128 bits of state and used 64-bit operations. It passes the bigcrush test of TestU01, even reversed, and is presently used by the JavaScript engines of Chrome, Firefox and Safari according to the author [28]. The current recommendation from Vigna is to use a successor to the xor- shift128+ generator called xoroshiro128+, but he also states that the xorshift* generators "are an excellent choice for all non-cryptographic applications" [3].

2.4 Methods for generating normally distributed variables

If you have access to independent uniformly distributed random variables it is in principle possible to generate any one-dimensional probability distribution. If we desire to generate a random variable X from a distribution with a cumulative distribution function F, this can be done by using the inverse of the CDF and 1 setting X = F− (U) where U is a random variable with uniform distribution, U U(0, 1) [5]. ∼ For the normal distribution, it is not very easy to calculated the inverse of the CDF Φ(x). Instead other methods are used. Two common methods are the Box- Müller method and the Marsaglia Polar method. The Box-Müller method is more computational demanding than the Polar method since it requires calculating the sine and cosine of an angle. Below are more detailed descriptions of the two methods, as well as a description of the Ziggurat method, a more complex methods which aims at being more computational eﬃcient.

2.4.1 The Box-Müller method

The method proposed by George Edward Pelham Box and Mervin Edgar Müller in 1958 generates a pair of independent random variables X1,X2 from a normal distribution with zero mean and unit variance using a pair of independent random variables U1,U2 with uniform distribution over [0, 1]. The fundamental relationship is shown in 2.5 [6].

p X1 = 2 ln U1 cos 2πU2 p− (2.5) X = 2 ln U sin 2πU 2 − 1 2 2.4 Methods for generating normally distributed variables 13

2.4.2 The Marsaglia Polar method Proposed by George Marsaglia and Thomas Bray in 1964, this method is similar to the Box-Müller method in that is also generates a pair of normally distributed variables of the form shown in 2.6.

X = R cos φ 1 (2.6) X2 = R sin φ

However, this methods avoids the use of the trigonometric functions. Instead it is based on first finding a random point which is uniformly distributed in the unit circle, and using a fraction of polynomials for the sine and cosine of the angle to this point. To find a point with uniform distribution in the unit circle, we generate values U1,U2 which are uniform in [1 , 1] and reject the point (U1,U2) 2 2 − until U1 + U2 < 1. Then the normally distributed pair X1,X2 shown in 2.7 are returned.

q 2 2 U1 X1 = 2 ln (U1 + U2 )q − 2 2 U1 + U2 q (2.7) 2 2 U2 X2 = 2 ln (U1 + U2 )q − 2 2 U1 + U2

We can see that the sine and cosine terms in 2.6 are replace by a rational expression, and the corresponding angle will be uniformly distributed since the point (U1,U2) is uniformly distributed in the unit circle. The distance R to the new point (X ,X ) is in both this case and in the Box-Müller method, given by 1 2 p an expression on the form 2 ln (U). This expression is the inverse of CDF for the Rayleigh distribution, a full− motivation for the use of this can be seen in [5]. 2 2 The Marsaglia Polar method uses U = U1 + U2 , and this is based on the fact that U 2 +U 2 is uniform on [0, 1] and is independent of U /U , and hence independent 1 2q q 1 2 2 2 2 2 of U1/( U1 + U2 ) and U2/( U1 + U2 ) [17]. Since the Marsaglia Polar method initially reject all points (U1,U2) unless 2 2 U1 + U2 < 1, it is referred to as a rejection sampling technique. And since the area of the unit circle is π, it will reject on average 1 π/4 21% of the generated uniform random variables. This will require the method− ≈ to generate 1/(π/4) 27% more values than required by the application. ≈

2.4.3 The Ziggurat method This method for generating variables from a normal distribution was published by George Marsaglia and Wai Wan Tsang in 2000, but is based on a method the two authors developed in the 1980 [18]. It is the method used for MATLAB’s randn function [20]. 14 2 Background Theory

The Ziggurat method is essentially a table lookup algorithm which uses a rejection method. It can be used to sample from any decreasing density function. The approach is to cover the target density with the union of a collection of sets from which it is easy to choose uniform points and then use the rejection method. If we denote area under the plot of the PDF by and the the union of the covering sets by , note that . Then we can describeC the rejection method like this: Select randomZ pointsC ⊂x, Z y in until you get a point that is in , and return the value of x. This way the returnedZ values will belong to the desiredC distribution. The Ziggurat method covers the PDF of the normal distribution by stacking horizontal rectangles on top of a base strip which tails of to inﬁnity. An important aspect of the algorithm is that all of these sets, both the rectangles and the base strip, have the same area. This is illustrated in ﬁgure 2.2 by using 8 sets, 7 rectangles and a base strip. The number of sets is best chosen to be a power of 2 to make random selection from a table easy. It is also preferable to utilize a lot more sets than 8 in order to reduce the area which is cover in excess of the desired PDF. In the C code provided in [18], the method used 128 sets for the normal distribution.

Figure 2.2: The Ziggurat method with 7 rectangles and a base strip. This ﬁgure is copied without alterations from the article [18] by George Marsaglia and Wai Wan Tsang, provided under the Creative Commons Attribution 3.0 Unported License

The Ziggurat method gets its name from the shape of the rectangles. Since each of the rectangles have an equal area the top ones are taller than the bottom ones, so that the layered rectangles resemble a ziggurat step pyramid. The reason for desiring the area of all sets to be equal is that this allows us to select a set a random using a uniform index. We let the top rectangle be number one, and we denote the rightmost point of each rectangle i as xi, and let r be the rightmost xi. We also assume an empty rectangle R0 with x0 = 0, the left edge of R1. Let W be a uniform value in [ 1, 1], and U be uniform in [0, 1]. Let f (x) be the PDF of the desired normal distribution.− Using this notation we can describe Ziggurat algorithm with the following steps: 2.5 Hardware acceleration for turbo codes 15

1. Select a random index i from a uniform distribution to specify the set

2. Set x = W xi

3. If x < xi 1, return x | | − 4. If i = 0, return an x from the tail by a special method

5. If [f (xi 1) f (xi)]U < f (x) f (xi), return x − − − 6. Go to step 1

The special method for returning a value from the tail is as follows: Generate x = ln U1/r and y = ln U2 until y + y > x x, then return r + x or r x depending− of the sign of− the original x. × − − When 256 sets are used to cover the PDF, the algorithm will return in step 3 around 99% of the time. [18]

2.5 Hardware acceleration for turbo codes

This section presents some commercially available hardware systems which include DSP processor and hardware acceleration for turbo codes. Texas Instruments offer several product for telecom infrastructure. Their product TCI6618 [2] is a multicore system-on-a-chip DSP system with hardware accelerators for bit rate processing such as turbo encoding and decoding. It can perform the turbo decoding with a throughput of 582 Mbps for the 6144 bit code block size and 6 iterations. The TCI6618 system is based on the C66x DSP core. The level-1 (L1) program and data memories on the TCI6618 device are 32 kB each per core. The level-2 (L2) memory is shared between program and data space for a total of 4,096 kB (1,024 kB per core). The TCI6618 contains 2,048 kB of multicore shared memory (MSM) that is used as a shared L2 SRAM or shared L3 SRAM. Another hardware system the is commercially available is the MSC8157 [1] from NXP Semiconductors (previously Freescale Semiconductors, soon to be ac- quired by Qualcomm). The MSC8157 is a system-on-chip DSP system with hardware acceleration for baseband applications such as turbo encoding and decoding. It can perform the turbo decoding with a throughput of up to 330 Mbps. The MSC8157 includes six StarCore SC3850 DSP subsystems, each with an SC3850 DSP core, 32 kB L1 instruction cache, 32 kB L1 data cache, unified 512 kB L2 cache configurable as M2 memory in 64 kB increments and a M3 shared memory of 3072 kB.

3 Method

This chapter describes how the simulator has been implemented, both in a wider view where the the fundamental model and general implementation considerations for the simulator setup are described, and also on a detailed level describing the proposed method for using a look-up table to generate random noise. First the fundamental simulator model is described in 3.1, and then the general principles for the implementation and use of the simulator are described in 3.2. This is followed in section 3.3 by a description of the diﬀerent program structures which have been implemented and tested. Lastly, the proposed table method for generating random noise is described in section 3.4.

3.1 Fundamental simulator model

A typical approach used to study the performance of channel coding techniques is to model the transmission media as an additive white Gaussian noise (AWGN) channel. This is the underlying model and general idea used for the simulator described below. For a time-continuous channel, where X is the transmitted signal and Y is the received signal, the AWGN channel is modelled by 3.1

Y = X + W (3.1) where the noise W is white and Gaussian. The property of being white is a strictly mathematical model which signiﬁes that the energy of the signal is uniformly distributed over all frequencies. White noise is characterized by a single parameter σ 2 called the energy per degree of freedom or spectral density. That the noise is Gaussian means that the coeﬃcients of the signal has a Gaussian or normal distribution with zero mean: Wn (0, σ). In the application at hand, we will only utilize a time-discrete channel.∼ NIn this case, the samples of W are independent and belonging to a Gaussian or normal distribution. [13]

17 18 3 Method

3.2 General principles of the simulator setup

The objective with the simulator is to be able to generate curves showing the turbo code block error rate (BLER) as a function of the signal-to-noise ratio(SNR). The overall method to achieve this is very simple: try to perform the turbo coding process several times and see how often it fails. This general principle can be described in the few steps below:

1. Generate input message 2. Perform turbo encoding 3. Generate random noise according to AWGN model and add to data 4. Perform decoding 5. Compare the output to the input and store statistics

6. Repeat until suﬃcient statistics have been obtained

The input message generated in step 1 could be any bit sequence. However it can be advantageous to use a random sequence rather than all ones or all zeros to avoid suffering from potential implementation flaws, such as a decoder which has a bias to the all-zero codeword. Since hardware acceleration components will be used for the encoding and decoding of turbo codes, the task of the simulator is mainly to interface these components and to use suitable methods to simulate disturbance from an AWGN channel. General considerations about interfacing hardware acceleration components and adding noise are described in section 3.2.1. To generate a curve over the BLER performance the simulations have to be run at several different SNR values. How to perform the needed calculations concerning SNR is described in section 3.2.2. Considerations about how many times this process has to be repeated for each SNR value are described in section 3.2.3.

3.2.1 Soft bits and noise addition The turbo decoder used in this thesis accepts input in the form of 8-bit integer values. These 8-bits are together referred to as a soft bit, and the original binary value from the encoding process is referred to as an encoded hard bit. So each encoded hard bit in the set 0, 1 must be mapped to a soft bit value belonging to the integer set 128, 127,...,{ 126} , 127 . To begin describing{− − how this mapping} is done we ﬁrst select a positive value µ. This µ can be referred to as the signal level, and the mapping F of encoded hard bits to soft bit values can be described by the relations in 3.2.

F(0) = µ (3.2) F(1) = µ − 3.2 General principles of the simulator setup 19

Based on the AWGN model, we want to add white Gaussian noise to each soft bit value. If we let W (0, σ) then the mapping of encoded hard bits to soft bits with additive white∼ Gaussian N noise can be described by the relations in 3.3.

F(0) = µ + W (3.3) F(1) = µ + W − With this notation, a soft bit X has a distribution of X ( µ, σ), where the sign depends on if the hard bit value is zero or one. ∼ N ± In order to facilitate the implementation in code and to achieve effective execution, we will first generate a random value G (µ, σ) regardless of the hard bit value, and then assign a positive or negative∼ sign N depending on the hard bit value. The random value is generated using PRNG methods implemented to run on DSP processors. The different methods used have been described in chapter 2. Using this approach, we can express the implementation in code as shown below in 3.1. Listing 3.1: C implementation of hard to soft bit conversion with additive white Gaussian noise soft_bit_value = (hard_bit_value == 0 ? G : G); − Due to the integer nature of the soft bits, the value G first has to be rounded to the nearest integer value in the interval [ 127, 127] in order to be sure that both G and G belong to the set 128, 127−,..., 126, 127 . However, with this method, the− soft bit value -128 will{− never− be utilized. } It should be noted that the use of integer values imply that a quantification noise will be present in the resulting values.

3.2.2 SNR calculations The signal-to-noise ratio measures the energy of a useful signal relative to the noise part of a signal. So if we have a useful signal X which is contaminated by additive noise W , then the signal-to-noise ratio of the resulting signal is deﬁned as the ratio of energy in X to that of W . We thus get SNR = EX /EW . The SNR is typically expressed in the logarithmic decibel scale as shown in 3.4. [13]

EX SNRdB = 10 log10(SNR) = 10 log10( ) (3.4) EW In our case, with the signal deﬁned in 3.3, the SNR is µ2/σ 2, so that we arrive at 3.5

µ2 SNR = 10 log ( ) (3.5) dB 10 σ 2 With the current application of ﬁnding the BLER value at a particular SNR level, we are however more interested in the reverse calculation order. We wish to 20 3 Method

ﬁnd the standard deviation σ corresponding to a certain SNR value. For a given value of the signal level µ, this standard deviation for the normal distributed noise can be calculated according to 3.6 q σ = µ2/10SNRdB/10 (3.6)

SNR can also be presented in another way using the notation Eb/N0, but this alternative will not be used in this thesis.

3.2.3 Number of coding attempts The simulator has been configured to continued running successive attempts of encoding and decoding until 100 errors are encountered. This is a typical approach used for these type of simulations, it puts a limit on how many iterations will be performed and it gives the generated values a certain statistical significance without having to perform any calculations for the desired number of iter- 9 ations in advance. For a SNR level which corresponds to a BLER value of 10− we would expect to have to run 1011 iterations of attempted encoding and decoding processes. The statistical significance of the generated values may also depend on other aspects, such as how well the PRNG methods actually mimic true randomness. There is some background theory for this supplied in chapter 2, but for the re- mainder of this section and the description of the method for determining the statistical significance of values we assume true randomness. First we recall the background theory concerning a binomial distribution presented in section 2.1. If we perform n different attempts of encoding and decoding, and we model this as independent trials each with a probability of failure p, then the number of failed attempts x will belong to a binomial distribution. If n is large this can p be approximated by a normal distribution where µ = np and σ = np(1 p). − As also described in section 2.1, if we have a point estimator which is approximately normally distributed, and we have the estimators Θ = np = 100 and p ∗ d = np(1 p) 10, then a confidence interval with a confidence level of 95% is given by 3.7− ≈

IΘ = (Θ∗ λα/2d, Θ∗ + λα/2d) = (100 1.96 · 10, 100 + 1.96 · 10) = (80.4, 119.6) − − (3.7) If we where to calculate BLER values from an estimator which has this type of confidence interval, it would give the calculated BLER value roughly one significant figure.

3.3 Simulator program structure

Two diﬀerent program structures for performing the simulations have been implemented and tested in this thesis. These program structures deﬁne how the 3.3 Simulator program structure 21

flow of execution should be passed along between the different steps of the simulation process and how multiple DSP resources should be utilized to enable a parallel execution of the program. In a sense, the implementation according to these two program structures corresponds to the implementation of two different simulator programs. Much of the program can, and has, been divided into different subroutines which are callable from a library. However, the two implementations differ completely concerning the part of the simulator program from where the simulation is started and ended, and the execution flow is controlled throughout the duration of the simulation. The two different program structures are implemented to study how well the program can be executed in parallel over multiple DSP resources, and to aid in studying how well both the hardware acceleration and DSP components are utilized. Due to the differences in the implementation, the two simulator programs have also been used as references to each other, for comparing the the simulation results and searching for potential bugs in the program as an attempt to verify the implementations. The two different program structures implemented and tested in this thesis are referred to as the busy-wait simulator structure and the callback simulator structure.

3.3.1 Busy-wait simulator structure The busy-wait simulator structure is a very simple program structure which is probably seen by many as a straight forward and natural way to implement the steps of the simulator described in chapter 3.2. With this program structure, every iteration of the simulator, or independent trial of coding and decoding, is run to completion before the next iteration or trial is started. All the steps in the simulation process are executed is series in a single DSP resource, so that all data can be handled in memory local to only that DSP resource, except of course for the ﬁnal statistics which need to be shared by and accessible from all DSP resources. The busy-wait structure is illustrated in ﬁgure3.1. 22 3 Method

Generate input message and send to encocder

Wait for encoder to complete

Fetch output, add noise and send to decoder

Wait for decoder to complete

Fetch output, compare to input message and store statistics

If more attempts are needed, repeat process

Figure 3.1: A graphical representation of the busy-wait program structure

3.3.2 Callback simulator structure

The callback simulator structure allows a new iterations, or an independent trial of coding and decoding, to be started before the previous one is finished. This behaviour is achieved by making each of the steps 1,3,5 described in chapter 3.2 into self-contained and separate jobs. These jobs must still be executed in a serial way since results obtained in a step is needed for the next. The steps are made self-contained by requiring that all data needed for the execution of the respective procedure be supplied as arguments. In this way, subroutines can be defined which can perform each step without the need for the different steps to be executed with access to the same memory. The different steps can therefore be executed on different DSP resources and new independent trials can be started before the previous iteration has completed. The aim with this program structure is to better utilized the DSP resources by elimination the need to wait for the accelerator components to finish their task. Instead the DSP resource can be utilized to perform a subroutine belonging to a different step from another independent trial of coding and decoding. The term callback in computer programming is used to denote a function 3.4 The table method for noise generation 23 passed as an argument to another function with the intention that at an appropriate time, the executing function makes a call back to the argument function. This is precisely how the program controlled is passes along when a coding or decoding job is sent to the hardware accelerator components. The call to the accelerator is bundled with arguments to the subroutine which will perform the continuation of the process, and the data needed to perform it. When the hardware acceleration process is complete, the callback function is executed on an available DSP resource. In the actual implementation, the steps 1 and 5 in chapter 3.2, i.e the step performing the initialization of an attempt and the step performing the evaluation of the results from decoding, are combined into one self-contained job. In this way, there are two separate, self-contained jobs performed by software executed on a DSP, in addition to the encoding and decoding jobs executed in the hardware acceleration components. We can refer to the job which executes steps 5 and 1 as A and the job which executes step 3 as B. This program structure is illustrated in figure 3.2, where and additional software job, C, is defined to initialize the very first iteration.

A: Analyze result of Turbo decoding and decoding start the next encoding

Turbo A: Add noise encoding

C: start first encoding

Figure 3.2: A graphical representation of the callback simulator structure: the circles represent execution of software in DSPs, the rectangles represent execution in hardware accelerators. The dotted lines represent how program control is passed along by callback instructions, the solid lines show the chronological execution path

3.4 The table method for noise generation

The table method for generating noise is a very simple table look up method which is proposed, implemented and tested in this thesis. The idea of the method is to generate approximately normally distributed random variable in a very fast 24 3 Method way based on uniformly distributed random values. The reason for proposing this method is to study if even an extremely simple method generating only ap- proximate normally distributed values can be used in the simulator to achieve a comparable level of quality in the generated BLER curves. The hope is also that this method can supply a relevant baseline concerning the execution speed of generating noise values in a DSP system, in order to compare to the execution speed of other and more established methods which adhere to a true normal distribution. Like stated above, this method is a very simple table look up method. First a table is set up according to a desired normal distribution, and then a random integer can be generated and used as an index to the table. We used a table containing 256 values so that a positive integer of 8 random bits can be used as an index. The PRNG method used in this thesis generate a 32-bit integer value, so each generated value was divided up into four different random indexes. In section 3.2.1 is a description of the general approach used in the implementation to create a signal with additive noise. For each simulation at a specified SNR value, there is a need to generate a large number of random variables from a normal distribution with a constant mean µ and a constant standard deviation σ. The table method for noise generation is designed to enable the generation of such values in a fast way. In general, this method can be described like this: For a probability distribution specified by a constant parameter set of , a table consisting of N elements P is set up such that the values in the table represent points x1, x2, , xN on the x-axis of the distribution CDF, which we can denote by Φ{(x). These··· points} are centred around the expected value of the distribution and are equidistant points in terms of probability. By this we mean that for each pair of adjacent point xi and xi+1, the probability mass between these points is the same, i.e. adhere to the rule shown in 3.8.

Φ(x ) Φ(x ) = C, i 1, 2, ,N 1 , constant C R (3.8) i+1 − i ∀ ∈ { ··· − } ∈ Further on, the values have been set so that the probability mass in the tails of the distribution which are beyond the largest or smallest value in the table, should equal to C/2, corresponding to the relationships in 3.9

Φ(x ) Φ( ) = Φ(x ) = C/2 1 − −∞ 1 (3.9) Φ( ) Φ(x ) = 1 Φ(x ) = C/2 ∞ − N − N With this set up of the tables, the probability mass in the tails of the distribution not covered by the table values will amount to 1/N. In the current context and implementation, the distribution is of course the normal distribution specified by = µ, σ and the table contains N = 256 entries. This means that the probabilityP mass{ } of the distribution not covered by the table amounts to 0.4%. The table entries are all integer values to to the integer nature required by the accelerator components as described in section 3.2.1. In order to facilitate and easy implementation in the DSP software for setting up a table according to a current parameter set, there is a table stored in memory 3.4 The table method for noise generation 25 containing the 256 floating point value for a normal (0, 1) distribution. These values can be be scaled, shifted and rounded as needed, according to the desired values of µ and σ, to get a table with integer entries. The table of floating point values used for a normal (0, 1) distribution can be generated using the MATLAB code shown in listing 3.2 below. The actual values of the table are also presented in appendix A.

Listing 3.2: MATLAB code for generating a table for the standard normal distribution N = 256; P = 1/(2*N) : 1 /N: 1 ; X = norminv(P);

4 Results

In this chapter the results are presented. First are some results showing output from the simulator, i.e. BLER-curves, for the different simulator structures and methods for generating normally distributed values. These are shown in section 4.1. Then results showing the execution time for the implemented PRNGs are shown in section 4.2. The execution times for the implemented methods which converts uniform variables to a normal distribution are show in section 4.3, together with examples of the empirical distribution of the methods. The execution time for the resulting different ways of converting a hard bit value to a soft bit value are shown in section 4.4. Results showing the throughput of the simulator is shown in section 4.5, and after this the distribution of the execution time among the different steps of the simulator is shown in section 4.6. Finally, in section 4.7, some BLER-curves for very low error rates are presented.

4.1 BLER curves

4.1.1 Reference curves and axis The BLER-curves presented here will be shown together with a reference curve. Results for two different code block sizes will be presented, both for 40 bit code blocks and for 6144 bit code blocks, which are the smallest and largest code blocks available. These reference curves are previous results from a completely different simulator implemented entirely in software. The two reference curves for the different code block sizes are shown in 4.1 and 4.2. The BLER curves show the block-error rates, i.e. the probability that the codeword is incorrectly decoded, for different signal-to-noise values. All BLER-curves shown are for turbo codes with a code rate of 1/3. All BLER-curves are also based on the principles described in 3.2.3, i.e. that for each SNR-value the simulation

27 28 4 Results has run until 100 failed decoding have been encountered, unless stated otherwise. For all BLER-curves, the SNR values shown on the x-axis are set to be zero where the reference curve for 6144 bit code blocks has a block error-rate of 1%.

Turbo code 40 bit code block, 1/3 code rate

0 10 Reference curve

10-1

10-2

10-3 BLER

10-4

10-5

10-6 -3 -2 -1 0 1 2 3 4 scaled SNR [dB]

Figure 4.1: The reference curve for 40 bit code blocks 4.1 BLER curves 29

Turbo code 6144 bit code block, 1/3 code rate

0 10 Reference curve

10-1

10-2

10-3 BLER

10-4

10-5

10-6 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 scaled SNR [dB]

Figure 4.2: The reference curve for 6144 bit code blocks

4.1.2 Different noise methods The figures below each show three different generated BLER-curves, in addition to a reference curve. The curves in each figure display the simulation results when using different methods for generating the additive noise. The results for 40 bit code blocks are shown in 4.3 and the results for 6144 bit code blocks are shown in 4.4. 30 4 Results

Turbo code 40 bit code block, 1/3 code rate, Callback simulator

0 10 Reference curve Table method Ziggurath method 10-1 Polar method

10-2

10-3 BLER

10-4

10-5

10-6 -3 -2 -1 0 1 2 3 4 scaled SNR [dB]

Figure 4.3: Simulation results for 40 bit code block size

Turbo code 6144 bit code block, 1/3 code rate, Callback simulator

0 10 Reference curve Table method Ziggurath method 10-1 Polar method

10-2

10-3 BLER

10-4

10-5

10-6 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 scaled SNR [dB]

Figure 4.4: Simulation results for 6144 bit code block size 4.1 BLER curves 31

4.1.3 Different simulator structures The figures below show the simulation results of the implementations in the different simulator structures for the same methods for generating the noise. The random seeds for the different simulations also differ. We see the results of the different simulator structures for the table method in 4.5, for the ziggurat method in 4.6, and for the polar method in 4.7.

Turbo code 6144 bit code block, 1/3 code rate, table method noise

0 10 Reference curve Callback simulator Busy-wait simulator 10-1

10-2

10-3 BLER

10-4

10-5

10-6 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 scaled SNR [dB]

Figure 4.5: Simulation results of the diﬀerent simulator structures for 6144 bit code block size using the table method 32 4 Results

Turbo code 6144 bit code block, 1/3 code rate, Ziggurat method noise

0 10 Reference curve Callback simulator Busy-wait simulator 10-1

10-2

10-3 BLER

10-4

10-5

10-6 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 scaled SNR [dB]

Figure 4.6: Simulation results of the diﬀerent simulator structures for 6144 bit code block size using the Ziggurat method

Turbo code 6144 bit code block, 1/3 code rate, Polar method noise

0 10 Reference curve Callback simulator Busy-wait simulator 10-1

10-2

10-3 BLER

10-4

10-5

10-6 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 scaled SNR [dB]

Figure 4.7: Simulation results of the diﬀerent simulator structures for 6144 bit code block size using the Polar method 4.2 Execution time for PRNG methods 33

4.2 Execution time for PRNG methods

The execution time for the two implemented PRNG methods are shown in ﬁgure 4.8. The values shown are calculated as an average over 10 000 consecutive calls to the PRNG. The actual values are also shown in table 4.1.

Figure 4.8: The average execution time measured in clock cycles for generating a 32-bit pseudo random integer

Table 4.1: The average execution time for measured in clock cycles generating a 32-bit pseudo random integer

Method Clock cycles xorshift 4.002 Mersenne Twister 46.363

4.3 Normal distribution methods

Three different methods for generating values with a normal distribution have been implemented, these are the Polar method and Ziggurat algorithm described in chapter 2 as well as the proposed Table method described in chapter 3.4. The results presented here show the execution time of these methods as implemented in the simulator, as well as an example of the empirical distribution of these methods. As described in the sections mentioned above, all three of these methods rely on an underlying PRNG for generating uniformly distributed values, which are used to generated normally distributed values. For all the results presented here concerning the normal distribution methods, the underlying PRNG used is the xorshift random number generator described in chapter 2.3. It should perhaps be 34 4 Results reiterated that the normal distribution methods tested require a different number of calls to the underlying PRNG to generate a random normally distributed value. The output from these methods also differ. In the current application, the value from the normal distribution methods eventually have to take the form of an 8-bit signed integer, as described in chapter 3.2.1. The results presented here concerning execution time are for the average generation time of an 8 bit integer value. This means that for the Polar method and the Ziggurat algorithm, the output has been converted to an 8 bit integer value. The execution time for generating and 8-bit integer value from each of these normal distribution methods are shown in figure 4.9. The values are calculated as an average of 10 000 consecutive calls to respective method. The values are also shown in table 4.2.

Figure 4.9: The average execution time measured in clock cycles for generating a 8-bit normally distributed integer value

Table 4.2: The average execution time for measured in clock cycles generating a 8-bit normally distributed integer value

Method Clock cycles Table method 9.51 Ziggurat algorithm 126.27 Polar method 249.68

The empirical distribution for these different methods can be seen in figure 4.10. In this figure, the values of the mean µ and the standard deviation σ have been set to 32 and 10 respectively. The theoretical CDF of integer values from a normal distribution with theses parameters is also shown in the figure. The empirical CDF for the Polar method and the Ziggurat algorithm have been created using 10 000 samples from their output, using the xorshift PRNG as underlying 4.3 Normal distribution methods 35 source of uniformly distributed random variables. The CDF of the table method has instead been created directly from the table generated in this method. The x-axis is set to be the interval [ 128, 127] which is the available interval for the 8-bit soft bit values. Figure 4.11− shows this same graph but zoomed in around where the CDF approaches 1.

Empirical CDF 1 Theoretical 0.9 Polar method Ziggurat algorithm 0.8 Table method

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 -150 -100 -50 0 50 100 150

Figure 4.10: The empirical CDF:s for µ = 32 and σ = 10 of the table method, Ziggurat algorithm and Polar method alongside the theoretical CDF for a normal distribution of integer values 36 4 Results

Empirical CDF

0.995 Theoretical Polar method Ziggurat algorithm 0.99 Table method

0.985

0.98

50 55 60 65 70 75 80 85

Figure 4.11: The empirical CDF:s for µ = 32 and σ = 10 of the table method, Ziggurat algorithm and Polar method alongside the theoretical CDF for a normal distribution of integer values

4.4 Hard to soft conversion with noise addition

The generated values with a normal distribution are used in a process of converting hard bits to soft bits as described in chapter 3.2.1. The execution time for this whole process, when using the different normal distribution methods, is shown below for both code block sizes of 40 bits and 6144 bits. In figure 4.12 we see the average execution time for converting a 40 bit code block from hard bits to soft bits, both without the addition of any noise, and including addition of noise from the three methods discussed. The timings are taken as an average over 1000 code blocks. The corresponding results for a code block size of 6144 bits is shown in figure 4.13. For this figure, the timings are taken as an average over 100 code blocks. 4.4 Hard to soft conversion with noise addition 37

Figure 4.12: The average execution time measured in clock cycles for converting hard bit to soft bits using diﬀerent normal distribution methods for a 40 bit code block

Figure 4.13: The average execution time measured in clock cycles for converting hard bit to soft bits using diﬀerent normal distribution methods for a 6144 bit code block

A large code block size mean more hard bits, for the 40 bit code block there are 132 hard bits and for the 6144 bit code block there are 18 444 hard bits. Since each hard to soft bit conversion with added noise requires a normally distributed value, as described in chapter 3.2.1, we also present the execution timing results normalized by the number of hard bits in ﬁgure 4.14 and 4.15. 38 4 Results

Figure 4.14: The average execution time measured in clock cycles for converting one hard bit to a soft bit using diﬀerent normal distribution methods for a 40 bit code block

Figure 4.15: The average execution time measured in clock cycles for converting one hard bit to a soft bit using diﬀerent normal distribution methods for a 6144 bit code block

We will also present these results in a third way. We take the execution timing for converting hard bits to soft bits without the addition of noise, and subtract this value from the the conversions which include additive noise. This value is then normalized by the number of hard bits and presented in ﬁgure 4.16 and ﬁgure 4.17. 4.4 Hard to soft conversion with noise addition 39

Figure 4.16: The ﬁgure shows both the average execution time measured in clock cycles for generating a 8-bit normally distributed value for each normal distribution method, denoted as Series 2, and the normalized execution time measured in clock cycles for converting hard bits to soft bits for a 40 bit code block using the diﬀerent normal distribution methods, with the value for conversions without noise being subtracted, denoted as Series 1

Figure 4.17: The ﬁgure shows both the average execution time measured in clock cycles for generating a 8-bit normally distributed value for each normal distribution method, denoted as Series 2, and the normalized execution time measured in clock cycles for converting hard bits to soft bits for a 6144 bit code block using the diﬀerent normal distribution methods, with the value for conversions without noise being subtracted, denoted as Series 1

The values for the ﬁgures above are shown in table 4.3 for 40 bit code blocks 40 4 Results and in table 4.4 for 6144 bit code blocks.

Table 4.3: The execution time values for converting hard bits to soft bits for 40 bit code blocks with diﬀerent methods for generating the additive noise. The tables shows both the total execution time measured in clock cycles, and this value normalized by the number of hard bits (132), as well as with the normalized value for the conversion with no additive noise subtracted from the other methods

Method Total execution time Normalized With subtraction No noise 1739 13.2 - Table method 4427 33.6 20.4 Ziggurat algorithm 20473 155.1 141.9 Polar method 37758 286.0 272.9

Table 4.4: The execution time values for converting hard bits to soft bits for 6144 bit code blocks with diﬀerent methods for generating the additive noise. The tables shows both the total execution time measured in clock cycles, and this value normalized by the number of hard bits (18444), as well as with the normalized value for the conversion with no additive noise subtracted from the other methods

Method Total execution time Normalized With subtraction No noise 227212 12.3 - Table method 597507 32.4 20.1 Ziggurat algorithm 2778203 150.6 138.3 Polar method 5187963 281.3 269.0

4.5 Simulator throughput

We define the throughput of the simulator as the average number of information bits which can be processed per second. By being processed we mean passing through all the steps of the simulator as described in chapter 3. In the results presented below, we also include the corresponding value of the throughput for the turbo code decoding hardware accelerator unit as stated in the product speci- fication. In figure 4.18 we see the throughput of the callback structure simulator for 40 bit code blocks. A zoomed in version of this is shown in figure 4.19. 4.5 Simulator throughput 41

Simulator throughput, 40 bit code block 100 Theoretical max 90 No conversion No noise 80 Polar Table 70 Ziggurath

Percent of maximum throughput 20

0 Number of virtual processes

Figure 4.18: The throughput in information bits per second for 40 bit code blocks of the callback structure simulator using diﬀerent methods for converting hard bits to soft bits and adding noise

Simulator throughput, 40 bit code block

8 Theoretical max No conversion No noise 6 Polar Table Ziggurath 4 Percent of maximum throughput 2

0 Number of virtual processes

Figure 4.19: The throughput in information bits per second for 40 bit code blocks of the callback structure simulator using diﬀerent methods for converting hard bits to soft bits and adding noise 42 4 Results

The corresponding results for the throughput of the callback structure simulator for a 6144 bit code block size is shown in ﬁgure 4.20.

Simulator throughput, 6144 bit code block 100

Theoretical max No conversion No noise 60 Polar Table Ziggurath

20 Percent of maximum throughput

-20 Number of virtual processes

Figure 4.20: The throughput in information bits per second for 6144 bit code blocks of the callback structure simulator using diﬀerent methods for converting hard bits to soft bits and adding noise

Figure 4.21 shows the throughput of both the callback structure simulator and the busy-wait structure simulator for a 6144 bit code block size and noise addition using the table method. In this graph, the diﬀerence in maximum throughput of the two methods are 8.9%. 4.6 Distribution of execution time 43

Simulator throughput, 6144 bit code block 35

30 Busy-wait simulator Callback simulator 25

10 Percent of maximum throughput

0 Number of virtual processes

Figure 4.21: The throughput in information bits per second for 6144 bit code blocks of the callback structure simulator and the busy-wait structure simulator using the table method for generating additive noise

4.6 Distribution of execution time

The execution time for performing simulations is spread over the diﬀerent simulator steps described in chapter 3. We repeat these steps here below for ease of readability:

1. Generate input message 2. Perform turbo encoding

3. Generate random noise and add to data 4. Perform decoding 5. Compare the output to the input and store statistics

The distribution of the execution over these steps has been measured and is show for the 40 bit code block size in table 4.5 and for the 6144 bit code block size in table 4.6. 44 4 Results

Table 4.5: The distribution of the execution time over diﬀerent steps in the simulator process for a 40 bit code block size. The execution time for each method is presented as the fraction of the total execution time

Step Table Method Ziggurat Algorithm Polar Method 1 0.0557 0.0486 0.0363 2 0.1489 0.1141 0.0786 3 0.3893 0.5259 0.6690 4 0.2786 0.1912 0.1245 5 0.1222 0.1150 0.0878

Table 4.6: The distribution of the execution time over diﬀerent steps in the simulator process for a 6144 bit code block size. The execution time for each method is presented as the fraction of the total execution time

Step Table Method Ziggurat Algorithm Polar Method 1 0.0090 0.0021 0.0011 2 0.0129 0.0032 0.0017 3 0.9166 0.9797 0.9892 4 0.0468 0.0114 0.0060 5 0.0132 0.0032 0.0017

4.7 Low error-rate BLER-curves

Here are some results presented for the output of the simulator when it is allowed to run for a longer period of time. Figure 4.22 shows results for the 40 bit code block size and ﬁgure 4.23 shows results for the 6144 bit code block size. For both these plots, the simulator has been conﬁgured to run until a maximum of 109 iterations. 4.7 Low error-rate BLER-curves 45

Turbo code 40 bit code block, 1/3 code rate 100 Reference curve Table method Ziggurat method

10-2

10-4 BLER

10-6

10-8 -4 -2 0 2 4 6 scaled SNR [dB]

Figure 4.22: The results of simulations concerning block error rate for turbo code with a code rate of 1/3 for a 40 bit code block size, with additive noise generated using either the table method and the Ziggurat algorithm

Turbo code 6144 bit code block, 1/3 code rate 100 Reference curve Table method Ziggurat method

10-2

10-4 BLER

10-6

10-8 -1.5 -1 -0.5 0 0.5 1 1.5 scaled SNR [dB]

Figure 4.23: The results of simulations concerning block error rate for turbo code with a code rate of 1/3 for a 6144 bit code block size, with additive noise generated using either the table method and the Ziggurat algorithm

5 Conclusion and Discussion

This thesis has investigate the use of hardware acceleration for performing link level simulations for turbo codes. Diﬀerent program structures have been implemented and the results concerning the execution speed and simulator output have been recorded. Conclusions and a discussion of the results are presented in the following sections. Section 5.1 covers aspects concerning feasibility, quality and speed of this type of simulator. Aspects concerning how the simulator can be set up and implemented using diﬀerent program structures are covered in section 5.2. The following section, 5.3, contains conclusions and a discussion concerning the bottlenecks of the tested simulator set-up. Lastly, in section 5.4, are some brief conclusions and remarks on normal distribution methods and the tested table method.

5.1 Simulator output and speed

The main output of the simulator are BLER curves, of which several different examples are presented in chapter 4. From these different figures we can see that the the generated BLER curves are very consistent for the different method and program structures tested. The BLER curves for the 6144 bit code block size are typically within 0.1 dB of the reference curve. For the 40 bit code block size the curves are typically within 0.5 dB. This discrepancy is within a range we deem as acceptable for presenting the results as plausible. As noted in the introduction, the aim of this thesis is not to guarantee that the implementation is free of flaws of bugs, but rather to study the possibility of using hardware acceleration components for simulations. The use of hardware acceleration for the turbo encoding and decoding makes the simulator much faster than a comparable simulator based on a pure software implementation executed on standard processors. How much faster is hard to

47 48 5 Conclusion and Discussion say and depends very much on the implemented codecs, and of course what hardware platform is taken to be the basis of the comparison. However, comparing the implemented simulator to a software simulator running on a typical desktops workstation, we feel fairly conﬁdent in saying that the improvement in speed should be at least a factor of 103. So from these results we draw the conclusion that hardware acceleration can successfully be used to perform the described type of link simulations, with clear advantages in terms of the execution time needed. As a further discussion and remark on the simulator set-up, it should be noted that a high degree of parallelization is required to reach a high throughput. In order to accurately study the empirical performace at very low BLER values, a very large number of independent attempts at encoding and decoding a message have to be performed. With the employed methodology of running simulations until 100 errors have been encountered, if we want to study the performance of 9 turbo codes at a SNR value which corresponds to a BLER of 10− , then we will on average need to run 1011 iterations. The largest deﬁned codeword for turbo code in LTE release 8 is 6144 bits. [22] So if we use a code rate of 1/3 we will get 18432 encoded bits, and we need to add noise to each of these bits. So for this example we would need to generate almost 2 1015 (2 thousand trillion) random values belonging to a normal distribution. × To put this number in a time perspective, assume we use the xorshift number generator described in chapter 2.3. It is one of the fastest available PRNG methods [16, 25, 29] and it can generate a uniform random 32-bit integer in about 4.4 nanoseconds (running on a 1800 MHz PC) [16]. So if each of the random numbers were generated in sequence, the total execution time would be 9 11 4.4 10− 18432 10 seconds which is more than 93 days. And this is only for generating× × uniform× random integers, converting this into a normally distributed value is not included, let alone the turbo encoding and decoding itself. Running all these computations in a serial fashion would still result in long execution times even when using hardware acceleration for the coding and decoding.

5.2 Program structure

This thesis has presented two different program structures for implementing a simulator utilizing hardware acceleration, referred to as the busy-wait structure and the callback structure. The maximum throughput of the two simulator structures is shown in figure 4.21 and these results present ground for some interesting conclusions. With a system having the relative hardware resources of the tested system, we see in table 4.6 that over 90% of the total execution time is spent in the DSP:s. With this level of bottleneck behaviour, we see that there is not much idle time for the DSP:s, even if the do nothing while they wait for the hardware acceleration to finish coding or decoding. So implementing and using a more refined program structure, such as the tested callback structure, does not yield any large 5.3 Bottleneck 49 improvements in total speed or throughput as we can see in the results. If the relationship between the hardware resources had been different, say that the execution time needed to perform a coding and decoding attempt had been split equally between DSP and hardware acceleration, then the total throughput or speed could be doubled with a simulator structure making use of the waiting time. But at a level similar to the tested system, where DSP usage takes up 90%, we only have a possible theoretical improvement of 100/90 11% from fully utilizing the DSP during the waiting time. ≈ So the possible advantage, in terms of simulator speed, of utilizing a more refined program structure depends of the relationship between the hardware resources. In a system where this relationship is similar to the tested system, the overall improvements from a more refined program structure are not very large as shown in the results. It should be noted that a more advanced program structure requires more development work, and since it is more complex, might be more prone to bugs. One practical aspect of using a program structure similar to the tested callback structure in a a multi core DSP system is having to transfer data back and forth between the DSPs and a memory common to all the DSP cores. This process means added work, which makes it impossible to reach the ideal theoretical improvements which are indicated by the relative amount of time the DSP processors are idle. We can rephrase the conclusion in this way: even a very simple program structure can utilize the hardware resources very well, and yield a fast simulator with very little development work, depending on the relationship between the different hardware resources.

5.3 Bottleneck

The bottleneck for the described simulator is the conversion of hard encoded bits to soft bits with added noise. This conclusion can be draw from two separate observations, first we see in figures 4.20 and 4.18 that the simulator does not utilize the full potential of the hardware acceleration. Secondly, in tables 4.5 and 4.6 we see that the vast majority of the execution time is spent in the DSPs performing the conversion of hard bits to soft bits with added noise. Both the process of generation the noise values and the process of converting hard bits to soft bits are quite time consuming, and these processes are hard to keep apart. In table 4.2 we can see the execution time to generate a noise value as a stand alone process. Tables 4.3 and 4.4 show the comparable time taken to generate a noise value when it is a part of the process of converting hard bits to noisy soft bits. These results consistently show that the execution time for the combined process of generating noisy soft bits is larger than the sum of execution times for the two separate processes of converting hard bits to soft bits and generating noise values. In the case of the table method, where this effect creates the largest relative difference, the execution time to generate a noise value as a stand alone process 50 5 Conclusion and Discussion is 9.5 clock cycles, compared to the 20.1 clock cycles it takes when it is a part of the process of creating noisy soft bits. From this we draw 2 potential conclusions: a) The current implementation can possibly be improved by as much as this difference indicates b) Improving one of these processes might not necessarily improve the combination, and therefore may not provide any substantial improvement to the simulator as a whole How feasible it is to achieve the level of improvement hinted at in (a) is hard to say. There are of course fundamental aspects with respect to the hardware used, but also considerable aspect with respect to creating the software. On a personal note, I believe that even though the hardware might be able to support quite substantial improvements, it might come at a high cost in development work. Although not fully described in this thesis, a not negligible amount of work has been devoted to testing variations of the software implementations and compiler options. This has included things such as when and how the memory is accessed, different loop structures and effect of using different compiler-known- functions. These changes have been studied with regard to their effect on execution time performance as well as the assembler output from the compiler. This process has yielded improvements in the overall execution time and speed of the simulator, but I believe it to be hard and time consuming to further substantially improve on this without some more fundamental change on an algorithmic level. The conclusion in (b) is interesting regarding possible future work with improving the speed of the simulator. One possibility which we have discussed on a very early stage in this work was using FPGA implementations to generate normally distributed noise. This definitely has the potential to improve the speed of the noise generation, but it might not improve the overall speed of the simulator. If the process of converting hard bits to soft is still done with software running on DSP, then we get an added complexity of accessing the results from the FPGA and getting it to the DSP, which possibly could lead to an overall slower simulator. And it should be kept in mind that the current implementation dic- tates a good baseline for comparisons. The average generation time for a batch of random uniform integers from the xorshift PRNG in only 4 clock cycles, and a the corresponding value of only 9.5 clock cycles for the whole process of generating a random integer with a normal distribution from the table method, means that even if it is possible to generate a random value with normal distribution in only one clock cycle from a FPGA, it might not lead to an overall improvement if access time to the memory is taken into account. There is also the possibility of performing a larger part, or indeed the whole process, of converting hard bits to noisy soft bits with FPGA resources directly, but this would also require more development work. The overall interpretation of the results is that this process as a whole must be addressed in order to achieve a faster simulator with higher throughput. To only optimize a part in the process might not yield desirable results. 5.4 Normally distributed noise 51

5.4 Normally distributed noise

Regarding the different methods of generating normally distributed noise, we can see in figure 4.20, that the proposed table method is substantially faster than the standard methods. From table 4.4, we can see that this is true both as a stand- alone method and as a part of the process when converting hard bits to soft bits with added noise. However, the table method does not represent a true normal distribution, and difference in the empirical distribution can be seen in figure 4.11. The BLER curves also shows a discrepancy between the table method and the two other methods, visible in figure 4.4. The table method seems to give a slightly optimistic estimate of the code performance. A further study of the cause of this would be of interest, in order to determine if this is due to a fundamental need for an exact normal distribution, or if smaller alterations to the method, such as a large table, would change this. In this context, it should be notated that since the soft bits are integer values limited a few number of bits, we will always have discretization effect with resulting quantification noise and thus never achieve a “true” normal distribution for the noise, no matter which method is used. The conclusion we draw from the available results is that methods such as the proposed table method are highly interesting in order to achieve a simulator with high throughput. Since the resulting BLER curves for the different methods are so similar, it definitely seems possible that methods which do not adhere to a strictly correct normal distribution can be usable. We also see that such methods can yield quite considerable improvements in the speed of the simulator.

6 Future Work

There are several areas which would be interesting to study further concerning the use of hardware acceleration for simulations. These can roughly be divided into the three sections below.

6.1 Improved usability of the simulator

It would be very interesting and good for improving the usability of the simulator to implement and test a framework which allows for interchanging diﬀerent steps of the simulator process for other implementations. For example, it would be good to easily be able to test using diﬀerent channel models in the simulator, or test other codecs. If it was possible to swap the hardware accelerators used in this simulator for a generic module based on either a software implementation, an implementation in FPGA or another hardware accelerator implementation, that would make the simulator set-up a powerful tool for fast and accurate simulations adaptable to various needs. Being able to use a codec module implemented in FPGA would allow for much quicker prototyping and make the simulator more useful for development work.

6.2 Comparison with theoretical properties

A further analysis and comparison of the empirical results to theoretical properties would be interesting. Both regarding the output from the simulator, where the slope of the BLER curves could be set in relationship to theoretical aspect from information theory, such as the minimum and mean distance between the codewords. The performance of the table method for generation normally distributed random noise is also fairly good, perhaps surprisingly so. It would be

53 54 6 Future Work interesting to try to determine what the fundamental requirements are on the distribution of the noise to achieve good quality in simulations and generation of BLER curves, and to use these results to the analyse the discrepancy and similar- ities of the empirical simulation results from the diﬀerent methods of generation noise.

6.3 Optimizations of the simulator

Naturally it would be interesting to try to further optimize the presented simulator set-up. Especially concerning the process of converting hard bits to noisy soft bits since this is the bottleneck of the implemented simulator. Possible options include improving the current software implementations, using a FPGA implementation or perhaps testing some kind of solution including a AD converter which would allow for truly random noise to be used in the simulator. It is also possible to test other software implementations. Since the input to the decoder is an 8-bit value, it would be interesting to try to implement some kind of 8-step binary search tree, which could generate noise with a true normal distribution, possibly in a very fast way. Improving the tested table method would be very interesting since it pro- duced such good results despite being so simple. As mentioned above, some kind of theoretical study of the requirements on the probability distribution used for this utilization would be interesting. Would it perhaps be possible to achieve even better result from the table method by slightly altering the size of the table of the values used within? The main drawback of the current implementation is probably the truncation of the distribution so that values from far out in the tails never occur in generated sequence. A possible way to handle this is to use nested tables to improve the granularity for covering values in the tails. For example, if the generated index is 0 or 255, a new random index can be generated for a second table which covers only the tail, and a value from this second table can be returned. This methodology could also be used in several steps. Appendix

A Normal distribution table

These are the 256 ﬂoating point values used in for the (0, 1) normal distribution table in the table method: N -2.8856349124267626, -2.5205022171903644, -2.3352330400688177, -2.2065752165371353, -2.1065540881628193, -2.0240136237191630, -1.9533237077453984, -1.8912292378201130, -1.8356715369125438, -1.7852624904353240, -1.7390199717299040, -1.6962225050066093, -1.6563238653408074, -1.6189001435373598, -1.5836154758017889, -1.5501990407917607, -1.5184291411525910, -1.4881218960233811, -1.4591230250215927, -1.4313017591024759, -1.4045462481588744, -1.3787600432219229, -1.3538593640751062, -1.3297709501812092, -1.3064303511275646, -1.2837805526081674, -1.2617708616359862, -1.2403559942306717, -1.2194953228462240, -1.1991522509932744, -1.1792936900106508, -1.1598896185252787, -1.1409127093423133, -1.1223380117021657, -1.1041426792922295, -1.0863057362981008, -1.0688078752591987, -1.0516312816573354, -1.0347594810882446, -1.0181772056006679, -1.0018702763769824, -0.9858255004050610, -0.9700305791772410, -0.9544740277674430, -0.9391451028960623, -0.9240337388053882, -0.9091304899448472, -0.8944264796122241, -0.8799133538196812, -0.8655832397563087, -0.8514287083055712, -0.8374427401492454, -0.8236186950515332, -0.8099502839698922, -0.7964315436842329, -0.7830568136747741, -0.7698207150120411, -0.7567181310510782, -0.7437441897466541, -0.7308942474276287, -0.7181638738872305, -0.7055488386621753, -0.6930450983876637, -0.6806487851276465, -0.6683561955905790, -0.6561637811503790, -0.6440681386006926, -0.6320660015779468, -0.6201542325952074, -0.6083298156346320, -0.5965898492514455, -0.5849315401469274, -0.5733521971719525, -0.5618492257262568, -0.5504201225218276,

57 58 A Normal distribution table

-0.5390624706817188, -0.5277739351481804, -0.5165522583763212, -0.5053952562916195, -0.4943008144914711, -0.4832668846726728, -0.4722914812682608, -0.4613726782785133, -0.4505086062821783, -0.4396974496151201, -0.4289374437046087, -0.4182268725484092, -0.4075640663286797, -0.3969473991514547, -0.3863752869031996, -0.3758461852165612, -0.3653585875380231, -0.3549110232907173, -0.3445020561261195, -0.3341302822588196, -0.3237943288789543, -0.3134928526372756, -0.3032245381981662, -0.2929880968562333, -0.2827822652124026, -0.2726058039056998, -0.2624574963971555, -0.2523361478024895, -0.2422405837704474, -0.2321696494038453, -0.2221222082205574, -0.2120971411518491, -0.2020933455755961, -0.1921097343820800, -0.1821452350701681, -0.1721987888718080, -0.1622693499028698, -0.1523558843384725, -0.1424573696110161, -0.1325727936292278, -0.1227011540166061, -0.1128414573677141, -0.1029927185208379, -0.0931539598455833, -0.0833242105440363, -0.0735025059641587, -0.0636878869241329, -0.0538793990464062, -0.0440760921002204, -0.0342770193514366, -0.0244812369184941, -0.0146878031333598, -0.0048957779063425, 0.0048957779063425, 0.0146878031333598, 0.0244812369184941, 0.0342770193514366, 0.0440760921002204, 0.0538793990464062, 0.0636878869241329, 0.0735025059641587, 0.0833242105440363, 0.0931539598455833, 0.1029927185208379, 0.1128414573677141, 0.1227011540166061, 0.1325727936292278, 0.1424573696110161, 0.1523558843384725, 0.1622693499028698, 0.1721987888718080, 0.1821452350701681, 0.1921097343820800, 0.2020933455755961, 0.2120971411518491, 0.2221222082205574, 0.2321696494038453, 0.2422405837704474, 0.2523361478024895, 0.2624574963971555, 0.2726058039056998, 0.2827822652124026, 0.2929880968562333, 0.3032245381981662, 0.3134928526372756, 0.3237943288789543, 0.3341302822588196, 0.3445020561261195, 0.3549110232907173, 0.3653585875380231, 0.3758461852165612, 0.3863752869031996, 0.3969473991514547, 0.4075640663286797, 0.4182268725484092, 0.4289374437046087, 0.4396974496151201, 0.4505086062821783, 0.4613726782785133, 0.4722914812682608, 0.4832668846726728, 0.4943008144914711, 0.5053952562916195, 0.5165522583763212, 0.5277739351481804, 0.5390624706817188, 0.5504201225218276, 0.5618492257262568, 0.5733521971719525, 0.5849315401469274, 0.5965898492514455, 0.6083298156346320, 0.6201542325952074, 0.6320660015779468, 0.6440681386006926, 0.6561637811503790, 0.6683561955905790, 0.6806487851276465, 0.6930450983876637, 0.7055488386621753, 0.7181638738872305, 0.7308942474276287, 0.7437441897466541, 0.7567181310510782, 0.7698207150120411, 0.7830568136747741, 0.7964315436842329, 0.8099502839698922, 0.8236186950515332, 0.8374427401492454, 0.8514287083055712, 0.8655832397563087, 0.8799133538196812, 0.8944264796122241, 0.9091304899448472, 0.9240337388053882, 0.9391451028960623, 0.9544740277674430, 59

0.9700305791772410, 0.9858255004050610, 1.0018702763769824, 1.0181772056006679, 1.0347594810882446, 1.0516312816573354, 1.0688078752591987, 1.0863057362981008, 1.1041426792922295, 1.1223380117021657, 1.1409127093423133, 1.1598896185252787, 1.1792936900106508, 1.1991522509932744, 1.2194953228462240, 1.2403559942306717, 1.2617708616359862, 1.2837805526081674, 1.3064303511275646, 1.3297709501812092, 1.3538593640751062, 1.3787600432219229, 1.4045462481588744, 1.4313017591024759, 1.4591230250215927, 1.4881218960233811, 1.5184291411525910, 1.5501990407917607, 1.5836154758017889, 1.6189001435373598, 1.6563238653408074, 1.6962225050066093, 1.7390199717299040, 1.7852624904353240, 1.8356715369125438, 1.8912292378201130, 1.9533237077453984, 2.0240136237191630, 2.1065540881628193, 2.2065752165371353, 2.3352330400688177, 2.5205022171903644, 2.8856349124267626

Bibliography

[1] MSC8157 Six-Core Digital Signal Processor. Cited on page 15.

[2] TMS320TCI6618 Doubling performance for 4G wireless base stations. Cited on page 15.

[3] The University of Milan http://xoroshiro.di.unimi.it/. Cited on page 12.

[4] C. Berrou, A. Glavieux, and P. Thitimajshima. Near Shannon limit error- correcting coding and decoding: turbo-codes. In IEEE ICC ’93, Geneva, pages 1064–1070, 1993. Cited on pages 2 and 7.

[5] Gunnar Blom, Jan Enger, Gunnar Englund, Jan Grandell, and Lars Holst. Sannolikhetsteori och statistikteori med tillämpningar. Studentlitteratur, 2005. Cited on pages 6, 9, 12, and 13.

[6] George Edward Pelham Box and Mervin Edgar Muller. A note on the generation of random normal deviates. The Annals of Mathematical Statistics, 1958. Cited on page 12.

[7] Richard P. Brent. Note on Marsaglia’s xorshift random number generators. Journal of Statistical Software, 2004. Cited on page 8.

[8] Thomas M. Cover and Joy A. Thomas. Elements of information theory. John Wiley & Sons, Inc, 1991. Cited on page 4.

[9] Erik Dahlman, Stefan Parkvall, and Johan Sköld. 4G LTE/LTE-Advanced for Mobile Broadband. Academic Press, 2011. Cited on pages 1 and 4.

[10] Erico Guizzo. Closing in on the perfect code. IEEE Spec- trum http://spectrum.ieee.org/computing/software/ closing-in-on-the-perfect-code, 03 2004. Cited on pages 2 and 7.

[11] Song-Nam Hong, Dennis Hui, and Ivana Marić. Capacity-achieving rate- compatible polar codes. 8th North American School of Information Theory, 2015. Cited on page 1.

61 62 Bibliography

[12] N.A. Johansson, Y.-P.E. Wang, E. Eriksson, and M. Hessler. Radio access for ultra-reliable and low-latency 5G communications. IEEE International Conference on Communication Workshop (ICCW), pages 1184 – 1189, 2015. Cited on page 1.

[13] Erik G. Larsson. Signals, Information and Communications. LiU-press, 2014. Cited on pages 4, 6, 17, and 19.

[14] Pierre L’Ecuyer and Richard Simard. TestU01: A C library for empirical testing of random number generators. ACM Transactions on Mathematical Software, 2007. Cited on pages 7 and 9.

[15] George Marsaglia. The Marsaglia random number cdrom, with the Diehard battery of tests of randomness. Florida State University http://stat. fsu.edu/pub/diehard/, 1995. Cited on page 7.

[16] George Marsaglia. Xorshift RNGs. Journal of Statistical Software, 2003. Cited on pages 9, 11, and 48.

[17] George Marsaglia and Thomas Bray. A convenient method for generating normal variables. Society for Industrial and Applied Mathematics, 1964. Cited on page 13.

[18] George Marsaglia and Wai Wan Tsang. The ziggurat method for generating random variables. Journal of Statistical Software, 2000. Cited on pages 13, 14, and 15.

[19] Makoto Matsumoto and Takuji Nishimura. Mersenne twister: a 623- dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation, 1998. Cited on pages 8 and 9.

[20] Cleve Moler. Normal behavior. MathWorks https://se.mathworks. com/company/newsletters/articles/normal-behavior.html? requestedDomain=uk.mathworks.com, 2001. Cited on page 13.

[21] Cleve Moler. Random number generators, mersenne twister. Math- Works http://blogs.mathworks.com/cleve/2015/04/17/ random-number-generator-mersenne-twister/, 04 2015. Cited on page 9.

[22] Technical Speciﬁcation Group Radio Access Network. Ts 36.212 multiplex- ing and channel coding (release 8). Technical report, 3GPP, 2009. Cited on pages 1, 2, 7, and 48.

[23] Nokia Networks. Looking ahead to 5G. http://networks.nokia.com/ file/28771/5g-white-paper, 2016. Cited on page 1.

[24] John Von Neumann. Various techniques used in connection with random digits. NBS Applied Mathematics Series, 1951. Cited on page 4. Bibliography 63

[25] François Panneton and Pierre L’Ecuyer. On the xorshift random number generators. ACM Transactions on Modeling and Computer Simulation, 2005. Cited on pages 7, 8, 9, 11, and 48. [26] Mutsuo Saito and Makoto Matsumoto. Xsadd. Hiroshima Uni- versity http://www.math.sci.hiroshima-u.ac.jp/~m-mat@math. sci.hiroshima-u.ac.jp/MT/XSADD/, 04 2015. Cited on page 12. [27] Inc. The MathWorks. Founders. MathWorks https://se.mathworks. com/company/aboutus/founders/clevemoler.html?s_tid= srchtitle&requestedDomain=se.mathworks.com, 2016. Cited on page 9. [28] Sebastiano Vigna. Further scramblings of Marsaglia’s xorshift generators. arXiv Data Structures and Algorithms, 2016. Cited on page 12.

[29] Sebastiano Vigna. An experimental exploration of Marsaglia’s xorshift generators, scrambled. ACM Transactions on Mathematical Software, 2016. Cited on pages 7, 8, 9, 11, and 48. [30] John M. Wozencraft and Irwin Mark Jacobs. Principles of communication engineering. Waveland Press, Inc, 1990. Cited on page 4.