<<

DEGREE PROJECT IN INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2016

An efficient Hardware implementation of the Peak Cancellation Crest Factor Reduction Algorithm

MATTEO BERNINI

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY An efficient Hardware implementation of the Peak Cancellation Crest Factor Reduction Algorithm

MATTEO BERNINI

Master’s Thesis at KTH Information and Communication Technology Supervisor: Shafqat Ullah Examiner: Johnny Öberg

TRITA-ICT-EX-2016:187

Abstract

An important component of the cost of a base station comes from to the Power Am- plifier driving the array of antennas. The cost can be split in Capital and Operational expenditure, due to the high design and realization costs and low energy efficiency of the Power Amplifier respectively. Both these cost components are related to the Crest Factor of the input signal. In order to reduce both costs, it would be possible to lower the average power level of the transmitting signal, whereas in order to obtain a more efficient transmis- sion, a more energized signal would allow the receiver to better distinguish the message from the and interferences. These opposed needs motivate the research and development of solutions aiming at reducing the excursion of the signal without the need of sacrificing its average power level. One of the algorithms addressing this problem is the Peak Cancellation Crest Factor Reduction. This work documents the design of a hardware implementation of such method, targeting a possible future ASIC for Ericsson AB. SystemVerilog is the Hardware Description Language used for both the design and the verification of the project, together with a MATLAB model used for both exploring some design choices and to val- idate the design against the output of the simulation. The two main goals of the design have been the efficient hardware exploitation, aiming to a smaller area footprint on the inte- grated circuit, and the adoption of some innovative design solutions in the controlling part of the design, for example the managing of the cancelling pulse coefficients and the use of a time-division multiplexing strategy to further save area on the chip. For the contexts where both the solutions could compete, the proposed one shows better results in terms of area and delay compared to the current methods in use at Ericsson and also provides innovative suggestions and ideas for further improvements. Keywords: CFR, PC-CFR, PAPR Reduction, OFDM Sammanfattning

En effektiv hårdvaruimplementation av Peak Cancellation-algoritmen för reduktion av toppfaktor

En komponent som det är viktigt att ta hänsyn till när det kommer till en radiobasstations kostnad är förstärkaren som används för att driva antennerna. Kostnaden för förstärkaren kan delas upp i en initial kostnad relaterad till utveckling och tillverkning av kretsen, samt en löpande kostnad som är relaterad till kretsens energieffektivitet. Båda kostnaderna är kopplade till en egenskap hos förstärkarens insignal, vilken är kvoten mellan signalens maxi- mala effekt och dess medeleffekt, såkallad toppfaktor. För att reducera dessa kostnader så är det möjligt att minska signalens medeleffekt, men en hög medeleffekt förbättrar radioöver- föringen eftersom det är lättare för mottagaren att skilja en signal med hög energi från brus och interferens. Dessa två motsatta krav motiverar forskning och utveckling av lösningar för att minska signalens maximala värde utan att minska dess medeleffekt. En algoritm som kan användas för att minska signalens toppfaktor är Peak Cancellation. Den här rapporten presenterar design och hårdvaruimplementering av Peak Cancellation med avsikt att kunna användas av Ericsson AB i framtida integrerade kretsar. Det hårdvarubeskrivande språket SystemVerilog användes för både design och testning i projektet. MATLAB användes för att utforska designalternativ samt för att modellera algoritmen och jämföra utdata med hårdvaruimplementationen i simuleringar. De två huvudmålen med designen var att utnytt- ja hårdvaran effektivt för att nå en så liten kretsyta som möjligt och att använda en rad innovativa lösningar för kontrolldelen av designen. Exempel på innovativa designlösningar som användes är hur koefficienter för pulserna, som används för reducera toppar i signalen, hanteras och användning av tidsmultiplex för att ytterligare minska kretsytan. I använd- ningsscenarion där båda lösningarna kan konkurrera, visar den föreslagna lösningen bättre resultat när det kommer till kretsyta och latens än nuvarande lösningar som används av Ericsson. Ges också förslag på ytterligare framtida förbättringar av implementationen. Keywords: CFR, PC-CFR, PAPR Reduction, OFDM List of Acronyms and Abbreviations

ACLR Adjacent Channel Leakage Ratio

AM Modulation

ASIC Application Specific Integrated Circuit

ASM Algorithmic State Machine

BPSK Binary Shift Keying

CAF Clipping and Filtering Technique

CapEx Capital Expenditure

CCDF Complementary Cumulative Distribution Function

CF Crest Factor

CORDIC Coordinate Rotation Digital Computer

CS Clip Stage

EVM Error Vector Magnitude

FDM Division Multiplexing

FIR Finite Impulse Response

FM Frequency Modulation

FPGA Field Programmable Gate Array

GSM Global System for Mobile communication

(H)PA (High) Power Amplifier

(I)DCT (Inverse) Discrete Cosine Transform

IFFT Inverse Fast Fourier Transform

I/Q In-phase / Quadrature signal

LTE Long Term Evolution

MSR Multi Standard Radio

NS Noise Shaping

OFDM Orthogonal Frequency Division Multiplexing

OOB Out Of Band OpEx Operating Expenditure

PA(P)R Peak to Average (Power) Ratio

PC, PC-CFR Peak Cancellation Crest Factor Reduction

PCU Peak Cancelling Unit

PDF Probability Density Function

PF Peak Filtering

PM Phase Modulation

PM Peak Manager

PTS Partial Transmit Sequence

PW Peak Windowing

QPSK Quadrature Phase-Shift Keying

RMS

RTL Register Transfer Level

SLM SeLective Mapping

SV SystemVerilog

TC Turbo Clipping

TDM Time Division Multiplexing

TI Tone Injection

TR Tone Reservation

WCDMA Wideband Code Division Multiple Access Contents

1 Introduction 1 1.1 Background and statement of the problem ...... 2 1.2 Purpose of the design project ...... 4

2 Background and related work 7 2.1 Background ...... 7 2.1.1 Orthogonal Frequency Division Multiplexing (OFDM) . . . . 7 2.1.2 Definitions: CF, PAPR, EVM and ACLR ...... 9 2.1.3 Overview of the main CFR methods ...... 11 2.2 Related Work ...... 18

3 The proposed implementation of the PC-CFR 21 3.1 General description of the PC-CFR algorithm ...... 21 3.2 Structural description of the proposed implementation ...... 26 3.2.1 The Clip Stage ...... 28 3.2.2 The Peak Manager ...... 31

4 Future work and suggested improvements 43 4.1 Programmable or dynamic CS–PCU mapping ...... 43 4.2 Bypassable PC-CFR module ...... 43 4.3 Clip Stages with different delay memories and cancelling pulses length 45 4.4 Truncation of cancelling pulses ...... 45 4.5 Variable length Peak Search Window ...... 46 4.6 Priority-based acceptance of peaks ...... 47 4.7 Generation of multiple cancelling pulses from the same time slot . . 49

5 Results and conclusions 53 5.1 Comparative synthesis results ...... 53 5.2 Some input and model configuration exploration ...... 54 5.2.1 Observations ...... 57

Bibliography 61

Appendices 62 A The MATLAB golden model 63 Chapter 1

Introduction

If the cost of a typical transmitting radio base station is analyzed, we discover that the Capital Expenditure (CapEx)1 and the Operating Expenditure (OpEx)2 relative to the radio cards alone cover roughly 50% of the total cost[1]. The radio cards house the Power Amplifier (PA) whose low efficiency is the main culprit for the OpEx part of the overall costs. In fact, only a small quota of the power consumed by the radio cards becomes transmitted power. Similar considerations are valid for the consumer electronics market: every mobile device, relying on wireless communications, suffers from the non-optimal efficiency of the PA causing a substantial negative effect on the battery lifetime. In many low-cost applications, this issue alone might prevent the whole system to be considered convenient or even possible to design. The efficiency of the PA is a function of the characteristics of the input signal, in particular of its Peak to Average Power Ratio (PAPR, or PAR) or Crest Factor (CF), which are the ratio between the powers or the magnitudes associated to the largest and the average values of the signal, respectively. In Figure 1.1, we can see a small segment of data in a typical scenario. The maximum values, that is the peaks (a more accurate definition of peaks will be given in 3.1, for now the intuitive comprehension is sufficient), are responsible for the high PAPR of a given signal. It is not surprising that the industry is striving to reduce this phenomenon, and thus the costs and inefficiencies, by investigating several alternatives. Basically the two most relevant ways to deal with the problem are: 1)introducing some changes in the signals to be transmitted (without of course compromising its informative content) in order to prevent the occurrence of high peaks, at the cost of an increased complexity of the transmitter and/or sacrificing some data rate for the transmission of side information needed on the receiver side for the reconstruction of the information, or 2)digitally processing the signal as it is (either in the time or frequency domain) in order to limit the occurrence and magnitude of the unavoidable peaks, at the cost of some introduced distortion. This thesis work focuses on the design, modeling and verification of an algo-

1Resources invested by a company to buy or upgrade fixed, physical, non-consumable assets. 2Day-to-day costs of operation.

1 CHAPTER 1. INTRODUCTION

Figure 1.1: A segment of a typical signal amplitude showing high variability and, as a consequence, a high ratio between the maximum and average values. rithm belonging to the digital processing category, namely the Peak Cancellation Crest Factor Reduction (PC-CFR) and it is targeted to an Application Specific In- tegrated Circuit (ASIC). The thesis project was performed at Ericsson AB in Kista, Stockholm.

1.1 Background and statement of the problem

Very widely used multi-carrier signals such as Orthogonal Frequency Division Multi- plexing (OFDM) show higher PAPR than single carrier systems. Also, several radio access technologies such as Long Term Evolution (LTE), Wideband Code Division Multiple Access (WCDMA), etc. are used in Multi Standard Radio (MSR) trans- mitters situated in base stations. These signals exhibit a non-constant envelope behaviour, but show instead a fluctuating envelope with a high CF (see Figure 1.2, [2]). The main reason is the fact that the sum of multiple sub-carriers create a compound signal whose real and imaginary parts approach a Gaussian Probability Density Function (PDF), due to the Central Limit Theorem, whereas the amplitude will approach a Rayleigh PDF. On the other hand, the Global System for Mobile communication (GSM), uses constant envelope Gaussian modulation. The input-output static characteristics of a PA show a linear region bounded by a non-linear part (see Figure 1.3). The part of the PA input signal characteristics outside the linear region entails significative Out Of Band (OOB) emissions, caused

2 1.1. BACKGROUND AND STATEMENT OF THE PROBLEM

Figure 1.2: Comparative view of PAPR for different transmission protocols (source: [2]).

Figure 1.3: Power Amplifier characteristics before PAR reduction (source: [3]).

by the inter-modulation products on the adjacent channels. Therefore the linear part of the PA’s characteristics needs to be wide enough to contain the of the input signal that has to be amplified and fed to the antenna(s). In order for the PAs to accommodate signals with such a high voltage swing, either they have to be dimensioned for the maximum peak value (thus increasing the CapEx), or they

3 CHAPTER 1. INTRODUCTION

Figure 1.4: Power Amplifier characteristics after PAR reduction. Note the increased average output voltage (thus power) available thanks to the reduction of the PAR (source: [3]). are made operating with more back-off3 from the most convenient operating point, which translates to a lesser efficient usage of energy (thus increasing the OpEx). In other words, PAs with larger linear ranges are more expensive and make a worse use of electric power than those with smaller linear input range. What is desirable, instead, is to deal with signals with limited PAPR (or CF) because then it is possible to increase their average power level without the risk of falling into the saturation region of the PA. The increased transmitting power guarantees a higher strength of the signal with respect to the unavoidable noise and thus an overall more efficient transmission of information. In Figure 1.4, the input- output characteristics of a PA after a 6 dB reduction of PAPR is shown. Notice that now it is possible to accommodate the operating point of the signal at a higher power level thanks to the reduction of the PAPR.

1.2 Purpose of the design project

The purpose of the project described in this report is the design, verification and performance test of an innovative implementation of the Peak Cancellation (PC) algorithm, which will possibly be implemented in one of Ericsson’s ASICs in the future. The design is as generic and configurable as possible, in order for the user to be able to compare different parameter options against existing solutions already implemented in Ericsson. The programmability of the PC-CFR module is another desirable characteristic of the project because, as a consequence of the changes in

3The back-off is the deliberate reduction of the average input power to the PA.

4 1.2. PURPOSE OF THE DESIGN PROJECT the input signal properties, some actions might be taken accordingly, for example a change of the length of the search window (related to the granularity of the detection of the peaks). One of the most attractive aspects of the PC algorithm, as opposed to other solutions, is the low complexity in terms of hardware, which translates to a smaller area occupancy on the ASIC and to a lower power consumption of the module. The drawback of the PC is that each peak must be treated separately by dedicating hardware resources to it for the entire duration of the corresponding cancelling pulse. When the detected peaks in the input signal exhibit a density such that the available hardware resources are insufficient in number to cancel them all, some of them pass untouched and eventually reach the PA. The PC-CFR architecture proposed in this thesis report is new and possibly innovative in some aspects, compared to the documented already existing imple- mentations[1][3][4][5]. The aspect of the design that required most of the effort was the optimization of the hardware resources and, at the same time, the minimization of the probability of a peak leak. In order to fulfill these requirements, most of the hardware resources were not used exclusively but more efficiently shared in a Time Division Multiplexing (TDM) configuration, thanks to the availability of a second, faster clock and several design expedients. The Register Transfer Level (RTL) design and the testbench are written in the SystemVerilog (SV) language and simulated and synthesized via the software tools made available by Ericsson. A MATLAB golden model has been written in a way to match both the expected behaviour of the PC-CFR algorithm and, as accurately as possible, all the elaborations of the data taking place in the target hardware implementation. This model was used to compare its output against the RTL version when driven with the same input data: the target RTL implementation is considered compliant to the model when the two outputs match sample by sample.

5

Chapter 2

Background and related work

2.1 Background

2.1.1 Orthogonal Frequency Division Multiplexing (OFDM)

Communication systems use a physical channel to provide a reliable mean to trans- fer information by the use of a technique called modulation: by superimposing some coded version of the information over one or more of the characteristics of a prop- erly chosen sinusoidal signal, called carrier, it is possible to overcome the physical limits of the communication channel, in terms of available bandwidth and maxi- mum power. According to the fact that the carrier signal characteristic is frequency, phase or amplitude (or a combination of them), we have several types of modulation (such as Amplitude Modulation (AM), Phase Modulation (PM), Frequency Modu- lation (FM), Binary Phase Shift Keying (BPSK), Quadrature Phase-Shift Keying (QPSK), etc...) each with different advantages and drawbacks. If more than one line of communication needs to be established over the same physical channel, then some means to share it must be employed, such as multiplexing (we might think of these independent paths of communication as logical channels, as well as pairs of users). In Time Division Multiplexing (TDM) each user occupies the entire bandwidth of the channel for a given time frame in a round-robin fashion, with some silence time between two successive frames, whereas in Frequency Division Multiplexing (FDM) the whole channel bandwidth is divided in segments separated by guard intervals and each user has at its disposal a specific bandwidth arranged somewhere around a carrier for the entire duration of the communication. The relation among carriers can be any, the only constraint being the non-overlapping of the frequency bands of each channel. In OFDM there is a specific relationship among the carrier fre- quencies i.e. they are all multiples of a single frequency. This simple expedient allows the relaxation of the requirement about the non-overlapping of the various bands, thus actually compacting them together in order to make a better use of the channel resource. The fact that all carriers are multiples of a common frequency

7 CHAPTER 2. BACKGROUND AND RELATED WORK

Figure 2.1: Block diagram of the generation of a OFDM signal. The sinusoidal carriers are orthogonal (source: [6]).

entails the orthogonality1 of them and this makes the recovery of the transmitted information on the receiver side much easier and, most of all, possible even if the signals overlap in frequency. OFDM (which can be considered as a special case of FDM), is a so called multi-carrier modulation technique because it makes use of several carriers at the same time each capable of conveying information modulated according to different mappings (BPSK, QPSK, etc...). The communication quality through channels affected by frequency selective fading2 benefits from OFDM, in the sense that the fading can be more easily com- pensated at the receiver side: with OFDM, instead of compensating for the fading of the channel as a continuous function of frequency over a large range (which is a more involved operation), the receiver can divide the frequency range into small segments corresponding each to a sub-carrier, and approximate the fading as a con- stant in each one of such segments. The advantage is that constant fading can be fought more easily by using error correction and other techniques. A block-level diagram is shown in Figure 2.1 (see also [6]).

1Two signals are said orthogonal if their scalar product is zero. 2Frequency selective fading is a radio propagation anomaly due to the partial cancellation of a signal by itself, because the signal arrives from at least two different directions and one or more of such paths is lengthening or shortening.

8 2.1. BACKGROUND

2.1.2 Definitions: CF, PAPR, EVM and ACLR As already stated, the problem with non-constant envelope signals is the presence of a too large variability in the amplitude, and this is harmful for the design and power efficiency of the PA. This phenomenon can be very closely related to the presence of group of samples whose magnitude exceeds a certain desired value called threshold. Some of the techniques proposed to mitigate this behavior are briefly listed in the following, but here more quantitative definitions of Crest Factor and Peak to Average (Power) Ratio are presented. We define the Crest Factor as the ratio between the maximum value of the magnitude and the average value of a signal, observed in a certain temporal window:

ks(n)kmax CF = srms We also define the more commonly used Peak to Average (Power) Ratio, again for a given interval of time or a certain number of samples, for discrete-time contexts:

2 2 ks(n)kmax ks(n)kmax P AP R = 2 , or P AP RdB = 10 log10 2 srms srms Note that P AP R = CF 2. The desired effect of the various CF reduction tech- niques is to reduce the PAR of the signal without introducing too much distortion. Some of the techniques will not introduce any distortion at all, at the price of a greater complexity and/or reduction of data rate, whereas some other will inject some unavoidable distortion both in-band (the bandwidth occupied by the signal being transmitted) and out of band (in the adjacent bands). Both of these side ef- fects are of course undesirable and in order to quantify them, two parameters exist: Error Vector Magnitude (EVM), and Adjacent Channel Leakage Ratio (ACLR). EVM is a measurement that quantifies the global displacements of the received (output) signal compared to the expected ideal one, due to any disturbances (such as noise) and, as in our case, to the CFR intervention too. We define it as (see Figure 2.2): s Perror Perror EVM = 10 log10 or EV M(%) = · 100 Pref Pref

Where Perror is the sum of all the error vector powers and Pref is the sum of all the reference, expected, signal powers. The error vector is the vector in the I/Q plane that connects the received symbol with the ideal, expected position in the plane (the position corresponding to the exact transmitted symbol). For each received symbol, the corresponding power is computed and averaged, then divided by a properly chosen value representative for the modulation scheme. The result is a cumulative measure of how much the whole transmitter-receiver chain is close to the ideal from the accuracy point of view. In an ideal transmission system, each received would fall exactly in one of the possible points in the plane corresponding to the coding of the sent symbol. The scattering of the received compared

9 CHAPTER 2. BACKGROUND AND RELATED WORK

Figure 2.2: I/Q plane with representations of the reference and the measured (or received, in a communication channel) vectors. The powers of the error and the reference vectors are used to compute the EVM (source: [7]).

Figure 2.3: The components at the base of the definition of the ACLR (source: [8]). to the constellation of the expected symbols is as much pronounced as less ideal is the communication system. In the present case, in-band distortion introduced by the CFR algorithm has a direct effect on the EVM which, as a consequence, is considered as a measurement of the performance of the method. Adjacent Channel Leakage Ratio (ACLR) is the measurement concerning the out of band distortion. It is defined as the ratio of the power leaked to the adjacent and the power in the carrier channels (see also Figure 2.3 and [8]):

AdjacentChannelP ower ACLR = MainChannelP ower

10 2.1. BACKGROUND

The most important reason behind the desire to keep the ACLR to a low level is that otherwise unexpected and unwanted power will pour outside of the frequency band of interest. If the adjacent frequency intervals are used as the main channels of other communication systems, it means that we are injecting interference into them. The second reason driving the effort in keeping the ACLR as low as possible is simply the fact that high ACLR translates to some energy (supposed to be in the main channel) wasted over adjacent channels therefore reducing the efficiency of transmission.

2.1.3 Overview of the main CFR methods Several techniques have been proposed to mitigate the PAPR problem of the OFDM signals. These techniques can be roughly and partially categorized in: coding tech- nique, probabilistic (scrambling) technique, adaptive pre-distortion technique and clipping technique. This last category will be further explored given its importance to this thesis work.

Coding technique The coding technique pursues PAPR reduction via an appropriate choice of the codes of the modulation to be transmitted for each sub-carrier. This method causes no distortion both in-band and OOB, but it suffers from non optimal bandwidth usage because a smaller number of data words is mapped to a greater number of code words. The complexity of the algorithm is also non-negligible because both the computational effort needed to choose the most appropriate symbol to send and the area required to store the look-up tables grow rapidly with the number of sub-carriers, up to the point of becoming computationally intractable for common useful signals.

Probabilistic (scrambling) technique This technique entails the scrambling (meaning, in this context, the act of manip- ulating a signal with a well known sequence to alter its properties but in such a way to not introduce distortion) of the OFDM input signal with several versions of scrambling sequences, one block of samples at a time, and successively choosing among the resulting sequences the one exhibiting the lowest PAPR. This approach cannot guarantee a desired PAPR level (it will provide the minimum among the sequences though), yields a reduction in bandwidth utilization because of the ad- ditional information to be sent to the receiver and the complexity rapidly increases with the number of sub-carriers. This solution includes the SLM (SeLective Map- ping), PTS (Partial Transmit Sequence), TI (Tone Injection) techniques, and TR (Tone Reservation) algorithms. As an example we might very briefly consider the SeLective Mapping (see Figure 2.4). This technique requires the OFDM signal to be independently multiplied by u u jφv U phase sequences Pv = e , u = 1, 2, ..., U. The U resulting sequences are passed

11 CHAPTER 2. BACKGROUND AND RELATED WORK

Figure 2.4: Block diagram of the selective mapping technique for PAPR reduction (source: [9]). through U IFFT (Inverse Fast Fourier Transform) blocks and the output sequences xu are compared in order to determine the one yielding the lowest PAPR. The side information about the selected sequence needs to be sent into the channel for the receiver to be able to reconstruct the original OFDM message. Therefore, the SLM algorithm requires U IFFT blocks, the sending of the side information and the block to properly choose the version of the OFDM signal with the smallest PAPR, through a proper measurement and comparison.

Adaptive pre-distortion The idea behind the adaptive pre-distortion is to distort the signal according to a non-linear function in order to compensate for the successive, well known, non- linear characteristics of the PA. Some solutions are capable of dealing with time- variable characteristics of the PA by dynamically and efficiently changing the input constellation.

Clipping technique This technique has the advantage of being the simplest to implement, but incurs in in-band distortion, out-of-band interferences, and the disruption of the orthogonal- ity of the sub-carriers. The method requires some sort of digital processing in the time and/or frequency domain. Among others, this technique includes: Clipping and Filtering Technique (CAF), block-scaling technique, Peak Windowing technique (PW), Peak Cancellation technique (PC), and Fourier projection technique.

12 2.1. BACKGROUND

In order to introduce the scope of this work, a brief description of some of the algorithms belonging to this category follows. The algorithms have been chosen because of their conceptual and practical affinities with the proposed approach in this work. For all the following algorithms (Peak Filtering, Peak Cancellation and Peak Windowing), the concept of threshold is of utmost importance. The threshold is the desired maximum value for the magnitude of the input signal. It can be either hardwired inside the algorithm or programmed during its operating life. In any case, by setting a certain value for the threshold, we also inherently program a desired PAPR, because the magnitude of the signal is monotonically related to the power. The three described algorithms differ in the way they obtain the reduction of the maximum magnitude of the signal (and thus the PAPR) to the desired level, but they all will operate a digital processing on it thus introducing some distortion, whose size they try to minimize.

Peak Filtering (PF) The Peak Filtering algorithm, sometimes referred to as Noise Shaping (NS) consists of extracting the part of the input signal whose magnitude exceeds the threshold, called the clip error sequence, then filtering it and finally subtracting it from a properly delayed version of the original signal itself. The purpose of the delay is to compensate for all the latencies generated during the detection and extraction of the clip error and filtering. The clip error generation consists first of the generation of a clipped version of the signal, B(n), according to the formula (note that the clipped signal retains its complex nature, see also Figure 2.5):  x(n) if kx(n)k ≤ threshold B(n) = x(n)·threshold  kx(n)k otherwise and second, of the successive subtraction of such a generated signal from the original one:

e(n) = x(n) − B(n) where x(n) is the original signal and e(n) is the clip error (see Figure 2.6). The clip error signal e(n) is then filtered by a filter whose coefficients are computed off-line and stored in a memory. The filter design is tailored for the specific type of signal the algorithm will work with (i.e. number and bandwidth of the carriers). After each iteration of the algorithm, it is possible that some peaks will be created by the filtering operation itself (the so called peak regrowth phenomenon), so succes- sive applications of the algorithm might be necessary, and this is accomplished by cascading several stages of PF. Another reason justifying the cascading of several PF stages is the fact that a discrete-time signal does not necessarily exhibit the maximums of the true analog signal of which the elements constitute the sampling and that will reach the Power Amplifier[9]. It is indeed possible for two successive elements of the discrete-time

13 CHAPTER 2. BACKGROUND AND RELATED WORK

Im

Re

Figure 2.5: Reduction of a complex sample to a version with the same phase and magnitude equal to a set threshold. signal to have both a lower amplitude than the analog signal they are samples of, because of the very nature of the discrete-time representation of a continuous time signal. In order to expose these hidden peaks, fractional delay filters are often interposed between successive stages of the PF. The effect of these filters is equivalent to a conversion from digital to analog followed by a slightly time-shifted sampling process at the same sample rate as the original.

Input Output ampl. ampl.

treshold treshold

samples samples

Figure 2.6: The generation of the clip error from the original signal.

Peak Cancellation (PC)

Contrarily to the PF, the Peak Cancellation algorithm (see Figure 2.7) does not filter the clip error sequence, but explicitly isolates a single input element sample among those identified within a certain Peak Search Window interval (a more formal

14 2.1. BACKGROUND

Figure 2.7: A very simplified top-level architecture of the Peak Cancellation algo- rithm. definition will be given when the algorithm will be described more in depth). Each time the algorithm detects these elements, called peaks, it cancels them individually by subtracting a properly shaped cancelling pulse from the signal, one for each peak. The major advantage of the PC is the reduced complexity of the algorithm compared to the PF because of the lack of actual filtering over a clip error. In Figure 2.7, the Peak Extractor is the block that detects the samples whose magnitude is greater than the threshold, and it is basically the same in PF and in PW, whereas the Peak Detector, present only in the PC algorithm, isolates the maximum of the samples, which as said is defined as the peak. The reduction of the PAPR via the PC algorithm is made by the cancellation of these detected peaks via cancelling pulses that are generated only when the peaks are detected. The stored pulse is, similar to the PF filter, a combined impulse response of all the input carrier filters modulated to the correct frequency within the multi carrier frequency band. Such cancelling pulse can be generated in advance (off-line) and is only dependent on carrier configuration of the input signal. For each peak, an impulse with the correct amplitude and phase is generated and subtracted. Some peak regrowth can occur as consequence of the subtraction of the cancelling pulses from the input signal, therefore the algorithm has to be run several times. For example, in Figure 2.8 it can be seen that, because of the application of the cancelling pulse (in red), the two minimums surrounding the peak add in phase with the pulse itself thus generating two more peaks.

Peak Windowing (PW)

The peak windowing algorithm (see Figure 2.9) is based on multiplying the signal with an attenuating window W (k) rather than adding a correction to the signal. When a peak is detected in the input signal, a set of coefficients (a window, see Figure 2.10) is either generated at run-time or read from a memory where it is stored, pre-computed off-line. Before the application of the window to the signal,

15 CHAPTER 2. BACKGROUND AND RELATED WORK

Figure 2.8: The effect of the cancelling pulse on the adjacent samples of the targeted peak. Note the regrowth of the peaks as a consequence.

Peak Extractor Window generator

- + 1

Delay

Figure 2.9: Top-level architecture of a Peak Window algorithm the coefficients are scaled by a real number C, chosen in such a way that the peaks will be attenuated to the desired level (threshold). The signal around the maximum peak sample np is multiplied by the attenuating window according to:

y(n) = x(n) · (1 − C · W (n − np + K/2)) Where K is the number of the window’s coefficients. The input signal is delayed to compensate for the delay of the peak search part of the algorithm and to make the peak sample correspond to the maximum of the window. The windowing operation corresponds to a subtraction, from the original signal, of a windowed part of itself, whereas in the frequency domain it corresponds to the convolution of the signal with

16 2.1. BACKGROUND

Figure 2.10: Window to be multiplied with the signal in order to reduce the mag- nitude of the peaks. the Fourier transform of the window. Among the advantages of the algorithm, there is the fact that if the window amplitude changes smoothly, then not much OOB emission is expected to appear, but on the other hand the lack of knowledge about the exact frequency characteristics of the attenuating window (because it is tailored on the particular input segment around the peak) makes it harder to guarantee a required or specified OOB performance. It would be desirable to minimize both the EVM and the OOB but a trade-off must be chosen for the length of the window because, as it will be better clarified further, the longer the window is, the worse the impact on the in-band distortion (thus the EVM) is but the better the effect on the adjacent channel (thus the OOB emission) and vice-versa is at the same time. Furthermore, if closely spaced peaks are detected, the algorithm tends to overcompensate and this again has a negative effect on the EVM. Figure 2.11 shows the effect of the windowing on a segment of the input signal. The successive processing of the signal in this way, when there are overlappings among successive windows has the unfortunate effect of reducing the overall average power instead of the PAPR. This can be partially mitigated by introducing some more complexity in the algorithm, such as coefficients that take into account the presence of earlier windows, the searching and detection of closely spaced peaks and the subsequent generation of the window only once etc... The best way to reduce the risk of an excessive attenuation is the cascading of several PW stages each attenuating the peaks in a lower measure. This of course will introduce longer delay as well. The PW is the least complex of the presented algorithms but also the one having the worst (and least predictable) performance in terms of in-band and out of band emissions.

17 CHAPTER 2. BACKGROUND AND RELATED WORK

Figure 2.11: Effect of the application of the window on a segment of the signal containing peaks.

2.2 Related Work

In Xilinx application note 1033 (XAPP1033[1]), the company proposes a PC-CFR algorithm, together with an implementation for their Virtex-4 and Virtex-5 fami- lies of FPGAs, based on a simple architecture featuring a peak detector and four cancelling pulse generators. The coefficients of the unscaled cancelling pulse are generated off-line by superposing as many prototype filter masks, properly shifted in frequency, as the number of carriers the input signal is made of. The algorithm is compared against a Peak Windowing CFR (PW-CFR) and a Noise Shaping CFR (NS-CFR). With the frequency and number of coefficients chosen by the authors of the application note for the comparison, the PC-CFR outperforms both the NS- CFR and the PW-CFR solutions in terms of ACLR and EVM. In [5], Song and Ochiai propose a Field Programmable Gate Array (FPGA) implementation of the PC-CFR. The added value of their solution is a workaround over the problem of the overlapping of cancelling pulses due to too closely spaced detected peaks being cancelled. When the detected peaks are too closely spaced (in terms of number of samples), the relative generated cancelling pulses might overlap and add in-phase thus both reducing the effect of peak reduction and generating peak regrowth. The authors propose, when the measured distance between succes- sive peaks falls under a certain value, the generation of a truncated version of the cancelling pulses in order to avoid the overlap. Such a truncation introduces discon- tinuities in the signal and as a consequence, OOB emission. The authors state that the use of a simple moving-average filter is good enough to take care of these emis- sions and satisfy the ACLR requirements. Results show that the proposed solution is satisfactory in terms of both EVM and ACLR although the hardware complexity

18 2.2. RELATED WORK is higher than the plain PC-CFR solution, because of the added circuitry to take care of the detection and truncation of the pulses. In [10], Schmidt and Schlee propose a PC method that generates a cancelling pulse shaped only on the carrier that, at the moment of the peak detection, gives the most contribution to the aggregated signal. By doing so, the algorithm should minimize both the in-band and the OOB emissions. The knowledge about which sub-carrier is responsible for the largest part of the peak should be available from the measurement of the time the peak is detected. The cancelling pulses are also dynamically conditioned by a set of weights that may change according to several scenarios that might occur (e.g. if a carrier is idle for a certain amount of time, the corresponding spectral range could be "occupied" without any risk of introducing distortion). In [11], Bauml et al. use the term selected mapping for the first time. The se- lected mapping algorithm can be used to mitigate the PAPR of signals consisting of an arbitrary number of carriers and any signal constellation. This method provides significant advantages at the cost of a moderate additional complexity. In [12], Wang et al. described the first nonlinear companding3 transform (NCT) for PAPR reduction, applied to a speech processing algorithm µ − law. It showed better performance than the clipping algorithm. In [13] Jean Armstrong transforms the OFDM signal into time-domain via an over-sized IDFT giving origin to trigonometric interpolation. Then the signal is clipped and filtered via a forward and inverse DFT in order to remove OOB emis- sions. These results are further improved by the same author (see [14]) by repeatedly clipping and filtering. In particular the author claims that this method causes no increase in OOB emissions. In [15], unlike the µ − law companding scheme which reduces the PAPR by enlarging the small portions of the signal only, Jiang et al. propose a solution based on the exponential companding technique, that adjusts small and large signals samples altogether, keeping the average power unchanged but transforming the power density distribution to uniform instead of Rayleigh and generating fewer spectrum side-lobes too. Similar approach is pursued in [16] by Al-Azzo et al., where this time the distribution density is transformed from Rayleigh to Gaussian and as a consequence of that, peak and average values are changed so that the overall PAPR reduces. Improvements are shown in the in-band distortion too. In 2008, Carole et al. [17], present a method that exploits the unused carriers in OFDM systems in order to decrease the PAPR of the signal without introducing significative OOB and in-band distortions (compared to clipping and windowing techniques), because no interference with the proper data channels exists. In 2013 Sroy et al. [18] propose a version of the Iterative Clipping and Filtering (ICF) algorithm for the PAPR reduction of OFDM type of signals using (Inverse) Discrete Cosine Transform (IDCT/DCT), showing better results than the the reg- ular DFT/IDFT based approach in [14]

3From the combination of the words compressing and expanding.

19

Chapter 3

The proposed implementation of the PC-CFR

3.1 General description of the PC-CFR algorithm

A detailed description of the implementation of the PC-CFR algorithm is given in the following section of this Chapter, but first a more in-depth discussion about it from the general point of view is necessary in order to better understand the design choices that have been made. The PC-CFR module is usually placed after the aggregator (combining all the signals coming from different channels) and before the Digital Pre-Distorter (DPD),

푒푖휔1푇푠

x1 h1

푖휔 푇 푒 2 푠 Antenna 푖ω푡 x2 푒 h2 Σ CFR DPD DAC HPA

푒푖휔퐾푇푠

xK hK

Figure 3.1: Typical positioning of the CFR inside the communication chain

21 CHAPTER 3. THE PROPOSED IMPLEMENTATION OF THE PC-CFR when present (see Figure 3.1). The input of the system is a fixed-point signal made of two parts (in-phase and quadrature1). It is the result of the sum of all the various components relative to the various carriers. The result is a high PAPR discrete-time signal. The output of the PC-CFR is a lower PAPR and delayed signal of the same format. The purpose of the algorithm is to reduce the PAPR of the input signal to a desired value, and this is achieved by properly monitoring and, when necessary, reducing the values of the samples exceeding a certain threshold. The value of such threshold is directly related to the final desired PAPR. The PC-CFR performs a time-domain signal processing on limited, selected por- tions of the input signal. Such parts are selected according to the presence of peaks which can be defined as follows: given the interval of samples of the input signal starting from the first one having magnitude greater than the threshold and finish- ing after a fixed number of samples, the peak is the element having the maximum magnitude inside this interval. Because the detection of the peaks is made on the basis of the magnitude of the input samples, a conversion from rectangular to polar form or some other means to expose the magnitude of the input samples is needed as one of the first steps of the algorithm. For each detected peak, a cancelling pulse is generated and subtracted from the input signal in order to reduce the value of the peak to the value of the threshold. The complex coefficients of the cancelling pulse are stored in a memory; these coefficients are the same for each peak being cancelled, but in order for the cancelling pulse to be shaped accurately after the peak it is expected to cancel, they are multiplied by the peak characteristics prior of being subtracted from the input signal. To be more clear, the cancelling pulse, used to cancel the peak from the input complex signal by subtraction, is generated by a simple complex multiplication between each of the coefficients of the stored unscaled cancelling pulse, and a single complex number coming from the peak detection part of the algorithm, this operation being performed for each peak independently. The characteristics of the peak p that are needed for the generation of the cancelling pulse are: the difference between the magnitude of the sample selected as peak (sk, for some k) and the threshold, and the phase of such element:

iθP p = ρP · e , where ρP = kskk − threshold The cancelling pulse elements c[n], are generated according to this formula:

i(θP +θ[n]) c[n] = ρP · ρ[n] · e Where ρ[n] · eiθ[n] are the coefficients of the unscaled cancelling pulse, for all the values n. It should be noted that this operation is much less computationally intensive (i.e. it requires a much lower amount of hardware resources) than other filtering-based CFR signal processing algorithms. At the output of the multiplier,

1The in-phase/quadrature components format can be formally considered as a complex sig- nal, with the real and imaginary parts corresponding to the in-phase and quadrature components respectively. In the rest of the text the two formalisms (complex and I/Q) will be used interchange- ably.

22 3.1. GENERAL DESCRIPTION OF THE PC-CFR ALGORITHM the complex data is converted back to rectangular form2, ready to be subtracted from the input signal thus finally cancelling the peaks. Of course, it may happen that more than one cancelling pulse needs to be generated at the same time so that portions of their intervals overlap. In order to provide the cumulative effect of all the cancelling pulses, all the coefficients of the active pulses must be added together and then subtracted from the signal at each sample of interest. Another observation is that the cancelling pulse effectively cancels the peak element and that element only: the central element of the unscaled cancelling pulse is the actual element that, when multiplied by the peak characteristics and subtracted from the signal, will yield as a result an element having magnitude matching exactly the threshold value. It follows that the value of such element must be real and equal to one. In Figure 2.5 the effect of the subtraction and the consequent reduction of the peak to a magnitude matching the threshold is shown on the complex plane. All the neighbor input samples will be modified, as already explained, in such a way that their magnitude will be generally reduced too, but it should be noted that the algorithm has not accurate control over these elements, therefore some undesirable phenomenons are unavoidable, as it will be illustrated shortly. The algorithm is usually applied more than once to the signal, and this is per- formed by letting the output of the algorithm, elaborated by a module or stage, become the input of the next one, in a cascade-like structure (see Figure 3.2). The reasons for which this is usually done are the following:

• Peak Leak. If an implementation of the PC-CFR algorithm poses an upper limit on the number of simultaneous cancelling pulses that can be generated by a single stage, then it happens that, when such limit is reached and a new peak is detected, the peak will simply pass uncancelled through the stage and, in case of the last one, it will reach the Power Amplifier, which is the event we strive to avoid in the first place. By cascading several stages, the probability of such an event obviously decreases. The scenario depicted should not be considered unlikely because peaks may come in bursts separated by relatively long periods of inactivity, so the utilization of the resources of the module is not uniform during the time, passing from high intensity to long idling periods. It is crucial to understand that what are to be considered as peaks, and so their presence, density and magnitude, are relative to the parameter values we decide to configure the Clip Stage with. So, for example, if for a certain value of the threshold no peaks are detected, it may be possible that for a lower threshold the same set of input values exhibit one or more peaks. The number of closely spaced detected peaks also depends on the Peak Search Window length (i.e. how many samples are observed in search for the peak): the same set of input elements could give rise to a larger or smaller amount of detected peaks according to the length of such interval (the longer the

2The rectangular form of the complex numbers is much more suitable than the polar form to perform additions and subtractions.

23 CHAPTER 3. THE PROPOSED IMPLEMENTATION OF THE PC-CFR

interval, the fewer the detected peaks, because larger amounts of samples will be associated with single peaks).

• Peak Regrowth. It can be observed that (Figure 2.8), because the subtrac- tion of the cancelling pulses from the signal interests a much larger number of samples than the peak alone, some of the samples that were smaller than the threshold before the cancellation of a peak may raise over it because of the constructive summation of the cancelling pulses, thus becoming peaks them- selves although they were not in the beginning, and creating the so-called peak regrowth phenomenon. It can be observed that the magnitude of the regrown peaks is correlated with the height of the original cancelled peak, in the sense that the greater a peak is, the more likely and higher the regrown peaks ap- pear after its cancellation. By cascading several stages, the regrown peaks can be taken care of as well.

• Gradual peak reduction. It may happen that in an interval with more can- celling pulses operating simultaneously, one or more peaks are not cancelled efficiently (not completely or too much) because of the reciprocal interactions among cancelling pulses. This is an unavoidable phenomenon which is more likely to happen and whose effects are more severe the greater the peak to cancel is. In order to mitigate this and the effect of the peak regrowth phe- nomenon, a smart strategy consists of gradually reducing the magnitude of the peaks by applying progressively decreasing thresholds to successive stages of the PC-CFR, instead of trying to completely cancel them in one pass. This can be easily achieved by a cascading architecture because each iteration of the PC-CFR may be independently configured with a different set of parameters, such as the threshold.

The implementation of numerous clip stages not only requires a larger area (and thus higher power consumption) on the chip, but also introduces a higher delay on the signal, which in general is an undesirable effect especially for the more recent communication protocols. The delay in the signal data path is purposely introduced in order for all the computations constituting the algorithm to have the needed time to execute. The largest portion of the delay is by far the group delay of the cancelling pulse itself, which obviously cannot start before the actual detection of the peak. As previously stated, this algorithm involves some signal processing which in turn will modify the characteristics of the input signal thus introducing both in-band and out of band distortion. In order to reduce this undesirable consequence, the unscaled cancelling pulse is chosen so that its frequency spectrum will match as much as possible that of the input signal. The spectrum of the input signal depends on the number, bandwidth and relative positions of the carriers and is either known or estimable. Hence, a trade-off must be chosen because the longer is the cancelling pulse (which translates to: the more coefficients it is made of), the more severe is the effect on the input signal when the cancelling pulse is subtracted from it because the operation will affect a larger number of elements, impacting negatively on the

24 3.1. GENERAL DESCRIPTION OF THE PC-CFR ALGORITHM

Peak Manager (management of HW resources)

Peak detection notification and peak Cancelling pulses characteristics

Higher PAPR I/Q Lower PAPR I/Q signal signal Clip Stage 1 Clip Stage 2

Figure 3.2: Simplified block-level view of the architecture of the PC-CFR module, with two cascaded Clip Stages as an example.

EVM; also, longer cancelling pulses require larger memories for their storing and impose longer delays. On the other hand, the steeper the frequency response of the cancelling pulse3 is, the more accurately we can intervene on the signal spectrum while at the same time reducing the consequences over the frequency intervals that do not belong to the input signal, yielding lower OOB emissions. This is a desirable behaviour because the total frequency bandwidth is a resource that is shared among several users, thus its integrity must be preserved. One notable limitation of the PC-CFR algorithm over other types of signal processing algorithms for CFR is the fact that, every time a cancelling pulse is being generated, it requires the exclusive use of some hardware resources, which of course amount to a finite quantity. Other algorithms, based essentially on the filtering of the signal or portions of it, do not suffer this limitation but, on the other hand, the complexity of the filters (that can be translated to higher area occupancy and in general more power consumed by the ASIC) limits their attractiveness. On the other hand, a notable advantage of the PC-CFR over the Turbo Clip- ping (TC) and other filter-based algorithms, is its inherent flexibility in terms of changes of the input signal characteristics. For the PC-CFR algorithm, in fact, in order to adapt to a completely different configuration of the input signal carriers, it is just a matter of changing the coefficients of the unscaled cancelling pulse, via a re-configuration of the pulse memory, thus enhancing the usefulness of the module for several contexts. The TC algorithm, instead, operates on each carrier indepen-

3According to the theory of digital signal processing, longer sequences in the discrete-time domain correspond to steeper profiles in the frequency domain.

25 CHAPTER 3. THE PROPOSED IMPLEMENTATION OF THE PC-CFR dently via a properly designed branch consisting of one or more decimators, Finite Impulse Response (FIR) filters and interpolators. It follows that the entire hard- ware architecture of the TC is shaped upon a particular configuration for the input signal carriers, and it cannot be reconfigured as easily. On the other hand, the per- carrier filtering of the TC allows a more accurate and so, effective intervention on the input signal, whereas the cancelling pulse in the PC-CFR is generally obtained by the cumulative characteristics of the entire input signal carrier configuration and therefore is sub-optimal with respect to each carrier.

3.2 Structural description of the proposed implementation

The proposed architecture is made by a parameterizable number of cascaded Clip Stages (CSs), each of them communicating with a centralized controlling module called Peak Manager (PM) (see Figure 3.2 for an example scenario with two Clip Stages). The cascading set of CSs constitutes the data path of the signal, and allows the iteration of the algorithm the desired number of times, but not necessarily with the same set of configuration values (every CS can be configured with a local threshold and Peak Search Window length, for example). In each CS, the following operations are performed: the conversion from rectangular to polar form of the input signal, the peak detection, the delaying of the input signal and the subtraction of the cancelling pulse from it. The PM is responsible for dispatching the detected peaks coming from the vari- ous Clip Stages to the available Peak Cancelling Units (PCUs)4 by implementing a dispatching policy. The generation of a cancelling pulse requires the availability of a PCU for the entire duration of the pulse itself. Such PCU will appear busy and therefore unavailable for the generation of cancelling pulses for the entire period. There is a finite number of PCUs in the Peak Manager. The PM receives the notifi- cations about (and the characteristics of) the detected peaks from all the connected CSs, and then generates and dispatches the cancelling pulses to them (again, if at least a PCU is available). The PM is made of several components: one memory to store the coefficients of the unscaled cancelling pulse, a complex multiplier, a Co- ordinate Rotation Digital Computer (CORDIC) unit dedicated to the conversion of the data from the polar back to the rectangular form and an adder to combine together all the cancelling pulses before sending them to the various CSs for the final cancellations. A controlling unit and a pulse generator are responsible for the overall management of the whole subsystem. In Figure 3.3, the top-level diagram of the entire PC-CFR is presented with the name of the input/output ports and the principal configurable parameters with the names as they appear in the SV code. The following is a list describing each of these signals.

4What is referred here as PCU is the set of hardware and physical resources (a time-slot in the Time-Division Multiplexing rotation is a physical resource) needed for the generation of a cancelling pulse.

26 3.2. STRUCTURAL DESCRIPTION OF THE PROPOSED IMPLEMENTATION

i_data_real o_data_real

i_data_imag o_data_imag

i_dtg i_thr_lvs PC-CFR i_cmd_strst

clk clk_1G o_data_stats rst_n

cr_thr_c cr_psw_length_c

Figure 3.3: Top-level view of the input/output signals and parameters of the PC- CFR module

• i_data_real. Data, input. In-phase component of the input data. • i_data_imag. Data, input. Quadrature component of the input data. • i_dtg. Configuration register, input. Input data toggle. At each toggle of this signal the module processes one data. • i_thr_lvs. Configuration register, input. This input provides the mapping between the peak scale values and the length of the cancelling pulses, as explained in the report. • i_cmd_strst. Configuration register, input. Synchronous reset of the peak statistics. • clk. Clock. Main clock of the module. Its value is 250 MHz. • clk_1G. Clock. Secondary, faster clock of the module used for time-division multiplexing. The value is clk*4 = 1 GHz. • rst_n. Reset. Active low, asynchronous reset. • cr_thr_c. Configuration register, input. Values of the thresholds for the Clip Stages. • cr_psw_length_c. Configuration register, input. Values of the PSW length for the Clip Stages. • o_data_real. Data, output. In-phase component of the output data. • o_data_imag. Data, output. Quadrature component of the output data. • o_data_stats. Status register, output. Statistics about the peak height distribution.

27 CHAPTER 3. THE PROPOSED IMPLEMENTATION OF THE PC-CFR

To the Peak Manager From the Peak Manager

peak scale, peak phase, displacement

Mag/phase CORDIC Peak Detector

Peak

statistics Cancelling Cancelling pulses

Threshold Peak Search Window (PSW) length

- Delay (CORDIC + PSW + group delay + + various) Higher PAPR Lower PAPR I/Q data I/Q data

Figure 3.4: Block level diagram of the Clip Stage. The Peak Detector isolates the peaks in the input signal and collects statistics on them.

3.2.1 The Clip Stage

Each Clip Stage (Figure 3.4) receives the data to be processed from the previous CS (or from the previous module in the processing chain, in case of the first clip stage), in the form of an I/Q fixed-point signal. The inputs of the CS are the clock, the active-low reset, the input signal (real and imaginary parts), the input data toggle command, the synchronous reset of the peak statistics and the cancelling pulse(s) coming from the PM. The output is the registered difference between the (delayed) input signal and the cancelling pulse(s).

CORDIC

The first module encountered by the signal inside the CS is the CORDIC. The CORDIC is a flexible iterative algorithm capable of computing several approximated transcendental functions without the need of multipliers, so it is conveniently used in hardware design in order to minimize the area. Inside the CS, it is used to convert the input complex signal from the rectangular to the polar form, so that the magnitude of the input signal samples is exposed and the peaks can be detected. The implemented CORDIC can be configured to be synthesized in pipelined or non- pipelined version. The latter operates all the iterations combinatorially in a single clock cycle thus offering a significative lower delay but might not be synthesizable at the higher . In the perspective of designing the PC-CFR in a as much configurable way as possible, both choices are available.

28 3.2. STRUCTURAL DESCRIPTION OF THE PROPOSED IMPLEMENTATION

Peak Detector In the following, it is referred as Peak Detector what, with reference to the Figure 2.7, corresponds to the cumulative functions made by the Peak Detector and the Peak Extractor. Therefore, the goals of the Peak Detector are:

• To identify the peaks in the input signal. Every time a new peak is detected, the module sends the peak characteristics and a notification pulse to the Peak Manager (PM). • To collect information on the height of the detected peaks. This data is collected either for statistical purposes or in the perspective of using it to apply adjustments on the threshold and the Peak Search Window (PSW) length (not yet possible in the present implementation).

The Peak Detector can be configured with two values: the threshold and the PSW length (cr_thr_p and cr_psw_length_c in the SV code, respectively) that can be set independently for each CS. The module is implemented as a two-states Finite State Machine (FSM) (see Figure 3.5 for the Algorithmic State Machine (ASM) of the Peak Detector, with pseudo-code or plain English in place of the actual SV statements or variable identifiers, in order to favor clarity over formality): in the IDLE state, the input samples pass unaffected and no action is taken until a sample exceeds the programmed threshold. Then the state machine evolves to the PEAK_SEARCH state during which, for the fixed amount of samples dictated by the PSW length register, successive input samples are compared to the last detected maximum in order to find the maximum sample within the entire interval (the definition of peak). This is performed simply by comparing the magnitude of each new input sample with the actual maximum which is stored in a register together with the corresponding phase. At the end of such interval, the value of the threshold parameter is subtracted from the found maximum input sample thus defining what will be referred to as peak scale in the rest of the report. The peak scale, the relative phase and a trigger signal are sent to the PM, and the statistics of the peaks are updated with the new arrival. A fundamental aspect in the process of peak detection has been neglected so far: within the PSW interval, the sample that will be elected as the peak can be found at any position (i.e. it could be the first or the second of the last sample in the interval), and this positional information is necessary for the proper alignment between the cancelling pulse that will be generated by the PM and the input signal. The Peak Detector keeps track of this displacement of the peak inside the PSW interval via a counter (reported as displacement in Figure 3.5), and this is the last information sent by the Peak Detector to the PM when a new peak is detected.

Delay Memory The input signal to the CS is sent to both the CORDIC and a delay memory whose purpose is to compensate for the delays due to the various aforementioned steps of

29 CHAPTER 3. THE PROPOSED IMPLEMENTATION OF THE PC-CFR

IDLE

F i_data_mag > thr.?

T

Save i_data_mag, i_data_pha. Reset PSW counter

PEAK_SEARCH

PSW = PSW + 1 (increases samples count)

i_data_mag Update current T > max mag, phase, current displacement max?

F

Notify the Peak Manager, T End of PSW F create and send pk_scale = max mag – thr., reached? pk_pha, displacement, update peak statistics

Figure 3.5: ASM of the Peak Detector. Please note that "thr." and "End of PSW" correspond to the cr_thr_c and cr_psw_length_c parameters respectively.

30 3.2. STRUCTURAL DESCRIPTION OF THE PROPOSED IMPLEMENTATION the processing on the signal in the CS and in the PM. The delay can be split in two components, giving a clearer understanding of their origins and relative measures (expressed in terms of data rate periods): the smaller component is to compensate for the CORDIC (one if it has been configured as non-pipelined, eleven otherwise5), the Peak Detector (for a number of units equivalent to the PSW length), and all the chain of elaboration provided by the PM. The most consistent component by far is the group delay relative to the cancelling pulse generation, the amount of which is approximately half of the number of coefficients of the pulse.

Final registered subtractor

The output of the CS is generated as the difference between the delayed signal and the sum of the cancelling pulses coming from the PM, registered. This costs another data period of delay per Clip Stage but is not compensated by the delay memory because the final registered subtraction is the very last operation applied to the signal.

3.2.2 The Peak Manager

The Peak Manager (see Figure 3.6) is the centralized unit that receives the no- tifications and the characteristics of the detected peaks from all the Clip Stages, generates the cancelling pulses accordingly and sends them back to the appropriate Clip Stage, where they will finally cancel the peaks. One of the most crucial tasks of the PM is the management of the Peak Cancelling Units (PCUs), whose optimal utilization has been the main effort in this project design. In the most naive way of tackling the problem, the availability of N PCUs would require the presence of N replicas of all the resources needed for the generation of a single cancelling pulse; this in turn would mean: N memories for the storing of the cancelling pulse coefficients, N complex multipliers, N CORDICs for the conversion from polar to rectangular form and N accesses to an adder to combine together all the cancelling pulses. In order to minimise the area occupancy of the PC-CFR module, as anticipated in the introduction, the present implementation makes use of a time-division multiplexing approach for a more efficient exploitation of the described hardware resources. To make this possible, a second, faster clock is used as well, and the ratio between the faster and the slower clock frequencies is set as the parameter num_ts_c (number of time slots, see Figure 3.7). As in every time-division multiplexing scenarios, a single resource is shared among several users in different intervals or slots of time forming a partition (that is, without any overlapping) of a longer interval of time, which repeats periodically. In the present implementation the shared resource is made by the mentioned set of hardware resources (coefficients memory, multiplier etc...), the slot of time is the period of the faster clock and the longer interval is the

5The number eleven comes from the precision of the data that is elaborated by the CORDIC. The number of iterations of this algorithm is roughly the same as the number of bits that is used to represent the data.

31 CHAPTER 3. THE PROPOSED IMPLEMENTATION OF THE PC-CFR

Peak Manager Peak Cancelling Units (PCUs)

2

Search for and allocate HW resources to peaks

1 3 1 3 1 3 1 3

I/Q Clip Stage 0 Clip Stage 1 Clip Stage 2 Clip Stage 3 I/Q

Figure 3.6: Basic conceptual scheme of the relation between Clip Stages and Peak Manager. 1. detected peaks are notified and sent to the PM; 2. the PM searches for available HW resources in order to generate the cancelling pulses and, if available, 3. sends the cancelling pulses to the Clip Stages for the cancellation of the peaks. period of the slower clock. Ultimately, the PM shares the hardware among as many cancelling pulses as the ratio of the two clock periods (the reciprocal of the ratio between the clock frequencies), so it should be clear now that the higher is this ratio, the more efficient usage can be done with the available hardware resources because such hardware can now be seen as exclusively available for the generation of multiple cancelling pulses, with the only constraint that they alternate in time for the access to it, in a non overlapping way. The present PC-CFR module has been simulated and synthesised with the constraint of the slower clock frequency set to the value of 250 MHz (corresponding to the input data rate) and with the availability of a faster clock, generated internally to the ASIC, four times faster (for a value of 1 GHz) thus giving the ratio of four, but in the effort of not limiting the reusability and flexibility of the project, the parameter num_ts_c can be modified to any other integer positive value whenever a different ratio would be convenient and/or phys- ically available. As a conclusion, the PM has the availability of num_ts_c PCUs instead of only one, which translates to the possibility to cancel up to num_ts_c peaks simultaneously. The PM maintains a table holding the information about the peaks being dealt with, and in doing so it also keeps track of the availability of the resources for the generation of the cancelling pulses when a new peak is detected; in particular each row of the table corresponds to a PCU.6 Each PCU is also associated with a specific,

6The terms rows and PCUs will be used interchangeably in the following.

32 3.2. STRUCTURAL DESCRIPTION OF THE PROPOSED IMPLEMENTATION

fCLK = 250 MHz

fCLK_1G = 1 GHz

Time Slots TS0 TS1 TS2 TS3 TS0 TS1

Figure 3.7: The relation between the two clocks determines the definition of the time slots. It can be seen that in this time-division multiplexing exactly four time slots are available. and only one, CS, this information being stored in a separate field, CS, see Table 3.17. So for example if the PC-CFR has four Clip Stages and the PM is capable of generating up to eight cancelling pulses simultaneously, the first four PCUs may be dedicated to the first CS, the next two to the second one and each one of the last two to the third and fourth. The mapping among the PCUs and the Clip Stages is at the moment non configurable but it can anyway be changed by modifying the SV code. The way the Clip Stages are mapped to the rows and therefore to the PCUs, however, has been designed is such a way that it will be easily configurable in a future possible improvement (see Section 4). When some CS notifies the PM of a detected peak, the PM checks if for that particular CS an available PCU exists by scanning the busy bit field of the rows (so the complete condition to satisfy for a peak in order to be accepted is that, among the rows associated to the CS the peak notification is coming from, at least one row has the busy bit at zero); if it does, the peak scale and the peak phase information arriving from the CS are stored in the table at the corresponding row, the displacement is sent to another part of the PM that will be discussed shortly and the busy bit is asserted, otherwise the peak will be ignored and it will leak from the present Clip Stage. The busy bit is deasserted and the pk_scale and pk_pha fields are reset at the end of the generation of the corresponding cancelling pulse, thus making the PCU available for the generation of another cancelling pulse. Whenever a detected peak is accepted and inserted into the table, the generation of the corresponding cancelling pulse starts. As we have seen in the general descrip-

7Although the technology constraint for this design project imposes a number of time slots of only four, Table 3.1 shows eight PCUs available. This is not for explanatory reasons only, but, as it will be shown further, the limit of num_ts_c time slot can be bypassed with some hardware resources redundancy.

33 CHAPTER 3. THE PROPOSED IMPLEMENTATION OF THE PC-CFR

PCU n. busy CS pk_scale pk_pha 0 0 0 0 0 1 1 0 2345 221 2 1 0 1632 -110 3 1 0 995 1298 4 0 1 0 0 5 1 1 4632 -1121 6 0 2 0 0 7 1 3 832 -450

Table 3.1: An example of a possible PCU table at a certain instant. In this example we have 8 PCUs mapped on 4 CSs in this order: 4, 2, 1, 1. The choice of mapping more resources to earlier CSs is generally considered a convenient rule of thumb.

Magnitude component Phase component V: Valid bit D: Direction bit V D CS CS: Destination Clip Stage

Figure 3.8: For each unit of data some meta-data are associated. These meta-data are consumed at various points during the elaboration chain. tion of the algorithm, a set of operations has to be performed in order to do it, the first of all being the complex multiplication; the two operands are the (peak_scale, peak_phase) pair (characterizing the specific detected peak and coming from the PCU table) and each of the coefficients of the unscaled cancelling pulse (the same for all the peaks, coming from the data bus of the pulse memory). Some meta-data are generated, attached to the data and consumed during the chain of elaboration of the cancelling pulses. See Figure 3.8 for a possible representation of the data and meta-data at the output of the complex multiplier. The valid bit is a flag used to notify the following stages that the corresponding data should be actually processed. The direction bit identifies the data as being part of the first or the second half of the cancelling pulse, in order to take advantage of the symmetry of the pulse itself (see Section 3.2.2). The information about the Clip Stage is sent together with the data because when it will reach the adder and dispatcher, this information will be used to send the data to the correct destination CS. In Figure 3.9, the detailed steps and some design choices that have been made are described, together with the rationale behind them, for each of the two data paths constituting the operands of the first step in the chain, the complex multiplier.

The PCU table datapath At the faster clock frequency rate, each row in the PCU table is read continuously starting from the beginnning, reaching the bottom and starting from the first again and in doing so, the values of the pk_scale and pk_pha are sent to the complex

34 3.2. STRUCTURAL DESCRIPTION OF THE PROPOSED IMPLEMENTATION

mag/phase j(ϑC + ϑP) ρC ·ρP ·e to CS0 a + jb ρ ·ejϑC CMUL canc. C to CS1 pulse CORDIC Σ to CS2 mem. to CS3

ρ ejϑP P· Busy Clip Peak Peak MUX bit Stage scale phase 1 0 241 -25 Time slot 0 1 1 125 119 Time slot 1

MUX 0 2 0 0 Time slot 2

1 3 78 -66 Time slot 3

address address address address address address

generator0 generator1 generator2 generator3

Peak detection notification and characteristics coming from the Clip Stages

Figure 3.9: The schematics shows the most important parts constituting the Peak Manager. On the left, the unscaled cancelling pulse datapath is composed of the address generators and the pulse memory, and in the bottom part the PCU table is filled with some example values. The rows and address generators that have the same colors are matched. multiplier regardless of the fact that at a specific row of the table a peak is actually stored and being cancelled or not (as can be seen from Table 3.1, the rows associated to an idle PCU yield 0 for both the peak scale and phase anyway so even if they were computed, they would provide a zero result). Each row reading corresponds to a time slot, and the entire reading of the table corresponds to the periodic interval of the slower clock. At this point, together with peak scale and phase, also the CS information is sent, to be used later by the adder.

The unscaled cancelling pulse datapath

The second operand of the complex multiplier comes from the pulse memory, which is shared among all the PCUs. As it has been discussed, for each time slot, a different row of the table is being read and the peak scale data are sent to the multiplier, so analogously the correct coefficient must be fetched from the memory and sent to it as well. The purpose of the address generators is firstly to generate the addresses to the cancelling pulse memory, and secondly to manage the valid and direction bits. The address generators are implemented as up-down programmable counters with enable input. The cancelling pulse coefficients are stored sequentially into the pulse memory, so the counting outputs of such address generators are connected directly

35 CHAPTER 3. THE PROPOSED IMPLEMENTATION OF THE PC-CFR to the address bus of the memory via a multiplexer that will alternate the addresses coming from the various counters in a time-shared fashion. The PM contains as many address generators as the number of PCUs which in turn, is also the same number as the available time slots inside a sample rate period. The PCU table rows and the address generators are matched, in the sense that they will always work in pairs (i.e. the first row with the first address generator and so on). Because all the various simultaneous cancelling pulses must be generated independently, the address generators must work independently as well, in the sense that each one must keep track of the address of the specific coefficient inside the shared pulse memory independently from the other pulses, and resuming the counting at the point it stopped the last time it had its time slot ready. Therefore there is not a chance to exploit the sharing of a single address generator among the various time shares; this explains the reason why a number of address generators equal to the number of PCUs (stored in the parameter num_PCU_c) has been necessarily instantiated in the code. The event of inserting a new peak into the PCU table if an available CS/busy bit combination is found is immediately followed by the programming of the corre- sponding address generator, consisting of the following information: the displace- ment information coming from the Clip Stage (see 3.2.1), a start and a finishing address. The starting address can be 0 in case of complete cancelling pulse gen- eration but, and it will be discussed further, could also be a larger number (i.e. the cancelling pulse is not generated completely but only a portion of it, but still symmetrically around the central element). Whichever is the case, the address will follow the same progress: it will increase up to the value of 511 (the central element of the cancelling pulse responsible of the peak cancellation), then it will decrease up to the starting value8. As soon as this information is made available, an enable signal will trigger the counting, at the end of which the address generator will assert a pulse that will notify the PM of the end of the cancelling pulse. This in turn will clear the corresponding busy bit in the appropriate PCU table row and make the PCU available to process another peak. At every rising edge of the faster clock, the system explores a different time slot. The effect on the PCU table is the sending of the information of one of the rows to the multiplier and, on the matched address generator side only, the evolution (increase or decrease) of the address driving the address bus of the pulse memory. All the other address generators not interested by the present time slot hold the same state waiting for their turn to change the count. The displacement information about the detected peak (see 3.2.1), is used by the address generators to delay the start of the counting, thus the generation of the addresses towards the pulse memory, by this amount; by doing this, the displacement of the position of the peak sample inside the PSW is taken care of and the central element of the cancelling pulse is aligned with the proper position of the peak. More precisely, during the initial

8This addressing scheme is justified by the symmetry properties of the cancelling pulse that are better illustrated further in this report.

36 3.2. STRUCTURAL DESCRIPTION OF THE PROPOSED IMPLEMENTATION

1

0.8 Central element 0.6 Magnitude = 1

0.4 Magnitude

0.2

0 350 400 450 500 550 600 650 Coefficient number

4000

2000

0 Phase

-2000 Central element Phase = 0 -4000 350 400 450 500 550 600 650 Coefficient number

Figure 3.10: Magnitude and phase of the unscaled cancelling pulse (that is before the multiplication with the peak characteristics). For clarity of representation, only the central part are represented, but the symmetry of the first and the anti-symmetry of the second can be seen. delay phase, the address generator is loaded with the value of the displacement and commanded to perform a down-counting; when zero is reached, the proper address generation starts. It should be clearer now the reason behind the need of the valid bit: during the down-counting required for the compensation of the displacement, the addresses to the pulse memory are generated anyway but no actual coefficients that are appearing on the data bus of the memory as a consequence should be taken into consideration for the elaboration of any cancelling pulse. By associating a zero valid bit to these data, such data are not used for the computation of the cancelling pulse. A notable saving of memory area has been made possible because of the sym- metry properties of the unscaled cancelling pulse (see Figure 3.10): the pulse has a symmetrical magnitude with respect to the central element, whereas the phase is anti-symmetrical, again against the central element. This simple property makes it possible to store only half of the coefficients, thus implementing a smaller memory.9

9The memory area is an important component of the overall area footprint of an ASIC.

37 CHAPTER 3. THE PROPOSED IMPLEMENTATION OF THE PC-CFR

1

0.8

0.6 address generators address generators decrease counting increase counting

0.4 Magnitude

0.2

0 50 100 150 200 250 300 350 400 450 500 Coefficient number

4000 direction bit = 0

2000

0 Phase

-2000

direction bit = 1 -4000 50 100 150 200 250 300 350 400 450 500 Coefficient number

Figure 3.11: The actual portion of coefficients that are stored in the pulse memory. During the increasing and the decreasing part of the address generation, the direc- tion bit is 0 and 1 respectively in order to "fix" the anti-symmetry of the phase for the complex multiplication.

This is the reason for which the address generators are up-down counters: when the central value 511 has been reached (over a total of 1023 elements), the counting direction is inverted. From the point of view of the memory data bus, a complete symmetric cancelling pulse appears, coefficient after coefficient with the only differ- ence that the phase shows values with inverted sign during the descending part. To better understand the rationale behind the direction bit, remember that in order to exploit the anti-symmetry of the phase shown by the cancelling pulses, during the descending part of the generation of the addresses, the phases must be inverted in sign. The direction bit is sent with the values of 0 or 1 together with the data to mark whether the phases have to be added or subtracted respectively by the multiplier (see Figure 3.11). Another chance of simplification emerges from the study of the various ways a complex multiplication can be performed. In rectangular form it consists of four real multiplications and two additions, whereas in polar form only requires one real multiplication and one addition, with an evident saving of area (see 3.1 and 3.2 for a comparison of the complex multiplication in rectangular and polar forms,

38 3.2. STRUCTURAL DESCRIPTION OF THE PROPOSED IMPLEMENTATION respectively). As it has been hinted repeatedly, this second approach has been chosen by storing the coefficients of the cancelling pulse inside the memory in the magnitude/phase form, whereas the peak scale and the phase are already in the convenient form.

(a + ib) · (c + id) = (ac − bd) + i(ad + bc) (3.1)

iθA iθB i(θA+θB ) ρAe · ρBe = ρAρBe (3.2) The multiplier will basically perform the Equation 3.3 using the valid and di- rection bits as follows: if valid bit is 0, no multiplication if performed (the clock will be gated in order to save power) otherwise the multiplication will be executed with the phases added together if the direction bit is 0, subtracted otherwise (the phase coming from the cancelling pulse branch sign inverted). The CS information for each data passes untouched. ( ρ ρ[n] · ei(θP +θ[n]) if direction bit = 0 c n P [ ] = i(θ −θ[n]) (3.3) ρP ρ[n] · e P otherwise The next step in the creation of the cancelling pulse is the conversion of the data from polar to rectangular form, and in order to do this, the CORDIC algorithm has been used again, this time in 11-stages pipelined form, given the fact that all the chain of computation of the cancelling pulses work at the higher frequency rate. The final stage of the chain of elaboration for the generation of the cancelling pulses, and the last part working in the time-division multiplexing mode, is the adder/dispatcher. The complex data, now in rectangular form is now easily added together on a Clip Stage basis. The meta-data that play a role here are the valid bit and the CS information: the adder adds together all the data that have the valid bit set and belong to the same CS. At the end of each clock period of the slower clock, all the results are dispatched to the respective Clip Stages.

Implemented improvements The duration of the cancelling pulses is the principal obstacle, for the PC-CFR algorithm, for not being able to process a higher number of peaks because the longer the pulse, the longer the corresponding PCU stays busy. On the other hand we have discussed in 3.1, that shortening the cancelling pulses might have harmful effects on the OOB distortion, but this observation is not taking into consideration the magnitude (basically the energy associated) of the cancelling pulses. It is reasonable to think that for the cancellation of a very small peak, the effect of the cancelling pulse on the energy of the signal is smaller compared to the effects of the cancelling pulses of bigger peaks. That is because the cancelling pulse is the product of the unscaled cancelling pulse and the detected peak scale, therefore cancelling pulses associated to smaller peaks are smaller in magnitude and in energy as well. It is therefore conceivable to use shorter cancelling pulses to cancel smaller peaks,

39 CHAPTER 3. THE PROPOSED IMPLEMENTATION OF THE PC-CFR

Height max 90+% of max 50+% of max 20+% of max else Start addr. 0 75 182 293 402

Table 3.2: The table maps the height of detected peaks with the starting address of the pulse memory. because the side lobes of the cancelling pulse, which are smaller anyway, will be even more negligible after the multiplication with a small peak scale. The advantage of adopting this strategy is that it is possible to keep the PCUs associated with the smaller peaks busy for less time thus increasing the overall availability of the PCUs to the new peaks. With this in mind from the earlier design phases of the project, the address generators have been designed so that this strategy could be easily implemented. The generation of shorter pulses, indeed, requires nothing more than to program the start address of the address generator with a value which is not 0 (the very first coefficient of the complete cancelling pulse), but with a greater value, corresponding to a successive element, which will also be the final address when the up/down counting will have completed the descending part. In this way the central element of the pulse, corresponding to the actual peak to cancel, will still be present and still the central element of the shorter pulse, and the address generator will send the end pulse to the PM earlier thus freeing the corresponding PCU. In order to introduce an even lower possibility of OOB distortion, several peak scale-pulse lengths pairs are provided to the PC-CFR module, via the i_thr_lvs input port (see Table 3.2 for a possible mapping, corresponding to that of Figure 3.12). With the present implementation, the choice of the mapping between peak magnitude and pulse length is left to the software, in the sense that it is a configuration register. In Figure 3.12 there are some possible examples of reduced length cancelling pulses. Another chance of improvement comes from the observation that with the con- straint imposed on the ratio between the faster and the slower clocks, a total of only num_ts_c = 4 time slots are available (which as it is known by now, provides as many PCUs, therefore simultaneous cancelling pulses). In order to improve this aspect, some hardware redundancy had to be used, thus using some more area on the ASIC (see Figure 3.13). The number of rows in the PCU table has been in- creased to 8, and so the number of address generators. At every time slots, now two paths are generated in parallel: two rows are read from the PCU table (the first and the fifth, and so on) and the two matched address generators are enabled in parallel being connected to two independent pulse memories storing the same coef- ficients. Basically the hardware resources in the Peak Manager have been doubled. This approach is parameterizable by changing some constants in the pc_cfr_pk.sv package file in order to increase even more the number of PCUs at the expense of more hardware, so it it possible to have any multiple of num_ts_c PCUs in the PC-CFR module up to the point in which the hardware complexity of the module does not justify the need of the cancellation of the expected density of peaks.

40 3.2. STRUCTURAL DESCRIPTION OF THE PROPOSED IMPLEMENTATION

1 1

0.8 0.8

0.6 0.6

0.4 0.4

Magnitude Magnitude

0.2 0.2

0 0 200 400 600 800 200 400 600 800 A B

1 1

0.8 0.8

0.6 0.6

0.4 0.4

Magnitude Magnitude

0.2 0.2

0 0 300 400 500 600 700 450 500 550 600 C D

Figure 3.12: Several possible choices for incomplete cancelling pulses, showing mag- nitudes only. Note that they are progressively shorter cycling from A to D corre- sponding to peaks from the highest to the smallest respectively. The points of start and finish for the limited pulses are chosen at the passing through zero or at very little values of the magnitude in order to minimize the discontinuities and the OOB emissions.

CMUL0 canc. to CS0 pulse CORDIC0 Σ mem. 0 0

1 0 Time slot 0 0 0 Time slot 1 1 0 Time slot 2

1 0 Time slot 3

address address

address address

generator3 generator0

CMUL1 canc. to CS1 pulse CORDIC1 Σ1 to CS2 mem. 1 to CS3

0 1 Time slot 0 1 1 Time slot 1 1 2 Time slot 2

1 3 Time slot 3

address address address

generator4 generator7

Figure 3.13: Still four time slots available, but this time up to two cancelling pulses can be generated at each time slot by these two branches working in parallel, for a total of eight maximum cancelling pulses simultaneously (the idea can be extended to any number of branches N, yielding N*4 cancelling pulses).

41

Chapter 4

Future work and suggested improvements

4.1 Programmable or dynamic CS–PCU mapping

The mapping between the Clip Stages and the PCUs in the PCU table (stored in the CS field) of the Peak Manager is, in the presented implementation, static but changeable by modifying the SV code. This lack of flexibility may be easily fixed by having the mapping be read from a configuration register. So, for example, an initial configuration such as 4, 2, 1, 1 (4 PCUs for the first CS, 2 for the second and so on) might change at some point at run time to, as an example, 2, 2, 2, 2 (see Figure 4.1). The reason behind the usefulness of such change is related to: the density of the detected peaks at the various Clip Stages, the probability of peak regrowth, and to the need or decision of progressively reducing the peaks instead of trying to cancel them in a single pass (see 3.1). The control over the reconfiguration could be software programmable or automatically implemented based on the evaluation of the statistics collected by the various Clip Stages.

4.2 Bypassable PC-CFR module

The delay introduced by every module in a communication chain is of course an undesirable but unavoidable characteristic. As it has been shown, the PC-CFR is not an exception to this rule: every Clip Stage the PC-CFR is made of, introduces a significant latency whose main component is the group delay of the cancelling pulse. The delay is present even when no peaks are detected, which are (hopefully) significantly longer periods of time compared to those with some peak activity. A possible improvement to the basic PC-CFR module that could partially mitigate this problem could be the insertion of an observer stage before the PC-CFR having the task of inserting or bypassing the module totally or partially according to the presence of peak activity. The detection of such peak activity is nothing more than the task performed by the already discussed Peak Detector present inside every Clip

43 CHAPTER 4. FUTURE WORK AND SUGGESTED IMPROVEMENTS

PCU0 PCU1 PCU2 PCU3 PCU4 PCU5 PCU6 PCU7 Initial arrangement

CS0 CS1 CS2 CS3

After automatic PCU0 PCU1 PCU2 PCU3 PCU4 PCU5 PCU6 PCU7 reconfiguration or reprogramming

CS0 CS1 CS2 CS3

Figure 4.1: A mapping between PCUs and Clip Stages at a certain instant is re- configured in order to differently redistribute the PCUs among the Clip Stages, according for example to some changes in the input signal.

Insert/remove module from signal path

select I/Q PD0 PC-CFR

I/Q Turn on-off module

Figure 4.2: A possible insertion of a mechanism to bypass the PC-CFR module if a period of absence of peaks is detected, in order to reduce the delay on the signal path and save power. PD0 is the Peak Detector of the first Clip Stage (CS0).

Stage (preceded by a CORDIC, of course, and not mentioned anymore), therefore the Peak Detector in the first CS (PD0 in Figure 4.2) could be used as the observer of the input signal. During the periods in which the PC-CFR is bypassed, the module (with the exclusion of the PD0, to avoid turning it off and being incapable of resuming) could also be gate-clocked in order to save power. On the other hand, it should be noted that the switching between configurations with different delays might not be very conveniently tolerated by some of the more recent communication protocols, therefore some more research should be definitely done in this aspect in order to evaluate the feasibility or convenience of this feature.

44 4.3. CLIP STAGES WITH DIFFERENT DELAY MEMORIES AND CANCELLING PULSES LENGTH 4.3 Clip Stages with different delay memories and cancelling pulses length

Another chance of reducing the delay of the PC-CFR comes from the analysis of the components of such delay. As it has been discussed, the greatest part of it is due to the generation of the cancelling pulse itself, and this group delay is directly proportional to the number of coefficients constituting the pulse. In general, later Clip Stages of a well balanced CFR are expected to deal with smaller peaks, be- cause of the already applied effect of the previous Clip Stages. As a consequence smaller cancelling pulses will be generated and they can also be made shorter (see the observations at the base of the choice of implementing variable length cancelling pulses in 3.2.2) in order to reduce the overall delay of the PC-CFR module. In 3.2.2, the motivation is not the reduction of the delay, but the increased availability of the PCUs due to the shorter occupancy of the HW resources: in the implemented solu- tion the delay of the Clip Stages is always the maximum delay needed by the longest (complete) cancelling pulse corresponding to the highest peak that the module is expected to receive, even when shorter pulses are generated, because the delay memory is statically configured and cannot change its delay as a function of the pulses (consider that the same Clip Stage can generate several cancelling pulses of different length at the same time so it could not provide different delays at the same time in any case). In this case, instead, successive CSs might feature shorter delay memories because they implement shorter maximum cancelling pulses, therefore they introduce less net delay, but also greater OOB emissions because, as stated repeatedly, the shorter the cancelling pulse is, the broader the flanks of the relative Fourier spectrum that, as a consequence, will match more poorly the spectrum of the signal are. Eventually, a final FIR filter will take care of these emissions (see Figure 4.3 for two comparative scenarios). Of course several combinations of shorter and normal length pulses may be explored in order to find an optimal configuration.

4.4 Truncation of cancelling pulses

When two or more closely spaced peaks are being cancelled by the respective pulses, there might be a considerable overlapping among them, because the length of the cancelling pulses is usually much greater than the distance that might be observed between successive peaks, especially when short PSW are used. The consequence is that the peaks and the neighbor input samples are lowered more than required (it may be useful to remember here that ideally the peaks should settle exactly at the threshold value, after the cancellation) thus introducing considerable in- band distortion. In Figure 4.4, two closely spaced peaks are detected and cancelled (PSW1 and PSW2 are the two correspondent Peak Search Windows, which are contiguous). A possible expedient to mitigate this problem could be the truncation of the already in-progress cancelling pulse somewhere close to the middle point between two successive peaks and the immediate start of the generation of the

45 CHAPTER 4. FUTURE WORK AND SUGGESTED IMPROVEMENTS

Each stage uses full pulse => total delay ≈ 2N

Base configuration CS0 CS1 CS2 CS3

N/2 N/2 N/2 N/2

One full pulse, others use half width + FIR => total delay ≈ 7/4 N

Alternative CS0 CS1 CS2 CS3 FIR configuration

N/2 N/4 N/4 N/4 N/2

Figure 4.3: In the base configuration, four Clip Stages use a full length cancelling pulse, of N elements, yielding a total delay of approximately 2N time units. The alternative configuration uses the full cancelling pulse only for the first stage, and for the successive stages it uses half length pulses, so that the overall delay (delay of the CSs plus the group delay of the FIR filter) is only 7/4N. A final FIR filter can be inserted to reduce the emissions introduced by the shorter cancelling pulses; its delay has been taken into consideration. partial pulse for the next one (not starting from the beginning, but from a proper point). Particular care should be taken in choosing the "breaking" point between the cancelling pulses, because this discontinuity will translate into OOB emissions. This procedure also provides the benefit of using a single PCU for two cancelling pulses, whereas the careful management of the PCU table and especially the address generators is the more involved part.

4.5 Variable length Peak Search Window

Sometimes the peaks arrive in burst. If these peaks are tackled individually as in the implemented design, the pool of available PCUs will soon be depleted, with the consequence of a possible peak leak. If this case is detected, it might be convenient to let the search window length increase (up to a certain maximum) so that it can embrace a larger number of input samples and detect and cancel a single peak (or fewer, anyway) among them instead of isolating more peaks and cancel them individually (see Figure 4.5). It is useful to remember that the Peak Cancellation algorithm only exactly cancel the peak sample, not the neighbor elements of it, therefore by enlarging too much the PSW and thus cancelling fewer peaks (given

46 4.6. PRIORITY-BASED ACCEPTANCE OF PEAKS

#104 Original signal (red) vs peak-cancelled signal(black)

2.7

2.6 Peak 2 Peak 1 2.5

2.4

PSW1 PSW2 2.3 Threshold

2.2

2.1

2

1.9

1.8

1.7

450 500 550 600 650

Figure 4.4: This interval shows the input(red) and the output(black) signals of the peak cancellation algorithm. PSW1 and PSW2 are the two (contiguous) Peak Search Windows associated to the two closely spaced detected peaks, so the respec- tive cancelling pulses are overlapped for a considerable amount of samples. The notable effect is the excessive reduction of the power of the signal in the interested interval. the same amount of input elements for comparison), the resulting cancellation will be less accurate and eventually require more passages through the algorithm. As for the CS/PCU mapping already discussed, the configuration of the PSW of the various stages may be software programmable or it can be given some degree of automatism according to the observation of the peak occurrence at the first Clip Stage.

4.6 Priority-based acceptance of peaks

In a further effort to optimize the utilization of the PCUs, it could be considered convenient to interrupt the generation of a cancelling pulse if all the PCUs are busy and a new peak is detected, according to some criteria. It is reasonable to accept by now that higher peaks are more "dangerous" than smaller ones, in the sense that if they would leak and reach the Power Amplifier, they would cause a larger non-linear

47 CHAPTER 4. FUTURE WORK AND SUGGESTED IMPROVEMENTS

#PSW104 lenght = 16 => 4 peaks detected #PSW104 lenght = 32 => 2 peaks detected

2.6 2.6 Peak 2 Peak 3 Peak 1 Peak 2

2.5 2.5

2.4 Peak 1 2.4

Peak 4 2.3 2.3

2.2 2.2

2.1 2.1

2 2 480 500 520 540 560 580 600 480 500 520 540 560 580 600

Figure 4.5: On the left, a Peak Search Window of 16 elements is configured. As a consequence, four peaks are detected and as many PCUs used. On the right part of the Figure, the same input interval is passed through a PSW of 32 elements, and only two peaks are detected. In this case, 50% of PCUs are enough, compared to the first scenario. distortion than the smaller ones. But when peaks are detected by the Peak Detector of the Clip Stages, they are given the same importance: if PCUs are available, they are all accepted and cancelled completely (with longer or shorter cancelling pulses according to their entities, as it has been discussed in 3.2.2). It is reasonable, instead, to give priority to higher peaks compared to the smaller ones. The difficulty here is that the presence and height of a peak is unknown until it is detected, and at this point it is too late to start cancelling it if all the PCUs are already busy. So, if a very high peak is detected just after the last PCU is assigned to a much smaller one, it will leak and eventually reach the PA. The proposed improvement is instead based on some sort of priority to be attributed to the newly detected peaks in relation to the "importance" of the cancelling pulses that are being generated, all of this as a function of time. If the detected peak is somehow "more important" than at least one of the cancelling pulses that are keeping some PCUs busy, the least "important" one among those PCUs could be interrupted to accommodate the newly arrived peak. The condition could be the comparison between two numbers:

48 4.7. GENERATION OF MULTIPLE CANCELLING PULSES FROM THE SAME TIME SLOT on the new peak side, the height of the peak itself is the metric that, as it has been highlighted, determines the severity of the consequences on the PA. On the cancelling pulses side, the priority could be defined as the product of the height of the peaks being cancelled and a number function of the portion of the cancelling pulses that, at the moment of the detection of the new peak, is being generated. This can be justified with the observation that interrupting a cancelling pulse at the central point entails consequences in terms of OOB distortions more severe than the interruption over one of the tails but, if the newly arrived peak is high enough, might still be worth truncating a cancelling pulse even in the middle; so a product should give a measurement that takes into consideration both these aspects. In Figure 4.6, some "weights" for the various parts of the cancelling pulses are proposed, the fine tuning of such weights could be an object for further study. The so defined priority, product of the height of the cancelled peak and these weight, is therefore a function of time and it could be stored as an additional field of the PCU table, and updated every time the generation of the correspondent cancelling pulse passes from one of the regions of Figure 4.6 to another. Note that the weights have been chosen as power of two in order to minimize the computation effort for the computation of the priorities (only the shift of the respective peak scale is needed).

4.7 Generation of multiple cancelling pulses from the same time slot

The input signal passes through all the Clip Stages of the PC-CFR module. It is useful to think that each CS monitors the signal at different instants of time, and when two or more of them are in the PEAK_SEARCH state of the respective Peak Detector, they will eventually notify the PM about the detected peaks. Now, if the displacements inside the correspondent search windows are the same, the generation of the cancelling pulses will use exactly the same addresses for all the peaks. The various peak scales and phases of course will still be different for each detected peak, but the unscaled cancelling pulse path will output exactly the same coefficients for all of them at the same instants. Therefore, as depicted in Figure 4.7, for these cases a single time slot for the generation of multiple pulses could be used. Note that, in order to accommodate this solution, the PCU table must hold the information about as many peaks as the number of Clip Stages present in the system, for each row, because in the best case all the Clip Stages will detect a peak at the same time. When the time slot relative to the multiple peaks arrives, the entire relative row is read so that all the peak information is sent to as many complex multipliers (so there is a need for some hardware redundancy, namely the complex multipliers and the CORDICs), and only one address generator will be used. So basically, up to num_CS_c cancelling pulses can be generated for every time slot. The event this solution is based upon is the contemporary detection of a peak by more than one Clip Stage in the same position of their respective search windows. This event might not happen with a high enough probability to justify the increase in

49 CHAPTER 4. FUTURE WORK AND SUGGESTED IMPROVEMENTS

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

1 4 8 4 1 -0.1 100 200 300 400 500 600 700 800 900 1000

Figure 4.6: A possible association between portions of the cancelling pulse being generated and weight/importance. Interrupting an in-progress cancelling pulse has more severe consequences than stopping it at a very early stage or when it is about to end anyway. hardware complexity and design effort, so the condition needed for the mechanism to trigger may be relaxed: even when the peaks are not detected exactly with the same displacement (in other words, not at the same relative distance from the beginning of the search window), but they are still closely spaced by some programmable interval ("tolerance"), the system will still produce multiple cancelling pulses from a single PCU. By doing this it is clear that not all the peaks will be cancelled exactly because the central element of the cancelling pulse will be synchronized with the peak only, and the other detected peaks will receive a cancelling pulse displaced by a number of samples equal (in the worst case) to the length of the small tolerance interval, which could be of some elements. This approximation can be tolerated if we observe that the characteristics of the input signal samples are similar for closely spaced elements, therefore the cancellation of a peak with a not perfectly aligned cancelling pulse may still provide a substantial benefit.

50 4.7. GENERATION OF MULTIPLE CANCELLING PULSES FROM THE SAME TIME SLOT

peak

Cancelling CS0 pulse

peak

and Adders and To the CORDICs the CORDICs To Address CS1 generators

peak

CS2

peak busy sca0 pha0 sca1 pha1 sca2 pha2 sca3 pha3 1 123 -45 229 78 93 12 312 -321 Time slot 0 CS3 0 0 0 0 0 0 0 0 0 Time slot 1 0 0 0 0 0 0 0 0 0 Time slot 2 1 43 55 148 -654 345 -221 432 331 Time slot 3

Figure 4.7: On the left of the Figure, the input signal as it appears to the Clip Stages at the moment of the peak detection. Note that the peaks are not perfectly aligned to the same relative position. All the information about the four detected peaks are sent to the generation of the cancelling pulses during a single time slot.

51

Chapter 5

Results and conclusions

5.1 Comparative synthesis results

Several configurations (i.e. different numbers of Clip Stages with different numbers of PCUs) of the designed PC-CFR have been successfully synthesized in order to explore how the logic and memory area and the gate counts components scaled. The results are compared with a particular configuration of the Ericsson Turbo Clipping. In Figure 5.1, the much lower area occupancy of the PC-CFR for both the memory and the logic gate parts make the former an interesting alternative for the CFR module in the future Ericsson ASICs.

Figure 5.1: Comparative results between one Turbo Clipping and three PC-CFR configurations.

The data shown in the table should be considered, though, only after some ob- servations: it should be noted that the most complete configuration of the PC-CFR that has been taken into consideration (and synthesized) provides the capability of only 8 PCUs, which again translates to only 8 simultaneous cancelling pulses. On the other hand, it should be noted that the SV code is parameterized so that it is possible to include as many PCUs as needed, in groups of num_ts_c elements at a time, of course at the price of increased gate count and memory area. The fact that the design has been successfully synthesized with 8 PCUs (which means with 2 time-division shared PCUs) gives a reasonable confidence that it could be synthesizable with the same frequency constraint even with more PCUs, because

53 CHAPTER 5. RESULTS AND CONCLUSIONS such modules operate in parallel, so they do not contribute to the cumulative delay of the critical path of the design.1 A reasonable amount of total PCUs to be in- cluded inside the PC-CFR in order to be comparable with the TC for some realistic and useful carrier configurations of the input signal would be 32, which means 8 time-division shared PCUs, or 8 HW structures like those in Figure 3.9. Another important aspect that has been neglected so far is the fact that the input and output data passing through the various Clip Stages are discrete-time and quantized signals originated from the sampling of analog signals, and the final output of the elaboration chain of which the PC-CFR is part will also be converted back to the analog form before being sent to the Power Amplifier and transmitted through the antenna array. Although the sampling frequency of the input signals has been chosen so that the Nyquist-Shannon theorem is satisfied, the PC-CFR has no the certainty that the samples it sees and uses to detect peaks correspond to the actual maximums and minimums (therefore the maximum magnitudes) of the underlying analog signal. In other words the algorithm bases the detection of peaks and the computation of the peak scales on the values of the samples of the true analog signal, then it generates the cancelling pulses and cancels the detected peaks according to these values, but the true signal reaching the Power Amplifier is the analog signal, whose true peaks might pass undetected between two successive samples. The result is that the true PAPR of the analog signal is always greater than or equal to the PAPR of the discrete-time signal the PC-CFR has worked on. In order to address this issue, two solutions are usually taken into consideration: some fractional delay filters can be interposed between every pair of successive Clip Stages, or the Peak Detection module can be made working not on the data signal as it enters the Clip Stage, but on an interpolated (and therefore at a higher frequency) version of it. Without entering into details, both the solutions entail an increment of complexity of the CFR, translating as usual into higher gate count and larger area. The Ericsson Turbo Clipping, being a FIR-based CFR, suffers not from the shortage of PCUs as the PC-CFR does and it fully takes into account the problem of the "hidden" peaks between samples, therefore the higher values in terms of area of the TC also are due to this added complexity. Nevertheless, the numeric differences between TC and PC-CFR in Table 5.1 are so prominent that even taking into account all the discussed observations, the PC-CFR is still an attractive solution.

5.2 Some input and model configuration exploration

This section explores the behaviour of the PC-CFR algorithm under several config- uration scenarios (see Table 5.2). A MATLAB model (see Appendix A) has been written in such a way that it is sample and bit-accurate with the SV design project.

1The critical path is the slowest point-to-point path present in a given circuit. If the cumulative delay of such a path can fit into the fastest clock (that is the smallest clock period), then the synthesizer is capable of synthesizing all the parts of the digital subsystem with this given clock frequency.

54 5.2. SOME INPUT AND MODEL CONFIGURATION EXPLORATION

Figure 5.2: The table summarizing the configuration values and parameters used to test the PC-CFR. The last two columns are the two output values taken into consideration.

This means that the output of the MATLAB model are exactly the same of the SV simulation, the only difference being the timing aspects (data rate of the input and output signal, delays etc...) which are not modeled. This model has been used in this section for the mentioned purposes. For all the configurations the following characteristics are fixed:

• All the Clip Stages have the same threshold

• All the Clip Stages have the same Peak Search Window length

• The PCUs are evenly distributed among the various Clip Stages

One of the objectives of the present section is to expose the limitations of the algorithm especially as a consequence of the lack of PCUs, the other is to help getting a deeper insight on the relationships among the various aspects of the algorithm. In Table 5.2, 8 configurations are listed. The parameters are set into the MATLAB model and then the script is run. The outputs of the simulations that have been reported are the EVM, the "target" PAPR, the effective PAPR of the output signal and the CCDF diagram. The EVM is computed inside the script as:

evm = sqrt(var(canc_pulse_tot)/var(sig_inout_fix(:,1))) * 100; where canc_pulse_tot is the cumulative cancelling pulse obtained as the sum of all the cancelling pulses for each Clip Stages. It has no physical correspondence inside the algorithm, since each CS applies its own cancelling sequence individually, but it is needed to compute the EVM because canc_pulse_tot acts as, and it is the only "perturbation" to the input signal, therefore it carries the information

55 CHAPTER 5. RESULTS AND CONCLUSIONS about how much the algorithm has changed the original signal in order to perform the peak cancellation. All the digital signal processing algorithms operate some changes on the signal, and the EVM is the quantity that measures such change, which is desirably small. It should be noted, though, that this measurement carries very little (if not misleading) information when the algorithm does not fulfill the required PAPR reduction. As it can be easily noted by looking at the Figures 5.3 and 5.4, only the configurations 2, 4, and 6 fulfill the requirements in a satisfactory way. For the cases in which the PC-CFR cannot cancel all the peaks, the EVM measurement makes little sense and should be neglected. It is, on the other hand, helpful as a comparative gauge among configurations that fulfill the desired PAPR reduction in order to better guide a choice towards a more convenient configuration, or among several signal processing CFR methods to estimate the in-band distortion of each. The target PAPR parameter is computed as follows, in the script:

PAPR_tar = 10*log10(((threshold(1)/1.6474) .^ 2) /... (mean(abs(sig_inout_fix(:,num_CS_c+1)).^ 2)));

It means that it is defined as the ratio between the power corresponding to the threshold value (which is a constant for all the CSs for this set-up), and the average power of the output signal of the PC-CFR. This parameter is just a reference value, because a true target value for the output PAPR is hard to define. Perhaps the ratio between the power relative to the threshold value and the average of the input signal clipped to such threshold value could be considered as theoretical limit, although it would correspond to an unrealistic and undesired scenario (the clipping of the input signal as a means of managing the PAPR has been discarded immediately because of the OOB emissions). The definition of the output PAPR is straightforward:

PAPR_out = 10*log10((max(abs(sig_inout_fix(:,num_CS_c+1)) .^ 2)) /... (mean(abs(sig_inout_fix(:,num_CS_c+1)) .^ 2)));

The CCDF (Complementary Cumulative Distribution Function) is a represen- tation of the distribution of the instantaneous to average power ratio in a given signal interval. It can be interpreted as follows: for each value of the x-axis (repre- senting values of the ratios between instantaneous and average power of the signal, expressed in dB), the corresponding value on the y-axis is the relative frequency2 of the input signal elements having such a ratio equal to or greater than the value on the x-axis (in other words, how many input samples have instantaneous to average power greater than or equal to the value on the x-axis). Each graph of Figures 5.3 and 5.4 shows in red and in black the CCDFs for the input and the output signals respectively. The blue vertical line represents the target PAPR as discussed

2Intended as ratio between a number of occurrences of an event at every trial of an outcome and the total number of outcomes.

56 5.2. SOME INPUT AND MODEL CONFIGURATION EXPLORATION earlier. It should be kept in mind that the target PAPR is not only a function of the threshold but also of the average of the output signal after the application of the algorithm, therefore we cannot expect it to be the same among configurations having the same threshold. A well working PC-CFR should yield an output CCDF definitely on the left of the input CCDF and approaching the blue line. When this does not happen, it is because one or more peak has leaked and contributes to the raising of the values of the corresponding relative frequency function (so it "pushes" the black graph to the right). The peaks leak because, in most of the configurations, the chosen values of thresholds/PSW lengths yield a density of peaks that cannot be tackled by the limited number of PCUs. In Figure 5.3, 2 Clip Stages are used. The top-left most image is relative to the configuration with the least amount of PCUs and as a consequence it completely fails in cancelling all the peaks. In addition, the overlapping of the cancelling pulses actually makes things worse so that the output signal shows an even higher PAPR of the input signal. The top-right most image refers to the case with 8 PCUs and indeed successfully reaches the desired PAPR. But by lowering the threshold (that is, imposing a stricter requirement on the desired PAPR reduction), the previous configuration also fails (it is the bottom left image of Figure 5.3). The simple enlargement of the Peak Search Window from 32 to 64 samples reduces the density of the detected peaks to a point so that the available PCUs are again sufficient to deal with the peaks and so the configuration performs decently (bottom right image). It should be noted, though, that the enlargement of the PSW comes with the price of a less accurate cancellation of the input elements that are close to the peak, so their magnitude does not always fall under the threshold and the desired PAPR requirement is not exactly met.

5.2.1 Observations From the Figure 5.4 some observations arise. Each image in the Figure has the same model parameters and configuration of the images in Figure 5.3 in the same position with the only difference being in the number of Clip Stages (this time there are 4 CSs): the combination of number of PCUs per CSs (two) and length of the PSW yields the unexpected result of the bottom right image in the Figure. The large PSW cancels the peak element but as already observed, leaves some elements over the threshold which are detected as peaks by the following stages. These very small peaks occupy all the PCUs and prevent the algorithm to take care of the bigger one that basically is never cancelled and leaks to the end of the algorithm. This did not happen with only 2 Clip Stages with 4 PCUs each because 4 PCUs were enough to deal with both the smallest peaks and the big one. The input-output signals of the configuration 8 is represented in Figure 5.5. The experiment makes it clear that the configuration of the PC-CFR is a delicate operation in which some non-independent parameters interact in ways that are not obvious.

57 CHAPTER 5. RESULTS AND CONCLUSIONS

(a) threshold = 22500, PSW length = 32, 4 PCUs for the left image, 8 PCUs for the right image

(b) 8 PCUs, threshold = 22000, PSW length = 32 for the left image, 64 for the right image

Figure 5.3: Configurations from 1 to 4 according to Table 5.2. All the configurations use 2 CSs, but the topmost use a threshold of 22500, the two bottom images use 22000.

58 5.2. SOME INPUT AND MODEL CONFIGURATION EXPLORATION

(a) threshold = 22500, PSW length = 32, 4 PCUs for the left image, 8 PCUs for the right image

(b) 8 PCUs, threshold = 22000, PSW length = 32 for the left image, 64 for the right image

Figure 5.4: Configurations from 5 to 8 according to Table 5.2. All the configurations use 4 CSs, but the topmost use a threshold of 22500, the two bottom images use 22000.

59 CHAPTER 5. RESULTS AND CONCLUSIONS

Figure 5.5: Input and output signals for configuration 8, with an uncancelled peak clearly visible (peak leak).

60 Bibliography

[1] Ed Hemphill et al. Peak Cancellation Crest Factor Reduction Reference Design (XAPP1033). Xilinx Inc. [2] Adrio Communications Ltd Ian Poole. Radio-Electronics.com, Resources and analysis for electronics engineers. 2016. url: http://www.radio-electronics. com/ (visited on 08/29/2016). [3] Altera Corporation. Crest Factor Reduction (Application Note 396-1.0). Al- tera Corporation. [4] Lattice Semiconductor. Peak Cancellation Crest Factor Reduction IP Core User’s Guide. Lattice Semiconductor. [5] Jiajia Song and Hideki Ochiai. “A low-complexity peak cancellation scheme and its FPGA implementation for peak-to-average power ratio reduction”. In: EURASIP Journal onWireless Communications and Networking (2015). doi: 10.1186/s13638-015-0319-0. [6] Mathuranathan Viswanathan. Introduction to OFDM - orthogonal Frequency division multiplexing. 2011. url: http://www.gaussianwaves.com/2011/05/ introduction-to-ofdm-orthogonal-frequency-division-multiplexing- 2/ (visited on 04/09/2016). [7] Electronicdesign. url: http://electronicdesign.com/engineering-essentials/ understanding-error-vector-magnitude (visited on 03/19/2017). [8] ShareTechnote. url: http://www.sharetechnote.com/html/RF_Handbook_ ACLR_ACPR.html (visited on 04/09/2016). [9] Y. S. Cho et al. MIMO OFDM wireless communications with MATLAB. John Wiley & Sons, 2010, pp. 218–221. [10] G. Schmidt and J. Schlee. “Crest factor reduction for a multicarrier-signal with spectrally shaped single-carrier cancelation pulses”. Patent US 8,619,903 (US). Dec. 2013. url: https://www.google.se/patents/US8619903. [11] Bauml Robert, F Robert, and BH Johannes. “Reducing the peak-to-average power ratio of multicarrier modulation by selected mapping”. In: Electron. lett 32 (1996), pp. 2056–2057. [12] Hemdutta Joshi. “Performance augmentation of OFDM system”. In: (2013).

61 BIBLIOGRAPHY

[13] Jean Armstrong. “New OFDM peak-to-average power reduction scheme”. In: Vehicular Technology Conference, 2001. VTC 2001 Spring. IEEE VTS 53rd. Vol. 1. IEEE. 2001, pp. 756–760. [14] Jean Armstrong. “Peak-to-average power reduction for OFDM by repeated clipping and frequency domain filtering”. In: Electronics letters 38.5 (2002), p. 1. [15] Tao Jiang, Yang Yang, and Yong-Hua Song. “Companding technique for PAPR reduction in OFDM systems based on an exponential function”. In: Global Telecommunications Conference, 2005. GLOBECOM’05. IEEE. Vol. 5. IEEE. 2005, 4–pp. [16] Wisam F Al-Azzo et al. “Time domain statistical control for PAPR reduc- tion in OFDM system”. In: Communications, 2007. APCC 2007. Asia-Pacific Conference on. IEEE. 2007, pp. 141–144. [17] Carole A Devlin, Anding Zhu, and Thomas J Brazil. “Peak to average power ratio reduction technique for OFDM using pilot tones and unused carriers”. In: Radio and Wireless Symposium, 2008 IEEE. IEEE. 2008, pp. 33–36. [18] Sroy Abouty et al. “A novel iterative clipping and filtering technique for PAPR reduction of OFDM signals: system using DCT/IDCT transform”. In: Inter- national Journal of Future Generation Communication and Networking 6.1 (2013), pp. 1–8.

62 Appendix A

The MATLAB golden model

A model of the described RTL implementation of the PC-CFR has been written in MATLAB. The purpose of the model is to replicate the behavior of the design as faithfully as possible, so no priority has been given to performance or memory efficiency in its developing, although pre-allocation and vectorization of the two CORDIC functions are used. A properly chosen interval of the data file 2carriers.dat has been used as input for both the model and in SV testbench and the outputs of the model and the simulation have been compared in order to check for full 100% matches. This condition has been considered as proof of the consistency between the model and the RTL implementation. The code of the script as well as the functions it depends upon are listed in this Appendix. The only missing files are the input data and the cancelling pulse coefficient ones. The segment of the input file that has been isolated from the much larger 2carriers.dat, has been isolated because it exhibits a variety of peak densities and envelope shapes useful to show several characteristics and limitations of the PC-CFR (see Figure A.2 for the data segments before and after the PC-CFR). The script starts in Listing A.1, where some parameters for the PC-CFR con- figuration (thresholds, PSW lengths, number of CSs and PCUs) are set. Experi- menting by modifying these and running the script, makes it possible to explore the performance of the PC-CFR for both different PAPR reduction requirements (by changing the threshold), and for several HW configurations (by selecting number of CSs, distribution of PCUs among them etc.). Input data and coefficients are loaded from external files and are closely related to each other (in case of a change of the input data stream carriers configuration, a new, correspondent set of coefficients must be used for the cancelling pulse). Finally, the signals sig_inout_fix and canc_pulse_tot are defined. The first holds all the successive signals of all the data path chain from the input up to the output of the last stage (i.e. the output of the PC-CFR).

63 APPENDIX A. THE MATLAB GOLDEN MODEL

Listing A.1: Part 1 of the MATLAB script. Configuration of the model and loading of input data segment and pulse coefficients.

%% This is the golden model for thePC-CFR %********* % Part1 * %********* % Parameters for the model % num_CS_c is number of Clip Stages. It translates into number of % iterations of thePC-CFR algorithm threshold is an array of thresholds % for each Clip Stage. The desired values are to be multiplied by the % value1.6474 to take into account the scaling effect introduced by the % CORDIC during the conversion from the rectangular to the polar forms. % psw is an array of Peak Search Window lengths for each Clip Stage % num_PCU_c represents how the PCUs are distributed among the Clip % Stages(it is actually not parameterized, since it describesa4 Clip % Stages case). num_cdc_iter is the number of iterations the CORDIC will % do. The value must match the number of elements of the array % lut_table, which contains the amount of rotations for each iteration %(in radians * 2^10). These two last parameters have been inserted % directly into the code of the two CORDIC function in order to better % vectorize them. pk_table maps the peak scales with the starting % elements of the cancelling pulse. This models the variable length % cancelling pulses clear; num_CS_c = 4; threshold = zeros(1, num_CS_c); psw = zeros(1, num_CS_c); num_PCU_c = [4, 2, 1, 1]; for i = 1:num_CS_c threshold(i) = 37066;%= 22500 *1.6474; psw(i) = 32; end pk_table = [1000, 2000, 4000, 5000, 131071; 403, 294, 183, 76, 1];

%% Read the input data file %a subset of 5600 elements is isolated and used as input of the model fid = fopen('input_data/2carriers.dat','r'); data = textscan(fid,'%f%f'); fclose(fid); i_data = data{1}; i_data = i_data(278751:284350); q_data = data{2}; q_data = q_data(278751:284350); iq_data = complex(i_data, q_data); data_length = length(iq_data); clear data i_data q_data fid;

%% Load the cancelling pulse coefficients, in mag/phase form load('pulse_coeffs'); clear coeffs;

64 %% The matrix sig_inout_fix stores all the signals of the chain from the % input to the output, included the outputs of the intermediate CSs sig_inout_fix = zeros(data_length, num_CS_c+1); sig_inout_fix(:, 1) = iq_data;

%% We iterate the algorithm over num_CS_c times. Each iteration uses the % output of the previous iteration as its input, in order to modela % cascade-like structure. canc_pulse_tot stores the cumulative % cancelling pulse signals, composition of all the cancelling pulses of % all the stages. It is needed for the computation of EVM at the end % of the algorithm canc_pulse_tot = zeros(data_length, 1); In Listing A.2, the actual iterations start, in order to model the cascade-like structure of the Clip Stages. First, the CORDIC exposes the magnitude of the samples for the successive Peak Detector to be able to find the Peaks (this snippet of code populates a table with all the peak characteristics). The following snippet defines, according to the detected peaks, the starting and ending indexes of the corresponding cancelling pulses’ intervals (including the special cases in which one of the interval limit is outside of the range of the input signal). Then a very important snippet of code follows: here the algorithm checks whether a peak can be cancelled according to the availability of PCUs. If it cannot, it will simply leak, which translates into not generating a cancelling pulse in the following part of the script. But if it can, it will increase a counter, for all the duration of the interval of the corresponding cancelling pulse, in such a way that the PCU availability for the following peaks will keep this into account (one less PCU available).

Listing A.2: Part 2 of the MATLAB script. In order: conversion of input samples to polar form, peak detection, definition of cancelling pulses on the input signal and flagging of the peaks that will leak and thus will not be cancelled. The very last part generates a graphical representation of the PCU occupancy divided per CS, as can be seen in Figure A.1 %********* % Part2 * %********* for m = 1:num_CS_c % CORDIC [data_mag_array, data_pha_array] = arrayfun(@cordic_c2p,... real(sig_inout_fix(:, m)), imag(sig_inout_fix(:, m)));

%% Peak detection % in the first column we put the absolute index of the detected % peak, in the second the value of the scale, in the third the % phase, in the fourth the starting address of the relative % cancelling pulse according to the scale(check pk_table array). % In the fifth, later in the algorithm, will be seta1 ora0 % according to the fact that the relative peak leaks or not %(if all the resources of the presentCS are already busy). % In the sixth we put the number of elements from the peak position

65 APPENDIX A. THE MATLAB GOLDEN MODEL

% to the end of the PSW peak_info = zeros(250, 6); state ='IDLE'; j = 1;% counter inside the PSW k = 1;% peak table index for i = 1:data_length switch state case'IDLE' if (data_mag_array(i) > threshold(m)) state ='PEAK_SEARCH'; index_temp = i; mag_temp = data_mag_array(i); pha_temp = data_pha_array(i); index_in_psw = 1; end case'PEAK_SEARCH' j = j + 1; if (data_mag_array(i) > mag_temp) index_temp = i; index_in_psw = j; mag_temp = data_mag_array(i); pha_temp = data_pha_array(i); end if j == psw(m) peak_info(k,1) = index_temp; peak_info(k,2) = mag_temp - threshold(m); peak_info(k,3) = pha_temp; peak_info(k,6) = psw(m) - index_in_psw; index_in_psw = psw(m); j = 1; % Choice of the pulse length according to the % magnitude of the peak scale for q = 1:5 if peak_info(k,2) < pk_table(1,q) peak_info(k,4) = pk_table(2,q); break; end end % k = k + 1; state ='IDLE'; mag_temp = 0; pha_temp = 0; index_temp = 1; end end end peak_info = peak_info(1:k-1,:);

%% Study the density of cancelling pulses needed to deal with peaks % for each detected peak, start_end_idx stores the initial and the % final indexes of the relative cancelling pulses, if they will be % generated(the peak leak condition is checked further in the code) start_end_idx = zeros(size(peak_info, 1), 2);

66 for n = 1:size(peak_info, 1) delay = 512 - peak_info(n,4); if ((peak_info(n,1) - delay) <= 0) start_end_idx(n,1) = 1; start_end_idx(n,2) = peak_info(n,1) + delay; elseif ((peak_info(n,1) + delay) >= data_length) start_end_idx(n,1) = peak_info(n,1) - delay; start_end_idx(n,2) = data_length; else start_end_idx(n,1) = peak_info(n,1) - delay; start_end_idx(n,2) = peak_info(n,1) + delay; end end

% This part of code detects whethera peak can be cancelled % according to the availability of PCUs for the actualCS(i.e. % iteration) or it will leak. In the latter case, no corresponding % cancelling pulse will be generated peak_leak = zeros(size(peak_info, 1), 1); pulse_intervals = zeros(length(iq_data) + 2*psw(m) + 1023, 1); for n = 1:(size(peak_info, 1)) % we managea counter ina proper interval of the input % signal. The counting represents how many PCUs are already in % use at that point. The interval starts just after the end of % the PSW and lasts for: the displacement between the actual % peak index position and the start of the PSW plus 512 % elements needed to reach the central element of the % cancelling pulse plus the remaining elements of the % cancelling pulse before the PCU will be free again if pulse_intervals(peak_info(n,1) + peak_info(n,6)) + 1 >... num_PCU_c(m) peak_leak(n) = 1; else for t = peak_info(n,1)+peak_info(n,6):... peak_info(n,1)+peak_info(n,6) +... (psw(m) - peak_info(n,6)) + 512 + (512-peak_info(n,4)) pulse_intervals(t) = pulse_intervals(t) + 1; end end end peak_info(:,5) = peak_leak;

figure(1); subplot(num_CS_c, 1, m); plot(pulse_intervals); xlabel('Input signal element index'); ylabel('PCUs');

In Listing A.3, finally the cancelling pulses are generated by modeling the same operation sequences implemented into the RTL design: complex multiplication in polar form followed by the conversion to rectangular form and subtraction from the input signal. The results are computed and displayed in the final Part 4.

67 APPENDIX A. THE MATLAB GOLDEN MODEL

Listing A.3: Part 3 and 4 of the MATLAB script. Configuration of the model and loading of input data segment and pulse coefficients. %% Creation and application of the cancelling pulses to the signal %********* % Part3 * %********* canc_pulse_i = zeros(data_length,1); canc_pulse_q = zeros(data_length,1); for j = 1:size(peak_info, 1)% for every detected peak if peak_info(j, 5) == 0% if the peak did not leak range = start_end_idx(j,1):start_end_idx(j,2); pulseRange = range - peak_info(j,1) + 1 + 511; canc_pulse_mag = zeros(data_length, 1); canc_pulse_pha = zeros(data_length, 1); canc_i = zeros(data_length, 1); canc_q = zeros(data_length, 1); % 48334 is to compensate for the gain of the two codecs canc_pulse_mag = 48334 * peak_info(j,2) *... coeffs_mag(pulseRange); canc_pulse_pha = peak_info(j,3) + coeffs_pha(pulseRange); [canc_i(range), canc_q(range)] =... arrayfun(@cordic_p2c, canc_pulse_mag, canc_pulse_pha); canc_pulse_i(range) = canc_pulse_i(range) + canc_i(range); canc_pulse_q(range) = canc_pulse_q(range) + canc_q(range); end end % The right shift by 34 position is to compensate % for the cancelling pulse coefficients(which were multiplied % by 2^17-1) and for the constant 48334=0.3687562... * 2^17) temp_i = floor(bitshift(canc_pulse_i, -34,'int64')); temp_q = floor(bitshift(canc_pulse_q, -34,'int64')); canc_pulse = complex(temp_i, temp_q); canc_pulse_tot = canc_pulse_tot + canc_pulse; % actual cancellation of the peaks sig_inout_fix(:, m+1) = sig_inout_fix(:, m) - canc_pulse; end% end of algorithm iterations loop

%********* % Part4 * %********* figure; plot(abs(iq_data),'r'); hold on; plot(abs(sig_inout_fix(:,num_CS_c+1)),'k'); plot(ones(1, data_length)*threshold(m)/1.6474,'k'); title('Input signal(red) vs peak-cancelled signal(black)');

%% Calculate the FFT of the input signal and the output of thePC-CFR %X=(fftshift(fft(iq_data, data_length))); %XX=X. * conj(X)/(data_length^2); %Y=(fftshift(fft(sig_inout_fix(:,num_CS_c+1), data_length))); %YY=Y. * conj(Y)/(data_length^2);

68 % figure; %% plot input power spectrum % subplot(2,1,1); % plot(10*log10(XX)); % title('Power Spectrum Using Log Scale, input sequence'); % ylabel('Power inDB'); %% plot output power spectrum % subplot(2,1,2); % plot(10*log10(YY)); % title('Power Spectrum Using Log Scale, output sequence'); % ylabel('Power inDB');

%% computation of PAPR PAPR_in = 10*log10(max(abs(iq_data) .^ 2) / mean(abs(iq_data) .^ 2)); PAPR_out = 10*log10((max(abs(sig_inout_fix(:,num_CS_c+1)) .^ 2)) /... (mean(abs(sig_inout_fix(:,num_CS_c+1)) .^ 2))); % target PAPR, computed on threshold of CS0 PAPR_tar = 10*log10(((threshold(1)/1.6474) .^ 2) /... (mean(abs(sig_inout_fix(:,num_CS_c+1)).^ 2))); str_in = sprintf('PAPR of input signal=%f dB', PAPR_in); str_out = sprintf('PAPR of output signal=%f dB', PAPR_out); str_tar = sprintf('Target PAPR=%f dB', PAPR_tar); disp(str_in); disp(str_out); disp(str_tar);

%% computation of EVM evm = sqrt(var(canc_pulse_tot)/var(sig_inout_fix(:,1))) * 100; fprintf('Percent EVM=%f %%\n', evm);

%% computation of CCDF figure; CDF_plot(iq_data,'r'); hold on; CDF_plot(sig_inout_fix(:,num_CS_c+1),'k'); hold on line([PAPR_tar PAPR_tar], ylim); legend('PC-CFR input','PC-CFR output','Target PAPR'); title('CCDF of input and output signals'); xlabel('PAR'); ylabel('Relative frequency');

% end ofPC-CFR model

In Listings A.4, A.5 and A.6 the MATLAB code for the two CORDICs and the Complementary Cumulative Distribution Function (CCDF) computation and displaying function is provided (see Appendix 5.2 for a brief explanation about the CCDF representation). Note that in each part of the MATLAB code, fixed-point arithmetic and representation has been "implemented" by the careful use of integers and scaling. In Figure A.3, the comparative CCDF of input and output signal is shown.

69 APPENDIX A. THE MATLAB GOLDEN MODEL

Listing A.4: The CORDIC function used to convert from rectangular to polar function [mag, pha] = cordic_c2p(x, y) pi_div_2 = 1571; n = 11; inpLUT = [804, 475, 251, 127, 64, 32, 16, 8, 4, 2, 1];

ifx<0 ify<0 tmp = x; x = -y; y = tmp; z = -pi_div_2; else tmp = x; x = y; y = -tmp; z = pi_div_2; end else z = 0; end

for idx = 1:n xtmp = floor(bitshift(x, -(idx-1),'int32')); ytmp = floor(bitshift(y, -(idx-1),'int32')); %xtmp= floor(bitshift(x,-(idx-1)));% octave version %ytmp= floor(bitshift(y,-(idx-1)));% octave version ify<0 z = z - inpLUT(idx); x = x - ytmp; y = y + xtmp; else z = z + inpLUT(idx); x = x + ytmp; y = y - xtmp; end end mag = x; pha = z; end

Listing A.5: The CORDIC function used to convert from polar to rectangular function [x0, y0] = cordic_p2c(mag, pha) pi_div_2 = 1571; pi = 3142; two_pi = 6284; n = 11; inpLUT = [804, 475, 251, 127, 64, 32, 16, 8, 4, 2, 1];

if pha < -pi pha = pha + two_pi;

70 elseif pha > pi pha = pha - two_pi; end

if pha < 0 pha = pha + pi_div_2; pos_rot = 0; else pha = pha - pi_div_2; pos_rot = 1; end

x = mag; y = 0; z = pha;

for idx = 1:n xtmp = floor(bitshift(x, -(idx-1),'int64')); ytmp = floor(bitshift(y, -(idx-1),'int64')); %xtmp= floor(bitshift(x,-(idx-1)));% octave version %ytmp= floor(bitshift(y,-(idx-1)));% octave version ifz<0 z = z + inpLUT(idx); x = x + ytmp; y = y - xtmp; else z = z - inpLUT(idx); x = x - ytmp; y = y + xtmp; end end

if pos_rot == 1 x0 = -y; y0 = x; else x0 = y; y0 = -x; end end

Listing A.6: Function used to plot the CCDF of the input and output signals function [tmp] = CDF_plot(Y, color) P_average = mean(abs(Y) .^ 2); Instantaneous_power = abs(Y) .^ 2; [n, X] = hist(Instantaneous_power, length(Y)); m = cumsum(n); semilogy(10*log10(X/P_average), 1 - m/max(m), color); grid on; axis([-2 12 2e-4 1]) end

71 APPENDIX A. THE MATLAB GOLDEN MODEL

Figure A.1: PCU usage for each Clip Stage. Note that the maximum number of PCU per stage (in this example 4, 2, 1, 1) cannot be exceeded.

Figure A.2: The input data segment is compared with the output of the PC-CFR. Note that for this configuration of CSs, threshold, PSW length and distribution and number of PCUs, the algorithm is perfectly capable to satisfy the PAPR require- ment.

72 Figure A.3: The CCDF of input and output signal. The output signal fully satisfies the target PAPR requirement.

73 TRITA ICT-EX-2016:187

www.kth.se