CONFIDENTIAL

Low power digital baseband ar- chitecture for wireless sensor nodes

Yuteng Hao Master of Science Thesis

Department of Microelectronics

mscconfidential

Low power digital baseband architecture for wireless sensor nodes

Master of Science Thesis

For the degree of Master of Science in Microelectronics at Delft University of Technology

Yuteng Hao

June 22, 2015

Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS) · Delft University of Technology The work in this thesis was supported by Holst Centre. Their cooperation is hereby gratefully acknowledged.

Copyright c Department of Microelectronics All rights reserved. Delft University of Technology Department of Department of Microelectronics

The undersigned hereby certify that they have read and recommend to the Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS) for acceptance a thesis entitled Low power digital baseband architecture for wireless sensor nodes by Yuteng Hao in partial fulfillment of the requirements for the degree of Master of Science Microelectronics

Dated: June 22, 2015

Supervisor(s): Dr.ir. Nick van der Meijs

Dr. Christian A. Bachmann

Committee member(s): Dr.ir. Gerard Janssen

Dr.ir. Rene Van Leuken

Dr. R.R. Venkatesha Prasad

Abstract

This thesis presents a digital baseband design for an upcoming wireless standards: IEEE 802.11ah. It is a branch of Wi-Fi (IEEE 802.11) standards. Compared with the previous Wi-Fi standards, this new standard has larger coverage range and consumes less energy. It is particularly suited for energy-constrained sensor applications. In contrast to the Digital Baseband (DBB)s of other Wi-Fi standards, this design consumes much less power. The basic modulation method of the system is Orthogonal Frequency Divi- sion Multiplexing (OFDM) and the detailed algorithms are explored. To prove the robustness of the system, some error tests for the system are performed. A gate-level hardware design and the synthesis netlist are also presented to prove the low-power design. Based on the synthesis results, a series of optimization is done to lower the power consumption. The DBB has been implemented in 40nm Low-power CMOS process to prove the concept. It includes the key blocks of this system. Measurement results show that the DBB for IEEE 802.11ah is suitable for low power applications. The power consumption of this DBB is around 200 - 400 µW, which is hundreds times less than that of the traditional 802.11 baseband. Keywords: IEEE 802.11ah, Digital baseband, OFDM, Low-power, Synchronizer

Master of Science Thesis CONFIDENTIAL Yuteng Hao ii

Yuteng Hao CONFIDENTIAL Master of Science Thesis Table of Contents

Glossary xi List of Acronyms...... xi List of Symbols...... xii

Acknowledgements xiii

1 Introduction1 1-1 Background...... 1 1-2 Motivations and objectives...... 2 1-3 Thesis Overview...... 5

2 High level modeling7 2-1 OFDM Basics...... 7 2-2 OFDM Frame Format...... 10 2-3 Parameters for all the supplied cases...... 12 2-4 Digital transmitter and receiver...... 12 2-4-1 Convolutional encoder and Viterbi decoder...... 13 2-4-2 Interleaver and deinterleaver...... 14 2-4-3 Modulator and demodulator...... 15 2-4-4 Pilot inserter and remover...... 17 2-4-5 Fast Fourier Transform...... 17 2-4-6 Cyclic Prefix (CP) insertion and removing...... 18 2-4-7 Timing synchronization...... 19 2-4-8 Channel estimation...... 24 2-4-9 Oversampling...... 27 2-5 High level simulation results...... 28 2-5-1 Verification...... 29 2-5-2 Floating-point simulation...... 29 2-5-3 Fixed-point simulation...... 32 2-6 Summary...... 34

Master of Science Thesis CONFIDENTIAL Yuteng Hao iv Table of Contents

3 Hardware implementation 35 3-1 Implementation design flow...... 35 3-2 Quantization...... 36 3-3 RTL Implementation...... 37 3-3-1 Modulator and demodulator...... 39 3-3-2 Pilot inserter and remover...... 40 3-3-3 IFFT and FFT...... 41 3-3-4 Cyclic Prefix Extender...... 41 3-3-5 Synchronizer...... 42 3-3-6 Packet buffers...... 44 3-4 Logic synthesis...... 48 3-5 Summary...... 49

4 Simulation results and analysis 51 4-1 Synthesis results...... 51 4-2 Power consumption...... 53 4-3 Summary...... 60

5 Conclusions and future work 61 5-1 Conclusions...... 61 5-2 Future work...... 62

A The comparison of channel estimation using LS and MMSE estimators 65

B Verification for the functionality of RTL model and gate-level netlist 67

Yuteng Hao CONFIDENTIAL Master of Science Thesis List of Figures

1-1 Smart grid with an 802.11ah AP (after [3])...... 3 1-2 Magnitude spectrum of baseband signal and high frequency signal...... 3 1-3 Digital baseband and front-end...... 4

2-1 Frequency-Time Representative of an OFDM signal [14]...... 8 2-2 Spectrum of an FDM subcarriers...... 8 2-3 Spectrum of OFDM symbols with overlapping subcarriers...... 9 2-4 Architecture of the OFDM system (Transmitter and Receiver)...... 9 2-5 802.11ah PPDU frame format (bandwidth: 1 MHz) [1]...... 10 2-6 Subcarrier frequency allocation (bandwidth is 1 MHz) [1]...... 11 2-7 802.11ah PPDU frame format (bandwidth ≥ 1 MHz) [1]...... 11 2-8 Convolutional encoder (K = 7) [[1]]...... 13 2-9 The coded bits before and after interleaving...... 15 2-10 The constellation diagrams of BPSK and QPSK...... 16 2-11 The constellation diagrams of 16-QAM...... 17 2-12 The QPSK constellation diagrams with noise...... 18 2-13 The basic concept of cyclic prefix...... 19 2-14 Inter-symbol interference with a delayed signal (without CP)...... 19 2-15 Magnitude of short training sequence in S1G_1M...... 20 2-16 Block diagram of the auto-correlation algorithm (Equation 2-14 with N = 16). 22 2-17 Output of auto-correlation for an incoming signal with an SNR of 20 dB (S1G_1M) 23 2-18 Block diagram of the cross-correlation algorithm (S1G_1M)...... 24 2-19 Output of cross-correlation for an incoming signal with an SNR of 20 dB (S1G_1M) 24 2-20 Correlation output for an incoming signal with an SNR of 20 dB (S1G_1M)... 25

Master of Science Thesis CONFIDENTIAL Yuteng Hao vi List of Figures

2-21 Channel estimation principle [33]...... 26 2-22 Channel frequency response for the original channel and estimated channel (over- lapping)...... 27 2-23 Oversampling principle...... 28 2-24 The bit streams which can be compared to verify the modules...... 29 2-25 Theoretical BER for BPSK, QPSK and 16-QAM...... 31 2-26 BER simulation over an AWGN channel and theoretical BER for BPSK (these two curves are overlapping )...... 31 2-27 PER vs SNR for all the supported S1G_1M cases...... 32 2-28 Effect of timing synchronization on PER (CBW1, MCS1)...... 32 2-29 Effect of Rayleigh Fading channel (CBW1, MCS1)...... 33 2-30 Fixed-point simulation (CBW1, MCS3)...... 34

3-1 The design flow for hardware implementation...... 36 3-2 A basic example for "dia − doa" structure...... 38 3-3 The OFDM transmitter block diagram in RTL design...... 38 3-4 The OFDM receiver block diagram in RTL design...... 38 3-5 The circuitry used for constellation mapping (BPSK, QPSK and 16-QAM).... 39 3-6 The waveform of the signals in the modulator (BPSK, S1G_1M)...... 40 3-7 The circuitry used for constellation demapping (BPSK, QPSK and 16-QAM).. 40 3-8 The circuitry used for pilot insertion and zero padding...... 41 3-9 The circuitry used for cyclic prefix extension...... 42 3-10 The circuitry used for removing cyclic prefix...... 42 3-11 The circuitry designed for auto-correlation...... 43 3-12 The circuitry designed for cross-correlation...... 44 3-13 The circuitry of synchronization...... 44 3-14 The waveform of the signals in the synchronizer...... 45 3-15 Timing diagram of OFDM symbols at the transmitter...... 46 3-16 The waveform of the "fake" clock signal...... 46 3-17 Block diagram for Ping-Pong buffer...... 47 3-18 The waveform of signals in packet buffer at Tx...... 48 3-19 Top level view of the RTL circuitry...... 49 3-20 The synthesis flow...... 50

4-1 Cell area vs clock frequency constraints...... 52 4-2 Power consumption over time (top level, CBW1, MCS3)...... 54 4-3 Power consumption for each block (CBW1, MCS3)...... 55 4-4 Power consumption of synchronizer (CBW1, MCS3)...... 56 4-5 Average power consumption for all the supplied cases...... 57

Yuteng Hao CONFIDENTIAL Master of Science Thesis List of Figures vii

4-6 Average power consumption distribution (CBW1, MCS3)...... 57 4-7 Power consumption of each module during transmitting period (CBW1, MCS3). 58 4-8 Power consumption of each module during receiving period (CBW1, MCS3)... 59 4-9 Power consumption of each module during receiving period (CBW1, MCS3)... 59

A-1 Comparison of BER for no channel estimation , LS channel estimation and MMSE channel estimation...... 66

B-1 Comparison between the input and output bits...... 67

Master of Science Thesis CONFIDENTIAL Yuteng Hao viii List of Figures

Yuteng Hao CONFIDENTIAL Master of Science Thesis List of Tables

1-1 Typical transition range of some protocol standards...... 1 1-2 Sub 1 GHz spectra specified in the 802.11ah channelization...... 2 1-3 Power consumption of DBB for some wireless communication standards..... 5 1-4 Area of DBB for some wireless communication standards...... 5

2-1 Fields of S1G PPDU...... 11 2-2 S1G-MCSs for 1 MHz...... 12 2-3 S1G-MCSs for 2 MHz...... 12 2-4 The definitions of the related parameters...... 12 2-5 Values for some parameters related to the interleaver...... 14 2-6 SNRs where PER equals 0.1 for all the supplied cases...... 33

3-1 Number of bits in each block (MCS0 and MCS1)...... 37 3-2 Basic timing-related parameters [1]...... 45 3-3 Minimum frequencies in the transmitter processor...... 46

4-1 Cell area of each block...... 52 4-2 The number of gates for different gate types...... 53 4-3 Average power consumption of each period for all the supplied cases...... 56

5-1 Summary of IEEE 802.11ah DBB circuitry...... 61 5-2 Summary of IEEE 802.11ah DBB circuitry...... 62

Master of Science Thesis CONFIDENTIAL Yuteng Hao x List of Tables

Yuteng Hao CONFIDENTIAL Master of Science Thesis Glossary

List of Acronyms

AP Access Point

PAR Project Authorization Request

IoT Internet of Things

M2M Machine to Machine

WPAN Wireless Personal Area Network

WLAN Wireless Local Area Networks

OFDM Orthogonal Frequency Division Multiplexing

PHY Physical Layer

MAC Medium Access Control

DTT Digital Terrestrial Television

DVB Digital Video Broadcast

FDM Frequency Division Multiplexing

RF Radio Frequency

DBB Digital Baseband

IFFT Inverse Fast Fourier Transform

FFT Fast Fourier Transform

BTLE Smart (Low energy)

STF Short Training Field

LTF Long Training Field

Master of Science Thesis CONFIDENTIAL Yuteng Hao xii Glossary

SIG Signal Field

S1G Sub 1 GHz

DC Direct Current

GI Guard Interval

CBW Channel Bandwidth

PPDU Physical Layer Convergence Protocol Data Unit

PSK Phase Shift Keying

QAM Quadrature Amplitude Modulation

BPSK Binary Phase Shift Keying

QPSK Quadrature Phase Shift Keying

IDFT Inverse Discrete Fourier transform

ISI Inter-symbol Interference

CP Cyclic Prefix

AGC Automatic Gain Control

AWGN Additional White Gaussian Noise

SNR Signal To Noise Ratio

CIR Channel Impulse Response

CFR Channel Frequency Response

LS Least Square

MMSE Minimum Mean Square Error

MSE Mean-Square Error

DAC Digital-to-analog Converter

LPF Low-pass Filter

RTL Register Transfer Level

HDLs Hardware Description Languages

BER Bit Error Rate

PER Packet Error Rate

Sync Synchronization

VCD Value Change Dump

Yuteng Hao CONFIDENTIAL Master of Science Thesis Acknowledgements

I am grateful to all the people who, in one way or another, have helped me during my master thesis project. Without the support of others, it would have been impossible for me to finish this project with this performance. First, I would like to thank my supervisor Dr.ir. Nick van der Meijs for his guidance during the project. I learned a lot from his lectures on digital integrated circuit and was then made a firm decision to select this direction as my master thesis. It is him who enlightened me to explore in the sea of digital integrated circuit. During the last 10 months, he tracked my status well and was always willing to give me advices and suggestions. I also would like to thank him for carefully reading the thesis and correcting my writing style. Next, I would like to express my thanks to my supervisor Dr. Christian A. Bachmann for his continuous assistant and encouragement throughout my master thesis project. Every time I got stuck, he was able to provide the best solution. He explained the concept of digital baseband design well and also helped me explore very detailed algorithms. It is my honor to work with him. I also appreciate my colleges at Holst Centre for their support. Thank Pepijn Boer and Peng Zhang for sharing their rich experience on telecommunication system. Thank you for all the meetings and discussions with me, which were very valuable for my design. Thank Bo Liu for his contribution to the richness of this research, including digital circuit design and synthesis. I also want to thank you for reviewing my thesis, which improved my thesis a lot. Thank Tobias Gemmeke for helping me optimize the hardware design and giving me access to the technology library. Thank all the colleges here for all the nice help and advices. I gained a lot from chatting with you. I must express my gratitude to Michel Berkelaar. Thanks for your help in my MSc thesis work in the early stage, especially the advices on my presentations. I would like to thank my friends here. Thank you for keeping good company and sharing the ups and downs during these two academic years. The general help and friendship were all greatly appreciated. Thanks are given to all my friends on Facebook, Weibo and Wechat. Thank you for all the likes and comments, which make me always energetic and optimistic. I also would like to thank readers and committee members for reading my thesis.

Master of Science Thesis CONFIDENTIAL Yuteng Hao xiv Acknowledgements

In the end, I would like to thank my parents for their selfless love during all these years. This thesis is dedicated to my parents.

Delft, University of Technology Yuteng Hao June 22, 2015

Yuteng Hao CONFIDENTIAL Master of Science Thesis “Our life always expresses the result of our dominant thoughts.” — Soren Kierkegaard

Chapter 1

Introduction

1-1 Background

Internet of Things (IoT) and Machine to Machine (M2M) communication are developing rapidly in today’s world. In such systems, all the devices that provided with unique identifiers can transfer data over a network. ZigBee (over IEEE 802.15.4), Bluetooth (over 802.15.1) and Wi-Fi (Over 802.11) are three protocol standards which can be used in different scenarios in the IoT and M2M systems. In many scenarios, the IoT consists of a large number of sensor nodes and they communicate with each other through wireless sensor networks. These nodes must operate without battery replacement for many years. Therefore, energy efficiency is a very critical constraint when designing an IoT network [2]. Compared with conventional Wi-Fi standards, the IEEE 802.15 standard family (contains ZigBee and Bluetooth) is a better option for low power design until recently. However, Bluetooth and ZigBee are developed for Wireless Personal Area Network (WPAN) communication and the operation range is relatively low (as shown in Table 1-1) [3]. On the other hand, Wi-Fi is oriented to Wireless Local Area Networks (WLAN), which has a larger cover range. As a result, the Wi-Fi standard currently is more suitable for some specific applications such as large scale networks and outdoor systems. To fulfill the tight power budget, the IEEE 802.11ah standard is proposed by the 802.11 working group of IEEE.

Table 1-1: Typical transition range of some protocol standards Standard IEEE spec Range (m) ZigBee 802.15.4 10..100 Bluetooth 802.15.1 1..10 Wi-Fi 802.11a/b/g ≥ 100 802.11ah ≥ 1000

Master of Science Thesis CONFIDENTIAL Yuteng Hao 2 Introduction

The conventional 802.11 WLANs operate at 2.4 GHz and 5 GHz bands, which ensure high throughput and data rates. However, the previously mentioned high frequency bands put lim- its on the transmission ranges and make it less suitable for outdoor applications. Furthermore, the mutual interference problems become more and more critical as the use of IEEE 802.11- based wireless networks becomes even more widespread. Therefore, IEEE 802.11ah targets to operate at sub 1 GHz bands, which is specified with respect to the involved countries in Table 1-2 [4].

Table 1-2: Sub 1 GHz spectra specified in the 802.11ah channelization Country Available spectra (MHz) US 902-928 Europe 863-868 China 755-787 Korea 917.5-923.5 Japan 916.5-927.5 Singapore 866-869 920-925

The main advantages of a standardized Sub 1 GHz (S1G) WLAN are concluded in [5]:

• Long transmission range (see Table 1-1), suitable for large scale networks.

• Lower power consumption, compared with other Wi-Fi standards.

• Operates at sub 1 GHz license-exempt bands.

• Avoids interference issues.

• Easy to understand and implement for network device manufactures.

A practical application is introduced in [5], which is a large scale smart grid. Smart grid is used to monitor the real-time status of various utility consumptions and inform the users of these status [6]. By making use of the 802.11ah Access Point (AP), the coverage of one-hop transmission can be much wider (more than 1 km), thus allowing to support more devices in a signal network [4].

1-2 Motivations and objectives

The research on 802.11ah standard is still in an early stage. The Project Authorization Request (PAR) document (describes the purpose and scope of the 802.11ah project) was gen- erated in 2010. The IEEE 802.11ah amendment is currently being developed and is expected to be finalized in 2016 [7]. The current amendment defines an Orthogonal Frequency Divi- sion Multiplexing (OFDM) Physical Layer (PHY) and also a Medium Access Control (MAC) layer to support this PHY. The function of PHY is to handle details of data transmission and reception between stations. It connects the MAC layer to a physical medium such as a wireless channel.

Yuteng Hao CONFIDENTIAL Master of Science Thesis 1-2 Motivations and objectives 3

Figure 1-1: Smart grid with an 802.11ah AP (after [3])

The functionality of the PHY layer is usually realized by two subsystems: analog front-end and "baseband". The baseband signal is generated at low frequencies, which is shown in Fig. 1.2(a). The signals are limited in −fb and fb, this channel is called baseband channel and fb is the baseband bandwidth. On the contrary, the front-end contains analog circuits which operate at high frequencies (Fig. 1.2 (b)).

Figure 1-2: Magnitude spectrum of baseband signal and high frequency signal

Master of Science Thesis CONFIDENTIAL Yuteng Hao 4 Introduction

The communication system is illustrated in Fig. 1.3. In a wireless communication system, the signal is transmitted at high frequencies. However, on the chip level, only a small fraction of the circuit operates in the Radio Frequency (RF) range while the rest performing low- frequency baseband analog and digital signal processing [8]. The data source contains the digital data from the sensor nodes. The digital baseband mainly does the necessary processing of the input data and limits the input signal bandwidth to the required value (-fb to fb in Fig. 1.2). Then in the front-end, the signal is up-converted and mixed with the carrier frequency and transmitted via the antenna. Similarly, in the receiver, theRF signal is down-converted before digital processing.

Figure 1-3: Digital baseband and front-end

The 802.11ah PHY layer is based on the down-clocked operation of IEEE 802.11ac’s PHY. The 802.11ac standard provides 20 MHz, 40 MHz, 80 MHz and 160 MHz channel bandwidths. By 10-time downclocking, suitable bandwidths can be obtained. 802.11ah amendment supports the following channel bandwidths: 2 MHz, 4 MHz, 8 MHz, and 16 MHz. In addition, a 1 MHz channel is also defined by 802.11ah for the purpose of extended coverage. All 802.11ah stations shall support 1 and 2 MHz (others are optional) [2]. In this work, a digital baseband design for an IEEE 802.11ah PHY layer is presented. Since IEEE 802.11ah is an upcoming Wi-Fi standard, there currently is no research on its Digital Baseband (DBB) design published. As described above, this new standard aims to realize long-range communication with relatively low power consumption, so the power consumption of this DBB should be compared with DBB of other standards (Table 1-3). In this table, "Tx" stands for DBB in Transmitter and "Rx" stands for DBB in Receiver. The power consumption of the DBB in this work should be much smaller than that of other Wi-Fi standards and be comparable with that of 802.15 standard family. By extrapolating from the table, the power consumption of this work for DBBs in Tx and Rx targets to be around 1 mW. In addition to the power consumption, area is also an important parameter to evaluate the performance. In this work, the technology for hardware implementation is 40 nm CMOS and the area from some of the above work can be scaled to 40 nm. This is shown in Table 1-4. Similar to power consumption, the area of this DBB should be smaller than the DBB areas

Yuteng Hao CONFIDENTIAL Master of Science Thesis 1-3 Thesis Overview 5

Table 1-3: Power consumption of DBB for some wireless communication standards Standard Power consumption Technology IEEE 802.15.4 (Zigbee) 70.3 µW (Tx)/1.7 mW (Rx) 0.18 µm CMOS [9] IEEE 802.15.1 (Bluetooth) 0.5 - 3.3 mW 0.13 µm CMOS [10] IEEE 802.15.4 (Zigbee) 80 µW (Tx)/200 µW (Rx) 40 nm CMOS [11] 80 µW (Tx)/180 µW (Rx) 40 nm CMOS [11] Bluetooth Smart (Low en- ergy) (BTLE) IEEE 802.15.6 60 µW (Tx)/140 µW (Rx) 40 nm CMOS [11] IEEE 802.11a 104 mW (Tx)/ 146 mW (Rx) 0.065 µm CMOS [12] IEEE 802.11n 336 mW (Tx)/ 372 mW (Rx) 0.13 µm CMOS [13]

Table 1-4: Area of DBB for some wireless communication standards Standard Technology Area (origi- Area (scaled) nal) IEEE 802.15.4 (Zigbee) 0.18 µm CMOS 1.13 mm2 0.056 mm2 [9] IEEE 802.15.1 (Bluetooth) 0.13 µm CMOS 2.42 mm2 0.23 mm2 [10] IEEE 802.11a 0.065 µm CMOS 21 mm2 7.95 mm2 [12] IEEE 802.11n 0.13 µm CMOS 25 mm2 2.37 mm2 [13] in other Wi-Fi standards. Here, we define an area target of 0.5 mm2. In this thesis, concepts for energy-efficient digital baseband architectures, sub-modules and low power optimization techniques for this wireless standard are explored. A literature re- search regarding existing solutions and the state-of-the-art is performed, leading into a con- cept for an energy-efficient DBB architecture and sub-modules. Noise and distortion during transmission are also included in this model. By implementing the proposed concept and performing power simulations, the performance is also evaluated. The target is to make the power consumption lower than 1 mW and the area less than 0.5 mm2.

1-3 Thesis Overview

The next chapters in the thesis are organized as follows: Chapter 2 describes the high level model for the digital baseband system. First OFDM as the modulation technique is introduced. Then, specified parameters are briefly discussed. After this, the subsystems are described, including the wireless communication theory and the high level implementation. Finally, the simulation results of high level modeling are presented and analyzed. Chapter 3 presents the hardware implementation for the digital baseband system. This is a Register Transfer Level (RTL) description of the OFDM system. First the design flow is introduced. Then the quantization is proposed based on the simulation in Chapter 2. This is followed by the description of the critical modules in hardware. In the end, the synthesis flow is introduced briefly.

Master of Science Thesis CONFIDENTIAL Yuteng Hao 6 Introduction

Chapter 4 evaluates the performance of this DBB circuitry, including the cell area, the number of gates and the power consumption. The results are compared with the targets we set in Chapter 1. Chapter 5 summarizes the work of the thesis and proposes the future directions based on this work.

Yuteng Hao CONFIDENTIAL Master of Science Thesis Chapter 2

High level modeling

Orthogonal Frequency Division Multiplexing (OFDM), as used in 802.11ah DBB, has emerged as a popular modulation technique for digital communication. As the final target of this project is to implement this system in hardware, it is desirable to make a software model for the OFDM system available. By designing the software model for the system, the basic structure and algorithms can be determined and the preliminary performance of the system can be estimated. The main challenge of this step is to understand the principle of the telecommunication theory and realize it in software block by block. To make the software model more reliable, noise and distortion are also added in this system. This chapter is intended to present the software model of the digital baseband design. First the basic method of the digital communication system, the OFDM technique, is introduced and the system is then broken down into multiple subsystems. Next, the subsystems and the algorithms are presented. Noise and the distortion are also described briefly. This is followed by the preliminary simulation results and analysis for it. The chapter ends with a short conclusion. The software model is built in MATLAB.

2-1 OFDM Basics

OFDM is a modulation and multiplexing technique which is widely adopted in wideband digital communication. It has been used in the Wi-Fi standards like 802.11a, 802.11n and more. It has also been adopted for a multitude of broadcast standards from Digital Terrestrial Television (DTT) to the Digital Video Broadcast (DVB) standards. The principle of an OFDM system can be shown in Fig. 2.1 [14], the main signal is first divided into a number of independent signals (which are called subcarriers) and then transmitted. In result, the original data stream will be divided into a set of channels. A large number of subcarrier signals are closely spaced. These subcarriers are orthogonal to each other, which means any two signals of an OFDM-based product can operate without interference with each other. The data can be carried on the subcarrier signals in parallel. Each subcarrier is then modulated with a conventional modulation scheme at a relatively low symbol rate. By

Master of Science Thesis CONFIDENTIAL Yuteng Hao 8 High level modeling

Inverse Fast Fourier Transform (IFFT), these subcarriers will be transfered from frequency domain into time domain. In the time domain, multiple OFDM symbols are concatenated to generate the final OFDM burst signal. At the receiver side, a Fast Fourier Transform (FFT) is performed to recover the original signals.

Figure 2-1: Frequency-Time Representative of an OFDM signal [14]

The OFDM method is based on the Frequency Division Multiplexing (FDM) scheme and the major advantage of an OFDM system is its efficient use of the spectrum. In an FDM system, the symbols are also transmitted in parallel but the total bandwidth is divided into a series of non-overlapping sub-bands. As a result, each data symbol occupies the entire available bandwidth (Fig. 2.2). On the contrary, in an OFDM system, when the subcarriers have appropriate spacing to satisfy orthogonality, their spectrum will overlap (see Fig. 2.3), which means the spectrum of an individual data subcarrier only occupies a small part of available bandwidth [15].

Figure 2-2: Spectrum of an FDM subcarriers

Yuteng Hao CONFIDENTIAL Master of Science Thesis 2-1 OFDM Basics 9

Figure 2-3: Spectrum of OFDM symbols with overlapping subcarriers

The architecture of the system is described as Fig. 2.4. The upper line of the system is the transmitter, it receives the input data and then generates the OFDM symbol by the modulator. Through IFFT the OFDM symbol is mapped into the time domain and then being transmitted after passing the filter. On the receiver side, a synchronization block is added to find the start point of the received signal and a channel estimation is also needed to decrease the distortion introduced by the fading channel. These blocks and the algorithms are introduced in detail in the following pages.

Figure 2-4: Architecture of the OFDM system (Transmitter and Receiver)

Master of Science Thesis CONFIDENTIAL Yuteng Hao 10 High level modeling

2-2 OFDM Frame Format

In the 802.11 standard, the basic unit of a transmitted signal is named as the Physical Layer Convergence Protocol Data Unit (PPDU). Each PPDU is divided into two basic sections, the preamble and the payload. The preamble is a hard-coded sequence which is processed and appended to the data field during transmission. At the receiver side, the PHY preamble is used for time synchronization and channel estimation. The payload is also called the data field. In the payload, data is sent in the form of OFDM symbols. An OFDM symbol consists of some virtual carriers, and the FFT size is determined by the carrier number. Here, three types of carriers are used [16]:

• Data carriers: Occupied by data

• Pilot carriers: Occupied by pilots (see Section 2-4-4) which is defined by the standard

• Null carriers: no transmission at all, for guard bands and Direct Current (DC) carrier

The guard band (or Guard Interval (GI)) is added to ensure that the data is only sampled when the signal is stable and no new delayed signals arrive that would influence the signal. There are two bandwidth-dependent frame formats for the S1G PHY: S1G_1M (Sub 1 GHz with Channel Bandwidth = 1 MHz) and S1G_SHORT (Sub 1 GHz with Channel Bandwidth = 2 MHz and Short Guard Interval). These two cases can also be distinguished as Channel Bandwidth (CBW)1 and CBW2, respectively.

Figure 2-5: 802.11ah PPDU frame format (bandwidth: 1 MHz) [1]

The general structure for S1G_1M is defined in Fig. 2.5. The preamble consists of two 4- symbol sections and one 6-symbol field. The first field is called the Short Training Field (STF), which is a repetition of 10 16-sample sequences. This field is used for time synchronization at the receiver to detect the start point of the received signal. The second field is 5 repetitions of a 32-sample sequence, called the Long Training Field (LTF). It is used for channel estimation to recover the signals after passing a fading channel. The last section of the preamble is called Signal Field (SIG), some information such as the data length is stored in this field. The receiver can detect this information and use the correct methods to recover the received data. The other fields of the S1G PPDU formats are summarized in Table 2-1.

Yuteng Hao CONFIDENTIAL Master of Science Thesis 2-2 OFDM Frame Format 11

Table 2-1: Fields of S1G PPDU Field Description LT S Long training symbol GI Guard interval DGI Double guard interval

The transmitted data is stored in the payload. In this mode, the frames transmit over the channel with a bandwidth of 1 MHz. The channel is divided into 32 subcarriers. 2 pilot signals are inserted in subcarriers -7 and 7. In this mode, the signal is transmitted on subcarriers -13 to -1 and 1 to 13. The other subcarriers are occupied by zeros. This is illustrated in Fig 2.6.

Figure 2-6: Subcarrier frequency allocation (bandwidth is 1 MHz) [1]

Figure 2-7: 802.11ah PPDU frame format (bandwidth ≥ 1 MHz) [1]

The structure for S1G_SHORT is defined in Fig. 2.7. The format is used for transmission using 2 MHz, 4 MHz, 8 MHz and 16 MHz frames. The structure of the training is similar to the structure for S1G_1M. The signal field consists of 2 repetitions of 32 pre-defined samples and needs to be processed at the receiver. In the payload, the channel is divided into 64 subcarriers. 4 pilot signals are inserted in subcarriers -21, -7, 7 and 21. The signal is transmitted on subcarriers -28 to -1 and 1 to 28. The other subcarriers are occupied by zeros.

Master of Science Thesis CONFIDENTIAL Yuteng Hao 12 High level modeling

2-3 Parameters for all the supplied cases

The PHY layer of 802.11ah is based on the OFDM scheme. In a typical indoor communication scenario, the channel can vary due to different conditions. In order to improve the radio link capability, multiple PHY layer modulation schemes are applied [17]. The parameters for CBW1 and CBW2 are given in the tables listed below. The S1G-MCS (modulation and coding scheme) is a value that determines the modulation and coding used in the Data field of the PPDU. In this work, MCS0, 1 and 3 are supported and the related parameters are shown in the tables. The goal of this thesis was to focus on the mandatory modes of 802.11ah (MCSs 0 and 1) in order to make an ultra-low power design possible. For higher CBW and other MCSs, this becomes infeasible, especially when also considering the analog front-end. MCS3 is added as an optional mode in this DBB to explore the impact on area and power, which is shown in Chapter 3 and Chapter 4.

Table 2-2: S1G-MCSs for 1 MHz MCS Index Modulation NSD NSP NCBPS NBPSC Data rate (Kbps) 8 µs GI 4 µs GI 0 BPSK 24 2 24 1 300.0 333.3 1 QPSK 24 2 48 2 600.0 666.7 3 16-QAM 24 2 96 4 1200 1333.3

Table 2-3: S1G-MCSs for 2 MHz MCS Index Modulation NSD NSP NCBPS NBPSC Data rate (Kbps) 8us GI 4us GI 0 BPSK 52 4 52 1 650.0 722.2 1 QPSK 52 4 104 2 1300.0 1444.4 3 16-QAM 52 4 208 4 2600 2888.9

The parameters in the above tables are described in Table 2-4:

Table 2-4: The definitions of the related parameters Parameter Description

NSD Number of data subcarriers per OFDM symbol

NSP Number of pilot subcarrier per OFDM symbol

NCBPS Number of coded bits per symbol

NBPSC Number of coded bits per subcarrier

2-4 Digital transmitter and receiver

As described above, the OFDM system can be divided into three parts: transmitter, receiver and channel. In this work, transmitter and receiver are modeled in MATLAB block by block.

Yuteng Hao CONFIDENTIAL Master of Science Thesis 2-4 Digital transmitter and receiver 13

The channel is from an in house design in Holst Centre (from [18]). The communication theories and implementation of these modules are introduced in detail in this section.

2-4-1 Convolutional encoder and Viterbi decoder

Convolutional encoder and Viterbi decoder are the mandatory modules in IEEE 802.11ah Standard. Convolutional coding has been widely used in communication systems since in- troduced by Elias in 1955 [19]. By using this method, the data is protected and the noise can be blocked when transmitted over a noisy channel. It is realized by adding redundancy to the source symbols. The convolutional encoder is usually characterized by the following parameters [20]: 1) Input bit stream length: This is the number of data bits entering the input, represented as k. 2) Output bit stream length: This is the number of data bits coming out of the encoder, represented as n. 3) Code rate: This is the ratio between the input bit rate to output bit rate, given as k/n. 4) Constraint length: This is the number of stages which the input bits goes through in the encoding shift register, represented as K. In this work, the DATA field shall be coded with a convolutional encoder of coding rate R ∈ {1/2, 2/3, 3/4}, corresponding to the desired data rate [1]. An example of encoding operation with constraint length of 7 is shown in Fig. 2.8. The code rate of this example is 1/2. If the number of bits of the input stream is 48, the output stream will contain 96 bits of data.

Figure 2-8: Convolutional encoder (K = 7) [[1]]

At the receiver, the data is decoded by Viterbi decoder. The Viterbi decoding algorithm was proposed in 1967 [21]. In the convolutional encoder, each coded sequence has followed a particular path, which makes it possible to find that path and then to estimate what the coded sequence might have been upon transmission [22]. In this work, these two modules are implemented by MATLAB routine and the related parameters are from [1].

Master of Science Thesis CONFIDENTIAL Yuteng Hao 14 High level modeling

2-4-2 Interleaver and deinterleaver

Subcarrier interleaving also increases resistance to the noisy (fading) channels. For example, if one part of the channel bandwidth fades, subcarrier interleaving ensures that the bit errors which result from those subcarriers in the faded part of the bandwidth are spread out in the bit-stream rather than being concentrated. In this OFDM system, all encoded data bits shall be interleaved by a block interleaver. The interleaver is defined by a two-step permutation. The first permutation ensures that adjacent coded bits are mapped onto nonadjacent subcarriers. The second permutation ensures that adjacent coded bits are mapped alternately onto less and more significant bits of the constel- lation (the next module of the OFDM system). As a result, long runs of low reliability bits are avoided. If we denote the index of the coded bit before the first permutation by k and the index of the coded bit after the first permutation by i, the first permutation can be defined by the rule [1]

i = (NROW (k mod NCOL) + Floor(k/NCOL), k = 0, 1, ..., NCBPS − 1 (2-1) In this equation, the function Floor (.) denotes the the largest integer not exceeding the parameter and the function "a mod b" denotes the modulo operation. The parameter NCBPS is defined in Table 2-2 and 2-3. NCOL and NROW denote the number of rows and columns in the interleaver, respectively. If we denote the index of the coded bit after the second permutation by j, the second permutation can be defined by the rule [1] j = s ∗ Floor(i/s) + (i + NCBPS − Floor(NCOL ∗ i/NCBPS)) mod s, i = 0, 1, ..., NCBPS − 1 (2-2) The value of s is determined by the number of coded bits per subcarrier, NBPSC , according to [1] s = max(NBPSC /2, 1) (2-3)

The values for NCOL and NROW are shown in Table 2-5.

Table 2-5: Values for some parameters related to the interleaver Parameter CBW1 CBW2

NCOL 8 3

NROW 13 4

The deinterleaver, which is used to recover the order of the coded bits, is also defined by two permutations. Here the index of the original received bit before the first permutation is denoted by j and i is the index after the first permutation. The first permutation is defined by the rule [1]

i = s ∗ Floor(j/s) + (j + Floor(NCOL ∗ j/NCBPS)) mod s, j = 0, 1, ..., NCBPS − 1 (2-4) s is defined as equation 2-3. The index of the coded bits after the second permutation shall be denoted as k and the second permutation is defined as [1]

k = NCOL ∗ i − (NCBPS − 1)Floor(NCOL ∗ i/NROW ), i = 0, 1, ..., NCBPS − 1 (2-5)

Yuteng Hao CONFIDENTIAL Master of Science Thesis 2-4 Digital transmitter and receiver 15

An example is shown in Fig. 2.9. In this case, the bandwidth is 1 MHz and the MCS is 3. From Table 2-2 and Table 2-5, the values of the parameters in the above equitations can be derived. The number of subcarriers is doubled after interleaving and the new order in the second picture is able to fulfill the equations above. As a specific example, consider the case of k = 1 (the second bit after encoding, shown as in Fig. 2.9). It is mapped by the first permutation to i = 12 and by the second permutation to j = 13 (the 14-th bit after interleaving). At the receiver side, the input data is recovered by Viterbi decoder. The output from Viterbi decoder can be compared with the input to Convolutional encoder: if no noise is added, these two data sequences should exactly match.

Figure 2-9: The coded bits before and after interleaving

2-4-3 Modulator and demodulator

Before a digital bit sequence can be transmitted over a radio channel, it has to be pre- processed: among other things, it has to be transformed into an analog signal in continuous time and modulated onto a carrier frequency [23]. The objective of the modulator is to convert the bit stream into many parallel data pipes using segments of different sinusoidal waveforms. Basic digital modulation techniques make use of only one of the three parameters–amplitude, frequency and phase of the sinusoidal wave according to the binary data to be transmitted. The basic unit in these techniques is a symbol, which is composed of a segment of sinusoidal wave [24]. It is typical practice to describe a symbol by a point in constellation diagram. The real and imaginary axes are often called the in-phase, or I-axis, and the quadrature, or Q-axis, respectively. In this work, 3 modulation types are adopted (see Section 2-3). This is accomplished by adapting the modulation type to the changing channel condition and rate request. These modulations can be divided into Phase Shift Keying (PSK) and Quadrature Amplitude Mod- ulation (QAM). In PSK, carrier phase is used to carry symbol information and modulation signal set: q si(t) = Acos(Ωct + øi(t)), 0 ≤ t ≤ Ts, 1 ≤ i ≤ M = 2 (2-6)

Master of Science Thesis CONFIDENTIAL Yuteng Hao 16 High level modeling

where Ts is the symbol period, A is the constant carrier amplitude, M is the number of symbol points in constellation diagram and q stands for the number of bits of one symbol. In this work, q can be 1 or 2. The constellation diagrams are shown in Fig. 2.10. The conversion is performed according to Gray-coded constellation mappings. Based on different qs, they are called Binary Phase Shift Keying (BPSK) and Quadrature Phase Shift Keying (QPSK), respectively.

Figure 2-10: The constellation diagrams of BPSK and QPSK

BPSK is the simplest form of phase shift keying. It only uses two phases which are separated by π. In the diagram, the quadrature branch is not used. It is only able to modulate at 1 bit/symbol. QPSK modulates two bits per symbol with a minimum phase separation of π/2 and it maps to 4 positions in the constellation diagram. For the 16-QAM modulation used in this work, the incoming data uses four bits to create each of the 16 possible complex-valued QAM symbols (or information symbols), as it can be seen from figure 2.11. This technique involves both the phase and amplitude information of the sinusoid wave. Compared to PSK, QAM yields higher spectral efficiency and better data-rate while costs more energy, since more bits are involved in QAM. At the receiver, the demodulator performs the inverse task of the modulator. It recovers the data bits from symbols. Because of noise and distortion during the transmission, the incoming I and Q signals to the modulator cannot be exactly the values we assign in the modulator. The received constellation is shown in Fig. 2.12 (here the Signal To Noise Ratio (SNR) is 10 dB). This requires the demodulator to have a threshold detector making a decision on each integrated bit based on the threshold. For example, in a QPSK scheme, the incoming bit- stream "10" is mapped to "1 + 1i". At the receiver, this signal may be changed to "0.7 + 0.9i". This signal then is mapped into "10" if the threshold detector is implemented correctly. BPSK is the most robust modulation technology since it takes the highest level of noise or distortion to make the demodulator reach an incorrect decision. The higher order modulation has less noise margin while offering faster data rates.

Yuteng Hao CONFIDENTIAL Master of Science Thesis 2-4 Digital transmitter and receiver 17

Figure 2-11: The constellation diagrams of 16-QAM

2-4-4 Pilot inserter and remover

Each OFDM symbol in IEEE 802.11ah has a certain number of pilot subcarriers. Pilot signals are defined signals on defined frequencies in transmission so that the receiver can find them easily. In this system, these pilots are added in order to make the coherent detection robust against frequency offsets and phase noise. Locations of the pilots were introduced in Section 2-2. The pilots and zeros are inserted into the data after modulation and then the number of subcarriers can be the same as the FFT size.

2-4-5 Fast Fourier Transform

As described in the above sections, sinusoidal components are used in an OFDM transmitter. Generally, an OFDM signal consists of N orthogonal subcarriers modulated by N parallel data streams. Each baseband subcarrier can be presented as,

j2πfkt φk(t) = e , (2-7)

where fk is the frequency of the kth subcarrier. OFDM consists of many carriers. Thus the complex signal s(t) is represented as follows:

N−1 1 X s(t) = √ xkφk(t), 0 < t < NT. (2-8) N k=0

where xk is the kth complex symbol which is derived from the modulator described above; NT is the length of the OFDM symbol. This form of OFDM symbol could typically be received

Master of Science Thesis CONFIDENTIAL Yuteng Hao 18 High level modeling

Figure 2-12: The QPSK constellation diagrams with noise

by using a bank of matched filters, which costs too much power. However, if the signal is nk sampled using a sampling frequency of 1/T (fk = N ), then the resulting signal is represented by N−1 1 X j2π nk s(k) = √ xke N , 0 ≤ k ≤ N − 1. (2-9) N k=0 Now Equation 2-9 can be compared with the general form of Inverse Discrete Fourier transform (IDFT) N−1 1 X n j2π nk g(kT ) = √ G( )e N . (2-10) NT N k=0 From these two equations (Equation 2-9 and Equation 2-10) we can find that IDFT is a suitable method to demodulate the OFDM signals. Normally it is implemented by IFFT since IFFT is a simple method to realize it. At the transmitter, the OFDM signal is defined in the frequency domain. By IFFT, each OFDM carrier corresponds to one element of this discrete Fourier spectrum. The amplitudes and phases of the carriers depend on the data to be transmitted and the modulation type used in the modulator [25]. At the receiver, FFT is used to recover the data from the Fourier spectrum. Thanks to the orthogonality property, the OFDM receiver demodulates the spectrum values at those points corresponding to the maximum of individual subcarriers [26].

2-4-6 Cyclic Prefix (CP) insertion and removing

The basic concept of the OFDM cyclic prefix is straightforward: it is created so that each OFDM symbol is preceded by a copy of the end part of the same symbol. This concept is illustrated in Fig. 2.13. The size of the cyclic prefix field depends on the system.

Yuteng Hao CONFIDENTIAL Master of Science Thesis 2-4 Digital transmitter and receiver 19

Figure 2-13: The basic concept of cyclic prefix

In an OFDM system, the insertion of cyclic prefix results in the extension of the symbol length. This method is used because it can completely eliminate Inter-symbol Interference (ISI). When the signal is transmitted in a multi-path fading channel, multiple copies of the trans- mitted signal are received at different time intervals, which causes interference. Figure 2.14 shows this kind of interference. The signals always arrive at different times at the receiver. Without cyclic prefix, a portion of the n − 1th symbol in signal 1 creates interference with the nth symbol in signal 2 because of the delay. This effect ends up with changes of the signals’ amplitude and phase. Cyclic prefix acts as a buffer between two consequent OFDM symbols. The insertion of CP allows for the useful data window to be positioned so that there is no overlap with a subsequent symbol. As a result, ISI is totally removed.

Figure 2-14: Inter-symbol interference with a delayed signal (without CP)

2-4-7 Timing synchronization

At the receiver, timing synchronization is an important block of the digital baseband which influences the complexity and power consumption greatly. This block is used to estimate the

Master of Science Thesis CONFIDENTIAL Yuteng Hao 20 High level modeling end point of the STF (see Section 2-2). Because of the symbol extension by cyclic prefix, there is always a tolerance for frame timing errors. As described in Section 2-2, STF in the front of the preamble is used for timing synchro- nization. STF is also used for energy detection and Automatic Gain Control (AGC). These two blocks are not included in this work. In general, due to the energy detection and AGC module, only parts of the training symbols are available for timing synchronization. To build a high level model, we can assume that 2 of 3 short training symbols are left after the AGC process. In the S1G_1M frame format, 4 OFDM symbols are placed in the front of the OFDM burst. The short training symbols in the frequency domain are given by [1] q STF−16,15 = 2/3 ∗ {0, 0, 0, 0, 0.5 + 0.5j, 0, 0, 0, −1 − j, 0, 0, 0, 1 + j, 0, 0, 0, 0, 0, 0, 0, −1 − j, 0, 0, 0, −1 − j, 0, 0, 0, 1 + j}, (2-11)

where 6 out of 32 subcarriers are used. The constant p2/3 is used to normalize the short training sequence such that the average transmitted power is 1. The training symbols also need to be modulated before transmission by inverse FFT and cyclic prefix. What’s more, similar to the data field, the training sequence also needs to be padded by zeros before inverse FFT. The final extended sequence in the S1G_1M frame format represents 20 "short symbols" in total, and each symbol has 16 samples (see Fig. 2.15).

Figure 2-15: Magnitude of short training sequence in S1G_1M

Similarly, In the S1G_SHORT frame format the training symbols in the frequency domain are given by [1]: q STF−32,31 = 1/2 ∗ {0, 0, 0, 0, 0, 0, 0, 0, 1 + j, 0, 0, 0, −1 − j, 0, 0, 0, 1 + j, 0, 0, 0, −1 − j, 0, 0, 0, − 1 − j, 0, 0, 0, 1 + j, 0, 0, 0, 0, 0, 0, 0, −1 − j, 0, 0, 0, −1 − j, 0, 0, 0, 1 + j, 0, 0, 0, 1 + j, 0, 0, 0, 1 + j, 0, 0, 0, 1 + j, 0, 0, 0, 0, 0, 0, 0}. (2-12)

Yuteng Hao CONFIDENTIAL Master of Science Thesis 2-4 Digital transmitter and receiver 21

The final extended sequence in the S1G_SHORT frame format represents 10 "short symbols" in total, and each symbol has 32 samples. We can see that the total length of the STF for these two frame formats are the same as 320 samples. Based on short training symbol and its periodic characteristic, a couple of schemes for synchro- nization are proposed. Schmidl and Cox [27] presented a method for time synchronization. If two training symbols are placed at the beginning of the frame, the correlation between these two symbols can be used to find out timing metric in the receiver. However, this met- ric suffers from a plateau which leads potential errors in determining the start point of the frame. This method is improved in [28] and [29] by modifying the structure of the STF. This plateau problem is reduced but not totally solved. All these algorithms perform a so-called auto-correlation of the received signal to detect the symbol. An OFDM cross-correlation al- gorithm was proposed in [30] and [31]. This means the received signal is correlated with a known training symbol. This method can avoid the plateau issue but also introduces compu- tational complexity. These two approaches are combined in this work to achieve an accurate synchronization. At the same time, the resulting power consumption in hardware can be reduced. This will be explained in detail in the following pages. The first step of symbol timing estimation is carried out by auto-correlation. It is the similarity between observations as a function of the time lag between them. Since the training symbol is periodic, the samples of the STF can be computed by the auto-correlator. It can be derived as [27] N P (d) = X r(d + k)r∗(d + k − N) (2-13) k=1 where P (d) is the result of auto-correlation with time index d and N is the length of one short training symbol. r∗(d+k −N) represents the conjugate of r(d+k −N). If the received signal is inside the range of the STF, the products of each of these pairs of samples will have almost the same period, so the magnitude of the sum will be a large value. So what we expect to see is that the plot of P (d) is steady at the start of the frame and falling quickly when the incoming signal is outside the STF. To build this auto-correlator in MATLAB, Equation 2-13 can be implemented with the iter- ative formula [27]

P (d + 1) = P (d) + (r(d + N)r∗(d)) − (r(d)r∗(d − N)) (2-14)

By using Equation 2-14 instead of 2-13, a number of multiplications are saved. This is valuable for saving power when it is implemented into hardware and the analysis is presented in the next chapter. To normalize the received power, a method was developed as shown in Equation 2-15 [27]: N R(d) = X |r(d + k)|2 . (2-15) k=1 which can also be calculated iteratively. Then a timing metric can be defined as [27]

|P (d)| M(d) = (2-16) |R(d)|

Master of Science Thesis CONFIDENTIAL Yuteng Hao 22 High level modeling

Figure 2-16: Block diagram of the auto-correlation algorithm (Equation 2-14 with N = 16)

The values of M(d) should fall between 1 and 0 and they will be compared with a predefined threshold, Mth. Auto-correlation can be illustrated in a more straightforward way in Fig. 2.16. This block diagram shows this algorithm to the S1G_1M frame format. In this case, the short training field contains 20 short training symbols with 16 samples long, so the parameter N is 16 in this case. The received sample r(d) is correlated with the conjugate of r(d − 16). After that, the correlated samples are averaged in the moving average block over a period of time. The result from the moving average block is P (d) in the above equations and it can be normalized into M(d). The value is compared with the predefined threshold. If it is larger than the threshold, we can assume that the corresponding sample is still inside the STF range. Fig. 2.17 shows the auto-correlator output for the S1G_1M case with an SNR of 20 dB. We assume that 100 samples are used in energy detection and AGC process, so the start point of the incoming burst is the 101-th sample of the training field and there are 220 samples in the STF is left for synchronization. Since N is 16 in this case, the 17th incoming sample will be used for the auto-correlation and then the output is steady. When the incoming sample does not belong to the STF, the output starts to fall. However, it is hard for us to find this sample exactly by only using auto-correlation since the plateau issue we mentioned above. In this step we only define a certain threshold to find the first point which is smaller than the threshold. To define the threshold, the auto-correlation is pre-calculated by 2 repeating sequences of the STF. Then the threshold is derived by scaling the pre-calculated value. This point will be used in the second step of synchronization. By auto-correlation we find an index near the end point of the STF. To find the exact place of this point, cross-correlation is used. It is carried out between the incoming sequence and a predefined sequence (which is one repeating sequence of the STF) stored in the receiver. It can be described as N X(d) = X r(d + i)s∗(i) (2-17) i=1 The result can be normalized into C(d) using the same way as auto-correlation. Unlike auto-correlation, the cross-correlation is not iterative. At the time index d − 1, the cross-correlation function can be written as

X(d − 1) = r(d − 1)s∗(0) + r(d)s∗(1) + ... + r(d + N − 2)s∗(N − 1) (2-18)

Yuteng Hao CONFIDENTIAL Master of Science Thesis 2-4 Digital transmitter and receiver 23

Figure 2-17: Output of auto-correlation for an incoming signal with an SNR of 20 dB (S1G_1M)

At the time index d, the function 2-17 can be presented as

X(d) = r(d)s∗(0) + r(d + 1)s∗(1) + ... + r(d + N − 1)s∗(N − 1) (2-19)

Comparing the above two functions, the multiplication terms are totally different from each other. So we have to do this whole calculation for each sample. This property makes cross- correlation more compute intensive compared to auto-correlation. The block diagram of cross-correlation for the S1G_1M case is shown in Fig. 2.18. Ideally, the predefined sequence s(d) should be exactly the same as a part of the training symbols. As a result, the ideal peak value of C(d) should be 1. Unlike auto-correlation output, the cross-correlation output shows periodical peaks instead of steady values when the incoming samples are inside of the short training symbol. This makes the method more accurate so that the exact end point of the STF can be detected. In Fig. 2.19, the cross-correlation output for an incoming signal with an SNR of 20 dB is shown. This is for the S1G_1M case and the incoming sequence is the same as in Fig. 2.17. The peaks rise every 16 samples and the ratio between the peaks and other output is large enough to be distinguished. The output of the combined algorithm is shown in Fig. 2.20. The output signal, which has a flat shape, represents the first step which is carried out by the autocorrelator. This flat signal cannot be used to find the target point accurately since it is difficult to find the end of the flat curve. By comparing with the threshold, the point "A" with an time index later than the end of the STF is found. The cross-correlation is then performed from the point "A". The nearest peak of the cross-correlation output curve can be found and regarded as the end point of the STF (’B’ in the figure). By adding cross-correlation, the related power consumption in hardware is only increased minimally since it is only performed on a few samples but the robustness of synchronization is highly improved. If only the cross-correlation is used (as shown in Fig. 2.19), all the samples in the STF will be computed by the cross-correlator. Then the resulting power consumption will be much larger.

Master of Science Thesis CONFIDENTIAL Yuteng Hao 24 High level modeling

Figure 2-18: Block diagram of the cross-correlation algorithm (S1G_1M)

Figure 2-19: Output of cross-correlation for an incoming signal with an SNR of 20 dB (S1G_1M)

2-4-8 Channel estimation

In OFDM systems, the data symbols coming from the transmitter suffer from the channel distortion and noise. As a result, the received signal y(t) is represented as

y(t) = x(t) ? h(t) + w(t) (2-20)

where the symbol ’?’ denotes convolution, x(t) denotes the signal from the transmitter, w(t) denotes the Additional White Gaussian Noise (AWGN) and h(t) denotes the Channel Impulse Response (CIR). For different wireless communication fading channels, the channel impulse response is modeled using various statistical distributions such as Rayleigh, Nakagami-m, etc. In this work, a frequency selective Rayleigh fading channel model is adopted. It is the most common used channel in Wi-Fi Standards since it mimics the most real scenarios (see [32]).

Yuteng Hao CONFIDENTIAL Master of Science Thesis 2-4 Digital transmitter and receiver 25

Figure 2-20: Correlation output for an incoming signal with an SNR of 20 dB (S1G_1M)

It is defined as follows: 1 h(n) = √ [h (t − t ) + h (t − t ) + ... + h (t − t )]. (2-21) n 1 1 2 2 n n

√1 In this equation, the term n is used to normalize the average channel power. The other terms in this equation stand for the channel coefficient of the corresponding taps. If inter-symbol interference is eliminated by the cyclic prefix, the above equation can be transformed into frequency domain as:

Y (k) = HkXk + wk, k = 0, 1, ..., N − 1. (2-22)

Where Hk is the frequency response of h(t): v−1 X j2πnk/N Hk = hme , (2-23) m=0

here, Hk is also called the Channel Frequency Response (CFR). Xk and wk are also defined as the frequency responses of x(t) and w(t), respectively. We assume that each h(t) (and H(k)) will remain constant over many OFDM blocks. This system inputs the sequence X(k) to a digital finite impulse response filter and then samples the output. From these equations we can learn that to recover the transmitted signal, h(t) or H(k) needs to be estimated. This is the objective of adding the channel estimation module. This principle is shown in Fig. 2.21. The channel estimation block of this baseband system is generally based on the use of LTF which is introduced in Section 2-2. The 802.11ah amendment implements block-type pilot structure since the OFDM symbols transmit training symbols across all 32 or 64 subcarriers. In the S1G_1M frame format, 4 OFDM symbols are put next to the STF. The long training symbols in the frequency domain are given by [1]

LT F−16,15 = {0, 0, 0, 1, −1, 1, −1, −1, 1, −1, 1, 1, −1, 1, 1, 1, 0, −1, −1, −1, 1, −1, −1, −1, 1, −1, 1, 1, 1, −1, 0, 0} (2-24)

Master of Science Thesis CONFIDENTIAL Yuteng Hao 26 High level modeling

Figure 2-21: Channel estimation principle [33]

In the S1G_SHORT frame format, the length of each OFDM symbol is doubled and 2 OFDM symbols are put in front of the OFDM burst. As a result, the length of LTF in this case is also 320 samples. The long training symbols in the frequency domain are given by [1]

LT F−31,32 = {0, 0, 0, 0, 1, 1, 1, 1, −1, −1, 1, 1, −1, 1, −1, 1, 1, 1, 1, 1, 1, −1, −1, 1, 1, −1, 1, −1, 1, 1, 1, 1, 0...1, −1, −1, 1, 1, −1, 1, −1, 1, −1, −1, −1, −1, −1, 1, 1, −1, −1, 1, −1, 1, −1, 1, 1, 1, 1, −1, −1, 0, 0, 0} (2-25)

Similar to the STF, the long training sequence also needs to be modulated by IFFT and other processes. These processes are described in the IEEE 802.11 standard [1]. Since the training field is block-type pilot based, the estimation of the channel can be done using a Least Square (LS) estimator or an Minimum Mean Square Error (MMSE) estimator [34]. In frequency domain, the signal and the channel response can be regarded as matrix. If we write the Equation 2-22 in matrix form, the signals can be represented as

X = diag{X0,X1, ..., XN−1} T Y = [Y0,Y1, ..., YN−1] T (2-26) W = [W0,W1, ..., WN−1] T H = [H0,H1, ..., HN−1] The LS estimate channel of is represented as:

Yp Hp = , p = 0, 1, ..., N − 1 (2-27) Xp Here p stands for the pilot index of LTF (see Equation 2-24 and 2-25). Compared to an MMSE estimator, the LS estimator suffers from higher Mean-Square Error (MSE) but can

Yuteng Hao CONFIDENTIAL Master of Science Thesis 2-4 Digital transmitter and receiver 27 achieve very low complexity (see Appendix A). In this work, the channel model we use is a slowly fading channel and the MSE caused by a LS estimator is acceptable. To achieve lower complexity algorithm and a lower power consumption in the hardware, the LS estimator is chosen for channel estimation. The estimated channel is then performed as an equalizer for data field to recover the transmitted signal. To verify this module, the estimated channel by the LS estimator can be compared with the frequency response of the channel model, which is shown in Fig. 2.22. Here, the channel bandwidth is 2 MHz and no noise is added. The CFR that converted from the original channel model function matches the estimated CFR perfectly, which verifies the LS estimator.

Figure 2-22: Channel frequency response for the original channel and estimated channel (overlapping)

2-4-9 Oversampling

As described in Chapter 1, the digital baseband system needs to be connected with the front- end. The values at the output of transmitter should represent the analogue signal to be sampled by the Digital-to-analog Converter (DAC). In this work, the sampling frequency of the front-end is fixed as 8 MHz, so the transmitted data (1 MHz or 2 MHz) needs to be oversampled at a rate of 4 or 8 before transmission. By the sampling process, aliases will be produced next to the OFDM signal. To separate the OFDM signal and the aliases, zero-padding is introduced during the IFFT process. In Fig. 2.23 (A), the block diagram of the oversampling process is illustrated. In addition to the input sequence, zeros are also input to the IFFT module, which results in double FFT size. For instance, in the S1G_1M case, the information is coded in 32 subcarriers but a 64 IFFT/FFT is adopted. This also provides an oversampling rate of 2 (This is also considered as frequency-domain oversampling). Then the OFDM symbol is oversampled in the time domain again before transmission. The oversampling process is performed by the Low-pass Filter (LPF). Since the number of samples in each OFDM symbol in the S1G_SHORT case

Master of Science Thesis CONFIDENTIAL Yuteng Hao 28 High level modeling is twice the number of that in the S1G_1M case, the oversampling rates in this step are not the same.

In the receiver, down-sampling is performed in a similar way. The effect of LPF is shown as Fig. 2.23 (B), the zero-padded frequencies are those around the Nyquist channel and the OFDM signal is placed onto the subcarriers around 0 Hz. Finally the aliases can be shifted away.

Figure 2-23: Oversampling principle

2-5 High level simulation results

This high level model is built block by block. The modules are first verified by comparing the preliminary simulation results with the theory and formulas.

Once a module in Tx and its corresponding module in Rx are built, a bit stream can be used to verify the functionality of these modules. After all the modules in the OFDM system are built and verified, the performance of this high level model is evaluated by Bit Error Rate (BER) and Packet Error Rate (PER) tests. BER (PER) is the number of bit (packet) errors divided by the total number of transfered bits (packets). In a wireless communication system, the error rate may be affected by channel noise, distortion, channel fading, timing synchronization problem, etc. In the packet based communication scenario, PER is more practical as an indicator.

Yuteng Hao CONFIDENTIAL Master of Science Thesis 2-5 High level simulation results 29

2-5-1 Verification

The verification is performed for each module in the OFDM system by comparing the bit streams. For instance, the bit streams that input into the interleaver and the output bit streams from the deinterleaver. In this step, no noise is introduced into the system and the bit streams are compared bit by bit. In Fig. 2.24, a block diagram shows the bit streams that can be compared. In this block diagram, the modules in the left stand for the modules in Tx and the modules in the right stand for the modules in Rx. The related bit streams can be compared to verify the modules, as shown in this figure. In this figure, the basic verification for the channel estimator and the synchronizer is not shown. The channel estimator is verified by comparing the estimated channel with the original channel (see Fig. 2.22). For the synchronizer, it can be verified by checking whether the detected point is the last point in the STF.

Figure 2-24: The bit streams which can be compared to verify the modules

2-5-2 Floating-point simulation

After all the modules in this system are modeled and verified, the simulation in system level is performed.

Master of Science Thesis CONFIDENTIAL Yuteng Hao 30 High level modeling

In theory, each different type of modulation has its error rate function [35]. In general, the higher the order of the modulation scheme, the lesser robust the system is. This was explained in Section 2-4-3.

The bit error rate, Pb, for uncoded BPSK is [35], s 2Eb Pb = Q( ) (2-28) N0

Here, Eb is the bit energy and N0 is the noise power spectral density. Q is defined by 1 x Q(x) = erfc(√ ) (2-29) 2 2

Eb/N0 is a normalized SNR measure and is an important parameter in digital communication which reflects the reliability of a communication system. Eb can be represented by

S Eb = (2-30) Rb

Here, Rb is the bit rate and S is the signal power. To build the relation between Eb/N0 and SNR, the noise power N0 is then introduced in the above equation: E S b = . (2-31) N0 Rb ∗ N0

As a result, the relation between Eb/N0 and SNR is derived: S R ∗ E SNR = = b b . (2-32) N0 N0

The theoretical BER for QPSK can also be represented as Equation 2-28. This is because in QPSK, two independently modulated carriers (I and Q) are used. Either of these carriers can be viewed as a BPSK modulation process. In result, the BER for QPSK is the same as for BPSK. The theoretical BER for 16-QAM is [35] s 3 2Eb Pb = erfc( ) (2-33) 8 5N0

The theoretical BER for these three cases are shown in Fig. 2.25. The first step of verification in high level is comparing the basic BER simulation results with the theoretical error probability. This is shown in Fig. 2.26. The simulated BER versus Eb/N0 (BPSK modulation scheme) matches the theoretical BER curve well and this validates the OFDM communication system. The similar comparisons are performed for QPSK and 16-QAM, which also match the corresponding their theoretical BER curves. In a wireless communication system, a number of OFDM packets are transmitted at a time. Therefore, PER is a better parameter to reflect the robustness of the system. In contrast to BER, PER is also related to the packet length. Most communication systems have a certain

Yuteng Hao CONFIDENTIAL Master of Science Thesis 2-5 High level simulation results 31

Figure 2-25: Theoretical BER for BPSK, QPSK and 16-QAM

Figure 2-26: BER simulation over an AWGN channel and theoretical BER for BPSK (these two curves are overlapping )

tolerance for bit errors. According to the standard [1], the maximum allowable PER in the system presented in this thesis is 10−1. Figure 2.27 shows the PER versus SNR curve for S1G_1M. The number of OFDM symbols in each packet is 50 and the packet length varies depending on the modulation and coding scheme. The signal is transmitted in an AWGN channel and the frame synchronization is assumed to be correct. The only contribution to errors comes from the data field. The simulation results are summarized in Table 2-6 and presented as "Testcase 1". In Fig. 2.28, the PER versus SNR with timing synchronizer is compared with the correspond- ing curve in Fig. 2.27 (CBW1, MCS1). From this figure we can learn that the contribution of timing synchronizer for error probability in this case is around 2.5 dB. The PERs for all the cases are summarized in Table 2-6 and are presented as "Testcase 2". The PER simulation is also performed in Rayleigh fading channel and the channel estimation

Master of Science Thesis CONFIDENTIAL Yuteng Hao 32 High level modeling

Figure 2-27: PER vs SNR for all the supported S1G_1M cases

Figure 2-28: Effect of timing synchronization on PER (CBW1, MCS1)

block is added into the system (named as "Testcase 3"). The corresponding simulation result is compared with the PER for AWGN channel, which is shown in Fig. 2.29. In this case, the PER for the fading channel case is 3.5 dB worse than that for the AWGN channel when the PER value is 10−1. The PER simulation results for all the supported cases are presented in Table 2-6. The SNR listed in the table is the SNR where the PER is 0.1.

2-5-3 Fixed-point simulation

To convert the high level model into hardware, the first step is to convert the data type from floating-point to fixed-point. This is because in hardware, due to implementation limitations, the number of bits in each block is fixed and the floating-point number cannot exist. How- ever, fixed-point number introduces inaccuracy. As a result, a trade-off between the power

Yuteng Hao CONFIDENTIAL Master of Science Thesis 2-5 High level simulation results 33

Figure 2-29: Effect of Rayleigh Fading channel (CBW1, MCS1)

Table 2-6: SNRs where PER equals 0.1 for all the supplied cases CBW MCS SNR (dB) Testcase 1 Testcase 2 Testcase 3 1 0 -2.5 2 4.8 1 1 2 4.5 8 1 3 9.8 10.2 10.8 2 0 0.5 2.5 6 2 1 4 4.6 8.5 2 3 10 11.8 14.8

consumption, the area and the accuracy should be considered (see Section 3-1). Fixed-point modeling begins from converting the floating-point number into fixed-point number or inte- gers by quantization. Then the fixed-point simulation can be run with the new numbers. The target of fixed-point simulation is to explore the minimum number of bits in each block which does not influence the PER vs SNR curve significantly. The fixed-point simulation result (CBW1, MCS3) with different resolutions is shown in Fig. 2.30. In this simulation, the resolution of all the sub-modules in the system is the same. From this figure we can learn that the number of bits for the system should be chosen as 7 or more. For more accurate estimation, the fixed-point simulation can be broken down (each sub-module has different number of bits). For example, if we want to explore the resolution for FFT module, we can change the number of bits for FFT module and set the number of bits for other modules as 7. With these simulation results and the implementation in hardware, the resolution of each module is determined, which is shown in Chapter 3.

Master of Science Thesis CONFIDENTIAL Yuteng Hao 34 High level modeling

Figure 2-30: Fixed-point simulation (CBW1, MCS3)

2-6 Summary

In this chapter, the concept of OFDM system was introduced. It is an effective technique for wireless communication. The sub-blocks and the related telecommunication theory of the baseband system were also described. The input data stream is modulated, mapped onto subcarriers, converted into time domain by IFFT and combined into OFDM symbols. These symbols are over-sampled and transmitted by the transmitter. At the receiver, the received signal suffers from noise and distortions caused by the channel, so synchronization and channel estimation are added to recover the signal. These are followed by FFT, demapping and demodulation modules. The simulation results in high level are also presented in this chapter.

Yuteng Hao CONFIDENTIAL Master of Science Thesis Chapter 3

Hardware implementation

Register Transfer Level (RTL) design is a method of describing a circuit and is currently widely used for digital design of complex logic circuits. RTL description is a high level representation of a circuit. Rather than the gate-level implementation, it describes the circuit in terms of the flow of signals between hardware registers and the logical operations performed on these signals. In other words, it models how the signal is transformed and transfered between registers. RTL design can be written in Hardware Description Languages (HDLs) like and VHDL. An RTL design allows the synthesis tool to optimize the functionality the designer has specified and perform the gate-level implementation. This chapter is intended to present an RTL design for the OFDM system discussed in Chapter 2. It is written in VHDL and tested by a SystemVerilog testbench. This chapter starts with the overview of the RTL description, as well as the design flow for it. Then the RTL implementation is described in detail. In addition to the critical modules introduced in Chapter 2, some new modules are also introduced in RTL design to make it more practical. This is followed by the description of the synthesis simulation. Finally, a brief summary of this chapter is given.

3-1 Implementation design flow

The design flow for hardware implementation is shown in Fig. 3.1. It starts with floating- point modeling which is described in Chapter 2. The optimization and modification on system level are performed in this step. This step is followed by fixed-point modeling and simulation. Since the number of bits in each block determines the complexity and the power consumption of the system, we need to explore the minimum number of bits to decrease the complexity. However, decreasing the number of bits introduces more inaccuracy in the system. As a result, a trade-off between the complexity, the power consumption and performance of this system is needed. Based on the fixed-point simulation results in Section 2-7, the number of bits in each block is determined (see Section 3-2).

Master of Science Thesis CONFIDENTIAL Yuteng Hao 36 Hardware implementation

Figure 3-1: The design flow for hardware implementation

After the fixed-point simulation, the design is modeled in terms of registers and the com- binational logic. This RTL description is implemented by VHDL codes and verified by a SystemVerilog testbench. The environment for RTL simulation is Cadence IUS. After the generation of VHDL code, the RTL description is translated into register elements and combinational logic by the synthesis tool and translated into a gate level netlist of the optimized circuit. The power-consumption and performance of this circuit can be simulated based on the netlist and they are introduced in Chapter 4.

3-2 Quantization

By comparing the performance (PERs) for systems in different resolutions (see Section 2-7), the number of bits for each block in RTL is determined and presented in Table 3-1. These parameters work for MCS0 and MCS1 cases. Note that in this table only the inputs and outputs of each block are shown. In some blocks, some internal signals need to be defined with different number of bits since there are complex computations inside these blocks. For example, in the synchronizer, the inputs are 8-bit integers. However, as described in Section 2-4-7, multiplication and summation are involved in this module (for auto-correlation and cross-correlation). As a result, the signals that stand for the products from the correlator should be arranged with higher resolutions. The resolution of the input and the output in a module can be different. For instance, the inputs of the pilot inserter are 8-bit, while the outputs of this module are 16-bit. This is because the following IFFT module needs 16-bit inputs, which are the outputs from the pilot inserter. If modulation type is 16-QAM (MCS3), the resolution of FFT and IFFT modules need to be higher since the computation should be more accurate. In this case, the number of bits for

Yuteng Hao CONFIDENTIAL Master of Science Thesis 3-3 RTL Implementation 37

Table 3-1: Number of bits in each block (MCS0 and MCS1) Block Inputs (number of bits) Outputs (number of bits) Modulator 8 8 Pilot inserter 8 16 IFFT 16 16 CP adder 16 8 Synchronizer 8 8 CP remover 8 16 FFT 16 16 Pilot remover 16 8 Demodulator 8 8

FFT and IFFT modules (input and output) is 24.

3-3 RTL Implementation

Based on the OFDM model in Chapter 2, the critical modules are implemented in RTL (Convolutional encoder, Viterbi decoder and channel estimator are not included in this RTL model). The OFDM system is a synchronous system, which means all the registers are driven by a single clock signal. However, the number of clock cycles required by each block is different from other to the other. As a result, the latency of each block should be considered for achieving real time implementation. To solve this problem, a synchronous enable signal is added in each block to compensate the different number of clock cycles [36]. In general, enable signal is the signal which turns on or turns off a module. In this work, the enable signals are divided into two types: dia signal and doa signal. Here, the dia signal is an input signal to a module and the doa signal is an output from a module. When the dia signal is high, the inputs are valid. On the other hand, when the output is valid, the doa signal is high. A basic example is shown in Fig. 3.2: when the output of the block "A" is ready, the doa signal is high and behaves as the dia signal of block "B" to trigger the block "B". By making use of dia and doa signals, the blocks are running only when their dia signals are high. As a result, the dynamic power consumption can be reduced during "off" time. A block diagram for transmitter is presented in Fig. 3.3. The input data coming from the MATLAB file is fed into the mapping (BPSK, QPSK and 16-QAM) module first. Each data is converted into an 8-bit integer here. Then the pilots are inserted and zeros are padded. This is followed by two other blocks "IFFT" and "CP Extension". At the end of the transmitter, a buffer is added to make the generated OFDM symbol in real time. This buffer also provides a feedback signal to the transmitter controller to trigger the "processor" blocks. The block diagram depicting the blocks of the receiver is in Fig. 3.4. The receiver is split into 6 major blocks. The first block is the synchronizer. After the last sample in the STF is detected, the buffer is used to split the continuous OFDM symbols into periodical symbols. These symbols are then processed in the "processor" blocks, which are introduced in detail in the following subsections.

Master of Science Thesis CONFIDENTIAL Yuteng Hao 38 Hardware implementation

Figure 3-2: A basic example for "dia − doa" structure

Figure 3-3: The OFDM transmitter block diagram in RTL design

Figure 3-4: The OFDM receiver block diagram in RTL design

Yuteng Hao CONFIDENTIAL Master of Science Thesis 3-3 RTL Implementation 39

3-3-1 Modulator and demodulator

The circuitry used for constellation mapping at the transmitter is shown in Fig. 3.5. As described in Section 2-4-3, 3 modes of mapping are adopted in this work: BPSK, QPSK and 16-QAM. In this module, the "CLK", "RESET" and "MODE" signals represent for "clock", "reset" and "modulation mode" signals, respectively. They are all controlled by the transmitter controller in the top level (same for the other modules in the following sections). The first sub-block in this figure is a shift register. It consists of a counter and a register. Depending on the modulation mode, the input bits (with the length of ’N’) are converted to parallel and latched in the signal "Data_par" for N clock cycles. For instance, if QPSK is used, the input data is latched every 2 bits in Data_par[1] and Data_par[0] and the "doa" signal is high every 2 clock cycles. The "Con" signal (2-bit) is used as a counter to track and control the status of this sub-block.

Figure 3-5: The circuitry used for constellation mapping (BPSK, QPSK and 16-QAM)

The second sub-block is called "Mapper", which maps the input parallel data (Data_par) into a constellation point. The real and imaginary components of the mapping data are separated into I and Q, which are formatted in 8-bit 2’s complement (Out_I and Out_Q in the figure). The mapping scale in this figure is different from that of Section 2-4-3. In this work, the I and Q are in the range of -127 to 127 (27). As discussed in Chapter 2, each OFDM symbol contains 24 (S1G_1M) or 52 (S1G_SHORT) data samples, these 24 or 52 outputs are generated in continuous. The waveform of the signals in this block is shown in Fig. 3.6. The modulation type in this example is QPSK and the channel bandwidth is 1 MHz (the related parameters can be found in Table 2-2). The input data stream contains 48 bits during one dia cycle and these 48 bits are converted into integers. At the receiver, the demodulator demaps the I and Q input samples into bits by making a comparison between the input sample and the threshold. In the circuitry (see Fig. 3.7), the "Demapper" here can be regarded as a comparator. The principle of comparison is similar to that shown in Fig. 2.12 (only the range of constellation points is different). The following register is used to convert the N-bit integer "Data_par" into N output bits in serial. When the "doa" signal is high, the output data is stored in the receiver controller and can be compared with the input data in Fig. 3.5.

Master of Science Thesis CONFIDENTIAL Yuteng Hao 40 Hardware implementation

Figure 3-6: The waveform of the signals in the modulator (BPSK, S1G_1M)

Figure 3-7: The circuitry used for constellation demapping (BPSK, QPSK and 16-QAM)

3-3-2 Pilot inserter and remover

The pilots are unmodulated signal on the defined subcarriers. During one dia cycle of this block, the incoming signals are a group of 24 (S1G_1M) or 52 (S1G_SHORT) words. The circuitry is shown in Fig 3.8 (note that the "clock", "reset" and " mode" signals are not shown in this figure). The first sub-block in this module is a "Reorder register" which is used to reorde the input I and Q by generating the address signal (Add_w in the figure). When the dia signal is down, the "En" signal turns high. This means that all the data signals in this OFDM symbol have been stored. This register is followed by an Nfft * Ndata memory (note that in this work, all the "memories" are register based). Here, Nfft represents for the FFT size and Ndata represents for the length of I or Q. The input I and Q are stored in this memory according to the address signal "Add_w". When the enable signal in this memory (En_i) is high, the empty positions in the memory can be occupied by pilots and zeros. At the next rising edge of the clock signal, the "En_o" signal turns high and I(Q) is then transmitted into the next block in serial with the new order stored in the memory. Note the output I and Q are converted into 16-bit 2’s complement numbers since the following IFFT module requires higher precision. In this module, we can prove the functionality of dia − doa signal which is proposed above. For instance, in the S1G_1M frame format, the input data of this module is a sequence

Yuteng Hao CONFIDENTIAL Master of Science Thesis 3-3 RTL Implementation 41

Figure 3-8: The circuitry used for pilot insertion and zero padding

containing 24 words and the FFT size is 64. Without dia − doa structure, the new input sequence inputs this module before the 64-sample output coming into the next module. This will cause the interact of two modules.

The circuitry of the pilot remover in RTL is similar to that of the pilot inserter. The data sequence is first stored in a memory and then converted into a group of data in serial.

3-3-3 IFFT and FFT

The FFT module in this work is based on a previous design in Holst Centre. This memory- based design was targeted for low power consumption. In this design, the radix-4 FFT algorithm is used as a baseline. The original version was adapted for this project and the FFT size is 64 [37].

3-3-4 Cyclic Prefix Extender

Fig. 3.9 illustrates the circuitry in RTL for cyclic prefix extension. The first process is to store the input I and Q data (the outputs of IFFT module). The addresses of these signals are controlled by a counter, which counts from 1 to Nfft (FFT size). When the output signal of this counter turns to Nfft, the cyclic prefix is copied as described in Section 2-4-6. At the next rising edge of clock signal, the "En_o" signal of the memory turns high, which enables the next sub-block output the OFDM symbol. The output index of the OFDM symbol is controlled by the other counter, which counts from 1 to the length of one OFDM symbol.

At the receiver, the circuitry used for removing cyclic prefix (Shown in Fig. 3.10) is simpler. Only a counter and a register are used in this module. The counter counts from 1 to the length of cyclic prefix (Ncp). When it counts Ncp, it gives an enable signal to the register and then the register transforms the input I and Q to 8-bit complement numbers. It is predictable that the dynamic power consumption of this circuitry is far more less than that of cyclic prefix extension.

Master of Science Thesis CONFIDENTIAL Yuteng Hao 42 Hardware implementation

Figure 3-9: The circuitry used for cyclic prefix extension

Figure 3-10: The circuitry used for removing cyclic prefix

3-3-5 Synchronizer

In Section 2-4-7, an algorithm for timing synchronization using the STF is proposed. The critical sub-blocks in the synchronizer are the auto-correlator and the cross-correlator. The structure for the auto-correlator (for I signal) is illustrated in Fig. 3.11. Nstf stands for the length of one repeating unit in the STF (16 for the S1G_1M frame format and 32 for the S1G_SHORT frame format). When the dia signal of synchronizer is high, the first 2 ∗ Nstf input signals are stored in two memories. At the next clock cycle, the signals stored in "Memory II" multiply by the signals stored in the same row address of "Memory I". The resulting numbers are then added up as "S_auto_I". As described in Section 2-4-7, auto- correlation is iterative. This effect is fully adopted in this circuitry: when the first result

Yuteng Hao CONFIDENTIAL Master of Science Thesis 3-3 RTL Implementation 43 of auto-correlation is derived, the iterative calculation can be performed. In the next clock cycle, I[Nstf -1] of "Memory I" multiplies by the incoming signal "I_i" and is then added to "S_auto_I". The two "I[Nstf -1]" in the memories are also multiplied together and the product is subtracted by "S_auto_I". If the incoming "I_i" is still in the STF, the resulting "S_auto_I" will stay constant. In the same clock cycle, the signals stored in the memories are shifted to fulfill Formula 2-14.

Figure 3-11: The circuitry designed for auto-correlation

The circuitry of the cross-correlator for I data is shown in Fig. 3.12. The repeating sequence of the STF is stored in Memory II and the sequence to be detected is stored in Memory I. Multiplications are performed by signals stored in the same addresses in these two memories. The products are then summed up as the output of this correlator. The signals are also shifted to the next address at the rising edge of clock signal.

Cross-correlation is not iterative. For each cross-correlation, Nstf times of multiplication and Nstf −1 times of addition are needed, which cost much power. To minimize the usage of cross- correlation, the top level synchronizer is designed. It is shown in Fig. 3.13. The input signal is stored in a memory. At the beginning of the synchronization phase, the auto-correlator works and the corresponding output "S_auto" is compared with a pre-defined threshold. The output of this comparator then gives a feedback to "MUX_1". If "S_auto_I" is smaller than the threshold, the signals stored in the memory input the cross-correlator instead of the auto- correlator. The cross-correlation output "S_cross_I" is also compared with a pre-defined threshold till "S_cross_I" is larger than the threshold. During the cross-correlation phase, the input signal is stored in another register-based memory with 2 ∗ Nstf columns. The waveform of the signals in the synchronizer is shown in Fig. 3.14. The signal "s_auto" represents the auto-correlation output and the signal "s_cro" represents the cross-correlation

Master of Science Thesis CONFIDENTIAL Yuteng Hao 44 Hardware implementation

Figure 3-12: The circuitry designed for cross-correlation

Figure 3-13: The circuitry of synchronization

output. At the beginning, the auto-correlator works and "s_auto" keeps constant. When the incoming signals "i_I" and "i_Q" are out of the range of the STF, "s_auto" starts to drop. When it is smaller than the threshold, the cross-correlator works until the first peak of "s_cro" is found. Then the doa signal of synchronizer goes high and the signals transfer to the next module.

3-3-6 Packet buffers

In RTL, the real time constraints should also be taken into consideration. The basic timing- related parameters are listed in Table 3-2 [1]. The transmitted OFDM packet should be continuous and each OFDM symbol should last for 40 µs. However, since the dia − doa method is used in RTL, the OFDM symbols are periodic. In other words, there is a timing

Yuteng Hao CONFIDENTIAL Master of Science Thesis 3-3 RTL Implementation 45

Figure 3-14: The waveform of the signals in the synchronizer

gap between two OFDM symbols. To realize the real time design, the "buffer" modules are introduced in the transmitter and receiver.

Table 3-2: Basic timing-related parameters [1]. Parameter CBW1 (µs) CBW2 (µs) Description Tsyml 40 Duration of OFDM symbol with normal GI TGI 8 Guard interval duration TSTF 160 = 4×Tsyml 80 = 2×Tsyml STF field duration TLT F 40 LTF field duration

At the transmitter, the buffer is used to compensate the gaps between OFDM symbols and convert these symbols into one OFDM packet. The timing diagram is illustrated in Fig. 3.15. The first row in this figure is the timing diagram of input data. To avoid timing interference, the next input data symbol is sent to this system when the corresponding OFDM symbol is ready (see the second row). As a result, the generated OFDM symbols are discrete. The functionality of the packet buffer at the transmitter is to combine the discrete OFDM symbols into one OFDM frame, which is shown in the third row in this figure. As indicated in the figure, the OFDM sample duration in the "processor" (see Fig. 3.3) is different from that in the OFDM frame. A simple solution for compensating this difference is to use two clock signals: a quicker clock signal in the "processor" and a slower clock signal in the packet buffer. However, in a synchronous circuit, all the parts should be synchronized by one clock signal. Based on these reasons, a "fake" clock signal is introduced by making use of a counter. The output signal is still controlled by the system clock signal but the clock signal is only valid when the counter counts a certain number (see Fig. 3.16). In this case, the "cont_div" signal is the output of the counter and the outputs are transfered every 5 clock cycles. An OFDM symbol contains 80 samples and lasts for 40 µs [1]. Therefore, each sample should last for 0.5 µs, which means the "fake" clock frequency is 2 MHz. In addition, the "fake" clock is controlled by a counter, so the system clock signal should be divisible by 2 MHz. The number of clock cycles for generating one OFDM symbol in the processor should thus be

Master of Science Thesis CONFIDENTIAL Yuteng Hao 46 Hardware implementation

Figure 3-15: Timing diagram of OFDM symbols at the transmitter

Figure 3-16: The waveform of the "fake" clock signal

divisible by 80. As a result, delays in processor are needed. The minimum frequency in the processor can thus be summarized in Table 3-3.

Table 3-3: Minimum frequencies in the transmitter processor CBW MCS Number of clock cycles Minimum clock frequency (MHz) Processor Delay Processor (original) (delayed) 1 0 315 5 320 8 1 1 339 61 400 10 1 3 387 13 400 10 2 0 343 57 400 10 2 1 395 5 400 10 2 3 499 61 560 14

Yuteng Hao CONFIDENTIAL Master of Science Thesis 3-3 RTL Implementation 47

The packet buffer module adopts a "Ping-Pong Buffering" technique. The principle of this method is shown in Fig. 3.17. Instead of using only one buffer, two buffers are used for data transfer in this technique. This is because if only one buffer is used as the packet buffer, the new data overwrites the present data being transmitted. Here, the "Ping" buffer is used to store the incoming data while the "Pong" buffer is used to transfer the stored data. When the first OFDM symbol is sent into this module, Buffer1 starts to store the input symbol. While Buffer1 is being filled, it starts to transfer the stored sample at the "fake" clock frequency. At the same time, the Buffer2 starts to store the input symbol and behaves as a "Ping" buffer. In a word, the roles of these two buffers are switched.

Figure 3-17: Block diagram for Ping-Pong buffer

The waveform of the signals in the buffer is shown in Fig. 3.18. This is for the CBW1, MCS0 case. From Table 3-3 we can learn that the clock frequency in this case is 8 MHz. When an OFDM symbol is generated, a "delay counter" starts to run for 4 clock cycles and then gives a feedback to the top transmitter controller. Once the feedback signal ("o_fb_tx" in the waveform) is high, the input data starts to be sent into the transmitter and the next processing starts. The timing gap between the two cursors ("Baseline" and "Time A") shows the duration of one output OFDM symbol and the duration is exactly 40 µs. At the receiver, a packet buffer is also needed to split the received OFDM frame and it is located after the synchronizer (see Fig. 3.4). The structure of the receiver buffer is similar to the packet buffer structure at the transmitter. The top level view of this RTL model is shown in Fig. 3.19. The testbench provides a high level control of the whole system. In the testbench, the data sequence is transfered from the data source into Tx. When the processing in Tx is finished, the output data from Tx inputs Rx in the testbench. In the end, the output data from Rx is collected and compared with the data source in the testbench. The resolution of the inputs and outputs of the submodules is also shown in this figure. For MCS3, the number of bits for the input and output of

Master of Science Thesis CONFIDENTIAL Yuteng Hao 48 Hardware implementation

Figure 3-18: The waveform of signals in packet buffer at Tx

IFFT/FFT modules is 24. The preamble is also added in the testbench, which is not shown in this figure.

3-4 Logic synthesis

As described at the beginning of this chapter, RTL is a high level representation of a circuit. To implement the circuit in gate-level, a logic synthesis process is needed to map the RTL description into a gate-level netlist. Logic synthesis uses a standard cell library. This library contains simple cells, such as basic logic gates like and, or, and nor, or macro cells, such as adder, muxes, memory, and flip-flops. Standard cells put together are called technology library. In this work, a CMOS 40 nm technology is used to provide the standard cell library. The synthesis flow is shown in Fig. 3.20. These steps are included in the synthesis script. It starts with loading of HDL files from the RTL compiler. Then elaboration is performed by RTL compiler. Elaboration is the process of expanding the HDL description to represent all instances of all modules in VHDL into unique objects. After elaboration, constraints can be applied. The design constraints include:

• Clock definition, including clock period, clock name, clock source and offset, etc. This definition is limited by the RTL description and has a great influence on the performance, which is explored in the next chapter. • Input constraints, which means the load and timing constraints of all the input ports (expect clocks). • Output constraints. Similar to the input constraints, all output ports should have two types of constraints: load and timing.

Then optimization is performed by the RTL compiler. In this work, this optimization is mainly carried out by clock gating. Clock gating is a technique used to turn off the unused units, hence reducing their power consumption.

Yuteng Hao CONFIDENTIAL Master of Science Thesis 3-5 Summary 49

Figure 3-19: Top level view of the RTL circuitry

Based on the constraints and optimizations, the synthesis is processed by these two steps:

• Synthesize the design to generic logic.

• Generate the netlist by mapping the generic logic to the technology library.

Apart from the netlist file, the reports on timing, area and other log files are also generated by synthesis. These files give an overall description of the gate-level circuit. The power simulation is then performed based on the netlist.

3-5 Summary

In this chapter, the hardware implementation of the DBB is presented. Based on the high level model described in Chapter 2, the quantization is first performed, which determines the number of bits of each block’s input and output. Then the RTL description is introduced block by block. In addition to the circuit structure, the waveform of the signals in some critical modules is also presented to verify these modules. This is followed by the synthesis

Master of Science Thesis CONFIDENTIAL Yuteng Hao 50 Hardware implementation

Figure 3-20: The synthesis flow

process. The simulation results are presented in Chapter 4, including the cell area, the power consumption and some other important performances.

Yuteng Hao CONFIDENTIAL Master of Science Thesis Chapter 4

Simulation results and analysis

This thesis is focused on the low power design for IEEE 802.11ah. As described in Chapter 1, the power consumption of this DBB should be less than 1 mW. In Chapter 3, the gate-level synthesized netlist is implemented. Based on the netlist, the performance of this DBB can be evaluated in this chapter. The simulation in hardware is executed in the following steps: first the functionalities in RTL are verified by a SystemVerilog testbench. In the testbench, the output data from the receiver is collected in a sequence, and then compared with the input data sequence bit-by-bit. If the two sequences match, the functionalities of the RTL code can be considered correct. The result is shown in Appendix B. After this, the synthesis is performed and the functionality is also verified in gate-level. The functionality of the gated-level circuit should also be verified since the synthesis may introduce some errors. In this chapter, the generated reports from the synthesis are first introduced. Then the power simulation results are analyzed. The area and power consumption of this DBB are compared with the targets we set in Chapter 1. Finally, a summary is presented.

4-1 Synthesis results

The minimum clock frequency constraint for synthesis is dependent on the RTL description and the possible supply voltage. To determine the applicable clock frequency constraint range, the synthesis under different clock frequency constraints is performed and the result is shown in Fig. 4.1. As can be seen in this figure, when the delay frequency is in the range of 10-40 MHz, the synthesized cell area of this circuit does not change a lot. When it is faster than 40 MHz, this curve increases rapidly till the timing violation occurs. From Table 3-3 we can learn that the minimum operation frequency is from 8 MHz to 14 MHz in this design. As a result, it is not difficult to meet the timing constraint. The cell area of each block in the generated circuit is listed in Table 4-1. Here, the clock frequency constraint is 20 MHz. The synchronizer consumes the most area, which is followed

Master of Science Thesis CONFIDENTIAL Yuteng Hao 52 Simulation results and analysis

Figure 4-1: Cell area vs clock frequency constraints

by the FFT and IFFT modules. In chapter 1, the target area of this DBB was set as 0.5 mm2. Here the total cell area of this design is 0.17 mm2. If we also take the modules which are not included in this design (Convolutional encoder, Viterbi decoder and channel estimator) and the net area into consideration, the total chip area for this DBB will be a little smaller than 0.5 mm2, which fulfills the objective.

Table 4-1: Cell area of each block Block Cell area (µm2) Synchronizer 52485 IFFT 39680 FFT 38483 Packet buffer(Rx) 13908 Packet buffer (Tx) 12482 CP adder 6447 Pilot inserter 6445 Pilot remover 5121 Demodulator 223 Modulator 198 CP remover 131 Total 176191

The number of gates in the generated circuit is listed in Table 4-2. In this design, 73063 gates

Yuteng Hao CONFIDENTIAL Master of Science Thesis 4-2 Power consumption 53 are implemented.

Table 4-2: The number of gates for different gate types Gate type Instances Sequential 19496 Inverter 6312 Buffer 297 Clock_gating_integrated_cell 873 Logic 46085 Total 73063

4-2 Power consumption

The transient gate level power estimation is performed by Synopsys Primetime (Version H- 2012.12-SP1 for RHEL64) based on Value Change Dump (VCD) data generated by netlist simulations. A custom MATLAB script is used to visualize the power estimation data. The power consumption over time for a specific case (CBW1, MCS3) is shown in Fig. 4.2. The clock frequency of this circuit is 10 MHz. Both the dynamic power and static power are included in this figure. The peaks in the figure stand for the dynamic power, which is 20 times larger than the static power in this case. The "Tx", "Synchronization (Sync)" and "Rx" work in different time period (here, "Rx" period means the period which the data field is processing at the receiver). This OFDM frame contains 10 OFDM symbols, so there are 10 peaks in "Tx". The peak value in this figure is around 952 µW, which means the maximum power consumption is 952 µW (in "Rx" range). There is a low power consumption period between the synchronizer and the receiver, which is the time period in which the LPF and the SIG are processed. In this RTL model, the channel estimation is not included. Therefore, this "channel estimation" process only behaves as a ideal register and consumes a small amount of power. The power consumption over time is broken down to see the power consumption of each sub-module, which is shown in Fig. 4.3. From this figure we can learn that the spikes of the top level mainly come from FFT and IFFT modules. In [24], it is proposed that FFT and IFFT modules would occupy a large portion of the circuit area as well as the power consumption in OFDM system. So it is reasonable that these modules contribute the most power consumption here. Fig. 4.4 shows the power consumption of the synchronizer. In the circuit, it has been divided to the auto-correlator and the cross-correlator. This is also reflected in the power consumption simulation. As described above, the auto-correlator consumes less power since it is iterative. On the other hand, the cross-correlator involves more computation and the corresponding power consumption is much larger. When the last sample of the STF is found, the synchronizer behaves as a series of registers, which only consumes a tiny amount of power.

The average power consumption for all the supplied modes of each period (Tx, Sync and Rx, which are illustrated in Fig. 4.2 and Fig. 4.3) is summarized in Table 4-3. The "Overall

Master of Science Thesis CONFIDENTIAL Yuteng Hao 54 Simulation results and analysis

Figure 4-2: Power consumption over time (top level, CBW1, MCS3) Yuteng Hao CONFIDENTIAL Master of Science Thesis 4-2 Power consumption 55

Figure 4-3: Power consumption for each block (CBW1, MCS3) Master of Science Thesis CONFIDENTIAL Yuteng Hao 56 Simulation results and analysis

Figure 4-4: Power consumption of synchronizer (CBW1, MCS3)

Table 4-3: Average power consumption of each period for all the supplied cases CBW MCS Clock Tx (µW) Rx (µW) Sync (µW) Overall av- frequency erage (µW) 1 0 8 125.5 115.8 101.3 110.0 1 1 10 145.3 137.3 113.2 132.0 1 3 10 191.2 172.0 142.0 169.4 2 0 10 145.2 136.3 122.9 134.0 2 1 10 142.8 137 130.5 135.0 2 3 14 234.2 224.6 175.7 220.2 average" column shows the average power consumption during the whole processing period. The best case is CBW1, MCS0. This case benefits from the lowest clock frequency and ends up with the lowest dynamic power consumption. For MCS3, as described in Section 3-2, the number of bits in FFT and IFFT modules is 24, which is 8 bits more than that in other cases. This results in the increase of computation complexity and power consumption. The worst case is CBW2, MCS3. In addition to the high resolution of FFT module, it also suffers from the highest clock frequency. The comparison is given in Fig. 4.5. The average power consumption in different modules is broken down. An example is given for CBW1, MCS3 (shown in Fig. 4.6). The IFFT and FFT modules contribute the most, which account for about 70% of the overall power consumption. The synchronizer only contributes 10% of the total power consumption. This benefits from the algorithm and the implementation for the synchronizer described in Chapter 2 and Chapter 3. During transmitting period, the average power consumption result is shown in Fig. 4.7. When the transmitter is operating, the IFFT module consumes most of the power, which is 56% in this case. The receiver is off during this period and consumes 26% of the total power. This power consumption is considered as the static power and leakage power, which can be greatly reduced by implementing power gating in the future work. The average power consumption of each sub-module during receiving period is shown in Fig. 4.8. When the receiver is working, the FFT module contributes the most. It takes 51% of

Yuteng Hao CONFIDENTIAL Master of Science Thesis 4-2 Power consumption 57

Figure 4-5: Average power consumption for all the supplied cases

Figure 4-6: Average power consumption distribution (CBW1, MCS3)

Master of Science Thesis CONFIDENTIAL Yuteng Hao 58 Simulation results and analysis

Figure 4-7: Power consumption of each module during transmitting period (CBW1, MCS3)

the total power. During this time, the transmitter consumes 28% of the total power, which mainly comes from the IFFT module in Tx. In the future work, it can be optimized by using power gating. In Fig. 4.9, the average power consumption during synchronizing period is shown. During this time period, the synchronizer is on and both the transmitter and the receiver is off. The power consumption of transmitter, receiver and synchronizer is almost equal to each other. Here, the power gating is more critical. If the power gating is implemented, the average power consumption can be drastically reduced. These pie charts show the contribution of each module in different processing period for CBW1, MCS3. The results for other supplied cases are similar to this one. In Chapter 1, the target power consumption we set for the DBB is less than 1 mW. From Table 4-3 we can learn that the overall average power consumption for the worst case is 220.2 µW. Although the gate-level circuitry for the DBB is not finished, it is reasonable to assume the power consumption for the whole DBB will be less than 1 mW: The modules which are still not implemented are Convolutional encoder, Viterbi decoder, channel estimator and equalizer. According to [24] and the computation estimation based on Chapter 2, the additional dynamic power consumption will be smaller than that of IFFT/FFT. Therefore, the average power consumption for the whole DBB will not be twice larger than the current circuitry. In assumption, the average power consumption for the whole DBB is in the range of 200 - 400 µW. If the power gating is added in the future work, the power consumption can be further reduced by around 30%. In conclusion, the target power consumption can be fulfilled.

Yuteng Hao CONFIDENTIAL Master of Science Thesis 4-2 Power consumption 59

Figure 4-8: Power consumption of each module during receiving period (CBW1, MCS3)

Figure 4-9: Power consumption of each module during receiving period (CBW1, MCS3)

Master of Science Thesis CONFIDENTIAL Yuteng Hao 60 Simulation results and analysis

4-3 Summary

In this chapter, the performance of the gate-level circuitry is evaluated. The generated reports from synthesis (time, area, number of gates) are introduced. The chip area of this DBB is proven to be fulfilled the objective in Chapter 1. Then the power consumption is presented and analyzed in detail. The results show that the goal of the power consumption is also attained. The break-down results show that the synchronizer only consumes a small amount of power, which demonstrates the discussion and design decision in Chapter 2 and Chapter 3.

Yuteng Hao CONFIDENTIAL Master of Science Thesis Chapter 5

Conclusions and future work

5-1 Conclusions

A digital baseband system targeted for energy-constrained sensor applications has been pro- posed in this thesis. This design is based on the new Wi-Fi standard – IEEE 802.11ah. This system is designed by taking advantage of of low-complexity algorithms and is supporting the OFDM scheme. The performance of this system is proven by the simulation results in the high level model (Chapter 2). The power consumption of the DBB is much less than that of the previous Wi-Fi standards, which is proven by the power consumption results shown in Chapter 4. The area of this DBB is also smaller than the originally targeted chip area.

The main contribution of this thesis is listed in Table 5-1. This work starts with the explo- ration of IEEE 802.11ah and related telecommunication theories. Then the high level model is built in MATLAB. The structures and algorithms for some critical modules, such as the synchronizer and the channel estimator, are designed and optimized targeting low power con- sumption. In addition to these models, the synthesis and measurements are also performed to demonstrate this low power design.

Table 5-1: Summary of IEEE 802.11ah DBB circuitry Language Number of modules Lines of codes High level model MATLAB 15 3409 RTL model VHDL 11 2306 Testbench SystemVerilog 459

This leads to the following conclusions:

1. The algorithm of the synchronizer enables the overall system to achieve 10% PER at SNR of 4.8 dB (the best case) and 14.8 dB (the worst case).

Master of Science Thesis CONFIDENTIAL Yuteng Hao 62 Conclusions and future work

2. The hardware implementation for the synchronizer consumes only 20-40 µW of power 1.

3. The average power consumption of the RTL circuitry is 110 µW and 220 µW for the best case and the worst case1, respectively. When assuming adding the remaining modules (Convolutional encoder, Viterbi decoder, channel estimator and equalizer), the average power consumption for the complete DBB should be smaller than 200 - 400 µW.

4. The cell area of this circuit is 0.176 mm2 (40 nm CMOS). When assuming adding the remaining modules, the total area of the complete DBB should be smaller than 0.5 mm2.

5. The hardware performance is summarized in Table 5-2. The number of gates in this circuitry is 73063.

6. The minimum operation clock frequency to achieve real-time performance is from 8 MHz to 14 MHz. The clock frequency constraint for synthesis is in the range of 10 - 40 MHz. This means the timing constraint is not difficult to be met.

7. Packet buffers in RTL design. These buffers are not included in high level model since the high level model does not take the real time into consideration. The buffers in Tx and Rx are important modules in RTL to make it the real-time design.

Table 5-2: Summary of IEEE 802.11ah DBB circuitry Technology 40 nm CMOS Supply voltage 1.1 V Clock frequency constraint for synthesis 20 MHz Clock frequency 10 - 14 MHz Average power consumption 200 - 400 µW Gate count 73063 Cell area 176191 µm2

5-2 Future work

The digital baseband design is a large system and from the view of the whole system, it is not finished. To realize the complete system with a good performance, the future work is listed as below.

• Optimize the high level model and continue algorithm exploration. Better trade-off between power consumption and other performance can be introduced.

• Implement the remaining modules in RTL, including the convolutional encoder, Viterbi decoder, channel estimator and equalizer.

140 nm CMOS, RCTYP Corner in synthesis netlist power simulation

Yuteng Hao CONFIDENTIAL Master of Science Thesis 5-2 Future work 63

• Optimize the existing RTL modules, especially for FFT module. This optimization should target to lower the computational complexity. It can also be realized by decreas- ing the clock cycles it needs. If the FFT module needs less clock cycles, the lower clock frequency can be introduced, which provides a larger range to scale the clock frequency.

• Implement the voltage scaling. The voltage scaling process should be verified by simu- lation.

• Connect with analog and RF parts to implement the whole system.

Master of Science Thesis CONFIDENTIAL Yuteng Hao 64 Conclusions and future work

Yuteng Hao CONFIDENTIAL Master of Science Thesis Appendix A

The comparison of channel estimation using LS and MMSE estimators

Equation 2-22 can be rewritten as:

Y = HFX + W, (A-1) Where  00 0(N−1)  WN ··· WN  . . .  F =  . .. .   . .  (N−1)0 (N−1)(N−1) WN ··· WN The other parameters are the same as Equation 2-26. F is defined as the matrix of DFT with corresponding weights given as:

1 −j2π nk W (N) = e N (A-2) nk N If the MMSE estimator is used, the estimated channel response is [38]: −1 HMMSE = FRhY RYY Y, (A-3) Where H H RhY = EhY = RhhF X (A-4) H H 2 RYY = EYY = XFRhhF X + σnIN (A-5)

Here, Rxy is the cross-correlation between h and y, Ryy is the auto-correlation matrix of y and 2 Rhh is the auto-correlation matrix of h. The Rhh and σn (the noise variance) are assumed to be known parameter [39]. . In Fig. A.1, the comparison of BER vs SNR for different channel estimations is shown. The modulation type in this case is QPSK. If we take BER = 0.1, the corresponding SNR of LS estimator is 2 dB worse than that of MMSE estimator. This means the robustness of MMSE estimator is a little better than that of LS estimator. However, the complexity of MMSE estimator is much higher. To save the power consumption in the hardware, LS estimator is chosen as the channel estimation scheme.

Master of Science Thesis CONFIDENTIAL Yuteng Hao 66 The comparison of channel estimation using LS and MMSE estimators

Figure A-1: Comparison of BER for no channel estimation , LS channel estimation and MMSE channel estimation

Yuteng Hao CONFIDENTIAL Master of Science Thesis Appendix B

Verification for the functionality of RTL model and gate-level netlist

To verify the RTL design and synthesized gate-level netlist, the output data from the receiver should match the input data exactly. This can be performed by the testbench. In Fig. B.1, an example is shown. Here, the signals "s _data_o_del" and "s_data_o_ref" stand for the output and input bit sequence, respectively. The signal "err_cnt" is used to track the number of unmatched bits. The functionality of the RTL model is correct if "err_cnt" is 0.

Figure B-1: Comparison between the input and output bits

The functionality of the generated gate-level circuit after synthesis is verified similarly. The output and input bit sequence can also be compared in Cadence in the same way.

Master of Science Thesis CONFIDENTIAL Yuteng Hao 68 Verification for the functionality of RTL model and gate-level netlist

Yuteng Hao CONFIDENTIAL Master of Science Thesis Bibliography

[1] IEEE p802.11ah Draft. Part 11: wireless LAN medium access control (MAC) and phys- ical layer (PHY) specifications. November 2014.

[2] Ali Hazmi, Jukka Rinne, and Mikko Valkama. Feasibility study of IEEE 802.11 ah radio technology for IoT and M2M use cases. In Globecom Workshops (GC Wkshps), 2012 IEEE, pages 1687–1692. IEEE, 2012.

[3] Jin-Shyan Lee, Yu-Wei Su, and Chung-Chou Shen. A comparative study of wireless protocols: Bluetooth, UWB, ZigBee, and Wi-Fi. In Industrial Electronics Society, 2007. IECON 2007. 33rd Annual Conference of the IEEE, pages 46–51. IEEE, 2007.

[4] Weiping Sun, Munhwan Choi, and Sunghyun Choi. Ieee 802.11 ah: A long range 802.11 WLAN at sub 1 GHz. Journal of ICT Standardization, 1(1):83–108, 2013.

[5] Stefan Aust, R Venkatesha Prasad, and Ignas GMM Niemegeers. IEEE 802.11 ah: Advantages in standards and further challenges for sub 1 GHz Wi-Fi. In Communications (ICC), 2012 IEEE International Conference on, pages 6885–6889. IEEE, 2012.

[6] NIST Priority Action Plan. 2-Guidelines for Assessing Wireless Standards for Smart Grid Applications. National Institude of Standards and Technology Std, 2011.

[7] Stephen McCann and Alex Ashley. Official IEEE 802.11 working group project timelines. Online. Disponıvel em http://grouper. ieee. org/groups/802/11/Reports/802.11 _Time- lines. htm. Ultimo acesso em, 6(03):2010, 2010.

[8] Behzad Razavi and Razavi Behzad. RF microelectronics, volume 1. Prentice Hall New Jersey, 1998.

[9] Kai-Hsin Chen and Hsi-Pin Ma. A low power zigbee baseband processor. In SoC Design Conference, 2008. ISOCC’08. International, volume 1, pages I–40. IEEE, 2008.

[10] Satyam Dwivedi, Bharadwaj Amrutur, and Navakanta Bhat. Power scalable digital baseband architecture for .15.4. In VLSI Design (VLSI Design), 2011 24th International Conference on, pages 30–35. IEEE, 2011.

Master of Science Thesis CONFIDENTIAL Yuteng Hao 70 Bibliography

[11] Christian Bachmann, Gert-Jan van Schaik, Benjamin Busze, Mario Konijnenburg, Yan Zhang, Jan Stuyt, Maryam Ashouei, Guido Dolmans, Tobias Gemmeke, and Harmke de Groot. 10.6 a 0.74 v 200µw multi-standard transceiver digital baseband in 40nm lp- cmos for 2.4 ghz bluetooth smart/zigbee/ieee 802.15. 6 personal area networks. In Solid- State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International, pages 186–187. IEEE, 2014.

[12] Anthony Chun, Kyle McCanta, Edgar Borrayo Sandoval, and Kapil Gulati. Overview of the scalable communications core: A reconfigurable wireless baseband in 65nm CMOS. In VLSI, 2009. ISVLSI’09. IEEE Computer Society Annual Symposium on, pages 1–6. IEEE, 2009.

[13] Masoud Zargari, Lalitkumar Y Nathawad, Hirad Samavati, Srenik S Mehta, Alireza Kheirkhahi, Phoebe Chen, Ke Gong, Babak Vakili-Amini, J Hwang, S-WM Chen, et al. A dual-band CMOS MIMO radio SoC for IEEE 802.11n wireless lan. Solid-State Circuits, IEEE Journal of, 43(12):2882–2895, 2008.

[14] Concepts of Orthogonal Frequency Division Multiplexing (OFDM) and 802.11 WLAN. Keysight technologies. http://rfmw.em.keysight.com/wireless/helpfiles/89600B/ WebHelp/subsystems/wlan-ofdm/Content/ofdm_basicprinciplesoverview.html. Accessed Jan 22, 2015.

[15] L Zha, Zh H Yu, and LH Xing. Frequency Controlling and Synchronization in OFDM communication system. In Industrial Electronics and Applications, 2006 1ST IEEE Con- ference on, pages 1–5. IEEE, 2006.

[16] Lili Zhang. A study of IEEE 802.16a OFDM-PHY baseband. 2005.

[17] Marc Engels. Wireless OFDM Systems: How to make them work? Springer Science & Business Media, 2002.

[18] Peng Zhang Johan van den Heuvel, Yan Zhang. Ulpwifi radio front-end v1 top level specifications, Technical Note TN-14-WATS-TP2-171. Holst Centre, Eindhoven, 2014.

[19] Peter Elias. Coding for noisy channels. In PROCEEDINGS OF THE INSTITUTE OF RADIO ENGINEERS, volume 43, pages 356–356, 1955.

[20] Sklar Bernard. Digital communications fundamentals and applications. Chap15, Prentice-Hall International, inc, 2001.

[21] Andrew J Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. Information Theory, IEEE Transactions on, 13(2):260–269, 1967.

[22] Brett William Werling. A Hardware Implementation of the Soft Output Viterbi Algorithm for Serially Concatenated Convolutional Codes. PhD thesis, University of Kansas, 2010.

[23] Alle-Jan van der Veen and Geert Leus. Signal processing for communications. Delft University of Technology, 2005.

[24] Tzi-Dar Chiueh and Pei-Yun Tsai. OFDM Baseband Receiver Design for Wireless Com- munications. Wiley Online Library, 2007.

Yuteng Hao CONFIDENTIAL Master of Science Thesis 71

[25] Jan-Jaap van de Beek, P Ödling, SK Wilson, and PO Börjesson. Orthogonal frequency- division multiplexing (ofdm). Review of Radio Science 1996-99, Intern. Union of Radio Science (URSI), 1999.

[26] Eduardo Heras Miguel. Fiber-based orthogonal frequency division multiplexing trans- mission systems. 2010.

[27] Timothy M Schmidl and Donald C Cox. Robust frequency and timing synchronization for OFDM. Communications, IEEE Transactions on, 45(12):1613–1621, 1997.

[28] Hlaing Minn, Mao Zeng, and Vijay K Bhargava. On timing offset estimation for ofdm systems. Communications Letters, IEEE, 4(7):242–244, 2000.

[29] Byungjoon Park, Hyunsoo Cheon, Changeon Kang, and Daesik Hong. A novel timing estimation method for OFDM systems. Communications Letters, IEEE, 7(5):239–241, 2003.

[30] Kun-Wah Yip, Yik-Chung Wu, and Tung-Sang Ng. Timing-synchronization analysis for IEEE 802.11 a wireless LANs in frequency-nonselective Rician fading environments. Wireless Communications, IEEE Transactions on, 3(2):387–394, 2004.

[31] Apurva N Mody and Gordon L Stuber. Synchronization for mimo ofdm systems. In Global Telecommunications Conference, 2001. GLOBECOM’01. IEEE, volume 1, pages 509–513. IEEE, 2001.

[32] V Erceg et al. Ieee p802. 11 wireless lans: Tgn channel models 2004. IEEE Std, pages 802–11.

[33] Communication-Deconvolution. Sharetechnote. http://www.sharetechnote.com/ html/Communication_Deconvolution.html. Accessed Jan 22, 2015.

[34] Yuping Zhao and Aiping Huang. A novel channel estimation method for ofdm mobile communication systems based on pilot signals and transform-domain processing. In Vehicular Technology Conference, 1997, IEEE 47th, volume 3, pages 2089–2093. IEEE, 1997.

[35] John G Proakis. Digital communications. 1995. McGraw-Hill, New York.

[36] Pong P Chu. RTL hardware design using VHDL: coding for efficiency, portability, and scalability. John Wiley & Sons, 2006.

[37] Dongsuk Jeon. Reconfigurable FFT accelerator, Technical Note. Holst Centre, Eind- hoven, 2015.

[38] Meng-Han Hsieh and Che-Ho Wei. Channel estimation for ofdm systems based on comb- type pilot arrangement in frequency selective fading channels. Consumer Electronics, IEEE Transactions on, 44(1):217–225, 1998.

[39] Sinem Colieri, Mustafa Ergen, Anuj Puri, and Ahmad Bahai. A study of channel estima- tion in OFDM systems. In Vehicular Technology Conference, 2002. Proceedings. VTC 2002-Fall. 2002 IEEE 56th, volume 2, pages 894–898. IEEE, 2002.

Master of Science Thesis CONFIDENTIAL Yuteng Hao 72 Bibliography

[40] Raymond Steele, H Ahmadi, and A Krishna. Mobile radio communications. In IEEE Proceedings, volume 82, pages 1468–1468. [New York, NY]: Institute of Electrical and Electronics Engineers,[1963-, 1994.

[41] Nathaniel Pinckney, Ronald G Dreslinski, Korey Sewell, David Fick, Trevor Mudge, Dennis Sylvester, and David Blaauw. Limits of parallelism and boosting in dim silicon. Micro, IEEE, 33(5):30–37, 2013.

Yuteng Hao CONFIDENTIAL Master of Science Thesis